Project · Personal · Shipped

Job Market Analyzer

A pipeline that reads my resume, pulls live job postings across the US, EU, and India, and scores each market against my actual skill set, seniority, and visa situation, so I stop guessing where to apply.

Python HuggingFace NER spaCy Zero-shot BART Sentence embeddings JSearch / Remotive / Arbeitnow APIs

01 · The problem & why I built it

Picking a job market by intuition is a bad strategy.

I'm an Indian citizen with three years of experience targeting data/engineering roles. I needed visa sponsorship in the US and EU, and I had to pick where to focus my limited application energy. Which market is actually worth applying to? Not by gut feel. By evidence. Matching my real skills, my seniority, and my visa situation against actual job postings.

Doing that manually means scanning hundreds of postings across three regions, filtering for sponsorship language, and judging skill fit on each one. It's not a thing a human does well at scale. So I built the thing that does.

Side benefit: I now have a reusable system. Next time my situation changes (different role, different visa status, different markets), I re-run the pipeline.

02 · Architecture & flow

How the pipeline moves a resume to a market score.

Five stages, each writes a stable file the next stage reads. That makes it easy to debug. If scoring looks wrong, I open the enriched CSV and check whether skill extraction or the score function is at fault.

resume.pdf │ ▼ parse_resume.py ──────────────► profile.json (skills, seniority, visa status) │ ┌─────────────────────────────┘ ▼ fetch_jobs.py ├─ Arbeitnow API → EU jobs (DE, NL) ├─ Remotive API → Remote jobs └─ JSearch (RapidAPI) → US + India jobs │ ▼ (title relevance filter) jobs.csv │ ▼ extract_skills.py ├─ NER (HF: lm-ner-linkedin-skills) │ └─ embedding filter (drops non-tech tokens) ├─ Keyword taxonomy scan (~120 known skills) ├─ spaCy Matcher → seniority + years exp └─ Zero-shot → visa sponsorship detection (facebook/bart-large-mnli) │ ▼ jobs_enriched.csv │ ▼ score_markets.py ├─ skill_match (cosine ≥ 0.6 per job, avg per market) ├─ seniority_match ├─ visa_score ├─ volume └─ salary_score │ ▼ scores.csv

03 · Key technical decisions & bottlenecks

Where it broke, what I changed, and why.

The first version used Claude Haiku for skill extraction. It worked, but it was expensive at scale, and the latency made iteration slow. I rebuilt around open NLP models. Every decision below traces back to a real failure I hit while testing.

Bottleneck NER alone wasn't enough

The NER model (algiraldohe/lm-ner-linkedin-skills-recognition) was trained on LinkedIn data, so it treated words like "english", "marketing", and "events" as valid skills. Top extracted skills per country came out as noise.

Resolution A three-layer filter:

Confidence threshold ≥ 0.75 (drops uncertain extractions)
Embedding proximity filter (cosine similarity ≥ 0.45 against tech seed terms). This generalizes to new roles without me maintaining a hardcoded blocklist.
Keyword taxonomy scan as a second pass, directly matching ~120 known skills (AWS, Kafka, dbt, Terraform, etc.) with regex word-boundary matching.

Why this combination: NER catches novel/emerging tools the taxonomy doesn't know about yet. The taxonomy catches known skills that NER fragments or misses. The embedding filter bridges them without me having to maintain a blocklist forever.

Bottleneck BERT's 512-token limit

Job descriptions average 800–1500 tokens, which exceeds BERT's hard limit. Running the full text through NER threw indexing errors.

Resolution Chunked tokenization. Split into 500-token windows with a 50-token overlap, run NER on each chunk, merge results into a set. The overlap is intentional. It prevents skills that get split across a chunk boundary from going missing.

Bottleneck Irrelevant jobs polluting EU scores

Arbeitnow's search parameter does full-text matching, not title matching. Searching "Data Engineer" returned CRM, marketing, and sales roles because those words appeared somewhere in the description. The skill_match score for EU tanked. A CRM job has zero overlap with PySpark or Terraform.

Resolution Post-fetch title relevance filter. Only keep jobs whose title contains keywords like "data", "engineer", "analyst", or "scientist". Applied at fetch time, before any ML processing, so I'm not wasting GPU cycles on garbage.

Bottleneck Visa detection without labeled training data

"Visa sponsorship" appears in many forms ("we sponsor", "no sponsorship available", "candidates must be authorized to work") and negation flips the meaning entirely. A keyword search would mislabel half the postings.

Resolution Zero-shot classification using facebook/bart-large-mnli. Extract sentences containing visa-related keywords, pass each to the classifier with the labels ["visa sponsored", "no visa sponsorship"]. Handles negation and paraphrase out of the box, with no training data needed.

Why I moved off Claude Haiku

The original prototype piped each job description through Haiku to extract skills and visa status. The output quality was good but the per-job cost added up fast across ~400 postings, and re-running the pipeline (which I do every time I tweak the scorer) made the cost feel unjustified. Open NLP models gave me 90% of the quality at near-zero marginal cost. The right trade-off for a tool I run on my own machine repeatedly.

04 · Results & what they mean

India wins on volume and fit. US wins on quality per posting.

Market	Jobs	Skill	Seniority	Visa	Final Score
India ✓	235	0.356	0.535	1.00	0.611
US	57	0.339	0.440	0.83	0.436
EU	132	0.269	0.446	0.62	0.411

What the numbers actually say

India (0.611) is the strongest market for me. Highest skill overlap, most job volume, and an automatic 1.0 visa score (Indian citizen, no sponsorship needed). Top skills demanded: Python, SQL, AWS, Java, ETL, Kubernetes. Well-aligned.

US (0.436) has strong skill match and a good sponsorship rate (83% of postings mention it), but the low volume (57 jobs) drags the score down. High competition, but the postings that exist match me well. Worth a selective, targeted pipeline rather than mass applications.

EU (0.411) has the most postings after India but the lowest skill match. DE and NL markets skew toward JavaScript/TypeScript, which isn't my stack. Only 61% mention sponsorship. Worth monitoring, not the primary target.

The takeaway I actually act on

Focus application effort on India for volume and fit. Maintain a selective US pipeline for the well-matched roles. Treat EU as opportunistic. Apply only when something specific surfaces. That's a strategy I can defend, not a feeling.

05 · What I'd build next

The honest list of things I'd improve.

A web UI. Right now it's CLI. I want a small Streamlit or Next.js front end where I upload a new resume and see a fresh market report.
Salary normalization across currencies. The salary_score is currently the weakest signal. Comparing INR, USD, and EUR ranges meaningfully needs purchasing power adjustment.
Company-quality signal. Job count overweights aggregators. I'd add Glassdoor or Levels.fyi data to weight postings by company quality, not just count.
Trend over time. Run the pipeline weekly and surface which markets are gaining or losing roles. Right now it's a snapshot.

See the code

All projects live on GitHub. Issues and PRs welcome.

View on GitHub ↗ Email me about this →

← vaibhavimutya.github.io Built by me, obviously.