A pipeline that reads my resume, pulls live job postings across the US, EU, and India, and scores each market against my actual skill set, seniority, and visa situation, so I stop guessing where to apply.
I'm an Indian citizen with three years of experience targeting data/engineering roles. I needed visa sponsorship in the US and EU, and I had to pick where to focus my limited application energy. Which market is actually worth applying to? Not by gut feel. By evidence. Matching my real skills, my seniority, and my visa situation against actual job postings.
Doing that manually means scanning hundreds of postings across three regions, filtering for sponsorship language, and judging skill fit on each one. It's not a thing a human does well at scale. So I built the thing that does.
Side benefit: I now have a reusable system. Next time my situation changes (different role, different visa status, different markets), I re-run the pipeline.
Five stages, each writes a stable file the next stage reads. That makes it easy to debug. If scoring looks wrong, I open the enriched CSV and check whether skill extraction or the score function is at fault.
The first version used Claude Haiku for skill extraction. It worked, but it was expensive at scale, and the latency made iteration slow. I rebuilt around open NLP models. Every decision below traces back to a real failure I hit while testing.
The NER model (algiraldohe/lm-ner-linkedin-skills-recognition) was trained on LinkedIn data, so it treated words like "english", "marketing", and "events" as valid skills. Top extracted skills per country came out as noise.
Resolution A three-layer filter:
Why this combination: NER catches novel/emerging tools the taxonomy doesn't know about yet. The taxonomy catches known skills that NER fragments or misses. The embedding filter bridges them without me having to maintain a blocklist forever.
Job descriptions average 800–1500 tokens, which exceeds BERT's hard limit. Running the full text through NER threw indexing errors.
Resolution Chunked tokenization. Split into 500-token windows with a 50-token overlap, run NER on each chunk, merge results into a set. The overlap is intentional. It prevents skills that get split across a chunk boundary from going missing.
Arbeitnow's search parameter does full-text matching, not title matching. Searching "Data Engineer" returned CRM, marketing, and sales roles because those words appeared somewhere in the description. The skill_match score for EU tanked. A CRM job has zero overlap with PySpark or Terraform.
Resolution Post-fetch title relevance filter. Only keep jobs whose title contains keywords like "data", "engineer", "analyst", or "scientist". Applied at fetch time, before any ML processing, so I'm not wasting GPU cycles on garbage.
"Visa sponsorship" appears in many forms ("we sponsor", "no sponsorship available", "candidates must be authorized to work") and negation flips the meaning entirely. A keyword search would mislabel half the postings.
Resolution Zero-shot classification using facebook/bart-large-mnli. Extract sentences containing visa-related keywords, pass each to the classifier with the labels ["visa sponsored", "no visa sponsorship"]. Handles negation and paraphrase out of the box, with no training data needed.
The original prototype piped each job description through Haiku to extract skills and visa status. The output quality was good but the per-job cost added up fast across ~400 postings, and re-running the pipeline (which I do every time I tweak the scorer) made the cost feel unjustified. Open NLP models gave me 90% of the quality at near-zero marginal cost. The right trade-off for a tool I run on my own machine repeatedly.
| Market | Jobs | Skill | Seniority | Visa | Final Score |
|---|---|---|---|---|---|
| India ✓ | 235 | 0.356 | 0.535 | 1.00 | 0.611 |
| US | 57 | 0.339 | 0.440 | 0.83 | 0.436 |
| EU | 132 | 0.269 | 0.446 | 0.62 | 0.411 |
India (0.611) is the strongest market for me. Highest skill overlap, most job volume, and an automatic 1.0 visa score (Indian citizen, no sponsorship needed). Top skills demanded: Python, SQL, AWS, Java, ETL, Kubernetes. Well-aligned.
US (0.436) has strong skill match and a good sponsorship rate (83% of postings mention it), but the low volume (57 jobs) drags the score down. High competition, but the postings that exist match me well. Worth a selective, targeted pipeline rather than mass applications.
EU (0.411) has the most postings after India but the lowest skill match. DE and NL markets skew toward JavaScript/TypeScript, which isn't my stack. Only 61% mention sponsorship. Worth monitoring, not the primary target.
Focus application effort on India for volume and fit. Maintain a selective US pipeline for the well-matched roles. Treat EU as opportunistic. Apply only when something specific surfaces. That's a strategy I can defend, not a feeling.
All projects live on GitHub. Issues and PRs welcome.