Project · Personal · Shipped

H1B Cap-Exempt Sponsor Tracker

A pipeline that turns three years of raw federal H-1B disclosure spreadsheets into a clean list of cap-exempt employers, then scrapes their careers pages to surface real, currently-open IT roles.

Python Playwright Serper API DOL OFLC data ATS-aware scraping

01 · The problem & why I built it

The H-1B cap is a lottery. Cap-exempt employers aren't.

Most H-1B applicants are subject to the annual cap and the lottery, which means even a perfect application has a ~25% chance of being selected. But a category of employers is cap-exempt: universities, non-profit research organizations, government research labs, and qualifying medical centers. For people who can target them, those employers are a stable sponsorship path that bypasses the lottery entirely.

The catch: there's no clean public list of who they are. The Department of Labor publishes raw LCA disclosure spreadsheets every year, but they're enormous, messy, and don't flag cap-exemption. And once you've identified a sponsor, you still have to find their careers page and check if they're hiring anything you can actually do.

I built this for my own visa situation. The system aggregates three years of DOL data, filters for cap-exempt employers using NAICS codes plus name heuristics, ranks them by hiring volume, finds each company's careers page, and scrapes open IT roles. What was a week of manual research becomes a CSV.

02 · Architecture & flow

A linear pipeline with resumable checkpoints.

Each stage is a separate script and writes its output to disk before the next stage reads it. That separation matters because the scraping stage is slow and fragile. I want to be able to interrupt and resume without re-downloading 200MB of DOL data.

DOL OFLC LCA Excel (2024, 2025, 2026) │ ▼ parse + filter ┌──────────────────────────┐ │ NAICS prefix match │ ← 6113 (univ), 6221 (hospital), │ + employer name keywords │ 8139 (nonprofit assoc) └──────────────────────────┘ │ ▼ aggregate H-1B counts per employer (2024–2026) │ ▼ careers page lookup via Serper API └─ filter out aggregators (LinkedIn, Indeed, Glassdoor, ZipRecruiter…) │ ▼ h1b_cap_exempt_sponsors.csv │ ▼ Playwright scraper (with checkpoint.json) ├─ ATS-specific extractors: │ Greenhouse, Lever, Workday, SmartRecruiters, │ iCIMS, Jobvite, Ashby, Taleo, SuccessFactors └─ Generic fallback for unknown sites │ ▼ it_jobs.csv

03 · Key technical decisions & bottlenecks

The hard parts were never in the data parsing.

Parsing the DOL spreadsheets was tedious but mechanical. The interesting work was deciding which employers count as cap-exempt, how to find their careers pages without getting blocked, and how to scrape job listings reliably across dozens of different ATS vendors.

Decision Filtering for cap-exempt status

DOL doesn't tag employers as cap-exempt. That classification is downstream of who they are. I needed a hybrid rule set.

Approach Two-pass filter:

NAICS prefix match. Codes 6113 (colleges/universities), 6221 (general medical/surgical hospitals), 8139 (nonprofit business associations).
Employer name keyword match. University, medical center, institute, national laboratory, association, foundation.

Why hybrid: NAICS alone misses employers that are misclassified or use parent-company codes. Name matching alone catches generic "Institute" entities that aren't actually cap-exempt. Running both and unioning gives broad coverage with manageable false positives, which a human can review in the CSV.

Decision Finding careers pages without scraping Google

Once I have an employer list, I need their actual careers URL. Scraping Google directly gets blocked quickly and is fragile.

Approach Use the Serper.dev API, which exposes Google search results through a clean REST interface. I filter results aggressively to drop aggregator sites (LinkedIn, Indeed, Glassdoor, ZipRecruiter, Wikipedia) because those don't give me the company's own ATS URL.

Why this matters: Aggregator URLs lead to listings that may be stale, missing, or behind login walls. The company's own careers page is where the source of truth lives.

Bottleneck Every careers page is different

There's no single way to scrape job listings. Workday is a SPA that hides content behind XHR. Greenhouse uses iframes. Some universities use 20-year-old static HTML. A single generic scraper covers maybe 40% of cases.

Approach Use Playwright for headless browser automation (it handles dynamic JS-rendered content), and write ATS-specific extractors for the common vendors: Greenhouse, Lever, Workday, SmartRecruiters, iCIMS, Jobvite, Ashby, Taleo, SuccessFactors. When the URL doesn't match a known ATS pattern, fall back to a generic page scanner that looks for common job-listing markup.

Why this combination: The ATS-specific path is reliable and fast for the ~70% of employers using a known vendor. The generic fallback gets me partial coverage on the long tail without writing a custom scraper for every weird site.

Bottleneck Resilience: the scrape takes hours

Scraping a few hundred careers pages with a real browser takes serious time. A crash three hours in shouldn't mean starting over.

Approach Save progress to h1b_cap_exempt_checkpoint.json after each employer. On restart, skip employers already processed. Add polite delays between requests to avoid rate limits. A separate cron_jobs.py checks whether the sponsor CSV exists and whether enough careers pages are already populated before rerunning the search step, so it's idempotent on a schedule.

04 · Results & what they mean

What this turns into, practically.

2 files

CSVs that did not exist before

3 years

of federal disclosure data aggregated

9 ATS

vendors covered with dedicated extractors

The outputs

h1b_cap_exempt_sponsors.csv. Employer name, careers page URL, state, and H-1B worker counts for 2024 / 2025 / 2026.
it_jobs.csv. Currently-open IT roles at those employers: job title, URL, location, posting date, and scrape timestamp.

What it actually means

It turns raw federal disclosure data (technically public, but practically unusable) into something actionable. If I'm a candidate trying to bypass the H-1B lottery, I now have a ranked list of who actively sponsors and what they're hiring right now. If I'm doing immigration research or recruiting, I have a fresh employer panel I can rebuild any time the DOL publishes new disclosures.

The deeper point: a lot of "impossible to find" information isn't actually hidden. It's published in a format nobody wants to deal with. Sometimes the highest-leverage thing you can build is the cleanup layer.

05 · What I'd build next

The honest list of things I'd improve.

A public web UI. Right now you have to run the pipeline locally. I want a hosted version where someone can filter by state, role type, and employer size.
Job-title classification. The "IT jobs" filter is keyword-based. Adding a small classifier would let me carve out Data, ML, Backend, DevOps as separate tracks.
Diff over time. Compare scrapes week-over-week to surface new openings the moment they're posted. Useful for high-demand cap-exempt employers.
Better ATS coverage. Add SmartRecruiters, Personio, and Greenhouse iframe-embedded variants. The generic fallback handles them but with lower accuracy.

See the code

All projects live on GitHub. Issues and PRs welcome.

View on GitHub ↗ Email me about this →

← vaibhavimutya.github.io Built by me, obviously.