A pipeline that turns three years of raw federal H-1B disclosure spreadsheets into a clean list of cap-exempt employers, then scrapes their careers pages to surface real, currently-open IT roles.
Most H-1B applicants are subject to the annual cap and the lottery, which means even a perfect application has a ~25% chance of being selected. But a category of employers is cap-exempt: universities, non-profit research organizations, government research labs, and qualifying medical centers. For people who can target them, those employers are a stable sponsorship path that bypasses the lottery entirely.
The catch: there's no clean public list of who they are. The Department of Labor publishes raw LCA disclosure spreadsheets every year, but they're enormous, messy, and don't flag cap-exemption. And once you've identified a sponsor, you still have to find their careers page and check if they're hiring anything you can actually do.
I built this for my own visa situation. The system aggregates three years of DOL data, filters for cap-exempt employers using NAICS codes plus name heuristics, ranks them by hiring volume, finds each company's careers page, and scrapes open IT roles. What was a week of manual research becomes a CSV.
Each stage is a separate script and writes its output to disk before the next stage reads it. That separation matters because the scraping stage is slow and fragile. I want to be able to interrupt and resume without re-downloading 200MB of DOL data.
Parsing the DOL spreadsheets was tedious but mechanical. The interesting work was deciding which employers count as cap-exempt, how to find their careers pages without getting blocked, and how to scrape job listings reliably across dozens of different ATS vendors.
DOL doesn't tag employers as cap-exempt. That classification is downstream of who they are. I needed a hybrid rule set.
Approach Two-pass filter:
6113 (colleges/universities), 6221 (general medical/surgical hospitals), 8139 (nonprofit business associations).Why hybrid: NAICS alone misses employers that are misclassified or use parent-company codes. Name matching alone catches generic "Institute" entities that aren't actually cap-exempt. Running both and unioning gives broad coverage with manageable false positives, which a human can review in the CSV.
Once I have an employer list, I need their actual careers URL. Scraping Google directly gets blocked quickly and is fragile.
Approach Use the Serper.dev API, which exposes Google search results through a clean REST interface. I filter results aggressively to drop aggregator sites (LinkedIn, Indeed, Glassdoor, ZipRecruiter, Wikipedia) because those don't give me the company's own ATS URL.
Why this matters: Aggregator URLs lead to listings that may be stale, missing, or behind login walls. The company's own careers page is where the source of truth lives.
There's no single way to scrape job listings. Workday is a SPA that hides content behind XHR. Greenhouse uses iframes. Some universities use 20-year-old static HTML. A single generic scraper covers maybe 40% of cases.
Approach Use Playwright for headless browser automation (it handles dynamic JS-rendered content), and write ATS-specific extractors for the common vendors: Greenhouse, Lever, Workday, SmartRecruiters, iCIMS, Jobvite, Ashby, Taleo, SuccessFactors. When the URL doesn't match a known ATS pattern, fall back to a generic page scanner that looks for common job-listing markup.
Why this combination: The ATS-specific path is reliable and fast for the ~70% of employers using a known vendor. The generic fallback gets me partial coverage on the long tail without writing a custom scraper for every weird site.
Scraping a few hundred careers pages with a real browser takes serious time. A crash three hours in shouldn't mean starting over.
Approach Save progress to h1b_cap_exempt_checkpoint.json after each employer. On restart, skip employers already processed. Add polite delays between requests to avoid rate limits. A separate cron_jobs.py checks whether the sponsor CSV exists and whether enough careers pages are already populated before rerunning the search step, so it's idempotent on a schedule.
h1b_cap_exempt_sponsors.csv. Employer name, careers page URL, state, and H-1B worker counts for 2024 / 2025 / 2026.it_jobs.csv. Currently-open IT roles at those employers: job title, URL, location, posting date, and scrape timestamp.It turns raw federal disclosure data (technically public, but practically unusable) into something actionable. If I'm a candidate trying to bypass the H-1B lottery, I now have a ranked list of who actively sponsors and what they're hiring right now. If I'm doing immigration research or recruiting, I have a fresh employer panel I can rebuild any time the DOL publishes new disclosures.
The deeper point: a lot of "impossible to find" information isn't actually hidden. It's published in a format nobody wants to deal with. Sometimes the highest-leverage thing you can build is the cleanup layer.
All projects live on GitHub. Issues and PRs welcome.