A clean, fast Streamlit dashboard for exploring global seismic activity, backed by a pandas pipeline that turns messy USGS GeoJSON into a reusable analytical dataset.
The USGS publishes a comprehensive real-time earthquake feed. The data is excellent: magnitude, depth, location, tsunami flags, felt reports, alert levels, all of it. The problem is that consuming it directly is painful. The GeoJSON structure is nested, fields are inconsistent, location strings come in formats like "23km SSW of Volcano, Hawaii", and there's no built-in way to filter or chart it.
I wanted a tool that would let me actually look at the data. Filter by country, see where events cluster, check how depth correlates with magnitude, look at monthly trends. Less of an "earthquake monitoring" tool, more of a "give me a fast lens on whatever USGS just published" tool.
The deliberate split: heavy cleaning runs in a separate pipeline that writes a clean CSV. The dashboard just reads that CSV. That means the Streamlit app starts in under a second and never has to parse raw GeoJSON at request time.
I needed an interactive dashboard with maps and charts, with the lowest possible setup cost so I'd actually finish.
Why Streamlit gives me Python-native widgets, built-in caching, and a one-line deployment story. The trade-off is less UI polish, but for a personal analytics tool, that's the right call. I'd reach for React only if I needed custom interactions Streamlit can't express.
The place field looks like "23km SSW of Volcano, Hawaii" or "central Mid-Atlantic Ridge". There's no formal schema. Sometimes it ends with a country, sometimes a US state, sometimes a geographic region with no political boundary.
Resolution A multi-step parser in pipeline.py:
Why this conservative approach: Wrong country labels would silently break the filter. Better to mark unknowns honestly and surface them as their own bucket than to over-confidently mis-classify.
The first version of the app called the USGS API and ran the full cleaning pipeline on every page render. Load time was 8+ seconds and the dashboard felt unusable.
Resolution Pre-compute. pipeline.py runs separately (or on a schedule), writes earthquake_clean.csv to disk. app.py only reads that file. Startup drops to under a second.
Why this matters: It's a small change in architecture but a big change in UX. It also separates concerns properly. The pipeline can fail or get updated without breaking the dashboard.
Derived metrics like is_felt, time_gap_secs, magnitude buckets, and "higher-than-normal magnitude" flags could live in the dashboard layer.
Approach Compute them once in the pipeline and store them as columns in the clean CSV. The dashboard reads them directly.
Why: It keeps the dashboard code purely about display. Every derived field is documented in one place. If I ever want to swap Streamlit for something else, the cleaned dataset is portable.
The app has a handful of dependencies (Streamlit, pandas, plotly, requests). Running it on a fresh machine is annoying.
Approach A small Dockerfile that pins the Python version, installs requirements, and exposes the Streamlit port. docker run and the dashboard is live.
The shipped dashboard gives me:
The pandas pipeline is the part I'm most proud of. The clean CSV is the reusable artifact. Anyone can pick it up and build their own analysis on top (Tableau, a notebook, a different framework). The dashboard is one consumer. It's not the product.
Engineering lesson that generalizes: separate the data-prep layer from the consumption layer, even on small projects. It costs almost nothing to do early and saves real effort later.
All projects live on GitHub. Issues and PRs welcome.