PyPI - opendata-fetch - Versions diffs - 0.1.0__tar.gz - Mend

opendata-fetch 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

opendata_fetch-0.1.0/.gitignore +29 -0
opendata_fetch-0.1.0/CONTRIBUTING.md +140 -0
opendata_fetch-0.1.0/LICENSE +21 -0
opendata_fetch-0.1.0/PKG-INFO +157 -0
opendata_fetch-0.1.0/README.md +110 -0
opendata_fetch-0.1.0/docs/sources/01-nces-schools.md +96 -0
opendata_fetch-0.1.0/docs/sources/02-epa-npl-superfund.md +106 -0
opendata_fetch-0.1.0/docs/sources/03-epa-sdwa-water.md +95 -0
opendata_fetch-0.1.0/docs/sources/04-epa-tri.md +75 -0
opendata_fetch-0.1.0/docs/sources/05-cdc-svi.md +76 -0
opendata_fetch-0.1.0/docs/sources/06-cdc-places.md +86 -0
opendata_fetch-0.1.0/docs/sources/07-hrsa-hpsa.md +79 -0
opendata_fetch-0.1.0/docs/sources/08-cdc-nors.md +71 -0
opendata_fetch-0.1.0/docs/sources/09-census-tracts.md +71 -0
opendata_fetch-0.1.0/docs/sources/10-nces-edge-districts.md +84 -0
opendata_fetch-0.1.0/docs/sources/future-fema-nfhl.md +77 -0
opendata_fetch-0.1.0/opendata_fetch/__init__.py +64 -0
opendata_fetch-0.1.0/opendata_fetch/cli.py +123 -0
opendata_fetch-0.1.0/opendata_fetch/engine.py +217 -0
opendata_fetch-0.1.0/opendata_fetch/fetchers/__init__.py +26 -0
opendata_fetch-0.1.0/opendata_fetch/registry.py +153 -0
opendata_fetch-0.1.0/opendata_fetch/runner.py +75 -0
opendata_fetch-0.1.0/opendata_fetch/sources.toml +195 -0
opendata_fetch-0.1.0/opendata_fetch/verify.py +124 -0
opendata_fetch-0.1.0/pyproject.toml +51 -0
opendata_fetch-0.1.0/tests/test_engine.py +174 -0
opendata_fetch-0.1.0/tests/test_registry.py +173 -0
opendata_fetch-0.1.0/tests/test_runner.py +139 -0

opendata_fetch-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,29 @@
+# Downloaded data is never committed. opendata-fetch fetches it on demand.
+data/
+*.zip
+*.csv
+*.json
+*.geojson
+*.gpkg
+*.shp
+*.dbf
+*.shx
+*.prj
+*.parquet
+*.part
+# Python
+__pycache__/
+*.pyc
+.venv/
+venv/
+*.egg-info/
+dist/
+build/
+.pytest_cache/
+.ruff_cache/
+# OS / editor
+.DS_Store
+.idea/
+.vscode/

opendata_fetch-0.1.0/CONTRIBUTING.md ADDED Viewed

@@ -0,0 +1,140 @@
+# Contributing to opendata-fetch
+Thanks for helping. The two most valuable contributions are **adding a new
+source** and **keeping existing sources current** when an agency publishes a
+new vintage. Both are usually one-file edits.
+opendata-fetch has a hard scope rule that keeps it small and trustworthy:
+> **opendata-fetch downloads files. It does not transform them.**
+No column subsetting, no reshaping, no joins, no derived fields, no format
+conversion. Users get the file exactly as the agency ships it. PRs that add
+transform logic will be declined, not because they're bad, but because they
+belong downstream of opendata-fetch, not inside it.
+## Add a source (the common case)
+Most datasets are a fixed set of URLs. Add a `[[source]]` block to
+[`opendata_fetch/sources.toml`](opendata_fetch/sources.toml):
+```toml
+[[source]]
+slug = "12-my-dataset"                 # NN-kebab-case, unique
+agency = "Some Agency"
+dataset = "One line describing what this dataset is"
+doc = "docs/sources/12-my-dataset.md"  # optional but encouraged
+homepage = "https://agency.gov/dataset"
+update_cadence = "annual; bump `vintage`"
+vars = { vintage = "2025" }            # optional; see below
+  [[source.files]]
+  url = "https://agency.gov/data/{vintage}/file.csv"
+  dest = "file_{vintage}.csv"          # filename saved on disk
+  min_bytes = 1_000_000                # integrity floor (see below)
+```
+Then:
+```bash
+opendata-fetch fetch 12-my-dataset           # confirm it downloads
+opendata-fetch verify 12-my-dataset          # confirm the live URL is byte-stable
+```
+Add a matching doc under `docs/sources/` (copy an existing one for the shape).
+That's it. No Python.
+### Fields
+| Field | Required | Notes |
+|-------|----------|-------|
+| `slug` | yes | `NN-kebab-case`, unique across the registry |
+| `agency` | yes | publishing agency |
+| `dataset` | yes | one-line description |
+| `doc` | no | path to the source's doc |
+| `homepage` | no | agency landing page |
+| `update_cadence` | no | how often it changes / how to bump it |
+| `vars` | no | substitution variables (see below) |
+| `files` | one of | list of `{url, dest, min_bytes}` |
+| `handler` | one of | custom fetcher, for irregular sources (see below) |
+A source must have **either** `files` **or** `handler`, never both.
+### `vars` and the `{placeholder}` convention
+Many dataset URLs embed a year or version. Put that value in `vars` and
+reference it as `{name}` in `url` and `dest`. Bumping a dataset to a new
+release then means changing **one line**:
+```toml
+vars = { vintage = "2024" }    # change to "2026" when the new file lands
+```
+Common variables already in use: `vintage` (year/school-year) and `dataset_id`
+(Socrata four-by-four IDs that rotate each release). Use whatever name reads
+clearly; it just has to match the `{...}` in the URLs.
+### `min_bytes`
+Set it comfortably **below** the real file size (roughly half is a good rule).
+It is an integrity floor: a download smaller than this is rejected as a
+truncated file or an HTML error page. Too high and normal files fail; too low
+and a truncated download slips through. Note the observed size in the source's
+doc.
+## Bump a vintage (keeping a source current)
+When an agency publishes a new year/version:
+1. Find the new value (year, or the new Socrata `dataset_id`, etc.).
+2. Update the source's `vars` in `sources.toml`. If the filename pattern also
+   changed (e.g. NCES embeds a release-date stamp that isn't the vintage),
+   update the `url`/`dest` templates too.
+3. Update `min_bytes` if the size shifted materially.
+4. `opendata-fetch fetch <slug> --force` to confirm.
+5. Note the change in the source's doc.
+## Add an irregular source (custom fetcher)
+A few sources can't be expressed as a fixed URL list, e.g. they require a
+discovery call or assemble many files into one dataset (FEMA's National Flood
+Hazard Layer is ~2,600 county zips found via a search endpoint). For those,
+point the source at a `handler` instead of `files`:
+```toml
+[[source]]
+slug = "11-fema-nfhl"
+agency = "FEMA"
+dataset = "National Flood Hazard Layer (per-county)"
+handler = "opendata_fetch.fetchers.fema_nfhl:fetch"
+```
+Implement the handler in `opendata_fetch/fetchers/` with this signature:
+```python
+def fetch(source, dest_dir, *, force=False, extract=False) -> list[Path]:
+    """Download source.files-equivalent into dest_dir, return written paths."""
+```
+Keep custom fetchers stdlib-only where possible. If a fetcher genuinely needs
+a third-party library, add it as an **optional extra** in `pyproject.toml`, not
+to the core dependencies; the base install stays dependency-free.
+## Development
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -e ".[dev]"
+pytest          # run the test suite (mocked HTTP, no network)
+ruff check .    # lint
+```
+Tests must not hit the network. Mock at the `urlopen` boundary (see
+`tests/test_engine.py`) or patch `download_file` (see `tests/test_runner.py`).
+## What to expect from review
+- New sources: welcome. Include a doc and confirm `fetch` + `verify` work.
+- Vintage bumps: welcome, the most useful routine contribution.
+- Transform/analysis features: out of scope (see the scope rule above).
+- Bug fixes to the engine/registry: welcome, with a test.

opendata_fetch-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Bhanu Bade
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

opendata_fetch-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,157 @@
+Metadata-Version: 2.4
+Name: opendata-fetch
+Version: 0.1.0
+Summary: Reliably download U.S. government open datasets as the agencies ship them.
+Project-URL: Homepage, https://github.com/manojbade/opendata-fetch
+Project-URL: Source, https://github.com/manojbade/opendata-fetch
+Project-URL: Issues, https://github.com/manojbade/opendata-fetch/issues
+Author: Bhanu Bade
+License: MIT License
+        Copyright (c) 2026 Bhanu Bade
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+License-File: LICENSE
+Keywords: cdc,census,download,epa,federal,government,hrsa,nces,open-data
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Utilities
+Requires-Python: >=3.11
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == 'dev'
+Requires-Dist: ruff>=0.6; extra == 'dev'
+Description-Content-Type: text/markdown
+# opendata-fetch
+**Reliably download U.S. government open datasets as the agencies ship them.**
+opendata-fetch is a small, dependency-free tool for fetching public government open data
+files (EPA, Census, CDC, NCES, HRSA, ...) into your own pipeline. It handles
+the annoying parts of pulling data from government endpoints:
+- **rotating / vintage URLs** (NCES school-year stamps, annual Census vintages,
+  Socrata dataset IDs) expressed as one-line config knobs
+- **integrity checks** that reject truncated downloads and HTML error pages
+  served with a `200`
+- **retries with backoff**, atomic writes (no half-written files), and
+  idempotent skips
+- **optional whole-archive extraction** for zipped shapefiles and bundles
+What opendata-fetch **does not** do: transform, reshape, filter, or subset the data.
+You get the file exactly as the agency publishes it. Slicing columns, joining,
+and analysis are your job, with whatever tools you prefer.
+## Why
+Every team that builds on government open data re-solves the same fetch problems:
+the URL changed for the new vintage, the download silently returned an error
+page, the 3.8 GB file got truncated, the shapefile zip needs all four sibling
+files. opendata-fetch encodes those fixes once, as data, so adding or maintaining a
+source is a config edit, not a coding task.
+## Install
+```bash
+pip install opendata-fetch
+```
+Core has zero runtime dependencies (stdlib only). Requires Python 3.11+ (uses the stdlib `tomllib`).
+For development (editable install with tests):
+```bash
+git clone https://github.com/manojbade/opendata-fetch
+cd opendata-fetch
+pip install -e ".[dev]"
+```
+## Usage
+```bash
+opendata-fetch list                      # show every available source
+opendata-fetch fetch 05-cdc-svi          # download one source into ./data/05-cdc-svi/
+opendata-fetch fetch 05-cdc-svi 09-census-tracts   # several at once
+opendata-fetch fetch --all               # everything (large: hundreds of GB total)
+opendata-fetch fetch 09-census-tracts --extract    # unzip archives after download
+opendata-fetch fetch 05-cdc-svi --dest /data/raw   # choose the download root
+opendata-fetch fetch 04-epa-tri --force  # re-download even if it already exists
+opendata-fetch verify 05-cdc-svi         # re-download and byte-compare vs local copy
+```
+As a library:
+```python
+import opendata_fetch
+paths = opendata_fetch.fetch("05-cdc-svi", dest="data", extract=False)
+for src in opendata_fetch.list_sources():
+    print(src.slug, "-", src.dataset)
+```
+## Available sources
+Run `opendata-fetch list`. v1 ships 10 sources spanning EPA, Census, CDC, NCES, and
+HRSA. Per-source provenance and field notes live in [`docs/sources/`](docs/sources/).
+> **Note on size.** Several sources are large (CDC PLACES ~700 MB, EPA SDWA
+> ~1 GB across two zips). `fetch --all` downloads everything; fetch individual
+> slugs unless you really want the full set.
+## Adding your own source
+Most sources are pure config. Add a `[[source]]` block to
+[`opendata_fetch/sources.toml`](opendata_fetch/sources.toml):
+```toml
+[[source]]
+slug = "12-my-dataset"
+agency = "Some Agency"
+dataset = "What this dataset is"
+update_cadence = "annual; bump `vintage`"
+vars = { vintage = "2025" }
+  [[source.files]]
+  url = "https://example.gov/data/{vintage}/file.csv"
+  dest = "file_{vintage}.csv"
+  min_bytes = 1_000_000
+```
+Then `opendata-fetch fetch 12-my-dataset`. No Python required. For irregular sources
+that need discovery calls or assembling many files, point the source at a
+custom `handler` instead, see [CONTRIBUTING.md](CONTRIBUTING.md).
+## Data licensing
+opendata-fetch is MIT-licensed. It downloads, but does not redistribute, government
+data. The datasets themselves are published by U.S. government agencies and are
+generally in the public domain; confirm the terms for any specific dataset on
+its agency page (linked in each source's doc) before redistributing.
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md). The most useful contributions are new
+sources and keeping existing vintage knobs current.

opendata_fetch-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,110 @@
+# opendata-fetch
+**Reliably download U.S. government open datasets as the agencies ship them.**
+opendata-fetch is a small, dependency-free tool for fetching public government open data
+files (EPA, Census, CDC, NCES, HRSA, ...) into your own pipeline. It handles
+the annoying parts of pulling data from government endpoints:
+- **rotating / vintage URLs** (NCES school-year stamps, annual Census vintages,
+  Socrata dataset IDs) expressed as one-line config knobs
+- **integrity checks** that reject truncated downloads and HTML error pages
+  served with a `200`
+- **retries with backoff**, atomic writes (no half-written files), and
+  idempotent skips
+- **optional whole-archive extraction** for zipped shapefiles and bundles
+What opendata-fetch **does not** do: transform, reshape, filter, or subset the data.
+You get the file exactly as the agency publishes it. Slicing columns, joining,
+and analysis are your job, with whatever tools you prefer.
+## Why
+Every team that builds on government open data re-solves the same fetch problems:
+the URL changed for the new vintage, the download silently returned an error
+page, the 3.8 GB file got truncated, the shapefile zip needs all four sibling
+files. opendata-fetch encodes those fixes once, as data, so adding or maintaining a
+source is a config edit, not a coding task.
+## Install
+```bash
+pip install opendata-fetch
+```
+Core has zero runtime dependencies (stdlib only). Requires Python 3.11+ (uses the stdlib `tomllib`).
+For development (editable install with tests):
+```bash
+git clone https://github.com/manojbade/opendata-fetch
+cd opendata-fetch
+pip install -e ".[dev]"
+```
+## Usage
+```bash
+opendata-fetch list                      # show every available source
+opendata-fetch fetch 05-cdc-svi          # download one source into ./data/05-cdc-svi/
+opendata-fetch fetch 05-cdc-svi 09-census-tracts   # several at once
+opendata-fetch fetch --all               # everything (large: hundreds of GB total)
+opendata-fetch fetch 09-census-tracts --extract    # unzip archives after download
+opendata-fetch fetch 05-cdc-svi --dest /data/raw   # choose the download root
+opendata-fetch fetch 04-epa-tri --force  # re-download even if it already exists
+opendata-fetch verify 05-cdc-svi         # re-download and byte-compare vs local copy
+```
+As a library:
+```python
+import opendata_fetch
+paths = opendata_fetch.fetch("05-cdc-svi", dest="data", extract=False)
+for src in opendata_fetch.list_sources():
+    print(src.slug, "-", src.dataset)
+```
+## Available sources
+Run `opendata-fetch list`. v1 ships 10 sources spanning EPA, Census, CDC, NCES, and
+HRSA. Per-source provenance and field notes live in [`docs/sources/`](docs/sources/).
+> **Note on size.** Several sources are large (CDC PLACES ~700 MB, EPA SDWA
+> ~1 GB across two zips). `fetch --all` downloads everything; fetch individual
+> slugs unless you really want the full set.
+## Adding your own source
+Most sources are pure config. Add a `[[source]]` block to
+[`opendata_fetch/sources.toml`](opendata_fetch/sources.toml):
+```toml
+[[source]]
+slug = "12-my-dataset"
+agency = "Some Agency"
+dataset = "What this dataset is"
+update_cadence = "annual; bump `vintage`"
+vars = { vintage = "2025" }
+  [[source.files]]
+  url = "https://example.gov/data/{vintage}/file.csv"
+  dest = "file_{vintage}.csv"
+  min_bytes = 1_000_000
+```
+Then `opendata-fetch fetch 12-my-dataset`. No Python required. For irregular sources
+that need discovery calls or assembling many files, point the source at a
+custom `handler` instead, see [CONTRIBUTING.md](CONTRIBUTING.md).
+## Data licensing
+opendata-fetch is MIT-licensed. It downloads, but does not redistribute, government
+data. The datasets themselves are published by U.S. government agencies and are
+generally in the public domain; confirm the terms for any specific dataset on
+its agency page (linked in each source's doc) before redistributing.
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md). The most useful contributions are new
+sources and keeping existing vintage knobs current.

opendata_fetch-0.1.0/docs/sources/01-nces-schools.md ADDED Viewed

@@ -0,0 +1,96 @@
+# NCES Schools: CCD + EDGE Geocode (NCES, U.S. Dept. of Education)
+The National Center for Education Statistics (NCES) Common Core of Data (CCD) is the federal census of US public schools and school districts. Four files together define the universe of public schools for a given school year: the Directory (identity, address, district, charter, grade range, operational status), Characteristics (NSLP designation, virtual flag, shared time), Membership (enrollment counts by grade/race/sex), and the EDGE Geocode file (latitude/longitude, locale code, county FIPS, CBSA, congressional district).
+Schools are keyed by the 12-character NCESSCH identifier; districts by the 7-character LEAID. The EDGE Geocode file comes from the companion Education Demographic and Geographic Estimates program and supplies the geographic attributes the CCD tables lack.
+**opendata-fetch slug:** `01-nces-schools`
+**Agency:** NCES (U.S. Dept. of Education)
+**File type:** zip archives (each contains CSV / SAS7BDAT; EDGE zip also contains pipe-delimited TXT, xlsx, and a nested shapefile zip)
+**Approx. size:** Directory ~13 MB, Characteristics ~5.4 MB, Membership ~203 MB (2.3 GB uncompressed), EDGE Geocode ~29 MB
+**Update cadence:** annual (school year). Bump `vintage` and the 6-digit release stamp in the filenames.
+## Download
+opendata-fetch pulls four files:
+```
+https://nces.ed.gov/ccd/Data/zip/ccd_sch_029_{vintage}_w_1a_073025.zip   (Directory)
+https://nces.ed.gov/ccd/Data/zip/ccd_sch_129_{vintage}_w_1a_073025.zip   (Characteristics)
+https://nces.ed.gov/ccd/Data/zip/ccd_sch_052_{vintage}_l_1a_073025.zip   (Membership)
+https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_PUBLICSCH_{vintage}.zip (EDGE Geocode)
+```
+The general patterns are:
+- CCD school-level files: `https://nces.ed.gov/ccd/Data/zip/ccd_sch_{filetype}_{SY}_{stage}_1a_{releaseDate}.zip`
+- EDGE Geocode: `https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_PUBLICSCH_{SY}.zip`
+`{vintage}` is the school-year code (`2425` = 2024-25, `2526` = 2025-26). It is filled from the source's `vars` table in `opendata_fetch/sources.toml`, so moving to a new year is a one-line edit. Note two things change per release: `{vintage}` and the embedded 6-digit release-date stamp (`073025`), so the filename literals in the registry must be updated alongside the variable.
+**Stage suffix is inconsistent across files for the same year.** As of 2024-25: Directory (`029`) and Characteristics (`129`) use `w` (working/preliminary), Membership (`052`) uses `l` (locked/final). When probing for a new release, try both `w` and `l` since the stage depends on file type and timing.
+The same school-level page also lists Staff (`ccd_sch_033`) and Lunch Program Eligibility (`ccd_sch_059`) files that opendata-fetch does not ship; add them as extra `[[source.files]]` entries if you need them.
+## Fields and files in this download
+Captured from the `2425` (2024-25) release. Column sets are stable year to year but can shift; treat this as a snapshot.
+**Directory** (`ccd_sch_029_2425_w_1a_073025.csv`, 65 columns). Annotated by group rather than per-column:
+`SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, LEA_NAME, STATE_AGENCY_NO, UNION, ST_LEAID, LEAID, ST_SCHID, NCESSCH, SCHID, MSTREET1, MSTREET2, MSTREET3, MCITY, MSTATE, MZIP, MZIP4, LSTREET1, LSTREET2, LSTREET3, LCITY, LSTATE, LZIP, LZIP4, PHONE, WEBSITE, SY_STATUS, SY_STATUS_TEXT, UPDATED_STATUS, UPDATED_STATUS_TEXT, EFFECTIVE_DATE, SCH_TYPE_TEXT, SCH_TYPE, RECON_STATUS, OUT_OF_STATE_FLAG, CHARTER_TEXT, CHARTAUTH1, CHARTAUTHN1, CHARTAUTH2, CHARTAUTHN2, NOGRADES, G_PK_OFFERED, G_KG_OFFERED, G_1_OFFERED, G_2_OFFERED, G_3_OFFERED, G_4_OFFERED, G_5_OFFERED, G_6_OFFERED, G_7_OFFERED, G_8_OFFERED, G_9_OFFERED, G_10_OFFERED, G_11_OFFERED, G_12_OFFERED, G_13_OFFERED, G_UG_OFFERED, G_AE_OFFERED, GSLO, GSHI, LEVEL, IGOFFERED`
+- **Identifiers:** `SCHOOL_YEAR`, `FIPST` (state FIPS), `STATE_AGENCY_NO`, `UNION`, `ST_LEAID`/`LEAID` (state and NCES district IDs), `ST_SCHID`/`NCESSCH`/`SCHID` (state and NCES school IDs).
+- **Names:** `SCH_NAME` (school), `LEA_NAME` (district), `STATENAME`, `ST`.
+- **Mailing address (`M*`):** `MSTREET1-3`, `MCITY`, `MSTATE`, `MZIP`, `MZIP4`.
+- **Location/physical address (`L*`):** `LSTREET1-3`, `LCITY`, `LSTATE`, `LZIP`, `LZIP4`.
+- **Contact:** `PHONE`, `WEBSITE`.
+- **Operational status:** `SY_STATUS`/`SY_STATUS_TEXT` (school-year status, see Quirks for codes), `UPDATED_STATUS`/`UPDATED_STATUS_TEXT`, `EFFECTIVE_DATE`, `RECON_STATUS`, `OUT_OF_STATE_FLAG`.
+- **Type & charter:** `SCH_TYPE`/`SCH_TYPE_TEXT` (school type), `CHARTER_TEXT` (charter status), `CHARTAUTH1/CHARTAUTHN1/CHARTAUTH2/CHARTAUTHN2` (charter authorizer IDs/names).
+- **Grades offered:** `NOGRADES` flag, the `G_*_OFFERED` per-grade flags (`PK`, `KG`, `1`-`13`, `UG`, `AE`), and the summaries `GSLO`/`GSHI` (lowest/highest grade), `LEVEL` (school level), `IGOFFERED`.
+**Characteristics** (`ccd_sch_129_2425_w_1a_073025.csv`, 17 columns):
+`SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, STATE_AGENCY_NO, UNION, ST_LEAID, LEAID, ST_SCHID, NCESSCH, SCHID, SHARED_TIME, NSLP_STATUS, NSLP_STATUS_TEXT, VIRTUAL, VIRTUAL_TEXT`
+- **Identity (12):** same identifier/name/state columns as the Directory (`SCHOOL_YEAR` … `SCHID`).
+- `SHARED_TIME` — whether the school is a shared-time facility (students attend part-time from other schools).
+- `NSLP_STATUS`/`NSLP_STATUS_TEXT` — National School Lunch Program participation designation (free/reduced-price lunch program status).
+- `VIRTUAL`/`VIRTUAL_TEXT` — virtual-school status code and its text label.
+**Membership** (`ccd_sch_052_2425_l_1a_073025.csv`, 18 columns; one row per school x grade x race/ethnicity x sex):
+`SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, STATE_AGENCY_NO, UNION, ST_LEAID, LEAID, ST_SCHID, NCESSCH, SCHID, GRADE, RACE_ETHNICITY, SEX, STUDENT_COUNT, TOTAL_INDICATOR, DMS_FLAG`
+- **Identity (12):** same identifier/name/state columns as the Directory.
+- `GRADE` — grade level for the row.
+- `RACE_ETHNICITY` — race/ethnicity category for the row.
+- `SEX` — sex category for the row.
+- `STUDENT_COUNT` — enrollment count for that grade x race x sex cell.
+- `TOTAL_INDICATOR` — flags whether the row is a subtotal/total (vs a single grade x race x sex cell).
+- `DMS_FLAG` — data-management-system reporting/quality flag.
+**EDGE Geocode** (`EDGE_GEOCODE_PUBLICSCH_2425.TXT`, pipe-delimited, **no header row**, 23 positional fields). In file order: NCES school ID, LEAID, school name, operating-state FIPS, street, city, state, ZIP, state FIPS, county FIPS, county name, locale code, latitude, longitude, CBSA code, CBSA name, CBSA type, CSA code, CSA name, congressional district code, state-legislature lower district, state-legislature upper district, school year. The key columns to note: `NCESSCH` and `LEAID` (join keys back to the CCD tables), latitude/longitude (address-geocoded point), the NCES locale code (urban/rural classification), county FIPS, the CBSA code (metro/micro area), and the congressional district code. The same EDGE zip also contains the identical data as an `.xlsx` (with labeled headers), a `.sas7bdat`, and a nested `Shapefile_SCH.zip`; field meanings are documented in `EDGE_GEOCODE_PUBLIC_FILEDOC.pdf` inside the zip.
+## Provenance & landing pages
+- CCD data index: https://nces.ed.gov/ccd/files.asp
+- EDGE geocode: https://nces.ed.gov/programs/edge/geographic/schoollocations
+- EDGE locale framework: https://nces.ed.gov/programs/edge/Geographic/LocaleBoundaries
+- Release calendar: https://ies.ed.gov/nces-statistical-products-release-calendar
+- EDGE geocode field documentation: `EDGE_GEOCODE_PUBLIC_FILEDOC.pdf` (inside the EDGE zip)
+## Quirks & notes
+- **Membership zip uses DEFLATE64 compression** (zip compression method 9). Python's stdlib `zipfile` cannot decode it, and `zipfile-deflate64` does not currently build on Python 3.13. Use the `unzip` binary (`unzip -p`) to extract it; `unzip` handles DEFLATE64 and is available on macOS and standard Linux runners.
+- **EDGE Geocode zip contains multiple formats:** pipe-delimited TXT (no header), xlsx (with headers), SAS7BDAT, and a nested `Shapefile_SCH.zip`. The TXT column meanings are documented in `EDGE_GEOCODE_PUBLIC_FILEDOC.pdf`.
+- The Directory and Characteristics zips contain both CSV and SAS7BDAT.
+- NCESSCH (12-char) and LEAID (7-char) are string IDs with leading zeros for some states. Read them as strings; integer parsing drops leading zeros.
+- Membership is large (203 MB compressed, ~2.3 GB uncompressed) because rows are at the school x grade x race x sex level. A downstream reader should stream/chunk it rather than load it whole.
+- Latitude/longitude in EDGE Geocode are address-geocoding estimates, not GPS readings (documented in the EDGE file doc).
+- FYI for downstream users: `SY_STATUS` codes are 1=Open, 2=Closed, 3=New, 4=Added, 5=Changed Boundary/Agency, 6=Inactive, 7=Future, 8=Reopened. Grade codes (`GSLO`/`GSHI`) are 2-char strings including reserved codes `M` (Missing) and `N` (Not applicable), not integers.
+## Manual fallback
+1. Download each file from the corresponding landing page above using a browser (CCD index for the `ccd_sch_*` zips; EDGE geocode page for the geocode zip).
+2. Place the file in the source's download directory with the exact filename opendata-fetch expects.
+3. Re-run `opendata-fetch fetch 01-nces-schools`; it detects the present file and skips the download.

opendata_fetch-0.1.0/docs/sources/02-epa-npl-superfund.md ADDED Viewed

@@ -0,0 +1,106 @@
+# EPA NPL Superfund Site Boundaries + Human Exposure (EPA)
+EPA's National Priorities List (NPL) is the federal roster of the most contaminated hazardous-waste sites in the US (Superfund sites). Two files describe them together: a GeoJSON of site boundary polygons (with identity attributes such as EPA_ID, site name, state, county, status, last change date) and a JSON list of per-site human-exposure status codes. The two join on the site's EPA ID.
+The boundaries file is published on EPA's ArcGIS Hub; the human-exposure file comes from EPA's Superfund Enterprise Management System (SEMS).
+**opendata-fetch slug:** `02-epa-npl-superfund`
+**Agency:** EPA
+**File type:** geojson + json
+**Approx. size:** NPL boundaries GeoJSON ~60 MB, Human Exposure JSON ~1.3 MB
+**Update cadence:** rolling. The boundaries dataset is refreshed by EPA irregularly; the human-exposure JSON is updated roughly daily.
+## Download
+opendata-fetch pulls two files at fixed URLs (no templated variable):
+```
+https://opendata.arcgis.com/api/v3/datasets/d6e1591d9a424f1fa6d95a02095a06d7_0/downloads/data?format=geojson&spatialRefId=4326  -> npl_boundaries.geojson
+https://www3.epa.gov/semsjson/Human_Exposure_Site_List.json                                                                      -> human_exposure.json
+```
+The boundaries URL embeds an ArcGIS dataset id (`d6e1591d9a424f1fa6d95a02095a06d7_0`). If EPA republishes the layer under a new id, edit the URL in `opendata_fetch/sources.toml`. There is no per-year variable for this source.
+## Fields and files in this download
+Captured from the files as served at audit time. Schemas may drift (EPA refreshes both irregularly).
+**`npl_boundaries.geojson`** — MultiPolygon/Polygon features, CRS EPSG:4326, ~2,406 features, 32 attribute fields:
+| column | meaning |
+|--------|---------|
+| `OBJECTID` | Internal feature object ID. |
+| `REGION_CODE` | EPA region (1-10). |
+| `EPA_PROGRAM` | EPA program associated with the site. |
+| `EPA_ID` | Site EPA/SEMS ID (join key; usually 12 characters). |
+| `SITE_NAME` | Superfund site name. |
+| `SITE_FEATURE_CLASS` | Feature classification grouping. |
+| `SITE_FEATURE_TYPE` | Feature type (e.g. `Site Boundary`, `Comprehensive Site Area`); distinguishes multiple polygons per site. |
+| `SITE_FEATURE_NAME` | Name of the individual feature. |
+| `SITE_FEATURE_DESCRIPTION` | Free-text description of the feature. |
+| `NPL_STATUS_CODE` | NPL status (F Final, D Deleted, P Proposed, etc.; see official documentation). |
+| `FEDERAL_FACILITY_DETER_CODE` | Federal-facility determination code. |
+| `LAST_CHANGE_DATE` | Date the feature record was last changed. |
+| `ORIGINAL_CREATION_DATE` | Date the feature record was created. |
+| `SITE_FEATURE_SOURCE` | Source of the feature geometry. |
+| `STREET_ADDR_TXT` | Site street address. |
+| `ADDR_COMMENT` | Address comment/note. |
+| `CITY_NAME` | Site city. |
+| `COUNTY` | Site county. |
+| `STATE_CODE` | 2-letter state code. |
+| `ZIP_CODE` | Site ZIP code. |
+| `SITE_CONTACT_NAME` | Site contact name. |
+| `PRIMARY_TELEPHONE_NUM` | Site contact phone number. |
+| `SITE_CONTACT_EMAIL` | Site contact email. |
+| `URL_ALIAS_TXT` | Site profile URL alias. |
+| `FEATURE_INFO_URL` | URL with more information about the feature. |
+| `FEATURE_INFO_URL_DESC` | Description of the info URL. |
+| `GIS_AREA` | Feature area as computed in GIS. |
+| `GIS_AREA_UNITS` | Units for `GIS_AREA`. |
+| `PROJECTION` | Native projection of the source geometry. |
+| `SF_GEOSPATIAL_DATA_DISCLAIMER` | Superfund geospatial data disclaimer text. |
+| `SHAPE_Length` | Polygon perimeter length (geometry attribute). |
+| `SHAPE_Area` | Polygon area (geometry attribute). |
+**`human_exposure.json`** — a JSON object with top-level keys `success`, `data`, `meta`. `data` is a list (~1,905 records), one per site, each with 16 fields:
+| column | meaning |
+|--------|---------|
+| `humanexposurepathdesc` | Human-exposure pathway status description (plain-text). |
+| `city` | Site city. |
+| `nplstatus` | NPL status of the site. |
+| `saaflag` | SAA flag (SEMS code; see EPA SEMS documentation). |
+| `county` | Site county. |
+| `fedfacilityflag` | Federal-facility flag. |
+| `zipcode` | Site ZIP code. |
+| `epaid` | Site EPA/SEMS ID (lowercase; join key to `EPA_ID`). |
+| `regionid` | EPA region. |
+| `sitename` | Site name. |
+| `eibaselinesiteInd` | Environmental-indicator baseline-site indicator (SEMS code; see EPA SEMS documentation). |
+| `siteid` | SEMS internal site ID. |
+| `humexposurestscode` | Human-exposure status code (e.g. `HENC`, `HEUC`; see EPA SEMS documentation). |
+| `state` | State name. |
+| `state_code` | 2-letter state code. |
+| `friendlyurl` | Public-facing site profile URL. |
+## Provenance & landing pages
+- NPL Boundaries dataset (ArcGIS Hub): https://hub.arcgis.com/maps/EPA::npl-superfund-site-boundaries-epa-public-2022/explore
+- ArcGIS Hub item page: https://www.arcgis.com/home/item.html?id=d6e1591d9a424f1fa6d95a02095a06d7
+- EPA Superfund landing: https://www.epa.gov/superfund/superfund-national-priorities-list-npl
+- EPA SEMS overview: https://www.epa.gov/enviro/sems-overview
+## Quirks & notes
+- **Join keys differ in case:** the boundaries file uses `EPA_ID` (uppercase, with underscore); the human-exposure file uses `epaid` (lowercase). A downstream user joining the two must normalize.
+- **The human-exposure JSON drifts on essentially every download.** EPA updates exposure status codes in place; the byte size can stay constant (observed at exactly 1,346,934 bytes across consecutive days) while the content (SHA256) changes. Treat content changes here as expected, not as corruption.
+- **Multiple polygons per site:** the boundaries dataset has more features than unique sites. A single EPA_ID can have several feature polygons distinguished by `SITE_FEATURE_TYPE` (e.g., `Comprehensive Site Area`, `Site Boundary`, `Current Ground Boundary`). opendata-fetch ships the raw file with all of them; any dedup is a downstream choice.
+- **Dirty `SITE_FEATURE_TYPE` values:** some rows carry a trailing `\r\n`, and some hold the literal string `<Null>` instead of a JSON null. FYI for downstream parsing.
+- **Status codes (FYI):** `NPL_STATUS_CODE` includes F (Final), D (Deleted), N, P (Proposed), R, S, A, and null. Human-exposure codes: HENC (not under control, worst), HEPR, HEUC, HHPA, HEID, HEIC, blank (not assigned).
+- Geometries are a mix of Polygon and MultiPolygon. At least one site (`RID09321243`) has an 11-character EPA_ID rather than the usual 12.
+## Manual fallback
+1. **NPL Boundaries:** open the ArcGIS Hub item page above, click Download -> GeoJSON, save as `npl_boundaries.geojson` into the source's download directory.
+2. **Human Exposure:** open the JSON URL in a browser and save it as `human_exposure.json`.
+3. Re-run `opendata-fetch fetch 02-epa-npl-superfund`; it detects the present files and skips the downloads.