opendata-fetch 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (28) hide show
  1. opendata_fetch-0.1.0/.gitignore +29 -0
  2. opendata_fetch-0.1.0/CONTRIBUTING.md +140 -0
  3. opendata_fetch-0.1.0/LICENSE +21 -0
  4. opendata_fetch-0.1.0/PKG-INFO +157 -0
  5. opendata_fetch-0.1.0/README.md +110 -0
  6. opendata_fetch-0.1.0/docs/sources/01-nces-schools.md +96 -0
  7. opendata_fetch-0.1.0/docs/sources/02-epa-npl-superfund.md +106 -0
  8. opendata_fetch-0.1.0/docs/sources/03-epa-sdwa-water.md +95 -0
  9. opendata_fetch-0.1.0/docs/sources/04-epa-tri.md +75 -0
  10. opendata_fetch-0.1.0/docs/sources/05-cdc-svi.md +76 -0
  11. opendata_fetch-0.1.0/docs/sources/06-cdc-places.md +86 -0
  12. opendata_fetch-0.1.0/docs/sources/07-hrsa-hpsa.md +79 -0
  13. opendata_fetch-0.1.0/docs/sources/08-cdc-nors.md +71 -0
  14. opendata_fetch-0.1.0/docs/sources/09-census-tracts.md +71 -0
  15. opendata_fetch-0.1.0/docs/sources/10-nces-edge-districts.md +84 -0
  16. opendata_fetch-0.1.0/docs/sources/future-fema-nfhl.md +77 -0
  17. opendata_fetch-0.1.0/opendata_fetch/__init__.py +64 -0
  18. opendata_fetch-0.1.0/opendata_fetch/cli.py +123 -0
  19. opendata_fetch-0.1.0/opendata_fetch/engine.py +217 -0
  20. opendata_fetch-0.1.0/opendata_fetch/fetchers/__init__.py +26 -0
  21. opendata_fetch-0.1.0/opendata_fetch/registry.py +153 -0
  22. opendata_fetch-0.1.0/opendata_fetch/runner.py +75 -0
  23. opendata_fetch-0.1.0/opendata_fetch/sources.toml +195 -0
  24. opendata_fetch-0.1.0/opendata_fetch/verify.py +124 -0
  25. opendata_fetch-0.1.0/pyproject.toml +51 -0
  26. opendata_fetch-0.1.0/tests/test_engine.py +174 -0
  27. opendata_fetch-0.1.0/tests/test_registry.py +173 -0
  28. opendata_fetch-0.1.0/tests/test_runner.py +139 -0
@@ -0,0 +1,29 @@
1
+ # Downloaded data is never committed. opendata-fetch fetches it on demand.
2
+ data/
3
+ *.zip
4
+ *.csv
5
+ *.json
6
+ *.geojson
7
+ *.gpkg
8
+ *.shp
9
+ *.dbf
10
+ *.shx
11
+ *.prj
12
+ *.parquet
13
+ *.part
14
+
15
+ # Python
16
+ __pycache__/
17
+ *.pyc
18
+ .venv/
19
+ venv/
20
+ *.egg-info/
21
+ dist/
22
+ build/
23
+ .pytest_cache/
24
+ .ruff_cache/
25
+
26
+ # OS / editor
27
+ .DS_Store
28
+ .idea/
29
+ .vscode/
@@ -0,0 +1,140 @@
1
+ # Contributing to opendata-fetch
2
+
3
+ Thanks for helping. The two most valuable contributions are **adding a new
4
+ source** and **keeping existing sources current** when an agency publishes a
5
+ new vintage. Both are usually one-file edits.
6
+
7
+ opendata-fetch has a hard scope rule that keeps it small and trustworthy:
8
+
9
+ > **opendata-fetch downloads files. It does not transform them.**
10
+
11
+ No column subsetting, no reshaping, no joins, no derived fields, no format
12
+ conversion. Users get the file exactly as the agency ships it. PRs that add
13
+ transform logic will be declined, not because they're bad, but because they
14
+ belong downstream of opendata-fetch, not inside it.
15
+
16
+ ## Add a source (the common case)
17
+
18
+ Most datasets are a fixed set of URLs. Add a `[[source]]` block to
19
+ [`opendata_fetch/sources.toml`](opendata_fetch/sources.toml):
20
+
21
+ ```toml
22
+ [[source]]
23
+ slug = "12-my-dataset" # NN-kebab-case, unique
24
+ agency = "Some Agency"
25
+ dataset = "One line describing what this dataset is"
26
+ doc = "docs/sources/12-my-dataset.md" # optional but encouraged
27
+ homepage = "https://agency.gov/dataset"
28
+ update_cadence = "annual; bump `vintage`"
29
+ vars = { vintage = "2025" } # optional; see below
30
+
31
+ [[source.files]]
32
+ url = "https://agency.gov/data/{vintage}/file.csv"
33
+ dest = "file_{vintage}.csv" # filename saved on disk
34
+ min_bytes = 1_000_000 # integrity floor (see below)
35
+ ```
36
+
37
+ Then:
38
+
39
+ ```bash
40
+ opendata-fetch fetch 12-my-dataset # confirm it downloads
41
+ opendata-fetch verify 12-my-dataset # confirm the live URL is byte-stable
42
+ ```
43
+
44
+ Add a matching doc under `docs/sources/` (copy an existing one for the shape).
45
+ That's it. No Python.
46
+
47
+ ### Fields
48
+
49
+ | Field | Required | Notes |
50
+ |-------|----------|-------|
51
+ | `slug` | yes | `NN-kebab-case`, unique across the registry |
52
+ | `agency` | yes | publishing agency |
53
+ | `dataset` | yes | one-line description |
54
+ | `doc` | no | path to the source's doc |
55
+ | `homepage` | no | agency landing page |
56
+ | `update_cadence` | no | how often it changes / how to bump it |
57
+ | `vars` | no | substitution variables (see below) |
58
+ | `files` | one of | list of `{url, dest, min_bytes}` |
59
+ | `handler` | one of | custom fetcher, for irregular sources (see below) |
60
+
61
+ A source must have **either** `files` **or** `handler`, never both.
62
+
63
+ ### `vars` and the `{placeholder}` convention
64
+
65
+ Many dataset URLs embed a year or version. Put that value in `vars` and
66
+ reference it as `{name}` in `url` and `dest`. Bumping a dataset to a new
67
+ release then means changing **one line**:
68
+
69
+ ```toml
70
+ vars = { vintage = "2024" } # change to "2026" when the new file lands
71
+ ```
72
+
73
+ Common variables already in use: `vintage` (year/school-year) and `dataset_id`
74
+ (Socrata four-by-four IDs that rotate each release). Use whatever name reads
75
+ clearly; it just has to match the `{...}` in the URLs.
76
+
77
+ ### `min_bytes`
78
+
79
+ Set it comfortably **below** the real file size (roughly half is a good rule).
80
+ It is an integrity floor: a download smaller than this is rejected as a
81
+ truncated file or an HTML error page. Too high and normal files fail; too low
82
+ and a truncated download slips through. Note the observed size in the source's
83
+ doc.
84
+
85
+ ## Bump a vintage (keeping a source current)
86
+
87
+ When an agency publishes a new year/version:
88
+
89
+ 1. Find the new value (year, or the new Socrata `dataset_id`, etc.).
90
+ 2. Update the source's `vars` in `sources.toml`. If the filename pattern also
91
+ changed (e.g. NCES embeds a release-date stamp that isn't the vintage),
92
+ update the `url`/`dest` templates too.
93
+ 3. Update `min_bytes` if the size shifted materially.
94
+ 4. `opendata-fetch fetch <slug> --force` to confirm.
95
+ 5. Note the change in the source's doc.
96
+
97
+ ## Add an irregular source (custom fetcher)
98
+
99
+ A few sources can't be expressed as a fixed URL list, e.g. they require a
100
+ discovery call or assemble many files into one dataset (FEMA's National Flood
101
+ Hazard Layer is ~2,600 county zips found via a search endpoint). For those,
102
+ point the source at a `handler` instead of `files`:
103
+
104
+ ```toml
105
+ [[source]]
106
+ slug = "11-fema-nfhl"
107
+ agency = "FEMA"
108
+ dataset = "National Flood Hazard Layer (per-county)"
109
+ handler = "opendata_fetch.fetchers.fema_nfhl:fetch"
110
+ ```
111
+
112
+ Implement the handler in `opendata_fetch/fetchers/` with this signature:
113
+
114
+ ```python
115
+ def fetch(source, dest_dir, *, force=False, extract=False) -> list[Path]:
116
+ """Download source.files-equivalent into dest_dir, return written paths."""
117
+ ```
118
+
119
+ Keep custom fetchers stdlib-only where possible. If a fetcher genuinely needs
120
+ a third-party library, add it as an **optional extra** in `pyproject.toml`, not
121
+ to the core dependencies; the base install stays dependency-free.
122
+
123
+ ## Development
124
+
125
+ ```bash
126
+ python -m venv .venv && source .venv/bin/activate
127
+ pip install -e ".[dev]"
128
+ pytest # run the test suite (mocked HTTP, no network)
129
+ ruff check . # lint
130
+ ```
131
+
132
+ Tests must not hit the network. Mock at the `urlopen` boundary (see
133
+ `tests/test_engine.py`) or patch `download_file` (see `tests/test_runner.py`).
134
+
135
+ ## What to expect from review
136
+
137
+ - New sources: welcome. Include a doc and confirm `fetch` + `verify` work.
138
+ - Vintage bumps: welcome, the most useful routine contribution.
139
+ - Transform/analysis features: out of scope (see the scope rule above).
140
+ - Bug fixes to the engine/registry: welcome, with a test.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Bhanu Bade
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,157 @@
1
+ Metadata-Version: 2.4
2
+ Name: opendata-fetch
3
+ Version: 0.1.0
4
+ Summary: Reliably download U.S. government open datasets as the agencies ship them.
5
+ Project-URL: Homepage, https://github.com/manojbade/opendata-fetch
6
+ Project-URL: Source, https://github.com/manojbade/opendata-fetch
7
+ Project-URL: Issues, https://github.com/manojbade/opendata-fetch/issues
8
+ Author: Bhanu Bade
9
+ License: MIT License
10
+
11
+ Copyright (c) 2026 Bhanu Bade
12
+
13
+ Permission is hereby granted, free of charge, to any person obtaining a copy
14
+ of this software and associated documentation files (the "Software"), to deal
15
+ in the Software without restriction, including without limitation the rights
16
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
17
+ copies of the Software, and to permit persons to whom the Software is
18
+ furnished to do so, subject to the following conditions:
19
+
20
+ The above copyright notice and this permission notice shall be included in all
21
+ copies or substantial portions of the Software.
22
+
23
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
24
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
25
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
26
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
27
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
28
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
29
+ SOFTWARE.
30
+ License-File: LICENSE
31
+ Keywords: cdc,census,download,epa,federal,government,hrsa,nces,open-data
32
+ Classifier: Development Status :: 4 - Beta
33
+ Classifier: Intended Audience :: Developers
34
+ Classifier: Intended Audience :: Science/Research
35
+ Classifier: License :: OSI Approved :: MIT License
36
+ Classifier: Programming Language :: Python :: 3
37
+ Classifier: Programming Language :: Python :: 3.11
38
+ Classifier: Programming Language :: Python :: 3.12
39
+ Classifier: Programming Language :: Python :: 3.13
40
+ Classifier: Topic :: Scientific/Engineering
41
+ Classifier: Topic :: Utilities
42
+ Requires-Python: >=3.11
43
+ Provides-Extra: dev
44
+ Requires-Dist: pytest>=8.0; extra == 'dev'
45
+ Requires-Dist: ruff>=0.6; extra == 'dev'
46
+ Description-Content-Type: text/markdown
47
+
48
+ # opendata-fetch
49
+
50
+ **Reliably download U.S. government open datasets as the agencies ship them.**
51
+
52
+ opendata-fetch is a small, dependency-free tool for fetching public government open data
53
+ files (EPA, Census, CDC, NCES, HRSA, ...) into your own pipeline. It handles
54
+ the annoying parts of pulling data from government endpoints:
55
+
56
+ - **rotating / vintage URLs** (NCES school-year stamps, annual Census vintages,
57
+ Socrata dataset IDs) expressed as one-line config knobs
58
+ - **integrity checks** that reject truncated downloads and HTML error pages
59
+ served with a `200`
60
+ - **retries with backoff**, atomic writes (no half-written files), and
61
+ idempotent skips
62
+ - **optional whole-archive extraction** for zipped shapefiles and bundles
63
+
64
+ What opendata-fetch **does not** do: transform, reshape, filter, or subset the data.
65
+ You get the file exactly as the agency publishes it. Slicing columns, joining,
66
+ and analysis are your job, with whatever tools you prefer.
67
+
68
+ ## Why
69
+
70
+ Every team that builds on government open data re-solves the same fetch problems:
71
+ the URL changed for the new vintage, the download silently returned an error
72
+ page, the 3.8 GB file got truncated, the shapefile zip needs all four sibling
73
+ files. opendata-fetch encodes those fixes once, as data, so adding or maintaining a
74
+ source is a config edit, not a coding task.
75
+
76
+ ## Install
77
+
78
+ ```bash
79
+ pip install opendata-fetch
80
+ ```
81
+
82
+ Core has zero runtime dependencies (stdlib only). Requires Python 3.11+ (uses the stdlib `tomllib`).
83
+
84
+ For development (editable install with tests):
85
+
86
+ ```bash
87
+ git clone https://github.com/manojbade/opendata-fetch
88
+ cd opendata-fetch
89
+ pip install -e ".[dev]"
90
+ ```
91
+
92
+ ## Usage
93
+
94
+ ```bash
95
+ opendata-fetch list # show every available source
96
+ opendata-fetch fetch 05-cdc-svi # download one source into ./data/05-cdc-svi/
97
+ opendata-fetch fetch 05-cdc-svi 09-census-tracts # several at once
98
+ opendata-fetch fetch --all # everything (large: hundreds of GB total)
99
+ opendata-fetch fetch 09-census-tracts --extract # unzip archives after download
100
+ opendata-fetch fetch 05-cdc-svi --dest /data/raw # choose the download root
101
+ opendata-fetch fetch 04-epa-tri --force # re-download even if it already exists
102
+ opendata-fetch verify 05-cdc-svi # re-download and byte-compare vs local copy
103
+ ```
104
+
105
+ As a library:
106
+
107
+ ```python
108
+ import opendata_fetch
109
+
110
+ paths = opendata_fetch.fetch("05-cdc-svi", dest="data", extract=False)
111
+ for src in opendata_fetch.list_sources():
112
+ print(src.slug, "-", src.dataset)
113
+ ```
114
+
115
+ ## Available sources
116
+
117
+ Run `opendata-fetch list`. v1 ships 10 sources spanning EPA, Census, CDC, NCES, and
118
+ HRSA. Per-source provenance and field notes live in [`docs/sources/`](docs/sources/).
119
+
120
+ > **Note on size.** Several sources are large (CDC PLACES ~700 MB, EPA SDWA
121
+ > ~1 GB across two zips). `fetch --all` downloads everything; fetch individual
122
+ > slugs unless you really want the full set.
123
+
124
+ ## Adding your own source
125
+
126
+ Most sources are pure config. Add a `[[source]]` block to
127
+ [`opendata_fetch/sources.toml`](opendata_fetch/sources.toml):
128
+
129
+ ```toml
130
+ [[source]]
131
+ slug = "12-my-dataset"
132
+ agency = "Some Agency"
133
+ dataset = "What this dataset is"
134
+ update_cadence = "annual; bump `vintage`"
135
+ vars = { vintage = "2025" }
136
+
137
+ [[source.files]]
138
+ url = "https://example.gov/data/{vintage}/file.csv"
139
+ dest = "file_{vintage}.csv"
140
+ min_bytes = 1_000_000
141
+ ```
142
+
143
+ Then `opendata-fetch fetch 12-my-dataset`. No Python required. For irregular sources
144
+ that need discovery calls or assembling many files, point the source at a
145
+ custom `handler` instead, see [CONTRIBUTING.md](CONTRIBUTING.md).
146
+
147
+ ## Data licensing
148
+
149
+ opendata-fetch is MIT-licensed. It downloads, but does not redistribute, government
150
+ data. The datasets themselves are published by U.S. government agencies and are
151
+ generally in the public domain; confirm the terms for any specific dataset on
152
+ its agency page (linked in each source's doc) before redistributing.
153
+
154
+ ## Contributing
155
+
156
+ See [CONTRIBUTING.md](CONTRIBUTING.md). The most useful contributions are new
157
+ sources and keeping existing vintage knobs current.
@@ -0,0 +1,110 @@
1
+ # opendata-fetch
2
+
3
+ **Reliably download U.S. government open datasets as the agencies ship them.**
4
+
5
+ opendata-fetch is a small, dependency-free tool for fetching public government open data
6
+ files (EPA, Census, CDC, NCES, HRSA, ...) into your own pipeline. It handles
7
+ the annoying parts of pulling data from government endpoints:
8
+
9
+ - **rotating / vintage URLs** (NCES school-year stamps, annual Census vintages,
10
+ Socrata dataset IDs) expressed as one-line config knobs
11
+ - **integrity checks** that reject truncated downloads and HTML error pages
12
+ served with a `200`
13
+ - **retries with backoff**, atomic writes (no half-written files), and
14
+ idempotent skips
15
+ - **optional whole-archive extraction** for zipped shapefiles and bundles
16
+
17
+ What opendata-fetch **does not** do: transform, reshape, filter, or subset the data.
18
+ You get the file exactly as the agency publishes it. Slicing columns, joining,
19
+ and analysis are your job, with whatever tools you prefer.
20
+
21
+ ## Why
22
+
23
+ Every team that builds on government open data re-solves the same fetch problems:
24
+ the URL changed for the new vintage, the download silently returned an error
25
+ page, the 3.8 GB file got truncated, the shapefile zip needs all four sibling
26
+ files. opendata-fetch encodes those fixes once, as data, so adding or maintaining a
27
+ source is a config edit, not a coding task.
28
+
29
+ ## Install
30
+
31
+ ```bash
32
+ pip install opendata-fetch
33
+ ```
34
+
35
+ Core has zero runtime dependencies (stdlib only). Requires Python 3.11+ (uses the stdlib `tomllib`).
36
+
37
+ For development (editable install with tests):
38
+
39
+ ```bash
40
+ git clone https://github.com/manojbade/opendata-fetch
41
+ cd opendata-fetch
42
+ pip install -e ".[dev]"
43
+ ```
44
+
45
+ ## Usage
46
+
47
+ ```bash
48
+ opendata-fetch list # show every available source
49
+ opendata-fetch fetch 05-cdc-svi # download one source into ./data/05-cdc-svi/
50
+ opendata-fetch fetch 05-cdc-svi 09-census-tracts # several at once
51
+ opendata-fetch fetch --all # everything (large: hundreds of GB total)
52
+ opendata-fetch fetch 09-census-tracts --extract # unzip archives after download
53
+ opendata-fetch fetch 05-cdc-svi --dest /data/raw # choose the download root
54
+ opendata-fetch fetch 04-epa-tri --force # re-download even if it already exists
55
+ opendata-fetch verify 05-cdc-svi # re-download and byte-compare vs local copy
56
+ ```
57
+
58
+ As a library:
59
+
60
+ ```python
61
+ import opendata_fetch
62
+
63
+ paths = opendata_fetch.fetch("05-cdc-svi", dest="data", extract=False)
64
+ for src in opendata_fetch.list_sources():
65
+ print(src.slug, "-", src.dataset)
66
+ ```
67
+
68
+ ## Available sources
69
+
70
+ Run `opendata-fetch list`. v1 ships 10 sources spanning EPA, Census, CDC, NCES, and
71
+ HRSA. Per-source provenance and field notes live in [`docs/sources/`](docs/sources/).
72
+
73
+ > **Note on size.** Several sources are large (CDC PLACES ~700 MB, EPA SDWA
74
+ > ~1 GB across two zips). `fetch --all` downloads everything; fetch individual
75
+ > slugs unless you really want the full set.
76
+
77
+ ## Adding your own source
78
+
79
+ Most sources are pure config. Add a `[[source]]` block to
80
+ [`opendata_fetch/sources.toml`](opendata_fetch/sources.toml):
81
+
82
+ ```toml
83
+ [[source]]
84
+ slug = "12-my-dataset"
85
+ agency = "Some Agency"
86
+ dataset = "What this dataset is"
87
+ update_cadence = "annual; bump `vintage`"
88
+ vars = { vintage = "2025" }
89
+
90
+ [[source.files]]
91
+ url = "https://example.gov/data/{vintage}/file.csv"
92
+ dest = "file_{vintage}.csv"
93
+ min_bytes = 1_000_000
94
+ ```
95
+
96
+ Then `opendata-fetch fetch 12-my-dataset`. No Python required. For irregular sources
97
+ that need discovery calls or assembling many files, point the source at a
98
+ custom `handler` instead, see [CONTRIBUTING.md](CONTRIBUTING.md).
99
+
100
+ ## Data licensing
101
+
102
+ opendata-fetch is MIT-licensed. It downloads, but does not redistribute, government
103
+ data. The datasets themselves are published by U.S. government agencies and are
104
+ generally in the public domain; confirm the terms for any specific dataset on
105
+ its agency page (linked in each source's doc) before redistributing.
106
+
107
+ ## Contributing
108
+
109
+ See [CONTRIBUTING.md](CONTRIBUTING.md). The most useful contributions are new
110
+ sources and keeping existing vintage knobs current.
@@ -0,0 +1,96 @@
1
+ # NCES Schools: CCD + EDGE Geocode (NCES, U.S. Dept. of Education)
2
+
3
+ The National Center for Education Statistics (NCES) Common Core of Data (CCD) is the federal census of US public schools and school districts. Four files together define the universe of public schools for a given school year: the Directory (identity, address, district, charter, grade range, operational status), Characteristics (NSLP designation, virtual flag, shared time), Membership (enrollment counts by grade/race/sex), and the EDGE Geocode file (latitude/longitude, locale code, county FIPS, CBSA, congressional district).
4
+
5
+ Schools are keyed by the 12-character NCESSCH identifier; districts by the 7-character LEAID. The EDGE Geocode file comes from the companion Education Demographic and Geographic Estimates program and supplies the geographic attributes the CCD tables lack.
6
+
7
+ **opendata-fetch slug:** `01-nces-schools`
8
+ **Agency:** NCES (U.S. Dept. of Education)
9
+ **File type:** zip archives (each contains CSV / SAS7BDAT; EDGE zip also contains pipe-delimited TXT, xlsx, and a nested shapefile zip)
10
+ **Approx. size:** Directory ~13 MB, Characteristics ~5.4 MB, Membership ~203 MB (2.3 GB uncompressed), EDGE Geocode ~29 MB
11
+ **Update cadence:** annual (school year). Bump `vintage` and the 6-digit release stamp in the filenames.
12
+
13
+ ## Download
14
+
15
+ opendata-fetch pulls four files:
16
+
17
+ ```
18
+ https://nces.ed.gov/ccd/Data/zip/ccd_sch_029_{vintage}_w_1a_073025.zip (Directory)
19
+ https://nces.ed.gov/ccd/Data/zip/ccd_sch_129_{vintage}_w_1a_073025.zip (Characteristics)
20
+ https://nces.ed.gov/ccd/Data/zip/ccd_sch_052_{vintage}_l_1a_073025.zip (Membership)
21
+ https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_PUBLICSCH_{vintage}.zip (EDGE Geocode)
22
+ ```
23
+
24
+ The general patterns are:
25
+ - CCD school-level files: `https://nces.ed.gov/ccd/Data/zip/ccd_sch_{filetype}_{SY}_{stage}_1a_{releaseDate}.zip`
26
+ - EDGE Geocode: `https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_PUBLICSCH_{SY}.zip`
27
+
28
+ `{vintage}` is the school-year code (`2425` = 2024-25, `2526` = 2025-26). It is filled from the source's `vars` table in `opendata_fetch/sources.toml`, so moving to a new year is a one-line edit. Note two things change per release: `{vintage}` and the embedded 6-digit release-date stamp (`073025`), so the filename literals in the registry must be updated alongside the variable.
29
+
30
+ **Stage suffix is inconsistent across files for the same year.** As of 2024-25: Directory (`029`) and Characteristics (`129`) use `w` (working/preliminary), Membership (`052`) uses `l` (locked/final). When probing for a new release, try both `w` and `l` since the stage depends on file type and timing.
31
+
32
+ The same school-level page also lists Staff (`ccd_sch_033`) and Lunch Program Eligibility (`ccd_sch_059`) files that opendata-fetch does not ship; add them as extra `[[source.files]]` entries if you need them.
33
+
34
+ ## Fields and files in this download
35
+
36
+ Captured from the `2425` (2024-25) release. Column sets are stable year to year but can shift; treat this as a snapshot.
37
+
38
+ **Directory** (`ccd_sch_029_2425_w_1a_073025.csv`, 65 columns). Annotated by group rather than per-column:
39
+
40
+ `SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, LEA_NAME, STATE_AGENCY_NO, UNION, ST_LEAID, LEAID, ST_SCHID, NCESSCH, SCHID, MSTREET1, MSTREET2, MSTREET3, MCITY, MSTATE, MZIP, MZIP4, LSTREET1, LSTREET2, LSTREET3, LCITY, LSTATE, LZIP, LZIP4, PHONE, WEBSITE, SY_STATUS, SY_STATUS_TEXT, UPDATED_STATUS, UPDATED_STATUS_TEXT, EFFECTIVE_DATE, SCH_TYPE_TEXT, SCH_TYPE, RECON_STATUS, OUT_OF_STATE_FLAG, CHARTER_TEXT, CHARTAUTH1, CHARTAUTHN1, CHARTAUTH2, CHARTAUTHN2, NOGRADES, G_PK_OFFERED, G_KG_OFFERED, G_1_OFFERED, G_2_OFFERED, G_3_OFFERED, G_4_OFFERED, G_5_OFFERED, G_6_OFFERED, G_7_OFFERED, G_8_OFFERED, G_9_OFFERED, G_10_OFFERED, G_11_OFFERED, G_12_OFFERED, G_13_OFFERED, G_UG_OFFERED, G_AE_OFFERED, GSLO, GSHI, LEVEL, IGOFFERED`
41
+
42
+ - **Identifiers:** `SCHOOL_YEAR`, `FIPST` (state FIPS), `STATE_AGENCY_NO`, `UNION`, `ST_LEAID`/`LEAID` (state and NCES district IDs), `ST_SCHID`/`NCESSCH`/`SCHID` (state and NCES school IDs).
43
+ - **Names:** `SCH_NAME` (school), `LEA_NAME` (district), `STATENAME`, `ST`.
44
+ - **Mailing address (`M*`):** `MSTREET1-3`, `MCITY`, `MSTATE`, `MZIP`, `MZIP4`.
45
+ - **Location/physical address (`L*`):** `LSTREET1-3`, `LCITY`, `LSTATE`, `LZIP`, `LZIP4`.
46
+ - **Contact:** `PHONE`, `WEBSITE`.
47
+ - **Operational status:** `SY_STATUS`/`SY_STATUS_TEXT` (school-year status, see Quirks for codes), `UPDATED_STATUS`/`UPDATED_STATUS_TEXT`, `EFFECTIVE_DATE`, `RECON_STATUS`, `OUT_OF_STATE_FLAG`.
48
+ - **Type & charter:** `SCH_TYPE`/`SCH_TYPE_TEXT` (school type), `CHARTER_TEXT` (charter status), `CHARTAUTH1/CHARTAUTHN1/CHARTAUTH2/CHARTAUTHN2` (charter authorizer IDs/names).
49
+ - **Grades offered:** `NOGRADES` flag, the `G_*_OFFERED` per-grade flags (`PK`, `KG`, `1`-`13`, `UG`, `AE`), and the summaries `GSLO`/`GSHI` (lowest/highest grade), `LEVEL` (school level), `IGOFFERED`.
50
+
51
+ **Characteristics** (`ccd_sch_129_2425_w_1a_073025.csv`, 17 columns):
52
+
53
+ `SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, STATE_AGENCY_NO, UNION, ST_LEAID, LEAID, ST_SCHID, NCESSCH, SCHID, SHARED_TIME, NSLP_STATUS, NSLP_STATUS_TEXT, VIRTUAL, VIRTUAL_TEXT`
54
+
55
+ - **Identity (12):** same identifier/name/state columns as the Directory (`SCHOOL_YEAR` … `SCHID`).
56
+ - `SHARED_TIME` — whether the school is a shared-time facility (students attend part-time from other schools).
57
+ - `NSLP_STATUS`/`NSLP_STATUS_TEXT` — National School Lunch Program participation designation (free/reduced-price lunch program status).
58
+ - `VIRTUAL`/`VIRTUAL_TEXT` — virtual-school status code and its text label.
59
+
60
+ **Membership** (`ccd_sch_052_2425_l_1a_073025.csv`, 18 columns; one row per school x grade x race/ethnicity x sex):
61
+
62
+ `SCHOOL_YEAR, FIPST, STATENAME, ST, SCH_NAME, STATE_AGENCY_NO, UNION, ST_LEAID, LEAID, ST_SCHID, NCESSCH, SCHID, GRADE, RACE_ETHNICITY, SEX, STUDENT_COUNT, TOTAL_INDICATOR, DMS_FLAG`
63
+
64
+ - **Identity (12):** same identifier/name/state columns as the Directory.
65
+ - `GRADE` — grade level for the row.
66
+ - `RACE_ETHNICITY` — race/ethnicity category for the row.
67
+ - `SEX` — sex category for the row.
68
+ - `STUDENT_COUNT` — enrollment count for that grade x race x sex cell.
69
+ - `TOTAL_INDICATOR` — flags whether the row is a subtotal/total (vs a single grade x race x sex cell).
70
+ - `DMS_FLAG` — data-management-system reporting/quality flag.
71
+
72
+ **EDGE Geocode** (`EDGE_GEOCODE_PUBLICSCH_2425.TXT`, pipe-delimited, **no header row**, 23 positional fields). In file order: NCES school ID, LEAID, school name, operating-state FIPS, street, city, state, ZIP, state FIPS, county FIPS, county name, locale code, latitude, longitude, CBSA code, CBSA name, CBSA type, CSA code, CSA name, congressional district code, state-legislature lower district, state-legislature upper district, school year. The key columns to note: `NCESSCH` and `LEAID` (join keys back to the CCD tables), latitude/longitude (address-geocoded point), the NCES locale code (urban/rural classification), county FIPS, the CBSA code (metro/micro area), and the congressional district code. The same EDGE zip also contains the identical data as an `.xlsx` (with labeled headers), a `.sas7bdat`, and a nested `Shapefile_SCH.zip`; field meanings are documented in `EDGE_GEOCODE_PUBLIC_FILEDOC.pdf` inside the zip.
73
+
74
+ ## Provenance & landing pages
75
+
76
+ - CCD data index: https://nces.ed.gov/ccd/files.asp
77
+ - EDGE geocode: https://nces.ed.gov/programs/edge/geographic/schoollocations
78
+ - EDGE locale framework: https://nces.ed.gov/programs/edge/Geographic/LocaleBoundaries
79
+ - Release calendar: https://ies.ed.gov/nces-statistical-products-release-calendar
80
+ - EDGE geocode field documentation: `EDGE_GEOCODE_PUBLIC_FILEDOC.pdf` (inside the EDGE zip)
81
+
82
+ ## Quirks & notes
83
+
84
+ - **Membership zip uses DEFLATE64 compression** (zip compression method 9). Python's stdlib `zipfile` cannot decode it, and `zipfile-deflate64` does not currently build on Python 3.13. Use the `unzip` binary (`unzip -p`) to extract it; `unzip` handles DEFLATE64 and is available on macOS and standard Linux runners.
85
+ - **EDGE Geocode zip contains multiple formats:** pipe-delimited TXT (no header), xlsx (with headers), SAS7BDAT, and a nested `Shapefile_SCH.zip`. The TXT column meanings are documented in `EDGE_GEOCODE_PUBLIC_FILEDOC.pdf`.
86
+ - The Directory and Characteristics zips contain both CSV and SAS7BDAT.
87
+ - NCESSCH (12-char) and LEAID (7-char) are string IDs with leading zeros for some states. Read them as strings; integer parsing drops leading zeros.
88
+ - Membership is large (203 MB compressed, ~2.3 GB uncompressed) because rows are at the school x grade x race x sex level. A downstream reader should stream/chunk it rather than load it whole.
89
+ - Latitude/longitude in EDGE Geocode are address-geocoding estimates, not GPS readings (documented in the EDGE file doc).
90
+ - FYI for downstream users: `SY_STATUS` codes are 1=Open, 2=Closed, 3=New, 4=Added, 5=Changed Boundary/Agency, 6=Inactive, 7=Future, 8=Reopened. Grade codes (`GSLO`/`GSHI`) are 2-char strings including reserved codes `M` (Missing) and `N` (Not applicable), not integers.
91
+
92
+ ## Manual fallback
93
+
94
+ 1. Download each file from the corresponding landing page above using a browser (CCD index for the `ccd_sch_*` zips; EDGE geocode page for the geocode zip).
95
+ 2. Place the file in the source's download directory with the exact filename opendata-fetch expects.
96
+ 3. Re-run `opendata-fetch fetch 01-nces-schools`; it detects the present file and skips the download.
@@ -0,0 +1,106 @@
1
+ # EPA NPL Superfund Site Boundaries + Human Exposure (EPA)
2
+
3
+ EPA's National Priorities List (NPL) is the federal roster of the most contaminated hazardous-waste sites in the US (Superfund sites). Two files describe them together: a GeoJSON of site boundary polygons (with identity attributes such as EPA_ID, site name, state, county, status, last change date) and a JSON list of per-site human-exposure status codes. The two join on the site's EPA ID.
4
+
5
+ The boundaries file is published on EPA's ArcGIS Hub; the human-exposure file comes from EPA's Superfund Enterprise Management System (SEMS).
6
+
7
+ **opendata-fetch slug:** `02-epa-npl-superfund`
8
+ **Agency:** EPA
9
+ **File type:** geojson + json
10
+ **Approx. size:** NPL boundaries GeoJSON ~60 MB, Human Exposure JSON ~1.3 MB
11
+ **Update cadence:** rolling. The boundaries dataset is refreshed by EPA irregularly; the human-exposure JSON is updated roughly daily.
12
+
13
+ ## Download
14
+
15
+ opendata-fetch pulls two files at fixed URLs (no templated variable):
16
+
17
+ ```
18
+ https://opendata.arcgis.com/api/v3/datasets/d6e1591d9a424f1fa6d95a02095a06d7_0/downloads/data?format=geojson&spatialRefId=4326 -> npl_boundaries.geojson
19
+ https://www3.epa.gov/semsjson/Human_Exposure_Site_List.json -> human_exposure.json
20
+ ```
21
+
22
+ The boundaries URL embeds an ArcGIS dataset id (`d6e1591d9a424f1fa6d95a02095a06d7_0`). If EPA republishes the layer under a new id, edit the URL in `opendata_fetch/sources.toml`. There is no per-year variable for this source.
23
+
24
+ ## Fields and files in this download
25
+
26
+ Captured from the files as served at audit time. Schemas may drift (EPA refreshes both irregularly).
27
+
28
+ **`npl_boundaries.geojson`** — MultiPolygon/Polygon features, CRS EPSG:4326, ~2,406 features, 32 attribute fields:
29
+
30
+ | column | meaning |
31
+ |--------|---------|
32
+ | `OBJECTID` | Internal feature object ID. |
33
+ | `REGION_CODE` | EPA region (1-10). |
34
+ | `EPA_PROGRAM` | EPA program associated with the site. |
35
+ | `EPA_ID` | Site EPA/SEMS ID (join key; usually 12 characters). |
36
+ | `SITE_NAME` | Superfund site name. |
37
+ | `SITE_FEATURE_CLASS` | Feature classification grouping. |
38
+ | `SITE_FEATURE_TYPE` | Feature type (e.g. `Site Boundary`, `Comprehensive Site Area`); distinguishes multiple polygons per site. |
39
+ | `SITE_FEATURE_NAME` | Name of the individual feature. |
40
+ | `SITE_FEATURE_DESCRIPTION` | Free-text description of the feature. |
41
+ | `NPL_STATUS_CODE` | NPL status (F Final, D Deleted, P Proposed, etc.; see official documentation). |
42
+ | `FEDERAL_FACILITY_DETER_CODE` | Federal-facility determination code. |
43
+ | `LAST_CHANGE_DATE` | Date the feature record was last changed. |
44
+ | `ORIGINAL_CREATION_DATE` | Date the feature record was created. |
45
+ | `SITE_FEATURE_SOURCE` | Source of the feature geometry. |
46
+ | `STREET_ADDR_TXT` | Site street address. |
47
+ | `ADDR_COMMENT` | Address comment/note. |
48
+ | `CITY_NAME` | Site city. |
49
+ | `COUNTY` | Site county. |
50
+ | `STATE_CODE` | 2-letter state code. |
51
+ | `ZIP_CODE` | Site ZIP code. |
52
+ | `SITE_CONTACT_NAME` | Site contact name. |
53
+ | `PRIMARY_TELEPHONE_NUM` | Site contact phone number. |
54
+ | `SITE_CONTACT_EMAIL` | Site contact email. |
55
+ | `URL_ALIAS_TXT` | Site profile URL alias. |
56
+ | `FEATURE_INFO_URL` | URL with more information about the feature. |
57
+ | `FEATURE_INFO_URL_DESC` | Description of the info URL. |
58
+ | `GIS_AREA` | Feature area as computed in GIS. |
59
+ | `GIS_AREA_UNITS` | Units for `GIS_AREA`. |
60
+ | `PROJECTION` | Native projection of the source geometry. |
61
+ | `SF_GEOSPATIAL_DATA_DISCLAIMER` | Superfund geospatial data disclaimer text. |
62
+ | `SHAPE_Length` | Polygon perimeter length (geometry attribute). |
63
+ | `SHAPE_Area` | Polygon area (geometry attribute). |
64
+
65
+ **`human_exposure.json`** — a JSON object with top-level keys `success`, `data`, `meta`. `data` is a list (~1,905 records), one per site, each with 16 fields:
66
+
67
+ | column | meaning |
68
+ |--------|---------|
69
+ | `humanexposurepathdesc` | Human-exposure pathway status description (plain-text). |
70
+ | `city` | Site city. |
71
+ | `nplstatus` | NPL status of the site. |
72
+ | `saaflag` | SAA flag (SEMS code; see EPA SEMS documentation). |
73
+ | `county` | Site county. |
74
+ | `fedfacilityflag` | Federal-facility flag. |
75
+ | `zipcode` | Site ZIP code. |
76
+ | `epaid` | Site EPA/SEMS ID (lowercase; join key to `EPA_ID`). |
77
+ | `regionid` | EPA region. |
78
+ | `sitename` | Site name. |
79
+ | `eibaselinesiteInd` | Environmental-indicator baseline-site indicator (SEMS code; see EPA SEMS documentation). |
80
+ | `siteid` | SEMS internal site ID. |
81
+ | `humexposurestscode` | Human-exposure status code (e.g. `HENC`, `HEUC`; see EPA SEMS documentation). |
82
+ | `state` | State name. |
83
+ | `state_code` | 2-letter state code. |
84
+ | `friendlyurl` | Public-facing site profile URL. |
85
+
86
+ ## Provenance & landing pages
87
+
88
+ - NPL Boundaries dataset (ArcGIS Hub): https://hub.arcgis.com/maps/EPA::npl-superfund-site-boundaries-epa-public-2022/explore
89
+ - ArcGIS Hub item page: https://www.arcgis.com/home/item.html?id=d6e1591d9a424f1fa6d95a02095a06d7
90
+ - EPA Superfund landing: https://www.epa.gov/superfund/superfund-national-priorities-list-npl
91
+ - EPA SEMS overview: https://www.epa.gov/enviro/sems-overview
92
+
93
+ ## Quirks & notes
94
+
95
+ - **Join keys differ in case:** the boundaries file uses `EPA_ID` (uppercase, with underscore); the human-exposure file uses `epaid` (lowercase). A downstream user joining the two must normalize.
96
+ - **The human-exposure JSON drifts on essentially every download.** EPA updates exposure status codes in place; the byte size can stay constant (observed at exactly 1,346,934 bytes across consecutive days) while the content (SHA256) changes. Treat content changes here as expected, not as corruption.
97
+ - **Multiple polygons per site:** the boundaries dataset has more features than unique sites. A single EPA_ID can have several feature polygons distinguished by `SITE_FEATURE_TYPE` (e.g., `Comprehensive Site Area`, `Site Boundary`, `Current Ground Boundary`). opendata-fetch ships the raw file with all of them; any dedup is a downstream choice.
98
+ - **Dirty `SITE_FEATURE_TYPE` values:** some rows carry a trailing `\r\n`, and some hold the literal string `<Null>` instead of a JSON null. FYI for downstream parsing.
99
+ - **Status codes (FYI):** `NPL_STATUS_CODE` includes F (Final), D (Deleted), N, P (Proposed), R, S, A, and null. Human-exposure codes: HENC (not under control, worst), HEPR, HEUC, HHPA, HEID, HEIC, blank (not assigned).
100
+ - Geometries are a mix of Polygon and MultiPolygon. At least one site (`RID09321243`) has an 11-character EPA_ID rather than the usual 12.
101
+
102
+ ## Manual fallback
103
+
104
+ 1. **NPL Boundaries:** open the ArcGIS Hub item page above, click Download -> GeoJSON, save as `npl_boundaries.geojson` into the source's download directory.
105
+ 2. **Human Exposure:** open the JSON URL in a browser and save it as `human_exposure.json`.
106
+ 3. Re-run `opendata-fetch fetch 02-epa-npl-superfund`; it detects the present files and skips the downloads.