earthcatalog 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- earthcatalog-0.2.0/PKG-INFO +333 -0
- earthcatalog-0.2.0/README.md +266 -0
- earthcatalog-0.2.0/earthcatalog/__init__.py +164 -0
- earthcatalog-0.2.0/earthcatalog/async_http_client.py +1006 -0
- earthcatalog-0.2.0/earthcatalog/config.py +97 -0
- earthcatalog-0.2.0/earthcatalog/engines/__init__.py +308 -0
- earthcatalog-0.2.0/earthcatalog/engines/rustac_engine.py +142 -0
- earthcatalog-0.2.0/earthcatalog/engines/stac_geoparquet_engine.py +126 -0
- earthcatalog-0.2.0/earthcatalog/exceptions.py +471 -0
- earthcatalog-0.2.0/earthcatalog/grid_systems.py +1114 -0
- earthcatalog-0.2.0/earthcatalog/ingestion_pipeline.py +2281 -0
- earthcatalog-0.2.0/earthcatalog/input_readers.py +603 -0
- earthcatalog-0.2.0/earthcatalog/job_tracking.py +485 -0
- earthcatalog-0.2.0/earthcatalog/pipeline.py +606 -0
- earthcatalog-0.2.0/earthcatalog/schema_generator.py +911 -0
- earthcatalog-0.2.0/earthcatalog/spatial_resolver.py +1207 -0
- earthcatalog-0.2.0/earthcatalog/stac_hooks.py +754 -0
- earthcatalog-0.2.0/earthcatalog/statistics.py +677 -0
- earthcatalog-0.2.0/earthcatalog/storage_backends.py +548 -0
- earthcatalog-0.2.0/earthcatalog/tests/__init__.py +1 -0
- earthcatalog-0.2.0/earthcatalog/tests/conftest.py +76 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_all_grids.py +793 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_async_http.py +700 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_cli_and_storage.py +230 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_config.py +245 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_dask_integration.py +580 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_e2e_synthetic.py +1624 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_engines.py +272 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_exceptions.py +346 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_file_structure.py +245 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_input_readers.py +666 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_integration.py +200 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_integration_async.py +283 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_job_tracking.py +603 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_multi_file_input.py +336 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_passthrough_hook.py +196 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_pipeline.py +684 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_pipeline_components.py +665 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_schema_generator.py +506 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_spatial_resolver.py +413 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_stac_hooks.py +776 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_statistics.py +477 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_storage_backends.py +236 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_validation.py +435 -0
- earthcatalog-0.2.0/earthcatalog/tests/test_workers.py +653 -0
- earthcatalog-0.2.0/earthcatalog/validation.py +921 -0
- earthcatalog-0.2.0/earthcatalog/workers.py +682 -0
- earthcatalog-0.2.0/earthcatalog.egg-info/PKG-INFO +333 -0
- earthcatalog-0.2.0/earthcatalog.egg-info/SOURCES.txt +53 -0
- earthcatalog-0.2.0/earthcatalog.egg-info/dependency_links.txt +1 -0
- earthcatalog-0.2.0/earthcatalog.egg-info/entry_points.txt +3 -0
- earthcatalog-0.2.0/earthcatalog.egg-info/requires.txt +49 -0
- earthcatalog-0.2.0/earthcatalog.egg-info/top_level.txt +1 -0
- earthcatalog-0.2.0/pyproject.toml +197 -0
- earthcatalog-0.2.0/setup.cfg +4 -0
|
@@ -0,0 +1,333 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: earthcatalog
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: earthcatalog is a scalable STAC ingestion library for partitioned GeoParquet catalogs
|
|
5
|
+
Author-email: betolink <betolin@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/betolink/earthcatalog
|
|
8
|
+
Project-URL: Repository, https://github.com/betolink/earthcatalog
|
|
9
|
+
Project-URL: Issues, https://github.com/betolink/earthcatalog/issues
|
|
10
|
+
Keywords: stac,geoparquet,geospatial,distributed,ingestion,catalog
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: GIS
|
|
19
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
20
|
+
Requires-Python: >=3.11
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
Requires-Dist: pandas>=2.3.3
|
|
23
|
+
Requires-Dist: pyarrow>=16.1.0
|
|
24
|
+
Requires-Dist: geopandas>=1.1.0
|
|
25
|
+
Requires-Dist: shapely>=2.1.2
|
|
26
|
+
Requires-Dist: stac-geoparquet>=0.2.0
|
|
27
|
+
Requires-Dist: fsspec>=2025.10.0
|
|
28
|
+
Requires-Dist: requests>=2.31.0
|
|
29
|
+
Requires-Dist: fastparquet>=2024.11.0
|
|
30
|
+
Requires-Dist: obstore>=0.5.1
|
|
31
|
+
Requires-Dist: tqdm>=4.66.4
|
|
32
|
+
Requires-Dist: rustac[arrow]>=0.9.0
|
|
33
|
+
Requires-Dist: PyYAML>=6.0.0
|
|
34
|
+
Requires-Dist: h3>=3.9.0
|
|
35
|
+
Requires-Dist: s2sphere>=0.2.5
|
|
36
|
+
Requires-Dist: mgrs>=1.5.0
|
|
37
|
+
Requires-Dist: s3fs>=2025.1.0
|
|
38
|
+
Requires-Dist: aiohttp>=3.9.0
|
|
39
|
+
Requires-Dist: aiofiles>=23.0.0
|
|
40
|
+
Provides-Extra: dask
|
|
41
|
+
Requires-Dist: dask[distributed]>=2025.1.0; extra == "dask"
|
|
42
|
+
Provides-Extra: all
|
|
43
|
+
Requires-Dist: dask[distributed]>=2025.1.0; extra == "all"
|
|
44
|
+
Provides-Extra: dev
|
|
45
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
46
|
+
Requires-Dist: pytest-cov>=5.0.0; extra == "dev"
|
|
47
|
+
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
|
|
48
|
+
Requires-Dist: aioresponses>=0.7.0; extra == "dev"
|
|
49
|
+
Requires-Dist: black>=24.0.0; extra == "dev"
|
|
50
|
+
Requires-Dist: ruff>=0.4.0; extra == "dev"
|
|
51
|
+
Requires-Dist: mypy>=1.0.0; extra == "dev"
|
|
52
|
+
Requires-Dist: types-requests>=2.31.0; extra == "dev"
|
|
53
|
+
Requires-Dist: types-aiofiles>=23.0.0; extra == "dev"
|
|
54
|
+
Requires-Dist: types-PyYAML>=6.0.0; extra == "dev"
|
|
55
|
+
Requires-Dist: jinja2>=3.1.0; extra == "dev"
|
|
56
|
+
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
|
|
57
|
+
Provides-Extra: docs
|
|
58
|
+
Requires-Dist: mkdocs>=1.5.0; extra == "docs"
|
|
59
|
+
Requires-Dist: mkdocs-material>=9.0.0; extra == "docs"
|
|
60
|
+
Requires-Dist: mkdocs-mermaid2-plugin>=1.1.0; extra == "docs"
|
|
61
|
+
Requires-Dist: mkdocs-swagger-ui-tag>=0.6.0; extra == "docs"
|
|
62
|
+
Requires-Dist: mkdocs-gen-files>=0.5.0; extra == "docs"
|
|
63
|
+
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == "docs"
|
|
64
|
+
Requires-Dist: mkdocs-macros-plugin>=1.0.0; extra == "docs"
|
|
65
|
+
Requires-Dist: mike>=2.0.0; extra == "docs"
|
|
66
|
+
Requires-Dist: pymdown-extensions>=10.0.0; extra == "docs"
|
|
67
|
+
|
|
68
|
+
# EarthCatalog
|
|
69
|
+
|
|
70
|
+
<img src="docs/earthcatalog.png" alt="EarthCatalog Logo" width="200"/>
|
|
71
|
+
|
|
72
|
+
A library for processing STAC items into spatially partitioned GeoParquet catalogs.
|
|
73
|
+
|
|
74
|
+
## Why EarthCatalog?
|
|
75
|
+
|
|
76
|
+
**The Problem**: Working with massive collections of geospatial data (satellite imagery, drone surveys, IoT sensors) is challenging because:
|
|
77
|
+
|
|
78
|
+
- Traditional databases struggle with spatial queries at scale
|
|
79
|
+
- Files become too large to process efficiently
|
|
80
|
+
- Spatial overlap makes data organization complex
|
|
81
|
+
- Updates may require full rebuilds
|
|
82
|
+
|
|
83
|
+
**EarthCatalog** transforms STAC items into fast, spatially-partitioned GeoParquet catalogs that:
|
|
84
|
+
|
|
85
|
+
- **Eliminate full table scans** - Query only relevant spatial partitions using spatial hive-partition pruning first.
|
|
86
|
+
- **Scale to terabytes** - Each partition is independently manageable
|
|
87
|
+
- **Support incremental updates** - Add new data without rebuilding the whole catalog
|
|
88
|
+
- **Handle complex geometries** - Smart global partitioning for multi-region items
|
|
89
|
+
|
|
90
|
+
## Key Features
|
|
91
|
+
|
|
92
|
+
- **Smart Spatial Partitioning**: Multiple grid systems (H3, S2, UTM, MGRS, LatLon, Custom geojson)
|
|
93
|
+
- **Global Partition Schema**: Auto-routes large/complex geometries to global partitions
|
|
94
|
+
- **Temporal Binning**: Year, month, or day-based time partitioning
|
|
95
|
+
- **Distributed Processing**: Local multi-threading or Dask distributed
|
|
96
|
+
- **Incremental Updates**: Merge new data with existing partitions
|
|
97
|
+
|
|
98
|
+
## Quick Start
|
|
99
|
+
|
|
100
|
+
### Installation
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
pip install earthcatalog
|
|
104
|
+
|
|
105
|
+
# With distributed processing support
|
|
106
|
+
pip install "earthcatalog[dask]"
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Basic Usage
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
# Process STAC URLs into a spatial catalog
|
|
113
|
+
stac-ingest \
|
|
114
|
+
--input stac_urls.parquet \
|
|
115
|
+
--output ./catalog \
|
|
116
|
+
--scratch ./scratch \
|
|
117
|
+
--workers 4
|
|
118
|
+
|
|
119
|
+
# Generate schema metadata for efficient querying (enabled by default)
|
|
120
|
+
stac-ingest \
|
|
121
|
+
--input stac_urls.parquet \
|
|
122
|
+
--output ./catalog \
|
|
123
|
+
--scratch ./scratch \
|
|
124
|
+
--workers 4
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
### Example: Create Input Data
|
|
128
|
+
|
|
129
|
+
```python
|
|
130
|
+
import pandas as pd
|
|
131
|
+
|
|
132
|
+
# Sample STAC item URLs
|
|
133
|
+
urls = [
|
|
134
|
+
"https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2A_20240101_123456",
|
|
135
|
+
"https://earth-search.aws.element84.com/v1/collections/landsat-8-c2-l2/items/LC08_20240103_345678",
|
|
136
|
+
]
|
|
137
|
+
|
|
138
|
+
df = pd.DataFrame({"url": urls})
|
|
139
|
+
df.to_parquet("stac_urls.parquet", index=False)
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
## Configuration Examples
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
# Use S2 grid with daily partitioning
|
|
146
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
147
|
+
--grid s2 --grid-resolution 13 --temporal-bin day
|
|
148
|
+
|
|
149
|
+
# Enable global partitioning with custom thresholds
|
|
150
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
151
|
+
--global-thresholds-file custom-thresholds.json
|
|
152
|
+
|
|
153
|
+
# Distributed processing with Dask
|
|
154
|
+
stac-ingest --input s3://bucket/urls.parquet --output s3://bucket/catalog \
|
|
155
|
+
--scratch s3://bucket/scratch --processor dask --workers 16
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Example: Efficient Spatial Queries
|
|
159
|
+
|
|
160
|
+
```python
|
|
161
|
+
# Traditional approach (slow - scans entire catalog)
|
|
162
|
+
import geopandas as gpd
|
|
163
|
+
from shapely.geometry import box
|
|
164
|
+
|
|
165
|
+
roi = box(-122.5, 37.7, -122.0, 38.0) # San Francisco area
|
|
166
|
+
df = gpd.read_parquet("catalog/**/*.parquet") # Reads EVERYTHING
|
|
167
|
+
results = df[df.intersects(roi)]
|
|
168
|
+
print(f"Found {len(results)} items (but scanned entire catalog)")
|
|
169
|
+
|
|
170
|
+
# EarthCatalog approach (fast - scans only relevant partitions)
|
|
171
|
+
from earthcatalog.spatial_resolver import spatial_resolver
|
|
172
|
+
import duckdb
|
|
173
|
+
|
|
174
|
+
resolver = spatial_resolver("catalog/catalog_schema.json")
|
|
175
|
+
partitions = resolver.resolve_partitions(roi)
|
|
176
|
+
paths = resolver.generate_query_paths(partitions)
|
|
177
|
+
|
|
178
|
+
result = duckdb.sql(f"SELECT * FROM read_parquet({paths})").df()
|
|
179
|
+
print(f"Found {len(result)} items (scanned only {len(partitions)} partitions)")
|
|
180
|
+
|
|
181
|
+
# Remote schema files (S3, GCS, Azure, HTTP) - requires fsspec
|
|
182
|
+
resolver = spatial_resolver("s3://my-bucket/catalog_schema.json", "s3://my-bucket/catalog/")
|
|
183
|
+
resolver = spatial_resolver("https://example.com/schema.json", "./local-catalog/")
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
## Output Structure
|
|
187
|
+
|
|
188
|
+
Uses Hive-style temporal partitioning for optimal query pruning in DuckDB, Athena, and Spark:
|
|
189
|
+
|
|
190
|
+
```
|
|
191
|
+
catalog/
|
|
192
|
+
├── {mission}/
|
|
193
|
+
│ └── partition=h3/
|
|
194
|
+
│ └── level=2/
|
|
195
|
+
│ ├── 8928308280fffff/
|
|
196
|
+
│ │ └── year=2024/
|
|
197
|
+
│ │ ├── month=01/
|
|
198
|
+
│ │ │ └── items.parquet # January 2024 items
|
|
199
|
+
│ │ └── month=02/
|
|
200
|
+
│ │ └── items.parquet
|
|
201
|
+
│ └── global/
|
|
202
|
+
│ └── year=2024/
|
|
203
|
+
│ └── month=01/
|
|
204
|
+
│ └── items.parquet # Large geometries spanning multiple cells
|
|
205
|
+
└── catalog_schema.json # Generated metadata for efficient querying (enabled by default)
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
## Schema Metadata and Efficient Querying
|
|
209
|
+
|
|
210
|
+
EarthCatalog generates comprehensive metadata about your catalog's partitioning scheme by default:
|
|
211
|
+
|
|
212
|
+
```bash
|
|
213
|
+
# Schema is generated by default
|
|
214
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch
|
|
215
|
+
|
|
216
|
+
# Use custom schema filename
|
|
217
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
218
|
+
--schema-filename my_catalog_schema.json
|
|
219
|
+
|
|
220
|
+
# Disable schema generation
|
|
221
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
222
|
+
--no-generate-schema
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
The generated schema includes:
|
|
226
|
+
|
|
227
|
+
- **Grid system details**: Type, resolution, cell sizes, coordinate system
|
|
228
|
+
- **Partition structure**: All spatial and temporal partitions created
|
|
229
|
+
- **Usage examples**: DuckDB queries for efficient partition pruning
|
|
230
|
+
- **Statistics**: Item counts, partition counts, processing info
|
|
231
|
+
|
|
232
|
+
### Automatic Global Partition Detection
|
|
233
|
+
|
|
234
|
+
The resolver intelligently includes the **global partition** when needed:
|
|
235
|
+
|
|
236
|
+
```python
|
|
237
|
+
# Threshold-based inclusion (queries spanning many cells include global)
|
|
238
|
+
large_area = box(-130, 30, -110, 50) # Multi-state region
|
|
239
|
+
partitions = resolver.resolve_partitions(large_area)
|
|
240
|
+
# Includes 'global' because query spans > threshold cells
|
|
241
|
+
|
|
242
|
+
# Geography-based inclusion (continental-scale areas include global)
|
|
243
|
+
continental = box(-180, -60, 180, 80) # Nearly global extent
|
|
244
|
+
partitions = resolver.resolve_partitions(continental)
|
|
245
|
+
# Includes 'global' because geometry area > large geometry threshold
|
|
246
|
+
|
|
247
|
+
# Manual control when needed
|
|
248
|
+
partitions_no_global = resolver.resolve_partitions(large_area, include_global=False)
|
|
249
|
+
partitions_force_global = resolver.resolve_partitions(small_area, include_global=True)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
### Remote Schema Files
|
|
253
|
+
|
|
254
|
+
The `spatial_resolver()` function supports schema files stored in cloud storage or remote locations:
|
|
255
|
+
|
|
256
|
+
```python
|
|
257
|
+
from earthcatalog.spatial_resolver import spatial_resolver
|
|
258
|
+
|
|
259
|
+
# S3 (requires fsspec[s3])
|
|
260
|
+
resolver = spatial_resolver("s3://my-bucket/catalog_schema.json", "s3://my-bucket/catalog/")
|
|
261
|
+
|
|
262
|
+
# Google Cloud Storage (requires fsspec[gcs])
|
|
263
|
+
resolver = spatial_resolver("gs://my-bucket/catalog_schema.json", "gs://my-bucket/catalog/")
|
|
264
|
+
|
|
265
|
+
# Azure Blob Storage (requires fsspec[azure])
|
|
266
|
+
resolver = spatial_resolver("abfs://container/catalog_schema.json", "abfs://container/catalog/")
|
|
267
|
+
|
|
268
|
+
# HTTP/HTTPS
|
|
269
|
+
resolver = spatial_resolver("https://example.com/catalog_schema.json", "./local-catalog/")
|
|
270
|
+
|
|
271
|
+
# Mixed: Remote schema with local catalog
|
|
272
|
+
resolver = spatial_resolver("s3://bucket/schema.json", "/local/catalog/")
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
**Requirements:**
|
|
276
|
+
|
|
277
|
+
- Install fsspec with appropriate extras: `pip install fsspec[s3]`, `fsspec[gcs]`, `fsspec[azure]`
|
|
278
|
+
- The `catalog_path` parameter is required for remote schema files
|
|
279
|
+
- Authentication follows fsspec conventions (AWS credentials, service accounts, etc.)
|
|
280
|
+
|
|
281
|
+
### Grid-Specific Resolution
|
|
282
|
+
|
|
283
|
+
**Key Benefits:**
|
|
284
|
+
|
|
285
|
+
- **Automatic Resolution**: No need to manually calculate grid intersections
|
|
286
|
+
- **All Grid Systems**: Works with H3, S2, MGRS, UTM, LatLon, and custom GeoJSON
|
|
287
|
+
- **Configurable Overlap**: Control boundary handling and buffer zones
|
|
288
|
+
- **Performance**: Query only relevant partitions instead of full catalog scan
|
|
289
|
+
- **DuckDB Integration**: Generates ready-to-use file path patterns
|
|
290
|
+
|
|
291
|
+
## ⚡ Performance Benchmarks
|
|
292
|
+
|
|
293
|
+
**Query Performance Comparison** (San Francisco Bay Area query on global dataset):
|
|
294
|
+
|
|
295
|
+
| Metric | Without Pruning | With Spatial Resolution | Improvement |
|
|
296
|
+
|--------|-----------------|-------------------------|-------------|
|
|
297
|
+
| **Data Scanned** | 50GB+ | 6GB | **88.5% reduction** |
|
|
298
|
+
| **Query Time** | 45 seconds | 5.2 seconds | **8.7x faster** |
|
|
299
|
+
| **Memory Usage** | 12GB | 2.1GB | **82% reduction** |
|
|
300
|
+
| **Files Read** | 15,000+ | 1,200 | **92% fewer files** |
|
|
301
|
+
|
|
302
|
+
**Grid System Performance** (typical regional query):
|
|
303
|
+
|
|
304
|
+
- **H3 Resolution 6**: 8-12 cells → ~85-90% data reduction
|
|
305
|
+
- **MGRS 100km**: 1-4 zones → ~95-98% data reduction
|
|
306
|
+
- **Custom GeoJSON**: Variable based on tile design
|
|
307
|
+
|
|
308
|
+
## Documentation
|
|
309
|
+
|
|
310
|
+
- 📖 **[Full Documentation](docs/)** - Complete guides and API reference
|
|
311
|
+
- 🏁 **[Quick Start Guide](docs/quickstart.md)** - Get up and running in minutes
|
|
312
|
+
- ⚙️ **[Configuration Guide](docs/configuration.md)** - All configuration options
|
|
313
|
+
- 🌍 **[Global Partitioning](docs/concepts/grids/global-partitioning.md)** - Handle large/complex geometries
|
|
314
|
+
- 🔧 **[API Reference](docs/api-reference/)** - Python and CLI documentation
|
|
315
|
+
|
|
316
|
+
## Contributing
|
|
317
|
+
|
|
318
|
+
```bash
|
|
319
|
+
# Development setup
|
|
320
|
+
git clone https://github.com/betolink/earthcatalog.git
|
|
321
|
+
cd earthcatalog
|
|
322
|
+
pip install -e ".[dev]"
|
|
323
|
+
|
|
324
|
+
# Run tests
|
|
325
|
+
python -m pytest
|
|
326
|
+
|
|
327
|
+
# Format and lint
|
|
328
|
+
black earthcatalog/ && ruff check earthcatalog/
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
## License
|
|
332
|
+
|
|
333
|
+
MIT License - see LICENSE file for details.
|
|
@@ -0,0 +1,266 @@
|
|
|
1
|
+
# EarthCatalog
|
|
2
|
+
|
|
3
|
+
<img src="docs/earthcatalog.png" alt="EarthCatalog Logo" width="200"/>
|
|
4
|
+
|
|
5
|
+
A library for processing STAC items into spatially partitioned GeoParquet catalogs.
|
|
6
|
+
|
|
7
|
+
## Why EarthCatalog?
|
|
8
|
+
|
|
9
|
+
**The Problem**: Working with massive collections of geospatial data (satellite imagery, drone surveys, IoT sensors) is challenging because:
|
|
10
|
+
|
|
11
|
+
- Traditional databases struggle with spatial queries at scale
|
|
12
|
+
- Files become too large to process efficiently
|
|
13
|
+
- Spatial overlap makes data organization complex
|
|
14
|
+
- Updates may require full rebuilds
|
|
15
|
+
|
|
16
|
+
**EarthCatalog** transforms STAC items into fast, spatially-partitioned GeoParquet catalogs that:
|
|
17
|
+
|
|
18
|
+
- **Eliminate full table scans** - Query only relevant spatial partitions using spatial hive-partition pruning first.
|
|
19
|
+
- **Scale to terabytes** - Each partition is independently manageable
|
|
20
|
+
- **Support incremental updates** - Add new data without rebuilding the whole catalog
|
|
21
|
+
- **Handle complex geometries** - Smart global partitioning for multi-region items
|
|
22
|
+
|
|
23
|
+
## Key Features
|
|
24
|
+
|
|
25
|
+
- **Smart Spatial Partitioning**: Multiple grid systems (H3, S2, UTM, MGRS, LatLon, Custom geojson)
|
|
26
|
+
- **Global Partition Schema**: Auto-routes large/complex geometries to global partitions
|
|
27
|
+
- **Temporal Binning**: Year, month, or day-based time partitioning
|
|
28
|
+
- **Distributed Processing**: Local multi-threading or Dask distributed
|
|
29
|
+
- **Incremental Updates**: Merge new data with existing partitions
|
|
30
|
+
|
|
31
|
+
## Quick Start
|
|
32
|
+
|
|
33
|
+
### Installation
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
pip install earthcatalog
|
|
37
|
+
|
|
38
|
+
# With distributed processing support
|
|
39
|
+
pip install "earthcatalog[dask]"
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
### Basic Usage
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
# Process STAC URLs into a spatial catalog
|
|
46
|
+
stac-ingest \
|
|
47
|
+
--input stac_urls.parquet \
|
|
48
|
+
--output ./catalog \
|
|
49
|
+
--scratch ./scratch \
|
|
50
|
+
--workers 4
|
|
51
|
+
|
|
52
|
+
# Generate schema metadata for efficient querying (enabled by default)
|
|
53
|
+
stac-ingest \
|
|
54
|
+
--input stac_urls.parquet \
|
|
55
|
+
--output ./catalog \
|
|
56
|
+
--scratch ./scratch \
|
|
57
|
+
--workers 4
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Example: Create Input Data
|
|
61
|
+
|
|
62
|
+
```python
|
|
63
|
+
import pandas as pd
|
|
64
|
+
|
|
65
|
+
# Sample STAC item URLs
|
|
66
|
+
urls = [
|
|
67
|
+
"https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/items/S2A_20240101_123456",
|
|
68
|
+
"https://earth-search.aws.element84.com/v1/collections/landsat-8-c2-l2/items/LC08_20240103_345678",
|
|
69
|
+
]
|
|
70
|
+
|
|
71
|
+
df = pd.DataFrame({"url": urls})
|
|
72
|
+
df.to_parquet("stac_urls.parquet", index=False)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Configuration Examples
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
# Use S2 grid with daily partitioning
|
|
79
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
80
|
+
--grid s2 --grid-resolution 13 --temporal-bin day
|
|
81
|
+
|
|
82
|
+
# Enable global partitioning with custom thresholds
|
|
83
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
84
|
+
--global-thresholds-file custom-thresholds.json
|
|
85
|
+
|
|
86
|
+
# Distributed processing with Dask
|
|
87
|
+
stac-ingest --input s3://bucket/urls.parquet --output s3://bucket/catalog \
|
|
88
|
+
--scratch s3://bucket/scratch --processor dask --workers 16
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Example: Efficient Spatial Queries
|
|
92
|
+
|
|
93
|
+
```python
|
|
94
|
+
# Traditional approach (slow - scans entire catalog)
|
|
95
|
+
import geopandas as gpd
|
|
96
|
+
from shapely.geometry import box
|
|
97
|
+
|
|
98
|
+
roi = box(-122.5, 37.7, -122.0, 38.0) # San Francisco area
|
|
99
|
+
df = gpd.read_parquet("catalog/**/*.parquet") # Reads EVERYTHING
|
|
100
|
+
results = df[df.intersects(roi)]
|
|
101
|
+
print(f"Found {len(results)} items (but scanned entire catalog)")
|
|
102
|
+
|
|
103
|
+
# EarthCatalog approach (fast - scans only relevant partitions)
|
|
104
|
+
from earthcatalog.spatial_resolver import spatial_resolver
|
|
105
|
+
import duckdb
|
|
106
|
+
|
|
107
|
+
resolver = spatial_resolver("catalog/catalog_schema.json")
|
|
108
|
+
partitions = resolver.resolve_partitions(roi)
|
|
109
|
+
paths = resolver.generate_query_paths(partitions)
|
|
110
|
+
|
|
111
|
+
result = duckdb.sql(f"SELECT * FROM read_parquet({paths})").df()
|
|
112
|
+
print(f"Found {len(result)} items (scanned only {len(partitions)} partitions)")
|
|
113
|
+
|
|
114
|
+
# Remote schema files (S3, GCS, Azure, HTTP) - requires fsspec
|
|
115
|
+
resolver = spatial_resolver("s3://my-bucket/catalog_schema.json", "s3://my-bucket/catalog/")
|
|
116
|
+
resolver = spatial_resolver("https://example.com/schema.json", "./local-catalog/")
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
## Output Structure
|
|
120
|
+
|
|
121
|
+
Uses Hive-style temporal partitioning for optimal query pruning in DuckDB, Athena, and Spark:
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
catalog/
|
|
125
|
+
├── {mission}/
|
|
126
|
+
│ └── partition=h3/
|
|
127
|
+
│ └── level=2/
|
|
128
|
+
│ ├── 8928308280fffff/
|
|
129
|
+
│ │ └── year=2024/
|
|
130
|
+
│ │ ├── month=01/
|
|
131
|
+
│ │ │ └── items.parquet # January 2024 items
|
|
132
|
+
│ │ └── month=02/
|
|
133
|
+
│ │ └── items.parquet
|
|
134
|
+
│ └── global/
|
|
135
|
+
│ └── year=2024/
|
|
136
|
+
│ └── month=01/
|
|
137
|
+
│ └── items.parquet # Large geometries spanning multiple cells
|
|
138
|
+
└── catalog_schema.json # Generated metadata for efficient querying (enabled by default)
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
## Schema Metadata and Efficient Querying
|
|
142
|
+
|
|
143
|
+
EarthCatalog generates comprehensive metadata about your catalog's partitioning scheme by default:
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
# Schema is generated by default
|
|
147
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch
|
|
148
|
+
|
|
149
|
+
# Use custom schema filename
|
|
150
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
151
|
+
--schema-filename my_catalog_schema.json
|
|
152
|
+
|
|
153
|
+
# Disable schema generation
|
|
154
|
+
stac-ingest --input urls.parquet --output ./catalog --scratch ./scratch \
|
|
155
|
+
--no-generate-schema
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
The generated schema includes:
|
|
159
|
+
|
|
160
|
+
- **Grid system details**: Type, resolution, cell sizes, coordinate system
|
|
161
|
+
- **Partition structure**: All spatial and temporal partitions created
|
|
162
|
+
- **Usage examples**: DuckDB queries for efficient partition pruning
|
|
163
|
+
- **Statistics**: Item counts, partition counts, processing info
|
|
164
|
+
|
|
165
|
+
### Automatic Global Partition Detection
|
|
166
|
+
|
|
167
|
+
The resolver intelligently includes the **global partition** when needed:
|
|
168
|
+
|
|
169
|
+
```python
|
|
170
|
+
# Threshold-based inclusion (queries spanning many cells include global)
|
|
171
|
+
large_area = box(-130, 30, -110, 50) # Multi-state region
|
|
172
|
+
partitions = resolver.resolve_partitions(large_area)
|
|
173
|
+
# Includes 'global' because query spans > threshold cells
|
|
174
|
+
|
|
175
|
+
# Geography-based inclusion (continental-scale areas include global)
|
|
176
|
+
continental = box(-180, -60, 180, 80) # Nearly global extent
|
|
177
|
+
partitions = resolver.resolve_partitions(continental)
|
|
178
|
+
# Includes 'global' because geometry area > large geometry threshold
|
|
179
|
+
|
|
180
|
+
# Manual control when needed
|
|
181
|
+
partitions_no_global = resolver.resolve_partitions(large_area, include_global=False)
|
|
182
|
+
partitions_force_global = resolver.resolve_partitions(small_area, include_global=True)
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Remote Schema Files
|
|
186
|
+
|
|
187
|
+
The `spatial_resolver()` function supports schema files stored in cloud storage or remote locations:
|
|
188
|
+
|
|
189
|
+
```python
|
|
190
|
+
from earthcatalog.spatial_resolver import spatial_resolver
|
|
191
|
+
|
|
192
|
+
# S3 (requires fsspec[s3])
|
|
193
|
+
resolver = spatial_resolver("s3://my-bucket/catalog_schema.json", "s3://my-bucket/catalog/")
|
|
194
|
+
|
|
195
|
+
# Google Cloud Storage (requires fsspec[gcs])
|
|
196
|
+
resolver = spatial_resolver("gs://my-bucket/catalog_schema.json", "gs://my-bucket/catalog/")
|
|
197
|
+
|
|
198
|
+
# Azure Blob Storage (requires fsspec[azure])
|
|
199
|
+
resolver = spatial_resolver("abfs://container/catalog_schema.json", "abfs://container/catalog/")
|
|
200
|
+
|
|
201
|
+
# HTTP/HTTPS
|
|
202
|
+
resolver = spatial_resolver("https://example.com/catalog_schema.json", "./local-catalog/")
|
|
203
|
+
|
|
204
|
+
# Mixed: Remote schema with local catalog
|
|
205
|
+
resolver = spatial_resolver("s3://bucket/schema.json", "/local/catalog/")
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
**Requirements:**
|
|
209
|
+
|
|
210
|
+
- Install fsspec with appropriate extras: `pip install fsspec[s3]`, `fsspec[gcs]`, `fsspec[azure]`
|
|
211
|
+
- The `catalog_path` parameter is required for remote schema files
|
|
212
|
+
- Authentication follows fsspec conventions (AWS credentials, service accounts, etc.)
|
|
213
|
+
|
|
214
|
+
### Grid-Specific Resolution
|
|
215
|
+
|
|
216
|
+
**Key Benefits:**
|
|
217
|
+
|
|
218
|
+
- **Automatic Resolution**: No need to manually calculate grid intersections
|
|
219
|
+
- **All Grid Systems**: Works with H3, S2, MGRS, UTM, LatLon, and custom GeoJSON
|
|
220
|
+
- **Configurable Overlap**: Control boundary handling and buffer zones
|
|
221
|
+
- **Performance**: Query only relevant partitions instead of full catalog scan
|
|
222
|
+
- **DuckDB Integration**: Generates ready-to-use file path patterns
|
|
223
|
+
|
|
224
|
+
## ⚡ Performance Benchmarks
|
|
225
|
+
|
|
226
|
+
**Query Performance Comparison** (San Francisco Bay Area query on global dataset):
|
|
227
|
+
|
|
228
|
+
| Metric | Without Pruning | With Spatial Resolution | Improvement |
|
|
229
|
+
|--------|-----------------|-------------------------|-------------|
|
|
230
|
+
| **Data Scanned** | 50GB+ | 6GB | **88.5% reduction** |
|
|
231
|
+
| **Query Time** | 45 seconds | 5.2 seconds | **8.7x faster** |
|
|
232
|
+
| **Memory Usage** | 12GB | 2.1GB | **82% reduction** |
|
|
233
|
+
| **Files Read** | 15,000+ | 1,200 | **92% fewer files** |
|
|
234
|
+
|
|
235
|
+
**Grid System Performance** (typical regional query):
|
|
236
|
+
|
|
237
|
+
- **H3 Resolution 6**: 8-12 cells → ~85-90% data reduction
|
|
238
|
+
- **MGRS 100km**: 1-4 zones → ~95-98% data reduction
|
|
239
|
+
- **Custom GeoJSON**: Variable based on tile design
|
|
240
|
+
|
|
241
|
+
## Documentation
|
|
242
|
+
|
|
243
|
+
- 📖 **[Full Documentation](docs/)** - Complete guides and API reference
|
|
244
|
+
- 🏁 **[Quick Start Guide](docs/quickstart.md)** - Get up and running in minutes
|
|
245
|
+
- ⚙️ **[Configuration Guide](docs/configuration.md)** - All configuration options
|
|
246
|
+
- 🌍 **[Global Partitioning](docs/concepts/grids/global-partitioning.md)** - Handle large/complex geometries
|
|
247
|
+
- 🔧 **[API Reference](docs/api-reference/)** - Python and CLI documentation
|
|
248
|
+
|
|
249
|
+
## Contributing
|
|
250
|
+
|
|
251
|
+
```bash
|
|
252
|
+
# Development setup
|
|
253
|
+
git clone https://github.com/betolink/earthcatalog.git
|
|
254
|
+
cd earthcatalog
|
|
255
|
+
pip install -e ".[dev]"
|
|
256
|
+
|
|
257
|
+
# Run tests
|
|
258
|
+
python -m pytest
|
|
259
|
+
|
|
260
|
+
# Format and lint
|
|
261
|
+
black earthcatalog/ && ruff check earthcatalog/
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
## License
|
|
265
|
+
|
|
266
|
+
MIT License - see LICENSE file for details.
|