geoparquet-extractor 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,24 @@
1
+ This is free and unencumbered software released into the public domain.
2
+
3
+ Anyone is free to copy, modify, publish, use, compile, sell, or
4
+ distribute this software, either in source code form or as a compiled
5
+ binary, for any purpose, commercial or non-commercial, and by any
6
+ means.
7
+
8
+ In jurisdictions that recognize copyright laws, the author or authors
9
+ of this software dedicate any and all copyright interest in the
10
+ software to the public domain. We make this dedication for the benefit
11
+ of the public at large and to the detriment of our heirs and
12
+ successors. We intend this dedication to be an overt act of
13
+ relinquishment in perpetuity of all present and future rights to this
14
+ software under copyright law.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19
+ IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20
+ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21
+ ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22
+ OTHER DEALINGS IN THE SOFTWARE.
23
+
24
+ For more information, please refer to <https://unlicense.org>
package/README.md ADDED
@@ -0,0 +1,222 @@
1
+ # geoparquet-extractor
2
+
3
+ [![npm version](https://img.shields.io/npm/v/geoparquet-extractor)](https://www.npmjs.com/package/geoparquet-extractor)
4
+ [![GitHub release](https://img.shields.io/github/v/release/ramSeraph/geoparquet_extractor)](https://github.com/ramSeraph/geoparquet_extractor/releases/latest)
5
+
6
+ Extract and convert spatial data from remote GeoParquet files in the browser. Supports bbox filtering, multiple output formats, and pluggable metadata providers.
7
+
8
+ > **Browser-only** — requires Origin Private File System (OPFS), Web Workers, and Web Locks APIs.
9
+
10
+ ## Installation
11
+
12
+ ```bash
13
+ npm install geoparquet-extractor
14
+ ```
15
+
16
+ ## Quick Start
17
+
18
+ ```javascript
19
+ import { GeoParquetExtractor, createDuckDBClient } from 'geoparquet-extractor';
20
+
21
+ // You initialize DuckDB yourself
22
+ import * as duckdb from 'duckdb-wasm-opfs-tempdir';
23
+
24
+ const db = /* your initialized AsyncDuckDB instance */;
25
+ const client = await createDuckDBClient(db, {
26
+ extensions: ['spatial', 'httpfs'],
27
+ });
28
+
29
+ const extractor = new GeoParquetExtractor({ duckdb: client });
30
+
31
+ await extractor.extract({
32
+ urls: ['https://example.com/data.parquet'],
33
+ bbox: [77.5, 12.9, 77.7, 13.1],
34
+ format: 'geoparquet',
35
+ baseName: 'my-data',
36
+ onProgress: (pct) => console.log(`${pct}%`),
37
+ onStatus: (msg) => console.log(msg),
38
+ });
39
+ ```
40
+
41
+ ## Features
42
+
43
+ - **9 output formats**: GeoParquet (v1.1 & v2.0), GeoPackage, Shapefile, CSV, GeoJSON, GeoJSONSeq, KML, DXF
44
+ - **Spatial filtering**: Bbox intersection with per-partition and per-row-group optimization
45
+ - **Pluggable metadata**: Override how partition URLs and bboxes are resolved
46
+ - **Extent visualization data**: Fetch partition/row-group bboxes as GeoJSON for map display
47
+ - **DuckDB-powered**: Spatial SQL queries via DuckDB WASM (you provide the instance)
48
+ - **Self-contained GeoPackage worker**: wa-sqlite-rtree bundled into the worker — no CDN needed
49
+
50
+ ## Formats
51
+
52
+ | Format | Value | Extension | Notes |
53
+ |--------|-------|-----------|-------|
54
+ | GeoPackage | `geopackage` | `.gpkg` | Requires GeoPackage worker |
55
+ | GeoJSON | `geojson` | `.geojson` | FeatureCollection |
56
+ | GeoJSONSeq | `geojsonseq` | `.geojsonl` | Newline-delimited |
57
+ | GeoParquet v1.1 | `geoparquet` | `.parquet` | With Hilbert spatial sort |
58
+ | GeoParquet v2.0 | `geoparquet2` | `.parquet` | Native geometry encoding |
59
+ | CSV | `csv` | `.csv` | WKT geometry column |
60
+ | Shapefile | `shapefile` | `.shp` | 2 GB limit per component |
61
+ | KML | `kml` | `.kml` | XML format |
62
+ | DXF | `dxf` | `.dxf` | AutoCAD R14, UTM projection |
63
+
64
+ ## DuckDB Setup
65
+
66
+ The library does NOT bundle DuckDB WASM. You initialize it yourself and pass it in:
67
+
68
+ ```javascript
69
+ import { createDuckDBClient } from 'geoparquet-extractor';
70
+ import * as duckdb from 'duckdb-wasm-opfs-tempdir';
71
+
72
+ // Standard duckdb-wasm-opfs-tempdir init
73
+ const MANUAL_BUNDLES = { /* your bundle config */ };
74
+ const bundle = await duckdb.selectBundle(MANUAL_BUNDLES);
75
+ const worker = new Worker(bundle.mainWorker);
76
+ const logger = new duckdb.ConsoleLogger();
77
+ const db = new duckdb.AsyncDuckDB(logger, worker);
78
+ await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
79
+
80
+ // Wrap it for the library
81
+ const client = await createDuckDBClient(db, {
82
+ extensions: ['spatial', 'httpfs'],
83
+ });
84
+ ```
85
+
86
+ ### Custom DuckDB Builds
87
+
88
+ The `duckdb-wasm-opfs-tempdir` package supports `SET temp_directory = 'opfs://...'` for large downloads that exceed browser memory limits. The library's `createDuckDBClient` adapter works with any DuckDB WASM build that provides `AsyncDuckDB`.
89
+
90
+ ## GeoPackage Worker
91
+
92
+ The GeoPackage format requires a Web Worker for wa-sqlite. The library ships a self-contained worker with wa-sqlite-rtree bundled in:
93
+
94
+ ```javascript
95
+ // Option 1: URL to hosted worker
96
+ const extractor = new GeoParquetExtractor({
97
+ duckdb: client,
98
+ gpkgWorkerUrl: '/workers/gpkg_worker.js',
99
+ });
100
+
101
+ // Option 2: Worker instance (import.meta.url resolves to dist/gpkg_worker.js)
102
+ const worker = new Worker(new URL('geoparquet-extractor/gpkg-worker', import.meta.url), { type: 'module' });
103
+ const extractor = new GeoParquetExtractor({
104
+ duckdb: client,
105
+ gpkgWorker: worker,
106
+ });
107
+ ```
108
+
109
+ > **Note**: The worker requires `wa-sqlite-async.wasm` to be served from the same directory as `gpkg_worker.js`. Both files are included in the `dist/` directory.
110
+
111
+ ## Custom Metadata Provider
112
+
113
+ Override how partition URLs and bboxes are resolved:
114
+
115
+ ```javascript
116
+ import { MetadataProvider, GeoParquetExtractor } from 'geoparquet-extractor';
117
+
118
+ class MyMetadataProvider extends MetadataProvider {
119
+ async getParquetUrls(sourceUrl) {
120
+ const meta = await fetch(sourceUrl + '.meta.json').then(r => r.json());
121
+ const baseUrl = sourceUrl.replace(/[^/]+$/, '');
122
+ return Object.keys(meta.extents).map(f => baseUrl + f);
123
+ }
124
+
125
+ async getExtents(sourceUrl) {
126
+ const meta = await fetch(sourceUrl + '.meta.json').then(r => r.json());
127
+ return meta.extents; // { "file.parquet": [minx, miny, maxx, maxy] }
128
+ }
129
+
130
+ async getBbox(sourceUrl, duckdb) {
131
+ const extents = await this.getExtents(sourceUrl);
132
+ // Compute overall bbox from all partition extents
133
+ let bbox = [Infinity, Infinity, -Infinity, -Infinity];
134
+ for (const ext of Object.values(extents)) {
135
+ bbox[0] = Math.min(bbox[0], ext[0]);
136
+ bbox[1] = Math.min(bbox[1], ext[1]);
137
+ bbox[2] = Math.max(bbox[2], ext[2]);
138
+ bbox[3] = Math.max(bbox[3], ext[3]);
139
+ }
140
+ return bbox;
141
+ }
142
+ }
143
+
144
+ const extractor = new GeoParquetExtractor({
145
+ duckdb: client,
146
+ metadataProvider: new MyMetadataProvider(),
147
+ });
148
+ ```
149
+
150
+ ## Extent Visualization
151
+
152
+ Fetch partition and row-group bboxes as GeoJSON for map display:
153
+
154
+ ```javascript
155
+ import { ExtentData, DefaultMetadataProvider } from 'geoparquet-extractor';
156
+
157
+ const extentData = new ExtentData({
158
+ metadataProvider: new DefaultMetadataProvider(),
159
+ duckdb: client,
160
+ });
161
+
162
+ const { dataExtents, rgExtents } = await extentData.fetchExtents({
163
+ sourceUrl: 'https://example.com/data.mosaic.json',
164
+ partitioned: true,
165
+ });
166
+
167
+ // Convert to GeoJSON for map rendering
168
+ const { polygons, labelPoints } = extentData.toGeoJSON(dataExtents);
169
+ // polygons: FeatureCollection of bbox rectangles
170
+ // labelPoints: FeatureCollection of label anchor points
171
+ ```
172
+
173
+ ## API
174
+
175
+ ### `GeoParquetExtractor`
176
+
177
+ Main orchestrator class.
178
+
179
+ - `constructor({ duckdb, metadataProvider?, gpkgWorkerUrl?, gpkgWorker?, memoryLimitMB? })`
180
+ - `prepare(options)` → Returns format handler for inspection before download
181
+ - `download(handler, { baseName, onProgress?, onStatus? })` → Execute download
182
+ - `extract(options)` → Convenience: prepare + download in one call
183
+ - `cancel()` → Cancel in-flight download
184
+ - `static cleanupOrphanedFiles()` → Clean up OPFS files from dead sessions
185
+ - `static getDownloadBaseName(sourceName, bbox)` → Generate suggested filename
186
+
187
+ ### `ExtentData`
188
+
189
+ Data-fetching for partition/row-group bboxes.
190
+
191
+ - `constructor({ metadataProvider, duckdb? })`
192
+ - `fetchExtents({ sourceUrl, partitioned?, includeRowGroups?, onStatus? })` → `{ dataExtents, rgExtents }`
193
+ - `toGeoJSON(extents)` → `{ polygons, labelPoints }`
194
+
195
+ ### `MetadataProvider`
196
+
197
+ Abstract base class. Override to customize metadata resolution.
198
+
199
+ - `getParquetUrls(sourceUrl)` → `string[]`
200
+ - `getExtents(sourceUrl)` → `{ filename: [minx, miny, maxx, maxy] }`
201
+ - `getBbox(sourceUrl, duckdb)` → `[minx, miny, maxx, maxy]`
202
+ - `getRowGroupBboxes(parquetUrl, duckdb)` → `{ rg_N: bbox }`
203
+ - `getRowGroupBboxesMulti(urls, duckdb)` → `{ filename: { rg_N: bbox } }`
204
+
205
+ ### `createDuckDBClient(db, options?)`
206
+
207
+ Wraps an `AsyncDuckDB` instance into the library's DuckDBClient interface.
208
+
209
+ ## CORS Proxy
210
+
211
+ If your parquet files need a CORS proxy:
212
+
213
+ ```javascript
214
+ import { setProxyUrl } from 'geoparquet-extractor';
215
+
216
+ // Set a custom proxy URL transformer
217
+ setProxyUrl((url) => `/proxy?url=${encodeURIComponent(url)}`);
218
+ ```
219
+
220
+ ## License
221
+
222
+ [Unlicense](LICENSE) — public domain.