PyPI - gbfs-toolkit - Versions diffs - 1.0.0__tar.gz - Mend

gbfs-toolkit 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (53) hide show

gbfs_toolkit-1.0.0/LICENSE +32 -0
gbfs_toolkit-1.0.0/PKG-INFO +329 -0
gbfs_toolkit-1.0.0/README.md +282 -0
gbfs_toolkit-1.0.0/pyproject.toml +83 -0
gbfs_toolkit-1.0.0/setup.cfg +4 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/__init__.py +223 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/accessor.py +105 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/analysis.py +274 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/audit/__init__.py +50 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/audit/dynamic.py +94 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/audit/static.py +215 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/catalog.py +189 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/cli.py +67 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/cluster.py +348 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/datasets.py +80 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/diagnostics.py +32 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/errors.py +34 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/fetch.py +510 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/fleet.py +155 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/geo.py +269 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/geofencing.py +164 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/models.py +271 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/multimodal.py +84 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/normalize.py +362 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/osm.py +111 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/py.typed +0 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/stats.py +415 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit/timeseries.py +529 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit.egg-info/PKG-INFO +329 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit.egg-info/SOURCES.txt +51 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit.egg-info/dependency_links.txt +1 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit.egg-info/entry_points.txt +2 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit.egg-info/requires.txt +30 -0
gbfs_toolkit-1.0.0/src/gbfs_toolkit.egg-info/top_level.txt +1 -0
gbfs_toolkit-1.0.0/tests/test_analysis_geo.py +145 -0
gbfs_toolkit-1.0.0/tests/test_audit_static.py +92 -0
gbfs_toolkit-1.0.0/tests/test_cli.py +33 -0
gbfs_toolkit-1.0.0/tests/test_cluster.py +137 -0
gbfs_toolkit-1.0.0/tests/test_consolidation.py +100 -0
gbfs_toolkit-1.0.0/tests/test_ergonomics.py +85 -0
gbfs_toolkit-1.0.0/tests/test_fetch.py +295 -0
gbfs_toolkit-1.0.0/tests/test_fleet.py +121 -0
gbfs_toolkit-1.0.0/tests/test_from_projects.py +81 -0
gbfs_toolkit-1.0.0/tests/test_geofencing.py +135 -0
gbfs_toolkit-1.0.0/tests/test_hardening.py +157 -0
gbfs_toolkit-1.0.0/tests/test_multimodal.py +51 -0
gbfs_toolkit-1.0.0/tests/test_normalize_catalog.py +67 -0
gbfs_toolkit-1.0.0/tests/test_osm_surroundings.py +52 -0
gbfs_toolkit-1.0.0/tests/test_quality.py +324 -0
gbfs_toolkit-1.0.0/tests/test_research_helpers.py +205 -0
gbfs_toolkit-1.0.0/tests/test_robustness.py +225 -0
gbfs_toolkit-1.0.0/tests/test_stats.py +182 -0
gbfs_toolkit-1.0.0/tests/test_timeseries.py +111 -0

gbfs_toolkit-1.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,32 @@
+MIT License
+Copyright (c) 2025-2026 Rohan Fossé and Gaël Pallares
+CESI LINEACT (EA 7527), Montpellier, France
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+----------------------------------------------------------------------
+This MIT licence applies to the source code in this repository
+(`audit_pipeline/`, `app/`, `notebooks/`, `paper/`).
+The data files under `catalogue/`, the Zenodo deposit and the Hugging
+Face dataset mirror are distributed under the Open Data Commons Open
+Database License (ODbL) v1.0; see LICENSE-DATA for the full text and
+attribution requirements.

gbfs_toolkit-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,329 @@
+Metadata-Version: 2.4
+Name: gbfs-toolkit
+Version: 1.0.0
+Summary: Research-grade ingestion and semantic quality audit (A1–A7) for GBFS bike-share feeds
+Author: Gaël Pallares
+Author-email: Rohan Fossé <rfosse@cesi.fr>
+License: MIT
+Project-URL: Homepage, https://github.com/cycling-data-lab/gbfs-toolkit
+Project-URL: Repository, https://github.com/cycling-data-lab/gbfs-toolkit
+Keywords: GBFS,bike-sharing,shared mobility,data quality,semantic validation,open data
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: GIS
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy>=1.26
+Requires-Dist: scipy>=1.13
+Requires-Dist: pandas>=2.2
+Provides-Extra: fetch
+Requires-Dist: requests>=2.31; extra == "fetch"
+Provides-Extra: geo
+Requires-Dist: geopandas>=0.14; extra == "geo"
+Provides-Extra: osm
+Requires-Dist: geopandas>=0.14; extra == "osm"
+Provides-Extra: parquet
+Requires-Dist: pyarrow>=15.0; extra == "parquet"
+Provides-Extra: cluster
+Requires-Dist: scikit-learn>=1.4; extra == "cluster"
+Provides-Extra: dtw
+Requires-Dist: tslearn>=0.6; extra == "dtw"
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0; extra == "dev"
+Requires-Dist: ruff>=0.5; extra == "dev"
+Requires-Dist: requests>=2.31; extra == "dev"
+Requires-Dist: pyarrow>=15.0; extra == "dev"
+Requires-Dist: scikit-learn>=1.4; extra == "dev"
+Requires-Dist: geopandas>=0.14; extra == "dev"
+Dynamic: license-file
+# gbfs-toolkit
+[![CI](https://github.com/cycling-data-lab/gbfs-toolkit/actions/workflows/ci.yml/badge.svg)](https://github.com/cycling-data-lab/gbfs-toolkit/actions/workflows/ci.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE)
+[![Python 3.10+](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/)
+**Research-grade ingestion and *semantic* quality audit for GBFS bike-share feeds.**
+MobilityData's [`gbfs-validator`](https://github.com/MobilityData/gbfs-validator) checks
+that a feed is *syntactically* valid. `gbfs-toolkit` checks whether it is *semantically*
+trustworthy and analysis-ready — the **A1–A7 quality taxonomy** of Fossé & Pallares
+([`gbfs-audit-catalogue`](https://github.com/cycling-data-lab/gbfs-audit-catalogue)) — and
+normalises feeds into a **stable, version-independent data model** you can reuse across
+studies.
+## Why
+Every bike-share study re-implements the same plumbing — discover feeds, normalise GBFS
+1.x/2.x/3.x, and (the hard part) cope with the semantic defects the syntactic validator
+cannot see: placeholder capacities, phantom docks, transposed coordinates, out-of-perimeter
+stations. This package consolidates that into one tested interface so the audit is a verdict
+per station, not a re-run of someone's notebook.
+## Install
+```bash
+pip install gbfs-toolkit            # from PyPI (when released)
+pip install -e ".[dev]"            # from a local clone
+```
+Core depends only on numpy / scipy / pandas. Network discovery/fetch uses the optional
+`[fetch]` extra (`requests`).
+## Quick start
+```python
+import gbfs_toolkit as gb
+info, status = gb.load_example()          # bundled sample — no network needed
+av = info.gbfs.join_status(status)        # fluent .gbfs accessor (or gb.join_availability)
+clean = info.gbfs.drop_flagged()          # audit A1–A7 and keep the trustworthy stations
+av.gbfs.occupancy()                       # bikes / (bikes + docks), NaN-safe
+```
+From your own feed:
+```python
+import json
+raw = json.load(open("station_information.json"))
+stations = gb.to_canonical_station_info(raw, system_id="velib")   # version-independent frame
+verdict  = gb.audit_static(stations)                              # A1–A7 per station
+clean    = stations[~verdict["flagged"].to_numpy()]              # quality filter in one line
+```
+Every function is also a `.gbfs` accessor method, and pure (so `df.pipe(gb.occupancy)` works).
+`gb.show_versions()` prints an environment report for bug reports.
+Command line (the semantic counterpart to `gbfs-validator`):
+```bash
+gbfs audit station_information.json --system-id velib --out verdict.csv
+```
+## The A1–A7 semantic taxonomy
+| Flag | Rule | Signature | Level |
+|---|---|---|---|
+| A1 | Out-of-domain inclusion | car-sharing advertised as bike-sharing | station |
+| A2 | Placeholder capacity | constant non-zero capacity across a whole system | system |
+| A3 | Structural over-capacity | free-floating fleet anchors | station |
+| A4 | Geospatial error | transposed coords / stations far from neighbours (3σ) | station |
+| A5 | Out-of-perimeter | system bounding box > 50,000 km² | system |
+| A6 | Zero-capacity dock | ≥1% of docked stations declare capacity = 0 | system |
+| A7 | Null capacity field | ≥50% of stations declare capacity = NaN | system |
+Thresholds match the published catalogue, so verdicts reproduce.
+## Canonical data model (the stable contract)
+Ingestion is normalised **once** into version-independent frames; audit and analysis then
+operate purely on these. Downstream code depends on these column names, never on raw GBFS
+JSON.
+- **StationInfo**: `system_id, station_id, name, lat, lon, capacity, station_type, is_virtual_station`
+- **StationStatus**: `system_id, station_id, num_bikes_available, num_docks_available, is_renting, is_returning, last_reported, fetched_at, gbfs_version`
+- **VehicleStatus**: `system_id, vehicle_id, vehicle_type_id, lat, lon, is_reserved, is_disabled, fetched_at, gbfs_version`
+- **AuditVerdict**: `system_id, station_id, A1…A7, flagged, reason`
+`last_reported` and `fetched_at` are tz-aware **UTC** timestamps (`datetime64[ns, UTC]`) so
+feeds from different cities merge unambiguously.
+## Daily ergonomics
+```python
+import gbfs_toolkit as gb
+# discover by city (you rarely know the system_id)
+cat   = gb.systems_catalog()
+paris = gb.filter_catalog(cat, country_code="FR", city="Paris")
+feed  = gb.GBFSFeed.from_url(url)
+feed.summary()                       # one-glance card: stations, bikes, staleness, version
+avail = feed.availability()          # bikes/docks + name/coords/capacity, one frame
+avail["state"] = gb.station_state(avail)          # empty / full / disabled / normal
+problems = gb.audit_dynamic(avail)                # negative counts, over-capacity, stale
+near  = gb.find_nearest_stations(48.85, 2.35, feed.station_information(), k=3)
+# many systems at once (threaded), broken feeds isolated as Exceptions
+feeds = gb.fetch_multiple(["velib", "bixi", "lyon"], max_workers=5)
+```
+## Longitudinal data lake
+Turn a stream of snapshots into an analysis-ready panel. The library owns the
+formatting / dedup / I/O; your orchestrator (cron, Airflow…) owns the polling loop.
+Requires the optional `[parquet]` extra (`pyarrow`).
+```python
+import gbfs_toolkit as gb
+# in your poller (every N minutes):
+gb.append_to_parquet(feed.station_status(), "lake/")   # Hive-partitioned by system_id/date
+# in your analysis:
+panel = gb.build_availability_panel("lake/", system_id="velib",
+                                    start_time="2026-06-01", resample_freq="5min")
+flow  = gb.calculate_net_flow(panel)   # Δ bikes/station per poll (observed flow only)
+```
+`build_availability_panel` filters partitions *before* loading (memory-bounded),
+de-duplicates redundant polls (same `station_id` + `last_reported`), and optionally
+resamples each station to a fixed cadence.
+## Station clustering (`[cluster]`)
+Three lenses on "which stations belong together" — spatial, topological, behavioural:
+```python
+gb.cluster_spatial(info, method="hdbscan")          # density zones (projected metres)
+gb.cluster_spectral(info, k=6)                       # network/topology groups
+gb.cluster_diurnal_profiles(panel, n_clusters=4)    # daily-rhythm typologies ⭐
+```
+`cluster_diurnal_profiles` turns the longitudinal panel into station **typologies** —
+e.g. "morning commuter origin" (full at night, empty by day) vs "recreational" — from each
+station's 24-hour occupancy profile (robust to irregular sampling). Modern options:
+auto-`k` by silhouette, shape clustering (`normalize="zscore"`), soft GMM, DTW
+(`method="dtw"`, extra `[dtw]`), weekday/weekend split. And `label_diurnal_typology`
+turns clusters into **named** types. The payoff of the data lake.
+## Multimodal — bikeshare ↔ transit
+```python
+stops = pd.read_csv("gtfs/stops.txt")               # bring your own GTFS stops
+linked = gb.link_transit_stops(info, stops, radius_m=200)
+feeders = linked[linked["is_transit_feeder"]]       # first/last-mile docks near rail/bus
+```
+Pure spatial proximity on `GeoKDTree` (no transit API, no schedules) — `is_transit_feeder`,
+`nearest_stop_dist_m`, `n_transit_within`.
+## Station surroundings — what's around each dock (`[osm]`)
+```python
+# generic "what's nearby" — works for any point dataset (POIs, shops, …)
+gb.features_within(info, pois, radius_m=300, category_col="amenity")  # n_within, n_cafe, …
+# bring your own OSM frame (fetch it yourself, e.g. osmnx.features_from_point)
+# one-shot context: transit feeders + OSM features, in one frame
+ctx = gb.station_surroundings(info, transit=stops, osm=osm_gdf, radius_m=300)
+```
+The radius summarisation (counts + per-category breakdown + nearest distance) is the durable,
+tested core; data acquisition is **Bring Your Own GeoDataFrame** so the library never depends
+on a live Overpass endpoint. Routing / isochrones stay out of scope (use OSMnx / pandana).
+## Descriptive stats — the bikeshare `describe()`
+```python
+gb.system_profile(av)                       # stations, capacity, occupancy, % empty/full/…
+gb.compare_systems({"velib": av1, "bixi": av2})   # one comparison row per city
+gb.concentration_metrics(info)              # capacity Gini + top-decile hub share (equity)
+gb.coverage_stats(info, zones=zones)        # density, nearest-neighbour, Clark–Evans dispersion
+gb.availability_stats(panel)                # per-station: occupancy, peak hour, volatility
+```
+Standard spatial / inequality algorithms (numpy/scipy only, deterministic):
+```python
+gb.morans_i(info, "occupancy")              # spatial autocorrelation (+ z-score / p-value)
+gb.ripley_k(info, radii=[100, 250, 500])    # multi-scale clustering: L>0 clustered, <0 dispersed
+gb.lorenz_curve(info)                       # inequality curve to plot (Gini/Theil in concentration_metrics)
+```
+Readable, comparable summaries — strictly descriptive (no OD/trip inference). `system_profile`
+is a one-glance numeric card of a snapshot; `concentration_metrics` is an equity lens (kept
+*outside* the published A1–A7 audit, since it's a metric not a quality verdict);
+`availability_stats` turns a longitudinal panel into per-station scalars (pass a `target_tz`
+panel for local-time peaks).
+## Fleet reconciliation — where are the bikes, really?
+```python
+tally = gb.reconcile_fleet_state(status, vehicles)   # or feed.reconcile_fleet()
+tally["total_deployed"]        # on the street: stations + free-floating, overlap excluded
+tally["total_rentable"]        # available in stations + available free-floating
+tally["double_count_avoided"]  # vehicles a naive sum would have counted twice
+```
+GBFS reports the same fleet twice — aggregate docked counts in `station_status` and
+individual units (some parked at stations) in `vehicle_status`. Naively adding them
+double-counts every vehicle sitting at a dock. The reconciler excludes station-parked
+vehicles from the deployed total and surfaces the overlap instead of hiding it.
+## Geofencing / service areas (`[geo]`)
+```python
+zones = gb.to_canonical_geofencing(raw, system_id="lime")  # GeoDataFrame of operator polygons
+tagged = gb.zones_for_points(info, zones)                   # which zone each station sits in
+density = len(info) / gb.zone_area_km2(zones).sum()         # bikes per km² of *real* service area
+no_park = tagged[tagged["station_parking"] == False]        # stations in park-restricted zones
+```
+For free-floating / hybrid systems the real footprint is the operator's polygons, not a
+convex hull of stations. `to_canonical_geofencing` parses `geofencing_zones.json` (v2.x
+`ride_allowed` and v3.x `ride_start/ride_end_allowed` reconciled), `zones_for_points` is the
+point-in-zone spatial join, and `zone_area_km2` reprojects to an equal-area CRS so density is
+metric and latitude-comparable. The full per-vehicle-type `rules` list is preserved.
+## Polite scraping & provenance (research-grade)
+```python
+session = gb.build_session()                 # pooled, retry/backoff on 429/5xx (default in fetch_multiple)
+resp = gb.fetch_feed_json(url, etag=prev_etag)   # conditional GET; raises GBFSNotModified on HTTP 304
+...
+gb.coverage_report(panel, expected_freq="5min")  # per-station uptime / longest gap (no imputation)
+gb.generate_manifest("lake/")                # SHA-256 per partition + summary → cite on Zenodo
+```
+Built for scrapers that run for months: retries/backoff, conditional GETs (skip unchanged
+snapshots), an offline catalogue cache, a `GBFSError` exception hierarchy, and provenance tools
+so a dataset is **citable and verifiable**. Missing data stays missing — `coverage_report`
+quantifies it rather than imputing.
+## Examples
+Runnable, end-to-end scripts live in [`examples/`](./examples) — auditing an unknown feed,
+cron-driven collection into a Parquet lake, longitudinal analysis (coverage, typologies,
+turnover), and a network equity/coverage report.
+## Roadmap
+- **v0.1** — canonical model, catalogue discovery, cross-version normalisation,
+  static audit (A1–A7), CLI.
+- **v0.2** — fetch/scrape (`GBFSFeed`, one-liners, `fetch_multiple`), dynamic audit
+  (D1–D3), `station_state`, geo (`GeoKDTree`, `find_nearest_stations`), schema hardening.
+- **v0.3 (this)** — longitudinal data lake: `append_to_parquet`,
+  `build_availability_panel`, `calculate_net_flow`.
+- **v0.4** — `cluster` (spatial / spectral / **diurnal profiles** + named typologies).
+- **v0.5** — `multimodal` (bikeshare ↔ transit feeders, BYOG GTFS).
+- **v0.6** — `osm` / surroundings: `features_within`, `station_surroundings`,
+  `enrich_with_osm` (BYOG infrastructure enrichment within a radius).
+- **v0.7** — hardening (nullable dtypes, dockless-aware A7, antimeridian A5,
+  mass-conservation net flow) + `geofencing` (service-area polygons, point-in-zone
+  joins, equal-area density), `fleet` reconciliation (docked ↔ free-floating dedup),
+  and parquet column/predicate pushdown for large panels.
+## Methodology & limitations
+[`METHODOLOGY.md`](./METHODOLOGY.md) documents the A1–A7 thresholds, the dynamic checks, the
+polling/aliasing limit on flows, and what the spatial statistics can and cannot claim — read it
+before building a study on the toolkit.
+## How to cite
+See [`CITATION.cff`](./CITATION.cff). The semantic taxonomy is from the
+`gbfs-audit-catalogue` dataset paper (Fossé & Pallares, 2026).
+## License
+[MIT](./LICENSE). Affiliated with [CESI LINEACT (EA 7527)](https://lineact.cesi.fr), Montpellier, France.

gbfs_toolkit-1.0.0/README.md ADDED Viewed

@@ -0,0 +1,282 @@
+# gbfs-toolkit
+[![CI](https://github.com/cycling-data-lab/gbfs-toolkit/actions/workflows/ci.yml/badge.svg)](https://github.com/cycling-data-lab/gbfs-toolkit/actions/workflows/ci.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE)
+[![Python 3.10+](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/)
+**Research-grade ingestion and *semantic* quality audit for GBFS bike-share feeds.**
+MobilityData's [`gbfs-validator`](https://github.com/MobilityData/gbfs-validator) checks
+that a feed is *syntactically* valid. `gbfs-toolkit` checks whether it is *semantically*
+trustworthy and analysis-ready — the **A1–A7 quality taxonomy** of Fossé & Pallares
+([`gbfs-audit-catalogue`](https://github.com/cycling-data-lab/gbfs-audit-catalogue)) — and
+normalises feeds into a **stable, version-independent data model** you can reuse across
+studies.
+## Why
+Every bike-share study re-implements the same plumbing — discover feeds, normalise GBFS
+1.x/2.x/3.x, and (the hard part) cope with the semantic defects the syntactic validator
+cannot see: placeholder capacities, phantom docks, transposed coordinates, out-of-perimeter
+stations. This package consolidates that into one tested interface so the audit is a verdict
+per station, not a re-run of someone's notebook.
+## Install
+```bash
+pip install gbfs-toolkit            # from PyPI (when released)
+pip install -e ".[dev]"            # from a local clone
+```
+Core depends only on numpy / scipy / pandas. Network discovery/fetch uses the optional
+`[fetch]` extra (`requests`).
+## Quick start
+```python
+import gbfs_toolkit as gb
+info, status = gb.load_example()          # bundled sample — no network needed
+av = info.gbfs.join_status(status)        # fluent .gbfs accessor (or gb.join_availability)
+clean = info.gbfs.drop_flagged()          # audit A1–A7 and keep the trustworthy stations
+av.gbfs.occupancy()                       # bikes / (bikes + docks), NaN-safe
+```
+From your own feed:
+```python
+import json
+raw = json.load(open("station_information.json"))
+stations = gb.to_canonical_station_info(raw, system_id="velib")   # version-independent frame
+verdict  = gb.audit_static(stations)                              # A1–A7 per station
+clean    = stations[~verdict["flagged"].to_numpy()]              # quality filter in one line
+```
+Every function is also a `.gbfs` accessor method, and pure (so `df.pipe(gb.occupancy)` works).
+`gb.show_versions()` prints an environment report for bug reports.
+Command line (the semantic counterpart to `gbfs-validator`):
+```bash
+gbfs audit station_information.json --system-id velib --out verdict.csv
+```
+## The A1–A7 semantic taxonomy
+| Flag | Rule | Signature | Level |
+|---|---|---|---|
+| A1 | Out-of-domain inclusion | car-sharing advertised as bike-sharing | station |
+| A2 | Placeholder capacity | constant non-zero capacity across a whole system | system |
+| A3 | Structural over-capacity | free-floating fleet anchors | station |
+| A4 | Geospatial error | transposed coords / stations far from neighbours (3σ) | station |
+| A5 | Out-of-perimeter | system bounding box > 50,000 km² | system |
+| A6 | Zero-capacity dock | ≥1% of docked stations declare capacity = 0 | system |
+| A7 | Null capacity field | ≥50% of stations declare capacity = NaN | system |
+Thresholds match the published catalogue, so verdicts reproduce.
+## Canonical data model (the stable contract)
+Ingestion is normalised **once** into version-independent frames; audit and analysis then
+operate purely on these. Downstream code depends on these column names, never on raw GBFS
+JSON.
+- **StationInfo**: `system_id, station_id, name, lat, lon, capacity, station_type, is_virtual_station`
+- **StationStatus**: `system_id, station_id, num_bikes_available, num_docks_available, is_renting, is_returning, last_reported, fetched_at, gbfs_version`
+- **VehicleStatus**: `system_id, vehicle_id, vehicle_type_id, lat, lon, is_reserved, is_disabled, fetched_at, gbfs_version`
+- **AuditVerdict**: `system_id, station_id, A1…A7, flagged, reason`
+`last_reported` and `fetched_at` are tz-aware **UTC** timestamps (`datetime64[ns, UTC]`) so
+feeds from different cities merge unambiguously.
+## Daily ergonomics
+```python
+import gbfs_toolkit as gb
+# discover by city (you rarely know the system_id)
+cat   = gb.systems_catalog()
+paris = gb.filter_catalog(cat, country_code="FR", city="Paris")
+feed  = gb.GBFSFeed.from_url(url)
+feed.summary()                       # one-glance card: stations, bikes, staleness, version
+avail = feed.availability()          # bikes/docks + name/coords/capacity, one frame
+avail["state"] = gb.station_state(avail)          # empty / full / disabled / normal
+problems = gb.audit_dynamic(avail)                # negative counts, over-capacity, stale
+near  = gb.find_nearest_stations(48.85, 2.35, feed.station_information(), k=3)
+# many systems at once (threaded), broken feeds isolated as Exceptions
+feeds = gb.fetch_multiple(["velib", "bixi", "lyon"], max_workers=5)
+```
+## Longitudinal data lake
+Turn a stream of snapshots into an analysis-ready panel. The library owns the
+formatting / dedup / I/O; your orchestrator (cron, Airflow…) owns the polling loop.
+Requires the optional `[parquet]` extra (`pyarrow`).
+```python
+import gbfs_toolkit as gb
+# in your poller (every N minutes):
+gb.append_to_parquet(feed.station_status(), "lake/")   # Hive-partitioned by system_id/date
+# in your analysis:
+panel = gb.build_availability_panel("lake/", system_id="velib",
+                                    start_time="2026-06-01", resample_freq="5min")
+flow  = gb.calculate_net_flow(panel)   # Δ bikes/station per poll (observed flow only)
+```
+`build_availability_panel` filters partitions *before* loading (memory-bounded),
+de-duplicates redundant polls (same `station_id` + `last_reported`), and optionally
+resamples each station to a fixed cadence.
+## Station clustering (`[cluster]`)
+Three lenses on "which stations belong together" — spatial, topological, behavioural:
+```python
+gb.cluster_spatial(info, method="hdbscan")          # density zones (projected metres)
+gb.cluster_spectral(info, k=6)                       # network/topology groups
+gb.cluster_diurnal_profiles(panel, n_clusters=4)    # daily-rhythm typologies ⭐
+```
+`cluster_diurnal_profiles` turns the longitudinal panel into station **typologies** —
+e.g. "morning commuter origin" (full at night, empty by day) vs "recreational" — from each
+station's 24-hour occupancy profile (robust to irregular sampling). Modern options:
+auto-`k` by silhouette, shape clustering (`normalize="zscore"`), soft GMM, DTW
+(`method="dtw"`, extra `[dtw]`), weekday/weekend split. And `label_diurnal_typology`
+turns clusters into **named** types. The payoff of the data lake.
+## Multimodal — bikeshare ↔ transit
+```python
+stops = pd.read_csv("gtfs/stops.txt")               # bring your own GTFS stops
+linked = gb.link_transit_stops(info, stops, radius_m=200)
+feeders = linked[linked["is_transit_feeder"]]       # first/last-mile docks near rail/bus
+```
+Pure spatial proximity on `GeoKDTree` (no transit API, no schedules) — `is_transit_feeder`,
+`nearest_stop_dist_m`, `n_transit_within`.
+## Station surroundings — what's around each dock (`[osm]`)
+```python
+# generic "what's nearby" — works for any point dataset (POIs, shops, …)
+gb.features_within(info, pois, radius_m=300, category_col="amenity")  # n_within, n_cafe, …
+# bring your own OSM frame (fetch it yourself, e.g. osmnx.features_from_point)
+# one-shot context: transit feeders + OSM features, in one frame
+ctx = gb.station_surroundings(info, transit=stops, osm=osm_gdf, radius_m=300)
+```
+The radius summarisation (counts + per-category breakdown + nearest distance) is the durable,
+tested core; data acquisition is **Bring Your Own GeoDataFrame** so the library never depends
+on a live Overpass endpoint. Routing / isochrones stay out of scope (use OSMnx / pandana).
+## Descriptive stats — the bikeshare `describe()`
+```python
+gb.system_profile(av)                       # stations, capacity, occupancy, % empty/full/…
+gb.compare_systems({"velib": av1, "bixi": av2})   # one comparison row per city
+gb.concentration_metrics(info)              # capacity Gini + top-decile hub share (equity)
+gb.coverage_stats(info, zones=zones)        # density, nearest-neighbour, Clark–Evans dispersion
+gb.availability_stats(panel)                # per-station: occupancy, peak hour, volatility
+```
+Standard spatial / inequality algorithms (numpy/scipy only, deterministic):
+```python
+gb.morans_i(info, "occupancy")              # spatial autocorrelation (+ z-score / p-value)
+gb.ripley_k(info, radii=[100, 250, 500])    # multi-scale clustering: L>0 clustered, <0 dispersed
+gb.lorenz_curve(info)                       # inequality curve to plot (Gini/Theil in concentration_metrics)
+```
+Readable, comparable summaries — strictly descriptive (no OD/trip inference). `system_profile`
+is a one-glance numeric card of a snapshot; `concentration_metrics` is an equity lens (kept
+*outside* the published A1–A7 audit, since it's a metric not a quality verdict);
+`availability_stats` turns a longitudinal panel into per-station scalars (pass a `target_tz`
+panel for local-time peaks).
+## Fleet reconciliation — where are the bikes, really?
+```python
+tally = gb.reconcile_fleet_state(status, vehicles)   # or feed.reconcile_fleet()
+tally["total_deployed"]        # on the street: stations + free-floating, overlap excluded
+tally["total_rentable"]        # available in stations + available free-floating
+tally["double_count_avoided"]  # vehicles a naive sum would have counted twice
+```
+GBFS reports the same fleet twice — aggregate docked counts in `station_status` and
+individual units (some parked at stations) in `vehicle_status`. Naively adding them
+double-counts every vehicle sitting at a dock. The reconciler excludes station-parked
+vehicles from the deployed total and surfaces the overlap instead of hiding it.
+## Geofencing / service areas (`[geo]`)
+```python
+zones = gb.to_canonical_geofencing(raw, system_id="lime")  # GeoDataFrame of operator polygons
+tagged = gb.zones_for_points(info, zones)                   # which zone each station sits in
+density = len(info) / gb.zone_area_km2(zones).sum()         # bikes per km² of *real* service area
+no_park = tagged[tagged["station_parking"] == False]        # stations in park-restricted zones
+```
+For free-floating / hybrid systems the real footprint is the operator's polygons, not a
+convex hull of stations. `to_canonical_geofencing` parses `geofencing_zones.json` (v2.x
+`ride_allowed` and v3.x `ride_start/ride_end_allowed` reconciled), `zones_for_points` is the
+point-in-zone spatial join, and `zone_area_km2` reprojects to an equal-area CRS so density is
+metric and latitude-comparable. The full per-vehicle-type `rules` list is preserved.
+## Polite scraping & provenance (research-grade)
+```python
+session = gb.build_session()                 # pooled, retry/backoff on 429/5xx (default in fetch_multiple)
+resp = gb.fetch_feed_json(url, etag=prev_etag)   # conditional GET; raises GBFSNotModified on HTTP 304
+...
+gb.coverage_report(panel, expected_freq="5min")  # per-station uptime / longest gap (no imputation)
+gb.generate_manifest("lake/")                # SHA-256 per partition + summary → cite on Zenodo
+```
+Built for scrapers that run for months: retries/backoff, conditional GETs (skip unchanged
+snapshots), an offline catalogue cache, a `GBFSError` exception hierarchy, and provenance tools
+so a dataset is **citable and verifiable**. Missing data stays missing — `coverage_report`
+quantifies it rather than imputing.
+## Examples
+Runnable, end-to-end scripts live in [`examples/`](./examples) — auditing an unknown feed,
+cron-driven collection into a Parquet lake, longitudinal analysis (coverage, typologies,
+turnover), and a network equity/coverage report.
+## Roadmap
+- **v0.1** — canonical model, catalogue discovery, cross-version normalisation,
+  static audit (A1–A7), CLI.
+- **v0.2** — fetch/scrape (`GBFSFeed`, one-liners, `fetch_multiple`), dynamic audit
+  (D1–D3), `station_state`, geo (`GeoKDTree`, `find_nearest_stations`), schema hardening.
+- **v0.3 (this)** — longitudinal data lake: `append_to_parquet`,
+  `build_availability_panel`, `calculate_net_flow`.
+- **v0.4** — `cluster` (spatial / spectral / **diurnal profiles** + named typologies).
+- **v0.5** — `multimodal` (bikeshare ↔ transit feeders, BYOG GTFS).
+- **v0.6** — `osm` / surroundings: `features_within`, `station_surroundings`,
+  `enrich_with_osm` (BYOG infrastructure enrichment within a radius).
+- **v0.7** — hardening (nullable dtypes, dockless-aware A7, antimeridian A5,
+  mass-conservation net flow) + `geofencing` (service-area polygons, point-in-zone
+  joins, equal-area density), `fleet` reconciliation (docked ↔ free-floating dedup),
+  and parquet column/predicate pushdown for large panels.
+## Methodology & limitations
+[`METHODOLOGY.md`](./METHODOLOGY.md) documents the A1–A7 thresholds, the dynamic checks, the
+polling/aliasing limit on flows, and what the spatial statistics can and cannot claim — read it
+before building a study on the toolkit.
+## How to cite
+See [`CITATION.cff`](./CITATION.cff). The semantic taxonomy is from the
+`gbfs-audit-catalogue` dataset paper (Fossé & Pallares, 2026).
+## License
+[MIT](./LICENSE). Affiliated with [CESI LINEACT (EA 7527)](https://lineact.cesi.fr), Montpellier, France.