views-frames 1.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,624 @@
1
+ Metadata-Version: 2.4
2
+ Name: views-frames
3
+ Version: 1.0.0
4
+ Summary: The VIEWS platform data-contract layer: immutable array+identifier frames (numpy only, root of the dependency DAG).
5
+ Project-URL: Homepage, https://github.com/views-platform/views-frames
6
+ Project-URL: Repository, https://github.com/views-platform/views-frames
7
+ Project-URL: Changelog, https://github.com/views-platform/views-frames/blob/main/CHANGELOG.md
8
+ Author-email: Simon Polichinel von der Maase <simmaa@prio.org>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Requires-Python: >=3.10
12
+ Requires-Dist: numpy<3,>=1.26
13
+ Provides-Extra: arrow
14
+ Requires-Dist: pyarrow<20,>=14; extra == 'arrow'
15
+ Description-Content-Type: text/markdown
16
+
17
+ # views-frames
18
+
19
+ > The VIEWS platform's **data-contract layer**: small, stable, abstract array
20
+ > containers (`FeatureFrame`, `PredictionFrame`, and their anticipated siblings)
21
+ > that every other repo depends on and that depends on nothing internal.
22
+ >
23
+ > **Status:** **frozen — v1.0.0** (API freeze, ADR-018). This README is the design
24
+ > bible; the contract it specifies is realised in `src/views_frames/` (index, frames,
25
+ > io, conformance suite) plus the `src/views_frames_summarize/` sibling package
26
+ > (sample-axis summarization — `collapse`/MAP/HDI/quantiles + cross-level
27
+ > aggregation; ADR-017). The blocking design decisions are resolved (§13a) and
28
+ > ratified as ADRs 011–018; two consumer-review rounds (`perspectives/`) validated
29
+ > the design. Consumer adoption (re-export shims, pandas migration) is the owner's
30
+ > migration, not this repo's.
31
+ > If the code and this README disagree, that is a bug — reconcile before merging.
32
+
33
+ ---
34
+
35
+ ## 0. One-paragraph thesis
36
+
37
+ DataFrames (pandas/geopandas/polars) are a **boundary/interop and analysis**
38
+ format, not an internal data-handling representation. They do not belong as the
39
+ canonical transport type inside the VIEWS pipeline. The canonical transport is an
40
+ **array + spatiotemporal identifiers** value object — what we call a *frame*.
41
+ Two frames already exist, **duplicated and diverging** across repos
42
+ (`PredictionFrame` in views-pipeline-core, `FeatureFrame` in views-datafactory).
43
+ `views-frames` unifies them into one **leaf package** at the root of the
44
+ dependency graph: maximally stable, maximally abstract, numpy-only, depended on
45
+ by everyone, depending on nothing internal. It is the keystone that
46
+ de-duplicates the frames, breaks cross-repo dependency cycles, removes pandas
47
+ from internal transport, and gives arrays the label-alignment that today forces
48
+ pandas back into the hot path.
49
+
50
+ ---
51
+
52
+ ## 0a. Quickstart
53
+
54
+ Build a frame, summarize its sample axis, serialize it, and run the published
55
+ contract check. The full runnable script is [`examples/quickstart.py`](examples/quickstart.py)
56
+ (`uv run examples/quickstart.py`):
57
+
58
+ ```python
59
+ import numpy as np
60
+ from views_frames import PredictionFrame, SpatialLevel, SpatioTemporalIndex
61
+ from views_frames.conformance import assert_frame_contract
62
+ from views_frames_summarize import collapse, hdi, map_estimate
63
+
64
+ index = SpatioTemporalIndex(
65
+ time=np.array([1, 1, 2], dtype=np.int64),
66
+ unit=np.array([10, 11, 10], dtype=np.int32),
67
+ level=SpatialLevel.PGM,
68
+ )
69
+ pf = PredictionFrame(np.random.default_rng(0).gamma(2.0, 1.0, (3, 500)).astype("f4"), index)
70
+
71
+ mean = collapse(pf, np.mean) # (N, S) -> (N, 1) frame, statistic injected
72
+ mode = map_estimate(pf) # per-row MAP -> (N, 1) frame
73
+ band = hdi(pf, mass=0.9) # per-row 90% HDI -> (N, 2) index-aligned array
74
+
75
+ pf.save("/tmp/pf"); reloaded = PredictionFrame.load("/tmp/pf")
76
+ assert_frame_contract(pf) # the check a consumer runs in its own CI
77
+ ```
78
+
79
+ The leaf (`views_frames`) owns the immutable array+identifier contract and
80
+ alignment; the sibling (`views_frames_summarize`) owns the sample-axis statistics.
81
+ Both are numpy-only. For the subtler cm↔pgm surface — a time-varying
82
+ `(time, unit)→country` mapping, `cross_level_align`, and conservation-correct
83
+ `aggregate_distributions` (`HDI(sum) ≠ sum(HDI)`) — see
84
+ [`examples/cross_level.py`](examples/cross_level.py).
85
+
86
+ ---
87
+
88
+ ## 1. Why this package exists (the problems it kills)
89
+
90
+ Concrete, current pain — each item is a real, observed defect this package is
91
+ designed to resolve (register IDs are from views-pipeline-core's technical risk
92
+ register):
93
+
94
+ - **Duplicated, diverging twins.** `PredictionFrame`
95
+ (`views-pipeline-core/views_pipeline_core/data/prediction_frame.py`) and
96
+ `FeatureFrame`
97
+ (`views-datafactory/src/datafactory_adapters/feature_frame.py`) share a core
98
+ (`values: ndarray` + `identifiers: {time, unit}` + `save/load`) but are **not
99
+ near-1:1**: they diverge on ≥6 axes — sample-axis position, `feature_names` /
100
+ `metadata`, identifier NaN-check, `collapse` / `mmap`, save footprint, and
101
+ `PredictionFrame` still imports pandas. They have two owners, two release
102
+ cadences, and **no shared base**. They will drift. (REP violation — reused
103
+ together, released apart.) The fix unifies the *shared index + protocols*, not
104
+ the classes — see §5 (Option C) and §13a.
105
+ - **Circular package dependency.** views-pipeline-core ↔ views-reporting form a
106
+ cycle (one direction declared, the other hidden behind `try/except ImportError`).
107
+ See views-reporting issue #113. A neutral leaf package both sides route their
108
+ data contract *through* breaks the cycle (ADP).
109
+ - **pandas leaks into internal transport.** The evaluation boundary still takes
110
+ `actual: pd.DataFrame, predictions: List[pd.DataFrame]`
111
+ (`modules/validation/adapter.py`); ingest returns `pd.DataFrame`; the
112
+ list-in-cell parquet encoding causes a measured ~33× memory blow-up (C-40,
113
+ C-66). The frame + a flat columnar disk format fix the scaling; a `TargetFrame`
114
+ fixes the eval boundary.
115
+ - **Observed in production (#181) — the thesis, measured.** A HydraNet eval run
116
+ (`main.py -r calibration -t -e -re`) is **OOM-killed (exit 137, ~16–18 GB)** in
117
+ the report tail; dropping the report flag → 2.4 GB (~7× less). A synthetic
118
+ micro-benchmark line-isolated it: the report builds **object-dtype** DataFrames
119
+ (list-in-cell `pred_{target}` + per-cell `np.array` actuals) over the **full
120
+ grid × full timeline** — **~50–160×** the dense float32 cost (~200–650 B/row vs
121
+ 4). The dense numpy compute is *small* (~0.3 GB); the cost is the object
122
+ representation. It scales with `n_posterior_samples` (the collapse step is what
123
+ first materializes the full-sample tensor). This is C-40/C-66 firing for real —
124
+ pipeline-core **C-186**, the **first observed-in-production member** of the
125
+ Data-Contract Gap cluster, and the live use-case that motivates this package.
126
+ A dense, collapsed array frame is the fix. See
127
+ `perspectives/from_views-pipeline-core_perspective.md`.
128
+ - **God-class data handler with leaked internals.** `_ViewsDataset`
129
+ (`data/handlers.py`, ~950 LOC, C-36) is consumed across three repos by reaching
130
+ into its **private** members (`_time_id`, `_entity_id`, `_get_entity_index`,
131
+ `.dataframe`, `.to_tensor`) at ~56 sites (C-135), and views-reporting even
132
+ **mutates** a core object across the repo boundary
133
+ (`pg_dataset.reconciled_dataframe = ...`,
134
+ `views-reporting/reconciliation/dataset_export.py:103,122`; C-184). Frames are
135
+ immutable value objects with a *published* interface — the opposite of this.
136
+ - **Evaluation outputs scattered, then mis-read.** A model's evaluation metrics
137
+ are written to a local `eval_*.parquet` *and* logged to wandb, with no typed
138
+ output container. views-reporting's evaluation report scrapes them back out of
139
+ wandb (`get_latest_run().summary`) and — because that returns the latest
140
+ *created* run, not the latest run *with* metrics — renders the wrong run:
141
+ **22/25 constituents showed "not calculated"** in a real ensemble report while
142
+ the scores sat in an earlier run (views-reporting's own register, C-48). A
143
+ first-class **`MetricFrame`** (§4.2) is the typed output form the report *could
144
+ adopt* instead of re-deriving from a mutable mirror — but `MetricFrame` is
145
+ **out of this leaf** (it is keyed `(target, step, unit)`, not a `(time, unit)`
146
+ frame; views-evaluation owns eval-output vocab). What this package provides is
147
+ the **substrate** for that cure (the typed, conformance-checked frame contract +
148
+ the extensible `FrameMetadata` header), not the cure itself. *(Exploratory; §4.2,
149
+ §13a.6.)*
150
+ - **Stable package, zero abstractions.** views-pipeline-core's `data/` is its
151
+ most depended-on (most stable) package yet contains no protocols/ABCs (C-165,
152
+ C-48). A stable component must be abstract (SAP). This package *is* the
153
+ abstraction.
154
+
155
+ **The product is not "a numpy wrapper." The product is the identifier/alignment
156
+ contract** — the shared, versioned definition of "an array aligned to (time,
157
+ unit)" that every model, evaluator, reconciler, and report agrees on.
158
+
159
+ ---
160
+
161
+ ## 2. Position in the dependency graph (the whole point)
162
+
163
+ ```
164
+ ┌───────────────────────┐
165
+ │ views-frames │ ← leaf / root of the DAG
166
+ │ (numpy only, stable, │ stable + abstract (SDP+SAP)
167
+ │ abstract protocols) │ depends on NOTHING internal
168
+ └───────────▲───────────┘
169
+ ┌───────────────┬───────┴────────┬────────────────┐
170
+ │ │ │ │
171
+ views-pipeline-core views-datafactory views-evaluation model repos
172
+ (orchestration) (data production) (metrics) (hydranet, bayesian,
173
+ │ │ stepshifter, r2darts2,
174
+ ▼ ▼ baseline, lab00)
175
+ views-reporting / views-postprocessing (consumers, downstream)
176
+ ```
177
+
178
+ **Rule:** every internal arrow points *toward* `views-frames`. `views-frames`
179
+ imports **no** `views_*` package, ever. If it ever needs to, the boundary is
180
+ wrong. This is what makes it impossible to participate in a cycle (ADP) and what
181
+ makes it safe to depend on from everywhere (SDP).
182
+
183
+ > **Consumer perspectives.** A downstream repo's detailed view of how it uses
184
+ > these frames lives in `perspectives/from_<repo>_perspective.md`. The first is
185
+ > `perspectives/from_views-reporting_perspective.md` — the presentation layer that
186
+ > *consumes* `PredictionFrame`, `TargetFrame`, and `MetricFrame` and routes its
187
+ > data contract through this leaf (which is what breaks the
188
+ > views-pipeline-core ↔ views-reporting cycle, reporting issue **#113**).
189
+ >
190
+ > `perspectives/from_views-pipeline-core_perspective.md` is the **origin/orchestration**
191
+ > repo's view — not a pure downstream consumer but the repo that *owns these types
192
+ > today* (`PredictionFrame`, `_ViewsDataset`, the converter) and hands the contract
193
+ > off to this leaf. It carries the worked failure mode (#181 report-stage OOM,
194
+ > C-186) and the migration mechanics (it does most of README §10).
195
+
196
+ ---
197
+
198
+ ## 3. Hard constraints (non-negotiable; reject PRs that break these)
199
+
200
+ 1. **Dependencies:** `numpy` only, in the core. Optional extras may add
201
+ serialization deps **behind `io/` submodules** (`pyarrow` for the columnar
202
+ format), never in the core frame classes. **Never** import `pandas`,
203
+ `geopandas`, `polars`, `wandb`, `viewser`, `torch`, or any `views_*` package
204
+ from the core. (CRP: a model that wants a `PredictionFrame` must not
205
+ transitively install the pandas/reporting world.)
206
+ 2. **No application logic.** No fetching, no model code, no report rendering, no
207
+ reconciliation math, no wandb, no disk-path conventions beyond `save/load` of
208
+ the frame itself. Those are *adapters* and live in the consumer repos.
209
+ 3. **Immutable value objects.** A frame is validated at construction and then
210
+ treated as read-only. Operations (`collapse`, `select`, `with_metadata`)
211
+ **return new frames**; they never mutate in place. (Directly forbids the
212
+ C-184 cross-repo-mutation anti-pattern.) **Copy-vs-view:** structural and
213
+ metadata-only operations (`with_metadata`, contiguous `select`) return frames
214
+ that **share** the underlying `values` buffer (numpy view / zero-copy), and a
215
+ `mmap`-backed frame stays `mmap`-backed — a new frame must never copy a
216
+ multi-GB `values` buffer (that would reintroduce the §7 blow-up). Only a
217
+ reducing op (`collapse`) allocates, and only the reduced array. Pinned in the
218
+ conformance suite.
219
+ 4. **Fail loud at construction.** All invariants are checked in `__init__` and
220
+ raise `ValueError`/`TypeError` immediately — never return a half-valid object,
221
+ never log-and-continue. (Matches the platform's "Fail Loud and Proud" rule.)
222
+ 5. **dtype discipline.** `values` are `float32` (contiguous); identifier arrays
223
+ are integer dtype; **no `object` dtype, ever** (object/list-in-cell is the
224
+ thing that doesn't scale). Identifiers are complete (no NaN). The guarantee is
225
+ **structural, not temporal**: the leaf validates integer / length-N / no-NaN,
226
+ but `time` is an **opaque integer** — month_id epoch, range, and monotonicity
227
+ are a producer-adapter concern, never the leaf's (the leaf is epoch-agnostic).
228
+ 6. **One concept per file.** See §6. Multiple classes in one file is the
229
+ exception, justified only by genuine tight coupling.
230
+
231
+ ---
232
+
233
+ ## 4. The frame family
234
+
235
+ A *frame* = a numeric array whose first axis is **N rows**, each row carrying a
236
+ complete set of **spatiotemporal identifiers** `{time, unit}`, optionally with a
237
+ trailing **sample axis S** (posterior draws / ensemble members) and, for
238
+ multi-channel frames, a **feature/channel axis**.
239
+
240
+ ### 4.1 Existing (unify these first)
241
+
242
+ | Frame | Array shape | Extra fields | Semantics | Lives today in |
243
+ |---|---|---|---|---|
244
+ | **`FeatureFrame`** | `y_features: (N, F)` or `(N, F, S)` | `feature_names: list[str]` | model **inputs** (X) | views-datafactory |
245
+ | **`PredictionFrame`** | `y_pred: (N, S)` | — | model **outputs** (ŷ samples) | views-pipeline-core |
246
+
247
+ Existing `PredictionFrame` contract (preserve on migration): `float32`;
248
+ `REQUIRED_IDENTIFIERS = {"time", "unit"}`; validates 2D, `n_rows > 0`,
249
+ `sample_count >= 1`, identifiers present + length-N + no NaN; properties
250
+ `n_rows`, `sample_count`, `identifier_keys`; `collapse(method="arithmetic_mean")`
251
+ → new `(N, 1)` frame; `save(dir)` → `y_pred.npy` + `identifiers.npz`;
252
+ `load(dir, mmap=False)`. Existing `FeatureFrame` adds `feature_names`,
253
+ `metadata`, `n_features`, `is_sample`.
254
+
255
+ **Sample axis convention (decided, §13a).** The sample axis **S** is **always an
256
+ explicit trailing axis** (`S ≥ 1`): `PredictionFrame` is `(N, S)`, `FeatureFrame`
257
+ is `(N, F, S)`, `TargetFrame` is `(N, 1)`. `is_sample` is `S > 1`; `collapse`
258
+ reduces the trailing axis. One shape contract across the family — no `ndim`
259
+ branching. A corollary: relocating `PredictionFrame` is a **numpy-only rewrite of
260
+ its identifier validation, not a verbatim move** — today it imports pandas and
261
+ uses `pd.isna` for the NaN-check (§10.2).
262
+
263
+ ### 4.2 Anticipated (design the base so these drop in via OCP, don't build all now)
264
+
265
+ | Frame | Array shape | Why we already know we need it | Priority |
266
+ |---|---|---|---|
267
+ | **`TargetFrame`** (a.k.a. `ActualsFrame`) | `y_true: (N, 1)` | The **evaluation boundary** still takes pandas actuals (`adapter.py`). A target frame makes eval array-native and kills that pandas dependency. Structurally `PredictionFrame` with `S=1`. | **next** |
268
+ | **`WeightFrame`** | `w: (N,)` or `(N, S)` | Weighted losses / weighted metrics. Same identifiers, different `values` meaning. | when weighting lands |
269
+ | **`MaskFrame`** | `mask: (N,)` bool | Partial-data / sparse-actuals evaluation (C-26 silent truncation). Marks which (time, unit) cells are present. | when partial eval lands |
270
+ | **`MetricFrame`** (a.k.a. `ScoreFrame`) | `(K, …)` keyed by `(target, step, unit)` | Evaluation **outputs** are currently scattered into wandb summaries + parquet. First-class array form. **views-reporting's eval report is the consumer of record** — today it scrapes wandb and renders the wrong run (its C-48; see `perspectives/from_views-reporting_perspective.md`). | exploratory |
271
+
272
+ **Already exists externally — do NOT rebuild:** `EvaluationFrame` lives in
273
+ `views-evaluation` (aligned pred×actual×(origin, step)). `views-frames` should
274
+ define the **identifier/index protocol it conforms to**, and views-evaluation
275
+ should adopt that protocol — not have its frame re-implemented here.
276
+
277
+ ### 4.3 The real shared primitive: `SpatioTemporalIndex`
278
+
279
+ Every frame is **array + identifiers**. The identifiers — `{time, unit}` (plus
280
+ the cm/pgm `SpatialLevel`) — and the **alignment/join logic over them** are the
281
+ genuinely reused core. Build this once:
282
+
283
+ - Fields: `time: int[N]`, `unit: int[N]`, `level: SpatialLevel` (cm/pgm), all
284
+ numpy, integer dtype, no NaN, length N.
285
+ - **Same-level operations (owned here, pure-numpy, no pandas):** `intersect`,
286
+ `reindex`, `is_superset_of`, `argsort`, `searchsorted`-based joins over
287
+ `(time, unit)` **at a single `SpatialLevel`**. **This is the label-alignment
288
+ that today drags pandas back in** — pred↔actual join, partial-overlap
289
+ evaluation, same-level reindex. This alignment logic lives in the leaf
290
+ unconditionally.
291
+ - **Cross-level operations (`cross_level_align`) — protocol here, data injected.**
292
+ The cm↔pgm **cross-level join** (country↔grid) is **not** a same-axis set op; it
293
+ is a one-to-many lookup against a `priogrid→country` mapping that is **injected**
294
+ by the consumer and **not embedded in the leaf** — the mapping is external,
295
+ viewser-sourced, and **time-varying** (a cell's country assignment changes by
296
+ month). The leaf owns only the operation signature `cross_level_align(index,
297
+ mapping)`. The alignment logic stays in the leaf; the alignment data (the
298
+ mapping) is supplied by the consumer (or a separate reference package the leaf
299
+ does not depend on), never fetched or versioned here — embedding versioned domain
300
+ data would make the leaf change for data reasons and break §8 maximal stability.
301
+ This resolves the falsified "domain-free cross-level" claim
302
+ (`critiqus/critique_02.md`); faoapi's producer-materialised metadata is the
303
+ existence proof (`perspectives/from_views-faoapi_perspective.md` §8.3).
304
+ - `SpatialLevel` (currently `views-pipeline-core/domain/spatial.py`) should move
305
+ here — it is a tiny, stable value object that *is* part of the identifier
306
+ vocabulary (it defines `index_names` and `entity_column`: cm→`country_id`,
307
+ pgm→`priogrid_id`). It carries the *labels*, never the cross-level *mapping*.
308
+ Owning it here ends the bare-string `"cm"`/`"pgm"` sprawl (C-38) and the
309
+ `_ViewsDataset` private `_entity_id` reads (C-135). Relocate it with the C-65
310
+ reversed index-tuple (must be time-first `(month_id, entity)`) and the
311
+ `priogrid_gid`/`priogrid_id` inconsistency **fixed, not ported**.
312
+
313
+ > Design heuristic: if two consumers disagree about how `(time, unit)` align **at
314
+ > the same level**, that disagreement belongs **here**, resolved once. If they
315
+ > disagree about *which country a cell belongs to*, that is domain reference data
316
+ > — it belongs to the consumer / producer, never the leaf.
317
+
318
+ ---
319
+
320
+ ## 5. Abstractions / Protocols (DIP, ISP, SAP, LSP)
321
+
322
+ The package exports **Protocols first, concretes second.** Consumers type against
323
+ the protocols (DIP); a concrete frame is an implementation detail.
324
+
325
+ Segregate the surface so no consumer depends on methods it does not use (ISP):
326
+
327
+ - **`SpatioTemporalIndexed`** — `identifiers`, `n_rows`, `index: SpatioTemporalIndex`.
328
+ (What a reconciler/aligner needs.)
329
+ - **`Sampled`** — `sample_count`, `is_sample` (the *structural* sample-axis facts).
330
+ Reduction over the sample axis lives in `views_frames_summarize`, not here (ADR-017).
331
+ - **`Persistable`** — `save(dir)`, `load(dir, mmap)`.
332
+ (What I/O needs — and *only* I/O.)
333
+ - **`Frame`** = the small composition the math layer needs: `values`, `index`,
334
+ `n_rows`. Nothing else.
335
+
336
+ **LSP + composition over inheritance:** `FeatureFrame`, `PredictionFrame`,
337
+ `TargetFrame`, … are **siblings, not a subtype chain.** Do **not** make one
338
+ inherit another. They share behavior by (a) satisfying the same Protocols and
339
+ (b) composing a `SpatioTemporalIndex` and a small internal validation helper —
340
+ **not** by extending a fat base class. A subtype must be substitutable wherever
341
+ its protocol is expected; that holds for protocol conformance, and it is exactly
342
+ what a `CMDataset`-style inheritance tree gets wrong. The cm/pgm distinction is a
343
+ **value** (`SpatialLevel`) carried by the index, never a class axis.
344
+
345
+ > Anti-pattern, explicitly banned: a `_BaseFrame` god-class that
346
+ > `FeatureFrame`/`PredictionFrame` extend and that accretes everyone's methods.
347
+ > That recreates `_ViewsDataset` (C-36). Keep the base a **Protocol**; share code
348
+ > by composition.
349
+ >
350
+ > **Unification model — Option C (decided, §13a).** v1 unifies **only** the shared
351
+ > `SpatioTemporalIndex` + `_validation` + protocols + `io/`; the frame classes are
352
+ > relocated as **separate sibling classes**, not merged. This captures the real
353
+ > reused core (the index) at the lowest churn and zero god-class risk. A composed,
354
+ > shared metadata header across frames (Option B) is a later upgrade *only if* a
355
+ > third frame proves the header is genuinely reused. A shared concrete base
356
+ > (Option A) is **rejected in writing**.
357
+
358
+ ---
359
+
360
+ ## 6. Physical layout (the repo must scream "data contracts")
361
+
362
+ ```
363
+ views-frames/
364
+ ├── README.md # this file (the design bible)
365
+ ├── pyproject.toml # numpy core; [arrow] optional extra for io/arrow
366
+ ├── LICENSE
367
+ ├── src/views_frames/ # the pure data contract (numpy only, depends on nothing)
368
+ │ ├── __init__.py # EXPLICIT re-exports only (no `import *`)
369
+ │ ├── index.py # SpatioTemporalIndex value object + alignment
370
+ │ ├── spatial_level.py # SpatialLevel enum (cm/pgm) — relocated here
371
+ │ ├── protocols.py # Frame / SpatioTemporalIndexed / Sampled / Persistable
372
+ │ ├── _validation.py # shared construction-time invariants (private helper)
373
+ │ ├── feature_frame.py # FeatureFrame ── one concept per file
374
+ │ ├── prediction_frame.py # PredictionFrame
375
+ │ ├── target_frame.py # TargetFrame
376
+ │ ├── conformance/ # the published contract suite consumers re-run (§9)
377
+ │ └── io/ # serialization adapters — SEPARATE from frames (SRP)
378
+ │ ├── __init__.py
379
+ │ ├── npz.py # native save()/load() (.npy + .npz)
380
+ │ └── arrow.py # flat columnar (.parquet) — the scalable disk format
381
+ ├── src/views_frames_summarize/ # sample-axis summarization OVER frames (ADR-017)
382
+ │ ├── __init__.py # depends on views_frames + numpy only; never the reverse
383
+ │ ├── collapse.py # collapse(frame, reducer) — generic point fold
384
+ │ ├── point.py # map_estimate (histogram MAP)
385
+ │ ├── interval.py # hdi, quantiles → arrays aligned to the frame index
386
+ │ └── aggregate.py # conservation-correct cross-level aggregation
387
+ └── tests/
388
+ ├── conformance/ # the published contract suite consumers re-run (see §9)
389
+ └── unit/
390
+ ```
391
+
392
+ Layout rules (these *are* the screaming-architecture requirements):
393
+
394
+ - **One main class/concept per file.** Multiple classes in a file is the
395
+ exception, allowed only for genuinely inseparable units.
396
+ - **Serialization is not the frame's job.** I/O adapters live under `io/`, import
397
+ the frame, and change for *their own* reasons (a new store format) — not when
398
+ the frame's schema changes (SRP + CCP). `PredictionFrameConverter`
399
+ (PF↔list-in-cell DataFrame, a pipeline-core boundary format) **stays in
400
+ pipeline-core**; it is an adapter, not a frame concern.
401
+ - **No dumping grounds.** A file accumulating loose helpers/types/constants/
402
+ classes means a boundary is wrong — split it. (`handlers.py`/`file.py`-style
403
+ 13-class files are the failure mode we are escaping.)
404
+ - **Explicit `__init__.py` re-exports** (named, not `import *`) so the public API
405
+ is statically analyzable.
406
+ - A new developer should infer every responsibility from the file tree without
407
+ reading bodies.
408
+
409
+ ---
410
+
411
+ ## 7. On-disk / serialization contract (where "doesn't scale" is actually decided)
412
+
413
+ The scaling failure in the platform today is the **list-in-cell `object`-dtype
414
+ DataFrame** (a cell holds a Python list of S samples) — measured ~33× blow-up
415
+ (C-40/C-66), and ~50–160× per-row over dense float32 in the #181 report-stage
416
+ investigation (C-186; `perspectives/from_views-pipeline-core_perspective.md`).
417
+ `views-frames` standardizes two scalable formats and **bans list-in-cell**:
418
+
419
+ - **Native (`io/npz.py`):** `values.npy` (contiguous float32) + `identifiers.npz`.
420
+ Supports `mmap` load so peak RAM = working set, not full array. (This is the
421
+ existing `PredictionFrame.save/load`; keep it.)
422
+ - **Interchange (`io/arrow.py`):** **flat columnar** parquet — one row per
423
+ `(time, unit[, sample])`, scalar cells only, zero-copy Arrow write. This is the
424
+ scalable replacement for the list-in-cell format and is what crosses to the
425
+ forecasts store / delivery. (Mirrors the existing `to_arrow_table()` path.)
426
+
427
+ The **boundary adapters** that convert a frame to a *pandas/views-forecasts*
428
+ representation (because those external stores mandate pandas) live in the
429
+ **consumer** repo, depend on `views-frames`, and are explicitly out of scope
430
+ here (CRP). `views-frames` makes the array authoritative; pandas becomes a thin
431
+ edge adapter, never the internal type.
432
+
433
+ ---
434
+
435
+ ## 8. Contract evolution & versioning (SemVer for a thing N repos import)
436
+
437
+ Because everyone depends on this, breakage is expensive — version it as a
438
+ **published contract**, not as app code:
439
+
440
+ - **MAJOR** (breaking): removing/renaming a field, changing a dtype or axis
441
+ meaning, adding a **required** identifier, tightening an invariant.
442
+ - **MINOR** (additive, back-compatible): a new frame type, a new **optional**
443
+ metadata key, a new method, a new `io/` format.
444
+ - **PATCH:** bug/doc fixes with identical contract.
445
+ - Adding a required identifier is the canonical breaking change — prefer optional
446
+ + a deprecation window. Provide a `from_legacy_*` shim path when a consumer
447
+ format changes.
448
+ - **SAP in practice:** if this package needs frequent MAJOR bumps, it is not
449
+ abstract/stable enough — push volatility *out* into consumer adapters.
450
+
451
+ ---
452
+
453
+ ## 9. Testing strategy (closes the cross-repo contract-test gap, C-30)
454
+
455
+ - **Conformance suite (`tests/conformance/`):** a *published*, importable set of
456
+ contract tests asserting the invariants of each Protocol (round-trip
457
+ save/load, identifier completeness, collapse semantics, alignment laws). Every
458
+ consumer repo runs it in CI against its own adapters. This is the missing
459
+ cross-repo contract test (C-30) and the safety net that lets the frames evolve
460
+ without silently breaking N repos.
461
+ - **Property tests** for `SpatioTemporalIndex` alignment (intersection is
462
+ commutative; align then collapse == collapse then align; etc.).
463
+ - **No mocks needed** — frames are pure value objects over numpy. If a test needs
464
+ a mock, the thing under test probably doesn't belong in this package.
465
+
466
+ ---
467
+
468
+ ## 10. Migration / adoption plan (Strangler, not big-bang)
469
+
470
+ 1. **Stand up the package** with `SpatioTemporalIndex`, `protocols.py`,
471
+ `_validation.py`, and `io/npz.py`.
472
+ 2. **Relocate `PredictionFrame` here (contract-preserving, but _not_ verbatim).**
473
+ `PredictionFrame` today **imports pandas** and uses `pd.isna` for its identifier
474
+ NaN-check (`prediction_frame.py:5,68`); §3.1 forbids pandas in the core, so the
475
+ move is **not a verbatim copy — its identifier validation is rewritten
476
+ numpy-only** (the observable contract from §4.1 is preserved; the implementation
477
+ is not). Re-export from `views-pipeline-core/data/prediction_frame.py` as a thin
478
+ shim (`from views_frames import PredictionFrame`) so existing imports keep
479
+ working.
480
+ 3. **Unify `FeatureFrame`:** move datafactory's implementation here; datafactory
481
+ re-exports a shim. The twins now share `SpatioTemporalIndex` + validation.
482
+ 4. **Add `TargetFrame`** and migrate the evaluation adapter
483
+ (`modules/validation/adapter.py`) off pandas actuals — the highest-value early
484
+ win.
485
+ 5. **Relocate `SpatialLevel`** here; replace bare `"cm"`/`"pgm"` strings and
486
+ `_ViewsDataset._entity_id` reads with `index.level.entity_column`.
487
+ 6. **Add `io/arrow.py`**; point savers at the flat columnar format; retire
488
+ list-in-cell on the internal path (keep a boundary adapter only where an
489
+ external store mandates pandas).
490
+ 7. Consumers drop their direct `_ViewsDataset` private-internal access in favor
491
+ of the published frame/index protocols.
492
+
493
+ Each step is independently shippable and back-compatible via shims (REP/CCP: the
494
+ twins now release together; nothing changes that doesn't change together).
495
+
496
+ ---
497
+
498
+ ## 11. Scope boundaries — what does NOT live here
499
+
500
+ - **Adapters to pandas / views-forecasts / appwrite / parquet-store** → consumer
501
+ repos (pipeline-core, datafactory). External stores mandate pandas; that is an
502
+ *edge*, not the core.
503
+ - **`_ViewsDataset` (pandas↔tensor handler, densification)** → stays in
504
+ pipeline-core; it is heavy, pandas-bound, and a different stability tier.
505
+ - **Reconciliation math, model code, report rendering, wandb, viewser** → their
506
+ owning repos.
507
+ - **`EvaluationFrame`** → stays in views-evaluation; conform it to our index
508
+ protocol instead.
509
+
510
+ If something here starts needing pandas, a `views_*` import, or app logic, it is
511
+ in the wrong package — extract it to a consumer adapter.
512
+
513
+ ---
514
+
515
+ ## 12. Risk-register & decisions this resolves / informs
516
+
517
+ Resolves or directly addresses (views-pipeline-core register): **C-36**
518
+ (`_ViewsDataset` god class — frames replace its transport role with a published
519
+ interface), **C-40 / C-66** (list-in-cell memory blow-up — flat columnar +
520
+ arrays) and **C-186** (the #181 report-stage OOM — the first observed-in-production
521
+ instance of that blow-up), **C-48** (concrete dependencies → protocols), **C-135** (private-internal
522
+ cross-repo leakage → published interface), **C-164** (unwired `DataFetchStrategy`
523
+ — frames give the strategy a typed payload), **C-165** (stable package, zero
524
+ abstractions — this *is* the abstraction), **C-167** (reconciliation I/O has no
525
+ typed contract → frame I/O contract), **C-184** (cross-repo mutation of
526
+ `reconciled_dataframe` → immutable frames). Keystone for views-reporting **#113**
527
+ (circular dependency) and informs **D-28** (relocate reconciliation) and **D-33**
528
+ (collapse the `CMDataset/PGMDataset` hierarchy into a `SpatialLevel` value).
529
+
530
+ From the **views-reporting** consumer (its own register; see
531
+ `perspectives/from_views-reporting_perspective.md`) this package *forbids* its
532
+ **C-184** (the `reconciled_dataframe` mutation) and the reporting side of
533
+ **C-135** (private `_entity_id`/`_time_id` reads → published index protocol), and
534
+ *enables* fixing **C-48** (wandb eval scrape → a typed `MetricFrame`) and **C-44**
535
+ (undeclared wandb → isolated to one consumer adapter). It does **not** by itself
536
+ resolve **C-22** (viewser metadata fetch) or **C-27** (wandb runtime dependency) —
537
+ those remain consumer-side acquisition concerns; `views-frames` only gives their
538
+ output a typed home. (Note: reporting's **C-48** is distinct from the
539
+ pipeline-core **C-48** listed above — two registers, same number.)
540
+
541
+ ---
542
+
543
+ ## 13. Design decisions
544
+
545
+ ### 13a. Resolved (ratified 2026-06-21 — these were the blocking pre-code decisions)
546
+
547
+ 1. **Twin-unification model — Option C.** Unify only the shared
548
+ `SpatioTemporalIndex` + `_validation` + protocols + `io/`; relocate
549
+ `FeatureFrame`/`PredictionFrame` as **separate sibling classes**. Reject the
550
+ shared `_BaseFrame` (Option A); defer the composed header (Option B) until a
551
+ third frame proves it. See §5.
552
+ 2. **Sample axis — decided: always an explicit trailing axis (`S ≥ 1`).**
553
+ `PredictionFrame (N, S)`, `FeatureFrame (N, F, S)`, `TargetFrame (N, 1)`;
554
+ `is_sample` is `S > 1`; sample-axis reduction lives in
555
+ `views_frames_summarize` (point 7), not the leaf. One shape contract, no
556
+ `ndim` branching. See §4.1.
557
+ 3. **Metadata / identifier model — typed header + fixed identifiers.** `metadata`
558
+ is a **typed, optional-extensible header** (not a free-form dict — that re-opens
559
+ C-48 store-side and cannot be validated), carrying provenance (model, run_type,
560
+ timestamp, seed) and `feature_names`. Identifiers stay a fixed required
561
+ `{time, unit}` for v1; any future identifier (`step`, `origin`, `scenario`) is
562
+ added as **optional only** (MINOR), never required (a required identifier is the
563
+ §8 MAJOR break). This is the typed home for the C-48 / #178 run-identity cure.
564
+ 4. **Cross-level (cm↔pgm) alignment — leaf owns the protocol, consumer injects the
565
+ mapping.** Same-level alignment lives in the leaf; the cross-level country↔grid
566
+ join needs a viewser-sourced, time-varying `priogrid→country` **mapping** that
567
+ is **injected by the consumer and never embedded in the leaf**. The leaf owns
568
+ only `cross_level_align(index, mapping)`. See §4.3; resolves
569
+ `critiqus/critique_02.md`.
570
+ 5. **`SpatialLevel` lives here, as identifier vocabulary only** — relocated with
571
+ the C-65 reversed index-tuple and the `priogrid_gid`/`priogrid_id`
572
+ inconsistency **fixed, not ported** (§4.3). It carries the level labels, never
573
+ the cross-level mapping or any unit values/ranges.
574
+ 6. **`MetricFrame` / `EvaluationFrame` — out of the leaf.** `EvaluationFrame` stays
575
+ in views-evaluation; `MetricFrame` is keyed `(target, step, unit)` and does not
576
+ satisfy the §4 frame definition, so it stays **out of (the) leaf** for v1 (it
577
+ may re-enter only if the index protocol is *deliberately* generalised to a
578
+ non-spatiotemporal key — a v2 decision). The leaf may define the *key/index
579
+ protocol* they conform to.
580
+ 7. **Sample-axis summarization is a sibling package, not the leaf (ADR-017, v0.2.0).**
581
+ `collapse`/MAP/HDI/quantiles and conservation-correct cross-level aggregation move
582
+ to `views_frames_summarize` (numpy-only, depends on `views_frames`, import-DAG
583
+ enforced). The leaf keeps only the *structural* `sample_count`/`is_sample`. This
584
+ de-duplicates the HDI/MAP logic faoapi and reporting each re-derive, and keeps the
585
+ leaf free of volatile statistics. (The older prose in §4.1/§5/§7/§9/§14 that lists
586
+ `collapse` as a frame method predates this and is superseded by ADR-017.)
587
+
588
+ ### 13b. Still open (lower-stakes, resolve at/around first code)
589
+
590
+ 1. **Separate repo (this) vs. interim `views_pipeline_core/frames/` sub-package.**
591
+ This scaffold assumes the separate repo (the SDP/SAP/REP end state, and the only
592
+ thing that de-duplicates datafactory's `FeatureFrame`).
593
+ 2. **`TargetFrame` vs `ActualsFrame` naming** (and whether targets/actuals are one
594
+ type with a role flag).
595
+ 3. **Minimum numpy version / typed-array (nptyping vs bare) policy.**
596
+ 4. **Conformance-suite packaging** — it must ship as an importable artifact
597
+ (installable subpackage / pytest plugin) with a governed **conformance-floor**
598
+ version every consumer runs in CI regardless of its runtime pin (closes C-30
599
+ without the version-coordination paradox).
600
+ 5. **Owner + release cadence** — name the keystone's owner and the process for a
601
+ MAJOR bump that must land across N repos at once (governance is otherwise the
602
+ largest unaddressed cost for a leaf this many repos import).
603
+
604
+ ---
605
+
606
+ ## 14. Glossary
607
+
608
+ - **Frame:** an immutable value object = numeric array (first axis = N rows) +
609
+ complete spatiotemporal identifiers, optionally with a sample axis S.
610
+ - **Identifier:** a length-N integer array locating each row in space/time
611
+ (`time`, `unit`).
612
+ - **`SpatioTemporalIndex`:** the `{time, unit, level}` triple + pure-numpy
613
+ alignment logic; the genuinely reused primitive.
614
+ - **`SpatialLevel`:** cm (country-month) | pgm (PRIO-GRID-month); defines the unit
615
+ column.
616
+ - **Sample axis (S):** posterior draws / ensemble members; reduced by
617
+ `views_frames_summarize` (e.g. `collapse(frame, reducer)`), not the leaf.
618
+ - **list-in-cell:** the banned `object`-dtype encoding (a DataFrame cell holding a
619
+ Python list of samples); the actual non-scaler.
620
+
621
+ ---
622
+
623
+ *Build against this document. If the code and this README disagree, that is a bug
624
+ in one of them — reconcile before merging.*
@@ -0,0 +1,27 @@
1
+ views_frames/__init__.py,sha256=C4CKDCwVd1veYQvZnhBdNr2D7CIGRKbLP4ysR8-DIjY,1124
2
+ views_frames/_typing.py,sha256=YtOmRz1s8KVbnO_P94lAppKlTJWxAUIRD0t-9ieGc-Q,897
3
+ views_frames/_validation.py,sha256=0KvM8KAZdXwSC3Ew-uuaxSDvJ6q9V26DCkCgEOjJmw8,4388
4
+ views_frames/feature_frame.py,sha256=NzBDXQoppKImhmpssbLjTnPhqAKB3SAT4WakKoKbHw8,6842
5
+ views_frames/index.py,sha256=lAWH3lVG4V4qo8NLHiw8k6fKjc2bqAQEWi0FocAQd18,13465
6
+ views_frames/metadata.py,sha256=mPLQTDTS-xe996DOg9zq-UfOPKxuuZRW-E80_L9rQ9c,1272
7
+ views_frames/prediction_frame.py,sha256=nCY9c564GRjj83XoDoK6WgbsBb5V7kebDeoQs7p_SCY,5214
8
+ views_frames/protocols.py,sha256=bMeEALEbbplJF_8Oy0OSepkxgS7Y8vEqFhMTslKAt40,2331
9
+ views_frames/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
10
+ views_frames/spatial_level.py,sha256=FmMDEFVjSfCTBzEiaf0DoOxQYad_xCsQZCsyGIdjGjo,1288
11
+ views_frames/target_frame.py,sha256=DTZ2c1QQhP8RwD7frtWfeOPk5RE7MeTYBdbpv53Jfgc,4964
12
+ views_frames/conformance/__init__.py,sha256=A8xXgoxMuWeRiXVUyFRRWoV6fV1vXgiqiX30QO0e5r8,4777
13
+ views_frames/io/__init__.py,sha256=zrDpTcigF7D09-ZztZFI__eDIzpbTWap-iNI6PKdJoE,445
14
+ views_frames/io/arrow.py,sha256=TR1ONiSrSl377ovfyIv2o7qxjEs_bEdWyQq4BM9abzA,3288
15
+ views_frames/io/npz.py,sha256=LC30HLOvpc7racUdUgvexTECKOWmnw9VjJXd2ATsBOw,2075
16
+ views_frames_summarize/__init__.py,sha256=z-I_hkSEQ3S8nQy8yjY7Pz1Pj3qmrYz8NaSqBBPIYLo,1021
17
+ views_frames_summarize/_common.py,sha256=3ASSbwLQTWcD3eMd2jN353_E4nKqfPurWLVNVSfSVOY,2391
18
+ views_frames_summarize/aggregate.py,sha256=RRSxdARlc6UICGX1t0J735LuWVTkie-f9XsZZRq6fEo,3371
19
+ views_frames_summarize/collapse.py,sha256=KBaRoMnaDDhE4EIgrDNjW3FLGyLNAN7VuLu5e8ozCSw,1437
20
+ views_frames_summarize/conformance.py,sha256=IM97aW5aa9gMT2GDvwXPI3RV5BhGGQHic_WuRAgXKpo,1540
21
+ views_frames_summarize/interval.py,sha256=oBUZ5aHR78gdBB7qyHAzyC7dK2R7ipeyPHqaesFaP74,2502
22
+ views_frames_summarize/point.py,sha256=9HqWI9i4cAgMQqZ6sMe1W4Rtc8i4-09wXTG0kbrTuzE,4800
23
+ views_frames_summarize/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
24
+ views_frames-1.0.0.dist-info/METADATA,sha256=s7cxDsDfgxggRN44xMWD0tLocYhYAw-dQoqeAYo9T9s,36946
25
+ views_frames-1.0.0.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
26
+ views_frames-1.0.0.dist-info/licenses/LICENSE,sha256=Pd39JkiREFciWHbwg50y6drerp2JC7dmpJaVMVPYRdo,1087
27
+ views_frames-1.0.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.30.1
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any