PyPI - jerry-thomas - Versions diffs - 0.0.5__tar.gz → 0.2.0__tar.gz - Mend

jerry-thomas 0.0.5tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (149) hide show

{jerry_thomas-0.0.5/src/jerry_thomas.egg-info → jerry_thomas-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,22 +1,31 @@
 Metadata-Version: 2.4
 Name: jerry-thomas
-Version: 0.0.5
+Version: 0.2.0
 Summary: Jerry-Thomas: a stream-first, plugin-friendly data pipeline (mixology-themed CLI)
 Author: Anders Skott Lind
 License: MIT
-Requires-Python: >=3.9
+Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: numpy<3.0,>=1.24
-Requires-Dist: pydantic>=1.8
+Requires-Dist: pydantic>=2.0
 Requires-Dist: PyYAML>=5.4
 Requires-Dist: tqdm>=4.0
 Requires-Dist: jinja2>=3.0
-Requires-Dist: setuptools>=70
+Provides-Extra: ml
+Requires-Dist: pandas>=2.0; extra == "ml"
+Requires-Dist: torch>=2.0; extra == "ml"
 Dynamic: license-file
 # Jerry Thomas
+Time‑Series First
+- This runtime is time‑series‑first. Every domain record must include a timezone‑aware `time` and a `value`.
+- Grouping is defined by time buckets only (`group_by.keys: [ { type: time, ... } ]`).
+- Feature streams are sorted by time; sequence transforms assume ordered series.
+- Categorical dimensions (e.g., station, zone, ticker) belong in `partition_by` so they become partitions of the same time series.
+- Non‑temporal grouping is not supported.
 Jerry Thomas turns the datapipeline runtime into a cocktail program. You still install the
 same Python package (`datapipeline`) and tap into the plugin architecture, but every CLI
 dance step nods to a craft bar. Declarative YAML menus describe projects, sources and
@@ -59,11 +68,29 @@ raw source → canonical stream → record stage → feature stage → vector st
 | `src/datapipeline/services`                                | Bootstrapping (project loading, YAML interpolation), runtime factories and scaffolding helpers for new bar tools (`services/bootstrap.py`, `services/factories.py`, `services/scaffold/plugin.py`).                           |
 | `src/datapipeline/pipeline`                                | Pure functions that build record/feature/vector iterators plus supporting utilities for ordering and transform wiring (`pipeline/pipelines.py`, `pipeline/utils/transform_utils.py`).                                         |
 | `src/datapipeline/domain`                                  | Data structures representing records, feature records and vectors coming off the line (`domain/record.py`, `domain/feature.py`, `domain/vector.py`).                                                                          |
-| `src/datapipeline/transforms` & `src/datapipeline/filters` | Built-in transforms (lagging timestamps, sliding windows) and filter helpers exposed through entry points (`transforms/transforms.py`, `transforms/sequence.py`, `filters/filters.py`).                                       |
+| `src/datapipeline/transforms` & `src/datapipeline/filters` | Built-in transforms (lagging timestamps, scaling, sliding windows) and filter helpers exposed through entry points (`transforms/record.py`, `transforms/feature.py`, `transforms/sequence.py`, `filters/filters.py`). |
 | `src/datapipeline/sources/synthetic/time`                  | Example synthetic time-series loader/parser pair plus helper mappers for experimentation while the real spirits arrive (`sources/synthetic/time/loader.py`, `sources/synthetic/time/parser.py`, `mappers/synthetic/time.py`). |
 ---
+## Built-in DSL identifiers
+The YAML DSL resolves filters and transforms by entry-point name. These ship with the
+template out of the box:
+| Kind              | Identifiers                                                                                     | Notes |
+| ----------------- | ----------------------------------------------------------------------------------------------- | ----- |
+| Filters           | `eq`/`equals`, `ne`/`not_equal`, `lt`, `le`, `gt`, `ge`, `in`/`contains`, `nin`/`not_in`        | Use as `- gt: { field: value }` or `- in: { field: [values...] }`. Synonyms map to the same implementation. |
+| Record transforms | `time_lag`, `drop_missing`                                                                       | `time_lag` expects a duration string (e.g. `1h`), `drop_missing` removes `None`/`NaN` records. |
+| Feature transforms| `standard_scale`                                                                                | Options: `with_mean`, `with_std`, optional `statistics`. |
+| Sequence transforms | `time_window`, `time_fill_mean`, `time_fill_median`                                           | `time_window` builds sliding windows; the fill transforms impute missing values from running mean/median with optional `window`/`min_samples`. |
+| Vector transforms   | `fill_history`, `fill_horizontal`, `fill_constant`, `drop_missing`                           | History fill uses prior buckets, horizontal fill aggregates sibling partitions, constant sets a default, and drop removes vectors below coverage thresholds. |
+Extend `pyproject.toml` with additional entry points to register custom logic under your
+own identifiers.
+---
 ## Opening the bar
 ### 1. Install the tools
@@ -86,17 +113,17 @@ python -c "import datapipeline; print('bar ready')"
 ### 2. Draft your bar book
-Create a `config/project.yaml` so the runtime knows where to find ingredients, infusions
-and the tasting menu. Globals are optional but handy for sharing values—they are
-interpolated into downstream YAML specs during bootstrap
+Create a `config/recipes/<name>/project.yaml` so the runtime knows where to find
+ingredients, infusions and the tasting menu. Globals are optional but handy for sharing
+values—they are interpolated into downstream YAML specs during bootstrap
 (`src/datapipeline/config/project.py`, `src/datapipeline/services/bootstrap.py`).
 ```yaml
 version: 1
 paths:
-  sources: config/distilleries
-  streams: config/contracts
-  dataset: config/recipe.yaml
+  sources: ../../sources
+  streams: ../../contracts
+  dataset: dataset.yaml
 globals:
   opening_time: "2024-01-01T16:00:00Z"
   last_call: "2024-01-02T02:00:00Z"
@@ -107,13 +134,13 @@ globals:
 ### 3. Stock the bottles (raw sources)
-Create `config/distilleries/<alias>.yaml` files. Each must expose a `parser` and `loader`
+Create `config/sources/<alias>.yaml` files. Each must expose a `parser` and `loader`
 pointing at entry points plus any constructor arguments
 (`src/datapipeline/services/bootstrap.py`). Here is a synthetic clock source that feels
 like a drip of barrel-aged bitters:
 ```yaml
-# config/distilleries/time_ticks.yaml
+# config/sources/time_ticks.yaml
 parser:
   entrypoint: "synthetic.time"
   args: {}
@@ -145,7 +172,7 @@ mapper:
     mode: spritz
 ```
-The mapper uses the provided mode to create a new `TimeFeatureRecord` stream ready for the
+The mapper uses the provided mode to create a new `TimeSeriesRecord` stream ready for the
 feature stage (`mappers/synthetic/time.py`).
 ### 5. Script the tasting menu (dataset)
@@ -155,28 +182,53 @@ are grouped (`src/datapipeline/config/dataset/dataset.py`). A minimal hourly men
 look like:
 ```yaml
-# config/recipe.yaml
+# config/recipes/default/dataset.yaml
 group_by:
   keys:
     - type: time
       field: time
       resolution: 1h
 features:
-  - stream: time.encode
-    feature_id: hour_spritz
-    partition_by: null
-    filters: []
+  - id: hour_spritz
+    stream: time.encode
     transforms:
-      - time_lag: "0h"
+      - record:
+          transform: time_lag
+          args: 0h
+      - feature:
+          transform: standard_scale
+          with_mean: true
+          with_std: true
+      - sequence:
+          transform: time_window
+          size: 4
+          stride: 1
+      - sequence:
+          transform: time_fill_mean
+          window: 24
+          min_samples: 6
 ```
 Use the sample `dataset` template as a starting point if you prefer scaffolding before
-pouring concrete values. Group keys support time bucketing (with automatic flooring to the
-requested resolution) and categorical splits
-(`src/datapipeline/config/dataset/group_by.py`,
-`src/datapipeline/config/dataset/normalize.py`). You can also attach feature or sequence
-transforms—such as the sliding `TimeWindowTransformer`—directly in the YAML by referencing
-their entry point names (`src/datapipeline/transforms/sequence.py`).
+pouring concrete values. Group keys now require explicit time bucketing (with automatic
+flooring to the requested resolution) so every pipeline is clock-driven. You can attach
+feature or sequence transforms—such as the sliding `TimeWindowTransformer` or the
+`time_fill_mean`/`time_fill_median` imputers—directly in the YAML by referencing their
+entry point names (`src/datapipeline/transforms/sequence.py`).
+When vectors are assembled you can optionally apply `vector_transforms` to enforce schema
+guarantees. The built-ins cover:
+- `fill_history` – use running means/medians from prior buckets (per partition) with
+  configurable window/minimum samples.
+- `fill_horizontal` – aggregate sibling partitions at the same timestamp (e.g. other
+  stations) using mean/median.
+- `fill_constant` – provide a constant default for missing features/partitions.
+- `drop_missing` – drop vectors that fall below a coverage threshold or omit required
+  features.
+Transforms accept either an explicit `expected` list or a manifest path to discover the
+full partition set (`build/partitions.json` produced by `jerry inspect partitions`).
 Once the book is ready, run the bootstrapper (the CLI does this automatically) to
 materialize all registered sources and streams
@@ -189,9 +241,9 @@ materialize all registered sources and streams
 ### Prep any station (with visuals)
 ```bash
-jerry prep pour   --project config/project.yaml --limit 20
-jerry prep build  --project config/project.yaml --limit 20
-jerry prep stir   --project config/project.yaml --limit 20
+jerry prep pour   --project config/datasets/default/project.yaml --limit 20
+jerry prep build  --project config/datasets/default/project.yaml --limit 20
+jerry prep stir   --project config/datasets/default/project.yaml --limit 20
 ```
 - `prep pour` shows the record-stage ingredients headed for each feature.
@@ -208,34 +260,79 @@ loaders. The CLI wires up `build_record_pipeline`, `build_feature_pipeline` and
 ### Serve the flights (production mode)
 ```bash
-jerry serve --project config/project.yaml --output print
-jerry serve --project config/project.yaml --output stream
-jerry serve --project config/project.yaml --output exports/batch.pt
+jerry serve --project config/datasets/default/project.yaml --output print
+jerry serve --project config/datasets/default/project.yaml --output stream
+jerry serve --project config/datasets/default/project.yaml --output exports/batch.pt
 ```
 Production mode skips the bar flair and focuses on throughput. `print` writes tasting
 notes to stdout, `stream` emits newline-delimited JSON (with values coerced to strings when
 necessary), and a `.pt` destination stores a pickle-compatible payload for later pours.
-### Taste the balance (vector quality)
-```bash
-jerry taste --project config/project.yaml
+## Funnel vectors into ML projects
+Data scientists rarely want to shell out to the CLI; they need a programmatic
+hand-off that plugs vectors straight into notebooks, feature stores or training
+loops. The `datapipeline.integrations` package wraps the existing iterator
+builders with ML-friendly adapters without pulling pandas or torch into the
+core runtime.
+```python
+from datapipeline.integrations import (
+    VectorAdapter,
+    dataframe_from_vectors,
+    iter_vector_rows,
+    torch_dataset,
+)
+# Bootstrap once and stream ready-to-use rows.
+adapter = VectorAdapter.from_project("config/project.yaml")
+for row in adapter.iter_rows(limit=32, flatten_sequences=True):
+    send_to_feature_store(row)
+# Helper functions cover ad-hoc jobs as well.
+rows = iter_vector_rows(
+    "config/project.yaml",
+    include_group=True,
+    group_format="mapping",
+    flatten_sequences=True,
+)
+# Optional extras materialize into common ML containers if installed.
+df = dataframe_from_vectors("config/project.yaml")                # Requires pandas
+dataset = torch_dataset("config/project.yaml", dtype=torch.float32)  # Requires torch
 ```
-This command reuses the vector pipeline, collects presence counts for every configured
-feature and flags empty or incomplete flights so you can diagnose upstream issues quickly
-(`src/datapipeline/cli/commands/analyze.py`, `src/datapipeline/analysis/vector_analyzer.py`).
-Use `--limit` to spot-check during service.
+Everything still flows through `build_vector_pipeline`; the integration layer
+normalizes group keys, optionally flattens sequence features and demonstrates
+how to turn the iterator into DataFrames or `torch.utils.data.Dataset`
+instances. ML teams can fork the same pattern for their own stacks—Spark, NumPy
+or feature store SDKs—without adding opinionated glue to the runtime itself.
+### Inspect the balance (vector quality)
+Use the inspect helpers for different outputs:
+- `jerry inspect report --project config/datasets/default/project.yaml` — print a
+  human-readable quality report (totals, keep/below lists, optional partition detail).
+- `jerry inspect coverage --project config/datasets/default/project.yaml` — persist the
+  coverage summary to `build/coverage.json` (keep/below feature and partition lists plus
+  coverage percentages).
+- `jerry inspect matrix --project config/datasets/default/project.yaml --format html` —
+  export availability matrices (CSV or HTML) for deeper analysis.
+- `jerry inspect partitions --project config/datasets/default/project.yaml` — write the
+  observed partition manifest to `build/partitions.json` for use in configs.
+Note: `jerry prep taste` has been removed; use `jerry inspect report` and friends.
 ---
-## Extending the bar program
+## Extending the CLI
 ### Scaffold a plugin package
 ```bash
-jerry station init --name my_datapipeline --out .
+jerry plugin init --name my_datapipeline --out .
 ```
 The generator copies a ready-made skeleton (pyproject, README, package directory) and
@@ -249,25 +346,29 @@ transforms.
 Use the CLI helpers to scaffold boilerplate code in your plugin workspace:
 ```bash
-jerry distillery add --provider dmi --dataset metobs --transport fs --format csv
-jerry spirit add --domain metobs --time-aware
-jerry contract --time-aware
+jerry source add --provider dmi --dataset metobs --transport fs --format csv
+jerry domain add --domain metobs
+jerry contract
 ```
-The distillery command writes DTO/parser stubs, updates entry points and drops a matching
-YAML file in `config/distilleries/` pre-filled with composed-loader defaults for the chosen
+The source command writes DTO/parser stubs, updates entry points and drops a matching
+YAML file in `config/sources/` pre-filled with composed-loader defaults for the chosen
 transport (`src/datapipeline/cli/app.py`, `src/datapipeline/services/scaffold/source.py`).
+`jerry domain add` now always scaffolds `TimeSeriesRecord` domains so every mapper carries
+an explicit timestamp alongside its value, and `jerry contract` wires that source/domain
+pair up for canonical stream generation.
 ### Add custom filters or transforms
 Register new functions/classes under the appropriate entry point group in your plugin’s
-`pyproject.toml`. The runtime resolves them through `load_ep`, applies record-level
-filters first, then record/feature/sequence transforms in the order declared in the
-dataset config (`pyproject.toml`, `src/datapipeline/utils/load.py`,
+`pyproject.toml`. The runtime resolves them through `load_ep`, applies record filters first,
+then record/feature/sequence transforms in the order declared in the dataset config
+(`pyproject.toml`, `src/datapipeline/utils/load.py`,
 `src/datapipeline/pipeline/utils/transform_utils.py`). Built-in helpers cover common
 comparisons (including timezone-aware checks) and time-based transforms (lags, sliding
 windows) if you need quick wins (`src/datapipeline/filters/filters.py`,
-`src/datapipeline/transforms/transforms.py`, `src/datapipeline/transforms/sequence.py`).
+`src/datapipeline/transforms/record.py`, `src/datapipeline/transforms/feature.py`,
+`src/datapipeline/transforms/sequence.py`).
 ### Prototype with synthetic time-series data
@@ -285,8 +386,7 @@ transform to build sliding-window feature flights without external datasets
 | Type                | Description                                                                                                                                                 |
 | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `Record`            | Canonical payload containing a `value`; extended by other record types (`src/datapipeline/domain/record.py`).                                               |
-| `TimeFeatureRecord` | A record with a timezone-aware `time` attribute, normalized to UTC to avoid boundary issues (`src/datapipeline/domain/record.py`).                          |
+| `TimeSeriesRecord`  | Canonical record with `time` (tz-aware, normalized to UTC) and `value`; the pipeline treats streams as ordered series (`src/datapipeline/domain/record.py`).|
 | `FeatureRecord`     | Links a record (or list of records from sequence transforms) to a `feature_id` and `group_key` (`src/datapipeline/domain/feature.py`).                      |
 | `Vector`            | Final grouped payload: a mapping of feature IDs to scalars or ordered lists plus helper methods for shape/key access (`src/datapipeline/domain/vector.py`). |

{jerry_thomas-0.0.5 → jerry_thomas-0.2.0}/README.md RENAMED Viewed

@@ -1,5 +1,12 @@
 # Jerry Thomas
+Time‑Series First
+- This runtime is time‑series‑first. Every domain record must include a timezone‑aware `time` and a `value`.
+- Grouping is defined by time buckets only (`group_by.keys: [ { type: time, ... } ]`).
+- Feature streams are sorted by time; sequence transforms assume ordered series.
+- Categorical dimensions (e.g., station, zone, ticker) belong in `partition_by` so they become partitions of the same time series.
+- Non‑temporal grouping is not supported.
 Jerry Thomas turns the datapipeline runtime into a cocktail program. You still install the
 same Python package (`datapipeline`) and tap into the plugin architecture, but every CLI
 dance step nods to a craft bar. Declarative YAML menus describe projects, sources and
@@ -42,11 +49,29 @@ raw source → canonical stream → record stage → feature stage → vector st
 | `src/datapipeline/services`                                | Bootstrapping (project loading, YAML interpolation), runtime factories and scaffolding helpers for new bar tools (`services/bootstrap.py`, `services/factories.py`, `services/scaffold/plugin.py`).                           |
 | `src/datapipeline/pipeline`                                | Pure functions that build record/feature/vector iterators plus supporting utilities for ordering and transform wiring (`pipeline/pipelines.py`, `pipeline/utils/transform_utils.py`).                                         |
 | `src/datapipeline/domain`                                  | Data structures representing records, feature records and vectors coming off the line (`domain/record.py`, `domain/feature.py`, `domain/vector.py`).                                                                          |
-| `src/datapipeline/transforms` & `src/datapipeline/filters` | Built-in transforms (lagging timestamps, sliding windows) and filter helpers exposed through entry points (`transforms/transforms.py`, `transforms/sequence.py`, `filters/filters.py`).                                       |
+| `src/datapipeline/transforms` & `src/datapipeline/filters` | Built-in transforms (lagging timestamps, scaling, sliding windows) and filter helpers exposed through entry points (`transforms/record.py`, `transforms/feature.py`, `transforms/sequence.py`, `filters/filters.py`). |
 | `src/datapipeline/sources/synthetic/time`                  | Example synthetic time-series loader/parser pair plus helper mappers for experimentation while the real spirits arrive (`sources/synthetic/time/loader.py`, `sources/synthetic/time/parser.py`, `mappers/synthetic/time.py`). |
 ---
+## Built-in DSL identifiers
+The YAML DSL resolves filters and transforms by entry-point name. These ship with the
+template out of the box:
+| Kind              | Identifiers                                                                                     | Notes |
+| ----------------- | ----------------------------------------------------------------------------------------------- | ----- |
+| Filters           | `eq`/`equals`, `ne`/`not_equal`, `lt`, `le`, `gt`, `ge`, `in`/`contains`, `nin`/`not_in`        | Use as `- gt: { field: value }` or `- in: { field: [values...] }`. Synonyms map to the same implementation. |
+| Record transforms | `time_lag`, `drop_missing`                                                                       | `time_lag` expects a duration string (e.g. `1h`), `drop_missing` removes `None`/`NaN` records. |
+| Feature transforms| `standard_scale`                                                                                | Options: `with_mean`, `with_std`, optional `statistics`. |
+| Sequence transforms | `time_window`, `time_fill_mean`, `time_fill_median`                                           | `time_window` builds sliding windows; the fill transforms impute missing values from running mean/median with optional `window`/`min_samples`. |
+| Vector transforms   | `fill_history`, `fill_horizontal`, `fill_constant`, `drop_missing`                           | History fill uses prior buckets, horizontal fill aggregates sibling partitions, constant sets a default, and drop removes vectors below coverage thresholds. |
+Extend `pyproject.toml` with additional entry points to register custom logic under your
+own identifiers.
+---
 ## Opening the bar
 ### 1. Install the tools
@@ -69,17 +94,17 @@ python -c "import datapipeline; print('bar ready')"
 ### 2. Draft your bar book
-Create a `config/project.yaml` so the runtime knows where to find ingredients, infusions
-and the tasting menu. Globals are optional but handy for sharing values—they are
-interpolated into downstream YAML specs during bootstrap
+Create a `config/recipes/<name>/project.yaml` so the runtime knows where to find
+ingredients, infusions and the tasting menu. Globals are optional but handy for sharing
+values—they are interpolated into downstream YAML specs during bootstrap
 (`src/datapipeline/config/project.py`, `src/datapipeline/services/bootstrap.py`).
 ```yaml
 version: 1
 paths:
-  sources: config/distilleries
-  streams: config/contracts
-  dataset: config/recipe.yaml
+  sources: ../../sources
+  streams: ../../contracts
+  dataset: dataset.yaml
 globals:
   opening_time: "2024-01-01T16:00:00Z"
   last_call: "2024-01-02T02:00:00Z"
@@ -90,13 +115,13 @@ globals:
 ### 3. Stock the bottles (raw sources)
-Create `config/distilleries/<alias>.yaml` files. Each must expose a `parser` and `loader`
+Create `config/sources/<alias>.yaml` files. Each must expose a `parser` and `loader`
 pointing at entry points plus any constructor arguments
 (`src/datapipeline/services/bootstrap.py`). Here is a synthetic clock source that feels
 like a drip of barrel-aged bitters:
 ```yaml
-# config/distilleries/time_ticks.yaml
+# config/sources/time_ticks.yaml
 parser:
   entrypoint: "synthetic.time"
   args: {}
@@ -128,7 +153,7 @@ mapper:
     mode: spritz
 ```
-The mapper uses the provided mode to create a new `TimeFeatureRecord` stream ready for the
+The mapper uses the provided mode to create a new `TimeSeriesRecord` stream ready for the
 feature stage (`mappers/synthetic/time.py`).
 ### 5. Script the tasting menu (dataset)
@@ -138,28 +163,53 @@ are grouped (`src/datapipeline/config/dataset/dataset.py`). A minimal hourly men
 look like:
 ```yaml
-# config/recipe.yaml
+# config/recipes/default/dataset.yaml
 group_by:
   keys:
     - type: time
       field: time
       resolution: 1h
 features:
-  - stream: time.encode
-    feature_id: hour_spritz
-    partition_by: null
-    filters: []
+  - id: hour_spritz
+    stream: time.encode
     transforms:
-      - time_lag: "0h"
+      - record:
+          transform: time_lag
+          args: 0h
+      - feature:
+          transform: standard_scale
+          with_mean: true
+          with_std: true
+      - sequence:
+          transform: time_window
+          size: 4
+          stride: 1
+      - sequence:
+          transform: time_fill_mean
+          window: 24
+          min_samples: 6
 ```
 Use the sample `dataset` template as a starting point if you prefer scaffolding before
-pouring concrete values. Group keys support time bucketing (with automatic flooring to the
-requested resolution) and categorical splits
-(`src/datapipeline/config/dataset/group_by.py`,
-`src/datapipeline/config/dataset/normalize.py`). You can also attach feature or sequence
-transforms—such as the sliding `TimeWindowTransformer`—directly in the YAML by referencing
-their entry point names (`src/datapipeline/transforms/sequence.py`).
+pouring concrete values. Group keys now require explicit time bucketing (with automatic
+flooring to the requested resolution) so every pipeline is clock-driven. You can attach
+feature or sequence transforms—such as the sliding `TimeWindowTransformer` or the
+`time_fill_mean`/`time_fill_median` imputers—directly in the YAML by referencing their
+entry point names (`src/datapipeline/transforms/sequence.py`).
+When vectors are assembled you can optionally apply `vector_transforms` to enforce schema
+guarantees. The built-ins cover:
+- `fill_history` – use running means/medians from prior buckets (per partition) with
+  configurable window/minimum samples.
+- `fill_horizontal` – aggregate sibling partitions at the same timestamp (e.g. other
+  stations) using mean/median.
+- `fill_constant` – provide a constant default for missing features/partitions.
+- `drop_missing` – drop vectors that fall below a coverage threshold or omit required
+  features.
+Transforms accept either an explicit `expected` list or a manifest path to discover the
+full partition set (`build/partitions.json` produced by `jerry inspect partitions`).
 Once the book is ready, run the bootstrapper (the CLI does this automatically) to
 materialize all registered sources and streams
@@ -172,9 +222,9 @@ materialize all registered sources and streams
 ### Prep any station (with visuals)
 ```bash
-jerry prep pour   --project config/project.yaml --limit 20
-jerry prep build  --project config/project.yaml --limit 20
-jerry prep stir   --project config/project.yaml --limit 20
+jerry prep pour   --project config/datasets/default/project.yaml --limit 20
+jerry prep build  --project config/datasets/default/project.yaml --limit 20
+jerry prep stir   --project config/datasets/default/project.yaml --limit 20
 ```
 - `prep pour` shows the record-stage ingredients headed for each feature.
@@ -191,34 +241,79 @@ loaders. The CLI wires up `build_record_pipeline`, `build_feature_pipeline` and
 ### Serve the flights (production mode)
 ```bash
-jerry serve --project config/project.yaml --output print
-jerry serve --project config/project.yaml --output stream
-jerry serve --project config/project.yaml --output exports/batch.pt
+jerry serve --project config/datasets/default/project.yaml --output print
+jerry serve --project config/datasets/default/project.yaml --output stream
+jerry serve --project config/datasets/default/project.yaml --output exports/batch.pt
 ```
 Production mode skips the bar flair and focuses on throughput. `print` writes tasting
 notes to stdout, `stream` emits newline-delimited JSON (with values coerced to strings when
 necessary), and a `.pt` destination stores a pickle-compatible payload for later pours.
-### Taste the balance (vector quality)
-```bash
-jerry taste --project config/project.yaml
+## Funnel vectors into ML projects
+Data scientists rarely want to shell out to the CLI; they need a programmatic
+hand-off that plugs vectors straight into notebooks, feature stores or training
+loops. The `datapipeline.integrations` package wraps the existing iterator
+builders with ML-friendly adapters without pulling pandas or torch into the
+core runtime.
+```python
+from datapipeline.integrations import (
+    VectorAdapter,
+    dataframe_from_vectors,
+    iter_vector_rows,
+    torch_dataset,
+)
+# Bootstrap once and stream ready-to-use rows.
+adapter = VectorAdapter.from_project("config/project.yaml")
+for row in adapter.iter_rows(limit=32, flatten_sequences=True):
+    send_to_feature_store(row)
+# Helper functions cover ad-hoc jobs as well.
+rows = iter_vector_rows(
+    "config/project.yaml",
+    include_group=True,
+    group_format="mapping",
+    flatten_sequences=True,
+)
+# Optional extras materialize into common ML containers if installed.
+df = dataframe_from_vectors("config/project.yaml")                # Requires pandas
+dataset = torch_dataset("config/project.yaml", dtype=torch.float32)  # Requires torch
 ```
-This command reuses the vector pipeline, collects presence counts for every configured
-feature and flags empty or incomplete flights so you can diagnose upstream issues quickly
-(`src/datapipeline/cli/commands/analyze.py`, `src/datapipeline/analysis/vector_analyzer.py`).
-Use `--limit` to spot-check during service.
+Everything still flows through `build_vector_pipeline`; the integration layer
+normalizes group keys, optionally flattens sequence features and demonstrates
+how to turn the iterator into DataFrames or `torch.utils.data.Dataset`
+instances. ML teams can fork the same pattern for their own stacks—Spark, NumPy
+or feature store SDKs—without adding opinionated glue to the runtime itself.
+### Inspect the balance (vector quality)
+Use the inspect helpers for different outputs:
+- `jerry inspect report --project config/datasets/default/project.yaml` — print a
+  human-readable quality report (totals, keep/below lists, optional partition detail).
+- `jerry inspect coverage --project config/datasets/default/project.yaml` — persist the
+  coverage summary to `build/coverage.json` (keep/below feature and partition lists plus
+  coverage percentages).
+- `jerry inspect matrix --project config/datasets/default/project.yaml --format html` —
+  export availability matrices (CSV or HTML) for deeper analysis.
+- `jerry inspect partitions --project config/datasets/default/project.yaml` — write the
+  observed partition manifest to `build/partitions.json` for use in configs.
+Note: `jerry prep taste` has been removed; use `jerry inspect report` and friends.
 ---
-## Extending the bar program
+## Extending the CLI
 ### Scaffold a plugin package
 ```bash
-jerry station init --name my_datapipeline --out .
+jerry plugin init --name my_datapipeline --out .
 ```
 The generator copies a ready-made skeleton (pyproject, README, package directory) and
@@ -232,25 +327,29 @@ transforms.
 Use the CLI helpers to scaffold boilerplate code in your plugin workspace:
 ```bash
-jerry distillery add --provider dmi --dataset metobs --transport fs --format csv
-jerry spirit add --domain metobs --time-aware
-jerry contract --time-aware
+jerry source add --provider dmi --dataset metobs --transport fs --format csv
+jerry domain add --domain metobs
+jerry contract
 ```
-The distillery command writes DTO/parser stubs, updates entry points and drops a matching
-YAML file in `config/distilleries/` pre-filled with composed-loader defaults for the chosen
+The source command writes DTO/parser stubs, updates entry points and drops a matching
+YAML file in `config/sources/` pre-filled with composed-loader defaults for the chosen
 transport (`src/datapipeline/cli/app.py`, `src/datapipeline/services/scaffold/source.py`).
+`jerry domain add` now always scaffolds `TimeSeriesRecord` domains so every mapper carries
+an explicit timestamp alongside its value, and `jerry contract` wires that source/domain
+pair up for canonical stream generation.
 ### Add custom filters or transforms
 Register new functions/classes under the appropriate entry point group in your plugin’s
-`pyproject.toml`. The runtime resolves them through `load_ep`, applies record-level
-filters first, then record/feature/sequence transforms in the order declared in the
-dataset config (`pyproject.toml`, `src/datapipeline/utils/load.py`,
+`pyproject.toml`. The runtime resolves them through `load_ep`, applies record filters first,
+then record/feature/sequence transforms in the order declared in the dataset config
+(`pyproject.toml`, `src/datapipeline/utils/load.py`,
 `src/datapipeline/pipeline/utils/transform_utils.py`). Built-in helpers cover common
 comparisons (including timezone-aware checks) and time-based transforms (lags, sliding
 windows) if you need quick wins (`src/datapipeline/filters/filters.py`,
-`src/datapipeline/transforms/transforms.py`, `src/datapipeline/transforms/sequence.py`).
+`src/datapipeline/transforms/record.py`, `src/datapipeline/transforms/feature.py`,
+`src/datapipeline/transforms/sequence.py`).
 ### Prototype with synthetic time-series data
@@ -268,8 +367,7 @@ transform to build sliding-window feature flights without external datasets
 | Type                | Description                                                                                                                                                 |
 | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `Record`            | Canonical payload containing a `value`; extended by other record types (`src/datapipeline/domain/record.py`).                                               |
-| `TimeFeatureRecord` | A record with a timezone-aware `time` attribute, normalized to UTC to avoid boundary issues (`src/datapipeline/domain/record.py`).                          |
+| `TimeSeriesRecord`  | Canonical record with `time` (tz-aware, normalized to UTC) and `value`; the pipeline treats streams as ordered series (`src/datapipeline/domain/record.py`).|
 | `FeatureRecord`     | Links a record (or list of records from sequence transforms) to a `feature_id` and `group_key` (`src/datapipeline/domain/feature.py`).                      |
 | `Vector`            | Final grouped payload: a mapping of feature IDs to scalars or ordered lists plus helper methods for shape/key access (`src/datapipeline/domain/vector.py`). |

jerry-thomas 0.0.5__tar.gz → 0.2.0__tar.gz

jerry-thomas 0.0.5tar.gz → 0.2.0tar.gz