PyPI - podstack - Versions diffs - 1.3.12__tar.gz → 1.3.13__tar.gz - Mend

podstack 1.3.12tar.gz → 1.3.13tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

{podstack-1.3.12 → podstack-1.3.13}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: podstack
-Version: 1.3.12
+Version: 1.3.13
 Summary: Official Python SDK for Podstack GPU Notebook Platform
 Author-email: Podstack <support@podstack.ai>
 License-Expression: MIT
@@ -302,13 +302,13 @@ with registry.start_run(name="training-v1") as run:
     registry.log_artifact("model.pt", "model")
     registry.log_artifact("training_curves.png", "plots")
-    # Log dataset metadata
-    registry.log_dataset(
-        name="imdb-reviews",
-        path="s3://datasets/imdb",
-        num_rows=50000,
-        num_features=2
-    )
+    # Log dataset provenance (first-class resource, deduped by content hash)
+    registry.log_dataset("imdb-reviews", path="data/imdb.csv", context="training")
+    # Or pass a DataFrame — schema and row/feature counts are auto-computed
+    import pandas as pd
+    df = pd.read_csv("data/imdb.csv")
+    registry.log_dataset("imdb-reviews", df=df, context="training")
 ```
 ### Log and Load Models
@@ -360,6 +360,305 @@ runs = registry.search_runs(
 )
 ```
+### Dataset Tracking & Lineage
+Podstack tracks datasets as first-class resources, linking them to runs and model versions so you can always answer *"what data was this model trained on?"*
+The lineage chain is:
+```
+Dataset(s) ──[logged to]──▶ Run ──[run_id]──▶ ModelVersion
+```
+#### `log_dataset()` — log a dataset to the active run
+```python
+dataset = registry.log_dataset(
+    name="imdb-reviews",          # required — human-readable name
+    path="data/imdb.csv",         # local path or URI (s3://, gcs://, https://)
+    context="training",           # "training" | "validation" | "test" (default: "training")
+)
+```
+The dataset is stored as a **project-level resource** and linked to the current run.
+Subsequent calls with the same file produce the same dataset record — no duplicates.
+**Auto-enrichment from a local file:**
+```python
+# SHA-256 digest is computed automatically for files ≤ 500 MB.
+# This enables deduplication across runs — if two runs use the exact
+# same file, they share one Dataset record in the registry.
+dataset = registry.log_dataset("imdb-reviews", path="data/imdb.csv")
+print(dataset.digest)  # "a3f2c1..." — hex SHA-256
+```
+**Auto-enrichment from a pandas DataFrame:**
+```python
+import pandas as pd
+df = pd.read_csv("data/imdb.csv")
+dataset = registry.log_dataset(
+    name="imdb-reviews",
+    df=df,
+    context="training",
+)
+# schema and profile are computed automatically:
+print(dataset.schema)   # {"text": "object", "label": "int64"}
+print(dataset.profile)  # {"num_rows": 50000, "num_features": 2}
+```
+**Pass both `path` and `df`** to get digest dedup *and* schema inference:
+```python
+dataset = registry.log_dataset("imdb-reviews", path="data/imdb.csv", df=df)
+```
+**All parameters:**
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `name` | `str` | required | Human-readable dataset name |
+| `path` | `str` | `None` | Local file path or URI (`s3://`, `gcs://`, `https://`) |
+| `df` | `DataFrame` | `None` | pandas DataFrame — schema and profile auto-computed |
+| `context` | `str` | `"training"` | Role of the dataset: `"training"`, `"validation"`, or `"test"` |
+| `digest` | `str` | `None` | SHA-256 hex digest. Computed from `path` if not provided |
+| `source_type` | `str` | `"local"` | Storage backend: `"local"`, `"s3"`, `"gcs"`, `"url"` |
+| `tags` | `dict` | `None` | Arbitrary string key-value tags |
+**Returns:** `Dataset` object with fields:
+| Field | Type | Description |
+|-------|------|-------------|
+| `id` | `str` | UUID of the dataset record |
+| `name` | `str` | Dataset name |
+| `digest` | `str` | SHA-256 hex digest (empty if not computed) |
+| `source_type` | `str` | Storage backend |
+| `source` | `str` | File path or URI |
+| `schema` | `dict` | Column → dtype mapping |
+| `profile` | `dict` | `num_rows`, `num_features`, and any other stats |
+| `tags` | `dict` | Tags dict |
+| `created_at` | `str` | ISO 8601 timestamp |
+**Via the `Run` object** (equivalent to calling `registry.log_dataset()`):
+```python
+with registry.start_run("training-v1") as run:
+    dataset = run.log_dataset("imdb-reviews", df=df, context="training")
+```
+#### Multiple datasets per run
+Log validation and test sets alongside the training set:
+```python
+with registry.start_run("bert-finetune") as run:
+    run.log_dataset("imdb-train", df=train_df, context="training")
+    run.log_dataset("imdb-val",   df=val_df,   context="validation")
+    run.log_dataset("imdb-test",  df=test_df,  context="test")
+```
+#### `get_run_datasets()` — retrieve datasets logged to a run
+Returns every `Dataset` object linked to a run, in the order they were logged.
+```python
+datasets = registry.get_run_datasets(run_id)
+```
+**Parameters:**
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `run_id` | `str` | ID of the run to query |
+**Returns:** `list[Dataset]` — same object as returned by `log_dataset()`.
+**Fields on each `Dataset`:**
+| Field | Type | Description |
+|-------|------|-------------|
+| `id` | `str` | UUID of the dataset record |
+| `name` | `str` | Human-readable name |
+| `digest` | `str` | SHA-256 hex digest (empty if not computed at log time) |
+| `source_type` | `str` | `"local"`, `"s3"`, `"gcs"`, or `"url"` |
+| `source` | `str` | File path or URI that was passed to `log_dataset()` |
+| `schema` | `dict` | Column → dtype mapping (e.g. `{"text": "object", "label": "int64"}`) |
+| `profile` | `dict` | Stats dict, always contains `num_rows` and `num_features` when a DataFrame was passed |
+| `tags` | `dict` | Key-value tags |
+| `created_at` | `str` | ISO 8601 timestamp |
+**Examples:**
+```python
+from podstack import registry
+registry.init(api_key="...", project_id="...")
+datasets = registry.get_run_datasets("3a9f12c4-...")
+# Inspect each dataset
+for ds in datasets:
+    print(ds.name)
+    print(f"  source : {ds.source}")
+    print(f"  digest : {ds.digest[:16]}…")
+    print(f"  rows   : {ds.profile.get('num_rows', 'unknown')}")
+    print(f"  schema : {ds.schema}")
+```
+Checking datasets on a run you have in hand:
+```python
+with registry.start_run("training-v1") as run:
+    run.log_dataset("train", df=train_df, context="training")
+    run.log_dataset("val",   df=val_df,   context="validation")
+# After the run completes, retrieve everything that was logged
+datasets = registry.get_run_datasets(run.id)
+assert len(datasets) == 2
+```
+Verifying deduplication — the same physical file logged across two runs
+returns the same dataset ID:
+```python
+ds1 = registry.get_run_datasets(run_a.id)[0]
+ds2 = registry.get_run_datasets(run_b.id)[0]
+# Same file → same digest → same Dataset record
+assert ds1.id == ds2.id
+assert ds1.digest == ds2.digest
+```
+#### `get_model_lineage()` — trace a model back to its training data
+Returns the full provenance chain for every version of a registered model:
+which datasets each version was trained on, via which run.
+```python
+lineage = registry.get_model_lineage(model_id)
+```
+**Parameters:**
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `model_id` | `str` | ID of the registered model |
+**Returns:** `dict` with the following structure:
+```
+{
+  "model_id": str,
+  "versions": [
+    {
+      "version":  int,        # version number (1, 2, 3 …)
+      "stage":    str,        # "development" | "staging" | "production" | "archived"
+      "run_id":   str,        # ID of the linked training run (empty if none)
+      "run_name": str,        # display name of the run
+      "datasets": [Dataset]   # list of Dataset dicts logged to that run
+    },
+    …
+  ]
+}
+```
+Each `datasets` entry has the same fields as a `Dataset` object
+(`id`, `name`, `digest`, `source_type`, `source`, `schema`, `profile`, `tags`, `created_at`).
+**Examples:**
+Basic iteration:
+```python
+from podstack import registry
+registry.init(api_key="...", project_id="...")
+model   = registry.get_model("sentiment-bert")
+lineage = registry.get_model_lineage(model.id)
+for version in lineage["versions"]:
+    print(f"v{version['version']} · {version['stage']}")
+    print(f"  Run: {version['run_name']} ({version['run_id'][:8]}…)")
+    for ds in version["datasets"]:
+        rows = ds["profile"].get("num_rows", "?")
+        print(f"  └─ {ds['name']}  {rows} rows  sha256:{ds['digest'][:12]}…")
+```
+Example output:
+```
+v3 · production
+  Run: bert-finetune-v3 (3a9f12c4…)
+  └─ imdb-train  40000 rows  sha256:a3f2c1d8e9b0…
+  └─ imdb-val     5000 rows  sha256:7e4b2f1a0c3d…
+v2 · staging
+  Run: bert-finetune-v2 (8b2e77d1…)
+  └─ imdb-train  40000 rows  sha256:a3f2c1d8e9b0…
+v1 · archived
+  Run: bert-finetune-v1 (f1c3a0e2…)
+  └─ imdb-train  40000 rows  sha256:a3f2c1d8e9b0…
+```
+Finding every unique dataset ever used to train any version of a model:
+```python
+lineage  = registry.get_model_lineage(model.id)
+seen     = {}
+for version in lineage["versions"]:
+    for ds in version["datasets"]:
+        seen[ds["id"]] = ds  # dedup by ID
+unique_datasets = list(seen.values())
+print(f"{len(unique_datasets)} unique dataset(s) across all versions")
+```
+Checking whether the production version was trained on an approved dataset:
+```python
+APPROVED_DIGEST = "a3f2c1d8e9b0..."
+lineage = registry.get_model_lineage(model.id)
+prod = next(v for v in lineage["versions"] if v["stage"] == "production")
+approved = any(ds["digest"] == APPROVED_DIGEST for ds in prod["datasets"])
+print("Production model trained on approved data:", approved)
+```
+#### End-to-end example
+```python
+import pandas as pd
+from podstack import registry
+registry.init(api_key="...", project_id="...")
+registry.set_experiment("sentiment-analysis")
+# Load data
+train_df = pd.read_csv("data/train.csv")
+val_df   = pd.read_csv("data/val.csv")
+with registry.start_run("bert-finetune-v3") as run:
+    # Log datasets — digest is auto-computed, schema inferred
+    run.log_dataset("imdb-train", path="data/train.csv", df=train_df, context="training")
+    run.log_dataset("imdb-val",   path="data/val.csv",   df=val_df,   context="validation")
+    # Train
+    run.log_params({"lr": 2e-5, "epochs": 3})
+    run.log_metrics({"accuracy": 0.93, "f1": 0.92})
+# Register and promote the model
+registry.register_model("sentiment-bert", run_id=run.id)
+registry.set_model_stage("sentiment-bert", version=3, stage="production")
+# Later — answer "what data trained v3?"
+model = registry.get_model("sentiment-bert")
+lineage = registry.get_model_lineage(model.id)
+```
 ### List and Browse
 ```python

podstack 1.3.12__tar.gz → 1.3.13__tar.gz

podstack 1.3.12tar.gz → 1.3.13tar.gz