PyPI - driftless - Versions diffs - 0.2.5__tar.gz → 0.2.6__tar.gz - Mend

driftless 0.2.5tar.gz → 0.2.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (88) hide show

{driftless-0.2.5 → driftless-0.2.6}/CHANGELOG.md RENAMED Viewed

@@ -17,6 +17,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ---
+## [0.2.6] - 2026-07-01
+### Added
+- **P0.3 multi-seed tuning selection** — optional `migration.split_seed_count`
+  (1–5) averages tuning-split metrics across shuffle seeds when scoring repair
+  candidates; holdout validation still uses the primary `--seed` only.
+---
 ## [0.2.5] - 2026-07-01
 ### Added
@@ -132,8 +142,9 @@ First public release on [PyPI](https://pypi.org/project/driftless/0.1.0/).
 - **Docs** — project overview, repair algorithm spec, 2×2 migration methodology,
   Poetry + Dependabot product framing.
-[Unreleased]: https://github.com/driftless-dev/driftless/compare/v0.2.5...HEAD
-[0.2.5]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.5
+[Unreleased]: https://github.com/driftless-dev/driftless/compare/v0.2.6...HEAD
+[0.2.6]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.6
+[0.2.5]: https://github.com/driftless-dev/driftless/compare/v0.2.5...v0.2.6
 [0.2.4]: https://github.com/driftless-dev/driftless/compare/v0.2.4...v0.2.5
 [0.2.3]: https://github.com/driftless-dev/driftless/compare/v0.2.3...v0.2.4
 [0.2.2]: https://github.com/driftless-dev/driftless/compare/v0.2.2...v0.2.3

{driftless-0.2.5 → driftless-0.2.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: driftless
-Version: 0.2.5
+Version: 0.2.6
 Summary: Keep prompts in sync when model or eval data changes — Poetry-style lock regeneration, Dependabot-style PRs.
 Project-URL: Homepage, https://github.com/driftless-dev/driftless
 Project-URL: Repository, https://github.com/driftless-dev/driftless
@@ -96,7 +96,7 @@ optimizes against it, with your team owning the definition of "good":
 |---|---|
 | `init` | Scaffold a `driftless.yml`. |
 | `init-policy` | Scaffold a `.driftless/policy.yml` (when to migrate). |
-| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, label audit, and judge check. |
+| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, plan, label audit, and judge check. |
 | `scan` | Find probable LLM usage and at-risk models. |
 | `plan` | Discover at-risk workflows and apply the migration policy (CI triage). |
 | `plan --act` | Migrate + open a PR/issue for every actionable trigger (close the loop). |
@@ -129,11 +129,11 @@ propose it.
 ## GitHub-native usage
 A composite GitHub Action (`action.yml`) wraps the CLI so scans and migrations
-can run in CI. See `.github/workflows/` for a scheduled deprecation scan and a
-manually-triggered migration that opens a PR (or an issue when blocked).
+can run in CI. See `.github/workflows/` for a scheduled deprecation scan, weekly
+`plan --act` triage, and manually-triggered migration workflows.
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.6
   with:
     command: scan
 ```

{driftless-0.2.5 → driftless-0.2.6}/README.md RENAMED Viewed

@@ -57,7 +57,7 @@ optimizes against it, with your team owning the definition of "good":
 |---|---|
 | `init` | Scaffold a `driftless.yml`. |
 | `init-policy` | Scaffold a `.driftless/policy.yml` (when to migrate). |
-| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, label audit, and judge check. |
+| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, plan, label audit, and judge check. |
 | `scan` | Find probable LLM usage and at-risk models. |
 | `plan` | Discover at-risk workflows and apply the migration policy (CI triage). |
 | `plan --act` | Migrate + open a PR/issue for every actionable trigger (close the loop). |
@@ -90,11 +90,11 @@ propose it.
 ## GitHub-native usage
 A composite GitHub Action (`action.yml`) wraps the CLI so scans and migrations
-can run in CI. See `.github/workflows/` for a scheduled deprecation scan and a
-manually-triggered migration that opens a PR (or an issue when blocked).
+can run in CI. See `.github/workflows/` for a scheduled deprecation scan, weekly
+`plan --act` triage, and manually-triggered migration workflows.
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.6
   with:
     command: scan
 ```

{driftless-0.2.5 → driftless-0.2.6}/docs/RELEASE.md RENAMED Viewed

@@ -153,7 +153,7 @@ After a release, users can pin the composite Action by release tag
 (`action.yml` lives at the repo root — no `/action` path segment):
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.6
   with:
     command: scan
 ```
@@ -161,9 +161,9 @@ After a release, users can pin the composite Action by release tag
 Or pin the PyPI package in the Action input:
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.6
   with:
-    version: "==0.2.5"
+    version: "==0.2.6"
     command: migrate
 ```
@@ -171,7 +171,7 @@ Optionally maintain a floating **`v1`** tag on the latest stable minor release
 (point it at the current release tag after each publish):
 ```bash
-git tag -f v1 v0.2.5 && git push origin v1 --force
+git tag -f v1 v0.2.6 && git push origin v1 --force
 ```
 Update [`action.yml`](../action.yml) default `version` input when cutting releases.
@@ -213,7 +213,9 @@ In **Settings → Secrets and variables → Actions**, add:
 | `ANTHROPIC_API_KEY` | Live eval matrix job (`provider: anthropic`) |
 If a secret is missing, that provider job exits cleanly with a warning (CI stays
-green). When both are set, nightly runs append to
+green). On scheduled or manual runs, the **secrets-preflight** job writes a
+summary table to the workflow run so you can see which keys are configured.
+When both are set, nightly runs append to
 `.driftless/regression-metrics.jsonl` and check against
 `tests/fixtures/live_eval_baseline.json` with `--require-all`.

{driftless-0.2.5 → driftless-0.2.6}/site/docs.html RENAMED Viewed

@@ -428,7 +428,7 @@ driftless view -w support_classifier</code></pre>
     <span class="tok-k">runs-on</span>: ubuntu-latest
     <span class="tok-k">steps</span>:
       - <span class="tok-k">uses</span>: actions/checkout@v4
-      - <span class="tok-k">uses</span>: driftless-dev/driftless@v0.2.5
+      - <span class="tok-k">uses</span>: driftless-dev/driftless@v0.2.6
         <span class="tok-k">with</span>:
           <span class="tok-k">command</span>: <span class="tok-s">plan</span></code></pre>
         <p>A scheduled <code class="inline">plan</code> gates CI when a deprecated model needs attention; a manually-triggered <code class="inline">migrate</code> opens a PR (or an issue when blocked) with the evidence attached.</p>

{driftless-0.2.5 → driftless-0.2.6}/src/driftless/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """driftless: Dependabot for LLM models."""
-__version__ = "0.2.5"
+__version__ = "0.2.6"

{driftless-0.2.5 → driftless-0.2.6}/src/driftless/contract.py RENAMED Viewed

@@ -324,6 +324,16 @@ class MigrationSpec(StrictModel):
     allow_business_logic_edits: bool = False
     max_iterations: int = 8
     holdout_required: bool = True
+    # When >1, average tuning-split metrics across this many shuffle seeds
+    # (seed, seed+1, …) when scoring repair candidates. Holdout uses ``seed`` only.
+    split_seed_count: int = 1
+    @field_validator("split_seed_count")
+    @classmethod
+    def _split_seed_count_range(cls, v: int) -> int:
+        if v < 1 or v > 5:
+            raise ValueError("migration.split_seed_count must be between 1 and 5")
+        return v
 class RepairSpec(StrictModel):

{driftless-0.2.5 → driftless-0.2.6}/src/driftless/engine.py RENAMED Viewed

@@ -30,10 +30,10 @@ from .calibrate import suggest_thresholds
 from .compare import ThresholdCheck, check_thresholds
 from .contract import ThresholdsSpec, Workflow
 from .errors import DriftlessError
-from .evaluation import Metrics, RecordRow, RunAnalysis, analyze
+from .evaluation import Metrics, RecordRow, RunAnalysis, analyze, average_metrics
 from .harness import run_workflow
 from .progress import log as progress_log
-from .splits import make_splits, materialize_inputs
+from .splits import Split, make_splits, materialize_inputs
 # --------------------------------------------------------------------------- #
@@ -336,6 +336,8 @@ class MigrationResult:
     message: str = ""
     # Frozen editable files at loop start — baseline for per-candidate diffs in reports/UI.
     original_editable_files: dict[str, str] = field(default_factory=dict)
+    # Shuffle seeds used for tuning (primary ``seed`` only when split_seed_count==1).
+    split_seeds_used: list[int] = field(default_factory=list)
     @property
     def succeeded(self) -> bool:
@@ -516,11 +518,19 @@ def run_migration(
         )
     split = make_splits(workflow, cwd=cwd, seed=seed)
+    split_seeds_used = list(range(seed, seed + mig.split_seed_count))
     size_warnings = assess_split_sizes(
         len(split.input_lines),
         len(split.holdout_idx),
         holdout_required=mig.holdout_required,
     )
+    if mig.split_seed_count > 1:
+        size_warnings.append(
+            f"Multi-seed tuning: candidate selection averages metrics across "
+            f"{mig.split_seed_count} shuffle seeds ({split_seeds_used[0]}.."
+            f"{split_seeds_used[-1]}); each candidate scoring multiplies tuning "
+            "workflow runs."
+        )
     use_ids = bool(workflow.eval.id_field) and split.gold is not None
@@ -532,18 +542,23 @@ def run_migration(
         return judge_evidence_samples(rows)
     def evaluate_on(
-        model: str, idx: list[int], files: dict[str, str] | None = None
+        model: str,
+        idx: list[int],
+        files: dict[str, str] | None = None,
+        *,
+        split_ref: Split | None = None,
     ) -> RunAnalysis:
+        sp = split_ref or split
         file_ctx = apply_files(files, cwd=cwd) if files else nullcontext()
-        idx_lines = split.lines_for(idx)
+        idx_lines = sp.lines_for(idx)
         with materialize_inputs(workflow, idx_lines, cwd=cwd):
             with file_ctx:
                 run = run_workflow(workflow, model, cwd=cwd)
-                if use_ids:
+                if use_ids and sp.gold_ids is not None:
                     return analyze(
                         workflow,
                         run,
-                        gold_by_id=split.gold_by_id_for(idx),
+                        gold_by_id=sp.gold_by_id_for(idx),
                         inputs=idx_lines,
                         judge=judge,
                         cwd=cwd,
@@ -551,12 +566,27 @@ def run_migration(
                 return analyze(
                     workflow,
                     run,
-                    gold_labels=split.gold_for(idx),
+                    gold_labels=sp.gold_for(idx),
                     inputs=idx_lines,
                     judge=judge,
                     cwd=cwd,
                 )
+    def evaluate_tuning(
+        model: str, files: dict[str, str] | None = None
+    ) -> RunAnalysis:
+        """Score on the tuning split; average across seeds when configured."""
+        if mig.split_seed_count <= 1:
+            return evaluate_on(model, split.tuning_idx, files)
+        tuning_splits = [make_splits(workflow, cwd=cwd, seed=s) for s in split_seeds_used]
+        analyses = [
+            evaluate_on(model, sp.tuning_idx, files, split_ref=sp) for sp in tuning_splits
+        ]
+        return RunAnalysis(
+            metrics=average_metrics([a.metrics for a in analyses]),
+            rows=analyses[0].rows,
+        )
     progress_log(
         f"migration: phase 1/3 — initial eval "
         f"({len(split.tuning_idx)} tuning examples, model={current})"
@@ -597,6 +627,7 @@ def run_migration(
                 holdout_checks=holdout_checks,
                 tuning_checks=naive_checks,
                 warnings=size_warnings,
+                split_seeds_used=split_seeds_used,
                 judge_agreement=judge_agreement_info,
                 judge_evidence=_judge_evidence(naive_analysis.rows),
                 message="naive model swap passes thresholds; only the model ID changes",
@@ -691,7 +722,7 @@ def run_migration(
             cand_size = _patch_diff_size(patch.files, original_editable)
             try:
                 validate_patch_scope(patch, workflow, cwd)
-                analysis = evaluate_on(target_model, split.tuning_idx, files=patch.files)
+                analysis = evaluate_tuning(target_model, files=patch.files)
             except DriftlessError as exc:
                 experiment_log.append(
                     AttemptRecord(
@@ -786,6 +817,7 @@ def run_migration(
                     experiment_log=experiment_log,
                     cluster_history=cluster_history,
                     warnings=size_warnings,
+                split_seeds_used=split_seeds_used,
                     judge_agreement=judge_agreement_info,
                     judge_evidence=_judge_evidence(best_analysis.rows),
                     original_editable_files=original_editable,
@@ -855,6 +887,7 @@ def run_migration(
             experiment_log=experiment_log,
             cluster_history=cluster_history,
             warnings=size_warnings,
+            split_seeds_used=split_seeds_used,
             suggested_thresholds=suggested,
             judge_agreement=judge_agreement_info,
             judge_evidence=_judge_evidence(best_analysis.rows),
@@ -887,6 +920,7 @@ def run_migration(
         experiment_log=experiment_log,
         cluster_history=cluster_history,
         warnings=size_warnings,
+        split_seeds_used=split_seeds_used,
         judge_agreement=judge_agreement_info,
         judge_evidence=_judge_evidence(best_analysis.rows),
         original_editable_files=original_editable,

{driftless-0.2.5 → driftless-0.2.6}/src/driftless/evaluation.py RENAMED Viewed

@@ -96,6 +96,39 @@ class Metrics:
     scored: int = 0
+def average_metrics(items: list[Metrics]) -> Metrics:
+    """Mean of headline metrics across multiple eval runs (multi-seed tuning)."""
+    if not items:
+        raise ValueError("average_metrics requires at least one Metrics")
+    if len(items) == 1:
+        return items[0]
+    def _mean(vals: list[float | None]) -> float | None:
+        nums = [v for v in vals if v is not None]
+        return sum(nums) / len(nums) if nums else None
+    def _mean_int(vals: list[int]) -> int:
+        return int(round(sum(vals) / len(vals)))
+    costs = [m.total_cost for m in items if m.total_cost is not None]
+    return Metrics(
+        n=items[0].n,
+        schema_error_rate=_mean([m.schema_error_rate for m in items]),
+        refusal_rate=_mean([m.refusal_rate for m in items]) or 0.0,
+        accuracy=_mean([m.accuracy for m in items]),
+        precision=_mean([m.precision for m in items]),
+        recall=_mean([m.recall for m in items]),
+        f1=_mean([m.f1 for m in items]),
+        avg_latency_ms=_mean([m.avg_latency_ms for m in items]),
+        total_cost=sum(costs) if costs else None,
+        score=_mean([m.score for m in items]),
+        schema_errors=_mean_int([m.schema_errors for m in items]),
+        refusals=_mean_int([m.refusals for m in items]),
+        labeled=items[0].labeled,
+        scored=items[0].scored,
+    )
 def load_jsonl(path: Path) -> list[OutputRecord]:
     records: list[OutputRecord] = []
     with path.open(encoding="utf-8") as fh:

{driftless-0.2.5 → driftless-0.2.6}/src/driftless/report.py RENAMED Viewed

@@ -568,6 +568,7 @@ def result_to_dict(result: MigrationResult) -> dict:
         "experiment_log": [asdict(a) for a in result.experiment_log],
         "cluster_trajectory": cluster_trajectories(result.cluster_history),
         "warnings": result.warnings,
+        "split_seeds_used": result.split_seeds_used,
         "judge_agreement": asdict(result.judge_agreement) if result.judge_agreement else None,
         "judge_evidence": result.judge_evidence,
         "suggested_thresholds": result.suggested_thresholds,

{driftless-0.2.5 → driftless-0.2.6}/tests/test_contract.py RENAMED Viewed

@@ -63,6 +63,17 @@ def test_workflow_not_found():
         contract.workflow("missing")
+def test_split_seed_count_must_be_in_range():
+    with pytest.raises(Exception):
+        Workflow.model_validate(
+            {
+                "run": {"command": "true", "input_path": "i", "output_path": "o"},
+                "model": {"current": "m", "env_var": "M"},
+                "migration": {"split_seed_count": 0},
+            }
+        )
 def test_load_missing_contract(tmp_path: Path):
     with pytest.raises(ContractError):
         load_contract(tmp_path / "nope.yml")

{driftless-0.2.5 → driftless-0.2.6}/tests/test_engine.py RENAMED Viewed

@@ -207,3 +207,13 @@ def test_cluster_failures():
     assert kinds["refusal"].count == 1
     assert kinds["misclassification"].count == 2  # billing<-technical pair
     assert kinds["misclassification"].key == "billing -> technical"
+def test_multi_seed_tuning_still_passes(tmp_path: Path):
+    wf = _make_workflow(tmp_path)
+    wf.migration.split_seed_count = 2
+    result = run_migration("demo", wf, "weak", generator=StrictGen(), cwd=tmp_path, seed=1)
+    assert result.status == MigrationStatus.PASS
+    assert result.split_seeds_used == [1, 2]
+    assert any("Multi-seed tuning" in w for w in result.warnings)

{driftless-0.2.5 → driftless-0.2.6}/tests/test_evaluation.py RENAMED Viewed

@@ -316,3 +316,14 @@ def test_load_labels_by_id_rejects_duplicates(tmp_path: Path):
     p.write_text('{"id":"a","label":"x"}\n{"id":"a","label":"y"}\n')
     with pytest.raises(Exception):
         load_labels_by_id(p, "id", "label")
+def test_average_metrics_means_headline_fields():
+    from driftless.evaluation import Metrics, average_metrics
+    a = Metrics(n=10, schema_error_rate=0.2, refusal_rate=0.1, f1=0.8)
+    b = Metrics(n=10, schema_error_rate=0.0, refusal_rate=0.0, f1=1.0)
+    avg = average_metrics([a, b])
+    assert avg.f1 == pytest.approx(0.9)
+    assert avg.schema_error_rate == pytest.approx(0.1)
+    assert avg.refusal_rate == pytest.approx(0.05)

{driftless-0.2.5 → driftless-0.2.6}/tests/test_init_ci.py RENAMED Viewed

@@ -293,6 +293,46 @@ def test_label_audit_helpers():
     assert label_audit_paths(contract) == ["labels.jsonl", "in.jsonl"]
+def test_init_ci_scaffolds_plan_workflow(tmp_path, monkeypatch):
+    monkeypatch.chdir(tmp_path)
+    Path("driftless.yml").write_text(
+        """
+version: 1
+workflows:
+  smoke:
+    run:
+      command: echo ok
+      input_path: in.jsonl
+      output_path: out.jsonl
+    model:
+      current: gpt-4o-mini
+      env_var: MODEL
+    eval:
+      labels_path: labels.jsonl
+""".lstrip()
+    )
+    out = tmp_path / "workflows"
+    result = runner.invoke(
+        app,
+        [
+            "init-ci",
+            "--out-dir",
+            str(out),
+            "--no-scan",
+            "--no-migrate",
+            "--no-refine",
+            "--no-audit-labels",
+            "--plan",
+        ],
+    )
+    assert result.exit_code == 0
+    plan = (out / "driftless-plan-act.yml").read_text()
+    assert "command: plan" in plan
+    assert "--act" in plan
+    assert "GH_TOKEN" in plan
 def test_rendered_workflows_use_action_ref():
     ref = "driftless-dev/driftless@v9.9.9"
     assert ref in render_migrate_workflow(ref)

driftless-0.2.6/tests/test_splits.py ADDED Viewed

@@ -0,0 +1,27 @@
+"""Tests for tuning/holdout splits."""
+from driftless.contract import Workflow
+from driftless.splits import make_splits
+def _workflow() -> Workflow:
+    return Workflow.model_validate(
+        {
+            "run": {"command": "true", "input_path": "i.jsonl", "output_path": "o.jsonl"},
+            "model": {"current": "m", "env_var": "M"},
+            "eval": {"labels_path": "l.jsonl", "split": {"tuning": 0.5, "holdout": 0.5}},
+        }
+    )
+def test_different_seeds_produce_different_partitions(tmp_path):
+    lines = "\n".join(f'{{"id": {i}, "label": "a"}}' for i in range(20)) + "\n"
+    labels = "\n".join('{"id": ' + str(i) + ', "label": "a"}' for i in range(20)) + "\n"
+    (tmp_path / "i.jsonl").write_text(lines)
+    (tmp_path / "l.jsonl").write_text(labels)
+    wf = _workflow()
+    wf.eval.id_field = "id"
+    a = make_splits(wf, cwd=tmp_path, seed=0)
+    b = make_splits(wf, cwd=tmp_path, seed=1)
+    assert a.tuning_idx != b.tuning_idx