PyPI - driftless - Versions diffs - 0.2.1__tar.gz → 0.2.5__tar.gz - Mend

driftless 0.2.1tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (93) hide show

{driftless-0.2.1 → driftless-0.2.5}/CHANGELOG.md RENAMED Viewed

@@ -17,6 +17,58 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ---
+## [0.2.5] - 2026-07-01
+### Added
+- **`init-ci` label-audit workflow** — scaffold `driftless-label-audit.yml` (or
+  `-all` matrix) with `audit-labels --fail` on eval dataset path changes.
+- **`init-ci` judge-check workflow** — scaffold `driftless-judge-check.yml` when
+  `eval.judge.calibration_path` is set; uses `--enforce` when gate thresholds
+  are configured.
+---
+## [0.2.4] - 2026-07-01
+### Fixed
+- **`judge-check` gate output under CI** — emit gate status via plain stdout so Rich
+  TTY highlighting (when `GITHUB_ACTIONS=true`) does not break publish workflow tests.
+---
+## [0.2.3] - 2026-07-01
+### Fixed
+- **`judge-check` gate output** — print gate status with Rich markup disabled so
+  publish CI can assert on `max_mae` / `min_correlation` lines reliably.
+---
+## [0.2.2] - 2026-07-01
+### Added
+- **`driftless judge-check`** — measure judge↔human agreement on a calibration set;
+  `--enforce` applies the same gates as `migrate` / `compare`.
+- **`driftless audit-labels`** — find duplicate/near-duplicate inputs with disagreeing
+  gold labels; `--fail` for CI.
+- **Judge trust hardening** — optional `max_mae` / `min_correlation` gates on
+  judge-graded workflows; judge reliability and scoring evidence in migration reports.
+- **P0.1 expansion** — judge-graded regression scenario; live eval CI baseline
+  checks with `--require-all` and job summaries.
+- **`open-pr --create` integration tests** — mocked git/gh execution path coverage.
+- **`migrate` / `refine` label-audit preflight** — warn on label conflicts by default;
+  `--strict-label-audit` blocks; `--skip-label-audit` to silence.
+### Changed
+- Live eval workflow sets `DRIFTLESS_REGRESSION_METRICS` explicitly.
+---
 ## [0.2.1] - 2026-07-01
 ### Fixed
@@ -80,8 +132,12 @@ First public release on [PyPI](https://pypi.org/project/driftless/0.1.0/).
 - **Docs** — project overview, repair algorithm spec, 2×2 migration methodology,
   Poetry + Dependabot product framing.
-[Unreleased]: https://github.com/driftless-dev/driftless/compare/v0.2.1...HEAD
+[Unreleased]: https://github.com/driftless-dev/driftless/compare/v0.2.5...HEAD
+[0.2.5]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.5
+[0.2.4]: https://github.com/driftless-dev/driftless/compare/v0.2.4...v0.2.5
+[0.2.3]: https://github.com/driftless-dev/driftless/compare/v0.2.3...v0.2.4
+[0.2.2]: https://github.com/driftless-dev/driftless/compare/v0.2.2...v0.2.3
 [0.2.1]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.1
-[0.2.0]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.0
+[0.2.0]: https://github.com/driftless-dev/driftless/compare/v0.2.0...v0.2.1
 [0.1.1]: https://github.com/driftless-dev/driftless/releases/tag/v0.1.1
 [0.1.0]: https://github.com/driftless-dev/driftless/releases/tag/v0.1.0

{driftless-0.2.1 → driftless-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: driftless
-Version: 0.2.1
+Version: 0.2.5
 Summary: Keep prompts in sync when model or eval data changes — Poetry-style lock regeneration, Dependabot-style PRs.
 Project-URL: Homepage, https://github.com/driftless-dev/driftless
 Project-URL: Repository, https://github.com/driftless-dev/driftless
@@ -87,6 +87,8 @@ optimizes against it, with your team owning the definition of "good":
   precision/recall/F1 against the gold record.
 - **`eval.judge`** — an LLM judge grades each free-form output against a rubric
   (with an optional human-scored calibration set for a judge-agreement check).
+  Run `driftless judge-check -w <workflow>` before optimizing; set
+  `max_mae` / `min_correlation` in the contract to gate `migrate` / `compare`.
 ## CLI
@@ -94,7 +96,7 @@ optimizes against it, with your team owning the definition of "good":
 |---|---|
 | `init` | Scaffold a `driftless.yml`. |
 | `init-policy` | Scaffold a `.driftless/policy.yml` (when to migrate). |
-| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, and poll. |
+| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, label audit, and judge check. |
 | `scan` | Find probable LLM usage and at-risk models. |
 | `plan` | Discover at-risk workflows and apply the migration policy (CI triage). |
 | `plan --act` | Migrate + open a PR/issue for every actionable trigger (close the loop). |
@@ -102,9 +104,12 @@ optimizes against it, with your team owning the definition of "good":
 | `calibrate -w <w>` | Measure the baseline and suggest starting thresholds. |
 | `compare -w <w> --to <model>` | Baseline vs target scorecard. |
 | `migrate -w <w> --to <model>` | Repair + validate + produce migrated files. |
+| | `--strict-label-audit` warns/blocks on duplicate-label conflicts. |
 | `refine -w <w>` | Re-optimize the prompt for a changed eval dataset (model pinned). |
 | `poll [--act]` | Detect external eval-dataset changes and refine on a meaningful change. |
 | `validate -w <w>` | Check the contract parses and the harness runs. |
+| `judge-check -w <w>` | Measure judge↔human agreement on a calibration set (`--enforce` to gate). |
+| `audit-labels -w <w>` | Find duplicate inputs with disagreeing gold labels (`--fail` for CI). |
 | `report` | Render the latest migration report. |
 | `view` | Open the optimization run viewer (charts + attempt log). |
 | `open-pr -w <w>` | Open a PR (or issue) from the latest migration result. |
@@ -128,7 +133,7 @@ can run in CI. See `.github/workflows/` for a scheduled deprecation scan and a
 manually-triggered migration that opens a PR (or an issue when blocked).
 ```yaml
-- uses: driftless-dev/driftless@v0.2.1
+- uses: driftless-dev/driftless@v0.2.5
   with:
     command: scan
 ```

{driftless-0.2.1 → driftless-0.2.5}/README.md RENAMED Viewed

@@ -48,6 +48,8 @@ optimizes against it, with your team owning the definition of "good":
   precision/recall/F1 against the gold record.
 - **`eval.judge`** — an LLM judge grades each free-form output against a rubric
   (with an optional human-scored calibration set for a judge-agreement check).
+  Run `driftless judge-check -w <workflow>` before optimizing; set
+  `max_mae` / `min_correlation` in the contract to gate `migrate` / `compare`.
 ## CLI
@@ -55,7 +57,7 @@ optimizes against it, with your team owning the definition of "good":
 |---|---|
 | `init` | Scaffold a `driftless.yml`. |
 | `init-policy` | Scaffold a `.driftless/policy.yml` (when to migrate). |
-| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, and poll. |
+| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, label audit, and judge check. |
 | `scan` | Find probable LLM usage and at-risk models. |
 | `plan` | Discover at-risk workflows and apply the migration policy (CI triage). |
 | `plan --act` | Migrate + open a PR/issue for every actionable trigger (close the loop). |
@@ -63,9 +65,12 @@ optimizes against it, with your team owning the definition of "good":
 | `calibrate -w <w>` | Measure the baseline and suggest starting thresholds. |
 | `compare -w <w> --to <model>` | Baseline vs target scorecard. |
 | `migrate -w <w> --to <model>` | Repair + validate + produce migrated files. |
+| | `--strict-label-audit` warns/blocks on duplicate-label conflicts. |
 | `refine -w <w>` | Re-optimize the prompt for a changed eval dataset (model pinned). |
 | `poll [--act]` | Detect external eval-dataset changes and refine on a meaningful change. |
 | `validate -w <w>` | Check the contract parses and the harness runs. |
+| `judge-check -w <w>` | Measure judge↔human agreement on a calibration set (`--enforce` to gate). |
+| `audit-labels -w <w>` | Find duplicate inputs with disagreeing gold labels (`--fail` for CI). |
 | `report` | Render the latest migration report. |
 | `view` | Open the optimization run viewer (charts + attempt log). |
 | `open-pr -w <w>` | Open a PR (or issue) from the latest migration result. |
@@ -89,7 +94,7 @@ can run in CI. See `.github/workflows/` for a scheduled deprecation scan and a
 manually-triggered migration that opens a PR (or an issue when blocked).
 ```yaml
-- uses: driftless-dev/driftless@v0.2.1
+- uses: driftless-dev/driftless@v0.2.5
   with:
     command: scan
 ```

{driftless-0.2.1 → driftless-0.2.5}/docs/RELEASE.md RENAMED Viewed

@@ -153,7 +153,7 @@ After a release, users can pin the composite Action by release tag
 (`action.yml` lives at the repo root — no `/action` path segment):
 ```yaml
-- uses: driftless-dev/driftless@v0.2.1
+- uses: driftless-dev/driftless@v0.2.5
   with:
     command: scan
 ```
@@ -161,13 +161,19 @@ After a release, users can pin the composite Action by release tag
 Or pin the PyPI package in the Action input:
 ```yaml
-- uses: driftless-dev/driftless@v0.2.1
+- uses: driftless-dev/driftless@v0.2.5
   with:
-    version: "==0.2.1"
+    version: "==0.2.5"
     command: migrate
 ```
-Optionally maintain a floating **`v1`** tag on the latest stable minor release.
+Optionally maintain a floating **`v1`** tag on the latest stable minor release
+(point it at the current release tag after each publish):
+```bash
+git tag -f v1 v0.2.5 && git push origin v1 --force
+```
 Update [`action.yml`](../action.yml) default `version` input when cutting releases.
 ---
@@ -188,3 +194,37 @@ Update [`action.yml`](../action.yml) default `version` input when cutting releas
 `0.1.0` was uploaded manually before Trusted Publishing was wired. Tags and
 GitHub Release for `v0.1.0` can be added retroactively for a clean history; PyPI
 already hosts that version.
+---
+## Maintainer: live optimizer eval (P0.1)
+The **migration-regression** workflow runs deterministic regression on every
+push/PR and a **live** LLM optimizer eval nightly (or on manual dispatch). The
+live job costs tokens and is opt-in via repository secrets.
+### Required secrets
+In **Settings → Secrets and variables → Actions**, add:
+| Secret | Used by |
+|---|---|
+| `OPENAI_API_KEY` | Live eval matrix job (`provider: openai`) |
+| `ANTHROPIC_API_KEY` | Live eval matrix job (`provider: anthropic`) |
+If a secret is missing, that provider job exits cleanly with a warning (CI stays
+green). When both are set, nightly runs append to
+`.driftless/regression-metrics.jsonl` and check against
+`tests/fixtures/live_eval_baseline.json` with `--require-all`.
+### Local reproduction
+```bash
+export DRIFTLESS_LIVE_EVAL=1
+export OPENAI_API_KEY=...
+pytest tests/test_migration_live.py -v -k openai
+python scripts/check_live_eval_metrics.py --provider openai --require-all
+```
+After a few stable nightly runs, tighten floors in `live_eval_baseline.json`
+(iterations ceiling, min F1/score).

{driftless-0.2.1 → driftless-0.2.5}/site/docs.html RENAMED Viewed

@@ -308,8 +308,11 @@ driftless open-pr -w support_classifier --create</code></pre>
             <tr><td><code>plan</code></td><td>Discover at-risk workflows and apply the migration policy (CI triage).</td></tr>
             <tr><td><code>configure &lt;workflow&gt;</code></td><td>Turn a detected workflow into a migration-ready contract.</td></tr>
             <tr><td><code>validate -w &lt;w&gt;</code></td><td>Check the contract parses and the harness runs.</td></tr>
+            <tr><td><code>audit-labels -w &lt;w&gt;</code></td><td>Find duplicate inputs with disagreeing gold labels (<code>--fail</code> for CI).</td></tr>
+            <tr><td><code>judge-check -w &lt;w&gt;</code></td><td>Measure judge↔human agreement on a calibration set (<code>--enforce</code> to gate).</td></tr>
             <tr><td><code>compare -w &lt;w&gt; --to &lt;model&gt;</code></td><td>Baseline vs. target scorecard + threshold checks.</td></tr>
             <tr><td><code>migrate -w &lt;w&gt; --to &lt;model&gt;</code></td><td>Repair + validate + produce migrated files.</td></tr>
+            <tr><td><code>refine -w &lt;w&gt;</code></td><td>Re-optimize the prompt for a changed dataset (model pinned).</td></tr>
             <tr><td><code>report [-w &lt;w&gt;]</code></td><td>Render the latest migration report(s).</td></tr>
             <tr><td><code>open-pr -w &lt;w&gt;</code></td><td>Open a PR (or issue) whose body is the evidence report: summary, scorecard, unified diffs, attempt log, holdout checks.</td></tr>
           </tbody>
@@ -318,6 +321,7 @@ driftless open-pr -w support_classifier --create</code></pre>
         <ul>
           <li><code class="inline">--generator llm|none</code> — the repair strategy (LLM-backed by default; <code class="inline">none</code> turns the loop into a dry analysis).</li>
           <li><code class="inline">--to &lt;model&gt;</code> — the target model to migrate to (otherwise the contract's candidates are used).</li>
+          <li><code class="inline">--strict-label-audit</code> — block when duplicate/near-duplicate inputs disagree on gold labels (warns by default).</li>
         </ul>
       </section>
@@ -424,7 +428,7 @@ driftless view -w support_classifier</code></pre>
     <span class="tok-k">runs-on</span>: ubuntu-latest
     <span class="tok-k">steps</span>:
       - <span class="tok-k">uses</span>: actions/checkout@v4
-      - <span class="tok-k">uses</span>: driftless-dev/driftless@v0.2.1
+      - <span class="tok-k">uses</span>: driftless-dev/driftless@v0.2.5
         <span class="tok-k">with</span>:
           <span class="tok-k">command</span>: <span class="tok-s">plan</span></code></pre>
         <p>A scheduled <code class="inline">plan</code> gates CI when a deprecated model needs attention; a manually-triggered <code class="inline">migrate</code> opens a PR (or an issue when blocked) with the evidence attached.</p>

{driftless-0.2.1 → driftless-0.2.5}/src/driftless/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """driftless: Dependabot for LLM models."""
-__version__ = "0.2.1"
+__version__ = "0.2.5"

{driftless-0.2.1 → driftless-0.2.5}/src/driftless/cli.py RENAMED Viewed

@@ -136,6 +136,16 @@ def init_ci(
     plan: bool = typer.Option(
         False, "--plan/--no-plan", help="Scaffold scheduled plan --act workflow."
     ),
+    audit_labels: bool | None = typer.Option(
+        None,
+        "--audit-labels/--no-audit-labels",
+        help="Scaffold label-audit CI workflow (default: on if labels_path is set).",
+    ),
+    judge_check: bool | None = typer.Option(
+        None,
+        "--judge-check/--no-judge-check",
+        help="Scaffold judge-calibration CI workflow (default: on if calibration_path is set).",
+    ),
 ) -> None:
     """Scaffold GitHub Actions workflows wired to the driftless composite Action."""
     from .init_ci import CHECKLIST, scaffold_ci_from_path
@@ -151,6 +161,8 @@ def init_ci(
             include_refine=refine,
             include_poll=poll,
             include_plan=plan,
+            include_audit_labels=audit_labels,
+            include_judge_check=judge_check,
         )
     except DriftlessError as exc:
         _fail(exc)
@@ -355,6 +367,30 @@ def _preflight(wf: Workflow, target_model: str) -> None:
         err_console.print(f"[yellow]warning:[/] {pf.warning}")
+def _label_audit_preflight(
+    workflow_name: str,
+    wf: Workflow,
+    *,
+    skip: bool,
+    strict: bool,
+) -> None:
+    """Warn or block when duplicate inputs carry disagreeing gold labels."""
+    if skip or wf.eval.grading != "label" or not wf.eval.labels_path:
+        return
+    from .label_audit import audit_labels, format_audit_report
+    report = audit_labels(workflow_name, wf, cwd=Path.cwd())
+    if not report.has_conflicts:
+        return
+    text = format_audit_report(report)
+    if strict:
+        err_console.print(text)
+        raise typer.Exit(code=1)
+    err_console.print(f"[yellow]Label audit warning[/] — {report.conflict_groups[0].kind} conflicts detected")
+    err_console.print(f"[dim]{text}[/]")
+    err_console.print("[dim]re-run with --strict-label-audit to block, or --skip-label-audit to silence[/]")
 def _fmt(value: float | None, *, pct: bool = False) -> str:
     if value is None:
         return "[dim]n/a[/]"
@@ -812,6 +848,14 @@ def migrate(
         2, "--candidates", help="Candidate patches to propose per iteration "
         "(widened automatically when an iteration stalls).",
     ),
+    skip_label_audit: bool = typer.Option(
+        False, "--skip-label-audit", help="Skip duplicate-label preflight check."
+    ),
+    strict_label_audit: bool = typer.Option(
+        False,
+        "--strict-label-audit",
+        help="Block when duplicate/near-duplicate inputs disagree on gold labels.",
+    ),
 ) -> None:
     """Attempt a migration: repair editable files, validate on holdout, report."""
     from .engine import MigrationStatus, run_migration
@@ -820,6 +864,9 @@ def migrate(
     try:
         contract = load_contract(contract_path)
         wf = contract.workflow(workflow)
+        _label_audit_preflight(
+            workflow, wf, skip=skip_label_audit, strict=strict_label_audit
+        )
         _preflight(wf, to)
         gen = build_generator(
             generator,
@@ -916,6 +963,14 @@ def refine(
         2, "--candidates", help="Candidate patches to propose per iteration "
         "(widened automatically when an iteration stalls).",
     ),
+    skip_label_audit: bool = typer.Option(
+        False, "--skip-label-audit", help="Skip duplicate-label preflight check."
+    ),
+    strict_label_audit: bool = typer.Option(
+        False,
+        "--strict-label-audit",
+        help="Block when duplicate/near-duplicate inputs disagree on gold labels.",
+    ),
 ) -> None:
     """Re-optimize a prompt for a changed eval dataset (model stays pinned).
@@ -933,6 +988,9 @@ def refine(
     try:
         contract = load_contract(contract_path)
         wf = contract.workflow(workflow)
+        _label_audit_preflight(
+            workflow, wf, skip=skip_label_audit, strict=strict_label_audit
+        )
         gen = build_generator(
             generator,
             provider=generator_provider,
@@ -1191,6 +1249,121 @@ def open_pr(
         )
+@app.command(name="judge-check")
+def judge_check(
+    workflow: str = typer.Option(..., "--workflow", "-w"),
+    contract_path: Path = typer.Option(None, "--contract", help="Path to driftless.yml."),
+    enforce: bool = typer.Option(
+        False,
+        "--enforce",
+        help="Apply eval.judge max_mae/min_correlation gates (same as migrate/compare).",
+    ),
+) -> None:
+    """Measure LLM-judge agreement against a human calibration set."""
+    from .judges import build_judge, judge_agreement, require_judge_agreement
+    try:
+        contract = load_contract(contract_path)
+        wf = contract.workflow(workflow)
+    except DriftlessError as exc:
+        _fail(exc)
+        return
+    if wf.eval.grading != "judge" or wf.eval.judge is None:
+        _fail(
+            DriftlessError(
+                f"{workflow!r} is not judge-graded",
+                hint="add eval.judge to the workflow in driftless.yml",
+            )
+        )
+        return
+    spec = wf.eval.judge
+    if not spec.calibration_path:
+        _fail(
+            DriftlessError(
+                "eval.judge.calibration_path is not set",
+                hint="add a human-scored JSONL file for judge agreement",
+            )
+        )
+        return
+    judge = build_judge(spec)
+    try:
+        agreement = (
+            require_judge_agreement(judge, spec)
+            if enforce
+            else judge_agreement(judge, spec)
+        )
+    except DriftlessError as exc:
+        _fail(exc)
+        return
+    if agreement is None:
+        _fail(DriftlessError("calibration set is empty or produced no scores"))
+        return
+    console.print(f"[bold]{workflow}[/] — judge calibration check\n")
+    console.print(f"  records: {agreement.n}")
+    console.print(f"  MAE: {agreement.mean_abs_error:.3f}")
+    corr = f"{agreement.correlation:.3f}" if agreement.correlation is not None else "n/a"
+    console.print(f"  correlation: {corr}")
+    gate_bits: list[str] = []
+    if spec.max_mae is not None:
+        ok = agreement.mean_abs_error <= spec.max_mae
+        gate_bits.append(f"max_mae={spec.max_mae:g} ({'ok' if ok else 'FAIL'})")
+    if spec.min_correlation is not None:
+        ok = agreement.correlation is not None and agreement.correlation >= spec.min_correlation
+        gate_bits.append(f"min_correlation={spec.min_correlation:g} ({'ok' if ok else 'FAIL'})")
+    if gate_bits:
+        # Plain stdout — Rich highlight/markup breaks publish CI assertions on the
+        # gate status line when GITHUB_ACTIONS forces a TTY console.
+        typer.echo("  gates: " + ", ".join(gate_bits))
+    if enforce:
+        console.print(f"\n[green]gates passed[/] — {agreement.summary}")
+    else:
+        console.print(f"\n[dim]{agreement.summary}[/]")
+        if spec.max_mae is not None or spec.min_correlation is not None:
+            console.print("[dim]re-run with --enforce to apply contract gates[/]")
+@app.command(name="audit-labels")
+def audit_labels_cmd(
+    workflow: str = typer.Option(..., "--workflow", "-w"),
+    contract_path: Path = typer.Option(None, "--contract", help="Path to driftless.yml."),
+    near_threshold: float = typer.Option(
+        0.85, "--near-threshold", min=0.5, max=1.0,
+        help="Token Jaccard threshold for near-duplicate detection.",
+    ),
+    fail: bool = typer.Option(
+        False, "--fail", help="Exit non-zero when label conflicts are found.",
+    ),
+) -> None:
+    """Audit gold labels for duplicate inputs with disagreeing labels."""
+    from .label_audit import audit_labels, format_audit_report
+    try:
+        contract = load_contract(contract_path)
+        wf = contract.workflow(workflow)
+        report = audit_labels(
+            workflow, wf, cwd=Path.cwd(), near_threshold=near_threshold
+        )
+    except DriftlessError as exc:
+        _fail(exc)
+        return
+    text = format_audit_report(report)
+    if report.has_conflicts:
+        err_console.print(text)
+    else:
+        console.print(text)
+    if fail and report.has_conflicts:
+        raise typer.Exit(code=1)
 @app.command()
 def report(
     workflow: str = typer.Option(None, "--workflow", "-w", help="Workflow to show (default: all)."),

{driftless-0.2.1 → driftless-0.2.5}/src/driftless/compare.py RENAMED Viewed

@@ -11,6 +11,7 @@ from __future__ import annotations
 import json
 from dataclasses import asdict, dataclass, field
 from pathlib import Path
+from typing import cast
 from .contract import ThresholdsSpec, Workflow
 from .errors import DriftlessError
@@ -195,6 +196,11 @@ def compare_models(
             )
         judge = build_judge(judge_spec)
+    if judge is not None and workflow.eval.judge is not None:
+        from .judges import Judge, require_judge_agreement
+        require_judge_agreement(cast(Judge, judge), workflow.eval.judge, cwd=cwd)
     progress_log(f"compare: baseline run ({current})...")
     baseline_run = run_workflow(workflow, current, cwd=cwd)
     baseline_metrics = evaluate(workflow, baseline_run, judge=judge, cwd=cwd)

{driftless-0.2.1 → driftless-0.2.5}/src/driftless/contract.py RENAMED Viewed

@@ -157,6 +157,18 @@ class JudgeSpec(StrictModel):
     # Optional path to human-scored records (carrying a numeric ``score``) for a
     # judge-reliability agreement check.
     calibration_path: str | None = None
+    # Optional gates (require ``calibration_path``). When set, ``migrate`` /
+    # ``compare`` / ``refine`` refuse to optimize against an untrusted judge.
+    max_mae: float | None = None
+    min_correlation: float | None = None
+    @model_validator(mode="after")
+    def _gates_need_calibration(self) -> "JudgeSpec":
+        if (self.max_mae is not None or self.min_correlation is not None) and not self.calibration_path:
+            raise ValueError(
+                "eval.judge.max_mae/min_correlation require calibration_path"
+            )
+        return self
     @field_validator("rubric")
     @classmethod

{driftless-0.2.1 → driftless-0.2.5}/src/driftless/engine.py RENAMED Viewed

@@ -327,6 +327,9 @@ class MigrationResult:
     experiment_log: list[AttemptRecord] = field(default_factory=list)
     cluster_history: list[list[FailureCluster]] = field(default_factory=list)
     warnings: list[str] = field(default_factory=list)
+    # Judge-graded workflows: calibration agreement + low-score rationales for reviewers.
+    judge_agreement: Any | None = None
+    judge_evidence: list[dict[str, Any]] = field(default_factory=list)
     # refine-only: thresholds derived from the achieved holdout metrics, for the
     # customer to accept/edit (the old dataset's thresholds are stale).
     suggested_thresholds: dict[str, float] = field(default_factory=dict)
@@ -478,6 +481,27 @@ def run_migration(
             )
         judge = build_judge(judge_spec)
+    judge_agreement_info = None
+    if judge is not None and workflow.eval.judge is not None:
+        from .judges import require_judge_agreement
+        try:
+            judge_agreement_info = require_judge_agreement(
+                judge, workflow.eval.judge, cwd=cwd
+            )
+        except DriftlessError as exc:
+            return MigrationResult(
+                workflow=workflow_name,
+                current_model=current,
+                target_model=target_model,
+                status=MigrationStatus.BLOCKED,
+                iterations=0,
+                baseline=Metrics(n=0, schema_error_rate=None, refusal_rate=0.0),
+                naive_target=Metrics(n=0, schema_error_rate=None, refusal_rate=0.0),
+                final=Metrics(n=0, schema_error_rate=None, refusal_rate=0.0),
+                message=str(exc),
+            )
     if not workflow.model.has_override():
         return MigrationResult(
             workflow=workflow_name,
@@ -500,6 +524,13 @@ def run_migration(
     use_ids = bool(workflow.eval.id_field) and split.gold is not None
+    def _judge_evidence(rows: list[RecordRow]) -> list[dict[str, Any]]:
+        if workflow.eval.grading != "judge":
+            return []
+        from .judges import judge_evidence_samples
+        return judge_evidence_samples(rows)
     def evaluate_on(
         model: str, idx: list[int], files: dict[str, str] | None = None
     ) -> RunAnalysis:
@@ -566,6 +597,8 @@ def run_migration(
                 holdout_checks=holdout_checks,
                 tuning_checks=naive_checks,
                 warnings=size_warnings,
+                judge_agreement=judge_agreement_info,
+                judge_evidence=_judge_evidence(naive_analysis.rows),
                 message="naive model swap passes thresholds; only the model ID changes",
             )
@@ -753,6 +786,8 @@ def run_migration(
                     experiment_log=experiment_log,
                     cluster_history=cluster_history,
                     warnings=size_warnings,
+                    judge_agreement=judge_agreement_info,
+                    judge_evidence=_judge_evidence(best_analysis.rows),
                     original_editable_files=original_editable,
                     message="migration passed tuning and holdout thresholds",
                 )
@@ -821,6 +856,8 @@ def run_migration(
             cluster_history=cluster_history,
             warnings=size_warnings,
             suggested_thresholds=suggested,
+            judge_agreement=judge_agreement_info,
+            judge_evidence=_judge_evidence(best_analysis.rows),
             original_editable_files=original_editable,
             message=message,
         )
@@ -850,6 +887,8 @@ def run_migration(
         experiment_log=experiment_log,
         cluster_history=cluster_history,
         warnings=size_warnings,
+        judge_agreement=judge_agreement_info,
+        judge_evidence=_judge_evidence(best_analysis.rows),
         original_editable_files=original_editable,
         message=message,
     )

driftless 0.2.1__tar.gz → 0.2.5__tar.gz

driftless 0.2.1tar.gz → 0.2.5tar.gz