PyPI - driftless - Versions diffs - 0.2.5__tar.gz → 0.2.7__tar.gz - Mend

driftless 0.2.5tar.gz → 0.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (88) hide show

{driftless-0.2.5 → driftless-0.2.7}/CHANGELOG.md RENAMED Viewed

@@ -17,6 +17,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ---
+## [0.2.7] - 2026-07-01
+### Added
+- **P0.3 per-class support floors** — warn when any class has fewer than five gold
+  examples on a split (`assess_class_support`); surfaced on `migrate` (tuning +
+  holdout), `compare` (baseline + target), CLI "Confidence caveats", and saved
+  compare JSON.
+---
+## [0.2.6] - 2026-07-01
+### Added
+- **P0.3 multi-seed tuning selection** — optional `migration.split_seed_count`
+  (1–5) averages tuning-split metrics across shuffle seeds when scoring repair
+  candidates; holdout validation still uses the primary `--seed` only.
+---
 ## [0.2.5] - 2026-07-01
 ### Added
@@ -132,8 +153,9 @@ First public release on [PyPI](https://pypi.org/project/driftless/0.1.0/).
 - **Docs** — project overview, repair algorithm spec, 2×2 migration methodology,
   Poetry + Dependabot product framing.
-[Unreleased]: https://github.com/driftless-dev/driftless/compare/v0.2.5...HEAD
-[0.2.5]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.5
+[Unreleased]: https://github.com/driftless-dev/driftless/compare/v0.2.7...HEAD
+[0.2.7]: https://github.com/driftless-dev/driftless/releases/tag/v0.2.7
+[0.2.6]: https://github.com/driftless-dev/driftless/compare/v0.2.6...v0.2.7
 [0.2.4]: https://github.com/driftless-dev/driftless/compare/v0.2.4...v0.2.5
 [0.2.3]: https://github.com/driftless-dev/driftless/compare/v0.2.3...v0.2.4
 [0.2.2]: https://github.com/driftless-dev/driftless/compare/v0.2.2...v0.2.3

{driftless-0.2.5 → driftless-0.2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: driftless
-Version: 0.2.5
+Version: 0.2.7
 Summary: Keep prompts in sync when model or eval data changes — Poetry-style lock regeneration, Dependabot-style PRs.
 Project-URL: Homepage, https://github.com/driftless-dev/driftless
 Project-URL: Repository, https://github.com/driftless-dev/driftless
@@ -96,7 +96,7 @@ optimizes against it, with your team owning the definition of "good":
 |---|---|
 | `init` | Scaffold a `driftless.yml`. |
 | `init-policy` | Scaffold a `.driftless/policy.yml` (when to migrate). |
-| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, label audit, and judge check. |
+| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, plan, label audit, and judge check. |
 | `scan` | Find probable LLM usage and at-risk models. |
 | `plan` | Discover at-risk workflows and apply the migration policy (CI triage). |
 | `plan --act` | Migrate + open a PR/issue for every actionable trigger (close the loop). |
@@ -129,11 +129,11 @@ propose it.
 ## GitHub-native usage
 A composite GitHub Action (`action.yml`) wraps the CLI so scans and migrations
-can run in CI. See `.github/workflows/` for a scheduled deprecation scan and a
-manually-triggered migration that opens a PR (or an issue when blocked).
+can run in CI. See `.github/workflows/` for a scheduled deprecation scan, weekly
+`plan --act` triage, and manually-triggered migration workflows.
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.7
   with:
     command: scan
 ```

{driftless-0.2.5 → driftless-0.2.7}/README.md RENAMED Viewed

@@ -57,7 +57,7 @@ optimizes against it, with your team owning the definition of "good":
 |---|---|
 | `init` | Scaffold a `driftless.yml`. |
 | `init-policy` | Scaffold a `.driftless/policy.yml` (when to migrate). |
-| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, label audit, and judge check. |
+| `init-ci` | Scaffold `.github/workflows/` for scan, migrate, refine, poll, plan, label audit, and judge check. |
 | `scan` | Find probable LLM usage and at-risk models. |
 | `plan` | Discover at-risk workflows and apply the migration policy (CI triage). |
 | `plan --act` | Migrate + open a PR/issue for every actionable trigger (close the loop). |
@@ -90,11 +90,11 @@ propose it.
 ## GitHub-native usage
 A composite GitHub Action (`action.yml`) wraps the CLI so scans and migrations
-can run in CI. See `.github/workflows/` for a scheduled deprecation scan and a
-manually-triggered migration that opens a PR (or an issue when blocked).
+can run in CI. See `.github/workflows/` for a scheduled deprecation scan, weekly
+`plan --act` triage, and manually-triggered migration workflows.
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.7
   with:
     command: scan
 ```

{driftless-0.2.5 → driftless-0.2.7}/docs/RELEASE.md RENAMED Viewed

@@ -153,7 +153,7 @@ After a release, users can pin the composite Action by release tag
 (`action.yml` lives at the repo root — no `/action` path segment):
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.7
   with:
     command: scan
 ```
@@ -161,9 +161,9 @@ After a release, users can pin the composite Action by release tag
 Or pin the PyPI package in the Action input:
 ```yaml
-- uses: driftless-dev/driftless@v0.2.5
+- uses: driftless-dev/driftless@v0.2.7
   with:
-    version: "==0.2.5"
+    version: "==0.2.7"
     command: migrate
 ```
@@ -171,7 +171,7 @@ Optionally maintain a floating **`v1`** tag on the latest stable minor release
 (point it at the current release tag after each publish):
 ```bash
-git tag -f v1 v0.2.5 && git push origin v1 --force
+git tag -f v1 v0.2.7 && git push origin v1 --force
 ```
 Update [`action.yml`](../action.yml) default `version` input when cutting releases.
@@ -213,7 +213,9 @@ In **Settings → Secrets and variables → Actions**, add:
 | `ANTHROPIC_API_KEY` | Live eval matrix job (`provider: anthropic`) |
 If a secret is missing, that provider job exits cleanly with a warning (CI stays
-green). When both are set, nightly runs append to
+green). On scheduled or manual runs, the **secrets-preflight** job writes a
+summary table to the workflow run so you can see which keys are configured.
+When both are set, nightly runs append to
 `.driftless/regression-metrics.jsonl` and check against
 `tests/fixtures/live_eval_baseline.json` with `--require-all`.

{driftless-0.2.5 → driftless-0.2.7}/site/docs.html RENAMED Viewed

@@ -428,7 +428,7 @@ driftless view -w support_classifier</code></pre>
     <span class="tok-k">runs-on</span>: ubuntu-latest
     <span class="tok-k">steps</span>:
       - <span class="tok-k">uses</span>: actions/checkout@v4
-      - <span class="tok-k">uses</span>: driftless-dev/driftless@v0.2.5
+      - <span class="tok-k">uses</span>: driftless-dev/driftless@v0.2.7
         <span class="tok-k">with</span>:
           <span class="tok-k">command</span>: <span class="tok-s">plan</span></code></pre>
         <p>A scheduled <code class="inline">plan</code> gates CI when a deprecated model needs attention; a manually-triggered <code class="inline">migrate</code> opens a PR (or an issue when blocked) with the evidence attached.</p>

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """driftless: Dependabot for LLM models."""
-__version__ = "0.2.5"
+__version__ = "0.2.7"

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/cli.py RENAMED Viewed

@@ -446,6 +446,11 @@ def compare(
     console.print(_scorecard(comparison))
+    if comparison.warnings:
+        console.print("\n[bold yellow]Confidence caveats[/]:")
+        for w in comparison.warnings:
+            console.print(f"  • {w}")
     console.print("\n[bold]Thresholds[/] (target vs contract):")
     if not comparison.checks:
         console.print("  [dim]no thresholds configured[/]")

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/compare.py RENAMED Viewed

@@ -15,7 +15,7 @@ from typing import cast
 from .contract import ThresholdsSpec, Workflow
 from .errors import DriftlessError
-from .evaluation import Metrics, evaluate
+from .evaluation import Metrics, assess_class_support, evaluate
 from .harness import run_workflow
 from .progress import log as progress_log
@@ -35,6 +35,7 @@ class Comparison:
     baseline: Metrics
     target: Metrics
     checks: list[ThresholdCheck] = field(default_factory=list)
+    warnings: list[str] = field(default_factory=list)
     @property
     def passed(self) -> bool:
@@ -218,6 +219,14 @@ def compare_models(
     )
     checks = check_thresholds(workflow.thresholds, baseline_metrics, target_metrics)
+    warnings: list[str] = []
+    for metrics, label in (
+        (baseline_metrics, "baseline"),
+        (target_metrics, "target"),
+    ):
+        for w in assess_class_support(metrics, context=f"{label} eval"):
+            if w not in warnings:
+                warnings.append(w)
     return Comparison(
         workflow=workflow_name,
@@ -226,6 +235,7 @@ def compare_models(
         baseline=baseline_metrics,
         target=target_metrics,
         checks=checks,
+        warnings=warnings,
     )
@@ -241,6 +251,7 @@ def save_comparison(comparison: Comparison, cwd: Path | None = None) -> Path:
         "baseline": asdict(comparison.baseline),
         "target": asdict(comparison.target),
         "checks": [asdict(c) for c in comparison.checks],
+        "warnings": comparison.warnings,
         "passed": comparison.passed,
     }
     out_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/contract.py RENAMED Viewed

@@ -324,6 +324,16 @@ class MigrationSpec(StrictModel):
     allow_business_logic_edits: bool = False
     max_iterations: int = 8
     holdout_required: bool = True
+    # When >1, average tuning-split metrics across this many shuffle seeds
+    # (seed, seed+1, …) when scoring repair candidates. Holdout uses ``seed`` only.
+    split_seed_count: int = 1
+    @field_validator("split_seed_count")
+    @classmethod
+    def _split_seed_count_range(cls, v: int) -> int:
+        if v < 1 or v > 5:
+            raise ValueError("migration.split_seed_count must be between 1 and 5")
+        return v
 class RepairSpec(StrictModel):

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/engine.py RENAMED Viewed

@@ -30,10 +30,10 @@ from .calibrate import suggest_thresholds
 from .compare import ThresholdCheck, check_thresholds
 from .contract import ThresholdsSpec, Workflow
 from .errors import DriftlessError
-from .evaluation import Metrics, RecordRow, RunAnalysis, analyze
+from .evaluation import Metrics, RecordRow, RunAnalysis, analyze, average_metrics, assess_class_support
 from .harness import run_workflow
 from .progress import log as progress_log
-from .splits import make_splits, materialize_inputs
+from .splits import Split, make_splits, materialize_inputs
 # --------------------------------------------------------------------------- #
@@ -336,6 +336,8 @@ class MigrationResult:
     message: str = ""
     # Frozen editable files at loop start — baseline for per-candidate diffs in reports/UI.
     original_editable_files: dict[str, str] = field(default_factory=dict)
+    # Shuffle seeds used for tuning (primary ``seed`` only when split_seed_count==1).
+    split_seeds_used: list[int] = field(default_factory=list)
     @property
     def succeeded(self) -> bool:
@@ -516,11 +518,19 @@ def run_migration(
         )
     split = make_splits(workflow, cwd=cwd, seed=seed)
+    split_seeds_used = list(range(seed, seed + mig.split_seed_count))
     size_warnings = assess_split_sizes(
         len(split.input_lines),
         len(split.holdout_idx),
         holdout_required=mig.holdout_required,
     )
+    if mig.split_seed_count > 1:
+        size_warnings.append(
+            f"Multi-seed tuning: candidate selection averages metrics across "
+            f"{mig.split_seed_count} shuffle seeds ({split_seeds_used[0]}.."
+            f"{split_seeds_used[-1]}); each candidate scoring multiplies tuning "
+            "workflow runs."
+        )
     use_ids = bool(workflow.eval.id_field) and split.gold is not None
@@ -532,18 +542,23 @@ def run_migration(
         return judge_evidence_samples(rows)
     def evaluate_on(
-        model: str, idx: list[int], files: dict[str, str] | None = None
+        model: str,
+        idx: list[int],
+        files: dict[str, str] | None = None,
+        *,
+        split_ref: Split | None = None,
     ) -> RunAnalysis:
+        sp = split_ref or split
         file_ctx = apply_files(files, cwd=cwd) if files else nullcontext()
-        idx_lines = split.lines_for(idx)
+        idx_lines = sp.lines_for(idx)
         with materialize_inputs(workflow, idx_lines, cwd=cwd):
             with file_ctx:
                 run = run_workflow(workflow, model, cwd=cwd)
-                if use_ids:
+                if use_ids and sp.gold_ids is not None:
                     return analyze(
                         workflow,
                         run,
-                        gold_by_id=split.gold_by_id_for(idx),
+                        gold_by_id=sp.gold_by_id_for(idx),
                         inputs=idx_lines,
                         judge=judge,
                         cwd=cwd,
@@ -551,18 +566,34 @@ def run_migration(
                 return analyze(
                     workflow,
                     run,
-                    gold_labels=split.gold_for(idx),
+                    gold_labels=sp.gold_for(idx),
                     inputs=idx_lines,
                     judge=judge,
                     cwd=cwd,
                 )
+    def evaluate_tuning(
+        model: str, files: dict[str, str] | None = None
+    ) -> RunAnalysis:
+        """Score on the tuning split; average across seeds when configured."""
+        if mig.split_seed_count <= 1:
+            return evaluate_on(model, split.tuning_idx, files)
+        tuning_splits = [make_splits(workflow, cwd=cwd, seed=s) for s in split_seeds_used]
+        analyses = [
+            evaluate_on(model, sp.tuning_idx, files, split_ref=sp) for sp in tuning_splits
+        ]
+        return RunAnalysis(
+            metrics=average_metrics([a.metrics for a in analyses]),
+            rows=analyses[0].rows,
+        )
     progress_log(
         f"migration: phase 1/3 — initial eval "
         f"({len(split.tuning_idx)} tuning examples, model={current})"
     )
     progress_log("migration: phase 1/3 — baseline prompt on tuning split...")
     baseline_tuning = evaluate_on(current, split.tuning_idx).metrics
+    size_warnings.extend(assess_class_support(baseline_tuning, context="tuning split"))
     progress_log(f"migration: phase 1/3 — baseline F1={_fmt_f1(baseline_tuning.f1)}")
     progress_log("migration: phase 1/3 — current prompt on tuning split...")
     naive_analysis = evaluate_on(target_model, split.tuning_idx)
@@ -575,8 +606,15 @@ def run_migration(
         baseline_holdout = evaluate_on(current, split.holdout_idx).metrics
         holdout_metrics = evaluate_on(target_model, split.holdout_idx, files=files).metrics
         checks = check_thresholds(thresholds, baseline_holdout, holdout_metrics)
+        append_holdout_class_warnings(holdout_metrics)
         return all(c.passed for c in checks), holdout_metrics, checks
+    def append_holdout_class_warnings(holdout_metrics: Metrics | None) -> None:
+        if holdout_metrics is not None:
+            size_warnings.extend(
+                assess_class_support(holdout_metrics, context="holdout split")
+            )
     # Step: naive target already good? (migrate only -- in refine the model is
     # pinned, so the "naive target" is just the current prompt and there's no
     # model-only change to short-circuit on.)
@@ -597,6 +635,7 @@ def run_migration(
                 holdout_checks=holdout_checks,
                 tuning_checks=naive_checks,
                 warnings=size_warnings,
+                split_seeds_used=split_seeds_used,
                 judge_agreement=judge_agreement_info,
                 judge_evidence=_judge_evidence(naive_analysis.rows),
                 message="naive model swap passes thresholds; only the model ID changes",
@@ -691,7 +730,7 @@ def run_migration(
             cand_size = _patch_diff_size(patch.files, original_editable)
             try:
                 validate_patch_scope(patch, workflow, cwd)
-                analysis = evaluate_on(target_model, split.tuning_idx, files=patch.files)
+                analysis = evaluate_tuning(target_model, files=patch.files)
             except DriftlessError as exc:
                 experiment_log.append(
                     AttemptRecord(
@@ -786,6 +825,7 @@ def run_migration(
                     experiment_log=experiment_log,
                     cluster_history=cluster_history,
                     warnings=size_warnings,
+                split_seeds_used=split_seeds_used,
                     judge_agreement=judge_agreement_info,
                     judge_evidence=_judge_evidence(best_analysis.rows),
                     original_editable_files=original_editable,
@@ -826,6 +866,7 @@ def run_migration(
             refine_holdout_checks = check_thresholds(
                 ThresholdsSpec(), baseline_holdout, refine_holdout_metrics
             )
+            append_holdout_class_warnings(refine_holdout_metrics)
         basis = refine_holdout_metrics if refine_holdout_metrics is not None else best_metrics
         suggested = suggest_thresholds(basis)
@@ -855,6 +896,7 @@ def run_migration(
             experiment_log=experiment_log,
             cluster_history=cluster_history,
             warnings=size_warnings,
+            split_seeds_used=split_seeds_used,
             suggested_thresholds=suggested,
             judge_agreement=judge_agreement_info,
             judge_evidence=_judge_evidence(best_analysis.rows),
@@ -887,6 +929,7 @@ def run_migration(
         experiment_log=experiment_log,
         cluster_history=cluster_history,
         warnings=size_warnings,
+        split_seeds_used=split_seeds_used,
         judge_agreement=judge_agreement_info,
         judge_evidence=_judge_evidence(best_analysis.rows),
         original_editable_files=original_editable,

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/evaluation.py RENAMED Viewed

@@ -74,6 +74,33 @@ class ClassMetrics:
     f1: float
+# Warn when macro-F1 aggregates classes with very few gold examples on a split.
+MIN_CLASS_SUPPORT = 5
+def assess_class_support(
+    metrics: Metrics,
+    *,
+    context: str,
+    min_support: int = MIN_CLASS_SUPPORT,
+) -> list[str]:
+    """Low-confidence warnings for rare classes in classification metrics."""
+    if metrics.f1 is None or not metrics.per_class or min_support <= 0:
+        return []
+    low = [
+        (name, cm.support)
+        for name, cm in sorted(metrics.per_class.items())
+        if 0 < cm.support < min_support
+    ]
+    if not low:
+        return []
+    bits = ", ".join(f"{name} ({n})" for name, n in low)
+    return [
+        f"Low per-class support on {context}: {bits} — each below {min_support} gold "
+        "examples. Macro-F1 may not reflect rare-class performance."
+    ]
 @dataclass
 class Metrics:
     n: int
@@ -96,6 +123,39 @@ class Metrics:
     scored: int = 0
+def average_metrics(items: list[Metrics]) -> Metrics:
+    """Mean of headline metrics across multiple eval runs (multi-seed tuning)."""
+    if not items:
+        raise ValueError("average_metrics requires at least one Metrics")
+    if len(items) == 1:
+        return items[0]
+    def _mean(vals: list[float | None]) -> float | None:
+        nums = [v for v in vals if v is not None]
+        return sum(nums) / len(nums) if nums else None
+    def _mean_int(vals: list[int]) -> int:
+        return int(round(sum(vals) / len(vals)))
+    costs = [m.total_cost for m in items if m.total_cost is not None]
+    return Metrics(
+        n=items[0].n,
+        schema_error_rate=_mean([m.schema_error_rate for m in items]),
+        refusal_rate=_mean([m.refusal_rate for m in items]) or 0.0,
+        accuracy=_mean([m.accuracy for m in items]),
+        precision=_mean([m.precision for m in items]),
+        recall=_mean([m.recall for m in items]),
+        f1=_mean([m.f1 for m in items]),
+        avg_latency_ms=_mean([m.avg_latency_ms for m in items]),
+        total_cost=sum(costs) if costs else None,
+        score=_mean([m.score for m in items]),
+        schema_errors=_mean_int([m.schema_errors for m in items]),
+        refusals=_mean_int([m.refusals for m in items]),
+        labeled=items[0].labeled,
+        scored=items[0].scored,
+    )
 def load_jsonl(path: Path) -> list[OutputRecord]:
     records: list[OutputRecord] = []
     with path.open(encoding="utf-8") as fh:

{driftless-0.2.5 → driftless-0.2.7}/src/driftless/report.py RENAMED Viewed

@@ -568,6 +568,7 @@ def result_to_dict(result: MigrationResult) -> dict:
         "experiment_log": [asdict(a) for a in result.experiment_log],
         "cluster_trajectory": cluster_trajectories(result.cluster_history),
         "warnings": result.warnings,
+        "split_seeds_used": result.split_seeds_used,
         "judge_agreement": asdict(result.judge_agreement) if result.judge_agreement else None,
         "judge_evidence": result.judge_evidence,
         "suggested_thresholds": result.suggested_thresholds,

{driftless-0.2.5 → driftless-0.2.7}/tests/test_contract.py RENAMED Viewed

@@ -63,6 +63,17 @@ def test_workflow_not_found():
         contract.workflow("missing")
+def test_split_seed_count_must_be_in_range():
+    with pytest.raises(Exception):
+        Workflow.model_validate(
+            {
+                "run": {"command": "true", "input_path": "i", "output_path": "o"},
+                "model": {"current": "m", "env_var": "M"},
+                "migration": {"split_seed_count": 0},
+            }
+        )
 def test_load_missing_contract(tmp_path: Path):
     with pytest.raises(ContractError):
         load_contract(tmp_path / "nope.yml")

{driftless-0.2.5 → driftless-0.2.7}/tests/test_engine.py RENAMED Viewed

@@ -191,6 +191,7 @@ def test_small_dataset_run_carries_warning(tmp_path: Path):
     wf = _make_workflow(tmp_path)  # 6 examples -> below the min thresholds
     result = run_migration("demo", wf, "weak", generator=StrictGen(), cwd=tmp_path, seed=1)
     assert any("Small dataset" in w for w in result.warnings)
+    assert any("Low per-class support" in w for w in result.warnings)
 def test_cluster_failures():
@@ -207,3 +208,13 @@ def test_cluster_failures():
     assert kinds["refusal"].count == 1
     assert kinds["misclassification"].count == 2  # billing<-technical pair
     assert kinds["misclassification"].key == "billing -> technical"
+def test_multi_seed_tuning_still_passes(tmp_path: Path):
+    wf = _make_workflow(tmp_path)
+    wf.migration.split_seed_count = 2
+    result = run_migration("demo", wf, "weak", generator=StrictGen(), cwd=tmp_path, seed=1)
+    assert result.status == MigrationStatus.PASS
+    assert result.split_seeds_used == [1, 2]
+    assert any("Multi-seed tuning" in w for w in result.warnings)

{driftless-0.2.5 → driftless-0.2.7}/tests/test_evaluation.py RENAMED Viewed

@@ -309,6 +309,25 @@ def test_id_alignment_duplicate_output_id_raises(tmp_path: Path):
         evaluate(wf, run, cwd=tmp_path)
+def test_assess_class_support_flags_rare_classes():
+    from driftless.evaluation import ClassMetrics, Metrics, assess_class_support
+    metrics = Metrics(
+        n=12,
+        schema_error_rate=0.0,
+        refusal_rate=0.0,
+        f1=0.9,
+        per_class={
+            "billing": ClassMetrics(4, 1.0, 1.0, 1.0),
+            "technical": ClassMetrics(8, 0.9, 0.9, 0.9),
+        },
+    )
+    warnings = assess_class_support(metrics, context="tuning split")
+    assert len(warnings) == 1
+    assert "billing (4)" in warnings[0]
+    assert "tuning split" in warnings[0]
 def test_load_labels_by_id_rejects_duplicates(tmp_path: Path):
     from driftless.evaluation import load_labels_by_id
@@ -316,3 +335,14 @@ def test_load_labels_by_id_rejects_duplicates(tmp_path: Path):
     p.write_text('{"id":"a","label":"x"}\n{"id":"a","label":"y"}\n')
     with pytest.raises(Exception):
         load_labels_by_id(p, "id", "label")
+def test_average_metrics_means_headline_fields():
+    from driftless.evaluation import Metrics, average_metrics
+    a = Metrics(n=10, schema_error_rate=0.2, refusal_rate=0.1, f1=0.8)
+    b = Metrics(n=10, schema_error_rate=0.0, refusal_rate=0.0, f1=1.0)
+    avg = average_metrics([a, b])
+    assert avg.f1 == pytest.approx(0.9)
+    assert avg.schema_error_rate == pytest.approx(0.1)
+    assert avg.refusal_rate == pytest.approx(0.05)

{driftless-0.2.5 → driftless-0.2.7}/tests/test_init_ci.py RENAMED Viewed

@@ -293,6 +293,46 @@ def test_label_audit_helpers():
     assert label_audit_paths(contract) == ["labels.jsonl", "in.jsonl"]
+def test_init_ci_scaffolds_plan_workflow(tmp_path, monkeypatch):
+    monkeypatch.chdir(tmp_path)
+    Path("driftless.yml").write_text(
+        """
+version: 1
+workflows:
+  smoke:
+    run:
+      command: echo ok
+      input_path: in.jsonl
+      output_path: out.jsonl
+    model:
+      current: gpt-4o-mini
+      env_var: MODEL
+    eval:
+      labels_path: labels.jsonl
+""".lstrip()
+    )
+    out = tmp_path / "workflows"
+    result = runner.invoke(
+        app,
+        [
+            "init-ci",
+            "--out-dir",
+            str(out),
+            "--no-scan",
+            "--no-migrate",
+            "--no-refine",
+            "--no-audit-labels",
+            "--plan",
+        ],
+    )
+    assert result.exit_code == 0
+    plan = (out / "driftless-plan-act.yml").read_text()
+    assert "command: plan" in plan
+    assert "--act" in plan
+    assert "GH_TOKEN" in plan
 def test_rendered_workflows_use_action_ref():
     ref = "driftless-dev/driftless@v9.9.9"
     assert ref in render_migrate_workflow(ref)

driftless-0.2.7/tests/test_splits.py ADDED Viewed

@@ -0,0 +1,27 @@
+"""Tests for tuning/holdout splits."""
+from driftless.contract import Workflow
+from driftless.splits import make_splits
+def _workflow() -> Workflow:
+    return Workflow.model_validate(
+        {
+            "run": {"command": "true", "input_path": "i.jsonl", "output_path": "o.jsonl"},
+            "model": {"current": "m", "env_var": "M"},
+            "eval": {"labels_path": "l.jsonl", "split": {"tuning": 0.5, "holdout": 0.5}},
+        }
+    )
+def test_different_seeds_produce_different_partitions(tmp_path):
+    lines = "\n".join(f'{{"id": {i}, "label": "a"}}' for i in range(20)) + "\n"
+    labels = "\n".join('{"id": ' + str(i) + ', "label": "a"}' for i in range(20)) + "\n"
+    (tmp_path / "i.jsonl").write_text(lines)
+    (tmp_path / "l.jsonl").write_text(labels)
+    wf = _workflow()
+    wf.eval.id_field = "id"
+    a = make_splits(wf, cwd=tmp_path, seed=0)
+    b = make_splits(wf, cwd=tmp_path, seed=1)
+    assert a.tuning_idx != b.tuning_idx