PyPI - eval-toolkit - Versions diffs - 0.49.0__tar.gz → 1.0.0__tar.gz - Mend

eval-toolkit 0.49.0tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (184) hide show

{eval_toolkit-0.49.0 → eval_toolkit-1.0.0}/.gitignore RENAMED Viewed

@@ -54,6 +54,14 @@ mutants/
 gate3-audit-prompt.md
 gate3-audit-report.md
 gate3-audit-round-*-report.md
+# R8-C10 audit fix: extend to cover the comprehensive-audit-* and
+# audit-verification-* naming conventions introduced at v0.50/v0.51.
+codex-comprehensive-audit-*-report.md
+codex-microaudit-*.md
+gemini-microaudit-*.md
+audit-gemini.md
+comprehensive-audit-codex.md
+audit-verification-*.md
 # Claude Code project settings (machine-local)
 .claude/

{eval_toolkit-0.49.0 → eval_toolkit-1.0.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,323 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.0.0] — 2026-05-25 — Stability contract activates per ADR 0003
+v1.0 is a **stability-contract activation**, not a code delta from v0.51.
+Every fix that landed at v0.51 is what v1.0 ships; the new thing at v1.0
+is that the [ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md)
+Tier 1 / Tier 2 / Tier 3 stability contract becomes load-bearing.
+Breaking changes to Tier-1 surfaces after v1.0 require a major bump (v2.0).
+### Contract activation
+- **Tier 1 STRICT** — public-API signatures captured in
+  `tests/golden/public_api/snapshot.json`. Any signature drift bumps to v2.0.
+- **Tier 2 ADDITIVE** — the 9 strict Protocols (`Scorer`, `LeakageCheck`,
+  `Splitter`, `ThresholdSelector`, `DatasetLoader`, `MetricSpec`,
+  `MetaLearner`, `Probe`, `TextTransform`) + 1 opt-in (`Versioned`).
+  Method shapes are frozen; additive subprotocols + new Protocols allowed.
+- **Tier 3 FREE** — internal modules (prefixed `_`). Refactors don't
+  bump major.
+### Gate 3 audit closure (Rounds 5 → 10)
+The multi-LLM cross-review sequence is closed. Per ADR 0003, Gate 3
+substitutes Codex + Gemini + Claude independent reads for external
+academic peer review. Outcomes:
+- **Round 8** (against v0.50): 13 confirmed → fixed in v0.51; 3 refuted;
+  2 v1.x-deferred (R8-G3 custom exceptions; R8-G4 joblib OOM capping).
+- **Round 9** (against v0.51 RC): 6 confirmed of 10 source items + 3
+  third-audit findings in modules neither auditor cited. 2 candidate-
+  blocker items fixed in-PR before tag.
+- **Round 10** (micro-audit on R9 follow-on commit): 3 Codex confirmed →
+  fixed in v0.51; 1 accept-as-design (Gemini); 1 refuted (Gemini
+  Pattern-1 violation).
+Full ledger at `docs/source/audit_findings.md` Rounds 5 → 10. v1.0.1
+cleanup batch tracked at GH issue #76.
+### Carried-over deprecations
+The R8-C1 `DeprecationWarning` on multi-seed `evaluate_folded(seeds=...)`
+calls without an explicit `reseed_splitter` callback **persists past v1.0
+by design** (pre-v1.0 deprecation window is one minor; `DEPRECATION.md`
+requires ≥2 minors to close a cycle). Single-seed callers see no change;
+multi-seed callers should pass `reseed_splitter` for true seed variance.
+### Migration
+If your consumer is on v0.51, nothing changes. If on v0.50 or earlier,
+follow [`docs/source/migration/v0.51.md`](docs/source/migration/v0.51.md)
+for the actual migration steps. Downstream projects should pin
+`eval-toolkit>=1.0,<2.0` to opt into the stability contract.
+## [0.51.0] — 2026-05-24 — Round 8 rectification batch
+The 18-item rectification batch following the Round 8 multi-LLM audit
+(Codex + Gemini reports verified at
+`audit-verification-codex-gemini-v0.50.0.md`, 2026-05-24). Per Decision
+Y.2 + the staggered-pre-v1.0 plan, v0.51.0 is a BREAKING-allowed
+minor bundling all fixes before v1.0 tags. Round 9 audit runs against
+the v0.51 RC.
+**Audit outcome**: 13 confirmed → fixed in this release; 2 deferred
+(R8-G3 custom exceptions, R8-G4 joblib OOM capping) to v1.x as Tier-2
+additive; 3 refuted (R8-G2 cyclic-import framing; R8-G5 cherry-picked
+weak test; R8-V1 + R8-V2 over-confident Gemini validations). See
+`docs/source/audit_findings.md` Round 8 section for the full ledger.
+**Round 9 follow-on**: a Round 9 multi-LLM cross-review (Codex + Gemini)
+ran against the v0.51 RC pre-tag. Verified by Claude at
+`audit-verification-round-9-v0.51.0.md` (6 confirmed / 3 refuted / 1
+partial; plus 3 third-audit fixes in modules neither auditor cited).
+**Two third-audit findings + one source-report regression fix shipped
+in this RC pre-tag** (commit-graph below); the remaining 4 deferred
+items go to v1.0.1. See `audit_findings.md` Round 9 section for the
+full ledger.
+### Added (Round 9 follow-on)
+- **R9-F-sweep-1** (CANDIDATE v1.0 BLOCKER closed) — `_sweep.py:
+  _validate_scorer_output()` now validates scorer output is finite
+  (no NaN / +inf / -inf), not just shape. Pre-R9 follow-on, NaN/inf
+  scores passed R7-C's shape check and silently propagated into the
+  sweep DataFrame, then silently zeroed the ASR flag (NaN >= threshold
+  is False). Closes the "no silent failures" invariant gap R7-C
+  established for shape but didn't extend to finiteness. Brings sweep
+  validation to parity with `stacking.py`'s `_validate_fit_inputs` /
+  `_validate_predict_inputs`. Tier-2 additive — callers whose scorers
+  were silently producing NaN now get a clear `ValueError` with
+  diagnostic context.
+- **R9-F-bootstrap-1** — `bootstrap.bootstrap_ci(...)` emits a
+  `UserWarning` when scipy's BCa method degenerates (returns
+  `ci_low == ci_high == point` or non-finite bounds). Pre-R9, the
+  R8-C4(b) RNG bug spuriously varied bootstrap streams and could mask
+  BCa degeneracy on small-n + ceiling/floor-metric inputs; post-R8 with
+  correct RNG, the brittleness is exposed. Warning text recommends
+  `method='percentile'` as the safer fallback at small n. The default
+  remains `method='BCa'` (preserves bit-stability for non-degenerate
+  cases); auto-fallback is deferred to v1.0.1 if user demand.
+- **R9-F-bootstrap-2** — `bootstrap.mde_from_ci(...)` now explicitly
+  rejects NaN CI width with `RuntimeError`. Pre-R9, NaN width
+  (possible when scipy BCa returns NaN bounds) bypassed the
+  `if width <= 0` check (NaN <= 0 is False in IEEE float) and
+  silently returned `MDEEstimate.mde = NaN`. Bundled with F-bootstrap-1.
+### Fixed (Round 10 follow-on)
+Pre-tag scoped Codex + Gemini micro-audit on `edadddc` surfaced 3
+Codex-confirmed findings (all fix-recommended / minor; no v1.0
+blockers). Verified by Claude; 1 Gemini accept-as-design + 1 Gemini
+refuted (Pattern-1 violation; calibration record in
+`audit_findings.md` Round 10 section). All 3 confirmed findings
+shipped in this RC pre-tag:
+- **R10-F1** — `protocols.Scorer.predict_proba` docstring + `_sweep.py`
+  error message clarification. Pre-R10, `_validate_scorer_output`'s
+  runtime error said "finite floats in [0, 1]" but the boundary check
+  only enforced finiteness (no range validation); the Scorer Protocol
+  docstring also lacked an explicit `[0, 1]` contract statement. R10-F1
+  extends the Protocol docstring to document calibrated-probability
+  semantics + reword the sweep runtime message to drop the unenforced
+  `[0, 1]` claim. Range enforcement is intentionally deferred to a
+  future minor once consumer usage patterns clarify whether the
+  Protocol should be strict (`[0, 1]`) or permissive (ranking scores).
+- **R10-F2** — `tests/test_bootstrap_unit.py::test_bootstrap_ci_bca_degeneracy_emits_warning`
+  test predicate hardening. Pre-R10, the test's assertion block used
+  `if ci.ci_low == ci.ci_high == ci.point_estimate:` — but NaN==NaN is
+  False in IEEE float, so the assertions were silently skipped on the
+  current scipy fixture (which returns NaN bounds). The test passed
+  WITHOUT proving the warning fires for the common degeneracy mode.
+  R10-F2 mirrors the production predicate exactly:
+  `(not np.isfinite(low)) or (not np.isfinite(high)) or (low == high == point)`.
+  The assertion block now runs whenever ANY degeneracy mode fires.
+- **R10-F3** — `bootstrap.mde_from_ci` docstring update for the
+  R9-F-bootstrap-2 non-finite-width branch. Pre-R10, the Raises section
+  said "non-positive width" only; the implementation has also been
+  rejecting non-finite width since `edadddc` but the docstring lagged.
+  R10-F3 updates the Raises text to "non-positive or non-finite width"
+  and adds a 4-line note explaining the scipy-BCa NaN-bound motivation
+  so callers understand the new behavior is intentional, not incidental.
+### Added
+- **R8-C6** — `calibration.reliability_curve(...)` and
+  `calibration.maximum_calibration_error(...)` now call
+  `_validate_calibrated_score(y_score)` BEFORE the sklearn dispatch.
+  Pre-v0.51 these functions silently accepted raw logits (any range);
+  sibling `metrics.expected_calibration_error*` variants already
+  validated input range via the same helper. Now symmetric — out-of-range
+  scores raise `ValueError` with the same actionable diagnostic. Also,
+  `calibration.fit_temperature(...)` now validates the `bounds` tuple
+  (finiteness, positivity, `lo < hi`) BEFORE forwarding to
+  `scipy.optimize.minimize_scalar` — cryptic optimizer errors replaced
+  with actionable input-validation errors.
+- **R8-F1** — `losses.RecallAtLowFPR.__init__(...)` now validates
+  `pos_weight > 0` at construction time, matching the sibling-kwarg
+  validators for `fpr_target` and `fpr_smoothing_beta`. Pre-v0.51
+  non-positive `pos_weight` produced degenerate-but-bounded loss
+  values silently.
+- **R8-F2** — `metric_specs.ece(n_bins=, strategy=)` factory now validates
+  `n_bins` eagerly at spec-construction time (matches the eager
+  `strategy` validation already present). Pre-v0.51 `n_bins`
+  validation was deferred to compute time.
+- **R8-F3** — `analysis.CsvPredictionReader.read_predictions(...)` now
+  detects missing CSV columns at read time and raises a
+  `ValueError(f"CSV file at {uri!r} is missing required column(s) ...")`
+  with the file path + available columns. Pre-v0.51 missing columns
+  were silently filled with empty strings, causing cryptic
+  `ValueError: invalid literal for int() with base 10: ''` downstream
+  in `load_prediction_arrays`'s dtype conversion. Root cause now
+  surfaces at the boundary.
+- **R8-C1** — `harness.evaluate_folded(...)` now accepts an optional
+  `reseed_splitter: Callable[[Splitter, int], Splitter] | None`
+  callback. When provided, each seed iteration calls
+  `reseed_splitter(splitter, seed)` to produce a fresh splitter for
+  that seed's fold iteration. Default `None` preserves the historical
+  behavior (the same splitter instance is reused across the seed loop,
+  so multi-seed × CV only varies the bootstrap RNG, not fold
+  partitions) AND emits a `DeprecationWarning` when `len(seeds) > 1`.
+  The warning persists past v1.0 because the pre-v1.0 deprecation
+  window (v0.51 → v1.0) is one minor and ADR 0003 / DEPRECATION.md
+  require ≥2 minors to close a cycle. Migration example::
+      from dataclasses import replace
+      evaluate_folded(
+          scorers, splitter, slice_,
+          seeds=(1, 2, 3),
+          reseed_splitter=lambda sp, s: replace(sp, seed=s),
+          ...
+      )
+  R8-C1 audit fix.
+### BREAKING
+- **R8-C2** — `SourceDisjointKFoldSplitter.iter_folds(...)` now caps
+  the fold count at `min(self.k, n_sources)` (matching
+  `get_n_splits(...)`). Pre-v0.51 the loop ran `range(self.k)` and
+  yielded EMPTY test partitions for the surplus folds when
+  `k > n_sources` while `get_n_splits` returned `min(k, n_sources)`
+  — the two methods silently disagreed on fold count. v0.51 caps both
+  at the same value AND emits a `UserWarning` when `k > n_sources` so
+  the caller knows the cap was applied. Callers that consumed the
+  surplus empty-test folds will see fewer iterations now; that was
+  the bug. (Probe-verified at
+  `audit-verification-codex-gemini-v0.50.0.md`.)
+- **R8-C3** — `thresholds.recall_at_fpr(...)` fallback semantics changed
+  when no threshold satisfies `target_fpr`. Pre-v0.51 the fallback set
+  `threshold = 1.0` and then computed `y_pred = (y_score >= 1.0)` —
+  inclusive comparator — which classified any negative-class sample
+  with score exactly 1.0 as predicted-positive. The probe
+  `recall_at_fpr(y=[0,1], scores=[1.0,1.0], target_fpr=0.0)` returned
+  `actual_fpr=1.0, fp=1` in silent violation of the function's
+  FPR-ceiling invariant. v0.51 returns a SENTINEL
+  `RecallAtFprResult(threshold=np.inf, recall=0.0, actual_fpr=0.0,
+  fp=0, tn=n_val_neg)` whenever the constraint is unsatisfiable.
+  Callers detect via `np.isinf(result.threshold)`. The
+  `actual_fpr ≤ target_fpr` invariant is now preserved by construction.
+  Migration: any caller filtering on `result.threshold` should add an
+  `np.isinf(...)` branch — pre-v0.51 the sentinel value was `1.0`.
+  (Verified at `audit-verification-codex-gemini-v0.50.0.md`.)
+- **R8-C4(a)** — `harness.evaluate(...)` with a `Generator`-typed `rng`
+  is now bit-stable across `n_jobs` values. Prior to v0.51, the same
+  `rng` object was attached to every `(slice, scorer)` work_unit;
+  joblib forked copies at the SAME generator state into N parallel
+  workers, so every worker used identical bootstrap sample streams —
+  silently producing non-independent CIs across `(slice, scorer)`
+  pairs in parallel mode and divergent results vs sequential mode.
+  The v0.51 implementation spawns one independent `SeedSequence` per
+  work unit at the dispatch boundary in `_score_all_slices` (depends
+  on the R8-C4(b) `spawn_seed_sequences` fix). Each pair now sees an
+  independent bootstrap stream; sequential (`n_jobs=1`) and parallel
+  (`n_jobs>1`) modes produce bit-identical CIs per the SPEC 7
+  contract at `docs/source/methodology/parallelism.md`. Integer `rng`
+  callers (the common case) are unaffected. (Verified by multi-slice
+  probe at `audit-verification-codex-gemini-v0.50.0.md`.)
+- **R8-C4(b)** — `eval_toolkit._rng.spawn_seed_sequences(rng, n)` now
+  respects `Generator` state. Prior to v0.51, the function extracted
+  the bit-generator's seed_seq and called `.spawn(n)` on it — so a
+  `Generator` advanced by prior draws produced the same children as a
+  fresh `Generator` with the same construction seed. The new
+  implementation draws `n` fresh entropy values FROM the generator
+  via `rng.integers(0, 2**63-1, size=n)` and wraps each in a
+  `SeedSequence`. Each call advances generator state, so repeated
+  calls on the same instance yield different children. This was the
+  root cause of bootstrap non-independence across `(slice, scorer)`
+  pairs in `harness.evaluate` — when the same `Generator` was shared
+  across bootstrap callsites, all callsites silently used the same
+  resample stream. (Verified probe at
+  `audit-verification-codex-gemini-v0.50.0.md`.)
+## [0.50.0] — 2026-05-23 — SPEC 7 `rng` parameter adoption
+The SPEC 7 follow-up to v0.49.0. The `_rng.py` scaffold shipped at
+v0.49.0 (SeedLike + RNGLike type aliases per
+[Scientific Python SPEC 7](https://scientific-python.org/specs/spec-0007/))
+is now wired into every Tier-1 public function that consumes a NumPy RNG.
+### BREAKING
+**22 Tier-1 function signatures**: `seed: int = X` / `random_state: int | None` → `rng: RNGLike | SeedLike | None = X`. Pre-v1.0 SemVer-minor BREAKING (v0.34.0 precedent). Defaults preserved (still deterministic-by-default).
+Affected functions:
+- `bootstrap.py` (7 public + 1 private): `bootstrap_ci`, `paired_bootstrap_diff`, `paired_bootstrap_ece_diff`, `paired_bootstrap_op_point_diff`, `paired_mde`, `block_bootstrap_on_folds`, `cross_validate_metric`, `_bootstrap_t_ci`.
+- `metrics.py:1063`: `expected_calibration_error_debiased`.
+- `thresholds.py`: `selected_operating_point` + `_bootstrap_threshold_metric_cis`.
+- `analysis.py`: `bootstrap_metric_from_predictions`, `paired_diff_from_prediction_refs`.
+- `harness.py` (6 sites): `evaluate`, `evaluate_scorer_on_slice`, `_bootstrap_auc_ci`, `_evaluate_scores`, `_compute_paired_diffs`, `_score_all_slices`.
+- `scorecards.py`: `scorecard`, `_evaluate_spec`.
+- `stacking.py`: `LogisticStacker.random_state` → `LogisticStacker.rng` class-field rename (sklearn pass-through derives int at the boundary).
+**Body refactors**:
+- 4 SeedSequence.spawn() sites converted from `np.random.SeedSequence(seed).spawn(n)` to `rng.bit_generator.seed_seq.spawn(n)` (Option A — preserves existing worker SeedSequence signatures).
+- 2 sklearn-bridge sites in `cross_validate_metric` derive int from rng before passing to `StratifiedKFold`/`KFold(random_state=...)` (defensive across sklearn versions <1.4).
+- `LogisticStacker.fit` derives sklearn int from `self.rng` at the boundary.
+**Config schema** (Tier-2 additive): `evaluate()` config dict key `"seed"` → `"rng"`. Generator-typed input serializes as `repr(rng)`; int/None serialize as-is (backward-compatible for prior int-seed usage).
+### Added
+- **Docstrings**: NumPy-style parameter doc for every renamed function now references `rng : RNGLike | SeedLike | None` with explicit link to SPEC 7.
+- **STYLE.md §3a** + **ADR 0004 D4**: `rng` row flipped from "target convention; adopted in v0.50.0" → "**canonical** convention (adopted v0.50.0)".
+### Changed
+- **Test sweep** (~230+ test sites): `seed=X` → `rng=X` in test kwarg calls, EXCEPT in test files that test legitimate `seed`-as-int contexts (`test_adversarial.py` for Python `random.Random`, `test_seeds.py` for `set_global_seeds`, `test_splits*.py` for Splitter dataclass fields, `test_text_dedup*.py` for MinHashLSHStrategy class field).
+- **CHANGELOG header**: this release.
+### Exceptions to SPEC 7 (KEPT `seed:` — documented in STYLE.md §3a + ADR 0004 D4)
+- `seeds.set_global_seeds(seed: int)` — global-state setter, not per-function RNG.
+- `adversarial.py` dataclass fields + functional wrappers — use Python stdlib `random.Random(seed)`, not NumPy.
+- `splits.py` Splitter dataclass class-fields (`HoldoutSplitter.seed`, `StratifiedKFoldSplitter.seed`, etc.) — configuration storage, not user-facing RNG parameter.
+- `loaders.py:903` YAML config schema key — declarative; renaming would break consumer YAMLs.
+### Migration
+- Consumer (`prompt-injection-detection-submission`) lockstep: bump dep pin `>=0.49.0` → `>=0.50.0`; rename `seed=` → `rng=` on eval-toolkit-bound call sites (estimated 5-8 sites).
+- Bit-for-bit reproducibility preserved when migrating `seed=42` → `rng=42` (int seed is SeedLike; `np.random.default_rng(42)` is the canonical normalization).
+### Notes
+- Ships in parallel with Round 8 audit STOP-GATE (Decision Y.2); R8 briefing at commit `6f6839a`, awaiting Codex+Gemini reports.
+- Memory pattern captured at v0.49.0: pre-flight grep MUST cover `README.md`, `.doctest-modules`, and any config files (per `feedback_sybil_runs_readme.md`). Applied to v0.50.0 pre-flight.
 ## [0.49.0] — 2026-05-23 — Global naming-standards sweep + final cleanup before v1.0
 Final pre-v1.0 minor consolidating the naming-convention standardization

{eval_toolkit-0.49.0 → eval_toolkit-1.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.49.0
+Version: 1.0.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -114,7 +114,8 @@ format changes.
 │  gpu_info + leakage_report (NeurIPS-aligned)           │
 ├─ Tier 2 ─ Protocol-based orchestration ────────────────┤
 │  Scorer / SliceAwareScorer / LeakageCheck / Splitter   │
-│  ThresholdSelector / DatasetLoader / SimilarityStrategy│
+│  ThresholdSelector / DatasetLoader / MetricSpec        │
+│  MetaLearner / Probe / TextTransform (9 strict)        │
 │  Versioned (opt-in: per-object versions in manifest)   │
 ├─ Tier 1 ─ Functional core ─────────────────────────────┤
 │  pr_auc / roc_auc / ECE variants / Brier / bootstrap_ci│
@@ -129,69 +130,65 @@ run: capture the manifest.
 ## Documentation
-- **[Getting started](docs/getting-started.md)** — end-to-end
+- **[Getting started](docs/source/getting-started.md)** — end-to-end
   walkthrough for new users: install, define a Scorer, build slices,
   run `evaluate()`, persist results, add a claim, render a plot.
-- **[Methodology curriculum](docs/methodology/README.md)** — 16
+- **[Methodology curriculum](docs/source/methodology/README.md)** — 16
   chapters on splits, metrics, calibration, evidence gates,
   prediction artifacts, and more.
-- **[Schema reference](docs/schemas.md)** — field-by-field semantics
+- **[Schema reference](docs/source/schemas.md)** — field-by-field semantics
   for `results.v1.json`, `results_full.v1.json`, `manifest.v1.json`.
-- **[Migration guides](docs/MIGRATION.md)** — v0.6→v0.7, v0.7→v0.8,
-  v0.8→v0.9.
-- **[Extending](docs/extending.md)** — Protocol-by-Protocol guide for
+- **[Migration guides](docs/source/MIGRATION.md)** — per-version migration
+  hub (v0.7 onward).
+- **[Extending](docs/source/extending.md)** — Protocol-by-Protocol guide for
   custom Scorers, Splitters, LeakageChecks, ThresholdSelectors,
   DatasetLoaders, EvidenceGates.
-- **[Repo strategy](docs/repo-strategy.md)** — how the package is
-  organized, the 6-bucket target shape, and the checklist that
-  governs when to extract a sub-package into its own repo.
+- **[Repo strategy](docs/source/repo-strategy.md)** — how the package is
+  organized, the flat-module layout per ADR 0001, and the v2.0 trigger
+  criteria for any future subpackage split.
 ## Methodology
 What good binary-classification evaluation looks like, with each
 concern mapped to the toolkit primitive that operationalizes it.
-- [`docs/methodology/`](docs/methodology/README.md) — the curriculum
-  (16 chapters). Recommended reading order:
-  [`leakage`](docs/methodology/leakage.md) →
-  [`splits`](docs/methodology/splits.md) →
-  [`thresholds`](docs/methodology/thresholds.md) →
-  [`calibration`](docs/methodology/calibration.md) →
-  [`comparison`](docs/methodology/comparison.md) →
-  [`bootstrap`](docs/methodology/bootstrap.md) →
-  [`length_stratification`](docs/methodology/length_stratification.md) →
-  [`text_dedup`](docs/methodology/text_dedup.md) →
-  [`versioning`](docs/methodology/versioning.md) →
-  [`fairness`](docs/methodology/fairness.md) →
-  [`reproducibility`](docs/methodology/reproducibility.md) →
-  [`testing`](docs/methodology/testing.md) →
-  [`reading_list`](docs/methodology/reading_list.md).
-- [`docs/MIGRATION.md`](docs/MIGRATION.md) — per-version migration
-  guides (v0.6→v0.7, v0.7→v0.8).
-- [`docs/roadmap.md`](docs/roadmap.md) — forward-looking tracker;
-  v1.0.0 path; consumer gap-doc cross-links.
+- [`docs/source/methodology/`](docs/source/methodology/README.md) — the
+  curriculum (16 chapters). Recommended reading order:
+  [`leakage`](docs/source/methodology/leakage.md) →
+  [`splits`](docs/source/methodology/splits.md) →
+  [`thresholds`](docs/source/methodology/thresholds.md) →
+  [`calibration`](docs/source/methodology/calibration.md) →
+  [`comparison`](docs/source/methodology/comparison.md) →
+  [`bootstrap`](docs/source/methodology/bootstrap.md) →
+  [`length_stratification`](docs/source/methodology/length_stratification.md) →
+  [`text_dedup`](docs/source/methodology/text_dedup.md) →
+  [`versioning`](docs/source/methodology/versioning.md) →
+  [`fairness`](docs/source/methodology/fairness.md) →
+  [`reproducibility`](docs/source/methodology/reproducibility.md) →
+  [`testing`](docs/source/methodology/testing.md) →
+  [`reading_list`](docs/source/methodology/reading_list.md).
+- [`docs/source/MIGRATION.md`](docs/source/MIGRATION.md) — per-version
+  migration guides (v0.7 onward; v0.49 / v0.50 / v0.51 included as of
+  v0.51.0).
+- [`docs/source/roadmap.md`](docs/source/roadmap.md) — forward-looking
+  tracker; v1.0.0 path; consumer gap-doc cross-links.
 ## Extending eval-toolkit
 How to plug your own scorers / leakage checks / splitters / loaders /
 threshold selectors into the harness.
-- [`docs/extending.md`](docs/extending.md) — Protocol-by-Protocol
+- [`docs/source/extending.md`](docs/source/extending.md) — Protocol-by-Protocol
   guide, ~50-line full-harness recipe, project-layout pointer.
 ## Worked examples
-- [`docs/examples/prompt_injection_walkthrough.md`](docs/examples/prompt_injection_walkthrough.md)
-  — End-to-end prompt-injection eval on a synthetic OWASP LLM01:2025
-  fixture; cross-links to the
-  [showcase repo](https://github.com/brandon-behring/prompt_injection_classifier_showcase)
-  for the real Lakera PINT walkthrough.
-- [`docs/examples/pytorch_scorer_example.md`](docs/examples/pytorch_scorer_example.md)
-  — HuggingFace transformer + LoRA `Scorer` adapter (batched inference,
-  GPU/CPU placement, deterministic-mode setup).
-- [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md)
-  — Composing reference + custom `EvidenceGate`s into a `ClaimSpec` and
-  running `evaluate_claims()` for release-time go/no-go checks.
+- [`docs/source/examples/`](docs/source/examples/index.md) — Sphinx /
+  MyST-NB executable notebooks covering: the evaluation harness,
+  metrics + bootstrap, calibration, claims-and-gates, leakage
+  detection, cross-corpus contamination scanning, character-injection
+  adversarial sweeps, callable-embedder dedup, and the activation-delta
+  probe.
 ## Install
@@ -233,12 +230,12 @@ print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")
 from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
-ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
+ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
 print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
 # Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
 s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
-diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
+diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
 print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")
 ```
@@ -291,7 +288,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
 | `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
-| `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
+| `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/source/extending.md`](docs/source/extending.md) for writing custom gates and [`docs/source/examples/claims_and_gates.md`](docs/source/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |
 | `eval_toolkit.provenance` | File hashing, run-directory layout, figure metadata sidecar |

{eval_toolkit-0.49.0 → eval_toolkit-1.0.0}/README.md RENAMED Viewed

@@ -31,7 +31,8 @@ format changes.
 │  gpu_info + leakage_report (NeurIPS-aligned)           │
 ├─ Tier 2 ─ Protocol-based orchestration ────────────────┤
 │  Scorer / SliceAwareScorer / LeakageCheck / Splitter   │
-│  ThresholdSelector / DatasetLoader / SimilarityStrategy│
+│  ThresholdSelector / DatasetLoader / MetricSpec        │
+│  MetaLearner / Probe / TextTransform (9 strict)        │
 │  Versioned (opt-in: per-object versions in manifest)   │
 ├─ Tier 1 ─ Functional core ─────────────────────────────┤
 │  pr_auc / roc_auc / ECE variants / Brier / bootstrap_ci│
@@ -46,69 +47,65 @@ run: capture the manifest.
 ## Documentation
-- **[Getting started](docs/getting-started.md)** — end-to-end
+- **[Getting started](docs/source/getting-started.md)** — end-to-end
   walkthrough for new users: install, define a Scorer, build slices,
   run `evaluate()`, persist results, add a claim, render a plot.
-- **[Methodology curriculum](docs/methodology/README.md)** — 16
+- **[Methodology curriculum](docs/source/methodology/README.md)** — 16
   chapters on splits, metrics, calibration, evidence gates,
   prediction artifacts, and more.
-- **[Schema reference](docs/schemas.md)** — field-by-field semantics
+- **[Schema reference](docs/source/schemas.md)** — field-by-field semantics
   for `results.v1.json`, `results_full.v1.json`, `manifest.v1.json`.
-- **[Migration guides](docs/MIGRATION.md)** — v0.6→v0.7, v0.7→v0.8,
-  v0.8→v0.9.
-- **[Extending](docs/extending.md)** — Protocol-by-Protocol guide for
+- **[Migration guides](docs/source/MIGRATION.md)** — per-version migration
+  hub (v0.7 onward).
+- **[Extending](docs/source/extending.md)** — Protocol-by-Protocol guide for
   custom Scorers, Splitters, LeakageChecks, ThresholdSelectors,
   DatasetLoaders, EvidenceGates.
-- **[Repo strategy](docs/repo-strategy.md)** — how the package is
-  organized, the 6-bucket target shape, and the checklist that
-  governs when to extract a sub-package into its own repo.
+- **[Repo strategy](docs/source/repo-strategy.md)** — how the package is
+  organized, the flat-module layout per ADR 0001, and the v2.0 trigger
+  criteria for any future subpackage split.
 ## Methodology
 What good binary-classification evaluation looks like, with each
 concern mapped to the toolkit primitive that operationalizes it.
-- [`docs/methodology/`](docs/methodology/README.md) — the curriculum
-  (16 chapters). Recommended reading order:
-  [`leakage`](docs/methodology/leakage.md) →
-  [`splits`](docs/methodology/splits.md) →
-  [`thresholds`](docs/methodology/thresholds.md) →
-  [`calibration`](docs/methodology/calibration.md) →
-  [`comparison`](docs/methodology/comparison.md) →
-  [`bootstrap`](docs/methodology/bootstrap.md) →
-  [`length_stratification`](docs/methodology/length_stratification.md) →
-  [`text_dedup`](docs/methodology/text_dedup.md) →
-  [`versioning`](docs/methodology/versioning.md) →
-  [`fairness`](docs/methodology/fairness.md) →
-  [`reproducibility`](docs/methodology/reproducibility.md) →
-  [`testing`](docs/methodology/testing.md) →
-  [`reading_list`](docs/methodology/reading_list.md).
-- [`docs/MIGRATION.md`](docs/MIGRATION.md) — per-version migration
-  guides (v0.6→v0.7, v0.7→v0.8).
-- [`docs/roadmap.md`](docs/roadmap.md) — forward-looking tracker;
-  v1.0.0 path; consumer gap-doc cross-links.
+- [`docs/source/methodology/`](docs/source/methodology/README.md) — the
+  curriculum (16 chapters). Recommended reading order:
+  [`leakage`](docs/source/methodology/leakage.md) →
+  [`splits`](docs/source/methodology/splits.md) →
+  [`thresholds`](docs/source/methodology/thresholds.md) →
+  [`calibration`](docs/source/methodology/calibration.md) →
+  [`comparison`](docs/source/methodology/comparison.md) →
+  [`bootstrap`](docs/source/methodology/bootstrap.md) →
+  [`length_stratification`](docs/source/methodology/length_stratification.md) →
+  [`text_dedup`](docs/source/methodology/text_dedup.md) →
+  [`versioning`](docs/source/methodology/versioning.md) →
+  [`fairness`](docs/source/methodology/fairness.md) →
+  [`reproducibility`](docs/source/methodology/reproducibility.md) →
+  [`testing`](docs/source/methodology/testing.md) →
+  [`reading_list`](docs/source/methodology/reading_list.md).
+- [`docs/source/MIGRATION.md`](docs/source/MIGRATION.md) — per-version
+  migration guides (v0.7 onward; v0.49 / v0.50 / v0.51 included as of
+  v0.51.0).
+- [`docs/source/roadmap.md`](docs/source/roadmap.md) — forward-looking
+  tracker; v1.0.0 path; consumer gap-doc cross-links.
 ## Extending eval-toolkit
 How to plug your own scorers / leakage checks / splitters / loaders /
 threshold selectors into the harness.
-- [`docs/extending.md`](docs/extending.md) — Protocol-by-Protocol
+- [`docs/source/extending.md`](docs/source/extending.md) — Protocol-by-Protocol
   guide, ~50-line full-harness recipe, project-layout pointer.
 ## Worked examples
-- [`docs/examples/prompt_injection_walkthrough.md`](docs/examples/prompt_injection_walkthrough.md)
-  — End-to-end prompt-injection eval on a synthetic OWASP LLM01:2025
-  fixture; cross-links to the
-  [showcase repo](https://github.com/brandon-behring/prompt_injection_classifier_showcase)
-  for the real Lakera PINT walkthrough.
-- [`docs/examples/pytorch_scorer_example.md`](docs/examples/pytorch_scorer_example.md)
-  — HuggingFace transformer + LoRA `Scorer` adapter (batched inference,
-  GPU/CPU placement, deterministic-mode setup).
-- [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md)
-  — Composing reference + custom `EvidenceGate`s into a `ClaimSpec` and
-  running `evaluate_claims()` for release-time go/no-go checks.
+- [`docs/source/examples/`](docs/source/examples/index.md) — Sphinx /
+  MyST-NB executable notebooks covering: the evaluation harness,
+  metrics + bootstrap, calibration, claims-and-gates, leakage
+  detection, cross-corpus contamination scanning, character-injection
+  adversarial sweeps, callable-embedder dedup, and the activation-delta
+  probe.
 ## Install
@@ -150,12 +147,12 @@ print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")
 from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
-ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
+ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
 print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
 # Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
 s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
-diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
+diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
 print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")
 ```
@@ -208,7 +205,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
 | `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
-| `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
+| `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/source/extending.md`](docs/source/extending.md) for writing custom gates and [`docs/source/examples/claims_and_gates.md`](docs/source/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |
 | `eval_toolkit.provenance` | File hashing, run-directory layout, figure metadata sidecar |

{eval_toolkit-0.49.0 → eval_toolkit-1.0.0}/STYLE.md RENAMED Viewed

@@ -76,7 +76,7 @@ them; deviations need justification in the PR description.
 | `n_jobs` | Parallelism (joblib + sklearn convention) |
 | `ax` | Matplotlib axis (matplotlib convention) |
 | `metric` | Callable `(y_true, y_score) -> float` |
-| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — target convention; adopted in v0.50.0 |
+| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — **canonical** convention (adopted v0.50.0). Accepts `int`, `np.random.Generator`, `BitGenerator`, `SeedSequence`, or `None`. |
 The v0.50.0 SPEC 7 adoption preserves two `seed: int` exceptions:
 `set_global_seeds(seed: int)` (global-state setter, not per-function

eval-toolkit 0.49.0__tar.gz → 1.0.0__tar.gz

eval-toolkit 0.49.0tar.gz → 1.0.0tar.gz