PyPI - eval-toolkit - Versions diffs - 0.48.0__tar.gz → 0.50.0__tar.gz - Mend

eval-toolkit 0.48.0tar.gz → 0.50.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (182) hide show

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,169 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.50.0] — 2026-05-23 — SPEC 7 `rng` parameter adoption
+The SPEC 7 follow-up to v0.49.0. The `_rng.py` scaffold shipped at
+v0.49.0 (SeedLike + RNGLike type aliases per
+[Scientific Python SPEC 7](https://scientific-python.org/specs/spec-0007/))
+is now wired into every Tier-1 public function that consumes a NumPy RNG.
+### BREAKING
+**22 Tier-1 function signatures**: `seed: int = X` / `random_state: int | None` → `rng: RNGLike | SeedLike | None = X`. Pre-v1.0 SemVer-minor BREAKING (v0.34.0 precedent). Defaults preserved (still deterministic-by-default).
+Affected functions:
+- `bootstrap.py` (7 public + 1 private): `bootstrap_ci`, `paired_bootstrap_diff`, `paired_bootstrap_ece_diff`, `paired_bootstrap_op_point_diff`, `paired_mde`, `block_bootstrap_on_folds`, `cross_validate_metric`, `_bootstrap_t_ci`.
+- `metrics.py:1063`: `expected_calibration_error_debiased`.
+- `thresholds.py`: `selected_operating_point` + `_bootstrap_threshold_metric_cis`.
+- `analysis.py`: `bootstrap_metric_from_predictions`, `paired_diff_from_prediction_refs`.
+- `harness.py` (6 sites): `evaluate`, `evaluate_scorer_on_slice`, `_bootstrap_auc_ci`, `_evaluate_scores`, `_compute_paired_diffs`, `_score_all_slices`.
+- `scorecards.py`: `scorecard`, `_evaluate_spec`.
+- `stacking.py`: `LogisticStacker.random_state` → `LogisticStacker.rng` class-field rename (sklearn pass-through derives int at the boundary).
+**Body refactors**:
+- 4 SeedSequence.spawn() sites converted from `np.random.SeedSequence(seed).spawn(n)` to `rng.bit_generator.seed_seq.spawn(n)` (Option A — preserves existing worker SeedSequence signatures).
+- 2 sklearn-bridge sites in `cross_validate_metric` derive int from rng before passing to `StratifiedKFold`/`KFold(random_state=...)` (defensive across sklearn versions <1.4).
+- `LogisticStacker.fit` derives sklearn int from `self.rng` at the boundary.
+**Config schema** (Tier-2 additive): `evaluate()` config dict key `"seed"` → `"rng"`. Generator-typed input serializes as `repr(rng)`; int/None serialize as-is (backward-compatible for prior int-seed usage).
+### Added
+- **Docstrings**: NumPy-style parameter doc for every renamed function now references `rng : RNGLike | SeedLike | None` with explicit link to SPEC 7.
+- **STYLE.md §3a** + **ADR 0004 D4**: `rng` row flipped from "target convention; adopted in v0.50.0" → "**canonical** convention (adopted v0.50.0)".
+### Changed
+- **Test sweep** (~230+ test sites): `seed=X` → `rng=X` in test kwarg calls, EXCEPT in test files that test legitimate `seed`-as-int contexts (`test_adversarial.py` for Python `random.Random`, `test_seeds.py` for `set_global_seeds`, `test_splits*.py` for Splitter dataclass fields, `test_text_dedup*.py` for MinHashLSHStrategy class field).
+- **CHANGELOG header**: this release.
+### Exceptions to SPEC 7 (KEPT `seed:` — documented in STYLE.md §3a + ADR 0004 D4)
+- `seeds.set_global_seeds(seed: int)` — global-state setter, not per-function RNG.
+- `adversarial.py` dataclass fields + functional wrappers — use Python stdlib `random.Random(seed)`, not NumPy.
+- `splits.py` Splitter dataclass class-fields (`HoldoutSplitter.seed`, `StratifiedKFoldSplitter.seed`, etc.) — configuration storage, not user-facing RNG parameter.
+- `loaders.py:903` YAML config schema key — declarative; renaming would break consumer YAMLs.
+### Migration
+- Consumer (`prompt-injection-detection-submission`) lockstep: bump dep pin `>=0.49.0` → `>=0.50.0`; rename `seed=` → `rng=` on eval-toolkit-bound call sites (estimated 5-8 sites).
+- Bit-for-bit reproducibility preserved when migrating `seed=42` → `rng=42` (int seed is SeedLike; `np.random.default_rng(42)` is the canonical normalization).
+### Notes
+- Ships in parallel with Round 8 audit STOP-GATE (Decision Y.2); R8 briefing at commit `6f6839a`, awaiting Codex+Gemini reports.
+- Memory pattern captured at v0.49.0: pre-flight grep MUST cover `README.md`, `.doctest-modules`, and any config files (per `feedback_sybil_runs_readme.md`). Applied to v0.50.0 pre-flight.
+## [0.49.0] — 2026-05-23 — Global naming-standards sweep + final cleanup before v1.0
+Final pre-v1.0 minor consolidating the naming-convention standardization
+that locks the v1.0 Tier-1 contract. Audit + industry-research pass
+(PEP 8, scikit-learn, NumPy, Google Python Style Guide, Scientific
+Python SPEC 7) found the repo already 95-99% consistent; this release
+closes the small remaining gaps + documents the conventions as
+[ADR 0004](docs/source/adr/0004-naming-conventions.md). The SPEC 7
+``rng`` parameter convention is documented here and adopted in v0.50.0.
+### BREAKING
+Five Tier-1 renames for naming consistency (pre-v1.0; SemVer-minor per
+the v0.34.0 BREAKING-minor precedent). Single-consumer lockstep bump in
+``prompt-injection-detection-submission``; no deprecation aliases.
+- **``build_manifest`` → ``make_manifest``** (manifest.py). Aligns
+  with ``make_minilm_embedder`` / ``make_palette`` / ``make_run_dir``
+  factory pattern. ``build_*`` was the only outlier.
+- **``CaseRandomization`` → ``CaseInjection``** (adversarial.py).
+  Aligns with ``*Injection`` / ``*Substitution`` adversarial suffix
+  convention.
+- **``TokenSplitting`` → ``TokenSplittingInjection``** (adversarial.py).
+  Same rationale.
+- **``UnicodeNormalization`` → ``UnicodeNormalizationInjection``**
+  (adversarial.py). Same rationale.
+- **``eval_toolkit._scorecard.py`` → ``eval_toolkit.scorecards.py``**
+  (private → public module promotion). The 4 top-level symbols
+  (``scorecard``, ``Scorecard``, ``MetricSpec``, ``MetricResult``)
+  remain top-level Tier-1; the new public submodule path
+  ``from eval_toolkit.scorecards import Scorecard`` is now stable.
+  ``_scorecard.py`` is gone — old import paths raise
+  ``ModuleNotFoundError``. Per the asymmetric-promotion principle in
+  [ADR 0001](docs/source/adr/0001-flat-module-layout.md): promote
+  collection-of-types modules, keep single-function modules underscore
+  (``_sweep.py`` stays private).
+### Added
+- **[ADR 0004](docs/source/adr/0004-naming-conventions.md)** — Naming
+  conventions decision record with industry citations. Covers module
+  naming (singular vs plural), class suffixes by domain, function
+  verb-prefix conventions, canonical parameter list, fitted-attribute
+  trailing underscore (sklearn convention), TypeVar leading underscore
+  (Google convention), and the SPEC 7 ``rng`` parameter convention
+  (adopted in v0.50.0).
+- **STYLE.md** extended with §3a-d (parameter naming, class suffixes
+  by domain, module naming, asymmetric promotion), §4a-b
+  (fitted-attribute trailing underscore + TypeVar), §12 (75-col
+  docstring prose rule), §14 (test naming convention).
+- **CONTRIBUTING.md** cross-link to ADR 0004 + STYLE.md.
+- **[docs/source/api/strict_tier2_protocols.md](docs/source/api/strict_tier2_protocols.md)** —
+  new docs page enumerating the 9 strict Tier-2 Protocols + 1 opt-in
+  per [ADR 0003 §1](docs/source/adr/0003-stability-contract-and-gate3-methodology.md),
+  with canonical top-level import paths. Resolves #69's discoverability
+  concern without breaking the lightweight design intent of
+  ``eval_toolkit.protocols`` (per ``protocols.py:1-5``).
+- **``src/eval_toolkit/_rng.py``** — private module with SPEC 7 type
+  aliases (``SeedLike``, ``RNGLike``). Not yet referenced; scaffold for
+  the v0.50.0 SPEC 7 adoption.
+- **[ADR 0001](docs/source/adr/0001-flat-module-layout.md)** amendment
+  — added the asymmetric-promotion sub-rule (collection-of-types MAY
+  promote, single-function SHOULD stay underscore).
+### Changed
+- **Duplicate-type consolidation** (single source of truth):
+  - ``Versioned`` Protocol — canonical at ``protocols.py:64``; the
+    duplicate at ``leakage.py:82`` removed. Removed
+    ``"Versioned"`` from ``leakage.__all__``; previously-unused
+    ``from eval_toolkit.leakage import Versioned`` now raises
+    ``ImportError``. Use ``from eval_toolkit.protocols import Versioned``
+    or top-level ``from eval_toolkit import Versioned``.
+  - ``MetricStatus`` ``Literal`` — canonical at ``artifacts.py:30``; the
+    duplicate at ``scorecards.py:78`` removed; ``scorecards`` now
+    imports from ``artifacts``.
+- **[validation] optional extra** reclassified from "active deprecation
+  with removal target v0.33.0" → "permanent no-op kept for backward
+  compatibility." Hard removal would break consumer pip pins of the
+  form ``eval-toolkit[validation]`` for zero functional benefit
+  (R3 in DEPRECATION.md).
+- **Sphinx cross-references** updated from
+  ``eval_toolkit.leakage.Versioned`` → ``eval_toolkit.protocols.Versioned``
+  in ``manifest.py`` docstrings.
+### Deferred to v0.50.0
+- **SPEC 7 ``rng`` parameter adoption** across ~30 NumPy-RNG functions.
+  Scope deferred from v0.49.0 after the planning audit revealed the
+  full blast radius (~30 signature sites + 247 test kwarg sites +
+  7 internal helpers + SeedSequence/Generator/sklearn-bridge
+  conversions). Splitting matches the "one cleanup per minor" pattern
+  per [feedback_staggered_breaking_releases]. ``_rng.py`` ships in
+  v0.49.0 as the scaffold; v0.50.0 wires it into every applicable
+  function.
+### Notes
+- Round 8 audit STOP-GATE per Decision Y.2 — briefing committed at
+  v0.48.0 (commit ``6f6839a``); v0.49.0 ships in parallel since the
+  audit-trail synthesis confirmed R8 audits the existing contract
+  (does not prescribe new changes). Any R8 finding folds into v0.49.1
+  hotfix if needed.
+- Issue #69 closed by the new strict-Tier-2-Protocols docs page; see
+  ``docs/source/api/strict_tier2_protocols.md`` and the close
+  rationale on the issue itself.
 ## [0.48.0] — 2026-05-22 — Polish + audit-driven tightening before v1.0 (Round 7 follow-on + cross-API consistency + doc-execution gates)
 Third + final BREAKING minor of the staggered v0.45 → v0.46 → v0.46.1 → v0.47

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.48.0
+Version: 0.50.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -233,12 +233,12 @@ print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")
 from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
-ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
+ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
 print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
 # Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
 s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
-diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
+diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
 print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")
 ```
@@ -261,13 +261,13 @@ print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")
 ```python
 import tempfile
 from pathlib import Path
-from eval_toolkit import build_manifest, write_manifest
+from eval_toolkit import make_manifest, write_manifest
 with tempfile.TemporaryDirectory() as run_dir:
     # data_files: {name: path} → eval_toolkit hashes the files for you;
     # versioned: any object with a `version` attribute (e.g. a scorer or
     # leakage check) is captured by name → version in the manifest.
-    manifest = build_manifest(
+    manifest = make_manifest(
         run_id="quickstart-demo",
         config={"threshold_criterion": "max_f1", "seed": 42},
         seeds={"global": 42, "bootstrap": 42},
@@ -290,7 +290,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.leakage` | `LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol |
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
-| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest` |
+| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
 | `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/README.md RENAMED Viewed

@@ -150,12 +150,12 @@ print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")
 from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
 from eval_toolkit.metrics import pr_auc
-ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
+ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
 print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")
 # Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
 s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
-diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
+diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
 print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")
 ```
@@ -178,13 +178,13 @@ print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")
 ```python
 import tempfile
 from pathlib import Path
-from eval_toolkit import build_manifest, write_manifest
+from eval_toolkit import make_manifest, write_manifest
 with tempfile.TemporaryDirectory() as run_dir:
     # data_files: {name: path} → eval_toolkit hashes the files for you;
     # versioned: any object with a `version` attribute (e.g. a scorer or
     # leakage check) is captured by name → version in the manifest.
-    manifest = build_manifest(
+    manifest = make_manifest(
         run_id="quickstart-demo",
         config={"threshold_criterion": "max_f1", "seed": 42},
         seeds={"global": 42, "bootstrap": 42},
@@ -207,7 +207,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.leakage` | `LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol |
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
-| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest` |
+| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
 | `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/STYLE.md RENAMED Viewed

@@ -36,6 +36,11 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 ## 3. Naming
+For the full decision record + industry-citations, see
+[ADR 0004 — Naming conventions](docs/source/adr/0004-naming-conventions.md).
+This section is the day-to-day quick reference; the ADR is the
+authoritative source.
 - Module names: `snake_case`, lowercase package (`eval_toolkit`).
 - Class names: `PascalCase`. Suffixes used in this repo:
   - `*Config` — frozen dataclass for settings
@@ -55,6 +60,68 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 - Mutation marking: not used. Mutating functions return `None` (Pythonic over
   Julia's `_inplace` suffix).
+### 3a. Parameter naming (canonical list, locked at v1.0)
+These names mean these things, everywhere. Future functions MUST use
+them; deviations need justification in the PR description.
+| Parameter | Meaning |
+|---|---|
+| `y_true` | Ground-truth labels (binary, shape `(n,)`) |
+| `y_score` | Continuous score / probability (shape `(n,)`) |
+| `y_pred` | Discrete prediction (threshold-dependent) |
+| `n_resamples` | Bootstrap iteration count |
+| `confidence` | Two-sided confidence level (0.95 default) |
+| `n_bins` | Binning count for calibration / ECE |
+| `n_jobs` | Parallelism (joblib + sklearn convention) |
+| `ax` | Matplotlib axis (matplotlib convention) |
+| `metric` | Callable `(y_true, y_score) -> float` |
+| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — **canonical** convention (adopted v0.50.0). Accepts `int`, `np.random.Generator`, `BitGenerator`, `SeedSequence`, or `None`. |
+The v0.50.0 SPEC 7 adoption preserves two `seed: int` exceptions:
+`set_global_seeds(seed: int)` (global-state setter, not per-function
+RNG; SPEC 7 doesn't apply) and adversarial dataclass fields (use Python
+`random.Random(seed)`; not NumPy-RNG, so SPEC 7's typing doesn't fit).
+### 3b. Class suffixes by domain
+Each suffix maps to a Protocol contract. Stay within the pattern:
+| Suffix | Domain | Protocol |
+|---|---|---|
+| `*Selector` | Threshold selection | `ThresholdSelector` |
+| `*Splitter` | Cross-validation splits | `Splitter` |
+| `*Check` | Leakage detection | `LeakageCheck` |
+| `*Loader` | Dataset loading | `DatasetLoader` |
+| `*Reader` | Prediction artifact reading | `PredictionReader` |
+| `*Variant` | Preprocessing variant | (functional API) |
+| `*Strategy` | Dedup similarity backend | `SimilarityStrategy` |
+| `*Injection` / `*Substitution` | Adversarial char-injection / -substitution | `TextTransform` |
+### 3c. Module naming (singular vs plural)
+- **Plural noun** for collection-of-types modules: `metrics`,
+  `loaders`, `protocols`, `losses`, `probes`, `splits`, `paths`,
+  `seeds`, `thresholds`, `artifacts`, `claims`, `embeddings`,
+  `scorecards`.
+- **Singular noun** for domain-concept modules: `harness`,
+  `bootstrap`, `manifest`, `calibration`, `leakage`, `analysis`,
+  `provenance`, `evidence`, `stacking`, `text_dedup`.
+- **Gerund** for process-domain modules: `preprocessing`.
+### 3d. Asymmetric module promotion (private → public)
+Collection-of-types private modules MAY be promoted to plural-public
+when they hold ≥2 user-relevant types. Single-function private
+modules SHOULD stay underscore. See
+[ADR 0001](docs/source/adr/0001-flat-module-layout.md) for the trigger
+analysis.
+Examples:
+- `_scorecard.py` (4 public exports) → `scorecards.py` at v0.49.0. ✓ promote.
+- `_sweep.py` (1 public function `sweep`) → stays `_sweep.py`. ✓ keep private.
 ## 4. Type hints
 - Every public function has fully typed parameters and return.
@@ -79,10 +146,13 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
     for 4 reference impls.
   - `SimilarityStrategy` (`text_dedup.py`) — pluggable similarity backend for
     `near_dedup` / `cross_dedup` / `NearDuplicateCheck` / `CrossSplitLeakageCheck`.
-  - `Versioned` (`leakage.py`) — opt-in single-attribute Protocol; any Tier-2
-    implementation may expose `version: str`. `RunManifest.versioned_objects`
-    auto-collects them. Mirrors the `lm-evaluation-harness` task `VERSION`
-    pattern. See `docs/methodology/versioning.md`.
+  - `Versioned` (`protocols.py`) — opt-in single-attribute Protocol; any
+    Tier-2 implementation may expose `version: str`.
+    `RunManifest.versioned_objects` auto-collects them. Mirrors the
+    `lm-evaluation-harness` task `VERSION` pattern. See
+    `docs/methodology/versioning.md`. (Single source of truth at
+    `protocols.py:64` since v0.49.0; the duplicate previously in
+    `leakage.py:82` was removed.)
 - All seams are `@runtime_checkable` so callers can `isinstance(obj, Protocol)`.
 - Reference impls are `@dataclass(frozen=True, slots=True)` with config in the
   constructor (`TargetRecallSelector(recall=0.90)`) and the Protocol method as
@@ -90,6 +160,25 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 - `NamedTuple` for stable public records that benefit from positional access;
   frozen dataclasses with `slots=True` otherwise.
+### 4a. Fitted-attribute trailing underscore (sklearn convention)
+Estimator-style classes (`fit`/`predict` pattern) that store
+**learned-from-data attributes** use trailing underscore per scikit-learn
+convention: `coef_`, `classes_`, `n_features_in_`, `feature_importances_`.
+These attributes MUST NOT be set in `__init__` — set them only in `fit()`.
+Frozen reference-impl dataclasses (`@dataclass(frozen=True, slots=True)`)
+are **exempt** — they hold config, not fitted state.
+Current canonical example: `stacking.LogisticStacker`.
+### 4b. TypeVar naming
+Internal (private) `TypeVar`s use a leading underscore per Google Python
+Style Guide §3.19.10: `_T = TypeVar("_T")`. Public, constrained `TypeVar`s
+without the underscore are allowed only when explicitly part of an
+exported generic API.
 ## 5. Dataclasses
 1. **`slots=True` always** on repo-owned dataclasses. Catches typos at
@@ -220,6 +309,10 @@ def fit_temperature(val_logits, val_labels, bounds=(0.05, 20.0)):
 - **References** cites arXiv IDs / DOIs / journal cites.
 - For modules where doctests would be contrived (`plotting`, `harness`,
   `provenance`), Examples are optional.
+- **Docstring prose wraps at 75 cols** (numpydoc convention) so that
+  `help()` is readable in a terminal. Doctest code blocks inside the
+  docstring follow the 100-col Black rule (code stays comfortable in an
+  editor even though prose around it is narrower).
 ## 13. Comments
@@ -228,6 +321,12 @@ restate what the code says.
 ## 14. Tests
+- **File naming**: `tests/test_<module>.py` mirrors
+  `src/eval_toolkit/<module>.py`. Auxiliary tests per module use
+  suffixes (`test_<module>_props.py`, `test_<module>_validation.py`,
+  `test_<module>_golden.py`).
+- **Function naming**: `test_<thing_under_test>_<scenario>`. No
+  class-based test grouping unless fixtures truly demand it (rare).
 - **Markers**: `unit`, `property`, `smoke`, `golden`.
 - **Sklearn-reference + analytical** as the unit-test oracle where available.
 - **Hypothesis** required for math/stat invariants. Strategies use

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/pyproject.toml RENAMED Viewed

@@ -74,15 +74,14 @@ probes = ["torch>=2.0", "transformers>=4.40"]
 # (granular extras — losses callers should not have to install the larger
 # transformers stack). Shares the torch version pin with [probes].
 losses = ["torch>=2.0"]
-# DEPRECATED (announced v0.30.1, removal v0.33.0).
+# NO-OP extra kept for backward compatibility (R3 at v0.49.0).
 #
-# Retained as a transitive no-op so `pip install eval-toolkit[validation]`
-# / dev still resolve cleanly. jsonschema moved to the base deps in
-# v0.16.0; this extra has been a no-op ever since. The 2-minor-version
-# window (v0.30.1 announce → v0.33.0 remove) matches the @deprecated
-# policy in docs/DEPRECATION.md. Extras can't trigger import-time
-# DeprecationWarnings, so the deprecation is documentation-only here +
-# in CHANGELOG ### Deprecated + docs/DEPRECATION.md.
+# jsonschema>=4.21 moved to base deps at v0.16.0; this extra has been a
+# no-op ever since. Originally announced as deprecated in v0.30.1 with
+# target removal at v0.33.0, but reclassified at v0.49.0 (R3 in
+# docs/DEPRECATION.md) as a permanent no-op — hard removal would break
+# consumer pip pins of the form `eval-toolkit[validation]` for zero
+# functional benefit. Retained indefinitely.
 validation = []
 # v0.31.0 docs site: Sphinx + pydata-sphinx-theme (replaces v0.28.0's
 # mkdocs-material). Migration drivers — pain points Q1 in the v0.31.0

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -38,15 +38,15 @@ _EXPORTS: dict[str, str] = {
     "ALL_TECHNIQUES": "eval_toolkit.adversarial",
     "BidiRTLInjection": "eval_toolkit.adversarial",
     "CORE_TECHNIQUES": "eval_toolkit.adversarial",
-    "CaseRandomization": "eval_toolkit.adversarial",
+    "CaseInjection": "eval_toolkit.adversarial",
     "DiacriticInjection": "eval_toolkit.adversarial",
     "HomoglyphSubstitution": "eval_toolkit.adversarial",
     "InvisibleCharsInjection": "eval_toolkit.adversarial",
     "PunctuationInjection": "eval_toolkit.adversarial",
     "SynonymSubstitution": "eval_toolkit.adversarial",
     "TagStrippingInjection": "eval_toolkit.adversarial",
-    "TokenSplitting": "eval_toolkit.adversarial",
-    "UnicodeNormalization": "eval_toolkit.adversarial",
+    "TokenSplittingInjection": "eval_toolkit.adversarial",
+    "UnicodeNormalizationInjection": "eval_toolkit.adversarial",
     "WhitespaceInjection": "eval_toolkit.adversarial",
     "ZeroWidthSpaceInjection": "eval_toolkit.adversarial",
     # CharacterInjectionStrategy + character_injection SimpleNamespace
@@ -202,7 +202,7 @@ _EXPORTS: dict[str, str] = {
     "MANIFEST_SCHEMA_VERSION": "eval_toolkit.manifest",
     "RunManifest": "eval_toolkit.manifest",
     "SourceRoleRecord": "eval_toolkit.manifest",
-    "build_manifest": "eval_toolkit.manifest",
+    "make_manifest": "eval_toolkit.manifest",
     "validate_source_roles": "eval_toolkit.manifest",
     "write_manifest": "eval_toolkit.manifest",
     # --- metrics ---
@@ -315,10 +315,10 @@ _EXPORTS: dict[str, str] = {
     "wilson_interval": "eval_toolkit.thresholds",
     "LogisticStacker": "eval_toolkit.stacking",
     "MetaLearner": "eval_toolkit.stacking",
-    "MetricResult": "eval_toolkit._scorecard",
-    "MetricSpec": "eval_toolkit._scorecard",
-    "Scorecard": "eval_toolkit._scorecard",
-    "scorecard": "eval_toolkit._scorecard",
+    "MetricResult": "eval_toolkit.scorecards",
+    "MetricSpec": "eval_toolkit.scorecards",
+    "Scorecard": "eval_toolkit.scorecards",
+    "scorecard": "eval_toolkit.scorecards",
     # --- sweep (top-level v0.47 unification — Decision K + Decision D) ---
     "sweep": "eval_toolkit._sweep",
 }

eval_toolkit-0.50.0/src/eval_toolkit/_rng.py ADDED Viewed

@@ -0,0 +1,65 @@
+"""Private RNG-parameter type aliases per Scientific-Python SPEC 7.
+This module centralizes the type aliases used to annotate user-facing RNG
+parameters across the toolkit. Per `SPEC 7 — Seeding PRNG
+<https://scientific-python.org/specs/spec-0007/>`_ (Endorsed) eval-toolkit
+exposes a single canonical parameter name ``rng`` typed as
+``RNGLike | SeedLike | None`` on every function that consumes a NumPy
+``Generator``. Bodies normalize via ``np.random.default_rng(rng)``.
+This module is private (underscore prefix) so the aliases stay an
+implementation detail — public symbols use them only in their annotations.
+If a Tier-2 consumer ever needs them exposed for their own callsite type
+annotations, promote them via ``eval_toolkit.protocols`` per the
+asymmetric-promotion principle in ADR 0001 + STYLE.md §3d.
+Exceptions to the SPEC 7 convention — documented in STYLE.md §3a:
+- ``seeds.set_global_seeds(seed: int)`` — global-state setter, not a
+  per-function RNG parameter; SPEC 7 is scoped to per-function RNG inputs.
+- ``adversarial.*Injection`` / ``*Substitution`` / ``CaseInjection``
+  dataclass fields — they use Python's stdlib ``random.Random(seed)``,
+  not NumPy. SPEC 7's typing (``RNGLike = np.random.Generator | ...``) is
+  strictly NumPy-scoped.
+"""
+from __future__ import annotations
+from collections.abc import Sequence
+from typing import cast
+import numpy as np
+type SeedLike = int | np.integer | Sequence[int] | np.random.SeedSequence
+"""Anything that can seed a NumPy bit generator.
+Per SPEC 7, ``np.random.default_rng`` accepts any of these as a seed
+without further conversion. ``Sequence[int]`` is the entropy-vector form
+used by ``np.random.SeedSequence``.
+"""
+type RNGLike = np.random.Generator | np.random.BitGenerator
+"""An already-instantiated NumPy bit generator or generator wrapper.
+``np.random.default_rng(rng)`` is the identity function on
+``Generator`` inputs and lifts ``BitGenerator`` inputs into a
+``Generator`` — both forms compose cleanly.
+"""
+def spawn_seed_sequences(rng: RNGLike | SeedLike | None, n: int) -> list[np.random.SeedSequence]:
+    """Spawn ``n`` independent SeedSequences from any SPEC 7 ``rng`` input.
+    Normalizes the input to a ``Generator``, then extracts the underlying
+    ``SeedSequence`` via the bit-generator and spawns ``n`` children.
+    The cast satisfies mypy strict: the ``seed_seq`` attribute on a
+    concrete BitGenerator is a ``SeedSequence`` instance, but the type
+    stub on ``BitGenerator.seed_seq`` returns the abstract
+    ``ISeedSequence`` interface (which lacks ``spawn``).
+    Used by the bootstrap parallel workers (which take spawned
+    ``SeedSequence`` objects to seed their internal ``default_rng()`` calls).
+    """
+    gen = np.random.default_rng(rng)
+    seed_seq = cast(np.random.SeedSequence, gen.bit_generator.seed_seq)
+    return seed_seq.spawn(n)

{eval_toolkit-0.48.0 → eval_toolkit-0.50.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.48.0"
+__version__ = "0.50.0"

eval-toolkit 0.48.0__tar.gz → 0.50.0__tar.gz

eval-toolkit 0.48.0tar.gz → 0.50.0tar.gz