PyPI - eval-toolkit - Versions diffs - 0.47.0__tar.gz → 0.49.0__tar.gz - Mend

eval-toolkit 0.47.0tar.gz → 0.49.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (183) hide show

{eval_toolkit-0.47.0 → eval_toolkit-0.49.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,203 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.49.0] — 2026-05-23 — Global naming-standards sweep + final cleanup before v1.0
+Final pre-v1.0 minor consolidating the naming-convention standardization
+that locks the v1.0 Tier-1 contract. Audit + industry-research pass
+(PEP 8, scikit-learn, NumPy, Google Python Style Guide, Scientific
+Python SPEC 7) found the repo already 95-99% consistent; this release
+closes the small remaining gaps + documents the conventions as
+[ADR 0004](docs/source/adr/0004-naming-conventions.md). The SPEC 7
+``rng`` parameter convention is documented here and adopted in v0.50.0.
+### BREAKING
+Five Tier-1 renames for naming consistency (pre-v1.0; SemVer-minor per
+the v0.34.0 BREAKING-minor precedent). Single-consumer lockstep bump in
+``prompt-injection-detection-submission``; no deprecation aliases.
+- **``build_manifest`` → ``make_manifest``** (manifest.py). Aligns
+  with ``make_minilm_embedder`` / ``make_palette`` / ``make_run_dir``
+  factory pattern. ``build_*`` was the only outlier.
+- **``CaseRandomization`` → ``CaseInjection``** (adversarial.py).
+  Aligns with ``*Injection`` / ``*Substitution`` adversarial suffix
+  convention.
+- **``TokenSplitting`` → ``TokenSplittingInjection``** (adversarial.py).
+  Same rationale.
+- **``UnicodeNormalization`` → ``UnicodeNormalizationInjection``**
+  (adversarial.py). Same rationale.
+- **``eval_toolkit._scorecard.py`` → ``eval_toolkit.scorecards.py``**
+  (private → public module promotion). The 4 top-level symbols
+  (``scorecard``, ``Scorecard``, ``MetricSpec``, ``MetricResult``)
+  remain top-level Tier-1; the new public submodule path
+  ``from eval_toolkit.scorecards import Scorecard`` is now stable.
+  ``_scorecard.py`` is gone — old import paths raise
+  ``ModuleNotFoundError``. Per the asymmetric-promotion principle in
+  [ADR 0001](docs/source/adr/0001-flat-module-layout.md): promote
+  collection-of-types modules, keep single-function modules underscore
+  (``_sweep.py`` stays private).
+### Added
+- **[ADR 0004](docs/source/adr/0004-naming-conventions.md)** — Naming
+  conventions decision record with industry citations. Covers module
+  naming (singular vs plural), class suffixes by domain, function
+  verb-prefix conventions, canonical parameter list, fitted-attribute
+  trailing underscore (sklearn convention), TypeVar leading underscore
+  (Google convention), and the SPEC 7 ``rng`` parameter convention
+  (adopted in v0.50.0).
+- **STYLE.md** extended with §3a-d (parameter naming, class suffixes
+  by domain, module naming, asymmetric promotion), §4a-b
+  (fitted-attribute trailing underscore + TypeVar), §12 (75-col
+  docstring prose rule), §14 (test naming convention).
+- **CONTRIBUTING.md** cross-link to ADR 0004 + STYLE.md.
+- **[docs/source/api/strict_tier2_protocols.md](docs/source/api/strict_tier2_protocols.md)** —
+  new docs page enumerating the 9 strict Tier-2 Protocols + 1 opt-in
+  per [ADR 0003 §1](docs/source/adr/0003-stability-contract-and-gate3-methodology.md),
+  with canonical top-level import paths. Resolves #69's discoverability
+  concern without breaking the lightweight design intent of
+  ``eval_toolkit.protocols`` (per ``protocols.py:1-5``).
+- **``src/eval_toolkit/_rng.py``** — private module with SPEC 7 type
+  aliases (``SeedLike``, ``RNGLike``). Not yet referenced; scaffold for
+  the v0.50.0 SPEC 7 adoption.
+- **[ADR 0001](docs/source/adr/0001-flat-module-layout.md)** amendment
+  — added the asymmetric-promotion sub-rule (collection-of-types MAY
+  promote, single-function SHOULD stay underscore).
+### Changed
+- **Duplicate-type consolidation** (single source of truth):
+  - ``Versioned`` Protocol — canonical at ``protocols.py:64``; the
+    duplicate at ``leakage.py:82`` removed. Removed
+    ``"Versioned"`` from ``leakage.__all__``; previously-unused
+    ``from eval_toolkit.leakage import Versioned`` now raises
+    ``ImportError``. Use ``from eval_toolkit.protocols import Versioned``
+    or top-level ``from eval_toolkit import Versioned``.
+  - ``MetricStatus`` ``Literal`` — canonical at ``artifacts.py:30``; the
+    duplicate at ``scorecards.py:78`` removed; ``scorecards`` now
+    imports from ``artifacts``.
+- **[validation] optional extra** reclassified from "active deprecation
+  with removal target v0.33.0" → "permanent no-op kept for backward
+  compatibility." Hard removal would break consumer pip pins of the
+  form ``eval-toolkit[validation]`` for zero functional benefit
+  (R3 in DEPRECATION.md).
+- **Sphinx cross-references** updated from
+  ``eval_toolkit.leakage.Versioned`` → ``eval_toolkit.protocols.Versioned``
+  in ``manifest.py`` docstrings.
+### Deferred to v0.50.0
+- **SPEC 7 ``rng`` parameter adoption** across ~30 NumPy-RNG functions.
+  Scope deferred from v0.49.0 after the planning audit revealed the
+  full blast radius (~30 signature sites + 247 test kwarg sites +
+  7 internal helpers + SeedSequence/Generator/sklearn-bridge
+  conversions). Splitting matches the "one cleanup per minor" pattern
+  per [feedback_staggered_breaking_releases]. ``_rng.py`` ships in
+  v0.49.0 as the scaffold; v0.50.0 wires it into every applicable
+  function.
+### Notes
+- Round 8 audit STOP-GATE per Decision Y.2 — briefing committed at
+  v0.48.0 (commit ``6f6839a``); v0.49.0 ships in parallel since the
+  audit-trail synthesis confirmed R8 audits the existing contract
+  (does not prescribe new changes). Any R8 finding folds into v0.49.1
+  hotfix if needed.
+- Issue #69 closed by the new strict-Tier-2-Protocols docs page; see
+  ``docs/source/api/strict_tier2_protocols.md`` and the close
+  rationale on the issue itself.
+## [0.48.0] — 2026-05-22 — Polish + audit-driven tightening before v1.0 (Round 7 follow-on + cross-API consistency + doc-execution gates)
+Third + final BREAKING minor of the staggered v0.45 → v0.46 → v0.46.1 → v0.47
+→ v0.48 → v1.0 release sequence (plan
+``~/.claude/plans/evaluate-all-the-work-twinkly-kite.md``, Step 4). Migration
+guide: ``docs/source/migration/v0.48.md``.
+Closes:
+- Round 7 audit STOP-GATE per Decision Y.2 (Codex R7-F1/F2/F3 + 6 Gemini
+  observations; see ``docs/source/audit_findings.md`` for the per-finding
+  ledger).
+- Audit-as-seed extensions surfaced during plan refinement: full
+  module-docstring sweep across ``src/eval_toolkit/``; expanded
+  ``.doctest-modules`` from 11 → 21 modules; comprehensive cross-API
+  shape-validation consistency sweep.
+- Round 5 §5E-prep packet-drift fixes (7 methodology documentation
+  corrections).
+After v0.48 observes ≥1 consumer cycle, the Round 8 audit STOP-GATE
+opens before ``v1.0.0`` tag.
+### BREAKING
+- **``BootstrapCI.to_dict()`` + ``PairedBootstrapCI.to_dict()`` schema
+  rewrite** (§5B). Pre-v0.48 hard-coded a ``"ci_95"`` key regardless of
+  the actual ``confidence`` field — the key contradicted the data.
+  v0.48 schema is self-describing:
+    Before: ``{"point_estimate": p, "ci_95": [l, h], "confidence": 0.95, ...}``
+    After:  ``{"point": p, "low": l, "high": h, "confidence": 0.95, ...}``
+  Migration: ``d["point_estimate"]`` → ``d["point"]``; ``d["ci_95"]``
+  → ``(d["low"], d["high"])``. Same rewrite for ``PairedBootstrapCI``.
+- **``sweep()`` schema grows by 1 column** (§5I, Decision R7-B option C).
+  New ``strategy_id`` column inserted between ``text_id`` and ``variant``
+  carries the canonical per-row identifier built from configured
+  kwargs. Callers indexing by column position must re-check offsets.
+- **``sweep()`` rejects duplicate ``strategy_id``** (§5I). Mirrors
+  R6-B's duplicate ``MetricSpec.name`` rejection in ``scorecard()``.
+- **``sweep()`` validates scorer output shape** (§5J, Decision R7-C).
+  Wrong-shape arrays from ``Scorer.predict_proba`` raise contextual
+  ``ValueError`` at the boundary. Pre-v0.48: silent truncation
+  (overlong), ``IndexError`` (short), or ``TypeError`` (matrix-shaped).
+- **``paired_bootstrap_op_point_diff()`` rejects ``val_y is test_y``**
+  (§5E-prep). The two-level bootstrap assumes disjoint val + test
+  partitions; passing the same array causes ~63.2% silent overlap.
+### Added
+- **``make pre-push``** Makefile target (§5L) running all 3 doc-
+  execution surfaces — Sybil-collected ``.md`` fences, MyST-NB example
+  notebooks, and in-source ``>>>`` docstring examples. Closes the
+  v0.47 Sub-PR 7 incident class.
+- **``nb_execution_raise_on_error = True``** in ``docs/source/conf.py``
+  (§5H, Decision R7-A). Docs CI now fails on notebook execution errors.
+- **``.doctest-modules`` expanded** from 11 → 21 modules (§5M).
+### Changed
+- **Cross-API shape-validation consistency** (§5N). Every public-API
+  surface with array inputs now validates shape + raises ``ValueError``
+  with context (rather than leaking low-level numpy/sklearn errors).
+- **Standardized ``ImportError`` messages** across lazy-extras (§5C).
+  Canonical template: ``"<feature> requires <pkg>. Install with: pip
+  install eval-toolkit[<extra>]"``.
+- **Pin-exact-key-set regression-guards** (§5A) for every dict-returning
+  metrics function. Audit revealed no drift; the tests pin existing
+  key sets so future drift fails CI loud.
+- **Docs polish** (§5K + §5E-prep): ``SynonymSubstitution`` whitelist
+  ``Notes``; ``Scorecard.to_pandas()`` dtype coercion ``Notes``;
+  ``CostSensitiveSelector`` calibrated-prior ``Warning``; ``cv_clt_ci``
+  docstring per Bayle et al. (2020) Theorem 3.1; ``methodology/parallelism.md``
+  post-v0.36 state; ``methodology/testing.md`` reference-equivalence-gap
+  framing; ``methodology/calibration.md`` 4-binary-adapter family;
+  ``methodology/bootstrap.md`` disjoint-split example; DeLong docs
+  aligned to shipped state (Decision U).
+### Fixed
+- **R7-F1**: 6 MyST-NB example notebooks (``docs/source/examples/*.md``)
+  migrated to v0.47 API; 4 module-level docstrings rewritten; 5
+  drifted ``docs/source/api/*.md`` autosummary lists corrected;
+  8 missing ``api/*.md`` pages created; roadmap "Sybil-validated
+  examples" wording corrected (§5G).
+- **ADR 0001** (flat-module layout) + **ADR 0003** (stability contract
+  + Gate 3 methodology) finalized for v1.0 (§5E + §5F).
+- **schemas.md** + **methodology/claims.md** + **getting-started.md**:
+  ``BootstrapCI`` schema references updated for the §5B rewrite.
 ## [0.47.0] — 2026-05-21 — Sweep unification + TextTransform + advanced-6 + cleanup + Round 6 follow-on
 Second BREAKING minor of the staggered v0.45 → v0.46 → v0.46.1 → v0.47 →

{eval_toolkit-0.47.0 → eval_toolkit-0.49.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.47.0
+Version: 0.49.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -261,13 +261,13 @@ print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")
 ```python
 import tempfile
 from pathlib import Path
-from eval_toolkit import build_manifest, write_manifest
+from eval_toolkit import make_manifest, write_manifest
 with tempfile.TemporaryDirectory() as run_dir:
     # data_files: {name: path} → eval_toolkit hashes the files for you;
     # versioned: any object with a `version` attribute (e.g. a scorer or
     # leakage check) is captured by name → version in the manifest.
-    manifest = build_manifest(
+    manifest = make_manifest(
         run_id="quickstart-demo",
         config={"threshold_criterion": "max_f1", "seed": 42},
         seeds={"global": 42, "bootstrap": 42},
@@ -290,7 +290,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.leakage` | `LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol |
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
-| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest` |
+| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
 | `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |

{eval_toolkit-0.47.0 → eval_toolkit-0.49.0}/README.md RENAMED Viewed

@@ -178,13 +178,13 @@ print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")
 ```python
 import tempfile
 from pathlib import Path
-from eval_toolkit import build_manifest, write_manifest
+from eval_toolkit import make_manifest, write_manifest
 with tempfile.TemporaryDirectory() as run_dir:
     # data_files: {name: path} → eval_toolkit hashes the files for you;
     # versioned: any object with a `version` attribute (e.g. a scorer or
     # leakage check) is captured by name → version in the manifest.
-    manifest = build_manifest(
+    manifest = make_manifest(
         run_id="quickstart-demo",
         config={"threshold_criterion": "max_f1", "seed": 42},
         seeds={"global": 42, "bootstrap": 42},
@@ -207,7 +207,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.leakage` | `LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol |
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
-| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest` |
+| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
 | `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |

{eval_toolkit-0.47.0 → eval_toolkit-0.49.0}/STYLE.md RENAMED Viewed

@@ -36,6 +36,11 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 ## 3. Naming
+For the full decision record + industry-citations, see
+[ADR 0004 — Naming conventions](docs/source/adr/0004-naming-conventions.md).
+This section is the day-to-day quick reference; the ADR is the
+authoritative source.
 - Module names: `snake_case`, lowercase package (`eval_toolkit`).
 - Class names: `PascalCase`. Suffixes used in this repo:
   - `*Config` — frozen dataclass for settings
@@ -55,6 +60,68 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 - Mutation marking: not used. Mutating functions return `None` (Pythonic over
   Julia's `_inplace` suffix).
+### 3a. Parameter naming (canonical list, locked at v1.0)
+These names mean these things, everywhere. Future functions MUST use
+them; deviations need justification in the PR description.
+| Parameter | Meaning |
+|---|---|
+| `y_true` | Ground-truth labels (binary, shape `(n,)`) |
+| `y_score` | Continuous score / probability (shape `(n,)`) |
+| `y_pred` | Discrete prediction (threshold-dependent) |
+| `n_resamples` | Bootstrap iteration count |
+| `confidence` | Two-sided confidence level (0.95 default) |
+| `n_bins` | Binning count for calibration / ECE |
+| `n_jobs` | Parallelism (joblib + sklearn convention) |
+| `ax` | Matplotlib axis (matplotlib convention) |
+| `metric` | Callable `(y_true, y_score) -> float` |
+| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — target convention; adopted in v0.50.0 |
+The v0.50.0 SPEC 7 adoption preserves two `seed: int` exceptions:
+`set_global_seeds(seed: int)` (global-state setter, not per-function
+RNG; SPEC 7 doesn't apply) and adversarial dataclass fields (use Python
+`random.Random(seed)`; not NumPy-RNG, so SPEC 7's typing doesn't fit).
+### 3b. Class suffixes by domain
+Each suffix maps to a Protocol contract. Stay within the pattern:
+| Suffix | Domain | Protocol |
+|---|---|---|
+| `*Selector` | Threshold selection | `ThresholdSelector` |
+| `*Splitter` | Cross-validation splits | `Splitter` |
+| `*Check` | Leakage detection | `LeakageCheck` |
+| `*Loader` | Dataset loading | `DatasetLoader` |
+| `*Reader` | Prediction artifact reading | `PredictionReader` |
+| `*Variant` | Preprocessing variant | (functional API) |
+| `*Strategy` | Dedup similarity backend | `SimilarityStrategy` |
+| `*Injection` / `*Substitution` | Adversarial char-injection / -substitution | `TextTransform` |
+### 3c. Module naming (singular vs plural)
+- **Plural noun** for collection-of-types modules: `metrics`,
+  `loaders`, `protocols`, `losses`, `probes`, `splits`, `paths`,
+  `seeds`, `thresholds`, `artifacts`, `claims`, `embeddings`,
+  `scorecards`.
+- **Singular noun** for domain-concept modules: `harness`,
+  `bootstrap`, `manifest`, `calibration`, `leakage`, `analysis`,
+  `provenance`, `evidence`, `stacking`, `text_dedup`.
+- **Gerund** for process-domain modules: `preprocessing`.
+### 3d. Asymmetric module promotion (private → public)
+Collection-of-types private modules MAY be promoted to plural-public
+when they hold ≥2 user-relevant types. Single-function private
+modules SHOULD stay underscore. See
+[ADR 0001](docs/source/adr/0001-flat-module-layout.md) for the trigger
+analysis.
+Examples:
+- `_scorecard.py` (4 public exports) → `scorecards.py` at v0.49.0. ✓ promote.
+- `_sweep.py` (1 public function `sweep`) → stays `_sweep.py`. ✓ keep private.
 ## 4. Type hints
 - Every public function has fully typed parameters and return.
@@ -79,10 +146,13 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
     for 4 reference impls.
   - `SimilarityStrategy` (`text_dedup.py`) — pluggable similarity backend for
     `near_dedup` / `cross_dedup` / `NearDuplicateCheck` / `CrossSplitLeakageCheck`.
-  - `Versioned` (`leakage.py`) — opt-in single-attribute Protocol; any Tier-2
-    implementation may expose `version: str`. `RunManifest.versioned_objects`
-    auto-collects them. Mirrors the `lm-evaluation-harness` task `VERSION`
-    pattern. See `docs/methodology/versioning.md`.
+  - `Versioned` (`protocols.py`) — opt-in single-attribute Protocol; any
+    Tier-2 implementation may expose `version: str`.
+    `RunManifest.versioned_objects` auto-collects them. Mirrors the
+    `lm-evaluation-harness` task `VERSION` pattern. See
+    `docs/methodology/versioning.md`. (Single source of truth at
+    `protocols.py:64` since v0.49.0; the duplicate previously in
+    `leakage.py:82` was removed.)
 - All seams are `@runtime_checkable` so callers can `isinstance(obj, Protocol)`.
 - Reference impls are `@dataclass(frozen=True, slots=True)` with config in the
   constructor (`TargetRecallSelector(recall=0.90)`) and the Protocol method as
@@ -90,6 +160,25 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 - `NamedTuple` for stable public records that benefit from positional access;
   frozen dataclasses with `slots=True` otherwise.
+### 4a. Fitted-attribute trailing underscore (sklearn convention)
+Estimator-style classes (`fit`/`predict` pattern) that store
+**learned-from-data attributes** use trailing underscore per scikit-learn
+convention: `coef_`, `classes_`, `n_features_in_`, `feature_importances_`.
+These attributes MUST NOT be set in `__init__` — set them only in `fit()`.
+Frozen reference-impl dataclasses (`@dataclass(frozen=True, slots=True)`)
+are **exempt** — they hold config, not fitted state.
+Current canonical example: `stacking.LogisticStacker`.
+### 4b. TypeVar naming
+Internal (private) `TypeVar`s use a leading underscore per Google Python
+Style Guide §3.19.10: `_T = TypeVar("_T")`. Public, constrained `TypeVar`s
+without the underscore are allowed only when explicitly part of an
+exported generic API.
 ## 5. Dataclasses
 1. **`slots=True` always** on repo-owned dataclasses. Catches typos at
@@ -220,6 +309,10 @@ def fit_temperature(val_logits, val_labels, bounds=(0.05, 20.0)):
 - **References** cites arXiv IDs / DOIs / journal cites.
 - For modules where doctests would be contrived (`plotting`, `harness`,
   `provenance`), Examples are optional.
+- **Docstring prose wraps at 75 cols** (numpydoc convention) so that
+  `help()` is readable in a terminal. Doctest code blocks inside the
+  docstring follow the 100-col Black rule (code stays comfortable in an
+  editor even though prose around it is narrower).
 ## 13. Comments
@@ -228,6 +321,12 @@ restate what the code says.
 ## 14. Tests
+- **File naming**: `tests/test_<module>.py` mirrors
+  `src/eval_toolkit/<module>.py`. Auxiliary tests per module use
+  suffixes (`test_<module>_props.py`, `test_<module>_validation.py`,
+  `test_<module>_golden.py`).
+- **Function naming**: `test_<thing_under_test>_<scenario>`. No
+  class-based test grouping unless fixtures truly demand it (rare).
 - **Markers**: `unit`, `property`, `smoke`, `golden`.
 - **Sklearn-reference + analytical** as the unit-test oracle where available.
 - **Hypothesis** required for math/stat invariants. Strategies use

{eval_toolkit-0.47.0 → eval_toolkit-0.49.0}/pyproject.toml RENAMED Viewed

@@ -74,15 +74,14 @@ probes = ["torch>=2.0", "transformers>=4.40"]
 # (granular extras — losses callers should not have to install the larger
 # transformers stack). Shares the torch version pin with [probes].
 losses = ["torch>=2.0"]
-# DEPRECATED (announced v0.30.1, removal v0.33.0).
+# NO-OP extra kept for backward compatibility (R3 at v0.49.0).
 #
-# Retained as a transitive no-op so `pip install eval-toolkit[validation]`
-# / dev still resolve cleanly. jsonschema moved to the base deps in
-# v0.16.0; this extra has been a no-op ever since. The 2-minor-version
-# window (v0.30.1 announce → v0.33.0 remove) matches the @deprecated
-# policy in docs/DEPRECATION.md. Extras can't trigger import-time
-# DeprecationWarnings, so the deprecation is documentation-only here +
-# in CHANGELOG ### Deprecated + docs/DEPRECATION.md.
+# jsonschema>=4.21 moved to base deps at v0.16.0; this extra has been a
+# no-op ever since. Originally announced as deprecated in v0.30.1 with
+# target removal at v0.33.0, but reclassified at v0.49.0 (R3 in
+# docs/DEPRECATION.md) as a permanent no-op — hard removal would break
+# consumer pip pins of the form `eval-toolkit[validation]` for zero
+# functional benefit. Retained indefinitely.
 validation = []
 # v0.31.0 docs site: Sphinx + pydata-sphinx-theme (replaces v0.28.0's
 # mkdocs-material). Migration drivers — pain points Q1 in the v0.31.0

{eval_toolkit-0.47.0 → eval_toolkit-0.49.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -1,9 +1,12 @@
 """eval-toolkit — reusable evaluation contracts for binary classification.
-Public API remains available from ``eval_toolkit`` and from submodules:
+The v1.0 primary metric surface is :func:`~eval_toolkit.scorecard` plus the
+:mod:`~eval_toolkit.metric_specs` namespace (ADR 0002). Submodule paths
+remain available for scalar primitives and adapter authors:
-    from eval_toolkit import pr_auc, bootstrap_ci, BootstrapCI
-    from eval_toolkit.metrics import pr_auc
+    from eval_toolkit import scorecard, metric_specs as ms
+    from eval_toolkit import bootstrap_ci, BootstrapCI
+    from eval_toolkit.metrics import pr_auc  # internal API, ADR 0002
 The package root uses lazy exports so importing ``eval_toolkit`` does not
 eagerly import optional-heavy modules such as plotting, loaders, or harnesses.
@@ -35,15 +38,15 @@ _EXPORTS: dict[str, str] = {
     "ALL_TECHNIQUES": "eval_toolkit.adversarial",
     "BidiRTLInjection": "eval_toolkit.adversarial",
     "CORE_TECHNIQUES": "eval_toolkit.adversarial",
-    "CaseRandomization": "eval_toolkit.adversarial",
+    "CaseInjection": "eval_toolkit.adversarial",
     "DiacriticInjection": "eval_toolkit.adversarial",
     "HomoglyphSubstitution": "eval_toolkit.adversarial",
     "InvisibleCharsInjection": "eval_toolkit.adversarial",
     "PunctuationInjection": "eval_toolkit.adversarial",
     "SynonymSubstitution": "eval_toolkit.adversarial",
     "TagStrippingInjection": "eval_toolkit.adversarial",
-    "TokenSplitting": "eval_toolkit.adversarial",
-    "UnicodeNormalization": "eval_toolkit.adversarial",
+    "TokenSplittingInjection": "eval_toolkit.adversarial",
+    "UnicodeNormalizationInjection": "eval_toolkit.adversarial",
     "WhitespaceInjection": "eval_toolkit.adversarial",
     "ZeroWidthSpaceInjection": "eval_toolkit.adversarial",
     # CharacterInjectionStrategy + character_injection SimpleNamespace
@@ -199,7 +202,7 @@ _EXPORTS: dict[str, str] = {
     "MANIFEST_SCHEMA_VERSION": "eval_toolkit.manifest",
     "RunManifest": "eval_toolkit.manifest",
     "SourceRoleRecord": "eval_toolkit.manifest",
-    "build_manifest": "eval_toolkit.manifest",
+    "make_manifest": "eval_toolkit.manifest",
     "validate_source_roles": "eval_toolkit.manifest",
     "write_manifest": "eval_toolkit.manifest",
     # --- metrics ---
@@ -207,12 +210,15 @@ _EXPORTS: dict[str, str] = {
     "SINGLE_CLASS_INCOMPATIBLE_METRICS": "eval_toolkit.metrics",
     "ThresholdResult": "eval_toolkit.metrics",
     "brier_decomposition": "eval_toolkit.metrics",
-    # `brier_score`, `pr_auc`, `roc_auc`, and the 5 ECE variants removed from
-    # `_EXPORTS` at v0.46 (Decision L). They remain reachable at the top
-    # level via the `__getattr__` deprecation branch (emits
-    # `DeprecationWarning`; branch removed at v0.47) and via the metrics
-    # submodule (`from eval_toolkit.metrics import pr_auc` — internal API
-    # per ADR 0002, not part of the v1.0 stability contract).
+    # `brier_score`, `pr_auc`, `roc_auc`, and the 5 ECE variants were removed
+    # from `_EXPORTS` at v0.46 (Decision L); the v0.46 `__getattr__`
+    # deprecation branch that kept them reachable with `DeprecationWarning`
+    # was removed at v0.47. They now raise `AttributeError` at the top level.
+    # The metrics submodule (`from eval_toolkit.metrics import pr_auc`)
+    # remains the only stable import path for scalar primitives — internal
+    # API per ADR 0002, not part of the v1.0 stability contract. The
+    # `scorecard()` + `metric_specs` surface is the primary path going
+    # forward (`metric_specs.pr_auc`, `metric_specs.roc_auc`, etc.).
     "headline_metrics": "eval_toolkit.metrics",
     "is_metric_defined_for_slice": "eval_toolkit.metrics",
     "metrics_at_threshold": "eval_toolkit.metrics",
@@ -309,10 +315,10 @@ _EXPORTS: dict[str, str] = {
     "wilson_interval": "eval_toolkit.thresholds",
     "LogisticStacker": "eval_toolkit.stacking",
     "MetaLearner": "eval_toolkit.stacking",
-    "MetricResult": "eval_toolkit._scorecard",
-    "MetricSpec": "eval_toolkit._scorecard",
-    "Scorecard": "eval_toolkit._scorecard",
-    "scorecard": "eval_toolkit._scorecard",
+    "MetricResult": "eval_toolkit.scorecards",
+    "MetricSpec": "eval_toolkit.scorecards",
+    "Scorecard": "eval_toolkit.scorecards",
+    "scorecard": "eval_toolkit.scorecards",
     # --- sweep (top-level v0.47 unification — Decision K + Decision D) ---
     "sweep": "eval_toolkit._sweep",
 }

eval_toolkit-0.49.0/src/eval_toolkit/_rng.py ADDED Viewed

@@ -0,0 +1,46 @@
+"""Private RNG-parameter type aliases per Scientific-Python SPEC 7.
+This module centralizes the type aliases used to annotate user-facing RNG
+parameters across the toolkit. Per `SPEC 7 — Seeding PRNG
+<https://scientific-python.org/specs/spec-0007/>`_ (Endorsed) eval-toolkit
+exposes a single canonical parameter name ``rng`` typed as
+``RNGLike | SeedLike | None`` on every function that consumes a NumPy
+``Generator``. Bodies normalize via ``np.random.default_rng(rng)``.
+This module is private (underscore prefix) so the aliases stay an
+implementation detail — public symbols use them only in their annotations.
+If a Tier-2 consumer ever needs them exposed for their own callsite type
+annotations, promote them via ``eval_toolkit.protocols`` per the
+asymmetric-promotion principle in ADR 0001 + STYLE.md §3d.
+Exceptions to the SPEC 7 convention — documented in STYLE.md §3a:
+- ``seeds.set_global_seeds(seed: int)`` — global-state setter, not a
+  per-function RNG parameter; SPEC 7 is scoped to per-function RNG inputs.
+- ``adversarial.*Injection`` / ``*Substitution`` / ``CaseInjection``
+  dataclass fields — they use Python's stdlib ``random.Random(seed)``,
+  not NumPy. SPEC 7's typing (``RNGLike = np.random.Generator | ...``) is
+  strictly NumPy-scoped.
+"""
+from __future__ import annotations
+from collections.abc import Sequence
+import numpy as np
+type SeedLike = int | np.integer | Sequence[int] | np.random.SeedSequence
+"""Anything that can seed a NumPy bit generator.
+Per SPEC 7, ``np.random.default_rng`` accepts any of these as a seed
+without further conversion. ``Sequence[int]`` is the entropy-vector form
+used by ``np.random.SeedSequence``.
+"""
+type RNGLike = np.random.Generator | np.random.BitGenerator
+"""An already-instantiated NumPy bit generator or generator wrapper.
+``np.random.default_rng(rng)`` is the identity function on
+``Generator`` inputs and lifts ``BitGenerator`` inputs into a
+``Generator`` — both forms compose cleanly.
+"""

eval-toolkit 0.47.0__tar.gz → 0.49.0__tar.gz

eval-toolkit 0.47.0tar.gz → 0.49.0tar.gz