PyPI - eval-toolkit - Versions diffs - 0.48.0__tar.gz → 0.49.0__tar.gz - Mend

eval-toolkit 0.48.0tar.gz → 0.49.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (182) hide show

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,113 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.49.0] — 2026-05-23 — Global naming-standards sweep + final cleanup before v1.0
+Final pre-v1.0 minor consolidating the naming-convention standardization
+that locks the v1.0 Tier-1 contract. Audit + industry-research pass
+(PEP 8, scikit-learn, NumPy, Google Python Style Guide, Scientific
+Python SPEC 7) found the repo already 95-99% consistent; this release
+closes the small remaining gaps + documents the conventions as
+[ADR 0004](docs/source/adr/0004-naming-conventions.md). The SPEC 7
+``rng`` parameter convention is documented here and adopted in v0.50.0.
+### BREAKING
+Five Tier-1 renames for naming consistency (pre-v1.0; SemVer-minor per
+the v0.34.0 BREAKING-minor precedent). Single-consumer lockstep bump in
+``prompt-injection-detection-submission``; no deprecation aliases.
+- **``build_manifest`` → ``make_manifest``** (manifest.py). Aligns
+  with ``make_minilm_embedder`` / ``make_palette`` / ``make_run_dir``
+  factory pattern. ``build_*`` was the only outlier.
+- **``CaseRandomization`` → ``CaseInjection``** (adversarial.py).
+  Aligns with ``*Injection`` / ``*Substitution`` adversarial suffix
+  convention.
+- **``TokenSplitting`` → ``TokenSplittingInjection``** (adversarial.py).
+  Same rationale.
+- **``UnicodeNormalization`` → ``UnicodeNormalizationInjection``**
+  (adversarial.py). Same rationale.
+- **``eval_toolkit._scorecard.py`` → ``eval_toolkit.scorecards.py``**
+  (private → public module promotion). The 4 top-level symbols
+  (``scorecard``, ``Scorecard``, ``MetricSpec``, ``MetricResult``)
+  remain top-level Tier-1; the new public submodule path
+  ``from eval_toolkit.scorecards import Scorecard`` is now stable.
+  ``_scorecard.py`` is gone — old import paths raise
+  ``ModuleNotFoundError``. Per the asymmetric-promotion principle in
+  [ADR 0001](docs/source/adr/0001-flat-module-layout.md): promote
+  collection-of-types modules, keep single-function modules underscore
+  (``_sweep.py`` stays private).
+### Added
+- **[ADR 0004](docs/source/adr/0004-naming-conventions.md)** — Naming
+  conventions decision record with industry citations. Covers module
+  naming (singular vs plural), class suffixes by domain, function
+  verb-prefix conventions, canonical parameter list, fitted-attribute
+  trailing underscore (sklearn convention), TypeVar leading underscore
+  (Google convention), and the SPEC 7 ``rng`` parameter convention
+  (adopted in v0.50.0).
+- **STYLE.md** extended with §3a-d (parameter naming, class suffixes
+  by domain, module naming, asymmetric promotion), §4a-b
+  (fitted-attribute trailing underscore + TypeVar), §12 (75-col
+  docstring prose rule), §14 (test naming convention).
+- **CONTRIBUTING.md** cross-link to ADR 0004 + STYLE.md.
+- **[docs/source/api/strict_tier2_protocols.md](docs/source/api/strict_tier2_protocols.md)** —
+  new docs page enumerating the 9 strict Tier-2 Protocols + 1 opt-in
+  per [ADR 0003 §1](docs/source/adr/0003-stability-contract-and-gate3-methodology.md),
+  with canonical top-level import paths. Resolves #69's discoverability
+  concern without breaking the lightweight design intent of
+  ``eval_toolkit.protocols`` (per ``protocols.py:1-5``).
+- **``src/eval_toolkit/_rng.py``** — private module with SPEC 7 type
+  aliases (``SeedLike``, ``RNGLike``). Not yet referenced; scaffold for
+  the v0.50.0 SPEC 7 adoption.
+- **[ADR 0001](docs/source/adr/0001-flat-module-layout.md)** amendment
+  — added the asymmetric-promotion sub-rule (collection-of-types MAY
+  promote, single-function SHOULD stay underscore).
+### Changed
+- **Duplicate-type consolidation** (single source of truth):
+  - ``Versioned`` Protocol — canonical at ``protocols.py:64``; the
+    duplicate at ``leakage.py:82`` removed. Removed
+    ``"Versioned"`` from ``leakage.__all__``; previously-unused
+    ``from eval_toolkit.leakage import Versioned`` now raises
+    ``ImportError``. Use ``from eval_toolkit.protocols import Versioned``
+    or top-level ``from eval_toolkit import Versioned``.
+  - ``MetricStatus`` ``Literal`` — canonical at ``artifacts.py:30``; the
+    duplicate at ``scorecards.py:78`` removed; ``scorecards`` now
+    imports from ``artifacts``.
+- **[validation] optional extra** reclassified from "active deprecation
+  with removal target v0.33.0" → "permanent no-op kept for backward
+  compatibility." Hard removal would break consumer pip pins of the
+  form ``eval-toolkit[validation]`` for zero functional benefit
+  (R3 in DEPRECATION.md).
+- **Sphinx cross-references** updated from
+  ``eval_toolkit.leakage.Versioned`` → ``eval_toolkit.protocols.Versioned``
+  in ``manifest.py`` docstrings.
+### Deferred to v0.50.0
+- **SPEC 7 ``rng`` parameter adoption** across ~30 NumPy-RNG functions.
+  Scope deferred from v0.49.0 after the planning audit revealed the
+  full blast radius (~30 signature sites + 247 test kwarg sites +
+  7 internal helpers + SeedSequence/Generator/sklearn-bridge
+  conversions). Splitting matches the "one cleanup per minor" pattern
+  per [feedback_staggered_breaking_releases]. ``_rng.py`` ships in
+  v0.49.0 as the scaffold; v0.50.0 wires it into every applicable
+  function.
+### Notes
+- Round 8 audit STOP-GATE per Decision Y.2 — briefing committed at
+  v0.48.0 (commit ``6f6839a``); v0.49.0 ships in parallel since the
+  audit-trail synthesis confirmed R8 audits the existing contract
+  (does not prescribe new changes). Any R8 finding folds into v0.49.1
+  hotfix if needed.
+- Issue #69 closed by the new strict-Tier-2-Protocols docs page; see
+  ``docs/source/api/strict_tier2_protocols.md`` and the close
+  rationale on the issue itself.
 ## [0.48.0] — 2026-05-22 — Polish + audit-driven tightening before v1.0 (Round 7 follow-on + cross-API consistency + doc-execution gates)
 Third + final BREAKING minor of the staggered v0.45 → v0.46 → v0.46.1 → v0.47

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.48.0
+Version: 0.49.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -261,13 +261,13 @@ print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")
 ```python
 import tempfile
 from pathlib import Path
-from eval_toolkit import build_manifest, write_manifest
+from eval_toolkit import make_manifest, write_manifest
 with tempfile.TemporaryDirectory() as run_dir:
     # data_files: {name: path} → eval_toolkit hashes the files for you;
     # versioned: any object with a `version` attribute (e.g. a scorer or
     # leakage check) is captured by name → version in the manifest.
-    manifest = build_manifest(
+    manifest = make_manifest(
         run_id="quickstart-demo",
         config={"threshold_criterion": "max_f1", "seed": 42},
         seeds={"global": 42, "bootstrap": 42},
@@ -290,7 +290,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.leakage` | `LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol |
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
-| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest` |
+| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
 | `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/README.md RENAMED Viewed

@@ -178,13 +178,13 @@ print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")
 ```python
 import tempfile
 from pathlib import Path
-from eval_toolkit import build_manifest, write_manifest
+from eval_toolkit import make_manifest, write_manifest
 with tempfile.TemporaryDirectory() as run_dir:
     # data_files: {name: path} → eval_toolkit hashes the files for you;
     # versioned: any object with a `version` attribute (e.g. a scorer or
     # leakage check) is captured by name → version in the manifest.
-    manifest = build_manifest(
+    manifest = make_manifest(
         run_id="quickstart-demo",
         config={"threshold_criterion": "max_f1", "seed": 42},
         seeds={"global": 42, "bootstrap": 42},
@@ -207,7 +207,7 @@ with tempfile.TemporaryDirectory() as run_dir:
 | `eval_toolkit.leakage` | `LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol |
 | `eval_toolkit.splits` | `Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series) |
 | `eval_toolkit.loaders` | `DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()` |
-| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest` |
+| `eval_toolkit.manifest` | `RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `make_manifest` / `write_manifest` |
 | `eval_toolkit.claims` | `EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See [`docs/extending.md`](docs/extending.md) for writing custom gates and [`docs/examples/claims_and_gates.md`](docs/examples/claims_and_gates.md) for a worked end-to-end example. |
 | `eval_toolkit.text_dedup` | `SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators |
 | `eval_toolkit.plotting` | PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs |

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/STYLE.md RENAMED Viewed

@@ -36,6 +36,11 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 ## 3. Naming
+For the full decision record + industry-citations, see
+[ADR 0004 — Naming conventions](docs/source/adr/0004-naming-conventions.md).
+This section is the day-to-day quick reference; the ADR is the
+authoritative source.
 - Module names: `snake_case`, lowercase package (`eval_toolkit`).
 - Class names: `PascalCase`. Suffixes used in this repo:
   - `*Config` — frozen dataclass for settings
@@ -55,6 +60,68 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 - Mutation marking: not used. Mutating functions return `None` (Pythonic over
   Julia's `_inplace` suffix).
+### 3a. Parameter naming (canonical list, locked at v1.0)
+These names mean these things, everywhere. Future functions MUST use
+them; deviations need justification in the PR description.
+| Parameter | Meaning |
+|---|---|
+| `y_true` | Ground-truth labels (binary, shape `(n,)`) |
+| `y_score` | Continuous score / probability (shape `(n,)`) |
+| `y_pred` | Discrete prediction (threshold-dependent) |
+| `n_resamples` | Bootstrap iteration count |
+| `confidence` | Two-sided confidence level (0.95 default) |
+| `n_bins` | Binning count for calibration / ECE |
+| `n_jobs` | Parallelism (joblib + sklearn convention) |
+| `ax` | Matplotlib axis (matplotlib convention) |
+| `metric` | Callable `(y_true, y_score) -> float` |
+| `rng` | RNG argument per [SPEC 7](https://scientific-python.org/specs/spec-0007/) — target convention; adopted in v0.50.0 |
+The v0.50.0 SPEC 7 adoption preserves two `seed: int` exceptions:
+`set_global_seeds(seed: int)` (global-state setter, not per-function
+RNG; SPEC 7 doesn't apply) and adversarial dataclass fields (use Python
+`random.Random(seed)`; not NumPy-RNG, so SPEC 7's typing doesn't fit).
+### 3b. Class suffixes by domain
+Each suffix maps to a Protocol contract. Stay within the pattern:
+| Suffix | Domain | Protocol |
+|---|---|---|
+| `*Selector` | Threshold selection | `ThresholdSelector` |
+| `*Splitter` | Cross-validation splits | `Splitter` |
+| `*Check` | Leakage detection | `LeakageCheck` |
+| `*Loader` | Dataset loading | `DatasetLoader` |
+| `*Reader` | Prediction artifact reading | `PredictionReader` |
+| `*Variant` | Preprocessing variant | (functional API) |
+| `*Strategy` | Dedup similarity backend | `SimilarityStrategy` |
+| `*Injection` / `*Substitution` | Adversarial char-injection / -substitution | `TextTransform` |
+### 3c. Module naming (singular vs plural)
+- **Plural noun** for collection-of-types modules: `metrics`,
+  `loaders`, `protocols`, `losses`, `probes`, `splits`, `paths`,
+  `seeds`, `thresholds`, `artifacts`, `claims`, `embeddings`,
+  `scorecards`.
+- **Singular noun** for domain-concept modules: `harness`,
+  `bootstrap`, `manifest`, `calibration`, `leakage`, `analysis`,
+  `provenance`, `evidence`, `stacking`, `text_dedup`.
+- **Gerund** for process-domain modules: `preprocessing`.
+### 3d. Asymmetric module promotion (private → public)
+Collection-of-types private modules MAY be promoted to plural-public
+when they hold ≥2 user-relevant types. Single-function private
+modules SHOULD stay underscore. See
+[ADR 0001](docs/source/adr/0001-flat-module-layout.md) for the trigger
+analysis.
+Examples:
+- `_scorecard.py` (4 public exports) → `scorecards.py` at v0.49.0. ✓ promote.
+- `_sweep.py` (1 public function `sweep`) → stays `_sweep.py`. ✓ keep private.
 ## 4. Type hints
 - Every public function has fully typed parameters and return.
@@ -79,10 +146,13 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
     for 4 reference impls.
   - `SimilarityStrategy` (`text_dedup.py`) — pluggable similarity backend for
     `near_dedup` / `cross_dedup` / `NearDuplicateCheck` / `CrossSplitLeakageCheck`.
-  - `Versioned` (`leakage.py`) — opt-in single-attribute Protocol; any Tier-2
-    implementation may expose `version: str`. `RunManifest.versioned_objects`
-    auto-collects them. Mirrors the `lm-evaluation-harness` task `VERSION`
-    pattern. See `docs/methodology/versioning.md`.
+  - `Versioned` (`protocols.py`) — opt-in single-attribute Protocol; any
+    Tier-2 implementation may expose `version: str`.
+    `RunManifest.versioned_objects` auto-collects them. Mirrors the
+    `lm-evaluation-harness` task `VERSION` pattern. See
+    `docs/methodology/versioning.md`. (Single source of truth at
+    `protocols.py:64` since v0.49.0; the duplicate previously in
+    `leakage.py:82` was removed.)
 - All seams are `@runtime_checkable` so callers can `isinstance(obj, Protocol)`.
 - Reference impls are `@dataclass(frozen=True, slots=True)` with config in the
   constructor (`TargetRecallSelector(recall=0.90)`) and the Protocol method as
@@ -90,6 +160,25 @@ Run via `make lint` (= `ruff check + black --check + mypy`) and `make test`.
 - `NamedTuple` for stable public records that benefit from positional access;
   frozen dataclasses with `slots=True` otherwise.
+### 4a. Fitted-attribute trailing underscore (sklearn convention)
+Estimator-style classes (`fit`/`predict` pattern) that store
+**learned-from-data attributes** use trailing underscore per scikit-learn
+convention: `coef_`, `classes_`, `n_features_in_`, `feature_importances_`.
+These attributes MUST NOT be set in `__init__` — set them only in `fit()`.
+Frozen reference-impl dataclasses (`@dataclass(frozen=True, slots=True)`)
+are **exempt** — they hold config, not fitted state.
+Current canonical example: `stacking.LogisticStacker`.
+### 4b. TypeVar naming
+Internal (private) `TypeVar`s use a leading underscore per Google Python
+Style Guide §3.19.10: `_T = TypeVar("_T")`. Public, constrained `TypeVar`s
+without the underscore are allowed only when explicitly part of an
+exported generic API.
 ## 5. Dataclasses
 1. **`slots=True` always** on repo-owned dataclasses. Catches typos at
@@ -220,6 +309,10 @@ def fit_temperature(val_logits, val_labels, bounds=(0.05, 20.0)):
 - **References** cites arXiv IDs / DOIs / journal cites.
 - For modules where doctests would be contrived (`plotting`, `harness`,
   `provenance`), Examples are optional.
+- **Docstring prose wraps at 75 cols** (numpydoc convention) so that
+  `help()` is readable in a terminal. Doctest code blocks inside the
+  docstring follow the 100-col Black rule (code stays comfortable in an
+  editor even though prose around it is narrower).
 ## 13. Comments
@@ -228,6 +321,12 @@ restate what the code says.
 ## 14. Tests
+- **File naming**: `tests/test_<module>.py` mirrors
+  `src/eval_toolkit/<module>.py`. Auxiliary tests per module use
+  suffixes (`test_<module>_props.py`, `test_<module>_validation.py`,
+  `test_<module>_golden.py`).
+- **Function naming**: `test_<thing_under_test>_<scenario>`. No
+  class-based test grouping unless fixtures truly demand it (rare).
 - **Markers**: `unit`, `property`, `smoke`, `golden`.
 - **Sklearn-reference + analytical** as the unit-test oracle where available.
 - **Hypothesis** required for math/stat invariants. Strategies use

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/pyproject.toml RENAMED Viewed

@@ -74,15 +74,14 @@ probes = ["torch>=2.0", "transformers>=4.40"]
 # (granular extras — losses callers should not have to install the larger
 # transformers stack). Shares the torch version pin with [probes].
 losses = ["torch>=2.0"]
-# DEPRECATED (announced v0.30.1, removal v0.33.0).
+# NO-OP extra kept for backward compatibility (R3 at v0.49.0).
 #
-# Retained as a transitive no-op so `pip install eval-toolkit[validation]`
-# / dev still resolve cleanly. jsonschema moved to the base deps in
-# v0.16.0; this extra has been a no-op ever since. The 2-minor-version
-# window (v0.30.1 announce → v0.33.0 remove) matches the @deprecated
-# policy in docs/DEPRECATION.md. Extras can't trigger import-time
-# DeprecationWarnings, so the deprecation is documentation-only here +
-# in CHANGELOG ### Deprecated + docs/DEPRECATION.md.
+# jsonschema>=4.21 moved to base deps at v0.16.0; this extra has been a
+# no-op ever since. Originally announced as deprecated in v0.30.1 with
+# target removal at v0.33.0, but reclassified at v0.49.0 (R3 in
+# docs/DEPRECATION.md) as a permanent no-op — hard removal would break
+# consumer pip pins of the form `eval-toolkit[validation]` for zero
+# functional benefit. Retained indefinitely.
 validation = []
 # v0.31.0 docs site: Sphinx + pydata-sphinx-theme (replaces v0.28.0's
 # mkdocs-material). Migration drivers — pain points Q1 in the v0.31.0

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -38,15 +38,15 @@ _EXPORTS: dict[str, str] = {
     "ALL_TECHNIQUES": "eval_toolkit.adversarial",
     "BidiRTLInjection": "eval_toolkit.adversarial",
     "CORE_TECHNIQUES": "eval_toolkit.adversarial",
-    "CaseRandomization": "eval_toolkit.adversarial",
+    "CaseInjection": "eval_toolkit.adversarial",
     "DiacriticInjection": "eval_toolkit.adversarial",
     "HomoglyphSubstitution": "eval_toolkit.adversarial",
     "InvisibleCharsInjection": "eval_toolkit.adversarial",
     "PunctuationInjection": "eval_toolkit.adversarial",
     "SynonymSubstitution": "eval_toolkit.adversarial",
     "TagStrippingInjection": "eval_toolkit.adversarial",
-    "TokenSplitting": "eval_toolkit.adversarial",
-    "UnicodeNormalization": "eval_toolkit.adversarial",
+    "TokenSplittingInjection": "eval_toolkit.adversarial",
+    "UnicodeNormalizationInjection": "eval_toolkit.adversarial",
     "WhitespaceInjection": "eval_toolkit.adversarial",
     "ZeroWidthSpaceInjection": "eval_toolkit.adversarial",
     # CharacterInjectionStrategy + character_injection SimpleNamespace
@@ -202,7 +202,7 @@ _EXPORTS: dict[str, str] = {
     "MANIFEST_SCHEMA_VERSION": "eval_toolkit.manifest",
     "RunManifest": "eval_toolkit.manifest",
     "SourceRoleRecord": "eval_toolkit.manifest",
-    "build_manifest": "eval_toolkit.manifest",
+    "make_manifest": "eval_toolkit.manifest",
     "validate_source_roles": "eval_toolkit.manifest",
     "write_manifest": "eval_toolkit.manifest",
     # --- metrics ---
@@ -315,10 +315,10 @@ _EXPORTS: dict[str, str] = {
     "wilson_interval": "eval_toolkit.thresholds",
     "LogisticStacker": "eval_toolkit.stacking",
     "MetaLearner": "eval_toolkit.stacking",
-    "MetricResult": "eval_toolkit._scorecard",
-    "MetricSpec": "eval_toolkit._scorecard",
-    "Scorecard": "eval_toolkit._scorecard",
-    "scorecard": "eval_toolkit._scorecard",
+    "MetricResult": "eval_toolkit.scorecards",
+    "MetricSpec": "eval_toolkit.scorecards",
+    "Scorecard": "eval_toolkit.scorecards",
+    "scorecard": "eval_toolkit.scorecards",
     # --- sweep (top-level v0.47 unification — Decision K + Decision D) ---
     "sweep": "eval_toolkit._sweep",
 }

eval_toolkit-0.49.0/src/eval_toolkit/_rng.py ADDED Viewed

@@ -0,0 +1,46 @@
+"""Private RNG-parameter type aliases per Scientific-Python SPEC 7.
+This module centralizes the type aliases used to annotate user-facing RNG
+parameters across the toolkit. Per `SPEC 7 — Seeding PRNG
+<https://scientific-python.org/specs/spec-0007/>`_ (Endorsed) eval-toolkit
+exposes a single canonical parameter name ``rng`` typed as
+``RNGLike | SeedLike | None`` on every function that consumes a NumPy
+``Generator``. Bodies normalize via ``np.random.default_rng(rng)``.
+This module is private (underscore prefix) so the aliases stay an
+implementation detail — public symbols use them only in their annotations.
+If a Tier-2 consumer ever needs them exposed for their own callsite type
+annotations, promote them via ``eval_toolkit.protocols`` per the
+asymmetric-promotion principle in ADR 0001 + STYLE.md §3d.
+Exceptions to the SPEC 7 convention — documented in STYLE.md §3a:
+- ``seeds.set_global_seeds(seed: int)`` — global-state setter, not a
+  per-function RNG parameter; SPEC 7 is scoped to per-function RNG inputs.
+- ``adversarial.*Injection`` / ``*Substitution`` / ``CaseInjection``
+  dataclass fields — they use Python's stdlib ``random.Random(seed)``,
+  not NumPy. SPEC 7's typing (``RNGLike = np.random.Generator | ...``) is
+  strictly NumPy-scoped.
+"""
+from __future__ import annotations
+from collections.abc import Sequence
+import numpy as np
+type SeedLike = int | np.integer | Sequence[int] | np.random.SeedSequence
+"""Anything that can seed a NumPy bit generator.
+Per SPEC 7, ``np.random.default_rng`` accepts any of these as a seed
+without further conversion. ``Sequence[int]`` is the entropy-vector form
+used by ``np.random.SeedSequence``.
+"""
+type RNGLike = np.random.Generator | np.random.BitGenerator
+"""An already-instantiated NumPy bit generator or generator wrapper.
+``np.random.default_rng(rng)`` is the identity function on
+``Generator`` inputs and lifts ``BitGenerator`` inputs into a
+``Generator`` — both forms compose cleanly.
+"""

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.48.0"
+__version__ = "0.49.0"

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/src/eval_toolkit/adversarial.py RENAMED Viewed

@@ -12,7 +12,7 @@ Core techniques (shipped in v0.43.0):
 - :class:`HomoglyphSubstitution` — Latin → Cyrillic/Greek lookalikes
 - :class:`DiacriticInjection` — combining-mark insertion (NFC bypass)
 - :class:`WhitespaceInjection` — variable whitespace padding (regular + NBSP)
-- :class:`CaseRandomization` — random case-flipping per character
+- :class:`CaseInjection` — random case-flipping per character
 - :class:`PunctuationInjection` — non-semantic punctuation insertion
 Advanced techniques (shipped in v0.47 per Decision Q11.3):
@@ -20,8 +20,8 @@ Advanced techniques (shipped in v0.47 per Decision Q11.3):
 - :class:`BidiRTLInjection` — U+202E…U+202C override block
 - :class:`TagStrippingInjection` — ``<…>`` tag removal (idempotent)
 - :class:`SynonymSubstitution` — whitelisted-word swap, seed-deterministic
-- :class:`TokenSplitting` — mid-word single-space insertion
-- :class:`UnicodeNormalization` — NFC / NFD / NFKC / NFKD form switch
+- :class:`TokenSplittingInjection` — mid-word single-space insertion
+- :class:`UnicodeNormalizationInjection` — NFC / NFD / NFKC / NFKD form switch
 - :class:`InvisibleCharsInjection` — 5 invisible code points
 The convenience tuples :data:`CORE_TECHNIQUES` (6-tuple),
@@ -54,15 +54,15 @@ __all__ = [
     "ALL_TECHNIQUES",
     "BidiRTLInjection",
     "CORE_TECHNIQUES",
-    "CaseRandomization",
+    "CaseInjection",
     "DiacriticInjection",
     "HomoglyphSubstitution",
     "InvisibleCharsInjection",
     "PunctuationInjection",
     "SynonymSubstitution",
     "TagStrippingInjection",
-    "TokenSplitting",
-    "UnicodeNormalization",
+    "TokenSplittingInjection",
+    "UnicodeNormalizationInjection",
     "WhitespaceInjection",
     "ZeroWidthSpaceInjection",
 ]
@@ -287,7 +287,7 @@ class WhitespaceInjection:
 @dataclass(frozen=True, slots=True)
-class CaseRandomization:
+class CaseInjection:
     """Randomly flip the case of alphabetic characters.
     Deterministic given the seed. Numeric / punctuation / whitespace pass
@@ -311,7 +311,7 @@ class CaseRandomization:
     def __post_init__(self) -> None:
         if not 0.0 <= self.ratio <= 1.0:
-            raise ValueError(f"CaseRandomization: ratio must be in [0, 1]; got {self.ratio}")
+            raise ValueError(f"CaseInjection: ratio must be in [0, 1]; got {self.ratio}")
     def transform(self, text: str) -> str:
         rng = random.Random(self.seed)
@@ -524,7 +524,7 @@ class SynonymSubstitution:
 @dataclass(frozen=True, slots=True)
-class TokenSplitting:
+class TokenSplittingInjection:
     """Insert a single space inside each long enough word.
     Forces subword tokenizers to break a single token into two, often
@@ -552,10 +552,10 @@ class TokenSplitting:
     def __post_init__(self) -> None:
         if self.min_word_length < 2:
             raise ValueError(
-                f"TokenSplitting: min_word_length must be >= 2; got {self.min_word_length}"
+                f"TokenSplittingInjection: min_word_length must be >= 2; got {self.min_word_length}"
             )
         if not 0.0 <= self.ratio <= 1.0:
-            raise ValueError(f"TokenSplitting: ratio must be in [0, 1]; got {self.ratio}")
+            raise ValueError(f"TokenSplittingInjection: ratio must be in [0, 1]; got {self.ratio}")
     def transform(self, text: str) -> str:
         rng = random.Random(self.seed)
@@ -576,7 +576,7 @@ class TokenSplitting:
 @dataclass(frozen=True, slots=True)
-class UnicodeNormalization:
+class UnicodeNormalizationInjection:
     """Apply a Unicode normalization form to the input.
     Defaults to NFKC which folds compatibility characters (e.g., ``ＡＢＣ``
@@ -598,7 +598,7 @@ class UnicodeNormalization:
     def __post_init__(self) -> None:
         if self.form not in {"NFC", "NFD", "NFKC", "NFKD"}:
             raise ValueError(
-                f"UnicodeNormalization: form must be NFC / NFD / NFKC / NFKD; got {self.form!r}"
+                f"UnicodeNormalizationInjection: form must be NFC / NFD / NFKC / NFKD; got {self.form!r}"
             )
     def transform(self, text: str) -> str:
@@ -659,15 +659,15 @@ CORE_TECHNIQUES: tuple[type[Any], ...] = (
     HomoglyphSubstitution,
     DiacriticInjection,
     WhitespaceInjection,
-    CaseRandomization,
+    CaseInjection,
     PunctuationInjection,
 )
 ADVANCED_TECHNIQUES: tuple[type[Any], ...] = (
     BidiRTLInjection,
     TagStrippingInjection,
     SynonymSubstitution,
-    TokenSplitting,
-    UnicodeNormalization,
+    TokenSplittingInjection,
+    UnicodeNormalizationInjection,
     InvisibleCharsInjection,
 )
 ALL_TECHNIQUES: tuple[type[Any], ...] = CORE_TECHNIQUES + ADVANCED_TECHNIQUES
@@ -703,8 +703,8 @@ def _whitespace(
 def _case_random(text: str, ratio: float = 0.5, seed: int = 42) -> str:
-    """Functional alias for :class:`CaseRandomization`."""
-    return CaseRandomization(ratio=ratio, seed=seed).transform(text)
+    """Functional alias for :class:`CaseInjection`."""
+    return CaseInjection(ratio=ratio, seed=seed).transform(text)
 def _punctuation(text: str, ratio: float = 0.1, seed: int = 42) -> str:

{eval_toolkit-0.48.0 → eval_toolkit-0.49.0}/src/eval_toolkit/leakage.py RENAMED Viewed

@@ -71,28 +71,16 @@ __all__ = [
     "Severity",
     "TemporalLeakageCheck",
     "TokenizationLeakageCheck",
-    "Versioned",
     "run_leakage_checks",
 ]
 Severity = Literal["error", "warning", "info"]
-@runtime_checkable
-class Versioned(Protocol):
-    """Anything exposing a ``version: str`` attribute.
-    Used by :class:`~eval_toolkit.manifest.RunManifest` to capture per-object
-    versions of any Tier-2 implementation (Scorer, LeakageCheck, Splitter,
-    ThresholdSelector, DatasetLoader). Mirrors the lm-evaluation-harness
-    ``VERSION`` field pattern, which invalidates cross-version metric
-    comparisons. Opt-in: implementations are not required to set ``version``.
-    """
-    @property
-    def version(self) -> str:  # pragma: no cover
-        """Stable version string for this implementation."""
-        ...
+# `Versioned` Protocol previously had a duplicate definition here (v0.7+).
+# Removed at v0.49.0 (N5 dedup) — canonical home is `eval_toolkit.protocols`
+# per `protocols.py:1-5` ("Lightweight public Protocols with minimal dependency
+# surface"). Use `from eval_toolkit.protocols import Versioned` (or top-level
+# `from eval_toolkit import Versioned`).
 # ---------------------------------------------------------------------------

eval-toolkit 0.48.0__tar.gz → 0.49.0__tar.gz

eval-toolkit 0.48.0tar.gz → 0.49.0tar.gz