PyPI - eval-toolkit - Versions diffs - 1.7.0__tar.gz → 1.9.0__tar.gz - Mend

eval-toolkit 1.7.0tar.gz → 1.9.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (204) hide show

eval_toolkit-1.9.0/.claude/agents/README.md ADDED Viewed

@@ -0,0 +1,54 @@
+# eval-toolkit review agents
+Repo-local Claude Code subagents that enforce the **judgment** the deterministic
+gates can't: SemVer impact, audit-validator architecture, silent failures,
+docstring conformance, and dogfood noise. They are **advisory** — `ruff` / `black` / `mypy` / `pytest` /
+coverage / the public-API snapshot remain the authoritative blocking gates. No
+agent re-runs or replaces them.
+## The agents
+| Agent | Catches | Authoritative source |
+|---|---|---|
+| `etk-audit-validator-reviewer` | Three-layer conformance (identity/scope/pairing), `_narrative` reuse, UTF-8 | ADR 0007 / 0005 / 0006, STYLE.md §5 |
+| `etk-api-stability-guardian` | Tier-1/2/3 SemVer class + public-API snapshot regen | ADR 0003, `tests/test_public_api.py`, STYLE.md §17 |
+| `etk-silent-failure-auditor` | NaN/inf finiteness gaps, swallowed exceptions, encoding/IO, non-diagnostic raises | STYLE.md §1 / §6 / §7 |
+| `etk-docstring-conformance-auditor` | NumPy sections, Raises↔code agreement, canonical param names, runnable Examples | STYLE.md §12 / §3a |
+| `etk-dogfood-noise-analyst` | Classifies consumer residuals (real / FP / edge × layer) | runner: `scripts/dogfood_audit.py` |
+## How to run
+You never need to remember the names. Either:
+- **Describe the task** — "review the changes I made to the citation validator" — and the main agent auto-routes by each agent's `description`; or
+- **Run `/review-eval`** — the one handle that fans them out and synthesizes one verdict.
+```
+/review-eval                      # diff mode: git diff main...HEAD
+/review-eval --pr 84              # review a GitHub PR diff
+/review-eval --audit              # full baseline sweep (whole files, no diff)
+/review-eval --audit api          # focused: just the public surface
+/review-eval --audit validators   # focused: just audit_*.py + _narrative.py
+/review-eval --audit docstrings   # focused: just public docstrings
+/review-eval --refute             # adversarial second pass (quote-or-reject)
+/review-eval --ledger             # persist a review entry under .claude/reviews/
+```
+Diff mode prints to the terminal; `--audit` also writes a machine-local entry
+under `.claude/reviews/` (gitignored).
+## Conventions baked in
+- **Read-back discipline** — every finding quotes code with `path:line`; no quote, no finding (counters validation-without-reading).
+- **High-confidence only** — plus a `suppressed N low-confidence` footer so nothing is silently dropped.
+- **Hybrid rubric** — a tight inlined checklist that defers to `STYLE.md` / the ADRs as the single source of truth.
+- **Structured verdict** — `PASS / CONCERNS / BLOCK`, per-agent and overall.
+`tests/test_claude_agents.py` guards these files against pointer-rot (frontmatter
+parses, `name` matches filename, every cited path exists).
+## Escalation
+For a full multi-round release audit (fan out N finders → dedup → adversarially
+verify → synthesize a ledger), use the **Workflow** tool, not a single subagent.
+`/review-eval --audit` is the lightweight precursor.

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/.gitignore RENAMED Viewed

@@ -72,8 +72,12 @@ codex-comprehensive-audit-*.md
 # Contents have historical value but are not part of any release.
 .scratch/
-# Claude Code project settings (machine-local)
-.claude/
+# Claude Code: settings are machine-local; the review agents + commands are
+# shared, versioned deliverables (see .claude/agents/README.md). Review ledgers
+# written by `/review-eval --ledger` stay local under .claude/reviews/.
+.claude/*
+!.claude/agents/
+!.claude/commands/
 # mkdocs build output (Section E.1 v0.28.0)
 /site/

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,197 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [1.9.0] — 2026-06-10 — resample distribution + silent-NaN hardening + UTF-8 batch (#93, #96, #97)
+### Fixed — pre-tag adversarial-review completion (silent-NaN gaps in `bootstrap_ci` itself)
+The pre-tag review panel (whole-repo re-audit + independent reviewers +
+self-refutation) found the #96 class surviving in `bootstrap_ci` — the
+most-used entry point, outside #96's enumerated scope:
+- **Studentized path**: a NaN outer statistic was accepted as a *valid*
+  resample, bypassing the 95%-valid gate and poisoning the pivots into a
+  silent all-NaN `BootstrapCI` with zero warnings. NaN/inf outer statistics
+  (and inner-jackknife LOO values) now count as degenerate draws.
+- **Non-finite CI bounds now raise for ANY method** — the degeneracy check
+  was gated to `method="BCa"`, so `percentile` returned NaN bounds with only
+  scipy's misdirecting `DegenerateDataWarning` (which always names BCa).
+  The finite BCa-collapse case (`ci_low == ci_high == point`) keeps the R9
+  `UserWarning` contract. Behavior change: BCa NaN-bounds previously
+  warned-and-returned; they now raise (the scorecard/harness per-cell
+  isolation converts this into an error/reason cell as before).
+- **Point estimate guarded**: a metric returning NaN on the full data
+  previously yielded `point_estimate=nan` beside a finite CI, silently.
+- `analysis.load_prediction_arrays`: labels are now domain-checked before
+  the int cast — `dtype=int` coercion silently **truncated** float labels
+  (`0.7 → 0`), flipping ground truth with in-domain values no downstream
+  gate could catch.
+- Docs: `sweep()` Raises now documents the reachable pandas `ImportError`
+  and its Returns lists the always-present `strategy_id` column; STYLE.md
+  Tier-2 quick reference corrected to the ten strict Protocols (v1.0.2
+  `SimilarityStrategy` promotion).
+- Tests: mutation-verified gap closed (per-stratum NaN filter pinned via a
+  non-NaN-propagating `combine`); `samples`↔quantile consistency pinned at
+  non-default confidence.
+### Added — resample-distribution exposure on the cluster bootstraps (#93)
+Tier-1 strictly-appended optional parameters, SemVer-MINOR per the
+[ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md)
+2026-06-10 amendment (#101) — backward-compatible; snapshot regenerated in
+the same commit.
+- **`cluster_bootstrap_ci(..., return_samples=True)`** and
+  **`stratified_cluster_bootstrap_ci(..., return_samples=True)`** attach the
+  post-filter bootstrap resample statistics to the result as
+  **`BootstrapCI.samples`** (read-only `numpy.ndarray`, the same array the
+  percentile bounds are computed from; `shape == (n_resamples_used,)`).
+  Distribution summaries such as the consumer's `frac_gt0`
+  (`float(np.mean(ci.samples > 0.0))`) are now derivable from the *same*
+  draws as the CI — previously structurally unrecoverable, blocking the two
+  remaining LODO call-site migrations (downstream DF-11).
+- `BootstrapCI` gains the trailing optional field `samples`
+  (default `None`, `compare=False`, `repr=False`): positional construction,
+  equality/hash semantics, and the **`to_dict()` schema are all unchanged**.
+  Note: `dataclasses.asdict()` (which ignores those flags) now includes a
+  `samples` key — consumers serializing via `asdict` instead of `to_dict()`
+  should drop it (an attached ndarray is not JSON-serializable).
+- scipy precedent: `BootstrapResult.bootstrap_distribution`. Honors the
+  n_jobs bit-for-bit reproducibility contract.
+### Fixed — silent-NaN hardening batch (#96)
+Finiteness guards across the numeric surface (STYLE §1 *never fail silently* /
+§7 validation boundary). All are new raises (or error statuses) on garbage
+input that previously produced silently-wrong results:
+- `cluster_bootstrap_ci` / `stratified_cluster_bootstrap_ci`: NaN/inf scores
+  now raise at the validation boundary (previously the per-stratum check was
+  shape-only and the result was a silent all-NaN `BootstrapCI`); a non-finite
+  point estimate raises with the got-value; resample draws where the statistic
+  (or `combine`) **returns** NaN/inf now count toward the >5% degenerate gate
+  instead of poisoning the quantile CI (previously only *raising* draws counted).
+- `paired_bootstrap_diff` / `paired_bootstrap_ece_diff` /
+  `paired_bootstrap_op_point_diff`: NaN resample deltas now count as degenerate
+  draws (same ≤5% tolerance as raising draws — pre-#96 a NaN CI made
+  `overlaps_zero` read `False`, i.e. silently "statistically significant");
+  a non-finite full-data Δ raises with the got-value; a non-finite-CI-bounds
+  raise remains in each constructor as a backstop (mirrors the BCa degeneracy
+  guard in `bootstrap_ci`).
+- `scorecard` bootstrap path: BCa-degenerate NaN CI bounds are no longer
+  attached to an `"ok"` cell — the CI is dropped and the reason recorded
+  (consistent with the existing "bootstrap unavailable" convention).
+- `eda.median_bandwidth`: non-finite input (NaN/inf) now raises at entry —
+  NaN bypassed the `sigma <= 0.0` check and escaped as a NaN bandwidth, and a
+  NaN row outside the `max_samples` subsample escaped entirely.
+- `eda.maximum_mean_discrepancy`: explicit `bandwidth` must be finite and > 0 —
+  `inf` yielded γ = 0 → all-ones Gram → MMD² = 0 → `p_value = 1.0` silently
+  reading "no shift".
+- `eda` PAD/MMD/kNN feature matrices are finiteness-checked at the boundary
+  (previously NaN embeddings died deep inside sklearn blaming internals).
+- `eda.class_lexical_association`: a `positive_label` matching no label (the
+  1-vs-`"1"` type-mismatch trap) or matching every label now raises listing the
+  observed label values, instead of returning a documented all-empty result
+  that read "no shortcut signal".
+- `scorecard`: a custom `MetricSpec.compute` returning NaN/inf now yields
+  `MetricResult(status="error", reason=...)` through the same path as a raising
+  compute — previously `status="ok"` with a NaN value.
+- `sweep`: a NaN `attack_threshold` now raises (it silently zeroed every `asr`
+  flag); `±inf` remains a documented unsatisfiable sentinel.
+- `analysis.JsonlPredictionReader`: a row missing (or `null` on) a declared
+  column key now fails fast with file + row + key context (the R8-F3 pattern
+  already applied to CSV headers) — previously a missing score coerced to NaN
+  deep in the metric computation and a missing label died as a bare `TypeError`.
+  A malformed JSON row now reports the actual file row (raw `json.JSONDecodeError`
+  always said "line 1"). `analysis.load_prediction_arrays` additionally rejects
+  non-finite loaded scores (a bare JSON `NaN` token or a CSV `"nan"` cell passes
+  per-row key checks) with file + column + row-index context.
+### Fixed — explicit UTF-8 encoding batch (#97)
+Windows (cp1252 locale codec) is the trigger; Linux/macOS hid all of these.
+Locked convention: always pass `encoding="utf-8"` on text-file IO.
+- `docs.render_files` **apply mode** read and wrote consumer markdown with the
+  locale codec — on cp1252 this silently and *cumulatively* corrupted
+  non-ASCII user content on every apply (the worst item in the batch).
+- All remaining text IO made explicit: `__main__` schema/payload reads
+  (RFC 8259 mandates UTF-8 for JSON), `analysis` CSV/JSONL prediction readers,
+  `config.from_yaml`, `artifacts` schema read + report write,
+  `plotting` sidecar write, `scripts/audit_raises_sections.py`.
+- `scripts/dogfood_audit.py`: a surface file skipped for `UnicodeDecodeError`
+  now emits a stderr warning with the path — previously it silently vanished
+  from the acceptance evidence.
+- **Detection locked out permanently**: ruff now enforces `PLW1514`
+  (implicit-encoding) across `src/`, `scripts/`, and `tests/` via
+  `preview = true` + `explicit-preview-rules = true` (only this rule gets
+  preview status; no other behavior changes).
+### Fixed
+- `audit_value_bindings.validate_reader_value_bindings` now raises a
+  diagnostic `ValueError` when a scanned file is not valid UTF-8, instead
+  of letting an unguarded `read_text(encoding="utf-8")` abort the run with
+  a bare `UnicodeDecodeError`. Documented in the function's `Raises` section.
+- `audit_sister_doc_concept_drift.validate_sister_doc_concept_drift` now
+  skips non-UTF-8 files with a `warnings.warn` instead of crashing — its
+  prior `except OSError` did not catch `UnicodeDecodeError` (a `ValueError`,
+  not an `OSError`), so a single non-UTF-8 byte aborted the whole scan.
+### Internal
+- `scripts/` is now covered by `ruff` / `black` / `mypy` across all runners
+  (`Makefile`, `ci.yml`, `.pre-commit-config.yaml`, `tox.ini`, `noxfile.py`).
+### Fixed — documentation/config consistency batch (2026-06-09 full-repo audit)
+- [ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md)
+  amended: records the v1.0.2 `SimilarityStrategy` promotion (strict
+  Tier-2 count 9 → 10) in the Tier-1 Protocol list, and replaces the
+  unimplemented `STRICT_DOCSTRINGS` plan with the actual contract
+  (docstring first lines remain pinned through v1.x).
+- `SimilarityStrategy` registered in
+  `tests/test_public_api.py::_TIER2_PROTOCOLS` — the R6-D fail-fast
+  list had lagged the v1.0.2 promotion.
+- `docs/source/roadmap.md` post-v1.0 section refreshed to the v1.8.0
+  state (was still "v1.0.1 is the next minor"; the referenced
+  `v1.0.1 cleanup` issue #76 closed at v1.0.2); broken repo-relative
+  link to a machine-local planning document removed.
+- STYLE.md §17 example updated — `pr_auc` left the top level at v0.46
+  (Decision L); the example now uses `scorecard`. README Tier-2 box
+  disambiguated (10 strict Protocols vs `SliceAwareScorer`/`Versioned`).
+- CONTRIBUTING.md: corrected the `[dev]`-extra claim (heavy optional
+  stacks `embeddings`/`transformers`/`probes`/`losses` are not
+  included) and documented the docs-extra requirement for `pre-push`.
+### Internal
+- `make test` now collects all three doc-execution surfaces — the
+  positional `tests` arg silently bypassed pyproject `testpaths`,
+  skipping the 161 README/docs Sybil doc tests (v0.47 §5L incident
+  class). `make install` installs `.[dev,docs]` so the sphinx
+  pre-push gate works on a fresh environment.
+- CI coverage step excludes `-m integration` (aligns ci.yml with the
+  pyproject marker contract and the Makefile coverage target).
+- tox/nox aligned with `requires-python = ">=3.13"`: py313-only
+  envlist/`PY_VERSIONS`, monte_carlo/benchmark/integration marker
+  exclusions added to their pytest commands, stale "private and
+  home-designed" framing removed; Makefile help text and §5H
+  notebook-gate comments updated to current reality.
+## [1.8.0] — 2026-06-04 — composite multi-stratum cluster bootstrap (#92)
+### Added — `bootstrap.stratified_cluster_bootstrap_ci` (composite multi-stratum cluster bootstrap)
+`eval_toolkit.bootstrap` ADDITIVE per [ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md) — backward-compatible. Generalises the v1.7.0 single-block `cluster_bootstrap_ci` to the shape leave-one-group-out transfer gaps actually take: a **composite statistic reduced over several independently-resampled cluster strata**.
+- **`stratified_cluster_bootstrap_ci(strata, per_stratum_metric, combine, *, resample_labels=(0,1), …)`** — `strata` is a mapping `{key: (y, score, groups)}` of independent resample-units (e.g. `seed`, `(carrier, seed)`, `(attack_type, seed)`); each bootstrap iteration resamples every stratum's `(label, group)` clusters, computes `per_stratum_metric` on each, and reduces the `{key: metric}` map with `combine` to one scalar (a seed-averaged ROC-AUC gap, a mean-over-carriers gap, a top−bottom per-type AUPRC contrast, …). Percentile `BootstrapCI` (`method="stratified_cluster_percentile"`). `cluster_bootstrap_ci` is the single-stratum, identity-reduce special case.
+- **Why:** the v1.7.0 single-block primitive could not express the **seed-averaging** that real LODO estimators do inside the bootstrap (`Gx = val − mean_seed(test_roc)`), so it did not actually fit the consumer portfolio's attack-type / carrier / dialect bootstraps. This is the correct primitive for them.
+- **Parallel + reproducible:** built on `parallel_map` + `spawn_seed_sequences` ⇒ `n_jobs` gives bit-for-bit-identical CIs; `n_jobs=-1` all cores.
+- Exported via `from eval_toolkit import stratified_cluster_bootstrap_ci`; `__all__` + `_EXPORTS` + doctest + n_jobs-reproducibility / seed-averaged / composite-statistic tests; mypy-strict clean.
 ## [1.7.0] — 2026-06-04 — label-stratified cluster bootstrap (#90, #91)
 ### Added — `bootstrap.cluster_bootstrap_ci` (label-stratified cluster bootstrap)

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 1.7.0
+Version: 1.9.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -116,11 +116,10 @@ format changes.
 │  manifest.json + seeds + git_sha + data_hashes +       │
 │  gpu_info + leakage_report (NeurIPS-aligned)           │
 ├─ Tier 2 ─ Protocol-based orchestration ────────────────┤
-│  Scorer / SliceAwareScorer / LeakageCheck / Splitter   │
-│  ThresholdSelector / DatasetLoader / MetricSpec        │
-│  MetaLearner / Probe / TextTransform /                 │
-│  SimilarityStrategy (10 strict)                        │
-│  Versioned (opt-in: per-object versions in manifest)   │
+│  Scorer / LeakageCheck / Splitter / ThresholdSelector  │
+│  DatasetLoader / MetricSpec / MetaLearner / Probe /    │
+│  TextTransform / SimilarityStrategy (10 strict)        │
+│  SliceAwareScorer / Versioned (outside the 10 strict)  │
 ├─ Tier 1 ─ Functional core ─────────────────────────────┤
 │  pr_auc / roc_auc / ECE variants / Brier / bootstrap_ci│
 │  paired_bootstrap_diff / cv_clt_ci / mde_from_ci       │

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/README.md RENAMED Viewed

@@ -30,11 +30,10 @@ format changes.
 │  manifest.json + seeds + git_sha + data_hashes +       │
 │  gpu_info + leakage_report (NeurIPS-aligned)           │
 ├─ Tier 2 ─ Protocol-based orchestration ────────────────┤
-│  Scorer / SliceAwareScorer / LeakageCheck / Splitter   │
-│  ThresholdSelector / DatasetLoader / MetricSpec        │
-│  MetaLearner / Probe / TextTransform /                 │
-│  SimilarityStrategy (10 strict)                        │
-│  Versioned (opt-in: per-object versions in manifest)   │
+│  Scorer / LeakageCheck / Splitter / ThresholdSelector  │
+│  DatasetLoader / MetricSpec / MetaLearner / Probe /    │
+│  TextTransform / SimilarityStrategy (10 strict)        │
+│  SliceAwareScorer / Versioned (outside the 10 strict)  │
 ├─ Tier 1 ─ Functional core ─────────────────────────────┤
 │  pr_auc / roc_auc / ECE variants / Brier / bootstrap_ci│
 │  paired_bootstrap_diff / cv_clt_ci / mde_from_ci       │

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/STYLE.md RENAMED Viewed

@@ -1,8 +1,8 @@
 # eval-toolkit — Coding Standards
-Self-contained standards for this repository. External readers do not need
-access to any other style document; everything required to contribute lives
-here.
+Self-contained quick reference for this repository. The ADRs
+(`docs/source/adr/`) are the authoritative source for the decisions summarized
+here; everything needed for day-to-day contribution lives in this file.
 ## 1. Foundational principles
@@ -27,7 +27,7 @@ here.
 | Formatter | `black`, line length 100 |
 | Linter | `ruff` with `select = ["E", "W", "F", "I", "N", "UP", "B", "SIM", "C4"]`, ignore `E501` (Black handles), `N803`/`N806` (math identifiers) |
 | Type checker | `mypy` strict (`disallow_untyped_defs`, `disallow_incomplete_defs`, `check_untyped_defs`, `no_implicit_optional`, `warn_redundant_casts`, `warn_unused_ignores`, `warn_no_return`, `strict_equality`, `warn_return_any`) |
-| Test runner | `pytest` with markers `unit`, `property`, `smoke`, `golden`; coverage floor `90%` |
+| Test runner | `pytest` with markers `unit`, `property`, `smoke`, `golden`; coverage floor `92%` |
 | Build backend | `hatchling` |
 | Env manager | `uv` (`uv venv` → `.venv/`; `uv pip install -e .[dev]`) |
 | Python | `>=3.13` (RunPod parity floor; py313 tool targets in pyproject.toml) |
@@ -130,7 +130,15 @@ Examples:
   required.
 - `from __future__ import annotations` only when forward refs require it.
 - `Protocol` only at "real seams" — where two or more concrete implementations
-  exist or are planned. Current seams (as of v0.8.0):
+  exist or are planned. The authoritative Tier-2-stable set is `_TIER2_PROTOCOLS` in
+`tests/test_public_api.py` plus
+[ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md):
+the ten strict Tier-2 Protocols are `Scorer`, `LeakageCheck`, `Splitter`,
+`ThresholdSelector`, `DatasetLoader`, `MetricSpec`, `MetaLearner`, `Probe`,
+`TextTransform`, and `SimilarityStrategy` (promoted 10th at v1.0.2, #76
+RC2). The seams below are illustrative detail — `SliceAwareScorer` is an
+opt-in subprotocol of `Scorer`, and `Versioned` is a real seam that is
+**not** in the Tier-2 frozenset:
   - `Scorer` + `SliceAwareScorer` (`harness.py`) — anything with
     `predict_proba(X) -> np.ndarray`. `SliceAwareScorer` adds opt-in
     `should_score_slice(name)` for cost-controlled skipping.
@@ -251,7 +259,11 @@ Local imports inside functions are allowed for:
 ## 11. Logging
 Use `logging` (library context — consumers configure handlers). Do not use
-`print` in `src/eval_toolkit/`.
+`print` in `src/eval_toolkit/`. Log levels: `DEBUG` for internal events; `INFO`
+only for the rare user-relevant harness progress signal; **`WARNING` is reserved
+for `warnings.warn(...)`, not `logger.warning(...)`**; and **`ERROR` must not
+appear in library code — raise an exception instead**. See CONTRIBUTING.md
+§Logging for the full rationale.
 ## 12. Docstrings
@@ -333,7 +345,7 @@ restate what the code says.
   `hypothesis.extra.numpy` for arrays.
 - **Golden tests** only for `docs.py`, where the output is the contract.
 - **Doctests** for math/algorithmic kernels.
-- **Coverage floor**: 90%.
+- **Coverage floor**: 92%.
 - **`assert` is fine in tests.**
 ## 15. Packaging
@@ -359,6 +371,9 @@ restate what the code says.
 - Every module declares `__all__`.
 - The package's `__init__.py` re-exports the public surface so both
-  `from eval_toolkit import pr_auc` and `from eval_toolkit.metrics import pr_auc`
-  work — matches sklearn/pandas/scipy convention.
+  `from eval_toolkit import scorecard` and
+  `from eval_toolkit.scorecards import scorecard` work — matches
+  sklearn/pandas/scipy convention. (Threshold-dependent scalar metrics
+  such as `pr_auc` left the top level at v0.46 Decision L — import
+  them from `eval_toolkit.metrics`.)
 - Private helpers are prefixed with `_` and not re-exported.

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/pyproject.toml RENAMED Viewed

@@ -160,7 +160,12 @@ line-length = 100
 target-version = "py313"
 [tool.ruff.lint]
-select = ["E", "F", "W", "I", "N", "UP", "B", "SIM", "C4"]
+# preview + explicit-preview-rules: enable ONLY the explicitly selected
+# preview rules (PLW1514 implicit-encoding, #97) — no other preview-mode
+# behavior changes. Locks the Windows-cp1252 mojibake class out permanently.
+preview = true
+explicit-preview-rules = true
+select = ["E", "F", "W", "I", "N", "UP", "B", "SIM", "C4", "PLW1514"]
 ignore = [
     "E501",   # line length handled by black
     "N803",   # function arg lowercase — math kernels use π, T, etc. per Decision 14

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -140,6 +140,7 @@ _EXPORTS: dict[str, str] = {
     "paired_bootstrap_ece_diff": "eval_toolkit.bootstrap",
     "paired_bootstrap_op_point_diff": "eval_toolkit.bootstrap",
     "paired_mde": "eval_toolkit.bootstrap",
+    "stratified_cluster_bootstrap_ci": "eval_toolkit.bootstrap",
     # --- calibration ---
     "DEFAULT_FN_COST": "eval_toolkit.calibration",
     "DEFAULT_FP_COST": "eval_toolkit.calibration",

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/__main__.py RENAMED Viewed

@@ -47,7 +47,7 @@ def _cmd_schemas_show(args: argparse.Namespace) -> int:
         else:
             print(f"unknown schema: {name}", file=sys.stderr)
             return 2
-    print(json.dumps(json.loads(candidate.read_text()), indent=2, sort_keys=True))
+    print(json.dumps(json.loads(candidate.read_text(encoding="utf-8")), indent=2, sort_keys=True))
     return 0
@@ -73,7 +73,7 @@ def _cmd_schemas_check(_args: argparse.Namespace) -> int:
     failures: list[str] = []
     for f in files:
         try:
-            schema = json.loads(f.read_text())
+            schema = json.loads(f.read_text(encoding="utf-8"))
             Draft202012Validator.check_schema(schema)
             print(f"  {f.name}: OK")
         except (json.JSONDecodeError, SchemaError) as exc:
@@ -109,8 +109,8 @@ def _cmd_validate(args: argparse.Namespace) -> int:
     if not file_path.exists():
         print(f"file not found: {args.file}", file=sys.stderr)
         return 2
-    schema = json.loads(schema_path.read_text())
-    payload = json.loads(file_path.read_text())
+    schema = json.loads(schema_path.read_text(encoding="utf-8"))
+    payload = json.loads(file_path.read_text(encoding="utf-8"))
     import jsonschema as _js  # noqa: PLC0415
     try:

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/_sweep.py RENAMED Viewed

@@ -72,15 +72,18 @@ def sweep(
         **Required to materialize ``asr``** — the documented contract refuses
         a magic default threshold (cf. ``methodology/thresholds.md``).
         Ignored when ``scorer`` is ``None`` (with ``ValueError`` if passed
-        with ``scorer=None`` to surface the API misuse).
+        with ``scorer=None`` to surface the API misuse). Must not be NaN
+        (every ``asr`` flag would silently be ``False``); ``±inf`` is
+        accepted as a deliberately unsatisfiable sentinel.
     Returns
     -------
     pandas.DataFrame
         Columns vary by which optional kwargs are passed:
-        - Always: ``text_id`` (int), ``variant`` (str — from
-          ``strategy.name``), ``transformed_text`` (str).
+        - Always: ``text_id`` (int), ``strategy_id`` (str —
+          configured-instance identity, Decision R7-B), ``variant`` (str —
+          from ``strategy.name``), ``transformed_text`` (str).
         - With ``scorer``: also ``original_score`` (float) +
           ``transformed_score`` (float).
         - With ``scorer`` AND ``attack_threshold``: also ``asr`` (bool —
@@ -90,9 +93,12 @@ def sweep(
     Raises
     ------
+    ImportError
+        If pandas is not installed (install the ``dataframe`` extra).
     ValueError
         - If ``strategies`` is empty.
         - If ``attack_threshold`` is provided without ``scorer``.
+        - If ``attack_threshold`` is NaN.
         - If any strategy doesn't satisfy ``TextTransform`` structurally
           (typically a missing ``name`` attribute).
@@ -138,6 +144,13 @@ def sweep(
             "Either pass scorer=<scorer> + attack_threshold=<float>, "
             "or omit attack_threshold."
         )
+    # NaN comparisons are all False, so a NaN threshold would silently zero
+    # every asr flag. (±inf is semantically valid: an unsatisfiable sentinel.)
+    if attack_threshold is not None and np.isnan(attack_threshold):
+        raise ValueError(
+            "sweep(): attack_threshold is NaN — every asr flag would be False. "
+            "Pass a finite threshold (or ±inf as an unsatisfiable sentinel)."
+        )
     for i, strategy in enumerate(strategies):
         if not (hasattr(strategy, "name") and hasattr(strategy, "transform")):
             raise ValueError(
@@ -177,8 +190,11 @@ def sweep(
                 "transformed_text": transformed,
             }
             if scorer is not None:
-                assert original_scores is not None
-                assert transformed_scores is not None
+                if original_scores is None or transformed_scores is None:  # pragma: no cover
+                    raise RuntimeError(
+                        "sweep(): internal invariant violated — batch scores not "
+                        "materialized despite scorer being set"
+                    )
                 s_orig = float(original_scores[text_id])
                 s_adv = float(transformed_scores[text_id])
                 row["original_score"] = s_orig

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "1.7.0"
+__version__ = "1.9.0"

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/analysis.py RENAMED Viewed

@@ -66,7 +66,7 @@ class CsvPredictionReader:
         """
         wanted = set(columns.values())
         out: dict[str, list[object]] = {col: [] for col in wanted}
-        with Path(uri).open(newline="") as fh:
+        with Path(uri).open(newline="", encoding="utf-8") as fh:
             reader = csv.DictReader(fh)
             # R8-F3: validate the header up-front so missing columns
             # surface as a clear ValueError rather than as a cryptic
@@ -93,14 +93,40 @@ class JsonlPredictionReader:
         *,
         columns: Mapping[str, str],
     ) -> Mapping[str, Sequence[object]]:
-        """Read a local JSONL file."""
+        """Read a local JSONL file.
+        Raises
+        ------
+        ValueError
+            If any non-blank row is not valid JSON, or is missing (or has
+            ``null`` for) a key declared in the ``columns`` mapping.
+            Validated at read time — the R8-F3 pattern already applied to
+            CSV headers — so a missing ``score`` key surfaces with the file
+            path + row number instead of being coerced to NaN deep inside
+            the metric computation (or, for ``label``, dying as a
+            context-free ``TypeError``).
+        """
         wanted = set(columns.values())
         out: dict[str, list[object]] = {col: [] for col in wanted}
-        with Path(uri).open() as fh:
-            for line in fh:
+        with Path(uri).open(encoding="utf-8") as fh:
+            for line_no, line in enumerate(fh, start=1):
                 if not line.strip():
                     continue
-                row = json.loads(line)
+                try:
+                    row = json.loads(line)
+                except json.JSONDecodeError as exc:
+                    # json.loads on a single line always reports "line 1",
+                    # actively misdirecting on which file row is broken.
+                    raise ValueError(
+                        f"JSONL file at {uri!r} row {line_no} is not valid JSON: {exc}"
+                    ) from exc
+                missing = sorted(col for col in wanted if row.get(col) is None)
+                if missing:
+                    raise ValueError(
+                        f"JSONL file at {uri!r} row {line_no} is missing required "
+                        f"key(s) {missing} (or they are null); "
+                        f"available keys: {sorted(row)}"
+                    )
                 for col in wanted:
                     out[col].append(row.get(col))
         return out
@@ -117,8 +143,13 @@ def load_prediction_arrays(
     ------
     ValueError
         If ``ref`` lacks a ``columns`` mapping, lacks a non-empty ``uri``,
-        or its ``columns`` mapping is missing the ``label`` / ``score``
-        keys (re-raised from :func:`_required_column`).
+        its ``columns`` mapping is missing the ``label`` / ``score`` keys
+        (re-raised from :func:`_required_column`), the loaded scores
+        contain non-finite values (a bare ``NaN`` token in JSONL or a
+        ``"nan"`` cell in CSV passes the readers' per-row key checks but
+        must not flow into metrics as a silent NaN), or the loaded labels
+        are not all in ``{0, 1}`` (an int cast would silently truncate
+        ``0.7 → 0``, flipping ground truth).
     """
     columns = ref.get("columns")
     if not isinstance(columns, Mapping):
@@ -131,8 +162,26 @@ def load_prediction_arrays(
     selected_reader = reader or _reader_for_ref(ref)
     reader_columns = {str(k): str(v) for k, v in columns.items() if isinstance(v, str)}
     table = selected_reader.read_predictions(uri, columns=reader_columns)
-    labels = np.asarray(table[label_col], dtype=int)
+    # Load labels as float first: np.asarray(..., dtype=int) silently
+    # TRUNCATES numeric non-integers (0.7 → 0), flipping ground truth with
+    # in-domain values no downstream gate can catch (v1.9.0 pre-tag review).
+    labels_raw = np.asarray(table[label_col], dtype=float)
+    bad_labels = ~np.isin(labels_raw, (0.0, 1.0))
+    if bad_labels.any():
+        first_bad = int(np.flatnonzero(bad_labels)[0])
+        raise ValueError(
+            f"prediction artifact at {uri!r} column {label_col!r} contains "
+            f"non-binary label(s); first bad value {labels_raw[first_bad]!r} "
+            f"at data row index {first_bad}"
+        )
+    labels = labels_raw.astype(int)
     scores = np.asarray(table[score_col], dtype=float)
+    if not np.isfinite(scores).all():
+        first_bad = int(np.flatnonzero(~np.isfinite(scores))[0])
+        raise ValueError(
+            f"prediction artifact at {uri!r} column {score_col!r} contains "
+            f"non-finite score(s) (NaN/inf); first at data row index {first_bad}"
+        )
     row_id_col = columns.get("row_id")
     hash_col = columns.get("content_hash")
     row_ids = tuple(str(v) for v in table.get(str(row_id_col), ())) if row_id_col else ()

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/artifacts.py RENAMED Viewed

@@ -243,7 +243,10 @@ def write_json_strict(
     out_path = Path(path)
     out_path.parent.mkdir(parents=True, exist_ok=True)
     sanitized = sanitize_for_json(payload)
-    out_path.write_text(json.dumps(sanitized, indent=indent, sort_keys=sort_keys, allow_nan=False))
+    out_path.write_text(
+        json.dumps(sanitized, indent=indent, sort_keys=sort_keys, allow_nan=False),
+        encoding="utf-8",
+    )
     return out_path
@@ -258,7 +261,7 @@ def validate_payload(payload: object, schema_name: str) -> None:
     from jsonschema import Draft202012Validator  # type: ignore[import-untyped]
     schema_path = resources.files("eval_toolkit") / "schemas" / schema_name
-    schema = json.loads(schema_path.read_text())
+    schema = json.loads(schema_path.read_text(encoding="utf-8"))
     Draft202012Validator(schema).validate(sanitize_for_json(payload))

{eval_toolkit-1.7.0 → eval_toolkit-1.9.0}/src/eval_toolkit/audit_sister_doc_concept_drift.py RENAMED Viewed

@@ -54,6 +54,7 @@ concept_drift v1.0.4).
 from __future__ import annotations
 import re
+import warnings
 from collections.abc import Callable, Sequence
 from dataclasses import dataclass
 from pathlib import Path
@@ -220,7 +221,14 @@ def validate_sister_doc_concept_drift(
     for path in files_resolved:
         try:
             file_texts[path] = path.read_text(encoding="utf-8")
-        except OSError:
+        except (OSError, UnicodeDecodeError) as exc:
+            # UnicodeDecodeError is a ValueError, not an OSError — without it a
+            # single non-UTF-8 byte would crash the whole scan. Skip unreadable
+            # or non-UTF-8 files, but warn so the skip is not silent (STYLE §1).
+            warnings.warn(
+                f"skipping unreadable file {path}: {exc}",
+                stacklevel=2,
+            )
             continue
     drift_clusters: list[DriftCluster] = []

eval-toolkit 1.7.0__tar.gz → 1.9.0__tar.gz

eval-toolkit 1.7.0tar.gz → 1.9.0tar.gz