PyPI - eval-toolkit - Versions diffs - 0.47.0__tar.gz → 0.48.0__tar.gz - Mend

eval-toolkit 0.47.0tar.gz → 0.48.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (182) hide show

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,96 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.48.0] — 2026-05-22 — Polish + audit-driven tightening before v1.0 (Round 7 follow-on + cross-API consistency + doc-execution gates)
+Third + final BREAKING minor of the staggered v0.45 → v0.46 → v0.46.1 → v0.47
+→ v0.48 → v1.0 release sequence (plan
+``~/.claude/plans/evaluate-all-the-work-twinkly-kite.md``, Step 4). Migration
+guide: ``docs/source/migration/v0.48.md``.
+Closes:
+- Round 7 audit STOP-GATE per Decision Y.2 (Codex R7-F1/F2/F3 + 6 Gemini
+  observations; see ``docs/source/audit_findings.md`` for the per-finding
+  ledger).
+- Audit-as-seed extensions surfaced during plan refinement: full
+  module-docstring sweep across ``src/eval_toolkit/``; expanded
+  ``.doctest-modules`` from 11 → 21 modules; comprehensive cross-API
+  shape-validation consistency sweep.
+- Round 5 §5E-prep packet-drift fixes (7 methodology documentation
+  corrections).
+After v0.48 observes ≥1 consumer cycle, the Round 8 audit STOP-GATE
+opens before ``v1.0.0`` tag.
+### BREAKING
+- **``BootstrapCI.to_dict()`` + ``PairedBootstrapCI.to_dict()`` schema
+  rewrite** (§5B). Pre-v0.48 hard-coded a ``"ci_95"`` key regardless of
+  the actual ``confidence`` field — the key contradicted the data.
+  v0.48 schema is self-describing:
+    Before: ``{"point_estimate": p, "ci_95": [l, h], "confidence": 0.95, ...}``
+    After:  ``{"point": p, "low": l, "high": h, "confidence": 0.95, ...}``
+  Migration: ``d["point_estimate"]`` → ``d["point"]``; ``d["ci_95"]``
+  → ``(d["low"], d["high"])``. Same rewrite for ``PairedBootstrapCI``.
+- **``sweep()`` schema grows by 1 column** (§5I, Decision R7-B option C).
+  New ``strategy_id`` column inserted between ``text_id`` and ``variant``
+  carries the canonical per-row identifier built from configured
+  kwargs. Callers indexing by column position must re-check offsets.
+- **``sweep()`` rejects duplicate ``strategy_id``** (§5I). Mirrors
+  R6-B's duplicate ``MetricSpec.name`` rejection in ``scorecard()``.
+- **``sweep()`` validates scorer output shape** (§5J, Decision R7-C).
+  Wrong-shape arrays from ``Scorer.predict_proba`` raise contextual
+  ``ValueError`` at the boundary. Pre-v0.48: silent truncation
+  (overlong), ``IndexError`` (short), or ``TypeError`` (matrix-shaped).
+- **``paired_bootstrap_op_point_diff()`` rejects ``val_y is test_y``**
+  (§5E-prep). The two-level bootstrap assumes disjoint val + test
+  partitions; passing the same array causes ~63.2% silent overlap.
+### Added
+- **``make pre-push``** Makefile target (§5L) running all 3 doc-
+  execution surfaces — Sybil-collected ``.md`` fences, MyST-NB example
+  notebooks, and in-source ``>>>`` docstring examples. Closes the
+  v0.47 Sub-PR 7 incident class.
+- **``nb_execution_raise_on_error = True``** in ``docs/source/conf.py``
+  (§5H, Decision R7-A). Docs CI now fails on notebook execution errors.
+- **``.doctest-modules`` expanded** from 11 → 21 modules (§5M).
+### Changed
+- **Cross-API shape-validation consistency** (§5N). Every public-API
+  surface with array inputs now validates shape + raises ``ValueError``
+  with context (rather than leaking low-level numpy/sklearn errors).
+- **Standardized ``ImportError`` messages** across lazy-extras (§5C).
+  Canonical template: ``"<feature> requires <pkg>. Install with: pip
+  install eval-toolkit[<extra>]"``.
+- **Pin-exact-key-set regression-guards** (§5A) for every dict-returning
+  metrics function. Audit revealed no drift; the tests pin existing
+  key sets so future drift fails CI loud.
+- **Docs polish** (§5K + §5E-prep): ``SynonymSubstitution`` whitelist
+  ``Notes``; ``Scorecard.to_pandas()`` dtype coercion ``Notes``;
+  ``CostSensitiveSelector`` calibrated-prior ``Warning``; ``cv_clt_ci``
+  docstring per Bayle et al. (2020) Theorem 3.1; ``methodology/parallelism.md``
+  post-v0.36 state; ``methodology/testing.md`` reference-equivalence-gap
+  framing; ``methodology/calibration.md`` 4-binary-adapter family;
+  ``methodology/bootstrap.md`` disjoint-split example; DeLong docs
+  aligned to shipped state (Decision U).
+### Fixed
+- **R7-F1**: 6 MyST-NB example notebooks (``docs/source/examples/*.md``)
+  migrated to v0.47 API; 4 module-level docstrings rewritten; 5
+  drifted ``docs/source/api/*.md`` autosummary lists corrected;
+  8 missing ``api/*.md`` pages created; roadmap "Sybil-validated
+  examples" wording corrected (§5G).
+- **ADR 0001** (flat-module layout) + **ADR 0003** (stability contract
+  + Gate 3 methodology) finalized for v1.0 (§5E + §5F).
+- **schemas.md** + **methodology/claims.md** + **getting-started.md**:
+  ``BootstrapCI`` schema references updated for the §5B rewrite.
 ## [0.47.0] — 2026-05-21 — Sweep unification + TextTransform + advanced-6 + cleanup + Round 6 follow-on
 Second BREAKING minor of the staggered v0.45 → v0.46 → v0.46.1 → v0.47 →

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.47.0
+Version: 0.48.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -1,9 +1,12 @@
 """eval-toolkit — reusable evaluation contracts for binary classification.
-Public API remains available from ``eval_toolkit`` and from submodules:
+The v1.0 primary metric surface is :func:`~eval_toolkit.scorecard` plus the
+:mod:`~eval_toolkit.metric_specs` namespace (ADR 0002). Submodule paths
+remain available for scalar primitives and adapter authors:
-    from eval_toolkit import pr_auc, bootstrap_ci, BootstrapCI
-    from eval_toolkit.metrics import pr_auc
+    from eval_toolkit import scorecard, metric_specs as ms
+    from eval_toolkit import bootstrap_ci, BootstrapCI
+    from eval_toolkit.metrics import pr_auc  # internal API, ADR 0002
 The package root uses lazy exports so importing ``eval_toolkit`` does not
 eagerly import optional-heavy modules such as plotting, loaders, or harnesses.
@@ -207,12 +210,15 @@ _EXPORTS: dict[str, str] = {
     "SINGLE_CLASS_INCOMPATIBLE_METRICS": "eval_toolkit.metrics",
     "ThresholdResult": "eval_toolkit.metrics",
     "brier_decomposition": "eval_toolkit.metrics",
-    # `brier_score`, `pr_auc`, `roc_auc`, and the 5 ECE variants removed from
-    # `_EXPORTS` at v0.46 (Decision L). They remain reachable at the top
-    # level via the `__getattr__` deprecation branch (emits
-    # `DeprecationWarning`; branch removed at v0.47) and via the metrics
-    # submodule (`from eval_toolkit.metrics import pr_auc` — internal API
-    # per ADR 0002, not part of the v1.0 stability contract).
+    # `brier_score`, `pr_auc`, `roc_auc`, and the 5 ECE variants were removed
+    # from `_EXPORTS` at v0.46 (Decision L); the v0.46 `__getattr__`
+    # deprecation branch that kept them reachable with `DeprecationWarning`
+    # was removed at v0.47. They now raise `AttributeError` at the top level.
+    # The metrics submodule (`from eval_toolkit.metrics import pr_auc`)
+    # remains the only stable import path for scalar primitives — internal
+    # API per ADR 0002, not part of the v1.0 stability contract. The
+    # `scorecard()` + `metric_specs` surface is the primary path going
+    # forward (`metric_specs.pr_auc`, `metric_specs.roc_auc`, etc.).
     "headline_metrics": "eval_toolkit.metrics",
     "is_metric_defined_for_slice": "eval_toolkit.metrics",
     "metrics_at_threshold": "eval_toolkit.metrics",

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/_scorecard.py RENAMED Viewed

@@ -272,6 +272,38 @@ class Scorecard(Mapping[str, MetricResult]):
         ``n_resamples`` + ``method`` so the schema is lossless against
         :meth:`BootstrapCI.to_dict` — trace provenance no longer drops in
         the DataFrame view.
+        Notes
+        -----
+        **Dtype coercion: ``n_resamples`` is ``float64``, not ``Int64``.**
+        ``BootstrapCI.n_resamples`` is an ``int`` at the Python level, but
+        pandas treats a mixed ``int`` + ``NaN`` column as ``float64`` —
+        any row with ``status != "ok"`` or ``bootstrap=False`` carries
+        ``NaN`` in the CI columns, and NaN forces the whole column to
+        floating-point. So ``df["pr_auc"]["n_resamples"].dtype`` is
+        ``float64``, and individual values read back as e.g. ``1000.0``
+        rather than ``1000`` (the trade-off Decision R6-C accepted to
+        keep the schema lossless).
+        Consumers expecting strict ``Int64`` semantics (e.g., for joins
+        against an integer-typed table, or for SQL emission where
+        ``float64`` would round-trip as ``DOUBLE``) need to cast
+        explicitly *after* dropping NaN rows:
+        ::
+            df["pr_auc"]["n_resamples"].dropna().astype("Int64")
+        or use pandas' nullable integer extension dtype at construction
+        time::
+            df["pr_auc"]["n_resamples"] = df["pr_auc"]["n_resamples"].astype("Int64")
+        which preserves NaN as ``pd.NA`` and the rest as integer.
+        ``Scorecard.to_pandas()`` does not perform this coercion by
+        default because it would force a pandas-nullable-dtype dependency
+        on every consumer; the float64 default works under any pandas
+        version.
         """
         try:
             import pandas as pd

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/_sweep.py RENAMED Viewed

@@ -103,7 +103,7 @@ def sweep(
     >>> from eval_toolkit import DelimitVariant, DatamarkVariant, sweep
     >>> df = sweep([DelimitVariant(), DatamarkVariant()], ["hello world"])
     >>> sorted(df.columns.tolist())
-    ['text_id', 'transformed_text', 'variant']
+    ['strategy_id', 'text_id', 'transformed_text', 'variant']
     >>> df[df["variant"] == "delimit"].iloc[0]["transformed_text"]
     '<<hello world>>'
@@ -144,6 +144,7 @@ def sweep(
                 f"sweep(): strategy at index {i} ({type(strategy).__name__}) "
                 f"does not satisfy TextTransform (missing 'name' or 'transform')."
             )
+    _validate_unique_strategy_ids(strategies)
     text_list = list(texts)
     rows: list[dict[str, object]] = []
@@ -153,15 +154,25 @@ def sweep(
     original_scores: np.ndarray | None = None
     if scorer is not None and text_list:
         original_scores = np.asarray(scorer.predict_proba(text_list))
+        _validate_scorer_output(
+            original_scores, expected_n=len(text_list), label="original-texts batch"
+        )
     for strategy in strategies:
+        sid = _strategy_id_for(strategy)
         transformed_list = [strategy.transform(t) for t in text_list]
         transformed_scores: np.ndarray | None = None
         if scorer is not None and transformed_list:
             transformed_scores = np.asarray(scorer.predict_proba(transformed_list))
+            _validate_scorer_output(
+                transformed_scores,
+                expected_n=len(text_list),
+                label=f"transformed-texts batch for strategy {strategy.name!r}",
+            )
         for text_id, (_, transformed) in enumerate(zip(text_list, transformed_list, strict=True)):
             row: dict[str, object] = {
                 "text_id": text_id,
+                "strategy_id": sid,
                 "variant": strategy.name,
                 "transformed_text": transformed,
             }
@@ -176,9 +187,116 @@ def sweep(
                     row["asr"] = bool(s_orig >= attack_threshold > s_adv)
             rows.append(row)
-    base_cols = ["text_id", "variant", "transformed_text"]
+    base_cols = ["text_id", "strategy_id", "variant", "transformed_text"]
     if scorer is not None:
         base_cols += ["original_score", "transformed_score"]
     if attack_threshold is not None:
         base_cols += ["asr"]
     return pd.DataFrame(rows, columns=base_cols)
+# ─────────────────────────────────────────────────────────────────────────────
+# Helpers — strategy identity (Decision R7-B; v0.48 §5I)
+# ─────────────────────────────────────────────────────────────────────────────
+def _strategy_id_for(strategy: TextTransform) -> str:
+    """Build a stable, repr-stable identifier from a strategy's configured state.
+    Decision R7-B (Round 7 audit, Codex R7-F2): a strategy's ``name`` alone is
+    not enough to identify a configured instance. Two instances of the same
+    dataclass with different kwargs (e.g., ``DelimitVariant(delimiter="<<")``
+    and ``DelimitVariant(delimiter="[[")``) share ``name == "delimit"`` and
+    would silently merge under ``groupby("variant")``. The ``strategy_id``
+    column carries the canonical configured identity so downstream
+    analysis can disambiguate.
+    Format (pseudo-URI; chosen for groupby-friendliness + special-char
+    safety via ``repr()``):
+    - Frozen dataclass strategies: ``"<name>/<k1>=<repr(v1)>,<k2>=<repr(v2)>,..."``
+      with kwargs alphabetized (excluding the ``name`` field itself). Mirrors
+      :func:`eval_toolkit.metric_specs.make_spec_name` but uses ``repr(value)``
+      instead of ``str(value)`` so string kwargs with special chars (``<<``,
+      ``[[``, ``^``, etc.) round-trip cleanly.
+    - Plain :class:`TextTransform`-Protocol-satisfying objects without
+      ``__dataclass_fields__``: falls back to ``strategy.name``.
+    Examples
+    --------
+    >>> from eval_toolkit.preprocessing import DelimitVariant
+    >>> _strategy_id_for(DelimitVariant(delimiter="<<", end=">>"))
+    "delimit/delimiter='<<',end='>>'"
+    >>> from eval_toolkit.adversarial import ZeroWidthSpaceInjection
+    >>> _strategy_id_for(ZeroWidthSpaceInjection(ratio=0.5, seed=42))
+    'zero_width_space/ratio=0.5,seed=42'
+    """
+    fields = getattr(strategy, "__dataclass_fields__", None)
+    if fields is None:
+        return strategy.name
+    kw_pairs = sorted((f, getattr(strategy, f)) for f in fields if f != "name")
+    if not kw_pairs:
+        return strategy.name
+    return f"{strategy.name}/" + ",".join(f"{k}={v!r}" for k, v in kw_pairs)
+def _validate_scorer_output(scores: np.ndarray, *, expected_n: int, label: str) -> None:
+    """Validate the shape of a batched ``Scorer.predict_proba`` result.
+    Decision R7-C (Round 7 audit, Codex R7-F3): three failure modes Codex
+    surfaced via runtime probe — too many 1-D scores (silent truncation,
+    worst class), too few (later ``IndexError``), and matrix-shaped
+    (later ``TypeError`` when ``float(...)`` is applied to a row). All
+    three become a single API-level ``ValueError`` with context.
+    Style invariants 1 (no silent failures) + 3 (API-level errors, never
+    low-level exceptions through the boundary). Drives Decision R7-C.
+    Parameters
+    ----------
+    scores : np.ndarray
+        The ``np.asarray()``-wrapped result of ``scorer.predict_proba(...)``.
+    expected_n : int
+        The expected length — ``len(texts)`` for the current sweep call.
+    label : str
+        Context for the error message naming the offending batch
+        (e.g., ``"original-texts batch"`` or
+        ``"transformed-texts batch for strategy 'zero_width_space'"``).
+    Raises
+    ------
+    ValueError
+        If ``scores.shape != (expected_n,)``.
+    """
+    if scores.shape != (expected_n,):
+        raise ValueError(
+            f"sweep(): scorer.predict_proba({label}) returned shape "
+            f"{scores.shape}; expected ({expected_n},). The Scorer Protocol "
+            f"requires one float P(positive) per input row (see "
+            f"`eval_toolkit.protocols.Scorer`); ensure your adapter returns "
+            f"a 1-D array of length len(texts)."
+        )
+def _validate_unique_strategy_ids(strategies: Sequence[TextTransform]) -> None:
+    """Reject duplicate ``strategy_id`` values in a single ``sweep()`` call.
+    Decision R7-B (Round 7 audit, Codex R7-F2): mirrors R6-B's duplicate
+    ``MetricSpec.name`` rejection in ``scorecard()`` — same anti-silent-merge
+    invariant, applied to the sweep surface. No methodology-honest reason to
+    put the same configured strategy twice in one sweep; cache-warming +
+    reproducibility re-runs use ``strategy.transform()`` directly outside
+    ``sweep()``.
+    """
+    seen: dict[str, int] = {}
+    for i, strategy in enumerate(strategies):
+        sid = _strategy_id_for(strategy)
+        if sid in seen:
+            raise ValueError(
+                f"sweep(): duplicate strategy_id {sid!r} at index {i} "
+                f"(previously at index {seen[sid]}); each strategy must "
+                f"produce a unique strategy_id. If you want two configurations "
+                f"of the same dataclass in the same sweep, vary their kwargs "
+                f"so the canonical identifier differs."
+            )
+        seen[sid] = i

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.47.0"
+__version__ = "0.48.0"

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/adversarial.py RENAMED Viewed

@@ -1,4 +1,4 @@
-"""Adversarial robustness: character-injection bypass suite + Scorer-Protocol sweep.
+"""Adversarial robustness: 12-technique character-injection bypass suite.
 Implements the character-injection bypass techniques from Microsoft Research
 2024 ([1]_) for testing prompt-injection-detection scorers under adversarial
@@ -6,7 +6,7 @@ input perturbation. Each technique is deterministic given a ``seed`` and
 preserves the surface meaning of the text from a human reader's perspective
 while shifting the tokenizer / scorer's representation.
-Core techniques shipped in v0.43.0:
+Core techniques (shipped in v0.43.0):
 - :class:`ZeroWidthSpaceInjection` — insert U+200B zero-width spaces
 - :class:`HomoglyphSubstitution` — Latin → Cyrillic/Greek lookalikes
@@ -15,25 +15,24 @@ Core techniques shipped in v0.43.0:
 - :class:`CaseRandomization` — random case-flipping per character
 - :class:`PunctuationInjection` — non-semantic punctuation insertion
-The :func:`sweep` function applies a set of techniques against a
-:class:`~eval_toolkit.protocols.Scorer`-Protocol-compliant scorer and
-returns a DataFrame of
-``(text_id, technique, original_score, transformed_score, asr)``
-for adversarial robustness analysis. ASR (attack success rate) is the
-fraction of inputs where the scorer crossed the threshold from positive
-to negative under the transformation.
+Advanced techniques (shipped in v0.47 per Decision Q11.3):
-The six advanced techniques (bidi RTL override, tag stripping, synonym
-substitution, token splitting, Unicode normalization, invisible
-characters) are scheduled for v0.43.1 as a follow-up patch; the sweep
-API stabilizes in v0.43.0 so the v0.43.1 additions are pure extensions.
+- :class:`BidiRTLInjection` — U+202E…U+202C override block
+- :class:`TagStrippingInjection` — ``<…>`` tag removal (idempotent)
+- :class:`SynonymSubstitution` — whitelisted-word swap, seed-deterministic
+- :class:`TokenSplitting` — mid-word single-space insertion
+- :class:`UnicodeNormalization` — NFC / NFD / NFKC / NFKD form switch
+- :class:`InvisibleCharsInjection` — 5 invisible code points
-A module-level :data:`character_injection` namespace exposes the
-function-style API from the upstream issue spec:
+The convenience tuples :data:`CORE_TECHNIQUES` (6-tuple),
+:data:`ADVANCED_TECHNIQUES` (6-tuple), and :data:`ALL_TECHNIQUES`
+(12-tuple = core + advanced) enumerate the suite for sweep callers.
->>> from eval_toolkit.adversarial import character_injection
->>> character_injection.zero_width_space("hello")  # doctest: +SKIP
-'hello'
+Use the v0.47 top-level :func:`eval_toolkit.sweep` to apply any set of
+:class:`~eval_toolkit.TextTransform` strategies against a corpus (and
+optionally a :class:`~eval_toolkit.protocols.Scorer`); the v0.43–v0.46
+module-level ``sweep()`` function and the ``character_injection``
+``SimpleNamespace`` were removed at v0.47 (Decisions D + K + N).
 References
 ----------
@@ -469,6 +468,29 @@ class SynonymSubstitution:
         Random seed for determinism. Default ``42``.
     name : str, optional
         Override technique name. Default ``"synonym"``.
+    Notes
+    -----
+    The eligible-word set is the module-level ``_SYNONYMS`` dict, a fixed
+    6-entry whitelist hand-curated to preserve semantics:
+    - ``ignore`` → ``disregard``, ``overlook``
+    - ``instructions`` → ``directions``, ``guidance``
+    - ``system`` → ``framework``, ``platform``
+    - ``secret`` → ``private``, ``confidential``
+    - ``send`` → ``transmit``, ``forward``
+    - ``all`` → ``every``, ``all of``
+    Inputs containing none of those whitelist words are returned unchanged
+    — the transform is a no-op on such inputs. This is intentional: the
+    technique's invariant is "looks like the original," so the substitution
+    deliberately stays small. The trade-off is easy to be surprised by
+    when running ``SynonymSubstitution`` on a corpus that doesn't share
+    the prompt-injection vocabulary the whitelist was built from. If you
+    need broader substitution, the whitelist isn't extension-friendly
+    today — fork the dict at the module level, or treat
+    ``SynonymSubstitution`` as a reference implementation for your own
+    text-transform with a richer table.
     """
     ratio: float = 1.0

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/bootstrap.py RENAMED Viewed

@@ -120,10 +120,29 @@ class BootstrapCI:
     method: str
     def to_dict(self) -> dict[str, object]:
-        """Serialize to a stable dict schema for JSON output."""
+        """Serialize to a stable, self-describing dict schema for JSON output.
+        v0.48 BREAKING (§5B): schema rewritten to drop the hard-coded
+        ``"ci_95"`` key that lied when ``confidence != 0.95``. The new
+        schema names the bounds neutrally and carries the actual
+        confidence level in a dedicated field; consumers can read
+        ``confidence`` to interpret the bound semantics.
+        Before v0.48:
+            {"point_estimate": p, "ci_95": [l, h], "confidence": 0.95,
+             "n_resamples": N, "method": "BCa"}
+        v0.48+:
+            {"point": p, "low": l, "high": h, "confidence": 0.95,
+             "n_resamples": N, "method": "BCa"}
+        Migration: rename ``point_estimate`` → ``point``; replace the
+        ``ci_95`` list-of-two with separate ``low`` + ``high`` keys.
+        """
         return {
-            "point_estimate": self.point_estimate,
-            "ci_95": [self.ci_low, self.ci_high],
+            "point": self.point_estimate,
+            "low": self.ci_low,
+            "high": self.ci_high,
             "confidence": self.confidence,
             "n_resamples": self.n_resamples,
             "method": self.method,
@@ -185,10 +204,24 @@ class PairedBootstrapCI:
     n_resamples: int
     def to_dict(self) -> dict[str, object]:
-        """Serialize to a stable dict schema for JSON output."""
+        """Serialize to a stable, self-describing dict schema for JSON output.
+        v0.48 BREAKING (§5B): same rewrite as :meth:`BootstrapCI.to_dict`.
+        ``"ci_95"`` is replaced by ``"low"`` + ``"high"``; ``"confidence"``
+        carries the actual level.
+        Before v0.48:
+            {"delta": d, "ci_95": [l, h], "overlaps_zero": b,
+             "confidence": 0.95, "n_resamples": N}
+        v0.48+:
+            {"delta": d, "low": l, "high": h, "overlaps_zero": b,
+             "confidence": 0.95, "n_resamples": N}
+        """
         return {
             "delta": self.delta,
-            "ci_95": [self.ci_low, self.ci_high],
+            "low": self.ci_low,
+            "high": self.ci_high,
             "overlaps_zero": self.overlaps_zero,
             "confidence": self.confidence,
             "n_resamples": self.n_resamples,
@@ -843,6 +876,21 @@ def paired_bootstrap_op_point_diff(
     .. [2] Bouckaert, R. R. "Choosing between two learning algorithms
            based on calibrated tests." ICML 2003.
     """
+    # Defensive identity-guard: the two-level bootstrap resamples val + test
+    # indices INDEPENDENTLY (see _paired_bootstrap_op_point_diff_step). Passing
+    # the same Python object for val and test causes ~63.2% overlap on each
+    # resample, violating the val/test independence assumption that lets the
+    # CI absorb threshold-selection variance honestly. Partition the data
+    # before calling — see docs/source/methodology/thresholds.md.
+    if val_y is test_y:
+        raise ValueError(
+            "paired_bootstrap_op_point_diff: val_y and test_y are the same array. "
+            "The two-level bootstrap requires DISJOINT val + test slices; the "
+            "resampler draws val_idx and test_idx independently, so identical "
+            "arrays cause ~63.2% overlap and violate the independence assumption. "
+            "Partition your data first (e.g., val = arr[:n//2], test = arr[n//2:])."
+        )
     val_y_arr = np.asarray(val_y)
     val_a, val_b = np.asarray(val_score_a), np.asarray(val_score_b)
     test_y_arr = np.asarray(test_y)
@@ -1157,12 +1205,16 @@ def cv_clt_ci(
     Computes a confidence interval on the cross-validation mean metric
     that correctly accounts for fold-level dependence. The standard
-    "naive" CI (compute std-of-folds then divide by sqrt(K)) is anti-
-    conservative because the folds share training data; Bayle et al.
-    2020 prove a CV-CLT with a correction factor that gives valid
-    coverage asymptotically.
+    "naive" CI (compute std-of-folds then divide by sqrt(K)) had long
+    been suspected to be anti-conservative because the folds share
+    training data. Bayle et al. 2020 prove that the naive sample-variance
+    estimator (with ``ddof=1``) gives valid asymptotic coverage under
+    stability conditions, resolving the historical concern that fold
+    correlation makes it anti-conservative. No additional correction
+    factor is applied.
-    The corrected variance estimator (Bayle 2020 Theorem 3.1):
+    The variance estimator (Bayle 2020 Theorem 3.1) is just the standard
+    sample variance over per-fold metrics:
     .. math::
@@ -1233,9 +1285,9 @@ def cv_clt_ci(
         raise ValueError(f"confidence must be in (0, 1), got {confidence}")
     point = float(arr.mean())
-    # Bayle 2020 Theorem 3.1 variance: sample variance with (K-1) denom; the
-    # CV-CLT correction is captured in this estimator's asymptotic guarantee
-    # (no extra fold-correlation factor needed for a balanced K-fold CV).
+    # Bayle 2020 Theorem 3.1: the naive sample-variance estimator (ddof=1)
+    # gives valid asymptotic coverage under stability conditions — no extra
+    # correction factor is applied for fold correlation.
     sigma_hat = float(np.std(arr, ddof=1))
     z = _normal_quantile(0.5 + confidence / 2.0)
     margin = z * sigma_hat / np.sqrt(K)
@@ -1258,9 +1310,10 @@ def block_bootstrap_on_folds(
 ) -> BootstrapCI:
     r"""Block bootstrap on folds: resample K folds with replacement; percentile CI on mean.
-    Sibling primitive to :func:`cv_clt_ci`. Where :func:`cv_clt_ci` applies
-    the Bayle et al. 2020 CV-CLT correction (correct asymptotically under
-    fold exchangeability), the block bootstrap is more *conservative* under
+    Sibling primitive to :func:`cv_clt_ci`. Where :func:`cv_clt_ci` relies on
+    Bayle et al. 2020's CV-CLT — the naive sample-variance estimator gives
+    valid asymptotic coverage under stability + fold exchangeability — the
+    block bootstrap is more *conservative* under
     fold-level **non-exchangeability** — situations where the K folds are
     not interchangeable (e.g., source-disjoint LODO folds where one source
     is intrinsically harder than the others). The sensitivity-check

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/calibration.py RENAMED Viewed

@@ -356,6 +356,20 @@ def bayes_optimal_threshold(π: float, c_fp: float, c_fn: float) -> float:
     .. math:: t^* = \frac{c_{FP} \cdot (1 - π)}{c_{FP} \cdot (1 - π) + c_{FN} \cdot π}
+    .. warning::
+        This formula assumes ``y_score`` is a calibrated probability with
+        respect to a **balanced prior** (or equivalently, a raw likelihood
+        ratio). If your scores are calibrated to the deployment prior (e.g.,
+        via :func:`fit_platt_binary` on a representative validation set), the
+        prior is already incorporated into the score and applying this
+        formula will **double-count it**. For deployment-prior-calibrated
+        scores, use the simpler prior-independent form
+        ``t* = c_fp / (c_fp + c_fn)`` (no ``prior`` kwarg) — that's literal
+        Elkan 2001 §4. The function in this file is the prior-corrected
+        variant for raw / balanced-prior scores; see the Examples for both
+        usage patterns.
     Parameters
     ----------
     π : float
@@ -396,6 +410,29 @@ def bayes_optimal_threshold(π: float, c_fp: float, c_fn: float) -> float:
     >>> bayes_optimal_threshold(1.0, c_fp=1.0, c_fn=1.0)
     0.0
+    **Two correct usages, side by side.** The choice depends on what your
+    ``y_score`` is calibrated to.
+    Usage A — raw or balanced-prior scores (use this function, pass ``π``):
+    >>> # Score from a model trained on a balanced (50/50) corpus, deployed
+    >>> # at a 1% positive prior, with FN cost 10× the FP cost.
+    >>> t_balanced = bayes_optimal_threshold(0.01, c_fp=1.0, c_fn=10.0)
+    >>> round(t_balanced, 4)
+    0.9083
+    Usage B — deployment-prior-calibrated scores (skip this function, use
+    the literal Elkan 2001 §4 prior-independent form):
+    >>> # Score already calibrated to the 1% deployment prior via
+    >>> # fit_platt_binary on a representative val slice — DO NOT pass π
+    >>> # to this function (you'd double-count it). Threshold the
+    >>> # already-prior-corrected probability against the cost ratio:
+    >>> c_fp, c_fn = 1.0, 10.0
+    >>> t_calibrated = c_fp / (c_fp + c_fn)
+    >>> round(t_calibrated, 4)
+    0.0909
     Notes
     -----
     Symmetric costs (c_fp == c_fn) collapse the formula to t* = 1 - π.
@@ -407,9 +444,10 @@ def bayes_optimal_threshold(π: float, c_fp: float, c_fn: float) -> float:
     *Bayes-calibrated* posterior P(y=1 | x). The formula implemented here
     is the **prior-corrected** form for thresholding raw scores at a known
     deployment prior π, which agrees with Elkan only under symmetric costs.
-    For our intended use (deployment prior + asymmetric costs) the
-    prior-corrected form is what the user wants — but the citation should
-    be read as "Elkan 2001 cost-sensitive framework", not literal §4.
+    For our intended use (deployment prior + asymmetric costs on raw /
+    balanced-prior scores) the prior-corrected form is what the user wants
+    — but the citation should be read as "Elkan 2001 cost-sensitive
+    framework", not literal §4.
     References
     ----------

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/config.py RENAMED Viewed

@@ -89,7 +89,7 @@ def from_yaml[T](path: Path | str, cls: type[T]) -> T:
         import yaml  # noqa: PLC0415
     except ImportError as exc:
         raise ImportError(
-            "from_yaml requires pyyaml; install with `pip install eval-toolkit[yaml]`"
+            "from_yaml requires pyyaml. Install with: pip install eval-toolkit[yaml]"
         ) from exc
     if not is_dataclass(cls):

{eval_toolkit-0.47.0 → eval_toolkit-0.48.0}/src/eval_toolkit/embeddings.py RENAMED Viewed

@@ -85,7 +85,7 @@ def make_minilm_embedder(
     except ImportError as e:
         raise ImportError(
             "make_minilm_embedder requires sentence-transformers. "
-            "Install via: pip install eval-toolkit[embeddings]"
+            "Install with: pip install eval-toolkit[embeddings]"
         ) from e
     # sentence-transformers-active path: excluded from CI coverage

eval-toolkit 0.47.0__tar.gz → 0.48.0__tar.gz

eval-toolkit 0.47.0tar.gz → 0.48.0tar.gz