PyPI - eval-toolkit - Versions diffs - 1.0.5__tar.gz → 1.2.0__tar.gz - Mend

eval-toolkit 1.0.5tar.gz → 1.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (191) hide show

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/.gitignore RENAMED Viewed

@@ -62,6 +62,15 @@ gemini-microaudit-*.md
 audit-gemini.md
 comprehensive-audit-codex.md
 audit-verification-*.md
+# v1.1.0 (#80 cycle): broaden codex-comprehensive-audit-* to cover
+# non-suffixed briefing variants (e.g., codex-comprehensive-audit-v0.50.0.md);
+# the earlier pattern only covered the *-report.md form.
+codex-comprehensive-audit-*.md
+# Local scratch directory: ad-hoc dogfood scripts, pre-tag validation
+# runs, draft PR bodies, audit prompts. Intentionally untracked.
+# Contents have historical value but are not part of any release.
+.scratch/
 # Claude Code project settings (machine-local)
 .claude/

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,299 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.2.0] — 2026-05-26 — `audit_value_bindings` context-aware noise reduction (consumer-feedback follow-on to #80)
+Tier-1 ADDITIVE per [ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md).
+Consumer-feedback follow-on after v1.1.0's adoption at
+`prompt-injection-detection-submission@v1.3.11`. The v1.1.0
+slice-axis fix achieved 62% noise reduction (96 → 36 warnings) on
+the consumer's writeup; the residual 36 were positional-heuristic
+limitations [ADR 0005](docs/source/adr/0005-structured-keys-for-audit-validators.md)
+named as "Future work (deferred)" for v1.2.0+. This release
+addresses 81% of that residual (36 → 7) via four context-aware
+extensions to `scope="narrative"`. Combined with v1.1.0,
+**93% total noise reduction** vs the pre-fix v1.0.5 baseline.
+### Added — `audit_value_bindings.py` context-aware narrative filters
+All four filters activate ONLY when `scope="narrative"`. Legacy
+`scope="all"` callers see zero behavior change (Tier-1 ADDITIVE).
+No new public kwargs; no signature drift; the keyword lists are
+hardcoded module-level `frozenset` constants. Issue [#80](https://github.com/brandon-behring/eval-toolkit/issues/80)'s
+acceptance criterion was ≤5 warnings; v1.2.0 hits 7 (close to the
+target; the remaining 7 are pure cross-detector list-grammar cases
+that require parser-level work — see "Out of scope" below).
+- **T1: Delta-context filter.** Suppresses values that are
+  comparative magnitudes rather than binding claims. Two
+  sub-filters:
+  - Sign-prefix skip: values immediately preceded by `+` or `-`
+    (negative-magnitude markers like `-0.071 AUPRC`,
+    `+0.073 lift`) are dropped.
+  - Delta-keyword skip: values within 30 chars AFTER a
+    delta-marker token are dropped. The before-only window
+    prevents mis-firing on prose like `"frozen probe's 0.515
+    (delta -0.132)"` where the `"delta"` token refers to the
+    following `-0.132`, not the preceding `0.515`.
+  Keyword list (`_DELTA_KEYWORDS`, hardcoded frozenset):
+  `delta`, `drop`, `drops`, `lift`, `lifts`, `gap`, `margin`,
+  `regresses`, `improves`, `beats`, `exceeds`, `trails`,
+  `underperforms`, `vs`, `versus`, `below`. Excluded:
+  `against`, `above`, `ahead`, `behind` (too ambiguous; common
+  comparison prepositions in legitimate binding prose).
+- **T2: Floor-context filter.** Suppresses values near random-
+  baseline / floor mentions. Window is asymmetric (50 chars
+  before, 5 chars after) because floor mentions canonically
+  precede the value (`"random AUPRC is 0.374"`).
+  Keyword list (`_FLOOR_KEYWORDS`): `random`, `floor`, `chance`,
+  `trivial`. Intentionally narrow — `baseline`, `prior`,
+  `majority` excluded because they have legitimate non-floor
+  senses (`"TF-IDF baseline"`, `"prior work"`). Multi-word
+  patterns like `"below the prevalence baseline of 0.374"` are
+  caught by T1's `"below"` keyword instead.
+- **T3: Consume-on-match within sentence.** After a value
+  produces a Match for `(detector, metric, slice)`, subsequent
+  values for the same canonical binding in the same sentence are
+  suppressed. Catches dense multi-detector enumerations like
+  `"AUPRC 0.556 vs 0.519"` where the second value is implicitly
+  a contrasting detector's binding (cross-detector inference
+  remains out of scope per ADR 0005 A4).
+- **T4: Sentence-boundary detector-pair reject.** When pairing a
+  detector mention with a value, if a sentence terminator (`.`,
+  `!`, `?`, `\n\n`) lies between them, the pair is rejected.
+  Sentence detection uses paragraph-aware abbreviation guarding
+  (`vs.`, `e.g.`, `i.e.`, `c.f.`, `etc.`, `cf.`, `fig.`,
+  `eq.`, `pp.`, `viz.`, `ca.` excluded; decimal numbers and
+  letter-dot-letter patterns also guarded). Single `\n` is a
+  soft break (markdown line-wrap, NOT a sentence boundary);
+  `\n\n` is hard.
+### Internal changes (no public API impact)
+- `_nearest_canonical_key()` now returns `(key, position)`
+  instead of just `key`. The position is needed for T4's
+  sentence-boundary check. The slice-pairing call site unpacks
+  and discards the position. Private helper; no consumer impact.
+- New private helpers: `_is_sentence_terminator_dot`,
+  `_sentence_boundary_positions`, `_sentence_id_of`,
+  `_crosses_sentence_boundary`, `_is_signed_value`,
+  `_has_keyword_in_window`, `_compile_keyword_pattern`. All
+  underscore-prefixed; Tier-3 FREE.
+### Dogfood evidence
+| Configuration | Warnings on `prompt-injection-detection-submission` HEAD | Reduction vs v1.0.5 baseline |
+|---|---|---|
+| v1.0.5 (legacy 2-tuple) | 95 | — |
+| v1.1.0 BindingKey + scope='narrative' (content-type filter only) | 23 | 76% |
+| **v1.2.0 + context filters (this release)** | **7** | **93%** |
+The 7 v1.2.0 residuals are all cross-detector list constructions
+(e.g., `"0.293 versus 0.364 for the frozen probe and 0.291 for
+TF-IDF + LR"` where the validator can't infer that 0.361 / 0.291
+belong to ProtectAI-v1 and TF-IDF respectively because they're
+introduced by `"and"` / `"for"` without an immediately-preceding
+detector mention). These require true list-grammar parsing
+(rejected for v1.x in ADR 0005 A4) and are tracked for v1.3.0+
+with their own ADR design review.
+### Consumer adoption path
+`prompt-injection-detection-submission` and other consumers using
+`scope="narrative"` get the v1.2.0 filters automatically with no
+code change. Consumers on `scope="all"` (default) continue with
+v1.1.0 behavior. Recommended consumer migration:
+1. Re-pin `eval-toolkit>=1.2.0,<2` (additive; no consumer code
+   change required).
+2. HARD-gate promotion is now credible: 7 residual warnings is
+   below the actionable threshold; consumer can promote
+   `audit_value_bindings` from SOFT to HARD bundled with
+   `audit_citation_alignment` per the v1.3.8 plan.
+### Tests
+36 in `tests/test_audit_value_bindings.py` (28 from v1.1.0 + 8
+new for T1–T4 + sentence-boundary helper unit test). All pass.
+Public API snapshot regenerated for `__version__` bump only (no
+signature changes beyond an inspect-formatting normalization on
+the `validate_reader_value_bindings` `bindings` annotation; same
+type semantically).
+### Out of scope (deferred)
+- **Cross-detector list-grammar parsing** — the 7 residual
+  warnings. Requires lookahead context-aware list parsing
+  (`"X scored Y vs Z for W and V for U"`). Track as a v1.3.0+
+  candidate; needs ADR design before implementation.
+- **Markdown AST parsing** (ADR 0005 A4) — v2.0 territory.
+- **`extra_*_keywords` kwargs** for runtime extension of the
+  hardcoded keyword lists — YAGNI for now (consumer's prose is
+  covered); add in a v1.2.x patch if concrete demand emerges.
+## [1.1.0] — 2026-05-26 — `audit_value_bindings` slice-aware matching via `BindingKey` (closes #80)
+Tier-1 ADDITIVE per [ADR 0003](docs/source/adr/0003-stability-contract-and-gate3-methodology.md).
+Closes [#80](https://github.com/brandon-behring/eval-toolkit/issues/80)
+— consumer-feedback structural fix surfaced by
+`brandon-behring/prompt-injection-detection-prototype@v1.3.9` (96
+warnings, ~95 false positives) where the pre-v1.1 2-tuple
+`(detector, metric)` canonical-binding identity could not
+disambiguate the same `(detector, metric)` across multiple slices
+(`direct_validation`, `pooled_ood`, paired-delta cells,
+random-floor mentions).
+Pending [ADR 0005](docs/source/adr/0005-structured-keys-for-audit-validators.md)
+codifies the underlying rule: "structured keys over positional
+tuples for canonical-identity types in audit validators."
+### Added
+- **`BindingKey`** (new public class, exported via the lazy
+  `_EXPORTS` resolver). Frozen dataclass with fields
+  `(detector: str, metric: str, slice: str = "any")`. Forward-
+  extensible: future identity axes (split, ci_kind, source_ref, ...)
+  can be added as defaulted fields without breaking the dict-key
+  schema. Avoids the recur-every-N-months schema-event pattern that
+  produced #80.
+- **`validate_reader_value_bindings(bindings=...)`** now accepts
+  three input shapes, normalized internally to `dict[BindingKey,
+  float]` via a per-key `_normalize_binding_key` adapter:
+  1. Canonical: `BindingKey(detector=..., metric=..., slice=...)`
+     (recommended for new consumer code).
+  2. Sugar 3-tuple: `(detector, metric, slice)` (concise dict
+     literal; issue #80's proposed schema).
+  3. Legacy 2-tuple: `(detector, metric)` (preserved; treated as
+     `slice="any"`). All pre-v1.1 consumer code continues to work
+     unchanged.
+  Mixed key shapes in a single dict are supported. Invalid key
+  shapes raise `TypeError` at the function boundary (loud failure,
+  not silent zero-match drift).
+- **New optional kwargs** on `validate_reader_value_bindings`:
+  - `slice_aliases: Mapping[str, Sequence[str]] | None = None` —
+    canonical-slice-name → regex-alternatives mapping, mirroring
+    `detector_aliases` / `metric_aliases`. Used when at least one
+    `BindingKey` has `slice != "any"`.
+  - `slice_window_chars: int = 240` — character window for slice
+    disambiguation (≈ 50 tokens at ~5 chars/token).
+- **Slice disambiguation** in the matching loop. When a binding
+  key has `slice != "any"`, the candidate value is paired with
+  the nearest slice mention (same last-before-first-after rule as
+  detector pairing). If no slice mention falls within
+  `slice_window_chars`, the triple is suppressed (warn-only;
+  counted in `unmatched_slice_count`). If the paired slice
+  differs from the binding's slice, the triple is silently skipped
+  (handled by the binding for the correct slice on its own loop
+  iteration).
+- **`ValueBindingsReport.unmatched_slice_count: int = 0`** — new
+  field surfacing the warn-only signal. Default `0` means full
+  backward compatibility for code that constructs reports manually.
+- **`scope: Literal["all", "narrative"] = "all"`** — content-type
+  filter on the validator. Default `"all"` preserves legacy v1.0.x
+  behavior. Setting `scope="narrative"` excludes from matching:
+  - Markdown table rows (lines starting with `|`)
+  - Bracketed expressions `[...]` (CI bounds, ranges)
+  - Fenced code blocks (triple-backtick)
+  This addresses the broader category of false positives that the
+  slice-axis fix alone could not (CI bounds being matched as point
+  estimates, table-cell metrics being cross-flagged, code-block
+  literals being treated as claims). Compatible with the
+  motivating misbinding bug class (V1.3.1 ADR-080) which was in
+  narrative prose — no recall loss.
+  Combined dogfood result on `prompt-injection-detection-submission`
+  HEAD: **76% noise reduction** (95 warnings at v1.0.5 2-tuple
+  baseline → 23 with v1.1.0 BindingKey + `scope="narrative"`). The
+  remaining 23 are positional-heuristic limitations (random-floor
+  mis-attribution across sentence boundaries; cross-detector
+  pairing in dense prose) not addressable without parser-level
+  work (future v1.2.0+ scope).
+### Changed (Tier-1 ADDITIVE — no breaking changes)
+- `_nearest_detector_key()` private helper renamed to
+  `_nearest_canonical_key()` (used now for both detector and slice
+  pairing; the body was already generic). Implementation detail;
+  no public API impact.
+- `pyproject.toml` `Development Status` classifier bumped
+  `4 - Beta` → `5 - Production/Stable` (post-v1.0 hygiene).
+- `.gitignore` now covers `.scratch/` and the broader
+  `codex-comprehensive-audit-*.md` pattern (was previously only
+  `*-report.md`).
+### Deprecation policy
+All three `bindings` input shapes remain accepted **indefinitely**
+through the v1.x line. `BindingKey` is the canonical/recommended
+shape per ADR 0005 + docstring guidance; tuples remain valid
+syntactic sugar with no `DeprecationWarning`. Formal deprecation
+deferred to a future v2.0 cleanup pass when there is concrete
+payoff. Consumers can migrate slot-by-slot at their own pace, or
+not at all (legacy 2-tuple semantics survive as `slice="any"`).
+### Consumer adoption path
+`prompt-injection-detection-submission` consumer-side script at
+`scripts/audit_value_bindings.py` can adopt by either:
+- Replacing the 2-tuple `BINDINGS` literal with 3-tuple keys
+  (smallest diff; issue body's proposal), OR
+- Migrating to `BindingKey(...)` for forward-extensibility (new
+  identity axes won't require re-touching the script).
+### Validator design philosophy (introduced this release)
+This release introduces a two-layer correctness model for audit
+validators, formalized in pending ADR 0005:
+1. **Identity correctness** — canonical measurements have
+   structured identity (`BindingKey`), not positional tuples.
+   Future axes added as defaulted fields without breaking the
+   schema.
+2. **Scope correctness** — the validator should only scan content
+   that is plausibly a binding claim. Narrative prose is.
+   Markdown tables, CI brackets, and code blocks aren't.
+   `scope="narrative"` is the v1.1.0 implementation of this rule;
+   match the lint-design convention from `ruff`/`mypy`/`bandit`
+   (scope predicates aren't optional in production-quality
+   linters).
+Both layers are now in place. Consumers writing typical research
+writeup prose with dense statistical tables should adopt
+`scope="narrative"` for a ~80% noise reduction relative to v1.0.5.
+### Known v1.1.0 limitations (residual after slice + narrative fixes)
+The remaining ~20% of v1.0.5 baseline noise is positional-
+heuristic limitations that the slice + narrative fixes cannot
+address:
+1. **"Random floor" / sub-clause values across sentence
+   boundaries** — prose like "X scored 0.291. The pooled OOD
+   random floor is 0.374" pairs 0.374 with the nearest preceding
+   detector (X), because the validator doesn't treat `.` as a
+   pairing boundary.
+2. **Multi-detector cross-pairing in dense prose** — prose like
+   "LoRA scored 0.293 [0.286, 0.301] versus 0.364 for the frozen
+   probe and 0.291 for TF-IDF + LR" pairs values with
+   text-order-nearest detector, which over-credits the second
+   detector in a list construction.
+Future work (v1.2.0+ candidate): sentence-boundary awareness
+(respecting `.` / `\n` as pairing boundaries) + better
+multi-detector list parsing.
+HARD-gate promotion of the consumer-side validator at v1.3.10+
+becomes credible at the ~80% reduction level achieved here.
+Remaining false positives can be suppressed via consumer-side
+filtering (e.g., excluding lines containing "random floor" or
+"versus") or accepted as known low-frequency noise.
 ## [1.0.5] — 2026-05-26 — publish workflow hardening (infrastructure-only)
 Tier-3 / infrastructure-only release. **No library code or public API

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 1.0.5
+Version: 1.2.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/
@@ -11,7 +11,7 @@ Author: Brandon Behring
 License-Expression: MIT
 License-File: LICENSE
 Keywords: binary-classification,bootstrap,calibration,evaluation,machine-learning,metrics
-Classifier: Development Status :: 4 - Beta
+Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/docs/source/adr/README.md RENAMED Viewed

@@ -76,3 +76,5 @@ What would have to change for this decision to be reopened?
 | [0001](0001-flat-module-layout.md) | Flat single-file modules through v1.x | Accepted | 2026-05-21 |
 | [0002](0002-scorecard-as-primary-metric-surface.md) | `scorecard()` as the primary v1.0 metric surface | Accepted | 2026-05-21 |
 | [0003](0003-stability-contract-and-gate3-methodology.md) | v1.0 stability contract + Gate 3 methodology | Accepted | 2026-05-21 |
+| [0004](0004-naming-conventions.md) | Naming conventions for modules, classes, and parameters | Accepted | 2026-05-23 |
+| [0005](0005-structured-keys-for-audit-validators.md) | Structured keys over positional tuples for canonical-identity types in audit validators | Accepted | 2026-05-26 |

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/pyproject.toml RENAMED Viewed

@@ -21,7 +21,7 @@ keywords = [
     "binary-classification",
 ]
 classifiers = [
-    "Development Status :: 4 - Beta",
+    "Development Status :: 5 - Production/Stable",
     "Intended Audience :: Developers",
     "Intended Audience :: Science/Research",
     "License :: OSI Approved :: MIT License",

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/src/eval_toolkit/__init__.py RENAMED Viewed

@@ -63,6 +63,9 @@ _EXPORTS: dict[str, str] = {
     # --- audit_value_bindings ---
     # Flat-module per ADR 0001. Closes #71. Motivated by consumer V1.3.1
     # ADR-080 audit-fix finding (TF-IDF / LoRA 0.974 value mis-binding).
+    # BindingKey + slice-aware matching added v1.1.0 (closes #80;
+    # consumer-feedback structural fix per pending ADR 0005).
+    "BindingKey": "eval_toolkit.audit_value_bindings",
     "Match": "eval_toolkit.audit_value_bindings",
     "ValueBindingsReport": "eval_toolkit.audit_value_bindings",
     "Violation": "eval_toolkit.audit_value_bindings",

{eval_toolkit-1.0.5 → eval_toolkit-1.2.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "1.0.5"
+__version__ = "1.2.0"

eval-toolkit 1.0.5__tar.gz → 1.2.0__tar.gz

eval-toolkit 1.0.5tar.gz → 1.2.0tar.gz