PyPI - eval-toolkit - Versions diffs - 0.40.0__tar.gz → 0.41.0__tar.gz - Mend

eval-toolkit 0.40.0tar.gz → 0.41.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (161) hide show

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,84 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.41.0] — 2026-05-18 — Croissant end-to-end (closes #42, v1.0 Gate 4 MET)
+Closes v1.0 readiness Gate 4 — "Croissant interop verified end-to-end."
+`HFDatasetsLoader.describe()` now fetches per-file `sha256` hashes
+from HF Hub and exposes them in `distribution[].sha256`. The
+integration test (`tests/test_croissant_e2e.py`) downloads a real
+parquet shard from `stanfordnlp/sst2` and verifies the bytes hash
+bit-exactly to the value `describe()` reports.
+### Added
+- **`HFDatasetsLoader.describe()` Croissant + tree-API enrichment.**
+  When `fetch_remote_metadata=True` (default), the loader fetches from
+  two HF Hub endpoints:
+  - `/api/datasets/{repo}/croissant` — JSON-LD metadata (name,
+    description, license, citeAs, schema).
+  - `/api/datasets/{repo}/tree/refs%2Fconvert%2Fparquet?recursive=true`
+    — per-file `sha256` (read from each file's `lfs.oid` field — the
+    git-LFS content hash, equal to `sha256sum` of the raw bytes).
+  Caller-provided fields (`name=`, `cite_as=`, etc.) win over
+  Croissant fetches; Croissant fills only gaps. Network failures
+  degrade gracefully (warning emitted; sha256 empty as in pre-v0.41).
+- **`fetch_remote_metadata: bool = True`** constructor field on
+  `HFDatasetsLoader`. Set `False` for offline / unit-test paths.
+- **`tests/test_croissant_e2e.py`** — 5 integration tests against
+  live HF Hub:
+  1. `describe()` returns real `sha256:<64-hex>` per shard.
+  2. **Bit-exact verification**: download shard from `contentUrl`,
+     hash bytes, assert equals `describe()`'s sha256. This is the
+     literal v1.0 Gate 4 check.
+  3. Croissant metadata enriches name/citeAs/license/description.
+  4. Caller overrides win over remote.
+  5. `fetch_remote_metadata=False` preserves v0.40 behavior.
+  All pass against `stanfordnlp/sst2` (~3 MB train shard).
+- **New `integration` pytest marker** for network-dependent tests.
+  Excluded from `make coverage` (PR CI); runs explicitly via
+  `pytest -m integration`.
+### Why dual-sourced
+HF Hub's Croissant emitter currently fills `distribution[].sha256`
+with a placeholder URL pointing at MLCommons Croissant spec issue
+[#80](https://github.com/mlcommons/croissant/issues/80) ("In
+<Download>, check SHA256 or MD5"), which is **open**. The Croissant
+spec doesn't yet require per-file checksums from emitters; HF Hub is
+honest and punts the field. The authoritative hash IS available via
+HF Hub's tree API: `lfs.oid` is precisely sha256 of the file content
+(verified bit-exact via `sha256sum`).
+When MLCommons #80 resolves and HF Hub starts populating Croissant
+`sha256` with real values (which will equal the existing `lfs.oid`),
+the loader's source switches in ~5 LOC. Same downstream contract.
+### Documentation
+- `docs/source/methodology/reproducibility.md` §"Croissant
+  interoperability": replaces v0.7-era "subset" framing with the
+  end-to-end-verified narrative + dual-source rationale.
+- `docs/source/roadmap.md` §"v1.0.0 path":
+  - **Gate 2 (Protocol stability) ✅ MET** — v0.41 = minor 2 of 2
+    without Protocol shape edits (v0.40 fit_*_binary additives +
+    v0.41 HFDatasetsLoader enrichment leave Tier-2 Protocols
+    untouched).
+  - **Gate 4 (Croissant end-to-end) ✅ MET** — with dual-source caveat
+    documented; one-line migration path when MLCommons #80 resolves.
+### v1.0 readiness state after v0.41.0
+- Gate 1 (real consumer ≥1 review cycle on v0.7+): partial — consumer
+  pinned to v0.34.0; needs bump + cycle. **External**.
+- Gate 2 ✅ MET (v0.41 is minor 2 of 2 stable).
+- Gate 3 (methodology peer review): not met — needs external reader.
+  **External**.
+- Gate 4 ✅ MET — see this release.
+Two of four gates closed in-repo. The remaining two require external
+coordination (consumer review cycle, methodology peer reviewer).
 ## [0.40.0] — 2026-05-18 — fit_platt_binary + fit_beta_binary (closes #43)
 Completes the binary scalar-prob calibrator family started in v0.35.0

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-toolkit
-Version: 0.40.0
+Version: 0.41.0
 Summary: Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.
 Project-URL: Homepage, https://github.com/brandon-behring/eval-toolkit
 Project-URL: Documentation, https://brandon-behring.github.io/eval-toolkit/

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/pyproject.toml RENAMED Viewed

@@ -188,6 +188,7 @@ markers = [
     "slow: Tests > 2s (bootstrap-t studentized, multi-seed K-fold). Opt out with `pytest -m 'not slow'`.",
     "monte_carlo: Monte Carlo calibration suite (~14 min). Skipped in PR CI; runs only in the nightly-mc workflow via `-m monte_carlo`.",
     "benchmark: pytest-benchmark perf-regression tests on math kernels. Skipped in PR CI; runs in the nightly-benchmarks workflow via `-m benchmark`. Per v0.29.0 plan Tier γ #1.",
+    "integration: Network-dependent integration tests (HF Hub API, Croissant endpoints, etc.). Excluded from PR CI to avoid network-flake; runs in nightly. Opt in via `-m integration`. Added v0.41.0 (#42 Croissant Gate 4).",
 ]
 [tool.coverage.run]

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/_version.py RENAMED Viewed

@@ -2,4 +2,4 @@
 __all__ = ["__version__"]
-__version__ = "0.40.0"
+__version__ = "0.41.0"

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/loaders.py RENAMED Viewed

@@ -29,7 +29,9 @@ References
 from __future__ import annotations
 import glob as _glob
+import json as _json
 import logging
+import urllib.request as _urlrequest
 from collections.abc import Mapping, Sequence
 from dataclasses import dataclass
 from pathlib import Path
@@ -38,6 +40,24 @@ from typing import TYPE_CHECKING, Any, Protocol, cast, runtime_checkable
 from eval_toolkit.harness import EvalSlice
 from eval_toolkit.provenance import file_sha256
+_HF_HUB_BASE = "https://huggingface.co"
+_HF_FETCH_TIMEOUT_SEC = 15
+def _hf_get_json(path: str) -> Any:
+    """GET ``https://huggingface.co{path}`` and return parsed JSON.
+    Stdlib-only (no ``requests`` / ``huggingface_hub`` dep). Raises
+    ``OSError`` (urllib error) or ``ValueError`` (JSON decode) on
+    failure — callers catch both. The 15-second timeout caps any one
+    fetch so CI doesn't hang on a slow HF Hub.
+    """
+    url = f"{_HF_HUB_BASE}{path}"
+    req = _urlrequest.Request(url, headers={"User-Agent": "eval-toolkit"})
+    with _urlrequest.urlopen(req, timeout=_HF_FETCH_TIMEOUT_SEC) as resp:
+        return _json.loads(resp.read().decode("utf-8"))
 _logger = logging.getLogger(__name__)
 if TYPE_CHECKING:
@@ -367,10 +387,22 @@ class HFDatasetsLoader:
     raised at :meth:`load_splits` time with a clear install hint. This is
     intentional — eval-toolkit's core deps are numpy / scipy / sklearn only.
+    Since v0.41.0, :meth:`describe` enriches its output with per-file
+    ``sha256`` hashes fetched from the HF Hub tree API (the ``lfs.oid``
+    field), plus Croissant metadata fetched from HF Hub's Croissant
+    endpoint. The dual-source design is documented in
+    ``methodology/reproducibility.md`` §"Croissant interoperability";
+    in short: HF Hub's Croissant emitter currently punts the
+    ``distribution[].sha256`` field per MLCommons Croissant issue #80
+    (open), so we read the authoritative sha256 from the tree API's
+    ``lfs.oid`` instead. When #80 resolves and HF Hub starts populating
+    Croissant ``sha256`` with real values, the implementation collapses
+    to a single source.
     Parameters
     ----------
     repo_id : str
-        HuggingFace dataset repo, e.g. ``"deepset/prompt-injections"``.
+        HuggingFace dataset repo, e.g. ``"stanfordnlp/sst2"``.
     splits : sequence of str or None, optional
         Subset of HF splits to load. ``None`` = every split the repo defines.
     feature_col : str, optional
@@ -381,7 +413,12 @@ class HFDatasetsLoader:
     config_name : str or None, optional
         HF dataset config name (some datasets have multiple configs).
     name, description, cite_as, license, url : str, optional
-        Croissant metadata fields.
+        Croissant metadata overrides. If empty, :meth:`describe` will
+        fall back to fetching from HF Hub's Croissant endpoint.
+    fetch_remote_metadata : bool, optional
+        If ``True`` (default), :meth:`describe` fetches Croissant + tree
+        metadata from HF Hub. Set ``False`` to disable network calls
+        (useful for offline / unit testing).
     """
     repo_id: str
@@ -395,6 +432,7 @@ class HFDatasetsLoader:
     cite_as: str = ""
     license: str = ""
     url: str = ""
+    fetch_remote_metadata: bool = True
     def _load_dataset(self) -> Mapping[str, Any]:
         """Soft-import ``datasets`` and return the loaded DatasetDict.
@@ -447,20 +485,118 @@ class HFDatasetsLoader:
         return out
     def describe(self) -> dict[str, object]:
-        """Croissant-subset metadata pointing at the HF repo (no file hashes — HF caches)."""
-        return {
-            "name": self.name or self.repo_id,
-            "description": self.description,
-            "citeAs": self.cite_as,
-            "license": self.license,
-            "url": self.url or f"https://huggingface.co/datasets/{self.repo_id}",
-            "distribution": [
+        """Croissant-compatible metadata + per-file sha256 from HF Hub.
+        When ``fetch_remote_metadata=True`` (default), enriches the
+        baseline metadata with two HF Hub API fetches:
+        - **Croissant endpoint** (``/api/datasets/{repo}/croissant``) —
+          provides ``name``, ``description``, ``citeAs``, ``license``,
+          ``url`` defaults when the loader's fields are empty.
+        - **Tree API** (``/api/datasets/{repo}/tree/...?recursive=true``) —
+          provides per-file ``sha256`` (from ``lfs.oid``) and
+          ``contentSize`` for each parquet shard under the
+          ``refs/convert/parquet`` branch.
+        Network failures degrade gracefully (warning emitted; sha256
+        empty as in pre-v0.41 behavior). See class docstring for the
+        dual-source rationale (MLCommons Croissant issue #80).
+        """
+        remote_meta: dict[str, object] = {}
+        distribution: list[dict[str, object]] = []
+        if self.fetch_remote_metadata:
+            remote_meta = self._fetch_croissant_metadata_safe()
+            distribution = self._fetch_tree_distribution_safe()
+        # Caller-provided fields win; Croissant fills gaps.
+        def _pick(local: str, key: str) -> str:
+            if local:
+                return local
+            val = remote_meta.get(key)
+            return val if isinstance(val, str) else ""
+        if not distribution:
+            distribution = [
                 {
                     "name": f"hf:{self.repo_id}",
                     "contentUrl": f"https://huggingface.co/datasets/{self.repo_id}",
-                    "sha256": "",  # HF cache hash not exposed via the public API
+                    "sha256": "",
                     "contentSize": 0,
                 }
-            ],
+            ]
+        return {
+            "name": _pick(self.name, "name") or self.repo_id,
+            "description": _pick(self.description, "description"),
+            "citeAs": _pick(self.cite_as, "citeAs"),
+            "license": _pick(self.license, "license"),
+            "url": self.url or f"https://huggingface.co/datasets/{self.repo_id}",
+            "distribution": distribution,
             "config_name": self.config_name,
         }
+    def _fetch_croissant_metadata_safe(self) -> dict[str, object]:
+        """Fetch HF Hub Croissant JSON-LD; return empty dict on any failure."""
+        try:
+            data = _hf_get_json(f"/api/datasets/{self.repo_id}/croissant")
+            return data if isinstance(data, dict) else {}
+        except (OSError, ValueError) as exc:  # urllib.URLError, JSONDecodeError, etc.
+            _logger.warning(
+                "HFDatasetsLoader %s: Croissant fetch failed (%s); proceeding without",
+                self.repo_id,
+                exc,
+            )
+            return {}
+    def _fetch_tree_distribution_safe(self) -> list[dict[str, object]]:
+        """Fetch HF Hub tree API for the parquet-convert branch; return ``cr:FileObject`` entries.
+        Each entry carries ``sha256`` (from ``lfs.oid`` — the git-LFS
+        content hash, equal to ``sha256sum`` of the file content) and
+        ``contentSize`` (from the tree response's ``size`` field).
+        Falls back to an empty list on any failure — callers should
+        treat empty distribution as "no remote provenance available."
+        """
+        # HF stores native parquet (or auto-converts) under
+        # refs/convert/parquet; that's the canonical hash target.
+        path = f"/api/datasets/{self.repo_id}/tree/refs%2Fconvert%2Fparquet?recursive=true"
+        try:
+            entries = _hf_get_json(path)
+        except (OSError, ValueError) as exc:
+            _logger.warning(
+                "HFDatasetsLoader %s: tree-API fetch failed (%s); sha256 unavailable",
+                self.repo_id,
+                exc,
+            )
+            return []
+        if not isinstance(entries, list):
+            return []
+        out: list[dict[str, object]] = []
+        for entry in entries:
+            if not isinstance(entry, dict):
+                continue
+            if entry.get("type") != "file":
+                continue
+            path_val = entry.get("path", "")
+            if not isinstance(path_val, str) or not path_val.endswith(".parquet"):
+                continue
+            lfs = entry.get("lfs")
+            sha = ""
+            if isinstance(lfs, dict):
+                oid = lfs.get("oid")
+                if isinstance(oid, str) and len(oid) == 64:  # sha256 hex
+                    sha = f"sha256:{oid}"
+            size = entry.get("size", 0)
+            out.append(
+                {
+                    "name": path_val,
+                    "contentUrl": (
+                        f"https://huggingface.co/datasets/{self.repo_id}"
+                        f"/resolve/refs%2Fconvert%2Fparquet/{path_val}"
+                    ),
+                    "sha256": sha,
+                    "contentSize": int(size) if isinstance(size, (int, float)) else 0,
+                }
+            )
+        return out

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/public_api/snapshot.json RENAMED Viewed

@@ -553,7 +553,7 @@
       ],
       "doc_first_line": "Load a HuggingFace ``datasets`` repo as ``{split: EvalSlice}``.",
       "kind": "class",
-      "signature": "(repo_id: 'str', splits: 'Sequence[str] | None' = None, feature_col: 'str' = 'text', label_col: 'str' = 'label', strata_col: 'str | None' = None, config_name: 'str | None' = None, name: 'str' = '', description: 'str' = '', cite_as: 'str' = '', license: 'str' = '', url: 'str' = '') -> None"
+      "signature": "(repo_id: 'str', splits: 'Sequence[str] | None' = None, feature_col: 'str' = 'text', label_col: 'str' = 'label', strata_col: 'str | None' = None, config_name: 'str | None' = None, name: 'str' = '', description: 'str' = '', cite_as: 'str' = '', license: 'str' = '', url: 'str' = '', fetch_remote_metadata: 'bool' = True) -> None"
     },
     "HoldoutSplitter": {
       "bases": [
@@ -1036,7 +1036,7 @@
       "doc_first_line": "str(object='') -> str",
       "kind": "value",
       "type": "str",
-      "value": "'0.40.0'"
+      "value": "'0.41.0'"
     },
     "apply_operating_points": {
       "doc_first_line": "Apply fitted thresholds to a mixed-class or single-class target slice.",

eval_toolkit-0.41.0/tests/test_croissant_e2e.py ADDED Viewed

@@ -0,0 +1,145 @@
+"""End-to-end Croissant interop verification (v0.41.0, closes #42, v1.0 Gate 4).
+Verifies that ``HFDatasetsLoader.describe()`` returns per-file ``sha256``
+hashes that match the actual bytes of the underlying parquet shards on
+HF Hub.
+Background on the dual-source design (Croissant + tree API):
+- HF Hub's Croissant emitter (``/api/datasets/{repo}/croissant``) ships
+  metadata (name, license, citation, schema) but **does not** populate
+  per-file ``distribution[].sha256`` — instead, the field carries a
+  placeholder URL pointing at the MLCommons Croissant issue tracking
+  the eventual checksum addition (issue #80, open).
+- HF Hub's tree API (``/api/datasets/{repo}/tree/...``) exposes
+  ``lfs.oid`` per file: a 64-hex sha256 of the raw file content.
+- ``HFDatasetsLoader.describe()`` reads sha256 from the tree API today,
+  and will pick up Croissant's eventual sha256 with a one-line change
+  when #80 resolves (same downstream contract; same hash format).
+Tests are marked ``@pytest.mark.integration`` — network-dependent;
+excluded from PR CI via ``-m "not integration"`` in ``make coverage``.
+Run explicitly via ``pytest -m integration`` (nightly or local dev).
+"""
+from __future__ import annotations
+import hashlib
+import urllib.request
+from typing import Any
+import pytest
+from eval_toolkit.loaders import HFDatasetsLoader
+# Small public Croissant-compliant dataset. ~50 KB test split (1 parquet
+# shard). Pinned via repo_id; HF retains revisions, so even if the dataset
+# is updated the test only fails if HF re-shards (rare for popular
+# datasets) — which is a real signal we want to catch in nightly.
+_TEST_REPO_ID = "stanfordnlp/sst2"
+def _download_and_hash(url: str) -> str:
+    """GET the URL, return ``sha256:<hex>`` of the body."""
+    req = urllib.request.Request(url, headers={"User-Agent": "eval-toolkit-test"})
+    with urllib.request.urlopen(req, timeout=60) as resp:
+        body = resp.read()
+    return f"sha256:{hashlib.sha256(body).hexdigest()}"
+@pytest.mark.integration
+def test_hfdatasets_describe_returns_real_sha256_from_tree_api() -> None:
+    """``describe()`` populates per-file sha256 from HF Hub's tree API.
+    Closes the infrastructure half of v1.0 Gate 4: prove the loader can
+    surface authoritative file hashes from HF Hub.
+    """
+    loader = HFDatasetsLoader(repo_id=_TEST_REPO_ID)
+    desc = loader.describe()
+    distribution = desc["distribution"]
+    assert isinstance(distribution, list)
+    assert distribution, "expected at least one parquet shard in distribution[]"
+    # Every entry should have a real sha256 (64 hex chars after the prefix).
+    for entry in distribution:
+        sha = entry["sha256"]
+        assert isinstance(sha, str)
+        assert sha.startswith("sha256:"), f"unexpected hash format: {sha!r}"
+        hex_part = sha.removeprefix("sha256:")
+        assert len(hex_part) == 64, f"expected 64-hex sha256, got {len(hex_part)}"
+        assert all(c in "0123456789abcdef" for c in hex_part), f"non-hex: {hex_part!r}"
+@pytest.mark.integration
+def test_hfdatasets_describe_sha256_matches_actual_file_bytes() -> None:
+    """End-to-end Gate 4 verification: hash a downloaded shard, assert match.
+    For each shard in ``describe()['distribution']``, fetch the raw
+    parquet bytes from ``contentUrl`` and verify ``sha256(bytes) ==
+    entry['sha256']``. This proves the source-of-truth chain:
+    HF Hub tree API → ``describe()`` → real file content.
+    """
+    loader = HFDatasetsLoader(repo_id=_TEST_REPO_ID)
+    desc = loader.describe()
+    distribution = desc["distribution"]
+    assert isinstance(distribution, list)
+    # Only verify the first shard to keep CI cost bounded (sst2 train is
+    # ~3 MB; we just need one matched pair to prove the contract).
+    entry: dict[str, Any] = distribution[0]
+    content_url = entry["contentUrl"]
+    expected_sha = entry["sha256"]
+    assert content_url and expected_sha
+    actual_sha = _download_and_hash(content_url)
+    assert actual_sha == expected_sha, (
+        f"sha256 mismatch for {entry['name']!r}: "
+        f"describe() reported {expected_sha}, actual file hashed to {actual_sha}"
+    )
+@pytest.mark.integration
+def test_hfdatasets_describe_returns_croissant_metadata() -> None:
+    """``describe()`` enriches with Croissant metadata (name, license, citeAs).
+    Even though Croissant's ``distribution[].sha256`` is unusable today
+    (placeholder URL per MLCommons #80), the metadata fields are valid
+    and should pass through to ``describe()`` output.
+    """
+    loader = HFDatasetsLoader(repo_id=_TEST_REPO_ID)
+    desc = loader.describe()
+    # Either Croissant provided a non-empty name or we fell back to repo_id.
+    name = desc["name"]
+    assert isinstance(name, str) and name
+@pytest.mark.integration
+def test_hfdatasets_caller_overrides_win() -> None:
+    """Caller-provided fields override Croissant fetches.
+    Explicit ``name=...`` / ``cite_as=...`` are not overwritten by
+    remote metadata even when ``fetch_remote_metadata=True``.
+    """
+    loader = HFDatasetsLoader(
+        repo_id=_TEST_REPO_ID,
+        name="my-custom-name",
+        cite_as="my-citation",
+    )
+    desc = loader.describe()
+    assert desc["name"] == "my-custom-name"
+    assert desc["citeAs"] == "my-citation"
+@pytest.mark.integration
+def test_hfdatasets_fetch_remote_metadata_disabled_skips_network() -> None:
+    """``fetch_remote_metadata=False`` produces the v0.40-era empty-sha256 output."""
+    loader = HFDatasetsLoader(
+        repo_id=_TEST_REPO_ID,
+        fetch_remote_metadata=False,
+    )
+    desc = loader.describe()
+    distribution = desc["distribution"]
+    assert isinstance(distribution, list)
+    assert len(distribution) == 1
+    assert distribution[0]["sha256"] == ""

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_loaders_coverage.py RENAMED Viewed

@@ -156,8 +156,12 @@ def test_hf_datasets_loader_subset_splits() -> None:
 @pytest.mark.unit
 def test_hf_datasets_loader_describe_uses_url_or_default() -> None:
-    """Describe(): url falls back to huggingface.co/datasets/<repo_id>."""
-    loader = HFDatasetsLoader(repo_id="dummy/example")
+    """Describe(): url falls back to huggingface.co/datasets/<repo_id>.
+    ``fetch_remote_metadata=False`` keeps this a unit test (no network).
+    The network-enabled path is exercised in test_croissant_e2e.py.
+    """
+    loader = HFDatasetsLoader(repo_id="dummy/example", fetch_remote_metadata=False)
     out = loader.describe()
     assert out["url"] == "https://huggingface.co/datasets/dummy/example"
     assert out["distribution"][0]["sha256"] == ""
@@ -165,6 +169,10 @@ def test_hf_datasets_loader_describe_uses_url_or_default() -> None:
 @pytest.mark.unit
 def test_hf_datasets_loader_describe_with_explicit_url() -> None:
-    loader = HFDatasetsLoader(repo_id="dummy/example", url="https://example.com")
+    loader = HFDatasetsLoader(
+        repo_id="dummy/example",
+        url="https://example.com",
+        fetch_remote_metadata=False,
+    )
     out = loader.describe()
     assert out["url"] == "https://example.com"

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/.gitignore RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/LICENSE RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/STYLE.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/archive/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/research/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/research/datasets/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/research/papers/data-integrity/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/research/papers/eval-ecosystem/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/research/papers/inference/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/research/papers/prompt-injection/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/docs/source/methodology/README.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/__init__.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/__main__.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/_deprecated.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/_parallel.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/analysis.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/artifacts.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/bootstrap.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/calibration.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/claims.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/config.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/docs.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/embeddings.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/evidence.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/harness.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/leakage.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/manifest.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/metrics.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/operating_points.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/paths.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/plotting.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/protocols.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/provenance.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/py.typed RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/schemas/manifest.v1.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/schemas/manifest.v2.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/schemas/manifest.v3.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/schemas/results.v1.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/schemas/results_full.v1.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/seeds.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/splits.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/text_dedup.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/src/eval_toolkit/thresholds.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_bootstrap_distribution.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_confusion_matrix_grid.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_lift_ci.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_metric_bars.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_pareto_frontier.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_pr_curve.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_reliability_diagram.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_roc_curve.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_score_histograms.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/baseline/test_plotting_visual/plot_slice_metric_heatmap.png RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/benchmarks/__init__.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/benchmarks/test_kernel_benchmarks.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/conftest.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/bootstrap_ci/cases.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/data/dedup_holdout.jsonl RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/data/dedup_holdout_expected.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/data/dedup_holdout_provenance.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/docs/expected.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/docs/input.md RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/docs/metrics.json RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/golden/test_dedup_holdout_calibration.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/strategies.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_analysis.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_artifacts.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_block_bootstrap_on_folds.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_calibration_mc.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_edge_cases.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_golden.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_njobs.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_research_grounded.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_bootstrap_unit.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_binary_adapters.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_bootstrap_chain.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_determinism.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_optimization_failures.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_research_grounded.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_calibration_unit.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_claims.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_claims_coverage.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_claims_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_cli.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_config.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_coverage_bootstrap.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_coverage_calibration.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_coverage_harness.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_coverage_metrics.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_coverage_plotting.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_dedup_split_leakage_chain.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_deprecations.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_docs_golden.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_docs_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_embeddings.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_evidence_validators.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_edge_cases.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_fault_injection.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_folded.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_internals.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_metric_options.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_parallelism.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_harness_smoke.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_import_boundaries.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_is_metric_defined_for_slice.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_leakage.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_leakage_error_paths.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_leakage_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_loaders.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_loaders_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_logging.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_manifest.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_manifest_contamination_round_trip.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_manifest_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_manifest_validation.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_metrics_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_metrics_stratified_subsets.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_metrics_unit.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_misc_coverage.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_numeric_edge_cases.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_operating_points.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_operating_points_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_parallel.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_paths.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_pipeline_e2e.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_plotting_edge.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_plotting_smoke.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_plotting_visual.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_protocol_conformance.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_provenance.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_public_api.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_recall_at_fpr.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_reference_equivalence.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_reproducibility_integration.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_schemas.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_seeds.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_splits.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_splits_leakage_integration.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_splits_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_text_dedup.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_text_dedup_coverage.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_text_dedup_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_text_dedup_strategies.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_thresholds.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_thresholds_constant_score.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_thresholds_coverage.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_thresholds_props.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_thresholds_research_grounded.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_tokenization_leakage_check.py RENAMED Viewed

File without changes

{eval_toolkit-0.40.0 → eval_toolkit-0.41.0}/tests/test_v09_contracts.py RENAMED Viewed

File without changes

eval-toolkit 0.40.0__tar.gz → 0.41.0__tar.gz

eval-toolkit 0.40.0tar.gz → 0.41.0tar.gz