PyPI - proofbundle - Versions diffs - 0.7.0__tar.gz → 0.8.0__tar.gz - Mend

proofbundle 0.7.0tar.gz → 0.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

{proofbundle-0.7.0/src/proofbundle.egg-info → proofbundle-0.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: proofbundle
-Version: 0.7.0
+Version: 0.8.0
 Summary: Emit and verify portable cryptographic evidence bundles, offline: Ed25519 + RFC 6962 Merkle + optional SD-JWT.
 Author: Konrad Gruszka
 License: MIT
@@ -28,7 +28,7 @@ Provides-Extra: eval
 Requires-Dist: rfc8785>=0.1.4; extra == "eval"
 Provides-Extra: adapters
 Provides-Extra: inspect
-Requires-Dist: inspect_ai<0.4,>=0.3.100; extra == "inspect"
+Requires-Dist: inspect_ai<0.4,>=0.3.100; python_version >= "3.10" and extra == "inspect"
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: ruff>=0.5; extra == "dev"
@@ -38,7 +38,7 @@ Requires-Dist: build>=1; extra == "dev"
 Requires-Dist: hypothesis>=6; extra == "dev"
 Requires-Dist: rfc8785>=0.1.4; extra == "dev"
 Requires-Dist: sd-jwt>=0.10; extra == "dev"
-Requires-Dist: inspect_ai<0.4,>=0.3.100; extra == "dev"
+Requires-Dist: inspect_ai<0.4,>=0.3.100; python_version >= "3.10" and extra == "dev"
 Dynamic: license-file
 <div align="center">
@@ -62,14 +62,15 @@ selectively disclosable credential. Pure Python, no server, no daemon, one JSON
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![SLSA build provenance](https://img.shields.io/badge/SLSA-build_provenance-D6248A.svg)](https://slsa.dev)
 [![PyPI attestations](https://img.shields.io/badge/PyPI-attestations_(PEP_740)-D6248A.svg)](https://pypi.org/project/proofbundle/)
-<!-- DOI badge placeholder: Zenodo is linked and archives each release. Add the Zenodo concept-DOI badge
-     here (and the DOI to CITATION.cff) once Zenodo assigns it — it does not exist at build time. -->
+<!-- DOI badge placeholder: enable Zenodo archiving for this repo, then add the Zenodo concept-DOI
+     badge here (and the DOI to CITATION.cff) once Zenodo assigns one on the next release. No DOI has
+     been assigned yet (no archived record exists at build time) — tracked in the human checklist. -->
 </div>
 **At a glance:** `proofbundle emit` signs and anchors a payload; `proofbundle
 verify` checks one self-contained `bundle.json` with three offline cryptographic
-checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 63 tests.
+checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 74 tests.
 ## Contents
@@ -78,6 +79,7 @@ checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 63 tests.
 - [How it fits together](#how-it-fits-together)
 - [Install](#install)
 - [Quickstart](#quickstart)
+- [Demo](#demo--a-real-eval-log-to-a-verified-receipt-offline)
 - [Interoperability](#interoperability)
 - [Bundle format](#bundle-format-proofbundlev01)
 - [Eval receipts](#eval-receipts)
@@ -208,6 +210,21 @@ from proofbundle import verify_consistency
 verify_consistency(first_size, second_size, proof, first_root, second_root)  # -> bool
 ```
+## Demo — a real eval log to a verified receipt, offline
+```bash
+pip install "proofbundle[eval,inspect]"
+make demo          # or: bash scripts/demo.sh
+```
+`make demo` runs end-to-end with **no network, no API key, no GPU**: it takes genuine eval logs — an
+inspect_ai `mockllm/model` `.eval` log and an lm-evaluation-harness `--model dummy` `results.json`
+(committed under `tests/fixtures/`, generated offline) — turns each into a signed, Merkle-anchored
+proofbundle receipt, and verifies it to `=> OK`. The scores are random (a dummy model); the point is
+that the *artifact* is signed and offline-verifiable, with model and dataset kept as salted commitments.
+See [`examples/inspect_receipt.py`](examples/inspect_receipt.py) and
+[`examples/lm_eval_receipt.py`](examples/lm_eval_receipt.py).
 ## Interoperability
 proofbundle uses the same RFC 6962 / RFC 9162 Merkle primitive as
@@ -284,15 +301,29 @@ proofbundle show-eval receipt.json       # verify + print the claim (issuer-boun
 ```
 The claim format is specified in [EVAL_CLAIM.md](EVAL_CLAIM.md); the emit path uses
-RFC 8785 JCS canonicalization, the verify path stays dependency-free. **Honest scope:**
-a receipt proves `passed` against `threshold` and hides the model/dataset via salted
-commitments — it does **not** prove the evaluation was well designed or that the score
-itself is correct. Those are human judgements; what it removes is the need to simply
-trust the number.
+RFC 8785 JCS canonicalization, the verify path stays dependency-free.
+**Honesty guardrail (the exact scope).** A receipt attests the **authenticity and integrity** of a
+*claimed* result and its context — these exact bytes, signed by this key, anchored under this root, with
+model/dataset kept as salted commitments. It does **not** attest the **correctness of the computation**,
+and it cannot detect **cherry-picking** of the eval. Whether the eval was well designed, whether the
+suite measures what it claims, and whether the number was computed honestly are separate questions.
+Trusted-execution approaches such as [Attestable Audits](https://arxiv.org/abs/2506.23706) target
+computation-correctness with a different (hardware) trust model; proofbundle is the lightweight,
+hardware-free path to a portable, tamper-evident, selectively disclosable *result artifact*.
+**How this differs from a bare hash or a TEE.** A plain SHA-256 of a log commits to bytes but carries no
+signature, no tamper-evident anchor, and no selective disclosure (an attestation-exporter idea along
+those lines,
+[inspect_evals PR #1610](https://github.com/UKGovernmentBEIS/inspect_evals/pull/1610), was closed as
+belonging *a layer above* the framework — which is exactly where proofbundle sits). A TEE proves the
+computation ran untampered but needs specific hardware. proofbundle adds Ed25519 + RFC 6962 Merkle +
+SD-JWT selective disclosure over one portable file, offline.
 ### A verification layer for trustworthy eval logs
-The UK AISI inspect_ai team names an open gap ([arXiv:2507.06893](https://arxiv.org/abs/2507.06893)):
+The maintainers of inspect_evals (Arcadia Impact, funded by the UK AI Safety Institute) name an open
+gap ([arXiv:2507.06893](https://arxiv.org/abs/2507.06893)):
 a database of trustworthy evaluation results with proper provenance tracking. proofbundle is the
 missing **signature + selective-disclosure layer** for exactly that — complementary to metadata
 aggregation (Every Eval Ever) and documentation taxonomies (Eval Factsheets), not a competitor.
@@ -326,8 +357,10 @@ attestation — see [SECURITY.md](SECURITY.md).
 - **v0.5** — inspect_ai adapter (stable API), in-toto Statement v1 view, SD-JWT **issuance** (RFC 9901).
 - **v0.6** — a second eval adapter (lm-evaluation-harness, real format + provenance), INTEROP.md,
   CITATION.cff, PEP 740 attestations documented.
-- **v0.7 (current release)** — citability polish: ORCID in CITATION.cff, a Zenodo DOI placeholder
-  (assigned on release), and a draft in-toto ML-eval predicate proposal.
+- **v0.7** — citability polish (ORCID, Zenodo DOI placeholder, in-toto proposal draft); v0.7.1 hardened
+  verifier robustness + CI on Python 3.9 after a holistic review.
+- **v0.8 (current release)** — an offline `make demo` (real eval log -> signed receipt -> verified),
+  a sharpened honesty guardrail (authenticity/integrity, not computation-correctness), and outreach drafts.
 - **Deferred** (explicitly not yet built) — SD-JWT VC conformance + `vct` metadata,
   Key-Binding JWT, status lists / revocation, an official in-toto PR, DSSE / a full in-toto client.

{proofbundle-0.7.0 → proofbundle-0.8.0}/README.md RENAMED Viewed

@@ -19,14 +19,15 @@ selectively disclosable credential. Pure Python, no server, no daemon, one JSON
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![SLSA build provenance](https://img.shields.io/badge/SLSA-build_provenance-D6248A.svg)](https://slsa.dev)
 [![PyPI attestations](https://img.shields.io/badge/PyPI-attestations_(PEP_740)-D6248A.svg)](https://pypi.org/project/proofbundle/)
-<!-- DOI badge placeholder: Zenodo is linked and archives each release. Add the Zenodo concept-DOI badge
-     here (and the DOI to CITATION.cff) once Zenodo assigns it — it does not exist at build time. -->
+<!-- DOI badge placeholder: enable Zenodo archiving for this repo, then add the Zenodo concept-DOI
+     badge here (and the DOI to CITATION.cff) once Zenodo assigns one on the next release. No DOI has
+     been assigned yet (no archived record exists at build time) — tracked in the human checklist. -->
 </div>
 **At a glance:** `proofbundle emit` signs and anchors a payload; `proofbundle
 verify` checks one self-contained `bundle.json` with three offline cryptographic
-checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 63 tests.
+checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 74 tests.
 ## Contents
@@ -35,6 +36,7 @@ checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 63 tests.
 - [How it fits together](#how-it-fits-together)
 - [Install](#install)
 - [Quickstart](#quickstart)
+- [Demo](#demo--a-real-eval-log-to-a-verified-receipt-offline)
 - [Interoperability](#interoperability)
 - [Bundle format](#bundle-format-proofbundlev01)
 - [Eval receipts](#eval-receipts)
@@ -165,6 +167,21 @@ from proofbundle import verify_consistency
 verify_consistency(first_size, second_size, proof, first_root, second_root)  # -> bool
 ```
+## Demo — a real eval log to a verified receipt, offline
+```bash
+pip install "proofbundle[eval,inspect]"
+make demo          # or: bash scripts/demo.sh
+```
+`make demo` runs end-to-end with **no network, no API key, no GPU**: it takes genuine eval logs — an
+inspect_ai `mockllm/model` `.eval` log and an lm-evaluation-harness `--model dummy` `results.json`
+(committed under `tests/fixtures/`, generated offline) — turns each into a signed, Merkle-anchored
+proofbundle receipt, and verifies it to `=> OK`. The scores are random (a dummy model); the point is
+that the *artifact* is signed and offline-verifiable, with model and dataset kept as salted commitments.
+See [`examples/inspect_receipt.py`](examples/inspect_receipt.py) and
+[`examples/lm_eval_receipt.py`](examples/lm_eval_receipt.py).
 ## Interoperability
 proofbundle uses the same RFC 6962 / RFC 9162 Merkle primitive as
@@ -241,15 +258,29 @@ proofbundle show-eval receipt.json       # verify + print the claim (issuer-boun
 ```
 The claim format is specified in [EVAL_CLAIM.md](EVAL_CLAIM.md); the emit path uses
-RFC 8785 JCS canonicalization, the verify path stays dependency-free. **Honest scope:**
-a receipt proves `passed` against `threshold` and hides the model/dataset via salted
-commitments — it does **not** prove the evaluation was well designed or that the score
-itself is correct. Those are human judgements; what it removes is the need to simply
-trust the number.
+RFC 8785 JCS canonicalization, the verify path stays dependency-free.
+**Honesty guardrail (the exact scope).** A receipt attests the **authenticity and integrity** of a
+*claimed* result and its context — these exact bytes, signed by this key, anchored under this root, with
+model/dataset kept as salted commitments. It does **not** attest the **correctness of the computation**,
+and it cannot detect **cherry-picking** of the eval. Whether the eval was well designed, whether the
+suite measures what it claims, and whether the number was computed honestly are separate questions.
+Trusted-execution approaches such as [Attestable Audits](https://arxiv.org/abs/2506.23706) target
+computation-correctness with a different (hardware) trust model; proofbundle is the lightweight,
+hardware-free path to a portable, tamper-evident, selectively disclosable *result artifact*.
+**How this differs from a bare hash or a TEE.** A plain SHA-256 of a log commits to bytes but carries no
+signature, no tamper-evident anchor, and no selective disclosure (an attestation-exporter idea along
+those lines,
+[inspect_evals PR #1610](https://github.com/UKGovernmentBEIS/inspect_evals/pull/1610), was closed as
+belonging *a layer above* the framework — which is exactly where proofbundle sits). A TEE proves the
+computation ran untampered but needs specific hardware. proofbundle adds Ed25519 + RFC 6962 Merkle +
+SD-JWT selective disclosure over one portable file, offline.
 ### A verification layer for trustworthy eval logs
-The UK AISI inspect_ai team names an open gap ([arXiv:2507.06893](https://arxiv.org/abs/2507.06893)):
+The maintainers of inspect_evals (Arcadia Impact, funded by the UK AI Safety Institute) name an open
+gap ([arXiv:2507.06893](https://arxiv.org/abs/2507.06893)):
 a database of trustworthy evaluation results with proper provenance tracking. proofbundle is the
 missing **signature + selective-disclosure layer** for exactly that — complementary to metadata
 aggregation (Every Eval Ever) and documentation taxonomies (Eval Factsheets), not a competitor.
@@ -283,8 +314,10 @@ attestation — see [SECURITY.md](SECURITY.md).
 - **v0.5** — inspect_ai adapter (stable API), in-toto Statement v1 view, SD-JWT **issuance** (RFC 9901).
 - **v0.6** — a second eval adapter (lm-evaluation-harness, real format + provenance), INTEROP.md,
   CITATION.cff, PEP 740 attestations documented.
-- **v0.7 (current release)** — citability polish: ORCID in CITATION.cff, a Zenodo DOI placeholder
-  (assigned on release), and a draft in-toto ML-eval predicate proposal.
+- **v0.7** — citability polish (ORCID, Zenodo DOI placeholder, in-toto proposal draft); v0.7.1 hardened
+  verifier robustness + CI on Python 3.9 after a holistic review.
+- **v0.8 (current release)** — an offline `make demo` (real eval log -> signed receipt -> verified),
+  a sharpened honesty guardrail (authenticity/integrity, not computation-correctness), and outreach drafts.
 - **Deferred** (explicitly not yet built) — SD-JWT VC conformance + `vct` metadata,
   Key-Binding JWT, status lists / revocation, an official in-toto PR, DSSE / a full in-toto client.

{proofbundle-0.7.0 → proofbundle-0.8.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "proofbundle"
-version = "0.7.0"
+version = "0.8.0"
 description = "Emit and verify portable cryptographic evidence bundles, offline: Ed25519 + RFC 6962 Merkle + optional SD-JWT."
 readme = "README.md"
 requires-python = ">=3.9"
@@ -47,10 +47,11 @@ eval = ["rfc8785>=0.1.4"]
 adapters = []
 # The inspect_ai adapter uses the STABLE read_eval_log API (lazy import). Pinned with an UPPER bound:
 # the .eval format + pydantic schema change between versions (inspect_ai issue 834), and the fixture
-# test is bound to this range. `pip install "proofbundle[inspect]"`.
-inspect = ["inspect_ai>=0.3.100,<0.4"]
+# test is bound to this range. inspect_ai requires Python >= 3.10, so the marker gates it out on 3.9
+# (base + [eval]/[sdjwt] still work on 3.9; the inspect adapter test skips there). Fixes the red 3.9 CI.
+inspect = ['inspect_ai>=0.3.100,<0.4; python_version >= "3.10"']
 dev = ["pytest>=7", "ruff>=0.5", "jsonschema>=4", "mypy>=1.8", "build>=1", "hypothesis>=6",
-       "rfc8785>=0.1.4", "sd-jwt>=0.10", "inspect_ai>=0.3.100,<0.4"]
+       "rfc8785>=0.1.4", "sd-jwt>=0.10", 'inspect_ai>=0.3.100,<0.4; python_version >= "3.10"']
 [project.urls]
 Homepage = "https://b7n0de.com"

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle/__init__.py RENAMED Viewed

@@ -13,7 +13,7 @@ from .emit import emit_bundle, generate_signer
 from .errors import Check, ProofBundleError, VerificationResult
 from .merkle import verify_consistency, verify_inclusion
-__version__ = "0.7.0"
+__version__ = "0.8.0"
 __all__ = [
     "__version__",

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle/adapters/inspect_ai.py RENAMED Viewed

@@ -57,9 +57,23 @@ def from_inspect_ai_log(path, metric: str, *, comparator: str, threshold: str, t
     model_id = str(getattr(ev, "model", "unknown"))
     dataset = getattr(ev, "dataset", None)
     dataset_id = str(getattr(dataset, "name", None) or suite)
+    # Provenance parity with the lm-eval adapter: inspect_ai exposes the same run provenance for free.
+    provenance = {"harness": "inspect_ai"}
+    revision = getattr(ev, "revision", None)
+    commit = getattr(revision, "commit", None)
+    if commit:
+        provenance["git_hash"] = str(commit)
+    packages = getattr(ev, "packages", None) or {}
+    if isinstance(packages, dict) and packages.get("inspect_ai"):
+        provenance["harness_version"] = str(packages["inspect_ai"])
+    tv = getattr(ev, "task_version", None)
+    if tv is not None:
+        provenance["task_version"] = str(tv)
     return build_eval_claim(
         suite=suite, suite_version=str(getattr(ev, "task_version", "1")),
         metric=metric, comparator=comparator, threshold=threshold, score=repr(value),
         n=int(getattr(results, "total_samples", 0) or 0),
         model_id=model_id, dataset_id=dataset_id, issuer="", timestamp=timestamp,
-        model_salt=model_salt, dataset_salt=dataset_salt)
+        provenance=provenance, model_salt=model_salt, dataset_salt=dataset_salt)

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle/bundle.py RENAMED Viewed

@@ -12,7 +12,11 @@ checks, fully offline and without any running log server:
 The verifier treats ``payload`` as opaque bytes: it proves *that these exact
 bytes were signed and anchored*, not what they mean. That keeps v0.1 small and
 correct. Turning a reproducible eval run into such a payload is the job of the
-emitter (see ``emit.py``, roadmap).
+eval-receipt emitter (see :mod:`proofbundle.evalclaim`, since v0.4).
+Malformed input (wrong types, missing or unknown fields) is rejected with a
+``BundleFormatError`` — never a raw traceback — so a caller gets the documented
+malformed exit code, not a crash.
 """
 from __future__ import annotations
@@ -30,6 +34,13 @@ __all__ = ["SCHEMA", "verify_bundle", "load_bundle"]
 SCHEMA = "proofbundle/v0.1"
+# Allowed keys per object — SPEC.md §3: a verifier MUST reject unknown fields (schema is
+# additionalProperties: false). Enforced here so the code matches its own normative spec.
+_TOP_KEYS = {"schema", "payload_b64", "signature", "merkle", "sd_jwt_vc"}
+_SIG_KEYS = {"alg", "public_key_b64", "sig_b64"}
+_MERKLE_KEYS = {"hash_alg", "leaf_index", "tree_size", "inclusion_proof_b64", "root_b64"}
+_SD_KEYS = {"compact", "issuer_public_key_b64"}
 def _b64d(value: str, field: str) -> bytes:
     try:
@@ -44,6 +55,27 @@ def _require(obj: dict, key: str, field: str):
     return obj[key]
+def _require_dict(obj, field: str) -> dict:
+    """The value must be a JSON object — a string/list/number is malformed, not a crash."""
+    if not isinstance(obj, dict):
+        raise BundleFormatError(f"field {field} must be a JSON object")
+    return obj
+def _require_int(obj: dict, key: str, field: str) -> int:
+    """The value must be a JSON integer — reject floats (SPEC §2) and non-numeric strings/None."""
+    val = _require(obj, key, field)
+    if isinstance(val, bool) or not isinstance(val, int):   # bool is an int subclass; a float/str/None is not
+        raise BundleFormatError(f"field {field} must be an integer, got {type(val).__name__}")
+    return val
+def _reject_unknown(obj: dict, allowed: set, field: str) -> None:
+    extra = set(obj) - allowed
+    if extra:
+        raise BundleFormatError(f"unknown field(s) in {field}: {sorted(extra)}")
 def load_bundle(path: str) -> dict:
     """Read and JSON-parse a bundle file."""
     with open(path, "r", encoding="utf-8") as handle:
@@ -60,12 +92,14 @@ def verify_bundle(bundle: Union[dict, str]) -> VerificationResult:
     schema = bundle.get("schema")
     if schema != SCHEMA:
         raise UnsupportedError(f"unsupported schema {schema!r}, expected {SCHEMA!r}")
+    _reject_unknown(bundle, _TOP_KEYS, "bundle")
     result = VerificationResult()
     payload = _b64d(_require(bundle, "payload_b64", "payload_b64"), "payload_b64")
     # 1. signature over the payload
-    sig = _require(bundle, "signature", "signature")
+    sig = _require_dict(_require(bundle, "signature", "signature"), "signature")
+    _reject_unknown(sig, _SIG_KEYS, "signature")
     alg = sig.get("alg")
     if alg != "ed25519":
         raise UnsupportedError(f"signature alg {alg!r} not supported in v0.1")
@@ -75,13 +109,17 @@ def verify_bundle(bundle: Union[dict, str]) -> VerificationResult:
     result.add("ed25519-signature", sig_ok, "payload signed by stated key" if sig_ok else "invalid signature")
     # 2. merkle inclusion of the payload
-    mk = _require(bundle, "merkle", "merkle")
+    mk = _require_dict(_require(bundle, "merkle", "merkle"), "merkle")
+    _reject_unknown(mk, _MERKLE_KEYS, "merkle")
     hash_alg = mk.get("hash_alg", "sha256-rfc6962")
     if hash_alg != "sha256-rfc6962":
         raise UnsupportedError(f"merkle hash_alg {hash_alg!r} not supported in v0.1")
-    leaf_index = int(_require(mk, "leaf_index", "merkle.leaf_index"))
-    tree_size = int(_require(mk, "tree_size", "merkle.tree_size"))
-    proof = [_b64d(p, "merkle.inclusion_proof_b64[]") for p in mk.get("inclusion_proof_b64", [])]
+    leaf_index = _require_int(mk, "leaf_index", "merkle.leaf_index")
+    tree_size = _require_int(mk, "tree_size", "merkle.tree_size")
+    proof_list = _require(mk, "inclusion_proof_b64", "merkle.inclusion_proof_b64")   # required per SPEC §5
+    if not isinstance(proof_list, list):
+        raise BundleFormatError("field merkle.inclusion_proof_b64 must be a list")
+    proof = [_b64d(p, "merkle.inclusion_proof_b64[]") for p in proof_list]
     root = _b64d(_require(mk, "root_b64", "merkle.root_b64"), "merkle.root_b64")
     incl_ok = merkle.verify_inclusion(payload, leaf_index, tree_size, proof, root)
     result.add(
@@ -93,6 +131,8 @@ def verify_bundle(bundle: Union[dict, str]) -> VerificationResult:
     # 3. optional SD-JWT selective disclosure credential
     sd = bundle.get("sd_jwt_vc")
     if sd is not None:
+        sd = _require_dict(sd, "sd_jwt_vc")
+        _reject_unknown(sd, _SD_KEYS, "sd_jwt_vc")
         compact = _require(sd, "compact", "sd_jwt_vc.compact")
         issuer_pub = None
         if sd.get("issuer_public_key_b64"):

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle/evalclaim.py RENAMED Viewed

@@ -21,6 +21,7 @@ import base64
 import hashlib
 import json
 import os
+import re
 import unicodedata
 from typing import Optional, Sequence
@@ -34,6 +35,8 @@ EVAL_CLAIM_SCHEMA = "proofbundle/eval-claim/v0.1"
 COMMIT_ALG = "sha256-salted-v1"
 _COMPARATORS = {">=", ">", "<=", "<"}
 _MAX_SAFE_INT = 2 ** 53 - 1
+# The published eval-claim schema's decimal pattern for threshold/score (no exponent, no sign+, no spaces).
+_DECIMAL_RE = re.compile(r"^-?[0-9]+(\.[0-9]+)?$")
 # The exact key set of an eval claim; decode/validate reject anything else.
 _REQUIRED = {"schema", "suite", "suite_version", "metric", "comparator", "threshold",
              "passed", "n", "model_id_commit", "dataset_id_commit", "commit_alg", "issuer", "timestamp"}
@@ -103,7 +106,12 @@ def canonicalize(claim: dict) -> bytes:
     for the UTF-16 code-unit key sort + compact UTF-8 serialization.
     """
     _reject_non_jcs(claim)
-    import rfc8785  # noqa: PLC0415 — lazy: only the emit path pulls the JCS dependency
+    try:
+        import rfc8785  # noqa: PLC0415 — lazy: only the emit path pulls the JCS dependency
+    except ImportError as e:
+        raise EvalClaimError(
+            "emitting eval receipts needs an RFC 8785 canonicalizer — install with: "
+            "pip install \"proofbundle[eval]\"") from e
     try:
         return rfc8785.dumps(claim)
     except (rfc8785.FloatDomainError, rfc8785.IntegerDomainError, rfc8785.CanonicalizationError) as e:
@@ -137,14 +145,17 @@ def build_eval_claim(*, suite: str, suite_version: str, metric: str, comparator:
     """
     if comparator not in _COMPARATORS:
         raise EvalClaimError(f"comparator must be one of {sorted(_COMPARATORS)}")
+    # threshold/score must match the PUBLISHED schema's decimal pattern exactly — reject "1e2",
+    # "Infinity", "+5", " 5 " etc. that Decimal() would accept but jsonschema rejects (schema-conformance).
     for name, val in (("threshold", threshold), ("score", score)):
         if not isinstance(val, str):
             raise EvalClaimError(f"{name} must be a decimal STRING, not {type(val).__name__}")
-    from decimal import Decimal, InvalidOperation  # noqa: PLC0415
-    try:
-        s, t = Decimal(score), Decimal(threshold)
-    except InvalidOperation as e:
-        raise EvalClaimError(f"threshold/score are not valid decimals: {e}") from e
+        if not _DECIMAL_RE.match(val):
+            raise EvalClaimError(f"{name} must be a plain decimal string (^-?[0-9]+(\\.[0-9]+)?$), got {val!r}")
+    if not isinstance(n, int) or isinstance(n, bool) or n < 0 or n > _MAX_SAFE_INT:
+        raise EvalClaimError(f"n must be a non-negative integer <= 2**53-1, got {n!r}")
+    from decimal import Decimal  # noqa: PLC0415
+    s, t = Decimal(score), Decimal(threshold)
     passed = {">=": s >= t, ">": s > t, "<=": s <= t, "<": s < t}[comparator]
     m_salt = model_salt if model_salt is not None else os.urandom(16)
     d_salt = dataset_salt if dataset_salt is not None else os.urandom(16)

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle/intoto.py RENAMED Viewed

@@ -12,7 +12,7 @@ exists (deferred, see the roadmap).
 """
 from __future__ import annotations
-from typing import Optional
+from typing import Any, Optional
 STATEMENT_TYPE = "https://in-toto.io/Statement/v1"
 PREDICATE_TYPE = "https://b7n0de.com/proofbundle/eval-receipt/v0.1"
@@ -37,6 +37,21 @@ def to_intoto_statement(claim: dict, *, root_b64: Optional[str] = None,
     (e.g. {"name": "inspect_ai", "version": "0.3.217"}) is optional. The subject digest is the model
     commitment under a custom key (never `sha256`).
     """
+    predicate: dict[str, Any] = {
+        "verifier": {"id": VERIFIER_ID},
+        "evaluatedAt": claim["timestamp"],
+        "suite": claim["suite"],
+        "claims": [{
+            "metric": claim["metric"], "comparator": claim["comparator"],
+            "threshold": claim["threshold"], "passed": claim["passed"],
+        }],
+        "datasetCommit": claim.get("dataset_id_commit"),
+        "subject_digest_note": _SUBJECT_DIGEST_NOTE,
+    }
+    if harness:
+        predicate["harness"] = harness
+    if root_b64:
+        predicate["receipt"] = {"schema": "proofbundle/v0.1", "root_b64": root_b64}
     statement = {
         "_type": STATEMENT_TYPE,
         "subject": [{
@@ -44,20 +59,6 @@ def to_intoto_statement(claim: dict, *, root_b64: Optional[str] = None,
             "digest": {MODEL_COMMIT_DIGEST_KEY: _commit_hex(claim["model_id_commit"])},
         }],
         "predicateType": PREDICATE_TYPE,
-        "predicate": {
-            "verifier": {"id": VERIFIER_ID},
-            "evaluatedAt": claim["timestamp"],
-            "suite": claim["suite"],
-            "claims": [{
-                "metric": claim["metric"], "comparator": claim["comparator"],
-                "threshold": claim["threshold"], "passed": claim["passed"],
-            }],
-            "datasetCommit": claim.get("dataset_id_commit"),
-            "subject_digest_note": _SUBJECT_DIGEST_NOTE,
-        },
+        "predicate": predicate,
     }
-    if harness:
-        statement["predicate"]["harness"] = harness
-    if root_b64:
-        statement["predicate"]["receipt"] = {"schema": "proofbundle/v0.1", "root_b64": root_b64}
     return statement

{proofbundle-0.7.0 → proofbundle-0.8.0/src/proofbundle.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: proofbundle
-Version: 0.7.0
+Version: 0.8.0
 Summary: Emit and verify portable cryptographic evidence bundles, offline: Ed25519 + RFC 6962 Merkle + optional SD-JWT.
 Author: Konrad Gruszka
 License: MIT
@@ -28,7 +28,7 @@ Provides-Extra: eval
 Requires-Dist: rfc8785>=0.1.4; extra == "eval"
 Provides-Extra: adapters
 Provides-Extra: inspect
-Requires-Dist: inspect_ai<0.4,>=0.3.100; extra == "inspect"
+Requires-Dist: inspect_ai<0.4,>=0.3.100; python_version >= "3.10" and extra == "inspect"
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: ruff>=0.5; extra == "dev"
@@ -38,7 +38,7 @@ Requires-Dist: build>=1; extra == "dev"
 Requires-Dist: hypothesis>=6; extra == "dev"
 Requires-Dist: rfc8785>=0.1.4; extra == "dev"
 Requires-Dist: sd-jwt>=0.10; extra == "dev"
-Requires-Dist: inspect_ai<0.4,>=0.3.100; extra == "dev"
+Requires-Dist: inspect_ai<0.4,>=0.3.100; python_version >= "3.10" and extra == "dev"
 Dynamic: license-file
 <div align="center">
@@ -62,14 +62,15 @@ selectively disclosable credential. Pure Python, no server, no daemon, one JSON
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 [![SLSA build provenance](https://img.shields.io/badge/SLSA-build_provenance-D6248A.svg)](https://slsa.dev)
 [![PyPI attestations](https://img.shields.io/badge/PyPI-attestations_(PEP_740)-D6248A.svg)](https://pypi.org/project/proofbundle/)
-<!-- DOI badge placeholder: Zenodo is linked and archives each release. Add the Zenodo concept-DOI badge
-     here (and the DOI to CITATION.cff) once Zenodo assigns it — it does not exist at build time. -->
+<!-- DOI badge placeholder: enable Zenodo archiving for this repo, then add the Zenodo concept-DOI
+     badge here (and the DOI to CITATION.cff) once Zenodo assigns one on the next release. No DOI has
+     been assigned yet (no archived record exists at build time) — tracked in the human checklist. -->
 </div>
 **At a glance:** `proofbundle emit` signs and anchors a payload; `proofbundle
 verify` checks one self-contained `bundle.json` with three offline cryptographic
-checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 63 tests.
+checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 74 tests.
 ## Contents
@@ -78,6 +79,7 @@ checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 63 tests.
 - [How it fits together](#how-it-fits-together)
 - [Install](#install)
 - [Quickstart](#quickstart)
+- [Demo](#demo--a-real-eval-log-to-a-verified-receipt-offline)
 - [Interoperability](#interoperability)
 - [Bundle format](#bundle-format-proofbundlev01)
 - [Eval receipts](#eval-receipts)
@@ -208,6 +210,21 @@ from proofbundle import verify_consistency
 verify_consistency(first_size, second_size, proof, first_root, second_root)  # -> bool
 ```
+## Demo — a real eval log to a verified receipt, offline
+```bash
+pip install "proofbundle[eval,inspect]"
+make demo          # or: bash scripts/demo.sh
+```
+`make demo` runs end-to-end with **no network, no API key, no GPU**: it takes genuine eval logs — an
+inspect_ai `mockllm/model` `.eval` log and an lm-evaluation-harness `--model dummy` `results.json`
+(committed under `tests/fixtures/`, generated offline) — turns each into a signed, Merkle-anchored
+proofbundle receipt, and verifies it to `=> OK`. The scores are random (a dummy model); the point is
+that the *artifact* is signed and offline-verifiable, with model and dataset kept as salted commitments.
+See [`examples/inspect_receipt.py`](examples/inspect_receipt.py) and
+[`examples/lm_eval_receipt.py`](examples/lm_eval_receipt.py).
 ## Interoperability
 proofbundle uses the same RFC 6962 / RFC 9162 Merkle primitive as
@@ -284,15 +301,29 @@ proofbundle show-eval receipt.json       # verify + print the claim (issuer-boun
 ```
 The claim format is specified in [EVAL_CLAIM.md](EVAL_CLAIM.md); the emit path uses
-RFC 8785 JCS canonicalization, the verify path stays dependency-free. **Honest scope:**
-a receipt proves `passed` against `threshold` and hides the model/dataset via salted
-commitments — it does **not** prove the evaluation was well designed or that the score
-itself is correct. Those are human judgements; what it removes is the need to simply
-trust the number.
+RFC 8785 JCS canonicalization, the verify path stays dependency-free.
+**Honesty guardrail (the exact scope).** A receipt attests the **authenticity and integrity** of a
+*claimed* result and its context — these exact bytes, signed by this key, anchored under this root, with
+model/dataset kept as salted commitments. It does **not** attest the **correctness of the computation**,
+and it cannot detect **cherry-picking** of the eval. Whether the eval was well designed, whether the
+suite measures what it claims, and whether the number was computed honestly are separate questions.
+Trusted-execution approaches such as [Attestable Audits](https://arxiv.org/abs/2506.23706) target
+computation-correctness with a different (hardware) trust model; proofbundle is the lightweight,
+hardware-free path to a portable, tamper-evident, selectively disclosable *result artifact*.
+**How this differs from a bare hash or a TEE.** A plain SHA-256 of a log commits to bytes but carries no
+signature, no tamper-evident anchor, and no selective disclosure (an attestation-exporter idea along
+those lines,
+[inspect_evals PR #1610](https://github.com/UKGovernmentBEIS/inspect_evals/pull/1610), was closed as
+belonging *a layer above* the framework — which is exactly where proofbundle sits). A TEE proves the
+computation ran untampered but needs specific hardware. proofbundle adds Ed25519 + RFC 6962 Merkle +
+SD-JWT selective disclosure over one portable file, offline.
 ### A verification layer for trustworthy eval logs
-The UK AISI inspect_ai team names an open gap ([arXiv:2507.06893](https://arxiv.org/abs/2507.06893)):
+The maintainers of inspect_evals (Arcadia Impact, funded by the UK AI Safety Institute) name an open
+gap ([arXiv:2507.06893](https://arxiv.org/abs/2507.06893)):
 a database of trustworthy evaluation results with proper provenance tracking. proofbundle is the
 missing **signature + selective-disclosure layer** for exactly that — complementary to metadata
 aggregation (Every Eval Ever) and documentation taxonomies (Eval Factsheets), not a competitor.
@@ -326,8 +357,10 @@ attestation — see [SECURITY.md](SECURITY.md).
 - **v0.5** — inspect_ai adapter (stable API), in-toto Statement v1 view, SD-JWT **issuance** (RFC 9901).
 - **v0.6** — a second eval adapter (lm-evaluation-harness, real format + provenance), INTEROP.md,
   CITATION.cff, PEP 740 attestations documented.
-- **v0.7 (current release)** — citability polish: ORCID in CITATION.cff, a Zenodo DOI placeholder
-  (assigned on release), and a draft in-toto ML-eval predicate proposal.
+- **v0.7** — citability polish (ORCID, Zenodo DOI placeholder, in-toto proposal draft); v0.7.1 hardened
+  verifier robustness + CI on Python 3.9 after a holistic review.
+- **v0.8 (current release)** — an offline `make demo` (real eval log -> signed receipt -> verified),
+  a sharpened honesty guardrail (authenticity/integrity, not computation-correctness), and outreach drafts.
 - **Deferred** (explicitly not yet built) — SD-JWT VC conformance + `vct` metadata,
   Key-Binding JWT, status lists / revocation, an official in-toto PR, DSSE / a full in-toto client.

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle.egg-info/SOURCES.txt RENAMED Viewed

@@ -24,11 +24,13 @@ src/proofbundle/adapters/inspect_ai.py
 src/proofbundle/adapters/lm_eval.py
 tests/test_adapters.py
 tests/test_bundle.py
+tests/test_bundle_robustness.py
 tests/test_cli.py
 tests/test_cli_eval.py
 tests/test_emit.py
 tests/test_eval_claim_schema.py
 tests/test_evalclaim.py
+tests/test_examples.py
 tests/test_intoto.py
 tests/test_merkle.py
 tests/test_merkle_property.py

{proofbundle-0.7.0 → proofbundle-0.8.0}/src/proofbundle.egg-info/requires.txt RENAMED Viewed

@@ -11,12 +11,16 @@ build>=1
 hypothesis>=6
 rfc8785>=0.1.4
 sd-jwt>=0.10
+[dev:python_version >= "3.10"]
 inspect_ai<0.4,>=0.3.100
 [eval]
 rfc8785>=0.1.4
 [inspect]
+[inspect:python_version >= "3.10"]
 inspect_ai<0.4,>=0.3.100
 [sdjwt]

{proofbundle-0.7.0 → proofbundle-0.8.0}/tests/test_adapters.py RENAMED Viewed

@@ -39,6 +39,8 @@ class TestAdapters(unittest.TestCase):
         self.assertEqual(claim["suite"], "safety_refusal_demo")
         self.assertTrue(claim["passed"])                    # accuracy 0.0 >= 0.00
         self.assertNotIn("mockllm/model", str(claim))       # model id only as salted commitment
+        self.assertEqual(claim["provenance"]["harness"], "inspect_ai")  # provenance parity with lm-eval
+        self.assertIn("harness_version", claim["provenance"])
     def test_inspect_ai_missing_metric_clear_error(self):
         from proofbundle.adapters.inspect_ai import InspectAdapterError

proofbundle-0.8.0/tests/test_bundle_robustness.py ADDED Viewed

@@ -0,0 +1,74 @@
+"""Malformed-input robustness of verify_bundle + build_eval_claim (holistic-review findings, 0.7.1).
+The verifier's contract is OK/FAILED/malformed — never a raw traceback. build_eval_claim must not emit a
+receipt that fails its own published schema. One red-test per finding."""
+import copy
+import unittest
+from proofbundle import verify_bundle
+from proofbundle.emit import emit_bundle, generate_signer
+from proofbundle.errors import BundleFormatError
+from proofbundle.evalclaim import EvalClaimError, build_eval_claim
+def _bundle():
+    return emit_bundle(b"payload", generate_signer())
+def _mut(mut):
+    b = copy.deepcopy(_bundle())
+    mut(b)
+    return b
+class TestBundleRobustness(unittest.TestCase):
+    def test_leaf_index_non_numeric_raises_format_error(self):   # D1
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b["merkle"].__setitem__("leaf_index", "abc")))
+    def test_signature_non_object_raises_format_error(self):     # D2
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b.__setitem__("signature", "notadict")))
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b.__setitem__("merkle", ["x"])))
+    def test_tree_size_float_rejected(self):                     # D3 (SPEC §2: integers only)
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b["merkle"].__setitem__("tree_size", 1.5)))
+    def test_missing_inclusion_proof_rejected(self):             # D4 (SPEC §5: required)
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b["merkle"].pop("inclusion_proof_b64")))
+    def test_unknown_fields_rejected(self):                      # SPEC §3: additionalProperties false
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b.__setitem__("evil", "x")))
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b["signature"].__setitem__("evil", "x")))
+        with self.assertRaises(BundleFormatError):
+            verify_bundle(_mut(lambda b: b["merkle"].__setitem__("evil", "x")))
+    def test_well_formed_still_ok(self):                         # no false positive
+        self.assertTrue(verify_bundle(_bundle()).ok)
+class TestEvalClaimSchemaConformance(unittest.TestCase):
+    def _build(self, **kw):
+        base = dict(suite="s", suite_version="v1", metric="acc", comparator=">=", threshold="0.8",
+                    score="0.9", n=1, model_id="m", dataset_id="d", issuer="",
+                    timestamp="2026-07-01T12:00:00Z", model_salt=b"0" * 16, dataset_salt=b"1" * 16)
+        base.update(kw)
+        return build_eval_claim(**base)
+    def test_negative_n_rejected(self):                          # schema minimum 0
+        with self.assertRaises(EvalClaimError):
+            self._build(n=-5)
+    def test_exponent_and_sign_threshold_rejected(self):         # schema decimal pattern
+        for bad in ("1e2", "Infinity", "+5", " 0.9 "):
+            with self.assertRaises(EvalClaimError):
+                self._build(threshold=bad)
+    def test_plain_decimal_accepted(self):
+        claim, _ = self._build(threshold="0.80", score="0.92")
+        self.assertTrue(claim["passed"])

proofbundle-0.8.0/tests/test_examples.py ADDED Viewed

@@ -0,0 +1,28 @@
+"""The demo examples run end-to-end (real fixtures -> receipt -> verify). Covers `make demo` (Phase B)."""
+import importlib.util
+import sys
+import unittest
+from pathlib import Path
+REPO = Path(__file__).resolve().parents[1]
+def _run_example(name):
+    try:
+        import inspect_ai.log  # noqa: F401  (inspect example needs it)
+    except ImportError:
+        if name == "inspect_receipt":
+            raise unittest.SkipTest("inspect_ai not installed")
+    spec = importlib.util.spec_from_file_location(name, REPO / "examples" / f"{name}.py")
+    m = importlib.util.module_from_spec(spec)
+    sys.modules[name] = m
+    spec.loader.exec_module(m)
+    return m.main()
+class TestExamples(unittest.TestCase):
+    def test_lm_eval_receipt_example(self):
+        self.assertEqual(_run_example("lm_eval_receipt"), 0)
+    def test_inspect_receipt_example(self):
+        self.assertEqual(_run_example("inspect_receipt"), 0)