PyPI - proofbundle - Versions diffs - 0.3.0__tar.gz → 0.4.0__tar.gz - Mend

proofbundle 0.3.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{proofbundle-0.3.0/src/proofbundle.egg-info → proofbundle-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: proofbundle
-Version: 0.3.0
+Version: 0.4.0
 Summary: Emit and verify portable cryptographic evidence bundles, offline: Ed25519 + RFC 6962 Merkle + optional SD-JWT.
 Author: Konrad Gruszka
 License: MIT
@@ -24,6 +24,9 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: cryptography>=42
 Provides-Extra: sdjwt
+Provides-Extra: eval
+Requires-Dist: rfc8785>=0.1.4; extra == "eval"
+Provides-Extra: adapters
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: ruff>=0.5; extra == "dev"
@@ -31,6 +34,7 @@ Requires-Dist: jsonschema>=4; extra == "dev"
 Requires-Dist: mypy>=1.8; extra == "dev"
 Requires-Dist: build>=1; extra == "dev"
 Requires-Dist: hypothesis>=6; extra == "dev"
+Requires-Dist: rfc8785>=0.1.4; extra == "dev"
 Dynamic: license-file
 <div align="center">
@@ -68,6 +72,7 @@ checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 25 tests.
 - [Quickstart](#quickstart)
 - [Interoperability](#interoperability)
 - [Bundle format](#bundle-format-proofbundlev01)
+- [Eval receipts](#eval-receipts)
 - [Security notes and scope](#security-notes-and-scope-stated-honestly)
 - [Roadmap](#roadmap)
 - [Contributing](#contributing)
@@ -255,25 +260,37 @@ This is v0.1. It does exactly what it says and no more:
 If you find a correctness or security issue, please open an issue or see
 [SECURITY.md](SECURITY.md).
+## Eval receipts
+Since v0.4, proofbundle turns a reproducible eval run into a signed, Merkle-anchored
+**receipt** that proves *suite S `comparator` threshold T, passed* while carrying only
+**salted commitments** to the model and dataset identifiers — never the weights, the
+data, or the plaintext names. A third party verifies the threshold was met, offline,
+from one file, without ever seeing the model or the test set.
+```bash
+pip install "proofbundle[eval]"          # emit side needs an RFC 8785 canonicalizer
+proofbundle emit-eval --claim claim.json --out receipt.json --new-key signer.key
+proofbundle verify receipt.json          # a receipt is a normal bundle
+proofbundle show-eval receipt.json       # verify + print the claim (issuer-bound)
+```
+The claim format is specified in [EVAL_CLAIM.md](EVAL_CLAIM.md); the emit path uses
+RFC 8785 JCS canonicalization, the verify path stays dependency-free. **Honest scope:**
+a receipt proves `passed` against `threshold` and hides the model/dataset via salted
+commitments — it does **not** prove the evaluation was well designed or that the score
+itself is correct. Those are human judgements; what it removes is the need to simply
+trust the number.
 ## Roadmap
 - **v0.1** — the offline verifier plus a real example bundle.
-- **v0.2 (current release)** — the emitter: `emit_bundle` signs a payload with
-  Ed25519 and anchors it as the last leaf of an RFC 6962 Merkle tree, producing
-  a bundle that `verify_bundle` accepts. Available as `proofbundle emit`.
-- **v0.3** — an eval-receipt emitter: wrap one evaluation framework run
-  ([Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai),
-  [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness))
-  into a signed receipt whose payload is a minimal canonical claim, for example
-  `{"suite": "...", "threshold": 0.8, "passed": true}`, optionally wrapped as an
-  SD-JWT VC so a holder can disclose *passed above threshold* without revealing
-  the model, weights or dataset, and carrying a cluster-bootstrap confidence
-  interval, a multiple-testing correction and a preregistration hash.
-That last step is the point: today no widely used AI project turns a
-reproducible evaluation result into a signed, third-party-verifiable,
-selectively disclosable receipt. This repository is the trustworthy verification
-core that makes it possible.
+- **v0.2** — the emitter: `emit_bundle` / `proofbundle emit`.
+- **v0.3** — external RFC 6962 conformance vectors + real Sigstore Rekor interop.
+- **v0.4 (current release)** — the eval-receipt emitter (`emit_eval_receipt` /
+  `proofbundle emit-eval`), salted commitments, issuer binding, file-based adapters.
+- **v0.5** — selective disclosure of the exact score via SD-JWT **issuance** (the issuer
+  reveals identifier + salt on demand) and full SD-JWT VC conformance.
 ## Contributing

{proofbundle-0.3.0 → proofbundle-0.4.0}/README.md RENAMED Viewed

@@ -33,6 +33,7 @@ checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 25 tests.
 - [Quickstart](#quickstart)
 - [Interoperability](#interoperability)
 - [Bundle format](#bundle-format-proofbundlev01)
+- [Eval receipts](#eval-receipts)
 - [Security notes and scope](#security-notes-and-scope-stated-honestly)
 - [Roadmap](#roadmap)
 - [Contributing](#contributing)
@@ -220,25 +221,37 @@ This is v0.1. It does exactly what it says and no more:
 If you find a correctness or security issue, please open an issue or see
 [SECURITY.md](SECURITY.md).
+## Eval receipts
+Since v0.4, proofbundle turns a reproducible eval run into a signed, Merkle-anchored
+**receipt** that proves *suite S `comparator` threshold T, passed* while carrying only
+**salted commitments** to the model and dataset identifiers — never the weights, the
+data, or the plaintext names. A third party verifies the threshold was met, offline,
+from one file, without ever seeing the model or the test set.
+```bash
+pip install "proofbundle[eval]"          # emit side needs an RFC 8785 canonicalizer
+proofbundle emit-eval --claim claim.json --out receipt.json --new-key signer.key
+proofbundle verify receipt.json          # a receipt is a normal bundle
+proofbundle show-eval receipt.json       # verify + print the claim (issuer-bound)
+```
+The claim format is specified in [EVAL_CLAIM.md](EVAL_CLAIM.md); the emit path uses
+RFC 8785 JCS canonicalization, the verify path stays dependency-free. **Honest scope:**
+a receipt proves `passed` against `threshold` and hides the model/dataset via salted
+commitments — it does **not** prove the evaluation was well designed or that the score
+itself is correct. Those are human judgements; what it removes is the need to simply
+trust the number.
 ## Roadmap
 - **v0.1** — the offline verifier plus a real example bundle.
-- **v0.2 (current release)** — the emitter: `emit_bundle` signs a payload with
-  Ed25519 and anchors it as the last leaf of an RFC 6962 Merkle tree, producing
-  a bundle that `verify_bundle` accepts. Available as `proofbundle emit`.
-- **v0.3** — an eval-receipt emitter: wrap one evaluation framework run
-  ([Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai),
-  [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness))
-  into a signed receipt whose payload is a minimal canonical claim, for example
-  `{"suite": "...", "threshold": 0.8, "passed": true}`, optionally wrapped as an
-  SD-JWT VC so a holder can disclose *passed above threshold* without revealing
-  the model, weights or dataset, and carrying a cluster-bootstrap confidence
-  interval, a multiple-testing correction and a preregistration hash.
-That last step is the point: today no widely used AI project turns a
-reproducible evaluation result into a signed, third-party-verifiable,
-selectively disclosable receipt. This repository is the trustworthy verification
-core that makes it possible.
+- **v0.2** — the emitter: `emit_bundle` / `proofbundle emit`.
+- **v0.3** — external RFC 6962 conformance vectors + real Sigstore Rekor interop.
+- **v0.4 (current release)** — the eval-receipt emitter (`emit_eval_receipt` /
+  `proofbundle emit-eval`), salted commitments, issuer binding, file-based adapters.
+- **v0.5** — selective disclosure of the exact score via SD-JWT **issuance** (the issuer
+  reveals identifier + salt on demand) and full SD-JWT VC conformance.
 ## Contributing

{proofbundle-0.3.0 → proofbundle-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "proofbundle"
-version = "0.3.0"
+version = "0.4.0"
 description = "Emit and verify portable cryptographic evidence bundles, offline: Ed25519 + RFC 6962 Merkle + optional SD-JWT."
 readme = "README.md"
 requires-python = ">=3.9"
@@ -39,7 +39,13 @@ dependencies = ["cryptography>=42"]
 # `cryptography` (EdDSA + stdlib), so `pip install proofbundle[sdjwt]` keeps the trusted core
 # lean; the extra documents intent and is forward-compatible if SD-JWT ever needs a heavier lib.
 sdjwt = []
-dev = ["pytest>=7", "ruff>=0.5", "jsonschema>=4", "mypy>=1.8", "build>=1", "hypothesis>=6"]
+# EMITTING eval receipts needs a real RFC 8785 JCS canonicalizer (emit path only). The VERIFY
+# path (verify_bundle / decode_eval_claim) never canonicalizes — it checks stored bytes — so the
+# verifier stays dependency-free. `pip install proofbundle[eval]` adds emit-side canonicalization.
+eval = ["rfc8785>=0.1.4"]
+# Framework adapters read exported result JSON only (no framework import) → pure stdlib today.
+adapters = []
+dev = ["pytest>=7", "ruff>=0.5", "jsonschema>=4", "mypy>=1.8", "build>=1", "hypothesis>=6", "rfc8785>=0.1.4"]
 [project.urls]
 Homepage = "https://b7n0de.com"

{proofbundle-0.3.0 → proofbundle-0.4.0}/src/proofbundle/__init__.py RENAMED Viewed

@@ -13,7 +13,7 @@ from .emit import emit_bundle, generate_signer
 from .errors import Check, ProofBundleError, VerificationResult
 from .merkle import verify_consistency, verify_inclusion
-__version__ = "0.3.0"
+__version__ = "0.4.0"
 __all__ = [
     "__version__",

proofbundle-0.4.0/src/proofbundle/adapters/__init__.py ADDED Viewed

@@ -0,0 +1,10 @@
+"""Adapters that map an eval framework's EXPORTED result JSON to an eval claim.
+Each adapter reads a result file from disk and never imports the framework, so they
+add no runtime dependency. The output-format mapping is bound to a framework version;
+each fixture in tests/fixtures documents its source + version.
+"""
+from .inspect_ai import from_inspect_ai_log
+from .lm_eval import from_lm_eval_results
+__all__ = ["from_lm_eval_results", "from_inspect_ai_log"]

proofbundle-0.4.0/src/proofbundle/adapters/inspect_ai.py ADDED Viewed

@@ -0,0 +1,36 @@
+"""Adapter for UK AISI inspect_ai eval-log JSON (file-based, no framework import)."""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Optional
+from ..evalclaim import build_eval_claim
+def from_inspect_ai_log(path, metric: str, *, comparator: str, threshold: str, timestamp: str,
+                        model_salt: Optional[bytes] = None, dataset_salt: Optional[bytes] = None):
+    """Read an inspect_ai eval-log JSON and build an eval claim.
+    Expects: {"eval": {"task": ..., "model": ..., "dataset": {"name": ...}},
+    "results": {"total_samples": n, "scores": [{"metrics": {metric: {"value": <number>}}}]}}.
+    Returns (claim, salts).
+    """
+    data = json.loads(Path(path).read_text(encoding="utf-8"))
+    ev = data.get("eval", {})
+    scores = data.get("results", {}).get("scores", [])
+    value = None
+    for s in scores:
+        m = s.get("metrics", {})
+        if metric in m:
+            value = m[metric].get("value")
+            break
+    if value is None:
+        raise ValueError(f"metric {metric!r} not found in inspect_ai scores")
+    n = int(data.get("results", {}).get("total_samples") or 0)
+    return build_eval_claim(
+        suite=str(ev.get("task", "inspect_ai")), suite_version=str(ev.get("task_version", "1")),
+        metric=metric, comparator=comparator, threshold=threshold, score=repr(value), n=n,
+        model_id=str(ev.get("model", "unknown")),
+        dataset_id=str(ev.get("dataset", {}).get("name", ev.get("task", "unknown"))),
+        issuer="", timestamp=timestamp, model_salt=model_salt, dataset_salt=dataset_salt)

proofbundle-0.4.0/src/proofbundle/adapters/lm_eval.py ADDED Viewed

@@ -0,0 +1,32 @@
+"""Adapter for EleutherAI lm-evaluation-harness results.json (file-based, no framework import)."""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Optional
+from ..evalclaim import build_eval_claim
+def from_lm_eval_results(path, task: str, metric: str, *, comparator: str, threshold: str,
+                         timestamp: str, model_salt: Optional[bytes] = None,
+                         dataset_salt: Optional[bytes] = None):
+    """Read an lm-evaluation-harness results.json and build an eval claim for `task`/`metric`.
+    Expects the standard shape: {"results": {task: {metric: <number>, ...}, ...},
+    "n-samples": {task: {"effective": n}}, "config"/"model_name": ...}. The score is read as a
+    STRING to avoid float canonicalization issues. Returns (claim, salts).
+    """
+    data = json.loads(Path(path).read_text(encoding="utf-8"))
+    res = data.get("results", {}).get(task)
+    if res is None or metric not in res:
+        raise ValueError(f"task/metric not found in results: {task}/{metric}")
+    score = repr(res[metric]) if not isinstance(res[metric], str) else res[metric]
+    n = int(data.get("n-samples", {}).get(task, {}).get("effective")
+            or data.get("n-samples", {}).get(task, {}).get("original") or 0)
+    model_id = str(data.get("model_name") or data.get("config", {}).get("model") or "unknown")
+    return build_eval_claim(
+        suite=task, suite_version=str(data.get("config", {}).get("model_source", "lm-eval")),
+        metric=metric, comparator=comparator, threshold=threshold, score=str(score), n=n,
+        model_id=model_id, dataset_id=task, issuer="", timestamp=timestamp,
+        model_salt=model_salt, dataset_salt=dataset_salt)

{proofbundle-0.3.0 → proofbundle-0.4.0}/src/proofbundle/cli.py RENAMED Viewed

@@ -1,4 +1,4 @@
-"""Command line interface: ``proofbundle verify`` and ``proofbundle emit``."""
+"""Command line interface: ``proofbundle`` verify / emit / emit-eval / show-eval."""
 from __future__ import annotations
@@ -12,6 +12,58 @@ from .emit import emit_bundle, generate_signer, load_signer, save_signer
 from .errors import ProofBundleError
+def _resolve_signer(args):
+    """Shared signer resolution for emit / emit-eval. Returns a signer or None (with an error)."""
+    if getattr(args, "new_key", None) and getattr(args, "key", None):
+        print("ERROR: use either --key or --new-key, not both", file=sys.stderr)
+        return None
+    if getattr(args, "new_key", None):
+        signer = generate_signer()
+        save_signer(signer, args.new_key)
+        print(f"wrote new signing key to {args.new_key} (keep this secret)", file=sys.stderr)
+        return signer
+    if getattr(args, "key", None):
+        return load_signer(args.key)
+    print("ERROR: provide --key <file> or --new-key <file>", file=sys.stderr)
+    return None
+def _cmd_emit_eval(args: argparse.Namespace) -> int:
+    from .evalclaim import EvalClaimError, emit_eval_receipt, load_claim_text  # noqa: PLC0415
+    signer = _resolve_signer(args)
+    if signer is None:
+        return 2
+    try:
+        with open(args.claim, encoding="utf-8") as handle:
+            claim = load_claim_text(handle.read())
+        bundle = emit_eval_receipt(claim, signer)
+    except (EvalClaimError, OSError, ValueError) as exc:
+        print(f"ERROR: {exc}", file=sys.stderr)
+        return 2
+    with open(args.out, "w", encoding="utf-8") as handle:
+        json.dump(bundle, handle, indent=2)
+        handle.write("\n")
+    print(f"wrote eval receipt {args.out}")
+    return 0
+def _cmd_show_eval(args: argparse.Namespace) -> int:
+    from .evalclaim import decode_eval_claim  # noqa: PLC0415
+    claim = decode_eval_claim(args.receipt)
+    if claim is None:
+        print("=> FAILED: not a valid, issuer-bound eval receipt", file=sys.stderr)
+        return 1
+    print(f"suite      {claim['suite']} ({claim['suite_version']})")
+    print(f"metric     {claim['metric']} {claim['comparator']} {claim['threshold']}")
+    print(f"passed     {claim['passed']}   (n={claim['n']})")
+    print(f"model      commit {claim['model_id_commit']}")
+    print(f"dataset    commit {claim['dataset_id_commit']}")
+    print(f"issuer     {claim['issuer']}")
+    print(f"timestamp  {claim['timestamp']}")
+    print("=> OK")
+    return 0
 def _cmd_verify(args: argparse.Namespace) -> int:
     try:
         result = verify_bundle(args.bundle)
@@ -32,17 +84,8 @@ def _cmd_verify(args: argparse.Namespace) -> int:
 def _cmd_emit(args: argparse.Namespace) -> int:
-    if args.new_key and args.key:
-        print("ERROR: use either --key or --new-key, not both", file=sys.stderr)
-        return 2
-    if args.new_key:
-        signer = generate_signer()
-        save_signer(signer, args.new_key)
-        print(f"wrote new signing key to {args.new_key} (keep this secret)", file=sys.stderr)
-    elif args.key:
-        signer = load_signer(args.key)
-    else:
-        print("ERROR: provide --key <file> or --new-key <file>", file=sys.stderr)
+    signer = _resolve_signer(args)
+    if signer is None:
         return 2
     with open(args.payload_file, "rb") as handle:
@@ -76,6 +119,17 @@ def build_parser() -> argparse.ArgumentParser:
     emit.add_argument("--new-key", help="generate a signing key and save it to this file")
     emit.set_defaults(func=_cmd_emit)
+    emit_eval = sub.add_parser("emit-eval", help="emit a signed eval receipt from a claim JSON")
+    emit_eval.add_argument("--claim", required=True, help="path to the eval-claim JSON")
+    emit_eval.add_argument("--out", required=True, help="path to write the receipt bundle JSON")
+    emit_eval.add_argument("--key", help="use an existing 32 byte raw Ed25519 seed file")
+    emit_eval.add_argument("--new-key", help="generate a signing key and save it to this file")
+    emit_eval.set_defaults(func=_cmd_emit_eval)
+    show_eval = sub.add_parser("show-eval", help="verify an eval receipt and print the claim")
+    show_eval.add_argument("receipt", help="path to the eval receipt bundle JSON")
+    show_eval.set_defaults(func=_cmd_show_eval)
     return parser

proofbundle-0.4.0/src/proofbundle/evalclaim.py ADDED Viewed

@@ -0,0 +1,212 @@
+"""Eval receipts (v0.4): sign + Merkle-anchor a canonical eval CLAIM.
+A receipt proves exactly one thing — *suite S scored `comparator` threshold T,
+passed=…* — carrying only SALTED commitments to the model and dataset identifiers,
+never the weights, the data, or the plaintext names. A third party verifies the
+threshold was met, offline, from one file, without ever seeing the model or dataset.
+Honest scope (see EVAL_CLAIM.md): the receipt proves `passed` against `threshold`
+and hides the model/dataset via salted commitments. It does NOT prove the evaluation
+itself was well designed or that the suite measures what it claims — those are human
+judgements. What it removes is the need to simply *trust the number*.
+Layering: the claim payload is canonicalized with RFC 8785 JCS **only on the emit
+path** (a lazy dependency). The verify path (`decode_eval_claim`) never canonicalizes —
+it checks the exact stored bytes that `verify_bundle` already authenticated — so the
+verifier stays dependency-free (cryptography + stdlib only).
+"""
+from __future__ import annotations
+import base64
+import hashlib
+import json
+import os
+import unicodedata
+from typing import Optional, Sequence
+from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
+from cryptography.hazmat.primitives.serialization import Encoding, PublicFormat
+from .bundle import load_bundle, verify_bundle
+from .emit import emit_bundle
+EVAL_CLAIM_SCHEMA = "proofbundle/eval-claim/v0.1"
+COMMIT_ALG = "sha256-salted-v1"
+_COMPARATORS = {">=", ">", "<=", "<"}
+_MAX_SAFE_INT = 2 ** 53 - 1
+# The exact key set of an eval claim; decode/validate reject anything else.
+_REQUIRED = {"schema", "suite", "suite_version", "metric", "comparator", "threshold",
+             "passed", "n", "model_id_commit", "dataset_id_commit", "commit_alg", "issuer", "timestamp"}
+_OPTIONAL = {"context_binding", "ci95", "multiple_testing", "prereg_sha256"}
+__all__ = [
+    "EVAL_CLAIM_SCHEMA", "COMMIT_ALG", "canonicalize", "build_eval_claim",
+    "emit_eval_receipt", "decode_eval_claim", "salted_commit", "issuer_fingerprint",
+]
+class EvalClaimError(ValueError):
+    """Raised for a malformed eval claim (float in payload, non-NFC string, unsafe int, …)."""
+def issuer_fingerprint(signer: Ed25519PrivateKey) -> str:
+    """The `issuer` field value: ed25519:<base64 of the 32-byte raw public key>."""
+    raw = signer.public_key().public_bytes(Encoding.Raw, PublicFormat.Raw)
+    return "ed25519:" + base64.b64encode(raw).decode("ascii")
+def salted_commit(identifier: str, salt: bytes) -> str:
+    """Salted commitment to an identifier: sha256:<hex> over salt || utf8(identifier).
+    The salt (>=16 bytes, high entropy) stays with the issuer and is NEVER in the payload,
+    so the identifier cannot be recovered from the commitment — not even via a rainbow table
+    over known model names like gpt-4o.
+    """
+    if len(salt) < 16:
+        raise EvalClaimError("commitment salt must be at least 16 bytes")
+    return "sha256:" + hashlib.sha256(salt + identifier.encode("utf-8")).hexdigest()
+def _reject_non_jcs(value) -> None:
+    """Recursively reject values that RFC 8785 / this profile forbids in a claim."""
+    if isinstance(value, bool):
+        return
+    if isinstance(value, float):
+        raise EvalClaimError("float values are forbidden; use a decimal STRING (e.g. \"0.80\")")
+    if isinstance(value, int):
+        if abs(value) > _MAX_SAFE_INT:
+            raise EvalClaimError(f"integer {value} exceeds the IEEE-754 safe range (2**53-1)")
+        return
+    if isinstance(value, str):
+        if unicodedata.normalize("NFC", value) != value:
+            raise EvalClaimError("string is not NFC-normalized")
+        return
+    if value is None:
+        return
+    if isinstance(value, dict):
+        for v in value.values():
+            _reject_non_jcs(v)
+        return
+    if isinstance(value, (list, tuple)):
+        for v in value:
+            _reject_non_jcs(v)
+        return
+    raise EvalClaimError(f"unsupported value type {type(value).__name__}")
+def canonicalize(claim: dict) -> bytes:
+    """RFC 8785 JCS canonical bytes of a claim — EMIT PATH ONLY.
+    Enforces the profile before serializing: no Python float, NFC strings, safe-range ints.
+    Duplicate keys cannot exist in a Python dict; when parsing claim JSON from text, use
+    `load_claim_text` which rejects duplicate keys. Uses the rfc8785 library (lazy import)
+    for the UTF-16 code-unit key sort + compact UTF-8 serialization.
+    """
+    _reject_non_jcs(claim)
+    import rfc8785  # noqa: PLC0415 — lazy: only the emit path pulls the JCS dependency
+    try:
+        return rfc8785.dumps(claim)
+    except (rfc8785.FloatDomainError, rfc8785.IntegerDomainError, rfc8785.CanonicalizationError) as e:
+        raise EvalClaimError(f"canonicalization failed: {e}") from e
+def load_claim_text(text: str) -> dict:
+    """Parse claim JSON text, rejecting duplicate keys (JCS forbids them)."""
+    def _no_dupes(pairs):
+        seen = {}
+        for k, v in pairs:
+            if k in seen:
+                raise EvalClaimError(f"duplicate key {k!r} in claim JSON")
+            seen[k] = v
+        return seen
+    return json.loads(text, object_pairs_hook=_no_dupes)
+def build_eval_claim(*, suite: str, suite_version: str, metric: str, comparator: str,
+                     threshold: str, score: str, n: int, model_id: str, dataset_id: str,
+                     issuer: str, timestamp: str, context_binding: Optional[str] = None,
+                     ci95: Optional[Sequence[str]] = None, multiple_testing: Optional[str] = None,
+                     prereg_sha256: Optional[str] = None,
+                     model_salt: Optional[bytes] = None, dataset_salt: Optional[bytes] = None):
+    """Build a valid eval claim from raw values. Computes `passed` ITSELF from the comparator
+    (never trusts the caller), creates salted commitments, and returns (claim, salts) with the
+    salts SEPARATE (never in the payload).
+    threshold/score are decimal STRINGS (never floats). Returns:
+        (claim: dict, salts: {"model_salt": bytes, "dataset_salt": bytes})
+    """
+    if comparator not in _COMPARATORS:
+        raise EvalClaimError(f"comparator must be one of {sorted(_COMPARATORS)}")
+    for name, val in (("threshold", threshold), ("score", score)):
+        if not isinstance(val, str):
+            raise EvalClaimError(f"{name} must be a decimal STRING, not {type(val).__name__}")
+    from decimal import Decimal, InvalidOperation  # noqa: PLC0415
+    try:
+        s, t = Decimal(score), Decimal(threshold)
+    except InvalidOperation as e:
+        raise EvalClaimError(f"threshold/score are not valid decimals: {e}") from e
+    passed = {">=": s >= t, ">": s > t, "<=": s <= t, "<": s < t}[comparator]
+    m_salt = model_salt if model_salt is not None else os.urandom(16)
+    d_salt = dataset_salt if dataset_salt is not None else os.urandom(16)
+    claim = {
+        "schema": EVAL_CLAIM_SCHEMA, "suite": suite, "suite_version": suite_version,
+        "metric": metric, "comparator": comparator, "threshold": threshold, "passed": passed,
+        "n": n, "model_id_commit": salted_commit(model_id, m_salt),
+        "dataset_id_commit": salted_commit(dataset_id, d_salt), "commit_alg": COMMIT_ALG,
+        "issuer": issuer, "timestamp": timestamp,
+    }
+    if context_binding is not None:
+        claim["context_binding"] = context_binding
+    if ci95 is not None:
+        claim["ci95"] = [str(x) for x in ci95]
+    if multiple_testing is not None:
+        claim["multiple_testing"] = multiple_testing
+    if prereg_sha256 is not None:
+        claim["prereg_sha256"] = prereg_sha256
+    _reject_non_jcs(claim)
+    return claim, {"model_salt": m_salt, "dataset_salt": d_salt}
+def emit_eval_receipt(claim: dict, signer: Ed25519PrivateKey, *, prior_leaves: Sequence[bytes] = (),
+                      sd_jwt: Optional[dict] = None) -> dict:
+    """Emit a proofbundle/v0.1 bundle whose payload is the canonical eval claim.
+    Sets `issuer` to the signer's fingerprint automatically (binding the receipt to the key),
+    canonicalizes, and calls emit_bundle. The returned bundle is verified unchanged by verify_bundle.
+    """
+    claim = dict(claim)
+    claim["issuer"] = issuer_fingerprint(signer)
+    missing = _REQUIRED - set(claim)
+    if missing:
+        raise EvalClaimError(f"claim missing required fields: {sorted(missing)}")
+    extra = set(claim) - _REQUIRED - _OPTIONAL
+    if extra:
+        raise EvalClaimError(f"claim has unknown fields: {sorted(extra)}")
+    payload = canonicalize(claim)
+    return emit_bundle(payload, signer, prior_leaves=prior_leaves, sd_jwt_vc=sd_jwt)
+def decode_eval_claim(bundle) -> Optional[dict]:
+    """Verify the bundle, then check the signing key matches the claim's `issuer` field.
+    Returns the parsed claim on success, None on any failure. Dependency-free (no JCS import):
+    it re-reads the exact stored payload bytes that verify_bundle already authenticated.
+    """
+    result = verify_bundle(bundle)
+    if not result.ok:
+        return None
+    if isinstance(bundle, str):
+        bundle = load_bundle(bundle)   # a str is a PATH (consistent with verify_bundle)
+    try:
+        payload = base64.b64decode(bundle["payload_b64"])
+        claim = load_claim_text(payload.decode("utf-8"))
+        if claim.get("schema") != EVAL_CLAIM_SCHEMA:
+            return None
+        # Issuer binding: the claim's issuer must be the key that signed the bundle.
+        sig_pub_b64 = bundle["signature"]["public_key_b64"]
+        want = "ed25519:" + base64.b64encode(base64.b64decode(sig_pub_b64)).decode("ascii")
+        if claim.get("issuer") != want:
+            return None
+        return claim
+    except (KeyError, ValueError, EvalClaimError):
+        return None

{proofbundle-0.3.0 → proofbundle-0.4.0/src/proofbundle.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: proofbundle
-Version: 0.3.0
+Version: 0.4.0
 Summary: Emit and verify portable cryptographic evidence bundles, offline: Ed25519 + RFC 6962 Merkle + optional SD-JWT.
 Author: Konrad Gruszka
 License: MIT
@@ -24,6 +24,9 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: cryptography>=42
 Provides-Extra: sdjwt
+Provides-Extra: eval
+Requires-Dist: rfc8785>=0.1.4; extra == "eval"
+Provides-Extra: adapters
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: ruff>=0.5; extra == "dev"
@@ -31,6 +34,7 @@ Requires-Dist: jsonschema>=4; extra == "dev"
 Requires-Dist: mypy>=1.8; extra == "dev"
 Requires-Dist: build>=1; extra == "dev"
 Requires-Dist: hypothesis>=6; extra == "dev"
+Requires-Dist: rfc8785>=0.1.4; extra == "dev"
 Dynamic: license-file
 <div align="center">
@@ -68,6 +72,7 @@ checks → `OK` or `FAILED`. No network, no daemon, no own crypto. 25 tests.
 - [Quickstart](#quickstart)
 - [Interoperability](#interoperability)
 - [Bundle format](#bundle-format-proofbundlev01)
+- [Eval receipts](#eval-receipts)
 - [Security notes and scope](#security-notes-and-scope-stated-honestly)
 - [Roadmap](#roadmap)
 - [Contributing](#contributing)
@@ -255,25 +260,37 @@ This is v0.1. It does exactly what it says and no more:
 If you find a correctness or security issue, please open an issue or see
 [SECURITY.md](SECURITY.md).
+## Eval receipts
+Since v0.4, proofbundle turns a reproducible eval run into a signed, Merkle-anchored
+**receipt** that proves *suite S `comparator` threshold T, passed* while carrying only
+**salted commitments** to the model and dataset identifiers — never the weights, the
+data, or the plaintext names. A third party verifies the threshold was met, offline,
+from one file, without ever seeing the model or the test set.
+```bash
+pip install "proofbundle[eval]"          # emit side needs an RFC 8785 canonicalizer
+proofbundle emit-eval --claim claim.json --out receipt.json --new-key signer.key
+proofbundle verify receipt.json          # a receipt is a normal bundle
+proofbundle show-eval receipt.json       # verify + print the claim (issuer-bound)
+```
+The claim format is specified in [EVAL_CLAIM.md](EVAL_CLAIM.md); the emit path uses
+RFC 8785 JCS canonicalization, the verify path stays dependency-free. **Honest scope:**
+a receipt proves `passed` against `threshold` and hides the model/dataset via salted
+commitments — it does **not** prove the evaluation was well designed or that the score
+itself is correct. Those are human judgements; what it removes is the need to simply
+trust the number.
 ## Roadmap
 - **v0.1** — the offline verifier plus a real example bundle.
-- **v0.2 (current release)** — the emitter: `emit_bundle` signs a payload with
-  Ed25519 and anchors it as the last leaf of an RFC 6962 Merkle tree, producing
-  a bundle that `verify_bundle` accepts. Available as `proofbundle emit`.
-- **v0.3** — an eval-receipt emitter: wrap one evaluation framework run
-  ([Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai),
-  [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness))
-  into a signed receipt whose payload is a minimal canonical claim, for example
-  `{"suite": "...", "threshold": 0.8, "passed": true}`, optionally wrapped as an
-  SD-JWT VC so a holder can disclose *passed above threshold* without revealing
-  the model, weights or dataset, and carrying a cluster-bootstrap confidence
-  interval, a multiple-testing correction and a preregistration hash.
-That last step is the point: today no widely used AI project turns a
-reproducible evaluation result into a signed, third-party-verifiable,
-selectively disclosable receipt. This repository is the trustworthy verification
-core that makes it possible.
+- **v0.2** — the emitter: `emit_bundle` / `proofbundle emit`.
+- **v0.3** — external RFC 6962 conformance vectors + real Sigstore Rekor interop.
+- **v0.4 (current release)** — the eval-receipt emitter (`emit_eval_receipt` /
+  `proofbundle emit-eval`), salted commitments, issuer binding, file-based adapters.
+- **v0.5** — selective disclosure of the exact score via SD-JWT **issuance** (the issuer
+  reveals identifier + salt on demand) and full SD-JWT VC conformance.
 ## Contributing

{proofbundle-0.3.0 → proofbundle-0.4.0}/src/proofbundle.egg-info/SOURCES.txt RENAMED Viewed

@@ -6,6 +6,7 @@ src/proofbundle/bundle.py
 src/proofbundle/cli.py
 src/proofbundle/emit.py
 src/proofbundle/errors.py
+src/proofbundle/evalclaim.py
 src/proofbundle/merkle.py
 src/proofbundle/py.typed
 src/proofbundle/sdjwt.py
@@ -16,12 +17,20 @@ src/proofbundle.egg-info/dependency_links.txt
 src/proofbundle.egg-info/entry_points.txt
 src/proofbundle.egg-info/requires.txt
 src/proofbundle.egg-info/top_level.txt
+src/proofbundle/adapters/__init__.py
+src/proofbundle/adapters/inspect_ai.py
+src/proofbundle/adapters/lm_eval.py
+tests/test_adapters.py
 tests/test_bundle.py
 tests/test_cli.py
+tests/test_cli_eval.py
 tests/test_emit.py
+tests/test_eval_claim_schema.py
+tests/test_evalclaim.py
 tests/test_merkle.py
 tests/test_merkle_property.py
 tests/test_rekor_interop.py
 tests/test_rfc6962_external_vectors.py
 tests/test_schema.py
+tests/test_sdjwt_reference.py
 tests/test_signature.py

{proofbundle-0.3.0 → proofbundle-0.4.0}/src/proofbundle.egg-info/requires.txt RENAMED Viewed

@@ -1,5 +1,7 @@
 cryptography>=42
+[adapters]
 [dev]
 pytest>=7
 ruff>=0.5
@@ -7,5 +9,9 @@ jsonschema>=4
 mypy>=1.8
 build>=1
 hypothesis>=6
+rfc8785>=0.1.4
+[eval]
+rfc8785>=0.1.4
 [sdjwt]

proofbundle-0.4.0/tests/test_adapters.py ADDED Viewed

@@ -0,0 +1,32 @@
+"""Adapters map real exported eval JSON to a valid claim (file-based, no framework import)."""
+import unittest
+from pathlib import Path
+from proofbundle.adapters import from_inspect_ai_log, from_lm_eval_results
+FX = Path(__file__).resolve().parent / "fixtures"
+TS = "2026-07-01T12:00:00Z"
+class TestAdapters(unittest.TestCase):
+    def test_lm_eval(self):
+        claim, salts = from_lm_eval_results(FX / "lm_eval_results.json", "hellaswag", "acc",
+                                            comparator=">=", threshold="0.70", timestamp=TS,
+                                            model_salt=b"0" * 16, dataset_salt=b"1" * 16)
+        self.assertEqual(claim["suite"], "hellaswag")
+        self.assertEqual(claim["threshold"], "0.70")
+        self.assertTrue(claim["passed"])              # 0.7534 >= 0.70
+        self.assertNotIn("acme/model-x", str(claim))  # id only as salted commitment
+        self.assertEqual(claim["n"], 10042)
+    def test_inspect_ai(self):
+        claim, salts = from_inspect_ai_log(FX / "inspect_ai_log.json", "accuracy",
+                                           comparator=">=", threshold="0.80", timestamp=TS,
+                                           model_salt=b"0" * 16, dataset_salt=b"1" * 16)
+        self.assertEqual(claim["suite"], "safety_refusal")
+        self.assertTrue(claim["passed"])              # 0.92 >= 0.80
+        self.assertEqual(claim["n"], 500)
+if __name__ == "__main__":
+    unittest.main()

proofbundle-0.4.0/tests/test_cli_eval.py ADDED Viewed

@@ -0,0 +1,39 @@
+"""CLI emit-eval + show-eval end-to-end (round-trip through the process boundary)."""
+import json
+import subprocess
+import sys
+import unittest
+from pathlib import Path
+REPO = Path(__file__).resolve().parents[1]
+def _run(*args, **kw):
+    return subprocess.run([sys.executable, "-m", "proofbundle.cli", *args],
+                          capture_output=True, text=True, cwd=REPO,
+                          env={"PYTHONPATH": str(REPO / "src"), **kw.get("env", {})})
+class TestCliEval(unittest.TestCase):
+    def test_emit_eval_then_verify_and_show(self):
+        import tempfile
+        import os
+        with tempfile.TemporaryDirectory() as d:
+            claim = os.path.join(d, "claim.json")
+            Path(claim).write_text(json.dumps({
+                "schema": "proofbundle/eval-claim/v0.1", "suite": "s", "suite_version": "v1",
+                "metric": "acc", "comparator": ">=", "threshold": "0.80", "passed": True, "n": 100,
+                "model_id_commit": "sha256:x", "dataset_id_commit": "sha256:y",
+                "commit_alg": "sha256-salted-v1", "issuer": "ed25519:z",
+                "timestamp": "2026-07-01T12:00:00Z"}), encoding="utf-8")
+            out = os.path.join(d, "receipt.json")
+            key = os.path.join(d, "k.key")
+            self.assertEqual(_run("emit-eval", "--claim", claim, "--out", out, "--new-key", key).returncode, 0)
+            self.assertEqual(_run("verify", out).returncode, 0)
+            show = _run("show-eval", out)
+            self.assertEqual(show.returncode, 0)
+            self.assertIn("passed", show.stdout)
+if __name__ == "__main__":
+    unittest.main()

proofbundle-0.4.0/tests/test_eval_claim_schema.py ADDED Viewed

@@ -0,0 +1,36 @@
+"""An emitted eval claim validates against schemas/eval_claim_v0_1.schema.json."""
+import json
+import unittest
+from pathlib import Path
+try:
+    import jsonschema
+except ImportError:  # pragma: no cover
+    jsonschema = None
+from proofbundle.emit import generate_signer
+from proofbundle.evalclaim import build_eval_claim, issuer_fingerprint
+ROOT = Path(__file__).resolve().parents[1]
+SCHEMA = ROOT / "schemas" / "eval_claim_v0_1.schema.json"
+@unittest.skipIf(jsonschema is None, "jsonschema not installed (pip install -e .[dev])")
+class TestEvalClaimSchema(unittest.TestCase):
+    def test_schema_valid(self):
+        jsonschema.Draft202012Validator.check_schema(json.loads(SCHEMA.read_text(encoding="utf-8")))
+    def test_built_claim_matches_schema(self):
+        signer = generate_signer()
+        claim, _ = build_eval_claim(
+            suite="s", suite_version="v1", metric="acc", comparator=">=", threshold="0.80",
+            score="0.92", n=500, model_id="m", dataset_id="d",
+            issuer=issuer_fingerprint(signer), timestamp="2026-07-01T12:00:00Z",
+            model_salt=b"0" * 16, dataset_salt=b"1" * 16)
+        jsonschema.validate(instance=claim, schema=json.loads(SCHEMA.read_text(encoding="utf-8")))
+    def test_schema_rejects_float_threshold(self):
+        schema = json.loads(SCHEMA.read_text(encoding="utf-8"))
+        bad = {"schema": "proofbundle/eval-claim/v0.1", "threshold": 0.80}
+        with self.assertRaises(jsonschema.ValidationError):
+            jsonschema.validate(instance=bad, schema=schema)

proofbundle-0.4.0/tests/test_evalclaim.py ADDED Viewed

@@ -0,0 +1,107 @@
+"""Eval-receipt (v0.4) tests — No-Fake, one red-test per new invariant."""
+import base64
+import json
+import unittest
+from proofbundle import verify_bundle
+from proofbundle.emit import generate_signer
+from proofbundle.evalclaim import (
+    EvalClaimError,
+    build_eval_claim,
+    canonicalize,
+    decode_eval_claim,
+    emit_eval_receipt,
+    issuer_fingerprint,
+    salted_commit,
+)
+TS = "2026-07-01T12:00:00Z"
+def _claim(signer, score="0.92", threshold="0.80", comparator=">="):
+    claim, salts = build_eval_claim(
+        suite="safety-refusal", suite_version="v1", metric="refusal_rate",
+        comparator=comparator, threshold=threshold, score=score, n=500,
+        model_id="acme/model-x", dataset_id="acme/dataset-y",
+        issuer=issuer_fingerprint(signer), timestamp=TS,
+        model_salt=b"0" * 16, dataset_salt=b"1" * 16)
+    return claim, salts
+class TestEvalClaim(unittest.TestCase):
+    def test_round_trip(self):
+        signer = generate_signer()
+        claim, _ = _claim(signer)
+        bundle = emit_eval_receipt(claim, signer)
+        self.assertTrue(verify_bundle(bundle).ok)
+        decoded = decode_eval_claim(bundle)
+        self.assertIsNotNone(decoded)
+        self.assertEqual(decoded["suite"], "safety-refusal")
+        self.assertTrue(decoded["passed"])
+    def test_determinism_emoji_and_nfc(self):
+        # A key beyond the BMP + NFC content must canonicalize identically twice.
+        c = {"schema": "x", "\U0001F600z": "café"}  # NFD 'é'
+        with self.assertRaises(EvalClaimError):
+            canonicalize(c)  # non-NFC string rejected
+        c2 = {"b": "1", "\U0001F600": "ok", "a": "2"}
+        self.assertEqual(canonicalize(c2), canonicalize(dict(reversed(list(c2.items())))))
+    def test_duplicate_keys_rejected(self):
+        from proofbundle.evalclaim import load_claim_text
+        with self.assertRaises(EvalClaimError):
+            load_claim_text('{"a": 1, "a": 2}')
+    def test_float_guard_red(self):
+        with self.assertRaises(EvalClaimError):
+            canonicalize({"schema": "x", "threshold": 0.80})  # a Python float is forbidden
+    def test_passed_integrity_at_boundary(self):
+        signer = generate_signer()
+        eq, _ = _claim(signer, score="0.80", threshold="0.80", comparator=">=")
+        self.assertTrue(eq["passed"])
+        gt, _ = _claim(signer, score="0.80", threshold="0.80", comparator=">")
+        self.assertFalse(gt["passed"])
+        lt, _ = _claim(signer, score="0.79", threshold="0.80", comparator="<")
+        self.assertTrue(lt["passed"])
+    def test_issuer_binding_red(self):
+        signer = generate_signer()
+        claim, _ = _claim(signer)
+        bundle = emit_eval_receipt(claim, signer)
+        # Tamper the issuer field to a different key -> re-sign with the SAME signer.
+        # decode must reject because claim.issuer != signing key.
+        import copy
+        b2 = copy.deepcopy(bundle)
+        other = issuer_fingerprint(generate_signer())
+        payload = json.loads(base64.b64decode(b2["payload_b64"]).decode("utf-8"))
+        payload["issuer"] = other
+        # keep bytes verifiable only if re-emitted; here we just prove decode's issuer check:
+        b2["payload_b64"] = base64.b64encode(canonicalize(payload)).decode("ascii")
+        # signature no longer matches the new payload -> verify_bundle fails -> decode None.
+        self.assertIsNone(decode_eval_claim(b2))
+    def test_commitment_hides_identifier(self):
+        c1 = salted_commit("gpt-4o", b"A" * 16)
+        c1b = salted_commit("gpt-4o", b"A" * 16)
+        c2 = salted_commit("gpt-4o", b"B" * 16)
+        self.assertEqual(c1, c1b)          # same id + salt -> same commit
+        self.assertNotEqual(c1, c2)        # different salt -> different commit
+        signer = generate_signer()
+        claim, _ = _claim(signer)
+        payload = json.dumps(claim)
+        self.assertNotIn("acme/model-x", payload)   # plaintext id never in the payload
+        with self.assertRaises(EvalClaimError):
+            salted_commit("x", b"short")             # salt must be >= 16 bytes
+    def test_tamper_red(self):
+        signer = generate_signer()
+        claim, _ = _claim(signer)
+        bundle = emit_eval_receipt(claim, signer)
+        bundle["payload_b64"] = base64.b64encode(b'{"tampered":true}').decode("ascii")
+        self.assertFalse(verify_bundle(bundle).ok)
+        self.assertIsNone(decode_eval_claim(bundle))
+if __name__ == "__main__":
+    unittest.main()

proofbundle-0.4.0/tests/test_sdjwt_reference.py ADDED Viewed

@@ -0,0 +1,39 @@
+"""proofbundle verifies an SD-JWT produced by the EXTERNAL reference library.
+The fixture tests/fixtures/sdjwt_reference_eddsa.json was generated by
+openwallet-foundation-labs/sd-jwt-python (the reference implementation that
+produces the IETF/RFC 9901 examples) with an Ed25519 issuer key and two
+selectively-disclosable claims. proofbundle must verify both the disclosure-digest
+commitments and the EdDSA issuer signature — i.e. it interops with the reference
+tool, not just with its own emitter. No network / no sd-jwt dependency at test
+time; the SD-JWT is committed.
+"""
+import json
+import unittest
+from base64 import b64decode
+from pathlib import Path
+from proofbundle.sdjwt import verify_sd_jwt
+FIXTURE = Path(__file__).resolve().parent / "fixtures" / "sdjwt_reference_eddsa.json"
+@unittest.skipIf(not FIXTURE.exists(), "sd-jwt reference fixture not present")
+class TestSdJwtReference(unittest.TestCase):
+    def setUp(self):
+        self.f = json.loads(FIXTURE.read_text(encoding="utf-8"))
+    def test_source_documented(self):
+        self.assertIn("sd-jwt-python", self.f["source"])
+    def test_proofbundle_verifies_reference_sd_jwt(self):
+        res = verify_sd_jwt(self.f["compact"], b64decode(self.f["issuer_public_key_b64"]))
+        self.assertTrue(res["structure_ok"], res)
+        self.assertTrue(res["sig_ok"], res)
+        self.assertIn("2 disclosure", res["detail"])
+    def test_wrong_issuer_key_is_rejected(self):
+        # a different key must fail the issuer-signature check (no false accept).
+        import os
+        res = verify_sd_jwt(self.f["compact"], os.urandom(32))
+        self.assertFalse(res.get("sig_ok"), res)