verifyhash 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +201 -0
- package/README.md +883 -0
- package/cli/abi/ContributionRegistry.json +881 -0
- package/cli/agent.js +2173 -0
- package/cli/anchor-artifact.js +853 -0
- package/cli/anchor.js +400 -0
- package/cli/claim.js +881 -0
- package/cli/core/agent-commit.js +448 -0
- package/cli/core/agent-session.js +598 -0
- package/cli/core/anchor-binding.js +663 -0
- package/cli/core/attestation.js +580 -0
- package/cli/core/evidence-plans.js +495 -0
- package/cli/core/fixtures/evidence-plans/baseline.json +19 -0
- package/cli/core/fulfill-intake.js +1082 -0
- package/cli/core/go-live-preflight.js +481 -0
- package/cli/core/license.js +534 -0
- package/cli/core/manifest.js +243 -0
- package/cli/core/packetseal.js +591 -0
- package/cli/core/registryArtifact.js +49 -0
- package/cli/core/revocation.js +539 -0
- package/cli/core/rfc3161.js +389 -0
- package/cli/core/timestamp.js +482 -0
- package/cli/core/trust-asof.js +479 -0
- package/cli/dataset.js +2950 -0
- package/cli/evidence.js +2227 -0
- package/cli/fulfill-webhook-http.js +438 -0
- package/cli/git.js +220 -0
- package/cli/hash.js +550 -0
- package/cli/identity.js +1072 -0
- package/cli/journal-cli.js +1110 -0
- package/cli/journal-log.js +454 -0
- package/cli/journal.js +334 -0
- package/cli/lineage.js +447 -0
- package/cli/list.js +287 -0
- package/cli/parcel.js +1509 -0
- package/cli/proof.js +578 -0
- package/cli/prove.js +300 -0
- package/cli/receipt.js +631 -0
- package/cli/registry.js +331 -0
- package/cli/reputation.js +344 -0
- package/cli/revocation.js +495 -0
- package/cli/serve-verify-http.js +298 -0
- package/cli/serve-verify.js +333 -0
- package/cli/show.js +339 -0
- package/cli/verify.js +383 -0
- package/cli/vh.js +3927 -0
- package/docs/ADOPT.md +183 -0
- package/docs/ADOPTION.json +11 -0
- package/docs/AGENTTRACE.md +247 -0
- package/docs/ANCHORING.md +167 -0
- package/docs/AUDIT.md +55 -0
- package/docs/CONFORMANCE.md +107 -0
- package/docs/DATALEDGER.md +638 -0
- package/docs/DECIDE.md +47 -0
- package/docs/DECISIONS-PENDING.md +27 -0
- package/docs/DEPLOY-PUBLIC-SITE.md +301 -0
- package/docs/ENGINE-LEDGER.json +12 -0
- package/docs/EVIDENCE.md +519 -0
- package/docs/GO-LIVE.md +66 -0
- package/docs/IDENTITY.md +123 -0
- package/docs/INDEPENDENT-VERIFICATION.md +377 -0
- package/docs/INTEGRITY-JOURNAL.md +337 -0
- package/docs/KEY-LIFECYCLE.md +179 -0
- package/docs/LICENSING.md +46 -0
- package/docs/LINEAGE.md +307 -0
- package/docs/LOOP-AUDIT-2026-07-03.json +580 -0
- package/docs/LOOP-HARDENING-PLAN.md +44 -0
- package/docs/MERKLE-LEAVES.md +113 -0
- package/docs/METRICS.jsonl +31 -0
- package/docs/MORNING.md +204 -0
- package/docs/PILOT.md +444 -0
- package/docs/PROOFPARCEL.md +227 -0
- package/docs/PROOFS.md +262 -0
- package/docs/RECEIPTS.md +341 -0
- package/docs/REPUTATION.md +158 -0
- package/docs/SDK.md +301 -0
- package/docs/STRATEGY-ARCHIVE.md +5055 -0
- package/docs/SUPERVISOR-RUNBOOK.md +52 -0
- package/docs/TRUST-BOUNDARIES.md +335 -0
- package/docs/TRUSTLEDGER.md +1976 -0
- package/docs/USAGE-BUDGET.json +121 -0
- package/docs/VERIFY-SERVICE.md +168 -0
- package/index.js +160 -0
- package/package.json +41 -0
- package/trustledger/build-standalone.js +796 -0
- package/trustledger/cli.js +3179 -0
- package/trustledger/close.js +391 -0
- package/trustledger/corpus.js +159 -0
- package/trustledger/dist/BUILD-PROVENANCE.json +99 -0
- package/trustledger/dist/trustledger-standalone.html +6197 -0
- package/trustledger/dist/trustledger-standalone.html.sha256 +1 -0
- package/trustledger/door-core.js +442 -0
- package/trustledger/fixtures/bank.csv +7 -0
- package/trustledger/fixtures/bank.malformed.csv +3 -0
- package/trustledger/fixtures/bank.noalias.csv +5 -0
- package/trustledger/fixtures/bank.ofx +34 -0
- package/trustledger/fixtures/bank.real.csv +5 -0
- package/trustledger/fixtures/corpus/_shared/prior-close.json +22 -0
- package/trustledger/fixtures/corpus/bank-book-mismatch--benign-twin/inputs.json +14 -0
- package/trustledger/fixtures/corpus/bank-book-mismatch--benign-twin/meta.json +7 -0
- package/trustledger/fixtures/corpus/bank-book-mismatch--out-of-trust/inputs.json +14 -0
- package/trustledger/fixtures/corpus/bank-book-mismatch--out-of-trust/meta.json +7 -0
- package/trustledger/fixtures/corpus/continuity-break--benign-twin/inputs.json +15 -0
- package/trustledger/fixtures/corpus/continuity-break--benign-twin/meta.json +7 -0
- package/trustledger/fixtures/corpus/continuity-break--out-of-trust/inputs.json +15 -0
- package/trustledger/fixtures/corpus/continuity-break--out-of-trust/meta.json +7 -0
- package/trustledger/fixtures/corpus/negative-tenant-ledger--benign-twin/inputs.json +13 -0
- package/trustledger/fixtures/corpus/negative-tenant-ledger--benign-twin/meta.json +7 -0
- package/trustledger/fixtures/corpus/negative-tenant-ledger--out-of-trust/inputs.json +13 -0
- package/trustledger/fixtures/corpus/negative-tenant-ledger--out-of-trust/meta.json +7 -0
- package/trustledger/fixtures/corpus/owner-overdraw--benign-twin/inputs.json +15 -0
- package/trustledger/fixtures/corpus/owner-overdraw--benign-twin/meta.json +7 -0
- package/trustledger/fixtures/corpus/owner-overdraw--out-of-trust/inputs.json +15 -0
- package/trustledger/fixtures/corpus/owner-overdraw--out-of-trust/meta.json +7 -0
- package/trustledger/fixtures/corpus/security-deposit-segregation--benign-twin/inputs.json +16 -0
- package/trustledger/fixtures/corpus/security-deposit-segregation--benign-twin/meta.json +7 -0
- package/trustledger/fixtures/corpus/security-deposit-segregation--out-of-trust/inputs.json +13 -0
- package/trustledger/fixtures/corpus/security-deposit-segregation--out-of-trust/meta.json +7 -0
- package/trustledger/fixtures/corpus/subledger-out-of-balance--benign-twin/inputs.json +13 -0
- package/trustledger/fixtures/corpus/subledger-out-of-balance--benign-twin/meta.json +7 -0
- package/trustledger/fixtures/corpus/subledger-out-of-balance--out-of-trust/inputs.json +13 -0
- package/trustledger/fixtures/corpus/subledger-out-of-balance--out-of-trust/meta.json +7 -0
- package/trustledger/fixtures/e2e/bank.aliased.csv +4 -0
- package/trustledger/fixtures/e2e/bank.csv +4 -0
- package/trustledger/fixtures/e2e/bank.nsf.csv +4 -0
- package/trustledger/fixtures/e2e/quickbooks.csv +6 -0
- package/trustledger/fixtures/e2e/quickbooks.nsf.csv +8 -0
- package/trustledger/fixtures/e2e/rentroll.csv +6 -0
- package/trustledger/fixtures/e2e/rentroll.nsf.csv +8 -0
- package/trustledger/fixtures/e2e/rentroll.short.csv +5 -0
- package/trustledger/fixtures/plans/baseline.json +25 -0
- package/trustledger/fixtures/plans/price-binding.example.json +27 -0
- package/trustledger/fixtures/policy/ambiguous-deposit-example.json +12 -0
- package/trustledger/fixtures/policy/baseline.json +19 -0
- package/trustledger/fixtures/policy/ca-example.json +12 -0
- package/trustledger/fixtures/policy/negative-tenant-ledger-example.json +12 -0
- package/trustledger/fixtures/policy/owner-overdraw-example.json +12 -0
- package/trustledger/fixtures/quickbooks.csv +7 -0
- package/trustledger/fixtures/quickbooks.real.csv +5 -0
- package/trustledger/fixtures/rentroll.csv +6 -0
- package/trustledger/fixtures/rentroll.real.csv +4 -0
- package/trustledger/ingest.js +1163 -0
- package/trustledger/lib/policy-bundled-loader.js +44 -0
- package/trustledger/lib/sha256-vendored.js +227 -0
- package/trustledger/license.js +563 -0
- package/trustledger/match.js +551 -0
- package/trustledger/plans.js +551 -0
- package/trustledger/policy.js +398 -0
- package/trustledger/public/index.html +512 -0
- package/trustledger/reconcile.js +1486 -0
- package/trustledger/report.js +887 -0
- package/trustledger/seal.js +854 -0
- package/trustledger/server.js +391 -0
- package/trustledger/valueproof.js +350 -0
|
@@ -0,0 +1,638 @@
|
|
|
1
|
+
# DataLedger — verifiable AI training-data provenance
|
|
2
|
+
|
|
3
|
+
DataLedger turns a training dataset into a **reproducible, tamper-evident manifest** and a small set
|
|
4
|
+
of artifacts a data-provenance reviewer (enterprise due-diligence, EU AI Act technical documentation)
|
|
5
|
+
actually consumes. It runs on the same path-bound Merkle core as `vh hash`/`vh prove`, so every claim
|
|
6
|
+
it makes is independently re-derivable.
|
|
7
|
+
|
|
8
|
+
Every DataLedger command is **offline, needs NO private key, and needs NO network**. You can hand a
|
|
9
|
+
manifest, a diff, a summary, or a single-file proof to a third party and they can re-derive the result
|
|
10
|
+
on an air-gapped machine with only the `vh` CLI — they do not have to trust your server, your build
|
|
11
|
+
machine, or you.
|
|
12
|
+
|
|
13
|
+
> **Read this first:** the trust posture below is the SAME wording carried in-band in every artifact
|
|
14
|
+
> (`cli/dataset.js` › `TRUST_NOTE` / `MEMBERSHIP_TRUST_NOTE`) and in
|
|
15
|
+
> [`docs/TRUST-BOUNDARIES.md`](TRUST-BOUNDARIES.md). Do not overclaim past it.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## What DataLedger PROVES
|
|
20
|
+
|
|
21
|
+
A DataLedger manifest commits to a Merkle root over the full set of `(relPath, content)` pairs in a
|
|
22
|
+
dataset — **file names AND bytes**. From that root and the manifest the following are
|
|
23
|
+
re-derivable by anyone, offline:
|
|
24
|
+
|
|
25
|
+
1. **Exactly which files a dataset contained — names and bytes.** The root commits to the complete
|
|
26
|
+
`(relPath, content)` set. Any edit, rename, add, or remove changes the root. A manifest whose
|
|
27
|
+
recomputed root (from the bytes on disk) matches its recorded root is byte-for-byte the dataset it
|
|
28
|
+
claims to be — a hand-edited manifest root cannot fake a `MATCH`, because `vh dataset verify`
|
|
29
|
+
re-derives the root from the actual file bytes, not from the manifest's recorded string.
|
|
30
|
+
|
|
31
|
+
2. **Offline set-membership of any one file.** `vh dataset prove` builds a Merkle proof that a single
|
|
32
|
+
file (its `relPath` + bytes) was a leaf of the manifest's root, matched by **content** (not by the
|
|
33
|
+
caller's filename). `vh dataset verify-proof` folds that proof back to the recorded root **purely
|
|
34
|
+
offline** — no dataset copy, no manifest, no key, no network. A fabricated or altered file does not
|
|
35
|
+
fold to the root and is REJECTED.
|
|
36
|
+
|
|
37
|
+
3. **The precise add / remove / change between two dataset versions.** `vh dataset diff` compares two
|
|
38
|
+
manifests offline and reports `ADDED` (in B not A), `REMOVED` (in A not B), and `CHANGED` (same
|
|
39
|
+
`relPath`, different content, old→new). A rename shows as `REMOVED`+`ADDED` because the path is bound
|
|
40
|
+
into the leaf. This answers the most common auditor question — "what changed in the training data
|
|
41
|
+
between model version N and N+1?" — without either dataset on disk.
|
|
42
|
+
|
|
43
|
+
4. **A provenance / license roll-up.** `vh dataset summary` aggregates the manifest into a histogram of
|
|
44
|
+
the claimed `{source, license}` hints over the **trusted file set** (total `fileCount`, the root,
|
|
45
|
+
counts of files per claimed license/source, and explicit buckets for files with no hint).
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## What DataLedger does NOT prove (do not overclaim)
|
|
50
|
+
|
|
51
|
+
- **It is NOT a timestamp.** A manifest binds a file SET to a root; it says nothing about *when* the
|
|
52
|
+
dataset existed. "Unaltered since date T" is a strictly stronger, time-anchored claim that needs the
|
|
53
|
+
**human-owned signing / timestamp trust-root** — a `needs-human` step recorded in
|
|
54
|
+
[`STRATEGY.md`](../STRATEGY.md) (the loop only BUILDS and locally TESTS; standing up a real signing
|
|
55
|
+
key / timestamp anchor is a human action). Until that trust-root exists, never report or imply
|
|
56
|
+
"unaltered since date T".
|
|
57
|
+
|
|
58
|
+
- **The `{source, license}` hints are UNTRUSTED self-asserted metadata.** Per-file `hints`
|
|
59
|
+
(source/license) are recorded labeled as untrusted and are **NOT bound into the Merkle root** —
|
|
60
|
+
editing a hint does not change the root, and the summary counts what the dataset **CLAIMS**, it does
|
|
61
|
+
NOT verify any license or source is correct. `(no license hint)` means the manifest asserts nothing,
|
|
62
|
+
NOT that the file is unlicensed.
|
|
63
|
+
|
|
64
|
+
- **Set-membership ≠ time / authorship / licensing.** A membership proof binds a file to a ROOT; it
|
|
65
|
+
does NOT prove the file is unaltered since a date, who authored it, or under what license — that needs
|
|
66
|
+
the same human-owned signing/timestamp trust-root above.
|
|
67
|
+
|
|
68
|
+
This is the same boundary the artifacts carry in-band, verbatim, so the caveats can never drift from
|
|
69
|
+
the code:
|
|
70
|
+
|
|
71
|
+
> The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Workflow, end to end
|
|
76
|
+
|
|
77
|
+
> **Run it, don't just read it.** [`examples/run.js`](../examples/run.js) is the executable companion to
|
|
78
|
+
> the worked examples below: `node examples/run.js` drives this whole pipeline (`build → check --policy →
|
|
79
|
+
> verify → report → attest`, plus the ProofParcel side) against tiny committed sample data, offline and
|
|
80
|
+
> with no key, and prints a PASS/FAIL summary — including a deliberately FLAGGED policy violation and a
|
|
81
|
+
> caught tamper. It writes only to an OS temp dir, references (does not run) the human-gated sign/timestamp
|
|
82
|
+
> steps, and is test-gated by `test/cli.examples.test.js` so it can never drift from this prose. See
|
|
83
|
+
> [`examples/README.md`](../examples/README.md).
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
build → diff (between versions) → summary → check (the policy gate) → report (the filed deliverable) → attest (the signing-ready payload) → [human signs, P-3] → verify-attest (offline-verify a signed container) → prove (a single file) → verify-proof
|
|
87
|
+
|
|
88
|
+
# OR, for an INDEPENDENT timestamp (P-3 Option B):
|
|
89
|
+
attest → timestamp-request (the digest your TSA stamps) → [human obtains a token from a TSA] → timestamp-wrap (the token container) → verify-timestamp (offline-verify "an RFC-3161 TSA saw this by date T")
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
| Command | What it does | Offline? Key? Network? |
|
|
93
|
+
| --- | --- | --- |
|
|
94
|
+
| `vh dataset build <dir> --out <p>` | Write a tamper-evident manifest (Merkle root + per-file leaves; optional untrusted hints) | offline, no key, no network |
|
|
95
|
+
| `vh dataset verify <dir> --manifest <p>` | Re-derive the root from a fresh copy on disk + a per-file ADDED/REMOVED/CHANGED diff vs the manifest | offline, no key, no network |
|
|
96
|
+
| `vh dataset diff <manifestA> <manifestB>` | Compare two manifests; report the exact change set between versions | offline, no tree, no key, no network |
|
|
97
|
+
| `vh dataset summary <manifest>` | Provenance/license roll-up over the trusted file set | offline, no tree, no key, no network |
|
|
98
|
+
| `vh dataset check <manifest> --policy <p> [--json]` | GATE the manifest's self-asserted hints against a written license/source policy: PASS/FAIL + the exact violating files; CI-gateable exit 0/3 | offline, no tree, no key, no network |
|
|
99
|
+
| `vh dataset report <manifest> [--verify <dir>] [--policy <p>] [--json] [--out <p>]` | Consolidate identity + roll-up + (optional) verify verdict + (optional) policy verdict + caveats into ONE deterministic evidence document the reviewer files | offline, no key, no network |
|
|
100
|
+
| `vh dataset attest <manifest> [--json] [--out <p>]` | Emit the canonical, byte-deterministic UNSIGNED attestation payload (root + fileCount + manifestDigest) a human signing/timestamp trust-root will sign | offline, no key, no network |
|
|
101
|
+
| `vh dataset sign <manifest> --key-env <VAR>\|--key-file <p> [--out <p>] [--json]` | Sign the UNSIGNED attestation with a key YOU provisioned → the signed container `verify-attest` accepts. Read-only of YOUR key; never generates/persists/logs a key | offline, **caller-supplied key**, no network |
|
|
102
|
+
| `vh dataset verify-attest <signed> [--manifest <m>] [--signer <addr>] [--json]` | OFFLINE-verify a SIGNED attestation container: recover the signer, optionally pin the publisher (`--signer`) and bind to your manifest (`--manifest`); ACCEPTED/REJECTED with a CI-gateable exit 0/3 | offline, no key, no network |
|
|
103
|
+
| `vh dataset timestamp-request <manifest> [--out <p>] [--json]` | Emit the SHA-256 digest of the canonical attestation bytes — the exact `messageImprint` you submit to your RFC-3161 TSA | offline, no key, no network |
|
|
104
|
+
| `vh dataset timestamp-wrap <manifest> --token <p> [--out <p>] [--json]` | Wrap the TSA's returned RFC-3161 token into a verifiable `verifyhash.dataset-attestation-timestamped` container (binds it to the re-derived SHA-256 digest) | offline, no key, no network |
|
|
105
|
+
| `vh dataset verify-timestamp <container> [--manifest <m>] [--json]` | OFFLINE-verify a timestamped container: re-derive the digest, confirm the RFC-3161 token binds it, optionally bind to your manifest; ACCEPTED (with genTime / TSA serial / policy OID) or REJECTED; CI-gateable exit 0/3 | offline, no key, no network |
|
|
106
|
+
| `vh dataset prove --file <p> --manifest <m> --out <a>` | Build a portable set-membership proof for ONE file | offline, no key, no network |
|
|
107
|
+
| `vh dataset verify-proof <proof>` | Fold the membership proof back to the recorded root | purely offline, no dataset, no key, no network |
|
|
108
|
+
|
|
109
|
+
### Worked example
|
|
110
|
+
|
|
111
|
+
Manifest a dataset, then snapshot a later version and ask what changed.
|
|
112
|
+
|
|
113
|
+
```sh
|
|
114
|
+
# 1. BUILD a manifest of dataset version 1 (optionally attach untrusted source/license hints).
|
|
115
|
+
vh dataset build ./dataset-v1 --out v1.manifest.json --hints ./hints.json
|
|
116
|
+
# wrote v1.manifest.json root: 0xabc… fileCount: 1024
|
|
117
|
+
|
|
118
|
+
# 2. VERIFY a fresh copy on disk re-derives the same root, and localize any drift per file.
|
|
119
|
+
vh dataset verify ./dataset-v1 --manifest v1.manifest.json
|
|
120
|
+
# MATCH — recomputed root == manifest root (exit 0)
|
|
121
|
+
|
|
122
|
+
# 3. Later: build version 2's manifest, then DIFF the two versions OFFLINE (no datasets on disk).
|
|
123
|
+
vh dataset build ./dataset-v2 --out v2.manifest.json
|
|
124
|
+
vh dataset diff v1.manifest.json v2.manifest.json
|
|
125
|
+
# DIFFERENT
|
|
126
|
+
# ADDED: 12 files
|
|
127
|
+
# REMOVED: 3 files
|
|
128
|
+
# CHANGED: 5 files (relPath same, content differs) (exit 3)
|
|
129
|
+
|
|
130
|
+
# 4. SUMMARY: the provenance/license roll-up a reviewer reads.
|
|
131
|
+
vh dataset summary v2.manifest.json
|
|
132
|
+
# fileCount: 1033 root: 0xdef…
|
|
133
|
+
# licenses: { MIT: 900, CC-BY-4.0: 110, (no license hint): 23 }
|
|
134
|
+
# TRUST: the file SET is bound into the root; hints are UNTRUSTED self-asserted metadata.
|
|
135
|
+
|
|
136
|
+
# 5. PROVE one file was a member of v2, as a portable artifact…
|
|
137
|
+
vh dataset prove --file ./dataset-v2/img/0007.jpg --manifest v2.manifest.json --out 0007.proof.json
|
|
138
|
+
# MEMBER — wrote 0007.proof.json (exit 0)
|
|
139
|
+
|
|
140
|
+
# 6. …and the reviewer VERIFY-PROOFs it on an air-gapped machine — no dataset, no manifest, no key, no net.
|
|
141
|
+
vh dataset verify-proof 0007.proof.json
|
|
142
|
+
# CONFIRMED — leaf folds to the recorded root (exit 0)
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Exit codes are CI-friendly: `vh dataset verify` and `vh dataset diff` exit `3` on
|
|
146
|
+
mismatch/difference (so a pipeline can gate "the training set changed unexpectedly"), `0` on
|
|
147
|
+
match/identical; `vh dataset prove`/`verify-proof` exit `0` MEMBER/CONFIRMED, `3` non-member/rejected;
|
|
148
|
+
and `vh dataset check` exits `0` PASS / `3` FAIL (the policy gate, below).
|
|
149
|
+
|
|
150
|
+
---
|
|
151
|
+
|
|
152
|
+
## Policy compliance gate
|
|
153
|
+
|
|
154
|
+
`vh dataset summary` *describes* a dataset's license/source composition; `vh dataset check <manifest>
|
|
155
|
+
--policy <p>` **GATES** it. It is the difference between "a provenance report" and "a compliance control":
|
|
156
|
+
the control your pipeline runs on every change and the verdict your auditor files (an EU-AI-Act
|
|
157
|
+
technical-documentation / enterprise due-diligence packet). It answers the one question a compliance
|
|
158
|
+
reviewer and a CI job actually ask — **"does this training set VIOLATE our written policy?"** — as a
|
|
159
|
+
deterministic, OFFLINE PASS/FAIL with the exact list of which files broke which rule.
|
|
160
|
+
|
|
161
|
+
> **Trust posture, FIRST (the same wording the artifact carries in-band, verbatim).** The `{source,
|
|
162
|
+
> license}` hints checked here are **UNTRUSTED, self-asserted metadata that are NOT bound into the
|
|
163
|
+
> Merkle root.** A PASS means the dataset's self-asserted hints satisfy this policy —
|
|
164
|
+
> **NOT that the licenses are genuinely correct.** A `(no license hint)` file ASSERTS NOTHING (the `requireLicense` rule
|
|
165
|
+
> is the one that flags it). This NEVER verifies that any license or source is real. It is the same
|
|
166
|
+
> boundary every DataLedger artifact carries:
|
|
167
|
+
>
|
|
168
|
+
> > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
|
|
169
|
+
|
|
170
|
+
### The policy file
|
|
171
|
+
|
|
172
|
+
A policy is a small, versioned, strictly-validated JSON document. A corrupt, foreign, or malformed
|
|
173
|
+
policy is **rejected outright** (never half-accepted into a surprise verdict). Two fixed fields identify
|
|
174
|
+
it, then every RULE field is **optional and combinable**:
|
|
175
|
+
|
|
176
|
+
| Field | Required | Type | Meaning / match semantics |
|
|
177
|
+
| --- | --- | --- | --- |
|
|
178
|
+
| `kind` | yes | string | MUST be exactly `verifyhash.dataset-policy`. |
|
|
179
|
+
| `schemaVersion` | yes | number | MUST be a supported version (this build understands `1`). |
|
|
180
|
+
| `allowLicenses` | no | string[] | A file whose license hint is **NOT** in this list VIOLATES. A file with **no** license hint also violates (it is in no allowlist). |
|
|
181
|
+
| `denyLicenses` | no | string[] | A file whose license hint **IS** in this list VIOLATES. A file with **no** license hint does NOT violate (no value to match). |
|
|
182
|
+
| `allowSources` | no | string[] | Same as `allowLicenses`, on the `source` hint. |
|
|
183
|
+
| `denySources` | no | string[] | Same as `denyLicenses`, on the `source` hint. |
|
|
184
|
+
| `requireLicense` | no | boolean | When `true`, every file MUST carry a license hint; a `(no license hint)` file VIOLATES. This is the ONE rule that flags a missing hint. |
|
|
185
|
+
|
|
186
|
+
**Match semantics (so a verdict is reproducible).** A file's "license hint value" is its
|
|
187
|
+
`hints.license` string, or the **absence** of one (no `hints` at all, or `hints` with no `license`);
|
|
188
|
+
likewise for `hints.source`. All comparisons against the policy's lists are **CASE-SENSITIVE EXACT STRING
|
|
189
|
+
matches** — `"GPL-3.0"` matches only `"GPL-3.0"`, never `"gpl-3.0"` or `"GPL-3.0-or-later"`. A missing
|
|
190
|
+
hint is reported with the explicit `(no license hint)` / `(no source hint)` sentinel value, never a
|
|
191
|
+
literal string named that.
|
|
192
|
+
|
|
193
|
+
**The no-rules case.** A policy that declares **no rules** (no list fields, or only empty lists, and
|
|
194
|
+
`requireLicense` not `true`) is valid and **trivially PASSes** — every dataset satisfies a policy with no
|
|
195
|
+
constraints. `vh dataset check` says so explicitly (`rules evaluated: 0` + a NOTE) so a green check from
|
|
196
|
+
an empty policy can never be mistaken for a real gate.
|
|
197
|
+
|
|
198
|
+
### `vh dataset check` — the gate
|
|
199
|
+
|
|
200
|
+
`vh dataset check <manifest> --policy <p>` reads the manifest via the SAME strict reader the other
|
|
201
|
+
commands use and the policy via its strict reader, then evaluates the manifest's **trusted file set**
|
|
202
|
+
against the policy in a **pure, deterministic** function (no tree, no provider, no key, no network) and
|
|
203
|
+
emits a verdict:
|
|
204
|
+
|
|
205
|
+
- **PASS / FAIL.** PASS when no file's self-asserted hints violate any rule; FAIL when at least one does.
|
|
206
|
+
- **The violating-file output.** On FAIL, one line per violation — the **file (relPath)**, the **rule it
|
|
207
|
+
broke**, and the **offending hint value** — sorted by relPath then rule, so two runs over the same
|
|
208
|
+
inputs produce byte-identical output. A single file that breaks two rules produces two lines.
|
|
209
|
+
- **The 0/3 exit contract a CI job gates on.** `vh dataset check` exits **`0` on PASS, `3` on FAIL** —
|
|
210
|
+
the SAME data-divergence exit convention as `vh dataset verify`/`diff`, so all dataset gates share one
|
|
211
|
+
contract. A missing/unreadable manifest or policy is a runtime error (exit `1`); a missing `--policy`
|
|
212
|
+
is a usage error (exit `2`) — a gate with no policy must never silently pass. So a pipeline step is
|
|
213
|
+
simply `vh dataset check ds.manifest.json --policy org-policy.json` and the build blocks on a non-zero
|
|
214
|
+
exit.
|
|
215
|
+
- **`--json`.** Emits the machine object
|
|
216
|
+
`{ verdict, fileCount, rulesEvaluated, violations: [{ relPath, rule, value }] }` for an ingestion
|
|
217
|
+
pipeline. The `rule` strings are stable identifiers a consumer can gate on
|
|
218
|
+
(`allowLicenses` / `denyLicenses` / `allowSources` / `denySources` / `requireLicense`).
|
|
219
|
+
|
|
220
|
+
### `vh dataset report --policy` — embedding the verdict in the filed document
|
|
221
|
+
|
|
222
|
+
`vh dataset report <manifest> --policy <p>` folds the **SAME pure evaluator** `vh dataset check` runs
|
|
223
|
+
(verbatim — no re-implementation) into the filed evidence document as a **"Policy compliance" section**:
|
|
224
|
+
the verdict, the number of rules evaluated, and (on FAIL) the violating files. Because it reuses the same
|
|
225
|
+
evaluator, the report's PASS/FAIL **can never diverge** from `vh dataset check`'s for the same manifest +
|
|
226
|
+
policy. The section LEADS with the same UNTRUSTED-hints caveat as `vh dataset check`, so the embedded
|
|
227
|
+
verdict never implies a real license was checked.
|
|
228
|
+
|
|
229
|
+
`--policy` combines with `--verify` to make **one report invocation a complete CI gate**: with `--verify`
|
|
230
|
+
the exit is `0` MATCH / `3` MISMATCH; with `--policy` it is `0` PASS / `3` FAIL; with **both**, the report
|
|
231
|
+
exits `3` if **EITHER** the live-tree verify is a MISMATCH OR the policy is a FAIL, and `0` only when the
|
|
232
|
+
verify is MATCH **and** the policy is PASS — so a single command gates data integrity AND policy
|
|
233
|
+
compliance, and the buyer's filed document shows both verdicts.
|
|
234
|
+
|
|
235
|
+
### Worked example — build with hints, write a policy, check, then embed in a report
|
|
236
|
+
|
|
237
|
+
```sh
|
|
238
|
+
# 1. BUILD a manifest, attaching the (UNTRUSTED, self-asserted) source/license hints per file.
|
|
239
|
+
vh dataset build ./dataset-v2 --out v2.manifest.json --hints ./hints.json
|
|
240
|
+
# wrote v2.manifest.json root: 0xdef… fileCount: 1033
|
|
241
|
+
|
|
242
|
+
# 2. WRITE a policy: a proprietary product that forbids copyleft and requires every file to be licensed.
|
|
243
|
+
cat > org-policy.json <<'JSON'
|
|
244
|
+
{
|
|
245
|
+
"kind": "verifyhash.dataset-policy",
|
|
246
|
+
"schemaVersion": 1,
|
|
247
|
+
"denyLicenses": ["GPL-3.0", "AGPL-3.0"],
|
|
248
|
+
"requireLicense": true
|
|
249
|
+
}
|
|
250
|
+
JSON
|
|
251
|
+
|
|
252
|
+
# 3. CHECK the dataset against the policy — the gate a CI job runs (exit 0 PASS / 3 FAIL).
|
|
253
|
+
vh dataset check v2.manifest.json --policy org-policy.json
|
|
254
|
+
# TRUST: the {source, license} hints checked here are UNTRUSTED, self-asserted metadata. …
|
|
255
|
+
# policy check: FAIL
|
|
256
|
+
# files: 1033
|
|
257
|
+
# rules evaluated: 2
|
|
258
|
+
# FAIL: 2 violations (each line: the file, the rule it broke, and the offending hint value):
|
|
259
|
+
# src/vendored/lib.py [denyLicenses] value: GPL-3.0
|
|
260
|
+
# data/notes.txt [requireLicense] value: (no license hint) (exit 3)
|
|
261
|
+
|
|
262
|
+
# 4. …or as a machine object for an ingestion pipeline:
|
|
263
|
+
vh dataset check v2.manifest.json --policy org-policy.json --json
|
|
264
|
+
# {"verdict":"FAIL","fileCount":1033,"rulesEvaluated":2,"violations":[…]}
|
|
265
|
+
|
|
266
|
+
# 5. EMBED the SAME verdict in the ONE document the reviewer files (and gate integrity + policy at once):
|
|
267
|
+
vh dataset report v2.manifest.json --verify ./dataset-v2 --policy org-policy.json --out evidence.md
|
|
268
|
+
# dataset report written: /abs/path/evidence.md (exit 3 if EITHER verify MISMATCH or policy FAIL)
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
> **What a PASS does and does NOT mean.** A PASS attests that the dataset's **self-asserted hints satisfy
|
|
272
|
+
> the policy** — the control your pipeline ran and the verdict your auditor files. It is **NOT** a claim
|
|
273
|
+
> that the licenses are genuinely correct, nor a timestamp ("unaltered since date T"). Those require the
|
|
274
|
+
> human-owned signing/timestamp trust-root (`needs-human`, P-3 in [`STRATEGY.md`](../STRATEGY.md)).
|
|
275
|
+
|
|
276
|
+
---
|
|
277
|
+
|
|
278
|
+
## The evidence report
|
|
279
|
+
|
|
280
|
+
A reviewer does not file three terminal outputs — they file **one document**. `vh dataset report
|
|
281
|
+
<manifest> [--verify <dir>] [--policy <p>] [--json] [--out <p>]` consolidates everything a manifest
|
|
282
|
+
already proves into a single deterministic artifact you attach to an EU-AI-Act technical-documentation
|
|
283
|
+
section or an enterprise data-provenance due-diligence packet.
|
|
284
|
+
|
|
285
|
+
It **invents no new math.** The dataset identity (root + fileCount) comes from the strict manifest read;
|
|
286
|
+
the provenance/license roll-up reuses the SAME aggregation `vh dataset summary` emits (identical
|
|
287
|
+
histogram order); the optional verification reuses `vh dataset verify` verbatim; the optional policy
|
|
288
|
+
verdict reuses the SAME pure evaluator `vh dataset check` runs (see [Policy compliance
|
|
289
|
+
gate](#policy-compliance-gate) above). So the report can never drift from the commands it consolidates.
|
|
290
|
+
|
|
291
|
+
What it consolidates, in a stable section order:
|
|
292
|
+
|
|
293
|
+
1. **Trust posture, FIRST.** The same in-band `TRUST_NOTE` (file SET bound into the root and trustworthy;
|
|
294
|
+
`{source, license}` hints UNTRUSTED and NOT bound into the root) plus the explicit no-overclaim line:
|
|
295
|
+
this report is NOT a timestamp — it does not prove the dataset is "unaltered since date T", nor
|
|
296
|
+
authorship/licensing.
|
|
297
|
+
2. **Dataset identity** — the Merkle `root` and `fileCount`.
|
|
298
|
+
3. **Verification status** — either the embedded `--verify` verdict, or a plain statement that NO
|
|
299
|
+
live-tree verification was performed (so the report never *implies* a verify that did not run).
|
|
300
|
+
4. **Policy compliance** — ONLY when `--policy` is given: the PASS/FAIL verdict, rules evaluated, and (on
|
|
301
|
+
FAIL) the violating files (relPath / rule / value), leading with the same UNTRUSTED-hints caveat as
|
|
302
|
+
`vh dataset check`. Omitted entirely without `--policy`, so the report never implies a gate that did
|
|
303
|
+
not run.
|
|
304
|
+
5. **Provenance / license roll-up** — the `{source, license}` histogram over the trusted file set.
|
|
305
|
+
|
|
306
|
+
**Deterministic Markdown vs `--json`.** The default human output is a Markdown document with a stable
|
|
307
|
+
section order and a histogram ordered by the same rule `vh dataset summary` uses, so two runs over the
|
|
308
|
+
same manifest produce **byte-identical Markdown** — suitable to attach to a filing and to diff in CI.
|
|
309
|
+
`--json` emits the same consolidated model as a machine object for an ingestion pipeline. `--out <p>`
|
|
310
|
+
writes the document to a caller-chosen explicit path (never silently the cwd) and names the file.
|
|
311
|
+
|
|
312
|
+
**The optional `--verify` status section.** Without `--verify`, the report documents the manifest's
|
|
313
|
+
*claimed* root and says so plainly. With `--verify <dir>` it re-derives the root from the live tree
|
|
314
|
+
(still offline — no network) and embeds the **MATCH/MISMATCH verdict** plus the per-file
|
|
315
|
+
ADDED/REMOVED/CHANGED localization; under `--verify` the command's exit code mirrors `vh dataset verify`
|
|
316
|
+
(`0` on MATCH, `3` on MISMATCH) so a pipeline can gate on it.
|
|
317
|
+
|
|
318
|
+
**The optional `--policy` compliance section.** With `--policy <p>` the report embeds the SAME PASS/FAIL
|
|
319
|
+
verdict `vh dataset check` produces (see above). Combined with `--verify`, ONE report invocation is a
|
|
320
|
+
complete CI gate: it exits `3` if **EITHER** the live-tree verify is a MISMATCH **OR** the policy is a
|
|
321
|
+
FAIL, and `0` only when both pass.
|
|
322
|
+
|
|
323
|
+
```sh
|
|
324
|
+
# The single document a reviewer files (manifest-only — claims the manifest's root):
|
|
325
|
+
vh dataset report v2.manifest.json --out evidence.md
|
|
326
|
+
# dataset report written: /abs/path/evidence.md
|
|
327
|
+
|
|
328
|
+
# …or with a live-tree verdict embedded, gating on the recomputed-root match (exit 3 on drift):
|
|
329
|
+
vh dataset report v2.manifest.json --verify ./dataset-v2 --out evidence.md
|
|
330
|
+
|
|
331
|
+
# …or with the policy verdict embedded too, so ONE invocation gates integrity AND policy (exit 3 if either fails):
|
|
332
|
+
vh dataset report v2.manifest.json --verify ./dataset-v2 --policy org-policy.json --out evidence.md
|
|
333
|
+
|
|
334
|
+
# …or the machine form for an ingestion pipeline:
|
|
335
|
+
vh dataset report v2.manifest.json --json
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
> **What the reviewer files.** This Markdown (or its `--json` twin) IS the deliverable — the EU-AI-Act
|
|
339
|
+
> technical-documentation section / due-diligence evidence packet a buyer's compliance process is built
|
|
340
|
+
> around — not a transcript of three commands. It still claims nothing past the standing trust posture:
|
|
341
|
+
> no wall-clock "unaltered since date T", no truth of any `{source, license}` hint.
|
|
342
|
+
|
|
343
|
+
---
|
|
344
|
+
|
|
345
|
+
## Unsigned attestation payload
|
|
346
|
+
|
|
347
|
+
`vh dataset attest <manifest> [--json] [--out <p>]` emits the **canonical, byte-deterministic** payload a
|
|
348
|
+
human-owned signing/timestamp trust-root will sign. It is the bridge that turns the human step (P-3) from
|
|
349
|
+
"design and sign a payload" into "sign THIS exact file."
|
|
350
|
+
|
|
351
|
+
**What it commits to.** A small envelope binding the dataset IDENTITY:
|
|
352
|
+
|
|
353
|
+
- `root` — the manifest's Merkle root.
|
|
354
|
+
- `fileCount` — the number of files in the committed set.
|
|
355
|
+
- `manifestDigest` — `keccak256` over a canonical serialization of the manifest's committed file set
|
|
356
|
+
(each entry's root-committed `{relPath, contentHash, leaf}`, keys in fixed order, entries sorted by
|
|
357
|
+
`relPath`, no insignificant whitespace; the UNTRUSTED `hints` are excluded). Any edit/rename/add/remove
|
|
358
|
+
to the committed set changes the digest.
|
|
359
|
+
|
|
360
|
+
The envelope serializes with a fixed top-level key order and no insignificant whitespace, so **two runs
|
|
361
|
+
over the same manifest produce identical bytes** — that determinism is exactly what makes "sign the
|
|
362
|
+
bytes" well-defined. `--json` emits those same canonical bytes (pipe it straight into a signer); `--out
|
|
363
|
+
<p>` writes them to a caller-chosen explicit path (never the cwd) and names the file.
|
|
364
|
+
|
|
365
|
+
**It is UNSIGNED — and says so, in-band.** The envelope carries an explicit `signed: false` and a
|
|
366
|
+
`signature: null` slot, plus the standing caveat verbatim. The strict reader REJECTS any payload that
|
|
367
|
+
claims `signed: true` or a non-null `signature`, so this build can never be tricked into treating a
|
|
368
|
+
hand-edited envelope as if it were signed.
|
|
369
|
+
|
|
370
|
+
```sh
|
|
371
|
+
# Emit the canonical UNSIGNED payload (the exact bytes a signer/timestamp service signs over):
|
|
372
|
+
vh dataset attest v2.manifest.json --out v2.attestation.json
|
|
373
|
+
# dataset attestation written: /abs/path/v2.attestation.json
|
|
374
|
+
|
|
375
|
+
# …or to stdout / piped into a signer:
|
|
376
|
+
vh dataset attest v2.manifest.json --json
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
> **Attaching a real signature/timestamp is the human-owned trust-root.** Standing up a real signing key
|
|
380
|
+
> or an external timestamp authority is a `needs-human` step recorded in
|
|
381
|
+
> [`STRATEGY.md`](../STRATEGY.md) as **P-3** — the loop only BUILDS and locally TESTS the UNSIGNED
|
|
382
|
+
> payload. Until a signature is attached, this payload proves only the same set-membership / identity the
|
|
383
|
+
> manifest already does — **NOT** that the dataset is unaltered since a date T. Do not overclaim past P-3.
|
|
384
|
+
|
|
385
|
+
### Signed attestation + verification
|
|
386
|
+
|
|
387
|
+
The UNSIGNED payload above is the bytes a publisher signs; this build also ships the **detached signature
|
|
388
|
+
container that WRAPS those bytes** and the **offline VERIFIER** a buyer runs to confirm a signature is
|
|
389
|
+
genuine — `vh dataset verify-attest`. Read the boundary FIRST, because it is exactly the standing dataset
|
|
390
|
+
trust posture plus one signing-specific line:
|
|
391
|
+
|
|
392
|
+
> **Trust posture, FIRST (reuses the standing dataset `TRUST_NOTE` verbatim).** A valid signature proves
|
|
393
|
+
> **the holder of `signer`'s key vouched for THIS dataset identity** (the embedded `root` / `fileCount` /
|
|
394
|
+
> `manifestDigest`). It does **NOT** by itself prove a trustworthy TIMESTAMP — there is still no "unaltered
|
|
395
|
+
> since a date T" unless the `scheme` is a timestamp authority, which is **still the human-owned trust-root,
|
|
396
|
+
> `needs-human`, P-3** in [`STRATEGY.md`](../STRATEGY.md). It does **NOT** validate that the dataset's
|
|
397
|
+
> `{source, license}` hints are genuinely correct (that is `vh dataset check`'s untrusted-hint caveat). It
|
|
398
|
+
> is the same boundary every DataLedger artifact carries:
|
|
399
|
+
>
|
|
400
|
+
> > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
|
|
401
|
+
|
|
402
|
+
> **CRITICAL — what this build ships, and what it does NOT.** This build ships the **FORMAT** (the
|
|
403
|
+
> signed-container schema below), the **VERIFIER** (`vh dataset verify-attest`), **AND the SIGNING command**
|
|
404
|
+
> (`vh dataset sign`, below) — all proved end-to-end in the test suite with **EPHEMERAL, throwaway
|
|
405
|
+
> `Wallet.createRandom()` keys generated in-process and never persisted**. **Provisioning a real signing key
|
|
406
|
+
> and choosing trust-root option A/B/C is still the human-owned trust-root, P-3** (`needs-human`,
|
|
407
|
+
> [`STRATEGY.md`](../STRATEGY.md)). The loop NEVER generates, holds, persists, or logs a real key — `vh
|
|
408
|
+
> dataset sign` reads a key the human provisioned OUTSIDE the loop, uses it in-process ONLY to sign, and
|
|
409
|
+
> discards it. Emitting/signing/verifying a signed container NEVER implies "unaltered since date T": a signed
|
|
410
|
+
> container says only "this key vouched for this dataset identity" — the trustworthy *timestamp* is the part
|
|
411
|
+
> P-3 still owns.
|
|
412
|
+
|
|
413
|
+
#### The signed-container schema (`verifyhash.dataset-attestation-signed`)
|
|
414
|
+
|
|
415
|
+
The signed container **WRAPS, never edits** the unsigned payload: it embeds the EXACT canonical UNSIGNED
|
|
416
|
+
bytes verbatim (byte-for-byte the string `vh dataset attest` emits, including its trailing newline) and
|
|
417
|
+
attaches a detached signature alongside. Every field, with a FIXED key order so the container is itself
|
|
418
|
+
byte-deterministic:
|
|
419
|
+
|
|
420
|
+
| Field | Type | Meaning |
|
|
421
|
+
| --- | --- | --- |
|
|
422
|
+
| `kind` | string | MUST be exactly `verifyhash.dataset-attestation-signed`. |
|
|
423
|
+
| `schemaVersion` | number | MUST be a supported version (this build understands `1`). |
|
|
424
|
+
| `note` | string | The standing in-band trust caveat (the dataset `TRUST_NOTE` + the signing-specific line above), carried verbatim so the caveats can never drift from the artifact. |
|
|
425
|
+
| `attestation` | string | **The EXACT canonical UNSIGNED bytes**, embedded as a string. Re-parsed and re-validated by the SAME unsigned reader on every read: it must STILL be strictly `signed: false` / `signature: null`, and must be byte-for-byte `serializeAttestation`'s output. |
|
|
426
|
+
| `signature` | object | The detached `{ scheme, signer, signature }` triple (below). |
|
|
427
|
+
| `signature.scheme` | string | The signature scheme. This build's `scheme` value is **`eip191-personal-sign`** — EIP-191 `personal_sign` over the EXACT embedded canonical bytes (a detached signature, deliberately NOT EIP-712, so the signed message IS the payload bytes verbatim with no separate domain/struct to drift from). |
|
|
428
|
+
| `signature.signer` | string | The CLAIMED `0x` signer address, **lowercase** (a checksummed/mixed-case address is rejected for byte-determinism — lowercase it first). |
|
|
429
|
+
| `signature.signature` | string | The detached signature: for `eip191-personal-sign`, a 65-byte `r‖s‖v` secp256k1 signature as a **lowercase** `0x`-hex string (130 hex chars). |
|
|
430
|
+
|
|
431
|
+
**The wrap-don't-edit invariant.** The embedded `attestation` stays strictly `signed: false` /
|
|
432
|
+
`signature: null`: the strict reader re-parses it and runs the SAME `validateAttestation` the unsigned path
|
|
433
|
+
uses (which hard-rejects any `signed: true` / non-null `signature`), then requires the embedded string to
|
|
434
|
+
be byte-for-byte canonical. So a signed container can never smuggle in an edited or already-"signed"
|
|
435
|
+
payload — wrapping adds a vouch, it never edits the thing vouched for. T-15.2's strict UNSIGNED guarantee
|
|
436
|
+
is preserved unchanged.
|
|
437
|
+
|
|
438
|
+
#### `vh dataset sign` — the one-command signing leg (reads a key YOU provisioned)
|
|
439
|
+
|
|
440
|
+
`vh dataset sign <manifest> --key-env <VAR> | --key-file <path> [--out <p>] [--json]` is the **one command
|
|
441
|
+
that turns "a human has a key" into a signed container a buyer can verify.** It builds the UNSIGNED payload
|
|
442
|
+
exactly as `vh dataset attest` does (no re-implementation), constructs an in-process ethers `Wallet` from
|
|
443
|
+
the key YOU supply, signs the canonical bytes (`eip191-personal-sign`), and **wraps WITHOUT editing** the
|
|
444
|
+
payload into the `verifyhash.dataset-attestation-signed` container the existing `vh dataset verify-attest`
|
|
445
|
+
accepts. The result round-trips by construction.
|
|
446
|
+
|
|
447
|
+
**Key hygiene (load-bearing — the property that keeps this guardrail-safe).** `vh dataset sign` performs a
|
|
448
|
+
**read-only of a key YOU provisioned outside this tool**; it **never generates, never persists, and never
|
|
449
|
+
logs (or echoes) a key**, and it is **OFFLINE — no provider, no network**. The key is read from EXACTLY ONE
|
|
450
|
+
of `--key-env <VAR>` (read `process.env[VAR]`) or `--key-file <path>` (a file you created), used in-process
|
|
451
|
+
ONLY to sign, then discarded. **Neither source, both sources, a missing env var, an unreadable file, or a
|
|
452
|
+
malformed/all-zero key HARD-ERRORS before any signing**, with a message that names only the SOURCE (the env
|
|
453
|
+
var name or the file path) — **never the key material**. On success the output prints ONLY the PUBLIC signer
|
|
454
|
+
address, the output path, and the scheme. A usage error (no `<manifest>`, or not exactly one key source)
|
|
455
|
+
exits `2`; a runtime error (bad key, unreadable manifest) exits `1`.
|
|
456
|
+
|
|
457
|
+
> **Trust posture (inherited verbatim — a signature is NOT a timestamp).** This is the SHARED in-band
|
|
458
|
+
> `SIGN_TRUST_NOTE` (`cli/dataset.js`), the same wording the `sign` command prints and the human reads, so
|
|
459
|
+
> the caveat can never drift from the code:
|
|
460
|
+
>
|
|
461
|
+
> > This signs the dataset IDENTITY (root, fileCount, manifestDigest) with the key YOU supplied. A self-managed key attests "the signer says so" — it is NOT an independent, trusted TIMESTAMP: "existed/unaltered since a date T" still needs the human-owned signing/timestamp trust-root (needs-human, P-3). The key must be one YOU provisioned OUTSIDE this tool.
|
|
462
|
+
>
|
|
463
|
+
> The stronger B/C options buy an independent timestamp; (A) does not. It also still carries the standing
|
|
464
|
+
> dataset caveat verbatim:
|
|
465
|
+
>
|
|
466
|
+
> > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
|
|
467
|
+
|
|
468
|
+
```sh
|
|
469
|
+
# Sign the dataset attestation with a key YOU provisioned outside the loop (env var or key file).
|
|
470
|
+
# Read-only of YOUR key; never generates/persists/logs a key; OFFLINE; no network.
|
|
471
|
+
vh dataset sign v2.manifest.json --key-env DATASET_SIGNING_KEY --out v2.attestation.signed.json
|
|
472
|
+
# TRUST: This signs the dataset IDENTITY … it is NOT an independent, trusted TIMESTAMP …
|
|
473
|
+
# signed by 0x<your public address>
|
|
474
|
+
# scheme: eip191-personal-sign
|
|
475
|
+
# signed attestation written: /abs/path/v2.attestation.signed.json
|
|
476
|
+
```
|
|
477
|
+
|
|
478
|
+
#### `vh dataset verify-attest` — the offline verifier
|
|
479
|
+
|
|
480
|
+
`vh dataset verify-attest <signed> [--manifest <m>] [--signer <addr>] [--json]` is **purely offline — no
|
|
481
|
+
tree walk, no provider, no key, no network.** It reads the container with the strict reader (a
|
|
482
|
+
malformed/edited/foreign container is rejected, never half-accepted), then runs up to three checks:
|
|
483
|
+
|
|
484
|
+
1. **Signature recovery (always).** Recover the signer from the embedded canonical bytes + signature
|
|
485
|
+
(`eip191-personal-sign` → ethers' `verifyMessage` over exactly those bytes) and confirm it equals the
|
|
486
|
+
container's CLAIMED `signer`. A signature that does not recover to the claimed signer — or a tampered,
|
|
487
|
+
unrecoverable signature — is a clean **REJECTED**, not a crash.
|
|
488
|
+
2. **`--signer <addr>` (optional publisher pin).** Confirm the RECOVERED signer equals the SPECIFIC
|
|
489
|
+
publisher address the buyer pinned — so a buyer pins WHO must have signed, not merely that someone did.
|
|
490
|
+
Accepts a checksummed or lowercase address.
|
|
491
|
+
3. **`--manifest <m>` (optional identity binding).** Recompute the canonical UNSIGNED bytes from the
|
|
492
|
+
buyer's OWN manifest via the EXISTING build path and require them byte-identical to the embedded
|
|
493
|
+
(signed-over) payload — proving the signature binds the dataset the buyer actually holds, not some other
|
|
494
|
+
set.
|
|
495
|
+
|
|
496
|
+
**The 0/3 exit contract a buyer's CI gates on.** The verdict is **ACCEPTED only when EVERY requested check
|
|
497
|
+
passes**; any failure is REJECTED. It exits **`0` on ACCEPTED, `3` on REJECTED** — the SAME
|
|
498
|
+
data-divergence convention as `vh dataset verify` / `diff` / `check`, so all dataset gates share one exit
|
|
499
|
+
contract. A usage error (e.g. missing `<signed>`) exits `2`; a missing/corrupt container or manifest is a
|
|
500
|
+
runtime error (exit `1`). `--json` emits the machine object
|
|
501
|
+
`{ verdict, accepted, recoveredSigner, claimedSigner, scheme, checks: { signatureMatchesSigner,
|
|
502
|
+
signerMatchesExpected, manifestBindsAttestation }, expectedSigner, manifestChecked, failedChecks }` — the
|
|
503
|
+
`checks.*` booleans are `null` for a check that was not requested (never a silent fail), and `failedChecks`
|
|
504
|
+
names the stable rule ids a consumer gates on. So a buyer's pipeline step is simply
|
|
505
|
+
`vh dataset verify-attest signed.json --signer 0x<ourPublishedAddr> --manifest ds.manifest.json` and the
|
|
506
|
+
build blocks on a non-zero exit.
|
|
507
|
+
|
|
508
|
+
#### Worked end-to-end example (attest → sign → verify-attest)
|
|
509
|
+
|
|
510
|
+
```sh
|
|
511
|
+
# 1. ATTEST: emit the canonical UNSIGNED bytes (the exact bytes the publisher signs over).
|
|
512
|
+
vh dataset attest v2.manifest.json --out v2.attestation.json
|
|
513
|
+
# dataset attestation written: /abs/path/v2.attestation.json
|
|
514
|
+
|
|
515
|
+
# 2. [HUMAN-OWNED, P-3 — PROVISION ONLY] Provision a real signing key OUTSIDE the loop (env var or key file),
|
|
516
|
+
# then SIGN with ONE command. The loop NEVER generates/persists/logs the key — `vh dataset sign` reads the
|
|
517
|
+
# key YOU provisioned, signs the canonical bytes (eip191-personal-sign), and wraps them WITHOUT editing the
|
|
518
|
+
# payload (it stays signed:false). Choosing/provisioning the key + trust-root option A/B/C is the P-3 part.
|
|
519
|
+
vh dataset sign v2.manifest.json --key-env DATASET_SIGNING_KEY --out v2.attestation.signed.json
|
|
520
|
+
# signed by 0x<your public address> scheme: eip191-personal-sign
|
|
521
|
+
# signed attestation written: /abs/path/v2.attestation.signed.json
|
|
522
|
+
# (In tests this signing step uses an EPHEMERAL throwaway Wallet.createRandom() key — never a real key.)
|
|
523
|
+
|
|
524
|
+
# 3. The BUYER VERIFIES offline — no key, no network — pinning WHO signed AND binding it to THEIR dataset:
|
|
525
|
+
vh dataset verify-attest v2.attestation.signed.json \
|
|
526
|
+
--signer 0x<the publisher's published address> --manifest ./my-copy.manifest.json
|
|
527
|
+
# TRUST: A valid signature proves the holder of `signer`'s key vouched for THIS dataset identity …
|
|
528
|
+
# verify-attest: ACCEPTED
|
|
529
|
+
# [PASS] signature recovers to the claimed signer
|
|
530
|
+
# [PASS] recovered signer matches the expected publisher (0x…)
|
|
531
|
+
# [PASS] the signature binds YOUR manifest (its canonical bytes are byte-identical to the signed payload)
|
|
532
|
+
# ACCEPTED: every requested check passed. (exit 0; exit 3 if ANY requested check FAILs)
|
|
533
|
+
```
|
|
534
|
+
|
|
535
|
+
> **Still bounded by P-3.** An ACCEPTED verdict proves the key-holder vouched for this dataset identity —
|
|
536
|
+
> it does **NOT** prove a trustworthy timestamp ("unaltered since date T") and does **NOT** validate any
|
|
537
|
+
> `{source, license}` hint. The trustworthy timestamp is the human-owned trust-root, `needs-human`, P-3 in
|
|
538
|
+
> [`STRATEGY.md`](../STRATEGY.md). This build ships the FORMAT, the VERIFIER, AND the `vh dataset sign`
|
|
539
|
+
> command (all proved with throwaway test keys); the human still owns PROVISIONING the key and choosing
|
|
540
|
+
> trust-root option A/B/C.
|
|
541
|
+
|
|
542
|
+
---
|
|
543
|
+
|
|
544
|
+
## The independent timestamp (P-3 Option B): an RFC-3161 TSA proves "existed by date T"
|
|
545
|
+
|
|
546
|
+
A self-managed signature (`sign` / `verify-attest`, Option A) attests only "the publisher **says so**". The
|
|
547
|
+
stronger claim a due-diligence / EU-AI-Act reviewer ultimately wants — "an **independent** third party saw
|
|
548
|
+
this exact dataset identity **by time T**" — is what **P-3 Option (B)** delivers: an
|
|
549
|
+
[RFC-3161](https://www.rfc-editor.org/rfc/rfc3161) **Time-Stamping Authority (TSA)** stamps a digest and
|
|
550
|
+
returns a signed `TimeStampToken`. This build ships the **FORMAT** (the
|
|
551
|
+
`verifyhash.dataset-attestation-timestamped` container) and the **OFFLINE VERIFIER**
|
|
552
|
+
(`vh dataset verify-timestamp`), proved end-to-end with **self-minted test tokens** (a test-only mock TSA
|
|
553
|
+
with an ephemeral key — **NEVER a real TSA**, exactly as the signing tests use `Wallet.createRandom()`).
|
|
554
|
+
Obtaining a real token is a **human/network step** (you pick a TSA you trust and call it).
|
|
555
|
+
|
|
556
|
+
The flow is **`timestamp-request` → (obtain a token from your TSA) → `timestamp-wrap` → `verify-timestamp`**:
|
|
557
|
+
|
|
558
|
+
```sh
|
|
559
|
+
# 1. REQUEST: emit the SHA-256 digest of the canonical attestation bytes — the EXACT messageImprint a TSA stamps.
|
|
560
|
+
vh dataset timestamp-request v2.manifest.json
|
|
561
|
+
# sha256 digest (the messageImprint to stamp): 34031ecf…439f
|
|
562
|
+
# To obtain an RFC-3161 timestamp token over this digest (a HUMAN/network step):
|
|
563
|
+
# openssl ts -query -digest 34031ecf…439f -sha256 -cert -out request.tsq
|
|
564
|
+
# # send request.tsq to your TSA -> response.tsr ; then:
|
|
565
|
+
# openssl ts -reply -in response.tsr -token_out -out token.der
|
|
566
|
+
|
|
567
|
+
# 2. [HUMAN-OWNED, P-3 Option B] Pick a TSA you trust and obtain a token over that digest (network step).
|
|
568
|
+
# The loop NEVER calls a TSA, holds no token, and generates none.
|
|
569
|
+
|
|
570
|
+
# 3. WRAP: bind the returned RFC-3161 token to the re-derived digest, WITHOUT editing the payload.
|
|
571
|
+
vh dataset timestamp-wrap v2.manifest.json --token token.der --out v2.attestation.timestamped.json
|
|
572
|
+
# timestamped: an INDEPENDENT TSA stamped this digest by genTime
|
|
573
|
+
# genTime (asserted by the TSA): 2026-01-01T00:00:00Z TSA serial: 2a policy OID: 1.2.3.4.5
|
|
574
|
+
|
|
575
|
+
# 4. The BUYER VERIFIES offline — no key, no network — and (optionally) binds it to THEIR dataset:
|
|
576
|
+
vh dataset verify-timestamp v2.attestation.timestamped.json --manifest ./my-copy.manifest.json
|
|
577
|
+
# TRUST: ACCEPTED means an RFC-3161 TSA asserted this exact dataset identity (digest) existed by genTime;
|
|
578
|
+
# this is as trustworthy as the TSA whose certificate YOU trust — this command does NOT validate the
|
|
579
|
+
# TSA's certificate chain (use `openssl ts -verify` / a CMS verifier for full PKI validation).
|
|
580
|
+
# verify-timestamp: ACCEPTED
|
|
581
|
+
# [PASS] the token binds sha256(canonical attestation bytes) under RFC-3161
|
|
582
|
+
# [PASS] the timestamp binds YOUR manifest
|
|
583
|
+
# ACCEPTED: an RFC-3161 TSA asserted this dataset identity existed by:
|
|
584
|
+
# genTime (ISO UTC): 2026-01-01T00:00:00Z
|
|
585
|
+
# TSA serialNumber: 2a (decimal 42)
|
|
586
|
+
# policy OID: 1.2.3.4.5 (exit 0; exit 3 if ANY requested check FAILs)
|
|
587
|
+
```
|
|
588
|
+
|
|
589
|
+
> **The exact bounded trust claim (never overclaims).** ACCEPTED means **an RFC-3161 TSA asserted this exact
|
|
590
|
+
> dataset identity (the SHA-256 digest of the canonical attestation bytes) existed by `<genTime>`** — and
|
|
591
|
+
> this is **as trustworthy as the TSA whose certificate YOU trust**. `verify-timestamp` does **NOT** validate
|
|
592
|
+
> the TSA's X.509 certificate chain or the token's CMS signature — use your platform's CMS verifier
|
|
593
|
+
> (`openssl ts -verify`) for full PKI validation, exactly as Option A pins the signer ADDRESS out of band. A
|
|
594
|
+
> tampered token, a mismatched digest, or an edited embedded attestation **REJECTS** — never a false ACCEPT.
|
|
595
|
+
> Even so this is materially stronger than Option A: an **independent third party** (not the publisher)
|
|
596
|
+
> attests existence by `genTime`.
|
|
597
|
+
|
|
598
|
+
P-3 Option (B)'s human handoff therefore collapses to: **(1)** pick a TSA you trust; **(2)** run
|
|
599
|
+
`vh dataset timestamp-request` to get the digest; **(3)** obtain a token from your TSA over that digest;
|
|
600
|
+
**(4)** run `vh dataset timestamp-wrap` — **done**; buyers verify offline with `vh dataset verify-timestamp`.
|
|
601
|
+
|
|
602
|
+
---
|
|
603
|
+
|
|
604
|
+
## What an auditor / EU AI Act reviewer gets
|
|
605
|
+
|
|
606
|
+
A mapping from the reviewer's question to the command that produces the evidence:
|
|
607
|
+
|
|
608
|
+
| Reviewer's question | Command | Evidence produced |
|
|
609
|
+
| --- | --- | --- |
|
|
610
|
+
| "Exactly which files — names and bytes — did this dataset contain?" | `vh dataset build` | A manifest: a Merkle root over every `(relPath, content)` pair + per-file leaves |
|
|
611
|
+
| "Is this copy of the dataset byte-for-byte the one you manifested?" | `vh dataset verify` | Recomputed-root vs manifest-root verdict + a per-file ADDED/REMOVED/CHANGED localization |
|
|
612
|
+
| "What changed in the training data between model version N and N+1?" | `vh dataset diff` | The precise add/remove/change set between two manifests (offline) |
|
|
613
|
+
| "What is the provenance/license composition of the dataset?" | `vh dataset summary` | A `{source, license}` histogram over the trusted file set (claims, clearly labeled untrusted) |
|
|
614
|
+
| "Does this dataset VIOLATE our written license/source policy? (the control CI runs)" | `vh dataset check --policy` | A PASS/FAIL verdict + the exact violating files (relPath / rule / value); a CI-gateable exit code (0 PASS / 3 FAIL) over the dataset's self-asserted hints (clearly labeled untrusted) |
|
|
615
|
+
| "Give me ONE document to file in the technical-documentation / due-diligence packet." | `vh dataset report` | A single deterministic Markdown (or `--json`) document: dataset identity + the provenance/license roll-up + the standing trust caveats + an optional live-tree verify verdict + (with `--policy`) the embedded policy-compliance verdict |
|
|
616
|
+
| "Give me the exact bytes our publisher (or a timestamp authority) will sign over." | `vh dataset attest` | A canonical, byte-deterministic UNSIGNED attestation payload committing to `root` / `fileCount` / `manifestDigest` (the file a human signing/timestamp trust-root signs — see P-3) |
|
|
617
|
+
| "I provisioned a signing key — turn the attestation into a signed container in one command." | `vh dataset sign` | The `verifyhash.dataset-attestation-signed` container, signed (`eip191-personal-sign`) with the key YOU supplied (`--key-env`/`--key-file`), ready for any buyer to `verify-attest`. Read-only of your key; never generates/persists/logs a key; offline. Attests the IDENTITY + "the signer says so" — NOT a timestamp (still P-3) |
|
|
618
|
+
| "A vendor handed me a 'signed by the publisher' attestation — confirm it is genuine and binds the dataset I hold." | `vh dataset verify-attest` | An OFFLINE ACCEPTED/REJECTED verdict: the signature recovers to the claimed signer, (with `--signer`) the recovered signer is the publisher I pinned, and (with `--manifest`) it binds MY dataset; CI-gateable exit 0/3. Proves the key-holder vouched for this dataset identity — NOT a timestamp (P-3) |
|
|
619
|
+
| "Prove an INDEPENDENT party saw this exact dataset by a date — `timestamp-request` → (TSA) → `timestamp-wrap`." | `vh dataset timestamp-request` / `vh dataset timestamp-wrap` | The SHA-256 digest a TSA stamps, then the `verifyhash.dataset-attestation-timestamped` container binding the returned RFC-3161 token to that digest. The loop never calls a TSA; obtaining the token is the human/network step (P-3 Option B) |
|
|
620
|
+
| "A vendor handed me a timestamped attestation — confirm an RFC-3161 TSA stamped the dataset I hold, and by WHEN." | `vh dataset verify-timestamp` | An OFFLINE ACCEPTED (with the asserted genTime / TSA serial / policy OID) or REJECTED verdict; with `--manifest` it binds the timestamp to MY dataset; CI-gateable exit 0/3. ACCEPTED == a TSA asserted this digest existed by genTime, to the strength of the TSA YOU trust — does NOT validate the TSA cert chain (use `openssl ts -verify`) |
|
|
621
|
+
| "Prove this specific record/file was actually in the dataset." | `vh dataset prove` → `vh dataset verify-proof` | A portable, offline-verifiable set-membership proof for one file |
|
|
622
|
+
|
|
623
|
+
What this mapping deliberately does NOT claim: a wall-clock "unaltered since date T", and the
|
|
624
|
+
truth of any `{source, license}` hint. Both require the human-owned signing/timestamp trust-root
|
|
625
|
+
(`needs-human`, see [`STRATEGY.md`](../STRATEGY.md)).
|
|
626
|
+
|
|
627
|
+
---
|
|
628
|
+
|
|
629
|
+
## See also
|
|
630
|
+
|
|
631
|
+
- [`docs/TRUST-BOUNDARIES.md`](TRUST-BOUNDARIES.md) — the full trust model the caveats above reuse.
|
|
632
|
+
- [`docs/MERKLE-LEAVES.md`](MERKLE-LEAVES.md) — the exact path-bound leaf/root construction the
|
|
633
|
+
manifest commits to (DataLedger reuses it unchanged).
|
|
634
|
+
- [`docs/PROOFS.md`](PROOFS.md) — the portable proof-artifact schema membership proofs reuse.
|
|
635
|
+
|
|
636
|
+
|
|
637
|
+
---
|
|
638
|
+
<sub>© 2026 verifyhash.com · Licensed under Apache-2.0 (SPDX-License-Identifier: Apache-2.0) — see the [LICENSE](https://verifyhash.com/LICENSE) and [NOTICE](https://verifyhash.com/NOTICE) served with this file.</sub>
|