verifyhash 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (154) hide show
  1. package/LICENSE +201 -0
  2. package/README.md +883 -0
  3. package/cli/abi/ContributionRegistry.json +881 -0
  4. package/cli/agent.js +2173 -0
  5. package/cli/anchor-artifact.js +853 -0
  6. package/cli/anchor.js +400 -0
  7. package/cli/claim.js +881 -0
  8. package/cli/core/agent-commit.js +448 -0
  9. package/cli/core/agent-session.js +598 -0
  10. package/cli/core/anchor-binding.js +663 -0
  11. package/cli/core/attestation.js +580 -0
  12. package/cli/core/evidence-plans.js +495 -0
  13. package/cli/core/fixtures/evidence-plans/baseline.json +19 -0
  14. package/cli/core/fulfill-intake.js +1082 -0
  15. package/cli/core/go-live-preflight.js +481 -0
  16. package/cli/core/license.js +534 -0
  17. package/cli/core/manifest.js +243 -0
  18. package/cli/core/packetseal.js +591 -0
  19. package/cli/core/registryArtifact.js +49 -0
  20. package/cli/core/revocation.js +539 -0
  21. package/cli/core/rfc3161.js +389 -0
  22. package/cli/core/timestamp.js +482 -0
  23. package/cli/core/trust-asof.js +479 -0
  24. package/cli/dataset.js +2950 -0
  25. package/cli/evidence.js +2227 -0
  26. package/cli/fulfill-webhook-http.js +438 -0
  27. package/cli/git.js +220 -0
  28. package/cli/hash.js +550 -0
  29. package/cli/identity.js +1072 -0
  30. package/cli/journal-cli.js +1110 -0
  31. package/cli/journal-log.js +454 -0
  32. package/cli/journal.js +334 -0
  33. package/cli/lineage.js +447 -0
  34. package/cli/list.js +287 -0
  35. package/cli/parcel.js +1509 -0
  36. package/cli/proof.js +578 -0
  37. package/cli/prove.js +300 -0
  38. package/cli/receipt.js +631 -0
  39. package/cli/registry.js +331 -0
  40. package/cli/reputation.js +344 -0
  41. package/cli/revocation.js +495 -0
  42. package/cli/serve-verify-http.js +298 -0
  43. package/cli/serve-verify.js +333 -0
  44. package/cli/show.js +339 -0
  45. package/cli/verify.js +383 -0
  46. package/cli/vh.js +3927 -0
  47. package/docs/ADOPT.md +183 -0
  48. package/docs/ADOPTION.json +11 -0
  49. package/docs/AGENTTRACE.md +247 -0
  50. package/docs/ANCHORING.md +167 -0
  51. package/docs/AUDIT.md +55 -0
  52. package/docs/CONFORMANCE.md +107 -0
  53. package/docs/DATALEDGER.md +638 -0
  54. package/docs/DECIDE.md +47 -0
  55. package/docs/DECISIONS-PENDING.md +27 -0
  56. package/docs/DEPLOY-PUBLIC-SITE.md +301 -0
  57. package/docs/ENGINE-LEDGER.json +12 -0
  58. package/docs/EVIDENCE.md +519 -0
  59. package/docs/GO-LIVE.md +66 -0
  60. package/docs/IDENTITY.md +123 -0
  61. package/docs/INDEPENDENT-VERIFICATION.md +377 -0
  62. package/docs/INTEGRITY-JOURNAL.md +337 -0
  63. package/docs/KEY-LIFECYCLE.md +179 -0
  64. package/docs/LICENSING.md +46 -0
  65. package/docs/LINEAGE.md +307 -0
  66. package/docs/LOOP-AUDIT-2026-07-03.json +580 -0
  67. package/docs/LOOP-HARDENING-PLAN.md +44 -0
  68. package/docs/MERKLE-LEAVES.md +113 -0
  69. package/docs/METRICS.jsonl +31 -0
  70. package/docs/MORNING.md +204 -0
  71. package/docs/PILOT.md +444 -0
  72. package/docs/PROOFPARCEL.md +227 -0
  73. package/docs/PROOFS.md +262 -0
  74. package/docs/RECEIPTS.md +341 -0
  75. package/docs/REPUTATION.md +158 -0
  76. package/docs/SDK.md +301 -0
  77. package/docs/STRATEGY-ARCHIVE.md +5055 -0
  78. package/docs/SUPERVISOR-RUNBOOK.md +52 -0
  79. package/docs/TRUST-BOUNDARIES.md +335 -0
  80. package/docs/TRUSTLEDGER.md +1976 -0
  81. package/docs/USAGE-BUDGET.json +121 -0
  82. package/docs/VERIFY-SERVICE.md +168 -0
  83. package/index.js +160 -0
  84. package/package.json +41 -0
  85. package/trustledger/build-standalone.js +796 -0
  86. package/trustledger/cli.js +3179 -0
  87. package/trustledger/close.js +391 -0
  88. package/trustledger/corpus.js +159 -0
  89. package/trustledger/dist/BUILD-PROVENANCE.json +99 -0
  90. package/trustledger/dist/trustledger-standalone.html +6197 -0
  91. package/trustledger/dist/trustledger-standalone.html.sha256 +1 -0
  92. package/trustledger/door-core.js +442 -0
  93. package/trustledger/fixtures/bank.csv +7 -0
  94. package/trustledger/fixtures/bank.malformed.csv +3 -0
  95. package/trustledger/fixtures/bank.noalias.csv +5 -0
  96. package/trustledger/fixtures/bank.ofx +34 -0
  97. package/trustledger/fixtures/bank.real.csv +5 -0
  98. package/trustledger/fixtures/corpus/_shared/prior-close.json +22 -0
  99. package/trustledger/fixtures/corpus/bank-book-mismatch--benign-twin/inputs.json +14 -0
  100. package/trustledger/fixtures/corpus/bank-book-mismatch--benign-twin/meta.json +7 -0
  101. package/trustledger/fixtures/corpus/bank-book-mismatch--out-of-trust/inputs.json +14 -0
  102. package/trustledger/fixtures/corpus/bank-book-mismatch--out-of-trust/meta.json +7 -0
  103. package/trustledger/fixtures/corpus/continuity-break--benign-twin/inputs.json +15 -0
  104. package/trustledger/fixtures/corpus/continuity-break--benign-twin/meta.json +7 -0
  105. package/trustledger/fixtures/corpus/continuity-break--out-of-trust/inputs.json +15 -0
  106. package/trustledger/fixtures/corpus/continuity-break--out-of-trust/meta.json +7 -0
  107. package/trustledger/fixtures/corpus/negative-tenant-ledger--benign-twin/inputs.json +13 -0
  108. package/trustledger/fixtures/corpus/negative-tenant-ledger--benign-twin/meta.json +7 -0
  109. package/trustledger/fixtures/corpus/negative-tenant-ledger--out-of-trust/inputs.json +13 -0
  110. package/trustledger/fixtures/corpus/negative-tenant-ledger--out-of-trust/meta.json +7 -0
  111. package/trustledger/fixtures/corpus/owner-overdraw--benign-twin/inputs.json +15 -0
  112. package/trustledger/fixtures/corpus/owner-overdraw--benign-twin/meta.json +7 -0
  113. package/trustledger/fixtures/corpus/owner-overdraw--out-of-trust/inputs.json +15 -0
  114. package/trustledger/fixtures/corpus/owner-overdraw--out-of-trust/meta.json +7 -0
  115. package/trustledger/fixtures/corpus/security-deposit-segregation--benign-twin/inputs.json +16 -0
  116. package/trustledger/fixtures/corpus/security-deposit-segregation--benign-twin/meta.json +7 -0
  117. package/trustledger/fixtures/corpus/security-deposit-segregation--out-of-trust/inputs.json +13 -0
  118. package/trustledger/fixtures/corpus/security-deposit-segregation--out-of-trust/meta.json +7 -0
  119. package/trustledger/fixtures/corpus/subledger-out-of-balance--benign-twin/inputs.json +13 -0
  120. package/trustledger/fixtures/corpus/subledger-out-of-balance--benign-twin/meta.json +7 -0
  121. package/trustledger/fixtures/corpus/subledger-out-of-balance--out-of-trust/inputs.json +13 -0
  122. package/trustledger/fixtures/corpus/subledger-out-of-balance--out-of-trust/meta.json +7 -0
  123. package/trustledger/fixtures/e2e/bank.aliased.csv +4 -0
  124. package/trustledger/fixtures/e2e/bank.csv +4 -0
  125. package/trustledger/fixtures/e2e/bank.nsf.csv +4 -0
  126. package/trustledger/fixtures/e2e/quickbooks.csv +6 -0
  127. package/trustledger/fixtures/e2e/quickbooks.nsf.csv +8 -0
  128. package/trustledger/fixtures/e2e/rentroll.csv +6 -0
  129. package/trustledger/fixtures/e2e/rentroll.nsf.csv +8 -0
  130. package/trustledger/fixtures/e2e/rentroll.short.csv +5 -0
  131. package/trustledger/fixtures/plans/baseline.json +25 -0
  132. package/trustledger/fixtures/plans/price-binding.example.json +27 -0
  133. package/trustledger/fixtures/policy/ambiguous-deposit-example.json +12 -0
  134. package/trustledger/fixtures/policy/baseline.json +19 -0
  135. package/trustledger/fixtures/policy/ca-example.json +12 -0
  136. package/trustledger/fixtures/policy/negative-tenant-ledger-example.json +12 -0
  137. package/trustledger/fixtures/policy/owner-overdraw-example.json +12 -0
  138. package/trustledger/fixtures/quickbooks.csv +7 -0
  139. package/trustledger/fixtures/quickbooks.real.csv +5 -0
  140. package/trustledger/fixtures/rentroll.csv +6 -0
  141. package/trustledger/fixtures/rentroll.real.csv +4 -0
  142. package/trustledger/ingest.js +1163 -0
  143. package/trustledger/lib/policy-bundled-loader.js +44 -0
  144. package/trustledger/lib/sha256-vendored.js +227 -0
  145. package/trustledger/license.js +563 -0
  146. package/trustledger/match.js +551 -0
  147. package/trustledger/plans.js +551 -0
  148. package/trustledger/policy.js +398 -0
  149. package/trustledger/public/index.html +512 -0
  150. package/trustledger/reconcile.js +1486 -0
  151. package/trustledger/report.js +887 -0
  152. package/trustledger/seal.js +854 -0
  153. package/trustledger/server.js +391 -0
  154. package/trustledger/valueproof.js +350 -0
@@ -0,0 +1,638 @@
1
+ # DataLedger — verifiable AI training-data provenance
2
+
3
+ DataLedger turns a training dataset into a **reproducible, tamper-evident manifest** and a small set
4
+ of artifacts a data-provenance reviewer (enterprise due-diligence, EU AI Act technical documentation)
5
+ actually consumes. It runs on the same path-bound Merkle core as `vh hash`/`vh prove`, so every claim
6
+ it makes is independently re-derivable.
7
+
8
+ Every DataLedger command is **offline, needs NO private key, and needs NO network**. You can hand a
9
+ manifest, a diff, a summary, or a single-file proof to a third party and they can re-derive the result
10
+ on an air-gapped machine with only the `vh` CLI — they do not have to trust your server, your build
11
+ machine, or you.
12
+
13
+ > **Read this first:** the trust posture below is the SAME wording carried in-band in every artifact
14
+ > (`cli/dataset.js` › `TRUST_NOTE` / `MEMBERSHIP_TRUST_NOTE`) and in
15
+ > [`docs/TRUST-BOUNDARIES.md`](TRUST-BOUNDARIES.md). Do not overclaim past it.
16
+
17
+ ---
18
+
19
+ ## What DataLedger PROVES
20
+
21
+ A DataLedger manifest commits to a Merkle root over the full set of `(relPath, content)` pairs in a
22
+ dataset — **file names AND bytes**. From that root and the manifest the following are
23
+ re-derivable by anyone, offline:
24
+
25
+ 1. **Exactly which files a dataset contained — names and bytes.** The root commits to the complete
26
+ `(relPath, content)` set. Any edit, rename, add, or remove changes the root. A manifest whose
27
+ recomputed root (from the bytes on disk) matches its recorded root is byte-for-byte the dataset it
28
+ claims to be — a hand-edited manifest root cannot fake a `MATCH`, because `vh dataset verify`
29
+ re-derives the root from the actual file bytes, not from the manifest's recorded string.
30
+
31
+ 2. **Offline set-membership of any one file.** `vh dataset prove` builds a Merkle proof that a single
32
+ file (its `relPath` + bytes) was a leaf of the manifest's root, matched by **content** (not by the
33
+ caller's filename). `vh dataset verify-proof` folds that proof back to the recorded root **purely
34
+ offline** — no dataset copy, no manifest, no key, no network. A fabricated or altered file does not
35
+ fold to the root and is REJECTED.
36
+
37
+ 3. **The precise add / remove / change between two dataset versions.** `vh dataset diff` compares two
38
+ manifests offline and reports `ADDED` (in B not A), `REMOVED` (in A not B), and `CHANGED` (same
39
+ `relPath`, different content, old→new). A rename shows as `REMOVED`+`ADDED` because the path is bound
40
+ into the leaf. This answers the most common auditor question — "what changed in the training data
41
+ between model version N and N+1?" — without either dataset on disk.
42
+
43
+ 4. **A provenance / license roll-up.** `vh dataset summary` aggregates the manifest into a histogram of
44
+ the claimed `{source, license}` hints over the **trusted file set** (total `fileCount`, the root,
45
+ counts of files per claimed license/source, and explicit buckets for files with no hint).
46
+
47
+ ---
48
+
49
+ ## What DataLedger does NOT prove (do not overclaim)
50
+
51
+ - **It is NOT a timestamp.** A manifest binds a file SET to a root; it says nothing about *when* the
52
+ dataset existed. "Unaltered since date T" is a strictly stronger, time-anchored claim that needs the
53
+ **human-owned signing / timestamp trust-root** — a `needs-human` step recorded in
54
+ [`STRATEGY.md`](../STRATEGY.md) (the loop only BUILDS and locally TESTS; standing up a real signing
55
+ key / timestamp anchor is a human action). Until that trust-root exists, never report or imply
56
+ "unaltered since date T".
57
+
58
+ - **The `{source, license}` hints are UNTRUSTED self-asserted metadata.** Per-file `hints`
59
+ (source/license) are recorded labeled as untrusted and are **NOT bound into the Merkle root** —
60
+ editing a hint does not change the root, and the summary counts what the dataset **CLAIMS**, it does
61
+ NOT verify any license or source is correct. `(no license hint)` means the manifest asserts nothing,
62
+ NOT that the file is unlicensed.
63
+
64
+ - **Set-membership ≠ time / authorship / licensing.** A membership proof binds a file to a ROOT; it
65
+ does NOT prove the file is unaltered since a date, who authored it, or under what license — that needs
66
+ the same human-owned signing/timestamp trust-root above.
67
+
68
+ This is the same boundary the artifacts carry in-band, verbatim, so the caveats can never drift from
69
+ the code:
70
+
71
+ > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
72
+
73
+ ---
74
+
75
+ ## Workflow, end to end
76
+
77
+ > **Run it, don't just read it.** [`examples/run.js`](../examples/run.js) is the executable companion to
78
+ > the worked examples below: `node examples/run.js` drives this whole pipeline (`build → check --policy →
79
+ > verify → report → attest`, plus the ProofParcel side) against tiny committed sample data, offline and
80
+ > with no key, and prints a PASS/FAIL summary — including a deliberately FLAGGED policy violation and a
81
+ > caught tamper. It writes only to an OS temp dir, references (does not run) the human-gated sign/timestamp
82
+ > steps, and is test-gated by `test/cli.examples.test.js` so it can never drift from this prose. See
83
+ > [`examples/README.md`](../examples/README.md).
84
+
85
+ ```
86
+ build → diff (between versions) → summary → check (the policy gate) → report (the filed deliverable) → attest (the signing-ready payload) → [human signs, P-3] → verify-attest (offline-verify a signed container) → prove (a single file) → verify-proof
87
+
88
+ # OR, for an INDEPENDENT timestamp (P-3 Option B):
89
+ attest → timestamp-request (the digest your TSA stamps) → [human obtains a token from a TSA] → timestamp-wrap (the token container) → verify-timestamp (offline-verify "an RFC-3161 TSA saw this by date T")
90
+ ```
91
+
92
+ | Command | What it does | Offline? Key? Network? |
93
+ | --- | --- | --- |
94
+ | `vh dataset build <dir> --out <p>` | Write a tamper-evident manifest (Merkle root + per-file leaves; optional untrusted hints) | offline, no key, no network |
95
+ | `vh dataset verify <dir> --manifest <p>` | Re-derive the root from a fresh copy on disk + a per-file ADDED/REMOVED/CHANGED diff vs the manifest | offline, no key, no network |
96
+ | `vh dataset diff <manifestA> <manifestB>` | Compare two manifests; report the exact change set between versions | offline, no tree, no key, no network |
97
+ | `vh dataset summary <manifest>` | Provenance/license roll-up over the trusted file set | offline, no tree, no key, no network |
98
+ | `vh dataset check <manifest> --policy <p> [--json]` | GATE the manifest's self-asserted hints against a written license/source policy: PASS/FAIL + the exact violating files; CI-gateable exit 0/3 | offline, no tree, no key, no network |
99
+ | `vh dataset report <manifest> [--verify <dir>] [--policy <p>] [--json] [--out <p>]` | Consolidate identity + roll-up + (optional) verify verdict + (optional) policy verdict + caveats into ONE deterministic evidence document the reviewer files | offline, no key, no network |
100
+ | `vh dataset attest <manifest> [--json] [--out <p>]` | Emit the canonical, byte-deterministic UNSIGNED attestation payload (root + fileCount + manifestDigest) a human signing/timestamp trust-root will sign | offline, no key, no network |
101
+ | `vh dataset sign <manifest> --key-env <VAR>\|--key-file <p> [--out <p>] [--json]` | Sign the UNSIGNED attestation with a key YOU provisioned → the signed container `verify-attest` accepts. Read-only of YOUR key; never generates/persists/logs a key | offline, **caller-supplied key**, no network |
102
+ | `vh dataset verify-attest <signed> [--manifest <m>] [--signer <addr>] [--json]` | OFFLINE-verify a SIGNED attestation container: recover the signer, optionally pin the publisher (`--signer`) and bind to your manifest (`--manifest`); ACCEPTED/REJECTED with a CI-gateable exit 0/3 | offline, no key, no network |
103
+ | `vh dataset timestamp-request <manifest> [--out <p>] [--json]` | Emit the SHA-256 digest of the canonical attestation bytes — the exact `messageImprint` you submit to your RFC-3161 TSA | offline, no key, no network |
104
+ | `vh dataset timestamp-wrap <manifest> --token <p> [--out <p>] [--json]` | Wrap the TSA's returned RFC-3161 token into a verifiable `verifyhash.dataset-attestation-timestamped` container (binds it to the re-derived SHA-256 digest) | offline, no key, no network |
105
+ | `vh dataset verify-timestamp <container> [--manifest <m>] [--json]` | OFFLINE-verify a timestamped container: re-derive the digest, confirm the RFC-3161 token binds it, optionally bind to your manifest; ACCEPTED (with genTime / TSA serial / policy OID) or REJECTED; CI-gateable exit 0/3 | offline, no key, no network |
106
+ | `vh dataset prove --file <p> --manifest <m> --out <a>` | Build a portable set-membership proof for ONE file | offline, no key, no network |
107
+ | `vh dataset verify-proof <proof>` | Fold the membership proof back to the recorded root | purely offline, no dataset, no key, no network |
108
+
109
+ ### Worked example
110
+
111
+ Manifest a dataset, then snapshot a later version and ask what changed.
112
+
113
+ ```sh
114
+ # 1. BUILD a manifest of dataset version 1 (optionally attach untrusted source/license hints).
115
+ vh dataset build ./dataset-v1 --out v1.manifest.json --hints ./hints.json
116
+ # wrote v1.manifest.json root: 0xabc… fileCount: 1024
117
+
118
+ # 2. VERIFY a fresh copy on disk re-derives the same root, and localize any drift per file.
119
+ vh dataset verify ./dataset-v1 --manifest v1.manifest.json
120
+ # MATCH — recomputed root == manifest root (exit 0)
121
+
122
+ # 3. Later: build version 2's manifest, then DIFF the two versions OFFLINE (no datasets on disk).
123
+ vh dataset build ./dataset-v2 --out v2.manifest.json
124
+ vh dataset diff v1.manifest.json v2.manifest.json
125
+ # DIFFERENT
126
+ # ADDED: 12 files
127
+ # REMOVED: 3 files
128
+ # CHANGED: 5 files (relPath same, content differs) (exit 3)
129
+
130
+ # 4. SUMMARY: the provenance/license roll-up a reviewer reads.
131
+ vh dataset summary v2.manifest.json
132
+ # fileCount: 1033 root: 0xdef…
133
+ # licenses: { MIT: 900, CC-BY-4.0: 110, (no license hint): 23 }
134
+ # TRUST: the file SET is bound into the root; hints are UNTRUSTED self-asserted metadata.
135
+
136
+ # 5. PROVE one file was a member of v2, as a portable artifact…
137
+ vh dataset prove --file ./dataset-v2/img/0007.jpg --manifest v2.manifest.json --out 0007.proof.json
138
+ # MEMBER — wrote 0007.proof.json (exit 0)
139
+
140
+ # 6. …and the reviewer VERIFY-PROOFs it on an air-gapped machine — no dataset, no manifest, no key, no net.
141
+ vh dataset verify-proof 0007.proof.json
142
+ # CONFIRMED — leaf folds to the recorded root (exit 0)
143
+ ```
144
+
145
+ Exit codes are CI-friendly: `vh dataset verify` and `vh dataset diff` exit `3` on
146
+ mismatch/difference (so a pipeline can gate "the training set changed unexpectedly"), `0` on
147
+ match/identical; `vh dataset prove`/`verify-proof` exit `0` MEMBER/CONFIRMED, `3` non-member/rejected;
148
+ and `vh dataset check` exits `0` PASS / `3` FAIL (the policy gate, below).
149
+
150
+ ---
151
+
152
+ ## Policy compliance gate
153
+
154
+ `vh dataset summary` *describes* a dataset's license/source composition; `vh dataset check <manifest>
155
+ --policy <p>` **GATES** it. It is the difference between "a provenance report" and "a compliance control":
156
+ the control your pipeline runs on every change and the verdict your auditor files (an EU-AI-Act
157
+ technical-documentation / enterprise due-diligence packet). It answers the one question a compliance
158
+ reviewer and a CI job actually ask — **"does this training set VIOLATE our written policy?"** — as a
159
+ deterministic, OFFLINE PASS/FAIL with the exact list of which files broke which rule.
160
+
161
+ > **Trust posture, FIRST (the same wording the artifact carries in-band, verbatim).** The `{source,
162
+ > license}` hints checked here are **UNTRUSTED, self-asserted metadata that are NOT bound into the
163
+ > Merkle root.** A PASS means the dataset's self-asserted hints satisfy this policy —
164
+ > **NOT that the licenses are genuinely correct.** A `(no license hint)` file ASSERTS NOTHING (the `requireLicense` rule
165
+ > is the one that flags it). This NEVER verifies that any license or source is real. It is the same
166
+ > boundary every DataLedger artifact carries:
167
+ >
168
+ > > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
169
+
170
+ ### The policy file
171
+
172
+ A policy is a small, versioned, strictly-validated JSON document. A corrupt, foreign, or malformed
173
+ policy is **rejected outright** (never half-accepted into a surprise verdict). Two fixed fields identify
174
+ it, then every RULE field is **optional and combinable**:
175
+
176
+ | Field | Required | Type | Meaning / match semantics |
177
+ | --- | --- | --- | --- |
178
+ | `kind` | yes | string | MUST be exactly `verifyhash.dataset-policy`. |
179
+ | `schemaVersion` | yes | number | MUST be a supported version (this build understands `1`). |
180
+ | `allowLicenses` | no | string[] | A file whose license hint is **NOT** in this list VIOLATES. A file with **no** license hint also violates (it is in no allowlist). |
181
+ | `denyLicenses` | no | string[] | A file whose license hint **IS** in this list VIOLATES. A file with **no** license hint does NOT violate (no value to match). |
182
+ | `allowSources` | no | string[] | Same as `allowLicenses`, on the `source` hint. |
183
+ | `denySources` | no | string[] | Same as `denyLicenses`, on the `source` hint. |
184
+ | `requireLicense` | no | boolean | When `true`, every file MUST carry a license hint; a `(no license hint)` file VIOLATES. This is the ONE rule that flags a missing hint. |
185
+
186
+ **Match semantics (so a verdict is reproducible).** A file's "license hint value" is its
187
+ `hints.license` string, or the **absence** of one (no `hints` at all, or `hints` with no `license`);
188
+ likewise for `hints.source`. All comparisons against the policy's lists are **CASE-SENSITIVE EXACT STRING
189
+ matches** — `"GPL-3.0"` matches only `"GPL-3.0"`, never `"gpl-3.0"` or `"GPL-3.0-or-later"`. A missing
190
+ hint is reported with the explicit `(no license hint)` / `(no source hint)` sentinel value, never a
191
+ literal string named that.
192
+
193
+ **The no-rules case.** A policy that declares **no rules** (no list fields, or only empty lists, and
194
+ `requireLicense` not `true`) is valid and **trivially PASSes** — every dataset satisfies a policy with no
195
+ constraints. `vh dataset check` says so explicitly (`rules evaluated: 0` + a NOTE) so a green check from
196
+ an empty policy can never be mistaken for a real gate.
197
+
198
+ ### `vh dataset check` — the gate
199
+
200
+ `vh dataset check <manifest> --policy <p>` reads the manifest via the SAME strict reader the other
201
+ commands use and the policy via its strict reader, then evaluates the manifest's **trusted file set**
202
+ against the policy in a **pure, deterministic** function (no tree, no provider, no key, no network) and
203
+ emits a verdict:
204
+
205
+ - **PASS / FAIL.** PASS when no file's self-asserted hints violate any rule; FAIL when at least one does.
206
+ - **The violating-file output.** On FAIL, one line per violation — the **file (relPath)**, the **rule it
207
+ broke**, and the **offending hint value** — sorted by relPath then rule, so two runs over the same
208
+ inputs produce byte-identical output. A single file that breaks two rules produces two lines.
209
+ - **The 0/3 exit contract a CI job gates on.** `vh dataset check` exits **`0` on PASS, `3` on FAIL** —
210
+ the SAME data-divergence exit convention as `vh dataset verify`/`diff`, so all dataset gates share one
211
+ contract. A missing/unreadable manifest or policy is a runtime error (exit `1`); a missing `--policy`
212
+ is a usage error (exit `2`) — a gate with no policy must never silently pass. So a pipeline step is
213
+ simply `vh dataset check ds.manifest.json --policy org-policy.json` and the build blocks on a non-zero
214
+ exit.
215
+ - **`--json`.** Emits the machine object
216
+ `{ verdict, fileCount, rulesEvaluated, violations: [{ relPath, rule, value }] }` for an ingestion
217
+ pipeline. The `rule` strings are stable identifiers a consumer can gate on
218
+ (`allowLicenses` / `denyLicenses` / `allowSources` / `denySources` / `requireLicense`).
219
+
220
+ ### `vh dataset report --policy` — embedding the verdict in the filed document
221
+
222
+ `vh dataset report <manifest> --policy <p>` folds the **SAME pure evaluator** `vh dataset check` runs
223
+ (verbatim — no re-implementation) into the filed evidence document as a **"Policy compliance" section**:
224
+ the verdict, the number of rules evaluated, and (on FAIL) the violating files. Because it reuses the same
225
+ evaluator, the report's PASS/FAIL **can never diverge** from `vh dataset check`'s for the same manifest +
226
+ policy. The section LEADS with the same UNTRUSTED-hints caveat as `vh dataset check`, so the embedded
227
+ verdict never implies a real license was checked.
228
+
229
+ `--policy` combines with `--verify` to make **one report invocation a complete CI gate**: with `--verify`
230
+ the exit is `0` MATCH / `3` MISMATCH; with `--policy` it is `0` PASS / `3` FAIL; with **both**, the report
231
+ exits `3` if **EITHER** the live-tree verify is a MISMATCH OR the policy is a FAIL, and `0` only when the
232
+ verify is MATCH **and** the policy is PASS — so a single command gates data integrity AND policy
233
+ compliance, and the buyer's filed document shows both verdicts.
234
+
235
+ ### Worked example — build with hints, write a policy, check, then embed in a report
236
+
237
+ ```sh
238
+ # 1. BUILD a manifest, attaching the (UNTRUSTED, self-asserted) source/license hints per file.
239
+ vh dataset build ./dataset-v2 --out v2.manifest.json --hints ./hints.json
240
+ # wrote v2.manifest.json root: 0xdef… fileCount: 1033
241
+
242
+ # 2. WRITE a policy: a proprietary product that forbids copyleft and requires every file to be licensed.
243
+ cat > org-policy.json <<'JSON'
244
+ {
245
+ "kind": "verifyhash.dataset-policy",
246
+ "schemaVersion": 1,
247
+ "denyLicenses": ["GPL-3.0", "AGPL-3.0"],
248
+ "requireLicense": true
249
+ }
250
+ JSON
251
+
252
+ # 3. CHECK the dataset against the policy — the gate a CI job runs (exit 0 PASS / 3 FAIL).
253
+ vh dataset check v2.manifest.json --policy org-policy.json
254
+ # TRUST: the {source, license} hints checked here are UNTRUSTED, self-asserted metadata. …
255
+ # policy check: FAIL
256
+ # files: 1033
257
+ # rules evaluated: 2
258
+ # FAIL: 2 violations (each line: the file, the rule it broke, and the offending hint value):
259
+ # src/vendored/lib.py [denyLicenses] value: GPL-3.0
260
+ # data/notes.txt [requireLicense] value: (no license hint) (exit 3)
261
+
262
+ # 4. …or as a machine object for an ingestion pipeline:
263
+ vh dataset check v2.manifest.json --policy org-policy.json --json
264
+ # {"verdict":"FAIL","fileCount":1033,"rulesEvaluated":2,"violations":[…]}
265
+
266
+ # 5. EMBED the SAME verdict in the ONE document the reviewer files (and gate integrity + policy at once):
267
+ vh dataset report v2.manifest.json --verify ./dataset-v2 --policy org-policy.json --out evidence.md
268
+ # dataset report written: /abs/path/evidence.md (exit 3 if EITHER verify MISMATCH or policy FAIL)
269
+ ```
270
+
271
+ > **What a PASS does and does NOT mean.** A PASS attests that the dataset's **self-asserted hints satisfy
272
+ > the policy** — the control your pipeline ran and the verdict your auditor files. It is **NOT** a claim
273
+ > that the licenses are genuinely correct, nor a timestamp ("unaltered since date T"). Those require the
274
+ > human-owned signing/timestamp trust-root (`needs-human`, P-3 in [`STRATEGY.md`](../STRATEGY.md)).
275
+
276
+ ---
277
+
278
+ ## The evidence report
279
+
280
+ A reviewer does not file three terminal outputs — they file **one document**. `vh dataset report
281
+ <manifest> [--verify <dir>] [--policy <p>] [--json] [--out <p>]` consolidates everything a manifest
282
+ already proves into a single deterministic artifact you attach to an EU-AI-Act technical-documentation
283
+ section or an enterprise data-provenance due-diligence packet.
284
+
285
+ It **invents no new math.** The dataset identity (root + fileCount) comes from the strict manifest read;
286
+ the provenance/license roll-up reuses the SAME aggregation `vh dataset summary` emits (identical
287
+ histogram order); the optional verification reuses `vh dataset verify` verbatim; the optional policy
288
+ verdict reuses the SAME pure evaluator `vh dataset check` runs (see [Policy compliance
289
+ gate](#policy-compliance-gate) above). So the report can never drift from the commands it consolidates.
290
+
291
+ What it consolidates, in a stable section order:
292
+
293
+ 1. **Trust posture, FIRST.** The same in-band `TRUST_NOTE` (file SET bound into the root and trustworthy;
294
+ `{source, license}` hints UNTRUSTED and NOT bound into the root) plus the explicit no-overclaim line:
295
+ this report is NOT a timestamp — it does not prove the dataset is "unaltered since date T", nor
296
+ authorship/licensing.
297
+ 2. **Dataset identity** — the Merkle `root` and `fileCount`.
298
+ 3. **Verification status** — either the embedded `--verify` verdict, or a plain statement that NO
299
+ live-tree verification was performed (so the report never *implies* a verify that did not run).
300
+ 4. **Policy compliance** — ONLY when `--policy` is given: the PASS/FAIL verdict, rules evaluated, and (on
301
+ FAIL) the violating files (relPath / rule / value), leading with the same UNTRUSTED-hints caveat as
302
+ `vh dataset check`. Omitted entirely without `--policy`, so the report never implies a gate that did
303
+ not run.
304
+ 5. **Provenance / license roll-up** — the `{source, license}` histogram over the trusted file set.
305
+
306
+ **Deterministic Markdown vs `--json`.** The default human output is a Markdown document with a stable
307
+ section order and a histogram ordered by the same rule `vh dataset summary` uses, so two runs over the
308
+ same manifest produce **byte-identical Markdown** — suitable to attach to a filing and to diff in CI.
309
+ `--json` emits the same consolidated model as a machine object for an ingestion pipeline. `--out <p>`
310
+ writes the document to a caller-chosen explicit path (never silently the cwd) and names the file.
311
+
312
+ **The optional `--verify` status section.** Without `--verify`, the report documents the manifest's
313
+ *claimed* root and says so plainly. With `--verify <dir>` it re-derives the root from the live tree
314
+ (still offline — no network) and embeds the **MATCH/MISMATCH verdict** plus the per-file
315
+ ADDED/REMOVED/CHANGED localization; under `--verify` the command's exit code mirrors `vh dataset verify`
316
+ (`0` on MATCH, `3` on MISMATCH) so a pipeline can gate on it.
317
+
318
+ **The optional `--policy` compliance section.** With `--policy <p>` the report embeds the SAME PASS/FAIL
319
+ verdict `vh dataset check` produces (see above). Combined with `--verify`, ONE report invocation is a
320
+ complete CI gate: it exits `3` if **EITHER** the live-tree verify is a MISMATCH **OR** the policy is a
321
+ FAIL, and `0` only when both pass.
322
+
323
+ ```sh
324
+ # The single document a reviewer files (manifest-only — claims the manifest's root):
325
+ vh dataset report v2.manifest.json --out evidence.md
326
+ # dataset report written: /abs/path/evidence.md
327
+
328
+ # …or with a live-tree verdict embedded, gating on the recomputed-root match (exit 3 on drift):
329
+ vh dataset report v2.manifest.json --verify ./dataset-v2 --out evidence.md
330
+
331
+ # …or with the policy verdict embedded too, so ONE invocation gates integrity AND policy (exit 3 if either fails):
332
+ vh dataset report v2.manifest.json --verify ./dataset-v2 --policy org-policy.json --out evidence.md
333
+
334
+ # …or the machine form for an ingestion pipeline:
335
+ vh dataset report v2.manifest.json --json
336
+ ```
337
+
338
+ > **What the reviewer files.** This Markdown (or its `--json` twin) IS the deliverable — the EU-AI-Act
339
+ > technical-documentation section / due-diligence evidence packet a buyer's compliance process is built
340
+ > around — not a transcript of three commands. It still claims nothing past the standing trust posture:
341
+ > no wall-clock "unaltered since date T", no truth of any `{source, license}` hint.
342
+
343
+ ---
344
+
345
+ ## Unsigned attestation payload
346
+
347
+ `vh dataset attest <manifest> [--json] [--out <p>]` emits the **canonical, byte-deterministic** payload a
348
+ human-owned signing/timestamp trust-root will sign. It is the bridge that turns the human step (P-3) from
349
+ "design and sign a payload" into "sign THIS exact file."
350
+
351
+ **What it commits to.** A small envelope binding the dataset IDENTITY:
352
+
353
+ - `root` — the manifest's Merkle root.
354
+ - `fileCount` — the number of files in the committed set.
355
+ - `manifestDigest` — `keccak256` over a canonical serialization of the manifest's committed file set
356
+ (each entry's root-committed `{relPath, contentHash, leaf}`, keys in fixed order, entries sorted by
357
+ `relPath`, no insignificant whitespace; the UNTRUSTED `hints` are excluded). Any edit/rename/add/remove
358
+ to the committed set changes the digest.
359
+
360
+ The envelope serializes with a fixed top-level key order and no insignificant whitespace, so **two runs
361
+ over the same manifest produce identical bytes** — that determinism is exactly what makes "sign the
362
+ bytes" well-defined. `--json` emits those same canonical bytes (pipe it straight into a signer); `--out
363
+ <p>` writes them to a caller-chosen explicit path (never the cwd) and names the file.
364
+
365
+ **It is UNSIGNED — and says so, in-band.** The envelope carries an explicit `signed: false` and a
366
+ `signature: null` slot, plus the standing caveat verbatim. The strict reader REJECTS any payload that
367
+ claims `signed: true` or a non-null `signature`, so this build can never be tricked into treating a
368
+ hand-edited envelope as if it were signed.
369
+
370
+ ```sh
371
+ # Emit the canonical UNSIGNED payload (the exact bytes a signer/timestamp service signs over):
372
+ vh dataset attest v2.manifest.json --out v2.attestation.json
373
+ # dataset attestation written: /abs/path/v2.attestation.json
374
+
375
+ # …or to stdout / piped into a signer:
376
+ vh dataset attest v2.manifest.json --json
377
+ ```
378
+
379
+ > **Attaching a real signature/timestamp is the human-owned trust-root.** Standing up a real signing key
380
+ > or an external timestamp authority is a `needs-human` step recorded in
381
+ > [`STRATEGY.md`](../STRATEGY.md) as **P-3** — the loop only BUILDS and locally TESTS the UNSIGNED
382
+ > payload. Until a signature is attached, this payload proves only the same set-membership / identity the
383
+ > manifest already does — **NOT** that the dataset is unaltered since a date T. Do not overclaim past P-3.
384
+
385
+ ### Signed attestation + verification
386
+
387
+ The UNSIGNED payload above is the bytes a publisher signs; this build also ships the **detached signature
388
+ container that WRAPS those bytes** and the **offline VERIFIER** a buyer runs to confirm a signature is
389
+ genuine — `vh dataset verify-attest`. Read the boundary FIRST, because it is exactly the standing dataset
390
+ trust posture plus one signing-specific line:
391
+
392
+ > **Trust posture, FIRST (reuses the standing dataset `TRUST_NOTE` verbatim).** A valid signature proves
393
+ > **the holder of `signer`'s key vouched for THIS dataset identity** (the embedded `root` / `fileCount` /
394
+ > `manifestDigest`). It does **NOT** by itself prove a trustworthy TIMESTAMP — there is still no "unaltered
395
+ > since a date T" unless the `scheme` is a timestamp authority, which is **still the human-owned trust-root,
396
+ > `needs-human`, P-3** in [`STRATEGY.md`](../STRATEGY.md). It does **NOT** validate that the dataset's
397
+ > `{source, license}` hints are genuinely correct (that is `vh dataset check`'s untrusted-hint caveat). It
398
+ > is the same boundary every DataLedger artifact carries:
399
+ >
400
+ > > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
401
+
402
+ > **CRITICAL — what this build ships, and what it does NOT.** This build ships the **FORMAT** (the
403
+ > signed-container schema below), the **VERIFIER** (`vh dataset verify-attest`), **AND the SIGNING command**
404
+ > (`vh dataset sign`, below) — all proved end-to-end in the test suite with **EPHEMERAL, throwaway
405
+ > `Wallet.createRandom()` keys generated in-process and never persisted**. **Provisioning a real signing key
406
+ > and choosing trust-root option A/B/C is still the human-owned trust-root, P-3** (`needs-human`,
407
+ > [`STRATEGY.md`](../STRATEGY.md)). The loop NEVER generates, holds, persists, or logs a real key — `vh
408
+ > dataset sign` reads a key the human provisioned OUTSIDE the loop, uses it in-process ONLY to sign, and
409
+ > discards it. Emitting/signing/verifying a signed container NEVER implies "unaltered since date T": a signed
410
+ > container says only "this key vouched for this dataset identity" — the trustworthy *timestamp* is the part
411
+ > P-3 still owns.
412
+
413
+ #### The signed-container schema (`verifyhash.dataset-attestation-signed`)
414
+
415
+ The signed container **WRAPS, never edits** the unsigned payload: it embeds the EXACT canonical UNSIGNED
416
+ bytes verbatim (byte-for-byte the string `vh dataset attest` emits, including its trailing newline) and
417
+ attaches a detached signature alongside. Every field, with a FIXED key order so the container is itself
418
+ byte-deterministic:
419
+
420
+ | Field | Type | Meaning |
421
+ | --- | --- | --- |
422
+ | `kind` | string | MUST be exactly `verifyhash.dataset-attestation-signed`. |
423
+ | `schemaVersion` | number | MUST be a supported version (this build understands `1`). |
424
+ | `note` | string | The standing in-band trust caveat (the dataset `TRUST_NOTE` + the signing-specific line above), carried verbatim so the caveats can never drift from the artifact. |
425
+ | `attestation` | string | **The EXACT canonical UNSIGNED bytes**, embedded as a string. Re-parsed and re-validated by the SAME unsigned reader on every read: it must STILL be strictly `signed: false` / `signature: null`, and must be byte-for-byte `serializeAttestation`'s output. |
426
+ | `signature` | object | The detached `{ scheme, signer, signature }` triple (below). |
427
+ | `signature.scheme` | string | The signature scheme. This build's `scheme` value is **`eip191-personal-sign`** — EIP-191 `personal_sign` over the EXACT embedded canonical bytes (a detached signature, deliberately NOT EIP-712, so the signed message IS the payload bytes verbatim with no separate domain/struct to drift from). |
428
+ | `signature.signer` | string | The CLAIMED `0x` signer address, **lowercase** (a checksummed/mixed-case address is rejected for byte-determinism — lowercase it first). |
429
+ | `signature.signature` | string | The detached signature: for `eip191-personal-sign`, a 65-byte `r‖s‖v` secp256k1 signature as a **lowercase** `0x`-hex string (130 hex chars). |
430
+
431
+ **The wrap-don't-edit invariant.** The embedded `attestation` stays strictly `signed: false` /
432
+ `signature: null`: the strict reader re-parses it and runs the SAME `validateAttestation` the unsigned path
433
+ uses (which hard-rejects any `signed: true` / non-null `signature`), then requires the embedded string to
434
+ be byte-for-byte canonical. So a signed container can never smuggle in an edited or already-"signed"
435
+ payload — wrapping adds a vouch, it never edits the thing vouched for. T-15.2's strict UNSIGNED guarantee
436
+ is preserved unchanged.
437
+
438
+ #### `vh dataset sign` — the one-command signing leg (reads a key YOU provisioned)
439
+
440
+ `vh dataset sign <manifest> --key-env <VAR> | --key-file <path> [--out <p>] [--json]` is the **one command
441
+ that turns "a human has a key" into a signed container a buyer can verify.** It builds the UNSIGNED payload
442
+ exactly as `vh dataset attest` does (no re-implementation), constructs an in-process ethers `Wallet` from
443
+ the key YOU supply, signs the canonical bytes (`eip191-personal-sign`), and **wraps WITHOUT editing** the
444
+ payload into the `verifyhash.dataset-attestation-signed` container the existing `vh dataset verify-attest`
445
+ accepts. The result round-trips by construction.
446
+
447
+ **Key hygiene (load-bearing — the property that keeps this guardrail-safe).** `vh dataset sign` performs a
448
+ **read-only of a key YOU provisioned outside this tool**; it **never generates, never persists, and never
449
+ logs (or echoes) a key**, and it is **OFFLINE — no provider, no network**. The key is read from EXACTLY ONE
450
+ of `--key-env <VAR>` (read `process.env[VAR]`) or `--key-file <path>` (a file you created), used in-process
451
+ ONLY to sign, then discarded. **Neither source, both sources, a missing env var, an unreadable file, or a
452
+ malformed/all-zero key HARD-ERRORS before any signing**, with a message that names only the SOURCE (the env
453
+ var name or the file path) — **never the key material**. On success the output prints ONLY the PUBLIC signer
454
+ address, the output path, and the scheme. A usage error (no `<manifest>`, or not exactly one key source)
455
+ exits `2`; a runtime error (bad key, unreadable manifest) exits `1`.
456
+
457
+ > **Trust posture (inherited verbatim — a signature is NOT a timestamp).** This is the SHARED in-band
458
+ > `SIGN_TRUST_NOTE` (`cli/dataset.js`), the same wording the `sign` command prints and the human reads, so
459
+ > the caveat can never drift from the code:
460
+ >
461
+ > > This signs the dataset IDENTITY (root, fileCount, manifestDigest) with the key YOU supplied. A self-managed key attests "the signer says so" — it is NOT an independent, trusted TIMESTAMP: "existed/unaltered since a date T" still needs the human-owned signing/timestamp trust-root (needs-human, P-3). The key must be one YOU provisioned OUTSIDE this tool.
462
+ >
463
+ > The stronger B/C options buy an independent timestamp; (A) does not. It also still carries the standing
464
+ > dataset caveat verbatim:
465
+ >
466
+ > > The Merkle root commits to the full set of (relPath, content) pairs (names AND bytes): any edit, rename, add, or remove changes the root. Per-file `hints` (source/license) are UNTRUSTED, self-asserted metadata — they are NOT bound into the root and prove nothing.
467
+
468
+ ```sh
469
+ # Sign the dataset attestation with a key YOU provisioned outside the loop (env var or key file).
470
+ # Read-only of YOUR key; never generates/persists/logs a key; OFFLINE; no network.
471
+ vh dataset sign v2.manifest.json --key-env DATASET_SIGNING_KEY --out v2.attestation.signed.json
472
+ # TRUST: This signs the dataset IDENTITY … it is NOT an independent, trusted TIMESTAMP …
473
+ # signed by 0x<your public address>
474
+ # scheme: eip191-personal-sign
475
+ # signed attestation written: /abs/path/v2.attestation.signed.json
476
+ ```
477
+
478
+ #### `vh dataset verify-attest` — the offline verifier
479
+
480
+ `vh dataset verify-attest <signed> [--manifest <m>] [--signer <addr>] [--json]` is **purely offline — no
481
+ tree walk, no provider, no key, no network.** It reads the container with the strict reader (a
482
+ malformed/edited/foreign container is rejected, never half-accepted), then runs up to three checks:
483
+
484
+ 1. **Signature recovery (always).** Recover the signer from the embedded canonical bytes + signature
485
+ (`eip191-personal-sign` → ethers' `verifyMessage` over exactly those bytes) and confirm it equals the
486
+ container's CLAIMED `signer`. A signature that does not recover to the claimed signer — or a tampered,
487
+ unrecoverable signature — is a clean **REJECTED**, not a crash.
488
+ 2. **`--signer <addr>` (optional publisher pin).** Confirm the RECOVERED signer equals the SPECIFIC
489
+ publisher address the buyer pinned — so a buyer pins WHO must have signed, not merely that someone did.
490
+ Accepts a checksummed or lowercase address.
491
+ 3. **`--manifest <m>` (optional identity binding).** Recompute the canonical UNSIGNED bytes from the
492
+ buyer's OWN manifest via the EXISTING build path and require them byte-identical to the embedded
493
+ (signed-over) payload — proving the signature binds the dataset the buyer actually holds, not some other
494
+ set.
495
+
496
+ **The 0/3 exit contract a buyer's CI gates on.** The verdict is **ACCEPTED only when EVERY requested check
497
+ passes**; any failure is REJECTED. It exits **`0` on ACCEPTED, `3` on REJECTED** — the SAME
498
+ data-divergence convention as `vh dataset verify` / `diff` / `check`, so all dataset gates share one exit
499
+ contract. A usage error (e.g. missing `<signed>`) exits `2`; a missing/corrupt container or manifest is a
500
+ runtime error (exit `1`). `--json` emits the machine object
501
+ `{ verdict, accepted, recoveredSigner, claimedSigner, scheme, checks: { signatureMatchesSigner,
502
+ signerMatchesExpected, manifestBindsAttestation }, expectedSigner, manifestChecked, failedChecks }` — the
503
+ `checks.*` booleans are `null` for a check that was not requested (never a silent fail), and `failedChecks`
504
+ names the stable rule ids a consumer gates on. So a buyer's pipeline step is simply
505
+ `vh dataset verify-attest signed.json --signer 0x<ourPublishedAddr> --manifest ds.manifest.json` and the
506
+ build blocks on a non-zero exit.
507
+
508
+ #### Worked end-to-end example (attest → sign → verify-attest)
509
+
510
+ ```sh
511
+ # 1. ATTEST: emit the canonical UNSIGNED bytes (the exact bytes the publisher signs over).
512
+ vh dataset attest v2.manifest.json --out v2.attestation.json
513
+ # dataset attestation written: /abs/path/v2.attestation.json
514
+
515
+ # 2. [HUMAN-OWNED, P-3 — PROVISION ONLY] Provision a real signing key OUTSIDE the loop (env var or key file),
516
+ # then SIGN with ONE command. The loop NEVER generates/persists/logs the key — `vh dataset sign` reads the
517
+ # key YOU provisioned, signs the canonical bytes (eip191-personal-sign), and wraps them WITHOUT editing the
518
+ # payload (it stays signed:false). Choosing/provisioning the key + trust-root option A/B/C is the P-3 part.
519
+ vh dataset sign v2.manifest.json --key-env DATASET_SIGNING_KEY --out v2.attestation.signed.json
520
+ # signed by 0x<your public address> scheme: eip191-personal-sign
521
+ # signed attestation written: /abs/path/v2.attestation.signed.json
522
+ # (In tests this signing step uses an EPHEMERAL throwaway Wallet.createRandom() key — never a real key.)
523
+
524
+ # 3. The BUYER VERIFIES offline — no key, no network — pinning WHO signed AND binding it to THEIR dataset:
525
+ vh dataset verify-attest v2.attestation.signed.json \
526
+ --signer 0x<the publisher's published address> --manifest ./my-copy.manifest.json
527
+ # TRUST: A valid signature proves the holder of `signer`'s key vouched for THIS dataset identity …
528
+ # verify-attest: ACCEPTED
529
+ # [PASS] signature recovers to the claimed signer
530
+ # [PASS] recovered signer matches the expected publisher (0x…)
531
+ # [PASS] the signature binds YOUR manifest (its canonical bytes are byte-identical to the signed payload)
532
+ # ACCEPTED: every requested check passed. (exit 0; exit 3 if ANY requested check FAILs)
533
+ ```
534
+
535
+ > **Still bounded by P-3.** An ACCEPTED verdict proves the key-holder vouched for this dataset identity —
536
+ > it does **NOT** prove a trustworthy timestamp ("unaltered since date T") and does **NOT** validate any
537
+ > `{source, license}` hint. The trustworthy timestamp is the human-owned trust-root, `needs-human`, P-3 in
538
+ > [`STRATEGY.md`](../STRATEGY.md). This build ships the FORMAT, the VERIFIER, AND the `vh dataset sign`
539
+ > command (all proved with throwaway test keys); the human still owns PROVISIONING the key and choosing
540
+ > trust-root option A/B/C.
541
+
542
+ ---
543
+
544
+ ## The independent timestamp (P-3 Option B): an RFC-3161 TSA proves "existed by date T"
545
+
546
+ A self-managed signature (`sign` / `verify-attest`, Option A) attests only "the publisher **says so**". The
547
+ stronger claim a due-diligence / EU-AI-Act reviewer ultimately wants — "an **independent** third party saw
548
+ this exact dataset identity **by time T**" — is what **P-3 Option (B)** delivers: an
549
+ [RFC-3161](https://www.rfc-editor.org/rfc/rfc3161) **Time-Stamping Authority (TSA)** stamps a digest and
550
+ returns a signed `TimeStampToken`. This build ships the **FORMAT** (the
551
+ `verifyhash.dataset-attestation-timestamped` container) and the **OFFLINE VERIFIER**
552
+ (`vh dataset verify-timestamp`), proved end-to-end with **self-minted test tokens** (a test-only mock TSA
553
+ with an ephemeral key — **NEVER a real TSA**, exactly as the signing tests use `Wallet.createRandom()`).
554
+ Obtaining a real token is a **human/network step** (you pick a TSA you trust and call it).
555
+
556
+ The flow is **`timestamp-request` → (obtain a token from your TSA) → `timestamp-wrap` → `verify-timestamp`**:
557
+
558
+ ```sh
559
+ # 1. REQUEST: emit the SHA-256 digest of the canonical attestation bytes — the EXACT messageImprint a TSA stamps.
560
+ vh dataset timestamp-request v2.manifest.json
561
+ # sha256 digest (the messageImprint to stamp): 34031ecf…439f
562
+ # To obtain an RFC-3161 timestamp token over this digest (a HUMAN/network step):
563
+ # openssl ts -query -digest 34031ecf…439f -sha256 -cert -out request.tsq
564
+ # # send request.tsq to your TSA -> response.tsr ; then:
565
+ # openssl ts -reply -in response.tsr -token_out -out token.der
566
+
567
+ # 2. [HUMAN-OWNED, P-3 Option B] Pick a TSA you trust and obtain a token over that digest (network step).
568
+ # The loop NEVER calls a TSA, holds no token, and generates none.
569
+
570
+ # 3. WRAP: bind the returned RFC-3161 token to the re-derived digest, WITHOUT editing the payload.
571
+ vh dataset timestamp-wrap v2.manifest.json --token token.der --out v2.attestation.timestamped.json
572
+ # timestamped: an INDEPENDENT TSA stamped this digest by genTime
573
+ # genTime (asserted by the TSA): 2026-01-01T00:00:00Z TSA serial: 2a policy OID: 1.2.3.4.5
574
+
575
+ # 4. The BUYER VERIFIES offline — no key, no network — and (optionally) binds it to THEIR dataset:
576
+ vh dataset verify-timestamp v2.attestation.timestamped.json --manifest ./my-copy.manifest.json
577
+ # TRUST: ACCEPTED means an RFC-3161 TSA asserted this exact dataset identity (digest) existed by genTime;
578
+ # this is as trustworthy as the TSA whose certificate YOU trust — this command does NOT validate the
579
+ # TSA's certificate chain (use `openssl ts -verify` / a CMS verifier for full PKI validation).
580
+ # verify-timestamp: ACCEPTED
581
+ # [PASS] the token binds sha256(canonical attestation bytes) under RFC-3161
582
+ # [PASS] the timestamp binds YOUR manifest
583
+ # ACCEPTED: an RFC-3161 TSA asserted this dataset identity existed by:
584
+ # genTime (ISO UTC): 2026-01-01T00:00:00Z
585
+ # TSA serialNumber: 2a (decimal 42)
586
+ # policy OID: 1.2.3.4.5 (exit 0; exit 3 if ANY requested check FAILs)
587
+ ```
588
+
589
+ > **The exact bounded trust claim (never overclaims).** ACCEPTED means **an RFC-3161 TSA asserted this exact
590
+ > dataset identity (the SHA-256 digest of the canonical attestation bytes) existed by `<genTime>`** — and
591
+ > this is **as trustworthy as the TSA whose certificate YOU trust**. `verify-timestamp` does **NOT** validate
592
+ > the TSA's X.509 certificate chain or the token's CMS signature — use your platform's CMS verifier
593
+ > (`openssl ts -verify`) for full PKI validation, exactly as Option A pins the signer ADDRESS out of band. A
594
+ > tampered token, a mismatched digest, or an edited embedded attestation **REJECTS** — never a false ACCEPT.
595
+ > Even so this is materially stronger than Option A: an **independent third party** (not the publisher)
596
+ > attests existence by `genTime`.
597
+
598
+ P-3 Option (B)'s human handoff therefore collapses to: **(1)** pick a TSA you trust; **(2)** run
599
+ `vh dataset timestamp-request` to get the digest; **(3)** obtain a token from your TSA over that digest;
600
+ **(4)** run `vh dataset timestamp-wrap` — **done**; buyers verify offline with `vh dataset verify-timestamp`.
601
+
602
+ ---
603
+
604
+ ## What an auditor / EU AI Act reviewer gets
605
+
606
+ A mapping from the reviewer's question to the command that produces the evidence:
607
+
608
+ | Reviewer's question | Command | Evidence produced |
609
+ | --- | --- | --- |
610
+ | "Exactly which files — names and bytes — did this dataset contain?" | `vh dataset build` | A manifest: a Merkle root over every `(relPath, content)` pair + per-file leaves |
611
+ | "Is this copy of the dataset byte-for-byte the one you manifested?" | `vh dataset verify` | Recomputed-root vs manifest-root verdict + a per-file ADDED/REMOVED/CHANGED localization |
612
+ | "What changed in the training data between model version N and N+1?" | `vh dataset diff` | The precise add/remove/change set between two manifests (offline) |
613
+ | "What is the provenance/license composition of the dataset?" | `vh dataset summary` | A `{source, license}` histogram over the trusted file set (claims, clearly labeled untrusted) |
614
+ | "Does this dataset VIOLATE our written license/source policy? (the control CI runs)" | `vh dataset check --policy` | A PASS/FAIL verdict + the exact violating files (relPath / rule / value); a CI-gateable exit code (0 PASS / 3 FAIL) over the dataset's self-asserted hints (clearly labeled untrusted) |
615
+ | "Give me ONE document to file in the technical-documentation / due-diligence packet." | `vh dataset report` | A single deterministic Markdown (or `--json`) document: dataset identity + the provenance/license roll-up + the standing trust caveats + an optional live-tree verify verdict + (with `--policy`) the embedded policy-compliance verdict |
616
+ | "Give me the exact bytes our publisher (or a timestamp authority) will sign over." | `vh dataset attest` | A canonical, byte-deterministic UNSIGNED attestation payload committing to `root` / `fileCount` / `manifestDigest` (the file a human signing/timestamp trust-root signs — see P-3) |
617
+ | "I provisioned a signing key — turn the attestation into a signed container in one command." | `vh dataset sign` | The `verifyhash.dataset-attestation-signed` container, signed (`eip191-personal-sign`) with the key YOU supplied (`--key-env`/`--key-file`), ready for any buyer to `verify-attest`. Read-only of your key; never generates/persists/logs a key; offline. Attests the IDENTITY + "the signer says so" — NOT a timestamp (still P-3) |
618
+ | "A vendor handed me a 'signed by the publisher' attestation — confirm it is genuine and binds the dataset I hold." | `vh dataset verify-attest` | An OFFLINE ACCEPTED/REJECTED verdict: the signature recovers to the claimed signer, (with `--signer`) the recovered signer is the publisher I pinned, and (with `--manifest`) it binds MY dataset; CI-gateable exit 0/3. Proves the key-holder vouched for this dataset identity — NOT a timestamp (P-3) |
619
+ | "Prove an INDEPENDENT party saw this exact dataset by a date — `timestamp-request` → (TSA) → `timestamp-wrap`." | `vh dataset timestamp-request` / `vh dataset timestamp-wrap` | The SHA-256 digest a TSA stamps, then the `verifyhash.dataset-attestation-timestamped` container binding the returned RFC-3161 token to that digest. The loop never calls a TSA; obtaining the token is the human/network step (P-3 Option B) |
620
+ | "A vendor handed me a timestamped attestation — confirm an RFC-3161 TSA stamped the dataset I hold, and by WHEN." | `vh dataset verify-timestamp` | An OFFLINE ACCEPTED (with the asserted genTime / TSA serial / policy OID) or REJECTED verdict; with `--manifest` it binds the timestamp to MY dataset; CI-gateable exit 0/3. ACCEPTED == a TSA asserted this digest existed by genTime, to the strength of the TSA YOU trust — does NOT validate the TSA cert chain (use `openssl ts -verify`) |
621
+ | "Prove this specific record/file was actually in the dataset." | `vh dataset prove` → `vh dataset verify-proof` | A portable, offline-verifiable set-membership proof for one file |
622
+
623
+ What this mapping deliberately does NOT claim: a wall-clock "unaltered since date T", and the
624
+ truth of any `{source, license}` hint. Both require the human-owned signing/timestamp trust-root
625
+ (`needs-human`, see [`STRATEGY.md`](../STRATEGY.md)).
626
+
627
+ ---
628
+
629
+ ## See also
630
+
631
+ - [`docs/TRUST-BOUNDARIES.md`](TRUST-BOUNDARIES.md) — the full trust model the caveats above reuse.
632
+ - [`docs/MERKLE-LEAVES.md`](MERKLE-LEAVES.md) — the exact path-bound leaf/root construction the
633
+ manifest commits to (DataLedger reuses it unchanged).
634
+ - [`docs/PROOFS.md`](PROOFS.md) — the portable proof-artifact schema membership proofs reuse.
635
+
636
+
637
+ ---
638
+ <sub>© 2026 verifyhash.com · Licensed under Apache-2.0 (SPDX-License-Identifier: Apache-2.0) — see the [LICENSE](https://verifyhash.com/LICENSE) and [NOTICE](https://verifyhash.com/NOTICE) served with this file.</sub>