@dogfood-lab/study-swarm 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,369 @@
1
+ {
2
+ "schema": "study-swarm.orchestration/v1",
3
+ "note": "Harness-emitted record of the Step-2 research agents for study-swarm-lock.dispatch.md. tool_schema pins the StructuredOutput contract each agent was bound to (the load-bearing, capturable surface); a full harness would pin the entire tool array. output_sha256 hashes the returned findings for drift detection (not determinism).",
4
+ "steps": [
5
+ {
6
+ "question_id": "Q1-replay-manifest",
7
+ "resolved_model": "claude-opus-4-8",
8
+ "prompt": "You are a research agent in a STUDY-SWARM (the research-grounded-advisor protocol) grounding the design of a new feature for the open-source repo `dogfood-lab/study-swarm`.\n\nTHE FEATURE — `dispatch.lock.json`: a per-dispatch lockfile that makes a study-swarm research dispatch BYTE-REPLAYABLE by pinning, per step:\n- the RESOLVED model id each research agent actually ran on (e.g. claude-opus-4-8, never an alias like \"opus\"),\n- the SHA-256 of the byte-exact agent prompt,\n- the SHA-256 of the tool JSONSchemas the agent had,\n- the external-verifier run/receipt id (e.g. a prism Ed25519 receipt id) and the receipt chain hash,\n- plus a top-level `lock_sha256` rollup over the whole lock (its content-address).\nThis implements the PIN_PER_STEP workflow standard (heritage: Snakemake 2012, Pegasus 2001).\n\nIMPLEMENTATION CONSTRAINTS (these shape which evidence is useful):\n- The CLI is ZERO-DEPENDENCY, NETWORK-FREE, DETERMINISTIC: SHA-256 via node:crypto, JSON I/O only. It makes NO model calls. The ORCHESTRATION HARNESS supplies the resolved models + byte-exact prompts + verifier run_id; the CLI only canonicalizes + hashes + validates them, and `lock --verify` re-derives the deterministic hashes and FAILS (exit 1) on drift.\n- Honest ceiling: pinning model+prompt+temp does NOT give bit-identical LLM outputs. The lock pins INPUTS byte-exact + records OUTPUT hashes for DRIFT DETECTION — \"replayable inputs + drift-detectable outputs\", NOT \"deterministic replay\".\n\nYOUR JOB: gather SPECIFIC, CITED, RETRIEVED evidence to answer ONE question. HARD RULES:\n- GROUND AT GENERATION TIME: use WebSearch and WebFetch to ACTUALLY RETRIEVE every source you cite THIS session. Cite ONLY sources you actually fetched. A claim you cannot ground in a fetched source is DROPPED, not invented.\n- Every finding needs: a one-sentence claim in your own words that MATCHES what the source actually says (do NOT overstate); author(s)/org; year; a RESOLVABLE identifier (arXiv:NNNN.NNNNN, a DOI, an RFC number, or a direct URL to the spec/paper — not a blog summary); the resolvable URL; whether you retrieved it; and a one-sentence DESIGN IMPLICATION for `dispatch.lock.json`.\n- Prefer specificity over breadth: 6-8 well-sourced, RETRIEVED findings beat 20 vague gestures. ~500-600 words of substance.\n- Set retrieved=false for anything you could not actually fetch — those will be dropped.\n\n========\nYOUR QUESTION (Q1-replay-manifest):\nQUESTION: How do reproducible-workflow and build/package systems structure a replay manifest, and how do they detect & surface DRIFT between the lock and a re-run?\nInvestigate (retrieve the actual papers/docs): Snakemake (Koster & Rahmann 2012, Bioinformatics DOI:10.1093/bioinformatics/bts480; and the 2021 F1000Research sustainable-data-analysis update); Pegasus (Deelman et al., Future Generation Computer Systems 2015 / the workflow provenance work); Nextflow (Di Tommaso et al. 2017, Nature Biotechnology DOI:10.1038/nbt.3820); ReproZip (Chirigati et al.); Nix / reproducible builds; and DEPENDENCY LOCKFILES (npm package-lock.json integrity, Cargo.lock, uv.lock, pip --require-hashes / PEP 665 / PEP 658). For each: WHAT is pinned (inputs, versions, content hashes), HOW it is hashed/content-addressed, and HOW drift is detected and surfaced (e.g. an integrity mismatch failing the install/CI). Map each to a concrete `dispatch.lock.json` field or to the `lock --verify` drift-detection behavior.\n\nReturn structured findings. Remember: retrieve-then-cite; drop what you cannot fetch.",
9
+ "tool_schema": {
10
+ "type": "object",
11
+ "additionalProperties": false,
12
+ "required": [
13
+ "question_id",
14
+ "findings",
15
+ "notes"
16
+ ],
17
+ "properties": {
18
+ "question_id": {
19
+ "type": "string"
20
+ },
21
+ "findings": {
22
+ "type": "array",
23
+ "items": {
24
+ "type": "object",
25
+ "additionalProperties": false,
26
+ "required": [
27
+ "claim",
28
+ "authors",
29
+ "year",
30
+ "identifier",
31
+ "url",
32
+ "retrieved",
33
+ "design_implication"
34
+ ],
35
+ "properties": {
36
+ "claim": {
37
+ "type": "string",
38
+ "description": "One-sentence finding in your own words, matching the source (do not overstate)."
39
+ },
40
+ "authors": {
41
+ "type": "string",
42
+ "description": "Author(s) or org, e.g. \"Koster & Rahmann\" or \"Rundgren et al.\" or \"OpenSSF\"."
43
+ },
44
+ "year": {
45
+ "type": "string"
46
+ },
47
+ "identifier": {
48
+ "type": "string",
49
+ "description": "arXiv:NNNN.NNNNN, a DOI (10.xxxx/...), an RFC number, or a direct URL."
50
+ },
51
+ "url": {
52
+ "type": "string",
53
+ "description": "A resolvable URL the existence oracle can fetch."
54
+ },
55
+ "retrieved": {
56
+ "type": "boolean",
57
+ "description": "true ONLY if you actually fetched this source this session."
58
+ },
59
+ "design_implication": {
60
+ "type": "string",
61
+ "description": "One sentence: implication for dispatch.lock.json."
62
+ }
63
+ }
64
+ }
65
+ },
66
+ "notes": {
67
+ "type": "string",
68
+ "description": "Coverage gaps, sources you could not fetch, dropped claims."
69
+ }
70
+ }
71
+ },
72
+ "schema_dialect": "https://json-schema.org/draft/2020-12/schema",
73
+ "output_sha256": "sha256-k5rkOOyAHGBCVacEk8LPULp+s5YHVzaUyCHJY4tD5jE="
74
+ },
75
+ {
76
+ "question_id": "Q2-canonicalization",
77
+ "resolved_model": "claude-opus-4-8",
78
+ "prompt": "You are a research agent in a STUDY-SWARM (the research-grounded-advisor protocol) grounding the design of a new feature for the open-source repo `dogfood-lab/study-swarm`.\n\nTHE FEATURE — `dispatch.lock.json`: a per-dispatch lockfile that makes a study-swarm research dispatch BYTE-REPLAYABLE by pinning, per step:\n- the RESOLVED model id each research agent actually ran on (e.g. claude-opus-4-8, never an alias like \"opus\"),\n- the SHA-256 of the byte-exact agent prompt,\n- the SHA-256 of the tool JSONSchemas the agent had,\n- the external-verifier run/receipt id (e.g. a prism Ed25519 receipt id) and the receipt chain hash,\n- plus a top-level `lock_sha256` rollup over the whole lock (its content-address).\nThis implements the PIN_PER_STEP workflow standard (heritage: Snakemake 2012, Pegasus 2001).\n\nIMPLEMENTATION CONSTRAINTS (these shape which evidence is useful):\n- The CLI is ZERO-DEPENDENCY, NETWORK-FREE, DETERMINISTIC: SHA-256 via node:crypto, JSON I/O only. It makes NO model calls. The ORCHESTRATION HARNESS supplies the resolved models + byte-exact prompts + verifier run_id; the CLI only canonicalizes + hashes + validates them, and `lock --verify` re-derives the deterministic hashes and FAILS (exit 1) on drift.\n- Honest ceiling: pinning model+prompt+temp does NOT give bit-identical LLM outputs. The lock pins INPUTS byte-exact + records OUTPUT hashes for DRIFT DETECTION — \"replayable inputs + drift-detectable outputs\", NOT \"deterministic replay\".\n\nYOUR JOB: gather SPECIFIC, CITED, RETRIEVED evidence to answer ONE question. HARD RULES:\n- GROUND AT GENERATION TIME: use WebSearch and WebFetch to ACTUALLY RETRIEVE every source you cite THIS session. Cite ONLY sources you actually fetched. A claim you cannot ground in a fetched source is DROPPED, not invented.\n- Every finding needs: a one-sentence claim in your own words that MATCHES what the source actually says (do NOT overstate); author(s)/org; year; a RESOLVABLE identifier (arXiv:NNNN.NNNNN, a DOI, an RFC number, or a direct URL to the spec/paper — not a blog summary); the resolvable URL; whether you retrieved it; and a one-sentence DESIGN IMPLICATION for `dispatch.lock.json`.\n- Prefer specificity over breadth: 6-8 well-sourced, RETRIEVED findings beat 20 vague gestures. ~500-600 words of substance.\n- Set retrieved=false for anything you could not actually fetch — those will be dropped.\n\n========\nYOUR QUESTION (Q2-canonicalization):\nQUESTION: What is the correct way to canonicalize structured (JSON) data so a hash is STABLE across platforms and re-serializations, and how should per-step hashes roll up to one dispatch hash?\nInvestigate (retrieve the actual specs): RFC 8785 JSON Canonicalization Scheme (Rundgren, Jordan & Erdtman 2020) — exactly what it normalizes (property ordering, number serialization per ECMAScript, Unicode/UTF-8, whitespace); JWS/JOSE canonical serialization (RFC 7515); Merkle trees / hash chains (Merkle, CRYPTO 1987, DOI:10.1007/3-540-48184-2_32) for per-step → rollup; and the concrete instability sources a WINDOWS-authored tool must defend against — CRLF vs LF, key/property ordering, Unicode normalization (NFC/NFD), floating-point/number formatting, trailing whitespace, BOM. Map each to how `lock_sha256`, `prompt_sha256`, and `tool_schema_sha256` must be computed so the SAME dispatch hashes IDENTICALLY on Windows, Linux, and macOS.\n\nReturn structured findings. Remember: retrieve-then-cite; drop what you cannot fetch.",
79
+ "tool_schema": {
80
+ "type": "object",
81
+ "additionalProperties": false,
82
+ "required": [
83
+ "question_id",
84
+ "findings",
85
+ "notes"
86
+ ],
87
+ "properties": {
88
+ "question_id": {
89
+ "type": "string"
90
+ },
91
+ "findings": {
92
+ "type": "array",
93
+ "items": {
94
+ "type": "object",
95
+ "additionalProperties": false,
96
+ "required": [
97
+ "claim",
98
+ "authors",
99
+ "year",
100
+ "identifier",
101
+ "url",
102
+ "retrieved",
103
+ "design_implication"
104
+ ],
105
+ "properties": {
106
+ "claim": {
107
+ "type": "string",
108
+ "description": "One-sentence finding in your own words, matching the source (do not overstate)."
109
+ },
110
+ "authors": {
111
+ "type": "string",
112
+ "description": "Author(s) or org, e.g. \"Koster & Rahmann\" or \"Rundgren et al.\" or \"OpenSSF\"."
113
+ },
114
+ "year": {
115
+ "type": "string"
116
+ },
117
+ "identifier": {
118
+ "type": "string",
119
+ "description": "arXiv:NNNN.NNNNN, a DOI (10.xxxx/...), an RFC number, or a direct URL."
120
+ },
121
+ "url": {
122
+ "type": "string",
123
+ "description": "A resolvable URL the existence oracle can fetch."
124
+ },
125
+ "retrieved": {
126
+ "type": "boolean",
127
+ "description": "true ONLY if you actually fetched this source this session."
128
+ },
129
+ "design_implication": {
130
+ "type": "string",
131
+ "description": "One sentence: implication for dispatch.lock.json."
132
+ }
133
+ }
134
+ }
135
+ },
136
+ "notes": {
137
+ "type": "string",
138
+ "description": "Coverage gaps, sources you could not fetch, dropped claims."
139
+ }
140
+ }
141
+ },
142
+ "schema_dialect": "https://json-schema.org/draft/2020-12/schema",
143
+ "output_sha256": "sha256-ymPE5lJqoygN0MpzqftgE1tYlyp7z+poCK2Z1Zj3mi8="
144
+ },
145
+ {
146
+ "question_id": "Q3-provenance-attestation",
147
+ "resolved_model": "claude-opus-4-8",
148
+ "prompt": "You are a research agent in a STUDY-SWARM (the research-grounded-advisor protocol) grounding the design of a new feature for the open-source repo `dogfood-lab/study-swarm`.\n\nTHE FEATURE — `dispatch.lock.json`: a per-dispatch lockfile that makes a study-swarm research dispatch BYTE-REPLAYABLE by pinning, per step:\n- the RESOLVED model id each research agent actually ran on (e.g. claude-opus-4-8, never an alias like \"opus\"),\n- the SHA-256 of the byte-exact agent prompt,\n- the SHA-256 of the tool JSONSchemas the agent had,\n- the external-verifier run/receipt id (e.g. a prism Ed25519 receipt id) and the receipt chain hash,\n- plus a top-level `lock_sha256` rollup over the whole lock (its content-address).\nThis implements the PIN_PER_STEP workflow standard (heritage: Snakemake 2012, Pegasus 2001).\n\nIMPLEMENTATION CONSTRAINTS (these shape which evidence is useful):\n- The CLI is ZERO-DEPENDENCY, NETWORK-FREE, DETERMINISTIC: SHA-256 via node:crypto, JSON I/O only. It makes NO model calls. The ORCHESTRATION HARNESS supplies the resolved models + byte-exact prompts + verifier run_id; the CLI only canonicalizes + hashes + validates them, and `lock --verify` re-derives the deterministic hashes and FAILS (exit 1) on drift.\n- Honest ceiling: pinning model+prompt+temp does NOT give bit-identical LLM outputs. The lock pins INPUTS byte-exact + records OUTPUT hashes for DRIFT DETECTION — \"replayable inputs + drift-detectable outputs\", NOT \"deterministic replay\".\n\nYOUR JOB: gather SPECIFIC, CITED, RETRIEVED evidence to answer ONE question. HARD RULES:\n- GROUND AT GENERATION TIME: use WebSearch and WebFetch to ACTUALLY RETRIEVE every source you cite THIS session. Cite ONLY sources you actually fetched. A claim you cannot ground in a fetched source is DROPPED, not invented.\n- Every finding needs: a one-sentence claim in your own words that MATCHES what the source actually says (do NOT overstate); author(s)/org; year; a RESOLVABLE identifier (arXiv:NNNN.NNNNN, a DOI, an RFC number, or a direct URL to the spec/paper — not a blog summary); the resolvable URL; whether you retrieved it; and a one-sentence DESIGN IMPLICATION for `dispatch.lock.json`.\n- Prefer specificity over breadth: 6-8 well-sourced, RETRIEVED findings beat 20 vague gestures. ~500-600 words of substance.\n- Set retrieved=false for anything you could not actually fetch — those will be dropped.\n\n========\nYOUR QUESTION (Q3-provenance-attestation):\nQUESTION: How do software supply-chain frameworks capture STEP-LEVEL provenance, and which parts map to pinning \"model + prompt + tool-schema + verifier receipt\" for one dispatch step?\nInvestigate (retrieve the actual papers/specs): in-toto (Torres-Arias, Afzali, Kuppusamy, Curtmola & Cappos 2019, USENIX Security — the link metadata + layout model); SLSA (the OpenSSF SLSA provenance levels + the provenance predicate schema); Sigstore (Newman, Meyers et al. 2022, ACM CCS DOI:10.1145/3548606.3560596 — keyless signing + Rekor transparency log) and the verifiability-vs-anti-forgery distinction; SCITT and/or C2PA if relevant; and W3C PROV / research-object reproducibility lineage. For each: what a step attestation records (materials/inputs, the step command/predicate, products/outputs, the actor/environment), how steps are chained, and what \"verifiable but not unforgeable\" means for an ephemeral local signing key. Map each to the per-step record SHAPE of `dispatch.lock.json` and to the \"harness EMITS the record, CLI CANONICALIZES+HASHES+VALIDATES it\" separation.\n\nReturn structured findings. Remember: retrieve-then-cite; drop what you cannot fetch.",
149
+ "tool_schema": {
150
+ "type": "object",
151
+ "additionalProperties": false,
152
+ "required": [
153
+ "question_id",
154
+ "findings",
155
+ "notes"
156
+ ],
157
+ "properties": {
158
+ "question_id": {
159
+ "type": "string"
160
+ },
161
+ "findings": {
162
+ "type": "array",
163
+ "items": {
164
+ "type": "object",
165
+ "additionalProperties": false,
166
+ "required": [
167
+ "claim",
168
+ "authors",
169
+ "year",
170
+ "identifier",
171
+ "url",
172
+ "retrieved",
173
+ "design_implication"
174
+ ],
175
+ "properties": {
176
+ "claim": {
177
+ "type": "string",
178
+ "description": "One-sentence finding in your own words, matching the source (do not overstate)."
179
+ },
180
+ "authors": {
181
+ "type": "string",
182
+ "description": "Author(s) or org, e.g. \"Koster & Rahmann\" or \"Rundgren et al.\" or \"OpenSSF\"."
183
+ },
184
+ "year": {
185
+ "type": "string"
186
+ },
187
+ "identifier": {
188
+ "type": "string",
189
+ "description": "arXiv:NNNN.NNNNN, a DOI (10.xxxx/...), an RFC number, or a direct URL."
190
+ },
191
+ "url": {
192
+ "type": "string",
193
+ "description": "A resolvable URL the existence oracle can fetch."
194
+ },
195
+ "retrieved": {
196
+ "type": "boolean",
197
+ "description": "true ONLY if you actually fetched this source this session."
198
+ },
199
+ "design_implication": {
200
+ "type": "string",
201
+ "description": "One sentence: implication for dispatch.lock.json."
202
+ }
203
+ }
204
+ }
205
+ },
206
+ "notes": {
207
+ "type": "string",
208
+ "description": "Coverage gaps, sources you could not fetch, dropped claims."
209
+ }
210
+ }
211
+ },
212
+ "schema_dialect": "https://json-schema.org/draft/2020-12/schema",
213
+ "output_sha256": "sha256-OSUAIhytytKihfFM1Y+p1BllUSTC+KEfb/NX86X+Kc0="
214
+ },
215
+ {
216
+ "question_id": "Q4-llm-determinism",
217
+ "resolved_model": "claude-opus-4-8",
218
+ "prompt": "You are a research agent in a STUDY-SWARM (the research-grounded-advisor protocol) grounding the design of a new feature for the open-source repo `dogfood-lab/study-swarm`.\n\nTHE FEATURE — `dispatch.lock.json`: a per-dispatch lockfile that makes a study-swarm research dispatch BYTE-REPLAYABLE by pinning, per step:\n- the RESOLVED model id each research agent actually ran on (e.g. claude-opus-4-8, never an alias like \"opus\"),\n- the SHA-256 of the byte-exact agent prompt,\n- the SHA-256 of the tool JSONSchemas the agent had,\n- the external-verifier run/receipt id (e.g. a prism Ed25519 receipt id) and the receipt chain hash,\n- plus a top-level `lock_sha256` rollup over the whole lock (its content-address).\nThis implements the PIN_PER_STEP workflow standard (heritage: Snakemake 2012, Pegasus 2001).\n\nIMPLEMENTATION CONSTRAINTS (these shape which evidence is useful):\n- The CLI is ZERO-DEPENDENCY, NETWORK-FREE, DETERMINISTIC: SHA-256 via node:crypto, JSON I/O only. It makes NO model calls. The ORCHESTRATION HARNESS supplies the resolved models + byte-exact prompts + verifier run_id; the CLI only canonicalizes + hashes + validates them, and `lock --verify` re-derives the deterministic hashes and FAILS (exit 1) on drift.\n- Honest ceiling: pinning model+prompt+temp does NOT give bit-identical LLM outputs. The lock pins INPUTS byte-exact + records OUTPUT hashes for DRIFT DETECTION — \"replayable inputs + drift-detectable outputs\", NOT \"deterministic replay\".\n\nYOUR JOB: gather SPECIFIC, CITED, RETRIEVED evidence to answer ONE question. HARD RULES:\n- GROUND AT GENERATION TIME: use WebSearch and WebFetch to ACTUALLY RETRIEVE every source you cite THIS session. Cite ONLY sources you actually fetched. A claim you cannot ground in a fetched source is DROPPED, not invented.\n- Every finding needs: a one-sentence claim in your own words that MATCHES what the source actually says (do NOT overstate); author(s)/org; year; a RESOLVABLE identifier (arXiv:NNNN.NNNNN, a DOI, an RFC number, or a direct URL to the spec/paper — not a blog summary); the resolvable URL; whether you retrieved it; and a one-sentence DESIGN IMPLICATION for `dispatch.lock.json`.\n- Prefer specificity over breadth: 6-8 well-sourced, RETRIEVED findings beat 20 vague gestures. ~500-600 words of substance.\n- Set retrieved=false for anything you could not actually fetch — those will be dropped.\n\n========\nYOUR QUESTION (Q4-llm-determinism):\nQUESTION: Can pinning model + prompt + temperature (+ seed) yield reproducible LLM OUTPUTS, or only reproducible INPUTS? Find the strongest EMPIRICAL evidence.\nInvestigate (retrieve the actual sources): nondeterminism even at temperature 0 / fixed seed from floating-point non-associativity + GPU kernel/reduction order + BATCH-SIZE / batching effects (Thinking Machines Lab — He et al. 2025, \"Defeating Nondeterminism in LLM Inference\", thinkingmachines.ai; and the batch-invariant-kernels / vLLM work); provider-side SILENT MODEL DRIFT over time (Chen, Zaharia & Zou 2023, \"How Is ChatGPT's Behavior Changing over Time?\", arXiv:2307.09009); MoE/expert-routing nondeterminism; and any work quantifying output variance under fixed decoding params (e.g. reproducibility-of-LLM-evaluations papers, Atil et al. or similar). This finding JUSTIFIES the honest-ceiling claim: pin INPUTS byte-exact + record OUTPUT hashes for DRIFT DETECTION; do NOT claim \"deterministic replay\". Give the strongest citations for exactly that framing.\n\nReturn structured findings. Remember: retrieve-then-cite; drop what you cannot fetch.",
219
+ "tool_schema": {
220
+ "type": "object",
221
+ "additionalProperties": false,
222
+ "required": [
223
+ "question_id",
224
+ "findings",
225
+ "notes"
226
+ ],
227
+ "properties": {
228
+ "question_id": {
229
+ "type": "string"
230
+ },
231
+ "findings": {
232
+ "type": "array",
233
+ "items": {
234
+ "type": "object",
235
+ "additionalProperties": false,
236
+ "required": [
237
+ "claim",
238
+ "authors",
239
+ "year",
240
+ "identifier",
241
+ "url",
242
+ "retrieved",
243
+ "design_implication"
244
+ ],
245
+ "properties": {
246
+ "claim": {
247
+ "type": "string",
248
+ "description": "One-sentence finding in your own words, matching the source (do not overstate)."
249
+ },
250
+ "authors": {
251
+ "type": "string",
252
+ "description": "Author(s) or org, e.g. \"Koster & Rahmann\" or \"Rundgren et al.\" or \"OpenSSF\"."
253
+ },
254
+ "year": {
255
+ "type": "string"
256
+ },
257
+ "identifier": {
258
+ "type": "string",
259
+ "description": "arXiv:NNNN.NNNNN, a DOI (10.xxxx/...), an RFC number, or a direct URL."
260
+ },
261
+ "url": {
262
+ "type": "string",
263
+ "description": "A resolvable URL the existence oracle can fetch."
264
+ },
265
+ "retrieved": {
266
+ "type": "boolean",
267
+ "description": "true ONLY if you actually fetched this source this session."
268
+ },
269
+ "design_implication": {
270
+ "type": "string",
271
+ "description": "One sentence: implication for dispatch.lock.json."
272
+ }
273
+ }
274
+ }
275
+ },
276
+ "notes": {
277
+ "type": "string",
278
+ "description": "Coverage gaps, sources you could not fetch, dropped claims."
279
+ }
280
+ }
281
+ },
282
+ "schema_dialect": "https://json-schema.org/draft/2020-12/schema",
283
+ "output_sha256": "sha256-aFDm9p4/94p97NJg/vWKbYXchv6N1b0swpIfYoep1iw="
284
+ },
285
+ {
286
+ "question_id": "Q5-tool-schema-drift",
287
+ "resolved_model": "claude-opus-4-8",
288
+ "prompt": "You are a research agent in a STUDY-SWARM (the research-grounded-advisor protocol) grounding the design of a new feature for the open-source repo `dogfood-lab/study-swarm`.\n\nTHE FEATURE — `dispatch.lock.json`: a per-dispatch lockfile that makes a study-swarm research dispatch BYTE-REPLAYABLE by pinning, per step:\n- the RESOLVED model id each research agent actually ran on (e.g. claude-opus-4-8, never an alias like \"opus\"),\n- the SHA-256 of the byte-exact agent prompt,\n- the SHA-256 of the tool JSONSchemas the agent had,\n- the external-verifier run/receipt id (e.g. a prism Ed25519 receipt id) and the receipt chain hash,\n- plus a top-level `lock_sha256` rollup over the whole lock (its content-address).\nThis implements the PIN_PER_STEP workflow standard (heritage: Snakemake 2012, Pegasus 2001).\n\nIMPLEMENTATION CONSTRAINTS (these shape which evidence is useful):\n- The CLI is ZERO-DEPENDENCY, NETWORK-FREE, DETERMINISTIC: SHA-256 via node:crypto, JSON I/O only. It makes NO model calls. The ORCHESTRATION HARNESS supplies the resolved models + byte-exact prompts + verifier run_id; the CLI only canonicalizes + hashes + validates them, and `lock --verify` re-derives the deterministic hashes and FAILS (exit 1) on drift.\n- Honest ceiling: pinning model+prompt+temp does NOT give bit-identical LLM outputs. The lock pins INPUTS byte-exact + records OUTPUT hashes for DRIFT DETECTION — \"replayable inputs + drift-detectable outputs\", NOT \"deterministic replay\".\n\nYOUR JOB: gather SPECIFIC, CITED, RETRIEVED evidence to answer ONE question. HARD RULES:\n- GROUND AT GENERATION TIME: use WebSearch and WebFetch to ACTUALLY RETRIEVE every source you cite THIS session. Cite ONLY sources you actually fetched. A claim you cannot ground in a fetched source is DROPPED, not invented.\n- Every finding needs: a one-sentence claim in your own words that MATCHES what the source actually says (do NOT overstate); author(s)/org; year; a RESOLVABLE identifier (arXiv:NNNN.NNNNN, a DOI, an RFC number, or a direct URL to the spec/paper — not a blog summary); the resolvable URL; whether you retrieved it; and a one-sentence DESIGN IMPLICATION for `dispatch.lock.json`.\n- Prefer specificity over breadth: 6-8 well-sourced, RETRIEVED findings beat 20 vague gestures. ~500-600 words of substance.\n- Set retrieved=false for anything you could not actually fetch — those will be dropped.\n\n========\nYOUR QUESTION (Q5-tool-schema-drift):\nQUESTION: How do LLM agent frameworks and tool/function-calling systems pin or version the TOOL/FUNCTION schemas an agent had, so a replay with the same prompt but a CHANGED tool surface is DETECTED? (This is the half PIN explicitly flags as missing.)\nInvestigate (retrieve the actual docs/specs/papers): OpenAI & Anthropic function-calling / tool-use schema definitions (JSON Schema for tool parameters); the Model Context Protocol (MCP) tool definition format and any capability negotiation / version field (modelcontextprotocol.io spec); JSON Schema canonicalization/hashing for API-contract drift; OpenAPI + Pact consumer-driven contract testing and \"schema drift\" detection in API tooling; and any agent-reproducibility / agent-eval work that captures the TOOL ENVIRONMENT as part of a run record. For each: how the tool surface is represented and how a change is surfaced as a failure. Map each to the `tool_schema_sha256` field — exactly WHAT to hash (the canonicalized tool JSONSchemas the agent was given) and HOW a changed tool surface surfaces as a `lock --verify` drift failure.\n\nReturn structured findings. Remember: retrieve-then-cite; drop what you cannot fetch.",
289
+ "tool_schema": {
290
+ "type": "object",
291
+ "additionalProperties": false,
292
+ "required": [
293
+ "question_id",
294
+ "findings",
295
+ "notes"
296
+ ],
297
+ "properties": {
298
+ "question_id": {
299
+ "type": "string"
300
+ },
301
+ "findings": {
302
+ "type": "array",
303
+ "items": {
304
+ "type": "object",
305
+ "additionalProperties": false,
306
+ "required": [
307
+ "claim",
308
+ "authors",
309
+ "year",
310
+ "identifier",
311
+ "url",
312
+ "retrieved",
313
+ "design_implication"
314
+ ],
315
+ "properties": {
316
+ "claim": {
317
+ "type": "string",
318
+ "description": "One-sentence finding in your own words, matching the source (do not overstate)."
319
+ },
320
+ "authors": {
321
+ "type": "string",
322
+ "description": "Author(s) or org, e.g. \"Koster & Rahmann\" or \"Rundgren et al.\" or \"OpenSSF\"."
323
+ },
324
+ "year": {
325
+ "type": "string"
326
+ },
327
+ "identifier": {
328
+ "type": "string",
329
+ "description": "arXiv:NNNN.NNNNN, a DOI (10.xxxx/...), an RFC number, or a direct URL."
330
+ },
331
+ "url": {
332
+ "type": "string",
333
+ "description": "A resolvable URL the existence oracle can fetch."
334
+ },
335
+ "retrieved": {
336
+ "type": "boolean",
337
+ "description": "true ONLY if you actually fetched this source this session."
338
+ },
339
+ "design_implication": {
340
+ "type": "string",
341
+ "description": "One sentence: implication for dispatch.lock.json."
342
+ }
343
+ }
344
+ }
345
+ },
346
+ "notes": {
347
+ "type": "string",
348
+ "description": "Coverage gaps, sources you could not fetch, dropped claims."
349
+ }
350
+ }
351
+ },
352
+ "schema_dialect": "https://json-schema.org/draft/2020-12/schema",
353
+ "output_sha256": "sha256-Y1lsUdKoplgdos0r4faN6UgxXX74D/Se0AVHCiUA9dM="
354
+ }
355
+ ],
356
+ "verification": {
357
+ "runner": "roleos verify-citations",
358
+ "runner_source": "role-os local clone E:/AI/role-os",
359
+ "tool": "prism verify --type citations",
360
+ "tool_version": "prism 1.6.0",
361
+ "verifier_model": "mistral-small:24b",
362
+ "verifier_family": "local",
363
+ "caller_family_excluded": "anthropic",
364
+ "verdict": "escalate",
365
+ "receipt_id": "prism-01kwbajx31dj9gcf5xn3cn5ydg",
366
+ "receipt_signature": "272c892124e3bc13a76b2674fa361b1d65aee6a588c74604cf4ae4e7c9440a8ba7888175b9ec1286fe87490121f694f64cd30adc3ffc0e1b31cd3365b7b38901",
367
+ "receipt_chain_sha256": "499b63905064a5e25fd1801c5530504c94742f2183c4d3c8eb545a20cfbb112e"
368
+ }
369
+ }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@dogfood-lab/study-swarm",
3
- "version": "1.1.0",
3
+ "version": "1.3.0",
4
4
  "description": "Ground design decisions in cited research, then verify every citation with a different model family before it becomes canon — a research-grounded design protocol, with a thin CLI.",
5
5
  "keywords": [
6
6
  "methodology",