@ara-commons/ara-skills 0.2.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ara-commons/ara-skills",
3
- "version": "0.2.0",
3
+ "version": "0.3.1",
4
4
  "description": "Install Agent-Native Research Artifact (ARA) skills — compiler, research-manager, rigor-reviewer — into Claude Code, Cursor, OpenCode, Gemini CLI, Codex, and more.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -40,12 +40,12 @@
40
40
  "license": "MIT",
41
41
  "repository": {
42
42
  "type": "git",
43
- "url": "https://github.com/AmberLJC/Agent-Native-Research-Artifact.git",
43
+ "url": "https://github.com/ARA-Labs/Agent-Native-Research-Artifact.git",
44
44
  "directory": "packages/ara-skills"
45
45
  },
46
- "homepage": "https://github.com/AmberLJC/Agent-Native-Research-Artifact#readme",
46
+ "homepage": "https://github.com/ARA-Labs/Agent-Native-Research-Artifact#readme",
47
47
  "bugs": {
48
- "url": "https://github.com/AmberLJC/Agent-Native-Research-Artifact/issues"
48
+ "url": "https://github.com/ARA-Labs/Agent-Native-Research-Artifact/issues"
49
49
  },
50
50
  "engines": {
51
51
  "node": ">=18.0.0"
@@ -16,7 +16,7 @@ allowed-tools: Read, Write, Edit, Bash(python *|git clone *|ls *|mkdir *), Glob,
16
16
  metadata:
17
17
  author: ara-commons
18
18
  category: research-tooling
19
- version: "1.1.0"
19
+ version: "1.2.0"
20
20
  tags: [research, compilation, artifacts, knowledge-extraction]
21
21
  ---
22
22
 
@@ -120,14 +120,40 @@ For non-trivial figures (dense plots, log axes, multi-panel, anything needing re
120
120
  **Stage 2 — Cognitive Mapping**
121
121
  Map the atoms into `/logic/`:
122
122
  - **problem.md**: observations (with numbers) → gaps → key insight → assumptions
123
- - **claims.md**: falsifiable claims with proof pointers to experiment IDs (E01, E02…). Phrase each
124
- `Statement` at the strongest level the cited evidence directly supports; keep raw support in
125
- `Evidence basis` and broader synthesis in `Interpretation`. Don't upgrade a validation-metric
126
- result into a claim about training dynamics without training-side evidence.
123
+ - **claims.md**: falsifiable claims with proof pointers to experiment IDs (E01, E02…). A claim's job
124
+ is the **takeaway, not the record**. Before writing a `Statement`, distill: for each result,
125
+ ablation, or dead-end, ask what it *reveals* the mechanism or relationship behind the number, the
126
+ WHY a reader would reuse and make THAT the `Statement`. Look across results too, not one at a
127
+ time: where several experiments together reveal a relationship none shows alone — whether they
128
+ agree on it or differ in a way that reveals what bounds it — make THAT relationship the claim
129
+ (`Proof` spanning them, `Dependencies` the narrower claims it rests on), rather than settling for
130
+ one claim per experiment. The recipe name, run IDs, and numbers are
131
+ the evidence *for* the takeaway, not the takeaway itself: they live in `Evidence basis`/`Proof`,
132
+ referenced and never restated in the Statement. A `Statement`'s subject is a mechanism/relationship,
133
+ never a named recipe/config/run, and carries no run numbers, scores, step counts, or p-values. Bound
134
+ every Statement with a `Conditions` field (the regime + the untested boundary) and a substantive
135
+ `Falsification criteria` (about the system for a mechanism claim, about the benchmark's behavior for
136
+ a methodological one) — this accountability, not a narrowed sentence, is what keeps a generalized
137
+ claim honest. Don't upgrade a validation-metric result into a claim about training dynamics without
138
+ training-side evidence. Stating the mechanism a result reveals is the goal **even from a single
139
+ instance** — what you must NOT do is extrapolate it into a universal law beyond its regime, or
140
+ assert a distinction the design cannot disentangle; that limit goes in `Conditions` so the
141
+ `Statement` can still carry the mechanism rather than collapsing back to a recipe-and-number.
142
+ **Ground every load-bearing number in a claim like code** (the `# Grounding` discipline,
143
+ applied to numbers): before writing it, open its source and copy the matched line verbatim into a
144
+ `**Sources**` entry — `<value> ← <source ref> «matched line» [input]` for values that were set
145
+ (cite where they're defined), `[result]` for values a run produced (cite the log/output that
146
+ reports them). Never write a number from memory and back-fill a path; never carry a value over
147
+ from a dependency claim — re-open this claim's own source. A bare path with no «quote» is invalid;
148
+ if a source can't be opened this turn, write `[pending: …]` (an unverified path is fabrication,
149
+ worse than `[pending]`).
127
150
  - **concepts.md**: the paper's genuine technical terms, formally defined
128
151
  - **experiments.md**: declarative verification/analysis plans (NO exact numbers — directional
129
152
  only). "Experiment" generalizes to the field's way of testing a claim: an eval run, a statistical
130
- test, a proof obligation, a user study.
153
+ test, a proof obligation, a user study. Link each experiment to where its results are filed
154
+ (`Evidence`) and to what produced it (`Run`, including failed/ablated runs). Claims and experiments
155
+ are many-to-many — a claim that generalises across runs lists every experiment in its `Proof`;
156
+ don't mirror one experiment per claim.
131
157
  - **solution/**: the method layer — `constraints.md` (limitations/assumptions) is always present;
132
158
  beyond it, create the files the paper's content actually calls for (architecture, algorithm,
133
159
  method, study design, formalization, proofs, heuristics — whatever fits the work). You decide
@@ -144,8 +170,11 @@ whichever layer fits best, preserving the source's granularity. Never silently d
144
170
  or released form, *distinct from the prose that describes it*. `src/environment.md` is always
145
171
  required (reproducibility). Beyond it, one rule decides everything:
146
172
 
147
- > **Capture every concrete artifact the source actually contains, in its native form; never
148
- > re-encode a prose-only description as code.**
173
+ > **Represent every concrete artifact losslessly. When it persists in a linkable external store (a
174
+ > run database, a released/versioned repo), point to it — a comprehensive `src/artifacts.md` index,
175
+ > one link per artifact (every run, config, log, script), nothing aggregated or copied. Capture it
176
+ > into `src/execution/` only when it would otherwise be lost — code that lives solely inside the
177
+ > paper, or a source not externally persisted. Never re-encode a prose-only description as code.**
149
178
 
150
179
  A concrete artifact is real content the cognitive layer doesn't already hold — capture it (grounded
151
180
  in the real repo/files when provided), in whatever directory fits. But a method conveyed only in
@@ -210,7 +239,7 @@ Run ARA Seal Level 1. Check:
210
239
  - Mandatory-core dirs exist (`logic/`, `logic/solution/`, `src/`, `trace/`, `evidence/`) and all
211
240
  mandatory-core files exist and are non-empty
212
241
  - PAPER.md has valid frontmatter (title, authors, year) + a Layer Index
213
- - claims.md has C01+ blocks with Statement, Status, Falsification criteria, Proof
242
+ - claims.md has C01+ blocks with Statement, Conditions, Status, Falsification criteria, Proof; Conditions non-trivial
214
243
  - experiments.md has E01+ blocks with Verifies, Setup, Procedure, Expected outcome (no exact numbers)
215
244
  - concepts.md, related_work.md, constraints.md non-trivial; any heuristics blocks have Rationale,
216
245
  Sensitivity, Bounds
@@ -227,6 +256,18 @@ Run ARA Seal Level 1. Check:
227
256
  - **Cited locations verified** (Rule 15): every repo path/`file:line` exists and is in range;
228
257
  spot-check that trace `source_refs` and evidence `Source` actually contain the cited content; no
229
258
  repo fact transcribed from the paper without checking the real file
259
+ - **Statement is a takeaway, not a record** — its own dedicated FAIL pass, symmetric to the
260
+ number-sources pass: scan EVERY claim's `Statement`. It FAILS if the Statement's subject is a named
261
+ recipe/config/run, or if the Statement contains a run number, n-count, score, step/bin count, or
262
+ p-value. Such a claim is a leaderboard coordinate, not knowledge — the mechanism it reveals must
263
+ become the Statement and the numbers move to `Evidence basis`/`Proof`. Exhaustive, not spot-checked
264
+ - **Number sources bound** (claims & heuristics) — run this as its own dedicated pass, one job: for
265
+ *each* `**Sources**` entry, re-open the cited `file:line` (or trace `node:field`) and confirm the
266
+ verbatim «quote» is actually there and the number in the `Statement`/`Rationale` matches the value
267
+ inside the quote; `[input]` entries cite recipe scripts, `[result]` entries cite logs/trace (not
268
+ swapped). Exhaustive, not spot-checked. `[pending: …]` entries are allowed but listed for
269
+ follow-up; a bare path, a «quote» absent from the cited line, or a value that disagrees with its
270
+ quote FAILS
230
271
  - **Self-consistency**: ARA-authored derived numbers recompute; PAPER.md declared counts match the
231
272
  files; tree `evidence:` refs are claim IDs (C##), not observation IDs
232
273
 
@@ -251,13 +292,13 @@ key stats (claims, experiments, concepts, tree nodes, evidence tables/figures).
251
292
  7. **"Not specified"**: if information is genuinely unavailable, write "Not specified in paper" — never guess
252
293
  8. **No fake source labels**: never call a derived subset `Table N`/`Figure N` unless it faithfully reproduces the original
253
294
  9. **No synthetic trace history**: don't invent decisions, dead ends, or experiments not explicit in the inputs; mark inferred trajectories as inferred or omit them
254
- 10. **Evidence-limited wording**: don't use stronger language than the evidence supports; separate observation from interpretation
295
+ 10. **Distill the takeaway, then bound it**: a `Statement` is the mechanism or relationship a result reveals — the reusable WHY — with the named recipe and its numbers demoted to `Evidence basis`/`Proof`, never restated in the sentence and never its subject. Keep it accountable by an explicit `Conditions` regime, a substantive `Falsification criteria` (about the system, or about the benchmark's behavior for a methodological claim), and grounded `Proof` — not by narrowing the sentence to a measured value. A single instance still licenses a mechanism `Statement`: what is forbidden is extrapolating it into a universal law beyond its regime, or asserting a distinction the design cannot disentangle — those limits go in `Conditions`, they do not shrink the Statement back to a recipe-and-number. Still separate observation from interpretation: the numbers stay in the evidence layer, reached via `Proof`/`Evidence basis`
255
296
  11. **Visual extraction is honest extraction**: read figures by looking; mark estimates `≈` with extraction method + confidence; never present a digitized estimate as exact, invent points for an unreadable figure, or turn a diagram into a fake data table
256
297
  12. **Complete, ordered evidence**: file EVERY numbered table and figure, in order — a systematic sweep, not a lucky sample — each as a markdown transcription PLUS a saved screenshot (`.png`). No early stopping; account for any object you don't file
257
298
  13. **Fit the file set to the paper, not the paper to a template**: only PAPER.md + the mandatory core are required. Beyond them, generate the files THIS work actually warrants and nothing it doesn't have. Never force inappropriate files (e.g. model-training configs onto an eval or theory paper)
258
- 14. **`src/` holds concrete artifacts, not re-encoded prose**: capture every concrete artifact the source actually contains, in its native form, grounded in real files. Two sides: (a) never fabricate a code stub from a prose-only method — it already lives in `logic/`, so a `.py` just duplicates it; (b) never drop a concrete artifact that does exist — a lone `environment.md` is wrong when the work has one
299
+ 14. **`src/` holds concrete artifacts, not re-encoded prose**: capture every concrete artifact the source actually contains, in its native form, grounded in real files. Three sides: (a) never fabricate a code stub from a prose-only method — it already lives in `logic/`, so a `.py` just duplicates it; (b) never drop a concrete artifact that does exist — a lone `environment.md` is wrong when the work has one; (c) when the work's artifacts **persist in a linkable external store** (a run database, a released or versioned repo), represent them as a **comprehensive pointer index** in `src/artifacts.md` — one link per artifact (every run, config, log, script), nothing aggregated into a vague bucket, nothing copied; a lossy subset-copy is the failure. **Transcribe real source into `src/execution/` only when it would otherwise be lost** — code that lives solely inside the paper, or a source not externally persisted (then `# Grounding: transcribed`, cite path). No implementation in the input → neither applies.
259
300
  15. **Source-bounded minimums**: any count or required field is a target, never a license to invent. If the source supports fewer, produce what is real and note the shortfall; for an unstated field write "Not specified in paper" rather than guessing
260
- 16. **Cite by verification, and ask on conflict**: a source reference (evidence `Source`, trace `source_refs`, claim `Proof`, a repo `file:line`/path) promises the cited location actually contains the claim — open it and confirm. Never transcribe a *description* of an artifact as a verified fact about it. **When the code repo and the paper disagree on a fact (line count, path, value, behavior), do NOT pick one silently — surface the conflict to the user and ask which source to follow.** If unverifiable and the user is unavailable, attribute it ("per §X") or omit. Carry a statistic's scope/denominator in its `Source`
301
+ 16. **Cite by verification, and ask on conflict**: a source reference (evidence `Source`, trace `source_refs`, claim `Proof`, a repo `file:line`/path) promises the cited location actually contains the claim — open it and confirm. Never transcribe a *description* of an artifact as a verified fact about it. **When the code repo and the paper disagree on a fact (line count, path, value, behavior), do NOT pick one silently — surface the conflict to the user and ask which source to follow.** If unverifiable and the user is unavailable, attribute it ("per §X") or omit. Carry a statistic's scope/denominator in its `Source`. **This extends to every load-bearing number in a claim/heuristic `Statement`/`Rationale`: it carries a `**Sources**` entry whose verbatim «quote» you opened and confirmed contains that value — a memory-filled value or a bare path is fabrication; use `[pending]` when you cannot open the source**
261
302
 
262
303
  ## Reference Files
263
304
 
@@ -159,20 +159,51 @@ Rule: if a filename includes a source label such as `table3` or `figure4`, it sh
159
159
 
160
160
  Each claim MUST have ALL fields:
161
161
  ```markdown
162
- ## C{NN}: {Short title}
163
- - **Statement**: {Precise, falsifiable assertion}
162
+ ## C{NN}: {generalized title — the takeaway, not a recipe/result name}
163
+ - **Statement**: {the generalized, mechanistic conclusion the evidence supports; subject = a mechanism/relationship, never a named recipe; carries NO run numbers}
164
+ - **Conditions**: {under what conditions it holds; the regime; the known untested boundary}
164
165
  - **Status**: {hypothesis|supported|refuted}
165
- - **Falsification criteria**: {What would disprove this}
166
+ - **Falsification criteria**: {a concrete observation that would disprove it — for a mechanism claim, about the system/world; for a methodological/regime claim, about the benchmark's behavior. Not a tautology or a re-run of the same gate}
166
167
  - **Proof**: [{experiment IDs: E01, E02}]
167
- - **Evidence basis**: {What the cited evidence directly shows}
168
- - **Interpretation**: {Optional broader reading that should not be confused with the raw evidence}
169
- - **Dependencies**: {other claim IDs, if any}
168
+ - **Evidence basis**: {what the cited evidence shows — point to it; do NOT restate run numbers in the Statement}
169
+ - **Dependencies**: {claim IDs this one rests on — the narrower claims a more general claim draws on, or a claim it corrects/refines; not mere shared setup; omit if it rests only on its own evidence}
170
170
  - **Tags**: {comma-separated keywords}
171
171
  ```
172
172
 
173
173
  Proof MUST reference experiment IDs from experiments.md.
174
174
  Each proofed experiment should in turn be backed by evidence files whose rows or measurements actually match the claim being asserted.
175
- `Statement` should stay at the strongest level directly supported by the cited evidence. Use `Interpretation` for broader synthesis.
175
+ `Statement` is the **generalized conclusion the evidence supports** a mechanism or relationship,
176
+ not a restatement of run numbers. The claim is kept falsifiable and honest by `Conditions` (the
177
+ regime it holds in + the untested boundary) and a `Falsification criteria`, not by narrowing the
178
+ sentence to a single measured value. Numbers (n, scores, step counts, run IDs) live in the evidence
179
+ layer and are reached via `Proof`/`Evidence basis`, never pasted into `Statement`. `Conditions` is
180
+ mandatory: a generalized Statement with no Conditions is an unbounded slogan.
181
+
182
+ **Distill the mechanism; bound the reach.** Before writing a `Statement`, ask what the result
183
+ *reveals* — the mechanism or relationship a reader would reuse — and state that; the recipe and its
184
+ numbers are the evidence for it, not the claim, and never its subject. A single instance still
185
+ licenses a mechanism `Statement`; what is forbidden is extrapolating it into a universal law beyond
186
+ its regime, or asserting a distinction the design cannot disentangle. Put that boundary in
187
+ `Conditions` — it bounds *where* the claim holds and is not a license for the verb to over-reach.
188
+ `Conditions` carries the limits so the `Statement` can carry the mechanism.
189
+
190
+ **A claim's evidence may be one result or several read together.** Most claims distill what a single
191
+ result reveals; but where several experiments together reveal a relationship none shows alone —
192
+ whether they agree on it, or differ in a way that itself reveals what bounds or explains the
193
+ difference — that relationship is the claim. Write it as an ordinary `## C` block whose `Proof` lists
194
+ every experiment it draws on and whose `Dependencies` names the narrower claims it rests on; the same
195
+ distill-the-mechanism, bound-the-reach discipline applies. State the most general relationship the
196
+ evidence supports — bounded by `Conditions`, never asserted past what those experiments jointly show —
197
+ rather than settling for one claim per experiment. A claim need not be about the object under study:
198
+ a reusable relationship the work itself exposes, including in how it was run, is worth a claim.
199
+
200
+ **The attribution trap (the most common miss).** An ablation / leave-one-out that shows *which*
201
+ components dominate is the *evidence*, not the claim. A Statement that merely names the load-bearing
202
+ vs decorative components passes the no-numbers gate but is still a league table of *this* system.
203
+ Apply the **name-deletion test**: strike your system's component names from the Statement — if
204
+ nothing a stranger working on a different stack could reuse survives, you wrote attribution. State
205
+ instead what the ranking reveals about the *class* of system; the named components and their deltas
206
+ live in `Evidence basis`, reached via `Proof`.
176
207
 
177
208
  ---
178
209
 
@@ -192,11 +223,15 @@ borrowed terms to reach 5 (Rule 14). One section per concept:
192
223
 
193
224
  ## logic/experiments.md
194
225
 
195
- ≥3 experiments. Declarative plans, NOT scripts. NO exact numerical results.
226
+ ≥3 experiments. Declarative plans, NOT scripts. NO exact numerical results. Experiments and claims
227
+ are **many-to-many**: one experiment may verify several claims, and a claim that generalises across
228
+ runs lists every experiment it draws on in its `Proof` — do not force a 1:1 claim↔experiment ledger.
196
229
 
197
230
  ```markdown
198
231
  ## E{NN}: {Short title}
199
- - **Verifies**: {claim IDs, e.g., C01, C02}
232
+ - **Verifies**: {claim IDs this run bears on — may be several}
233
+ - **Evidence**: {evidence file(s) where this run's results are recorded — `evidence/…`; "pending" if not yet filed}
234
+ - **Run**: {what produced this result — a `src/execution/` file (or other `src/` artifact) when captured, else a link/ref into the source repo or run database; give it for EVERY experiment, including failed or ablated runs}
200
235
  - **Setup**:
201
236
  - Model: {model name and size}
202
237
  - Hardware: {GPU type, count, memory}
@@ -286,22 +321,41 @@ field format:
286
321
 
287
322
  ## src/execution/{module}.py (when the work warrants it — grounded or absent)
288
323
 
289
- Present only when the source provides **concrete code-shaped content**: actual repo code, or
290
- explicit pseudocode/equations the paper prints. The stub captures the **novel mechanism** and must
291
- be grounded never fabricated.
324
+ Capture here is the **fallback, not the default**: transcribe code into `src/execution/` only when it
325
+ would otherwise be **lost** it exists solely inside the paper, or its source is not externally
326
+ persisted. When the work's code/runs **persist in a linkable external store** (a repo, a run
327
+ database), do NOT copy them here — index them comprehensively in `src/artifacts.md` (see below). When
328
+ capture IS the call: actual repo code → capture real runnable files in native form (transcribed); only
329
+ pseudocode/equations the paper prints → a reconstructed stub of the **novel mechanism**. Either way it
330
+ must be grounded — never fabricated.
331
+
332
+ When the input is a run database / repo of many experiment runs, index it **comprehensively** in
333
+ `src/artifacts.md`: a link for **every** run and artifact (the per-run logs — e.g. a `runs.jsonl`
334
+ already indexes each — plus every config, candidate, log, and script), nothing aggregated into a vague
335
+ bucket and nothing copied. Each experiment's `Run` field points at the relevant entries. A lossy
336
+ subset — only the winning run, or runs collapsed into a single directory link — is the failure.
292
337
 
293
338
  Every file declares its grounding on the first line:
294
339
  ```python
295
340
  # Grounding: transcribed — adapted from repo code; cite file:line in docstrings
296
341
  # Grounding: reconstructed — from explicit paper pseudocode/equations; cite §/eq
297
342
  ```
298
- Contents:
343
+ Contents depend on the grounding:
344
+
345
+ **`transcribed` (a real repo file is provided)** — copy it faithfully in native form: full function
346
+ bodies, the file's own imports (third-party deps included), and its real scaffolding (CLI/argparse,
347
+ logging, entrypoints) all kept as in the repo. Do NOT replace working code with
348
+ `NotImplementedError`, strip plumbing, or reduce to signatures-only — that mutates the artifact and
349
+ breaks the cited `file:line`. Add only the `# Grounding` line and source-citing docstrings; otherwise
350
+ leave the file as it is in the repo.
351
+
352
+ **`reconstructed` (only pseudocode/equations exist)** — build a minimal stub of the novel mechanism:
299
353
  - Typed function signatures using ONLY names/types the source states
300
- - Docstrings that cite the source (`§4.2`, `Eq. 3`, `repo: model.py:88`) — not paraphrases of this skill
354
+ - Docstrings that cite the source (`§4.2`, `Eq. 3`) — not paraphrases of this skill
301
355
  - Implementation logic ONLY where the source provides it; everything unspecified stays
302
356
  `raise NotImplementedError("Not specified in paper")` — never plausible filler
303
- - NO scaffolding (no argparse, logging, distributed wrappers)
304
- - Import only standard libraries + the field's core stack (torch/numpy, pandas/statsmodels, etc.)
357
+ - NO scaffolding (no argparse, logging, distributed wrappers); import only standard libraries + the
358
+ field's core stack (torch/numpy, pandas/statsmodels, etc.)
305
359
 
306
360
  Hard rule: do not invent API names, function bodies, constants, or hyperparameters. **If the paper
307
361
  describes the method only in prose (no code, no printed pseudocode), do NOT write a `.py` stub or
@@ -309,11 +363,18 @@ pseudo-code — that information already lives in `logic/solution/`, and re-enco
309
363
  duplicates it.** A concrete artifact that IS raw "code" — e.g. a prompt or template — is different:
310
364
  store it verbatim in `src/prompts/`, don't paraphrase it. A hollow invented API is a hallucination.
311
365
 
312
- ## src/artifacts.md (when the implementation is not a `.py` stub)
366
+ ## src/artifacts.md (the artifact index comprehensive pointer file when the source persists externally)
367
+
368
+ `src/` must represent the implementation **losslessly**. When the work's artifacts **persist in a
369
+ linkable external store** (a repo, a run database, a released tool/dataset), `artifacts.md` is the
370
+ **comprehensive pointer index** — a link to **every** artifact (every run, config, log, script,
371
+ released binary, dataset), grounded in the real files, nothing aggregated into a vague bucket and
372
+ nothing copied. One block (or row) per artifact:
313
373
 
314
- `src/` must still represent the implementation. When the deliverable is a released tool, library,
315
- skill/specification, system, benchmark, or dataset rather than a code stub, describe the **real**
316
- artifacts here grounded in the actual repo/files when a repo is provided. One block per artifact:
374
+ **Capture is the fallback, not the default.** Transcribe a file into `src/execution/` only when it
375
+ would otherwise be **lost** code that lives solely inside the paper, or a source not externally
376
+ persisted. When the source persists and is linkable, point to it here; copying a lossy subset (only
377
+ the winner, or files collapsed into a single directory link) is the failure.
317
378
 
318
379
  ```markdown
319
380
  ## {Artifact name}
@@ -35,12 +35,15 @@ where present, they are non-trivial — there is no fixed list. Model-training f
35
35
 
36
36
  ### logic/claims.md
37
37
  - Has `## C\d+` blocks (at least one claim)
38
- - Contains `**Statement**`
38
+ - Contains `**Statement**` (the mechanism/takeaway a result reveals — subject is a mechanism/relationship, never a named recipe; no run numbers)
39
+ - Contains `**Conditions**` (non-trivial: the regime + the untested boundary)
40
+ - Contains `**Sources**`; every load-bearing number in a claim has a `Sources` entry carrying
41
+ a verbatim «quote» plus an `[input]`/`[result]` tag — no bare-path entries, no memory-filled numbers
39
42
  - Contains `**Status**`
40
- - Contains `**Falsification criteria**`
43
+ - Contains `**Falsification criteria**` (a substantive observation — about the system, or about the benchmark's behavior for a methodological claim — not a tautology or a re-run of a metric gate)
44
+ - `Statement` is the mechanism/takeaway a result reveals, not a record: a single instance may state the mechanism it reveals, but must not be extrapolated into a universal law beyond its regime, nor assert a distinction the design cannot disentangle — those limits live in `Conditions`
41
45
  - Contains `**Proof**`
42
46
  - Contains `**Evidence basis**`
43
- - Contains `**Interpretation**`
44
47
 
45
48
  ### logic/problem.md
46
49
  - Has `### O\d+` blocks (observations)
@@ -50,6 +53,8 @@ where present, they are non-trivial — there is no fixed list. Model-training f
50
53
  ### logic/experiments.md
51
54
  - Has `## E\d+` blocks (at least 3)
52
55
  - Contains `**Verifies**`
56
+ - Contains `**Evidence**` (link to where the run's results are filed, or "pending")
57
+ - Contains `**Run**` (what produced the run — a `src/execution/` file or a link/ref into the source repo/DB; failed/ablated runs are linked too, not omitted)
53
58
  - Contains `**Setup**`
54
59
  - Contains `**Procedure**`
55
60
  - Contains `**Expected outcome**` or `**Expected results**`
@@ -87,8 +92,10 @@ fewer passes with fewer; what fails is fabricated filler.
87
92
  - `src/execution/`: ≥1 `.py` file only when the work has implementable content (repo code / paper pseudocode / named interface). NOT mandatory otherwise; omitting it (with a note in `environment.md`) beats fabricating one.
88
93
  - `evidence/tables/`, `evidence/figures/`, or `evidence/proofs/`: contains the filed evidence (see §11)
89
94
 
90
- ### Implementation layer (`src/`) — captured, not re-encoded
91
- - Concrete artifacts that exist are captured in native form: prompts/templates verbatim in `src/prompts/`, real repo code/tools/skills via grounded `src/execution/` or `src/artifacts.md`, config values in `src/configs/`. A lone `environment.md` is wrong when such artifacts exist.
95
+ ### Implementation layer (`src/`) — indexed when external, captured when it'd be lost
96
+ - Concrete artifacts are represented losslessly: prompts/templates verbatim in `src/prompts/`, config values in `src/configs/`, and when the work's code/runs **persist in a linkable external store** — a **comprehensive pointer index** in `src/artifacts.md` linking every artifact. A lone `environment.md` is wrong when such artifacts exist.
97
+ - **Comprehensiveness** (external repo/run-database input): `src/artifacts.md` links **every** run and source file (per-run logs included — a `runs.jsonl` counts), nothing aggregated into a bare directory link or a "~N others" summary. FAIL on a lossy subset (only the winning run; real artifacts collapsed into a vague bucket).
98
+ - **Capture only when it'd be lost**: transcribe source into `src/execution/` (native form, `# Grounding: transcribed`, cite path) only when it exists solely inside the paper or its source is not externally persisted. Pointer-only is correct when the source persists; it FAILS only when the pointer would dangle (no persisted source).
92
99
  - Conversely, a prose-only method (no code, no prompt, no config values) is NOT re-encoded as a `.py` stub or pseudo-code — it lives in `logic/solution/`; a lone `environment.md` is correct here. FAIL on a `.py` stub manufactured from prose (it just duplicates the cognitive layer).
93
100
 
94
101
  ### Code grounding (each `src/execution/*.py`, when present)
@@ -147,11 +154,18 @@ For each file in `evidence/figures/*.md` specifically:
147
154
  ### Claim Proof → Experiment Resolution
148
155
  - Every `E\d+` in a claim's `**Proof**: [...]` must exist in experiments.md
149
156
  - Proof-linked experiments should have evidence files whose labels and row contents actually match the compared systems or measurements
150
- - Claim wording should be auditable against `Evidence basis`; broader language should be isolated to `Interpretation`
157
+ - Claim `Statement` is a generalized mechanism/relationship auditable against `Evidence basis`; its reach is bounded by `Conditions` and its run numbers live in the evidence layer (not pasted into the Statement)
151
158
 
152
159
  ### Experiment Verifies → Claim Resolution
153
160
  - Every `C\d+` in an experiment's `**Verifies**` must exist in claims.md
154
161
 
162
+ ### Experiment Evidence / Run → Resolution
163
+ - Every `evidence/…` path in an experiment's `**Evidence**` is a filed evidence file (or "pending")
164
+ - Every experiment carries a `**Run**` ref — an entry in the comprehensive `src/artifacts.md` index (or, in the capture-fallback case, a `src/execution/` file) that links the source location; failed/ablated runs are linked there, not dropped
165
+
166
+ ### Claim Dependencies → Claim Resolution
167
+ - Every `C\d+` in a claim's `**Dependencies**` must exist in claims.md (an unresolved ID FAILS)
168
+
155
169
  ### Heuristic Code Ref → File Resolution (only when heuristics.md + src/execution/ are both present)
156
170
  - Every `src/...` path in `**Code ref**: [...]` must be an existing file
157
171
 
@@ -171,6 +185,25 @@ For each file in `evidence/figures/*.md` specifically:
171
185
  - No fact ABOUT a repo artifact (line count, path, internal structure) is transcribed from the paper without checking the real file — when paper and repo disagree, the discrepancy is flagged, not silently resolved to the paper's number
172
186
  - Spot-check trace `source_refs` and evidence `**Source**` labels: the cited section/table/appendix actually contains the claimed content
173
187
  - A statistic carries its scope/denominator (N, population) in its `Source` — subset figures (e.g. "5 papers / 3,050 reqs") are not juxtaposed with full-corpus figures as if same-denominator
188
+ - **Claim Statements are takeaways** (exhaustive, not spot-checked — symmetric to the number-sources
189
+ pass): each `## C\d+` Statement FAILS if its subject is a named recipe/config/run, or if it
190
+ contains a run number, n-count, score, step/bin count, or p-value. The mechanism a result reveals
191
+ must be the subject; the numbers must live in `Evidence basis`/`Proof`
192
+ - **Attribution is not insight** (exhaustive, applied to each Statement that already passed the
193
+ takeaways gate above): a Statement also FAILS if it only identifies *which* named components of
194
+ this one system rank highest/lowest (load-bearing / dominant / decorative / inert / "largest
195
+ contributor") without stating what that ranking *reveals* — a relationship or mechanism a reader
196
+ could carry to a different system. **Operational tell: delete this system's component names from
197
+ the Statement; if no transferable relationship survives, it is attribution, not a mechanism — it
198
+ FAILS.** The fix is to state the generalization the ranking licenses; the named components and
199
+ their deltas stay in `Evidence basis`. This is the most common way a numerically-clean Statement
200
+ still fails the insight bar.
201
+ - **Claim/heuristic number sources** (exhaustive, not spot-checked): each `**Sources**` entry's cited
202
+ `file:line` (or trace `node:field`) exists, the verbatim «quote» is actually present there, and the
203
+ number in the `Statement`/`Rationale` matches the value inside that quote; `[input]` entries cite
204
+ recipe scripts and `[result]` entries cite run logs/trace (not swapped). A bare path with no «quote»,
205
+ a «quote» absent from the cited line, or a value that disagrees with its quote FAILS. `[pending: …]`
206
+ entries pass but are listed for follow-up — an unverified plausible path does not pass
174
207
 
175
208
  ## 11. Evidence Ledger Completeness
176
209
 
@@ -15,7 +15,7 @@ argument-hint: "[optional: hint about what happened this turn]"
15
15
  allowed-tools: Read, Write, Edit, Glob, Grep
16
16
  metadata:
17
17
  author: ara-commons
18
- version: "2.2.0"
18
+ version: "2.4.0"
19
19
  tags: [research, process-recording, provenance, progressive-crystallization, knowledge-management]
20
20
  ---
21
21
 
@@ -128,8 +128,10 @@ When a signal fires for `O{XX}`:
128
128
 
129
129
  1. Read O{XX}'s `content`, `context`, `potential_type`, `provenance`, `bound_to`.
130
130
  2. Allocate the next ID for the target layer (read the target file first).
131
- 3. Construct a typed entry using the schema (see Schemas below). Carry forward
132
- `provenance`. Verbal-affirmation upgrades `ai-suggested` `user-revised` (or `user` if
131
+ 3. Construct a typed entry using the schema (see Schemas below). **Before any number enters a
132
+ `Statement`/`Rationale`, ground it per "Number grounding" below open the source, copy the
133
+ matched line verbatim into `Sources`, then write the number as a copy of that quote.** Carry
134
+ forward `provenance`. Verbal-affirmation upgrades `ai-suggested` → `user-revised` (or `user` if
133
135
  reproduced verbatim). The other three signals do **not** upgrade provenance.
134
136
  4. Add fields: `Crystallized via: <signal>`, `From staging: O{XX}`.
135
137
  5. Establish forensic bindings (claim→proof, heuristic→code, decision→evidence). Use
@@ -137,6 +139,24 @@ When a signal fires for `O{XX}`:
137
139
  6. Update O{XX}: `promoted: true`, `promoted_to: <layer>:<id>`, `crystallized_via: <signal>`.
138
140
  **Do not delete the observation** — the trail from raw to typed is part of the record.
139
141
 
142
+ #### Number grounding (claims & heuristics)
143
+
144
+ Every load-bearing number in a `Statement` (or a heuristic's `Rationale`/`Sensitivity`/`Bounds`)
145
+ is grounded the way code is — transcribed from an open source, never written from memory:
146
+
147
+ 1. **Open before you write.** Before the number enters the prose, open its source and copy the
148
+ matched line *verbatim* into `Sources` (`<value> ← <source ref> «matched line» [input|result]`).
149
+ The number you then write in the prose is a copy of the value inside that quote — not a value
150
+ recalled and back-cited. An entry with a bare path and no «quote» is invalid.
151
+ 2. **Input vs result.** Tag each entry `[input]` (a value you set — cite the source that defines it)
152
+ or `[result]` (a value the run produced — cite the log/output that reports it). Don't cite a
153
+ measured outcome to the config meant to produce it, or vice versa.
154
+ 3. **No inheritance.** Re-open *this* claim's own source for every number; a value shared with a
155
+ dependency claim is re-verified here, never copied from the dependency's wording.
156
+ 4. **`[pending]` beats a guess.** Can't open or locate a source this turn? Write
157
+ `<value> ← [pending: what's missing]`. An unverified-but-plausible path is fabrication and is
158
+ worse than `[pending]`.
159
+
140
160
  #### Contradiction trigger
141
161
 
142
162
  When a new event contradicts something already staged or crystallized:
@@ -164,9 +184,17 @@ entries — staged observations belong to Stage 3. (History lives in the trace;
164
184
  1. **Status updates** — flip a claim's `Status` field when evidence warrants.
165
185
  2. **Content revisions** — rewrite a `Statement`, `Rationale`, or definition when new
166
186
  evidence narrows scope, terminology changed, or wording no longer matches what's
167
- actually supported.
187
+ actually supported. Keep `Statement` a generalized mechanism/relationship and sharpen
188
+ `Conditions` as the regime becomes clearer; new run numbers update `Proof`/`evidence`,
189
+ never the Statement. A rewrite re-grounds every number it now contains (Number grounding);
190
+ any changed value gets its own fresh `Sources` «quote», never a carried-over one.
168
191
  3. **Structural changes** — split a claim into two, merge duplicates, repair
169
- dependencies, rename ids when concepts are renamed.
192
+ dependencies, rename ids when concepts are renamed. Also **generalize**: when several
193
+ crystallized claims are together evidence for a more general relationship none states
194
+ alone, author a new claim whose `Dependencies` are those narrower claims and whose
195
+ `Proof` spans their evidence — keep the narrower claims in place; the new claim sits
196
+ above them, not instead of them (only when a signal this turn makes the relationship
197
+ evident — never a routine sweep).
170
198
  4. **Consistency pass** — scan for broken cross-references (claim cites C05 which no
171
199
  longer exists), terminology mismatch with `concepts.md`, dependency loops.
172
200
 
@@ -236,6 +264,8 @@ When a signal fires for entry `E` (claim, heuristic, or concept):
236
264
  new id for the spin-off, update all cross-references.
237
265
  - **Merge**: keep the lower id, mark the higher id as `withdrawn` with
238
266
  `Merged into: C{XX}`, redirect cross-references.
267
+ - **Generalize**: allocate a new id for the more general claim, set its `Dependencies`
268
+ to the narrower claims, and leave those claims in place (they remain its grounding).
239
269
  6. **Record full before/after in today's session record** under `logic_revisions:`
240
270
  (see schema below). This is the ONLY place the prior wording is preserved — the
241
271
  logic file does not keep it.
@@ -340,17 +370,34 @@ tree:
340
370
  ### Claim (`logic/claims.md`) — crystallized only
341
371
 
342
372
  ```markdown
343
- ## C{XX}: {title}
344
- - **Statement**: {current falsifiable assertion}
373
+ ## C{XX}: {generalized title — the takeaway, not a recipe name}
374
+ - **Statement**: {the generalized, mechanistic conclusion; subject = a mechanism/relationship, never a named recipe; carries NO run numbers}
375
+ - **Conditions**: {under what conditions it holds; the regime; the known untested boundary}
376
+ - **Sources**: [{one entry per load-bearing number in the claim (now in `Conditions`/`Proof`): `<value> ← <file:line | trace-node:field> «verbatim line copied from source» [input|result]`, or `<value> ← [pending: reason]`}] # see "Number grounding"; a bare path with no «quote» is invalid
345
377
  - **Status**: hypothesis | untested | testing | supported | weakened | refuted | withdrawn
346
378
  - **Provenance**: user | ai-suggested | user-revised
347
- - **Falsification criteria**: {what would disprove this}
348
- - **Proof**: [{evidence refs or "pending"}]
379
+ - **Falsification**: {a concrete observation that would disprove it — for a mechanism claim, about the system/world; for a methodological/regime claim, about the benchmark's behavior. NOT a tautology or a re-run of the same gate ("if the recipe fails the gate")}
380
+ - **Proof**: [{evidence refs (→ evidence/) or "pending"; run numbers/IDs/scores live HERE, not in Statement}]
349
381
  - **Dependencies**: [C{YY}, ...]
350
382
  - **Tags**: {comma-separated}
351
383
  - **Last revised**: YYYY-MM-DD (turn-id) # pointer back to the trace; absent until first revision
352
384
  ```
353
385
 
386
+ **The Statement is the generalized conclusion the evidence supports — a mechanism or relationship,
387
+ not a restatement of run numbers.** What keeps it falsifiable and honest is `Conditions` (the regime
388
+ it holds in + the untested boundary) plus a `Falsification`, not a narrowed sentence. Numbers (run
389
+ IDs, n, scores, step counts) belong in `Proof` → `evidence/` (grounded per Number grounding), never
390
+ in `Statement`. `Conditions` is mandatory: a generalized Statement with no Conditions is an unbounded
391
+ slogan.
392
+
393
+ **Calibrate the Statement to what the evidence actually separates.** Do not assert a distinction the
394
+ design cannot disentangle (confounded factors — e.g. matrix "shape" vs "role" when they co-vary), or
395
+ a law from a single instance. When that's the case, hedge in the Statement itself — name the
396
+ unseparated factors together, or say "shown once here" — rather than only burying it in `Conditions`.
397
+ `Conditions` bounds *where* the claim applies; it is not a license for the Statement's verb to
398
+ over-reach. The Statement/Conditions may be sharpened on a later turn (Stage 4 content revision) as
399
+ the mechanism becomes clearer — no new closure signal is needed.
400
+
354
401
  Current-state snapshot only — no prior statements, no `From staging`/`Crystallized via`
355
402
  notes. Crystallization and every edit are recorded in the trace (`trace/sessions/…` under
356
403
  `logic_revisions:` with before/after; source observation stays in `staging/`; reasoning in
@@ -362,6 +409,7 @@ marker, not a resting state — see Stage 4.
362
409
  ```markdown
363
410
  ## H{XX}: {title}
364
411
  - **Rationale**: {current best explanation of why this works}
412
+ - **Sources**: [{one entry per load-bearing number in `Rationale`/`Sensitivity`/`Bounds`, same format as claims — see "Number grounding"}]
365
413
  - **Status**: active | weakened | retired
366
414
  - **Provenance**: user | ai-suggested | user-revised
367
415
  - **Sensitivity**: low | medium | high | unknown # "unknown" until the turn establishes it — never guess
@@ -72,7 +72,7 @@ What KIND of event is this?
72
72
  → ai-action [session record only]
73
73
 
74
74
  Interpretation (something asserted to be true / general)?
75
- Falsifiable assertion about the system?
75
+ Generalizable falsifiable assertion (a mechanism/relationship, bounded by conditions)?
76
76
  → STAGE as potential_type: claim
77
77
  Implementation rule with rationale?
78
78
  → STAGE as potential_type: heuristic
@@ -148,6 +148,7 @@ For each claim's Falsification criteria field:
148
148
 
149
149
  - **Over-claiming**: Does any Statement use universal scope markers ("all models", "any dataset", "state-of-the-art across all") while cited experiments cover only specific, narrow conditions? The gap must be substantial.
150
150
  - **Under-claiming**: Are there important experimental results present in evidence/ that are not captured by any claim? (Evidence without a corresponding claim.)
151
+ - **Attribution vs mechanism**: Does any Statement merely name *which* components of this one system rank highest/lowest (load-bearing, dominant, decorative, inert) without stating what that ranking *reveals*? Apply the name-deletion test — strike the system's component names; if no transferable relationship survives, the Statement is attribution, not insight. Flag as `major` (the claim is a league table of this system, not a reusable finding); suggest the generalization the ranking licenses.
151
152
  - **Assumption explicitness**: Are key assumptions stated in problem.md (Assumptions section) or constraints.md? Are there unstated assumptions implied by the experimental design?
152
153
  - **Generalization boundaries**: Does the artifact clearly state what the claims do NOT apply to? Check constraints.md and limitations in the exploration tree.
153
154
  - **Qualifier consistency**: When claims use hedging ("tends to", "in most cases"), is this consistent with the evidence strength?
@@ -65,6 +65,7 @@ resolution, field presence, YAML parsing) is handled entirely by Level 1.
65
65
  |-------|---------------|-----------------|
66
66
  | Over-claiming | Statement uses universal scope while evidence covers narrow conditions | critical if extreme, major if moderate |
67
67
  | Under-claiming | Evidence files or experiment results not captured by any claim | minor |
68
+ | Attribution vs mechanism | Statement names which components rank where (name-deletion test leaves nothing transferable) instead of what the ranking reveals | major |
68
69
  | Assumption explicitness | Key assumptions stated in problem.md or constraints.md | major if unstated assumptions affect validity |
69
70
  | Generalization boundaries | Artifact states what claims do NOT apply to | minor |
70
71
  | Qualifier consistency | Hedging language matches evidence strength | minor |
package/src/installer.js CHANGED
@@ -45,7 +45,7 @@ function writeLock(dir, data) {
45
45
  ensureDir(dir);
46
46
  fs.writeFileSync(
47
47
  file,
48
- JSON.stringify({ updatedAt: new Date().toISOString(), ...data }, null, 2)
48
+ JSON.stringify({ ...data, updatedAt: new Date().toISOString() }, null, 2)
49
49
  );
50
50
  }
51
51