@ara-commons/ara-skills 0.3.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +3 -2
- package/package.json +3 -2
- package/skills/compiler/SKILL.md +63 -11
- package/skills/compiler/references/ara-schema.md +94 -22
- package/skills/compiler/references/validation-checklist.md +32 -8
- package/skills/research-manager/SKILL.md +33 -8
- package/skills/research-manager/references/event-taxonomy.md +1 -1
- package/skills/research-visualizer/SKILL.md +172 -0
- package/skills/research-visualizer/references/binding.md +245 -0
- package/skills/research-visualizer/references/parsing.md +211 -0
- package/skills/research-visualizer/references/trajectory-template.html +804 -0
- package/skills/rigor-reviewer/SKILL.md +1 -0
- package/skills/rigor-reviewer/references/review-dimensions.md +1 -0
- package/src/index.js +1 -1
package/README.md
CHANGED
|
@@ -1,12 +1,13 @@
|
|
|
1
1
|
# @ara-commons/ara-skills
|
|
2
2
|
|
|
3
|
-
One-command installer for the
|
|
3
|
+
One-command installer for the four **Agent-Native Research Artifact (ARA)** skills:
|
|
4
4
|
|
|
5
5
|
| Skill | Invoke | What it does |
|
|
6
6
|
|-------|--------|--------------|
|
|
7
7
|
| `compiler` | `/compiler <input>` | Convert a paper, repo, or notes into a complete ARA artifact |
|
|
8
8
|
| `research-manager` | `/research-manager` | Post-session recorder that captures decisions, dead ends, and claims |
|
|
9
9
|
| `rigor-reviewer` | `/rigor-reviewer <dir>` | ARA Seal Level 2 semantic epistemic review across six dimensions |
|
|
10
|
+
| `research-visualizer` | `/research-visualizer <dir>` | Render an ARA into one interactive, self-contained trajectory.html |
|
|
10
11
|
|
|
11
12
|
## Quick start
|
|
12
13
|
|
|
@@ -64,4 +65,4 @@ In dev mode the CLI reads skills from the sibling `../../skills/` directory. On
|
|
|
64
65
|
|
|
65
66
|
## Upstream source of truth
|
|
66
67
|
|
|
67
|
-
The
|
|
68
|
+
The four skill directories live at the repo root under `skills/`. Edit them there — never edit the copy inside this package, which is created on demand by `prepack`.
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ara-commons/ara-skills",
|
|
3
|
-
"version": "0.
|
|
4
|
-
"description": "Install Agent-Native Research Artifact (ARA) skills — compiler, research-manager, rigor-reviewer — into Claude Code, Cursor, OpenCode, Gemini CLI, Codex, and more.",
|
|
3
|
+
"version": "0.4.0",
|
|
4
|
+
"description": "Install Agent-Native Research Artifact (ARA) skills — compiler, research-manager, rigor-reviewer, research-visualizer — into Claude Code, Cursor, OpenCode, Gemini CLI, Codex, and more.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
7
7
|
"ara-skills": "./bin/cli.js"
|
|
@@ -33,6 +33,7 @@
|
|
|
33
33
|
"compiler",
|
|
34
34
|
"research-manager",
|
|
35
35
|
"rigor-reviewer",
|
|
36
|
+
"research-visualizer",
|
|
36
37
|
"llm",
|
|
37
38
|
"cli"
|
|
38
39
|
],
|
package/skills/compiler/SKILL.md
CHANGED
|
@@ -120,11 +120,26 @@ For non-trivial figures (dense plots, log axes, multi-panel, anything needing re
|
|
|
120
120
|
**Stage 2 — Cognitive Mapping**
|
|
121
121
|
Map the atoms into `/logic/`:
|
|
122
122
|
- **problem.md**: observations (with numbers) → gaps → key insight → assumptions
|
|
123
|
-
- **claims.md**: falsifiable claims with proof pointers to experiment IDs (E01, E02…).
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
123
|
+
- **claims.md**: falsifiable claims with proof pointers to experiment IDs (E01, E02…). A claim's job
|
|
124
|
+
is the **takeaway, not the record**. Before writing a `Statement`, distill: for each result,
|
|
125
|
+
ablation, or dead-end, ask what it *reveals* — the mechanism or relationship behind the number, the
|
|
126
|
+
WHY a reader would reuse — and make THAT the `Statement`. Look across results too, not one at a
|
|
127
|
+
time: where several experiments together reveal a relationship none shows alone — whether they
|
|
128
|
+
agree on it or differ in a way that reveals what bounds it — make THAT relationship the claim
|
|
129
|
+
(`Proof` spanning them, `Dependencies` the narrower claims it rests on), rather than settling for
|
|
130
|
+
one claim per experiment. The recipe name, run IDs, and numbers are
|
|
131
|
+
the evidence *for* the takeaway, not the takeaway itself: they live in `Evidence basis`/`Proof`,
|
|
132
|
+
referenced and never restated in the Statement. A `Statement`'s subject is a mechanism/relationship,
|
|
133
|
+
never a named recipe/config/run, and carries no run numbers, scores, step counts, or p-values. Bound
|
|
134
|
+
every Statement with a `Conditions` field (the regime + the untested boundary) and a substantive
|
|
135
|
+
`Falsification criteria` (about the system for a mechanism claim, about the benchmark's behavior for
|
|
136
|
+
a methodological one) — this accountability, not a narrowed sentence, is what keeps a generalized
|
|
137
|
+
claim honest. Don't upgrade a validation-metric result into a claim about training dynamics without
|
|
138
|
+
training-side evidence. Stating the mechanism a result reveals is the goal **even from a single
|
|
139
|
+
instance** — what you must NOT do is extrapolate it into a universal law beyond its regime, or
|
|
140
|
+
assert a distinction the design cannot disentangle; that limit goes in `Conditions` so the
|
|
141
|
+
`Statement` can still carry the mechanism rather than collapsing back to a recipe-and-number.
|
|
142
|
+
**Ground every load-bearing number in a claim like code** (the `# Grounding` discipline,
|
|
128
143
|
applied to numbers): before writing it, open its source and copy the matched line verbatim into a
|
|
129
144
|
`**Sources**` entry — `<value> ← <source ref> «matched line» [input]` for values that were set
|
|
130
145
|
(cite where they're defined), `[result]` for values a run produced (cite the log/output that
|
|
@@ -135,7 +150,10 @@ Map the atoms into `/logic/`:
|
|
|
135
150
|
- **concepts.md**: the paper's genuine technical terms, formally defined
|
|
136
151
|
- **experiments.md**: declarative verification/analysis plans (NO exact numbers — directional
|
|
137
152
|
only). "Experiment" generalizes to the field's way of testing a claim: an eval run, a statistical
|
|
138
|
-
test, a proof obligation, a user study.
|
|
153
|
+
test, a proof obligation, a user study. Link each experiment to where its results are filed
|
|
154
|
+
(`Evidence`) and to what produced it (`Run`, including failed/ablated runs). Claims and experiments
|
|
155
|
+
are many-to-many — a claim that generalises across runs lists every experiment in its `Proof`;
|
|
156
|
+
don't mirror one experiment per claim.
|
|
139
157
|
- **solution/**: the method layer — `constraints.md` (limitations/assumptions) is always present;
|
|
140
158
|
beyond it, create the files the paper's content actually calls for (architecture, algorithm,
|
|
141
159
|
method, study design, formalization, proofs, heuristics — whatever fits the work). You decide
|
|
@@ -152,8 +170,11 @@ whichever layer fits best, preserving the source's granularity. Never silently d
|
|
|
152
170
|
or released form, *distinct from the prose that describes it*. `src/environment.md` is always
|
|
153
171
|
required (reproducibility). Beyond it, one rule decides everything:
|
|
154
172
|
|
|
155
|
-
> **
|
|
156
|
-
>
|
|
173
|
+
> **Represent every concrete artifact losslessly. When it persists in a linkable external store (a
|
|
174
|
+
> run database, a released/versioned repo), point to it — a comprehensive `src/artifacts.md` index,
|
|
175
|
+
> one link per artifact (every run, config, log, script), nothing aggregated or copied. Capture it
|
|
176
|
+
> into `src/execution/` only when it would otherwise be lost — code that lives solely inside the
|
|
177
|
+
> paper, or a source not externally persisted. Never re-encode a prose-only description as code.**
|
|
157
178
|
|
|
158
179
|
A concrete artifact is real content the cognitive layer doesn't already hold — capture it (grounded
|
|
159
180
|
in the real repo/files when provided), in whatever directory fits. But a method conveyed only in
|
|
@@ -178,6 +199,28 @@ the source actually reveals — but the node count and types are **source-bounde
|
|
|
178
199
|
never invent a dead end, decision, or experiment to hit a number. A paper that hides its failures
|
|
179
200
|
yields a smaller, honest tree (Rule 9 wins).
|
|
180
201
|
|
|
202
|
+
**Optional per-node changed-code (enrichment for the Research Visualizer).** When the work is a
|
|
203
|
+
sequence of code edits and the scripts are resolvable at compile time, you MAY attach to an experiment
|
|
204
|
+
node the **unified diff** it represents — never required, omitted when unclear:
|
|
205
|
+
1. **Resolve node → representative variant — this link does NOT already exist; construct it.** From the
|
|
206
|
+
node's `source_refs` / its claims' cited `record_configs` → the run index (`runs.csv`/`runs.jsonl`)
|
|
207
|
+
row(s) whose family+purpose+bin match → the representative submitted script. Where this is empty or
|
|
208
|
+
ambiguous (most `decision`/`dead_end` nodes, or evidence that is only journal prose), **omit
|
|
209
|
+
`code_change`** — never guess a script.
|
|
210
|
+
2. **Resolve node → diff base** from the lineage you already reconstruct for `solution/*` (wave baseline
|
|
211
|
+
or immediate-parent variant).
|
|
212
|
+
3. **Index both scripts in `src/artifacts.md` under a stable anchor** (`A01`, `A02`, …) carrying real
|
|
213
|
+
path + sha256 + original location; compute the unified diff (variant vs base) and write it to a tracked
|
|
214
|
+
**`evidence/changes/<node-id>.diff.md`** sidecar (fenced ```diff, `**Source**` header citing the two
|
|
215
|
+
anchor ids). Set the node's `code_change: {base_artifact, variant_artifact, lang, diff_file}`. The whole
|
|
216
|
+
scripts stay pointers (Rule 14) — the diff is a derived, grounded view, like a `derived_subset` table.
|
|
217
|
+
4. **Store-absent ⇒ pointers, not a diff.** If the scripts don't resolve on disk (git-ignored store),
|
|
218
|
+
still record `code_change` with the anchor ids + a `note`, omit `diff_file` — the visualizer shows a
|
|
219
|
+
pointer chip. Expected, not a failure.
|
|
220
|
+
|
|
221
|
+
You MAY also attach `node.thinking` — the agent's deliberation — but **only verbatim** grounded
|
|
222
|
+
journal/decision text; never compose new prose. No verbatim rationale ⇒ leave it absent.
|
|
223
|
+
|
|
181
224
|
### Step 3: Generate Files
|
|
182
225
|
|
|
183
226
|
Write the mandatory core, then the additional files the paper warrants. See
|
|
@@ -218,7 +261,7 @@ Run ARA Seal Level 1. Check:
|
|
|
218
261
|
- Mandatory-core dirs exist (`logic/`, `logic/solution/`, `src/`, `trace/`, `evidence/`) and all
|
|
219
262
|
mandatory-core files exist and are non-empty
|
|
220
263
|
- PAPER.md has valid frontmatter (title, authors, year) + a Layer Index
|
|
221
|
-
- claims.md has C01+ blocks with Statement, Status, Falsification criteria, Proof
|
|
264
|
+
- claims.md has C01+ blocks with Statement, Conditions, Status, Falsification criteria, Proof; Conditions non-trivial
|
|
222
265
|
- experiments.md has E01+ blocks with Verifies, Setup, Procedure, Expected outcome (no exact numbers)
|
|
223
266
|
- concepts.md, related_work.md, constraints.md non-trivial; any heuristics blocks have Rationale,
|
|
224
267
|
Sensitivity, Bounds
|
|
@@ -228,6 +271,10 @@ Run ARA Seal Level 1. Check:
|
|
|
228
271
|
heuristic `Code ref` → a real `src/execution/` file (when both exist); tree `evidence:` → claim IDs
|
|
229
272
|
- Evidence: **every numbered table and figure is filed with BOTH a markdown file and a screenshot
|
|
230
273
|
(.png)**; numbered objects not filed are accounted for in `evidence/README.md` with a reason
|
|
274
|
+
- **Changed-code (only if emitted):** each `evidence/changes/<node>.diff.md` cites two `src/artifacts.md`
|
|
275
|
+
anchors (`base`/`variant`) that resolve; the diff is verbatim; the node's `code_change` points at the
|
|
276
|
+
sidecar via `diff_file` (or carries a `note` with no `diff_file` when the store was absent). Optional —
|
|
277
|
+
absent is fine; never invent a diff or a node→script mapping
|
|
231
278
|
- Evidence files have **Source** fields; figures declare Figure type / Extraction method / Reading
|
|
232
279
|
confidence; estimated readings marked `≈` (not `exact_from_labels`); diagrams/qualitative samples
|
|
233
280
|
carry a visual description, not a fabricated table
|
|
@@ -235,6 +282,11 @@ Run ARA Seal Level 1. Check:
|
|
|
235
282
|
- **Cited locations verified** (Rule 15): every repo path/`file:line` exists and is in range;
|
|
236
283
|
spot-check that trace `source_refs` and evidence `Source` actually contain the cited content; no
|
|
237
284
|
repo fact transcribed from the paper without checking the real file
|
|
285
|
+
- **Statement is a takeaway, not a record** — its own dedicated FAIL pass, symmetric to the
|
|
286
|
+
number-sources pass: scan EVERY claim's `Statement`. It FAILS if the Statement's subject is a named
|
|
287
|
+
recipe/config/run, or if the Statement contains a run number, n-count, score, step/bin count, or
|
|
288
|
+
p-value. Such a claim is a leaderboard coordinate, not knowledge — the mechanism it reveals must
|
|
289
|
+
become the Statement and the numbers move to `Evidence basis`/`Proof`. Exhaustive, not spot-checked
|
|
238
290
|
- **Number sources bound** (claims & heuristics) — run this as its own dedicated pass, one job: for
|
|
239
291
|
*each* `**Sources**` entry, re-open the cited `file:line` (or trace `node:field`) and confirm the
|
|
240
292
|
verbatim «quote» is actually there and the number in the `Statement`/`Rationale` matches the value
|
|
@@ -266,11 +318,11 @@ key stats (claims, experiments, concepts, tree nodes, evidence tables/figures).
|
|
|
266
318
|
7. **"Not specified"**: if information is genuinely unavailable, write "Not specified in paper" — never guess
|
|
267
319
|
8. **No fake source labels**: never call a derived subset `Table N`/`Figure N` unless it faithfully reproduces the original
|
|
268
320
|
9. **No synthetic trace history**: don't invent decisions, dead ends, or experiments not explicit in the inputs; mark inferred trajectories as inferred or omit them
|
|
269
|
-
10. **
|
|
321
|
+
10. **Distill the takeaway, then bound it**: a `Statement` is the mechanism or relationship a result reveals — the reusable WHY — with the named recipe and its numbers demoted to `Evidence basis`/`Proof`, never restated in the sentence and never its subject. Keep it accountable by an explicit `Conditions` regime, a substantive `Falsification criteria` (about the system, or about the benchmark's behavior for a methodological claim), and grounded `Proof` — not by narrowing the sentence to a measured value. A single instance still licenses a mechanism `Statement`: what is forbidden is extrapolating it into a universal law beyond its regime, or asserting a distinction the design cannot disentangle — those limits go in `Conditions`, they do not shrink the Statement back to a recipe-and-number. Still separate observation from interpretation: the numbers stay in the evidence layer, reached via `Proof`/`Evidence basis`
|
|
270
322
|
11. **Visual extraction is honest extraction**: read figures by looking; mark estimates `≈` with extraction method + confidence; never present a digitized estimate as exact, invent points for an unreadable figure, or turn a diagram into a fake data table
|
|
271
323
|
12. **Complete, ordered evidence**: file EVERY numbered table and figure, in order — a systematic sweep, not a lucky sample — each as a markdown transcription PLUS a saved screenshot (`.png`). No early stopping; account for any object you don't file
|
|
272
324
|
13. **Fit the file set to the paper, not the paper to a template**: only PAPER.md + the mandatory core are required. Beyond them, generate the files THIS work actually warrants and nothing it doesn't have. Never force inappropriate files (e.g. model-training configs onto an eval or theory paper)
|
|
273
|
-
14. **`src/` holds concrete artifacts, not re-encoded prose**: capture every concrete artifact the source actually contains, in its native form, grounded in real files. Three sides: (a) never fabricate a code stub from a prose-only method — it already lives in `logic/`, so a `.py` just duplicates it; (b) never drop a concrete artifact that does exist — a lone `environment.md` is wrong when the work has one; (c) when the
|
|
325
|
+
14. **`src/` holds concrete artifacts, not re-encoded prose**: capture every concrete artifact the source actually contains, in its native form, grounded in real files. Three sides: (a) never fabricate a code stub from a prose-only method — it already lives in `logic/`, so a `.py` just duplicates it; (b) never drop a concrete artifact that does exist — a lone `environment.md` is wrong when the work has one; (c) when the work's artifacts **persist in a linkable external store** (a run database, a released or versioned repo), represent them as a **comprehensive pointer index** in `src/artifacts.md` — one link per artifact (every run, config, log, script), nothing aggregated into a vague bucket, nothing copied; a lossy subset-copy is the failure. **Transcribe real source into `src/execution/` only when it would otherwise be lost** — code that lives solely inside the paper, or a source not externally persisted (then `# Grounding: transcribed`, cite path). No implementation in the input → neither applies.
|
|
274
326
|
15. **Source-bounded minimums**: any count or required field is a target, never a license to invent. If the source supports fewer, produce what is real and note the shortfall; for an unstated field write "Not specified in paper" rather than guessing
|
|
275
327
|
16. **Cite by verification, and ask on conflict**: a source reference (evidence `Source`, trace `source_refs`, claim `Proof`, a repo `file:line`/path) promises the cited location actually contains the claim — open it and confirm. Never transcribe a *description* of an artifact as a verified fact about it. **When the code repo and the paper disagree on a fact (line count, path, value, behavior), do NOT pick one silently — surface the conflict to the user and ask which source to follow.** If unverifiable and the user is unavailable, attribute it ("per §X") or omit. Carry a statistic's scope/denominator in its `Source`. **This extends to every load-bearing number in a claim/heuristic `Statement`/`Rationale`: it carries a `**Sources**` entry whose verbatim «quote» you opened and confirmed contains that value — a memory-filled value or a bare path is fabrication; use `[pending]` when you cannot open the source**
|
|
276
328
|
|
|
@@ -32,6 +32,7 @@ evidence/
|
|
|
32
32
|
tables/ # ✓ every numbered Table: tableN.md + tableN.png
|
|
33
33
|
figures/ # ✓ every numbered Figure: figureN.md + figureN.png
|
|
34
34
|
proofs/ # as warranted: derivations / proofs
|
|
35
|
+
changes/ # as warranted: per-node code-change unified diffs (Research Visualizer)
|
|
35
36
|
rubric/requirements.md # (Only if a rubric is provided)
|
|
36
37
|
```
|
|
37
38
|
|
|
@@ -159,20 +160,51 @@ Rule: if a filename includes a source label such as `table3` or `figure4`, it sh
|
|
|
159
160
|
|
|
160
161
|
Each claim MUST have ALL fields:
|
|
161
162
|
```markdown
|
|
162
|
-
## C{NN}: {
|
|
163
|
-
- **Statement**: {
|
|
163
|
+
## C{NN}: {generalized title — the takeaway, not a recipe/result name}
|
|
164
|
+
- **Statement**: {the generalized, mechanistic conclusion the evidence supports; subject = a mechanism/relationship, never a named recipe; carries NO run numbers}
|
|
165
|
+
- **Conditions**: {under what conditions it holds; the regime; the known untested boundary}
|
|
164
166
|
- **Status**: {hypothesis|supported|refuted}
|
|
165
|
-
- **Falsification criteria**: {
|
|
167
|
+
- **Falsification criteria**: {a concrete observation that would disprove it — for a mechanism claim, about the system/world; for a methodological/regime claim, about the benchmark's behavior. Not a tautology or a re-run of the same gate}
|
|
166
168
|
- **Proof**: [{experiment IDs: E01, E02}]
|
|
167
|
-
- **Evidence basis**: {
|
|
168
|
-
- **
|
|
169
|
-
- **Dependencies**: {other claim IDs, if any}
|
|
169
|
+
- **Evidence basis**: {what the cited evidence shows — point to it; do NOT restate run numbers in the Statement}
|
|
170
|
+
- **Dependencies**: {claim IDs this one rests on — the narrower claims a more general claim draws on, or a claim it corrects/refines; not mere shared setup; omit if it rests only on its own evidence}
|
|
170
171
|
- **Tags**: {comma-separated keywords}
|
|
171
172
|
```
|
|
172
173
|
|
|
173
174
|
Proof MUST reference experiment IDs from experiments.md.
|
|
174
175
|
Each proofed experiment should in turn be backed by evidence files whose rows or measurements actually match the claim being asserted.
|
|
175
|
-
`Statement`
|
|
176
|
+
`Statement` is the **generalized conclusion the evidence supports** — a mechanism or relationship,
|
|
177
|
+
not a restatement of run numbers. The claim is kept falsifiable and honest by `Conditions` (the
|
|
178
|
+
regime it holds in + the untested boundary) and a `Falsification criteria`, not by narrowing the
|
|
179
|
+
sentence to a single measured value. Numbers (n, scores, step counts, run IDs) live in the evidence
|
|
180
|
+
layer and are reached via `Proof`/`Evidence basis`, never pasted into `Statement`. `Conditions` is
|
|
181
|
+
mandatory: a generalized Statement with no Conditions is an unbounded slogan.
|
|
182
|
+
|
|
183
|
+
**Distill the mechanism; bound the reach.** Before writing a `Statement`, ask what the result
|
|
184
|
+
*reveals* — the mechanism or relationship a reader would reuse — and state that; the recipe and its
|
|
185
|
+
numbers are the evidence for it, not the claim, and never its subject. A single instance still
|
|
186
|
+
licenses a mechanism `Statement`; what is forbidden is extrapolating it into a universal law beyond
|
|
187
|
+
its regime, or asserting a distinction the design cannot disentangle. Put that boundary in
|
|
188
|
+
`Conditions` — it bounds *where* the claim holds and is not a license for the verb to over-reach.
|
|
189
|
+
`Conditions` carries the limits so the `Statement` can carry the mechanism.
|
|
190
|
+
|
|
191
|
+
**A claim's evidence may be one result or several read together.** Most claims distill what a single
|
|
192
|
+
result reveals; but where several experiments together reveal a relationship none shows alone —
|
|
193
|
+
whether they agree on it, or differ in a way that itself reveals what bounds or explains the
|
|
194
|
+
difference — that relationship is the claim. Write it as an ordinary `## C` block whose `Proof` lists
|
|
195
|
+
every experiment it draws on and whose `Dependencies` names the narrower claims it rests on; the same
|
|
196
|
+
distill-the-mechanism, bound-the-reach discipline applies. State the most general relationship the
|
|
197
|
+
evidence supports — bounded by `Conditions`, never asserted past what those experiments jointly show —
|
|
198
|
+
rather than settling for one claim per experiment. A claim need not be about the object under study:
|
|
199
|
+
a reusable relationship the work itself exposes, including in how it was run, is worth a claim.
|
|
200
|
+
|
|
201
|
+
**The attribution trap (the most common miss).** An ablation / leave-one-out that shows *which*
|
|
202
|
+
components dominate is the *evidence*, not the claim. A Statement that merely names the load-bearing
|
|
203
|
+
vs decorative components passes the no-numbers gate but is still a league table of *this* system.
|
|
204
|
+
Apply the **name-deletion test**: strike your system's component names from the Statement — if
|
|
205
|
+
nothing a stranger working on a different stack could reuse survives, you wrote attribution. State
|
|
206
|
+
instead what the ranking reveals about the *class* of system; the named components and their deltas
|
|
207
|
+
live in `Evidence basis`, reached via `Proof`.
|
|
176
208
|
|
|
177
209
|
---
|
|
178
210
|
|
|
@@ -192,11 +224,15 @@ borrowed terms to reach 5 (Rule 14). One section per concept:
|
|
|
192
224
|
|
|
193
225
|
## logic/experiments.md
|
|
194
226
|
|
|
195
|
-
≥3 experiments. Declarative plans, NOT scripts. NO exact numerical results.
|
|
227
|
+
≥3 experiments. Declarative plans, NOT scripts. NO exact numerical results. Experiments and claims
|
|
228
|
+
are **many-to-many**: one experiment may verify several claims, and a claim that generalises across
|
|
229
|
+
runs lists every experiment it draws on in its `Proof` — do not force a 1:1 claim↔experiment ledger.
|
|
196
230
|
|
|
197
231
|
```markdown
|
|
198
232
|
## E{NN}: {Short title}
|
|
199
|
-
- **Verifies**: {claim IDs
|
|
233
|
+
- **Verifies**: {claim IDs this run bears on — may be several}
|
|
234
|
+
- **Evidence**: {evidence file(s) where this run's results are recorded — `evidence/…`; "pending" if not yet filed}
|
|
235
|
+
- **Run**: {what produced this result — a `src/execution/` file (or other `src/` artifact) when captured, else a link/ref into the source repo or run database; give it for EVERY experiment, including failed or ablated runs}
|
|
200
236
|
- **Setup**:
|
|
201
237
|
- Model: {model name and size}
|
|
202
238
|
- Hardware: {GPU type, count, memory}
|
|
@@ -286,12 +322,20 @@ field format:
|
|
|
286
322
|
|
|
287
323
|
## src/execution/{module}.py (when the work warrants it — grounded or absent)
|
|
288
324
|
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
|
|
325
|
+
Capture here is the **fallback, not the default**: transcribe code into `src/execution/` only when it
|
|
326
|
+
would otherwise be **lost** — it exists solely inside the paper, or its source is not externally
|
|
327
|
+
persisted. When the work's code/runs **persist in a linkable external store** (a repo, a run
|
|
328
|
+
database), do NOT copy them here — index them comprehensively in `src/artifacts.md` (see below). When
|
|
329
|
+
capture IS the call: actual repo code → capture real runnable files in native form (transcribed); only
|
|
330
|
+
pseudocode/equations the paper prints → a reconstructed stub of the **novel mechanism**. Either way it
|
|
293
331
|
must be grounded — never fabricated.
|
|
294
332
|
|
|
333
|
+
When the input is a run database / repo of many experiment runs, index it **comprehensively** in
|
|
334
|
+
`src/artifacts.md`: a link for **every** run and artifact (the per-run logs — e.g. a `runs.jsonl`
|
|
335
|
+
already indexes each — plus every config, candidate, log, and script), nothing aggregated into a vague
|
|
336
|
+
bucket and nothing copied. Each experiment's `Run` field points at the relevant entries. A lossy
|
|
337
|
+
subset — only the winning run, or runs collapsed into a single directory link — is the failure.
|
|
338
|
+
|
|
295
339
|
Every file declares its grounding on the first line:
|
|
296
340
|
```python
|
|
297
341
|
# Grounding: transcribed — adapted from repo code; cite file:line in docstrings
|
|
@@ -320,20 +364,23 @@ pseudo-code — that information already lives in `logic/solution/`, and re-enco
|
|
|
320
364
|
duplicates it.** A concrete artifact that IS raw "code" — e.g. a prompt or template — is different:
|
|
321
365
|
store it verbatim in `src/prompts/`, don't paraphrase it. A hollow invented API is a hallucination.
|
|
322
366
|
|
|
323
|
-
## src/artifacts.md (
|
|
367
|
+
## src/artifacts.md (the artifact index — comprehensive pointer file when the source persists externally)
|
|
324
368
|
|
|
325
|
-
`src/` must
|
|
326
|
-
|
|
327
|
-
|
|
369
|
+
`src/` must represent the implementation **losslessly**. When the work's artifacts **persist in a
|
|
370
|
+
linkable external store** (a repo, a run database, a released tool/dataset), `artifacts.md` is the
|
|
371
|
+
**comprehensive pointer index** — a link to **every** artifact (every run, config, log, script,
|
|
372
|
+
released binary, dataset), grounded in the real files, nothing aggregated into a vague bucket and
|
|
373
|
+
nothing copied. One block (or row) per artifact:
|
|
328
374
|
|
|
329
|
-
**
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
location. Naming a real `.py`/`.js`/… file here instead of capturing it is a coverage failure.
|
|
375
|
+
**Capture is the fallback, not the default.** Transcribe a file into `src/execution/` only when it
|
|
376
|
+
would otherwise be **lost** — code that lives solely inside the paper, or a source not externally
|
|
377
|
+
persisted. When the source persists and is linkable, point to it here; copying a lossy subset (only
|
|
378
|
+
the winner, or files collapsed into a single directory link) is the failure.
|
|
334
379
|
|
|
335
380
|
```markdown
|
|
336
381
|
## {Artifact name}
|
|
382
|
+
- **Anchor**: {stable short id — `A01`, `A02`, … — so a trace node's `code_change` can reference this artifact by id; optional, but required for the Research Visualizer's changed-code diffs}
|
|
383
|
+
- **sha256**: {content hash of the file, when a code-change diff cites it}
|
|
337
384
|
- **File(s) in repo**: {real path(s), verified to exist}
|
|
338
385
|
- **Nature**: {what it is — tool / library / skill spec / system / dataset}
|
|
339
386
|
- **What it does / contains**: {grounded description}
|
|
@@ -379,6 +426,28 @@ Reproducibility for any field. For purely analytical work, state so explicitly.
|
|
|
379
426
|
|
|
380
427
|
---
|
|
381
428
|
|
|
429
|
+
## evidence/changes/{node-id}.diff.md (Research Visualizer changed-code diffs)
|
|
430
|
+
|
|
431
|
+
Per-experiment-node **unified diff** the step represents — a derived, grounded view (the whole scripts
|
|
432
|
+
stay pointers in `src/artifacts.md`; this is NOT a copy of the artifact). One tracked file per node that
|
|
433
|
+
has a resolvable code change:
|
|
434
|
+
|
|
435
|
+
```markdown
|
|
436
|
+
# Change {node-id}: {short description}
|
|
437
|
+
- **Base**: A01 (→ src/artifacts.md anchor; path + sha256 + original location live there)
|
|
438
|
+
- **Variant**: A07
|
|
439
|
+
- **Language**: python
|
|
440
|
+
|
|
441
|
+
<unified diff fenced as ```diff … ``` — verbatim from the real scripts>
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
Rules: cite the two artifacts **by anchor id**, never paste their paths/sha here (those live once in
|
|
445
|
+
`src/artifacts.md`). The diff text is verbatim. If the scripts can't be resolved at compile time
|
|
446
|
+
(git-ignored store), omit this file and set the node's `code_change.note` instead (the visualizer shows a
|
|
447
|
+
pointer chip, no diff). These sidecars MUST be tracked/committed (not swept by any store `.gitignore`).
|
|
448
|
+
|
|
449
|
+
---
|
|
450
|
+
|
|
382
451
|
## evidence/tables/{file}.md (+ screenshot)
|
|
383
452
|
|
|
384
453
|
Every numbered table gets BOTH this markdown file AND a screenshot `tableN.png` (the rendered
|
|
@@ -431,6 +500,9 @@ tree:
|
|
|
431
500
|
source_refs: ["Table 2", "§4.1"] # recommended for explicit nodes
|
|
432
501
|
title: "{...}"
|
|
433
502
|
description: "{...}"
|
|
503
|
+
# OPTIONAL enrichment (Research Visualizer; omit when absent):
|
|
504
|
+
# thinking: "{verbatim agent deliberation — why it did/branched}"
|
|
505
|
+
# code_change: { base_artifact: A01, variant_artifact: A07, lang: python, diff_file: evidence/changes/N01.diff.md }
|
|
434
506
|
```
|
|
435
507
|
|
|
436
508
|
Rules:
|
|
@@ -35,14 +35,15 @@ where present, they are non-trivial — there is no fixed list. Model-training f
|
|
|
35
35
|
|
|
36
36
|
### logic/claims.md
|
|
37
37
|
- Has `## C\d+` blocks (at least one claim)
|
|
38
|
-
- Contains `**Statement**`
|
|
39
|
-
- Contains `**
|
|
38
|
+
- Contains `**Statement**` (the mechanism/takeaway a result reveals — subject is a mechanism/relationship, never a named recipe; no run numbers)
|
|
39
|
+
- Contains `**Conditions**` (non-trivial: the regime + the untested boundary)
|
|
40
|
+
- Contains `**Sources**`; every load-bearing number in a claim has a `Sources` entry carrying
|
|
40
41
|
a verbatim «quote» plus an `[input]`/`[result]` tag — no bare-path entries, no memory-filled numbers
|
|
41
42
|
- Contains `**Status**`
|
|
42
|
-
- Contains `**Falsification criteria**`
|
|
43
|
+
- Contains `**Falsification criteria**` (a substantive observation — about the system, or about the benchmark's behavior for a methodological claim — not a tautology or a re-run of a metric gate)
|
|
44
|
+
- `Statement` is the mechanism/takeaway a result reveals, not a record: a single instance may state the mechanism it reveals, but must not be extrapolated into a universal law beyond its regime, nor assert a distinction the design cannot disentangle — those limits live in `Conditions`
|
|
43
45
|
- Contains `**Proof**`
|
|
44
46
|
- Contains `**Evidence basis**`
|
|
45
|
-
- Contains `**Interpretation**`
|
|
46
47
|
|
|
47
48
|
### logic/problem.md
|
|
48
49
|
- Has `### O\d+` blocks (observations)
|
|
@@ -52,6 +53,8 @@ where present, they are non-trivial — there is no fixed list. Model-training f
|
|
|
52
53
|
### logic/experiments.md
|
|
53
54
|
- Has `## E\d+` blocks (at least 3)
|
|
54
55
|
- Contains `**Verifies**`
|
|
56
|
+
- Contains `**Evidence**` (link to where the run's results are filed, or "pending")
|
|
57
|
+
- Contains `**Run**` (what produced the run — a `src/execution/` file or a link/ref into the source repo/DB; failed/ablated runs are linked too, not omitted)
|
|
55
58
|
- Contains `**Setup**`
|
|
56
59
|
- Contains `**Procedure**`
|
|
57
60
|
- Contains `**Expected outcome**` or `**Expected results**`
|
|
@@ -89,9 +92,10 @@ fewer passes with fewer; what fails is fabricated filler.
|
|
|
89
92
|
- `src/execution/`: ≥1 `.py` file only when the work has implementable content (repo code / paper pseudocode / named interface). NOT mandatory otherwise; omitting it (with a note in `environment.md`) beats fabricating one.
|
|
90
93
|
- `evidence/tables/`, `evidence/figures/`, or `evidence/proofs/`: contains the filed evidence (see §11)
|
|
91
94
|
|
|
92
|
-
### Implementation layer (`src/`) —
|
|
93
|
-
- Concrete artifacts
|
|
94
|
-
- **
|
|
95
|
+
### Implementation layer (`src/`) — indexed when external, captured when it'd be lost
|
|
96
|
+
- Concrete artifacts are represented losslessly: prompts/templates verbatim in `src/prompts/`, config values in `src/configs/`, and — when the work's code/runs **persist in a linkable external store** — a **comprehensive pointer index** in `src/artifacts.md` linking every artifact. A lone `environment.md` is wrong when such artifacts exist.
|
|
97
|
+
- **Comprehensiveness** (external repo/run-database input): `src/artifacts.md` links **every** run and source file (per-run logs included — a `runs.jsonl` counts), nothing aggregated into a bare directory link or a "~N others" summary. FAIL on a lossy subset (only the winning run; real artifacts collapsed into a vague bucket).
|
|
98
|
+
- **Capture only when it'd be lost**: transcribe source into `src/execution/` (native form, `# Grounding: transcribed`, cite path) only when it exists solely inside the paper or its source is not externally persisted. Pointer-only is correct when the source persists; it FAILS only when the pointer would dangle (no persisted source).
|
|
95
99
|
- Conversely, a prose-only method (no code, no prompt, no config values) is NOT re-encoded as a `.py` stub or pseudo-code — it lives in `logic/solution/`; a lone `environment.md` is correct here. FAIL on a `.py` stub manufactured from prose (it just duplicates the cognitive layer).
|
|
96
100
|
|
|
97
101
|
### Code grounding (each `src/execution/*.py`, when present)
|
|
@@ -150,11 +154,18 @@ For each file in `evidence/figures/*.md` specifically:
|
|
|
150
154
|
### Claim Proof → Experiment Resolution
|
|
151
155
|
- Every `E\d+` in a claim's `**Proof**: [...]` must exist in experiments.md
|
|
152
156
|
- Proof-linked experiments should have evidence files whose labels and row contents actually match the compared systems or measurements
|
|
153
|
-
- Claim
|
|
157
|
+
- Claim `Statement` is a generalized mechanism/relationship auditable against `Evidence basis`; its reach is bounded by `Conditions` and its run numbers live in the evidence layer (not pasted into the Statement)
|
|
154
158
|
|
|
155
159
|
### Experiment Verifies → Claim Resolution
|
|
156
160
|
- Every `C\d+` in an experiment's `**Verifies**` must exist in claims.md
|
|
157
161
|
|
|
162
|
+
### Experiment Evidence / Run → Resolution
|
|
163
|
+
- Every `evidence/…` path in an experiment's `**Evidence**` is a filed evidence file (or "pending")
|
|
164
|
+
- Every experiment carries a `**Run**` ref — an entry in the comprehensive `src/artifacts.md` index (or, in the capture-fallback case, a `src/execution/` file) that links the source location; failed/ablated runs are linked there, not dropped
|
|
165
|
+
|
|
166
|
+
### Claim Dependencies → Claim Resolution
|
|
167
|
+
- Every `C\d+` in a claim's `**Dependencies**` must exist in claims.md (an unresolved ID FAILS)
|
|
168
|
+
|
|
158
169
|
### Heuristic Code Ref → File Resolution (only when heuristics.md + src/execution/ are both present)
|
|
159
170
|
- Every `src/...` path in `**Code ref**: [...]` must be an existing file
|
|
160
171
|
|
|
@@ -174,6 +185,19 @@ For each file in `evidence/figures/*.md` specifically:
|
|
|
174
185
|
- No fact ABOUT a repo artifact (line count, path, internal structure) is transcribed from the paper without checking the real file — when paper and repo disagree, the discrepancy is flagged, not silently resolved to the paper's number
|
|
175
186
|
- Spot-check trace `source_refs` and evidence `**Source**` labels: the cited section/table/appendix actually contains the claimed content
|
|
176
187
|
- A statistic carries its scope/denominator (N, population) in its `Source` — subset figures (e.g. "5 papers / 3,050 reqs") are not juxtaposed with full-corpus figures as if same-denominator
|
|
188
|
+
- **Claim Statements are takeaways** (exhaustive, not spot-checked — symmetric to the number-sources
|
|
189
|
+
pass): each `## C\d+` Statement FAILS if its subject is a named recipe/config/run, or if it
|
|
190
|
+
contains a run number, n-count, score, step/bin count, or p-value. The mechanism a result reveals
|
|
191
|
+
must be the subject; the numbers must live in `Evidence basis`/`Proof`
|
|
192
|
+
- **Attribution is not insight** (exhaustive, applied to each Statement that already passed the
|
|
193
|
+
takeaways gate above): a Statement also FAILS if it only identifies *which* named components of
|
|
194
|
+
this one system rank highest/lowest (load-bearing / dominant / decorative / inert / "largest
|
|
195
|
+
contributor") without stating what that ranking *reveals* — a relationship or mechanism a reader
|
|
196
|
+
could carry to a different system. **Operational tell: delete this system's component names from
|
|
197
|
+
the Statement; if no transferable relationship survives, it is attribution, not a mechanism — it
|
|
198
|
+
FAILS.** The fix is to state the generalization the ranking licenses; the named components and
|
|
199
|
+
their deltas stay in `Evidence basis`. This is the most common way a numerically-clean Statement
|
|
200
|
+
still fails the insight bar.
|
|
177
201
|
- **Claim/heuristic number sources** (exhaustive, not spot-checked): each `**Sources**` entry's cited
|
|
178
202
|
`file:line` (or trace `node:field`) exists, the verbatim «quote» is actually present there, and the
|
|
179
203
|
number in the `Statement`/`Rationale` matches the value inside that quote; `[input]` entries cite
|
|
@@ -15,7 +15,7 @@ argument-hint: "[optional: hint about what happened this turn]"
|
|
|
15
15
|
allowed-tools: Read, Write, Edit, Glob, Grep
|
|
16
16
|
metadata:
|
|
17
17
|
author: ara-commons
|
|
18
|
-
version: "2.
|
|
18
|
+
version: "2.4.0"
|
|
19
19
|
tags: [research, process-recording, provenance, progressive-crystallization, knowledge-management]
|
|
20
20
|
---
|
|
21
21
|
|
|
@@ -184,10 +184,17 @@ entries — staged observations belong to Stage 3. (History lives in the trace;
|
|
|
184
184
|
1. **Status updates** — flip a claim's `Status` field when evidence warrants.
|
|
185
185
|
2. **Content revisions** — rewrite a `Statement`, `Rationale`, or definition when new
|
|
186
186
|
evidence narrows scope, terminology changed, or wording no longer matches what's
|
|
187
|
-
actually supported.
|
|
187
|
+
actually supported. Keep `Statement` a generalized mechanism/relationship and sharpen
|
|
188
|
+
`Conditions` as the regime becomes clearer; new run numbers update `Proof`/`evidence`,
|
|
189
|
+
never the Statement. A rewrite re-grounds every number it now contains (Number grounding);
|
|
188
190
|
any changed value gets its own fresh `Sources` «quote», never a carried-over one.
|
|
189
191
|
3. **Structural changes** — split a claim into two, merge duplicates, repair
|
|
190
|
-
dependencies, rename ids when concepts are renamed.
|
|
192
|
+
dependencies, rename ids when concepts are renamed. Also **generalize**: when several
|
|
193
|
+
crystallized claims are together evidence for a more general relationship none states
|
|
194
|
+
alone, author a new claim whose `Dependencies` are those narrower claims and whose
|
|
195
|
+
`Proof` spans their evidence — keep the narrower claims in place; the new claim sits
|
|
196
|
+
above them, not instead of them (only when a signal this turn makes the relationship
|
|
197
|
+
evident — never a routine sweep).
|
|
191
198
|
4. **Consistency pass** — scan for broken cross-references (claim cites C05 which no
|
|
192
199
|
longer exists), terminology mismatch with `concepts.md`, dependency loops.
|
|
193
200
|
|
|
@@ -257,6 +264,8 @@ When a signal fires for entry `E` (claim, heuristic, or concept):
|
|
|
257
264
|
new id for the spin-off, update all cross-references.
|
|
258
265
|
- **Merge**: keep the lower id, mark the higher id as `withdrawn` with
|
|
259
266
|
`Merged into: C{XX}`, redirect cross-references.
|
|
267
|
+
- **Generalize**: allocate a new id for the more general claim, set its `Dependencies`
|
|
268
|
+
to the narrower claims, and leave those claims in place (they remain its grounding).
|
|
260
269
|
6. **Record full before/after in today's session record** under `logic_revisions:`
|
|
261
270
|
(see schema below). This is the ONLY place the prior wording is preserved — the
|
|
262
271
|
logic file does not keep it.
|
|
@@ -361,18 +370,34 @@ tree:
|
|
|
361
370
|
### Claim (`logic/claims.md`) — crystallized only
|
|
362
371
|
|
|
363
372
|
```markdown
|
|
364
|
-
## C{XX}: {title}
|
|
365
|
-
- **Statement**: {
|
|
366
|
-
- **
|
|
373
|
+
## C{XX}: {generalized title — the takeaway, not a recipe name}
|
|
374
|
+
- **Statement**: {the generalized, mechanistic conclusion; subject = a mechanism/relationship, never a named recipe; carries NO run numbers}
|
|
375
|
+
- **Conditions**: {under what conditions it holds; the regime; the known untested boundary}
|
|
376
|
+
- **Sources**: [{one entry per load-bearing number in the claim (now in `Conditions`/`Proof`): `<value> ← <file:line | trace-node:field> «verbatim line copied from source» [input|result]`, or `<value> ← [pending: reason]`}] # see "Number grounding"; a bare path with no «quote» is invalid
|
|
367
377
|
- **Status**: hypothesis | untested | testing | supported | weakened | refuted | withdrawn
|
|
368
378
|
- **Provenance**: user | ai-suggested | user-revised
|
|
369
|
-
- **Falsification
|
|
370
|
-
- **Proof**: [{evidence refs or "pending"}]
|
|
379
|
+
- **Falsification**: {a concrete observation that would disprove it — for a mechanism claim, about the system/world; for a methodological/regime claim, about the benchmark's behavior. NOT a tautology or a re-run of the same gate ("if the recipe fails the gate")}
|
|
380
|
+
- **Proof**: [{evidence refs (→ evidence/) or "pending"; run numbers/IDs/scores live HERE, not in Statement}]
|
|
371
381
|
- **Dependencies**: [C{YY}, ...]
|
|
372
382
|
- **Tags**: {comma-separated}
|
|
373
383
|
- **Last revised**: YYYY-MM-DD (turn-id) # pointer back to the trace; absent until first revision
|
|
374
384
|
```
|
|
375
385
|
|
|
386
|
+
**The Statement is the generalized conclusion the evidence supports — a mechanism or relationship,
|
|
387
|
+
not a restatement of run numbers.** What keeps it falsifiable and honest is `Conditions` (the regime
|
|
388
|
+
it holds in + the untested boundary) plus a `Falsification`, not a narrowed sentence. Numbers (run
|
|
389
|
+
IDs, n, scores, step counts) belong in `Proof` → `evidence/` (grounded per Number grounding), never
|
|
390
|
+
in `Statement`. `Conditions` is mandatory: a generalized Statement with no Conditions is an unbounded
|
|
391
|
+
slogan.
|
|
392
|
+
|
|
393
|
+
**Calibrate the Statement to what the evidence actually separates.** Do not assert a distinction the
|
|
394
|
+
design cannot disentangle (confounded factors — e.g. matrix "shape" vs "role" when they co-vary), or
|
|
395
|
+
a law from a single instance. When that's the case, hedge in the Statement itself — name the
|
|
396
|
+
unseparated factors together, or say "shown once here" — rather than only burying it in `Conditions`.
|
|
397
|
+
`Conditions` bounds *where* the claim applies; it is not a license for the Statement's verb to
|
|
398
|
+
over-reach. The Statement/Conditions may be sharpened on a later turn (Stage 4 content revision) as
|
|
399
|
+
the mechanism becomes clearer — no new closure signal is needed.
|
|
400
|
+
|
|
376
401
|
Current-state snapshot only — no prior statements, no `From staging`/`Crystallized via`
|
|
377
402
|
notes. Crystallization and every edit are recorded in the trace (`trace/sessions/…` under
|
|
378
403
|
`logic_revisions:` with before/after; source observation stays in `staging/`; reasoning in
|
|
@@ -72,7 +72,7 @@ What KIND of event is this?
|
|
|
72
72
|
→ ai-action [session record only]
|
|
73
73
|
|
|
74
74
|
Interpretation (something asserted to be true / general)?
|
|
75
|
-
|
|
75
|
+
Generalizable falsifiable assertion (a mechanism/relationship, bounded by conditions)?
|
|
76
76
|
→ STAGE as potential_type: claim
|
|
77
77
|
Implementation rule with rationale?
|
|
78
78
|
→ STAGE as potential_type: heuristic
|