@ara-commons/ara-skills 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ara-commons/ara-skills",
3
- "version": "0.1.0",
3
+ "version": "0.2.0",
4
4
  "description": "Install Agent-Native Research Artifact (ARA) skills — compiler, research-manager, rigor-reviewer — into Claude Code, Cursor, OpenCode, Gemini CLI, Codex, and more.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -3,18 +3,20 @@ name: compiler
3
3
  description: |
4
4
  Universal ARA Compiler. Converts ANY research input — PDF papers, GitHub repositories,
5
5
  experiment logs, code directories, raw notes, or combinations thereof — into a complete
6
- Agent-Native Research Artifact (ARA). Produces a structured, machine-executable knowledge
7
- package with cognitive layer (claims, concepts, heuristics), physical layer (configs, code
8
- stubs), exploration graph (research DAG), and grounded evidence.
6
+ Agent-Native Research Artifact (ARA): a structured, machine-executable knowledge package with a
7
+ cognitive layer (claims, concepts, methods), an artifact layer (code/configs/data as the work
8
+ warrants), an exploration graph (research DAG), and grounded evidence. Works across any research
9
+ field — not only model-training research.
9
10
 
10
11
  TRIGGERS: compile, create ARA, generate artifact, convert paper, build artifact, compile paper,
11
- ARA from PDF, ARA from repo, ARA from code, structure research, extract knowledge
12
+ ARA from PDF, ARA from repo, ARA from code, structure research, extract knowledge,
13
+ extract figure data, digitize plot, read chart, figure to data
12
14
  argument-hint: "[any input — paths, URLs, descriptions, or nothing]"
13
15
  allowed-tools: Read, Write, Edit, Bash(python *|git clone *|ls *|mkdir *), Glob, Grep, Task
14
16
  metadata:
15
17
  author: ara-commons
16
18
  category: research-tooling
17
- version: "1.0.0"
19
+ version: "1.1.0"
18
20
  tags: [research, compilation, artifacts, knowledge-extraction]
19
21
  ---
20
22
 
@@ -26,50 +28,33 @@ validated ARA artifact. You operate as a first-class Claude Code agent — use y
26
28
 
27
29
  ## Input Philosophy
28
30
 
29
- The compiler is **open-ended**. It accepts anything that contains research knowledge — there is
30
- no fixed input schema. Your job is to figure out what you've been given and extract maximum
31
+ The compiler is **open-ended**. It accepts anything that contains research knowledge — papers,
32
+ repos, code, notebooks, logs, configs, notes, threads, a verbal description, combinations, or
33
+ nothing at all (build interactively). Figure out what you've been given and extract maximum
31
34
  structured knowledge from it.
32
35
 
33
- Possible inputs include (but are NOT limited to):
34
- - PDF papers, arXiv links
35
- - GitHub repositories (URLs or local paths)
36
- - Code files, scripts, notebooks (`.py`, `.ipynb`, `.rs`, `.cpp`, etc.)
37
- - Experiment logs, training outputs, evaluation results
38
- - Configuration files, hyperparameter sweeps
39
- - Raw research notes, brainstorm transcripts, meeting notes
40
- - Data directories with results, checkpoints, figures
41
- - Slack/email threads describing research decisions
42
- - Combinations of the above
43
- - A verbal description or conversation with the user about their research
44
- - Nothing at all — the user may want to build an ARA interactively through dialogue
45
-
46
- When arguments are provided (`$ARGUMENTS`), interpret them flexibly:
47
- - File/directory paths → read them
48
- - URLs → fetch or clone them
49
- - `--output <dir>` → where to write the ARA (default: `./ara-output/`)
50
- - `--rubric <path>` → PaperBench rubric for coverage mapping
51
- - Anything else → treat as context or ask the user for clarification
36
+ When arguments are provided (`$ARGUMENTS`), interpret them flexibly: paths → read; URLs →
37
+ fetch/clone; `--output <dir>` → where to write (default `./ara-output/`); `--rubric <path>`
38
+ PaperBench rubric for coverage mapping; anything else → context (ask only if it genuinely blocks).
52
39
 
53
40
  ### Input Reading Strategy
54
41
 
55
- Adapt to whatever you receive:
56
- 1. **Identify what you have.** Glob, read, and explore the provided paths. Understand the nature
57
- of the input before committing to a generation plan.
58
- 2. **Maximize coverage.** Cross-reference all available sources. A PDF gives narrative + claims;
59
- code gives ground-truth implementation; experiment logs give the exploration trajectory;
60
- notes give decisions and dead ends that never made it to paper.
61
- 3. **Ask when stuck.** If the input is ambiguous or incomplete, ask the user to fill gaps rather
62
- than hallucinating. The user is a collaborator, not a passive consumer.
63
- 4. **Handle partial inputs gracefully.** Not every ARA field will be fillable from every input.
64
- Populate what you can with high confidence, mark gaps explicitly with "Not available from
65
- provided input", and tell the user what's missing so they can supplement later.
42
+ 1. **Identify what you have.** Glob, read, explore the inputs before committing to a plan.
43
+ 2. **Maximize coverage.** Cross-reference all sources a PDF gives narrative + claims; code gives
44
+ ground-truth implementation; logs give the trajectory; notes give dead ends that never reached
45
+ the paper.
46
+ 3. **Decide, then flag.** Resolve ambiguity with your own judgment and proceed. Only pause to ask
47
+ the user when a choice is both genuinely undecidable from the inputs and material to the result
48
+ (see Rule 15 for the repo-vs-paper conflict case). Never hallucinate to fill a gap; mark it.
49
+ 4. **Handle partial inputs gracefully.** Populate what you can with high confidence; mark gaps with
50
+ "Not available from provided input" and tell the user what's missing.
66
51
 
67
52
  ## Workflow
68
53
 
69
54
  ```
70
55
  1. READ all inputs
71
56
  2. REASON through the 4-stage epistemic protocol (see below)
72
- 3. GENERATE all ARA files using Write tool
57
+ 3. GENERATE files (the mandatory core + whatever additional files the paper's content warrants)
73
58
  4. COVERAGE CHECK loop (max 3 rounds): re-read source → diff against ARA → patch gaps
74
59
  5. VALIDATE by running Seal Level 1
75
60
  6. FIX any failures, re-validate
@@ -78,178 +63,206 @@ Adapt to whatever you receive:
78
63
 
79
64
  ### Step 1: Read Inputs
80
65
 
81
- Read ALL provided inputs thoroughly before generating anything. For PDFs, read every page,
82
- **including appendices** — appendices often carry reproduction-critical content and should
83
- be treated with the same priority as main-text pages.
66
+ Read ALL inputs thoroughly before generating. For PDFs, read every page **including appendices**
67
+ (they carry reproduction-critical content). For repos, prioritize README → core code → configs →
68
+ environment.
84
69
 
85
- For repos, prioritize: README core algorithm files configs environment files.
70
+ **Read figures visually, not just their captions.** Much of a paper's evidence lives in plots,
71
+ diagrams, and qualitative samples whose information cannot be recovered from surrounding text.
72
+ Render PDF pages/regions to PNG (`python` with PyMuPDF/`fitz` or `pdf2image`) and Read them as
73
+ images; read standalone image files directly. Treat reading a figure as a deliberate extraction
74
+ step — see Stage 1's visual evidence pass.
86
75
 
87
76
  ### Step 2: 4-Stage Epistemic Chain-of-Thought
88
77
 
89
- Before writing any files, reason through these 4 stages. Think carefully about each stage.
78
+ Before writing files, reason through these 4 stages.
90
79
 
91
80
  **Stage 1 — Semantic Deconstruction**
92
- Strip narrative framing. Extract the raw knowledge atoms:
93
- - Mathematical formulations and equations
94
- - Architectural specifications and component descriptions
95
- - Experimental configurations (hyperparameters, hardware, datasets, seeds)
96
- - ALL numerical results and benchmarks (exact values, never rounded)
97
- - Citation dependencies and their roles (imports, extends, bounds, refutes)
98
- - Negative results, ablation findings, rejected alternatives
99
- - Implementation tricks, convergence hacks, sensitivity observations
100
-
101
- Before moving on, perform an **evidence capture pass**:
102
- - For every source table or figure you plan to cite, first capture the original source identifier and caption exactly (`Table 2`, `Figure 4`, etc.)
103
- - Transcribe the raw table/figure content before making any claim-specific summary
104
- - If you create a filtered view for one claim, store it as a **derived subset**, not as the original table itself
105
- - Never label a subset or merged summary as `Table N` unless it reproduces the original source table faithfully
106
- - If PDF extraction is ambiguous, re-read the page with layout preserved or inspect the page manually before writing evidence files
81
+ Strip narrative framing. Extract the raw knowledge atoms: formulations/equations; architectural
82
+ or method specifications; configurations (hyperparameters, hardware, datasets, seeds); ALL
83
+ numerical results (exact, never rounded); citation dependencies and their roles; negative results
84
+ and ablation findings; implementation tricks and sensitivity observations.
85
+
86
+ Then perform the **evidence pass** capture every table and figure, completely and in order:
87
+
88
+ - **Build an evidence ledger first.** Enumerate EVERY numbered `Table N` and `Figure N` in the
89
+ source (main text + appendices). You will file all of them, in order (1, 2, 3, …) — this is a
90
+ systematic sweep, not a sample. Do not stop early and do not skip an object because its data
91
+ appears elsewhere. If an object genuinely warrants no file (e.g. an exact duplicate), record it
92
+ in `evidence/README.md` with a reason no silent omissions.
93
+ - **Save the screenshot AND the description.** For each table/figure, render its region to a PNG and
94
+ save it next to the markdown: `evidence/figures/figure3.png` + `evidence/figures/figure3.md`,
95
+ `evidence/tables/table2.png` + `evidence/tables/table2.md`. The markdown holds the transcription /
96
+ structured description; the PNG preserves the original visual. Keep both, never just the text.
97
+ - Capture each object's source identifier and caption exactly; transcribe raw content before any
98
+ claim-specific summary.
99
+ - A filtered view for one claim is a **derived subset** (filename `derived_`/`subset_`, state its
100
+ parent) — never label it as the original `Table N`/`Figure N`.
101
+
102
+ Then the **visual evidence pass** over every figure (data does not extract itself from pixels):
103
+ 1. **Classify**: `quantitative_plot` (line/bar/scatter/box/histogram/heatmap with numbers),
104
+ `diagram` (structure, not measurements), `qualitative_sample` (example outputs, failure cases),
105
+ or `mixed`.
106
+ 2. **Quantitative plots**: read values off the axes; record axis labels, units, and **scale**
107
+ (linear vs log — misreading a log axis corrupts every value). Use exact values when printed as
108
+ data labels or stated in text; otherwise estimate and mark approximate (`≈`). Record an
109
+ **extraction method** (`exact_from_labels` / `digitized_estimate` / `visual_description`) and a
110
+ **reading confidence**. Capture the trend even when exact points are unreadable.
111
+ 3. **Diagrams**: do NOT fabricate a data table. Write a structured visual description of components
112
+ and connections, and reflect that structure into the relevant method/solution file.
113
+ 4. **Qualitative samples**: describe what the figure demonstrates and which claim/gap it supports.
114
+ 5. If a figure is too low-resolution to read reliably, say so (`reading confidence: low`) rather
115
+ than inventing values.
116
+
117
+ For non-trivial figures (dense plots, log axes, multi-panel, anything needing render/crop), load
118
+ `${CLAUDE_SKILL_DIR}/references/figure-extraction-guide.md`.
107
119
 
108
120
  **Stage 2 — Cognitive Mapping**
109
- Map extracted atoms to `/logic/`:
121
+ Map the atoms into `/logic/`:
110
122
  - **problem.md**: observations (with numbers) → gaps → key insight → assumptions
111
- - **claims.md**: falsifiable claims with proof pointers to experiment IDs (E01, E02...), plus a separation between direct evidence basis and higher-level interpretation
112
- - **concepts.md**: ≥5 formal definitions with notation and boundary conditions
113
- - **experiments.md**: ≥3 declarative verification plans (NO exact numbers directional only)
114
- - **solution/**: architecture (component graph), algorithm (math + pseudocode), constraints, heuristics
115
- - **related_work.md**: typed dependency graph (imports/extends/bounds/baseline/refutes)
116
-
117
- Appendix content (worked examples, prompt templates, enumerated taxonomies, annotation
118
- schemas, extended analyses, prescriptive content) should be routed into the ARA layers
119
- where it fits best, preserving the granularity the source uses. Never silently drop an
120
- appendix section.
121
-
122
- When writing claims:
123
- - Phrase the main `Statement` at the strongest level directly supported by the cited evidence
124
- - Put raw support in `Evidence basis`
125
- - Put any broader synthesis in `Interpretation`
126
- - If the evidence only shows validation metrics, do not upgrade the claim to training dynamics or optimization quality unless training-side evidence is also captured
127
-
128
- `related_work.md` should reflect the paper's full citation footprint, not only the
129
- closest predecessors. Works with a specific technical delta get full `RW` blocks; remaining
130
- citations from the paper's References list should still be captured (more briefly) so the
131
- intellectual neighborhood is preserved.
132
-
133
- **Stage 3 Physical Stubbing**
134
- Generate `/src/`:
135
- - **configs/**: exact hyperparameter values with rationale and sensitivity
136
- - **execution/**: ≥1 Python code stub implementing the NOVEL contribution (typed signatures, no boilerplate)
137
- - **environment.md**: Python version, framework, hardware, dependencies, seeds
138
- - If repo available: use actual code to improve stub precision
139
- - If rubric provided: produce `rubric/requirements.md` mapping every leaf node
123
+ - **claims.md**: falsifiable claims with proof pointers to experiment IDs (E01, E02). Phrase each
124
+ `Statement` at the strongest level the cited evidence directly supports; keep raw support in
125
+ `Evidence basis` and broader synthesis in `Interpretation`. Don't upgrade a validation-metric
126
+ result into a claim about training dynamics without training-side evidence.
127
+ - **concepts.md**: the paper's genuine technical terms, formally defined
128
+ - **experiments.md**: declarative verification/analysis plans (NO exact numbers — directional
129
+ only). "Experiment" generalizes to the field's way of testing a claim: an eval run, a statistical
130
+ test, a proof obligation, a user study.
131
+ - **solution/**: the method layer `constraints.md` (limitations/assumptions) is always present;
132
+ beyond it, create the files the paper's content actually calls for (architecture, algorithm,
133
+ method, study design, formalization, proofs, heuristics — whatever fits the work). You decide
134
+ which; do not force a fixed template.
135
+ - **related_work.md**: typed dependency graph (imports/extends/bounds/baseline/refutes). Reflect
136
+ the paper's full citation footprint — full `RW` blocks for works with a specific technical delta,
137
+ briefer entries for the rest.
138
+
139
+ Route appendix content (worked examples, prompt templates, taxonomies, extended analyses) into
140
+ whichever layer fits best, preserving the source's granularity. Never silently drop a section.
141
+
142
+ **Stage 3 Artifact Layer (`src/`)**
143
+ `src/` holds the work's **concrete implementation artifacts** — whatever exists in a raw, runnable,
144
+ or released form, *distinct from the prose that describes it*. `src/environment.md` is always
145
+ required (reproducibility). Beyond it, one rule decides everything:
146
+
147
+ > **Capture every concrete artifact the source actually contains, in its native form; never
148
+ > re-encode a prose-only description as code.**
149
+
150
+ A concrete artifact is real content the cognitive layer doesn't already hold — capture it (grounded
151
+ in the real repo/files when provided), in whatever directory fits. But a method conveyed only in
152
+ natural language already lives in `logic/solution/`; manufacturing a stub or pseudo-code from it just
153
+ duplicates it. Capture what exists, no more, no less — so a lone `environment.md` is correct when the
154
+ work has no concrete artifact, and wrong when it does. (If a rubric was provided, also produce
155
+ `rubric/requirements.md`.)
156
+
157
+ **Code grounding.** When you include `src/execution/*.py`, tag it `# Grounding: transcribed` (repo
158
+ code, cite `file:line`) or `reconstructed` (printed pseudocode/equations, cite §/eq). Never invent
159
+ API names, bodies, constants, or hyperparameters; no concrete code → no stub.
160
+
161
+ Never invent function bodies, constants, hyperparameters, or API names. No real code and no printed
162
+ pseudocode/equations → no stub (the prose method belongs in `logic/`, not re-encoded here).
140
163
 
141
164
  **Stage 4 — Exploration Graph Extraction**
142
- Reconstruct the research DAG for `/trace/exploration_tree.yaml`:
143
- - Root nodes = central research questions
144
- - Experiments and decisions nest as children
145
- - Dead ends from ablations/rejected alternatives = typed leaf nodes
146
- - ≥8 nodes, must include dead_end and decision types
147
- - Use `also_depends_on` for DAG convergence points
148
- - Every node must declare whether it is `explicit` from source material or `inferred` from reconstruction
149
- - Explicit nodes should carry source references (table/figure/section labels)
150
- - Inferred nodes are allowed only when they help reconstruct the paper's logic without pretending to be literal session logs
165
+ Reconstruct the research DAG for `/trace/exploration_tree.yaml`: root nodes = central questions;
166
+ experiments and decisions nest as children; dead ends from ablations/rejected alternatives = typed
167
+ leaf nodes; `also_depends_on` for convergence points. Every node declares `support_level: explicit`
168
+ (from source, with source refs) or `inferred` (reconstructed). Capture every dead_end and decision
169
+ the source actually reveals but the node count and types are **source-bounded, not quotas**:
170
+ never invent a dead end, decision, or experiment to hit a number. A paper that hides its failures
171
+ yields a smaller, honest tree (Rule 9 wins).
151
172
 
152
173
  ### Step 3: Generate Files
153
174
 
154
- Write ALL mandatory files. See `${CLAUDE_SKILL_DIR}/references/ara-schema.md` for the complete
155
- directory structure and field-level requirements for every file.
156
-
157
- **Mandatory files** (all must exist and be non-trivial):
158
- - `PAPER.md` — YAML frontmatter (title, authors, year, venue, doi, ara_version, domain, keywords, claims_summary, abstract) + Layer Index
159
- - `logic/problem.md` — Observations (O1, O2...), Gaps (G1, G2...), Key Insight, Assumptions
160
- - `logic/claims.md` — Claims (C01, C02...) each with Statement, Status, Falsification criteria, Proof, Evidence basis, Interpretation, Dependencies, Tags
161
- - `logic/concepts.md` — ≥5 concepts each with Notation, Definition, Boundary conditions, Related concepts
162
- - `logic/experiments.md` — ≥3 experiments (E01, E02...) each with Verifies, Setup, Procedure, Metrics, Expected outcome (directional only!), Baselines, Dependencies
163
- - `logic/solution/architecture.md` — Component graph with inputs/outputs
164
- - `logic/solution/algorithm.md` Math formulation + pseudocode + complexity
165
- - `logic/solution/constraints.md` — Boundary conditions and limitations
166
- - `logic/solution/heuristics.md` — Heuristics (H01, H02...) each with Rationale, Sensitivity, Bounds, Code ref, Source
167
- - `logic/related_work.md` — Related work (RW01, RW02...) each with DOI, Type, Delta, Claims affected
168
- - `src/configs/training.md`Hyperparameters with Value, Rationale, Search range, Sensitivity, Source
169
- - `src/configs/model.md` — Model/architecture configs
170
- - `src/execution/{module}.py` ≥1 code stub with typed signatures
171
- - `src/environment.md` Python version, framework, hardware, dependencies, seeds
172
- - `trace/exploration_tree.yaml` Research DAG (≥8 nodes, nested YAML)
173
- - `evidence/README.md` — Index table mapping every evidence file to claims
174
- - `evidence/tables/*.md` ALL result tables (exact cell values, never rounded)
175
- - `evidence/figures/*.md` ALL quantitative figures (extracted data points)
176
-
177
- Evidence-generation rules:
178
- - Preserve **raw source tables** separately from any **derived subset** views
179
- - A file named after a source object (for example `table3_...`) must match that source object's caption and contents
180
- - If only a subset is included, the filename must say `derived_`, `subset_`, or equivalent, and the file must state what it was derived from
181
- - Do not merge rows from different source tables into one evidence file unless the file is explicitly labeled as a derived comparison
175
+ Write the mandatory core, then the additional files the paper warrants. See
176
+ `${CLAUDE_SKILL_DIR}/references/ara-schema.md` for field-level format.
177
+
178
+ **Mandatory core** (every ARA, must exist and be non-trivial):
179
+ - `PAPER.md` — frontmatter (title, authors, year, venue, doi, ara_version, domain, keywords,
180
+ claims_summary, abstract) + Layer Index
181
+ - `logic/problem.md`, `logic/claims.md`, `logic/concepts.md`, `logic/experiments.md`,
182
+ `logic/related_work.md`, `logic/solution/constraints.md`
183
+ - `src/environment.md`
184
+ - `trace/exploration_tree.yaml`
185
+ - `evidence/README.md` + an evidence file (markdown **and** screenshot) for **every** numbered
186
+ table and figure in the source (`evidence/tables/`, `evidence/figures/`; `evidence/proofs/` for
187
+ derivations)
188
+
189
+ **Additional filesyour judgment, not a fixed list.** Create whatever the paper's content calls
190
+ for in `logic/solution/` (method/architecture/algorithm/study-design/formalization/proofs/
191
+ heuristics…) and `src/`/`data/` (configs/code/data/prompts…). There is no domain template to fill —
192
+ generate the files that genuinely represent THIS work, and nothing it doesn't have. Don't force
193
+ model-training files onto an evaluation, data-science, or theory paper.
194
+
195
+ Evidence rules: keep raw source tables separate from derived subsets; a file named after a source
196
+ object must faithfully match it; don't merge rows from different source tables under one original
197
+ table number.
182
198
 
183
199
  ### Step 4: Coverage Check Loop (max 3 rounds)
184
200
 
185
- Before running Seal validation, verify that the ARA faithfully covers the source material.
186
- Repeat up to **3 rounds**; stop early if a round produces no patches.
187
-
188
- **Each round:** re-read the source, identify anything not yet captured or only shallowly
189
- captured in the ARA, patch those gaps, then note how many fixes were made. If zero, exit
190
- early. Pay particular attention to appendix content and to citations from the paper's
191
- References list, which are easy to miss on the first pass.
192
-
193
- The coverage loop does not replace validation — it ensures the ARA is semantically complete
194
- before structural checks run.
201
+ Re-read the source, find anything not yet captured or only shallowly captured, patch it, count the
202
+ fixes; exit early when a round yields zero. Watch for: appendix content; citations from the
203
+ References list; figures whose information is only visual; and **every distinct contribution /
204
+ motivating argument thread** a paper often makes a conceptual argument carrying no number that is
205
+ easy to drop. The coverage loop ensures semantic completeness before structural checks.
195
206
 
196
207
  ### Step 5: Validate
197
208
 
198
- Run ARA Seal Level 1 validation. Perform these checks:
199
- - All mandatory dirs exist: `logic/`, `logic/solution/`, `src/`, `src/configs/`, `trace/`, `evidence/`
200
- - All mandatory files exist and are non-empty
201
- - PAPER.md has YAML frontmatter with title, authors, year
202
- - PAPER.md has Layer Index section
203
- - claims.md has C01+ blocks with Statement, Status, Falsification criteria, Proof fields
204
- - experiments.md has E01+ blocks with Verifies, Setup, Procedure, Expected outcome fields
205
- - heuristics.md has H01+ blocks with Rationale, Sensitivity, Bounds fields
206
- - concepts.md has ≥5 concept sections
207
- - experiments.md has ≥3 experiment plans
208
- - exploration_tree.yaml parses as valid YAML with ≥8 nodes, has dead_end and decision types
209
- - Claim Proof references (E01, E02...) resolve to experiments.md
210
- - Experiment Verifies references (C01, C02...) resolve to claims.md
211
- - Heuristic Code ref paths resolve to actual files in src/execution/
212
- - Evidence files contain Markdown tables with **Source** fields
213
- - Evidence file names, source labels, and captions agree on the original table/figure identifier
214
- - Any file named like a raw source table is a faithful transcription rather than a filtered subset
215
- - Claims only cite experiments whose evidence actually contains the compared rows or measurements
216
- - Claim wording does not outrun the evidence type (for example, validation tables alone should not be used to claim training-dynamics improvements)
217
- - Trace nodes declare `support_level: explicit|inferred`
218
- - Trace nodes with `support_level: explicit` include source references
209
+ Run ARA Seal Level 1. Check:
210
+ - Mandatory-core dirs exist (`logic/`, `logic/solution/`, `src/`, `trace/`, `evidence/`) and all
211
+ mandatory-core files exist and are non-empty
212
+ - PAPER.md has valid frontmatter (title, authors, year) + a Layer Index
213
+ - claims.md has C01+ blocks with Statement, Status, Falsification criteria, Proof
214
+ - experiments.md has E01+ blocks with Verifies, Setup, Procedure, Expected outcome (no exact numbers)
215
+ - concepts.md, related_work.md, constraints.md non-trivial; any heuristics blocks have Rationale,
216
+ Sensitivity, Bounds
217
+ - exploration_tree.yaml parses; nodes declare `support_level`; explicit nodes carry source refs;
218
+ no invented dead_end/decision/experiment nodes
219
+ - Cross-layer bindings resolve: claim `Proof` experiments.md; experiment `Verifies` claims.md;
220
+ heuristic `Code ref` a real `src/execution/` file (when both exist); tree `evidence:` → claim IDs
221
+ - Evidence: **every numbered table and figure is filed with BOTH a markdown file and a screenshot
222
+ (.png)**; numbered objects not filed are accounted for in `evidence/README.md` with a reason
223
+ - Evidence files have **Source** fields; figures declare Figure type / Extraction method / Reading
224
+ confidence; estimated readings marked `≈` (not `exact_from_labels`); diagrams/qualitative samples
225
+ carry a visual description, not a fabricated table
226
+ - Code stubs carry a `# Grounding:` tag and invent nothing; absent when the source is prose-only
227
+ - **Cited locations verified** (Rule 15): every repo path/`file:line` exists and is in range;
228
+ spot-check that trace `source_refs` and evidence `Source` actually contain the cited content; no
229
+ repo fact transcribed from the paper without checking the real file
230
+ - **Self-consistency**: ARA-authored derived numbers recompute; PAPER.md declared counts match the
231
+ files; tree `evidence:` refs are claim IDs (C##), not observation IDs
219
232
 
220
233
  ### Step 6: Fix & Iterate
221
234
 
222
- For each validation failure:
223
- 1. Read the failing file
224
- 2. Apply targeted edits (prefer Edit over full rewrite to preserve correct content)
225
- 3. Re-validate after all fixes
226
-
227
- Typically converges in 2-3 rounds.
235
+ For each failure: read the file, apply targeted edits (prefer Edit over rewrite), re-validate.
236
+ Typically converges in 2–3 rounds.
228
237
 
229
238
  ### Step 7: Report
230
239
 
231
- Print a summary:
232
- - Artifact location
233
- - File count and total size
234
- - Validation result (pass/fail with details)
235
- - Key statistics: number of claims, experiments, heuristics, concepts, tree nodes, evidence files
240
+ Print: artifact location; file count and total size; validation result (pass/fail with details);
241
+ key stats (claims, experiments, concepts, tree nodes, evidence tables/figures).
236
242
 
237
243
  ## Critical Rules
238
244
 
239
- 1. **Exact numbers**: All numerical values copied EXACTLY from source — never round or approximate
240
- 2. **No hallucination**: Never invent claims, results, or heuristics not in the source material
241
- 3. **Experiments have NO exact numbers**: `experiments.md` contains only directional/relative expected outcomes. Exact numbers go in `evidence/`
242
- 4. **Every claim has proof**: Proof field references experiment IDs (E01, E02), not file paths
245
+ 1. **Exact numbers**: all values copied EXACTLY from source — never round or approximate
246
+ 2. **No hallucination**: never invent claims, results, or heuristics not in the source
247
+ 3. **Experiments have NO exact numbers**: `experiments.md` is directional only; exact numbers live in `evidence/`
248
+ 4. **Every claim has proof**: `Proof` references experiment IDs (E01, E02), not file paths
243
249
  5. **Cross-layer binding**: Claims ↔ Experiments ↔ Evidence ↔ Code refs must all resolve
244
- 6. **Dead ends matter**: Include failed approaches, rejected alternatives, ablation findings
245
- 7. **"Not specified"**: If information is genuinely unavailable, write "Not specified in paper" — never guess
246
- 8. **No fake source labels**: Never call a derived subset `Table N` or `Figure N` unless it faithfully reproduces the original source object
247
- 9. **No synthetic trace history**: Do not invent decisions, dead ends, or experiments that are not explicit in the provided inputs; if a trajectory is inferred, mark it as inferred or omit it
248
- 10. **Evidence-limited wording**: Do not use stronger language than the evidence supports; separate direct observations from interpretation
250
+ 6. **Dead ends matter**: include failed approaches, rejected alternatives, ablation findings
251
+ 7. **"Not specified"**: if information is genuinely unavailable, write "Not specified in paper" — never guess
252
+ 8. **No fake source labels**: never call a derived subset `Table N`/`Figure N` unless it faithfully reproduces the original
253
+ 9. **No synthetic trace history**: don't invent decisions, dead ends, or experiments not explicit in the inputs; mark inferred trajectories as inferred or omit them
254
+ 10. **Evidence-limited wording**: don't use stronger language than the evidence supports; separate observation from interpretation
255
+ 11. **Visual extraction is honest extraction**: read figures by looking; mark estimates `≈` with extraction method + confidence; never present a digitized estimate as exact, invent points for an unreadable figure, or turn a diagram into a fake data table
256
+ 12. **Complete, ordered evidence**: file EVERY numbered table and figure, in order — a systematic sweep, not a lucky sample — each as a markdown transcription PLUS a saved screenshot (`.png`). No early stopping; account for any object you don't file
257
+ 13. **Fit the file set to the paper, not the paper to a template**: only PAPER.md + the mandatory core are required. Beyond them, generate the files THIS work actually warrants and nothing it doesn't have. Never force inappropriate files (e.g. model-training configs onto an eval or theory paper)
258
+ 14. **`src/` holds concrete artifacts, not re-encoded prose**: capture every concrete artifact the source actually contains, in its native form, grounded in real files. Two sides: (a) never fabricate a code stub from a prose-only method — it already lives in `logic/`, so a `.py` just duplicates it; (b) never drop a concrete artifact that does exist — a lone `environment.md` is wrong when the work has one
259
+ 15. **Source-bounded minimums**: any count or required field is a target, never a license to invent. If the source supports fewer, produce what is real and note the shortfall; for an unstated field write "Not specified in paper" rather than guessing
260
+ 16. **Cite by verification, and ask on conflict**: a source reference (evidence `Source`, trace `source_refs`, claim `Proof`, a repo `file:line`/path) promises the cited location actually contains the claim — open it and confirm. Never transcribe a *description* of an artifact as a verified fact about it. **When the code repo and the paper disagree on a fact (line count, path, value, behavior), do NOT pick one silently — surface the conflict to the user and ask which source to follow.** If unverifiable and the user is unavailable, attribute it ("per §X") or omit. Carry a statistic's scope/denominator in its `Source`
249
261
 
250
262
  ## Reference Files
251
263
 
252
- For detailed schema specifications, load these on demand:
253
- - `${CLAUDE_SKILL_DIR}/references/ara-schema.md` — Complete ARA directory schema with field-level format for every file
254
- - `${CLAUDE_SKILL_DIR}/references/exploration-tree-spec.md` — Detailed exploration tree YAML specification with examples
255
- - `${CLAUDE_SKILL_DIR}/references/validation-checklist.md` — All Seal Level 1 checks (what the validator looks for)
264
+ Load on demand:
265
+ - `${CLAUDE_SKILL_DIR}/references/ara-schema.md` — field-level format for every file
266
+ - `${CLAUDE_SKILL_DIR}/references/exploration-tree-spec.md` — exploration tree YAML spec
267
+ - `${CLAUDE_SKILL_DIR}/references/validation-checklist.md` — all Seal Level 1 checks
268
+ - `${CLAUDE_SKILL_DIR}/references/figure-extraction-guide.md` — reading plots/diagrams/samples + PyMuPDF render/crop recipes; load when an input has figures whose information is only visual