@ara-commons/ara-skills 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -2,40 +2,43 @@
2
2
 
3
3
  ## Directory Structure
4
4
 
5
+ `✓` = mandatory core (always present). Everything else is created **only when the paper's content
6
+ warrants it** — there is no domain template to fill; you decide which method/artifact files
7
+ genuinely represent the work. The layout below is illustrative, not prescriptive.
8
+
5
9
  ```
6
- PAPER.md # Level 1: Root manifest + layer index
10
+ PAPER.md # Root manifest + layer index
7
11
  logic/
8
- problem.md # Why: observations → gaps → key insight
9
- claims.md # Falsifiable assertions
10
- concepts.md # All key technical terms (one ## per term)
11
- experiments.md # Declarative experiment plans (NOT scripts)
12
+ problem.md # Why: observations → gaps → key insight
13
+ claims.md # Falsifiable assertions
14
+ concepts.md # Key technical terms (one ## per term)
15
+ experiments.md # Declarative verification/analysis plans (NOT scripts)
12
16
  solution/
13
- architecture.md # System design + component graph
14
- algorithm.md # Math formulation + pseudocode
15
- constraints.md # Boundary conditions + limitations
16
- heuristics.md # Convergence tricks + rationale
17
- related_work.md # Typed dependency graph (RDO)
17
+ constraints.md # Boundary conditions + assumptions + limitations
18
+ <method files> # as warranted: architecture / algorithm / method /
19
+ # study_design / formalization / results / proofs /
20
+ # design / heuristics — whatever fits THIS work
21
+ related_work.md # Typed dependency graph (RDO)
18
22
  src/
19
- configs/
20
- training.md # Training hyperparameters with rationale
21
- model.md # Architecture/model configs
22
- execution/
23
- {module}.py # Minimal code stubs (core algorithm only)
24
- environment.md # Dependencies, hardware, seeds
23
+ environment.md # ✓ Data/software/hardware/protocols/seeds
24
+ configs/ # as warranted: hyperparameters / inference / deployment
25
+ execution/{module}.py # as warranted: grounded code stub (or absent — see below)
26
+ prompts/, ... # as warranted: prompt templates, etc.
27
+ data/ # as warranted: dataset.md + preprocessing.md
25
28
  trace/
26
- exploration_tree.yaml # Research DAG: nested YAML tree with typed nodes
29
+ exploration_tree.yaml # Research DAG: nested YAML tree with typed nodes
27
30
  evidence/
28
- README.md # Index mapping every evidence file to claims
29
- tables/ # Raw result tables (exact cell values)
30
- figures/ # Raw figure data (extracted data points)
31
- rubric/ # (Only if rubric provided)
32
- requirements.md # Leaf-level rubric requirements mapped to ARA files
31
+ README.md # Index mapping every evidence file to claims
32
+ tables/ # every numbered Table: tableN.md + tableN.png
33
+ figures/ # every numbered Figure: figureN.md + figureN.png
34
+ proofs/ # as warranted: derivations / proofs
35
+ rubric/requirements.md # (Only if a rubric is provided)
33
36
  ```
34
37
 
35
- Additional files or subdirectories may be created on demand when the source contains
36
- content that does not fit the standard layers (for example, appendix-sourced worked
37
- examples, prompt templates, or enumerated taxonomies). Place such content in the ARA
38
- layer where it best belongs.
38
+ Every numbered table and figure in the source gets BOTH a markdown file and a screenshot `.png`
39
+ (see the evidence specs below). Additional files/subdirectories may be created on demand for
40
+ content that doesn't fit the standard layers (appendix worked examples, prompt templates,
41
+ taxonomies) — place such content where it best belongs.
39
42
 
40
43
  ## Progressive Disclosure (3 Levels)
41
44
 
@@ -56,17 +59,15 @@ year: {year}
56
59
  venue: "{venue}"
57
60
  doi: "{DOI or arXiv ID}"
58
61
  ara_version: "1.0"
59
- domain: "{research domain}"
62
+ domain: "{research domain — free text}"
60
63
  keywords: [{5-10 keywords}]
61
64
  claims_summary:
62
- - "{one-line summary of main claim 1}"
63
- - "{one-line summary of main claim 2}"
64
- - "{one-line summary of main claim 3}"
65
+ - "{one-line summary of each main claim}"
65
66
  abstract: "{paper abstract}"
66
67
  ---
67
68
  ```
68
69
 
69
- Body MUST include a Layer Index — a table for each layer listing every file:
70
+ Body MUST include a Layer Index — a table for each layer listing every file actually generated:
70
71
 
71
72
  ```markdown
72
73
  # {Paper Title}
@@ -177,12 +178,13 @@ Each proofed experiment should in turn be backed by evidence files whose rows or
177
178
 
178
179
  ## logic/concepts.md
179
180
 
180
- ≥5 concepts. One section per concept:
181
+ Target ≥5 concepts, but capture the paper's *genuine* technical terms — don't pad with trivial or
182
+ borrowed terms to reach 5 (Rule 14). One section per concept:
181
183
  ```markdown
182
184
  ## {Term Name}
183
- - **Notation**: {LaTeX or symbolic notation}
185
+ - **Notation**: {LaTeX or symbolic notation, or "—" if none}
184
186
  - **Definition**: {Formal definition}
185
- - **Boundary conditions**: {When does this concept apply/not apply}
187
+ - **Boundary conditions**: {When it applies/not — or "Not specified in paper"}
186
188
  - **Related concepts**: {other concept names}
187
189
  ```
188
190
 
@@ -220,9 +222,9 @@ Component graph. For each component: name, purpose, inputs, outputs, interaction
220
222
  ## logic/solution/algorithm.md
221
223
 
222
224
  - Mathematical formulation (LaTeX)
223
- - Pseudocode
225
+ - Pseudocode (reconstruct only from the paper's stated algorithm; don't invent steps the paper omits)
224
226
  - Step-by-step explanation
225
- - Complexity analysis
227
+ - Complexity analysis — only if the paper states or clearly implies it; else "Not specified in paper"
226
228
 
227
229
  ## logic/solution/constraints.md
228
230
 
@@ -232,13 +234,15 @@ Component graph. For each component: name, purpose, inputs, outputs, interaction
232
234
 
233
235
  ## logic/solution/heuristics.md
234
236
 
235
- Each heuristic MUST have ALL fields:
237
+ Include only heuristics the paper actually states (implementation tricks, convergence hacks,
238
+ practical gotchas). If the paper presents none, `heuristics.md` may be empty/omitted — do not invent
239
+ tricks. Each heuristic present uses these fields; values come from the paper, else "Not specified":
236
240
  ```markdown
237
241
  ## H{NN}: {Short description}
238
242
  - **Rationale**: {Why this trick is needed}
239
- - **Sensitivity**: {low|medium|high}
240
- - **Bounds**: {acceptable range or limits}
241
- - **Code ref**: [{path to src/execution/ file}]
243
+ - **Sensitivity**: {low|medium|high — or "Not specified in paper"}
244
+ - **Bounds**: {acceptable range or limits — or "Not specified in paper"}
245
+ - **Code ref**: [{path to src/execution/ file, or "Not specified"}]
242
246
  - **Source**: {Section/table in the paper}
243
247
  ```
244
248
 
@@ -264,51 +268,111 @@ the paper's full citation footprint.
264
268
 
265
269
  ---
266
270
 
267
- ## src/configs/training.md
271
+ ## src/configs/{config}.md (when the work warrants it)
272
+
273
+ Name configs for what the work actually has — e.g. `training.md`/`model.md` for a trained model,
274
+ `inference.md` for an eval/prompting method, `deployment.md` for a system. Don't create
275
+ model-training configs for work that trained no model. All config files share one per-parameter
276
+ field format:
268
277
 
269
278
  ```markdown
270
279
  ## {Parameter name}
271
280
  - **Value**: {exact value}
272
- - **Rationale**: {why this value}
281
+ - **Rationale**: {why this value, or "Not specified in paper"}
273
282
  - **Search range**: {if mentioned}
274
- - **Sensitivity**: {low|medium|high}
283
+ - **Sensitivity**: {low|medium|high — or "Not specified in paper"}
275
284
  - **Source**: {section/table}
276
285
  ```
277
286
 
278
- ## src/configs/model.md
287
+ ## src/execution/{module}.py (when the work warrants it — grounded or absent)
279
288
 
280
- Same format as training.md for model/architecture configs.
289
+ Present only when the source provides **concrete code-shaped content**: actual repo code, or
290
+ explicit pseudocode/equations the paper prints. The stub captures the **novel mechanism** and must
291
+ be grounded — never fabricated.
281
292
 
282
- ## src/execution/{module}.py
283
-
284
- - Typed function signatures (input/output types, tensor shapes)
285
- - Docstrings explaining what each function does
286
- - Implementation logic for the NOVEL contribution
293
+ Every file declares its grounding on the first line:
294
+ ```python
295
+ # Grounding: transcribed — adapted from repo code; cite file:line in docstrings
296
+ # Grounding: reconstructed from explicit paper pseudocode/equations; cite §/eq
297
+ ```
298
+ Contents:
299
+ - Typed function signatures using ONLY names/types the source states
300
+ - Docstrings that cite the source (`§4.2`, `Eq. 3`, `repo: model.py:88`) — not paraphrases of this skill
301
+ - Implementation logic ONLY where the source provides it; everything unspecified stays
302
+ `raise NotImplementedError("Not specified in paper")` — never plausible filler
287
303
  - NO scaffolding (no argparse, logging, distributed wrappers)
288
- - Import only standard libraries + torch/numpy
304
+ - Import only standard libraries + the field's core stack (torch/numpy, pandas/statsmodels, etc.)
305
+
306
+ Hard rule: do not invent API names, function bodies, constants, or hyperparameters. **If the paper
307
+ describes the method only in prose (no code, no printed pseudocode), do NOT write a `.py` stub or
308
+ pseudo-code — that information already lives in `logic/solution/`, and re-encoding it as code merely
309
+ duplicates it.** A concrete artifact that IS raw "code" — e.g. a prompt or template — is different:
310
+ store it verbatim in `src/prompts/`, don't paraphrase it. A hollow invented API is a hallucination.
311
+
312
+ ## src/artifacts.md (when the implementation is not a `.py` stub)
313
+
314
+ `src/` must still represent the implementation. When the deliverable is a released tool, library,
315
+ skill/specification, system, benchmark, or dataset rather than a code stub, describe the **real**
316
+ artifacts here — grounded in the actual repo/files when a repo is provided. One block per artifact:
317
+
318
+ ```markdown
319
+ ## {Artifact name}
320
+ - **File(s) in repo**: {real path(s), verified to exist}
321
+ - **Nature**: {what it is — tool / library / skill spec / system / dataset}
322
+ - **What it does / contains**: {grounded description}
323
+ - **How to use / run**: {entry point, command, or interface}
324
+ - **Claims supported**: {C## ids}
325
+ ```
326
+
327
+ Do not leave `src/` at just `environment.md` when the work clearly has an implementation (code,
328
+ configs, prompts, a released tool). Capture configs in `src/configs/`, prompts in `src/prompts/`,
329
+ and the rest here.
330
+
331
+ ## data/ (when the work is data-driven)
289
332
 
290
- ## src/environment.md
333
+ - `data/dataset.md` — provenance, source, size, licensing, consent/IRB/ethics, variables
334
+ - `data/preprocessing.md` — cleaning, normalization, QC, feature construction
335
+
336
+ ## src/environment.md (mandatory core)
337
+
338
+ Reproducibility for any field. For purely analytical work, state so explicitly.
291
339
 
292
340
  ```markdown
293
341
  # Environment
294
- - **Python**: {version}
295
- - **Framework**: {PyTorch version, etc.}
296
- - **Hardware**: {GPU type, count, memory}
342
+ - **Language/runtime**: {Python version, R version, proof assistant, or "analytical — none"}
343
+ - **Framework**: {PyTorch/pandas/statsmodels/... version, etc.}
344
+ - **Hardware**: {GPU/CPU type, count, memory — or "n/a"}
345
+ - **Data sources**: {datasets/cohorts with access info — for data-driven work}
297
346
  - **Key dependencies**: {list with versions}
347
+ - **Protocols**: {analysis protocol / preregistration / pipeline, if any}
298
348
  - **Random seeds**: {if specified}
299
349
  ```
300
350
 
351
+ ## evidence/proofs/{name}.md (for theory/derivation work)
352
+
353
+ ```markdown
354
+ # {Theorem/Lemma N}: {short title}
355
+ - **Source**: {Theorem N, Section X.Y}
356
+ - **Statement**: {formal statement}
357
+ - **Assumptions used**: {which assumptions from constraints.md}
358
+
359
+ ## Proof
360
+ {proof sketch or full derivation}
361
+ ```
362
+
301
363
  ---
302
364
 
303
- ## evidence/tables/{file}.md
365
+ ## evidence/tables/{file}.md (+ screenshot)
304
366
 
305
- Raw source-table transcription:
367
+ Every numbered table gets BOTH this markdown file AND a screenshot `tableN.png` (the rendered
368
+ region of the source) saved beside it. Raw source-table transcription:
306
369
 
307
370
  ```markdown
308
371
  # Table {N} - {Caption or short description}
309
372
 
310
373
  **Source**: Table {N} in {paper/report title}
311
374
  **Caption**: {verbatim or near-verbatim caption}
375
+ **Screenshot**: tableN.png
312
376
  **Extraction type**: raw_table
313
377
 
314
378
  | ... | ... |
@@ -389,21 +453,62 @@ ALL result tables, exact cell values:
389
453
  | exact | values | ... |
390
454
  ```
391
455
 
392
- ## evidence/figures/{name}.md
456
+ ## evidence/figures/{name}.md (+ screenshot)
457
+
458
+ ALL figures, read visually. Every numbered figure gets BOTH this markdown file AND a screenshot
459
+ `figureN.png` (the rendered region) saved beside it. Each file declares its type, extraction
460
+ method, and reading confidence so downstream layers know how trustworthy the contents are.
393
461
 
394
- ALL quantitative figures (not diagrams). Extract data points:
462
+ Shared header (all figure types):
395
463
  ```markdown
396
464
  # Figure N: {Title}
397
465
  - **Source**: Figure N, Section X.Y
398
- - **Caption**: "{caption}"
399
- - **Axes**: X = {label, units}, Y = {label, units}
466
+ - **Caption**: "{verbatim or near-verbatim caption}"
467
+ - **Screenshot**: figureN.png
468
+ - **Figure type**: {quantitative_plot | diagram | qualitative_sample | mixed}
469
+ - **Extraction method**: {exact_from_labels | digitized_estimate | visual_description}
470
+ - **Reading confidence**: {high | medium | low}
471
+ ```
472
+
473
+ ### quantitative_plot
474
+ Read values off the axes. Record axis scale — misreading a log axis corrupts every value.
475
+ ```markdown
476
+ - **Plot kind**: {line | bar | scatter | box | histogram | heatmap}
477
+ - **Axes**: X = {label, units, scale: linear|log}, Y = {label, units, scale: linear|log}
400
478
 
401
479
  | X | Y (Series A) | Y (Series B) | ... |
402
480
  |---|-------------|-------------|-----|
403
- | v | v | v | ... |
481
+ | v | v | v | ... |
482
+
483
+ ## Trend summary
484
+ {Directional reading that survives estimation error: monotonic/plateau/crossover at x≈..., variance bands, A vs B ordering.}
485
+ ```
486
+ - Use exact values only when shown as data labels or stated in text; otherwise mark readings approximate with `≈` and set extraction method to `digitized_estimate`.
487
+ - A `quantitative_plot` file MUST contain a data table OR an explicit statement that points were unreadable (with `reading confidence: low`) plus a usable trend summary.
488
+
489
+ ### diagram (architecture / pipeline / schematic)
490
+ Do NOT fabricate a data table. Capture structure, and mirror it into the relevant method/solution file.
491
+ ```markdown
492
+ ## Visual description
493
+ - **Components**: {boxes/modules with their labels}
494
+ - **Connections**: {arrows / data flow, source → target}
495
+ - **Annotations**: {shapes, colors, groupings that carry meaning}
496
+ - **What it conveys**: {the structural claim the diagram makes}
404
497
  ```
405
498
 
406
- Mark approximate readings with "≈".
499
+ ### qualitative_sample (example outputs, attention maps, failure cases)
500
+ ```markdown
501
+ ## Visual description
502
+ - **Shows**: {what the panel depicts}
503
+ - **Demonstrates**: {the qualitative point — e.g. failure mode, behavior, artifact}
504
+ - **Supports**: {claim ID(s) or gap ID(s) this is evidence for}
505
+ ```
506
+
507
+ Rules:
508
+ - Mark every estimated numeric reading with `≈`.
509
+ - Never present a `digitized_estimate` as an exact source value.
510
+ - Never convert a `diagram` or `qualitative_sample` into a numeric table it does not contain.
511
+ - Subset/derived figure views follow the same `derived_`/`subset_` naming and provenance rules as tables.
407
512
 
408
513
  ---
409
514
 
@@ -94,13 +94,12 @@ A change in research direction.
94
94
 
95
95
  1. **Nested YAML**: Children appear inline under parent node's `children` list
96
96
  2. **Valid DAG**: No cycles. All `also_depends_on` IDs must exist in the tree
97
- 3. **Minimum 8 nodes**: Cover the paper's key research trajectory
98
- 4. **Must include dead_end nodes**: At least 1 from ablations or rejected alternatives
99
- 5. **Must include decision nodes**: At least 1 documenting a design choice
100
- 6. **Every node has**: `id` (N01, N02...), `type`, `title`
101
- 7. **Every node has `support_level`**: `explicit` or `inferred`
102
- 8. **Explicit nodes should have `source_refs`**: table/figure/section references from the input material
103
- 9. **`also_depends_on`**: Only for DAG convergence (node has multiple parents beyond nesting)
97
+ 3. **Target ~8+ nodes** covering the paper's key trajectory — but source-bounded, not a quota. Never add filler nodes to hit the number (Rule 14).
98
+ 4. **dead_end / decision nodes**: include every one the paper actually reveals (ablations, rejected alternatives, stated design choices). If the paper exposes none, do NOT invent one — a smaller honest tree is correct (Rule 9). Mark reconstructed nodes `inferred`.
99
+ 5. **Every node has**: `id` (N01, N02...), `type`, `title`
100
+ 6. **Every node has `support_level`**: `explicit` or `inferred`
101
+ 7. **Explicit nodes should have `source_refs`**: table/figure/section references from the input material
102
+ 8. **`also_depends_on`**: Only for DAG convergence (node has multiple parents beyond nesting)
104
103
 
105
104
  ## Extraction Strategy
106
105
 
@@ -0,0 +1,218 @@
1
+ # Figure Extraction Guide — Reading Plots, Diagrams, and Samples
2
+
3
+ Load this when an input contains figures whose information is not available as text. The goal
4
+ is to turn pixels into structured ARA evidence **honestly**: exact where the source is exact,
5
+ explicitly approximate where you are reading off a plot, and structural (not numeric) where the
6
+ figure is a diagram.
7
+
8
+ The governing rule (Critical Rule #11): read figures by looking at them, mark estimates as
9
+ estimates, and never fabricate a data table for a figure that does not contain one.
10
+
11
+ ---
12
+
13
+ ## 0. Decide whether you even need to crop
14
+
15
+ Try reading the figure from the rendered PDF page first — the Read tool renders PDF pages and
16
+ displays images visually. Only fall back to rendering/cropping (Section 2) when the figure is:
17
+ - too small or dense to read values reliably,
18
+ - one panel in a multi-panel figure you need to isolate,
19
+ - overlapping with text/other figures, or
20
+ - in a vector format you want at higher resolution.
21
+
22
+ Cropping is a means to *see better*, not a required step.
23
+
24
+ ---
25
+
26
+ ## 1. Classify before you read
27
+
28
+ | Type | What it carries | ARA destination | Do NOT |
29
+ |------|-----------------|-----------------|--------|
30
+ | `quantitative_plot` | numbers on axes (line/bar/scatter/box/hist/heatmap) | `evidence/figures/` data table + trend summary | invent points you cannot see |
31
+ | `diagram` | structure: components + connections | `evidence/figures/` visual description **and** `logic/solution/architecture.md` | build a numeric table |
32
+ | `qualitative_sample` | a demonstrated behavior/artifact | `evidence/figures/` visual description, tied to a claim/gap | claim measurements |
33
+ | `mixed` | several of the above in one figure | split per panel, classify each | collapse panels together |
34
+
35
+ If you are unsure, classify by asking "could I, in principle, read a number off an axis here?"
36
+ If no, it is not a `quantitative_plot`.
37
+
38
+ ---
39
+
40
+ ## 2. Rendering and cropping a figure (when needed)
41
+
42
+ The skill allows `Bash(python *)`. Prefer **PyMuPDF** (`fitz`) — no system dependencies, fast,
43
+ and lets you crop a sub-region. `pdf2image` is a fine alternative when you only need full pages.
44
+
45
+ **Save every render as the evidence screenshot.** The cropped PNG you produce for a table/figure
46
+ is not transient — save it into the artifact next to its markdown (`evidence/figures/figureN.png`,
47
+ `evidence/tables/tableN.png`). Crop to the object's region so the screenshot shows just that
48
+ table/figure. Every numbered table and figure must end up with a saved `.png`.
49
+
50
+ ### 2a. Render a whole page to PNG (PyMuPDF)
51
+
52
+ ```python
53
+ import fitz # PyMuPDF
54
+
55
+ doc = fitz.open("paper.pdf")
56
+ page = doc[6] # 0-indexed; page 7 in the PDF
57
+ pix = page.get_pixmap(dpi=200) # bump dpi for dense plots (200–300)
58
+ pix.save("page7.png")
59
+ ```
60
+
61
+ Then Read `page7.png` as an image.
62
+
63
+ ### 2b. Crop a single figure region (PyMuPDF)
64
+
65
+ Coordinates are in PDF points (72 pt = 1 inch), origin at the top-left of the page. Find the
66
+ rough box by eye from the full-page render, then crop with a `clip` rectangle:
67
+
68
+ ```python
69
+ import fitz
70
+
71
+ doc = fitz.open("paper.pdf")
72
+ page = doc[6]
73
+ # clip = (x0, y0, x1, y1) in points — the bounding box of the figure on the page
74
+ clip = fitz.Rect(60, 90, 540, 360)
75
+ pix = page.get_pixmap(dpi=300, clip=clip)
76
+ pix.save("fig4_cropped.png")
77
+ ```
78
+
79
+ Increase `dpi` if axis ticks or legends are still unreadable. Re-Read the crop and iterate.
80
+
81
+ ### 2c. Full-page fallback (pdf2image)
82
+
83
+ ```python
84
+ from pdf2image import convert_from_path
85
+
86
+ pages = convert_from_path("paper.pdf", dpi=200, first_page=7, last_page=7)
87
+ pages[0].save("page7.png")
88
+ ```
89
+
90
+ ### 2d. Standalone image inputs
91
+
92
+ If given `.png`/`.jpg`/`.svg`/exported plots directly, Read them as-is. For `.svg`, the text
93
+ labels are often in the XML — `Grep` the file for axis labels and series names to corroborate
94
+ what you read visually.
95
+
96
+ ---
97
+
98
+ ## 3. Reading a quantitative plot
99
+
100
+ 1. **Axes first.** Record both axis labels, units, and **scale (linear vs log)**. A log axis
101
+ read as linear silently corrupts every value — check tick spacing (equal multiplicative
102
+ gaps ⇒ log).
103
+ 2. **Ranges and gridlines.** Note the axis min/max and any gridlines; they are your ruler.
104
+ 3. **Prefer printed values.** If the plot has data labels, or the text/caption states the key
105
+ numbers, use those and set `extraction method: exact_from_labels`.
106
+ 4. **Otherwise estimate.** Read each point against the gridlines, mark it `≈`, and set
107
+ `extraction method: digitized_estimate` with a `reading confidence`.
108
+ 5. **Always capture the trend.** Even when exact points are unreadable, the *shape* is real
109
+ evidence: monotonic? plateau? crossover at x≈?? which series is on top? variance bands?
110
+ 6. **Series and legend.** One column per series; name them exactly as the legend does.
111
+
112
+ Confidence rubric:
113
+ - `high` — clean axes, gridlines, few points, or printed labels
114
+ - `medium` — readable but interpolated between gridlines
115
+ - `low` — dense/overlapping/blurred; record the trend and say points are unreliable
116
+
117
+ ### Worked example — line plot
118
+
119
+ Source: a 2-series accuracy-vs-epochs line plot, no data labels, linear axes.
120
+
121
+ ```markdown
122
+ # Figure 4: Validation accuracy vs. training epochs
123
+ - **Source**: Figure 4, Section 5.2
124
+ - **Caption**: "Validation accuracy over training for Ours vs. Baseline."
125
+ - **Figure type**: quantitative_plot
126
+ - **Extraction method**: digitized_estimate
127
+ - **Reading confidence**: medium
128
+ - **Plot kind**: line
129
+ - **Axes**: X = epoch (count, linear), Y = top-1 accuracy (%, linear)
130
+
131
+ | Epoch | Ours (%) | Baseline (%) |
132
+ |-------|----------|--------------|
133
+ | 10 | ≈62 | ≈58 |
134
+ | 30 | ≈74 | ≈66 |
135
+ | 50 | ≈78 | ≈69 |
136
+
137
+ ## Trend summary
138
+ Both rise monotonically and plateau by ~epoch 40. Ours is above Baseline at every read point;
139
+ the gap widens from ≈4 pts (epoch 10) to ≈9 pts (epoch 50). Exact endpoints unreadable — see
140
+ evidence/tables/ for any reported final numbers.
141
+ ```
142
+
143
+ > Note the discipline: the claim "Ours > Baseline, gap widens" is well supported even though
144
+ > every individual number is approximate. Put the directional fact in the claim's
145
+ > `Evidence basis`; do not promote "≈78%" into an exact result.
146
+
147
+ ---
148
+
149
+ ## 4. Reading a diagram
150
+
151
+ Do not build a data table. Capture structure, then mirror it into `architecture.md`.
152
+
153
+ ```markdown
154
+ # Figure 2: Model architecture
155
+ - **Source**: Figure 2, Section 3.1
156
+ - **Caption**: "Overview of the proposed two-stage encoder."
157
+ - **Figure type**: diagram
158
+ - **Extraction method**: visual_description
159
+ - **Reading confidence**: high
160
+
161
+ ## Visual description
162
+ - **Components**: Tokenizer → Stage-A encoder (6 blocks) → Cross-attn bridge → Stage-B decoder → Head
163
+ - **Connections**: residual skip from Stage-A output to Cross-attn bridge; dashed arrow = optional auxiliary loss path
164
+ - **Annotations**: blue boxes = trainable, grey = frozen; the bridge is the paper's novel block
165
+ - **What it conveys**: the contribution sits in the cross-attn bridge, not the encoders
166
+ ```
167
+
168
+ The component graph here becomes the backbone of `logic/solution/architecture.md`.
169
+
170
+ ---
171
+
172
+ ## 5. Reading a qualitative sample
173
+
174
+ ```markdown
175
+ # Figure 6: Failure cases on out-of-distribution inputs
176
+ - **Source**: Figure 6, Appendix C
177
+ - **Caption**: "Representative failures under distribution shift."
178
+ - **Figure type**: qualitative_sample
179
+ - **Extraction method**: visual_description
180
+ - **Reading confidence**: high
181
+
182
+ ## Visual description
183
+ - **Shows**: 4 input/output pairs where the model mislabels rotated objects
184
+ - **Demonstrates**: the rotation-sensitivity failure mode
185
+ - **Supports**: G2 (robustness gap), and is the qualitative basis behind C04's limitation clause
186
+ ```
187
+
188
+ No numbers — but this is genuine evidence for a gap/limitation and must be tied to a claim or gap ID.
189
+
190
+ ---
191
+
192
+ ## 6. Common traps
193
+
194
+ - **Log axes** read as linear — the single most damaging error. Check tick spacing every time.
195
+ - **Secondary (right-hand) Y-axis** — dual-axis plots have two scales; map each series to the
196
+ correct one.
197
+ - **Truncated / broken axes** (axis not starting at 0) — exaggerates differences; note it in
198
+ the trend summary so claims are not overstated.
199
+ - **Error bars / shaded bands** — capture them; they bound how strong a claim can be.
200
+ - **Color-only series distinction** — name series by legend text, not color, so the table is
201
+ unambiguous.
202
+ - **Stacked vs grouped bars** — stacked totals are cumulative; do not read a stacked segment as
203
+ an absolute value.
204
+ - **Subset panels** — a single panel pulled from a multi-panel figure is a derived view; name it
205
+ `derived_`/`subset_` and cite the parent figure, per the evidence naming rules.
206
+
207
+ ---
208
+
209
+ ## 7. Honesty checklist (before writing the figure file)
210
+
211
+ - [ ] Figure type classified, and the file matches it (plot ⇒ table+trend; diagram/sample ⇒ visual description)
212
+ - [ ] `Extraction method` and `Reading confidence` set, and consistent with the content
213
+ - [ ] Every estimated number marked `≈`; nothing estimated is labeled `exact_from_labels`
214
+ - [ ] Axis scale (linear/log) recorded for plots
215
+ - [ ] No fabricated table for a diagram or qualitative sample
216
+ - [ ] Unreadable figure stated as `reading confidence: low` with a trend summary, not invented points
217
+ - [ ] Diagram structure mirrored into `logic/solution/architecture.md`
218
+ - [ ] Qualitative sample tied to a claim or gap ID