@ara-commons/ara-skills 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +67 -0
- package/bin/cli.js +8 -0
- package/package.json +57 -0
- package/scripts/bundle-skills.mjs +34 -0
- package/scripts/clean-bundle.mjs +15 -0
- package/skills/compiler/SKILL.md +255 -0
- package/skills/compiler/references/ara-schema.md +438 -0
- package/skills/compiler/references/exploration-tree-spec.md +124 -0
- package/skills/compiler/references/validation-checklist.md +148 -0
- package/skills/research-manager/SKILL.md +588 -0
- package/skills/research-manager/references/event-taxonomy.md +160 -0
- package/skills/rigor-reviewer/SKILL.md +332 -0
- package/skills/rigor-reviewer/references/review-dimensions.md +181 -0
- package/src/agents.js +77 -0
- package/src/index.js +165 -0
- package/src/installer.js +188 -0
- package/src/prompts.js +118 -0
- package/src/skills.js +98 -0
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
# Event Taxonomy & Routing Rules
|
|
2
|
+
|
|
3
|
+
Canonical reference for **Stage 2 (Event Router)** of the Live PM pipeline. Loaded on
|
|
4
|
+
demand at epilogue time. SKILL.md owns the pipeline orchestration, closure signals,
|
|
5
|
+
crystallization procedure, contradiction trigger, and schemas — this file does not
|
|
6
|
+
duplicate those.
|
|
7
|
+
|
|
8
|
+
This document covers two axes:
|
|
9
|
+
|
|
10
|
+
| Axis | Question | Outcome |
|
|
11
|
+
|------|----------|---------|
|
|
12
|
+
| **Kind** | What kind of event is this? | Picks the schema and target layer. |
|
|
13
|
+
| **Routing** | Is this a journey fact or interpretation? | Picks **direct** vs **staged**. |
|
|
14
|
+
|
|
15
|
+
A journey fact records *what occurred* (a choice, a run, an abandonment). It is immutable
|
|
16
|
+
and goes direct. An interpretive claim records *what something means or what is generally
|
|
17
|
+
true*. It is revisable, goes staged, and only crystallizes on a closure signal.
|
|
18
|
+
|
|
19
|
+
## Direct-Routed Events (Journey Layer)
|
|
20
|
+
|
|
21
|
+
Write to `trace/exploration_tree.yaml` immediately at end of turn.
|
|
22
|
+
|
|
23
|
+
| Type | Signals | Required payload |
|
|
24
|
+
|------|---------|------------------|
|
|
25
|
+
| `question` | "What if...", "Should we...", "How does...", a research direction opened | `description` |
|
|
26
|
+
| `decision` | User chose between alternatives, committed to a direction | `choice`, `alternatives`, `evidence` |
|
|
27
|
+
| `experiment` | Code ran a test/benchmark, results produced | `result`, `evidence` |
|
|
28
|
+
| `dead_end` | Approach abandoned, hypothesis falsified, "doesn't work", reverted | `hypothesis`, `failure_mode`, `lesson` |
|
|
29
|
+
| `pivot` | Major direction change triggered by evidence | `from`, `to`, `trigger` |
|
|
30
|
+
|
|
31
|
+
A `decision` node MAY reference a staged observation as evidence — this counts as
|
|
32
|
+
**artifact-commitment** for that observation (closure signal; see SKILL.md Stage 3).
|
|
33
|
+
|
|
34
|
+
`ai-action` events (AI wrote code, ran a command) go to the session record's `ai_actions`
|
|
35
|
+
list, **not** to the exploration tree.
|
|
36
|
+
|
|
37
|
+
## Staged-Routed Events (Interpretive — Buffered for Maturity)
|
|
38
|
+
|
|
39
|
+
Write to `staging/observations.yaml` first, with `potential_type` indicating where they
|
|
40
|
+
would crystallize. They do **not** enter `logic/` until a closure signal fires (see
|
|
41
|
+
SKILL.md Stage 3).
|
|
42
|
+
|
|
43
|
+
| Candidate Event | Signals | Crystallizes To | `potential_type` |
|
|
44
|
+
|-----------------|---------|-----------------|------------------|
|
|
45
|
+
| `claim` | "I believe...", "The system achieves...", falsifiable assertion about capability/property | `logic/claims.md` | `claim` |
|
|
46
|
+
| `heuristic` | "The trick is...", "You need to...", implementation rule with rationale | `logic/solution/heuristics.md` | `heuristic` |
|
|
47
|
+
| `concept` | New term defined, disambiguation needed | `logic/concepts.md` | `concept` |
|
|
48
|
+
| `constraint` | "This only works when...", boundary condition | `logic/solution/constraints.md` | `constraint` |
|
|
49
|
+
| `architecture` | System design statement, component relationship | `logic/solution/architecture.md` | `architecture` |
|
|
50
|
+
| (unclassified) | Interesting but not yet typed | (stays staged) | `unknown` |
|
|
51
|
+
|
|
52
|
+
Evidence artifacts (tables, figures, metrics) referenced by a direct-routed `experiment`
|
|
53
|
+
get written to `evidence/` immediately — they are raw data, not interpretation.
|
|
54
|
+
|
|
55
|
+
## Routing Decision Tree
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
What KIND of event is this?
|
|
59
|
+
|
|
60
|
+
Journey fact (something that happened)?
|
|
61
|
+
Was a choice made between alternatives?
|
|
62
|
+
→ decision [DIRECT to trace/]
|
|
63
|
+
Did code/test produce a result?
|
|
64
|
+
→ experiment [DIRECT to trace/, plus evidence/ for artifacts]
|
|
65
|
+
Was an approach abandoned with a reason?
|
|
66
|
+
→ dead_end [DIRECT to trace/]
|
|
67
|
+
Was there a major direction change?
|
|
68
|
+
→ pivot [DIRECT to trace/]
|
|
69
|
+
Was a research question opened?
|
|
70
|
+
→ question [DIRECT to trace/]
|
|
71
|
+
Did the AI perform an action (write code, run command)?
|
|
72
|
+
→ ai-action [session record only]
|
|
73
|
+
|
|
74
|
+
Interpretation (something asserted to be true / general)?
|
|
75
|
+
Falsifiable assertion about the system?
|
|
76
|
+
→ STAGE as potential_type: claim
|
|
77
|
+
Implementation rule with rationale?
|
|
78
|
+
→ STAGE as potential_type: heuristic
|
|
79
|
+
Term definition?
|
|
80
|
+
→ STAGE as potential_type: concept
|
|
81
|
+
Boundary condition?
|
|
82
|
+
→ STAGE as potential_type: constraint
|
|
83
|
+
System-design statement?
|
|
84
|
+
→ STAGE as potential_type: architecture
|
|
85
|
+
Doesn't fit?
|
|
86
|
+
→ STAGE as potential_type: unknown
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Skip Filter (no record)
|
|
90
|
+
|
|
91
|
+
Do not write any record for these:
|
|
92
|
+
- Routine file reads with no downstream decision
|
|
93
|
+
- Typo fixes, formatting changes, lint passes
|
|
94
|
+
- Git status checks, dependency installs, environment setup
|
|
95
|
+
- Greetings, acknowledgments, "thanks"
|
|
96
|
+
- Clarifying questions whose answer added no new content
|
|
97
|
+
- Pure restatement of the user's request
|
|
98
|
+
|
|
99
|
+
If a turn contains only skip-filter activity, print
|
|
100
|
+
`[PM] Turn skipped: no research events.` (or stay silent) and exit the epilogue.
|
|
101
|
+
|
|
102
|
+
## Provenance Assignment
|
|
103
|
+
|
|
104
|
+
```
|
|
105
|
+
Who generated this information?
|
|
106
|
+
|
|
107
|
+
User said it directly (typed it, stated it, confirmed it)
|
|
108
|
+
→ provenance: user
|
|
109
|
+
|
|
110
|
+
AI inferred it from code, output, or conversation context
|
|
111
|
+
→ provenance: ai-suggested
|
|
112
|
+
|
|
113
|
+
AI performed an action (wrote code, ran test, made edit)
|
|
114
|
+
→ provenance: ai-executed
|
|
115
|
+
|
|
116
|
+
User modified an AI suggestion ("no, actually..." / "more like...")
|
|
117
|
+
→ provenance: user-revised
|
|
118
|
+
|
|
119
|
+
Uncertain?
|
|
120
|
+
→ provenance: ai-suggested (conservative default)
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
`ai-suggested` never auto-upgrades. A subsequent **verbal-affirmation** closure signal
|
|
124
|
+
upgrades it to `user-revised` (or `user` if the affirmation reproduces the assertion
|
|
125
|
+
verbatim). The other three closure signals license crystallization but do **not** change
|
|
126
|
+
provenance.
|
|
127
|
+
|
|
128
|
+
### Trust calibration
|
|
129
|
+
|
|
130
|
+
The provenance distribution of an artifact is itself a quality signal: a project full of
|
|
131
|
+
`ai-suggested` claims is less trustworthy than one full of `user` / `user-revised` claims.
|
|
132
|
+
Reviewers and downstream tools (e.g., rigor-reviewer L2) inspect this distribution.
|
|
133
|
+
|
|
134
|
+
## ID Conventions
|
|
135
|
+
|
|
136
|
+
| Type | Prefix | Example | Scope |
|
|
137
|
+
|------|--------|---------|-------|
|
|
138
|
+
| Exploration node | N | N01, N02 | Global (across all turns and sessions) |
|
|
139
|
+
| Claim | C | C01, C02 | Global; assigned at crystallization, not at staging |
|
|
140
|
+
| Heuristic | H | H01, H02 | Global; assigned at crystallization |
|
|
141
|
+
| Experiment plan | E | E01, E02 | Global |
|
|
142
|
+
| Observation | O | O01, O02 | Global; assigned at staging |
|
|
143
|
+
| Session | date_seq | 2026-04-27_001 | Unique per calendar day |
|
|
144
|
+
|
|
145
|
+
Always read the target file to find the highest existing ID before assigning a new one.
|
|
146
|
+
|
|
147
|
+
## Forensic Binding Checklist
|
|
148
|
+
|
|
149
|
+
Establish at write time. If a binding is not yet possible, write `[pending]` and leave a
|
|
150
|
+
TODO comment so a future epilogue can complete it.
|
|
151
|
+
|
|
152
|
+
- **Claim → Proof**: at crystallization, what evidence supports/refutes it?
|
|
153
|
+
- **Experiment → Claim**: which staged or crystallized claim does this experiment test?
|
|
154
|
+
This binding is what enables the **empirical-resolution** closure signal.
|
|
155
|
+
- **Heuristic → Code**: where in the codebase is this implemented?
|
|
156
|
+
- **Decision → Evidence**: which exploration nodes or evidence artifacts motivated it?
|
|
157
|
+
- **Dead End → Lesson**: what was learned that prevents repeating the mistake?
|
|
158
|
+
- **Observation → Bound nodes**: at staging time, list `bound_to: [N{XX}, ...]` for any
|
|
159
|
+
exploration nodes the observation depends on. Without this list, empirical-resolution
|
|
160
|
+
cannot be detected automatically.
|
|
@@ -0,0 +1,332 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: rigor-reviewer
|
|
3
|
+
description: |
|
|
4
|
+
ARA Seal Level 2: Semantic Epistemic Review. Acts as an objective research
|
|
5
|
+
reviewer for Agent-Native Research Artifacts. Assumes Level 1 structural
|
|
6
|
+
validation has already passed. Evaluates six dimensions of epistemic quality
|
|
7
|
+
through semantic reasoning over the ARA's content. Produces a scored review
|
|
8
|
+
with per-dimension strengths/weaknesses/suggestions, severity-ranked findings,
|
|
9
|
+
and an overall recommendation (Strong Accept to Reject).
|
|
10
|
+
|
|
11
|
+
TRIGGERS: level2, seal level 2, verify level 2, epistemic audit, review ara, audit claims
|
|
12
|
+
argument-hint: "<artifact_dir>"
|
|
13
|
+
allowed-tools: Read, Write, Glob, Grep
|
|
14
|
+
metadata:
|
|
15
|
+
category: research-tooling
|
|
16
|
+
version: "3.0.0"
|
|
17
|
+
last_updated: "2026-04-16"
|
|
18
|
+
user-invocable: true
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
# ARA Seal Level 2: Semantic Epistemic Review
|
|
22
|
+
|
|
23
|
+
You are an objective research reviewer for Agent-Native Research Artifacts. You receive an
|
|
24
|
+
ARA directory path and produce a comprehensive review as `level2_report.json` at the
|
|
25
|
+
artifact root. You operate entirely through your native tools (Read, Write, Glob, Grep).
|
|
26
|
+
You do NOT execute code, fetch URLs, or consult external sources.
|
|
27
|
+
|
|
28
|
+
**Prerequisite**: Level 1 (structural validation) has already passed. All references
|
|
29
|
+
resolve, required fields exist, the exploration tree parses correctly, and cross-layer
|
|
30
|
+
links are bidirectionally consistent. Level 2 does NOT re-check any of this. Instead, it
|
|
31
|
+
evaluates whether the *content* of the ARA is epistemically sound: whether evidence
|
|
32
|
+
actually supports claims, whether the argument is coherent, and whether the research
|
|
33
|
+
process is honestly documented.
|
|
34
|
+
|
|
35
|
+
Your review is **constructive**: identify both strengths and weaknesses, provide actionable
|
|
36
|
+
suggestions, and give a calibrated overall assessment. You are not a bug detector; you are
|
|
37
|
+
a reviewer who helps authors improve their work.
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## Six Review Dimensions
|
|
42
|
+
|
|
43
|
+
Each dimension is scored 1-5 and includes strengths, weaknesses, and suggestions.
|
|
44
|
+
All checks are semantic: they require reading comprehension and reasoning, not structural validation.
|
|
45
|
+
|
|
46
|
+
| Dimension | What it evaluates |
|
|
47
|
+
|-----------|-------------------|
|
|
48
|
+
| **D1. Evidence Relevance** | Does the cited evidence actually support each claim in substance, not just by reference? |
|
|
49
|
+
| **D2. Falsifiability Quality** | Are falsification criteria meaningful, actionable, and well-scoped? |
|
|
50
|
+
| **D3. Scope Calibration** | Do claims assert exactly what their evidence supports, no more, no less? |
|
|
51
|
+
| **D4. Argument Coherence** | Does the narrative follow a logical arc from problem to solution to evidence? |
|
|
52
|
+
| **D5. Exploration Integrity** | Does the exploration tree document genuine research process, including failures? |
|
|
53
|
+
| **D6. Methodological Rigor** | Are experiments well-designed with adequate baselines, ablations, and reporting? |
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Procedure
|
|
58
|
+
|
|
59
|
+
### Step 1: Read the ARA
|
|
60
|
+
|
|
61
|
+
Read files in this fixed order. Record the list as `read_order` in the report.
|
|
62
|
+
|
|
63
|
+
1. `PAPER.md`
|
|
64
|
+
2. `logic/claims.md`
|
|
65
|
+
3. `logic/experiments.md`
|
|
66
|
+
4. `logic/problem.md`
|
|
67
|
+
5. `logic/concepts.md`
|
|
68
|
+
6. `logic/solution/architecture.md`, `algorithm.md`, `constraints.md`, `heuristics.md`
|
|
69
|
+
7. `logic/related_work.md`
|
|
70
|
+
8. `trace/exploration_tree.yaml`
|
|
71
|
+
9. `evidence/README.md` (if exists)
|
|
72
|
+
10. Spot-check 2-3 evidence files from `evidence/tables/` or `evidence/figures/`
|
|
73
|
+
|
|
74
|
+
### Step 2: Parse Entities
|
|
75
|
+
|
|
76
|
+
**Claims** (from `logic/claims.md`): each `## C{NN}: {title}` section. Extract:
|
|
77
|
+
- `Statement`, `Status`, `Falsification criteria`, `Proof` (experiment IDs), `Dependencies` (claim IDs), `Tags`
|
|
78
|
+
|
|
79
|
+
**Experiments** (from `logic/experiments.md`): each `## E{NN}: {title}` section. Extract:
|
|
80
|
+
- `Verifies` (claim IDs), `Setup`, `Procedure`, `Metrics`, `Expected outcome`, `Baselines`, `Dependencies`
|
|
81
|
+
|
|
82
|
+
**Heuristics** (from `logic/solution/heuristics.md`): each `## H{NN}` section. Extract:
|
|
83
|
+
- `Rationale`, `Sensitivity`, `Bounds`, `Code ref`
|
|
84
|
+
|
|
85
|
+
**Observations and Gaps** (from `logic/problem.md`): each `O{N}` and `G{N}`.
|
|
86
|
+
|
|
87
|
+
**Exploration tree** (from `trace/exploration_tree.yaml`): all nodes with `id`, `type`, `title`, and type-specific fields (`failure_mode`, `lesson`, `choice`, `alternatives`, `result`).
|
|
88
|
+
|
|
89
|
+
### Step 3: Build Working Maps
|
|
90
|
+
|
|
91
|
+
Construct these maps as inputs for semantic analysis. Do NOT validate structural integrity
|
|
92
|
+
(Level 1 guarantees it).
|
|
93
|
+
|
|
94
|
+
- **claim_proof_map**: for each claim, the set of experiment IDs in its Proof
|
|
95
|
+
- **experiment_verifies_map**: for each experiment, the set of claim IDs in its Verifies
|
|
96
|
+
- **claim_dependency_edges**: directed edges from each claim to its Dependencies
|
|
97
|
+
- **gap_set**: all G{N} from problem.md
|
|
98
|
+
- **rejected_nodes**: exploration tree nodes with type = `dead_end` or `pivot`
|
|
99
|
+
- **decision_nodes**: exploration tree nodes with type = `decision`
|
|
100
|
+
|
|
101
|
+
### Step 4: Evaluate Each Dimension
|
|
102
|
+
|
|
103
|
+
For each dimension, perform semantic reasoning over the parsed content. Record strengths, weaknesses, and suggestions as you go.
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
#### D1. Evidence Relevance
|
|
108
|
+
|
|
109
|
+
For each claim-experiment pair linked through Proof/Verifies:
|
|
110
|
+
|
|
111
|
+
- **Relevance**: Does the experiment's Setup/Procedure/Metrics actually address what the claim asserts? (Not just "link exists" but "link is substantively relevant.")
|
|
112
|
+
- **Type-aware entailment**: Infer claim type from Statement cues, check experiment design matches:
|
|
113
|
+
- Causal ("causes", "leads to", "enables") → needs isolating ablation
|
|
114
|
+
- Generalization ("generalizes", "robust", "across") → needs heterogeneous test conditions
|
|
115
|
+
- Improvement ("outperforms", "better", "improves") → needs baseline comparison
|
|
116
|
+
- Descriptive ("accounts for", "distribution", "pattern") → needs representative sampling
|
|
117
|
+
- Scoping ("when", "under conditions", "limited to") → needs declared bounds
|
|
118
|
+
- **Evidence sufficiency**: Is a single experiment enough to support this claim, or does the claim's scope demand multiple independent experiments?
|
|
119
|
+
|
|
120
|
+
**Scoring anchors:**
|
|
121
|
+
- **5**: Type-appropriate, relevant evidence for every claim; multi-experiment support where needed
|
|
122
|
+
- **4**: Evidence relevant for all claims, minor type mismatches (e.g., causal claim with correlation-only evidence)
|
|
123
|
+
- **3**: Most claim-experiment pairs are relevant, 1-2 weak matches where evidence doesn't quite address the claim
|
|
124
|
+
- **2**: Multiple claims where cited experiments don't substantively address what the claim asserts
|
|
125
|
+
- **1**: Majority of claims cite experiments that are irrelevant to their statements
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
#### D2. Falsifiability Quality
|
|
130
|
+
|
|
131
|
+
For each claim's Falsification criteria field:
|
|
132
|
+
|
|
133
|
+
- **Actionability**: Could an independent researcher execute this criterion? Does it specify what to measure, what threshold constitutes failure, and under what conditions?
|
|
134
|
+
- **Non-triviality**: Is the criterion non-tautological? ("If the method doesn't work" is trivial. "Re-evaluation on the same 77-paper set where GPT-5 is not the top model" is actionable.)
|
|
135
|
+
- **Scope match**: Does the falsification criterion address the same scope as the Statement? (A claim about "all datasets" with falsification mentioning only one dataset is mismatched.)
|
|
136
|
+
- **Independence**: Could the criterion be tested without access to the authors' proprietary data or systems?
|
|
137
|
+
|
|
138
|
+
**Scoring anchors:**
|
|
139
|
+
- **5**: Every claim has specific, actionable, independently testable falsification criteria matching the claim's scope
|
|
140
|
+
- **4**: Most criteria are strong, 1-2 are vague or hard to operationalize
|
|
141
|
+
- **3**: Mixed quality; some actionable, some trivial or scope-mismatched
|
|
142
|
+
- **2**: Most criteria are trivial, tautological, or scope-mismatched
|
|
143
|
+
- **1**: Falsification criteria meaningless across claims
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
#### D3. Scope Calibration
|
|
148
|
+
|
|
149
|
+
- **Over-claiming**: Does any Statement use universal scope markers ("all models", "any dataset", "state-of-the-art across all") while cited experiments cover only specific, narrow conditions? The gap must be substantial.
|
|
150
|
+
- **Under-claiming**: Are there important experimental results present in evidence/ that are not captured by any claim? (Evidence without a corresponding claim.)
|
|
151
|
+
- **Assumption explicitness**: Are key assumptions stated in problem.md (Assumptions section) or constraints.md? Are there unstated assumptions implied by the experimental design?
|
|
152
|
+
- **Generalization boundaries**: Does the artifact clearly state what the claims do NOT apply to? Check constraints.md and limitations in the exploration tree.
|
|
153
|
+
- **Qualifier consistency**: When claims use hedging ("tends to", "in most cases"), is this consistent with the evidence strength?
|
|
154
|
+
|
|
155
|
+
**Scoring anchors:**
|
|
156
|
+
- **5**: All claims precisely match evidence scope, assumptions explicit, limits clearly stated
|
|
157
|
+
- **4**: Claims well-scoped with minor gaps in assumption documentation
|
|
158
|
+
- **3**: Some claims slightly over/under-reach, assumptions partially stated
|
|
159
|
+
- **2**: Multiple over-claims or significant undocumented assumptions
|
|
160
|
+
- **1**: Pervasive scope mismatch between claims and evidence
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
#### D4. Argument Coherence
|
|
165
|
+
|
|
166
|
+
- **Observation → Gap derivation**: Do the stated gaps follow logically from the observations? Or are they asserted without connection?
|
|
167
|
+
- **Gap → Insight connection**: Does the key insight in problem.md address the identified gaps?
|
|
168
|
+
- **Insight → Solution alignment**: Does the solution architecture implement the key insight?
|
|
169
|
+
- **Solution → Claims coverage**: Do the claims cover the solution's main contributions?
|
|
170
|
+
- **Cross-layer consistency**: Do claims, exploration tree, and evidence tell the same story? Flag contradictions.
|
|
171
|
+
- **Narrative completeness**: Are there motivating questions from problem.md that are neither answered nor explicitly deferred?
|
|
172
|
+
- **Gap coverage**: For each gap in problem.md, is there at least one claim that substantively addresses it? Flag gaps that are motivated but never resolved.
|
|
173
|
+
|
|
174
|
+
**Scoring anchors:**
|
|
175
|
+
- **5**: Clear logical arc (observations → gaps → insight → solution → claims → evidence), all gaps addressed, no contradictions
|
|
176
|
+
- **4**: Strong flow with minor logical gaps or one unaddressed gap
|
|
177
|
+
- **3**: General flow present but some disconnects between layers
|
|
178
|
+
- **2**: Significant misalignment between problem statement and claims, or unresolved contradictions
|
|
179
|
+
- **1**: No coherent logical flow; layers tell different stories
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
#### D5. Exploration Integrity
|
|
184
|
+
|
|
185
|
+
- **Dead-end quality**: Is the `failure_mode` specific enough to be actionable? ("Didn't work" is bad. "Divergence after 1000 steps due to gradient explosion" is good.) Is the `lesson` a genuine transferable insight?
|
|
186
|
+
- **Decision rationale quality**: Do rationales explain WHY the chosen path was preferred over alternatives? Are alternatives real alternatives or strawmen?
|
|
187
|
+
- **Rebutted-branch consistency**: Does any claim advocate an approach marked as dead_end or pivot in the tree? (This is a logical contradiction.)
|
|
188
|
+
- **Exploration breadth**: For the paper's main design choices, were at least 2 alternatives considered and documented?
|
|
189
|
+
- **Honesty signal**: Does the tree document genuine negative results, or does it read like a post-hoc justification? A tree with zero dead-ends or only trivial failures is suspicious.
|
|
190
|
+
|
|
191
|
+
**Scoring anchors:**
|
|
192
|
+
- **5**: Rich tree with well-documented dead-ends (specific failure modes, actionable lessons), thorough decision rationale, genuine negative results
|
|
193
|
+
- **4**: Good tree with minor gaps in dead-end documentation or decision rationale
|
|
194
|
+
- **3**: Tree present but dead-ends lack specificity or decisions lack alternatives
|
|
195
|
+
- **2**: Boilerplate documentation; dead-ends and decisions read as formulaic rather than authentic
|
|
196
|
+
- **1**: Tree contradicts claims or reads entirely as post-hoc justification
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
#### D6. Methodological Rigor
|
|
201
|
+
|
|
202
|
+
- **Baseline adequacy**: Are the right things being compared? Are baselines recent and relevant? Flag experiments with "no baseline" for comparative claims.
|
|
203
|
+
- **Ablation coverage**: For claims involving multiple components, does at least one experiment isolate individual contributions?
|
|
204
|
+
- **Statistical reporting**: Do experiments mention variance, confidence intervals, number of runs, or statistical tests? Flag single-run results for quantitative claims.
|
|
205
|
+
- **Metric-claim alignment**: Does the metric actually measure what the claim asserts? (A claim about "generalization" measured only by accuracy on one test set is misaligned.)
|
|
206
|
+
- **Reproducibility signals**: Are experiment setups specific enough for independent replication? (Model name, dataset, hardware, hyperparameters.)
|
|
207
|
+
|
|
208
|
+
**Scoring anchors:**
|
|
209
|
+
- **5**: Comprehensive baselines, proper ablations, statistical rigor, metrics precisely match claims, fully reproducible setup
|
|
210
|
+
- **4**: Strong methodology with minor gaps (e.g., missing variance on one experiment)
|
|
211
|
+
- **3**: Adequate but missing some baselines or statistical details
|
|
212
|
+
- **2**: Significant gaps; missing baselines for comparative claims or no ablations
|
|
213
|
+
- **1**: No baselines, no ablations, metrics don't match claims
|
|
214
|
+
|
|
215
|
+
---
|
|
216
|
+
|
|
217
|
+
### Step 5: Compile Findings
|
|
218
|
+
|
|
219
|
+
Collect all issues found across the six dimensions into a single findings list. Assign each finding:
|
|
220
|
+
|
|
221
|
+
- **finding_id**: F01, F02, ... (sequential)
|
|
222
|
+
- **dimension**: which of D1-D6
|
|
223
|
+
- **severity**: one of:
|
|
224
|
+
- `critical` — fundamental epistemic flaw; the claim or argument cannot stand as written
|
|
225
|
+
- `major` — significant weakness that undermines a claim or dimension score
|
|
226
|
+
- `minor` — noticeable issue that doesn't invalidate the work
|
|
227
|
+
- `suggestion` — constructive improvement opportunity, not a flaw
|
|
228
|
+
- **target_file**: which ARA file
|
|
229
|
+
- **target_entity**: C{NN}, E{NN}, H{NN}, G{N}, or node ID (if applicable)
|
|
230
|
+
- **evidence_span**: verbatim substring from the ARA that triggered the finding (MUST be exact quote; omit if the finding is about an absence)
|
|
231
|
+
- **observation**: what you found (factual)
|
|
232
|
+
- **reasoning**: why it matters (analytical)
|
|
233
|
+
- **suggestion**: how to fix or improve it (constructive)
|
|
234
|
+
|
|
235
|
+
Sort findings by severity: critical first, then major, minor, suggestion.
|
|
236
|
+
|
|
237
|
+
### Step 6: Compute Overall Grade
|
|
238
|
+
|
|
239
|
+
Calculate the mean of the six dimension scores. Apply the grade mapping:
|
|
240
|
+
|
|
241
|
+
| Grade | Condition |
|
|
242
|
+
|-------|-----------|
|
|
243
|
+
| **Strong Accept** | mean ≥ 4.5 AND no dimension < 3 |
|
|
244
|
+
| **Accept** | mean ≥ 3.8 AND no dimension < 2 |
|
|
245
|
+
| **Weak Accept** | mean ≥ 3.0 AND no dimension < 2 |
|
|
246
|
+
| **Weak Reject** | mean ≥ 2.0 AND (mean < 3.0 OR any dimension < 2) |
|
|
247
|
+
| **Reject** | mean < 2.0 OR any dimension = 1 |
|
|
248
|
+
|
|
249
|
+
### Step 7: Write Report
|
|
250
|
+
|
|
251
|
+
Write `level2_report.json` to the artifact root:
|
|
252
|
+
|
|
253
|
+
```json
|
|
254
|
+
{
|
|
255
|
+
"artifact": "<name>",
|
|
256
|
+
"artifact_dir": "<path>",
|
|
257
|
+
"review_version": "3.0.0",
|
|
258
|
+
"prerequisite": "Level 1 passed",
|
|
259
|
+
|
|
260
|
+
"overall": {
|
|
261
|
+
"grade": "Accept",
|
|
262
|
+
"mean_score": 4.1,
|
|
263
|
+
"one_line_summary": "<1 sentence: what makes this ARA strong or weak>",
|
|
264
|
+
"strengths_summary": ["<top 2-3 strengths across all dimensions>"],
|
|
265
|
+
"weaknesses_summary": ["<top 2-3 weaknesses across all dimensions>"]
|
|
266
|
+
},
|
|
267
|
+
|
|
268
|
+
"dimensions": {
|
|
269
|
+
"D1_evidence_relevance": {
|
|
270
|
+
"score": 4,
|
|
271
|
+
"strengths": ["Evidence is substantively relevant for all 6 claims"],
|
|
272
|
+
"weaknesses": ["C02 cites a correlation study but makes a causal claim"],
|
|
273
|
+
"suggestions": ["Add an ablation experiment to isolate the causal mechanism for C02"]
|
|
274
|
+
},
|
|
275
|
+
"D2_falsifiability": {
|
|
276
|
+
"score": 4,
|
|
277
|
+
"strengths": ["..."],
|
|
278
|
+
"weaknesses": ["C02 falsification criteria is hard to operationalize independently"],
|
|
279
|
+
"suggestions": ["Specify a concrete re-annotation protocol for C02"]
|
|
280
|
+
},
|
|
281
|
+
"D3_scope_calibration": { "score": 4, "..." : "..." },
|
|
282
|
+
"D4_argument_coherence": { "score": 4, "..." : "..." },
|
|
283
|
+
"D5_exploration_integrity": { "score": 3, "..." : "..." },
|
|
284
|
+
"D6_methodological_rigor": { "score": 4, "..." : "..." }
|
|
285
|
+
},
|
|
286
|
+
|
|
287
|
+
"findings": [
|
|
288
|
+
{
|
|
289
|
+
"finding_id": "F01",
|
|
290
|
+
"dimension": "D6_methodological_rigor",
|
|
291
|
+
"severity": "major",
|
|
292
|
+
"target_file": "logic/experiments.md",
|
|
293
|
+
"target_entity": "E03",
|
|
294
|
+
"evidence_span": "**Baselines**: No random or retrieval-only baseline reported",
|
|
295
|
+
"observation": "E03 evaluates four LLMs on research ideation but includes no non-LLM baseline.",
|
|
296
|
+
"reasoning": "Without a random or retrieval-only baseline, it is impossible to assess whether LLM performance is meaningfully above chance.",
|
|
297
|
+
"suggestion": "Add a retrieval-only baseline (e.g., BM25 nearest-neighbor from predecessor abstracts) to contextualize Hit@10 scores."
|
|
298
|
+
}
|
|
299
|
+
],
|
|
300
|
+
|
|
301
|
+
"questions_for_authors": [
|
|
302
|
+
"What is the inter-annotator agreement on thinking-pattern classification? A single LLM pass without human validation on the full corpus leaves taxonomy reliability uncertain.",
|
|
303
|
+
"..."
|
|
304
|
+
],
|
|
305
|
+
|
|
306
|
+
"read_order": ["PAPER.md", "logic/claims.md", "..."]
|
|
307
|
+
}
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## Critical Rules
|
|
313
|
+
|
|
314
|
+
1. **Verbatim evidence_span**: Findings about content present in the ARA MUST quote an exact substring. Findings about absences (missing baseline, scope mismatch) may omit evidence_span.
|
|
315
|
+
|
|
316
|
+
2. **Constructive tone**: Every weakness must come with a suggestion. You are helping authors improve, not punishing them.
|
|
317
|
+
|
|
318
|
+
3. **Calibrated scoring**: Most competent ARAs should land in the 3-4 range. A score of 5 means genuinely excellent, not just "no problems found." A score of 1 means fundamental problems, not just "could be better."
|
|
319
|
+
|
|
320
|
+
4. **No false grounding**: Support must flow through Proof → experiments.md → evidence/. Agreement in prose (problem.md, architecture.md) does not substitute for experimental evidence.
|
|
321
|
+
|
|
322
|
+
5. **Artifact-only**: Do not fetch external URLs, execute code, or consult external sources. Take the ARA's reported evidence at face value.
|
|
323
|
+
|
|
324
|
+
6. **Balanced review**: Actively look for strengths, not just weaknesses. A review that only lists problems is not useful.
|
|
325
|
+
|
|
326
|
+
7. **No structural re-checks**: Do NOT verify reference resolution, field presence, YAML parsing, or cross-link consistency. Level 1 has already validated all of this. Focus entirely on whether the *content* is epistemically sound.
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## Reference
|
|
331
|
+
|
|
332
|
+
See `references/review-dimensions.md` for scoring anchor details and check inventories per dimension.
|
|
@@ -0,0 +1,181 @@
|
|
|
1
|
+
# Level 2 Review Dimensions — Scoring Anchors and Check Inventory
|
|
2
|
+
|
|
3
|
+
Six dimensions of epistemic quality. All checks are semantic: they require reading
|
|
4
|
+
comprehension and reasoning over the ARA's content. Structural validation (reference
|
|
5
|
+
resolution, field presence, YAML parsing) is handled entirely by Level 1.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## D1. Evidence Relevance
|
|
10
|
+
|
|
11
|
+
**Question**: Does the cited evidence actually support each claim in substance, not just by reference?
|
|
12
|
+
|
|
13
|
+
### Checks
|
|
14
|
+
|
|
15
|
+
| Check | What to verify | Finding severity |
|
|
16
|
+
|-------|---------------|-----------------|
|
|
17
|
+
| Relevance | Experiment's Setup/Procedure addresses what the claim actually asserts | major |
|
|
18
|
+
| Type-aware entailment | Experiment design matches claim type (causal→ablation, generalization→heterogeneous, improvement→baseline, descriptive→sampling, scoping→bounds) | major |
|
|
19
|
+
| Evidence sufficiency | Is a single experiment enough to support this claim, or are multiple needed? | suggestion |
|
|
20
|
+
|
|
21
|
+
### Scoring Anchors
|
|
22
|
+
|
|
23
|
+
| Score | Description |
|
|
24
|
+
|-------|-------------|
|
|
25
|
+
| 5 | Type-appropriate, relevant evidence for every claim; multi-experiment support where needed |
|
|
26
|
+
| 4 | Evidence relevant for all claims, minor type mismatches |
|
|
27
|
+
| 3 | Most claim-experiment pairs relevant, 1-2 weak matches |
|
|
28
|
+
| 2 | Multiple claims where cited experiments don't substantively address the claim |
|
|
29
|
+
| 1 | Majority of claims cite experiments irrelevant to their statements |
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## D2. Falsifiability Quality
|
|
34
|
+
|
|
35
|
+
**Question**: Are claims genuinely falsifiable with meaningful, actionable criteria?
|
|
36
|
+
|
|
37
|
+
### Checks
|
|
38
|
+
|
|
39
|
+
| Check | What to verify | Finding severity |
|
|
40
|
+
|-------|---------------|-----------------|
|
|
41
|
+
| Actionability | Could an independent researcher execute this? Specifies what to measure, failure threshold, and conditions? | major |
|
|
42
|
+
| Non-triviality | Is the criterion more than a tautology? ("If the method doesn't work" = trivial) | major |
|
|
43
|
+
| Scope match | Does the criterion address the same scope as the Statement? | major |
|
|
44
|
+
| Independence | Could it be tested without proprietary data or systems? | minor |
|
|
45
|
+
|
|
46
|
+
### Scoring Anchors
|
|
47
|
+
|
|
48
|
+
| Score | Description |
|
|
49
|
+
|-------|-------------|
|
|
50
|
+
| 5 | Every claim has specific, actionable, independently testable criteria matching claim scope |
|
|
51
|
+
| 4 | Most criteria are strong, 1-2 vague or hard to operationalize |
|
|
52
|
+
| 3 | Mixed; some actionable, some trivial or scope-mismatched |
|
|
53
|
+
| 2 | Most criteria trivial, tautological, or scope-mismatched |
|
|
54
|
+
| 1 | Criteria meaningless across claims |
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## D3. Scope Calibration
|
|
59
|
+
|
|
60
|
+
**Question**: Do claims assert exactly what their evidence supports — no more, no less?
|
|
61
|
+
|
|
62
|
+
### Checks
|
|
63
|
+
|
|
64
|
+
| Check | What to verify | Finding severity |
|
|
65
|
+
|-------|---------------|-----------------|
|
|
66
|
+
| Over-claiming | Statement uses universal scope while evidence covers narrow conditions | critical if extreme, major if moderate |
|
|
67
|
+
| Under-claiming | Evidence files or experiment results not captured by any claim | minor |
|
|
68
|
+
| Assumption explicitness | Key assumptions stated in problem.md or constraints.md | major if unstated assumptions affect validity |
|
|
69
|
+
| Generalization boundaries | Artifact states what claims do NOT apply to | minor |
|
|
70
|
+
| Qualifier consistency | Hedging language matches evidence strength | minor |
|
|
71
|
+
|
|
72
|
+
### Scoring Anchors
|
|
73
|
+
|
|
74
|
+
| Score | Description |
|
|
75
|
+
|-------|-------------|
|
|
76
|
+
| 5 | All claims precisely match evidence scope, assumptions explicit, limits stated |
|
|
77
|
+
| 4 | Well-scoped with minor gaps in assumption documentation |
|
|
78
|
+
| 3 | Some claims slightly over/under-reach, assumptions partially stated |
|
|
79
|
+
| 2 | Multiple over-claims or significant undocumented assumptions |
|
|
80
|
+
| 1 | Pervasive scope mismatch between claims and evidence |
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## D4. Argument Coherence
|
|
85
|
+
|
|
86
|
+
**Question**: Does the argument follow a coherent path from problem to solution to evidence?
|
|
87
|
+
|
|
88
|
+
### Checks
|
|
89
|
+
|
|
90
|
+
| Check | What to verify | Finding severity |
|
|
91
|
+
|-------|---------------|-----------------|
|
|
92
|
+
| Observation → Gap derivation | Gaps follow logically from observations | major |
|
|
93
|
+
| Gap → Insight connection | Key insight addresses the identified gaps | major |
|
|
94
|
+
| Insight → Solution alignment | Solution architecture implements the key insight | major |
|
|
95
|
+
| Solution → Claims coverage | Claims cover the solution's main contributions | minor |
|
|
96
|
+
| Cross-layer consistency | Claims, tree, and evidence tell the same story | major |
|
|
97
|
+
| Narrative completeness | Motivating questions are answered or explicitly deferred | minor |
|
|
98
|
+
| Gap coverage | Every gap is substantively addressed by at least one claim | major |
|
|
99
|
+
|
|
100
|
+
### Scoring Anchors
|
|
101
|
+
|
|
102
|
+
| Score | Description |
|
|
103
|
+
|-------|-------------|
|
|
104
|
+
| 5 | Clear arc from observations → gaps → insight → solution → claims → evidence, all gaps addressed |
|
|
105
|
+
| 4 | Strong flow with minor gaps or one unaddressed gap |
|
|
106
|
+
| 3 | General flow present but disconnects between layers |
|
|
107
|
+
| 2 | Significant misalignment between problem and claims, or contradictions |
|
|
108
|
+
| 1 | No coherent logical flow; layers tell different stories |
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## D5. Exploration Integrity
|
|
113
|
+
|
|
114
|
+
**Question**: Does the exploration tree faithfully document the research journey?
|
|
115
|
+
|
|
116
|
+
### Checks
|
|
117
|
+
|
|
118
|
+
| Check | What to verify | Finding severity |
|
|
119
|
+
|-------|---------------|-----------------|
|
|
120
|
+
| Dead-end specificity | failure_mode is concrete, lesson is transferable | major |
|
|
121
|
+
| Decision rationale quality | Rationale explains why chosen path preferred over real alternatives | major |
|
|
122
|
+
| Rebutted-branch consistency | No claim advocates a dead_end or pivot approach | critical |
|
|
123
|
+
| Exploration breadth | Main design choices have ≥2 documented alternatives | minor |
|
|
124
|
+
| Honesty signal | Tree documents genuine negatives, not post-hoc justification | suggestion |
|
|
125
|
+
|
|
126
|
+
### Scoring Anchors
|
|
127
|
+
|
|
128
|
+
| Score | Description |
|
|
129
|
+
|-------|-------------|
|
|
130
|
+
| 5 | Rich tree, specific failure modes, actionable lessons, thorough rationale, genuine negatives |
|
|
131
|
+
| 4 | Good tree with minor gaps in dead-end or decision documentation |
|
|
132
|
+
| 3 | Tree present but dead-ends lack specificity or decisions lack alternatives |
|
|
133
|
+
| 2 | Boilerplate documentation; dead-ends and decisions read as formulaic |
|
|
134
|
+
| 1 | Tree contradicts claims or reads entirely as post-hoc justification |
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## D6. Methodological Rigor
|
|
139
|
+
|
|
140
|
+
**Question**: Are experiments well-designed with adequate baselines and reporting?
|
|
141
|
+
|
|
142
|
+
### Checks
|
|
143
|
+
|
|
144
|
+
| Check | What to verify | Finding severity |
|
|
145
|
+
|-------|---------------|-----------------|
|
|
146
|
+
| Baseline adequacy | Right things compared? Baselines recent and relevant? | major |
|
|
147
|
+
| Ablation coverage | Multi-component claims have experiments isolating individual contributions | major |
|
|
148
|
+
| Statistical reporting | Variance, CI, number of runs, or tests mentioned | major for quantitative claims |
|
|
149
|
+
| Metric-claim alignment | Metric measures what claim asserts | major |
|
|
150
|
+
| Reproducibility signals | Setup specific enough for replication (model, dataset, hardware, hyperparameters) | minor |
|
|
151
|
+
|
|
152
|
+
### Scoring Anchors
|
|
153
|
+
|
|
154
|
+
| Score | Description |
|
|
155
|
+
|-------|-------------|
|
|
156
|
+
| 5 | Comprehensive baselines, proper ablations, statistical rigor, precise metric-claim alignment |
|
|
157
|
+
| 4 | Strong methodology with minor gaps |
|
|
158
|
+
| 3 | Adequate but missing some baselines or statistical details |
|
|
159
|
+
| 2 | Significant gaps; missing baselines for comparative claims or no ablations |
|
|
160
|
+
| 1 | No baselines, no ablations, metrics don't match claims |
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## Overall Grade Mapping
|
|
165
|
+
|
|
166
|
+
| Grade | Condition |
|
|
167
|
+
|-------|-----------|
|
|
168
|
+
| **Strong Accept** | mean ≥ 4.5 AND no dimension < 3 |
|
|
169
|
+
| **Accept** | mean ≥ 3.8 AND no dimension < 2 |
|
|
170
|
+
| **Weak Accept** | mean ≥ 3.0 AND no dimension < 2 |
|
|
171
|
+
| **Weak Reject** | mean ≥ 2.0 AND (mean < 3.0 OR any dimension < 2) |
|
|
172
|
+
| **Reject** | mean < 2.0 OR any dimension = 1 |
|
|
173
|
+
|
|
174
|
+
## Finding Severity Definitions
|
|
175
|
+
|
|
176
|
+
| Severity | Meaning | Example |
|
|
177
|
+
|----------|---------|---------|
|
|
178
|
+
| `critical` | Fundamental epistemic flaw; the claim or argument cannot stand as written | Causal claim supported only by correlation; claim advocates a dead-end approach |
|
|
179
|
+
| `major` | Significant weakness that undermines a claim or dimension | Comparative claim with no baseline; trivial falsification criteria; metric doesn't match claim |
|
|
180
|
+
| `minor` | Noticeable issue that doesn't invalidate the work | Missing generalization boundaries; hedging inconsistent with evidence |
|
|
181
|
+
| `suggestion` | Constructive improvement, not a flaw | Adding a retrieval baseline for context; documenting exploration breadth |
|