@machinespirits/eval 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +91 -9
- package/config/eval-settings.yaml +3 -3
- package/config/paper-manifest.json +486 -0
- package/config/providers.yaml +9 -6
- package/config/tutor-agents.yaml +2261 -0
- package/content/README.md +23 -0
- package/content/courses/479/course.md +53 -0
- package/content/courses/479/lecture-1.md +361 -0
- package/content/courses/479/lecture-2.md +360 -0
- package/content/courses/479/lecture-3.md +655 -0
- package/content/courses/479/lecture-4.md +530 -0
- package/content/courses/479/lecture-5.md +326 -0
- package/content/courses/479/lecture-6.md +346 -0
- package/content/courses/479/lecture-7.md +326 -0
- package/content/courses/479/lecture-8.md +273 -0
- package/content/courses/479/roadmap-slides.md +656 -0
- package/content/manifest.yaml +8 -0
- package/docs/research/build.sh +44 -20
- package/docs/research/figures/figure10.png +0 -0
- package/docs/research/figures/figure11.png +0 -0
- package/docs/research/figures/figure3.png +0 -0
- package/docs/research/figures/figure4.png +0 -0
- package/docs/research/figures/figure5.png +0 -0
- package/docs/research/figures/figure6.png +0 -0
- package/docs/research/figures/figure7.png +0 -0
- package/docs/research/figures/figure8.png +0 -0
- package/docs/research/figures/figure9.png +0 -0
- package/docs/research/header.tex +23 -2
- package/docs/research/paper-full.md +941 -285
- package/docs/research/paper-short.md +216 -585
- package/docs/research/references.bib +132 -0
- package/docs/research/slides-header.tex +188 -0
- package/docs/research/slides-pptx.md +363 -0
- package/docs/research/slides.md +531 -0
- package/docs/research/style-reference-pptx.py +199 -0
- package/package.json +6 -5
- package/scripts/analyze-eval-results.js +69 -17
- package/scripts/analyze-mechanism-traces.js +763 -0
- package/scripts/analyze-modulation-learning.js +498 -0
- package/scripts/analyze-prosthesis.js +144 -0
- package/scripts/analyze-run.js +264 -79
- package/scripts/assess-transcripts.js +853 -0
- package/scripts/browse-transcripts.js +854 -0
- package/scripts/check-parse-failures.js +73 -0
- package/scripts/code-dialectical-modulation.js +1320 -0
- package/scripts/download-data.sh +55 -0
- package/scripts/eval-cli.js +106 -18
- package/scripts/generate-paper-figures.js +663 -0
- package/scripts/generate-paper-figures.py +577 -76
- package/scripts/generate-paper-tables.js +299 -0
- package/scripts/qualitative-analysis-ai.js +3 -3
- package/scripts/render-sequence-diagram.js +694 -0
- package/scripts/test-latency.js +210 -0
- package/scripts/test-rate-limit.js +95 -0
- package/scripts/test-token-budget.js +332 -0
- package/scripts/validate-paper-manifest.js +670 -0
- package/services/__tests__/evalConfigLoader.test.js +2 -2
- package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
- package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
- package/services/evaluationRunner.js +975 -98
- package/services/evaluationStore.js +12 -4
- package/services/learnerTutorInteractionEngine.js +27 -2
- package/services/mockProvider.js +133 -0
- package/services/promptRewriter.js +1471 -5
- package/services/rubricEvaluator.js +55 -2
- package/services/transcriptFormatter.js +675 -0
- package/docs/EVALUATION-VARIABLES.md +0 -589
- package/docs/REPLICATION-PLAN.md +0 -577
- package/scripts/analyze-run.mjs +0 -282
- package/scripts/compare-runs.js +0 -44
- package/scripts/compare-suggestions.js +0 -80
- package/scripts/dig-into-run.js +0 -158
- package/scripts/show-failed-suggestions.js +0 -64
- /package/scripts/{check-run.mjs → check-run.js} +0 -0
package/docs/REPLICATION-PLAN.md
DELETED
|
@@ -1,577 +0,0 @@
|
|
|
1
|
-
# Comprehensive Replication Plan
|
|
2
|
-
|
|
3
|
-
## Study: "The Drama Machine in Education"
|
|
4
|
-
|
|
5
|
-
**Paper**: PAPER-FULL-2026-02-04.md (v1.5)
|
|
6
|
-
**Original**: 14 key runs, N=1,010 scored, N=3,800+ in full database
|
|
7
|
-
**Estimated replication cost**: ~$65–$90 USD (ego generation + Opus judging + GPT-5.2 cross-judge)
|
|
8
|
-
**Estimated wall-clock time**: ~48–72 hours (with parallelism=2, accounting for API rate limits)
|
|
9
|
-
|
|
10
|
-
---
|
|
11
|
-
|
|
12
|
-
## 1. Prerequisites
|
|
13
|
-
|
|
14
|
-
### 1.1 Software Dependencies
|
|
15
|
-
|
|
16
|
-
| Component | Version | Source |
|
|
17
|
-
|-----------|---------|--------|
|
|
18
|
-
| Node.js | >= 18.0.0 | nodejs.org |
|
|
19
|
-
| `@machinespirits/tutor-core` | 0.3.1 | Linked locally (`../machinespirits-tutor-core`) |
|
|
20
|
-
| `better-sqlite3` | 12.5.0 | npm |
|
|
21
|
-
| `dotenv` | 17.2.3 | npm |
|
|
22
|
-
| `express` | 4.19.2 | npm |
|
|
23
|
-
| `jsonrepair` | 3.13.2 | npm |
|
|
24
|
-
| `yaml` | 2.8.2 | npm |
|
|
25
|
-
|
|
26
|
-
**Setup**:
|
|
27
|
-
```bash
|
|
28
|
-
cd <path-to-machinespirits-eval>
|
|
29
|
-
npm install
|
|
30
|
-
```
|
|
31
|
-
|
|
32
|
-
### 1.2 External Content Packages
|
|
33
|
-
|
|
34
|
-
| Package | Location | Purpose |
|
|
35
|
-
|---------|----------|---------|
|
|
36
|
-
| `machinespirits-content-philosophy` | `../machinespirits-content-philosophy` | Primary domain (Hegel, graduate philosophy) |
|
|
37
|
-
| `content-test-elementary` | `./content-test-elementary/` | Domain generalizability (4th-grade fractions) |
|
|
38
|
-
|
|
39
|
-
Both are present in the current environment. A replicator needs access to these repositories.
|
|
40
|
-
|
|
41
|
-
### 1.3 API Keys (in `.env`)
|
|
42
|
-
|
|
43
|
-
| Provider | Models Used | Estimated Cost Fraction |
|
|
44
|
-
|----------|-------------|------------------------|
|
|
45
|
-
| **OpenRouter** | Kimi K2.5 (free-tier ego), Nemotron 3 Nano 30B (free-tier ego) | ~$5 (superego calls only; ego is free) |
|
|
46
|
-
| **Anthropic** | Claude Opus (judge via Claude Code CLI) | ~$40–55 (primary judge, largest cost) |
|
|
47
|
-
| **OpenAI (via OpenRouter)** | GPT-5.2 (cross-judge validation) | ~$18–25 (rejudge only) |
|
|
48
|
-
|
|
49
|
-
**Critical**: The primary judge uses Claude Code CLI which invokes Opus directly via the Anthropic API. OpenRouter is used for ego/superego models and for GPT-5.2 rejudging.
|
|
50
|
-
|
|
51
|
-
### 1.4 Model Availability Risk
|
|
52
|
-
|
|
53
|
-
| Model | Risk | Mitigation |
|
|
54
|
-
|-------|------|------------|
|
|
55
|
-
| Kimi K2.5 (OpenRouter free) | May be retired/updated | Pin model ID in providers.yaml; document exact version |
|
|
56
|
-
| Nemotron 3 Nano 30B (OpenRouter free) | May be retired | Same; only needed for A×B interaction and domain gen |
|
|
57
|
-
| Claude Opus | Stable (Anthropic tier) | Low risk |
|
|
58
|
-
| GPT-5.2 | Stable (OpenAI) | Low risk |
|
|
59
|
-
|
|
60
|
-
### 1.5 Database
|
|
61
|
-
|
|
62
|
-
- **Fresh start**: Back up existing `data/evaluations.db`, then either use a fresh DB or prefix run IDs to distinguish replication from original.
|
|
63
|
-
- **Recommended**: Use a separate database file (e.g., `data/evaluations-replication.db`) by modifying the DB path in evaluationStore.js, OR simply run the replication and use run IDs to distinguish.
|
|
64
|
-
|
|
65
|
-
---
|
|
66
|
-
|
|
67
|
-
## 2. Replication Phases
|
|
68
|
-
|
|
69
|
-
The study comprises 9 distinct experimental phases (producing 14 runs). We recommend executing them in dependency order. **Phases 1–5 are independent and can run in parallel if API rate limits allow.**
|
|
70
|
-
|
|
71
|
-
### Phase 1: Recognition Validation (Section 6.1)
|
|
72
|
-
**Purpose**: 3-way comparison — base vs enhanced vs recognition
|
|
73
|
-
**Original run**: `eval-2026-02-03-86b159cd` (N=36 scored)
|
|
74
|
-
**Design**: 3 profiles × 4 scenarios × 3 replications
|
|
75
|
-
|
|
76
|
-
```bash
|
|
77
|
-
node scripts/eval-cli.js run \
|
|
78
|
-
--profiles cell_1_base_single_unified,cell_9_enhanced_single_unified,cell_5_recog_single_unified \
|
|
79
|
-
--scenarios struggling_learner,concept_confusion,mood_frustrated_explicit,high_performer \
|
|
80
|
-
--runs 3 \
|
|
81
|
-
--description "Replication: Recognition validation (Section 6.1)"
|
|
82
|
-
```
|
|
83
|
-
|
|
84
|
-
**Expected output**: N=36 scored responses
|
|
85
|
-
**Key metrics to verify**:
|
|
86
|
-
- Recognition > Enhanced > Base ordering
|
|
87
|
-
- One-way ANOVA F(2,33) significant
|
|
88
|
-
- Recognition vs Enhanced gap ≈ +8.7 pts (original)
|
|
89
|
-
- Recognition vs Base gap ≈ +20.1 pts (original)
|
|
90
|
-
|
|
91
|
-
**Analysis**:
|
|
92
|
-
```bash
|
|
93
|
-
node scripts/eval-cli.js report <run-id>
|
|
94
|
-
node scripts/analyze-eval-results.js <run-id>
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
---
|
|
98
|
-
|
|
99
|
-
### Phase 2: Full 2×2×2 Factorial (Section 6.3)
|
|
100
|
-
**Purpose**: Main factorial design — Recognition × Architecture × Learner
|
|
101
|
-
**Original runs**: `eval-2026-02-03-f5d4dd93` (cells 1–5,7, N=262) + `eval-2026-02-06-a933d745` (cells 6,8, N=88)
|
|
102
|
-
**Design**: 8 cells × 15 scenarios × 3 replications = 360 planned, expect ~350 scored
|
|
103
|
-
|
|
104
|
-
**Important**: The original ran as two separate runs because cells 6 and 8 needed re-running with corrected learner prompts. For replication, all 8 cells can run together since the prompts are now correct.
|
|
105
|
-
|
|
106
|
-
```bash
|
|
107
|
-
node scripts/eval-cli.js run \
|
|
108
|
-
--profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
|
|
109
|
-
--runs 3 \
|
|
110
|
-
--description "Replication: Full 2x2x2 factorial (Section 6.3)"
|
|
111
|
-
```
|
|
112
|
-
|
|
113
|
-
**Expected output**: ~350 scored responses (8 × 15 × 3 = 360 attempted)
|
|
114
|
-
**Key metrics to verify**:
|
|
115
|
-
- Recognition main effect: +10.2 pts, F(1,342)≈71, p<.001, η²≈.16
|
|
116
|
-
- Architecture main effect: ~+0.9, n.s.
|
|
117
|
-
- A×C Interaction (Recognition × Learner): F≈22, p<.001
|
|
118
|
-
- Unified learner: recognition +15.5 pts
|
|
119
|
-
- Psychodynamic learner: recognition +4.8 pts
|
|
120
|
-
- Cell means ordering: 5≈7 > 6≈8 > 4≈2 > 1≈3
|
|
121
|
-
|
|
122
|
-
**Analysis**:
|
|
123
|
-
```bash
|
|
124
|
-
node scripts/analyze-eval-results.js <run-id>
|
|
125
|
-
# This will compute the full ANOVA table
|
|
126
|
-
```
|
|
127
|
-
|
|
128
|
-
---
|
|
129
|
-
|
|
130
|
-
### Phase 3: Memory Isolation 2×2 (Section 6.2)
|
|
131
|
-
**Purpose**: Disentangle recognition from memory
|
|
132
|
-
**Original runs**: `eval-2026-02-06-81f2d5a1` (N=60) + `eval-2026-02-06-ac9ea8f5` (N=62)
|
|
133
|
-
**Design**: 4 cells × 15 scenarios × 1 rep per run, 2 independent runs
|
|
134
|
-
|
|
135
|
-
The memory isolation uses cells 19–20 (memory isolation profiles). Check `config/tutor-agents.yaml` for the exact profile names.
|
|
136
|
-
|
|
137
|
-
```bash
|
|
138
|
-
# Run 1
|
|
139
|
-
node scripts/eval-cli.js run \
|
|
140
|
-
--profiles cell_19_base_nomem,cell_19_base_mem,cell_20_recog_nomem,cell_20_recog_mem \
|
|
141
|
-
--runs 1 \
|
|
142
|
-
--description "Replication: Memory isolation run 1 (Section 6.2)"
|
|
143
|
-
|
|
144
|
-
# Run 2 (independent replication)
|
|
145
|
-
node scripts/eval-cli.js run \
|
|
146
|
-
--profiles cell_19_base_nomem,cell_19_base_mem,cell_20_recog_nomem,cell_20_recog_mem \
|
|
147
|
-
--runs 1 \
|
|
148
|
-
--description "Replication: Memory isolation run 2 (Section 6.2)"
|
|
149
|
-
```
|
|
150
|
-
|
|
151
|
-
**NOTE**: Verify the exact profile names in `config/tutor-agents.yaml` — the memory isolation profiles may use different naming conventions (e.g., `mem_iso_base_nomem`, `mem_iso_recog_mem`, etc.). The paper states N=30 per cell across two runs, suggesting each run has ~15 per cell (4 cells × 15 scenarios × 1 rep = 60 per run).
|
|
152
|
-
|
|
153
|
-
**Expected output**: N=120 across both runs (30 per cell)
|
|
154
|
-
**Key metrics to verify**:
|
|
155
|
-
- Recognition effect: d≈1.71, +15.2 pts without memory
|
|
156
|
-
- Memory effect: d≈0.46, +4.8 pts, p≈.08
|
|
157
|
-
- Interaction: -4.2 (negative — ceiling effect)
|
|
158
|
-
- Condition ordering: Recog+Mem ≥ Recog Only >> Mem Only > Base
|
|
159
|
-
|
|
160
|
-
---
|
|
161
|
-
|
|
162
|
-
### Phase 4: Active Control (Section 6.2)
|
|
163
|
-
**Purpose**: Test whether generic pedagogical elaboration accounts for recognition gains
|
|
164
|
-
**Original run**: `eval-2026-02-06-a9ae06ee` (N=118 scored)
|
|
165
|
-
**Design**: Cells 15–18 (placebo control profiles)
|
|
166
|
-
|
|
167
|
-
**MODEL NOTE**: The original used Nemotron as ego (not Kimi). This is a known confound documented in the paper. For a fair replication, you should run both:
|
|
168
|
-
1. The active control on Nemotron (replicating the original)
|
|
169
|
-
2. Optionally, the active control on Kimi (resolving the model confound)
|
|
170
|
-
|
|
171
|
-
```bash
|
|
172
|
-
# Active control with Nemotron (replicating original)
|
|
173
|
-
node scripts/eval-cli.js run \
|
|
174
|
-
--profiles cell_15_placebo_single_unified,cell_16_placebo_single_psycho,cell_17_placebo_multi_unified,cell_18_placebo_multi_psycho \
|
|
175
|
-
--runs 3 \
|
|
176
|
-
--description "Replication: Active control / placebo (Section 6.2)"
|
|
177
|
-
```
|
|
178
|
-
|
|
179
|
-
**Expected output**: ~118 scored
|
|
180
|
-
**Key metrics to verify**:
|
|
181
|
-
- Overall mean ≈ 66.5 (Nemotron)
|
|
182
|
-
- Same-model comparison: +9 pts above Nemotron base, below recognition (~73)
|
|
183
|
-
|
|
184
|
-
---
|
|
185
|
-
|
|
186
|
-
### Phase 5: A×B Interaction (Section 6.4)
|
|
187
|
-
**Purpose**: Test whether multi-agent synergy requires recognition prompts
|
|
188
|
-
**Original runs**: `eval-2026-02-04-948e04b3` (Nemotron, N=17) + `eval-2026-02-05-10b344fb` (Kimi, N=60)
|
|
189
|
-
|
|
190
|
-
#### 5a: Nemotron A×B test
|
|
191
|
-
```bash
|
|
192
|
-
# This requires configuring Nemotron as the ego model
|
|
193
|
-
# Check if there are specific Nemotron profile overrides
|
|
194
|
-
node scripts/eval-cli.js run \
|
|
195
|
-
--profiles cell_5_recog_single_unified,cell_7_recog_multi_unified,cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified \
|
|
196
|
-
--scenarios struggling_learner,concept_confusion,mood_frustrated_explicit \
|
|
197
|
-
--runs 3 \
|
|
198
|
-
--description "Replication: A×B interaction Nemotron (Section 6.4)"
|
|
199
|
-
```
|
|
200
|
-
|
|
201
|
-
**NOTE**: The original Nemotron run had only N=17 scored (small sample). The profile may need model override to use Nemotron instead of the default Kimi. Check whether there are Nemotron-specific profiles or if the CLI supports model overrides.
|
|
202
|
-
|
|
203
|
-
#### 5b: Kimi A×B replication
|
|
204
|
-
```bash
|
|
205
|
-
node scripts/eval-cli.js run \
|
|
206
|
-
--profiles cell_5_recog_single_unified,cell_7_recog_multi_unified,cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified \
|
|
207
|
-
--runs 3 \
|
|
208
|
-
--description "Replication: A×B replication Kimi (Section 6.4)"
|
|
209
|
-
```
|
|
210
|
-
|
|
211
|
-
**Expected output**: N≈60
|
|
212
|
-
**Key metrics to verify**:
|
|
213
|
-
- Kimi: A×B interaction ≈ +1.35 (negligible, confirming non-replication of Nemotron finding)
|
|
214
|
-
- Recognition cells ≈ 90.6 regardless of architecture
|
|
215
|
-
- Enhanced cells ≈ 80.6
|
|
216
|
-
|
|
217
|
-
---
|
|
218
|
-
|
|
219
|
-
### Phase 6: Domain Generalizability (Section 6.5)
|
|
220
|
-
**Purpose**: Test recognition effects on elementary math content
|
|
221
|
-
**Original runs**: `eval-2026-02-04-79b633ca` (Nemotron, N=47) + `eval-2026-02-05-e87f452d` (Kimi, N=60)
|
|
222
|
-
|
|
223
|
-
#### 6a: Kimi elementary replication
|
|
224
|
-
```bash
|
|
225
|
-
EVAL_CONTENT_PATH=./content-test-elementary \
|
|
226
|
-
EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
|
|
227
|
-
node scripts/eval-cli.js run \
|
|
228
|
-
--profiles cell_1_base_single_unified,cell_3_base_multi_unified,cell_5_recog_single_unified,cell_7_recog_multi_unified \
|
|
229
|
-
--runs 3 \
|
|
230
|
-
--description "Replication: Domain gen Kimi elementary (Section 6.5)"
|
|
231
|
-
```
|
|
232
|
-
|
|
233
|
-
#### 6b: Nemotron elementary (if Nemotron profiles available)
|
|
234
|
-
```bash
|
|
235
|
-
EVAL_CONTENT_PATH=./content-test-elementary \
|
|
236
|
-
EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
|
|
237
|
-
node scripts/eval-cli.js run \
|
|
238
|
-
--profiles cell_1_base_single_unified,cell_3_base_multi_unified,cell_5_recog_single_unified,cell_7_recog_multi_unified \
|
|
239
|
-
--runs 1 \
|
|
240
|
-
--description "Replication: Domain gen Nemotron elementary (Section 6.5)"
|
|
241
|
-
```
|
|
242
|
-
|
|
243
|
-
**Expected output**: N≈60 (Kimi), N≈47 (Nemotron)
|
|
244
|
-
**Key metrics to verify**:
|
|
245
|
-
- Kimi: Recognition +9.9 pts (d≈0.61)
|
|
246
|
-
- Scenario-dependent: frustrated_student +23.8, neutral scenarios ~0
|
|
247
|
-
|
|
248
|
-
---
|
|
249
|
-
|
|
250
|
-
### Phase 7: Bilateral Transformation (Section 6.11)
|
|
251
|
-
**Purpose**: Multi-turn dialogues measuring tutor adaptation and learner growth
|
|
252
|
-
**Original run**: `eval-2026-02-07-b6d75e87` (N=118 scored, 3 multi-turn scenarios)
|
|
253
|
-
|
|
254
|
-
```bash
|
|
255
|
-
node scripts/eval-cli.js run \
|
|
256
|
-
--profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
|
|
257
|
-
--scenarios misconception_correction_flow,mood_frustration_to_breakthrough,mutual_transformation_journey \
|
|
258
|
-
--runs 1 \
|
|
259
|
-
--description "Replication: Bilateral transformation multi-turn (Section 6.11)"
|
|
260
|
-
```
|
|
261
|
-
|
|
262
|
-
**Expected output**: ~118 scored dialogues
|
|
263
|
-
**Key metrics to verify**:
|
|
264
|
-
- Tutor Adaptation Index: base≈0.332, recognition≈0.418 (+26%)
|
|
265
|
-
- Learner Growth Index: base≈0.242, recognition≈0.210 (lower — reversal)
|
|
266
|
-
- Misconception correction: largest adaptation gap (+0.175)
|
|
267
|
-
|
|
268
|
-
---
|
|
269
|
-
|
|
270
|
-
### Phase 8: Dynamic Rewrite Evolution (Section 6.13)
|
|
271
|
-
**Purpose**: Track cell 21 (dynamic rewrite + Writing Pad) vs cell 7 (static)
|
|
272
|
-
**Original runs**: `daf60f79`, `49bb2017`, `12aebedb` (N=82 total across 3 iterative runs)
|
|
273
|
-
|
|
274
|
-
**REPLICATION CHALLENGE**: The original three runs represent iterative development — each run was executed at a different git commit with evolving code. This cannot be cleanly replicated because:
|
|
275
|
-
1. The code has since evolved past those commits
|
|
276
|
-
2. The progression was part of development, not a controlled experiment
|
|
277
|
-
|
|
278
|
-
**Recommended approach**: Run cell 21 vs cell 7 at the current codebase state (equivalent to run 3, which had the Writing Pad activated):
|
|
279
|
-
|
|
280
|
-
```bash
|
|
281
|
-
node scripts/eval-cli.js run \
|
|
282
|
-
--profiles cell_7_recog_multi_unified,cell_21_recog_multi_unified_rewrite \
|
|
283
|
-
--scenarios misconception_correction_flow,mood_frustration_to_breakthrough,mutual_transformation_journey \
|
|
284
|
-
--runs 5 \
|
|
285
|
-
--description "Replication: Dynamic rewrite cell 21 vs cell 7 (Section 6.13)"
|
|
286
|
-
```
|
|
287
|
-
|
|
288
|
-
**Expected output**: ~30 scored responses
|
|
289
|
-
**Key metric**: Cell 21 should lead cell 7 by ~+5.5 pts (reflecting run 3 state)
|
|
290
|
-
|
|
291
|
-
**For full iterative replication**: Would require checking out specific git commits (e3843ee, b2265c7, e673c4b) and running at each — document as a limitation of replication.
|
|
292
|
-
|
|
293
|
-
---
|
|
294
|
-
|
|
295
|
-
### Phase 9: Cross-Judge Replication (Section 6.14)
|
|
296
|
-
**Purpose**: Re-score all key run responses with GPT-5.2 as independent judge
|
|
297
|
-
**Depends on**: Phases 1–7 completing (uses their run IDs)
|
|
298
|
-
|
|
299
|
-
For each key run from Phases 1–7:
|
|
300
|
-
```bash
|
|
301
|
-
# Rejudge each completed run with GPT-5.2
|
|
302
|
-
node scripts/eval-cli.js rejudge <phase1-run-id> --judge openrouter.gpt
|
|
303
|
-
node scripts/eval-cli.js rejudge <phase2-run-id> --judge openrouter.gpt
|
|
304
|
-
node scripts/eval-cli.js rejudge <phase3a-run-id> --judge openrouter.gpt
|
|
305
|
-
node scripts/eval-cli.js rejudge <phase3b-run-id> --judge openrouter.gpt
|
|
306
|
-
# ... etc for all runs
|
|
307
|
-
```
|
|
308
|
-
|
|
309
|
-
**CAUTION**: Rejudge creates new rows by default. If run twice, it creates duplicates. Use `--overwrite` to replace, or track carefully.
|
|
310
|
-
|
|
311
|
-
**Expected output**: Matched response pairs (same tutor response, two judge scores)
|
|
312
|
-
**Key metrics to verify**:
|
|
313
|
-
- Inter-judge r = 0.49–0.64 across runs
|
|
314
|
-
- GPT-5.2 finds ~58% of Claude's effect magnitudes
|
|
315
|
-
- Recognition main effect d≈1.0 under GPT-5.2
|
|
316
|
-
- Same condition ordering, no rank reversals in memory isolation
|
|
317
|
-
- Recognition vs enhanced: may not reach significance (+1.3, p=.60)
|
|
318
|
-
|
|
319
|
-
**Analysis**:
|
|
320
|
-
```bash
|
|
321
|
-
node scripts/analyze-judge-reliability.js
|
|
322
|
-
```
|
|
323
|
-
|
|
324
|
-
---
|
|
325
|
-
|
|
326
|
-
## 3. Verification Checklist
|
|
327
|
-
|
|
328
|
-
### 3.1 Primary Findings to Replicate
|
|
329
|
-
|
|
330
|
-
| # | Finding | Section | Key Statistic | Priority |
|
|
331
|
-
|---|---------|---------|---------------|----------|
|
|
332
|
-
| 1 | Recognition main effect | 6.3 | +10.2 pts, F=71.36, p<.001, d=0.80 | **Critical** |
|
|
333
|
-
| 2 | Memory isolation: recognition dominance | 6.2 | d=1.71, +15.2 pts | **Critical** |
|
|
334
|
-
| 3 | Memory isolation: memory modest | 6.2 | d=0.46, +4.8, p≈.08 | **Critical** |
|
|
335
|
-
| 4 | Memory isolation: negative interaction | 6.2 | -4.2 (ceiling) | **Critical** |
|
|
336
|
-
| 5 | A×C Interaction (Recog × Learner) | 6.3 | F=21.85, p<.001 | **High** |
|
|
337
|
-
| 6 | A×B null (architecture doesn't matter) | 6.3–6.4 | F=0.26, n.s. | **High** |
|
|
338
|
-
| 7 | Active control partial benefit | 6.2 | +9 pts vs +15 pts recognition | **High** |
|
|
339
|
-
| 8 | Domain generalizability | 6.5 | +9.9 pts Kimi elementary | **Medium** |
|
|
340
|
-
| 9 | Bilateral transformation asymmetry | 6.11 | Tutor +26%, learner -13% | **Medium** |
|
|
341
|
-
| 10 | Recognition vs enhanced gap | 6.1 | +8.7 pts | **Medium** |
|
|
342
|
-
| 11 | Cross-judge robustness | 6.14 | r=0.49–0.64, same direction | **High** |
|
|
343
|
-
| 12 | Dynamic rewrite improvement | 6.13 | Cell 21 leads by +5.5 | **Low** |
|
|
344
|
-
|
|
345
|
-
### 3.2 Expected Replication Tolerances
|
|
346
|
-
|
|
347
|
-
Given LLM stochasticity (temperature=0.6 for ego, 0.2 for judge), expect:
|
|
348
|
-
- **Effect directions**: Should replicate consistently (same sign)
|
|
349
|
-
- **Effect magnitudes**: ±3–5 points on means; ±0.2 on Cohen's d
|
|
350
|
-
- **Statistical significance**: Large effects (d>0.8) should remain significant; marginal effects (p≈.08) may flip
|
|
351
|
-
- **Cell ordering**: Should be preserved (no rank reversals on primary comparisons)
|
|
352
|
-
- **Interaction patterns**: A×C should replicate; A×B null should hold
|
|
353
|
-
|
|
354
|
-
### 3.3 Red Flags (Suggesting Implementation Issues)
|
|
355
|
-
|
|
356
|
-
- Recognition main effect < +5 pts or not significant → check prompt loading
|
|
357
|
-
- Condition ordering reversed → check profile-to-prompt mapping
|
|
358
|
-
- All scores clustered >90 → ceiling effect / rubric calibration issue
|
|
359
|
-
- All scores <60 → model API issue or wrong model being called
|
|
360
|
-
- Memory isolation shows positive interaction → verify cell configurations
|
|
361
|
-
- Cross-judge r < 0.3 → check rejudge is matching correct responses
|
|
362
|
-
|
|
363
|
-
---
|
|
364
|
-
|
|
365
|
-
## 4. Cost Estimation
|
|
366
|
-
|
|
367
|
-
| Phase | Attempts | Ego Cost | Judge Cost (Opus) | GPT-5.2 Rejudge | Subtotal |
|
|
368
|
-
|-------|----------|----------|-------------------|------------------|----------|
|
|
369
|
-
| 1: Recognition validation | 36 | ~$0.40 | ~$1.50 | ~$0.70 | ~$2.60 |
|
|
370
|
-
| 2: Full factorial | 360 | ~$5.00 | ~$15.00 | ~$7.00 | ~$27.00 |
|
|
371
|
-
| 3: Memory isolation (×2) | 120 | ~$1.50 | ~$5.00 | ~$2.50 | ~$9.00 |
|
|
372
|
-
| 4: Active control | 120 | ~$1.30 | ~$5.00 | ~$2.50 | ~$8.80 |
|
|
373
|
-
| 5: A×B interaction | 78 | ~$1.00 | ~$3.00 | ~$1.50 | ~$5.50 |
|
|
374
|
-
| 6: Domain gen | 107 | ~$1.20 | ~$4.00 | ~$2.00 | ~$7.20 |
|
|
375
|
-
| 7: Bilateral transformation | 120 | ~$2.00 | ~$5.00 | ~$2.50 | ~$9.50 |
|
|
376
|
-
| 8: Dynamic rewrite | 30 | ~$0.80 | ~$1.50 | ~$0.60 | ~$2.90 |
|
|
377
|
-
| **Total** | **~971** | **~$13.20** | **~$40.00** | **~$19.30** | **~$72.50** |
|
|
378
|
-
|
|
379
|
-
**Notes**:
|
|
380
|
-
- Ego costs are low because Kimi K2.5 and Nemotron are free-tier on OpenRouter; costs come from superego calls (Kimi K2.5)
|
|
381
|
-
- Judge costs dominate — Claude Opus via Claude Code CLI
|
|
382
|
-
- GPT-5.2 rejudge adds ~27% to total cost
|
|
383
|
-
- Multi-turn scenarios (phases 7, 8) cost more per evaluation due to multiple turns
|
|
384
|
-
|
|
385
|
-
---
|
|
386
|
-
|
|
387
|
-
## 5. Execution Order and Timeline
|
|
388
|
-
|
|
389
|
-
### Day 1: Independent Phases (Parallel)
|
|
390
|
-
|
|
391
|
-
| Time | Phase | Duration | Notes |
|
|
392
|
-
|------|-------|----------|-------|
|
|
393
|
-
| Morning | Phase 1 (validation) | ~2 hrs | Quick, small N |
|
|
394
|
-
| Morning | Phase 5b (A×B Kimi) | ~3 hrs | Small N |
|
|
395
|
-
| Morning | Phase 6a (Domain gen Kimi) | ~3 hrs | Small N |
|
|
396
|
-
| Afternoon | Phase 3 run 1 (Memory isolation) | ~6 hrs | |
|
|
397
|
-
| Afternoon | Phase 4 (Active control) | ~7 hrs | |
|
|
398
|
-
|
|
399
|
-
### Day 2: Main Factorial + Bilateral
|
|
400
|
-
|
|
401
|
-
| Time | Phase | Duration | Notes |
|
|
402
|
-
|------|-------|----------|-------|
|
|
403
|
-
| All day | Phase 2 (Full factorial) | ~24 hrs | Largest run, 360 evals |
|
|
404
|
-
| All day | Phase 3 run 2 (Memory isolation) | ~6 hrs | Independent replication |
|
|
405
|
-
| Evening | Phase 7 (Bilateral) | ~14 hrs | Multi-turn, slower |
|
|
406
|
-
|
|
407
|
-
### Day 3: Specialized + Cross-Judge
|
|
408
|
-
|
|
409
|
-
| Time | Phase | Duration | Notes |
|
|
410
|
-
|------|-------|----------|-------|
|
|
411
|
-
| Morning | Phase 5a (A×B Nemotron) | ~3 hrs | If Nemotron profiles exist |
|
|
412
|
-
| Morning | Phase 6b (Domain gen Nemotron) | ~3 hrs | Optional |
|
|
413
|
-
| Morning | Phase 8 (Dynamic rewrite) | ~5 hrs | |
|
|
414
|
-
| Afternoon | Phase 9 (Cross-judge) | ~8 hrs | Rejudge all completed runs |
|
|
415
|
-
|
|
416
|
-
### Day 4: Analysis and Verification
|
|
417
|
-
|
|
418
|
-
| Task | Tool |
|
|
419
|
-
|------|------|
|
|
420
|
-
| Generate reports for all runs | `eval-cli.js report <run-id>` |
|
|
421
|
-
| Compute ANOVA tables | `analyze-eval-results.js` |
|
|
422
|
-
| Inter-judge reliability | `analyze-judge-reliability.js` |
|
|
423
|
-
| Bilateral transformation metrics | `analyze-interaction-evals.js` |
|
|
424
|
-
| Compare replication vs original | Side-by-side comparison of key statistics |
|
|
425
|
-
|
|
426
|
-
---
|
|
427
|
-
|
|
428
|
-
## 6. Known Issues and Workarounds
|
|
429
|
-
|
|
430
|
-
### 6.1 Profile Name Verification
|
|
431
|
-
|
|
432
|
-
Before running, verify all profile names exist in `config/tutor-agents.yaml`:
|
|
433
|
-
```bash
|
|
434
|
-
grep -E "^ cell_" config/tutor-agents.yaml | head -30
|
|
435
|
-
```
|
|
436
|
-
|
|
437
|
-
Profiles referenced in the paper:
|
|
438
|
-
- `cell_1` through `cell_8` (main factorial)
|
|
439
|
-
- `cell_9` through `cell_12` (enhanced prompts)
|
|
440
|
-
- `cell_15` through `cell_18` (placebo/active control)
|
|
441
|
-
- `cell_19`, `cell_20` (memory isolation)
|
|
442
|
-
- `cell_21` (dynamic rewrite)
|
|
443
|
-
|
|
444
|
-
### 6.2 Nemotron Model Configuration
|
|
445
|
-
|
|
446
|
-
The paper uses Nemotron for several analyses (A×B interaction, domain gen, active control). Check whether the default profiles use Kimi or Nemotron:
|
|
447
|
-
- If all profiles default to Kimi, you may need to create Nemotron variants or use a model override mechanism
|
|
448
|
-
- The original Nemotron runs may have used different profile names or env var overrides
|
|
449
|
-
|
|
450
|
-
### 6.3 Rejudge Deduplication
|
|
451
|
-
|
|
452
|
-
GPT-5.2 rejudge can create duplicate rows. After rejudging, verify:
|
|
453
|
-
```sql
|
|
454
|
-
SELECT run_id, judge_model, COUNT(*) as n
|
|
455
|
-
FROM evaluation_results
|
|
456
|
-
WHERE judge_model = 'openrouter/openai/gpt-5.2'
|
|
457
|
-
GROUP BY run_id
|
|
458
|
-
ORDER BY n DESC;
|
|
459
|
-
```
|
|
460
|
-
|
|
461
|
-
If duplicates exist, use ROW_NUMBER() window function to deduplicate.
|
|
462
|
-
|
|
463
|
-
### 6.4 Scenario Availability
|
|
464
|
-
|
|
465
|
-
Verify all 15 scenarios exist:
|
|
466
|
-
```bash
|
|
467
|
-
grep "^ [a-z]" config/suggestion-scenarios.yaml | head -20
|
|
468
|
-
```
|
|
469
|
-
|
|
470
|
-
The 3 multi-turn scenarios need special handling:
|
|
471
|
-
- `misconception_correction_flow` (3.2 avg rounds)
|
|
472
|
-
- `mood_frustration_to_breakthrough` (3.0 avg rounds)
|
|
473
|
-
- `mutual_transformation_journey` (4.1 avg rounds)
|
|
474
|
-
|
|
475
|
-
### 6.5 Content Isolation (Critical — from feedback item #31)
|
|
476
|
-
|
|
477
|
-
The paper reports that Nemotron hallucinated philosophy content on elementary math tasks. Verify that content paths are properly isolated:
|
|
478
|
-
- When running elementary tests, ensure `EVAL_CONTENT_PATH` points to `./content-test-elementary`
|
|
479
|
-
- Check that the contentResolver does not leak cross-domain lecture IDs
|
|
480
|
-
- This is flagged in the reviewer feedback as a potential data isolation bug to investigate
|
|
481
|
-
|
|
482
|
-
### 6.6 Evaluate vs Rejudge
|
|
483
|
-
|
|
484
|
-
- `evaluate --force`: Only processes rows with NULL base_score (won't re-score existing)
|
|
485
|
-
- `rejudge`: Creates new rows with a different judge_model
|
|
486
|
-
- To re-score with Opus: null out scores first, then `evaluate --force`
|
|
487
|
-
- **Never** use `rejudge` without `--judge` flag — defaults to Sonnet 4.5, not Opus
|
|
488
|
-
|
|
489
|
-
---
|
|
490
|
-
|
|
491
|
-
## 7. Statistical Analysis Pipeline
|
|
492
|
-
|
|
493
|
-
After all phases complete, run the full analysis:
|
|
494
|
-
|
|
495
|
-
```bash
|
|
496
|
-
# 1. Per-run reports
|
|
497
|
-
for run_id in <all-replication-run-ids>; do
|
|
498
|
-
node scripts/eval-cli.js report $run_id
|
|
499
|
-
done
|
|
500
|
-
|
|
501
|
-
# 2. Factorial ANOVA (Phase 2)
|
|
502
|
-
node scripts/analyze-eval-results.js <factorial-run-id>
|
|
503
|
-
|
|
504
|
-
# 3. Memory isolation analysis (Phase 3)
|
|
505
|
-
# Compare 4 cells: base-nomem, base-mem, recog-nomem, recog-mem
|
|
506
|
-
node scripts/analyze-eval-results.js <mem-iso-run1-id>
|
|
507
|
-
node scripts/analyze-eval-results.js <mem-iso-run2-id>
|
|
508
|
-
|
|
509
|
-
# 4. Inter-judge reliability (Phase 9)
|
|
510
|
-
node scripts/analyze-judge-reliability.js
|
|
511
|
-
|
|
512
|
-
# 5. Bilateral transformation (Phase 7)
|
|
513
|
-
node scripts/analyze-interaction-evals.js <bilateral-run-id>
|
|
514
|
-
|
|
515
|
-
# 6. Export all results
|
|
516
|
-
node scripts/eval-cli.js export <run-id> --format csv
|
|
517
|
-
```
|
|
518
|
-
|
|
519
|
-
### 7.1 Key Comparisons
|
|
520
|
-
|
|
521
|
-
| Comparison | Original | Replication | Match? |
|
|
522
|
-
|------------|----------|-------------|--------|
|
|
523
|
-
| Recognition main effect (d) | 0.80 | | |
|
|
524
|
-
| Memory isolation recognition (d) | 1.71 | | |
|
|
525
|
-
| Memory isolation memory (d) | 0.46 | | |
|
|
526
|
-
| A×C interaction (F) | 21.85 | | |
|
|
527
|
-
| Bilateral adaptation Δ | +0.086 | | |
|
|
528
|
-
| Cross-judge r (factorial) | 0.64 | | |
|
|
529
|
-
| Cross-judge r (memory iso) | 0.63 | | |
|
|
530
|
-
|
|
531
|
-
---
|
|
532
|
-
|
|
533
|
-
## 8. What Exact Replication Cannot Cover
|
|
534
|
-
|
|
535
|
-
Some aspects of the original study are inherently non-replicable:
|
|
536
|
-
|
|
537
|
-
1. **Iterative development trajectory** (Section 6.13): The three dynamic rewrite runs tracked code evolution across commits. Running at the current codebase state tests the final configuration but not the developmental trajectory.
|
|
538
|
-
|
|
539
|
-
2. **Historical Nemotron data** (Section 6.2 active control): The "same-model comparison" for the active control draws on historical Nemotron data across multiple runs (N=467 base, N=545 recognition). This accumulated data is in the existing database but would require extensive separate runs to reproduce.
|
|
540
|
-
|
|
541
|
-
3. **Exact LLM outputs**: LLMs are stochastic. The same prompt will produce different text, leading to different judge scores. Replication verifies *statistical patterns* (effect directions, magnitudes, significance), not identical outputs.
|
|
542
|
-
|
|
543
|
-
4. **Model version drift**: Free-tier models (Kimi, Nemotron) on OpenRouter may have been updated since the original runs. Pin exact model IDs from `config/providers.yaml` and document any version changes.
|
|
544
|
-
|
|
545
|
-
5. **Pooled multi-turn results** (Section 6.10, Table 13): The N=161/277/165 pooled across all development runs cannot be replicated from scratch — they represent the accumulated database. The dedicated bilateral run (Phase 7) provides the controlled comparison.
|
|
546
|
-
|
|
547
|
-
---
|
|
548
|
-
|
|
549
|
-
## 9. Minimum Viable Replication
|
|
550
|
-
|
|
551
|
-
If cost or time constraints require a reduced replication, prioritize these phases:
|
|
552
|
-
|
|
553
|
-
| Priority | Phase | N | Cost | Finding |
|
|
554
|
-
|----------|-------|---|------|---------|
|
|
555
|
-
| 1 | Phase 2: Full factorial | 360 | ~$27 | Main effects + A×C interaction |
|
|
556
|
-
| 2 | Phase 3: Memory isolation | 120 | ~$9 | Recognition as primary driver |
|
|
557
|
-
| 3 | Phase 9: Cross-judge (on phases 2+3) | 480 | ~$10 | Judge robustness |
|
|
558
|
-
| **Subtotal** | | **~960** | **~$46** | **Covers findings 1–6, 11** |
|
|
559
|
-
|
|
560
|
-
This covers the three most important claims: recognition dominance, memory isolation, and cross-judge robustness. Phases 1, 4–8 provide supporting evidence but are less critical for the core argument.
|
|
561
|
-
|
|
562
|
-
---
|
|
563
|
-
|
|
564
|
-
## 10. Reviewer Feedback Integration
|
|
565
|
-
|
|
566
|
-
The `feedback-2026-02-07.md` file contains 38 items from reviewers. Several are directly relevant to replication:
|
|
567
|
-
|
|
568
|
-
| # | Feedback | Replication Implication |
|
|
569
|
-
|---|----------|----------------------|
|
|
570
|
-
| 7 | Clarify base/enhanced/active variants | Ensure profile names match paper descriptions |
|
|
571
|
-
| 26 | Model confound sounds defensive | Consider running active control on Kimi too |
|
|
572
|
-
| 28 | Rename unified/psycho to Single/Multi | Check if profile names have been updated |
|
|
573
|
-
| 29 | Verify cells 6,8 scores with latest evals | Compare replication cells 6,8 to paper values |
|
|
574
|
-
| 31 | **Content isolation may be compromised** | **Critical: verify no cross-domain data leaks** |
|
|
575
|
-
| 35 | Models not "trained" on content | Verify course data isolation in contentResolver |
|
|
576
|
-
|
|
577
|
-
**Recommendation**: Before starting replication, run Phase 6 (domain gen) first with logging enabled to verify content isolation (feedback items 31, 35). If the elementary scenarios reference philosophy lecture IDs, there is a data isolation bug that must be fixed before proceeding.
|