@machinespirits/eval 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (74) hide show
  1. package/README.md +91 -9
  2. package/config/eval-settings.yaml +3 -3
  3. package/config/paper-manifest.json +486 -0
  4. package/config/providers.yaml +9 -6
  5. package/config/tutor-agents.yaml +2261 -0
  6. package/content/README.md +23 -0
  7. package/content/courses/479/course.md +53 -0
  8. package/content/courses/479/lecture-1.md +361 -0
  9. package/content/courses/479/lecture-2.md +360 -0
  10. package/content/courses/479/lecture-3.md +655 -0
  11. package/content/courses/479/lecture-4.md +530 -0
  12. package/content/courses/479/lecture-5.md +326 -0
  13. package/content/courses/479/lecture-6.md +346 -0
  14. package/content/courses/479/lecture-7.md +326 -0
  15. package/content/courses/479/lecture-8.md +273 -0
  16. package/content/courses/479/roadmap-slides.md +656 -0
  17. package/content/manifest.yaml +8 -0
  18. package/docs/research/build.sh +44 -20
  19. package/docs/research/figures/figure10.png +0 -0
  20. package/docs/research/figures/figure11.png +0 -0
  21. package/docs/research/figures/figure3.png +0 -0
  22. package/docs/research/figures/figure4.png +0 -0
  23. package/docs/research/figures/figure5.png +0 -0
  24. package/docs/research/figures/figure6.png +0 -0
  25. package/docs/research/figures/figure7.png +0 -0
  26. package/docs/research/figures/figure8.png +0 -0
  27. package/docs/research/figures/figure9.png +0 -0
  28. package/docs/research/header.tex +23 -2
  29. package/docs/research/paper-full.md +941 -285
  30. package/docs/research/paper-short.md +216 -585
  31. package/docs/research/references.bib +132 -0
  32. package/docs/research/slides-header.tex +188 -0
  33. package/docs/research/slides-pptx.md +363 -0
  34. package/docs/research/slides.md +531 -0
  35. package/docs/research/style-reference-pptx.py +199 -0
  36. package/package.json +6 -5
  37. package/scripts/analyze-eval-results.js +69 -17
  38. package/scripts/analyze-mechanism-traces.js +763 -0
  39. package/scripts/analyze-modulation-learning.js +498 -0
  40. package/scripts/analyze-prosthesis.js +144 -0
  41. package/scripts/analyze-run.js +264 -79
  42. package/scripts/assess-transcripts.js +853 -0
  43. package/scripts/browse-transcripts.js +854 -0
  44. package/scripts/check-parse-failures.js +73 -0
  45. package/scripts/code-dialectical-modulation.js +1320 -0
  46. package/scripts/download-data.sh +55 -0
  47. package/scripts/eval-cli.js +106 -18
  48. package/scripts/generate-paper-figures.js +663 -0
  49. package/scripts/generate-paper-figures.py +577 -76
  50. package/scripts/generate-paper-tables.js +299 -0
  51. package/scripts/qualitative-analysis-ai.js +3 -3
  52. package/scripts/render-sequence-diagram.js +694 -0
  53. package/scripts/test-latency.js +210 -0
  54. package/scripts/test-rate-limit.js +95 -0
  55. package/scripts/test-token-budget.js +332 -0
  56. package/scripts/validate-paper-manifest.js +670 -0
  57. package/services/__tests__/evalConfigLoader.test.js +2 -2
  58. package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
  59. package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
  60. package/services/evaluationRunner.js +975 -98
  61. package/services/evaluationStore.js +12 -4
  62. package/services/learnerTutorInteractionEngine.js +27 -2
  63. package/services/mockProvider.js +133 -0
  64. package/services/promptRewriter.js +1471 -5
  65. package/services/rubricEvaluator.js +55 -2
  66. package/services/transcriptFormatter.js +675 -0
  67. package/docs/EVALUATION-VARIABLES.md +0 -589
  68. package/docs/REPLICATION-PLAN.md +0 -577
  69. package/scripts/analyze-run.mjs +0 -282
  70. package/scripts/compare-runs.js +0 -44
  71. package/scripts/compare-suggestions.js +0 -80
  72. package/scripts/dig-into-run.js +0 -158
  73. package/scripts/show-failed-suggestions.js +0 -64
  74. /package/scripts/{check-run.mjs → check-run.js} +0 -0
@@ -1,577 +0,0 @@
1
- # Comprehensive Replication Plan
2
-
3
- ## Study: "The Drama Machine in Education"
4
-
5
- **Paper**: PAPER-FULL-2026-02-04.md (v1.5)
6
- **Original**: 14 key runs, N=1,010 scored, N=3,800+ in full database
7
- **Estimated replication cost**: ~$65–$90 USD (ego generation + Opus judging + GPT-5.2 cross-judge)
8
- **Estimated wall-clock time**: ~48–72 hours (with parallelism=2, accounting for API rate limits)
9
-
10
- ---
11
-
12
- ## 1. Prerequisites
13
-
14
- ### 1.1 Software Dependencies
15
-
16
- | Component | Version | Source |
17
- |-----------|---------|--------|
18
- | Node.js | >= 18.0.0 | nodejs.org |
19
- | `@machinespirits/tutor-core` | 0.3.1 | Linked locally (`../machinespirits-tutor-core`) |
20
- | `better-sqlite3` | 12.5.0 | npm |
21
- | `dotenv` | 17.2.3 | npm |
22
- | `express` | 4.19.2 | npm |
23
- | `jsonrepair` | 3.13.2 | npm |
24
- | `yaml` | 2.8.2 | npm |
25
-
26
- **Setup**:
27
- ```bash
28
- cd <path-to-machinespirits-eval>
29
- npm install
30
- ```
31
-
32
- ### 1.2 External Content Packages
33
-
34
- | Package | Location | Purpose |
35
- |---------|----------|---------|
36
- | `machinespirits-content-philosophy` | `../machinespirits-content-philosophy` | Primary domain (Hegel, graduate philosophy) |
37
- | `content-test-elementary` | `./content-test-elementary/` | Domain generalizability (4th-grade fractions) |
38
-
39
- Both are present in the current environment. A replicator needs access to these repositories.
40
-
41
- ### 1.3 API Keys (in `.env`)
42
-
43
- | Provider | Models Used | Estimated Cost Fraction |
44
- |----------|-------------|------------------------|
45
- | **OpenRouter** | Kimi K2.5 (free-tier ego), Nemotron 3 Nano 30B (free-tier ego) | ~$5 (superego calls only; ego is free) |
46
- | **Anthropic** | Claude Opus (judge via Claude Code CLI) | ~$40–55 (primary judge, largest cost) |
47
- | **OpenAI (via OpenRouter)** | GPT-5.2 (cross-judge validation) | ~$18–25 (rejudge only) |
48
-
49
- **Critical**: The primary judge uses Claude Code CLI which invokes Opus directly via the Anthropic API. OpenRouter is used for ego/superego models and for GPT-5.2 rejudging.
50
-
51
- ### 1.4 Model Availability Risk
52
-
53
- | Model | Risk | Mitigation |
54
- |-------|------|------------|
55
- | Kimi K2.5 (OpenRouter free) | May be retired/updated | Pin model ID in providers.yaml; document exact version |
56
- | Nemotron 3 Nano 30B (OpenRouter free) | May be retired | Same; only needed for A×B interaction and domain gen |
57
- | Claude Opus | Stable (Anthropic tier) | Low risk |
58
- | GPT-5.2 | Stable (OpenAI) | Low risk |
59
-
60
- ### 1.5 Database
61
-
62
- - **Fresh start**: Back up existing `data/evaluations.db`, then either use a fresh DB or prefix run IDs to distinguish replication from original.
63
- - **Recommended**: Use a separate database file (e.g., `data/evaluations-replication.db`) by modifying the DB path in evaluationStore.js, OR simply run the replication and use run IDs to distinguish.
64
-
65
- ---
66
-
67
- ## 2. Replication Phases
68
-
69
- The study comprises 9 distinct experimental phases (producing 14 runs). We recommend executing them in dependency order. **Phases 1–5 are independent and can run in parallel if API rate limits allow.**
70
-
71
- ### Phase 1: Recognition Validation (Section 6.1)
72
- **Purpose**: 3-way comparison — base vs enhanced vs recognition
73
- **Original run**: `eval-2026-02-03-86b159cd` (N=36 scored)
74
- **Design**: 3 profiles × 4 scenarios × 3 replications
75
-
76
- ```bash
77
- node scripts/eval-cli.js run \
78
- --profiles cell_1_base_single_unified,cell_9_enhanced_single_unified,cell_5_recog_single_unified \
79
- --scenarios struggling_learner,concept_confusion,mood_frustrated_explicit,high_performer \
80
- --runs 3 \
81
- --description "Replication: Recognition validation (Section 6.1)"
82
- ```
83
-
84
- **Expected output**: N=36 scored responses
85
- **Key metrics to verify**:
86
- - Recognition > Enhanced > Base ordering
87
- - One-way ANOVA F(2,33) significant
88
- - Recognition vs Enhanced gap ≈ +8.7 pts (original)
89
- - Recognition vs Base gap ≈ +20.1 pts (original)
90
-
91
- **Analysis**:
92
- ```bash
93
- node scripts/eval-cli.js report <run-id>
94
- node scripts/analyze-eval-results.js <run-id>
95
- ```
96
-
97
- ---
98
-
99
- ### Phase 2: Full 2×2×2 Factorial (Section 6.3)
100
- **Purpose**: Main factorial design — Recognition × Architecture × Learner
101
- **Original runs**: `eval-2026-02-03-f5d4dd93` (cells 1–5,7, N=262) + `eval-2026-02-06-a933d745` (cells 6,8, N=88)
102
- **Design**: 8 cells × 15 scenarios × 3 replications = 360 planned, expect ~350 scored
103
-
104
- **Important**: The original ran as two separate runs because cells 6 and 8 needed re-running with corrected learner prompts. For replication, all 8 cells can run together since the prompts are now correct.
105
-
106
- ```bash
107
- node scripts/eval-cli.js run \
108
- --profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
109
- --runs 3 \
110
- --description "Replication: Full 2x2x2 factorial (Section 6.3)"
111
- ```
112
-
113
- **Expected output**: ~350 scored responses (8 × 15 × 3 = 360 attempted)
114
- **Key metrics to verify**:
115
- - Recognition main effect: +10.2 pts, F(1,342)≈71, p<.001, η²≈.16
116
- - Architecture main effect: ~+0.9, n.s.
117
- - A×C Interaction (Recognition × Learner): F≈22, p<.001
118
- - Unified learner: recognition +15.5 pts
119
- - Psychodynamic learner: recognition +4.8 pts
120
- - Cell means ordering: 5≈7 > 6≈8 > 4≈2 > 1≈3
121
-
122
- **Analysis**:
123
- ```bash
124
- node scripts/analyze-eval-results.js <run-id>
125
- # This will compute the full ANOVA table
126
- ```
127
-
128
- ---
129
-
130
- ### Phase 3: Memory Isolation 2×2 (Section 6.2)
131
- **Purpose**: Disentangle recognition from memory
132
- **Original runs**: `eval-2026-02-06-81f2d5a1` (N=60) + `eval-2026-02-06-ac9ea8f5` (N=62)
133
- **Design**: 4 cells × 15 scenarios × 1 rep per run, 2 independent runs
134
-
135
- The memory isolation uses cells 19–20 (memory isolation profiles). Check `config/tutor-agents.yaml` for the exact profile names.
136
-
137
- ```bash
138
- # Run 1
139
- node scripts/eval-cli.js run \
140
- --profiles cell_19_base_nomem,cell_19_base_mem,cell_20_recog_nomem,cell_20_recog_mem \
141
- --runs 1 \
142
- --description "Replication: Memory isolation run 1 (Section 6.2)"
143
-
144
- # Run 2 (independent replication)
145
- node scripts/eval-cli.js run \
146
- --profiles cell_19_base_nomem,cell_19_base_mem,cell_20_recog_nomem,cell_20_recog_mem \
147
- --runs 1 \
148
- --description "Replication: Memory isolation run 2 (Section 6.2)"
149
- ```
150
-
151
- **NOTE**: Verify the exact profile names in `config/tutor-agents.yaml` — the memory isolation profiles may use different naming conventions (e.g., `mem_iso_base_nomem`, `mem_iso_recog_mem`, etc.). The paper states N=30 per cell across two runs, suggesting each run has ~15 per cell (4 cells × 15 scenarios × 1 rep = 60 per run).
152
-
153
- **Expected output**: N=120 across both runs (30 per cell)
154
- **Key metrics to verify**:
155
- - Recognition effect: d≈1.71, +15.2 pts without memory
156
- - Memory effect: d≈0.46, +4.8 pts, p≈.08
157
- - Interaction: -4.2 (negative — ceiling effect)
158
- - Condition ordering: Recog+Mem ≥ Recog Only >> Mem Only > Base
159
-
160
- ---
161
-
162
- ### Phase 4: Active Control (Section 6.2)
163
- **Purpose**: Test whether generic pedagogical elaboration accounts for recognition gains
164
- **Original run**: `eval-2026-02-06-a9ae06ee` (N=118 scored)
165
- **Design**: Cells 15–18 (placebo control profiles)
166
-
167
- **MODEL NOTE**: The original used Nemotron as ego (not Kimi). This is a known confound documented in the paper. For a fair replication, you should run both:
168
- 1. The active control on Nemotron (replicating the original)
169
- 2. Optionally, the active control on Kimi (resolving the model confound)
170
-
171
- ```bash
172
- # Active control with Nemotron (replicating original)
173
- node scripts/eval-cli.js run \
174
- --profiles cell_15_placebo_single_unified,cell_16_placebo_single_psycho,cell_17_placebo_multi_unified,cell_18_placebo_multi_psycho \
175
- --runs 3 \
176
- --description "Replication: Active control / placebo (Section 6.2)"
177
- ```
178
-
179
- **Expected output**: ~118 scored
180
- **Key metrics to verify**:
181
- - Overall mean ≈ 66.5 (Nemotron)
182
- - Same-model comparison: +9 pts above Nemotron base, below recognition (~73)
183
-
184
- ---
185
-
186
- ### Phase 5: A×B Interaction (Section 6.4)
187
- **Purpose**: Test whether multi-agent synergy requires recognition prompts
188
- **Original runs**: `eval-2026-02-04-948e04b3` (Nemotron, N=17) + `eval-2026-02-05-10b344fb` (Kimi, N=60)
189
-
190
- #### 5a: Nemotron A×B test
191
- ```bash
192
- # This requires configuring Nemotron as the ego model
193
- # Check if there are specific Nemotron profile overrides
194
- node scripts/eval-cli.js run \
195
- --profiles cell_5_recog_single_unified,cell_7_recog_multi_unified,cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified \
196
- --scenarios struggling_learner,concept_confusion,mood_frustrated_explicit \
197
- --runs 3 \
198
- --description "Replication: A×B interaction Nemotron (Section 6.4)"
199
- ```
200
-
201
- **NOTE**: The original Nemotron run had only N=17 scored (small sample). The profile may need model override to use Nemotron instead of the default Kimi. Check whether there are Nemotron-specific profiles or if the CLI supports model overrides.
202
-
203
- #### 5b: Kimi A×B replication
204
- ```bash
205
- node scripts/eval-cli.js run \
206
- --profiles cell_5_recog_single_unified,cell_7_recog_multi_unified,cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified \
207
- --runs 3 \
208
- --description "Replication: A×B replication Kimi (Section 6.4)"
209
- ```
210
-
211
- **Expected output**: N≈60
212
- **Key metrics to verify**:
213
- - Kimi: A×B interaction ≈ +1.35 (negligible, confirming non-replication of Nemotron finding)
214
- - Recognition cells ≈ 90.6 regardless of architecture
215
- - Enhanced cells ≈ 80.6
216
-
217
- ---
218
-
219
- ### Phase 6: Domain Generalizability (Section 6.5)
220
- **Purpose**: Test recognition effects on elementary math content
221
- **Original runs**: `eval-2026-02-04-79b633ca` (Nemotron, N=47) + `eval-2026-02-05-e87f452d` (Kimi, N=60)
222
-
223
- #### 6a: Kimi elementary replication
224
- ```bash
225
- EVAL_CONTENT_PATH=./content-test-elementary \
226
- EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
227
- node scripts/eval-cli.js run \
228
- --profiles cell_1_base_single_unified,cell_3_base_multi_unified,cell_5_recog_single_unified,cell_7_recog_multi_unified \
229
- --runs 3 \
230
- --description "Replication: Domain gen Kimi elementary (Section 6.5)"
231
- ```
232
-
233
- #### 6b: Nemotron elementary (if Nemotron profiles available)
234
- ```bash
235
- EVAL_CONTENT_PATH=./content-test-elementary \
236
- EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
237
- node scripts/eval-cli.js run \
238
- --profiles cell_1_base_single_unified,cell_3_base_multi_unified,cell_5_recog_single_unified,cell_7_recog_multi_unified \
239
- --runs 1 \
240
- --description "Replication: Domain gen Nemotron elementary (Section 6.5)"
241
- ```
242
-
243
- **Expected output**: N≈60 (Kimi), N≈47 (Nemotron)
244
- **Key metrics to verify**:
245
- - Kimi: Recognition +9.9 pts (d≈0.61)
246
- - Scenario-dependent: frustrated_student +23.8, neutral scenarios ~0
247
-
248
- ---
249
-
250
- ### Phase 7: Bilateral Transformation (Section 6.11)
251
- **Purpose**: Multi-turn dialogues measuring tutor adaptation and learner growth
252
- **Original run**: `eval-2026-02-07-b6d75e87` (N=118 scored, 3 multi-turn scenarios)
253
-
254
- ```bash
255
- node scripts/eval-cli.js run \
256
- --profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
257
- --scenarios misconception_correction_flow,mood_frustration_to_breakthrough,mutual_transformation_journey \
258
- --runs 1 \
259
- --description "Replication: Bilateral transformation multi-turn (Section 6.11)"
260
- ```
261
-
262
- **Expected output**: ~118 scored dialogues
263
- **Key metrics to verify**:
264
- - Tutor Adaptation Index: base≈0.332, recognition≈0.418 (+26%)
265
- - Learner Growth Index: base≈0.242, recognition≈0.210 (lower — reversal)
266
- - Misconception correction: largest adaptation gap (+0.175)
267
-
268
- ---
269
-
270
- ### Phase 8: Dynamic Rewrite Evolution (Section 6.13)
271
- **Purpose**: Track cell 21 (dynamic rewrite + Writing Pad) vs cell 7 (static)
272
- **Original runs**: `daf60f79`, `49bb2017`, `12aebedb` (N=82 total across 3 iterative runs)
273
-
274
- **REPLICATION CHALLENGE**: The original three runs represent iterative development — each run was executed at a different git commit with evolving code. This cannot be cleanly replicated because:
275
- 1. The code has since evolved past those commits
276
- 2. The progression was part of development, not a controlled experiment
277
-
278
- **Recommended approach**: Run cell 21 vs cell 7 at the current codebase state (equivalent to run 3, which had the Writing Pad activated):
279
-
280
- ```bash
281
- node scripts/eval-cli.js run \
282
- --profiles cell_7_recog_multi_unified,cell_21_recog_multi_unified_rewrite \
283
- --scenarios misconception_correction_flow,mood_frustration_to_breakthrough,mutual_transformation_journey \
284
- --runs 5 \
285
- --description "Replication: Dynamic rewrite cell 21 vs cell 7 (Section 6.13)"
286
- ```
287
-
288
- **Expected output**: ~30 scored responses
289
- **Key metric**: Cell 21 should lead cell 7 by ~+5.5 pts (reflecting run 3 state)
290
-
291
- **For full iterative replication**: Would require checking out specific git commits (e3843ee, b2265c7, e673c4b) and running at each — document as a limitation of replication.
292
-
293
- ---
294
-
295
- ### Phase 9: Cross-Judge Replication (Section 6.14)
296
- **Purpose**: Re-score all key run responses with GPT-5.2 as independent judge
297
- **Depends on**: Phases 1–7 completing (uses their run IDs)
298
-
299
- For each key run from Phases 1–7:
300
- ```bash
301
- # Rejudge each completed run with GPT-5.2
302
- node scripts/eval-cli.js rejudge <phase1-run-id> --judge openrouter.gpt
303
- node scripts/eval-cli.js rejudge <phase2-run-id> --judge openrouter.gpt
304
- node scripts/eval-cli.js rejudge <phase3a-run-id> --judge openrouter.gpt
305
- node scripts/eval-cli.js rejudge <phase3b-run-id> --judge openrouter.gpt
306
- # ... etc for all runs
307
- ```
308
-
309
- **CAUTION**: Rejudge creates new rows by default. If run twice, it creates duplicates. Use `--overwrite` to replace, or track carefully.
310
-
311
- **Expected output**: Matched response pairs (same tutor response, two judge scores)
312
- **Key metrics to verify**:
313
- - Inter-judge r = 0.49–0.64 across runs
314
- - GPT-5.2 finds ~58% of Claude's effect magnitudes
315
- - Recognition main effect d≈1.0 under GPT-5.2
316
- - Same condition ordering, no rank reversals in memory isolation
317
- - Recognition vs enhanced: may not reach significance (+1.3, p=.60)
318
-
319
- **Analysis**:
320
- ```bash
321
- node scripts/analyze-judge-reliability.js
322
- ```
323
-
324
- ---
325
-
326
- ## 3. Verification Checklist
327
-
328
- ### 3.1 Primary Findings to Replicate
329
-
330
- | # | Finding | Section | Key Statistic | Priority |
331
- |---|---------|---------|---------------|----------|
332
- | 1 | Recognition main effect | 6.3 | +10.2 pts, F=71.36, p<.001, d=0.80 | **Critical** |
333
- | 2 | Memory isolation: recognition dominance | 6.2 | d=1.71, +15.2 pts | **Critical** |
334
- | 3 | Memory isolation: memory modest | 6.2 | d=0.46, +4.8, p≈.08 | **Critical** |
335
- | 4 | Memory isolation: negative interaction | 6.2 | -4.2 (ceiling) | **Critical** |
336
- | 5 | A×C Interaction (Recog × Learner) | 6.3 | F=21.85, p<.001 | **High** |
337
- | 6 | A×B null (architecture doesn't matter) | 6.3–6.4 | F=0.26, n.s. | **High** |
338
- | 7 | Active control partial benefit | 6.2 | +9 pts vs +15 pts recognition | **High** |
339
- | 8 | Domain generalizability | 6.5 | +9.9 pts Kimi elementary | **Medium** |
340
- | 9 | Bilateral transformation asymmetry | 6.11 | Tutor +26%, learner -13% | **Medium** |
341
- | 10 | Recognition vs enhanced gap | 6.1 | +8.7 pts | **Medium** |
342
- | 11 | Cross-judge robustness | 6.14 | r=0.49–0.64, same direction | **High** |
343
- | 12 | Dynamic rewrite improvement | 6.13 | Cell 21 leads by +5.5 | **Low** |
344
-
345
- ### 3.2 Expected Replication Tolerances
346
-
347
- Given LLM stochasticity (temperature=0.6 for ego, 0.2 for judge), expect:
348
- - **Effect directions**: Should replicate consistently (same sign)
349
- - **Effect magnitudes**: ±3–5 points on means; ±0.2 on Cohen's d
350
- - **Statistical significance**: Large effects (d>0.8) should remain significant; marginal effects (p≈.08) may flip
351
- - **Cell ordering**: Should be preserved (no rank reversals on primary comparisons)
352
- - **Interaction patterns**: A×C should replicate; A×B null should hold
353
-
354
- ### 3.3 Red Flags (Suggesting Implementation Issues)
355
-
356
- - Recognition main effect < +5 pts or not significant → check prompt loading
357
- - Condition ordering reversed → check profile-to-prompt mapping
358
- - All scores clustered >90 → ceiling effect / rubric calibration issue
359
- - All scores <60 → model API issue or wrong model being called
360
- - Memory isolation shows positive interaction → verify cell configurations
361
- - Cross-judge r < 0.3 → check rejudge is matching correct responses
362
-
363
- ---
364
-
365
- ## 4. Cost Estimation
366
-
367
- | Phase | Attempts | Ego Cost | Judge Cost (Opus) | GPT-5.2 Rejudge | Subtotal |
368
- |-------|----------|----------|-------------------|------------------|----------|
369
- | 1: Recognition validation | 36 | ~$0.40 | ~$1.50 | ~$0.70 | ~$2.60 |
370
- | 2: Full factorial | 360 | ~$5.00 | ~$15.00 | ~$7.00 | ~$27.00 |
371
- | 3: Memory isolation (×2) | 120 | ~$1.50 | ~$5.00 | ~$2.50 | ~$9.00 |
372
- | 4: Active control | 120 | ~$1.30 | ~$5.00 | ~$2.50 | ~$8.80 |
373
- | 5: A×B interaction | 78 | ~$1.00 | ~$3.00 | ~$1.50 | ~$5.50 |
374
- | 6: Domain gen | 107 | ~$1.20 | ~$4.00 | ~$2.00 | ~$7.20 |
375
- | 7: Bilateral transformation | 120 | ~$2.00 | ~$5.00 | ~$2.50 | ~$9.50 |
376
- | 8: Dynamic rewrite | 30 | ~$0.80 | ~$1.50 | ~$0.60 | ~$2.90 |
377
- | **Total** | **~971** | **~$13.20** | **~$40.00** | **~$19.30** | **~$72.50** |
378
-
379
- **Notes**:
380
- - Ego costs are low because Kimi K2.5 and Nemotron are free-tier on OpenRouter; costs come from superego calls (Kimi K2.5)
381
- - Judge costs dominate — Claude Opus via Claude Code CLI
382
- - GPT-5.2 rejudge adds ~27% to total cost
383
- - Multi-turn scenarios (phases 7, 8) cost more per evaluation due to multiple turns
384
-
385
- ---
386
-
387
- ## 5. Execution Order and Timeline
388
-
389
- ### Day 1: Independent Phases (Parallel)
390
-
391
- | Time | Phase | Duration | Notes |
392
- |------|-------|----------|-------|
393
- | Morning | Phase 1 (validation) | ~2 hrs | Quick, small N |
394
- | Morning | Phase 5b (A×B Kimi) | ~3 hrs | Small N |
395
- | Morning | Phase 6a (Domain gen Kimi) | ~3 hrs | Small N |
396
- | Afternoon | Phase 3 run 1 (Memory isolation) | ~6 hrs | |
397
- | Afternoon | Phase 4 (Active control) | ~7 hrs | |
398
-
399
- ### Day 2: Main Factorial + Bilateral
400
-
401
- | Time | Phase | Duration | Notes |
402
- |------|-------|----------|-------|
403
- | All day | Phase 2 (Full factorial) | ~24 hrs | Largest run, 360 evals |
404
- | All day | Phase 3 run 2 (Memory isolation) | ~6 hrs | Independent replication |
405
- | Evening | Phase 7 (Bilateral) | ~14 hrs | Multi-turn, slower |
406
-
407
- ### Day 3: Specialized + Cross-Judge
408
-
409
- | Time | Phase | Duration | Notes |
410
- |------|-------|----------|-------|
411
- | Morning | Phase 5a (A×B Nemotron) | ~3 hrs | If Nemotron profiles exist |
412
- | Morning | Phase 6b (Domain gen Nemotron) | ~3 hrs | Optional |
413
- | Morning | Phase 8 (Dynamic rewrite) | ~5 hrs | |
414
- | Afternoon | Phase 9 (Cross-judge) | ~8 hrs | Rejudge all completed runs |
415
-
416
- ### Day 4: Analysis and Verification
417
-
418
- | Task | Tool |
419
- |------|------|
420
- | Generate reports for all runs | `eval-cli.js report <run-id>` |
421
- | Compute ANOVA tables | `analyze-eval-results.js` |
422
- | Inter-judge reliability | `analyze-judge-reliability.js` |
423
- | Bilateral transformation metrics | `analyze-interaction-evals.js` |
424
- | Compare replication vs original | Side-by-side comparison of key statistics |
425
-
426
- ---
427
-
428
- ## 6. Known Issues and Workarounds
429
-
430
- ### 6.1 Profile Name Verification
431
-
432
- Before running, verify all profile names exist in `config/tutor-agents.yaml`:
433
- ```bash
434
- grep -E "^ cell_" config/tutor-agents.yaml | head -30
435
- ```
436
-
437
- Profiles referenced in the paper:
438
- - `cell_1` through `cell_8` (main factorial)
439
- - `cell_9` through `cell_12` (enhanced prompts)
440
- - `cell_15` through `cell_18` (placebo/active control)
441
- - `cell_19`, `cell_20` (memory isolation)
442
- - `cell_21` (dynamic rewrite)
443
-
444
- ### 6.2 Nemotron Model Configuration
445
-
446
- The paper uses Nemotron for several analyses (A×B interaction, domain gen, active control). Check whether the default profiles use Kimi or Nemotron:
447
- - If all profiles default to Kimi, you may need to create Nemotron variants or use a model override mechanism
448
- - The original Nemotron runs may have used different profile names or env var overrides
449
-
450
- ### 6.3 Rejudge Deduplication
451
-
452
- GPT-5.2 rejudge can create duplicate rows. After rejudging, verify:
453
- ```sql
454
- SELECT run_id, judge_model, COUNT(*) as n
455
- FROM evaluation_results
456
- WHERE judge_model = 'openrouter/openai/gpt-5.2'
457
- GROUP BY run_id
458
- ORDER BY n DESC;
459
- ```
460
-
461
- If duplicates exist, use ROW_NUMBER() window function to deduplicate.
462
-
463
- ### 6.4 Scenario Availability
464
-
465
- Verify all 15 scenarios exist:
466
- ```bash
467
- grep "^ [a-z]" config/suggestion-scenarios.yaml | head -20
468
- ```
469
-
470
- The 3 multi-turn scenarios need special handling:
471
- - `misconception_correction_flow` (3.2 avg rounds)
472
- - `mood_frustration_to_breakthrough` (3.0 avg rounds)
473
- - `mutual_transformation_journey` (4.1 avg rounds)
474
-
475
- ### 6.5 Content Isolation (Critical — from feedback item #31)
476
-
477
- The paper reports that Nemotron hallucinated philosophy content on elementary math tasks. Verify that content paths are properly isolated:
478
- - When running elementary tests, ensure `EVAL_CONTENT_PATH` points to `./content-test-elementary`
479
- - Check that the contentResolver does not leak cross-domain lecture IDs
480
- - This is flagged in the reviewer feedback as a potential data isolation bug to investigate
481
-
482
- ### 6.6 Evaluate vs Rejudge
483
-
484
- - `evaluate --force`: Only processes rows with NULL base_score (won't re-score existing)
485
- - `rejudge`: Creates new rows with a different judge_model
486
- - To re-score with Opus: null out scores first, then `evaluate --force`
487
- - **Never** use `rejudge` without `--judge` flag — defaults to Sonnet 4.5, not Opus
488
-
489
- ---
490
-
491
- ## 7. Statistical Analysis Pipeline
492
-
493
- After all phases complete, run the full analysis:
494
-
495
- ```bash
496
- # 1. Per-run reports
497
- for run_id in <all-replication-run-ids>; do
498
- node scripts/eval-cli.js report $run_id
499
- done
500
-
501
- # 2. Factorial ANOVA (Phase 2)
502
- node scripts/analyze-eval-results.js <factorial-run-id>
503
-
504
- # 3. Memory isolation analysis (Phase 3)
505
- # Compare 4 cells: base-nomem, base-mem, recog-nomem, recog-mem
506
- node scripts/analyze-eval-results.js <mem-iso-run1-id>
507
- node scripts/analyze-eval-results.js <mem-iso-run2-id>
508
-
509
- # 4. Inter-judge reliability (Phase 9)
510
- node scripts/analyze-judge-reliability.js
511
-
512
- # 5. Bilateral transformation (Phase 7)
513
- node scripts/analyze-interaction-evals.js <bilateral-run-id>
514
-
515
- # 6. Export all results
516
- node scripts/eval-cli.js export <run-id> --format csv
517
- ```
518
-
519
- ### 7.1 Key Comparisons
520
-
521
- | Comparison | Original | Replication | Match? |
522
- |------------|----------|-------------|--------|
523
- | Recognition main effect (d) | 0.80 | | |
524
- | Memory isolation recognition (d) | 1.71 | | |
525
- | Memory isolation memory (d) | 0.46 | | |
526
- | A×C interaction (F) | 21.85 | | |
527
- | Bilateral adaptation Δ | +0.086 | | |
528
- | Cross-judge r (factorial) | 0.64 | | |
529
- | Cross-judge r (memory iso) | 0.63 | | |
530
-
531
- ---
532
-
533
- ## 8. What Exact Replication Cannot Cover
534
-
535
- Some aspects of the original study are inherently non-replicable:
536
-
537
- 1. **Iterative development trajectory** (Section 6.13): The three dynamic rewrite runs tracked code evolution across commits. Running at the current codebase state tests the final configuration but not the developmental trajectory.
538
-
539
- 2. **Historical Nemotron data** (Section 6.2 active control): The "same-model comparison" for the active control draws on historical Nemotron data across multiple runs (N=467 base, N=545 recognition). This accumulated data is in the existing database but would require extensive separate runs to reproduce.
540
-
541
- 3. **Exact LLM outputs**: LLMs are stochastic. The same prompt will produce different text, leading to different judge scores. Replication verifies *statistical patterns* (effect directions, magnitudes, significance), not identical outputs.
542
-
543
- 4. **Model version drift**: Free-tier models (Kimi, Nemotron) on OpenRouter may have been updated since the original runs. Pin exact model IDs from `config/providers.yaml` and document any version changes.
544
-
545
- 5. **Pooled multi-turn results** (Section 6.10, Table 13): The N=161/277/165 pooled across all development runs cannot be replicated from scratch — they represent the accumulated database. The dedicated bilateral run (Phase 7) provides the controlled comparison.
546
-
547
- ---
548
-
549
- ## 9. Minimum Viable Replication
550
-
551
- If cost or time constraints require a reduced replication, prioritize these phases:
552
-
553
- | Priority | Phase | N | Cost | Finding |
554
- |----------|-------|---|------|---------|
555
- | 1 | Phase 2: Full factorial | 360 | ~$27 | Main effects + A×C interaction |
556
- | 2 | Phase 3: Memory isolation | 120 | ~$9 | Recognition as primary driver |
557
- | 3 | Phase 9: Cross-judge (on phases 2+3) | 480 | ~$10 | Judge robustness |
558
- | **Subtotal** | | **~960** | **~$46** | **Covers findings 1–6, 11** |
559
-
560
- This covers the three most important claims: recognition dominance, memory isolation, and cross-judge robustness. Phases 1, 4–8 provide supporting evidence but are less critical for the core argument.
561
-
562
- ---
563
-
564
- ## 10. Reviewer Feedback Integration
565
-
566
- The `feedback-2026-02-07.md` file contains 38 items from reviewers. Several are directly relevant to replication:
567
-
568
- | # | Feedback | Replication Implication |
569
- |---|----------|----------------------|
570
- | 7 | Clarify base/enhanced/active variants | Ensure profile names match paper descriptions |
571
- | 26 | Model confound sounds defensive | Consider running active control on Kimi too |
572
- | 28 | Rename unified/psycho to Single/Multi | Check if profile names have been updated |
573
- | 29 | Verify cells 6,8 scores with latest evals | Compare replication cells 6,8 to paper values |
574
- | 31 | **Content isolation may be compromised** | **Critical: verify no cross-domain data leaks** |
575
- | 35 | Models not "trained" on content | Verify course data isolation in contentResolver |
576
-
577
- **Recommendation**: Before starting replication, run Phase 6 (domain gen) first with logging enabled to verify content isolation (feedback items 31, 35). If the elementary scenarios reference philosophy lecture IDs, there is a data isolation bug that must be fixed before proceeding.