@machinespirits/eval 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (74) hide show
  1. package/README.md +91 -9
  2. package/config/eval-settings.yaml +3 -3
  3. package/config/paper-manifest.json +486 -0
  4. package/config/providers.yaml +9 -6
  5. package/config/tutor-agents.yaml +2261 -0
  6. package/content/README.md +23 -0
  7. package/content/courses/479/course.md +53 -0
  8. package/content/courses/479/lecture-1.md +361 -0
  9. package/content/courses/479/lecture-2.md +360 -0
  10. package/content/courses/479/lecture-3.md +655 -0
  11. package/content/courses/479/lecture-4.md +530 -0
  12. package/content/courses/479/lecture-5.md +326 -0
  13. package/content/courses/479/lecture-6.md +346 -0
  14. package/content/courses/479/lecture-7.md +326 -0
  15. package/content/courses/479/lecture-8.md +273 -0
  16. package/content/courses/479/roadmap-slides.md +656 -0
  17. package/content/manifest.yaml +8 -0
  18. package/docs/research/build.sh +44 -20
  19. package/docs/research/figures/figure10.png +0 -0
  20. package/docs/research/figures/figure11.png +0 -0
  21. package/docs/research/figures/figure3.png +0 -0
  22. package/docs/research/figures/figure4.png +0 -0
  23. package/docs/research/figures/figure5.png +0 -0
  24. package/docs/research/figures/figure6.png +0 -0
  25. package/docs/research/figures/figure7.png +0 -0
  26. package/docs/research/figures/figure8.png +0 -0
  27. package/docs/research/figures/figure9.png +0 -0
  28. package/docs/research/header.tex +23 -2
  29. package/docs/research/paper-full.md +941 -285
  30. package/docs/research/paper-short.md +216 -585
  31. package/docs/research/references.bib +132 -0
  32. package/docs/research/slides-header.tex +188 -0
  33. package/docs/research/slides-pptx.md +363 -0
  34. package/docs/research/slides.md +531 -0
  35. package/docs/research/style-reference-pptx.py +199 -0
  36. package/package.json +6 -5
  37. package/scripts/analyze-eval-results.js +69 -17
  38. package/scripts/analyze-mechanism-traces.js +763 -0
  39. package/scripts/analyze-modulation-learning.js +498 -0
  40. package/scripts/analyze-prosthesis.js +144 -0
  41. package/scripts/analyze-run.js +264 -79
  42. package/scripts/assess-transcripts.js +853 -0
  43. package/scripts/browse-transcripts.js +854 -0
  44. package/scripts/check-parse-failures.js +73 -0
  45. package/scripts/code-dialectical-modulation.js +1320 -0
  46. package/scripts/download-data.sh +55 -0
  47. package/scripts/eval-cli.js +106 -18
  48. package/scripts/generate-paper-figures.js +663 -0
  49. package/scripts/generate-paper-figures.py +577 -76
  50. package/scripts/generate-paper-tables.js +299 -0
  51. package/scripts/qualitative-analysis-ai.js +3 -3
  52. package/scripts/render-sequence-diagram.js +694 -0
  53. package/scripts/test-latency.js +210 -0
  54. package/scripts/test-rate-limit.js +95 -0
  55. package/scripts/test-token-budget.js +332 -0
  56. package/scripts/validate-paper-manifest.js +670 -0
  57. package/services/__tests__/evalConfigLoader.test.js +2 -2
  58. package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
  59. package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
  60. package/services/evaluationRunner.js +975 -98
  61. package/services/evaluationStore.js +12 -4
  62. package/services/learnerTutorInteractionEngine.js +27 -2
  63. package/services/mockProvider.js +133 -0
  64. package/services/promptRewriter.js +1471 -5
  65. package/services/rubricEvaluator.js +55 -2
  66. package/services/transcriptFormatter.js +675 -0
  67. package/docs/EVALUATION-VARIABLES.md +0 -589
  68. package/docs/REPLICATION-PLAN.md +0 -577
  69. package/scripts/analyze-run.mjs +0 -282
  70. package/scripts/compare-runs.js +0 -44
  71. package/scripts/compare-suggestions.js +0 -80
  72. package/scripts/dig-into-run.js +0 -158
  73. package/scripts/show-failed-suggestions.js +0 -64
  74. /package/scripts/{check-run.mjs → check-run.js} +0 -0
@@ -0,0 +1,531 @@
1
+ ---
2
+ title: "*Geist* in the Machine"
3
+ subtitle: "Mutual Recognition and Multiagent Architecture\\newline for Dialectical AI Tutoring"
4
+ author: "Liam Magee"
5
+ institute: "Education Policy, Organization and Leadership\\newline University of Illinois Urbana-Champaign"
6
+ date: "February 2026"
7
+ bibliography: references.bib
8
+ csl: apa.csl
9
+ theme: "metropolis"
10
+ aspectratio: 169
11
+ toc: false
12
+ header-includes:
13
+ - \input{slides-header.tex}
14
+ ---
15
+
16
+ ## The Problem
17
+
18
+ \vspace{0.5em}
19
+
20
+ Current AI tutoring treats learners as **knowledge deficits** to be filled.
21
+
22
+ \vspace{0.5em}
23
+
24
+ - Learner says something interesting $\rightarrow$ tutor redirects to curriculum
25
+ - Learner struggles $\rightarrow$ tutor simplifies or restates
26
+ - Learner resists $\rightarrow$ tutor notes "engagement metrics" and moves on
27
+
28
+ \vspace{0.5em}
29
+
30
+ \alert{The learner is never encountered as a subject.}
31
+
32
+ This maps onto Hegel's master--slave dialectic: the master (tutor) consumes the slave's (learner's) labor without genuine encounter.
33
+
34
+ ---
35
+
36
+ ## Hegel's Alternative: Mutual Recognition
37
+
38
+ \vspace{0.5em}
39
+
40
+ **Recognition** (*Anerkennung*): each party acknowledges the other as an autonomous consciousness whose understanding has intrinsic validity.
41
+
42
+ \vspace{0.5em}
43
+
44
+ :::::::::::::: {.columns}
45
+ ::: {.column width="50%"}
46
+
47
+ **What it is**
48
+
49
+ - A \alert{relational stance}
50
+ - How the tutor constitutes the learner
51
+ - Achievable without consciousness
52
+
53
+ :::
54
+ ::: {.column width="50%"}
55
+
56
+ **What it is not**
57
+
58
+ - Not agreement --- can disagree while recognizing
59
+ - Not affirmation --- "good job!" is not recognition
60
+ - Not a consciousness requirement
61
+
62
+ :::
63
+ ::::::::::::::
64
+
65
+ ---
66
+
67
+ ## The Drama Machine
68
+
69
+ \vspace{0.5em}
70
+
71
+ :::::::::::::: {.columns}
72
+ ::: {.column width="48%"}
73
+
74
+ \begin{block}{Ego (Response Generator)}
75
+ \begin{itemize}
76
+ \item Generates pedagogical suggestions
77
+ \item Has \textbf{final authority} over output
78
+ \item Can override or incorporate Superego feedback
79
+ \end{itemize}
80
+ \end{block}
81
+
82
+ :::
83
+ ::: {.column width="48%"}
84
+
85
+ \begin{block}{Superego (Internal Critic)}
86
+ \begin{itemize}
87
+ \item Evaluates Ego's draft
88
+ \item Checks pedagogical quality
89
+ \item Structured critique: approve / revise / reject
90
+ \end{itemize}
91
+ \end{block}
92
+
93
+ :::
94
+ ::::::::::::::
95
+
96
+ \vspace{0.5em}
97
+
98
+ \alert{Recognition prompts} add Hegelian theory to both Ego and Superego:
99
+
100
+ - *"Acknowledge the learner as an autonomous subject..."*
101
+ - *"Evaluate whether the response treats the learner's understanding as having intrinsic validity..."*
102
+
103
+ ---
104
+
105
+ ## Phase 2: Advanced Mechanisms
106
+
107
+ \vspace{0.3em}
108
+
109
+ Nine architectural mechanisms tested beyond base Ego/Superego:
110
+
111
+ \vspace{0.3em}
112
+
113
+ \footnotesize
114
+
115
+ | Mechanism | What it does |
116
+ |:----------|:-------------|
117
+ | Self-reflection | Ego reviews own prior performance |
118
+ | Bidirectional profiling | Theory of Mind models of each party |
119
+ | Intersubjective recognition | Explicit other-awareness prompts |
120
+ | Combined (all three) | Full mechanism stack |
121
+ | Cross-turn superego memory | Superego retains conversation context |
122
+ | Prompt rewriting | Dynamic prompt evolution mid-dialogue |
123
+ | Quantitative disposition | Numeric stance tracking |
124
+ | Prompt erosion | Gradual prompt degradation test |
125
+
126
+ \normalsize
127
+
128
+ ---
129
+
130
+ ## Evaluation Design
131
+
132
+ \vspace{0.5em}
133
+
134
+ **37 evaluations**, N=3,383 primary scored responses
135
+
136
+ \vspace{0.5em}
137
+
138
+ :::::::::::::: {.columns}
139
+ ::: {.column width="55%"}
140
+
141
+ - **2\texttimes 2\texttimes 2 factorial** (N=350)
142
+ - Recognition \texttimes{} Architecture \texttimes{} Learner type
143
+ - **Memory isolation** (N=120)
144
+ - Disentangle recognition from episodic memory
145
+ - **Multi-model probe** (N=655)
146
+ - 5 ego models, architecture held constant
147
+
148
+ :::
149
+ ::: {.column width="45%"}
150
+
151
+ - **Dynamic learner tests** (N=660)
152
+ - Mechanisms with feedback-capable learners
153
+ - **Cross-judge replication** (N=977)
154
+ - GPT-5.2 independent validation
155
+ - **14-dimension rubric**
156
+ - Scored by Claude Opus 4.6
157
+
158
+ :::
159
+ ::::::::::::::
160
+
161
+ ---
162
+
163
+ ## Finding 1: Memory Isolation (The Definitive Finding)
164
+
165
+ \vspace{0.3em}
166
+
167
+ 2\texttimes 2 design (N=120, 30/cell) disentangles recognition from episodic memory:
168
+
169
+ \vspace{0.5em}
170
+
171
+ \centering
172
+
173
+ | | No Memory | Memory |
174
+ |:--|:-----------:|:--------:|
175
+ | **No Recognition** | 75.4 | 80.2 |
176
+ | **Recognition** | \alert{90.6} | \alert{91.2} |
177
+
178
+ \raggedright
179
+
180
+ \vspace{0.5em}
181
+
182
+ - **Recognition**: \alert{+15.2 pts}, d=1.71, p<.001
183
+ - **Memory**: +4.8 pts, d=0.46, n.s.
184
+ - **Interaction**: --4.2 pts (ceiling effect, not synergy)
185
+
186
+ Recognition alone accounts for nearly the entire improvement.
187
+
188
+ ---
189
+
190
+ ## Finding 2: Full Factorial (2\texttimes 2\texttimes 2)
191
+
192
+ \vspace{0.3em}
193
+
194
+ N=350, Kimi K2.5 ego, Opus 4.6 judge:
195
+
196
+ \vspace{0.3em}
197
+
198
+ \footnotesize
199
+
200
+ | Cell | Recog | Arch | Learner | M (SD) |
201
+ |:------:|:-------:|:------:|:---------:|:--------:|
202
+ | 1 | -- | Single | Single | 73.4 (16.2) |
203
+ | 2 | -- | Multi | Single | 69.9 (23.3) |
204
+ | 3 | -- | Single | Multi | 75.5 (15.2) |
205
+ | 4 | -- | Multi | Multi | 75.2 (18.1) |
206
+ | 5 | + | Single | Single | \alert{90.2} (7.1) |
207
+ | 6 | + | Multi | Single | \alert{83.9} (18.1) |
208
+ | 7 | + | Single | Multi | \alert{90.1} (7.1) |
209
+ | 8 | + | Multi | Multi | \alert{87.3} (10.3) |
210
+
211
+ \normalsize
212
+
213
+ **Recognition**: \alert{+14.4 pts}, F(1,342)=110.04, p<.001, $\eta^2$=.243, d=1.11
214
+
215
+ ---
216
+
217
+ ## Finding 3: Architecture is Additive
218
+
219
+ Multi-model probe (N=655, 5 ego models):
220
+
221
+ \vspace{0.3em}
222
+
223
+ \footnotesize
224
+
225
+ | Model | Base | +Arch | +Recog | +Both | A\texttimes B |
226
+ |:-------|:------:|:-------:|:--------:|:-------:|:-----:|
227
+ | Kimi K2.5 | 73.4 | 75.5 | \alert{90.2} | 90.1 | +0.5 |
228
+ | Haiku | 78.2 | 81.9 | \alert{93.3} | 93.5 | --3.7 |
229
+ | DeepSeek-R1 | 71.1 | 71.3 | \alert{88.9} | 83.2 | --5.7 |
230
+ | GLM-4.7 | 63.9 | 62.2 | \alert{73.5} | 74.9 | +3.1 |
231
+ | Nemotron | 62.3 | 62.6 | \alert{78.2} | 72.5 | --5.7 |
232
+
233
+ \normalsize
234
+
235
+ - A\texttimes B interaction: --5.7 to +3.1 (mean --1.8) --- \alert{no synergy}
236
+ - Recognition range: +9.6 to +17.8 across all models
237
+
238
+ ---
239
+
240
+ ## Finding 4: Domain Generalizability
241
+
242
+ Recognition effect across 6 tutorial domains (N=60):
243
+
244
+ \vspace{0.3em}
245
+
246
+ \footnotesize
247
+
248
+ | Domain | Base | Recog | $\Delta$ |
249
+ |:--------|:------:|:-------:|:---:|
250
+ | Climate science | 72.0 | 93.8 | \alert{+21.8} |
251
+ | Ethics | 72.3 | 89.3 | \alert{+17.0} |
252
+ | Mathematics | 73.0 | 89.2 | \alert{+16.2} |
253
+ | Philosophy | 75.2 | 89.7 | \alert{+14.5} |
254
+ | Machine learning | 78.0 | 91.5 | \alert{+13.5} |
255
+ | Poetry | 86.0 | 92.5 | +6.5 |
256
+
257
+ \normalsize
258
+
259
+ \vspace{0.3em}
260
+
261
+ Strong for conceptual domains (+14 to +22 pts). Weakest for poetry (+6.5) --- high baseline leaves less room for improvement.
262
+
263
+ ---
264
+
265
+ ## Finding 5: Scripted vs. Dynamic Learners
266
+
267
+ :::::::::::::: {.columns}
268
+ ::: {.column width="48%"}
269
+
270
+ \begin{alertblock}{Scripted learners}
271
+ \begin{itemize}
272
+ \item Pre-written responses
273
+ \item 9 mechanisms cluster within 2.4 pts
274
+ \item No differentiation --- noise floor
275
+ \end{itemize}
276
+ \end{alertblock}
277
+
278
+ :::
279
+ ::: {.column width="48%"}
280
+
281
+ \begin{exampleblock}{Dynamic learners}
282
+ \begin{itemize}
283
+ \item LLM-generated, ego/superego
284
+ \item Mechanisms spread 5+ pts
285
+ \item Recognition doubles: +7.6 $\rightarrow$ \textbf{+14.8}
286
+ \end{itemize}
287
+ \end{exampleblock}
288
+
289
+ :::
290
+ ::::::::::::::
291
+
292
+ \vspace{0.5em}
293
+
294
+ **Lesson**: Mechanism effects require genuine feedback loops to manifest.
295
+
296
+ ---
297
+
298
+ ## Finding 6: Dynamic Learner Mechanisms
299
+
300
+ Complete 2\texttimes 4 matrix (N=480, Haiku ego, dynamic learner):
301
+
302
+ \vspace{0.3em}
303
+
304
+ | Mechanism | Base | Recog | $\Delta$ |
305
+ |:-----------|:------:|:-------:|:---:|
306
+ | Self-reflection | 72.3 | 85.6 | +13.3 |
307
+ | Bidirectional profiling | 74.6 | \alert{88.8} | +14.2 |
308
+ | Intersubjective | 67.7 | 82.8 | +15.1 |
309
+ | Combined | 73.7 | 87.8 | +14.1 |
310
+
311
+ \vspace{0.3em}
312
+
313
+ - Variance collapses with added mechanisms (SD: 22.5 $\rightarrow$ 11.8)
314
+ - Recognition $\Delta$ stable (+13.3 to +15.1) regardless of mechanism
315
+ - Profiling = highest ceiling; intersubjective = lowest floor
316
+
317
+ ---
318
+
319
+ ## Finding 7: Cognitive Prosthesis Fails
320
+
321
+ Can a strong Superego (Kimi K2.5) compensate for a weak Ego (Nemotron)?
322
+
323
+ \vspace{0.5em}
324
+
325
+ :::::::::::::: {.columns}
326
+ ::: {.column width="55%"}
327
+
328
+ \begin{alertblock}{No.}
329
+ Full mechanism stack scores \textbf{49.5} ---\\that's \alert{--15 pts below} Nemotron\\simple base (64.2)
330
+ \end{alertblock}
331
+
332
+ :::
333
+ ::: {.column width="45%"}
334
+
335
+ - Same mechanisms boost Haiku by **+20 pts**
336
+ - Static dims fine (spec 4.0)
337
+ - Dynamic dims fail (adaptation 1.8)
338
+ - Parse failures: 16--45\% of turns
339
+
340
+ :::
341
+ ::::::::::::::
342
+
343
+ \vspace{0.5em}
344
+
345
+ **Minimum ego capability threshold**: The mechanisms amplify what the Ego can already do --- they cannot substitute for missing capability.
346
+
347
+ ---
348
+
349
+ ## Finding 8: Cross-Judge Robustness
350
+
351
+ GPT-5.2 independently rejudged N=977 paired responses:
352
+
353
+ \vspace{0.3em}
354
+
355
+ | Finding | Claude | GPT-5.2 | Replicates? |
356
+ |:---------|:--------:|:---------:|:-------------:|
357
+ | Recognition (memory) | d=1.71 | d=1.54 | Yes |
358
+ | Memory effect | d=0.46 | d=0.49 | Yes (small) |
359
+ | Architecture effect | +2.6 | --0.2 | Yes (null) |
360
+ | Mechanism clustering | 2.8 pt | 4.4 pt | Yes (null) |
361
+
362
+ \vspace{0.3em}
363
+
364
+ - Inter-judge r = 0.44--0.64 (all p<.001)
365
+ - GPT-5.2 finds 37--59\% of Claude's effect magnitudes
366
+ - Always same direction --- \alert{no sign reversals}
367
+
368
+ ---
369
+
370
+ ## What Recognition Looks Like
371
+
372
+ \vspace{0.5em}
373
+
374
+ **Base tutor** to a struggling learner:
375
+
376
+ > "You left off at the neural networks section. Complete this lecture to maintain your learning streak."
377
+
378
+ \vspace{0.3em}
379
+
380
+ **Recognition tutor** to the same learner:
381
+
382
+ > "This is your third session --- you've persisted through quiz-479-3 three times, which signals you're wrestling with how recognition operates in the dialectic..."
383
+
384
+ \vspace{0.3em}
385
+
386
+ Three systematic changes:
387
+
388
+ 1. The ego \alert{listens to its internal critic} (superego feedback incorporated)
389
+ 2. The tutor \alert{builds on learner contributions} (not redirecting to curriculum)
390
+ 3. \alert{Mid-conversation strategy shifts} occur (30\% of recognition dialogues vs 0\% base)
391
+
392
+ ---
393
+
394
+ ## Dialectical Impasse: The Strongest Test
395
+
396
+ Three 5-turn scenarios with escalating resistance (N=24):
397
+
398
+ \vspace{0.3em}
399
+
400
+ - **Epistemic resistance** (Popperian critique): Recognition \alert{+43 pts}
401
+ - **Productive deadlock** (incompatible frameworks): Recognition \alert{+29 pts}
402
+ - **Affective shutdown** (emotional retreat): Recognition --1.1 (null)
403
+
404
+ \vspace{0.3em}
405
+
406
+ Resolution strategy coding ($\chi^2$=24.00, p<.001, V=1.000):
407
+
408
+ - **Base**: 12/12 withdraw from encounter entirely
409
+ - **Recognition**: 10/12 scaffolded reframing (*Aufhebung*), 1 mutual recognition, 1 domination
410
+
411
+ \vspace{0.3em}
412
+
413
+ The null on affective shutdown sharpens the claim: recognition's contribution is **epistemological**, not primarily affective.
414
+
415
+ ---
416
+
417
+ ## The Learner Superego Paradox
418
+
419
+ \vspace{0.5em}
420
+
421
+ Multi-agent learner architecture **hurts** learner quality (d=1.43, F=68.28, p<.001):
422
+
423
+ \vspace{0.3em}
424
+
425
+ - Designed to improve through internal self-critique
426
+ - Actually over-edits --- polishes away messy, authentic engagement
427
+ - Recognition partially rescues multi-agent learner (d=0.79, p=.004)
428
+
429
+ \vspace{0.5em}
430
+
431
+ \begin{exampleblock}{Hegelian interpretation}
432
+ External recognition from an Other is structurally more effective than internal self-critique. You cannot bootstrap genuine dialogue from a monologue.
433
+ \end{exampleblock}
434
+
435
+ ---
436
+
437
+ ## Practical Recommendations
438
+
439
+ \vspace{0.5em}
440
+
441
+ :::::::::::::: {.columns}
442
+ ::: {.column width="50%"}
443
+
444
+ 1. \alert{Add recognition prompts}
445
+ - Immediate +14 pt improvement
446
+ - No architecture changes needed
447
+
448
+ 2. **Architecture is optional**
449
+ - Modest additive benefit (+2 pts)
450
+
451
+ 3. **Use dynamic learners for testing**
452
+ - Scripted learners mask effects
453
+
454
+ :::
455
+ ::: {.column width="50%"}
456
+
457
+ 4. **Theory of Mind profiling**
458
+ - Best mechanism for ceiling performance
459
+
460
+ 5. **Token budgets can be cut 4--16x**
461
+ - No quality loss
462
+
463
+ 6. **Minimum ego capability matters**
464
+ - Mechanisms amplify, don't substitute
465
+
466
+ :::
467
+ ::::::::::::::
468
+
469
+ ---
470
+
471
+ ## Limitations
472
+
473
+ \vspace{0.5em}
474
+
475
+ 1. **Simulated learners, not humans** --- all "learners" are LLM agents
476
+ 2. **LLM-as-judge** --- Claude Opus evaluates (mitigated by GPT-5.2 cross-judge)
477
+ 3. **Single content domain** --- primarily philosophy of education
478
+ 4. **No longitudinal data** --- snapshots, not learning trajectories
479
+ 5. **Prompt-level intervention** --- recognition embedded in prompts, not weights
480
+ 6. **Small N per cell** --- 30 observations per condition in key experiments
481
+
482
+ ---
483
+
484
+ ## Conclusion
485
+
486
+ \vspace{0.5em}
487
+
488
+ **Recognition theory** produces robust, replicable improvements in AI tutoring quality:
489
+
490
+ - d=1.11 to d=1.71 depending on experiment
491
+ - Replicates across 5 models, 6 domains, 2 judges
492
+ - Survives all controls: memory isolation, prompt elaboration, token budget
493
+
494
+ \vspace{0.3em}
495
+
496
+ **Multi-agent architecture** contributes additively but modestly.
497
+
498
+ \vspace{0.5em}
499
+
500
+ \begin{block}{The Key Insight}
501
+ Philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems. Recognition is better understood as an \alert{achievable relational stance} than a requirement for machine consciousness.
502
+ \end{block}
503
+
504
+ ---
505
+
506
+ ## Thank You
507
+
508
+ \vspace{2em}
509
+
510
+ \vspace{0.8em}
511
+
512
+ \normalsize
513
+
514
+ *Geist* in the Machine (v2.3.14)
515
+
516
+ \footnotesize
517
+
518
+ 37 evaluations | N=3,383 scored | 5 ego models | 2 judges
519
+
520
+ \vspace{1.5em}
521
+
522
+ \normalsize
523
+
524
+ Liam Magee
525
+
526
+ \footnotesize
527
+
528
+ Education Policy, Organization and Leadership
529
+
530
+ University of Illinois Urbana-Champaign
531
+