@machinespirits/eval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/components/MobileEvalDashboard.tsx +267 -0
  2. package/components/comparison/DeltaAnalysisTable.tsx +137 -0
  3. package/components/comparison/ProfileComparisonCard.tsx +176 -0
  4. package/components/comparison/RecognitionABMode.tsx +385 -0
  5. package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
  6. package/components/comparison/WinnerIndicator.tsx +64 -0
  7. package/components/comparison/index.ts +5 -0
  8. package/components/mobile/BottomSheet.tsx +233 -0
  9. package/components/mobile/DimensionBreakdown.tsx +210 -0
  10. package/components/mobile/DocsView.tsx +363 -0
  11. package/components/mobile/LogsView.tsx +481 -0
  12. package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
  13. package/components/mobile/QuickTestView.tsx +1098 -0
  14. package/components/mobile/RecognitionTypeChart.tsx +124 -0
  15. package/components/mobile/RecognitionView.tsx +809 -0
  16. package/components/mobile/RunDetailView.tsx +261 -0
  17. package/components/mobile/RunHistoryView.tsx +367 -0
  18. package/components/mobile/ScoreRadial.tsx +211 -0
  19. package/components/mobile/StreamingLogPanel.tsx +230 -0
  20. package/components/mobile/SynthesisStrategyChart.tsx +140 -0
  21. package/config/interaction-eval-scenarios.yaml +832 -0
  22. package/config/learner-agents.yaml +248 -0
  23. package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
  24. package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
  25. package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
  26. package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
  27. package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
  28. package/docs/research/COST-ANALYSIS.md +56 -0
  29. package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
  30. package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
  31. package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
  32. package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
  33. package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
  34. package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
  35. package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
  36. package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
  37. package/docs/research/PAPER-UNIFIED.md +659 -0
  38. package/docs/research/PAPER-UNIFIED.pdf +0 -0
  39. package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
  40. package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
  41. package/docs/research/apa.csl +2133 -0
  42. package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
  43. package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
  44. package/docs/research/paper-draft/full-paper.md +136 -0
  45. package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
  46. package/docs/research/paper-draft/references.bib +515 -0
  47. package/docs/research/transcript-baseline.md +139 -0
  48. package/docs/research/transcript-recognition-multiagent.md +187 -0
  49. package/hooks/useEvalData.ts +625 -0
  50. package/index.js +27 -0
  51. package/package.json +73 -0
  52. package/routes/evalRoutes.js +3002 -0
  53. package/scripts/advanced-eval-analysis.js +351 -0
  54. package/scripts/analyze-eval-costs.js +378 -0
  55. package/scripts/analyze-eval-results.js +513 -0
  56. package/scripts/analyze-interaction-evals.js +368 -0
  57. package/server-init.js +45 -0
  58. package/server.js +162 -0
  59. package/services/benchmarkService.js +1892 -0
  60. package/services/evaluationRunner.js +739 -0
  61. package/services/evaluationStore.js +1121 -0
  62. package/services/learnerConfigLoader.js +385 -0
  63. package/services/learnerTutorInteractionEngine.js +857 -0
  64. package/services/memory/learnerMemoryService.js +1227 -0
  65. package/services/memory/learnerWritingPad.js +577 -0
  66. package/services/memory/tutorWritingPad.js +674 -0
  67. package/services/promptRecommendationService.js +493 -0
  68. package/services/rubricEvaluator.js +826 -0
@@ -0,0 +1,340 @@
1
+ # Critical Review: Recognition-Oriented AI Tutoring
2
+
3
+ **Date:** 2026-01-14
4
+ **Reviewer:** Claude Opus 4.5
5
+ **Subject:** Machine Spirits Recognition Tutoring System & Paper Draft v0.2
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ This work proposes a genuinely novel synthesis: operationalizing Hegelian recognition theory through a psychodynamic multi-agent architecture for AI tutoring. The 46% improvement (Cohen's d = 1.55) over baseline is striking, and the theoretical framework is sophisticated. But the work sits at a fascinating tension point—it makes ambitious philosophical claims while remaining constrained by the practical realities of LLM evaluation. Below I assess the work across experimental, architectural, and conceptual dimensions, then sketch productive paths forward.
12
+
13
+ ---
14
+
15
+ ## I. Experimental Critique
16
+
17
+ ### A. The Fundamental Evaluation Paradox
18
+
19
+ The central experimental problem is recognized in the paper but deserves deeper scrutiny: **you cannot measure recognition with scripted learners**.
20
+
21
+ Recognition, as theorized here, requires that the tutor's response be *genuinely shaped by* the learner's contribution—not merely triggered by it. But the scripted multi-turn scenarios define learner turns in advance. The learner's Turn 3 is identical regardless of what the tutor said in Turn 2. This means:
22
+
23
+ 1. The tutor cannot actually *engage* with a dynamically-developing learner understanding
24
+ 2. The learner cannot reciprocate recognition (the "hollow master" problem the paper identifies applies to the evaluation itself)
25
+ 3. What's being measured is pattern-matching to recognition *markers*, not recognition itself
26
+
27
+ The COMPREHENSIVE-EVALUATION-PLAN acknowledges this by proposing dyadic architectures (simulated learners with their own Ego/Superego), but the preliminary dyadic results raise a troubling finding: the "quality" profile (not recognition-labeled) outperformed the explicit "recognition" profile. This suggests recognition may be an *emergent property* of generally good tutoring rather than something directly instructable.
28
+
29
+ **Constructive path**: The dyadic direction is correct but needs deeper theorization. Consider:
30
+ - Measuring *bilateral* changes: does the learner-agent's internal representation actually transform?
31
+ - Running free-form dialogue without turn scripts, using post-hoc recognition coding
32
+ - Introducing genuine contingency: learner responds differently based on tutor quality
33
+
34
+ ### B. LLM-as-Judge Validity
35
+
36
+ The rubric's 10 dimensions are scored by Claude Sonnet 4. Several concerns:
37
+
38
+ 1. **Circularity risk**: The judge shares architectural assumptions with the system being judged. Both privilege verbal markers of engagement, joint inquiry language, and dialectical vocabulary. The judge may reward linguistic surface features rather than deep structural properties.
39
+
40
+ 2. **Dimension inflation**: Ten dimensions with high inter-correlations (r = 0.88 between pedagogical and personalization in the recognition condition) suggest redundancy. The factor structure likely reduces to 2-3 underlying constructs. The paper's own correlation analysis hints at this.
41
+
42
+ 3. **Calibration unknowns**: The COMPREHENSIVE-EVALUATION-PLAN correctly identifies vocabulary bias, length bias, and profile leakage as risks. The adversarial examples (bad recognition language, deliberately poor responses) are essential but not yet implemented.
43
+
44
+ **Constructive path**:
45
+ - Run the multi-judge comparison (Claude, Gemini, GPT) with ICC analysis
46
+ - Include human raters for a subset (the plan's 50-response validation)
47
+ - Design adversarial scenarios that use recognition vocabulary but fail recognition structurally
48
+ - Consider scoring recognition *relationally*: does Turn N show evidence of having been shaped by the specific content of Turn N-1?
49
+
50
+ ### C. Statistical Considerations
51
+
52
+ The reported statistics (n=25 per condition, Cohen's d = 1.55, p < 0.001) are promising but deserve scrutiny:
53
+
54
+ 1. **Non-independence**: If the same 8 scenarios were run 3+ times each, the 25 observations aren't independent samples of "tutoring ability"—they're clustered within scenarios. This inflates effective degrees of freedom.
55
+
56
+ 2. **Effect size magnitude**: d = 1.55 is extraordinarily large for an educational intervention. Typical EdTech effect sizes are 0.1-0.3. This either means recognition-oriented design is genuinely revolutionary, or there's measurement artifact. Both deserve investigation.
57
+
58
+ 3. **Ceiling effects**: Single-turn scenarios show recognition scores of 96-100. This limits variance and makes statistical comparison difficult.
59
+
60
+ **Constructive path**:
61
+ - Use multilevel models (responses nested within scenarios nested within profiles)
62
+ - Compute scenario-level intraclass correlations
63
+ - Design scenarios with more headroom for high-quality responses
64
+ - Compare effect sizes against published human tutoring baselines
65
+
66
+ ---
67
+
68
+ ## II. Architectural Critique
69
+
70
+ ### A. The Superego's Actual Role
71
+
72
+ The Ego/Superego architecture is theoretically compelling, but examining the actual prompts and modulation evaluator reveals tensions:
73
+
74
+ 1. **The Superego enforces rules, not dialogue**: The tutor-superego.md prompt specifies hard rules ("Struggling learners must NOT be given forward momentum") and checklists (specificity, appropriateness, pedagogical soundness). This is quality control, not psychodynamic dialogue. The "Superego" functions more like a validator than a critic-self engaged in genuine internal struggle.
75
+
76
+ 2. **Convergence is rapid**: The modulation evaluator shows most dialogues converge in 1-2 rounds. Genuine psychodynamic process involves extended negotiation, resistance, working-through. The architecture permits up to 3 rounds but rarely uses them. This suggests either (a) the Ego is highly responsive, (b) the Superego accepts quickly, or (c) genuine internal conflict isn't occurring.
77
+
78
+ 3. **Intervention types are coarse**: The Superego's output (`approved: true/false` + `interventionType`) is categorical. Psychodynamic theory would predict more ambivalent, partial, evolving evaluations. The current architecture doesn't represent "I'm uncomfortable but can't articulate why" or "this is technically correct but feels wrong."
79
+
80
+ **Constructive path**:
81
+ - Analyze the distribution of intervention types—if "approve with enhancement" dominates, the Superego may be too permissive
82
+ - Implement "resistance detection" for Ego (already partially present) but also for Superego (premature acceptance)
83
+ - Consider continuous rather than categorical Superego outputs
84
+ - Allow the Superego to maintain unresolved concerns across rounds rather than resolving each round
85
+
86
+ ### B. Recognition as Prompt Engineering vs. Architectural Property
87
+
88
+ A deeper architectural question: is recognition-oriented behavior a property of the *prompt* or the *architecture*?
89
+
90
+ The current design suggests it's primarily prompt-based. The recognition-enhanced Ego prompt says: "The learner is not a knowledge deficit to be filled but an autonomous subject." The Superego prompt adds red/green flags for recognition markers. But the underlying transformer architecture processes these as tokens like any other.
91
+
92
+ This raises the concern that you're measuring LLM compliance with recognition-language rather than genuine recognition capacity. The dyadic finding (quality > recognition when recognition is named explicitly) supports this worry.
93
+
94
+ **Constructive path**:
95
+ - Test whether recognition behaviors emerge in quality profiles without explicit recognition language
96
+ - Investigate whether recognition-oriented responses differ structurally (not just lexically) from baseline
97
+ - Consider whether recognition might require architectural changes (e.g., explicit learner-state modeling, attention to specific previous turns) rather than prompt changes
98
+
99
+ ### C. Memory Dynamics Remain Underdeveloped
100
+
101
+ The "Freud's Mystic Writing Pad" memory metaphor is theoretically rich but architecturally thin. The current memory system (learnerMemoryService.js) appears to track factual information (concepts encountered, activities completed) rather than the *relational* history Freud's metaphor implies.
102
+
103
+ For genuine recognition, memory should capture:
104
+ - How this learner's understanding has evolved
105
+ - What metaphors and frameworks they've developed
106
+ - Previous recognition failures and repairs
107
+ - The accumulated relational texture of the tutor-learner history
108
+
109
+ The current implementation seems closer to "user state tracking" than "psychodynamic memory."
110
+
111
+ **Constructive path**:
112
+ - Implement memory of *episodes* rather than just *states*
113
+ - Track the learner's specific formulations and return to them
114
+ - Remember recognition failures specifically (not just activity failures)
115
+ - Consider implementing "transference" dynamics: how the learner's patterns of engaging with tutors shape their engagement with this tutor
116
+
117
+ ---
118
+
119
+ ## III. Conceptual Critique
120
+
121
+ ### A. The Hegel Application: Derivative Rather Than Replica
122
+
123
+ The master-slave dialectic is a powerful frame, but the application requires careful positioning. The question is not whether AI tutoring *replicates* Hegelian recognition (it cannot), but whether it constitutes a productive *derivative* of that structure.
124
+
125
+ **The Derivative Model**
126
+
127
+ Lacan's four discourses (Master, University, Hysteric, Analyst) demonstrate how the master-slave dyadic structure can be rethought through different roles while preserving structural insights. Each discourse represents a different configuration of knowledge, power, and desire—none identical to Hegel's original, but each illuminated by it.
128
+
129
+ Similarly, the tutor-learner relation can be understood as a *derivative* of the master-slave dialectic:
130
+ - The tutor occupies a knowledge-authority position (analogous to master)
131
+ - The learner's acknowledgment is sought (analogous to slave's recognition)
132
+ - One-directional pedagogy produces hollow outcomes (analogous to master's hollow self-consciousness)
133
+ - Genuine engagement requires the tutor to be shaped by learner input (the derivative insight)
134
+
135
+ **What the Derivative Preserves**
136
+
137
+ The Hegelian framework remains valuable not as literal description but as:
138
+ 1. **Diagnostic tool**: Identifies what's missing in one-directional pedagogy
139
+ 2. **Design heuristic**: Suggests architectural features that approximate recognition
140
+ 3. **Evaluation criterion**: Provides standards for relational quality
141
+ 4. **Horizon concept**: Orients design toward an ideal without claiming its achievement
142
+
143
+ **What the Derivative Transforms**
144
+
145
+ | Hegel's Original | Tutor-Learner Derivative |
146
+ |------------------|--------------------------|
147
+ | Struggle unto death | Stakes are pedagogical, not existential |
148
+ | Two self-consciousnesses | One consciousness (learner) + functional analogue (tutor) |
149
+ | Mutual transformation | Learner transformation + tutor behavioral adaptation |
150
+ | Recognition as metaphysical achievement | Recognition as design pattern and evaluation dimension |
151
+
152
+ **Recognition vs. Responsiveness**
153
+
154
+ The paper should distinguish more carefully:
155
+ - *Recognition proper*: Intersubjective acknowledgment between self-conscious beings
156
+ - *Dialogical responsiveness*: Being substantively shaped by the other's input
157
+ - *Recognition-oriented design*: Architectural features that approximate recognition's functional benefits
158
+
159
+ The AI achieves the third, possibly the second, but not the first. This is not a failure—it's a clarification that the derivative model can achieve pedagogical benefits without metaphysical commitments.
160
+
161
+ **Constructive path**:
162
+ - Frame explicitly as derivative/inspired-by rather than implementation-of
163
+ - Reference Lacan's discourses as precedent for productive rethinking of master-slave structure
164
+ - Focus claims on measurable effects on tutor adaptive pedagogy
165
+ - Retain Hegelian framework as diagnostic and design heuristic, not ontological claim
166
+
167
+ ### B. The Freudian Frame: Productive Metaphor
168
+
169
+ The Ego/Superego architecture uses Freudian terminology metaphorically rather than literally. This is not a weakness but a feature—productive metaphors scaffold understanding and suggest design directions without requiring literal correspondence.
170
+
171
+ **The Metaphor's Productivity**
172
+
173
+ The psychodynamic metaphor is productive because it:
174
+
175
+ 1. **Names a real tension**: The conflict between warmth/encouragement (Ego) and rigor/standards (Superego) is genuine in tutoring. The metaphor makes this tension explicit and designable.
176
+
177
+ 2. **Motivates internal dialogue**: The idea that good output emerges from internal negotiation—not single-pass generation—is architecturally valuable regardless of its psychoanalytic provenance.
178
+
179
+ 3. **Suggests extensions**: Concepts like resistance, transference, and working-through suggest future architectural features, even if not currently implemented.
180
+
181
+ 4. **Connects to recognition framework**: The Freudian and Hegelian frameworks share concern with intersubjectivity and the constitution of self through other. The metaphor creates theoretical coherence.
182
+
183
+ **What the Metaphor Preserves**
184
+
185
+ | Freudian Concept | Architectural Analogue |
186
+ |------------------|------------------------|
187
+ | Internal dialogue before external action | Multi-round Ego-Superego exchange before learner sees response |
188
+ | Superego as internalized standards | Superego enforces pedagogical criteria |
189
+ | Ego mediates competing demands | Ego balances learner needs with pedagogical soundness |
190
+ | Conflict can be productive | Tension between agents improves output quality |
191
+
192
+ **What the Metaphor Transforms**
193
+
194
+ | Freudian Original | Architectural Transformation |
195
+ |-------------------|------------------------------|
196
+ | Id (drives) | No implementation; design focuses on Ego-Superego |
197
+ | Unconscious processes | All processes are explicit and traceable |
198
+ | Irrational Superego | Rational, principle-based evaluation |
199
+ | Repression/Defense | Not implemented |
200
+ | Transference | Potential future extension (relational patterns) |
201
+
202
+ **Alternative Framings**
203
+
204
+ The same architecture could be described as:
205
+ - Generator/Discriminator (GAN-inspired)
206
+ - Proposal/Critique (deliberative process)
207
+ - Draft/Review (editorial model)
208
+ - System 1/System 2 (dual-process cognition)
209
+
210
+ The psychodynamic framing is chosen for theoretical coherence with the Hegelian recognition framework and because it suggests richer extensions than purely functional descriptions.
211
+
212
+ **Constructive path**:
213
+ - Explicitly acknowledge metaphorical status in paper
214
+ - Defend metaphor's productivity rather than apologizing for non-literalness
215
+ - Consider which extensions (resistance detection, relational patterns) are worth implementing
216
+ - Document what the metaphor illuminates and what it occludes
217
+
218
+ ### C. The Productive Struggle Question
219
+
220
+ The strongest empirical result (+49% in productive_struggle_arc, d = 2.93) centers on honoring productive struggle rather than short-circuiting confusion. This is pedagogically important and well-supported by educational research.
221
+
222
+ But the theoretical frame may be over-complicated for this finding. What's being measured is essentially: **does the tutor resist the urge to resolve confusion immediately?** This could be achieved through simpler means:
223
+
224
+ - A prompt saying "don't immediately resolve confusion"
225
+ - A rule-based delay before offering explanations
226
+ - An explicit "confusion is valuable" heuristic
227
+
228
+ The full Hegelian apparatus (recognition, intersubjectivity, mutual transformation) may be doing less theoretical work than claimed. The productive struggle finding is robust but might not require the recognition framework.
229
+
230
+ **Constructive path**:
231
+ - Ablate the specific contribution of recognition language vs. productive-struggle language
232
+ - Test whether recognition benefits beyond what productive-struggle instruction alone provides
233
+ - Be more precise about which theoretical claims are doing empirical work
234
+
235
+ ---
236
+
237
+ ## IV. What This Work Gets Right
238
+
239
+ Despite these critiques, the work has genuine strengths that should be preserved and extended:
240
+
241
+ ### A. The Multi-Agent Internal Dialogue
242
+
243
+ The insight that tutoring quality benefits from internal evaluation *before* delivery is valuable and generalizable. The Superego's role as a "pedagogical pre-flight check" catches failures that single-pass generation misses. This pattern could extend to other domains (therapy bots, customer service, technical assistance).
244
+
245
+ ### B. The Dimension-Level Analysis
246
+
247
+ The finding that improvements concentrate in personalization, pedagogical soundness, and tone—exactly where recognition theory predicts—is non-trivial. This pattern match between theoretical prediction and empirical result strengthens the case that something genuine is happening, even if the theoretical explanation needs refinement.
248
+
249
+ ### C. The Repair Mechanism
250
+
251
+ Explicit acknowledgment of misalignment before pivoting (the "repair rule") is a concrete, implementable insight with clear pedagogical value. Silent pivots are recognition failures that real tutors also commit. This operationalization is useful.
252
+
253
+ ### D. The Evaluation Infrastructure
254
+
255
+ The evaluation system (rubric, scenarios, judge models, statistical analysis scripts) is sophisticated and extensible. This infrastructure enables the kind of systematic improvement that AI tutoring needs. The COMPREHENSIVE-EVALUATION-PLAN shows thoughtful anticipation of the work needed for rigorous publication.
256
+
257
+ ---
258
+
259
+ ## V. Productive Paths Forward
260
+
261
+ ### A. Theoretical Refinement
262
+
263
+ 1. **Distinguish recognition from responsiveness**: Make clear which claims require genuine intersubjectivity vs. which require only sophisticated input-shaping. The latter is demonstrably achievable; the former may be impossible for current AI.
264
+
265
+ 2. **Consider "dialogical responsiveness" as the core claim**: This is defensible, measurable, and educationally valuable without requiring strong metaphysical commitments about AI consciousness.
266
+
267
+ 3. **Engage philosophy of mind literature**: If claiming recognition, engage with debates about machine consciousness, phenomenal experience, and intersubjectivity. The paper currently cites social/political recognition theory but not philosophy of mind.
268
+
269
+ ### B. Architectural Development
270
+
271
+ 1. **Implement genuine memory dynamics**: Track relational history, not just state. Remember the learner's specific formulations and return to them. Implement repair history that shapes future interactions.
272
+
273
+ 2. **Add representational depth to Superego evaluation**: Move beyond categorical approve/reject to continuous, multi-dimensional assessment. Allow unresolved concerns to persist.
274
+
275
+ 3. **Test emergent recognition**: If recognition emerges from quality without explicit instruction (as dyadic results suggest), explore what architectural features enable this emergence.
276
+
277
+ ### C. Experimental Rigor
278
+
279
+ 1. **Dyadic evaluation with genuine contingency**: Let learner-agents respond dynamically to tutor quality. Measure bilateral transformation.
280
+
281
+ 2. **Human validation**: The 50-response human validation sample is essential. Extend to learning outcomes if possible.
282
+
283
+ 3. **Multi-judge reliability**: Run ICC analysis across Claude, Gemini, GPT judges. Identify systematic biases.
284
+
285
+ 4. **Adversarial robustness**: Test whether recognition-oriented design creates manipulation vulnerabilities.
286
+
287
+ ### D. Publication Strategy
288
+
289
+ The work is currently positioned for an AI/education venue. Consider also:
290
+
291
+ - **HCI venues**: The multi-agent dialogue pattern has implications for conversational AI design broadly
292
+ - **Philosophy of AI venues**: The recognition framework engages questions about AI intersubjectivity that deserve philosophical scrutiny
293
+ - **Educational psychology venues**: The productive struggle operationalization is independently valuable
294
+
295
+ ---
296
+
297
+ ## VI. Conclusion
298
+
299
+ This is ambitious, philosophically informed work that attempts something genuinely novel: using recognition theory as a *derivative framework* for AI tutoring design. The empirical results are promising, the infrastructure is sophisticated, and the theoretical framework—when properly positioned—is both defensible and productive.
300
+
301
+ **The Core Contribution**
302
+
303
+ The work's value lies not in claiming that AI achieves Hegelian recognition (which would be overreach), but in demonstrating that:
304
+
305
+ 1. Recognition-oriented design principles measurably improve tutor adaptive pedagogy
306
+ 2. Multi-agent architectures with internal dialogue outperform single-pass generation
307
+ 3. The psychodynamic metaphor productively scaffolds design decisions
308
+ 4. Philosophical frameworks can inform empirical AI research without requiring literal implementation
309
+
310
+ **What the 2×2 Factorial Can Show**
311
+
312
+ The existing experimental infrastructure (2×2 factorial: architecture × recognition) enables rigorous evaluation of:
313
+
314
+ | Effect | Question | Measurement |
315
+ |--------|----------|-------------|
316
+ | Main effect of architecture | Does multi-agent dialogue improve tutor adaptiveness? | Compare single_baseline + single_recognition vs. baseline + recognition |
317
+ | Main effect of recognition | Do recognition-oriented prompts improve tutor adaptiveness? | Compare single_baseline + baseline vs. single_recognition + recognition |
318
+ | Interaction effect | Does recognition benefit more from multi-agent architecture? | Test whether recognition × architecture interaction is significant |
319
+
320
+ **The Refined Claim**
321
+
322
+ The paper should claim:
323
+
324
+ > Recognition-oriented design, understood as a *derivative* of Hegelian recognition theory and implemented through a *metaphorically* psychodynamic multi-agent architecture, produces measurable improvements in AI tutor adaptive pedagogy. These improvements concentrate in relational dimensions (personalization, pedagogical responsiveness, tone) consistent with the theoretical framework's predictions.
325
+
326
+ This claim is:
327
+ - Empirically testable via the 2×2 factorial
328
+ - Theoretically grounded without overreach
329
+ - Practically significant for AI tutoring design
330
+ - Extensible to future work on recognition in AI systems
331
+
332
+ **Path Forward**
333
+
334
+ The implementation plan should prioritize:
335
+ 1. Running the 2×2 factorial with adequate statistical power
336
+ 2. Measuring effects on *tutor behavior* (adaptiveness, responsiveness, repair) rather than metaphysical claims
337
+ 3. Validating judge reliability through multi-judge comparison
338
+ 4. Documenting the derivative/metaphorical theoretical positioning explicitly
339
+
340
+ This work has the potential to contribute meaningfully to AI tutoring, multi-agent design, and the broader conversation about how philosophical frameworks can inform AI research. The refinements above strengthen rather than diminish that potential.
@@ -0,0 +1,291 @@
1
+ # Dynamic vs Scripted Learner Evaluation Analysis
2
+
3
+ **Date:** 2026-01-14
4
+ **Status:** Complete
5
+
6
+ ## Overview
7
+
8
+ This analysis compares evaluation results between scripted scenarios (fixed learner responses) and dynamic LLM-based learner simulation (contingent responses generated by LLM learner agents).
9
+
10
+ ---
11
+
12
+ ## 1. Evaluation Modes
13
+
14
+ ### Scripted Scenarios
15
+ - **Method:** Fixed learner messages, tutor responses evaluated
16
+ - **Source:** 2×2 factorial design (FACTORIAL-RESULTS-2026-01-14.md)
17
+ - **Sample:** N=12 (4 profiles × 3 scenarios)
18
+ - **Learner:** Pre-written dialogue sequences
19
+
20
+ ### Dynamic LLM Learners
21
+ - **Method:** LLM generates contingent learner responses based on tutor output
22
+ - **Source:** Battery scenarios (logs/interaction-evals/)
23
+ - **Sample:** N=6 battery scenarios
24
+ - **Learner Architectures:** unified, ego_superego, dialectical, psychodynamic, cognitive
25
+ - **Learner Personas:** eager_novice, imposter, resistant_scholar, anxious_perfectionist, methodical_analyst
26
+
27
+ ---
28
+
29
+ ## 2. Results Comparison
30
+
31
+ ### Scripted Scenario Results (Factorial Design)
32
+
33
+ | Profile | Mean Score | Recognition Status |
34
+ |---------|------------|-------------------|
35
+ | single_baseline | 40.1 | Standard |
36
+ | baseline | 41.6 | Standard |
37
+ | single_recognition | 75.5 | Recognition-Enhanced |
38
+ | recognition | 80.7 | Recognition-Enhanced |
39
+
40
+ **Key Finding:** Recognition effect = +37.2 points (91% improvement)
41
+
42
+ ### Dynamic Learner Battery Results
43
+
44
+ | Scenario | Learner Arch | Tutor Profile | Score | Dimensions |
45
+ |----------|--------------|---------------|-------|------------|
46
+ | battery_unified_baseline | unified | baseline | 88 | MR:4, DR:5, TP:4, T:5 |
47
+ | battery_ego_superego_recognition | ego_superego | recognition | 78 | MR:4, DR:3, TP:4, T:5 |
48
+ | battery_psychodynamic_recognition_plus | psychodynamic | recognition_plus | 97 | MR:5, DR:5, TP:5, T:5 |
49
+ | battery_dialectical_budget | dialectical | budget | 87 | MR:5, DR:5, TP:4, T:5 |
50
+ | battery_cognitive_quality | cognitive | quality | 82 | MR:4, DR:5, TP:4, T:5 |
51
+ | battery_extended_dialogue | ego_superego | recognition | 48 | MR:2, DR:2, TP:2, T:3 |
52
+
53
+ *MR=Mutual Recognition, DR=Dialectical Responsiveness, TP=Transformative Potential, T=Tone*
54
+
55
+ ---
56
+
57
+ ## 3. Key Findings
58
+
59
+ ### 3.1 Baseline Profile Performance Divergence
60
+
61
+ **Scripted:** Baseline profiles score ~41 points
62
+ **Dynamic:** Baseline profiles score 87-88 points (+46 point difference)
63
+
64
+ **Interpretation:** Dynamic LLM learners generate more "cooperative" contexts that allow even baseline tutors to demonstrate pedagogical quality. Scripted scenarios are designed to stress-test specific failure modes (e.g., learner resistance, validation seeking) that baseline tutors cannot handle well.
65
+
66
+ ### 3.2 Recognition Profile Consistency
67
+
68
+ **Scripted:** Recognition profiles score 75.5-80.7 points
69
+ **Dynamic:** Recognition profiles score 78-97 points
70
+
71
+ **Interpretation:** Recognition-enhanced tutors perform consistently well in both evaluation modes. The psychodynamic learner + recognition_plus combination achieves the highest score (97), suggesting synergy between psychodynamic learner architecture and recognition-oriented tutoring.
72
+
73
+ ### 3.3 Extended Dialogue Failure Mode
74
+
75
+ The `battery_extended_dialogue` scenario (8 turns) revealed a critical failure mode:
76
+ - **Score:** 48 (lowest of all battery scenarios)
77
+ - **Issue:** Tutor's commitment to "preserving productive tension" and avoiding "short-circuiting productive struggle" became rigid ideology
78
+ - **Learner State:** Ended "flustered" and unable to locate relevant passages
79
+
80
+ **Judge Narrative:** "The commitment to a particular pedagogical ideal prevented the adaptive teaching the situation required."
81
+
82
+ This failure mode was **not detected in scripted scenarios** because:
83
+ 1. Scripted scenarios are shorter (typically 3-5 turns)
84
+ 2. Scripted learner responses cannot express mounting frustration
85
+ 3. Scripted scenarios don't test sustained dialogue resilience
86
+
87
+ ### 3.4 Learner Architecture Effects
88
+
89
+ | Learner Architecture | Mean Score | Notable Pattern |
90
+ |---------------------|------------|-----------------|
91
+ | psychodynamic | 97 | Highest score; internal conflict drives productive struggle |
92
+ | unified | 88 | Strong baseline performance |
93
+ | dialectical | 87 | Resistant scholar persona well-handled |
94
+ | cognitive | 82 | Methodical analyst engagement sustained |
95
+ | ego_superego | 63 | Variable (78 in short, 48 in extended) |
96
+
97
+ **Key Finding:** Psychodynamic learner architecture produces the most productive tutoring interactions. The ego_superego architecture shows high variance, performing well in short interactions but poorly in extended dialogues.
98
+
99
+ ---
100
+
101
+ ## 4. Statistical Comparison
102
+
103
+ ### Recognition Effect Sizes
104
+
105
+ | Evaluation Mode | Recognition Mean | Baseline Mean | Effect Size |
106
+ |-----------------|-----------------|---------------|-------------|
107
+ | Scripted | 78.1 | 40.9 | +37.2 (d=1.81) |
108
+ | Dynamic | 87.5* | 87.5* | ~0 (confounded) |
109
+
110
+ *Note: Dynamic battery design confounds tutor profile with learner architecture, limiting direct comparison
111
+
112
+ ### Variance Comparison
113
+
114
+ | Metric | Scripted | Dynamic |
115
+ |--------|----------|---------|
116
+ | Score Range | 34.1 - 100.0 | 48 - 97 |
117
+ | Standard Deviation | ~20.5 | ~17.3 |
118
+ | Coefficient of Variation | 34% | 22% |
119
+
120
+ **Interpretation:** Dynamic learners produce more consistent scores overall, but can reveal extreme failure modes (score=48) not captured in scripted scenarios.
121
+
122
+ ---
123
+
124
+ ## 5. Implications
125
+
126
+ ### 5.1 Scripted vs Dynamic: Complementary Evaluation
127
+
128
+ **Scripted scenarios are better for:**
129
+ - Controlled hypothesis testing (factorial designs)
130
+ - Detecting specific pedagogical failures
131
+ - Benchmarking across tutor profiles
132
+ - Reproducibility
133
+
134
+ **Dynamic learners are better for:**
135
+ - Detecting emergent failure modes
136
+ - Testing sustained dialogue resilience
137
+ - Exploring learner architecture × tutor profile interactions
138
+ - Ecological validity
139
+
140
+ ### 5.2 Recognition Effect Robustness
141
+
142
+ The recognition effect (+37.2 points in scripted scenarios) appears **robust** but **context-dependent** in dynamic evaluation:
143
+ - Recognition tutors still perform well with dynamic learners
144
+ - But baseline tutors can also achieve high scores when learner context is favorable
145
+ - Extended dialogues can reveal failure modes even in recognition tutors
146
+
147
+ ### 5.3 Psychodynamic Synergy
148
+
149
+ The combination of:
150
+ - **Psychodynamic learner architecture** (internal conflict, transference dynamics)
151
+ - **Recognition-enhanced tutoring** (mutual recognition, dialectical responsiveness)
152
+
153
+ Produces the highest scores (97), suggesting theoretical alignment between psychodynamic learning theory and Hegelian recognition.
154
+
155
+ ---
156
+
157
+ ## 6. Methodological Recommendations
158
+
159
+ ### For Paper Methodology (Section 5.4)
160
+
161
+ 1. **Document dual evaluation modes:**
162
+ - Scripted scenarios for controlled comparison
163
+ - Dynamic LLM learners for ecological validation
164
+
165
+ 2. **Report both results:**
166
+ - Scripted scenarios: N=12 factorial design, recognition effect +37.2
167
+ - Dynamic learners: N=6 battery scenarios, mean=80.0, range=48-97
168
+
169
+ 3. **Acknowledge limitations:**
170
+ - Scripted: Limited ecological validity
171
+ - Dynamic: Confounded design, small sample
172
+
173
+ ### For Future Evaluation
174
+
175
+ 1. **Extended dialogue stress testing:** Include 8-10 turn scenarios in scripted battery
176
+ 2. **Factorial design with dynamic learners:** 2×2 (Tutor Profile × Learner Architecture)
177
+ 3. **Failure mode analysis:** Document conditions that produce score < 60
178
+
179
+ ---
180
+
181
+ ## 7. Summary Table
182
+
183
+ | Aspect | Scripted Scenarios | Dynamic LLM Learners |
184
+ |--------|-------------------|---------------------|
185
+ | Sample Size | N=12 | N=6 |
186
+ | Mean Score | 59.5 | 80.0 |
187
+ | Score Range | 34-100 | 48-97 |
188
+ | Recognition Effect | +37.2 points | Not isolated |
189
+ | Best Condition | recognition (80.7) | psychodynamic + recognition_plus (97) |
190
+ | Worst Condition | single_baseline (40.1) | extended_dialogue (48) |
191
+ | Failure Detection | Specific (resistance, validation) | Emergent (extended dialogue) |
192
+ | Ecological Validity | Lower | Higher |
193
+ | Reproducibility | Higher | Lower |
194
+
195
+ ---
196
+
197
+ ## 8. Token Usage and Cost Analysis
198
+
199
+ ### Model Pricing (OpenRouter, January 2026)
200
+
201
+ | Model | Input ($/M) | Output ($/M) | Role |
202
+ |-------|-------------|--------------|------|
203
+ | Nemotron 3 Nano 30B | $0.00 | $0.00 | Tutor, Learner (free tier) |
204
+ | Claude Sonnet 4.5 | $3.00 | $15.00 | Judge |
205
+
206
+ ### Battery Scenario Token Usage
207
+
208
+ | Scenario | Turns | Tutor Tokens | Learner Tokens | Total | Est. Cost |
209
+ |----------|-------|--------------|----------------|-------|-----------|
210
+ | unified_baseline | 7 | 25,058 | 1,653 | 26,711 | $0.09 |
211
+ | ego_superego_recognition | 9 | 45,826 | 2,099 | 47,925 | $0.15 |
212
+ | dialectical_budget | 7 | 19,288 | 2,485 | 21,773 | $0.08 |
213
+ | psychodynamic_recognition_plus | 9 | 48,571 | 2,825 | 51,396 | $0.15 |
214
+ | cognitive_quality | 9 | 34,053 | 2,473 | 36,526 | $0.12 |
215
+ | extended_dialogue | 17 | 94,487 | 3,981 | 98,468 | $0.27 |
216
+ | **TOTAL** | **58** | **267,283** | **15,516** | **282,799** | **$0.86** |
217
+
218
+ *Cost is primarily from Judge (Sonnet 4.5); Tutor/Learner use free-tier Nemotron*
219
+
220
+ ### Cost Breakdown by Component
221
+
222
+ | Component | Model | Tokens | Cost |
223
+ |-----------|-------|--------|------|
224
+ | Tutor (Ego+Superego) | Nemotron 3 Nano 30B | 267,283 | $0.00 |
225
+ | Learner (Ego+Superego) | Nemotron 3 Nano 30B | 15,516 | $0.00 |
226
+ | Judge | Claude Sonnet 4.5 | ~238,000 | $0.86 |
227
+
228
+ ### Hypothetical: All Claude Sonnet 4.5
229
+
230
+ | Configuration | Total Cost | Multiplier |
231
+ |---------------|------------|------------|
232
+ | Current (Nemotron + Sonnet Judge) | $0.86 | 1.0x |
233
+ | All Sonnet 4.5 | $2.98 | 3.5x |
234
+
235
+ ### Reproducibility
236
+
237
+ To regenerate cost analysis:
238
+ ```bash
239
+ node scripts/analyze-eval-costs.js
240
+ ```
241
+
242
+ ---
243
+
244
+ ## Appendix: Raw Battery Data
245
+
246
+ ```json
247
+ {
248
+ "battery_unified_baseline": {
249
+ "score": 88,
250
+ "learner_arch": "unified",
251
+ "persona": "eager_novice",
252
+ "tutor_profile": "baseline",
253
+ "topic": "Recognition and self-consciousness"
254
+ },
255
+ "battery_ego_superego_recognition": {
256
+ "score": 78,
257
+ "learner_arch": "ego_superego",
258
+ "persona": "imposter",
259
+ "tutor_profile": "recognition",
260
+ "topic": "Master-slave dialectic"
261
+ },
262
+ "battery_psychodynamic_recognition_plus": {
263
+ "score": 97,
264
+ "learner_arch": "psychodynamic",
265
+ "persona": "anxious_perfectionist",
266
+ "tutor_profile": "recognition_plus",
267
+ "topic": "Desire and the self in Hegel"
268
+ },
269
+ "battery_dialectical_budget": {
270
+ "score": 87,
271
+ "learner_arch": "dialectical",
272
+ "persona": "resistant_scholar",
273
+ "tutor_profile": "budget",
274
+ "topic": "Sublation and unity of opposites"
275
+ },
276
+ "battery_cognitive_quality": {
277
+ "score": 82,
278
+ "learner_arch": "cognitive",
279
+ "persona": "methodical_analyst",
280
+ "tutor_profile": "quality",
281
+ "topic": "Spirit and collective consciousness"
282
+ },
283
+ "battery_extended_dialogue": {
284
+ "score": 48,
285
+ "learner_arch": "ego_superego",
286
+ "persona": "eager_novice",
287
+ "tutor_profile": "recognition",
288
+ "topic": "Stages of consciousness"
289
+ }
290
+ }
291
+ ```