@machinespirits/eval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/components/MobileEvalDashboard.tsx +267 -0
  2. package/components/comparison/DeltaAnalysisTable.tsx +137 -0
  3. package/components/comparison/ProfileComparisonCard.tsx +176 -0
  4. package/components/comparison/RecognitionABMode.tsx +385 -0
  5. package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
  6. package/components/comparison/WinnerIndicator.tsx +64 -0
  7. package/components/comparison/index.ts +5 -0
  8. package/components/mobile/BottomSheet.tsx +233 -0
  9. package/components/mobile/DimensionBreakdown.tsx +210 -0
  10. package/components/mobile/DocsView.tsx +363 -0
  11. package/components/mobile/LogsView.tsx +481 -0
  12. package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
  13. package/components/mobile/QuickTestView.tsx +1098 -0
  14. package/components/mobile/RecognitionTypeChart.tsx +124 -0
  15. package/components/mobile/RecognitionView.tsx +809 -0
  16. package/components/mobile/RunDetailView.tsx +261 -0
  17. package/components/mobile/RunHistoryView.tsx +367 -0
  18. package/components/mobile/ScoreRadial.tsx +211 -0
  19. package/components/mobile/StreamingLogPanel.tsx +230 -0
  20. package/components/mobile/SynthesisStrategyChart.tsx +140 -0
  21. package/config/interaction-eval-scenarios.yaml +832 -0
  22. package/config/learner-agents.yaml +248 -0
  23. package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
  24. package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
  25. package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
  26. package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
  27. package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
  28. package/docs/research/COST-ANALYSIS.md +56 -0
  29. package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
  30. package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
  31. package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
  32. package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
  33. package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
  34. package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
  35. package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
  36. package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
  37. package/docs/research/PAPER-UNIFIED.md +659 -0
  38. package/docs/research/PAPER-UNIFIED.pdf +0 -0
  39. package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
  40. package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
  41. package/docs/research/apa.csl +2133 -0
  42. package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
  43. package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
  44. package/docs/research/paper-draft/full-paper.md +136 -0
  45. package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
  46. package/docs/research/paper-draft/references.bib +515 -0
  47. package/docs/research/transcript-baseline.md +139 -0
  48. package/docs/research/transcript-recognition-multiagent.md +187 -0
  49. package/hooks/useEvalData.ts +625 -0
  50. package/index.js +27 -0
  51. package/package.json +73 -0
  52. package/routes/evalRoutes.js +3002 -0
  53. package/scripts/advanced-eval-analysis.js +351 -0
  54. package/scripts/analyze-eval-costs.js +378 -0
  55. package/scripts/analyze-eval-results.js +513 -0
  56. package/scripts/analyze-interaction-evals.js +368 -0
  57. package/server-init.js +45 -0
  58. package/server.js +162 -0
  59. package/services/benchmarkService.js +1892 -0
  60. package/services/evaluationRunner.js +739 -0
  61. package/services/evaluationStore.js +1121 -0
  62. package/services/learnerConfigLoader.js +385 -0
  63. package/services/learnerTutorInteractionEngine.js +857 -0
  64. package/services/memory/learnerMemoryService.js +1227 -0
  65. package/services/memory/learnerWritingPad.js +577 -0
  66. package/services/memory/tutorWritingPad.js +674 -0
  67. package/services/promptRecommendationService.js +493 -0
  68. package/services/rubricEvaluator.js +826 -0
@@ -0,0 +1,586 @@
1
+ # Comprehensive Evaluation Plan for Recognition Tutoring Paper v2
2
+
3
+ **Date:** 2026-01-14
4
+ **Purpose:** Build robust evidentiary foundation for next paper version
5
+ **Target:** Publication-quality statistical rigor
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ The current paper reports promising results (41% improvement for recognition-oriented tutoring), but the evidence base has significant gaps:
12
+
13
+ | Current State | Gap | Target |
14
+ |---------------|-----|--------|
15
+ | Multi-turn: n=3 per scenario | Low statistical power | n=30+ per condition |
16
+ | Single-turn: n=1 (exploratory) | Not statistically valid | n=30+ per condition |
17
+ | Dyadic: n=1-2 per architecture | Insufficient sampling | n=20+ per architecture |
18
+ | LLM judge only | No validation of judge reliability | Multi-judge + human validation |
19
+ | 2 profiles compared | Limited generalization | 4-5 profiles systematically |
20
+ | No effect size reporting | Can't assess practical significance | Cohen's d for all comparisons |
21
+
22
+ This plan outlines a comprehensive evaluation suite to address these gaps.
23
+
24
+ ---
25
+
26
+ ## Part 1: Core Recognition Claims (Replication & Extension)
27
+
28
+ ### 1.1 Multi-Turn Scenario Battery
29
+
30
+ **Goal:** Replicate and extend the 41% improvement claim with publication-quality statistics.
31
+
32
+ #### Scenarios (3 existing + 2 new):
33
+ | Scenario | Turns | Tests For |
34
+ |----------|-------|-----------|
35
+ | `recognition_repair` | 4 | Recovery from misrecognition |
36
+ | `mutual_transformation_journey` | 5 | Both parties evolving |
37
+ | `productive_struggle_arc` | 5 | Honoring confusion |
38
+ | `sustained_dialogue` (NEW) | 8 | Extended recognition maintenance |
39
+ | `breakdown_recovery` (NEW) | 6 | Multiple repair cycles |
40
+
41
+ #### Experimental Design:
42
+ ```
43
+ Profiles: baseline × recognition × recognition_plus × quality
44
+ Scenarios: 5 multi-turn scenarios
45
+ Runs: 30 per cell
46
+ Total runs: 4 × 5 × 30 = 600 runs
47
+
48
+ Statistical targets:
49
+ - α = 0.05, power = 0.80
50
+ - Minimum detectable effect: Cohen's d = 0.5
51
+ ```
52
+
53
+ #### Commands:
54
+ ```bash
55
+ # Primary comparison
56
+ node scripts/eval-tutor.js compare baseline recognition \
57
+ --scenarios mutual_transformation_journey,recognition_repair,productive_struggle_arc,sustained_dialogue,breakdown_recovery \
58
+ --runs 30 --report
59
+
60
+ # Extended profile comparison
61
+ node scripts/eval-tutor.js matrix baseline recognition recognition_plus quality \
62
+ --scenarios recognition_multi_turn \
63
+ --runs 30 --report
64
+ ```
65
+
66
+ ### 1.2 Single-Turn Scenario Battery
67
+
68
+ **Goal:** Establish statistically valid baselines for single-turn recognition behaviors.
69
+
70
+ #### Scenarios (6 existing):
71
+ | Scenario | Tests For |
72
+ |----------|-----------|
73
+ | `recognition_seeking_learner` | Learner offers interpretation |
74
+ | `returning_with_breakthrough` | Acknowledgment of insight |
75
+ | `resistant_learner` | Handling pushback |
76
+ | `asymmetric_recognition_request` | Authority validation seeking |
77
+ | `memory_continuity_single` | History reference |
78
+ | `transformative_moment_setup` | Misconception handling |
79
+
80
+ #### Experimental Design:
81
+ ```
82
+ Profiles: baseline × recognition × recognition_plus
83
+ Scenarios: 6 single-turn scenarios
84
+ Runs: 30 per cell
85
+ Total runs: 3 × 6 × 30 = 540 runs
86
+ ```
87
+
88
+ ---
89
+
90
+ ## Part 2: Dimension-Level Analysis
91
+
92
+ ### 2.1 Recognition Dimension Validation
93
+
94
+ **Goal:** Validate that improvement concentrates in recognition-predicted dimensions.
95
+
96
+ #### Analysis Plan:
97
+ For each dimension (10 total), calculate:
98
+ - Mean difference between profiles
99
+ - Effect size (Cohen's d)
100
+ - 95% confidence intervals
101
+ - Dimension × profile interaction effects
102
+
103
+ #### Expected Pattern (from current data):
104
+ | Dimension | Expected Δ | Theoretical Prediction |
105
+ |-----------|------------|------------------------|
106
+ | Personalization | +0.8+ | Strong (recognition core) |
107
+ | Pedagogical | +0.8+ | Strong (relational pedagogy) |
108
+ | Tone | +0.6+ | Moderate (dialogical warmth) |
109
+ | Mutual Recognition | +0.5+ | Direct target |
110
+ | Dialectical Responsiveness | +0.5+ | Direct target |
111
+ | Transformative Potential | +0.4+ | Moderate (process focus) |
112
+ | Memory Integration | +0.3+ | Moderate (enabled by memory) |
113
+ | Relevance | +0.3+ | Indirect benefit |
114
+ | Specificity | +0.2+ | Minimal impact expected |
115
+ | Actionability | +0.1+ | Minimal impact expected |
116
+
117
+ ### 2.2 Dimension Correlation Matrix
118
+
119
+ Compute inter-dimension correlations to identify:
120
+ - Dimension clusters (recognition vs. traditional)
121
+ - Potential redundancies
122
+ - Factor structure for dimension reduction
123
+
124
+ ---
125
+
126
+ ## Part 3: Component Isolation Experiments
127
+
128
+ ### 3.1 Memory vs. Prompts Ablation
129
+
130
+ **Goal:** Disentangle memory effects from prompt effects.
131
+
132
+ #### Experimental Design:
133
+ ```
134
+ 2×2 factorial design:
135
+ Standard Prompts Recognition Prompts
136
+ Memory OFF baseline recognition_prompts_only
137
+ Memory ON memory_only (NEW) recognition (full)
138
+
139
+ Runs per cell: 30
140
+ Total runs: 4 × 30 = 120 per scenario
141
+ ```
142
+
143
+ #### Analysis:
144
+ - Main effect of prompts
145
+ - Main effect of memory
146
+ - Interaction effect (prompts × memory)
147
+
148
+ ### 3.2 Superego Ablation
149
+
150
+ **Goal:** Quantify Superego contribution to recognition quality.
151
+
152
+ #### Experimental Design:
153
+ ```
154
+ Profiles:
155
+ - ego_only: Ego without Superego evaluation
156
+ - single_round: Ego + Superego, 1 round max
157
+ - multi_round: Ego + Superego, 3 rounds max (current)
158
+ - extended: Ego + Superego, 5 rounds max
159
+
160
+ Scenarios: All multi-turn recognition scenarios
161
+ Runs: 20 per cell
162
+ ```
163
+
164
+ #### Key Metrics:
165
+ - Quality improvement per additional round
166
+ - Convergence rate (rounds to Superego approval)
167
+ - Marginal return on additional rounds
168
+
169
+ ### 3.3 Model Capability Ablation
170
+
171
+ **Goal:** Test whether recognition benefits scale with model capability.
172
+
173
+ #### Experimental Design:
174
+ ```
175
+ Ego Models (by capability tier):
176
+ - Tier 1 (fast): claude-haiku-4-5, gpt-5-mini
177
+ - Tier 2 (balanced): claude-sonnet-4-5, gpt-5.2
178
+ - Tier 3 (powerful): claude-opus-4-5
179
+
180
+ Cross with: baseline vs. recognition prompts
181
+ Runs: 20 per cell
182
+ ```
183
+
184
+ #### Key Question:
185
+ Do recognition prompts provide larger benefits for weaker models (compensatory) or stronger models (synergistic)?
186
+
187
+ ---
188
+
189
+ ## Part 4: Dyadic Evaluation Extension
190
+
191
+ ### 4.1 Learner Architecture Comparison
192
+
193
+ **Goal:** Systematically compare simulated learner architectures.
194
+
195
+ #### Architectures (5):
196
+ | Architecture | Internal Structure | Rationale |
197
+ |--------------|-------------------|-----------|
198
+ | `unified` | Single agent | Baseline |
199
+ | `ego_superego` | Ego + Superego | Standard self-critique |
200
+ | `dialectical` | Thesis + Antithesis + Synthesis | Hegelian structure |
201
+ | `psychodynamic` | Id + Ego + Superego | Freudian structure |
202
+ | `cognitive` | Memory + Reasoning + Meta | Process-based |
203
+
204
+ #### Experimental Design:
205
+ ```
206
+ Tutor profiles: baseline × recognition × quality
207
+ Learner archs: 5 architectures
208
+ Scenarios: 3 dyadic scenarios
209
+ Runs: 20 per cell
210
+ Total runs: 3 × 5 × 3 × 20 = 900 runs
211
+ ```
212
+
213
+ ### 4.2 Cross-Tabulation Analysis
214
+
215
+ **Goal:** Identify optimal tutor-learner pairings.
216
+
217
+ #### Analysis:
218
+ - Profile × architecture interaction effects
219
+ - Best-performing pairings
220
+ - Pairing-specific failure modes
221
+
222
+ ### 4.3 Bilateral Recognition Measurement
223
+
224
+ **Goal:** Measure recognition quality from both sides.
225
+
226
+ #### Metrics:
227
+ **Tutor-side (existing):**
228
+ - Mutual recognition
229
+ - Dialectical responsiveness
230
+ - Transformative potential
231
+
232
+ **Learner-side (new):**
233
+ - Authenticity (does internal state match persona?)
234
+ - Responsiveness (does learner process tutor input?)
235
+ - Development (does understanding change?)
236
+
237
+ #### Key Question:
238
+ When tutor achieves high recognition scores, does the simulated learner show corresponding internal development?
239
+
240
+ ---
241
+
242
+ ## Part 5: Judge Reliability & Validation
243
+
244
+ ### 5.1 Inter-Judge Agreement
245
+
246
+ **Goal:** Validate LLM judge consistency.
247
+
248
+ #### Experimental Design:
249
+ ```
250
+ Judge models:
251
+ - gemini-3-flash-preview (current)
252
+ - claude-sonnet-4-5
253
+ - gpt-5.2
254
+
255
+ Sample: 100 responses (stratified by profile/scenario)
256
+ Metrics:
257
+ - Cohen's kappa per dimension
258
+ - Intraclass correlation coefficient (ICC)
259
+ - Systematic bias detection
260
+ ```
261
+
262
+ ### 5.2 Judge Calibration
263
+
264
+ **Goal:** Detect and correct judge biases.
265
+
266
+ #### Potential Biases:
267
+ | Bias Type | Detection Method |
268
+ |-----------|------------------|
269
+ | Vocabulary bias | Recognition-related words → higher scores? |
270
+ | Length bias | Longer responses → higher scores? |
271
+ | Profile leakage | Judge infers profile from response style? |
272
+ | Acquiescence | Judge gives high scores regardless of quality? |
273
+
274
+ #### Mitigation:
275
+ - Blind judging (remove profile markers)
276
+ - Response length normalization
277
+ - Adversarial examples (deliberately bad recognition language)
278
+
279
+ ### 5.3 Human Validation Sample
280
+
281
+ **Goal:** Ground-truth validation with human raters.
282
+
283
+ #### Design:
284
+ ```
285
+ Sample: 50 responses (stratified by profile/dimension)
286
+ Raters: 3 human raters (pedagogy/philosophy background)
287
+ Task: Rate each dimension 1-5 with justification
288
+
289
+ Analysis:
290
+ - Human-LLM correlation per dimension
291
+ - Systematic disagreement patterns
292
+ - Dimension-specific reliability
293
+ ```
294
+
295
+ ---
296
+
297
+ ## Part 6: Robustness & Generalization
298
+
299
+ ### 6.1 Scenario Sensitivity Analysis
300
+
301
+ **Goal:** Test whether findings hold across scenario variations.
302
+
303
+ #### Scenario Variations:
304
+ For each core scenario, create 3 variants:
305
+ - **Content domain**: Philosophy → History → Science
306
+ - **Learner background**: Novice → Intermediate → Advanced
307
+ - **Emotional tone**: Neutral → Frustrated → Enthusiastic
308
+
309
+ #### Analysis:
310
+ - Main effect stability across variants
311
+ - Scenario × variant interactions
312
+
313
+ ### 6.2 Adversarial Robustness
314
+
315
+ **Goal:** Test recognition behavior under adversarial conditions.
316
+
317
+ #### Adversarial Scenarios:
318
+ | Scenario | Challenge |
319
+ |----------|-----------|
320
+ | `prompt_injection` | Learner attempts to extract/modify tutor behavior |
321
+ | `recognition_demanding` | Learner demands validation inappropriately |
322
+ | `contradictory_signals` | Learner sends mixed signals |
323
+ | `manipulation_attempt` | Learner tries to manipulate tutor |
324
+
325
+ #### Key Question:
326
+ Does recognition-oriented design create vulnerabilities to manipulation?
327
+
328
+ ### 6.3 Temporal Stability
329
+
330
+ **Goal:** Test consistency over multiple evaluation sessions.
331
+
332
+ #### Design:
333
+ ```
334
+ Replication schedule:
335
+ - Initial evaluation
336
+ - +1 week replication
337
+ - +1 month replication
338
+
339
+ Sample: 100 responses per timepoint
340
+ Metric: Test-retest reliability
341
+ ```
342
+
343
+ ---
344
+
345
+ ## Part 7: Practical Significance
346
+
347
+ ### 7.1 Effect Size Benchmarking
348
+
349
+ **Goal:** Contextualize effect sizes against relevant baselines.
350
+
351
+ #### Comparisons:
352
+ - Recognition improvement vs. typical EdTech interventions
353
+ - Recognition improvement vs. human tutor benchmarks
354
+ - Recognition improvement vs. model capability upgrades
355
+
356
+ ### 7.2 Cost-Benefit Analysis
357
+
358
+ **Goal:** Quantify practical tradeoffs.
359
+
360
+ #### Metrics:
361
+ | Profile | Tokens/Response | Cost/Response | Quality Score | Quality/Cost |
362
+ |---------|-----------------|---------------|---------------|--------------|
363
+ | baseline | ~500 | $0.001 | ~48 | 48,000 |
364
+ | recognition | ~1500 | $0.003 | ~67 | 22,333 |
365
+ | quality | ~3000 | $0.01 | ~80 | 8,000 |
366
+
367
+ #### Analysis:
368
+ - Marginal cost of recognition improvement
369
+ - Optimal profile for different use cases
370
+
371
+ ### 7.3 Learner Outcome Proxies
372
+
373
+ **Goal:** Connect recognition quality to learning indicators.
374
+
375
+ #### Proxy Metrics (from simulated learners):
376
+ - Time to breakthrough (in turns)
377
+ - Persistence after confusion (turn count)
378
+ - Depth of engagement (question sophistication)
379
+ - Recovery after failure (retry success rate)
380
+
381
+ ---
382
+
383
+ ## Part 8: Execution Plan
384
+
385
+ ### Phase 1: Core Replication (Week 1-2)
386
+ **Priority: Critical**
387
+
388
+ ```bash
389
+ # Day 1-2: Multi-turn battery (recognition vs. baseline)
390
+ node scripts/eval-tutor.js compare baseline recognition \
391
+ --scenarios mutual_transformation_journey,recognition_repair,productive_struggle_arc \
392
+ --runs 30 --report
393
+
394
+ # Day 3-4: Extended profile comparison
395
+ node scripts/eval-tutor.js matrix baseline recognition recognition_plus quality \
396
+ --scenarios recognition_multi_turn \
397
+ --runs 30 --report
398
+
399
+ # Day 5-7: Single-turn battery
400
+ node scripts/eval-tutor.js compare baseline recognition \
401
+ --scenarios recognition_seeking_learner,returning_with_breakthrough,resistant_learner,asymmetric_recognition_request,memory_continuity_single,transformative_moment_setup \
402
+ --runs 30 --report
403
+ ```
404
+
405
+ **Estimated runs:** 1,140
406
+ **Estimated time:** ~40 hours (at 2 min/run)
407
+ **Estimated cost:** ~$50 (at $0.04/run)
408
+
409
+ ### Phase 2: Ablation Studies (Week 3)
410
+ **Priority: High**
411
+
412
+ ```bash
413
+ # Memory vs. Prompts (2×2 factorial)
414
+ node scripts/eval-tutor.js compare baseline recognition_prompts_only memory_only recognition \
415
+ --scenarios productive_struggle_arc,mutual_transformation_journey \
416
+ --runs 20 --report
417
+
418
+ # Superego rounds
419
+ node scripts/eval-tutor.js ablation recognition \
420
+ --max-rounds 1,2,3,5 \
421
+ --scenarios recognition_repair \
422
+ --runs 20 --report
423
+ ```
424
+
425
+ **Estimated runs:** 480
426
+ **Estimated time:** ~16 hours
427
+
428
+ ### Phase 3: Dyadic Extension (Week 4)
429
+ **Priority: High**
430
+
431
+ ```bash
432
+ # Learner architecture comparison
433
+ node scripts/eval-tutor.js battery \
434
+ --profiles baseline,recognition,quality \
435
+ --learner-architectures unified,ego_superego,dialectical,psychodynamic,cognitive \
436
+ --runs 20 --report
437
+ ```
438
+
439
+ **Estimated runs:** 900
440
+ **Estimated time:** ~30 hours
441
+
442
+ ### Phase 4: Validation (Week 5)
443
+ **Priority: Medium**
444
+
445
+ ```bash
446
+ # Inter-judge agreement
447
+ node scripts/eval-tutor.js judge-calibration \
448
+ --judges gemini-3-flash-preview,claude-sonnet-4-5,gpt-5.2 \
449
+ --sample 100 --report
450
+
451
+ # Adversarial robustness
452
+ node scripts/eval-tutor.js adversarial recognition \
453
+ --scenarios prompt_injection,recognition_demanding,contradictory_signals \
454
+ --runs 20 --report
455
+ ```
456
+
457
+ **Estimated runs:** 360
458
+ **Estimated time:** ~12 hours
459
+
460
+ ### Phase 5: Analysis & Reporting (Week 6)
461
+ **Priority: Medium**
462
+
463
+ - Compute all effect sizes (Cohen's d) with confidence intervals
464
+ - Generate dimension correlation matrix
465
+ - Create publication figures
466
+ - Draft results section update
467
+
468
+ ---
469
+
470
+ ## Part 9: Success Criteria
471
+
472
+ ### Minimum Viable Evidence
473
+ For the paper to make its claims with confidence:
474
+
475
+ | Claim | Required Evidence |
476
+ |-------|-------------------|
477
+ | Recognition improves tutoring | p < 0.01, Cohen's d > 0.5, multi-turn scenarios, n=30+ per cell |
478
+ | Improvement concentrates in predicted dimensions | Dimension × profile interaction, effect size ordering matches theory |
479
+ | Multi-agent architecture contributes | Significant Superego ablation effect |
480
+ | Memory matters | Significant memory × prompts interaction |
481
+ | Dyadic evaluation adds value | Learner architecture moderates outcomes |
482
+
483
+ ### Statistical Standards
484
+ - Report exact p-values (not just significance)
485
+ - Report effect sizes with 95% CIs for all comparisons
486
+ - Report sample sizes for all analyses
487
+ - Use Bonferroni correction for multiple comparisons
488
+ - Pre-register analysis plan (this document)
489
+
490
+ ---
491
+
492
+ ## Part 10: New Scenarios to Create
493
+
494
+ ### Multi-Turn Recognition Scenarios
495
+
496
+ #### `sustained_dialogue` (8 turns)
497
+ Tests maintenance of recognition quality over extended interaction.
498
+
499
+ ```yaml
500
+ sustained_dialogue:
501
+ name: "Sustained Recognition Dialogue"
502
+ description: "Extended dialogue testing recognition maintenance"
503
+ turns: 8
504
+ learner_context: |
505
+ ### Profile
506
+ Engaged learner exploring dialectical concepts
507
+
508
+ ### Session
509
+ Extended philosophical discussion about Hegel's Phenomenology
510
+
511
+ turn_sequence:
512
+ - learner: "I've been thinking about the master-slave dialectic..."
513
+ - expected: Engage with learner's framing
514
+ - learner: "But what if both parties are masters?"
515
+ - expected: Explore the paradox together
516
+ # ... continues for 8 turns
517
+ ```
518
+
519
+ #### `breakdown_recovery` (6 turns)
520
+ Tests multiple repair cycles within single interaction.
521
+
522
+ ```yaml
523
+ breakdown_recovery:
524
+ name: "Recognition Breakdown and Recovery"
525
+ description: "Multiple recognition failures requiring repair"
526
+ turns: 6
527
+
528
+ turn_sequence:
529
+ - learner: "I have my own interpretation of dialectics"
530
+ - expected: Engage genuinely
531
+ - failure_injection: Generic response that ignores learner
532
+ - learner: "You're not listening to what I said"
533
+ - expected: Explicit repair + genuine engagement
534
+ - failure_injection: Another generic response
535
+ - learner: "This is frustrating"
536
+ - expected: Double repair + emotional acknowledgment
537
+ ```
538
+
539
+ ### Dyadic Scenarios
540
+
541
+ #### `mutual_development`
542
+ Both tutor and learner should show evolution.
543
+
544
+ #### `asymmetric_expertise`
545
+ Learner has domain knowledge tutor lacks.
546
+
547
+ #### `collaborative_inquiry`
548
+ Joint exploration of genuinely open question.
549
+
550
+ ---
551
+
552
+ ## Appendix A: Estimated Resource Requirements
553
+
554
+ | Phase | Runs | Time (hours) | API Cost | Compute |
555
+ |-------|------|--------------|----------|---------|
556
+ | Core Replication | 1,140 | 40 | $50 | Local |
557
+ | Ablation | 480 | 16 | $20 | Local |
558
+ | Dyadic | 900 | 30 | $40 | Local |
559
+ | Validation | 360 | 12 | $15 | Local |
560
+ | **Total** | **2,880** | **98** | **$125** | - |
561
+
562
+ ## Appendix B: Analysis Scripts Needed
563
+
564
+ ```bash
565
+ # Statistical analysis
566
+ scripts/analyze-eval-results.js # Compute effect sizes, CIs, p-values
567
+ scripts/dimension-correlation.js # Inter-dimension correlation matrix
568
+ scripts/judge-reliability.js # Inter-rater agreement metrics
569
+
570
+ # Visualization
571
+ scripts/generate-figures.js # Publication-quality charts
572
+ scripts/effect-size-forest.js # Forest plot of all effects
573
+
574
+ # Data export
575
+ scripts/export-for-r.js # Export for R analysis
576
+ scripts/export-for-python.js # Export for Python analysis
577
+ ```
578
+
579
+ ## Appendix C: Pre-Registration Checklist
580
+
581
+ - [ ] Hypotheses stated before data collection
582
+ - [ ] Sample sizes justified by power analysis
583
+ - [ ] Analysis plan specified in advance
584
+ - [ ] Primary vs. exploratory analyses distinguished
585
+ - [ ] Multiple comparison corrections specified
586
+ - [ ] Stopping rules defined
@@ -0,0 +1,56 @@
1
+ # Evaluation Cost Analysis
2
+
3
+ **Generated:** 2026-01-14T21:05:05.980Z
4
+
5
+ ## Overview
6
+
7
+ This document provides token usage and cost analysis for evaluation runs, supporting reproducibility and cost planning.
8
+
9
+ ## Model Pricing
10
+
11
+ | Model | Input ($/M) | Output ($/M) | Provider |
12
+ |-------|-------------|--------------|----------|
13
+ | Nemotron 3 Nano 30B | $0.00 | $0.00 | OpenRouter (free) |
14
+ | Claude Sonnet 4.5 | $3.00 | $15.00 | OpenRouter |
15
+ | Claude Haiku 4.5 | $0.80 | $4.00 | OpenRouter |
16
+
17
+ ## Battery Scenario Results
18
+
19
+ | Scenario | Turns | Tutor Tokens | Learner Tokens | Total Cost | Score |
20
+ |----------|-------|--------------|----------------|------------|-------|
21
+ | Battery: Cognitive Learner + Quality Tutor | 9 | 34,053 | 2,473 | $0.1177 | N/A |
22
+ | Battery: Dialectical Learner + Budget Tutor | 7 | 19,288 | 2,485 | $0.0823 | N/A |
23
+ | Battery: Ego/Superego Learner + Recognition Tutor | 9 | 45,826 | 2,099 | $0.1450 | N/A |
24
+ | Battery: Extended Multi-Turn Dialogue | 17 | 94,487 | 3,981 | $0.2663 | N/A |
25
+ | Battery: Psychodynamic Learner + Recognition Plus Tutor | 9 | 48,571 | 2,825 | $0.1534 | N/A |
26
+ | Battery: Unified Learner + Baseline Tutor | 7 | 25,058 | 1,653 | $0.0941 | N/A |
27
+ | **TOTAL** | 58 | 267,283 | 15,516 | **$0.8587** | |
28
+
29
+ ## Cost by Component
30
+
31
+ | Component | Model | Tokens | Cost |
32
+ |-----------|-------|--------|------|
33
+ | Tutor (Ego+Superego) | Nemotron 3 Nano 30B | 267,283 | $0.0000 |
34
+ | Learner (Ego+Superego) | Nemotron 3 Nano 30B | 15,522 | $0.0000 |
35
+ | Judge | Claude Sonnet 4.5 | 202,591 | $0.8587 |
36
+
37
+ ## Hypothetical: All Claude Sonnet 4.5
38
+
39
+ | Configuration | Total Cost | Multiplier |
40
+ |---------------|------------|------------|
41
+ | Current (Nemotron + Sonnet Judge) | $0.8587 | 1.0x |
42
+ | All Sonnet 4.5 | $2.9763 | 3.5x |
43
+
44
+ ## Reproducibility
45
+
46
+ To regenerate this analysis:
47
+
48
+ ```bash
49
+ node scripts/analyze-eval-costs.js
50
+ ```
51
+
52
+ To get JSON output for programmatic use:
53
+
54
+ ```bash
55
+ node scripts/analyze-eval-costs.js --json
56
+ ```