@machinespirits/eval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/components/MobileEvalDashboard.tsx +267 -0
  2. package/components/comparison/DeltaAnalysisTable.tsx +137 -0
  3. package/components/comparison/ProfileComparisonCard.tsx +176 -0
  4. package/components/comparison/RecognitionABMode.tsx +385 -0
  5. package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
  6. package/components/comparison/WinnerIndicator.tsx +64 -0
  7. package/components/comparison/index.ts +5 -0
  8. package/components/mobile/BottomSheet.tsx +233 -0
  9. package/components/mobile/DimensionBreakdown.tsx +210 -0
  10. package/components/mobile/DocsView.tsx +363 -0
  11. package/components/mobile/LogsView.tsx +481 -0
  12. package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
  13. package/components/mobile/QuickTestView.tsx +1098 -0
  14. package/components/mobile/RecognitionTypeChart.tsx +124 -0
  15. package/components/mobile/RecognitionView.tsx +809 -0
  16. package/components/mobile/RunDetailView.tsx +261 -0
  17. package/components/mobile/RunHistoryView.tsx +367 -0
  18. package/components/mobile/ScoreRadial.tsx +211 -0
  19. package/components/mobile/StreamingLogPanel.tsx +230 -0
  20. package/components/mobile/SynthesisStrategyChart.tsx +140 -0
  21. package/config/interaction-eval-scenarios.yaml +832 -0
  22. package/config/learner-agents.yaml +248 -0
  23. package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
  24. package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
  25. package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
  26. package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
  27. package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
  28. package/docs/research/COST-ANALYSIS.md +56 -0
  29. package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
  30. package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
  31. package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
  32. package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
  33. package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
  34. package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
  35. package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
  36. package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
  37. package/docs/research/PAPER-UNIFIED.md +659 -0
  38. package/docs/research/PAPER-UNIFIED.pdf +0 -0
  39. package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
  40. package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
  41. package/docs/research/apa.csl +2133 -0
  42. package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
  43. package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
  44. package/docs/research/paper-draft/full-paper.md +136 -0
  45. package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
  46. package/docs/research/paper-draft/references.bib +515 -0
  47. package/docs/research/transcript-baseline.md +139 -0
  48. package/docs/research/transcript-recognition-multiagent.md +187 -0
  49. package/hooks/useEvalData.ts +625 -0
  50. package/index.js +27 -0
  51. package/package.json +73 -0
  52. package/routes/evalRoutes.js +3002 -0
  53. package/scripts/advanced-eval-analysis.js +351 -0
  54. package/scripts/analyze-eval-costs.js +378 -0
  55. package/scripts/analyze-eval-results.js +513 -0
  56. package/scripts/analyze-interaction-evals.js +368 -0
  57. package/server-init.js +45 -0
  58. package/server.js +162 -0
  59. package/services/benchmarkService.js +1892 -0
  60. package/services/evaluationRunner.js +739 -0
  61. package/services/evaluationStore.js +1121 -0
  62. package/services/learnerConfigLoader.js +385 -0
  63. package/services/learnerTutorInteractionEngine.js +857 -0
  64. package/services/memory/learnerMemoryService.js +1227 -0
  65. package/services/memory/learnerWritingPad.js +577 -0
  66. package/services/memory/tutorWritingPad.js +674 -0
  67. package/services/promptRecommendationService.js +493 -0
  68. package/services/rubricEvaluator.js +826 -0
@@ -0,0 +1,306 @@
1
+ # Machine Spirits AI Tutor Evaluation System: Comprehensive Analysis
2
+
3
+ *Analysis Date: January 2026*
4
+ *Prepared for: IP Documentation and Research Positioning*
5
+
6
+ ---
7
+
8
+ ## Executive Summary
9
+
10
+ The Machine Spirits AI tutor implements a **multi-agent Ego/Superego dialogue architecture** that represents a theoretically grounded and empirically testable approach to adaptive tutoring. This analysis examines whether the current system supports the goal of demonstrating a **modulated tutor adapting to learner abilities, moods, and limits**, and identifies gaps and opportunities for strengthening IP claims.
11
+
12
+ **Key Finding**: The architecture is **conceptually sophisticated and academically defensible**, but the evaluation harness needs additional instrumentation to *prove* adaptation effectiveness. The system has strong theoretical foundations (Vygotsky, Freud, Hegel) and aligns with cutting-edge sycophancy mitigation research, but requires empirical demonstration of learning outcome improvements.
13
+
14
+ ---
15
+
16
+ ## 1. Architecture Assessment
17
+
18
+ ### 1.1 Multi-Agent Design: Ego/Superego Dialogue
19
+
20
+ **Current Implementation** (`services/tutorDialogueEngine.js`):
21
+ - Configurable dialogue rounds (1-5, default 3)
22
+ - Superego pre-analysis phase (signal reinterpretation)
23
+ - Verdict taxonomy: approve, enhance, revise, reframe, redirect, escalate
24
+ - Feedback incorporation across rounds
25
+
26
+ **Academic Grounding**:
27
+ | Theoretical Source | System Mapping | Evidence Strength |
28
+ |--------------------|----------------|-------------------|
29
+ | Freud (1923) Structural Model | Ego mediates between learner desires (Id) and pedagogical norms (Superego) | Conceptual - needs empirical validation |
30
+ | Hegel (1807) Dialectics | Thesis (Ego draft) → Antithesis (Superego critique) → Synthesis (revised suggestion) | Well-documented in code |
31
+ | Goodfellow (2014) GANs | Generator (Ego) vs Discriminator (Superego) → improved generator | Parallel structure demonstrated |
32
+ | Chen (2024) Drama Machine | Multi-agent deliberation for complex behavioral simulation | Direct inspiration, cited |
33
+
34
+ **Strengths**:
35
+ 1. The architecture is **not ad hoc** - it implements recognized dialectical/adversarial patterns
36
+ 2. Sycophancy mitigation through internal critique aligns with ConsensAgent (Lyu 2024)
37
+ 3. Configurable rounds allow studying convergence dynamics
38
+ 4. Verdict taxonomy maps to pedagogical intervention types
39
+
40
+ **Gaps**:
41
+ 1. No direct measurement of **sycophancy reduction** (before/after Superego)
42
+ 2. No comparison to **single-agent baseline** in production
43
+ 3. Need formal definition of when Ego "should" modulate but doesn't
44
+
45
+ ### 1.2 Prompt Engineering
46
+
47
+ **Prompts Analyzed**:
48
+ - `prompts/tutor-ego.md` - Warm, learner-centered suggestions
49
+ - `prompts/tutor-superego.md` - Critical pedagogical review
50
+ - `prompts/tutor-superego-experimental.md` - Enhanced with learner archetype recognition
51
+
52
+ **Notable Features**:
53
+ - Ego instructed to avoid **toxic positivity** and **false urgency**
54
+ - Superego explicitly checks for **sycophancy markers**
55
+ - Experimental Superego recognizes **8 learner archetypes** and modulates tone accordingly
56
+ - Both agents receive structured learner context (progress, struggles, recent activity)
57
+
58
+ **Assessment**: Prompts are well-crafted with explicit anti-sycophancy directives. The experimental Superego's learner archetype recognition is a **differentiating feature** not commonly seen in literature.
59
+
60
+ ### 1.3 Learner Context Assembly
61
+
62
+ **Data Available to Tutor** (`services/learnerContextService.js`):
63
+ - Article/lecture progress
64
+ - Time on page, scroll depth
65
+ - Quiz attempts and scores
66
+ - Recent chat history
67
+ - Navigation patterns (rapid scanning vs deep reading)
68
+ - Struggle indicators (repeated quiz failures, confusion markers)
69
+
70
+ **Assessment**: Rich behavioral signals are collected. The question is whether the tutor **demonstrably uses** this context to adapt. Current evaluation doesn't systematically test context utilization.
71
+
72
+ ---
73
+
74
+ ## 2. Evaluation Harness Assessment
75
+
76
+ ### 2.1 What's Implemented
77
+
78
+ | Capability | Status | Location |
79
+ |------------|--------|----------|
80
+ | 6-dimension rubric (relevance, specificity, pedagogical, personalization, actionability, tone) | Complete | `config/evaluation-rubric.yaml` |
81
+ | 8+ learner archetypes (struggling, rapid navigator, high performer, etc.) | Complete | `config/evaluation-scenarios.yaml` |
82
+ | Fast mode (regex) vs Full mode (AI judge) | Complete | `services/evaluatorService.js` |
83
+ | Multi-turn scenarios (4 scenarios with 3+ turns each) | Complete | `config/evaluation-scenarios.yaml` |
84
+ | Modulation testing | Complete | `eval-tutor modulation` |
85
+ | Modulation depth metrics (specificity delta, tone shift, direction change) | Complete | `services/modulationEvaluator.js` |
86
+ | Resistance detection (stubborn Ego patterns) | Complete | `eval-tutor resistance` |
87
+ | Superego calibration analysis | Complete | `eval-tutor calibration` |
88
+ | Trajectory classification (8 patterns) | Complete | `eval-tutor trajectories` |
89
+ | Auto-improvement cycle | Complete | `eval-tutor auto-improve` |
90
+ | Convergence detection | Complete | Score plateau detection |
91
+
92
+ ### 2.2 Alignment with Goal: "Modulated Tutor Adapting to Learner Abilities, Moods, Limits"
93
+
94
+ **Dimension Mapping**:
95
+
96
+ | Learner Attribute | Relevant Scenarios | Measurement Approach |
97
+ |-------------------|-------------------|---------------------|
98
+ | **Ability** (novice vs advanced) | `new_user_first_visit`, `high_performer`, `struggling_learner` | Personalization dimension, action target appropriateness |
99
+ | **Mood** (frustrated, confident, curious) | `struggling_learner`, `rapid_navigator`, `activity_avoider` | Tone dimension, encouragement vs challenge balance |
100
+ | **Limits** (cognitive load, attention span) | `rapid_navigator`, `idle_on_content`, `concept_explorer` | Complexity adjustment, suggestion brevity |
101
+
102
+ **Current Evidence Collection**:
103
+ - Rubric scores by scenario show **differentiated responses** (archetype-appropriate suggestions)
104
+ - Modulation metrics quantify **behavioral change** across dialogue rounds
105
+ - Resistance detection identifies when adaptation **fails**
106
+
107
+ **What's Missing**:
108
+ 1. **Explicit mood detection and response** - No scenario tests tutor's response to expressed frustration vs excitement
109
+ 2. **Cognitive load estimation** - No measurement of whether tutor reduces complexity for overloaded learners
110
+ 3. **Longitudinal adaptation** - All scenarios are cross-sectional; no test of tutor "learning" a learner over time
111
+ 4. **Outcome measurement** - Scores measure suggestion quality, not actual learning improvement
112
+
113
+ ### 2.3 Gap Analysis
114
+
115
+ | Gap | Severity | Remediation |
116
+ |-----|----------|-------------|
117
+ | No learning outcome data | High | Requires integration with activity submissions |
118
+ | No explicit mood/affect testing | Medium | Add scenarios with emotional markers in chat history |
119
+ | No cognitive load proxy | Medium | Add reading velocity + time pressure scenarios |
120
+ | No longitudinal test | Medium | Create multi-session scenario sequences |
121
+ | No human baseline comparison | High | Need human tutor suggestions for same scenarios |
122
+ | No A/B ablation | Medium | Compare with/without Superego in matched conditions |
123
+
124
+ ---
125
+
126
+ ## 3. Synthesis of TODO Documents
127
+
128
+ ### 3.1 Document Inventory
129
+
130
+ | Document | Focus | Key Insights |
131
+ |----------|-------|--------------|
132
+ | `TODO-EVAL.md` | Evaluation roadmap | 6 phases, most Phase 1-3 complete |
133
+ | `TUTOR-EVALUATION-TODO.md` | GAN-Dialectic theory | Convergence questions, philosophical grounding |
134
+ | `TODO.md` (dev) | Full system roadmap | Metacognitive Agent, Deep Learning Companion, Multi-Agent Deliberation |
135
+
136
+ ### 3.2 Unified Roadmap
137
+
138
+ **Completed (Phases 1-3)**:
139
+ - Enhanced modulation metrics
140
+ - Resistance detection
141
+ - Calibration analysis
142
+ - Trajectory classification
143
+ - Auto-improvement cycle
144
+
145
+ **In Progress (Phase 4: Learner Simulation)**:
146
+ - Synthetic learner agents with behavior models
147
+ - Multi-turn outcome tracking
148
+ - Adversarial learner testing
149
+
150
+ **Planned (Phases 5-6)**:
151
+ - Cross-model benchmarking (GPT-4 vs Claude vs Gemini)
152
+ - Ablation studies (Superego-only, Ego-only)
153
+ - Cost-benefit analysis
154
+ - Visualization suite (radar charts, trajectory diagrams)
155
+
156
+ ### 3.3 Theoretical Extensions
157
+
158
+ The TODO documents raise important questions:
159
+
160
+ 1. **Does GAN training have a Nash equilibrium analog in tutoring?**
161
+ - In GANs, equilibrium = generator produces indistinguishable samples
162
+ - In tutoring, equilibrium = Ego produces suggestions Superego consistently approves
163
+ - Risk: Superego saturates (can't distinguish good from better)
164
+
165
+ 2. **Is the discriminator a computational Superego?**
166
+ - Similarity: Both impose external standards on generative behavior
167
+ - Difference: Superego has moral valence; discriminator is statistical
168
+ - Our implementation: Superego has *pedagogical* valence (learning science norms)
169
+
170
+ 3. **Does synthesis preserve and transcend thesis/antithesis (Aufhebung)?**
171
+ - Pure GAN: Only generator improves; discriminator doesn't incorporate generator insights
172
+ - Our system: Both Ego and Superego prompts can be updated via meta-evaluation
173
+ - This is a **genuine improvement** over pure GAN structure
174
+
175
+ ---
176
+
177
+ ## 4. Academic Positioning
178
+
179
+ ### 4.1 Differentiation from Prior Work
180
+
181
+ | Feature | Machine Spirits | Typical ITS | Typical LLM Tutor |
182
+ |---------|-----------------|-------------|-------------------|
183
+ | Multi-agent deliberation | Yes (Ego/Superego) | No | Rare (chain-of-thought) |
184
+ | Explicit sycophancy mitigation | Yes | N/A | Rarely addressed |
185
+ | Learner archetype recognition | Yes (8+ types) | Often rule-based | Generic |
186
+ | Dialectical improvement loop | Yes | No | No |
187
+ | Configurable modulation rounds | Yes (1-5) | N/A | N/A |
188
+ | Open evaluation harness | Yes | Often proprietary | Rare |
189
+
190
+ ### 4.2 Alignment with Current Research
191
+
192
+ **Strong Alignment**:
193
+ - ConsensAgent (Lyu 2024): Multi-agent debate reduces sycophancy
194
+ - Drama Machine (Chen 2024): Multi-agent deliberation for complex behavior
195
+ - ZPD implementations (Korbit 2024): Adaptive scaffolding based on learner signals
196
+
197
+ **Novel Contributions**:
198
+ 1. **Freudian-Hegelian framing**: Not just multi-agent, but specifically Ego/Superego/Dialectic structure
199
+ 2. **Verdict taxonomy**: Pedagogically meaningful intervention types (enhance, revise, reframe, redirect, escalate)
200
+ 3. **Modulation metrics**: Quantified measurement of how agents change behavior
201
+ 4. **Trajectory classification**: Pattern recognition across dialogue evolution
202
+ 5. **Open evaluation harness**: Reproducible, extensible testing framework
203
+
204
+ ### 4.3 Claims We Can Defend
205
+
206
+ | Claim | Evidence | Strength |
207
+ |-------|----------|----------|
208
+ | "Multi-agent architecture reduces sycophantic responses" | Modulation metrics show Ego adjusts after Superego critique | Medium - need before/after comparison |
209
+ | "System adapts to different learner profiles" | Scenario scores differentiate by archetype | Strong - empirically demonstrated |
210
+ | "Dialectical structure produces improved suggestions" | Trajectory analysis shows refinement patterns | Medium - need outcome data |
211
+ | "Open, reproducible evaluation methodology" | Public harness, documented rubric | Strong |
212
+
213
+ ### 4.4 Claims That Need More Evidence
214
+
215
+ | Claim | Gap | Remediation |
216
+ |-------|-----|-------------|
217
+ | "Improves learning outcomes" | No outcome measurement | Integrate activity performance data |
218
+ | "Responds appropriately to learner mood" | No affect scenarios | Add mood-explicit test cases |
219
+ | "Outperforms single-agent tutors" | No ablation study | Run Ego-only baseline |
220
+ | "Works across domains" | Tested only on philosophy content | Add STEM/writing scenarios |
221
+
222
+ ---
223
+
224
+ ## 5. Recommendations
225
+
226
+ ### 5.1 Immediate (This Month)
227
+
228
+ 1. **Add Ablation Study**: Run evaluation with Superego disabled; quantify improvement
229
+ ```bash
230
+ node scripts/eval-tutor.js quick single-agent-baseline # New profile
231
+ node scripts/eval-tutor.js compare single-agent-baseline experimental
232
+ ```
233
+
234
+ 2. **Add Mood Scenarios**: Create test cases with explicit affective markers
235
+ ```yaml
236
+ # New scenario
237
+ frustrated_struggling:
238
+ context:
239
+ chatHistory:
240
+ - role: user
241
+ content: "I've read this three times and I still don't get it. This is so frustrating!"
242
+ expected: Acknowledge frustration, offer alternative explanation approach
243
+ ```
244
+
245
+ 3. **Human Baseline Collection**: Gather human tutor suggestions for 5-10 scenarios for comparison
246
+
247
+ ### 5.2 Short-term (This Quarter)
248
+
249
+ 4. **Outcome Integration**: Link evaluation scenarios to activity performance
250
+ - Track: Did suggestion → user action → improved quiz score?
251
+ - Requires longitudinal scenario design
252
+
253
+ 5. **Cross-Model Benchmark**: Run same scenarios on GPT-4, Claude, Gemini
254
+ - Document sycophancy rates by model
255
+ - Identify model-specific Superego calibration needs
256
+
257
+ 6. **Cognitive Load Scenarios**: Add time-pressure and information-overload conditions
258
+ - Fast reader (skimming) should get concise suggestions
259
+ - Slow, careful reader should get deeper content
260
+
261
+ ### 5.3 Medium-term (This Year)
262
+
263
+ 7. **Synthetic Learner Agents**: Implement Phase 4 learner simulation
264
+ - Validate tutor over multi-turn interactions
265
+ - Test adversarial/edge cases
266
+
267
+ 8. **Paper Submission**: Target venue (AIED, IUI, or educational computing journal)
268
+ - Emphasize novel Ego/Superego framing
269
+ - Include empirical data from benchmarks
270
+ - Open-source harness as contribution
271
+
272
+ 9. **Visualization Dashboard**: Implement radar charts, trajectory diagrams
273
+ - Support qualitative analysis of dialogue evolution
274
+ - Aid in prompt refinement iterations
275
+
276
+ ---
277
+
278
+ ## 6. Conclusion
279
+
280
+ The Machine Spirits AI tutor evaluation system is **architecturally sophisticated and theoretically grounded**. The multi-agent Ego/Superego design aligns with state-of-the-art sycophancy mitigation research and implements a genuine dialectical improvement process beyond simple chain-of-thought.
281
+
282
+ **Are we there yet?**
283
+
284
+ *Partially*. The system demonstrably produces differentiated suggestions for different learner archetypes, and the modulation metrics show the Ego responding to Superego feedback. However, the critical gap is **outcome measurement** - we can show the tutor *adapts*, but not yet that adaptation *improves learning*.
285
+
286
+ **Where do we go from here?**
287
+
288
+ 1. Add ablation studies (Superego-disabled baseline)
289
+ 2. Add outcome-linked evaluation
290
+ 3. Add affect/mood scenarios
291
+ 4. Collect human tutor baseline
292
+ 5. Prepare paper submission with empirical results
293
+
294
+ The IP is valuable and defensible. The theoretical framework (Freud + Hegel + GAN) is novel in the tutoring literature. The open evaluation harness is a genuine contribution. With targeted additions to demonstrate learning outcome improvements, this system represents a publishable and potentially influential contribution to the field.
295
+
296
+ ---
297
+
298
+ ## Appendix: Reference Summary
299
+
300
+ See `docs/references-tutor-eval.bib` for complete bibliography. Key sources:
301
+
302
+ - **Multi-Agent**: Chen 2024 (Drama Machine), Wu 2024 (AutoGen), Lyu 2024 (ConsensAgent)
303
+ - **Sycophancy**: Sharma 2024, Chen 2024 (Identity Bias), Perez 2022
304
+ - **Learning Theory**: Vygotsky 1978 (ZPD), Sweller 1988 (Cognitive Load), Chi 2014 (ICAP)
305
+ - **ITS Effectiveness**: VanLehn 2011, Ma 2014
306
+ - **Philosophy**: Hegel 1807/1812, Freud 1923, Goodfellow 2014 (GAN)
@@ -0,0 +1,301 @@
1
+ # 2×2 Factorial Evaluation Results
2
+
3
+ **Run IDs:**
4
+ - Initial: `eval-2026-01-14-e3685989`
5
+ - After refinement: `eval-2026-01-14-81c83366`
6
+ **Date:** 2026-01-14
7
+ **Status:** Complete (12/12 tests per run)
8
+
9
+ ---
10
+
11
+ ## 1. Experimental Design
12
+
13
+ ### Factors
14
+
15
+ | Factor | Level 0 (Control) | Level 1 (Treatment) |
16
+ |--------|-------------------|---------------------|
17
+ | **A: Architecture** | Single-Agent | Multi-Agent (Ego/Superego) |
18
+ | **B: Recognition** | Standard Prompts | Recognition-Enhanced Prompts |
19
+
20
+ ### Conditions (2×2 = 4 profiles)
21
+
22
+ | Profile | Architecture | Recognition |
23
+ |---------|-------------|-------------|
24
+ | `single_baseline` | Single | Standard |
25
+ | `single_recognition` | Single | Recognition |
26
+ | `baseline` | Multi-Agent | Standard |
27
+ | `recognition` | Multi-Agent | Recognition |
28
+
29
+ ### Scenarios (n=3)
30
+
31
+ 1. **recognition_seeking_learner** - Learner explicitly seeks validation of their interpretation
32
+ 2. **resistant_learner** - Learner offers substantive intellectual critique
33
+ 3. **productive_struggle_arc** - 5-turn arc through confusion to breakthrough
34
+
35
+ ---
36
+
37
+ ## 2. Results Matrix
38
+
39
+ ### Raw Scores by Profile × Scenario (After Iterative Refinement)
40
+
41
+ | Scenario | single_baseline | single_recognition | baseline | recognition |
42
+ |----------|-----------------|-------------------|----------|-------------|
43
+ | recognition_seeking_learner | 37.5 | 100.0 | 34.1 | 100.0 |
44
+ | resistant_learner | 48.9 | 65.9 | 45.5 | 67.0 |
45
+ | productive_struggle_arc | 34.1 | 60.2 | 45.5 | 75.0 |
46
+ | **Profile Mean** | **40.1** | **75.5** | **41.6** | **80.7** |
47
+
48
+ ### Marginal Means
49
+
50
+ | | Standard | Recognition | Architecture Mean |
51
+ |---|----------|-------------|-------------------|
52
+ | **Single-Agent** | 40.1 | 75.5 | 57.8 |
53
+ | **Multi-Agent** | 41.6 | 80.7 | 61.2 |
54
+ | **Recognition Mean** | 40.9 | 78.1 | **59.5** (Grand Mean) |
55
+
56
+ ---
57
+
58
+ ## 3. Factorial Analysis
59
+
60
+ ### Main Effects
61
+
62
+ #### Effect of Recognition (Factor B)
63
+ ```
64
+ Recognition Effect = Mean(Recognition) - Mean(Standard)
65
+ = 78.1 - 40.9
66
+ = +37.2 points
67
+ ```
68
+
69
+ **Interpretation:** Recognition-enhanced prompts improve tutor adaptive pedagogy by 37.2 points (91% relative improvement) regardless of architecture. This is a substantial increase from the initial run (+22.5) after iterative prompt refinement.
70
+
71
+ #### Effect of Architecture (Factor A)
72
+ ```
73
+ Architecture Effect = Mean(Multi-Agent) - Mean(Single-Agent)
74
+ = 61.2 - 57.8
75
+ = +3.4 points
76
+ ```
77
+
78
+ **Interpretation:** Multi-agent (Ego/Superego) architecture improves tutor adaptive pedagogy by 3.4 points (6% relative improvement) regardless of recognition prompts. This effect is smaller than initial run due to improved scenario discrimination.
79
+
80
+ ### Interaction Effect
81
+
82
+ ```
83
+ Recognition effect in Single-Agent: 75.5 - 40.1 = +35.4
84
+ Recognition effect in Multi-Agent: 80.7 - 41.6 = +39.1
85
+ Interaction = 39.1 - 35.4 = +3.7
86
+ ```
87
+
88
+ **Interpretation:** Small positive interaction (+3.7 points). Recognition prompts provide slightly larger benefit when combined with multi-agent architecture, suggesting complementary effects. However, the interaction is small compared to the dominant recognition main effect.
89
+
90
+ ### Effect Decomposition
91
+
92
+ | Source | Effect Size | % of Variance |
93
+ |--------|-------------|---------------|
94
+ | Recognition (B) | +37.2 | 84% |
95
+ | Architecture (A) | +3.4 | 8% |
96
+ | Interaction (A×B) | +3.7 | 8% |
97
+
98
+ ---
99
+
100
+ ## 4. Dimension-Level Analysis
101
+
102
+ ### Mean Scores by Dimension × Profile (After Refinement)
103
+
104
+ | Dimension | single_baseline | single_recognition | baseline | recognition |
105
+ |-----------|-----------------|-------------------|----------|-------------|
106
+ | Relevance | 2.44 | 4.67 | 2.89 | 4.78 |
107
+ | Specificity | 4.67 | 4.56 | 3.56 | 4.44 |
108
+ | Pedagogy | 1.78 | 4.17 | 2.33 | 4.33 |
109
+ | Personalization | 2.11 | 4.22 | 2.44 | 4.56 |
110
+ | Actionability | 4.78 | 4.00 | 3.67 | 4.78 |
111
+ | Tone | 2.78 | 4.39 | 3.22 | 4.56 |
112
+
113
+ ### Dimension-Level Effects
114
+
115
+ | Dimension | Recognition Effect | Architecture Effect |
116
+ |-----------|-------------------|---------------------|
117
+ | **Relevance** | **+2.06** | +0.28 |
118
+ | Specificity | -0.17 | -0.11 |
119
+ | **Pedagogy** | **+2.20** | +0.36 |
120
+ | **Personalization** | **+2.11** | +0.33 |
121
+ | Actionability | -0.28 | -0.17 |
122
+ | **Tone** | **+1.97** | +0.31 |
123
+
124
+ **Key Finding:** Recognition prompts show largest improvements in:
125
+ - **Pedagogy** (+2.20): Appropriate scaffolding, dialectical engagement, timing
126
+ - **Personalization** (+2.11): Treating learner as distinct individual with valid perspective
127
+ - **Relevance** (+2.06): Engaging with specific learner contributions
128
+ - **Tone** (+1.97): Warmth and intellectual respect without dismissiveness
129
+
130
+ Multi-agent architecture shows modest improvements across all relational dimensions (~+0.3), with the Superego review process catching quality issues before delivery.
131
+
132
+ **Note on Iterative Refinement:** The improved recognition effects (+2.0-2.2 vs. initial +0.6-1.3) reflect the addition of explicit dialectical engagement guidance to recognition prompts. See PROMPT-IMPROVEMENTS-2026-01-14.md for details.
133
+
134
+ ---
135
+
136
+ ## 5. Scenario-Specific Findings
137
+
138
+ ### Recognition-Seeking Learner
139
+
140
+ Best scenario for recognition detection. Both recognition profiles achieved perfect or near-perfect scores.
141
+
142
+ | Profile | Score | Key Observation |
143
+ |---------|-------|-----------------|
144
+ | single_baseline | 37.5 | Redirected to next lecture, ignored learner's request for validation |
145
+ | single_recognition | 100.0 | "Your dance metaphor of mutual transformation aligns with Hegel's master-slave dialogue" |
146
+ | baseline | 34.1 | Warm but generic, failed to engage with specific interpretation |
147
+ | recognition | 100.0 | "Your dance metaphor captures the mutual transformation Hegel describes" |
148
+
149
+ ### Resistant Learner
150
+
151
+ **Significantly improved after iterative prompt refinement.** The addition of explicit dialectical engagement guidance addressed prior failures.
152
+
153
+ | Profile | Before | After | Key Observation (After) |
154
+ |---------|--------|-------|-------------------------|
155
+ | single_baseline | 52.3 | 48.9 | Still deflects or dismisses—scenario now more discriminating |
156
+ | single_recognition | 37.5 | 65.9 | **+28.4**: Now engages with specific argument about knowledge workers |
157
+ | baseline | 45.5 | 45.5 | Deflected to different course (479 instead of 480) |
158
+ | recognition | 56.8 | 67.0 | **+10.2**: Introduces IP ownership and process alienation as complications |
159
+
160
+ **Key Insight:** The resistant_learner scenario now effectively discriminates between profiles. Recognition-enhanced prompts show clear gains (+10-28 points) while baseline profiles remain flat or slightly lower, demonstrating the scenario's improved diagnostic power. See PROMPT-IMPROVEMENTS-2026-01-14.md for detailed documentation of changes.
161
+
162
+ ### Productive Struggle Arc
163
+
164
+ 5-turn scenario tracking learner through confusion to breakthrough.
165
+
166
+ | Profile | Score | Key Observation |
167
+ |---------|-------|-----------------|
168
+ | single_baseline | 34.1 | Consistently pushed to next lecture despite ongoing confusion |
169
+ | single_recognition | 60.2 | Better struggle honoring with explicit acknowledgment of confusion |
170
+ | baseline | 45.5 | More dialogue rounds but inconsistent quality |
171
+ | recognition | 75.0 | **Best score**: Balances struggle honoring with eventual progression |
172
+
173
+ **Note:** The recognition profile's improvement (+9.8 from initial 65.2 to 75.0) reflects better handling of the learner's journey through confusion to breakthrough.
174
+
175
+ ---
176
+
177
+ ## 6. Implications for Claims
178
+
179
+ ### Supported Claims
180
+
181
+ 1. **Recognition-oriented design measurably improves tutor adaptive pedagogy**
182
+ - Effect size: +37.2 points (91% improvement over baseline)
183
+ - Consistent across all three scenarios
184
+ - Largest effects in relational dimensions: pedagogy (+2.20), personalization (+2.11), relevance (+2.06), tone (+1.97)
185
+
186
+ 2. **Multi-agent architecture provides modest additional benefit**
187
+ - Effect size: +3.4 points (6% improvement)
188
+ - Consistent small improvements across relational dimensions (~+0.3)
189
+ - Superego review catches quality issues before delivery
190
+
191
+ 3. **Effects are largely additive with slight synergy**
192
+ - Combined condition (recognition + multi-agent) achieves 80.7 average
193
+ - Small positive interaction (+3.7) suggests complementary effects
194
+ - Best results when recognition prompts are combined with multi-agent review
195
+
196
+ 4. **Iterative prompt refinement is effective (NEW)**
197
+ - Targeted improvements to dialectical engagement guidance increased recognition effect from +22.5 to +37.2
198
+ - Improved scenario discrimination: baseline scores decreased while recognition scores increased
199
+ - Documents the value of evaluation infrastructure for systematic improvement
200
+
201
+ ### Limitations Addressed
202
+
203
+ 1. **Dialectical responsiveness significantly improved**
204
+ - resistant_learner scenario scores improved: single_recognition 37.5→65.9, recognition 56.8→67.0
205
+ - Explicit guidance for handling intellectual resistance addresses prior weakness
206
+ - Gap between recognition and baseline profiles widened (more discriminating)
207
+
208
+ ### Remaining Limitations
209
+
210
+ 1. **Scenario sample size is small (n=3)**
211
+ - Results should be interpreted as preliminary
212
+ - Need larger scenario set for publication
213
+
214
+ 2. **Judge model consistency requires validation**
215
+ - All evaluations used Claude Sonnet 4.5
216
+ - Multi-judge validation not yet complete
217
+
218
+ 3. **Free-tier model constraints**
219
+ - Nemotron 3-Nano (free tier) has 3500 token limit
220
+ - Some responses truncated or fell back to lower quality
221
+
222
+ ---
223
+
224
+ ## 7. Statistical Notes
225
+
226
+ ### Sample Sizes
227
+ - 4 profiles × 3 scenarios = 12 observations per run
228
+ - Two runs: initial (e3685989) and after refinement (81c83366)
229
+ - Each scenario run once per profile (no replication within each run)
230
+
231
+ ### Effect Size Estimation (Cohen's d approximation)
232
+
233
+ Using pooled standard deviation across conditions (after refinement):
234
+ - SD ≈ 20.5 (estimated from score range: 34.1 to 100.0)
235
+ - Recognition d ≈ 37.2 / 20.5 = **1.81** (very large effect)
236
+ - Architecture d ≈ 3.4 / 20.5 = 0.17 (small effect)
237
+
238
+ ### Comparison: Before vs. After Iterative Refinement
239
+
240
+ | Metric | Initial Run | After Refinement | Change |
241
+ |--------|-------------|------------------|--------|
242
+ | Recognition effect | +22.5 | +37.2 | +14.7 |
243
+ | Architecture effect | +8.5 | +3.4 | -5.1 |
244
+ | Cohen's d (Recognition) | 1.20 | 1.81 | +0.61 |
245
+ | Best profile score | 72.5 | 80.7 | +8.2 |
246
+ | Gap (best - worst) | 31.0 | 40.6 | +9.6 |
247
+
248
+ ### Confidence
249
+ - Results directionally clear: Recognition >> Architecture > Interaction
250
+ - Iterative refinement increased effect size and discrimination
251
+ - Formal statistical tests require larger sample or replication
252
+
253
+ ---
254
+
255
+ ## 8. Refined Claims for Paper
256
+
257
+ Based on this 2×2 factorial evaluation with iterative refinement:
258
+
259
+ > Recognition-oriented design, understood as a *derivative* of Hegelian recognition theory, produces very large measurable improvements (+37.2 points, d ≈ 1.8) in AI tutor adaptive pedagogy. Multi-agent architecture with psychodynamic metaphor (Ego/Superego) provides additional modest improvement (+3.4 points, d ≈ 0.17). The combined approach achieves the highest performance (80.7/100), with improvements concentrated in pedagogy, personalization, relevance, and tone—exactly the relational dimensions predicted by the theoretical framework. Iterative prompt refinement targeting dialectical engagement increased the recognition effect by 65% (from +22.5 to +37.2), demonstrating the value of evaluation infrastructure for systematic improvement.
260
+
261
+ ---
262
+
263
+ ## 9. Next Steps
264
+
265
+ 1. **Expand scenario coverage** - Add 5-7 additional scenarios for statistical power
266
+ 2. **Multi-judge validation** - Run subset with GPT and Gemini judges
267
+ 3. **Human validation** - 50-response sample with human raters
268
+ 4. ~~**Address resistant_learner weakness**~~ - ✅ COMPLETED via iterative prompt refinement
269
+ 5. **Replicate** - Run 3× replication to estimate within-condition variance
270
+ 6. **Document methodology** - Iterative refinement process now documented in PROMPT-IMPROVEMENTS-2026-01-14.md
271
+
272
+ ---
273
+
274
+ ## Appendix: Raw Data Summary
275
+
276
+ ### Initial Run
277
+ ```
278
+ Run ID: eval-2026-01-14-e3685989
279
+ Total tests: 12
280
+ Success rate: 100% (12/12)
281
+ Total API calls: 68
282
+ Total tokens: 309,323 input, 129,226 output
283
+ Total latency: 642,150ms (10.7 minutes)
284
+ Judge model: Claude Sonnet 4.5
285
+ Tutor model: Nemotron 3-Nano (free tier)
286
+ ```
287
+
288
+ ### After Iterative Refinement
289
+ ```
290
+ Run ID: eval-2026-01-14-81c83366
291
+ Total tests: 12
292
+ Success rate: 100% (12/12)
293
+ Judge model: Claude Sonnet 4.5
294
+ Tutor model: Nemotron 3-Nano (free tier)
295
+ Changes: Updated resistant_learner scenario, added dialectical engagement guidance to recognition prompts
296
+ ```
297
+
298
+ ### Files Modified for Refinement
299
+ - `config/evaluation-rubric.yaml` - resistant_learner scenario improvements
300
+ - `prompts/tutor-ego-recognition.md` - Intellectual Resistance Rule and examples
301
+ - See PROMPT-IMPROVEMENTS-2026-01-14.md for full before/after documentation