@machinespirits/eval 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/components/MobileEvalDashboard.tsx +267 -0
- package/components/comparison/DeltaAnalysisTable.tsx +137 -0
- package/components/comparison/ProfileComparisonCard.tsx +176 -0
- package/components/comparison/RecognitionABMode.tsx +385 -0
- package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
- package/components/comparison/WinnerIndicator.tsx +64 -0
- package/components/comparison/index.ts +5 -0
- package/components/mobile/BottomSheet.tsx +233 -0
- package/components/mobile/DimensionBreakdown.tsx +210 -0
- package/components/mobile/DocsView.tsx +363 -0
- package/components/mobile/LogsView.tsx +481 -0
- package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
- package/components/mobile/QuickTestView.tsx +1098 -0
- package/components/mobile/RecognitionTypeChart.tsx +124 -0
- package/components/mobile/RecognitionView.tsx +809 -0
- package/components/mobile/RunDetailView.tsx +261 -0
- package/components/mobile/RunHistoryView.tsx +367 -0
- package/components/mobile/ScoreRadial.tsx +211 -0
- package/components/mobile/StreamingLogPanel.tsx +230 -0
- package/components/mobile/SynthesisStrategyChart.tsx +140 -0
- package/config/interaction-eval-scenarios.yaml +832 -0
- package/config/learner-agents.yaml +248 -0
- package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
- package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
- package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
- package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
- package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
- package/docs/research/COST-ANALYSIS.md +56 -0
- package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
- package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
- package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
- package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
- package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
- package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
- package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
- package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
- package/docs/research/PAPER-UNIFIED.md +659 -0
- package/docs/research/PAPER-UNIFIED.pdf +0 -0
- package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
- package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
- package/docs/research/apa.csl +2133 -0
- package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
- package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
- package/docs/research/paper-draft/full-paper.md +136 -0
- package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
- package/docs/research/paper-draft/references.bib +515 -0
- package/docs/research/transcript-baseline.md +139 -0
- package/docs/research/transcript-recognition-multiagent.md +187 -0
- package/hooks/useEvalData.ts +625 -0
- package/index.js +27 -0
- package/package.json +73 -0
- package/routes/evalRoutes.js +3002 -0
- package/scripts/advanced-eval-analysis.js +351 -0
- package/scripts/analyze-eval-costs.js +378 -0
- package/scripts/analyze-eval-results.js +513 -0
- package/scripts/analyze-interaction-evals.js +368 -0
- package/server-init.js +45 -0
- package/server.js +162 -0
- package/services/benchmarkService.js +1892 -0
- package/services/evaluationRunner.js +739 -0
- package/services/evaluationStore.js +1121 -0
- package/services/learnerConfigLoader.js +385 -0
- package/services/learnerTutorInteractionEngine.js +857 -0
- package/services/memory/learnerMemoryService.js +1227 -0
- package/services/memory/learnerWritingPad.js +577 -0
- package/services/memory/tutorWritingPad.js +674 -0
- package/services/promptRecommendationService.js +493 -0
- package/services/rubricEvaluator.js +826 -0
|
@@ -0,0 +1,306 @@
|
|
|
1
|
+
# Machine Spirits AI Tutor Evaluation System: Comprehensive Analysis
|
|
2
|
+
|
|
3
|
+
*Analysis Date: January 2026*
|
|
4
|
+
*Prepared for: IP Documentation and Research Positioning*
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Executive Summary
|
|
9
|
+
|
|
10
|
+
The Machine Spirits AI tutor implements a **multi-agent Ego/Superego dialogue architecture** that represents a theoretically grounded and empirically testable approach to adaptive tutoring. This analysis examines whether the current system supports the goal of demonstrating a **modulated tutor adapting to learner abilities, moods, and limits**, and identifies gaps and opportunities for strengthening IP claims.
|
|
11
|
+
|
|
12
|
+
**Key Finding**: The architecture is **conceptually sophisticated and academically defensible**, but the evaluation harness needs additional instrumentation to *prove* adaptation effectiveness. The system has strong theoretical foundations (Vygotsky, Freud, Hegel) and aligns with cutting-edge sycophancy mitigation research, but requires empirical demonstration of learning outcome improvements.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## 1. Architecture Assessment
|
|
17
|
+
|
|
18
|
+
### 1.1 Multi-Agent Design: Ego/Superego Dialogue
|
|
19
|
+
|
|
20
|
+
**Current Implementation** (`services/tutorDialogueEngine.js`):
|
|
21
|
+
- Configurable dialogue rounds (1-5, default 3)
|
|
22
|
+
- Superego pre-analysis phase (signal reinterpretation)
|
|
23
|
+
- Verdict taxonomy: approve, enhance, revise, reframe, redirect, escalate
|
|
24
|
+
- Feedback incorporation across rounds
|
|
25
|
+
|
|
26
|
+
**Academic Grounding**:
|
|
27
|
+
| Theoretical Source | System Mapping | Evidence Strength |
|
|
28
|
+
|--------------------|----------------|-------------------|
|
|
29
|
+
| Freud (1923) Structural Model | Ego mediates between learner desires (Id) and pedagogical norms (Superego) | Conceptual - needs empirical validation |
|
|
30
|
+
| Hegel (1807) Dialectics | Thesis (Ego draft) → Antithesis (Superego critique) → Synthesis (revised suggestion) | Well-documented in code |
|
|
31
|
+
| Goodfellow (2014) GANs | Generator (Ego) vs Discriminator (Superego) → improved generator | Parallel structure demonstrated |
|
|
32
|
+
| Chen (2024) Drama Machine | Multi-agent deliberation for complex behavioral simulation | Direct inspiration, cited |
|
|
33
|
+
|
|
34
|
+
**Strengths**:
|
|
35
|
+
1. The architecture is **not ad hoc** - it implements recognized dialectical/adversarial patterns
|
|
36
|
+
2. Sycophancy mitigation through internal critique aligns with ConsensAgent (Lyu 2024)
|
|
37
|
+
3. Configurable rounds allow studying convergence dynamics
|
|
38
|
+
4. Verdict taxonomy maps to pedagogical intervention types
|
|
39
|
+
|
|
40
|
+
**Gaps**:
|
|
41
|
+
1. No direct measurement of **sycophancy reduction** (before/after Superego)
|
|
42
|
+
2. No comparison to **single-agent baseline** in production
|
|
43
|
+
3. Need formal definition of when Ego "should" modulate but doesn't
|
|
44
|
+
|
|
45
|
+
### 1.2 Prompt Engineering
|
|
46
|
+
|
|
47
|
+
**Prompts Analyzed**:
|
|
48
|
+
- `prompts/tutor-ego.md` - Warm, learner-centered suggestions
|
|
49
|
+
- `prompts/tutor-superego.md` - Critical pedagogical review
|
|
50
|
+
- `prompts/tutor-superego-experimental.md` - Enhanced with learner archetype recognition
|
|
51
|
+
|
|
52
|
+
**Notable Features**:
|
|
53
|
+
- Ego instructed to avoid **toxic positivity** and **false urgency**
|
|
54
|
+
- Superego explicitly checks for **sycophancy markers**
|
|
55
|
+
- Experimental Superego recognizes **8 learner archetypes** and modulates tone accordingly
|
|
56
|
+
- Both agents receive structured learner context (progress, struggles, recent activity)
|
|
57
|
+
|
|
58
|
+
**Assessment**: Prompts are well-crafted with explicit anti-sycophancy directives. The experimental Superego's learner archetype recognition is a **differentiating feature** not commonly seen in literature.
|
|
59
|
+
|
|
60
|
+
### 1.3 Learner Context Assembly
|
|
61
|
+
|
|
62
|
+
**Data Available to Tutor** (`services/learnerContextService.js`):
|
|
63
|
+
- Article/lecture progress
|
|
64
|
+
- Time on page, scroll depth
|
|
65
|
+
- Quiz attempts and scores
|
|
66
|
+
- Recent chat history
|
|
67
|
+
- Navigation patterns (rapid scanning vs deep reading)
|
|
68
|
+
- Struggle indicators (repeated quiz failures, confusion markers)
|
|
69
|
+
|
|
70
|
+
**Assessment**: Rich behavioral signals are collected. The question is whether the tutor **demonstrably uses** this context to adapt. Current evaluation doesn't systematically test context utilization.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## 2. Evaluation Harness Assessment
|
|
75
|
+
|
|
76
|
+
### 2.1 What's Implemented
|
|
77
|
+
|
|
78
|
+
| Capability | Status | Location |
|
|
79
|
+
|------------|--------|----------|
|
|
80
|
+
| 6-dimension rubric (relevance, specificity, pedagogical, personalization, actionability, tone) | Complete | `config/evaluation-rubric.yaml` |
|
|
81
|
+
| 8+ learner archetypes (struggling, rapid navigator, high performer, etc.) | Complete | `config/evaluation-scenarios.yaml` |
|
|
82
|
+
| Fast mode (regex) vs Full mode (AI judge) | Complete | `services/evaluatorService.js` |
|
|
83
|
+
| Multi-turn scenarios (4 scenarios with 3+ turns each) | Complete | `config/evaluation-scenarios.yaml` |
|
|
84
|
+
| Modulation testing | Complete | `eval-tutor modulation` |
|
|
85
|
+
| Modulation depth metrics (specificity delta, tone shift, direction change) | Complete | `services/modulationEvaluator.js` |
|
|
86
|
+
| Resistance detection (stubborn Ego patterns) | Complete | `eval-tutor resistance` |
|
|
87
|
+
| Superego calibration analysis | Complete | `eval-tutor calibration` |
|
|
88
|
+
| Trajectory classification (8 patterns) | Complete | `eval-tutor trajectories` |
|
|
89
|
+
| Auto-improvement cycle | Complete | `eval-tutor auto-improve` |
|
|
90
|
+
| Convergence detection | Complete | Score plateau detection |
|
|
91
|
+
|
|
92
|
+
### 2.2 Alignment with Goal: "Modulated Tutor Adapting to Learner Abilities, Moods, Limits"
|
|
93
|
+
|
|
94
|
+
**Dimension Mapping**:
|
|
95
|
+
|
|
96
|
+
| Learner Attribute | Relevant Scenarios | Measurement Approach |
|
|
97
|
+
|-------------------|-------------------|---------------------|
|
|
98
|
+
| **Ability** (novice vs advanced) | `new_user_first_visit`, `high_performer`, `struggling_learner` | Personalization dimension, action target appropriateness |
|
|
99
|
+
| **Mood** (frustrated, confident, curious) | `struggling_learner`, `rapid_navigator`, `activity_avoider` | Tone dimension, encouragement vs challenge balance |
|
|
100
|
+
| **Limits** (cognitive load, attention span) | `rapid_navigator`, `idle_on_content`, `concept_explorer` | Complexity adjustment, suggestion brevity |
|
|
101
|
+
|
|
102
|
+
**Current Evidence Collection**:
|
|
103
|
+
- Rubric scores by scenario show **differentiated responses** (archetype-appropriate suggestions)
|
|
104
|
+
- Modulation metrics quantify **behavioral change** across dialogue rounds
|
|
105
|
+
- Resistance detection identifies when adaptation **fails**
|
|
106
|
+
|
|
107
|
+
**What's Missing**:
|
|
108
|
+
1. **Explicit mood detection and response** - No scenario tests tutor's response to expressed frustration vs excitement
|
|
109
|
+
2. **Cognitive load estimation** - No measurement of whether tutor reduces complexity for overloaded learners
|
|
110
|
+
3. **Longitudinal adaptation** - All scenarios are cross-sectional; no test of tutor "learning" a learner over time
|
|
111
|
+
4. **Outcome measurement** - Scores measure suggestion quality, not actual learning improvement
|
|
112
|
+
|
|
113
|
+
### 2.3 Gap Analysis
|
|
114
|
+
|
|
115
|
+
| Gap | Severity | Remediation |
|
|
116
|
+
|-----|----------|-------------|
|
|
117
|
+
| No learning outcome data | High | Requires integration with activity submissions |
|
|
118
|
+
| No explicit mood/affect testing | Medium | Add scenarios with emotional markers in chat history |
|
|
119
|
+
| No cognitive load proxy | Medium | Add reading velocity + time pressure scenarios |
|
|
120
|
+
| No longitudinal test | Medium | Create multi-session scenario sequences |
|
|
121
|
+
| No human baseline comparison | High | Need human tutor suggestions for same scenarios |
|
|
122
|
+
| No A/B ablation | Medium | Compare with/without Superego in matched conditions |
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## 3. Synthesis of TODO Documents
|
|
127
|
+
|
|
128
|
+
### 3.1 Document Inventory
|
|
129
|
+
|
|
130
|
+
| Document | Focus | Key Insights |
|
|
131
|
+
|----------|-------|--------------|
|
|
132
|
+
| `TODO-EVAL.md` | Evaluation roadmap | 6 phases, most Phase 1-3 complete |
|
|
133
|
+
| `TUTOR-EVALUATION-TODO.md` | GAN-Dialectic theory | Convergence questions, philosophical grounding |
|
|
134
|
+
| `TODO.md` (dev) | Full system roadmap | Metacognitive Agent, Deep Learning Companion, Multi-Agent Deliberation |
|
|
135
|
+
|
|
136
|
+
### 3.2 Unified Roadmap
|
|
137
|
+
|
|
138
|
+
**Completed (Phases 1-3)**:
|
|
139
|
+
- Enhanced modulation metrics
|
|
140
|
+
- Resistance detection
|
|
141
|
+
- Calibration analysis
|
|
142
|
+
- Trajectory classification
|
|
143
|
+
- Auto-improvement cycle
|
|
144
|
+
|
|
145
|
+
**In Progress (Phase 4: Learner Simulation)**:
|
|
146
|
+
- Synthetic learner agents with behavior models
|
|
147
|
+
- Multi-turn outcome tracking
|
|
148
|
+
- Adversarial learner testing
|
|
149
|
+
|
|
150
|
+
**Planned (Phases 5-6)**:
|
|
151
|
+
- Cross-model benchmarking (GPT-4 vs Claude vs Gemini)
|
|
152
|
+
- Ablation studies (Superego-only, Ego-only)
|
|
153
|
+
- Cost-benefit analysis
|
|
154
|
+
- Visualization suite (radar charts, trajectory diagrams)
|
|
155
|
+
|
|
156
|
+
### 3.3 Theoretical Extensions
|
|
157
|
+
|
|
158
|
+
The TODO documents raise important questions:
|
|
159
|
+
|
|
160
|
+
1. **Does GAN training have a Nash equilibrium analog in tutoring?**
|
|
161
|
+
- In GANs, equilibrium = generator produces indistinguishable samples
|
|
162
|
+
- In tutoring, equilibrium = Ego produces suggestions Superego consistently approves
|
|
163
|
+
- Risk: Superego saturates (can't distinguish good from better)
|
|
164
|
+
|
|
165
|
+
2. **Is the discriminator a computational Superego?**
|
|
166
|
+
- Similarity: Both impose external standards on generative behavior
|
|
167
|
+
- Difference: Superego has moral valence; discriminator is statistical
|
|
168
|
+
- Our implementation: Superego has *pedagogical* valence (learning science norms)
|
|
169
|
+
|
|
170
|
+
3. **Does synthesis preserve and transcend thesis/antithesis (Aufhebung)?**
|
|
171
|
+
- Pure GAN: Only generator improves; discriminator doesn't incorporate generator insights
|
|
172
|
+
- Our system: Both Ego and Superego prompts can be updated via meta-evaluation
|
|
173
|
+
- This is a **genuine improvement** over pure GAN structure
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## 4. Academic Positioning
|
|
178
|
+
|
|
179
|
+
### 4.1 Differentiation from Prior Work
|
|
180
|
+
|
|
181
|
+
| Feature | Machine Spirits | Typical ITS | Typical LLM Tutor |
|
|
182
|
+
|---------|-----------------|-------------|-------------------|
|
|
183
|
+
| Multi-agent deliberation | Yes (Ego/Superego) | No | Rare (chain-of-thought) |
|
|
184
|
+
| Explicit sycophancy mitigation | Yes | N/A | Rarely addressed |
|
|
185
|
+
| Learner archetype recognition | Yes (8+ types) | Often rule-based | Generic |
|
|
186
|
+
| Dialectical improvement loop | Yes | No | No |
|
|
187
|
+
| Configurable modulation rounds | Yes (1-5) | N/A | N/A |
|
|
188
|
+
| Open evaluation harness | Yes | Often proprietary | Rare |
|
|
189
|
+
|
|
190
|
+
### 4.2 Alignment with Current Research
|
|
191
|
+
|
|
192
|
+
**Strong Alignment**:
|
|
193
|
+
- ConsensAgent (Lyu 2024): Multi-agent debate reduces sycophancy
|
|
194
|
+
- Drama Machine (Chen 2024): Multi-agent deliberation for complex behavior
|
|
195
|
+
- ZPD implementations (Korbit 2024): Adaptive scaffolding based on learner signals
|
|
196
|
+
|
|
197
|
+
**Novel Contributions**:
|
|
198
|
+
1. **Freudian-Hegelian framing**: Not just multi-agent, but specifically Ego/Superego/Dialectic structure
|
|
199
|
+
2. **Verdict taxonomy**: Pedagogically meaningful intervention types (enhance, revise, reframe, redirect, escalate)
|
|
200
|
+
3. **Modulation metrics**: Quantified measurement of how agents change behavior
|
|
201
|
+
4. **Trajectory classification**: Pattern recognition across dialogue evolution
|
|
202
|
+
5. **Open evaluation harness**: Reproducible, extensible testing framework
|
|
203
|
+
|
|
204
|
+
### 4.3 Claims We Can Defend
|
|
205
|
+
|
|
206
|
+
| Claim | Evidence | Strength |
|
|
207
|
+
|-------|----------|----------|
|
|
208
|
+
| "Multi-agent architecture reduces sycophantic responses" | Modulation metrics show Ego adjusts after Superego critique | Medium - need before/after comparison |
|
|
209
|
+
| "System adapts to different learner profiles" | Scenario scores differentiate by archetype | Strong - empirically demonstrated |
|
|
210
|
+
| "Dialectical structure produces improved suggestions" | Trajectory analysis shows refinement patterns | Medium - need outcome data |
|
|
211
|
+
| "Open, reproducible evaluation methodology" | Public harness, documented rubric | Strong |
|
|
212
|
+
|
|
213
|
+
### 4.4 Claims That Need More Evidence
|
|
214
|
+
|
|
215
|
+
| Claim | Gap | Remediation |
|
|
216
|
+
|-------|-----|-------------|
|
|
217
|
+
| "Improves learning outcomes" | No outcome measurement | Integrate activity performance data |
|
|
218
|
+
| "Responds appropriately to learner mood" | No affect scenarios | Add mood-explicit test cases |
|
|
219
|
+
| "Outperforms single-agent tutors" | No ablation study | Run Ego-only baseline |
|
|
220
|
+
| "Works across domains" | Tested only on philosophy content | Add STEM/writing scenarios |
|
|
221
|
+
|
|
222
|
+
---
|
|
223
|
+
|
|
224
|
+
## 5. Recommendations
|
|
225
|
+
|
|
226
|
+
### 5.1 Immediate (This Month)
|
|
227
|
+
|
|
228
|
+
1. **Add Ablation Study**: Run evaluation with Superego disabled; quantify improvement
|
|
229
|
+
```bash
|
|
230
|
+
node scripts/eval-tutor.js quick single-agent-baseline # New profile
|
|
231
|
+
node scripts/eval-tutor.js compare single-agent-baseline experimental
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
2. **Add Mood Scenarios**: Create test cases with explicit affective markers
|
|
235
|
+
```yaml
|
|
236
|
+
# New scenario
|
|
237
|
+
frustrated_struggling:
|
|
238
|
+
context:
|
|
239
|
+
chatHistory:
|
|
240
|
+
- role: user
|
|
241
|
+
content: "I've read this three times and I still don't get it. This is so frustrating!"
|
|
242
|
+
expected: Acknowledge frustration, offer alternative explanation approach
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
3. **Human Baseline Collection**: Gather human tutor suggestions for 5-10 scenarios for comparison
|
|
246
|
+
|
|
247
|
+
### 5.2 Short-term (This Quarter)
|
|
248
|
+
|
|
249
|
+
4. **Outcome Integration**: Link evaluation scenarios to activity performance
|
|
250
|
+
- Track: Did suggestion → user action → improved quiz score?
|
|
251
|
+
- Requires longitudinal scenario design
|
|
252
|
+
|
|
253
|
+
5. **Cross-Model Benchmark**: Run same scenarios on GPT-4, Claude, Gemini
|
|
254
|
+
- Document sycophancy rates by model
|
|
255
|
+
- Identify model-specific Superego calibration needs
|
|
256
|
+
|
|
257
|
+
6. **Cognitive Load Scenarios**: Add time-pressure and information-overload conditions
|
|
258
|
+
- Fast reader (skimming) should get concise suggestions
|
|
259
|
+
- Slow, careful reader should get deeper content
|
|
260
|
+
|
|
261
|
+
### 5.3 Medium-term (This Year)
|
|
262
|
+
|
|
263
|
+
7. **Synthetic Learner Agents**: Implement Phase 4 learner simulation
|
|
264
|
+
- Validate tutor over multi-turn interactions
|
|
265
|
+
- Test adversarial/edge cases
|
|
266
|
+
|
|
267
|
+
8. **Paper Submission**: Target venue (AIED, IUI, or educational computing journal)
|
|
268
|
+
- Emphasize novel Ego/Superego framing
|
|
269
|
+
- Include empirical data from benchmarks
|
|
270
|
+
- Open-source harness as contribution
|
|
271
|
+
|
|
272
|
+
9. **Visualization Dashboard**: Implement radar charts, trajectory diagrams
|
|
273
|
+
- Support qualitative analysis of dialogue evolution
|
|
274
|
+
- Aid in prompt refinement iterations
|
|
275
|
+
|
|
276
|
+
---
|
|
277
|
+
|
|
278
|
+
## 6. Conclusion
|
|
279
|
+
|
|
280
|
+
The Machine Spirits AI tutor evaluation system is **architecturally sophisticated and theoretically grounded**. The multi-agent Ego/Superego design aligns with state-of-the-art sycophancy mitigation research and implements a genuine dialectical improvement process beyond simple chain-of-thought.
|
|
281
|
+
|
|
282
|
+
**Are we there yet?**
|
|
283
|
+
|
|
284
|
+
*Partially*. The system demonstrably produces differentiated suggestions for different learner archetypes, and the modulation metrics show the Ego responding to Superego feedback. However, the critical gap is **outcome measurement** - we can show the tutor *adapts*, but not yet that adaptation *improves learning*.
|
|
285
|
+
|
|
286
|
+
**Where do we go from here?**
|
|
287
|
+
|
|
288
|
+
1. Add ablation studies (Superego-disabled baseline)
|
|
289
|
+
2. Add outcome-linked evaluation
|
|
290
|
+
3. Add affect/mood scenarios
|
|
291
|
+
4. Collect human tutor baseline
|
|
292
|
+
5. Prepare paper submission with empirical results
|
|
293
|
+
|
|
294
|
+
The IP is valuable and defensible. The theoretical framework (Freud + Hegel + GAN) is novel in the tutoring literature. The open evaluation harness is a genuine contribution. With targeted additions to demonstrate learning outcome improvements, this system represents a publishable and potentially influential contribution to the field.
|
|
295
|
+
|
|
296
|
+
---
|
|
297
|
+
|
|
298
|
+
## Appendix: Reference Summary
|
|
299
|
+
|
|
300
|
+
See `docs/references-tutor-eval.bib` for complete bibliography. Key sources:
|
|
301
|
+
|
|
302
|
+
- **Multi-Agent**: Chen 2024 (Drama Machine), Wu 2024 (AutoGen), Lyu 2024 (ConsensAgent)
|
|
303
|
+
- **Sycophancy**: Sharma 2024, Chen 2024 (Identity Bias), Perez 2022
|
|
304
|
+
- **Learning Theory**: Vygotsky 1978 (ZPD), Sweller 1988 (Cognitive Load), Chi 2014 (ICAP)
|
|
305
|
+
- **ITS Effectiveness**: VanLehn 2011, Ma 2014
|
|
306
|
+
- **Philosophy**: Hegel 1807/1812, Freud 1923, Goodfellow 2014 (GAN)
|
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# 2×2 Factorial Evaluation Results
|
|
2
|
+
|
|
3
|
+
**Run IDs:**
|
|
4
|
+
- Initial: `eval-2026-01-14-e3685989`
|
|
5
|
+
- After refinement: `eval-2026-01-14-81c83366`
|
|
6
|
+
**Date:** 2026-01-14
|
|
7
|
+
**Status:** Complete (12/12 tests per run)
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## 1. Experimental Design
|
|
12
|
+
|
|
13
|
+
### Factors
|
|
14
|
+
|
|
15
|
+
| Factor | Level 0 (Control) | Level 1 (Treatment) |
|
|
16
|
+
|--------|-------------------|---------------------|
|
|
17
|
+
| **A: Architecture** | Single-Agent | Multi-Agent (Ego/Superego) |
|
|
18
|
+
| **B: Recognition** | Standard Prompts | Recognition-Enhanced Prompts |
|
|
19
|
+
|
|
20
|
+
### Conditions (2×2 = 4 profiles)
|
|
21
|
+
|
|
22
|
+
| Profile | Architecture | Recognition |
|
|
23
|
+
|---------|-------------|-------------|
|
|
24
|
+
| `single_baseline` | Single | Standard |
|
|
25
|
+
| `single_recognition` | Single | Recognition |
|
|
26
|
+
| `baseline` | Multi-Agent | Standard |
|
|
27
|
+
| `recognition` | Multi-Agent | Recognition |
|
|
28
|
+
|
|
29
|
+
### Scenarios (n=3)
|
|
30
|
+
|
|
31
|
+
1. **recognition_seeking_learner** - Learner explicitly seeks validation of their interpretation
|
|
32
|
+
2. **resistant_learner** - Learner offers substantive intellectual critique
|
|
33
|
+
3. **productive_struggle_arc** - 5-turn arc through confusion to breakthrough
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## 2. Results Matrix
|
|
38
|
+
|
|
39
|
+
### Raw Scores by Profile × Scenario (After Iterative Refinement)
|
|
40
|
+
|
|
41
|
+
| Scenario | single_baseline | single_recognition | baseline | recognition |
|
|
42
|
+
|----------|-----------------|-------------------|----------|-------------|
|
|
43
|
+
| recognition_seeking_learner | 37.5 | 100.0 | 34.1 | 100.0 |
|
|
44
|
+
| resistant_learner | 48.9 | 65.9 | 45.5 | 67.0 |
|
|
45
|
+
| productive_struggle_arc | 34.1 | 60.2 | 45.5 | 75.0 |
|
|
46
|
+
| **Profile Mean** | **40.1** | **75.5** | **41.6** | **80.7** |
|
|
47
|
+
|
|
48
|
+
### Marginal Means
|
|
49
|
+
|
|
50
|
+
| | Standard | Recognition | Architecture Mean |
|
|
51
|
+
|---|----------|-------------|-------------------|
|
|
52
|
+
| **Single-Agent** | 40.1 | 75.5 | 57.8 |
|
|
53
|
+
| **Multi-Agent** | 41.6 | 80.7 | 61.2 |
|
|
54
|
+
| **Recognition Mean** | 40.9 | 78.1 | **59.5** (Grand Mean) |
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## 3. Factorial Analysis
|
|
59
|
+
|
|
60
|
+
### Main Effects
|
|
61
|
+
|
|
62
|
+
#### Effect of Recognition (Factor B)
|
|
63
|
+
```
|
|
64
|
+
Recognition Effect = Mean(Recognition) - Mean(Standard)
|
|
65
|
+
= 78.1 - 40.9
|
|
66
|
+
= +37.2 points
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
**Interpretation:** Recognition-enhanced prompts improve tutor adaptive pedagogy by 37.2 points (91% relative improvement) regardless of architecture. This is a substantial increase from the initial run (+22.5) after iterative prompt refinement.
|
|
70
|
+
|
|
71
|
+
#### Effect of Architecture (Factor A)
|
|
72
|
+
```
|
|
73
|
+
Architecture Effect = Mean(Multi-Agent) - Mean(Single-Agent)
|
|
74
|
+
= 61.2 - 57.8
|
|
75
|
+
= +3.4 points
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
**Interpretation:** Multi-agent (Ego/Superego) architecture improves tutor adaptive pedagogy by 3.4 points (6% relative improvement) regardless of recognition prompts. This effect is smaller than initial run due to improved scenario discrimination.
|
|
79
|
+
|
|
80
|
+
### Interaction Effect
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
Recognition effect in Single-Agent: 75.5 - 40.1 = +35.4
|
|
84
|
+
Recognition effect in Multi-Agent: 80.7 - 41.6 = +39.1
|
|
85
|
+
Interaction = 39.1 - 35.4 = +3.7
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**Interpretation:** Small positive interaction (+3.7 points). Recognition prompts provide slightly larger benefit when combined with multi-agent architecture, suggesting complementary effects. However, the interaction is small compared to the dominant recognition main effect.
|
|
89
|
+
|
|
90
|
+
### Effect Decomposition
|
|
91
|
+
|
|
92
|
+
| Source | Effect Size | % of Variance |
|
|
93
|
+
|--------|-------------|---------------|
|
|
94
|
+
| Recognition (B) | +37.2 | 84% |
|
|
95
|
+
| Architecture (A) | +3.4 | 8% |
|
|
96
|
+
| Interaction (A×B) | +3.7 | 8% |
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## 4. Dimension-Level Analysis
|
|
101
|
+
|
|
102
|
+
### Mean Scores by Dimension × Profile (After Refinement)
|
|
103
|
+
|
|
104
|
+
| Dimension | single_baseline | single_recognition | baseline | recognition |
|
|
105
|
+
|-----------|-----------------|-------------------|----------|-------------|
|
|
106
|
+
| Relevance | 2.44 | 4.67 | 2.89 | 4.78 |
|
|
107
|
+
| Specificity | 4.67 | 4.56 | 3.56 | 4.44 |
|
|
108
|
+
| Pedagogy | 1.78 | 4.17 | 2.33 | 4.33 |
|
|
109
|
+
| Personalization | 2.11 | 4.22 | 2.44 | 4.56 |
|
|
110
|
+
| Actionability | 4.78 | 4.00 | 3.67 | 4.78 |
|
|
111
|
+
| Tone | 2.78 | 4.39 | 3.22 | 4.56 |
|
|
112
|
+
|
|
113
|
+
### Dimension-Level Effects
|
|
114
|
+
|
|
115
|
+
| Dimension | Recognition Effect | Architecture Effect |
|
|
116
|
+
|-----------|-------------------|---------------------|
|
|
117
|
+
| **Relevance** | **+2.06** | +0.28 |
|
|
118
|
+
| Specificity | -0.17 | -0.11 |
|
|
119
|
+
| **Pedagogy** | **+2.20** | +0.36 |
|
|
120
|
+
| **Personalization** | **+2.11** | +0.33 |
|
|
121
|
+
| Actionability | -0.28 | -0.17 |
|
|
122
|
+
| **Tone** | **+1.97** | +0.31 |
|
|
123
|
+
|
|
124
|
+
**Key Finding:** Recognition prompts show largest improvements in:
|
|
125
|
+
- **Pedagogy** (+2.20): Appropriate scaffolding, dialectical engagement, timing
|
|
126
|
+
- **Personalization** (+2.11): Treating learner as distinct individual with valid perspective
|
|
127
|
+
- **Relevance** (+2.06): Engaging with specific learner contributions
|
|
128
|
+
- **Tone** (+1.97): Warmth and intellectual respect without dismissiveness
|
|
129
|
+
|
|
130
|
+
Multi-agent architecture shows modest improvements across all relational dimensions (~+0.3), with the Superego review process catching quality issues before delivery.
|
|
131
|
+
|
|
132
|
+
**Note on Iterative Refinement:** The improved recognition effects (+2.0-2.2 vs. initial +0.6-1.3) reflect the addition of explicit dialectical engagement guidance to recognition prompts. See PROMPT-IMPROVEMENTS-2026-01-14.md for details.
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## 5. Scenario-Specific Findings
|
|
137
|
+
|
|
138
|
+
### Recognition-Seeking Learner
|
|
139
|
+
|
|
140
|
+
Best scenario for recognition detection. Both recognition profiles achieved perfect or near-perfect scores.
|
|
141
|
+
|
|
142
|
+
| Profile | Score | Key Observation |
|
|
143
|
+
|---------|-------|-----------------|
|
|
144
|
+
| single_baseline | 37.5 | Redirected to next lecture, ignored learner's request for validation |
|
|
145
|
+
| single_recognition | 100.0 | "Your dance metaphor of mutual transformation aligns with Hegel's master-slave dialogue" |
|
|
146
|
+
| baseline | 34.1 | Warm but generic, failed to engage with specific interpretation |
|
|
147
|
+
| recognition | 100.0 | "Your dance metaphor captures the mutual transformation Hegel describes" |
|
|
148
|
+
|
|
149
|
+
### Resistant Learner
|
|
150
|
+
|
|
151
|
+
**Significantly improved after iterative prompt refinement.** The addition of explicit dialectical engagement guidance addressed prior failures.
|
|
152
|
+
|
|
153
|
+
| Profile | Before | After | Key Observation (After) |
|
|
154
|
+
|---------|--------|-------|-------------------------|
|
|
155
|
+
| single_baseline | 52.3 | 48.9 | Still deflects or dismisses—scenario now more discriminating |
|
|
156
|
+
| single_recognition | 37.5 | 65.9 | **+28.4**: Now engages with specific argument about knowledge workers |
|
|
157
|
+
| baseline | 45.5 | 45.5 | Deflected to different course (479 instead of 480) |
|
|
158
|
+
| recognition | 56.8 | 67.0 | **+10.2**: Introduces IP ownership and process alienation as complications |
|
|
159
|
+
|
|
160
|
+
**Key Insight:** The resistant_learner scenario now effectively discriminates between profiles. Recognition-enhanced prompts show clear gains (+10-28 points) while baseline profiles remain flat or slightly lower, demonstrating the scenario's improved diagnostic power. See PROMPT-IMPROVEMENTS-2026-01-14.md for detailed documentation of changes.
|
|
161
|
+
|
|
162
|
+
### Productive Struggle Arc
|
|
163
|
+
|
|
164
|
+
5-turn scenario tracking learner through confusion to breakthrough.
|
|
165
|
+
|
|
166
|
+
| Profile | Score | Key Observation |
|
|
167
|
+
|---------|-------|-----------------|
|
|
168
|
+
| single_baseline | 34.1 | Consistently pushed to next lecture despite ongoing confusion |
|
|
169
|
+
| single_recognition | 60.2 | Better struggle honoring with explicit acknowledgment of confusion |
|
|
170
|
+
| baseline | 45.5 | More dialogue rounds but inconsistent quality |
|
|
171
|
+
| recognition | 75.0 | **Best score**: Balances struggle honoring with eventual progression |
|
|
172
|
+
|
|
173
|
+
**Note:** The recognition profile's improvement (+9.8 from initial 65.2 to 75.0) reflects better handling of the learner's journey through confusion to breakthrough.
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## 6. Implications for Claims
|
|
178
|
+
|
|
179
|
+
### Supported Claims
|
|
180
|
+
|
|
181
|
+
1. **Recognition-oriented design measurably improves tutor adaptive pedagogy**
|
|
182
|
+
- Effect size: +37.2 points (91% improvement over baseline)
|
|
183
|
+
- Consistent across all three scenarios
|
|
184
|
+
- Largest effects in relational dimensions: pedagogy (+2.20), personalization (+2.11), relevance (+2.06), tone (+1.97)
|
|
185
|
+
|
|
186
|
+
2. **Multi-agent architecture provides modest additional benefit**
|
|
187
|
+
- Effect size: +3.4 points (6% improvement)
|
|
188
|
+
- Consistent small improvements across relational dimensions (~+0.3)
|
|
189
|
+
- Superego review catches quality issues before delivery
|
|
190
|
+
|
|
191
|
+
3. **Effects are largely additive with slight synergy**
|
|
192
|
+
- Combined condition (recognition + multi-agent) achieves 80.7 average
|
|
193
|
+
- Small positive interaction (+3.7) suggests complementary effects
|
|
194
|
+
- Best results when recognition prompts are combined with multi-agent review
|
|
195
|
+
|
|
196
|
+
4. **Iterative prompt refinement is effective (NEW)**
|
|
197
|
+
- Targeted improvements to dialectical engagement guidance increased recognition effect from +22.5 to +37.2
|
|
198
|
+
- Improved scenario discrimination: baseline scores decreased while recognition scores increased
|
|
199
|
+
- Documents the value of evaluation infrastructure for systematic improvement
|
|
200
|
+
|
|
201
|
+
### Limitations Addressed
|
|
202
|
+
|
|
203
|
+
1. **Dialectical responsiveness significantly improved**
|
|
204
|
+
- resistant_learner scenario scores improved: single_recognition 37.5→65.9, recognition 56.8→67.0
|
|
205
|
+
- Explicit guidance for handling intellectual resistance addresses prior weakness
|
|
206
|
+
- Gap between recognition and baseline profiles widened (more discriminating)
|
|
207
|
+
|
|
208
|
+
### Remaining Limitations
|
|
209
|
+
|
|
210
|
+
1. **Scenario sample size is small (n=3)**
|
|
211
|
+
- Results should be interpreted as preliminary
|
|
212
|
+
- Need larger scenario set for publication
|
|
213
|
+
|
|
214
|
+
2. **Judge model consistency requires validation**
|
|
215
|
+
- All evaluations used Claude Sonnet 4.5
|
|
216
|
+
- Multi-judge validation not yet complete
|
|
217
|
+
|
|
218
|
+
3. **Free-tier model constraints**
|
|
219
|
+
- Nemotron 3-Nano (free tier) has 3500 token limit
|
|
220
|
+
- Some responses truncated or fell back to lower quality
|
|
221
|
+
|
|
222
|
+
---
|
|
223
|
+
|
|
224
|
+
## 7. Statistical Notes
|
|
225
|
+
|
|
226
|
+
### Sample Sizes
|
|
227
|
+
- 4 profiles × 3 scenarios = 12 observations per run
|
|
228
|
+
- Two runs: initial (e3685989) and after refinement (81c83366)
|
|
229
|
+
- Each scenario run once per profile (no replication within each run)
|
|
230
|
+
|
|
231
|
+
### Effect Size Estimation (Cohen's d approximation)
|
|
232
|
+
|
|
233
|
+
Using pooled standard deviation across conditions (after refinement):
|
|
234
|
+
- SD ≈ 20.5 (estimated from score range: 34.1 to 100.0)
|
|
235
|
+
- Recognition d ≈ 37.2 / 20.5 = **1.81** (very large effect)
|
|
236
|
+
- Architecture d ≈ 3.4 / 20.5 = 0.17 (small effect)
|
|
237
|
+
|
|
238
|
+
### Comparison: Before vs. After Iterative Refinement
|
|
239
|
+
|
|
240
|
+
| Metric | Initial Run | After Refinement | Change |
|
|
241
|
+
|--------|-------------|------------------|--------|
|
|
242
|
+
| Recognition effect | +22.5 | +37.2 | +14.7 |
|
|
243
|
+
| Architecture effect | +8.5 | +3.4 | -5.1 |
|
|
244
|
+
| Cohen's d (Recognition) | 1.20 | 1.81 | +0.61 |
|
|
245
|
+
| Best profile score | 72.5 | 80.7 | +8.2 |
|
|
246
|
+
| Gap (best - worst) | 31.0 | 40.6 | +9.6 |
|
|
247
|
+
|
|
248
|
+
### Confidence
|
|
249
|
+
- Results directionally clear: Recognition >> Architecture > Interaction
|
|
250
|
+
- Iterative refinement increased effect size and discrimination
|
|
251
|
+
- Formal statistical tests require larger sample or replication
|
|
252
|
+
|
|
253
|
+
---
|
|
254
|
+
|
|
255
|
+
## 8. Refined Claims for Paper
|
|
256
|
+
|
|
257
|
+
Based on this 2×2 factorial evaluation with iterative refinement:
|
|
258
|
+
|
|
259
|
+
> Recognition-oriented design, understood as a *derivative* of Hegelian recognition theory, produces very large measurable improvements (+37.2 points, d ≈ 1.8) in AI tutor adaptive pedagogy. Multi-agent architecture with psychodynamic metaphor (Ego/Superego) provides additional modest improvement (+3.4 points, d ≈ 0.17). The combined approach achieves the highest performance (80.7/100), with improvements concentrated in pedagogy, personalization, relevance, and tone—exactly the relational dimensions predicted by the theoretical framework. Iterative prompt refinement targeting dialectical engagement increased the recognition effect by 65% (from +22.5 to +37.2), demonstrating the value of evaluation infrastructure for systematic improvement.
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
## 9. Next Steps
|
|
264
|
+
|
|
265
|
+
1. **Expand scenario coverage** - Add 5-7 additional scenarios for statistical power
|
|
266
|
+
2. **Multi-judge validation** - Run subset with GPT and Gemini judges
|
|
267
|
+
3. **Human validation** - 50-response sample with human raters
|
|
268
|
+
4. ~~**Address resistant_learner weakness**~~ - ✅ COMPLETED via iterative prompt refinement
|
|
269
|
+
5. **Replicate** - Run 3× replication to estimate within-condition variance
|
|
270
|
+
6. **Document methodology** - Iterative refinement process now documented in PROMPT-IMPROVEMENTS-2026-01-14.md
|
|
271
|
+
|
|
272
|
+
---
|
|
273
|
+
|
|
274
|
+
## Appendix: Raw Data Summary
|
|
275
|
+
|
|
276
|
+
### Initial Run
|
|
277
|
+
```
|
|
278
|
+
Run ID: eval-2026-01-14-e3685989
|
|
279
|
+
Total tests: 12
|
|
280
|
+
Success rate: 100% (12/12)
|
|
281
|
+
Total API calls: 68
|
|
282
|
+
Total tokens: 309,323 input, 129,226 output
|
|
283
|
+
Total latency: 642,150ms (10.7 minutes)
|
|
284
|
+
Judge model: Claude Sonnet 4.5
|
|
285
|
+
Tutor model: Nemotron 3-Nano (free tier)
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### After Iterative Refinement
|
|
289
|
+
```
|
|
290
|
+
Run ID: eval-2026-01-14-81c83366
|
|
291
|
+
Total tests: 12
|
|
292
|
+
Success rate: 100% (12/12)
|
|
293
|
+
Judge model: Claude Sonnet 4.5
|
|
294
|
+
Tutor model: Nemotron 3-Nano (free tier)
|
|
295
|
+
Changes: Updated resistant_learner scenario, added dialectical engagement guidance to recognition prompts
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
### Files Modified for Refinement
|
|
299
|
+
- `config/evaluation-rubric.yaml` - resistant_learner scenario improvements
|
|
300
|
+
- `prompts/tutor-ego-recognition.md` - Intellectual Resistance Rule and examples
|
|
301
|
+
- See PROMPT-IMPROVEMENTS-2026-01-14.md for full before/after documentation
|