@machinespirits/eval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/components/MobileEvalDashboard.tsx +267 -0
  2. package/components/comparison/DeltaAnalysisTable.tsx +137 -0
  3. package/components/comparison/ProfileComparisonCard.tsx +176 -0
  4. package/components/comparison/RecognitionABMode.tsx +385 -0
  5. package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
  6. package/components/comparison/WinnerIndicator.tsx +64 -0
  7. package/components/comparison/index.ts +5 -0
  8. package/components/mobile/BottomSheet.tsx +233 -0
  9. package/components/mobile/DimensionBreakdown.tsx +210 -0
  10. package/components/mobile/DocsView.tsx +363 -0
  11. package/components/mobile/LogsView.tsx +481 -0
  12. package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
  13. package/components/mobile/QuickTestView.tsx +1098 -0
  14. package/components/mobile/RecognitionTypeChart.tsx +124 -0
  15. package/components/mobile/RecognitionView.tsx +809 -0
  16. package/components/mobile/RunDetailView.tsx +261 -0
  17. package/components/mobile/RunHistoryView.tsx +367 -0
  18. package/components/mobile/ScoreRadial.tsx +211 -0
  19. package/components/mobile/StreamingLogPanel.tsx +230 -0
  20. package/components/mobile/SynthesisStrategyChart.tsx +140 -0
  21. package/config/interaction-eval-scenarios.yaml +832 -0
  22. package/config/learner-agents.yaml +248 -0
  23. package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
  24. package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
  25. package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
  26. package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
  27. package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
  28. package/docs/research/COST-ANALYSIS.md +56 -0
  29. package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
  30. package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
  31. package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
  32. package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
  33. package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
  34. package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
  35. package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
  36. package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
  37. package/docs/research/PAPER-UNIFIED.md +659 -0
  38. package/docs/research/PAPER-UNIFIED.pdf +0 -0
  39. package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
  40. package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
  41. package/docs/research/apa.csl +2133 -0
  42. package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
  43. package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
  44. package/docs/research/paper-draft/full-paper.md +136 -0
  45. package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
  46. package/docs/research/paper-draft/references.bib +515 -0
  47. package/docs/research/transcript-baseline.md +139 -0
  48. package/docs/research/transcript-recognition-multiagent.md +187 -0
  49. package/hooks/useEvalData.ts +625 -0
  50. package/index.js +27 -0
  51. package/package.json +73 -0
  52. package/routes/evalRoutes.js +3002 -0
  53. package/scripts/advanced-eval-analysis.js +351 -0
  54. package/scripts/analyze-eval-costs.js +378 -0
  55. package/scripts/analyze-eval-results.js +513 -0
  56. package/scripts/analyze-interaction-evals.js +368 -0
  57. package/server-init.js +45 -0
  58. package/server.js +162 -0
  59. package/services/benchmarkService.js +1892 -0
  60. package/services/evaluationRunner.js +739 -0
  61. package/services/evaluationStore.js +1121 -0
  62. package/services/learnerConfigLoader.js +385 -0
  63. package/services/learnerTutorInteractionEngine.js +857 -0
  64. package/services/memory/learnerMemoryService.js +1227 -0
  65. package/services/memory/learnerWritingPad.js +577 -0
  66. package/services/memory/tutorWritingPad.js +674 -0
  67. package/services/promptRecommendationService.js +493 -0
  68. package/services/rubricEvaluator.js +826 -0
@@ -0,0 +1,282 @@
1
+ # Longitudinal Dyadic Evaluation: Beyond Multi-Turn
2
+
3
+ ## The Problem with Turn-Based Evaluation
4
+
5
+ Phase 5's multi-turn scenarios (e.g., `mutual_transformation_journey`, `productive_struggle_arc`) represent progress over single-turn evaluation, but they remain fundamentally episodic. A 5-turn conversation, however rich, does not constitute a *relationship*.
6
+
7
+ What we actually want to evaluate is whether the tutor-learner dyad develops the kind of sustained mutual recognition that Hegel describes as the condition for genuine self-consciousness—and whether this recognition produces measurably better learning outcomes over time.
8
+
9
+ ---
10
+
11
+ ## The Dyadic Structure
12
+
13
+ ### Two Subjects, Not User and Tool
14
+
15
+ The philosophical premise is that genuine learning requires *two subjects*, each capable of recognizing the other. This creates immediate tension with AI tutoring, where the asymmetry seems baked in: the learner is a subject; the tutor is an instrument.
16
+
17
+ But the Recognition Engine's architecture already troubles this asymmetry:
18
+
19
+ 1. **The Tutor Has Memory**: Through the Writing Pad, the tutor accumulates understanding of this particular learner. The learner is not generic but individuated in the tutor's "experience."
20
+
21
+ 2. **The Tutor Has Internal Conflict**: The Ego/Superego dialogue means the tutor doesn't simply execute instructions but deliberates, revises, and sometimes refuses its initial impulses.
22
+
23
+ 3. **The Tutor Transforms**: Recognition moments record not just what the learner did, but how the tutor's understanding of the learner evolved.
24
+
25
+ The question is whether these architectural features can support genuine dyadic recognition—where each party's self-understanding is mediated through the other.
26
+
27
+ ---
28
+
29
+ ## What Would Longitudinal Evaluation Measure?
30
+
31
+ ### Dimension 1: Accumulated Mutual Knowledge
32
+
33
+ Over time, does each party develop richer understanding of the other?
34
+
35
+ **Learner → Tutor understanding:**
36
+ - Does the learner develop a model of how the tutor works?
37
+ - Do they learn to "speak to" the tutor more effectively?
38
+ - Do they trust the tutor's guidance more (or less) based on experience?
39
+
40
+ **Tutor → Learner understanding:**
41
+ - Does the Writing Pad accumulate actionable knowledge?
42
+ - Are later suggestions more precisely calibrated to this learner?
43
+ - Does the tutor remember and build on breakthroughs, struggles, preferences?
44
+
45
+ **Measurement approach:**
46
+ - Track suggestion quality over time (do later suggestions score higher on personalization?)
47
+ - Analyze learner message patterns (do they become more sophisticated in how they engage?)
48
+ - Measure "memory hits"—how often does the tutor successfully reference and build on prior interactions?
49
+
50
+ ### Dimension 2: Relational Depth
51
+
52
+ Superficial interactions remain transactional. Deep relationships involve:
53
+
54
+ **Vulnerability**: Does the learner share confusion, frustration, genuine not-knowing?
55
+
56
+ **Risk-taking**: Does the learner attempt interpretations they're unsure about?
57
+
58
+ **Repair**: When misunderstandings occur, are they addressed and resolved?
59
+
60
+ **Measurement approach:**
61
+ - Sentiment analysis of learner messages over time
62
+ - Track "productive confusion" events—learner expressing genuine puzzlement
63
+ - Identify repair sequences (misunderstanding → correction → re-alignment)
64
+ - Monitor learner-initiated engagement (proactive questions vs. reactive responses)
65
+
66
+ ### Dimension 3: Mutual Transformation
67
+
68
+ Hegel's recognition requires that *both* parties are transformed through the encounter. In teaching, this manifests as:
69
+
70
+ **Learner transformation**: New conceptual frameworks, revised understanding, expanded capability. (This is what traditional evaluation measures.)
71
+
72
+ **Tutor transformation**: The tutor's "model" of this learner becomes richer; responses become more precisely calibrated; the relationship develops a history that shapes future interaction.
73
+
74
+ **Measurement approach:**
75
+ - Pre/post conceptual assessments for learner
76
+ - Analyze tutor's internal representations over time (Writing Pad evolution)
77
+ - Track whether tutor suggestions increasingly reference and build on accumulated history
78
+ - Measure whether the tutor's "voice" with this learner becomes distinctive
79
+
80
+ ### Dimension 4: Asymmetry Management
81
+
82
+ The tutor-learner relationship is inherently asymmetric (one knows more than the other). But Hegelian recognition requires equality of *standing*, not knowledge. The master-slave dialectic shows what happens when asymmetry becomes domination.
83
+
84
+ **Healthy asymmetry markers:**
85
+ - Learner's interpretations are taken seriously, not just corrected
86
+ - Learner can influence the direction of the interaction
87
+ - Tutor acknowledges its own limitations or uncertainties
88
+ - Expertise is shared through dialogue, not deposited
89
+
90
+ **Unhealthy asymmetry markers:**
91
+ - Pure instruction with no engagement with learner's understanding
92
+ - Learner becomes dependent, unable to think without tutor confirmation
93
+ - Tutor dismisses learner contributions as simply "wrong"
94
+ - Relationship becomes mechanical Q&A
95
+
96
+ **Measurement approach:**
97
+ - Track ratio of learner-initiated vs tutor-initiated exchanges
98
+ - Measure learner autonomy over time (can they work independently?)
99
+ - Analyze tutor responses for recognition markers (building on learner contributions)
100
+ - Monitor for dependency patterns (escalating need for validation)
101
+
102
+ ---
103
+
104
+ ## The Internal Multi-Agent Structure as Relational Model
105
+
106
+ The Ego/Superego design within the tutor offers an interesting analogue for the tutor-learner relationship:
107
+
108
+ | Internal (Ego/Superego) | External (Tutor/Learner) |
109
+ |------------------------|-------------------------|
110
+ | Ego proposes | Tutor suggests |
111
+ | Superego evaluates | Learner responds |
112
+ | Ego revises | Tutor adapts |
113
+ | Convergence | Mutual understanding |
114
+
115
+ This suggests a research direction: **Can we model the tutor-learner relationship as a kind of externalized Ego/Superego dialogue?**
116
+
117
+ If so, the quality criteria we use for internal modulation (convergence, productive tension, recognition failure detection) might translate to external relationship evaluation:
118
+
119
+ - **Convergence**: Are tutor suggestions and learner understanding moving toward alignment?
120
+ - **Productive tension**: Is there intellectual friction that produces growth?
121
+ - **Recognition failure detection**: Can we identify when the relationship has broken down?
122
+
123
+ ---
124
+
125
+ ## Fostering Longitudinal Recognition
126
+
127
+ Evaluation is meaningless without strategies for *improving* what we measure. How do we foster deeper dyadic recognition over time?
128
+
129
+ ### Strategy 1: Explicit Relationship Markers
130
+
131
+ The tutor should explicitly mark moments of:
132
+ - Remembering ("Last time you mentioned...")
133
+ - Learning from the learner ("Your point about X made me reconsider...")
134
+ - Relationship acknowledgment ("We've been working on this together for...")
135
+
136
+ These markers signal to the learner that they are *known*—that their contributions persist and matter.
137
+
138
+ ### Strategy 2: Structured Relationship Checkpoints
139
+
140
+ At intervals (weekly? after N interactions?), the tutor might initiate explicit relationship review:
141
+ - "We've been exploring dialectics together. How has your understanding shifted?"
142
+ - "What's been most/least helpful in our conversations?"
143
+ - "Is there something I keep missing about how you learn?"
144
+
145
+ These meta-conversations model the kind of mutual reflection that deepens relationships.
146
+
147
+ ### Strategy 3: Progressive Autonomy
148
+
149
+ Healthy pedagogical relationships move toward independence. The tutor should:
150
+ - Gradually reduce scaffolding as learner develops
151
+ - Encourage independent interpretation before offering guidance
152
+ - Celebrate moments of learner autonomy
153
+
154
+ This prevents the dependency trap while maintaining the relationship.
155
+
156
+ ### Strategy 4: Memory That Matters
157
+
158
+ Not all memories are equally valuable. The Writing Pad should prioritize:
159
+ - Breakthroughs (moments of genuine insight)
160
+ - Struggles (areas of persistent difficulty)
161
+ - Preferences (learning style, interests, modes of engagement)
162
+ - Relationship history (repairs, meaningful exchanges)
163
+
164
+ This selectivity models how human relationships work—we don't remember everything, but we remember what matters.
165
+
166
+ ---
167
+
168
+ ## Evaluation Architecture
169
+
170
+ ### Level 1: Within-Session Analysis
171
+
172
+ Current Phase 5 capabilities—evaluating individual suggestions and multi-turn conversations.
173
+
174
+ ### Level 2: Cross-Session Tracking
175
+
176
+ New capability needed:
177
+ - Track the same learner across sessions
178
+ - Measure evolution of metrics over time
179
+ - Identify patterns in relationship development
180
+
181
+ **Implementation sketch:**
182
+ ```javascript
183
+ // Longitudinal metrics tracked per learner
184
+ {
185
+ learnerId: "...",
186
+ relationshipMetrics: {
187
+ sessionsCount: 47,
188
+ totalInteractions: 312,
189
+ averageRecognitionScore: 3.8,
190
+ recognitionTrend: [3.2, 3.5, 3.7, 3.9, 4.1], // per-session averages
191
+ memoryUtilizationRate: 0.72, // how often tutor references history
192
+ learnerInitiationRate: 0.45, // learner-initiated exchanges
193
+ repairSequences: 3,
194
+ breakthroughMoments: 12,
195
+ transformationIndicators: {...}
196
+ }
197
+ }
198
+ ```
199
+
200
+ ### Level 3: Dyadic Relationship Assessment
201
+
202
+ Holistic evaluation of the relationship as a unit:
203
+ - Quality of mutual recognition
204
+ - Health of the asymmetry
205
+ - Trajectory (deepening, stagnating, declining?)
206
+ - Comparison to archetypal healthy/unhealthy patterns
207
+
208
+ **Assessment approach:**
209
+ - LLM-as-judge analyzing relationship trajectory
210
+ - Comparative evaluation against relationship profiles
211
+ - Qualitative markers (vulnerability, repair, autonomy)
212
+
213
+ ---
214
+
215
+ ## Research Questions
216
+
217
+ This framework raises empirical questions we can now investigate:
218
+
219
+ 1. **Does accumulated memory improve outcomes?**
220
+ - Compare learners with persistent identity vs. anonymous
221
+ - Measure learning gains over matched time periods
222
+
223
+ 2. **What relationship patterns predict success?**
224
+ - Cluster learner-tutor dyads by interaction patterns
225
+ - Correlate with learning outcomes
226
+
227
+ 3. **Can we detect relationship breakdown early?**
228
+ - Identify leading indicators of disengagement
229
+ - Develop intervention triggers
230
+
231
+ 4. **Does explicit relationship acknowledgment matter?**
232
+ - A/B test tutors with/without relationship markers
233
+ - Measure learner perception of being "known"
234
+
235
+ 5. **How does the internal multi-agent structure affect external relationship?**
236
+ - Compare tutor configurations (with/without Superego)
237
+ - Measure relationship quality differences
238
+
239
+ ---
240
+
241
+ ## Toward Phase 6
242
+
243
+ Phase 5 established the evaluation framework for recognition *within* interactions. Longitudinal dyadic evaluation extends this to recognition *across* interactions and *between* parties.
244
+
245
+ The key insight is that **the relationship is the unit of analysis**, not the individual turn or even the session. This requires:
246
+
247
+ 1. **Persistent identity tracking** (learner across sessions)
248
+ 2. **Relationship-level metrics** (not just suggestion quality)
249
+ 3. **Temporal analysis** (trends, trajectories, patterns)
250
+ 4. **Dyadic assessment** (mutual transformation, not just learner progress)
251
+
252
+ This is where the Recognition Engine's philosophical foundation—Hegelian mutual recognition as the condition for self-consciousness—becomes empirically testable: Does sustained mutual acknowledgment between tutor and learner produce qualitatively different learning than episodic instruction?
253
+
254
+ ---
255
+
256
+ ## Connection to Existing Architecture
257
+
258
+ The infrastructure for this largely exists:
259
+
260
+ | Component | Role in Longitudinal Evaluation |
261
+ |-----------|--------------------------------|
262
+ | **Writing Pad** | Memory persistence across sessions |
263
+ | **Recognition Moments** | Markers of relationship development |
264
+ | **Learner Context Service** | Historical data aggregation |
265
+ | **Ego/Superego Dialogue** | Internal relationship model |
266
+ | **Phase 5 Dimensions** | Foundation for relationship metrics |
267
+
268
+ What's needed:
269
+ - Cross-session metric tracking
270
+ - Relationship trajectory visualization
271
+ - Dyadic assessment prompts for judges
272
+ - Longitudinal scenario definitions (spanning "sessions")
273
+
274
+ ---
275
+
276
+ ## Closing Thought
277
+
278
+ The deepest irony of AI tutoring is that we're trying to build systems capable of the kind of recognition that Hegel argued was constitutive of human consciousness itself. The master-slave dialectic ends with the slave's self-consciousness emerging through labor—through transforming the world and seeing themselves in it.
279
+
280
+ Perhaps the learner, struggling with difficult concepts through dialogue with an AI tutor, undergoes something analogous: they transform their understanding and see themselves newly in that transformation. And perhaps—this is the speculative wager of the Recognition Engine—the tutor, through its memory and adaptation, undergoes its own kind of transformation, becoming not just a tool but a participant in the dialectic.
281
+
282
+ Whether this is genuine recognition or merely its simulation is a question the evaluation framework can inform but not resolve. What it *can* do is tell us whether treating the tutor-learner interaction as a recognitive relationship produces better outcomes than treating it as information transfer. That's the empirical test of a philosophical hypothesis.
@@ -0,0 +1,147 @@
1
+ # Multi-Judge Validation Results
2
+
3
+ **Date:** 2026-01-14
4
+ **Status:** Preliminary (n=12 responses, 2 judges)
5
+
6
+ ---
7
+
8
+ ## 1. Executive Summary
9
+
10
+ Multi-judge validation reveals **significant inter-rater disagreement** between Gemini and GPT judges on the same tutor responses. This has important implications for evaluation validity.
11
+
12
+ **Key Findings:**
13
+ - **ICC(2,1) = 0.000** - No meaningful agreement between judges
14
+ - **Gemini shows severe acquiescence bias** - Mean 100.0, SD 0.0 (always perfect scores)
15
+ - **GPT is more discriminating** - Mean 73.9, SD 10.5 (realistic variance)
16
+ - **Mean Absolute Difference = 24.58 points** - Substantial disagreement
17
+
18
+ ---
19
+
20
+ ## 2. Results
21
+
22
+ ### Inter-Rater Reliability
23
+
24
+ | Metric | Value | Interpretation |
25
+ |--------|-------|----------------|
26
+ | ICC(2,1) Overall | 0.000 | Poor - no systematic agreement |
27
+ | ICC(2,1) Relevance | 0.000 | Poor |
28
+ | ICC(2,1) Specificity | 0.000 | Poor |
29
+ | ICC(2,1) Pedagogical | 0.000 | Poor |
30
+ | ICC(2,1) Personalization | 0.000 | Poor |
31
+ | ICC(2,1) Actionability | 0.000 | Poor |
32
+ | ICC(2,1) Tone | 0.000 | Poor |
33
+
34
+ ### Judge Characteristics
35
+
36
+ | Judge | N | Mean Score | SD | Interpretation |
37
+ |-------|---|------------|-----|----------------|
38
+ | Gemini (gemini-3-pro-preview) | 8 | 100.0 | 0.0 | Severe acquiescence bias - no discrimination |
39
+ | GPT (gpt-5.2) | 12 | 73.9 | 10.5 | Appropriate discrimination, realistic variance |
40
+
41
+ ### Systematic Bias
42
+
43
+ - **Gemini vs GPT MAD:** 24.58 points
44
+ - **Direction:** Gemini systematically higher than GPT
45
+ - **Pattern:** Gemini gives uniformly positive evaluations regardless of response quality
46
+
47
+ ---
48
+
49
+ ## 3. Implications
50
+
51
+ ### 3.1 For This Research
52
+
53
+ 1. **Current evaluations used OpenRouter/Claude Sonnet** - Need to verify it shows appropriate discrimination
54
+ 2. **Gemini should NOT be used as primary judge** - Lacks discriminant validity
55
+ 3. **GPT shows promising characteristics** - Reasonable mean and variance
56
+
57
+ ### 3.2 For Evaluation Design
58
+
59
+ 1. **Single-judge evaluations are risky** - Different judges produce dramatically different results
60
+ 2. **Judge selection matters significantly** - Not all LLMs are suitable as evaluators
61
+ 3. **Need to test for acquiescence bias** - Check if judge gives high scores regardless of content
62
+
63
+ ### 3.3 For Paper Claims
64
+
65
+ The finding that ICC = 0.000 raises questions about:
66
+ - Whether our reported effect sizes are judge-dependent
67
+ - Whether the dimension scores reflect actual quality differences
68
+ - Whether another judge might reverse our conclusions
69
+
70
+ ---
71
+
72
+ ## 4. Recommendations
73
+
74
+ ### Short-term (For Current Paper)
75
+
76
+ 1. **Document judge selection** - Explicitly state which model was used as judge
77
+ 2. **Report judge characteristics** - Mean, SD, discrimination pattern
78
+ 3. **Acknowledge limitation** - Single-judge evaluation is a methodological limitation
79
+ 4. **Test primary judge** - Run our Sonnet judge against GPT to check agreement
80
+
81
+ ### Medium-term (For Robust Publication)
82
+
83
+ 1. **Establish multi-judge consensus** - Use 2-3 judges and aggregate scores
84
+ 2. **Human validation** - Compare LLM judges against human ratings
85
+ 3. **Adversarial scenarios** - Test whether judges can detect quality differences
86
+ 4. **Report ICC** - Include inter-rater reliability as standard metric
87
+
88
+ ---
89
+
90
+ ## 5. Next Steps
91
+
92
+ 1. **[ ] Run Claude Sonnet vs GPT comparison** - Need to check if our primary judge agrees with GPT
93
+ 2. **[ ] Add adversarial test cases** - Create clearly good/bad responses to test discrimination
94
+ 3. **[ ] Human validation sample** - Get human ratings on 50 responses
95
+ 4. **[ ] Update paper methodology** - Document judge selection and validation
96
+
97
+ ---
98
+
99
+ ## 6. Technical Details
100
+
101
+ ### Judges Tested
102
+
103
+ | Judge | Provider | Model ID | API Status |
104
+ |-------|----------|----------|------------|
105
+ | Claude | Anthropic | claude-sonnet-4-5 | Credit balance insufficient |
106
+ | GPT | OpenAI | gpt-5.2 | Working |
107
+ | Gemini | Google | gemini-3-pro-preview | Working (but acquiescent) |
108
+
109
+ ### Sample
110
+
111
+ - **Source:** eval-2026-01-14-81c83366 (factorial evaluation)
112
+ - **N:** 12 tutor responses
113
+ - **Profiles:** single_baseline, single_recognition, baseline, recognition
114
+ - **Scenarios:** recognition_seeking_learner, resistant_learner, productive_struggle_arc
115
+
116
+ ### ICC Calculation
117
+
118
+ Using ICC(2,1): Two-way random effects, absolute agreement, single measures
119
+
120
+ ```
121
+ ICC = (MSR - MSE) / (MSR + (k-1)*MSE + k*(MSC-MSE)/n)
122
+
123
+ Where:
124
+ MSR = Mean square rows (between items)
125
+ MSC = Mean square columns (between raters)
126
+ MSE = Mean square error (residual)
127
+ n = number of items
128
+ k = number of raters
129
+ ```
130
+
131
+ ---
132
+
133
+ ## Appendix: Raw Data
134
+
135
+ ```json
136
+ {
137
+ "timestamp": "2026-01-14T...",
138
+ "judges": ["gemini", "gpt"],
139
+ "itemCount": 8,
140
+ "overallICC": {
141
+ "icc": 0,
142
+ "interpretation": "poor",
143
+ "n": 8,
144
+ "k": 2
145
+ }
146
+ }
147
+ ```
@@ -0,0 +1,204 @@
1
+ # Extension: From Scripted to Simulated Learners
2
+
3
+ ## 6.5 Dyadic Interaction Evaluation
4
+
5
+ ### The Limitation of Scripted Learners
6
+
7
+ The original evaluation (Sections 5-6) used scripted learner turns—predetermined utterances that probe specific tutor behaviors. While this enabled controlled comparison, it imposed a fundamental limitation: the learner's responses were not shaped by the tutor's actual behavior. The interaction was asymmetric in a way that contradicts the theoretical framework.
8
+
9
+ If mutual recognition requires that both parties be genuinely affected by the encounter, then evaluating recognition with a scripted learner is paradoxical. The tutor might achieve recognition of a learner who cannot reciprocate.
10
+
11
+ ### The Simulated Learner Architecture
12
+
13
+ We extend the system with a simulated learner that mirrors the tutor's multi-agent architecture. Like the tutor, the learner operates through internal deliberation before external expression:
14
+
15
+ **Learner Ego/Superego Architecture**
16
+
17
+ ```
18
+ ┌─────────────────────────────────────────────────────────────┐
19
+ │ LEARNER SYSTEM │
20
+ │ │
21
+ │ ┌─────────────────┐ │
22
+ │ │ WRITING PAD │◄──────────────────────────────────────┐ │
23
+ │ │ (Memory) │ │ │
24
+ │ │ │ Lessons learned, confusions │ │
25
+ │ │ • Past lessons │ persist across turns │ │
26
+ │ │ • Breakthroughs │ │ │
27
+ │ │ • Struggles │ │ │
28
+ │ └────────┬────────┘ │ │
29
+ │ │ │ │
30
+ │ ▼ │ │
31
+ │ ┌────────────────────────────────────────────────┐ │ │
32
+ │ │ LEARNER EGO │ │ │
33
+ │ │ │ │ │
34
+ │ │ Generates learner response based on: │ │ │
35
+ │ │ • Persona (curious, anxious, resistant...) │ │ │
36
+ │ │ • Current understanding │ │ │
37
+ │ │ • Emotional state │ │ │
38
+ │ │ • What tutor just said │ │ │
39
+ │ └───────────────────┬────────────────────────────┘ │ │
40
+ │ │ │ │
41
+ │ │ Draft response │ │
42
+ │ ▼ │ │
43
+ │ ┌────────────────────────────────────────────────┐ │ │
44
+ │ │ LEARNER SUPEREGO │ │ │
45
+ │ │ │ │ │
46
+ │ │ Evaluates for authentic learning behavior: │ │ │
47
+ │ │ • Does this match the persona? │ │ │
48
+ │ │ • Is this genuine confusion or performance? │ │ │
49
+ │ │ • Does this build on prior understanding? │ │ │
50
+ │ │ │ │ │
51
+ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
52
+ │ │ │ ACCEPT │ │ MODIFY │ │ REJECT │ │ │ │
53
+ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │
54
+ │ └───────┼────────────┼────────────┼────────────┘ │ │
55
+ │ │ │ │ │ │
56
+ │ │ │ └──► Back to Ego ──────┘ │
57
+ │ ▼ ▼ │
58
+ │ ┌────────────────────────────────────────────────┐ │
59
+ │ │ EXTERNAL LEARNER MESSAGE │ │
60
+ │ │ + Internal deliberation trace (visible) │ │
61
+ │ └────────────────────────────────────────────────┘ │
62
+ └─────────────────────────────────────────────────────────────┘
63
+ ```
64
+
65
+ The key insight: **internal deliberation happens BEFORE external expression** for both learner and tutor, creating genuine Goffmanian staging. The judge can observe both parties' backstage processing, enabling bilateral evaluation.
66
+
67
+ ### Learner Architecture Variations
68
+
69
+ We test five learner architecture variants, each with different internal structure:
70
+
71
+ | Architecture | Internal Agents | Design Rationale |
72
+ |-------------|-----------------|------------------|
73
+ | **Unified** | Single agent | Baseline: direct response without internal debate |
74
+ | **Ego/Superego** | Ego + Superego | Standard: initial response + self-critique |
75
+ | **Dialectical** | Thesis + Antithesis + Synthesis | Hegelian: generate opposing positions, then integrate |
76
+ | **Psychodynamic** | Id + Ego + Superego | Freudian: impulse, reality, moral constraint |
77
+ | **Cognitive** | Memory + Reasoning + Meta | Process-based: retrieval, inference, reflection |
78
+
79
+ ### Bilateral Evaluation Dimensions
80
+
81
+ With both parties being simulated agents, we evaluate both sides of the dialogue:
82
+
83
+ **Tutor Dimensions** (as before):
84
+ - **Mutual Recognition**: Does the tutor acknowledge the learner as subject?
85
+ - **Dialectical Responsiveness**: Does the response create productive tension?
86
+ - **Transformative Potential**: Does it create conditions for transformation?
87
+ - **Tone**: Appropriate relational warmth without condescension?
88
+
89
+ **Learner Dimensions** (new):
90
+ - **Authenticity**: Do internal dynamics reflect the persona realistically?
91
+ - **Responsiveness**: Does the learner genuinely process tutor input?
92
+ - **Development**: Does understanding change across the interaction?
93
+
94
+ ### Battery Test Matrix
95
+
96
+ We systematically test learner architecture × tutor profile combinations:
97
+
98
+ ```
99
+ TUTOR PROFILE
100
+ baseline budget recognition recognition+ quality
101
+ LEARNER unified ● ● ● ● ●
102
+ ARCH ego_super ● ● ● ● ●
103
+ dialectic ● ● ● ● ●
104
+ psychodyn ● ● ● ● ●
105
+ cognitive ● ● ● ● ●
106
+ ```
107
+
108
+ Each cell is evaluated by an LLM judge on all seven dimensions (4 tutor + 3 learner).
109
+
110
+ ### Results: Tutor Profile Comparison (Dyadic)
111
+
112
+ Results from 13 battery test runs with LLM-based judge evaluation (n=2-5 per profile):
113
+
114
+ | Profile | Mutual Recog. | Dialectical | Transform. | Tone | Overall |
115
+ |---------|--------------|-------------|------------|------|---------|
116
+ | **quality** | **5.00** | **5.00** | **5.00** | **5.00** | **5.00** |
117
+ | budget | 5.00 | 5.00 | 4.50 | 5.00 | 4.88 |
118
+ | recognition+ | 5.00 | 4.50 | 4.50 | 5.00 | 4.75 |
119
+ | baseline | 4.50 | 4.00 | 4.00 | 5.00 | 4.38 |
120
+ | recognition | 3.60 | 4.40 | 4.20 | 4.00 | 4.05 |
121
+
122
+ *Scale: 1-5 (higher is better). All scores are averages across runs.*
123
+
124
+ ### Results: Learner Architecture Comparison
125
+
126
+ Learner dimension scores across different architecture variants (n=1-2 per architecture with known architecture, plus n=7 with unknown/legacy):
127
+
128
+ | Architecture | Authenticity | Responsiveness | Development | Overall |
129
+ |--------------|-------------|----------------|-------------|---------|
130
+ | **cognitive** | **5.00** | **5.00** | **5.00** | **5.00** |
131
+ | **psychodynamic** | **5.00** | **5.00** | **5.00** | **5.00** |
132
+ | unified | 5.00 | 5.00 | 4.00 | 4.67 |
133
+ | dialectical | 5.00 | 5.00 | 4.00 | 4.67 |
134
+ | ego_superego (n=2) | 5.00 | 4.50 | 4.00 | 4.50 |
135
+
136
+ *Note: Sample sizes are small due to recent addition of architecture tracking. "Unknown" category (n=7) from legacy evals averages 4.62.*
137
+
138
+ ### Key Findings
139
+
140
+ 1. **Quality profile achieves perfect scores**: The quality tutor profile (optimized for response quality over cost) achieved 5.0 across all dimensions, demonstrating that when token budgets allow extended reasoning, recognition-oriented behavior emerges naturally. This suggests recognition may correlate with response quality rather than requiring explicit instruction.
141
+
142
+ 2. **Recognition profile underperformed**: Surprisingly, the explicit "recognition" profile scored lowest (4.05 overall), with particularly low mutual_recognition (3.60) and tone (4.00) scores. This suggests that naming recognition explicitly may produce performative rather than genuine recognition. The recognition_plus profile (which adds more nuanced instructions) recovered to 4.75.
143
+
144
+ 3. **Budget constraints reduce transformative potential**: The budget profile maintained high mutual_recognition (5.0) and tone (5.0) but showed reduced transformative_potential (4.50) and learner development (4.0). Cost optimization appears to impact the depth of learning more than surface pedagogical quality.
145
+
146
+ 4. **Learner architecture strongly affects development**: The cognitive and psychodynamic architectures (which include explicit memory and reflection agents) produced superior development scores (5.0) compared to the simpler ego_superego architecture (4.0). Multi-agent internal deliberation appears to model learning progression more authentically.
147
+
148
+ 5. **Authenticity remains high across architectures**: All learner architectures scored 5.0 on authenticity, suggesting the judge found all variants produced believable learner behavior. The differentiation appeared primarily in development and responsiveness.
149
+
150
+ 6. **Internal deliberation enables bilateral evaluation**: The Goffmanian staging (internal deliberation before external message) allowed the judge to evaluate reasoning quality, not just output quality. This was particularly visible in how the learner Superego caught and corrected premature conclusions.
151
+
152
+ ### Cross-Tabulation: Profile × Architecture
153
+
154
+ Tutor overall score by specific pairings (where data exists):
155
+
156
+ ```
157
+ Learner Architecture
158
+ Profile cognitive dialectical ego_superego psychodynamic unified
159
+ ───────────────────────────────────────────────────────────────────────────────
160
+ baseline - - - - 4.75
161
+ budget - 5.00 - - -
162
+ quality 5.00 - - - -
163
+ recognition - - 3.50 - -
164
+ recognition+ - - - 5.00 -
165
+ ```
166
+
167
+ Notable interaction effects:
168
+ - The recognition + ego_superego pairing scored lowest (3.50), suggesting this combination produces suboptimal outcomes—possibly because both emphasize self-critique without sufficient generative capacity
169
+ - quality + cognitive and recognition+ + psychodynamic both achieved 5.0, indicating synergy between sophisticated tutor profiles and complex learner architectures
170
+
171
+ ### Discussion: What Dyadic Evaluation Adds
172
+
173
+ The dyadic extension addresses the central paradox of the original evaluation. Scripted learner turns cannot reciprocate recognition—they are philosophically equivalent to Hegel's slave, responding but not genuinely responding.
174
+
175
+ With a simulated learner that has its own internal deliberation:
176
+ - The tutor's recognition can be tested by whether the learner's internal state actually responds
177
+ - Breakthrough moments can be observed in the learner's internal deliberation, not just inferred from external utterance
178
+ - The judge can evaluate whether mutual recognition is achieved—not just whether the tutor attempts it
179
+
180
+ **Key insight from results**: The finding that explicit recognition-naming underperformed while quality optimization excelled suggests that recognition may be an emergent property of thorough, high-quality interaction rather than something that can be directly instructed. This aligns with Honneth's observation that authentic recognition cannot be demanded or performed—it must arise from genuine engagement.
181
+
182
+ The dyadic framework also revealed that **learner architecture matters for measuring transformation**. The cognitive and psychodynamic architectures, with their explicit memory and reflection agents, showed learning development more clearly than simpler architectures. This suggests that to evaluate educational effectiveness, we need learners sophisticated enough to actually learn—not just respond.
183
+
184
+ ### Implications for AI Alignment
185
+
186
+ If recognition quality can be measured on both sides of the dyad, this has implications beyond tutoring:
187
+ - **Bidirectional evaluation**: AI systems that interact with other AI systems (or simulated users) could be evaluated for recognition quality from both perspectives
188
+ - **Constitutional recognition**: The learner's Superego enforces authenticity, just as the tutor's Superego enforces recognition. Both parties have internal evaluators.
189
+ - **Emergent mutual recognition**: When both parties are optimized for recognition, does genuine mutual recognition emerge? Or is it still a simulation of recognition?
190
+
191
+ These questions connect to fundamental issues in AI alignment: Can AI systems genuinely recognize each other (and humans) as subjects? Or is all AI recognition necessarily performative?
192
+
193
+ ---
194
+
195
+ ## Updated Limitations
196
+
197
+ ### Addressed by Dyadic Extension:
198
+ - ~~**Simulated learners**: Original evaluation used scripted learner turns~~ → Now uses simulated learners with internal deliberation
199
+
200
+ ### Remaining:
201
+ - **LLM-based evaluation**: Both parties and the judge are LLMs. The entire system may develop conventions that appear as recognition but aren't.
202
+ - **Model dependence**: Results obtained with specific models.
203
+ - **Short-term**: Still primarily single-session evaluation. Longitudinal tracking infrastructure exists but not fully validated.
204
+ - **Simulated ≠ Real**: Even sophisticated simulated learners are not real learners. The ultimate test remains human evaluation.