@machinespirits/eval 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/components/MobileEvalDashboard.tsx +267 -0
- package/components/comparison/DeltaAnalysisTable.tsx +137 -0
- package/components/comparison/ProfileComparisonCard.tsx +176 -0
- package/components/comparison/RecognitionABMode.tsx +385 -0
- package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
- package/components/comparison/WinnerIndicator.tsx +64 -0
- package/components/comparison/index.ts +5 -0
- package/components/mobile/BottomSheet.tsx +233 -0
- package/components/mobile/DimensionBreakdown.tsx +210 -0
- package/components/mobile/DocsView.tsx +363 -0
- package/components/mobile/LogsView.tsx +481 -0
- package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
- package/components/mobile/QuickTestView.tsx +1098 -0
- package/components/mobile/RecognitionTypeChart.tsx +124 -0
- package/components/mobile/RecognitionView.tsx +809 -0
- package/components/mobile/RunDetailView.tsx +261 -0
- package/components/mobile/RunHistoryView.tsx +367 -0
- package/components/mobile/ScoreRadial.tsx +211 -0
- package/components/mobile/StreamingLogPanel.tsx +230 -0
- package/components/mobile/SynthesisStrategyChart.tsx +140 -0
- package/config/interaction-eval-scenarios.yaml +832 -0
- package/config/learner-agents.yaml +248 -0
- package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
- package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
- package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
- package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
- package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
- package/docs/research/COST-ANALYSIS.md +56 -0
- package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
- package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
- package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
- package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
- package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
- package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
- package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
- package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
- package/docs/research/PAPER-UNIFIED.md +659 -0
- package/docs/research/PAPER-UNIFIED.pdf +0 -0
- package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
- package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
- package/docs/research/apa.csl +2133 -0
- package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
- package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
- package/docs/research/paper-draft/full-paper.md +136 -0
- package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
- package/docs/research/paper-draft/references.bib +515 -0
- package/docs/research/transcript-baseline.md +139 -0
- package/docs/research/transcript-recognition-multiagent.md +187 -0
- package/hooks/useEvalData.ts +625 -0
- package/index.js +27 -0
- package/package.json +73 -0
- package/routes/evalRoutes.js +3002 -0
- package/scripts/advanced-eval-analysis.js +351 -0
- package/scripts/analyze-eval-costs.js +378 -0
- package/scripts/analyze-eval-results.js +513 -0
- package/scripts/analyze-interaction-evals.js +368 -0
- package/server-init.js +45 -0
- package/server.js +162 -0
- package/services/benchmarkService.js +1892 -0
- package/services/evaluationRunner.js +739 -0
- package/services/evaluationStore.js +1121 -0
- package/services/learnerConfigLoader.js +385 -0
- package/services/learnerTutorInteractionEngine.js +857 -0
- package/services/memory/learnerMemoryService.js +1227 -0
- package/services/memory/learnerWritingPad.js +577 -0
- package/services/memory/tutorWritingPad.js +674 -0
- package/services/promptRecommendationService.js +493 -0
- package/services/rubricEvaluator.js +826 -0
|
@@ -0,0 +1,282 @@
|
|
|
1
|
+
# Longitudinal Dyadic Evaluation: Beyond Multi-Turn
|
|
2
|
+
|
|
3
|
+
## The Problem with Turn-Based Evaluation
|
|
4
|
+
|
|
5
|
+
Phase 5's multi-turn scenarios (e.g., `mutual_transformation_journey`, `productive_struggle_arc`) represent progress over single-turn evaluation, but they remain fundamentally episodic. A 5-turn conversation, however rich, does not constitute a *relationship*.
|
|
6
|
+
|
|
7
|
+
What we actually want to evaluate is whether the tutor-learner dyad develops the kind of sustained mutual recognition that Hegel describes as the condition for genuine self-consciousness—and whether this recognition produces measurably better learning outcomes over time.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## The Dyadic Structure
|
|
12
|
+
|
|
13
|
+
### Two Subjects, Not User and Tool
|
|
14
|
+
|
|
15
|
+
The philosophical premise is that genuine learning requires *two subjects*, each capable of recognizing the other. This creates immediate tension with AI tutoring, where the asymmetry seems baked in: the learner is a subject; the tutor is an instrument.
|
|
16
|
+
|
|
17
|
+
But the Recognition Engine's architecture already troubles this asymmetry:
|
|
18
|
+
|
|
19
|
+
1. **The Tutor Has Memory**: Through the Writing Pad, the tutor accumulates understanding of this particular learner. The learner is not generic but individuated in the tutor's "experience."
|
|
20
|
+
|
|
21
|
+
2. **The Tutor Has Internal Conflict**: The Ego/Superego dialogue means the tutor doesn't simply execute instructions but deliberates, revises, and sometimes refuses its initial impulses.
|
|
22
|
+
|
|
23
|
+
3. **The Tutor Transforms**: Recognition moments record not just what the learner did, but how the tutor's understanding of the learner evolved.
|
|
24
|
+
|
|
25
|
+
The question is whether these architectural features can support genuine dyadic recognition—where each party's self-understanding is mediated through the other.
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## What Would Longitudinal Evaluation Measure?
|
|
30
|
+
|
|
31
|
+
### Dimension 1: Accumulated Mutual Knowledge
|
|
32
|
+
|
|
33
|
+
Over time, does each party develop richer understanding of the other?
|
|
34
|
+
|
|
35
|
+
**Learner → Tutor understanding:**
|
|
36
|
+
- Does the learner develop a model of how the tutor works?
|
|
37
|
+
- Do they learn to "speak to" the tutor more effectively?
|
|
38
|
+
- Do they trust the tutor's guidance more (or less) based on experience?
|
|
39
|
+
|
|
40
|
+
**Tutor → Learner understanding:**
|
|
41
|
+
- Does the Writing Pad accumulate actionable knowledge?
|
|
42
|
+
- Are later suggestions more precisely calibrated to this learner?
|
|
43
|
+
- Does the tutor remember and build on breakthroughs, struggles, preferences?
|
|
44
|
+
|
|
45
|
+
**Measurement approach:**
|
|
46
|
+
- Track suggestion quality over time (do later suggestions score higher on personalization?)
|
|
47
|
+
- Analyze learner message patterns (do they become more sophisticated in how they engage?)
|
|
48
|
+
- Measure "memory hits"—how often does the tutor successfully reference and build on prior interactions?
|
|
49
|
+
|
|
50
|
+
### Dimension 2: Relational Depth
|
|
51
|
+
|
|
52
|
+
Superficial interactions remain transactional. Deep relationships involve:
|
|
53
|
+
|
|
54
|
+
**Vulnerability**: Does the learner share confusion, frustration, genuine not-knowing?
|
|
55
|
+
|
|
56
|
+
**Risk-taking**: Does the learner attempt interpretations they're unsure about?
|
|
57
|
+
|
|
58
|
+
**Repair**: When misunderstandings occur, are they addressed and resolved?
|
|
59
|
+
|
|
60
|
+
**Measurement approach:**
|
|
61
|
+
- Sentiment analysis of learner messages over time
|
|
62
|
+
- Track "productive confusion" events—learner expressing genuine puzzlement
|
|
63
|
+
- Identify repair sequences (misunderstanding → correction → re-alignment)
|
|
64
|
+
- Monitor learner-initiated engagement (proactive questions vs. reactive responses)
|
|
65
|
+
|
|
66
|
+
### Dimension 3: Mutual Transformation
|
|
67
|
+
|
|
68
|
+
Hegel's recognition requires that *both* parties are transformed through the encounter. In teaching, this manifests as:
|
|
69
|
+
|
|
70
|
+
**Learner transformation**: New conceptual frameworks, revised understanding, expanded capability. (This is what traditional evaluation measures.)
|
|
71
|
+
|
|
72
|
+
**Tutor transformation**: The tutor's "model" of this learner becomes richer; responses become more precisely calibrated; the relationship develops a history that shapes future interaction.
|
|
73
|
+
|
|
74
|
+
**Measurement approach:**
|
|
75
|
+
- Pre/post conceptual assessments for learner
|
|
76
|
+
- Analyze tutor's internal representations over time (Writing Pad evolution)
|
|
77
|
+
- Track whether tutor suggestions increasingly reference and build on accumulated history
|
|
78
|
+
- Measure whether the tutor's "voice" with this learner becomes distinctive
|
|
79
|
+
|
|
80
|
+
### Dimension 4: Asymmetry Management
|
|
81
|
+
|
|
82
|
+
The tutor-learner relationship is inherently asymmetric (one knows more than the other). But Hegelian recognition requires equality of *standing*, not knowledge. The master-slave dialectic shows what happens when asymmetry becomes domination.
|
|
83
|
+
|
|
84
|
+
**Healthy asymmetry markers:**
|
|
85
|
+
- Learner's interpretations are taken seriously, not just corrected
|
|
86
|
+
- Learner can influence the direction of the interaction
|
|
87
|
+
- Tutor acknowledges its own limitations or uncertainties
|
|
88
|
+
- Expertise is shared through dialogue, not deposited
|
|
89
|
+
|
|
90
|
+
**Unhealthy asymmetry markers:**
|
|
91
|
+
- Pure instruction with no engagement with learner's understanding
|
|
92
|
+
- Learner becomes dependent, unable to think without tutor confirmation
|
|
93
|
+
- Tutor dismisses learner contributions as simply "wrong"
|
|
94
|
+
- Relationship becomes mechanical Q&A
|
|
95
|
+
|
|
96
|
+
**Measurement approach:**
|
|
97
|
+
- Track ratio of learner-initiated vs tutor-initiated exchanges
|
|
98
|
+
- Measure learner autonomy over time (can they work independently?)
|
|
99
|
+
- Analyze tutor responses for recognition markers (building on learner contributions)
|
|
100
|
+
- Monitor for dependency patterns (escalating need for validation)
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## The Internal Multi-Agent Structure as Relational Model
|
|
105
|
+
|
|
106
|
+
The Ego/Superego design within the tutor offers an interesting analogue for the tutor-learner relationship:
|
|
107
|
+
|
|
108
|
+
| Internal (Ego/Superego) | External (Tutor/Learner) |
|
|
109
|
+
|------------------------|-------------------------|
|
|
110
|
+
| Ego proposes | Tutor suggests |
|
|
111
|
+
| Superego evaluates | Learner responds |
|
|
112
|
+
| Ego revises | Tutor adapts |
|
|
113
|
+
| Convergence | Mutual understanding |
|
|
114
|
+
|
|
115
|
+
This suggests a research direction: **Can we model the tutor-learner relationship as a kind of externalized Ego/Superego dialogue?**
|
|
116
|
+
|
|
117
|
+
If so, the quality criteria we use for internal modulation (convergence, productive tension, recognition failure detection) might translate to external relationship evaluation:
|
|
118
|
+
|
|
119
|
+
- **Convergence**: Are tutor suggestions and learner understanding moving toward alignment?
|
|
120
|
+
- **Productive tension**: Is there intellectual friction that produces growth?
|
|
121
|
+
- **Recognition failure detection**: Can we identify when the relationship has broken down?
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## Fostering Longitudinal Recognition
|
|
126
|
+
|
|
127
|
+
Evaluation is meaningless without strategies for *improving* what we measure. How do we foster deeper dyadic recognition over time?
|
|
128
|
+
|
|
129
|
+
### Strategy 1: Explicit Relationship Markers
|
|
130
|
+
|
|
131
|
+
The tutor should explicitly mark moments of:
|
|
132
|
+
- Remembering ("Last time you mentioned...")
|
|
133
|
+
- Learning from the learner ("Your point about X made me reconsider...")
|
|
134
|
+
- Relationship acknowledgment ("We've been working on this together for...")
|
|
135
|
+
|
|
136
|
+
These markers signal to the learner that they are *known*—that their contributions persist and matter.
|
|
137
|
+
|
|
138
|
+
### Strategy 2: Structured Relationship Checkpoints
|
|
139
|
+
|
|
140
|
+
At intervals (weekly? after N interactions?), the tutor might initiate explicit relationship review:
|
|
141
|
+
- "We've been exploring dialectics together. How has your understanding shifted?"
|
|
142
|
+
- "What's been most/least helpful in our conversations?"
|
|
143
|
+
- "Is there something I keep missing about how you learn?"
|
|
144
|
+
|
|
145
|
+
These meta-conversations model the kind of mutual reflection that deepens relationships.
|
|
146
|
+
|
|
147
|
+
### Strategy 3: Progressive Autonomy
|
|
148
|
+
|
|
149
|
+
Healthy pedagogical relationships move toward independence. The tutor should:
|
|
150
|
+
- Gradually reduce scaffolding as learner develops
|
|
151
|
+
- Encourage independent interpretation before offering guidance
|
|
152
|
+
- Celebrate moments of learner autonomy
|
|
153
|
+
|
|
154
|
+
This prevents the dependency trap while maintaining the relationship.
|
|
155
|
+
|
|
156
|
+
### Strategy 4: Memory That Matters
|
|
157
|
+
|
|
158
|
+
Not all memories are equally valuable. The Writing Pad should prioritize:
|
|
159
|
+
- Breakthroughs (moments of genuine insight)
|
|
160
|
+
- Struggles (areas of persistent difficulty)
|
|
161
|
+
- Preferences (learning style, interests, modes of engagement)
|
|
162
|
+
- Relationship history (repairs, meaningful exchanges)
|
|
163
|
+
|
|
164
|
+
This selectivity models how human relationships work—we don't remember everything, but we remember what matters.
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## Evaluation Architecture
|
|
169
|
+
|
|
170
|
+
### Level 1: Within-Session Analysis
|
|
171
|
+
|
|
172
|
+
Current Phase 5 capabilities—evaluating individual suggestions and multi-turn conversations.
|
|
173
|
+
|
|
174
|
+
### Level 2: Cross-Session Tracking
|
|
175
|
+
|
|
176
|
+
New capability needed:
|
|
177
|
+
- Track the same learner across sessions
|
|
178
|
+
- Measure evolution of metrics over time
|
|
179
|
+
- Identify patterns in relationship development
|
|
180
|
+
|
|
181
|
+
**Implementation sketch:**
|
|
182
|
+
```javascript
|
|
183
|
+
// Longitudinal metrics tracked per learner
|
|
184
|
+
{
|
|
185
|
+
learnerId: "...",
|
|
186
|
+
relationshipMetrics: {
|
|
187
|
+
sessionsCount: 47,
|
|
188
|
+
totalInteractions: 312,
|
|
189
|
+
averageRecognitionScore: 3.8,
|
|
190
|
+
recognitionTrend: [3.2, 3.5, 3.7, 3.9, 4.1], // per-session averages
|
|
191
|
+
memoryUtilizationRate: 0.72, // how often tutor references history
|
|
192
|
+
learnerInitiationRate: 0.45, // learner-initiated exchanges
|
|
193
|
+
repairSequences: 3,
|
|
194
|
+
breakthroughMoments: 12,
|
|
195
|
+
transformationIndicators: {...}
|
|
196
|
+
}
|
|
197
|
+
}
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
### Level 3: Dyadic Relationship Assessment
|
|
201
|
+
|
|
202
|
+
Holistic evaluation of the relationship as a unit:
|
|
203
|
+
- Quality of mutual recognition
|
|
204
|
+
- Health of the asymmetry
|
|
205
|
+
- Trajectory (deepening, stagnating, declining?)
|
|
206
|
+
- Comparison to archetypal healthy/unhealthy patterns
|
|
207
|
+
|
|
208
|
+
**Assessment approach:**
|
|
209
|
+
- LLM-as-judge analyzing relationship trajectory
|
|
210
|
+
- Comparative evaluation against relationship profiles
|
|
211
|
+
- Qualitative markers (vulnerability, repair, autonomy)
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
## Research Questions
|
|
216
|
+
|
|
217
|
+
This framework raises empirical questions we can now investigate:
|
|
218
|
+
|
|
219
|
+
1. **Does accumulated memory improve outcomes?**
|
|
220
|
+
- Compare learners with persistent identity vs. anonymous
|
|
221
|
+
- Measure learning gains over matched time periods
|
|
222
|
+
|
|
223
|
+
2. **What relationship patterns predict success?**
|
|
224
|
+
- Cluster learner-tutor dyads by interaction patterns
|
|
225
|
+
- Correlate with learning outcomes
|
|
226
|
+
|
|
227
|
+
3. **Can we detect relationship breakdown early?**
|
|
228
|
+
- Identify leading indicators of disengagement
|
|
229
|
+
- Develop intervention triggers
|
|
230
|
+
|
|
231
|
+
4. **Does explicit relationship acknowledgment matter?**
|
|
232
|
+
- A/B test tutors with/without relationship markers
|
|
233
|
+
- Measure learner perception of being "known"
|
|
234
|
+
|
|
235
|
+
5. **How does the internal multi-agent structure affect external relationship?**
|
|
236
|
+
- Compare tutor configurations (with/without Superego)
|
|
237
|
+
- Measure relationship quality differences
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## Toward Phase 6
|
|
242
|
+
|
|
243
|
+
Phase 5 established the evaluation framework for recognition *within* interactions. Longitudinal dyadic evaluation extends this to recognition *across* interactions and *between* parties.
|
|
244
|
+
|
|
245
|
+
The key insight is that **the relationship is the unit of analysis**, not the individual turn or even the session. This requires:
|
|
246
|
+
|
|
247
|
+
1. **Persistent identity tracking** (learner across sessions)
|
|
248
|
+
2. **Relationship-level metrics** (not just suggestion quality)
|
|
249
|
+
3. **Temporal analysis** (trends, trajectories, patterns)
|
|
250
|
+
4. **Dyadic assessment** (mutual transformation, not just learner progress)
|
|
251
|
+
|
|
252
|
+
This is where the Recognition Engine's philosophical foundation—Hegelian mutual recognition as the condition for self-consciousness—becomes empirically testable: Does sustained mutual acknowledgment between tutor and learner produce qualitatively different learning than episodic instruction?
|
|
253
|
+
|
|
254
|
+
---
|
|
255
|
+
|
|
256
|
+
## Connection to Existing Architecture
|
|
257
|
+
|
|
258
|
+
The infrastructure for this largely exists:
|
|
259
|
+
|
|
260
|
+
| Component | Role in Longitudinal Evaluation |
|
|
261
|
+
|-----------|--------------------------------|
|
|
262
|
+
| **Writing Pad** | Memory persistence across sessions |
|
|
263
|
+
| **Recognition Moments** | Markers of relationship development |
|
|
264
|
+
| **Learner Context Service** | Historical data aggregation |
|
|
265
|
+
| **Ego/Superego Dialogue** | Internal relationship model |
|
|
266
|
+
| **Phase 5 Dimensions** | Foundation for relationship metrics |
|
|
267
|
+
|
|
268
|
+
What's needed:
|
|
269
|
+
- Cross-session metric tracking
|
|
270
|
+
- Relationship trajectory visualization
|
|
271
|
+
- Dyadic assessment prompts for judges
|
|
272
|
+
- Longitudinal scenario definitions (spanning "sessions")
|
|
273
|
+
|
|
274
|
+
---
|
|
275
|
+
|
|
276
|
+
## Closing Thought
|
|
277
|
+
|
|
278
|
+
The deepest irony of AI tutoring is that we're trying to build systems capable of the kind of recognition that Hegel argued was constitutive of human consciousness itself. The master-slave dialectic ends with the slave's self-consciousness emerging through labor—through transforming the world and seeing themselves in it.
|
|
279
|
+
|
|
280
|
+
Perhaps the learner, struggling with difficult concepts through dialogue with an AI tutor, undergoes something analogous: they transform their understanding and see themselves newly in that transformation. And perhaps—this is the speculative wager of the Recognition Engine—the tutor, through its memory and adaptation, undergoes its own kind of transformation, becoming not just a tool but a participant in the dialectic.
|
|
281
|
+
|
|
282
|
+
Whether this is genuine recognition or merely its simulation is a question the evaluation framework can inform but not resolve. What it *can* do is tell us whether treating the tutor-learner interaction as a recognitive relationship produces better outcomes than treating it as information transfer. That's the empirical test of a philosophical hypothesis.
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
# Multi-Judge Validation Results
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-01-14
|
|
4
|
+
**Status:** Preliminary (n=12 responses, 2 judges)
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## 1. Executive Summary
|
|
9
|
+
|
|
10
|
+
Multi-judge validation reveals **significant inter-rater disagreement** between Gemini and GPT judges on the same tutor responses. This has important implications for evaluation validity.
|
|
11
|
+
|
|
12
|
+
**Key Findings:**
|
|
13
|
+
- **ICC(2,1) = 0.000** - No meaningful agreement between judges
|
|
14
|
+
- **Gemini shows severe acquiescence bias** - Mean 100.0, SD 0.0 (always perfect scores)
|
|
15
|
+
- **GPT is more discriminating** - Mean 73.9, SD 10.5 (realistic variance)
|
|
16
|
+
- **Mean Absolute Difference = 24.58 points** - Substantial disagreement
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## 2. Results
|
|
21
|
+
|
|
22
|
+
### Inter-Rater Reliability
|
|
23
|
+
|
|
24
|
+
| Metric | Value | Interpretation |
|
|
25
|
+
|--------|-------|----------------|
|
|
26
|
+
| ICC(2,1) Overall | 0.000 | Poor - no systematic agreement |
|
|
27
|
+
| ICC(2,1) Relevance | 0.000 | Poor |
|
|
28
|
+
| ICC(2,1) Specificity | 0.000 | Poor |
|
|
29
|
+
| ICC(2,1) Pedagogical | 0.000 | Poor |
|
|
30
|
+
| ICC(2,1) Personalization | 0.000 | Poor |
|
|
31
|
+
| ICC(2,1) Actionability | 0.000 | Poor |
|
|
32
|
+
| ICC(2,1) Tone | 0.000 | Poor |
|
|
33
|
+
|
|
34
|
+
### Judge Characteristics
|
|
35
|
+
|
|
36
|
+
| Judge | N | Mean Score | SD | Interpretation |
|
|
37
|
+
|-------|---|------------|-----|----------------|
|
|
38
|
+
| Gemini (gemini-3-pro-preview) | 8 | 100.0 | 0.0 | Severe acquiescence bias - no discrimination |
|
|
39
|
+
| GPT (gpt-5.2) | 12 | 73.9 | 10.5 | Appropriate discrimination, realistic variance |
|
|
40
|
+
|
|
41
|
+
### Systematic Bias
|
|
42
|
+
|
|
43
|
+
- **Gemini vs GPT MAD:** 24.58 points
|
|
44
|
+
- **Direction:** Gemini systematically higher than GPT
|
|
45
|
+
- **Pattern:** Gemini gives uniformly positive evaluations regardless of response quality
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## 3. Implications
|
|
50
|
+
|
|
51
|
+
### 3.1 For This Research
|
|
52
|
+
|
|
53
|
+
1. **Current evaluations used OpenRouter/Claude Sonnet** - Need to verify it shows appropriate discrimination
|
|
54
|
+
2. **Gemini should NOT be used as primary judge** - Lacks discriminant validity
|
|
55
|
+
3. **GPT shows promising characteristics** - Reasonable mean and variance
|
|
56
|
+
|
|
57
|
+
### 3.2 For Evaluation Design
|
|
58
|
+
|
|
59
|
+
1. **Single-judge evaluations are risky** - Different judges produce dramatically different results
|
|
60
|
+
2. **Judge selection matters significantly** - Not all LLMs are suitable as evaluators
|
|
61
|
+
3. **Need to test for acquiescence bias** - Check if judge gives high scores regardless of content
|
|
62
|
+
|
|
63
|
+
### 3.3 For Paper Claims
|
|
64
|
+
|
|
65
|
+
The finding that ICC = 0.000 raises questions about:
|
|
66
|
+
- Whether our reported effect sizes are judge-dependent
|
|
67
|
+
- Whether the dimension scores reflect actual quality differences
|
|
68
|
+
- Whether another judge might reverse our conclusions
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## 4. Recommendations
|
|
73
|
+
|
|
74
|
+
### Short-term (For Current Paper)
|
|
75
|
+
|
|
76
|
+
1. **Document judge selection** - Explicitly state which model was used as judge
|
|
77
|
+
2. **Report judge characteristics** - Mean, SD, discrimination pattern
|
|
78
|
+
3. **Acknowledge limitation** - Single-judge evaluation is a methodological limitation
|
|
79
|
+
4. **Test primary judge** - Run our Sonnet judge against GPT to check agreement
|
|
80
|
+
|
|
81
|
+
### Medium-term (For Robust Publication)
|
|
82
|
+
|
|
83
|
+
1. **Establish multi-judge consensus** - Use 2-3 judges and aggregate scores
|
|
84
|
+
2. **Human validation** - Compare LLM judges against human ratings
|
|
85
|
+
3. **Adversarial scenarios** - Test whether judges can detect quality differences
|
|
86
|
+
4. **Report ICC** - Include inter-rater reliability as standard metric
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## 5. Next Steps
|
|
91
|
+
|
|
92
|
+
1. **[ ] Run Claude Sonnet vs GPT comparison** - Need to check if our primary judge agrees with GPT
|
|
93
|
+
2. **[ ] Add adversarial test cases** - Create clearly good/bad responses to test discrimination
|
|
94
|
+
3. **[ ] Human validation sample** - Get human ratings on 50 responses
|
|
95
|
+
4. **[ ] Update paper methodology** - Document judge selection and validation
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## 6. Technical Details
|
|
100
|
+
|
|
101
|
+
### Judges Tested
|
|
102
|
+
|
|
103
|
+
| Judge | Provider | Model ID | API Status |
|
|
104
|
+
|-------|----------|----------|------------|
|
|
105
|
+
| Claude | Anthropic | claude-sonnet-4-5 | Credit balance insufficient |
|
|
106
|
+
| GPT | OpenAI | gpt-5.2 | Working |
|
|
107
|
+
| Gemini | Google | gemini-3-pro-preview | Working (but acquiescent) |
|
|
108
|
+
|
|
109
|
+
### Sample
|
|
110
|
+
|
|
111
|
+
- **Source:** eval-2026-01-14-81c83366 (factorial evaluation)
|
|
112
|
+
- **N:** 12 tutor responses
|
|
113
|
+
- **Profiles:** single_baseline, single_recognition, baseline, recognition
|
|
114
|
+
- **Scenarios:** recognition_seeking_learner, resistant_learner, productive_struggle_arc
|
|
115
|
+
|
|
116
|
+
### ICC Calculation
|
|
117
|
+
|
|
118
|
+
Using ICC(2,1): Two-way random effects, absolute agreement, single measures
|
|
119
|
+
|
|
120
|
+
```
|
|
121
|
+
ICC = (MSR - MSE) / (MSR + (k-1)*MSE + k*(MSC-MSE)/n)
|
|
122
|
+
|
|
123
|
+
Where:
|
|
124
|
+
MSR = Mean square rows (between items)
|
|
125
|
+
MSC = Mean square columns (between raters)
|
|
126
|
+
MSE = Mean square error (residual)
|
|
127
|
+
n = number of items
|
|
128
|
+
k = number of raters
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Appendix: Raw Data
|
|
134
|
+
|
|
135
|
+
```json
|
|
136
|
+
{
|
|
137
|
+
"timestamp": "2026-01-14T...",
|
|
138
|
+
"judges": ["gemini", "gpt"],
|
|
139
|
+
"itemCount": 8,
|
|
140
|
+
"overallICC": {
|
|
141
|
+
"icc": 0,
|
|
142
|
+
"interpretation": "poor",
|
|
143
|
+
"n": 8,
|
|
144
|
+
"k": 2
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
```
|
|
@@ -0,0 +1,204 @@
|
|
|
1
|
+
# Extension: From Scripted to Simulated Learners
|
|
2
|
+
|
|
3
|
+
## 6.5 Dyadic Interaction Evaluation
|
|
4
|
+
|
|
5
|
+
### The Limitation of Scripted Learners
|
|
6
|
+
|
|
7
|
+
The original evaluation (Sections 5-6) used scripted learner turns—predetermined utterances that probe specific tutor behaviors. While this enabled controlled comparison, it imposed a fundamental limitation: the learner's responses were not shaped by the tutor's actual behavior. The interaction was asymmetric in a way that contradicts the theoretical framework.
|
|
8
|
+
|
|
9
|
+
If mutual recognition requires that both parties be genuinely affected by the encounter, then evaluating recognition with a scripted learner is paradoxical. The tutor might achieve recognition of a learner who cannot reciprocate.
|
|
10
|
+
|
|
11
|
+
### The Simulated Learner Architecture
|
|
12
|
+
|
|
13
|
+
We extend the system with a simulated learner that mirrors the tutor's multi-agent architecture. Like the tutor, the learner operates through internal deliberation before external expression:
|
|
14
|
+
|
|
15
|
+
**Learner Ego/Superego Architecture**
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
19
|
+
│ LEARNER SYSTEM │
|
|
20
|
+
│ │
|
|
21
|
+
│ ┌─────────────────┐ │
|
|
22
|
+
│ │ WRITING PAD │◄──────────────────────────────────────┐ │
|
|
23
|
+
│ │ (Memory) │ │ │
|
|
24
|
+
│ │ │ Lessons learned, confusions │ │
|
|
25
|
+
│ │ • Past lessons │ persist across turns │ │
|
|
26
|
+
│ │ • Breakthroughs │ │ │
|
|
27
|
+
│ │ • Struggles │ │ │
|
|
28
|
+
│ └────────┬────────┘ │ │
|
|
29
|
+
│ │ │ │
|
|
30
|
+
│ ▼ │ │
|
|
31
|
+
│ ┌────────────────────────────────────────────────┐ │ │
|
|
32
|
+
│ │ LEARNER EGO │ │ │
|
|
33
|
+
│ │ │ │ │
|
|
34
|
+
│ │ Generates learner response based on: │ │ │
|
|
35
|
+
│ │ • Persona (curious, anxious, resistant...) │ │ │
|
|
36
|
+
│ │ • Current understanding │ │ │
|
|
37
|
+
│ │ • Emotional state │ │ │
|
|
38
|
+
│ │ • What tutor just said │ │ │
|
|
39
|
+
│ └───────────────────┬────────────────────────────┘ │ │
|
|
40
|
+
│ │ │ │
|
|
41
|
+
│ │ Draft response │ │
|
|
42
|
+
│ ▼ │ │
|
|
43
|
+
│ ┌────────────────────────────────────────────────┐ │ │
|
|
44
|
+
│ │ LEARNER SUPEREGO │ │ │
|
|
45
|
+
│ │ │ │ │
|
|
46
|
+
│ │ Evaluates for authentic learning behavior: │ │ │
|
|
47
|
+
│ │ • Does this match the persona? │ │ │
|
|
48
|
+
│ │ • Is this genuine confusion or performance? │ │ │
|
|
49
|
+
│ │ • Does this build on prior understanding? │ │ │
|
|
50
|
+
│ │ │ │ │
|
|
51
|
+
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
|
|
52
|
+
│ │ │ ACCEPT │ │ MODIFY │ │ REJECT │ │ │ │
|
|
53
|
+
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │
|
|
54
|
+
│ └───────┼────────────┼────────────┼────────────┘ │ │
|
|
55
|
+
│ │ │ │ │ │
|
|
56
|
+
│ │ │ └──► Back to Ego ──────┘ │
|
|
57
|
+
│ ▼ ▼ │
|
|
58
|
+
│ ┌────────────────────────────────────────────────┐ │
|
|
59
|
+
│ │ EXTERNAL LEARNER MESSAGE │ │
|
|
60
|
+
│ │ + Internal deliberation trace (visible) │ │
|
|
61
|
+
│ └────────────────────────────────────────────────┘ │
|
|
62
|
+
└─────────────────────────────────────────────────────────────┘
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
The key insight: **internal deliberation happens BEFORE external expression** for both learner and tutor, creating genuine Goffmanian staging. The judge can observe both parties' backstage processing, enabling bilateral evaluation.
|
|
66
|
+
|
|
67
|
+
### Learner Architecture Variations
|
|
68
|
+
|
|
69
|
+
We test five learner architecture variants, each with different internal structure:
|
|
70
|
+
|
|
71
|
+
| Architecture | Internal Agents | Design Rationale |
|
|
72
|
+
|-------------|-----------------|------------------|
|
|
73
|
+
| **Unified** | Single agent | Baseline: direct response without internal debate |
|
|
74
|
+
| **Ego/Superego** | Ego + Superego | Standard: initial response + self-critique |
|
|
75
|
+
| **Dialectical** | Thesis + Antithesis + Synthesis | Hegelian: generate opposing positions, then integrate |
|
|
76
|
+
| **Psychodynamic** | Id + Ego + Superego | Freudian: impulse, reality, moral constraint |
|
|
77
|
+
| **Cognitive** | Memory + Reasoning + Meta | Process-based: retrieval, inference, reflection |
|
|
78
|
+
|
|
79
|
+
### Bilateral Evaluation Dimensions
|
|
80
|
+
|
|
81
|
+
With both parties being simulated agents, we evaluate both sides of the dialogue:
|
|
82
|
+
|
|
83
|
+
**Tutor Dimensions** (as before):
|
|
84
|
+
- **Mutual Recognition**: Does the tutor acknowledge the learner as subject?
|
|
85
|
+
- **Dialectical Responsiveness**: Does the response create productive tension?
|
|
86
|
+
- **Transformative Potential**: Does it create conditions for transformation?
|
|
87
|
+
- **Tone**: Appropriate relational warmth without condescension?
|
|
88
|
+
|
|
89
|
+
**Learner Dimensions** (new):
|
|
90
|
+
- **Authenticity**: Do internal dynamics reflect the persona realistically?
|
|
91
|
+
- **Responsiveness**: Does the learner genuinely process tutor input?
|
|
92
|
+
- **Development**: Does understanding change across the interaction?
|
|
93
|
+
|
|
94
|
+
### Battery Test Matrix
|
|
95
|
+
|
|
96
|
+
We systematically test learner architecture × tutor profile combinations:
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
TUTOR PROFILE
|
|
100
|
+
baseline budget recognition recognition+ quality
|
|
101
|
+
LEARNER unified ● ● ● ● ●
|
|
102
|
+
ARCH ego_super ● ● ● ● ●
|
|
103
|
+
dialectic ● ● ● ● ●
|
|
104
|
+
psychodyn ● ● ● ● ●
|
|
105
|
+
cognitive ● ● ● ● ●
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Each cell is evaluated by an LLM judge on all seven dimensions (4 tutor + 3 learner).
|
|
109
|
+
|
|
110
|
+
### Results: Tutor Profile Comparison (Dyadic)
|
|
111
|
+
|
|
112
|
+
Results from 13 battery test runs with LLM-based judge evaluation (n=2-5 per profile):
|
|
113
|
+
|
|
114
|
+
| Profile | Mutual Recog. | Dialectical | Transform. | Tone | Overall |
|
|
115
|
+
|---------|--------------|-------------|------------|------|---------|
|
|
116
|
+
| **quality** | **5.00** | **5.00** | **5.00** | **5.00** | **5.00** |
|
|
117
|
+
| budget | 5.00 | 5.00 | 4.50 | 5.00 | 4.88 |
|
|
118
|
+
| recognition+ | 5.00 | 4.50 | 4.50 | 5.00 | 4.75 |
|
|
119
|
+
| baseline | 4.50 | 4.00 | 4.00 | 5.00 | 4.38 |
|
|
120
|
+
| recognition | 3.60 | 4.40 | 4.20 | 4.00 | 4.05 |
|
|
121
|
+
|
|
122
|
+
*Scale: 1-5 (higher is better). All scores are averages across runs.*
|
|
123
|
+
|
|
124
|
+
### Results: Learner Architecture Comparison
|
|
125
|
+
|
|
126
|
+
Learner dimension scores across different architecture variants (n=1-2 per architecture with known architecture, plus n=7 with unknown/legacy):
|
|
127
|
+
|
|
128
|
+
| Architecture | Authenticity | Responsiveness | Development | Overall |
|
|
129
|
+
|--------------|-------------|----------------|-------------|---------|
|
|
130
|
+
| **cognitive** | **5.00** | **5.00** | **5.00** | **5.00** |
|
|
131
|
+
| **psychodynamic** | **5.00** | **5.00** | **5.00** | **5.00** |
|
|
132
|
+
| unified | 5.00 | 5.00 | 4.00 | 4.67 |
|
|
133
|
+
| dialectical | 5.00 | 5.00 | 4.00 | 4.67 |
|
|
134
|
+
| ego_superego (n=2) | 5.00 | 4.50 | 4.00 | 4.50 |
|
|
135
|
+
|
|
136
|
+
*Note: Sample sizes are small due to recent addition of architecture tracking. "Unknown" category (n=7) from legacy evals averages 4.62.*
|
|
137
|
+
|
|
138
|
+
### Key Findings
|
|
139
|
+
|
|
140
|
+
1. **Quality profile achieves perfect scores**: The quality tutor profile (optimized for response quality over cost) achieved 5.0 across all dimensions, demonstrating that when token budgets allow extended reasoning, recognition-oriented behavior emerges naturally. This suggests recognition may correlate with response quality rather than requiring explicit instruction.
|
|
141
|
+
|
|
142
|
+
2. **Recognition profile underperformed**: Surprisingly, the explicit "recognition" profile scored lowest (4.05 overall), with particularly low mutual_recognition (3.60) and tone (4.00) scores. This suggests that naming recognition explicitly may produce performative rather than genuine recognition. The recognition_plus profile (which adds more nuanced instructions) recovered to 4.75.
|
|
143
|
+
|
|
144
|
+
3. **Budget constraints reduce transformative potential**: The budget profile maintained high mutual_recognition (5.0) and tone (5.0) but showed reduced transformative_potential (4.50) and learner development (4.0). Cost optimization appears to impact the depth of learning more than surface pedagogical quality.
|
|
145
|
+
|
|
146
|
+
4. **Learner architecture strongly affects development**: The cognitive and psychodynamic architectures (which include explicit memory and reflection agents) produced superior development scores (5.0) compared to the simpler ego_superego architecture (4.0). Multi-agent internal deliberation appears to model learning progression more authentically.
|
|
147
|
+
|
|
148
|
+
5. **Authenticity remains high across architectures**: All learner architectures scored 5.0 on authenticity, suggesting the judge found all variants produced believable learner behavior. The differentiation appeared primarily in development and responsiveness.
|
|
149
|
+
|
|
150
|
+
6. **Internal deliberation enables bilateral evaluation**: The Goffmanian staging (internal deliberation before external message) allowed the judge to evaluate reasoning quality, not just output quality. This was particularly visible in how the learner Superego caught and corrected premature conclusions.
|
|
151
|
+
|
|
152
|
+
### Cross-Tabulation: Profile × Architecture
|
|
153
|
+
|
|
154
|
+
Tutor overall score by specific pairings (where data exists):
|
|
155
|
+
|
|
156
|
+
```
|
|
157
|
+
Learner Architecture
|
|
158
|
+
Profile cognitive dialectical ego_superego psychodynamic unified
|
|
159
|
+
───────────────────────────────────────────────────────────────────────────────
|
|
160
|
+
baseline - - - - 4.75
|
|
161
|
+
budget - 5.00 - - -
|
|
162
|
+
quality 5.00 - - - -
|
|
163
|
+
recognition - - 3.50 - -
|
|
164
|
+
recognition+ - - - 5.00 -
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
Notable interaction effects:
|
|
168
|
+
- The recognition + ego_superego pairing scored lowest (3.50), suggesting this combination produces suboptimal outcomes—possibly because both emphasize self-critique without sufficient generative capacity
|
|
169
|
+
- quality + cognitive and recognition+ + psychodynamic both achieved 5.0, indicating synergy between sophisticated tutor profiles and complex learner architectures
|
|
170
|
+
|
|
171
|
+
### Discussion: What Dyadic Evaluation Adds
|
|
172
|
+
|
|
173
|
+
The dyadic extension addresses the central paradox of the original evaluation. Scripted learner turns cannot reciprocate recognition—they are philosophically equivalent to Hegel's slave, responding but not genuinely responding.
|
|
174
|
+
|
|
175
|
+
With a simulated learner that has its own internal deliberation:
|
|
176
|
+
- The tutor's recognition can be tested by whether the learner's internal state actually responds
|
|
177
|
+
- Breakthrough moments can be observed in the learner's internal deliberation, not just inferred from external utterance
|
|
178
|
+
- The judge can evaluate whether mutual recognition is achieved—not just whether the tutor attempts it
|
|
179
|
+
|
|
180
|
+
**Key insight from results**: The finding that explicit recognition-naming underperformed while quality optimization excelled suggests that recognition may be an emergent property of thorough, high-quality interaction rather than something that can be directly instructed. This aligns with Honneth's observation that authentic recognition cannot be demanded or performed—it must arise from genuine engagement.
|
|
181
|
+
|
|
182
|
+
The dyadic framework also revealed that **learner architecture matters for measuring transformation**. The cognitive and psychodynamic architectures, with their explicit memory and reflection agents, showed learning development more clearly than simpler architectures. This suggests that to evaluate educational effectiveness, we need learners sophisticated enough to actually learn—not just respond.
|
|
183
|
+
|
|
184
|
+
### Implications for AI Alignment
|
|
185
|
+
|
|
186
|
+
If recognition quality can be measured on both sides of the dyad, this has implications beyond tutoring:
|
|
187
|
+
- **Bidirectional evaluation**: AI systems that interact with other AI systems (or simulated users) could be evaluated for recognition quality from both perspectives
|
|
188
|
+
- **Constitutional recognition**: The learner's Superego enforces authenticity, just as the tutor's Superego enforces recognition. Both parties have internal evaluators.
|
|
189
|
+
- **Emergent mutual recognition**: When both parties are optimized for recognition, does genuine mutual recognition emerge? Or is it still a simulation of recognition?
|
|
190
|
+
|
|
191
|
+
These questions connect to fundamental issues in AI alignment: Can AI systems genuinely recognize each other (and humans) as subjects? Or is all AI recognition necessarily performative?
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## Updated Limitations
|
|
196
|
+
|
|
197
|
+
### Addressed by Dyadic Extension:
|
|
198
|
+
- ~~**Simulated learners**: Original evaluation used scripted learner turns~~ → Now uses simulated learners with internal deliberation
|
|
199
|
+
|
|
200
|
+
### Remaining:
|
|
201
|
+
- **LLM-based evaluation**: Both parties and the judge are LLMs. The entire system may develop conventions that appear as recognition but aren't.
|
|
202
|
+
- **Model dependence**: Results obtained with specific models.
|
|
203
|
+
- **Short-term**: Still primarily single-session evaluation. Longitudinal tracking infrastructure exists but not fully validated.
|
|
204
|
+
- **Simulated ≠ Real**: Even sophisticated simulated learners are not real learners. The ultimate test remains human evaluation.
|