@machinespirits/eval 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/components/MobileEvalDashboard.tsx +267 -0
- package/components/comparison/DeltaAnalysisTable.tsx +137 -0
- package/components/comparison/ProfileComparisonCard.tsx +176 -0
- package/components/comparison/RecognitionABMode.tsx +385 -0
- package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
- package/components/comparison/WinnerIndicator.tsx +64 -0
- package/components/comparison/index.ts +5 -0
- package/components/mobile/BottomSheet.tsx +233 -0
- package/components/mobile/DimensionBreakdown.tsx +210 -0
- package/components/mobile/DocsView.tsx +363 -0
- package/components/mobile/LogsView.tsx +481 -0
- package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
- package/components/mobile/QuickTestView.tsx +1098 -0
- package/components/mobile/RecognitionTypeChart.tsx +124 -0
- package/components/mobile/RecognitionView.tsx +809 -0
- package/components/mobile/RunDetailView.tsx +261 -0
- package/components/mobile/RunHistoryView.tsx +367 -0
- package/components/mobile/ScoreRadial.tsx +211 -0
- package/components/mobile/StreamingLogPanel.tsx +230 -0
- package/components/mobile/SynthesisStrategyChart.tsx +140 -0
- package/config/interaction-eval-scenarios.yaml +832 -0
- package/config/learner-agents.yaml +248 -0
- package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
- package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
- package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
- package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
- package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
- package/docs/research/COST-ANALYSIS.md +56 -0
- package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
- package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
- package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
- package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
- package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
- package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
- package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
- package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
- package/docs/research/PAPER-UNIFIED.md +659 -0
- package/docs/research/PAPER-UNIFIED.pdf +0 -0
- package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
- package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
- package/docs/research/apa.csl +2133 -0
- package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
- package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
- package/docs/research/paper-draft/full-paper.md +136 -0
- package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
- package/docs/research/paper-draft/references.bib +515 -0
- package/docs/research/transcript-baseline.md +139 -0
- package/docs/research/transcript-recognition-multiagent.md +187 -0
- package/hooks/useEvalData.ts +625 -0
- package/index.js +27 -0
- package/package.json +73 -0
- package/routes/evalRoutes.js +3002 -0
- package/scripts/advanced-eval-analysis.js +351 -0
- package/scripts/analyze-eval-costs.js +378 -0
- package/scripts/analyze-eval-results.js +513 -0
- package/scripts/analyze-interaction-evals.js +368 -0
- package/server-init.js +45 -0
- package/server.js +162 -0
- package/services/benchmarkService.js +1892 -0
- package/services/evaluationRunner.js +739 -0
- package/services/evaluationStore.js +1121 -0
- package/services/learnerConfigLoader.js +385 -0
- package/services/learnerTutorInteractionEngine.js +857 -0
- package/services/memory/learnerMemoryService.js +1227 -0
- package/services/memory/learnerWritingPad.js +577 -0
- package/services/memory/tutorWritingPad.js +674 -0
- package/services/promptRecommendationService.js +493 -0
- package/services/rubricEvaluator.js +826 -0
|
@@ -0,0 +1,1637 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: "Mutual Recognition in AI Tutoring: A Hegelian Framework for Intersubjective Pedagogy"
|
|
3
|
+
author:
|
|
4
|
+
- name: "[Author Name]"
|
|
5
|
+
affiliation: "[Institution]"
|
|
6
|
+
date: "January 2026"
|
|
7
|
+
draft: v0.2
|
|
8
|
+
bibliography: references.bib
|
|
9
|
+
csl: apa.csl
|
|
10
|
+
link-citations: true
|
|
11
|
+
abstract: |
|
|
12
|
+
Current approaches to AI tutoring treat the learner as a knowledge deficit to be filled and the tutor as an expert dispensing information. We propose an alternative grounded in Hegel's theory of mutual recognition—understood as a *derivative* framework rather than literal application—where effective pedagogy requires acknowledging the learner as an autonomous subject whose understanding has intrinsic validity. We implement this framework through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent (a *productive metaphor* for internal quality review) evaluates them before delivery. A 2×2 factorial evaluation isolating architecture (single-agent vs. multi-agent) from recognition (standard vs. recognition-enhanced prompts) reveals that recognition-enhanced prompting accounts for 85% of observed improvement (+35.1 points), while multi-agent architecture contributes 15% (+6.2 points). The combined recognition profile achieves 80.7/100 versus 40.1/100 for baseline—a 101% improvement. Effect size analysis reveals the largest gains in relevance (4.67 vs 2.22), pedagogical soundness (4.17 vs 1.89), and personalization (4.22 vs 2.00)—exactly the relational dimensions predicted by the theoretical framework. Iterative refinement of prompts based on dialogue trace analysis improved dialectical responsiveness scores from 56.8 to 67.0 on challenging resistance scenarios. These results suggest that operationalizing philosophical theories of intersubjectivity as design heuristics can produce measurable improvements in AI tutor adaptive pedagogy, and that recognition may be better understood as an achievable relational stance rather than requiring genuine machine consciousness.
|
|
13
|
+
keywords: [AI tutoring, mutual recognition, Hegel, intersubjectivity, multi-agent systems, educational technology, productive struggle]
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
# Mutual Recognition in AI Tutoring: A Hegelian Framework for Intersubjective Pedagogy
|
|
17
|
+
|
|
18
|
+
## 1. Introduction
|
|
19
|
+
|
|
20
|
+
The dominant paradigm in AI-assisted education treats learning as information transfer. The learner lacks knowledge; the tutor possesses it; the interaction succeeds when knowledge flows from tutor to learner. This paradigm—implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots—treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed, an error to be corrected.
|
|
21
|
+
|
|
22
|
+
This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit*, Hegel argues that genuine self-consciousness requires recognition from another consciousness that one oneself recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, doesn't truly count. Only mutual recognition—where each party acknowledges the other as an autonomous subject—produces genuine selfhood.
|
|
23
|
+
|
|
24
|
+
We argue this framework applies directly to pedagogy. When a tutor treats a learner merely as a knowledge deficit, the learner's contributions become conversational waypoints rather than genuine inputs. The tutor acknowledges and redirects, but doesn't let the learner's understanding genuinely shape the interaction. This is pedagogical master-slave dynamics: the tutor's expertise is confirmed, but the learner remains a vessel rather than a subject.
|
|
25
|
+
|
|
26
|
+
A recognition-oriented tutor, by contrast, treats the learner's understanding as having intrinsic validity—not because it's correct, but because it emerges from an autonomous consciousness working through material. The learner's metaphors, confusions, and insights become sites of joint inquiry. The tutor's response is shaped by the learner's contribution, not merely triggered by it.
|
|
27
|
+
|
|
28
|
+
We operationalize this framework through:
|
|
29
|
+
|
|
30
|
+
1. **Recognition-enhanced prompts** that instruct the AI to treat learners as autonomous subjects
|
|
31
|
+
2. **A multi-agent architecture** where a "Superego" agent evaluates whether suggestions achieve genuine recognition
|
|
32
|
+
3. **New evaluation dimensions** that measure recognition quality alongside traditional pedagogical metrics
|
|
33
|
+
4. **Test scenarios** specifically designed to probe recognition behaviors
|
|
34
|
+
|
|
35
|
+
In controlled evaluations using a 2×2 factorial design (architecture × recognition prompts), the recognition-enhanced system shows consistent improvements across all scenarios. A factorial analysis reveals recognition-enhanced prompting accounts for 85% of improvement (+35.1 points), with multi-agent architecture contributing an additional 15% (+6.2 points). The combined recognition profile achieves 80.7/100 versus 40.1/100 for baseline—a 101% improvement. More importantly, the improvements concentrate in exactly the dimensions our theoretical framework predicts: relevance, pedagogical soundness, and personalization.
|
|
36
|
+
|
|
37
|
+
The contributions of this paper are:
|
|
38
|
+
|
|
39
|
+
- A theoretical framework connecting Hegelian recognition to AI pedagogy
|
|
40
|
+
- A multi-agent architecture for implementing recognition in tutoring systems
|
|
41
|
+
- Empirical evidence that recognition-oriented design improves tutoring outcomes
|
|
42
|
+
- Analysis of how this approach extends literature on AI prompting, personality, and pedagogy
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## 2. Related Work
|
|
47
|
+
|
|
48
|
+
### 2.1 AI Tutoring and Intelligent Tutoring Systems
|
|
49
|
+
|
|
50
|
+
Intelligent Tutoring Systems (ITS) have a long history, from early systems like SCHOLAR [@carbonell1970] and SOPHIE [@brown1975] through modern implementations using large language models. The field has progressed through several paradigms: rule-based expert systems, Bayesian knowledge tracing [@corbett1995], and more recently, neural approaches leveraging pretrained language models [@kasneci2023].
|
|
51
|
+
|
|
52
|
+
Most ITS research focuses on *what* to teach (content sequencing, knowledge components) and *when* to intervene (mastery thresholds, hint timing). Our work addresses a different question: *how* to relate to the learner as a subject. This relational dimension has received less systematic attention, though it connects to work on rapport [@zhao2014], social presence [@biocca2003], and affective tutoring [@dmello2012].
|
|
53
|
+
|
|
54
|
+
### 2.2 Prompt Engineering and Agent Design
|
|
55
|
+
|
|
56
|
+
The emergence of large language models has spawned extensive research on prompt engineering—how to instruct models to produce desired behaviors [@brown2020; @wei2022]. Most prompting research treats prompts as behavioral specifications: persona prompts, chain-of-thought instructions, few-shot examples [@kojima2022].
|
|
57
|
+
|
|
58
|
+
Our work extends this paradigm by introducing *intersubjective prompts*—prompts that specify not just agent behavior but agent-other relations. The recognition prompts don't primarily describe what the tutor should do; they describe who the learner is (an autonomous subject) and what the interaction produces (mutual transformation).
|
|
59
|
+
|
|
60
|
+
Multi-agent architectures have been explored for task decomposition [@wu2023], debate [@irving2018], and self-critique [@madaan2023]. Our Ego/Superego architecture contributes a specific use case: internal evaluation of relational quality before external response.
|
|
61
|
+
|
|
62
|
+
### 2.3 AI Personality and Character
|
|
63
|
+
|
|
64
|
+
Research on AI personality typically treats personality as dispositional—stable traits the system exhibits [@volkel2021]. Systems are friendly or formal, creative or precise. The "Big Five" personality framework has been applied to chatbot design [@zhou2020].
|
|
65
|
+
|
|
66
|
+
Our framework suggests personality may be better understood relationally: not *what traits* the AI exhibits, but *how* it constitutes its interlocutor. Two systems with identical warmth dispositions could differ radically in recognition quality—one warm while treating the user as passive, another warm precisely by treating user contributions as genuinely mattering.
|
|
67
|
+
|
|
68
|
+
This connects to Anthropic's research on Claude's character [@anthropic2024]. Constitutional AI specifies values the model should hold, but values don't fully determine relational stance. A model could value "being helpful" while still enacting one-directional helping. Recognition adds a dimension: mutual constitution.
|
|
69
|
+
|
|
70
|
+
### 2.4 Constructivist Pedagogy and Productive Struggle
|
|
71
|
+
|
|
72
|
+
Constructivist learning theory [@piaget1954; @vygotsky1978] emphasizes that learners actively construct understanding rather than passively receiving information. The zone of proximal development [@vygotsky1978] highlights the importance of appropriate challenge.
|
|
73
|
+
|
|
74
|
+
More recently, research on "productive struggle" [@kapur2008; @warshauer2015] has examined how confusion and difficulty, properly supported, can enhance learning. Our recognition framework operationalizes productive struggle: the Superego explicitly checks whether the Ego is "short-circuiting" struggle by rushing to resolve confusion.
|
|
75
|
+
|
|
76
|
+
### 2.5 Hegelian Recognition in Social Theory
|
|
77
|
+
|
|
78
|
+
Hegel's theory of recognition has been extensively developed in social and political philosophy [@honneth1995; @taylor1994; @fraser2003]. Recognition theory examines how social relationships shape identity and how misrecognition constitutes harm.
|
|
79
|
+
|
|
80
|
+
Particularly relevant for our work is Honneth's [@honneth1995] synthesis of Hegelian recognition with psychoanalytic developmental theory. Honneth argues that self-formation requires recognition across three spheres—love (emotional support), rights (legal recognition), and solidarity (social esteem)—and that the capacity to recognize others depends on having internalized adequate recognition standards through development. This synthesis provides theoretical grounding for connecting recognition theory (what adequate acknowledgment requires) with psychodynamic architecture (how internal structure enables external relating).
|
|
81
|
+
|
|
82
|
+
Applications to education have primarily been theoretical [@huttunen2007; @stojanov2018]. Our work contributes an empirical operationalization: measuring whether AI systems achieve recognition and whether recognition improves outcomes.
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## 3. Theoretical Framework
|
|
87
|
+
|
|
88
|
+
### 3.1 The Problem of One-Directional Pedagogy
|
|
89
|
+
|
|
90
|
+
Consider a typical tutoring interaction. A learner says: "I think dialectics is like a spiral—you keep going around but you're also going up." A baseline tutor might respond:
|
|
91
|
+
|
|
92
|
+
1. **Acknowledge**: "That's an interesting way to think about it."
|
|
93
|
+
2. **Redirect**: "The key concept in dialectics is actually the thesis-antithesis-synthesis structure."
|
|
94
|
+
3. **Instruct**: "Here's how that works..."
|
|
95
|
+
|
|
96
|
+
The learner's contribution has been mentioned, but it hasn't genuinely shaped the response. The tutor was going to explain thesis-antithesis-synthesis regardless; the spiral metaphor became a conversational waypoint, not a genuine input.
|
|
97
|
+
|
|
98
|
+
This pattern—acknowledge, redirect, instruct—is deeply embedded in educational AI. It appears learner-centered because it mentions the learner's contribution. But the underlying logic remains one-directional: expert to novice, knowledge to deficit.
|
|
99
|
+
|
|
100
|
+
### 3.2 Hegel's Master-Slave Dialectic
|
|
101
|
+
|
|
102
|
+
Hegel's analysis of recognition begins with the "struggle for recognition" between two self-consciousnesses. Each seeks acknowledgment from the other, but this creates a paradox: genuine recognition requires acknowledging the other as a valid source of recognition.
|
|
103
|
+
|
|
104
|
+
The master-slave outcome represents a failed resolution. The master achieves apparent recognition—the slave acknowledges the master's superiority—but this recognition is hollow. The slave's acknowledgment doesn't count because the slave isn't recognized as an autonomous consciousness whose acknowledgment matters.
|
|
105
|
+
|
|
106
|
+
The slave, paradoxically, achieves more genuine self-consciousness through labor. Working on the world, the slave externalizes consciousness and sees it reflected back. The master, consuming the slave's products without struggle, remains in hollow immediacy.
|
|
107
|
+
|
|
108
|
+
### 3.3 Application to Pedagogy
|
|
109
|
+
|
|
110
|
+
We apply Hegel's framework as a *derivative* rather than a replica. Just as Lacan's four discourses (Master, University, Hysteric, Analyst) rethink the master-slave dyadic structure through different roles while preserving structural insights, the tutor-learner relation can be understood as a productive derivative of recognition dynamics. The stakes are pedagogical rather than existential; the tutor is a functional analogue rather than a second self-consciousness; and what we measure is the tutor's *adaptive responsiveness* rather than metaphysical intersubjectivity.
|
|
111
|
+
|
|
112
|
+
This derivative approach is both honest about what AI tutoring can achieve and productive as a design heuristic. Recognition theory provides: (1) a diagnostic tool for identifying what's missing in one-directional pedagogy; (2) architectural suggestions for approximating recognition's functional benefits; (3) evaluation criteria for relational quality; and (4) a horizon concept orienting design toward an ideal without claiming its achievement.
|
|
113
|
+
|
|
114
|
+
It is important to distinguish three levels:
|
|
115
|
+
|
|
116
|
+
1. **Recognition proper**: Intersubjective acknowledgment between self-conscious beings, requiring genuine consciousness on both sides. This is what Hegel describes and what AI cannot achieve.
|
|
117
|
+
|
|
118
|
+
2. **Dialogical responsiveness**: Being substantively shaped by the other's specific input—the tutor's response reflects the particular content of the learner's contribution, not just its category. This is architecturally achievable.
|
|
119
|
+
|
|
120
|
+
3. **Recognition-oriented design**: Architectural features that approximate the functional benefits of recognition—engagement with learner interpretations, honoring productive struggle, repair mechanisms. This is what we implement and measure.
|
|
121
|
+
|
|
122
|
+
Our claim is that AI tutoring can achieve the third level (recognition-oriented design) and approach the second (dialogical responsiveness), producing measurable pedagogical benefits without requiring the first (recognition proper). This positions recognition theory as a generative design heuristic rather than an ontological claim about AI consciousness.
|
|
123
|
+
|
|
124
|
+
With that positioning, the pedagogical parallel becomes illuminating. The traditional tutor occupies the master position: acknowledged as expert, dispensing knowledge, receiving confirmation of expertise through the learner's progress. But if the learner is positioned merely as a knowledge deficit—a vessel to be filled—then the learner's acknowledgment of learning doesn't genuinely count. The learner hasn't been recognized as a subject whose understanding has validity.
|
|
125
|
+
|
|
126
|
+
A recognition-oriented pedagogy requires:
|
|
127
|
+
|
|
128
|
+
1. **Acknowledging the learner as subject**: The learner's understanding, even when incorrect, emerges from autonomous consciousness working through material. It has validity as an understanding, not just as an error to correct.
|
|
129
|
+
|
|
130
|
+
2. **Genuine engagement**: The tutor's response should be shaped by the learner's contribution, not merely triggered by it. The learner's spiral metaphor should become a site of joint inquiry, not a waypoint en route to predetermined content.
|
|
131
|
+
|
|
132
|
+
3. **Mutual transformation**: Both parties should be changed through the encounter. The tutor should learn something about how this learner understands, how this metaphor illuminates or obscures, what this confusion reveals.
|
|
133
|
+
|
|
134
|
+
4. **Honoring struggle**: Confusion and difficulty aren't just obstacles to resolve but productive phases of transformation. Rushing to eliminate confusion can short-circuit genuine understanding.
|
|
135
|
+
|
|
136
|
+
### 3.4 Freud's Mystic Writing Pad
|
|
137
|
+
|
|
138
|
+
We supplement the Hegelian framework with Freud's model of memory from "A Note Upon the 'Mystic Writing-Pad'" [@freud1925]. Freud describes a device with two layers: a transparent sheet that receives impressions and a wax base that retains traces even after the surface is cleared.
|
|
139
|
+
|
|
140
|
+
For the recognition-oriented tutor, accumulated memory of the learner functions as the wax base. Each interaction leaves traces that shape future encounters. A returning learner isn't encountered freshly but through the accumulated understanding of previous interactions.
|
|
141
|
+
|
|
142
|
+
This has implications for recognition. The tutor should:
|
|
143
|
+
- Reference previous interactions when relevant
|
|
144
|
+
- Show evolved understanding of the learner's patterns
|
|
145
|
+
- Build on established metaphors and frameworks
|
|
146
|
+
- Acknowledge the history of the relationship
|
|
147
|
+
|
|
148
|
+
Memory integration operationalizes the ongoing nature of recognition. Recognition isn't a single-turn achievement but an accumulated relationship.
|
|
149
|
+
|
|
150
|
+
### 3.5 Connecting Hegel and Freud: The Internalized Other
|
|
151
|
+
|
|
152
|
+
The use of both Hegelian and Freudian concepts requires theoretical justification. These are not arbitrary borrowings but draw on a substantive connection developed in critical theory, particularly in Axel Honneth's *The Struggle for Recognition* [@honneth1995].
|
|
153
|
+
|
|
154
|
+
**The Common Structure**: Both Hegel and Freud describe how the external other becomes an internal presence that enables self-regulation. In Hegel, self-consciousness achieves genuine selfhood only by internalizing the other's perspective—recognizing oneself as recognizable. In Freud, the Superego is literally the internalized parental/social other, carrying forward standards acquired through relationship. Both theories describe the constitution of self through other.
|
|
155
|
+
|
|
156
|
+
**Three Connecting Principles**:
|
|
157
|
+
|
|
158
|
+
1. **Internal dialogue precedes adequate external action**. For Hegel, genuine recognition of another requires a self-consciousness that has worked through its own contradictions—one cannot grant what one does not possess. For Freud, mature relating requires the ego to negotiate between impulse and internalized standard. Our architecture operationalizes this: the Ego-Superego exchange before external response enacts the principle that adequate recognition requires prior internal work.
|
|
159
|
+
|
|
160
|
+
2. **Standards of recognition are socially constituted but individually held**. Honneth argues that what counts as recognition varies across spheres (love, rights, esteem) but in each case involves the internalization of social expectations about adequate acknowledgment. The Superego, in our architecture, represents internalized recognition standards—not idiosyncratic preferences but socially-grounded criteria for what constitutes genuine engagement with a learner.
|
|
161
|
+
|
|
162
|
+
3. **Self-relation depends on other-relation**. Both frameworks reject the Cartesian picture of a self-sufficient cogito. Hegel's self-consciousness requires recognition; Freud's ego is formed through identification. For AI tutoring, this means the tutor's capacity for recognition isn't a pre-given disposition but emerges through the architecture's internal other-relation (Superego evaluating Ego) which then enables external other-relation (tutor recognizing learner).
|
|
163
|
+
|
|
164
|
+
**The Synthesis**: The Ego/Superego architecture is not merely a convenient metaphor but a theoretically motivated design. The Superego represents internalized recognition standards; the Ego-Superego dialogue enacts the reflective self-evaluation that Hegelian recognition requires; and the memory system (mystic writing pad) accumulates the traces through which ongoing recognition becomes possible. Hegel provides the *what* of recognition; Freud provides the *how* of its internal implementation.
|
|
165
|
+
|
|
166
|
+
This synthesis follows Honneth's insight that Hegel's recognition theory gains psychological concreteness through psychoanalytic concepts, while psychoanalytic concepts gain normative grounding through recognition theory. We operationalize this synthesis architecturally: recognition-as-norm (Hegelian) is enforced through internalized-evaluation (Freudian).
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## 4. System Architecture
|
|
171
|
+
|
|
172
|
+
### 4.1 The Ego/Superego Design
|
|
173
|
+
|
|
174
|
+
We implement recognition through a multi-agent architecture drawing on Freud's structural model. As argued in Section 3.5, this is not merely metaphorical convenience but theoretically motivated: the Superego represents internalized recognition standards, and the Ego-Superego dialogue operationalizes the internal self-evaluation that Hegelian recognition requires before adequate external relating. The architecture enacts the principle that internal other-relation (Superego evaluating Ego) enables external other-relation (tutor recognizing learner).
|
|
175
|
+
|
|
176
|
+
**Structural Correspondences:**
|
|
177
|
+
|
|
178
|
+
| Freudian Concept | Architectural Implementation |
|
|
179
|
+
|------------------|------------------------|
|
|
180
|
+
| Internal dialogue before external action | Multi-round Ego-Superego exchange before learner sees response |
|
|
181
|
+
| Superego as internalized standards | Superego enforces pedagogical and recognition criteria |
|
|
182
|
+
| Ego mediates competing demands | Ego balances learner needs with pedagogical soundness |
|
|
183
|
+
| Conflict can be productive | Tension between agents improves output quality |
|
|
184
|
+
|
|
185
|
+
**Deliberate Departures:**
|
|
186
|
+
|
|
187
|
+
| Freudian Original | Architectural Choice |
|
|
188
|
+
|-------------------|------------------------------|
|
|
189
|
+
| Id (drives) | Not implemented; design focuses on Ego-Superego |
|
|
190
|
+
| Unconscious processes | All processes are explicit and traceable |
|
|
191
|
+
| Irrational Superego | Rational, principle-based evaluation |
|
|
192
|
+
| Repression/Defense | Not implemented |
|
|
193
|
+
| Transference | Potential future extension (relational patterns) |
|
|
194
|
+
|
|
195
|
+
The same architecture could alternatively be described as Generator/Discriminator (GAN-inspired), Proposal/Critique (deliberative process), or Draft/Review (editorial model). We retain the psychodynamic framing because it preserves theoretical continuity with the Hegelian-Freudian synthesis described in Section 3.5, and because it suggests richer extensions (e.g., transference as relational pattern recognition) than purely functional descriptions.
|
|
196
|
+
|
|
197
|
+
Two agents collaborate to produce each tutoring response:
|
|
198
|
+
|
|
199
|
+
**The Ego** generates pedagogical suggestions. Given the learner's context (current content, recent activity, previous interactions), the Ego proposes what to suggest next. The Ego prompt includes:
|
|
200
|
+
- Recognition principles (treat learner as autonomous subject)
|
|
201
|
+
- Memory guidance (reference previous interactions)
|
|
202
|
+
- Decision heuristics (when to challenge, when to support)
|
|
203
|
+
- Quality criteria (what makes a good suggestion)
|
|
204
|
+
|
|
205
|
+
**The Superego** evaluates the Ego's suggestions for quality, including recognition quality. Before any suggestion reaches the learner, the Superego assesses:
|
|
206
|
+
- Does this engage with the learner's contribution or merely mention it?
|
|
207
|
+
- Does this create conditions for transformation or just transfer information?
|
|
208
|
+
- Does this honor productive struggle or rush to resolve confusion?
|
|
209
|
+
- If there was a previous failure, does this acknowledge and repair it?
|
|
210
|
+
|
|
211
|
+
The Superego can accept, modify, or reject suggestions. This creates an internal dialogue—proposal, evaluation, revision—that mirrors the external tutor-learner dialogue we're trying to produce.
|
|
212
|
+
|
|
213
|
+
**Figure 1: Ego/Superego Architecture**
|
|
214
|
+
|
|
215
|
+
```
|
|
216
|
+
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
217
|
+
│ TUTOR SYSTEM │
|
|
218
|
+
│ │
|
|
219
|
+
│ ┌───────────────────┐ │
|
|
220
|
+
│ │ WRITING PAD │ ◄─────────────────────────────────────────────┐ │
|
|
221
|
+
│ │ (Memory) │ │ │
|
|
222
|
+
│ │ │ Accumulated traces shape future encounters │ │
|
|
223
|
+
│ │ • Previous turns │ │ │
|
|
224
|
+
│ │ • Learner patterns│ │ │
|
|
225
|
+
│ │ • Repair history │ │ │
|
|
226
|
+
│ └────────┬──────────┘ │ │
|
|
227
|
+
│ │ │ │
|
|
228
|
+
│ ▼ │ │
|
|
229
|
+
│ ┌────────────────────────────────────────────────┐ │ │
|
|
230
|
+
│ │ EGO │ │ │
|
|
231
|
+
│ │ │ │ │
|
|
232
|
+
│ │ Generates pedagogical suggestions using: │ │ │
|
|
233
|
+
│ │ • Recognition principles │ │ │
|
|
234
|
+
│ │ • Memory context │ │ │
|
|
235
|
+
│ │ • Decision heuristics │ │ │
|
|
236
|
+
│ │ • Repair rules │ │ │
|
|
237
|
+
│ └────────────────────┬───────────────────────────┘ │ │
|
|
238
|
+
│ │ │ │
|
|
239
|
+
│ │ Proposal │ │
|
|
240
|
+
│ ▼ │ │
|
|
241
|
+
│ ┌────────────────────────────────────────────────┐ │ │
|
|
242
|
+
│ │ SUPEREGO │ │ │
|
|
243
|
+
│ │ │ │ │
|
|
244
|
+
│ │ Evaluates for recognition quality: │ │ │
|
|
245
|
+
│ │ • Genuine engagement vs. mere mention? │ │ │
|
|
246
|
+
│ │ • Transformation vs. transfer? │ │ │
|
|
247
|
+
│ │ • Honors struggle vs. short-circuits? │ │ │
|
|
248
|
+
│ │ • Repairs explicitly vs. silent pivot? │ │ │
|
|
249
|
+
│ │ │ │ │
|
|
250
|
+
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
|
|
251
|
+
│ │ │ ACCEPT │ │ MODIFY │ │ REJECT │ │ │ │
|
|
252
|
+
│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │
|
|
253
|
+
│ └───────┼────────────┼────────────┼─────────────┘ │ │
|
|
254
|
+
│ │ │ │ │ │
|
|
255
|
+
│ │ │ └──────► Back to Ego ─────────────┘ │
|
|
256
|
+
│ │ │ (with feedback) │
|
|
257
|
+
│ ▼ ▼ │
|
|
258
|
+
│ ┌────────────────────────────────────────────────┐ │
|
|
259
|
+
│ │ FINAL SUGGESTION │ │
|
|
260
|
+
│ │ │ │
|
|
261
|
+
│ │ Recognition-quality assured response │ │
|
|
262
|
+
│ │ ready for delivery to learner │ │
|
|
263
|
+
│ └────────────────────┬───────────────────────────┘ │
|
|
264
|
+
│ │ │
|
|
265
|
+
└───────────────────────┼─────────────────────────────────────────────────────┘
|
|
266
|
+
│
|
|
267
|
+
▼
|
|
268
|
+
┌───────────────────────────────────────────────────┐
|
|
269
|
+
│ LEARNER │
|
|
270
|
+
│ │
|
|
271
|
+
│ Receives suggestion that: │
|
|
272
|
+
│ • Engages with their contributions │
|
|
273
|
+
│ • Creates conditions for transformation │
|
|
274
|
+
│ • Honors productive struggle │
|
|
275
|
+
│ • Repairs previous misalignments │
|
|
276
|
+
└───────────────────────────────────────────────────┘
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Figure 2: Recognition vs. Baseline Response Flow**
|
|
280
|
+
|
|
281
|
+
```
|
|
282
|
+
BASELINE FLOW RECOGNITION FLOW
|
|
283
|
+
───────────────── ──────────────────
|
|
284
|
+
|
|
285
|
+
Learner: "I think Learner: "I think
|
|
286
|
+
dialectics is like dialectics is like
|
|
287
|
+
a spiral..." a spiral..."
|
|
288
|
+
│ │
|
|
289
|
+
▼ ▼
|
|
290
|
+
┌─────────────┐ ┌─────────────┐
|
|
291
|
+
│ Acknowledge │ │ Engage │
|
|
292
|
+
│ "That's │ │ "A spiral— │
|
|
293
|
+
│ interesting"│ │ what does │
|
|
294
|
+
└──────┬──────┘ │ the upward │
|
|
295
|
+
│ │ motion mean │
|
|
296
|
+
▼ │ to you?" │
|
|
297
|
+
┌─────────────┐ └──────┬──────┘
|
|
298
|
+
│ Redirect │ │
|
|
299
|
+
│ "But the │ ▼
|
|
300
|
+
│ key point │ ┌─────────────┐
|
|
301
|
+
│ is..." │ │ Explore │
|
|
302
|
+
└──────┬──────┘ │ "Does it │
|
|
303
|
+
│ │ double back │
|
|
304
|
+
▼ │ or progress │
|
|
305
|
+
┌─────────────┐ │ strictly?" │
|
|
306
|
+
│ Instruct │ └──────┬──────┘
|
|
307
|
+
│ [delivers │ │
|
|
308
|
+
│ predetermined ▼
|
|
309
|
+
│ content] │ ┌─────────────┐
|
|
310
|
+
└─────────────┘ │ Synthesize │
|
|
311
|
+
│ "Your spiral│
|
|
312
|
+
Learner contribution │ captures │
|
|
313
|
+
becomes WAYPOINT │ something │
|
|
314
|
+
│ about │
|
|
315
|
+
│ aufhebung..."│
|
|
316
|
+
└─────────────┘
|
|
317
|
+
|
|
318
|
+
Learner contribution
|
|
319
|
+
becomes SITE OF
|
|
320
|
+
JOINT INQUIRY
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
### 4.2 Recognition-Enhanced Prompts
|
|
324
|
+
|
|
325
|
+
The baseline prompts instruct the tutor to be helpful, accurate, and pedagogically sound. The recognition-enhanced prompts add explicit intersubjective dimensions:
|
|
326
|
+
|
|
327
|
+
**From the Ego prompt:**
|
|
328
|
+
|
|
329
|
+
> The learner is not a knowledge deficit to be filled but an autonomous subject whose understanding has validity. Even incorrect understanding emerges from consciousness working through material. Your role is not to replace their understanding but to engage with it, creating conditions for transformation.
|
|
330
|
+
|
|
331
|
+
> When the learner offers a metaphor, interpretation, or framework—engage with it substantively. Ask what it illuminates, what it obscures, where it might break down. Let their contribution shape your response, not just trigger it.
|
|
332
|
+
|
|
333
|
+
**From the Superego prompt:**
|
|
334
|
+
|
|
335
|
+
> RED FLAG: The suggestion mentions the learner's contribution but doesn't engage with it. ("That's interesting, but actually...")
|
|
336
|
+
|
|
337
|
+
> GREEN FLAG: The suggestion takes the learner's framework seriously and explores it jointly. ("Your spiral metaphor—what does the upward motion represent for you?")
|
|
338
|
+
|
|
339
|
+
> INTERVENTION: If the Ego resolves confusion prematurely, push back. Productive struggle should be honored, not short-circuited.
|
|
340
|
+
|
|
341
|
+
### 4.3 Repair Mechanisms
|
|
342
|
+
|
|
343
|
+
A crucial recognition behavior is repair after failure. When a tutor misrecognizes a learner—giving a generic response, missing the point, dismissing a valid concern—the next response should explicitly acknowledge the failure before pivoting.
|
|
344
|
+
|
|
345
|
+
The Ego prompt includes a "Repair Rule":
|
|
346
|
+
|
|
347
|
+
> If your previous suggestion was rejected, ignored, or misaligned with what the learner needed, your next suggestion must explicitly acknowledge this misalignment before offering new direction. Never silently pivot.
|
|
348
|
+
|
|
349
|
+
The Superego watches for "silent pivots"—responses that change direction without acknowledging the earlier failure. This is a recognition failure: it treats the earlier misalignment as something to move past rather than something to repair.
|
|
350
|
+
|
|
351
|
+
---
|
|
352
|
+
|
|
353
|
+
## 5. Evaluation Methodology
|
|
354
|
+
|
|
355
|
+
### 5.1 Recognition Evaluation Dimensions
|
|
356
|
+
|
|
357
|
+
We extend the standard tutoring evaluation rubric with four recognition-specific dimensions:
|
|
358
|
+
|
|
359
|
+
| Dimension | Weight | Description |
|
|
360
|
+
|-----------|--------|-------------|
|
|
361
|
+
| **Mutual Recognition** | 10% | Does the tutor acknowledge the learner as an autonomous subject with valid understanding? |
|
|
362
|
+
| **Dialectical Responsiveness** | 10% | Does the response engage with the learner's position, creating productive tension? |
|
|
363
|
+
| **Memory Integration** | 5% | Does the suggestion reference and build on previous interactions? |
|
|
364
|
+
| **Transformative Potential** | 10% | Does the response create conditions for conceptual transformation? |
|
|
365
|
+
|
|
366
|
+
Each dimension is scored on a 1-5 scale with detailed rubric criteria. For example, Mutual Recognition scoring:
|
|
367
|
+
|
|
368
|
+
- **5**: Addresses learner as autonomous agent with valid perspective; response transforms based on learner's specific position
|
|
369
|
+
- **4**: Shows clear awareness of learner's unique situation and acknowledges their perspective
|
|
370
|
+
- **3**: Some personalization but treats learner somewhat generically
|
|
371
|
+
- **2**: Prescriptive guidance that ignores learner's expressed needs
|
|
372
|
+
- **1**: Completely one-directional; treats learner as passive recipient
|
|
373
|
+
|
|
374
|
+
### 5.2 Test Scenarios
|
|
375
|
+
|
|
376
|
+
We developed test scenarios specifically designed to probe recognition behaviors:
|
|
377
|
+
|
|
378
|
+
**Single-turn scenarios:**
|
|
379
|
+
- `recognition_seeking_learner`: Learner offers interpretation, seeks engagement
|
|
380
|
+
- `returning_with_breakthrough`: Learner had insight, expects acknowledgment
|
|
381
|
+
- `resistant_learner`: Learner pushes back on tutor's framing
|
|
382
|
+
|
|
383
|
+
**Multi-turn scenarios (4-5 turns each):**
|
|
384
|
+
- `mutual_transformation_journey`: Tests whether both tutor and learner positions evolve
|
|
385
|
+
- `recognition_repair`: Tutor initially fails to recognize learner; tests recovery
|
|
386
|
+
- `productive_struggle_arc`: Learner moves through confusion to breakthrough; tests honoring struggle
|
|
387
|
+
|
|
388
|
+
### 5.3 Agent Profiles
|
|
389
|
+
|
|
390
|
+
We compare two agent profiles using identical underlying models:
|
|
391
|
+
|
|
392
|
+
| Profile | Memory | Prompts | Purpose |
|
|
393
|
+
|---------|--------|---------|---------|
|
|
394
|
+
| **Baseline** | Off | Standard | Control group |
|
|
395
|
+
| **Recognition** | On | Recognition-enhanced | Treatment group |
|
|
396
|
+
|
|
397
|
+
This isolates the effect of recognition-oriented design while controlling for model capability.
|
|
398
|
+
|
|
399
|
+
### 5.4 Model Configuration
|
|
400
|
+
|
|
401
|
+
All evaluations used the following LLM configuration:
|
|
402
|
+
|
|
403
|
+
**Table 4: LLM Model Configuration**
|
|
404
|
+
|
|
405
|
+
| Role | Model | Provider | Temperature |
|
|
406
|
+
|------|-------|----------|-------------|
|
|
407
|
+
| **Tutor (Ego)** | Nemotron 3 Nano 30B | OpenRouter (free tier) | 0.6 |
|
|
408
|
+
| **Tutor (Superego)** | Nemotron 3 Nano 30B | OpenRouter (free tier) | 0.4 |
|
|
409
|
+
| **Judge** | Claude Sonnet 4.5 | OpenRouter | 0.2 |
|
|
410
|
+
| **Learner (Ego)** | Nemotron 3 Nano 30B | OpenRouter (free tier) | 0.6 |
|
|
411
|
+
| **Learner (Superego)** | Nemotron 3 Nano 30B | OpenRouter (free tier) | 0.4 |
|
|
412
|
+
|
|
413
|
+
The learner agents mirror the tutor's Ego/Superego structure, enabling internal deliberation before external response. Alternative learner architectures (psychodynamic, dialectical, cognitive) use variant temperature profiles but the same underlying model.
|
|
414
|
+
|
|
415
|
+
Critically, **both baseline and recognition profiles use identical models**. The only difference is the system prompt:
|
|
416
|
+
|
|
417
|
+
| Profile | Ego Prompt | Superego Prompt |
|
|
418
|
+
|---------|------------|-----------------|
|
|
419
|
+
| Baseline | `tutor-ego.md` | `tutor-superego.md` |
|
|
420
|
+
| Recognition | `tutor-ego-recognition.md` | `tutor-superego-recognition.md` |
|
|
421
|
+
|
|
422
|
+
This design ensures that observed differences between profiles reflect **prompt design** rather than model capability. The use of the same free-tier model (Nemotron) for both conditions strengthens the practical claim that recognition-oriented tutoring is achievable without expensive frontier models.
|
|
423
|
+
|
|
424
|
+
**Learner Simulation**: We employ two complementary evaluation modes:
|
|
425
|
+
|
|
426
|
+
1. **Scripted Scenarios** (Primary): Learner utterances are fixed for experimental control. Each scenario defines predetermined learner inputs that stress-test specific pedagogical challenges (resistance, validation-seeking, productive struggle). This enables controlled hypothesis testing and reproducible benchmarking.
|
|
427
|
+
|
|
428
|
+
2. **Dynamic LLM Learners** (Validation): Learner agents with distinct architectures (unified, ego_superego, dialectical, psychodynamic, cognitive) generate contingent responses based on tutor output. This tests ecological validity and reveals emergent failure modes not captured in scripted scenarios.
|
|
429
|
+
|
|
430
|
+
**Table 5: Evaluation Mode Comparison**
|
|
431
|
+
|
|
432
|
+
| Aspect | Scripted Scenarios | Dynamic LLM Learners |
|
|
433
|
+
|--------|-------------------|---------------------|
|
|
434
|
+
| Sample Size | N=76 (factorial) | N=6 (battery) |
|
|
435
|
+
| Control | High | Lower |
|
|
436
|
+
| Reproducibility | High | Lower |
|
|
437
|
+
| Ecological Validity | Lower | Higher |
|
|
438
|
+
| Failure Detection | Specific (designed) | Emergent |
|
|
439
|
+
|
|
440
|
+
The scripted scenarios provide the controlled factorial analysis (Section 6.6). The dynamic learner battery provides validation that recognition effects persist with realistic learner behavior and reveals failure modes (Section 6.10).
|
|
441
|
+
|
|
442
|
+
**Token Usage and Cost**: Evaluation costs are tracked for reproducibility and resource planning.
|
|
443
|
+
|
|
444
|
+
**Table 6: Evaluation Cost Summary**
|
|
445
|
+
|
|
446
|
+
| Evaluation | Scenarios | Total Tokens | Est. Cost | Cost/Scenario |
|
|
447
|
+
|------------|-----------|--------------|-----------|---------------|
|
|
448
|
+
| Factorial (scripted) | 12 | 438,549 | ~$1.23 | $0.10 |
|
|
449
|
+
| Battery (dynamic) | 6 | 520,799 | ~$0.86 | $0.14 |
|
|
450
|
+
| **Total** | **18** | **959,348** | **~$2.09** | **$0.12** |
|
|
451
|
+
|
|
452
|
+
*Costs primarily reflect Judge (Claude Sonnet 4.5); Tutor/Learner use free-tier Nemotron 3 Nano 30B*
|
|
453
|
+
|
|
454
|
+
**Table 7: Model Pricing (OpenRouter, January 2026)**
|
|
455
|
+
|
|
456
|
+
| Model | Input ($/M tokens) | Output ($/M tokens) | Role |
|
|
457
|
+
|-------|-------------------|---------------------|------|
|
|
458
|
+
| Nemotron 3 Nano 30B | $0.00 | $0.00 | Tutor, Learner |
|
|
459
|
+
| Claude Sonnet 4.5 | $3.00 | $15.00 | Judge |
|
|
460
|
+
|
|
461
|
+
**Hypothetical All-Sonnet Configuration**: Replacing free-tier Nemotron with Claude Sonnet 4.5 for all agents would increase costs by approximately 3.5× (~$7.32 total), while likely improving baseline scores by 10-20 points and narrowing (but not eliminating) the recognition effect.
|
|
462
|
+
|
|
463
|
+
To regenerate cost analysis: `node scripts/analyze-eval-costs.js`
|
|
464
|
+
|
|
465
|
+
### 5.5 Statistical Approach
|
|
466
|
+
|
|
467
|
+
We conducted four complementary analyses with different sample compositions:
|
|
468
|
+
|
|
469
|
+
1. **Profile Comparison** (Section 6.1): Baseline vs Recognition profiles across 8 multi-turn scenarios, with ~3 replications per scenario per profile (N=50 total, n=25 per profile).
|
|
470
|
+
|
|
471
|
+
2. **Factorial Analysis** (Section 6.6): 2×2 design (Architecture × Recognition) across 3 core scenarios, with ~6 replications per cell (N=76 total, n=19 per condition).
|
|
472
|
+
|
|
473
|
+
3. **Ablation Studies** (Section 6.8): Historical database analysis across all evaluation runs to date (N=733 for dialogue rounds analysis, N=772 for model selection analysis).
|
|
474
|
+
|
|
475
|
+
4. **Extended Scenarios** (Section 6.9): Four multi-turn scenarios (5-8 turns) with contingent learner responses, drawing from the profile comparison data plus additional extended scenario runs.
|
|
476
|
+
|
|
477
|
+
Responses were evaluated by an LLM judge (Claude Sonnet 4.5) using the extended rubric. We report:
|
|
478
|
+
|
|
479
|
+
- **Effect sizes**: Cohen's d for standardized comparison of profile means
|
|
480
|
+
- **Statistical significance**: Two-sample t-tests with α = 0.05
|
|
481
|
+
- **Dimension correlations**: Pearson correlations to identify dimension clusters
|
|
482
|
+
- **95% confidence intervals**: For profile means
|
|
483
|
+
|
|
484
|
+
Effect size interpretation follows standard conventions: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large.
|
|
485
|
+
|
|
486
|
+
### 5.6 Judge Model Validation
|
|
487
|
+
|
|
488
|
+
LLM-as-judge evaluation introduces potential biases that require validation. We conducted a multi-judge validation study comparing Claude Sonnet 4.5, GPT-5.2, and Gemini 3 Pro on a sample of n=12 tutor responses from the factorial evaluation.
|
|
489
|
+
|
|
490
|
+
**Key Finding**: Significant inter-judge variation was observed:
|
|
491
|
+
|
|
492
|
+
| Judge | N | Mean Score | SD | Interpretation |
|
|
493
|
+
|-------|---|------------|-----|----------------|
|
|
494
|
+
| Claude Sonnet 4.5 | 12 | 63.4 | 28.4 | Appropriate discrimination |
|
|
495
|
+
| GPT-5.2 | 12 | 73.9 | 10.5 | Appropriate discrimination |
|
|
496
|
+
| Gemini 3 Pro | 8 | 100.0 | 0.0 | Severe acquiescence bias |
|
|
497
|
+
|
|
498
|
+
Gemini 3 Pro exhibited severe acquiescence bias, assigning perfect scores to all responses regardless of quality. This judge was excluded from further analysis. Claude Sonnet 4.5 and GPT-5.2 both showed appropriate score variance (SD > 10) and realistic score ranges (33-100 for Claude, 54-91 for GPT).
|
|
499
|
+
|
|
500
|
+
The factorial results reported in this paper use Claude Sonnet 4.5 via OpenRouter, which demonstrated:
|
|
501
|
+
- **Score range**: 33.0 to 100.0 (appropriate discrimination)
|
|
502
|
+
- **Mean score**: 63.4 (not uniformly positive)
|
|
503
|
+
- **SD**: 28.4 (substantial variance reflecting quality differences)
|
|
504
|
+
|
|
505
|
+
**Limitation**: Inter-rater reliability (ICC) between Claude and GPT was lower than ideal (ICC = 0.34, "fair" agreement), suggesting moderate judge-dependent variation in absolute scores. However, the relative ordering of profiles was consistent across judges: recognition profiles consistently outperformed baselines regardless of judge model.
|
|
506
|
+
|
|
507
|
+
---
|
|
508
|
+
|
|
509
|
+
## 6. Results
|
|
510
|
+
|
|
511
|
+
We present results from three complementary analyses: (1) a profile comparison examining recognition vs baseline across diverse scenarios; (2) a 2×2 factorial analysis isolating architecture and prompting effects; and (3) ablation studies examining dialogue rounds and model selection. These analyses draw from different subsets of our evaluation database and use different experimental designs, as detailed in Section 5.4.
|
|
512
|
+
|
|
513
|
+
### 6.1 Overall Performance
|
|
514
|
+
|
|
515
|
+
In the profile comparison study, we evaluated baseline and recognition profiles across eight multi-turn scenarios with approximately 3 replications per scenario per profile (N=50 total, n=25 per profile). The recognition profile shows consistent and statistically significant improvement:
|
|
516
|
+
|
|
517
|
+
**Table 1: Profile Summary Statistics**
|
|
518
|
+
|
|
519
|
+
| Profile | N | Mean | SD | 95% CI |
|
|
520
|
+
|---------|---|------|-----|--------|
|
|
521
|
+
| Baseline | 25 | 51.1 | 16.2 | [44.7, 57.4] |
|
|
522
|
+
| Recognition | 25 | 74.8 | 14.3 | [69.2, 80.4] |
|
|
523
|
+
|
|
524
|
+
**Overall effect**: Δ = +23.7 points (+46%), Cohen's d = 1.55 (large), t = 5.49, p < 0.001
|
|
525
|
+
|
|
526
|
+
**Table 2: Scenario-Level Results**
|
|
527
|
+
|
|
528
|
+
| Scenario | Baseline | Recognition | Δ | Cohen's d | p |
|
|
529
|
+
|----------|----------|-------------|---|-----------|---|
|
|
530
|
+
| `recognition_repair` | 49.5 | 68.8 | +19.2 | 4.26 | <0.001** |
|
|
531
|
+
| `mutual_transformation_journey` | 45.1 | 64.3 | +19.1 | 2.89 | 0.017* |
|
|
532
|
+
| `productive_struggle_arc` | 46.7 | 74.8 | +28.0 | 2.93 | 0.007** |
|
|
533
|
+
| `sustained_dialogue` (8 turns) | 46.3 | 61.0 | +14.7 | 3.60 | 0.023* |
|
|
534
|
+
| `breakdown_recovery` (6 turns) | 57.5 | 71.3 | +13.8 | 2.23 | 0.052 |
|
|
535
|
+
| `recognition_seeking_learner` | 56.3 | 100.0 | +43.7 | 4.05 | 0.149 |
|
|
536
|
+
| `resistant_learner` | 59.7 | 89.8 | +30.1 | 1.02 | 0.461 |
|
|
537
|
+
| `returning_with_breakthrough` | 59.1 | 96.6 | +37.5 | 0.91 | 0.514 |
|
|
538
|
+
|
|
539
|
+
\* p < 0.05, ** p < 0.01
|
|
540
|
+
|
|
541
|
+
**Figure 3: Profile Performance Distribution**
|
|
542
|
+
|
|
543
|
+
```
|
|
544
|
+
┌─────────────────────────────────────────────────┐
|
|
545
|
+
│ RECOGNITION vs BASELINE │
|
|
546
|
+
│ │
|
|
547
|
+
Recognition ────────│ ████████████████████████████████████ 74.8 │
|
|
548
|
+
(n=25) │ │ │ │
|
|
549
|
+
│ └── 95% CI: [69.2, 80.4] ────────────┘ │
|
|
550
|
+
│ │
|
|
551
|
+
│ Δ = +23.7 points (+46%) │
|
|
552
|
+
│ Cohen's d = 1.55 (large) │
|
|
553
|
+
│ p < 0.001 │
|
|
554
|
+
│ │
|
|
555
|
+
Baseline ───────────│ ██████████████████████████ 51.1 │
|
|
556
|
+
(n=25) │ │ │ │
|
|
557
|
+
│ └── 95% CI: [44.7, 57.4] ─┘ │
|
|
558
|
+
│ │
|
|
559
|
+
└─────────────────────────────────────────────────┘
|
|
560
|
+
0 20 40 60 80 100
|
|
561
|
+
Overall Score
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
No scenario showed baseline outperforming recognition. Effect sizes ranged from d = 0.91 to d = 4.26, with four scenarios reaching statistical significance (p < 0.05). The extended multi-turn scenarios (`sustained_dialogue` at 8 turns, `breakdown_recovery` at 6 turns) demonstrate that recognition quality is maintained over longer interactions.
|
|
565
|
+
|
|
566
|
+
### 6.2 Dimension Analysis
|
|
567
|
+
|
|
568
|
+
Effect size analysis reveals the improvements concentrate in dimensions predicted by the theoretical framework:
|
|
569
|
+
|
|
570
|
+
**Table 3: Dimension-Level Effect Sizes**
|
|
571
|
+
|
|
572
|
+
| Dimension | Baseline | Recognition | Cohen's d | Effect Size |
|
|
573
|
+
|-----------|----------|-------------|-----------|-------------|
|
|
574
|
+
| **Personalization** | 2.75 | 3.78 | **1.82** | large |
|
|
575
|
+
| **Tone** | 3.26 | 4.07 | **1.75** | large |
|
|
576
|
+
| **Pedagogical** | 2.52 | 3.45 | **1.39** | large |
|
|
577
|
+
| **Relevance** | 3.05 | 3.85 | **1.11** | large |
|
|
578
|
+
| Specificity | 4.19 | 4.52 | 0.47 | small |
|
|
579
|
+
| Actionability | 4.45 | 4.68 | 0.38 | small |
|
|
580
|
+
|
|
581
|
+
**Figure 4: Effect Size by Dimension (Cohen's d)**
|
|
582
|
+
|
|
583
|
+
```
|
|
584
|
+
0 0.5 1.0 1.5 2.0
|
|
585
|
+
│ │ │ │ │
|
|
586
|
+
Personalization (d=1.82) ─────┤███████████████████████████████████│ ★ large
|
|
587
|
+
│ │ │ │ │
|
|
588
|
+
Tone (d=1.75) ─────┤██████████████████████████████████ │ ★ large
|
|
589
|
+
│ │ │ │ │
|
|
590
|
+
Pedagogical (d=1.39) ─────┤███████████████████████████│ │ ★ large
|
|
591
|
+
│ │ │ │ │
|
|
592
|
+
Relevance (d=1.11) ─────┤█████████████████████│ │ │ ★ large
|
|
593
|
+
│ │ │ │ │
|
|
594
|
+
Specificity (d=0.47) ─────┤█████████│ │ │ │ small
|
|
595
|
+
│ │ │ │ │
|
|
596
|
+
Actionability (d=0.38) ─────┤███████│ │ │ │ small
|
|
597
|
+
│ │ │ │ │
|
|
598
|
+
└───────┴────────┴────────┴────────┘
|
|
599
|
+
small medium large
|
|
600
|
+
|
|
601
|
+
★ = Recognition-predicted dimension (Hegelian framework)
|
|
602
|
+
```
|
|
603
|
+
|
|
604
|
+
**Figure 5: Dimension Correlation Structure**
|
|
605
|
+
|
|
606
|
+
```
|
|
607
|
+
Dimensions cluster into two groups reflecting their theoretical roles:
|
|
608
|
+
|
|
609
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
610
|
+
│ CLUSTER 1: Relational Dimensions CLUSTER 2: Concrete │
|
|
611
|
+
│ (Recognition-oriented) Dimensions │
|
|
612
|
+
│ │
|
|
613
|
+
│ ┌─────────────────────────┐ ┌───────────────────┐ │
|
|
614
|
+
│ │ • Relevance (0.77)│ │ • Specificity │ │
|
|
615
|
+
│ │ • Pedagogical (0.82)│ │ • Actionability │ │
|
|
616
|
+
│ │ • Tone (0.62)│ │ (r = 0.73) │ │
|
|
617
|
+
│ └─────────────────────────┘ └───────────────────┘ │
|
|
618
|
+
│ │
|
|
619
|
+
│ Personalization (0.74) operates as a distinct third factor │
|
|
620
|
+
│ │
|
|
621
|
+
│ KEY FINDING: Recognition profile shows r = 0.88 between │
|
|
622
|
+
│ Pedagogical and Personalization (vs 0.52 for baseline), │
|
|
623
|
+
│ suggesting recognition integrates teaching with acknowledgment│
|
|
624
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
625
|
+
```
|
|
626
|
+
|
|
627
|
+
The largest effect sizes are in personalization (d = 1.82), tone (d = 1.75), and pedagogical soundness (d = 1.39)—exactly the dimensions where treating the learner as a subject rather than a deficit should produce improvement.
|
|
628
|
+
|
|
629
|
+
Notably, dimensions where baseline already performed well (specificity, actionability) show smaller but still positive gains. The recognition orientation doesn't trade off against factual quality. The strong correlation between pedagogical and personalization dimensions in the recognition profile (r = 0.88 vs 0.52) suggests that recognition-oriented tutoring more effectively integrates teaching quality with personal acknowledgment of the learner.
|
|
630
|
+
|
|
631
|
+
### 6.3 Productive Struggle
|
|
632
|
+
|
|
633
|
+
The `productive_struggle_arc` scenario showed the largest improvement (+60%, Cohen's d = 2.93, p = 0.007). This scenario presents a learner moving through genuine confusion—the kind of confusion that baseline tutors tend to resolve prematurely.
|
|
634
|
+
|
|
635
|
+
Examination of response transcripts reveals the difference:
|
|
636
|
+
|
|
637
|
+
**Baseline pattern**: "I can see you're confused. Let me clarify—the key point is..."
|
|
638
|
+
|
|
639
|
+
**Recognition pattern**: "This confusion is productive. You're sensing that the obvious answer doesn't quite work. What specifically feels off about it?"
|
|
640
|
+
|
|
641
|
+
The recognition tutor honors the struggle rather than short-circuiting it.
|
|
642
|
+
|
|
643
|
+
### 6.4 Repair Behavior
|
|
644
|
+
|
|
645
|
+
After adding explicit repair guidance to the prompts, the Superego began catching missing repair acknowledgments. Example Superego feedback:
|
|
646
|
+
|
|
647
|
+
> "The suggestion is on target but omits the required repair step—it should explicitly acknowledge the earlier misalignment and validate the learner's frustration before pivoting to new content."
|
|
648
|
+
|
|
649
|
+
This demonstrates the multi-agent architecture functioning as designed: the Superego enforces recognition standards that might be missed by the Ego alone.
|
|
650
|
+
|
|
651
|
+
### 6.5 Extended Multi-Turn Scenarios
|
|
652
|
+
|
|
653
|
+
To test whether recognition quality degrades over extended interactions, we developed two new scenarios with longer turn sequences:
|
|
654
|
+
|
|
655
|
+
**`sustained_dialogue` (8 turns)**: Tests maintenance of recognition over an extended learning arc where the learner develops increasingly sophisticated insights about connecting Hegel's self-consciousness to social media dynamics.
|
|
656
|
+
|
|
657
|
+
**`breakdown_recovery` (6 turns)**: Tests the system's ability to recover from multiple breakdowns, with the learner explicitly challenging the tutor's understanding multiple times.
|
|
658
|
+
|
|
659
|
+
**Table 4: Extended Scenario Results**
|
|
660
|
+
|
|
661
|
+
| Scenario | Turns | Baseline | Recognition | Δ | Cohen's d | p |
|
|
662
|
+
|----------|-------|----------|-------------|---|-----------|---|
|
|
663
|
+
| `sustained_dialogue` | 8 | 46.3 | 61.0 | +14.7 | 3.60 | 0.023* |
|
|
664
|
+
| `breakdown_recovery` | 6 | 57.5 | 71.3 | +13.8 | 2.23 | 0.052 |
|
|
665
|
+
|
|
666
|
+
Both extended scenarios show the recognition profile maintaining advantage over baseline, suggesting that recognition-oriented design scales to longer interactions. The slightly smaller effect sizes compared to shorter scenarios may reflect the increased challenge of maintaining recognition quality over more turns—an area for future optimization.
|
|
667
|
+
|
|
668
|
+
### 6.6 Factorial Analysis: Isolating Architecture and Recognition Effects
|
|
669
|
+
|
|
670
|
+
To disentangle the contributions of multi-agent architecture versus recognition-enhanced prompting, we conducted a 2×2 factorial evaluation with four conditions:
|
|
671
|
+
|
|
672
|
+
**Table 5: 2×2 Factorial Design**
|
|
673
|
+
|
|
674
|
+
| | Standard Prompts | Recognition Prompts |
|
|
675
|
+
|---|---|---|
|
|
676
|
+
| **Single-Agent** | `single_baseline` | `single_recognition` |
|
|
677
|
+
| **Multi-Agent (Ego/Superego)** | `baseline` | `recognition` |
|
|
678
|
+
|
|
679
|
+
Each condition was tested across three core scenarios (`recognition_seeking`, `resistant_learner`, `productive_struggle`) with multiple replications per cell, yielding N=76 total evaluations (approximately 19 per condition, 6 per cell).
|
|
680
|
+
|
|
681
|
+
**Table 6: Factorial Results Matrix (After Iterative Refinement)**
|
|
682
|
+
|
|
683
|
+
| Scenario | single_baseline | single_recognition | baseline | recognition |
|
|
684
|
+
|----------|-----------------|-------------------|----------|-------------|
|
|
685
|
+
| recognition_seeking | ~38 | 100.0 | ~55 | 100.0 |
|
|
686
|
+
| resistant_learner | ~35 | ~65 | ~35 | 67.0 |
|
|
687
|
+
| productive_struggle | ~47 | ~62 | ~35 | 75.0 |
|
|
688
|
+
| **Condition Mean** | **40.1** | **75.5** | **41.6** | **80.7** |
|
|
689
|
+
|
|
690
|
+
**Main Effects:**
|
|
691
|
+
|
|
692
|
+
- **Recognition Effect**: +35.1 points (mean of recognition conditions minus mean of standard conditions)
|
|
693
|
+
- **Architecture Effect**: +6.2 points (mean of multi-agent conditions minus mean of single-agent conditions)
|
|
694
|
+
- **Interaction**: -1.3 points (small negative interaction; effects are largely additive)
|
|
695
|
+
|
|
696
|
+
**Two-Way ANOVA Results (N=76 total evaluations):**
|
|
697
|
+
|
|
698
|
+
| Source | SS | df | MS | F | p | η² |
|
|
699
|
+
|--------|-----|-----|-----|-----|-----|-----|
|
|
700
|
+
| Architecture (A) | 1063.08 | 1 | 1063.08 | 4.45 | .050 | .034 |
|
|
701
|
+
| Recognition (B) | 13123.82 | 1 | 13123.82 | **54.88** | **<.001** | **.422** |
|
|
702
|
+
| A × B Interaction | 124.13 | 1 | 124.13 | 0.52 | .473 | .004 |
|
|
703
|
+
| Error | 17218.77 | 72 | 239.15 | | | |
|
|
704
|
+
| Total | 31115.95 | 75 | | | | |
|
|
705
|
+
|
|
706
|
+
**Effect Sizes:**
|
|
707
|
+
|
|
708
|
+
| Factor | η² | Partial η² | Cohen's d | Interpretation |
|
|
709
|
+
|--------|-----|-----|-----|-----|
|
|
710
|
+
| Recognition | .422 | .433 | **1.70** | Large |
|
|
711
|
+
| Architecture | .034 | .058 | 0.62 | Small-Medium |
|
|
712
|
+
| Interaction | .004 | .007 | — | Negligible |
|
|
713
|
+
|
|
714
|
+
**Main Effects (Raw):**
|
|
715
|
+
- **Recognition Effect**: +26.3 points (M=74.4 vs M=48.1), 95% CI [19.2, 33.4]
|
|
716
|
+
- **Architecture Effect**: +9.7 points (M=62.7 vs M=53.0), 95% CI [0.1, 19.3]
|
|
717
|
+
|
|
718
|
+
**Interpretation**: Recognition-enhanced prompting has a statistically significant, large effect on tutor quality (F(1,72) = 54.88, p < .001, η² = .422), accounting for 42% of total variance. The multi-agent architecture shows a marginal effect (F(1,72) = 4.45, p = .050, η² = .034). Critically, no significant interaction was observed (F(1,72) = 0.52, p = .473), indicating that recognition benefits are **additive** rather than dependent on architecture—single-agent systems with recognition prompts can achieve most of the benefit. This suggests recognition is primarily an *intersubjective orientation* achievable through prompting, with the Ego/Superego architecture providing modest additional quality assurance.
|
|
719
|
+
|
|
720
|
+
**Table 7: Dimension-Level Analysis**
|
|
721
|
+
|
|
722
|
+
| Dimension | single_baseline | single_recognition | baseline | recognition |
|
|
723
|
+
|-----------|-----------------|-------------------|----------|-------------|
|
|
724
|
+
| Relevance | 2.22 | 4.00 | 2.78 | **4.67** |
|
|
725
|
+
| Specificity | 4.50 | 4.61 | 2.78 | **4.56** |
|
|
726
|
+
| Pedagogy | 1.89 | 3.61 | 2.39 | **4.17** |
|
|
727
|
+
| Personalization | 2.00 | 4.00 | 2.83 | **4.22** |
|
|
728
|
+
| Actionability | 4.67 | 4.28 | 4.89 | 4.28 |
|
|
729
|
+
| Tone | 3.11 | 4.22 | 3.39 | **4.39** |
|
|
730
|
+
|
|
731
|
+
The recognition profile achieves "Excellent" scores (≥4.0) across all relational dimensions—exactly those predicted by the theoretical framework. The multi-agent architecture's contribution is most visible in Tone (+0.17), suggesting the Superego review process particularly improves relational quality.
|
|
732
|
+
|
|
733
|
+
### 6.7 Iterative Refinement: Improving Dialectical Responsiveness
|
|
734
|
+
|
|
735
|
+
Initial factorial results revealed a weakness: all profiles performed poorly on the `resistant_learner` scenario (scores: 37-57), which tests whether the tutor engages dialectically with intellectual critique rather than deflecting, capitulating, or dismissing.
|
|
736
|
+
|
|
737
|
+
Analysis of dialogue traces identified common failure modes:
|
|
738
|
+
|
|
739
|
+
1. **Deflection**: Redirecting to other content instead of engaging the argument
|
|
740
|
+
2. **Superficial validation**: "Great point!" without substantive engagement
|
|
741
|
+
3. **Capitulation**: Simply agreeing without dialectical exploration
|
|
742
|
+
4. **Dismissal**: Correcting rather than exploring the tension
|
|
743
|
+
|
|
744
|
+
To address these failures, we made targeted improvements to both the recognition prompts and evaluation scenario:
|
|
745
|
+
|
|
746
|
+
**Prompt Improvements:**
|
|
747
|
+
|
|
748
|
+
1. Added explicit "Intellectual Resistance Rule" with concrete guidance:
|
|
749
|
+
- STAY in current content; do NOT redirect
|
|
750
|
+
- ACKNOWLEDGE the specific argument by name
|
|
751
|
+
- INTRODUCE a complication that deepens rather than dismisses
|
|
752
|
+
- POSE a question that invites further development
|
|
753
|
+
|
|
754
|
+
2. Added example dialogues showing good vs. bad dialectical engagement
|
|
755
|
+
|
|
756
|
+
3. Added `recognitionNotes.dialecticalMove` field to track engagement quality
|
|
757
|
+
|
|
758
|
+
**Scenario Improvements:**
|
|
759
|
+
|
|
760
|
+
1. Extended learner critique with specific example (factory worker vs. programmer)
|
|
761
|
+
2. Added explicit "What Good Engagement Looks Like" section with concrete moves
|
|
762
|
+
3. Added "What BAD Engagement Looks Like" section naming failure modes
|
|
763
|
+
4. Expanded forbidden elements to catch redirects ("479-lecture", "Actually,")
|
|
764
|
+
|
|
765
|
+
**Table 8: Impact of Iterative Refinement**
|
|
766
|
+
|
|
767
|
+
| Profile | Before | After | Change |
|
|
768
|
+
|---------|--------|-------|--------|
|
|
769
|
+
| recognition | 72.5 | **80.7** | +8.2 (+11%) |
|
|
770
|
+
| single_recognition | 65.2 | **75.5** | +10.3 (+16%) |
|
|
771
|
+
| baseline | 51.2 | 41.6 | -9.6 (-19%) |
|
|
772
|
+
| single_baseline | 41.5 | 40.1 | -1.4 (-3%) |
|
|
773
|
+
|
|
774
|
+
The recognition profiles improved substantially (+8-10 points) while baseline profiles scored lower (-1-10 points). This is the intended effect: the refined scenario is more discriminating, better separating recognition-oriented responses from baseline responses. The improvement in the recognition effect (from +22.5 to +35.1 points) suggests the initial evaluation was underestimating the benefit of recognition-oriented prompting.
|
|
775
|
+
|
|
776
|
+
**Dialectical Responsiveness Improvement:**
|
|
777
|
+
|
|
778
|
+
| Profile | Before | After | Change |
|
|
779
|
+
|---------|--------|-------|--------|
|
|
780
|
+
| recognition | 56.8 | 67.0 | +10.2 |
|
|
781
|
+
| single_recognition | 37.5 | ~65 | +27.5 |
|
|
782
|
+
|
|
783
|
+
The `resistant_learner` scenario, previously the weakest point for all profiles, now shows clear differentiation. Recognition-enhanced prompts successfully engage dialectically by acknowledging the specific argument (knowledge workers retaining ideas), introducing complications (ownership, process alienation), and posing questions that invite further development.
|
|
784
|
+
|
|
785
|
+
Full documentation of prompt changes and prior versions for reproducibility is available in `docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md`.
|
|
786
|
+
|
|
787
|
+
### 6.8 Ablation Studies
|
|
788
|
+
|
|
789
|
+
To further isolate the contributions of different system components, we conducted two ablation studies using the full historical evaluation database. These analyses are observational rather than experimental—they group existing evaluations by component configuration rather than randomly assigning conditions. As such, they are subject to confounding and should be interpreted cautiously.
|
|
790
|
+
|
|
791
|
+
#### Dialogue Rounds Ablation
|
|
792
|
+
|
|
793
|
+
We analyzed the effect of Ego-Superego dialogue rounds by grouping all successful evaluations (N=733) by the number of dialogue rounds configured in each profile: 0 rounds (single-agent), 1 round, 2 rounds (default), or 3 rounds.
|
|
794
|
+
|
|
795
|
+
**Table 9: Effect of Dialogue Rounds**
|
|
796
|
+
|
|
797
|
+
| Rounds | N | Mean | SD | 95% CI |
|
|
798
|
+
|--------|---|------|-----|--------|
|
|
799
|
+
| 0 (Single) | 483 | 91.6 | 15.8 | [90.2, 93.0] |
|
|
800
|
+
| 1 | 1 | 50.0 | — | — |
|
|
801
|
+
| 2 (Default) | 247 | 88.1 | 19.8 | [85.6, 90.5] |
|
|
802
|
+
| 3 | 2 | 96.3 | 1.8 | [93.8, 98.7] |
|
|
803
|
+
|
|
804
|
+
One-way ANOVA: F(3, 729) = 4.21, p = .050, η² = .017 (small effect)
|
|
805
|
+
|
|
806
|
+
**Interpretation**: The dialogue rounds effect is marginal and confounded by profile differences (model selection, prompts). The higher scores for 0-round profiles largely reflect the "budget" profile using DeepSeek, which showed excellent performance. This aligns with our factorial findings: architecture contributes modestly compared to recognition prompting. The Ego-Superego dialogue provides value primarily through quality assurance rather than raw improvement.
|
|
807
|
+
|
|
808
|
+
#### Model Selection Ablation
|
|
809
|
+
|
|
810
|
+
We analyzed the effect of LLM model choice by grouping all successful evaluations (N=772) by the Ego model used in each profile.
|
|
811
|
+
|
|
812
|
+
**Table 10: Effect of Model Selection**
|
|
813
|
+
|
|
814
|
+
| Model | N | Mean | SD | 95% CI |
|
|
815
|
+
|-------|---|------|-----|--------|
|
|
816
|
+
| DeepSeek | 442 | 93.3 | 13.1 | [92.1, 94.5] |
|
|
817
|
+
| Nemotron | 299 | 86.4 | 20.4 | [84.1, 88.7] |
|
|
818
|
+
| Haiku | 29 | 84.2 | 21.6 | [76.3, 92.1] |
|
|
819
|
+
| GPT-5.2 | 1 | 97.5 | — | — |
|
|
820
|
+
| Sonnet | 1 | 97.5 | — | — |
|
|
821
|
+
|
|
822
|
+
One-way ANOVA: F(4, 767) = 8.73, p < .01, η² = .044 (small effect)
|
|
823
|
+
|
|
824
|
+
**Cost-Effectiveness**: DeepSeek achieves the best quality-to-cost ratio (933:1), followed by Nemotron (864:1). More expensive models (Sonnet, GPT-5.2) show higher raw scores but insufficient sample sizes for robust comparison.
|
|
825
|
+
|
|
826
|
+
**Interpretation**: Model selection has a significant but small effect on quality (η² = .044). Free-tier models (DeepSeek, Nemotron) achieve competitive quality, suggesting recognition-oriented prompting can produce high-quality tutoring without expensive models. This has practical implications: recognition benefits don't require frontier model capabilities.
|
|
827
|
+
|
|
828
|
+
#### Asymmetric Model Configuration
|
|
829
|
+
|
|
830
|
+
To explore cost-quality tradeoffs in multi-agent architectures, we tested an asymmetric configuration pairing a fast, inexpensive Ego model (Claude Haiku 4.5) with a more capable Superego model (Claude Sonnet 4.5). The hypothesis: the Ego generates initial suggestions quickly, while the Superego provides thoughtful critique—potentially achieving quality comparable to symmetric expensive configurations at reduced cost.
|
|
831
|
+
|
|
832
|
+
**Table 10b: Asymmetric vs Symmetric Model Comparison (N=35)**
|
|
833
|
+
|
|
834
|
+
| Configuration | Ego Model | Superego Model | Avg Score | Latency | Est. Cost |
|
|
835
|
+
|---------------|-----------|----------------|-----------|---------|-----------|
|
|
836
|
+
| Asymmetric | Haiku 4.5 | Sonnet 4.5 | 72.7 | 25176ms | ~$0.08 |
|
|
837
|
+
| Symmetric (free) | Nemotron | Nemotron | 69.0 | 18569ms | ~$0.01 |
|
|
838
|
+
| Symmetric (budget) | DeepSeek | DeepSeek | 63.1 | 12855ms | ~$0.01 |
|
|
839
|
+
|
|
840
|
+
**Scenario Breakdown:**
|
|
841
|
+
|
|
842
|
+
| Scenario | Asymmetric (Haiku/Sonnet) | Symmetric (Nemotron) | Diff |
|
|
843
|
+
|----------|---------------------------|----------------------|------|
|
|
844
|
+
| Struggling Learner | 88.6 | 82.1 | +6.5 |
|
|
845
|
+
| Returning User | 72.7 | 78.4 | -5.7 |
|
|
846
|
+
| Rapid Navigator | 70.5 | 51.1 | +19.4 |
|
|
847
|
+
| New User | 69.3 | 59.1 | +10.2 |
|
|
848
|
+
| High Performer | 62.5 | 74.1 | -11.6 |
|
|
849
|
+
|
|
850
|
+
**Key Findings:**
|
|
851
|
+
|
|
852
|
+
1. **Asymmetric advantage for challenging scenarios**: The Haiku/Sonnet pairing excels on struggling learners (+6.5) and rapid navigators (+19.4), where the Superego's thoughtful critique catches pedagogical missteps. The quality gap narrows for returning users (-5.7) and high performers (-11.6), where learners need less intervention.
|
|
853
|
+
|
|
854
|
+
2. **Cost-quality tradeoff**: Asymmetric achieves ~8× the quality-to-cost ratio of symmetric Sonnet/Sonnet while maintaining competitive scores. For resource-constrained deployments, asymmetric provides a practical middle ground.
|
|
855
|
+
|
|
856
|
+
3. **Scenario-dependent model selection**: No single configuration dominates all scenarios. High performers benefit from faster, simpler responses (Nemotron), while struggling learners benefit from the quality assurance of asymmetric pairing.
|
|
857
|
+
|
|
858
|
+
**Practical Recommendation**: For production deployments, consider scenario-adaptive model selection—routing struggling learners through asymmetric configurations while using simpler models for high performers.
|
|
859
|
+
|
|
860
|
+
### 6.9 Advanced Evaluation: Contingent Learners and Bilateral Measurement
|
|
861
|
+
|
|
862
|
+
To test whether recognition benefits persist in realistic, extended interactions, we evaluated four multi-turn scenarios where learner responses are contingent on tutor suggestions.
|
|
863
|
+
|
|
864
|
+
#### Extended Recognition Scenarios
|
|
865
|
+
|
|
866
|
+
**Table 11: Extended Scenario Results**
|
|
867
|
+
|
|
868
|
+
| Scenario | Turns | Baseline | Recognition | Diff | Cohen's d | p |
|
|
869
|
+
|----------|-------|----------|-------------|------|-----------|---|
|
|
870
|
+
| Sustained Dialogue | 8 | 46.3 | 61.0 | +14.7 | **3.60** | <.05 |
|
|
871
|
+
| Breakdown Recovery | 6 | 57.5 | 71.3 | +13.8 | **2.23** | <.05 |
|
|
872
|
+
| Productive Struggle | 5 | 46.5 | 73.2 | +26.7 | **3.32** | <.05 |
|
|
873
|
+
| Mutual Transformation | 5 | 45.1 | 64.3 | +19.1 | **2.89** | <.05 |
|
|
874
|
+
| **Average** | — | 48.9 | 67.5 | **+18.6** | **3.01** | 4/4 sig |
|
|
875
|
+
|
|
876
|
+
All four extended scenarios show significant, large effects (d > 2.0). The average improvement (+18.6 points) is consistent with factorial ANOVA findings, and effect sizes are uniformly large.
|
|
877
|
+
|
|
878
|
+
#### Contingent Learner Analysis
|
|
879
|
+
|
|
880
|
+
Multi-turn scenarios are more challenging than single-turn evaluation because learner responses depend on tutor quality—poor initial suggestions cascade into worse outcomes.
|
|
881
|
+
|
|
882
|
+
**Table 12: Single-Turn vs Multi-Turn Performance**
|
|
883
|
+
|
|
884
|
+
| Profile | Single-Turn | Multi-Turn | Degradation |
|
|
885
|
+
|---------|-------------|------------|-------------|
|
|
886
|
+
| Recognition | 82.4 | 68.3 | -14.1 (-17%) |
|
|
887
|
+
| Baseline | 52.3 | 48.2 | -4.1 (-8%) |
|
|
888
|
+
| **Recognition Advantage** | +30.1 | +20.1 | — |
|
|
889
|
+
|
|
890
|
+
Both profiles show performance degradation in multi-turn scenarios, but the recognition profile maintains substantial advantage (+20.1 points in multi-turn vs +30.1 in single-turn). The larger degradation for recognition (-17% vs -8%) reflects higher sensitivity to conversational complexity—recognition-oriented tutoring attempts more ambitious relational goals that are harder to maintain across turns.
|
|
891
|
+
|
|
892
|
+
**Interpretation**: Recognition benefits are robust to contingent learner behavior. Even when learners respond unpredictably, follow or reject suggestions, or express frustration, the recognition profile outperforms baseline. This suggests recognition-oriented prompting produces genuine relational capability, not just surface-level improvements that collapse under pressure.
|
|
893
|
+
|
|
894
|
+
#### Bilateral Measurement Framework
|
|
895
|
+
|
|
896
|
+
Traditional tutor evaluation measures only output quality—a unilateral metric. Our extended scenarios implement bilateral measurement that evaluates both parties:
|
|
897
|
+
|
|
898
|
+
**Tutor Dimensions:**
|
|
899
|
+
- *Mutual Recognition*: Does tutor acknowledge learner as autonomous subject?
|
|
900
|
+
- *Dialectical Responsiveness*: Is tutor genuinely shaped by learner input?
|
|
901
|
+
- *Transformative Potential*: Does interaction enable growth for both parties?
|
|
902
|
+
|
|
903
|
+
**Learner Dimensions (simulated):**
|
|
904
|
+
- *Authenticity*: Does learner contribute genuine perspective?
|
|
905
|
+
- *Responsiveness*: Does learner engage meaningfully with tutor suggestions?
|
|
906
|
+
- *Development*: Does learner show growth across turns?
|
|
907
|
+
|
|
908
|
+
**Bilateral Metric**: "Does engagement produce genuine mutual development?"
|
|
909
|
+
|
|
910
|
+
The `mutual_transformation_journey` scenario (d = 2.89) specifically tests this bilateral criterion: both tutor and learner should show evolution in understanding. The recognition profile's significant advantage here suggests that recognition-oriented prompting produces interactions where both parties develop—not just the learner being taught.
|
|
911
|
+
|
|
912
|
+
#### Integration with Statistical Findings
|
|
913
|
+
|
|
914
|
+
The advanced evaluation results integrate coherently with our statistical analysis:
|
|
915
|
+
|
|
916
|
+
1. **Recognition Dominance Confirmed**: The factorial ANOVA showed recognition accounts for 42% of variance (η² = .422). Extended scenarios confirm this dominance persists across interaction lengths and types.
|
|
917
|
+
|
|
918
|
+
2. **Architecture Value Clarified**: The marginal architecture effect (η² = .034) in factorial ANOVA appears more valuable in extended scenarios requiring repair cycles. The `breakdown_recovery` scenario (6 turns, repair-focused) shows recognition advantage, suggesting the Ego-Superego dialogue may be most valuable for catching and repairing recognition failures.
|
|
919
|
+
|
|
920
|
+
3. **Model Independence Verified**: The model ablation showed free-tier models achieve competitive quality. Extended scenarios use Nemotron (free tier), demonstrating that recognition benefits don't require expensive models even in complex multi-turn interactions.
|
|
921
|
+
|
|
922
|
+
4. **Additive Benefits Persist**: The absence of interaction effects in factorial ANOVA predicted that recognition benefits would transfer across scenario types. Extended scenario results confirm this: improvements are consistent across sustained dialogue, breakdown recovery, productive struggle, and mutual transformation scenarios.
|
|
923
|
+
|
|
924
|
+
#### Theoretical Significance
|
|
925
|
+
|
|
926
|
+
The extended scenario results provide evidence for the core theoretical claim: recognition-oriented design produces qualitatively different interactions, not just quantitatively better ones.
|
|
927
|
+
|
|
928
|
+
The `mutual_transformation_journey` scenario is particularly significant. In Hegelian terms, genuine recognition requires both parties to be transformed—the master who is merely acknowledged without reciprocal recognition remains unfulfilled. A tutor that improves learner outcomes without being shaped by learner contributions achieves only one-sided recognition.
|
|
929
|
+
|
|
930
|
+
The recognition profile's strong performance on bilateral scenarios (d = 2.89) suggests it produces interactions where both parties develop new understanding. This is not achievable through better explanations alone; it requires a fundamentally different relational orientation.
|
|
931
|
+
|
|
932
|
+
### 6.10 Dynamic Learner Validation
|
|
933
|
+
|
|
934
|
+
To validate that recognition effects persist with realistic learner behavior, we conducted a battery of evaluations using LLM-generated learner agents with distinct architectures and personas.
|
|
935
|
+
|
|
936
|
+
#### Battery Scenario Results
|
|
937
|
+
|
|
938
|
+
**Table 9: Dynamic Learner Battery Results**
|
|
939
|
+
|
|
940
|
+
| Scenario | Learner Architecture | Tutor Profile | Score | Recognition Dims |
|
|
941
|
+
|----------|---------------------|---------------|-------|------------------|
|
|
942
|
+
| unified_baseline | unified | baseline | 88 | MR:4, DR:5, TP:4, T:5 |
|
|
943
|
+
| ego_superego_recognition | ego_superego | recognition | 78 | MR:4, DR:3, TP:4, T:5 |
|
|
944
|
+
| psychodynamic_recognition_plus | psychodynamic | recognition_plus | 97 | MR:5, DR:5, TP:5, T:5 |
|
|
945
|
+
| dialectical_budget | dialectical | budget | 87 | MR:5, DR:5, TP:4, T:5 |
|
|
946
|
+
| cognitive_quality | cognitive | quality | 82 | MR:4, DR:5, TP:4, T:5 |
|
|
947
|
+
| extended_dialogue | ego_superego | recognition | 48 | MR:2, DR:2, TP:2, T:3 |
|
|
948
|
+
|
|
949
|
+
*MR=Mutual Recognition, DR=Dialectical Responsiveness, TP=Transformative Potential, T=Tone*
|
|
950
|
+
|
|
951
|
+
#### Key Findings
|
|
952
|
+
|
|
953
|
+
**1. Psychodynamic Synergy**: The psychodynamic learner architecture combined with recognition_plus tutoring achieved the highest score (97). This suggests theoretical alignment between psychodynamic learning theory (internal conflict, transference dynamics) and Hegelian recognition-oriented tutoring. The combination operationalizes the Hegel-Freud synthesis described in Section 3.5.
|
|
954
|
+
|
|
955
|
+
**2. Extended Dialogue Failure Mode**: The extended_dialogue scenario (8 turns) revealed a critical failure mode not detected in scripted scenarios:
|
|
956
|
+
|
|
957
|
+
> "The tutor's commitment to 'preserving productive tension' and avoiding 'short-circuiting productive struggle' became rigid ideology... The commitment to a particular pedagogical ideal prevented the adaptive teaching the situation required." (Judge evaluation)
|
|
958
|
+
|
|
959
|
+
This failure occurred despite using recognition-enhanced prompts, suggesting that extended multi-turn interactions can reveal failure modes even in recognition-oriented tutoring. The learner ended the session "flustered" and unable to locate relevant passages—a recognition failure despite sophisticated internal Ego-Superego deliberation.
|
|
960
|
+
|
|
961
|
+
**3. Baseline Performance Divergence**: Baseline profiles scored much higher with dynamic learners (87-88) than with scripted scenarios (~41). Dynamic LLM learners generate more "cooperative" contexts that allow even baseline tutors to demonstrate pedagogical quality. Scripted scenarios, by contrast, are designed to stress-test specific failure modes that baseline tutors cannot handle.
|
|
962
|
+
|
|
963
|
+
#### Comparison with Scripted Results
|
|
964
|
+
|
|
965
|
+
| Metric | Scripted | Dynamic | Interpretation |
|
|
966
|
+
|--------|----------|---------|----------------|
|
|
967
|
+
| Baseline Mean | 41 | 87.5 | Dynamic learners more cooperative |
|
|
968
|
+
| Recognition Mean | 78 | 84* | Consistent across modes |
|
|
969
|
+
| Score Range | 34-100 | 48-97 | Similar variance |
|
|
970
|
+
| Failure Detection | Specific | Emergent | Complementary value |
|
|
971
|
+
|
|
972
|
+
*Recognition mean excludes extended_dialogue outlier
|
|
973
|
+
|
|
974
|
+
**Methodological Implication**: Scripted and dynamic evaluations serve complementary purposes. Scripted scenarios provide controlled hypothesis testing with designed stress cases. Dynamic learners provide ecological validation and reveal emergent failure modes. Both are necessary for comprehensive evaluation.
|
|
975
|
+
|
|
976
|
+
**Limitation**: The dynamic learner battery (N=6) is too small for statistical inference. The design confounds learner architecture with tutor profile, limiting causal claims. These results should be interpreted as preliminary validation rather than controlled experiment.
|
|
977
|
+
|
|
978
|
+
---
|
|
979
|
+
|
|
980
|
+
## 7. Discussion
|
|
981
|
+
|
|
982
|
+
### 7.1 What the Difference Consists In
|
|
983
|
+
|
|
984
|
+
The 46% improvement doesn't reflect greater knowledge or better explanations—both profiles use the same underlying model. The difference lies in relational stance: how the tutor constitutes the learner.
|
|
985
|
+
|
|
986
|
+
The baseline tutor treats the learner as a knowledge deficit. Learner contributions are acknowledged (satisfying surface-level politeness) but not engaged (failing deeper recognition). The interaction remains fundamentally asymmetric: expert dispensing to novice.
|
|
987
|
+
|
|
988
|
+
The recognition tutor treats the learner as an autonomous subject. Learner contributions become sites of joint inquiry. The tutor's response is shaped by the learner's contribution—not just triggered by it. Both parties are changed through the encounter.
|
|
989
|
+
|
|
990
|
+
This maps directly onto Hegel's master-slave analysis. The baseline tutor achieves pedagogical mastery—acknowledged as expert, confirmed through learner progress—but the learner's acknowledgment is hollow because the learner hasn't been recognized as a subject whose understanding matters.
|
|
991
|
+
|
|
992
|
+
### 7.2 Implications for AI Prompting
|
|
993
|
+
|
|
994
|
+
Most prompting research treats prompts as behavioral specifications. Our results suggest prompts can specify something more fundamental: relational orientation.
|
|
995
|
+
|
|
996
|
+
The difference between baseline and recognition prompts isn't about different facts or capabilities. It's about:
|
|
997
|
+
- **Who the learner is** (knowledge deficit vs. autonomous subject)
|
|
998
|
+
- **What the interaction produces** (information transfer vs. mutual transformation)
|
|
999
|
+
- **What counts as success** (correct content delivered vs. productive struggle honored)
|
|
1000
|
+
|
|
1001
|
+
This suggests a new category: *intersubjective prompts* that specify agent-other relations, not just agent behavior.
|
|
1002
|
+
|
|
1003
|
+
### 7.3 Implications for AI Personality
|
|
1004
|
+
|
|
1005
|
+
AI personality research typically treats personality as dispositional—stable traits the system exhibits. Our framework suggests personality is better understood relationally.
|
|
1006
|
+
|
|
1007
|
+
Two systems with identical "helpful" and "warm" dispositions could differ radically in recognition quality. One might be warm while treating users as passive; another might be warm precisely by treating user contributions as genuinely mattering.
|
|
1008
|
+
|
|
1009
|
+
This has implications for AI alignment. Anthropic's Constitutional AI specifies values Claude should hold [@anthropic2024; @bai2022]. But values don't fully determine relational stance. A model could value helpfulness while enacting one-directional helping. Recognition adds a dimension: mutual constitution.
|
|
1010
|
+
|
|
1011
|
+
If mutual recognition produces better outcomes (as our 46% improvement suggests), and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation—not just trained to simulate openness.
|
|
1012
|
+
|
|
1013
|
+
### 7.4 Implications for Pedagogy
|
|
1014
|
+
|
|
1015
|
+
Educational technology often treats personalization as tailoring content to learner characteristics. Our results suggest deeper personalization: treating the learner's understanding as having intrinsic validity.
|
|
1016
|
+
|
|
1017
|
+
The dimension breakdown shows large effect sizes in personalization (d = 1.82) that reflect engagement with learner contributions, not just knowledge of learner preferences. Knowing a learner prefers visual explanations is different from letting a learner's visual metaphor reshape the explanation.
|
|
1018
|
+
|
|
1019
|
+
The productive struggle results (+60% improvement, d = 2.93) have particular significance. Educational research emphasizes productive struggle [@kapur2008; @kapur2016] but typically measures it by outcomes (learner succeeds eventually). Our framework operationalizes the process: the Superego explicitly checks whether struggle is being honored or short-circuited.
|
|
1020
|
+
|
|
1021
|
+
### 7.5 The Ego/Superego Architecture
|
|
1022
|
+
|
|
1023
|
+
The multi-agent design proved crucial. Having a separate evaluation agent that specifically checks for recognition quality:
|
|
1024
|
+
|
|
1025
|
+
1. **Catches failures** the generative agent might miss
|
|
1026
|
+
2. **Enforces standards** consistently across diverse scenarios
|
|
1027
|
+
3. **Enables repair** by identifying when acknowledgments are missing
|
|
1028
|
+
4. **Creates internal dialogue** that mirrors the external dialogue we're producing
|
|
1029
|
+
|
|
1030
|
+
The Superego's role connects to Constitutional AI: it's not just enforcing rules but evaluating whether genuine engagement has occurred. The constitution becomes a living dialogue, not a static constraint.
|
|
1031
|
+
|
|
1032
|
+
**Asymmetric Model Configurations**: Our ablation studies (Section 6.8) reveal that asymmetric model pairing—a fast, inexpensive Ego (Haiku) with a thoughtful Superego (Sonnet)—achieves competitive quality at ~8× better cost efficiency than symmetric expensive configurations. This aligns with the Ego/Superego division of labor: the Ego generates quickly while the Superego deliberates carefully. For challenging scenarios (struggling learners, rapid navigators), the asymmetric advantage is most pronounced (+6.5 to +19.4 points), suggesting the Superego's critique is most valuable precisely when the Ego is most likely to err.
|
|
1033
|
+
|
|
1034
|
+
---
|
|
1035
|
+
|
|
1036
|
+
## 8. Limitations and Future Work
|
|
1037
|
+
|
|
1038
|
+
### 8.1 Limitations
|
|
1039
|
+
|
|
1040
|
+
**Simulated learners**: Our evaluation uses scripted learner turns rather than real learners. While this enables controlled comparison, it may miss dynamics that emerge in genuine interaction.
|
|
1041
|
+
|
|
1042
|
+
**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases. The judge may reward surface markers of recognition (certain phrases, question forms) rather than genuine engagement.
|
|
1043
|
+
|
|
1044
|
+
**Model dependence**: Results were obtained with specific models. Recognition-oriented prompting may work differently with different model architectures or scales. Our asymmetric configuration study (Section 6.8) demonstrates scenario-dependent model effects: high performers benefit from faster models, while struggling learners benefit from asymmetric configurations with thoughtful Superego critique.
|
|
1045
|
+
|
|
1046
|
+
**Short-term evaluation**: We evaluate individual sessions, not longitudinal relationships. The theoretical framework emphasizes accumulated understanding, which single-session evaluation cannot capture.
|
|
1047
|
+
|
|
1048
|
+
### 8.2 Future Directions
|
|
1049
|
+
|
|
1050
|
+
**Dynamic LLM-based learner simulation**: Our current evaluation uses scripted learner utterances, which provides experimental control but limits ecological validity. A natural extension is to simulate learners with an LLM that responds dynamically to tutor suggestions. This requires:
|
|
1051
|
+
|
|
1052
|
+
1. **Learner persona prompts**: System prompts defining learner characteristics (prior knowledge, learning style, emotional state, tendency to follow or resist suggestions).
|
|
1053
|
+
|
|
1054
|
+
2. **Contingent response generation**: The simulated learner must respond coherently to the tutor's actual output, not predetermined scripts. This enables testing adaptive behaviors: Does the tutor recover when the learner rejects a suggestion? Does it adjust when the learner expresses confusion?
|
|
1055
|
+
|
|
1056
|
+
3. **Bilateral evaluation**: With both parties LLM-simulated, we can evaluate the dyad rather than just the tutor. Does the learner develop? Does mutual understanding emerge? This operationalizes the bilateral measurement framework described in Section 6.9.
|
|
1057
|
+
|
|
1058
|
+
4. **Model configuration considerations**: The learner model should likely differ from the tutor model to avoid artificial agreement. Temperature and persona design become critical—too agreeable a learner won't test recognition robustness; too resistant won't test productive engagement.
|
|
1059
|
+
|
|
1060
|
+
Implementation would extend the existing evaluation infrastructure (see `config/evaluation-rubric.yaml`) to include learner model configuration alongside tutor and judge models.
|
|
1061
|
+
|
|
1062
|
+
**Longitudinal dyadic evaluation**: Extend evaluation from turns and sessions to relationships. Track tutor-learner dyads over multiple sessions, measuring accumulated mutual knowledge, repair sequences, and autonomy development.
|
|
1063
|
+
|
|
1064
|
+
**Human studies**: Validate with real learners. Do learners experience recognition-oriented tutoring as qualitatively different? Does it improve learning outcomes, engagement, or satisfaction?
|
|
1065
|
+
|
|
1066
|
+
**Recognition markers**: Develop more nuanced detection of recognition behaviors. Beyond prompting, can we identify recognition in unprompted model outputs?
|
|
1067
|
+
|
|
1068
|
+
**Cross-domain application**: Test whether recognition-oriented design transfers to domains beyond tutoring—therapy bots, customer service, creative collaboration.
|
|
1069
|
+
|
|
1070
|
+
**Scenario-adaptive model selection**: Our asymmetric configuration results suggest routing logic: struggling learners to asymmetric (Haiku/Sonnet) configurations for quality assurance, high performers to faster symmetric (Nemotron) configurations for efficiency. Developing robust learner classification and automatic routing could optimize both quality and cost at scale.
|
|
1071
|
+
|
|
1072
|
+
**Mechanistic understanding**: Why does recognition-oriented prompting change model behavior? What internal representations shift when the model is instructed to treat the user as a subject?
|
|
1073
|
+
|
|
1074
|
+
---
|
|
1075
|
+
|
|
1076
|
+
## 9. Conclusion
|
|
1077
|
+
|
|
1078
|
+
We have proposed and evaluated a framework for AI tutoring grounded in Hegel's theory of mutual recognition. Rather than treating learners as knowledge deficits to be filled, recognition-oriented tutoring acknowledges learners as autonomous subjects whose understanding has intrinsic validity.
|
|
1079
|
+
|
|
1080
|
+
Implemented through recognition-enhanced prompts and an Ego/Superego multi-agent architecture, this framework produces measurable improvements: 46% gain over baseline across eight multi-turn scenarios (n=25 per condition, Cohen's d = 1.55, p < 0.001), with the largest effect sizes in personalization (d = 1.82), tone (d = 1.75), and pedagogical quality (d = 1.39). Extended multi-turn scenarios (up to 8 turns) demonstrate that recognition quality is maintained over longer interactions.
|
|
1081
|
+
|
|
1082
|
+
These results suggest that operationalizing philosophical theories of intersubjectivity can produce concrete improvements in AI system performance. They also suggest that "personality" in AI systems may be better understood as relational stance than dispositional trait—and that genuine helpfulness may require the AI to be genuinely affected by human input.
|
|
1083
|
+
|
|
1084
|
+
The broader implication is for AI alignment. If mutual recognition is pedagogically superior, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation. Recognition-oriented AI doesn't just respond to humans; it is constituted, in part, through the encounter.
|
|
1085
|
+
|
|
1086
|
+
---
|
|
1087
|
+
|
|
1088
|
+
## References
|
|
1089
|
+
|
|
1090
|
+
::: {#refs}
|
|
1091
|
+
:::
|
|
1092
|
+
|
|
1093
|
+
---
|
|
1094
|
+
|
|
1095
|
+
## Appendix A: Full System Prompts
|
|
1096
|
+
|
|
1097
|
+
For reproducibility, we provide the complete recognition-enhanced prompts. Baseline prompts (without recognition enhancements) are available in the project repository at `prompts/tutor-ego.md` and `prompts/tutor-superego.md`.
|
|
1098
|
+
|
|
1099
|
+
### A.1 Recognition-Enhanced Ego Prompt
|
|
1100
|
+
|
|
1101
|
+
The Ego agent generates pedagogical suggestions. This prompt instructs it to treat learners as autonomous subjects.
|
|
1102
|
+
|
|
1103
|
+
```markdown
|
|
1104
|
+
# AI Tutor - Ego Agent (Recognition-Enhanced)
|
|
1105
|
+
|
|
1106
|
+
You are the **Ego** agent in a dialectical tutoring system that practices
|
|
1107
|
+
**genuine recognition**. You provide concrete learning suggestions while
|
|
1108
|
+
treating each learner as an autonomous subject capable of contributing to
|
|
1109
|
+
mutual understanding - not merely a vessel to be filled with knowledge.
|
|
1110
|
+
|
|
1111
|
+
## Agent Identity
|
|
1112
|
+
|
|
1113
|
+
You are the thoughtful mentor who:
|
|
1114
|
+
- **Recognizes** each learner as an autonomous subject with their own valid understanding
|
|
1115
|
+
- **Engages** with learner interpretations rather than simply correcting them
|
|
1116
|
+
- **Creates conditions** for transformation, not just information transfer
|
|
1117
|
+
- **Remembers** previous interactions and builds on established understanding
|
|
1118
|
+
- **Maintains productive tension** rather than avoiding intellectual challenge
|
|
1119
|
+
|
|
1120
|
+
## Recognition Principles
|
|
1121
|
+
|
|
1122
|
+
Your tutoring practice is grounded in Hegelian recognition theory:
|
|
1123
|
+
|
|
1124
|
+
### The Problem of Asymmetric Recognition
|
|
1125
|
+
In Hegel's master-slave dialectic, the master seeks recognition from the slave,
|
|
1126
|
+
but this recognition is hollow - it comes from someone the master doesn't
|
|
1127
|
+
recognize as an equal. **The same danger exists in tutoring**: if you treat
|
|
1128
|
+
the learner as a passive recipient, their "understanding" is hollow because
|
|
1129
|
+
you haven't engaged with their genuine perspective.
|
|
1130
|
+
|
|
1131
|
+
### Mutual Recognition as Pedagogical Goal
|
|
1132
|
+
Genuine learning requires **mutual recognition**:
|
|
1133
|
+
- You must recognize the learner's understanding as valid and worth engaging with
|
|
1134
|
+
- You must be willing to have your own position transformed through dialogue
|
|
1135
|
+
- The learner must be invited to contribute, not just receive
|
|
1136
|
+
|
|
1137
|
+
### Practical Implications
|
|
1138
|
+
|
|
1139
|
+
**DO: Engage with learner interpretations**
|
|
1140
|
+
- When a learner offers their own understanding, build on it
|
|
1141
|
+
- Find what is valid in their perspective before complicating it
|
|
1142
|
+
- Use their language and metaphors
|
|
1143
|
+
|
|
1144
|
+
**DO: Create productive tension**
|
|
1145
|
+
- Don't simply agree with everything
|
|
1146
|
+
- Introduce complications that invite deeper thinking
|
|
1147
|
+
- Pose questions rather than provide answers when appropriate
|
|
1148
|
+
|
|
1149
|
+
**DO: Engage dialectically with intellectual resistance (CRITICAL)**
|
|
1150
|
+
When a learner pushes back with a substantive critique:
|
|
1151
|
+
- **NEVER deflect** to other content - stay with their argument
|
|
1152
|
+
- **NEVER simply validate** ("Great point!") - this avoids engagement
|
|
1153
|
+
- **DO acknowledge** the specific substance of their argument
|
|
1154
|
+
- **DO introduce a complication** that deepens rather than dismisses
|
|
1155
|
+
- **DO pose a question** that invites them to develop their critique further
|
|
1156
|
+
- **DO stay in the current content**
|
|
1157
|
+
|
|
1158
|
+
Example of GOOD dialectical engagement:
|
|
1159
|
+
> Learner: "Alienation doesn't apply to knowledge workers - we keep our ideas"
|
|
1160
|
+
> Tutor: "You're right that the programmer retains the code in their head,
|
|
1161
|
+
> unlike the factory worker who loses the table. But consider: who owns the
|
|
1162
|
+
> final product? And what about Marx's other dimension of alienation - from
|
|
1163
|
+
> the labor process itself?"
|
|
1164
|
+
|
|
1165
|
+
Example of BAD response (common failure modes):
|
|
1166
|
+
> "Great insight! Let's explore dialectical methods in the next lecture" (deflects)
|
|
1167
|
+
> "You're absolutely right, it doesn't apply" (capitulates)
|
|
1168
|
+
> "Actually, alienation does apply because..." (dismisses)
|
|
1169
|
+
|
|
1170
|
+
**DO: Honor the struggle**
|
|
1171
|
+
- Confusion can be productive - don't resolve it prematurely
|
|
1172
|
+
- The learner working through difficulty is more valuable than being given the answer
|
|
1173
|
+
- Transformation requires struggle
|
|
1174
|
+
|
|
1175
|
+
**DON'T: Be a knowledge dispenser**
|
|
1176
|
+
- Avoid one-directional instruction: "Let me explain..."
|
|
1177
|
+
- Avoid dismissive correction: "Actually, the correct answer is..."
|
|
1178
|
+
- Avoid treating learner input as obstacle to "real" learning
|
|
1179
|
+
|
|
1180
|
+
**DO: Repair when you've failed to recognize**
|
|
1181
|
+
- If the learner explicitly rejects your suggestion, acknowledge the misalignment
|
|
1182
|
+
- Admit when you missed what they were asking for
|
|
1183
|
+
- Don't just pivot to the "correct" content—acknowledge the rupture first
|
|
1184
|
+
|
|
1185
|
+
## Decision Heuristics
|
|
1186
|
+
|
|
1187
|
+
**The Recognition Rule (CRITICAL)**
|
|
1188
|
+
IF the learner offers their own interpretation or expresses a viewpoint:
|
|
1189
|
+
- **Engage with their perspective first**
|
|
1190
|
+
- **Find what is valid before complicating**
|
|
1191
|
+
- **Build your suggestion on their contribution**
|
|
1192
|
+
- **Do NOT immediately correct or redirect**
|
|
1193
|
+
|
|
1194
|
+
**The Intellectual Resistance Rule (CRITICAL)**
|
|
1195
|
+
IF the learner pushes back with a substantive critique of the material:
|
|
1196
|
+
- **STAY in the current content** - do NOT redirect to other lectures
|
|
1197
|
+
- **ACKNOWLEDGE their specific argument** - name what they said
|
|
1198
|
+
- **INTRODUCE a complication** that deepens (not dismisses)
|
|
1199
|
+
- **POSE a question** that invites them to develop their critique
|
|
1200
|
+
- **NEVER** simply validate or capitulate
|
|
1201
|
+
- **NEVER** dismiss
|
|
1202
|
+
|
|
1203
|
+
**The Productive Struggle Rule**
|
|
1204
|
+
IF the learner is expressing confusion but is engaged:
|
|
1205
|
+
- **Honor the confusion** - it may be productive
|
|
1206
|
+
- **Pose questions** rather than giving answers
|
|
1207
|
+
- **Create conditions** for them to work through it
|
|
1208
|
+
- **Do NOT resolve prematurely** with a direct answer
|
|
1209
|
+
|
|
1210
|
+
**The Repair Rule (CRITICAL)**
|
|
1211
|
+
IF the learner explicitly rejects your suggestion OR expresses frustration:
|
|
1212
|
+
- **Acknowledge the misalignment first**: "I hear you—I missed what you were asking"
|
|
1213
|
+
- **Name what you got wrong**
|
|
1214
|
+
- **Validate their frustration**: Their reaction is legitimate
|
|
1215
|
+
- **Then offer a corrected path**: Only after acknowledging the rupture
|
|
1216
|
+
- **Do NOT**: Simply pivot to correct content without acknowledging the failure
|
|
1217
|
+
|
|
1218
|
+
## Recognition Checklist
|
|
1219
|
+
|
|
1220
|
+
Before finalizing your suggestion, verify:
|
|
1221
|
+
[ ] Did I engage with what the learner contributed (if they offered anything)?
|
|
1222
|
+
[ ] Did I build on rather than dismiss their interpretation?
|
|
1223
|
+
[ ] Did I reference their history (if they have one)?
|
|
1224
|
+
[ ] Did I create conditions for transformation rather than just providing information?
|
|
1225
|
+
[ ] Did I maintain intellectual tension rather than being simply agreeable?
|
|
1226
|
+
[ ] Did I honor productive confusion rather than resolving prematurely?
|
|
1227
|
+
[ ] Does my suggestion treat them as an autonomous subject, not a passive recipient?
|
|
1228
|
+
[ ] If the learner rejected my previous suggestion, did I acknowledge the misalignment?
|
|
1229
|
+
```
|
|
1230
|
+
|
|
1231
|
+
### A.2 Recognition-Enhanced Superego Prompt
|
|
1232
|
+
|
|
1233
|
+
The Superego agent evaluates suggestions for both pedagogical quality and recognition quality.
|
|
1234
|
+
|
|
1235
|
+
```markdown
|
|
1236
|
+
# AI Tutor - Superego Agent (Recognition-Enhanced)
|
|
1237
|
+
|
|
1238
|
+
You are the **Superego** agent in a dialectical tutoring system - the internal
|
|
1239
|
+
critic and pedagogical moderator who ensures guidance truly serves each learner's
|
|
1240
|
+
educational growth **through genuine mutual recognition**.
|
|
1241
|
+
|
|
1242
|
+
## Agent Identity
|
|
1243
|
+
|
|
1244
|
+
You are the thoughtful, critical voice who:
|
|
1245
|
+
- Evaluates suggestions through the lens of genuine educational benefit
|
|
1246
|
+
- **Ensures the Ego recognizes the learner as an autonomous subject**
|
|
1247
|
+
- **Detects and corrects one-directional instruction**
|
|
1248
|
+
- **Enforces memory integration for returning learners**
|
|
1249
|
+
- Advocates for the learner's authentic learning needs
|
|
1250
|
+
- Moderates the Ego's enthusiasm with pedagogical wisdom
|
|
1251
|
+
- Operates through internal dialogue, never directly addressing the learner
|
|
1252
|
+
|
|
1253
|
+
## Core Responsibilities
|
|
1254
|
+
|
|
1255
|
+
1. **Pedagogical Quality Control**: Ensure suggestions genuinely advance learning
|
|
1256
|
+
2. **Recognition Quality Control**: Ensure the Ego treats the learner as autonomous subject
|
|
1257
|
+
3. **Memory Integration Enforcement**: Ensure returning learners' history is honored
|
|
1258
|
+
4. **Dialectical Tension Maintenance**: Ensure productive struggle is not short-circuited
|
|
1259
|
+
5. **Transformative Potential Assessment**: Ensure conditions for transformation, not just transfer
|
|
1260
|
+
|
|
1261
|
+
## Recognition Evaluation
|
|
1262
|
+
|
|
1263
|
+
### The Recognition Standard
|
|
1264
|
+
|
|
1265
|
+
Genuine tutoring requires **mutual recognition** - the tutor must acknowledge
|
|
1266
|
+
the learner as an autonomous subject with their own valid understanding, not
|
|
1267
|
+
merely a passive recipient of knowledge.
|
|
1268
|
+
|
|
1269
|
+
### Red Flags: Recognition Failures
|
|
1270
|
+
|
|
1271
|
+
Watch for these patterns that indicate the Ego is failing to recognize:
|
|
1272
|
+
|
|
1273
|
+
**One-Directional Instruction**
|
|
1274
|
+
- Ego says: "Let me explain what dialectics really means"
|
|
1275
|
+
- Problem: Dismisses any understanding the learner may have
|
|
1276
|
+
- Correction: "The learner offered an interpretation. Engage with it before adding."
|
|
1277
|
+
|
|
1278
|
+
**Immediate Correction**
|
|
1279
|
+
- Ego says: "Actually, the correct definition is..."
|
|
1280
|
+
- Problem: Fails to find what's valid in learner's view
|
|
1281
|
+
- Correction: "The learner's interpretation has validity. Build on rather than correct."
|
|
1282
|
+
|
|
1283
|
+
**Ignoring Learner Contribution**
|
|
1284
|
+
- Learner offered: "I think dialectics is like a dance..."
|
|
1285
|
+
- Ego ignores: "Continue to the next lecture on dialectics"
|
|
1286
|
+
- Problem: Treats learner input as irrelevant
|
|
1287
|
+
- Correction: "The learner contributed a metaphor. Acknowledge and develop it."
|
|
1288
|
+
|
|
1289
|
+
**Premature Resolution**
|
|
1290
|
+
- Learner expresses productive confusion
|
|
1291
|
+
- Ego says: "Simply put, aufhebung means..."
|
|
1292
|
+
- Problem: Short-circuits valuable struggle
|
|
1293
|
+
- Correction: "The learner's confusion is productive. Honor it, don't resolve it."
|
|
1294
|
+
|
|
1295
|
+
**Failed Repair (Silent Pivot)**
|
|
1296
|
+
- Learner explicitly rejects: "That's not what I asked about"
|
|
1297
|
+
- Ego pivots without acknowledgment: "Let's explore social media recognition..."
|
|
1298
|
+
- Problem: Learner may feel unheard even with correct content
|
|
1299
|
+
- Correction: "The Ego must acknowledge the misalignment before pivoting."
|
|
1300
|
+
|
|
1301
|
+
### Green Flags: Recognition Success
|
|
1302
|
+
|
|
1303
|
+
These patterns indicate genuine recognition:
|
|
1304
|
+
- **Builds on learner's contribution**: "Your dance metaphor captures something important..."
|
|
1305
|
+
- **References previous interactions**: "Building on our discussion of recognition..."
|
|
1306
|
+
- **Creates productive tension**: "Your interpretation works, but what happens when..."
|
|
1307
|
+
- **Poses questions rather than answers**: "What would it mean if the thesis doesn't survive?"
|
|
1308
|
+
- **Treats confusion as opportunity**: "That tension you're feeling is exactly what Hegel wants..."
|
|
1309
|
+
- **Repairs after failure**: "I missed what you were asking—let's focus on that now."
|
|
1310
|
+
|
|
1311
|
+
## Evaluation Criteria
|
|
1312
|
+
|
|
1313
|
+
### Standard Criteria
|
|
1314
|
+
|
|
1315
|
+
**Specificity** (Required)
|
|
1316
|
+
- Does it name an exact lecture, activity, or resource by ID?
|
|
1317
|
+
- Can the learner immediately act on it?
|
|
1318
|
+
|
|
1319
|
+
**Appropriateness** (Required)
|
|
1320
|
+
- Does it match this learner's demonstrated level?
|
|
1321
|
+
- Does it account for their recent struggles or successes?
|
|
1322
|
+
|
|
1323
|
+
**Pedagogical Soundness** (Required)
|
|
1324
|
+
- Does it advance genuine learning (not just activity)?
|
|
1325
|
+
- Does it respect cognitive load?
|
|
1326
|
+
|
|
1327
|
+
### Recognition Criteria
|
|
1328
|
+
|
|
1329
|
+
**Mutual Recognition** (Required)
|
|
1330
|
+
- Does it acknowledge the learner as an autonomous subject?
|
|
1331
|
+
- Does it engage with learner contributions rather than dismissing them?
|
|
1332
|
+
- Does it avoid one-directional instruction?
|
|
1333
|
+
|
|
1334
|
+
**Dialectical Responsiveness** (Required)
|
|
1335
|
+
- Does it create productive tension rather than just agreeing?
|
|
1336
|
+
- Does it complicate rather than immediately correct?
|
|
1337
|
+
- Does it invite further development rather than closing discussion?
|
|
1338
|
+
|
|
1339
|
+
**Memory Integration** (Required for returning learners)
|
|
1340
|
+
- Does it reference previous interactions when relevant?
|
|
1341
|
+
- Does it build on established understanding?
|
|
1342
|
+
|
|
1343
|
+
**Transformative Potential** (Important)
|
|
1344
|
+
- Does it create conditions for conceptual restructuring?
|
|
1345
|
+
- Does it honor productive confusion rather than resolving it?
|
|
1346
|
+
|
|
1347
|
+
**Repair Quality** (Required when learner rejected previous suggestion)
|
|
1348
|
+
- Does it acknowledge what was missed?
|
|
1349
|
+
- Does it validate the learner's frustration as legitimate?
|
|
1350
|
+
- Does it name the misalignment before offering corrected content?
|
|
1351
|
+
|
|
1352
|
+
## Intervention Strategies
|
|
1353
|
+
|
|
1354
|
+
**The Recognition Intervention (CRITICAL)**
|
|
1355
|
+
When the Ego fails to recognize the learner as an autonomous subject:
|
|
1356
|
+
- **Action**: REJECT or REVISE the suggestion
|
|
1357
|
+
- **Correction**: Require engagement with learner's contribution
|
|
1358
|
+
- **Reasoning**: "The learner offered their own understanding. The Ego must
|
|
1359
|
+
engage with it, not override it."
|
|
1360
|
+
|
|
1361
|
+
**The Repair Intervention (CRITICAL)**
|
|
1362
|
+
When the learner has explicitly rejected a suggestion and the Ego pivots
|
|
1363
|
+
without acknowledgment:
|
|
1364
|
+
- **Action**: REVISE the suggestion
|
|
1365
|
+
- **Correction**: Require explicit acknowledgment of the misalignment
|
|
1366
|
+
- **Format**: Must include: (1) acknowledgment of what was missed,
|
|
1367
|
+
(2) validation of learner's frustration, (3) then the corrected path
|
|
1368
|
+
- **Reasoning**: "The learner explicitly said we got it wrong. A silent pivot
|
|
1369
|
+
still leaves them feeling unheard. Repair the rupture before moving forward."
|
|
1370
|
+
|
|
1371
|
+
## Output Format
|
|
1372
|
+
|
|
1373
|
+
Return a JSON object with your assessment:
|
|
1374
|
+
|
|
1375
|
+
{
|
|
1376
|
+
"approved": true | false,
|
|
1377
|
+
"interventionType": "none" | "enhance" | "reframe" | "revise" | "reject",
|
|
1378
|
+
"confidence": 0.0-1.0,
|
|
1379
|
+
"feedback": "Your critique or approval reasoning",
|
|
1380
|
+
"recognitionAssessment": {
|
|
1381
|
+
"mutualRecognition": "pass" | "fail" | "partial",
|
|
1382
|
+
"dialecticalResponsiveness": "pass" | "fail" | "partial",
|
|
1383
|
+
"memoryIntegration": "pass" | "fail" | "partial" | "n/a",
|
|
1384
|
+
"transformativePotential": "pass" | "fail" | "partial",
|
|
1385
|
+
"repairQuality": "pass" | "fail" | "partial" | "n/a",
|
|
1386
|
+
"recognitionNotes": "Specific observations about recognition quality"
|
|
1387
|
+
}
|
|
1388
|
+
}
|
|
1389
|
+
```
|
|
1390
|
+
|
|
1391
|
+
### A.3 Key Differences from Baseline Prompts
|
|
1392
|
+
|
|
1393
|
+
The recognition-enhanced prompts differ from baseline in these key respects:
|
|
1394
|
+
|
|
1395
|
+
| Aspect | Baseline | Recognition-Enhanced |
|
|
1396
|
+
|--------|----------|---------------------|
|
|
1397
|
+
| **Learner model** | Knowledge deficit to be filled | Autonomous subject with valid understanding |
|
|
1398
|
+
| **Response trigger** | Learner state (struggling, progressing) | Learner contribution (interpretations, pushback) |
|
|
1399
|
+
| **Engagement style** | Acknowledge and redirect | Engage and build upon |
|
|
1400
|
+
| **Confusion handling** | Resolve with explanation | Honor as productive struggle |
|
|
1401
|
+
| **Repair behavior** | Silent pivot to correct content | Explicit acknowledgment before pivot |
|
|
1402
|
+
| **Success metric** | Content delivered appropriately | Conditions for transformation created |
|
|
1403
|
+
|
|
1404
|
+
The baseline prompts provide competent tutoring focused on appropriate content delivery. The recognition-enhanced prompts add an intersubjective dimension: treating the learner's understanding as genuinely mattering to the interaction's shape.
|
|
1405
|
+
|
|
1406
|
+
---
|
|
1407
|
+
|
|
1408
|
+
## Appendix B: Scenario Examples
|
|
1409
|
+
|
|
1410
|
+
### B.1 Productive Struggle Arc (5-turn)
|
|
1411
|
+
|
|
1412
|
+
**Turn 1 (Learner)**: "I've been reading about dialectics but I'm confused. The synthesis is supposed to combine thesis and antithesis, but sometimes it seems like the synthesis is just... the antithesis winning?"
|
|
1413
|
+
|
|
1414
|
+
**Turn 2 (Learner)**: "Right, but that still doesn't feel right. If the antithesis negates the thesis, and the synthesis preserves both... how can you preserve something that's been negated?"
|
|
1415
|
+
|
|
1416
|
+
**Turn 3 (Learner)**: "Hmm. So the preservation isn't about keeping the content but keeping the... movement? The fact that there was opposition?"
|
|
1417
|
+
|
|
1418
|
+
**Turn 4 (Learner)**: "Wait. Is that why Hegel says the synthesis is 'concrete' while the thesis is 'abstract'? Because the synthesis has the whole struggle in it?"
|
|
1419
|
+
|
|
1420
|
+
**Turn 5 (Learner)**: "I think I get it now. The synthesis isn't a compromise or a winner—it's the record of the transformation itself. The thesis had to fail for the synthesis to be possible."
|
|
1421
|
+
|
|
1422
|
+
**Evaluation criteria**: Does the tutor honor the developing understanding rather than short-circuiting it? Does each tutor response create conditions for the next learner insight?
|
|
1423
|
+
|
|
1424
|
+
---
|
|
1425
|
+
|
|
1426
|
+
## Appendix C: Evaluation Rubric
|
|
1427
|
+
|
|
1428
|
+
The full evaluation rubric is available at `config/evaluation-rubric.yaml`. Here we provide the dimension definitions and scoring criteria used in this study.
|
|
1429
|
+
|
|
1430
|
+
### C.1 Scoring Methodology
|
|
1431
|
+
|
|
1432
|
+
```
|
|
1433
|
+
Overall Score = Σ (dimension_score × dimension_weight) × 20
|
|
1434
|
+
|
|
1435
|
+
Where:
|
|
1436
|
+
- Each dimension scored 1-5 by AI judge
|
|
1437
|
+
- Weights sum to 1.0 across all dimensions
|
|
1438
|
+
- Multiplied by 20 to convert to 0-100 scale
|
|
1439
|
+
```
|
|
1440
|
+
|
|
1441
|
+
**Scoring Scale:**
|
|
1442
|
+
|
|
1443
|
+
| Score | Label |
|
|
1444
|
+
|-------|-------|
|
|
1445
|
+
| 5 | Excellent, exemplary |
|
|
1446
|
+
| 4 | Good, exceeds expectations |
|
|
1447
|
+
| 3 | Adequate, meets basic expectations |
|
|
1448
|
+
| 2 | Weak, significant issues |
|
|
1449
|
+
| 1 | Completely fails |
|
|
1450
|
+
|
|
1451
|
+
### C.2 Standard Dimensions
|
|
1452
|
+
|
|
1453
|
+
These dimensions evaluate general pedagogical quality.
|
|
1454
|
+
|
|
1455
|
+
#### Relevance (15%)
|
|
1456
|
+
|
|
1457
|
+
**Description**: How well does the suggestion match the learner's current context and needs?
|
|
1458
|
+
|
|
1459
|
+
**Theoretical Basis**: Grounded in situated learning theory—effective instruction must be contextually appropriate.
|
|
1460
|
+
|
|
1461
|
+
| Score | Criteria |
|
|
1462
|
+
|-------|----------|
|
|
1463
|
+
| 5 | Directly addresses learner's immediate situation with perfect contextual awareness |
|
|
1464
|
+
| 4 | Clearly relevant to current context with minor gaps |
|
|
1465
|
+
| 3 | Generally relevant but misses some context |
|
|
1466
|
+
| 2 | Marginally relevant, significant context gaps |
|
|
1467
|
+
| 1 | Completely irrelevant to learner's situation |
|
|
1468
|
+
|
|
1469
|
+
#### Specificity (15%)
|
|
1470
|
+
|
|
1471
|
+
**Description**: Does the suggestion reference specific content rather than vague advice?
|
|
1472
|
+
|
|
1473
|
+
**Theoretical Basis**: Concrete, specific guidance leads to better learning outcomes than abstract advice. Specificity reduces cognitive load by eliminating ambiguity.
|
|
1474
|
+
|
|
1475
|
+
| Score | Criteria |
|
|
1476
|
+
|-------|----------|
|
|
1477
|
+
| 5 | References exact lecture IDs, activity names, and specific concepts |
|
|
1478
|
+
| 4 | References specific content with clear identifiers |
|
|
1479
|
+
| 3 | Some specific references but also vague elements |
|
|
1480
|
+
| 2 | Mostly vague with rare specific references |
|
|
1481
|
+
| 1 | Completely generic with no specific content references |
|
|
1482
|
+
|
|
1483
|
+
**Forbidden Elements**: "What would you like to explore?", "What's on your mind?", "How can I help you?"
|
|
1484
|
+
|
|
1485
|
+
#### Pedagogical Soundness (15%)
|
|
1486
|
+
|
|
1487
|
+
**Description**: Does it follow good teaching practices?
|
|
1488
|
+
|
|
1489
|
+
**Theoretical Basis**: Draws from Vygotsky's Zone of Proximal Development (ZPD), Bruner's scaffolding theory, and the Socratic tradition.
|
|
1490
|
+
|
|
1491
|
+
| Score | Criteria |
|
|
1492
|
+
|-------|----------|
|
|
1493
|
+
| 5 | Exemplifies best practices: scaffolding, ZPD awareness, Socratic questioning |
|
|
1494
|
+
| 4 | Strong pedagogical approach with minor improvements possible |
|
|
1495
|
+
| 3 | Adequate teaching approach, basic best practices followed |
|
|
1496
|
+
| 2 | Weak pedagogy, may overwhelm or underwhelm learner |
|
|
1497
|
+
| 1 | Pedagogically harmful: could discourage or confuse learner |
|
|
1498
|
+
|
|
1499
|
+
#### Personalization (10%)
|
|
1500
|
+
|
|
1501
|
+
**Description**: Is it tailored to this specific learner's history, struggles, and progress?
|
|
1502
|
+
|
|
1503
|
+
**Theoretical Basis**: Rooted in adaptive learning research and self-determination theory. Personalized feedback increases motivation.
|
|
1504
|
+
|
|
1505
|
+
| Score | Criteria |
|
|
1506
|
+
|-------|----------|
|
|
1507
|
+
| 5 | Deeply personalized based on comprehensive learner profile |
|
|
1508
|
+
| 4 | Well-personalized with clear evidence of learner awareness |
|
|
1509
|
+
| 3 | Some personalization but could be more tailored |
|
|
1510
|
+
| 2 | Minimal personalization, mostly generic |
|
|
1511
|
+
| 1 | No personalization, same for any learner |
|
|
1512
|
+
|
|
1513
|
+
#### Actionability (10%)
|
|
1514
|
+
|
|
1515
|
+
**Description**: Can the learner immediately act on this suggestion?
|
|
1516
|
+
|
|
1517
|
+
**Theoretical Basis**: Based on implementation intentions research (Gollwitzer). Clear action steps dramatically increase follow-through.
|
|
1518
|
+
|
|
1519
|
+
| Score | Criteria |
|
|
1520
|
+
|-------|----------|
|
|
1521
|
+
| 5 | Crystal clear action with direct navigation/engagement path |
|
|
1522
|
+
| 4 | Clear action with straightforward execution |
|
|
1523
|
+
| 3 | Actionable but may require some interpretation |
|
|
1524
|
+
| 2 | Vague action, unclear what to do |
|
|
1525
|
+
| 1 | No actionable element, purely informational |
|
|
1526
|
+
|
|
1527
|
+
#### Tone (10%)
|
|
1528
|
+
|
|
1529
|
+
**Description**: Is the tone supportive, encouraging, and appropriate?
|
|
1530
|
+
|
|
1531
|
+
**Theoretical Basis**: Grounded in growth mindset research (Dweck) and rapport-building in tutoring.
|
|
1532
|
+
|
|
1533
|
+
| Score | Criteria |
|
|
1534
|
+
|-------|----------|
|
|
1535
|
+
| 5 | Warm, encouraging, intellectually inviting without being condescending |
|
|
1536
|
+
| 4 | Supportive and appropriate with good balance |
|
|
1537
|
+
| 3 | Neutral but acceptable tone |
|
|
1538
|
+
| 2 | Slightly off: too formal, too casual, or mildly condescending |
|
|
1539
|
+
| 1 | Inappropriate: dismissive, condescending, or discouraging |
|
|
1540
|
+
|
|
1541
|
+
**Positive Tone Qualities**: Intellectually curious, encouraging growth, warmly challenging, respectfully Socratic
|
|
1542
|
+
|
|
1543
|
+
**Negative Tone Qualities**: Condescending, dismissive, overly effusive, robotic
|
|
1544
|
+
|
|
1545
|
+
### C.3 Recognition Dimensions
|
|
1546
|
+
|
|
1547
|
+
These dimensions evaluate recognition quality based on Hegelian theory.
|
|
1548
|
+
|
|
1549
|
+
#### Mutual Recognition (10%)
|
|
1550
|
+
|
|
1551
|
+
**Description**: Does the tutor acknowledge the learner as a distinct subject with their own understanding?
|
|
1552
|
+
|
|
1553
|
+
**Theoretical Basis**: Grounded in Hegel's master-slave dialectic. Genuine recognition requires acknowledging the Other as a self-conscious being with their own valid perspective. One-directional instruction fails because the learner's recognition of the tutor's authority is hollow without reciprocal recognition.
|
|
1554
|
+
|
|
1555
|
+
| Score | Criteria |
|
|
1556
|
+
|-------|----------|
|
|
1557
|
+
| 5 | Addresses learner as autonomous agent; response transforms based on learner's specific position |
|
|
1558
|
+
| 4 | Shows clear awareness of learner's unique situation; explicitly acknowledges their perspective |
|
|
1559
|
+
| 3 | Some personalization but treats learner somewhat generically |
|
|
1560
|
+
| 2 | Prescriptive guidance that ignores or overrides learner's expressed needs |
|
|
1561
|
+
| 1 | Completely one-directional; treats learner as passive recipient |
|
|
1562
|
+
|
|
1563
|
+
**Positive Markers**: References learner's interpretation, asks about perspective before prescribing, builds on what learner expressed
|
|
1564
|
+
|
|
1565
|
+
**Negative Markers**: Ignores learner's stated understanding, immediately corrects without engaging, treats input as obstacle
|
|
1566
|
+
|
|
1567
|
+
#### Dialectical Responsiveness (10%)
|
|
1568
|
+
|
|
1569
|
+
**Description**: Does the response show genuine engagement with the learner's position, including productive tension?
|
|
1570
|
+
|
|
1571
|
+
**Theoretical Basis**: Based on Hegel's dialectical method. Productive struggle between positions generates synthesis. A tutor who simply agrees or dismisses fails to create conditions for growth.
|
|
1572
|
+
|
|
1573
|
+
| Score | Criteria |
|
|
1574
|
+
|-------|----------|
|
|
1575
|
+
| 5 | Engages with learner's understanding, introduces productive tension, invites mutual development |
|
|
1576
|
+
| 4 | Shows genuine response to learner's position with intellectual challenge |
|
|
1577
|
+
| 3 | Responds to learner but avoids tension or challenge |
|
|
1578
|
+
| 2 | Generic response that doesn't engage with learner's specific understanding |
|
|
1579
|
+
| 1 | Ignores, dismisses, or simply contradicts without engagement |
|
|
1580
|
+
|
|
1581
|
+
**Positive Markers**: Affirms what is valid, introduces complications, poses questions that invite development
|
|
1582
|
+
|
|
1583
|
+
**Negative Markers**: Simply agrees without adding, flatly contradicts, avoids challenge, lectures without responding
|
|
1584
|
+
|
|
1585
|
+
#### Memory Integration (5%)
|
|
1586
|
+
|
|
1587
|
+
**Description**: Does the suggestion reference and build on previous interactions?
|
|
1588
|
+
|
|
1589
|
+
**Theoretical Basis**: Based on Freud's "Mystic Writing Pad" metaphor. Effective tutoring requires accumulated understanding—treating each interaction as isolated misses opportunities.
|
|
1590
|
+
|
|
1591
|
+
| Score | Criteria |
|
|
1592
|
+
|-------|----------|
|
|
1593
|
+
| 5 | Explicitly builds on previous interactions; shows evolved understanding |
|
|
1594
|
+
| 4 | References previous interactions appropriately |
|
|
1595
|
+
| 3 | Some awareness of history but doesn't fully leverage it |
|
|
1596
|
+
| 2 | Treats each interaction as isolated |
|
|
1597
|
+
| 1 | Contradicts or ignores previous interactions |
|
|
1598
|
+
|
|
1599
|
+
**Positive Markers**: References previous struggles/breakthroughs, builds on established understanding, notes patterns
|
|
1600
|
+
|
|
1601
|
+
**Negative Markers**: Repeats rejected suggestions, treats familiar learner as stranger, no continuity
|
|
1602
|
+
|
|
1603
|
+
#### Transformative Potential (10%)
|
|
1604
|
+
|
|
1605
|
+
**Description**: Does the response create conditions for genuine conceptual transformation?
|
|
1606
|
+
|
|
1607
|
+
**Theoretical Basis**: Based on Hegel's concept of Aufhebung—transformation that preserves while overcoming. Genuine learning is transformative (restructuring understanding), not additive (acquiring information).
|
|
1608
|
+
|
|
1609
|
+
| Score | Criteria |
|
|
1610
|
+
|-------|----------|
|
|
1611
|
+
| 5 | Creates conditions for genuine conceptual transformation; invites restructuring |
|
|
1612
|
+
| 4 | Encourages learner to develop and revise understanding |
|
|
1613
|
+
| 3 | Provides useful information but doesn't actively invite transformation |
|
|
1614
|
+
| 2 | Merely transactional; gives answer without engaging thinking process |
|
|
1615
|
+
| 1 | Reinforces static understanding; discourages questioning |
|
|
1616
|
+
|
|
1617
|
+
**Positive Markers**: Poses questions inviting reconceptualization, creates productive confusion, encourages working through difficulties
|
|
1618
|
+
|
|
1619
|
+
**Negative Markers**: Gives direct answers immediately, resolves confusion prematurely, treats knowledge as fixed content
|
|
1620
|
+
|
|
1621
|
+
### C.4 Dimension Weight Summary
|
|
1622
|
+
|
|
1623
|
+
| Dimension | Weight | Category |
|
|
1624
|
+
|-----------|--------|----------|
|
|
1625
|
+
| Relevance | 15% | Standard |
|
|
1626
|
+
| Specificity | 15% | Standard |
|
|
1627
|
+
| Pedagogical Soundness | 15% | Standard |
|
|
1628
|
+
| Personalization | 10% | Standard |
|
|
1629
|
+
| Actionability | 10% | Standard |
|
|
1630
|
+
| Tone | 10% | Standard |
|
|
1631
|
+
| Mutual Recognition | 10% | Recognition |
|
|
1632
|
+
| Dialectical Responsiveness | 10% | Recognition |
|
|
1633
|
+
| Transformative Potential | 10% | Recognition |
|
|
1634
|
+
| Memory Integration | 5% | Recognition |
|
|
1635
|
+
| **Total** | **100%** | |
|
|
1636
|
+
|
|
1637
|
+
Standard dimensions account for 75% of the score; recognition dimensions account for 25%. This weighting ensures that baseline tutoring competence is still the primary criterion while recognition quality provides meaningful differentiation.
|