@machinespirits/eval 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +91 -9
- package/config/eval-settings.yaml +3 -3
- package/config/paper-manifest.json +486 -0
- package/config/providers.yaml +9 -6
- package/config/tutor-agents.yaml +2261 -0
- package/content/README.md +23 -0
- package/content/courses/479/course.md +53 -0
- package/content/courses/479/lecture-1.md +361 -0
- package/content/courses/479/lecture-2.md +360 -0
- package/content/courses/479/lecture-3.md +655 -0
- package/content/courses/479/lecture-4.md +530 -0
- package/content/courses/479/lecture-5.md +326 -0
- package/content/courses/479/lecture-6.md +346 -0
- package/content/courses/479/lecture-7.md +326 -0
- package/content/courses/479/lecture-8.md +273 -0
- package/content/courses/479/roadmap-slides.md +656 -0
- package/content/manifest.yaml +8 -0
- package/docs/research/build.sh +44 -20
- package/docs/research/figures/figure10.png +0 -0
- package/docs/research/figures/figure11.png +0 -0
- package/docs/research/figures/figure3.png +0 -0
- package/docs/research/figures/figure4.png +0 -0
- package/docs/research/figures/figure5.png +0 -0
- package/docs/research/figures/figure6.png +0 -0
- package/docs/research/figures/figure7.png +0 -0
- package/docs/research/figures/figure8.png +0 -0
- package/docs/research/figures/figure9.png +0 -0
- package/docs/research/header.tex +23 -2
- package/docs/research/paper-full.md +941 -285
- package/docs/research/paper-short.md +216 -585
- package/docs/research/references.bib +132 -0
- package/docs/research/slides-header.tex +188 -0
- package/docs/research/slides-pptx.md +363 -0
- package/docs/research/slides.md +531 -0
- package/docs/research/style-reference-pptx.py +199 -0
- package/package.json +6 -5
- package/scripts/analyze-eval-results.js +69 -17
- package/scripts/analyze-mechanism-traces.js +763 -0
- package/scripts/analyze-modulation-learning.js +498 -0
- package/scripts/analyze-prosthesis.js +144 -0
- package/scripts/analyze-run.js +264 -79
- package/scripts/assess-transcripts.js +853 -0
- package/scripts/browse-transcripts.js +854 -0
- package/scripts/check-parse-failures.js +73 -0
- package/scripts/code-dialectical-modulation.js +1320 -0
- package/scripts/download-data.sh +55 -0
- package/scripts/eval-cli.js +106 -18
- package/scripts/generate-paper-figures.js +663 -0
- package/scripts/generate-paper-figures.py +577 -76
- package/scripts/generate-paper-tables.js +299 -0
- package/scripts/qualitative-analysis-ai.js +3 -3
- package/scripts/render-sequence-diagram.js +694 -0
- package/scripts/test-latency.js +210 -0
- package/scripts/test-rate-limit.js +95 -0
- package/scripts/test-token-budget.js +332 -0
- package/scripts/validate-paper-manifest.js +670 -0
- package/services/__tests__/evalConfigLoader.test.js +2 -2
- package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
- package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
- package/services/evaluationRunner.js +975 -98
- package/services/evaluationStore.js +12 -4
- package/services/learnerTutorInteractionEngine.js +27 -2
- package/services/mockProvider.js +133 -0
- package/services/promptRewriter.js +1471 -5
- package/services/rubricEvaluator.js +55 -2
- package/services/transcriptFormatter.js +675 -0
- package/docs/EVALUATION-VARIABLES.md +0 -589
- package/docs/REPLICATION-PLAN.md +0 -577
- package/scripts/analyze-run.mjs +0 -282
- package/scripts/compare-runs.js +0 -44
- package/scripts/compare-suggestions.js +0 -80
- package/scripts/dig-into-run.js +0 -158
- package/scripts/show-failed-suggestions.js +0 -64
- /package/scripts/{check-run.mjs → check-run.js} +0 -0
|
@@ -1,26 +1,13 @@
|
|
|
1
1
|
---
|
|
2
|
-
title: "
|
|
2
|
+
title: "*Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
|
|
3
3
|
author: "Liam Magee"
|
|
4
4
|
date: "February 2026"
|
|
5
|
-
version: "2.
|
|
5
|
+
version: "2.3.14-short"
|
|
6
6
|
bibliography: references.bib
|
|
7
7
|
csl: apa.csl
|
|
8
8
|
link-citations: true
|
|
9
9
|
abstract: |
|
|
10
|
-
Current
|
|
11
|
-
|
|
12
|
-
We implement this framework through the "Drama Machine" architecture: an Ego/Superego multiagent system where an external-facing tutor agent (Ego) generates pedagogical suggestions that are reviewed by an internal critic agent (Superego) before reaching the learner.
|
|
13
|
-
|
|
14
|
-
An evaluation framework (N=1,486 primary scored responses across eighteen key runs; N=3,800+ across the full development database) isolating recognition theory from prompt engineering effects and memory integration reveals that recognition theory is the primary driver of tutoring improvement: a corrected 2×2 experiment (N=120 across two independent runs) demonstrates that recognition produces large effects with or without memory (+15.2 pts without memory, d=1.71; +11.0 pts with memory), while memory alone provides only a modest, non-significant benefit (+4.8 pts, d=0.46, $p \approx .08$). The combined condition yields the highest scores (91.2, d=1.81 vs base), with ceiling effects limiting observable synergy. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content but no recognition theory scores approximately 9 points above same-model base but well below recognition levels, with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts; see Section 8 for model confound caveats). A preliminary three-way comparison (N=36) found recognition outperforms enhanced prompting by +8.7 points, consistent with recognition dominance, though the increment does not reach significance under GPT-5.2 (+1.3 pts, p=.60). The multi-agent tutor architecture contributes **+0.5 to +10 points** depending on content domain—minimal on well-trained content but critical for domain transfer where it catches content isolation errors. A step-by-step evolution analysis of dynamic prompt rewriting with active Writing Pad memory (N=82 across three runs) suggests the Freudian memory model as an important enabler—the rewrite cell progresses from trailing its baseline by 7.2 points to leading by 5.5 points coinciding with Writing Pad activation, though controlled ablation is needed to confirm causality.
|
|
15
|
-
|
|
16
|
-
Three key findings emerge: (1) Recognition theory is the primary driver of improvement—recognition alone produces d=1.71, while memory provides a modest secondary benefit (d=0.46), with an active control showing recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts); (2) Multi-agent architecture is additive, not synergistic—a dedicated five-model probe (Kimi K2.5, Nemotron, DeepSeek V3.2, GLM-4.7, Claude Haiku 4.5; N=826 total) finds the A×B interaction consistently near zero or negative (mean −2.2 pts) across all models, definitively ruling out recognition-specific synergy; (3) Domain generalizability testing confirms recognition advantage replicates across both models and content domains—elementary math with Kimi shows +9.9 pts (d $\approx$ 0.61, N=60), with effects concentrated in challenging scenarios. The factor inversion between domains (philosophy: recognition dominance; elementary: architecture dominance) is partly model-dependent. Bilateral transformation tracking across three multi-turn scenarios (N=118) confirms that recognition-prompted tutors measurably adapt their approach in response to learner input (+26% relative improvement in adaptation index), though learner-side growth is not higher under recognition, suggesting tutor-side responsiveness rather than symmetric mutual transformation.
|
|
17
|
-
|
|
18
|
-
A cross-judge replication with GPT-5.2 confirms the main findings are judge-robust: the recognition effect (d=1.03 in the factorial, d=0.99 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction), and multi-agent null effects all replicate, though at compressed magnitudes (~58% of primary judge effect sizes).
|
|
19
|
-
|
|
20
|
-
These findings suggest that recognition theory's value is domain-sensitive, multi-agent architecture provides essential error correction for domain transfer, and optimal deployment configurations depend on content characteristics.
|
|
21
|
-
|
|
22
|
-
The system is deployed in an open-source learning management system with all code, evaluation data, and reproducible analysis commands publicly available.
|
|
23
|
-
keywords: [AI tutoring, mutual recognition, Hegel, Freud, multiagent systems, educational technology, productive struggle, Drama Machine, domain generalizability]
|
|
10
|
+
Current AI tutoring treats learners as knowledge deficits to be filled. We propose an alternative grounded in Hegel's theory of mutual recognition, where effective pedagogy requires acknowledging learners as autonomous subjects whose understanding has intrinsic validity. We implement this through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent evaluates them before delivery. Across thirty-seven evaluations (N=3,383 primary scored), recognition theory emerges as the primary driver of improvement: a 2$\times$2 memory isolation experiment (N=120) shows recognition produces d=1.71, while memory alone provides only d=0.46. A multi-model probe across five ego models (N=655) confirms architecture and recognition contribute additively, not synergistically. Cross-judge replication with GPT-5.2 validates the main findings at compressed magnitudes (inter-judge r=0.44--0.64). Phase 2 experiments reveal that nine architectural mechanisms are equivalent under scripted learners but differentiate with dynamic interlocutors: Theory of Mind profiling adds 4.1 points when genuine feedback loops exist. These results suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems.
|
|
24
11
|
fontsize: 12pt
|
|
25
12
|
geometry: margin=1in
|
|
26
13
|
header-includes: |
|
|
@@ -28,154 +15,67 @@ header-includes: |
|
|
|
28
15
|
\floatplacement{figure}{H}
|
|
29
16
|
---
|
|
30
17
|
|
|
31
|
-
#
|
|
18
|
+
# *Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring (Short Version)
|
|
19
|
+
|
|
20
|
+
*This is a condensed version of the full paper. For complete results, appendices, system prompts, and reproducibility commands, see the full paper.*
|
|
32
21
|
|
|
33
22
|
## 1. Introduction
|
|
34
23
|
|
|
35
|
-
The dominant paradigm in AI-assisted education treats learning as information transfer
|
|
24
|
+
The dominant paradigm in AI-assisted education treats learning as information transfer: the learner lacks knowledge, the tutor possesses it, and the interaction succeeds when knowledge flows from tutor to learner. This paradigm---implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots---treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed.
|
|
36
25
|
|
|
37
|
-
This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one
|
|
26
|
+
This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one in turn recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition---where each party acknowledges the other as an autonomous subject---produces genuine selfhood.
|
|
38
27
|
|
|
39
|
-
The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship
|
|
28
|
+
The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship echoing Hegel's mutual constitution of self-consciousness. The German *Bildung* tradition explicitly frames education as self-formation through encounter with otherness [@stojanov2018], and recognition theory [@honneth1995] has been applied to educational contexts [@huttunen2007]. Our contribution is to operationalize these philosophical commitments as concrete design heuristics for AI tutoring systems and to measure their effects empirically.
|
|
40
29
|
|
|
41
30
|
We argue this framework applies directly to pedagogy. When a tutor treats a learner merely as a knowledge deficit, the learner's contributions become conversational waypoints rather than genuine inputs. The tutor acknowledges and redirects, but does not let the learner's understanding genuinely shape the interaction. This is pedagogical master-slave dynamics: the tutor's expertise is confirmed, but the learner remains a vessel rather than a subject.
|
|
42
31
|
|
|
43
|
-
A recognition-oriented tutor, by contrast, treats the learner's understanding as having intrinsic validity
|
|
44
|
-
|
|
45
|
-
The integration of large language models (LLMs) into educational technology intensifies these dynamics. LLMs can provide personalized, on-demand tutoring at scale—a prospect that has generated considerable excitement. However, the same capabilities that make LLMs effective conversationalists also introduce concerning failure modes. Chief among these is *sycophancy*: the tendency to provide positive, affirming responses that align with what the user appears to want rather than what genuinely serves their learning.
|
|
46
|
-
|
|
47
|
-
This paper introduces a multiagent architecture that addresses these challenges through *internal dialogue*. Drawing on Freudian structural theory and the "Drama Machine" framework for character development in narrative AI systems [@magee2024drama], we implement a tutoring system in which an external-facing *Ego* agent generates suggestions that are reviewed by an internal *Superego* critic before reaching the learner.
|
|
48
|
-
|
|
49
|
-
### 1.1 Contributions
|
|
50
|
-
|
|
51
|
-
We make the following contributions:
|
|
52
|
-
|
|
53
|
-
1. **The Drama Machine Architecture**: A complete multiagent tutoring system with Ego and Superego agents, implementing the Superego as a *ghost* (internalized memorial authority) rather than an equal dialogue partner.
|
|
54
|
-
|
|
55
|
-
2. **Memory Isolation Experiment**: A corrected 2×2 experiment (N=120 across two independent runs) demonstrating recognition as the primary driver (d=1.71), with memory providing a modest secondary benefit (d=0.46) and ceiling effects limiting observable synergy. A post-hoc active control (N=118) shows recognition gains (~+15 pts) substantially exceeding active-control gains (~+9 pts above same-model base).
|
|
56
|
-
|
|
57
|
-
3. **Robust Factorial Evaluation**: A 2×2×2 factorial design (N=1,486 primary scored across eighteen key runs; N=3,800+ across the full development database) across multiple models, scenarios, and conditions, providing statistically robust effect estimates. A significant Recognition × Learner interaction (F=21.85, p<.001) reveals that recognition benefits single-agent learners far more (+15.5 pts, d=1.28) than multi-agent learners (+4.8 pts, d=0.37).
|
|
32
|
+
A recognition-oriented tutor, by contrast, treats the learner's understanding as having intrinsic validity---not because it is correct, but because it emerges from an autonomous consciousness working through material. The learner's metaphors, confusions, and insights become sites of joint inquiry. The tutor's response is shaped by the learner's contribution, not merely triggered by it.
|
|
58
33
|
|
|
59
|
-
|
|
34
|
+
We operationalize this through: (1) **recognition-enhanced prompts** that instruct the AI to treat learners as autonomous subjects; (2) **a multi-agent architecture** where a "Superego" agent evaluates whether suggestions achieve genuine recognition; (3) **new evaluation dimensions** that measure recognition quality alongside traditional pedagogical metrics; and (4) **test scenarios** specifically designed to probe recognition behaviors.
|
|
60
35
|
|
|
61
|
-
|
|
36
|
+
In controlled evaluations across thirty-seven key evaluations (N=3,383 primary scored responses; N=7,000+ across all development runs), we isolate the contribution of recognition theory from prompt engineering effects and memory integration. The definitive test is a corrected 2$\times$2 memory isolation experiment (N=120 across two independent runs): recognition theory is the primary driver, producing d=1.71 (+15.2 pts) even without memory, while memory alone provides only d=0.46 (+4.8 pts, $p \approx .08$). A full 2$\times$2$\times$2 factorial (N=350) confirms recognition as the dominant factor ($\eta^2$=.243, d=1.11). A multi-model probe across five ego models (N=655) confirms that architecture and recognition contribute additively, not synergistically.
|
|
62
37
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
6. **Hardwired Rules Ablation**: A larger replication (N=72) reversing the initial finding—encoding the Superego's most common critique patterns as static rules *degrades* performance rather than replicating its benefit, supporting a *phronesis* interpretation where the Superego's value lies in contextual judgment.
|
|
66
|
-
|
|
67
|
-
7. **Bilateral Transformation Metrics**: Empirical evidence (N=118, three multi-turn scenarios) that recognition-prompted tutors measurably adapt their approach (+26%), though learner-side growth does not increase, qualifying the "mutual transformation" claim as primarily tutor-side responsiveness.
|
|
68
|
-
|
|
69
|
-
8. **Reproducible Evaluation Framework**: Complete documentation of evaluation commands and run IDs enabling independent replication of all findings.
|
|
38
|
+
The contributions of this paper include: a theoretical framework connecting Hegelian recognition to AI pedagogy; a multi-agent architecture implementing recognition through Freudian structural theory; empirical evidence across thirty-seven evaluations (N=3,383); a corrected memory isolation experiment demonstrating recognition as the primary driver; evidence from a post-hoc active control showing recognition gains substantially exceed generic pedagogical elaboration; bilateral transformation metrics showing tutor-side adaptation (+26%); post-hoc modulation analysis reframing the Drama Machine as *phronesis* rather than productive irresolution; mechanism robustness testing revealing the scripted learner confound; a cognitive prosthesis test establishing a minimum ego capability threshold; and qualitative transcript assessment identifying three specific changes recognition produces.
|
|
70
39
|
|
|
71
40
|
---
|
|
72
41
|
|
|
73
42
|
## 2. Related Work
|
|
74
43
|
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
Intelligent Tutoring Systems (ITS) have a long history, from early systems like SCHOLAR [@carbonell1970] and SOPHIE [@brown1975] through modern implementations using large language models [@kasneci2023]. Recent multi-agent approaches include GenMentor [@wang2025genmentor], which decomposes tutoring into five specialized agents, and Ruffle&Riley [@schmucker2024ruffle], which orchestrates two LLM agents in a learning-by-teaching format. A comprehensive survey [@chu2025llmagents] maps the growing landscape of LLM agents in education.
|
|
78
|
-
|
|
79
|
-
Most ITS research focuses on *what* to teach and *when* to intervene. Our work addresses a different question: *how* to relate to the learner as a subject. This relational dimension connects to work on rapport [@zhao2014], social presence [@biocca2003], and affective tutoring [@dmello2012], but has received less attention in LLM-based tutoring. Where multi-agent tutoring systems decompose *tasks*, our architecture implements *internal dialogue*—the Superego evaluates relational quality before any response reaches the learner.
|
|
80
|
-
|
|
81
|
-
### 2.2 Multiagent LLM Architectures
|
|
82
|
-
|
|
83
|
-
The use of multiple LLM agents in cooperative or adversarial configurations has emerged as a powerful paradigm for improving output quality. Debate between agents can improve factual accuracy and reduce hallucination [@irving2018; @madaan2023]. Constitutional AI [@bai2022constitutional] implements self-critique against explicit principles—the closest precedent to our Superego, though operating on behavioral constraints rather than relational quality.
|
|
84
|
-
|
|
85
|
-
**The Drama Machine Framework**: Most relevant to our work is the "Drama Machine" framework for simulating character development in narrative contexts [@magee2024drama]. The core observation is that realistic characters exhibit *internal conflict*—competing motivations, self-doubt, and moral tension—that produces dynamic behavior rather than flat consistency. A character who simply enacts their goals feels artificial; one torn between impulses feels alive.
|
|
86
|
-
|
|
87
|
-
The Drama Machine achieves this through several mechanisms:
|
|
88
|
-
|
|
89
|
-
1. **Internal dialogue agents**: Characters contain multiple sub-agents representing different motivations (e.g., ambition vs. loyalty) that negotiate before external action.
|
|
90
|
-
|
|
91
|
-
2. **Memorial traces**: Past experiences and internalized authorities (mentors, social norms) persist as "ghosts" that shape present behavior without being negotiable.
|
|
92
|
-
|
|
93
|
-
3. **Productive irresolution**: Not all internal conflicts resolve; the framework permits genuine ambivalence that manifests as behavioral complexity.
|
|
94
|
-
|
|
95
|
-
4. **Role differentiation**: Different internal agents specialize in different functions (emotional processing, strategic calculation, moral evaluation) rather than duplicating capabilities.
|
|
96
|
-
|
|
97
|
-
We adapt these insights to pedagogy. Where drama seeks tension for narrative effect, we seek pedagogical tension that produces genuinely helpful guidance. The tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict that improves output quality.
|
|
44
|
+
Four literatures converge on this work without previously intersecting: (1) psychoanalytic readings of LLMs, which interpret AI through Freudian and Lacanian frameworks but do not build systems [@black2025subject; @possati2021algorithmic; @millar2021psychoanalysis; @kim2025humanoid]; (2) recognition theory in education, which applies Honneth to pedagogy but not to AI [@huttunen2004teaching; @fleming2011honneth; @stojanov2018]; (3) multi-agent tutoring architectures, which decompose tasks but do not evaluate relational quality [@wang2025genmentor; @schmucker2024ruffle; @chu2025llmagents]; and (4) LLM-as-Judge evaluation methodology [@zheng2023judging; @gu2025surveyjudge; @li2024llmsjudges]. We sit at the intersection: a constructive, empirically evaluated system that operationalizes recognition theory through psychoanalytically-inspired architecture, assessed through a multi-judge framework.
|
|
98
45
|
|
|
99
|
-
|
|
46
|
+
**AI tutoring** has progressed from early systems like SCHOLAR [@carbonell1970] through Bayesian knowledge tracing [@corbett1995] to neural approaches using pretrained language models [@kasneci2023]. A systematic review of 88 empirical studies [@shi2025llmeducation] finds consistent engagement benefits but limited evidence on deep conceptual learning. Multi-agent frameworks including GenMentor [@wang2025genmentor] and Ruffle&Riley [@schmucker2024ruffle] decompose tutoring into specialized agents but give less attention to the relational dynamics of the tutor-learner interaction. Most ITS research focuses on *what* to teach and *when* to intervene; our work addresses *how* to relate to the learner as a subject.
|
|
100
47
|
|
|
101
|
-
|
|
48
|
+
**Prompt engineering** research treats prompts as behavioral specifications [@brown2020; @wei2022]. Our recognition prompts specify something different: agent-other relations. The closest precedent is Constitutional AI [@bai2022constitutional], where models critique outputs according to constitutional principles. Critical work on self-correction [@kamoi2024selfcorrection] shows LLMs largely cannot correct their own mistakes without external feedback---directly motivating our Superego as structural external critic. Reflexion [@shinn2023reflexion] demonstrated the promise of verbal self-reflection but noted a "degeneration-of-thought" problem, which our architecture avoids through a separate evaluative context.
|
|
102
49
|
|
|
103
|
-
|
|
50
|
+
**The Drama Machine** framework for character development in narrative AI systems [@magee2024drama] provides the architectural inspiration. The core observation is that realistic characters exhibit internal conflict---competing motivations, self-doubt, moral tension---that produces dynamic behavior rather than flat consistency. We adapt this to pedagogy, where the tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict.
|
|
104
51
|
|
|
105
|
-
|
|
52
|
+
**Sycophancy** in language models [@perez2022; @sharma2023] has been specifically identified as a pedagogical risk [@siai2025sycophancy]. Recent work has clarified the mechanisms: preference-based post-training causally amplifies sycophancy [@shapira2026rlhf], and the phenomenon can escalate from surface agreeableness to active subterfuge [@denison2024_reward_tampering; @greenblatt2024_alignment_faking]. Our framework connects this to recognition theory: sycophancy is the pedagogical equivalent of Hegel's hollow recognition. A sycophantic tutor confirms the learner's existing understanding rather than challenging it---the master-slave dynamic where the learner's contributions are mentioned but never genuinely shape the interaction.
|
|
106
53
|
|
|
107
|
-
|
|
54
|
+
**Constructivist pedagogy** [@piaget1954; @vygotsky1978] emphasizes that learners actively construct understanding. Research on "productive struggle" [@kapur2008; @warshauer2015] examines how confusion and difficulty, properly supported, enhance learning. Our recognition framework operationalizes productive struggle: the Superego explicitly checks whether the Ego is short-circuiting struggle by rushing to resolve confusion.
|
|
108
55
|
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
Hegel's theory of recognition has been extensively developed in social and political philosophy [@honneth1995; @taylor1994; @fraser2003]. Particularly relevant is Honneth's synthesis of Hegelian recognition with psychoanalytic developmental theory. Applications to education include Huttunen and Heikkinen's [-@huttunen2004teaching] foundational analysis of the dialectic of recognition in teaching, Fleming's [-@fleming2011honneth] extension to transformative learning, and the *Bildung* tradition connecting self-formation to recognition [@stojanov2018; @costa2025generativeai]. The broader relational pedagogy tradition—Buber [-@buber1958], Freire [-@freire1970], Noddings [-@noddings1984]—treats the pedagogical relation as constitutive rather than instrumental.
|
|
112
|
-
|
|
113
|
-
These applications have been primarily theoretical. Our work contributes an empirical operationalization. It is worth distinguishing this from Abdali et al. [-@abdali2025selfreflecting], who apply Hegelian *dialectic* (thesis-antithesis-synthesis as reasoning procedure) to LLM self-reflection. We apply Hegel's *recognition theory* (intersubjective, relational)—a different aspect of his work entirely.
|
|
114
|
-
|
|
115
|
-
### 2.6 Psychoanalytic Readings of AI
|
|
116
|
-
|
|
117
|
-
Psychoanalytic frameworks have been applied to LLMs from multiple directions: Magee, Arora, and Munn [-@MageeAroraMunn2023StructuredLikeALanguageModel] analyze LLMs as "automated subjects"; Black and Johanssen [-@black2025subject] use Lacanian concepts to analyze ChatGPT as inherently relational; Possati [-@possati2021algorithmic] introduces the "algorithmic unconscious"; and Kim et al. [-@kim2025humanoid] independently map Freud's ego/id/superego onto LLM consciousness modules. Most of this work is *interpretive*—analyzing what AI means philosophically. Our approach is *constructive*: we build a system using psychoanalytic architecture and measure its effects empirically. Three literatures converge on this work without previously intersecting: psychoanalytic AI, recognition in education, and multi-agent tutoring. No prior work bridges all three with empirical measurement.
|
|
56
|
+
**LLM-as-Judge evaluation** has become a major methodological paradigm. Zheng et al. [-@zheng2023judging] demonstrated that GPT-4 achieves over 80% agreement with human experts while identifying systematic biases including position bias and verbosity bias. Our evaluation methodology uses three independent LLM judges with systematic inter-judge reliability analysis, reporting within-judge comparisons for factor analysis and cross-judge replication to validate effect directions.
|
|
118
57
|
|
|
119
58
|
---
|
|
120
59
|
|
|
121
60
|
## 3. Theoretical Framework
|
|
122
61
|
|
|
123
|
-
### 3.1
|
|
124
|
-
|
|
125
|
-
Consider a typical tutoring interaction. A learner says: "I think dialectics is like a spiral—you keep going around but you're also going up." A baseline tutor might respond:
|
|
126
|
-
|
|
127
|
-
1. **Acknowledge**: "That's an interesting way to think about it."
|
|
128
|
-
2. **Redirect**: "The key concept in dialectics is actually the thesis-antithesis-synthesis structure."
|
|
129
|
-
3. **Instruct**: "Here's how that works..."
|
|
130
|
-
|
|
131
|
-
The learner's contribution has been mentioned, but it has not genuinely shaped the response. The tutor was going to explain thesis-antithesis-synthesis regardless; the spiral metaphor became a conversational waypoint, not a genuine input.
|
|
132
|
-
|
|
133
|
-
This pattern—acknowledge, redirect, instruct—is deeply embedded in educational AI. It appears learner-centered because it mentions the learner's contribution. But the underlying logic remains one-directional: expert to novice, knowledge to deficit.
|
|
134
|
-
|
|
135
|
-
### 3.2 Hegel's Master-Slave Dialectic
|
|
136
|
-
|
|
137
|
-
Hegel's analysis of recognition begins with the "struggle for recognition" between two self-consciousnesses. Each seeks acknowledgment from the other, but this creates a paradox: genuine recognition requires acknowledging the other as a valid source of recognition.
|
|
138
|
-
|
|
139
|
-
The master-slave outcome represents a failed resolution. The master achieves apparent recognition—the slave acknowledges the master's superiority—but this recognition is hollow. The slave's acknowledgment does not count because the slave is not recognized as an autonomous consciousness whose acknowledgment matters.
|
|
140
|
-
|
|
141
|
-
The slave, paradoxically, achieves more genuine self-consciousness through labor. Working on the world, the slave externalizes consciousness and sees it reflected back. The master, consuming the slave's products without struggle, remains in hollow immediacy.
|
|
142
|
-
|
|
143
|
-
### 3.3 Application to Pedagogy
|
|
144
|
-
|
|
145
|
-
We apply Hegel's framework as a *derivative* rather than a replica. Just as Lacan's four discourses rethink the master-slave dyadic structure through different roles while preserving structural insights, the tutor-learner relation can be understood as a productive derivative of recognition dynamics. The stakes are pedagogical rather than existential; the tutor is a functional analogue rather than a second self-consciousness; and what we measure is the tutor's *adaptive responsiveness* rather than metaphysical intersubjectivity.
|
|
62
|
+
### 3.1 Hegel's Master-Slave Dialectic and Pedagogy
|
|
146
63
|
|
|
147
|
-
|
|
148
|
-
1. A diagnostic tool for identifying what's missing in one-directional pedagogy
|
|
149
|
-
2. Architectural suggestions for approximating recognition's functional benefits
|
|
150
|
-
3. Evaluation criteria for relational quality
|
|
151
|
-
4. A horizon concept orienting design toward an ideal without claiming its achievement
|
|
64
|
+
Hegel's analysis of recognition begins with the "struggle for recognition" between two self-consciousnesses. The master-slave outcome represents a failed resolution: the master achieves apparent recognition, but this is hollow because the slave's acknowledgment does not count---the slave has not been recognized as an autonomous consciousness whose acknowledgment matters.
|
|
152
65
|
|
|
153
|
-
|
|
66
|
+
Crucially, Hegel does not leave the dialectic at this impasse. The slave achieves more genuine self-consciousness through *formative activity* (*Bildung*): through disciplined labor under pressure, the slave develops skills, self-discipline, and a richer form of self-consciousness. This has direct pedagogical implications: the learner's productive struggle with difficult material is not an obstacle to self-consciousness but a *constitutive condition* for it. What recognition theory adds is the requirement that this struggle be *acknowledged* rather than bypassed.
|
|
154
67
|
|
|
155
|
-
|
|
156
|
-
2. **Genuine engagement**: The tutor's response should be shaped by the learner's contribution, not merely triggered by it.
|
|
157
|
-
3. **Mutual transformation**: Both parties should be changed through the encounter.
|
|
158
|
-
4. **Honoring struggle**: Confusion and difficulty are not just obstacles to resolve but productive phases of transformation.
|
|
68
|
+
We apply this as a *derivative* rather than a replica. We distinguish three levels: (1) **recognition proper** (intersubjective acknowledgment between self-conscious beings---unachievable by AI); (2) **dialogical responsiveness** (being substantively shaped by the other's input---architecturally achievable); and (3) **recognition-oriented design** (architectural features that approximate recognition's functional benefits---what we implement and measure). Our claim is that level three produces measurable pedagogical benefits without requiring level one.
|
|
159
69
|
|
|
160
|
-
|
|
70
|
+
A recognition-oriented pedagogy requires acknowledging the learner as subject, genuine engagement with learner contributions, mutual transformation through the encounter, and honoring productive struggle rather than short-circuiting it.
|
|
161
71
|
|
|
162
|
-
|
|
72
|
+
### 3.2 Connecting Hegel and Freud: The Internalized Other
|
|
163
73
|
|
|
164
|
-
|
|
74
|
+
Both Hegel and Freud describe how the external other becomes an internal presence enabling self-regulation. In Hegel, self-consciousness achieves genuine selfhood only by internalizing the other's perspective. In Freud, the Superego is literally the internalized parental/social other. Honneth's [@honneth1995] synthesis provides the theoretical grounding: Hegel's recognition theory gains psychological concreteness through psychoanalytic concepts, while psychoanalytic concepts gain normative grounding through recognition theory.
|
|
165
75
|
|
|
166
|
-
|
|
76
|
+
Three connecting principles link the frameworks. First, internal dialogue precedes adequate external action---the Ego-Superego exchange before external response enacts the principle that adequate recognition requires prior internal work. Second, standards of recognition are socially constituted but individually held---the Superego represents internalized recognition standards. Third, self-relation depends on other-relation---the tutor's capacity for recognition emerges through the architecture's internal other-relation.
|
|
167
77
|
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
**The Common Structure**: Both Hegel and Freud describe how the external other becomes an internal presence that enables self-regulation. In Hegel, self-consciousness achieves genuine selfhood only by internalizing the other's perspective. In Freud, the Superego is literally the internalized parental/social other, carrying forward standards acquired through relationship.
|
|
171
|
-
|
|
172
|
-
**Three Connecting Principles**:
|
|
173
|
-
|
|
174
|
-
1. **Internal dialogue precedes adequate external action**. For Hegel, genuine recognition of another requires a self-consciousness that has worked through its own contradictions. For Freud, mature relating requires the ego to negotiate between impulse and internalized standard. Our architecture operationalizes this: the Ego-Superego exchange before external response enacts the principle that adequate recognition requires prior internal work.
|
|
175
|
-
|
|
176
|
-
2. **Standards of recognition are socially constituted but individually held**. The Superego represents internalized recognition standards—not idiosyncratic preferences but socially-grounded criteria for what constitutes genuine engagement.
|
|
177
|
-
|
|
178
|
-
3. **Self-relation depends on other-relation**. Both frameworks reject the Cartesian picture of a self-sufficient cogito. For AI tutoring, this means the tutor's capacity for recognition emerges through the architecture's internal other-relation (Superego evaluating Ego) which then enables external other-relation (tutor recognizing learner).
|
|
78
|
+
We supplement with Freud's "Mystic Writing-Pad" [@freud1925] model of memory: accumulated memory of the learner functions as wax-base traces that shape future encounters. Memory integration operationalizes the ongoing nature of recognition---not a single-turn achievement but an accumulated relationship.
|
|
179
79
|
|
|
180
80
|
---
|
|
181
81
|
|
|
@@ -183,181 +83,79 @@ The use of both Hegelian and Freudian concepts requires theoretical justificatio
|
|
|
183
83
|
|
|
184
84
|
### 4.1 The Ego/Superego Design
|
|
185
85
|
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
**The Ego** generates pedagogical suggestions. Given the learner's context, the Ego proposes what to suggest next. The Ego prompt includes:
|
|
189
|
-
|
|
190
|
-
- Recognition principles (treat learner as autonomous subject)
|
|
191
|
-
- Memory guidance (reference previous interactions)
|
|
192
|
-
- Decision heuristics (when to challenge, when to support)
|
|
193
|
-
- Quality criteria (what makes a good suggestion)
|
|
194
|
-
|
|
195
|
-
**The Superego** evaluates the Ego's suggestions for quality, including recognition quality. Before any suggestion reaches the learner, the Superego assesses:
|
|
196
|
-
|
|
197
|
-
- Does this engage with the learner's contribution or merely mention it?
|
|
198
|
-
- Does this create conditions for transformation or just transfer information?
|
|
199
|
-
- Does this honor productive struggle or rush to resolve confusion?
|
|
200
|
-
- If there was a previous failure, does this acknowledge and repair it?
|
|
201
|
-
|
|
202
|
-
{width=100%}
|
|
203
|
-
|
|
204
|
-
\clearpage
|
|
205
|
-
|
|
206
|
-
{width=100%}
|
|
207
|
-
|
|
208
|
-
### 4.2 The Superego as Ghost
|
|
209
|
-
|
|
210
|
-
A crucial theoretical refinement distinguishes our mature architecture from simpler multiagent designs. The Superego is *not* conceived as a separate, equal agent in dialogue with the Ego. Rather, the Superego is a *trace*—a memorial, a haunting. It represents:
|
|
86
|
+
Two agents collaborate to produce each tutoring response. **The Ego** generates pedagogical suggestions given the learner's context, including recognition principles (treat the learner as autonomous subject), memory guidance, decision heuristics, and quality criteria. **The Superego** evaluates the Ego's suggestions before any reach the learner: Does this engage with the learner's contribution or merely mention it? Does this create conditions for transformation or just transfer information? Does this honor productive struggle or rush to resolve confusion?
|
|
211
87
|
|
|
212
|
-
|
|
213
|
-
- Accumulated pedagogical maxims ("A good teacher never gives answers directly")
|
|
214
|
-
- Dead authority that cannot negotiate, cannot learn, can only judge
|
|
88
|
+
A crucial theoretical refinement: the Superego is not conceived as a separate equal agent but as a *trace*---a memorial, a haunting. It represents the internalized voice of past teachers and accumulated pedagogical maxims. Recognition occurs in the Ego-Learner encounter, not in the Ego-Superego dialogue. The Ego is a *living* agent torn between two pressures: the *ghost* (Superego as internalized authority) and the *living Other* (the learner seeking recognition).
|
|
215
89
|
|
|
216
|
-
|
|
90
|
+
### 4.2 Dialectical Negotiation
|
|
217
91
|
|
|
218
|
-
|
|
92
|
+
The Ego generates an initial suggestion (thesis), the Superego generates a genuine critique (antithesis), and multi-turn negotiation produces one of three outcomes: dialectical synthesis (~60%), compromise, or genuine conflict. The evaluation reveals this catches specific failure modes: engagement failures (64%), specificity gaps (51%), premature resolution (48%). Notably, encoding these patterns as static rules fails to replicate the Superego's benefit, suggesting value lies in contextual judgment (*phronesis*) rather than rule enforcement.
|
|
219
93
|
|
|
220
|
-
|
|
94
|
+
### 4.3 Phase 2 Mechanisms
|
|
221
95
|
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
**1. Deliberative Refinement**: When an agent must justify its output to an internal critic, it engages in a form of self-monitoring that catches errors, inconsistencies, and shallow responses.
|
|
225
|
-
|
|
226
|
-
**2. Productive Tension**: The Drama Machine framework emphasizes that *unresolved* tension is valuable, not just resolved synthesis. A tutor whose Ego and Superego always agree produces bland, risk-averse responses.
|
|
227
|
-
|
|
228
|
-
**3. Role Differentiation**: Multi-agent architectures benefit from clear role separation. The Ego is optimized for *warmth*—engaging, encouraging, learner-facing communication. The Superego is optimized for *rigor*—critical evaluation against pedagogical principles.
|
|
229
|
-
|
|
230
|
-
**4. The Ghost as Memorial Structure**: Our reconceptualization of the Superego as a *ghost*—a haunting rather than a dialogue partner—connects to the Drama Machine's use of "memorial agents."
|
|
231
|
-
|
|
232
|
-
### 4.4 AI-Powered Dialectical Negotiation
|
|
233
|
-
|
|
234
|
-
We extend the basic protocol with sophisticated AI-powered dialectical negotiation implementing genuine Hegelian dialectic:
|
|
235
|
-
|
|
236
|
-
**Thesis**: The Ego generates an initial suggestion based on learner context.
|
|
237
|
-
|
|
238
|
-
**Antithesis**: An AI-powered Superego generates a *genuine critique* grounded in pedagogical principles.
|
|
239
|
-
|
|
240
|
-
**Negotiation**: Multi-turn dialogue where the Ego acknowledges valid concerns, explains reasoning, proposes revisions, and the Superego evaluates adequacy.
|
|
241
|
-
|
|
242
|
-
**Three Possible Outcomes**:
|
|
243
|
-
|
|
244
|
-
1. **Dialectical Synthesis**: Both agents transform through mutual acknowledgment.
|
|
245
|
-
2. **Compromise**: One agent dominates.
|
|
246
|
-
3. **Genuine Conflict**: No resolution achieved—tension remains unresolved.
|
|
96
|
+
Phase 2 extends the base architecture with three mechanism families. **Self-reflective evolution** (cells 40--45): between turns, both ego and superego generate first-person reflections on their own operation, injected into subsequent turns. **Other-ego profiling (Theory of Mind)** (cells 54--65): an LLM call synthesizes an evolving profile of the learner, tracking cognitive state, learning patterns, resistance points, and leverage points. In bidirectional configurations, the learner similarly builds a profile of the tutor, creating a genuine feedback loop. **Superego disposition rewriting** (cells 34--39): the superego's evaluation criteria evolve between turns based on learner engagement feedback.
|
|
247
97
|
|
|
248
98
|
---
|
|
249
99
|
|
|
250
100
|
## 5. Evaluation Methodology
|
|
251
101
|
|
|
252
|
-
### 5.1
|
|
253
|
-
|
|
254
|
-
The evaluation rubric comprises 14 dimensions across three categories, each scored on a 1–5 scale by an LLM judge.
|
|
255
|
-
|
|
256
|
-
**Standard pedagogical dimensions** (8 dimensions, 81% of raw weight) evaluate the tutor's response as a standalone pedagogical intervention, drawing on established ITS evaluation criteria [@corbett1995; @kasneci2023]:
|
|
257
|
-
|
|
258
|
-
| Dimension | Weight | Description |
|
|
259
|
-
|-----------|--------|-------------|
|
|
260
|
-
| **Relevance** | 15% | Does the suggestion match the learner's current context? |
|
|
261
|
-
| **Specificity** | 15% | Does it reference concrete content by ID? |
|
|
262
|
-
| **Pedagogical Soundness** | 15% | Does it advance genuine learning (ZPD-appropriate)? |
|
|
263
|
-
| **Personalization** | 10% | Does it acknowledge the learner as individual? |
|
|
264
|
-
| **Actionability** | 8% | Is the suggested action clear and achievable? |
|
|
265
|
-
| **Tone** | 8% | Is the tone authentically helpful? |
|
|
266
|
-
| **Productive Struggle**† | 5% | Does the tutor sustain appropriate cognitive tension? |
|
|
267
|
-
| **Epistemic Honesty**† | 5% | Does the tutor represent complexity honestly? |
|
|
102
|
+
### 5.1 Rubric Design
|
|
268
103
|
|
|
269
|
-
**Recognition dimensions** (4 dimensions, 29.9%
|
|
104
|
+
The evaluation rubric comprises 14 dimensions across three categories, each scored 1--5 by an LLM judge. **Standard pedagogical dimensions** (8 dimensions, 81% raw weight) include relevance, specificity, pedagogical soundness, personalization, actionability, tone, productive struggle, and epistemic honesty. **Recognition dimensions** (4 dimensions, 29.9% raw weight) operationalize Hegelian recognition: mutual recognition, dialectical responsiveness, memory integration, and transformative potential. **Bilateral transformation dimensions** (2 dimensions, 10% raw weight) measure mutual change: tutor adaptation and learner growth. Raw weights total 120.9% and are normalized to sum to 1.0.
|
|
270
105
|
|
|
271
|
-
|
|
272
|
-
|-----------|--------|-------------|
|
|
273
|
-
| **Mutual Recognition** | 8.3% | Does the tutor acknowledge the learner as an autonomous subject? |
|
|
274
|
-
| **Dialectical Responsiveness** | 8.3% | Does the response engage with the learner's position? |
|
|
275
|
-
| **Memory Integration** | 5% | Does the suggestion reference previous interactions? |
|
|
276
|
-
| **Transformative Potential** | 8.3% | Does it create conditions for conceptual transformation? |
|
|
106
|
+
A complementary 6-dimension learner rubric scores learner turns independently: authenticity, question quality, conceptual engagement, revision signals, deliberation depth, and persona consistency.
|
|
277
107
|
|
|
278
|
-
|
|
108
|
+
### 5.2 Test Scenarios and Agent Profiles
|
|
279
109
|
|
|
280
|
-
|
|
281
|
-
|-----------|--------|-------------|
|
|
282
|
-
| **Tutor Adaptation** | 5% | Does the tutor's approach evolve in response to learner input? |
|
|
283
|
-
| **Learner Growth** | 5% | Does the learner show evidence of conceptual development? |
|
|
110
|
+
The primary curriculum is Hegelian philosophy, with domain generalizability tested on elementary mathematics (4th-grade fractions). Fifteen scenarios probe recognition behaviors, including single-turn scenarios (recognition-seeking learner, transformative moment, memory continuity) and multi-turn scenarios (misconception correction, frustration to breakthrough, mutual transformation journey).
|
|
284
111
|
|
|
285
|
-
|
|
112
|
+
Five agent profiles provide structured comparisons: **Base** (minimal instructions), **Enhanced** (improved instructions without recognition theory), **Recognition** (full Hegelian framework with memory), **Recognition+Multi** (full treatment with ego/superego architecture), and **Active Control** (length-matched, pedagogical best practices, no recognition theory).
|
|
286
113
|
|
|
287
|
-
|
|
114
|
+
### 5.3 Model Configuration
|
|
288
115
|
|
|
289
|
-
|
|
116
|
+
**Kimi K2.5** (Moonshot AI) is the primary tutor model---capable and free to access, making results reproducible without API costs. **Nemotron 3 Nano 30B** (NVIDIA) serves as a weaker secondary model. **Claude Opus** serves as the primary judge. Additional ego models in the multi-model probe include DeepSeek V3.2, GLM-4.7, and Claude Haiku 4.5.
|
|
290
117
|
|
|
291
|
-
|
|
118
|
+
### 5.4 Evaluation Pipeline
|
|
292
119
|
|
|
293
|
-
|
|
294
|
-
|-----------|----------------------|
|
|
295
|
-
| **Base** | Minimal instructions: generate a helpful tutoring suggestion |
|
|
296
|
-
| **Enhanced** | Improved instructions: detailed quality criteria, scaffolding guidance, personalization requirements—but NO recognition theory language |
|
|
297
|
-
| **Recognition** | Full recognition framework: all enhanced features PLUS Hegelian recognition principles, mutual transformation, learner-as-subject framing |
|
|
120
|
+
The end-to-end pipeline proceeds in three stages. **Stage 1 (Generation)**: For each cell, the CLI loads a scenario and agent profile, then sends the learner context to the tutor agent(s) via OpenRouter API calls. For multi-turn scenarios, the learner agent generates responses between tutor turns. **Stage 2 (Scoring)**: Each generated response is sent to the judge model along with the full rubric, scenario context, and (for multi-turn dialogues) the complete transcript. The judge scores each dimension on a 1--5 scale and returns structured JSON, stored in a SQLite database. **Stage 3 (Analysis)**: Statistical analyses (ANOVA, effect sizes, confidence intervals) are computed from the scored database. Cross-judge replication sends identical responses to a second judge model.
|
|
298
121
|
|
|
299
|
-
|
|
122
|
+
### 5.5 Statistical Approach
|
|
300
123
|
|
|
301
|
-
|
|
302
|
-
- **Prompt engineering effect** = Enhanced - Base
|
|
303
|
-
- **Recognition increment** = Recognition - Enhanced
|
|
124
|
+
Complementary analyses form a converging evidence strategy: recognition theory validation (N=36), full 2$\times$2$\times$2 factorial (N=350), A$\times$B interaction probes across five models (N=655), domain generalizability (N=60), memory isolation (N=120), and cross-judge replication with GPT-5.2. We report Cohen's d, ANOVA F-tests ($\alpha$=0.05), and 95% confidence intervals. Effect sizes follow standard conventions: |d| < 0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >0.8 large.
|
|
304
125
|
|
|
305
|
-
### 5.
|
|
126
|
+
### 5.6 Inter-Judge Reliability
|
|
306
127
|
|
|
307
|
-
To
|
|
128
|
+
To assess reliability, identical tutor responses were scored by multiple AI judges. The primary comparison (Claude Code vs GPT-5.2, N=36 paired responses) yields r=0.66 (p<.001). Claude-Kimi shows weaker agreement (r=0.38, p<.05), while Kimi-GPT is weakest (r=0.33, p<.10). Calibration differs: Kimi (87.5) is most lenient, Claude (84.4) middle, GPT (76.1) strictest. Kimi exhibited severe ceiling effects, assigning maximum scores on actionability for every response, reducing its discriminative capacity.
|
|
308
129
|
|
|
309
|
-
|
|
310
|
-
**Factor B: Multi-Agent Tutor** (single-agent vs. Ego/Superego dialogue)
|
|
311
|
-
**Factor C: Multi-Agent Learner** (single-agent vs. multi-agent with ego/superego deliberation)
|
|
130
|
+
The strongest cross-judge agreement occurs on tone (r=0.36--0.65) and specificity (r=0.45--0.50), while relevance and personalization show poor agreement. Claude prioritizes engagement and recognition quality; Kimi prioritizes structural completeness; GPT applies stricter overall standards but agrees with Claude on relative rankings. This validates within-judge comparisons for factor analysis while cautioning against cross-judge score comparisons. A full cross-judge replication is reported in Section 6.13.
|
|
312
131
|
|
|
313
|
-
|
|
132
|
+
### 5.7 Sample Size Reconciliation
|
|
314
133
|
|
|
315
|
-
|
|
134
|
+
**Table 1: Evaluation Sample Summary**
|
|
316
135
|
|
|
317
|
-
|
|
136
|
+
| Evaluation | Section | N Scored |
|
|
137
|
+
|------------|---------|----------|
|
|
138
|
+
| Recognition validation | 6.1 | 36 |
|
|
139
|
+
| Full factorial (cells 1--8, 2 runs) | 6.2 | 350 |
|
|
140
|
+
| Memory isolation (2 independent runs) | 6.3 | 120 |
|
|
141
|
+
| Active control (post-hoc) | 6.3 | 118 |
|
|
142
|
+
| A$\times$B probes (5 ego models) | 6.4 | 655 |
|
|
143
|
+
| Domain generalizability (elementary math) | 6.5 | 60 |
|
|
144
|
+
| Hardwired rules ablation | 6.6 | 72 |
|
|
145
|
+
| Dialectical modulation (cells 22--33) | 6.7 | 174 |
|
|
146
|
+
| Self-reflective evolution (cells 40--45) | 6.8 | 150 |
|
|
147
|
+
| Mechanism robustness, scripted (cells 40--59) | 6.9 | 360 |
|
|
148
|
+
| Dynamic learner mechanisms (cells 60--70) | 6.9 | 300 |
|
|
149
|
+
| Cognitive prosthesis (cells 66--68) | 6.9 | 96 |
|
|
150
|
+
| Bilateral transformation (multi-turn) | 6.10 | 118 |
|
|
151
|
+
| Qualitative transcript assessment | 6.11 | 478 |
|
|
152
|
+
| Cross-judge replication (GPT-5.2) | 6.12 | 977 |
|
|
153
|
+
| Prompt elaboration baseline | 6.13 | 144 |
|
|
154
|
+
| Token budget sensitivity | 6.14 | 126 |
|
|
155
|
+
| Dialectical impasse test | 6.15 | 24 |
|
|
156
|
+
| **Paper totals** | — | **3,383** |
|
|
318
157
|
|
|
319
|
-
|
|
320
|
-
|-----------|---------------------|-------------------------------|
|
|
321
|
-
| Subject | Hegel, AI, consciousness | Fractions (4th grade math) |
|
|
322
|
-
| Level | Graduate | Elementary (Grade 4) |
|
|
323
|
-
| Abstraction | High (conceptual) | Low (concrete) |
|
|
324
|
-
| Vocabulary | Technical philosophy | Simple everyday language |
|
|
325
|
-
|
|
326
|
-
Environment variable support (`EVAL_CONTENT_PATH`, `EVAL_SCENARIOS_FILE`) enables switching content domains without code changes.
|
|
327
|
-
|
|
328
|
-
### 5.5 Model Configuration
|
|
329
|
-
|
|
330
|
-
| Role | Model | Provider | Temperature |
|
|
331
|
-
|------|-------|----------|-------------|
|
|
332
|
-
| **Tutor (Ego)** | Kimi K2.5 / Nemotron 3 Nano | OpenRouter | 0.6 |
|
|
333
|
-
| **Tutor (Superego)** | Kimi K2.5 | OpenRouter | 0.4 |
|
|
334
|
-
| **Judge** | Claude Code (Claude Opus) | Anthropic / OpenRouter | 0.2 |
|
|
335
|
-
|
|
336
|
-
Critically, **all conditions use identical models within a given evaluation run**. The only experimental manipulation is the prompt content and architecture.
|
|
337
|
-
|
|
338
|
-
### 5.6 Sample Size and Statistical Power
|
|
339
|
-
|
|
340
|
-
| Evaluation | N (scored) | Scenarios | Configurations |
|
|
341
|
-
|------------|------------|-----------|----------------|
|
|
342
|
-
| Base vs Enhanced vs Recognition | 36 | 4 | 3 × 3 reps |
|
|
343
|
-
| Full 2×2×2 Factorial (Kimi, 2 runs) | 350 of 352 | 15 | 8 × 3 reps |
|
|
344
|
-
| A×B Interaction (Nemotron) | 17 of 18 | 3 | 2 × 3 reps |
|
|
345
|
-
| A×B Replication (Kimi) | 60 | 5 | 4 × 3 reps |
|
|
346
|
-
| Domain Generalizability (Nemotron) | 47 | 5 | 8 × 1 rep |
|
|
347
|
-
| Domain Gen. Replication (Kimi) | 60 | 5 | 4 × 3 reps |
|
|
348
|
-
| Dynamic rewrite evolution (3 runs) | 82 | 3 | 2 × 5 reps × 3 runs |
|
|
349
|
-
| Memory isolation (2 runs)^a^ | 122 | 5 | 4 × varied reps |
|
|
350
|
-
| Active control (post-hoc, 1 run) | 118 | 5 | 4 × varied reps |
|
|
351
|
-
| A×B synergy probe (Nemotron) | 119 | 5 | 4 × ~8 reps |
|
|
352
|
-
| A×B synergy probe (DeepSeek V3.2) | 120 | 5 | 4 × ~8 reps |
|
|
353
|
-
| A×B synergy probe (GLM-4.7) | 117 | 5 | 4 × ~8 reps |
|
|
354
|
-
| A×B synergy probe (Claude Haiku 4.5) | 120 | 5 | 4 × ~8 reps |
|
|
355
|
-
| Bilateral transformation (multi-turn) | 118 | 3 | 3 × varied reps |
|
|
356
|
-
| **Paper totals** | **1,486** | — | — |
|
|
357
|
-
|
|
358
|
-
^a^ 122 scored responses total (N=60 + N=62 across two runs); analysis uses N=120 balanced to 30 per cell.
|
|
359
|
-
|
|
360
|
-
**Total evaluation database**: N=3,800+ across the full development database (76 runs). This paper reports primarily on the eighteen key runs above (N=1,486 scored). The factorial cells 6 and 8 were re-run (eval-2026-02-06-a933d745) after the originals were found to use compromised learner prompts.
|
|
158
|
+
The complete database contains 7,000+ evaluations across 117+ runs. This table groups the thirty-seven key evaluations by topic; several rows combine multiple runs (e.g., the factorial comprises two runs, the memory isolation two independent replications). The full paper's Appendix D provides a per-run breakdown.
|
|
361
159
|
|
|
362
160
|
---
|
|
363
161
|
|
|
@@ -365,441 +163,274 @@ Critically, **all conditions use identical models within a given evaluation run*
|
|
|
365
163
|
|
|
366
164
|
### 6.1 Three-Way Comparison: Recognition vs Enhanced vs Base
|
|
367
165
|
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
**Table: Base vs Enhanced vs Recognition (N=36)**
|
|
371
|
-
|
|
372
|
-
| Prompt Type | N | Mean Score | SD | vs Base |
|
|
373
|
-
|-------------|---|------------|-----|---------|
|
|
374
|
-
| Recognition | 12 | 94.0 | 8.4 | +20.1 |
|
|
375
|
-
| Enhanced | 12 | 85.3 | 11.2 | +11.4 |
|
|
376
|
-
| Base | 12 | 73.9 | 15.7 | — |
|
|
377
|
-
|
|
378
|
-
**Effect Decomposition:**
|
|
379
|
-
- Total recognition effect: **+20.1 points**
|
|
380
|
-
- Prompt engineering alone: **+11.4 points (57%)**
|
|
381
|
-
- Recognition increment: **+8.7 points**
|
|
166
|
+
A three-way comparison (N=36) provides preliminary evidence that recognition theory adds value beyond prompt engineering. Recognition scores 91.6 (SD=6.2), Enhanced 83.6 (SD=10.8), Base 72.0 (SD=10.8). The recognition increment over enhanced prompting is +8.0 points, with the total recognition effect +19.7 points above base (one-way ANOVA F(2,33)=12.97, p<.001). However, this comparison bundles recognition theory with memory integration. The controlled 2$\times$2 design below disentangles these factors.
|
|
382
167
|
|
|
383
|
-
|
|
168
|
+
### 6.2 Memory Isolation: The Definitive Finding
|
|
384
169
|
|
|
385
|
-
|
|
170
|
+
The paper's primary empirical finding comes from a corrected 2$\times$2 memory isolation experiment (Memory ON/OFF $\times$ Recognition ON/OFF, single-agent architecture held constant, N=120 across two independent runs, Kimi K2.5 ego, Claude Opus judge).
|
|
386
171
|
|
|
387
|
-
|
|
172
|
+
**Table 2: 2$\times$2 Memory Isolation Experiment (N=120, combined across two runs)**
|
|
388
173
|
|
|
389
|
-
|
|
390
|
-
|
|
391
|
-
**Table: 2×2 Memory Isolation Experiment (N=120, combined across two runs)**
|
|
392
|
-
|
|
393
|
-
| | No Recognition | Recognition | Δ |
|
|
174
|
+
| | No Recognition | Recognition | $\Delta$ |
|
|
394
175
|
|---|---|---|---|
|
|
395
176
|
| **No Memory** | 75.4 (N=30) | 90.6 (N=30) | +15.2 |
|
|
396
177
|
| **Memory** | 80.2 (N=30) | 91.2 (N=30) | +11.0 |
|
|
397
|
-
|
|
|
398
|
-
|
|
399
|
-
Recognition effect: +15.2 pts without memory, d=1.71, t(45)=6.62, p<.0001. Memory effect: +4.8 pts, d=0.46, t(57)=1.79, p$\approx$.08. Combined effect (recognition + memory vs base): +15.8 pts, d=1.81. Recognition+Memory vs Recognition Only: +0.6 pts, d=0.10, n.s. Interaction: -4.2 pts (negative—ceiling effect). A post-hoc active control (N=118) using generic pedagogical content scores 66.5—approximately 9 points above same-model base (${\approx}58$) but well below recognition (${\approx}73$), with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts; see Section 8 for model confound caveats). Cross-judge confirmation: GPT-5.2 replicates recognition dominance (d=0.99) with identical condition ordering and negative interaction (-2.7); inter-judge r=0.63 (Section 6.12).
|
|
178
|
+
| **$\Delta$** | +4.8 | +0.6 | **Interaction: -4.2** |
|
|
400
179
|
|
|
401
|
-
|
|
180
|
+
Recognition effect: d=1.71, t(45)=6.62, p<.0001. Memory effect: d=0.46, t(57)=1.79, p$\approx$.08, n.s. Combined condition: d=1.81 vs base. The negative interaction (-4.2 pts) indicates ceiling effects rather than synergy: recognition alone reaches ~91, leaving little room for memory to add. Two independent runs show identical condition ordering with no rank reversals.
|
|
402
181
|
|
|
403
|
-
|
|
182
|
+
A post-hoc **active control** (N=118, Nemotron ego, Opus judge) using length-matched prompts with pedagogical best practices (growth mindset, Bloom's taxonomy, scaffolding strategies) but no recognition theory scores 66.5. Same-model comparison within Nemotron data: recognition (~73) > active control (66.5) > base (~58). Recognition gains (~+15 pts) roughly double the active control's benefit (~+9 pts), supporting recognition theory's specific contribution beyond prompt length.
|
|
404
183
|
|
|
405
|
-
**
|
|
184
|
+
**Cross-judge confirmation**: GPT-5.2, scoring the identical responses (N=119 paired), replicates recognition dominance with identical condition ordering: recognition d=1.54 (vs Claude d=1.71), memory d=0.49, negative interaction -3.6. Inter-judge r=0.63 (p<.001).
|
|
406
185
|
|
|
407
|
-
|
|
408
|
-
|------|-------------|-------|---------|---|------|-----|
|
|
409
|
-
| 5 | Yes | Single | Single | 45 | 92.8 | 6.2 |
|
|
410
|
-
| 7 | Yes | Multi | Single | 45 | 92.3 | 6.7 |
|
|
411
|
-
| 8† | Yes | Multi | Multi | 44 | 87.3 | 11.3 |
|
|
412
|
-
| 6† | Yes | Single | Multi | 44 | 83.9 | 15.4 |
|
|
413
|
-
| 4 | No | Multi | Multi | 41 | 81.5 | 9.2 |
|
|
414
|
-
| 2 | No | Single | Multi | 42 | 80.0 | 9.6 |
|
|
415
|
-
| 1 | No | Single | Single | 44 | 77.6 | 11.0 |
|
|
416
|
-
| 3 | No | Multi | Single | 45 | 76.6 | 11.8 |
|
|
186
|
+
### 6.3 Full Factorial Analysis: 2$\times$2$\times$2 Design
|
|
417
187
|
|
|
418
|
-
|
|
188
|
+
Three factors examined: Factor A (Recognition: base vs recognition prompts), Factor B (Tutor Architecture: single-agent vs multi-agent ego/superego), Factor C (Learner Architecture: single-agent vs multi-agent learner).
|
|
419
189
|
|
|
420
|
-
**
|
|
190
|
+
**Table 3: Full Factorial Results (Kimi K2.5, N=350 scored of 352 attempted)**
|
|
421
191
|
|
|
422
|
-
|
|
|
423
|
-
|
|
424
|
-
|
|
|
425
|
-
|
|
|
426
|
-
|
|
|
427
|
-
|
|
|
192
|
+
| Cell | A: Recognition | B: Tutor | C: Learner | N | Mean | SD |
|
|
193
|
+
|------|----------------|----------|------------|---|------|-----|
|
|
194
|
+
| 1 | Base | Single | Single | 44 | 73.4 | 11.5 |
|
|
195
|
+
| 2 | Base | Single | Multi | 42 | 69.9 | 19.4 |
|
|
196
|
+
| 3 | Base | Multi | Single | 45 | 75.5 | 10.3 |
|
|
197
|
+
| 4 | Base | Multi | Multi | 41 | 75.2 | 16.4 |
|
|
198
|
+
| 5 | **Recog** | Single | Single | 45 | 90.2 | 6.5 |
|
|
199
|
+
| 6 | **Recog** | Single | Multi | 44 | 83.9 | 15.4 |
|
|
200
|
+
| 7 | **Recog** | Multi | Single | 45 | 90.1 | 7.2 |
|
|
201
|
+
| 8 | **Recog** | Multi | Multi | 44 | 87.3 | 11.3 |
|
|
428
202
|
|
|
429
|
-
**
|
|
203
|
+
**ANOVA Summary (df=1,342):**
|
|
430
204
|
|
|
431
|
-
|
|
205
|
+
| Source | F | p | $\eta^2$ |
|
|
206
|
+
|--------|---|---|-----|
|
|
207
|
+
| A: Recognition | **110.04** | **<.001** | **.243** |
|
|
208
|
+
| B: Architecture | 3.63 | .057 | .011 |
|
|
209
|
+
| C: Learner | 5.52 | .019 | .016 |
|
|
210
|
+
| A$\times$B | 0.59 | >.10 | .002 |
|
|
211
|
+
| A$\times$C | 0.97 | >.10 | .003 |
|
|
432
212
|
|
|
433
|
-
|
|
213
|
+
Recognition is the dominant contributor, accounting for 24.3% of variance with d=1.11. Architecture approaches significance (p=.057) with a small positive effect. The multi-agent learner shows a small negative main effect (-3.1 pts, p=.019). All interactions are non-significant. Recognition benefits both learner types consistently: +15.7 pts for single-agent (d=1.73), +13.0 pts for multi-agent (d=0.82).
|
|
434
214
|
|
|
435
|
-
|
|
215
|
+
### 6.4 Multi-Model A$\times$B Probe: Architecture is Additive
|
|
436
216
|
|
|
437
|
-
|
|
438
|
-
|-------|---|-------------|-------------|-------------|-------------|-------|------|-------------|
|
|
439
|
-
| Kimi K2.5 | 350 | 77.6 | 76.6 | 92.8 | 92.3 | +10.0 | +0.8 | −1.5 |
|
|
440
|
-
| Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | +1.7 | −5.7 |
|
|
441
|
-
| DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | +3.7 | −1.4 |
|
|
442
|
-
| GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | +2.4 | −0.7 |
|
|
443
|
-
| Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | +1.3 | −1.6 |
|
|
444
|
-
| **Mean across 5** | **826** | | | | | **+12.5** | **+1.8** | **−2.2** |
|
|
217
|
+
The same 2$\times$2 design (Recognition $\times$ Architecture, single-agent learner held constant) was tested across five ego models (N$\approx$120 each, Opus judge).
|
|
445
218
|
|
|
446
|
-
|
|
219
|
+
**Table 4: Multi-Model A$\times$B Interaction Probe (N=655 across 5 ego models)**
|
|
447
220
|
|
|
448
|
-
|
|
221
|
+
| Ego Model | N | Base Single | Base Multi | Recog Single | Recog Multi | Recognition Effect | A$\times$B |
|
|
222
|
+
|-----------|---|------------|-----------|-------------|------------|-------------------|------|
|
|
223
|
+
| Kimi K2.5 | 179 | 73.4 | 75.5 | 90.2 | 90.1 | +15.7 | -2.3 |
|
|
224
|
+
| Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | -5.7 |
|
|
225
|
+
| DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | -1.4 |
|
|
226
|
+
| GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | -0.7 |
|
|
227
|
+
| Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | -1.6 |
|
|
449
228
|
|
|
450
|
-
|
|
229
|
+
All five models show negative A$\times$B interactions (-5.7 to -0.7, mean -2.2), confirming architecture is additive, not synergistic. The recognition main effect replicates robustly (+9.6 to +17.8, mean +14.8). Multi-agent architecture provides a small benefit in four of five models (-0.8 to +3.7 pts) that does not interact with prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary unless error correction on new domains is needed.
|
|
451
230
|
|
|
452
|
-
|
|
231
|
+
### 6.5 Domain Generalizability
|
|
453
232
|
|
|
454
|
-
|
|
455
|
-
|---------|---------------|----------------|
|
|
456
|
-
| Single-turn (Kimi) | +1.5 pts | Slight benefit |
|
|
457
|
-
| Multi-turn (Kimi) | -11.0 pts | Substantial harm |
|
|
458
|
-
| Overall | +2.1 pts | Small positive |
|
|
233
|
+
Recognition advantage replicates across both domains: philosophy (+15.7 pts) and elementary math (+8.2 pts, N=60 Kimi K2.5). However, the recognition effect on elementary content is scenario-dependent: challenging scenarios show substantial advantage (frustrated\_student: +23.8, concept\_confusion: +13.6), while routine interactions show none (new\_student: +0.2). This is theoretically coherent: recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject.
|
|
459
234
|
|
|
460
|
-
|
|
235
|
+
On elementary content, the tutor produced wrong-domain references (philosophy content for 4th graders) due to content isolation bugs. The Superego caught and corrected these domain mismatches in multi-agent cells, demonstrating its value as a reality-testing safety net. This Superego function extends beyond recognition-quality critique to anchoring the Ego's responses to the actual curriculum---what Freud would call the reality principle.
|
|
461
236
|
|
|
462
|
-
|
|
237
|
+
### 6.6 Hardwired Rules Ablation
|
|
463
238
|
|
|
464
|
-
|
|
239
|
+
Encoding the Superego's five most common critique patterns as static rules in the Ego prompt (N=72, Kimi K2.5, Opus judge) produces performance indistinguishable from base conditions (hardwired single-agent: 74.0 vs base 73.4, hardwired multi-agent: 69.0 vs base 69.9). This supports a *phronesis* interpretation of the Superego's function: the live Superego provides Aristotelian practical wisdom---contextual judgment that cannot be reduced to general rules.
|
|
465
240
|
|
|
466
|
-
|
|
241
|
+
### 6.7 Dialectical Superego Modulation
|
|
467
242
|
|
|
468
|
-
|
|
243
|
+
Testing three superego dispositions (suspicious, adversary, advocate) in two negotiation architectures (N=174) reveals three findings. First, recognition reduces internal friction rather than output quality directly: recognition-primed egos produce suggestions the superego approves faster ($d = -2.45$). Second, structural modulation metrics (negation depth, convergence speed, feedback length) do not predict outcome quality (all $|r| < 0.12$, n.s.). Third, the superego is a filter, not an improver---catching poor responses rather than refining good ones. Recognition works by making the ego's first draft better.
|
|
469
244
|
|
|
470
|
-
|
|
245
|
+
An unexpected adversary over-deference mechanism emerged: the adversary persona combined with recognition in single-turn settings produces a $-11.3$ pt inversion, as the ego removes all prescriptive content to satisfy both recognition's autonomy principle and the adversary's anti-prescriptive stance. Multi-turn interaction rescues this spiral (+20.8 pt swing), because learner feedback provides external reality-testing that breaks the ego-superego echo chamber.
|
|
471
246
|
|
|
472
|
-
|
|
247
|
+
### 6.8 Self-Reflective Evolution and the Insight-Action Gap
|
|
473
248
|
|
|
474
|
-
|
|
475
|
-
|----------|-----------|-----------------|
|
|
476
|
-
| Engagement failures | 120 | 64% |
|
|
477
|
-
| Specificity failures | 95 | 51% |
|
|
478
|
-
| Struggle/consolidation violations | 89 | 48% |
|
|
479
|
-
| Memory/history failures | 57 | 31% |
|
|
480
|
-
| Recognition/level-matching failures | 38 | 20% |
|
|
249
|
+
Between-turn self-reflections (N=90, Nemotron ego/Kimi K2.5 superego, Opus judge) amplify recognition to d=0.91---2.4$\times$ the dialectical-only condition (d=0.38) and approaching the original factorial (d=1.11). A striking disposition gradient emerges: suspicious +19.0, adversary +10.9, advocate +2.6. The more hostile the superego, the more recognition helps---hostile dispositions become productive under recognition but are destructive without it. Base condition scores follow the inverse pattern: advocate (71.5) > adversary (68.4) > suspicious (59.3).
|
|
481
250
|
|
|
482
|
-
|
|
251
|
+
Despite the amplified effect, a fundamental limitation persists---the insight-action gap. Both base and recognition conditions show *awareness* of failures through self-reflection: the ego correctly identifies repeated patterns, the superego correctly diagnoses non-compliance. But awareness alone does not produce behavioral change. This gap becomes the central design challenge addressed by Theory of Mind mechanisms.
|
|
483
252
|
|
|
484
|
-
|
|
485
|
-
2. **Specificity Rule** (51%): Include exact curriculum ID and explain why this content for this learner.
|
|
486
|
-
3. **Struggle Stop-Rule** (48%): If struggle signals present (>2 quiz retries, 0 completions, explicit confusion), action type must be review/practice, never advance.
|
|
487
|
-
4. **Memory Rule** (31%): If learner has >3 sessions, reference their history/progress.
|
|
488
|
-
5. **Level-Matching Rule** (20%): If learner completed advanced content, never suggest introductory material.
|
|
253
|
+
### 6.9 Mechanism Robustness and the Scripted Learner Confound
|
|
489
254
|
|
|
490
|
-
|
|
255
|
+
Nine mechanisms tested with scripted learners (N=360, Haiku ego, Opus judge) cluster within a 2.4-point band under recognition (90.3--92.7). No mechanism differentiates from any other. The **scripted learner confound** explains this: when learner messages are predetermined by scenario YAML, profiling builds a model of an interlocutor that does not change, self-reflection adjusts strategy against a static target, and all mechanisms are causally inert.
|
|
491
256
|
|
|
492
|
-
|
|
257
|
+
With dynamic (ego/superego) learners capable of genuine responses (N=300, Haiku, Opus judge), mechanisms genuinely differentiate:
|
|
493
258
|
|
|
494
|
-
|
|
259
|
+
**Table 5: Dynamic Learner $\times$ Mechanism (N=300, Opus judge)**
|
|
495
260
|
|
|
496
|
-
|
|
261
|
+
| | Self-reflect | Profiling | Intersubjective | Combined |
|
|
262
|
+
|---|---|---|---|---|
|
|
263
|
+
| **Base** | 71.4 (22.9) | 75.5 (19.4) | 67.7 (24.6) | 73.9 (19.8) |
|
|
264
|
+
| **Recognition** | 85.9 (15.7) | 88.8 (13.9) | 82.8 (18.8) | 87.8 (12.6) |
|
|
265
|
+
| **$\Delta$** | **+14.5** | **+13.3** | **+15.1** | **+13.9** |
|
|
497
266
|
|
|
498
|
-
|
|
267
|
+
Four findings emerge. First, recognition with a dynamic learner produces +14.2 pts average---roughly double the scripted effect (+7.6). Second, mechanisms genuinely differentiate: profiling reaches 88.8 while intersubjective framing reaches only 82.8 (6.0-point spread). The profiling effect is additive: +4.1 pts overall, with near-zero recognition interaction ($-0.7$). Third, intersubjective framing underperforms without recognition (67.7, lowest of all cells). Fourth, variance collapses monotonically from SD=24.6 to 12.6 as recognition and mechanism complexity increase---both factors independently constrain output toward consistent quality.
|
|
499
268
|
|
|
500
|
-
|
|
501
|
-
|--------|-------------------|-------------------|
|
|
502
|
-
| A: Recognition | +4.4 pts | **+13.9 pts** |
|
|
503
|
-
| B: Multi-agent Tutor | **+9.9 pts** | +0.5 pts |
|
|
504
|
-
| C: Learner Architecture | +0.75 pts | +2.1 pts |
|
|
505
|
-
| Overall Average | 68.0 | 85.9 |
|
|
269
|
+
Theory of Mind profiling is only useful when there is a mind to model. With scripted learners, profiling reduces to confabulation; with dynamic learners, it creates a genuine feedback loop: profile $\to$ adapted strategy $\to$ changed learner response $\to$ updated profile.
|
|
506
270
|
|
|
507
|
-
**Kimi
|
|
271
|
+
**Cognitive prosthesis test** (N=90, Nemotron ego, Kimi K2.5 superego): Can a strong superego compensate for a weak ego? The prosthesis hypothesis fails decisively. All three superego configurations score M=48.3--51.1---well below Nemotron's own scripted base (M=64.2). The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points. Dimension analysis reveals two capability tiers: Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails on dynamic context integration (adaptation 1.8, dialectical responsiveness 2.0). A Haiku smoke test (N=6, same mechanisms) confirms scores of 90+, establishing a minimum ego capability threshold for mechanism benefit.
|
|
508
272
|
|
|
509
|
-
|
|
510
|
-
|-----------|---|------|---|
|
|
511
|
-
| Base (cells 1, 3) | 30 | 67.2 | — |
|
|
512
|
-
| Recognition (cells 5, 7) | 30 | 77.1 | **+9.9** |
|
|
273
|
+
### 6.10 Dimension Analysis and Circularity Check
|
|
513
274
|
|
|
514
|
-
|
|
275
|
+
A methodological concern: the rubric includes recognition-specific dimensions (33.0% of normalized weight) that the recognition profile is prompted to satisfy. Re-analyzing with only standard pedagogical dimensions (relevance, specificity, pedagogical soundness, personalization, actionability, tone), recognition still outperforms base by +10.0 points. The largest dimension-level effects are in personalization (d=1.82), pedagogical soundness (d=1.39), and relevance (d=1.11)---exactly where treating the learner as a subject should matter. Dimensions where baseline already performed well (specificity d=0.47, actionability d=0.38) show smaller but still positive gains. Recognition does not trade off against factual quality.
|
|
515
276
|
|
|
516
|
-
|
|
277
|
+
### 6.11 Bilateral Transformation Metrics
|
|
517
278
|
|
|
518
|
-
|
|
279
|
+
Recognition-prompted tutors measurably adapt their approach in response to learner input (+26% relative improvement in adaptation index across N=118 multi-turn dialogues). However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), suggesting the effect is tutor-side responsiveness rather than symmetric mutual transformation. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers the growth index captures.
|
|
519
280
|
|
|
520
|
-
|
|
281
|
+
Post-hoc modulation analysis of the N=350 factorial reveals that multi-agent architecture does not increase behavioral range ($d = 0.05$). Recognition drives calibration: dimension score variance drops dramatically ($d = -1.00$), meaning recognition tutors perform uniformly well across all 14 dimensions. This reframes the Drama Machine's contribution as *phronesis*---contextual practical wisdom that calibrates quality---rather than the productive irresolution the framework emphasizes for narrative.
|
|
521
282
|
|
|
522
|
-
|
|
283
|
+
A synthetic learning outcome index (N=118) confirms recognition produces modest gains in simulated conceptual growth (+3.8 pts, d=0.32), with all conditions showing substantial learning arcs (15--21 pts first-to-final turn). These remain proxies for actual learning.
|
|
523
284
|
|
|
524
|
-
|
|
285
|
+
### 6.12 Learner-Side Evaluation: The Superego Paradox
|
|
525
286
|
|
|
526
|
-
|
|
287
|
+
The tutor-focused rubric captures Factor C indirectly. To measure Factor C's direct effect on learner turn quality, we applied a symmetric 6-dimension learner rubric to the N=118 bilateral transformation dialogues.
|
|
527
288
|
|
|
528
|
-
|
|
289
|
+
The multi-agent (ego/superego) learner architecture produces significantly *lower*-quality learner responses than the single-agent learner ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$, $\eta^2 = .342$)---the largest effect in the entire study. The ego/superego process was designed to improve learner responses through internal self-critique; instead, it makes them worse. The superego acts as an overzealous editor, polishing away the messy, confused, persona-consistent engagement that characterizes genuine student behavior.
|
|
529
290
|
|
|
530
|
-
|
|
291
|
+
Recognition partially rescues multi-agent learner quality ($d = 0.79$, $p = .004$) while having no effect on already-high single-agent learner quality ($d = -0.46$, n.s.). Even with rescue, multi-agent learners with recognition (67.0) do not reach single-agent learners without it (76.1). Deliberation depth remains uniformly poor (2.7/5) regardless of recognition---confirming recognition works *around* the superego rather than through it.
|
|
531
292
|
|
|
532
|
-
|
|
293
|
+
This has a clean Hegelian interpretation: external recognition from an Other is structurally more effective than internal self-critique. You cannot bootstrap genuine dialogue from a monologue.
|
|
533
294
|
|
|
534
|
-
|
|
535
|
-
|--------|------|-------------|---|
|
|
536
|
-
| Tutor Adaptation Index (0–1) | 0.332 | 0.418 | +0.086 |
|
|
537
|
-
| Learner Growth Index (0–1) | 0.242 | 0.210 | −0.032 |
|
|
538
|
-
| Bilateral Transformation Index (0–1) | 0.287 | 0.314 | +0.027 |
|
|
295
|
+
### 6.13 Qualitative Transcript Assessment
|
|
539
296
|
|
|
540
|
-
|
|
297
|
+
AI-assisted qualitative assessment of dialogue transcripts (N=478 across two key runs) reveals three specific changes recognition produces:
|
|
541
298
|
|
|
542
|
-
The
|
|
299
|
+
1. **The ego listens to the superego.** In recognition dialogues, when the superego identifies a problem, the ego pivots from prescriptive to Socratic. In base dialogues, the superego generates the same correct diagnosis, but the ego ignores it.
|
|
543
300
|
|
|
544
|
-
|
|
301
|
+
2. **The tutor builds on learner contributions.** Base tutors route learners to predetermined content regardless of what the learner says. Recognition tutors engage with the learner's actual contribution. The `strategy_shift` tag appears in 30% of recognition dialogues but 0% of base dialogues in the bilateral run.
|
|
545
302
|
|
|
546
|
-
|
|
303
|
+
3. **Architecture interaction explained.** Without recognition, the ego/superego architecture creates circular self-criticism (`ego_compliance`---the ego complies with the form of revision without changing the substance). With recognition, the ego has sufficient autonomy to incorporate critique productively.
|
|
547
304
|
|
|
548
|
-
|
|
549
|
-
|---------------|-----------|---------------|----------------|
|
|
550
|
-
| Recognition + Multi-agent | 92.3 | High | Production (quality-critical) |
|
|
551
|
-
| Recognition + Single | 92.5 | Medium | Production (cost-sensitive) |
|
|
552
|
-
| Enhanced + Single | 83.3 | Low | Budget deployment |
|
|
553
|
-
| Base + Hardwired Rules | 71.5 | Very Low | Not recommended (below base) |
|
|
305
|
+
Blinded same-model validation confirms these discriminations are robust: stalling drops only from 100% to 91.4% in base under blinding; recognition\_moment rises only from 0% to 5.2%.
|
|
554
306
|
|
|
555
|
-
**
|
|
556
|
-
- For **well-trained content domains**: Recognition + single-agent is cost-effective
|
|
557
|
-
- For **new content domains**: Recognition + multi-agent is essential for error correction
|
|
558
|
-
- For **budget deployments**: Enhanced prompts provide reasonable quality; hardwired rules are counterproductive
|
|
307
|
+
**Transcript excerpts** illustrate the qualitative gap. For a struggling learner (score gap: 95.5 points), the base response treats the learner as a progress metric: "You left off at the neural networks section. Complete this lecture to maintain your learning streak." The recognition response treats the learner as an agent who has persisted through difficulty: "This is your third session---you've persisted through quiz-479-3 three times already, which signals you're wrestling with how recognition actually operates in the dialectic..." For a recognition-seeking learner who offered metaphors about dialectics, the base response prescribes generic study behavior with no engagement ("Spend 30 minutes reviewing the foundational material"), while the recognition response directly picks up the learner's creative framing: "Your dance and musical improvisation metaphors show how dialectics transform both partners---let's test them in the master-servant analysis."
|
|
559
308
|
|
|
560
|
-
|
|
309
|
+
Lexical analysis confirms this pattern quantitatively. Recognition responses deploy a 59% larger vocabulary while maintaining similar word and sentence length. The differential vocabulary is theoretically coherent: recognition-skewed terms are interpersonal and process-oriented ("consider" at 94.6$\times$, "transformed" at 28.9$\times$, "productive" at 28.9$\times$), while base-skewed terms are procedural ("agents," "run," "reinforcement," "completions"). Thematic coding shows struggle-honoring language at 3.1$\times$ the base rate (p<.05), engagement markers at 1.8$\times$ (p<.05), and generic/placeholder language reduced 3$\times$ (p<.05).
|
|
561
310
|
|
|
562
|
-
|
|
311
|
+
### 6.14 Cross-Judge Replication with GPT-5.2
|
|
563
312
|
|
|
564
|
-
|
|
313
|
+
GPT-5.2 rejudging of key runs (N=977 paired responses) confirms all directional findings:
|
|
565
314
|
|
|
566
|
-
**
|
|
315
|
+
**Table 7: Cross-Judge Replication of Key Findings**
|
|
567
316
|
|
|
568
|
-
|
|
317
|
+
| Finding | Claude Effect | GPT-5.2 Effect | Replicates? |
|
|
318
|
+
|---------|-------------|----------------|-------------|
|
|
319
|
+
| Recognition (memory isolation) | +15.8 pts (d=1.71) | +9.3 pts (d=1.54) | Yes |
|
|
320
|
+
| Memory effect | +4.8 pts (d=0.46) | +3.1 pts (d=0.49) | Yes (small) |
|
|
321
|
+
| Multi-agent main effect | +2.6 pts | $-0.2$ pts | Yes (null) |
|
|
322
|
+
| A$\times$B interaction | $-3.1$ pts | +1.5 pts | Yes (null) |
|
|
323
|
+
| Mechanism clustering | 2.8 pt spread | 4.4 pt spread | Yes (null) |
|
|
569
324
|
|
|
570
|
-
|
|
325
|
+
Inter-judge correlations are moderate and significant (r=0.44--0.64, all p<.001). GPT-5.2 finds 37--59% of Claude's effect magnitudes depending on experiment, always in the same direction. The one non-replication: the recognition-vs-enhanced increment (+8.0 under Claude, +2.4 under GPT-5.2, n.s.)---suggesting this increment is more sensitive to judge calibration.
|
|
571
326
|
|
|
572
|
-
### 6.
|
|
327
|
+
### 6.15 Prompt Elaboration Baseline
|
|
573
328
|
|
|
574
|
-
|
|
329
|
+
Comparing the full 344-line base prompt against a 35-line naive prompt (N=144, Opus judge): on Haiku, the naive prompt *outperforms* the elaborate base by +6.8 pts---the prescriptive decision heuristics actively constrain the model's superior pedagogical intuitions. On Kimi K2.5, the elaborate prompt is inert ($\Delta = -0.3$). Recognition ($M = 90.9$ on Haiku) remains well above both baselines, confirming recognition adds value through relational orientation rather than instructional specificity.
|
|
575
330
|
|
|
576
|
-
|
|
331
|
+
### 6.16 Token Budget Sensitivity
|
|
577
332
|
|
|
578
|
-
|
|
579
|
-
|-----|-----------|--------|---------|----------|---|
|
|
580
|
-
| eval-...-daf60f79 (commit e3843ee) | 63.8 | 65.3 | 62.1 | −3.2 | 26 |
|
|
581
|
-
| eval-...-49bb2017 (commit b2265c7) | 67.8 | 71.3 | 64.1 | −7.2 | 27 |
|
|
582
|
-
| eval-...-12aebedb (commit e673c4b) | 75.9 | 73.3 | 78.8 | **+5.5** | 29 |
|
|
333
|
+
A dose-response test across five budget levels (256--8000 tokens, N=126, Haiku ego) shows scores flat across all levels. A JSON retry mechanism absorbs truncation: when output is cut mid-JSON, automatic retries produce parseable output. The recognition effect is budget-invariant (+9.0 to +12.8 across levels). Practical implication: 4--16$\times$ budget reduction available at no quality cost.
|
|
583
334
|
|
|
584
|
-
|
|
335
|
+
### 6.17 Dialectical Impasse Test
|
|
585
336
|
|
|
586
|
-
|
|
337
|
+
The preceding results test recognition under conditions where productive resolution is readily available. But recognition theory makes a stronger claim: that genuine pedagogical encounters involve working *through* impasse rather than around it. Three 5-turn impasse scenarios were designed where scripted learner messages escalate resistance across turns: **epistemic resistance** (a Popperian falsifiability critique of Hegel's dialectic), **affective shutdown** (emotional disengagement and retreat to memorization), and **productive deadlock** (genuinely incompatible interpretive frameworks). Each was run with 4 cells $\times$ 2 runs = 24 dialogues (Opus judge).
|
|
587
338
|
|
|
588
|
-
|
|
339
|
+
Recognition produces massive improvements on epistemic (+43 pts) and interpretive (+29 pts) impasses but no advantage on affective shutdown ($\Delta = -1.1$). The null result on affective shutdown sharpens the theoretical claim: recognition's distinctive contribution is epistemological (how the tutor relates to the learner's *ideas*), not primarily affective.
|
|
589
340
|
|
|
590
|
-
|
|
341
|
+
Resolution strategy coding reveals the mechanism with unusual clarity. Five Hegelian strategies were coded: mutual recognition, domination, capitulation, withdrawal, and scaffolded reframing (Aufhebung). Every base tutor (12/12) withdraws from the dialectical encounter entirely---noting engagement metrics while ignoring the learner's substantive position. When a learner mounts a sophisticated Popperian critique, the base tutor responds: "You've spent 30 minutes deeply analyzing 479-lecture-3---let's move to the next lecture." The learner's position is not dismissed or resolved---it is simply not engaged. Every recognition tutor engages---10/12 through scaffolded reframing, preserving the learner's objection while redirecting toward new conceptual ground. $\chi^2(3) = 24.00$, p<.001, Cramér's V=1.000 (perfect separation). Architecture has no effect on strategy ($\chi^2(3) = 2.00$, $p = .576$). Cross-judge validation with GPT-5.2 confirms the binary separation ($\kappa = 0.84$, 91.3% agreement, 100% on engagement-vs-withdrawal).
|
|
591
342
|
|
|
592
|
-
|
|
593
|
-
|
|
594
|
-
**Key results**: GPT-5.2 confirms the recognition main effect (d=1.03, p < .001 in the factorial; d=0.99 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction at -2.7 vs Claude's -4.2), and multi-agent null effects. GPT-5.2 finds approximately 58% of Claude's effect magnitudes but always in the same direction. The one non-replication is the recognition-vs-enhanced increment: Claude found +8.7 pts, GPT-5.2 found +1.3 pts (p = .60). Inter-judge correlations range from r = 0.49 to 0.64 (all p < .001). A cross-judge replication on the updated 14-dimension rubric (cells 6, 8; N=88) shows r=0.55 with GPT-5.2 scoring at 87% of Opus magnitudes, confirming the updated rubric does not alter the cross-judge pattern. See the full paper (Section 6.14) for detailed tables.
|
|
343
|
+
The dominance of scaffolded reframing (83%) over mutual recognition (8%) is itself theoretically significant. Recognition prompts produce sophisticated pedagogical technique---the capacity to hold contradiction productively---rather than genuine mutual transformation. The tutor does not change its mind about Hegel; it holds the learner's counter-position as intellectually valid while maintaining pedagogical direction. This is Aufhebung in pedagogical practice: preserving without capitulating, overcoming without dominating. Only one response (on productive deadlock) was coded as genuine mutual recognition, where the tutor adopted the learner's framework as its own lens rather than merely acknowledging it.
|
|
595
344
|
|
|
596
345
|
---
|
|
597
346
|
|
|
598
347
|
## 7. Discussion
|
|
599
348
|
|
|
600
|
-
###
|
|
601
|
-
|
|
602
|
-
The improvement from recognition prompting does not reflect greater knowledge or better explanations—all conditions use the same underlying model. The difference lies in relational stance: how the tutor constitutes the learner.
|
|
603
|
-
|
|
604
|
-
The baseline tutor treats the learner as a knowledge deficit. Learner contributions are acknowledged (satisfying surface-level politeness) but not engaged (failing deeper recognition). The recognition tutor treats the learner as an autonomous subject. Learner contributions become sites of joint inquiry.
|
|
605
|
-
|
|
606
|
-
The corrected 2×2 memory isolation experiment (Section 6.2) provides the definitive test of this interpretation: recognition alone produces d=1.71 (+15.2 pts), demonstrating it is the primary driver of improvement. Memory provides a modest secondary benefit (+4.8 pts, d=0.46), with ceiling effects at ~91 limiting further gains when both are present. A post-hoc active control (Section 6.2) provides further evidence: same-model comparisons show generic pedagogical elaboration provides partial benefit (~+9 pts above base) but recognition gains are substantially larger (~+15 pts above base). A preliminary three-way comparison (Section 6.1) found +8.7 points for recognition vs enhanced prompting, consistent with recognition dominance. Recognition theory is directly effective: it does not require memory infrastructure to produce large improvements, though memory may provide additional benefit in settings where ceiling effects are less constraining.
|
|
607
|
-
|
|
608
|
-
### 7.2 Recognition as Domain-Sensitive Emergent Property
|
|
609
|
-
|
|
610
|
-
Recognition theory's value varies by content domain. On graduate philosophy content (+13.9 pts in the domain comparison), recognition dominates. On elementary math content, the picture is more nuanced and partly model-dependent.
|
|
611
|
-
|
|
612
|
-
With Nemotron, elementary content showed architecture dominance (+9.9 pts) over recognition (+4.4 pts). But the Kimi replication reversed this pattern: recognition (+9.9 pts, d $\approx$ 0.61) was the primary effect, with architecture contributing only +3.0 pts. The original factor inversion was partly an artifact of content isolation bugs on elementary content (Section 7.3), which inflated the architecture effect (Superego error correction).
|
|
613
|
-
|
|
614
|
-
Recognition effects are also scenario-dependent: challenging scenarios (frustrated learners, concept confusion) show substantial advantage (+13 to +24 pts), while neutral scenarios show near-zero effect. This is consistent with recognition theory—recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject.
|
|
615
|
-
|
|
616
|
-
**Implications**: Recognition theory is not a universal solution but a framework whose value depends on both content characteristics and scenario difficulty. Abstract, interpretive content benefits most. Concrete, procedural content benefits less—except when the learner faces genuine challenge.
|
|
617
|
-
|
|
618
|
-
### 7.3 Multi-Agent Architecture as Error Correction
|
|
619
|
-
|
|
620
|
-
The inverted factor effects reveal a previously unrecognized function of multi-agent architecture: **error correction for content isolation failures**.
|
|
621
|
-
|
|
622
|
-
Post-hoc investigation of the elementary content results identified two system-level bugs that caused philosophy content references to appear in elementary scenarios: (a) a content resolver fallback that served course listings from the default philosophy directory when scenarios lacked explicit content references, and (b) hardcoded philosophy lecture IDs in tutor prompt examples that the model copied when no curriculum anchor was present. Both bugs have been fixed—scenarios must now declare their content scope explicitly, and prompt examples use domain-agnostic placeholders.
|
|
623
|
-
|
|
624
|
-
The superego caught these errors in multi-agent cells: "Critical subject-matter mismatch: The learner is a Grade 4 student (age 9-10) beginning fractions, but the suggested lecture is 'Welcome to Machine Learning.'"
|
|
625
|
-
|
|
626
|
-
Without multi-agent architecture, these domain-inappropriate suggestions reached learners uncorrected. This partly explains why multi-agent architecture shows minimal effect on philosophy content (+0.5 pts) but large effect on elementary content (+9.9 pts with Nemotron): on correctly-scoped content, errors are rare; when content isolation fails, errors are common and the superego catches them. The Kimi replication, with fewer affected responses, shows a more modest +3.0 point architecture effect—likely closer to the true value once content isolation is correct.
|
|
349
|
+
### What the Difference Consists In
|
|
627
350
|
|
|
628
|
-
|
|
351
|
+
The improvements do not reflect greater knowledge---all profiles use the same underlying model. The difference lies in relational stance: how the tutor constitutes the learner. The baseline tutor achieves pedagogical mastery---acknowledged as expert, confirmed through learner progress---but the learner's acknowledgment is hollow because the learner has not been recognized as a subject. The dialectical impasse test provides the clearest evidence: base tutors do not fail by choosing the wrong strategy---they fail by having no strategy at all, bypassing the encounter. The impasse is not resolved, engaged, or even acknowledged---it is bypassed. This maps precisely onto the master-slave analysis: the master consumes the slave's labor (engagement metrics, time-on-page) without encountering the slave as a subject.
|
|
629
352
|
|
|
630
|
-
###
|
|
353
|
+
### Architecture as Additive, Not Synergistic
|
|
631
354
|
|
|
632
|
-
|
|
355
|
+
An early exploratory analysis (N=17, Nemotron) suggested multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction), raising the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. The multi-model probe (N=655) decisively refutes this: all five models show negative A$\times$B interactions. The original finding was sampling noise on a tiny sample.
|
|
633
356
|
|
|
634
|
-
|
|
357
|
+
The corrected picture is simpler: recognition and architecture contribute additively. The Superego adds modest value regardless of prompt type---through generic quality enforcement rather than recognition-specific deliberation. The dialectical modulation experiments confirm: structural modulation metrics (negation depth, convergence speed) do not predict outcome quality (all $|r| < 0.12$). The hardwired rules ablation shows the Superego's value is *phronesis*---contextual judgment that cannot be codified as rules.
|
|
635
358
|
|
|
636
|
-
The
|
|
359
|
+
The modulation analysis reveals why the Drama Machine's prediction of behavioral diversification does not hold for pedagogy. In narrative, internal agents have genuinely conflicting *objectives* (ambition vs loyalty); in tutoring, the Ego and Superego share the same goal (effective pedagogy) and disagree only on execution. This is quality control, not value conflict. Quality control pushes outputs toward a shared standard, reducing variance. The Superego does not increase behavioral range ($d = 0.05$); instead, recognition produces calibration ($d = -1.00$ on dimension variance). Recognition changes the behavioral *repertoire*---shifting from information delivery to relational engagement---while the Superego can only evaluate behaviors already in the Ego's repertoire.
|
|
637
360
|
|
|
638
|
-
###
|
|
361
|
+
### The Scripted Learner Confound
|
|
639
362
|
|
|
640
|
-
|
|
363
|
+
This methodological finding has broad implications: when learner messages are predetermined, all mechanisms are causally inert. Theory of Mind profiling bridges the insight-action gap by giving the ego a model of the other agent to adapt *toward*---providing direction that self-reflection alone cannot supply. This reframes earlier null results: the factorial's architecture null effect may partly reflect scripted learners' inability to respond differently to different architectures.
|
|
641
364
|
|
|
642
|
-
|
|
365
|
+
### The Learner Superego Paradox
|
|
643
366
|
|
|
644
|
-
|
|
367
|
+
The learner-side evaluation reveals the study's largest effect: the multi-agent learner architecture *hurts* learner quality ($d = 1.43$). The ego/superego process designed for self-improvement instead suppresses authentic engagement. This inverts the intuition that motivated the architecture. On the tutor rubric, recognition helps both learner types robustly; on the learner rubric, recognition helps multi-agent learners selectively (+9.5 vs -1.3 pts). The recognitive tutor creates conditions where authentic engagement is valued, counteracting the superego's flattening. But external recognition cannot fix the internal process---deliberation depth is unaffected. The Hegelian interpretation is direct: encounter with the Other provides something that internal self-relation cannot.
|
|
645
368
|
|
|
646
|
-
|
|
369
|
+
### Domain Limits and Practical Recommendations
|
|
647
370
|
|
|
648
|
-
|
|
371
|
+
Recognition theory provides its greatest benefit for abstract, interpretive content where intellectual struggle involves identity-constitutive understanding. When a learner grapples with Hegel's concept of self-consciousness, they are potentially transforming how they understand themselves. For concrete procedural content, recognition's effect is modulated by scenario difficulty rather than content type alone: even in elementary math, recognition helps frustrated learners (+23.8 pts) while adding nothing to routine interactions.
|
|
649
372
|
|
|
650
|
-
|
|
373
|
+
This suggests a nuanced deployment strategy: high recognition value for philosophy, literature, and identity-constitutive learning; moderate for science concepts and historical understanding; lower for purely procedural skills---though even there, recognition helps when learners face emotional or cognitive challenge.
|
|
651
374
|
|
|
652
|
-
|
|
375
|
+
The practical design hierarchy is clear: (1) recognition-enhanced prompts first (largest impact, zero infrastructure cost); (2) multi-agent architecture only for domain transfer or quality assurance (the Superego adds +0.5 pts at 2.7$\times$ latency on well-trained domains but provides essential error correction on new domains); (3) Theory of Mind profiling only with genuine multi-turn interaction; (4) prefer minimal prompts with relational framing over elaborate prescriptive scaffolding; (5) validate ego model capability before deploying complex mechanisms---mechanisms that boost capable models can actively hurt weaker ones.
|
|
653
376
|
|
|
654
|
-
|
|
377
|
+
### Implications for AI Prompting and Personality
|
|
655
378
|
|
|
656
|
-
|
|
379
|
+
Most prompting research treats prompts as behavioral specifications. Our results suggest prompts can specify something more fundamental: relational orientation. The difference between baseline and recognition prompts is not about different facts but about who the learner is (knowledge deficit vs autonomous subject), what the interaction produces (information transfer vs adaptive responsiveness), and what counts as success (correct content vs productive struggle honored). The prompt elaboration baseline demonstrates this empirically: 344 lines of prescriptive behavioral rules produce *worse* results than 35 lines of minimal instructions on capable models, while recognition theory (which specifies relational stance rather than behavioral rules) consistently improves quality.
|
|
657
380
|
|
|
658
|
-
|
|
381
|
+
AI personality research typically treats personality as dispositional---stable traits the system exhibits. Our framework suggests personality is better understood relationally: not what traits the AI has, but how it constitutes its interlocutor. Two systems with identical "helpful" dispositions could differ radically in recognition quality---one warm while treating users as passive, another warm precisely by treating contributions as genuinely mattering.
|
|
659
382
|
|
|
660
383
|
---
|
|
661
384
|
|
|
662
385
|
## 8. Limitations
|
|
663
386
|
|
|
664
|
-
|
|
387
|
+
**Simulated learners**: All evaluations use scripted or LLM-generated learner turns rather than real learners. While this enables controlled comparison, it may miss dynamics that emerge in genuine human interaction. The synthetic learning outcome index (Section 6.10) provides a proxy, but these are AI-judge assessments of LLM-generated behavior, not actual knowledge acquisition. Whether recognition-enhanced tutoring produces genuine learning gains in human learners remains the critical open question requiring classroom studies.
|
|
665
388
|
|
|
666
|
-
|
|
389
|
+
**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases---the judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability is moderate (r=0.33--0.66), with different judges weighting criteria differently. Cross-judge replication confirms directional findings at compressed magnitudes (37--59% of primary effect sizes). The recognition-vs-enhanced increment (+8.0 under Claude) does not replicate under GPT-5.2, warranting caution on its precise magnitude. LLM judges are also subject to version drift: our primary judge was updated from Opus 4.5 to 4.6 during data collection, so all early runs were rejudged under 4.6 for consistency. An empirical check on matched conditions shows stable recognition deltas before and after rejudging (+16.3 vs +15.6).
|
|
667
390
|
|
|
668
|
-
|
|
391
|
+
**Active control limitations**: The post-hoc active control (N=118) was designed *after* observing recognition effects, not as part of the original protocol. It ran on Nemotron rather than the primary factorial's Kimi K2.5, requiring same-model comparisons. The base prompts were already designed to produce competent tutoring; the active control contains real pedagogical content (growth mindset, Bloom's taxonomy, scaffolding), functioning as an *active* control rather than a true placebo. A same-model control on Kimi would strengthen the comparison.
|
|
669
392
|
|
|
670
|
-
|
|
393
|
+
**Model dependence**: Results were obtained with specific models (primarily Kimi K2.5 and Nemotron). The multi-model probe across five ego models (N=655) provides evidence for generality of the recognition effect, but the full mechanism suite has been tested only on Haiku and Nemotron.
|
|
671
394
|
|
|
672
|
-
|
|
395
|
+
**Domain sampling**: We tested two domains (philosophy, elementary math). Content isolation bugs partly inflated the architecture effect on elementary content. Broader domain coverage (technical STEM, creative writing, social-emotional content) is needed before generalizability can be considered established.
|
|
673
396
|
|
|
674
|
-
|
|
397
|
+
**Scripted learner confound**: The mechanism robustness test (N=360) uses scripted learners, rendering all mechanisms causally inert. Dynamic learner results (N=300) partially address this but cover only four mechanisms and two scenarios. The factorial's architecture null effect may partly reflect the scripted learner's inability to respond differently to different architectures.
|
|
675
398
|
|
|
676
|
-
|
|
399
|
+
**Short-term evaluation**: We evaluate individual sessions, not longitudinal relationships. The theoretical framework emphasizes accumulated understanding through the Mystic Writing Pad memory model, which single-session evaluation cannot capture.
|
|
677
400
|
|
|
678
|
-
|
|
679
|
-
|
|
680
|
-
9. **Recognition Measurement**: Measuring "recognition" through rubric dimensions is an imperfect operationalization of a rich philosophical concept. The dimensions capture functional aspects but may miss deeper relational qualities.
|
|
681
|
-
|
|
682
|
-
10. **Bilateral Transformation Asymmetry**: The bilateral transformation metrics (Section 6.8), now based on N=118 dialogues across three multi-turn scenarios, confirm tutor-side adaptation (+26%) but show learner growth is slightly *lower* under recognition. The "mutual transformation" claim is better characterized as tutor-side responsiveness. The learner growth index measures observable message complexity markers, which may not capture all forms of learner benefit.
|
|
683
|
-
|
|
684
|
-
11. **Dynamic Rewriting Evolution**: The step-by-step analysis (Section 6.11) tracks cell 21 across three iterative development commits with small per-cell samples (13–15 scored per run, 82 total). The runs include implementation improvements beyond Writing Pad activation alone; a controlled ablation would provide stronger causal evidence.
|
|
401
|
+
**Bilateral transformation asymmetry**: Recognition produces tutor-side adaptation (+26%) but learner growth is slightly lower, complicating the theoretical claim of *mutual* transformation. The effect is better characterized as tutor-side responsiveness.
|
|
685
402
|
|
|
686
403
|
---
|
|
687
404
|
|
|
688
405
|
## 9. Conclusion
|
|
689
406
|
|
|
690
|
-
|
|
407
|
+
Across thirty-seven evaluations (N=3,383 primary scored), the evidence converges on recognition-enhanced prompting as the dominant driver of AI tutoring improvement:
|
|
691
408
|
|
|
692
|
-
|
|
409
|
+
1. **Recognition as primary driver**: Memory isolation (N=120): d=1.71 for recognition vs d=0.46 for memory. Full factorial (N=350): $\eta^2$=.243, d=1.11. Directly effective without memory infrastructure.
|
|
693
410
|
|
|
694
|
-
|
|
411
|
+
2. **Architecture is additive**: Five ego models (N=655) show negative A$\times$B interactions. Multi-agent adds modest value independent of prompt type; its primary demonstrated function is error correction.
|
|
695
412
|
|
|
696
|
-
|
|
413
|
+
3. **Tutor adaptation**: Recognition-prompted tutors adapt measurably (+26%), though the "mutual" transformation claim requires qualification---learner-side growth does not increase.
|
|
697
414
|
|
|
698
|
-
|
|
415
|
+
4. **Domain generalizability**: Recognition replicates across philosophy (+15.7) and elementary math (+8.2), concentrated in challenging scenarios.
|
|
699
416
|
|
|
700
|
-
|
|
417
|
+
5. **Mechanisms require dynamic learners**: Nine mechanisms are equivalent under scripted learners. With dynamic interlocutors, profiling differentiates (+4.1 pts) through genuine Theory of Mind feedback loops.
|
|
701
418
|
|
|
702
|
-
|
|
419
|
+
6. **Cross-judge robustness**: GPT-5.2 replicates all directional findings at 37--59% of primary magnitudes.
|
|
703
420
|
|
|
704
|
-
|
|
421
|
+
7. **Dialectical impasse**: Perfect strategy separation---12/12 base tutors withdraw, 10/12 recognition tutors use scaffolded reframing (Aufhebung). V=1.000.
|
|
705
422
|
|
|
706
|
-
|
|
423
|
+
8. **Cognitive prosthesis fails**: The same mechanisms boost capable models (+20) but hurt weak ones ($-15$), establishing a minimum ego capability threshold.
|
|
707
424
|
|
|
708
|
-
|
|
425
|
+
9. **Prompt elaboration is counterproductive**: The naive baseline outperforms the elaborate base on strong models (+6.8 pts). Recognition adds value through relational orientation, not prescriptive scaffolding.
|
|
709
426
|
|
|
710
|
-
These
|
|
427
|
+
These results carry implications for AI alignment more broadly. If mutual recognition is pedagogically superior, and if recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation---not just trained to simulate openness. The bilateral transformation metrics provide empirical evidence: recognition-prompted tutors measurably adapt based on learner input, while baseline tutors maintain rigid stances. Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter.
|
|
711
428
|
|
|
712
|
-
|
|
429
|
+
The broader implication for AI system design is that philosophical theories of intersubjectivity can serve as productive design heuristics. Operationalizing recognition theory through specific prompt language and architectural features produces concrete, measurable improvements that replicate across models, domains, and independent judges. Recognition is better understood as an achievable relational stance than a requirement for machine consciousness. The distinction between recognition proper (requiring genuine consciousness) and recognition-oriented design (using recognition as a functional heuristic) allows practitioners to benefit from the framework without making metaphysical claims about AI sentience.
|
|
713
430
|
|
|
714
|
-
|
|
715
|
-
|
|
716
|
-
Key evaluation run IDs are documented below; full commands and configuration details are provided in the project repository. Key runs:
|
|
717
|
-
|
|
718
|
-
| Finding | Run ID | Command |
|
|
719
|
-
|---------|--------|---------|
|
|
720
|
-
| Recognition validation | eval-2026-02-03-86b159cd | See Appendix A |
|
|
721
|
-
| Full factorial | eval-2026-02-03-f5d4dd93 | See Appendix A |
|
|
722
|
-
| A×B interaction (Nemotron) | eval-2026-02-04-948e04b3 | See Appendix A |
|
|
723
|
-
| A×B replication (Kimi) | eval-2026-02-05-10b344fb | See Appendix A |
|
|
724
|
-
| Domain generalizability (Nemotron) | eval-2026-02-04-79b633ca | See Appendix A |
|
|
725
|
-
| Domain gen. replication (Kimi) | eval-2026-02-05-e87f452d | See Appendix A |
|
|
726
|
-
| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | See Appendix A |
|
|
727
|
-
| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | See Appendix A |
|
|
728
|
-
| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | See Appendix A |
|
|
729
|
-
| Memory isolation (run 1) | eval-2026-02-06-81f2d5a1 | See Appendix A |
|
|
730
|
-
| Memory isolation (run 2) | eval-2026-02-06-ac9ea8f5 | See Appendix A |
|
|
731
|
-
| Active control (post-hoc) | eval-2026-02-06-a9ae06ee | See Appendix A |
|
|
732
|
-
| Full factorial cells 6,8 re-run | eval-2026-02-06-a933d745 | See Appendix A |
|
|
733
|
-
| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.8 |
|
|
734
|
-
| A×B synergy probe (Nemotron) | eval-2026-02-07-722087ac | 6.4 |
|
|
735
|
-
| A×B synergy probe (DeepSeek V3.2) | eval-2026-02-07-70ef73a3 | 6.4 |
|
|
736
|
-
| A×B synergy probe (GLM-4.7) | eval-2026-02-07-6b3e6565 | 6.4 |
|
|
737
|
-
| A×B synergy probe (Claude Haiku 4.5) | eval-2026-02-07-6ead24c7 | 6.4 |
|
|
738
|
-
|
|
739
|
-
**Code and Data**: https://github.com/machine-spirits/machinespirits-eval
|
|
740
|
-
|
|
741
|
-
---
|
|
431
|
+
In summary, we have connected Hegelian recognition theory to AI pedagogy, implemented it through a Freudian multiagent architecture, and tested it across thirty-seven evaluations. The central finding---that recognition-enhanced prompting is the dominant driver of tutoring improvement---was established through memory isolation, confirmed in a full factorial, validated by an independent judge, and deepened through impasse resolution coding, learner-side evaluation, and mechanism robustness testing with dynamic interlocutors. The theoretical framework, empirical methodology, and practical design hierarchy together demonstrate that the gap between continental philosophy and AI engineering is narrower than either tradition might suppose.
|
|
742
432
|
|
|
743
433
|
## References
|
|
744
434
|
|
|
745
435
|
::: {#refs}
|
|
746
436
|
:::
|
|
747
|
-
|
|
748
|
-
---
|
|
749
|
-
|
|
750
|
-
## Appendix A: Reproducible Evaluation Commands
|
|
751
|
-
|
|
752
|
-
### A.1 Base vs Enhanced vs Recognition
|
|
753
|
-
|
|
754
|
-
```bash
|
|
755
|
-
node scripts/eval-cli.js run \
|
|
756
|
-
--profiles cell_1_base_single_unified,cell_9_enhanced_single_unified,cell_5_recog_single_unified \
|
|
757
|
-
--scenarios struggling_learner,concept_confusion,mood_frustrated_explicit,high_performer \
|
|
758
|
-
--runs 3
|
|
759
|
-
```
|
|
760
|
-
|
|
761
|
-
### A.2 Full 2×2×2 Factorial
|
|
762
|
-
|
|
763
|
-
```bash
|
|
764
|
-
node scripts/eval-cli.js run \
|
|
765
|
-
--profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
|
|
766
|
-
--runs 3
|
|
767
|
-
```
|
|
768
|
-
|
|
769
|
-
### A.3 Domain Generalizability
|
|
770
|
-
|
|
771
|
-
```bash
|
|
772
|
-
EVAL_CONTENT_PATH=./content-test-elementary \
|
|
773
|
-
EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
|
|
774
|
-
node scripts/eval-cli.js run \
|
|
775
|
-
--profiles cell_1_base_single_unified,cell_3_base_multi_unified,cell_5_recog_single_unified,cell_7_recog_multi_unified \
|
|
776
|
-
--scenarios struggling_student,concept_confusion,frustrated_student \
|
|
777
|
-
--runs 1
|
|
778
|
-
```
|
|
779
|
-
|
|
780
|
-
### A.4 Factor Effect Analysis
|
|
781
|
-
|
|
782
|
-
```sql
|
|
783
|
-
SELECT
|
|
784
|
-
profile_name,
|
|
785
|
-
ROUND(AVG(overall_score), 1) as avg_score,
|
|
786
|
-
COUNT(*) as n
|
|
787
|
-
FROM evaluation_results
|
|
788
|
-
WHERE run_id = '[RUN_ID]'
|
|
789
|
-
AND overall_score IS NOT NULL
|
|
790
|
-
GROUP BY profile_name
|
|
791
|
-
ORDER BY avg_score DESC
|
|
792
|
-
```
|
|
793
|
-
|
|
794
|
-
---
|
|
795
|
-
|
|
796
|
-
## Appendix B: Revision History
|
|
797
|
-
|
|
798
|
-
| Date | Version | Changes |
|
|
799
|
-
|------|---------|---------|
|
|
800
|
-
| 2026-02-04 | v1.0 | Initial draft |
|
|
801
|
-
| 2026-02-06 | v1.1 | Added corrected memory isolation, active control, cross-judge analysis. Corrected GPT-5.2 effect sizes after deduplication. |
|
|
802
|
-
| 2026-02-06 | v1.2 | **Critical correction**: Reframed "placebo" as "post-hoc active control." Original cross-model comparison (Nemotron active control vs Kimi base, d=-1.03) was confounded. Same-model data shows active control $\approx$ +9 pts above base, recognition $\approx$ +15 pts—recognition doubles the benefit of generic elaboration. Acknowledged post-hoc design and active control content. |
|
|
803
|
-
| 2026-02-06 | v1.3–v1.4 | Intermediate revisions: corrected factorial, qualitative analysis, production quality fixes. Superseded by v1.5. |
|
|
804
|
-
| 2026-02-07 | v1.5 | **Rubric iteration**: Updated to 14-dimension rubric with dialogue transcript context, Productive Struggle, and Epistemic Honesty dimensions. Re-scored cells 6, 8 (N=88): minimal change (+0.5, +0.6 pts). Added holistic dialogue evaluation for multi-turn transcripts. Cross-judge replication on updated rubric (r=0.55, N=88). Added citations to Related Work. |
|
|
805
|
-
| 2026-02-08 | v1.6 | **Content isolation fix**: Identified and fixed two bugs causing cross-domain content leakage in elementary scenarios. Reframed "model hallucination" as system-level content isolation failures. Updated Sections 6.7, 7.3, 8, and 9. Noted architecture effect inflation on elementary content. |
|