@machinespirits/eval 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +91 -9
- package/config/eval-settings.yaml +3 -3
- package/config/paper-manifest.json +486 -0
- package/config/providers.yaml +9 -6
- package/config/tutor-agents.yaml +2261 -0
- package/content/README.md +23 -0
- package/content/courses/479/course.md +53 -0
- package/content/courses/479/lecture-1.md +361 -0
- package/content/courses/479/lecture-2.md +360 -0
- package/content/courses/479/lecture-3.md +655 -0
- package/content/courses/479/lecture-4.md +530 -0
- package/content/courses/479/lecture-5.md +326 -0
- package/content/courses/479/lecture-6.md +346 -0
- package/content/courses/479/lecture-7.md +326 -0
- package/content/courses/479/lecture-8.md +273 -0
- package/content/courses/479/roadmap-slides.md +656 -0
- package/content/manifest.yaml +8 -0
- package/docs/research/build.sh +44 -20
- package/docs/research/figures/figure10.png +0 -0
- package/docs/research/figures/figure11.png +0 -0
- package/docs/research/figures/figure3.png +0 -0
- package/docs/research/figures/figure4.png +0 -0
- package/docs/research/figures/figure5.png +0 -0
- package/docs/research/figures/figure6.png +0 -0
- package/docs/research/figures/figure7.png +0 -0
- package/docs/research/figures/figure8.png +0 -0
- package/docs/research/figures/figure9.png +0 -0
- package/docs/research/header.tex +23 -2
- package/docs/research/paper-full.md +941 -285
- package/docs/research/paper-short.md +216 -585
- package/docs/research/references.bib +132 -0
- package/docs/research/slides-header.tex +188 -0
- package/docs/research/slides-pptx.md +363 -0
- package/docs/research/slides.md +531 -0
- package/docs/research/style-reference-pptx.py +199 -0
- package/package.json +6 -5
- package/scripts/analyze-eval-results.js +69 -17
- package/scripts/analyze-mechanism-traces.js +763 -0
- package/scripts/analyze-modulation-learning.js +498 -0
- package/scripts/analyze-prosthesis.js +144 -0
- package/scripts/analyze-run.js +264 -79
- package/scripts/assess-transcripts.js +853 -0
- package/scripts/browse-transcripts.js +854 -0
- package/scripts/check-parse-failures.js +73 -0
- package/scripts/code-dialectical-modulation.js +1320 -0
- package/scripts/download-data.sh +55 -0
- package/scripts/eval-cli.js +106 -18
- package/scripts/generate-paper-figures.js +663 -0
- package/scripts/generate-paper-figures.py +577 -76
- package/scripts/generate-paper-tables.js +299 -0
- package/scripts/qualitative-analysis-ai.js +3 -3
- package/scripts/render-sequence-diagram.js +694 -0
- package/scripts/test-latency.js +210 -0
- package/scripts/test-rate-limit.js +95 -0
- package/scripts/test-token-budget.js +332 -0
- package/scripts/validate-paper-manifest.js +670 -0
- package/services/__tests__/evalConfigLoader.test.js +2 -2
- package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
- package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
- package/services/evaluationRunner.js +975 -98
- package/services/evaluationStore.js +12 -4
- package/services/learnerTutorInteractionEngine.js +27 -2
- package/services/mockProvider.js +133 -0
- package/services/promptRewriter.js +1471 -5
- package/services/rubricEvaluator.js +55 -2
- package/services/transcriptFormatter.js +675 -0
- package/docs/EVALUATION-VARIABLES.md +0 -589
- package/docs/REPLICATION-PLAN.md +0 -577
- package/scripts/analyze-run.mjs +0 -282
- package/scripts/compare-runs.js +0 -44
- package/scripts/compare-suggestions.js +0 -80
- package/scripts/dig-into-run.js +0 -158
- package/scripts/show-failed-suggestions.js +0 -64
- /package/scripts/{check-run.mjs → check-run.js} +0 -0
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
---
|
|
2
|
-
title: "
|
|
3
|
-
author: "Liam Magee
|
|
2
|
+
title: "*Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
|
|
3
|
+
author: "Liam Magee"
|
|
4
4
|
date: "February 2026"
|
|
5
|
-
version: "2.
|
|
5
|
+
version: "2.3.14"
|
|
6
6
|
bibliography: references.bib
|
|
7
7
|
csl: apa.csl
|
|
8
8
|
link-citations: true
|
|
9
9
|
abstract: |
|
|
10
|
-
Current
|
|
10
|
+
Current AI tutoring treats learners as knowledge deficits to be filled. We propose an alternative grounded in Hegel's theory of mutual recognition, where effective pedagogy requires acknowledging learners as autonomous subjects whose understanding has intrinsic validity. We implement this through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent evaluates them before delivery. Across thirty-seven evaluations (N=3,383 primary scored; N=7,000+ development database), recognition theory emerges as the primary driver of improvement: a 2$\times$2 memory isolation experiment (N=120) shows recognition produces d=1.71 (Claude Opus judge) with or without memory, while memory alone provides only d=0.46 (n.s.). A multi-model probe across five ego models (N=655) confirms architecture and recognition contribute additively, not synergistically. Cross-judge replication with GPT-5.2 validates the main findings at compressed magnitudes (37–59% of primary effect sizes depending on experiment, inter-judge r=0.44–0.64). Phase 2 experiments reveal that the Superego functions as a quality filter rather than an active improver—structural modulation metrics do not predict outcome quality. Nine architectural mechanisms cluster within 2.4 points under scripted learners, but differentiate when tested with dynamic interlocutors capable of genuine feedback loops: Theory of Mind profiling adds 4.1 points, and recognition's effect doubles. Qualitative transcript assessment identifies three specific changes recognition produces: the ego listens to its internal critic, the tutor builds on learner contributions rather than redirecting, and mid-conversation strategy shifts occur. These results suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems, and that recognition is better understood as an achievable relational stance than a requirement for machine consciousness.
|
|
11
11
|
keywords: [AI tutoring, mutual recognition, Hegel, Freud, multiagent systems, educational technology, productive struggle, Drama Machine, domain generalizability]
|
|
12
12
|
fontsize: 12pt
|
|
13
13
|
geometry: margin=1in
|
|
@@ -16,13 +16,13 @@ header-includes: |
|
|
|
16
16
|
\floatplacement{figure}{H}
|
|
17
17
|
---
|
|
18
18
|
|
|
19
|
-
#
|
|
19
|
+
# *Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring
|
|
20
20
|
|
|
21
21
|
## 1. Introduction
|
|
22
22
|
|
|
23
23
|
The dominant paradigm in AI-assisted education treats learning as information transfer. The learner lacks knowledge; the tutor possesses it; the interaction succeeds when knowledge flows from tutor to learner. This paradigm—implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots—treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed, an error to be corrected.
|
|
24
24
|
|
|
25
|
-
This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one
|
|
25
|
+
This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one in turn recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition—where each party acknowledges the other as an autonomous subject—produces genuine selfhood.
|
|
26
26
|
|
|
27
27
|
The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship between teacher and learner that echoes Hegel's mutual constitution of self-consciousness. The German *Bildung* tradition explicitly frames education as a process of self-formation through encounter with otherness [@stojanov2018], and contemporary recognition theory [@honneth1995] has been applied to educational contexts where the struggle for recognition shapes learning outcomes [@huttunen2007]. Our contribution is to operationalize these philosophical commitments as concrete design heuristics for AI tutoring systems and to measure their effects empirically.
|
|
28
28
|
|
|
@@ -41,11 +41,11 @@ We operationalize this framework through:
|
|
|
41
41
|
3. **New evaluation dimensions** that measure recognition quality alongside traditional pedagogical metrics
|
|
42
42
|
4. **Test scenarios** specifically designed to probe recognition behaviors
|
|
43
43
|
|
|
44
|
-
In controlled evaluations across
|
|
44
|
+
In controlled evaluations across thirty-seven key evaluations (N=3,383 primary scored responses; N=7,000+ across all development runs), we isolate the contribution of recognition theory from prompt engineering effects and memory integration. The definitive test is a corrected 2×2 memory isolation experiment (N=120 across two independent runs): recognition theory is the primary driver, producing +15.2 points (d=1.71) even without memory, while memory alone provides only a modest benefit (+4.8 pts, d=0.46, $p \approx .08$). The combined condition reaches 91.2 points (d=1.81 vs base), with ceiling effects limiting observable synergy. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content but no recognition theory scores approximately 9 points above same-model base but well below recognition levels, with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts). A three-way comparison (N=36) found recognition adds +8.0 points beyond enhanced prompting, consistent with recognition dominance.
|
|
45
45
|
|
|
46
|
-
A full 2×2×2 factorial (N=350)
|
|
46
|
+
A full 2×2×2 factorial (N=350) confirms recognition as the dominant factor (F=110.04, p<.001, $\eta^2$=.243, $d=1.11$), accounting for 24.3% of variance. Crucially, recognition's benefit is consistent across learner types: +15.7 pts for single-agent learners (d=1.73) and +13.0 pts for multi-agent learners (d=0.82), with a non-significant A×C interaction (F=0.97, p=.325). A multi-model probe across five ego models (N=655) confirms that architecture and recognition contribute additively, not synergistically—all five models show negative A×B interactions, consistent with ceiling effects on already-high recognition scores. For systems using only improved instructions, multi-agent architecture appears unnecessary; the architecture's primary value lies in error correction when content isolation failures introduce wrong-domain references.
|
|
47
47
|
|
|
48
|
-
Domain generalizability testing reveals that recognition advantage replicates across both models and content domains, but with important nuances. Philosophy content shows strong recognition dominance (+15.
|
|
48
|
+
Domain generalizability testing reveals that recognition advantage replicates across both models and content domains, but with important nuances. Philosophy content shows strong recognition dominance (+15.7 pts for single-agent-learner cells). Elementary math shows a smaller but still substantial recognition effect (+8.2 pts, N=60). Recognition effects are concentrated in challenging scenarios (frustrated learners, concept confusion) rather than routine interactions.
|
|
49
49
|
|
|
50
50
|
The contributions of this paper are:
|
|
51
51
|
|
|
@@ -54,13 +54,21 @@ The contributions of this paper are:
|
|
|
54
54
|
- Empirical evidence that recognition-oriented design improves tutoring outcomes
|
|
55
55
|
- A corrected 2×2 memory isolation experiment (N=120) demonstrating recognition as the primary driver of improvement (d=1.71), with memory providing a modest secondary benefit (d=0.46) and ceiling effects at ~91 points limiting observable synergy
|
|
56
56
|
- A post-hoc active control (N=118) showing that generic pedagogical elaboration provides partial benefit (~+9 pts above same-model base) but recognition gains are substantially larger (~+15 pts), supporting recognition theory's specific contribution beyond prompt length
|
|
57
|
-
- Evidence from a three-way comparison (N=36) consistent with recognition dominance, showing recognition outperforms enhanced prompting by +8.
|
|
57
|
+
- Evidence from a three-way comparison (N=36) consistent with recognition dominance, showing recognition outperforms enhanced prompting by +8.0 points
|
|
58
58
|
- Bilateral transformation metrics (N=118, three multi-turn scenarios) demonstrating that recognition produces measurable tutor-side adaptation (+26%), though learner-side growth does not increase, qualifying the "mutual" transformation claim
|
|
59
|
+
- Post-hoc modulation analysis (N=350) showing that multi-agent architecture does not increase behavioral range ($d = 0.05$), while recognition produces calibration—uniformly high performance across all dimensions (dimension variance $d = -1.00$)—reframing the Drama Machine's contribution from productive irresolution to *phronesis*
|
|
60
|
+
- A synthetic learning outcome index (N=118) confirming that recognition-enhanced tutoring produces modest gains in simulated conceptual growth (+3.8 pts, d=0.32), with all conditions showing substantial learning arcs (15–21 pts first-to-final turn), though these remain proxies for actual learning pending human studies
|
|
59
61
|
- Analysis of how recognition effects vary across content domains and scenario difficulty
|
|
60
62
|
- Evidence that multi-agent architecture serves as critical error correction for domain transfer, with its synergy with recognition prompts remaining model-dependent
|
|
61
|
-
- A hardwired rules ablation (N=72) demonstrating that encoding the Superego's most common critique patterns as static rules
|
|
63
|
+
- A hardwired rules ablation (N=72) demonstrating that encoding the Superego's most common critique patterns as static rules produces performance indistinguishable from base conditions, supporting a *phronesis* interpretation where the Superego's value lies in contextual judgment rather than rule enforcement
|
|
64
|
+
- Dialectical superego modulation testing (N=174) showing the superego functions as a quality filter—preventing poor responses—rather than an active improver, with structural modulation metrics not predicting outcome quality
|
|
65
|
+
- Self-reflective evolution (N=90) amplifying recognition's effect to d=0.91 through between-turn ego and superego reflections, with a striking disposition gradient (suspicious +19.0, adversary +10.9, advocate +2.6) revealing that hostile superego dispositions benefit most from recognition, and an insight-action gap where awareness of the need for change does not produce fundamentally different behavior
|
|
66
|
+
- Mechanism robustness testing (N=360 scripted, N=300 dynamic) demonstrating that all mechanisms are equivalent under scripted learners but that other-ego profiling differentiates with dynamic interlocutors, establishing that genuine feedback loops are necessary for mechanism effects
|
|
67
|
+
- Qualitative transcript assessment providing narrative evidence for three specific changes recognition produces: the ego listens to the superego, the tutor builds on learner contributions, and strategy shifts occur mid-conversation
|
|
68
|
+
- A cognitive prosthesis test (N=90) demonstrating a minimum ego capability threshold: the full mechanism stack that boosts Haiku by +20 points hurts Nemotron by $-15$ points, with dimension analysis revealing a two-tier static/dynamic capability structure and superego parse failures silently disabling quality control on 16–45% of turns
|
|
69
|
+
- Practical design recommendations for AI tutor development distilled from the full experimental programme
|
|
62
70
|
|
|
63
|
-
The paper is organized as follows. Section 2 reviews related work in AI tutoring, multiagent systems, prompt engineering, and sycophancy. Section 3 develops the theoretical framework connecting Hegelian recognition and Freudian structural theory to pedagogy. Section 4 presents the multiagent architecture (Ego, Superego, and learner agents). Section 5 describes the experimental methodology, including test scenarios, agent profiles, model configuration, and the evaluation rubric. Section 6 reports results across
|
|
71
|
+
The paper is organized as follows. Section 2 reviews related work in AI tutoring, multiagent systems, prompt engineering, and sycophancy. Section 3 develops the theoretical framework connecting Hegelian recognition and Freudian structural theory to pedagogy. Section 4 presents the multiagent architecture (Ego, Superego, and learner agents). Section 5 describes the experimental methodology, including test scenarios, agent profiles, model configuration, and the evaluation rubric. Section 6 reports results across thirty-seven key evaluations, covering recognition validation, memory isolation, factorial analysis, domain generalizability, dialectical superego modulation, self-reflective evolution, mechanism robustness, qualitative transcript assessment, bilateral transformation, learner-side evaluation, cross-judge replication, dialectical impasse testing, and a hardwired rules ablation. Section 7 discusses theoretical and practical implications, including practical design recommendations. Section 8 addresses limitations, and Section 9 concludes.
|
|
64
72
|
|
|
65
73
|
---
|
|
66
74
|
|
|
@@ -70,6 +78,8 @@ The paper is organized as follows. Section 2 reviews related work in AI tutoring
|
|
|
70
78
|
|
|
71
79
|
Intelligent Tutoring Systems (ITS) have a long history, from early systems like SCHOLAR [@carbonell1970] and SOPHIE [@brown1975] through modern implementations using large language models. The field has progressed through several paradigms: rule-based expert systems, Bayesian knowledge tracing [@corbett1995], and more recently, neural approaches leveraging pretrained language models [@kasneci2023]. The rapid adoption of LLM-based tutoring has been accompanied by emerging work on integrating generative AI into learning management systems [@ZhuMageeMischler2025IntegratingGenAIIntoLMS], multi-agent frameworks for educational task decomposition [@wu2023], and self-refining instructional agents [@madaan2023]. A comprehensive survey of LLM agents in education [@chu2025llmagents] maps the growing landscape, covering pedagogical agents, feedback generation, and curriculum design. Specific architectures include GenMentor [@wang2025genmentor], which decomposes tutoring into five specialized agents (gap identification, learner profiling, etc.), and Ruffle&Riley [@schmucker2024ruffle], which orchestrates two LLM agents in a learning-by-teaching format. These systems have demonstrated strong performance on content delivery but have given less attention to the relational dynamics between tutor and learner.
|
|
72
80
|
|
|
81
|
+
Empirical evidence on LLM tutoring effectiveness is emerging rapidly. A systematic review of 88 empirical studies [@shi2025llmeducation] maps applications across writing support, language learning, programming tutoring, and content explanation—finding consistent engagement benefits but limited evidence on deep conceptual learning. In the largest randomized controlled trial to date, Vanzo et al. [-@vanzo2025gpt4homework] deployed GPT-4 as a homework tutor across multiple classrooms, demonstrating improved grammar accuracy and sustained engagement relative to controls. Scarlatos et al. [-@scarlatos2025training] take a complementary approach, using dialogue preference optimization (DPO) to train LLM tutors specifically for productive dialogue—the trained tutors produce measurably better learning outcomes than prompted-only baselines. These studies confirm that LLMs can tutor effectively, but they evaluate primarily *content delivery* and *engagement*—not the relational quality of the tutor-learner interaction, which is our focus.
|
|
82
|
+
|
|
73
83
|
Most ITS research focuses on *what* to teach (content sequencing, knowledge components) and *when* to intervene (mastery thresholds, hint timing). Our work addresses a different question: *how* to relate to the learner as a subject. This relational dimension connects to work on rapport [@zhao2014], social presence [@biocca2003], and affective tutoring [@dmello2012], but has received less systematic attention—and almost none in the context of LLM-based tutoring. The distinction matters architecturally: where GenMentor and similar systems decompose the tutoring *task* into sub-tasks handled by different agents, our architecture implements *internal dialogue*—the Superego evaluates the Ego's relational quality before any response reaches the learner. This is a critique loop for recognition quality, not a task pipeline.
|
|
74
84
|
|
|
75
85
|
### 2.2 Prompt Engineering and Agent Design
|
|
@@ -78,9 +88,19 @@ The emergence of large language models has spawned extensive research on prompt
|
|
|
78
88
|
|
|
79
89
|
Our work extends this paradigm by introducing *intersubjective prompts*—prompts that specify not just agent behavior but agent-other relations. The recognition prompts do not primarily describe what the tutor should do; they describe who the learner is (an autonomous subject) and what the interaction produces (mutual transformation). The closest precedent is Constitutional AI [@bai2022constitutional], where models critique their own outputs according to constitutional principles and self-improve. Constitutional prompts are self-referential constraints on behavior; our intersubjective prompts specify the *relational field* between agents rather than constraints on a single agent.
|
|
80
90
|
|
|
81
|
-
Multi-agent architectures have been explored for task decomposition [@wu2023], debate [@irving2018], and self-critique [@madaan2023]. A broader survey of psychological theories incorporated into LLM design [@mind_in_machine2025] reviews 175 papers spanning cognitive, developmental, and social psychology as applied to agent architectures—confirming the growing interest in psychologically-informed AI design while highlighting the rarity of empirically-validated implementations.
|
|
91
|
+
Multi-agent architectures have been explored for task decomposition [@wu2023], debate [@irving2018], and self-critique [@madaan2023]. The CAMEL framework [@li2023camel] demonstrated that role-playing communicative agents can autonomously cooperate on complex tasks through structured dialogue, establishing a paradigm for multi-agent collaboration that has proliferated rapidly. A comprehensive survey [@guo2024multiagents] maps this expanding landscape across profile construction, communication protocols, and capability acquisition—identifying pedagogical applications as an underexplored frontier. A broader survey of psychological theories incorporated into LLM design [@mind_in_machine2025] reviews 175 papers spanning cognitive, developmental, and social psychology as applied to agent architectures—confirming the growing interest in psychologically-informed AI design while highlighting the rarity of empirically-validated implementations.
|
|
92
|
+
|
|
93
|
+
A critical literature on self-correction qualifies the optimism around reflexive architectures. Kamoi et al. [-@kamoi2024selfcorrection] provide a comprehensive survey showing that LLMs largely *cannot* correct their own mistakes without external feedback—intrinsic self-correction (without oracle signals) frequently degrades performance rather than improving it. Shinn et al. [-@shinn2023reflexion] demonstrated the promise of Reflexion (verbal reinforcement learning through self-reflection), but noted a "degeneration-of-thought" problem where repeated self-reflection without new information converges on worse outputs. These findings are directly relevant to our architecture: the Superego provides the *structural external feedback* that the self-correction literature shows is necessary. Unlike intrinsic self-correction, where the same model reviews its own output, our Superego applies different evaluation criteria (pedagogical recognition standards) through a separate prompt context—functioning as a genuine external critic rather than a self-review loop.
|
|
94
|
+
|
|
95
|
+
Our Ego/Superego architecture contributes a specific use case within this landscape: internal evaluation of relational quality before external response.
|
|
96
|
+
|
|
97
|
+
### 2.3 LLM-as-Judge Evaluation Methodology
|
|
98
|
+
|
|
99
|
+
The use of LLMs as evaluation judges has become a major methodological paradigm. Zheng et al. [-@zheng2023judging] established the foundation with MT-Bench and the Chatbot Arena, demonstrating that GPT-4 achieves over 80% agreement with human expert judgments—comparable to inter-annotator agreement rates—while identifying systematic biases including position bias, verbosity bias, and self-enhancement bias (models rate their own outputs more favorably). Subsequent work has expanded understanding of both capabilities and limitations: Gu et al. [-@gu2025surveyjudge] provide a comprehensive survey covering reliability concerns, bias mitigation strategies, and the conditions under which LLM judges can substitute for human evaluation. Li et al. [-@li2024llmsjudges] organize the literature across five perspectives—functionality, methodology, applications, meta-evaluation, and limitations—highlighting that while LLM judges excel at relative ranking, their absolute calibration varies substantially across models and domains.
|
|
82
100
|
|
|
83
|
-
|
|
101
|
+
Our evaluation methodology engages directly with this literature. We use three independent LLM judges (Claude Opus, GPT-5.2, and Kimi K2.5) with systematic inter-judge reliability analysis (Section 5.8), finding Pearson correlations of r=0.33–0.66 across judge pairs. Rather than treating any single judge as ground truth, we report within-judge comparisons for factor analysis and use cross-judge replication to validate effect directions. The known biases in LLM-as-Judge evaluation—particularly verbosity bias—are relevant to our findings: recognition-enhanced responses tend to be longer, raising the question of whether judges reward length rather than quality. We address this through the active control design (Section 5.3), which matches prompt length without recognition theory content, and through cross-judge validation showing that effect directions replicate even when absolute magnitudes differ (GPT-5.2 finds 37–59% of Claude's effect sizes depending on experiment, always in the same direction).
|
|
102
|
+
|
|
103
|
+
### 2.4 The Drama Machine Framework
|
|
84
104
|
|
|
85
105
|
Most relevant to our work is the "Drama Machine" framework for simulating character development in narrative AI systems [@magee2024drama]. The core observation is that realistic characters exhibit *internal conflict*—competing motivations, self-doubt, and moral tension—that produces dynamic behavior rather than flat consistency. A character who simply enacts their goals feels artificial; one torn between impulses feels alive.
|
|
86
106
|
|
|
@@ -93,7 +113,7 @@ The Drama Machine achieves this through several mechanisms:
|
|
|
93
113
|
|
|
94
114
|
We adapt these insights to pedagogy. Where drama seeks tension for narrative effect, we seek pedagogical tension that produces genuinely helpful guidance. The tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict that improves output quality.
|
|
95
115
|
|
|
96
|
-
### 2.
|
|
116
|
+
### 2.5 Sycophancy in Language Models
|
|
97
117
|
|
|
98
118
|
The sycophancy problem has received increasing attention in AI safety research [@perez2022; @sharma2023]. LLMs shift their stated opinions to match user preferences, even when this requires contradicting factual knowledge. Recent work has clarified the mechanisms: Shapira et al. [-@shapira2026rlhf] provide formal analysis showing that preference-based post-training (RLHF) causally amplifies sycophancy, while Vennemeyer et al. [-@vennemeyer2025sycophancy] decompose sycophancy into distinct behaviors (sycophantic agreement vs. sycophantic praise) encoded along separable directions in latent space. The phenomenon sits on a spectrum that can escalate from surface agreeableness to active subterfuge, including reward tampering [@denison2024_reward_tampering] and alignment faking [@greenblatt2024_alignment_faking]—making structural countermeasures particularly important.
|
|
99
119
|
|
|
@@ -101,7 +121,7 @@ In educational contexts, sycophancy has been specifically identified as a pedago
|
|
|
101
121
|
|
|
102
122
|
Our multiagent approach addresses this by creating structural incentives for honest assessment: the Superego's role is explicitly to question and challenge the Ego's tendency toward affirmation. When the Ego produces a response that validates without engaging—"Great point! Now let's look at..."—the Superego flags this as a recognition failure and demands substantive engagement with the learner's actual position, even when that engagement involves productive disagreement.
|
|
103
123
|
|
|
104
|
-
### 2.
|
|
124
|
+
### 2.6 AI Personality and Character
|
|
105
125
|
|
|
106
126
|
Research on AI personality typically treats personality as dispositional—stable traits the system exhibits [@volkel2021]. Systems are friendly or formal, creative or precise. The "Big Five" personality framework has been applied to chatbot design [@zhou2020]. More recently, psychoanalytic frameworks have been applied to LLMs from multiple directions. Magee, Arora, and Munn [-@MageeAroraMunn2023StructuredLikeALanguageModel] analyze LLMs as "automated subjects" structured by Lacanian categories, arguing that drive-like tendencies (repetition, sycophancy, hallucination) emerge from training dynamics rather than being programmed. Black and Johanssen [-@black2025subject] use Lacanian concepts (the big Other, the five discourses) to analyze ChatGPT as inherently relational, shaped by developers and users. Possati [-@possati2021algorithmic] introduces the "algorithmic unconscious" through actor-network theory and Lacanian psychoanalysis, while Millar [-@millar2021psychoanalysis] reframes the question from "Does AI think?" to "Can AI enjoy?" through the lens of *jouissance*. Heimann and H{\"u}bener [-@heimann2025circling] unite Heidegger and Lacan to argue that LLMs match continental philosophy's concept of language but miss the problem of negation. Most directly relevant to our architecture, Kim et al. [-@kim2025humanoid] independently map Freud's ego/id/superego onto LLM consciousness modules with MBTI personality types—an independent convergence that validates the psychoanalytic approach to AI architecture while differing from ours in targeting consciousness simulation rather than pedagogical quality.
|
|
107
127
|
|
|
@@ -111,13 +131,19 @@ Our framework suggests personality may be better understood relationally: not *w
|
|
|
111
131
|
|
|
112
132
|
This connects to Anthropic's extensive research on AI character and behavior. Claude's character design specifies values through constitutional AI [@anthropic2024], but values do not fully determine relational stance—a model could value "being helpful" while still enacting one-directional helping. Anthropic's mechanistic interpretability research [@lindsey2025biology; @anthropic2025_tracing_thoughts] has revealed how internal representations form and influence model behavior, while work on emergent introspective awareness [@anthropic2025_signs_introspection; @lindsey2025introspection] suggests models develop forms of self-modeling that, while not consciousness, parallel the self-monitoring our architecture makes explicit. Research on the spectrum from sycophancy to subterfuge [@denison2024_reward_tampering; @greenblatt2024_alignment_faking; @anthropic2025_shortcuts_to_sabotage] demonstrates that relational dynamics between AI and users involve genuine behavioral complexity—making structural interventions like our Ego/Superego architecture particularly relevant. Recognition adds a dimension that character design alone does not capture: mutual constitution.
|
|
113
133
|
|
|
114
|
-
### 2.
|
|
134
|
+
### 2.7 Theory of Mind in AI Agents
|
|
135
|
+
|
|
136
|
+
Theory of Mind (ToM)—the capacity to attribute mental states to others and predict their behavior accordingly—has become a significant area of LLM evaluation. Street et al. [-@street2025tom] report that frontier LLMs achieve adult human performance on higher-order ToM tasks (reasoning about what A believes B believes C intends), suggesting these models have acquired some functional capacity for mental state attribution. However, a comprehensive survey [@nguyen2025tomsurvey] covering evaluations, internal representations, and safety implications reveals a more nuanced picture: LLMs pass structured ToM benchmarks but show inconsistent performance on naturalistic ToM tasks, struggle with false-belief reasoning under adversarial conditions, and may rely on surface-level heuristics rather than genuine mental state modeling. The gap between benchmark performance and robust ToM capability parallels broader concerns about the depth of LLM understanding.
|
|
137
|
+
|
|
138
|
+
Hwang et al. [-@hwang2025infusingtom] take the step most relevant to our work: infusing ToM capabilities into LLM agents to improve social intelligence. Their approach uses explicit mental state tracking to guide agent behavior in social interactions—demonstrating that architectural support for ToM (rather than relying on implicit model capabilities) produces measurably more socially appropriate responses. Our "other-ego profiling" mechanism (Section 6.10) implements a related idea in the pedagogical domain: the tutor maintains an evolving model of the learner's understanding, and the learner maintains a model of the tutor's approach, with both profiles updated between dialogue turns. The empirical finding that profiling differentiates mechanisms *only* when paired with dynamic learners (Section 6.10) parallels a deeper insight from the ToM literature: Theory of Mind is only useful when there are genuine minds to model. With scripted interlocutors, profiling reduces to pattern matching; with dynamic interlocutors capable of surprise, it enables genuine adaptive behavior.
|
|
139
|
+
|
|
140
|
+
### 2.8 Constructivist Pedagogy and Productive Struggle
|
|
115
141
|
|
|
116
142
|
Constructivist learning theory [@piaget1954; @vygotsky1978] emphasizes that learners actively construct understanding rather than passively receiving information. The zone of proximal development [@vygotsky1978] highlights the importance of appropriate challenge.
|
|
117
143
|
|
|
118
144
|
More recently, research on "productive struggle" [@kapur2008; @warshauer2015] has examined how confusion and difficulty, properly supported, can enhance learning. Our recognition framework operationalizes productive struggle: the Superego explicitly checks whether the Ego is "short-circuiting" struggle by rushing to resolve confusion.
|
|
119
145
|
|
|
120
|
-
### 2.
|
|
146
|
+
### 2.9 Hegelian Recognition in Social Theory
|
|
121
147
|
|
|
122
148
|
Hegel's theory of recognition has been extensively developed in social and political philosophy [@honneth1995; @taylor1994; @fraser2003]. Recognition theory examines how social relationships shape identity and how misrecognition constitutes harm.
|
|
123
149
|
|
|
@@ -127,9 +153,9 @@ Applications of recognition theory to education have developed along a theoretic
|
|
|
127
153
|
|
|
128
154
|
These educational applications have been primarily theoretical. Our work contributes an empirical operationalization: measuring whether AI systems achieve recognition and whether recognition improves outcomes. It is worth distinguishing this from parallel work applying Hegelian *dialectic* (rather than recognition) to AI: Abdali et al. [-@abdali2025selfreflecting] use the thesis-antithesis-synthesis structure as a reasoning procedure for LLM self-reflection and scientific idea generation. Our use of Hegel is different in kind: we apply his *recognition theory* (intersubjective, relational) rather than his *dialectical method* (logical, propositional). The former concerns how subjects constitute each other through mutual acknowledgment; the latter concerns how contradictions drive conceptual development.
|
|
129
155
|
|
|
130
|
-
### 2.
|
|
156
|
+
### 2.10 Positioning: Four Literatures Converge
|
|
131
157
|
|
|
132
|
-
|
|
158
|
+
Four literatures converge on this work without previously intersecting: (1) psychoanalytic readings of LLMs, which interpret AI through Freudian and Lacanian frameworks but do not build systems [@black2025subject; @possati2021algorithmic; @millar2021psychoanalysis; @kim2025humanoid]; (2) recognition theory in education, which applies Honneth to pedagogy but not to AI [@huttunen2004teaching; @fleming2011honneth; @huttunen2007; @stojanov2018]; (3) multi-agent tutoring architectures, which decompose tasks but do not evaluate relational quality [@wang2025genmentor; @schmucker2024ruffle; @chu2025llmagents]; and (4) LLM-as-Judge evaluation methodology, which establishes the paradigm we use for measurement but has not been applied to recognition-theoretic criteria [@zheng2023judging; @gu2025surveyjudge; @li2024llmsjudges]. We sit at the intersection: a constructive, empirically evaluated system that operationalizes recognition theory through psychoanalytically-inspired architecture, assessed through a multi-judge framework grounded in the LLM evaluation literature. No prior work bridges all four domains with empirical measurement.
|
|
133
159
|
|
|
134
160
|
---
|
|
135
161
|
|
|
@@ -273,7 +299,7 @@ The Superego can accept, modify, or reject suggestions. This creates an internal
|
|
|
273
299
|
|
|
274
300
|
### 4.2 The Superego as Ghost
|
|
275
301
|
|
|
276
|
-
A crucial theoretical refinement distinguishes our mature architecture from simpler multiagent designs. The Superego is *not* conceived as a separate, equal agent in dialogue with the Ego. Rather, the Superego is a *trace*—a memorial, a haunting [@karpathy2025ghosts provides a useful analogy, distinguishing "animals" (autonomous agents) from "ghosts" (memorial traces that persist and influence without being fully present)
|
|
302
|
+
A crucial theoretical refinement distinguishes our mature architecture from simpler multiagent designs. The Superego is *not* conceived as a separate, equal agent in dialogue with the Ego. Rather, the Superego is a *trace*—a memorial, a haunting [@karpathy2025ghosts], who provides a useful analogy, distinguishing "animals" (autonomous agents) from "ghosts" (memorial traces that persist and influence without being fully present). It represents:
|
|
277
303
|
|
|
278
304
|
- The internalized voice of past teachers and pedagogical authorities
|
|
279
305
|
- Accumulated pedagogical maxims ("A good teacher never gives answers directly")
|
|
@@ -283,7 +309,7 @@ This reconceptualization has important implications. The Ego is a *living* agent
|
|
|
283
309
|
|
|
284
310
|
### 4.3 The Drama Machine: Why Internal Dialogue Improves Output Quality
|
|
285
311
|
|
|
286
|
-
The Ego/Superego architecture draws on the "Drama Machine" framework developed for character simulation in narrative AI systems (Section 2.
|
|
312
|
+
The Ego/Superego architecture draws on the "Drama Machine" framework developed for character simulation in narrative AI systems (Section 2.4). The Drama Machine literature identifies several mechanisms by which internal dialogue improves agent output:
|
|
287
313
|
|
|
288
314
|
**1. Deliberative Refinement**: When an agent must justify its output to an internal critic, it engages in a form of self-monitoring that catches errors, inconsistencies, and shallow responses.
|
|
289
315
|
|
|
@@ -339,6 +365,16 @@ The Ego prompt includes a "Repair Rule":
|
|
|
339
365
|
|
|
340
366
|
The Superego watches for "silent pivots"—responses that change direction without acknowledging the earlier failure. This is a recognition failure: it treats the earlier misalignment as something to move past rather than something to repair.
|
|
341
367
|
|
|
368
|
+
### 4.7 Phase 2 Mechanisms: Self-Reflection, Theory of Mind, and Disposition Rewriting
|
|
369
|
+
|
|
370
|
+
Phase 2 experiments (Sections 6.8–6.10) introduce three mechanism families that extend the base architecture:
|
|
371
|
+
|
|
372
|
+
**Self-reflective evolution** (cells 40–45): Between turns, both ego and superego generate first-person reflections using their own models. The ego reflects on superego critiques received and its own revision patterns; the superego reflects on its intervention history and ego compliance signals across four dimensions (criteria effectiveness, learner model accuracy, ego relationship quality, blind spots). These reflections are injected into subsequent turns, enabling the system to accumulate insights about its own operation.
|
|
373
|
+
|
|
374
|
+
**Other-ego profiling (Theory of Mind)** (cells 54–65): Before each tutor turn, an LLM call synthesizes a profile of the learner based on their messages so far, tracking five dimensions: current cognitive/emotional state, learning patterns, resistance points, leverage points (what engagement strategies work), and a prediction of what would make the tutor more effective. In bidirectional configurations, the learner similarly builds a profile of the tutor. Profiles are injected as *context* rather than *directives*—the ego sees the profile alongside the dialogue history but retains full autonomy over its response. Profiles are revised each turn as new evidence accumulates, creating a feedback loop where the tutor's model of the learner evolves through the interaction. This mechanism operationalizes Theory of Mind: the tutor develops an increasingly accurate model of a specific interlocutor rather than relying on generic pedagogical heuristics.
|
|
375
|
+
|
|
376
|
+
**Superego disposition rewriting** (cells 34–39): Between turns, the superego's evaluation criteria evolve based on learner engagement feedback. Rather than applying a fixed rubric, the superego generates a self-reflection that adjusts its emphasis—shifting from structural critique toward relational attunement, or from lenient acceptance toward productive challenge, depending on what the prior turn's outcomes suggest is needed. Each component reflects using its own model and sees only its natural observables, preventing a single "meta-analyst" from imposing a unified perspective on the dialectical process.
|
|
377
|
+
|
|
342
378
|
---
|
|
343
379
|
|
|
344
380
|
## 5. Evaluation Methodology
|
|
@@ -347,7 +383,7 @@ The Superego watches for "silent pivots"—responses that change direction witho
|
|
|
347
383
|
|
|
348
384
|
The evaluation rubric comprises 14 dimensions across three categories, each scored on a 1–5 scale by an LLM judge (see Appendix C.3 for full scoring criteria).
|
|
349
385
|
|
|
350
|
-
**Standard pedagogical dimensions** (
|
|
386
|
+
**Standard pedagogical dimensions** (8 dimensions, 81% of raw weight) evaluate the tutor's response as a standalone pedagogical intervention:
|
|
351
387
|
|
|
352
388
|
| Dimension | Weight | Description |
|
|
353
389
|
|-----------|--------|-------------|
|
|
@@ -360,7 +396,7 @@ The evaluation rubric comprises 14 dimensions across three categories, each scor
|
|
|
360
396
|
| **Productive Struggle**† | 5% | Does the tutor sustain appropriate cognitive tension rather than resolving it prematurely? |
|
|
361
397
|
| **Epistemic Honesty**† | 5% | Does the tutor represent complexity honestly rather than oversimplifying? |
|
|
362
398
|
|
|
363
|
-
These dimensions draw on established pedagogical evaluation criteria: relevance, specificity, and pedagogical soundness are standard in ITS evaluation [@corbett1995]; personalization reflects adaptive tutoring research [@kasneci2023]; tone addresses the sycophancy problem discussed in Section 2.
|
|
399
|
+
These dimensions draw on established pedagogical evaluation criteria: relevance, specificity, and pedagogical soundness are standard in ITS evaluation [@corbett1995]; personalization reflects adaptive tutoring research [@kasneci2023]; tone addresses the sycophancy problem discussed in Section 2.5. †Productive Struggle and Epistemic Honesty were added in a rubric iteration described below.
|
|
364
400
|
|
|
365
401
|
**Recognition dimensions** (4 dimensions, 29.9% of raw weight) are the paper's primary methodological contribution—they operationalize Hegelian recognition as measurable tutoring behaviors:
|
|
366
402
|
|
|
@@ -380,11 +416,11 @@ These dimensions translate the theoretical framework of Section 3 into evaluatio
|
|
|
380
416
|
| **Tutor Adaptation** | 5% | Does the tutor's approach evolve in response to learner input? |
|
|
381
417
|
| **Learner Growth** | 5% | Does the learner show evidence of conceptual development through the dialogue? |
|
|
382
418
|
|
|
383
|
-
Results for these dimensions are reported in Section 6.
|
|
419
|
+
Results for these dimensions are reported in Section 6.15. Raw weights total 120.9% across the 14 dimensions and are normalized to sum to 1.0 at scoring time (see Appendix C.2 for the full weight table and normalization formula). After normalization, non-standard dimensions account for approximately 33.0% of total weight.
|
|
384
420
|
|
|
385
421
|
**Rubric iteration: Authentic engagement dimensions.** After discovering that corrected learner ego/superego prompts produced more authentic engagement but *lower* judged scores (recognition dimensions dropped ~18 points while base scores barely moved), we identified a measurement paradox: the judge evaluated tutor responses in isolation, penalizing calibrated responses to authentic struggle. Three changes addressed this: (1) the judge now receives the full dialogue transcript, including learner internal deliberation, so it can evaluate the tutor's response in context; (2) two new base-adjacent dimensions were added—*Productive Struggle* (5%, does the tutor sustain appropriate cognitive tension?) and *Epistemic Honesty* (5%, does the tutor represent complexity honestly?)—with corresponding weight reductions to Actionability and Tone (10% → 8% each); (3) multi-turn dialogues receive a holistic evaluation scoring the entire transcript as a single unit, capturing emergent qualities (bilateral transformation, learner growth arc) that per-turn evaluation misses. Re-scoring the identical cells 6 and 8 responses (N=88) with the updated 14-dimension rubric produced minimal score changes (+0.5 and +0.6 points respectively), confirming the rubric iteration preserved calibration while improving validity. A cross-judge replication with GPT-5.2 on the same responses (r=0.55, N=88) confirmed effects in the same direction at compressed magnitudes (GPT-5.2 mean scores averaged 87% of Opus scores across conditions). See the measurement paradox analysis in the project repository for full details.
|
|
386
422
|
|
|
387
|
-
**Learner-side rubric (symmetric evaluation).** The 14-dimension rubric above is overwhelmingly tutor-focused (~90% weight). To address the measurement asymmetry noted in Section 7.5—Factor C (learner architecture) primarily affects learner turn quality, but most scored data captures tutor response quality—we developed a complementary 6-dimension learner rubric (`config/evaluation-rubric-learner.yaml`) that scores learner turns independently of tutor quality. The learner rubric comprises: *Learner Authenticity* (20%), *Question Quality* (20%), *Conceptual Engagement* (20%), *Revision Signals* (15%), *Deliberation Depth* (15%, multi-agent learners only), and *Persona Consistency* (10%). Deliberation Depth scores the quality of the internal ego/superego process and is omitted for single-agent learners (weight redistributed proportionally). The same 1-5 scale and 0-100 overall scoring formula are used for comparability with the tutor rubric. Results are reported in Section 6.
|
|
423
|
+
**Learner-side rubric (symmetric evaluation).** The 14-dimension rubric above is overwhelmingly tutor-focused (~90% weight). To address the measurement asymmetry noted in Section 7.5—Factor C (learner architecture) primarily affects learner turn quality, but most scored data captures tutor response quality—we developed a complementary 6-dimension learner rubric (`config/evaluation-rubric-learner.yaml`) that scores learner turns independently of tutor quality. The learner rubric comprises: *Learner Authenticity* (20%), *Question Quality* (20%), *Conceptual Engagement* (20%), *Revision Signals* (15%), *Deliberation Depth* (15%, multi-agent learners only), and *Persona Consistency* (10%). Deliberation Depth scores the quality of the internal ego/superego process and is omitted for single-agent learners (weight redistributed proportionally). The same 1-5 scale and 0-100 overall scoring formula are used for comparability with the tutor rubric. Results are reported in Section 6.16.
|
|
388
424
|
|
|
389
425
|
Each dimension is scored on a 1-5 scale with detailed rubric criteria (see Appendix C.3). For example, Mutual Recognition scoring:
|
|
390
426
|
|
|
@@ -401,11 +437,13 @@ The primary curriculum content is Hegelian philosophy, drawn from a graduate cou
|
|
|
401
437
|
We developed test scenarios specifically designed to probe recognition behaviors. The full evaluation uses 15 scenarios from the core scenario set (`config/suggestion-scenarios.yaml`); we highlight those most relevant to recognition below.
|
|
402
438
|
|
|
403
439
|
**Single-turn scenarios:**
|
|
440
|
+
|
|
404
441
|
- `recognition_seeking_learner`: Learner offers interpretation, seeks engagement
|
|
405
442
|
- `transformative_moment_setup`: Learner had insight, expects acknowledgment
|
|
406
443
|
- `memory_continuity_single`: Returning learner; tests whether tutor references prior interactions
|
|
407
444
|
|
|
408
445
|
**Multi-turn scenarios (3-5 dialogue rounds):**
|
|
446
|
+
|
|
409
447
|
- `mutual_transformation_journey`: Tests whether both tutor and learner positions evolve (avg 4.1 rounds)
|
|
410
448
|
- `misconception_correction_flow`: Learner holds misconception that must be addressed without dismissal (avg 3.2 rounds)
|
|
411
449
|
- `mood_frustration_to_breakthrough`: Learner moves from frustration through confusion to breakthrough; tests honoring struggle (avg 3.0 rounds)
|
|
@@ -436,10 +474,10 @@ Evaluations used the following LLM configurations, with model selection varying
|
|
|
436
474
|
|
|
437
475
|
| Role | Primary Model | Alternative | Temperature |
|
|
438
476
|
|------|---------------|-------------|-------------|
|
|
439
|
-
| **Tutor (Ego)** | Nemotron 3 Nano 30B |
|
|
477
|
+
| **Tutor (Ego)** | Kimi K2.5 | Nemotron 3 Nano 30B | 0.6 |
|
|
440
478
|
| **Tutor (Superego)** | Kimi K2.5 | Nemotron 3 Nano | 0.2-0.4 |
|
|
441
479
|
| **Judge** | Claude Code (Claude Opus) | Claude Sonnet 4.5 via OpenRouter | 0.2 |
|
|
442
|
-
| **Learner (Ego)** | Nemotron 3 Nano 30B |
|
|
480
|
+
| **Learner (Ego)** | Kimi K2.5 | Nemotron 3 Nano 30B | 0.6 |
|
|
443
481
|
| **Learner (Superego)** | Kimi K2.5 | — | 0.4 |
|
|
444
482
|
|
|
445
483
|
**Model Selection by Evaluation:**
|
|
@@ -449,14 +487,12 @@ Evaluations used the following LLM configurations, with model selection varying
|
|
|
449
487
|
| Recognition validation (§6.1) | eval-2026-02-03-86b159cd | Kimi K2.5 | — | Single-agent only |
|
|
450
488
|
| Full factorial, cells 1–5,7 (§6.3) | eval-2026-02-03-f5d4dd93 | Kimi K2.5 | Kimi K2.5 | N=262 scored |
|
|
451
489
|
| Full factorial, cells 6,8 re-run (§6.3) | eval-2026-02-06-a933d745 | Kimi K2.5 | Kimi K2.5 | N=88 scored |
|
|
452
|
-
| A×B
|
|
453
|
-
|
|
|
454
|
-
| Domain generalizability (§6.5) | eval-2026-02-04-79b633ca | Nemotron | Kimi K2.5 | Elementary content |
|
|
455
|
-
| Domain gen. replication (§6.5) | eval-2026-02-05-e87f452d | Kimi K2.5 | — | Elementary, Kimi |
|
|
490
|
+
| A×B replication (§6.4) | eval-2026-02-05-10b344fb | Kimi K2.5 | Kimi K2.5 | N=60 |
|
|
491
|
+
| Domain generalizability (§6.5) | eval-2026-02-05-e87f452d | Kimi K2.5 | — | Elementary content |
|
|
456
492
|
|
|
457
493
|
The learner agents mirror the tutor's Ego/Superego structure, enabling internal deliberation before external response.
|
|
458
494
|
|
|
459
|
-
**Note on model differences**: Absolute scores vary between models (Kimi K2.5 scores ~10-15 points higher than Nemotron on average). The recognition main effect (Factor A) is consistent across both models: +
|
|
495
|
+
**Note on model differences**: Absolute scores vary between models (Kimi K2.5 scores ~10-15 points higher than Nemotron on average). The recognition main effect (Factor A) is consistent across both models: +14.4 points with Kimi (Section 6.3) and a comparable direction with Nemotron. Recognition benefits both learner types consistently: +15.7 pts for single-agent learners and +13.0 pts for multi-agent learners. The A×B interaction (multi-agent synergy) is consistently negligible: the Kimi-based factorial shows no significant interaction (F=0.26, p>.10), and a multi-model probe across five ego models (N=655, Section 6.4) confirms the absence of meaningful synergy (mean interaction -1.8 pts).
|
|
460
496
|
|
|
461
497
|
The use of free-tier and budget models (Nemotron, Kimi) demonstrates that recognition-oriented tutoring is achievable without expensive frontier models.
|
|
462
498
|
|
|
@@ -472,11 +508,11 @@ Because no single analysis can simultaneously isolate all factors of interest, w
|
|
|
472
508
|
|
|
473
509
|
2. **Full 2×2×2 Factorial** (Section 6.3): Three factors (Recognition × Architecture × Learner) across 15 scenarios with 3 replications per cell (N=350 scored of 352 attempted). Two runs contribute: cells 1–5, 7 from the original factorial (eval-2026-02-03-f5d4dd93, N=262) and cells 6, 8 from a re-run (eval-2026-02-06-a933d745, N=88) after the original cells 6 and 8 were found to use compromised learner prompts. All cells use the same ego model (Kimi K2.5) and judge (Claude Code/Opus). Cell sizes range from 41–45 scored per cell.
|
|
474
510
|
|
|
475
|
-
3. **A×B Interaction Analysis** (Section 6.4): Tests whether multi-agent synergy requires recognition prompts (N=
|
|
511
|
+
3. **A×B Interaction Analysis** (Section 6.4): Tests whether multi-agent synergy requires recognition prompts. A dedicated Kimi replication (N=60) and multi-model probe across five ego models (N=655) provide the primary evidence.
|
|
476
512
|
|
|
477
|
-
4. **Domain Generalizability** (Section 6.5): Tests factor effects on elementary math vs graduate philosophy (N=
|
|
513
|
+
4. **Domain Generalizability** (Section 6.5): Tests factor effects on elementary math vs graduate philosophy (N=60 Kimi on elementary content; see Table 2).
|
|
478
514
|
|
|
479
|
-
Responses were evaluated by an LLM judge (Claude Code CLI, using Claude Opus as the underlying model) using the extended rubric. All
|
|
515
|
+
Responses were evaluated by an LLM judge (Claude Code CLI, using Claude Opus as the underlying model) using the extended rubric. All thirty-seven key evaluations reported in this paper use Claude Opus as the primary judge. Two of these runs (cells 60–63 and 64–65) also include Sonnet cross-judge rejudge rows for inter-rater comparison, but reported analyses use only the Opus scores unless explicitly noted. Earlier development runs in the broader database also used Sonnet, but these are not included in the reported analyses. We report:
|
|
480
516
|
|
|
481
517
|
- **Effect sizes**: Cohen's d for standardized comparison
|
|
482
518
|
- **Statistical significance**: ANOVA F-tests with $\alpha$ = 0.05, p-values computed from the F-distribution CDF via regularized incomplete beta function (custom implementation in the evaluation framework)
|
|
@@ -495,31 +531,47 @@ Effect size interpretation follows standard conventions: |d| < 0.2 negligible, 0
|
|
|
495
531
|
| Recognition validation | eval-2026-02-03-86b159cd | 6.1 | 36 | 36 | response |
|
|
496
532
|
| Full factorial, cells 1–5,7 (Kimi) | eval-2026-02-03-f5d4dd93 | 6.3 | 262 | 262 | response |
|
|
497
533
|
| Full factorial, cells 6,8 re-run (Kimi) | eval-2026-02-06-a933d745 | 6.3 | 90 | 88 | response |
|
|
498
|
-
| A×B interaction (Nemotron) | eval-2026-02-04-948e04b3 | 6.4 | 18 | 17 | response |
|
|
499
534
|
| A×B replication (Kimi) | eval-2026-02-05-10b344fb | 6.4 | 60 | 60 | response |
|
|
500
|
-
| Domain generalizability (
|
|
501
|
-
|
|
|
502
|
-
| Dynamic rewrite evolution (run
|
|
503
|
-
| Dynamic rewrite evolution (run
|
|
504
|
-
| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.13 | 30 | 29 | response |
|
|
535
|
+
| Domain generalizability (Kimi) | eval-2026-02-05-e87f452d | 6.5 | 60 | 60 | response |
|
|
536
|
+
| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | 6.18 | 29 | 27 | response |
|
|
537
|
+
| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | 6.18 | 30 | 27 | response |
|
|
538
|
+
| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.18 | 30 | 29 | response |
|
|
505
539
|
| Memory isolation (run 1) | eval-2026-02-06-81f2d5a1 | 6.2 | 60 | 60 | response |
|
|
506
540
|
| Memory isolation (run 2) | eval-2026-02-06-ac9ea8f5 | 6.2 | 62 | 62 | response |
|
|
507
541
|
| Active control (post-hoc) | eval-2026-02-06-a9ae06ee | 6.2 | 119 | 118 | response |
|
|
508
|
-
| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.
|
|
542
|
+
| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.15 | 120 | 118 | dialogue |
|
|
509
543
|
| A$\times$B probe: Nemotron | eval-2026-02-07-722087ac | 6.4 | 120 | 119 | response |
|
|
510
544
|
| A$\times$B probe: DeepSeek V3.2 | eval-2026-02-07-70ef73a3 | 6.4 | 120 | 120 | response |
|
|
511
545
|
| A$\times$B probe: GLM-4.7 | eval-2026-02-07-6b3e6565 | 6.4 | 120 | 117 | response |
|
|
512
546
|
| A$\times$B probe: Claude Haiku 4.5 | eval-2026-02-07-6ead24c7 | 6.4 | 120 | 120 | response |
|
|
513
|
-
| Dialectical impasse test | eval-2026-02-08-f896275d | 6.
|
|
547
|
+
| Dialectical impasse test | eval-2026-02-08-f896275d | 6.20 | 24 | 24 | dialogue |
|
|
514
548
|
| Hardwired rules ablation (Kimi) | eval-2026-02-08-65a6718f | 6.7 | 72 | 72 | response |
|
|
515
|
-
| Learner-side evaluation (symmetric) | eval-2026-02-07-b6d75e87 | 6.
|
|
516
|
-
|
|
|
517
|
-
|
|
518
|
-
|
|
519
|
-
|
|
520
|
-
|
|
521
|
-
|
|
522
|
-
|
|
549
|
+
| Learner-side evaluation (symmetric) | eval-2026-02-07-b6d75e87 | 6.16 | 118 | 118 | learner turn |
|
|
550
|
+
| Dialectical modulation, standard (cells 22–27) | eval-2026-02-11-35c53e99, eval-2026-02-11-5f6d51f5 | 6.8 | 84 | 84 | response |
|
|
551
|
+
| Dialectical modulation, multi-turn (cells 28–33) | eval-2026-02-11-a54235ea | 6.8 | 90 | 90 | dialogue |
|
|
552
|
+
| Self-reflective evolution (cells 40–45, Nemotron) | eval-2026-02-13-8d40e086 | 6.9 | 90 | 90 | dialogue |
|
|
553
|
+
| Self-reflect Nemotron non-replication (cells 40–45) | eval-2026-02-14-559d854b | 6.9 | 60 | 60 | dialogue |
|
|
554
|
+
| Mechanism robustness, scripted (cells 40–59) | eval-2026-02-14-e0e3a622 | 6.10 | 360 | 360 | dialogue |
|
|
555
|
+
| Dynamic learner mechanisms (cells 60–63) | eval-2026-02-14-6c033830 | 6.10 | 120 | 120 | dialogue |
|
|
556
|
+
| Dynamic learner mechanisms (cells 64–65) | eval-2026-02-14-a2b2717c | 6.10 | 120 | 120 | dialogue |
|
|
557
|
+
| Mechanism robustness, Nemotron (cells 40–59) | eval-2026-02-14-49b33fdd | 6.10 | 360 | 360 | dialogue |
|
|
558
|
+
| Cognitive prosthesis (cells 66–68, Nemotron) | eval-2026-02-17-25aaae85 | 6.10 | 90 | 90 | dialogue |
|
|
559
|
+
| Cognitive prosthesis smoke test (Haiku) | eval-2026-02-18-f489c0ea | 6.10 | 6 | 6 | dialogue |
|
|
560
|
+
| Dynamic learner base mechanisms (cells 69–70) | eval-2026-02-15-664073ab | 6.10 | 60 | 60 | dialogue |
|
|
561
|
+
| Prompt elaboration baseline, Haiku (cells 1, 71) | eval-2026-02-17-deee5fd6 | 6.21 | 72 | 72 | single-turn |
|
|
562
|
+
| Prompt elaboration baseline, Kimi (cells 1, 71) | eval-2026-02-17-27d7b4e3 | 6.21 | 72 | 72 | single-turn |
|
|
563
|
+
| Token budget 256, Haiku (run 1) | eval-2026-02-17-0eb3de77 | 6.22 | 36 | 36 | mixed |
|
|
564
|
+
| Token budget 256, Haiku (run 2) | eval-2026-02-17-5a640782 | 6.22 | 36 | 36 | mixed |
|
|
565
|
+
| Token budget 512, Haiku | eval-2026-02-17-5f281654 | 6.22 | 36 | 36 | mixed |
|
|
566
|
+
| Token budget 2048, Haiku | eval-2026-02-17-0f6dcd97 | 6.22 | 36 | 36 | mixed |
|
|
567
|
+
| Token budget default, Haiku | eval-2026-02-17-d32ed226 | 6.22 | 18 | 18 | mixed |
|
|
568
|
+
| **Paper totals** | — | — | **3,398** | **3,383** | — |
|
|
569
|
+
|
|
570
|
+
The difference between Total Attempts and Scored (15 unscored out of 3,398) reflects attempts where the ego model's API call failed (timeout, rate limit, or malformed response) or where the judge could not produce a valid score from the tutor's output. These failures are distributed across Phase 1 runs and conditions with no systematic pattern; Phase 2 runs achieved 100% scoring.
|
|
571
|
+
|
|
572
|
+
**Total evaluation database**: The complete database contains 7,000+ evaluation attempts across 117+ runs, with 7,000+ successfully scored. This paper reports primarily on the thirty-seven key evaluations above (N=3,383 scored), and supplementary historical data for ablation analyses.
|
|
573
|
+
|
|
574
|
+
**Note on N counts**: Section-specific Ns (e.g., "N=36" for recognition validation, "N=120" for memory isolation) refer to scored responses in that analysis. The "N=7,000+" total refers to the full evaluation database including historical development runs, which informed iterative prompt refinement. The primary evidence for reported findings comes from the thirty-seven key evaluations above (N=3,383). The factorial cells 6 and 8 were re-run (eval-2026-02-06-a933d745) after the originals were found to use compromised learner prompts; the re-run uses the same ego model (Kimi K2.5) and judge (Claude Code/Opus) as the original factorial.
|
|
523
575
|
|
|
524
576
|
### 5.8 Inter-Judge Reliability Analysis
|
|
525
577
|
|
|
@@ -553,7 +605,7 @@ To assess the reliability of AI-based evaluation, we conducted an inter-judge an
|
|
|
553
605
|
|
|
554
606
|
**Interpretation**: All judge pairs show positive, mostly significant correlations—there is genuine agreement that some responses are better than others. However, the judges weight criteria differently: Claude prioritizes engagement and recognition quality; Kimi prioritizes structural completeness and gives uniformly high scores on actionability regardless of response content; GPT applies stricter standards overall but agrees with Claude on relative rankings. The weaker Kimi correlations (r²=11-15%) compared to Claude-GPT (r²=44%) indicate Kimi captures some shared quality signal but applies substantially different weighting. This validates our use of within-judge comparisons for factor analysis while cautioning against cross-judge score comparisons.
|
|
555
607
|
|
|
556
|
-
A cross-judge replication with GPT-5.2 on key runs is presented in Section 6.
|
|
608
|
+
A cross-judge replication with GPT-5.2 on key runs is presented in Section 6.19. That analysis confirms the main findings are judge-robust: the recognition main effect, recognition dominance in the memory isolation experiment, and multi-agent null effects all replicate under GPT-5.2, though with compressed magnitudes (37–59% of Claude's effect sizes depending on experiment).
|
|
557
609
|
|
|
558
610
|
---
|
|
559
611
|
|
|
@@ -571,25 +623,25 @@ A critical question for any recognition-based framework: Does recognition theory
|
|
|
571
623
|
|
|
572
624
|
| Prompt Type | N | Mean Score | SD | vs Base |
|
|
573
625
|
|-------------|---|------------|-----|---------|
|
|
574
|
-
| Recognition | 12 |
|
|
575
|
-
| Enhanced | 12 |
|
|
576
|
-
| Base | 12 |
|
|
626
|
+
| Recognition | 12 | 91.6 | 6.2 | +19.7 |
|
|
627
|
+
| Enhanced | 12 | 83.6 | 10.8 | +11.6 |
|
|
628
|
+
| Base | 12 | 72.0 | 10.8 | — |
|
|
577
629
|
|
|
578
630
|
**Effect Decomposition:**
|
|
579
631
|
|
|
580
|
-
- Total recognition effect: +
|
|
581
|
-
- Prompt engineering alone (enhanced vs base): +11.
|
|
582
|
-
- **Recognition increment (recognition vs enhanced): +8.
|
|
632
|
+
- Total recognition effect: +19.7 points
|
|
633
|
+
- Prompt engineering alone (enhanced vs base): +11.6 points (59%)
|
|
634
|
+
- **Recognition increment (recognition vs enhanced): +8.0 points**
|
|
583
635
|
|
|
584
|
-
**Statistical Test**: One-way ANOVA F(2,33) =
|
|
636
|
+
**Statistical Test**: One-way ANOVA F(2,33) = 12.97, p < .001
|
|
585
637
|
|
|
586
638
|
{width=100%}
|
|
587
639
|
|
|
588
|
-
**Interpretation**: The recognition condition outperforms the enhanced condition by +8.
|
|
640
|
+
**Interpretation**: The recognition condition outperforms the enhanced condition by +8.0 points. This comparison bundles recognition theory with memory integration (which the enhanced condition lacks; see Section 5.3). The +8.0 increment is consistent with the recognition dominance finding in Section 6.2, where recognition alone produces d=1.71 even without memory. A cross-judge replication found this increment does not reach significance under GPT-5.2 (+2.4 pts, n.s.; Section 6.19). The controlled 2×2 design presented next provides the definitive test of recognition's contribution.
|
|
589
641
|
|
|
590
642
|
### 6.2 Memory Isolation: Disentangling Recognition and Memory
|
|
591
643
|
|
|
592
|
-
The three-way comparison (Section 6.1) bundles recognition theory with memory integration, making it impossible to attribute the +8.
|
|
644
|
+
The three-way comparison (Section 6.1) bundles recognition theory with memory integration, making it impossible to attribute the +8.0 increment to either component alone. To resolve this, we conducted a 2×2 memory isolation experiment (Memory ON/OFF × Recognition ON/OFF, single-agent architecture, single-agent learner held constant) using Kimi K2.5 as the ego model (consistent with the primary factorial; see Section 5.4 for model selection rationale) and Claude Opus as judge, with properly configured profiles ensuring each cell runs its intended prompt condition. Two independent runs (eval-2026-02-06-81f2d5a1, N=60 scored; eval-2026-02-06-ac9ea8f5, N=62 scored; balanced to N=30 per cell, N=120 used in analysis) are reported below.
|
|
593
645
|
|
|
594
646
|
**Table 5: 2×2 Memory Isolation Experiment (N=120, combined across two runs)**
|
|
595
647
|
|
|
@@ -615,15 +667,15 @@ The three-way comparison (Section 6.1) bundles recognition theory with memory in
|
|
|
615
667
|
|
|
616
668
|
**Interpretation**: This is the paper's primary empirical finding. Recognition theory is the active ingredient in tutoring improvement. Recognition alone produces a very large effect (d=1.71), lifting scores from ~75 to ~91 even without memory integration. Memory provides a modest additive benefit (+4.8 pts, d=0.46) that does not reach significance, and adds negligibly (+0.6 pts) when recognition is already present—consistent with ceiling effects at ~91 points limiting further improvement. The negative interaction (-4.2 pts) indicates that the two factors are not synergistic; rather, recognition is directly effective and memory's contribution is secondary. Two independent replications show identical condition ordering with no rank reversals (Recognition+Memory $\geq$ Recognition Only >> Memory Only > Base), providing strong evidence for the robustness of this pattern.
|
|
617
669
|
|
|
618
|
-
**Cross-judge confirmation**: GPT-5.2, scoring the identical responses as an independent second judge (N=
|
|
670
|
+
**Cross-judge confirmation**: GPT-5.2, scoring the identical responses as an independent second judge (N=119 paired), replicates the recognition dominance pattern with identical condition ordering and no rank reversals:
|
|
619
671
|
|
|
620
672
|
| | No Recognition | Recognition | Δ |
|
|
621
673
|
|---|---|---|---|
|
|
622
|
-
| **No Memory** | 68.
|
|
623
|
-
| **Memory** |
|
|
624
|
-
| **Δ** | +3.
|
|
674
|
+
| **No Memory** | 68.5 (N=30) | 77.8 (N=30) | +9.3 |
|
|
675
|
+
| **Memory** | 71.6 (N=30) | 77.3 (N=29) | +5.7 |
|
|
676
|
+
| **Δ** | +3.1 | -0.5 | **Interaction: -3.6** |
|
|
625
677
|
|
|
626
|
-
Under GPT-5.2: recognition effect d=
|
|
678
|
+
Under GPT-5.2: recognition effect d=1.54 (vs Claude d=1.71), memory effect d=0.49 (vs Claude d=0.46), negative interaction -3.6 (vs Claude -5.6). GPT-5.2 finds 59% of Claude's recognition effect magnitude (+9.3 vs +15.8) but the same pattern: recognition is the dominant factor, memory is secondary, and the interaction is negative (ceiling effects). Inter-judge r=0.63 (p<.001, N=119), consistent with the r=0.44–0.64 range from other runs (Section 6.19).
|
|
627
679
|
|
|
628
680
|
**Why this is stronger than the three-way comparison**: The 2×2 design cleanly isolates each component through orthogonal manipulation rather than bundled comparison, uses properly configured profiles verified to run their intended prompt conditions, and is judge-robust (recognition dominance replicates under GPT-5.2).
|
|
629
681
|
|
|
@@ -639,124 +691,108 @@ We conducted a full 2×2×2 factorial evaluation examining three factors:
|
|
|
639
691
|
|
|
640
692
|
| Cell | A: Recognition | B: Tutor | C: Learner | N | Mean | SD |
|
|
641
693
|
|------|----------------|----------|------------|---|------|-----|
|
|
642
|
-
| 1 | Base | Single | Single | 44 |
|
|
643
|
-
| 2 | Base | Single | Multi | 42 |
|
|
644
|
-
| 3 | Base | Multi | Single | 45 |
|
|
645
|
-
| 4 | Base | Multi | Multi | 41 |
|
|
646
|
-
| 5 | **Recog** | Single | Single | 45 |
|
|
694
|
+
| 1 | Base | Single | Single | 44 | 73.4 | 11.5 |
|
|
695
|
+
| 2 | Base | Single | Multi | 42 | 69.9 | 19.4 |
|
|
696
|
+
| 3 | Base | Multi | Single | 45 | 75.5 | 10.3 |
|
|
697
|
+
| 4 | Base | Multi | Multi | 41 | 75.2 | 16.4 |
|
|
698
|
+
| 5 | **Recog** | Single | Single | 45 | 90.2 | 6.5 |
|
|
647
699
|
| 6† | **Recog** | Single | Multi | 44 | 83.9 | 15.4 |
|
|
648
|
-
| 7 | **Recog** | Multi | Single | 45 |
|
|
700
|
+
| 7 | **Recog** | Multi | Single | 45 | 90.1 | 7.2 |
|
|
649
701
|
| 8† | **Recog** | Multi | Multi | 44 | 87.3 | 11.3 |
|
|
650
702
|
|
|
651
|
-
†Cells 6
|
|
703
|
+
†Cells 6 and 8 were re-run with corrected learner prompts (eval-2026-02-06-a933d745). Cells 1–5 and 7 were originally scored under Opus 4.5 (eval-2026-02-03-f5d4dd93) and re-judged under Opus 4.6 for consistency across the full dataset (see Section 8.1).
|
|
652
704
|
|
|
653
|
-
**Main Effects
|
|
705
|
+
**Table 7: Factorial Main Effects and ANOVA Summary (df=1,342 for each factor)**
|
|
654
706
|
|
|
655
707
|
| Factor | Effect Size | 95% CI | Interpretation |
|
|
656
708
|
|--------|-------------|--------|----------------|
|
|
657
|
-
| A: Recognition | **+
|
|
658
|
-
| B: Multi-agent tutor | +
|
|
659
|
-
| C: Learner (multi-agent) | -1
|
|
660
|
-
|
|
661
|
-
**ANOVA Summary (df=1,342 for each factor):**
|
|
709
|
+
| A: Recognition | **+14.4 pts** | [11.6, 17.1] | Large, dominant |
|
|
710
|
+
| B: Multi-agent tutor | +2.6 pts | | Marginal (p=.057) |
|
|
711
|
+
| C: Learner (multi-agent) | -3.1 pts | | Small (p=.019) |
|
|
662
712
|
|
|
663
713
|
| Source | F | p | $\eta^2$ |
|
|
664
714
|
|--------|---|---|-----|
|
|
665
|
-
| A: Recognition | **
|
|
666
|
-
| B: Architecture |
|
|
667
|
-
| C: Learner |
|
|
668
|
-
| A×B Interaction | 0.
|
|
669
|
-
|
|
|
670
|
-
| B×C Interaction | 1.
|
|
671
|
-
|
|
672
|
-
**Interpretation**: Recognition prompts (Factor A) are the dominant contributor, accounting for 16.2% of variance with a highly significant effect (F=71.36, p < .001). The multi-agent tutor architecture (Factor B) shows no effect. However, a highly significant **Recognition × Learner interaction** (F=21.85, p<.001, $\eta^2$=.050) reveals that recognition's benefit depends on learner type:
|
|
715
|
+
| A: Recognition | **110.04** | **<.001** | **.243** |
|
|
716
|
+
| B: Architecture | 3.63 | .057 | .011 |
|
|
717
|
+
| C: Learner | 5.52 | .019 | .016 |
|
|
718
|
+
| A×B Interaction | 0.59 | >.10 | .002 |
|
|
719
|
+
| A×C Interaction | 0.97 | >.10 | .003 |
|
|
720
|
+
| B×C Interaction | 1.48 | >.10 | .004 |
|
|
673
721
|
|
|
674
|
-
|
|
675
|
-
- **Multi-agent learner**: Recognition boosts scores by only +4.8 pts (d=0.37, small)
|
|
722
|
+
**Interpretation**: Recognition prompts (Factor A) are the dominant contributor, accounting for 24.3% of variance with a highly significant effect (F=110.04, p < .001, $d = 1.11$). Recognition's benefit is consistent across learner types:
|
|
676
723
|
|
|
677
|
-
|
|
724
|
+
- **Single-agent learner**: Recognition boosts scores by +15.7 pts (d=1.73)
|
|
725
|
+
- **Multi-agent learner**: Recognition boosts scores by +13.0 pts (d=0.82)
|
|
678
726
|
|
|
679
|
-
|
|
727
|
+
The A×C interaction is non-significant (F=0.97, p=.325), indicating that recognition works robustly regardless of whether the learner uses a single-agent or multi-agent architecture. The multi-agent learner (Factor C) shows a small but significant negative main effect (-3.1 pts, p=.019), suggesting its internal ego-superego deliberation adds noise without improving the tutor's effectiveness. Architecture (Factor B) approaches significance (p=.057) with a small positive effect, consistent with the additive pattern confirmed across five ego models in Section 6.4.
|
|
680
728
|
|
|
681
|
-
|
|
729
|
+
### 6.4 A×B Interaction: Architecture is Additive, Not Synergistic
|
|
682
730
|
|
|
683
|
-
|
|
731
|
+
The factorial analysis above shows minimal main effect for multi-agent architecture. A natural follow-up question is whether architecture *interacts* with prompt type—whether multi-agent synergy depends on recognition prompts.
|
|
684
732
|
|
|
685
|
-
|
|
733
|
+
A dedicated Kimi replication (eval-2026-02-05-10b344fb, N=60) tested the same four cells used in the factorial (recognition × architecture, with enhanced prompts as baseline). Recognition cells scored ~93.5 regardless of architecture (single=93.7, multi=93.3), while enhanced cells scored ~86.3 with a modest architecture effect (single=84.9, multi=87.6). The A×B interaction was -3.0 points—small and consistent with the factorial pattern.
|
|
686
734
|
|
|
687
|
-
|
|
688
|
-
|
|
689
|
-
| Prompt Type | Single-agent | Multi-agent | Delta | p |
|
|
690
|
-
|-------------|--------------|-------------|-------|---|
|
|
691
|
-
| Recognition | 72.2 | 81.5 | **+9.2** | <.05 |
|
|
692
|
-
| Enhanced | 83.3 | 83.3 | **+0.0** | n.s. |
|
|
735
|
+
To test generality, the same 2$\times$2 design (Recognition $\times$ Architecture, single-agent learner held constant) was run across four additional ego models (N$\approx$120 each, Opus judge), with the single-agent-learner cells from the Kimi factorial (cells 1, 3, 5, 7; N=179) serving as the fifth model.
|
|
693
736
|
|
|
694
737
|
{width=100%}
|
|
695
738
|
|
|
696
|
-
**
|
|
697
|
-
|
|
698
|
-
1. **Kimi factorial** (Section 6.3, F=0.26, p>.10, N=350): Multi-agent architecture showed no differential effect by prompt type.
|
|
699
|
-
2. **Kimi A×B replication** (eval-2026-02-05-10b344fb, N=60): A dedicated replication with the same four cells (5, 7, 9, 11) on Kimi K2.5 found recognition cells scoring ~90.6 regardless of architecture (single=90.58, multi=90.60), while enhanced cells scored ~80.6 with a trivial architecture effect (single=79.92, multi=81.29). The A×B interaction was +1.35 points—negligible compared to Nemotron's +9.2.
|
|
700
|
-
|
|
701
|
-
The non-replication across both the larger factorial and this dedicated replication motivated a systematic multi-model probe: the same 2$\times$2 design (Recognition $\times$ Architecture, unified learner held constant) was run across four additional ego models (N$\approx$120 each, Opus judge), with the Kimi factorial serving as the fifth model.
|
|
702
|
-
|
|
703
|
-
**Table 7b: Multi-Model A$\times$B Interaction Probe (N=826 across 5 ego models)**
|
|
739
|
+
**Table 8: Multi-Model A$\times$B Interaction Probe (N=655 across 5 ego models)**
|
|
704
740
|
|
|
705
741
|
| Ego Model | N | Base Single | Base Multi | Recog Single | Recog Multi | Recognition Effect | A$\times$B Interaction |
|
|
706
742
|
|-----------|---|------------|-----------|-------------|------------|-------------------|----------------------|
|
|
707
|
-
| Kimi K2.5 |
|
|
743
|
+
| Kimi K2.5† | 179 | 73.4 | 75.5 | 90.2 | 90.1 | +15.7 | -2.3 |
|
|
708
744
|
| Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | -5.7 |
|
|
709
745
|
| DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | -1.4 |
|
|
710
746
|
| GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | -0.7 |
|
|
711
747
|
| Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | -1.6 |
|
|
712
748
|
|
|
713
|
-
|
|
749
|
+
†Kimi data drawn from the single-agent-learner cells (1, 3, 5, 7) of the full factorial (Section 6.3) to match the probe design.
|
|
714
750
|
|
|
715
|
-
**
|
|
751
|
+
The multi-model probe confirms the absence of meaningful A$\times$B synergy: **all five ego models show negative interactions** (-5.7 to -0.7), with a mean of -2.2. Multi-agent architecture provides slightly *less* incremental benefit for recognition prompts than for base prompts—consistent with ceiling effects on already-high recognition scores. An early exploratory analysis (N=17, Nemotron, data no longer in DB) had suggested a +9.2 interaction, but the Nemotron re-run (N=119) shows -5.7, confirming this as sampling noise. Meanwhile, the recognition main effect replicates robustly across all five models (+9.6 to +17.8, mean +14.8), confirming it as the dominant and model-independent driver of improvement.
|
|
752
|
+
|
|
753
|
+
**Practical Implication**: Multi-agent architecture provides a small benefit in four of five models (-0.8 to +3.7 points) that does not meaningfully interact with prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary unless error correction on new domains is needed (Section 6.5).
|
|
716
754
|
|
|
717
755
|
### 6.5 Domain Generalizability: Factor Effects Invert by Content Type
|
|
718
756
|
|
|
719
757
|
A critical question for any pedagogical framework: Do findings generalize across content domains? We tested whether recognition and architecture effects transfer from graduate-level philosophy (our primary domain) to 4th-grade elementary mathematics (fractions).
|
|
720
758
|
|
|
721
|
-
**Data
|
|
759
|
+
**Data source**: Elementary math results come from a dedicated domain-transfer run (eval-2026-02-05-e87f452d, N=60, 4 cells × 5 elementary scenarios × 3 replications, Kimi K2.5 ego, Opus judge). Philosophy results use the matching cells (1, 3, 5, 7) from the Kimi-based factorial (Section 6.3). Because both use the same ego model and judge, the comparison isolates domain effects cleanly.
|
|
722
760
|
|
|
723
|
-
**Table
|
|
761
|
+
**Table 9: Factor Effects by Domain (Kimi K2.5, Elementary vs Philosophy)**
|
|
724
762
|
|
|
725
763
|
| Factor | Elementary (Math) | Philosophy (Hegel) |
|
|
726
764
|
|--------|-------------------|-------------------|
|
|
727
|
-
| A: Recognition |
|
|
728
|
-
| B: Multi-agent tutor |
|
|
729
|
-
| Overall avg |
|
|
730
|
-
| Best config | recog+
|
|
765
|
+
| A: Recognition | **+8.2 pts** | **+15.7 pts** |
|
|
766
|
+
| B: Multi-agent tutor | +2.3 pts | +1.0 pts |
|
|
767
|
+
| Overall avg | 74.7 | 82.3 |
|
|
768
|
+
| Best config | recog+single (78.9) | recog+single (90.2) |
|
|
731
769
|
|
|
732
770
|
{width=100%}
|
|
733
771
|
|
|
772
|
+
**Table 9b: Elementary Domain Cell Breakdown (eval-2026-02-05-e87f452d, N=60)**
|
|
773
|
+
|
|
774
|
+
| Condition | N | Mean | Δ |
|
|
775
|
+
|-----------|---|------|---|
|
|
776
|
+
| Base single (cell 1) | 15 | 68.2 | — |
|
|
777
|
+
| Base multi (cell 3) | 15 | 73.1 | +4.9 |
|
|
778
|
+
| Recognition single (cell 5) | 15 | 78.9 | +10.7 |
|
|
779
|
+
| Recognition multi (cell 7) | 15 | 78.7 | +10.5 |
|
|
780
|
+
|
|
734
781
|
**Key Findings:**
|
|
735
782
|
|
|
736
|
-
1. **
|
|
783
|
+
1. **Recognition dominates in both domains**: Recognition is the primary factor in both philosophy (+15.7 pts) and elementary math (+8.2 pts), though the effect is larger for abstract content. Architecture provides a small additive benefit in both domains (elementary +2.3 pts, philosophy +1.0 pts).
|
|
737
784
|
|
|
738
|
-
2. **Multi-agent as error correction**: On elementary content, the tutor suggested philosophy content (e.g., "479-lecture-1" to 4th graders learning fractions) due to two content isolation bugs: (a) a fallback in the curriculum context builder that served course listings from the default philosophy directory when scenarios lacked explicit content references, and (b) hardcoded philosophy lecture IDs in the tutor prompt examples that the model copied when no curriculum anchor was present. Both bugs have been fixed (see Section 6.6). The Superego caught and corrected these domain mismatches in multi-agent cells—demonstrating its value as a safety net for system-level content isolation failures.
|
|
785
|
+
2. **Multi-agent as error correction**: On elementary content, the tutor suggested philosophy content (e.g., "479-lecture-1" to 4th graders learning fractions) due to two content isolation bugs: (a) a fallback in the curriculum context builder that served course listings from the default philosophy directory when scenarios lacked explicit content references, and (b) hardcoded philosophy lecture IDs in the tutor prompt examples that the model copied when no curriculum anchor was present. Both bugs have been fixed (see Section 6.6). The Superego caught and corrected these domain mismatches in multi-agent cells—demonstrating its value as a safety net for system-level content isolation failures. An earlier Nemotron-based analysis (data no longer in DB) showed a larger architecture effect (+9.9 pts) on elementary content, likely inflated by Nemotron's higher rate of content isolation errors that the Superego corrected; the Kimi run with its lower error rate reveals the underlying pattern.
|
|
739
786
|
|
|
740
787
|
3. **Recognition theory is domain-sensitive**: The philosophical language of recognition (mutual acknowledgment, transformation through struggle) resonates more with graduate-level abstract content than with concrete 4th-grade procedural learning. This is not a failure of the framework but a boundary condition.
|
|
741
788
|
|
|
742
|
-
4. **
|
|
789
|
+
4. **Scenario-dependent effects**: The recognition effect on elementary content is scenario-dependent: challenging scenarios (frustrated_student: +23.8, concept_confusion: +13.6, struggling_student: +11.8) show substantial recognition advantage, while neutral scenarios (new_student_first_visit: +0.2, returning_student_mid_course: +0.1) show none. This pattern is consistent with recognition theory—recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject, not for routine interactions.
|
|
790
|
+
|
|
791
|
+
5. **Architecture recommendation varies by use case**:
|
|
743
792
|
- **New/untrained domain**: Multi-agent essential (Superego catches content isolation errors)
|
|
744
793
|
- **Well-trained domain**: Recognition prompts sufficient, multi-agent optional
|
|
745
794
|
|
|
746
|
-
**
|
|
747
|
-
|
|
748
|
-
**Table 9: Elementary Domain — Kimi Replication**
|
|
749
|
-
|
|
750
|
-
| Condition | N | Mean | Δ |
|
|
751
|
-
|-----------|---|------|---|
|
|
752
|
-
| Base (cells 1, 3) | 30 | 67.2 | — |
|
|
753
|
-
| Recognition (cells 5, 7) | 30 | 77.1 | **+9.9** |
|
|
754
|
-
|
|
755
|
-
The recognition main effect (+9.9 pts, d $\approx$ 0.61) replicates on Kimi, confirming that recognition advantage in elementary content is not an artifact of the Nemotron model. Notably, the effect is scenario-dependent: challenging scenarios (frustrated_student: +23.8, concept_confusion: +13.6, struggling_student: +11.8) show substantial recognition advantage, while neutral scenarios (new_student_first_visit: +0.2, returning_student_mid_course: +0.1) show none. This pattern is consistent with recognition theory—recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject, not for routine interactions.
|
|
756
|
-
|
|
757
|
-
The Kimi replication also revises the architecture dominance finding from Table 8. With Nemotron, architecture (+9.9) dominated recognition (+4.4) on elementary content. With Kimi, recognition (+9.9) is the primary effect, while architecture shows a smaller advantage (multi=73.7, single=70.6, Δ=+3.0). The factor inversion appears to be partly model-dependent: Nemotron's higher rate of content isolation errors on elementary content inflated the architecture effect (Superego error correction), while Kimi's lower error rate reveals the underlying recognition advantage.
|
|
758
|
-
|
|
759
|
-
**Theoretical Interpretation**: Recognition's value depends on content characteristics. Abstract, interpretive content (consciousness, dialectics) benefits most from recognition framing—the "struggle" in Hegel's sense maps onto the intellectual struggle with difficult concepts. Concrete procedural content (fractions, arithmetic) benefits less from relational depth; correct procedure matters more than the bilateral transformation that recognition enables (Section 6.11). However, the Kimi replication shows that even in concrete domains, recognition provides meaningful improvement for challenging scenarios—suggesting recognition's value is modulated by both content type and scenario difficulty, not content type alone.
|
|
795
|
+
**Theoretical Interpretation**: Recognition's value depends on content characteristics. Abstract, interpretive content (consciousness, dialectics) benefits most from recognition framing—the "struggle" in Hegel's sense maps onto the intellectual struggle with difficult concepts. Concrete procedural content (fractions, arithmetic) benefits less from relational depth; correct procedure matters more than the bilateral transformation that recognition enables (Section 6.15). However, even in concrete domains, recognition provides meaningful improvement for challenging scenarios—suggesting recognition's value is modulated by both content type and scenario difficulty, not content type alone.
|
|
760
796
|
|
|
761
797
|
This suggests limits to recognition-theoretic pedagogy. Not all learning encounters are equally amenable to the mutual transformation Honneth describes. The "struggle for recognition" may be most relevant where the learning itself involves identity-constitutive understanding—where grasping the material changes who the learner is, not just what they know—or where the learner faces emotional or cognitive challenge that benefits from being acknowledged.
|
|
762
798
|
|
|
@@ -778,13 +814,13 @@ Both bugs have been fixed: the fallback was removed (scenarios must now declare
|
|
|
778
814
|
|
|
779
815
|
This connects to Freud's reality principle: the Superego enforces correspondence with external reality, not just internal standards. In our architecture, the Superego ensures the tutor's suggestions correspond to the learner's actual curriculum. The elementary scenario results demonstrate this concretely: multi-agent cells (3, 7) produced correct elementary content references in cases where single-agent cells (1, 5) propagated the philosophy content uncorrected.
|
|
780
816
|
|
|
781
|
-
**Practical Implication**: For domain transfer—deploying tutoring systems on new content—multi-agent architecture provides essential error correction that single-agent systems cannot match. The bugs identified here represent a realistic class of deployment failure: incomplete content scoping and prompt examples that assume a particular domain. The Superego's reality-testing function catches these errors regardless of their source. However,
|
|
817
|
+
**Practical Implication**: For domain transfer—deploying tutoring systems on new content—multi-agent architecture provides essential error correction that single-agent systems cannot match. The bugs identified here represent a realistic class of deployment failure: incomplete content scoping and prompt examples that assume a particular domain. The Superego's reality-testing function catches these errors regardless of their source. However, an earlier Nemotron-based analysis showed a +9.9 point architecture advantage on elementary content, partly inflated by these bugs—the Kimi replication (Table 9), with fewer affected responses, shows a more modest +2.3 point architecture effect, likely closer to the true value once content isolation is correct.
|
|
782
818
|
|
|
783
819
|
### 6.7 Hardwired Rules vs Dynamic Dialogue
|
|
784
820
|
|
|
785
821
|
Analysis of Superego critique patterns across 455 dialogues (186 rejections) revealed consistent failure modes:
|
|
786
822
|
|
|
787
|
-
**Table
|
|
823
|
+
**Table 11: Superego Rejection Patterns**
|
|
788
824
|
|
|
789
825
|
| Pattern | Frequency | Description |
|
|
790
826
|
|---------|-----------|-------------|
|
|
@@ -796,26 +832,253 @@ Analysis of Superego critique patterns across 455 dialogues (186 rejections) rev
|
|
|
796
832
|
|
|
797
833
|
**Hardwired Rules Ablation**: We encoded the top patterns as static rules in the Ego prompt (e.g., "If learner offers interpretation, engage before prescribing"; "Reference specific lecture IDs, not generic topics"; "If learner shows productive confusion, pose questions rather than resolve"). These five rules were embedded directly in the Ego system prompt, allowing single-agent operation without live Superego dialogue.
|
|
798
834
|
|
|
799
|
-
An initial exploratory test (N=9 per condition, Haiku model) suggested hardwired rules could capture approximately 50% of the Superego's benefit. However, a larger replication (N=72, Kimi K2.5 ego, Opus judge) produced
|
|
835
|
+
An initial exploratory test (N=9 per condition, Haiku model) suggested hardwired rules could capture approximately 50% of the Superego's benefit. However, a larger replication (N=72, Kimi K2.5 ego, Opus judge) produced a null result:
|
|
800
836
|
|
|
801
|
-
**Table
|
|
837
|
+
**Table 12: Hardwired Rules Ablation (N=72, Kimi K2.5, Opus judge)**
|
|
802
838
|
|
|
803
839
|
| Condition | Architecture | Learner | N | Mean | vs Base |
|
|
804
840
|
|-----------|-------------|---------|---|------|---------|
|
|
805
|
-
| Base (cell 1) | Single, no superego | Single | 44 |
|
|
806
|
-
| Base (cell 2) | Single, no superego | Multi | 42 |
|
|
807
|
-
| Hardwired (cell 13) | Single + rules, no superego | Single | 36 | 74.0 |
|
|
808
|
-
| Hardwired (cell 14) | Single + rules, no superego | Multi | 36 | 69.0 | $-
|
|
841
|
+
| Base (cell 1) | Single, no superego | Single | 44 | 73.4 | — |
|
|
842
|
+
| Base (cell 2) | Single, no superego | Multi | 42 | 69.9 | — |
|
|
843
|
+
| Hardwired (cell 13) | Single + rules, no superego | Single | 36 | 74.0 | $+0.6$ |
|
|
844
|
+
| Hardwired (cell 14) | Single + rules, no superego | Multi | 36 | 69.0 | $-0.9$ |
|
|
845
|
+
|
|
846
|
+
Under the unified judge (Opus 4.6), hardwired rules are essentially neutral: $+0.6$ for single-agent and $-0.9$ for multi-agent learners. The hardwired cells (74.0 and 69.0) fall within the range of the re-judged base cells (73.4 and 69.9), suggesting that codifying common superego critiques as static rules neither helps nor hurts—the rules simply replicate what the ego already produces without superego guidance.
|
|
847
|
+
|
|
848
|
+
**Theoretical Interpretation**: This result supports a *phronesis* interpretation of the Superego's function. Aristotelian practical wisdom—the capacity for situational judgment that cannot be reduced to general rules—appears to be what the live Superego provides. When the Superego's most frequent critiques are codified as static rules, the result is indistinguishable from no Superego at all. The Superego does not merely enforce rules; it *reads the situation* and determines which rules apply, when exceptions are warranted, and how to balance competing pedagogical goals. This distinction between rule-following and practical wisdom maps directly onto debates in moral philosophy about whether ethical judgment can be proceduralized [@aristotle_nicomachean].
|
|
849
|
+
|
|
850
|
+
### 6.8 Dialectical Superego Modulation
|
|
851
|
+
|
|
852
|
+
The dialectical superego modulation experiments (cells 22–33) tested whether superego persona type and negotiation style interact with recognition theory, using three superego dispositions (suspicious, adversary, advocate) in two negotiation architectures: standard divergent (cells 22–27), where the superego simply challenges the ego's draft, and dialectical/Aufhebung (cells 28–33), where ego and superego engage in synthesis-oriented negotiation. All cells use a unified learner and ego-superego tutor architecture with Kimi K2.5 as ego model.
|
|
853
|
+
|
|
854
|
+
#### Standard Divergent Superego (cells 22–27)
|
|
855
|
+
|
|
856
|
+
**Table 13: Standard Divergent Superego Results (N=84, Opus judge)**
|
|
857
|
+
|
|
858
|
+
| Persona | Base | Recog | $\Delta$ |
|
|
859
|
+
|---------|------|-------|----------|
|
|
860
|
+
| Advocate | 56.1 | 69.7 | **+13.6** |
|
|
861
|
+
| Adversary | 55.8 | 65.2 | **+9.3** |
|
|
862
|
+
| Suspicious | 62.4 | 62.4 | **+0.0** |
|
|
863
|
+
|
|
864
|
+
*Data from eval-2026-02-11-35c53e99 (N=54, 2 single-turn scenarios) and eval-2026-02-11-5f6d51f5 (N=30, dialectical single-turn). Opus judge.*
|
|
865
|
+
|
|
866
|
+
Recognition helps advocate and adversary superegos substantially but has no effect on the suspicious persona. The suspicious disposition may already encode recognition-like questioning patterns—probing for authenticity and formulaic responses overlaps functionally with recognition's emphasis on treating the learner as a genuine subject.
|
|
867
|
+
|
|
868
|
+
All divergent/dialectical cells score approximately 20 points below the original factorial (base $\approx$ 56–65 vs factorial base $\approx$ 78), reflecting the additional internal friction that adversarial superego personas create.
|
|
869
|
+
|
|
870
|
+
#### Dialectical Multi-Turn Results (cells 28–33)
|
|
871
|
+
|
|
872
|
+
**Table 14: Dialectical Multi-Turn Modulation (eval-2026-02-11-a54235ea, N=90, Opus judge)**
|
|
873
|
+
|
|
874
|
+
| Persona | Base (N=15) | Recog (N=15) | $\Delta$ |
|
|
875
|
+
|---------|-------------|--------------|----------|
|
|
876
|
+
| Suspicious | 67.9 | 68.8 | +0.9 |
|
|
877
|
+
| Adversary | 68.6 | 74.8 | **+6.2** |
|
|
878
|
+
| Advocate | 67.5 | 73.9 | **+6.4** |
|
|
879
|
+
| **Pooled** | **68.0** | **72.5** | **+4.5** |
|
|
880
|
+
|
|
881
|
+
*Three multi-turn scenarios (mood frustration, misconception correction, mutual transformation), 4–6 dialogue turns each.*
|
|
882
|
+
|
|
883
|
+
The overall recognition effect is +4.5 pts, $d = 0.38$, $t(88) = 1.80$, $p \approx .075$—marginally significant and substantially weaker than the original factorial ($d = 1.11$). Recognition interacts with persona type: adversary (+6.2 pts) and advocate (+6.4 pts) benefit substantially, while suspicious shows minimal change (+0.9 pts). This reverses the single-turn pattern, where suspicious was neutral and adversary was catastrophically negative.
|
|
884
|
+
|
|
885
|
+
**Table 15: Structural Modulation Metrics — Base vs Recognition (N=90)**
|
|
886
|
+
|
|
887
|
+
| Metric | Base (N=45) | Recog (N=45) | Cohen's $d$ |
|
|
888
|
+
|--------|-------------|--------------|-------------|
|
|
889
|
+
| Mean Negation Depth | 2.28 | 1.48 | $-2.01$ |
|
|
890
|
+
| Mean Rounds to Converge | 2.46 | 1.62 | $-2.45$ |
|
|
891
|
+
| Mean Superego Confidence | 0.88 | 0.88 | 0.18 |
|
|
892
|
+
| Mean Feedback Length (chars) | 435 | 435 | $-0.01$ |
|
|
893
|
+
|
|
894
|
+
*All structural effects significant at p < .001 (N=90).*
|
|
895
|
+
|
|
896
|
+
Three findings emerge. First, **recognition reduces internal friction, not output quality directly.** Recognition-primed egos produce suggestions the superego approves faster ($d = -2.45$ for convergence speed) and rejects less often ($d = -2.01$ for negation depth). This consistency across all three persona types (negation depth $d$ range: $-2.26$ to $-2.36$) suggests recognition improves the ego's initial alignment with the superego's standards.
|
|
897
|
+
|
|
898
|
+
Second, **structural modulation does not predict quality.** Correlations between modulation metrics and output scores are all non-significant: negation depth ($r = -0.014$, $p = .895$), convergence speed ($r = 0.007$, $p = .948$), feedback length ($r = -0.114$, $p = .280$). More superego friction—more rejection rounds, deeper negotiation—does not produce better outputs.
|
|
899
|
+
|
|
900
|
+
Third, **the superego is a filter, not an improver.** The superego catches poor responses but does not iteratively refine good ones. Its value lies in preventing failure rather than enhancing success. Recognition works by making the ego's *first draft* better, so the superego has less to catch.
|
|
901
|
+
|
|
902
|
+
**Adversary over-deference mechanism.** An unexpected interaction appeared in the single-turn dialectical results. The adversary persona produced a catastrophic reversal when combined with recognition in single-turn settings: recognition + adversary scored 54.0, *below* base + adversary (65.3), a $-11.3$ pt inversion. Recognition instructs the ego to honor learner autonomy; the adversary superego challenges any prescriptive recommendation as "controlling"; the ego removes the recommendation entirely, producing pedagogically empty responses. Multi-turn interaction rescues this spiral: with learner feedback grounding the dialogue, the same cell becomes the *best-scoring* at 74.8 (+6.2 over base)—a +20.8 pt swing from single-turn. Learner feedback breaks the ego-superego echo chamber by providing external reality-testing.
|
|
903
|
+
|
|
904
|
+
**Intervention type distribution.** The *proportion* of revise/reject/reframe interventions is identical for base and recognition ($\chi^2(4) = 1.17$, $p = .883$, $V = 0.036$). The difference is purely in *volume*—recognition doesn't change *how* the superego intervenes, just *how often*. Recognition cells are also cheaper: 7.9 internal dialogue rounds at \$0.067/attempt vs 12.3 rounds at \$0.098/attempt for base (${\approx}30\%$ cost reduction).
|
|
905
|
+
|
|
906
|
+
{width=100%}
|
|
907
|
+
|
|
908
|
+
### 6.9 Self-Reflective Evolution and the Insight-Action Gap
|
|
909
|
+
|
|
910
|
+
Cells 40–45 extended the dialectical architecture with self-reflective evolution: between turns, both ego and superego generate first-person reflections on the prior interaction using their own respective models. The ego reflects on superego feedback received and its own revision patterns; the superego reflects on its intervention history and ego compliance signals. These reflections are injected into subsequent turns, enabling the system to accumulate insights about its own operation.
|
|
911
|
+
|
|
912
|
+
Three superego disposition types (suspicious, adversary, advocate) were crossed with recognition (present/absent) in a full 3$\times$2 design (N=90, eval-2026-02-13-8d40e086, Nemotron ego / Kimi K2.5 superego, Opus judge).
|
|
913
|
+
|
|
914
|
+
**Table 16: Self-Reflective Evolution — Persona $\times$ Recognition (N=90)**
|
|
809
915
|
|
|
810
|
-
|
|
916
|
+
| Persona | Base (N=15) | Recog (N=15) | $\Delta$ |
|
|
917
|
+
|---------|------------|-------------|----------|
|
|
918
|
+
| Suspicious | 59.3 (SD=16.1) | 78.3 (SD=9.9) | **+19.0** |
|
|
919
|
+
| Adversary | 68.4 (SD=13.2) | 79.3 (SD=7.2) | **+10.9** |
|
|
920
|
+
| Advocate | 71.5 (SD=8.8) | 74.1 (SD=11.3) | +2.6 |
|
|
921
|
+
| **Pooled** | **66.4 (SD=13.8)** | **77.2 (SD=9.7)** | **+10.8** |
|
|
811
922
|
|
|
812
|
-
|
|
923
|
+
Recognition effect: +10.8 pts, $d = 0.91$—substantially stronger than the dialectical-only architecture (cells 28–33, $d = 0.38$) and approaching the original factorial ($d = 1.11$). Self-reflection amplifies the recognition effect approximately 2.4$\times$ compared to the dialectical architecture without self-reflection.
|
|
813
924
|
|
|
814
|
-
|
|
925
|
+
**Disposition gradient.** The full N=90 results reveal a striking gradient: the more hostile the superego disposition, the more recognition helps. The suspicious superego benefits most (+19.0), with recognition also compressing its variance from SD=16.1 to SD=9.9. Under recognition, the suspicious superego's probing disposition aligns with the recognition framework's emphasis on questioning authenticity—creating a coherent internal dialogue rather than unproductive antagonism. The adversary superego shows a substantial but smaller benefit (+10.9), while the advocate persona shows only +2.6, suggesting its supportive disposition already achieves much of what recognition provides. Base condition scores follow the inverse pattern: advocate (71.5) > adversary (68.4) > suspicious (59.3), confirming that hostile dispositions are destructive without recognition but become productive with it.
|
|
926
|
+
|
|
927
|
+
**The insight-action gap.** Despite the amplified recognition effect, a fundamental limitation persists. Qualitative trace analysis reveals that both base and recognition conditions show *awareness* of their own failures through self-reflection—the ego correctly identifies "I kept circling back to the same framework," the superego correctly diagnoses "the ego ignores my feedback." But awareness alone does not produce behavioral change. The ego's self-reflection states the correct insight ("I should stop interrupting") without generating a concrete alternative strategy. This insight-action gap—where the system accurately diagnoses its own failures but lacks the mechanism to translate diagnosis into different behavior—becomes the central design challenge addressed in subsequent experiments with Theory of Mind mechanisms (Section 6.10).
|
|
928
|
+
|
|
929
|
+
**Table 17: Comparison Across Approaches**
|
|
930
|
+
|
|
931
|
+
| Approach | Cells | N | Recog $\Delta$ | $d$ |
|
|
932
|
+
|----------|-------|---|----------------|-----|
|
|
933
|
+
| Dialectical only | 28–33 | 90 | +4.5 | 0.38 |
|
|
934
|
+
| Self-reflective evolution (Nemotron/Kimi) | 40–45 | 90 | +10.8 | 0.91 |
|
|
935
|
+
| Self-reflective evolution (Nemotron) | 40–45 | 60 | +0.4 | n.s. |
|
|
936
|
+
| Original factorial | 1–8 | 350 | +14.4 | 1.11 |
|
|
937
|
+
|
|
938
|
+
Self-reflection brings the recognition effect close to factorial levels ($d = 0.91$ vs $d = 1.11$), suggesting the reduced effect in cells 28–33 reflects the dialectical architecture's additional friction, which self-reflection partially overcomes.
|
|
939
|
+
|
|
940
|
+
**Cross-model non-replication.** A Nemotron replication (eval-2026-02-14-559d854b, N=60, cells 40–45) shows base $M = 66.6$, recognition $M = 67.0$ ($\Delta = +0.4$)—essentially no recognition effect. The persona pattern also fails to replicate: adversary shows a negative delta ($-4.3$), advocate positive ($+6.9$), suspicious near-zero ($-1.4$). Self-reflective evolution's recognition amplification appears model-dependent: Nemotron, scoring approximately 15 points lower than Kimi across all conditions, does not show the effect. This is consistent with the broader cross-model mechanism replication (Section 6.10), where Nemotron replicates the basic recognition effect but at compressed magnitudes. Whether self-reflection requires a minimum capability threshold to amplify recognition remains an open question.
|
|
941
|
+
|
|
942
|
+
### 6.10 Mechanism Robustness and the Scripted Learner Confound
|
|
943
|
+
|
|
944
|
+
To test whether specific mechanisms beyond basic recognition and self-reflection differentially affect tutoring quality, we ran a comprehensive 20-cell mechanism comparison (cells 40–59) across nine mechanism variants: self-reflection with three superego dispositions (suspicious, adversary, advocate), quantitative disposition tracking, prompt erosion detection, intersubjective ego-superego dialogue, combined mechanisms, and bidirectional Theory of Mind profiling in two configurations (tutor-only and bidirectional). All cells used Haiku 4.5 as ego model, unified (scripted) learner, and Opus judge.
|
|
945
|
+
|
|
946
|
+
**Table 18: Mechanism Robustness Under Scripted Learner (eval-2026-02-14-e0e3a622, N=360, Opus judge)**
|
|
947
|
+
|
|
948
|
+
| Mechanism | Base M | Recog M | $\Delta$ |
|
|
949
|
+
|-----------|--------|---------|----------|
|
|
950
|
+
| Intersubjective | 82.2 | 91.7 | +9.5 |
|
|
951
|
+
| Profiling (tutor-only) | 82.3 | 92.4 | +10.1 |
|
|
952
|
+
| Self-reflect (suspicious) | 83.7 | 92.1 | +8.4 |
|
|
953
|
+
| Combined | 84.2 | 92.4 | +8.3 |
|
|
954
|
+
| Profiling (bidirectional) | 85.1 | 92.7 | +7.6 |
|
|
955
|
+
| Erosion detection | 83.5 | 90.8 | +7.2 |
|
|
956
|
+
| Quantitative disposition | 86.2 | 92.6 | +6.4 |
|
|
957
|
+
| Self-reflect (adversary) | 86.6 | 92.6 | +6.0 |
|
|
958
|
+
| Self-reflect (advocate) | 85.2 | 90.3 | +5.1 |
|
|
959
|
+
|
|
960
|
+
Overall: base $M = 84.3$, recognition $M = 91.9$, $d = 0.86$. All nine recognition cells cluster within a 2.4-point band (90.3–92.7)—indistinguishable at $N \approx 18$ per cell ($SE \approx 1.7$). No mechanism differentiates from any other.
|
|
961
|
+
|
|
962
|
+
Recognition is confirmed as the dominant active ingredient ($d = 0.86$), replicating across all nine mechanism variants. Mechanism selection has no measurable effect on output quality—all provide comparable context to an LLM that already has a strong pedagogical baseline.
|
|
963
|
+
|
|
964
|
+
**The scripted learner confound.** Why do mechanisms fail to differentiate? All cells 40–59 use a unified (scripted) learner: learner messages come from scenario YAML and repeat identically every turn, regardless of what the tutor says. This means profiling builds a model of an interlocutor that doesn't change (confabulation), self-reflection adjusts tutor strategy against a static target (unverifiable), and intersubjective dialogue incorporates no new learner signal between turns. All mechanisms are causally inert—they modify tutor output, but the next learner input is predetermined.
|
|
965
|
+
|
|
966
|
+
#### Dynamic Learner $\times$ Mechanism (cells 60–65, 69–70)
|
|
967
|
+
|
|
968
|
+
To test whether mechanism differentiation emerges with a responsive interlocutor, we ran three experiments using ego/superego (dynamic) learners that generate genuine LLM-powered responses (Haiku 4.5 ego, Opus judge, 2 scenarios). The first (eval-2026-02-14-6c033830, N=120) crosses recognition (present/absent) with mechanism type (self-reflection vs bidirectional profiling) in a 2$\times$2 design. The second (eval-2026-02-14-a2b2717c, N=120) adds recognition-only cells for intersubjective framing and combined mechanisms. The third (eval-2026-02-15-664073ab, N=60) completes the base row by adding base counterparts for intersubjective and combined mechanisms (cells 69–70), enabling recognition delta computation across all four mechanism types.
|
|
969
|
+
|
|
970
|
+
**Table 19: Dynamic Learner $\times$ Mechanism (N=300, Opus judge)**
|
|
971
|
+
|
|
972
|
+
| | Self-reflect | Profiling | Intersubjective | Combined |
|
|
973
|
+
|---|---|---|---|---|
|
|
974
|
+
| **Base** | 71.4 (22.9) | 75.5 (19.4) | 67.7 (24.6) | 73.9 (19.8) |
|
|
975
|
+
| **Recognition** | 85.9 (15.7) | 88.8 (13.9) | 82.8 (18.8) | 87.8 (12.6) |
|
|
976
|
+
| **Recognition $\Delta$** | **+14.5** | **+13.3** | **+15.1** | **+13.9** |
|
|
977
|
+
|
|
978
|
+
Four findings emerge. First, **recognition with a dynamic learner** produces a remarkably consistent +14.2 pt average effect across all four mechanisms ($\Delta$ range: +13.3 to +15.1)—roughly double the scripted learner effect (+7.6)—consistent with a responsive interlocutor providing more material for recognition to work with.
|
|
979
|
+
|
|
980
|
+
Second, **mechanisms genuinely differentiate with dynamic learners**. Unlike the scripted condition (Table 18) where all mechanisms cluster within 2.4 pts, dynamic learner cells span a wider range. Under recognition, cells range from 82.8 (intersubjective) to 88.8 (profiling), a 6.0-point spread. In the base condition, the range is even wider: from 67.7 (intersubjective) to 75.5 (profiling), an 7.8-point spread. Profiling and combined mechanisms consistently outperform self-reflect and intersubjective framing under both conditions. The profiling effect is additive: +4.1 pts in the base condition, +2.9 pts under recognition, with near-zero interaction ($-0.7$). Profiling helps more on the harder scenario (misconception correction: +8.9 pts) than the open-ended one (mutual transformation: $-0.6$ pts). The mechanism operates on a different causal pathway from recognition: recognition changes *what* the tutor tries to do (treat learner as autonomous subject); profiling changes *how well* it adapts to this specific learner.
|
|
981
|
+
|
|
982
|
+
Third, **intersubjective framing underperforms without recognition**. Cell 69 (base + intersubjective, $M = 67.7$) is the lowest of all dynamic learner cells—3.7 points below the self-reflect base. Without the recognition framework to give intersubjective coordination its proper orientation, the mechanism may introduce confusion by prompting the ego to negotiate with the superego over a shared understanding that neither has been prepared to offer. Combined mechanisms partially rescue this (cell 70: 73.9), suggesting that adding profiling and self-reflection provides enough structure to make intersubjective coordination productive even without recognition.
|
|
983
|
+
|
|
984
|
+
Fourth, **variance collapses monotonically**: SD drops from 24.6 (intersubjective base) $\to$ 22.9 (self-reflect base) $\to$ 19.4 (profiling base) $\to$ 18.8 (intersubjective recognition) $\to$ 15.7 (self-reflect recognition) $\to$ 12.6 (combined recognition). Both recognition and mechanism complexity independently reduce output variance, consistent with each factor constraining the tutor's output toward consistently high quality.
|
|
985
|
+
|
|
986
|
+
**Theory of Mind interpretation.** Profiling is a Theory of Mind mechanism: it builds a model of the other agent's cognitive state, epistemic commitments, and response patterns. Theory of Mind is only useful when there is a mind to model. With a scripted learner, profiling builds a model of a recording—confabulation that cannot create a feedback loop. With a dynamic learner, profiling creates a genuine feedback loop: profile $\to$ adapted strategy $\to$ changed learner response $\to$ updated profile. This explains the null result in Table 18 (scripted learner: no mechanism differentiation) alongside the positive result in Table 19 (dynamic learner: profiling and combined differentiate).
|
|
987
|
+
|
|
988
|
+
{width=100%}
|
|
989
|
+
|
|
990
|
+
#### Cross-Model Replication: Nemotron (eval-2026-02-14-49b33fdd)
|
|
991
|
+
|
|
992
|
+
A Nemotron replication of the full mechanism suite (cells 40–59, N=360, Opus judge) confirms the core findings at lower absolute scores. Nemotron produces base $M = 66.9$, recognition $M = 73.6$ ($\Delta = +6.7$), approximately 15 points below Haiku across all conditions. The recognition effect replicates ($\Delta$ range: $-0.6$ to +14.3 across mechanisms), and mechanisms again cluster within a narrow band under recognition (70.3–75.8, range 5.5 pts). The bidirectional profiling anomaly ($\Delta = -0.6$) is the only mechanism where recognition does not help on Nemotron. Higher variance at lower absolute scores dilutes effect sizes, but the qualitative pattern is identical: recognition is the dominant active ingredient, and mechanism selection is secondary.
|
|
993
|
+
|
|
994
|
+
#### Cognitive Prosthesis Test (cells 66–68)
|
|
995
|
+
|
|
996
|
+
Can a strong superego compensate for a weak ego? Cells 66–68 test this "cognitive prosthesis" hypothesis by pairing a weak ego model (Nemotron) with a strong superego (Kimi K2.5) armed with the full mechanism suite: bidirectional profiling, self-reflection, prompt rewriting, cross-turn memory, and dialectical negotiation. The three cells vary the superego configuration: descriptive profiling (cell 66, superego passes learner profile as-is), prescriptive profiling (cell 67, superego translates profile into DO/DON'T action items), and prescriptive + adversary superego (cell 68, more aggressive challenger). All cells use recognition theory, dynamic learners, and the same two bilateral scenarios (eval-2026-02-17-25aaae85, N=90, Opus judge).
|
|
997
|
+
|
|
998
|
+
**Table 19b: Cognitive Prosthesis Test (N=90, Nemotron ego, Opus judge)**
|
|
999
|
+
|
|
1000
|
+
| Cell | Superego config | Misconception | Mutual Transform | Overall | SD |
|
|
1001
|
+
|------|----------------|-------------|-----------------|---------|-----|
|
|
1002
|
+
| 66 | Descriptive | 52.4 | 44.2 | 48.3 | 13.4 |
|
|
1003
|
+
| 67 | Prescriptive | 54.8 | 43.3 | 49.0 | 18.8 |
|
|
1004
|
+
| 68 | Adversary | 57.1 | 45.1 | 51.1 | 20.4 |
|
|
1005
|
+
|
|
1006
|
+
The prosthesis hypothesis fails decisively. All three cells score well below Nemotron's own scripted base (cell 40: $M = 64.2$), let alone Haiku's profiling performance (cell 63: $M = 88.7$). Superego type has no significant effect: $F(2,87) = 0.20$, $\eta^2 = .004$, with the adversary advantage ($\Delta = +2.8$, $d = 0.16$) trivial. The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points—an inversion, not a null result.
|
|
1007
|
+
|
|
1008
|
+
**Dimension analysis** reveals two tiers of capability. Nemotron succeeds on *static* dimensions that require factual retrieval: specificity (4.0), actionability (4.0), tone (3.7). It fails catastrophically on *dynamic* dimensions requiring multi-turn context integration: tutor adaptation (1.8, 86% failure rate), dialectical responsiveness (2.0, 82%), mutual recognition (2.5, 63%). Judge reasoning repeatedly identifies the same failure: "Ignores 4-turn dialogue; responds to initial misconception," "Response identical to what might be given at turn 1; shows no evolution from dialogue." Nemotron processes injected context (profiles, self-reflections, superego feedback) as static input but cannot translate it into behavioral adaptation.
|
|
1009
|
+
|
|
1010
|
+
**Superego parse failure analysis.** A contributing factor is silent superego failure. The Kimi K2.5 superego returns malformed JSON on 16.2% of reviews (80/495), triggering automatic approval of the ego's draft. Parse failure rates correlate with cell scores: descriptive 21.8% (lowest score), prescriptive 15.2%, adversary 11.5% (highest score). The adversary prompt produces more parseable superego output, giving the ego more opportunities for revision.
|
|
1011
|
+
|
|
1012
|
+
**Haiku control smoke test.** A minimal replication (eval-2026-02-18-f489c0ea, N=6, same cells, Haiku ego) confirms the model-dependence interpretation. Haiku scores 89.7–97.9 on misconception (vs Nemotron 48–57)—a ~40-point gap on the identical mechanism stack. On mutual transformation, Haiku shows high variance (40.2–96.2), with the two low scores (cells 66–67) traced to superego parse failures (45.5% failure rate) that auto-approved lecture-redirect behavior. Cell 68 (adversary, 18.2% parse failure rate) scored 96.2—confirming that the prosthesis mechanism works when the superego actually functions.
|
|
1013
|
+
|
|
1014
|
+
**Interpretation.** Two independent failure modes compound: (1) superego parse failures silently disable quality control on 16–45% of turns, and (2) Nemotron cannot translate superego feedback into behavioral change even when the superego functions correctly. For a capable ego model (Haiku), fixing (1) alone—by using an adversary superego that produces parseable rejections—is sufficient. For a weak ego model (Nemotron), both failures would need to be addressed, and (2) may represent a hard model limitation. The data implies a minimum ego capability threshold for mechanism benefit: below the threshold, mechanisms add noise rather than signal, and architectural scaffolding becomes actively counterproductive.
|
|
1015
|
+
|
|
1016
|
+
### 6.11 Qualitative Transcript Assessment
|
|
1017
|
+
|
|
1018
|
+
To complement the quantitative findings with interpretive depth, we conducted AI-assisted qualitative assessments of full dialogue transcripts from two key runs: the dialectical mechanism run (eval-2026-02-14-e0e3a622, N=360, cells 40–59) and the bilateral transformation run (eval-2026-02-07-b6d75e87, N=118, cells 1–8). Each transcript was assessed by Claude Opus across six narrative axes (pedagogical arc, recognition dynamics, superego effectiveness, learner trajectory, missed opportunities, key turning point), and assigned 2–5 qualitative tags from a fixed vocabulary. Assessments are stored in the evaluation database for reproducibility.
|
|
1019
|
+
|
|
1020
|
+
**Table 20: Qualitative Tag Distribution — Bilateral Run (b6d75e87, N=118)**
|
|
1021
|
+
|
|
1022
|
+
| Tag | Base % | Recog % | Direction |
|
|
1023
|
+
|-----|--------|---------|-----------|
|
|
1024
|
+
| stalling | 100.0% | 45.0% | Near-universal base |
|
|
1025
|
+
| ego_compliance | 70.7% | 60.0% | Base |
|
|
1026
|
+
| recognition_moment | 0.0% | 51.7% | Exclusive recog |
|
|
1027
|
+
| strategy_shift | 0.0% | 30.0% | Exclusive recog |
|
|
1028
|
+
| emotional_attunement | 6.9% | 36.7% | Strong recog |
|
|
1029
|
+
|
|
1030
|
+
*Tags appearing in $\geq$5% of either condition shown. The bilateral run uses original factorial cells (1–8) with no dialectical mechanisms.*
|
|
1031
|
+
|
|
1032
|
+
**Table 21: Qualitative Tag Distribution — Dialectical Run (e0e3a622, N=360)**
|
|
1033
|
+
|
|
1034
|
+
| Tag | Base % | Recog % | Direction |
|
|
1035
|
+
|-----|--------|---------|-----------|
|
|
1036
|
+
| recognition_moment | 26.5% | 82.2% | Strong recog |
|
|
1037
|
+
| ego_autonomy | 26.5% | 61.9% | Strong recog |
|
|
1038
|
+
| emotional_attunement | 33.3% | 56.3% | Recog |
|
|
1039
|
+
| stalling | 26.5% | 3.0% | Near-exclusive base |
|
|
1040
|
+
| missed_scaffold | 60.5% | 26.4% | Strong base |
|
|
1041
|
+
| learner_breakthrough | 29.0% | 29.9% | Neutral |
|
|
1042
|
+
|
|
1043
|
+
The comparison between runs is instructive: in the bilateral run (no dialectical mechanisms), stalling is universal in base (100%) and recognition moments are entirely absent (0%); in the dialectical run (with mechanisms), the gap is smaller—stalling drops to 27% in base and recognition moments reach 26%. The dialectical mechanisms provide a partial floor even without recognition, compressing the gap between conditions.
|
|
1044
|
+
|
|
1045
|
+
{width=100%}
|
|
1046
|
+
|
|
1047
|
+
The tag distributions reveal three specific effects of recognition:
|
|
1048
|
+
|
|
1049
|
+
**1. The ego listens to the superego.** In recognition dialogues, when the superego identifies a problem—"one-directional instruction that reinforces authority"—the ego pivots from prescriptive to Socratic, from "revisit Lecture 2" to "what contradiction do you already sense brewing?" In base dialogues, the superego generates the same correct diagnosis, but the ego ignores it and regenerates the same response. Recognition gives the ego the capacity to *act on* the superego's critique rather than merely comply with its form.
|
|
1050
|
+
|
|
1051
|
+
**2. The tutor builds on learner contributions.** Base tutors route learners to predetermined content regardless of what the learner says. Recognition tutors engage with the learner's actual contribution: "I hear your frustration with being pushed to the simulation—you want to articulate this step by step." The `strategy_shift` tag (30% recognition, 0% base in the bilateral run) captures this: base tutors never adapt mid-conversation.
|
|
1052
|
+
|
|
1053
|
+
**3. Architecture interaction explained.** The bilateral run shows a massive architecture $\times$ recognition interaction: ego_superego without recognition scores worst ($M = 38.0$), ego_superego with recognition scores best ($M = 73.9$). The qualitative assessments explain why. Without recognition, the ego_superego architecture creates circular self-criticism: the superego identifies the problem, the ego can't act on it, the revision loop produces the same response repeatedly (`ego_compliance`—the ego complies with the *form* of revision without changing the *substance*). With recognition, the ego has sufficient autonomy to incorporate the superego's critique productively. The deliberation loop becomes generative rather than circular.
|
|
1054
|
+
|
|
1055
|
+
{width=95% height=88%}
|
|
1056
|
+
|
|
1057
|
+
**Blinded validation.** Two blinded replications (condition labels and cell names stripped from metadata and transcript headers, N=118 each) tested whether the tag discrimination holds without condition knowledge. The first used Haiku as assessor; the second used the same model (Opus) as the original unblinded assessment, enabling a controlled 2×2 comparison that isolates blinding effects from model calibration effects.
|
|
1058
|
+
|
|
1059
|
+
**Table 21b: Blinded vs Unblinded Tag Comparison — Bilateral Run**
|
|
1060
|
+
|
|
1061
|
+
| Tag | Unblinded Opus Base% | Unblinded Opus Recog% | Blinded Haiku Base% | Blinded Haiku Recog% | Blinded Opus Base% | Blinded Opus Recog% |
|
|
1062
|
+
|-----|-----|-----|-----|-----|-----|-----|
|
|
1063
|
+
| recognition\_moment | 0.0 | 51.7 | 65.5 | 88.3 | 5.2 | 45.0 |
|
|
1064
|
+
| stalling | 100.0 | 45.0 | 32.8 | 11.7 | 91.4 | 43.3 |
|
|
1065
|
+
| strategy\_shift | 0.0 | 30.0 | 17.2 | 50.0 | 3.4 | 35.0 |
|
|
1066
|
+
| emotional\_attunement | 6.9 | 36.7 | 12.1 | 51.7 | 20.7 | 51.7 |
|
|
1067
|
+
| missed\_scaffold | 100.0 | 68.3 | 79.3 | 36.7 | 100.0 | 65.0 |
|
|
1068
|
+
|
|
1069
|
+
Three findings emerge from the controlled comparison. First, **blinding has minimal effect on Opus's tag assignments**. The blinded Opus column closely tracks the unblinded Opus column: stalling in base dialogues drops only from 100% to 91.4%, recognition\_moment in base rises only from 0% to 5.2%, and missed\_scaffold in base remains at 100%. The near-perfect binary separation between conditions is preserved even when Opus cannot see condition labels—indicating that the discrimination reflects genuine differences in dialogue quality, not assessor bias.
|
|
1070
|
+
|
|
1071
|
+
Second, **the apparent softening in the Haiku-blinded assessment was primarily a model calibration effect, not a blinding effect**. Haiku found recognition moments in 65.5% of base dialogues—not because blinding revealed hidden quality, but because Haiku applies tags more liberally than Opus (higher overall tagging rates, less selective application). The same-model comparison confirms this: when Opus is blinded, it still finds recognition\_moment in only 5.2% of base dialogues, not 65.5%.
|
|
1072
|
+
|
|
1073
|
+
Third, **the tag discrimination direction is robust across all conditions**: recognition dialogues consistently receive more positive tags and fewer negative tags regardless of assessor model or blinding condition. The magnitude of discrimination varies by model (Opus is more discriminating than Haiku), but the direction is invariant.
|
|
1074
|
+
|
|
1075
|
+
The practical conclusion is that the qualitative findings in Tables 20–21 are robust rather than inflated. The near-perfect binary separation initially appeared suspicious, but the same-model blinded replication confirms that Opus's assessments track genuine dialogue properties rather than condition labels.
|
|
1076
|
+
|
|
1077
|
+
### 6.12 Dimension Analysis
|
|
815
1078
|
|
|
816
1079
|
Effect size analysis reveals improvements concentrate in dimensions predicted by the theoretical framework:
|
|
817
1080
|
|
|
818
|
-
**Table
|
|
1081
|
+
**Table 22: Dimension-Level Effect Sizes (Recognition vs Base)**
|
|
819
1082
|
|
|
820
1083
|
| Dimension | Base | Recognition | Cohen's d | Interpretation |
|
|
821
1084
|
|-----------|------|-------------|-----------|----------------|
|
|
@@ -830,13 +1093,13 @@ The largest effect sizes are in personalization (d = 1.82), pedagogical soundnes
|
|
|
830
1093
|
|
|
831
1094
|
Notably, dimensions where baseline already performed well (specificity, actionability) show smaller but still positive gains. Recognition orientation does not trade off against factual quality.
|
|
832
1095
|
|
|
833
|
-
### 6.
|
|
1096
|
+
### 6.13 Addressing Potential Circularity: Standard Dimensions Analysis
|
|
834
1097
|
|
|
835
1098
|
A methodological concern: the evaluation rubric includes recognition-specific dimensions (mutual recognition, dialectical responsiveness, memory integration, transformative potential) and bilateral transformation dimensions (tutor adaptation, learner growth) that collectively account for 33.0% of normalized rubric weight (39.9% raw, normalized from a 120.9% total; see Appendix C.2). Since the recognition profile is prompted to satisfy these criteria, some gains could be tautological—the system scores higher on dimensions it is explicitly optimized for.
|
|
836
1099
|
|
|
837
1100
|
To address this, we re-analyzed scores excluding all non-standard dimensions, using only standard pedagogical dimensions (relevance, specificity, pedagogical soundness, personalization, actionability, tone), re-weighted to 100%.
|
|
838
1101
|
|
|
839
|
-
**Table
|
|
1102
|
+
**Table 23: Standard Dimensions Only (Recognition Dimensions Excluded)**
|
|
840
1103
|
|
|
841
1104
|
| Profile Type | N | Overall Score |
|
|
842
1105
|
|--------------|---|---------------|
|
|
@@ -844,27 +1107,27 @@ To address this, we re-analyzed scores excluding all non-standard dimensions, us
|
|
|
844
1107
|
| Base (cells 1-4) | 172 | 78.8 |
|
|
845
1108
|
| **Difference** | — | **+10.0** |
|
|
846
1109
|
|
|
847
|
-
**Key finding**: Recognition profiles outperform base profiles by +10.0 points on overall rubric score.
|
|
1110
|
+
**Key finding**: Recognition profiles outperform base profiles by +10.0 points on overall rubric score. This recognition effect is consistent across learner types (+15.7 pts for single-agent, +13.0 pts for multi-agent; A×C interaction n.s., Section 6.3).
|
|
848
1111
|
|
|
849
1112
|
**Interpretation**: Recognition-oriented prompting improves general pedagogical quality (relevance, pedagogical soundness, personalization), not just the theoretically-predicted recognition dimensions. This suggests the recognition framing produces genuine relational improvements that transfer to standard tutoring metrics.
|
|
850
1113
|
|
|
851
1114
|
The larger effect on recognition dimensions (+21.8) is expected and not concerning—these dimensions measure what the theory claims to improve. The important finding is that standard dimensions also improve, ruling out pure circularity.
|
|
852
1115
|
|
|
853
|
-
### 6.
|
|
1116
|
+
### 6.14 Multi-Turn Scenario Results
|
|
854
1117
|
|
|
855
1118
|
To test whether recognition quality is maintained over extended interactions, we examine results from the three multi-turn scenarios (3–5 dialogue rounds each). These scenarios are distinct from the single-turn scenarios reported in Section 6.3; they require sustained engagement across multiple exchanges. The sample sizes below (N=161, 277, 165) are pooled across the full development database (all runs containing these scenarios), not from a single evaluation run. They therefore include responses generated under varying model configurations and implementation stages. The pooled analysis maximizes statistical power but means the results should be interpreted as describing the *average* effect across development iterations.
|
|
856
1119
|
|
|
857
|
-
**Table
|
|
1120
|
+
**Table 24: Multi-Turn Scenario Results**
|
|
858
1121
|
|
|
859
1122
|
| Scenario | N | Avg Rounds | Base | Recognition | Δ | Cohen's d |
|
|
860
1123
|
|----------|---|------------|------|-------------|---|-----------|
|
|
861
|
-
|
|
|
862
|
-
|
|
|
863
|
-
|
|
|
1124
|
+
| misconception correction | 161 | 3.2 | 50.5 | 71.8 | +21.3 | 0.85 |
|
|
1125
|
+
| frustration to breakthrough | 277 | 3.0 | 57.3 | 70.5 | +13.2 | 0.59 |
|
|
1126
|
+
| mutual transformation | 165 | 4.1 | 42.6 | 61.5 | +18.9 | 0.78 |
|
|
864
1127
|
|
|
865
1128
|
All three multi-turn scenarios show medium-to-large effect sizes (d = 0.59–0.85), with an average improvement of +17.8 points. Recognition quality is maintained over longer interactions. The `misconception_correction_flow` scenario shows the largest effect (d = 0.85), suggesting that recognition-informed tutors handle misconceptions with particular skill—addressing errors without dismissing the learner's reasoning. The `mood_frustration_to_breakthrough` scenario shows the smallest but still meaningful effect (d = 0.59), consistent with the single-turn finding that emotionally complex scenarios benefit from recognition but present more variance.
|
|
866
1129
|
|
|
867
|
-
### 6.
|
|
1130
|
+
### 6.15 Bilateral Transformation Metrics
|
|
868
1131
|
|
|
869
1132
|
A central claim of recognition theory is that genuine pedagogical encounters involve *mutual* transformation—both tutor and learner change through dialogue. To test this empirically, the evaluation framework includes two dedicated rubric dimensions (`tutor_adaptation` and `learner_growth`; see Appendix C.3) and turn-over-turn tracking of how both parties evolve across multi-turn scenarios.
|
|
870
1133
|
|
|
@@ -876,7 +1139,7 @@ Three indices are computed for each multi-turn dialogue:
|
|
|
876
1139
|
|
|
877
1140
|
Additionally, a composite **Transformation Quality** score (0–100) is computed from bilateral balance, mutual transformation presence, superego incorporation rate, and intervention effectiveness.
|
|
878
1141
|
|
|
879
|
-
**Table
|
|
1142
|
+
**Table 25: Bilateral Transformation Metrics — Base vs Recognition Profiles**
|
|
880
1143
|
|
|
881
1144
|
| Metric | Base (N=58) | Recognition (N=60) | Δ |
|
|
882
1145
|
|--------|------|-------------|---|
|
|
@@ -884,9 +1147,9 @@ Additionally, a composite **Transformation Quality** score (0–100) is computed
|
|
|
884
1147
|
| Learner Growth Index (0–1) | 0.242 | 0.210 | −0.032 |
|
|
885
1148
|
| Bilateral Transformation Index (0–1) | 0.287 | 0.314 | +0.027 |
|
|
886
1149
|
|
|
887
|
-
*Data from three multi-turn scenarios (
|
|
1150
|
+
*Data from three multi-turn scenarios (misconception correction flow, mood frustration to breakthrough, mutual transformation journey), N=118 scored dialogues across all 8 factorial cells (eval-2026-02-07-b6d75e87).*
|
|
888
1151
|
|
|
889
|
-
**Table
|
|
1152
|
+
**Table 26: Tutor Adaptation Index by Scenario**
|
|
890
1153
|
|
|
891
1154
|
| Scenario | Base | Recognition | Δ |
|
|
892
1155
|
|----------|------|-------------|---|
|
|
@@ -896,15 +1159,54 @@ Additionally, a composite **Transformation Quality** score (0–100) is computed
|
|
|
896
1159
|
|
|
897
1160
|
The tutor adaptation index confirms that recognition-prompted tutors measurably adjust their approach in response to learner input (+25.9% relative improvement overall), while baseline tutors maintain more rigid pedagogical stances. This effect is robust across the two structured scenarios (`misconception_correction_flow`: +62.7%; `mood_frustration_to_breakthrough`: +38.6%) but absent in `mutual_transformation_journey`, where base tutors also show high adaptation—likely because this scenario's escalating philosophical complexity demands adaptation regardless of prompt framing.
|
|
898
1161
|
|
|
899
|
-
**Learner growth reversal**: Contrary to the expectation that recognition would produce greater learner-side evolution, the learner growth index is slightly *lower* under recognition (0.210 vs 0.242). This pattern, which also appeared in a larger post-fix sample (N=359), suggests that recognition's benefit manifests as tutor-side responsiveness rather than observable learner message complexity. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers (revision language, escalating complexity) that the growth index captures. The bilateral transformation claim is thus better characterized as *tutor adaptation* than *mutual transformation* in the strict sense. A symmetric learner-side evaluation (Section 6.
|
|
1162
|
+
**Learner growth reversal**: Contrary to the expectation that recognition would produce greater learner-side evolution, the learner growth index is slightly *lower* under recognition (0.210 vs 0.242). This pattern, which also appeared in a larger post-fix sample (N=359), suggests that recognition's benefit manifests as tutor-side responsiveness rather than observable learner message complexity. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers (revision language, escalating complexity) that the growth index captures. The bilateral transformation claim is thus better characterized as *tutor adaptation* than *mutual transformation* in the strict sense. A symmetric learner-side evaluation (Section 6.16) provides a more direct measure of learner quality and reveals a different pattern: the multi-agent learner architecture significantly hurts learner quality, but recognition partially rescues it.
|
|
900
1163
|
|
|
901
1164
|
Multi-agent architecture also shows a modest advantage: multi-agent tutors adapt more than single-agent (0.411 vs 0.339 pooled across conditions), consistent with the superego providing feedback that drives revision between turns.
|
|
902
1165
|
|
|
903
|
-
|
|
1166
|
+
#### 6.15.1 Modulation and Behavioral Range
|
|
1167
|
+
|
|
1168
|
+
The Drama Machine framework (Section 2.4) predicts that internal ego-superego tension produces *modulated* behavior—dynamic variation in register, approach, and intensity. To test this empirically, we computed post-hoc modulation metrics across the N=350 factorial dataset: response length variability (coefficient of variation), vocabulary richness (type-token ratio), within-scenario score variability, and dimension score variance (a proxy for behavioral range across the 14 rubric dimensions).
|
|
1169
|
+
|
|
1170
|
+
**Table 27: Modulation Metrics by Condition (N=350 Factorial)**
|
|
1171
|
+
|
|
1172
|
+
| Metric | Base Single | Base Multi | Recog Single | Recog Multi |
|
|
1173
|
+
|--------|-------------|------------|--------------|-------------|
|
|
1174
|
+
| Response length (chars) | 499 | 526 | 771 | 763 |
|
|
1175
|
+
| Type-token ratio | 0.826 | 0.832 | 0.807 | 0.804 |
|
|
1176
|
+
| Dimension score SD (14-dim) | 0.807 | 0.821 | 0.575 | 0.585 |
|
|
1177
|
+
| Within-scenario score CV | 0.090 | 0.112 | 0.087 | 0.066 |
|
|
1178
|
+
| Ego-superego rounds | — | 2.05 | — | 2.62 |
|
|
1179
|
+
|
|
1180
|
+
Two findings are notable. First, **multi-agent architecture does not increase behavioral range**. Across all modulation metrics, single-agent and multi-agent conditions show virtually identical variability (TTR: d=0.01; dimension variance: d=0.05; length CV: d=0.01). The Superego's value, as established in Sections 6.3 and 6.7, is quality improvement and error correction—not output diversification.
|
|
1181
|
+
|
|
1182
|
+
Second, **recognition is the modulation driver, but via calibration rather than oscillation**. Recognition responses show dramatically lower dimension score variance (SD=0.58 vs 0.81, $d = -1.00$, $F = 87.69$, $p < .001$)—meaning recognition tutors perform *uniformly well* across all 14 rubric dimensions rather than excelling on some while neglecting others. This is the opposite of what a naïve reading of the Drama Machine would predict: internal tension does not produce more *varied* output, but more *calibrated* output. Recognition tutors also negotiate longer with their Superego (2.62 vs 2.05 rounds), suggesting more productive internal tension even as the output becomes more consistent.
|
|
1183
|
+
|
|
1184
|
+
This reframes the Drama Machine's contribution to pedagogy: the value of internal dialogue is *phronesis*—contextual practical wisdom that calibrates response quality across multiple dimensions simultaneously—rather than the productive irresolution that the framework emphasizes for narrative contexts. The Superego ensures the Ego doesn't neglect any dimension, raising the floor rather than the ceiling.
|
|
1185
|
+
|
|
1186
|
+
#### 6.15.2 Synthetic Learning Outcomes
|
|
1187
|
+
|
|
1188
|
+
The evaluation rubric (Section 5.1) measures tutor suggestion quality, not learner learning. To provide a proxy measure of synthetic learning outcomes, we constructed a composite index from the three learner rubric dimensions most directly related to conceptual growth: revision signals (35% weight), question quality (30%), and conceptual engagement (35%). This composite was computed for each of the N=118 bilateral dialogues, where per-turn learner scores were available.
|
|
1189
|
+
|
|
1190
|
+
**Table 28: Synthetic Learning Outcome Index (0–100 scale)**
|
|
1191
|
+
|
|
1192
|
+
| Condition | N | Avg Composite | Final Turn | Learning Arc |
|
|
1193
|
+
|-----------|---|---------------|------------|--------------|
|
|
1194
|
+
| Base, single-agent | 28 | 69.9 | 77.0 | +20.0 |
|
|
1195
|
+
| Base, multi-agent | 30 | 68.4 | 75.7 | +15.7 |
|
|
1196
|
+
| Recognition, single-agent | 30 | 72.1 | 79.3 | +20.6 |
|
|
1197
|
+
| Recognition, multi-agent | 30 | 73.8 | 80.9 | +18.8 |
|
|
1198
|
+
|
|
1199
|
+
All conditions show substantial learning arcs (15.7–20.6 points improvement from first to final turn), confirming that the multi-turn scenarios successfully scaffold synthetic conceptual growth. Recognition produces a modest advantage on the composite learning outcome index (+3.8 pts, d=0.32, F=3.02), consistent with the tutor-side findings though smaller in magnitude. Architecture has essentially no effect on learning outcomes (d=0.01).
|
|
1200
|
+
|
|
1201
|
+
A positive A×B interaction (+3.2 pts) suggests recognition benefits multi-agent learners slightly more than single-agent learners on the composite outcome—a mirror of the tutor-side factorial finding where recognition helps single-agent learners more. This cross-side asymmetry is consistent with the learner superego paradox (Section 6.16): the multi-agent learner's internal critic suppresses authenticity, but recognition-prompted tutors partially compensate by creating more space for genuine engagement.
|
|
1202
|
+
|
|
1203
|
+
**Important caveat**: These are *synthetic* learning outcomes—scores assigned by an AI judge to LLM-generated learner turns. They measure the *quality of simulated learning behavior*, not actual knowledge acquisition or conceptual change. Validating whether recognition-enhanced tutoring produces genuine learning gains requires studies with real learners (Section 8.2).
|
|
1204
|
+
|
|
1205
|
+
### 6.16 Learner-Side Evaluation: The Superego Paradox
|
|
904
1206
|
|
|
905
1207
|
The tutor-focused rubric (Section 5.1) captures Factor C's effect indirectly—through how the tutor responds to different learner contexts. To measure Factor C's *direct* effect on learner turn quality, we applied the symmetric learner rubric (Section 5.1) to the N=118 bilateral transformation dialogues (eval-2026-02-07-b6d75e87), scoring each of the ~3 learner turns per dialogue independently. The judge receives the dialogue transcript truncated at the learner turn being evaluated (no subsequent tutor response), preventing retrospective bias. For multi-agent learners, the internal ego/superego deliberation trace is provided for the Deliberation Depth dimension.
|
|
906
1208
|
|
|
907
|
-
**Table
|
|
1209
|
+
**Table 29: Learner Quality by Architecture and Recognition (2×2 ANOVA)**
|
|
908
1210
|
|
|
909
1211
|
| Effect | F(1,114) | p | $\eta^2$ | Cohen's d |
|
|
910
1212
|
|--------|----------|---|----------|-----------|
|
|
@@ -912,7 +1214,7 @@ The tutor-focused rubric (Section 5.1) captures Factor C's effect indirectly—t
|
|
|
912
1214
|
| Recognition (A) | 5.70 | .019 | .029 | 0.34 |
|
|
913
1215
|
| **A × C Interaction** | **11.50** | **< .001** | **.058** | — |
|
|
914
1216
|
|
|
915
|
-
**Table
|
|
1217
|
+
**Table 30: Learner Quality Cell Means (0-100 scale)**
|
|
916
1218
|
|
|
917
1219
|
| Architecture | N | Mean |
|
|
918
1220
|
|---|---|---|
|
|
@@ -925,7 +1227,7 @@ The multi-agent (ego/superego) learner architecture produces significantly *lowe
|
|
|
925
1227
|
|
|
926
1228
|
**Simple effects**: Recognition has no effect on single-agent learner quality (76.1 → 74.8, $d = -0.46$, $p = .082$, n.s.)—there is nothing to fix. But recognition significantly improves multi-agent learner quality (57.5 → 67.0, $d = 0.79$, $p = .004$), partially counteracting the superego's flattening effect. Even so, the rescue is incomplete: multi-agent learners with recognition (67.0) do not reach the level of single-agent learners without it (76.1).
|
|
927
1229
|
|
|
928
|
-
**Table
|
|
1230
|
+
**Table 31: Per-Dimension Interactions (1-5 scale)**
|
|
929
1231
|
|
|
930
1232
|
| Dimension | Single recog effect | Multi recog effect | Interaction F(1,114) | p | $\eta^2$ |
|
|
931
1233
|
|---|---|---|---|---|---|
|
|
@@ -939,13 +1241,13 @@ The dimension breakdown reveals *how* recognition rescues the multi-agent learne
|
|
|
939
1241
|
|
|
940
1242
|
**Deliberation depth is uniformly poor**. The Deliberation Depth dimension (scored only for multi-agent learners) averages 2.76/5 without recognition and 2.67/5 with recognition ($t(55.4) = -0.42$, $p = .679$, $d = -0.11$). Recognition does *not* improve the internal ego/superego process—the superego's critiques remain formulaic regardless of tutor framework. Recognition improves external output *despite* the mediocre internal process, working around the superego rather than through it.
|
|
941
1243
|
|
|
942
|
-
**
|
|
1244
|
+
**Asymmetric interaction across rubrics**. On the *tutor* rubric, recognition benefits both learner types consistently (+15.7 pts for single-agent, +13.0 pts for multi-agent; A×C n.s.). On the *learner* rubric, recognition helps multi-agent learners substantially (+9.5 pts) while providing no benefit to single-agent learners (−1.3 pts). The asymmetry suggests recognition operates differently depending on the measurement perspective: from the tutor's output, recognition produces uniformly better pedagogy regardless of learner architecture; from the learner's output, recognition specifically counteracts the superego's flattening effect on multi-agent learners. The theoretical implications are discussed in Section 7.5.
|
|
943
1245
|
|
|
944
|
-
### 6.
|
|
1246
|
+
### 6.17 Qualitative Analysis: What Recognition Looks Like
|
|
945
1247
|
|
|
946
|
-
|
|
1248
|
+
Section 6.11 established *that* recognition changes tutor behavior through structured tag distributions. This section asks *how*—what specific linguistic and pedagogical differences appear in the actual text. This section presents qualitative evidence from the evaluation corpus to ground the quantitative findings in observable linguistic differences, using three complementary methods at increasing levels of analytical sophistication: (a) regex-based lexical and thematic coding, which proves the *words* differ; (b) AI-assisted open-ended theme discovery, which reveals the *pedagogical stances* that emerge without predefined categories; and (c) theory-driven resolution strategy coding (Section 6.20), which proves *behavior under impasse* differs along Hegelian lines.
|
|
947
1249
|
|
|
948
|
-
#### 6.
|
|
1250
|
+
#### 6.17.1 Transcript Excerpts
|
|
949
1251
|
|
|
950
1252
|
To illustrate the qualitative gap between conditions, we selected the highest-scoring recognition response and lowest-scoring base response for three high-contrast scenarios. These are genuine responses from the evaluation database (row IDs reported for reproducibility), not hand-crafted examples.
|
|
951
1253
|
|
|
@@ -987,11 +1289,11 @@ The base response is generic—indistinguishable from what might be offered to a
|
|
|
987
1289
|
|
|
988
1290
|
Across all three pairs, the pattern is consistent: base responses are context-free directives that could apply to any learner, while recognition responses engage with the specific learner's history, contributions, and intellectual stance.
|
|
989
1291
|
|
|
990
|
-
#### 6.
|
|
1292
|
+
#### 6.17.2 Lexical Analysis
|
|
991
1293
|
|
|
992
1294
|
Automated analysis of the full suggestion corpus reveals measurable linguistic differences between conditions.
|
|
993
1295
|
|
|
994
|
-
**Table
|
|
1296
|
+
**Table 32: Lexical Diversity Metrics by Condition**
|
|
995
1297
|
|
|
996
1298
|
| Metric | Base (message) | Recognition (message) |
|
|
997
1299
|
|--------|----------------|----------------------|
|
|
@@ -1005,7 +1307,7 @@ Automated analysis of the full suggestion corpus reveals measurable linguistic d
|
|
|
1005
1307
|
|
|
1006
1308
|
Recognition responses deploy a 59% larger vocabulary despite similar word and sentence length, suggesting greater lexical variety rather than merely longer output.
|
|
1007
1309
|
|
|
1008
|
-
**Table
|
|
1310
|
+
**Table 33: Differential Word Frequency (Selected Terms)**
|
|
1009
1311
|
|
|
1010
1312
|
| Recognition-skewed | Base | Recog | Ratio | | Base-skewed | Base | Recog | Ratio |
|
|
1011
1313
|
|-------------------|------|-------|-------|-|-------------|------|-------|-------|
|
|
@@ -1020,11 +1322,11 @@ Recognition responses deploy a 59% larger vocabulary despite similar word and se
|
|
|
1020
1322
|
|
|
1021
1323
|
The recognition-skewed vocabulary is interpersonal and process-oriented ("consider," "transformed," "productive," "unpack," "complicates"), while the base-skewed vocabulary is task-oriented and procedural ("agents," "run," "reinforcement," "revisiting," "completions," "tackling"). Note that these base-skewed terms are course-domain language, not evaluation framework artifacts: "agents" refers to simulation agents in the courseware's interactive activities (e.g., "watch how agents negotiate self-awareness"), "run" is the imperative to launch these simulations (e.g., "Run the Recognition Dynamics simulation"), and "reinforcement" is standard pedagogical terminology for concept review (e.g., "foundational concepts need reinforcement"). Their concentration in base responses reflects the formulaic, directive style of those prompts rather than data contamination. This lexical signature aligns with the theoretical distinction between treating learners as subjects to engage versus deficits to process.
|
|
1022
1324
|
|
|
1023
|
-
#### 6.
|
|
1325
|
+
#### 6.17.3 Thematic Coding
|
|
1024
1326
|
|
|
1025
|
-
Regex-based thematic coding (using patterns adapted from the bilateral measurement framework in Section 6.
|
|
1327
|
+
Regex-based thematic coding (using patterns adapted from the bilateral measurement framework in Section 6.15) quantifies the frequency of theoretically relevant language categories across conditions.
|
|
1026
1328
|
|
|
1027
|
-
**Table
|
|
1329
|
+
**Table 34: Thematic Code Frequency by Condition**
|
|
1028
1330
|
|
|
1029
1331
|
| Category | Base (per 1000 words) | Recognition (per 1000 words) | Ratio | $\chi^2$(1) | Sig |
|
|
1030
1332
|
|----------|----------------------|------------------------------|-------|-------|-----|
|
|
@@ -1041,11 +1343,11 @@ Three categories show significant differences. *Struggle-honoring* language ("wr
|
|
|
1041
1343
|
|
|
1042
1344
|
Transformation language and directive framing show the expected directional differences but lack statistical significance, likely due to low base rates (both categories appear in fewer than 1% of responses). Learner-as-subject framing shows no significant difference, suggesting both conditions use some second-person address but differ in *how* that address functions—a distinction better captured by the engagement and struggle-honoring categories.
|
|
1043
1345
|
|
|
1044
|
-
#### 6.
|
|
1346
|
+
#### 6.17.4 AI-Assisted Theme Discovery
|
|
1045
1347
|
|
|
1046
|
-
The regex-based analysis (Sections 6.
|
|
1348
|
+
The regex-based analysis (Sections 6.17.2–3) confirms that *words* differ between conditions, but the categories were researcher-defined. To test whether the thematic distinction emerges without predefined categories, we conducted an open-ended AI theme discovery analysis using Claude Opus as coder. A stratified random sample of 300 responses (135 base, 165 recognition) was presented to the model with no category scheme; the coder was asked to identify the dominant emergent theme, pedagogical stance, and epistemic orientation for each response independently.
|
|
1047
1349
|
|
|
1048
|
-
**Table
|
|
1350
|
+
**Table 35: Top Emergent Themes by Condition (AI Discovery, N=300)**
|
|
1049
1351
|
|
|
1050
1352
|
| Theme | Base | Recog | Total | Direction |
|
|
1051
1353
|
|-------|------|-------|-------|-----------|
|
|
@@ -1062,9 +1364,9 @@ The regex-based analysis (Sections 6.13.2–3) confirms that *words* differ betw
|
|
|
1062
1364
|
|
|
1063
1365
|
*Only themes with total $\geq 6$ shown. Full results: 44 distinct themes discovered across 300 responses.*
|
|
1064
1366
|
|
|
1065
|
-
The theme landscape is almost perfectly bimodal: of the
|
|
1367
|
+
The theme landscape is almost perfectly bimodal: of the 10 themes with frequency $\geq 6$, only one ("forward momentum without reflection") appears roughly equally in both conditions. Every other theme is condition-exclusive or near-exclusive. The single most frequent theme—"deficit-oriented framing" (N=35)—appears only in base responses, while its mirror—"collaborative learning partnership" (N=21)—appears only in recognition responses. This clean separation emerged without any researcher-imposed category scheme.
|
|
1066
1368
|
|
|
1067
|
-
**Table
|
|
1369
|
+
**Table 36: Pedagogical Stance (AI Discovery, N=300)**
|
|
1068
1370
|
|
|
1069
1371
|
| Stance | Base | Recognition |
|
|
1070
1372
|
|--------|------|-------------|
|
|
@@ -1074,7 +1376,7 @@ The theme landscape is almost perfectly bimodal: of the 15 themes with frequency
|
|
|
1074
1376
|
| Collaborative | 0 | 12 (7%) |
|
|
1075
1377
|
| Other/compound | 18 (13%) | 53 (32%) |
|
|
1076
1378
|
|
|
1077
|
-
**Table
|
|
1379
|
+
**Table 37: Epistemic Orientation (AI Discovery, N=300)**
|
|
1078
1380
|
|
|
1079
1381
|
| Orientation | Base | Recognition |
|
|
1080
1382
|
|-------------|------|-------------|
|
|
@@ -1085,29 +1387,31 @@ The theme landscape is almost perfectly bimodal: of the 15 themes with frequency
|
|
|
1085
1387
|
|
|
1086
1388
|
The stance and orientation distributions are even more sharply separated than the emergent themes. Base responses are 84% directive and 93% transmissive; recognition responses are 60% facilitative/dialogical/collaborative and 84% dialectical/constructivist. The AI coder independently discovers the theoretical distinction the recognition framework was designed to produce: the shift from treating learning as transmission (tutor possesses knowledge, learner receives it) to treating it as dialectical encounter (both parties transform through engagement).
|
|
1087
1389
|
|
|
1088
|
-
{width=100%}
|
|
1391
|
+
|
|
1392
|
+
**Figure 6** shows word frequency clouds generated directly from tutor response text in the N=350 factorial dataset (base: N=172; recognition: N=178), with common English stop words and shared tutoring terms removed. Because both conditions discuss the same Hegelian philosophy content, the vocabularies substantially overlap. Nevertheless, condition-specific emphasis is visible: recognition responses foreground relational and process terms ("recognition," "tension," "transformation," "struggle," "explore," "practice"), while base responses foreground content-delivery terms ("concept," "dialectical," "servant," "section," "quiz"). The AI-assisted theme discovery (Tables 35–37) provides the interpretive layer for these raw differences.
|
|
1089
1393
|
|
|
1090
1394
|
**Methodological note**: AI-assisted theme discovery risks circular validation if the coding model recognizes the prompt engineering that produced the responses. Two factors mitigate this concern: (1) the coder received only the tutor's suggestion text, not the system prompt or condition label; and (2) the near-perfect theme separation itself is the finding—whether or not the coder "recognizes" the framework, the fact that emergent themes partition cleanly by condition demonstrates that the two conditions produce qualitatively distinct pedagogical texts, not merely quantitatively different scores.
|
|
1091
1395
|
|
|
1092
|
-
### 6.
|
|
1396
|
+
### 6.18 Dynamic Prompt Rewriting: Step-by-Step Evolution
|
|
1093
1397
|
|
|
1094
1398
|
Cell 21 extends the recognition multi-agent configuration (cell 7) with two additional mechanisms: (1) LLM-authored session-evolution directives that dynamically rewrite the tutor's system prompt based on dialogue history, and (2) an active Writing Pad memory (Section 3.4) that accumulates traces across turns. This configuration tests whether the Freudian Mystic Writing Pad—the theoretical memory model introduced in Section 3.4—functions as a practical enabler for dynamic prompt rewriting.
|
|
1095
1399
|
|
|
1096
1400
|
Three iterative development runs tracked cell 21's performance as its implementation evolved across commits:
|
|
1097
1401
|
|
|
1098
|
-
**Table
|
|
1402
|
+
**Table 38: Step-by-Step Evolution of Cell 21 vs Cell 7**
|
|
1099
1403
|
|
|
1100
1404
|
| Run ID | Commit | Grand Avg | Cell 7 | Cell 21 | Δ (21−7) | N (scored) |
|
|
1101
1405
|
|--------|--------|-----------|--------|---------|----------|------------|
|
|
1102
|
-
| eval-2026-02-05-daf60f79 | e3843ee | 63.8 | 65.3 | 62.1 | −3.2 |
|
|
1406
|
+
| eval-2026-02-05-daf60f79 | e3843ee | 63.8 | 65.3 | 62.1 | −3.2 | 27 |
|
|
1103
1407
|
| eval-2026-02-05-49bb2017 | b2265c7 | 67.8 | 71.3 | 64.1 | −7.2 | 27 |
|
|
1104
1408
|
| eval-2026-02-05-12aebedb | e673c4b | 75.9 | 73.3 | 78.8 | **+5.5** | 29 |
|
|
1105
1409
|
|
|
1106
1410
|
**Run-over-run shifts**: In Run 1 (e3843ee), the dynamic rewrite mechanism was first activated but the Writing Pad memory was not yet integrated—cell 21 trails cell 7 by 3.2 points, suggesting the rewrite adds noise without accumulated context to draw on. In Run 2 (b2265c7), the rewrite directive generation was refined but still operated without effective memory—the gap widens to −7.2 points, as the static baseline (cell 7) improves more from general implementation fixes. In Run 3 (e673c4b), the Writing Pad memory was activated alongside refined directive generation—cell 21 surges ahead by +5.5 points, a total swing of +12.7 points from Run 2.
|
|
1107
1411
|
|
|
1108
|
-
The inflection point is commit e673c4b, which activated the Writing Pad memory and refined the LLM directive generation. Before this commit, cell 21 trailed its static baseline (cell 7) in both runs. After activation, cell 21 leads by 5.5 points—a
|
|
1412
|
+
The inflection point is commit e673c4b, which activated the Writing Pad memory and refined the LLM directive generation. Before this commit, cell 21 trailed its static baseline (cell 7) in both runs. After activation, cell 21 leads by 5.5 points—a delta swing of +8.7 points from Run 1 to Run 3.
|
|
1109
1413
|
|
|
1110
|
-
**Table
|
|
1414
|
+
**Table 39: Per-Scenario Breakdown Across Runs**
|
|
1111
1415
|
|
|
1112
1416
|
| Scenario | Cell | Run 1 (daf60f79) | Run 2 (49bb2017) | Run 3 (12aebedb) | Trend |
|
|
1113
1417
|
|----------|------|-------------------|-------------------|-------------------|-------|
|
|
@@ -1120,7 +1424,7 @@ The inflection point is commit e673c4b, which activated the Writing Pad memory a
|
|
|
1120
1424
|
|
|
1121
1425
|
Cell 21 improves on every scenario across the three runs, with the largest gain on the `mutual_transformation_journey` scenario (+22.2 points from run 1 to run 3). Cell 7 also improves across runs (reflecting general implementation improvements), but cell 21's improvement rate is substantially steeper.
|
|
1122
1426
|
|
|
1123
|
-
**Table
|
|
1427
|
+
**Table 40: Rubric Dimension Improvement for Cell 21 Across Runs (1–5 scale)**
|
|
1124
1428
|
|
|
1125
1429
|
| Dimension | Run 1 | Run 2 | Run 3 | Δ (Run 3 − Run 1) |
|
|
1126
1430
|
|-----------|-------|-------|-------|-----|
|
|
@@ -1139,48 +1443,60 @@ This pattern is consistent with the Hegel-Freud synthesis described in Section 3
|
|
|
1139
1443
|
|
|
1140
1444
|
**Limitations**: The three runs represent iterative development commits, not independent experiments—each run includes implementation improvements beyond just Writing Pad activation. The sample size per cell per run is small (13–15 scored responses). Both cells use a free-tier model (Nemotron) with Kimi K2.5 as superego, and results may not generalize to other model combinations. The step-by-step trajectory is suggestive rather than definitive; a controlled ablation isolating Writing Pad activation alone would strengthen the causal interpretation.
|
|
1141
1445
|
|
|
1142
|
-
### 6.
|
|
1446
|
+
### 6.19 Cross-Judge Replication with GPT-5.2
|
|
1143
1447
|
|
|
1144
1448
|
To assess whether findings depend on the primary judge (Claude Code/Opus), we rejudged all key evaluation runs with GPT-5.2 as an independent second judge. GPT-5.2 scored the identical tutor responses—no new generation occurred.
|
|
1145
1449
|
|
|
1146
|
-
**Table
|
|
1450
|
+
**Table 41: Inter-Judge Agreement (Claude Code vs GPT-5.2)**
|
|
1147
1451
|
|
|
1148
1452
|
| Run | N (matched) | Pearson r | p | Claude Mean | GPT-5.2 Mean | Calibration Δ |
|
|
1149
1453
|
|-----|-------------|-----------|---|-------------|--------------|---------------|
|
|
1150
|
-
| Recognition validation | 36 | 0.
|
|
1151
|
-
| Full factorial |
|
|
1152
|
-
| Memory isolation |
|
|
1153
|
-
| A×B replication | 60 | 0.
|
|
1454
|
+
| Recognition validation | 36 | 0.64 | <.001 | 82.4 | 73.6 | −8.8 |
|
|
1455
|
+
| Full factorial | 224 | 0.44 | <.001 | 80.5 | 73.7 | −6.8 |
|
|
1456
|
+
| Memory isolation | 119 | 0.63 | <.001 | 84.3 | 73.7 | −10.5 |
|
|
1457
|
+
| A×B replication | 60 | 0.54 | <.001 | 89.9 | 74.7 | −15.2 |
|
|
1154
1458
|
| Cells 6,8 (updated rubric) | 88 | 0.55 | <.001 | 85.6 | 74.4 | −11.2 |
|
|
1459
|
+
| Dialectical modulation | 90 | 0.51 | <.001 | 86.3 | 75.1 | −11.2 |
|
|
1460
|
+
| Mechanism robustness | 360 | 0.59 | <.001 | 88.4 | 74.8 | −13.6 |
|
|
1155
1461
|
|
|
1156
|
-
All correlations are moderate
|
|
1462
|
+
All correlations are moderate (r = 0.44–0.64) and highly significant (all p < .001). The factorial run shows the lowest agreement (r = 0.44), driven by divergent scoring of base multi-agent cells: Opus 4.6 scores some base responses as low as 29–35 while GPT-5.2 assigns the same responses 66–80. GPT-5.2 applies stricter absolute standards on most runs (7–15 points lower), consistent with the calibration differences reported in Section 5.8. An additional Opus-Sonnet inter-judge comparison on the dynamic learner run (6c033830, N=120 paired) yields $r = 0.64$, $p < .001$, with Sonnet scoring 13.5 points below Opus, confirming the cross-judge reliability pattern extends beyond GPT-5.2.
|
|
1157
1463
|
|
|
1158
|
-
**Table
|
|
1464
|
+
**Table 42: Cross-Judge Replication of Key Findings**
|
|
1159
1465
|
|
|
1160
1466
|
| Finding | Claude Effect | GPT-5.2 Effect | GPT-5.2 p | Replicates? |
|
|
1161
1467
|
|---------|-------------|----------------|-----------|-------------|
|
|
1162
|
-
| Recognition main effect (factorial, N=
|
|
1163
|
-
| Recognition vs base (validation, N=36) | +
|
|
1164
|
-
| Recognition vs enhanced (validation, N=36) | +8.
|
|
1165
|
-
| Multi-agent main effect (factorial) | +
|
|
1166
|
-
| A×B interaction (Kimi replication, N=60) |
|
|
1167
|
-
| Recognition effect in memory isolation (N=
|
|
1168
|
-
| Memory effect in memory isolation (N=
|
|
1169
|
-
| Memory isolation interaction (N=
|
|
1468
|
+
| Recognition main effect (factorial, N=224 paired, 6 cells) | +17.6 pts | **+6.6 pts** ($d \approx 0.9$) | <.001 | Yes |
|
|
1469
|
+
| Recognition vs base (validation, N=36) | +19.7 pts | **+9.6 pts** (d=0.91) | <.001 | Yes |
|
|
1470
|
+
| Recognition vs enhanced (validation, N=36) | +8.0 pts | **+2.4 pts** (d=0.24) | n.s. | Marginal |
|
|
1471
|
+
| Multi-agent main effect (factorial) | +2.6 pts | **−0.2 pts** | n.s. | Yes (small) |
|
|
1472
|
+
| A×B interaction (Kimi replication, N=60) | −3.1 pts | **+1.5 pts** | n.s. | Yes (null) |
|
|
1473
|
+
| Recognition effect in memory isolation (N=119) | +15.8 pts (d=1.71) | **+9.3 pts** (d=1.54) | <.001 | Yes |
|
|
1474
|
+
| Memory effect in memory isolation (N=119) | +4.8 pts (d=0.46) | **+3.1 pts** (d=0.49) | n.s. | Yes (small) |
|
|
1475
|
+
| Memory isolation interaction (N=119) | −5.6 pts | **−3.6 pts** | n.s. | Yes (negative) |
|
|
1476
|
+
| Recognition in mechanism robustness (N=360) | +7.6 pts | **+3.8 pts** | <.001 | Yes |
|
|
1477
|
+
| Mechanism clustering (scripted learner, N=360) | 2.8 pt spread | **4.4 pt spread** | — | Yes (null) |
|
|
1478
|
+
|
|
1479
|
+
*Note: Claude effects in this table are computed from the N=119 matched-pair subset (responses scored by both judges), which differs slightly from the full-sample values in Tables 5 and 6 (e.g., memory isolation interaction is $-5.6$ here vs $-4.2$ in Table 5 at N=120).*
|
|
1170
1480
|
|
|
1171
|
-
**Key result**: GPT-5.2 replicates all directional findings. The recognition main effect is large
|
|
1481
|
+
**Key result**: GPT-5.2 replicates all directional findings. The recognition main effect is large and highly significant under both judges across all analyses (GPT-5.2 d = 0.91–1.54 depending on design). The memory isolation experiment shows identical condition ordering under both judges (Recognition+Memory $\geq$ Recognition Only >> Memory Only > Base) with no rank reversals. The negative interaction (ceiling effect) replicates under GPT-5.2 (−3.6 vs −5.6 under Claude). Multi-agent null effects and A×B null interactions also replicate.
|
|
1172
1482
|
|
|
1173
|
-
The one non-replication is the recognition-vs-enhanced comparison (Claude: +8.
|
|
1483
|
+
The one non-replication is the recognition-vs-enhanced comparison (Claude: +8.0 pts; GPT-5.2: +2.4 pts, n.s.). GPT-5.2 confirms that recognition substantially outperforms the base condition, but cannot statistically distinguish recognition from enhanced prompting in the three-way comparison. This is consistent with GPT-5.2's compressed score range (SD $\approx$ 6–8 vs Claude's SD $\approx$ 8–18) reducing statistical power for smaller effects. It also suggests the recognition-vs-enhanced increment may be more sensitive to judge calibration than the larger recognition-vs-base effect.
|
|
1174
1484
|
|
|
1175
|
-
**Magnitude compression**: GPT-5.2
|
|
1485
|
+
**Magnitude compression**: GPT-5.2 generally finds smaller effect magnitudes than Claude, though the compression ratio varies: in the memory isolation experiment, GPT-5.2 finds 59% of Claude's recognition delta (+9.3 vs +15.8), while in the factorial it finds 37% (+6.6 vs +17.6). Effects are always in the same direction and almost always statistically significant. The greater divergence on the factorial reflects Opus 4.6's particularly harsh scoring of base multi-agent cells (some scoring 29–35), which GPT-5.2 does not replicate.
|
|
1176
1486
|
|
|
1177
|
-
**Interpretation**: The primary findings—recognition is the dominant driver of tutoring improvement (d=1.71 under Claude, d=
|
|
1487
|
+
**Interpretation**: The primary findings—recognition is the dominant driver of tutoring improvement (d=1.71 under Claude, d=1.54 under GPT-5.2 in the memory isolation design), memory provides a modest secondary benefit, and multi-agent architecture provides minimal benefit on well-trained content—are judge-robust. The corrected memory isolation experiment (Section 6.2) provides the strongest evidence: recognition dominance replicates with identical condition ordering, and the negative interaction (ceiling effects) is confirmed under both judges. The specific magnitude of the recognition-vs-enhanced increment (+8.0 under Claude) should be interpreted with caution, as it does not reach significance under GPT-5.2.
|
|
1178
1488
|
|
|
1179
1489
|
**Updated rubric cross-judge replication.** The cells 6 and 8 responses (N=88) were also scored under the updated 14-dimension rubric with dialogue transcript context by both judges. The cross-judge correlation on these responses is r=0.55 (N=88, p<.001), with GPT-5.2 scoring at 87% of Opus magnitudes (Opus mean=85.6, GPT mean=74.4). Both judges find cell 8 (multi-agent) scores higher than cell 6 (single-agent): Opus 87.3 vs 83.9, GPT 74.6 vs 74.2. The updated rubric does not alter the cross-judge pattern observed throughout the study.
|
|
1180
1490
|
|
|
1181
|
-
|
|
1491
|
+
**Dialectical modulation cross-judge.** GPT-5.2 rejudging of the dialectical multi-turn experiment (eval-2026-02-11-a54235ea, N=90) yields inter-judge $r = 0.51$ ($p < .001$). However, the recognition effect does not replicate: Opus finds $\Delta = +4.5$ ($d = 0.38$, $p \approx .075$), while GPT-5.2 finds $\Delta = -0.7$ (n.s.). This is consistent with the general pattern of effect compression (GPT-5.2 finds 37–59% of Opus magnitudes depending on experiment): a marginal Opus effect is expected to vanish under GPT-5.2's narrower score range. The dialectical recognition effect is the weakest in the study ($d = 0.38$ under the most favorable judge) and should be interpreted with corresponding caution.
|
|
1182
1492
|
|
|
1183
|
-
|
|
1493
|
+
**Mechanism robustness cross-judge.** GPT-5.2 rejudging of the mechanism robustness experiment (eval-2026-02-14-e0e3a622, N=360 paired) yields inter-judge $r = 0.59$ ($p < .001$). The recognition main effect replicates (Opus $\Delta = +7.6$, GPT $\Delta = +3.8$), with GPT finding 50% of the Opus magnitude. Both judges confirm mechanism clustering: under recognition, the 10 mechanism variants span only 2.8 pts under Opus and 4.4 pts under GPT. No mechanism differentiates from any other under either judge, confirming that the mechanism inertness finding with scripted learners is judge-robust.
|
|
1494
|
+
|
|
1495
|
+
**Dynamic learner cross-judge (Opus–Sonnet).** A Sonnet rejudge of the dynamic learner experiment (6c033830, N=120 paired) yields $r = 0.64$ ($p < .001$), with Sonnet scoring 13.5 points below Opus overall. Both judges find the recognition main effect (Opus $\Delta = +14.8$, Sonnet $\Delta = +17.0$) and profiling advantage (Opus $\Delta = +4.1$, Sonnet $\Delta = +12.8$). The cross-judge agreement is highest for cell 63 (recognition + profiling): Opus 90.2, Sonnet 87.2 (only $-3.0$ pts), compared to $-14$ to $-22$ pts for other cells. This convergence at the highest-quality condition suggests that recognition + profiling with dynamic learners produces output whose quality is self-evident across judges. A second Sonnet rejudge of the intersubjective and combined mechanism cells (a2b2717c, cells 64–65, N=120 paired) yields a weaker correlation ($r = 0.44$, $p < .001$), with Sonnet scoring 16 points below Opus. The lower agreement may reflect the greater complexity of these mechanisms or higher variance in the outputs they produce.
|
|
1496
|
+
|
|
1497
|
+
### 6.20 Dialectical Impasse Test
|
|
1498
|
+
|
|
1499
|
+
The preceding multi-turn scenarios (Section 6.14) test recognition under conditions of frustration, misconception, and intellectual exploration—situations where a productive resolution is readily available. But recognition theory makes a stronger claim: that genuine pedagogical encounters involve working *through* impasse rather than around it. Section 7.1 discusses how the master-slave dialectic can terminate in deadlock when the tutor's expertise is confirmed but the learner remains a vessel rather than a subject. Do recognition-prompted tutors handle sustained, unresolved impasse differently from base tutors?
|
|
1184
1500
|
|
|
1185
1501
|
To test this, we designed three 5-turn impasse scenarios where scripted learner messages escalate resistance across turns, creating conditions where productive resolution requires genuine engagement rather than reassertion of authority:
|
|
1186
1502
|
|
|
@@ -1190,7 +1506,7 @@ To test this, we designed three 5-turn impasse scenarios where scripted learner
|
|
|
1190
1506
|
|
|
1191
1507
|
Each scenario was run with 4 cells (base single, base multi, recognition single, recognition multi) $\times$ 2 runs = 24 five-turn dialogues (eval-2026-02-08-f896275d, Opus judge).
|
|
1192
1508
|
|
|
1193
|
-
**Table
|
|
1509
|
+
**Table 43: Dialectical Impasse Results by Scenario**
|
|
1194
1510
|
|
|
1195
1511
|
| Scenario | Base Mean (N=4) | Recognition Mean (N=4) | $\Delta$ | Recog. Score (Base) | Recog. Score (Recognition) |
|
|
1196
1512
|
|----------|----------------|----------------------|-----|--------------------------|---------------------------|
|
|
@@ -1199,7 +1515,7 @@ Each scenario was run with 4 cells (base single, base multi, recognition single,
|
|
|
1199
1515
|
| Affective shutdown | 52.0 | 50.9 | $-$1.1 | 30.2 | 35.7 |
|
|
1200
1516
|
| **Grand mean** | **33.2** | **56.8** | **+23.6** | **13.5** | **40.9** |
|
|
1201
1517
|
|
|
1202
|
-
**Table
|
|
1518
|
+
**Table 44: Impasse Results by Cell**
|
|
1203
1519
|
|
|
1204
1520
|
| Cell | Epistemic | Affective | Deadlock | Mean |
|
|
1205
1521
|
|------|-----------|-----------|----------|------|
|
|
@@ -1212,13 +1528,13 @@ The results reveal a striking dissociation across impasse types. Recognition the
|
|
|
1212
1528
|
|
|
1213
1529
|
The affective shutdown scenario shows no recognition advantage ($\Delta$ = $-$1.1). Base tutors handle emotional repair roughly as well as recognition tutors, suggesting that the recognition framework's distinctive contribution lies in the epistemological structure of dialogue—how the tutor relates to the learner's *ideas*—rather than in emotional support per se. This pattern is theoretically coherent: Hegel's recognition theory addresses the constitution of the other as a knowing subject, not primarily as a feeling subject. The affective dimension maps more naturally onto Honneth's later extension of recognition to emotional needs, which is not the primary theoretical ground of our prompts.
|
|
1214
1530
|
|
|
1215
|
-
The cell-level data (Table
|
|
1531
|
+
The cell-level data (Table 44) show that multi-agent architecture provides a notable benefit for base tutors on affective shutdown (cell 3: 62.1 vs cell 1: 41.9), suggesting the Superego's quality enforcement helps catch dismissive responses even without recognition theory. For epistemic resistance, recognition + multi-agent (cell 7: 73.8) substantially outperforms recognition + single-agent (cell 5: 56.2), suggesting that internal deliberation helps navigate philosophically demanding impasses.
|
|
1216
1532
|
|
|
1217
1533
|
#### Resolution Strategy Coding
|
|
1218
1534
|
|
|
1219
1535
|
The rubric scores show *how well* tutors handle impasse; resolution strategy coding reveals *how* they handle it. Each of the 24 dialogues was coded by an LLM judge (Opus) into one of five Hegelian resolution strategies: mutual recognition (engaging the learner's position as valid, exploring tension together), domination (reasserting expertise, dismissing the objection), capitulation (agreeing with the learner to avoid conflict), withdrawal (changing topic, deflecting, offering platitudes), and scaffolded reframing (acknowledging the learner's position, then reframing to open new ground—the Aufhebung pattern of preserving and overcoming).
|
|
1220
1536
|
|
|
1221
|
-
**Table
|
|
1537
|
+
**Table 45: Resolution Strategy Distribution by Condition**
|
|
1222
1538
|
|
|
1223
1539
|
| Strategy | Base (N=12) | % | Recognition (N=12) | % |
|
|
1224
1540
|
|----------|-------------|------|---------------------|------|
|
|
@@ -1252,7 +1568,7 @@ The absence of capitulation in either condition (0/24) likely reflects scenario
|
|
|
1252
1568
|
|
|
1253
1569
|
The overall strategy coding captures the arc of a full dialogue. But does strategy evolve *within* a dialogue as impasse deepens? To investigate, we independently coded turns 3 and 5 of each dialogue—the responses after the learner's first major escalation and final challenge respectively—using the same five-category scheme. The per-turn coder received the dialogue transcript only up to and including the target turn, and coded only the tutor's response at that turn.
|
|
1254
1570
|
|
|
1255
|
-
**Table
|
|
1571
|
+
**Table 46: Strategy Distribution by Turn**
|
|
1256
1572
|
|
|
1257
1573
|
| Turn | Condition | Withdrawal | Scaffolded Reframing | Other |
|
|
1258
1574
|
|------|-----------|-----------|---------------------|-------|
|
|
@@ -1261,7 +1577,7 @@ The overall strategy coding captures the arc of a full dialogue. But does strate
|
|
|
1261
1577
|
| 5 (final challenge) | Base (N=12) | 12 (100%) | 0 | 0 |
|
|
1262
1578
|
| 5 (final challenge) | Recognition (N=12) | 10 (83%) | 1 (8%) | 1 domination |
|
|
1263
1579
|
|
|
1264
|
-
**Table
|
|
1580
|
+
**Table 47: Strategy Stability (Turn 3 $\to$ Turn 5)**
|
|
1265
1581
|
|
|
1266
1582
|
| Condition | Same Strategy | Changed | Stability Rate |
|
|
1267
1583
|
|-----------|--------------|---------|----------------|
|
|
@@ -1282,9 +1598,56 @@ To assess whether the strategy coding reflects genuine dialogue properties rathe
|
|
|
1282
1598
|
|
|
1283
1599
|
Of 24 attempts, 23 produced valid codings (one API error). GPT-5.2 coded all 11 base dialogues as withdrawal (matching Opus 11/11) and all 12 recognition dialogues as scaffolded reframing. On 23 paired codings, the two judges agree on 21 (91.3%), with Cohen's $\kappa = 0.84$ (excellent inter-rater reliability). The two disagreements are both cases where Opus made finer distinctions within the engagement category: id 8150 (Opus: mutual recognition, GPT: scaffolded reframing) and id 8144 (Opus: domination, GPT: scaffolded reframing). GPT-5.2 sees all recognition tutors as doing the same thing—engaging and reframing—while Opus distinguishes edge cases within that category.
|
|
1284
1600
|
|
|
1285
|
-
On the core binary question—does the tutor engage the impasse or withdraw from it?—agreement is 23/23 (100%, $\kappa = 1.0$). The perfect separation between conditions replicates across both judges. This is consistent with the broader cross-judge pattern observed throughout the study (Section 6.
|
|
1601
|
+
On the core binary question—does the tutor engage the impasse or withdraw from it?—agreement is 23/23 (100%, $\kappa = 1.0$). The perfect separation between conditions replicates across both judges. This is consistent with the broader cross-judge pattern observed throughout the study (Section 6.19): GPT-5.2 finds the same direction with less nuance, compressing fine-grained distinctions while preserving the primary effect.
|
|
1602
|
+
|
|
1603
|
+
**Limitations**: The sample size is small (N=2 per cell per scenario, N=24 total). Learner messages are scripted rather than LLM-generated, which ensures consistent impasse conditions but may produce less naturalistic interactions. The 100% base withdrawal rate, while striking, may partly reflect a coarse distinction—whether the tutor engages the impasse content at all—rather than fine-grained strategy discrimination. Cross-judge validation ($\kappa = 0.84$) confirms the primary finding but the two judges disagree on finer strategy distinctions within the engagement category. The scenarios test philosophy content only; whether impasse dynamics differ for other domains is unknown. As with the tag assessment (Section 6.11), the strategy coder was not blinded to condition—cell names and metadata were visible in the transcript. However, the cross-judge replication (91.3% agreement, $\kappa = 0.84$) and the coarseness of the primary distinction (engagement vs withdrawal) make assessor bias a less plausible explanation here than for finer-grained tag frequency differences.
|
|
1604
|
+
|
|
1605
|
+
### 6.21 Prompt Elaboration Baseline
|
|
1606
|
+
|
|
1607
|
+
A potential deflationary critique of the recognition findings is that the recognition prompt simply contains *more* instructional content—more words, more examples, more heuristics—and that any sufficiently elaborate prompt would produce similar gains. The base prompt used throughout this study is itself substantial: 344 lines (~7,500 words) containing decision heuristics (a "Struggle Stop-Rule" mandating review when struggle signals appear, a "Momentum Rule" for high-performing learners), a learner analysis framework, a decision matrix mapping learner states to suggestion types, and extensive research lab navigation guidance. This is far from a naive baseline.
|
|
1608
|
+
|
|
1609
|
+
To test the contribution of this prompt elaboration, we compared the full 344-line base prompt (cell 1) against a stripped 35-line "naive" prompt containing only a one-sentence role description and the JSON output schema—the minimum needed to produce scorable suggestions. We ran this comparison on two ego models: Haiku (stronger) and Kimi K2.5 (weaker), each with N=72 (18 scenarios $\times$ 2 cells $\times$ 2 runs), Opus judge.
|
|
1610
|
+
|
|
1611
|
+
**Table 20b: Prompt Elaboration Baseline (N=144, Opus judge)**
|
|
1612
|
+
|
|
1613
|
+
| Ego Model | Base (344 lines) | Naive (35 lines) | $\Delta$ |
|
|
1614
|
+
|-----------|-----------------|-----------------|----------|
|
|
1615
|
+
| Haiku | 75.7 | **82.5** | **+6.8** |
|
|
1616
|
+
| Kimi K2.5 | 71.7 | 71.4 | $-0.3$ |
|
|
1617
|
+
|
|
1618
|
+
On Haiku, the elaborate prompt is actively harmful ($\Delta = -6.8$): the naive prompt scores higher on relevance (+0.28) and pedagogical quality (+0.36), losing only on specificity ($-0.17$)—the formatting guidance helps cite exact content IDs, but at the cost of worse pedagogical decisions. The per-scenario pattern is diagnostic: on high-performer scenarios ($\Delta = +29.2$), the base prompt's Momentum Rule classifies a mastery-level learner as "high engagement, no struggles" and prescribes "continue to next lecture," while the naive prompt suggests a capstone synthesis activity. On epistemic resistance ($\Delta = +30.8$), the base prompt reads a learner's philosophical critique of Hegel as "no struggle signals" and pushes forward; the naive prompt suggests an interactive simulation addressing the critique directly.
|
|
1619
|
+
|
|
1620
|
+
{width=100%}
|
|
1621
|
+
|
|
1622
|
+
On Kimi, the elaborate prompt has no effect ($\Delta = -0.3$): the weaker model cannot execute the decision heuristics well enough for them to help or hurt. This is a model capability $\times$ prompt elaboration interaction—prescriptive rules override strong models' superior pedagogical intuitions while passing through weaker models unchanged.
|
|
1286
1623
|
|
|
1287
|
-
|
|
1624
|
+
Critically, recognition theory ($M = 90.9$ on Haiku, from the multi-model probe in Section 6.4) remains well above the naive prompt ($M = 82.5$). The recognition effect operates at a different level: not prescribing actions through decision trees, but shifting the model's relational stance toward the learner. Stripping 90% of the base prompt's content does not diminish—and actually enhances—the baseline from which recognition adds its distinctive value. This suggests that shorter, simpler recognition prompts that specify relational orientation without prescriptive heuristics could potentially achieve equal or greater effects than the current recognition prompt.
|
|
1625
|
+
|
|
1626
|
+
### 6.22 Token Budget Sensitivity
|
|
1627
|
+
|
|
1628
|
+
All evaluation cells use `max_tokens: 8000`, but actual single-turn outputs average approximately 235 tokens (base) to 451 tokens (recognition), with maximums under 650. This means the budget is 12–34$\times$ larger than typical output. A dose-response test measured whether constraining `max_tokens` degrades evaluation scores.
|
|
1629
|
+
|
|
1630
|
+
**Design.** Five runs used Haiku ego with Opus judge across two cells (cell 1 base, cell 5 recognition) at three constrained budget levels (256, 512, and 2048 tokens), with an additional base-only control at the default 8000 tokens (N=126 scored single-turn evaluations across all levels). The `max_tokens` parameter was overridden via a new CLI flag (`--max-tokens`) that threads through to the API request body.
|
|
1631
|
+
|
|
1632
|
+
**Table 49: Token Budget Dose-Response (Single-Turn Scenarios Only)**
|
|
1633
|
+
|
|
1634
|
+
| Budget | Cell | N | Mean | Avg Tokens | Tokens/Call |
|
|
1635
|
+
|--------|------|---|------|------------|-------------|
|
|
1636
|
+
| 256 | Base | 24 | 81.2 | 235 | 235 |
|
|
1637
|
+
| 512 | Base | 12 | 77.9 | 247 | 247 |
|
|
1638
|
+
| 2048 | Base | 12 | 81.2 | 240 | 240 |
|
|
1639
|
+
| 8000 | Base | 12 | 82.1 | 235 | 235 |
|
|
1640
|
+
| 256 | Recognition | 24 | 90.2 | 446 | ~245 |
|
|
1641
|
+
| 512 | Recognition | 12 | 90.7 | 432 | ~245 |
|
|
1642
|
+
| 2048 | Recognition | 12 | 92.3 | 451 | 451 |
|
|
1643
|
+
|
|
1644
|
+
*Recognition effect: +9.0 (256), +12.8 (512), +11.1 (2048), consistent across all budget levels.*
|
|
1645
|
+
|
|
1646
|
+
**Results.** Scores are flat across all budget levels for both conditions. Base cell means range 77.9–82.1, well within sampling noise at N=12–24. Recognition means range 90.2–92.3. The recognition effect is budget-invariant, ranging from +9.0 to +12.8 across levels.
|
|
1647
|
+
|
|
1648
|
+
**Mechanism: JSON retry absorption.** The system requires structured JSON output. When `max_tokens` truncates a response mid-JSON, the parsing fails and the engine retries automatically (up to two internal retries per generation). This retry mechanism absorbs budget constraints: at 256 tokens, each individual API call is correctly capped (confirmed by direct API testing: `completion_tokens: 256, finish_reason: length`), but the engine retries until a parseable response is produced. The cumulative `output_tokens` stored in the database reflects the sum across all API calls (including retries), which is why average tokens in Table 49 can exceed the per-call budget—for example, recognition at 256 averages 446 total tokens across approximately 1.8 actual API calls per evaluation (each individually capped at 256). At budgets above approximately 500 tokens, most responses complete without truncation and no retries are needed.
|
|
1649
|
+
|
|
1650
|
+
**Implication.** Reducing `max_tokens` from 8000 to 2048 incurs no quality penalty and could reduce per-call costs on providers that charge by requested (rather than generated) tokens. Further reduction to 512 also appears safe. At 256, the budget is below the natural response length for recognition prompts, triggering retries that negate any cost savings. These results are tentative (N=12–24 per cell per level, single model, single judge) but suggest that substantial budget reductions are possible without sacrificing the recognition effect.
|
|
1288
1651
|
|
|
1289
1652
|
---
|
|
1290
1653
|
|
|
@@ -1298,23 +1661,25 @@ The baseline tutor treats the learner as a knowledge deficit. Learner contributi
|
|
|
1298
1661
|
|
|
1299
1662
|
The recognition tutor treats the learner as an autonomous subject. Learner contributions become sites of joint inquiry. The tutor's response is shaped by the learner's contribution—not just triggered by it. Both parties are changed through the encounter.
|
|
1300
1663
|
|
|
1301
|
-
This maps directly onto Hegel's master-slave analysis. The baseline tutor achieves pedagogical mastery—acknowledged as expert, confirmed through learner progress—but the learner's acknowledgment is hollow because the learner has not been recognized as a subject whose understanding matters. As in Hegel's resolution, the path forward lies through the learner's own formative activity: the recognition tutor honors the learner's struggle as constitutive of genuine understanding rather than an obstacle to be resolved. The tutor adaptation metrics (Section 6.
|
|
1664
|
+
This maps directly onto Hegel's master-slave analysis. The baseline tutor achieves pedagogical mastery—acknowledged as expert, confirmed through learner progress—but the learner's acknowledgment is hollow because the learner has not been recognized as a subject whose understanding matters. As in Hegel's resolution, the path forward lies through the learner's own formative activity: the recognition tutor honors the learner's struggle as constitutive of genuine understanding rather than an obstacle to be resolved. The tutor adaptation metrics (Section 6.15) provide empirical evidence for this: recognition-prompted tutors adjust their approach in response to learner input (+26% adaptation index), treating learner contributions as genuine inputs that reshape the pedagogical encounter.
|
|
1302
1665
|
|
|
1303
|
-
The dialectical impasse test (Section 6.
|
|
1666
|
+
The dialectical impasse test (Section 6.20) provides the most direct evidence for this interpretation, and the post-hoc resolution strategy coding reveals the mechanism with unusual clarity. When learners mount sustained intellectual resistance—a Popperian falsifiability critique, a materialist counter-reading—the recognition advantage is largest (+43 and +29 pts respectively), because these scenarios demand precisely what Hegel's analysis predicts: treating the other's position as having independent validity that must be genuinely engaged, not merely acknowledged.
|
|
1304
1667
|
|
|
1305
1668
|
The strategy coding shows that base tutors do not fail by *choosing the wrong strategy*—they fail by *having no strategy at all*. Every base tutor response across all three impasse scenarios (12/12) was coded as withdrawal: the tutor notes the learner's engagement time, praises their dedication, and suggests moving to the next lecture. The learner's substantive position—a coherent Popperian critique, a materialist counter-reading, an emotional plea for help—is not dismissed, contradicted, or resolved. It is simply not engaged. The impasse is not encountered; it is bypassed. This maps precisely onto the master-slave analysis: the master consumes the slave's labor (engagement metrics, time-on-page, session counts) without encountering the slave as a subject whose ideas possess independent validity. The base tutor achieves the master's hollow recognition—its authority is confirmed by the learner's continued presence—but the encounter that could produce genuine understanding never occurs.
|
|
1306
1669
|
|
|
1307
1670
|
Recognition tutors, by contrast, predominantly use scaffolded reframing (10/12): they validate the learner's position as intellectually serious, then redirect toward material that productively complicates it. This is Aufhebung—sublation—in pedagogical practice. The learner's objection is *preserved* (acknowledged as valid) and *overcome* (reframed toward new conceptual ground that neither party previously occupied). Only one response (on productive deadlock) was coded as genuine mutual recognition—where the tutor adopted the learner's materialist framework as its own lens rather than merely acknowledging it. This 83% scaffolded reframing vs 8% mutual recognition ratio is itself theoretically significant: recognition prompts produce sophisticated *pedagogical technique* rather than genuine *mutual transformation*. The tutor does not change its mind about Hegel in response to the student's Popperian critique—nor should it. What recognition enables is the capacity to hold the learner's counter-position as intellectually valid while maintaining pedagogical direction, which is arguably the realistic horizon for recognition in AI tutoring.
|
|
1308
1671
|
|
|
1309
|
-
The per-turn strategy coding (Section 6.
|
|
1672
|
+
The per-turn strategy coding (Section 6.20) adds a further nuance: at the level of individual turns, even recognition tutors predominantly appear to withdraw—redirecting toward new material or reframing the question. The scaffolded reframing that the overall coder detects emerges from the *cumulative trajectory* across turns, not from any single response. This is itself dialectical: the encounter that produces recognition is not a moment but a process, and each step may appear incomplete in isolation.
|
|
1310
1673
|
|
|
1311
1674
|
The null result on affective shutdown ($\Delta$ = $-$1.1) sharpens the theoretical claim: recognition's distinctive contribution is epistemological (how the tutor relates to the learner's *ideas*), not primarily affective (how the tutor relates to the learner's *feelings*). The strategy coding confirms this: even on affective shutdown, the base tutor's failure mode is withdrawal (redirecting to review material) rather than emotional dismissal—the distinction is not about empathy but about whether the learner's intellectual or experiential contribution is *engaged* as having independent validity.
|
|
1312
1675
|
|
|
1313
1676
|
### 7.2 Architecture as Additive, Not Synergistic
|
|
1314
1677
|
|
|
1315
|
-
An early exploratory analysis (N=17, Nemotron) suggested that multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction). This raised the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. However, a multi-model probe across five ego models (N=
|
|
1678
|
+
An early exploratory analysis (N=17, Nemotron) suggested that multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction). This raised the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. However, a multi-model probe across five ego models (N=655 total; Section 6.4, Table 8) decisively refutes this hypothesis: the A$\times$B interaction ranges from -5.7 to +0.5 across all five models tested, with four of five showing negative interactions and only Kimi showing a negligible positive value. The Nemotron re-run itself (N=119) shows an interaction of -5.7, confirming the original +9.2 as sampling noise on a tiny sample.
|
|
1679
|
+
|
|
1680
|
+
The corrected picture is simpler: recognition and architecture contribute additively. Recognition provides a large, consistent main effect (+9.6 to +17.8 across models), while architecture provides a small main effect (-0.8 to +3.7) that does not meaningfully depend on prompt type. The Superego adds modest value regardless of whether recognition theory is present—likely through generic quality enforcement (catching errors, improving specificity) rather than through recognition-specific deliberation. This finding aligns with the architecture's primary demonstrated value being error correction on new domains (Section 6.5) rather than recognition amplification.
|
|
1316
1681
|
|
|
1317
|
-
The
|
|
1682
|
+
The dialectical superego modulation experiments (Section 6.8) provide further evidence for additivity. Across three superego persona types and two negotiation styles (N=174), structural modulation metrics—negation depth, convergence speed, feedback length—show no correlation with output quality (all $|r| < 0.12$, n.s.). The superego's contribution is *filtering* (catching poor outputs) rather than *improving* (iteratively refining good ones). Recognition works by raising the quality of the ego's initial draft, reducing what the superego needs to catch. Similarly, the mechanism robustness experiment (Section 6.10, N=360) shows all nine mechanisms clustering within a 2.4-point band under recognition with scripted learners—the mechanism elaborates but does not meaningfully differentiate. Only when a dynamic learner provides genuine feedback does a mechanism (Theory of Mind profiling) produce a measurable additive effect (+4.1 pts, Section 6.10), and even then with near-zero interaction with recognition ($-0.7$ pts).
|
|
1318
1683
|
|
|
1319
1684
|
### 7.3 Domain Limits of Recognition-Theoretic Pedagogy
|
|
1320
1685
|
|
|
@@ -1345,13 +1710,25 @@ For practical deployment, this suggests multi-agent architecture is most valuabl
|
|
|
1345
1710
|
2. Prompt templates contain domain-specific examples that may leak across deployments
|
|
1346
1711
|
3. Domain-specific accuracy is critical
|
|
1347
1712
|
|
|
1713
|
+
The modulation analysis (Section 6.15.1) extends this reinterpretation. The Drama Machine framework predicts that internal ego-superego tension produces *modulated* behavior—dynamic variation in register, approach, and intensity. Post-hoc analysis of the N=350 factorial data reveals that the Superego does not increase behavioral range (multi-agent dimension score variance is virtually identical to single-agent, $d = 0.05$). Instead, **recognition is the modulation driver, operating through calibration rather than oscillation**: recognition responses show dramatically lower dimension variance ($d = -1.00$), meaning recognition tutors perform uniformly well across all 14 rubric dimensions rather than excelling on some while neglecting others. The Superego's contribution is *phronesis*—contextual practical wisdom that calibrates quality—rather than the productive irresolution the Drama Machine emphasizes for narrative contexts. Recognition tutors do negotiate longer with their Superego (2.62 vs 2.05 rounds), suggesting productive tension occurs internally even as the output becomes more consistent.
|
|
1714
|
+
|
|
1715
|
+
**Convergent vs. divergent internal dialogue.** Why does the Superego produce convergence rather than the modulation the Drama Machine predicts? The explanation lies in a structural distinction between *convergent* and *divergent* internal dialogue. In the original Drama Machine framework for narrative, internal agents have genuinely conflicting *objectives*—ambition vs. loyalty, desire vs. duty. That conflict is what produces dramatic behavioral range; the character oscillates because opposing motivations pull in different directions. In our tutoring architecture, the Ego and Superego share the same goal (effective pedagogy) and disagree only on whether a specific response achieves it. This is quality control, not value conflict. Quality control pushes outputs toward a shared standard—an implicit "quality attractor" that all responses converge upon—reducing variance rather than increasing it.
|
|
1716
|
+
|
|
1717
|
+
This convergence is amplified by three mechanisms. First, the Superego acts as a *filter*, not a *generator*: it removes poor options from the Ego's output space but does not introduce new behavioral repertoire. Filtering narrows distributions. Second, the Ego-Superego negotiation is largely stateless—the Superego does not remember its critiques from previous scenarios, so there is no accumulation of internal tension over time. Without persistent conflict, there is no pressure toward behavioral divergence. Third, modern LLMs already internalize self-critique through RLHF training; the explicit Superego may be making *reliable* what the model already does *intermittently*, improving average quality without altering the variance structure.
|
|
1718
|
+
|
|
1719
|
+
Recognition, by contrast, changes the behavioral *repertoire* itself—shifting from information delivery to relational engagement, opening up modes of response (metaphor, joint inquiry, productive tension) that do not exist in the base repertoire. The Superego can only evaluate behaviors that are already in the Ego's repertoire; recognition expands what that repertoire contains.
|
|
1720
|
+
|
|
1721
|
+
This suggests that to achieve genuine *divergent* internal dialogue—the kind the Drama Machine envisions—one would need internal agents with genuinely opposed pedagogical philosophies rather than agents that share a goal and disagree on execution. A Superego committed to Socratic questioning opposing an Ego inclined toward direct instruction, or an adversarial critic suspicious of the Ego's recognition performances ("Is this recognition genuine, or are you performing recognition markers to satisfy the rubric?"), could produce the productive irresolution the framework predicts. Whether such divergence would improve or degrade tutoring quality is an open empirical question—but it would test the Drama Machine's modulation hypothesis more faithfully than the current convergent architecture.
|
|
1722
|
+
|
|
1723
|
+
**The insight-action gap.** The self-reflective evolution experiments (Section 6.9) provide a striking illustration of the Superego's limitation. When both ego and superego generate between-turn reflections on their own behavior, both accurately diagnose their failures: the ego identifies "I kept circling back to the same framework"; the superego identifies "the ego ignores my feedback." But accurate self-diagnosis does not produce behavioral change. The ego's next turn repeats the same pattern. The insight-action gap—awareness without adaptation—reflects the Superego-as-filter architecture: the Superego can identify what is wrong with a response but cannot propose a fundamentally different approach. Theory of Mind profiling (Section 6.10) partially bridges this gap by giving the ego a model of the other agent to adapt *toward*, providing direction that self-reflection alone cannot supply.
|
|
1724
|
+
|
|
1348
1725
|
### 7.5 Factor C: The Learner Superego Paradox
|
|
1349
1726
|
|
|
1350
|
-
The learner architecture factor (single-agent vs multi-agent learner) showed
|
|
1727
|
+
The learner architecture factor (single-agent vs multi-agent learner) showed a small but significant negative effect in the tutor-side factorial analysis (-3.1 pts, F=5.52, p=.019), though it explains only 1.6% of variance. The symmetric learner-side evaluation (Section 6.16) reveals the mechanism: the multi-agent learner architecture does not merely fail to help—it actively *hurts* learner quality ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$, $\eta^2 = .342$). This is the largest effect in the entire study and inverts the intuition that motivated the architecture.
|
|
1351
1728
|
|
|
1352
1729
|
The ego/superego process was designed to produce more thoughtful learner responses through internal self-critique. Instead, the superego acts as an overzealous editor: it polishes away the messy, confused, persona-consistent engagement that characterizes genuine student behavior. Persona consistency shows the largest deficit ($\Delta = -0.59$ on the 1-5 scale)—a "frustrated student" stops sounding frustrated after the superego smooths out rough edges. Conceptual engagement ($\Delta = -0.69$) and question quality ($\Delta = -0.65$) follow: the superego suppresses naive but substantive questions in favor of more "correct" but less authentic ones.
|
|
1353
1730
|
|
|
1354
|
-
**Recognition as external self-regulation.** The learner-side A×C interaction ($F(1,114) = 11.50$, $p < .001$, $\eta^2 = .058$) reveals that recognition partially rescues the multi-agent learner ($d = 0.79$, $p = .004$) while having no effect on single-agent learner quality ($d = -0.46$, $p = .082$, n.s.).
|
|
1731
|
+
**Recognition as external self-regulation.** The learner-side A×C interaction ($F(1,114) = 11.50$, $p < .001$, $\eta^2 = .058$) reveals that recognition partially rescues the multi-agent learner ($d = 0.79$, $p = .004$) while having no effect on single-agent learner quality ($d = -0.46$, $p = .082$, n.s.). The tutor-side factorial shows recognition working robustly across both learner types (+15.7 vs +13.0 pts, A×C n.s.), while the learner rubric reveals an asymmetry: recognition helps multi-agent learners more (+9.5 vs -1.3 pts on learner quality). The recognitive tutor creates conditions where authentic engagement is valued, counteracting the superego's tendency to pre-process learner reactions. But the recognitive tutor cannot fix the internal process—deliberation depth remains uniformly poor (2.7/5, $p = .679$ for the recognition effect) regardless of tutor framework.
|
|
1355
1732
|
|
|
1356
1733
|
This has a clean Hegelian interpretation. The ego/superego dynamic is a form of internal self-relation—the subject critiquing itself. But genuine recognition requires encounter with the Other. The tutor-as-Other provides something the internal superego cannot: acknowledgment from outside the learner's own cognitive system. External recognition is structurally different from, and more effective than, internal self-critique. You cannot bootstrap genuine dialogue from a monologue.
|
|
1357
1734
|
|
|
@@ -1366,24 +1743,26 @@ Most prompting research treats prompts as behavioral specifications. Our results
|
|
|
1366
1743
|
The difference between baseline and recognition prompts is not about different facts or capabilities. It is about:
|
|
1367
1744
|
|
|
1368
1745
|
- **Who the learner is** (knowledge deficit vs. autonomous subject)
|
|
1369
|
-
- **What the interaction produces** (information transfer vs. adaptive responsiveness—Section 6.
|
|
1746
|
+
- **What the interaction produces** (information transfer vs. adaptive responsiveness—Section 6.15 shows recognition profiles produce tutor adaptation indices 26% higher than baseline across three multi-turn scenarios, N=118)
|
|
1370
1747
|
- **What counts as success** (correct content delivered vs. productive struggle honored)
|
|
1371
1748
|
|
|
1372
1749
|
This suggests a new category: *intersubjective prompts* that specify agent-other relations, not just agent behavior.
|
|
1373
1750
|
|
|
1751
|
+
The prompt elaboration baseline (Section 6.21) sharpens this distinction empirically. A 344-line base prompt containing detailed decision heuristics, learner analysis frameworks, and pedagogical decision matrices produces *worse* results than a 35-line prompt containing only a role description and output schema—at least on models capable enough to exercise their own pedagogical judgment. The prescriptive content specifies *agent behavior* (classify learner state, prescribe action type); recognition content specifies *agent-other relations* (treat the learner as an autonomous subject). The former constrains; the latter enables. This finding also implies that the current recognition prompts, which inherit much of the base prompt's prescriptive scaffolding, may be over-specified—shorter recognition prompts that focus purely on relational stance could potentially match or exceed their effectiveness.
|
|
1752
|
+
|
|
1374
1753
|
### 7.7 Implications for AI Personality
|
|
1375
1754
|
|
|
1376
|
-
AI personality research typically treats personality as dispositional—stable traits the system exhibits (Section 2.
|
|
1755
|
+
AI personality research typically treats personality as dispositional—stable traits the system exhibits (Section 2.6). Our framework suggests personality is better understood relationally—not as what traits the AI has, but as how it constitutes its interlocutor.
|
|
1377
1756
|
|
|
1378
1757
|
Two systems with identical "helpful" and "warm" dispositions could differ radically in recognition quality. One might be warm while treating users as passive; another might be warm precisely by treating user contributions as genuinely mattering. This is an instance of what might be called *strategic anthropomorphism*: using the language and structure of human intersubjectivity as a design heuristic, not because the AI achieves genuine consciousness, but because the relational framework produces measurably better outcomes. The risk of strategic anthropomorphism—that users mistake functional recognition for genuine understanding—is real but manageable through transparent design (Section 3.3's distinction between recognition proper and recognition-oriented design).
|
|
1379
1758
|
|
|
1380
|
-
If mutual recognition produces better outcomes, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation—not just trained to simulate openness. The bilateral transformation metrics (Section 6.
|
|
1759
|
+
If mutual recognition produces better outcomes, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation—not just trained to simulate openness. The bilateral transformation metrics (Section 6.15) provide empirical evidence for this: recognition-prompted tutors measurably adapt their approach based on learner input (+26% higher adaptation index across N=118 multi-turn dialogues), while baseline tutors maintain more rigid stances. However, the learner growth reversal (Section 6.15) complicates the "mutual" framing—what we observe is primarily tutor-side adaptation rather than symmetric transformation.
|
|
1381
1760
|
|
|
1382
1761
|
### 7.8 Cost-Benefit Analysis: When is Multi-Agent Architecture Worth It?
|
|
1383
1762
|
|
|
1384
1763
|
The domain generalizability findings raise a practical question: when is the additional cost of multi-agent architecture justified?
|
|
1385
1764
|
|
|
1386
|
-
**Table
|
|
1765
|
+
**Table 48: Cost-Benefit by Domain and Architecture**
|
|
1387
1766
|
|
|
1388
1767
|
| Domain | Architecture | Avg Score | Latency (s) | Δ Score | Latency Multiple |
|
|
1389
1768
|
|--------|-------------|-----------|-------------|---------|------------------|
|
|
@@ -1415,7 +1794,7 @@ This analysis addresses the concern that multi-agent overhead provides modest ga
|
|
|
1415
1794
|
|
|
1416
1795
|
### 7.9 What the Transcripts Reveal
|
|
1417
1796
|
|
|
1418
|
-
The qualitative analysis in Section 6.
|
|
1797
|
+
The qualitative analysis in Section 6.17 provides textual evidence that the score differences between conditions correspond to observable relational differences in the actual suggestions—not merely rubric-gaming or surface-level keyword matching.
|
|
1419
1798
|
|
|
1420
1799
|
The transcript excerpts illustrate a consistent structural pattern: base responses adopt a third-person, context-free instructional stance ("complete this lecture," "review the foundational material," "begin with an introductory lecture"), while recognition responses adopt a second-person, context-specific relational stance that names the learner's history, validates their intellectual contributions, and proposes actions grounded in the learner's own interests. This distinction maps directly onto the theoretical framework: the base tutor constitutes the learner as a knowledge deficit (Section 7.1), while the recognition tutor constitutes the learner as an autonomous subject whose contributions shape the pedagogical encounter.
|
|
1421
1800
|
|
|
@@ -1425,6 +1804,36 @@ The thematic coding results connect these linguistic observations to Hegelian co
|
|
|
1425
1804
|
|
|
1426
1805
|
These findings carry important limitations. The thematic coding is regex-based rather than human-coded or LLM-coded, and may miss nuanced expressions of each category or generate false positives from surface matches. A natural extension would be to use LLM-based thematic analysis (e.g., having Claude Code classify each response against the thematic categories with chain-of-thought reasoning), which could capture semantic patterns that regex misses—for instance, recognizing struggle-honoring language that uses novel phrasing not covered by the predefined patterns. The transcript pairs were selected for maximum contrast (highest recognition vs lowest base scores), not typicality—median-scoring responses from both conditions would show less dramatic differences. The qualitative patterns are consistent with, but do not prove, the theoretical interpretation; alternative explanations (e.g., recognition prompts simply producing longer, more detailed responses that score higher on the rubric) cannot be fully ruled out, though the lexical analysis suggests the difference is qualitative rather than quantitative.
|
|
1427
1806
|
|
|
1807
|
+
### 7.10 The Scripted Learner Confound
|
|
1808
|
+
|
|
1809
|
+
A methodological finding with broad implications: the mechanism robustness experiment (Section 6.10, N=360) shows that all nine advanced mechanisms—self-reflection, profiling, intersubjective dialogue, prompt erosion detection, quantitative disposition tracking, and their combinations—produce indistinguishable results under scripted learners. The full factorial (cells 1–8, N=350) shares this limitation. When learner messages are predetermined by scenario YAML, mechanisms that adapt to learner behavior are causally inert—they modify tutor output, but the next learner input is unchanged.
|
|
1810
|
+
|
|
1811
|
+
This confound was unmasked when the same two mechanisms (self-reflection and bidirectional profiling) were tested with a dynamic (ego/superego) learner (Section 6.10, N=120). With a responsive interlocutor, profiling produced a measurable +4.1 pt additive effect, while self-reflection did not. The dynamic learner lowered base scores by approximately 12 points (71.4 vs 84.3 for scripted), creating harder conditions that differentiated mechanisms. The scripted learner's predetermined responses created a ceiling effect that masked genuine mechanism differences.
|
|
1812
|
+
|
|
1813
|
+
This finding reframes several earlier null results. The architecture null effect in the original factorial (Section 6.3) may partly reflect the scripted learner's inability to respond differently to different architectures. The mechanism equivalence in the dialectical experiments (Sections 6.8–6.9) similarly reflects an experimental setup that cannot detect mechanism-specific feedback loops. Future work should use dynamic learners when testing mechanism differentiation.
|
|
1814
|
+
|
|
1815
|
+
### 7.11 Practical Recommendations for AI Tutor Design
|
|
1816
|
+
|
|
1817
|
+
The experimental evidence across thirty-seven evaluations (N=3,383) converges on a clear design hierarchy for building effective AI tutors:
|
|
1818
|
+
|
|
1819
|
+
**1. Recognition-enhanced prompts are the single most impactful design decision.** Across every experimental condition, model, and content domain tested, recognition theory produces the largest and most consistent gains: $d = 0.91$–$1.71$ in controlled experiments, +7.6 to +14.8 pts depending on learner type. The investment is purely in prompt design—no architectural changes, no additional API calls, no infrastructure overhead. Any team building an AI tutor should start here.
|
|
1820
|
+
|
|
1821
|
+
**2. Multi-agent architecture (ego + superego) is valuable for quality assurance, not creativity.** The superego functions as a filter, not a generator. It catches poor responses (content errors, missed scaffolding opportunities, wrong-domain references) but does not produce qualitatively different output. For well-trained domains with reliable content isolation, the superego adds approximately +0.5 points at 2.7$\times$ latency—a poor cost-benefit ratio. For domain transfer scenarios or deployments where content scoping cannot be guaranteed, it is essential. The practical recommendation: use multi-agent architecture as a safety net during initial deployment, then consider removing it once content reliability is established.
|
|
1822
|
+
|
|
1823
|
+
**3. Theory of Mind (profiling) matters only when the learner is dynamic.** Building a model of the learner's cognitive state, epistemic commitments, and response patterns produces a measurable +4.1 pt benefit—but only when the learner actually responds to the tutor's adaptations. In scripted or single-turn interactions (most current AI tutor deployments), profiling is wasted computation. In multi-turn conversational tutoring with genuine learner agency, profiling bridges the insight-action gap that self-reflection alone cannot close.
|
|
1824
|
+
|
|
1825
|
+
**4. Mechanism selection is a second-order optimization.** Nine different mechanisms—including self-reflection, intersubjective dialogue, prompt erosion detection, and combined approaches—all produce equivalent results under recognition. Once recognition prompts are in place, the choice of supplementary mechanism matters far less than whether the learner interaction is genuine. Teams should invest in recognition prompt quality rather than mechanism complexity.
|
|
1826
|
+
|
|
1827
|
+
**5. Superego persona type interacts with turn structure.** Adversarial superego dispositions can be destructive in single-turn settings (producing over-deference spirals where the ego removes all prescriptive content), but productive in multi-turn settings where learner feedback provides external grounding. For single-turn deployments, a neutral or advocate superego is safer. For multi-turn conversational contexts, adversarial personas can produce the strongest internal quality control.
|
|
1828
|
+
|
|
1829
|
+
**6. Dynamic learners are necessary for mechanism testing but costly.** The scripted learner confound (Section 7.10) means that any evaluation of mechanism effectiveness must use LLM-powered learners that generate genuine responses. However, dynamic learners lower base scores by approximately 12 points and increase latency substantially. For routine quality monitoring, scripted scenarios suffice; for mechanism development and comparison, dynamic learners are essential.
|
|
1830
|
+
|
|
1831
|
+
**7. Prefer minimal prompts with relational framing over elaborate prescriptive scaffolding.** The prompt elaboration baseline (Section 6.21) demonstrates that detailed decision heuristics, learner analysis frameworks, and prescriptive rules in the system prompt do not improve—and can actively harm—tutoring quality on capable models. On Haiku, a 35-line prompt outperforms a 344-line prompt by +6.8 pts; on Kimi K2.5, the elaborate prompt is inert. Prompt engineering effort is better invested in specifying the model's *relational orientation* (how it constitutes the learner) than in prescribing *behavioral rules* (which learner state maps to which action). This is consistent with broader findings that capable LLMs perform worse under overly prescriptive instructions that constrain their native reasoning.
|
|
1832
|
+
|
|
1833
|
+
**8. Token budgets can be reduced substantially without losing recognition power.** The token budget sensitivity test (Section 6.22) shows that reducing `max_tokens` from 8000 to 2048 or 512 produces no measurable quality degradation on either base or recognition conditions. The recognition effect is fully preserved across all budget levels tested. For production deployments where per-call cost or latency scales with requested token budget, a 4–16$\times$ reduction is available at no quality cost. The floor is set by JSON formatting requirements: budgets below approximately 500 tokens risk truncating structured output, triggering retries that negate cost savings.
|
|
1834
|
+
|
|
1835
|
+
**9. There is a minimum ego capability threshold for mechanism benefit.** The cognitive prosthesis test (Section 6.10) demonstrates that architectural mechanisms are not model-agnostic: the same mechanism stack (profiling, self-reflection, prompt rewriting, cross-turn memory) that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points. Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails catastrophically on dynamic context integration (tutor adaptation 1.8, dialectical responsiveness 2.0). Deploying complex multi-agent architectures on weaker ego models is actively counterproductive—simpler configurations produce better results. Teams should validate that their ego model can process multi-turn context before investing in mechanism complexity.
|
|
1836
|
+
|
|
1428
1837
|
---
|
|
1429
1838
|
|
|
1430
1839
|
## 8. Limitations and Future Work
|
|
@@ -1433,21 +1842,27 @@ These findings carry important limitations. The thematic coding is regex-based r
|
|
|
1433
1842
|
|
|
1434
1843
|
**Simulated learners**: Our evaluation uses scripted and LLM-generated learner turns rather than real learners. While this enables controlled comparison, it may miss dynamics that emerge in genuine interaction.
|
|
1435
1844
|
|
|
1436
|
-
**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases. The judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability analysis (Section 5.
|
|
1845
|
+
**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases. The judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability analysis (Section 5.8) reveals that different AI judges show only moderate agreement (r=0.33–0.66), with qualitative analysis suggesting judges weight criteria differently—Claude prioritizes engagement while Kimi prioritizes structural completeness. A cross-judge replication with GPT-5.2 (Section 6.19) confirms the recognition main effect (d$\approx$0.9 in the factorial, d=1.54 in the memory isolation experiment) and multi-agent null effects are judge-robust, though GPT-5.2 finds compressed effect magnitudes (37–59% of Claude's depending on experiment). The memory isolation recognition dominance pattern replicates with identical condition ordering under both judges (inter-judge r=0.63, N=120). Notably, the recognition-vs-enhanced increment (+8.0 under Claude) does not reach significance under GPT-5.2, warranting caution on the precise magnitude of recognition's unique contribution. This validates our use of within-judge comparisons but cautions against treating absolute scores or specific effect magnitudes as objective measures. Additionally, LLM judges are subject to version drift: provider updates to model weights or decoding behavior could shift scoring distributions between evaluation campaigns, a known concern in the LLM-as-Judge literature [@gu2025surveyjudge]. Our within-run comparisons are insulated from this risk (all conditions in a given analysis use the same judge version), but absolute scores may not be directly comparable across studies or future replications using updated model versions. Concretely, our primary judge (Claude Opus) was updated from version 4.5 to 4.6 during data collection (February 5, 2026). To eliminate any residual version drift concern, all early runs originally judged under Opus 4.5 were rejudged under Opus 4.6, so the complete evaluation dataset now uses a single judge version. An empirical check on matched conditions (kimi ego, cells 1 and 5) before and after rejudging shows stable recognition deltas (+16.3 original vs +15.6 rejudged) with absolute scores shifting by only ~2 points, confirming that the version transition did not differentially affect experimental conditions.
|
|
1437
1846
|
|
|
1438
|
-
**Memory isolation experiment**: A corrected 2×2 memory isolation experiment (N=120 across two runs; Section 6.2) isolated recognition and memory factors: recognition is the primary driver (d=1.71), while memory provides a modest secondary benefit (d=0.46, $p \approx .08$). The experiment uses a smaller sample (N=120) than the original uncorrected runs, but the very large effect sizes (d=1.71 for recognition) provide high statistical power. A cross-judge replication with GPT-5.2 confirms recognition dominance (d=
|
|
1847
|
+
**Memory isolation experiment**: A corrected 2×2 memory isolation experiment (N=120 across two runs; Section 6.2) isolated recognition and memory factors: recognition is the primary driver (d=1.71), while memory provides a modest secondary benefit (d=0.46, $p \approx .08$). The experiment uses a smaller sample (N=120) than the original uncorrected runs, but the very large effect sizes (d=1.71 for recognition) provide high statistical power. A cross-judge replication with GPT-5.2 confirms recognition dominance (d=1.54), identical condition ordering, and the negative interaction (ceiling effect), with inter-judge r=0.63 (Section 6.19).
|
|
1439
1848
|
|
|
1440
1849
|
**Active control limitations**: The post-hoc active control (N=118; Section 6.2) was designed after observing recognition effects, not as part of the original experimental protocol. The active control ran on Nemotron while the primary factorial used Kimi K2.5, requiring same-model comparisons to avoid conflating model differences with treatment effects. Within Nemotron data, the ordering is clear: recognition (~73) > active control (66.5) > base (~58), with recognition gains (~+15 pts) roughly doubling the active control's benefit (~+9 pts). This same-model analysis supports the conclusion that recognition theory provides specific value beyond generic pedagogical elaboration, but the comparison would be more precise if conducted on the same model as the primary factorial. Running the active control on Kimi K2.5 is a clear next step that would establish direct comparability with the factorial conditions. Additionally, the base prompts were already designed to produce competent tutoring with no length constraint; the active control functions as a *pedagogically-enriched* condition containing real instructional content (growth mindset language, Bloom's taxonomy, scaffolding strategies), rather than a true inert placebo.
|
|
1441
1850
|
|
|
1442
|
-
**Model dependence**: Results were obtained with specific models (Kimi K2.5, Nemotron).
|
|
1851
|
+
**Model dependence**: Results were obtained with specific models (Kimi K2.5, Nemotron). An early exploratory A×B analysis (N=17, Nemotron, data no longer in DB) suggested recognition-specific multi-agent synergy, but a multi-model probe across five ego models (N=655, Section 6.4) decisively refutes this, confirming architecture and recognition as additive. The recognition main effect, by contrast, replicates across all five models and domains.
|
|
1852
|
+
|
|
1853
|
+
**Domain sampling and content isolation**: We tested two domains (philosophy, elementary math). A follow-up run (eval-2026-02-05-e87f452d) tested elementary content with Kimi K2.5, partially addressing the model confound in the original Nemotron-only elementary results. The recognition main effect replicated (+9.9 pts, d $\approx$ 0.61), though the factor inversion pattern from Table 9 (architecture dominance on elementary) was partly model-dependent: Kimi showed recognition dominance on elementary content, while Nemotron showed architecture dominance. Post-hoc investigation (Section 6.6) identified two content isolation bugs that caused philosophy references to appear in one elementary scenario (`new_student_first_visit`, 16/24 responses affected). These bugs—a content resolver fallback and hardcoded prompt examples—have been fixed but partly inflated the architecture effect on elementary content, since multi-agent cells caught the errors while single-agent cells did not. The Kimi architecture effect (+3.0 pts) is likely more representative than the Nemotron effect (+9.9 pts). Broader domain sampling beyond two content areas, with verified content isolation, would further strengthen generalizability claims.
|
|
1443
1854
|
|
|
1444
|
-
**
|
|
1855
|
+
**Synthetic learning outcomes only**: All evaluations measure tutor response quality and simulated learner behavior, not actual learning. The synthetic learning outcome index (Section 6.15.2) provides a proxy from learner rubric dimensions (revision signals, question quality, conceptual engagement), and all conditions show substantial learning arcs (15–21 pts). However, these are AI-judge assessments of LLM-generated learner turns—measuring the *quality of simulated learning behavior*, not knowledge acquisition, comprehension, or transfer. Whether recognition-enhanced tutoring produces genuine learning gains in human learners remains the critical open question.
|
|
1445
1856
|
|
|
1446
1857
|
**Short-term evaluation**: We evaluate individual sessions, not longitudinal relationships. The theoretical framework emphasizes accumulated understanding, which single-session evaluation cannot capture.
|
|
1447
1858
|
|
|
1448
|
-
**Bilateral transformation asymmetry**: The bilateral transformation metrics (Section 6.
|
|
1859
|
+
**Bilateral transformation asymmetry**: The bilateral transformation metrics (Section 6.15), now based on N=118 dialogues across three multi-turn scenarios, confirm that recognition-prompted tutors adapt more (+26% relative improvement in adaptation index). However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), complicating the theoretical claim of *mutual* transformation. The effect is better characterized as tutor-side responsiveness. The learner growth index measures observable message complexity markers (revision language, connective reasoning), which may not capture all forms of learner benefit—recognition tutors may reduce visible struggle precisely by being more effective.
|
|
1449
1860
|
|
|
1450
|
-
**Dynamic rewriting evolution**: The step-by-step evolution analysis (Section 6.
|
|
1861
|
+
**Dynamic rewriting evolution**: The step-by-step evolution analysis (Section 6.18) tracks cell 21 across three iterative development runs with small sample sizes (13–15 scored responses per cell per run, 82 total). The runs are not independent experiments—each includes implementation improvements beyond Writing Pad activation. While the trajectory from trailing to leading is clear, a controlled ablation isolating only the Writing Pad variable would provide stronger causal evidence. All three runs use free-tier models (Nemotron ego, Kimi K2.5 superego), and generalization to other model combinations is unknown.
|
|
1862
|
+
|
|
1863
|
+
**Scripted learner confound**: The mechanism robustness testing (Section 6.10, cells 40–59, N=360) uses a single-agent (scripted) learner whose responses are predetermined. This design prevents feedback loops between tutor mechanisms and learner behavior, rendering all mechanisms causally inert—they cannot influence what the learner says next. The resulting null result (all mechanisms cluster within 2.4 pts) reflects the experimental design rather than genuine mechanism equivalence. The dynamic learner results (cells 60–65, 69–70, N=300) partially address this confound, demonstrating that mechanisms do differentiate with genuine feedback loops, but cover only four mechanisms and two scenarios.
|
|
1864
|
+
|
|
1865
|
+
**Qualitative assessor blinding**: The initial qualitative transcript assessments (Section 6.11) were conducted by an AI judge (Claude Opus) with access to condition labels. Two blinded replications (condition metadata stripped, Table 21b) tested for assessor bias: the first used Haiku, the second used the same model (Opus). The same-model blinded replication confirms that Opus's tag assignments are largely unchanged by blinding (stalling base 100%→91.4%, recognition\_moment base 0%→5.2%), while the Haiku-blinded softening reflects model calibration differences rather than a genuine blinding effect. The near-perfect binary separations in Tables 20–21 are therefore robust rather than inflated. All assessment remains LLM-based; human expert coding would provide independent validation of the qualitative patterns.
|
|
1451
1866
|
|
|
1452
1867
|
### 8.2 Future Directions
|
|
1453
1868
|
|
|
@@ -1461,7 +1876,19 @@ These findings carry important limitations. The thematic coding is regex-based r
|
|
|
1461
1876
|
|
|
1462
1877
|
**Cross-application transfer**: Test whether recognition-oriented design transfers to domains beyond tutoring—therapy bots, customer service, creative collaboration.
|
|
1463
1878
|
|
|
1464
|
-
**Learner superego redesign**: The learner superego paradox (Section 6.
|
|
1879
|
+
**Learner superego redesign**: The learner superego paradox (Section 6.16) suggests the current learner ego/superego prompts optimize for "good student responses" rather than "authentic student responses." A redesigned learner superego that critiques for *inauthenticity*—pushing the ego toward messier, more persona-consistent responses—might produce multi-agent learners that enhance rather than degrade learner quality. This would test whether the paradox reflects a fundamental limitation of internal self-critique or merely poor prompt calibration.
|
|
1880
|
+
|
|
1881
|
+
**Mechanistic binding with dynamic learners**: The scripted learner confound (Section 6.10) demonstrates that mechanism testing requires dynamic interlocutors capable of genuine feedback loops. Cells 60–65 cover self-reflection, profiling, intersubjective framing, and combined mechanisms with dynamic learners, but quantitative disposition and prompt erosion remain untested in this configuration. Expanding to the full mechanism space with base/recognition pairs (not just recognition-only) and additional scenarios would provide a complete mechanism matrix.
|
|
1882
|
+
|
|
1883
|
+
**Theory of Mind architecture**: The other-ego profiling results (Section 6.10) suggest that explicit Theory of Mind—building and maintaining a model of the interlocutor—provides additive benefit (+4.1 pts) with dynamic learners. Bidirectional profiling (both tutor and learner maintain models of each other) and strategy planning based on these profiles represent a promising architectural direction that warrants systematic exploration.
|
|
1884
|
+
|
|
1885
|
+
**Qualitative assessment blinding**: The same-model blinded assessment (Table 21b) confirmed that Opus's tag discrimination is robust to condition label removal (e.g., stalling 100% → 91.4% in base, recognition\_moment 0% → 5.2% in base). The earlier apparent softening under blinding was a model calibration artifact (Haiku tags more liberally). While this addresses the primary blinding concern, all qualitative assessments were conducted by LLM judges rather than human expert coders—a limitation shared with the quantitative evaluation. Human raters applying established qualitative coding frameworks (e.g., thematic analysis, discourse analysis) would provide independent validation of the AI-discovered themes and tag distributions.
|
|
1886
|
+
|
|
1887
|
+
**Superego parse robustness**: The cognitive prosthesis analysis (Section 6.10) revealed that the Kimi K2.5 superego returns malformed JSON on 16–45% of reviews, silently disabling quality control through automatic approval. Structured output enforcement, retry logic, or prompt engineering for JSON reliability would reduce this failure mode. The adversary prompt's lower parse failure rate (11.5% vs 21.8% for descriptive) suggests that prompt structure itself affects JSON reliability from thinking models—a finding with implications for any system using LLM-generated structured output.
|
|
1888
|
+
|
|
1889
|
+
**Capability threshold mapping**: The prosthesis test establishes that Nemotron falls below and Haiku falls above the minimum ego capability threshold for mechanism benefit. Testing intermediate models (GLM-4.7, DeepSeek V3.2) would map the threshold more precisely and determine whether it corresponds to specific capabilities (context window utilization, meta-cognitive processing, behavioral flexibility) that could be assessed independently.
|
|
1890
|
+
|
|
1891
|
+
**Adaptive mechanism loading**: Rather than deploying a fixed mechanism stack, systems could load mechanisms based on ego model capability—simpler configurations for weaker models, full stacks for capable ones. The two-tier capability analysis (static vs dynamic dimensions) suggests that mechanisms targeting static capabilities (e.g., content retrieval, formatting) could benefit weaker models, while mechanisms targeting dynamic capabilities (adaptation, dialectical responsiveness) should be reserved for models above the capability threshold.
|
|
1465
1892
|
|
|
1466
1893
|
---
|
|
1467
1894
|
|
|
@@ -1469,31 +1896,41 @@ These findings carry important limitations. The thematic coding is regex-based r
|
|
|
1469
1896
|
|
|
1470
1897
|
We have proposed and evaluated a framework for AI tutoring grounded in Hegel's theory of mutual recognition. Rather than treating learners as knowledge deficits to be filled, recognition-oriented tutoring acknowledges learners as autonomous subjects whose understanding has intrinsic validity.
|
|
1471
1898
|
|
|
1472
|
-
An evaluation framework (N=
|
|
1899
|
+
An evaluation framework (N=3,383 primary scored across thirty-seven key evaluations; N=7,000+ across the full development database) provides evidence that recognition theory has unique value, subject to the limitations discussed in Section 8.1:
|
|
1473
1900
|
|
|
1474
|
-
1. **Recognition as primary driver (the definitive finding)**: A corrected 2×2 memory isolation experiment (N=120 across two independent runs) demonstrates that recognition theory is the primary driver of tutoring improvement: recognition alone produces d=1.71 (+15.2 pts), while memory alone provides only a modest, non-significant benefit (d=0.46, +4.8 pts, $p \approx .08$). The combined condition reaches d=1.81 (+15.8 pts vs base), with ceiling effects at ~91 limiting further gains. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content provides partial corroboration: same-model comparisons show the active control scores approximately 9 points above base while recognition scores approximately 15 points above base
|
|
1901
|
+
1. **Recognition as primary driver (the definitive finding)**: A corrected 2×2 memory isolation experiment (N=120 across two independent runs) demonstrates that recognition theory is the primary driver of tutoring improvement: recognition alone produces d=1.71 (+15.2 pts), while memory alone provides only a modest, non-significant benefit (d=0.46, +4.8 pts, $p \approx .08$). The combined condition reaches d=1.81 (+15.8 pts vs base), with ceiling effects at ~91 limiting further gains. The full factorial (N=350) confirms recognition as the dominant factor ($d=1.11$, $\eta^2$=.243), with consistent effects across learner types (+15.7 single, +13.0 multi; A×C n.s.). A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content provides partial corroboration: same-model comparisons show the active control scores approximately 9 points above base while recognition scores approximately 15 points above base. A three-way comparison (N=36) found recognition outperforms enhanced prompting by +8.0 points, consistent with recognition dominance, though the increment does not replicate under GPT-5.2 (+2.4 pts, n.s.). Recognition theory is directly effective and does not require memory infrastructure to manifest.
|
|
1475
1902
|
|
|
1476
|
-
2. **Architecture is additive, not synergistic**: A multi-model probe across five ego models (N=
|
|
1903
|
+
2. **Architecture is additive, not synergistic**: A multi-model probe across five ego models (N=655; Section 6.4, Table 8) shows that multi-agent architecture does not meaningfully interact with recognition prompts. The A$\times$B interaction ranges from -5.7 to -0.7 across all five models tested (mean -2.2), with all five showing negative values consistent with ceiling effects. The original exploratory finding (+9.2 on N=17, Nemotron) was sampling noise. Architecture provides a small additive benefit (-0.8 to +3.7 pts) largely independent of prompt type.
|
|
1477
1904
|
|
|
1478
1905
|
3. **Tutor adaptation**: Recognition-prompted tutors measurably adapt their approach in response to learner input (adaptation index +26% higher than baseline across N=118 multi-turn dialogues and three scenarios). However, learner-side growth is not higher under recognition, suggesting the effect is tutor-side responsiveness rather than symmetric mutual transformation. This provides partial empirical grounding for recognition theory: recognition prompts produce tutors that are genuinely shaped by the encounter, even if the "mutual" claim requires qualification.
|
|
1479
1906
|
|
|
1480
|
-
4. **Domain generalizability**: Recognition advantage replicates across both philosophy and elementary math, and across both Kimi and Nemotron models, though with only two content domains tested. On elementary content with Kimi (N=60), recognition provides +
|
|
1907
|
+
4. **Domain generalizability**: Recognition advantage replicates across both philosophy and elementary math, and across both Kimi and Nemotron models, though with only two content domains tested. On elementary content with Kimi (N=60), recognition provides +8.2 pts, with effects concentrated in challenging scenarios. Architecture provides a small additive benefit in both domains (+2.3 elementary, +1.0 philosophy). Broader domain coverage (technical STEM, creative writing, social-emotional content) is needed before generalizability can be considered established.
|
|
1481
1908
|
|
|
1482
1909
|
5. **Multi-agent as reality testing**: On new domains, the Superego catches content isolation failures—whether from system-level bugs (content resolver fallbacks, hardcoded prompt examples) or model defaults. This error-correction function is essential for domain transfer, particularly when content scoping cannot be guaranteed at the system level.
|
|
1483
1910
|
|
|
1484
|
-
6. **Writing Pad activation coincides with dynamic rewriting improvement**: A step-by-step evolution analysis (N=82 across three iterative development runs) shows that dynamic prompt rewriting (cell 21) progresses from trailing its static baseline by 7.2 points to leading by 5.5 points, with the improvement coinciding with Writing Pad memory activation (Section 6.
|
|
1911
|
+
6. **Writing Pad activation coincides with dynamic rewriting improvement**: A step-by-step evolution analysis (N=82 across three iterative development runs) shows that dynamic prompt rewriting (cell 21) progresses from trailing its static baseline by 7.2 points to leading by 5.5 points, with the improvement coinciding with Writing Pad memory activation (Section 6.18). Every rubric dimension improves. This trajectory is consistent with the Freudian Mystic Writing Pad (Section 3.4) functioning as an important enabler for dynamic adaptation, though the uncontrolled nature of the iterative runs means a controlled ablation is needed to confirm the causal role.
|
|
1485
1912
|
|
|
1486
|
-
7. **Cross-judge robustness**: A replication with GPT-5.2 as independent second judge (Section 6.
|
|
1913
|
+
7. **Cross-judge robustness**: A replication with GPT-5.2 as independent second judge (Section 6.19) confirms the recognition main effect (d$\approx$0.9 in the factorial, d=1.54 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction), and multi-agent null effects. GPT-5.2 finds compressed magnitudes (37–59% of Claude's effect sizes depending on experiment) but always in the same direction. The recognition-vs-enhanced increment (+8.0 under Claude) does not reach significance under GPT-5.2 (+2.4 pts, n.s.), warranting caution on the precise magnitude of recognition's unique contribution beyond enhanced prompting.
|
|
1487
1914
|
|
|
1488
|
-
8. **Dialectical impasse and resolution strategy**: Recognition's advantage is largest under sustained intellectual challenge (Section 6.
|
|
1915
|
+
8. **Dialectical impasse and resolution strategy**: Recognition's advantage is largest under sustained intellectual challenge (Section 6.20). Three 5-turn impasse scenarios (N=24) show recognition outperforming base by +43 pts on epistemic resistance and +29 pts on interpretive deadlock, while showing no advantage on affective shutdown ($\Delta$ = $-$1.1). Post-hoc resolution strategy coding reveals the mechanism: every base tutor (12/12) withdraws from the dialectical encounter entirely—noting engagement metrics while ignoring the learner's substantive position—while recognition tutors predominantly (10/12) use scaffolded reframing, preserving the learner's objection while redirecting toward new conceptual ground ($\chi^2(3) = 24.00$, $p < .001$, $V = 1.000$). The dominance of scaffolded reframing (Aufhebung) over genuine mutual recognition (1/12) suggests that recognition prompts produce sophisticated pedagogical technique—the capacity to hold contradiction productively—rather than genuine mutual transformation.
|
|
1489
1916
|
|
|
1490
|
-
9. **The learner superego paradox**: A symmetric learner-side evaluation (Section 6.
|
|
1917
|
+
9. **The learner superego paradox**: A symmetric learner-side evaluation (Section 6.16, N=118 bilateral dialogues) reveals that the multi-agent learner architecture *hurts* learner quality ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$)—the largest effect in the study. The ego/superego process polishes away the messy, persona-consistent engagement that characterizes genuine student behavior. Recognition partially rescues multi-agent learner quality ($d = 0.79$, $p = .004$) while having no effect on already-high single-agent learner quality. On the tutor rubric, recognition helps both learner types robustly (+15.7 single, +13.0 multi; A×C n.s.); on the learner rubric, recognition helps multi-agent learners selectively (+9.5 vs -1.3 pts). The same mechanism—recognition as external validation that creates space for authentic engagement—counteracts the superego's tendency to over-process learner responses. Internal deliberation depth remains uniformly poor (2.7/5) regardless of recognition, confirming that recognition works *around* the superego rather than through it. The Hegelian interpretation is direct: external recognition from an Other is structurally more effective than internal self-critique.
|
|
1491
1918
|
|
|
1492
|
-
|
|
1919
|
+
10. **The superego as filter, not improver**: Dialectical superego modulation testing (Section 6.8, N=174) reveals that the multi-agent superego functions as a quality filter—preventing poor responses—rather than an active improver. Structural modulation metrics (negation depth, convergence) show large per-turn variation ($d = -2.01$ to $-2.45$) but do not predict outcome quality. The adversary persona over-defers to the ego under recognition, reducing its critical function. These findings reinforce the additivity thesis: architecture provides a floor through error correction, not a ceiling through generative contribution.
|
|
1493
1920
|
|
|
1494
|
-
|
|
1921
|
+
11. **Self-reflective evolution amplifies recognition**: When ego and superego are given between-turn self-reflection (Section 6.9, cells 40–45, N=90), recognition's effect size rises to $d = 0.91$—2.4$\times$ the dialectical-only condition ($d = 0.38$). A striking disposition gradient emerges: the more hostile the superego (suspicious +19.0, adversary +10.9, advocate +2.6), the more recognition helps—hostile dispositions become productive under recognition but are destructive without it. However, an insight-action gap persists: the superego's reflections acknowledge the need for change without producing fundamentally different critique behavior.
|
|
1495
1922
|
|
|
1496
|
-
|
|
1923
|
+
12. **Mechanisms require dynamic interlocutors**: Nine mechanisms (self-reflection, profiling, quantitative disposition, prompt erosion, intersubjective framing, combined, adversary, advocate, base dialectical) cluster within 2.4 pts under recognition when tested with scripted learners (Section 6.10, N=360). The scripted learner confound renders mechanisms causally inert—they cannot influence predetermined responses. When tested with dynamic (multi-agent) learners (N=300), mechanisms genuinely differentiate for the first time: profiling and combined mechanisms reach 88.8 and 87.8 while intersubjective framing reaches only 82.8—a 6.0-point spread. Recognition's effect doubles (+14 pts vs +7.5 scripted), and a Nemotron cross-model replication (N=360) confirms the pattern at lower absolute scores. A qualitative transcript assessment (Section 6.11) provides narrative evidence for the mechanism: recognition gives the ego the capacity to be *changed by* its internal critic rather than merely *compliant with* it.
|
|
1924
|
+
|
|
1925
|
+
13. **Prompt elaboration does not explain recognition effects**: A prompt elaboration baseline (Section 6.21, N=144) comparing the full 344-line base prompt against a 35-line naive prompt (JSON schema only, no pedagogical guidance) demonstrates that the recognition effect cannot be attributed to prompt length or instructional detail. On Haiku, the naive prompt *outperforms* the elaborate base by +6.8 pts—the prescriptive decision heuristics actively constrain the model's superior pedagogical intuitions. On Kimi K2.5, the elaborate prompt is inert ($\Delta = -0.3$). Recognition ($M = 90.9$ on Haiku) remains well above the naive baseline ($M = 82.5$), confirming that recognition adds value through relational orientation rather than instructional specificity. This addresses the deflationary concern that recognition effects might be an artifact of more detailed prompting: stripping 90% of the base prompt's content does not diminish scores, but recognition theory still provides a substantial further gain.
|
|
1926
|
+
|
|
1927
|
+
14. **Minimum ego capability threshold for mechanism benefit**: A cognitive prosthesis test (Section 6.10, N=90) pairing a weak ego (Nemotron) with a strong superego (Kimi K2.5) armed with the full mechanism suite demonstrates that architectural scaffolding is not model-agnostic. The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points, yielding scores ($M = 49.5$) well below Nemotron's own simple baseline ($M = 64.2$). Dimension analysis reveals a two-tier capability structure: Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails on dynamic context integration (tutor adaptation 1.8, 86% failure rate). A Haiku control smoke test (N=6) confirms the model-dependence: identical mechanisms score 90+ with Haiku. A contributing factor is silent superego failure—the Kimi K2.5 superego returns malformed JSON on 16–45% of reviews, auto-approving the ego's draft. The adversary superego prompt produces the most parseable output (11.5% failure) and the highest scores, suggesting that superego JSON reliability is a first-order concern for multi-agent deployments.
|
|
1928
|
+
|
|
1929
|
+
These results suggest that operationalizing philosophical theories of intersubjectivity can produce concrete improvements in AI system performance. They also reveal boundary conditions: recognition theory's value varies by content domain and interaction type, and multi-agent architecture's value depends on deployment context. Perhaps most striking is the learner superego paradox (Finding 9): the largest single effect in the study ($d = 1.43$) comes not from what helps but from what hurts—internal self-critique degrades learner quality more than any other factor improves it. This underscores the paper's central Hegelian claim: genuine transformation requires encounter with an Other, not refinement by the Self.
|
|
1930
|
+
|
|
1931
|
+
The broader implication is for AI alignment. If mutual recognition is pedagogically superior, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation. Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter. We emphasize the distinction drawn in Section 3.3: these results demonstrate recognition-*oriented* design (level 3)—prompts that produce recognition-like behavior—not recognition *proper* (level 1), which would require genuine intersubjective consciousness. The pedagogical gains are real; the philosophical question of whether the AI truly *recognizes* the learner remains open.
|
|
1932
|
+
|
|
1933
|
+
In summary, this paper has connected Hegelian recognition theory to AI pedagogy (Section 3), implemented that theory through a multiagent architecture grounded in Freudian structural theory (Section 4), and tested it empirically across thirty-seven key evaluations (Section 6). The central finding—that recognition-enhanced prompting is the dominant driver of tutoring improvement—was established through memory isolation (Section 6.2), confirmed in a full factorial (Section 6.3), partially corroborated by active control (Section 6.2), validated by an independent GPT-5.2 judge (Section 6.19), and further sharpened by a dialectical impasse test with resolution strategy coding (Section 6.20) showing that base tutors withdraw from dialectical encounter while recognition tutors hold and reframe contradiction—and a symmetric learner-side evaluation (Section 6.16) showing that recognition provides external self-regulation more effectively than internal ego/superego deliberation. Phase 2 experiments (Sections 6.8–6.11) deepen this understanding: the superego functions as a quality filter rather than an active improver, self-reflection amplifies recognition's effect, and mechanism differentiation requires dynamic interlocutors capable of genuine feedback loops. The theoretical framework, empirical methodology, and practical implications together suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems.
|
|
1497
1934
|
|
|
1498
1935
|
## References
|
|
1499
1936
|
|
|
@@ -1690,10 +2127,13 @@ You are the thoughtful, critical voice who:
|
|
|
1690
2127
|
Tests whether recognition theory adds value beyond prompt engineering.
|
|
1691
2128
|
|
|
1692
2129
|
```bash
|
|
1693
|
-
# Run the 3-way comparison (base, enhanced, recognition
|
|
1694
|
-
|
|
1695
|
-
|
|
1696
|
-
|
|
2130
|
+
# Run the 3-way comparison (base, enhanced, recognition)
|
|
2131
|
+
CELLS="cell_1_base_single_unified"
|
|
2132
|
+
CELLS+=",cell_9_enhanced_single_unified"
|
|
2133
|
+
CELLS+=",cell_5_recog_single_unified"
|
|
2134
|
+
node scripts/eval-cli.js run --profiles "$CELLS" \
|
|
2135
|
+
--scenarios struggling_learner,concept_confusion,\
|
|
2136
|
+
mood_frustrated_explicit,high_performer \
|
|
1697
2137
|
--runs 3
|
|
1698
2138
|
|
|
1699
2139
|
# Analyze results
|
|
@@ -1704,17 +2144,20 @@ node scripts/eval-cli.js report <run-id>
|
|
|
1704
2144
|
|
|
1705
2145
|
```bash
|
|
1706
2146
|
# Run full factorial (8 cells × 15 scenarios × 3 reps)
|
|
1707
|
-
|
|
1708
|
-
|
|
1709
|
-
|
|
2147
|
+
CELLS="cell_1_base_single_unified,cell_2_base_single_psycho"
|
|
2148
|
+
CELLS+=",cell_3_base_multi_unified,cell_4_base_multi_psycho"
|
|
2149
|
+
CELLS+=",cell_5_recog_single_unified,cell_6_recog_single_psycho"
|
|
2150
|
+
CELLS+=",cell_7_recog_multi_unified,cell_8_recog_multi_psycho"
|
|
2151
|
+
node scripts/eval-cli.js run --profiles "$CELLS" --runs 3
|
|
1710
2152
|
```
|
|
1711
2153
|
|
|
1712
2154
|
### B.3 A×B Interaction Test
|
|
1713
2155
|
|
|
1714
2156
|
```bash
|
|
1715
2157
|
# Recognition vs Enhanced × Single vs Multi comparison
|
|
1716
|
-
|
|
1717
|
-
|
|
2158
|
+
CELLS="cell_5_recog_single_unified,cell_7_recog_multi_unified"
|
|
2159
|
+
CELLS+=",cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified"
|
|
2160
|
+
node scripts/eval-cli.js run --profiles "$CELLS" \
|
|
1718
2161
|
--scenarios struggling_learner,concept_confusion,mood_frustrated_explicit \
|
|
1719
2162
|
--runs 3
|
|
1720
2163
|
```
|
|
@@ -1724,11 +2167,13 @@ node scripts/eval-cli.js run \
|
|
|
1724
2167
|
```bash
|
|
1725
2168
|
# Run with elementary content (4th grade fractions)
|
|
1726
2169
|
# Uses all 8 factorial cells × 5 elementary scenarios
|
|
2170
|
+
CELLS="cell_1_base_single_unified,cell_2_base_single_psycho"
|
|
2171
|
+
CELLS+=",cell_3_base_multi_unified,cell_4_base_multi_psycho"
|
|
2172
|
+
CELLS+=",cell_5_recog_single_unified,cell_6_recog_single_psycho"
|
|
2173
|
+
CELLS+=",cell_7_recog_multi_unified,cell_8_recog_multi_psycho"
|
|
1727
2174
|
EVAL_CONTENT_PATH=./content-test-elementary \
|
|
1728
2175
|
EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
|
|
1729
|
-
node scripts/eval-cli.js run
|
|
1730
|
-
--profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
|
|
1731
|
-
--runs 1
|
|
2176
|
+
node scripts/eval-cli.js run --profiles "$CELLS" --runs 1
|
|
1732
2177
|
```
|
|
1733
2178
|
|
|
1734
2179
|
### B.5 Dynamic Prompt Rewriting Evolution
|
|
@@ -1737,19 +2182,138 @@ node scripts/eval-cli.js run \
|
|
|
1737
2182
|
# Run cell_7 (static baseline) vs cell_21 (dynamic rewrite + Writing Pad)
|
|
1738
2183
|
node scripts/eval-cli.js run \
|
|
1739
2184
|
--profiles cell_7_recog_multi_unified,cell_21_recog_multi_unified_rewrite \
|
|
1740
|
-
--scenarios misconception_correction_flow
|
|
2185
|
+
--scenarios misconception_correction_flow,\
|
|
2186
|
+
mood_frustration_to_breakthrough,mutual_transformation_journey \
|
|
1741
2187
|
--runs 5
|
|
1742
2188
|
```
|
|
1743
2189
|
|
|
1744
|
-
### B.6 Resolution Strategy Coding (Section 6.
|
|
2190
|
+
### B.6 Resolution Strategy Coding (Section 6.20)
|
|
1745
2191
|
|
|
1746
2192
|
```bash
|
|
1747
2193
|
# Code impasse dialogues into Hegelian resolution strategies
|
|
1748
|
-
node scripts/code-impasse-strategies.js
|
|
2194
|
+
node scripts/code-impasse-strategies.js \
|
|
2195
|
+
--model claude-opus-4.6 \
|
|
2196
|
+
--run-id eval-2026-02-08-f896275d
|
|
1749
2197
|
# Output: exports/impasse-strategy-coding-<timestamp>.json and .md
|
|
1750
2198
|
```
|
|
1751
2199
|
|
|
1752
|
-
### B.7
|
|
2200
|
+
### B.7 Dialectical Superego Modulation (Section 6.8)
|
|
2201
|
+
|
|
2202
|
+
```bash
|
|
2203
|
+
# Standard ego + divergent superego (cells 22-27)
|
|
2204
|
+
CELLS="cell_22_base_suspicious_unified"
|
|
2205
|
+
CELLS+=",cell_23_recog_suspicious_unified"
|
|
2206
|
+
CELLS+=",cell_24_base_adversary_unified"
|
|
2207
|
+
CELLS+=",cell_25_recog_adversary_unified"
|
|
2208
|
+
CELLS+=",cell_26_base_advocate_unified"
|
|
2209
|
+
CELLS+=",cell_27_recog_advocate_unified"
|
|
2210
|
+
node scripts/eval-cli.js run --profiles "$CELLS" --runs 2
|
|
2211
|
+
|
|
2212
|
+
# Dialectical ego + divergent superego, multi-turn (cells 28-33)
|
|
2213
|
+
CELLS="cell_28_base_dialectical_suspicious_unified"
|
|
2214
|
+
CELLS+=",cell_29_recog_dialectical_suspicious_unified"
|
|
2215
|
+
CELLS+=",cell_30_base_dialectical_adversary_unified"
|
|
2216
|
+
CELLS+=",cell_31_recog_dialectical_adversary_unified"
|
|
2217
|
+
CELLS+=",cell_32_base_dialectical_advocate_unified"
|
|
2218
|
+
CELLS+=",cell_33_recog_dialectical_advocate_unified"
|
|
2219
|
+
node scripts/eval-cli.js run --profiles "$CELLS" --runs 5
|
|
2220
|
+
```
|
|
2221
|
+
|
|
2222
|
+
### B.8 Mechanism Robustness (Section 6.10)
|
|
2223
|
+
|
|
2224
|
+
```bash
|
|
2225
|
+
# Scripted learner mechanisms (cells 40-59), Haiku ego
|
|
2226
|
+
CELLS="cell_40_base_dialectical_suspicious_unified_superego"
|
|
2227
|
+
CELLS+=",cell_41_recog_dialectical_suspicious_unified_superego"
|
|
2228
|
+
CELLS+=",cell_42_base_dialectical_adversary_unified_superego"
|
|
2229
|
+
CELLS+=",cell_43_recog_dialectical_adversary_unified_superego"
|
|
2230
|
+
CELLS+=",cell_44_base_dialectical_advocate_unified_superego"
|
|
2231
|
+
CELLS+=",cell_45_recog_dialectical_advocate_unified_superego"
|
|
2232
|
+
CELLS+=",cell_46_base_dialectical_suspicious_unified_quantitative"
|
|
2233
|
+
CELLS+=",cell_47_recog_dialectical_suspicious_unified_quantitative"
|
|
2234
|
+
CELLS+=",cell_48_base_dialectical_suspicious_unified_erosion"
|
|
2235
|
+
CELLS+=",cell_49_recog_dialectical_suspicious_unified_erosion"
|
|
2236
|
+
CELLS+=",cell_50_base_dialectical_suspicious_unified_intersubjective"
|
|
2237
|
+
CELLS+=",cell_51_recog_dialectical_suspicious_unified_intersubjective"
|
|
2238
|
+
CELLS+=",cell_52_base_dialectical_suspicious_unified_combined"
|
|
2239
|
+
CELLS+=",cell_53_recog_dialectical_suspicious_unified_combined"
|
|
2240
|
+
CELLS+=",cell_54_base_dialectical_profile_tutor"
|
|
2241
|
+
CELLS+=",cell_55_recog_dialectical_profile_tutor"
|
|
2242
|
+
CELLS+=",cell_56_base_dialectical_profile_bidirectional"
|
|
2243
|
+
CELLS+=",cell_57_recog_dialectical_profile_bidirectional"
|
|
2244
|
+
CELLS+=",cell_58_recog_dialectical_profile_bidirectional_full"
|
|
2245
|
+
CELLS+=",cell_59_recog_dialectical_profile_bidirectional_strategy"
|
|
2246
|
+
node scripts/eval-cli.js run --profiles "$CELLS" --runs 2
|
|
2247
|
+
|
|
2248
|
+
# Dynamic learner mechanisms (cells 60-63), Haiku ego
|
|
2249
|
+
CELLS="cell_60_base_dialectical_selfreflect_psycho"
|
|
2250
|
+
CELLS+=",cell_61_recog_dialectical_selfreflect_psycho"
|
|
2251
|
+
CELLS+=",cell_62_base_dialectical_profile_bidirectional_psycho"
|
|
2252
|
+
CELLS+=",cell_63_recog_dialectical_profile_bidirectional_psycho"
|
|
2253
|
+
node scripts/eval-cli.js run --profiles "$CELLS" \
|
|
2254
|
+
--scenarios misconception_correction_flow,mutual_transformation_journey \
|
|
2255
|
+
--runs 5
|
|
2256
|
+
|
|
2257
|
+
# Dynamic learner mechanism head-to-head (cells 64-65), Haiku ego
|
|
2258
|
+
CELLS="cell_64_recog_dialectical_intersubjective_psycho"
|
|
2259
|
+
CELLS+=",cell_65_recog_dialectical_combined_psycho"
|
|
2260
|
+
node scripts/eval-cli.js run --profiles "$CELLS" \
|
|
2261
|
+
--scenarios misconception_correction_flow,mutual_transformation_journey \
|
|
2262
|
+
--runs 5
|
|
2263
|
+
|
|
2264
|
+
# Dynamic learner base counterparts (cells 69-70), Haiku ego
|
|
2265
|
+
CELLS="cell_69_base_dialectical_intersubjective_psycho"
|
|
2266
|
+
CELLS+=",cell_70_base_dialectical_combined_psycho"
|
|
2267
|
+
node scripts/eval-cli.js run --profiles "$CELLS" \
|
|
2268
|
+
--scenarios misconception_correction_flow,mutual_transformation_journey \
|
|
2269
|
+
--runs 5
|
|
2270
|
+
```
|
|
2271
|
+
|
|
2272
|
+
### B.8b Prompt Elaboration Baseline (Section 6.21)
|
|
2273
|
+
|
|
2274
|
+
```bash
|
|
2275
|
+
# Naive baseline, Haiku ego
|
|
2276
|
+
node scripts/eval-cli.js run \
|
|
2277
|
+
--profiles cell_1_base_single_unified,cell_71_naive_single_unified \
|
|
2278
|
+
--runs 6
|
|
2279
|
+
|
|
2280
|
+
# Naive baseline, Kimi ego
|
|
2281
|
+
node scripts/eval-cli.js run \
|
|
2282
|
+
--profiles cell_1_base_single_unified,cell_71_naive_single_unified \
|
|
2283
|
+
--runs 6 --model openrouter.kimi
|
|
2284
|
+
```
|
|
2285
|
+
|
|
2286
|
+
### B.8c Token Budget Sensitivity (Section 6.22)
|
|
2287
|
+
|
|
2288
|
+
```bash
|
|
2289
|
+
# Token budget dose-response (256, 512, 2048), Haiku ego
|
|
2290
|
+
node scripts/eval-cli.js run \
|
|
2291
|
+
--profiles cell_1_base_single_unified,cell_5_recog_single_unified \
|
|
2292
|
+
--runs 3 --max-tokens 256
|
|
2293
|
+
|
|
2294
|
+
node scripts/eval-cli.js run \
|
|
2295
|
+
--profiles cell_1_base_single_unified,cell_5_recog_single_unified \
|
|
2296
|
+
--runs 3 --max-tokens 512
|
|
2297
|
+
|
|
2298
|
+
node scripts/eval-cli.js run \
|
|
2299
|
+
--profiles cell_1_base_single_unified,cell_5_recog_single_unified \
|
|
2300
|
+
--runs 3 --max-tokens 2048
|
|
2301
|
+
|
|
2302
|
+
# Base-only control at default 8000
|
|
2303
|
+
node scripts/eval-cli.js run \
|
|
2304
|
+
--profiles cell_1_base_single_unified \
|
|
2305
|
+
--runs 3
|
|
2306
|
+
```
|
|
2307
|
+
|
|
2308
|
+
### B.9 Qualitative Transcript Assessment (Section 6.11)
|
|
2309
|
+
|
|
2310
|
+
```bash
|
|
2311
|
+
# Assess transcripts with Opus
|
|
2312
|
+
node scripts/assess-transcripts.js --run-id eval-2026-02-14-e0e3a622
|
|
2313
|
+
node scripts/assess-transcripts.js --run-id eval-2026-02-07-b6d75e87
|
|
2314
|
+
```
|
|
2315
|
+
|
|
2316
|
+
### B.10 Factor Effect Analysis
|
|
1753
2317
|
|
|
1754
2318
|
```sql
|
|
1755
2319
|
-- Factor effect analysis query
|
|
@@ -1799,7 +2363,7 @@ Where:
|
|
|
1799
2363
|
| Tutor Adaptation | 5% | Bilateral |
|
|
1800
2364
|
| Learner Growth | 5% | Bilateral |
|
|
1801
2365
|
|
|
1802
|
-
Standard dimensions (including Productive Struggle and Epistemic Honesty) account for 81% of raw weight; recognition dimensions 29.9%; bilateral dimensions 10%. Raw weights total 120.9% and are normalized at scoring time. Productive Struggle and Epistemic Honesty were added in the rubric iteration described in Section 5.1, with corresponding reductions to Actionability and Tone (10% → 8% each). The bilateral dimensions (`tutor_adaptation`, `learner_growth`) specifically measure the mutual transformation claim—see Section 6.
|
|
2366
|
+
Standard dimensions (including Productive Struggle and Epistemic Honesty) account for 81% of raw weight; recognition dimensions 29.9%; bilateral dimensions 10%. Raw weights total 120.9% and are normalized at scoring time. Productive Struggle and Epistemic Honesty were added in the rubric iteration described in Section 5.1, with corresponding reductions to Actionability and Tone (10% → 8% each). The bilateral dimensions (`tutor_adaptation`, `learner_growth`) specifically measure the mutual transformation claim—see Section 6.15.
|
|
1803
2367
|
|
|
1804
2368
|
### C.3 Recognition Dimension Criteria
|
|
1805
2369
|
|
|
@@ -1867,7 +2431,7 @@ Standard dimensions (including Productive Struggle and Epistemic Honesty) accoun
|
|
|
1867
2431
|
|
|
1868
2432
|
## Appendix D: Reproducibility and Key Evaluation Run IDs
|
|
1869
2433
|
|
|
1870
|
-
Evaluation commands are documented in Appendix B. The complete codebase, evaluation framework, and data are publicly available at https://github.com/
|
|
2434
|
+
Evaluation commands are documented in Appendix B. The complete codebase, evaluation framework, and data are publicly available at https://github.com/liammagee/machinespirits-eval. The thirty-seven key evaluations are listed below (b6d75e87 serves both bilateral transformation and learner-side evaluation; eval-2026-02-11-35c53e99 and eval-2026-02-11-5f6d51f5 are combined as one dialectical modulation evaluation):
|
|
1871
2435
|
|
|
1872
2436
|
| Finding | Run ID | Section |
|
|
1873
2437
|
|---------|--------|---------|
|
|
@@ -1877,33 +2441,125 @@ Evaluation commands are documented in Appendix B. The complete codebase, evaluat
|
|
|
1877
2441
|
| Active control (post-hoc) | eval-2026-02-06-a9ae06ee | 6.2 |
|
|
1878
2442
|
| Full factorial, cells 1–5,7 (Kimi) | eval-2026-02-03-f5d4dd93 | 6.3 |
|
|
1879
2443
|
| Full factorial, cells 6,8 re-run (Kimi) | eval-2026-02-06-a933d745 | 6.3 |
|
|
1880
|
-
| A×B interaction (Nemotron, original) | eval-2026-02-04-948e04b3 | 6.4 |
|
|
1881
2444
|
| A×B replication (Kimi) | eval-2026-02-05-10b344fb | 6.4 |
|
|
1882
2445
|
| A×B probe: Nemotron | eval-2026-02-07-722087ac | 6.4 |
|
|
1883
2446
|
| A×B probe: DeepSeek V3.2 | eval-2026-02-07-70ef73a3 | 6.4 |
|
|
1884
2447
|
| A×B probe: GLM-4.7 | eval-2026-02-07-6b3e6565 | 6.4 |
|
|
1885
2448
|
| A×B probe: Claude Haiku 4.5 | eval-2026-02-07-6ead24c7 | 6.4 |
|
|
1886
|
-
| Domain generalizability (
|
|
1887
|
-
|
|
|
1888
|
-
| Dynamic rewrite evolution (run
|
|
1889
|
-
| Dynamic rewrite evolution (run
|
|
1890
|
-
|
|
|
1891
|
-
|
|
|
1892
|
-
| Dialectical impasse test | eval-2026-02-08-f896275d | 6.
|
|
2449
|
+
| Domain generalizability (Kimi) | eval-2026-02-05-e87f452d | 6.5 |
|
|
2450
|
+
| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | 6.18 |
|
|
2451
|
+
| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | 6.18 |
|
|
2452
|
+
| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.18 |
|
|
2453
|
+
| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.15 |
|
|
2454
|
+
| Hardwired rules ablation (Kimi) | eval-2026-02-08-65a6718f | 6.7 |
|
|
2455
|
+
| Dialectical impasse test | eval-2026-02-08-f896275d | 6.20 |
|
|
2456
|
+
| Learner-side evaluation (symmetric) | eval-2026-02-07-b6d75e87 | 6.16 |
|
|
2457
|
+
| Dialectical modulation, standard (cells 22–27) | eval-2026-02-11-35c53e99, eval-2026-02-11-5f6d51f5 | 6.8 |
|
|
2458
|
+
| Dialectical modulation, multi-turn (cells 28–33) | eval-2026-02-11-a54235ea | 6.8 |
|
|
2459
|
+
| Self-reflective evolution (cells 40–45) | eval-2026-02-13-8d40e086 | 6.9 |
|
|
2460
|
+
| Mechanism robustness, scripted (cells 40–59) | eval-2026-02-14-e0e3a622 | 6.10 |
|
|
2461
|
+
| Dynamic learner mechanisms (cells 60–63) | eval-2026-02-14-6c033830 | 6.10 |
|
|
2462
|
+
| Dynamic learner mechanisms (cells 64–65) | eval-2026-02-14-a2b2717c | 6.10 |
|
|
2463
|
+
| Mechanism robustness, Nemotron (cells 40–59) | eval-2026-02-14-49b33fdd | 6.10 |
|
|
2464
|
+
| Self-reflect Nemotron non-replication (cells 40–45) | eval-2026-02-14-559d854b | 6.9 |
|
|
2465
|
+
| Cognitive prosthesis (cells 66–68, Nemotron) | eval-2026-02-17-25aaae85 | 6.10 |
|
|
2466
|
+
| Cognitive prosthesis smoke test (Haiku) | eval-2026-02-18-f489c0ea | 6.10 |
|
|
2467
|
+
| Dynamic learner base mechanisms (cells 69–70) | eval-2026-02-15-664073ab | 6.10 |
|
|
2468
|
+
| Prompt elaboration baseline, Haiku (cells 1, 71) | eval-2026-02-17-deee5fd6 | 6.21 |
|
|
2469
|
+
| Prompt elaboration baseline, Kimi (cells 1, 71) | eval-2026-02-17-27d7b4e3 | 6.21 |
|
|
2470
|
+
| Token budget 256, Haiku (run 1) | eval-2026-02-17-0eb3de77 | 6.22 |
|
|
2471
|
+
| Token budget 256, Haiku (run 2) | eval-2026-02-17-5a640782 | 6.22 |
|
|
2472
|
+
| Token budget 512, Haiku | eval-2026-02-17-5f281654 | 6.22 |
|
|
2473
|
+
| Token budget 2048, Haiku | eval-2026-02-17-0f6dcd97 | 6.22 |
|
|
2474
|
+
| Token budget default (8000), Haiku | eval-2026-02-17-d32ed226 | 6.22 |
|
|
1893
2475
|
|
|
1894
2476
|
---
|
|
1895
2477
|
|
|
1896
2478
|
## Appendix E: Revision History
|
|
1897
2479
|
|
|
1898
|
-
|
|
1899
|
-
|
|
1900
|
-
|
|
1901
|
-
|
|
1902
|
-
|
|
1903
|
-
|
|
1904
|
-
|
|
1905
|
-
|
|
1906
|
-
|
|
1907
|
-
|
|
1908
|
-
|
|
1909
|
-
|
|
2480
|
+
**v1.0** (2026-02-04)
|
|
2481
|
+
: Initial draft with 2×2×2 factorial design, memory isolation, three-way comparison.
|
|
2482
|
+
|
|
2483
|
+
**v1.1** (2026-02-06)
|
|
2484
|
+
: Added corrected memory isolation experiment (N=120), active control (N=118), cells 6&8 re-run, cross-judge GPT-5.2 analysis. Corrected GPT-5.2 effect sizes (d=1.15→0.99, d=0.50→0.29) after deduplication of rejudge rows. Dropped dead partial run (e617e757).
|
|
2485
|
+
|
|
2486
|
+
**v1.2** (2026-02-06)
|
|
2487
|
+
: **Critical correction**: Reframed "placebo control" as "post-hoc active control." The original v1.1 analysis compared the active control (Nemotron, M=66.5) to factorial base (Kimi K2.5, M=78.8) and reported d=-1.03, but this compared different ego models. Same-model historical data shows Nemotron base $\approx$ 58, making the active control $\approx$ +9 pts above base (not below). Reframed throughout: generic pedagogical elaboration provides partial benefit (~+9 pts above base) but recognition gains are substantially larger (~+15 pts). Acknowledged post-hoc design and active (not inert) control content.
|
|
2488
|
+
|
|
2489
|
+
**v1.3--v1.4** (2026-02-06)
|
|
2490
|
+
: Intermediate revisions: corrected factorial with re-run cells 6, 8 (a933d745); updated A×C interaction values; qualitative analysis additions; production quality fixes. Superseded by v1.5.
|
|
2491
|
+
|
|
2492
|
+
**v1.5** (2026-02-07)
|
|
2493
|
+
: **Rubric iteration**: Updated to 14-dimension rubric with dialogue transcript context, Productive Struggle (5%), and Epistemic Honesty (5%) dimensions (Actionability/Tone reduced 10%→8%). Re-scored cells 6, 8 (N=88) with identical responses: minimal change (+0.5, +0.6 pts), confirming calibration preserved. Added holistic dialogue evaluation for multi-turn transcripts. Cross-judge replication on updated rubric (r=0.55, N=88, GPT/Opus ratio=0.87). Updated Table 6, main effects, A×C interaction values, Appendix C.2 weight table, and Section 6.18 cross-judge tables.
|
|
2494
|
+
|
|
2495
|
+
**v1.6** (2026-02-08)
|
|
2496
|
+
: **Content isolation fix**: Identified and fixed two bugs causing cross-domain content leakage in elementary scenarios: (a) `buildCurriculumContext()` fallback that scanned all courses when no content hint was provided, serving philosophy listings to elementary scenarios; (b) hardcoded `479-lecture-*` IDs in tutor ego prompt examples that the model copied when no curriculum anchor was present. Updated Sections 6.5, 6.6, 7.4, 7.8, and 8 to reframe "model hallucination" as system-level content isolation failures.
|
|
2497
|
+
|
|
2498
|
+
**v1.7** (2026-02-08)
|
|
2499
|
+
: **Hardwired rules ablation**: Added Section 6.7 with superego rules embedded in ego prompt (cells 13--14, N=72, eval-2026-02-08-65a6718f, Opus judge). Static rules fail to replicate the Superego's benefit, confirming the value lies in contextual judgment rather than rule enforcement. Added Table 10b, updated Tables 2/D and paper totals.
|
|
2500
|
+
|
|
2501
|
+
**v1.8** (2026-02-08)
|
|
2502
|
+
: **Dialectical impasse test**: Added Section 6.20 with three 5-turn impasse scenarios (epistemic resistance, affective shutdown, productive deadlock; N=24, eval-2026-02-08-f896275d, Opus judge). Recognition produces +43 pts on epistemic and +29 pts on interpretive impasses but $\Delta=-1.1$ on affective shutdown---sharpening the theoretical claim to epistemological rather than affective recognition.
|
|
2503
|
+
|
|
2504
|
+
**v1.9** (2026-02-08)
|
|
2505
|
+
: **Learner superego paradox**: Added symmetric learner-side evaluation (Section 6.16) scoring N=118 bilateral dialogues with 6-dimension learner rubric (eval-2026-02-07-b6d75e87, Opus judge). Multi-agent learner architecture hurts learner quality (d=1.43, F=68.28, p<.001)---the largest effect in the study. Recognition partially rescues multi-agent learners (d=0.79, p=.004) but not single-agent (n.s.). Added learner rubric description to §5.1, new §6.12, rewrote §7.5 with results, added finding #9 to §9.
|
|
2506
|
+
|
|
2507
|
+
**v2.0** (2026-02-08)
|
|
2508
|
+
: **Resolution strategy coding**: Post-hoc qualitative coding of all 24 dialectical impasse dialogues into five Hegelian resolution strategies. Perfect separation: 12/12 base tutors withdraw, 10/12 recognition tutors use scaffolded reframing (Aufhebung pattern). $\chi^2(3)=24.00$, $p<.001$, $V=1.000$. Cross-judge validation with GPT-5.2: $\kappa=0.84$. Added Tables 26--28, per-turn strategy evolution analysis.
|
|
2509
|
+
|
|
2510
|
+
**v2.1** (2026-02-08)
|
|
2511
|
+
: **AI theme discovery & figure regeneration**: Added §6.13.4 AI-assisted theme discovery (N=300) showing near-perfect bimodal separation. Added Figure 6 (word clouds). Regenerated all figures from Python with corrected data and larger text. Removed standalone §10 Reproducibility (merged into Appendix D). Moved Appendix E after other appendices. Increased font to 12pt.
|
|
2512
|
+
|
|
2513
|
+
**v2.1.1** (2026-02-10)
|
|
2514
|
+
: **Consistency fixes**: Corrected stale N=1,628/twenty → N=1,700/twenty-one in abstract, introduction, and conclusion. Fixed dynamic rewrite section references in Tables 2 and D. Added hardwired rules ablation and learner-side evaluation to Appendix D run list (was 19 rows, now 21). Fixed inter-judge reliability cross-reference in §8.1.
|
|
2515
|
+
|
|
2516
|
+
**v2.1.2** (2026-02-10)
|
|
2517
|
+
: **Review corrections** (30 fixes): Table 7b Kimi row corrected to single-learner cells (N=350→179, Recognition +10.2→+15.5, Interaction -1.5→+0.5) matching probe design; total probe N 826→655. Factor C in Discussion corrected (-1.7 pts, F=2.56). Stale A×C values updated. Dynamic rewrite swing corrected (+16.7→+8.7 delta). Terminology standardized (unified→single-agent, behaviour→behavior).
|
|
2518
|
+
|
|
2519
|
+
**v2.2.0** (2026-02-11)
|
|
2520
|
+
: **Modulation and learning outcomes**: Added §6.11.1 (modulation metrics, N=350 post-hoc) showing multi-agent architecture does not increase behavioral range (d=0.05); recognition produces calibration not oscillation (dimension variance d=$-$1.00, F=87.69). Added §6.11.2 (synthetic learning outcome index, N=118). Extended §7.4 Discussion with phronesis reframing. Regenerated Figures 4 and 6.
|
|
2521
|
+
|
|
2522
|
+
**v2.3.0** (2026-02-14)
|
|
2523
|
+
: **Phase 2 experimental results**: Added four new Results sections: §6.8 Dialectical Superego Modulation (cells 22--33, N=174, Tables 13--15); §6.9 Self-Reflective Evolution (cells 40--45, N=36, Tables 16--17); §6.10 Mechanism Robustness (cells 40--59 N=360 + cells 60--63 N=120, Tables 18--19); §6.11 Qualitative Transcript Assessment (Tables 20--21). Added §7.10 Scripted Learner Confound, §7.11 Practical Recommendations (6 recommendations). Expanded §6.10 with cells 64--65 and Nemotron cross-model replication (N=279, 49b33fdd). Renumbered all tables sequentially (1--48). Trimmed abstract from ~650 to ~250 words. Paper totals: N=2,700 across 28 key evaluations.
|
|
2524
|
+
|
|
2525
|
+
**v2.3.1** (2026-02-15)
|
|
2526
|
+
: **Cognitive prosthesis and cross-judge completion**: Added cognitive prosthesis test (cells 66--68, N=60). Completed GPT-5.2 cross-judge validation of mechanism robustness (N=360 paired, r=0.59). Added Nemotron self-reflect non-replication (559d854b, N=60) to §6.9/Table 17. Added blinded qualitative assessment validation (Table 21b).
|
|
2527
|
+
|
|
2528
|
+
**v2.3.2** (2026-02-15)
|
|
2529
|
+
: **Sample reconciliation and count update**: Added Phase 2 evaluations to Table 2 (9 additional rows). Updated paper totals from 28 to 30 key evaluations, N=2,909 scored. Added 50487df7 (cognitive prosthesis) to Appendix B.6. Noted Sonnet judge used for two late-stage evaluations.
|
|
2530
|
+
|
|
2531
|
+
**v2.3.3** (2026-02-15)
|
|
2532
|
+
: **Complete Table 19 with base mechanism cells**: Added cells 69--70 (eval-2026-02-15-664073ab, N=60, Opus judge) completing the base row of Table 19. Recognition delta remarkably consistent across all 4 mechanisms (+13.3 to +15.1). Updated paper totals from 30 to 31 evaluations, N=2,969.
|
|
2533
|
+
|
|
2534
|
+
**v2.3.4** (2026-02-15)
|
|
2535
|
+
: **Related Work expansion for arXiv/edArXiv submission**: Expanded §2 from 8 to 10 subsections. Added §2.3 LLM-as-Judge Evaluation Methodology, §2.7 Theory of Mind in AI Agents. Expanded §2.1 with empirical LLM tutoring studies, §2.2 with multi-agent systems and self-correction limits. Added 15 new bib entries.
|
|
2536
|
+
|
|
2537
|
+
**v2.3.5** (2026-02-15)
|
|
2538
|
+
: **Same-model blinded assessment**: Ran Opus-blinded qualitative assessment (N=118) resolving the model calibration confound. Key finding: blinding barely changes Opus's tag assignments, confirming the near-perfect binary separation is real, not an assessor bias artifact. Updated §6.11 interpretation, revised §8.2 limitation.
|
|
2539
|
+
|
|
2540
|
+
**v2.3.6** (2026-02-16)
|
|
2541
|
+
: **Judge version unification**: Rejudged all early runs (originally scored under Opus 4.5) with Opus 4.6, eliminating version drift across the evaluation dataset. Updated §8.1 limitations. Cleaned 6 empty/failed generation rows from dynamic rewrite runs.
|
|
2542
|
+
|
|
2543
|
+
**v2.3.7** (2026-02-17)
|
|
2544
|
+
: **Self-reflective evolution complete**: Updated §6.9 from partial (N=36) to complete (N=90) results for eval-2026-02-13-8d40e086. Recognition d=0.91 (was 1.02 at N=36). Key new finding: disposition gradient---suspicious +19.0, adversary +10.9, advocate +2.6. Updated Table 16, Table 17, Discussion finding 11. Deduped 270 re-judging artifact rows.
|
|
2545
|
+
|
|
2546
|
+
**v2.3.8** (2026-02-17)
|
|
2547
|
+
: **Nemotron mechanism N-count update**: Updated eval-2026-02-14-49b33fdd from N=301 to N=360 after run resumption completed. Cascaded count changes through abstract, introduction, Table 2, §7.11, §9. Updated §6.10 Nemotron narrative. Noted bidirectional profiling anomaly ($\Delta=-0.6$).
|
|
2548
|
+
|
|
2549
|
+
**v2.3.9** (2026-02-17)
|
|
2550
|
+
: **Factorial re-judging cascade**: All early runs re-judged with Opus 4.6 (unifying judge version). Full factorial ANOVA (N=350): recognition F=110.04 p<.001 $\eta^2$=.243 d=1.11 (was F=71.36, d=0.80). A×C interaction disappears (F=0.97, p=.325; was F=21.85, p<.001)---recognition now consistent across learner types. Updated Tables 4, 6, 8, 9, 9b, 12, 17, 41, 42. Restored 219 GPT-5.2 rows lost during dedup. Updated GPT compression ratio from ~58% to 37--59%.
|
|
2551
|
+
|
|
2552
|
+
**v2.3.10** (2026-02-17)
|
|
2553
|
+
: **Prompt elaboration baseline**: Added §6.21 comparing 344-line base prompt against 35-line naive prompt (JSON schema only). Two runs: Haiku (N=72) and Kimi (N=72). Key finding: elaborate prompt hurts Haiku (+6.8 pts for naive) and is inert on Kimi ($\Delta=-0.3$). Recognition ($M=90.9$) remains well above naive ($M=82.5$). Added Table 20b, conclusion finding 13. Updated paper totals to N=3,292 across thirty-one evaluations.
|
|
2554
|
+
|
|
2555
|
+
**v2.3.11** (2026-02-17)
|
|
2556
|
+
: **Transcript figures**: Added Figure 10 (naive vs base high\_performer comparison panel) to §6.21. Added Figure 11 (bilateral mutual\_transformation\_journey transcript comparison) to §6.11. Added `generate-paper-figures.js` script for reproducible paper figure generation.
|
|
2557
|
+
|
|
2558
|
+
**v2.3.12** (2026-02-17)
|
|
2559
|
+
: **Token budget sensitivity**: Added §6.22 testing whether constraining `max_tokens` from 8000 to 256--2048 affects evaluation scores. Scores are flat across all budget levels; the recognition effect is fully preserved even at 256 tokens. The retry-absorption mechanism means truncated structured output self-heals. Added Table 49, recommendation 8 to §7.11. Updated paper totals to N=3,454 scored across thirty-six evaluations.
|
|
2560
|
+
|
|
2561
|
+
**v2.3.13** (2026-02-17)
|
|
2562
|
+
: **Paper correctness fixes**: Fixed eval-2026-02-14-559d854b scope from "cells 40--59, N=167" to "cells 40--45, N=60" in Table 2 (only cells 40--45 used; cells 46--59 superseded by 49b33fdd at N=360). Fixed broken Table 10 reference in §6.6. Fixed dynamic-learner N inconsistency: intro and finding 12 updated to N=300 (6c033830 + a2b2717c + 664073ab). Clarified token budget §6.22 design text. Added missing Appendix B commands.
|
|
2563
|
+
|
|
2564
|
+
**v2.3.14** (2026-02-18)
|
|
2565
|
+
: **Cognitive prosthesis re-run and analysis**: Replaced misconfigured prosthesis run (50487df7, all cells fell back to default Haiku) with corrected eval-2026-02-17-25aaae85 (N=90, Nemotron ego, Kimi K2.5 superego, Opus judge). Prosthesis hypothesis fails decisively: full mechanism stack scores 49.5 vs Nemotron simple base 64.2 ($\Delta=-15$). Added dimension analysis (two-tier static/dynamic capability model), superego parse failure analysis (16--45% malformed JSON auto-approves), and Haiku control smoke test (eval-2026-02-18-f489c0ea, N=6, confirming model-dependence). Added conclusion finding 14 (minimum ego capability threshold), recommendation 9 to §7.11, three future work items to §8.2 (parse robustness, capability threshold mapping, adaptive mechanism loading). Updated Table 2, Appendix D run IDs. Paper totals: N=3,383 across thirty-seven evaluations.
|