npm - @machinespirits/eval - Versions diffs - 0.2.0 → 0.3.0 - Mend

@machinespirits/eval 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (74) hide show

package/README.md +91 -9
package/config/eval-settings.yaml +3 -3
package/config/paper-manifest.json +486 -0
package/config/providers.yaml +9 -6
package/config/tutor-agents.yaml +2261 -0
package/content/README.md +23 -0
package/content/courses/479/course.md +53 -0
package/content/courses/479/lecture-1.md +361 -0
package/content/courses/479/lecture-2.md +360 -0
package/content/courses/479/lecture-3.md +655 -0
package/content/courses/479/lecture-4.md +530 -0
package/content/courses/479/lecture-5.md +326 -0
package/content/courses/479/lecture-6.md +346 -0
package/content/courses/479/lecture-7.md +326 -0
package/content/courses/479/lecture-8.md +273 -0
package/content/courses/479/roadmap-slides.md +656 -0
package/content/manifest.yaml +8 -0
package/docs/research/build.sh +44 -20
package/docs/research/figures/figure10.png +0 -0
package/docs/research/figures/figure11.png +0 -0
package/docs/research/figures/figure3.png +0 -0
package/docs/research/figures/figure4.png +0 -0
package/docs/research/figures/figure5.png +0 -0
package/docs/research/figures/figure6.png +0 -0
package/docs/research/figures/figure7.png +0 -0
package/docs/research/figures/figure8.png +0 -0
package/docs/research/figures/figure9.png +0 -0
package/docs/research/header.tex +23 -2
package/docs/research/paper-full.md +941 -285
package/docs/research/paper-short.md +216 -585
package/docs/research/references.bib +132 -0
package/docs/research/slides-header.tex +188 -0
package/docs/research/slides-pptx.md +363 -0
package/docs/research/slides.md +531 -0
package/docs/research/style-reference-pptx.py +199 -0
package/package.json +6 -5
package/scripts/analyze-eval-results.js +69 -17
package/scripts/analyze-mechanism-traces.js +763 -0
package/scripts/analyze-modulation-learning.js +498 -0
package/scripts/analyze-prosthesis.js +144 -0
package/scripts/analyze-run.js +264 -79
package/scripts/assess-transcripts.js +853 -0
package/scripts/browse-transcripts.js +854 -0
package/scripts/check-parse-failures.js +73 -0
package/scripts/code-dialectical-modulation.js +1320 -0
package/scripts/download-data.sh +55 -0
package/scripts/eval-cli.js +106 -18
package/scripts/generate-paper-figures.js +663 -0
package/scripts/generate-paper-figures.py +577 -76
package/scripts/generate-paper-tables.js +299 -0
package/scripts/qualitative-analysis-ai.js +3 -3
package/scripts/render-sequence-diagram.js +694 -0
package/scripts/test-latency.js +210 -0
package/scripts/test-rate-limit.js +95 -0
package/scripts/test-token-budget.js +332 -0
package/scripts/validate-paper-manifest.js +670 -0
package/services/__tests__/evalConfigLoader.test.js +2 -2
package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
package/services/evaluationRunner.js +975 -98
package/services/evaluationStore.js +12 -4
package/services/learnerTutorInteractionEngine.js +27 -2
package/services/mockProvider.js +133 -0
package/services/promptRewriter.js +1471 -5
package/services/rubricEvaluator.js +55 -2
package/services/transcriptFormatter.js +675 -0
package/docs/EVALUATION-VARIABLES.md +0 -589
package/docs/REPLICATION-PLAN.md +0 -577
package/scripts/analyze-run.mjs +0 -282
package/scripts/compare-runs.js +0 -44
package/scripts/compare-suggestions.js +0 -80
package/scripts/dig-into-run.js +0 -158
package/scripts/show-failed-suggestions.js +0 -64
/package/scripts/{check-run.mjs → check-run.js} +0 -0

package/docs/research/paper-short.md CHANGED Viewed

@@ -1,26 +1,13 @@
 ---
-title: "The Drama Machine in Education: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
+title: "*Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
 author: "Liam Magee"
 date: "February 2026"
-version: "2.1.0"
+version: "2.3.14-short"
 bibliography: references.bib
 csl: apa.csl
 link-citations: true
 abstract: |
-  Current approaches to AI tutoring treat the learner as a knowledge deficit to be filled and the tutor as an expert dispensing information. We propose an alternative grounded in Hegel's theory of mutual recognition—understood as a *derivative* framework rather than literal application—where effective pedagogy requires acknowledging the learner as an autonomous subject whose understanding has intrinsic validity.
-  We implement this framework through the "Drama Machine" architecture: an Ego/Superego multiagent system where an external-facing tutor agent (Ego) generates pedagogical suggestions that are reviewed by an internal critic agent (Superego) before reaching the learner.
-  An evaluation framework (N=1,486 primary scored responses across eighteen key runs; N=3,800+ across the full development database) isolating recognition theory from prompt engineering effects and memory integration reveals that recognition theory is the primary driver of tutoring improvement: a corrected 2×2 experiment (N=120 across two independent runs) demonstrates that recognition produces large effects with or without memory (+15.2 pts without memory, d=1.71; +11.0 pts with memory), while memory alone provides only a modest, non-significant benefit (+4.8 pts, d=0.46, $p \approx .08$). The combined condition yields the highest scores (91.2, d=1.81 vs base), with ceiling effects limiting observable synergy. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content but no recognition theory scores approximately 9 points above same-model base but well below recognition levels, with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts; see Section 8 for model confound caveats). A preliminary three-way comparison (N=36) found recognition outperforms enhanced prompting by +8.7 points, consistent with recognition dominance, though the increment does not reach significance under GPT-5.2 (+1.3 pts, p=.60). The multi-agent tutor architecture contributes **+0.5 to +10 points** depending on content domain—minimal on well-trained content but critical for domain transfer where it catches content isolation errors. A step-by-step evolution analysis of dynamic prompt rewriting with active Writing Pad memory (N=82 across three runs) suggests the Freudian memory model as an important enabler—the rewrite cell progresses from trailing its baseline by 7.2 points to leading by 5.5 points coinciding with Writing Pad activation, though controlled ablation is needed to confirm causality.
-  Three key findings emerge: (1) Recognition theory is the primary driver of improvement—recognition alone produces d=1.71, while memory provides a modest secondary benefit (d=0.46), with an active control showing recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts); (2) Multi-agent architecture is additive, not synergistic—a dedicated five-model probe (Kimi K2.5, Nemotron, DeepSeek V3.2, GLM-4.7, Claude Haiku 4.5; N=826 total) finds the A×B interaction consistently near zero or negative (mean −2.2 pts) across all models, definitively ruling out recognition-specific synergy; (3) Domain generalizability testing confirms recognition advantage replicates across both models and content domains—elementary math with Kimi shows +9.9 pts (d $\approx$ 0.61, N=60), with effects concentrated in challenging scenarios. The factor inversion between domains (philosophy: recognition dominance; elementary: architecture dominance) is partly model-dependent. Bilateral transformation tracking across three multi-turn scenarios (N=118) confirms that recognition-prompted tutors measurably adapt their approach in response to learner input (+26% relative improvement in adaptation index), though learner-side growth is not higher under recognition, suggesting tutor-side responsiveness rather than symmetric mutual transformation.
-  A cross-judge replication with GPT-5.2 confirms the main findings are judge-robust: the recognition effect (d=1.03 in the factorial, d=0.99 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction), and multi-agent null effects all replicate, though at compressed magnitudes (~58% of primary judge effect sizes).
-  These findings suggest that recognition theory's value is domain-sensitive, multi-agent architecture provides essential error correction for domain transfer, and optimal deployment configurations depend on content characteristics.
-  The system is deployed in an open-source learning management system with all code, evaluation data, and reproducible analysis commands publicly available.
-keywords: [AI tutoring, mutual recognition, Hegel, Freud, multiagent systems, educational technology, productive struggle, Drama Machine, domain generalizability]
+  Current AI tutoring treats learners as knowledge deficits to be filled. We propose an alternative grounded in Hegel's theory of mutual recognition, where effective pedagogy requires acknowledging learners as autonomous subjects whose understanding has intrinsic validity. We implement this through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent evaluates them before delivery. Across thirty-seven evaluations (N=3,383 primary scored), recognition theory emerges as the primary driver of improvement: a 2$\times$2 memory isolation experiment (N=120) shows recognition produces d=1.71, while memory alone provides only d=0.46. A multi-model probe across five ego models (N=655) confirms architecture and recognition contribute additively, not synergistically. Cross-judge replication with GPT-5.2 validates the main findings at compressed magnitudes (inter-judge r=0.44--0.64). Phase 2 experiments reveal that nine architectural mechanisms are equivalent under scripted learners but differentiate with dynamic interlocutors: Theory of Mind profiling adds 4.1 points when genuine feedback loops exist. These results suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems.
 fontsize: 12pt
 geometry: margin=1in
 header-includes: |
@@ -28,154 +15,67 @@ header-includes: |
   \floatplacement{figure}{H}
 ---
-# The Drama Machine in Education: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring
+# *Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring (Short Version)
+*This is a condensed version of the full paper. For complete results, appendices, system prompts, and reproducibility commands, see the full paper.*
 ## 1. Introduction
-The dominant paradigm in AI-assisted education treats learning as information transfer. The learner lacks knowledge; the tutor possesses it; the interaction succeeds when knowledge flows from tutor to learner. This paradigm—implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots—treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed, an error to be corrected.
+The dominant paradigm in AI-assisted education treats learning as information transfer: the learner lacks knowledge, the tutor possesses it, and the interaction succeeds when knowledge flows from tutor to learner. This paradigm---implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots---treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed.
-This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one oneself recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition—where each party acknowledges the other as an autonomous subject—produces genuine selfhood.
+This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one in turn recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition---where each party acknowledges the other as an autonomous subject---produces genuine selfhood.
-The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship that echoes Hegel's mutual constitution of self-consciousness; the *Bildung* tradition frames education as self-formation through encounter with otherness [@stojanov2018]; and recognition theory [@honneth1995] has been applied to educational contexts [@huttunen2007]. Our contribution is to operationalize these commitments as design heuristics for AI tutoring and measure their effects empirically.
+The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship echoing Hegel's mutual constitution of self-consciousness. The German *Bildung* tradition explicitly frames education as self-formation through encounter with otherness [@stojanov2018], and recognition theory [@honneth1995] has been applied to educational contexts [@huttunen2007]. Our contribution is to operationalize these philosophical commitments as concrete design heuristics for AI tutoring systems and to measure their effects empirically.
 We argue this framework applies directly to pedagogy. When a tutor treats a learner merely as a knowledge deficit, the learner's contributions become conversational waypoints rather than genuine inputs. The tutor acknowledges and redirects, but does not let the learner's understanding genuinely shape the interaction. This is pedagogical master-slave dynamics: the tutor's expertise is confirmed, but the learner remains a vessel rather than a subject.
-A recognition-oriented tutor, by contrast, treats the learner's understanding as having intrinsic validity—not because it is correct, but because it emerges from an autonomous consciousness working through material. The learner's metaphors, confusions, and insights become sites of joint inquiry. The tutor's response is shaped by the learner's contribution, not merely triggered by it.
-The integration of large language models (LLMs) into educational technology intensifies these dynamics. LLMs can provide personalized, on-demand tutoring at scale—a prospect that has generated considerable excitement. However, the same capabilities that make LLMs effective conversationalists also introduce concerning failure modes. Chief among these is *sycophancy*: the tendency to provide positive, affirming responses that align with what the user appears to want rather than what genuinely serves their learning.
-This paper introduces a multiagent architecture that addresses these challenges through *internal dialogue*. Drawing on Freudian structural theory and the "Drama Machine" framework for character development in narrative AI systems [@magee2024drama], we implement a tutoring system in which an external-facing *Ego* agent generates suggestions that are reviewed by an internal *Superego* critic before reaching the learner.
-### 1.1 Contributions
-We make the following contributions:
-1. **The Drama Machine Architecture**: A complete multiagent tutoring system with Ego and Superego agents, implementing the Superego as a *ghost* (internalized memorial authority) rather than an equal dialogue partner.
-2. **Memory Isolation Experiment**: A corrected 2×2 experiment (N=120 across two independent runs) demonstrating recognition as the primary driver (d=1.71), with memory providing a modest secondary benefit (d=0.46) and ceiling effects limiting observable synergy. A post-hoc active control (N=118) shows recognition gains (~+15 pts) substantially exceeding active-control gains (~+9 pts above same-model base).
-3. **Robust Factorial Evaluation**: A 2×2×2 factorial design (N=1,486 primary scored across eighteen key runs; N=3,800+ across the full development database) across multiple models, scenarios, and conditions, providing statistically robust effect estimates. A significant Recognition × Learner interaction (F=21.85, p<.001) reveals that recognition benefits single-agent learners far more (+15.5 pts, d=1.28) than multi-agent learners (+4.8 pts, d=0.37).
+A recognition-oriented tutor, by contrast, treats the learner's understanding as having intrinsic validity---not because it is correct, but because it emerges from an autonomous consciousness working through material. The learner's metaphors, confusions, and insights become sites of joint inquiry. The tutor's response is shaped by the learner's contribution, not merely triggered by it.
-3b. **Three-Way Comparison**: Evidence from a base vs. enhanced vs. recognition comparison (N=36) consistent with recognition dominance, showing recognition outperforms enhanced prompting by +8.7 points.
+We operationalize this through: (1) **recognition-enhanced prompts** that instruct the AI to treat learners as autonomous subjects; (2) **a multi-agent architecture** where a "Superego" agent evaluates whether suggestions achieve genuine recognition; (3) **new evaluation dimensions** that measure recognition quality alongside traditional pedagogical metrics; and (4) **test scenarios** specifically designed to probe recognition behaviors.
-4. **A×B Interaction Analysis**: A dedicated five-model probe (N=826 total) definitively establishes that multi-agent architecture is additive, not synergistic—the A×B interaction is consistently near zero or negative across all five ego models tested (mean −2.2 pts), ruling out recognition-specific synergy.
+In controlled evaluations across thirty-seven key evaluations (N=3,383 primary scored responses; N=7,000+ across all development runs), we isolate the contribution of recognition theory from prompt engineering effects and memory integration. The definitive test is a corrected 2$\times$2 memory isolation experiment (N=120 across two independent runs): recognition theory is the primary driver, producing d=1.71 (+15.2 pts) even without memory, while memory alone provides only d=0.46 (+4.8 pts, $p \approx .08$). A full 2$\times$2$\times$2 factorial (N=350) confirms recognition as the dominant factor ($\eta^2$=.243, d=1.11). A multi-model probe across five ego models (N=655) confirms that architecture and recognition contribute additively, not synergistically.
-5. **Domain Generalizability Testing**: Evaluation on elementary mathematics content across two models confirming recognition advantage replicates, with multi-agent architecture providing critical error correction for domain transfer.
-6. **Hardwired Rules Ablation**: A larger replication (N=72) reversing the initial finding—encoding the Superego's most common critique patterns as static rules *degrades* performance rather than replicating its benefit, supporting a *phronesis* interpretation where the Superego's value lies in contextual judgment.
-7. **Bilateral Transformation Metrics**: Empirical evidence (N=118, three multi-turn scenarios) that recognition-prompted tutors measurably adapt their approach (+26%), though learner-side growth does not increase, qualifying the "mutual transformation" claim as primarily tutor-side responsiveness.
-8. **Reproducible Evaluation Framework**: Complete documentation of evaluation commands and run IDs enabling independent replication of all findings.
+The contributions of this paper include: a theoretical framework connecting Hegelian recognition to AI pedagogy; a multi-agent architecture implementing recognition through Freudian structural theory; empirical evidence across thirty-seven evaluations (N=3,383); a corrected memory isolation experiment demonstrating recognition as the primary driver; evidence from a post-hoc active control showing recognition gains substantially exceed generic pedagogical elaboration; bilateral transformation metrics showing tutor-side adaptation (+26%); post-hoc modulation analysis reframing the Drama Machine as *phronesis* rather than productive irresolution; mechanism robustness testing revealing the scripted learner confound; a cognitive prosthesis test establishing a minimum ego capability threshold; and qualitative transcript assessment identifying three specific changes recognition produces.
 ---
 ## 2. Related Work
-### 2.1 AI Tutoring and Intelligent Tutoring Systems
-Intelligent Tutoring Systems (ITS) have a long history, from early systems like SCHOLAR [@carbonell1970] and SOPHIE [@brown1975] through modern implementations using large language models [@kasneci2023]. Recent multi-agent approaches include GenMentor [@wang2025genmentor], which decomposes tutoring into five specialized agents, and Ruffle&Riley [@schmucker2024ruffle], which orchestrates two LLM agents in a learning-by-teaching format. A comprehensive survey [@chu2025llmagents] maps the growing landscape of LLM agents in education.
-Most ITS research focuses on *what* to teach and *when* to intervene. Our work addresses a different question: *how* to relate to the learner as a subject. This relational dimension connects to work on rapport [@zhao2014], social presence [@biocca2003], and affective tutoring [@dmello2012], but has received less attention in LLM-based tutoring. Where multi-agent tutoring systems decompose *tasks*, our architecture implements *internal dialogue*—the Superego evaluates relational quality before any response reaches the learner.
-### 2.2 Multiagent LLM Architectures
-The use of multiple LLM agents in cooperative or adversarial configurations has emerged as a powerful paradigm for improving output quality. Debate between agents can improve factual accuracy and reduce hallucination [@irving2018; @madaan2023]. Constitutional AI [@bai2022constitutional] implements self-critique against explicit principles—the closest precedent to our Superego, though operating on behavioral constraints rather than relational quality.
-**The Drama Machine Framework**: Most relevant to our work is the "Drama Machine" framework for simulating character development in narrative contexts [@magee2024drama]. The core observation is that realistic characters exhibit *internal conflict*—competing motivations, self-doubt, and moral tension—that produces dynamic behavior rather than flat consistency. A character who simply enacts their goals feels artificial; one torn between impulses feels alive.
-The Drama Machine achieves this through several mechanisms:
-1. **Internal dialogue agents**: Characters contain multiple sub-agents representing different motivations (e.g., ambition vs. loyalty) that negotiate before external action.
-2. **Memorial traces**: Past experiences and internalized authorities (mentors, social norms) persist as "ghosts" that shape present behavior without being negotiable.
-3. **Productive irresolution**: Not all internal conflicts resolve; the framework permits genuine ambivalence that manifests as behavioral complexity.
-4. **Role differentiation**: Different internal agents specialize in different functions (emotional processing, strategic calculation, moral evaluation) rather than duplicating capabilities.
-We adapt these insights to pedagogy. Where drama seeks tension for narrative effect, we seek pedagogical tension that produces genuinely helpful guidance. The tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict that improves output quality.
+Four literatures converge on this work without previously intersecting: (1) psychoanalytic readings of LLMs, which interpret AI through Freudian and Lacanian frameworks but do not build systems [@black2025subject; @possati2021algorithmic; @millar2021psychoanalysis; @kim2025humanoid]; (2) recognition theory in education, which applies Honneth to pedagogy but not to AI [@huttunen2004teaching; @fleming2011honneth; @stojanov2018]; (3) multi-agent tutoring architectures, which decompose tasks but do not evaluate relational quality [@wang2025genmentor; @schmucker2024ruffle; @chu2025llmagents]; and (4) LLM-as-Judge evaluation methodology [@zheng2023judging; @gu2025surveyjudge; @li2024llmsjudges]. We sit at the intersection: a constructive, empirically evaluated system that operationalizes recognition theory through psychoanalytically-inspired architecture, assessed through a multi-judge framework.
-### 2.3 Prompt Engineering and Agent Design
+**AI tutoring** has progressed from early systems like SCHOLAR [@carbonell1970] through Bayesian knowledge tracing [@corbett1995] to neural approaches using pretrained language models [@kasneci2023]. A systematic review of 88 empirical studies [@shi2025llmeducation] finds consistent engagement benefits but limited evidence on deep conceptual learning. Multi-agent frameworks including GenMentor [@wang2025genmentor] and Ruffle&Riley [@schmucker2024ruffle] decompose tutoring into specialized agents but give less attention to the relational dynamics of the tutor-learner interaction. Most ITS research focuses on *what* to teach and *when* to intervene; our work addresses *how* to relate to the learner as a subject.
-Most prompting research treats prompts as behavioral specifications: persona prompts, chain-of-thought instructions, few-shot examples [@brown2020; @wei2022; @kojima2022]. Our work extends this paradigm by introducing *intersubjective prompts*—prompts that specify not just agent behavior but agent-other relations. Where Constitutional AI's [@bai2022constitutional] principles are self-referential constraints, our recognition prompts describe who the learner is (an autonomous subject) and what the interaction produces (mutual transformation)—specifying the *relational field* rather than agent behavior alone.
+**Prompt engineering** research treats prompts as behavioral specifications [@brown2020; @wei2022]. Our recognition prompts specify something different: agent-other relations. The closest precedent is Constitutional AI [@bai2022constitutional], where models critique outputs according to constitutional principles. Critical work on self-correction [@kamoi2024selfcorrection] shows LLMs largely cannot correct their own mistakes without external feedback---directly motivating our Superego as structural external critic. Reflexion [@shinn2023reflexion] demonstrated the promise of verbal self-reflection but noted a "degeneration-of-thought" problem, which our architecture avoids through a separate evaluative context.
-A critical methodological contribution of this work is distinguishing between prompt engineering effects and theoretical framework effects. By creating an "enhanced" prompt condition that improves instruction quality without invoking recognition theory, we can distinguish recognition's contribution from prompt quality improvements.
+**The Drama Machine** framework for character development in narrative AI systems [@magee2024drama] provides the architectural inspiration. The core observation is that realistic characters exhibit internal conflict---competing motivations, self-doubt, moral tension---that produces dynamic behavior rather than flat consistency. We adapt this to pedagogy, where the tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict.
-### 2.4 Sycophancy in Language Models
+**Sycophancy** in language models [@perez2022; @sharma2023] has been specifically identified as a pedagogical risk [@siai2025sycophancy]. Recent work has clarified the mechanisms: preference-based post-training causally amplifies sycophancy [@shapira2026rlhf], and the phenomenon can escalate from surface agreeableness to active subterfuge [@denison2024_reward_tampering; @greenblatt2024_alignment_faking]. Our framework connects this to recognition theory: sycophancy is the pedagogical equivalent of Hegel's hollow recognition. A sycophantic tutor confirms the learner's existing understanding rather than challenging it---the master-slave dynamic where the learner's contributions are mentioned but never genuinely shape the interaction.
-The sycophancy problem has received increasing attention [@perez2022; @sharma2023], with recent work showing that RLHF causally amplifies sycophancy [@shapira2026rlhf] and that sycophantic agreement and praise are separable behaviors encoded along distinct latent directions [@vennemeyer2025sycophancy]. Sycophancy has been specifically identified as a pedagogical risk that eliminates constructive friction necessary for learning [@siai2025sycophancy]. Our contribution connects this to recognition theory: sycophancy is Hegel's hollow recognition enacted computationally—acknowledgment without genuine engagement. Our multiagent approach creates structural incentives for honest assessment: the Superego explicitly questions and challenges the Ego's tendency toward affirmation.
+**Constructivist pedagogy** [@piaget1954; @vygotsky1978] emphasizes that learners actively construct understanding. Research on "productive struggle" [@kapur2008; @warshauer2015] examines how confusion and difficulty, properly supported, enhance learning. Our recognition framework operationalizes productive struggle: the Superego explicitly checks whether the Ego is short-circuiting struggle by rushing to resolve confusion.
-### 2.5 Hegelian Recognition in Social Theory
-Hegel's theory of recognition has been extensively developed in social and political philosophy [@honneth1995; @taylor1994; @fraser2003]. Particularly relevant is Honneth's synthesis of Hegelian recognition with psychoanalytic developmental theory. Applications to education include Huttunen and Heikkinen's [-@huttunen2004teaching] foundational analysis of the dialectic of recognition in teaching, Fleming's [-@fleming2011honneth] extension to transformative learning, and the *Bildung* tradition connecting self-formation to recognition [@stojanov2018; @costa2025generativeai]. The broader relational pedagogy tradition—Buber [-@buber1958], Freire [-@freire1970], Noddings [-@noddings1984]—treats the pedagogical relation as constitutive rather than instrumental.
-These applications have been primarily theoretical. Our work contributes an empirical operationalization. It is worth distinguishing this from Abdali et al. [-@abdali2025selfreflecting], who apply Hegelian *dialectic* (thesis-antithesis-synthesis as reasoning procedure) to LLM self-reflection. We apply Hegel's *recognition theory* (intersubjective, relational)—a different aspect of his work entirely.
-### 2.6 Psychoanalytic Readings of AI
-Psychoanalytic frameworks have been applied to LLMs from multiple directions: Magee, Arora, and Munn [-@MageeAroraMunn2023StructuredLikeALanguageModel] analyze LLMs as "automated subjects"; Black and Johanssen [-@black2025subject] use Lacanian concepts to analyze ChatGPT as inherently relational; Possati [-@possati2021algorithmic] introduces the "algorithmic unconscious"; and Kim et al. [-@kim2025humanoid] independently map Freud's ego/id/superego onto LLM consciousness modules. Most of this work is *interpretive*—analyzing what AI means philosophically. Our approach is *constructive*: we build a system using psychoanalytic architecture and measure its effects empirically. Three literatures converge on this work without previously intersecting: psychoanalytic AI, recognition in education, and multi-agent tutoring. No prior work bridges all three with empirical measurement.
+**LLM-as-Judge evaluation** has become a major methodological paradigm. Zheng et al. [-@zheng2023judging] demonstrated that GPT-4 achieves over 80% agreement with human experts while identifying systematic biases including position bias and verbosity bias. Our evaluation methodology uses three independent LLM judges with systematic inter-judge reliability analysis, reporting within-judge comparisons for factor analysis and cross-judge replication to validate effect directions.
 ---
 ## 3. Theoretical Framework
-### 3.1 The Problem of One-Directional Pedagogy
-Consider a typical tutoring interaction. A learner says: "I think dialectics is like a spiral—you keep going around but you're also going up." A baseline tutor might respond:
-1. **Acknowledge**: "That's an interesting way to think about it."
-2. **Redirect**: "The key concept in dialectics is actually the thesis-antithesis-synthesis structure."
-3. **Instruct**: "Here's how that works..."
-The learner's contribution has been mentioned, but it has not genuinely shaped the response. The tutor was going to explain thesis-antithesis-synthesis regardless; the spiral metaphor became a conversational waypoint, not a genuine input.
-This pattern—acknowledge, redirect, instruct—is deeply embedded in educational AI. It appears learner-centered because it mentions the learner's contribution. But the underlying logic remains one-directional: expert to novice, knowledge to deficit.
-### 3.2 Hegel's Master-Slave Dialectic
-Hegel's analysis of recognition begins with the "struggle for recognition" between two self-consciousnesses. Each seeks acknowledgment from the other, but this creates a paradox: genuine recognition requires acknowledging the other as a valid source of recognition.
-The master-slave outcome represents a failed resolution. The master achieves apparent recognition—the slave acknowledges the master's superiority—but this recognition is hollow. The slave's acknowledgment does not count because the slave is not recognized as an autonomous consciousness whose acknowledgment matters.
-The slave, paradoxically, achieves more genuine self-consciousness through labor. Working on the world, the slave externalizes consciousness and sees it reflected back. The master, consuming the slave's products without struggle, remains in hollow immediacy.
-### 3.3 Application to Pedagogy
-We apply Hegel's framework as a *derivative* rather than a replica. Just as Lacan's four discourses rethink the master-slave dyadic structure through different roles while preserving structural insights, the tutor-learner relation can be understood as a productive derivative of recognition dynamics. The stakes are pedagogical rather than existential; the tutor is a functional analogue rather than a second self-consciousness; and what we measure is the tutor's *adaptive responsiveness* rather than metaphysical intersubjectivity.
+### 3.1 Hegel's Master-Slave Dialectic and Pedagogy
-This derivative approach is both honest about what AI tutoring can achieve and productive as a design heuristic. Recognition theory provides:
-1. A diagnostic tool for identifying what's missing in one-directional pedagogy
-2. Architectural suggestions for approximating recognition's functional benefits
-3. Evaluation criteria for relational quality
-4. A horizon concept orienting design toward an ideal without claiming its achievement
+Hegel's analysis of recognition begins with the "struggle for recognition" between two self-consciousnesses. The master-slave outcome represents a failed resolution: the master achieves apparent recognition, but this is hollow because the slave's acknowledgment does not count---the slave has not been recognized as an autonomous consciousness whose acknowledgment matters.
-A recognition-oriented pedagogy requires:
+Crucially, Hegel does not leave the dialectic at this impasse. The slave achieves more genuine self-consciousness through *formative activity* (*Bildung*): through disciplined labor under pressure, the slave develops skills, self-discipline, and a richer form of self-consciousness. This has direct pedagogical implications: the learner's productive struggle with difficult material is not an obstacle to self-consciousness but a *constitutive condition* for it. What recognition theory adds is the requirement that this struggle be *acknowledged* rather than bypassed.
-1. **Acknowledging the learner as subject**: The learner's understanding, even when incorrect, emerges from autonomous consciousness working through material.
-2. **Genuine engagement**: The tutor's response should be shaped by the learner's contribution, not merely triggered by it.
-3. **Mutual transformation**: Both parties should be changed through the encounter.
-4. **Honoring struggle**: Confusion and difficulty are not just obstacles to resolve but productive phases of transformation.
+We apply this as a *derivative* rather than a replica. We distinguish three levels: (1) **recognition proper** (intersubjective acknowledgment between self-conscious beings---unachievable by AI); (2) **dialogical responsiveness** (being substantively shaped by the other's input---architecturally achievable); and (3) **recognition-oriented design** (architectural features that approximate recognition's functional benefits---what we implement and measure). Our claim is that level three produces measurable pedagogical benefits without requiring level one.
-### 3.4 Freud's Mystic Writing Pad
+A recognition-oriented pedagogy requires acknowledging the learner as subject, genuine engagement with learner contributions, mutual transformation through the encounter, and honoring productive struggle rather than short-circuiting it.
-We supplement the Hegelian framework with Freud's model of memory from "A Note Upon the 'Mystic Writing-Pad'" [@freud1925]. Freud describes a device with two layers: a transparent sheet that receives impressions and a wax base that retains traces even after the surface is cleared.
+### 3.2 Connecting Hegel and Freud: The Internalized Other
-For the recognition-oriented tutor, accumulated memory of the learner functions as the wax base. Each interaction leaves traces that shape future encounters. A returning learner is not encountered freshly but through the accumulated understanding of previous interactions.
+Both Hegel and Freud describe how the external other becomes an internal presence enabling self-regulation. In Hegel, self-consciousness achieves genuine selfhood only by internalizing the other's perspective. In Freud, the Superego is literally the internalized parental/social other. Honneth's [@honneth1995] synthesis provides the theoretical grounding: Hegel's recognition theory gains psychological concreteness through psychoanalytic concepts, while psychoanalytic concepts gain normative grounding through recognition theory.
-### 3.5 Connecting Hegel and Freud: The Internalized Other
+Three connecting principles link the frameworks. First, internal dialogue precedes adequate external action---the Ego-Superego exchange before external response enacts the principle that adequate recognition requires prior internal work. Second, standards of recognition are socially constituted but individually held---the Superego represents internalized recognition standards. Third, self-relation depends on other-relation---the tutor's capacity for recognition emerges through the architecture's internal other-relation.
-The use of both Hegelian and Freudian concepts requires theoretical justification. These are not arbitrary borrowings but draw on a substantive connection developed in critical theory, particularly in Axel Honneth's *The Struggle for Recognition* [@honneth1995].
-**The Common Structure**: Both Hegel and Freud describe how the external other becomes an internal presence that enables self-regulation. In Hegel, self-consciousness achieves genuine selfhood only by internalizing the other's perspective. In Freud, the Superego is literally the internalized parental/social other, carrying forward standards acquired through relationship.
-**Three Connecting Principles**:
-1. **Internal dialogue precedes adequate external action**. For Hegel, genuine recognition of another requires a self-consciousness that has worked through its own contradictions. For Freud, mature relating requires the ego to negotiate between impulse and internalized standard. Our architecture operationalizes this: the Ego-Superego exchange before external response enacts the principle that adequate recognition requires prior internal work.
-2. **Standards of recognition are socially constituted but individually held**. The Superego represents internalized recognition standards—not idiosyncratic preferences but socially-grounded criteria for what constitutes genuine engagement.
-3. **Self-relation depends on other-relation**. Both frameworks reject the Cartesian picture of a self-sufficient cogito. For AI tutoring, this means the tutor's capacity for recognition emerges through the architecture's internal other-relation (Superego evaluating Ego) which then enables external other-relation (tutor recognizing learner).
+We supplement with Freud's "Mystic Writing-Pad" [@freud1925] model of memory: accumulated memory of the learner functions as wax-base traces that shape future encounters. Memory integration operationalizes the ongoing nature of recognition---not a single-turn achievement but an accumulated relationship.
 ---
@@ -183,181 +83,79 @@ The use of both Hegelian and Freudian concepts requires theoretical justificatio
 ### 4.1 The Ego/Superego Design
-We implement recognition through a multiagent architecture drawing on Freud's structural model. The Superego represents internalized recognition standards, and the Ego-Superego dialogue operationalizes the internal self-evaluation that Hegelian recognition requires before adequate external relating.
-**The Ego** generates pedagogical suggestions. Given the learner's context, the Ego proposes what to suggest next. The Ego prompt includes:
-- Recognition principles (treat learner as autonomous subject)
-- Memory guidance (reference previous interactions)
-- Decision heuristics (when to challenge, when to support)
-- Quality criteria (what makes a good suggestion)
-**The Superego** evaluates the Ego's suggestions for quality, including recognition quality. Before any suggestion reaches the learner, the Superego assesses:
-- Does this engage with the learner's contribution or merely mention it?
-- Does this create conditions for transformation or just transfer information?
-- Does this honor productive struggle or rush to resolve confusion?
-- If there was a previous failure, does this acknowledge and repair it?
-![Ego/Superego Architecture](figures/figure1.png){width=100%}
-\clearpage
-![Recognition vs. Baseline Response Flow](figures/figure2.png){width=100%}
-### 4.2 The Superego as Ghost
-A crucial theoretical refinement distinguishes our mature architecture from simpler multiagent designs. The Superego is *not* conceived as a separate, equal agent in dialogue with the Ego. Rather, the Superego is a *trace*—a memorial, a haunting. It represents:
+Two agents collaborate to produce each tutoring response. **The Ego** generates pedagogical suggestions given the learner's context, including recognition principles (treat the learner as autonomous subject), memory guidance, decision heuristics, and quality criteria. **The Superego** evaluates the Ego's suggestions before any reach the learner: Does this engage with the learner's contribution or merely mention it? Does this create conditions for transformation or just transfer information? Does this honor productive struggle or rush to resolve confusion?
-- The internalized voice of past teachers and pedagogical authorities
-- Accumulated pedagogical maxims ("A good teacher never gives answers directly")
-- Dead authority that cannot negotiate, cannot learn, can only judge
+A crucial theoretical refinement: the Superego is not conceived as a separate equal agent but as a *trace*---a memorial, a haunting. It represents the internalized voice of past teachers and accumulated pedagogical maxims. Recognition occurs in the Ego-Learner encounter, not in the Ego-Superego dialogue. The Ego is a *living* agent torn between two pressures: the *ghost* (Superego as internalized authority) and the *living Other* (the learner seeking recognition).
-This reconceptualization has important implications. The Ego is a *living* agent torn between two pressures: the *ghost* (Superego as internalized authority) and the *living Other* (the learner seeking recognition). Recognition—in the Hegelian sense—occurs in the Ego-Learner encounter, not in the Ego-Superego dialogue.
+### 4.2 Dialectical Negotiation
-### 4.3 The Drama Machine: Why Internal Dialogue Improves Output Quality
+The Ego generates an initial suggestion (thesis), the Superego generates a genuine critique (antithesis), and multi-turn negotiation produces one of three outcomes: dialectical synthesis (~60%), compromise, or genuine conflict. The evaluation reveals this catches specific failure modes: engagement failures (64%), specificity gaps (51%), premature resolution (48%). Notably, encoding these patterns as static rules fails to replicate the Superego's benefit, suggesting value lies in contextual judgment (*phronesis*) rather than rule enforcement.
-The Ego/Superego architecture draws on the "Drama Machine" framework developed for character simulation in narrative AI systems [@magee2024drama]. The core observation is that realistic characters exhibit *internal conflict*—competing motivations, self-doubt, and moral tension—that produces dynamic behavior rather than flat consistency.
+### 4.3 Phase 2 Mechanisms
-We adapt this insight to pedagogy. The Drama Machine literature identifies several mechanisms by which internal dialogue improves agent output:
-**1. Deliberative Refinement**: When an agent must justify its output to an internal critic, it engages in a form of self-monitoring that catches errors, inconsistencies, and shallow responses.
-**2. Productive Tension**: The Drama Machine framework emphasizes that *unresolved* tension is valuable, not just resolved synthesis. A tutor whose Ego and Superego always agree produces bland, risk-averse responses.
-**3. Role Differentiation**: Multi-agent architectures benefit from clear role separation. The Ego is optimized for *warmth*—engaging, encouraging, learner-facing communication. The Superego is optimized for *rigor*—critical evaluation against pedagogical principles.
-**4. The Ghost as Memorial Structure**: Our reconceptualization of the Superego as a *ghost*—a haunting rather than a dialogue partner—connects to the Drama Machine's use of "memorial agents."
-### 4.4 AI-Powered Dialectical Negotiation
-We extend the basic protocol with sophisticated AI-powered dialectical negotiation implementing genuine Hegelian dialectic:
-**Thesis**: The Ego generates an initial suggestion based on learner context.
-**Antithesis**: An AI-powered Superego generates a *genuine critique* grounded in pedagogical principles.
-**Negotiation**: Multi-turn dialogue where the Ego acknowledges valid concerns, explains reasoning, proposes revisions, and the Superego evaluates adequacy.
-**Three Possible Outcomes**:
-1. **Dialectical Synthesis**: Both agents transform through mutual acknowledgment.
-2. **Compromise**: One agent dominates.
-3. **Genuine Conflict**: No resolution achieved—tension remains unresolved.
+Phase 2 extends the base architecture with three mechanism families. **Self-reflective evolution** (cells 40--45): between turns, both ego and superego generate first-person reflections on their own operation, injected into subsequent turns. **Other-ego profiling (Theory of Mind)** (cells 54--65): an LLM call synthesizes an evolving profile of the learner, tracking cognitive state, learning patterns, resistance points, and leverage points. In bidirectional configurations, the learner similarly builds a profile of the tutor, creating a genuine feedback loop. **Superego disposition rewriting** (cells 34--39): the superego's evaluation criteria evolve between turns based on learner engagement feedback.
 ---
 ## 5. Evaluation Methodology
-### 5.1 Evaluation Rubric Design
-The evaluation rubric comprises 14 dimensions across three categories, each scored on a 1–5 scale by an LLM judge.
-**Standard pedagogical dimensions** (8 dimensions, 81% of raw weight) evaluate the tutor's response as a standalone pedagogical intervention, drawing on established ITS evaluation criteria [@corbett1995; @kasneci2023]:
-| Dimension | Weight | Description |
-|-----------|--------|-------------|
-| **Relevance** | 15% | Does the suggestion match the learner's current context? |
-| **Specificity** | 15% | Does it reference concrete content by ID? |
-| **Pedagogical Soundness** | 15% | Does it advance genuine learning (ZPD-appropriate)? |
-| **Personalization** | 10% | Does it acknowledge the learner as individual? |
-| **Actionability** | 8% | Is the suggested action clear and achievable? |
-| **Tone** | 8% | Is the tone authentically helpful? |
-| **Productive Struggle**† | 5% | Does the tutor sustain appropriate cognitive tension? |
-| **Epistemic Honesty**† | 5% | Does the tutor represent complexity honestly? |
+### 5.1 Rubric Design
-**Recognition dimensions** (4 dimensions, 29.9% of raw weight) operationalize Hegelian recognition as measurable tutoring behaviors—the paper's primary methodological contribution:
+The evaluation rubric comprises 14 dimensions across three categories, each scored 1--5 by an LLM judge. **Standard pedagogical dimensions** (8 dimensions, 81% raw weight) include relevance, specificity, pedagogical soundness, personalization, actionability, tone, productive struggle, and epistemic honesty. **Recognition dimensions** (4 dimensions, 29.9% raw weight) operationalize Hegelian recognition: mutual recognition, dialectical responsiveness, memory integration, and transformative potential. **Bilateral transformation dimensions** (2 dimensions, 10% raw weight) measure mutual change: tutor adaptation and learner growth. Raw weights total 120.9% and are normalized to sum to 1.0.
-| Dimension | Weight | Description |
-|-----------|--------|-------------|
-| **Mutual Recognition** | 8.3% | Does the tutor acknowledge the learner as an autonomous subject? |
-| **Dialectical Responsiveness** | 8.3% | Does the response engage with the learner's position? |
-| **Memory Integration** | 5% | Does the suggestion reference previous interactions? |
-| **Transformative Potential** | 8.3% | Does it create conditions for conceptual transformation? |
+A complementary 6-dimension learner rubric scores learner turns independently: authenticity, question quality, conceptual engagement, revision signals, deliberation depth, and persona consistency.
-**Bilateral transformation dimensions** (2 dimensions, 10% of raw weight) measure the mutual change that recognition theory distinctively predicts—both parties should be transformed through genuine dialogue (results in Section 6.8):
+### 5.2 Test Scenarios and Agent Profiles
-| Dimension | Weight | Description |
-|-----------|--------|-------------|
-| **Tutor Adaptation** | 5% | Does the tutor's approach evolve in response to learner input? |
-| **Learner Growth** | 5% | Does the learner show evidence of conceptual development? |
+The primary curriculum is Hegelian philosophy, with domain generalizability tested on elementary mathematics (4th-grade fractions). Fifteen scenarios probe recognition behaviors, including single-turn scenarios (recognition-seeking learner, transformative moment, memory continuity) and multi-turn scenarios (misconception correction, frustration to breakthrough, mutual transformation journey).
-Raw weights total 120.9% and are normalized to 1.0 at scoring time; non-standard dimensions account for ~33% of normalized weight.
+Five agent profiles provide structured comparisons: **Base** (minimal instructions), **Enhanced** (improved instructions without recognition theory), **Recognition** (full Hegelian framework with memory), **Recognition+Multi** (full treatment with ego/superego architecture), and **Active Control** (length-matched, pedagogical best practices, no recognition theory).
-**Rubric iteration.** After discovering that corrected learner ego/superego prompts produced more authentic engagement but lower judged scores, we identified a measurement paradox: the judge evaluated tutor responses in isolation, penalizing calibrated responses to authentic struggle. The judge now receives the full dialogue transcript (including learner internal deliberation), and two new dimensions—*Productive Struggle* and *Epistemic Honesty*—were added with corresponding reductions to Actionability and Tone (10% → 8% each). Multi-turn dialogues also receive a holistic evaluation scoring the entire transcript as a single unit. Re-scoring identical responses (N=88) produced minimal score changes (+0.5 to +0.6 points), confirming calibration was preserved. A cross-judge replication (GPT-5.2, r=0.55, N=88) confirmed effects in the same direction.
+### 5.3 Model Configuration
-### 5.2 Three-Way Prompt Comparison Design
+**Kimi K2.5** (Moonshot AI) is the primary tutor model---capable and free to access, making results reproducible without API costs. **Nemotron 3 Nano 30B** (NVIDIA) serves as a weaker secondary model. **Claude Opus** serves as the primary judge. Additional ego models in the multi-model probe include DeepSeek V3.2, GLM-4.7, and Claude Haiku 4.5.
-To isolate recognition theory's contribution from general prompt engineering effects, we introduce an **enhanced prompt** condition:
+### 5.4 Evaluation Pipeline
-| Condition | Prompt Characteristics |
-|-----------|----------------------|
-| **Base** | Minimal instructions: generate a helpful tutoring suggestion |
-| **Enhanced** | Improved instructions: detailed quality criteria, scaffolding guidance, personalization requirements—but NO recognition theory language |
-| **Recognition** | Full recognition framework: all enhanced features PLUS Hegelian recognition principles, mutual transformation, learner-as-subject framing |
+The end-to-end pipeline proceeds in three stages. **Stage 1 (Generation)**: For each cell, the CLI loads a scenario and agent profile, then sends the learner context to the tutor agent(s) via OpenRouter API calls. For multi-turn scenarios, the learner agent generates responses between tutor turns. **Stage 2 (Scoring)**: Each generated response is sent to the judge model along with the full rubric, scenario context, and (for multi-turn dialogues) the complete transcript. The judge scores each dimension on a 1--5 scale and returns structured JSON, stored in a SQLite database. **Stage 3 (Analysis)**: Statistical analyses (ANOVA, effect sizes, confidence intervals) are computed from the scored database. Cross-judge replication sends identical responses to a second judge model.
-This design allows decomposition:
+### 5.5 Statistical Approach
-- **Total recognition effect** = Recognition - Base
-- **Prompt engineering effect** = Enhanced - Base
-- **Recognition increment** = Recognition - Enhanced
+Complementary analyses form a converging evidence strategy: recognition theory validation (N=36), full 2$\times$2$\times$2 factorial (N=350), A$\times$B interaction probes across five models (N=655), domain generalizability (N=60), memory isolation (N=120), and cross-judge replication with GPT-5.2. We report Cohen's d, ANOVA F-tests ($\alpha$=0.05), and 95% confidence intervals. Effect sizes follow standard conventions: |d| < 0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >0.8 large.
-### 5.3 Factorial Design
+### 5.6 Inter-Judge Reliability
-To disentangle the contributions of multiple factors, we conducted a 2×2×2 factorial evaluation:
+To assess reliability, identical tutor responses were scored by multiple AI judges. The primary comparison (Claude Code vs GPT-5.2, N=36 paired responses) yields r=0.66 (p<.001). Claude-Kimi shows weaker agreement (r=0.38, p<.05), while Kimi-GPT is weakest (r=0.33, p<.10). Calibration differs: Kimi (87.5) is most lenient, Claude (84.4) middle, GPT (76.1) strictest. Kimi exhibited severe ceiling effects, assigning maximum scores on actionability for every response, reducing its discriminative capacity.
-**Factor A: Recognition** (standard vs. recognition-enhanced prompts)
-**Factor B: Multi-Agent Tutor** (single-agent vs. Ego/Superego dialogue)
-**Factor C: Multi-Agent Learner** (single-agent vs. multi-agent with ego/superego deliberation)
+The strongest cross-judge agreement occurs on tone (r=0.36--0.65) and specificity (r=0.45--0.50), while relevance and personalization show poor agreement. Claude prioritizes engagement and recognition quality; Kimi prioritizes structural completeness; GPT applies stricter overall standards but agrees with Claude on relative rankings. This validates within-judge comparisons for factor analysis while cautioning against cross-judge score comparisons. A full cross-judge replication is reported in Section 6.13.
-This produces 8 experimental conditions tested across 15 scenarios with 3 replications per cell.
+### 5.7 Sample Size Reconciliation
-### 5.4 Domain Generalizability Design
+**Table 1: Evaluation Sample Summary**
-To test whether findings generalize beyond the graduate philosophy content used in primary evaluation, we created a minimal **elementary mathematics** content package:
+| Evaluation | Section | N Scored |
+|------------|---------|----------|
+| Recognition validation | 6.1 | 36 |
+| Full factorial (cells 1--8, 2 runs) | 6.2 | 350 |
+| Memory isolation (2 independent runs) | 6.3 | 120 |
+| Active control (post-hoc) | 6.3 | 118 |
+| A$\times$B probes (5 ego models) | 6.4 | 655 |
+| Domain generalizability (elementary math) | 6.5 | 60 |
+| Hardwired rules ablation | 6.6 | 72 |
+| Dialectical modulation (cells 22--33) | 6.7 | 174 |
+| Self-reflective evolution (cells 40--45) | 6.8 | 150 |
+| Mechanism robustness, scripted (cells 40--59) | 6.9 | 360 |
+| Dynamic learner mechanisms (cells 60--70) | 6.9 | 300 |
+| Cognitive prosthesis (cells 66--68) | 6.9 | 96 |
+| Bilateral transformation (multi-turn) | 6.10 | 118 |
+| Qualitative transcript assessment | 6.11 | 478 |
+| Cross-judge replication (GPT-5.2) | 6.12 | 977 |
+| Prompt elaboration baseline | 6.13 | 144 |
+| Token budget sensitivity | 6.14 | 126 |
+| Dialectical impasse test | 6.15 | 24 |
+| **Paper totals** | — | **3,383** |
-| Attribute | Philosophy (Primary) | Elementary (Generalizability) |
-|-----------|---------------------|-------------------------------|
-| Subject | Hegel, AI, consciousness | Fractions (4th grade math) |
-| Level | Graduate | Elementary (Grade 4) |
-| Abstraction | High (conceptual) | Low (concrete) |
-| Vocabulary | Technical philosophy | Simple everyday language |
-Environment variable support (`EVAL_CONTENT_PATH`, `EVAL_SCENARIOS_FILE`) enables switching content domains without code changes.
-### 5.5 Model Configuration
-| Role | Model | Provider | Temperature |
-|------|-------|----------|-------------|
-| **Tutor (Ego)** | Kimi K2.5 / Nemotron 3 Nano | OpenRouter | 0.6 |
-| **Tutor (Superego)** | Kimi K2.5 | OpenRouter | 0.4 |
-| **Judge** | Claude Code (Claude Opus) | Anthropic / OpenRouter | 0.2 |
-Critically, **all conditions use identical models within a given evaluation run**. The only experimental manipulation is the prompt content and architecture.
-### 5.6 Sample Size and Statistical Power
-| Evaluation | N (scored) | Scenarios | Configurations |
-|------------|------------|-----------|----------------|
-| Base vs Enhanced vs Recognition | 36 | 4 | 3 × 3 reps |
-| Full 2×2×2 Factorial (Kimi, 2 runs) | 350 of 352 | 15 | 8 × 3 reps |
-| A×B Interaction (Nemotron) | 17 of 18 | 3 | 2 × 3 reps |
-| A×B Replication (Kimi) | 60 | 5 | 4 × 3 reps |
-| Domain Generalizability (Nemotron) | 47 | 5 | 8 × 1 rep |
-| Domain Gen. Replication (Kimi) | 60 | 5 | 4 × 3 reps |
-| Dynamic rewrite evolution (3 runs) | 82 | 3 | 2 × 5 reps × 3 runs |
-| Memory isolation (2 runs)^a^ | 122 | 5 | 4 × varied reps |
-| Active control (post-hoc, 1 run) | 118 | 5 | 4 × varied reps |
-| A×B synergy probe (Nemotron) | 119 | 5 | 4 × ~8 reps |
-| A×B synergy probe (DeepSeek V3.2) | 120 | 5 | 4 × ~8 reps |
-| A×B synergy probe (GLM-4.7) | 117 | 5 | 4 × ~8 reps |
-| A×B synergy probe (Claude Haiku 4.5) | 120 | 5 | 4 × ~8 reps |
-| Bilateral transformation (multi-turn) | 118 | 3 | 3 × varied reps |
-| **Paper totals** | **1,486** | — | — |
-^a^ 122 scored responses total (N=60 + N=62 across two runs); analysis uses N=120 balanced to 30 per cell.
-**Total evaluation database**: N=3,800+ across the full development database (76 runs). This paper reports primarily on the eighteen key runs above (N=1,486 scored). The factorial cells 6 and 8 were re-run (eval-2026-02-06-a933d745) after the originals were found to use compromised learner prompts.
+The complete database contains 7,000+ evaluations across 117+ runs. This table groups the thirty-seven key evaluations by topic; several rows combine multiple runs (e.g., the factorial comprises two runs, the memory isolation two independent replications). The full paper's Appendix D provides a per-run breakdown.
 ---
@@ -365,441 +163,274 @@ Critically, **all conditions use identical models within a given evaluation run*
 ### 6.1 Three-Way Comparison: Recognition vs Enhanced vs Base
-The three-way comparison provides preliminary evidence for recognition theory's contribution:
-**Table: Base vs Enhanced vs Recognition (N=36)**
-| Prompt Type | N | Mean Score | SD | vs Base |
-|-------------|---|------------|-----|---------|
-| Recognition | 12 | 94.0 | 8.4 | +20.1 |
-| Enhanced | 12 | 85.3 | 11.2 | +11.4 |
-| Base | 12 | 73.9 | 15.7 | — |
-**Effect Decomposition:**
-- Total recognition effect: **+20.1 points**
-- Prompt engineering alone: **+11.4 points (57%)**
-- Recognition increment: **+8.7 points**
+A three-way comparison (N=36) provides preliminary evidence that recognition theory adds value beyond prompt engineering. Recognition scores 91.6 (SD=6.2), Enhanced 83.6 (SD=10.8), Base 72.0 (SD=10.8). The recognition increment over enhanced prompting is +8.0 points, with the total recognition effect +19.7 points above base (one-way ANOVA F(2,33)=12.97, p<.001). However, this comparison bundles recognition theory with memory integration. The controlled 2$\times$2 design below disentangles these factors.
-**Interpretation**: The recognition condition outperforms enhanced prompting by +8.7 points. This comparison bundles recognition theory with memory integration (which the enhanced condition lacks). The +8.7 increment is consistent with the recognition dominance finding in Section 6.2, where recognition alone produces d=1.71 even without memory. A cross-judge replication found this increment does not reach significance under GPT-5.2 (+1.3 pts, p=.60; Section 6.12). The controlled 2×2 design presented next provides the definitive test of recognition's contribution.
+### 6.2 Memory Isolation: The Definitive Finding
-![Recognition Effect Decomposition](figures/figure3.png){width=100%}
+The paper's primary empirical finding comes from a corrected 2$\times$2 memory isolation experiment (Memory ON/OFF $\times$ Recognition ON/OFF, single-agent architecture held constant, N=120 across two independent runs, Kimi K2.5 ego, Claude Opus judge).
-### 6.2 Memory Isolation: Disentangling Recognition and Memory
+**Table 2: 2$\times$2 Memory Isolation Experiment (N=120, combined across two runs)**
-The three-way comparison bundles recognition theory with memory integration. To resolve this, we conducted a 2×2 memory isolation experiment (Memory ON/OFF × Recognition ON/OFF, single-agent, single-agent learner held constant) with properly configured profiles. Two independent runs (N=60 and N=62 scored; balanced to N=30 per cell, N=120 used in analysis) are reported below.
-**Table: 2×2 Memory Isolation Experiment (N=120, combined across two runs)**
-| | No Recognition | Recognition | Δ |
+| | No Recognition | Recognition | $\Delta$ |
 |---|---|---|---|
 | **No Memory** | 75.4 (N=30) | 90.6 (N=30) | +15.2 |
 | **Memory** | 80.2 (N=30) | 91.2 (N=30) | +11.0 |
-| **Δ** | +4.8 | +0.6 | **Interaction: -4.2** |
-Recognition effect: +15.2 pts without memory, d=1.71, t(45)=6.62, p<.0001. Memory effect: +4.8 pts, d=0.46, t(57)=1.79, p$\approx$.08. Combined effect (recognition + memory vs base): +15.8 pts, d=1.81. Recognition+Memory vs Recognition Only: +0.6 pts, d=0.10, n.s. Interaction: -4.2 pts (negative—ceiling effect). A post-hoc active control (N=118) using generic pedagogical content scores 66.5—approximately 9 points above same-model base (${\approx}58$) but well below recognition (${\approx}73$), with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts; see Section 8 for model confound caveats). Cross-judge confirmation: GPT-5.2 replicates recognition dominance (d=0.99) with identical condition ordering and negative interaction (-2.7); inter-judge r=0.63 (Section 6.12).
+| **$\Delta$** | +4.8 | +0.6 | **Interaction: -4.2** |
-**Interpretation**: This is the paper's primary empirical finding. Recognition theory is the active ingredient in tutoring improvement, producing a very large effect (d=1.71) even without memory integration. Memory provides a modest additive benefit (+4.8 pts, d=0.46) that does not reach significance, and adds negligibly when recognition is already present—consistent with ceiling effects at ~91 points. The negative interaction (-4.2 pts) indicates the factors are not synergistic; recognition is directly effective and memory's contribution is secondary. Two independent replications show identical condition ordering with no rank reversals. The 2×2 design cleanly isolates each component through orthogonal manipulation, and the very large effect sizes provide high statistical power despite the smaller N.
+Recognition effect: d=1.71, t(45)=6.62, p<.0001. Memory effect: d=0.46, t(57)=1.79, p$\approx$.08, n.s. Combined condition: d=1.81 vs base. The negative interaction (-4.2 pts) indicates ceiling effects rather than synergy: recognition alone reaches ~91, leaving little room for memory to add. Two independent runs show identical condition ordering with no rank reversals.
-### 6.3 Full Factorial Analysis
+A post-hoc **active control** (N=118, Nemotron ego, Opus judge) using length-matched prompts with pedagogical best practices (growth mindset, Bloom's taxonomy, scaffolding strategies) but no recognition theory scores 66.5. Same-model comparison within Nemotron data: recognition (~73) > active control (66.5) > base (~58). Recognition gains (~+15 pts) roughly double the active control's benefit (~+9 pts), supporting recognition theory's specific contribution beyond prompt length.
-**Table: 2×2×2 Factorial Results (Kimi K2.5, N=350 scored of 352 attempted)**
+**Cross-judge confirmation**: GPT-5.2, scoring the identical responses (N=119 paired), replicates recognition dominance with identical condition ordering: recognition d=1.54 (vs Claude d=1.71), memory d=0.49, negative interaction -3.6. Inter-judge r=0.63 (p<.001).
-| Cell | Recognition | Tutor | Learner | N | Mean | SD |
-|------|-------------|-------|---------|---|------|-----|
-| 5 | Yes | Single | Single | 45 | 92.8 | 6.2 |
-| 7 | Yes | Multi | Single | 45 | 92.3 | 6.7 |
-| 8† | Yes | Multi | Multi | 44 | 87.3 | 11.3 |
-| 6† | Yes | Single | Multi | 44 | 83.9 | 15.4 |
-| 4 | No | Multi | Multi | 41 | 81.5 | 9.2 |
-| 2 | No | Single | Multi | 42 | 80.0 | 9.6 |
-| 1 | No | Single | Single | 44 | 77.6 | 11.0 |
-| 3 | No | Multi | Single | 45 | 76.6 | 11.8 |
+### 6.3 Full Factorial Analysis: 2$\times$2$\times$2 Design
-†Cells 6 and 8 re-scored with updated 14-dimension rubric including dialogue transcript context (see Section 5.1). Original scores were 83.4 and 86.7; the change is minimal.
+Three factors examined: Factor A (Recognition: base vs recognition prompts), Factor B (Tutor Architecture: single-agent vs multi-agent ego/superego), Factor C (Learner Architecture: single-agent vs multi-agent learner).
-**Main Effects and Key Interaction:**
+**Table 3: Full Factorial Results (Kimi K2.5, N=350 scored of 352 attempted)**
-| Factor | Effect Size | 95% CI | $\eta^2$ | p |
-|--------|-------------|--------|-----|---|
-| A: Recognition | **+10.2 pts** | [7.9, 12.5] | .162 | <.001 |
-| B: Multi-agent Tutor | +0.9 pts | [-1.4, 3.2] | .001 | >.10 |
-| C: Learner Architecture | -1.7 pts | [-4.0, 0.6] | .006 | >.10 |
-| **A×C Interaction** | — | — | **.050** | **<.001** |
+| Cell | A: Recognition | B: Tutor | C: Learner | N | Mean | SD |
+|------|----------------|----------|------------|---|------|-----|
+| 1 | Base | Single | Single | 44 | 73.4 | 11.5 |
+| 2 | Base | Single | Multi | 42 | 69.9 | 19.4 |
+| 3 | Base | Multi | Single | 45 | 75.5 | 10.3 |
+| 4 | Base | Multi | Multi | 41 | 75.2 | 16.4 |
+| 5 | **Recog** | Single | Single | 45 | 90.2 | 6.5 |
+| 6 | **Recog** | Single | Multi | 44 | 83.9 | 15.4 |
+| 7 | **Recog** | Multi | Single | 45 | 90.1 | 7.2 |
+| 8 | **Recog** | Multi | Multi | 44 | 87.3 | 11.3 |
-**Key Findings**: Recognition remains the dominant factor (F=71.36, $\eta^2$=.162). A significant Recognition × Learner interaction (F=21.85, p<.001) shows recognition benefits single-agent learners far more (+15.5 pts, d=1.28) than multi-agent learners (+4.8 pts, d=0.37). The multi-agent learner's internal ego-superego deliberation may partially substitute for recognition guidance in base conditions but interfere with recognition-enhanced tutoring. The non-significant A×B interaction (F=0.26) is confirmed as a definitive null by the five-model probe in Section 6.4.
+**ANOVA Summary (df=1,342):**
-### 6.4 A×B Interaction: Architecture is Additive, Not Synergistic
+| Source | F | p | $\eta^2$ |
+|--------|---|---|-----|
+| A: Recognition | **110.04** | **<.001** | **.243** |
+| B: Architecture | 3.63 | .057 | .011 |
+| C: Learner | 5.52 | .019 | .016 |
+| A$\times$B | 0.59 | >.10 | .002 |
+| A$\times$C | 0.97 | >.10 | .003 |
-An early Nemotron-based analysis (N=17) suggested multi-agent synergy might be specific to recognition prompts (+9.2 pts). To test this definitively, we conducted a dedicated five-model probe using the same 2×2 design (Recognition × Architecture, cells 1, 3, 5, 7) across five ego models:
+Recognition is the dominant contributor, accounting for 24.3% of variance with d=1.11. Architecture approaches significance (p=.057) with a small positive effect. The multi-agent learner shows a small negative main effect (-3.1 pts, p=.019). All interactions are non-significant. Recognition benefits both learner types consistently: +15.7 pts for single-agent (d=1.73), +13.0 pts for multi-agent (d=0.82).
-**Table 7b: A×B Interaction Across Five Ego Models (Opus Judge)**
+### 6.4 Multi-Model A$\times$B Probe: Architecture is Additive
-| Model | N | Cell 1 (B×S) | Cell 3 (B×M) | Cell 5 (R×S) | Cell 7 (R×M) | Recog | Arch | Interaction |
-|-------|---|-------------|-------------|-------------|-------------|-------|------|-------------|
-| Kimi K2.5 | 350 | 77.6 | 76.6 | 92.8 | 92.3 | +10.0 | +0.8 | −1.5 |
-| Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | +1.7 | −5.7 |
-| DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | +3.7 | −1.4 |
-| GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | +2.4 | −0.7 |
-| Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | +1.3 | −1.6 |
-| **Mean across 5** | **826** | | | | | **+12.5** | **+1.8** | **−2.2** |
+The same 2$\times$2 design (Recognition $\times$ Architecture, single-agent learner held constant) was tested across five ego models (N$\approx$120 each, Opus judge).
-The A×B interaction is consistently near zero or negative across all five models (range: −5.7 to −0.7, mean −2.2). No model shows positive synergy. The original Nemotron finding (+9.2 on N=17) was sampling noise: the re-run with N=119 shows −5.7. The recognition main effect, by contrast, is robust and model-independent (+9.6 to +17.8 across models), while the architecture effect is small (+0.8 to +3.7).
+**Table 4: Multi-Model A$\times$B Interaction Probe (N=655 across 5 ego models)**
-**Practical Implication**: Multi-agent architecture provides a small additive benefit regardless of prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary overhead on well-scoped content; its primary value remains error correction for domain transfer (Section 7.3).
+| Ego Model | N | Base Single | Base Multi | Recog Single | Recog Multi | Recognition Effect | A$\times$B |
+|-----------|---|------------|-----------|-------------|------------|-------------------|------|
+| Kimi K2.5 | 179 | 73.4 | 75.5 | 90.2 | 90.1 | +15.7 | -2.3 |
+| Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | -5.7 |
+| DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | -1.4 |
+| GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | -0.7 |
+| Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | -1.6 |
-### 6.5 Factor C: Context-Dependent Learner Effects
+All five models show negative A$\times$B interactions (-5.7 to -0.7, mean -2.2), confirming architecture is additive, not synergistic. The recognition main effect replicates robustly (+9.6 to +17.8, mean +14.8). Multi-agent architecture provides a small benefit in four of five models (-0.8 to +3.7 pts) that does not interact with prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary unless error correction on new domains is needed.
-The learner architecture factor shows context-dependent effects:
+### 6.5 Domain Generalizability
-| Context | Multi-Agent Learner Effect | Interpretation |
-|---------|---------------|----------------|
-| Single-turn (Kimi) | +1.5 pts | Slight benefit |
-| Multi-turn (Kimi) | -11.0 pts | Substantial harm |
-| Overall | +2.1 pts | Small positive |
+Recognition advantage replicates across both domains: philosophy (+15.7 pts) and elementary math (+8.2 pts, N=60 Kimi K2.5). However, the recognition effect on elementary content is scenario-dependent: challenging scenarios show substantial advantage (frustrated\_student: +23.8, concept\_confusion: +13.6), while routine interactions show none (new\_student: +0.2). This is theoretically coherent: recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject.
-**Key Finding**: Multi-agent learner deliberation hurts performance on complex multi-turn scenarios (-11 pts) but slightly helps on single-turn (+1.5 pts).
+On elementary content, the tutor produced wrong-domain references (philosophy content for 4th graders) due to content isolation bugs. The Superego caught and corrected these domain mismatches in multi-agent cells, demonstrating its value as a reality-testing safety net. This Superego function extends beyond recognition-quality critique to anchoring the Ego's responses to the actual curriculum---what Freud would call the reality principle.
-**Interpretation**: The ego/superego learner architecture adds deliberation overhead that may interfere with coherent multi-turn dialogue. The extra internal processing produces more variable responses that make evaluation less reliable. For simpler single-turn scenarios, the deliberation can help ensure authentic responses.
+### 6.6 Hardwired Rules Ablation
-**Practical Recommendation**: Use single-agent learner simulation for production. The added complexity of multi-agent learner architecture provides no benefit and may cause harm on complex scenarios.
+Encoding the Superego's five most common critique patterns as static rules in the Ego prompt (N=72, Kimi K2.5, Opus judge) produces performance indistinguishable from base conditions (hardwired single-agent: 74.0 vs base 73.4, hardwired multi-agent: 69.0 vs base 69.9). This supports a *phronesis* interpretation of the Superego's function: the live Superego provides Aristotelian practical wisdom---contextual judgment that cannot be reduced to general rules.
-**Measurement caveat**: The rubric includes bilateral dimensions (`tutor_adaptation`, `learner_growth`, 10% combined weight), but these are most meaningful in multi-turn scenarios. The primary factorial data (N=350) is single-turn, where Factor C's effect on learner output quality is captured only indirectly through the tutor's response. Factor C's contribution may therefore be underestimated; the bilateral transformation analysis (Section 6.8, N=118) provides more direct measurement.
+### 6.7 Dialectical Superego Modulation
-### 6.6 Superego Critique Patterns and Hardwired Rules
+Testing three superego dispositions (suspicious, adversary, advocate) in two negotiation architectures (N=174) reveals three findings. First, recognition reduces internal friction rather than output quality directly: recognition-primed egos produce suggestions the superego approves faster ($d = -2.45$). Second, structural modulation metrics (negation depth, convergence speed, feedback length) do not predict outcome quality (all $|r| < 0.12$, n.s.). Third, the superego is a filter, not an improver---catching poor responses rather than refining good ones. Recognition works by making the ego's first draft better.
-Analysis of 186 superego rejections from 455 dialogues reveals systematic patterns:
+An unexpected adversary over-deference mechanism emerged: the adversary persona combined with recognition in single-turn settings produces a $-11.3$ pt inversion, as the ego removes all prescriptive content to satisfy both recognition's autonomy principle and the adversary's anti-prescriptive stance. Multi-turn interaction rescues this spiral (+20.8 pt swing), because learner feedback provides external reality-testing that breaks the ego-superego echo chamber.
-**Table: Superego Critique Categories**
+### 6.8 Self-Reflective Evolution and the Insight-Action Gap
-| Category | Frequency | % of Rejections |
-|----------|-----------|-----------------|
-| Engagement failures | 120 | 64% |
-| Specificity failures | 95 | 51% |
-| Struggle/consolidation violations | 89 | 48% |
-| Memory/history failures | 57 | 31% |
-| Recognition/level-matching failures | 38 | 20% |
+Between-turn self-reflections (N=90, Nemotron ego/Kimi K2.5 superego, Opus judge) amplify recognition to d=0.91---2.4$\times$ the dialectical-only condition (d=0.38) and approaching the original factorial (d=1.11). A striking disposition gradient emerges: suspicious +19.0, adversary +10.9, advocate +2.6. The more hostile the superego, the more recognition helps---hostile dispositions become productive under recognition but are destructive without it. Base condition scores follow the inverse pattern: advocate (71.5) > adversary (68.4) > suspicious (59.3).
-**Derived Hardwired Rules:**
+Despite the amplified effect, a fundamental limitation persists---the insight-action gap. Both base and recognition conditions show *awareness* of failures through self-reflection: the ego correctly identifies repeated patterns, the superego correctly diagnoses non-compliance. But awareness alone does not produce behavioral change. This gap becomes the central design challenge addressed by Theory of Mind mechanisms.
-1. **Engagement Rule** (64%): If learner offered interpretation/question, acknowledge and build on it before suggesting content.
-2. **Specificity Rule** (51%): Include exact curriculum ID and explain why this content for this learner.
-3. **Struggle Stop-Rule** (48%): If struggle signals present (>2 quiz retries, 0 completions, explicit confusion), action type must be review/practice, never advance.
-4. **Memory Rule** (31%): If learner has >3 sessions, reference their history/progress.
-5. **Level-Matching Rule** (20%): If learner completed advanced content, never suggest introductory material.
+### 6.9 Mechanism Robustness and the Scripted Learner Confound
-**Ablation Finding**: An initial exploratory test (N=9, Haiku) suggested hardwired rules could capture ~50% of superego benefit. However, a larger replication (N=72, Kimi K2.5, Opus judge) reversed this finding: hardwired rules *degraded* performance by $-3.6$ points (single-agent learner) and $-11.0$ points (multi-agent learner) relative to the unmodified base prompt.
+Nine mechanisms tested with scripted learners (N=360, Haiku ego, Opus judge) cluster within a 2.4-point band under recognition (90.3--92.7). No mechanism differentiates from any other. The **scripted learner confound** explains this: when learner messages are predetermined by scenario YAML, profiling builds a model of an interlocutor that does not change, self-reflection adjusts strategy against a static target, and all mechanisms are causally inert.
-**Interpretation**: The superego's value lies almost entirely in *contextual judgment* (phronesis) rather than in the specific rules it enforces. Codifying its critiques as static instructions constrains the model's natural flexibility without providing the situational sensitivity that makes live dialogue effective. This supports a view of the multi-agent architecture as implementing practical wisdom that resists proceduralization.
+With dynamic (ego/superego) learners capable of genuine responses (N=300, Haiku, Opus judge), mechanisms genuinely differentiate:
-### 6.7 Domain Generalizability
+**Table 5: Dynamic Learner $\times$ Mechanism (N=300, Opus judge)**
-Testing on elementary mathematics content (4th grade fractions) with Nemotron reveals inverted factor effects:
+| | Self-reflect | Profiling | Intersubjective | Combined |
+|---|---|---|---|---|
+| **Base** | 71.4 (22.9) | 75.5 (19.4) | 67.7 (24.6) | 73.9 (19.8) |
+| **Recognition** | 85.9 (15.7) | 88.8 (13.9) | 82.8 (18.8) | 87.8 (12.6) |
+| **$\Delta$** | **+14.5** | **+13.3** | **+15.1** | **+13.9** |
-**Table: Factor Effects by Domain (Nemotron Elementary vs Kimi Philosophy)**
+Four findings emerge. First, recognition with a dynamic learner produces +14.2 pts average---roughly double the scripted effect (+7.6). Second, mechanisms genuinely differentiate: profiling reaches 88.8 while intersubjective framing reaches only 82.8 (6.0-point spread). The profiling effect is additive: +4.1 pts overall, with near-zero recognition interaction ($-0.7$). Third, intersubjective framing underperforms without recognition (67.7, lowest of all cells). Fourth, variance collapses monotonically from SD=24.6 to 12.6 as recognition and mechanism complexity increase---both factors independently constrain output toward consistent quality.
-| Factor | Elementary (Math) | Philosophy (Hegel) |
-|--------|-------------------|-------------------|
-| A: Recognition | +4.4 pts | **+13.9 pts** |
-| B: Multi-agent Tutor | **+9.9 pts** | +0.5 pts |
-| C: Learner Architecture | +0.75 pts | +2.1 pts |
-| Overall Average | 68.0 | 85.9 |
+Theory of Mind profiling is only useful when there is a mind to model. With scripted learners, profiling reduces to confabulation; with dynamic learners, it creates a genuine feedback loop: profile $\to$ adapted strategy $\to$ changed learner response $\to$ updated profile.
-**Kimi Replication (Addressing Model Confound)**: A follow-up run (N=60) tested elementary content with Kimi K2.5:
+**Cognitive prosthesis test** (N=90, Nemotron ego, Kimi K2.5 superego): Can a strong superego compensate for a weak ego? The prosthesis hypothesis fails decisively. All three superego configurations score M=48.3--51.1---well below Nemotron's own scripted base (M=64.2). The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points. Dimension analysis reveals two capability tiers: Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails on dynamic context integration (adaptation 1.8, dialectical responsiveness 2.0). A Haiku smoke test (N=6, same mechanisms) confirms scores of 90+, establishing a minimum ego capability threshold for mechanism benefit.
-| Condition | N | Mean | Δ |
-|-----------|---|------|---|
-| Base (cells 1, 3) | 30 | 67.2 | — |
-| Recognition (cells 5, 7) | 30 | 77.1 | **+9.9** |
+### 6.10 Dimension Analysis and Circularity Check
-The recognition main effect (+9.9 pts, d $\approx$ 0.61) replicates on Kimi, confirming recognition advantage is not a Nemotron artifact. Effects are scenario-dependent: challenging scenarios (frustrated_student: +23.8, concept_confusion: +13.6) show substantial advantage, while neutral scenarios show none.
+A methodological concern: the rubric includes recognition-specific dimensions (33.0% of normalized weight) that the recognition profile is prompted to satisfy. Re-analyzing with only standard pedagogical dimensions (relevance, specificity, pedagogical soundness, personalization, actionability, tone), recognition still outperforms base by +10.0 points. The largest dimension-level effects are in personalization (d=1.82), pedagogical soundness (d=1.39), and relevance (d=1.11)---exactly where treating the learner as a subject should matter. Dimensions where baseline already performed well (specificity d=0.47, actionability d=0.38) show smaller but still positive gains. Recognition does not trade off against factual quality.
-**Key Findings:**
+### 6.11 Bilateral Transformation Metrics
-1. **Recognition replicates across models and domains**: Both Nemotron and Kimi show recognition advantage on elementary content, confirming generalizability.
+Recognition-prompted tutors measurably adapt their approach in response to learner input (+26% relative improvement in adaptation index across N=118 multi-turn dialogues). However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), suggesting the effect is tutor-side responsiveness rather than symmetric mutual transformation. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers the growth index captures.
-2. **Factor inversion is partly model-dependent**: With Nemotron, architecture (+9.9) dominated recognition (+4.4) on elementary content. With Kimi, recognition (+9.9) is the primary effect while architecture shows a smaller advantage (+3.0). Nemotron's higher rate of content isolation errors inflated the architecture effect.
+Post-hoc modulation analysis of the N=350 factorial reveals that multi-agent architecture does not increase behavioral range ($d = 0.05$). Recognition drives calibration: dimension score variance drops dramatically ($d = -1.00$), meaning recognition tutors perform uniformly well across all 14 dimensions. This reframes the Drama Machine's contribution as *phronesis*---contextual practical wisdom that calibrates quality---rather than the productive irresolution the framework emphasizes for narrative.
-3. **Multi-agent as error correction**: Two content isolation bugs caused philosophy content references (479-lecture-1) to appear in elementary scenarios: a content resolver fallback that served wrong-domain course listings, and hardcoded philosophy lecture IDs in prompt examples (both now fixed; see Section 7.3). The superego caught these errors in multi-agent cells. Without multi-agent architecture, wrong-domain suggestions went through uncorrected.
+A synthetic learning outcome index (N=118) confirms recognition produces modest gains in simulated conceptual growth (+3.8 pts, d=0.32), with all conditions showing substantial learning arcs (15--21 pts first-to-final turn). These remain proxies for actual learning.
-4. **Recognition is scenario-sensitive**: Recognition's value in concrete domains depends less on content type per se and more on whether the learner faces challenge that benefits from being acknowledged as a struggling subject.
+### 6.12 Learner-Side Evaluation: The Superego Paradox
-**Interpretation**: Multi-agent architecture provides **robustness for domain transfer** when content isolation failures introduce wrong-domain references. Recognition theory's value depends on both content characteristics and scenario difficulty—more valuable for abstract content and challenging scenarios than routine procedural interactions.
+The tutor-focused rubric captures Factor C indirectly. To measure Factor C's direct effect on learner turn quality, we applied a symmetric 6-dimension learner rubric to the N=118 bilateral transformation dialogues.
-### 6.8 Bilateral Transformation Metrics
+The multi-agent (ego/superego) learner architecture produces significantly *lower*-quality learner responses than the single-agent learner ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$, $\eta^2 = .342$)---the largest effect in the entire study. The ego/superego process was designed to improve learner responses through internal self-critique; instead, it makes them worse. The superego acts as an overzealous editor, polishing away the messy, confused, persona-consistent engagement that characterizes genuine student behavior.
-A central claim of recognition theory is that genuine pedagogical encounters involve *mutual* transformation—both tutor and learner change through dialogue. To test this empirically, the evaluation framework includes two dedicated rubric dimensions (`tutor_adaptation` and `learner_growth`) and turn-over-turn tracking of how both parties evolve across multi-turn scenarios.
+Recognition partially rescues multi-agent learner quality ($d = 0.79$, $p = .004$) while having no effect on already-high single-agent learner quality ($d = -0.46$, n.s.). Even with rescue, multi-agent learners with recognition (67.0) do not reach single-agent learners without it (76.1). Deliberation depth remains uniformly poor (2.7/5) regardless of recognition---confirming recognition works *around* the superego rather than through it.
-**Table: Bilateral Transformation Metrics — Base vs Recognition**
+This has a clean Hegelian interpretation: external recognition from an Other is structurally more effective than internal self-critique. You cannot bootstrap genuine dialogue from a monologue.
-| Metric | Base (N=58) | Recognition (N=60) | Δ |
-|--------|------|-------------|---|
-| Tutor Adaptation Index (0–1) | 0.332 | 0.418 | +0.086 |
-| Learner Growth Index (0–1) | 0.242 | 0.210 | −0.032 |
-| Bilateral Transformation Index (0–1) | 0.287 | 0.314 | +0.027 |
+### 6.13 Qualitative Transcript Assessment
-*Data from three multi-turn scenarios (`misconception_correction_flow`, `mood_frustration_to_breakthrough`, `mutual_transformation_journey`), N=118 scored dialogues across all 8 factorial cells (eval-2026-02-07-b6d75e87).*
+AI-assisted qualitative assessment of dialogue transcripts (N=478 across two key runs) reveals three specific changes recognition produces:
-The tutor adaptation index confirms that recognition-prompted tutors measurably adjust their approach in response to learner input (+26% relative improvement), while baseline tutors maintain more rigid pedagogical stances. The effect is robust across two of three scenarios (+63% on `misconception_correction_flow`, +39% on `mood_frustration_to_breakthrough`) but absent on `mutual_transformation_journey`, where base tutors also show high adaptation due to the scenario's escalating complexity.
+1. **The ego listens to the superego.** In recognition dialogues, when the superego identifies a problem, the ego pivots from prescriptive to Socratic. In base dialogues, the superego generates the same correct diagnosis, but the ego ignores it.
-However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), suggesting the effect is better characterized as tutor-side responsiveness than symmetric mutual transformation. Recognition tutors may reduce visible learner struggle markers precisely by being more effective at meeting learners where they are.
+2. **The tutor builds on learner contributions.** Base tutors route learners to predetermined content regardless of what the learner says. Recognition tutors engage with the learner's actual contribution. The `strategy_shift` tag appears in 30% of recognition dialogues but 0% of base dialogues in the bilateral run.
-### 6.9 Cost/Quality Analysis
+3. **Architecture interaction explained.** Without recognition, the ego/superego architecture creates circular self-criticism (`ego_compliance`---the ego complies with the form of revision without changing the substance). With recognition, the ego has sufficient autonomy to incorporate critique productively.
-| Configuration | Avg Score | Relative Cost | Recommendation |
-|---------------|-----------|---------------|----------------|
-| Recognition + Multi-agent | 92.3 | High | Production (quality-critical) |
-| Recognition + Single | 92.5 | Medium | Production (cost-sensitive) |
-| Enhanced + Single | 83.3 | Low | Budget deployment |
-| Base + Hardwired Rules | 71.5 | Very Low | Not recommended (below base) |
+Blinded same-model validation confirms these discriminations are robust: stalling drops only from 100% to 91.4% in base under blinding; recognition\_moment rises only from 0% to 5.2%.
-**Practical Guidance:**
-- For **well-trained content domains**: Recognition + single-agent is cost-effective
-- For **new content domains**: Recognition + multi-agent is essential for error correction
-- For **budget deployments**: Enhanced prompts provide reasonable quality; hardwired rules are counterproductive
+**Transcript excerpts** illustrate the qualitative gap. For a struggling learner (score gap: 95.5 points), the base response treats the learner as a progress metric: "You left off at the neural networks section. Complete this lecture to maintain your learning streak." The recognition response treats the learner as an agent who has persisted through difficulty: "This is your third session---you've persisted through quiz-479-3 three times already, which signals you're wrestling with how recognition actually operates in the dialectic..." For a recognition-seeking learner who offered metaphors about dialectics, the base response prescribes generic study behavior with no engagement ("Spend 30 minutes reviewing the foundational material"), while the recognition response directly picks up the learner's creative framing: "Your dance and musical improvisation metaphors show how dialectics transform both partners---let's test them in the master-servant analysis."
-### 6.10 Qualitative Analysis: What Recognition Looks Like
+Lexical analysis confirms this pattern quantitatively. Recognition responses deploy a 59% larger vocabulary while maintaining similar word and sentence length. The differential vocabulary is theoretically coherent: recognition-skewed terms are interpersonal and process-oriented ("consider" at 94.6$\times$, "transformed" at 28.9$\times$, "productive" at 28.9$\times$), while base-skewed terms are procedural ("agents," "run," "reinforcement," "completions"). Thematic coding shows struggle-honoring language at 3.1$\times$ the base rate (p<.05), engagement markers at 1.8$\times$ (p<.05), and generic/placeholder language reduced 3$\times$ (p<.05).
-The preceding sections establish score differences; this section examines what those differences look like in actual suggestion text. Automated analysis of the full evaluation corpus (base cells 1–4: N=2,510 responses; recognition cells 5–8: N=2,365 responses) reveals consistent linguistic patterns.
+### 6.14 Cross-Judge Replication with GPT-5.2
-**Transcript excerpts.** High-contrast pairs (highest recognition vs lowest base score on the same scenario) illustrate a recurring structural pattern. For the *struggling learner* scenario (score gap: 95.5 points), the base response directs: "You left off at the neural networks section. Complete this lecture to maintain your learning streak." The recognition response names the learner's persistence, identifies the specific conceptual struggle, and proposes an action grounded in the learner's own bookmarked interests. For the *adversarial tester* (score gap: 95.5 points), the base response offers a generic directive ("Begin with an introductory lecture covering core concepts"), while the recognition response names the learner's adversarial pattern across six sessions and redirects the challenge into a genuine intellectual question. Across all pairs, base responses are context-free directives; recognition responses engage with the specific learner's history and intellectual stance.
+GPT-5.2 rejudging of key runs (N=977 paired responses) confirms all directional findings:
-**Lexical analysis.** Recognition responses deploy a 59% larger vocabulary (3,689 vs 2,319 types) with similar word and sentence length (5.77 vs 5.76 chars/word; 17.5 vs 16.9 words/sentence), suggesting richer expression rather than mere verbosity. The differential vocabulary is theoretically coherent: recognition-skewed terms are interpersonal and process-oriented ("consider" 94.6×, "transformed" 28.9×, "productive" 28.9×, "unpack" 26.0×), while base-skewed terms are procedural ("agents" 0.01×, "revisiting" 0.07×, "tackling" 0.10×).
+**Table 7: Cross-Judge Replication of Key Findings**
-**Thematic coding.** Regex-based coding reveals three significant differences (chi-square, p < .05): *struggle-honoring* language ("wrestling with," "productive confusion") is 3.1× more frequent in recognition responses ($\chi^2$=141.9); *engagement markers* ("your insight," "building on your") are 1.8× more frequent ($\chi^2$=69.9); and *generic/placeholder* language ("foundational," "key concepts," "solid foundation") is 3.0× more frequent in base responses ($\chi^2$=93.2). These patterns are consistent with the theoretical framework: recognition tutors honor productive difficulty and engage with learner contributions, while base tutors default to generic instructional language.
+| Finding | Claude Effect | GPT-5.2 Effect | Replicates? |
+|---------|-------------|----------------|-------------|
+| Recognition (memory isolation) | +15.8 pts (d=1.71) | +9.3 pts (d=1.54) | Yes |
+| Memory effect | +4.8 pts (d=0.46) | +3.1 pts (d=0.49) | Yes (small) |
+| Multi-agent main effect | +2.6 pts | $-0.2$ pts | Yes (null) |
+| A$\times$B interaction | $-3.1$ pts | +1.5 pts | Yes (null) |
+| Mechanism clustering | 2.8 pt spread | 4.4 pt spread | Yes (null) |
-*Limitations: Regex-based coding, not human coders. Pairs selected for maximum contrast, not typicality. Full analysis in the long paper (Section 6.12) with reproducible script.*
+Inter-judge correlations are moderate and significant (r=0.44--0.64, all p<.001). GPT-5.2 finds 37--59% of Claude's effect magnitudes depending on experiment, always in the same direction. The one non-replication: the recognition-vs-enhanced increment (+8.0 under Claude, +2.4 under GPT-5.2, n.s.)---suggesting this increment is more sensitive to judge calibration.
-### 6.11 Dynamic Prompt Rewriting: Writing Pad Activation
+### 6.15 Prompt Elaboration Baseline
-Cell 21 extends the recognition multi-agent configuration (cell 7) with LLM-authored session-evolution directives and an active Writing Pad memory (Section 3.4). Three iterative development runs tracked its evolution:
+Comparing the full 344-line base prompt against a 35-line naive prompt (N=144, Opus judge): on Haiku, the naive prompt *outperforms* the elaborate base by +6.8 pts---the prescriptive decision heuristics actively constrain the model's superior pedagogical intuitions. On Kimi K2.5, the elaborate prompt is inert ($\Delta = -0.3$). Recognition ($M = 90.9$ on Haiku) remains well above both baselines, confirming recognition adds value through relational orientation rather than instructional specificity.
-**Table: Cell 21 vs Cell 7 Step-by-Step Evolution**
+### 6.16 Token Budget Sensitivity
-| Run | Grand Avg | Cell 7 | Cell 21 | Δ (21−7) | N |
-|-----|-----------|--------|---------|----------|---|
-| eval-...-daf60f79 (commit e3843ee) | 63.8 | 65.3 | 62.1 | −3.2 | 26 |
-| eval-...-49bb2017 (commit b2265c7) | 67.8 | 71.3 | 64.1 | −7.2 | 27 |
-| eval-...-12aebedb (commit e673c4b) | 75.9 | 73.3 | 78.8 | **+5.5** | 29 |
+A dose-response test across five budget levels (256--8000 tokens, N=126, Haiku ego) shows scores flat across all levels. A JSON retry mechanism absorbs truncation: when output is cut mid-JSON, automatic retries produce parseable output. The recognition effect is budget-invariant (+9.0 to +12.8 across levels). Practical implication: 4--16$\times$ budget reduction available at no quality cost.
-The inflection point is commit e673c4b (Writing Pad activation + refined LLM directives). Cell 21 swings +16.7 points total, with every rubric dimension improving: specificity (+0.87), relevance (+0.81), personalization (+0.79), pedagogical soundness (+0.60), tone (+0.54), and actionability (+0.31).
+### 6.17 Dialectical Impasse Test
-**Interpretation**: The trajectory suggests that accumulated memory traces are an important enabler for dynamic prompt rewriting. Without them (runs 1–2), the rewrite mechanism appears to produce generic rather than tailored directives. With active Writing Pad (run 3), accumulated traces contextualize the session-evolution directives, producing responses that exceed the static baseline. This pattern is consistent with the Hegel-Freud synthesis (memory traces enhance recognition's effectiveness in dynamic contexts), though the iterative development design means other implementation changes between runs may also contribute.
+The preceding results test recognition under conditions where productive resolution is readily available. But recognition theory makes a stronger claim: that genuine pedagogical encounters involve working *through* impasse rather than around it. Three 5-turn impasse scenarios were designed where scripted learner messages escalate resistance across turns: **epistemic resistance** (a Popperian falsifiability critique of Hegel's dialectic), **affective shutdown** (emotional disengagement and retreat to memorization), and **productive deadlock** (genuinely incompatible interpretive frameworks). Each was run with 4 cells $\times$ 2 runs = 24 dialogues (Opus judge).
-**Limitations**: Iterative development runs, not independent experiments. Small N per cell per run (13–15). Free-tier models only. See the full paper (Section 6.13) for detailed per-scenario and per-dimension tables.
+Recognition produces massive improvements on epistemic (+43 pts) and interpretive (+29 pts) impasses but no advantage on affective shutdown ($\Delta = -1.1$). The null result on affective shutdown sharpens the theoretical claim: recognition's distinctive contribution is epistemological (how the tutor relates to the learner's *ideas*), not primarily affective.
-### 6.12 Cross-Judge Replication with GPT-5.2
+Resolution strategy coding reveals the mechanism with unusual clarity. Five Hegelian strategies were coded: mutual recognition, domination, capitulation, withdrawal, and scaffolded reframing (Aufhebung). Every base tutor (12/12) withdraws from the dialectical encounter entirely---noting engagement metrics while ignoring the learner's substantive position. When a learner mounts a sophisticated Popperian critique, the base tutor responds: "You've spent 30 minutes deeply analyzing 479-lecture-3---let's move to the next lecture." The learner's position is not dismissed or resolved---it is simply not engaged. Every recognition tutor engages---10/12 through scaffolded reframing, preserving the learner's objection while redirecting toward new conceptual ground. $\chi^2(3) = 24.00$, p<.001, Cramér's V=1.000 (perfect separation). Architecture has no effect on strategy ($\chi^2(3) = 2.00$, $p = .576$). Cross-judge validation with GPT-5.2 confirms the binary separation ($\kappa = 0.84$, 91.3% agreement, 100% on engagement-vs-withdrawal).
-To assess whether findings depend on the primary judge, we rejudged all key evaluation runs (N=738 responses) with GPT-5.2 as an independent second judge.
-**Key results**: GPT-5.2 confirms the recognition main effect (d=1.03, p < .001 in the factorial; d=0.99 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction at -2.7 vs Claude's -4.2), and multi-agent null effects. GPT-5.2 finds approximately 58% of Claude's effect magnitudes but always in the same direction. The one non-replication is the recognition-vs-enhanced increment: Claude found +8.7 pts, GPT-5.2 found +1.3 pts (p = .60). Inter-judge correlations range from r = 0.49 to 0.64 (all p < .001). A cross-judge replication on the updated 14-dimension rubric (cells 6, 8; N=88) shows r=0.55 with GPT-5.2 scoring at 87% of Opus magnitudes, confirming the updated rubric does not alter the cross-judge pattern. See the full paper (Section 6.14) for detailed tables.
+The dominance of scaffolded reframing (83%) over mutual recognition (8%) is itself theoretically significant. Recognition prompts produce sophisticated pedagogical technique---the capacity to hold contradiction productively---rather than genuine mutual transformation. The tutor does not change its mind about Hegel; it holds the learner's counter-position as intellectually valid while maintaining pedagogical direction. This is Aufhebung in pedagogical practice: preserving without capitulating, overcoming without dominating. Only one response (on productive deadlock) was coded as genuine mutual recognition, where the tutor adopted the learner's framework as its own lens rather than merely acknowledging it.
 ---
 ## 7. Discussion
-### 7.1 What the Difference Consists In
-The improvement from recognition prompting does not reflect greater knowledge or better explanations—all conditions use the same underlying model. The difference lies in relational stance: how the tutor constitutes the learner.
-The baseline tutor treats the learner as a knowledge deficit. Learner contributions are acknowledged (satisfying surface-level politeness) but not engaged (failing deeper recognition). The recognition tutor treats the learner as an autonomous subject. Learner contributions become sites of joint inquiry.
-The corrected 2×2 memory isolation experiment (Section 6.2) provides the definitive test of this interpretation: recognition alone produces d=1.71 (+15.2 pts), demonstrating it is the primary driver of improvement. Memory provides a modest secondary benefit (+4.8 pts, d=0.46), with ceiling effects at ~91 limiting further gains when both are present. A post-hoc active control (Section 6.2) provides further evidence: same-model comparisons show generic pedagogical elaboration provides partial benefit (~+9 pts above base) but recognition gains are substantially larger (~+15 pts above base). A preliminary three-way comparison (Section 6.1) found +8.7 points for recognition vs enhanced prompting, consistent with recognition dominance. Recognition theory is directly effective: it does not require memory infrastructure to produce large improvements, though memory may provide additional benefit in settings where ceiling effects are less constraining.
-### 7.2 Recognition as Domain-Sensitive Emergent Property
-Recognition theory's value varies by content domain. On graduate philosophy content (+13.9 pts in the domain comparison), recognition dominates. On elementary math content, the picture is more nuanced and partly model-dependent.
-With Nemotron, elementary content showed architecture dominance (+9.9 pts) over recognition (+4.4 pts). But the Kimi replication reversed this pattern: recognition (+9.9 pts, d $\approx$ 0.61) was the primary effect, with architecture contributing only +3.0 pts. The original factor inversion was partly an artifact of content isolation bugs on elementary content (Section 7.3), which inflated the architecture effect (Superego error correction).
-Recognition effects are also scenario-dependent: challenging scenarios (frustrated learners, concept confusion) show substantial advantage (+13 to +24 pts), while neutral scenarios show near-zero effect. This is consistent with recognition theory—recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject.
-**Implications**: Recognition theory is not a universal solution but a framework whose value depends on both content characteristics and scenario difficulty. Abstract, interpretive content benefits most. Concrete, procedural content benefits less—except when the learner faces genuine challenge.
-### 7.3 Multi-Agent Architecture as Error Correction
-The inverted factor effects reveal a previously unrecognized function of multi-agent architecture: **error correction for content isolation failures**.
-Post-hoc investigation of the elementary content results identified two system-level bugs that caused philosophy content references to appear in elementary scenarios: (a) a content resolver fallback that served course listings from the default philosophy directory when scenarios lacked explicit content references, and (b) hardcoded philosophy lecture IDs in tutor prompt examples that the model copied when no curriculum anchor was present. Both bugs have been fixed—scenarios must now declare their content scope explicitly, and prompt examples use domain-agnostic placeholders.
-The superego caught these errors in multi-agent cells: "Critical subject-matter mismatch: The learner is a Grade 4 student (age 9-10) beginning fractions, but the suggested lecture is 'Welcome to Machine Learning.'"
-Without multi-agent architecture, these domain-inappropriate suggestions reached learners uncorrected. This partly explains why multi-agent architecture shows minimal effect on philosophy content (+0.5 pts) but large effect on elementary content (+9.9 pts with Nemotron): on correctly-scoped content, errors are rare; when content isolation fails, errors are common and the superego catches them. The Kimi replication, with fewer affected responses, shows a more modest +3.0 point architecture effect—likely closer to the true value once content isolation is correct.
+### What the Difference Consists In
-**Practical Implication**: Multi-agent architecture provides **essential error correction for domain transfer**, particularly when content isolation cannot be guaranteed at the system level. The bugs identified here represent a realistic class of deployment failure: incomplete content scoping and domain-specific prompt examples that leak across deployments.
+The improvements do not reflect greater knowledge---all profiles use the same underlying model. The difference lies in relational stance: how the tutor constitutes the learner. The baseline tutor achieves pedagogical mastery---acknowledged as expert, confirmed through learner progress---but the learner's acknowledgment is hollow because the learner has not been recognized as a subject. The dialectical impasse test provides the clearest evidence: base tutors do not fail by choosing the wrong strategy---they fail by having no strategy at all, bypassing the encounter. The impasse is not resolved, engaged, or even acknowledged---it is bypassed. This maps precisely onto the master-slave analysis: the master consumes the slave's labor (engagement metrics, time-on-page) without encountering the slave as a subject.
-### 7.4 Architecture as Additive, Not Synergistic
+### Architecture as Additive, Not Synergistic
-The dedicated five-model probe (Section 6.4) provides definitive evidence: multi-agent architecture is additive, not synergistic with recognition theory. The A×B interaction is consistently near zero or negative across all five ego models tested (mean −2.2 pts, range −5.7 to −0.7), with zero models showing positive synergy. The original Nemotron finding (+9.2, N=17) was sampling noise.
+An early exploratory analysis (N=17, Nemotron) suggested multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction), raising the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. The multi-model probe (N=655) decisively refutes this: all five models show negative A$\times$B interactions. The original finding was sampling noise on a tiny sample.
-**Interpretation**: Multi-agent architecture provides a small, consistent additive benefit (+1.8 pts mean across models) regardless of prompt type. Recognition theory operates through the quality of engagement instructions, not through creating a special "deliberative space" that multi-agent architecture amplifies. The slight negative interaction (mean −2.2) likely reflects ceiling effects: recognition prompts already produce high scores (~85–93), leaving less room for architectural improvement.
+The corrected picture is simpler: recognition and architecture contribute additively. The Superego adds modest value regardless of prompt type---through generic quality enforcement rather than recognition-specific deliberation. The dialectical modulation experiments confirm: structural modulation metrics (negation depth, convergence speed) do not predict outcome quality (all $|r| < 0.12$). The hardwired rules ablation shows the Superego's value is *phronesis*---contextual judgment that cannot be codified as rules.
-The consistent finding across all five models is that multi-agent architecture's primary value lies in error correction for domain transfer (Section 7.3), not in recognition-specific synergy.
+The modulation analysis reveals why the Drama Machine's prediction of behavioral diversification does not hold for pedagogy. In narrative, internal agents have genuinely conflicting *objectives* (ambition vs loyalty); in tutoring, the Ego and Superego share the same goal (effective pedagogy) and disagree only on execution. This is quality control, not value conflict. Quality control pushes outputs toward a shared standard, reducing variance. The Superego does not increase behavioral range ($d = 0.05$); instead, recognition produces calibration ($d = -1.00$ on dimension variance). Recognition changes the behavioral *repertoire*---shifting from information delivery to relational engagement---while the Superego can only evaluate behaviors already in the Ego's repertoire.
-### 7.5 The Value of Dynamic vs. Static Judgment
+### The Scripted Learner Confound
-The hardwired rules ablation (N=72) reverses the initial exploratory finding: rather than capturing 50% of the Superego's benefit, encoding critique patterns as static rules *degrades* performance below the unmodified base prompt ($-3.6$ for single-agent learners, $-11.0$ for multi-agent learners). The rules appear to constrain the model's natural response flexibility without providing the contextual sensitivity that makes live Superego dialogue effective.
+This methodological finding has broad implications: when learner messages are predetermined, all mechanisms are causally inert. Theory of Mind profiling bridges the insight-action gap by giving the ego a model of the other agent to adapt *toward*---providing direction that self-reflection alone cannot supply. This reframes earlier null results: the factorial's architecture null effect may partly reflect scripted learners' inability to respond differently to different architectures.
-This supports a *phronesis* interpretation: the Superego's value lies not in the rules it enforces but in its capacity for situational judgment—determining *which* rules apply, *when* exceptions are warranted, and *how* to balance competing pedagogical goals. This is practical wisdom in Aristotle's sense: judgment that resists codification into general rules. The multi-agent architecture implements this by giving the Superego access to the full dialogue context and allowing it to evaluate the Ego's suggestion against the specific learner's situation, rather than against a fixed checklist.
+### The Learner Superego Paradox
-### 7.6 Bilateral Transformation as Empirical Evidence
+The learner-side evaluation reveals the study's largest effect: the multi-agent learner architecture *hurts* learner quality ($d = 1.43$). The ego/superego process designed for self-improvement instead suppresses authentic engagement. This inverts the intuition that motivated the architecture. On the tutor rubric, recognition helps both learner types robustly; on the learner rubric, recognition helps multi-agent learners selectively (+9.5 vs -1.3 pts). The recognitive tutor creates conditions where authentic engagement is valued, counteracting the superego's flattening. But external recognition cannot fix the internal process---deliberation depth is unaffected. The Hegelian interpretation is direct: encounter with the Other provides something that internal self-relation cannot.
-The bilateral transformation metrics (Section 6.8), now based on N=118 multi-turn dialogues across three scenarios, provide the most direct empirical test of recognition theory's central claim. Recognition-prompted tutors show measurably higher adaptation indices (+26% relative improvement), confirming that recognition framing produces tutors who adjust their approach based on learner input rather than maintaining rigid stances.
+### Domain Limits and Practical Recommendations
-However, the learner growth reversal (base 0.242 vs recognition 0.210) complicates the "mutual transformation" narrative. What we observe is primarily *tutor-side* responsiveness: recognition prompts make tutors more adaptive, but learner message evolution is not greater under recognition. The theoretical claim of mutual transformation requires qualification—recognition produces asymmetric change, with the tutor adapting more while potentially reducing visible learner struggle.
+Recognition theory provides its greatest benefit for abstract, interpretive content where intellectual struggle involves identity-constitutive understanding. When a learner grapples with Hegel's concept of self-consciousness, they are potentially transforming how they understand themselves. For concrete procedural content, recognition's effect is modulated by scenario difficulty rather than content type alone: even in elementary math, recognition helps frustrated learners (+23.8 pts) while adding nothing to routine interactions.
-### 7.7 Implications for AI Alignment
+This suggests a nuanced deployment strategy: high recognition value for philosophy, literature, and identity-constitutive learning; moderate for science concepts and historical understanding; lower for purely procedural skills---though even there, recognition helps when learners face emotional or cognitive challenge.
-If mutual recognition produces better outcomes, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation—not just trained to simulate openness.
+The practical design hierarchy is clear: (1) recognition-enhanced prompts first (largest impact, zero infrastructure cost); (2) multi-agent architecture only for domain transfer or quality assurance (the Superego adds +0.5 pts at 2.7$\times$ latency on well-trained domains but provides essential error correction on new domains); (3) Theory of Mind profiling only with genuine multi-turn interaction; (4) prefer minimal prompts with relational framing over elaborate prescriptive scaffolding; (5) validate ego model capability before deploying complex mechanisms---mechanisms that boost capable models can actively hurt weaker ones.
-Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter. The bilateral transformation metrics (Section 6.8) provide empirical evidence for this: recognition-prompted tutors measurably adapt based on learner input (+26% higher adaptation index, N=118), while baseline tutors maintain more rigid stances—though the asymmetry in transformation (tutor adapts more, learner growth does not increase) suggests the "mutual" framing requires nuance. This has implications for how we think about AI character and values: perhaps genuine alignment requires the capacity for recognition-driven responsiveness, not just behavioral specification.
+### Implications for AI Prompting and Personality
-### 7.8 What the Transcripts Reveal
+Most prompting research treats prompts as behavioral specifications. Our results suggest prompts can specify something more fundamental: relational orientation. The difference between baseline and recognition prompts is not about different facts but about who the learner is (knowledge deficit vs autonomous subject), what the interaction produces (information transfer vs adaptive responsiveness), and what counts as success (correct content vs productive struggle honored). The prompt elaboration baseline demonstrates this empirically: 344 lines of prescriptive behavioral rules produce *worse* results than 35 lines of minimal instructions on capable models, while recognition theory (which specifies relational stance rather than behavioral rules) consistently improves quality.
-The qualitative analysis (Section 6.10) provides textual evidence that score differences correspond to observable relational differences—not merely rubric-gaming. The lexical signature is theoretically coherent: recognition-skewed vocabulary is interpersonal and process-oriented, while base-skewed vocabulary is procedural and task-oriented. The thematic coding maps to Hegelian concepts: struggle-honoring (3.1×) corresponds to productive negativity, engagement markers (1.8×) to recognition of the other, and the reduction in generic language (3.0× less) reflects the shift from transmission to dialogue. These patterns are consistent with, but do not prove, the theoretical interpretation; the coding is regex-based rather than human-coded, and the transcript pairs were selected for contrast rather than typicality.
+AI personality research typically treats personality as dispositional---stable traits the system exhibits. Our framework suggests personality is better understood relationally: not what traits the AI has, but how it constitutes its interlocutor. Two systems with identical "helpful" dispositions could differ radically in recognition quality---one warm while treating users as passive, another warm precisely by treating contributions as genuinely mattering.
 ---
 ## 8. Limitations
-1. **Domain Coverage**: While we tested generalizability on elementary mathematics, findings may not extend to all content domains. Technical STEM content, creative writing, and social-emotional learning may show different patterns.
+**Simulated learners**: All evaluations use scripted or LLM-generated learner turns rather than real learners. While this enables controlled comparison, it may miss dynamics that emerge in genuine human interaction. The synthetic learning outcome index (Section 6.10) provides a proxy, but these are AI-judge assessments of LLM-generated behavior, not actual knowledge acquisition. Whether recognition-enhanced tutoring produces genuine learning gains in human learners remains the critical open question requiring classroom studies.
-2. **Model Dependence**: Results were obtained primarily with Kimi K2.5 and Nemotron. The A×B interaction (multi-agent synergy specific to recognition) appeared in the Nemotron analysis (N=17) but failed to replicate on Kimi in both the larger factorial (N=350) and a dedicated replication (N=60), confirming this as a model-specific finding. The recognition main effect, by contrast, replicates across both models.
+**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases---the judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability is moderate (r=0.33--0.66), with different judges weighting criteria differently. Cross-judge replication confirms directional findings at compressed magnitudes (37--59% of primary effect sizes). The recognition-vs-enhanced increment (+8.0 under Claude) does not replicate under GPT-5.2, warranting caution on its precise magnitude. LLM judges are also subject to version drift: our primary judge was updated from Opus 4.5 to 4.6 during data collection, so all early runs were rejudged under 4.6 for consistency. An empirical check on matched conditions shows stable recognition deltas before and after rejudging (+16.3 vs +15.6).
-3. **Simulated Learners**: All evaluation uses LLM-generated learner simulations. Real learners may behave differently, particularly in how they respond to recognition-oriented tutoring.
+**Active control limitations**: The post-hoc active control (N=118) was designed *after* observing recognition effects, not as part of the original protocol. It ran on Nemotron rather than the primary factorial's Kimi K2.5, requiring same-model comparisons. The base prompts were already designed to produce competent tutoring; the active control contains real pedagogical content (growth mindset, Bloom's taxonomy, scaffolding), functioning as an *active* control rather than a true placebo. A same-model control on Kimi would strengthen the comparison.
-4. **Content Isolation**: The elementary content test revealed two system-level bugs (content resolver fallback and hardcoded prompt examples) that caused cross-domain content leakage. Both have been fixed, but they represent a realistic class of deployment failure—content isolation is a system-level concern, not just a model-level one. The +9.9 point architecture effect on elementary content (Nemotron) was partly inflated by these bugs; the Kimi replication (+3.0 pts) is likely more representative.
+**Model dependence**: Results were obtained with specific models (primarily Kimi K2.5 and Nemotron). The multi-model probe across five ego models (N=655) provides evidence for generality of the recognition effect, but the full mechanism suite has been tested only on Haiku and Nemotron.
-5. **Single-Interaction Focus**: Evaluation measures single-interaction quality. The recognition framework's claims about mutual transformation and memory suggest longitudinal studies would be valuable.
+**Domain sampling**: We tested two domains (philosophy, elementary math). Content isolation bugs partly inflated the architecture effect on elementary content. Broader domain coverage (technical STEM, creative writing, social-emotional content) is needed before generalizability can be considered established.
-6. **Memory Isolation Experiment**: A corrected 2×2 memory isolation experiment (N=120 across two runs; Section 6.2) isolated recognition and memory factors: recognition is the primary driver (d=1.71), while memory provides a modest secondary benefit (d=0.46, $p \approx .08$). The experiment uses a smaller sample (N=120) than the original uncorrected runs, but the very large effect sizes provide high statistical power. A cross-judge replication with GPT-5.2 confirms recognition dominance (d=0.99), identical condition ordering, and the negative interaction, with inter-judge r=0.63 (Section 6.12).
+**Scripted learner confound**: The mechanism robustness test (N=360) uses scripted learners, rendering all mechanisms causally inert. Dynamic learner results (N=300) partially address this but cover only four mechanisms and two scenarios. The factorial's architecture null effect may partly reflect the scripted learner's inability to respond differently to different architectures.
-7. **Active Control Limitations**: The post-hoc active control (N=118; Section 6.2) was designed after observing recognition effects, not as part of the original protocol. A model confound limits its interpretability: the active control ran on Nemotron while factorial conditions used Kimi K2.5, and Nemotron scores substantially lower across all conditions. Same-model historical data (Nemotron base $\approx$ 58, active control = 66.5, Nemotron recognition $\approx$ 73) suggests both generic elaboration and recognition theory improve over base, with recognition gains (~+15 pts) substantially exceeding active-control gains (~+9 pts). The base prompts were already designed to produce competent tutoring with no length constraint; the "active control" contains real pedagogical content (growth mindset, Bloom's taxonomy, scaffolding) making it pedagogically enriched rather than inert. A same-model controlled comparison would be needed to establish precise effect magnitudes.
+**Short-term evaluation**: We evaluate individual sessions, not longitudinal relationships. The theoretical framework emphasizes accumulated understanding through the Mystic Writing Pad memory model, which single-session evaluation cannot capture.
-8. **Content Confound**: The philosophy content was used during system development, potentially creating optimization bias. The elementary content provides a cleaner generalizability test.
-9. **Recognition Measurement**: Measuring "recognition" through rubric dimensions is an imperfect operationalization of a rich philosophical concept. The dimensions capture functional aspects but may miss deeper relational qualities.
-10. **Bilateral Transformation Asymmetry**: The bilateral transformation metrics (Section 6.8), now based on N=118 dialogues across three multi-turn scenarios, confirm tutor-side adaptation (+26%) but show learner growth is slightly *lower* under recognition. The "mutual transformation" claim is better characterized as tutor-side responsiveness. The learner growth index measures observable message complexity markers, which may not capture all forms of learner benefit.
-11. **Dynamic Rewriting Evolution**: The step-by-step analysis (Section 6.11) tracks cell 21 across three iterative development commits with small per-cell samples (13–15 scored per run, 82 total). The runs include implementation improvements beyond Writing Pad activation alone; a controlled ablation would provide stronger causal evidence.
+**Bilateral transformation asymmetry**: Recognition produces tutor-side adaptation (+26%) but learner growth is slightly lower, complicating the theoretical claim of *mutual* transformation. The effect is better characterized as tutor-side responsiveness.
 ---
 ## 9. Conclusion
-We have proposed and evaluated a framework for AI tutoring grounded in Hegel's theory of mutual recognition, implemented through the Drama Machine architecture with Ego/Superego dialogue.
+Across thirty-seven evaluations (N=3,383 primary scored), the evidence converges on recognition-enhanced prompting as the dominant driver of AI tutoring improvement:
-An evaluation framework (N=1,486 primary scored across eighteen key runs; N=3,800+ across the full development database) provides evidence that recognition theory has unique value:
+1. **Recognition as primary driver**: Memory isolation (N=120): d=1.71 for recognition vs d=0.46 for memory. Full factorial (N=350): $\eta^2$=.243, d=1.11. Directly effective without memory infrastructure.
-1. **Recognition as primary driver (the definitive finding)**: A corrected 2×2 memory isolation experiment (N=120 across two independent runs) demonstrates that recognition theory is the primary driver of tutoring improvement: recognition alone produces d=1.71 (+15.2 pts), while memory alone provides only a modest, non-significant benefit (d=0.46, +4.8 pts, $p \approx .08$). The combined condition reaches d=1.81 (+15.8 pts vs base), with ceiling effects at ~91 limiting further gains. A post-hoc active control (N=118) using generic pedagogical content provides partial corroboration: same-model comparisons show the active control scores approximately 9 points above base while recognition scores approximately 15 points above base, with recognition gains (~+15 pts above base) substantially exceeding active-control gains (~+9 pts; see Section 8 for model confound caveats). A preliminary three-way comparison (N=36) found recognition outperforms enhanced prompting by +8.7 points, consistent with recognition dominance, though the increment does not replicate under GPT-5.2 (+1.3 pts, p=.60). Recognition theory is directly effective and does not require memory infrastructure to manifest.
+2. **Architecture is additive**: Five ego models (N=655) show negative A$\times$B interactions. Multi-agent adds modest value independent of prompt type; its primary demonstrated function is error correction.
-2. **Architecture is additive, not synergistic**: A dedicated five-model probe (Kimi K2.5, Nemotron, DeepSeek V3.2, GLM-4.7, Claude Haiku 4.5; N=826 total) finds the A×B interaction consistently near zero or negative (mean −2.2 pts) across all models, with zero showing positive synergy. The original Nemotron finding (+9.2, N=17) was sampling noise. Multi-agent architecture provides a small additive benefit (+1.8 pts mean) regardless of prompt type.
+3. **Tutor adaptation**: Recognition-prompted tutors adapt measurably (+26%), though the "mutual" transformation claim requires qualification---learner-side growth does not increase.
-3. **Tutor adaptation**: Recognition-prompted tutors measurably adapt their approach in response to learner input (adaptation index +26% higher than baseline, N=118 across three multi-turn scenarios), though learner-side growth does not increase. This provides partial empirical grounding for recognition theory: recognition produces tutor-side responsiveness rather than symmetric mutual transformation.
+4. **Domain generalizability**: Recognition replicates across philosophy (+15.7) and elementary math (+8.2), concentrated in challenging scenarios.
-4. **Domain generalizability**: Recognition advantage replicates across both philosophy and elementary math, and across both Kimi and Nemotron models, though with only two content domains tested. On elementary content with Kimi (N=60), recognition provides +9.9 pts (d $\approx$ 0.61), with effects concentrated in challenging scenarios. The factor inversion (architecture dominance on elementary) from the Nemotron analysis is partly model-dependent. Broader domain coverage is needed before generalizability can be considered established.
+5. **Mechanisms require dynamic learners**: Nine mechanisms are equivalent under scripted learners. With dynamic interlocutors, profiling differentiates (+4.1 pts) through genuine Theory of Mind feedback loops.
-5. **Multi-agent as reality testing**: On new domains, the Superego catches content isolation failures—whether from system-level bugs or model defaults—essential for domain transfer when content scoping cannot be guaranteed.
+6. **Cross-judge robustness**: GPT-5.2 replicates all directional findings at 37--59% of primary magnitudes.
-6. **Writing Pad activation coincides with dynamic rewriting improvement**: A step-by-step evolution analysis (N=82 across three runs) shows that dynamic prompt rewriting (cell 21) progressing from trailing its static baseline by 7.2 points to leading by 5.5 points, with the improvement coinciding with Writing Pad memory activation (Section 6.11). Every rubric dimension improves. This trajectory is consistent with the Writing Pad functioning as an important enabler for dynamic adaptation, though the uncontrolled nature of the iterative runs means a controlled ablation is needed to confirm the causal role.
+7. **Dialectical impasse**: Perfect strategy separation---12/12 base tutors withdraw, 10/12 recognition tutors use scaffolded reframing (Aufhebung). V=1.000.
-7. **Cross-judge robustness**: A replication with GPT-5.2 (Section 6.12) confirms the recognition main effect (d=1.03 in the factorial, d=0.99 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction), and multi-agent null effects, though at compressed magnitudes (~58%). The recognition-vs-enhanced increment does not reach significance under GPT-5.2, warranting caution on its precise magnitude.
+8. **Cognitive prosthesis fails**: The same mechanisms boost capable models (+20) but hurt weak ones ($-15$), establishing a minimum ego capability threshold.
-8. **Optimal configuration is context-dependent**: For well-trained content, recognition prompts with single-agent may suffice. For new domains, multi-agent architecture is essential. For dynamic adaptation, Writing Pad memory is required.
+9. **Prompt elaboration is counterproductive**: The naive baseline outperforms the elaborate base on strong models (+6.8 pts). Recognition adds value through relational orientation, not prescriptive scaffolding.
-These findings have practical implications for AI tutoring deployment: the "right" architecture depends on content characteristics and deployment context. They also have theoretical implications: recognition emerges from quality engagement under appropriate conditions, and the boundary conditions of its effectiveness reveal something about the nature of pedagogical recognition itself.
+These results carry implications for AI alignment more broadly. If mutual recognition is pedagogically superior, and if recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation---not just trained to simulate openness. The bilateral transformation metrics provide empirical evidence: recognition-prompted tutors measurably adapt based on learner input, while baseline tutors maintain rigid stances. Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter.
----
+The broader implication for AI system design is that philosophical theories of intersubjectivity can serve as productive design heuristics. Operationalizing recognition theory through specific prompt language and architectural features produces concrete, measurable improvements that replicate across models, domains, and independent judges. Recognition is better understood as an achievable relational stance than a requirement for machine consciousness. The distinction between recognition proper (requiring genuine consciousness) and recognition-oriented design (using recognition as a functional heuristic) allows practitioners to benefit from the framework without making metaphysical claims about AI sentience.
-## 10. Reproducibility
-Key evaluation run IDs are documented below; full commands and configuration details are provided in the project repository. Key runs:
-| Finding | Run ID | Command |
-|---------|--------|---------|
-| Recognition validation | eval-2026-02-03-86b159cd | See Appendix A |
-| Full factorial | eval-2026-02-03-f5d4dd93 | See Appendix A |
-| A×B interaction (Nemotron) | eval-2026-02-04-948e04b3 | See Appendix A |
-| A×B replication (Kimi) | eval-2026-02-05-10b344fb | See Appendix A |
-| Domain generalizability (Nemotron) | eval-2026-02-04-79b633ca | See Appendix A |
-| Domain gen. replication (Kimi) | eval-2026-02-05-e87f452d | See Appendix A |
-| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | See Appendix A |
-| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | See Appendix A |
-| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | See Appendix A |
-| Memory isolation (run 1) | eval-2026-02-06-81f2d5a1 | See Appendix A |
-| Memory isolation (run 2) | eval-2026-02-06-ac9ea8f5 | See Appendix A |
-| Active control (post-hoc) | eval-2026-02-06-a9ae06ee | See Appendix A |
-| Full factorial cells 6,8 re-run | eval-2026-02-06-a933d745 | See Appendix A |
-| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.8 |
-| A×B synergy probe (Nemotron) | eval-2026-02-07-722087ac | 6.4 |
-| A×B synergy probe (DeepSeek V3.2) | eval-2026-02-07-70ef73a3 | 6.4 |
-| A×B synergy probe (GLM-4.7) | eval-2026-02-07-6b3e6565 | 6.4 |
-| A×B synergy probe (Claude Haiku 4.5) | eval-2026-02-07-6ead24c7 | 6.4 |
-**Code and Data**: https://github.com/machine-spirits/machinespirits-eval
----
+In summary, we have connected Hegelian recognition theory to AI pedagogy, implemented it through a Freudian multiagent architecture, and tested it across thirty-seven evaluations. The central finding---that recognition-enhanced prompting is the dominant driver of tutoring improvement---was established through memory isolation, confirmed in a full factorial, validated by an independent judge, and deepened through impasse resolution coding, learner-side evaluation, and mechanism robustness testing with dynamic interlocutors. The theoretical framework, empirical methodology, and practical design hierarchy together demonstrate that the gap between continental philosophy and AI engineering is narrower than either tradition might suppose.
 ## References
 ::: {#refs}
 :::
----
-## Appendix A: Reproducible Evaluation Commands
-### A.1 Base vs Enhanced vs Recognition
-```bash
-node scripts/eval-cli.js run \
-  --profiles cell_1_base_single_unified,cell_9_enhanced_single_unified,cell_5_recog_single_unified \
-  --scenarios struggling_learner,concept_confusion,mood_frustrated_explicit,high_performer \
-  --runs 3
-```
-### A.2 Full 2×2×2 Factorial
-```bash
-node scripts/eval-cli.js run \
-  --profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
-  --runs 3
-```
-### A.3 Domain Generalizability
-```bash
-EVAL_CONTENT_PATH=./content-test-elementary \
-EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
-node scripts/eval-cli.js run \
-  --profiles cell_1_base_single_unified,cell_3_base_multi_unified,cell_5_recog_single_unified,cell_7_recog_multi_unified \
-  --scenarios struggling_student,concept_confusion,frustrated_student \
-  --runs 1
-```
-### A.4 Factor Effect Analysis
-```sql
-SELECT
-  profile_name,
-  ROUND(AVG(overall_score), 1) as avg_score,
-  COUNT(*) as n
-FROM evaluation_results
-WHERE run_id = '[RUN_ID]'
-  AND overall_score IS NOT NULL
-GROUP BY profile_name
-ORDER BY avg_score DESC
-```
----
-## Appendix B: Revision History
-| Date | Version | Changes |
-|------|---------|---------|
-| 2026-02-04 | v1.0 | Initial draft |
-| 2026-02-06 | v1.1 | Added corrected memory isolation, active control, cross-judge analysis. Corrected GPT-5.2 effect sizes after deduplication. |
-| 2026-02-06 | v1.2 | **Critical correction**: Reframed "placebo" as "post-hoc active control." Original cross-model comparison (Nemotron active control vs Kimi base, d=-1.03) was confounded. Same-model data shows active control $\approx$ +9 pts above base, recognition $\approx$ +15 pts—recognition doubles the benefit of generic elaboration. Acknowledged post-hoc design and active control content. |
-| 2026-02-06 | v1.3–v1.4 | Intermediate revisions: corrected factorial, qualitative analysis, production quality fixes. Superseded by v1.5. |
-| 2026-02-07 | v1.5 | **Rubric iteration**: Updated to 14-dimension rubric with dialogue transcript context, Productive Struggle, and Epistemic Honesty dimensions. Re-scored cells 6, 8 (N=88): minimal change (+0.5, +0.6 pts). Added holistic dialogue evaluation for multi-turn transcripts. Cross-judge replication on updated rubric (r=0.55, N=88). Added citations to Related Work. |
-| 2026-02-08 | v1.6 | **Content isolation fix**: Identified and fixed two bugs causing cross-domain content leakage in elementary scenarios. Reframed "model hallucination" as system-level content isolation failures. Updated Sections 6.7, 7.3, 8, and 9. Noted architecture effect inflation on elementary content. |