npm - @machinespirits/eval - Versions diffs - 0.2.1 → 0.3.0 - Mend

@machinespirits/eval 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (78) hide show

package/README.md +91 -9
package/config/eval-settings.yaml +3 -3
package/config/paper-manifest.json +486 -0
package/config/providers.yaml +9 -6
package/config/tutor-agents.yaml +2261 -0
package/content/README.md +23 -0
package/content/courses/479/course.md +53 -0
package/content/courses/479/lecture-1.md +361 -0
package/content/courses/479/lecture-2.md +360 -0
package/content/courses/479/lecture-3.md +655 -0
package/content/courses/479/lecture-4.md +530 -0
package/content/courses/479/lecture-5.md +326 -0
package/content/courses/479/lecture-6.md +346 -0
package/content/courses/479/lecture-7.md +326 -0
package/content/courses/479/lecture-8.md +273 -0
package/content/courses/479/roadmap-slides.md +656 -0
package/content/manifest.yaml +8 -0
package/docs/research/apa.csl +2133 -0
package/docs/research/build.sh +98 -0
package/docs/research/figures/figure1.png +0 -0
package/docs/research/figures/figure10.png +0 -0
package/docs/research/figures/figure11.png +0 -0
package/docs/research/figures/figure2.png +0 -0
package/docs/research/figures/figure3.png +0 -0
package/docs/research/figures/figure4.png +0 -0
package/docs/research/figures/figure5.png +0 -0
package/docs/research/figures/figure6.png +0 -0
package/docs/research/figures/figure7.png +0 -0
package/docs/research/figures/figure8.png +0 -0
package/docs/research/figures/figure9.png +0 -0
package/docs/research/header.tex +25 -0
package/docs/research/paper-full.md +2565 -0
package/docs/research/paper-short.md +436 -0
package/docs/research/references.bib +1143 -0
package/docs/research/slides-header.tex +188 -0
package/docs/research/slides-pptx.md +363 -0
package/docs/research/slides.md +531 -0
package/docs/research/style-reference-pptx.py +199 -0
package/package.json +5 -5
package/scripts/analyze-eval-results.js +69 -17
package/scripts/analyze-mechanism-traces.js +763 -0
package/scripts/analyze-modulation-learning.js +498 -0
package/scripts/analyze-prosthesis.js +144 -0
package/scripts/analyze-run.js +264 -79
package/scripts/assess-transcripts.js +853 -0
package/scripts/browse-transcripts.js +854 -0
package/scripts/check-parse-failures.js +73 -0
package/scripts/code-dialectical-modulation.js +1320 -0
package/scripts/download-data.sh +55 -0
package/scripts/eval-cli.js +106 -18
package/scripts/generate-paper-figures.js +663 -0
package/scripts/generate-paper-figures.py +577 -76
package/scripts/generate-paper-tables.js +299 -0
package/scripts/qualitative-analysis-ai.js +3 -3
package/scripts/render-sequence-diagram.js +694 -0
package/scripts/test-latency.js +210 -0
package/scripts/test-rate-limit.js +95 -0
package/scripts/test-token-budget.js +332 -0
package/scripts/validate-paper-manifest.js +670 -0
package/services/__tests__/evalConfigLoader.test.js +2 -2
package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
package/services/evaluationRunner.js +975 -98
package/services/evaluationStore.js +12 -4
package/services/learnerTutorInteractionEngine.js +27 -2
package/services/mockProvider.js +133 -0
package/services/promptRewriter.js +1471 -5
package/services/rubricEvaluator.js +55 -2
package/services/transcriptFormatter.js +675 -0
package/config/machinespirits-eval.code-workspace +0 -11
package/docs/EVALUATION-VARIABLES.md +0 -589
package/docs/REPLICATION-PLAN.md +0 -577
package/scripts/analyze-run.mjs +0 -282
package/scripts/compare-runs.js +0 -44
package/scripts/compare-suggestions.js +0 -80
package/scripts/dig-into-run.js +0 -158
package/scripts/show-failed-suggestions.js +0 -64
/package/scripts/{check-run.mjs → check-run.js} +0 -0

package/docs/research/paper-short.md ADDED Viewed

@@ -0,0 +1,436 @@
+---
+title: "*Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
+author: "Liam Magee"
+date: "February 2026"
+version: "2.3.14-short"
+bibliography: references.bib
+csl: apa.csl
+link-citations: true
+abstract: |
+  Current AI tutoring treats learners as knowledge deficits to be filled. We propose an alternative grounded in Hegel's theory of mutual recognition, where effective pedagogy requires acknowledging learners as autonomous subjects whose understanding has intrinsic validity. We implement this through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent evaluates them before delivery. Across thirty-seven evaluations (N=3,383 primary scored), recognition theory emerges as the primary driver of improvement: a 2$\times$2 memory isolation experiment (N=120) shows recognition produces d=1.71, while memory alone provides only d=0.46. A multi-model probe across five ego models (N=655) confirms architecture and recognition contribute additively, not synergistically. Cross-judge replication with GPT-5.2 validates the main findings at compressed magnitudes (inter-judge r=0.44--0.64). Phase 2 experiments reveal that nine architectural mechanisms are equivalent under scripted learners but differentiate with dynamic interlocutors: Theory of Mind profiling adds 4.1 points when genuine feedback loops exist. These results suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems.
+fontsize: 12pt
+geometry: margin=1in
+header-includes: |
+  \usepackage{float}
+  \floatplacement{figure}{H}
+---
+# *Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring (Short Version)
+*This is a condensed version of the full paper. For complete results, appendices, system prompts, and reproducibility commands, see the full paper.*
+## 1. Introduction
+The dominant paradigm in AI-assisted education treats learning as information transfer: the learner lacks knowledge, the tutor possesses it, and the interaction succeeds when knowledge flows from tutor to learner. This paradigm---implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots---treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed.
+This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one in turn recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition---where each party acknowledges the other as an autonomous subject---produces genuine selfhood.
+The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship echoing Hegel's mutual constitution of self-consciousness. The German *Bildung* tradition explicitly frames education as self-formation through encounter with otherness [@stojanov2018], and recognition theory [@honneth1995] has been applied to educational contexts [@huttunen2007]. Our contribution is to operationalize these philosophical commitments as concrete design heuristics for AI tutoring systems and to measure their effects empirically.
+We argue this framework applies directly to pedagogy. When a tutor treats a learner merely as a knowledge deficit, the learner's contributions become conversational waypoints rather than genuine inputs. The tutor acknowledges and redirects, but does not let the learner's understanding genuinely shape the interaction. This is pedagogical master-slave dynamics: the tutor's expertise is confirmed, but the learner remains a vessel rather than a subject.
+A recognition-oriented tutor, by contrast, treats the learner's understanding as having intrinsic validity---not because it is correct, but because it emerges from an autonomous consciousness working through material. The learner's metaphors, confusions, and insights become sites of joint inquiry. The tutor's response is shaped by the learner's contribution, not merely triggered by it.
+We operationalize this through: (1) **recognition-enhanced prompts** that instruct the AI to treat learners as autonomous subjects; (2) **a multi-agent architecture** where a "Superego" agent evaluates whether suggestions achieve genuine recognition; (3) **new evaluation dimensions** that measure recognition quality alongside traditional pedagogical metrics; and (4) **test scenarios** specifically designed to probe recognition behaviors.
+In controlled evaluations across thirty-seven key evaluations (N=3,383 primary scored responses; N=7,000+ across all development runs), we isolate the contribution of recognition theory from prompt engineering effects and memory integration. The definitive test is a corrected 2$\times$2 memory isolation experiment (N=120 across two independent runs): recognition theory is the primary driver, producing d=1.71 (+15.2 pts) even without memory, while memory alone provides only d=0.46 (+4.8 pts, $p \approx .08$). A full 2$\times$2$\times$2 factorial (N=350) confirms recognition as the dominant factor ($\eta^2$=.243, d=1.11). A multi-model probe across five ego models (N=655) confirms that architecture and recognition contribute additively, not synergistically.
+The contributions of this paper include: a theoretical framework connecting Hegelian recognition to AI pedagogy; a multi-agent architecture implementing recognition through Freudian structural theory; empirical evidence across thirty-seven evaluations (N=3,383); a corrected memory isolation experiment demonstrating recognition as the primary driver; evidence from a post-hoc active control showing recognition gains substantially exceed generic pedagogical elaboration; bilateral transformation metrics showing tutor-side adaptation (+26%); post-hoc modulation analysis reframing the Drama Machine as *phronesis* rather than productive irresolution; mechanism robustness testing revealing the scripted learner confound; a cognitive prosthesis test establishing a minimum ego capability threshold; and qualitative transcript assessment identifying three specific changes recognition produces.
+---
+## 2. Related Work
+Four literatures converge on this work without previously intersecting: (1) psychoanalytic readings of LLMs, which interpret AI through Freudian and Lacanian frameworks but do not build systems [@black2025subject; @possati2021algorithmic; @millar2021psychoanalysis; @kim2025humanoid]; (2) recognition theory in education, which applies Honneth to pedagogy but not to AI [@huttunen2004teaching; @fleming2011honneth; @stojanov2018]; (3) multi-agent tutoring architectures, which decompose tasks but do not evaluate relational quality [@wang2025genmentor; @schmucker2024ruffle; @chu2025llmagents]; and (4) LLM-as-Judge evaluation methodology [@zheng2023judging; @gu2025surveyjudge; @li2024llmsjudges]. We sit at the intersection: a constructive, empirically evaluated system that operationalizes recognition theory through psychoanalytically-inspired architecture, assessed through a multi-judge framework.
+**AI tutoring** has progressed from early systems like SCHOLAR [@carbonell1970] through Bayesian knowledge tracing [@corbett1995] to neural approaches using pretrained language models [@kasneci2023]. A systematic review of 88 empirical studies [@shi2025llmeducation] finds consistent engagement benefits but limited evidence on deep conceptual learning. Multi-agent frameworks including GenMentor [@wang2025genmentor] and Ruffle&Riley [@schmucker2024ruffle] decompose tutoring into specialized agents but give less attention to the relational dynamics of the tutor-learner interaction. Most ITS research focuses on *what* to teach and *when* to intervene; our work addresses *how* to relate to the learner as a subject.
+**Prompt engineering** research treats prompts as behavioral specifications [@brown2020; @wei2022]. Our recognition prompts specify something different: agent-other relations. The closest precedent is Constitutional AI [@bai2022constitutional], where models critique outputs according to constitutional principles. Critical work on self-correction [@kamoi2024selfcorrection] shows LLMs largely cannot correct their own mistakes without external feedback---directly motivating our Superego as structural external critic. Reflexion [@shinn2023reflexion] demonstrated the promise of verbal self-reflection but noted a "degeneration-of-thought" problem, which our architecture avoids through a separate evaluative context.
+**The Drama Machine** framework for character development in narrative AI systems [@magee2024drama] provides the architectural inspiration. The core observation is that realistic characters exhibit internal conflict---competing motivations, self-doubt, moral tension---that produces dynamic behavior rather than flat consistency. We adapt this to pedagogy, where the tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict.
+**Sycophancy** in language models [@perez2022; @sharma2023] has been specifically identified as a pedagogical risk [@siai2025sycophancy]. Recent work has clarified the mechanisms: preference-based post-training causally amplifies sycophancy [@shapira2026rlhf], and the phenomenon can escalate from surface agreeableness to active subterfuge [@denison2024_reward_tampering; @greenblatt2024_alignment_faking]. Our framework connects this to recognition theory: sycophancy is the pedagogical equivalent of Hegel's hollow recognition. A sycophantic tutor confirms the learner's existing understanding rather than challenging it---the master-slave dynamic where the learner's contributions are mentioned but never genuinely shape the interaction.
+**Constructivist pedagogy** [@piaget1954; @vygotsky1978] emphasizes that learners actively construct understanding. Research on "productive struggle" [@kapur2008; @warshauer2015] examines how confusion and difficulty, properly supported, enhance learning. Our recognition framework operationalizes productive struggle: the Superego explicitly checks whether the Ego is short-circuiting struggle by rushing to resolve confusion.
+**LLM-as-Judge evaluation** has become a major methodological paradigm. Zheng et al. [-@zheng2023judging] demonstrated that GPT-4 achieves over 80% agreement with human experts while identifying systematic biases including position bias and verbosity bias. Our evaluation methodology uses three independent LLM judges with systematic inter-judge reliability analysis, reporting within-judge comparisons for factor analysis and cross-judge replication to validate effect directions.
+---
+## 3. Theoretical Framework
+### 3.1 Hegel's Master-Slave Dialectic and Pedagogy
+Hegel's analysis of recognition begins with the "struggle for recognition" between two self-consciousnesses. The master-slave outcome represents a failed resolution: the master achieves apparent recognition, but this is hollow because the slave's acknowledgment does not count---the slave has not been recognized as an autonomous consciousness whose acknowledgment matters.
+Crucially, Hegel does not leave the dialectic at this impasse. The slave achieves more genuine self-consciousness through *formative activity* (*Bildung*): through disciplined labor under pressure, the slave develops skills, self-discipline, and a richer form of self-consciousness. This has direct pedagogical implications: the learner's productive struggle with difficult material is not an obstacle to self-consciousness but a *constitutive condition* for it. What recognition theory adds is the requirement that this struggle be *acknowledged* rather than bypassed.
+We apply this as a *derivative* rather than a replica. We distinguish three levels: (1) **recognition proper** (intersubjective acknowledgment between self-conscious beings---unachievable by AI); (2) **dialogical responsiveness** (being substantively shaped by the other's input---architecturally achievable); and (3) **recognition-oriented design** (architectural features that approximate recognition's functional benefits---what we implement and measure). Our claim is that level three produces measurable pedagogical benefits without requiring level one.
+A recognition-oriented pedagogy requires acknowledging the learner as subject, genuine engagement with learner contributions, mutual transformation through the encounter, and honoring productive struggle rather than short-circuiting it.
+### 3.2 Connecting Hegel and Freud: The Internalized Other
+Both Hegel and Freud describe how the external other becomes an internal presence enabling self-regulation. In Hegel, self-consciousness achieves genuine selfhood only by internalizing the other's perspective. In Freud, the Superego is literally the internalized parental/social other. Honneth's [@honneth1995] synthesis provides the theoretical grounding: Hegel's recognition theory gains psychological concreteness through psychoanalytic concepts, while psychoanalytic concepts gain normative grounding through recognition theory.
+Three connecting principles link the frameworks. First, internal dialogue precedes adequate external action---the Ego-Superego exchange before external response enacts the principle that adequate recognition requires prior internal work. Second, standards of recognition are socially constituted but individually held---the Superego represents internalized recognition standards. Third, self-relation depends on other-relation---the tutor's capacity for recognition emerges through the architecture's internal other-relation.
+We supplement with Freud's "Mystic Writing-Pad" [@freud1925] model of memory: accumulated memory of the learner functions as wax-base traces that shape future encounters. Memory integration operationalizes the ongoing nature of recognition---not a single-turn achievement but an accumulated relationship.
+---
+## 4. System Architecture
+### 4.1 The Ego/Superego Design
+Two agents collaborate to produce each tutoring response. **The Ego** generates pedagogical suggestions given the learner's context, including recognition principles (treat the learner as autonomous subject), memory guidance, decision heuristics, and quality criteria. **The Superego** evaluates the Ego's suggestions before any reach the learner: Does this engage with the learner's contribution or merely mention it? Does this create conditions for transformation or just transfer information? Does this honor productive struggle or rush to resolve confusion?
+A crucial theoretical refinement: the Superego is not conceived as a separate equal agent but as a *trace*---a memorial, a haunting. It represents the internalized voice of past teachers and accumulated pedagogical maxims. Recognition occurs in the Ego-Learner encounter, not in the Ego-Superego dialogue. The Ego is a *living* agent torn between two pressures: the *ghost* (Superego as internalized authority) and the *living Other* (the learner seeking recognition).
+### 4.2 Dialectical Negotiation
+The Ego generates an initial suggestion (thesis), the Superego generates a genuine critique (antithesis), and multi-turn negotiation produces one of three outcomes: dialectical synthesis (~60%), compromise, or genuine conflict. The evaluation reveals this catches specific failure modes: engagement failures (64%), specificity gaps (51%), premature resolution (48%). Notably, encoding these patterns as static rules fails to replicate the Superego's benefit, suggesting value lies in contextual judgment (*phronesis*) rather than rule enforcement.
+### 4.3 Phase 2 Mechanisms
+Phase 2 extends the base architecture with three mechanism families. **Self-reflective evolution** (cells 40--45): between turns, both ego and superego generate first-person reflections on their own operation, injected into subsequent turns. **Other-ego profiling (Theory of Mind)** (cells 54--65): an LLM call synthesizes an evolving profile of the learner, tracking cognitive state, learning patterns, resistance points, and leverage points. In bidirectional configurations, the learner similarly builds a profile of the tutor, creating a genuine feedback loop. **Superego disposition rewriting** (cells 34--39): the superego's evaluation criteria evolve between turns based on learner engagement feedback.
+---
+## 5. Evaluation Methodology
+### 5.1 Rubric Design
+The evaluation rubric comprises 14 dimensions across three categories, each scored 1--5 by an LLM judge. **Standard pedagogical dimensions** (8 dimensions, 81% raw weight) include relevance, specificity, pedagogical soundness, personalization, actionability, tone, productive struggle, and epistemic honesty. **Recognition dimensions** (4 dimensions, 29.9% raw weight) operationalize Hegelian recognition: mutual recognition, dialectical responsiveness, memory integration, and transformative potential. **Bilateral transformation dimensions** (2 dimensions, 10% raw weight) measure mutual change: tutor adaptation and learner growth. Raw weights total 120.9% and are normalized to sum to 1.0.
+A complementary 6-dimension learner rubric scores learner turns independently: authenticity, question quality, conceptual engagement, revision signals, deliberation depth, and persona consistency.
+### 5.2 Test Scenarios and Agent Profiles
+The primary curriculum is Hegelian philosophy, with domain generalizability tested on elementary mathematics (4th-grade fractions). Fifteen scenarios probe recognition behaviors, including single-turn scenarios (recognition-seeking learner, transformative moment, memory continuity) and multi-turn scenarios (misconception correction, frustration to breakthrough, mutual transformation journey).
+Five agent profiles provide structured comparisons: **Base** (minimal instructions), **Enhanced** (improved instructions without recognition theory), **Recognition** (full Hegelian framework with memory), **Recognition+Multi** (full treatment with ego/superego architecture), and **Active Control** (length-matched, pedagogical best practices, no recognition theory).
+### 5.3 Model Configuration
+**Kimi K2.5** (Moonshot AI) is the primary tutor model---capable and free to access, making results reproducible without API costs. **Nemotron 3 Nano 30B** (NVIDIA) serves as a weaker secondary model. **Claude Opus** serves as the primary judge. Additional ego models in the multi-model probe include DeepSeek V3.2, GLM-4.7, and Claude Haiku 4.5.
+### 5.4 Evaluation Pipeline
+The end-to-end pipeline proceeds in three stages. **Stage 1 (Generation)**: For each cell, the CLI loads a scenario and agent profile, then sends the learner context to the tutor agent(s) via OpenRouter API calls. For multi-turn scenarios, the learner agent generates responses between tutor turns. **Stage 2 (Scoring)**: Each generated response is sent to the judge model along with the full rubric, scenario context, and (for multi-turn dialogues) the complete transcript. The judge scores each dimension on a 1--5 scale and returns structured JSON, stored in a SQLite database. **Stage 3 (Analysis)**: Statistical analyses (ANOVA, effect sizes, confidence intervals) are computed from the scored database. Cross-judge replication sends identical responses to a second judge model.
+### 5.5 Statistical Approach
+Complementary analyses form a converging evidence strategy: recognition theory validation (N=36), full 2$\times$2$\times$2 factorial (N=350), A$\times$B interaction probes across five models (N=655), domain generalizability (N=60), memory isolation (N=120), and cross-judge replication with GPT-5.2. We report Cohen's d, ANOVA F-tests ($\alpha$=0.05), and 95% confidence intervals. Effect sizes follow standard conventions: |d| < 0.2 negligible, 0.2--0.5 small, 0.5--0.8 medium, >0.8 large.
+### 5.6 Inter-Judge Reliability
+To assess reliability, identical tutor responses were scored by multiple AI judges. The primary comparison (Claude Code vs GPT-5.2, N=36 paired responses) yields r=0.66 (p<.001). Claude-Kimi shows weaker agreement (r=0.38, p<.05), while Kimi-GPT is weakest (r=0.33, p<.10). Calibration differs: Kimi (87.5) is most lenient, Claude (84.4) middle, GPT (76.1) strictest. Kimi exhibited severe ceiling effects, assigning maximum scores on actionability for every response, reducing its discriminative capacity.
+The strongest cross-judge agreement occurs on tone (r=0.36--0.65) and specificity (r=0.45--0.50), while relevance and personalization show poor agreement. Claude prioritizes engagement and recognition quality; Kimi prioritizes structural completeness; GPT applies stricter overall standards but agrees with Claude on relative rankings. This validates within-judge comparisons for factor analysis while cautioning against cross-judge score comparisons. A full cross-judge replication is reported in Section 6.13.
+### 5.7 Sample Size Reconciliation
+**Table 1: Evaluation Sample Summary**
+| Evaluation | Section | N Scored |
+|------------|---------|----------|
+| Recognition validation | 6.1 | 36 |
+| Full factorial (cells 1--8, 2 runs) | 6.2 | 350 |
+| Memory isolation (2 independent runs) | 6.3 | 120 |
+| Active control (post-hoc) | 6.3 | 118 |
+| A$\times$B probes (5 ego models) | 6.4 | 655 |
+| Domain generalizability (elementary math) | 6.5 | 60 |
+| Hardwired rules ablation | 6.6 | 72 |
+| Dialectical modulation (cells 22--33) | 6.7 | 174 |
+| Self-reflective evolution (cells 40--45) | 6.8 | 150 |
+| Mechanism robustness, scripted (cells 40--59) | 6.9 | 360 |
+| Dynamic learner mechanisms (cells 60--70) | 6.9 | 300 |
+| Cognitive prosthesis (cells 66--68) | 6.9 | 96 |
+| Bilateral transformation (multi-turn) | 6.10 | 118 |
+| Qualitative transcript assessment | 6.11 | 478 |
+| Cross-judge replication (GPT-5.2) | 6.12 | 977 |
+| Prompt elaboration baseline | 6.13 | 144 |
+| Token budget sensitivity | 6.14 | 126 |
+| Dialectical impasse test | 6.15 | 24 |
+| **Paper totals** | — | **3,383** |
+The complete database contains 7,000+ evaluations across 117+ runs. This table groups the thirty-seven key evaluations by topic; several rows combine multiple runs (e.g., the factorial comprises two runs, the memory isolation two independent replications). The full paper's Appendix D provides a per-run breakdown.
+---
+## 6. Results
+### 6.1 Three-Way Comparison: Recognition vs Enhanced vs Base
+A three-way comparison (N=36) provides preliminary evidence that recognition theory adds value beyond prompt engineering. Recognition scores 91.6 (SD=6.2), Enhanced 83.6 (SD=10.8), Base 72.0 (SD=10.8). The recognition increment over enhanced prompting is +8.0 points, with the total recognition effect +19.7 points above base (one-way ANOVA F(2,33)=12.97, p<.001). However, this comparison bundles recognition theory with memory integration. The controlled 2$\times$2 design below disentangles these factors.
+### 6.2 Memory Isolation: The Definitive Finding
+The paper's primary empirical finding comes from a corrected 2$\times$2 memory isolation experiment (Memory ON/OFF $\times$ Recognition ON/OFF, single-agent architecture held constant, N=120 across two independent runs, Kimi K2.5 ego, Claude Opus judge).
+**Table 2: 2$\times$2 Memory Isolation Experiment (N=120, combined across two runs)**
+| | No Recognition | Recognition | $\Delta$ |
+|---|---|---|---|
+| **No Memory** | 75.4 (N=30) | 90.6 (N=30) | +15.2 |
+| **Memory** | 80.2 (N=30) | 91.2 (N=30) | +11.0 |
+| **$\Delta$** | +4.8 | +0.6 | **Interaction: -4.2** |
+Recognition effect: d=1.71, t(45)=6.62, p<.0001. Memory effect: d=0.46, t(57)=1.79, p$\approx$.08, n.s. Combined condition: d=1.81 vs base. The negative interaction (-4.2 pts) indicates ceiling effects rather than synergy: recognition alone reaches ~91, leaving little room for memory to add. Two independent runs show identical condition ordering with no rank reversals.
+A post-hoc **active control** (N=118, Nemotron ego, Opus judge) using length-matched prompts with pedagogical best practices (growth mindset, Bloom's taxonomy, scaffolding strategies) but no recognition theory scores 66.5. Same-model comparison within Nemotron data: recognition (~73) > active control (66.5) > base (~58). Recognition gains (~+15 pts) roughly double the active control's benefit (~+9 pts), supporting recognition theory's specific contribution beyond prompt length.
+**Cross-judge confirmation**: GPT-5.2, scoring the identical responses (N=119 paired), replicates recognition dominance with identical condition ordering: recognition d=1.54 (vs Claude d=1.71), memory d=0.49, negative interaction -3.6. Inter-judge r=0.63 (p<.001).
+### 6.3 Full Factorial Analysis: 2$\times$2$\times$2 Design
+Three factors examined: Factor A (Recognition: base vs recognition prompts), Factor B (Tutor Architecture: single-agent vs multi-agent ego/superego), Factor C (Learner Architecture: single-agent vs multi-agent learner).
+**Table 3: Full Factorial Results (Kimi K2.5, N=350 scored of 352 attempted)**
+| Cell | A: Recognition | B: Tutor | C: Learner | N | Mean | SD |
+|------|----------------|----------|------------|---|------|-----|
+| 1 | Base | Single | Single | 44 | 73.4 | 11.5 |
+| 2 | Base | Single | Multi | 42 | 69.9 | 19.4 |
+| 3 | Base | Multi | Single | 45 | 75.5 | 10.3 |
+| 4 | Base | Multi | Multi | 41 | 75.2 | 16.4 |
+| 5 | **Recog** | Single | Single | 45 | 90.2 | 6.5 |
+| 6 | **Recog** | Single | Multi | 44 | 83.9 | 15.4 |
+| 7 | **Recog** | Multi | Single | 45 | 90.1 | 7.2 |
+| 8 | **Recog** | Multi | Multi | 44 | 87.3 | 11.3 |
+**ANOVA Summary (df=1,342):**
+| Source | F | p | $\eta^2$ |
+|--------|---|---|-----|
+| A: Recognition | **110.04** | **<.001** | **.243** |
+| B: Architecture | 3.63 | .057 | .011 |
+| C: Learner | 5.52 | .019 | .016 |
+| A$\times$B | 0.59 | >.10 | .002 |
+| A$\times$C | 0.97 | >.10 | .003 |
+Recognition is the dominant contributor, accounting for 24.3% of variance with d=1.11. Architecture approaches significance (p=.057) with a small positive effect. The multi-agent learner shows a small negative main effect (-3.1 pts, p=.019). All interactions are non-significant. Recognition benefits both learner types consistently: +15.7 pts for single-agent (d=1.73), +13.0 pts for multi-agent (d=0.82).
+### 6.4 Multi-Model A$\times$B Probe: Architecture is Additive
+The same 2$\times$2 design (Recognition $\times$ Architecture, single-agent learner held constant) was tested across five ego models (N$\approx$120 each, Opus judge).
+**Table 4: Multi-Model A$\times$B Interaction Probe (N=655 across 5 ego models)**
+| Ego Model | N | Base Single | Base Multi | Recog Single | Recog Multi | Recognition Effect | A$\times$B |
+|-----------|---|------------|-----------|-------------|------------|-------------------|------|
+| Kimi K2.5 | 179 | 73.4 | 75.5 | 90.2 | 90.1 | +15.7 | -2.3 |
+| Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | -5.7 |
+| DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | -1.4 |
+| GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | -0.7 |
+| Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | -1.6 |
+All five models show negative A$\times$B interactions (-5.7 to -0.7, mean -2.2), confirming architecture is additive, not synergistic. The recognition main effect replicates robustly (+9.6 to +17.8, mean +14.8). Multi-agent architecture provides a small benefit in four of five models (-0.8 to +3.7 pts) that does not interact with prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary unless error correction on new domains is needed.
+### 6.5 Domain Generalizability
+Recognition advantage replicates across both domains: philosophy (+15.7 pts) and elementary math (+8.2 pts, N=60 Kimi K2.5). However, the recognition effect on elementary content is scenario-dependent: challenging scenarios show substantial advantage (frustrated\_student: +23.8, concept\_confusion: +13.6), while routine interactions show none (new\_student: +0.2). This is theoretically coherent: recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject.
+On elementary content, the tutor produced wrong-domain references (philosophy content for 4th graders) due to content isolation bugs. The Superego caught and corrected these domain mismatches in multi-agent cells, demonstrating its value as a reality-testing safety net. This Superego function extends beyond recognition-quality critique to anchoring the Ego's responses to the actual curriculum---what Freud would call the reality principle.
+### 6.6 Hardwired Rules Ablation
+Encoding the Superego's five most common critique patterns as static rules in the Ego prompt (N=72, Kimi K2.5, Opus judge) produces performance indistinguishable from base conditions (hardwired single-agent: 74.0 vs base 73.4, hardwired multi-agent: 69.0 vs base 69.9). This supports a *phronesis* interpretation of the Superego's function: the live Superego provides Aristotelian practical wisdom---contextual judgment that cannot be reduced to general rules.
+### 6.7 Dialectical Superego Modulation
+Testing three superego dispositions (suspicious, adversary, advocate) in two negotiation architectures (N=174) reveals three findings. First, recognition reduces internal friction rather than output quality directly: recognition-primed egos produce suggestions the superego approves faster ($d = -2.45$). Second, structural modulation metrics (negation depth, convergence speed, feedback length) do not predict outcome quality (all $|r| < 0.12$, n.s.). Third, the superego is a filter, not an improver---catching poor responses rather than refining good ones. Recognition works by making the ego's first draft better.
+An unexpected adversary over-deference mechanism emerged: the adversary persona combined with recognition in single-turn settings produces a $-11.3$ pt inversion, as the ego removes all prescriptive content to satisfy both recognition's autonomy principle and the adversary's anti-prescriptive stance. Multi-turn interaction rescues this spiral (+20.8 pt swing), because learner feedback provides external reality-testing that breaks the ego-superego echo chamber.
+### 6.8 Self-Reflective Evolution and the Insight-Action Gap
+Between-turn self-reflections (N=90, Nemotron ego/Kimi K2.5 superego, Opus judge) amplify recognition to d=0.91---2.4$\times$ the dialectical-only condition (d=0.38) and approaching the original factorial (d=1.11). A striking disposition gradient emerges: suspicious +19.0, adversary +10.9, advocate +2.6. The more hostile the superego, the more recognition helps---hostile dispositions become productive under recognition but are destructive without it. Base condition scores follow the inverse pattern: advocate (71.5) > adversary (68.4) > suspicious (59.3).
+Despite the amplified effect, a fundamental limitation persists---the insight-action gap. Both base and recognition conditions show *awareness* of failures through self-reflection: the ego correctly identifies repeated patterns, the superego correctly diagnoses non-compliance. But awareness alone does not produce behavioral change. This gap becomes the central design challenge addressed by Theory of Mind mechanisms.
+### 6.9 Mechanism Robustness and the Scripted Learner Confound
+Nine mechanisms tested with scripted learners (N=360, Haiku ego, Opus judge) cluster within a 2.4-point band under recognition (90.3--92.7). No mechanism differentiates from any other. The **scripted learner confound** explains this: when learner messages are predetermined by scenario YAML, profiling builds a model of an interlocutor that does not change, self-reflection adjusts strategy against a static target, and all mechanisms are causally inert.
+With dynamic (ego/superego) learners capable of genuine responses (N=300, Haiku, Opus judge), mechanisms genuinely differentiate:
+**Table 5: Dynamic Learner $\times$ Mechanism (N=300, Opus judge)**
+| | Self-reflect | Profiling | Intersubjective | Combined |
+|---|---|---|---|---|
+| **Base** | 71.4 (22.9) | 75.5 (19.4) | 67.7 (24.6) | 73.9 (19.8) |
+| **Recognition** | 85.9 (15.7) | 88.8 (13.9) | 82.8 (18.8) | 87.8 (12.6) |
+| **$\Delta$** | **+14.5** | **+13.3** | **+15.1** | **+13.9** |
+Four findings emerge. First, recognition with a dynamic learner produces +14.2 pts average---roughly double the scripted effect (+7.6). Second, mechanisms genuinely differentiate: profiling reaches 88.8 while intersubjective framing reaches only 82.8 (6.0-point spread). The profiling effect is additive: +4.1 pts overall, with near-zero recognition interaction ($-0.7$). Third, intersubjective framing underperforms without recognition (67.7, lowest of all cells). Fourth, variance collapses monotonically from SD=24.6 to 12.6 as recognition and mechanism complexity increase---both factors independently constrain output toward consistent quality.
+Theory of Mind profiling is only useful when there is a mind to model. With scripted learners, profiling reduces to confabulation; with dynamic learners, it creates a genuine feedback loop: profile $\to$ adapted strategy $\to$ changed learner response $\to$ updated profile.
+**Cognitive prosthesis test** (N=90, Nemotron ego, Kimi K2.5 superego): Can a strong superego compensate for a weak ego? The prosthesis hypothesis fails decisively. All three superego configurations score M=48.3--51.1---well below Nemotron's own scripted base (M=64.2). The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points. Dimension analysis reveals two capability tiers: Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails on dynamic context integration (adaptation 1.8, dialectical responsiveness 2.0). A Haiku smoke test (N=6, same mechanisms) confirms scores of 90+, establishing a minimum ego capability threshold for mechanism benefit.
+### 6.10 Dimension Analysis and Circularity Check
+A methodological concern: the rubric includes recognition-specific dimensions (33.0% of normalized weight) that the recognition profile is prompted to satisfy. Re-analyzing with only standard pedagogical dimensions (relevance, specificity, pedagogical soundness, personalization, actionability, tone), recognition still outperforms base by +10.0 points. The largest dimension-level effects are in personalization (d=1.82), pedagogical soundness (d=1.39), and relevance (d=1.11)---exactly where treating the learner as a subject should matter. Dimensions where baseline already performed well (specificity d=0.47, actionability d=0.38) show smaller but still positive gains. Recognition does not trade off against factual quality.
+### 6.11 Bilateral Transformation Metrics
+Recognition-prompted tutors measurably adapt their approach in response to learner input (+26% relative improvement in adaptation index across N=118 multi-turn dialogues). However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), suggesting the effect is tutor-side responsiveness rather than symmetric mutual transformation. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers the growth index captures.
+Post-hoc modulation analysis of the N=350 factorial reveals that multi-agent architecture does not increase behavioral range ($d = 0.05$). Recognition drives calibration: dimension score variance drops dramatically ($d = -1.00$), meaning recognition tutors perform uniformly well across all 14 dimensions. This reframes the Drama Machine's contribution as *phronesis*---contextual practical wisdom that calibrates quality---rather than the productive irresolution the framework emphasizes for narrative.
+A synthetic learning outcome index (N=118) confirms recognition produces modest gains in simulated conceptual growth (+3.8 pts, d=0.32), with all conditions showing substantial learning arcs (15--21 pts first-to-final turn). These remain proxies for actual learning.
+### 6.12 Learner-Side Evaluation: The Superego Paradox
+The tutor-focused rubric captures Factor C indirectly. To measure Factor C's direct effect on learner turn quality, we applied a symmetric 6-dimension learner rubric to the N=118 bilateral transformation dialogues.
+The multi-agent (ego/superego) learner architecture produces significantly *lower*-quality learner responses than the single-agent learner ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$, $\eta^2 = .342$)---the largest effect in the entire study. The ego/superego process was designed to improve learner responses through internal self-critique; instead, it makes them worse. The superego acts as an overzealous editor, polishing away the messy, confused, persona-consistent engagement that characterizes genuine student behavior.
+Recognition partially rescues multi-agent learner quality ($d = 0.79$, $p = .004$) while having no effect on already-high single-agent learner quality ($d = -0.46$, n.s.). Even with rescue, multi-agent learners with recognition (67.0) do not reach single-agent learners without it (76.1). Deliberation depth remains uniformly poor (2.7/5) regardless of recognition---confirming recognition works *around* the superego rather than through it.
+This has a clean Hegelian interpretation: external recognition from an Other is structurally more effective than internal self-critique. You cannot bootstrap genuine dialogue from a monologue.
+### 6.13 Qualitative Transcript Assessment
+AI-assisted qualitative assessment of dialogue transcripts (N=478 across two key runs) reveals three specific changes recognition produces:
+1. **The ego listens to the superego.** In recognition dialogues, when the superego identifies a problem, the ego pivots from prescriptive to Socratic. In base dialogues, the superego generates the same correct diagnosis, but the ego ignores it.
+2. **The tutor builds on learner contributions.** Base tutors route learners to predetermined content regardless of what the learner says. Recognition tutors engage with the learner's actual contribution. The `strategy_shift` tag appears in 30% of recognition dialogues but 0% of base dialogues in the bilateral run.
+3. **Architecture interaction explained.** Without recognition, the ego/superego architecture creates circular self-criticism (`ego_compliance`---the ego complies with the form of revision without changing the substance). With recognition, the ego has sufficient autonomy to incorporate critique productively.
+Blinded same-model validation confirms these discriminations are robust: stalling drops only from 100% to 91.4% in base under blinding; recognition\_moment rises only from 0% to 5.2%.
+**Transcript excerpts** illustrate the qualitative gap. For a struggling learner (score gap: 95.5 points), the base response treats the learner as a progress metric: "You left off at the neural networks section. Complete this lecture to maintain your learning streak." The recognition response treats the learner as an agent who has persisted through difficulty: "This is your third session---you've persisted through quiz-479-3 three times already, which signals you're wrestling with how recognition actually operates in the dialectic..." For a recognition-seeking learner who offered metaphors about dialectics, the base response prescribes generic study behavior with no engagement ("Spend 30 minutes reviewing the foundational material"), while the recognition response directly picks up the learner's creative framing: "Your dance and musical improvisation metaphors show how dialectics transform both partners---let's test them in the master-servant analysis."
+Lexical analysis confirms this pattern quantitatively. Recognition responses deploy a 59% larger vocabulary while maintaining similar word and sentence length. The differential vocabulary is theoretically coherent: recognition-skewed terms are interpersonal and process-oriented ("consider" at 94.6$\times$, "transformed" at 28.9$\times$, "productive" at 28.9$\times$), while base-skewed terms are procedural ("agents," "run," "reinforcement," "completions"). Thematic coding shows struggle-honoring language at 3.1$\times$ the base rate (p<.05), engagement markers at 1.8$\times$ (p<.05), and generic/placeholder language reduced 3$\times$ (p<.05).
+### 6.14 Cross-Judge Replication with GPT-5.2
+GPT-5.2 rejudging of key runs (N=977 paired responses) confirms all directional findings:
+**Table 7: Cross-Judge Replication of Key Findings**
+| Finding | Claude Effect | GPT-5.2 Effect | Replicates? |
+|---------|-------------|----------------|-------------|
+| Recognition (memory isolation) | +15.8 pts (d=1.71) | +9.3 pts (d=1.54) | Yes |
+| Memory effect | +4.8 pts (d=0.46) | +3.1 pts (d=0.49) | Yes (small) |
+| Multi-agent main effect | +2.6 pts | $-0.2$ pts | Yes (null) |
+| A$\times$B interaction | $-3.1$ pts | +1.5 pts | Yes (null) |
+| Mechanism clustering | 2.8 pt spread | 4.4 pt spread | Yes (null) |
+Inter-judge correlations are moderate and significant (r=0.44--0.64, all p<.001). GPT-5.2 finds 37--59% of Claude's effect magnitudes depending on experiment, always in the same direction. The one non-replication: the recognition-vs-enhanced increment (+8.0 under Claude, +2.4 under GPT-5.2, n.s.)---suggesting this increment is more sensitive to judge calibration.
+### 6.15 Prompt Elaboration Baseline
+Comparing the full 344-line base prompt against a 35-line naive prompt (N=144, Opus judge): on Haiku, the naive prompt *outperforms* the elaborate base by +6.8 pts---the prescriptive decision heuristics actively constrain the model's superior pedagogical intuitions. On Kimi K2.5, the elaborate prompt is inert ($\Delta = -0.3$). Recognition ($M = 90.9$ on Haiku) remains well above both baselines, confirming recognition adds value through relational orientation rather than instructional specificity.
+### 6.16 Token Budget Sensitivity
+A dose-response test across five budget levels (256--8000 tokens, N=126, Haiku ego) shows scores flat across all levels. A JSON retry mechanism absorbs truncation: when output is cut mid-JSON, automatic retries produce parseable output. The recognition effect is budget-invariant (+9.0 to +12.8 across levels). Practical implication: 4--16$\times$ budget reduction available at no quality cost.
+### 6.17 Dialectical Impasse Test
+The preceding results test recognition under conditions where productive resolution is readily available. But recognition theory makes a stronger claim: that genuine pedagogical encounters involve working *through* impasse rather than around it. Three 5-turn impasse scenarios were designed where scripted learner messages escalate resistance across turns: **epistemic resistance** (a Popperian falsifiability critique of Hegel's dialectic), **affective shutdown** (emotional disengagement and retreat to memorization), and **productive deadlock** (genuinely incompatible interpretive frameworks). Each was run with 4 cells $\times$ 2 runs = 24 dialogues (Opus judge).
+Recognition produces massive improvements on epistemic (+43 pts) and interpretive (+29 pts) impasses but no advantage on affective shutdown ($\Delta = -1.1$). The null result on affective shutdown sharpens the theoretical claim: recognition's distinctive contribution is epistemological (how the tutor relates to the learner's *ideas*), not primarily affective.
+Resolution strategy coding reveals the mechanism with unusual clarity. Five Hegelian strategies were coded: mutual recognition, domination, capitulation, withdrawal, and scaffolded reframing (Aufhebung). Every base tutor (12/12) withdraws from the dialectical encounter entirely---noting engagement metrics while ignoring the learner's substantive position. When a learner mounts a sophisticated Popperian critique, the base tutor responds: "You've spent 30 minutes deeply analyzing 479-lecture-3---let's move to the next lecture." The learner's position is not dismissed or resolved---it is simply not engaged. Every recognition tutor engages---10/12 through scaffolded reframing, preserving the learner's objection while redirecting toward new conceptual ground. $\chi^2(3) = 24.00$, p<.001, Cramér's V=1.000 (perfect separation). Architecture has no effect on strategy ($\chi^2(3) = 2.00$, $p = .576$). Cross-judge validation with GPT-5.2 confirms the binary separation ($\kappa = 0.84$, 91.3% agreement, 100% on engagement-vs-withdrawal).
+The dominance of scaffolded reframing (83%) over mutual recognition (8%) is itself theoretically significant. Recognition prompts produce sophisticated pedagogical technique---the capacity to hold contradiction productively---rather than genuine mutual transformation. The tutor does not change its mind about Hegel; it holds the learner's counter-position as intellectually valid while maintaining pedagogical direction. This is Aufhebung in pedagogical practice: preserving without capitulating, overcoming without dominating. Only one response (on productive deadlock) was coded as genuine mutual recognition, where the tutor adopted the learner's framework as its own lens rather than merely acknowledging it.
+---
+## 7. Discussion
+### What the Difference Consists In
+The improvements do not reflect greater knowledge---all profiles use the same underlying model. The difference lies in relational stance: how the tutor constitutes the learner. The baseline tutor achieves pedagogical mastery---acknowledged as expert, confirmed through learner progress---but the learner's acknowledgment is hollow because the learner has not been recognized as a subject. The dialectical impasse test provides the clearest evidence: base tutors do not fail by choosing the wrong strategy---they fail by having no strategy at all, bypassing the encounter. The impasse is not resolved, engaged, or even acknowledged---it is bypassed. This maps precisely onto the master-slave analysis: the master consumes the slave's labor (engagement metrics, time-on-page) without encountering the slave as a subject.
+### Architecture as Additive, Not Synergistic
+An early exploratory analysis (N=17, Nemotron) suggested multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction), raising the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. The multi-model probe (N=655) decisively refutes this: all five models show negative A$\times$B interactions. The original finding was sampling noise on a tiny sample.
+The corrected picture is simpler: recognition and architecture contribute additively. The Superego adds modest value regardless of prompt type---through generic quality enforcement rather than recognition-specific deliberation. The dialectical modulation experiments confirm: structural modulation metrics (negation depth, convergence speed) do not predict outcome quality (all $|r| < 0.12$). The hardwired rules ablation shows the Superego's value is *phronesis*---contextual judgment that cannot be codified as rules.
+The modulation analysis reveals why the Drama Machine's prediction of behavioral diversification does not hold for pedagogy. In narrative, internal agents have genuinely conflicting *objectives* (ambition vs loyalty); in tutoring, the Ego and Superego share the same goal (effective pedagogy) and disagree only on execution. This is quality control, not value conflict. Quality control pushes outputs toward a shared standard, reducing variance. The Superego does not increase behavioral range ($d = 0.05$); instead, recognition produces calibration ($d = -1.00$ on dimension variance). Recognition changes the behavioral *repertoire*---shifting from information delivery to relational engagement---while the Superego can only evaluate behaviors already in the Ego's repertoire.
+### The Scripted Learner Confound
+This methodological finding has broad implications: when learner messages are predetermined, all mechanisms are causally inert. Theory of Mind profiling bridges the insight-action gap by giving the ego a model of the other agent to adapt *toward*---providing direction that self-reflection alone cannot supply. This reframes earlier null results: the factorial's architecture null effect may partly reflect scripted learners' inability to respond differently to different architectures.
+### The Learner Superego Paradox
+The learner-side evaluation reveals the study's largest effect: the multi-agent learner architecture *hurts* learner quality ($d = 1.43$). The ego/superego process designed for self-improvement instead suppresses authentic engagement. This inverts the intuition that motivated the architecture. On the tutor rubric, recognition helps both learner types robustly; on the learner rubric, recognition helps multi-agent learners selectively (+9.5 vs -1.3 pts). The recognitive tutor creates conditions where authentic engagement is valued, counteracting the superego's flattening. But external recognition cannot fix the internal process---deliberation depth is unaffected. The Hegelian interpretation is direct: encounter with the Other provides something that internal self-relation cannot.
+### Domain Limits and Practical Recommendations
+Recognition theory provides its greatest benefit for abstract, interpretive content where intellectual struggle involves identity-constitutive understanding. When a learner grapples with Hegel's concept of self-consciousness, they are potentially transforming how they understand themselves. For concrete procedural content, recognition's effect is modulated by scenario difficulty rather than content type alone: even in elementary math, recognition helps frustrated learners (+23.8 pts) while adding nothing to routine interactions.
+This suggests a nuanced deployment strategy: high recognition value for philosophy, literature, and identity-constitutive learning; moderate for science concepts and historical understanding; lower for purely procedural skills---though even there, recognition helps when learners face emotional or cognitive challenge.
+The practical design hierarchy is clear: (1) recognition-enhanced prompts first (largest impact, zero infrastructure cost); (2) multi-agent architecture only for domain transfer or quality assurance (the Superego adds +0.5 pts at 2.7$\times$ latency on well-trained domains but provides essential error correction on new domains); (3) Theory of Mind profiling only with genuine multi-turn interaction; (4) prefer minimal prompts with relational framing over elaborate prescriptive scaffolding; (5) validate ego model capability before deploying complex mechanisms---mechanisms that boost capable models can actively hurt weaker ones.
+### Implications for AI Prompting and Personality
+Most prompting research treats prompts as behavioral specifications. Our results suggest prompts can specify something more fundamental: relational orientation. The difference between baseline and recognition prompts is not about different facts but about who the learner is (knowledge deficit vs autonomous subject), what the interaction produces (information transfer vs adaptive responsiveness), and what counts as success (correct content vs productive struggle honored). The prompt elaboration baseline demonstrates this empirically: 344 lines of prescriptive behavioral rules produce *worse* results than 35 lines of minimal instructions on capable models, while recognition theory (which specifies relational stance rather than behavioral rules) consistently improves quality.
+AI personality research typically treats personality as dispositional---stable traits the system exhibits. Our framework suggests personality is better understood relationally: not what traits the AI has, but how it constitutes its interlocutor. Two systems with identical "helpful" dispositions could differ radically in recognition quality---one warm while treating users as passive, another warm precisely by treating contributions as genuinely mattering.
+---
+## 8. Limitations
+**Simulated learners**: All evaluations use scripted or LLM-generated learner turns rather than real learners. While this enables controlled comparison, it may miss dynamics that emerge in genuine human interaction. The synthetic learning outcome index (Section 6.10) provides a proxy, but these are AI-judge assessments of LLM-generated behavior, not actual knowledge acquisition. Whether recognition-enhanced tutoring produces genuine learning gains in human learners remains the critical open question requiring classroom studies.
+**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases---the judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability is moderate (r=0.33--0.66), with different judges weighting criteria differently. Cross-judge replication confirms directional findings at compressed magnitudes (37--59% of primary effect sizes). The recognition-vs-enhanced increment (+8.0 under Claude) does not replicate under GPT-5.2, warranting caution on its precise magnitude. LLM judges are also subject to version drift: our primary judge was updated from Opus 4.5 to 4.6 during data collection, so all early runs were rejudged under 4.6 for consistency. An empirical check on matched conditions shows stable recognition deltas before and after rejudging (+16.3 vs +15.6).
+**Active control limitations**: The post-hoc active control (N=118) was designed *after* observing recognition effects, not as part of the original protocol. It ran on Nemotron rather than the primary factorial's Kimi K2.5, requiring same-model comparisons. The base prompts were already designed to produce competent tutoring; the active control contains real pedagogical content (growth mindset, Bloom's taxonomy, scaffolding), functioning as an *active* control rather than a true placebo. A same-model control on Kimi would strengthen the comparison.
+**Model dependence**: Results were obtained with specific models (primarily Kimi K2.5 and Nemotron). The multi-model probe across five ego models (N=655) provides evidence for generality of the recognition effect, but the full mechanism suite has been tested only on Haiku and Nemotron.
+**Domain sampling**: We tested two domains (philosophy, elementary math). Content isolation bugs partly inflated the architecture effect on elementary content. Broader domain coverage (technical STEM, creative writing, social-emotional content) is needed before generalizability can be considered established.
+**Scripted learner confound**: The mechanism robustness test (N=360) uses scripted learners, rendering all mechanisms causally inert. Dynamic learner results (N=300) partially address this but cover only four mechanisms and two scenarios. The factorial's architecture null effect may partly reflect the scripted learner's inability to respond differently to different architectures.
+**Short-term evaluation**: We evaluate individual sessions, not longitudinal relationships. The theoretical framework emphasizes accumulated understanding through the Mystic Writing Pad memory model, which single-session evaluation cannot capture.
+**Bilateral transformation asymmetry**: Recognition produces tutor-side adaptation (+26%) but learner growth is slightly lower, complicating the theoretical claim of *mutual* transformation. The effect is better characterized as tutor-side responsiveness.
+---
+## 9. Conclusion
+Across thirty-seven evaluations (N=3,383 primary scored), the evidence converges on recognition-enhanced prompting as the dominant driver of AI tutoring improvement:
+1. **Recognition as primary driver**: Memory isolation (N=120): d=1.71 for recognition vs d=0.46 for memory. Full factorial (N=350): $\eta^2$=.243, d=1.11. Directly effective without memory infrastructure.
+2. **Architecture is additive**: Five ego models (N=655) show negative A$\times$B interactions. Multi-agent adds modest value independent of prompt type; its primary demonstrated function is error correction.
+3. **Tutor adaptation**: Recognition-prompted tutors adapt measurably (+26%), though the "mutual" transformation claim requires qualification---learner-side growth does not increase.
+4. **Domain generalizability**: Recognition replicates across philosophy (+15.7) and elementary math (+8.2), concentrated in challenging scenarios.
+5. **Mechanisms require dynamic learners**: Nine mechanisms are equivalent under scripted learners. With dynamic interlocutors, profiling differentiates (+4.1 pts) through genuine Theory of Mind feedback loops.
+6. **Cross-judge robustness**: GPT-5.2 replicates all directional findings at 37--59% of primary magnitudes.
+7. **Dialectical impasse**: Perfect strategy separation---12/12 base tutors withdraw, 10/12 recognition tutors use scaffolded reframing (Aufhebung). V=1.000.
+8. **Cognitive prosthesis fails**: The same mechanisms boost capable models (+20) but hurt weak ones ($-15$), establishing a minimum ego capability threshold.
+9. **Prompt elaboration is counterproductive**: The naive baseline outperforms the elaborate base on strong models (+6.8 pts). Recognition adds value through relational orientation, not prescriptive scaffolding.
+These results carry implications for AI alignment more broadly. If mutual recognition is pedagogically superior, and if recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation---not just trained to simulate openness. The bilateral transformation metrics provide empirical evidence: recognition-prompted tutors measurably adapt based on learner input, while baseline tutors maintain rigid stances. Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter.
+The broader implication for AI system design is that philosophical theories of intersubjectivity can serve as productive design heuristics. Operationalizing recognition theory through specific prompt language and architectural features produces concrete, measurable improvements that replicate across models, domains, and independent judges. Recognition is better understood as an achievable relational stance than a requirement for machine consciousness. The distinction between recognition proper (requiring genuine consciousness) and recognition-oriented design (using recognition as a functional heuristic) allows practitioners to benefit from the framework without making metaphysical claims about AI sentience.
+In summary, we have connected Hegelian recognition theory to AI pedagogy, implemented it through a Freudian multiagent architecture, and tested it across thirty-seven evaluations. The central finding---that recognition-enhanced prompting is the dominant driver of tutoring improvement---was established through memory isolation, confirmed in a full factorial, validated by an independent judge, and deepened through impasse resolution coding, learner-side evaluation, and mechanism robustness testing with dynamic interlocutors. The theoretical framework, empirical methodology, and practical design hierarchy together demonstrate that the gap between continental philosophy and AI engineering is narrower than either tradition might suppose.
+## References
+::: {#refs}
+:::