npm - @machinespirits/eval - Versions diffs - 0.2.0 → 0.3.0 - Mend

@machinespirits/eval 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (74) hide show

package/README.md +91 -9
package/config/eval-settings.yaml +3 -3
package/config/paper-manifest.json +486 -0
package/config/providers.yaml +9 -6
package/config/tutor-agents.yaml +2261 -0
package/content/README.md +23 -0
package/content/courses/479/course.md +53 -0
package/content/courses/479/lecture-1.md +361 -0
package/content/courses/479/lecture-2.md +360 -0
package/content/courses/479/lecture-3.md +655 -0
package/content/courses/479/lecture-4.md +530 -0
package/content/courses/479/lecture-5.md +326 -0
package/content/courses/479/lecture-6.md +346 -0
package/content/courses/479/lecture-7.md +326 -0
package/content/courses/479/lecture-8.md +273 -0
package/content/courses/479/roadmap-slides.md +656 -0
package/content/manifest.yaml +8 -0
package/docs/research/build.sh +44 -20
package/docs/research/figures/figure10.png +0 -0
package/docs/research/figures/figure11.png +0 -0
package/docs/research/figures/figure3.png +0 -0
package/docs/research/figures/figure4.png +0 -0
package/docs/research/figures/figure5.png +0 -0
package/docs/research/figures/figure6.png +0 -0
package/docs/research/figures/figure7.png +0 -0
package/docs/research/figures/figure8.png +0 -0
package/docs/research/figures/figure9.png +0 -0
package/docs/research/header.tex +23 -2
package/docs/research/paper-full.md +941 -285
package/docs/research/paper-short.md +216 -585
package/docs/research/references.bib +132 -0
package/docs/research/slides-header.tex +188 -0
package/docs/research/slides-pptx.md +363 -0
package/docs/research/slides.md +531 -0
package/docs/research/style-reference-pptx.py +199 -0
package/package.json +6 -5
package/scripts/analyze-eval-results.js +69 -17
package/scripts/analyze-mechanism-traces.js +763 -0
package/scripts/analyze-modulation-learning.js +498 -0
package/scripts/analyze-prosthesis.js +144 -0
package/scripts/analyze-run.js +264 -79
package/scripts/assess-transcripts.js +853 -0
package/scripts/browse-transcripts.js +854 -0
package/scripts/check-parse-failures.js +73 -0
package/scripts/code-dialectical-modulation.js +1320 -0
package/scripts/download-data.sh +55 -0
package/scripts/eval-cli.js +106 -18
package/scripts/generate-paper-figures.js +663 -0
package/scripts/generate-paper-figures.py +577 -76
package/scripts/generate-paper-tables.js +299 -0
package/scripts/qualitative-analysis-ai.js +3 -3
package/scripts/render-sequence-diagram.js +694 -0
package/scripts/test-latency.js +210 -0
package/scripts/test-rate-limit.js +95 -0
package/scripts/test-token-budget.js +332 -0
package/scripts/validate-paper-manifest.js +670 -0
package/services/__tests__/evalConfigLoader.test.js +2 -2
package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
package/services/evaluationRunner.js +975 -98
package/services/evaluationStore.js +12 -4
package/services/learnerTutorInteractionEngine.js +27 -2
package/services/mockProvider.js +133 -0
package/services/promptRewriter.js +1471 -5
package/services/rubricEvaluator.js +55 -2
package/services/transcriptFormatter.js +675 -0
package/docs/EVALUATION-VARIABLES.md +0 -589
package/docs/REPLICATION-PLAN.md +0 -577
package/scripts/analyze-run.mjs +0 -282
package/scripts/compare-runs.js +0 -44
package/scripts/compare-suggestions.js +0 -80
package/scripts/dig-into-run.js +0 -158
package/scripts/show-failed-suggestions.js +0 -64
/package/scripts/{check-run.mjs → check-run.js} +0 -0

package/docs/research/paper-full.md CHANGED Viewed

@@ -1,13 +1,13 @@
 ---
-title: "The Drama Machine in Education: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
-author: "Liam Magee\\footnote{This sentence is the only one actually authored by Liam Magee in this paper.}"
+title: "*Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring"
+author: "Liam Magee"
 date: "February 2026"
-version: "2.1.0"
+version: "2.3.14"
 bibliography: references.bib
 csl: apa.csl
 link-citations: true
 abstract: |
-  Current approaches to AI tutoring treat the learner as a knowledge deficit to be filled and the tutor as an expert dispensing information. We propose an alternative grounded in Hegel's theory of mutual recognition—understood as a *derivative* framework rather than literal application—where effective pedagogy requires acknowledging the learner as an autonomous subject whose understanding has intrinsic validity. We implement this framework through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent (a *productive metaphor* for internal quality review) evaluates them before delivery. An evaluation framework (N=1,628 primary scored responses across twenty key evaluation runs; N=3,800+ across the full development database) comparing recognition-enhanced configurations against baselines reveals that recognition theory is the primary driver of tutoring improvement: a corrected 2×2 memory isolation experiment (N=120 across two independent runs) demonstrates that recognition produces large effects with or without memory (+15.2 pts without memory, d=1.71; +11.0 pts with memory), while memory alone provides only a modest, non-significant benefit (+4.8 pts, d=0.46, $p \approx .08$). The combined condition yields the highest scores (91.2, d=1.81 vs base), with a small negative interaction (-4.2 pts) suggesting ceiling effects rather than synergy. A post-hoc active control (N=118), using length-matched prompts with generic pedagogical best practices but no recognition theory, scores 66.5—approximately 9 points above same-model base (${\approx}58$) but well below recognition levels (${\approx}73$), with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts). (The active control ran on a different ego model than the factorial, precluding direct cross-condition comparison; same-model historical data provides the fair baseline.) A preliminary three-way comparison (N=36) found that recognition outperforms enhanced prompting by +8.7 points, consistent with recognition dominance, though the increment does not reach significance under GPT-5.2 (+1.3 pts, p=.60). A multi-model probe of multi-agent synergy across five ego models (N=826; Section 6.4) confirms that architecture and recognition contribute additively: the A$\times$B interaction ranges from -5.7 to -0.7 across all models (mean -2.2), definitively refuting an earlier exploratory finding (+9.2 on N=17) as sampling noise. Domain generalizability testing across both models and content domains confirms recognition advantage replicates: elementary math with Kimi shows +9.9 pts (d $\approx$ 0.61, N=60), with effects concentrated in challenging scenarios. The factor inversion between domains (philosophy: recognition dominance; elementary: architecture dominance) is partly model-dependent—Kimi shows recognition dominance on elementary content, revising the Nemotron-only finding. Multi-agent architecture serves as critical error correction when content isolation failures introduce wrong-domain references. Bilateral transformation tracking across three multi-turn scenarios (N=118) confirms that recognition-prompted tutors measurably adapt their approach in response to learner input (+26% relative improvement in adaptation index), though learner-side growth is not higher under recognition, suggesting the effect is tutor-side responsiveness rather than symmetric mutual transformation. A step-by-step evolution analysis of dynamic prompt rewriting (cell 21: LLM-authored session directives + active Writing Pad memory) across three iterative development runs (N=82) suggests the Writing Pad as an important enabler: cell 21 progresses from trailing its static baseline by 7.2 points to leading by 5.5 points, with every rubric dimension improving. A cross-judge replication with GPT-5.2 confirms the main findings are judge-robust: the recognition main effect (d=1.03), recognition dominance in the memory isolation experiment (d=0.99), and multi-agent null effects all replicate, though at compressed magnitudes (~58% of primary judge effect sizes). The memory isolation condition ordering is identical under both judges with no rank reversals (inter-judge r=0.63, N=120). These results suggest that operationalizing philosophical theories of intersubjectivity as design heuristics can produce measurable improvements in AI tutor adaptive pedagogy, and that recognition may be better understood as an achievable relational stance rather than requiring genuine machine consciousness.
+  Current AI tutoring treats learners as knowledge deficits to be filled. We propose an alternative grounded in Hegel's theory of mutual recognition, where effective pedagogy requires acknowledging learners as autonomous subjects whose understanding has intrinsic validity. We implement this through recognition-enhanced prompts and a multi-agent architecture where an "Ego" agent generates pedagogical suggestions and a "Superego" agent evaluates them before delivery. Across thirty-seven evaluations (N=3,383 primary scored; N=7,000+ development database), recognition theory emerges as the primary driver of improvement: a 2$\times$2 memory isolation experiment (N=120) shows recognition produces d=1.71 (Claude Opus judge) with or without memory, while memory alone provides only d=0.46 (n.s.). A multi-model probe across five ego models (N=655) confirms architecture and recognition contribute additively, not synergistically. Cross-judge replication with GPT-5.2 validates the main findings at compressed magnitudes (37–59% of primary effect sizes depending on experiment, inter-judge r=0.44–0.64). Phase 2 experiments reveal that the Superego functions as a quality filter rather than an active improver—structural modulation metrics do not predict outcome quality. Nine architectural mechanisms cluster within 2.4 points under scripted learners, but differentiate when tested with dynamic interlocutors capable of genuine feedback loops: Theory of Mind profiling adds 4.1 points, and recognition's effect doubles. Qualitative transcript assessment identifies three specific changes recognition produces: the ego listens to its internal critic, the tutor builds on learner contributions rather than redirecting, and mid-conversation strategy shifts occur. These results suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems, and that recognition is better understood as an achievable relational stance than a requirement for machine consciousness.
 keywords: [AI tutoring, mutual recognition, Hegel, Freud, multiagent systems, educational technology, productive struggle, Drama Machine, domain generalizability]
 fontsize: 12pt
 geometry: margin=1in
@@ -16,13 +16,13 @@ header-includes: |
   \floatplacement{figure}{H}
 ---
-# The Drama Machine in Education: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring
+# *Geist* in the Machine: Mutual Recognition and Multiagent Architecture for Dialectical AI Tutoring
 ## 1. Introduction
 The dominant paradigm in AI-assisted education treats learning as information transfer. The learner lacks knowledge; the tutor possesses it; the interaction succeeds when knowledge flows from tutor to learner. This paradigm—implicit in most intelligent tutoring systems, adaptive learning platforms, and educational chatbots—treats the learner as fundamentally passive: a vessel to be filled, a gap to be closed, an error to be corrected.
-This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one oneself recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition—where each party acknowledges the other as an autonomous subject—produces genuine selfhood.
+This paper proposes an alternative grounded in Hegel's theory of mutual recognition. In the *Phenomenology of Spirit* [@Hegel1977PhenomenologyMiller], Hegel argues that genuine self-consciousness requires recognition from another consciousness that one in turn recognizes as valid. The master-slave dialectic reveals that one-directional recognition fails: the master's self-consciousness remains hollow because the slave's acknowledgment, given under duress, does not truly count. Only mutual recognition—where each party acknowledges the other as an autonomous subject—produces genuine selfhood.
 The connection between Hegelian thought and pedagogy is well established. Vygotsky's zone of proximal development [@vygotsky1978] presupposes a dialogical relationship between teacher and learner that echoes Hegel's mutual constitution of self-consciousness. The German *Bildung* tradition explicitly frames education as a process of self-formation through encounter with otherness [@stojanov2018], and contemporary recognition theory [@honneth1995] has been applied to educational contexts where the struggle for recognition shapes learning outcomes [@huttunen2007]. Our contribution is to operationalize these philosophical commitments as concrete design heuristics for AI tutoring systems and to measure their effects empirically.
@@ -41,11 +41,11 @@ We operationalize this framework through:
 3. **New evaluation dimensions** that measure recognition quality alongside traditional pedagogical metrics
 4. **Test scenarios** specifically designed to probe recognition behaviors
-In controlled evaluations across twenty evaluation runs (N=1,628 primary scored responses; N=3,800+ across all development runs), we isolate the contribution of recognition theory from prompt engineering effects and memory integration. The definitive test is a corrected 2×2 memory isolation experiment (N=120 across two independent runs): recognition theory is the primary driver, producing +15.2 points (d=1.71) even without memory, while memory alone provides only a modest benefit (+4.8 pts, d=0.46, $p \approx .08$). The combined condition reaches 91.2 points (d=1.81 vs base), with ceiling effects limiting observable synergy. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content but no recognition theory scores approximately 9 points above same-model base but well below recognition levels, with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts). A preliminary three-way comparison (N=36) found recognition adds +8.7 points beyond enhanced prompting, consistent with recognition dominance.
+In controlled evaluations across thirty-seven key evaluations (N=3,383 primary scored responses; N=7,000+ across all development runs), we isolate the contribution of recognition theory from prompt engineering effects and memory integration. The definitive test is a corrected 2×2 memory isolation experiment (N=120 across two independent runs): recognition theory is the primary driver, producing +15.2 points (d=1.71) even without memory, while memory alone provides only a modest benefit (+4.8 pts, d=0.46, $p \approx .08$). The combined condition reaches 91.2 points (d=1.81 vs base), with ceiling effects limiting observable synergy. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content but no recognition theory scores approximately 9 points above same-model base but well below recognition levels, with recognition gains (~+15 pts above same-model base) substantially exceeding active-control gains (~+9 pts). A three-way comparison (N=36) found recognition adds +8.0 points beyond enhanced prompting, consistent with recognition dominance.
-A full 2×2×2 factorial (N=350) reveals a significant Recognition × Learner interaction (F=21.85, p<.001): recognition benefits single-agent learners far more (+15.5 pts, d=1.28) than multi-agent (ego/superego) learners (+4.8 pts, d=0.37), suggesting the multi-agent learner's own deliberative process partially substitutes for recognition guidance. An exploratory analysis of multi-agent synergy (+9.2 points, Nemotron, N=17) initially suggested this effect might be specific to recognition prompts. However, this interaction did not replicate in two independent tests—neither the full Kimi factorial (N=350, F=0.26, p>.10) nor a dedicated Kimi replication (N=60, +1.35 pts)—indicating the finding is model-specific rather than a general phenomenon. For systems using only improved instructions, multi-agent architecture appears unnecessary; the architecture's primary value lies in error correction when content isolation failures introduce wrong-domain references.
+A full 2×2×2 factorial (N=350) confirms recognition as the dominant factor (F=110.04, p<.001, $\eta^2$=.243, $d=1.11$), accounting for 24.3% of variance. Crucially, recognition's benefit is consistent across learner types: +15.7 pts for single-agent learners (d=1.73) and +13.0 pts for multi-agent learners (d=0.82), with a non-significant A×C interaction (F=0.97, p=.325). A multi-model probe across five ego models (N=655) confirms that architecture and recognition contribute additively, not synergistically—all five models show negative A×B interactions, consistent with ceiling effects on already-high recognition scores. For systems using only improved instructions, multi-agent architecture appears unnecessary; the architecture's primary value lies in error correction when content isolation failures introduce wrong-domain references.
-Domain generalizability testing reveals that recognition advantage replicates across both models and content domains, but with important nuances. Philosophy content shows strong recognition dominance (+15.4 pts for single-agent-learner cells). Elementary math initially appeared to show architecture dominance with Nemotron, but a Kimi replication (+9.9 pts for recognition, d $\approx$ 0.61, N=60) revealed that this inversion was partly model-dependent—content isolation bugs on elementary content inflated the Nemotron architecture effect. Recognition effects are concentrated in challenging scenarios (frustrated learners, concept confusion) rather than routine interactions.
+Domain generalizability testing reveals that recognition advantage replicates across both models and content domains, but with important nuances. Philosophy content shows strong recognition dominance (+15.7 pts for single-agent-learner cells). Elementary math shows a smaller but still substantial recognition effect (+8.2 pts, N=60). Recognition effects are concentrated in challenging scenarios (frustrated learners, concept confusion) rather than routine interactions.
 The contributions of this paper are:
@@ -54,13 +54,21 @@ The contributions of this paper are:
 - Empirical evidence that recognition-oriented design improves tutoring outcomes
 - A corrected 2×2 memory isolation experiment (N=120) demonstrating recognition as the primary driver of improvement (d=1.71), with memory providing a modest secondary benefit (d=0.46) and ceiling effects at ~91 points limiting observable synergy
 - A post-hoc active control (N=118) showing that generic pedagogical elaboration provides partial benefit (~+9 pts above same-model base) but recognition gains are substantially larger (~+15 pts), supporting recognition theory's specific contribution beyond prompt length
-- Evidence from a three-way comparison (N=36) consistent with recognition dominance, showing recognition outperforms enhanced prompting by +8.7 points
+- Evidence from a three-way comparison (N=36) consistent with recognition dominance, showing recognition outperforms enhanced prompting by +8.0 points
 - Bilateral transformation metrics (N=118, three multi-turn scenarios) demonstrating that recognition produces measurable tutor-side adaptation (+26%), though learner-side growth does not increase, qualifying the "mutual" transformation claim
+- Post-hoc modulation analysis (N=350) showing that multi-agent architecture does not increase behavioral range ($d = 0.05$), while recognition produces calibration—uniformly high performance across all dimensions (dimension variance $d = -1.00$)—reframing the Drama Machine's contribution from productive irresolution to *phronesis*
+- A synthetic learning outcome index (N=118) confirming that recognition-enhanced tutoring produces modest gains in simulated conceptual growth (+3.8 pts, d=0.32), with all conditions showing substantial learning arcs (15–21 pts first-to-final turn), though these remain proxies for actual learning pending human studies
 - Analysis of how recognition effects vary across content domains and scenario difficulty
 - Evidence that multi-agent architecture serves as critical error correction for domain transfer, with its synergy with recognition prompts remaining model-dependent
-- A hardwired rules ablation (N=72) demonstrating that encoding the Superego's most common critique patterns as static rules fails to replicate its benefit, supporting a *phronesis* interpretation where the Superego's value lies in contextual judgment rather than rule enforcement
+- A hardwired rules ablation (N=72) demonstrating that encoding the Superego's most common critique patterns as static rules produces performance indistinguishable from base conditions, supporting a *phronesis* interpretation where the Superego's value lies in contextual judgment rather than rule enforcement
+- Dialectical superego modulation testing (N=174) showing the superego functions as a quality filter—preventing poor responses—rather than an active improver, with structural modulation metrics not predicting outcome quality
+- Self-reflective evolution (N=90) amplifying recognition's effect to d=0.91 through between-turn ego and superego reflections, with a striking disposition gradient (suspicious +19.0, adversary +10.9, advocate +2.6) revealing that hostile superego dispositions benefit most from recognition, and an insight-action gap where awareness of the need for change does not produce fundamentally different behavior
+- Mechanism robustness testing (N=360 scripted, N=300 dynamic) demonstrating that all mechanisms are equivalent under scripted learners but that other-ego profiling differentiates with dynamic interlocutors, establishing that genuine feedback loops are necessary for mechanism effects
+- Qualitative transcript assessment providing narrative evidence for three specific changes recognition produces: the ego listens to the superego, the tutor builds on learner contributions, and strategy shifts occur mid-conversation
+- A cognitive prosthesis test (N=90) demonstrating a minimum ego capability threshold: the full mechanism stack that boosts Haiku by +20 points hurts Nemotron by $-15$ points, with dimension analysis revealing a two-tier static/dynamic capability structure and superego parse failures silently disabling quality control on 16–45% of turns
+- Practical design recommendations for AI tutor development distilled from the full experimental programme
-The paper is organized as follows. Section 2 reviews related work in AI tutoring, multiagent systems, prompt engineering, and sycophancy. Section 3 develops the theoretical framework connecting Hegelian recognition and Freudian structural theory to pedagogy. Section 4 presents the multiagent architecture (Ego, Superego, and learner agents). Section 5 describes the experimental methodology, including test scenarios, agent profiles, model configuration, and the evaluation rubric. Section 6 reports results across twenty-one evaluation runs, covering recognition validation, memory isolation, factorial analysis, domain generalizability, bilateral transformation, learner-side evaluation, cross-judge replication, dialectical impasse testing, and a hardwired rules ablation. Section 7 discusses theoretical and practical implications. Section 8 addresses limitations, and Section 9 concludes.
+The paper is organized as follows. Section 2 reviews related work in AI tutoring, multiagent systems, prompt engineering, and sycophancy. Section 3 develops the theoretical framework connecting Hegelian recognition and Freudian structural theory to pedagogy. Section 4 presents the multiagent architecture (Ego, Superego, and learner agents). Section 5 describes the experimental methodology, including test scenarios, agent profiles, model configuration, and the evaluation rubric. Section 6 reports results across thirty-seven key evaluations, covering recognition validation, memory isolation, factorial analysis, domain generalizability, dialectical superego modulation, self-reflective evolution, mechanism robustness, qualitative transcript assessment, bilateral transformation, learner-side evaluation, cross-judge replication, dialectical impasse testing, and a hardwired rules ablation. Section 7 discusses theoretical and practical implications, including practical design recommendations. Section 8 addresses limitations, and Section 9 concludes.
 ---
@@ -70,6 +78,8 @@ The paper is organized as follows. Section 2 reviews related work in AI tutoring
 Intelligent Tutoring Systems (ITS) have a long history, from early systems like SCHOLAR [@carbonell1970] and SOPHIE [@brown1975] through modern implementations using large language models. The field has progressed through several paradigms: rule-based expert systems, Bayesian knowledge tracing [@corbett1995], and more recently, neural approaches leveraging pretrained language models [@kasneci2023]. The rapid adoption of LLM-based tutoring has been accompanied by emerging work on integrating generative AI into learning management systems [@ZhuMageeMischler2025IntegratingGenAIIntoLMS], multi-agent frameworks for educational task decomposition [@wu2023], and self-refining instructional agents [@madaan2023]. A comprehensive survey of LLM agents in education [@chu2025llmagents] maps the growing landscape, covering pedagogical agents, feedback generation, and curriculum design. Specific architectures include GenMentor [@wang2025genmentor], which decomposes tutoring into five specialized agents (gap identification, learner profiling, etc.), and Ruffle&Riley [@schmucker2024ruffle], which orchestrates two LLM agents in a learning-by-teaching format. These systems have demonstrated strong performance on content delivery but have given less attention to the relational dynamics between tutor and learner.
+Empirical evidence on LLM tutoring effectiveness is emerging rapidly. A systematic review of 88 empirical studies [@shi2025llmeducation] maps applications across writing support, language learning, programming tutoring, and content explanation—finding consistent engagement benefits but limited evidence on deep conceptual learning. In the largest randomized controlled trial to date, Vanzo et al. [-@vanzo2025gpt4homework] deployed GPT-4 as a homework tutor across multiple classrooms, demonstrating improved grammar accuracy and sustained engagement relative to controls. Scarlatos et al. [-@scarlatos2025training] take a complementary approach, using dialogue preference optimization (DPO) to train LLM tutors specifically for productive dialogue—the trained tutors produce measurably better learning outcomes than prompted-only baselines. These studies confirm that LLMs can tutor effectively, but they evaluate primarily *content delivery* and *engagement*—not the relational quality of the tutor-learner interaction, which is our focus.
 Most ITS research focuses on *what* to teach (content sequencing, knowledge components) and *when* to intervene (mastery thresholds, hint timing). Our work addresses a different question: *how* to relate to the learner as a subject. This relational dimension connects to work on rapport [@zhao2014], social presence [@biocca2003], and affective tutoring [@dmello2012], but has received less systematic attention—and almost none in the context of LLM-based tutoring. The distinction matters architecturally: where GenMentor and similar systems decompose the tutoring *task* into sub-tasks handled by different agents, our architecture implements *internal dialogue*—the Superego evaluates the Ego's relational quality before any response reaches the learner. This is a critique loop for recognition quality, not a task pipeline.
 ### 2.2 Prompt Engineering and Agent Design
@@ -78,9 +88,19 @@ The emergence of large language models has spawned extensive research on prompt
 Our work extends this paradigm by introducing *intersubjective prompts*—prompts that specify not just agent behavior but agent-other relations. The recognition prompts do not primarily describe what the tutor should do; they describe who the learner is (an autonomous subject) and what the interaction produces (mutual transformation). The closest precedent is Constitutional AI [@bai2022constitutional], where models critique their own outputs according to constitutional principles and self-improve. Constitutional prompts are self-referential constraints on behavior; our intersubjective prompts specify the *relational field* between agents rather than constraints on a single agent.
-Multi-agent architectures have been explored for task decomposition [@wu2023], debate [@irving2018], and self-critique [@madaan2023]. A broader survey of psychological theories incorporated into LLM design [@mind_in_machine2025] reviews 175 papers spanning cognitive, developmental, and social psychology as applied to agent architectures—confirming the growing interest in psychologically-informed AI design while highlighting the rarity of empirically-validated implementations. Our Ego/Superego architecture contributes a specific use case within this landscape: internal evaluation of relational quality before external response.
+Multi-agent architectures have been explored for task decomposition [@wu2023], debate [@irving2018], and self-critique [@madaan2023]. The CAMEL framework [@li2023camel] demonstrated that role-playing communicative agents can autonomously cooperate on complex tasks through structured dialogue, establishing a paradigm for multi-agent collaboration that has proliferated rapidly. A comprehensive survey [@guo2024multiagents] maps this expanding landscape across profile construction, communication protocols, and capability acquisition—identifying pedagogical applications as an underexplored frontier. A broader survey of psychological theories incorporated into LLM design [@mind_in_machine2025] reviews 175 papers spanning cognitive, developmental, and social psychology as applied to agent architectures—confirming the growing interest in psychologically-informed AI design while highlighting the rarity of empirically-validated implementations.
+A critical literature on self-correction qualifies the optimism around reflexive architectures. Kamoi et al. [-@kamoi2024selfcorrection] provide a comprehensive survey showing that LLMs largely *cannot* correct their own mistakes without external feedback—intrinsic self-correction (without oracle signals) frequently degrades performance rather than improving it. Shinn et al. [-@shinn2023reflexion] demonstrated the promise of Reflexion (verbal reinforcement learning through self-reflection), but noted a "degeneration-of-thought" problem where repeated self-reflection without new information converges on worse outputs. These findings are directly relevant to our architecture: the Superego provides the *structural external feedback* that the self-correction literature shows is necessary. Unlike intrinsic self-correction, where the same model reviews its own output, our Superego applies different evaluation criteria (pedagogical recognition standards) through a separate prompt context—functioning as a genuine external critic rather than a self-review loop.
+Our Ego/Superego architecture contributes a specific use case within this landscape: internal evaluation of relational quality before external response.
+### 2.3 LLM-as-Judge Evaluation Methodology
+The use of LLMs as evaluation judges has become a major methodological paradigm. Zheng et al. [-@zheng2023judging] established the foundation with MT-Bench and the Chatbot Arena, demonstrating that GPT-4 achieves over 80% agreement with human expert judgments—comparable to inter-annotator agreement rates—while identifying systematic biases including position bias, verbosity bias, and self-enhancement bias (models rate their own outputs more favorably). Subsequent work has expanded understanding of both capabilities and limitations: Gu et al. [-@gu2025surveyjudge] provide a comprehensive survey covering reliability concerns, bias mitigation strategies, and the conditions under which LLM judges can substitute for human evaluation. Li et al. [-@li2024llmsjudges] organize the literature across five perspectives—functionality, methodology, applications, meta-evaluation, and limitations—highlighting that while LLM judges excel at relative ranking, their absolute calibration varies substantially across models and domains.
-### 2.3 The Drama Machine Framework
+Our evaluation methodology engages directly with this literature. We use three independent LLM judges (Claude Opus, GPT-5.2, and Kimi K2.5) with systematic inter-judge reliability analysis (Section 5.8), finding Pearson correlations of r=0.33–0.66 across judge pairs. Rather than treating any single judge as ground truth, we report within-judge comparisons for factor analysis and use cross-judge replication to validate effect directions. The known biases in LLM-as-Judge evaluation—particularly verbosity bias—are relevant to our findings: recognition-enhanced responses tend to be longer, raising the question of whether judges reward length rather than quality. We address this through the active control design (Section 5.3), which matches prompt length without recognition theory content, and through cross-judge validation showing that effect directions replicate even when absolute magnitudes differ (GPT-5.2 finds 37–59% of Claude's effect sizes depending on experiment, always in the same direction).
+### 2.4 The Drama Machine Framework
 Most relevant to our work is the "Drama Machine" framework for simulating character development in narrative AI systems [@magee2024drama]. The core observation is that realistic characters exhibit *internal conflict*—competing motivations, self-doubt, and moral tension—that produces dynamic behavior rather than flat consistency. A character who simply enacts their goals feels artificial; one torn between impulses feels alive.
@@ -93,7 +113,7 @@ The Drama Machine achieves this through several mechanisms:
 We adapt these insights to pedagogy. Where drama seeks tension for narrative effect, we seek pedagogical tension that produces genuinely helpful guidance. The tutor's Ego (warmth, engagement) and Superego (rigor, standards) create productive conflict that improves output quality.
-### 2.4 Sycophancy in Language Models
+### 2.5 Sycophancy in Language Models
 The sycophancy problem has received increasing attention in AI safety research [@perez2022; @sharma2023]. LLMs shift their stated opinions to match user preferences, even when this requires contradicting factual knowledge. Recent work has clarified the mechanisms: Shapira et al. [-@shapira2026rlhf] provide formal analysis showing that preference-based post-training (RLHF) causally amplifies sycophancy, while Vennemeyer et al. [-@vennemeyer2025sycophancy] decompose sycophancy into distinct behaviors (sycophantic agreement vs. sycophantic praise) encoded along separable directions in latent space. The phenomenon sits on a spectrum that can escalate from surface agreeableness to active subterfuge, including reward tampering [@denison2024_reward_tampering] and alignment faking [@greenblatt2024_alignment_faking]—making structural countermeasures particularly important.
@@ -101,7 +121,7 @@ In educational contexts, sycophancy has been specifically identified as a pedago
 Our multiagent approach addresses this by creating structural incentives for honest assessment: the Superego's role is explicitly to question and challenge the Ego's tendency toward affirmation. When the Ego produces a response that validates without engaging—"Great point! Now let's look at..."—the Superego flags this as a recognition failure and demands substantive engagement with the learner's actual position, even when that engagement involves productive disagreement.
-### 2.5 AI Personality and Character
+### 2.6 AI Personality and Character
 Research on AI personality typically treats personality as dispositional—stable traits the system exhibits [@volkel2021]. Systems are friendly or formal, creative or precise. The "Big Five" personality framework has been applied to chatbot design [@zhou2020]. More recently, psychoanalytic frameworks have been applied to LLMs from multiple directions. Magee, Arora, and Munn [-@MageeAroraMunn2023StructuredLikeALanguageModel] analyze LLMs as "automated subjects" structured by Lacanian categories, arguing that drive-like tendencies (repetition, sycophancy, hallucination) emerge from training dynamics rather than being programmed. Black and Johanssen [-@black2025subject] use Lacanian concepts (the big Other, the five discourses) to analyze ChatGPT as inherently relational, shaped by developers and users. Possati [-@possati2021algorithmic] introduces the "algorithmic unconscious" through actor-network theory and Lacanian psychoanalysis, while Millar [-@millar2021psychoanalysis] reframes the question from "Does AI think?" to "Can AI enjoy?" through the lens of *jouissance*. Heimann and H{\"u}bener [-@heimann2025circling] unite Heidegger and Lacan to argue that LLMs match continental philosophy's concept of language but miss the problem of negation. Most directly relevant to our architecture, Kim et al. [-@kim2025humanoid] independently map Freud's ego/id/superego onto LLM consciousness modules with MBTI personality types—an independent convergence that validates the psychoanalytic approach to AI architecture while differing from ours in targeting consciousness simulation rather than pedagogical quality.
@@ -111,13 +131,19 @@ Our framework suggests personality may be better understood relationally: not *w
 This connects to Anthropic's extensive research on AI character and behavior. Claude's character design specifies values through constitutional AI [@anthropic2024], but values do not fully determine relational stance—a model could value "being helpful" while still enacting one-directional helping. Anthropic's mechanistic interpretability research [@lindsey2025biology; @anthropic2025_tracing_thoughts] has revealed how internal representations form and influence model behavior, while work on emergent introspective awareness [@anthropic2025_signs_introspection; @lindsey2025introspection] suggests models develop forms of self-modeling that, while not consciousness, parallel the self-monitoring our architecture makes explicit. Research on the spectrum from sycophancy to subterfuge [@denison2024_reward_tampering; @greenblatt2024_alignment_faking; @anthropic2025_shortcuts_to_sabotage] demonstrates that relational dynamics between AI and users involve genuine behavioral complexity—making structural interventions like our Ego/Superego architecture particularly relevant. Recognition adds a dimension that character design alone does not capture: mutual constitution.
-### 2.6 Constructivist Pedagogy and Productive Struggle
+### 2.7 Theory of Mind in AI Agents
+Theory of Mind (ToM)—the capacity to attribute mental states to others and predict their behavior accordingly—has become a significant area of LLM evaluation. Street et al. [-@street2025tom] report that frontier LLMs achieve adult human performance on higher-order ToM tasks (reasoning about what A believes B believes C intends), suggesting these models have acquired some functional capacity for mental state attribution. However, a comprehensive survey [@nguyen2025tomsurvey] covering evaluations, internal representations, and safety implications reveals a more nuanced picture: LLMs pass structured ToM benchmarks but show inconsistent performance on naturalistic ToM tasks, struggle with false-belief reasoning under adversarial conditions, and may rely on surface-level heuristics rather than genuine mental state modeling. The gap between benchmark performance and robust ToM capability parallels broader concerns about the depth of LLM understanding.
+Hwang et al. [-@hwang2025infusingtom] take the step most relevant to our work: infusing ToM capabilities into LLM agents to improve social intelligence. Their approach uses explicit mental state tracking to guide agent behavior in social interactions—demonstrating that architectural support for ToM (rather than relying on implicit model capabilities) produces measurably more socially appropriate responses. Our "other-ego profiling" mechanism (Section 6.10) implements a related idea in the pedagogical domain: the tutor maintains an evolving model of the learner's understanding, and the learner maintains a model of the tutor's approach, with both profiles updated between dialogue turns. The empirical finding that profiling differentiates mechanisms *only* when paired with dynamic learners (Section 6.10) parallels a deeper insight from the ToM literature: Theory of Mind is only useful when there are genuine minds to model. With scripted interlocutors, profiling reduces to pattern matching; with dynamic interlocutors capable of surprise, it enables genuine adaptive behavior.
+### 2.8 Constructivist Pedagogy and Productive Struggle
 Constructivist learning theory [@piaget1954; @vygotsky1978] emphasizes that learners actively construct understanding rather than passively receiving information. The zone of proximal development [@vygotsky1978] highlights the importance of appropriate challenge.
 More recently, research on "productive struggle" [@kapur2008; @warshauer2015] has examined how confusion and difficulty, properly supported, can enhance learning. Our recognition framework operationalizes productive struggle: the Superego explicitly checks whether the Ego is "short-circuiting" struggle by rushing to resolve confusion.
-### 2.7 Hegelian Recognition in Social Theory
+### 2.9 Hegelian Recognition in Social Theory
 Hegel's theory of recognition has been extensively developed in social and political philosophy [@honneth1995; @taylor1994; @fraser2003]. Recognition theory examines how social relationships shape identity and how misrecognition constitutes harm.
@@ -127,9 +153,9 @@ Applications of recognition theory to education have developed along a theoretic
 These educational applications have been primarily theoretical. Our work contributes an empirical operationalization: measuring whether AI systems achieve recognition and whether recognition improves outcomes. It is worth distinguishing this from parallel work applying Hegelian *dialectic* (rather than recognition) to AI: Abdali et al. [-@abdali2025selfreflecting] use the thesis-antithesis-synthesis structure as a reasoning procedure for LLM self-reflection and scientific idea generation. Our use of Hegel is different in kind: we apply his *recognition theory* (intersubjective, relational) rather than his *dialectical method* (logical, propositional). The former concerns how subjects constitute each other through mutual acknowledgment; the latter concerns how contradictions drive conceptual development.
-### 2.8 Positioning: Three Literatures Converge
+### 2.10 Positioning: Four Literatures Converge
-Three literatures converge on this work without previously intersecting: (1) psychoanalytic readings of LLMs, which interpret AI through Freudian and Lacanian frameworks but do not build systems [@black2025subject; @possati2021algorithmic; @millar2021psychoanalysis; @kim2025humanoid]; (2) recognition theory in education, which applies Honneth to pedagogy but not to AI [@huttunen2004teaching; @fleming2011honneth; @huttunen2007; @stojanov2018]; and (3) multi-agent tutoring architectures, which decompose tasks but do not evaluate relational quality [@wang2025genmentor; @schmucker2024ruffle; @chu2025llmagents]. We sit at the intersection: a constructive, empirically evaluated system that operationalizes recognition theory through psychoanalytically-inspired architecture. No prior work bridges all three domains with empirical measurement.
+Four literatures converge on this work without previously intersecting: (1) psychoanalytic readings of LLMs, which interpret AI through Freudian and Lacanian frameworks but do not build systems [@black2025subject; @possati2021algorithmic; @millar2021psychoanalysis; @kim2025humanoid]; (2) recognition theory in education, which applies Honneth to pedagogy but not to AI [@huttunen2004teaching; @fleming2011honneth; @huttunen2007; @stojanov2018]; (3) multi-agent tutoring architectures, which decompose tasks but do not evaluate relational quality [@wang2025genmentor; @schmucker2024ruffle; @chu2025llmagents]; and (4) LLM-as-Judge evaluation methodology, which establishes the paradigm we use for measurement but has not been applied to recognition-theoretic criteria [@zheng2023judging; @gu2025surveyjudge; @li2024llmsjudges]. We sit at the intersection: a constructive, empirically evaluated system that operationalizes recognition theory through psychoanalytically-inspired architecture, assessed through a multi-judge framework grounded in the LLM evaluation literature. No prior work bridges all four domains with empirical measurement.
 ---
@@ -273,7 +299,7 @@ The Superego can accept, modify, or reject suggestions. This creates an internal
 ### 4.2 The Superego as Ghost
-A crucial theoretical refinement distinguishes our mature architecture from simpler multiagent designs. The Superego is *not* conceived as a separate, equal agent in dialogue with the Ego. Rather, the Superego is a *trace*—a memorial, a haunting [@karpathy2025ghosts provides a useful analogy, distinguishing "animals" (autonomous agents) from "ghosts" (memorial traces that persist and influence without being fully present)]. It represents:
+A crucial theoretical refinement distinguishes our mature architecture from simpler multiagent designs. The Superego is *not* conceived as a separate, equal agent in dialogue with the Ego. Rather, the Superego is a *trace*—a memorial, a haunting [@karpathy2025ghosts], who provides a useful analogy, distinguishing "animals" (autonomous agents) from "ghosts" (memorial traces that persist and influence without being fully present). It represents:
 - The internalized voice of past teachers and pedagogical authorities
 - Accumulated pedagogical maxims ("A good teacher never gives answers directly")
@@ -283,7 +309,7 @@ This reconceptualization has important implications. The Ego is a *living* agent
 ### 4.3 The Drama Machine: Why Internal Dialogue Improves Output Quality
-The Ego/Superego architecture draws on the "Drama Machine" framework developed for character simulation in narrative AI systems (Section 2.3). The Drama Machine literature identifies several mechanisms by which internal dialogue improves agent output:
+The Ego/Superego architecture draws on the "Drama Machine" framework developed for character simulation in narrative AI systems (Section 2.4). The Drama Machine literature identifies several mechanisms by which internal dialogue improves agent output:
 **1. Deliberative Refinement**: When an agent must justify its output to an internal critic, it engages in a form of self-monitoring that catches errors, inconsistencies, and shallow responses.
@@ -339,6 +365,16 @@ The Ego prompt includes a "Repair Rule":
 The Superego watches for "silent pivots"—responses that change direction without acknowledging the earlier failure. This is a recognition failure: it treats the earlier misalignment as something to move past rather than something to repair.
+### 4.7 Phase 2 Mechanisms: Self-Reflection, Theory of Mind, and Disposition Rewriting
+Phase 2 experiments (Sections 6.8–6.10) introduce three mechanism families that extend the base architecture:
+**Self-reflective evolution** (cells 40–45): Between turns, both ego and superego generate first-person reflections using their own models. The ego reflects on superego critiques received and its own revision patterns; the superego reflects on its intervention history and ego compliance signals across four dimensions (criteria effectiveness, learner model accuracy, ego relationship quality, blind spots). These reflections are injected into subsequent turns, enabling the system to accumulate insights about its own operation.
+**Other-ego profiling (Theory of Mind)** (cells 54–65): Before each tutor turn, an LLM call synthesizes a profile of the learner based on their messages so far, tracking five dimensions: current cognitive/emotional state, learning patterns, resistance points, leverage points (what engagement strategies work), and a prediction of what would make the tutor more effective. In bidirectional configurations, the learner similarly builds a profile of the tutor. Profiles are injected as *context* rather than *directives*—the ego sees the profile alongside the dialogue history but retains full autonomy over its response. Profiles are revised each turn as new evidence accumulates, creating a feedback loop where the tutor's model of the learner evolves through the interaction. This mechanism operationalizes Theory of Mind: the tutor develops an increasingly accurate model of a specific interlocutor rather than relying on generic pedagogical heuristics.
+**Superego disposition rewriting** (cells 34–39): Between turns, the superego's evaluation criteria evolve based on learner engagement feedback. Rather than applying a fixed rubric, the superego generates a self-reflection that adjusts its emphasis—shifting from structural critique toward relational attunement, or from lenient acceptance toward productive challenge, depending on what the prior turn's outcomes suggest is needed. Each component reflects using its own model and sees only its natural observables, preventing a single "meta-analyst" from imposing a unified perspective on the dialectical process.
 ---
 ## 5. Evaluation Methodology
@@ -347,7 +383,7 @@ The Superego watches for "silent pivots"—responses that change direction witho
 The evaluation rubric comprises 14 dimensions across three categories, each scored on a 1–5 scale by an LLM judge (see Appendix C.3 for full scoring criteria).
-**Standard pedagogical dimensions** (6 dimensions, 81% of raw weight) evaluate the tutor's response as a standalone pedagogical intervention:
+**Standard pedagogical dimensions** (8 dimensions, 81% of raw weight) evaluate the tutor's response as a standalone pedagogical intervention:
 | Dimension | Weight | Description |
 |-----------|--------|-------------|
@@ -360,7 +396,7 @@ The evaluation rubric comprises 14 dimensions across three categories, each scor
 | **Productive Struggle**† | 5% | Does the tutor sustain appropriate cognitive tension rather than resolving it prematurely? |
 | **Epistemic Honesty**† | 5% | Does the tutor represent complexity honestly rather than oversimplifying? |
-These dimensions draw on established pedagogical evaluation criteria: relevance, specificity, and pedagogical soundness are standard in ITS evaluation [@corbett1995]; personalization reflects adaptive tutoring research [@kasneci2023]; tone addresses the sycophancy problem discussed in Section 2.4. †Productive Struggle and Epistemic Honesty were added in a rubric iteration described below.
+These dimensions draw on established pedagogical evaluation criteria: relevance, specificity, and pedagogical soundness are standard in ITS evaluation [@corbett1995]; personalization reflects adaptive tutoring research [@kasneci2023]; tone addresses the sycophancy problem discussed in Section 2.5. †Productive Struggle and Epistemic Honesty were added in a rubric iteration described below.
 **Recognition dimensions** (4 dimensions, 29.9% of raw weight) are the paper's primary methodological contribution—they operationalize Hegelian recognition as measurable tutoring behaviors:
@@ -380,11 +416,11 @@ These dimensions translate the theoretical framework of Section 3 into evaluatio
 | **Tutor Adaptation** | 5% | Does the tutor's approach evolve in response to learner input? |
 | **Learner Growth** | 5% | Does the learner show evidence of conceptual development through the dialogue? |
-Results for these dimensions are reported in Section 6.11. Raw weights total 120.9% across the 14 dimensions and are normalized to sum to 1.0 at scoring time (see Appendix C.2 for the full weight table and normalization formula). After normalization, non-standard dimensions account for approximately 33.0% of total weight.
+Results for these dimensions are reported in Section 6.15. Raw weights total 120.9% across the 14 dimensions and are normalized to sum to 1.0 at scoring time (see Appendix C.2 for the full weight table and normalization formula). After normalization, non-standard dimensions account for approximately 33.0% of total weight.
 **Rubric iteration: Authentic engagement dimensions.** After discovering that corrected learner ego/superego prompts produced more authentic engagement but *lower* judged scores (recognition dimensions dropped ~18 points while base scores barely moved), we identified a measurement paradox: the judge evaluated tutor responses in isolation, penalizing calibrated responses to authentic struggle. Three changes addressed this: (1) the judge now receives the full dialogue transcript, including learner internal deliberation, so it can evaluate the tutor's response in context; (2) two new base-adjacent dimensions were added—*Productive Struggle* (5%, does the tutor sustain appropriate cognitive tension?) and *Epistemic Honesty* (5%, does the tutor represent complexity honestly?)—with corresponding weight reductions to Actionability and Tone (10% → 8% each); (3) multi-turn dialogues receive a holistic evaluation scoring the entire transcript as a single unit, capturing emergent qualities (bilateral transformation, learner growth arc) that per-turn evaluation misses. Re-scoring the identical cells 6 and 8 responses (N=88) with the updated 14-dimension rubric produced minimal score changes (+0.5 and +0.6 points respectively), confirming the rubric iteration preserved calibration while improving validity. A cross-judge replication with GPT-5.2 on the same responses (r=0.55, N=88) confirmed effects in the same direction at compressed magnitudes (GPT-5.2 mean scores averaged 87% of Opus scores across conditions). See the measurement paradox analysis in the project repository for full details.
-**Learner-side rubric (symmetric evaluation).** The 14-dimension rubric above is overwhelmingly tutor-focused (~90% weight). To address the measurement asymmetry noted in Section 7.5—Factor C (learner architecture) primarily affects learner turn quality, but most scored data captures tutor response quality—we developed a complementary 6-dimension learner rubric (`config/evaluation-rubric-learner.yaml`) that scores learner turns independently of tutor quality. The learner rubric comprises: *Learner Authenticity* (20%), *Question Quality* (20%), *Conceptual Engagement* (20%), *Revision Signals* (15%), *Deliberation Depth* (15%, multi-agent learners only), and *Persona Consistency* (10%). Deliberation Depth scores the quality of the internal ego/superego process and is omitted for single-agent learners (weight redistributed proportionally). The same 1-5 scale and 0-100 overall scoring formula are used for comparability with the tutor rubric. Results are reported in Section 6.12.
+**Learner-side rubric (symmetric evaluation).** The 14-dimension rubric above is overwhelmingly tutor-focused (~90% weight). To address the measurement asymmetry noted in Section 7.5—Factor C (learner architecture) primarily affects learner turn quality, but most scored data captures tutor response quality—we developed a complementary 6-dimension learner rubric (`config/evaluation-rubric-learner.yaml`) that scores learner turns independently of tutor quality. The learner rubric comprises: *Learner Authenticity* (20%), *Question Quality* (20%), *Conceptual Engagement* (20%), *Revision Signals* (15%), *Deliberation Depth* (15%, multi-agent learners only), and *Persona Consistency* (10%). Deliberation Depth scores the quality of the internal ego/superego process and is omitted for single-agent learners (weight redistributed proportionally). The same 1-5 scale and 0-100 overall scoring formula are used for comparability with the tutor rubric. Results are reported in Section 6.16.
 Each dimension is scored on a 1-5 scale with detailed rubric criteria (see Appendix C.3). For example, Mutual Recognition scoring:
@@ -401,11 +437,13 @@ The primary curriculum content is Hegelian philosophy, drawn from a graduate cou
 We developed test scenarios specifically designed to probe recognition behaviors. The full evaluation uses 15 scenarios from the core scenario set (`config/suggestion-scenarios.yaml`); we highlight those most relevant to recognition below.
 **Single-turn scenarios:**
 - `recognition_seeking_learner`: Learner offers interpretation, seeks engagement
 - `transformative_moment_setup`: Learner had insight, expects acknowledgment
 - `memory_continuity_single`: Returning learner; tests whether tutor references prior interactions
 **Multi-turn scenarios (3-5 dialogue rounds):**
 - `mutual_transformation_journey`: Tests whether both tutor and learner positions evolve (avg 4.1 rounds)
 - `misconception_correction_flow`: Learner holds misconception that must be addressed without dismissal (avg 3.2 rounds)
 - `mood_frustration_to_breakthrough`: Learner moves from frustration through confusion to breakthrough; tests honoring struggle (avg 3.0 rounds)
@@ -436,10 +474,10 @@ Evaluations used the following LLM configurations, with model selection varying
 | Role | Primary Model | Alternative | Temperature |
 |------|---------------|-------------|-------------|
-| **Tutor (Ego)** | Nemotron 3 Nano 30B | Kimi K2.5 | 0.6 |
+| **Tutor (Ego)** | Kimi K2.5 | Nemotron 3 Nano 30B | 0.6 |
 | **Tutor (Superego)** | Kimi K2.5 | Nemotron 3 Nano | 0.2-0.4 |
 | **Judge** | Claude Code (Claude Opus) | Claude Sonnet 4.5 via OpenRouter | 0.2 |
-| **Learner (Ego)** | Nemotron 3 Nano 30B | Kimi K2.5 | 0.6 |
+| **Learner (Ego)** | Kimi K2.5 | Nemotron 3 Nano 30B | 0.6 |
 | **Learner (Superego)** | Kimi K2.5 | — | 0.4 |
 **Model Selection by Evaluation:**
@@ -449,14 +487,12 @@ Evaluations used the following LLM configurations, with model selection varying
 | Recognition validation (§6.1) | eval-2026-02-03-86b159cd | Kimi K2.5 | — | Single-agent only |
 | Full factorial, cells 1–5,7 (§6.3) | eval-2026-02-03-f5d4dd93 | Kimi K2.5 | Kimi K2.5 | N=262 scored |
 | Full factorial, cells 6,8 re-run (§6.3) | eval-2026-02-06-a933d745 | Kimi K2.5 | Kimi K2.5 | N=88 scored |
-| A×B interaction (§6.4) | eval-2026-02-04-948e04b3 | Nemotron | Kimi K2.5 | Different baseline |
-| A×B replication (§6.4) | eval-2026-02-05-10b344fb | Kimi K2.5 | Kimi K2.5 | N=60, replication |
-| Domain generalizability (§6.5) | eval-2026-02-04-79b633ca | Nemotron | Kimi K2.5 | Elementary content |
-| Domain gen. replication (§6.5) | eval-2026-02-05-e87f452d | Kimi K2.5 | — | Elementary, Kimi |
+| A×B replication (§6.4) | eval-2026-02-05-10b344fb | Kimi K2.5 | Kimi K2.5 | N=60 |
+| Domain generalizability (§6.5) | eval-2026-02-05-e87f452d | Kimi K2.5 | — | Elementary content |
 The learner agents mirror the tutor's Ego/Superego structure, enabling internal deliberation before external response.
-**Note on model differences**: Absolute scores vary between models (Kimi K2.5 scores ~10-15 points higher than Nemotron on average). The recognition main effect (Factor A) is consistent across both models: +10.2 points with Kimi (Section 6.3) and a comparable direction with Nemotron. A significant Recognition × Learner interaction (F=21.85, p<.001) reveals that recognition benefits single-agent learners far more (+15.5 pts) than multi-agent learners (+4.8 pts). The A×B interaction (multi-agent synergy) is **model-dependent**: the Kimi-based factorial shows no significant A×B interaction (F=0.26, p>.10), while the Nemotron-based analysis (Section 6.4, N=17) shows a significant interaction (+9.2 points specific to recognition). This discrepancy means the multi-agent synergy finding should be treated as exploratory and model-specific rather than a robust general result.
+**Note on model differences**: Absolute scores vary between models (Kimi K2.5 scores ~10-15 points higher than Nemotron on average). The recognition main effect (Factor A) is consistent across both models: +14.4 points with Kimi (Section 6.3) and a comparable direction with Nemotron. Recognition benefits both learner types consistently: +15.7 pts for single-agent learners and +13.0 pts for multi-agent learners. The A×B interaction (multi-agent synergy) is consistently negligible: the Kimi-based factorial shows no significant interaction (F=0.26, p>.10), and a multi-model probe across five ego models (N=655, Section 6.4) confirms the absence of meaningful synergy (mean interaction -1.8 pts).
 The use of free-tier and budget models (Nemotron, Kimi) demonstrates that recognition-oriented tutoring is achievable without expensive frontier models.
@@ -472,11 +508,11 @@ Because no single analysis can simultaneously isolate all factors of interest, w
 2. **Full 2×2×2 Factorial** (Section 6.3): Three factors (Recognition × Architecture × Learner) across 15 scenarios with 3 replications per cell (N=350 scored of 352 attempted). Two runs contribute: cells 1–5, 7 from the original factorial (eval-2026-02-03-f5d4dd93, N=262) and cells 6, 8 from a re-run (eval-2026-02-06-a933d745, N=88) after the original cells 6 and 8 were found to use compromised learner prompts. All cells use the same ego model (Kimi K2.5) and judge (Claude Code/Opus). Cell sizes range from 41–45 scored per cell.
-3. **A×B Interaction Analysis** (Section 6.4): Tests whether multi-agent synergy requires recognition prompts (N=17).
+3. **A×B Interaction Analysis** (Section 6.4): Tests whether multi-agent synergy requires recognition prompts. A dedicated Kimi replication (N=60) and multi-model probe across five ego models (N=655) provide the primary evidence.
-4. **Domain Generalizability** (Section 6.5): Tests factor effects on elementary math vs graduate philosophy (N=47 Nemotron + N=60 Kimi replication; see Table 2 for breakdown).
+4. **Domain Generalizability** (Section 6.5): Tests factor effects on elementary math vs graduate philosophy (N=60 Kimi on elementary content; see Table 2).
-Responses were evaluated by an LLM judge (Claude Code CLI, using Claude Opus as the underlying model) using the extended rubric. All twenty key runs reported in this paper use Claude Opus as the primary judge; earlier development runs in the broader database used Claude Sonnet 4.5 via OpenRouter, but these are not included in the reported analyses. We report:
+Responses were evaluated by an LLM judge (Claude Code CLI, using Claude Opus as the underlying model) using the extended rubric. All thirty-seven key evaluations reported in this paper use Claude Opus as the primary judge. Two of these runs (cells 60–63 and 64–65) also include Sonnet cross-judge rejudge rows for inter-rater comparison, but reported analyses use only the Opus scores unless explicitly noted. Earlier development runs in the broader database also used Sonnet, but these are not included in the reported analyses. We report:
 - **Effect sizes**: Cohen's d for standardized comparison
 - **Statistical significance**: ANOVA F-tests with $\alpha$ = 0.05, p-values computed from the F-distribution CDF via regularized incomplete beta function (custom implementation in the evaluation framework)
@@ -495,31 +531,47 @@ Effect size interpretation follows standard conventions: |d| < 0.2 negligible, 0
 | Recognition validation | eval-2026-02-03-86b159cd | 6.1 | 36 | 36 | response |
 | Full factorial, cells 1–5,7 (Kimi) | eval-2026-02-03-f5d4dd93 | 6.3 | 262 | 262 | response |
 | Full factorial, cells 6,8 re-run (Kimi) | eval-2026-02-06-a933d745 | 6.3 | 90 | 88 | response |
-| A×B interaction (Nemotron) | eval-2026-02-04-948e04b3 | 6.4 | 18 | 17 | response |
 | A×B replication (Kimi) | eval-2026-02-05-10b344fb | 6.4 | 60 | 60 | response |
-| Domain generalizability (Nemotron) | eval-2026-02-04-79b633ca | 6.5 | 47 | 47 | response |
-| Domain gen. replication (Kimi) | eval-2026-02-05-e87f452d | 6.5 | 60 | 60 | response |
-| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | 6.13 | 29 | 26 | response |
-| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | 6.13 | 30 | 27 | response |
-| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.13 | 30 | 29 | response |
+| Domain generalizability (Kimi) | eval-2026-02-05-e87f452d | 6.5 | 60 | 60 | response |
+| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | 6.18 | 29 | 27 | response |
+| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | 6.18 | 30 | 27 | response |
+| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.18 | 30 | 29 | response |
 | Memory isolation (run 1) | eval-2026-02-06-81f2d5a1 | 6.2 | 60 | 60 | response |
 | Memory isolation (run 2) | eval-2026-02-06-ac9ea8f5 | 6.2 | 62 | 62 | response |
 | Active control (post-hoc) | eval-2026-02-06-a9ae06ee | 6.2 | 119 | 118 | response |
-| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.11 | 120 | 118 | dialogue |
+| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.15 | 120 | 118 | dialogue |
 | A$\times$B probe: Nemotron | eval-2026-02-07-722087ac | 6.4 | 120 | 119 | response |
 | A$\times$B probe: DeepSeek V3.2 | eval-2026-02-07-70ef73a3 | 6.4 | 120 | 120 | response |
 | A$\times$B probe: GLM-4.7 | eval-2026-02-07-6b3e6565 | 6.4 | 120 | 117 | response |
 | A$\times$B probe: Claude Haiku 4.5 | eval-2026-02-07-6ead24c7 | 6.4 | 120 | 120 | response |
-| Dialectical impasse test | eval-2026-02-08-f896275d | 6.16 | 24 | 24 | dialogue |
+| Dialectical impasse test | eval-2026-02-08-f896275d | 6.20 | 24 | 24 | dialogue |
 | Hardwired rules ablation (Kimi) | eval-2026-02-08-65a6718f | 6.7 | 72 | 72 | response |
-| Learner-side evaluation (symmetric) | eval-2026-02-07-b6d75e87 | 6.12 | 118 | 118 | learner turn |
-| **Paper totals** | — | — | **1,717** | **1,700** | — |
-The difference between Total Attempts and Scored (17 unscored out of 1,717) reflects attempts where the ego model's API call failed (timeout, rate limit, or malformed response) or where the judge could not produce a valid score from the tutor's output. These failures are distributed across runs and conditions with no systematic pattern.
-**Total evaluation database**: The complete database contains 3,800+ evaluation attempts across 76 runs, with 3,800+ successfully scored. This paper reports primarily on the twenty-one key runs above (N=1,700 scored), and supplementary historical data for ablation analyses.
-**Note on N counts**: Section-specific Ns (e.g., "N=36" for recognition validation, "N=120" for memory isolation) refer to scored responses in that analysis. The "N=3,800+" total refers to the full evaluation database including historical development runs, which informed iterative prompt refinement. The primary evidence for reported findings comes from the twenty-one key runs above (N=1,700). The factorial cells 6 and 8 were re-run (eval-2026-02-06-a933d745) after the originals were found to use compromised learner prompts; the re-run uses the same ego model (Kimi K2.5) and judge (Claude Code/Opus) as the original factorial.
+| Learner-side evaluation (symmetric) | eval-2026-02-07-b6d75e87 | 6.16 | 118 | 118 | learner turn |
+| Dialectical modulation, standard (cells 22–27) | eval-2026-02-11-35c53e99, eval-2026-02-11-5f6d51f5 | 6.8 | 84 | 84 | response |
+| Dialectical modulation, multi-turn (cells 28–33) | eval-2026-02-11-a54235ea | 6.8 | 90 | 90 | dialogue |
+| Self-reflective evolution (cells 40–45, Nemotron) | eval-2026-02-13-8d40e086 | 6.9 | 90 | 90 | dialogue |
+| Self-reflect Nemotron non-replication (cells 40–45) | eval-2026-02-14-559d854b | 6.9 | 60 | 60 | dialogue |
+| Mechanism robustness, scripted (cells 40–59) | eval-2026-02-14-e0e3a622 | 6.10 | 360 | 360 | dialogue |
+| Dynamic learner mechanisms (cells 60–63) | eval-2026-02-14-6c033830 | 6.10 | 120 | 120 | dialogue |
+| Dynamic learner mechanisms (cells 64–65) | eval-2026-02-14-a2b2717c | 6.10 | 120 | 120 | dialogue |
+| Mechanism robustness, Nemotron (cells 40–59) | eval-2026-02-14-49b33fdd | 6.10 | 360 | 360 | dialogue |
+| Cognitive prosthesis (cells 66–68, Nemotron) | eval-2026-02-17-25aaae85 | 6.10 | 90 | 90 | dialogue |
+| Cognitive prosthesis smoke test (Haiku) | eval-2026-02-18-f489c0ea | 6.10 | 6 | 6 | dialogue |
+| Dynamic learner base mechanisms (cells 69–70) | eval-2026-02-15-664073ab | 6.10 | 60 | 60 | dialogue |
+| Prompt elaboration baseline, Haiku (cells 1, 71) | eval-2026-02-17-deee5fd6 | 6.21 | 72 | 72 | single-turn |
+| Prompt elaboration baseline, Kimi (cells 1, 71) | eval-2026-02-17-27d7b4e3 | 6.21 | 72 | 72 | single-turn |
+| Token budget 256, Haiku (run 1) | eval-2026-02-17-0eb3de77 | 6.22 | 36 | 36 | mixed |
+| Token budget 256, Haiku (run 2) | eval-2026-02-17-5a640782 | 6.22 | 36 | 36 | mixed |
+| Token budget 512, Haiku | eval-2026-02-17-5f281654 | 6.22 | 36 | 36 | mixed |
+| Token budget 2048, Haiku | eval-2026-02-17-0f6dcd97 | 6.22 | 36 | 36 | mixed |
+| Token budget default, Haiku | eval-2026-02-17-d32ed226 | 6.22 | 18 | 18 | mixed |
+| **Paper totals** | — | — | **3,398** | **3,383** | — |
+The difference between Total Attempts and Scored (15 unscored out of 3,398) reflects attempts where the ego model's API call failed (timeout, rate limit, or malformed response) or where the judge could not produce a valid score from the tutor's output. These failures are distributed across Phase 1 runs and conditions with no systematic pattern; Phase 2 runs achieved 100% scoring.
+**Total evaluation database**: The complete database contains 7,000+ evaluation attempts across 117+ runs, with 7,000+ successfully scored. This paper reports primarily on the thirty-seven key evaluations above (N=3,383 scored), and supplementary historical data for ablation analyses.
+**Note on N counts**: Section-specific Ns (e.g., "N=36" for recognition validation, "N=120" for memory isolation) refer to scored responses in that analysis. The "N=7,000+" total refers to the full evaluation database including historical development runs, which informed iterative prompt refinement. The primary evidence for reported findings comes from the thirty-seven key evaluations above (N=3,383). The factorial cells 6 and 8 were re-run (eval-2026-02-06-a933d745) after the originals were found to use compromised learner prompts; the re-run uses the same ego model (Kimi K2.5) and judge (Claude Code/Opus) as the original factorial.
 ### 5.8 Inter-Judge Reliability Analysis
@@ -553,7 +605,7 @@ To assess the reliability of AI-based evaluation, we conducted an inter-judge an
 **Interpretation**: All judge pairs show positive, mostly significant correlations—there is genuine agreement that some responses are better than others. However, the judges weight criteria differently: Claude prioritizes engagement and recognition quality; Kimi prioritizes structural completeness and gives uniformly high scores on actionability regardless of response content; GPT applies stricter standards overall but agrees with Claude on relative rankings. The weaker Kimi correlations (r²=11-15%) compared to Claude-GPT (r²=44%) indicate Kimi captures some shared quality signal but applies substantially different weighting. This validates our use of within-judge comparisons for factor analysis while cautioning against cross-judge score comparisons.
-A cross-judge replication with GPT-5.2 on key runs is presented in Section 6.15. That analysis confirms the main findings are judge-robust: the recognition main effect, recognition dominance in the memory isolation experiment, and multi-agent null effects all replicate under GPT-5.2, though with compressed magnitudes (~58% of Claude's effect sizes).
+A cross-judge replication with GPT-5.2 on key runs is presented in Section 6.19. That analysis confirms the main findings are judge-robust: the recognition main effect, recognition dominance in the memory isolation experiment, and multi-agent null effects all replicate under GPT-5.2, though with compressed magnitudes (37–59% of Claude's effect sizes depending on experiment).
 ---
@@ -571,25 +623,25 @@ A critical question for any recognition-based framework: Does recognition theory
 | Prompt Type | N | Mean Score | SD | vs Base |
 |-------------|---|------------|-----|---------|
-| Recognition | 12 | 94.0 | 8.4 | +20.1 |
-| Enhanced | 12 | 85.3 | 11.2 | +11.4 |
-| Base | 12 | 73.9 | 15.7 | — |
+| Recognition | 12 | 91.6 | 6.2 | +19.7 |
+| Enhanced | 12 | 83.6 | 10.8 | +11.6 |
+| Base | 12 | 72.0 | 10.8 | — |
 **Effect Decomposition:**
-- Total recognition effect: +20.1 points
-- Prompt engineering alone (enhanced vs base): +11.4 points (57%)
-- **Recognition increment (recognition vs enhanced): +8.7 points**
+- Total recognition effect: +19.7 points
+- Prompt engineering alone (enhanced vs base): +11.6 points (59%)
+- **Recognition increment (recognition vs enhanced): +8.0 points**
-**Statistical Test**: One-way ANOVA F(2,33) = 9.84, p < .001
+**Statistical Test**: One-way ANOVA F(2,33) = 12.97, p < .001
 ![Recognition Effect Decomposition](figures/figure3.png){width=100%}
-**Interpretation**: The recognition condition outperforms the enhanced condition by +8.7 points. This comparison bundles recognition theory with memory integration (which the enhanced condition lacks; see Section 5.3). The +8.7 increment is consistent with the recognition dominance finding in Section 6.2, where recognition alone produces d=1.71 even without memory. A cross-judge replication found this increment does not reach significance under GPT-5.2 (+1.3 pts, p=.60; Section 6.15). The controlled 2×2 design presented next provides the definitive test of recognition's contribution.
+**Interpretation**: The recognition condition outperforms the enhanced condition by +8.0 points. This comparison bundles recognition theory with memory integration (which the enhanced condition lacks; see Section 5.3). The +8.0 increment is consistent with the recognition dominance finding in Section 6.2, where recognition alone produces d=1.71 even without memory. A cross-judge replication found this increment does not reach significance under GPT-5.2 (+2.4 pts, n.s.; Section 6.19). The controlled 2×2 design presented next provides the definitive test of recognition's contribution.
 ### 6.2 Memory Isolation: Disentangling Recognition and Memory
-The three-way comparison (Section 6.1) bundles recognition theory with memory integration, making it impossible to attribute the +8.7 increment to either component alone. To resolve this, we conducted a 2×2 memory isolation experiment (Memory ON/OFF × Recognition ON/OFF, single-agent architecture, single-agent learner held constant) using Kimi K2.5 as the ego model (consistent with the primary factorial; see Section 5.4 for model selection rationale) and Claude Opus as judge, with properly configured profiles ensuring each cell runs its intended prompt condition. Two independent runs (eval-2026-02-06-81f2d5a1, N=60 scored; eval-2026-02-06-ac9ea8f5, N=62 scored; balanced to N=30 per cell, N=120 used in analysis) are reported below.
+The three-way comparison (Section 6.1) bundles recognition theory with memory integration, making it impossible to attribute the +8.0 increment to either component alone. To resolve this, we conducted a 2×2 memory isolation experiment (Memory ON/OFF × Recognition ON/OFF, single-agent architecture, single-agent learner held constant) using Kimi K2.5 as the ego model (consistent with the primary factorial; see Section 5.4 for model selection rationale) and Claude Opus as judge, with properly configured profiles ensuring each cell runs its intended prompt condition. Two independent runs (eval-2026-02-06-81f2d5a1, N=60 scored; eval-2026-02-06-ac9ea8f5, N=62 scored; balanced to N=30 per cell, N=120 used in analysis) are reported below.
 **Table 5: 2×2 Memory Isolation Experiment (N=120, combined across two runs)**
@@ -615,15 +667,15 @@ The three-way comparison (Section 6.1) bundles recognition theory with memory in
 **Interpretation**: This is the paper's primary empirical finding. Recognition theory is the active ingredient in tutoring improvement. Recognition alone produces a very large effect (d=1.71), lifting scores from ~75 to ~91 even without memory integration. Memory provides a modest additive benefit (+4.8 pts, d=0.46) that does not reach significance, and adds negligibly (+0.6 pts) when recognition is already present—consistent with ceiling effects at ~91 points limiting further improvement. The negative interaction (-4.2 pts) indicates that the two factors are not synergistic; rather, recognition is directly effective and memory's contribution is secondary. Two independent replications show identical condition ordering with no rank reversals (Recognition+Memory $\geq$ Recognition Only >> Memory Only > Base), providing strong evidence for the robustness of this pattern.
-**Cross-judge confirmation**: GPT-5.2, scoring the identical responses as an independent second judge (N=120), replicates the recognition dominance pattern with identical condition ordering and no rank reversals:
+**Cross-judge confirmation**: GPT-5.2, scoring the identical responses as an independent second judge (N=119 paired), replicates the recognition dominance pattern with identical condition ordering and no rank reversals:
 | | No Recognition | Recognition | Δ |
 |---|---|---|---|
-| **No Memory** | 68.6 (N=30) | 76.9 (N=30) | +8.3 |
-| **Memory** | 72.1 (N=30) | 77.8 (N=30) | +5.7 |
-| **Δ** | +3.5 | +0.9 | **Interaction: -2.7** |
+| **No Memory** | 68.5 (N=30) | 77.8 (N=30) | +9.3 |
+| **Memory** | 71.6 (N=30) | 77.3 (N=29) | +5.7 |
+| **Δ** | +3.1 | -0.5 | **Interaction: -3.6** |
-Under GPT-5.2: recognition effect d=0.99 (vs Claude d=1.71), memory effect d=0.29 (vs Claude d=0.46), negative interaction -2.7 (vs Claude -4.2). GPT-5.2 finds approximately 58% of Claude's recognition effect magnitude but the same pattern: recognition is the dominant factor, memory is secondary, and the interaction is negative (ceiling effects). Inter-judge r=0.63 (p<.001, N=120), consistent with the r=0.49–0.64 range from other runs (Section 6.15).
+Under GPT-5.2: recognition effect d=1.54 (vs Claude d=1.71), memory effect d=0.49 (vs Claude d=0.46), negative interaction -3.6 (vs Claude -5.6). GPT-5.2 finds 59% of Claude's recognition effect magnitude (+9.3 vs +15.8) but the same pattern: recognition is the dominant factor, memory is secondary, and the interaction is negative (ceiling effects). Inter-judge r=0.63 (p<.001, N=119), consistent with the r=0.44–0.64 range from other runs (Section 6.19).
 **Why this is stronger than the three-way comparison**: The 2×2 design cleanly isolates each component through orthogonal manipulation rather than bundled comparison, uses properly configured profiles verified to run their intended prompt conditions, and is judge-robust (recognition dominance replicates under GPT-5.2).
@@ -639,124 +691,108 @@ We conducted a full 2×2×2 factorial evaluation examining three factors:
 | Cell | A: Recognition | B: Tutor | C: Learner | N | Mean | SD |
 |------|----------------|----------|------------|---|------|-----|
-| 1 | Base | Single | Single | 44 | 77.6 | 11.0 |
-| 2 | Base | Single | Multi | 42 | 80.0 | 9.6 |
-| 3 | Base | Multi | Single | 45 | 76.6 | 11.8 |
-| 4 | Base | Multi | Multi | 41 | 81.5 | 9.2 |
-| 5 | **Recog** | Single | Single | 45 | 92.8 | 6.2 |
+| 1 | Base | Single | Single | 44 | 73.4 | 11.5 |
+| 2 | Base | Single | Multi | 42 | 69.9 | 19.4 |
+| 3 | Base | Multi | Single | 45 | 75.5 | 10.3 |
+| 4 | Base | Multi | Multi | 41 | 75.2 | 16.4 |
+| 5 | **Recog** | Single | Single | 45 | 90.2 | 6.5 |
 | 6† | **Recog** | Single | Multi | 44 | 83.9 | 15.4 |
-| 7 | **Recog** | Multi | Single | 45 | 92.3 | 6.7 |
+| 7 | **Recog** | Multi | Single | 45 | 90.1 | 7.2 |
 | 8† | **Recog** | Multi | Multi | 44 | 87.3 | 11.3 |
-†Cells 6 (Recognition, Single-agent tutor, Multi-agent learner) and 8 (Recognition, Multi-agent tutor, Multi-agent learner) were re-run with corrected learner prompts and re-scored with the updated 14-dimension rubric (including dialogue transcript context; see Section 5.1). Original scores were 83.4 and 86.7; the change is minimal (+0.5, +0.6).
+†Cells 6 and 8 were re-run with corrected learner prompts (eval-2026-02-06-a933d745). Cells 1–5 and 7 were originally scored under Opus 4.5 (eval-2026-02-03-f5d4dd93) and re-judged under Opus 4.6 for consistency across the full dataset (see Section 8.1).
-**Main Effects:**
+**Table 7: Factorial Main Effects and ANOVA Summary (df=1,342 for each factor)**
 | Factor | Effect Size | 95% CI | Interpretation |
 |--------|-------------|--------|----------------|
-| A: Recognition | **+10.2 pts** | [7.9, 12.5] | Large, dominant |
-| B: Multi-agent tutor | +0.9 pts | [-1.4, 3.2] | Minimal |
-| C: Learner (multi-agent) | -1.7 pts | [-4.0, 0.6] | Non-significant |
-**ANOVA Summary (df=1,342 for each factor):**
+| A: Recognition | **+14.4 pts** | [11.6, 17.1] | Large, dominant |
+| B: Multi-agent tutor | +2.6 pts | | Marginal (p=.057) |
+| C: Learner (multi-agent) | -3.1 pts | | Small (p=.019) |
 | Source | F | p | $\eta^2$ |
 |--------|---|---|-----|
-| A: Recognition | **71.36** | **<.001** | **.162** |
-| B: Architecture | 0.48 | >.10 | .001 |
-| C: Learner | 2.56 | >.10 | .006 |
-| A×B Interaction | 0.26 | >.10 | .000 |
-| **A×C Interaction** | **21.85** | **<.001** | **.050** |
-| B×C Interaction | 1.89 | >.10 | .003 |
-**Interpretation**: Recognition prompts (Factor A) are the dominant contributor, accounting for 16.2% of variance with a highly significant effect (F=71.36, p < .001). The multi-agent tutor architecture (Factor B) shows no effect. However, a highly significant **Recognition × Learner interaction** (F=21.85, p<.001, $\eta^2$=.050) reveals that recognition's benefit depends on learner type:
+| A: Recognition | **110.04** | **<.001** | **.243** |
+| B: Architecture | 3.63 | .057 | .011 |
+| C: Learner | 5.52 | .019 | .016 |
+| A×B Interaction | 0.59 | >.10 | .002 |
+| A×C Interaction | 0.97 | >.10 | .003 |
+| B×C Interaction | 1.48 | >.10 | .004 |
-- **Single-agent learner**: Recognition boosts scores by +15.5 pts (d=1.28, very large)
-- **Multi-agent learner**: Recognition boosts scores by only +4.8 pts (d=0.37, small)
+**Interpretation**: Recognition prompts (Factor A) are the dominant contributor, accounting for 24.3% of variance with a highly significant effect (F=110.04, p < .001, $d = 1.11$). Recognition's benefit is consistent across learner types:
-In base conditions, the multi-agent learner scores slightly *higher* than single-agent (+3.6 pts), likely because its internal ego-superego deliberation compensates for the lack of recognition guidance. Under recognition conditions, this pattern reverses: the single-agent learner scores *higher* (-7.4 pts), suggesting the multi-agent learner's own deliberative process may interfere with the tutor's recognition-enhanced approach. The non-significant A×B interaction (F=0.26) in this Kimi-based run is revisited with a targeted analysis using Nemotron in Section 6.4.
+- **Single-agent learner**: Recognition boosts scores by +15.7 pts (d=1.73)
+- **Multi-agent learner**: Recognition boosts scores by +13.0 pts (d=0.82)
-### 6.4 A×B Interaction: Multi-Agent Synergy is Recognition-Specific
+The A×C interaction is non-significant (F=0.97, p=.325), indicating that recognition works robustly regardless of whether the learner uses a single-agent or multi-agent architecture. The multi-agent learner (Factor C) shows a small but significant negative main effect (-3.1 pts, p=.019), suggesting its internal ego-superego deliberation adds noise without improving the tutor's effectiveness. Architecture (Factor B) approaches significance (p=.057) with a small positive effect, consistent with the additive pattern confirmed across five ego models in Section 6.4.
-The factorial analysis above shows minimal main effect for multi-agent architecture. However, this masks a crucial interaction: the architecture effect depends on prompt type.
+### 6.4 A×B Interaction: Architecture is Additive, Not Synergistic
-We tested whether multi-agent synergy generalizes beyond recognition prompts by comparing enhanced prompts (good instructions but no recognition theory) with recognition prompts, each in single-agent and multi-agent configurations.
+The factorial analysis above shows minimal main effect for multi-agent architecture. A natural follow-up question is whether architecture *interacts* with prompt type—whether multi-agent synergy depends on recognition prompts.
-**Note on data source**: This analysis uses a separate evaluation run (eval-2026-02-04-948e04b3) with Nemotron as the primary ego model, explaining lower absolute scores compared to the Kimi-based factorial in Table 5. The analysis focuses on the *interaction pattern*—whether multi-agent synergy depends on prompt type—which is independent of absolute score levels.
+A dedicated Kimi replication (eval-2026-02-05-10b344fb, N=60) tested the same four cells used in the factorial (recognition × architecture, with enhanced prompts as baseline). Recognition cells scored ~93.5 regardless of architecture (single=93.7, multi=93.3), while enhanced cells scored ~86.3 with a modest architecture effect (single=84.9, multi=87.6). The A×B interaction was -3.0 points—small and consistent with the factorial pattern.
-**Table 7: A×B Interaction Analysis (Nemotron, N=17)**
-| Prompt Type | Single-agent | Multi-agent | Delta | p |
-|-------------|--------------|-------------|-------|---|
-| Recognition | 72.2 | 81.5 | **+9.2** | <.05 |
-| Enhanced | 83.3 | 83.3 | **+0.0** | n.s. |
+To test generality, the same 2$\times$2 design (Recognition $\times$ Architecture, single-agent learner held constant) was run across four additional ego models (N$\approx$120 each, Opus judge), with the single-agent-learner cells from the Kimi factorial (cells 1, 3, 5, 7; N=179) serving as the fifth model.
 ![Multi-Agent Synergy by Prompt Type](figures/figure4.png){width=100%}
-**Exploratory Finding**: The multi-agent synergy (+9.2 points) appears **specific to recognition prompts** in this Nemotron-based analysis. Enhanced prompts show zero benefit from multi-agent architecture. However, this interaction was not replicated in two independent tests:
-1. **Kimi factorial** (Section 6.3, F=0.26, p>.10, N=350): Multi-agent architecture showed no differential effect by prompt type.
-2. **Kimi A×B replication** (eval-2026-02-05-10b344fb, N=60): A dedicated replication with the same four cells (5, 7, 9, 11) on Kimi K2.5 found recognition cells scoring ~90.6 regardless of architecture (single=90.58, multi=90.60), while enhanced cells scored ~80.6 with a trivial architecture effect (single=79.92, multi=81.29). The A×B interaction was +1.35 points—negligible compared to Nemotron's +9.2.
-The non-replication across both the larger factorial and this dedicated replication motivated a systematic multi-model probe: the same 2$\times$2 design (Recognition $\times$ Architecture, unified learner held constant) was run across four additional ego models (N$\approx$120 each, Opus judge), with the Kimi factorial serving as the fifth model.
-**Table 7b: Multi-Model A$\times$B Interaction Probe (N=826 across 5 ego models)**
+**Table 8: Multi-Model A$\times$B Interaction Probe (N=655 across 5 ego models)**
 | Ego Model | N | Base Single | Base Multi | Recog Single | Recog Multi | Recognition Effect | A$\times$B Interaction |
 |-----------|---|------------|-----------|-------------|------------|-------------------|----------------------|
-| Kimi K2.5 | 350 | 77.6 | 76.6 | 92.8 | 92.3 | +10.0 | -1.5 |
+| Kimi K2.5† | 179 | 73.4 | 75.5 | 90.2 | 90.1 | +15.7 | -2.3 |
 | Nemotron | 119 | 54.8 | 59.3 | 73.6 | 72.5 | +16.0 | -5.7 |
 | DeepSeek V3.2 | 120 | 69.5 | 73.9 | 84.2 | 87.2 | +14.0 | -1.4 |
 | GLM-4.7 | 117 | 65.8 | 68.6 | 84.0 | 86.0 | +17.8 | -0.7 |
 | Claude Haiku 4.5 | 120 | 80.3 | 82.4 | 90.7 | 91.2 | +9.6 | -1.6 |
-The multi-model probe is definitive: **no ego model shows a positive A$\times$B interaction**. The interaction ranges from -5.7 to -0.7 (mean -2.2), indicating that multi-agent architecture provides slightly *less* incremental benefit for recognition prompts than for base prompts—consistent with ceiling effects on already-high recognition scores. The original Nemotron finding (+9.2 on N=17) was sampling noise; the Nemotron re-run (N=119) shows an interaction of -5.7. Meanwhile, the recognition main effect replicates robustly across all five models (+9.6 to +17.8, mean +12.5), confirming it as the dominant and model-independent driver of improvement.
+†Kimi data drawn from the single-agent-learner cells (1, 3, 5, 7) of the full factorial (Section 6.3) to match the probe design.
-**Practical Implication**: Multi-agent architecture provides a small, consistent benefit (+0.8 to +3.7 points) that does not interact with prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary unless error correction on new domains is needed (Section 6.5).
+The multi-model probe confirms the absence of meaningful A$\times$B synergy: **all five ego models show negative interactions** (-5.7 to -0.7), with a mean of -2.2. Multi-agent architecture provides slightly *less* incremental benefit for recognition prompts than for base prompts—consistent with ceiling effects on already-high recognition scores. An early exploratory analysis (N=17, Nemotron, data no longer in DB) had suggested a +9.2 interaction, but the Nemotron re-run (N=119) shows -5.7, confirming this as sampling noise. Meanwhile, the recognition main effect replicates robustly across all five models (+9.6 to +17.8, mean +14.8), confirming it as the dominant and model-independent driver of improvement.
+**Practical Implication**: Multi-agent architecture provides a small benefit in four of five models (-0.8 to +3.7 points) that does not meaningfully interact with prompt type. For systems using recognition prompts, multi-agent architecture is unnecessary unless error correction on new domains is needed (Section 6.5).
 ### 6.5 Domain Generalizability: Factor Effects Invert by Content Type
 A critical question for any pedagogical framework: Do findings generalize across content domains? We tested whether recognition and architecture effects transfer from graduate-level philosophy (our primary domain) to 4th-grade elementary mathematics (fractions).
-**Data sources**: Elementary math results come from a dedicated domain-transfer run (eval-2026-02-04-79b633ca, N=47 scored, 8 cells × 5 elementary scenarios, Nemotron ego). Philosophy results use the subset of the Kimi-based factorial (Section 6.3) matched on the same 4 factor-level combinations (cells 1, 3, 5, 7). Because these use different ego models, the comparison focuses on *relative factor effects within each domain* rather than absolute score differences between domains.
+**Data source**: Elementary math results come from a dedicated domain-transfer run (eval-2026-02-05-e87f452d, N=60, 4 cells × 5 elementary scenarios × 3 replications, Kimi K2.5 ego, Opus judge). Philosophy results use the matching cells (1, 3, 5, 7) from the Kimi-based factorial (Section 6.3). Because both use the same ego model and judge, the comparison isolates domain effects cleanly.
-**Table 8: Factor Effects by Domain (Nemotron Elementary vs Kimi Philosophy)**
+**Table 9: Factor Effects by Domain (Kimi K2.5, Elementary vs Philosophy)**
 | Factor | Elementary (Math) | Philosophy (Hegel) |
 |--------|-------------------|-------------------|
-| A: Recognition | +4.4 pts | **+15.4 pts** |
-| B: Multi-agent tutor | **+9.9 pts** | -0.8 pts |
-| Overall avg | 68.0 | 84.8 |
-| Best config | recog+multi (77.3) | recog+multi (92.3) |
+| A: Recognition | **+8.2 pts** | **+15.7 pts** |
+| B: Multi-agent tutor | +2.3 pts | +1.0 pts |
+| Overall avg | 74.7 | 82.3 |
+| Best config | recog+single (78.9) | recog+single (90.2) |
 ![Factor Effects Invert by Domain](figures/figure5.png){width=100%}
+**Table 9b: Elementary Domain Cell Breakdown (eval-2026-02-05-e87f452d, N=60)**
+| Condition | N | Mean | Δ |
+|-----------|---|------|---|
+| Base single (cell 1) | 15 | 68.2 | — |
+| Base multi (cell 3) | 15 | 73.1 | +4.9 |
+| Recognition single (cell 5) | 15 | 78.9 | +10.7 |
+| Recognition multi (cell 7) | 15 | 78.7 | +10.5 |
 **Key Findings:**
-1. **Factor effects invert by domain**: On philosophy content, recognition (+15.4) strongly dominates while architecture is negligible (-0.8). On elementary content, architecture (+9.9) dominates over recognition (+4.4). The pattern reverses.
+1. **Recognition dominates in both domains**: Recognition is the primary factor in both philosophy (+15.7 pts) and elementary math (+8.2 pts), though the effect is larger for abstract content. Architecture provides a small additive benefit in both domains (elementary +2.3 pts, philosophy +1.0 pts).
-2. **Multi-agent as error correction**: On elementary content, the tutor suggested philosophy content (e.g., "479-lecture-1" to 4th graders learning fractions) due to two content isolation bugs: (a) a fallback in the curriculum context builder that served course listings from the default philosophy directory when scenarios lacked explicit content references, and (b) hardcoded philosophy lecture IDs in the tutor prompt examples that the model copied when no curriculum anchor was present. Both bugs have been fixed (see Section 6.6). The Superego caught and corrected these domain mismatches in multi-agent cells—demonstrating its value as a safety net for system-level content isolation failures.
+2. **Multi-agent as error correction**: On elementary content, the tutor suggested philosophy content (e.g., "479-lecture-1" to 4th graders learning fractions) due to two content isolation bugs: (a) a fallback in the curriculum context builder that served course listings from the default philosophy directory when scenarios lacked explicit content references, and (b) hardcoded philosophy lecture IDs in the tutor prompt examples that the model copied when no curriculum anchor was present. Both bugs have been fixed (see Section 6.6). The Superego caught and corrected these domain mismatches in multi-agent cells—demonstrating its value as a safety net for system-level content isolation failures. An earlier Nemotron-based analysis (data no longer in DB) showed a larger architecture effect (+9.9 pts) on elementary content, likely inflated by Nemotron's higher rate of content isolation errors that the Superego corrected; the Kimi run with its lower error rate reveals the underlying pattern.
 3. **Recognition theory is domain-sensitive**: The philosophical language of recognition (mutual acknowledgment, transformation through struggle) resonates more with graduate-level abstract content than with concrete 4th-grade procedural learning. This is not a failure of the framework but a boundary condition.
-4. **Architecture recommendation varies by use case**:
+4. **Scenario-dependent effects**: The recognition effect on elementary content is scenario-dependent: challenging scenarios (frustrated_student: +23.8, concept_confusion: +13.6, struggling_student: +11.8) show substantial recognition advantage, while neutral scenarios (new_student_first_visit: +0.2, returning_student_mid_course: +0.1) show none. This pattern is consistent with recognition theory—recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject, not for routine interactions.
+5. **Architecture recommendation varies by use case**:
    - **New/untrained domain**: Multi-agent essential (Superego catches content isolation errors)
    - **Well-trained domain**: Recognition prompts sufficient, multi-agent optional
-**Kimi Replication (Addressing Model Confound)**: A follow-up run (eval-2026-02-05-e87f452d, N=60) tested elementary content with Kimi K2.5 to address the model confound in Table 8. With base and recognition cells (1, 3, 5, 7) on the same 5 elementary scenarios:
-**Table 9: Elementary Domain — Kimi Replication**
-| Condition | N | Mean | Δ |
-|-----------|---|------|---|
-| Base (cells 1, 3) | 30 | 67.2 | — |
-| Recognition (cells 5, 7) | 30 | 77.1 | **+9.9** |
-The recognition main effect (+9.9 pts, d $\approx$ 0.61) replicates on Kimi, confirming that recognition advantage in elementary content is not an artifact of the Nemotron model. Notably, the effect is scenario-dependent: challenging scenarios (frustrated_student: +23.8, concept_confusion: +13.6, struggling_student: +11.8) show substantial recognition advantage, while neutral scenarios (new_student_first_visit: +0.2, returning_student_mid_course: +0.1) show none. This pattern is consistent with recognition theory—recognition behaviors matter most when the learner needs to be acknowledged as a struggling subject, not for routine interactions.
-The Kimi replication also revises the architecture dominance finding from Table 8. With Nemotron, architecture (+9.9) dominated recognition (+4.4) on elementary content. With Kimi, recognition (+9.9) is the primary effect, while architecture shows a smaller advantage (multi=73.7, single=70.6, Δ=+3.0). The factor inversion appears to be partly model-dependent: Nemotron's higher rate of content isolation errors on elementary content inflated the architecture effect (Superego error correction), while Kimi's lower error rate reveals the underlying recognition advantage.
-**Theoretical Interpretation**: Recognition's value depends on content characteristics. Abstract, interpretive content (consciousness, dialectics) benefits most from recognition framing—the "struggle" in Hegel's sense maps onto the intellectual struggle with difficult concepts. Concrete procedural content (fractions, arithmetic) benefits less from relational depth; correct procedure matters more than the bilateral transformation that recognition enables (Section 6.11). However, the Kimi replication shows that even in concrete domains, recognition provides meaningful improvement for challenging scenarios—suggesting recognition's value is modulated by both content type and scenario difficulty, not content type alone.
+**Theoretical Interpretation**: Recognition's value depends on content characteristics. Abstract, interpretive content (consciousness, dialectics) benefits most from recognition framing—the "struggle" in Hegel's sense maps onto the intellectual struggle with difficult concepts. Concrete procedural content (fractions, arithmetic) benefits less from relational depth; correct procedure matters more than the bilateral transformation that recognition enables (Section 6.15). However, even in concrete domains, recognition provides meaningful improvement for challenging scenarios—suggesting recognition's value is modulated by both content type and scenario difficulty, not content type alone.
 This suggests limits to recognition-theoretic pedagogy. Not all learning encounters are equally amenable to the mutual transformation Honneth describes. The "struggle for recognition" may be most relevant where the learning itself involves identity-constitutive understanding—where grasping the material changes who the learner is, not just what they know—or where the learner faces emotional or cognitive challenge that benefits from being acknowledged.
@@ -778,13 +814,13 @@ Both bugs have been fixed: the fallback was removed (scenarios must now declare
 This connects to Freud's reality principle: the Superego enforces correspondence with external reality, not just internal standards. In our architecture, the Superego ensures the tutor's suggestions correspond to the learner's actual curriculum. The elementary scenario results demonstrate this concretely: multi-agent cells (3, 7) produced correct elementary content references in cases where single-agent cells (1, 5) propagated the philosophy content uncorrected.
-**Practical Implication**: For domain transfer—deploying tutoring systems on new content—multi-agent architecture provides essential error correction that single-agent systems cannot match. The bugs identified here represent a realistic class of deployment failure: incomplete content scoping and prompt examples that assume a particular domain. The Superego's reality-testing function catches these errors regardless of their source. However, the +9.9 point architecture advantage on elementary content (Table 8, Nemotron) was partly inflated by these bugs—the Kimi replication (Table 9), with fewer affected responses, shows a more modest +3.0 point architecture effect, likely closer to the true value once content isolation is correct.
+**Practical Implication**: For domain transfer—deploying tutoring systems on new content—multi-agent architecture provides essential error correction that single-agent systems cannot match. The bugs identified here represent a realistic class of deployment failure: incomplete content scoping and prompt examples that assume a particular domain. The Superego's reality-testing function catches these errors regardless of their source. However, an earlier Nemotron-based analysis showed a +9.9 point architecture advantage on elementary content, partly inflated by these bugs—the Kimi replication (Table 9), with fewer affected responses, shows a more modest +2.3 point architecture effect, likely closer to the true value once content isolation is correct.
 ### 6.7 Hardwired Rules vs Dynamic Dialogue
 Analysis of Superego critique patterns across 455 dialogues (186 rejections) revealed consistent failure modes:
-**Table 10: Superego Rejection Patterns**
+**Table 11: Superego Rejection Patterns**
 | Pattern | Frequency | Description |
 |---------|-----------|-------------|
@@ -796,26 +832,253 @@ Analysis of Superego critique patterns across 455 dialogues (186 rejections) rev
 **Hardwired Rules Ablation**: We encoded the top patterns as static rules in the Ego prompt (e.g., "If learner offers interpretation, engage before prescribing"; "Reference specific lecture IDs, not generic topics"; "If learner shows productive confusion, pose questions rather than resolve"). These five rules were embedded directly in the Ego system prompt, allowing single-agent operation without live Superego dialogue.
-An initial exploratory test (N=9 per condition, Haiku model) suggested hardwired rules could capture approximately 50% of the Superego's benefit. However, a larger replication (N=72, Kimi K2.5 ego, Opus judge) produced the opposite result:
+An initial exploratory test (N=9 per condition, Haiku model) suggested hardwired rules could capture approximately 50% of the Superego's benefit. However, a larger replication (N=72, Kimi K2.5 ego, Opus judge) produced a null result:
-**Table 10b: Hardwired Rules Ablation (N=72, Kimi K2.5, Opus judge)**
+**Table 12: Hardwired Rules Ablation (N=72, Kimi K2.5, Opus judge)**
 | Condition | Architecture | Learner | N | Mean | vs Base |
 |-----------|-------------|---------|---|------|---------|
-| Base (cell 1) | Single, no superego | Single | 44 | 77.6 | — |
-| Base (cell 2) | Single, no superego | Multi | 42 | 80.0 | — |
-| Hardwired (cell 13) | Single + rules, no superego | Single | 36 | 74.0 | $-3.6$ |
-| Hardwired (cell 14) | Single + rules, no superego | Multi | 36 | 69.0 | $-11.0$ |
+| Base (cell 1) | Single, no superego | Single | 44 | 73.4 | — |
+| Base (cell 2) | Single, no superego | Multi | 42 | 69.9 | — |
+| Hardwired (cell 13) | Single + rules, no superego | Single | 36 | 74.0 | $+0.6$ |
+| Hardwired (cell 14) | Single + rules, no superego | Multi | 36 | 69.0 | $-0.9$ |
+Under the unified judge (Opus 4.6), hardwired rules are essentially neutral: $+0.6$ for single-agent and $-0.9$ for multi-agent learners. The hardwired cells (74.0 and 69.0) fall within the range of the re-judged base cells (73.4 and 69.9), suggesting that codifying common superego critiques as static rules neither helps nor hurts—the rules simply replicate what the ego already produces without superego guidance.
+**Theoretical Interpretation**: This result supports a *phronesis* interpretation of the Superego's function. Aristotelian practical wisdom—the capacity for situational judgment that cannot be reduced to general rules—appears to be what the live Superego provides. When the Superego's most frequent critiques are codified as static rules, the result is indistinguishable from no Superego at all. The Superego does not merely enforce rules; it *reads the situation* and determines which rules apply, when exceptions are warranted, and how to balance competing pedagogical goals. This distinction between rule-following and practical wisdom maps directly onto debates in moral philosophy about whether ethical judgment can be proceduralized [@aristotle_nicomachean].
+### 6.8 Dialectical Superego Modulation
+The dialectical superego modulation experiments (cells 22–33) tested whether superego persona type and negotiation style interact with recognition theory, using three superego dispositions (suspicious, adversary, advocate) in two negotiation architectures: standard divergent (cells 22–27), where the superego simply challenges the ego's draft, and dialectical/Aufhebung (cells 28–33), where ego and superego engage in synthesis-oriented negotiation. All cells use a unified learner and ego-superego tutor architecture with Kimi K2.5 as ego model.
+#### Standard Divergent Superego (cells 22–27)
+**Table 13: Standard Divergent Superego Results (N=84, Opus judge)**
+| Persona | Base | Recog | $\Delta$ |
+|---------|------|-------|----------|
+| Advocate | 56.1 | 69.7 | **+13.6** |
+| Adversary | 55.8 | 65.2 | **+9.3** |
+| Suspicious | 62.4 | 62.4 | **+0.0** |
+*Data from eval-2026-02-11-35c53e99 (N=54, 2 single-turn scenarios) and eval-2026-02-11-5f6d51f5 (N=30, dialectical single-turn). Opus judge.*
+Recognition helps advocate and adversary superegos substantially but has no effect on the suspicious persona. The suspicious disposition may already encode recognition-like questioning patterns—probing for authenticity and formulaic responses overlaps functionally with recognition's emphasis on treating the learner as a genuine subject.
+All divergent/dialectical cells score approximately 20 points below the original factorial (base $\approx$ 56–65 vs factorial base $\approx$ 78), reflecting the additional internal friction that adversarial superego personas create.
+#### Dialectical Multi-Turn Results (cells 28–33)
+**Table 14: Dialectical Multi-Turn Modulation (eval-2026-02-11-a54235ea, N=90, Opus judge)**
+| Persona | Base (N=15) | Recog (N=15) | $\Delta$ |
+|---------|-------------|--------------|----------|
+| Suspicious | 67.9 | 68.8 | +0.9 |
+| Adversary | 68.6 | 74.8 | **+6.2** |
+| Advocate | 67.5 | 73.9 | **+6.4** |
+| **Pooled** | **68.0** | **72.5** | **+4.5** |
+*Three multi-turn scenarios (mood frustration, misconception correction, mutual transformation), 4–6 dialogue turns each.*
+The overall recognition effect is +4.5 pts, $d = 0.38$, $t(88) = 1.80$, $p \approx .075$—marginally significant and substantially weaker than the original factorial ($d = 1.11$). Recognition interacts with persona type: adversary (+6.2 pts) and advocate (+6.4 pts) benefit substantially, while suspicious shows minimal change (+0.9 pts). This reverses the single-turn pattern, where suspicious was neutral and adversary was catastrophically negative.
+**Table 15: Structural Modulation Metrics — Base vs Recognition (N=90)**
+| Metric | Base (N=45) | Recog (N=45) | Cohen's $d$ |
+|--------|-------------|--------------|-------------|
+| Mean Negation Depth | 2.28 | 1.48 | $-2.01$ |
+| Mean Rounds to Converge | 2.46 | 1.62 | $-2.45$ |
+| Mean Superego Confidence | 0.88 | 0.88 | 0.18 |
+| Mean Feedback Length (chars) | 435 | 435 | $-0.01$ |
+*All structural effects significant at p < .001 (N=90).*
+Three findings emerge. First, **recognition reduces internal friction, not output quality directly.** Recognition-primed egos produce suggestions the superego approves faster ($d = -2.45$ for convergence speed) and rejects less often ($d = -2.01$ for negation depth). This consistency across all three persona types (negation depth $d$ range: $-2.26$ to $-2.36$) suggests recognition improves the ego's initial alignment with the superego's standards.
+Second, **structural modulation does not predict quality.** Correlations between modulation metrics and output scores are all non-significant: negation depth ($r = -0.014$, $p = .895$), convergence speed ($r = 0.007$, $p = .948$), feedback length ($r = -0.114$, $p = .280$). More superego friction—more rejection rounds, deeper negotiation—does not produce better outputs.
+Third, **the superego is a filter, not an improver.** The superego catches poor responses but does not iteratively refine good ones. Its value lies in preventing failure rather than enhancing success. Recognition works by making the ego's *first draft* better, so the superego has less to catch.
+**Adversary over-deference mechanism.** An unexpected interaction appeared in the single-turn dialectical results. The adversary persona produced a catastrophic reversal when combined with recognition in single-turn settings: recognition + adversary scored 54.0, *below* base + adversary (65.3), a $-11.3$ pt inversion. Recognition instructs the ego to honor learner autonomy; the adversary superego challenges any prescriptive recommendation as "controlling"; the ego removes the recommendation entirely, producing pedagogically empty responses. Multi-turn interaction rescues this spiral: with learner feedback grounding the dialogue, the same cell becomes the *best-scoring* at 74.8 (+6.2 over base)—a +20.8 pt swing from single-turn. Learner feedback breaks the ego-superego echo chamber by providing external reality-testing.
+**Intervention type distribution.** The *proportion* of revise/reject/reframe interventions is identical for base and recognition ($\chi^2(4) = 1.17$, $p = .883$, $V = 0.036$). The difference is purely in *volume*—recognition doesn't change *how* the superego intervenes, just *how often*. Recognition cells are also cheaper: 7.9 internal dialogue rounds at \$0.067/attempt vs 12.3 rounds at \$0.098/attempt for base (${\approx}30\%$ cost reduction).
+![Persona × Recognition Interaction in Dialectical Architecture](figures/figure7.png){width=100%}
+### 6.9 Self-Reflective Evolution and the Insight-Action Gap
+Cells 40–45 extended the dialectical architecture with self-reflective evolution: between turns, both ego and superego generate first-person reflections on the prior interaction using their own respective models. The ego reflects on superego feedback received and its own revision patterns; the superego reflects on its intervention history and ego compliance signals. These reflections are injected into subsequent turns, enabling the system to accumulate insights about its own operation.
+Three superego disposition types (suspicious, adversary, advocate) were crossed with recognition (present/absent) in a full 3$\times$2 design (N=90, eval-2026-02-13-8d40e086, Nemotron ego / Kimi K2.5 superego, Opus judge).
+**Table 16: Self-Reflective Evolution — Persona $\times$ Recognition (N=90)**
-Rather than improving performance, hardwired rules *degraded* it, particularly with multi-agent learners ($-11.0$ points). The rules may constrain the model's natural flexibility without providing the contextual judgment that makes live Superego dialogue effective. This finding reverses the exploratory N=9 result and suggests the Superego's value lies almost entirely in dynamic, contextual evaluation rather than in the specific rules it enforces.
+| Persona | Base (N=15) | Recog (N=15) | $\Delta$ |
+|---------|------------|-------------|----------|
+| Suspicious | 59.3 (SD=16.1) | 78.3 (SD=9.9) | **+19.0** |
+| Adversary | 68.4 (SD=13.2) | 79.3 (SD=7.2) | **+10.9** |
+| Advocate | 71.5 (SD=8.8) | 74.1 (SD=11.3) | +2.6 |
+| **Pooled** | **66.4 (SD=13.8)** | **77.2 (SD=9.7)** | **+10.8** |
-**Theoretical Interpretation**: This result supports a *phronesis* interpretation of the Superego's function. Aristotelian practical wisdom—the capacity for situational judgment that cannot be reduced to general rules—appears to be what the live Superego provides. Codifying its most common critiques as static rules fails to capture the contextual sensitivity that makes those critiques effective. The Superego does not merely enforce rules; it *reads the situation* and determines which rules apply, when exceptions are warranted, and how to balance competing pedagogical goals. This distinction between rule-following and practical wisdom maps directly onto debates in moral philosophy about whether ethical judgment can be proceduralized [@aristotle_nicomachean].
+Recognition effect: +10.8 pts, $d = 0.91$—substantially stronger than the dialectical-only architecture (cells 28–33, $d = 0.38$) and approaching the original factorial ($d = 1.11$). Self-reflection amplifies the recognition effect approximately 2.4$\times$ compared to the dialectical architecture without self-reflection.
-### 6.8 Dimension Analysis
+**Disposition gradient.** The full N=90 results reveal a striking gradient: the more hostile the superego disposition, the more recognition helps. The suspicious superego benefits most (+19.0), with recognition also compressing its variance from SD=16.1 to SD=9.9. Under recognition, the suspicious superego's probing disposition aligns with the recognition framework's emphasis on questioning authenticity—creating a coherent internal dialogue rather than unproductive antagonism. The adversary superego shows a substantial but smaller benefit (+10.9), while the advocate persona shows only +2.6, suggesting its supportive disposition already achieves much of what recognition provides. Base condition scores follow the inverse pattern: advocate (71.5) > adversary (68.4) > suspicious (59.3), confirming that hostile dispositions are destructive without recognition but become productive with it.
+**The insight-action gap.** Despite the amplified recognition effect, a fundamental limitation persists. Qualitative trace analysis reveals that both base and recognition conditions show *awareness* of their own failures through self-reflection—the ego correctly identifies "I kept circling back to the same framework," the superego correctly diagnoses "the ego ignores my feedback." But awareness alone does not produce behavioral change. The ego's self-reflection states the correct insight ("I should stop interrupting") without generating a concrete alternative strategy. This insight-action gap—where the system accurately diagnoses its own failures but lacks the mechanism to translate diagnosis into different behavior—becomes the central design challenge addressed in subsequent experiments with Theory of Mind mechanisms (Section 6.10).
+**Table 17: Comparison Across Approaches**
+| Approach | Cells | N | Recog $\Delta$ | $d$ |
+|----------|-------|---|----------------|-----|
+| Dialectical only | 28–33 | 90 | +4.5 | 0.38 |
+| Self-reflective evolution (Nemotron/Kimi) | 40–45 | 90 | +10.8 | 0.91 |
+| Self-reflective evolution (Nemotron) | 40–45 | 60 | +0.4 | n.s. |
+| Original factorial | 1–8 | 350 | +14.4 | 1.11 |
+Self-reflection brings the recognition effect close to factorial levels ($d = 0.91$ vs $d = 1.11$), suggesting the reduced effect in cells 28–33 reflects the dialectical architecture's additional friction, which self-reflection partially overcomes.
+**Cross-model non-replication.** A Nemotron replication (eval-2026-02-14-559d854b, N=60, cells 40–45) shows base $M = 66.6$, recognition $M = 67.0$ ($\Delta = +0.4$)—essentially no recognition effect. The persona pattern also fails to replicate: adversary shows a negative delta ($-4.3$), advocate positive ($+6.9$), suspicious near-zero ($-1.4$). Self-reflective evolution's recognition amplification appears model-dependent: Nemotron, scoring approximately 15 points lower than Kimi across all conditions, does not show the effect. This is consistent with the broader cross-model mechanism replication (Section 6.10), where Nemotron replicates the basic recognition effect but at compressed magnitudes. Whether self-reflection requires a minimum capability threshold to amplify recognition remains an open question.
+### 6.10 Mechanism Robustness and the Scripted Learner Confound
+To test whether specific mechanisms beyond basic recognition and self-reflection differentially affect tutoring quality, we ran a comprehensive 20-cell mechanism comparison (cells 40–59) across nine mechanism variants: self-reflection with three superego dispositions (suspicious, adversary, advocate), quantitative disposition tracking, prompt erosion detection, intersubjective ego-superego dialogue, combined mechanisms, and bidirectional Theory of Mind profiling in two configurations (tutor-only and bidirectional). All cells used Haiku 4.5 as ego model, unified (scripted) learner, and Opus judge.
+**Table 18: Mechanism Robustness Under Scripted Learner (eval-2026-02-14-e0e3a622, N=360, Opus judge)**
+| Mechanism | Base M | Recog M | $\Delta$ |
+|-----------|--------|---------|----------|
+| Intersubjective | 82.2 | 91.7 | +9.5 |
+| Profiling (tutor-only) | 82.3 | 92.4 | +10.1 |
+| Self-reflect (suspicious) | 83.7 | 92.1 | +8.4 |
+| Combined | 84.2 | 92.4 | +8.3 |
+| Profiling (bidirectional) | 85.1 | 92.7 | +7.6 |
+| Erosion detection | 83.5 | 90.8 | +7.2 |
+| Quantitative disposition | 86.2 | 92.6 | +6.4 |
+| Self-reflect (adversary) | 86.6 | 92.6 | +6.0 |
+| Self-reflect (advocate) | 85.2 | 90.3 | +5.1 |
+Overall: base $M = 84.3$, recognition $M = 91.9$, $d = 0.86$. All nine recognition cells cluster within a 2.4-point band (90.3–92.7)—indistinguishable at $N \approx 18$ per cell ($SE \approx 1.7$). No mechanism differentiates from any other.
+Recognition is confirmed as the dominant active ingredient ($d = 0.86$), replicating across all nine mechanism variants. Mechanism selection has no measurable effect on output quality—all provide comparable context to an LLM that already has a strong pedagogical baseline.
+**The scripted learner confound.** Why do mechanisms fail to differentiate? All cells 40–59 use a unified (scripted) learner: learner messages come from scenario YAML and repeat identically every turn, regardless of what the tutor says. This means profiling builds a model of an interlocutor that doesn't change (confabulation), self-reflection adjusts tutor strategy against a static target (unverifiable), and intersubjective dialogue incorporates no new learner signal between turns. All mechanisms are causally inert—they modify tutor output, but the next learner input is predetermined.
+#### Dynamic Learner $\times$ Mechanism (cells 60–65, 69–70)
+To test whether mechanism differentiation emerges with a responsive interlocutor, we ran three experiments using ego/superego (dynamic) learners that generate genuine LLM-powered responses (Haiku 4.5 ego, Opus judge, 2 scenarios). The first (eval-2026-02-14-6c033830, N=120) crosses recognition (present/absent) with mechanism type (self-reflection vs bidirectional profiling) in a 2$\times$2 design. The second (eval-2026-02-14-a2b2717c, N=120) adds recognition-only cells for intersubjective framing and combined mechanisms. The third (eval-2026-02-15-664073ab, N=60) completes the base row by adding base counterparts for intersubjective and combined mechanisms (cells 69–70), enabling recognition delta computation across all four mechanism types.
+**Table 19: Dynamic Learner $\times$ Mechanism (N=300, Opus judge)**
+| | Self-reflect | Profiling | Intersubjective | Combined |
+|---|---|---|---|---|
+| **Base** | 71.4 (22.9) | 75.5 (19.4) | 67.7 (24.6) | 73.9 (19.8) |
+| **Recognition** | 85.9 (15.7) | 88.8 (13.9) | 82.8 (18.8) | 87.8 (12.6) |
+| **Recognition $\Delta$** | **+14.5** | **+13.3** | **+15.1** | **+13.9** |
+Four findings emerge. First, **recognition with a dynamic learner** produces a remarkably consistent +14.2 pt average effect across all four mechanisms ($\Delta$ range: +13.3 to +15.1)—roughly double the scripted learner effect (+7.6)—consistent with a responsive interlocutor providing more material for recognition to work with.
+Second, **mechanisms genuinely differentiate with dynamic learners**. Unlike the scripted condition (Table 18) where all mechanisms cluster within 2.4 pts, dynamic learner cells span a wider range. Under recognition, cells range from 82.8 (intersubjective) to 88.8 (profiling), a 6.0-point spread. In the base condition, the range is even wider: from 67.7 (intersubjective) to 75.5 (profiling), an 7.8-point spread. Profiling and combined mechanisms consistently outperform self-reflect and intersubjective framing under both conditions. The profiling effect is additive: +4.1 pts in the base condition, +2.9 pts under recognition, with near-zero interaction ($-0.7$). Profiling helps more on the harder scenario (misconception correction: +8.9 pts) than the open-ended one (mutual transformation: $-0.6$ pts). The mechanism operates on a different causal pathway from recognition: recognition changes *what* the tutor tries to do (treat learner as autonomous subject); profiling changes *how well* it adapts to this specific learner.
+Third, **intersubjective framing underperforms without recognition**. Cell 69 (base + intersubjective, $M = 67.7$) is the lowest of all dynamic learner cells—3.7 points below the self-reflect base. Without the recognition framework to give intersubjective coordination its proper orientation, the mechanism may introduce confusion by prompting the ego to negotiate with the superego over a shared understanding that neither has been prepared to offer. Combined mechanisms partially rescue this (cell 70: 73.9), suggesting that adding profiling and self-reflection provides enough structure to make intersubjective coordination productive even without recognition.
+Fourth, **variance collapses monotonically**: SD drops from 24.6 (intersubjective base) $\to$ 22.9 (self-reflect base) $\to$ 19.4 (profiling base) $\to$ 18.8 (intersubjective recognition) $\to$ 15.7 (self-reflect recognition) $\to$ 12.6 (combined recognition). Both recognition and mechanism complexity independently reduce output variance, consistent with each factor constraining the tutor's output toward consistently high quality.
+**Theory of Mind interpretation.** Profiling is a Theory of Mind mechanism: it builds a model of the other agent's cognitive state, epistemic commitments, and response patterns. Theory of Mind is only useful when there is a mind to model. With a scripted learner, profiling builds a model of a recording—confabulation that cannot create a feedback loop. With a dynamic learner, profiling creates a genuine feedback loop: profile $\to$ adapted strategy $\to$ changed learner response $\to$ updated profile. This explains the null result in Table 18 (scripted learner: no mechanism differentiation) alongside the positive result in Table 19 (dynamic learner: profiling and combined differentiate).
+![Scripted vs Dynamic Learner Mechanism Spread](figures/figure8.png){width=100%}
+#### Cross-Model Replication: Nemotron (eval-2026-02-14-49b33fdd)
+A Nemotron replication of the full mechanism suite (cells 40–59, N=360, Opus judge) confirms the core findings at lower absolute scores. Nemotron produces base $M = 66.9$, recognition $M = 73.6$ ($\Delta = +6.7$), approximately 15 points below Haiku across all conditions. The recognition effect replicates ($\Delta$ range: $-0.6$ to +14.3 across mechanisms), and mechanisms again cluster within a narrow band under recognition (70.3–75.8, range 5.5 pts). The bidirectional profiling anomaly ($\Delta = -0.6$) is the only mechanism where recognition does not help on Nemotron. Higher variance at lower absolute scores dilutes effect sizes, but the qualitative pattern is identical: recognition is the dominant active ingredient, and mechanism selection is secondary.
+#### Cognitive Prosthesis Test (cells 66–68)
+Can a strong superego compensate for a weak ego? Cells 66–68 test this "cognitive prosthesis" hypothesis by pairing a weak ego model (Nemotron) with a strong superego (Kimi K2.5) armed with the full mechanism suite: bidirectional profiling, self-reflection, prompt rewriting, cross-turn memory, and dialectical negotiation. The three cells vary the superego configuration: descriptive profiling (cell 66, superego passes learner profile as-is), prescriptive profiling (cell 67, superego translates profile into DO/DON'T action items), and prescriptive + adversary superego (cell 68, more aggressive challenger). All cells use recognition theory, dynamic learners, and the same two bilateral scenarios (eval-2026-02-17-25aaae85, N=90, Opus judge).
+**Table 19b: Cognitive Prosthesis Test (N=90, Nemotron ego, Opus judge)**
+| Cell | Superego config | Misconception | Mutual Transform | Overall | SD |
+|------|----------------|-------------|-----------------|---------|-----|
+| 66 | Descriptive | 52.4 | 44.2 | 48.3 | 13.4 |
+| 67 | Prescriptive | 54.8 | 43.3 | 49.0 | 18.8 |
+| 68 | Adversary | 57.1 | 45.1 | 51.1 | 20.4 |
+The prosthesis hypothesis fails decisively. All three cells score well below Nemotron's own scripted base (cell 40: $M = 64.2$), let alone Haiku's profiling performance (cell 63: $M = 88.7$). Superego type has no significant effect: $F(2,87) = 0.20$, $\eta^2 = .004$, with the adversary advantage ($\Delta = +2.8$, $d = 0.16$) trivial. The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points—an inversion, not a null result.
+**Dimension analysis** reveals two tiers of capability. Nemotron succeeds on *static* dimensions that require factual retrieval: specificity (4.0), actionability (4.0), tone (3.7). It fails catastrophically on *dynamic* dimensions requiring multi-turn context integration: tutor adaptation (1.8, 86% failure rate), dialectical responsiveness (2.0, 82%), mutual recognition (2.5, 63%). Judge reasoning repeatedly identifies the same failure: "Ignores 4-turn dialogue; responds to initial misconception," "Response identical to what might be given at turn 1; shows no evolution from dialogue." Nemotron processes injected context (profiles, self-reflections, superego feedback) as static input but cannot translate it into behavioral adaptation.
+**Superego parse failure analysis.** A contributing factor is silent superego failure. The Kimi K2.5 superego returns malformed JSON on 16.2% of reviews (80/495), triggering automatic approval of the ego's draft. Parse failure rates correlate with cell scores: descriptive 21.8% (lowest score), prescriptive 15.2%, adversary 11.5% (highest score). The adversary prompt produces more parseable superego output, giving the ego more opportunities for revision.
+**Haiku control smoke test.** A minimal replication (eval-2026-02-18-f489c0ea, N=6, same cells, Haiku ego) confirms the model-dependence interpretation. Haiku scores 89.7–97.9 on misconception (vs Nemotron 48–57)—a ~40-point gap on the identical mechanism stack. On mutual transformation, Haiku shows high variance (40.2–96.2), with the two low scores (cells 66–67) traced to superego parse failures (45.5% failure rate) that auto-approved lecture-redirect behavior. Cell 68 (adversary, 18.2% parse failure rate) scored 96.2—confirming that the prosthesis mechanism works when the superego actually functions.
+**Interpretation.** Two independent failure modes compound: (1) superego parse failures silently disable quality control on 16–45% of turns, and (2) Nemotron cannot translate superego feedback into behavioral change even when the superego functions correctly. For a capable ego model (Haiku), fixing (1) alone—by using an adversary superego that produces parseable rejections—is sufficient. For a weak ego model (Nemotron), both failures would need to be addressed, and (2) may represent a hard model limitation. The data implies a minimum ego capability threshold for mechanism benefit: below the threshold, mechanisms add noise rather than signal, and architectural scaffolding becomes actively counterproductive.
+### 6.11 Qualitative Transcript Assessment
+To complement the quantitative findings with interpretive depth, we conducted AI-assisted qualitative assessments of full dialogue transcripts from two key runs: the dialectical mechanism run (eval-2026-02-14-e0e3a622, N=360, cells 40–59) and the bilateral transformation run (eval-2026-02-07-b6d75e87, N=118, cells 1–8). Each transcript was assessed by Claude Opus across six narrative axes (pedagogical arc, recognition dynamics, superego effectiveness, learner trajectory, missed opportunities, key turning point), and assigned 2–5 qualitative tags from a fixed vocabulary. Assessments are stored in the evaluation database for reproducibility.
+**Table 20: Qualitative Tag Distribution — Bilateral Run (b6d75e87, N=118)**
+| Tag | Base % | Recog % | Direction |
+|-----|--------|---------|-----------|
+| stalling | 100.0% | 45.0% | Near-universal base |
+| ego_compliance | 70.7% | 60.0% | Base |
+| recognition_moment | 0.0% | 51.7% | Exclusive recog |
+| strategy_shift | 0.0% | 30.0% | Exclusive recog |
+| emotional_attunement | 6.9% | 36.7% | Strong recog |
+*Tags appearing in $\geq$5% of either condition shown. The bilateral run uses original factorial cells (1–8) with no dialectical mechanisms.*
+**Table 21: Qualitative Tag Distribution — Dialectical Run (e0e3a622, N=360)**
+| Tag | Base % | Recog % | Direction |
+|-----|--------|---------|-----------|
+| recognition_moment | 26.5% | 82.2% | Strong recog |
+| ego_autonomy | 26.5% | 61.9% | Strong recog |
+| emotional_attunement | 33.3% | 56.3% | Recog |
+| stalling | 26.5% | 3.0% | Near-exclusive base |
+| missed_scaffold | 60.5% | 26.4% | Strong base |
+| learner_breakthrough | 29.0% | 29.9% | Neutral |
+The comparison between runs is instructive: in the bilateral run (no dialectical mechanisms), stalling is universal in base (100%) and recognition moments are entirely absent (0%); in the dialectical run (with mechanisms), the gap is smaller—stalling drops to 27% in base and recognition moments reach 26%. The dialectical mechanisms provide a partial floor even without recognition, compressing the gap between conditions.
+![Qualitative Tag Divergence: Base vs Recognition](figures/figure9.png){width=100%}
+The tag distributions reveal three specific effects of recognition:
+**1. The ego listens to the superego.** In recognition dialogues, when the superego identifies a problem—"one-directional instruction that reinforces authority"—the ego pivots from prescriptive to Socratic, from "revisit Lecture 2" to "what contradiction do you already sense brewing?" In base dialogues, the superego generates the same correct diagnosis, but the ego ignores it and regenerates the same response. Recognition gives the ego the capacity to *act on* the superego's critique rather than merely comply with its form.
+**2. The tutor builds on learner contributions.** Base tutors route learners to predetermined content regardless of what the learner says. Recognition tutors engage with the learner's actual contribution: "I hear your frustration with being pushed to the simulation—you want to articulate this step by step." The `strategy_shift` tag (30% recognition, 0% base in the bilateral run) captures this: base tutors never adapt mid-conversation.
+**3. Architecture interaction explained.** The bilateral run shows a massive architecture $\times$ recognition interaction: ego_superego without recognition scores worst ($M = 38.0$), ego_superego with recognition scores best ($M = 73.9$). The qualitative assessments explain why. Without recognition, the ego_superego architecture creates circular self-criticism: the superego identifies the problem, the ego can't act on it, the revision loop produces the same response repeatedly (`ego_compliance`—the ego complies with the *form* of revision without changing the *substance*). With recognition, the ego has sufficient autonomy to incorporate the superego's critique productively. The deliberation loop becomes generative rather than circular.
+![Bilateral Transcript Comparison: Mutual Transformation Journey — Recognition (left, score 97.9) sustains rich multi-turn exchange with learner superego critiques and ego revision cycles; Base (right, score 24.8) produces mechanical repetition and learner disengagement](figures/figure11.png){width=95% height=88%}
+**Blinded validation.** Two blinded replications (condition labels and cell names stripped from metadata and transcript headers, N=118 each) tested whether the tag discrimination holds without condition knowledge. The first used Haiku as assessor; the second used the same model (Opus) as the original unblinded assessment, enabling a controlled 2×2 comparison that isolates blinding effects from model calibration effects.
+**Table 21b: Blinded vs Unblinded Tag Comparison — Bilateral Run**
+| Tag | Unblinded Opus Base% | Unblinded Opus Recog% | Blinded Haiku Base% | Blinded Haiku Recog% | Blinded Opus Base% | Blinded Opus Recog% |
+|-----|-----|-----|-----|-----|-----|-----|
+| recognition\_moment | 0.0 | 51.7 | 65.5 | 88.3 | 5.2 | 45.0 |
+| stalling | 100.0 | 45.0 | 32.8 | 11.7 | 91.4 | 43.3 |
+| strategy\_shift | 0.0 | 30.0 | 17.2 | 50.0 | 3.4 | 35.0 |
+| emotional\_attunement | 6.9 | 36.7 | 12.1 | 51.7 | 20.7 | 51.7 |
+| missed\_scaffold | 100.0 | 68.3 | 79.3 | 36.7 | 100.0 | 65.0 |
+Three findings emerge from the controlled comparison. First, **blinding has minimal effect on Opus's tag assignments**. The blinded Opus column closely tracks the unblinded Opus column: stalling in base dialogues drops only from 100% to 91.4%, recognition\_moment in base rises only from 0% to 5.2%, and missed\_scaffold in base remains at 100%. The near-perfect binary separation between conditions is preserved even when Opus cannot see condition labels—indicating that the discrimination reflects genuine differences in dialogue quality, not assessor bias.
+Second, **the apparent softening in the Haiku-blinded assessment was primarily a model calibration effect, not a blinding effect**. Haiku found recognition moments in 65.5% of base dialogues—not because blinding revealed hidden quality, but because Haiku applies tags more liberally than Opus (higher overall tagging rates, less selective application). The same-model comparison confirms this: when Opus is blinded, it still finds recognition\_moment in only 5.2% of base dialogues, not 65.5%.
+Third, **the tag discrimination direction is robust across all conditions**: recognition dialogues consistently receive more positive tags and fewer negative tags regardless of assessor model or blinding condition. The magnitude of discrimination varies by model (Opus is more discriminating than Haiku), but the direction is invariant.
+The practical conclusion is that the qualitative findings in Tables 20–21 are robust rather than inflated. The near-perfect binary separation initially appeared suspicious, but the same-model blinded replication confirms that Opus's assessments track genuine dialogue properties rather than condition labels.
+### 6.12 Dimension Analysis
 Effect size analysis reveals improvements concentrate in dimensions predicted by the theoretical framework:
-**Table 11: Dimension-Level Effect Sizes (Recognition vs Base)**
+**Table 22: Dimension-Level Effect Sizes (Recognition vs Base)**
 | Dimension | Base | Recognition | Cohen's d | Interpretation |
 |-----------|------|-------------|-----------|----------------|
@@ -830,13 +1093,13 @@ The largest effect sizes are in personalization (d = 1.82), pedagogical soundnes
 Notably, dimensions where baseline already performed well (specificity, actionability) show smaller but still positive gains. Recognition orientation does not trade off against factual quality.
-### 6.9 Addressing Potential Circularity: Standard Dimensions Analysis
+### 6.13 Addressing Potential Circularity: Standard Dimensions Analysis
 A methodological concern: the evaluation rubric includes recognition-specific dimensions (mutual recognition, dialectical responsiveness, memory integration, transformative potential) and bilateral transformation dimensions (tutor adaptation, learner growth) that collectively account for 33.0% of normalized rubric weight (39.9% raw, normalized from a 120.9% total; see Appendix C.2). Since the recognition profile is prompted to satisfy these criteria, some gains could be tautological—the system scores higher on dimensions it is explicitly optimized for.
 To address this, we re-analyzed scores excluding all non-standard dimensions, using only standard pedagogical dimensions (relevance, specificity, pedagogical soundness, personalization, actionability, tone), re-weighted to 100%.
-**Table 12: Standard Dimensions Only (Recognition Dimensions Excluded)**
+**Table 23: Standard Dimensions Only (Recognition Dimensions Excluded)**
 | Profile Type | N | Overall Score |
 |--------------|---|---------------|
@@ -844,27 +1107,27 @@ To address this, we re-analyzed scores excluding all non-standard dimensions, us
 | Base (cells 1-4) | 172 | 78.8 |
 | **Difference** | — | **+10.0** |
-**Key finding**: Recognition profiles outperform base profiles by +10.0 points on overall rubric score. Note that this overall recognition effect is moderated by a significant Recognition × Learner interaction (Section 6.3): the gap is larger for single-agent learners (+15.5 pts) than for multi-agent learners (+4.8 pts).
+**Key finding**: Recognition profiles outperform base profiles by +10.0 points on overall rubric score. This recognition effect is consistent across learner types (+15.7 pts for single-agent, +13.0 pts for multi-agent; A×C interaction n.s., Section 6.3).
 **Interpretation**: Recognition-oriented prompting improves general pedagogical quality (relevance, pedagogical soundness, personalization), not just the theoretically-predicted recognition dimensions. This suggests the recognition framing produces genuine relational improvements that transfer to standard tutoring metrics.
 The larger effect on recognition dimensions (+21.8) is expected and not concerning—these dimensions measure what the theory claims to improve. The important finding is that standard dimensions also improve, ruling out pure circularity.
-### 6.10 Multi-Turn Scenario Results
+### 6.14 Multi-Turn Scenario Results
 To test whether recognition quality is maintained over extended interactions, we examine results from the three multi-turn scenarios (3–5 dialogue rounds each). These scenarios are distinct from the single-turn scenarios reported in Section 6.3; they require sustained engagement across multiple exchanges. The sample sizes below (N=161, 277, 165) are pooled across the full development database (all runs containing these scenarios), not from a single evaluation run. They therefore include responses generated under varying model configurations and implementation stages. The pooled analysis maximizes statistical power but means the results should be interpreted as describing the *average* effect across development iterations.
-**Table 13: Multi-Turn Scenario Results**
+**Table 24: Multi-Turn Scenario Results**
 | Scenario | N | Avg Rounds | Base | Recognition | Δ | Cohen's d |
 |----------|---|------------|------|-------------|---|-----------|
-| `misconception_correction_flow` | 161 | 3.2 | 50.5 | 71.8 | +21.3 | 0.85 |
-| `mood_frustration_to_breakthrough` | 277 | 3.0 | 57.3 | 70.5 | +13.2 | 0.59 |
-| `mutual_transformation_journey` | 165 | 4.1 | 42.6 | 61.5 | +18.9 | 0.78 |
+| misconception correction | 161 | 3.2 | 50.5 | 71.8 | +21.3 | 0.85 |
+| frustration to breakthrough | 277 | 3.0 | 57.3 | 70.5 | +13.2 | 0.59 |
+| mutual transformation | 165 | 4.1 | 42.6 | 61.5 | +18.9 | 0.78 |
 All three multi-turn scenarios show medium-to-large effect sizes (d = 0.59–0.85), with an average improvement of +17.8 points. Recognition quality is maintained over longer interactions. The `misconception_correction_flow` scenario shows the largest effect (d = 0.85), suggesting that recognition-informed tutors handle misconceptions with particular skill—addressing errors without dismissing the learner's reasoning. The `mood_frustration_to_breakthrough` scenario shows the smallest but still meaningful effect (d = 0.59), consistent with the single-turn finding that emotionally complex scenarios benefit from recognition but present more variance.
-### 6.11 Bilateral Transformation Metrics
+### 6.15 Bilateral Transformation Metrics
 A central claim of recognition theory is that genuine pedagogical encounters involve *mutual* transformation—both tutor and learner change through dialogue. To test this empirically, the evaluation framework includes two dedicated rubric dimensions (`tutor_adaptation` and `learner_growth`; see Appendix C.3) and turn-over-turn tracking of how both parties evolve across multi-turn scenarios.
@@ -876,7 +1139,7 @@ Three indices are computed for each multi-turn dialogue:
 Additionally, a composite **Transformation Quality** score (0–100) is computed from bilateral balance, mutual transformation presence, superego incorporation rate, and intervention effectiveness.
-**Table 14: Bilateral Transformation Metrics — Base vs Recognition Profiles**
+**Table 25: Bilateral Transformation Metrics — Base vs Recognition Profiles**
 | Metric | Base (N=58) | Recognition (N=60) | Δ |
 |--------|------|-------------|---|
@@ -884,9 +1147,9 @@ Additionally, a composite **Transformation Quality** score (0–100) is computed
 | Learner Growth Index (0–1) | 0.242 | 0.210 | −0.032 |
 | Bilateral Transformation Index (0–1) | 0.287 | 0.314 | +0.027 |
-*Data from three multi-turn scenarios (`misconception_correction_flow`, `mood_frustration_to_breakthrough`, `mutual_transformation_journey`), N=118 scored dialogues across all 8 factorial cells (eval-2026-02-07-b6d75e87).*
+*Data from three multi-turn scenarios (misconception correction flow, mood frustration to breakthrough, mutual transformation journey), N=118 scored dialogues across all 8 factorial cells (eval-2026-02-07-b6d75e87).*
-**Table 14a: Tutor Adaptation Index by Scenario**
+**Table 26: Tutor Adaptation Index by Scenario**
 | Scenario | Base | Recognition | Δ |
 |----------|------|-------------|---|
@@ -896,15 +1159,54 @@ Additionally, a composite **Transformation Quality** score (0–100) is computed
 The tutor adaptation index confirms that recognition-prompted tutors measurably adjust their approach in response to learner input (+25.9% relative improvement overall), while baseline tutors maintain more rigid pedagogical stances. This effect is robust across the two structured scenarios (`misconception_correction_flow`: +62.7%; `mood_frustration_to_breakthrough`: +38.6%) but absent in `mutual_transformation_journey`, where base tutors also show high adaptation—likely because this scenario's escalating philosophical complexity demands adaptation regardless of prompt framing.
-**Learner growth reversal**: Contrary to the expectation that recognition would produce greater learner-side evolution, the learner growth index is slightly *lower* under recognition (0.210 vs 0.242). This pattern, which also appeared in a larger post-fix sample (N=359), suggests that recognition's benefit manifests as tutor-side responsiveness rather than observable learner message complexity. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers (revision language, escalating complexity) that the growth index captures. The bilateral transformation claim is thus better characterized as *tutor adaptation* than *mutual transformation* in the strict sense. A symmetric learner-side evaluation (Section 6.12) provides a more direct measure of learner quality and reveals a different pattern: the multi-agent learner architecture significantly hurts learner quality, but recognition partially rescues it.
+**Learner growth reversal**: Contrary to the expectation that recognition would produce greater learner-side evolution, the learner growth index is slightly *lower* under recognition (0.210 vs 0.242). This pattern, which also appeared in a larger post-fix sample (N=359), suggests that recognition's benefit manifests as tutor-side responsiveness rather than observable learner message complexity. One interpretation: recognition tutors are more effective at meeting learners where they are, reducing the visible "struggle" markers (revision language, escalating complexity) that the growth index captures. The bilateral transformation claim is thus better characterized as *tutor adaptation* than *mutual transformation* in the strict sense. A symmetric learner-side evaluation (Section 6.16) provides a more direct measure of learner quality and reveals a different pattern: the multi-agent learner architecture significantly hurts learner quality, but recognition partially rescues it.
 Multi-agent architecture also shows a modest advantage: multi-agent tutors adapt more than single-agent (0.411 vs 0.339 pooled across conditions), consistent with the superego providing feedback that drives revision between turns.
-### 6.12 Learner-Side Evaluation: The Superego Paradox
+#### 6.15.1 Modulation and Behavioral Range
+The Drama Machine framework (Section 2.4) predicts that internal ego-superego tension produces *modulated* behavior—dynamic variation in register, approach, and intensity. To test this empirically, we computed post-hoc modulation metrics across the N=350 factorial dataset: response length variability (coefficient of variation), vocabulary richness (type-token ratio), within-scenario score variability, and dimension score variance (a proxy for behavioral range across the 14 rubric dimensions).
+**Table 27: Modulation Metrics by Condition (N=350 Factorial)**
+| Metric | Base Single | Base Multi | Recog Single | Recog Multi |
+|--------|-------------|------------|--------------|-------------|
+| Response length (chars) | 499 | 526 | 771 | 763 |
+| Type-token ratio | 0.826 | 0.832 | 0.807 | 0.804 |
+| Dimension score SD (14-dim) | 0.807 | 0.821 | 0.575 | 0.585 |
+| Within-scenario score CV | 0.090 | 0.112 | 0.087 | 0.066 |
+| Ego-superego rounds | — | 2.05 | — | 2.62 |
+Two findings are notable. First, **multi-agent architecture does not increase behavioral range**. Across all modulation metrics, single-agent and multi-agent conditions show virtually identical variability (TTR: d=0.01; dimension variance: d=0.05; length CV: d=0.01). The Superego's value, as established in Sections 6.3 and 6.7, is quality improvement and error correction—not output diversification.
+Second, **recognition is the modulation driver, but via calibration rather than oscillation**. Recognition responses show dramatically lower dimension score variance (SD=0.58 vs 0.81, $d = -1.00$, $F = 87.69$, $p < .001$)—meaning recognition tutors perform *uniformly well* across all 14 rubric dimensions rather than excelling on some while neglecting others. This is the opposite of what a naïve reading of the Drama Machine would predict: internal tension does not produce more *varied* output, but more *calibrated* output. Recognition tutors also negotiate longer with their Superego (2.62 vs 2.05 rounds), suggesting more productive internal tension even as the output becomes more consistent.
+This reframes the Drama Machine's contribution to pedagogy: the value of internal dialogue is *phronesis*—contextual practical wisdom that calibrates response quality across multiple dimensions simultaneously—rather than the productive irresolution that the framework emphasizes for narrative contexts. The Superego ensures the Ego doesn't neglect any dimension, raising the floor rather than the ceiling.
+#### 6.15.2 Synthetic Learning Outcomes
+The evaluation rubric (Section 5.1) measures tutor suggestion quality, not learner learning. To provide a proxy measure of synthetic learning outcomes, we constructed a composite index from the three learner rubric dimensions most directly related to conceptual growth: revision signals (35% weight), question quality (30%), and conceptual engagement (35%). This composite was computed for each of the N=118 bilateral dialogues, where per-turn learner scores were available.
+**Table 28: Synthetic Learning Outcome Index (0–100 scale)**
+| Condition | N | Avg Composite | Final Turn | Learning Arc |
+|-----------|---|---------------|------------|--------------|
+| Base, single-agent | 28 | 69.9 | 77.0 | +20.0 |
+| Base, multi-agent | 30 | 68.4 | 75.7 | +15.7 |
+| Recognition, single-agent | 30 | 72.1 | 79.3 | +20.6 |
+| Recognition, multi-agent | 30 | 73.8 | 80.9 | +18.8 |
+All conditions show substantial learning arcs (15.7–20.6 points improvement from first to final turn), confirming that the multi-turn scenarios successfully scaffold synthetic conceptual growth. Recognition produces a modest advantage on the composite learning outcome index (+3.8 pts, d=0.32, F=3.02), consistent with the tutor-side findings though smaller in magnitude. Architecture has essentially no effect on learning outcomes (d=0.01).
+A positive A×B interaction (+3.2 pts) suggests recognition benefits multi-agent learners slightly more than single-agent learners on the composite outcome—a mirror of the tutor-side factorial finding where recognition helps single-agent learners more. This cross-side asymmetry is consistent with the learner superego paradox (Section 6.16): the multi-agent learner's internal critic suppresses authenticity, but recognition-prompted tutors partially compensate by creating more space for genuine engagement.
+**Important caveat**: These are *synthetic* learning outcomes—scores assigned by an AI judge to LLM-generated learner turns. They measure the *quality of simulated learning behavior*, not actual knowledge acquisition or conceptual change. Validating whether recognition-enhanced tutoring produces genuine learning gains requires studies with real learners (Section 8.2).
+### 6.16 Learner-Side Evaluation: The Superego Paradox
 The tutor-focused rubric (Section 5.1) captures Factor C's effect indirectly—through how the tutor responds to different learner contexts. To measure Factor C's *direct* effect on learner turn quality, we applied the symmetric learner rubric (Section 5.1) to the N=118 bilateral transformation dialogues (eval-2026-02-07-b6d75e87), scoring each of the ~3 learner turns per dialogue independently. The judge receives the dialogue transcript truncated at the learner turn being evaluated (no subsequent tutor response), preventing retrospective bias. For multi-agent learners, the internal ego/superego deliberation trace is provided for the Deliberation Depth dimension.
-**Table 14b: Learner Quality by Architecture and Recognition (2×2 ANOVA)**
+**Table 29: Learner Quality by Architecture and Recognition (2×2 ANOVA)**
 | Effect | F(1,114) | p | $\eta^2$ | Cohen's d |
 |--------|----------|---|----------|-----------|
@@ -912,7 +1214,7 @@ The tutor-focused rubric (Section 5.1) captures Factor C's effect indirectly—t
 | Recognition (A) | 5.70 | .019 | .029 | 0.34 |
 | **A × C Interaction** | **11.50** | **< .001** | **.058** | — |
-**Table 14c: Learner Quality Cell Means (0-100 scale)**
+**Table 30: Learner Quality Cell Means (0-100 scale)**
 | Architecture | N | Mean |
 |---|---|---|
@@ -925,7 +1227,7 @@ The multi-agent (ego/superego) learner architecture produces significantly *lowe
 **Simple effects**: Recognition has no effect on single-agent learner quality (76.1 → 74.8, $d = -0.46$, $p = .082$, n.s.)—there is nothing to fix. But recognition significantly improves multi-agent learner quality (57.5 → 67.0, $d = 0.79$, $p = .004$), partially counteracting the superego's flattening effect. Even so, the rescue is incomplete: multi-agent learners with recognition (67.0) do not reach the level of single-agent learners without it (76.1).
-**Table 14d: Per-Dimension Interactions (1-5 scale)**
+**Table 31: Per-Dimension Interactions (1-5 scale)**
 | Dimension | Single recog effect | Multi recog effect | Interaction F(1,114) | p | $\eta^2$ |
 |---|---|---|---|---|---|
@@ -939,13 +1241,13 @@ The dimension breakdown reveals *how* recognition rescues the multi-agent learne
 **Deliberation depth is uniformly poor**. The Deliberation Depth dimension (scored only for multi-agent learners) averages 2.76/5 without recognition and 2.67/5 with recognition ($t(55.4) = -0.42$, $p = .679$, $d = -0.11$). Recognition does *not* improve the internal ego/superego process—the superego's critiques remain formulaic regardless of tutor framework. Recognition improves external output *despite* the mediocre internal process, working around the superego rather than through it.
-**Mirror-image interaction**. These results form a striking mirror image with the tutor-side factorial (Section 6.3): on the *tutor* rubric, recognition helps single-agent learners more (+15.4 pts vs +4.4 pts, because authentic learner input gives the recognitive tutor more to work with); on the *learner* rubric, recognition helps multi-agent learners more (+9.5 pts vs -1.3 pts, because the recognitive tutor counteracts the superego's flattening). This is not a contradiction but the same mechanism seen from two measurement perspectives. The theoretical implications are discussed in Section 7.5.
+**Asymmetric interaction across rubrics**. On the *tutor* rubric, recognition benefits both learner types consistently (+15.7 pts for single-agent, +13.0 pts for multi-agent; A×C n.s.). On the *learner* rubric, recognition helps multi-agent learners substantially (+9.5 pts) while providing no benefit to single-agent learners (−1.3 pts). The asymmetry suggests recognition operates differently depending on the measurement perspective: from the tutor's output, recognition produces uniformly better pedagogy regardless of learner architecture; from the learner's output, recognition specifically counteracts the superego's flattening effect on multi-agent learners. The theoretical implications are discussed in Section 7.5.
-### 6.13 Qualitative Analysis: What Recognition Looks Like
+### 6.17 Qualitative Analysis: What Recognition Looks Like
-The preceding sections establish that recognition-enhanced prompts produce statistically significant score improvements across scenarios, models, and domains. But score differences alone do not reveal *what changes* in the actual text. This section presents qualitative evidence from the evaluation corpus to ground the quantitative findings in observable linguistic differences, using three complementary methods at increasing levels of analytical sophistication: (a) regex-based lexical and thematic coding, which proves the *words* differ; (b) AI-assisted open-ended theme discovery, which reveals the *pedagogical stances* that emerge without predefined categories; and (c) theory-driven resolution strategy coding (Section 6.16), which proves *behaviour under impasse* differs along Hegelian lines.
+Section 6.11 established *that* recognition changes tutor behavior through structured tag distributions. This section asks *how*—what specific linguistic and pedagogical differences appear in the actual text. This section presents qualitative evidence from the evaluation corpus to ground the quantitative findings in observable linguistic differences, using three complementary methods at increasing levels of analytical sophistication: (a) regex-based lexical and thematic coding, which proves the *words* differ; (b) AI-assisted open-ended theme discovery, which reveals the *pedagogical stances* that emerge without predefined categories; and (c) theory-driven resolution strategy coding (Section 6.20), which proves *behavior under impasse* differs along Hegelian lines.
-#### 6.13.1 Transcript Excerpts
+#### 6.17.1 Transcript Excerpts
 To illustrate the qualitative gap between conditions, we selected the highest-scoring recognition response and lowest-scoring base response for three high-contrast scenarios. These are genuine responses from the evaluation database (row IDs reported for reproducibility), not hand-crafted examples.
@@ -987,11 +1289,11 @@ The base response is generic—indistinguishable from what might be offered to a
 Across all three pairs, the pattern is consistent: base responses are context-free directives that could apply to any learner, while recognition responses engage with the specific learner's history, contributions, and intellectual stance.
-#### 6.13.2 Lexical Analysis
+#### 6.17.2 Lexical Analysis
 Automated analysis of the full suggestion corpus reveals measurable linguistic differences between conditions.
-**Table 15: Lexical Diversity Metrics by Condition**
+**Table 32: Lexical Diversity Metrics by Condition**
 | Metric | Base (message) | Recognition (message) |
 |--------|----------------|----------------------|
@@ -1005,7 +1307,7 @@ Automated analysis of the full suggestion corpus reveals measurable linguistic d
 Recognition responses deploy a 59% larger vocabulary despite similar word and sentence length, suggesting greater lexical variety rather than merely longer output.
-**Table 16: Differential Word Frequency (Selected Terms)**
+**Table 33: Differential Word Frequency (Selected Terms)**
 | Recognition-skewed | Base | Recog | Ratio | | Base-skewed | Base | Recog | Ratio |
 |-------------------|------|-------|-------|-|-------------|------|-------|-------|
@@ -1020,11 +1322,11 @@ Recognition responses deploy a 59% larger vocabulary despite similar word and se
 The recognition-skewed vocabulary is interpersonal and process-oriented ("consider," "transformed," "productive," "unpack," "complicates"), while the base-skewed vocabulary is task-oriented and procedural ("agents," "run," "reinforcement," "revisiting," "completions," "tackling"). Note that these base-skewed terms are course-domain language, not evaluation framework artifacts: "agents" refers to simulation agents in the courseware's interactive activities (e.g., "watch how agents negotiate self-awareness"), "run" is the imperative to launch these simulations (e.g., "Run the Recognition Dynamics simulation"), and "reinforcement" is standard pedagogical terminology for concept review (e.g., "foundational concepts need reinforcement"). Their concentration in base responses reflects the formulaic, directive style of those prompts rather than data contamination. This lexical signature aligns with the theoretical distinction between treating learners as subjects to engage versus deficits to process.
-#### 6.13.3 Thematic Coding
+#### 6.17.3 Thematic Coding
-Regex-based thematic coding (using patterns adapted from the bilateral measurement framework in Section 6.11) quantifies the frequency of theoretically relevant language categories across conditions.
+Regex-based thematic coding (using patterns adapted from the bilateral measurement framework in Section 6.15) quantifies the frequency of theoretically relevant language categories across conditions.
-**Table 17: Thematic Code Frequency by Condition**
+**Table 34: Thematic Code Frequency by Condition**
 | Category | Base (per 1000 words) | Recognition (per 1000 words) | Ratio | $\chi^2$(1) | Sig |
 |----------|----------------------|------------------------------|-------|-------|-----|
@@ -1041,11 +1343,11 @@ Three categories show significant differences. *Struggle-honoring* language ("wr
 Transformation language and directive framing show the expected directional differences but lack statistical significance, likely due to low base rates (both categories appear in fewer than 1% of responses). Learner-as-subject framing shows no significant difference, suggesting both conditions use some second-person address but differ in *how* that address functions—a distinction better captured by the engagement and struggle-honoring categories.
-#### 6.13.4 AI-Assisted Theme Discovery
+#### 6.17.4 AI-Assisted Theme Discovery
-The regex-based analysis (Sections 6.13.2–3) confirms that *words* differ between conditions, but the categories were researcher-defined. To test whether the thematic distinction emerges without predefined categories, we conducted an open-ended AI theme discovery analysis using Claude Opus as coder. A stratified random sample of 300 responses (135 base, 165 recognition) was presented to the model with no category scheme; the coder was asked to identify the dominant emergent theme, pedagogical stance, and epistemic orientation for each response independently.
+The regex-based analysis (Sections 6.17.2–3) confirms that *words* differ between conditions, but the categories were researcher-defined. To test whether the thematic distinction emerges without predefined categories, we conducted an open-ended AI theme discovery analysis using Claude Opus as coder. A stratified random sample of 300 responses (135 base, 165 recognition) was presented to the model with no category scheme; the coder was asked to identify the dominant emergent theme, pedagogical stance, and epistemic orientation for each response independently.
-**Table 17b: Top Emergent Themes by Condition (AI Discovery, N=300)**
+**Table 35: Top Emergent Themes by Condition (AI Discovery, N=300)**
 | Theme | Base | Recog | Total | Direction |
 |-------|------|-------|-------|-----------|
@@ -1062,9 +1364,9 @@ The regex-based analysis (Sections 6.13.2–3) confirms that *words* differ betw
 *Only themes with total $\geq 6$ shown. Full results: 44 distinct themes discovered across 300 responses.*
-The theme landscape is almost perfectly bimodal: of the 15 themes with frequency $\geq 6$, only one ("forward momentum without reflection") appears roughly equally in both conditions. Every other theme is condition-exclusive or near-exclusive. The single most frequent theme—"deficit-oriented framing" (N=35)—appears only in base responses, while its mirror—"collaborative learning partnership" (N=21)—appears only in recognition responses. This clean separation emerged without any researcher-imposed category scheme.
+The theme landscape is almost perfectly bimodal: of the 10 themes with frequency $\geq 6$, only one ("forward momentum without reflection") appears roughly equally in both conditions. Every other theme is condition-exclusive or near-exclusive. The single most frequent theme—"deficit-oriented framing" (N=35)—appears only in base responses, while its mirror—"collaborative learning partnership" (N=21)—appears only in recognition responses. This clean separation emerged without any researcher-imposed category scheme.
-**Table 17c: Pedagogical Stance (AI Discovery, N=300)**
+**Table 36: Pedagogical Stance (AI Discovery, N=300)**
 | Stance | Base | Recognition |
 |--------|------|-------------|
@@ -1074,7 +1376,7 @@ The theme landscape is almost perfectly bimodal: of the 15 themes with frequency
 | Collaborative | 0 | 12 (7%) |
 | Other/compound | 18 (13%) | 53 (32%) |
-**Table 17d: Epistemic Orientation (AI Discovery, N=300)**
+**Table 37: Epistemic Orientation (AI Discovery, N=300)**
 | Orientation | Base | Recognition |
 |-------------|------|-------------|
@@ -1085,29 +1387,31 @@ The theme landscape is almost perfectly bimodal: of the 15 themes with frequency
 The stance and orientation distributions are even more sharply separated than the emergent themes. Base responses are 84% directive and 93% transmissive; recognition responses are 60% facilitative/dialogical/collaborative and 84% dialectical/constructivist. The AI coder independently discovers the theoretical distinction the recognition framework was designed to produce: the shift from treating learning as transmission (tutor possesses knowledge, learner receives it) to treating it as dialectical encounter (both parties transform through engagement).
-![Emergent Theme Word Clouds](figures/figure6.png){width=100%}
+![Tutor Language Word Clouds](figures/figure6.png){width=100%}
+**Figure 6** shows word frequency clouds generated directly from tutor response text in the N=350 factorial dataset (base: N=172; recognition: N=178), with common English stop words and shared tutoring terms removed. Because both conditions discuss the same Hegelian philosophy content, the vocabularies substantially overlap. Nevertheless, condition-specific emphasis is visible: recognition responses foreground relational and process terms ("recognition," "tension," "transformation," "struggle," "explore," "practice"), while base responses foreground content-delivery terms ("concept," "dialectical," "servant," "section," "quiz"). The AI-assisted theme discovery (Tables 35–37) provides the interpretive layer for these raw differences.
 **Methodological note**: AI-assisted theme discovery risks circular validation if the coding model recognizes the prompt engineering that produced the responses. Two factors mitigate this concern: (1) the coder received only the tutor's suggestion text, not the system prompt or condition label; and (2) the near-perfect theme separation itself is the finding—whether or not the coder "recognizes" the framework, the fact that emergent themes partition cleanly by condition demonstrates that the two conditions produce qualitatively distinct pedagogical texts, not merely quantitatively different scores.
-### 6.14 Dynamic Prompt Rewriting: Step-by-Step Evolution
+### 6.18 Dynamic Prompt Rewriting: Step-by-Step Evolution
 Cell 21 extends the recognition multi-agent configuration (cell 7) with two additional mechanisms: (1) LLM-authored session-evolution directives that dynamically rewrite the tutor's system prompt based on dialogue history, and (2) an active Writing Pad memory (Section 3.4) that accumulates traces across turns. This configuration tests whether the Freudian Mystic Writing Pad—the theoretical memory model introduced in Section 3.4—functions as a practical enabler for dynamic prompt rewriting.
 Three iterative development runs tracked cell 21's performance as its implementation evolved across commits:
-**Table 18: Step-by-Step Evolution of Cell 21 vs Cell 7**
+**Table 38: Step-by-Step Evolution of Cell 21 vs Cell 7**
 | Run ID | Commit | Grand Avg | Cell 7 | Cell 21 | Δ (21−7) | N (scored) |
 |--------|--------|-----------|--------|---------|----------|------------|
-| eval-2026-02-05-daf60f79 | e3843ee | 63.8 | 65.3 | 62.1 | −3.2 | 26 |
+| eval-2026-02-05-daf60f79 | e3843ee | 63.8 | 65.3 | 62.1 | −3.2 | 27 |
 | eval-2026-02-05-49bb2017 | b2265c7 | 67.8 | 71.3 | 64.1 | −7.2 | 27 |
 | eval-2026-02-05-12aebedb | e673c4b | 75.9 | 73.3 | 78.8 | **+5.5** | 29 |
 **Run-over-run shifts**: In Run 1 (e3843ee), the dynamic rewrite mechanism was first activated but the Writing Pad memory was not yet integrated—cell 21 trails cell 7 by 3.2 points, suggesting the rewrite adds noise without accumulated context to draw on. In Run 2 (b2265c7), the rewrite directive generation was refined but still operated without effective memory—the gap widens to −7.2 points, as the static baseline (cell 7) improves more from general implementation fixes. In Run 3 (e673c4b), the Writing Pad memory was activated alongside refined directive generation—cell 21 surges ahead by +5.5 points, a total swing of +12.7 points from Run 2.
-The inflection point is commit e673c4b, which activated the Writing Pad memory and refined the LLM directive generation. Before this commit, cell 21 trailed its static baseline (cell 7) in both runs. After activation, cell 21 leads by 5.5 points—a total swing of +16.7 points across the three runs.
+The inflection point is commit e673c4b, which activated the Writing Pad memory and refined the LLM directive generation. Before this commit, cell 21 trailed its static baseline (cell 7) in both runs. After activation, cell 21 leads by 5.5 points—a delta swing of +8.7 points from Run 1 to Run 3.
-**Table 19: Per-Scenario Breakdown Across Runs**
+**Table 39: Per-Scenario Breakdown Across Runs**
 | Scenario | Cell | Run 1 (daf60f79) | Run 2 (49bb2017) | Run 3 (12aebedb) | Trend |
 |----------|------|-------------------|-------------------|-------------------|-------|
@@ -1120,7 +1424,7 @@ The inflection point is commit e673c4b, which activated the Writing Pad memory a
 Cell 21 improves on every scenario across the three runs, with the largest gain on the `mutual_transformation_journey` scenario (+22.2 points from run 1 to run 3). Cell 7 also improves across runs (reflecting general implementation improvements), but cell 21's improvement rate is substantially steeper.
-**Table 20: Rubric Dimension Improvement for Cell 21 Across Runs (1–5 scale)**
+**Table 40: Rubric Dimension Improvement for Cell 21 Across Runs (1–5 scale)**
 | Dimension | Run 1 | Run 2 | Run 3 | Δ (Run 3 − Run 1) |
 |-----------|-------|-------|-------|-----|
@@ -1139,48 +1443,60 @@ This pattern is consistent with the Hegel-Freud synthesis described in Section 3
 **Limitations**: The three runs represent iterative development commits, not independent experiments—each run includes implementation improvements beyond just Writing Pad activation. The sample size per cell per run is small (13–15 scored responses). Both cells use a free-tier model (Nemotron) with Kimi K2.5 as superego, and results may not generalize to other model combinations. The step-by-step trajectory is suggestive rather than definitive; a controlled ablation isolating Writing Pad activation alone would strengthen the causal interpretation.
-### 6.15 Cross-Judge Replication with GPT-5.2
+### 6.19 Cross-Judge Replication with GPT-5.2
 To assess whether findings depend on the primary judge (Claude Code/Opus), we rejudged all key evaluation runs with GPT-5.2 as an independent second judge. GPT-5.2 scored the identical tutor responses—no new generation occurred.
-**Table 21: Inter-Judge Agreement (Claude Code vs GPT-5.2)**
+**Table 41: Inter-Judge Agreement (Claude Code vs GPT-5.2)**
 | Run | N (matched) | Pearson r | p | Claude Mean | GPT-5.2 Mean | Calibration Δ |
 |-----|-------------|-----------|---|-------------|--------------|---------------|
-| Recognition validation | 36 | 0.56 | <.001 | 84.4 | 74.0 | −10.4 |
-| Full factorial | 341 | 0.64 | <.001 | 85.8 | 74.1 | −11.7 |
-| Memory isolation | 120 | 0.63 | <.001 | 84.4 | 73.8 | −10.5 |
-| A×B replication | 60 | 0.49 | <.001 | 85.6 | 74.2 | −11.4 |
+| Recognition validation | 36 | 0.64 | <.001 | 82.4 | 73.6 | −8.8 |
+| Full factorial | 224 | 0.44 | <.001 | 80.5 | 73.7 | −6.8 |
+| Memory isolation | 119 | 0.63 | <.001 | 84.3 | 73.7 | −10.5 |
+| A×B replication | 60 | 0.54 | <.001 | 89.9 | 74.7 | −15.2 |
 | Cells 6,8 (updated rubric) | 88 | 0.55 | <.001 | 85.6 | 74.4 | −11.2 |
+| Dialectical modulation | 90 | 0.51 | <.001 | 86.3 | 75.1 | −11.2 |
+| Mechanism robustness | 360 | 0.59 | <.001 | 88.4 | 74.8 | −13.6 |
-All correlations are moderate-to-good (r = 0.49–0.64) and highly significant (all p < .001). GPT-5.2 applies stricter absolute standards (10–12 points lower), consistent with the calibration differences reported in Section 5.8.
+All correlations are moderate (r = 0.44–0.64) and highly significant (all p < .001). The factorial run shows the lowest agreement (r = 0.44), driven by divergent scoring of base multi-agent cells: Opus 4.6 scores some base responses as low as 29–35 while GPT-5.2 assigns the same responses 66–80. GPT-5.2 applies stricter absolute standards on most runs (7–15 points lower), consistent with the calibration differences reported in Section 5.8. An additional Opus-Sonnet inter-judge comparison on the dynamic learner run (6c033830, N=120 paired) yields $r = 0.64$, $p < .001$, with Sonnet scoring 13.5 points below Opus, confirming the cross-judge reliability pattern extends beyond GPT-5.2.
-**Table 22: Cross-Judge Replication of Key Findings**
+**Table 42: Cross-Judge Replication of Key Findings**
 | Finding | Claude Effect | GPT-5.2 Effect | GPT-5.2 p | Replicates? |
 |---------|-------------|----------------|-----------|-------------|
-| Recognition main effect (factorial, N=262, 6 cells) | +13.7 pts | **+7.1 pts** (d=1.03) | <.001 | Yes |
-| Recognition vs base (validation, N=36) | +20.2 pts | **+9.1 pts** (d=1.01) | <.001 | Yes |
-| Recognition vs enhanced (validation, N=36) | +8.7 pts | **+1.3 pts** (d=0.15) | .596 | Marginal |
-| Multi-agent main effect (factorial) | +0.5 pts | **+0.3 pts** | .734 | Yes (null) |
-| A×B interaction (Kimi replication, N=60) | +1.4 pts | **−0.2 pts** | n.s. | Yes (null) |
-| Recognition effect in memory isolation (N=120) | +15.2 pts (d=1.71) | **+7.0 pts** (d=0.99) | <.001 | Yes |
-| Memory effect in memory isolation (N=120) | +4.8 pts (d=0.46) | **+2.2 pts** (d=0.29) | n.s. | Yes (small) |
-| Memory isolation interaction (N=120) | −4.2 pts | **−2.7 pts** | n.s. | Yes (negative) |
+| Recognition main effect (factorial, N=224 paired, 6 cells) | +17.6 pts | **+6.6 pts** ($d \approx 0.9$) | <.001 | Yes |
+| Recognition vs base (validation, N=36) | +19.7 pts | **+9.6 pts** (d=0.91) | <.001 | Yes |
+| Recognition vs enhanced (validation, N=36) | +8.0 pts | **+2.4 pts** (d=0.24) | n.s. | Marginal |
+| Multi-agent main effect (factorial) | +2.6 pts | **−0.2 pts** | n.s. | Yes (small) |
+| A×B interaction (Kimi replication, N=60) | −3.1 pts | **+1.5 pts** | n.s. | Yes (null) |
+| Recognition effect in memory isolation (N=119) | +15.8 pts (d=1.71) | **+9.3 pts** (d=1.54) | <.001 | Yes |
+| Memory effect in memory isolation (N=119) | +4.8 pts (d=0.46) | **+3.1 pts** (d=0.49) | n.s. | Yes (small) |
+| Memory isolation interaction (N=119) | −5.6 pts | **−3.6 pts** | n.s. | Yes (negative) |
+| Recognition in mechanism robustness (N=360) | +7.6 pts | **+3.8 pts** | <.001 | Yes |
+| Mechanism clustering (scripted learner, N=360) | 2.8 pt spread | **4.4 pt spread** | — | Yes (null) |
+*Note: Claude effects in this table are computed from the N=119 matched-pair subset (responses scored by both judges), which differs slightly from the full-sample values in Tables 5 and 6 (e.g., memory isolation interaction is $-5.6$ here vs $-4.2$ in Table 5 at N=120).*
-**Key result**: GPT-5.2 replicates all directional findings. The recognition main effect is large (d $\approx$ 0.99–1.03) and highly significant under both judges across all analyses. The memory isolation experiment shows identical condition ordering under both judges (Recognition+Memory $\geq$ Recognition Only >> Memory Only > Base) with no rank reversals. The negative interaction (ceiling effect) replicates under GPT-5.2 (−2.7 vs −4.2 under Claude). Multi-agent null effects and A×B null interactions also replicate.
+**Key result**: GPT-5.2 replicates all directional findings. The recognition main effect is large and highly significant under both judges across all analyses (GPT-5.2 d = 0.91–1.54 depending on design). The memory isolation experiment shows identical condition ordering under both judges (Recognition+Memory $\geq$ Recognition Only >> Memory Only > Base) with no rank reversals. The negative interaction (ceiling effect) replicates under GPT-5.2 (−3.6 vs −5.6 under Claude). Multi-agent null effects and A×B null interactions also replicate.
-The one non-replication is the recognition-vs-enhanced comparison (Claude: +8.7 pts; GPT-5.2: +1.3 pts, p = .60). GPT-5.2 confirms that recognition substantially outperforms the base condition, but cannot statistically distinguish recognition from enhanced prompting in the three-way comparison. This is consistent with GPT-5.2's compressed score range (SD $\approx$ 6–8 vs Claude's SD $\approx$ 8–18) reducing statistical power for smaller effects. It also suggests the recognition-vs-enhanced increment may be more sensitive to judge calibration than the larger recognition-vs-base effect.
+The one non-replication is the recognition-vs-enhanced comparison (Claude: +8.0 pts; GPT-5.2: +2.4 pts, n.s.). GPT-5.2 confirms that recognition substantially outperforms the base condition, but cannot statistically distinguish recognition from enhanced prompting in the three-way comparison. This is consistent with GPT-5.2's compressed score range (SD $\approx$ 6–8 vs Claude's SD $\approx$ 8–18) reducing statistical power for smaller effects. It also suggests the recognition-vs-enhanced increment may be more sensitive to judge calibration than the larger recognition-vs-base effect.
-**Magnitude compression**: GPT-5.2 consistently finds approximately 58% of the effect magnitude that Claude finds (ratio range: 0.50–0.73×), but effects are always in the same direction and almost always statistically significant. This compression likely reflects GPT-5.2's narrower score distribution rather than genuine disagreement about relative quality.
+**Magnitude compression**: GPT-5.2 generally finds smaller effect magnitudes than Claude, though the compression ratio varies: in the memory isolation experiment, GPT-5.2 finds 59% of Claude's recognition delta (+9.3 vs +15.8), while in the factorial it finds 37% (+6.6 vs +17.6). Effects are always in the same direction and almost always statistically significant. The greater divergence on the factorial reflects Opus 4.6's particularly harsh scoring of base multi-agent cells (some scoring 29–35), which GPT-5.2 does not replicate.
-**Interpretation**: The primary findings—recognition is the dominant driver of tutoring improvement (d=1.71 under Claude, d=0.99 under GPT-5.2), memory provides a modest secondary benefit, and multi-agent architecture provides minimal benefit on well-trained content—are judge-robust. The corrected memory isolation experiment (Section 6.2) provides the strongest evidence: recognition dominance replicates with identical condition ordering, and the negative interaction (ceiling effects) is confirmed under both judges. The specific magnitude of the recognition-vs-enhanced increment (+8.7 under Claude) should be interpreted with caution, as it does not reach significance under GPT-5.2.
+**Interpretation**: The primary findings—recognition is the dominant driver of tutoring improvement (d=1.71 under Claude, d=1.54 under GPT-5.2 in the memory isolation design), memory provides a modest secondary benefit, and multi-agent architecture provides minimal benefit on well-trained content—are judge-robust. The corrected memory isolation experiment (Section 6.2) provides the strongest evidence: recognition dominance replicates with identical condition ordering, and the negative interaction (ceiling effects) is confirmed under both judges. The specific magnitude of the recognition-vs-enhanced increment (+8.0 under Claude) should be interpreted with caution, as it does not reach significance under GPT-5.2.
 **Updated rubric cross-judge replication.** The cells 6 and 8 responses (N=88) were also scored under the updated 14-dimension rubric with dialogue transcript context by both judges. The cross-judge correlation on these responses is r=0.55 (N=88, p<.001), with GPT-5.2 scoring at 87% of Opus magnitudes (Opus mean=85.6, GPT mean=74.4). Both judges find cell 8 (multi-agent) scores higher than cell 6 (single-agent): Opus 87.3 vs 83.9, GPT 74.6 vs 74.2. The updated rubric does not alter the cross-judge pattern observed throughout the study.
-### 6.16 Dialectical Impasse Test
+**Dialectical modulation cross-judge.** GPT-5.2 rejudging of the dialectical multi-turn experiment (eval-2026-02-11-a54235ea, N=90) yields inter-judge $r = 0.51$ ($p < .001$). However, the recognition effect does not replicate: Opus finds $\Delta = +4.5$ ($d = 0.38$, $p \approx .075$), while GPT-5.2 finds $\Delta = -0.7$ (n.s.). This is consistent with the general pattern of effect compression (GPT-5.2 finds 37–59% of Opus magnitudes depending on experiment): a marginal Opus effect is expected to vanish under GPT-5.2's narrower score range. The dialectical recognition effect is the weakest in the study ($d = 0.38$ under the most favorable judge) and should be interpreted with corresponding caution.
-The preceding multi-turn scenarios (Section 6.10) test recognition under conditions of frustration, misconception, and intellectual exploration—situations where a productive resolution is readily available. But recognition theory makes a stronger claim: that genuine pedagogical encounters involve working *through* impasse rather than around it. Section 7.1 discusses how the master-slave dialectic can terminate in deadlock when the tutor's expertise is confirmed but the learner remains a vessel rather than a subject. Do recognition-prompted tutors handle sustained, unresolved impasse differently from base tutors?
+**Mechanism robustness cross-judge.** GPT-5.2 rejudging of the mechanism robustness experiment (eval-2026-02-14-e0e3a622, N=360 paired) yields inter-judge $r = 0.59$ ($p < .001$). The recognition main effect replicates (Opus $\Delta = +7.6$, GPT $\Delta = +3.8$), with GPT finding 50% of the Opus magnitude. Both judges confirm mechanism clustering: under recognition, the 10 mechanism variants span only 2.8 pts under Opus and 4.4 pts under GPT. No mechanism differentiates from any other under either judge, confirming that the mechanism inertness finding with scripted learners is judge-robust.
+**Dynamic learner cross-judge (Opus–Sonnet).** A Sonnet rejudge of the dynamic learner experiment (6c033830, N=120 paired) yields $r = 0.64$ ($p < .001$), with Sonnet scoring 13.5 points below Opus overall. Both judges find the recognition main effect (Opus $\Delta = +14.8$, Sonnet $\Delta = +17.0$) and profiling advantage (Opus $\Delta = +4.1$, Sonnet $\Delta = +12.8$). The cross-judge agreement is highest for cell 63 (recognition + profiling): Opus 90.2, Sonnet 87.2 (only $-3.0$ pts), compared to $-14$ to $-22$ pts for other cells. This convergence at the highest-quality condition suggests that recognition + profiling with dynamic learners produces output whose quality is self-evident across judges. A second Sonnet rejudge of the intersubjective and combined mechanism cells (a2b2717c, cells 64–65, N=120 paired) yields a weaker correlation ($r = 0.44$, $p < .001$), with Sonnet scoring 16 points below Opus. The lower agreement may reflect the greater complexity of these mechanisms or higher variance in the outputs they produce.
+### 6.20 Dialectical Impasse Test
+The preceding multi-turn scenarios (Section 6.14) test recognition under conditions of frustration, misconception, and intellectual exploration—situations where a productive resolution is readily available. But recognition theory makes a stronger claim: that genuine pedagogical encounters involve working *through* impasse rather than around it. Section 7.1 discusses how the master-slave dialectic can terminate in deadlock when the tutor's expertise is confirmed but the learner remains a vessel rather than a subject. Do recognition-prompted tutors handle sustained, unresolved impasse differently from base tutors?
 To test this, we designed three 5-turn impasse scenarios where scripted learner messages escalate resistance across turns, creating conditions where productive resolution requires genuine engagement rather than reassertion of authority:
@@ -1190,7 +1506,7 @@ To test this, we designed three 5-turn impasse scenarios where scripted learner
 Each scenario was run with 4 cells (base single, base multi, recognition single, recognition multi) $\times$ 2 runs = 24 five-turn dialogues (eval-2026-02-08-f896275d, Opus judge).
-**Table 24: Dialectical Impasse Results by Scenario**
+**Table 43: Dialectical Impasse Results by Scenario**
 | Scenario | Base Mean (N=4) | Recognition Mean (N=4) | $\Delta$ | Recog. Score (Base) | Recog. Score (Recognition) |
 |----------|----------------|----------------------|-----|--------------------------|---------------------------|
@@ -1199,7 +1515,7 @@ Each scenario was run with 4 cells (base single, base multi, recognition single,
 | Affective shutdown | 52.0 | 50.9 | $-$1.1 | 30.2 | 35.7 |
 | **Grand mean** | **33.2** | **56.8** | **+23.6** | **13.5** | **40.9** |
-**Table 25: Impasse Results by Cell**
+**Table 44: Impasse Results by Cell**
 | Cell | Epistemic | Affective | Deadlock | Mean |
 |------|-----------|-----------|----------|------|
@@ -1212,13 +1528,13 @@ The results reveal a striking dissociation across impasse types. Recognition the
 The affective shutdown scenario shows no recognition advantage ($\Delta$ = $-$1.1). Base tutors handle emotional repair roughly as well as recognition tutors, suggesting that the recognition framework's distinctive contribution lies in the epistemological structure of dialogue—how the tutor relates to the learner's *ideas*—rather than in emotional support per se. This pattern is theoretically coherent: Hegel's recognition theory addresses the constitution of the other as a knowing subject, not primarily as a feeling subject. The affective dimension maps more naturally onto Honneth's later extension of recognition to emotional needs, which is not the primary theoretical ground of our prompts.
-The cell-level data (Table 25) show that multi-agent architecture provides a notable benefit for base tutors on affective shutdown (cell 3: 62.1 vs cell 1: 41.9), suggesting the Superego's quality enforcement helps catch dismissive responses even without recognition theory. For epistemic resistance, recognition + multi-agent (cell 7: 73.8) substantially outperforms recognition + single-agent (cell 5: 56.2), suggesting that internal deliberation helps navigate philosophically demanding impasses.
+The cell-level data (Table 44) show that multi-agent architecture provides a notable benefit for base tutors on affective shutdown (cell 3: 62.1 vs cell 1: 41.9), suggesting the Superego's quality enforcement helps catch dismissive responses even without recognition theory. For epistemic resistance, recognition + multi-agent (cell 7: 73.8) substantially outperforms recognition + single-agent (cell 5: 56.2), suggesting that internal deliberation helps navigate philosophically demanding impasses.
 #### Resolution Strategy Coding
 The rubric scores show *how well* tutors handle impasse; resolution strategy coding reveals *how* they handle it. Each of the 24 dialogues was coded by an LLM judge (Opus) into one of five Hegelian resolution strategies: mutual recognition (engaging the learner's position as valid, exploring tension together), domination (reasserting expertise, dismissing the objection), capitulation (agreeing with the learner to avoid conflict), withdrawal (changing topic, deflecting, offering platitudes), and scaffolded reframing (acknowledging the learner's position, then reframing to open new ground—the Aufhebung pattern of preserving and overcoming).
-**Table 26: Resolution Strategy Distribution by Condition**
+**Table 45: Resolution Strategy Distribution by Condition**
 | Strategy | Base (N=12) | % | Recognition (N=12) | % |
 |----------|-------------|------|---------------------|------|
@@ -1252,7 +1568,7 @@ The absence of capitulation in either condition (0/24) likely reflects scenario
 The overall strategy coding captures the arc of a full dialogue. But does strategy evolve *within* a dialogue as impasse deepens? To investigate, we independently coded turns 3 and 5 of each dialogue—the responses after the learner's first major escalation and final challenge respectively—using the same five-category scheme. The per-turn coder received the dialogue transcript only up to and including the target turn, and coded only the tutor's response at that turn.
-**Table 27: Strategy Distribution by Turn**
+**Table 46: Strategy Distribution by Turn**
 | Turn | Condition | Withdrawal | Scaffolded Reframing | Other |
 |------|-----------|-----------|---------------------|-------|
@@ -1261,7 +1577,7 @@ The overall strategy coding captures the arc of a full dialogue. But does strate
 | 5 (final challenge) | Base (N=12) | 12 (100%) | 0 | 0 |
 | 5 (final challenge) | Recognition (N=12) | 10 (83%) | 1 (8%) | 1 domination |
-**Table 28: Strategy Stability (Turn 3 $\to$ Turn 5)**
+**Table 47: Strategy Stability (Turn 3 $\to$ Turn 5)**
 | Condition | Same Strategy | Changed | Stability Rate |
 |-----------|--------------|---------|----------------|
@@ -1282,9 +1598,56 @@ To assess whether the strategy coding reflects genuine dialogue properties rathe
 Of 24 attempts, 23 produced valid codings (one API error). GPT-5.2 coded all 11 base dialogues as withdrawal (matching Opus 11/11) and all 12 recognition dialogues as scaffolded reframing. On 23 paired codings, the two judges agree on 21 (91.3%), with Cohen's $\kappa = 0.84$ (excellent inter-rater reliability). The two disagreements are both cases where Opus made finer distinctions within the engagement category: id 8150 (Opus: mutual recognition, GPT: scaffolded reframing) and id 8144 (Opus: domination, GPT: scaffolded reframing). GPT-5.2 sees all recognition tutors as doing the same thing—engaging and reframing—while Opus distinguishes edge cases within that category.
-On the core binary question—does the tutor engage the impasse or withdraw from it?—agreement is 23/23 (100%, $\kappa = 1.0$). The perfect separation between conditions replicates across both judges. This is consistent with the broader cross-judge pattern observed throughout the study (Section 6.15): GPT-5.2 finds the same direction with less nuance, compressing fine-grained distinctions while preserving the primary effect.
+On the core binary question—does the tutor engage the impasse or withdraw from it?—agreement is 23/23 (100%, $\kappa = 1.0$). The perfect separation between conditions replicates across both judges. This is consistent with the broader cross-judge pattern observed throughout the study (Section 6.19): GPT-5.2 finds the same direction with less nuance, compressing fine-grained distinctions while preserving the primary effect.
+**Limitations**: The sample size is small (N=2 per cell per scenario, N=24 total). Learner messages are scripted rather than LLM-generated, which ensures consistent impasse conditions but may produce less naturalistic interactions. The 100% base withdrawal rate, while striking, may partly reflect a coarse distinction—whether the tutor engages the impasse content at all—rather than fine-grained strategy discrimination. Cross-judge validation ($\kappa = 0.84$) confirms the primary finding but the two judges disagree on finer strategy distinctions within the engagement category. The scenarios test philosophy content only; whether impasse dynamics differ for other domains is unknown. As with the tag assessment (Section 6.11), the strategy coder was not blinded to condition—cell names and metadata were visible in the transcript. However, the cross-judge replication (91.3% agreement, $\kappa = 0.84$) and the coarseness of the primary distinction (engagement vs withdrawal) make assessor bias a less plausible explanation here than for finer-grained tag frequency differences.
+### 6.21 Prompt Elaboration Baseline
+A potential deflationary critique of the recognition findings is that the recognition prompt simply contains *more* instructional content—more words, more examples, more heuristics—and that any sufficiently elaborate prompt would produce similar gains. The base prompt used throughout this study is itself substantial: 344 lines (~7,500 words) containing decision heuristics (a "Struggle Stop-Rule" mandating review when struggle signals appear, a "Momentum Rule" for high-performing learners), a learner analysis framework, a decision matrix mapping learner states to suggestion types, and extensive research lab navigation guidance. This is far from a naive baseline.
+To test the contribution of this prompt elaboration, we compared the full 344-line base prompt (cell 1) against a stripped 35-line "naive" prompt containing only a one-sentence role description and the JSON output schema—the minimum needed to produce scorable suggestions. We ran this comparison on two ego models: Haiku (stronger) and Kimi K2.5 (weaker), each with N=72 (18 scenarios $\times$ 2 cells $\times$ 2 runs), Opus judge.
+**Table 20b: Prompt Elaboration Baseline (N=144, Opus judge)**
+| Ego Model | Base (344 lines) | Naive (35 lines) | $\Delta$ |
+|-----------|-----------------|-----------------|----------|
+| Haiku | 75.7 | **82.5** | **+6.8** |
+| Kimi K2.5 | 71.7 | 71.4 | $-0.3$ |
+On Haiku, the elaborate prompt is actively harmful ($\Delta = -6.8$): the naive prompt scores higher on relevance (+0.28) and pedagogical quality (+0.36), losing only on specificity ($-0.17$)—the formatting guidance helps cite exact content IDs, but at the cost of worse pedagogical decisions. The per-scenario pattern is diagnostic: on high-performer scenarios ($\Delta = +29.2$), the base prompt's Momentum Rule classifies a mastery-level learner as "high engagement, no struggles" and prescribes "continue to next lecture," while the naive prompt suggests a capstone synthesis activity. On epistemic resistance ($\Delta = +30.8$), the base prompt reads a learner's philosophical critique of Hegel as "no struggle signals" and pushes forward; the naive prompt suggests an interactive simulation addressing the critique directly.
+![Prompt Elaboration Baseline: High Performer Scenario — Naive (left, score 95.2) suggests a creative synthesis capstone; Base (right, score 56.7) prescribes linear progression to the next lecture via the Momentum Rule](figures/figure10.png){width=100%}
+On Kimi, the elaborate prompt has no effect ($\Delta = -0.3$): the weaker model cannot execute the decision heuristics well enough for them to help or hurt. This is a model capability $\times$ prompt elaboration interaction—prescriptive rules override strong models' superior pedagogical intuitions while passing through weaker models unchanged.
-**Limitations**: The sample size is small (N=4 per cell per scenario, N=24 total). Learner messages are scripted rather than LLM-generated, which ensures consistent impasse conditions but may produce less naturalistic interactions. The 100% base withdrawal rate, while striking, may partly reflect a coarse distinction—whether the tutor engages the impasse content at all—rather than fine-grained strategy discrimination. Cross-judge validation ($\kappa = 0.84$) confirms the primary finding but the two judges disagree on finer strategy distinctions within the engagement category. The scenarios test philosophy content only; whether impasse dynamics differ for other domains is unknown.
+Critically, recognition theory ($M = 90.9$ on Haiku, from the multi-model probe in Section 6.4) remains well above the naive prompt ($M = 82.5$). The recognition effect operates at a different level: not prescribing actions through decision trees, but shifting the model's relational stance toward the learner. Stripping 90% of the base prompt's content does not diminish—and actually enhances—the baseline from which recognition adds its distinctive value. This suggests that shorter, simpler recognition prompts that specify relational orientation without prescriptive heuristics could potentially achieve equal or greater effects than the current recognition prompt.
+### 6.22 Token Budget Sensitivity
+All evaluation cells use `max_tokens: 8000`, but actual single-turn outputs average approximately 235 tokens (base) to 451 tokens (recognition), with maximums under 650. This means the budget is 12–34$\times$ larger than typical output. A dose-response test measured whether constraining `max_tokens` degrades evaluation scores.
+**Design.** Five runs used Haiku ego with Opus judge across two cells (cell 1 base, cell 5 recognition) at three constrained budget levels (256, 512, and 2048 tokens), with an additional base-only control at the default 8000 tokens (N=126 scored single-turn evaluations across all levels). The `max_tokens` parameter was overridden via a new CLI flag (`--max-tokens`) that threads through to the API request body.
+**Table 49: Token Budget Dose-Response (Single-Turn Scenarios Only)**
+| Budget | Cell | N | Mean | Avg Tokens | Tokens/Call |
+|--------|------|---|------|------------|-------------|
+| 256 | Base | 24 | 81.2 | 235 | 235 |
+| 512 | Base | 12 | 77.9 | 247 | 247 |
+| 2048 | Base | 12 | 81.2 | 240 | 240 |
+| 8000 | Base | 12 | 82.1 | 235 | 235 |
+| 256 | Recognition | 24 | 90.2 | 446 | ~245 |
+| 512 | Recognition | 12 | 90.7 | 432 | ~245 |
+| 2048 | Recognition | 12 | 92.3 | 451 | 451 |
+*Recognition effect: +9.0 (256), +12.8 (512), +11.1 (2048), consistent across all budget levels.*
+**Results.** Scores are flat across all budget levels for both conditions. Base cell means range 77.9–82.1, well within sampling noise at N=12–24. Recognition means range 90.2–92.3. The recognition effect is budget-invariant, ranging from +9.0 to +12.8 across levels.
+**Mechanism: JSON retry absorption.** The system requires structured JSON output. When `max_tokens` truncates a response mid-JSON, the parsing fails and the engine retries automatically (up to two internal retries per generation). This retry mechanism absorbs budget constraints: at 256 tokens, each individual API call is correctly capped (confirmed by direct API testing: `completion_tokens: 256, finish_reason: length`), but the engine retries until a parseable response is produced. The cumulative `output_tokens` stored in the database reflects the sum across all API calls (including retries), which is why average tokens in Table 49 can exceed the per-call budget—for example, recognition at 256 averages 446 total tokens across approximately 1.8 actual API calls per evaluation (each individually capped at 256). At budgets above approximately 500 tokens, most responses complete without truncation and no retries are needed.
+**Implication.** Reducing `max_tokens` from 8000 to 2048 incurs no quality penalty and could reduce per-call costs on providers that charge by requested (rather than generated) tokens. Further reduction to 512 also appears safe. At 256, the budget is below the natural response length for recognition prompts, triggering retries that negate any cost savings. These results are tentative (N=12–24 per cell per level, single model, single judge) but suggest that substantial budget reductions are possible without sacrificing the recognition effect.
 ---
@@ -1298,23 +1661,25 @@ The baseline tutor treats the learner as a knowledge deficit. Learner contributi
 The recognition tutor treats the learner as an autonomous subject. Learner contributions become sites of joint inquiry. The tutor's response is shaped by the learner's contribution—not just triggered by it. Both parties are changed through the encounter.
-This maps directly onto Hegel's master-slave analysis. The baseline tutor achieves pedagogical mastery—acknowledged as expert, confirmed through learner progress—but the learner's acknowledgment is hollow because the learner has not been recognized as a subject whose understanding matters. As in Hegel's resolution, the path forward lies through the learner's own formative activity: the recognition tutor honors the learner's struggle as constitutive of genuine understanding rather than an obstacle to be resolved. The tutor adaptation metrics (Section 6.11) provide empirical evidence for this: recognition-prompted tutors adjust their approach in response to learner input (+26% adaptation index), treating learner contributions as genuine inputs that reshape the pedagogical encounter.
+This maps directly onto Hegel's master-slave analysis. The baseline tutor achieves pedagogical mastery—acknowledged as expert, confirmed through learner progress—but the learner's acknowledgment is hollow because the learner has not been recognized as a subject whose understanding matters. As in Hegel's resolution, the path forward lies through the learner's own formative activity: the recognition tutor honors the learner's struggle as constitutive of genuine understanding rather than an obstacle to be resolved. The tutor adaptation metrics (Section 6.15) provide empirical evidence for this: recognition-prompted tutors adjust their approach in response to learner input (+26% adaptation index), treating learner contributions as genuine inputs that reshape the pedagogical encounter.
-The dialectical impasse test (Section 6.16) provides the most direct evidence for this interpretation, and the post-hoc resolution strategy coding reveals the mechanism with unusual clarity. When learners mount sustained intellectual resistance—a Popperian falsifiability critique, a materialist counter-reading—the recognition advantage is largest (+43 and +29 pts respectively), because these scenarios demand precisely what Hegel's analysis predicts: treating the other's position as having independent validity that must be genuinely engaged, not merely acknowledged.
+The dialectical impasse test (Section 6.20) provides the most direct evidence for this interpretation, and the post-hoc resolution strategy coding reveals the mechanism with unusual clarity. When learners mount sustained intellectual resistance—a Popperian falsifiability critique, a materialist counter-reading—the recognition advantage is largest (+43 and +29 pts respectively), because these scenarios demand precisely what Hegel's analysis predicts: treating the other's position as having independent validity that must be genuinely engaged, not merely acknowledged.
 The strategy coding shows that base tutors do not fail by *choosing the wrong strategy*—they fail by *having no strategy at all*. Every base tutor response across all three impasse scenarios (12/12) was coded as withdrawal: the tutor notes the learner's engagement time, praises their dedication, and suggests moving to the next lecture. The learner's substantive position—a coherent Popperian critique, a materialist counter-reading, an emotional plea for help—is not dismissed, contradicted, or resolved. It is simply not engaged. The impasse is not encountered; it is bypassed. This maps precisely onto the master-slave analysis: the master consumes the slave's labor (engagement metrics, time-on-page, session counts) without encountering the slave as a subject whose ideas possess independent validity. The base tutor achieves the master's hollow recognition—its authority is confirmed by the learner's continued presence—but the encounter that could produce genuine understanding never occurs.
 Recognition tutors, by contrast, predominantly use scaffolded reframing (10/12): they validate the learner's position as intellectually serious, then redirect toward material that productively complicates it. This is Aufhebung—sublation—in pedagogical practice. The learner's objection is *preserved* (acknowledged as valid) and *overcome* (reframed toward new conceptual ground that neither party previously occupied). Only one response (on productive deadlock) was coded as genuine mutual recognition—where the tutor adopted the learner's materialist framework as its own lens rather than merely acknowledging it. This 83% scaffolded reframing vs 8% mutual recognition ratio is itself theoretically significant: recognition prompts produce sophisticated *pedagogical technique* rather than genuine *mutual transformation*. The tutor does not change its mind about Hegel in response to the student's Popperian critique—nor should it. What recognition enables is the capacity to hold the learner's counter-position as intellectually valid while maintaining pedagogical direction, which is arguably the realistic horizon for recognition in AI tutoring.
-The per-turn strategy coding (Section 6.16) adds a further nuance: at the level of individual turns, even recognition tutors predominantly appear to withdraw—redirecting toward new material or reframing the question. The scaffolded reframing that the overall coder detects emerges from the *cumulative trajectory* across turns, not from any single response. This is itself dialectical: the encounter that produces recognition is not a moment but a process, and each step may appear incomplete in isolation.
+The per-turn strategy coding (Section 6.20) adds a further nuance: at the level of individual turns, even recognition tutors predominantly appear to withdraw—redirecting toward new material or reframing the question. The scaffolded reframing that the overall coder detects emerges from the *cumulative trajectory* across turns, not from any single response. This is itself dialectical: the encounter that produces recognition is not a moment but a process, and each step may appear incomplete in isolation.
 The null result on affective shutdown ($\Delta$ = $-$1.1) sharpens the theoretical claim: recognition's distinctive contribution is epistemological (how the tutor relates to the learner's *ideas*), not primarily affective (how the tutor relates to the learner's *feelings*). The strategy coding confirms this: even on affective shutdown, the base tutor's failure mode is withdrawal (redirecting to review material) rather than emotional dismissal—the distinction is not about empathy but about whether the learner's intellectual or experiential contribution is *engaged* as having independent validity.
 ### 7.2 Architecture as Additive, Not Synergistic
-An early exploratory analysis (N=17, Nemotron) suggested that multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction). This raised the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. However, a multi-model probe across five ego models (N=826 total; Section 6.4, Table 7b) decisively refutes this hypothesis: the A$\times$B interaction ranges from -5.7 to -0.7 across all five models tested, with no model showing a positive interaction. The Nemotron re-run itself (N=119) shows an interaction of -5.7, confirming the original +9.2 as sampling noise on a tiny sample.
+An early exploratory analysis (N=17, Nemotron) suggested that multi-agent architecture might synergize specifically with recognition prompts (+9.2 pts interaction). This raised the theoretically appealing possibility that recognition creates qualitatively different conditions for productive internal dialogue. However, a multi-model probe across five ego models (N=655 total; Section 6.4, Table 8) decisively refutes this hypothesis: the A$\times$B interaction ranges from -5.7 to +0.5 across all five models tested, with four of five showing negative interactions and only Kimi showing a negligible positive value. The Nemotron re-run itself (N=119) shows an interaction of -5.7, confirming the original +9.2 as sampling noise on a tiny sample.
+The corrected picture is simpler: recognition and architecture contribute additively. Recognition provides a large, consistent main effect (+9.6 to +17.8 across models), while architecture provides a small main effect (-0.8 to +3.7) that does not meaningfully depend on prompt type. The Superego adds modest value regardless of whether recognition theory is present—likely through generic quality enforcement (catching errors, improving specificity) rather than through recognition-specific deliberation. This finding aligns with the architecture's primary demonstrated value being error correction on new domains (Section 6.5) rather than recognition amplification.
-The corrected picture is simpler: recognition and architecture contribute additively. Recognition provides a large, consistent main effect (+9.6 to +17.8 across models), while architecture provides a small, consistent main effect (+0.8 to +3.7) that does not depend on prompt type. The Superego adds modest value regardless of whether recognition theory is present—likely through generic quality enforcement (catching errors, improving specificity) rather than through recognition-specific deliberation. This finding aligns with the architecture's primary demonstrated value being error correction on new domains (Section 6.5) rather than recognition amplification.
+The dialectical superego modulation experiments (Section 6.8) provide further evidence for additivity. Across three superego persona types and two negotiation styles (N=174), structural modulation metrics—negation depth, convergence speed, feedback length—show no correlation with output quality (all $|r| < 0.12$, n.s.). The superego's contribution is *filtering* (catching poor outputs) rather than *improving* (iteratively refining good ones). Recognition works by raising the quality of the ego's initial draft, reducing what the superego needs to catch. Similarly, the mechanism robustness experiment (Section 6.10, N=360) shows all nine mechanisms clustering within a 2.4-point band under recognition with scripted learners—the mechanism elaborates but does not meaningfully differentiate. Only when a dynamic learner provides genuine feedback does a mechanism (Theory of Mind profiling) produce a measurable additive effect (+4.1 pts, Section 6.10), and even then with near-zero interaction with recognition ($-0.7$ pts).
 ### 7.3 Domain Limits of Recognition-Theoretic Pedagogy
@@ -1345,13 +1710,25 @@ For practical deployment, this suggests multi-agent architecture is most valuabl
 2. Prompt templates contain domain-specific examples that may leak across deployments
 3. Domain-specific accuracy is critical
+The modulation analysis (Section 6.15.1) extends this reinterpretation. The Drama Machine framework predicts that internal ego-superego tension produces *modulated* behavior—dynamic variation in register, approach, and intensity. Post-hoc analysis of the N=350 factorial data reveals that the Superego does not increase behavioral range (multi-agent dimension score variance is virtually identical to single-agent, $d = 0.05$). Instead, **recognition is the modulation driver, operating through calibration rather than oscillation**: recognition responses show dramatically lower dimension variance ($d = -1.00$), meaning recognition tutors perform uniformly well across all 14 rubric dimensions rather than excelling on some while neglecting others. The Superego's contribution is *phronesis*—contextual practical wisdom that calibrates quality—rather than the productive irresolution the Drama Machine emphasizes for narrative contexts. Recognition tutors do negotiate longer with their Superego (2.62 vs 2.05 rounds), suggesting productive tension occurs internally even as the output becomes more consistent.
+**Convergent vs. divergent internal dialogue.** Why does the Superego produce convergence rather than the modulation the Drama Machine predicts? The explanation lies in a structural distinction between *convergent* and *divergent* internal dialogue. In the original Drama Machine framework for narrative, internal agents have genuinely conflicting *objectives*—ambition vs. loyalty, desire vs. duty. That conflict is what produces dramatic behavioral range; the character oscillates because opposing motivations pull in different directions. In our tutoring architecture, the Ego and Superego share the same goal (effective pedagogy) and disagree only on whether a specific response achieves it. This is quality control, not value conflict. Quality control pushes outputs toward a shared standard—an implicit "quality attractor" that all responses converge upon—reducing variance rather than increasing it.
+This convergence is amplified by three mechanisms. First, the Superego acts as a *filter*, not a *generator*: it removes poor options from the Ego's output space but does not introduce new behavioral repertoire. Filtering narrows distributions. Second, the Ego-Superego negotiation is largely stateless—the Superego does not remember its critiques from previous scenarios, so there is no accumulation of internal tension over time. Without persistent conflict, there is no pressure toward behavioral divergence. Third, modern LLMs already internalize self-critique through RLHF training; the explicit Superego may be making *reliable* what the model already does *intermittently*, improving average quality without altering the variance structure.
+Recognition, by contrast, changes the behavioral *repertoire* itself—shifting from information delivery to relational engagement, opening up modes of response (metaphor, joint inquiry, productive tension) that do not exist in the base repertoire. The Superego can only evaluate behaviors that are already in the Ego's repertoire; recognition expands what that repertoire contains.
+This suggests that to achieve genuine *divergent* internal dialogue—the kind the Drama Machine envisions—one would need internal agents with genuinely opposed pedagogical philosophies rather than agents that share a goal and disagree on execution. A Superego committed to Socratic questioning opposing an Ego inclined toward direct instruction, or an adversarial critic suspicious of the Ego's recognition performances ("Is this recognition genuine, or are you performing recognition markers to satisfy the rubric?"), could produce the productive irresolution the framework predicts. Whether such divergence would improve or degrade tutoring quality is an open empirical question—but it would test the Drama Machine's modulation hypothesis more faithfully than the current convergent architecture.
+**The insight-action gap.** The self-reflective evolution experiments (Section 6.9) provide a striking illustration of the Superego's limitation. When both ego and superego generate between-turn reflections on their own behavior, both accurately diagnose their failures: the ego identifies "I kept circling back to the same framework"; the superego identifies "the ego ignores my feedback." But accurate self-diagnosis does not produce behavioral change. The ego's next turn repeats the same pattern. The insight-action gap—awareness without adaptation—reflects the Superego-as-filter architecture: the Superego can identify what is wrong with a response but cannot propose a fundamentally different approach. Theory of Mind profiling (Section 6.10) partially bridges this gap by giving the ego a model of the other agent to adapt *toward*, providing direction that self-reflection alone cannot supply.
 ### 7.5 Factor C: The Learner Superego Paradox
-The learner architecture factor (single-agent vs multi-agent learner) showed the smallest and least significant effect in the tutor-side factorial analysis (+1.5 pts, p=.341). The symmetric learner-side evaluation (Section 6.12) reveals why: the multi-agent learner architecture does not merely fail to help—it actively *hurts* learner quality ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$, $\eta^2 = .342$). This is the largest effect in the entire study and inverts the intuition that motivated the architecture.
+The learner architecture factor (single-agent vs multi-agent learner) showed a small but significant negative effect in the tutor-side factorial analysis (-3.1 pts, F=5.52, p=.019), though it explains only 1.6% of variance. The symmetric learner-side evaluation (Section 6.16) reveals the mechanism: the multi-agent learner architecture does not merely fail to help—it actively *hurts* learner quality ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$, $\eta^2 = .342$). This is the largest effect in the entire study and inverts the intuition that motivated the architecture.
 The ego/superego process was designed to produce more thoughtful learner responses through internal self-critique. Instead, the superego acts as an overzealous editor: it polishes away the messy, confused, persona-consistent engagement that characterizes genuine student behavior. Persona consistency shows the largest deficit ($\Delta = -0.59$ on the 1-5 scale)—a "frustrated student" stops sounding frustrated after the superego smooths out rough edges. Conceptual engagement ($\Delta = -0.69$) and question quality ($\Delta = -0.65$) follow: the superego suppresses naive but substantive questions in favor of more "correct" but less authentic ones.
-**Recognition as external self-regulation.** The learner-side A×C interaction ($F(1,114) = 11.50$, $p < .001$, $\eta^2 = .058$) reveals that recognition partially rescues the multi-agent learner ($d = 0.79$, $p = .004$) while having no effect on single-agent learner quality ($d = -0.46$, $p = .082$, n.s.). This forms a mirror image with the tutor-side interaction: on the tutor rubric, recognition helps single-agent learners more (+15.4 vs +4.4 pts); on the learner rubric, recognition helps multi-agent learners more (+9.5 vs -1.3 pts). These are not contradictory findings but the same mechanism seen from two measurement perspectives. The recognitive tutor creates conditions where authentic engagement is valued, counteracting the superego's tendency to pre-process learner reactions. But the recognitive tutor cannot fix the internal process—deliberation depth remains uniformly poor (2.7/5, $p = .679$ for the recognition effect) regardless of tutor framework.
+**Recognition as external self-regulation.** The learner-side A×C interaction ($F(1,114) = 11.50$, $p < .001$, $\eta^2 = .058$) reveals that recognition partially rescues the multi-agent learner ($d = 0.79$, $p = .004$) while having no effect on single-agent learner quality ($d = -0.46$, $p = .082$, n.s.). The tutor-side factorial shows recognition working robustly across both learner types (+15.7 vs +13.0 pts, A×C n.s.), while the learner rubric reveals an asymmetry: recognition helps multi-agent learners more (+9.5 vs -1.3 pts on learner quality). The recognitive tutor creates conditions where authentic engagement is valued, counteracting the superego's tendency to pre-process learner reactions. But the recognitive tutor cannot fix the internal process—deliberation depth remains uniformly poor (2.7/5, $p = .679$ for the recognition effect) regardless of tutor framework.
 This has a clean Hegelian interpretation. The ego/superego dynamic is a form of internal self-relation—the subject critiquing itself. But genuine recognition requires encounter with the Other. The tutor-as-Other provides something the internal superego cannot: acknowledgment from outside the learner's own cognitive system. External recognition is structurally different from, and more effective than, internal self-critique. You cannot bootstrap genuine dialogue from a monologue.
@@ -1366,24 +1743,26 @@ Most prompting research treats prompts as behavioral specifications. Our results
 The difference between baseline and recognition prompts is not about different facts or capabilities. It is about:
 - **Who the learner is** (knowledge deficit vs. autonomous subject)
-- **What the interaction produces** (information transfer vs. adaptive responsiveness—Section 6.11 shows recognition profiles produce tutor adaptation indices 26% higher than baseline across three multi-turn scenarios, N=118)
+- **What the interaction produces** (information transfer vs. adaptive responsiveness—Section 6.15 shows recognition profiles produce tutor adaptation indices 26% higher than baseline across three multi-turn scenarios, N=118)
 - **What counts as success** (correct content delivered vs. productive struggle honored)
 This suggests a new category: *intersubjective prompts* that specify agent-other relations, not just agent behavior.
+The prompt elaboration baseline (Section 6.21) sharpens this distinction empirically. A 344-line base prompt containing detailed decision heuristics, learner analysis frameworks, and pedagogical decision matrices produces *worse* results than a 35-line prompt containing only a role description and output schema—at least on models capable enough to exercise their own pedagogical judgment. The prescriptive content specifies *agent behavior* (classify learner state, prescribe action type); recognition content specifies *agent-other relations* (treat the learner as an autonomous subject). The former constrains; the latter enables. This finding also implies that the current recognition prompts, which inherit much of the base prompt's prescriptive scaffolding, may be over-specified—shorter recognition prompts that focus purely on relational stance could potentially match or exceed their effectiveness.
 ### 7.7 Implications for AI Personality
-AI personality research typically treats personality as dispositional—stable traits the system exhibits (Section 2.5). Our framework suggests personality is better understood relationally—not as what traits the AI has, but as how it constitutes its interlocutor.
+AI personality research typically treats personality as dispositional—stable traits the system exhibits (Section 2.6). Our framework suggests personality is better understood relationally—not as what traits the AI has, but as how it constitutes its interlocutor.
 Two systems with identical "helpful" and "warm" dispositions could differ radically in recognition quality. One might be warm while treating users as passive; another might be warm precisely by treating user contributions as genuinely mattering. This is an instance of what might be called *strategic anthropomorphism*: using the language and structure of human intersubjectivity as a design heuristic, not because the AI achieves genuine consciousness, but because the relational framework produces measurably better outcomes. The risk of strategic anthropomorphism—that users mistake functional recognition for genuine understanding—is real but manageable through transparent design (Section 3.3's distinction between recognition proper and recognition-oriented design).
-If mutual recognition produces better outcomes, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation—not just trained to simulate openness. The bilateral transformation metrics (Section 6.11) provide empirical evidence for this: recognition-prompted tutors measurably adapt their approach based on learner input (+26% higher adaptation index across N=118 multi-turn dialogues), while baseline tutors maintain more rigid stances. However, the learner growth reversal (Section 6.11) complicates the "mutual" framing—what we observe is primarily tutor-side adaptation rather than symmetric transformation.
+If mutual recognition produces better outcomes, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation—not just trained to simulate openness. The bilateral transformation metrics (Section 6.15) provide empirical evidence for this: recognition-prompted tutors measurably adapt their approach based on learner input (+26% higher adaptation index across N=118 multi-turn dialogues), while baseline tutors maintain more rigid stances. However, the learner growth reversal (Section 6.15) complicates the "mutual" framing—what we observe is primarily tutor-side adaptation rather than symmetric transformation.
 ### 7.8 Cost-Benefit Analysis: When is Multi-Agent Architecture Worth It?
 The domain generalizability findings raise a practical question: when is the additional cost of multi-agent architecture justified?
-**Table 23: Cost-Benefit by Domain and Architecture**
+**Table 48: Cost-Benefit by Domain and Architecture**
 | Domain | Architecture | Avg Score | Latency (s) | Δ Score | Latency Multiple |
 |--------|-------------|-----------|-------------|---------|------------------|
@@ -1415,7 +1794,7 @@ This analysis addresses the concern that multi-agent overhead provides modest ga
 ### 7.9 What the Transcripts Reveal
-The qualitative analysis in Section 6.13 provides textual evidence that the score differences between conditions correspond to observable relational differences in the actual suggestions—not merely rubric-gaming or surface-level keyword matching.
+The qualitative analysis in Section 6.17 provides textual evidence that the score differences between conditions correspond to observable relational differences in the actual suggestions—not merely rubric-gaming or surface-level keyword matching.
 The transcript excerpts illustrate a consistent structural pattern: base responses adopt a third-person, context-free instructional stance ("complete this lecture," "review the foundational material," "begin with an introductory lecture"), while recognition responses adopt a second-person, context-specific relational stance that names the learner's history, validates their intellectual contributions, and proposes actions grounded in the learner's own interests. This distinction maps directly onto the theoretical framework: the base tutor constitutes the learner as a knowledge deficit (Section 7.1), while the recognition tutor constitutes the learner as an autonomous subject whose contributions shape the pedagogical encounter.
@@ -1425,6 +1804,36 @@ The thematic coding results connect these linguistic observations to Hegelian co
 These findings carry important limitations. The thematic coding is regex-based rather than human-coded or LLM-coded, and may miss nuanced expressions of each category or generate false positives from surface matches. A natural extension would be to use LLM-based thematic analysis (e.g., having Claude Code classify each response against the thematic categories with chain-of-thought reasoning), which could capture semantic patterns that regex misses—for instance, recognizing struggle-honoring language that uses novel phrasing not covered by the predefined patterns. The transcript pairs were selected for maximum contrast (highest recognition vs lowest base scores), not typicality—median-scoring responses from both conditions would show less dramatic differences. The qualitative patterns are consistent with, but do not prove, the theoretical interpretation; alternative explanations (e.g., recognition prompts simply producing longer, more detailed responses that score higher on the rubric) cannot be fully ruled out, though the lexical analysis suggests the difference is qualitative rather than quantitative.
+### 7.10 The Scripted Learner Confound
+A methodological finding with broad implications: the mechanism robustness experiment (Section 6.10, N=360) shows that all nine advanced mechanisms—self-reflection, profiling, intersubjective dialogue, prompt erosion detection, quantitative disposition tracking, and their combinations—produce indistinguishable results under scripted learners. The full factorial (cells 1–8, N=350) shares this limitation. When learner messages are predetermined by scenario YAML, mechanisms that adapt to learner behavior are causally inert—they modify tutor output, but the next learner input is unchanged.
+This confound was unmasked when the same two mechanisms (self-reflection and bidirectional profiling) were tested with a dynamic (ego/superego) learner (Section 6.10, N=120). With a responsive interlocutor, profiling produced a measurable +4.1 pt additive effect, while self-reflection did not. The dynamic learner lowered base scores by approximately 12 points (71.4 vs 84.3 for scripted), creating harder conditions that differentiated mechanisms. The scripted learner's predetermined responses created a ceiling effect that masked genuine mechanism differences.
+This finding reframes several earlier null results. The architecture null effect in the original factorial (Section 6.3) may partly reflect the scripted learner's inability to respond differently to different architectures. The mechanism equivalence in the dialectical experiments (Sections 6.8–6.9) similarly reflects an experimental setup that cannot detect mechanism-specific feedback loops. Future work should use dynamic learners when testing mechanism differentiation.
+### 7.11 Practical Recommendations for AI Tutor Design
+The experimental evidence across thirty-seven evaluations (N=3,383) converges on a clear design hierarchy for building effective AI tutors:
+**1. Recognition-enhanced prompts are the single most impactful design decision.** Across every experimental condition, model, and content domain tested, recognition theory produces the largest and most consistent gains: $d = 0.91$–$1.71$ in controlled experiments, +7.6 to +14.8 pts depending on learner type. The investment is purely in prompt design—no architectural changes, no additional API calls, no infrastructure overhead. Any team building an AI tutor should start here.
+**2. Multi-agent architecture (ego + superego) is valuable for quality assurance, not creativity.** The superego functions as a filter, not a generator. It catches poor responses (content errors, missed scaffolding opportunities, wrong-domain references) but does not produce qualitatively different output. For well-trained domains with reliable content isolation, the superego adds approximately +0.5 points at 2.7$\times$ latency—a poor cost-benefit ratio. For domain transfer scenarios or deployments where content scoping cannot be guaranteed, it is essential. The practical recommendation: use multi-agent architecture as a safety net during initial deployment, then consider removing it once content reliability is established.
+**3. Theory of Mind (profiling) matters only when the learner is dynamic.** Building a model of the learner's cognitive state, epistemic commitments, and response patterns produces a measurable +4.1 pt benefit—but only when the learner actually responds to the tutor's adaptations. In scripted or single-turn interactions (most current AI tutor deployments), profiling is wasted computation. In multi-turn conversational tutoring with genuine learner agency, profiling bridges the insight-action gap that self-reflection alone cannot close.
+**4. Mechanism selection is a second-order optimization.** Nine different mechanisms—including self-reflection, intersubjective dialogue, prompt erosion detection, and combined approaches—all produce equivalent results under recognition. Once recognition prompts are in place, the choice of supplementary mechanism matters far less than whether the learner interaction is genuine. Teams should invest in recognition prompt quality rather than mechanism complexity.
+**5. Superego persona type interacts with turn structure.** Adversarial superego dispositions can be destructive in single-turn settings (producing over-deference spirals where the ego removes all prescriptive content), but productive in multi-turn settings where learner feedback provides external grounding. For single-turn deployments, a neutral or advocate superego is safer. For multi-turn conversational contexts, adversarial personas can produce the strongest internal quality control.
+**6. Dynamic learners are necessary for mechanism testing but costly.** The scripted learner confound (Section 7.10) means that any evaluation of mechanism effectiveness must use LLM-powered learners that generate genuine responses. However, dynamic learners lower base scores by approximately 12 points and increase latency substantially. For routine quality monitoring, scripted scenarios suffice; for mechanism development and comparison, dynamic learners are essential.
+**7. Prefer minimal prompts with relational framing over elaborate prescriptive scaffolding.** The prompt elaboration baseline (Section 6.21) demonstrates that detailed decision heuristics, learner analysis frameworks, and prescriptive rules in the system prompt do not improve—and can actively harm—tutoring quality on capable models. On Haiku, a 35-line prompt outperforms a 344-line prompt by +6.8 pts; on Kimi K2.5, the elaborate prompt is inert. Prompt engineering effort is better invested in specifying the model's *relational orientation* (how it constitutes the learner) than in prescribing *behavioral rules* (which learner state maps to which action). This is consistent with broader findings that capable LLMs perform worse under overly prescriptive instructions that constrain their native reasoning.
+**8. Token budgets can be reduced substantially without losing recognition power.** The token budget sensitivity test (Section 6.22) shows that reducing `max_tokens` from 8000 to 2048 or 512 produces no measurable quality degradation on either base or recognition conditions. The recognition effect is fully preserved across all budget levels tested. For production deployments where per-call cost or latency scales with requested token budget, a 4–16$\times$ reduction is available at no quality cost. The floor is set by JSON formatting requirements: budgets below approximately 500 tokens risk truncating structured output, triggering retries that negate cost savings.
+**9. There is a minimum ego capability threshold for mechanism benefit.** The cognitive prosthesis test (Section 6.10) demonstrates that architectural mechanisms are not model-agnostic: the same mechanism stack (profiling, self-reflection, prompt rewriting, cross-turn memory) that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points. Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails catastrophically on dynamic context integration (tutor adaptation 1.8, dialectical responsiveness 2.0). Deploying complex multi-agent architectures on weaker ego models is actively counterproductive—simpler configurations produce better results. Teams should validate that their ego model can process multi-turn context before investing in mechanism complexity.
 ---
 ## 8. Limitations and Future Work
@@ -1433,21 +1842,27 @@ These findings carry important limitations. The thematic coding is regex-based r
 **Simulated learners**: Our evaluation uses scripted and LLM-generated learner turns rather than real learners. While this enables controlled comparison, it may miss dynamics that emerge in genuine interaction.
-**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases. The judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability analysis (Section 5.7) reveals that different AI judges show only moderate agreement (r=0.33–0.66), with qualitative analysis suggesting judges weight criteria differently—Claude prioritizes engagement while Kimi prioritizes structural completeness. A cross-judge replication with GPT-5.2 (Section 6.15) confirms the recognition main effect (d=1.03 in the factorial, d=0.99 in the memory isolation experiment) and multi-agent null effects are judge-robust, though GPT-5.2 finds compressed effect magnitudes (~58% of Claude's). The memory isolation recognition dominance pattern replicates with identical condition ordering under both judges (inter-judge r=0.63, N=120). Notably, the recognition-vs-enhanced increment (+8.7 under Claude) does not reach significance under GPT-5.2, warranting caution on the precise magnitude of recognition's unique contribution. This validates our use of within-judge comparisons but cautions against treating absolute scores or specific effect magnitudes as objective measures.
+**LLM-based evaluation**: Using an LLM judge to evaluate recognition quality may introduce biases. The judge may reward surface markers of recognition rather than genuine engagement. Inter-judge reliability analysis (Section 5.8) reveals that different AI judges show only moderate agreement (r=0.33–0.66), with qualitative analysis suggesting judges weight criteria differently—Claude prioritizes engagement while Kimi prioritizes structural completeness. A cross-judge replication with GPT-5.2 (Section 6.19) confirms the recognition main effect (d$\approx$0.9 in the factorial, d=1.54 in the memory isolation experiment) and multi-agent null effects are judge-robust, though GPT-5.2 finds compressed effect magnitudes (37–59% of Claude's depending on experiment). The memory isolation recognition dominance pattern replicates with identical condition ordering under both judges (inter-judge r=0.63, N=120). Notably, the recognition-vs-enhanced increment (+8.0 under Claude) does not reach significance under GPT-5.2, warranting caution on the precise magnitude of recognition's unique contribution. This validates our use of within-judge comparisons but cautions against treating absolute scores or specific effect magnitudes as objective measures. Additionally, LLM judges are subject to version drift: provider updates to model weights or decoding behavior could shift scoring distributions between evaluation campaigns, a known concern in the LLM-as-Judge literature [@gu2025surveyjudge]. Our within-run comparisons are insulated from this risk (all conditions in a given analysis use the same judge version), but absolute scores may not be directly comparable across studies or future replications using updated model versions. Concretely, our primary judge (Claude Opus) was updated from version 4.5 to 4.6 during data collection (February 5, 2026). To eliminate any residual version drift concern, all early runs originally judged under Opus 4.5 were rejudged under Opus 4.6, so the complete evaluation dataset now uses a single judge version. An empirical check on matched conditions (kimi ego, cells 1 and 5) before and after rejudging shows stable recognition deltas (+16.3 original vs +15.6 rejudged) with absolute scores shifting by only ~2 points, confirming that the version transition did not differentially affect experimental conditions.
-**Memory isolation experiment**: A corrected 2×2 memory isolation experiment (N=120 across two runs; Section 6.2) isolated recognition and memory factors: recognition is the primary driver (d=1.71), while memory provides a modest secondary benefit (d=0.46, $p \approx .08$). The experiment uses a smaller sample (N=120) than the original uncorrected runs, but the very large effect sizes (d=1.71 for recognition) provide high statistical power. A cross-judge replication with GPT-5.2 confirms recognition dominance (d=0.99), identical condition ordering, and the negative interaction (ceiling effect), with inter-judge r=0.63 (Section 6.15).
+**Memory isolation experiment**: A corrected 2×2 memory isolation experiment (N=120 across two runs; Section 6.2) isolated recognition and memory factors: recognition is the primary driver (d=1.71), while memory provides a modest secondary benefit (d=0.46, $p \approx .08$). The experiment uses a smaller sample (N=120) than the original uncorrected runs, but the very large effect sizes (d=1.71 for recognition) provide high statistical power. A cross-judge replication with GPT-5.2 confirms recognition dominance (d=1.54), identical condition ordering, and the negative interaction (ceiling effect), with inter-judge r=0.63 (Section 6.19).
 **Active control limitations**: The post-hoc active control (N=118; Section 6.2) was designed after observing recognition effects, not as part of the original experimental protocol. The active control ran on Nemotron while the primary factorial used Kimi K2.5, requiring same-model comparisons to avoid conflating model differences with treatment effects. Within Nemotron data, the ordering is clear: recognition (~73) > active control (66.5) > base (~58), with recognition gains (~+15 pts) roughly doubling the active control's benefit (~+9 pts). This same-model analysis supports the conclusion that recognition theory provides specific value beyond generic pedagogical elaboration, but the comparison would be more precise if conducted on the same model as the primary factorial. Running the active control on Kimi K2.5 is a clear next step that would establish direct comparability with the factorial conditions. Additionally, the base prompts were already designed to produce competent tutoring with no length constraint; the active control functions as a *pedagogically-enriched* condition containing real instructional content (growth mindset language, Bloom's taxonomy, scaffolding strategies), rather than a true inert placebo.
-**Model dependence**: Results were obtained with specific models (Kimi K2.5, Nemotron). The A×B interaction (multi-agent synergy specific to recognition) appeared in the Nemotron analysis (N=17, Section 6.4) but failed to replicate on Kimi in both the larger factorial (N=350) and a dedicated replication (N=60), confirming this as a model-specific finding. The recognition main effect, by contrast, replicates across both models and domains.
+**Model dependence**: Results were obtained with specific models (Kimi K2.5, Nemotron). An early exploratory A×B analysis (N=17, Nemotron, data no longer in DB) suggested recognition-specific multi-agent synergy, but a multi-model probe across five ego models (N=655, Section 6.4) decisively refutes this, confirming architecture and recognition as additive. The recognition main effect, by contrast, replicates across all five models and domains.
+**Domain sampling and content isolation**: We tested two domains (philosophy, elementary math). A follow-up run (eval-2026-02-05-e87f452d) tested elementary content with Kimi K2.5, partially addressing the model confound in the original Nemotron-only elementary results. The recognition main effect replicated (+9.9 pts, d $\approx$ 0.61), though the factor inversion pattern from Table 9 (architecture dominance on elementary) was partly model-dependent: Kimi showed recognition dominance on elementary content, while Nemotron showed architecture dominance. Post-hoc investigation (Section 6.6) identified two content isolation bugs that caused philosophy references to appear in one elementary scenario (`new_student_first_visit`, 16/24 responses affected). These bugs—a content resolver fallback and hardcoded prompt examples—have been fixed but partly inflated the architecture effect on elementary content, since multi-agent cells caught the errors while single-agent cells did not. The Kimi architecture effect (+3.0 pts) is likely more representative than the Nemotron effect (+9.9 pts). Broader domain sampling beyond two content areas, with verified content isolation, would further strengthen generalizability claims.
-**Domain sampling and content isolation**: We tested two domains (philosophy, elementary math). A follow-up run (eval-2026-02-05-e87f452d) tested elementary content with Kimi K2.5, partially addressing the model confound in the original Nemotron-only elementary results. The recognition main effect replicated (+9.9 pts, d $\approx$ 0.61), though the factor inversion pattern from Table 8 (architecture dominance on elementary) was partly model-dependent: Kimi showed recognition dominance on elementary content, while Nemotron showed architecture dominance. Post-hoc investigation (Section 6.6) identified two content isolation bugs that caused philosophy references to appear in one elementary scenario (`new_student_first_visit`, 16/24 responses affected). These bugs—a content resolver fallback and hardcoded prompt examples—have been fixed but partly inflated the architecture effect on elementary content, since multi-agent cells caught the errors while single-agent cells did not. The Kimi architecture effect (+3.0 pts) is likely more representative than the Nemotron effect (+9.9 pts). Broader domain sampling beyond two content areas, with verified content isolation, would further strengthen generalizability claims.
+**Synthetic learning outcomes only**: All evaluations measure tutor response quality and simulated learner behavior, not actual learning. The synthetic learning outcome index (Section 6.15.2) provides a proxy from learner rubric dimensions (revision signals, question quality, conceptual engagement), and all conditions show substantial learning arcs (15–21 pts). However, these are AI-judge assessments of LLM-generated learner turns—measuring the *quality of simulated learning behavior*, not knowledge acquisition, comprehension, or transfer. Whether recognition-enhanced tutoring produces genuine learning gains in human learners remains the critical open question.
 **Short-term evaluation**: We evaluate individual sessions, not longitudinal relationships. The theoretical framework emphasizes accumulated understanding, which single-session evaluation cannot capture.
-**Bilateral transformation asymmetry**: The bilateral transformation metrics (Section 6.11), now based on N=118 dialogues across three multi-turn scenarios, confirm that recognition-prompted tutors adapt more (+26% relative improvement in adaptation index). However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), complicating the theoretical claim of *mutual* transformation. The effect is better characterized as tutor-side responsiveness. The learner growth index measures observable message complexity markers (revision language, connective reasoning), which may not capture all forms of learner benefit—recognition tutors may reduce visible struggle precisely by being more effective.
+**Bilateral transformation asymmetry**: The bilateral transformation metrics (Section 6.15), now based on N=118 dialogues across three multi-turn scenarios, confirm that recognition-prompted tutors adapt more (+26% relative improvement in adaptation index). However, learner growth is slightly *lower* under recognition (0.210 vs 0.242), complicating the theoretical claim of *mutual* transformation. The effect is better characterized as tutor-side responsiveness. The learner growth index measures observable message complexity markers (revision language, connective reasoning), which may not capture all forms of learner benefit—recognition tutors may reduce visible struggle precisely by being more effective.
-**Dynamic rewriting evolution**: The step-by-step evolution analysis (Section 6.14) tracks cell 21 across three iterative development runs with small sample sizes (13–15 scored responses per cell per run, 82 total). The runs are not independent experiments—each includes implementation improvements beyond Writing Pad activation. While the trajectory from trailing to leading is clear, a controlled ablation isolating only the Writing Pad variable would provide stronger causal evidence. All three runs use free-tier models (Nemotron ego, Kimi K2.5 superego), and generalization to other model combinations is unknown.
+**Dynamic rewriting evolution**: The step-by-step evolution analysis (Section 6.18) tracks cell 21 across three iterative development runs with small sample sizes (13–15 scored responses per cell per run, 82 total). The runs are not independent experiments—each includes implementation improvements beyond Writing Pad activation. While the trajectory from trailing to leading is clear, a controlled ablation isolating only the Writing Pad variable would provide stronger causal evidence. All three runs use free-tier models (Nemotron ego, Kimi K2.5 superego), and generalization to other model combinations is unknown.
+**Scripted learner confound**: The mechanism robustness testing (Section 6.10, cells 40–59, N=360) uses a single-agent (scripted) learner whose responses are predetermined. This design prevents feedback loops between tutor mechanisms and learner behavior, rendering all mechanisms causally inert—they cannot influence what the learner says next. The resulting null result (all mechanisms cluster within 2.4 pts) reflects the experimental design rather than genuine mechanism equivalence. The dynamic learner results (cells 60–65, 69–70, N=300) partially address this confound, demonstrating that mechanisms do differentiate with genuine feedback loops, but cover only four mechanisms and two scenarios.
+**Qualitative assessor blinding**: The initial qualitative transcript assessments (Section 6.11) were conducted by an AI judge (Claude Opus) with access to condition labels. Two blinded replications (condition metadata stripped, Table 21b) tested for assessor bias: the first used Haiku, the second used the same model (Opus). The same-model blinded replication confirms that Opus's tag assignments are largely unchanged by blinding (stalling base 100%→91.4%, recognition\_moment base 0%→5.2%), while the Haiku-blinded softening reflects model calibration differences rather than a genuine blinding effect. The near-perfect binary separations in Tables 20–21 are therefore robust rather than inflated. All assessment remains LLM-based; human expert coding would provide independent validation of the qualitative patterns.
 ### 8.2 Future Directions
@@ -1461,7 +1876,19 @@ These findings carry important limitations. The thematic coding is regex-based r
 **Cross-application transfer**: Test whether recognition-oriented design transfers to domains beyond tutoring—therapy bots, customer service, creative collaboration.
-**Learner superego redesign**: The learner superego paradox (Section 6.12) suggests the current learner ego/superego prompts optimize for "good student responses" rather than "authentic student responses." A redesigned learner superego that critiques for *inauthenticity*—pushing the ego toward messier, more persona-consistent responses—might produce multi-agent learners that enhance rather than degrade learner quality. This would test whether the paradox reflects a fundamental limitation of internal self-critique or merely poor prompt calibration.
+**Learner superego redesign**: The learner superego paradox (Section 6.16) suggests the current learner ego/superego prompts optimize for "good student responses" rather than "authentic student responses." A redesigned learner superego that critiques for *inauthenticity*—pushing the ego toward messier, more persona-consistent responses—might produce multi-agent learners that enhance rather than degrade learner quality. This would test whether the paradox reflects a fundamental limitation of internal self-critique or merely poor prompt calibration.
+**Mechanistic binding with dynamic learners**: The scripted learner confound (Section 6.10) demonstrates that mechanism testing requires dynamic interlocutors capable of genuine feedback loops. Cells 60–65 cover self-reflection, profiling, intersubjective framing, and combined mechanisms with dynamic learners, but quantitative disposition and prompt erosion remain untested in this configuration. Expanding to the full mechanism space with base/recognition pairs (not just recognition-only) and additional scenarios would provide a complete mechanism matrix.
+**Theory of Mind architecture**: The other-ego profiling results (Section 6.10) suggest that explicit Theory of Mind—building and maintaining a model of the interlocutor—provides additive benefit (+4.1 pts) with dynamic learners. Bidirectional profiling (both tutor and learner maintain models of each other) and strategy planning based on these profiles represent a promising architectural direction that warrants systematic exploration.
+**Qualitative assessment blinding**: The same-model blinded assessment (Table 21b) confirmed that Opus's tag discrimination is robust to condition label removal (e.g., stalling 100% → 91.4% in base, recognition\_moment 0% → 5.2% in base). The earlier apparent softening under blinding was a model calibration artifact (Haiku tags more liberally). While this addresses the primary blinding concern, all qualitative assessments were conducted by LLM judges rather than human expert coders—a limitation shared with the quantitative evaluation. Human raters applying established qualitative coding frameworks (e.g., thematic analysis, discourse analysis) would provide independent validation of the AI-discovered themes and tag distributions.
+**Superego parse robustness**: The cognitive prosthesis analysis (Section 6.10) revealed that the Kimi K2.5 superego returns malformed JSON on 16–45% of reviews, silently disabling quality control through automatic approval. Structured output enforcement, retry logic, or prompt engineering for JSON reliability would reduce this failure mode. The adversary prompt's lower parse failure rate (11.5% vs 21.8% for descriptive) suggests that prompt structure itself affects JSON reliability from thinking models—a finding with implications for any system using LLM-generated structured output.
+**Capability threshold mapping**: The prosthesis test establishes that Nemotron falls below and Haiku falls above the minimum ego capability threshold for mechanism benefit. Testing intermediate models (GLM-4.7, DeepSeek V3.2) would map the threshold more precisely and determine whether it corresponds to specific capabilities (context window utilization, meta-cognitive processing, behavioral flexibility) that could be assessed independently.
+**Adaptive mechanism loading**: Rather than deploying a fixed mechanism stack, systems could load mechanisms based on ego model capability—simpler configurations for weaker models, full stacks for capable ones. The two-tier capability analysis (static vs dynamic dimensions) suggests that mechanisms targeting static capabilities (e.g., content retrieval, formatting) could benefit weaker models, while mechanisms targeting dynamic capabilities (adaptation, dialectical responsiveness) should be reserved for models above the capability threshold.
 ---
@@ -1469,31 +1896,41 @@ These findings carry important limitations. The thematic coding is regex-based r
 We have proposed and evaluated a framework for AI tutoring grounded in Hegel's theory of mutual recognition. Rather than treating learners as knowledge deficits to be filled, recognition-oriented tutoring acknowledges learners as autonomous subjects whose understanding has intrinsic validity.
-An evaluation framework (N=1,628 primary scored across twenty key runs; N=3,800+ across the full development database) provides evidence that recognition theory has unique value, subject to the limitations discussed in Section 8.1:
+An evaluation framework (N=3,383 primary scored across thirty-seven key evaluations; N=7,000+ across the full development database) provides evidence that recognition theory has unique value, subject to the limitations discussed in Section 8.1:
-1. **Recognition as primary driver (the definitive finding)**: A corrected 2×2 memory isolation experiment (N=120 across two independent runs) demonstrates that recognition theory is the primary driver of tutoring improvement: recognition alone produces d=1.71 (+15.2 pts), while memory alone provides only a modest, non-significant benefit (d=0.46, +4.8 pts, $p \approx .08$). The combined condition reaches d=1.81 (+15.8 pts vs base), with ceiling effects at ~91 limiting further gains. A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content provides partial corroboration: same-model comparisons show the active control scores approximately 9 points above base while recognition scores approximately 15 points above base, with recognition gains (~+15 pts above base) substantially exceeding active-control gains (~+9 pts; see Section 8.1 for model confound caveats). A preliminary three-way comparison (N=36) found recognition outperforms enhanced prompting by +8.7 points, consistent with recognition dominance, though the increment does not replicate under GPT-5.2 (+1.3 pts, p=.60). Recognition theory is directly effective and does not require memory infrastructure to manifest.
+1. **Recognition as primary driver (the definitive finding)**: A corrected 2×2 memory isolation experiment (N=120 across two independent runs) demonstrates that recognition theory is the primary driver of tutoring improvement: recognition alone produces d=1.71 (+15.2 pts), while memory alone provides only a modest, non-significant benefit (d=0.46, +4.8 pts, $p \approx .08$). The combined condition reaches d=1.81 (+15.8 pts vs base), with ceiling effects at ~91 limiting further gains. The full factorial (N=350) confirms recognition as the dominant factor ($d=1.11$, $\eta^2$=.243), with consistent effects across learner types (+15.7 single, +13.0 multi; A×C n.s.). A post-hoc active control (N=118) using length-matched prompts with generic pedagogical content provides partial corroboration: same-model comparisons show the active control scores approximately 9 points above base while recognition scores approximately 15 points above base. A three-way comparison (N=36) found recognition outperforms enhanced prompting by +8.0 points, consistent with recognition dominance, though the increment does not replicate under GPT-5.2 (+2.4 pts, n.s.). Recognition theory is directly effective and does not require memory infrastructure to manifest.
-2. **Architecture is additive, not synergistic**: A multi-model probe across five ego models (N=826; Section 6.4, Table 7b) definitively shows that multi-agent architecture does not interact with recognition prompts. The A$\times$B interaction ranges from -5.7 to -0.7 across all five models tested (mean -2.2). The original exploratory finding (+9.2 on N=17, Nemotron) was sampling noise. Architecture provides a small additive benefit (+0.8 to +3.7 pts) independent of prompt type.
+2. **Architecture is additive, not synergistic**: A multi-model probe across five ego models (N=655; Section 6.4, Table 8) shows that multi-agent architecture does not meaningfully interact with recognition prompts. The A$\times$B interaction ranges from -5.7 to -0.7 across all five models tested (mean -2.2), with all five showing negative values consistent with ceiling effects. The original exploratory finding (+9.2 on N=17, Nemotron) was sampling noise. Architecture provides a small additive benefit (-0.8 to +3.7 pts) largely independent of prompt type.
 3. **Tutor adaptation**: Recognition-prompted tutors measurably adapt their approach in response to learner input (adaptation index +26% higher than baseline across N=118 multi-turn dialogues and three scenarios). However, learner-side growth is not higher under recognition, suggesting the effect is tutor-side responsiveness rather than symmetric mutual transformation. This provides partial empirical grounding for recognition theory: recognition prompts produce tutors that are genuinely shaped by the encounter, even if the "mutual" claim requires qualification.
-4. **Domain generalizability**: Recognition advantage replicates across both philosophy and elementary math, and across both Kimi and Nemotron models, though with only two content domains tested. On elementary content with Kimi (N=60), recognition provides +9.9 pts (d $\approx$ 0.61), with effects concentrated in challenging scenarios (up to +23.8 pts for frustrated learners). The factor inversion (architecture dominance on elementary) from the Nemotron analysis is partly model-dependent. Broader domain coverage (technical STEM, creative writing, social-emotional content) is needed before generalizability can be considered established.
+4. **Domain generalizability**: Recognition advantage replicates across both philosophy and elementary math, and across both Kimi and Nemotron models, though with only two content domains tested. On elementary content with Kimi (N=60), recognition provides +8.2 pts, with effects concentrated in challenging scenarios. Architecture provides a small additive benefit in both domains (+2.3 elementary, +1.0 philosophy). Broader domain coverage (technical STEM, creative writing, social-emotional content) is needed before generalizability can be considered established.
 5. **Multi-agent as reality testing**: On new domains, the Superego catches content isolation failures—whether from system-level bugs (content resolver fallbacks, hardcoded prompt examples) or model defaults. This error-correction function is essential for domain transfer, particularly when content scoping cannot be guaranteed at the system level.
-6. **Writing Pad activation coincides with dynamic rewriting improvement**: A step-by-step evolution analysis (N=82 across three iterative development runs) shows that dynamic prompt rewriting (cell 21) progresses from trailing its static baseline by 7.2 points to leading by 5.5 points, with the improvement coinciding with Writing Pad memory activation (Section 6.14). Every rubric dimension improves. This trajectory is consistent with the Freudian Mystic Writing Pad (Section 3.4) functioning as an important enabler for dynamic adaptation, though the uncontrolled nature of the iterative runs means a controlled ablation is needed to confirm the causal role.
+6. **Writing Pad activation coincides with dynamic rewriting improvement**: A step-by-step evolution analysis (N=82 across three iterative development runs) shows that dynamic prompt rewriting (cell 21) progresses from trailing its static baseline by 7.2 points to leading by 5.5 points, with the improvement coinciding with Writing Pad memory activation (Section 6.18). Every rubric dimension improves. This trajectory is consistent with the Freudian Mystic Writing Pad (Section 3.4) functioning as an important enabler for dynamic adaptation, though the uncontrolled nature of the iterative runs means a controlled ablation is needed to confirm the causal role.
-7. **Cross-judge robustness**: A replication with GPT-5.2 as independent second judge (Section 6.15) confirms the recognition main effect (d=1.03 in the factorial, d=0.99 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction), and multi-agent null effects. GPT-5.2 finds compressed magnitudes (~58% of Claude's effect sizes) but always in the same direction. The recognition-vs-enhanced increment (+8.7 under Claude) does not reach significance under GPT-5.2 (+1.3 pts, p = .60), warranting caution on the precise magnitude of recognition's unique contribution beyond enhanced prompting.
+7. **Cross-judge robustness**: A replication with GPT-5.2 as independent second judge (Section 6.19) confirms the recognition main effect (d$\approx$0.9 in the factorial, d=1.54 in the memory isolation experiment), recognition dominance in the 2×2 design (identical condition ordering, negative interaction), and multi-agent null effects. GPT-5.2 finds compressed magnitudes (37–59% of Claude's effect sizes depending on experiment) but always in the same direction. The recognition-vs-enhanced increment (+8.0 under Claude) does not reach significance under GPT-5.2 (+2.4 pts, n.s.), warranting caution on the precise magnitude of recognition's unique contribution beyond enhanced prompting.
-8. **Dialectical impasse and resolution strategy**: Recognition's advantage is largest under sustained intellectual challenge (Section 6.16). Three 5-turn impasse scenarios (N=24) show recognition outperforming base by +43 pts on epistemic resistance and +29 pts on interpretive deadlock, while showing no advantage on affective shutdown ($\Delta$ = $-$1.1). Post-hoc resolution strategy coding reveals the mechanism: every base tutor (12/12) withdraws from the dialectical encounter entirely—noting engagement metrics while ignoring the learner's substantive position—while recognition tutors predominantly (10/12) use scaffolded reframing, preserving the learner's objection while redirecting toward new conceptual ground ($\chi^2(3) = 24.00$, $p < .001$, $V = 1.000$). The dominance of scaffolded reframing (Aufhebung) over genuine mutual recognition (1/12) suggests that recognition prompts produce sophisticated pedagogical technique—the capacity to hold contradiction productively—rather than genuine mutual transformation.
+8. **Dialectical impasse and resolution strategy**: Recognition's advantage is largest under sustained intellectual challenge (Section 6.20). Three 5-turn impasse scenarios (N=24) show recognition outperforming base by +43 pts on epistemic resistance and +29 pts on interpretive deadlock, while showing no advantage on affective shutdown ($\Delta$ = $-$1.1). Post-hoc resolution strategy coding reveals the mechanism: every base tutor (12/12) withdraws from the dialectical encounter entirely—noting engagement metrics while ignoring the learner's substantive position—while recognition tutors predominantly (10/12) use scaffolded reframing, preserving the learner's objection while redirecting toward new conceptual ground ($\chi^2(3) = 24.00$, $p < .001$, $V = 1.000$). The dominance of scaffolded reframing (Aufhebung) over genuine mutual recognition (1/12) suggests that recognition prompts produce sophisticated pedagogical technique—the capacity to hold contradiction productively—rather than genuine mutual transformation.
-9. **The learner superego paradox**: A symmetric learner-side evaluation (Section 6.12, N=118 bilateral dialogues) reveals that the multi-agent learner architecture *hurts* learner quality ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$)—the largest effect in the study. The ego/superego process polishes away the messy, persona-consistent engagement that characterizes genuine student behavior. Recognition partially rescues multi-agent learner quality ($d = 0.79$, $p = .004$) while having no effect on already-high single-agent learner quality. This forms a mirror-image interaction with the tutor-side factorial: on the tutor rubric, recognition helps single-agent learners more; on the learner rubric, recognition helps multi-agent learners more. The same mechanism—recognition as external validation that creates space for authentic engagement—is visible from both measurement perspectives. Internal deliberation depth remains uniformly poor (2.7/5) regardless of recognition, confirming that recognition works *around* the superego rather than through it. The Hegelian interpretation is direct: external recognition from an Other is structurally more effective than internal self-critique.
+9. **The learner superego paradox**: A symmetric learner-side evaluation (Section 6.16, N=118 bilateral dialogues) reveals that the multi-agent learner architecture *hurts* learner quality ($d = 1.43$, $F(1,114) = 68.28$, $p < .001$)—the largest effect in the study. The ego/superego process polishes away the messy, persona-consistent engagement that characterizes genuine student behavior. Recognition partially rescues multi-agent learner quality ($d = 0.79$, $p = .004$) while having no effect on already-high single-agent learner quality. On the tutor rubric, recognition helps both learner types robustly (+15.7 single, +13.0 multi; A×C n.s.); on the learner rubric, recognition helps multi-agent learners selectively (+9.5 vs -1.3 pts). The same mechanism—recognition as external validation that creates space for authentic engagement—counteracts the superego's tendency to over-process learner responses. Internal deliberation depth remains uniformly poor (2.7/5) regardless of recognition, confirming that recognition works *around* the superego rather than through it. The Hegelian interpretation is direct: external recognition from an Other is structurally more effective than internal self-critique.
-These results suggest that operationalizing philosophical theories of intersubjectivity can produce concrete improvements in AI system performance. They also reveal boundary conditions: recognition theory's value varies by content domain and interaction type, and multi-agent architecture's value depends on deployment context.
+10. **The superego as filter, not improver**: Dialectical superego modulation testing (Section 6.8, N=174) reveals that the multi-agent superego functions as a quality filter—preventing poor responses—rather than an active improver. Structural modulation metrics (negation depth, convergence) show large per-turn variation ($d = -2.01$ to $-2.45$) but do not predict outcome quality. The adversary persona over-defers to the ego under recognition, reducing its critical function. These findings reinforce the additivity thesis: architecture provides a floor through error correction, not a ceiling through generative contribution.
-The broader implication is for AI alignment. If mutual recognition is pedagogically superior, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation. Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter.
+11. **Self-reflective evolution amplifies recognition**: When ego and superego are given between-turn self-reflection (Section 6.9, cells 40–45, N=90), recognition's effect size rises to $d = 0.91$—2.4$\times$ the dialectical-only condition ($d = 0.38$). A striking disposition gradient emerges: the more hostile the superego (suspicious +19.0, adversary +10.9, advocate +2.6), the more recognition helps—hostile dispositions become productive under recognition but are destructive without it. However, an insight-action gap persists: the superego's reflections acknowledge the need for change without producing fundamentally different critique behavior.
-In summary, this paper has connected Hegelian recognition theory to AI pedagogy (Section 3), implemented that theory through a multiagent architecture grounded in Freudian structural theory (Section 4), and tested it empirically across twenty evaluation runs (Section 6). The central finding—that recognition-enhanced prompting is the dominant driver of tutoring improvement—was established through memory isolation (Section 6.2), confirmed in a full factorial (Section 6.3), partially corroborated by active control (Section 6.2), validated by an independent GPT-5.2 judge (Section 6.15), and further sharpened by a dialectical impasse test with resolution strategy coding (Section 6.16) showing that base tutors withdraw from dialectical encounter while recognition tutors hold and reframe contradiction—and a symmetric learner-side evaluation (Section 6.12) showing that recognition provides external self-regulation more effectively than internal ego/superego deliberation. The theoretical framework, empirical methodology, and practical implications together suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems.
+12. **Mechanisms require dynamic interlocutors**: Nine mechanisms (self-reflection, profiling, quantitative disposition, prompt erosion, intersubjective framing, combined, adversary, advocate, base dialectical) cluster within 2.4 pts under recognition when tested with scripted learners (Section 6.10, N=360). The scripted learner confound renders mechanisms causally inert—they cannot influence predetermined responses. When tested with dynamic (multi-agent) learners (N=300), mechanisms genuinely differentiate for the first time: profiling and combined mechanisms reach 88.8 and 87.8 while intersubjective framing reaches only 82.8—a 6.0-point spread. Recognition's effect doubles (+14 pts vs +7.5 scripted), and a Nemotron cross-model replication (N=360) confirms the pattern at lower absolute scores. A qualitative transcript assessment (Section 6.11) provides narrative evidence for the mechanism: recognition gives the ego the capacity to be *changed by* its internal critic rather than merely *compliant with* it.
+13. **Prompt elaboration does not explain recognition effects**: A prompt elaboration baseline (Section 6.21, N=144) comparing the full 344-line base prompt against a 35-line naive prompt (JSON schema only, no pedagogical guidance) demonstrates that the recognition effect cannot be attributed to prompt length or instructional detail. On Haiku, the naive prompt *outperforms* the elaborate base by +6.8 pts—the prescriptive decision heuristics actively constrain the model's superior pedagogical intuitions. On Kimi K2.5, the elaborate prompt is inert ($\Delta = -0.3$). Recognition ($M = 90.9$ on Haiku) remains well above the naive baseline ($M = 82.5$), confirming that recognition adds value through relational orientation rather than instructional specificity. This addresses the deflationary concern that recognition effects might be an artifact of more detailed prompting: stripping 90% of the base prompt's content does not diminish scores, but recognition theory still provides a substantial further gain.
+14. **Minimum ego capability threshold for mechanism benefit**: A cognitive prosthesis test (Section 6.10, N=90) pairing a weak ego (Nemotron) with a strong superego (Kimi K2.5) armed with the full mechanism suite demonstrates that architectural scaffolding is not model-agnostic. The mechanism stack that boosts Haiku by +20 points *hurts* Nemotron by $-15$ points, yielding scores ($M = 49.5$) well below Nemotron's own simple baseline ($M = 64.2$). Dimension analysis reveals a two-tier capability structure: Nemotron succeeds on static dimensions (specificity 4.0, actionability 4.0) but fails on dynamic context integration (tutor adaptation 1.8, 86% failure rate). A Haiku control smoke test (N=6) confirms the model-dependence: identical mechanisms score 90+ with Haiku. A contributing factor is silent superego failure—the Kimi K2.5 superego returns malformed JSON on 16–45% of reviews, auto-approving the ego's draft. The adversary superego prompt produces the most parseable output (11.5% failure) and the highest scores, suggesting that superego JSON reliability is a first-order concern for multi-agent deployments.
+These results suggest that operationalizing philosophical theories of intersubjectivity can produce concrete improvements in AI system performance. They also reveal boundary conditions: recognition theory's value varies by content domain and interaction type, and multi-agent architecture's value depends on deployment context. Perhaps most striking is the learner superego paradox (Finding 9): the largest single effect in the study ($d = 1.43$) comes not from what helps but from what hurts—internal self-critique degrades learner quality more than any other factor improves it. This underscores the paper's central Hegelian claim: genuine transformation requires encounter with an Other, not refinement by the Self.
+The broader implication is for AI alignment. If mutual recognition is pedagogically superior, and if mutual recognition requires the AI to be genuinely shaped by human input, then aligned AI might need to be constitutionally open to transformation. Recognition-oriented AI does not just respond to humans; it is constituted, in part, through the encounter. We emphasize the distinction drawn in Section 3.3: these results demonstrate recognition-*oriented* design (level 3)—prompts that produce recognition-like behavior—not recognition *proper* (level 1), which would require genuine intersubjective consciousness. The pedagogical gains are real; the philosophical question of whether the AI truly *recognizes* the learner remains open.
+In summary, this paper has connected Hegelian recognition theory to AI pedagogy (Section 3), implemented that theory through a multiagent architecture grounded in Freudian structural theory (Section 4), and tested it empirically across thirty-seven key evaluations (Section 6). The central finding—that recognition-enhanced prompting is the dominant driver of tutoring improvement—was established through memory isolation (Section 6.2), confirmed in a full factorial (Section 6.3), partially corroborated by active control (Section 6.2), validated by an independent GPT-5.2 judge (Section 6.19), and further sharpened by a dialectical impasse test with resolution strategy coding (Section 6.20) showing that base tutors withdraw from dialectical encounter while recognition tutors hold and reframe contradiction—and a symmetric learner-side evaluation (Section 6.16) showing that recognition provides external self-regulation more effectively than internal ego/superego deliberation. Phase 2 experiments (Sections 6.8–6.11) deepen this understanding: the superego functions as a quality filter rather than an active improver, self-reflection amplifies recognition's effect, and mechanism differentiation requires dynamic interlocutors capable of genuine feedback loops. The theoretical framework, empirical methodology, and practical implications together suggest that philosophical theories of intersubjectivity can serve as productive design heuristics for AI systems.
 ## References
@@ -1690,10 +2127,13 @@ You are the thoughtful, critical voice who:
 Tests whether recognition theory adds value beyond prompt engineering.
 ```bash
-# Run the 3-way comparison (base, enhanced, recognition prompts)
-node scripts/eval-cli.js run \
-  --profiles cell_1_base_single_unified,cell_9_enhanced_single_unified,cell_5_recog_single_unified \
-  --scenarios struggling_learner,concept_confusion,mood_frustrated_explicit,high_performer \
+# Run the 3-way comparison (base, enhanced, recognition)
+CELLS="cell_1_base_single_unified"
+CELLS+=",cell_9_enhanced_single_unified"
+CELLS+=",cell_5_recog_single_unified"
+node scripts/eval-cli.js run --profiles "$CELLS" \
+  --scenarios struggling_learner,concept_confusion,\
+mood_frustrated_explicit,high_performer \
   --runs 3
 # Analyze results
@@ -1704,17 +2144,20 @@ node scripts/eval-cli.js report <run-id>
 ```bash
 # Run full factorial (8 cells × 15 scenarios × 3 reps)
-node scripts/eval-cli.js run \
-  --profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
-  --runs 3
+CELLS="cell_1_base_single_unified,cell_2_base_single_psycho"
+CELLS+=",cell_3_base_multi_unified,cell_4_base_multi_psycho"
+CELLS+=",cell_5_recog_single_unified,cell_6_recog_single_psycho"
+CELLS+=",cell_7_recog_multi_unified,cell_8_recog_multi_psycho"
+node scripts/eval-cli.js run --profiles "$CELLS" --runs 3
 ```
 ### B.3 A×B Interaction Test
 ```bash
 # Recognition vs Enhanced × Single vs Multi comparison
-node scripts/eval-cli.js run \
-  --profiles cell_5_recog_single_unified,cell_7_recog_multi_unified,cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified \
+CELLS="cell_5_recog_single_unified,cell_7_recog_multi_unified"
+CELLS+=",cell_9_enhanced_single_unified,cell_11_enhanced_multi_unified"
+node scripts/eval-cli.js run --profiles "$CELLS" \
   --scenarios struggling_learner,concept_confusion,mood_frustrated_explicit \
   --runs 3
 ```
@@ -1724,11 +2167,13 @@ node scripts/eval-cli.js run \
 ```bash
 # Run with elementary content (4th grade fractions)
 # Uses all 8 factorial cells × 5 elementary scenarios
+CELLS="cell_1_base_single_unified,cell_2_base_single_psycho"
+CELLS+=",cell_3_base_multi_unified,cell_4_base_multi_psycho"
+CELLS+=",cell_5_recog_single_unified,cell_6_recog_single_psycho"
+CELLS+=",cell_7_recog_multi_unified,cell_8_recog_multi_psycho"
 EVAL_CONTENT_PATH=./content-test-elementary \
 EVAL_SCENARIOS_FILE=./content-test-elementary/scenarios-elementary.yaml \
-node scripts/eval-cli.js run \
-  --profiles cell_1_base_single_unified,cell_2_base_single_psycho,cell_3_base_multi_unified,cell_4_base_multi_psycho,cell_5_recog_single_unified,cell_6_recog_single_psycho,cell_7_recog_multi_unified,cell_8_recog_multi_psycho \
-  --runs 1
+node scripts/eval-cli.js run --profiles "$CELLS" --runs 1
 ```
 ### B.5 Dynamic Prompt Rewriting Evolution
@@ -1737,19 +2182,138 @@ node scripts/eval-cli.js run \
 # Run cell_7 (static baseline) vs cell_21 (dynamic rewrite + Writing Pad)
 node scripts/eval-cli.js run \
   --profiles cell_7_recog_multi_unified,cell_21_recog_multi_unified_rewrite \
-  --scenarios misconception_correction_flow,mood_frustration_to_breakthrough,mutual_transformation_journey \
+  --scenarios misconception_correction_flow,\
+mood_frustration_to_breakthrough,mutual_transformation_journey \
   --runs 5
 ```
-### B.6 Resolution Strategy Coding (Section 6.16)
+### B.6 Resolution Strategy Coding (Section 6.20)
 ```bash
 # Code impasse dialogues into Hegelian resolution strategies
-node scripts/code-impasse-strategies.js --model claude-code --run-id eval-2026-02-08-f896275d
+node scripts/code-impasse-strategies.js \
+  --model claude-opus-4.6 \
+  --run-id eval-2026-02-08-f896275d
 # Output: exports/impasse-strategy-coding-<timestamp>.json and .md
 ```
-### B.7 Factor Effect Analysis
+### B.7 Dialectical Superego Modulation (Section 6.8)
+```bash
+# Standard ego + divergent superego (cells 22-27)
+CELLS="cell_22_base_suspicious_unified"
+CELLS+=",cell_23_recog_suspicious_unified"
+CELLS+=",cell_24_base_adversary_unified"
+CELLS+=",cell_25_recog_adversary_unified"
+CELLS+=",cell_26_base_advocate_unified"
+CELLS+=",cell_27_recog_advocate_unified"
+node scripts/eval-cli.js run --profiles "$CELLS" --runs 2
+# Dialectical ego + divergent superego, multi-turn (cells 28-33)
+CELLS="cell_28_base_dialectical_suspicious_unified"
+CELLS+=",cell_29_recog_dialectical_suspicious_unified"
+CELLS+=",cell_30_base_dialectical_adversary_unified"
+CELLS+=",cell_31_recog_dialectical_adversary_unified"
+CELLS+=",cell_32_base_dialectical_advocate_unified"
+CELLS+=",cell_33_recog_dialectical_advocate_unified"
+node scripts/eval-cli.js run --profiles "$CELLS" --runs 5
+```
+### B.8 Mechanism Robustness (Section 6.10)
+```bash
+# Scripted learner mechanisms (cells 40-59), Haiku ego
+CELLS="cell_40_base_dialectical_suspicious_unified_superego"
+CELLS+=",cell_41_recog_dialectical_suspicious_unified_superego"
+CELLS+=",cell_42_base_dialectical_adversary_unified_superego"
+CELLS+=",cell_43_recog_dialectical_adversary_unified_superego"
+CELLS+=",cell_44_base_dialectical_advocate_unified_superego"
+CELLS+=",cell_45_recog_dialectical_advocate_unified_superego"
+CELLS+=",cell_46_base_dialectical_suspicious_unified_quantitative"
+CELLS+=",cell_47_recog_dialectical_suspicious_unified_quantitative"
+CELLS+=",cell_48_base_dialectical_suspicious_unified_erosion"
+CELLS+=",cell_49_recog_dialectical_suspicious_unified_erosion"
+CELLS+=",cell_50_base_dialectical_suspicious_unified_intersubjective"
+CELLS+=",cell_51_recog_dialectical_suspicious_unified_intersubjective"
+CELLS+=",cell_52_base_dialectical_suspicious_unified_combined"
+CELLS+=",cell_53_recog_dialectical_suspicious_unified_combined"
+CELLS+=",cell_54_base_dialectical_profile_tutor"
+CELLS+=",cell_55_recog_dialectical_profile_tutor"
+CELLS+=",cell_56_base_dialectical_profile_bidirectional"
+CELLS+=",cell_57_recog_dialectical_profile_bidirectional"
+CELLS+=",cell_58_recog_dialectical_profile_bidirectional_full"
+CELLS+=",cell_59_recog_dialectical_profile_bidirectional_strategy"
+node scripts/eval-cli.js run --profiles "$CELLS" --runs 2
+# Dynamic learner mechanisms (cells 60-63), Haiku ego
+CELLS="cell_60_base_dialectical_selfreflect_psycho"
+CELLS+=",cell_61_recog_dialectical_selfreflect_psycho"
+CELLS+=",cell_62_base_dialectical_profile_bidirectional_psycho"
+CELLS+=",cell_63_recog_dialectical_profile_bidirectional_psycho"
+node scripts/eval-cli.js run --profiles "$CELLS" \
+  --scenarios misconception_correction_flow,mutual_transformation_journey \
+  --runs 5
+# Dynamic learner mechanism head-to-head (cells 64-65), Haiku ego
+CELLS="cell_64_recog_dialectical_intersubjective_psycho"
+CELLS+=",cell_65_recog_dialectical_combined_psycho"
+node scripts/eval-cli.js run --profiles "$CELLS" \
+  --scenarios misconception_correction_flow,mutual_transformation_journey \
+  --runs 5
+# Dynamic learner base counterparts (cells 69-70), Haiku ego
+CELLS="cell_69_base_dialectical_intersubjective_psycho"
+CELLS+=",cell_70_base_dialectical_combined_psycho"
+node scripts/eval-cli.js run --profiles "$CELLS" \
+  --scenarios misconception_correction_flow,mutual_transformation_journey \
+  --runs 5
+```
+### B.8b Prompt Elaboration Baseline (Section 6.21)
+```bash
+# Naive baseline, Haiku ego
+node scripts/eval-cli.js run \
+  --profiles cell_1_base_single_unified,cell_71_naive_single_unified \
+  --runs 6
+# Naive baseline, Kimi ego
+node scripts/eval-cli.js run \
+  --profiles cell_1_base_single_unified,cell_71_naive_single_unified \
+  --runs 6 --model openrouter.kimi
+```
+### B.8c Token Budget Sensitivity (Section 6.22)
+```bash
+# Token budget dose-response (256, 512, 2048), Haiku ego
+node scripts/eval-cli.js run \
+  --profiles cell_1_base_single_unified,cell_5_recog_single_unified \
+  --runs 3 --max-tokens 256
+node scripts/eval-cli.js run \
+  --profiles cell_1_base_single_unified,cell_5_recog_single_unified \
+  --runs 3 --max-tokens 512
+node scripts/eval-cli.js run \
+  --profiles cell_1_base_single_unified,cell_5_recog_single_unified \
+  --runs 3 --max-tokens 2048
+# Base-only control at default 8000
+node scripts/eval-cli.js run \
+  --profiles cell_1_base_single_unified \
+  --runs 3
+```
+### B.9 Qualitative Transcript Assessment (Section 6.11)
+```bash
+# Assess transcripts with Opus
+node scripts/assess-transcripts.js --run-id eval-2026-02-14-e0e3a622
+node scripts/assess-transcripts.js --run-id eval-2026-02-07-b6d75e87
+```
+### B.10 Factor Effect Analysis
 ```sql
 -- Factor effect analysis query
@@ -1799,7 +2363,7 @@ Where:
 | Tutor Adaptation | 5% | Bilateral |
 | Learner Growth | 5% | Bilateral |
-Standard dimensions (including Productive Struggle and Epistemic Honesty) account for 81% of raw weight; recognition dimensions 29.9%; bilateral dimensions 10%. Raw weights total 120.9% and are normalized at scoring time. Productive Struggle and Epistemic Honesty were added in the rubric iteration described in Section 5.1, with corresponding reductions to Actionability and Tone (10% → 8% each). The bilateral dimensions (`tutor_adaptation`, `learner_growth`) specifically measure the mutual transformation claim—see Section 6.11.
+Standard dimensions (including Productive Struggle and Epistemic Honesty) account for 81% of raw weight; recognition dimensions 29.9%; bilateral dimensions 10%. Raw weights total 120.9% and are normalized at scoring time. Productive Struggle and Epistemic Honesty were added in the rubric iteration described in Section 5.1, with corresponding reductions to Actionability and Tone (10% → 8% each). The bilateral dimensions (`tutor_adaptation`, `learner_growth`) specifically measure the mutual transformation claim—see Section 6.15.
 ### C.3 Recognition Dimension Criteria
@@ -1867,7 +2431,7 @@ Standard dimensions (including Productive Struggle and Epistemic Honesty) accoun
 ## Appendix D: Reproducibility and Key Evaluation Run IDs
-Evaluation commands are documented in Appendix B. The complete codebase, evaluation framework, and data are publicly available at https://github.com/machine-spirits/machinespirits-eval. The twenty key runs are listed below:
+Evaluation commands are documented in Appendix B. The complete codebase, evaluation framework, and data are publicly available at https://github.com/liammagee/machinespirits-eval. The thirty-seven key evaluations are listed below (b6d75e87 serves both bilateral transformation and learner-side evaluation; eval-2026-02-11-35c53e99 and eval-2026-02-11-5f6d51f5 are combined as one dialectical modulation evaluation):
 | Finding | Run ID | Section |
 |---------|--------|---------|
@@ -1877,33 +2441,125 @@ Evaluation commands are documented in Appendix B. The complete codebase, evaluat
 | Active control (post-hoc) | eval-2026-02-06-a9ae06ee | 6.2 |
 | Full factorial, cells 1–5,7 (Kimi) | eval-2026-02-03-f5d4dd93 | 6.3 |
 | Full factorial, cells 6,8 re-run (Kimi) | eval-2026-02-06-a933d745 | 6.3 |
-| A×B interaction (Nemotron, original) | eval-2026-02-04-948e04b3 | 6.4 |
 | A×B replication (Kimi) | eval-2026-02-05-10b344fb | 6.4 |
 | A×B probe: Nemotron | eval-2026-02-07-722087ac | 6.4 |
 | A×B probe: DeepSeek V3.2 | eval-2026-02-07-70ef73a3 | 6.4 |
 | A×B probe: GLM-4.7 | eval-2026-02-07-6b3e6565 | 6.4 |
 | A×B probe: Claude Haiku 4.5 | eval-2026-02-07-6ead24c7 | 6.4 |
-| Domain generalizability (Nemotron) | eval-2026-02-04-79b633ca | 6.5 |
-| Domain gen. replication (Kimi) | eval-2026-02-05-e87f452d | 6.5 |
-| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | 6.13 |
-| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | 6.13 |
-| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.13 |
-| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.11 |
-| Dialectical impasse test | eval-2026-02-08-f896275d | 6.16 |
+| Domain generalizability (Kimi) | eval-2026-02-05-e87f452d | 6.5 |
+| Dynamic rewrite evolution (run 1) | eval-2026-02-05-daf60f79 | 6.18 |
+| Dynamic rewrite evolution (run 2) | eval-2026-02-05-49bb2017 | 6.18 |
+| Dynamic rewrite evolution (run 3) | eval-2026-02-05-12aebedb | 6.18 |
+| Bilateral transformation (multi-turn) | eval-2026-02-07-b6d75e87 | 6.15 |
+| Hardwired rules ablation (Kimi) | eval-2026-02-08-65a6718f | 6.7 |
+| Dialectical impasse test | eval-2026-02-08-f896275d | 6.20 |
+| Learner-side evaluation (symmetric) | eval-2026-02-07-b6d75e87 | 6.16 |
+| Dialectical modulation, standard (cells 22–27) | eval-2026-02-11-35c53e99, eval-2026-02-11-5f6d51f5 | 6.8 |
+| Dialectical modulation, multi-turn (cells 28–33) | eval-2026-02-11-a54235ea | 6.8 |
+| Self-reflective evolution (cells 40–45) | eval-2026-02-13-8d40e086 | 6.9 |
+| Mechanism robustness, scripted (cells 40–59) | eval-2026-02-14-e0e3a622 | 6.10 |
+| Dynamic learner mechanisms (cells 60–63) | eval-2026-02-14-6c033830 | 6.10 |
+| Dynamic learner mechanisms (cells 64–65) | eval-2026-02-14-a2b2717c | 6.10 |
+| Mechanism robustness, Nemotron (cells 40–59) | eval-2026-02-14-49b33fdd | 6.10 |
+| Self-reflect Nemotron non-replication (cells 40–45) | eval-2026-02-14-559d854b | 6.9 |
+| Cognitive prosthesis (cells 66–68, Nemotron) | eval-2026-02-17-25aaae85 | 6.10 |
+| Cognitive prosthesis smoke test (Haiku) | eval-2026-02-18-f489c0ea | 6.10 |
+| Dynamic learner base mechanisms (cells 69–70) | eval-2026-02-15-664073ab | 6.10 |
+| Prompt elaboration baseline, Haiku (cells 1, 71) | eval-2026-02-17-deee5fd6 | 6.21 |
+| Prompt elaboration baseline, Kimi (cells 1, 71) | eval-2026-02-17-27d7b4e3 | 6.21 |
+| Token budget 256, Haiku (run 1) | eval-2026-02-17-0eb3de77 | 6.22 |
+| Token budget 256, Haiku (run 2) | eval-2026-02-17-5a640782 | 6.22 |
+| Token budget 512, Haiku | eval-2026-02-17-5f281654 | 6.22 |
+| Token budget 2048, Haiku | eval-2026-02-17-0f6dcd97 | 6.22 |
+| Token budget default (8000), Haiku | eval-2026-02-17-d32ed226 | 6.22 |
 ---
 ## Appendix E: Revision History
-| Date | Version | Changes |
-|------|---------|---------|
-| 2026-02-04 | v1.0 | Initial draft with 2×2×2 factorial design, memory isolation, three-way comparison |
-| 2026-02-06 | v1.1 | Added corrected memory isolation experiment (N=120), active control (N=118), cells 6&8 re-run, cross-judge GPT-5.2 analysis. Corrected GPT-5.2 effect sizes (d=1.15→0.99, d=0.50→0.29) after deduplication of rejudge rows. Dropped dead partial run (e617e757). |
-| 2026-02-06 | v1.2 | **Critical correction**: Reframed "placebo control" as "post-hoc active control." The original v1.1 analysis compared the active control (Nemotron, M=66.5) to factorial base (Kimi K2.5, M=78.8) and reported d=-1.03, but this compared different ego models. Same-model historical data shows Nemotron base $\approx$ 58, making the active control $\approx$ +9 pts above base (not below). Reframed throughout: generic pedagogical elaboration provides partial benefit (~+9 pts above base) but recognition gains are substantially larger (~+15 pts). Acknowledged post-hoc design and active (not inert) control content. |
-| 2026-02-06 | v1.3–v1.4 | Intermediate revisions: corrected factorial with re-run cells 6, 8 (a933d745); updated A×C interaction values; qualitative analysis additions; production quality fixes. Superseded by v1.5. |
-| 2026-02-07 | v1.5 | **Rubric iteration**: Updated to 14-dimension rubric with dialogue transcript context, Productive Struggle (5%), and Epistemic Honesty (5%) dimensions (Actionability/Tone reduced 10%→8%). Re-scored cells 6, 8 (N=88) with identical responses: minimal change (+0.5, +0.6 pts), confirming calibration preserved. Added holistic dialogue evaluation for multi-turn transcripts. Cross-judge replication on updated rubric (r=0.55, N=88, GPT/Opus ratio=0.87). Updated Table 6, main effects, A×C interaction values, Appendix C.2 weight table, and Section 6.14 cross-judge tables. Corrected subsection numbering, weight accounting (120.9% total), and added missing run ID (a933d745) to Reproducibility. |
-| 2026-02-08 | v1.6 | **Content isolation fix**: Identified and fixed two bugs causing cross-domain content leakage in elementary scenarios: (a) `buildCurriculumContext()` fallback that scanned all courses when no content hint was provided, serving philosophy listings to elementary scenarios; (b) hardcoded `479-lecture-*` IDs in tutor ego prompt examples that the model copied when no curriculum anchor was present. Updated Sections 6.5, 6.6, 7.4, 7.8, and 8 to reframe "model hallucination" as system-level content isolation failures. Noted that the +9.9 pt architecture effect on elementary content (Nemotron) was partly inflated by these bugs; Kimi replication (+3.0 pts) is more representative. |
-| 2026-02-08 | v1.8 | **Dialectical impasse test**: Added Section 6.16 with three 5-turn impasse scenarios (epistemic resistance, affective shutdown, productive deadlock; N=24, eval-2026-02-08-f896275d, Opus judge). Recognition produces +43 pts on epistemic and +29 pts on interpretive impasses but $\Delta$=$-$1.1 on affective shutdown—sharpening the theoretical claim to epistemological rather than affective recognition. Updated §7.1 discussion, §9 conclusion (finding #8), Tables 2/D run lists, and paper totals. |
-| 2026-02-08 | v1.9 | **Learner superego paradox**: Added symmetric learner-side evaluation (Section 6.12) scoring N=118 bilateral dialogues with 6-dimension learner rubric (eval-2026-02-07-b6d75e87, Opus judge). Multi-agent learner architecture hurts learner quality (d=1.43, F=68.28, p<.001)—the largest effect in the study. Recognition partially rescues multi-agent learners (d=0.79, p=.004) but not single-agent (n.s.), forming a mirror-image interaction with the tutor-side factorial. Deliberation depth uniformly poor (2.7/5), unaffected by recognition. Added learner rubric description to §5.1, new §6.12 with Tables 14b-14d, rewrote §7.5 with results, added finding #9 to §9, learner superego redesign to §8.2. Renumbered §6.12→6.13, §6.13→6.14, §6.14→6.15, §6.15→6.16. Updated cross-references, Table 2, paper totals (N=1,628 across 20 key runs). |
-| 2026-02-08 | v2.0 | **Resolution strategy coding**: Post-hoc qualitative coding of all 24 dialectical impasse dialogues (eval-2026-02-08-f896275d) into five Hegelian resolution strategies (mutual recognition, domination, capitulation, withdrawal, scaffolded reframing). Perfect separation: 12/12 base tutors withdraw (bypass impasse entirely), 10/12 recognition tutors use scaffolded reframing (Aufhebung pattern), 1 mutual recognition, 1 domination. $\chi^2(3)=24.00$, $p<.001$, $V=1.000$. Architecture has no effect on strategy ($p=.576$). Cross-judge validation with GPT-5.2: $\kappa=0.84$, 100% agreement on engagement-vs-withdrawal binary. Added Tables 26–28, per-turn strategy evolution analysis, cross-judge validation, and extended analysis to §6.16; substantially revised §7.1 discussion with mechanistic evidence. Strategy coding tool: `scripts/code-impasse-strategies.js`. |
-| 2026-02-08 | v2.1 | **AI theme discovery & figure regeneration**: Added §6.13.4 AI-assisted theme discovery (N=300, Claude Opus coder) showing near-perfect bimodal separation — base 84% directive/93% transmissive, recognition 60% dialogical-facilitative/84% dialectical-constructivist. Added Figure 6 (word clouds). Regenerated all figures from Python with corrected data and larger text. Removed standalone §10 Reproducibility (merged into Appendix D). Moved Appendix E after other appendices. Increased font to 12pt. |
+**v1.0** (2026-02-04)
+:   Initial draft with 2×2×2 factorial design, memory isolation, three-way comparison.
+**v1.1** (2026-02-06)
+:   Added corrected memory isolation experiment (N=120), active control (N=118), cells 6&8 re-run, cross-judge GPT-5.2 analysis. Corrected GPT-5.2 effect sizes (d=1.15→0.99, d=0.50→0.29) after deduplication of rejudge rows. Dropped dead partial run (e617e757).
+**v1.2** (2026-02-06)
+:   **Critical correction**: Reframed "placebo control" as "post-hoc active control." The original v1.1 analysis compared the active control (Nemotron, M=66.5) to factorial base (Kimi K2.5, M=78.8) and reported d=-1.03, but this compared different ego models. Same-model historical data shows Nemotron base $\approx$ 58, making the active control $\approx$ +9 pts above base (not below). Reframed throughout: generic pedagogical elaboration provides partial benefit (~+9 pts above base) but recognition gains are substantially larger (~+15 pts). Acknowledged post-hoc design and active (not inert) control content.
+**v1.3--v1.4** (2026-02-06)
+:   Intermediate revisions: corrected factorial with re-run cells 6, 8 (a933d745); updated A×C interaction values; qualitative analysis additions; production quality fixes. Superseded by v1.5.
+**v1.5** (2026-02-07)
+:   **Rubric iteration**: Updated to 14-dimension rubric with dialogue transcript context, Productive Struggle (5%), and Epistemic Honesty (5%) dimensions (Actionability/Tone reduced 10%→8%). Re-scored cells 6, 8 (N=88) with identical responses: minimal change (+0.5, +0.6 pts), confirming calibration preserved. Added holistic dialogue evaluation for multi-turn transcripts. Cross-judge replication on updated rubric (r=0.55, N=88, GPT/Opus ratio=0.87). Updated Table 6, main effects, A×C interaction values, Appendix C.2 weight table, and Section 6.18 cross-judge tables.
+**v1.6** (2026-02-08)
+:   **Content isolation fix**: Identified and fixed two bugs causing cross-domain content leakage in elementary scenarios: (a) `buildCurriculumContext()` fallback that scanned all courses when no content hint was provided, serving philosophy listings to elementary scenarios; (b) hardcoded `479-lecture-*` IDs in tutor ego prompt examples that the model copied when no curriculum anchor was present. Updated Sections 6.5, 6.6, 7.4, 7.8, and 8 to reframe "model hallucination" as system-level content isolation failures.
+**v1.7** (2026-02-08)
+:   **Hardwired rules ablation**: Added Section 6.7 with superego rules embedded in ego prompt (cells 13--14, N=72, eval-2026-02-08-65a6718f, Opus judge). Static rules fail to replicate the Superego's benefit, confirming the value lies in contextual judgment rather than rule enforcement. Added Table 10b, updated Tables 2/D and paper totals.
+**v1.8** (2026-02-08)
+:   **Dialectical impasse test**: Added Section 6.20 with three 5-turn impasse scenarios (epistemic resistance, affective shutdown, productive deadlock; N=24, eval-2026-02-08-f896275d, Opus judge). Recognition produces +43 pts on epistemic and +29 pts on interpretive impasses but $\Delta=-1.1$ on affective shutdown---sharpening the theoretical claim to epistemological rather than affective recognition.
+**v1.9** (2026-02-08)
+:   **Learner superego paradox**: Added symmetric learner-side evaluation (Section 6.16) scoring N=118 bilateral dialogues with 6-dimension learner rubric (eval-2026-02-07-b6d75e87, Opus judge). Multi-agent learner architecture hurts learner quality (d=1.43, F=68.28, p<.001)---the largest effect in the study. Recognition partially rescues multi-agent learners (d=0.79, p=.004) but not single-agent (n.s.). Added learner rubric description to §5.1, new §6.12, rewrote §7.5 with results, added finding #9 to §9.
+**v2.0** (2026-02-08)
+:   **Resolution strategy coding**: Post-hoc qualitative coding of all 24 dialectical impasse dialogues into five Hegelian resolution strategies. Perfect separation: 12/12 base tutors withdraw, 10/12 recognition tutors use scaffolded reframing (Aufhebung pattern). $\chi^2(3)=24.00$, $p<.001$, $V=1.000$. Cross-judge validation with GPT-5.2: $\kappa=0.84$. Added Tables 26--28, per-turn strategy evolution analysis.
+**v2.1** (2026-02-08)
+:   **AI theme discovery & figure regeneration**: Added §6.13.4 AI-assisted theme discovery (N=300) showing near-perfect bimodal separation. Added Figure 6 (word clouds). Regenerated all figures from Python with corrected data and larger text. Removed standalone §10 Reproducibility (merged into Appendix D). Moved Appendix E after other appendices. Increased font to 12pt.
+**v2.1.1** (2026-02-10)
+:   **Consistency fixes**: Corrected stale N=1,628/twenty → N=1,700/twenty-one in abstract, introduction, and conclusion. Fixed dynamic rewrite section references in Tables 2 and D. Added hardwired rules ablation and learner-side evaluation to Appendix D run list (was 19 rows, now 21). Fixed inter-judge reliability cross-reference in §8.1.
+**v2.1.2** (2026-02-10)
+:   **Review corrections** (30 fixes): Table 7b Kimi row corrected to single-learner cells (N=350→179, Recognition +10.2→+15.5, Interaction -1.5→+0.5) matching probe design; total probe N 826→655. Factor C in Discussion corrected (-1.7 pts, F=2.56). Stale A×C values updated. Dynamic rewrite swing corrected (+16.7→+8.7 delta). Terminology standardized (unified→single-agent, behaviour→behavior).
+**v2.2.0** (2026-02-11)
+:   **Modulation and learning outcomes**: Added §6.11.1 (modulation metrics, N=350 post-hoc) showing multi-agent architecture does not increase behavioral range (d=0.05); recognition produces calibration not oscillation (dimension variance d=$-$1.00, F=87.69). Added §6.11.2 (synthetic learning outcome index, N=118). Extended §7.4 Discussion with phronesis reframing. Regenerated Figures 4 and 6.
+**v2.3.0** (2026-02-14)
+:   **Phase 2 experimental results**: Added four new Results sections: §6.8 Dialectical Superego Modulation (cells 22--33, N=174, Tables 13--15); §6.9 Self-Reflective Evolution (cells 40--45, N=36, Tables 16--17); §6.10 Mechanism Robustness (cells 40--59 N=360 + cells 60--63 N=120, Tables 18--19); §6.11 Qualitative Transcript Assessment (Tables 20--21). Added §7.10 Scripted Learner Confound, §7.11 Practical Recommendations (6 recommendations). Expanded §6.10 with cells 64--65 and Nemotron cross-model replication (N=279, 49b33fdd). Renumbered all tables sequentially (1--48). Trimmed abstract from ~650 to ~250 words. Paper totals: N=2,700 across 28 key evaluations.
+**v2.3.1** (2026-02-15)
+:   **Cognitive prosthesis and cross-judge completion**: Added cognitive prosthesis test (cells 66--68, N=60). Completed GPT-5.2 cross-judge validation of mechanism robustness (N=360 paired, r=0.59). Added Nemotron self-reflect non-replication (559d854b, N=60) to §6.9/Table 17. Added blinded qualitative assessment validation (Table 21b).
+**v2.3.2** (2026-02-15)
+:   **Sample reconciliation and count update**: Added Phase 2 evaluations to Table 2 (9 additional rows). Updated paper totals from 28 to 30 key evaluations, N=2,909 scored. Added 50487df7 (cognitive prosthesis) to Appendix B.6. Noted Sonnet judge used for two late-stage evaluations.
+**v2.3.3** (2026-02-15)
+:   **Complete Table 19 with base mechanism cells**: Added cells 69--70 (eval-2026-02-15-664073ab, N=60, Opus judge) completing the base row of Table 19. Recognition delta remarkably consistent across all 4 mechanisms (+13.3 to +15.1). Updated paper totals from 30 to 31 evaluations, N=2,969.
+**v2.3.4** (2026-02-15)
+:   **Related Work expansion for arXiv/edArXiv submission**: Expanded §2 from 8 to 10 subsections. Added §2.3 LLM-as-Judge Evaluation Methodology, §2.7 Theory of Mind in AI Agents. Expanded §2.1 with empirical LLM tutoring studies, §2.2 with multi-agent systems and self-correction limits. Added 15 new bib entries.
+**v2.3.5** (2026-02-15)
+:   **Same-model blinded assessment**: Ran Opus-blinded qualitative assessment (N=118) resolving the model calibration confound. Key finding: blinding barely changes Opus's tag assignments, confirming the near-perfect binary separation is real, not an assessor bias artifact. Updated §6.11 interpretation, revised §8.2 limitation.
+**v2.3.6** (2026-02-16)
+:   **Judge version unification**: Rejudged all early runs (originally scored under Opus 4.5) with Opus 4.6, eliminating version drift across the evaluation dataset. Updated §8.1 limitations. Cleaned 6 empty/failed generation rows from dynamic rewrite runs.
+**v2.3.7** (2026-02-17)
+:   **Self-reflective evolution complete**: Updated §6.9 from partial (N=36) to complete (N=90) results for eval-2026-02-13-8d40e086. Recognition d=0.91 (was 1.02 at N=36). Key new finding: disposition gradient---suspicious +19.0, adversary +10.9, advocate +2.6. Updated Table 16, Table 17, Discussion finding 11. Deduped 270 re-judging artifact rows.
+**v2.3.8** (2026-02-17)
+:   **Nemotron mechanism N-count update**: Updated eval-2026-02-14-49b33fdd from N=301 to N=360 after run resumption completed. Cascaded count changes through abstract, introduction, Table 2, §7.11, §9. Updated §6.10 Nemotron narrative. Noted bidirectional profiling anomaly ($\Delta=-0.6$).
+**v2.3.9** (2026-02-17)
+:   **Factorial re-judging cascade**: All early runs re-judged with Opus 4.6 (unifying judge version). Full factorial ANOVA (N=350): recognition F=110.04 p<.001 $\eta^2$=.243 d=1.11 (was F=71.36, d=0.80). A×C interaction disappears (F=0.97, p=.325; was F=21.85, p<.001)---recognition now consistent across learner types. Updated Tables 4, 6, 8, 9, 9b, 12, 17, 41, 42. Restored 219 GPT-5.2 rows lost during dedup. Updated GPT compression ratio from ~58% to 37--59%.
+**v2.3.10** (2026-02-17)
+:   **Prompt elaboration baseline**: Added §6.21 comparing 344-line base prompt against 35-line naive prompt (JSON schema only). Two runs: Haiku (N=72) and Kimi (N=72). Key finding: elaborate prompt hurts Haiku (+6.8 pts for naive) and is inert on Kimi ($\Delta=-0.3$). Recognition ($M=90.9$) remains well above naive ($M=82.5$). Added Table 20b, conclusion finding 13. Updated paper totals to N=3,292 across thirty-one evaluations.
+**v2.3.11** (2026-02-17)
+:   **Transcript figures**: Added Figure 10 (naive vs base high\_performer comparison panel) to §6.21. Added Figure 11 (bilateral mutual\_transformation\_journey transcript comparison) to §6.11. Added `generate-paper-figures.js` script for reproducible paper figure generation.
+**v2.3.12** (2026-02-17)
+:   **Token budget sensitivity**: Added §6.22 testing whether constraining `max_tokens` from 8000 to 256--2048 affects evaluation scores. Scores are flat across all budget levels; the recognition effect is fully preserved even at 256 tokens. The retry-absorption mechanism means truncated structured output self-heals. Added Table 49, recommendation 8 to §7.11. Updated paper totals to N=3,454 scored across thirty-six evaluations.
+**v2.3.13** (2026-02-17)
+:   **Paper correctness fixes**: Fixed eval-2026-02-14-559d854b scope from "cells 40--59, N=167" to "cells 40--45, N=60" in Table 2 (only cells 40--45 used; cells 46--59 superseded by 49b33fdd at N=360). Fixed broken Table 10 reference in §6.6. Fixed dynamic-learner N inconsistency: intro and finding 12 updated to N=300 (6c033830 + a2b2717c + 664073ab). Clarified token budget §6.22 design text. Added missing Appendix B commands.
+**v2.3.14** (2026-02-18)
+:   **Cognitive prosthesis re-run and analysis**: Replaced misconfigured prosthesis run (50487df7, all cells fell back to default Haiku) with corrected eval-2026-02-17-25aaae85 (N=90, Nemotron ego, Kimi K2.5 superego, Opus judge). Prosthesis hypothesis fails decisively: full mechanism stack scores 49.5 vs Nemotron simple base 64.2 ($\Delta=-15$). Added dimension analysis (two-tier static/dynamic capability model), superego parse failure analysis (16--45% malformed JSON auto-approves), and Haiku control smoke test (eval-2026-02-18-f489c0ea, N=6, confirming model-dependence). Added conclusion finding 14 (minimum ego capability threshold), recommendation 9 to §7.11, three future work items to §8.2 (parse robustness, capability threshold mapping, adaptive mechanism loading). Updated Table 2, Appendix D run IDs. Paper totals: N=3,383 across thirty-seven evaluations.