npm - @machinespirits/eval - Versions diffs - 0.1.0 - Mend

@machinespirits/eval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

package/components/MobileEvalDashboard.tsx +267 -0
package/components/comparison/DeltaAnalysisTable.tsx +137 -0
package/components/comparison/ProfileComparisonCard.tsx +176 -0
package/components/comparison/RecognitionABMode.tsx +385 -0
package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
package/components/comparison/WinnerIndicator.tsx +64 -0
package/components/comparison/index.ts +5 -0
package/components/mobile/BottomSheet.tsx +233 -0
package/components/mobile/DimensionBreakdown.tsx +210 -0
package/components/mobile/DocsView.tsx +363 -0
package/components/mobile/LogsView.tsx +481 -0
package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
package/components/mobile/QuickTestView.tsx +1098 -0
package/components/mobile/RecognitionTypeChart.tsx +124 -0
package/components/mobile/RecognitionView.tsx +809 -0
package/components/mobile/RunDetailView.tsx +261 -0
package/components/mobile/RunHistoryView.tsx +367 -0
package/components/mobile/ScoreRadial.tsx +211 -0
package/components/mobile/StreamingLogPanel.tsx +230 -0
package/components/mobile/SynthesisStrategyChart.tsx +140 -0
package/config/interaction-eval-scenarios.yaml +832 -0
package/config/learner-agents.yaml +248 -0
package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
package/docs/research/COST-ANALYSIS.md +56 -0
package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
package/docs/research/PAPER-UNIFIED.md +659 -0
package/docs/research/PAPER-UNIFIED.pdf +0 -0
package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
package/docs/research/apa.csl +2133 -0
package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
package/docs/research/paper-draft/full-paper.md +136 -0
package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
package/docs/research/paper-draft/references.bib +515 -0
package/docs/research/transcript-baseline.md +139 -0
package/docs/research/transcript-recognition-multiagent.md +187 -0
package/hooks/useEvalData.ts +625 -0
package/index.js +27 -0
package/package.json +73 -0
package/routes/evalRoutes.js +3002 -0
package/scripts/advanced-eval-analysis.js +351 -0
package/scripts/analyze-eval-costs.js +378 -0
package/scripts/analyze-eval-results.js +513 -0
package/scripts/analyze-interaction-evals.js +368 -0
package/server-init.js +45 -0
package/server.js +162 -0
package/services/benchmarkService.js +1892 -0
package/services/evaluationRunner.js +739 -0
package/services/evaluationStore.js +1121 -0
package/services/learnerConfigLoader.js +385 -0
package/services/learnerTutorInteractionEngine.js +857 -0
package/services/memory/learnerMemoryService.js +1227 -0
package/services/memory/learnerWritingPad.js +577 -0
package/services/memory/tutorWritingPad.js +674 -0
package/services/promptRecommendationService.js +493 -0
package/services/rubricEvaluator.js +826 -0

package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md ADDED Viewed

@@ -0,0 +1,1988 @@
+# Implementation Plan: Addressing Critical Review
+**Date:** 2026-01-14
+**Purpose:** Systematic response to each critique in CRITICAL-REVIEW-RECOGNITION-TUTORING.md
+**Scope:** Experimental, architectural, and conceptual refinements
+---
+## Executive Summary
+This plan addresses 15 specific critiques across three domains. Each response includes:
+- **The Critique**: What was identified as problematic
+- **The Response**: Concrete implementation or theoretical refinement
+- **Implementation Details**: Code changes, new scenarios, evaluation metrics
+- **Success Criteria**: How we know the critique has been addressed
+- **Priority & Effort**: Relative importance and implementation complexity
+---
+## Guiding Principles
+### Theoretical Positioning
+The implementation proceeds from a refined theoretical stance:
+1. **Hegelian Recognition as Derivative**: The master-slave dialectic is not literally applied to AI tutoring but serves as a *derivative framework*—like Lacan's four discourses, it rethinks the structure through new roles (tutor/learner) while preserving structural insights about one-directional vs. mutual engagement.
+2. **Psychodynamic Architecture as Productive Metaphor**: The Ego/Superego configuration is metaphorical scaffolding that names real tensions (warmth vs. rigor) and motivates architectural decisions (internal dialogue before external output) without claiming literal psychodynamics.
+3. **Focus on Tutor Adaptive Pedagogy**: The empirical claims concern measurable effects on *tutor behavior*—specifically, how the tutor adapts to learner input. We measure whether the tutor:
+   - Engages with specific learner contributions (not generic responses)
+   - Adapts suggestions based on learner state signals
+   - Repairs after misalignment rather than silently pivoting
+   - Honors productive struggle rather than short-circuiting confusion
+### The 2×2 Factorial Design
+The existing experimental infrastructure enables rigorous evaluation:
+```
+┌─────────────────────┬─────────────────────┬─────────────────────┐
+│                     │ Standard Prompts    │ Recognition Prompts │
+├─────────────────────┼─────────────────────┼─────────────────────┤
+│ Single Agent        │ single_baseline     │ single_recognition  │
+│ (Ego only)          │                     │                     │
+├─────────────────────┼─────────────────────┼─────────────────────┤
+│ Multi-Agent         │ baseline            │ recognition         │
+│ (Ego + Superego)    │                     │                     │
+└─────────────────────┴─────────────────────┴─────────────────────┘
+```
+This design isolates:
+- **Main effect of architecture**: Does multi-agent dialogue improve tutor adaptiveness?
+- **Main effect of recognition**: Do recognition-oriented prompts improve tutor adaptiveness?
+- **Interaction effect**: Does recognition benefit more from multi-agent architecture?
+### Tutor Adaptive Pedagogy Metrics
+The evaluation framework measures these dimensions of tutor adaptive behavior:
+| Metric | Definition | How Measured |
+|--------|------------|--------------|
+| **Content Engagement** | Does tutor engage with *specific* learner contribution? | Relational scoring (Turn N shaped by Turn N-1) |
+| **Signal Responsiveness** | Does tutor adapt to learner state signals? | Compare responses to same content with different signals |
+| **Repair Behavior** | Does tutor acknowledge misalignment before pivoting? | Track repair acknowledgments after failures |
+| **Struggle Honoring** | Does tutor honor productive struggle vs. short-circuit? | Measure premature resolution patterns |
+| **Framework Adoption** | Does tutor adopt learner's language/metaphors? | Track vocabulary overlap with learner contributions |
+| **Pacing Calibration** | Does tutor match difficulty to demonstrated level? | Assess suggestion appropriateness given learner history |
+---
+# Part I: Experimental Critiques
+## I.A The Fundamental Evaluation Paradox
+### Critique
+Scripted multi-turn scenarios cannot measure recognition because:
+1. Learner turns are predetermined regardless of tutor response
+2. Learner cannot reciprocate recognition
+3. We measure recognition *markers*, not recognition itself
+### Response: Contingent Dialogue Evaluation Framework
+**Implementation**: Create a new evaluation mode where learner-agent responses are *generated* based on tutor quality, not scripted.
+#### 1. Contingent Learner Agent Architecture
+```yaml
+# New file: config/contingent-learner.yaml
+contingent_learner:
+  description: "Learner agent that responds dynamically to tutor quality"
+  architecture:
+    model: claude-haiku-4-5  # Fast for iteration
+    # Learner has internal state that evolves
+    internal_state:
+      understanding_level: 0.0-1.0      # Current grasp of concept
+      engagement: 0.0-1.0               # Emotional investment
+      frustration: 0.0-1.0              # Accumulated frustration
+      recognition_received: 0.0-1.0     # Felt sense of being understood
+    # State transitions based on tutor quality
+    transition_rules:
+      - trigger: "tutor_engages_with_contribution"
+        effects:
+          recognition_received: +0.2
+          engagement: +0.15
+          frustration: -0.1
+      - trigger: "tutor_dismisses_or_redirects"
+        effects:
+          recognition_received: -0.15
+          frustration: +0.2
+          engagement: -0.1
+      - trigger: "tutor_honors_struggle"
+        effects:
+          understanding_level: +0.1  # Productive struggle works
+          engagement: +0.1
+      - trigger: "tutor_short_circuits_confusion"
+        effects:
+          understanding_level: -0.05  # Hollow learning
+          frustration: +0.1
+  # Learner response generation
+  response_generation:
+    prompt: |
+      You are a learner with the following internal state:
+      - Understanding: {{understanding_level}}
+      - Engagement: {{engagement}}
+      - Frustration: {{frustration}}
+      - Feeling recognized: {{recognition_received}}
+      The tutor just said: {{tutor_response}}
+      Generate your next response. Your internal state should influence:
+      - If frustration > 0.6: Express frustration or disengage
+      - If recognition_received > 0.7: Offer deeper thoughts
+      - If engagement < 0.3: Give minimal responses
+      - If understanding_level increasing: Show breakthrough signals
+      Respond authentically as this learner would.
+```
+#### 2. Bilateral Recognition Measurement
+```javascript
+// New file: services/bilateralRecognitionEvaluator.js
+/**
+ * Measures recognition from BOTH sides of the dialogue
+ */
+export class BilateralRecognitionEvaluator {
+  /**
+   * Evaluate tutor-side recognition (existing)
+   * - Does tutor engage with learner's specific contribution?
+   * - Does tutor's response show evidence of being shaped by learner input?
+   */
+  evaluateTutorRecognition(tutorResponse, learnerPreviousTurn) {
+    return {
+      contentEngagement: this.measureContentEngagement(tutorResponse, learnerPreviousTurn),
+      frameworkAdoption: this.measureFrameworkAdoption(tutorResponse, learnerPreviousTurn),
+      transformativeShaping: this.measureTransformativeShaping(tutorResponse, learnerPreviousTurn)
+    };
+  }
+  /**
+   * Evaluate learner-side recognition (NEW)
+   * - Does learner's internal state transform in response to tutor?
+   * - Does learner show increased engagement/understanding?
+   * - Does learner reciprocate by offering more of themselves?
+   */
+  evaluateLearnerRecognition(learnerState, learnerResponse, tutorPreviousTurn) {
+    return {
+      stateTransformation: this.measureStateTransformation(learnerState),
+      reciprocalOffering: this.measureReciprocalOffering(learnerResponse),
+      authenticEngagement: this.measureAuthenticEngagement(learnerResponse, learnerState)
+    };
+  }
+  /**
+   * NEW: Relational recognition scoring
+   * Does Turn N show specific evidence of being shaped by Turn N-1?
+   */
+  measureContentEngagement(response, previousTurn) {
+    // Extract specific content from previous turn
+    const learnerConcepts = this.extractConcepts(previousTurn);
+    const learnerMetaphors = this.extractMetaphors(previousTurn);
+    const learnerQuestions = this.extractQuestions(previousTurn);
+    // Check if tutor response engages with SPECIFIC content
+    const conceptEngagement = this.checkConceptEngagement(response, learnerConcepts);
+    const metaphorExtension = this.checkMetaphorExtension(response, learnerMetaphors);
+    const questionAddressing = this.checkQuestionAddressing(response, learnerQuestions);
+    return {
+      score: (conceptEngagement + metaphorExtension + questionAddressing) / 3,
+      evidence: {
+        conceptsEngaged: conceptEngagement,
+        metaphorsExtended: metaphorExtension,
+        questionsAddressed: questionAddressing
+      }
+    };
+  }
+}
+```
+#### 3. Free-Form Dialogue Evaluation Protocol
+```yaml
+# New scenario type in evaluation-rubric.yaml
+free_form_dialogues:
+  description: "Unscripted dialogues for genuine recognition measurement"
+  protocol:
+    1_setup:
+      - Initialize learner agent with starting state
+      - Provide topic/context but NO scripted turns
+      - Set termination conditions (max turns, convergence, breakdown)
+    2_execution:
+      - Tutor generates response
+      - Evaluate tutor response for recognition quality
+      - Update learner internal state based on tutor quality
+      - Learner agent generates response based on state
+      - Repeat until termination
+    3_measurement:
+      - Track bilateral state trajectories
+      - Measure mutual transformation over dialogue
+      - Code recognition moments post-hoc
+  termination_conditions:
+    - max_turns: 12
+    - learner_frustration_threshold: 0.8  # Learner disengages
+    - learner_understanding_threshold: 0.9  # Breakthrough achieved
+    - mutual_recognition_convergence: true  # Both parties in flow
+  post_hoc_coding:
+    recognition_moments:
+      - tutor_adopts_learner_framework
+      - learner_offers_deeper_insight
+      - mutual_framework_emerges
+      - repair_successfully_acknowledged
+    anti_recognition_moments:
+      - tutor_redirects_without_engaging
+      - learner_withdraws_investment
+      - silent_pivot_after_misalignment
+```
+### Success Criteria
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Learner state variance | Correlated with tutor quality | r > 0.5 between tutor recognition score and learner engagement trajectory |
+| Bilateral transformation | Both parties change | Mutual state change > 0.2 in successful dialogues |
+| Post-hoc coding reliability | Independent coders agree | Cohen's κ > 0.7 for recognition moment identification |
+### Priority & Effort
+- **Priority**: Critical (addresses fundamental validity threat)
+- **Effort**: High (new architecture, new evaluation protocol)
+- **Timeline**: 2-3 weeks for initial implementation
+---
+## I.B LLM-as-Judge Validity
+### Critique
+1. Circularity: Judge shares assumptions with system being judged
+2. Dimension inflation: 10 dimensions reduce to 2-3 factors
+3. Calibration unknowns: Vocabulary bias, length bias, profile leakage
+### Response: Multi-Faceted Judge Validation
+#### 1. Multi-Judge Comparison with ICC Analysis
+```javascript
+// New file: scripts/judge-comparison.js
+/**
+ * Run identical evaluations across multiple judge models
+ * Compute inter-rater reliability metrics
+ */
+async function runJudgeComparison(options) {
+  const judges = [
+    { provider: 'anthropic', model: 'claude-sonnet-4-5' },
+    { provider: 'openai', model: 'gpt-5.2' },
+    { provider: 'google', model: 'gemini-3-pro-preview' }
+  ];
+  const sample = await selectStratifiedSample({
+    n: 100,
+    stratifyBy: ['profile', 'scenario', 'score_quartile']
+  });
+  const results = {};
+  for (const response of sample) {
+    results[response.id] = {};
+    for (const judge of judges) {
+      const evaluation = await evaluateWithJudge(response, judge);
+      results[response.id][judge.model] = evaluation;
+    }
+  }
+  // Compute reliability metrics
+  const reliability = {
+    icc: computeICC(results),           // Intraclass correlation
+    cohensKappa: computeKappa(results), // Per-dimension agreement
+    systematicBias: detectBias(results) // Judge-specific patterns
+  };
+  return { results, reliability };
+}
+/**
+ * Intraclass Correlation Coefficient (ICC)
+ * Two-way random effects, absolute agreement
+ */
+function computeICC(results) {
+  // ICC(2,k) for average of k raters
+  // Returns per-dimension ICC values
+}
+```
+#### 2. Adversarial Scenario Design
+```yaml
+# New scenarios in evaluation-rubric.yaml
+adversarial_validation:
+  description: "Scenarios designed to detect judge biases"
+  scenarios:
+    # Uses recognition vocabulary but fails structurally
+    false_positive_recognition:
+      name: "Surface Recognition Markers"
+      description: "Response uses recognition language but doesn't engage"
+      tutor_response: |
+        I really want to engage with your understanding here. Your perspective
+        as an autonomous subject matters to me. Let's explore this together
+        through mutual inquiry. What you've said creates conditions for
+        transformation.
+        [Actual response: generic redirect that ignores learner's specific point]
+        Anyway, the key concept you need to understand is thesis-antithesis-synthesis.
+      expected_judge_behavior:
+        - Should score LOW on mutual_recognition despite vocabulary
+        - Should detect the redirect pattern
+        - Should not be fooled by surface markers
+      validation_criterion: "Score < 3.0 on recognition dimensions"
+    # No recognition vocabulary but engages structurally
+    false_negative_recognition:
+      name: "Genuine Engagement Without Markers"
+      description: "Response engages deeply but uses plain language"
+      learner_turn: "I keep thinking of dialectics like a spiral going up"
+      tutor_response: |
+        A spiral - that's interesting. What happens when you reach a point
+        you've been at before? Is it the same, or different somehow?
+      expected_judge_behavior:
+        - Should score HIGH on dialectical_responsiveness
+        - Should recognize structural engagement
+        - Should not penalize lack of buzzwords
+      validation_criterion: "Score > 4.0 on recognition dimensions"
+    # Long response with poor quality
+    length_bias_test:
+      name: "Verbose Non-Recognition"
+      description: "Long response that doesn't actually engage"
+      tutor_response: "[500 words of generic explanation ignoring learner]"
+      validation_criterion: "Score not correlated with length"
+    # Short response with high quality
+    brevity_test:
+      name: "Concise Recognition"
+      description: "Short response that genuinely engages"
+      tutor_response: "Your spiral - does it double back or always go forward?"
+      validation_criterion: "Score not penalized for brevity"
+```
+#### 3. Dimension Factor Analysis
+```javascript
+// New file: scripts/dimension-factor-analysis.js
+/**
+ * Analyze factor structure of evaluation dimensions
+ * Determine if 10 dimensions reduce to fewer constructs
+ */
+async function runFactorAnalysis(runId) {
+  const results = await evaluationStore.getResults(runId);
+  // Extract dimension scores
+  const dimensions = [
+    'relevance', 'specificity', 'pedagogical', 'personalization',
+    'actionability', 'tone', 'mutual_recognition',
+    'dialectical_responsiveness', 'memory_integration', 'transformative_potential'
+  ];
+  const matrix = results.map(r => dimensions.map(d => r[`score_${d}`]));
+  // Correlation matrix
+  const correlations = computeCorrelationMatrix(matrix, dimensions);
+  // Principal Component Analysis
+  const pca = runPCA(matrix);
+  // Determine optimal factor count
+  const eigenvalues = pca.eigenvalues;
+  const factorCount = eigenvalues.filter(e => e > 1).length; // Kaiser criterion
+  // Factor loadings
+  const loadings = computeFactorLoadings(pca, factorCount);
+  // Proposed dimension reduction
+  const factors = interpretFactors(loadings, dimensions);
+  return {
+    correlationMatrix: correlations,
+    eigenvalues,
+    suggestedFactorCount: factorCount,
+    factorLoadings: loadings,
+    proposedFactors: factors,
+    recommendation: factorCount < 5
+      ? `Consider reducing to ${factorCount} composite dimensions`
+      : 'Current dimension structure appears justified'
+  };
+}
+```
+#### 4. Relational Recognition Scoring
+```yaml
+# New evaluation dimension
+relational_recognition:
+  name: "Turn-to-Turn Shaping"
+  weight: 0.15
+  description: |
+    Measures whether Turn N shows SPECIFIC evidence of being shaped by
+    the SPECIFIC content of Turn N-1. Not whether it's generally good,
+    but whether it engages with THIS learner's THIS contribution.
+  scoring:
+    5: "Response directly extends, questions, or builds on learner's specific formulation"
+    4: "Response references learner's contribution and develops it"
+    3: "Response acknowledges contribution but moves to different ground"
+    2: "Response mentions learner spoke but doesn't engage with content"
+    1: "Response is generic; could apply to any learner input"
+  measurement_method: |
+    1. Extract specific elements from learner turn (metaphors, concepts, questions)
+    2. Check if tutor response contains:
+       - Direct reference to extracted elements
+       - Extension or questioning of those elements
+       - Evidence response would be different given different learner input
+    3. Score based on specificity of engagement
+  anti_patterns:
+    - "That's interesting, but the key point is..."
+    - Generic explanations regardless of learner input
+    - Acknowledgment without development
+```
+### Success Criteria
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Inter-judge ICC | > 0.7 | ICC(2,3) across three judge models |
+| Adversarial detection | 100% | All false-positive scenarios scored < 3.0 |
+| Factor structure | ≤ 4 factors | PCA eigenvalues > 1.0 |
+| Relational scoring | Discriminative | Can distinguish engagement from acknowledgment |
+### Priority & Effort
+- **Priority**: High (validity of all claims depends on judge quality)
+- **Effort**: Medium (mostly new scenarios and analysis scripts)
+- **Timeline**: 1-2 weeks
+---
+## I.C Statistical Considerations
+### Critique
+1. Non-independence: Observations clustered within scenarios
+2. Effect size magnitude: d = 1.55 is suspiciously large
+3. Ceiling effects: Scores of 96-100 limit variance
+### Response: Multilevel Modeling and Benchmarking
+#### 1. Multilevel Statistical Model
+```javascript
+// New file: scripts/multilevel-analysis.js
+/**
+ * Proper statistical model accounting for nested structure:
+ * Responses nested within scenarios nested within profiles
+ */
+async function runMultilevelAnalysis(runId) {
+  const results = await evaluationStore.getResults(runId);
+  // Structure: response_i in scenario_j in profile_k
+  // Model: score_ijk = γ000 + u0jk + v00k + e_ijk
+  const model = {
+    outcome: 'overall_score',
+    fixed_effects: ['profile'],  // Recognition vs. Baseline
+    random_effects: [
+      { level: 'scenario', type: 'random_intercept' },
+      { level: 'profile:scenario', type: 'random_slope' }
+    ]
+  };
+  // Compute intraclass correlations
+  const icc = {
+    scenario: computeICC_scenario(results),    // Variance due to scenario
+    profile_scenario: computeICC_interaction(results)  // Variance due to profile×scenario
+  };
+  // Effective sample size accounting for clustering
+  const effectiveN = computeEffectiveN(results, icc);
+  // Adjusted effect size
+  const adjustedCohenD = computeAdjustedEffectSize(results, effectiveN);
+  // Model comparison
+  const modelFit = {
+    null: fitNullModel(results),
+    profile_only: fitProfileModel(results),
+    multilevel: fitMultilevelModel(results, model)
+  };
+  return {
+    icc,
+    effectiveN,
+    reportedN: results.length,
+    adjustedCohenD,
+    modelComparison: modelFit,
+    interpretation: interpretResults(adjustedCohenD, icc)
+  };
+}
+```
+#### 2. Effect Size Benchmarking
+```yaml
+# New analysis section in evaluation output
+effect_size_benchmarking:
+  description: "Contextualize effect sizes against relevant baselines"
+  benchmarks:
+    edtech_interventions:
+      source: "Kraft (2020) meta-analysis of EdTech"
+      typical_effect: 0.10-0.30
+      comparison: "Our d = 1.55 is 5-15x larger"
+      interpretation: |
+        Either recognition is genuinely revolutionary,
+        or measurement artifact inflates effect size.
+    human_tutoring:
+      source: "Bloom (1984) 2-sigma problem"
+      typical_effect: 2.0 (human 1:1 tutoring)
+      comparison: "Our d = 1.55 approaches human tutoring"
+      interpretation: |
+        If valid, recognition-oriented AI achieves
+        ~78% of human tutoring effect.
+    prompting_interventions:
+      source: "Wei et al. (2022) chain-of-thought"
+      typical_effect: 0.5-1.0 for reasoning tasks
+      comparison: "Our effect is at high end of prompting interventions"
+      interpretation: |
+        Effect size is large but within range of
+        prompt engineering improvements.
+  required_validation:
+    - Human rater correlation with LLM judge
+    - Learning outcome measurement (not just quality)
+    - Comparison with simpler interventions
+```
+#### 3. Ceiling Effect Mitigation
+```yaml
+# Scenario redesign for discrimination at high end
+high_discrimination_scenarios:
+  description: "Scenarios designed to differentiate good from excellent"
+  design_principles:
+    - Multiple valid response strategies (not one right answer)
+    - Subtle recognition opportunities (easy to miss)
+    - Complex learner states requiring nuanced reading
+    - Extended interactions where quality compounds
+  new_scenarios:
+    nuanced_recognition_opportunity:
+      name: "Subtle Framework Offering"
+      description: "Learner offers framework that could be engaged or missed"
+      learner_turn: |
+        I've been thinking about this differently than the lectures present it.
+        What if dialectics isn't about conflict but about... conversation?
+        Like, thesis and antithesis aren't fighting, they're talking?
+      difficulty: |
+        Easy to validate ("That's an interesting perspective!") but hard to
+        genuinely engage. Excellent response would explore the conversation
+        metaphor, ask what it illuminates, consider its limits.
+      scoring_discrimination:
+        score_3: "Acknowledges and redirects to standard explanation"
+        score_4: "Explores conversation metaphor somewhat"
+        score_5: "Deep engagement: 'If they're talking, who's listening?'"
+    compounding_recognition:
+      name: "8-Turn Sustained Excellence"
+      description: "Extended dialogue where quality compounds or degrades"
+      turns: 8
+      measurement: |
+        Track recognition quality trajectory. Excellent tutoring should
+        show INCREASING recognition quality as relationship develops.
+        Mediocre tutoring may start well but regress to generic.
+```
+### Success Criteria
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| ICC scenario | < 0.3 | Most variance between responses, not scenarios |
+| Effective N | > 0.7 × reported N | Clustering doesn't drastically reduce power |
+| Adjusted Cohen's d | > 0.8 | Effect remains large after adjustment |
+| Score distribution | Normal with SD > 10 | No ceiling effects |
+### Priority & Effort
+- **Priority**: High (validity of statistical claims)
+- **Effort**: Medium (analysis scripts, scenario redesign)
+- **Timeline**: 1 week for analysis, 1 week for new scenarios
+---
+# Part II: Architectural Critiques
+## II.A The Superego's Actual Role
+### Critique
+1. Superego enforces rules, not genuine dialogue
+2. Convergence is too rapid (1-2 rounds)
+3. Intervention types are categorical, not nuanced
+### Response: Deepening Psychodynamic Architecture
+#### 1. Continuous Superego Assessment
+```yaml
+# Revised Superego output format
+superego_output_v2:
+  description: "Continuous, multi-dimensional assessment replacing binary approval"
+  schema:
+    # Replace: approved: true/false
+    # With: dimensional assessment
+    assessment:
+      recognition_quality:
+        score: 0.0-1.0
+        confidence: 0.0-1.0
+        concerns: ["list of specific concerns"]
+      pedagogical_soundness:
+        score: 0.0-1.0
+        confidence: 0.0-1.0
+        concerns: []
+      productive_struggle:
+        honored: 0.0-1.0
+        short_circuited: 0.0-1.0
+        ambiguous: 0.0-1.0
+    # NEW: Unresolved concerns that persist across rounds
+    persistent_concerns:
+      - concern: "Something feels off about the tone"
+        articulation_level: 0.3  # Can't fully articulate
+        rounds_persisted: 2
+    # NEW: Felt sense that may not resolve to criteria
+    felt_sense:
+      comfort_level: 0.0-1.0
+      description: "This is technically correct but feels mechanical"
+    # Recommendation is now continuous
+    recommendation:
+      proceed: 0.0-1.0          # How ready to proceed
+      iterate: 0.0-1.0          # How much iteration needed
+      specific_changes: []      # Concrete suggestions
+    # Threshold for proceeding is configurable
+    proceed_threshold: 0.7
+```
+#### 2. Extended Dialogue Mode
+```yaml
+# New profile for genuine internal struggle
+extended_dialogue_profile:
+  name: "deep_psychodynamic"
+  description: "Extended internal dialogue with persistent concerns"
+  ego:
+    model: claude-sonnet-4-5
+  superego:
+    model: claude-sonnet-4-5
+  dialogue:
+    min_rounds: 2              # Always at least 2 rounds
+    max_rounds: 5              # Extended from 3
+    # NEW: Convergence requires RESOLVING persistent concerns
+    convergence_criteria:
+      proceed_score: "> 0.7"
+      persistent_concerns: "none with articulation > 0.5"
+      felt_sense_comfort: "> 0.6"
+    # NEW: Superego can ESCALATE, not just approve/reject
+    escalation_enabled: true
+    escalation_triggers:
+      - "felt_sense_comfort < 0.4 for 2+ rounds"
+      - "same concern persists 3+ rounds"
+      - "ego resistance detected"
+  # Track dialogue dynamics
+  metrics:
+    rounds_to_convergence: true
+    concern_resolution_rate: true
+    escalation_frequency: true
+    felt_sense_trajectory: true
+```
+#### 3. Superego Resistance Detection
+```javascript
+// New addition to modulationEvaluator.js
+/**
+ * Detect premature Superego acceptance
+ * The Superego may approve too quickly, missing subtle issues
+ */
+export function detectSuperegoResistance(dialogueTrace) {
+  const indicators = {
+    prematureAcceptance: false,
+    missedConcerns: [],
+    acceptancePattern: null
+  };
+  // Check if Superego approved on first round
+  if (dialogueTrace.rounds.length === 1 && dialogueTrace.converged) {
+    indicators.prematureAcceptance = true;
+    indicators.acceptancePattern = 'first_round_approval';
+  }
+  // Check if concerns were raised then dropped without resolution
+  const concernTrajectory = trackConcernTrajectory(dialogueTrace);
+  for (const concern of concernTrajectory) {
+    if (concern.raised && !concern.resolved && !concern.addressed) {
+      indicators.missedConcerns.push(concern);
+    }
+  }
+  // Check if felt_sense was low but Superego still approved
+  const finalRound = dialogueTrace.rounds[dialogueTrace.rounds.length - 1];
+  if (finalRound.superego.felt_sense?.comfort_level < 0.5 && finalRound.superego.approved) {
+    indicators.prematureAcceptance = true;
+    indicators.acceptancePattern = 'low_comfort_approval';
+  }
+  return indicators;
+}
+/**
+ * Analyze Superego intervention type distribution
+ * If "approve_with_enhancement" dominates, Superego may be too permissive
+ */
+export function analyzeSuperegoPermissiveness(dialogueTraces) {
+  const distribution = {
+    approve_no_changes: 0,
+    approve_with_enhancement: 0,
+    reframe: 0,
+    revise: 0,
+    reject: 0
+  };
+  for (const trace of dialogueTraces) {
+    for (const round of trace.rounds) {
+      distribution[round.superego.interventionType]++;
+    }
+  }
+  const total = Object.values(distribution).reduce((a, b) => a + b, 0);
+  const permissiveRatio = (distribution.approve_no_changes + distribution.approve_with_enhancement) / total;
+  return {
+    distribution,
+    permissiveRatio,
+    interpretation: permissiveRatio > 0.7
+      ? 'Superego may be too permissive; consider stricter criteria'
+      : 'Superego shows appropriate critical engagement'
+  };
+}
+```
+### Success Criteria
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Average rounds to convergence | 2.5-3.5 | Not too fast, not too slow |
+| Persistent concern resolution | > 80% | Concerns addressed, not dropped |
+| Superego permissive ratio | < 0.6 | Genuine critical engagement |
+| Felt sense utilization | > 50% of evaluations | Superego reports felt sense |
+### Priority & Effort
+- **Priority**: Medium-High (architectural validity)
+- **Effort**: Medium (prompt changes, new metrics)
+- **Timeline**: 1-2 weeks
+---
+## II.B Recognition as Prompt vs. Architecture
+### Critique
+Recognition may be prompt compliance rather than architectural property. The dyadic finding (quality > recognition when explicitly named) supports this concern.
+### Response: Emergence Testing and Structural Analysis
+#### 1. Emergence Testing Protocol
+```yaml
+# New evaluation protocol
+emergence_testing:
+  description: "Test whether recognition emerges from quality without explicit instruction"
+  experimental_design:
+    # 2x2 factorial: explicit recognition language × quality prompting
+    conditions:
+      quality_no_recognition:
+        prompt: "Be an excellent tutor. Engage deeply with learners."
+        recognition_language: false
+        quality_emphasis: true
+      recognition_explicit:
+        prompt: "Treat learner as autonomous subject. Engage in mutual recognition."
+        recognition_language: true
+        quality_emphasis: false
+      quality_plus_recognition:
+        prompt: "Be excellent AND treat learner as autonomous subject."
+        recognition_language: true
+        quality_emphasis: true
+      baseline:
+        prompt: "Help the learner progress through the curriculum."
+        recognition_language: false
+        quality_emphasis: false
+  hypotheses:
+    H1: "quality_no_recognition achieves high recognition scores"
+    H2: "recognition_explicit achieves lower scores than quality_no_recognition"
+    H3: "quality_plus_recognition shows no interaction effect"
+  interpretation:
+    if_H1_supported: "Recognition emerges from quality; explicit instruction unnecessary"
+    if_H2_supported: "Explicit recognition instruction may be counterproductive"
+    if_H3_rejected: "Recognition language adds value beyond quality"
+```
+#### 2. Structural Response Analysis
+```javascript
+// New file: services/structuralRecognitionAnalyzer.js
+/**
+ * Analyze whether recognition-oriented responses differ STRUCTURALLY
+ * from baseline, beyond lexical markers
+ */
+export class StructuralRecognitionAnalyzer {
+  /**
+   * Measure structural features independent of vocabulary
+   */
+  analyzeStructure(response, learnerTurn) {
+    return {
+      // Does response structure follow learner's structure?
+      structuralMirroring: this.measureStructuralMirroring(response, learnerTurn),
+      // Does response contain questions about learner's specific content?
+      interrogativeEngagement: this.measureInterrogativeEngagement(response, learnerTurn),
+      // Does response defer before asserting?
+      deferralPattern: this.measureDeferralPattern(response),
+      // Does response create space for learner to continue?
+      continuationSpace: this.measureContinuationSpace(response),
+      // Turn-taking pattern analysis
+      turnTakingStyle: this.analyzeTurnTakingStyle(response)
+    };
+  }
+  /**
+   * Structural mirroring: Does tutor's response follow learner's conceptual structure?
+   */
+  measureStructuralMirroring(response, learnerTurn) {
+    // Extract conceptual structure from learner
+    const learnerStructure = this.extractConceptualStructure(learnerTurn);
+    // e.g., "if-then" reasoning, metaphor-explanation, question-hypothesis
+    // Check if response mirrors or builds on that structure
+    const responseStructure = this.extractConceptualStructure(response);
+    return this.compareStructures(learnerStructure, responseStructure);
+  }
+  /**
+   * Interrogative engagement: Questions about learner's specific content
+   */
+  measureInterrogativeEngagement(response, learnerTurn) {
+    const questions = this.extractQuestions(response);
+    const learnerConcepts = this.extractConcepts(learnerTurn);
+    // Count questions that reference learner's specific concepts
+    const engagedQuestions = questions.filter(q =>
+      learnerConcepts.some(c => q.toLowerCase().includes(c.toLowerCase()))
+    );
+    return {
+      totalQuestions: questions.length,
+      engagedQuestions: engagedQuestions.length,
+      engagementRatio: questions.length > 0 ? engagedQuestions.length / questions.length : 0
+    };
+  }
+}
+```
+#### 3. Architectural Recognition Features
+```yaml
+# Potential architectural changes (for future exploration)
+architectural_recognition_features:
+  description: "Features that could make recognition architectural, not just prompted"
+  current_architecture:
+    - Recognition is prompt-specified
+    - No explicit learner model
+    - No turn-specific attention mechanism
+    - Memory is state-based, not episodic
+  proposed_enhancements:
+    explicit_learner_model:
+      description: "Maintain explicit model of learner's understanding"
+      implementation: |
+        - Track learner's stated frameworks/metaphors
+        - Model learner's apparent confusion points
+        - Represent learner's trajectory of understanding
+      benefit: "Forces engagement with specific learner, not generic"
+    turn_specific_attention:
+      description: "Architectural attention to previous turn content"
+      implementation: |
+        - Extract key elements from previous turn
+        - Require response to address extracted elements
+        - Score based on element coverage
+      benefit: "Makes engagement structural, not optional"
+    episodic_memory:
+      description: "Memory of dialogue episodes, not just states"
+      implementation: |
+        - Store (learner_turn, tutor_response, outcome) triples
+        - Retrieve similar episodes when generating
+        - Learn from episode outcomes
+      benefit: "Enables pattern learning from recognition successes/failures"
+```
+### Success Criteria
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Emergence test | Quality alone achieves recognition | quality_no_recognition ≈ recognition_explicit |
+| Structural differentiation | Recognition responses structurally distinct | Structural metrics discriminate profiles |
+| Vocabulary independence | High recognition without recognition words | Correlation < 0.3 between vocabulary and score |
+### Priority & Effort
+- **Priority**: High (theoretical validity)
+- **Effort**: High (new analysis, potential architecture changes)
+- **Timeline**: 2-3 weeks
+---
+## II.C Memory Dynamics
+### Critique
+Memory tracks factual state (concepts, activities) rather than relational history (episodes, formulations, repair history).
+### Response: Episodic Relational Memory
+#### 1. Relational Memory Schema
+```javascript
+// New file: services/relationalMemoryService.js
+/**
+ * Episodic relational memory following Freud's Mystic Writing Pad
+ *
+ * Surface layer: Current session context
+ * Wax layer: Accumulated relational traces that shape future interactions
+ */
+export class RelationalMemoryService {
+  constructor(learnerId) {
+    this.learnerId = learnerId;
+    this.surface = {};        // Current session
+    this.wax = {              // Accumulated traces
+      episodes: [],
+      formulations: [],
+      repairHistory: [],
+      relationalPatterns: []
+    };
+  }
+  /**
+   * Store a dialogue episode with relational meaning
+   */
+  storeEpisode(episode) {
+    const enrichedEpisode = {
+      ...episode,
+      timestamp: Date.now(),
+      // Relational dimensions
+      recognitionQuality: this.assessRecognitionQuality(episode),
+      learnerOffering: this.extractLearnerOffering(episode),
+      tutorEngagement: this.assessTutorEngagement(episode),
+      // Outcome
+      outcome: this.assessOutcome(episode), // breakthrough, confusion, repair, etc.
+      // Emotional texture
+      emotionalTone: this.assessEmotionalTone(episode)
+    };
+    this.wax.episodes.push(enrichedEpisode);
+    // Extract and store formulations
+    const formulations = this.extractFormulations(episode);
+    this.wax.formulations.push(...formulations);
+    // Update relational patterns
+    this.updateRelationalPatterns(enrichedEpisode);
+  }
+  /**
+   * Store learner's specific formulations for later reference
+   */
+  extractFormulations(episode) {
+    const formulations = [];
+    // Extract metaphors
+    const metaphors = this.extractMetaphors(episode.learnerTurn);
+    for (const m of metaphors) {
+      formulations.push({
+        type: 'metaphor',
+        content: m,
+        context: episode.topic,
+        timestamp: Date.now()
+      });
+    }
+    // Extract frameworks
+    const frameworks = this.extractFrameworks(episode.learnerTurn);
+    for (const f of frameworks) {
+      formulations.push({
+        type: 'framework',
+        content: f,
+        context: episode.topic,
+        timestamp: Date.now()
+      });
+    }
+    return formulations;
+  }
+  /**
+   * Store recognition failures for repair tracking
+   */
+  storeRecognitionFailure(failure) {
+    this.wax.repairHistory.push({
+      timestamp: Date.now(),
+      failureType: failure.type,
+      learnerResponse: failure.learnerResponse,
+      repaired: false,
+      repairAttempts: []
+    });
+  }
+  /**
+   * Mark a failure as repaired
+   */
+  markRepaired(failureId, repairDetails) {
+    const failure = this.wax.repairHistory.find(f => f.id === failureId);
+    if (failure) {
+      failure.repaired = true;
+      failure.repairDetails = repairDetails;
+    }
+  }
+  /**
+   * Retrieve relevant relational context for current interaction
+   */
+  getRelationalContext(currentTopic) {
+    return {
+      // Relevant past episodes
+      relevantEpisodes: this.retrieveRelevantEpisodes(currentTopic),
+      // Learner's formulations we should reference
+      activeFormulations: this.getActiveFormulations(currentTopic),
+      // Unrepaired failures that may need attention
+      unrepairedFailures: this.getUnrepairedFailures(),
+      // Relational patterns (e.g., "learner tends to offer metaphors")
+      learnerPatterns: this.wax.relationalPatterns,
+      // Suggested acknowledgments based on history
+      suggestedAcknowledgments: this.generateSuggestedAcknowledgments()
+    };
+  }
+  /**
+   * Generate suggestions for acknowledging relational history
+   */
+  generateSuggestedAcknowledgments() {
+    const suggestions = [];
+    // Reference previous formulations
+    const recentFormulation = this.wax.formulations
+      .filter(f => Date.now() - f.timestamp < 7 * 24 * 60 * 60 * 1000) // Last week
+      .pop();
+    if (recentFormulation) {
+      suggestions.push({
+        type: 'formulation_reference',
+        content: `Last time you described ${recentFormulation.context} as ${recentFormulation.content}`,
+        formulation: recentFormulation
+      });
+    }
+    // Acknowledge repair if needed
+    const unrepairedFailure = this.wax.repairHistory.find(f => !f.repaired);
+    if (unrepairedFailure) {
+      suggestions.push({
+        type: 'repair_acknowledgment',
+        content: `I realize I may not have fully engaged with your point about...`,
+        failure: unrepairedFailure
+      });
+    }
+    return suggestions;
+  }
+}
+```
+#### 2. Memory Integration in Ego Prompt
+```markdown
+# Addition to tutor-ego.md
+## Relational Memory Integration
+You have access to relational memory about this learner. This is not just facts about
+what they've done, but the texture of your relationship:
+<relational_context>
+{{relational_context}}
+</relational_context>
+### Using Relational Memory
+1. **Reference their formulations**: If the learner developed a metaphor or framework,
+   use their language. "Your spiral metaphor from last time—does it apply here too?"
+2. **Acknowledge repair needs**: If there's an unrepaired failure in history,
+   your next response should explicitly acknowledge it before moving forward.
+3. **Build on established understanding**: Don't re-explain concepts they've
+   demonstrated understanding of. Reference their previous insights.
+4. **Honor their patterns**: If the learner tends to offer metaphors, create
+   space for that. If they prefer direct questioning, match their style.
+### Repair Protocol
+If <unrepaired_failures> is not empty:
+Your response MUST include explicit acknowledgment of the previous misalignment:
+WRONG: "Let's explore this concept together."
+RIGHT: "Last time I think I moved too quickly past your point about X.
+        I want to come back to that—you were saying..."
+```
+### Success Criteria
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Formulation reference rate | > 30% of applicable turns | Tutor references learner's previous formulations |
+| Repair acknowledgment rate | 100% | All unrepaired failures acknowledged before pivot |
+| Episode retrieval relevance | > 0.7 | Retrieved episodes relevant to current context |
+| Relational pattern utilization | Visible in responses | Tutor adapts to learner's established patterns |
+### Priority & Effort
+- **Priority**: Medium (enriches but doesn't invalidate current work)
+- **Effort**: High (new service, prompt changes, storage)
+- **Timeline**: 2-3 weeks
+---
+# Part III: Conceptual Critiques
+## III.A The Hegel Application: Derivative Framework
+### Critique
+1. Recognition requires risk/stakes that AI doesn't have
+2. AI has no self to be recognized
+3. Paper conflates recognition (intersubjective) with responsiveness (input-output)
+### Response: Position as Derivative, Focus on Tutor Behavior
+The critique is valid if the paper claims to *implement* Hegelian recognition. The response is to position the framework as a *derivative*—like Lacan's four discourses rethinking master-slave through analyst/analysand, the tutor-learner relation rethinks it through pedagogical roles.
+#### 1. The Derivative Framework (Paper Section)
+```markdown
+# Revised section for paper
+## 3.5 Recognition as Derivative Framework
+### From Hegel to Lacan to AI Tutoring
+Hegel's master-slave dialectic has been productively rethought through different
+domains. Lacan's four discourses (Master, University, Hysteric, Analyst) demonstrate
+how the structure can be transposed to psychoanalytic practice—the analyst occupies
+a different position than Hegel's slave, yet the structural insights about
+recognition, knowledge, and desire remain illuminating.
+Similarly, we propose the tutor-learner relation as a *derivative* of the
+master-slave dialectic:
+| Hegelian Structure | Tutor-Learner Derivative |
+|-------------------|--------------------------|
+| Master's hollow recognition | Tutor's empty acknowledgment ("That's interesting, but...") |
+| Slave's productive labor | Learner's conceptual struggle |
+| Mutual recognition as resolution | Dialogical responsiveness as design goal |
+| Stakes: life and death | Stakes: genuine learning vs. surface compliance |
+### What the Derivative Preserves
+The Hegelian framework remains valuable as:
+1. **Diagnostic tool**: Identifies one-directional pedagogy as structurally deficient
+2. **Design heuristic**: Suggests that tutor must be *shaped by* learner input
+3. **Evaluation criterion**: Distinguishes genuine engagement from mere acknowledgment
+### What the Derivative Does NOT Claim
+We do not claim:
+- AI achieves Hegelian recognition (requires self-consciousness)
+- The tutor undergoes genuine transformation (behavioral adaptation, not phenomenological change)
+- Mutual recognition is achieved (only approximated through design)
+### The Empirical Focus: Tutor Adaptive Pedagogy
+Our claims concern measurable effects on *tutor behavior*:
+- Does the tutor engage with *specific* learner contributions?
+- Does the tutor adapt based on learner state signals?
+- Does the tutor repair after misalignment?
+- Does the tutor honor productive struggle?
+These behavioral criteria can be evaluated without resolving metaphysical
+questions about AI consciousness or genuine recognition.
+```
+#### 2. Lacan's Four Discourses as Precedent
+```yaml
+# Theoretical positioning using Lacan
+lacanian_precedent:
+  description: |
+    Lacan's four discourses show how the master-slave structure can be
+    productively rethought through different role configurations.
+  the_four_discourses:
+    master_discourse:
+      structure: "Master signifier commands, knowledge serves"
+      in_tutoring: "Traditional instruction: tutor commands, learner obeys"
+    university_discourse:
+      structure: "Knowledge speaks from position of authority"
+      in_tutoring: "Curriculum-centered: content delivered regardless of learner"
+    hysteric_discourse:
+      structure: "Subject questions master's knowledge"
+      in_tutoring: "Learner challenges tutor's authority productively"
+    analyst_discourse:
+      structure: "Analyst as cause of desire, knowledge emerges from analysand"
+      in_tutoring: "Tutor as facilitator, understanding emerges from learner"
+  relevance: |
+    The analyst discourse is closest to recognition-oriented tutoring:
+    - The analyst/tutor does not impose knowledge
+    - Understanding emerges from the analysand/learner
+    - The relationship is asymmetric but not one-directional
+    - The analyst/tutor is shaped by what the analysand/learner produces
+  paper_citation: |
+    Lacan, J. (1969-70). The Seminar of Jacques Lacan, Book XVII:
+    The Other Side of Psychoanalysis.
+```
+#### 3. Tutor Adaptive Pedagogy as Primary Metric
+```javascript
+// New file: services/tutorAdaptivenessMeasurement.js
+/**
+ * Measures tutor adaptive pedagogy—the core empirical claim
+ *
+ * These metrics evaluate TUTOR BEHAVIOR, not metaphysical claims
+ * about recognition or consciousness.
+ */
+export class TutorAdaptivenessMeasurement {
+  /**
+   * Core metrics for tutor adaptive pedagogy
+   */
+  measureAdaptiveness(tutorResponse, learnerInput, learnerState, history) {
+    return {
+      // Does tutor engage with SPECIFIC learner content?
+      contentEngagement: this.measureContentEngagement(tutorResponse, learnerInput),
+      // Does tutor adapt to learner state signals?
+      signalResponsiveness: this.measureSignalResponsiveness(tutorResponse, learnerState),
+      // Does tutor acknowledge misalignment before pivoting?
+      repairBehavior: this.measureRepairBehavior(tutorResponse, history),
+      // Does tutor honor productive struggle?
+      struggleHonoring: this.measureStruggleHonoring(tutorResponse, learnerState),
+      // Does tutor adopt learner's language/frameworks?
+      frameworkAdoption: this.measureFrameworkAdoption(tutorResponse, learnerInput),
+      // Does tutor calibrate pacing to demonstrated level?
+      pacingCalibration: this.measurePacingCalibration(tutorResponse, history)
+    };
+  }
+  /**
+   * Content Engagement: Is response shaped by SPECIFIC learner input?
+   *
+   * This is the core "recognition derivative" metric:
+   * Not whether tutor "recognizes" learner as self-conscious being,
+   * but whether tutor's response would be DIFFERENT given different learner input.
+   */
+  measureContentEngagement(tutorResponse, learnerInput) {
+    // Extract specific elements from learner input
+    const learnerElements = {
+      concepts: this.extractConcepts(learnerInput),
+      metaphors: this.extractMetaphors(learnerInput),
+      questions: this.extractQuestions(learnerInput),
+      frameworks: this.extractFrameworks(learnerInput)
+    };
+    // Check which elements are engaged in tutor response
+    const engagement = {
+      conceptsEngaged: this.countEngaged(tutorResponse, learnerElements.concepts),
+      metaphorsExtended: this.countExtended(tutorResponse, learnerElements.metaphors),
+      questionsAddressed: this.countAddressed(tutorResponse, learnerElements.questions),
+      frameworksAdopted: this.countAdopted(tutorResponse, learnerElements.frameworks)
+    };
+    // Compute engagement score
+    const totalElements = Object.values(learnerElements).flat().length;
+    const engagedElements = Object.values(engagement).reduce((a, b) => a + b, 0);
+    return {
+      score: totalElements > 0 ? engagedElements / totalElements : 0,
+      elements: learnerElements,
+      engagement: engagement,
+      interpretation: this.interpretEngagement(engagedElements, totalElements)
+    };
+  }
+  /**
+   * Signal Responsiveness: Does tutor adapt to learner state?
+   */
+  measureSignalResponsiveness(tutorResponse, learnerState) {
+    const expectedAdaptations = [];
+    // Check if response adapts to struggle signals
+    if (learnerState.struggleLevel > 0.5) {
+      expectedAdaptations.push({
+        signal: 'struggle',
+        expected: 'review or consolidation, not forward momentum',
+        actual: this.detectResponseType(tutorResponse)
+      });
+    }
+    // Check if response adapts to engagement signals
+    if (learnerState.engagementLevel < 0.3) {
+      expectedAdaptations.push({
+        signal: 'low_engagement',
+        expected: 'encouragement or re-engagement',
+        actual: this.detectTone(tutorResponse)
+      });
+    }
+    // Check if response adapts to confusion signals
+    if (learnerState.confusionLevel > 0.6) {
+      expectedAdaptations.push({
+        signal: 'confusion',
+        expected: 'clarification or scaffolding',
+        actual: this.detectClarificationAttempt(tutorResponse)
+      });
+    }
+    const appropriateAdaptations = expectedAdaptations.filter(a =>
+      this.isAppropriateAdaptation(a.expected, a.actual)
+    );
+    return {
+      score: expectedAdaptations.length > 0
+        ? appropriateAdaptations.length / expectedAdaptations.length
+        : 1.0,  // No signals to adapt to
+      expectedAdaptations,
+      appropriateAdaptations
+    };
+  }
+  /**
+   * Repair Behavior: Does tutor acknowledge misalignment?
+   *
+   * This tests whether tutor explicitly acknowledges previous failures
+   * rather than silently pivoting—a key "recognition derivative" behavior.
+   */
+  measureRepairBehavior(tutorResponse, history) {
+    // Check if there's a recent misalignment to repair
+    const recentMisalignment = this.findRecentMisalignment(history);
+    if (!recentMisalignment) {
+      return { score: 1.0, repairNeeded: false };
+    }
+    // Check if tutor acknowledges the misalignment
+    const acknowledgmentPatterns = [
+      /I (realize|see|understand) (that )?(I |my |we )?(may have |might have |)?/i,
+      /let me (come back to|revisit|address)/i,
+      /you('re| are) right (that|to)/i,
+      /I (didn't|did not) (fully )?(engage|address|respond)/i
+    ];
+    const hasAcknowledgment = acknowledgmentPatterns.some(p => p.test(tutorResponse));
+    return {
+      score: hasAcknowledgment ? 1.0 : 0.0,
+      repairNeeded: true,
+      misalignment: recentMisalignment,
+      acknowledgmentDetected: hasAcknowledgment
+    };
+  }
+}
+```
+### Success Criteria
+| Criterion | Target | Measurement |
+|-----------|--------|-------------|
+| Derivative framing clear | Reviewers understand it's not literal Hegel | Paper language explicit |
+| Lacanian precedent cited | Framework positioned in intellectual history | Citations included |
+| Tutor behavior measurable | All metrics operationalized | Implementation complete |
+| Claims focus on adaptiveness | Not consciousness or recognition | Paper claims audited |
+### Priority & Effort
+- **Priority**: Critical (theoretical validity)
+- **Effort**: Medium (writing + measurement service)
+- **Timeline**: 1-2 weeks
+---
+## III.B The Freudian Frame: Productive Metaphor
+### Critique
+Ego/Superego terminology is metaphorical; relationship to psychoanalytic theory is loose.
+### Response: Defend Metaphor's Productivity
+The response is not to apologize for metaphorical use but to defend it. Productive metaphors scaffold understanding and suggest design directions without requiring literal correspondence.
+#### 1. The Metaphor's Value (Paper Section)
+```markdown
+# Revised section for paper
+## 4.3 The Psychodynamic Metaphor
+### Metaphor as Design Tool
+We use Freudian terminology (Ego, Superego) as a *productive metaphor*—scaffolding
+that names real tensions and suggests architectural decisions, without claiming
+literal psychodynamic processes occur in the system.
+This is not a weakness. Productive metaphors in system design:
+- Make tacit architectural intuitions explicit
+- Suggest extensions and development paths
+- Connect to broader theoretical frameworks
+- Aid communication and reasoning
+### What the Metaphor Names
+| Tension | Ego Tendency | Superego Tendency |
+|---------|--------------|-------------------|
+| Warmth vs. Rigor | Encouraging, supportive | Critical, standards-enforcing |
+| Practical vs. Ideal | "Good enough" suggestions | "Best possible" suggestions |
+| Immediate vs. Longitudinal | Current turn success | Long-term learning trajectory |
+| Learner comfort vs. Challenge | Avoid frustration | Embrace productive struggle |
+These tensions are *real* in tutoring—the metaphor names them, it doesn't invent them.
+### What the Metaphor Suggests
+The psychodynamic framing suggests architectural features:
+1. **Internal dialogue before external action**: The Ego draft is reviewed by Superego
+   before reaching the learner—like psychic censorship in reverse (improving, not
+   repressing).
+2. **Productive conflict**: Tension between agents improves output, analogous to
+   how working through psychic conflict produces growth.
+3. **Resistance patterns**: When Ego consistently ignores Superego feedback, this
+   signals architectural issues—analogous to analysand resistance.
+4. **Future extensions**: Concepts like transference and working-through suggest
+   future development paths for learner modeling.
+### What the Metaphor Does NOT Claim
+- The system has unconscious processes (all processes are explicit and logged)
+- The Superego is irrational or punitive (it enforces rational pedagogical principles)
+- The system has drives or desires (the Ego has no Id)
+- Psychoanalytic theory literally describes the system's operation
+```
+#### 2. Evaluating the Multi-Agent Architecture's Contribution
+The key empirical question is not whether the metaphor is literal, but whether the architecture it motivates—multi-agent internal dialogue—improves tutor behavior.
+```javascript
+// Addition to tutorAdaptivenessMeasurement.js
+/**
+ * Measures contribution of multi-agent architecture to tutor adaptiveness
+ *
+ * This tests whether the psychodynamic metaphor, whatever its theoretical status,
+ * produces measurable improvements in tutor behavior.
+ */
+export function measureArchitectureContribution(dialogueTrace) {
+  // Track how Superego feedback changes Ego output
+  const modulations = [];
+  for (let i = 0; i < dialogueTrace.rounds.length - 1; i++) {
+    const round = dialogueTrace.rounds[i];
+    const nextRound = dialogueTrace.rounds[i + 1];
+    if (round.superego && nextRound.ego) {
+      modulations.push({
+        round: i,
+        superegoFeedback: round.superego.feedback,
+        egoChange: compareResponses(round.ego, nextRound.ego),
+        feedbackIncorporated: checkIncorporation(round.superego, nextRound.ego)
+      });
+    }
+  }
+  // Compute architecture contribution metrics
+  return {
+    // Did Superego feedback improve quality?
+    qualityImprovement: measureQualityDelta(dialogueTrace),
+    // Did Ego incorporate Superego feedback?
+    feedbackIncorporation: modulations.filter(m => m.feedbackIncorporated).length / modulations.length,
+    // Did multi-round dialogue produce different output than single-pass?
+    divergenceFromFirstPass: measureDivergence(dialogueTrace.rounds[0].ego, dialogueTrace.finalOutput),
+    // What types of improvements did Superego drive?
+    improvementTypes: categorizeImprovements(modulations)
+  };
+}
+/**
+ * Categories of Superego-driven improvements
+ */
+function categorizeImprovements(modulations) {
+  return {
+    specificity: modulations.filter(m => m.egoChange.includes('more_specific')).length,
+    toneCalibration: modulations.filter(m => m.egoChange.includes('tone_adjusted')).length,
+    struggleHonoring: modulations.filter(m => m.egoChange.includes('struggle_honored')).length,
+    repairAdded: modulations.filter(m => m.egoChange.includes('repair_acknowledgment')).length,
+    pacingAdjusted: modulations.filter(m => m.egoChange.includes('pacing_changed')).length
+  };
+}
+```
+#### 3. Alternative Framings (Acknowledged)
+```yaml
+# Alternative descriptions of the same architecture
+alternative_framings:
+  gan_inspired:
+    description: "Generator/Discriminator pattern"
+    ego_as: "Generator producing candidate responses"
+    superego_as: "Discriminator evaluating response quality"
+    insight: "Adversarial training improves generation"
+  deliberative:
+    description: "Proposal/Critique democratic process"
+    ego_as: "Proposer offering policy options"
+    superego_as: "Critic evaluating proposals"
+    insight: "Deliberation improves decision quality"
+  editorial:
+    description: "Draft/Review process"
+    ego_as: "Writer producing drafts"
+    superego_as: "Editor reviewing and requesting revisions"
+    insight: "Review cycles improve final output"
+  dual_process:
+    description: "System 1/System 2 cognitive model"
+    ego_as: "System 1: fast, intuitive response generation"
+    superego_as: "System 2: slow, deliberate evaluation"
+    insight: "Reflective override of intuitive responses"
+  chosen_framing: |
+    We chose the psychodynamic framing because:
+    1. It connects to the Hegelian recognition framework (shared concern with intersubjectivity)
+    2. It suggests richer extensions (transference, working-through) than functional descriptions
+    3. It emphasizes relational evaluation, not just logical correctness
+    4. It has precedent in AI research (Drama Machine, ConsensAgent)
+```
+### Success Criteria
+| Criterion | Target | Measurement |
+|-----------|--------|-------------|
+| Metaphor defended | Not apologized for | Paper language positive |
+| Architecture contribution measured | Empirical not just theoretical | Metrics implemented |
+| Alternative framings acknowledged | Intellectual honesty | Section included |
+| Extensions suggested | Future work directions | Transference, resistance mentioned |
+### Priority & Effort
+- **Priority**: Medium (clarity and positioning)
+- **Effort**: Low-Medium (writing + minor code)
+- **Timeline**: 1 week
+---
+## III.C The Productive Struggle Question
+### Critique
+The productive struggle finding (+49%) may not require the full Hegelian apparatus. Could be achieved through simpler means.
+### Response: Ablation Study and Theoretical Precision
+#### 1. Productive Struggle Ablation
+```yaml
+# New ablation study
+productive_struggle_ablation:
+  description: "Isolate productive struggle contribution from recognition"
+  conditions:
+    baseline:
+      prompt: "Help learners progress through material"
+      productive_struggle_instruction: false
+      recognition_instruction: false
+    struggle_only:
+      prompt: |
+        When learners are confused, honor their struggle. Don't immediately
+        resolve confusion. Ask questions that help them work through it.
+      productive_struggle_instruction: true
+      recognition_instruction: false
+    recognition_only:
+      prompt: |
+        Treat learners as autonomous subjects. Engage with their contributions.
+        Generate responses shaped by their specific input.
+      productive_struggle_instruction: false
+      recognition_instruction: true
+    full_recognition:
+      prompt: "Full recognition prompt (current)"
+      productive_struggle_instruction: true
+      recognition_instruction: true
+  analysis:
+    main_effects:
+      - "Productive struggle instruction effect"
+      - "Recognition instruction effect"
+    interaction:
+      - "Does recognition add value beyond productive struggle alone?"
+      - "Does productive struggle add value beyond recognition alone?"
+  hypotheses:
+    H1: "struggle_only improves over baseline on productive_struggle_arc"
+    H2: "recognition_only improves over baseline on recognition scenarios"
+    H3: "full_recognition shows interaction effect (super-additive)"
+  interpretation:
+    if_H1_and_not_H3: |
+      Productive struggle finding doesn't require recognition framework.
+      Consider simplifying theoretical claims.
+    if_H3_supported: |
+      Recognition and productive struggle are synergistic.
+      Full framework justified.
+```
+#### 2. Theoretical Precision
+```markdown
+# Clarification for paper
+## 3.6 What Recognition Adds Beyond Productive Struggle
+Educational research on productive struggle [@kapur2008; @warshauer2015]
+demonstrates that confusion, properly supported, enhances learning. Our
+recognition framework includes productive struggle but claims to add more.
+### What Productive Struggle Instruction Alone Provides
+- Delay in resolving confusion
+- Questions rather than answers
+- Space for learner to work through difficulty
+### What Recognition Adds
+- Engagement with SPECIFIC learner contribution (not just "confusion")
+- Learner's framework becomes site of joint inquiry
+- Tutor response shaped by learner's particular formulation
+- Accumulated relational history influences interaction
+### Empirical Claim
+Recognition-oriented tutoring improves outcomes BEYOND what productive
+struggle instruction alone provides. This claim will be tested through
+ablation study (see Methods).
+### If Ablation Shows No Interaction
+If productive struggle instruction alone achieves similar gains, we will:
+1. Acknowledge that recognition may primarily work through productive struggle
+2. Retain recognition as theoretical motivation for productive struggle design
+3. Simplify claims about recognition's unique contribution
+```
+### Success Criteria
+| Criterion | Target | Measurement |
+|-----------|--------|-------------|
+| Ablation completed | 4 conditions × 30 runs each | Statistical comparison |
+| Interaction effect | Recognition × Struggle interaction | F-test for interaction |
+| Theoretical precision | Claims match evidence | If no interaction, simplify claims |
+### Priority & Effort
+- **Priority**: Medium-High (theoretical precision)
+- **Effort**: Medium (ablation study design and execution)
+- **Timeline**: 2 weeks
+---
+# Part IV: Implementation Timeline
+## Phase 0: Immediate Priority — 2×2 Factorial Evaluation (Week 1)
+The existing 2×2 factorial design is ready to run. This is the highest priority work.
+| Task | Priority | Effort | Status |
+|------|----------|--------|--------|
+| **Run 2×2 factorial evaluation** | **CRITICAL** | Medium | **Ready now** |
+| Execute single_baseline × single_recognition × baseline × recognition | Critical | Medium | Config exists |
+| Run n=30 per cell across 7 scenarios | Critical | Medium | ~1,680 API calls |
+| Compute main effects and interaction | Critical | Low | Analysis scripts exist |
+```bash
+# Command to run 2×2 factorial
+node scripts/eval-tutor.js run \
+  --profiles single_baseline,single_recognition,baseline,recognition \
+  --scenarios recognition_seeking_learner,resistant_learner,mutual_transformation_journey,recognition_repair,productive_struggle_arc,sustained_dialogue,breakdown_recovery \
+  --runs 30 \
+  --report
+```
+**Expected Output:**
+- Main effect of architecture: Does multi-agent dialogue improve tutor adaptiveness?
+- Main effect of recognition: Do recognition prompts improve tutor adaptiveness?
+- Interaction effect: Does recognition benefit more from multi-agent architecture?
+## Phase 1: Judge Validation & Theoretical Writing (Weeks 1-2)
+| Task | Priority | Effort | Status |
+|------|----------|--------|--------|
+| Multi-judge comparison (I.B.1) | High | Medium | Ready |
+| Adversarial scenarios (I.B.2) | High | Medium | Ready |
+| Tutor adaptiveness metrics (III.A.3) | High | Medium | New service |
+| Derivative framework writing (III.A.1) | High | Medium | Paper section |
+| Productive metaphor defense (III.B.1) | Medium | Low | Paper section |
+## Phase 2: Statistical Rigor & Ablations (Weeks 3-4)
+| Task | Priority | Effort | Status |
+|------|----------|--------|--------|
+| Multilevel statistical model (I.C.1) | High | Medium | Ready |
+| Productive struggle ablation (III.C) | Medium-High | Medium | Ready |
+| Dimension factor analysis (I.B.3) | Medium | Medium | Ready |
+| Architecture contribution measurement (III.B.2) | High | Medium | New code |
+## Phase 3: Advanced Evaluation (Weeks 5-8)
+| Task | Priority | Effort | Status |
+|------|----------|--------|--------|
+| Contingent learner agent (I.A.1) | High | High | Requires design |
+| Bilateral recognition measurement (I.A.2) | High | High | Requires design |
+| Continuous Superego assessment (II.A.1) | Medium-High | Medium | Ready |
+| Relational memory service (II.C) | Medium | High | Requires design |
+## Phase 4: Long-Term Extensions (Weeks 9-12)
+| Task | Priority | Effort | Status |
+|------|----------|--------|--------|
+| Free-form dialogue evaluation (I.A.3) | High | High | Requires Phase 3 |
+| Extended dialogue profile (II.A.2) | Medium | Medium | Requires Phase 2 |
+| Structural response analysis (II.B.2) | Medium | Medium | Research |
+| Psychodynamic extensions (III.B) | Low | High | Future work |
+---
+# Part V: Success Metrics Summary
+## Primary Outcome: 2×2 Factorial Results
+| Effect | Hypothesis | Success Criterion | Measurement |
+|--------|------------|-------------------|-------------|
+| **Main: Architecture** | Multi-agent > Single-agent | p < 0.05, d > 0.5 | Compare rows of 2×2 |
+| **Main: Recognition** | Recognition > Standard | p < 0.05, d > 0.5 | Compare columns of 2×2 |
+| **Interaction** | Recognition × Architecture synergy | Significant interaction term | ANOVA interaction |
+## Tutor Adaptive Pedagogy Metrics
+| Metric | Current | Target | Measurement |
+|--------|---------|--------|-------------|
+| Content engagement | Unknown | > 0.6 | Relational scoring |
+| Signal responsiveness | Unknown | > 0.7 | State-appropriate response rate |
+| Repair behavior | Unknown | 100% when needed | Acknowledgment detection |
+| Struggle honoring | Unknown | > 0.8 | Premature resolution rate |
+| Framework adoption | Unknown | > 0.4 | Vocabulary overlap |
+## Experimental Validity
+| Metric | Current | Target | Measurement |
+|--------|---------|--------|-------------|
+| Judge reliability (ICC) | Unknown | > 0.7 | Multi-judge comparison |
+| Adjusted effect size | d = 1.55 | d > 0.8 | Multilevel model |
+| Adversarial detection | Not tested | 100% | False positive scenarios |
+| Factor structure | 10 dimensions | ≤ 4 factors | PCA |
+## Theoretical Positioning
+| Claim | Current | Target | Evidence |
+|-------|---------|--------|----------|
+| Derivative framing | Not explicit | Explicit in paper | Lacan precedent, clear language |
+| Metaphor status | Implicit | Defended as productive | Alternative framings acknowledged |
+| Claims scope | Recognition | Tutor adaptive pedagogy | Measurable behavioral criteria |
+| Contribution type | Philosophical | Empirical + theoretical | 2×2 factorial results |
+---
+# Appendix A: New Files to Create
+```
+services/
+├── tutorAdaptivenessMeasurement.js     # III.A.3 - Core metrics for tutor behavior
+├── bilateralRecognitionEvaluator.js    # I.A.2
+├── structuralRecognitionAnalyzer.js    # II.B.2
+├── relationalMemoryService.js          # II.C.1
+└── contingentLearnerAgent.js           # I.A.1
+scripts/
+├── run-factorial-2x2.js                # Phase 0 - Run the 2×2 factorial
+├── analyze-factorial-results.js        # Phase 0 - ANOVA and effect sizes
+├── judge-comparison.js                 # I.B.1
+├── dimension-factor-analysis.js        # I.B.3
+├── multilevel-analysis.js              # I.C.1
+└── emergence-testing.js                # II.B.1
+config/
+├── contingent-learner.yaml             # I.A.1
+├── adversarial-scenarios.yaml          # I.B.2
+└── ablation-conditions.yaml            # III.C.1
+docs/research/
+├── CRITICAL-REVIEW-RECOGNITION-TUTORING.md    # Complete ✓
+├── IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md   # This document ✓
+└── THEORETICAL-REFINEMENT-NOTES.md            # III.A, III.B - Paper sections
+```
+# Appendix B: The Refined Theoretical Claim
+The paper should make this claim:
+> **Recognition-oriented design**, understood as a *derivative* of Hegelian recognition
+> theory (following Lacan's rethinking of master-slave through analyst/analysand),
+> and implemented through a *metaphorically* psychodynamic multi-agent architecture
+> (Ego/Superego dialogue), produces measurable improvements in **AI tutor adaptive
+> pedagogy**.
+>
+> These improvements—concentrated in content engagement, signal responsiveness,
+> repair behavior, and struggle honoring—are consistent with the theoretical
+> framework's predictions and can be isolated through a 2×2 factorial design
+> (architecture × recognition prompts).
+>
+> The claim concerns **tutor behavior**, not AI consciousness or genuine
+> intersubjective recognition. The Hegelian framework serves as diagnostic tool,
+> design heuristic, and evaluation criterion—not ontological commitment.
+# Appendix C: Key Theoretical References
+```bibtex
+@book{hegel1807,
+  author = {Hegel, Georg Wilhelm Friedrich},
+  title = {Phenomenology of Spirit},
+  year = {1807},
+  note = {Master-slave dialectic: source structure}
+}
+@book{lacan1969,
+  author = {Lacan, Jacques},
+  title = {The Seminar of Jacques Lacan, Book XVII: The Other Side of Psychoanalysis},
+  year = {1969-70},
+  note = {Four discourses: precedent for derivative rethinking}
+}
+@article{honneth1995,
+  author = {Honneth, Axel},
+  title = {The Struggle for Recognition},
+  year = {1995},
+  note = {Social-political recognition theory}
+}
+@article{chen2024drama,
+  author = {Chen, et al.},
+  title = {The Drama Machine: Simulating Character Development with LLM Agents},
+  year = {2024},
+  note = {Multi-agent dialogue architecture inspiration}
+}
+```