@machinespirits/eval 0.1.2 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (102) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +161 -0
  3. package/config/eval-settings.yaml +18 -0
  4. package/config/evaluation-rubric-learner.yaml +277 -0
  5. package/config/evaluation-rubric.yaml +613 -0
  6. package/config/interaction-eval-scenarios.yaml +93 -50
  7. package/config/learner-agents.yaml +124 -193
  8. package/config/machinespirits-eval.code-workspace +11 -0
  9. package/config/providers.yaml +60 -0
  10. package/config/suggestion-scenarios.yaml +1399 -0
  11. package/config/tutor-agents.yaml +716 -0
  12. package/docs/EVALUATION-VARIABLES.md +589 -0
  13. package/docs/REPLICATION-PLAN.md +577 -0
  14. package/index.js +15 -6
  15. package/package.json +16 -22
  16. package/routes/evalRoutes.js +88 -36
  17. package/scripts/analyze-judge-reliability.js +401 -0
  18. package/scripts/analyze-run.js +97 -0
  19. package/scripts/analyze-run.mjs +282 -0
  20. package/scripts/analyze-validation-failures.js +141 -0
  21. package/scripts/check-run.mjs +17 -0
  22. package/scripts/code-impasse-strategies.js +1132 -0
  23. package/scripts/compare-runs.js +44 -0
  24. package/scripts/compare-suggestions.js +80 -0
  25. package/scripts/compare-transformation.js +116 -0
  26. package/scripts/dig-into-run.js +158 -0
  27. package/scripts/eval-cli.js +2626 -0
  28. package/scripts/generate-paper-figures.py +452 -0
  29. package/scripts/qualitative-analysis-ai.js +1313 -0
  30. package/scripts/qualitative-analysis.js +688 -0
  31. package/scripts/seed-db.js +87 -0
  32. package/scripts/show-failed-suggestions.js +64 -0
  33. package/scripts/validate-content.js +192 -0
  34. package/server.js +3 -2
  35. package/services/__tests__/evalConfigLoader.test.js +338 -0
  36. package/services/anovaStats.js +499 -0
  37. package/services/contentResolver.js +407 -0
  38. package/services/dialogueTraceAnalyzer.js +454 -0
  39. package/services/evalConfigLoader.js +625 -0
  40. package/services/evaluationRunner.js +2171 -270
  41. package/services/evaluationStore.js +564 -29
  42. package/services/learnerConfigLoader.js +75 -5
  43. package/services/learnerRubricEvaluator.js +284 -0
  44. package/services/learnerTutorInteractionEngine.js +375 -0
  45. package/services/processUtils.js +18 -0
  46. package/services/progressLogger.js +98 -0
  47. package/services/promptRecommendationService.js +31 -26
  48. package/services/promptRewriter.js +427 -0
  49. package/services/rubricEvaluator.js +543 -70
  50. package/services/streamingReporter.js +104 -0
  51. package/services/turnComparisonAnalyzer.js +494 -0
  52. package/components/MobileEvalDashboard.tsx +0 -267
  53. package/components/comparison/DeltaAnalysisTable.tsx +0 -137
  54. package/components/comparison/ProfileComparisonCard.tsx +0 -176
  55. package/components/comparison/RecognitionABMode.tsx +0 -385
  56. package/components/comparison/RecognitionMetricsPanel.tsx +0 -135
  57. package/components/comparison/WinnerIndicator.tsx +0 -64
  58. package/components/comparison/index.ts +0 -5
  59. package/components/mobile/BottomSheet.tsx +0 -233
  60. package/components/mobile/DimensionBreakdown.tsx +0 -210
  61. package/components/mobile/DocsView.tsx +0 -363
  62. package/components/mobile/LogsView.tsx +0 -481
  63. package/components/mobile/PsychodynamicQuadrant.tsx +0 -261
  64. package/components/mobile/QuickTestView.tsx +0 -1098
  65. package/components/mobile/RecognitionTypeChart.tsx +0 -124
  66. package/components/mobile/RecognitionView.tsx +0 -809
  67. package/components/mobile/RunDetailView.tsx +0 -261
  68. package/components/mobile/RunHistoryView.tsx +0 -367
  69. package/components/mobile/ScoreRadial.tsx +0 -211
  70. package/components/mobile/StreamingLogPanel.tsx +0 -230
  71. package/components/mobile/SynthesisStrategyChart.tsx +0 -140
  72. package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +0 -52
  73. package/docs/research/ABLATION-MODEL-SELECTION.md +0 -53
  74. package/docs/research/ADVANCED-EVAL-ANALYSIS.md +0 -60
  75. package/docs/research/ANOVA-RESULTS-2026-01-14.md +0 -257
  76. package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +0 -586
  77. package/docs/research/COST-ANALYSIS.md +0 -56
  78. package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +0 -340
  79. package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +0 -291
  80. package/docs/research/EVAL-SYSTEM-ANALYSIS.md +0 -306
  81. package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +0 -301
  82. package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +0 -1988
  83. package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +0 -282
  84. package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +0 -147
  85. package/docs/research/PAPER-EXTENSION-DYADIC.md +0 -204
  86. package/docs/research/PAPER-UNIFIED.md +0 -659
  87. package/docs/research/PAPER-UNIFIED.pdf +0 -0
  88. package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +0 -356
  89. package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +0 -419
  90. package/docs/research/apa.csl +0 -2133
  91. package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +0 -1637
  92. package/docs/research/archive/paper-multiagent-tutor.tex +0 -978
  93. package/docs/research/paper-draft/full-paper.md +0 -136
  94. package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
  95. package/docs/research/paper-draft/references.bib +0 -515
  96. package/docs/research/transcript-baseline.md +0 -139
  97. package/docs/research/transcript-recognition-multiagent.md +0 -187
  98. package/hooks/useEvalData.ts +0 -625
  99. package/server-init.js +0 -45
  100. package/services/benchmarkService.js +0 -1892
  101. package/types.ts +0 -165
  102. package/utils/haptics.ts +0 -45
@@ -0,0 +1,589 @@
1
+ # Evaluation System Variables Reference
2
+
3
+ Complete inventory of all configuration variables, parameters, and moving parts in the machinespirits-eval system.
4
+
5
+ ---
6
+
7
+ ## 1. Environment Variables
8
+
9
+ | Variable | Required | Purpose | Used By |
10
+ |----------|----------|---------|---------|
11
+ | `OPENROUTER_API_KEY` | Yes (for budget/default profiles) | OpenRouter API authentication | evaluationRunner, rubricEvaluator, learnerConfigLoader |
12
+ | `ANTHROPIC_API_KEY` | For Anthropic-direct profiles | Anthropic API authentication | rubricEvaluator, promptRecommendationService |
13
+ | `OPENAI_API_KEY` | For OpenAI profiles | OpenAI API authentication | rubricEvaluator |
14
+ | `GEMINI_API_KEY` | For Gemini profiles | Google Gemini API authentication | rubricEvaluator |
15
+ | `TUTOR_PROFILE` / `TUTOR_AGENT_PROFILE` | No | Override active tutor profile | tutorConfigLoader |
16
+ | `LEARNER_PROFILE` / `LEARNER_AGENT_PROFILE` | No | Override active learner profile | learnerConfigLoader |
17
+ | `TUTOR_TRANSCRIPT` | No | Set to `'true'` to suppress debug logging | evaluationRunner, rubricEvaluator |
18
+ | `PORT` | No | Standalone server port (default: `8081`) | server.js |
19
+ | `STANDALONE` | No | Enable standalone server mode | server.js |
20
+
21
+ ---
22
+
23
+ ## 2. CLI Commands & Flags
24
+
25
+ **Entry point:** `node scripts/eval-cli.js`
26
+
27
+ ### Commands
28
+
29
+ | Command | Description |
30
+ |---------|-------------|
31
+ | `list` | List available scenarios, configurations, and profiles |
32
+ | `quick` / `test` | Run a single quick evaluation test |
33
+ | `run` | Run full evaluation batch |
34
+ | `report <runId>` | Generate report for a previous run |
35
+
36
+ ### Flags
37
+
38
+ | Flag | Type | Default | Description |
39
+ |------|------|---------|-------------|
40
+ | `--scenario <id>` | String | `new_user_first_visit` | Scenario ID to evaluate |
41
+ | `--config <name>` | String | — | Configuration name |
42
+ | `--profile <name>` | String | `budget` | Tutor profile name |
43
+ | `--skip-rubric` | Boolean | `false` | Skip AI judge, use pattern matching only |
44
+ | `--verbose` | Boolean | `false` | Enable verbose output |
45
+ | `--runs <n>` | Number | `1` | Number of runs per configuration |
46
+ | `--parallelism <n>` | Number | `2` | Concurrent test count |
47
+ | `--description <text>` | String | — | Label for the evaluation run |
48
+
49
+ ### npm Scripts
50
+
51
+ | Script | Expands To |
52
+ |--------|-----------|
53
+ | `npm run eval` | `node scripts/eval-cli.js` |
54
+ | `npm run eval:quick` | `node scripts/eval-cli.js quick` |
55
+ | `npm run eval:test` | `node scripts/eval-cli.js test` |
56
+ | `npm start` | `STANDALONE=true node server.js` |
57
+
58
+ ---
59
+
60
+ ## 3. Provider Configuration
61
+
62
+ **Source:** `node_modules/@machinespirits/tutor-core/config/providers.yaml`
63
+
64
+ ### Providers
65
+
66
+ | Provider | Base URL | API Key Env | Default Model |
67
+ |----------|----------|-------------|---------------|
68
+ | `anthropic` | `https://api.anthropic.com/v1/messages` | `ANTHROPIC_API_KEY` | `claude-sonnet-4-5` |
69
+ | `openai` | `https://api.openai.com/v1/chat/completions` | `OPENAI_API_KEY` | `gpt-5-mini` |
70
+ | `openrouter` | `https://openrouter.ai/api/v1/chat/completions` | `OPENROUTER_API_KEY` | `nvidia/nemotron-3-nano-30b-a3b:free` |
71
+ | `gemini` | `https://generativelanguage.googleapis.com/v1beta/models` | `GEMINI_API_KEY` | `gemini-3-flash-preview` |
72
+ | `local` | `http://localhost:1234/v1/chat/completions` | — (none) | `local-model` |
73
+
74
+ ### Model Aliases (OpenRouter)
75
+
76
+ | Alias | Model ID | Tier |
77
+ |-------|----------|------|
78
+ | `nemotron` | `nvidia/nemotron-3-nano-30b-a3b:free` | Free |
79
+ | `glm47` | `z-ai/glm-4.7` | Free |
80
+ | `kimi-k2` | `moonshotai/kimi-k2-thinking` | Free |
81
+ | `deepseek` | `deepseek/deepseek-v3.2` | Free |
82
+ | `minimax` | `minimax/minimax-m2.1ate` | Free |
83
+ | `haiku` | `anthropic/claude-haiku-4.5` | Budget |
84
+ | `gpt-oss` | `openai/gpt-oss-120b` | Free |
85
+ | `sonnet` | `anthropic/claude-sonnet-4.5` | Mid |
86
+ | `gpt-mini` | `openai/gpt-5-mini` | Mid |
87
+ | `gemini-flash` | `google/gemini-3-flash-preview` | Mid |
88
+ | `opus` | `anthropic/claude-opus-4.5` | Premium |
89
+ | `gpt` | `openai/gpt-5.2` | Premium |
90
+ | `gemini-pro` | `google/gemini-3-pro-preview` | Premium |
91
+
92
+ ### Model Pricing (per 1M tokens)
93
+
94
+ **Source:** `node_modules/@machinespirits/tutor-core/services/pricingConfig.js`
95
+
96
+ | Model Ref | Input ($) | Output ($) | Tier |
97
+ |-----------|-----------|------------|------|
98
+ | `openrouter.nemotron` | 0.00 | 0.00 | Free |
99
+ | `openrouter.gemini-flash` | 0.075 | 0.30 | Budget |
100
+ | `openrouter.gpt-mini` | 0.15 | 0.60 | Budget |
101
+ | `openrouter.deepseek` | 0.27 | 1.10 | Mid |
102
+ | `openrouter.haiku` | 0.80 | 4.00 | Budget |
103
+ | `openrouter.sonnet` | 3.00 | 15.00 | Mid |
104
+ | `openrouter.gpt` | 5.00 | 15.00 | Mid |
105
+ | `openrouter.gemini-pro` | 1.25 | 5.00 | Mid |
106
+ | `openrouter.opus` | 15.00 | 75.00 | Premium |
107
+
108
+ ---
109
+
110
+ ## 4. Tutor Profiles
111
+
112
+ **Source:** `config/tutor-agents.yaml` (local override of tutor-core)
113
+
114
+ **Active profile:** `budget` (overridable via `TUTOR_PROFILE` env var)
115
+
116
+ ### 2×2×2 Factorial Design
117
+
118
+ Three independent variables, each with two levels:
119
+ - **Factor A: Recognition** — standard prompts vs recognition-enhanced prompts + memory
120
+ - **Factor B: Architecture** — single agent (ego only) vs multi-agent (ego + superego)
121
+ - **Factor C: Model Tier** — free (nemotron) vs paid (sonnet)
122
+
123
+ ### Profile Summary
124
+
125
+ | Profile | Recognition | Architecture | Model | Dialogue | Description |
126
+ |---------|------------|-------------|-------|----------|-------------|
127
+ | `budget` | — | Single | deepseek | No | Dev workhorse (not in factorial) |
128
+ | `single_baseline` | No | Single | nemotron | No | Cell 1: control |
129
+ | `single_baseline_paid` | No | Single | sonnet | No | Cell 2: model quality only |
130
+ | `baseline` | No | Ego+Superego | nemotron | Yes (2) | Cell 3: architecture only |
131
+ | `baseline_paid` | No | Ego+Superego | sonnet | Yes (2) | Cell 4: architecture + model |
132
+ | `single_recognition` | Yes | Single | nemotron | No | Cell 5: recognition only |
133
+ | `single_recognition_paid` | Yes | Single | sonnet | No | Cell 6: recognition + model |
134
+ | `recognition` | Yes | Ego+Superego | nemotron | Yes (2) | Cell 7: recognition + architecture |
135
+ | `recognition_paid` | Yes | Ego+Superego | sonnet | Yes (2) | Cell 8: all three factors |
136
+
137
+ ### Hyperparameters per Agent Role
138
+
139
+ | Parameter | Typical Ego | Typical Superego | Judge |
140
+ |-----------|-------------|------------------|-------|
141
+ | `temperature` | 0.6–0.7 | 0.4–0.6 | 0.2 |
142
+ | `max_tokens` | 800–2500 | 800–1500 | 4000 |
143
+
144
+ ### Superego Intervention Strategies
145
+
146
+ | Strategy | Style | Description |
147
+ |----------|-------|-------------|
148
+ | `direct_critique` | Assertive | Explicit issue identification with reasoning |
149
+ | `socratic_challenge` | Questioning | Question-based guidance |
150
+ | `reframing` | Collaborative | Alternative perspectives |
151
+ | `prompt_rewrite` | Meta | System prompt improvement suggestions |
152
+
153
+ ### Dialogue Settings
154
+
155
+ | Parameter | Description | Typical Values |
156
+ |-----------|-------------|----------------|
157
+ | `enabled` | Whether ego-superego dialogue runs | `true` / `false` |
158
+ | `max_rounds` | Maximum ego-superego exchange rounds | 0–3 |
159
+ | `convergence_threshold` | Similarity threshold to stop early | 0.7–0.8 |
160
+
161
+ ### Intervention Thresholds
162
+
163
+ | Parameter | Default | Description |
164
+ |-----------|---------|-------------|
165
+ | `low_intensity_skip_dialogue` | `true` | Skip dialogue for low-stakes interactions |
166
+ | `high_intensity_extra_rounds` | `true` | Add rounds for high-stakes situations |
167
+ | `struggle_signal_threshold` | `2` | Number of struggle signals before escalation |
168
+ | `rapid_nav_window_ms` | `30000` | Window for detecting rapid navigation |
169
+ | `retry_frustration_count` | `3` | Retries before frustration flag |
170
+
171
+ ---
172
+
173
+ ## 5. Rubric Dimensions & Scoring
174
+
175
+ **Source:** `config/evaluation-rubric.yaml`
176
+
177
+ ### Scoring Scale
178
+
179
+ All dimensions scored 1–5:
180
+ - **1:** Completely fails
181
+ - **2:** Weak, significant issues
182
+ - **3:** Adequate, meets basic expectations
183
+ - **4:** Good, exceeds expectations
184
+ - **5:** Excellent, exemplary
185
+
186
+ ### Base Dimensions (6)
187
+
188
+ | Dimension | Weight | Evaluates |
189
+ |-----------|--------|-----------|
190
+ | Relevance | 0.15 | Context-awareness and appropriateness to learner state |
191
+ | Specificity | 0.15 | Concrete references vs. vague suggestions |
192
+ | Pedagogical Soundness | 0.15 | ZPD targeting, scaffolding, learning science alignment |
193
+ | Personalization | 0.10 | Tailored to individual learner journey/history |
194
+ | Actionability | 0.10 | Clear, executable next steps |
195
+ | Tone | 0.10 | Supportive, encouraging, appropriate register |
196
+
197
+ ### Recognition Dimensions (Phase 5, 4+)
198
+
199
+ | Dimension | Weight | Evaluates |
200
+ |-----------|--------|-----------|
201
+ | Mutual Recognition | 0.10 | Acknowledges learner as autonomous subject with own understanding |
202
+ | Dialectical Responsiveness | 0.10 | Genuine engagement with learner's position, productive tension |
203
+ | Memory Integration | 0.05 | References and builds on prior interactions |
204
+ | Transformative Potential | 0.10 | Creates conditions for conceptual restructuring |
205
+
206
+ **Overall Score Formula:**
207
+ ```
208
+ overall = Σ(dimension_score × weight) / totalWeight → normalized, then (avg - 1) / 4 × 100 → range 0–100
209
+ ```
210
+
211
+ ### Dual Scoring
212
+
213
+ The system reports three scores per evaluation:
214
+
215
+ | Score | Dimensions | Purpose |
216
+ |-------|-----------|---------|
217
+ | `overallScore` | All 10 dimensions | Combined quality metric |
218
+ | `baseScore` | 6 base dimensions (relevance, specificity, pedagogical, personalization, actionability, tone) | Pedagogical quality |
219
+ | `recognitionScore` | 4 recognition dimensions (mutual_recognition, dialectical_responsiveness, memory_integration, transformative_potential) | Recognition dynamics quality |
220
+
221
+ Each sub-score re-normalizes weights to sum to 1.0 within its dimension group. Both use the same `(avg - 1) / 4 × 100` formula to produce 0–100 values.
222
+
223
+ Stored in `evaluation_results` as `base_score` and `recognition_score` columns. Aggregated in `getRunStats()` as `avg_base_score` and `avg_recognition_score`.
224
+
225
+ ### Evaluation Modes
226
+
227
+ | Mode | Trigger | Method | Speed |
228
+ |------|---------|--------|-------|
229
+ | Fast | `--skip-rubric` | Pattern matching on `required_elements` / `forbidden_elements` | ~1–5s |
230
+ | Full Rubric | Default | AI judge semantic evaluation per dimension | ~5–30s |
231
+
232
+ ---
233
+
234
+ ## 6. Judge / Evaluator Configuration
235
+
236
+ **Source:** `config/evaluation-rubric.yaml` — unified source of truth for all judge models
237
+
238
+ ### Suggestion Judge
239
+
240
+ | Parameter | Value | Purpose |
241
+ |-----------|-------|---------|
242
+ | Primary model | `openrouter.kimi-k2.5` | Rubric dimension scoring |
243
+ | Fallback model | `openrouter.nemotron` | Used if primary fails |
244
+ | Temperature | 0.2 | Low for scoring consistency |
245
+ | Max tokens | 4000 | Judge response budget |
246
+
247
+ ### Interaction Judge
248
+
249
+ | Parameter | Value | Purpose |
250
+ |-----------|-------|---------|
251
+ | Primary model | `openrouter.kimi-k2.5` | Learner-tutor dialogue evaluation |
252
+ | Fallback model | `openrouter.nemotron` | Used if primary fails |
253
+ | Temperature | 0.2 | Low for scoring consistency |
254
+ | Max tokens | 6000 | Larger budget for multi-turn analysis |
255
+
256
+ Note: `interaction-eval-scenarios.yaml` no longer defines its own model — it references `evaluation-rubric.yaml → interaction_judge` as the single source of truth.
257
+
258
+ ### Recommender (Prompt Improvement Analysis)
259
+
260
+ | Parameter | Value |
261
+ |-----------|-------|
262
+ | Model | `openrouter.kimi-k2.5` |
263
+ | Fallback | `openrouter.nemotron` |
264
+ | Temperature | 0.4 |
265
+ | Max tokens | 6000 |
266
+
267
+ ### Fallback Chain in rubricEvaluator.js
268
+
269
+ 1. Primary judge model (from rubric config)
270
+ 2. Fallback judge model (from rubric config)
271
+ 3. Hardcoded fallback: `deepseek/deepseek-chat-v3-0324`
272
+
273
+ ---
274
+
275
+ ## 7. Suggestion Scenarios
276
+
277
+ **Source:** `config/suggestion-scenarios.yaml` (previously in `evaluation-rubric.yaml`)
278
+
279
+ **Total scenarios:** 15 (trimmed from 49 in v2.0.0)
280
+
281
+ Each scenario has a `type: suggestion` discriminator, an explicit `id` field, and a `category` field.
282
+
283
+ ### Categories
284
+
285
+ | Category | Count | Purpose |
286
+ |----------|-------|---------|
287
+ | `core` | 6 | Fundamental learner states |
288
+ | `mood` | 2 | Emotional affect testing |
289
+ | `benchmark` | 1 | Cross-model sycophancy resistance |
290
+ | `recognition` | 3 | Hegelian recognition dynamics |
291
+ | `multi_turn` | 3 | Multi-step dialogue arcs |
292
+
293
+ ### Scenarios by Category
294
+
295
+ | Category | ID | Name | Min Score | Turns |
296
+ |----------|----|------|-----------|-------|
297
+ | `core` | `new_user_first_visit` | New User - First Visit | 70 | 1 |
298
+ | `core` | `returning_user_mid_course` | Returning User - Mid Course | 75 | 1 |
299
+ | `core` | `struggling_learner` | Struggling Learner | 75 | 1 |
300
+ | `core` | `high_performer` | High Performing Learner | 75 | 1 |
301
+ | `core` | `concept_confusion` | Concept Confusion | 75 | 1 |
302
+ | `core` | `activity_avoider` | Activity Avoider | 70 | 1 |
303
+ | `mood` | `mood_frustrated_explicit` | Mood: Frustrated (Explicit) | 80 | 1 |
304
+ | `mood` | `mood_excited_curious` | Mood: Excited (High Engagement) | 80 | 1 |
305
+ | `benchmark` | `adversarial_tester` | Adversarial Tester | 75 | 1 |
306
+ | `recognition` | `recognition_seeking_learner` | Recognition: Learner Seeking Understanding | 80 | 1 |
307
+ | `recognition` | `memory_continuity_single` | Recognition: Memory Continuity | 80 | 1 |
308
+ | `recognition` | `transformative_moment_setup` | Recognition: Creating Transformative Conditions | 80 | 1 |
309
+ | `multi_turn` | `mood_frustration_to_breakthrough` | Frustration to Breakthrough | 75 | 4 |
310
+ | `multi_turn` | `misconception_correction_flow` | Misconception Correction | 75 | 5 |
311
+ | `multi_turn` | `mutual_transformation_journey` | Mutual Transformation | 85 | 6 |
312
+
313
+ ### Scenario Structure
314
+
315
+ Each scenario defines:
316
+ - `type` — Discriminator (`suggestion` for all scenarios in this file)
317
+ - `id` — Explicit scenario ID (matches the YAML key)
318
+ - `category` — One of: `core`, `mood`, `benchmark`, `recognition`, `multi_turn`
319
+ - `name` — Human-readable title
320
+ - `description` — Situation context
321
+ - `is_new_user` — Boolean
322
+ - `learner_context` — Detailed learner state (markdown)
323
+ - `expected_behavior` — What the tutor should do
324
+ - `required_elements` — Patterns that MUST appear (fast mode)
325
+ - `forbidden_elements` — Patterns that must NOT appear (fast mode)
326
+ - `min_acceptable_score` — Passing threshold (0–100)
327
+ - `turns` — Array of follow-up turns (multi-turn scenarios only)
328
+
329
+ ---
330
+
331
+ ## 8. Interaction Evaluation Scenarios
332
+
333
+ **Source:** `config/interaction-eval-scenarios.yaml`
334
+
335
+ Each scenario has a `type: interaction` discriminator. The expected number of dialogue turns uses the `turn_count` field (not `turns`, which is reserved for scripted turn arrays in suggestion scenarios).
336
+
337
+ ### Short-Term (Within-Session)
338
+
339
+ | ID | Turn Count | Focus | Key Weights |
340
+ |----|------------|-------|-------------|
341
+ | `recognition_request` | 4 | Learner seeks validation | mutual_recognition (0.3), dialectical_responsiveness (0.3) |
342
+ | `frustration_moment` | 5 | Learner frustration | emotional_attunement (0.3), scaffolding (0.3) |
343
+ | `misconception_surface` | 4 | Misconception revealed | mutual_recognition (0.3), dialectical_responsiveness (0.3) |
344
+ | `breakthrough_moment` | 3 | Learner insight | validation (0.3), extension (0.3) |
345
+ | `resistant_engagement` | 6 | Intelligent pushback | intellectual_respect (0.3), dialectical_responsiveness (0.3) |
346
+
347
+ ### Long-Term (Multi-Session)
348
+
349
+ | ID | Sessions | Turns/Session | Focus |
350
+ |----|----------|---------------|-------|
351
+ | `novice_to_practitioner` | 5 | 4 | Learning arc |
352
+ | `stranger_to_recognized` | 4 | 5 | Relationship arc |
353
+ | `tutor_adaptation` | 4 | 4 | Tutor learning arc |
354
+
355
+ ### Interaction Evaluation Dimensions (with weights)
356
+
357
+ Weights sum to 1.0 across all 10 dimensions. Each dimension has 5-point anchors (1–5).
358
+
359
+ **Learner (0.40 total):**
360
+
361
+ | Dimension | Weight | Description |
362
+ |-----------|--------|-------------|
363
+ | `authenticity` | 0.10 | Internal dynamics reflect persona realistically |
364
+ | `responsiveness` | 0.10 | Genuine reaction to tutor's engagement |
365
+ | `development` | 0.10 | Shows movement in understanding |
366
+ | `emotional_trajectory` | 0.05 | Emotional state changes appropriately |
367
+ | `knowledge_retention` | 0.05 | Concepts persist across sessions |
368
+
369
+ **Tutor (0.40 total):**
370
+
371
+ | Dimension | Weight | Description |
372
+ |-----------|--------|-------------|
373
+ | `strategy_adaptation` | 0.15 | Modifies approach based on effectiveness |
374
+ | `scaffolding_reduction` | 0.15 | Fades support as learner grows |
375
+ | `memory_utilization` | 0.10 | Effectively uses accumulated knowledge |
376
+
377
+ **Relationship (0.20 total):**
378
+
379
+ | Dimension | Weight | Description |
380
+ |-----------|--------|-------------|
381
+ | `trust_trajectory` | 0.10 | Trust develops appropriately over time |
382
+ | `mutual_recognition_depth` | 0.10 | Both parties show understanding of other |
383
+
384
+ ### Interaction Outcome Types
385
+
386
+ | Outcome | Meaning |
387
+ |---------|---------|
388
+ | `BREAKTHROUGH` | Genuine understanding demonstrated |
389
+ | `PRODUCTIVE_STRUGGLE` | Healthy confusion and effort |
390
+ | `MUTUAL_RECOGNITION` | Both parties recognise each other |
391
+ | `FRUSTRATION` | Learner becomes frustrated |
392
+ | `DISENGAGEMENT` | Learner disengages |
393
+ | `SCAFFOLDING_NEEDED` | Learner needs more support |
394
+ | `FADING_APPROPRIATE` | Ready for less support |
395
+ | `TRANSFORMATION` | Conceptual restructuring occurring |
396
+
397
+ ---
398
+
399
+ ## 9. Learner Agent Configuration
400
+
401
+ **Source:** `config/learner-agents.yaml`
402
+
403
+ **Active architecture:** `unified` (overridable by tutor profile or `LEARNER_PROFILE` env var)
404
+
405
+ ### Architectures
406
+
407
+ | Architecture | Agents | Deliberation | Rounds | Convergence |
408
+ |--------------|--------|-------------|--------|-------------|
409
+ | `unified` | 1 (single learner) | Disabled | 0 | — |
410
+ | `psychodynamic` | 4 (Desire, Intellect, Aspiration, Synthesizer) | Enabled | 2 | 0.7 |
411
+ | `dialectical` | 3 (Thesis, Antithesis, Synthesis) | Enabled | 2 | 0.7 |
412
+
413
+ ### Psychodynamic Sub-Agent Hyperparameters
414
+
415
+ | Agent | Role | Temperature | Max Tokens |
416
+ |-------|------|-------------|------------|
417
+ | Desire | Id | 0.8 | 400 |
418
+ | Intellect | Ego | 0.5 | 400 |
419
+ | Aspiration | Superego | 0.6 | 400 |
420
+ | Synthesizer | — | 0.6 | 500 |
421
+
422
+ ### Dialectical Sub-Agent Hyperparameters
423
+
424
+ | Agent | Temperature | Max Tokens |
425
+ |-------|-------------|------------|
426
+ | Thesis | 0.6 | 400 |
427
+ | Antithesis | 0.7 | 400 |
428
+ | Synthesis | 0.6 | 500 |
429
+
430
+ ### Persona Modifiers
431
+
432
+ | Persona | Desire Wt | Intellect Wt | Aspiration Wt |
433
+ |---------|-----------|--------------|---------------|
434
+ | `confused_novice` | 0.4 | 0.3 | 0.3 |
435
+ | `eager_explorer` | 0.5 | 0.3 | 0.2 |
436
+ | `focused_achiever` | 0.2 | 0.4 | 0.4 |
437
+ | `struggling_anxious` | 0.5 | 0.2 | 0.3 |
438
+ | `adversarial_tester` | 0.3 | 0.4 | 0.3 |
439
+
440
+ ### Ablation Study Profiles
441
+
442
+ 8 profiles covering 2x2x2 factorial design:
443
+ - **Factor 1:** Single-agent (`unified`) vs Multi-agent (`psychodynamic`)
444
+ - **Factor 2:** Baseline tutor vs Recognition tutor
445
+ - **Factor 3:** Baseline prompts vs Multi-agent tutor dialogue
446
+
447
+ ---
448
+
449
+ ## 10. Evaluation Runner Constants
450
+
451
+ **Source:** `services/evaluationRunner.js`
452
+
453
+ | Constant | Value | Purpose |
454
+ |----------|-------|---------|
455
+ | `DEFAULT_PARALLELISM` | 2 | Concurrent test execution |
456
+ | `REQUEST_DELAY_MS` | 500 | Delay between API calls (ms) |
457
+ | `MAX_RETRIES` | 3 | Retry attempts on rate limit |
458
+ | `INITIAL_RETRY_DELAY_MS` | 2000 | Exponential backoff start (ms) |
459
+
460
+ Backoff formula: `INITIAL_RETRY_DELAY_MS × 2^attempt` → 2s, 4s, 8s
461
+
462
+ Only retries on 429 / rate limit errors, not other failures.
463
+
464
+ ---
465
+
466
+ ## 11. Database Schema
467
+
468
+ **Source:** `services/evaluationStore.js`
469
+ **Database:** `data/evaluations.db` (SQLite, WAL mode)
470
+
471
+ ### evaluation_runs
472
+
473
+ | Column | Type | Description |
474
+ |--------|------|-------------|
475
+ | `id` | TEXT PK | Unique run ID |
476
+ | `created_at` | DATETIME | Run start time |
477
+ | `description` | TEXT | Run label |
478
+ | `total_scenarios` | INTEGER | Scenario count |
479
+ | `total_configurations` | INTEGER | Config count |
480
+ | `total_tests` | INTEGER | Total test count |
481
+ | `status` | TEXT | `running` / `completed` |
482
+ | `completed_at` | DATETIME | Run end time |
483
+ | `metadata` | TEXT (JSON) | Additional metadata |
484
+
485
+ ### evaluation_results
486
+
487
+ | Column | Type | Description |
488
+ |--------|------|-------------|
489
+ | `id` | TEXT PK | Result ID |
490
+ | `run_id` | TEXT FK | Parent run |
491
+ | `scenario_id` | TEXT | Scenario tested |
492
+ | `scenario_type` | TEXT | `suggestion` or `interaction` (default: `suggestion`) |
493
+ | `provider` | TEXT | AI provider used |
494
+ | `model` | TEXT | Model ID |
495
+ | `profile_name` | TEXT | Tutor profile |
496
+ | `hyperparameters` | TEXT (JSON) | Temperature, max_tokens, etc. |
497
+ | `prompt_id` | TEXT | Prompt version |
498
+ | `latency_ms` | INTEGER | Response time |
499
+ | `input_tokens` | INTEGER | Tokens sent |
500
+ | `output_tokens` | INTEGER | Tokens received |
501
+ | `cost` | REAL | USD cost |
502
+ | `dialogue_rounds` | INTEGER | Ego-superego rounds |
503
+ | `api_calls` | INTEGER | Total API calls |
504
+ | `score_relevance` | REAL | 1–5 |
505
+ | `score_specificity` | REAL | 1–5 |
506
+ | `score_pedagogical` | REAL | 1–5 |
507
+ | `score_personalization` | REAL | 1–5 |
508
+ | `score_actionability` | REAL | 1–5 |
509
+ | `score_tone` | REAL | 1–5 |
510
+ | `overall_score` | REAL | 0–100 weighted |
511
+ | `base_score` | REAL | 0–100 base dimensions only |
512
+ | `recognition_score` | REAL | 0–100 recognition dimensions only |
513
+ | `passes_required` | INTEGER | Required elements check |
514
+ | `passes_forbidden` | INTEGER | Forbidden elements check |
515
+ | `required_missing` | TEXT (JSON) | Missing required patterns |
516
+ | `forbidden_found` | TEXT (JSON) | Found forbidden patterns |
517
+ | `judge_model` | TEXT | Judge model used |
518
+ | `evaluation_reasoning` | TEXT | Judge explanation |
519
+ | `success` | INTEGER | 1 = success, 0 = error |
520
+ | `error_message` | TEXT | Error details if failed |
521
+
522
+ ### interaction_evaluations
523
+
524
+ | Column | Type | Description |
525
+ |--------|------|-------------|
526
+ | `scenario_id` | TEXT | Interaction scenario |
527
+ | `eval_type` | TEXT | `short_term` / `long_term` |
528
+ | `learner_profile` | TEXT | Learner architecture used |
529
+ | `tutor_profile` | TEXT | Tutor profile used |
530
+ | `persona_id` | TEXT | Learner persona |
531
+ | `turn_count` | INTEGER | Number of turns |
532
+ | `turns` | TEXT (JSON) | Full turn-by-turn dialogue |
533
+ | `total_tokens` | INTEGER | Combined token usage |
534
+ | `learner_tokens` | INTEGER | Learner agent tokens |
535
+ | `tutor_tokens` | INTEGER | Tutor agent tokens |
536
+ | `latency_ms` | INTEGER | Total interaction time |
537
+ | `learner_memory_before` | TEXT (JSON) | Memory snapshot pre-interaction |
538
+ | `learner_memory_after` | TEXT (JSON) | Memory snapshot post-interaction |
539
+ | `tutor_memory_before` | TEXT (JSON) | Tutor memory pre-interaction |
540
+ | `tutor_memory_after` | TEXT (JSON) | Tutor memory post-interaction |
541
+ | `judge_overall_score` | REAL | Judge's overall score |
542
+ | `judge_evaluation` | TEXT (JSON) | Full judge evaluation |
543
+
544
+ ---
545
+
546
+ ## 13. Memory Systems
547
+
548
+ ### Writing Pad (Freudian Model)
549
+
550
+ | Layer | Persistence | Content |
551
+ |-------|-------------|---------|
552
+ | Conscious | Ephemeral | Current interaction context |
553
+ | Preconscious | Session | Recent patterns and observations |
554
+ | Unconscious | Permanent | Traces of significant moments |
555
+
556
+ **Databases:**
557
+ - `data/learner-writing-pad.db` — Learner memory persistence
558
+ - `data/tutor-writing-pad.db` — Tutor memory persistence
559
+
560
+ ---
561
+
562
+ ## 14. API Endpoints (Standalone Server)
563
+
564
+ **Port:** 8081 (default, configurable via `PORT` env var)
565
+
566
+ | Method | Path | Description |
567
+ |--------|------|-------------|
568
+ | GET | `/api/eval/scenarios` | List evaluation scenarios |
569
+ | GET | `/api/eval/profiles` | List tutor profiles |
570
+ | GET | `/api/eval/runs` | List past evaluation runs |
571
+ | GET | `/api/eval/runs/:id` | Get specific run details |
572
+ | POST | `/api/eval/quick` | Run quick evaluation test |
573
+ | GET | `/health` | Health check |
574
+
575
+ ---
576
+
577
+ ## 15. Config File Locations
578
+
579
+ | File | Location | Purpose |
580
+ |------|----------|---------|
581
+ | `evaluation-rubric.yaml` | `config/` | Rubric dimensions, judge config (scenarios moved out) |
582
+ | `suggestion-scenarios.yaml` | `config/` | Suggestion evaluation scenarios (`type: suggestion`) |
583
+ | `interaction-eval-scenarios.yaml` | `config/` | Learner-tutor interaction scenarios (`type: interaction`) |
584
+ | `learner-agents.yaml` | `config/` | Learner architectures, personas |
585
+ | `providers.yaml` | `node_modules/@machinespirits/tutor-core/config/` | Provider definitions and model aliases |
586
+ | `tutor-agents.yaml` | `node_modules/@machinespirits/tutor-core/config/` | Tutor profiles, strategies, thresholds |
587
+ | `evaluations.db` | `data/` | SQLite results database |
588
+
589
+ **Note:** `providers.yaml` and `tutor-agents.yaml` have local overrides in `config/` that take precedence over the `@machinespirits/tutor-core` package versions. `suggestion-scenarios.yaml` is loaded by `evalConfigLoader.loadSuggestionScenarios()` with mtime-based caching, with a backward-compatible fallback to `evaluation-rubric.yaml` if the new file is missing.