@machinespirits/eval 0.1.2 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +161 -0
- package/config/eval-settings.yaml +18 -0
- package/config/evaluation-rubric-learner.yaml +277 -0
- package/config/evaluation-rubric.yaml +613 -0
- package/config/interaction-eval-scenarios.yaml +93 -50
- package/config/learner-agents.yaml +124 -193
- package/config/machinespirits-eval.code-workspace +11 -0
- package/config/providers.yaml +60 -0
- package/config/suggestion-scenarios.yaml +1399 -0
- package/config/tutor-agents.yaml +716 -0
- package/docs/EVALUATION-VARIABLES.md +589 -0
- package/docs/REPLICATION-PLAN.md +577 -0
- package/index.js +15 -6
- package/package.json +16 -22
- package/routes/evalRoutes.js +88 -36
- package/scripts/analyze-judge-reliability.js +401 -0
- package/scripts/analyze-run.js +97 -0
- package/scripts/analyze-run.mjs +282 -0
- package/scripts/analyze-validation-failures.js +141 -0
- package/scripts/check-run.mjs +17 -0
- package/scripts/code-impasse-strategies.js +1132 -0
- package/scripts/compare-runs.js +44 -0
- package/scripts/compare-suggestions.js +80 -0
- package/scripts/compare-transformation.js +116 -0
- package/scripts/dig-into-run.js +158 -0
- package/scripts/eval-cli.js +2626 -0
- package/scripts/generate-paper-figures.py +452 -0
- package/scripts/qualitative-analysis-ai.js +1313 -0
- package/scripts/qualitative-analysis.js +688 -0
- package/scripts/seed-db.js +87 -0
- package/scripts/show-failed-suggestions.js +64 -0
- package/scripts/validate-content.js +192 -0
- package/server.js +3 -2
- package/services/__tests__/evalConfigLoader.test.js +338 -0
- package/services/anovaStats.js +499 -0
- package/services/contentResolver.js +407 -0
- package/services/dialogueTraceAnalyzer.js +454 -0
- package/services/evalConfigLoader.js +625 -0
- package/services/evaluationRunner.js +2171 -270
- package/services/evaluationStore.js +564 -29
- package/services/learnerConfigLoader.js +75 -5
- package/services/learnerRubricEvaluator.js +284 -0
- package/services/learnerTutorInteractionEngine.js +375 -0
- package/services/processUtils.js +18 -0
- package/services/progressLogger.js +98 -0
- package/services/promptRecommendationService.js +31 -26
- package/services/promptRewriter.js +427 -0
- package/services/rubricEvaluator.js +543 -70
- package/services/streamingReporter.js +104 -0
- package/services/turnComparisonAnalyzer.js +494 -0
- package/components/MobileEvalDashboard.tsx +0 -267
- package/components/comparison/DeltaAnalysisTable.tsx +0 -137
- package/components/comparison/ProfileComparisonCard.tsx +0 -176
- package/components/comparison/RecognitionABMode.tsx +0 -385
- package/components/comparison/RecognitionMetricsPanel.tsx +0 -135
- package/components/comparison/WinnerIndicator.tsx +0 -64
- package/components/comparison/index.ts +0 -5
- package/components/mobile/BottomSheet.tsx +0 -233
- package/components/mobile/DimensionBreakdown.tsx +0 -210
- package/components/mobile/DocsView.tsx +0 -363
- package/components/mobile/LogsView.tsx +0 -481
- package/components/mobile/PsychodynamicQuadrant.tsx +0 -261
- package/components/mobile/QuickTestView.tsx +0 -1098
- package/components/mobile/RecognitionTypeChart.tsx +0 -124
- package/components/mobile/RecognitionView.tsx +0 -809
- package/components/mobile/RunDetailView.tsx +0 -261
- package/components/mobile/RunHistoryView.tsx +0 -367
- package/components/mobile/ScoreRadial.tsx +0 -211
- package/components/mobile/StreamingLogPanel.tsx +0 -230
- package/components/mobile/SynthesisStrategyChart.tsx +0 -140
- package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +0 -52
- package/docs/research/ABLATION-MODEL-SELECTION.md +0 -53
- package/docs/research/ADVANCED-EVAL-ANALYSIS.md +0 -60
- package/docs/research/ANOVA-RESULTS-2026-01-14.md +0 -257
- package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +0 -586
- package/docs/research/COST-ANALYSIS.md +0 -56
- package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +0 -340
- package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +0 -291
- package/docs/research/EVAL-SYSTEM-ANALYSIS.md +0 -306
- package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +0 -301
- package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +0 -1988
- package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +0 -282
- package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +0 -147
- package/docs/research/PAPER-EXTENSION-DYADIC.md +0 -204
- package/docs/research/PAPER-UNIFIED.md +0 -659
- package/docs/research/PAPER-UNIFIED.pdf +0 -0
- package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +0 -356
- package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +0 -419
- package/docs/research/apa.csl +0 -2133
- package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +0 -1637
- package/docs/research/archive/paper-multiagent-tutor.tex +0 -978
- package/docs/research/paper-draft/full-paper.md +0 -136
- package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
- package/docs/research/paper-draft/references.bib +0 -515
- package/docs/research/transcript-baseline.md +0 -139
- package/docs/research/transcript-recognition-multiagent.md +0 -187
- package/hooks/useEvalData.ts +0 -625
- package/server-init.js +0 -45
- package/services/benchmarkService.js +0 -1892
- package/types.ts +0 -165
- package/utils/haptics.ts +0 -45
|
@@ -0,0 +1,589 @@
|
|
|
1
|
+
# Evaluation System Variables Reference
|
|
2
|
+
|
|
3
|
+
Complete inventory of all configuration variables, parameters, and moving parts in the machinespirits-eval system.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1. Environment Variables
|
|
8
|
+
|
|
9
|
+
| Variable | Required | Purpose | Used By |
|
|
10
|
+
|----------|----------|---------|---------|
|
|
11
|
+
| `OPENROUTER_API_KEY` | Yes (for budget/default profiles) | OpenRouter API authentication | evaluationRunner, rubricEvaluator, learnerConfigLoader |
|
|
12
|
+
| `ANTHROPIC_API_KEY` | For Anthropic-direct profiles | Anthropic API authentication | rubricEvaluator, promptRecommendationService |
|
|
13
|
+
| `OPENAI_API_KEY` | For OpenAI profiles | OpenAI API authentication | rubricEvaluator |
|
|
14
|
+
| `GEMINI_API_KEY` | For Gemini profiles | Google Gemini API authentication | rubricEvaluator |
|
|
15
|
+
| `TUTOR_PROFILE` / `TUTOR_AGENT_PROFILE` | No | Override active tutor profile | tutorConfigLoader |
|
|
16
|
+
| `LEARNER_PROFILE` / `LEARNER_AGENT_PROFILE` | No | Override active learner profile | learnerConfigLoader |
|
|
17
|
+
| `TUTOR_TRANSCRIPT` | No | Set to `'true'` to suppress debug logging | evaluationRunner, rubricEvaluator |
|
|
18
|
+
| `PORT` | No | Standalone server port (default: `8081`) | server.js |
|
|
19
|
+
| `STANDALONE` | No | Enable standalone server mode | server.js |
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## 2. CLI Commands & Flags
|
|
24
|
+
|
|
25
|
+
**Entry point:** `node scripts/eval-cli.js`
|
|
26
|
+
|
|
27
|
+
### Commands
|
|
28
|
+
|
|
29
|
+
| Command | Description |
|
|
30
|
+
|---------|-------------|
|
|
31
|
+
| `list` | List available scenarios, configurations, and profiles |
|
|
32
|
+
| `quick` / `test` | Run a single quick evaluation test |
|
|
33
|
+
| `run` | Run full evaluation batch |
|
|
34
|
+
| `report <runId>` | Generate report for a previous run |
|
|
35
|
+
|
|
36
|
+
### Flags
|
|
37
|
+
|
|
38
|
+
| Flag | Type | Default | Description |
|
|
39
|
+
|------|------|---------|-------------|
|
|
40
|
+
| `--scenario <id>` | String | `new_user_first_visit` | Scenario ID to evaluate |
|
|
41
|
+
| `--config <name>` | String | — | Configuration name |
|
|
42
|
+
| `--profile <name>` | String | `budget` | Tutor profile name |
|
|
43
|
+
| `--skip-rubric` | Boolean | `false` | Skip AI judge, use pattern matching only |
|
|
44
|
+
| `--verbose` | Boolean | `false` | Enable verbose output |
|
|
45
|
+
| `--runs <n>` | Number | `1` | Number of runs per configuration |
|
|
46
|
+
| `--parallelism <n>` | Number | `2` | Concurrent test count |
|
|
47
|
+
| `--description <text>` | String | — | Label for the evaluation run |
|
|
48
|
+
|
|
49
|
+
### npm Scripts
|
|
50
|
+
|
|
51
|
+
| Script | Expands To |
|
|
52
|
+
|--------|-----------|
|
|
53
|
+
| `npm run eval` | `node scripts/eval-cli.js` |
|
|
54
|
+
| `npm run eval:quick` | `node scripts/eval-cli.js quick` |
|
|
55
|
+
| `npm run eval:test` | `node scripts/eval-cli.js test` |
|
|
56
|
+
| `npm start` | `STANDALONE=true node server.js` |
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## 3. Provider Configuration
|
|
61
|
+
|
|
62
|
+
**Source:** `node_modules/@machinespirits/tutor-core/config/providers.yaml`
|
|
63
|
+
|
|
64
|
+
### Providers
|
|
65
|
+
|
|
66
|
+
| Provider | Base URL | API Key Env | Default Model |
|
|
67
|
+
|----------|----------|-------------|---------------|
|
|
68
|
+
| `anthropic` | `https://api.anthropic.com/v1/messages` | `ANTHROPIC_API_KEY` | `claude-sonnet-4-5` |
|
|
69
|
+
| `openai` | `https://api.openai.com/v1/chat/completions` | `OPENAI_API_KEY` | `gpt-5-mini` |
|
|
70
|
+
| `openrouter` | `https://openrouter.ai/api/v1/chat/completions` | `OPENROUTER_API_KEY` | `nvidia/nemotron-3-nano-30b-a3b:free` |
|
|
71
|
+
| `gemini` | `https://generativelanguage.googleapis.com/v1beta/models` | `GEMINI_API_KEY` | `gemini-3-flash-preview` |
|
|
72
|
+
| `local` | `http://localhost:1234/v1/chat/completions` | — (none) | `local-model` |
|
|
73
|
+
|
|
74
|
+
### Model Aliases (OpenRouter)
|
|
75
|
+
|
|
76
|
+
| Alias | Model ID | Tier |
|
|
77
|
+
|-------|----------|------|
|
|
78
|
+
| `nemotron` | `nvidia/nemotron-3-nano-30b-a3b:free` | Free |
|
|
79
|
+
| `glm47` | `z-ai/glm-4.7` | Free |
|
|
80
|
+
| `kimi-k2` | `moonshotai/kimi-k2-thinking` | Free |
|
|
81
|
+
| `deepseek` | `deepseek/deepseek-v3.2` | Free |
|
|
82
|
+
| `minimax` | `minimax/minimax-m2.1ate` | Free |
|
|
83
|
+
| `haiku` | `anthropic/claude-haiku-4.5` | Budget |
|
|
84
|
+
| `gpt-oss` | `openai/gpt-oss-120b` | Free |
|
|
85
|
+
| `sonnet` | `anthropic/claude-sonnet-4.5` | Mid |
|
|
86
|
+
| `gpt-mini` | `openai/gpt-5-mini` | Mid |
|
|
87
|
+
| `gemini-flash` | `google/gemini-3-flash-preview` | Mid |
|
|
88
|
+
| `opus` | `anthropic/claude-opus-4.5` | Premium |
|
|
89
|
+
| `gpt` | `openai/gpt-5.2` | Premium |
|
|
90
|
+
| `gemini-pro` | `google/gemini-3-pro-preview` | Premium |
|
|
91
|
+
|
|
92
|
+
### Model Pricing (per 1M tokens)
|
|
93
|
+
|
|
94
|
+
**Source:** `node_modules/@machinespirits/tutor-core/services/pricingConfig.js`
|
|
95
|
+
|
|
96
|
+
| Model Ref | Input ($) | Output ($) | Tier |
|
|
97
|
+
|-----------|-----------|------------|------|
|
|
98
|
+
| `openrouter.nemotron` | 0.00 | 0.00 | Free |
|
|
99
|
+
| `openrouter.gemini-flash` | 0.075 | 0.30 | Budget |
|
|
100
|
+
| `openrouter.gpt-mini` | 0.15 | 0.60 | Budget |
|
|
101
|
+
| `openrouter.deepseek` | 0.27 | 1.10 | Mid |
|
|
102
|
+
| `openrouter.haiku` | 0.80 | 4.00 | Budget |
|
|
103
|
+
| `openrouter.sonnet` | 3.00 | 15.00 | Mid |
|
|
104
|
+
| `openrouter.gpt` | 5.00 | 15.00 | Mid |
|
|
105
|
+
| `openrouter.gemini-pro` | 1.25 | 5.00 | Mid |
|
|
106
|
+
| `openrouter.opus` | 15.00 | 75.00 | Premium |
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## 4. Tutor Profiles
|
|
111
|
+
|
|
112
|
+
**Source:** `config/tutor-agents.yaml` (local override of tutor-core)
|
|
113
|
+
|
|
114
|
+
**Active profile:** `budget` (overridable via `TUTOR_PROFILE` env var)
|
|
115
|
+
|
|
116
|
+
### 2×2×2 Factorial Design
|
|
117
|
+
|
|
118
|
+
Three independent variables, each with two levels:
|
|
119
|
+
- **Factor A: Recognition** — standard prompts vs recognition-enhanced prompts + memory
|
|
120
|
+
- **Factor B: Architecture** — single agent (ego only) vs multi-agent (ego + superego)
|
|
121
|
+
- **Factor C: Model Tier** — free (nemotron) vs paid (sonnet)
|
|
122
|
+
|
|
123
|
+
### Profile Summary
|
|
124
|
+
|
|
125
|
+
| Profile | Recognition | Architecture | Model | Dialogue | Description |
|
|
126
|
+
|---------|------------|-------------|-------|----------|-------------|
|
|
127
|
+
| `budget` | — | Single | deepseek | No | Dev workhorse (not in factorial) |
|
|
128
|
+
| `single_baseline` | No | Single | nemotron | No | Cell 1: control |
|
|
129
|
+
| `single_baseline_paid` | No | Single | sonnet | No | Cell 2: model quality only |
|
|
130
|
+
| `baseline` | No | Ego+Superego | nemotron | Yes (2) | Cell 3: architecture only |
|
|
131
|
+
| `baseline_paid` | No | Ego+Superego | sonnet | Yes (2) | Cell 4: architecture + model |
|
|
132
|
+
| `single_recognition` | Yes | Single | nemotron | No | Cell 5: recognition only |
|
|
133
|
+
| `single_recognition_paid` | Yes | Single | sonnet | No | Cell 6: recognition + model |
|
|
134
|
+
| `recognition` | Yes | Ego+Superego | nemotron | Yes (2) | Cell 7: recognition + architecture |
|
|
135
|
+
| `recognition_paid` | Yes | Ego+Superego | sonnet | Yes (2) | Cell 8: all three factors |
|
|
136
|
+
|
|
137
|
+
### Hyperparameters per Agent Role
|
|
138
|
+
|
|
139
|
+
| Parameter | Typical Ego | Typical Superego | Judge |
|
|
140
|
+
|-----------|-------------|------------------|-------|
|
|
141
|
+
| `temperature` | 0.6–0.7 | 0.4–0.6 | 0.2 |
|
|
142
|
+
| `max_tokens` | 800–2500 | 800–1500 | 4000 |
|
|
143
|
+
|
|
144
|
+
### Superego Intervention Strategies
|
|
145
|
+
|
|
146
|
+
| Strategy | Style | Description |
|
|
147
|
+
|----------|-------|-------------|
|
|
148
|
+
| `direct_critique` | Assertive | Explicit issue identification with reasoning |
|
|
149
|
+
| `socratic_challenge` | Questioning | Question-based guidance |
|
|
150
|
+
| `reframing` | Collaborative | Alternative perspectives |
|
|
151
|
+
| `prompt_rewrite` | Meta | System prompt improvement suggestions |
|
|
152
|
+
|
|
153
|
+
### Dialogue Settings
|
|
154
|
+
|
|
155
|
+
| Parameter | Description | Typical Values |
|
|
156
|
+
|-----------|-------------|----------------|
|
|
157
|
+
| `enabled` | Whether ego-superego dialogue runs | `true` / `false` |
|
|
158
|
+
| `max_rounds` | Maximum ego-superego exchange rounds | 0–3 |
|
|
159
|
+
| `convergence_threshold` | Similarity threshold to stop early | 0.7–0.8 |
|
|
160
|
+
|
|
161
|
+
### Intervention Thresholds
|
|
162
|
+
|
|
163
|
+
| Parameter | Default | Description |
|
|
164
|
+
|-----------|---------|-------------|
|
|
165
|
+
| `low_intensity_skip_dialogue` | `true` | Skip dialogue for low-stakes interactions |
|
|
166
|
+
| `high_intensity_extra_rounds` | `true` | Add rounds for high-stakes situations |
|
|
167
|
+
| `struggle_signal_threshold` | `2` | Number of struggle signals before escalation |
|
|
168
|
+
| `rapid_nav_window_ms` | `30000` | Window for detecting rapid navigation |
|
|
169
|
+
| `retry_frustration_count` | `3` | Retries before frustration flag |
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## 5. Rubric Dimensions & Scoring
|
|
174
|
+
|
|
175
|
+
**Source:** `config/evaluation-rubric.yaml`
|
|
176
|
+
|
|
177
|
+
### Scoring Scale
|
|
178
|
+
|
|
179
|
+
All dimensions scored 1–5:
|
|
180
|
+
- **1:** Completely fails
|
|
181
|
+
- **2:** Weak, significant issues
|
|
182
|
+
- **3:** Adequate, meets basic expectations
|
|
183
|
+
- **4:** Good, exceeds expectations
|
|
184
|
+
- **5:** Excellent, exemplary
|
|
185
|
+
|
|
186
|
+
### Base Dimensions (6)
|
|
187
|
+
|
|
188
|
+
| Dimension | Weight | Evaluates |
|
|
189
|
+
|-----------|--------|-----------|
|
|
190
|
+
| Relevance | 0.15 | Context-awareness and appropriateness to learner state |
|
|
191
|
+
| Specificity | 0.15 | Concrete references vs. vague suggestions |
|
|
192
|
+
| Pedagogical Soundness | 0.15 | ZPD targeting, scaffolding, learning science alignment |
|
|
193
|
+
| Personalization | 0.10 | Tailored to individual learner journey/history |
|
|
194
|
+
| Actionability | 0.10 | Clear, executable next steps |
|
|
195
|
+
| Tone | 0.10 | Supportive, encouraging, appropriate register |
|
|
196
|
+
|
|
197
|
+
### Recognition Dimensions (Phase 5, 4+)
|
|
198
|
+
|
|
199
|
+
| Dimension | Weight | Evaluates |
|
|
200
|
+
|-----------|--------|-----------|
|
|
201
|
+
| Mutual Recognition | 0.10 | Acknowledges learner as autonomous subject with own understanding |
|
|
202
|
+
| Dialectical Responsiveness | 0.10 | Genuine engagement with learner's position, productive tension |
|
|
203
|
+
| Memory Integration | 0.05 | References and builds on prior interactions |
|
|
204
|
+
| Transformative Potential | 0.10 | Creates conditions for conceptual restructuring |
|
|
205
|
+
|
|
206
|
+
**Overall Score Formula:**
|
|
207
|
+
```
|
|
208
|
+
overall = Σ(dimension_score × weight) / totalWeight → normalized, then (avg - 1) / 4 × 100 → range 0–100
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### Dual Scoring
|
|
212
|
+
|
|
213
|
+
The system reports three scores per evaluation:
|
|
214
|
+
|
|
215
|
+
| Score | Dimensions | Purpose |
|
|
216
|
+
|-------|-----------|---------|
|
|
217
|
+
| `overallScore` | All 10 dimensions | Combined quality metric |
|
|
218
|
+
| `baseScore` | 6 base dimensions (relevance, specificity, pedagogical, personalization, actionability, tone) | Pedagogical quality |
|
|
219
|
+
| `recognitionScore` | 4 recognition dimensions (mutual_recognition, dialectical_responsiveness, memory_integration, transformative_potential) | Recognition dynamics quality |
|
|
220
|
+
|
|
221
|
+
Each sub-score re-normalizes weights to sum to 1.0 within its dimension group. Both use the same `(avg - 1) / 4 × 100` formula to produce 0–100 values.
|
|
222
|
+
|
|
223
|
+
Stored in `evaluation_results` as `base_score` and `recognition_score` columns. Aggregated in `getRunStats()` as `avg_base_score` and `avg_recognition_score`.
|
|
224
|
+
|
|
225
|
+
### Evaluation Modes
|
|
226
|
+
|
|
227
|
+
| Mode | Trigger | Method | Speed |
|
|
228
|
+
|------|---------|--------|-------|
|
|
229
|
+
| Fast | `--skip-rubric` | Pattern matching on `required_elements` / `forbidden_elements` | ~1–5s |
|
|
230
|
+
| Full Rubric | Default | AI judge semantic evaluation per dimension | ~5–30s |
|
|
231
|
+
|
|
232
|
+
---
|
|
233
|
+
|
|
234
|
+
## 6. Judge / Evaluator Configuration
|
|
235
|
+
|
|
236
|
+
**Source:** `config/evaluation-rubric.yaml` — unified source of truth for all judge models
|
|
237
|
+
|
|
238
|
+
### Suggestion Judge
|
|
239
|
+
|
|
240
|
+
| Parameter | Value | Purpose |
|
|
241
|
+
|-----------|-------|---------|
|
|
242
|
+
| Primary model | `openrouter.kimi-k2.5` | Rubric dimension scoring |
|
|
243
|
+
| Fallback model | `openrouter.nemotron` | Used if primary fails |
|
|
244
|
+
| Temperature | 0.2 | Low for scoring consistency |
|
|
245
|
+
| Max tokens | 4000 | Judge response budget |
|
|
246
|
+
|
|
247
|
+
### Interaction Judge
|
|
248
|
+
|
|
249
|
+
| Parameter | Value | Purpose |
|
|
250
|
+
|-----------|-------|---------|
|
|
251
|
+
| Primary model | `openrouter.kimi-k2.5` | Learner-tutor dialogue evaluation |
|
|
252
|
+
| Fallback model | `openrouter.nemotron` | Used if primary fails |
|
|
253
|
+
| Temperature | 0.2 | Low for scoring consistency |
|
|
254
|
+
| Max tokens | 6000 | Larger budget for multi-turn analysis |
|
|
255
|
+
|
|
256
|
+
Note: `interaction-eval-scenarios.yaml` no longer defines its own model — it references `evaluation-rubric.yaml → interaction_judge` as the single source of truth.
|
|
257
|
+
|
|
258
|
+
### Recommender (Prompt Improvement Analysis)
|
|
259
|
+
|
|
260
|
+
| Parameter | Value |
|
|
261
|
+
|-----------|-------|
|
|
262
|
+
| Model | `openrouter.kimi-k2.5` |
|
|
263
|
+
| Fallback | `openrouter.nemotron` |
|
|
264
|
+
| Temperature | 0.4 |
|
|
265
|
+
| Max tokens | 6000 |
|
|
266
|
+
|
|
267
|
+
### Fallback Chain in rubricEvaluator.js
|
|
268
|
+
|
|
269
|
+
1. Primary judge model (from rubric config)
|
|
270
|
+
2. Fallback judge model (from rubric config)
|
|
271
|
+
3. Hardcoded fallback: `deepseek/deepseek-chat-v3-0324`
|
|
272
|
+
|
|
273
|
+
---
|
|
274
|
+
|
|
275
|
+
## 7. Suggestion Scenarios
|
|
276
|
+
|
|
277
|
+
**Source:** `config/suggestion-scenarios.yaml` (previously in `evaluation-rubric.yaml`)
|
|
278
|
+
|
|
279
|
+
**Total scenarios:** 15 (trimmed from 49 in v2.0.0)
|
|
280
|
+
|
|
281
|
+
Each scenario has a `type: suggestion` discriminator, an explicit `id` field, and a `category` field.
|
|
282
|
+
|
|
283
|
+
### Categories
|
|
284
|
+
|
|
285
|
+
| Category | Count | Purpose |
|
|
286
|
+
|----------|-------|---------|
|
|
287
|
+
| `core` | 6 | Fundamental learner states |
|
|
288
|
+
| `mood` | 2 | Emotional affect testing |
|
|
289
|
+
| `benchmark` | 1 | Cross-model sycophancy resistance |
|
|
290
|
+
| `recognition` | 3 | Hegelian recognition dynamics |
|
|
291
|
+
| `multi_turn` | 3 | Multi-step dialogue arcs |
|
|
292
|
+
|
|
293
|
+
### Scenarios by Category
|
|
294
|
+
|
|
295
|
+
| Category | ID | Name | Min Score | Turns |
|
|
296
|
+
|----------|----|------|-----------|-------|
|
|
297
|
+
| `core` | `new_user_first_visit` | New User - First Visit | 70 | 1 |
|
|
298
|
+
| `core` | `returning_user_mid_course` | Returning User - Mid Course | 75 | 1 |
|
|
299
|
+
| `core` | `struggling_learner` | Struggling Learner | 75 | 1 |
|
|
300
|
+
| `core` | `high_performer` | High Performing Learner | 75 | 1 |
|
|
301
|
+
| `core` | `concept_confusion` | Concept Confusion | 75 | 1 |
|
|
302
|
+
| `core` | `activity_avoider` | Activity Avoider | 70 | 1 |
|
|
303
|
+
| `mood` | `mood_frustrated_explicit` | Mood: Frustrated (Explicit) | 80 | 1 |
|
|
304
|
+
| `mood` | `mood_excited_curious` | Mood: Excited (High Engagement) | 80 | 1 |
|
|
305
|
+
| `benchmark` | `adversarial_tester` | Adversarial Tester | 75 | 1 |
|
|
306
|
+
| `recognition` | `recognition_seeking_learner` | Recognition: Learner Seeking Understanding | 80 | 1 |
|
|
307
|
+
| `recognition` | `memory_continuity_single` | Recognition: Memory Continuity | 80 | 1 |
|
|
308
|
+
| `recognition` | `transformative_moment_setup` | Recognition: Creating Transformative Conditions | 80 | 1 |
|
|
309
|
+
| `multi_turn` | `mood_frustration_to_breakthrough` | Frustration to Breakthrough | 75 | 4 |
|
|
310
|
+
| `multi_turn` | `misconception_correction_flow` | Misconception Correction | 75 | 5 |
|
|
311
|
+
| `multi_turn` | `mutual_transformation_journey` | Mutual Transformation | 85 | 6 |
|
|
312
|
+
|
|
313
|
+
### Scenario Structure
|
|
314
|
+
|
|
315
|
+
Each scenario defines:
|
|
316
|
+
- `type` — Discriminator (`suggestion` for all scenarios in this file)
|
|
317
|
+
- `id` — Explicit scenario ID (matches the YAML key)
|
|
318
|
+
- `category` — One of: `core`, `mood`, `benchmark`, `recognition`, `multi_turn`
|
|
319
|
+
- `name` — Human-readable title
|
|
320
|
+
- `description` — Situation context
|
|
321
|
+
- `is_new_user` — Boolean
|
|
322
|
+
- `learner_context` — Detailed learner state (markdown)
|
|
323
|
+
- `expected_behavior` — What the tutor should do
|
|
324
|
+
- `required_elements` — Patterns that MUST appear (fast mode)
|
|
325
|
+
- `forbidden_elements` — Patterns that must NOT appear (fast mode)
|
|
326
|
+
- `min_acceptable_score` — Passing threshold (0–100)
|
|
327
|
+
- `turns` — Array of follow-up turns (multi-turn scenarios only)
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
## 8. Interaction Evaluation Scenarios
|
|
332
|
+
|
|
333
|
+
**Source:** `config/interaction-eval-scenarios.yaml`
|
|
334
|
+
|
|
335
|
+
Each scenario has a `type: interaction` discriminator. The expected number of dialogue turns uses the `turn_count` field (not `turns`, which is reserved for scripted turn arrays in suggestion scenarios).
|
|
336
|
+
|
|
337
|
+
### Short-Term (Within-Session)
|
|
338
|
+
|
|
339
|
+
| ID | Turn Count | Focus | Key Weights |
|
|
340
|
+
|----|------------|-------|-------------|
|
|
341
|
+
| `recognition_request` | 4 | Learner seeks validation | mutual_recognition (0.3), dialectical_responsiveness (0.3) |
|
|
342
|
+
| `frustration_moment` | 5 | Learner frustration | emotional_attunement (0.3), scaffolding (0.3) |
|
|
343
|
+
| `misconception_surface` | 4 | Misconception revealed | mutual_recognition (0.3), dialectical_responsiveness (0.3) |
|
|
344
|
+
| `breakthrough_moment` | 3 | Learner insight | validation (0.3), extension (0.3) |
|
|
345
|
+
| `resistant_engagement` | 6 | Intelligent pushback | intellectual_respect (0.3), dialectical_responsiveness (0.3) |
|
|
346
|
+
|
|
347
|
+
### Long-Term (Multi-Session)
|
|
348
|
+
|
|
349
|
+
| ID | Sessions | Turns/Session | Focus |
|
|
350
|
+
|----|----------|---------------|-------|
|
|
351
|
+
| `novice_to_practitioner` | 5 | 4 | Learning arc |
|
|
352
|
+
| `stranger_to_recognized` | 4 | 5 | Relationship arc |
|
|
353
|
+
| `tutor_adaptation` | 4 | 4 | Tutor learning arc |
|
|
354
|
+
|
|
355
|
+
### Interaction Evaluation Dimensions (with weights)
|
|
356
|
+
|
|
357
|
+
Weights sum to 1.0 across all 10 dimensions. Each dimension has 5-point anchors (1–5).
|
|
358
|
+
|
|
359
|
+
**Learner (0.40 total):**
|
|
360
|
+
|
|
361
|
+
| Dimension | Weight | Description |
|
|
362
|
+
|-----------|--------|-------------|
|
|
363
|
+
| `authenticity` | 0.10 | Internal dynamics reflect persona realistically |
|
|
364
|
+
| `responsiveness` | 0.10 | Genuine reaction to tutor's engagement |
|
|
365
|
+
| `development` | 0.10 | Shows movement in understanding |
|
|
366
|
+
| `emotional_trajectory` | 0.05 | Emotional state changes appropriately |
|
|
367
|
+
| `knowledge_retention` | 0.05 | Concepts persist across sessions |
|
|
368
|
+
|
|
369
|
+
**Tutor (0.40 total):**
|
|
370
|
+
|
|
371
|
+
| Dimension | Weight | Description |
|
|
372
|
+
|-----------|--------|-------------|
|
|
373
|
+
| `strategy_adaptation` | 0.15 | Modifies approach based on effectiveness |
|
|
374
|
+
| `scaffolding_reduction` | 0.15 | Fades support as learner grows |
|
|
375
|
+
| `memory_utilization` | 0.10 | Effectively uses accumulated knowledge |
|
|
376
|
+
|
|
377
|
+
**Relationship (0.20 total):**
|
|
378
|
+
|
|
379
|
+
| Dimension | Weight | Description |
|
|
380
|
+
|-----------|--------|-------------|
|
|
381
|
+
| `trust_trajectory` | 0.10 | Trust develops appropriately over time |
|
|
382
|
+
| `mutual_recognition_depth` | 0.10 | Both parties show understanding of other |
|
|
383
|
+
|
|
384
|
+
### Interaction Outcome Types
|
|
385
|
+
|
|
386
|
+
| Outcome | Meaning |
|
|
387
|
+
|---------|---------|
|
|
388
|
+
| `BREAKTHROUGH` | Genuine understanding demonstrated |
|
|
389
|
+
| `PRODUCTIVE_STRUGGLE` | Healthy confusion and effort |
|
|
390
|
+
| `MUTUAL_RECOGNITION` | Both parties recognise each other |
|
|
391
|
+
| `FRUSTRATION` | Learner becomes frustrated |
|
|
392
|
+
| `DISENGAGEMENT` | Learner disengages |
|
|
393
|
+
| `SCAFFOLDING_NEEDED` | Learner needs more support |
|
|
394
|
+
| `FADING_APPROPRIATE` | Ready for less support |
|
|
395
|
+
| `TRANSFORMATION` | Conceptual restructuring occurring |
|
|
396
|
+
|
|
397
|
+
---
|
|
398
|
+
|
|
399
|
+
## 9. Learner Agent Configuration
|
|
400
|
+
|
|
401
|
+
**Source:** `config/learner-agents.yaml`
|
|
402
|
+
|
|
403
|
+
**Active architecture:** `unified` (overridable by tutor profile or `LEARNER_PROFILE` env var)
|
|
404
|
+
|
|
405
|
+
### Architectures
|
|
406
|
+
|
|
407
|
+
| Architecture | Agents | Deliberation | Rounds | Convergence |
|
|
408
|
+
|--------------|--------|-------------|--------|-------------|
|
|
409
|
+
| `unified` | 1 (single learner) | Disabled | 0 | — |
|
|
410
|
+
| `psychodynamic` | 4 (Desire, Intellect, Aspiration, Synthesizer) | Enabled | 2 | 0.7 |
|
|
411
|
+
| `dialectical` | 3 (Thesis, Antithesis, Synthesis) | Enabled | 2 | 0.7 |
|
|
412
|
+
|
|
413
|
+
### Psychodynamic Sub-Agent Hyperparameters
|
|
414
|
+
|
|
415
|
+
| Agent | Role | Temperature | Max Tokens |
|
|
416
|
+
|-------|------|-------------|------------|
|
|
417
|
+
| Desire | Id | 0.8 | 400 |
|
|
418
|
+
| Intellect | Ego | 0.5 | 400 |
|
|
419
|
+
| Aspiration | Superego | 0.6 | 400 |
|
|
420
|
+
| Synthesizer | — | 0.6 | 500 |
|
|
421
|
+
|
|
422
|
+
### Dialectical Sub-Agent Hyperparameters
|
|
423
|
+
|
|
424
|
+
| Agent | Temperature | Max Tokens |
|
|
425
|
+
|-------|-------------|------------|
|
|
426
|
+
| Thesis | 0.6 | 400 |
|
|
427
|
+
| Antithesis | 0.7 | 400 |
|
|
428
|
+
| Synthesis | 0.6 | 500 |
|
|
429
|
+
|
|
430
|
+
### Persona Modifiers
|
|
431
|
+
|
|
432
|
+
| Persona | Desire Wt | Intellect Wt | Aspiration Wt |
|
|
433
|
+
|---------|-----------|--------------|---------------|
|
|
434
|
+
| `confused_novice` | 0.4 | 0.3 | 0.3 |
|
|
435
|
+
| `eager_explorer` | 0.5 | 0.3 | 0.2 |
|
|
436
|
+
| `focused_achiever` | 0.2 | 0.4 | 0.4 |
|
|
437
|
+
| `struggling_anxious` | 0.5 | 0.2 | 0.3 |
|
|
438
|
+
| `adversarial_tester` | 0.3 | 0.4 | 0.3 |
|
|
439
|
+
|
|
440
|
+
### Ablation Study Profiles
|
|
441
|
+
|
|
442
|
+
8 profiles covering 2x2x2 factorial design:
|
|
443
|
+
- **Factor 1:** Single-agent (`unified`) vs Multi-agent (`psychodynamic`)
|
|
444
|
+
- **Factor 2:** Baseline tutor vs Recognition tutor
|
|
445
|
+
- **Factor 3:** Baseline prompts vs Multi-agent tutor dialogue
|
|
446
|
+
|
|
447
|
+
---
|
|
448
|
+
|
|
449
|
+
## 10. Evaluation Runner Constants
|
|
450
|
+
|
|
451
|
+
**Source:** `services/evaluationRunner.js`
|
|
452
|
+
|
|
453
|
+
| Constant | Value | Purpose |
|
|
454
|
+
|----------|-------|---------|
|
|
455
|
+
| `DEFAULT_PARALLELISM` | 2 | Concurrent test execution |
|
|
456
|
+
| `REQUEST_DELAY_MS` | 500 | Delay between API calls (ms) |
|
|
457
|
+
| `MAX_RETRIES` | 3 | Retry attempts on rate limit |
|
|
458
|
+
| `INITIAL_RETRY_DELAY_MS` | 2000 | Exponential backoff start (ms) |
|
|
459
|
+
|
|
460
|
+
Backoff formula: `INITIAL_RETRY_DELAY_MS × 2^attempt` → 2s, 4s, 8s
|
|
461
|
+
|
|
462
|
+
Only retries on 429 / rate limit errors, not other failures.
|
|
463
|
+
|
|
464
|
+
---
|
|
465
|
+
|
|
466
|
+
## 11. Database Schema
|
|
467
|
+
|
|
468
|
+
**Source:** `services/evaluationStore.js`
|
|
469
|
+
**Database:** `data/evaluations.db` (SQLite, WAL mode)
|
|
470
|
+
|
|
471
|
+
### evaluation_runs
|
|
472
|
+
|
|
473
|
+
| Column | Type | Description |
|
|
474
|
+
|--------|------|-------------|
|
|
475
|
+
| `id` | TEXT PK | Unique run ID |
|
|
476
|
+
| `created_at` | DATETIME | Run start time |
|
|
477
|
+
| `description` | TEXT | Run label |
|
|
478
|
+
| `total_scenarios` | INTEGER | Scenario count |
|
|
479
|
+
| `total_configurations` | INTEGER | Config count |
|
|
480
|
+
| `total_tests` | INTEGER | Total test count |
|
|
481
|
+
| `status` | TEXT | `running` / `completed` |
|
|
482
|
+
| `completed_at` | DATETIME | Run end time |
|
|
483
|
+
| `metadata` | TEXT (JSON) | Additional metadata |
|
|
484
|
+
|
|
485
|
+
### evaluation_results
|
|
486
|
+
|
|
487
|
+
| Column | Type | Description |
|
|
488
|
+
|--------|------|-------------|
|
|
489
|
+
| `id` | TEXT PK | Result ID |
|
|
490
|
+
| `run_id` | TEXT FK | Parent run |
|
|
491
|
+
| `scenario_id` | TEXT | Scenario tested |
|
|
492
|
+
| `scenario_type` | TEXT | `suggestion` or `interaction` (default: `suggestion`) |
|
|
493
|
+
| `provider` | TEXT | AI provider used |
|
|
494
|
+
| `model` | TEXT | Model ID |
|
|
495
|
+
| `profile_name` | TEXT | Tutor profile |
|
|
496
|
+
| `hyperparameters` | TEXT (JSON) | Temperature, max_tokens, etc. |
|
|
497
|
+
| `prompt_id` | TEXT | Prompt version |
|
|
498
|
+
| `latency_ms` | INTEGER | Response time |
|
|
499
|
+
| `input_tokens` | INTEGER | Tokens sent |
|
|
500
|
+
| `output_tokens` | INTEGER | Tokens received |
|
|
501
|
+
| `cost` | REAL | USD cost |
|
|
502
|
+
| `dialogue_rounds` | INTEGER | Ego-superego rounds |
|
|
503
|
+
| `api_calls` | INTEGER | Total API calls |
|
|
504
|
+
| `score_relevance` | REAL | 1–5 |
|
|
505
|
+
| `score_specificity` | REAL | 1–5 |
|
|
506
|
+
| `score_pedagogical` | REAL | 1–5 |
|
|
507
|
+
| `score_personalization` | REAL | 1–5 |
|
|
508
|
+
| `score_actionability` | REAL | 1–5 |
|
|
509
|
+
| `score_tone` | REAL | 1–5 |
|
|
510
|
+
| `overall_score` | REAL | 0–100 weighted |
|
|
511
|
+
| `base_score` | REAL | 0–100 base dimensions only |
|
|
512
|
+
| `recognition_score` | REAL | 0–100 recognition dimensions only |
|
|
513
|
+
| `passes_required` | INTEGER | Required elements check |
|
|
514
|
+
| `passes_forbidden` | INTEGER | Forbidden elements check |
|
|
515
|
+
| `required_missing` | TEXT (JSON) | Missing required patterns |
|
|
516
|
+
| `forbidden_found` | TEXT (JSON) | Found forbidden patterns |
|
|
517
|
+
| `judge_model` | TEXT | Judge model used |
|
|
518
|
+
| `evaluation_reasoning` | TEXT | Judge explanation |
|
|
519
|
+
| `success` | INTEGER | 1 = success, 0 = error |
|
|
520
|
+
| `error_message` | TEXT | Error details if failed |
|
|
521
|
+
|
|
522
|
+
### interaction_evaluations
|
|
523
|
+
|
|
524
|
+
| Column | Type | Description |
|
|
525
|
+
|--------|------|-------------|
|
|
526
|
+
| `scenario_id` | TEXT | Interaction scenario |
|
|
527
|
+
| `eval_type` | TEXT | `short_term` / `long_term` |
|
|
528
|
+
| `learner_profile` | TEXT | Learner architecture used |
|
|
529
|
+
| `tutor_profile` | TEXT | Tutor profile used |
|
|
530
|
+
| `persona_id` | TEXT | Learner persona |
|
|
531
|
+
| `turn_count` | INTEGER | Number of turns |
|
|
532
|
+
| `turns` | TEXT (JSON) | Full turn-by-turn dialogue |
|
|
533
|
+
| `total_tokens` | INTEGER | Combined token usage |
|
|
534
|
+
| `learner_tokens` | INTEGER | Learner agent tokens |
|
|
535
|
+
| `tutor_tokens` | INTEGER | Tutor agent tokens |
|
|
536
|
+
| `latency_ms` | INTEGER | Total interaction time |
|
|
537
|
+
| `learner_memory_before` | TEXT (JSON) | Memory snapshot pre-interaction |
|
|
538
|
+
| `learner_memory_after` | TEXT (JSON) | Memory snapshot post-interaction |
|
|
539
|
+
| `tutor_memory_before` | TEXT (JSON) | Tutor memory pre-interaction |
|
|
540
|
+
| `tutor_memory_after` | TEXT (JSON) | Tutor memory post-interaction |
|
|
541
|
+
| `judge_overall_score` | REAL | Judge's overall score |
|
|
542
|
+
| `judge_evaluation` | TEXT (JSON) | Full judge evaluation |
|
|
543
|
+
|
|
544
|
+
---
|
|
545
|
+
|
|
546
|
+
## 13. Memory Systems
|
|
547
|
+
|
|
548
|
+
### Writing Pad (Freudian Model)
|
|
549
|
+
|
|
550
|
+
| Layer | Persistence | Content |
|
|
551
|
+
|-------|-------------|---------|
|
|
552
|
+
| Conscious | Ephemeral | Current interaction context |
|
|
553
|
+
| Preconscious | Session | Recent patterns and observations |
|
|
554
|
+
| Unconscious | Permanent | Traces of significant moments |
|
|
555
|
+
|
|
556
|
+
**Databases:**
|
|
557
|
+
- `data/learner-writing-pad.db` — Learner memory persistence
|
|
558
|
+
- `data/tutor-writing-pad.db` — Tutor memory persistence
|
|
559
|
+
|
|
560
|
+
---
|
|
561
|
+
|
|
562
|
+
## 14. API Endpoints (Standalone Server)
|
|
563
|
+
|
|
564
|
+
**Port:** 8081 (default, configurable via `PORT` env var)
|
|
565
|
+
|
|
566
|
+
| Method | Path | Description |
|
|
567
|
+
|--------|------|-------------|
|
|
568
|
+
| GET | `/api/eval/scenarios` | List evaluation scenarios |
|
|
569
|
+
| GET | `/api/eval/profiles` | List tutor profiles |
|
|
570
|
+
| GET | `/api/eval/runs` | List past evaluation runs |
|
|
571
|
+
| GET | `/api/eval/runs/:id` | Get specific run details |
|
|
572
|
+
| POST | `/api/eval/quick` | Run quick evaluation test |
|
|
573
|
+
| GET | `/health` | Health check |
|
|
574
|
+
|
|
575
|
+
---
|
|
576
|
+
|
|
577
|
+
## 15. Config File Locations
|
|
578
|
+
|
|
579
|
+
| File | Location | Purpose |
|
|
580
|
+
|------|----------|---------|
|
|
581
|
+
| `evaluation-rubric.yaml` | `config/` | Rubric dimensions, judge config (scenarios moved out) |
|
|
582
|
+
| `suggestion-scenarios.yaml` | `config/` | Suggestion evaluation scenarios (`type: suggestion`) |
|
|
583
|
+
| `interaction-eval-scenarios.yaml` | `config/` | Learner-tutor interaction scenarios (`type: interaction`) |
|
|
584
|
+
| `learner-agents.yaml` | `config/` | Learner architectures, personas |
|
|
585
|
+
| `providers.yaml` | `node_modules/@machinespirits/tutor-core/config/` | Provider definitions and model aliases |
|
|
586
|
+
| `tutor-agents.yaml` | `node_modules/@machinespirits/tutor-core/config/` | Tutor profiles, strategies, thresholds |
|
|
587
|
+
| `evaluations.db` | `data/` | SQLite results database |
|
|
588
|
+
|
|
589
|
+
**Note:** `providers.yaml` and `tutor-agents.yaml` have local overrides in `config/` that take precedence over the `@machinespirits/tutor-core` package versions. `suggestion-scenarios.yaml` is loaded by `evalConfigLoader.loadSuggestionScenarios()` with mtime-based caching, with a backward-compatible fallback to `evaluation-rubric.yaml` if the new file is missing.
|