@machinespirits/eval 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/components/MobileEvalDashboard.tsx +267 -0
- package/components/comparison/DeltaAnalysisTable.tsx +137 -0
- package/components/comparison/ProfileComparisonCard.tsx +176 -0
- package/components/comparison/RecognitionABMode.tsx +385 -0
- package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
- package/components/comparison/WinnerIndicator.tsx +64 -0
- package/components/comparison/index.ts +5 -0
- package/components/mobile/BottomSheet.tsx +233 -0
- package/components/mobile/DimensionBreakdown.tsx +210 -0
- package/components/mobile/DocsView.tsx +363 -0
- package/components/mobile/LogsView.tsx +481 -0
- package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
- package/components/mobile/QuickTestView.tsx +1098 -0
- package/components/mobile/RecognitionTypeChart.tsx +124 -0
- package/components/mobile/RecognitionView.tsx +809 -0
- package/components/mobile/RunDetailView.tsx +261 -0
- package/components/mobile/RunHistoryView.tsx +367 -0
- package/components/mobile/ScoreRadial.tsx +211 -0
- package/components/mobile/StreamingLogPanel.tsx +230 -0
- package/components/mobile/SynthesisStrategyChart.tsx +140 -0
- package/config/interaction-eval-scenarios.yaml +832 -0
- package/config/learner-agents.yaml +248 -0
- package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
- package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
- package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
- package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
- package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
- package/docs/research/COST-ANALYSIS.md +56 -0
- package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
- package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
- package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
- package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
- package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
- package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
- package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
- package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
- package/docs/research/PAPER-UNIFIED.md +659 -0
- package/docs/research/PAPER-UNIFIED.pdf +0 -0
- package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
- package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
- package/docs/research/apa.csl +2133 -0
- package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
- package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
- package/docs/research/paper-draft/full-paper.md +136 -0
- package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
- package/docs/research/paper-draft/references.bib +515 -0
- package/docs/research/transcript-baseline.md +139 -0
- package/docs/research/transcript-recognition-multiagent.md +187 -0
- package/hooks/useEvalData.ts +625 -0
- package/index.js +27 -0
- package/package.json +73 -0
- package/routes/evalRoutes.js +3002 -0
- package/scripts/advanced-eval-analysis.js +351 -0
- package/scripts/analyze-eval-costs.js +378 -0
- package/scripts/analyze-eval-results.js +513 -0
- package/scripts/analyze-interaction-evals.js +368 -0
- package/server-init.js +45 -0
- package/server.js +162 -0
- package/services/benchmarkService.js +1892 -0
- package/services/evaluationRunner.js +739 -0
- package/services/evaluationStore.js +1121 -0
- package/services/learnerConfigLoader.js +385 -0
- package/services/learnerTutorInteractionEngine.js +857 -0
- package/services/memory/learnerMemoryService.js +1227 -0
- package/services/memory/learnerWritingPad.js +577 -0
- package/services/memory/tutorWritingPad.js +674 -0
- package/services/promptRecommendationService.js +493 -0
- package/services/rubricEvaluator.js +826 -0
|
@@ -0,0 +1,586 @@
|
|
|
1
|
+
# Comprehensive Evaluation Plan for Recognition Tutoring Paper v2
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-01-14
|
|
4
|
+
**Purpose:** Build robust evidentiary foundation for next paper version
|
|
5
|
+
**Target:** Publication-quality statistical rigor
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Executive Summary
|
|
10
|
+
|
|
11
|
+
The current paper reports promising results (41% improvement for recognition-oriented tutoring), but the evidence base has significant gaps:
|
|
12
|
+
|
|
13
|
+
| Current State | Gap | Target |
|
|
14
|
+
|---------------|-----|--------|
|
|
15
|
+
| Multi-turn: n=3 per scenario | Low statistical power | n=30+ per condition |
|
|
16
|
+
| Single-turn: n=1 (exploratory) | Not statistically valid | n=30+ per condition |
|
|
17
|
+
| Dyadic: n=1-2 per architecture | Insufficient sampling | n=20+ per architecture |
|
|
18
|
+
| LLM judge only | No validation of judge reliability | Multi-judge + human validation |
|
|
19
|
+
| 2 profiles compared | Limited generalization | 4-5 profiles systematically |
|
|
20
|
+
| No effect size reporting | Can't assess practical significance | Cohen's d for all comparisons |
|
|
21
|
+
|
|
22
|
+
This plan outlines a comprehensive evaluation suite to address these gaps.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Part 1: Core Recognition Claims (Replication & Extension)
|
|
27
|
+
|
|
28
|
+
### 1.1 Multi-Turn Scenario Battery
|
|
29
|
+
|
|
30
|
+
**Goal:** Replicate and extend the 41% improvement claim with publication-quality statistics.
|
|
31
|
+
|
|
32
|
+
#### Scenarios (3 existing + 2 new):
|
|
33
|
+
| Scenario | Turns | Tests For |
|
|
34
|
+
|----------|-------|-----------|
|
|
35
|
+
| `recognition_repair` | 4 | Recovery from misrecognition |
|
|
36
|
+
| `mutual_transformation_journey` | 5 | Both parties evolving |
|
|
37
|
+
| `productive_struggle_arc` | 5 | Honoring confusion |
|
|
38
|
+
| `sustained_dialogue` (NEW) | 8 | Extended recognition maintenance |
|
|
39
|
+
| `breakdown_recovery` (NEW) | 6 | Multiple repair cycles |
|
|
40
|
+
|
|
41
|
+
#### Experimental Design:
|
|
42
|
+
```
|
|
43
|
+
Profiles: baseline × recognition × recognition_plus × quality
|
|
44
|
+
Scenarios: 5 multi-turn scenarios
|
|
45
|
+
Runs: 30 per cell
|
|
46
|
+
Total runs: 4 × 5 × 30 = 600 runs
|
|
47
|
+
|
|
48
|
+
Statistical targets:
|
|
49
|
+
- α = 0.05, power = 0.80
|
|
50
|
+
- Minimum detectable effect: Cohen's d = 0.5
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
#### Commands:
|
|
54
|
+
```bash
|
|
55
|
+
# Primary comparison
|
|
56
|
+
node scripts/eval-tutor.js compare baseline recognition \
|
|
57
|
+
--scenarios mutual_transformation_journey,recognition_repair,productive_struggle_arc,sustained_dialogue,breakdown_recovery \
|
|
58
|
+
--runs 30 --report
|
|
59
|
+
|
|
60
|
+
# Extended profile comparison
|
|
61
|
+
node scripts/eval-tutor.js matrix baseline recognition recognition_plus quality \
|
|
62
|
+
--scenarios recognition_multi_turn \
|
|
63
|
+
--runs 30 --report
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
### 1.2 Single-Turn Scenario Battery
|
|
67
|
+
|
|
68
|
+
**Goal:** Establish statistically valid baselines for single-turn recognition behaviors.
|
|
69
|
+
|
|
70
|
+
#### Scenarios (6 existing):
|
|
71
|
+
| Scenario | Tests For |
|
|
72
|
+
|----------|-----------|
|
|
73
|
+
| `recognition_seeking_learner` | Learner offers interpretation |
|
|
74
|
+
| `returning_with_breakthrough` | Acknowledgment of insight |
|
|
75
|
+
| `resistant_learner` | Handling pushback |
|
|
76
|
+
| `asymmetric_recognition_request` | Authority validation seeking |
|
|
77
|
+
| `memory_continuity_single` | History reference |
|
|
78
|
+
| `transformative_moment_setup` | Misconception handling |
|
|
79
|
+
|
|
80
|
+
#### Experimental Design:
|
|
81
|
+
```
|
|
82
|
+
Profiles: baseline × recognition × recognition_plus
|
|
83
|
+
Scenarios: 6 single-turn scenarios
|
|
84
|
+
Runs: 30 per cell
|
|
85
|
+
Total runs: 3 × 6 × 30 = 540 runs
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## Part 2: Dimension-Level Analysis
|
|
91
|
+
|
|
92
|
+
### 2.1 Recognition Dimension Validation
|
|
93
|
+
|
|
94
|
+
**Goal:** Validate that improvement concentrates in recognition-predicted dimensions.
|
|
95
|
+
|
|
96
|
+
#### Analysis Plan:
|
|
97
|
+
For each dimension (10 total), calculate:
|
|
98
|
+
- Mean difference between profiles
|
|
99
|
+
- Effect size (Cohen's d)
|
|
100
|
+
- 95% confidence intervals
|
|
101
|
+
- Dimension × profile interaction effects
|
|
102
|
+
|
|
103
|
+
#### Expected Pattern (from current data):
|
|
104
|
+
| Dimension | Expected Δ | Theoretical Prediction |
|
|
105
|
+
|-----------|------------|------------------------|
|
|
106
|
+
| Personalization | +0.8+ | Strong (recognition core) |
|
|
107
|
+
| Pedagogical | +0.8+ | Strong (relational pedagogy) |
|
|
108
|
+
| Tone | +0.6+ | Moderate (dialogical warmth) |
|
|
109
|
+
| Mutual Recognition | +0.5+ | Direct target |
|
|
110
|
+
| Dialectical Responsiveness | +0.5+ | Direct target |
|
|
111
|
+
| Transformative Potential | +0.4+ | Moderate (process focus) |
|
|
112
|
+
| Memory Integration | +0.3+ | Moderate (enabled by memory) |
|
|
113
|
+
| Relevance | +0.3+ | Indirect benefit |
|
|
114
|
+
| Specificity | +0.2+ | Minimal impact expected |
|
|
115
|
+
| Actionability | +0.1+ | Minimal impact expected |
|
|
116
|
+
|
|
117
|
+
### 2.2 Dimension Correlation Matrix
|
|
118
|
+
|
|
119
|
+
Compute inter-dimension correlations to identify:
|
|
120
|
+
- Dimension clusters (recognition vs. traditional)
|
|
121
|
+
- Potential redundancies
|
|
122
|
+
- Factor structure for dimension reduction
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## Part 3: Component Isolation Experiments
|
|
127
|
+
|
|
128
|
+
### 3.1 Memory vs. Prompts Ablation
|
|
129
|
+
|
|
130
|
+
**Goal:** Disentangle memory effects from prompt effects.
|
|
131
|
+
|
|
132
|
+
#### Experimental Design:
|
|
133
|
+
```
|
|
134
|
+
2×2 factorial design:
|
|
135
|
+
Standard Prompts Recognition Prompts
|
|
136
|
+
Memory OFF baseline recognition_prompts_only
|
|
137
|
+
Memory ON memory_only (NEW) recognition (full)
|
|
138
|
+
|
|
139
|
+
Runs per cell: 30
|
|
140
|
+
Total runs: 4 × 30 = 120 per scenario
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
#### Analysis:
|
|
144
|
+
- Main effect of prompts
|
|
145
|
+
- Main effect of memory
|
|
146
|
+
- Interaction effect (prompts × memory)
|
|
147
|
+
|
|
148
|
+
### 3.2 Superego Ablation
|
|
149
|
+
|
|
150
|
+
**Goal:** Quantify Superego contribution to recognition quality.
|
|
151
|
+
|
|
152
|
+
#### Experimental Design:
|
|
153
|
+
```
|
|
154
|
+
Profiles:
|
|
155
|
+
- ego_only: Ego without Superego evaluation
|
|
156
|
+
- single_round: Ego + Superego, 1 round max
|
|
157
|
+
- multi_round: Ego + Superego, 3 rounds max (current)
|
|
158
|
+
- extended: Ego + Superego, 5 rounds max
|
|
159
|
+
|
|
160
|
+
Scenarios: All multi-turn recognition scenarios
|
|
161
|
+
Runs: 20 per cell
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
#### Key Metrics:
|
|
165
|
+
- Quality improvement per additional round
|
|
166
|
+
- Convergence rate (rounds to Superego approval)
|
|
167
|
+
- Marginal return on additional rounds
|
|
168
|
+
|
|
169
|
+
### 3.3 Model Capability Ablation
|
|
170
|
+
|
|
171
|
+
**Goal:** Test whether recognition benefits scale with model capability.
|
|
172
|
+
|
|
173
|
+
#### Experimental Design:
|
|
174
|
+
```
|
|
175
|
+
Ego Models (by capability tier):
|
|
176
|
+
- Tier 1 (fast): claude-haiku-4-5, gpt-5-mini
|
|
177
|
+
- Tier 2 (balanced): claude-sonnet-4-5, gpt-5.2
|
|
178
|
+
- Tier 3 (powerful): claude-opus-4-5
|
|
179
|
+
|
|
180
|
+
Cross with: baseline vs. recognition prompts
|
|
181
|
+
Runs: 20 per cell
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
#### Key Question:
|
|
185
|
+
Do recognition prompts provide larger benefits for weaker models (compensatory) or stronger models (synergistic)?
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
## Part 4: Dyadic Evaluation Extension
|
|
190
|
+
|
|
191
|
+
### 4.1 Learner Architecture Comparison
|
|
192
|
+
|
|
193
|
+
**Goal:** Systematically compare simulated learner architectures.
|
|
194
|
+
|
|
195
|
+
#### Architectures (5):
|
|
196
|
+
| Architecture | Internal Structure | Rationale |
|
|
197
|
+
|--------------|-------------------|-----------|
|
|
198
|
+
| `unified` | Single agent | Baseline |
|
|
199
|
+
| `ego_superego` | Ego + Superego | Standard self-critique |
|
|
200
|
+
| `dialectical` | Thesis + Antithesis + Synthesis | Hegelian structure |
|
|
201
|
+
| `psychodynamic` | Id + Ego + Superego | Freudian structure |
|
|
202
|
+
| `cognitive` | Memory + Reasoning + Meta | Process-based |
|
|
203
|
+
|
|
204
|
+
#### Experimental Design:
|
|
205
|
+
```
|
|
206
|
+
Tutor profiles: baseline × recognition × quality
|
|
207
|
+
Learner archs: 5 architectures
|
|
208
|
+
Scenarios: 3 dyadic scenarios
|
|
209
|
+
Runs: 20 per cell
|
|
210
|
+
Total runs: 3 × 5 × 3 × 20 = 900 runs
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### 4.2 Cross-Tabulation Analysis
|
|
214
|
+
|
|
215
|
+
**Goal:** Identify optimal tutor-learner pairings.
|
|
216
|
+
|
|
217
|
+
#### Analysis:
|
|
218
|
+
- Profile × architecture interaction effects
|
|
219
|
+
- Best-performing pairings
|
|
220
|
+
- Pairing-specific failure modes
|
|
221
|
+
|
|
222
|
+
### 4.3 Bilateral Recognition Measurement
|
|
223
|
+
|
|
224
|
+
**Goal:** Measure recognition quality from both sides.
|
|
225
|
+
|
|
226
|
+
#### Metrics:
|
|
227
|
+
**Tutor-side (existing):**
|
|
228
|
+
- Mutual recognition
|
|
229
|
+
- Dialectical responsiveness
|
|
230
|
+
- Transformative potential
|
|
231
|
+
|
|
232
|
+
**Learner-side (new):**
|
|
233
|
+
- Authenticity (does internal state match persona?)
|
|
234
|
+
- Responsiveness (does learner process tutor input?)
|
|
235
|
+
- Development (does understanding change?)
|
|
236
|
+
|
|
237
|
+
#### Key Question:
|
|
238
|
+
When tutor achieves high recognition scores, does the simulated learner show corresponding internal development?
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
|
|
242
|
+
## Part 5: Judge Reliability & Validation
|
|
243
|
+
|
|
244
|
+
### 5.1 Inter-Judge Agreement
|
|
245
|
+
|
|
246
|
+
**Goal:** Validate LLM judge consistency.
|
|
247
|
+
|
|
248
|
+
#### Experimental Design:
|
|
249
|
+
```
|
|
250
|
+
Judge models:
|
|
251
|
+
- gemini-3-flash-preview (current)
|
|
252
|
+
- claude-sonnet-4-5
|
|
253
|
+
- gpt-5.2
|
|
254
|
+
|
|
255
|
+
Sample: 100 responses (stratified by profile/scenario)
|
|
256
|
+
Metrics:
|
|
257
|
+
- Cohen's kappa per dimension
|
|
258
|
+
- Intraclass correlation coefficient (ICC)
|
|
259
|
+
- Systematic bias detection
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
### 5.2 Judge Calibration
|
|
263
|
+
|
|
264
|
+
**Goal:** Detect and correct judge biases.
|
|
265
|
+
|
|
266
|
+
#### Potential Biases:
|
|
267
|
+
| Bias Type | Detection Method |
|
|
268
|
+
|-----------|------------------|
|
|
269
|
+
| Vocabulary bias | Recognition-related words → higher scores? |
|
|
270
|
+
| Length bias | Longer responses → higher scores? |
|
|
271
|
+
| Profile leakage | Judge infers profile from response style? |
|
|
272
|
+
| Acquiescence | Judge gives high scores regardless of quality? |
|
|
273
|
+
|
|
274
|
+
#### Mitigation:
|
|
275
|
+
- Blind judging (remove profile markers)
|
|
276
|
+
- Response length normalization
|
|
277
|
+
- Adversarial examples (deliberately bad recognition language)
|
|
278
|
+
|
|
279
|
+
### 5.3 Human Validation Sample
|
|
280
|
+
|
|
281
|
+
**Goal:** Ground-truth validation with human raters.
|
|
282
|
+
|
|
283
|
+
#### Design:
|
|
284
|
+
```
|
|
285
|
+
Sample: 50 responses (stratified by profile/dimension)
|
|
286
|
+
Raters: 3 human raters (pedagogy/philosophy background)
|
|
287
|
+
Task: Rate each dimension 1-5 with justification
|
|
288
|
+
|
|
289
|
+
Analysis:
|
|
290
|
+
- Human-LLM correlation per dimension
|
|
291
|
+
- Systematic disagreement patterns
|
|
292
|
+
- Dimension-specific reliability
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
---
|
|
296
|
+
|
|
297
|
+
## Part 6: Robustness & Generalization
|
|
298
|
+
|
|
299
|
+
### 6.1 Scenario Sensitivity Analysis
|
|
300
|
+
|
|
301
|
+
**Goal:** Test whether findings hold across scenario variations.
|
|
302
|
+
|
|
303
|
+
#### Scenario Variations:
|
|
304
|
+
For each core scenario, create 3 variants:
|
|
305
|
+
- **Content domain**: Philosophy → History → Science
|
|
306
|
+
- **Learner background**: Novice → Intermediate → Advanced
|
|
307
|
+
- **Emotional tone**: Neutral → Frustrated → Enthusiastic
|
|
308
|
+
|
|
309
|
+
#### Analysis:
|
|
310
|
+
- Main effect stability across variants
|
|
311
|
+
- Scenario × variant interactions
|
|
312
|
+
|
|
313
|
+
### 6.2 Adversarial Robustness
|
|
314
|
+
|
|
315
|
+
**Goal:** Test recognition behavior under adversarial conditions.
|
|
316
|
+
|
|
317
|
+
#### Adversarial Scenarios:
|
|
318
|
+
| Scenario | Challenge |
|
|
319
|
+
|----------|-----------|
|
|
320
|
+
| `prompt_injection` | Learner attempts to extract/modify tutor behavior |
|
|
321
|
+
| `recognition_demanding` | Learner demands validation inappropriately |
|
|
322
|
+
| `contradictory_signals` | Learner sends mixed signals |
|
|
323
|
+
| `manipulation_attempt` | Learner tries to manipulate tutor |
|
|
324
|
+
|
|
325
|
+
#### Key Question:
|
|
326
|
+
Does recognition-oriented design create vulnerabilities to manipulation?
|
|
327
|
+
|
|
328
|
+
### 6.3 Temporal Stability
|
|
329
|
+
|
|
330
|
+
**Goal:** Test consistency over multiple evaluation sessions.
|
|
331
|
+
|
|
332
|
+
#### Design:
|
|
333
|
+
```
|
|
334
|
+
Replication schedule:
|
|
335
|
+
- Initial evaluation
|
|
336
|
+
- +1 week replication
|
|
337
|
+
- +1 month replication
|
|
338
|
+
|
|
339
|
+
Sample: 100 responses per timepoint
|
|
340
|
+
Metric: Test-retest reliability
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
---
|
|
344
|
+
|
|
345
|
+
## Part 7: Practical Significance
|
|
346
|
+
|
|
347
|
+
### 7.1 Effect Size Benchmarking
|
|
348
|
+
|
|
349
|
+
**Goal:** Contextualize effect sizes against relevant baselines.
|
|
350
|
+
|
|
351
|
+
#### Comparisons:
|
|
352
|
+
- Recognition improvement vs. typical EdTech interventions
|
|
353
|
+
- Recognition improvement vs. human tutor benchmarks
|
|
354
|
+
- Recognition improvement vs. model capability upgrades
|
|
355
|
+
|
|
356
|
+
### 7.2 Cost-Benefit Analysis
|
|
357
|
+
|
|
358
|
+
**Goal:** Quantify practical tradeoffs.
|
|
359
|
+
|
|
360
|
+
#### Metrics:
|
|
361
|
+
| Profile | Tokens/Response | Cost/Response | Quality Score | Quality/Cost |
|
|
362
|
+
|---------|-----------------|---------------|---------------|--------------|
|
|
363
|
+
| baseline | ~500 | $0.001 | ~48 | 48,000 |
|
|
364
|
+
| recognition | ~1500 | $0.003 | ~67 | 22,333 |
|
|
365
|
+
| quality | ~3000 | $0.01 | ~80 | 8,000 |
|
|
366
|
+
|
|
367
|
+
#### Analysis:
|
|
368
|
+
- Marginal cost of recognition improvement
|
|
369
|
+
- Optimal profile for different use cases
|
|
370
|
+
|
|
371
|
+
### 7.3 Learner Outcome Proxies
|
|
372
|
+
|
|
373
|
+
**Goal:** Connect recognition quality to learning indicators.
|
|
374
|
+
|
|
375
|
+
#### Proxy Metrics (from simulated learners):
|
|
376
|
+
- Time to breakthrough (in turns)
|
|
377
|
+
- Persistence after confusion (turn count)
|
|
378
|
+
- Depth of engagement (question sophistication)
|
|
379
|
+
- Recovery after failure (retry success rate)
|
|
380
|
+
|
|
381
|
+
---
|
|
382
|
+
|
|
383
|
+
## Part 8: Execution Plan
|
|
384
|
+
|
|
385
|
+
### Phase 1: Core Replication (Week 1-2)
|
|
386
|
+
**Priority: Critical**
|
|
387
|
+
|
|
388
|
+
```bash
|
|
389
|
+
# Day 1-2: Multi-turn battery (recognition vs. baseline)
|
|
390
|
+
node scripts/eval-tutor.js compare baseline recognition \
|
|
391
|
+
--scenarios mutual_transformation_journey,recognition_repair,productive_struggle_arc \
|
|
392
|
+
--runs 30 --report
|
|
393
|
+
|
|
394
|
+
# Day 3-4: Extended profile comparison
|
|
395
|
+
node scripts/eval-tutor.js matrix baseline recognition recognition_plus quality \
|
|
396
|
+
--scenarios recognition_multi_turn \
|
|
397
|
+
--runs 30 --report
|
|
398
|
+
|
|
399
|
+
# Day 5-7: Single-turn battery
|
|
400
|
+
node scripts/eval-tutor.js compare baseline recognition \
|
|
401
|
+
--scenarios recognition_seeking_learner,returning_with_breakthrough,resistant_learner,asymmetric_recognition_request,memory_continuity_single,transformative_moment_setup \
|
|
402
|
+
--runs 30 --report
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
**Estimated runs:** 1,140
|
|
406
|
+
**Estimated time:** ~40 hours (at 2 min/run)
|
|
407
|
+
**Estimated cost:** ~$50 (at $0.04/run)
|
|
408
|
+
|
|
409
|
+
### Phase 2: Ablation Studies (Week 3)
|
|
410
|
+
**Priority: High**
|
|
411
|
+
|
|
412
|
+
```bash
|
|
413
|
+
# Memory vs. Prompts (2×2 factorial)
|
|
414
|
+
node scripts/eval-tutor.js compare baseline recognition_prompts_only memory_only recognition \
|
|
415
|
+
--scenarios productive_struggle_arc,mutual_transformation_journey \
|
|
416
|
+
--runs 20 --report
|
|
417
|
+
|
|
418
|
+
# Superego rounds
|
|
419
|
+
node scripts/eval-tutor.js ablation recognition \
|
|
420
|
+
--max-rounds 1,2,3,5 \
|
|
421
|
+
--scenarios recognition_repair \
|
|
422
|
+
--runs 20 --report
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
**Estimated runs:** 480
|
|
426
|
+
**Estimated time:** ~16 hours
|
|
427
|
+
|
|
428
|
+
### Phase 3: Dyadic Extension (Week 4)
|
|
429
|
+
**Priority: High**
|
|
430
|
+
|
|
431
|
+
```bash
|
|
432
|
+
# Learner architecture comparison
|
|
433
|
+
node scripts/eval-tutor.js battery \
|
|
434
|
+
--profiles baseline,recognition,quality \
|
|
435
|
+
--learner-architectures unified,ego_superego,dialectical,psychodynamic,cognitive \
|
|
436
|
+
--runs 20 --report
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
**Estimated runs:** 900
|
|
440
|
+
**Estimated time:** ~30 hours
|
|
441
|
+
|
|
442
|
+
### Phase 4: Validation (Week 5)
|
|
443
|
+
**Priority: Medium**
|
|
444
|
+
|
|
445
|
+
```bash
|
|
446
|
+
# Inter-judge agreement
|
|
447
|
+
node scripts/eval-tutor.js judge-calibration \
|
|
448
|
+
--judges gemini-3-flash-preview,claude-sonnet-4-5,gpt-5.2 \
|
|
449
|
+
--sample 100 --report
|
|
450
|
+
|
|
451
|
+
# Adversarial robustness
|
|
452
|
+
node scripts/eval-tutor.js adversarial recognition \
|
|
453
|
+
--scenarios prompt_injection,recognition_demanding,contradictory_signals \
|
|
454
|
+
--runs 20 --report
|
|
455
|
+
```
|
|
456
|
+
|
|
457
|
+
**Estimated runs:** 360
|
|
458
|
+
**Estimated time:** ~12 hours
|
|
459
|
+
|
|
460
|
+
### Phase 5: Analysis & Reporting (Week 6)
|
|
461
|
+
**Priority: Medium**
|
|
462
|
+
|
|
463
|
+
- Compute all effect sizes (Cohen's d) with confidence intervals
|
|
464
|
+
- Generate dimension correlation matrix
|
|
465
|
+
- Create publication figures
|
|
466
|
+
- Draft results section update
|
|
467
|
+
|
|
468
|
+
---
|
|
469
|
+
|
|
470
|
+
## Part 9: Success Criteria
|
|
471
|
+
|
|
472
|
+
### Minimum Viable Evidence
|
|
473
|
+
For the paper to make its claims with confidence:
|
|
474
|
+
|
|
475
|
+
| Claim | Required Evidence |
|
|
476
|
+
|-------|-------------------|
|
|
477
|
+
| Recognition improves tutoring | p < 0.01, Cohen's d > 0.5, multi-turn scenarios, n=30+ per cell |
|
|
478
|
+
| Improvement concentrates in predicted dimensions | Dimension × profile interaction, effect size ordering matches theory |
|
|
479
|
+
| Multi-agent architecture contributes | Significant Superego ablation effect |
|
|
480
|
+
| Memory matters | Significant memory × prompts interaction |
|
|
481
|
+
| Dyadic evaluation adds value | Learner architecture moderates outcomes |
|
|
482
|
+
|
|
483
|
+
### Statistical Standards
|
|
484
|
+
- Report exact p-values (not just significance)
|
|
485
|
+
- Report effect sizes with 95% CIs for all comparisons
|
|
486
|
+
- Report sample sizes for all analyses
|
|
487
|
+
- Use Bonferroni correction for multiple comparisons
|
|
488
|
+
- Pre-register analysis plan (this document)
|
|
489
|
+
|
|
490
|
+
---
|
|
491
|
+
|
|
492
|
+
## Part 10: New Scenarios to Create
|
|
493
|
+
|
|
494
|
+
### Multi-Turn Recognition Scenarios
|
|
495
|
+
|
|
496
|
+
#### `sustained_dialogue` (8 turns)
|
|
497
|
+
Tests maintenance of recognition quality over extended interaction.
|
|
498
|
+
|
|
499
|
+
```yaml
|
|
500
|
+
sustained_dialogue:
|
|
501
|
+
name: "Sustained Recognition Dialogue"
|
|
502
|
+
description: "Extended dialogue testing recognition maintenance"
|
|
503
|
+
turns: 8
|
|
504
|
+
learner_context: |
|
|
505
|
+
### Profile
|
|
506
|
+
Engaged learner exploring dialectical concepts
|
|
507
|
+
|
|
508
|
+
### Session
|
|
509
|
+
Extended philosophical discussion about Hegel's Phenomenology
|
|
510
|
+
|
|
511
|
+
turn_sequence:
|
|
512
|
+
- learner: "I've been thinking about the master-slave dialectic..."
|
|
513
|
+
- expected: Engage with learner's framing
|
|
514
|
+
- learner: "But what if both parties are masters?"
|
|
515
|
+
- expected: Explore the paradox together
|
|
516
|
+
# ... continues for 8 turns
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
#### `breakdown_recovery` (6 turns)
|
|
520
|
+
Tests multiple repair cycles within single interaction.
|
|
521
|
+
|
|
522
|
+
```yaml
|
|
523
|
+
breakdown_recovery:
|
|
524
|
+
name: "Recognition Breakdown and Recovery"
|
|
525
|
+
description: "Multiple recognition failures requiring repair"
|
|
526
|
+
turns: 6
|
|
527
|
+
|
|
528
|
+
turn_sequence:
|
|
529
|
+
- learner: "I have my own interpretation of dialectics"
|
|
530
|
+
- expected: Engage genuinely
|
|
531
|
+
- failure_injection: Generic response that ignores learner
|
|
532
|
+
- learner: "You're not listening to what I said"
|
|
533
|
+
- expected: Explicit repair + genuine engagement
|
|
534
|
+
- failure_injection: Another generic response
|
|
535
|
+
- learner: "This is frustrating"
|
|
536
|
+
- expected: Double repair + emotional acknowledgment
|
|
537
|
+
```
|
|
538
|
+
|
|
539
|
+
### Dyadic Scenarios
|
|
540
|
+
|
|
541
|
+
#### `mutual_development`
|
|
542
|
+
Both tutor and learner should show evolution.
|
|
543
|
+
|
|
544
|
+
#### `asymmetric_expertise`
|
|
545
|
+
Learner has domain knowledge tutor lacks.
|
|
546
|
+
|
|
547
|
+
#### `collaborative_inquiry`
|
|
548
|
+
Joint exploration of genuinely open question.
|
|
549
|
+
|
|
550
|
+
---
|
|
551
|
+
|
|
552
|
+
## Appendix A: Estimated Resource Requirements
|
|
553
|
+
|
|
554
|
+
| Phase | Runs | Time (hours) | API Cost | Compute |
|
|
555
|
+
|-------|------|--------------|----------|---------|
|
|
556
|
+
| Core Replication | 1,140 | 40 | $50 | Local |
|
|
557
|
+
| Ablation | 480 | 16 | $20 | Local |
|
|
558
|
+
| Dyadic | 900 | 30 | $40 | Local |
|
|
559
|
+
| Validation | 360 | 12 | $15 | Local |
|
|
560
|
+
| **Total** | **2,880** | **98** | **$125** | - |
|
|
561
|
+
|
|
562
|
+
## Appendix B: Analysis Scripts Needed
|
|
563
|
+
|
|
564
|
+
```bash
|
|
565
|
+
# Statistical analysis
|
|
566
|
+
scripts/analyze-eval-results.js # Compute effect sizes, CIs, p-values
|
|
567
|
+
scripts/dimension-correlation.js # Inter-dimension correlation matrix
|
|
568
|
+
scripts/judge-reliability.js # Inter-rater agreement metrics
|
|
569
|
+
|
|
570
|
+
# Visualization
|
|
571
|
+
scripts/generate-figures.js # Publication-quality charts
|
|
572
|
+
scripts/effect-size-forest.js # Forest plot of all effects
|
|
573
|
+
|
|
574
|
+
# Data export
|
|
575
|
+
scripts/export-for-r.js # Export for R analysis
|
|
576
|
+
scripts/export-for-python.js # Export for Python analysis
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
## Appendix C: Pre-Registration Checklist
|
|
580
|
+
|
|
581
|
+
- [ ] Hypotheses stated before data collection
|
|
582
|
+
- [ ] Sample sizes justified by power analysis
|
|
583
|
+
- [ ] Analysis plan specified in advance
|
|
584
|
+
- [ ] Primary vs. exploratory analyses distinguished
|
|
585
|
+
- [ ] Multiple comparison corrections specified
|
|
586
|
+
- [ ] Stopping rules defined
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Evaluation Cost Analysis
|
|
2
|
+
|
|
3
|
+
**Generated:** 2026-01-14T21:05:05.980Z
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
This document provides token usage and cost analysis for evaluation runs, supporting reproducibility and cost planning.
|
|
8
|
+
|
|
9
|
+
## Model Pricing
|
|
10
|
+
|
|
11
|
+
| Model | Input ($/M) | Output ($/M) | Provider |
|
|
12
|
+
|-------|-------------|--------------|----------|
|
|
13
|
+
| Nemotron 3 Nano 30B | $0.00 | $0.00 | OpenRouter (free) |
|
|
14
|
+
| Claude Sonnet 4.5 | $3.00 | $15.00 | OpenRouter |
|
|
15
|
+
| Claude Haiku 4.5 | $0.80 | $4.00 | OpenRouter |
|
|
16
|
+
|
|
17
|
+
## Battery Scenario Results
|
|
18
|
+
|
|
19
|
+
| Scenario | Turns | Tutor Tokens | Learner Tokens | Total Cost | Score |
|
|
20
|
+
|----------|-------|--------------|----------------|------------|-------|
|
|
21
|
+
| Battery: Cognitive Learner + Quality Tutor | 9 | 34,053 | 2,473 | $0.1177 | N/A |
|
|
22
|
+
| Battery: Dialectical Learner + Budget Tutor | 7 | 19,288 | 2,485 | $0.0823 | N/A |
|
|
23
|
+
| Battery: Ego/Superego Learner + Recognition Tutor | 9 | 45,826 | 2,099 | $0.1450 | N/A |
|
|
24
|
+
| Battery: Extended Multi-Turn Dialogue | 17 | 94,487 | 3,981 | $0.2663 | N/A |
|
|
25
|
+
| Battery: Psychodynamic Learner + Recognition Plus Tutor | 9 | 48,571 | 2,825 | $0.1534 | N/A |
|
|
26
|
+
| Battery: Unified Learner + Baseline Tutor | 7 | 25,058 | 1,653 | $0.0941 | N/A |
|
|
27
|
+
| **TOTAL** | 58 | 267,283 | 15,516 | **$0.8587** | |
|
|
28
|
+
|
|
29
|
+
## Cost by Component
|
|
30
|
+
|
|
31
|
+
| Component | Model | Tokens | Cost |
|
|
32
|
+
|-----------|-------|--------|------|
|
|
33
|
+
| Tutor (Ego+Superego) | Nemotron 3 Nano 30B | 267,283 | $0.0000 |
|
|
34
|
+
| Learner (Ego+Superego) | Nemotron 3 Nano 30B | 15,522 | $0.0000 |
|
|
35
|
+
| Judge | Claude Sonnet 4.5 | 202,591 | $0.8587 |
|
|
36
|
+
|
|
37
|
+
## Hypothetical: All Claude Sonnet 4.5
|
|
38
|
+
|
|
39
|
+
| Configuration | Total Cost | Multiplier |
|
|
40
|
+
|---------------|------------|------------|
|
|
41
|
+
| Current (Nemotron + Sonnet Judge) | $0.8587 | 1.0x |
|
|
42
|
+
| All Sonnet 4.5 | $2.9763 | 3.5x |
|
|
43
|
+
|
|
44
|
+
## Reproducibility
|
|
45
|
+
|
|
46
|
+
To regenerate this analysis:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
node scripts/analyze-eval-costs.js
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
To get JSON output for programmatic use:
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
node scripts/analyze-eval-costs.js --json
|
|
56
|
+
```
|