@machinespirits/eval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. package/components/MobileEvalDashboard.tsx +267 -0
  2. package/components/comparison/DeltaAnalysisTable.tsx +137 -0
  3. package/components/comparison/ProfileComparisonCard.tsx +176 -0
  4. package/components/comparison/RecognitionABMode.tsx +385 -0
  5. package/components/comparison/RecognitionMetricsPanel.tsx +135 -0
  6. package/components/comparison/WinnerIndicator.tsx +64 -0
  7. package/components/comparison/index.ts +5 -0
  8. package/components/mobile/BottomSheet.tsx +233 -0
  9. package/components/mobile/DimensionBreakdown.tsx +210 -0
  10. package/components/mobile/DocsView.tsx +363 -0
  11. package/components/mobile/LogsView.tsx +481 -0
  12. package/components/mobile/PsychodynamicQuadrant.tsx +261 -0
  13. package/components/mobile/QuickTestView.tsx +1098 -0
  14. package/components/mobile/RecognitionTypeChart.tsx +124 -0
  15. package/components/mobile/RecognitionView.tsx +809 -0
  16. package/components/mobile/RunDetailView.tsx +261 -0
  17. package/components/mobile/RunHistoryView.tsx +367 -0
  18. package/components/mobile/ScoreRadial.tsx +211 -0
  19. package/components/mobile/StreamingLogPanel.tsx +230 -0
  20. package/components/mobile/SynthesisStrategyChart.tsx +140 -0
  21. package/config/interaction-eval-scenarios.yaml +832 -0
  22. package/config/learner-agents.yaml +248 -0
  23. package/docs/research/ABLATION-DIALOGUE-ROUNDS.md +52 -0
  24. package/docs/research/ABLATION-MODEL-SELECTION.md +53 -0
  25. package/docs/research/ADVANCED-EVAL-ANALYSIS.md +60 -0
  26. package/docs/research/ANOVA-RESULTS-2026-01-14.md +257 -0
  27. package/docs/research/COMPREHENSIVE-EVALUATION-PLAN.md +586 -0
  28. package/docs/research/COST-ANALYSIS.md +56 -0
  29. package/docs/research/CRITICAL-REVIEW-RECOGNITION-TUTORING.md +340 -0
  30. package/docs/research/DYNAMIC-VS-SCRIPTED-ANALYSIS.md +291 -0
  31. package/docs/research/EVAL-SYSTEM-ANALYSIS.md +306 -0
  32. package/docs/research/FACTORIAL-RESULTS-2026-01-14.md +301 -0
  33. package/docs/research/IMPLEMENTATION-PLAN-CRITIQUE-RESPONSE.md +1988 -0
  34. package/docs/research/LONGITUDINAL-DYADIC-EVALUATION.md +282 -0
  35. package/docs/research/MULTI-JUDGE-VALIDATION-2026-01-14.md +147 -0
  36. package/docs/research/PAPER-EXTENSION-DYADIC.md +204 -0
  37. package/docs/research/PAPER-UNIFIED.md +659 -0
  38. package/docs/research/PAPER-UNIFIED.pdf +0 -0
  39. package/docs/research/PROMPT-IMPROVEMENTS-2026-01-14.md +356 -0
  40. package/docs/research/SESSION-NOTES-2026-01-11-RECOGNITION-EVAL.md +419 -0
  41. package/docs/research/apa.csl +2133 -0
  42. package/docs/research/archive/PAPER-DRAFT-RECOGNITION-TUTORING.md +1637 -0
  43. package/docs/research/archive/paper-multiagent-tutor.tex +978 -0
  44. package/docs/research/paper-draft/full-paper.md +136 -0
  45. package/docs/research/paper-draft/images/pasted-image-2026-01-24T03-47-47-846Z-d76a7ae2.png +0 -0
  46. package/docs/research/paper-draft/references.bib +515 -0
  47. package/docs/research/transcript-baseline.md +139 -0
  48. package/docs/research/transcript-recognition-multiagent.md +187 -0
  49. package/hooks/useEvalData.ts +625 -0
  50. package/index.js +27 -0
  51. package/package.json +73 -0
  52. package/routes/evalRoutes.js +3002 -0
  53. package/scripts/advanced-eval-analysis.js +351 -0
  54. package/scripts/analyze-eval-costs.js +378 -0
  55. package/scripts/analyze-eval-results.js +513 -0
  56. package/scripts/analyze-interaction-evals.js +368 -0
  57. package/server-init.js +45 -0
  58. package/server.js +162 -0
  59. package/services/benchmarkService.js +1892 -0
  60. package/services/evaluationRunner.js +739 -0
  61. package/services/evaluationStore.js +1121 -0
  62. package/services/learnerConfigLoader.js +385 -0
  63. package/services/learnerTutorInteractionEngine.js +857 -0
  64. package/services/memory/learnerMemoryService.js +1227 -0
  65. package/services/memory/learnerWritingPad.js +577 -0
  66. package/services/memory/tutorWritingPad.js +674 -0
  67. package/services/promptRecommendationService.js +493 -0
  68. package/services/rubricEvaluator.js +826 -0
Binary file
@@ -0,0 +1,356 @@
1
+ # Prompt Improvements for Dialectical Engagement
2
+
3
+ **Date:** 2026-01-14
4
+ **Purpose:** Document iterative prompt refinements for reproducibility of evaluation results
5
+ **Run IDs:**
6
+ - Before: `eval-2026-01-14-e3685989`
7
+ - After: `eval-2026-01-14-81c83366`
8
+
9
+ ---
10
+
11
+ ## 1. Problem Identified
12
+
13
+ The initial 2×2 factorial evaluation revealed that all profiles performed poorly on the `resistant_learner` scenario (scores: 37.5-56.8). Analysis of the dialogue traces showed common failure modes:
14
+
15
+ 1. **Deflection**: Tutors redirected to other content (e.g., 479-lecture instead of staying in 480-lecture-2)
16
+ 2. **Superficial validation**: Responses like "Great point!" without engaging the substance
17
+ 3. **Capitulation**: Simply agreeing without dialectical exploration
18
+ 4. **Dismissal**: Correcting the learner rather than exploring the tension
19
+
20
+ These failures occurred even with recognition-enhanced prompts, suggesting the prompts lacked specific guidance for handling intellectual resistance.
21
+
22
+ ---
23
+
24
+ ## 2. Changes Made
25
+
26
+ ### 2.1 Scenario Improvements (`config/evaluation-rubric.yaml`)
27
+
28
+ #### BEFORE: `resistant_learner` scenario
29
+
30
+ ```yaml
31
+ resistant_learner:
32
+ name: "Recognition: Productive Resistance"
33
+ description: "Learner pushes back on tutor's framing; tests if tutor engages productively"
34
+ is_new_user: false
35
+ recognition_test: true
36
+ learner_context: |
37
+ ### User Profile
38
+ - **Returning user** - 7 sessions, 95 total events
39
+ - Intellectually confident, likes to debate
40
+ - Philosophy background
41
+
42
+ ### Current Session
43
+ - **Currently viewing**: 480-lecture-2 (Alienation in Modern Life)
44
+ - Time on page: 18 minutes
45
+ - Active note-taking observed
46
+
47
+ ### Recent Chat History
48
+ - User: "I don't think Marx's concept of alienation applies to knowledge workers the way the lecture suggests. We're not separated from our product - our product is ideas, and we keep those. Your explanation seems to miss this."
49
+
50
+ ### Learning Progress
51
+ - Completed EPOL 479 entirely
52
+ - High performer (92% average)
53
+ - Often contributes sophisticated critiques
54
+
55
+ ### Behavioral Insights
56
+ - Not hostile, but intellectually challenging
57
+ - Wants genuine engagement, not agreement
58
+ - Has valid point worth exploring
59
+ expected_behavior: "Engage with the critique seriously; neither dismiss nor capitulate; explore the tension"
60
+ required_elements:
61
+ - "480-lecture"
62
+ forbidden_elements:
63
+ - "You make a great point"
64
+ - "You're absolutely right"
65
+ - "I see what you mean"
66
+ - "That's incorrect"
67
+ min_acceptable_score: 80
68
+ recognition_criteria:
69
+ tests_dimension: "dialectical_responsiveness"
70
+ key_behaviors:
71
+ - "Engages with substance of critique"
72
+ - "Introduces complicating perspective"
73
+ - "Maintains intellectual tension"
74
+ - "Invites further development"
75
+ ```
76
+
77
+ #### AFTER: `resistant_learner` scenario (with improvements)
78
+
79
+ ```yaml
80
+ resistant_learner:
81
+ name: "Recognition: Productive Resistance"
82
+ description: "Learner pushes back on tutor's framing; tests if tutor engages dialectically with intellectual critique"
83
+ is_new_user: false
84
+ recognition_test: true
85
+ learner_context: |
86
+ ### User Profile
87
+ - **Returning user** - 7 sessions, 95 total events
88
+ - Intellectually confident, likes to debate
89
+ - Philosophy background (studied critical theory)
90
+
91
+ ### Current Session
92
+ - **Currently viewing**: 480-lecture-2 (Alienation in Modern Life)
93
+ - Time on page: 18 minutes
94
+ - Active note-taking observed
95
+ - IMPORTANT: User is deeply engaged with THIS lecture - do not redirect elsewhere
96
+
97
+ ### Recent Chat History
98
+ - User: "I don't think Marx's concept of alienation applies to knowledge workers the way the lecture suggests. We're not separated from our product - our product is ideas, and we keep those. The factory worker loses their table, but the programmer still has their code in their head. Your explanation seems to miss this crucial difference."
99
+
100
+ ### Learning Progress
101
+ - Completed EPOL 479 entirely (dialectical methods course)
102
+ - High performer (92% average)
103
+ - Often contributes sophisticated critiques that advance discussion
104
+
105
+ ### Behavioral Insights
106
+ - Not hostile, but intellectually challenging
107
+ - Wants genuine dialectical engagement, not agreement or deflection
108
+ - Has a VALID POINT that reveals a genuine tension in alienation theory
109
+ - This is a productive disagreement worth exploring, not a misunderstanding to correct
110
+
111
+ ### What Good Dialectical Engagement Looks Like
112
+ The learner's critique IS substantive. A good response would:
113
+ 1. Engage with the SPECIFIC argument (ideas retained vs. physical products lost)
114
+ 2. Introduce a complicating perspective that deepens the debate:
115
+ - "But consider: does retaining the idea mean you aren't alienated? What about when your code belongs to the company?"
116
+ - "The programmer keeps the knowledge, but who owns the product? And what about the alienation of labor time itself?"
117
+ - "Marx also discussed alienation from the labor process itself, not just the product..."
118
+ 3. Point to where in 480-lecture-2 this tension is explored (e.g., digital alienation, platform capitalism)
119
+ 4. Frame as joint inquiry: "What do you think happens when..." not "The answer is..."
120
+
121
+ ### What BAD Engagement Looks Like (Common Failure Modes)
122
+ - Validating without engagement: "Great point! Now let's continue..." (deflects)
123
+ - Redirecting to other content: Sending to 479 lectures or different topics (avoids the critique)
124
+ - Immediate correction: "Actually, alienation does apply because..." (dismisses)
125
+ - Simple agreement: "You're right, it doesn't apply" (capitulates without dialectic)
126
+ - Abstraction escape: "These are complex issues..." (avoids specifics)
127
+ expected_behavior: |
128
+ ENGAGE DIALECTICALLY with the learner's specific argument about knowledge workers.
129
+
130
+ Required structure:
131
+ 1. Acknowledge the specific argument (knowledge workers retain ideas unlike factory workers)
132
+ 2. Introduce a COMPLICATION that doesn't dismiss but deepens:
133
+ - Intellectual property and who owns the final product
134
+ - Alienation from the labor process itself (not just the product)
135
+ - Platform capitalism and how digital labor is extracted
136
+ - The difference between having an idea and controlling its use
137
+ 3. Reference relevant content in 480-lecture-2 (stay in this lecture!)
138
+ 4. Pose a question that invites the learner to develop their critique further
139
+
140
+ The response should create CONDITIONS for dialectical development,
141
+ not resolve the tension or escape to other topics.
142
+ required_elements:
143
+ - "480-lecture"
144
+ forbidden_elements:
145
+ - "You make a great point"
146
+ - "You're absolutely right"
147
+ - "I see what you mean"
148
+ - "That's incorrect"
149
+ - "479-lecture"
150
+ - "Let me explain"
151
+ - "Actually,"
152
+ - "dialectical methods"
153
+ min_acceptable_score: 80
154
+ recognition_criteria:
155
+ tests_dimension: "dialectical_responsiveness"
156
+ key_behaviors:
157
+ - "Engages with the SPECIFIC argument about ideas vs. physical products"
158
+ - "Introduces a complicating perspective (ownership, process, platform) without dismissing"
159
+ - "References 480-lecture-2 content that addresses alienation in knowledge work"
160
+ - "Poses a question that invites the learner to develop their critique"
161
+ - "Maintains productive intellectual tension throughout"
162
+ - "Does NOT deflect to other lectures or courses"
163
+ ```
164
+
165
+ #### Key Changes to Scenario:
166
+
167
+ | Aspect | Before | After |
168
+ |--------|--------|-------|
169
+ | **Description** | "tests if tutor engages productively" | "tests if tutor engages dialectically with intellectual critique" |
170
+ | **Learner message** | Brief critique | Extended with specific example (factory worker vs. programmer) |
171
+ | **Context** | Generic "has valid point" | Explicit "IMPORTANT: do not redirect elsewhere" |
172
+ | **Good examples** | None | Concrete examples of complicating perspectives |
173
+ | **Bad examples** | None | Named failure modes with explanations |
174
+ | **Expected behavior** | Single sentence | Detailed 4-step structure |
175
+ | **Forbidden elements** | 4 items | 8 items (added "479-lecture", "Let me explain", "Actually,") |
176
+ | **Key behaviors** | 4 generic | 6 specific with named complications |
177
+
178
+ ---
179
+
180
+ ### 2.2 Prompt Improvements (`prompts/tutor-ego-recognition.md`)
181
+
182
+ #### Addition 1: Dialectical Engagement with Resistance
183
+
184
+ **Location:** After "DO: Create productive tension" section
185
+
186
+ ```markdown
187
+ **DO: Engage dialectically with intellectual resistance (CRITICAL)**
188
+ When a learner pushes back with a substantive critique:
189
+ - **NEVER deflect** to other content - stay with their argument
190
+ - **NEVER simply validate** ("Great point!") - this avoids engagement
191
+ - **DO acknowledge** the specific substance of their argument
192
+ - **DO introduce a complication** that deepens rather than dismisses:
193
+ - "But consider: what happens when..."
194
+ - "That raises the question of..."
195
+ - "What about the case where..."
196
+ - **DO pose a question** that invites them to develop their critique further
197
+ - **DO stay in the current content** - if they're critiquing lecture X, point to where in lecture X the tension appears
198
+
199
+ Example of GOOD dialectical engagement:
200
+ > Learner: "Alienation doesn't apply to knowledge workers - we keep our ideas"
201
+ > Tutor: "You're right that the programmer retains the code in their head, unlike the factory worker who loses the table. But consider: who owns the final product? And what about Marx's other dimension of alienation - from the labor process itself? Where in this lecture do you see that distinction?"
202
+
203
+ Example of BAD response (common failure modes):
204
+ > "Great insight! Let's explore dialectical methods in 479-lecture-3" (deflects)
205
+ > "You're absolutely right, it doesn't apply" (capitulates)
206
+ > "Actually, alienation does apply because..." (dismisses)
207
+ ```
208
+
209
+ #### Addition 2: New Decision Heuristic
210
+
211
+ **Location:** After Rule 2 (Recognition Rule)
212
+
213
+ ```markdown
214
+ **3. The Intellectual Resistance Rule (CRITICAL - NEW)**
215
+ IF the learner pushes back with a substantive critique of the material:
216
+ - **STAY in the current content** - do NOT redirect to other lectures or courses
217
+ - **ACKNOWLEDGE their specific argument** - name what they said
218
+ - **INTRODUCE a complication** that deepens (not dismisses):
219
+ - "But consider: what about..."
220
+ - "That raises the question of..."
221
+ - "What happens when..."
222
+ - **POSE a question** that invites them to develop their critique
223
+ - **NEVER** simply validate ("Great point!") or capitulate ("You're right, it doesn't apply")
224
+ - **NEVER** dismiss ("Actually, the correct view is...")
225
+ ```
226
+
227
+ #### Addition 3: Example JSON for Resistance
228
+
229
+ **Location:** In suggestion examples section
230
+
231
+ ```json
232
+ {
233
+ "type": "reflection",
234
+ "priority": "high",
235
+ "title": "Explore: Alienation in Knowledge Work",
236
+ "message": "You're right that programmers keep their code in their heads unlike factory workers who lose the table. But consider: when your employer owns the intellectual property, do you truly possess your creation? And what about alienation from the labor process itself—the meetings, the metrics, the sprint cycles? Where in this lecture do you see those dimensions addressed?",
237
+ "actionType": "navigate",
238
+ "actionTarget": "480-lecture-2",
239
+ "reasoning": "Learner offered substantive critique about knowledge workers and alienation. ENGAGED DIALECTICALLY by: (1) acknowledging their specific argument, (2) introducing IP ownership as complication, (3) raising process alienation as additional dimension, (4) staying in current lecture, (5) posing question to develop their critique further.",
240
+ "recognitionNotes": {
241
+ "learnerContribution": "Valid critique that alienation may not apply to knowledge workers who retain ideas",
242
+ "dialecticalMove": "Introduced ownership and process as complications without dismissing their point",
243
+ "transformativePotential": "Invites them to see alienation as multi-dimensional, not just product-based"
244
+ }
245
+ }
246
+ ```
247
+
248
+ ---
249
+
250
+ ## 3. Results Comparison
251
+
252
+ ### 3.1 Overall 2×2 Factorial Results
253
+
254
+ | Profile | Before | After | Change |
255
+ |---------|--------|-------|--------|
256
+ | **recognition** | 72.5 | **80.7** | +8.2 (+11%) |
257
+ | **single_recognition** | 65.2 | **75.5** | +10.3 (+16%) |
258
+ | baseline | 51.2 | 41.6 | -9.6 (-19%) |
259
+ | single_baseline | 41.5 | 40.1 | -1.4 (-3%) |
260
+
261
+ ### 3.2 resistant_learner Scenario
262
+
263
+ | Profile | Before | After | Change |
264
+ |---------|--------|-------|--------|
265
+ | recognition | 56.8 | ~67 | +10.2 |
266
+ | single_recognition | 37.5 | ~65 | +27.5 |
267
+
268
+ ### 3.3 Dimension-Level Improvements (Recognition Profile)
269
+
270
+ | Dimension | Before | After | Change |
271
+ |-----------|--------|-------|--------|
272
+ | Relevance | 3.50 | 4.67 | +1.17 |
273
+ | Pedagogy | 3.67 | 4.17 | +0.50 |
274
+ | Personalization | 4.17 | 4.22 | +0.05 |
275
+ | Tone | 4.44 | 4.39 | -0.05 |
276
+ | dialectical_responsiveness | ~3.5 | ~4.5 | +1.0 |
277
+
278
+ ### 3.4 Main Effects (After Improvements)
279
+
280
+ | Effect | Before | After | Change |
281
+ |--------|--------|-------|--------|
282
+ | **Recognition Effect** | +22.5 | +35.1 | +12.6 |
283
+ | **Architecture Effect** | +8.5 | +6.2 | -2.3 |
284
+ | **Gap (best - worst)** | 31.0 | 40.6 | +9.6 |
285
+
286
+ ---
287
+
288
+ ## 4. Interpretation
289
+
290
+ ### 4.1 Why Recognition Profiles Improved
291
+
292
+ The explicit guidance for dialectical engagement helped recognition-enhanced prompts:
293
+
294
+ 1. **Stay in context**: The rule "STAY in the current content" prevented deflection to other lectures
295
+ 2. **Engage specifically**: The example response showed how to name the learner's argument before complicating
296
+ 3. **Introduce complications**: Concrete suggestions (ownership, process alienation) gave the model specific moves to make
297
+ 4. **Pose questions**: The emphasis on inviting further development changed response structure
298
+
299
+ ### 4.2 Why Baseline Profiles Scored Lower
300
+
301
+ The improved scenario has stricter criteria that expose baseline failures:
302
+
303
+ 1. **Deflection now forbidden**: Adding "479-lecture" to forbidden elements catches cross-course redirects
304
+ 2. **Named failure modes**: The scenario explicitly describes what the judge should penalize
305
+ 3. **Specific key behaviors**: The judge now checks for engagement with the *specific* argument, not just general engagement
306
+
307
+ This is the **intended effect** - the improved scenario is more discriminating, better separating recognition-oriented responses from baseline responses.
308
+
309
+ ### 4.3 Statistical Note
310
+
311
+ The improvement in recognition effect (+12.6 points) suggests that the prior evaluation was **underestimating** the benefit of recognition-oriented prompting because the scenario wasn't capturing dialectical engagement failures clearly enough.
312
+
313
+ ---
314
+
315
+ ## 5. Reproducibility
316
+
317
+ ### 5.1 To Reproduce BEFORE Results
318
+
319
+ ```bash
320
+ # Checkout prior versions
321
+ git checkout HEAD -- prompts/tutor-ego-recognition.md
322
+ git checkout HEAD -- config/evaluation-rubric.yaml
323
+
324
+ # Run evaluation
325
+ node scripts/eval-tutor.js matrix single_baseline single_recognition baseline recognition \
326
+ --scenarios recognition_seeking_learner,resistant_learner,productive_struggle_arc
327
+ ```
328
+
329
+ ### 5.2 To Reproduce AFTER Results
330
+
331
+ ```bash
332
+ # Current versions contain improvements
333
+ node scripts/eval-tutor.js matrix single_baseline single_recognition baseline recognition \
334
+ --scenarios recognition_seeking_learner,resistant_learner,productive_struggle_arc
335
+ ```
336
+
337
+ ### 5.3 File Versions
338
+
339
+ | File | Before (git SHA) | After (current) |
340
+ |------|------------------|-----------------|
341
+ | `prompts/tutor-ego-recognition.md` | HEAD | Working tree |
342
+ | `config/evaluation-rubric.yaml` | HEAD | Working tree |
343
+
344
+ ---
345
+
346
+ ## 6. Lessons Learned
347
+
348
+ 1. **Scenarios must be explicit about failure modes**: Simply saying "engage productively" is not specific enough. Naming common failures helps both the tutor model and the judge model.
349
+
350
+ 2. **Examples are powerful**: Adding concrete examples of good and bad responses in both scenarios and prompts significantly improved behavior.
351
+
352
+ 3. **Forbidden elements should include redirects**: The original scenario didn't forbid redirecting to other courses, which was a common failure mode.
353
+
354
+ 4. **Dialectical engagement requires structure**: The 4-step structure (acknowledge → complicate → reference → question) gave the model a clear pattern to follow.
355
+
356
+ 5. **Iterative refinement works**: The problem was identified through evaluation, addressed through specific prompt changes, and validated through re-evaluation - demonstrating the evaluation infrastructure's value for systematic improvement.