opencode-swarm-plugin 0.37.0 → 0.39.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (79) hide show
  1. package/.env +2 -0
  2. package/.hive/eval-results.json +26 -0
  3. package/.hive/issues.jsonl +20 -5
  4. package/.hive/memories.jsonl +35 -1
  5. package/.opencode/eval-history.jsonl +12 -0
  6. package/.turbo/turbo-build.log +4 -4
  7. package/.turbo/turbo-test.log +319 -319
  8. package/CHANGELOG.md +258 -0
  9. package/README.md +50 -0
  10. package/bin/swarm.test.ts +475 -0
  11. package/bin/swarm.ts +385 -208
  12. package/dist/compaction-hook.d.ts +1 -1
  13. package/dist/compaction-hook.d.ts.map +1 -1
  14. package/dist/compaction-prompt-scoring.d.ts +124 -0
  15. package/dist/compaction-prompt-scoring.d.ts.map +1 -0
  16. package/dist/eval-capture.d.ts +81 -1
  17. package/dist/eval-capture.d.ts.map +1 -1
  18. package/dist/eval-gates.d.ts +84 -0
  19. package/dist/eval-gates.d.ts.map +1 -0
  20. package/dist/eval-history.d.ts +117 -0
  21. package/dist/eval-history.d.ts.map +1 -0
  22. package/dist/eval-learning.d.ts +216 -0
  23. package/dist/eval-learning.d.ts.map +1 -0
  24. package/dist/hive.d.ts +59 -0
  25. package/dist/hive.d.ts.map +1 -1
  26. package/dist/index.d.ts +87 -0
  27. package/dist/index.d.ts.map +1 -1
  28. package/dist/index.js +823 -131
  29. package/dist/plugin.js +655 -131
  30. package/dist/post-compaction-tracker.d.ts +133 -0
  31. package/dist/post-compaction-tracker.d.ts.map +1 -0
  32. package/dist/swarm-decompose.d.ts +30 -0
  33. package/dist/swarm-decompose.d.ts.map +1 -1
  34. package/dist/swarm-orchestrate.d.ts +23 -0
  35. package/dist/swarm-orchestrate.d.ts.map +1 -1
  36. package/dist/swarm-prompts.d.ts +25 -1
  37. package/dist/swarm-prompts.d.ts.map +1 -1
  38. package/dist/swarm.d.ts +19 -0
  39. package/dist/swarm.d.ts.map +1 -1
  40. package/evals/README.md +595 -94
  41. package/evals/compaction-prompt.eval.ts +149 -0
  42. package/evals/coordinator-behavior.eval.ts +8 -8
  43. package/evals/fixtures/compaction-prompt-cases.ts +305 -0
  44. package/evals/lib/compaction-loader.test.ts +248 -0
  45. package/evals/lib/compaction-loader.ts +320 -0
  46. package/evals/lib/data-loader.test.ts +345 -0
  47. package/evals/lib/data-loader.ts +107 -6
  48. package/evals/scorers/compaction-prompt-scorers.ts +145 -0
  49. package/evals/scorers/compaction-scorers.ts +13 -13
  50. package/evals/scorers/coordinator-discipline.evalite-test.ts +3 -2
  51. package/evals/scorers/coordinator-discipline.ts +13 -13
  52. package/examples/plugin-wrapper-template.ts +177 -8
  53. package/package.json +7 -2
  54. package/scripts/migrate-unknown-sessions.ts +349 -0
  55. package/src/compaction-capture.integration.test.ts +257 -0
  56. package/src/compaction-hook.test.ts +139 -2
  57. package/src/compaction-hook.ts +113 -2
  58. package/src/compaction-prompt-scorers.test.ts +299 -0
  59. package/src/compaction-prompt-scoring.ts +298 -0
  60. package/src/eval-capture.test.ts +422 -0
  61. package/src/eval-capture.ts +94 -2
  62. package/src/eval-gates.test.ts +306 -0
  63. package/src/eval-gates.ts +218 -0
  64. package/src/eval-history.test.ts +508 -0
  65. package/src/eval-history.ts +214 -0
  66. package/src/eval-learning.test.ts +378 -0
  67. package/src/eval-learning.ts +360 -0
  68. package/src/index.ts +61 -1
  69. package/src/post-compaction-tracker.test.ts +251 -0
  70. package/src/post-compaction-tracker.ts +237 -0
  71. package/src/swarm-decompose.test.ts +40 -47
  72. package/src/swarm-decompose.ts +2 -2
  73. package/src/swarm-orchestrate.test.ts +270 -7
  74. package/src/swarm-orchestrate.ts +100 -13
  75. package/src/swarm-prompts.test.ts +121 -0
  76. package/src/swarm-prompts.ts +297 -4
  77. package/src/swarm-research.integration.test.ts +157 -0
  78. package/src/swarm-review.ts +3 -3
  79. /package/evals/{evalite.config.ts → evalite.config.ts.bak} +0 -0
package/evals/README.md CHANGED
@@ -1,154 +1,655 @@
1
- # Evalite - Swarm Decomposition Evals
1
+ # Eval-Driven Development with Progressive Gates
2
2
 
3
- TypeScript-native evaluation framework for testing swarm task decomposition quality.
3
+ ```
4
+ ┌──────────────────────────────────────────────────────────────┐
5
+ │ EVAL PIPELINE │
6
+ │ │
7
+ │ CAPTURE → SCORE → STORE → GATE → LEARN → IMPROVE │
8
+ │ │
9
+ │ Real execution data feeds back into prompt generation │
10
+ └──────────────────────────────────────────────────────────────┘
11
+ ```
12
+
13
+ TypeScript-native evaluation framework for testing swarm task decomposition quality and coordinator discipline. Built on [Evalite](https://evalite.dev), powered by captured real-world execution data.
14
+
15
+ ---
4
16
 
5
17
  ## Quick Start
6
18
 
7
19
  ```bash
8
- # Watch mode for development
9
- pnpm eval:dev
10
-
11
20
  # Run all evals once
12
- pnpm eval:run
21
+ bun run eval:run
22
+
23
+ # Run specific eval suite
24
+ bun run eval:decomposition # Task decomposition quality
25
+ bun run eval:coordinator # Coordinator protocol compliance
26
+ bun run eval:compaction # Compaction prompt quality
27
+
28
+ # Check eval status (progressive gates)
29
+ swarm eval status
30
+
31
+ # View eval history with trends
32
+ swarm eval history
33
+ ```
34
+
35
+ ---
36
+
37
+ ## Architecture
38
+
39
+ ### The Pipeline
13
40
 
14
- # CI mode with 80% threshold
15
- pnpm eval:ci
16
41
  ```
42
+ ┌─────────────────────────────────────────────────────────────────┐
43
+ │ │
44
+ │ 1. CAPTURE (Real Execution) │
45
+ │ ├─ Decomposition: task, strategy, subtasks │
46
+ │ ├─ Outcomes: duration, errors, retries, success │
47
+ │ ├─ Coordinator Events: decisions, violations, compaction │
48
+ │ └─ Store to: .opencode/eval-data.jsonl, sessions/*.jsonl │
49
+ │ │
50
+ │ 2. SCORE (Quality Metrics) │
51
+ │ ├─ Subtask Independence (file conflicts) │
52
+ │ ├─ Complexity Balance (fair work distribution) │
53
+ │ ├─ Coverage Completeness (files + scope) │
54
+ │ ├─ Instruction Clarity (actionable descriptions) │
55
+ │ └─ Coordinator Discipline (protocol adherence) │
56
+ │ │
57
+ │ 3. STORE (History Tracking) │
58
+ │ ├─ Record to: .opencode/eval-history.jsonl │
59
+ │ ├─ Track: score, timestamp, run_count │
60
+ │ └─ Calculate: phase, variance, baseline │
61
+ │ │
62
+ │ 4. GATE (Progressive Quality Control) │
63
+ │ ├─ Bootstrap (<10 runs): Always pass, collect data │
64
+ │ ├─ Stabilization (10-50 runs): Warn on >10% regression │
65
+ │ └─ Production (>50 runs, variance <0.1): Fail on >5% drop │
66
+ │ │
67
+ │ 5. LEARN (Failure Feedback) │
68
+ │ ├─ Detect: Significant score drops (>15% from baseline) │
69
+ │ ├─ Store to: Semantic memory with tags │
70
+ │ └─ Query: Before generating future prompts │
71
+ │ │
72
+ │ 6. IMPROVE (Continuous Refinement) │
73
+ │ └─ Future prompts query past failures for context │
74
+ │ │
75
+ └─────────────────────────────────────────────────────────────────┘
76
+ ```
77
+
78
+ ### Progressive Gates (Phase-Based Quality Control)
17
79
 
18
- ## Structure
80
+ The eval system uses **progressive gates** that adapt based on data maturity:
19
81
 
20
82
  ```
21
- evals/
22
- ├── evalite.config.ts # Evalite configuration
23
- ├── scorers/
24
- └── index.ts # Custom scorers (independence, balance, coverage, clarity)
25
- ├── fixtures/
26
- └── decomposition-cases.ts # Test cases with expected outcomes
27
- └── *.eval.ts # Eval files (auto-discovered)
83
+ Phase Runs Variance Gate Behavior
84
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
85
+ Bootstrap <10 N/A ✅ Always pass (collect data)
86
+ Stabilization 10-50 N/A ⚠️ Warn on >10% regression (pass)
87
+ Production >50 <0.1 ❌ Fail on >5% regression
88
+ (High Variance) >50 ≥0.1 ⚠️ Stay in stabilization
89
+ ```
90
+
91
+ **Why progressive?**
92
+
93
+ - **Bootstrap**: No baseline yet, focus on data collection
94
+ - **Stabilization**: Baseline forming, tolerate noise while learning
95
+ - **Production**: Stable baseline, strict quality enforcement
96
+
97
+ **Variance threshold (0.1)**: Measures score consistency. High variance = unstable eval, stays in stabilization until it settles.
98
+
99
+ **Regression calculation**:
100
+ ```
101
+ baseline = mean(historical_scores)
102
+ regression = (baseline - current_score) / baseline
28
103
  ```
29
104
 
30
- ## Custom Scorers
105
+ ---
106
+
107
+ ## Eval Suites
108
+
109
+ ### Swarm Decomposition (`swarm-decomposition.eval.ts`)
110
+
111
+ **What it measures:** Quality of task decomposition into parallel subtasks
112
+
113
+ **Data sources:**
114
+ - Fixtures: `fixtures/decomposition-cases.ts`
115
+ - Real captures: `.opencode/eval-data.jsonl`
31
116
 
32
- ### Subtask Independence (0-1)
117
+ **Scorers:**
118
+
119
+ | Scorer | Weight | What It Checks | Perfect Score |
120
+ | ------------------------ | ------ | ------------------------------------------------------- | ---------------------------------- |
121
+ | **Subtask Independence** | 0.25 | No file overlaps between subtasks (prevents conflicts) | 0 files in multiple subtasks |
122
+ | **Complexity Balance** | 0.25 | Work distributed evenly (coefficient of variation <0.3) | CV <0.3 (max/min complexity ratio) |
123
+ | **Coverage** | 0.25 | Required files covered, subtask count in range | All required files + 3-6 subtasks |
124
+ | **Instruction Clarity** | 0.25 | Descriptions actionable, files specified, titles clear | >20 chars, files listed, specific |
125
+
126
+ **Example output:**
127
+ ```
128
+ swarm-decomposition
129
+ ├─ subtaskIndependence: 1.0 (no conflicts)
130
+ ├─ complexityBalance: 0.85 (CV: 0.22)
131
+ ├─ coverageCompleteness: 1.0 (all files covered)
132
+ └─ instructionClarity: 0.90 (clear, actionable)
133
+ → Overall: 0.94 ✅ PASS (stabilization phase)
134
+ ```
33
135
 
34
- Checks no files appear in multiple subtasks. File conflicts cause merge conflicts and coordination overhead.
136
+ ### Coordinator Session (`coordinator-session.eval.ts`)
35
137
 
36
- ### Complexity Balance (0-1)
138
+ **What it measures:** Coordinator protocol adherence during swarm runs
37
139
 
38
- Measures coefficient of variation (CV) of estimated_complexity across subtasks. CV < 0.3 scores 1.0, decreases linearly to 0 at CV = 1.0.
140
+ **Data sources:**
141
+ - Real sessions: `~/.config/swarm-tools/sessions/*.jsonl`
142
+ - Fixtures: `fixtures/coordinator-sessions.ts`
39
143
 
40
- ### Coverage Completeness (0-1)
144
+ **Scorers:**
145
+
146
+ | Scorer | Weight | What It Checks | Perfect Score |
147
+ | ---------------------------- | ------ | -------------------------------------------------- | -------------------- |
148
+ | **Violation Count** | 0.30 | Protocol violations (edit files, run tests, etc.) | 0 violations |
149
+ | **Spawn Efficiency** | 0.25 | Workers spawned / subtasks planned | 100% (all delegated) |
150
+ | **Review Thoroughness** | 0.25 | Reviews completed / workers finished | 100% (all reviewed) |
151
+ | **Time to First Spawn** | 0.20 | Speed from decomposition to first worker spawn | <60 seconds |
152
+ | **Overall Discipline** (composite) | 1.00 | Weighted composite of above | 1.0 (perfect) |
153
+
154
+ **Violations tracked:**
155
+ - `coordinator_edited_file` - Coordinator should NEVER edit directly
156
+ - `coordinator_ran_tests` - Workers run tests, not coordinator
157
+ - `coordinator_reserved_files` - Only workers reserve files
158
+ - `no_worker_spawned` - Coordinator must delegate, not do work itself
159
+
160
+ **Example output:**
161
+ ```
162
+ coordinator-behavior
163
+ ├─ violationCount: 1.0 (0 violations)
164
+ ├─ spawnEfficiency: 1.0 (3/3 workers spawned)
165
+ ├─ reviewThoroughness: 0.67 (2/3 reviewed)
166
+ └─ timeToFirstSpawn: 0.90 (45 seconds)
167
+ → overallDiscipline: 0.89 ✅ PASS (bootstrap phase, collecting data)
168
+ ```
41
169
 
42
- If expected.requiredFiles specified: ratio of covered files.
43
- Otherwise: checks subtask count is within min/max range.
170
+ ### Compaction Prompt (`compaction-prompt.eval.ts`)
44
171
 
45
- ### Instruction Clarity (0-1)
172
+ **What it measures:** Quality of continuation prompts after context compaction
46
173
 
47
- Average quality score per subtask based on:
174
+ **Data sources:**
175
+ - Captured compaction events from session files
176
+ - Test fixtures with known-good/bad prompts
48
177
 
49
- - Description length > 20 chars (+0.2)
50
- - Files specified (+0.2)
51
- - Non-generic title (+0.1)
178
+ **Scorers:**
179
+
180
+ | Scorer | Weight | What It Checks | Perfect Score |
181
+ | -------------------------------- | ------ | --------------------------------------------------------- | -------------------------------- |
182
+ | **Epic ID Specificity** | 0.20 | Real IDs (mjkw...) not placeholders (<epic-id>, bd-xxx) | Real epic ID present |
183
+ | **Actionability** | 0.20 | Tool calls with real values (swarm_status with epic ID) | Actionable tool with real values |
184
+ | **Coordinator Identity** | 0.25 | ASCII header + strong mandates (NEVER/ALWAYS) | ASCII box + strong language |
185
+ | **Forbidden Tools Listed** | 0.15 | Lists Edit, Write, swarmmail_reserve, git commit by name | 4/4 forbidden tools listed |
186
+ | **Post-Compaction Discipline** | 0.20 | First suggested tool is swarm_status or inbox (not Edit) | First tool correct |
187
+
188
+ **Why these metrics?**
189
+
190
+ Post-compaction coordinators often "wake up" confused:
191
+ - Forget they're coordinators → start editing files
192
+ - Use placeholders → can't check actual status
193
+ - Weak language → ignore mandates
194
+ - Wrong first tool → dive into code instead of checking workers
195
+
196
+ **Example output:**
197
+ ```
198
+ compaction-prompt
199
+ ├─ epicIdSpecificity: 1.0 (real ID: mjkw81rkq4c)
200
+ ├─ actionability: 1.0 (swarm_status with real epic ID)
201
+ ├─ coordinatorIdentity: 1.0 (ASCII header + NEVER/ALWAYS)
202
+ ├─ forbiddenToolsPresent: 1.0 (4/4 tools listed)
203
+ └─ postCompactionDiscipline: 1.0 (first tool: swarm_status)
204
+ → Overall: 1.0 ✅ PASS (production phase)
205
+ ```
206
+
207
+ ---
208
+
209
+ ## Data Capture
210
+
211
+ ### What Gets Captured
212
+
213
+ **Decomposition Events** (`.opencode/eval-data.jsonl`):
214
+ ```jsonl
215
+ {
216
+ "id": "mjkw81rkq4c",
217
+ "timestamp": "2025-01-01T12:00:00Z",
218
+ "task": "Add OAuth authentication",
219
+ "strategy": "feature-based",
220
+ "epic_title": "OAuth Implementation",
221
+ "subtasks": [...],
222
+ "outcomes": [...]
223
+ }
224
+ ```
225
+
226
+ **Coordinator Sessions** (`~/.config/swarm-tools/sessions/<session-id>.jsonl`):
227
+ ```jsonl
228
+ {"event_type": "DECISION", "decision_type": "strategy_selected", ...}
229
+ {"event_type": "DECISION", "decision_type": "worker_spawned", ...}
230
+ {"event_type": "VIOLATION", "violation_type": "coordinator_edited_file", ...}
231
+ {"event_type": "COMPACTION", "compaction_type": "prompt_generated", ...}
232
+ ```
233
+
234
+ **Eval History** (`.opencode/eval-history.jsonl`):
235
+ ```jsonl
236
+ {"timestamp": "...", "eval_name": "swarm-decomposition", "score": 0.92, "run_count": 15}
237
+ ```
238
+
239
+ ### Capture Points (Automatic)
240
+
241
+ | Integration Point | What Gets Captured | File |
242
+ | -------------------------- | ------------------------------------- | ----------------------- |
243
+ | `swarm_decompose` | Task, strategy, subtasks | eval-data.jsonl |
244
+ | `swarm_complete` | Outcome signals (duration, errors) | eval-data.jsonl |
245
+ | `swarm_record_outcome` | Learning signals | swarm-mail database |
246
+ | Coordinator spawn | Worker spawn event | sessions/*.jsonl |
247
+ | Coordinator review | Review decision | sessions/*.jsonl |
248
+ | Compaction hook | Prompt content, detection results | sessions/*.jsonl |
249
+ | Evalite runner | Score, baseline, phase | eval-history.jsonl |
250
+
251
+ ---
252
+
253
+ ## CLI Commands
254
+
255
+ ### `swarm eval status [eval-name]`
256
+
257
+ Shows current phase, gate thresholds, and recent scores with sparklines.
258
+
259
+ ```bash
260
+ $ swarm eval status swarm-decomposition
261
+
262
+ ┌─────────────────────────────────────────────────────────────┐
263
+ │ Eval: swarm-decomposition │
264
+ │ Phase: 🚀 Production (53 runs, variance: 0.08) │
265
+ │ │
266
+ │ Gate Thresholds: │
267
+ │ ├─ Stabilization: >10% regression (warn) │
268
+ │ └─ Production: >5% regression (fail) │
269
+ │ │
270
+ │ Recent Scores (last 10 runs): │
271
+ │ 0.92 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2025-01-01 12:00 │
272
+ │ 0.89 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2025-01-01 11:30 │
273
+ │ 0.94 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2025-01-01 11:00 │
274
+ │ Baseline: 0.91 | Variance: 0.08 | Trend: ↗ │
275
+ └─────────────────────────────────────────────────────────────┘
276
+ ```
277
+
278
+ **Phase indicators:**
279
+ - 🌱 Bootstrap - Collecting data
280
+ - ⚙️ Stabilization - Learning baseline
281
+ - 🚀 Production - Enforcing quality
282
+
283
+ ### `swarm eval history`
284
+
285
+ Shows eval run history grouped by eval name with trends and color-coded scores.
286
+
287
+ ```bash
288
+ $ swarm eval history
289
+
290
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
291
+ swarm-decomposition (53 runs)
292
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
293
+ Run #53 0.92 2025-01-01 12:00:00 ✅ PASS
294
+ Run #52 0.89 2025-01-01 11:30:00 ✅ PASS
295
+ Run #51 0.94 2025-01-01 11:00:00 ✅ PASS
296
+ ...
297
+ Sparkline: ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█
298
+ Trend: ↗ (improving)
299
+
300
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
301
+ coordinator-behavior (8 runs)
302
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
303
+ Run #8 0.85 2025-01-01 10:00:00 ⚠️ WARN
304
+ Run #7 0.91 2025-01-01 09:30:00 ✅ PASS
305
+ ...
306
+ ```
307
+
308
+ **Color coding:**
309
+ - 🟢 Green: ≥0.8 (pass/high score)
310
+ - 🟡 Yellow: 0.6-0.8 (warning/medium score)
311
+ - 🔴 Red: <0.6 (fail/low score)
312
+
313
+ ### `swarm eval run` (Stub)
314
+
315
+ Placeholder for future direct eval execution from CLI.
316
+
317
+ ---
318
+
319
+ ## CI Integration
320
+
321
+ ### GitHub Actions Workflow
322
+
323
+ Progressive gates integrate with CI for PR checks:
324
+
325
+ ```yaml
326
+ # .github/workflows/eval-check.yml
327
+ name: Eval Quality Gate
328
+
329
+ on:
330
+ pull_request:
331
+ branches: [main]
332
+
333
+ jobs:
334
+ eval-gate:
335
+ runs-on: ubuntu-latest
336
+ steps:
337
+ - uses: actions/checkout@v4
338
+
339
+ - name: Setup Bun
340
+ uses: oven-sh/setup-bun@v1
341
+
342
+ - name: Install deps
343
+ run: bun install
344
+
345
+ - name: Run evals
346
+ run: bun run eval:run
347
+
348
+ - name: Check gates
349
+ run: |
350
+ # Check if any eval failed production gate
351
+ swarm eval status | grep "FAIL" && exit 1 || exit 0
352
+
353
+ - name: Post PR comment
354
+ if: failure()
355
+ uses: actions/github-script@v7
356
+ with:
357
+ script: |
358
+ // Post detailed gate failure to PR
359
+ const status = await exec.getExecOutput('swarm eval status');
360
+ github.rest.issues.createComment({
361
+ issue_number: context.issue.number,
362
+ body: `## ❌ Eval Gate Failed\n\n\`\`\`\n${status.stdout}\n\`\`\``
363
+ });
364
+ ```
52
365
 
53
- ## Writing Evals
366
+ **Gate behavior in CI:**
367
+ - Bootstrap: Always pass (collecting data)
368
+ - Stabilization: Pass but warn on >10% regression
369
+ - Production: **FAIL PR** on >5% regression
370
+
371
+ ---
372
+
373
+ ## Writing New Evals
374
+
375
+ ### 1. Create Eval File
54
376
 
55
377
  ```typescript
378
+ // evals/my-feature.eval.ts
56
379
  import { evalite } from "evalite";
57
- import { subtaskIndependence, coverageCompleteness } from "./scorers/index.js";
380
+ import { createScorer } from "evalite";
381
+
382
+ // Define your scorer
383
+ const myScorer = createScorer({
384
+ name: "My Quality Metric",
385
+ description: "Checks if feature meets quality bar",
386
+ scorer: async ({ output, expected, input }) => {
387
+ // Implement scoring logic
388
+ const score = /* calculate 0-1 score */;
389
+ return {
390
+ score,
391
+ message: "Details about score"
392
+ };
393
+ },
394
+ });
58
395
 
59
- evalite("My decomposition test", {
396
+ // Define the eval
397
+ evalite("My Feature Quality", {
60
398
  data: async () => {
399
+ // Load test cases
61
400
  return [
62
401
  {
63
- input: "Add OAuth authentication",
64
- expected: {
65
- minSubtasks: 3,
66
- maxSubtasks: 6,
67
- requiredFiles: ["src/auth/oauth.ts", "src/middleware/auth.ts"],
68
- },
402
+ input: "test input",
403
+ expected: { /* expected structure */ },
69
404
  },
70
405
  ];
71
406
  },
72
407
  task: async (input) => {
73
- // Call your decomposition logic here
74
- // Should return CellTree JSON as string
408
+ // Call your system under test
409
+ const output = await myFeature(input);
410
+ return output;
75
411
  },
76
- scorers: [subtaskIndependence, coverageCompleteness],
412
+ scorers: [myScorer],
77
413
  });
78
414
  ```
79
415
 
80
- ## CellTree Format
416
+ ### 2. Add to Package Scripts
417
+
418
+ ```json
419
+ {
420
+ "scripts": {
421
+ "eval:my-feature": "bunx evalite run evals/my-feature.eval.ts"
422
+ }
423
+ }
424
+ ```
425
+
426
+ ### 3. Add Capture Points
81
427
 
82
- Scorers expect output as JSON string matching:
428
+ Wire your feature to capture real execution data:
83
429
 
84
430
  ```typescript
85
- {
86
- epic: {
87
- title: string;
88
- description: string;
431
+ import { captureMyFeatureEvent } from "./eval-capture.js";
432
+
433
+ async function myFeature(input) {
434
+ const startTime = Date.now();
435
+
436
+ try {
437
+ const result = await doWork(input);
438
+
439
+ // Capture success
440
+ captureMyFeatureEvent({
441
+ input,
442
+ output: result,
443
+ duration_ms: Date.now() - startTime,
444
+ success: true,
445
+ });
446
+
447
+ return result;
448
+ } catch (error) {
449
+ // Capture failure
450
+ captureMyFeatureEvent({
451
+ input,
452
+ error: error.message,
453
+ duration_ms: Date.now() - startTime,
454
+ success: false,
455
+ });
456
+ throw error;
89
457
  }
90
- subtasks: Array<{
91
- title: string;
92
- description?: string;
93
- files?: string[];
94
- estimated_complexity?: number; // 1-3
95
- }>;
96
458
  }
97
459
  ```
98
460
 
99
- ## Fixtures
461
+ ### 4. Test Locally
100
462
 
101
- See `fixtures/decomposition-cases.ts` for example test cases covering:
463
+ ```bash
464
+ # Run your eval
465
+ bun run eval:my-feature
102
466
 
103
- - OAuth implementation
104
- - Rate limiting
105
- - TypeScript migration
106
- - Admin dashboard
107
- - Memory leak debugging
108
- - Feature flag system
467
+ # Check status
468
+ swarm eval status my-feature
109
469
 
110
- ## Coordinator Session Eval
470
+ # View history
471
+ swarm eval history
472
+ ```
111
473
 
112
- ### coordinator-session.eval.ts
474
+ ---
113
475
 
114
- Scores coordinator discipline during swarm sessions.
476
+ ## Scorer Reference
115
477
 
116
- **Data Sources:**
117
- - Real captured sessions from `~/.config/swarm-tools/sessions/*.jsonl`
118
- - Synthetic fixtures from `fixtures/coordinator-sessions.ts`
478
+ ### Scorer Pattern (Evalite v1.0)
119
479
 
120
- **Scorers:**
121
- - `violationCount` - Protocol violations (edit files, run tests, reserve files)
122
- - `spawnEfficiency` - Workers spawned / subtasks planned
123
- - `reviewThoroughness` - Reviews completed / workers finished
124
- - `timeToFirstSpawn` - Speed from decomposition to first worker spawn
125
- - `overallDiscipline` - Weighted composite score
126
-
127
- **Fixtures:**
128
- - `perfectCoordinator` - 0 violations, 100% spawn, 100% review, 30s to spawn
129
- - `badCoordinator` - 5 violations, 33% spawn, 0% review, 10min to spawn
130
- - `decentCoordinator` - 1 violation, 100% spawn, 50% review, 45s to spawn
131
-
132
- **Run:**
480
+ **IMPORTANT**: Evalite scorers are **async functions**, not objects with `.scorer` property.
481
+
482
+ ```typescript
483
+ import { createScorer } from "evalite";
484
+
485
+ // CORRECT
486
+ const myScorer = createScorer({
487
+ name: "My Scorer",
488
+ description: "What it measures",
489
+ scorer: async ({ output, expected, input }) => {
490
+ return { score: 0.8, message: "Details" };
491
+ },
492
+ });
493
+
494
+ // Use in eval
495
+ evalite("test", {
496
+ scorers: [myScorer], // Pass the scorer directly
497
+ });
498
+
499
+ // In composite scorers
500
+ const result = await childScorer({ output, expected, input });
501
+ const score = result.score ?? 0;
502
+ ```
503
+
504
+ **WRONG ❌**:
505
+ ```typescript
506
+ // Don't do this - .scorer property doesn't exist
507
+ const result = childScorer.scorer({ output, expected }); // ❌
508
+ ```
509
+
510
+ ### Custom Scorer Template
511
+
512
+ ```typescript
513
+ export const myCustomScorer = createScorer({
514
+ name: "My Custom Metric",
515
+ description: "Detailed description of what this measures and why it matters",
516
+ scorer: async ({ output, expected, input }) => {
517
+ // 1. Parse output
518
+ let data;
519
+ try {
520
+ data = typeof output === "string" ? JSON.parse(output) : output;
521
+ } catch {
522
+ return { score: 0, message: "Invalid output format" };
523
+ }
524
+
525
+ // 2. Calculate score (0-1 range)
526
+ const score = calculateYourMetric(data, expected);
527
+
528
+ // 3. Return with detailed message
529
+ return {
530
+ score,
531
+ message: `Score: ${score.toFixed(2)} - ${getExplanation(score)}`,
532
+ };
533
+ },
534
+ });
535
+ ```
536
+
537
+ ---
538
+
539
+ ## Troubleshooting
540
+
541
+ ### "No eval history found"
542
+
543
+ **Cause:** Haven't run any evals yet or `.opencode/eval-history.jsonl` missing.
544
+
545
+ **Fix:**
133
546
  ```bash
134
- bunx evalite run evals/coordinator-session.eval.ts
547
+ # Run an eval to create history
548
+ bun run eval:decomposition
549
+ swarm eval status # Should show bootstrap phase
550
+ ```
551
+
552
+ ### "Phase stuck in stabilization despite >50 runs"
553
+
554
+ **Cause:** High variance (≥0.1). Scores not consistent enough for production phase.
555
+
556
+ **Fix:** Investigate why scores fluctuate:
557
+ ```bash
558
+ # Check variance
559
+ swarm eval status my-eval # Shows variance value
560
+
561
+ # View score history to spot outliers
562
+ swarm eval history
563
+
564
+ # Common causes:
565
+ # - Eval depends on external state (network, filesystem)
566
+ # - Non-deterministic scoring logic
567
+ # - Input data changing between runs
135
568
  ```
136
569
 
137
- ## Data Loaders
570
+ ### "Gate failing on minor changes"
571
+
572
+ **Cause:** Production phase threshold (5%) too strict for your use case.
573
+
574
+ **Fix:** Adjust threshold in eval code:
575
+ ```typescript
576
+ import { checkGate } from "../src/eval-gates.js";
577
+
578
+ const result = checkGate(projectPath, evalName, score, {
579
+ productionThreshold: 0.10, // 10% instead of 5%
580
+ });
581
+ ```
582
+
583
+ ### "Evalite not finding my eval file"
584
+
585
+ **Cause:** File not matching `*.eval.ts` pattern or not in `evals/` directory.
586
+
587
+ **Fix:**
588
+ ```bash
589
+ # Ensure file is named correctly
590
+ mv evals/my-test.ts evals/my-test.eval.ts
591
+
592
+ # Verify discovery
593
+ bunx evalite run evals/ # Should list your eval
594
+ ```
595
+
596
+ ### "Scorers returning undefined"
597
+
598
+ **Cause:** Forgot to `await` async scorers or accessing `.scorer` property (doesn't exist).
599
+
600
+ **Fix:**
601
+ ```typescript
602
+ // CORRECT ✅
603
+ const result = await myScorer({ output, expected, input });
604
+ const score = result.score ?? 0;
605
+
606
+ // WRONG ❌
607
+ const result = myScorer.scorer({ output, expected }); // .scorer doesn't exist
608
+ ```
609
+
610
+ ---
611
+
612
+ ## File Structure
613
+
614
+ ```
615
+ evals/
616
+ ├── README.md # This file
617
+ ├── evalite.config.ts # Evalite configuration
618
+
619
+ ├── fixtures/
620
+ │ ├── decomposition-cases.ts # Test cases for decomposition
621
+ │ ├── coordinator-sessions.ts # Known good/bad coordinator sessions
622
+ │ └── compaction-prompts.ts # Sample compaction prompts
623
+
624
+ ├── lib/
625
+ │ ├── data-loader.ts # Load eval data from JSONL files
626
+ │ └── test-helpers.ts # Shared test utilities
627
+
628
+ ├── scorers/
629
+ │ ├── decomposition-scorers.ts # Subtask quality scorers
630
+ │ ├── coordinator-scorers.ts # Protocol adherence scorers
631
+ │ └── compaction-prompt-scorers.ts # Prompt quality scorers
632
+
633
+ ├── swarm-decomposition.eval.ts # Decomposition quality eval
634
+ ├── coordinator-session.eval.ts # Coordinator discipline eval
635
+ ├── compaction-prompt.eval.ts # Compaction prompt quality eval
636
+ └── example.eval.ts # Sanity check / template
637
+ ```
138
638
 
139
- ### lib/data-loader.ts
639
+ **Data locations:**
640
+ - `.opencode/eval-data.jsonl` - Decomposition captures
641
+ - `.opencode/eval-history.jsonl` - Score history
642
+ - `~/.config/swarm-tools/sessions/*.jsonl` - Coordinator sessions
140
643
 
141
- Loads eval data from multiple sources:
644
+ ---
142
645
 
143
- - `loadEvalCases()` - PGlite eval_records table
144
- - `loadCapturedSessions()` - Real coordinator sessions from `~/.config/swarm-tools/sessions/`
145
- - `hasRealEvalData()` - Check if enough real data exists
146
- - `getEvalDataSummary()` - Stats about available eval data
646
+ ## Further Reading
147
647
 
148
- ## Notes
648
+ - **[Evalite Docs](https://evalite.dev)** - Evaluation framework
649
+ - **[Progressive Gates Implementation](../src/eval-gates.ts)** - Phase-based quality control
650
+ - **[Learning Feedback Loop](../src/eval-learning.ts)** - Auto-store failures to memory
651
+ - **[Data Capture](../src/eval-capture.ts)** - Real execution tracking
652
+ - **[Compaction Scorers](../src/compaction-prompt-scoring.ts)** - Pure scoring functions
149
653
 
150
- - Evalite v1.0.0-beta.15 installed
151
- - Built on Vitest
152
- - Runs locally, no API keys required
153
- - Results cached in `node_modules/.evalite/`
154
- - Clear cache if needed: `rm -rf node_modules/.evalite`
654
+ > _"Measure outcomes, not outputs. The system that learns from failure beats the system that avoids it."_
655
+ > Inspired by Site Reliability Engineering principles