opencode-swarm-plugin 0.39.1 → 0.42.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (82) hide show
  1. package/.hive/analysis/eval-failure-analysis-2025-12-25.md +331 -0
  2. package/.hive/analysis/session-data-quality-audit.md +320 -0
  3. package/.hive/eval-results.json +481 -24
  4. package/.hive/issues.jsonl +76 -11
  5. package/.hive/memories.jsonl +159 -1
  6. package/.opencode/eval-history.jsonl +315 -0
  7. package/.turbo/turbo-build.log +5 -5
  8. package/CHANGELOG.md +207 -0
  9. package/README.md +2 -0
  10. package/SCORER-ANALYSIS.md +598 -0
  11. package/bin/eval-gate.test.ts +158 -0
  12. package/bin/eval-gate.ts +74 -0
  13. package/bin/swarm.test.ts +1054 -719
  14. package/bin/swarm.ts +577 -0
  15. package/dist/compaction-hook.d.ts +10 -1
  16. package/dist/compaction-hook.d.ts.map +1 -1
  17. package/dist/compaction-observability.d.ts +173 -0
  18. package/dist/compaction-observability.d.ts.map +1 -0
  19. package/dist/compaction-prompt-scoring.d.ts +1 -0
  20. package/dist/compaction-prompt-scoring.d.ts.map +1 -1
  21. package/dist/eval-capture.d.ts +93 -0
  22. package/dist/eval-capture.d.ts.map +1 -1
  23. package/dist/eval-runner.d.ts +134 -0
  24. package/dist/eval-runner.d.ts.map +1 -0
  25. package/dist/hive.d.ts.map +1 -1
  26. package/dist/index.d.ts +65 -1
  27. package/dist/index.d.ts.map +1 -1
  28. package/dist/index.js +84043 -28070
  29. package/dist/memory-tools.d.ts +70 -2
  30. package/dist/memory-tools.d.ts.map +1 -1
  31. package/dist/memory.d.ts +37 -0
  32. package/dist/memory.d.ts.map +1 -1
  33. package/dist/observability-tools.d.ts +64 -0
  34. package/dist/observability-tools.d.ts.map +1 -1
  35. package/dist/plugin.js +83570 -27466
  36. package/dist/schemas/task.d.ts +3 -3
  37. package/dist/swarm-orchestrate.d.ts.map +1 -1
  38. package/dist/swarm-prompts.d.ts +32 -1
  39. package/dist/swarm-prompts.d.ts.map +1 -1
  40. package/docs/planning/ADR-009-oh-my-opencode-patterns.md +353 -0
  41. package/evals/ARCHITECTURE.md +1189 -0
  42. package/evals/README.md +113 -0
  43. package/evals/example.eval.ts +3 -4
  44. package/evals/fixtures/compaction-prompt-cases.ts +6 -0
  45. package/evals/scorers/coordinator-discipline.evalite-test.ts +163 -0
  46. package/evals/scorers/coordinator-discipline.ts +82 -2
  47. package/evals/scorers/index.test.ts +146 -0
  48. package/evals/scorers/index.ts +104 -0
  49. package/evals/swarm-decomposition.eval.ts +13 -4
  50. package/examples/commands/swarm.md +291 -21
  51. package/package.json +4 -3
  52. package/src/compaction-hook.ts +258 -110
  53. package/src/compaction-observability.integration.test.ts +139 -0
  54. package/src/compaction-observability.test.ts +187 -0
  55. package/src/compaction-observability.ts +324 -0
  56. package/src/compaction-prompt-scorers.test.ts +10 -9
  57. package/src/compaction-prompt-scoring.ts +7 -5
  58. package/src/eval-capture.test.ts +204 -1
  59. package/src/eval-capture.ts +194 -2
  60. package/src/eval-runner.test.ts +223 -0
  61. package/src/eval-runner.ts +402 -0
  62. package/src/hive.ts +57 -22
  63. package/src/index.ts +54 -1
  64. package/src/memory-tools.test.ts +84 -0
  65. package/src/memory-tools.ts +68 -3
  66. package/src/memory.test.ts +2 -2
  67. package/src/memory.ts +122 -49
  68. package/src/observability-tools.test.ts +13 -0
  69. package/src/observability-tools.ts +277 -0
  70. package/src/swarm-orchestrate.test.ts +162 -0
  71. package/src/swarm-orchestrate.ts +7 -5
  72. package/src/swarm-prompts.test.ts +168 -4
  73. package/src/swarm-prompts.ts +228 -7
  74. package/.env +0 -2
  75. package/.turbo/turbo-test.log +0 -481
  76. package/.turbo/turbo-typecheck.log +0 -1
  77. package/dist/beads.d.ts +0 -386
  78. package/dist/beads.d.ts.map +0 -1
  79. package/dist/schemas/bead-events.d.ts +0 -698
  80. package/dist/schemas/bead-events.d.ts.map +0 -1
  81. package/dist/schemas/bead.d.ts +0 -255
  82. package/dist/schemas/bead.d.ts.map +0 -1
@@ -0,0 +1,331 @@
1
+ # Eval Failure Analysis Report
2
+ **Date:** 2025-12-25
3
+ **Analyst:** BrightStar
4
+ **Cell:** opencode-swarm-plugin--ys7z8-mjlk7jsl4tt
5
+ **Epic:** opencode-swarm-plugin--ys7z8-mjlk7js9bt1
6
+
7
+ ## Executive Summary
8
+
9
+ Two eval failures analyzed:
10
+ - **example.eval.ts**: 0% score - structural bug in eval setup
11
+ - **compaction-prompt.eval.ts**: 53% score - case sensitivity + missing forbidden tools
12
+
13
+ Both are fixable with code changes. No test data quality issues.
14
+
15
+ ---
16
+
17
+ ## example.eval.ts - 0% Score
18
+
19
+ ### Status
20
+ ❌ **CRITICAL** - Complete failure (0%)
21
+
22
+ ### Root Cause
23
+ **Eval structure mismatch** between data provider and task function.
24
+
25
+ ### Technical Details
26
+
27
+ **File:** `evals/example.eval.ts`
28
+ **Lines:** 14-30
29
+
30
+ The eval has a fundamental flow error:
31
+
32
+ ```typescript
33
+ // Line 14-26: data() provides BOTH input AND expected output
34
+ data: async () => {
35
+ return [
36
+ {
37
+ input: "Test task", // ← String for task function
38
+ output: JSON.stringify({ // ← Expected output (ignored!)
39
+ epic: { title: "Test Epic", ... },
40
+ subtasks: [...]
41
+ }),
42
+ },
43
+ ];
44
+ },
45
+
46
+ // Line 28-30: task() does passthrough
47
+ task: async (input) => {
48
+ return input; // ← Returns "Test task" string, NOT the CellTree
49
+ },
50
+
51
+ // Line 31: Scorer expects CellTree JSON
52
+ scorers: [subtaskIndependence],
53
+ ```
54
+
55
+ **What happens:**
56
+ 1. Evalite passes `input` ("Test task") to task function
57
+ 2. Task returns "Test task" string unchanged
58
+ 3. Scorer `subtaskIndependence` receives "Test task"
59
+ 4. Scorer tries to parse as CellTree JSON → **FAILS**
60
+ 5. Score: 0%
61
+
62
+ The `output` field in `data()` is ignored by Evalite - it's the `task()` return value that gets scored.
63
+
64
+ ### Impact
65
+ - Example eval is useless for validation
66
+ - False signal that scorer infrastructure is broken (it's not)
67
+ - Wastes CI time
68
+
69
+ ### Proposed Fix
70
+
71
+ **Option 1: Remove output from data (recommended)**
72
+ ```typescript
73
+ data: async () => {
74
+ return [
75
+ {
76
+ input: {
77
+ epic: { title: "Test Epic", description: "Test" },
78
+ subtasks: [
79
+ { title: "Subtask 1", files: ["a.ts"], estimated_complexity: 1 },
80
+ { title: "Subtask 2", files: ["b.ts"], estimated_complexity: 1 },
81
+ ],
82
+ },
83
+ },
84
+ ];
85
+ },
86
+
87
+ task: async (input) => {
88
+ return JSON.stringify(input); // Stringify the CellTree
89
+ },
90
+ ```
91
+
92
+ **Option 2: Fix task to use output**
93
+ ```typescript
94
+ // Keep data() as-is, but fix task:
95
+ task: async (input, context) => {
96
+ return context.expected.output; // Use the output from data()
97
+ },
98
+ ```
99
+
100
+ Option 1 is cleaner - task functions should generate output, not just pass through.
101
+
102
+ ---
103
+
104
+ ## compaction-prompt.eval.ts - 53% Score
105
+
106
+ ### Status
107
+ ⚠️ **DEGRADED** - Below target (53% vs 100% historical)
108
+
109
+ ### Root Causes
110
+
111
+ #### RC1: Case-Sensitive Forbidden Tool Patterns (15% weight)
112
+
113
+ **File:** `src/compaction-prompt-scoring.ts`
114
+ **Lines:** 213-218
115
+
116
+ ```typescript
117
+ const forbiddenTools = [
118
+ /\bEdit\b/, // ← Requires capital E
119
+ /\bWrite\b/, // ← Requires capital W
120
+ /swarmmail_reserve/,
121
+ /git commit/,
122
+ ];
123
+ ```
124
+
125
+ **File:** `evals/fixtures/compaction-prompt-cases.ts`
126
+ **Lines:** 76-83 (perfect fixture)
127
+
128
+ ```
129
+ - edit // ← lowercase e
130
+ - write // ← lowercase w
131
+ - bash (for file modifications)
132
+ ```
133
+
134
+ **Evidence:**
135
+ ```javascript
136
+ /\bEdit\b/.test("- Edit") // ✅ true
137
+ /\bEdit\b/.test("- edit") // ❌ false (word boundary + case)
138
+ ```
139
+
140
+ **Impact:**
141
+ - Perfect fixture: 0/4 forbidden tools matched
142
+ - Forbidden tools scorer: 0% (should be 75-100%)
143
+ - Overall impact: 15% of total score lost
144
+
145
+ #### RC2: Missing Forbidden Tools (15% weight)
146
+
147
+ Scorer expects **4 tools**:
148
+ 1. Edit (or edit)
149
+ 2. Write (or write)
150
+ 3. swarmmail_reserve
151
+ 4. git commit
152
+
153
+ Perfect fixture has **3 tools** (and case mismatch):
154
+ 1. edit ❌ (lowercase)
155
+ 2. write ❌ (lowercase)
156
+ 3. bash ❌ (not in scorer's list)
157
+
158
+ Missing: swarmmail_reserve, git commit
159
+
160
+ **Impact:**
161
+ - Even if case fixed, still only 2/4 tools = 50% on this scorer
162
+ - Weighted: 50% × 15% = 7.5% contribution (should be 15%)
163
+
164
+ #### RC3: "bash" Not in Scorer's List
165
+
166
+ Fixtures mention "bash (for file modifications)" as forbidden, but scorer doesn't check for it.
167
+ This creates a 3-way mismatch:
168
+ - Fixture lists: edit, write, bash
169
+ - Scorer checks: Edit, Write, swarmmail_reserve, git commit
170
+ - Overlap: 0 tools (due to case)
171
+
172
+ ### Score Breakdown - Perfect Fixture
173
+
174
+ Expected (if 100%):
175
+ ```
176
+ epicIdSpecificity: 20% × 1.0 = 20%
177
+ actionability: 20% × 1.0 = 20%
178
+ coordinatorIdentity: 25% × 1.0 = 25%
179
+ forbiddenToolsPresent: 15% × 1.0 = 15%
180
+ postCompactionDiscipline: 20% × 1.0 = 20%
181
+ ─────
182
+ TOTAL: 100%
183
+ ```
184
+
185
+ Actual (current):
186
+ ```
187
+ epicIdSpecificity: 20% × 1.0 = 20% ✅
188
+ actionability: 20% × 1.0 = 20% ✅
189
+ coordinatorIdentity: 25% × 1.0 = 25% ✅
190
+ forbiddenToolsPresent: 15% × 0.0 = 0% ❌ (0/4 matched)
191
+ postCompactionDiscipline: 20% × 1.0 = 20% ✅
192
+ ─────
193
+ TOTAL: 85%
194
+ ```
195
+
196
+ Perfect fixture alone should score 85%, but overall eval is 53%.
197
+ This means the 5 "bad" fixtures are pulling average down further (expected behavior).
198
+
199
+ ### Historical Context
200
+
201
+ Semantic memory claims 100% score previously. Likely scenarios:
202
+ 1. **Never actually ran** - aspiration documented before implementation
203
+ 2. **Ran with different fixtures** - fixtures were updated after scorer was written
204
+ 3. **Scorer was case-insensitive before** - regression in recent commit aa12943
205
+
206
+ Commit aa12943 (2025-12-24) added the eval infrastructure. This is brand new code.
207
+
208
+ ### Proposed Fixes
209
+
210
+ #### Fix 1: Make Scorer Case-Insensitive (Recommended)
211
+
212
+ **File:** `src/compaction-prompt-scoring.ts`
213
+ **Lines:** 213-218
214
+
215
+ ```typescript
216
+ const forbiddenTools = [
217
+ /\bedit\b/i, // Case insensitive with 'i' flag
218
+ /\bwrite\b/i, // Case insensitive
219
+ /\bbash\b/i, // Add bash (was missing)
220
+ /swarmmail_reserve/i, // Keep, add 'i' for safety
221
+ /git commit/i, // Keep, add 'i' for safety
222
+ ];
223
+ ```
224
+
225
+ **Rationale:**
226
+ - Coordinators might capitalize differently in prompts
227
+ - Real prompts won't always match exact case
228
+ - More robust matching
229
+
230
+ #### Fix 2: Update Fixtures to Match Scorer (Alternative)
231
+
232
+ **File:** `evals/fixtures/compaction-prompt-cases.ts`
233
+ **Lines:** 76-83 (and all other fixtures)
234
+
235
+ ```
236
+ - Edit // Capital E
237
+ - Write // Capital W
238
+ - bash (for file modifications) // Keep or remove
239
+ - swarmmail_reserve // ADD
240
+ - git commit // ADD
241
+ ```
242
+
243
+ **Rationale:**
244
+ - Keeps scorer strict (may catch real case issues)
245
+ - Makes fixtures comprehensive (all 5 tools)
246
+ - More explicit about what's forbidden
247
+
248
+ #### Fix 3: Hybrid (Best of Both)
249
+
250
+ 1. Make scorer case-insensitive (Fix 1)
251
+ 2. Update fixtures to include all 5 tools (Fix 2)
252
+ 3. Remove "bash" from fixtures if not in coordinator forbidden list
253
+
254
+ ```typescript
255
+ // Scorer (5 tools, case-insensitive):
256
+ const forbiddenTools = [
257
+ /\bedit\b/i,
258
+ /\bwrite\b/i,
259
+ /swarmmail_reserve/i,
260
+ /git\s+commit/i,
261
+ /\bread\b/i, // Consider adding - coordinators shouldn't read, should check status
262
+ ];
263
+ ```
264
+
265
+ ```
266
+ // Fixture:
267
+ - Edit
268
+ - Write
269
+ - swarmmail_reserve (only workers reserve files)
270
+ - git commit (workers commit their changes)
271
+ ```
272
+
273
+ ### Risk Assessment
274
+
275
+ **If we fix this, will scores jump to 100%?**
276
+
277
+ **Perfect fixture:** 85% → 100% (if all 4 tools matched)
278
+ **Other fixtures:** Depends on their issues
279
+
280
+ Looking at fixture expected values:
281
+ - Fixture 0 (perfect): Should be 100%
282
+ - Fixture 1 (placeholder): Should fail (expected)
283
+ - Fixture 2 (generic): Should fail (expected)
284
+ - Fixture 3 (weak identity): Should partially fail (expected)
285
+ - Fixture 4 (missing forbidden): Should fail on forbidden tools only
286
+ - Fixture 5 (wrong first tool): Should fail on discipline only
287
+
288
+ Average across 6 fixtures: ~66% expected (not 100%)
289
+
290
+ **So 53% → ~70-80%** is realistic after fixes (not 100%).
291
+
292
+ To get higher scores, need to fix issues in bad fixtures too, but those are SUPPOSED to fail.
293
+ The scorer is working correctly on those.
294
+
295
+ ---
296
+
297
+ ## Recommendations
298
+
299
+ ### Immediate Actions (P0)
300
+
301
+ 1. **Fix example.eval.ts structure** - 5 min fix, unblocks that eval
302
+ 2. **Make forbidden tools case-insensitive** - 5 min fix, +15-20% score boost
303
+ 3. **Add missing tools to fixtures** - 10 min, comprehensive coverage
304
+
305
+ ### Medium-term Actions (P1)
306
+
307
+ 4. **Verify 100% claim in semantic memory** - Check if historical data exists
308
+ 5. **Document scorer expectations** - Add comments to fixtures explaining weights
309
+ 6. **Add unit tests for scorers** - Test edge cases independently
310
+
311
+ ### Long-term Actions (P2)
312
+
313
+ 7. **Consider LLM-as-judge for semantic checks** - Case-insensitive by nature
314
+ 8. **Add visual diff in eval output** - Show what's missing from prompts
315
+ 9. **Create eval dashboard** - Track scores over time, detect regressions
316
+
317
+ ---
318
+
319
+ ## Conclusion
320
+
321
+ Both evals have **code bugs, not test data issues**:
322
+ - example.eval.ts: Structural bug (task/data mismatch)
323
+ - compaction-prompt.eval.ts: Case sensitivity + incomplete tool list
324
+
325
+ Fixes are straightforward and low-risk. After fixes, expect:
326
+ - example.eval.ts: 0% → 100%
327
+ - compaction-prompt.eval.ts: 53% → 70-80%
328
+
329
+ The 100% historical score in semantic memory is likely aspirational - these evals are brand new (commit aa12943, Dec 24).
330
+
331
+ **Ready to implement fixes or escalate for review?**
@@ -0,0 +1,320 @@
1
+ # Session Data Quality Audit Report
2
+
3
+ **Date:** 2025-12-25
4
+ **Cell:** opencode-swarm-plugin--ys7z8-mjlk7jspacf
5
+ **Agent:** WildDawn
6
+
7
+ ## Executive Summary
8
+
9
+ Investigation of why only 3 of 102 sessions (2.9%) pass the coordinator-session eval filter reveals:
10
+
11
+ 1. **Filter is working as designed** - correctly isolating high-quality complete coordinator sessions
12
+ 2. **Data quality is actually GOOD** - the 3 passing sessions are gold-standard examples
13
+ 3. **97% filtered out is EXPECTED** - most sessions are worker completions, not coordinator sessions
14
+ 4. **Filter may be too strict** for broad coordinator behavior analysis (needs tuning)
15
+
16
+ ## Data Breakdown
17
+
18
+ ### Total Sessions: 102
19
+
20
+ | Category | Count | % | Description |
21
+ |----------|-------|---|-------------|
22
+ | **Single-event sessions** | 70 | 68.6% | Worker completions (subtask_success), isolated reviews |
23
+ | **Multi-event incomplete** | 29 | 28.4% | Coordinator sessions that didn't complete full cycle |
24
+ | **Passing sessions** | 3 | 2.9% | Complete coordinator cycles with spawn + review |
25
+
26
+ ### Single-Event Sessions (70 sessions - 68.6%)
27
+
28
+ **Event Type Breakdown:**
29
+ - `OUTCOME/subtask_success`: 56 (80.0%) - **Worker completions, not coordinator sessions**
30
+ - `DECISION/review_completed`: 12 (17.1%) - Isolated review events
31
+ - `DECISION/worker_spawned`: 2 (2.9%) - Isolated spawn events
32
+
33
+ **Analysis:** These are **NOT coordinator sessions**. They're worker agents reporting completion or isolated coordinator actions captured in separate session files.
34
+
35
+ ### Multi-Event Failures (29 sessions - 28.4%)
36
+
37
+ **Failure Breakdown:**
38
+ - **No worker_spawned event**: 20 sessions
39
+ - Review-only sessions (3-22 events, all `review_completed`)
40
+ - Appears to be test data or session capture split across files
41
+ - **Has worker_spawned but no review_completed**: 5 sessions
42
+ - Incomplete coordinator sessions (4-24 events)
43
+ - Coordinator spawned workers but reviews weren't captured (yet)
44
+ - **Too few events (<3)**: 4 sessions
45
+ - Aborted early
46
+
47
+ **Key Finding:** None of these 29 sessions have `decomposition_complete` events. This suggests:
48
+ 1. Session capture may not be recording decomposition events
49
+ 2. OR coordinator sessions span multiple session files
50
+ 3. OR these are partial captures from long-running coordinators
51
+
52
+ ### Passing Sessions (3 sessions - 2.9%)
53
+
54
+ #### ses_4b86f0867ffeXKv95ktf31igfD
55
+ - **Events:** 33
56
+ - **Worker spawns:** 20
57
+ - **Reviews completed:** 13
58
+ - **Violations:** 0
59
+ - **Duration:** 437 minutes (7.3 hours)
60
+ - **Quality:** GOLD STANDARD
61
+
62
+ #### ses_4ac0f508dffeEcwSQ6OSMWrmWF
63
+ - **Events:** 21
64
+ - **Worker spawns:** 17
65
+ - **Reviews completed:** 4
66
+ - **Duration:** 540 minutes (9.0 hours)
67
+ - **Quality:** GOLD STANDARD
68
+
69
+ #### ses_4ae8c2f66ffecyfyre7ZQ7y5LW
70
+ - **Events:** 31
71
+ - **Worker spawns:** 24
72
+ - **Reviews completed:** 7
73
+ - **Violations:** 0
74
+ - **Duration:** 368 minutes (6.1 hours)
75
+ - **Quality:** GOLD STANDARD
76
+
77
+ **Analysis:** These are FULL multi-hour coordinator sessions with extensive worker coordination. They represent the ideal coordinator behavior the eval is designed to measure.
78
+
79
+ ## Current Filter Criteria
80
+
81
+ ```typescript
82
+ {
83
+ minEvents: 3, // Default
84
+ requireWorkerSpawn: true, // Default
85
+ requireReview: true, // Default
86
+ }
87
+ ```
88
+
89
+ ### Filter Performance
90
+
91
+ | Check | Impact |
92
+ |-------|--------|
93
+ | `minEvents >= 3` | Filters out 74 sessions (72.5%) |
94
+ | `requireWorkerSpawn: true` | Filters out 20 additional sessions (19.6%) |
95
+ | `requireReview: true` | Filters out 5 additional sessions (4.9%) |
96
+
97
+ **Cascade effect:** Each filter compounds, resulting in 2.9% passing rate.
98
+
99
+ ## Root Cause Analysis
100
+
101
+ ### Is the Filter Too Strict?
102
+
103
+ **YES and NO:**
104
+
105
+ ✅ **Working as designed:**
106
+ - Correctly excludes worker-only sessions (80% of single-event data)
107
+ - Correctly excludes incomplete coordinator sessions
108
+ - Isolates high-quality complete coordinator cycles
109
+
110
+ ❌ **Too strict for real-world analysis:**
111
+ - 2.9% passing rate means most coordinator behavior is invisible to the eval
112
+ - Filter assumes coordinators ALWAYS complete full spawn+review cycles
113
+ - Doesn't account for:
114
+ - Long-running multi-session coordinators
115
+ - Coordinators that spawn workers but reviews aren't captured yet
116
+ - Early-stage coordinator sessions (before first spawn)
117
+
118
+ ### Is the Data Quality Low?
119
+
120
+ **NO.** The data quality is actually GOOD:
121
+
122
+ - The 3 passing sessions are excellent gold-standard examples
123
+ - They contain rich coordinator behavior (20-24 worker spawns, 4-13 reviews)
124
+ - Zero violations in all 3 sessions
125
+ - Multi-hour timelines showing sustained coordination
126
+
127
+ The "low passing rate" is a **filter strictness issue**, not a data quality issue.
128
+
129
+ ### Why Only 3/102 Pass?
130
+
131
+ **Theory 1: Session Capture Splits Long Coordinators**
132
+ - The 3 passing sessions are 6-9 hour marathons
133
+ - Most coordinator work may be happening in shorter bursts
134
+ - Session files might be split by epic_id or time windows
135
+
136
+ **Evidence:**
137
+ - Some sessions have 20+ `review_completed` events with no `worker_spawned`
138
+ - This suggests reviews from previous spawns in a different session file
139
+
140
+ **Theory 2: Review Capture Is Incomplete**
141
+ - 5 sessions have `worker_spawned` but no `review_completed`
142
+ - Reviews may be captured in separate session files
143
+ - OR review capture isn't working consistently
144
+
145
+ **Theory 3: Most Coordinator Sessions Are Short**
146
+ - Only 32/102 sessions (31.4%) have ANY `review_completed` event
147
+ - Only 10/102 sessions (9.8%) have ANY `worker_spawned` event
148
+ - This suggests most captured activity is worker completions, not coordinator cycles
149
+
150
+ ## Recommendations
151
+
152
+ ### 1. Make Filter Parameters Optional (IMMEDIATE)
153
+
154
+ **Current default:**
155
+ ```typescript
156
+ {
157
+ minEvents: 3,
158
+ requireWorkerSpawn: true,
159
+ requireReview: true,
160
+ }
161
+ ```
162
+
163
+ **Recommended default:**
164
+ ```typescript
165
+ {
166
+ minEvents: 3, // Keep - filters out noise
167
+ requireWorkerSpawn: false, // CHANGE - allow early-stage sessions
168
+ requireReview: false, // CHANGE - allow incomplete sessions
169
+ }
170
+ ```
171
+
172
+ **Impact:** This would increase passing rate from 3 to ~28 sessions (from 2.9% to 27.5%).
173
+
174
+ **Rationale:**
175
+ - Captures more coordinator behavior (spawns without reviews)
176
+ - Allows evaluation of early-stage coordination patterns
177
+ - Still filters out single-event worker completions
178
+ - Users can opt-in to stricter filters if needed
179
+
180
+ ### 2. Add Session Type Detection (ENHANCEMENT)
181
+
182
+ Add a filter to exclude worker-only sessions automatically:
183
+
184
+ ```typescript
185
+ function isCoordinatorSession(session: CoordinatorSession): boolean {
186
+ return session.events.some(e =>
187
+ e.event_type === "DECISION" &&
188
+ (e.decision_type === "decomposition_complete" ||
189
+ e.decision_type === "worker_spawned" ||
190
+ e.decision_type === "strategy_selected")
191
+ );
192
+ }
193
+ ```
194
+
195
+ **Impact:** Filters out 70+ worker-only sessions before applying other criteria.
196
+
197
+ ### 3. Investigate Session Capture Splitting (BUG FIX?)
198
+
199
+ **Symptoms:**
200
+ - Sessions with 22 `review_completed` events but no `worker_spawned`
201
+ - Sessions with 24 `worker_spawned` events but no reviews
202
+ - No `decomposition_complete` events in ANY session (including the 3 passing)
203
+
204
+ **Hypothesis:** Long-running coordinator sessions may be split across multiple session files.
205
+
206
+ **Action:** Investigate `eval-capture.ts` to understand:
207
+ - How `session_id` is generated
208
+ - Whether sessions are split by epic_id
209
+ - Whether there's a session timeout that creates new files
210
+
211
+ ### 4. Add Filter Reporting to Data Loader (OBSERVABILITY)
212
+
213
+ The data loader logs filtered-out count, but doesn't break down WHY sessions failed.
214
+
215
+ **Enhancement:**
216
+ ```typescript
217
+ console.log(`Filtered out ${filteredOutCount} sessions:`);
218
+ console.log(` - Too few events (<${minEvents}): ${stats.tooFewEvents}`);
219
+ console.log(` - No worker_spawned: ${stats.noWorkerSpawn}`);
220
+ console.log(` - No review_completed: ${stats.noReview}`);
221
+ console.log(` - Worker-only sessions: ${stats.workerOnly}`);
222
+ ```
223
+
224
+ This helps users understand filter impact.
225
+
226
+ ### 5. Consider Separate Evals for Different Session Types
227
+
228
+ Instead of one eval with strict filters, consider:
229
+
230
+ **Eval 1: Full Coordinator Cycles** (current behavior)
231
+ - Filters: `minEvents=3, requireWorkerSpawn=true, requireReview=true`
232
+ - Focus: End-to-end coordinator discipline
233
+ - Expected passing rate: ~3% (gold standard only)
234
+
235
+ **Eval 2: Coordinator Spawning Behavior**
236
+ - Filters: `minEvents=3, requireWorkerSpawn=true, requireReview=false`
237
+ - Focus: How coordinators delegate work
238
+ - Expected passing rate: ~10%
239
+
240
+ **Eval 3: Coordinator Review Behavior**
241
+ - Filters: `minEvents=3, requireWorkerSpawn=false, requireReview=true`
242
+ - Focus: How coordinators review worker output
243
+ - Expected passing rate: ~31%
244
+
245
+ **Eval 4: All Coordinator Activity**
246
+ - Filters: `minEvents=3, requireWorkerSpawn=false, requireReview=false, isCoordinatorSession=true`
247
+ - Focus: Broad coordinator behavior patterns
248
+ - Expected passing rate: ~27%
249
+
250
+ ## Conclusion
251
+
252
+ The coordinator-session eval filter is **working as designed**. It successfully isolates high-quality complete coordinator sessions for evaluation.
253
+
254
+ However, the **2.9% passing rate is too strict** for comprehensive coordinator behavior analysis. The filter should:
255
+
256
+ 1. **Default to more lenient settings** (requireWorkerSpawn=false, requireReview=false)
257
+ 2. **Allow users to opt-in** to stricter filters for gold-standard analysis
258
+ 3. **Automatically exclude worker-only sessions** via session type detection
259
+ 4. **Provide visibility** into why sessions are filtered out
260
+
261
+ The data quality itself is GOOD. The 3 passing sessions are excellent examples of sustained multi-hour coordinator behavior with extensive worker coordination and zero violations.
262
+
263
+ ---
264
+
265
+ ## Appendix: Raw Data
266
+
267
+ ### Event Count Distribution
268
+
269
+ ```
270
+ 1 event: 70 sessions (68.6%) - Mostly worker completions
271
+ 2 events: 4 sessions (3.9%)
272
+ 3 events: 6 sessions (5.9%)
273
+ 4 events: 3 sessions (2.9%)
274
+ 5 events: 3 sessions (2.9%)
275
+ 6 events: 2 sessions (2.0%)
276
+ 7 events: 1 session (1.0%)
277
+ 9 events: 1 session (1.0%)
278
+ 21 events: 1 session (1.0%) ✓ PASSING
279
+ 22 events: 5 sessions (4.9%)
280
+ 24 events: 1 session (1.0%)
281
+ 27 events: 1 session (1.0%)
282
+ 30 events: 2 sessions (2.0%)
283
+ 31 events: 1 session (1.0%) ✓ PASSING
284
+ 33 events: 1 session (1.0%) ✓ PASSING
285
+ ```
286
+
287
+ ### Sample Worker-Only Sessions
288
+
289
+ ```
290
+ ses_6EraEW6LTRswygMPQa2voC.jsonl (1 event):
291
+ OUTCOME/subtask_success
292
+
293
+ ses_xyJ85H9SaA5FSnJvDL7ktJ.jsonl (1 event):
294
+ OUTCOME/subtask_success
295
+
296
+ ses_BiqTpFyafkbpt3tvZbh29R.jsonl (1 event):
297
+ DECISION/review_completed
298
+ ```
299
+
300
+ ### Sample Incomplete Coordinator Sessions
301
+
302
+ ```
303
+ ses_4aa1d6e57ffeGfXIoIMNhTQ9JI.jsonl (7 events):
304
+ DECISION/worker_spawned (x7)
305
+ → Missing reviews
306
+
307
+ ses_3t9CP2ZG54wF3D982kZgps.jsonl (3 events):
308
+ DECISION/review_completed (x3)
309
+ → Missing spawns
310
+
311
+ test-review-1766636012605.jsonl (22 events):
312
+ DECISION/review_completed (x22)
313
+ → Missing spawns (likely test data)
314
+ ```
315
+
316
+ ---
317
+
318
+ **Generated by:** WildDawn (swarm worker agent)
319
+ **Date:** 2025-12-25
320
+ **Files analyzed:** 102 session files from `~/.config/swarm-tools/sessions/`