opencode-swarm-plugin 0.40.0 → 0.42.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.hive/analysis/eval-failure-analysis-2025-12-25.md +331 -0
- package/.hive/analysis/session-data-quality-audit.md +320 -0
- package/.hive/eval-results.json +481 -24
- package/.hive/issues.jsonl +67 -16
- package/.hive/memories.jsonl +159 -1
- package/.opencode/eval-history.jsonl +315 -0
- package/.turbo/turbo-build.log +5 -5
- package/CHANGELOG.md +165 -0
- package/README.md +2 -0
- package/SCORER-ANALYSIS.md +598 -0
- package/bin/eval-gate.test.ts +158 -0
- package/bin/eval-gate.ts +74 -0
- package/bin/swarm.serve.test.ts +46 -0
- package/bin/swarm.test.ts +661 -732
- package/bin/swarm.ts +335 -0
- package/dist/compaction-hook.d.ts +7 -5
- package/dist/compaction-hook.d.ts.map +1 -1
- package/dist/compaction-prompt-scoring.d.ts +1 -0
- package/dist/compaction-prompt-scoring.d.ts.map +1 -1
- package/dist/eval-runner.d.ts +134 -0
- package/dist/eval-runner.d.ts.map +1 -0
- package/dist/hive.d.ts.map +1 -1
- package/dist/index.d.ts +29 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +99741 -58858
- package/dist/memory-tools.d.ts +70 -2
- package/dist/memory-tools.d.ts.map +1 -1
- package/dist/memory.d.ts +37 -0
- package/dist/memory.d.ts.map +1 -1
- package/dist/observability-tools.d.ts +64 -0
- package/dist/observability-tools.d.ts.map +1 -1
- package/dist/plugin.js +99356 -58318
- package/dist/swarm-orchestrate.d.ts.map +1 -1
- package/dist/swarm-prompts.d.ts +32 -1
- package/dist/swarm-prompts.d.ts.map +1 -1
- package/docs/planning/ADR-009-oh-my-opencode-patterns.md +353 -0
- package/evals/ARCHITECTURE.md +1189 -0
- package/evals/example.eval.ts +3 -4
- package/evals/fixtures/compaction-prompt-cases.ts +6 -0
- package/evals/scorers/coordinator-discipline.evalite-test.ts +1 -162
- package/evals/scorers/coordinator-discipline.ts +0 -323
- package/evals/swarm-decomposition.eval.ts +4 -2
- package/package.json +4 -3
- package/src/compaction-prompt-scorers.test.ts +185 -9
- package/src/compaction-prompt-scoring.ts +7 -5
- package/src/eval-runner.test.ts +128 -1
- package/src/eval-runner.ts +46 -0
- package/src/hive.ts +43 -42
- package/src/memory-tools.test.ts +84 -0
- package/src/memory-tools.ts +68 -3
- package/src/memory.test.ts +2 -112
- package/src/memory.ts +88 -49
- package/src/observability-tools.test.ts +13 -0
- package/src/observability-tools.ts +277 -0
- package/src/swarm-orchestrate.test.ts +162 -0
- package/src/swarm-orchestrate.ts +7 -5
- package/src/swarm-prompts.test.ts +168 -4
- package/src/swarm-prompts.ts +228 -7
- package/.env +0 -2
- package/.turbo/turbo-test.log +0 -481
- package/.turbo/turbo-typecheck.log +0 -1
|
@@ -0,0 +1,331 @@
|
|
|
1
|
+
# Eval Failure Analysis Report
|
|
2
|
+
**Date:** 2025-12-25
|
|
3
|
+
**Analyst:** BrightStar
|
|
4
|
+
**Cell:** opencode-swarm-plugin--ys7z8-mjlk7jsl4tt
|
|
5
|
+
**Epic:** opencode-swarm-plugin--ys7z8-mjlk7js9bt1
|
|
6
|
+
|
|
7
|
+
## Executive Summary
|
|
8
|
+
|
|
9
|
+
Two eval failures analyzed:
|
|
10
|
+
- **example.eval.ts**: 0% score - structural bug in eval setup
|
|
11
|
+
- **compaction-prompt.eval.ts**: 53% score - case sensitivity + missing forbidden tools
|
|
12
|
+
|
|
13
|
+
Both are fixable with code changes. No test data quality issues.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## example.eval.ts - 0% Score
|
|
18
|
+
|
|
19
|
+
### Status
|
|
20
|
+
❌ **CRITICAL** - Complete failure (0%)
|
|
21
|
+
|
|
22
|
+
### Root Cause
|
|
23
|
+
**Eval structure mismatch** between data provider and task function.
|
|
24
|
+
|
|
25
|
+
### Technical Details
|
|
26
|
+
|
|
27
|
+
**File:** `evals/example.eval.ts`
|
|
28
|
+
**Lines:** 14-30
|
|
29
|
+
|
|
30
|
+
The eval has a fundamental flow error:
|
|
31
|
+
|
|
32
|
+
```typescript
|
|
33
|
+
// Line 14-26: data() provides BOTH input AND expected output
|
|
34
|
+
data: async () => {
|
|
35
|
+
return [
|
|
36
|
+
{
|
|
37
|
+
input: "Test task", // ← String for task function
|
|
38
|
+
output: JSON.stringify({ // ← Expected output (ignored!)
|
|
39
|
+
epic: { title: "Test Epic", ... },
|
|
40
|
+
subtasks: [...]
|
|
41
|
+
}),
|
|
42
|
+
},
|
|
43
|
+
];
|
|
44
|
+
},
|
|
45
|
+
|
|
46
|
+
// Line 28-30: task() does passthrough
|
|
47
|
+
task: async (input) => {
|
|
48
|
+
return input; // ← Returns "Test task" string, NOT the CellTree
|
|
49
|
+
},
|
|
50
|
+
|
|
51
|
+
// Line 31: Scorer expects CellTree JSON
|
|
52
|
+
scorers: [subtaskIndependence],
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**What happens:**
|
|
56
|
+
1. Evalite passes `input` ("Test task") to task function
|
|
57
|
+
2. Task returns "Test task" string unchanged
|
|
58
|
+
3. Scorer `subtaskIndependence` receives "Test task"
|
|
59
|
+
4. Scorer tries to parse as CellTree JSON → **FAILS**
|
|
60
|
+
5. Score: 0%
|
|
61
|
+
|
|
62
|
+
The `output` field in `data()` is ignored by Evalite - it's the `task()` return value that gets scored.
|
|
63
|
+
|
|
64
|
+
### Impact
|
|
65
|
+
- Example eval is useless for validation
|
|
66
|
+
- False signal that scorer infrastructure is broken (it's not)
|
|
67
|
+
- Wastes CI time
|
|
68
|
+
|
|
69
|
+
### Proposed Fix
|
|
70
|
+
|
|
71
|
+
**Option 1: Remove output from data (recommended)**
|
|
72
|
+
```typescript
|
|
73
|
+
data: async () => {
|
|
74
|
+
return [
|
|
75
|
+
{
|
|
76
|
+
input: {
|
|
77
|
+
epic: { title: "Test Epic", description: "Test" },
|
|
78
|
+
subtasks: [
|
|
79
|
+
{ title: "Subtask 1", files: ["a.ts"], estimated_complexity: 1 },
|
|
80
|
+
{ title: "Subtask 2", files: ["b.ts"], estimated_complexity: 1 },
|
|
81
|
+
],
|
|
82
|
+
},
|
|
83
|
+
},
|
|
84
|
+
];
|
|
85
|
+
},
|
|
86
|
+
|
|
87
|
+
task: async (input) => {
|
|
88
|
+
return JSON.stringify(input); // Stringify the CellTree
|
|
89
|
+
},
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**Option 2: Fix task to use output**
|
|
93
|
+
```typescript
|
|
94
|
+
// Keep data() as-is, but fix task:
|
|
95
|
+
task: async (input, context) => {
|
|
96
|
+
return context.expected.output; // Use the output from data()
|
|
97
|
+
},
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Option 1 is cleaner - task functions should generate output, not just pass through.
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## compaction-prompt.eval.ts - 53% Score
|
|
105
|
+
|
|
106
|
+
### Status
|
|
107
|
+
⚠️ **DEGRADED** - Below target (53% vs 100% historical)
|
|
108
|
+
|
|
109
|
+
### Root Causes
|
|
110
|
+
|
|
111
|
+
#### RC1: Case-Sensitive Forbidden Tool Patterns (15% weight)
|
|
112
|
+
|
|
113
|
+
**File:** `src/compaction-prompt-scoring.ts`
|
|
114
|
+
**Lines:** 213-218
|
|
115
|
+
|
|
116
|
+
```typescript
|
|
117
|
+
const forbiddenTools = [
|
|
118
|
+
/\bEdit\b/, // ← Requires capital E
|
|
119
|
+
/\bWrite\b/, // ← Requires capital W
|
|
120
|
+
/swarmmail_reserve/,
|
|
121
|
+
/git commit/,
|
|
122
|
+
];
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
**File:** `evals/fixtures/compaction-prompt-cases.ts`
|
|
126
|
+
**Lines:** 76-83 (perfect fixture)
|
|
127
|
+
|
|
128
|
+
```
|
|
129
|
+
- edit // ← lowercase e
|
|
130
|
+
- write // ← lowercase w
|
|
131
|
+
- bash (for file modifications)
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Evidence:**
|
|
135
|
+
```javascript
|
|
136
|
+
/\bEdit\b/.test("- Edit") // ✅ true
|
|
137
|
+
/\bEdit\b/.test("- edit") // ❌ false (word boundary + case)
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**Impact:**
|
|
141
|
+
- Perfect fixture: 0/4 forbidden tools matched
|
|
142
|
+
- Forbidden tools scorer: 0% (should be 75-100%)
|
|
143
|
+
- Overall impact: 15% of total score lost
|
|
144
|
+
|
|
145
|
+
#### RC2: Missing Forbidden Tools (15% weight)
|
|
146
|
+
|
|
147
|
+
Scorer expects **4 tools**:
|
|
148
|
+
1. Edit (or edit)
|
|
149
|
+
2. Write (or write)
|
|
150
|
+
3. swarmmail_reserve
|
|
151
|
+
4. git commit
|
|
152
|
+
|
|
153
|
+
Perfect fixture has **3 tools** (and case mismatch):
|
|
154
|
+
1. edit ❌ (lowercase)
|
|
155
|
+
2. write ❌ (lowercase)
|
|
156
|
+
3. bash ❌ (not in scorer's list)
|
|
157
|
+
|
|
158
|
+
Missing: swarmmail_reserve, git commit
|
|
159
|
+
|
|
160
|
+
**Impact:**
|
|
161
|
+
- Even if case fixed, still only 2/4 tools = 50% on this scorer
|
|
162
|
+
- Weighted: 50% × 15% = 7.5% contribution (should be 15%)
|
|
163
|
+
|
|
164
|
+
#### RC3: "bash" Not in Scorer's List
|
|
165
|
+
|
|
166
|
+
Fixtures mention "bash (for file modifications)" as forbidden, but scorer doesn't check for it.
|
|
167
|
+
This creates a 3-way mismatch:
|
|
168
|
+
- Fixture lists: edit, write, bash
|
|
169
|
+
- Scorer checks: Edit, Write, swarmmail_reserve, git commit
|
|
170
|
+
- Overlap: 0 tools (due to case)
|
|
171
|
+
|
|
172
|
+
### Score Breakdown - Perfect Fixture
|
|
173
|
+
|
|
174
|
+
Expected (if 100%):
|
|
175
|
+
```
|
|
176
|
+
epicIdSpecificity: 20% × 1.0 = 20%
|
|
177
|
+
actionability: 20% × 1.0 = 20%
|
|
178
|
+
coordinatorIdentity: 25% × 1.0 = 25%
|
|
179
|
+
forbiddenToolsPresent: 15% × 1.0 = 15%
|
|
180
|
+
postCompactionDiscipline: 20% × 1.0 = 20%
|
|
181
|
+
─────
|
|
182
|
+
TOTAL: 100%
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
Actual (current):
|
|
186
|
+
```
|
|
187
|
+
epicIdSpecificity: 20% × 1.0 = 20% ✅
|
|
188
|
+
actionability: 20% × 1.0 = 20% ✅
|
|
189
|
+
coordinatorIdentity: 25% × 1.0 = 25% ✅
|
|
190
|
+
forbiddenToolsPresent: 15% × 0.0 = 0% ❌ (0/4 matched)
|
|
191
|
+
postCompactionDiscipline: 20% × 1.0 = 20% ✅
|
|
192
|
+
─────
|
|
193
|
+
TOTAL: 85%
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
Perfect fixture alone should score 85%, but overall eval is 53%.
|
|
197
|
+
This means the 5 "bad" fixtures are pulling average down further (expected behavior).
|
|
198
|
+
|
|
199
|
+
### Historical Context
|
|
200
|
+
|
|
201
|
+
Semantic memory claims 100% score previously. Likely scenarios:
|
|
202
|
+
1. **Never actually ran** - aspiration documented before implementation
|
|
203
|
+
2. **Ran with different fixtures** - fixtures were updated after scorer was written
|
|
204
|
+
3. **Scorer was case-insensitive before** - regression in recent commit aa12943
|
|
205
|
+
|
|
206
|
+
Commit aa12943 (2025-12-24) added the eval infrastructure. This is brand new code.
|
|
207
|
+
|
|
208
|
+
### Proposed Fixes
|
|
209
|
+
|
|
210
|
+
#### Fix 1: Make Scorer Case-Insensitive (Recommended)
|
|
211
|
+
|
|
212
|
+
**File:** `src/compaction-prompt-scoring.ts`
|
|
213
|
+
**Lines:** 213-218
|
|
214
|
+
|
|
215
|
+
```typescript
|
|
216
|
+
const forbiddenTools = [
|
|
217
|
+
/\bedit\b/i, // Case insensitive with 'i' flag
|
|
218
|
+
/\bwrite\b/i, // Case insensitive
|
|
219
|
+
/\bbash\b/i, // Add bash (was missing)
|
|
220
|
+
/swarmmail_reserve/i, // Keep, add 'i' for safety
|
|
221
|
+
/git commit/i, // Keep, add 'i' for safety
|
|
222
|
+
];
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
**Rationale:**
|
|
226
|
+
- Coordinators might capitalize differently in prompts
|
|
227
|
+
- Real prompts won't always match exact case
|
|
228
|
+
- More robust matching
|
|
229
|
+
|
|
230
|
+
#### Fix 2: Update Fixtures to Match Scorer (Alternative)
|
|
231
|
+
|
|
232
|
+
**File:** `evals/fixtures/compaction-prompt-cases.ts`
|
|
233
|
+
**Lines:** 76-83 (and all other fixtures)
|
|
234
|
+
|
|
235
|
+
```
|
|
236
|
+
- Edit // Capital E
|
|
237
|
+
- Write // Capital W
|
|
238
|
+
- bash (for file modifications) // Keep or remove
|
|
239
|
+
- swarmmail_reserve // ADD
|
|
240
|
+
- git commit // ADD
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
**Rationale:**
|
|
244
|
+
- Keeps scorer strict (may catch real case issues)
|
|
245
|
+
- Makes fixtures comprehensive (all 5 tools)
|
|
246
|
+
- More explicit about what's forbidden
|
|
247
|
+
|
|
248
|
+
#### Fix 3: Hybrid (Best of Both)
|
|
249
|
+
|
|
250
|
+
1. Make scorer case-insensitive (Fix 1)
|
|
251
|
+
2. Update fixtures to include all 5 tools (Fix 2)
|
|
252
|
+
3. Remove "bash" from fixtures if not in coordinator forbidden list
|
|
253
|
+
|
|
254
|
+
```typescript
|
|
255
|
+
// Scorer (5 tools, case-insensitive):
|
|
256
|
+
const forbiddenTools = [
|
|
257
|
+
/\bedit\b/i,
|
|
258
|
+
/\bwrite\b/i,
|
|
259
|
+
/swarmmail_reserve/i,
|
|
260
|
+
/git\s+commit/i,
|
|
261
|
+
/\bread\b/i, // Consider adding - coordinators shouldn't read, should check status
|
|
262
|
+
];
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
```
|
|
266
|
+
// Fixture:
|
|
267
|
+
- Edit
|
|
268
|
+
- Write
|
|
269
|
+
- swarmmail_reserve (only workers reserve files)
|
|
270
|
+
- git commit (workers commit their changes)
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
### Risk Assessment
|
|
274
|
+
|
|
275
|
+
**If we fix this, will scores jump to 100%?**
|
|
276
|
+
|
|
277
|
+
**Perfect fixture:** 85% → 100% (if all 4 tools matched)
|
|
278
|
+
**Other fixtures:** Depends on their issues
|
|
279
|
+
|
|
280
|
+
Looking at fixture expected values:
|
|
281
|
+
- Fixture 0 (perfect): Should be 100%
|
|
282
|
+
- Fixture 1 (placeholder): Should fail (expected)
|
|
283
|
+
- Fixture 2 (generic): Should fail (expected)
|
|
284
|
+
- Fixture 3 (weak identity): Should partially fail (expected)
|
|
285
|
+
- Fixture 4 (missing forbidden): Should fail on forbidden tools only
|
|
286
|
+
- Fixture 5 (wrong first tool): Should fail on discipline only
|
|
287
|
+
|
|
288
|
+
Average across 6 fixtures: ~66% expected (not 100%)
|
|
289
|
+
|
|
290
|
+
**So 53% → ~70-80%** is realistic after fixes (not 100%).
|
|
291
|
+
|
|
292
|
+
To get higher scores, need to fix issues in bad fixtures too, but those are SUPPOSED to fail.
|
|
293
|
+
The scorer is working correctly on those.
|
|
294
|
+
|
|
295
|
+
---
|
|
296
|
+
|
|
297
|
+
## Recommendations
|
|
298
|
+
|
|
299
|
+
### Immediate Actions (P0)
|
|
300
|
+
|
|
301
|
+
1. **Fix example.eval.ts structure** - 5 min fix, unblocks that eval
|
|
302
|
+
2. **Make forbidden tools case-insensitive** - 5 min fix, +15-20% score boost
|
|
303
|
+
3. **Add missing tools to fixtures** - 10 min, comprehensive coverage
|
|
304
|
+
|
|
305
|
+
### Medium-term Actions (P1)
|
|
306
|
+
|
|
307
|
+
4. **Verify 100% claim in semantic memory** - Check if historical data exists
|
|
308
|
+
5. **Document scorer expectations** - Add comments to fixtures explaining weights
|
|
309
|
+
6. **Add unit tests for scorers** - Test edge cases independently
|
|
310
|
+
|
|
311
|
+
### Long-term Actions (P2)
|
|
312
|
+
|
|
313
|
+
7. **Consider LLM-as-judge for semantic checks** - Case-insensitive by nature
|
|
314
|
+
8. **Add visual diff in eval output** - Show what's missing from prompts
|
|
315
|
+
9. **Create eval dashboard** - Track scores over time, detect regressions
|
|
316
|
+
|
|
317
|
+
---
|
|
318
|
+
|
|
319
|
+
## Conclusion
|
|
320
|
+
|
|
321
|
+
Both evals have **code bugs, not test data issues**:
|
|
322
|
+
- example.eval.ts: Structural bug (task/data mismatch)
|
|
323
|
+
- compaction-prompt.eval.ts: Case sensitivity + incomplete tool list
|
|
324
|
+
|
|
325
|
+
Fixes are straightforward and low-risk. After fixes, expect:
|
|
326
|
+
- example.eval.ts: 0% → 100%
|
|
327
|
+
- compaction-prompt.eval.ts: 53% → 70-80%
|
|
328
|
+
|
|
329
|
+
The 100% historical score in semantic memory is likely aspirational - these evals are brand new (commit aa12943, Dec 24).
|
|
330
|
+
|
|
331
|
+
**Ready to implement fixes or escalate for review?**
|
|
@@ -0,0 +1,320 @@
|
|
|
1
|
+
# Session Data Quality Audit Report
|
|
2
|
+
|
|
3
|
+
**Date:** 2025-12-25
|
|
4
|
+
**Cell:** opencode-swarm-plugin--ys7z8-mjlk7jspacf
|
|
5
|
+
**Agent:** WildDawn
|
|
6
|
+
|
|
7
|
+
## Executive Summary
|
|
8
|
+
|
|
9
|
+
Investigation of why only 3 of 102 sessions (2.9%) pass the coordinator-session eval filter reveals:
|
|
10
|
+
|
|
11
|
+
1. **Filter is working as designed** - correctly isolating high-quality complete coordinator sessions
|
|
12
|
+
2. **Data quality is actually GOOD** - the 3 passing sessions are gold-standard examples
|
|
13
|
+
3. **97% filtered out is EXPECTED** - most sessions are worker completions, not coordinator sessions
|
|
14
|
+
4. **Filter may be too strict** for broad coordinator behavior analysis (needs tuning)
|
|
15
|
+
|
|
16
|
+
## Data Breakdown
|
|
17
|
+
|
|
18
|
+
### Total Sessions: 102
|
|
19
|
+
|
|
20
|
+
| Category | Count | % | Description |
|
|
21
|
+
|----------|-------|---|-------------|
|
|
22
|
+
| **Single-event sessions** | 70 | 68.6% | Worker completions (subtask_success), isolated reviews |
|
|
23
|
+
| **Multi-event incomplete** | 29 | 28.4% | Coordinator sessions that didn't complete full cycle |
|
|
24
|
+
| **Passing sessions** | 3 | 2.9% | Complete coordinator cycles with spawn + review |
|
|
25
|
+
|
|
26
|
+
### Single-Event Sessions (70 sessions - 68.6%)
|
|
27
|
+
|
|
28
|
+
**Event Type Breakdown:**
|
|
29
|
+
- `OUTCOME/subtask_success`: 56 (80.0%) - **Worker completions, not coordinator sessions**
|
|
30
|
+
- `DECISION/review_completed`: 12 (17.1%) - Isolated review events
|
|
31
|
+
- `DECISION/worker_spawned`: 2 (2.9%) - Isolated spawn events
|
|
32
|
+
|
|
33
|
+
**Analysis:** These are **NOT coordinator sessions**. They're worker agents reporting completion or isolated coordinator actions captured in separate session files.
|
|
34
|
+
|
|
35
|
+
### Multi-Event Failures (29 sessions - 28.4%)
|
|
36
|
+
|
|
37
|
+
**Failure Breakdown:**
|
|
38
|
+
- **No worker_spawned event**: 20 sessions
|
|
39
|
+
- Review-only sessions (3-22 events, all `review_completed`)
|
|
40
|
+
- Appears to be test data or session capture split across files
|
|
41
|
+
- **Has worker_spawned but no review_completed**: 5 sessions
|
|
42
|
+
- Incomplete coordinator sessions (4-24 events)
|
|
43
|
+
- Coordinator spawned workers but reviews weren't captured (yet)
|
|
44
|
+
- **Too few events (<3)**: 4 sessions
|
|
45
|
+
- Aborted early
|
|
46
|
+
|
|
47
|
+
**Key Finding:** None of these 29 sessions have `decomposition_complete` events. This suggests:
|
|
48
|
+
1. Session capture may not be recording decomposition events
|
|
49
|
+
2. OR coordinator sessions span multiple session files
|
|
50
|
+
3. OR these are partial captures from long-running coordinators
|
|
51
|
+
|
|
52
|
+
### Passing Sessions (3 sessions - 2.9%)
|
|
53
|
+
|
|
54
|
+
#### ses_4b86f0867ffeXKv95ktf31igfD
|
|
55
|
+
- **Events:** 33
|
|
56
|
+
- **Worker spawns:** 20
|
|
57
|
+
- **Reviews completed:** 13
|
|
58
|
+
- **Violations:** 0
|
|
59
|
+
- **Duration:** 437 minutes (7.3 hours)
|
|
60
|
+
- **Quality:** GOLD STANDARD
|
|
61
|
+
|
|
62
|
+
#### ses_4ac0f508dffeEcwSQ6OSMWrmWF
|
|
63
|
+
- **Events:** 21
|
|
64
|
+
- **Worker spawns:** 17
|
|
65
|
+
- **Reviews completed:** 4
|
|
66
|
+
- **Duration:** 540 minutes (9.0 hours)
|
|
67
|
+
- **Quality:** GOLD STANDARD
|
|
68
|
+
|
|
69
|
+
#### ses_4ae8c2f66ffecyfyre7ZQ7y5LW
|
|
70
|
+
- **Events:** 31
|
|
71
|
+
- **Worker spawns:** 24
|
|
72
|
+
- **Reviews completed:** 7
|
|
73
|
+
- **Violations:** 0
|
|
74
|
+
- **Duration:** 368 minutes (6.1 hours)
|
|
75
|
+
- **Quality:** GOLD STANDARD
|
|
76
|
+
|
|
77
|
+
**Analysis:** These are FULL multi-hour coordinator sessions with extensive worker coordination. They represent the ideal coordinator behavior the eval is designed to measure.
|
|
78
|
+
|
|
79
|
+
## Current Filter Criteria
|
|
80
|
+
|
|
81
|
+
```typescript
|
|
82
|
+
{
|
|
83
|
+
minEvents: 3, // Default
|
|
84
|
+
requireWorkerSpawn: true, // Default
|
|
85
|
+
requireReview: true, // Default
|
|
86
|
+
}
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### Filter Performance
|
|
90
|
+
|
|
91
|
+
| Check | Impact |
|
|
92
|
+
|-------|--------|
|
|
93
|
+
| `minEvents >= 3` | Filters out 74 sessions (72.5%) |
|
|
94
|
+
| `requireWorkerSpawn: true` | Filters out 20 additional sessions (19.6%) |
|
|
95
|
+
| `requireReview: true` | Filters out 5 additional sessions (4.9%) |
|
|
96
|
+
|
|
97
|
+
**Cascade effect:** Each filter compounds, resulting in 2.9% passing rate.
|
|
98
|
+
|
|
99
|
+
## Root Cause Analysis
|
|
100
|
+
|
|
101
|
+
### Is the Filter Too Strict?
|
|
102
|
+
|
|
103
|
+
**YES and NO:**
|
|
104
|
+
|
|
105
|
+
✅ **Working as designed:**
|
|
106
|
+
- Correctly excludes worker-only sessions (80% of single-event data)
|
|
107
|
+
- Correctly excludes incomplete coordinator sessions
|
|
108
|
+
- Isolates high-quality complete coordinator cycles
|
|
109
|
+
|
|
110
|
+
❌ **Too strict for real-world analysis:**
|
|
111
|
+
- 2.9% passing rate means most coordinator behavior is invisible to the eval
|
|
112
|
+
- Filter assumes coordinators ALWAYS complete full spawn+review cycles
|
|
113
|
+
- Doesn't account for:
|
|
114
|
+
- Long-running multi-session coordinators
|
|
115
|
+
- Coordinators that spawn workers but reviews aren't captured yet
|
|
116
|
+
- Early-stage coordinator sessions (before first spawn)
|
|
117
|
+
|
|
118
|
+
### Is the Data Quality Low?
|
|
119
|
+
|
|
120
|
+
**NO.** The data quality is actually GOOD:
|
|
121
|
+
|
|
122
|
+
- The 3 passing sessions are excellent gold-standard examples
|
|
123
|
+
- They contain rich coordinator behavior (20-24 worker spawns, 4-13 reviews)
|
|
124
|
+
- Zero violations in all 3 sessions
|
|
125
|
+
- Multi-hour timelines showing sustained coordination
|
|
126
|
+
|
|
127
|
+
The "low passing rate" is a **filter strictness issue**, not a data quality issue.
|
|
128
|
+
|
|
129
|
+
### Why Only 3/102 Pass?
|
|
130
|
+
|
|
131
|
+
**Theory 1: Session Capture Splits Long Coordinators**
|
|
132
|
+
- The 3 passing sessions are 6-9 hour marathons
|
|
133
|
+
- Most coordinator work may be happening in shorter bursts
|
|
134
|
+
- Session files might be split by epic_id or time windows
|
|
135
|
+
|
|
136
|
+
**Evidence:**
|
|
137
|
+
- Some sessions have 20+ `review_completed` events with no `worker_spawned`
|
|
138
|
+
- This suggests reviews from previous spawns in a different session file
|
|
139
|
+
|
|
140
|
+
**Theory 2: Review Capture Is Incomplete**
|
|
141
|
+
- 5 sessions have `worker_spawned` but no `review_completed`
|
|
142
|
+
- Reviews may be captured in separate session files
|
|
143
|
+
- OR review capture isn't working consistently
|
|
144
|
+
|
|
145
|
+
**Theory 3: Most Coordinator Sessions Are Short**
|
|
146
|
+
- Only 32/102 sessions (31.4%) have ANY `review_completed` event
|
|
147
|
+
- Only 10/102 sessions (9.8%) have ANY `worker_spawned` event
|
|
148
|
+
- This suggests most captured activity is worker completions, not coordinator cycles
|
|
149
|
+
|
|
150
|
+
## Recommendations
|
|
151
|
+
|
|
152
|
+
### 1. Make Filter Parameters Optional (IMMEDIATE)
|
|
153
|
+
|
|
154
|
+
**Current default:**
|
|
155
|
+
```typescript
|
|
156
|
+
{
|
|
157
|
+
minEvents: 3,
|
|
158
|
+
requireWorkerSpawn: true,
|
|
159
|
+
requireReview: true,
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
**Recommended default:**
|
|
164
|
+
```typescript
|
|
165
|
+
{
|
|
166
|
+
minEvents: 3, // Keep - filters out noise
|
|
167
|
+
requireWorkerSpawn: false, // CHANGE - allow early-stage sessions
|
|
168
|
+
requireReview: false, // CHANGE - allow incomplete sessions
|
|
169
|
+
}
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
**Impact:** This would increase passing rate from 3 to ~28 sessions (from 2.9% to 27.5%).
|
|
173
|
+
|
|
174
|
+
**Rationale:**
|
|
175
|
+
- Captures more coordinator behavior (spawns without reviews)
|
|
176
|
+
- Allows evaluation of early-stage coordination patterns
|
|
177
|
+
- Still filters out single-event worker completions
|
|
178
|
+
- Users can opt-in to stricter filters if needed
|
|
179
|
+
|
|
180
|
+
### 2. Add Session Type Detection (ENHANCEMENT)
|
|
181
|
+
|
|
182
|
+
Add a filter to exclude worker-only sessions automatically:
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
function isCoordinatorSession(session: CoordinatorSession): boolean {
|
|
186
|
+
return session.events.some(e =>
|
|
187
|
+
e.event_type === "DECISION" &&
|
|
188
|
+
(e.decision_type === "decomposition_complete" ||
|
|
189
|
+
e.decision_type === "worker_spawned" ||
|
|
190
|
+
e.decision_type === "strategy_selected")
|
|
191
|
+
);
|
|
192
|
+
}
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
**Impact:** Filters out 70+ worker-only sessions before applying other criteria.
|
|
196
|
+
|
|
197
|
+
### 3. Investigate Session Capture Splitting (BUG FIX?)
|
|
198
|
+
|
|
199
|
+
**Symptoms:**
|
|
200
|
+
- Sessions with 22 `review_completed` events but no `worker_spawned`
|
|
201
|
+
- Sessions with 24 `worker_spawned` events but no reviews
|
|
202
|
+
- No `decomposition_complete` events in ANY session (including the 3 passing)
|
|
203
|
+
|
|
204
|
+
**Hypothesis:** Long-running coordinator sessions may be split across multiple session files.
|
|
205
|
+
|
|
206
|
+
**Action:** Investigate `eval-capture.ts` to understand:
|
|
207
|
+
- How `session_id` is generated
|
|
208
|
+
- Whether sessions are split by epic_id
|
|
209
|
+
- Whether there's a session timeout that creates new files
|
|
210
|
+
|
|
211
|
+
### 4. Add Filter Reporting to Data Loader (OBSERVABILITY)
|
|
212
|
+
|
|
213
|
+
The data loader logs filtered-out count, but doesn't break down WHY sessions failed.
|
|
214
|
+
|
|
215
|
+
**Enhancement:**
|
|
216
|
+
```typescript
|
|
217
|
+
console.log(`Filtered out ${filteredOutCount} sessions:`);
|
|
218
|
+
console.log(` - Too few events (<${minEvents}): ${stats.tooFewEvents}`);
|
|
219
|
+
console.log(` - No worker_spawned: ${stats.noWorkerSpawn}`);
|
|
220
|
+
console.log(` - No review_completed: ${stats.noReview}`);
|
|
221
|
+
console.log(` - Worker-only sessions: ${stats.workerOnly}`);
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
This helps users understand filter impact.
|
|
225
|
+
|
|
226
|
+
### 5. Consider Separate Evals for Different Session Types
|
|
227
|
+
|
|
228
|
+
Instead of one eval with strict filters, consider:
|
|
229
|
+
|
|
230
|
+
**Eval 1: Full Coordinator Cycles** (current behavior)
|
|
231
|
+
- Filters: `minEvents=3, requireWorkerSpawn=true, requireReview=true`
|
|
232
|
+
- Focus: End-to-end coordinator discipline
|
|
233
|
+
- Expected passing rate: ~3% (gold standard only)
|
|
234
|
+
|
|
235
|
+
**Eval 2: Coordinator Spawning Behavior**
|
|
236
|
+
- Filters: `minEvents=3, requireWorkerSpawn=true, requireReview=false`
|
|
237
|
+
- Focus: How coordinators delegate work
|
|
238
|
+
- Expected passing rate: ~10%
|
|
239
|
+
|
|
240
|
+
**Eval 3: Coordinator Review Behavior**
|
|
241
|
+
- Filters: `minEvents=3, requireWorkerSpawn=false, requireReview=true`
|
|
242
|
+
- Focus: How coordinators review worker output
|
|
243
|
+
- Expected passing rate: ~31%
|
|
244
|
+
|
|
245
|
+
**Eval 4: All Coordinator Activity**
|
|
246
|
+
- Filters: `minEvents=3, requireWorkerSpawn=false, requireReview=false, isCoordinatorSession=true`
|
|
247
|
+
- Focus: Broad coordinator behavior patterns
|
|
248
|
+
- Expected passing rate: ~27%
|
|
249
|
+
|
|
250
|
+
## Conclusion
|
|
251
|
+
|
|
252
|
+
The coordinator-session eval filter is **working as designed**. It successfully isolates high-quality complete coordinator sessions for evaluation.
|
|
253
|
+
|
|
254
|
+
However, the **2.9% passing rate is too strict** for comprehensive coordinator behavior analysis. The filter should:
|
|
255
|
+
|
|
256
|
+
1. **Default to more lenient settings** (requireWorkerSpawn=false, requireReview=false)
|
|
257
|
+
2. **Allow users to opt-in** to stricter filters for gold-standard analysis
|
|
258
|
+
3. **Automatically exclude worker-only sessions** via session type detection
|
|
259
|
+
4. **Provide visibility** into why sessions are filtered out
|
|
260
|
+
|
|
261
|
+
The data quality itself is GOOD. The 3 passing sessions are excellent examples of sustained multi-hour coordinator behavior with extensive worker coordination and zero violations.
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## Appendix: Raw Data
|
|
266
|
+
|
|
267
|
+
### Event Count Distribution
|
|
268
|
+
|
|
269
|
+
```
|
|
270
|
+
1 event: 70 sessions (68.6%) - Mostly worker completions
|
|
271
|
+
2 events: 4 sessions (3.9%)
|
|
272
|
+
3 events: 6 sessions (5.9%)
|
|
273
|
+
4 events: 3 sessions (2.9%)
|
|
274
|
+
5 events: 3 sessions (2.9%)
|
|
275
|
+
6 events: 2 sessions (2.0%)
|
|
276
|
+
7 events: 1 session (1.0%)
|
|
277
|
+
9 events: 1 session (1.0%)
|
|
278
|
+
21 events: 1 session (1.0%) ✓ PASSING
|
|
279
|
+
22 events: 5 sessions (4.9%)
|
|
280
|
+
24 events: 1 session (1.0%)
|
|
281
|
+
27 events: 1 session (1.0%)
|
|
282
|
+
30 events: 2 sessions (2.0%)
|
|
283
|
+
31 events: 1 session (1.0%) ✓ PASSING
|
|
284
|
+
33 events: 1 session (1.0%) ✓ PASSING
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
### Sample Worker-Only Sessions
|
|
288
|
+
|
|
289
|
+
```
|
|
290
|
+
ses_6EraEW6LTRswygMPQa2voC.jsonl (1 event):
|
|
291
|
+
OUTCOME/subtask_success
|
|
292
|
+
|
|
293
|
+
ses_xyJ85H9SaA5FSnJvDL7ktJ.jsonl (1 event):
|
|
294
|
+
OUTCOME/subtask_success
|
|
295
|
+
|
|
296
|
+
ses_BiqTpFyafkbpt3tvZbh29R.jsonl (1 event):
|
|
297
|
+
DECISION/review_completed
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
### Sample Incomplete Coordinator Sessions
|
|
301
|
+
|
|
302
|
+
```
|
|
303
|
+
ses_4aa1d6e57ffeGfXIoIMNhTQ9JI.jsonl (7 events):
|
|
304
|
+
DECISION/worker_spawned (x7)
|
|
305
|
+
→ Missing reviews
|
|
306
|
+
|
|
307
|
+
ses_3t9CP2ZG54wF3D982kZgps.jsonl (3 events):
|
|
308
|
+
DECISION/review_completed (x3)
|
|
309
|
+
→ Missing spawns
|
|
310
|
+
|
|
311
|
+
test-review-1766636012605.jsonl (22 events):
|
|
312
|
+
DECISION/review_completed (x22)
|
|
313
|
+
→ Missing spawns (likely test data)
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
---
|
|
317
|
+
|
|
318
|
+
**Generated by:** WildDawn (swarm worker agent)
|
|
319
|
+
**Date:** 2025-12-25
|
|
320
|
+
**Files analyzed:** 102 session files from `~/.config/swarm-tools/sessions/`
|