opencode-swarm-plugin 0.44.0 → 0.44.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/swarm.serve.test.ts +6 -4
- package/bin/swarm.ts +16 -10
- package/dist/compaction-prompt-scoring.js +139 -0
- package/dist/eval-capture.js +12811 -0
- package/dist/hive.d.ts.map +1 -1
- package/dist/index.js +7644 -62599
- package/dist/plugin.js +23766 -78721
- package/dist/swarm-orchestrate.d.ts.map +1 -1
- package/dist/swarm-prompts.d.ts.map +1 -1
- package/dist/swarm-review.d.ts.map +1 -1
- package/package.json +17 -5
- package/.changeset/swarm-insights-data-layer.md +0 -63
- package/.hive/analysis/eval-failure-analysis-2025-12-25.md +0 -331
- package/.hive/analysis/session-data-quality-audit.md +0 -320
- package/.hive/eval-results.json +0 -483
- package/.hive/issues.jsonl +0 -138
- package/.hive/memories.jsonl +0 -729
- package/.opencode/eval-history.jsonl +0 -327
- package/.turbo/turbo-build.log +0 -9
- package/CHANGELOG.md +0 -2286
- package/SCORER-ANALYSIS.md +0 -598
- package/docs/analysis/subagent-coordination-patterns.md +0 -902
- package/docs/analysis-socratic-planner-pattern.md +0 -504
- package/docs/planning/ADR-001-monorepo-structure.md +0 -171
- package/docs/planning/ADR-002-package-extraction.md +0 -393
- package/docs/planning/ADR-003-performance-improvements.md +0 -451
- package/docs/planning/ADR-004-message-queue-features.md +0 -187
- package/docs/planning/ADR-005-devtools-observability.md +0 -202
- package/docs/planning/ADR-007-swarm-enhancements-worktree-review.md +0 -168
- package/docs/planning/ADR-008-worker-handoff-protocol.md +0 -293
- package/docs/planning/ADR-009-oh-my-opencode-patterns.md +0 -353
- package/docs/planning/ADR-010-cass-inhousing.md +0 -1215
- package/docs/planning/ROADMAP.md +0 -368
- package/docs/semantic-memory-cli-syntax.md +0 -123
- package/docs/swarm-mail-architecture.md +0 -1147
- package/docs/testing/context-recovery-test.md +0 -470
- package/evals/ARCHITECTURE.md +0 -1189
- package/evals/README.md +0 -768
- package/evals/compaction-prompt.eval.ts +0 -149
- package/evals/compaction-resumption.eval.ts +0 -289
- package/evals/coordinator-behavior.eval.ts +0 -307
- package/evals/coordinator-session.eval.ts +0 -154
- package/evals/evalite.config.ts.bak +0 -15
- package/evals/example.eval.ts +0 -31
- package/evals/fixtures/cass-baseline.ts +0 -217
- package/evals/fixtures/compaction-cases.ts +0 -350
- package/evals/fixtures/compaction-prompt-cases.ts +0 -311
- package/evals/fixtures/coordinator-sessions.ts +0 -328
- package/evals/fixtures/decomposition-cases.ts +0 -105
- package/evals/lib/compaction-loader.test.ts +0 -248
- package/evals/lib/compaction-loader.ts +0 -320
- package/evals/lib/data-loader.evalite-test.ts +0 -289
- package/evals/lib/data-loader.test.ts +0 -345
- package/evals/lib/data-loader.ts +0 -281
- package/evals/lib/llm.ts +0 -115
- package/evals/scorers/compaction-prompt-scorers.ts +0 -145
- package/evals/scorers/compaction-scorers.ts +0 -305
- package/evals/scorers/coordinator-discipline.evalite-test.ts +0 -539
- package/evals/scorers/coordinator-discipline.ts +0 -325
- package/evals/scorers/index.test.ts +0 -146
- package/evals/scorers/index.ts +0 -328
- package/evals/scorers/outcome-scorers.evalite-test.ts +0 -27
- package/evals/scorers/outcome-scorers.ts +0 -349
- package/evals/swarm-decomposition.eval.ts +0 -121
- package/examples/commands/swarm.md +0 -745
- package/examples/plugin-wrapper-template.ts +0 -2515
- package/examples/skills/hive-workflow/SKILL.md +0 -212
- package/examples/skills/skill-creator/SKILL.md +0 -223
- package/examples/skills/swarm-coordination/SKILL.md +0 -292
- package/global-skills/cli-builder/SKILL.md +0 -344
- package/global-skills/cli-builder/references/advanced-patterns.md +0 -244
- package/global-skills/learning-systems/SKILL.md +0 -644
- package/global-skills/skill-creator/LICENSE.txt +0 -202
- package/global-skills/skill-creator/SKILL.md +0 -352
- package/global-skills/skill-creator/references/output-patterns.md +0 -82
- package/global-skills/skill-creator/references/workflows.md +0 -28
- package/global-skills/swarm-coordination/SKILL.md +0 -995
- package/global-skills/swarm-coordination/references/coordinator-patterns.md +0 -235
- package/global-skills/swarm-coordination/references/strategies.md +0 -138
- package/global-skills/system-design/SKILL.md +0 -213
- package/global-skills/testing-patterns/SKILL.md +0 -430
- package/global-skills/testing-patterns/references/dependency-breaking-catalog.md +0 -586
- package/opencode-swarm-plugin-0.30.7.tgz +0 -0
- package/opencode-swarm-plugin-0.31.0.tgz +0 -0
- package/scripts/cleanup-test-memories.ts +0 -346
- package/scripts/init-skill.ts +0 -222
- package/scripts/migrate-unknown-sessions.ts +0 -349
- package/scripts/validate-skill.ts +0 -204
- package/src/agent-mail.ts +0 -1724
- package/src/anti-patterns.test.ts +0 -1167
- package/src/anti-patterns.ts +0 -448
- package/src/compaction-capture.integration.test.ts +0 -257
- package/src/compaction-hook.test.ts +0 -838
- package/src/compaction-hook.ts +0 -1204
- package/src/compaction-observability.integration.test.ts +0 -139
- package/src/compaction-observability.test.ts +0 -187
- package/src/compaction-observability.ts +0 -324
- package/src/compaction-prompt-scorers.test.ts +0 -475
- package/src/compaction-prompt-scoring.ts +0 -300
- package/src/contributor-tools.test.ts +0 -133
- package/src/contributor-tools.ts +0 -201
- package/src/dashboard.test.ts +0 -611
- package/src/dashboard.ts +0 -462
- package/src/error-enrichment.test.ts +0 -403
- package/src/error-enrichment.ts +0 -219
- package/src/eval-capture.test.ts +0 -1015
- package/src/eval-capture.ts +0 -929
- package/src/eval-gates.test.ts +0 -306
- package/src/eval-gates.ts +0 -218
- package/src/eval-history.test.ts +0 -508
- package/src/eval-history.ts +0 -214
- package/src/eval-learning.test.ts +0 -378
- package/src/eval-learning.ts +0 -360
- package/src/eval-runner.test.ts +0 -223
- package/src/eval-runner.ts +0 -402
- package/src/export-tools.test.ts +0 -476
- package/src/export-tools.ts +0 -257
- package/src/hive.integration.test.ts +0 -2241
- package/src/hive.ts +0 -1628
- package/src/index.ts +0 -940
- package/src/learning.integration.test.ts +0 -1815
- package/src/learning.ts +0 -1079
- package/src/logger.test.ts +0 -189
- package/src/logger.ts +0 -135
- package/src/mandate-promotion.test.ts +0 -473
- package/src/mandate-promotion.ts +0 -239
- package/src/mandate-storage.integration.test.ts +0 -601
- package/src/mandate-storage.test.ts +0 -578
- package/src/mandate-storage.ts +0 -794
- package/src/mandates.ts +0 -540
- package/src/memory-tools.test.ts +0 -195
- package/src/memory-tools.ts +0 -344
- package/src/memory.integration.test.ts +0 -334
- package/src/memory.test.ts +0 -158
- package/src/memory.ts +0 -527
- package/src/model-selection.test.ts +0 -188
- package/src/model-selection.ts +0 -68
- package/src/observability-tools.test.ts +0 -359
- package/src/observability-tools.ts +0 -871
- package/src/output-guardrails.test.ts +0 -438
- package/src/output-guardrails.ts +0 -381
- package/src/pattern-maturity.test.ts +0 -1160
- package/src/pattern-maturity.ts +0 -525
- package/src/planning-guardrails.test.ts +0 -491
- package/src/planning-guardrails.ts +0 -438
- package/src/plugin.ts +0 -23
- package/src/post-compaction-tracker.test.ts +0 -251
- package/src/post-compaction-tracker.ts +0 -237
- package/src/query-tools.test.ts +0 -636
- package/src/query-tools.ts +0 -324
- package/src/rate-limiter.integration.test.ts +0 -466
- package/src/rate-limiter.ts +0 -774
- package/src/replay-tools.test.ts +0 -496
- package/src/replay-tools.ts +0 -240
- package/src/repo-crawl.integration.test.ts +0 -441
- package/src/repo-crawl.ts +0 -610
- package/src/schemas/cell-events.test.ts +0 -347
- package/src/schemas/cell-events.ts +0 -807
- package/src/schemas/cell.ts +0 -257
- package/src/schemas/evaluation.ts +0 -166
- package/src/schemas/index.test.ts +0 -199
- package/src/schemas/index.ts +0 -286
- package/src/schemas/mandate.ts +0 -232
- package/src/schemas/swarm-context.ts +0 -115
- package/src/schemas/task.ts +0 -161
- package/src/schemas/worker-handoff.test.ts +0 -302
- package/src/schemas/worker-handoff.ts +0 -131
- package/src/sessions/agent-discovery.test.ts +0 -137
- package/src/sessions/agent-discovery.ts +0 -112
- package/src/sessions/index.ts +0 -15
- package/src/skills.integration.test.ts +0 -1192
- package/src/skills.test.ts +0 -643
- package/src/skills.ts +0 -1549
- package/src/storage.integration.test.ts +0 -341
- package/src/storage.ts +0 -884
- package/src/structured.integration.test.ts +0 -817
- package/src/structured.test.ts +0 -1046
- package/src/structured.ts +0 -762
- package/src/swarm-decompose.test.ts +0 -188
- package/src/swarm-decompose.ts +0 -1302
- package/src/swarm-deferred.integration.test.ts +0 -157
- package/src/swarm-deferred.test.ts +0 -38
- package/src/swarm-insights.test.ts +0 -214
- package/src/swarm-insights.ts +0 -459
- package/src/swarm-mail.integration.test.ts +0 -970
- package/src/swarm-mail.ts +0 -739
- package/src/swarm-orchestrate.integration.test.ts +0 -282
- package/src/swarm-orchestrate.test.ts +0 -548
- package/src/swarm-orchestrate.ts +0 -3084
- package/src/swarm-prompts.test.ts +0 -1270
- package/src/swarm-prompts.ts +0 -2077
- package/src/swarm-research.integration.test.ts +0 -701
- package/src/swarm-research.test.ts +0 -698
- package/src/swarm-research.ts +0 -472
- package/src/swarm-review.integration.test.ts +0 -285
- package/src/swarm-review.test.ts +0 -879
- package/src/swarm-review.ts +0 -709
- package/src/swarm-strategies.ts +0 -407
- package/src/swarm-worktree.test.ts +0 -501
- package/src/swarm-worktree.ts +0 -575
- package/src/swarm.integration.test.ts +0 -2377
- package/src/swarm.ts +0 -38
- package/src/tool-adapter.integration.test.ts +0 -1221
- package/src/tool-availability.ts +0 -461
- package/tsconfig.json +0 -28
|
@@ -1,320 +0,0 @@
|
|
|
1
|
-
# Session Data Quality Audit Report
|
|
2
|
-
|
|
3
|
-
**Date:** 2025-12-25
|
|
4
|
-
**Cell:** opencode-swarm-plugin--ys7z8-mjlk7jspacf
|
|
5
|
-
**Agent:** WildDawn
|
|
6
|
-
|
|
7
|
-
## Executive Summary
|
|
8
|
-
|
|
9
|
-
Investigation of why only 3 of 102 sessions (2.9%) pass the coordinator-session eval filter reveals:
|
|
10
|
-
|
|
11
|
-
1. **Filter is working as designed** - correctly isolating high-quality complete coordinator sessions
|
|
12
|
-
2. **Data quality is actually GOOD** - the 3 passing sessions are gold-standard examples
|
|
13
|
-
3. **97% filtered out is EXPECTED** - most sessions are worker completions, not coordinator sessions
|
|
14
|
-
4. **Filter may be too strict** for broad coordinator behavior analysis (needs tuning)
|
|
15
|
-
|
|
16
|
-
## Data Breakdown
|
|
17
|
-
|
|
18
|
-
### Total Sessions: 102
|
|
19
|
-
|
|
20
|
-
| Category | Count | % | Description |
|
|
21
|
-
|----------|-------|---|-------------|
|
|
22
|
-
| **Single-event sessions** | 70 | 68.6% | Worker completions (subtask_success), isolated reviews |
|
|
23
|
-
| **Multi-event incomplete** | 29 | 28.4% | Coordinator sessions that didn't complete full cycle |
|
|
24
|
-
| **Passing sessions** | 3 | 2.9% | Complete coordinator cycles with spawn + review |
|
|
25
|
-
|
|
26
|
-
### Single-Event Sessions (70 sessions - 68.6%)
|
|
27
|
-
|
|
28
|
-
**Event Type Breakdown:**
|
|
29
|
-
- `OUTCOME/subtask_success`: 56 (80.0%) - **Worker completions, not coordinator sessions**
|
|
30
|
-
- `DECISION/review_completed`: 12 (17.1%) - Isolated review events
|
|
31
|
-
- `DECISION/worker_spawned`: 2 (2.9%) - Isolated spawn events
|
|
32
|
-
|
|
33
|
-
**Analysis:** These are **NOT coordinator sessions**. They're worker agents reporting completion or isolated coordinator actions captured in separate session files.
|
|
34
|
-
|
|
35
|
-
### Multi-Event Failures (29 sessions - 28.4%)
|
|
36
|
-
|
|
37
|
-
**Failure Breakdown:**
|
|
38
|
-
- **No worker_spawned event**: 20 sessions
|
|
39
|
-
- Review-only sessions (3-22 events, all `review_completed`)
|
|
40
|
-
- Appears to be test data or session capture split across files
|
|
41
|
-
- **Has worker_spawned but no review_completed**: 5 sessions
|
|
42
|
-
- Incomplete coordinator sessions (4-24 events)
|
|
43
|
-
- Coordinator spawned workers but reviews weren't captured (yet)
|
|
44
|
-
- **Too few events (<3)**: 4 sessions
|
|
45
|
-
- Aborted early
|
|
46
|
-
|
|
47
|
-
**Key Finding:** None of these 29 sessions have `decomposition_complete` events. This suggests:
|
|
48
|
-
1. Session capture may not be recording decomposition events
|
|
49
|
-
2. OR coordinator sessions span multiple session files
|
|
50
|
-
3. OR these are partial captures from long-running coordinators
|
|
51
|
-
|
|
52
|
-
### Passing Sessions (3 sessions - 2.9%)
|
|
53
|
-
|
|
54
|
-
#### ses_4b86f0867ffeXKv95ktf31igfD
|
|
55
|
-
- **Events:** 33
|
|
56
|
-
- **Worker spawns:** 20
|
|
57
|
-
- **Reviews completed:** 13
|
|
58
|
-
- **Violations:** 0
|
|
59
|
-
- **Duration:** 437 minutes (7.3 hours)
|
|
60
|
-
- **Quality:** GOLD STANDARD
|
|
61
|
-
|
|
62
|
-
#### ses_4ac0f508dffeEcwSQ6OSMWrmWF
|
|
63
|
-
- **Events:** 21
|
|
64
|
-
- **Worker spawns:** 17
|
|
65
|
-
- **Reviews completed:** 4
|
|
66
|
-
- **Duration:** 540 minutes (9.0 hours)
|
|
67
|
-
- **Quality:** GOLD STANDARD
|
|
68
|
-
|
|
69
|
-
#### ses_4ae8c2f66ffecyfyre7ZQ7y5LW
|
|
70
|
-
- **Events:** 31
|
|
71
|
-
- **Worker spawns:** 24
|
|
72
|
-
- **Reviews completed:** 7
|
|
73
|
-
- **Violations:** 0
|
|
74
|
-
- **Duration:** 368 minutes (6.1 hours)
|
|
75
|
-
- **Quality:** GOLD STANDARD
|
|
76
|
-
|
|
77
|
-
**Analysis:** These are FULL multi-hour coordinator sessions with extensive worker coordination. They represent the ideal coordinator behavior the eval is designed to measure.
|
|
78
|
-
|
|
79
|
-
## Current Filter Criteria
|
|
80
|
-
|
|
81
|
-
```typescript
|
|
82
|
-
{
|
|
83
|
-
minEvents: 3, // Default
|
|
84
|
-
requireWorkerSpawn: true, // Default
|
|
85
|
-
requireReview: true, // Default
|
|
86
|
-
}
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
### Filter Performance
|
|
90
|
-
|
|
91
|
-
| Check | Impact |
|
|
92
|
-
|-------|--------|
|
|
93
|
-
| `minEvents >= 3` | Filters out 74 sessions (72.5%) |
|
|
94
|
-
| `requireWorkerSpawn: true` | Filters out 20 additional sessions (19.6%) |
|
|
95
|
-
| `requireReview: true` | Filters out 5 additional sessions (4.9%) |
|
|
96
|
-
|
|
97
|
-
**Cascade effect:** Each filter compounds, resulting in 2.9% passing rate.
|
|
98
|
-
|
|
99
|
-
## Root Cause Analysis
|
|
100
|
-
|
|
101
|
-
### Is the Filter Too Strict?
|
|
102
|
-
|
|
103
|
-
**YES and NO:**
|
|
104
|
-
|
|
105
|
-
✅ **Working as designed:**
|
|
106
|
-
- Correctly excludes worker-only sessions (80% of single-event data)
|
|
107
|
-
- Correctly excludes incomplete coordinator sessions
|
|
108
|
-
- Isolates high-quality complete coordinator cycles
|
|
109
|
-
|
|
110
|
-
❌ **Too strict for real-world analysis:**
|
|
111
|
-
- 2.9% passing rate means most coordinator behavior is invisible to the eval
|
|
112
|
-
- Filter assumes coordinators ALWAYS complete full spawn+review cycles
|
|
113
|
-
- Doesn't account for:
|
|
114
|
-
- Long-running multi-session coordinators
|
|
115
|
-
- Coordinators that spawn workers but reviews aren't captured yet
|
|
116
|
-
- Early-stage coordinator sessions (before first spawn)
|
|
117
|
-
|
|
118
|
-
### Is the Data Quality Low?
|
|
119
|
-
|
|
120
|
-
**NO.** The data quality is actually GOOD:
|
|
121
|
-
|
|
122
|
-
- The 3 passing sessions are excellent gold-standard examples
|
|
123
|
-
- They contain rich coordinator behavior (20-24 worker spawns, 4-13 reviews)
|
|
124
|
-
- Zero violations in all 3 sessions
|
|
125
|
-
- Multi-hour timelines showing sustained coordination
|
|
126
|
-
|
|
127
|
-
The "low passing rate" is a **filter strictness issue**, not a data quality issue.
|
|
128
|
-
|
|
129
|
-
### Why Only 3/102 Pass?
|
|
130
|
-
|
|
131
|
-
**Theory 1: Session Capture Splits Long Coordinators**
|
|
132
|
-
- The 3 passing sessions are 6-9 hour marathons
|
|
133
|
-
- Most coordinator work may be happening in shorter bursts
|
|
134
|
-
- Session files might be split by epic_id or time windows
|
|
135
|
-
|
|
136
|
-
**Evidence:**
|
|
137
|
-
- Some sessions have 20+ `review_completed` events with no `worker_spawned`
|
|
138
|
-
- This suggests reviews from previous spawns in a different session file
|
|
139
|
-
|
|
140
|
-
**Theory 2: Review Capture Is Incomplete**
|
|
141
|
-
- 5 sessions have `worker_spawned` but no `review_completed`
|
|
142
|
-
- Reviews may be captured in separate session files
|
|
143
|
-
- OR review capture isn't working consistently
|
|
144
|
-
|
|
145
|
-
**Theory 3: Most Coordinator Sessions Are Short**
|
|
146
|
-
- Only 32/102 sessions (31.4%) have ANY `review_completed` event
|
|
147
|
-
- Only 10/102 sessions (9.8%) have ANY `worker_spawned` event
|
|
148
|
-
- This suggests most captured activity is worker completions, not coordinator cycles
|
|
149
|
-
|
|
150
|
-
## Recommendations
|
|
151
|
-
|
|
152
|
-
### 1. Make Filter Parameters Optional (IMMEDIATE)
|
|
153
|
-
|
|
154
|
-
**Current default:**
|
|
155
|
-
```typescript
|
|
156
|
-
{
|
|
157
|
-
minEvents: 3,
|
|
158
|
-
requireWorkerSpawn: true,
|
|
159
|
-
requireReview: true,
|
|
160
|
-
}
|
|
161
|
-
```
|
|
162
|
-
|
|
163
|
-
**Recommended default:**
|
|
164
|
-
```typescript
|
|
165
|
-
{
|
|
166
|
-
minEvents: 3, // Keep - filters out noise
|
|
167
|
-
requireWorkerSpawn: false, // CHANGE - allow early-stage sessions
|
|
168
|
-
requireReview: false, // CHANGE - allow incomplete sessions
|
|
169
|
-
}
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
**Impact:** This would increase passing rate from 3 to ~28 sessions (from 2.9% to 27.5%).
|
|
173
|
-
|
|
174
|
-
**Rationale:**
|
|
175
|
-
- Captures more coordinator behavior (spawns without reviews)
|
|
176
|
-
- Allows evaluation of early-stage coordination patterns
|
|
177
|
-
- Still filters out single-event worker completions
|
|
178
|
-
- Users can opt-in to stricter filters if needed
|
|
179
|
-
|
|
180
|
-
### 2. Add Session Type Detection (ENHANCEMENT)
|
|
181
|
-
|
|
182
|
-
Add a filter to exclude worker-only sessions automatically:
|
|
183
|
-
|
|
184
|
-
```typescript
|
|
185
|
-
function isCoordinatorSession(session: CoordinatorSession): boolean {
|
|
186
|
-
return session.events.some(e =>
|
|
187
|
-
e.event_type === "DECISION" &&
|
|
188
|
-
(e.decision_type === "decomposition_complete" ||
|
|
189
|
-
e.decision_type === "worker_spawned" ||
|
|
190
|
-
e.decision_type === "strategy_selected")
|
|
191
|
-
);
|
|
192
|
-
}
|
|
193
|
-
```
|
|
194
|
-
|
|
195
|
-
**Impact:** Filters out 70+ worker-only sessions before applying other criteria.
|
|
196
|
-
|
|
197
|
-
### 3. Investigate Session Capture Splitting (BUG FIX?)
|
|
198
|
-
|
|
199
|
-
**Symptoms:**
|
|
200
|
-
- Sessions with 22 `review_completed` events but no `worker_spawned`
|
|
201
|
-
- Sessions with 24 `worker_spawned` events but no reviews
|
|
202
|
-
- No `decomposition_complete` events in ANY session (including the 3 passing)
|
|
203
|
-
|
|
204
|
-
**Hypothesis:** Long-running coordinator sessions may be split across multiple session files.
|
|
205
|
-
|
|
206
|
-
**Action:** Investigate `eval-capture.ts` to understand:
|
|
207
|
-
- How `session_id` is generated
|
|
208
|
-
- Whether sessions are split by epic_id
|
|
209
|
-
- Whether there's a session timeout that creates new files
|
|
210
|
-
|
|
211
|
-
### 4. Add Filter Reporting to Data Loader (OBSERVABILITY)
|
|
212
|
-
|
|
213
|
-
The data loader logs filtered-out count, but doesn't break down WHY sessions failed.
|
|
214
|
-
|
|
215
|
-
**Enhancement:**
|
|
216
|
-
```typescript
|
|
217
|
-
console.log(`Filtered out ${filteredOutCount} sessions:`);
|
|
218
|
-
console.log(` - Too few events (<${minEvents}): ${stats.tooFewEvents}`);
|
|
219
|
-
console.log(` - No worker_spawned: ${stats.noWorkerSpawn}`);
|
|
220
|
-
console.log(` - No review_completed: ${stats.noReview}`);
|
|
221
|
-
console.log(` - Worker-only sessions: ${stats.workerOnly}`);
|
|
222
|
-
```
|
|
223
|
-
|
|
224
|
-
This helps users understand filter impact.
|
|
225
|
-
|
|
226
|
-
### 5. Consider Separate Evals for Different Session Types
|
|
227
|
-
|
|
228
|
-
Instead of one eval with strict filters, consider:
|
|
229
|
-
|
|
230
|
-
**Eval 1: Full Coordinator Cycles** (current behavior)
|
|
231
|
-
- Filters: `minEvents=3, requireWorkerSpawn=true, requireReview=true`
|
|
232
|
-
- Focus: End-to-end coordinator discipline
|
|
233
|
-
- Expected passing rate: ~3% (gold standard only)
|
|
234
|
-
|
|
235
|
-
**Eval 2: Coordinator Spawning Behavior**
|
|
236
|
-
- Filters: `minEvents=3, requireWorkerSpawn=true, requireReview=false`
|
|
237
|
-
- Focus: How coordinators delegate work
|
|
238
|
-
- Expected passing rate: ~10%
|
|
239
|
-
|
|
240
|
-
**Eval 3: Coordinator Review Behavior**
|
|
241
|
-
- Filters: `minEvents=3, requireWorkerSpawn=false, requireReview=true`
|
|
242
|
-
- Focus: How coordinators review worker output
|
|
243
|
-
- Expected passing rate: ~31%
|
|
244
|
-
|
|
245
|
-
**Eval 4: All Coordinator Activity**
|
|
246
|
-
- Filters: `minEvents=3, requireWorkerSpawn=false, requireReview=false, isCoordinatorSession=true`
|
|
247
|
-
- Focus: Broad coordinator behavior patterns
|
|
248
|
-
- Expected passing rate: ~27%
|
|
249
|
-
|
|
250
|
-
## Conclusion
|
|
251
|
-
|
|
252
|
-
The coordinator-session eval filter is **working as designed**. It successfully isolates high-quality complete coordinator sessions for evaluation.
|
|
253
|
-
|
|
254
|
-
However, the **2.9% passing rate is too strict** for comprehensive coordinator behavior analysis. The filter should:
|
|
255
|
-
|
|
256
|
-
1. **Default to more lenient settings** (requireWorkerSpawn=false, requireReview=false)
|
|
257
|
-
2. **Allow users to opt-in** to stricter filters for gold-standard analysis
|
|
258
|
-
3. **Automatically exclude worker-only sessions** via session type detection
|
|
259
|
-
4. **Provide visibility** into why sessions are filtered out
|
|
260
|
-
|
|
261
|
-
The data quality itself is GOOD. The 3 passing sessions are excellent examples of sustained multi-hour coordinator behavior with extensive worker coordination and zero violations.
|
|
262
|
-
|
|
263
|
-
---
|
|
264
|
-
|
|
265
|
-
## Appendix: Raw Data
|
|
266
|
-
|
|
267
|
-
### Event Count Distribution
|
|
268
|
-
|
|
269
|
-
```
|
|
270
|
-
1 event: 70 sessions (68.6%) - Mostly worker completions
|
|
271
|
-
2 events: 4 sessions (3.9%)
|
|
272
|
-
3 events: 6 sessions (5.9%)
|
|
273
|
-
4 events: 3 sessions (2.9%)
|
|
274
|
-
5 events: 3 sessions (2.9%)
|
|
275
|
-
6 events: 2 sessions (2.0%)
|
|
276
|
-
7 events: 1 session (1.0%)
|
|
277
|
-
9 events: 1 session (1.0%)
|
|
278
|
-
21 events: 1 session (1.0%) ✓ PASSING
|
|
279
|
-
22 events: 5 sessions (4.9%)
|
|
280
|
-
24 events: 1 session (1.0%)
|
|
281
|
-
27 events: 1 session (1.0%)
|
|
282
|
-
30 events: 2 sessions (2.0%)
|
|
283
|
-
31 events: 1 session (1.0%) ✓ PASSING
|
|
284
|
-
33 events: 1 session (1.0%) ✓ PASSING
|
|
285
|
-
```
|
|
286
|
-
|
|
287
|
-
### Sample Worker-Only Sessions
|
|
288
|
-
|
|
289
|
-
```
|
|
290
|
-
ses_6EraEW6LTRswygMPQa2voC.jsonl (1 event):
|
|
291
|
-
OUTCOME/subtask_success
|
|
292
|
-
|
|
293
|
-
ses_xyJ85H9SaA5FSnJvDL7ktJ.jsonl (1 event):
|
|
294
|
-
OUTCOME/subtask_success
|
|
295
|
-
|
|
296
|
-
ses_BiqTpFyafkbpt3tvZbh29R.jsonl (1 event):
|
|
297
|
-
DECISION/review_completed
|
|
298
|
-
```
|
|
299
|
-
|
|
300
|
-
### Sample Incomplete Coordinator Sessions
|
|
301
|
-
|
|
302
|
-
```
|
|
303
|
-
ses_4aa1d6e57ffeGfXIoIMNhTQ9JI.jsonl (7 events):
|
|
304
|
-
DECISION/worker_spawned (x7)
|
|
305
|
-
→ Missing reviews
|
|
306
|
-
|
|
307
|
-
ses_3t9CP2ZG54wF3D982kZgps.jsonl (3 events):
|
|
308
|
-
DECISION/review_completed (x3)
|
|
309
|
-
→ Missing spawns
|
|
310
|
-
|
|
311
|
-
test-review-1766636012605.jsonl (22 events):
|
|
312
|
-
DECISION/review_completed (x22)
|
|
313
|
-
→ Missing spawns (likely test data)
|
|
314
|
-
```
|
|
315
|
-
|
|
316
|
-
---
|
|
317
|
-
|
|
318
|
-
**Generated by:** WildDawn (swarm worker agent)
|
|
319
|
-
**Date:** 2025-12-25
|
|
320
|
-
**Files analyzed:** 102 session files from `~/.config/swarm-tools/sessions/`
|