opencode-swarm-plugin 0.37.0 → 0.39.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.env +2 -0
- package/.hive/eval-results.json +26 -0
- package/.hive/issues.jsonl +20 -5
- package/.hive/memories.jsonl +35 -1
- package/.opencode/eval-history.jsonl +12 -0
- package/.turbo/turbo-build.log +4 -4
- package/.turbo/turbo-test.log +319 -319
- package/CHANGELOG.md +258 -0
- package/README.md +50 -0
- package/bin/swarm.test.ts +475 -0
- package/bin/swarm.ts +385 -208
- package/dist/compaction-hook.d.ts +1 -1
- package/dist/compaction-hook.d.ts.map +1 -1
- package/dist/compaction-prompt-scoring.d.ts +124 -0
- package/dist/compaction-prompt-scoring.d.ts.map +1 -0
- package/dist/eval-capture.d.ts +81 -1
- package/dist/eval-capture.d.ts.map +1 -1
- package/dist/eval-gates.d.ts +84 -0
- package/dist/eval-gates.d.ts.map +1 -0
- package/dist/eval-history.d.ts +117 -0
- package/dist/eval-history.d.ts.map +1 -0
- package/dist/eval-learning.d.ts +216 -0
- package/dist/eval-learning.d.ts.map +1 -0
- package/dist/hive.d.ts +59 -0
- package/dist/hive.d.ts.map +1 -1
- package/dist/index.d.ts +87 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +823 -131
- package/dist/plugin.js +655 -131
- package/dist/post-compaction-tracker.d.ts +133 -0
- package/dist/post-compaction-tracker.d.ts.map +1 -0
- package/dist/swarm-decompose.d.ts +30 -0
- package/dist/swarm-decompose.d.ts.map +1 -1
- package/dist/swarm-orchestrate.d.ts +23 -0
- package/dist/swarm-orchestrate.d.ts.map +1 -1
- package/dist/swarm-prompts.d.ts +25 -1
- package/dist/swarm-prompts.d.ts.map +1 -1
- package/dist/swarm.d.ts +19 -0
- package/dist/swarm.d.ts.map +1 -1
- package/evals/README.md +595 -94
- package/evals/compaction-prompt.eval.ts +149 -0
- package/evals/coordinator-behavior.eval.ts +8 -8
- package/evals/fixtures/compaction-prompt-cases.ts +305 -0
- package/evals/lib/compaction-loader.test.ts +248 -0
- package/evals/lib/compaction-loader.ts +320 -0
- package/evals/lib/data-loader.test.ts +345 -0
- package/evals/lib/data-loader.ts +107 -6
- package/evals/scorers/compaction-prompt-scorers.ts +145 -0
- package/evals/scorers/compaction-scorers.ts +13 -13
- package/evals/scorers/coordinator-discipline.evalite-test.ts +3 -2
- package/evals/scorers/coordinator-discipline.ts +13 -13
- package/examples/plugin-wrapper-template.ts +177 -8
- package/package.json +7 -2
- package/scripts/migrate-unknown-sessions.ts +349 -0
- package/src/compaction-capture.integration.test.ts +257 -0
- package/src/compaction-hook.test.ts +139 -2
- package/src/compaction-hook.ts +113 -2
- package/src/compaction-prompt-scorers.test.ts +299 -0
- package/src/compaction-prompt-scoring.ts +298 -0
- package/src/eval-capture.test.ts +422 -0
- package/src/eval-capture.ts +94 -2
- package/src/eval-gates.test.ts +306 -0
- package/src/eval-gates.ts +218 -0
- package/src/eval-history.test.ts +508 -0
- package/src/eval-history.ts +214 -0
- package/src/eval-learning.test.ts +378 -0
- package/src/eval-learning.ts +360 -0
- package/src/index.ts +61 -1
- package/src/post-compaction-tracker.test.ts +251 -0
- package/src/post-compaction-tracker.ts +237 -0
- package/src/swarm-decompose.test.ts +40 -47
- package/src/swarm-decompose.ts +2 -2
- package/src/swarm-orchestrate.test.ts +270 -7
- package/src/swarm-orchestrate.ts +100 -13
- package/src/swarm-prompts.test.ts +121 -0
- package/src/swarm-prompts.ts +297 -4
- package/src/swarm-research.integration.test.ts +157 -0
- package/src/swarm-review.ts +3 -3
- /package/evals/{evalite.config.ts → evalite.config.ts.bak} +0 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,263 @@
|
|
|
1
1
|
# opencode-swarm-plugin
|
|
2
2
|
|
|
3
|
+
## 0.39.1
|
|
4
|
+
|
|
5
|
+
### Patch Changes
|
|
6
|
+
|
|
7
|
+
- [`19a6557`](https://github.com/joelhooks/swarm-tools/commit/19a6557cee9878858e7f61e2aba86b37a3ec10ad) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🐝 Eval Quality Gates: Signal Over Noise
|
|
8
|
+
|
|
9
|
+
The eval system now filters coordinator sessions to focus on high-quality data.
|
|
10
|
+
|
|
11
|
+
**Problem:** 67 of 82 captured sessions had <3 events - noise from aborted runs, test pokes, and incomplete swarms. This diluted eval scores and made metrics unreliable.
|
|
12
|
+
|
|
13
|
+
**Solution:** Quality filters applied BEFORE sampling:
|
|
14
|
+
|
|
15
|
+
| Filter | Default | Purpose |
|
|
16
|
+
| -------------------- | ------- | --------------------------------- |
|
|
17
|
+
| `minEvents` | 3 | Skip incomplete/aborted sessions |
|
|
18
|
+
| `requireWorkerSpawn` | true | Ensure coordinator delegated work |
|
|
19
|
+
| `requireReview` | true | Ensure full swarm lifecycle |
|
|
20
|
+
|
|
21
|
+
**Impact:**
|
|
22
|
+
|
|
23
|
+
- Filters 93 noisy sessions automatically
|
|
24
|
+
- Overall eval score: 63% → 71% (true signal, not diluted)
|
|
25
|
+
- Coordinator discipline: 47% → 57% (accurate measurement)
|
|
26
|
+
|
|
27
|
+
**Usage:**
|
|
28
|
+
|
|
29
|
+
```typescript
|
|
30
|
+
// Default: high-quality sessions only
|
|
31
|
+
const sessions = await loadCapturedSessions();
|
|
32
|
+
|
|
33
|
+
// Override for specific analysis
|
|
34
|
+
const allSessions = await loadCapturedSessions({
|
|
35
|
+
minEvents: 1,
|
|
36
|
+
requireWorkerSpawn: false,
|
|
37
|
+
requireReview: false,
|
|
38
|
+
});
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
Includes 7 unit tests covering filter logic and edge cases.
|
|
42
|
+
|
|
43
|
+
## 0.39.0
|
|
44
|
+
|
|
45
|
+
### Minor Changes
|
|
46
|
+
|
|
47
|
+
- [`aa12943`](https://github.com/joelhooks/swarm-tools/commit/aa12943f3edc8d5e23878b22f44073e4c71367c5) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🐝 Eval-Driven Development: The System That Scores Itself
|
|
48
|
+
|
|
49
|
+
> "What gets measured gets managed." — Peter Drucker
|
|
50
|
+
> "What gets scored gets improved." — The Swarm
|
|
51
|
+
|
|
52
|
+
The plugin now evaluates its own output quality through a progressive gate system. Every compaction prompt gets scored, tracked, and learned from. Regressions become impossible to ignore.
|
|
53
|
+
|
|
54
|
+
### The Pipeline
|
|
55
|
+
|
|
56
|
+
```
|
|
57
|
+
CAPTURE → SCORE → STORE → GATE → LEARN → IMPROVE
|
|
58
|
+
↑ ↓
|
|
59
|
+
└──────────────────────────────────────┘
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### What's New
|
|
63
|
+
|
|
64
|
+
**Event Capture** (5 integration points)
|
|
65
|
+
|
|
66
|
+
- `detection_triggered` - When compaction is detected
|
|
67
|
+
- `prompt_generated` - Full LLM prompt captured
|
|
68
|
+
- `context_injected` - Final content before injection
|
|
69
|
+
- All events stored to `~/.config/swarm-tools/sessions/{session_id}.jsonl`
|
|
70
|
+
|
|
71
|
+
**5 Compaction Prompt Scorers**
|
|
72
|
+
|
|
73
|
+
- `epicIdSpecificity` - Real IDs, not placeholders (20%)
|
|
74
|
+
- `actionability` - Specific tool calls with values (20%)
|
|
75
|
+
- `coordinatorIdentity` - ASCII header + mandates (25%)
|
|
76
|
+
- `forbiddenToolsPresent` - Lists what NOT to do (15%)
|
|
77
|
+
- `postCompactionDiscipline` - First tool is correct (20%)
|
|
78
|
+
|
|
79
|
+
**Progressive Gates**
|
|
80
|
+
| Phase | Threshold | Behavior |
|
|
81
|
+
|-------|-----------|----------|
|
|
82
|
+
| Bootstrap | N/A | Always pass, building baseline |
|
|
83
|
+
| Stabilization | 0.6 | Warn but pass |
|
|
84
|
+
| Production | 0.7 | Fail CI on regression |
|
|
85
|
+
|
|
86
|
+
**CLI Commands**
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
swarm eval status # Current phase, thresholds, scores
|
|
90
|
+
swarm eval history # Trends with sparklines ▁▂▃▄▅▆▇█
|
|
91
|
+
swarm eval run [--ci] # Execute evals, gate check
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
**CI Integration**
|
|
95
|
+
|
|
96
|
+
- Runs after tests pass
|
|
97
|
+
- Posts results as PR comment with emoji status
|
|
98
|
+
- Only fails in production phase with actual regression
|
|
99
|
+
|
|
100
|
+
**Learning Feedback Loop**
|
|
101
|
+
|
|
102
|
+
- Significant score drops auto-stored to semantic memory
|
|
103
|
+
- Future agents learn from past failures
|
|
104
|
+
- Pattern maturity tracking
|
|
105
|
+
|
|
106
|
+
### Breaking Changes
|
|
107
|
+
|
|
108
|
+
None. All new functionality is additive.
|
|
109
|
+
|
|
110
|
+
### Files Changed
|
|
111
|
+
|
|
112
|
+
- `src/eval-capture.ts` - Event capture with Zod schemas
|
|
113
|
+
- `src/eval-gates.ts` - Progressive gate logic
|
|
114
|
+
- `src/eval-history.ts` - Score tracking over time
|
|
115
|
+
- `src/eval-learning.ts` - Failure-to-learning extraction
|
|
116
|
+
- `src/compaction-prompt-scoring.ts` - 5 pure scoring functions
|
|
117
|
+
- `evals/compaction-prompt.eval.ts` - Evalite integration
|
|
118
|
+
- `bin/swarm.ts` - CLI commands
|
|
119
|
+
- `.github/workflows/ci.yml` - CI integration
|
|
120
|
+
|
|
121
|
+
### Test Coverage
|
|
122
|
+
|
|
123
|
+
- 422 new tests for eval-capture
|
|
124
|
+
- 48 CLI tests
|
|
125
|
+
- 7 integration tests for capture wiring
|
|
126
|
+
- All existing tests still passing
|
|
127
|
+
|
|
128
|
+
### Patch Changes
|
|
129
|
+
|
|
130
|
+
- Updated dependencies [[`aa12943`](https://github.com/joelhooks/swarm-tools/commit/aa12943f3edc8d5e23878b22f44073e4c71367c5)]:
|
|
131
|
+
- swarm-mail@1.5.2
|
|
132
|
+
|
|
133
|
+
## 0.38.0
|
|
134
|
+
|
|
135
|
+
### Minor Changes
|
|
136
|
+
|
|
137
|
+
- [`41a1965`](https://github.com/joelhooks/swarm-tools/commit/41a19657b252eb1c7a7dc82bc59ab13589e8758f) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🐝 Coordinators Now Delegate Research to Workers
|
|
138
|
+
|
|
139
|
+
Coordinators finally know their place. They orchestrate, they don't fetch.
|
|
140
|
+
|
|
141
|
+
**The Problem:**
|
|
142
|
+
Coordinators were calling `repo-crawl_file`, `webfetch`, `context7_*` directly, burning expensive Sonnet context on raw file contents instead of spawning researcher workers.
|
|
143
|
+
|
|
144
|
+
**The Fix:**
|
|
145
|
+
|
|
146
|
+
### Forbidden Tools Section
|
|
147
|
+
|
|
148
|
+
COORDINATOR_PROMPT now explicitly lists tools coordinators must NEVER call:
|
|
149
|
+
|
|
150
|
+
- `repo-crawl_*`, `repo-autopsy_*` - repository fetching
|
|
151
|
+
- `webfetch`, `fetch_fetch` - web fetching
|
|
152
|
+
- `context7_*` - library documentation
|
|
153
|
+
- `pdf-brain_search`, `pdf-brain_read` - knowledge base
|
|
154
|
+
|
|
155
|
+
### Phase 1.5: Research Phase
|
|
156
|
+
|
|
157
|
+
New workflow phase between Initialize and Knowledge Gathering:
|
|
158
|
+
|
|
159
|
+
```
|
|
160
|
+
swarm_spawn_researcher(
|
|
161
|
+
research_id="research-nextjs-cache",
|
|
162
|
+
tech_stack=["Next.js 16 Cache Components"],
|
|
163
|
+
project_path="/path/to/project"
|
|
164
|
+
)
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### Strong Coordinator Identity Post-Compaction
|
|
168
|
+
|
|
169
|
+
When context compacts, the resuming agent now sees:
|
|
170
|
+
|
|
171
|
+
```
|
|
172
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
173
|
+
│ 🐝 YOU ARE THE COORDINATOR 🐝 │
|
|
174
|
+
│ NOT A WORKER. NOT AN IMPLEMENTER. │
|
|
175
|
+
│ YOU ORCHESTRATE. │
|
|
176
|
+
└─────────────────────────────────────────────────────────────┘
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### runResearchPhase Returns Spawn Instructions
|
|
180
|
+
|
|
181
|
+
```typescript
|
|
182
|
+
const result = await runResearchPhase(task, projectPath);
|
|
183
|
+
// result.spawn_instructions = [
|
|
184
|
+
// { research_id, tech, prompt, subagent_type: "swarm/researcher" }
|
|
185
|
+
// ]
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
**32+ new tests, all 425 passing.**
|
|
189
|
+
|
|
190
|
+
- [`b06f69b`](https://github.com/joelhooks/swarm-tools/commit/b06f69bc3db099c14f712585d88b42c801123d01) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🔬 Eval Capture Pipeline: Complete
|
|
191
|
+
|
|
192
|
+
> "The purpose of computing is insight, not numbers." — Richard Hamming
|
|
193
|
+
|
|
194
|
+
Wire all eval-capture functions into the swarm execution path, enabling ground-truth collection from real swarm executions.
|
|
195
|
+
|
|
196
|
+
**What changed:**
|
|
197
|
+
|
|
198
|
+
| Function | Wired Into | Purpose |
|
|
199
|
+
| ------------------------- | ------------------------------ | ---------------------------------- |
|
|
200
|
+
| `captureDecomposition()` | `swarm_validate_decomposition` | Records task → subtasks mapping |
|
|
201
|
+
| `captureSubtaskOutcome()` | `swarm_complete` | Records per-subtask execution data |
|
|
202
|
+
| `finalizeEvalRecord()` | `swarm_record_outcome` | Computes aggregate metrics |
|
|
203
|
+
|
|
204
|
+
**New npm scripts:**
|
|
205
|
+
|
|
206
|
+
```bash
|
|
207
|
+
bun run eval:run # Run all evals
|
|
208
|
+
bun run eval:decomposition # Decomposition quality
|
|
209
|
+
bun run eval:coordinator # Coordinator discipline
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Data flow:**
|
|
213
|
+
|
|
214
|
+
```
|
|
215
|
+
swarm_decompose → captureDecomposition → .opencode/eval-data.jsonl
|
|
216
|
+
↓
|
|
217
|
+
swarm_complete → captureSubtaskOutcome → updates record with outcomes
|
|
218
|
+
↓
|
|
219
|
+
swarm_record_outcome → finalizeEvalRecord → computes scope_accuracy, time_balance
|
|
220
|
+
↓
|
|
221
|
+
evalite → reads JSONL → scores decomposition quality
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
**Why it matters:**
|
|
225
|
+
|
|
226
|
+
- Enables data-driven decomposition strategy selection
|
|
227
|
+
- Tracks which strategies work for which task types
|
|
228
|
+
- Provides ground truth for Evalite evals
|
|
229
|
+
- Foundation for learning from swarm outcomes
|
|
230
|
+
|
|
231
|
+
**Key discovery:** New cell ID format doesn't follow `epicId.subtaskNum` pattern. Must use `cell.parent_id` to get epic ID for subtasks.
|
|
232
|
+
|
|
233
|
+
### Patch Changes
|
|
234
|
+
|
|
235
|
+
- [`56e5d4c`](https://github.com/joelhooks/swarm-tools/commit/56e5d4c5ac96ddd2184d12c63e163bb9c291fb69) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🔬 Eval Capture Pipeline: Phase 1
|
|
236
|
+
|
|
237
|
+
> "The first step toward wisdom is getting things right. The second step is getting them wrong in interesting ways." — Marvin Minsky
|
|
238
|
+
|
|
239
|
+
Wire `captureDecomposition()` into `swarm_validate_decomposition` to record decomposition inputs/outputs for evaluation.
|
|
240
|
+
|
|
241
|
+
**What changed:**
|
|
242
|
+
|
|
243
|
+
- `swarm_validate_decomposition` now calls `captureDecomposition()` after successful validation
|
|
244
|
+
- Captures: epicId, projectPath, task, context, strategy, epicTitle, subtasks
|
|
245
|
+
- Data persisted to `.opencode/eval-data.jsonl` for Evalite consumption
|
|
246
|
+
|
|
247
|
+
**Why it matters:**
|
|
248
|
+
|
|
249
|
+
- Enables ground-truth collection from real swarm executions
|
|
250
|
+
- Foundation for decomposition quality evals
|
|
251
|
+
- Tracks what strategies work for which task types
|
|
252
|
+
|
|
253
|
+
**Tests added:**
|
|
254
|
+
|
|
255
|
+
- Verifies `captureDecomposition` called with correct params on success
|
|
256
|
+
- Verifies NOT called on validation failure
|
|
257
|
+
- Handles optional context/description fields
|
|
258
|
+
|
|
259
|
+
**Next:** Wire `captureSubtaskOutcome()` and `finalizeEvalRecord()` to complete the pipeline.
|
|
260
|
+
|
|
3
261
|
## 0.37.0
|
|
4
262
|
|
|
5
263
|
### Minor Changes
|
package/README.md
CHANGED
|
@@ -231,6 +231,56 @@ bun test
|
|
|
231
231
|
bun run typecheck
|
|
232
232
|
```
|
|
233
233
|
|
|
234
|
+
### Evaluation Pipeline
|
|
235
|
+
|
|
236
|
+
Test decomposition quality and coordinator discipline with **Evalite** (TypeScript-native eval framework):
|
|
237
|
+
|
|
238
|
+
```bash
|
|
239
|
+
# Run all evals
|
|
240
|
+
bun run eval:run
|
|
241
|
+
|
|
242
|
+
# Run specific suites
|
|
243
|
+
bun run eval:decomposition # Task decomposition quality
|
|
244
|
+
bun run eval:coordinator # Coordinator protocol compliance
|
|
245
|
+
bun run eval:compaction # Compaction prompt quality
|
|
246
|
+
|
|
247
|
+
# Check eval status (progressive gates)
|
|
248
|
+
swarm eval status [eval-name]
|
|
249
|
+
|
|
250
|
+
# View history with trends
|
|
251
|
+
swarm eval history
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
**Progressive Gates:**
|
|
255
|
+
|
|
256
|
+
```
|
|
257
|
+
Phase Runs Gate Behavior
|
|
258
|
+
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
259
|
+
Bootstrap <10 ✅ Always pass (collect data)
|
|
260
|
+
Stabilization 10-50 ⚠️ Warn on >10% regression
|
|
261
|
+
Production >50 ❌ Fail on >5% regression
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
**What gets evaluated:**
|
|
265
|
+
|
|
266
|
+
| Eval Suite | Measures | Data Source |
|
|
267
|
+
| --------------------- | ------------------------------------------------------------- | ------------------------------------------------ |
|
|
268
|
+
| `swarm-decomposition` | Subtask independence, complexity balance, coverage, clarity | Fixtures + `.opencode/eval-data.jsonl` |
|
|
269
|
+
| `coordinator-session` | Violation count, spawn efficiency, review thoroughness | `~/.config/swarm-tools/sessions/*.jsonl` |
|
|
270
|
+
| `compaction-prompt` | ID specificity, actionability, identity, forbidden tools | Session compaction events |
|
|
271
|
+
|
|
272
|
+
**Learning Feedback Loop:**
|
|
273
|
+
|
|
274
|
+
When eval scores drop >15% from baseline, failure context is automatically stored to semantic memory. Future prompts query these learnings for context.
|
|
275
|
+
|
|
276
|
+
**Data capture locations:**
|
|
277
|
+
- Decomposition inputs/outputs: `.opencode/eval-data.jsonl`
|
|
278
|
+
- Eval history: `.opencode/eval-history.jsonl`
|
|
279
|
+
- Coordinator sessions: `~/.config/swarm-tools/sessions/*.jsonl`
|
|
280
|
+
- Subtask outcomes: swarm-mail database
|
|
281
|
+
|
|
282
|
+
See **[evals/README.md](./evals/README.md)** for full architecture, scorer details, CI integration, and how to write new evals.
|
|
283
|
+
|
|
234
284
|
---
|
|
235
285
|
|
|
236
286
|
## CLI
|