opencode-swarm-plugin 0.37.0 → 0.39.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (79) hide show
  1. package/.env +2 -0
  2. package/.hive/eval-results.json +26 -0
  3. package/.hive/issues.jsonl +20 -5
  4. package/.hive/memories.jsonl +35 -1
  5. package/.opencode/eval-history.jsonl +12 -0
  6. package/.turbo/turbo-build.log +4 -4
  7. package/.turbo/turbo-test.log +319 -319
  8. package/CHANGELOG.md +258 -0
  9. package/README.md +50 -0
  10. package/bin/swarm.test.ts +475 -0
  11. package/bin/swarm.ts +385 -208
  12. package/dist/compaction-hook.d.ts +1 -1
  13. package/dist/compaction-hook.d.ts.map +1 -1
  14. package/dist/compaction-prompt-scoring.d.ts +124 -0
  15. package/dist/compaction-prompt-scoring.d.ts.map +1 -0
  16. package/dist/eval-capture.d.ts +81 -1
  17. package/dist/eval-capture.d.ts.map +1 -1
  18. package/dist/eval-gates.d.ts +84 -0
  19. package/dist/eval-gates.d.ts.map +1 -0
  20. package/dist/eval-history.d.ts +117 -0
  21. package/dist/eval-history.d.ts.map +1 -0
  22. package/dist/eval-learning.d.ts +216 -0
  23. package/dist/eval-learning.d.ts.map +1 -0
  24. package/dist/hive.d.ts +59 -0
  25. package/dist/hive.d.ts.map +1 -1
  26. package/dist/index.d.ts +87 -0
  27. package/dist/index.d.ts.map +1 -1
  28. package/dist/index.js +823 -131
  29. package/dist/plugin.js +655 -131
  30. package/dist/post-compaction-tracker.d.ts +133 -0
  31. package/dist/post-compaction-tracker.d.ts.map +1 -0
  32. package/dist/swarm-decompose.d.ts +30 -0
  33. package/dist/swarm-decompose.d.ts.map +1 -1
  34. package/dist/swarm-orchestrate.d.ts +23 -0
  35. package/dist/swarm-orchestrate.d.ts.map +1 -1
  36. package/dist/swarm-prompts.d.ts +25 -1
  37. package/dist/swarm-prompts.d.ts.map +1 -1
  38. package/dist/swarm.d.ts +19 -0
  39. package/dist/swarm.d.ts.map +1 -1
  40. package/evals/README.md +595 -94
  41. package/evals/compaction-prompt.eval.ts +149 -0
  42. package/evals/coordinator-behavior.eval.ts +8 -8
  43. package/evals/fixtures/compaction-prompt-cases.ts +305 -0
  44. package/evals/lib/compaction-loader.test.ts +248 -0
  45. package/evals/lib/compaction-loader.ts +320 -0
  46. package/evals/lib/data-loader.test.ts +345 -0
  47. package/evals/lib/data-loader.ts +107 -6
  48. package/evals/scorers/compaction-prompt-scorers.ts +145 -0
  49. package/evals/scorers/compaction-scorers.ts +13 -13
  50. package/evals/scorers/coordinator-discipline.evalite-test.ts +3 -2
  51. package/evals/scorers/coordinator-discipline.ts +13 -13
  52. package/examples/plugin-wrapper-template.ts +177 -8
  53. package/package.json +7 -2
  54. package/scripts/migrate-unknown-sessions.ts +349 -0
  55. package/src/compaction-capture.integration.test.ts +257 -0
  56. package/src/compaction-hook.test.ts +139 -2
  57. package/src/compaction-hook.ts +113 -2
  58. package/src/compaction-prompt-scorers.test.ts +299 -0
  59. package/src/compaction-prompt-scoring.ts +298 -0
  60. package/src/eval-capture.test.ts +422 -0
  61. package/src/eval-capture.ts +94 -2
  62. package/src/eval-gates.test.ts +306 -0
  63. package/src/eval-gates.ts +218 -0
  64. package/src/eval-history.test.ts +508 -0
  65. package/src/eval-history.ts +214 -0
  66. package/src/eval-learning.test.ts +378 -0
  67. package/src/eval-learning.ts +360 -0
  68. package/src/index.ts +61 -1
  69. package/src/post-compaction-tracker.test.ts +251 -0
  70. package/src/post-compaction-tracker.ts +237 -0
  71. package/src/swarm-decompose.test.ts +40 -47
  72. package/src/swarm-decompose.ts +2 -2
  73. package/src/swarm-orchestrate.test.ts +270 -7
  74. package/src/swarm-orchestrate.ts +100 -13
  75. package/src/swarm-prompts.test.ts +121 -0
  76. package/src/swarm-prompts.ts +297 -4
  77. package/src/swarm-research.integration.test.ts +157 -0
  78. package/src/swarm-review.ts +3 -3
  79. /package/evals/{evalite.config.ts → evalite.config.ts.bak} +0 -0
package/CHANGELOG.md CHANGED
@@ -1,5 +1,263 @@
1
1
  # opencode-swarm-plugin
2
2
 
3
+ ## 0.39.1
4
+
5
+ ### Patch Changes
6
+
7
+ - [`19a6557`](https://github.com/joelhooks/swarm-tools/commit/19a6557cee9878858e7f61e2aba86b37a3ec10ad) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🐝 Eval Quality Gates: Signal Over Noise
8
+
9
+ The eval system now filters coordinator sessions to focus on high-quality data.
10
+
11
+ **Problem:** 67 of 82 captured sessions had <3 events - noise from aborted runs, test pokes, and incomplete swarms. This diluted eval scores and made metrics unreliable.
12
+
13
+ **Solution:** Quality filters applied BEFORE sampling:
14
+
15
+ | Filter | Default | Purpose |
16
+ | -------------------- | ------- | --------------------------------- |
17
+ | `minEvents` | 3 | Skip incomplete/aborted sessions |
18
+ | `requireWorkerSpawn` | true | Ensure coordinator delegated work |
19
+ | `requireReview` | true | Ensure full swarm lifecycle |
20
+
21
+ **Impact:**
22
+
23
+ - Filters 93 noisy sessions automatically
24
+ - Overall eval score: 63% → 71% (true signal, not diluted)
25
+ - Coordinator discipline: 47% → 57% (accurate measurement)
26
+
27
+ **Usage:**
28
+
29
+ ```typescript
30
+ // Default: high-quality sessions only
31
+ const sessions = await loadCapturedSessions();
32
+
33
+ // Override for specific analysis
34
+ const allSessions = await loadCapturedSessions({
35
+ minEvents: 1,
36
+ requireWorkerSpawn: false,
37
+ requireReview: false,
38
+ });
39
+ ```
40
+
41
+ Includes 7 unit tests covering filter logic and edge cases.
42
+
43
+ ## 0.39.0
44
+
45
+ ### Minor Changes
46
+
47
+ - [`aa12943`](https://github.com/joelhooks/swarm-tools/commit/aa12943f3edc8d5e23878b22f44073e4c71367c5) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🐝 Eval-Driven Development: The System That Scores Itself
48
+
49
+ > "What gets measured gets managed." — Peter Drucker
50
+ > "What gets scored gets improved." — The Swarm
51
+
52
+ The plugin now evaluates its own output quality through a progressive gate system. Every compaction prompt gets scored, tracked, and learned from. Regressions become impossible to ignore.
53
+
54
+ ### The Pipeline
55
+
56
+ ```
57
+ CAPTURE → SCORE → STORE → GATE → LEARN → IMPROVE
58
+ ↑ ↓
59
+ └──────────────────────────────────────┘
60
+ ```
61
+
62
+ ### What's New
63
+
64
+ **Event Capture** (5 integration points)
65
+
66
+ - `detection_triggered` - When compaction is detected
67
+ - `prompt_generated` - Full LLM prompt captured
68
+ - `context_injected` - Final content before injection
69
+ - All events stored to `~/.config/swarm-tools/sessions/{session_id}.jsonl`
70
+
71
+ **5 Compaction Prompt Scorers**
72
+
73
+ - `epicIdSpecificity` - Real IDs, not placeholders (20%)
74
+ - `actionability` - Specific tool calls with values (20%)
75
+ - `coordinatorIdentity` - ASCII header + mandates (25%)
76
+ - `forbiddenToolsPresent` - Lists what NOT to do (15%)
77
+ - `postCompactionDiscipline` - First tool is correct (20%)
78
+
79
+ **Progressive Gates**
80
+ | Phase | Threshold | Behavior |
81
+ |-------|-----------|----------|
82
+ | Bootstrap | N/A | Always pass, building baseline |
83
+ | Stabilization | 0.6 | Warn but pass |
84
+ | Production | 0.7 | Fail CI on regression |
85
+
86
+ **CLI Commands**
87
+
88
+ ```bash
89
+ swarm eval status # Current phase, thresholds, scores
90
+ swarm eval history # Trends with sparklines ▁▂▃▄▅▆▇█
91
+ swarm eval run [--ci] # Execute evals, gate check
92
+ ```
93
+
94
+ **CI Integration**
95
+
96
+ - Runs after tests pass
97
+ - Posts results as PR comment with emoji status
98
+ - Only fails in production phase with actual regression
99
+
100
+ **Learning Feedback Loop**
101
+
102
+ - Significant score drops auto-stored to semantic memory
103
+ - Future agents learn from past failures
104
+ - Pattern maturity tracking
105
+
106
+ ### Breaking Changes
107
+
108
+ None. All new functionality is additive.
109
+
110
+ ### Files Changed
111
+
112
+ - `src/eval-capture.ts` - Event capture with Zod schemas
113
+ - `src/eval-gates.ts` - Progressive gate logic
114
+ - `src/eval-history.ts` - Score tracking over time
115
+ - `src/eval-learning.ts` - Failure-to-learning extraction
116
+ - `src/compaction-prompt-scoring.ts` - 5 pure scoring functions
117
+ - `evals/compaction-prompt.eval.ts` - Evalite integration
118
+ - `bin/swarm.ts` - CLI commands
119
+ - `.github/workflows/ci.yml` - CI integration
120
+
121
+ ### Test Coverage
122
+
123
+ - 422 new tests for eval-capture
124
+ - 48 CLI tests
125
+ - 7 integration tests for capture wiring
126
+ - All existing tests still passing
127
+
128
+ ### Patch Changes
129
+
130
+ - Updated dependencies [[`aa12943`](https://github.com/joelhooks/swarm-tools/commit/aa12943f3edc8d5e23878b22f44073e4c71367c5)]:
131
+ - swarm-mail@1.5.2
132
+
133
+ ## 0.38.0
134
+
135
+ ### Minor Changes
136
+
137
+ - [`41a1965`](https://github.com/joelhooks/swarm-tools/commit/41a19657b252eb1c7a7dc82bc59ab13589e8758f) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🐝 Coordinators Now Delegate Research to Workers
138
+
139
+ Coordinators finally know their place. They orchestrate, they don't fetch.
140
+
141
+ **The Problem:**
142
+ Coordinators were calling `repo-crawl_file`, `webfetch`, `context7_*` directly, burning expensive Sonnet context on raw file contents instead of spawning researcher workers.
143
+
144
+ **The Fix:**
145
+
146
+ ### Forbidden Tools Section
147
+
148
+ COORDINATOR_PROMPT now explicitly lists tools coordinators must NEVER call:
149
+
150
+ - `repo-crawl_*`, `repo-autopsy_*` - repository fetching
151
+ - `webfetch`, `fetch_fetch` - web fetching
152
+ - `context7_*` - library documentation
153
+ - `pdf-brain_search`, `pdf-brain_read` - knowledge base
154
+
155
+ ### Phase 1.5: Research Phase
156
+
157
+ New workflow phase between Initialize and Knowledge Gathering:
158
+
159
+ ```
160
+ swarm_spawn_researcher(
161
+ research_id="research-nextjs-cache",
162
+ tech_stack=["Next.js 16 Cache Components"],
163
+ project_path="/path/to/project"
164
+ )
165
+ ```
166
+
167
+ ### Strong Coordinator Identity Post-Compaction
168
+
169
+ When context compacts, the resuming agent now sees:
170
+
171
+ ```
172
+ ┌─────────────────────────────────────────────────────────────┐
173
+ │ 🐝 YOU ARE THE COORDINATOR 🐝 │
174
+ │ NOT A WORKER. NOT AN IMPLEMENTER. │
175
+ │ YOU ORCHESTRATE. │
176
+ └─────────────────────────────────────────────────────────────┘
177
+ ```
178
+
179
+ ### runResearchPhase Returns Spawn Instructions
180
+
181
+ ```typescript
182
+ const result = await runResearchPhase(task, projectPath);
183
+ // result.spawn_instructions = [
184
+ // { research_id, tech, prompt, subagent_type: "swarm/researcher" }
185
+ // ]
186
+ ```
187
+
188
+ **32+ new tests, all 425 passing.**
189
+
190
+ - [`b06f69b`](https://github.com/joelhooks/swarm-tools/commit/b06f69bc3db099c14f712585d88b42c801123d01) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🔬 Eval Capture Pipeline: Complete
191
+
192
+ > "The purpose of computing is insight, not numbers." — Richard Hamming
193
+
194
+ Wire all eval-capture functions into the swarm execution path, enabling ground-truth collection from real swarm executions.
195
+
196
+ **What changed:**
197
+
198
+ | Function | Wired Into | Purpose |
199
+ | ------------------------- | ------------------------------ | ---------------------------------- |
200
+ | `captureDecomposition()` | `swarm_validate_decomposition` | Records task → subtasks mapping |
201
+ | `captureSubtaskOutcome()` | `swarm_complete` | Records per-subtask execution data |
202
+ | `finalizeEvalRecord()` | `swarm_record_outcome` | Computes aggregate metrics |
203
+
204
+ **New npm scripts:**
205
+
206
+ ```bash
207
+ bun run eval:run # Run all evals
208
+ bun run eval:decomposition # Decomposition quality
209
+ bun run eval:coordinator # Coordinator discipline
210
+ ```
211
+
212
+ **Data flow:**
213
+
214
+ ```
215
+ swarm_decompose → captureDecomposition → .opencode/eval-data.jsonl
216
+
217
+ swarm_complete → captureSubtaskOutcome → updates record with outcomes
218
+
219
+ swarm_record_outcome → finalizeEvalRecord → computes scope_accuracy, time_balance
220
+
221
+ evalite → reads JSONL → scores decomposition quality
222
+ ```
223
+
224
+ **Why it matters:**
225
+
226
+ - Enables data-driven decomposition strategy selection
227
+ - Tracks which strategies work for which task types
228
+ - Provides ground truth for Evalite evals
229
+ - Foundation for learning from swarm outcomes
230
+
231
+ **Key discovery:** New cell ID format doesn't follow `epicId.subtaskNum` pattern. Must use `cell.parent_id` to get epic ID for subtasks.
232
+
233
+ ### Patch Changes
234
+
235
+ - [`56e5d4c`](https://github.com/joelhooks/swarm-tools/commit/56e5d4c5ac96ddd2184d12c63e163bb9c291fb69) Thanks [@joelhooks](https://github.com/joelhooks)! - ## 🔬 Eval Capture Pipeline: Phase 1
236
+
237
+ > "The first step toward wisdom is getting things right. The second step is getting them wrong in interesting ways." — Marvin Minsky
238
+
239
+ Wire `captureDecomposition()` into `swarm_validate_decomposition` to record decomposition inputs/outputs for evaluation.
240
+
241
+ **What changed:**
242
+
243
+ - `swarm_validate_decomposition` now calls `captureDecomposition()` after successful validation
244
+ - Captures: epicId, projectPath, task, context, strategy, epicTitle, subtasks
245
+ - Data persisted to `.opencode/eval-data.jsonl` for Evalite consumption
246
+
247
+ **Why it matters:**
248
+
249
+ - Enables ground-truth collection from real swarm executions
250
+ - Foundation for decomposition quality evals
251
+ - Tracks what strategies work for which task types
252
+
253
+ **Tests added:**
254
+
255
+ - Verifies `captureDecomposition` called with correct params on success
256
+ - Verifies NOT called on validation failure
257
+ - Handles optional context/description fields
258
+
259
+ **Next:** Wire `captureSubtaskOutcome()` and `finalizeEvalRecord()` to complete the pipeline.
260
+
3
261
  ## 0.37.0
4
262
 
5
263
  ### Minor Changes
package/README.md CHANGED
@@ -231,6 +231,56 @@ bun test
231
231
  bun run typecheck
232
232
  ```
233
233
 
234
+ ### Evaluation Pipeline
235
+
236
+ Test decomposition quality and coordinator discipline with **Evalite** (TypeScript-native eval framework):
237
+
238
+ ```bash
239
+ # Run all evals
240
+ bun run eval:run
241
+
242
+ # Run specific suites
243
+ bun run eval:decomposition # Task decomposition quality
244
+ bun run eval:coordinator # Coordinator protocol compliance
245
+ bun run eval:compaction # Compaction prompt quality
246
+
247
+ # Check eval status (progressive gates)
248
+ swarm eval status [eval-name]
249
+
250
+ # View history with trends
251
+ swarm eval history
252
+ ```
253
+
254
+ **Progressive Gates:**
255
+
256
+ ```
257
+ Phase Runs Gate Behavior
258
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
259
+ Bootstrap <10 ✅ Always pass (collect data)
260
+ Stabilization 10-50 ⚠️ Warn on >10% regression
261
+ Production >50 ❌ Fail on >5% regression
262
+ ```
263
+
264
+ **What gets evaluated:**
265
+
266
+ | Eval Suite | Measures | Data Source |
267
+ | --------------------- | ------------------------------------------------------------- | ------------------------------------------------ |
268
+ | `swarm-decomposition` | Subtask independence, complexity balance, coverage, clarity | Fixtures + `.opencode/eval-data.jsonl` |
269
+ | `coordinator-session` | Violation count, spawn efficiency, review thoroughness | `~/.config/swarm-tools/sessions/*.jsonl` |
270
+ | `compaction-prompt` | ID specificity, actionability, identity, forbidden tools | Session compaction events |
271
+
272
+ **Learning Feedback Loop:**
273
+
274
+ When eval scores drop >15% from baseline, failure context is automatically stored to semantic memory. Future prompts query these learnings for context.
275
+
276
+ **Data capture locations:**
277
+ - Decomposition inputs/outputs: `.opencode/eval-data.jsonl`
278
+ - Eval history: `.opencode/eval-history.jsonl`
279
+ - Coordinator sessions: `~/.config/swarm-tools/sessions/*.jsonl`
280
+ - Subtask outcomes: swarm-mail database
281
+
282
+ See **[evals/README.md](./evals/README.md)** for full architecture, scorer details, CI integration, and how to write new evals.
283
+
234
284
  ---
235
285
 
236
286
  ## CLI