opencode-swarm-plugin 0.43.0 → 0.44.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/cass.characterization.test.ts +422 -0
- package/bin/swarm.serve.test.ts +6 -4
- package/bin/swarm.test.ts +68 -0
- package/bin/swarm.ts +81 -8
- package/dist/compaction-prompt-scoring.js +139 -0
- package/dist/contributor-tools.d.ts +42 -0
- package/dist/contributor-tools.d.ts.map +1 -0
- package/dist/eval-capture.js +12811 -0
- package/dist/hive.d.ts.map +1 -1
- package/dist/index.d.ts +12 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +7728 -62590
- package/dist/plugin.js +23833 -78695
- package/dist/sessions/agent-discovery.d.ts +59 -0
- package/dist/sessions/agent-discovery.d.ts.map +1 -0
- package/dist/sessions/index.d.ts +10 -0
- package/dist/sessions/index.d.ts.map +1 -0
- package/dist/swarm-orchestrate.d.ts.map +1 -1
- package/dist/swarm-prompts.d.ts.map +1 -1
- package/dist/swarm-review.d.ts.map +1 -1
- package/package.json +17 -5
- package/.changeset/swarm-insights-data-layer.md +0 -63
- package/.hive/analysis/eval-failure-analysis-2025-12-25.md +0 -331
- package/.hive/analysis/session-data-quality-audit.md +0 -320
- package/.hive/eval-results.json +0 -483
- package/.hive/issues.jsonl +0 -138
- package/.hive/memories.jsonl +0 -729
- package/.opencode/eval-history.jsonl +0 -327
- package/.turbo/turbo-build.log +0 -9
- package/CHANGELOG.md +0 -2255
- package/SCORER-ANALYSIS.md +0 -598
- package/docs/analysis/subagent-coordination-patterns.md +0 -902
- package/docs/analysis-socratic-planner-pattern.md +0 -504
- package/docs/planning/ADR-001-monorepo-structure.md +0 -171
- package/docs/planning/ADR-002-package-extraction.md +0 -393
- package/docs/planning/ADR-003-performance-improvements.md +0 -451
- package/docs/planning/ADR-004-message-queue-features.md +0 -187
- package/docs/planning/ADR-005-devtools-observability.md +0 -202
- package/docs/planning/ADR-007-swarm-enhancements-worktree-review.md +0 -168
- package/docs/planning/ADR-008-worker-handoff-protocol.md +0 -293
- package/docs/planning/ADR-009-oh-my-opencode-patterns.md +0 -353
- package/docs/planning/ROADMAP.md +0 -368
- package/docs/semantic-memory-cli-syntax.md +0 -123
- package/docs/swarm-mail-architecture.md +0 -1147
- package/docs/testing/context-recovery-test.md +0 -470
- package/evals/ARCHITECTURE.md +0 -1189
- package/evals/README.md +0 -768
- package/evals/compaction-prompt.eval.ts +0 -149
- package/evals/compaction-resumption.eval.ts +0 -289
- package/evals/coordinator-behavior.eval.ts +0 -307
- package/evals/coordinator-session.eval.ts +0 -154
- package/evals/evalite.config.ts.bak +0 -15
- package/evals/example.eval.ts +0 -31
- package/evals/fixtures/compaction-cases.ts +0 -350
- package/evals/fixtures/compaction-prompt-cases.ts +0 -311
- package/evals/fixtures/coordinator-sessions.ts +0 -328
- package/evals/fixtures/decomposition-cases.ts +0 -105
- package/evals/lib/compaction-loader.test.ts +0 -248
- package/evals/lib/compaction-loader.ts +0 -320
- package/evals/lib/data-loader.evalite-test.ts +0 -289
- package/evals/lib/data-loader.test.ts +0 -345
- package/evals/lib/data-loader.ts +0 -281
- package/evals/lib/llm.ts +0 -115
- package/evals/scorers/compaction-prompt-scorers.ts +0 -145
- package/evals/scorers/compaction-scorers.ts +0 -305
- package/evals/scorers/coordinator-discipline.evalite-test.ts +0 -539
- package/evals/scorers/coordinator-discipline.ts +0 -325
- package/evals/scorers/index.test.ts +0 -146
- package/evals/scorers/index.ts +0 -328
- package/evals/scorers/outcome-scorers.evalite-test.ts +0 -27
- package/evals/scorers/outcome-scorers.ts +0 -349
- package/evals/swarm-decomposition.eval.ts +0 -121
- package/examples/commands/swarm.md +0 -745
- package/examples/plugin-wrapper-template.ts +0 -2426
- package/examples/skills/hive-workflow/SKILL.md +0 -212
- package/examples/skills/skill-creator/SKILL.md +0 -223
- package/examples/skills/swarm-coordination/SKILL.md +0 -292
- package/global-skills/cli-builder/SKILL.md +0 -344
- package/global-skills/cli-builder/references/advanced-patterns.md +0 -244
- package/global-skills/learning-systems/SKILL.md +0 -644
- package/global-skills/skill-creator/LICENSE.txt +0 -202
- package/global-skills/skill-creator/SKILL.md +0 -352
- package/global-skills/skill-creator/references/output-patterns.md +0 -82
- package/global-skills/skill-creator/references/workflows.md +0 -28
- package/global-skills/swarm-coordination/SKILL.md +0 -995
- package/global-skills/swarm-coordination/references/coordinator-patterns.md +0 -235
- package/global-skills/swarm-coordination/references/strategies.md +0 -138
- package/global-skills/system-design/SKILL.md +0 -213
- package/global-skills/testing-patterns/SKILL.md +0 -430
- package/global-skills/testing-patterns/references/dependency-breaking-catalog.md +0 -586
- package/opencode-swarm-plugin-0.30.7.tgz +0 -0
- package/opencode-swarm-plugin-0.31.0.tgz +0 -0
- package/scripts/cleanup-test-memories.ts +0 -346
- package/scripts/init-skill.ts +0 -222
- package/scripts/migrate-unknown-sessions.ts +0 -349
- package/scripts/validate-skill.ts +0 -204
- package/src/agent-mail.ts +0 -1724
- package/src/anti-patterns.test.ts +0 -1167
- package/src/anti-patterns.ts +0 -448
- package/src/compaction-capture.integration.test.ts +0 -257
- package/src/compaction-hook.test.ts +0 -838
- package/src/compaction-hook.ts +0 -1204
- package/src/compaction-observability.integration.test.ts +0 -139
- package/src/compaction-observability.test.ts +0 -187
- package/src/compaction-observability.ts +0 -324
- package/src/compaction-prompt-scorers.test.ts +0 -475
- package/src/compaction-prompt-scoring.ts +0 -300
- package/src/dashboard.test.ts +0 -611
- package/src/dashboard.ts +0 -462
- package/src/error-enrichment.test.ts +0 -403
- package/src/error-enrichment.ts +0 -219
- package/src/eval-capture.test.ts +0 -1015
- package/src/eval-capture.ts +0 -929
- package/src/eval-gates.test.ts +0 -306
- package/src/eval-gates.ts +0 -218
- package/src/eval-history.test.ts +0 -508
- package/src/eval-history.ts +0 -214
- package/src/eval-learning.test.ts +0 -378
- package/src/eval-learning.ts +0 -360
- package/src/eval-runner.test.ts +0 -223
- package/src/eval-runner.ts +0 -402
- package/src/export-tools.test.ts +0 -476
- package/src/export-tools.ts +0 -257
- package/src/hive.integration.test.ts +0 -2241
- package/src/hive.ts +0 -1628
- package/src/index.ts +0 -935
- package/src/learning.integration.test.ts +0 -1815
- package/src/learning.ts +0 -1079
- package/src/logger.test.ts +0 -189
- package/src/logger.ts +0 -135
- package/src/mandate-promotion.test.ts +0 -473
- package/src/mandate-promotion.ts +0 -239
- package/src/mandate-storage.integration.test.ts +0 -601
- package/src/mandate-storage.test.ts +0 -578
- package/src/mandate-storage.ts +0 -794
- package/src/mandates.ts +0 -540
- package/src/memory-tools.test.ts +0 -195
- package/src/memory-tools.ts +0 -344
- package/src/memory.integration.test.ts +0 -334
- package/src/memory.test.ts +0 -158
- package/src/memory.ts +0 -527
- package/src/model-selection.test.ts +0 -188
- package/src/model-selection.ts +0 -68
- package/src/observability-tools.test.ts +0 -359
- package/src/observability-tools.ts +0 -871
- package/src/output-guardrails.test.ts +0 -438
- package/src/output-guardrails.ts +0 -381
- package/src/pattern-maturity.test.ts +0 -1160
- package/src/pattern-maturity.ts +0 -525
- package/src/planning-guardrails.test.ts +0 -491
- package/src/planning-guardrails.ts +0 -438
- package/src/plugin.ts +0 -23
- package/src/post-compaction-tracker.test.ts +0 -251
- package/src/post-compaction-tracker.ts +0 -237
- package/src/query-tools.test.ts +0 -636
- package/src/query-tools.ts +0 -324
- package/src/rate-limiter.integration.test.ts +0 -466
- package/src/rate-limiter.ts +0 -774
- package/src/replay-tools.test.ts +0 -496
- package/src/replay-tools.ts +0 -240
- package/src/repo-crawl.integration.test.ts +0 -441
- package/src/repo-crawl.ts +0 -610
- package/src/schemas/cell-events.test.ts +0 -347
- package/src/schemas/cell-events.ts +0 -807
- package/src/schemas/cell.ts +0 -257
- package/src/schemas/evaluation.ts +0 -166
- package/src/schemas/index.test.ts +0 -199
- package/src/schemas/index.ts +0 -286
- package/src/schemas/mandate.ts +0 -232
- package/src/schemas/swarm-context.ts +0 -115
- package/src/schemas/task.ts +0 -161
- package/src/schemas/worker-handoff.test.ts +0 -302
- package/src/schemas/worker-handoff.ts +0 -131
- package/src/skills.integration.test.ts +0 -1192
- package/src/skills.test.ts +0 -643
- package/src/skills.ts +0 -1549
- package/src/storage.integration.test.ts +0 -341
- package/src/storage.ts +0 -884
- package/src/structured.integration.test.ts +0 -817
- package/src/structured.test.ts +0 -1046
- package/src/structured.ts +0 -762
- package/src/swarm-decompose.test.ts +0 -188
- package/src/swarm-decompose.ts +0 -1302
- package/src/swarm-deferred.integration.test.ts +0 -157
- package/src/swarm-deferred.test.ts +0 -38
- package/src/swarm-insights.test.ts +0 -214
- package/src/swarm-insights.ts +0 -459
- package/src/swarm-mail.integration.test.ts +0 -970
- package/src/swarm-mail.ts +0 -739
- package/src/swarm-orchestrate.integration.test.ts +0 -282
- package/src/swarm-orchestrate.test.ts +0 -548
- package/src/swarm-orchestrate.ts +0 -3084
- package/src/swarm-prompts.test.ts +0 -1270
- package/src/swarm-prompts.ts +0 -2077
- package/src/swarm-research.integration.test.ts +0 -701
- package/src/swarm-research.test.ts +0 -698
- package/src/swarm-research.ts +0 -472
- package/src/swarm-review.integration.test.ts +0 -285
- package/src/swarm-review.test.ts +0 -879
- package/src/swarm-review.ts +0 -709
- package/src/swarm-strategies.ts +0 -407
- package/src/swarm-worktree.test.ts +0 -501
- package/src/swarm-worktree.ts +0 -575
- package/src/swarm.integration.test.ts +0 -2377
- package/src/swarm.ts +0 -38
- package/src/tool-adapter.integration.test.ts +0 -1221
- package/src/tool-availability.ts +0 -461
- package/tsconfig.json +0 -28
package/SCORER-ANALYSIS.md
DELETED
|
@@ -1,598 +0,0 @@
|
|
|
1
|
-
# Scorer Implementation Analysis
|
|
2
|
-
|
|
3
|
-
**Date:** 2025-12-25
|
|
4
|
-
**Cell:** opencode-swarm-plugin--ys7z8-mjlk7jsrvls
|
|
5
|
-
**Scope:** All scorer implementations in `evals/scorers/`
|
|
6
|
-
|
|
7
|
-
```
|
|
8
|
-
┌────────────────────────────────────────────────────────────┐
|
|
9
|
-
│ │
|
|
10
|
-
│ 📊 SCORER AUDIT REPORT │
|
|
11
|
-
│ ═══════════════════════ │
|
|
12
|
-
│ │
|
|
13
|
-
│ Files Analyzed: │
|
|
14
|
-
│ • index.ts (primary scorers) │
|
|
15
|
-
│ • coordinator-discipline.ts (11 scorers) │
|
|
16
|
-
│ • compaction-scorers.ts (5 scorers) │
|
|
17
|
-
│ • outcome-scorers.ts (5 scorers) │
|
|
18
|
-
│ │
|
|
19
|
-
│ Total Scorers: 24 │
|
|
20
|
-
│ Composite Scorers: 3 │
|
|
21
|
-
│ LLM-as-Judge: 1 │
|
|
22
|
-
│ │
|
|
23
|
-
└────────────────────────────────────────────────────────────┘
|
|
24
|
-
```
|
|
25
|
-
|
|
26
|
-
---
|
|
27
|
-
|
|
28
|
-
## Executive Summary
|
|
29
|
-
|
|
30
|
-
**Overall Assessment:** ✅ Scorers are well-implemented with correct API usage. Found **3 critical issues** and **5 optimization opportunities**.
|
|
31
|
-
|
|
32
|
-
**Eval Performance Context:**
|
|
33
|
-
- compaction-prompt: 53% (LOW - needs investigation)
|
|
34
|
-
- coordinator-behavior: 77% (GOOD)
|
|
35
|
-
- coordinator-session: 66% (FAIR)
|
|
36
|
-
- compaction-resumption: 93% (EXCELLENT)
|
|
37
|
-
- swarm-decomposition: 70% (GOOD)
|
|
38
|
-
- example: 0% (expected - sanity check)
|
|
39
|
-
|
|
40
|
-
---
|
|
41
|
-
|
|
42
|
-
## 🔴 CRITICAL ISSUES
|
|
43
|
-
|
|
44
|
-
### 1. **UNUSED SCORERS - Dead Code**
|
|
45
|
-
|
|
46
|
-
**Severity:** HIGH
|
|
47
|
-
**Impact:** Wasted development effort, misleading test coverage
|
|
48
|
-
|
|
49
|
-
#### Scorers Defined But Never Used in Evals
|
|
50
|
-
|
|
51
|
-
| Scorer | File | Lines | Status |
|
|
52
|
-
|--------|------|-------|--------|
|
|
53
|
-
| `researcherSpawnRate` | coordinator-discipline.ts | 345-378 | ❌ NEVER USED |
|
|
54
|
-
| `skillLoadingRate` | coordinator-discipline.ts | 388-421 | ❌ NEVER USED |
|
|
55
|
-
| `inboxMonitoringRate` | coordinator-discipline.ts | 433-484 | ❌ NEVER USED |
|
|
56
|
-
| `blockerResponseTime` | coordinator-discipline.ts | 499-588 | ❌ NEVER USED |
|
|
57
|
-
|
|
58
|
-
**Evidence:**
|
|
59
|
-
```bash
|
|
60
|
-
grep -r "researcherSpawnRate\|skillLoadingRate\|inboxMonitoringRate\|blockerResponseTime" evals/*.eval.ts
|
|
61
|
-
# → No matches
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
**Why This Matters:**
|
|
65
|
-
- These scorers represent ~250 lines of code (~38% of coordinator-discipline.ts)
|
|
66
|
-
- Tests exist for them but they don't influence eval results
|
|
67
|
-
- Maintenance burden without benefit
|
|
68
|
-
- Misleading signal that these metrics are being measured
|
|
69
|
-
|
|
70
|
-
**Recommendation:**
|
|
71
|
-
1. **EITHER** add these scorers to `coordinator-session.eval.ts` scorers array
|
|
72
|
-
2. **OR** remove them and their tests to reduce noise
|
|
73
|
-
|
|
74
|
-
**Probable Intent:**
|
|
75
|
-
These scorers were likely prototypes for expanded coordinator metrics but never integrated. The current 5-scorer set (violations, spawn, review, speed, reviewEfficiency) is sufficient for protocol adherence.
|
|
76
|
-
|
|
77
|
-
---
|
|
78
|
-
|
|
79
|
-
### 2. **reviewEfficiency vs reviewThoroughness - Potential Redundancy**
|
|
80
|
-
|
|
81
|
-
**Severity:** MEDIUM
|
|
82
|
-
**Impact:** Confusing metrics, potential double-penalization
|
|
83
|
-
|
|
84
|
-
#### What They Measure
|
|
85
|
-
|
|
86
|
-
| Scorer | Metric | Scoring |
|
|
87
|
-
|--------|--------|---------|
|
|
88
|
-
| `reviewThoroughness` | reviews / finished_workers | 0-1 (completeness) |
|
|
89
|
-
| `reviewEfficiency` | reviews / spawned_workers | penalizes >2:1 ratio |
|
|
90
|
-
|
|
91
|
-
**The Problem:**
|
|
92
|
-
```typescript
|
|
93
|
-
// Scenario: 2 workers spawned, 2 finished, 4 reviews completed
|
|
94
|
-
|
|
95
|
-
// reviewThoroughness: 4/2 = 2.0 → clipped to 1.0 (perfect!)
|
|
96
|
-
// reviewEfficiency: 4/2 = 2.0 → 0.5 (threshold penalty)
|
|
97
|
-
|
|
98
|
-
// These contradict each other
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
**Why This Exists:**
|
|
102
|
-
- `reviewThoroughness` added early to ensure coordinators review worker output
|
|
103
|
-
- `reviewEfficiency` added later to prevent over-reviewing (context waste)
|
|
104
|
-
- Both measure review behavior but from different angles
|
|
105
|
-
|
|
106
|
-
**Current Usage:**
|
|
107
|
-
- `coordinator-session.eval.ts` uses BOTH in scorers array
|
|
108
|
-
- `overallDiscipline` composite uses only `reviewThoroughness` (not efficiency)
|
|
109
|
-
|
|
110
|
-
**Recommendation:**
|
|
111
|
-
1. **Short-term:** Document that these are intentionally complementary (thoroughness=quality gate, efficiency=resource optimization)
|
|
112
|
-
2. **Long-term:** Consider composite "reviewQuality" scorer that balances both:
|
|
113
|
-
```typescript
|
|
114
|
-
// Perfect: 1:1 ratio (one review per finished worker)
|
|
115
|
-
// Good: 0.8-1.2 ratio
|
|
116
|
-
// Bad: <0.5 or >2.0 ratio
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
---
|
|
120
|
-
|
|
121
|
-
### 3. **Arbitrary Normalization Thresholds**
|
|
122
|
-
|
|
123
|
-
**Severity:** LOW
|
|
124
|
-
**Impact:** Scores may not reflect reality, hard to tune
|
|
125
|
-
|
|
126
|
-
#### timeToFirstSpawn Thresholds
|
|
127
|
-
|
|
128
|
-
```typescript
|
|
129
|
-
const EXCELLENT_MS = 60_000; // < 60s = 1.0 (why 60s?)
|
|
130
|
-
const POOR_MS = 300_000; // > 300s = 0.0 (why 5min?)
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
**Question:** Are these evidence-based or arbitrary?
|
|
134
|
-
|
|
135
|
-
**From Real Data:** We don't know - no analysis of actual coordinator spawn times.
|
|
136
|
-
|
|
137
|
-
**Recommendation:**
|
|
138
|
-
1. Add comment with rationale: "Based on X coordinator sessions, median spawn time is Y"
|
|
139
|
-
2. OR make thresholds configurable via expected values
|
|
140
|
-
3. OR use percentile-based normalization from real data
|
|
141
|
-
|
|
142
|
-
#### blockerResponseTime Thresholds
|
|
143
|
-
|
|
144
|
-
```typescript
|
|
145
|
-
const EXCELLENT_MS = 5 * 60 * 1000; // 5 min
|
|
146
|
-
const POOR_MS = 15 * 60 * 1000; // 15 min
|
|
147
|
-
```
|
|
148
|
-
|
|
149
|
-
**Same Issue:** No evidence these thresholds match real coordinator response patterns.
|
|
150
|
-
|
|
151
|
-
**Deeper Problem:**
|
|
152
|
-
```typescript
|
|
153
|
-
// This scorer matches blockers to resolutions by subtask_id
|
|
154
|
-
const resolution = resolutions.find(
|
|
155
|
-
(r) => (r.payload as any).subtask_id === subtaskId
|
|
156
|
-
);
|
|
157
|
-
|
|
158
|
-
// BUT: If coordinator resolves blocker by reassigning task,
|
|
159
|
-
// the subtask_id might change. This would miss the resolution.
|
|
160
|
-
```
|
|
161
|
-
|
|
162
|
-
---
|
|
163
|
-
|
|
164
|
-
## ⚠️ CALIBRATION ISSUES
|
|
165
|
-
|
|
166
|
-
### 1. **Composite Scorer Weight Inconsistency**
|
|
167
|
-
|
|
168
|
-
#### Current Weights
|
|
169
|
-
|
|
170
|
-
**overallDiscipline** (coordinator-discipline.ts:603):
|
|
171
|
-
```typescript
|
|
172
|
-
const weights = {
|
|
173
|
-
violations: 0.3, // 30% - "most critical"
|
|
174
|
-
spawn: 0.25, // 25%
|
|
175
|
-
review: 0.25, // 25%
|
|
176
|
-
speed: 0.2, // 20%
|
|
177
|
-
};
|
|
178
|
-
```
|
|
179
|
-
|
|
180
|
-
**compactionQuality** (compaction-scorers.ts:260):
|
|
181
|
-
```typescript
|
|
182
|
-
const weights = {
|
|
183
|
-
confidence: 0.25, // 25%
|
|
184
|
-
injection: 0.25, // 25%
|
|
185
|
-
required: 0.3, // 30% - "most critical"
|
|
186
|
-
forbidden: 0.2, // 20%
|
|
187
|
-
};
|
|
188
|
-
```
|
|
189
|
-
|
|
190
|
-
**overallCoordinatorBehavior** (coordinator-behavior.eval.ts:196):
|
|
191
|
-
```typescript
|
|
192
|
-
const score =
|
|
193
|
-
(toolsResult.score ?? 0) * 0.3 +
|
|
194
|
-
(avoidsResult.score ?? 0) * 0.4 + // 40% - "most important"
|
|
195
|
-
(mindsetResult.score ?? 0) * 0.3;
|
|
196
|
-
```
|
|
197
|
-
|
|
198
|
-
**Pattern:** Each composite prioritizes different metrics, which is GOOD (domain-specific), but...
|
|
199
|
-
|
|
200
|
-
**Issue:** No documentation of WHY these weights were chosen.
|
|
201
|
-
|
|
202
|
-
**Recommendation:**
|
|
203
|
-
Add comments explaining weight rationale:
|
|
204
|
-
```typescript
|
|
205
|
-
// Weights based on failure impact:
|
|
206
|
-
// - Violations (30%): Breaking protocol causes immediate harm
|
|
207
|
-
// - Spawn (25%): Delegation is core coordinator job
|
|
208
|
-
// - Review (25%): Quality gate prevents bad work propagating
|
|
209
|
-
// - Speed (20%): Optimization, not correctness
|
|
210
|
-
```
|
|
211
|
-
|
|
212
|
-
---
|
|
213
|
-
|
|
214
|
-
### 2. **Binary vs Gradient Scoring Philosophy**
|
|
215
|
-
|
|
216
|
-
#### Binary Scorers (0 or 1 only)
|
|
217
|
-
|
|
218
|
-
- `subtaskIndependence` - either conflicts exist or they don't
|
|
219
|
-
- `executionSuccess` - either all succeeded or not
|
|
220
|
-
- `noRework` - either rework detected or not
|
|
221
|
-
|
|
222
|
-
#### Gradient Scorers (0-1 continuous)
|
|
223
|
-
|
|
224
|
-
- `timeBalance` - ratio-based
|
|
225
|
-
- `scopeAccuracy` - percentage-based
|
|
226
|
-
- `instructionClarity` - heuristic-based
|
|
227
|
-
|
|
228
|
-
#### LLM-as-Judge (0-1 via scoring prompt)
|
|
229
|
-
|
|
230
|
-
- `decompositionCoherence` - Claude Haiku scores 0-100, normalized to 0-1
|
|
231
|
-
|
|
232
|
-
**Question:** Should all outcome scorers be gradient, or is binary appropriate?
|
|
233
|
-
|
|
234
|
-
**Trade-off:**
|
|
235
|
-
- **Binary:** Clear pass/fail, easy to reason about, motivates fixes
|
|
236
|
-
- **Gradient:** More nuanced, rewards partial success, better for learning
|
|
237
|
-
|
|
238
|
-
**Current Mix:** Seems reasonable. Binary for critical invariants (no conflicts, no rework), gradient for optimization metrics (balance, accuracy).
|
|
239
|
-
|
|
240
|
-
**Recommendation:** Document this philosophy in scorer file headers.
|
|
241
|
-
|
|
242
|
-
---
|
|
243
|
-
|
|
244
|
-
## ✅ WELL-CALIBRATED PATTERNS
|
|
245
|
-
|
|
246
|
-
### 1. **Fallback Strategy Consistency**
|
|
247
|
-
|
|
248
|
-
From semantic memory:
|
|
249
|
-
> "When no baseline exists, prefer realistic fallback (1.0 if delegation happened) over arbitrary 0.5"
|
|
250
|
-
|
|
251
|
-
**Good Example - spawnEfficiency (lines 98-108):**
|
|
252
|
-
```typescript
|
|
253
|
-
if (!decomp) {
|
|
254
|
-
// Fallback: if workers were spawned but no decomp event, assume they're doing work
|
|
255
|
-
if (spawned > 0) {
|
|
256
|
-
return {
|
|
257
|
-
score: 1.0, // Optimistic - work is happening
|
|
258
|
-
message: `${spawned} workers spawned (no decomposition event)`,
|
|
259
|
-
};
|
|
260
|
-
}
|
|
261
|
-
return {
|
|
262
|
-
score: 0,
|
|
263
|
-
message: "No decomposition event found",
|
|
264
|
-
};
|
|
265
|
-
}
|
|
266
|
-
```
|
|
267
|
-
|
|
268
|
-
**Rationale:** Workers spawned = delegation happened = good coordinator behavior. Not penalizing missing instrumentation.
|
|
269
|
-
|
|
270
|
-
**Contrast - decompositionCoherence fallback (lines 321-325):**
|
|
271
|
-
```typescript
|
|
272
|
-
} catch (error) {
|
|
273
|
-
// Don't fail the eval if judge fails - return neutral score
|
|
274
|
-
return {
|
|
275
|
-
score: 0.5, // Neutral - can't determine quality
|
|
276
|
-
message: `LLM judge error: ${error instanceof Error ? error.message : String(error)}`,
|
|
277
|
-
};
|
|
278
|
-
}
|
|
279
|
-
```
|
|
280
|
-
|
|
281
|
-
**Rationale:** LLM judge failure = unknown quality, not good or bad. Neutral 0.5 prevents biasing results.
|
|
282
|
-
|
|
283
|
-
**Consistency:** ✅ Both fallbacks match their semantic context.
|
|
284
|
-
|
|
285
|
-
---
|
|
286
|
-
|
|
287
|
-
### 2. **Test Coverage Philosophy**
|
|
288
|
-
|
|
289
|
-
#### Unit Tests (Bun test)
|
|
290
|
-
- **coordinator-discipline.evalite-test.ts** - Full functional tests with synthetic fixtures
|
|
291
|
-
- **outcome-scorers.evalite-test.ts** - Export verification only (integration tested via evalite)
|
|
292
|
-
|
|
293
|
-
#### Integration Tests (Evalite)
|
|
294
|
-
- **coordinator-session.eval.ts** - Real captured sessions + synthetic fixtures
|
|
295
|
-
- **swarm-decomposition.eval.ts** - Real LLM calls + fixtures
|
|
296
|
-
|
|
297
|
-
**Pattern:** Scorers with complex logic get unit tests. Simple scorers get integration tests only.
|
|
298
|
-
|
|
299
|
-
**Trade-off:**
|
|
300
|
-
- **Pro:** Faster iteration for complex scorers
|
|
301
|
-
- **Con:** No unit tests for outcome scorers (harder to debug failures)
|
|
302
|
-
|
|
303
|
-
**Recommendation:** Add characterization tests for outcome scorers (snapshot actual scores for known inputs).
|
|
304
|
-
|
|
305
|
-
---
|
|
306
|
-
|
|
307
|
-
## 📊 SCORER USAGE MATRIX
|
|
308
|
-
|
|
309
|
-
| Scorer | coordinator-session | swarm-decomposition | coordinator-behavior | compaction-resumption | compaction-prompt |
|
|
310
|
-
|--------|---------------------|---------------------|----------------------|-----------------------|-------------------|
|
|
311
|
-
| **violationCount** | ✅ | - | - | - | - |
|
|
312
|
-
| **spawnEfficiency** | ✅ | - | - | - | - |
|
|
313
|
-
| **reviewThoroughness** | ✅ | - | - | - | - |
|
|
314
|
-
| **reviewEfficiency** | ✅ | - | - | - | - |
|
|
315
|
-
| **timeToFirstSpawn** | ✅ | - | - | - | - |
|
|
316
|
-
| **overallDiscipline** | ✅ | - | - | - | - |
|
|
317
|
-
| **researcherSpawnRate** | ❌ | - | - | - | - |
|
|
318
|
-
| **skillLoadingRate** | ❌ | - | - | - | - |
|
|
319
|
-
| **inboxMonitoringRate** | ❌ | - | - | - | - |
|
|
320
|
-
| **blockerResponseTime** | ❌ | - | - | - | - |
|
|
321
|
-
| **subtaskIndependence** | - | ✅ | - | - | - |
|
|
322
|
-
| **coverageCompleteness** | - | ✅ | - | - | - |
|
|
323
|
-
| **instructionClarity** | - | ✅ | - | - | - |
|
|
324
|
-
| **decompositionCoherence** | - | ✅ | - | - | - |
|
|
325
|
-
| **mentionsCoordinatorTools** | - | - | ✅ | - | - |
|
|
326
|
-
| **avoidsWorkerBehaviors** | - | - | ✅ | - | - |
|
|
327
|
-
| **coordinatorMindset** | - | - | ✅ | - | - |
|
|
328
|
-
| **overallCoordinatorBehavior** | - | - | ✅ | - | - |
|
|
329
|
-
| **confidenceAccuracy** | - | - | - | ✅ | - |
|
|
330
|
-
| **contextInjectionCorrectness** | - | - | - | ✅ | - |
|
|
331
|
-
| **requiredPatternsPresent** | - | - | - | ✅ | - |
|
|
332
|
-
| **forbiddenPatternsAbsent** | - | - | - | ✅ | - |
|
|
333
|
-
| **compactionQuality** | - | - | - | ✅ | - |
|
|
334
|
-
| **compaction-prompt scorers** | - | - | - | - | ✅ |
|
|
335
|
-
| **outcome scorers** | - | - | - | - | - |
|
|
336
|
-
|
|
337
|
-
**Note:** Outcome scorers not used in any current eval (waiting for real execution data).
|
|
338
|
-
|
|
339
|
-
---
|
|
340
|
-
|
|
341
|
-
## 🎯 RECOMMENDATIONS
|
|
342
|
-
|
|
343
|
-
### Immediate (Pre-Ship)
|
|
344
|
-
|
|
345
|
-
1. **DECIDE:** Keep or remove unused coordinator scorers
|
|
346
|
-
- If keeping: Add to coordinator-session.eval.ts
|
|
347
|
-
- If removing: Delete scorers + tests, update exports
|
|
348
|
-
|
|
349
|
-
2. **DOCUMENT:** Add weight rationale comments to composite scorers
|
|
350
|
-
|
|
351
|
-
3. **CLARIFY:** Add docstring to reviewEfficiency explaining relationship with reviewThoroughness
|
|
352
|
-
|
|
353
|
-
### Short-term (Next Sprint)
|
|
354
|
-
|
|
355
|
-
4. **CALIBRATE:** Gather real coordinator session data, validate normalization thresholds
|
|
356
|
-
- Run 20+ real coordinator sessions
|
|
357
|
-
- Plot distribution of spawn times, blocker response times
|
|
358
|
-
- Adjust EXCELLENT_MS/POOR_MS based on percentiles
|
|
359
|
-
|
|
360
|
-
5. **TEST:** Add characterization tests for outcome scorers
|
|
361
|
-
```typescript
|
|
362
|
-
test("scopeAccuracy with known input", () => {
|
|
363
|
-
const result = scopeAccuracy({ output: knownGoodOutput, ... });
|
|
364
|
-
expect(result.score).toMatchSnapshot();
|
|
365
|
-
});
|
|
366
|
-
```
|
|
367
|
-
|
|
368
|
-
6. **INVESTIGATE:** Why is compaction-prompt eval at 53%?
|
|
369
|
-
- Review fixtures in `compaction-prompt-cases.ts`
|
|
370
|
-
- Check if scorers are too strict or fixtures are wrong
|
|
371
|
-
- This is the LOWEST-performing eval (red flag)
|
|
372
|
-
|
|
373
|
-
### Long-term (Future Iterations)
|
|
374
|
-
|
|
375
|
-
7. **REFACTOR:** Consider `reviewQuality` composite that balances thoroughness + efficiency
|
|
376
|
-
|
|
377
|
-
8. **ENHANCE:** Add percentile-based normalization for time-based scorers
|
|
378
|
-
```typescript
|
|
379
|
-
function normalizeTime(valueMs: number, p50: number, p95: number): number {
|
|
380
|
-
// Values at p50 = 0.5, values at p95 = 0.0
|
|
381
|
-
// Self-calibrating from real data
|
|
382
|
-
}
|
|
383
|
-
```
|
|
384
|
-
|
|
385
|
-
9. **INTEGRATE:** Use outcome scorers once real swarm execution data exists
|
|
386
|
-
- Currently no eval uses executionSuccess, timeBalance, scopeAccuracy, scopeDrift, noRework
|
|
387
|
-
- These are outcome-based (require actual subtask execution)
|
|
388
|
-
- Valuable for learning which decomposition strategies work
|
|
389
|
-
|
|
390
|
-
---
|
|
391
|
-
|
|
392
|
-
## 📈 SCORING PHILOSOPHY PATTERNS
|
|
393
|
-
|
|
394
|
-
### Pattern 1: "Perfect or Penalty" (Binary with Partial Credit)
|
|
395
|
-
|
|
396
|
-
**Example:** `instructionClarity` (index.ts:174-228)
|
|
397
|
-
```typescript
|
|
398
|
-
let score = 0.5; // baseline
|
|
399
|
-
if (subtask.description && subtask.description.length > 20) score += 0.2;
|
|
400
|
-
if (subtask.files && subtask.files.length > 0) score += 0.2;
|
|
401
|
-
if (!isGeneric) score += 0.1;
|
|
402
|
-
return Math.min(1.0, score);
|
|
403
|
-
```
|
|
404
|
-
|
|
405
|
-
**Philosophy:** Start at baseline, add points for quality signals, cap at 1.0
|
|
406
|
-
|
|
407
|
-
**Pro:** Rewards partial quality improvements
|
|
408
|
-
**Con:** Arbitrary baseline and increments
|
|
409
|
-
|
|
410
|
-
---
|
|
411
|
-
|
|
412
|
-
### Pattern 2: "Ratio Normalization" (Continuous Gradient)
|
|
413
|
-
|
|
414
|
-
**Example:** `timeBalance` (outcome-scorers.ts:73-141)
|
|
415
|
-
```typescript
|
|
416
|
-
const ratio = maxDuration / minDuration;
|
|
417
|
-
if (ratio < 2.0) score = 1.0; // well balanced
|
|
418
|
-
else if (ratio < 4.0) score = 0.5; // moderately balanced
|
|
419
|
-
else score = 0.0; // poorly balanced
|
|
420
|
-
```
|
|
421
|
-
|
|
422
|
-
**Philosophy:** Define thresholds for quality bands, linear interpolation between
|
|
423
|
-
|
|
424
|
-
**Pro:** Clear expectations, easy to reason about
|
|
425
|
-
**Con:** Threshold choices are subjective
|
|
426
|
-
|
|
427
|
-
---
|
|
428
|
-
|
|
429
|
-
### Pattern 3: "LLM-as-Judge" (Delegated Evaluation)
|
|
430
|
-
|
|
431
|
-
**Example:** `decompositionCoherence` (index.ts:245-328)
|
|
432
|
-
```typescript
|
|
433
|
-
const { text } = await generateText({
|
|
434
|
-
model: gateway(JUDGE_MODEL),
|
|
435
|
-
prompt: `Evaluate on these criteria (be harsh)...
|
|
436
|
-
1. INDEPENDENCE (25%)
|
|
437
|
-
2. SCOPE (25%)
|
|
438
|
-
3. COMPLETENESS (25%)
|
|
439
|
-
4. CLARITY (25%)
|
|
440
|
-
Return ONLY valid JSON: {"score": <0-100>, "issues": [...]}`,
|
|
441
|
-
});
|
|
442
|
-
```
|
|
443
|
-
|
|
444
|
-
**Philosophy:** Use LLM for nuanced evaluation humans/heuristics can't capture
|
|
445
|
-
|
|
446
|
-
**Pro:** Catches semantic issues (hidden dependencies, ambiguous scope)
|
|
447
|
-
**Con:** Non-deterministic, slower, requires API key, costs money
|
|
448
|
-
|
|
449
|
-
---
|
|
450
|
-
|
|
451
|
-
### Pattern 4: "Composite Weighted Average"
|
|
452
|
-
|
|
453
|
-
**Example:** `overallDiscipline` (coordinator-discipline.ts:603-648)
|
|
454
|
-
```typescript
|
|
455
|
-
const totalScore =
|
|
456
|
-
(scores.violations.score ?? 0) * weights.violations +
|
|
457
|
-
(scores.spawn.score ?? 0) * weights.spawn +
|
|
458
|
-
(scores.review.score ?? 0) * weights.review +
|
|
459
|
-
(scores.speed.score ?? 0) * weights.speed;
|
|
460
|
-
```
|
|
461
|
-
|
|
462
|
-
**Philosophy:** Combine multiple signals with domain-specific weights
|
|
463
|
-
|
|
464
|
-
**Pro:** Single metric for "overall quality", weights encode priorities
|
|
465
|
-
**Con:** Weights are subjective, hides individual metric details
|
|
466
|
-
|
|
467
|
-
---
|
|
468
|
-
|
|
469
|
-
## 🔬 DEEP DIVE: compaction-prompt 53% Score
|
|
470
|
-
|
|
471
|
-
**Context:** This is the LOWEST-performing eval. Needs investigation.
|
|
472
|
-
|
|
473
|
-
**Hypothesis 1:** Scorers are too strict
|
|
474
|
-
- Check if perfect fixture actually scores 100% (has dedicated eval for this)
|
|
475
|
-
- If perfect scores <100%, scorers have bugs
|
|
476
|
-
|
|
477
|
-
**Hypothesis 2:** Fixtures are wrong
|
|
478
|
-
- Fixtures might not represent actual good prompts
|
|
479
|
-
- Need to compare against real coordinator resumption prompts
|
|
480
|
-
|
|
481
|
-
**Hypothesis 3:** Real implementation doesn't match fixture assumptions
|
|
482
|
-
- Fixtures assume certain prompt structure
|
|
483
|
-
- Actual implementation may have evolved differently
|
|
484
|
-
|
|
485
|
-
**Next Steps:**
|
|
486
|
-
1. Run `Perfect Prompt Scores 100%` eval and check results
|
|
487
|
-
2. If it scores <100%, debug scorer logic
|
|
488
|
-
3. If it scores 100%, review other fixture expected values
|
|
489
|
-
|
|
490
|
-
---
|
|
491
|
-
|
|
492
|
-
## 💡 INSIGHTS FROM SEMANTIC MEMORY
|
|
493
|
-
|
|
494
|
-
### 1. Evalite API Pattern (from memory c2bb8f11)
|
|
495
|
-
|
|
496
|
-
```typescript
|
|
497
|
-
// CORRECT: Scorers are async functions
|
|
498
|
-
const result = await childScorer({ output, expected, input });
|
|
499
|
-
const score = result.score ?? 0;
|
|
500
|
-
|
|
501
|
-
// WRONG: .scorer property doesn't exist
|
|
502
|
-
const result = childScorer.scorer({ output, expected }); // ❌
|
|
503
|
-
```
|
|
504
|
-
|
|
505
|
-
✅ All current scorers follow correct pattern.
|
|
506
|
-
|
|
507
|
-
---
|
|
508
|
-
|
|
509
|
-
### 2. Garbage Input Handling (from memory b0ef27d5)
|
|
510
|
-
|
|
511
|
-
> "When LLM receives garbage input, it correctly scores it 0 - this is the RIGHT behavior, not an error."
|
|
512
|
-
|
|
513
|
-
**Application:** `decompositionCoherence` should NOT return 0.5 fallback for parse errors. Should let LLM judge garbage as garbage.
|
|
514
|
-
|
|
515
|
-
**Current Implementation:** ❌ Returns 0.5 on error (line 324)
|
|
516
|
-
|
|
517
|
-
**Recommendation:** Distinguish between:
|
|
518
|
-
- **LLM error** (API failure) → 0.5 fallback (can't judge)
|
|
519
|
-
- **Parse error** (invalid JSON output) → Pass raw output to LLM, let it judge as low quality
|
|
520
|
-
|
|
521
|
-
---
|
|
522
|
-
|
|
523
|
-
### 3. Epic ID Pattern (from memory ba964b81)
|
|
524
|
-
|
|
525
|
-
> "Epic ID pattern is mjkw + 7 base36 chars = 11 chars total"
|
|
526
|
-
|
|
527
|
-
**Application:** `forbiddenPatternsAbsent` checks for "bd-xxx" placeholders, but should also check for other placeholder patterns:
|
|
528
|
-
- `<epic>`, `<path>`, `placeholder`, `YOUR_EPIC_ID`, etc.
|
|
529
|
-
|
|
530
|
-
**Current Implementation:** ✅ Already checks these (compaction-scorers.ts:200)
|
|
531
|
-
|
|
532
|
-
---
|
|
533
|
-
|
|
534
|
-
## 🎨 ASCII ART SCORING DISTRIBUTION
|
|
535
|
-
|
|
536
|
-
```
|
|
537
|
-
SCORER USAGE HEAT MAP
|
|
538
|
-
═══════════════════════
|
|
539
|
-
|
|
540
|
-
coordinator-session ██████ (6 scorers)
|
|
541
|
-
swarm-decomposition ████ (4 scorers)
|
|
542
|
-
coordinator-behavior ████ (4 scorers)
|
|
543
|
-
compaction-resumption █████ (5 scorers)
|
|
544
|
-
compaction-prompt █████ (5 scorers)
|
|
545
|
-
|
|
546
|
-
UNUSED SCORERS: 🗑️ (4 scorers, 250 LOC)
|
|
547
|
-
```
|
|
548
|
-
|
|
549
|
-
---
|
|
550
|
-
|
|
551
|
-
## 📋 ACTION ITEMS
|
|
552
|
-
|
|
553
|
-
### Critical (Do First)
|
|
554
|
-
- [ ] **Decide fate of unused scorers** (remove or integrate)
|
|
555
|
-
- [ ] **Investigate compaction-prompt 53% score** (lowest eval)
|
|
556
|
-
- [ ] **Add weight rationale comments** to composite scorers
|
|
557
|
-
|
|
558
|
-
### High Priority
|
|
559
|
-
- [ ] **Document reviewEfficiency vs reviewThoroughness** relationship
|
|
560
|
-
- [ ] **Validate normalization thresholds** with real data
|
|
561
|
-
- [ ] **Add characterization tests** for outcome scorers
|
|
562
|
-
|
|
563
|
-
### Medium Priority
|
|
564
|
-
- [ ] **Consider reviewQuality composite** (balances thorough + efficient)
|
|
565
|
-
- [ ] **Enhance blockerResponseTime** matching logic (handle reassignments)
|
|
566
|
-
- [ ] **Document binary vs gradient scoring philosophy** in file headers
|
|
567
|
-
|
|
568
|
-
### Low Priority
|
|
569
|
-
- [ ] **Refactor garbage input handling** in decompositionCoherence
|
|
570
|
-
- [ ] **Add percentile-based normalization** for time scorers
|
|
571
|
-
- [ ] **Create scorer usage dashboard** (track which scorers impact results)
|
|
572
|
-
|
|
573
|
-
---
|
|
574
|
-
|
|
575
|
-
## 🏆 CONCLUSION
|
|
576
|
-
|
|
577
|
-
**Overall Quality:** 🟢 GOOD
|
|
578
|
-
|
|
579
|
-
**Strengths:**
|
|
580
|
-
- Correct Evalite API usage (no `.scorer` property bugs)
|
|
581
|
-
- Thoughtful fallback strategies (realistic vs neutral)
|
|
582
|
-
- Good separation of concerns (discipline, outcome, compaction)
|
|
583
|
-
- LLM-as-judge for complex evaluation
|
|
584
|
-
|
|
585
|
-
**Weaknesses:**
|
|
586
|
-
- 4 unused scorers (38% dead code in coordinator-discipline.ts)
|
|
587
|
-
- Arbitrary normalization thresholds (no evidence-based calibration)
|
|
588
|
-
- Undocumented weight rationale (composite scorers)
|
|
589
|
-
- Lowest eval score (compaction-prompt 53%) not investigated
|
|
590
|
-
|
|
591
|
-
**Priority:** Focus on **removing unused scorers** and **investigating compaction-prompt failure** before shipping.
|
|
592
|
-
|
|
593
|
-
---
|
|
594
|
-
|
|
595
|
-
**Analysis by:** CoolOcean
|
|
596
|
-
**Cell:** opencode-swarm-plugin--ys7z8-mjlk7jsrvls
|
|
597
|
-
**Epic:** opencode-swarm-plugin--ys7z8-mjlk7js9bt1
|
|
598
|
-
**Timestamp:** 2025-12-25T17:30:00Z
|