opencodekit 0.18.3 → 0.18.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/index.js +407 -17
- package/dist/template/.opencode/.version +1 -1
- package/dist/template/.opencode/AGENTS.md +13 -1
- package/dist/template/.opencode/agent/build.md +4 -1
- package/dist/template/.opencode/agent/explore.md +5 -35
- package/dist/template/.opencode/command/verify.md +63 -12
- package/dist/template/.opencode/memory/research/benchmark-framework.md +162 -0
- package/dist/template/.opencode/memory/research/effectiveness-audit.md +213 -0
- package/dist/template/.opencode/memory.db +0 -0
- package/dist/template/.opencode/memory.db-shm +0 -0
- package/dist/template/.opencode/memory.db-wal +0 -0
- package/dist/template/.opencode/opencode.json +1429 -1678
- package/dist/template/.opencode/package.json +1 -1
- package/dist/template/.opencode/plugin/lib/memory-helpers.ts +3 -129
- package/dist/template/.opencode/plugin/lib/memory-hooks.ts +4 -60
- package/dist/template/.opencode/plugin/memory.ts +0 -3
- package/dist/template/.opencode/skill/agent-teams/SKILL.md +16 -1
- package/dist/template/.opencode/skill/beads/SKILL.md +22 -0
- package/dist/template/.opencode/skill/brainstorming/SKILL.md +28 -0
- package/dist/template/.opencode/skill/code-navigation/SKILL.md +130 -0
- package/dist/template/.opencode/skill/condition-based-waiting/SKILL.md +12 -0
- package/dist/template/.opencode/skill/context-management/SKILL.md +122 -113
- package/dist/template/.opencode/skill/defense-in-depth/SKILL.md +20 -0
- package/dist/template/.opencode/skill/design-system-audit/SKILL.md +113 -112
- package/dist/template/.opencode/skill/dispatching-parallel-agents/SKILL.md +8 -0
- package/dist/template/.opencode/skill/executing-plans/SKILL.md +7 -0
- package/dist/template/.opencode/skill/memory-system/SKILL.md +50 -266
- package/dist/template/.opencode/skill/mockup-to-code/SKILL.md +21 -6
- package/dist/template/.opencode/skill/receiving-code-review/SKILL.md +8 -0
- package/dist/template/.opencode/skill/requesting-code-review/SKILL.md +242 -105
- package/dist/template/.opencode/skill/root-cause-tracing/SKILL.md +15 -0
- package/dist/template/.opencode/skill/session-management/SKILL.md +4 -103
- package/dist/template/.opencode/skill/subagent-driven-development/SKILL.md +23 -2
- package/dist/template/.opencode/skill/swarm-coordination/SKILL.md +17 -1
- package/dist/template/.opencode/skill/systematic-debugging/SKILL.md +21 -0
- package/dist/template/.opencode/skill/tool-priority/SKILL.md +34 -16
- package/dist/template/.opencode/skill/ui-ux-research/SKILL.md +5 -127
- package/dist/template/.opencode/skill/verification-before-completion/SKILL.md +36 -0
- package/dist/template/.opencode/skill/verification-before-completion/references/VERIFICATION_PROTOCOL.md +133 -29
- package/dist/template/.opencode/skill/visual-analysis/SKILL.md +20 -7
- package/dist/template/.opencode/skill/writing-plans/SKILL.md +7 -0
- package/dist/template/.opencode/tool/context7.ts +9 -1
- package/dist/template/.opencode/tool/grepsearch.ts +9 -1
- package/package.json +1 -1
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Verify implementation completeness, correctness, and coherence
|
|
3
|
-
argument-hint: "<bead-id> [--quick] [--full] [--fix]"
|
|
3
|
+
argument-hint: "<bead-id> [--quick] [--full] [--fix] [--no-cache]"
|
|
4
4
|
agent: review
|
|
5
5
|
---
|
|
6
6
|
|
|
@@ -17,12 +17,13 @@ skill({ name: "verification-before-completion" });
|
|
|
17
17
|
|
|
18
18
|
## Parse Arguments
|
|
19
19
|
|
|
20
|
-
| Argument
|
|
21
|
-
|
|
|
22
|
-
| `<bead-id>`
|
|
23
|
-
| `--quick`
|
|
24
|
-
| `--full`
|
|
25
|
-
| `--fix`
|
|
20
|
+
| Argument | Default | Description |
|
|
21
|
+
| ------------ | -------- | ---------------------------------------------- |
|
|
22
|
+
| `<bead-id>` | required | The bead to verify |
|
|
23
|
+
| `--quick` | false | Gates only, skip coherence check |
|
|
24
|
+
| `--full` | false | Force full verification mode (non-incremental) |
|
|
25
|
+
| `--fix` | false | Auto-fix lint/format issues |
|
|
26
|
+
| `--no-cache` | false | Bypass verification cache, force fresh run |
|
|
26
27
|
|
|
27
28
|
## Determine Input Type
|
|
28
29
|
|
|
@@ -39,6 +40,32 @@ skill({ name: "verification-before-completion" });
|
|
|
39
40
|
- **Run the gates**: Build, test, lint, typecheck are non-negotiable
|
|
40
41
|
- **Use project conventions**: Check `package.json` scripts first
|
|
41
42
|
|
|
43
|
+
## Phase 0: Check Verification Cache
|
|
44
|
+
|
|
45
|
+
Before running any gates, check if a recent verification is still valid:
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
# Compute current state fingerprint (commit hash + full diff + untracked files)
|
|
49
|
+
CURRENT_STAMP=$(printf '%s\n%s\n%s' \
|
|
50
|
+
"$(git rev-parse HEAD)" \
|
|
51
|
+
"$(git diff HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx')" \
|
|
52
|
+
"$(git ls-files --others --exclude-standard -- '*.ts' '*.tsx' '*.js' '*.jsx' | xargs cat 2>/dev/null)" \
|
|
53
|
+
| shasum -a 256 | cut -d' ' -f1)
|
|
54
|
+
LAST_STAMP=$(tail -1 .beads/verify.log 2>/dev/null | awk '{print $1}')
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
| Condition | Action |
|
|
58
|
+
| ----------------------------------------- | ------------------------------------------------------ |
|
|
59
|
+
| `--no-cache` or `--full` | Skip cache check, run fresh |
|
|
60
|
+
| `CURRENT_STAMP == LAST_STAMP` | Report **cached PASS**, skip to Phase 2 (completeness) |
|
|
61
|
+
| `CURRENT_STAMP != LAST_STAMP` or no cache | Run gates normally |
|
|
62
|
+
|
|
63
|
+
When cache hits, report:
|
|
64
|
+
|
|
65
|
+
```text
|
|
66
|
+
Verification: cached PASS (no changes since <timestamp from verify.log>)
|
|
67
|
+
```
|
|
68
|
+
|
|
42
69
|
## Phase 1: Gather Context
|
|
43
70
|
|
|
44
71
|
```bash
|
|
@@ -66,10 +93,34 @@ Extract all requirements/tasks from the PRD and verify each is implemented:
|
|
|
66
93
|
|
|
67
94
|
Follow the [Verification Protocol](../skill/verification-before-completion/references/VERIFICATION_PROTOCOL.md):
|
|
68
95
|
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
96
|
+
**Default: incremental mode** (changed files only, parallel gates).
|
|
97
|
+
|
|
98
|
+
| Mode | When | Behavior |
|
|
99
|
+
| ----------- | ----------------------------------------- | -------------------------------- |
|
|
100
|
+
| Incremental | Default, <20 changed files | Lint changed files, test changed |
|
|
101
|
+
| Full | `--full` flag, >20 changed files, or ship | Lint all, test all |
|
|
102
|
+
|
|
103
|
+
**Execution order:**
|
|
104
|
+
|
|
105
|
+
1. **Parallel**: typecheck + lint (simultaneously)
|
|
106
|
+
2. **Sequential** (after parallel passes): test, then build (ship only)
|
|
107
|
+
|
|
108
|
+
Report results with mode column:
|
|
109
|
+
|
|
110
|
+
```text
|
|
111
|
+
| Gate | Status | Mode | Time |
|
|
112
|
+
|-----------|--------|-------------|--------|
|
|
113
|
+
| Typecheck | PASS | full | 2.1s |
|
|
114
|
+
| Lint | PASS | incremental | 0.3s |
|
|
115
|
+
| Test | PASS | incremental | 1.2s |
|
|
116
|
+
| Build | SKIP | — | — |
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
**After all gates pass**, record to verification cache:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
echo "$CURRENT_STAMP $(date -u +%Y-%m-%dT%H:%M:%SZ) PASS" >> .beads/verify.log
|
|
123
|
+
```
|
|
73
124
|
|
|
74
125
|
If `--fix` flag provided, run the project's auto-fix command (e.g., `npm run lint:fix`, `ruff check --fix`, `cargo clippy --fix`).
|
|
75
126
|
|
|
@@ -93,7 +144,7 @@ Output:
|
|
|
93
144
|
|
|
94
145
|
1. **Result**: READY TO SHIP / NEEDS WORK / BLOCKED
|
|
95
146
|
2. **Completeness**: score and status
|
|
96
|
-
3. **Correctness**: gate results
|
|
147
|
+
3. **Correctness**: gate results (with mode column)
|
|
97
148
|
4. **Coherence**: contradictions found (if not --quick)
|
|
98
149
|
5. **Blocking issues** to fix before shipping
|
|
99
150
|
6. **Next step**: `/ship $ARGUMENTS` if ready, or list fixes needed
|
|
@@ -0,0 +1,162 @@
|
|
|
1
|
+
---
|
|
2
|
+
purpose: Scoring rubric for evaluating template agent effectiveness
|
|
3
|
+
updated: 2026-03-08
|
|
4
|
+
based-on: tilth research (measurable pattern adoption improvements)
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Agent Effectiveness Benchmark Framework
|
|
8
|
+
|
|
9
|
+
## Purpose
|
|
10
|
+
|
|
11
|
+
Evaluate whether skills, tools, and commands in the OpenCodeKit template actually help AI agents perform better. Based on tilth's methodology: they measured accuracy, cost/correct answer, and tool adoption rates to prove what works.
|
|
12
|
+
|
|
13
|
+
## Scoring Dimensions
|
|
14
|
+
|
|
15
|
+
7 dimensions, each scored 0–2. Max score: **14**.
|
|
16
|
+
|
|
17
|
+
### 1. Trigger Clarity (WHEN/SKIP)
|
|
18
|
+
|
|
19
|
+
Does the description clearly specify when to load AND when NOT to?
|
|
20
|
+
|
|
21
|
+
| Score | Criteria |
|
|
22
|
+
| ----- | ------------------------------------------- |
|
|
23
|
+
| 0 | Vague or missing trigger conditions |
|
|
24
|
+
| 1 | Has WHEN but not WHEN NOT (or vice versa) |
|
|
25
|
+
| 2 | Clear WHEN and WHEN NOT (SKIP) binary gates |
|
|
26
|
+
|
|
27
|
+
**Why it matters:** tilth found explicit WHEN/SKIP gates are the single most effective pattern for correct tool routing. Without them, agents either over-load (waste tokens) or under-load (miss relevant skills).
|
|
28
|
+
|
|
29
|
+
### 2. "Replaces X" Framing
|
|
30
|
+
|
|
31
|
+
Does it explicitly state what behavior, tool, or workflow it replaces?
|
|
32
|
+
|
|
33
|
+
| Score | Criteria |
|
|
34
|
+
| ----- | ---------------------------------------------- |
|
|
35
|
+
| 0 | No replacement framing |
|
|
36
|
+
| 1 | Implied replacement or "better than X" |
|
|
37
|
+
| 2 | Explicit "Replaces X" statement in description |
|
|
38
|
+
|
|
39
|
+
**Why it matters:** tilth measured +36 percentage points adoption on Haiku when tool descriptions included "Replaces X" framing. Models route better when they know what's superseded.
|
|
40
|
+
|
|
41
|
+
### 3. Concrete Examples
|
|
42
|
+
|
|
43
|
+
Does it provide working code with actual tool calls, not just prose?
|
|
44
|
+
|
|
45
|
+
| Score | Criteria |
|
|
46
|
+
| ----- | ----------------------------------------------------------------------- |
|
|
47
|
+
| 0 | No examples |
|
|
48
|
+
| 1 | Prose descriptions or generic prompt templates |
|
|
49
|
+
| 2 | Working code examples with actual tool calls / before-after comparisons |
|
|
50
|
+
|
|
51
|
+
**Why it matters:** Models follow examples more reliably than instructions. Prompt templates ("Analyze this image: [attach]") score 1, not 2 — they lack tool integration.
|
|
52
|
+
|
|
53
|
+
### 4. Anti-Patterns
|
|
54
|
+
|
|
55
|
+
Does it show what NOT to do?
|
|
56
|
+
|
|
57
|
+
| Score | Criteria |
|
|
58
|
+
| ----- | -------------------------------------------------------------- |
|
|
59
|
+
| 0 | No anti-patterns section |
|
|
60
|
+
| 1 | Brief "don't do X" mentions |
|
|
61
|
+
| 2 | Wrong/right comparison table or detailed anti-patterns section |
|
|
62
|
+
|
|
63
|
+
**Why it matters:** Failure prevention is as valuable as success instruction. tilth's evidence-based feature removal (disabling `--map` because 62% of losing tasks used it) proves tracking what fails matters.
|
|
64
|
+
|
|
65
|
+
### 5. Verification Integration
|
|
66
|
+
|
|
67
|
+
Does it reference or require verification steps?
|
|
68
|
+
|
|
69
|
+
| Score | Criteria |
|
|
70
|
+
| ----- | -------------------------------------------------------------- |
|
|
71
|
+
| 0 | No mention of verification |
|
|
72
|
+
| 1 | Mentions verification in passing |
|
|
73
|
+
| 2 | Integrates verification steps into workflow (run X, confirm Y) |
|
|
74
|
+
|
|
75
|
+
**Why it matters:** Skills that don't include verification produce unverified outputs. The build loop is perceive → create → **verify** → ship.
|
|
76
|
+
|
|
77
|
+
### 6. Token Efficiency
|
|
78
|
+
|
|
79
|
+
Is the token cost proportional to value delivered?
|
|
80
|
+
|
|
81
|
+
| Score | Criteria |
|
|
82
|
+
| ----- | ------------------------------------------------------------------------------ |
|
|
83
|
+
| 0 | >2500 tokens with low value density (filler, repetition, obvious instructions) |
|
|
84
|
+
| 1 | Reasonable size OR moderate value density |
|
|
85
|
+
| 2 | <1500 tokens with high value density, OR larger with proportional density |
|
|
86
|
+
|
|
87
|
+
**Why it matters:** Every loaded skill consumes context budget. A 4000-token skill that could be 1500 tokens is actively harmful — it displaces working memory.
|
|
88
|
+
|
|
89
|
+
### 7. Cross-References
|
|
90
|
+
|
|
91
|
+
Does it link to related skills for next steps?
|
|
92
|
+
|
|
93
|
+
| Score | Criteria |
|
|
94
|
+
| ----- | ---------------------------------------------------------------------------- |
|
|
95
|
+
| 0 | No references to other skills |
|
|
96
|
+
| 1 | Mentions related skills in text |
|
|
97
|
+
| 2 | Clear "Related Skills" table or "Next Phase" with skill loading instructions |
|
|
98
|
+
|
|
99
|
+
**Why it matters:** Skills that exist in isolation force agents to discover connections. Explicit connections reduce routing failures.
|
|
100
|
+
|
|
101
|
+
## Score Interpretation
|
|
102
|
+
|
|
103
|
+
| Range | Tier | Meaning |
|
|
104
|
+
| ----- | ---------- | ----------------------------------------------------------- |
|
|
105
|
+
| 12–14 | Exemplary | Ready to ship — high adoption, measurable value |
|
|
106
|
+
| 8–11 | Adequate | Functional but missing patterns that would improve adoption |
|
|
107
|
+
| 4–7 | Needs Work | Significant gaps — may load but produce suboptimal results |
|
|
108
|
+
| 0–3 | Poor | Should be rewritten or merged into another skill |
|
|
109
|
+
|
|
110
|
+
## Category Assessment
|
|
111
|
+
|
|
112
|
+
Beyond individual scoring, evaluate each skill's **category fit**:
|
|
113
|
+
|
|
114
|
+
| Category | Expected Traits |
|
|
115
|
+
| -------------------- | -------------------------------------------------------------------------------- |
|
|
116
|
+
| Core Workflow | Loaded frequently, high token ROI, tight integration with other core skills |
|
|
117
|
+
| Planning & Lifecycle | Clear phase transitions, handoff points between skills |
|
|
118
|
+
| Debugging & Quality | Real examples from actual debugging sessions, measurable impact |
|
|
119
|
+
| Code Review | Severity levels, actionable findings format |
|
|
120
|
+
| Design & UI | Visual reference integration, component breakdown |
|
|
121
|
+
| Agent Orchestration | Parallelism rules, coordination protocols |
|
|
122
|
+
| External Integration | API examples, auth handling, error patterns |
|
|
123
|
+
| Platform Specific | Version-pinned APIs, migration guidance |
|
|
124
|
+
| Meta Skills | Self-referential consistency (does the skill-about-skills follow its own rules?) |
|
|
125
|
+
|
|
126
|
+
## Audit Process
|
|
127
|
+
|
|
128
|
+
1. **Inventory** — List all skills with token size
|
|
129
|
+
2. **Sample** — Read representative skills from each category
|
|
130
|
+
3. **Score** — Apply 7 dimensions to each sampled skill
|
|
131
|
+
4. **Classify** — Assign tier and category
|
|
132
|
+
5. **Identify** — Flag overlaps, dead weight, and upgrade candidates
|
|
133
|
+
6. **Prioritize** — Rank improvements by impact (core skills first)
|
|
134
|
+
|
|
135
|
+
## Effectiveness Signals (Observable)
|
|
136
|
+
|
|
137
|
+
Beyond the rubric, track these runtime signals when possible:
|
|
138
|
+
|
|
139
|
+
| Signal | Indicates |
|
|
140
|
+
| ------------------------------------------ | ------------------------------------------------------------ |
|
|
141
|
+
| Skill loaded but instructions not followed | Trigger too broad OR instructions too vague |
|
|
142
|
+
| Skill never loaded despite relevant tasks | Trigger too narrow OR description doesn't match task framing |
|
|
143
|
+
| Agent re-reads files after skill search | Skill examples insufficient — agent needs more context |
|
|
144
|
+
| Verification skipped after skill workflow | Skill doesn't integrate verification |
|
|
145
|
+
| Agent loads 5+ skills simultaneously | Skills too granular — should be merged |
|
|
146
|
+
|
|
147
|
+
## Template-Level Metrics
|
|
148
|
+
|
|
149
|
+
For the overall template (all skills + tools + commands):
|
|
150
|
+
|
|
151
|
+
| Metric | Target | Current |
|
|
152
|
+
| ----------------------------- | ------ | ------- |
|
|
153
|
+
| Core skills at Exemplary tier | 100% | (audit) |
|
|
154
|
+
| No skills at Poor tier | 0 | (audit) |
|
|
155
|
+
| Average token cost per skill | <1500 | (audit) |
|
|
156
|
+
| Skills with WHEN/SKIP gates | 100% | (audit) |
|
|
157
|
+
| Skills with anti-patterns | >75% | (audit) |
|
|
158
|
+
| Overlap/redundancy pairs | 0 | (audit) |
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
_Apply this framework during effectiveness audits. Update scoring criteria as new evidence emerges._
|
|
@@ -0,0 +1,213 @@
|
|
|
1
|
+
---
|
|
2
|
+
purpose: Systematic effectiveness audit of all template skills, tools, and commands
|
|
3
|
+
updated: 2026-03-08
|
|
4
|
+
framework: benchmark-framework.md (7 dimensions, 0-2 each, max 14)
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Effectiveness Audit — OpenCodeKit Template
|
|
8
|
+
|
|
9
|
+
## Methodology
|
|
10
|
+
|
|
11
|
+
Scored 25+ skills, 2 tools, 18 commands using the benchmark framework.
|
|
12
|
+
Dimensions: **T**rigger clarity, **R**eplaces X, **E**xamples, **A**nti-patterns, **V**erification, **Tok**en efficiency, **X**-references.
|
|
13
|
+
Scale: 0=missing, 1=partial, 2=strong. Max: 14.
|
|
14
|
+
|
|
15
|
+
## Summary
|
|
16
|
+
|
|
17
|
+
| Metric | Value |
|
|
18
|
+
| ------------------ | -------- |
|
|
19
|
+
| Total skills | 73 |
|
|
20
|
+
| Reviewed in detail | 25 |
|
|
21
|
+
| Exemplary (12-14) | 5 (20%) |
|
|
22
|
+
| Adequate (8-11) | 10 (40%) |
|
|
23
|
+
| Needs Work (4-7) | 8 (32%) |
|
|
24
|
+
| Poor (0-3) | 2 (8%) |
|
|
25
|
+
| Custom tools | 2 |
|
|
26
|
+
| Commands | 18 |
|
|
27
|
+
|
|
28
|
+
## Tier 1: Exemplary (12-14)
|
|
29
|
+
|
|
30
|
+
Skills ready to ship — high adoption, measurable value.
|
|
31
|
+
|
|
32
|
+
| Skill | T | R | E | A | V | Tok | X | Total | Tokens | Notes |
|
|
33
|
+
| ------------------------------ | --- | --- | --- | --- | --- | --- | --- | ------ | ------ | ----------------------------------------------------------------------- |
|
|
34
|
+
| structured-edit | 2 | 1 | 2 | 2 | 2 | 2 | 2 | **13** | ~1.3k | Gold standard. 5-step protocol, Red Flags, BAD/GOOD examples, quick ref |
|
|
35
|
+
| code-navigation | 2 | 2 | 2 | 2 | 0 | 2 | 2 | **12** | ~1.2k | 7 patterns, tilth comparison, cost awareness, right/wrong examples |
|
|
36
|
+
| verification-before-completion | 2 | 0 | 2 | 2 | 2 | 2 | 1 | **11** | ~1.6k | Iron Law, rationalization prevention, smart verification |
|
|
37
|
+
| tool-priority | 2 | 2 | 2 | 2 | 0 | 1 | 2 | **11** | ~3.3k | "Replaces X" on all tools, tilth section, LSP 9-op table |
|
|
38
|
+
| requesting-code-review | 2 | 0 | 2 | 2 | 2 | 1 | 2 | **11** | ~2.5k | 3 review depths, 5 reviewer prompts, synthesis checklist |
|
|
39
|
+
|
|
40
|
+
### What makes these work
|
|
41
|
+
|
|
42
|
+
1. **Right/wrong examples** — Every exemplary skill shows incorrect then correct approach
|
|
43
|
+
2. **Tables over prose** — Decision tables, comparison tables, common mistakes tables
|
|
44
|
+
3. **Integrated verification** — structured-edit Step 5 (CONFIRM), verification-before-completion Iron Law
|
|
45
|
+
4. **Quick reference blocks** — structured-edit and tool-priority both end with copy-pasteable references
|
|
46
|
+
5. **"Replaces X" framing** — code-navigation and tool-priority explicitly state what they supersede
|
|
47
|
+
|
|
48
|
+
## Tier 2: Adequate (8-11)
|
|
49
|
+
|
|
50
|
+
Functional but missing patterns that would improve adoption.
|
|
51
|
+
|
|
52
|
+
| Skill | T | R | E | A | V | Tok | X | Total | Tokens | Gap |
|
|
53
|
+
| --------------------------- | --- | --- | --- | --- | --- | --- | --- | ------ | ------ | ---------------------------------------------------- |
|
|
54
|
+
| dispatching-parallel-agents | 2 | 0 | 2 | 2 | 2 | 2 | 0 | **10** | ~1.4k | No "Replaces X", no cross-refs |
|
|
55
|
+
| executing-plans | 2 | 0 | 2 | 1 | 2 | 1 | 2 | **10** | ~1.5k | No "Replaces X" |
|
|
56
|
+
| agent-teams | 2 | 0 | 2 | 2 | 1 | 1 | 1 | **9** | ~2.1k | No "Replaces X", could be more token-efficient |
|
|
57
|
+
| condition-based-waiting | 2 | 1 | 2 | 2 | 0 | 2 | 0 | **9** | ~868 | No verification step, no cross-refs |
|
|
58
|
+
| root-cause-tracing | 2 | 0 | 2 | 0 | 1 | 2 | 1 | **8** | ~1.2k | No anti-patterns, no "Replaces X" |
|
|
59
|
+
| writing-plans | 2 | 0 | 2 | 1 | 1 | 1 | 1 | **8** | ~2.0k | No "Replaces X", could trim |
|
|
60
|
+
| beads | 2 | 0 | 2 | 0 | 0 | 2 | 2 | **8** | ~1.2k | No anti-patterns, no verification |
|
|
61
|
+
| receiving-code-review | 2 | 0 | 2 | 2 | 1 | 1 | 0 | **8** | ~1.7k | No "Replaces X", no cross-refs |
|
|
62
|
+
| defense-in-depth | 2 | 0 | 2 | 0 | 0 | 2 | 1 | **7** | ~1.0k | No anti-patterns, no verification |
|
|
63
|
+
| systematic-debugging | 2 | 0 | 2 | 1 | 0 | 1 | 0 | **6** | ~1.6k | Border case — no verification, limited anti-patterns |
|
|
64
|
+
|
|
65
|
+
### Common gaps in this tier
|
|
66
|
+
|
|
67
|
+
1. **No "Replaces X"** — 9/10 adequate skills lack replacement framing
|
|
68
|
+
2. **Missing verification** — 6/10 don't integrate verification steps
|
|
69
|
+
3. **No anti-patterns** — 5/10 lack anti-pattern sections
|
|
70
|
+
4. **No cross-references** — 4/10 are isolated (no links to related skills)
|
|
71
|
+
|
|
72
|
+
## Tier 3: Needs Work (4-7)
|
|
73
|
+
|
|
74
|
+
Significant gaps — may load but produce suboptimal results.
|
|
75
|
+
|
|
76
|
+
| Skill | T | R | E | A | V | Tok | X | Total | Tokens | Issue |
|
|
77
|
+
| --------------------------- | --- | --- | --- | --- | --- | --- | --- | ----- | ------ | ---------------------------------------------- |
|
|
78
|
+
| context-management | 2 | 0 | 2 | 1 | 0 | 1 | 0 | **6** | ~1.7k | Overlaps with DCP system prompts |
|
|
79
|
+
| session-management | 2 | 0 | 1 | 1 | 0 | 2 | 0 | **6** | ~848 | Generic, no tool examples |
|
|
80
|
+
| swarm-coordination | 2 | 0 | 1 | 1 | 0 | 1 | 1 | **6** | ~1.8k | Partially complete, missing examples |
|
|
81
|
+
| memory-system | 2 | 0 | 2 | 0 | 0 | 1 | 0 | **5** | ~2.4k | Token-heavy, no anti-patterns, no verification |
|
|
82
|
+
| brainstorming | 2 | 0 | 0 | 0 | 0 | 2 | 1 | **5** | ~832 | No examples, no anti-patterns |
|
|
83
|
+
| mockup-to-code | 2 | 0 | 1 | 0 | 0 | 2 | 0 | **5** | ~794 | Prompt templates only |
|
|
84
|
+
| subagent-driven-development | 2 | 0 | 1 | 0 | 0 | 2 | 0 | **5** | ~1.2k | No anti-patterns, no verification |
|
|
85
|
+
| visual-analysis | 2 | 0 | 1 | 0 | 0 | 2 | 0 | **5** | ~705 | Prompt templates only |
|
|
86
|
+
|
|
87
|
+
### Common issues
|
|
88
|
+
|
|
89
|
+
1. **Prompt-template-only pattern** — mockup-to-code, visual-analysis give templates without tool integration
|
|
90
|
+
2. **No anti-patterns** — 7/8 lack anti-pattern sections entirely
|
|
91
|
+
3. **No verification** — 8/8 don't integrate verification
|
|
92
|
+
4. **No examples** — brainstorming has zero code examples
|
|
93
|
+
|
|
94
|
+
## Tier 4: Poor (0-3)
|
|
95
|
+
|
|
96
|
+
Should be rewritten or merged.
|
|
97
|
+
|
|
98
|
+
| Skill | T | R | E | A | V | Tok | X | Total | Tokens | Action |
|
|
99
|
+
| ------------------- | --- | --- | --- | --- | --- | --- | --- | ----- | ------ | ------------------------------------------------------------ |
|
|
100
|
+
| ui-ux-research | 2 | 0 | 1 | 0 | 0 | 2 | 0 | **5** | ~609 | Merge into design-system-audit or rewrite with tool examples |
|
|
101
|
+
| design-system-audit | 2 | 0 | 1 | 0 | 0 | 2 | 0 | **5** | ~527 | Merge with ui-ux-research or add substance |
|
|
102
|
+
|
|
103
|
+
_Note: These scored 5 (Needs Work) on the rubric but are categorized as effective tier 4 because they consist entirely of prompt templates with no actionable tool integration, anti-patterns, or verification — making them the least effective in practice._
|
|
104
|
+
|
|
105
|
+
## Not Reviewed (Estimated by Category)
|
|
106
|
+
|
|
107
|
+
These 48 skills were not read in detail. Estimates based on YAML description, size, and category patterns.
|
|
108
|
+
|
|
109
|
+
### Platform-Specific (likely Adequate if domain is relevant)
|
|
110
|
+
|
|
111
|
+
- swiftui-expert-skill (~4.2k tokens) — Largest skill, likely good depth
|
|
112
|
+
- swift-concurrency, core-data-expert — Domain-specific
|
|
113
|
+
- react-best-practices, supabase-postgres-best-practices — Framework-specific
|
|
114
|
+
|
|
115
|
+
### External Integrations (varies)
|
|
116
|
+
|
|
117
|
+
- resend, cloudflare, supabase, polar, jira, figma, stitch, v0, v1-run, mqdh
|
|
118
|
+
- These are MCP connector skills — effectiveness depends on API coverage
|
|
119
|
+
|
|
120
|
+
### Meta Skills
|
|
121
|
+
|
|
122
|
+
- skill-creator, writing-skills, testing-skills-with-subagents, sharing-skills, using-skills
|
|
123
|
+
- Self-referential — should follow their own rules
|
|
124
|
+
|
|
125
|
+
### Browser/Automation
|
|
126
|
+
|
|
127
|
+
- playwright, playwriter, agent-browser, chrome-devtools
|
|
128
|
+
|
|
129
|
+
### Context/Lifecycle
|
|
130
|
+
|
|
131
|
+
- compaction, context-engineering, context-initialization, gemini-large-context
|
|
132
|
+
- development-lifecycle, prd, prd-task
|
|
133
|
+
- finishing-a-development-branch, using-git-worktrees
|
|
134
|
+
- deep-research, source-code-research, opensrc, augment-context-engine
|
|
135
|
+
- beads-bridge, ralph, index-knowledge, obsidian, pdf-extract
|
|
136
|
+
- accessibility-audit, web-design-guidelines, frontend-design
|
|
137
|
+
|
|
138
|
+
## Tools Audit
|
|
139
|
+
|
|
140
|
+
| Tool | T | R | E | A | V | Tok | X | Total | Tokens | Notes |
|
|
141
|
+
| ---------- | --- | --- | --- | --- | --- | --- | --- | ----- | ------ | ------------------------------------------------------- |
|
|
142
|
+
| context7 | 2 | 2 | 2 | 0 | 0 | 1 | 0 | **7** | ~1.4k | Has "Replaces X" + WHEN/SKIP. Missing anti-patterns |
|
|
143
|
+
| grepsearch | 1 | 2 | 2 | 0 | 0 | 2 | 0 | **7** | ~946 | Has "Replaces X". Missing full SKIP gate, anti-patterns |
|
|
144
|
+
|
|
145
|
+
### Tool recommendations
|
|
146
|
+
|
|
147
|
+
- Add anti-patterns to both tool descriptions (common misuse patterns)
|
|
148
|
+
- context7: Add "SKIP: Internal code (use tilth/grep)" explicitly
|
|
149
|
+
- grepsearch: Add full WHEN/SKIP binary gate
|
|
150
|
+
|
|
151
|
+
## Commands Assessment
|
|
152
|
+
|
|
153
|
+
18 commands total. Commands evaluated on: clear trigger, actionable steps, verification integration, error guidance.
|
|
154
|
+
|
|
155
|
+
| Command | Category | Quality | Notes |
|
|
156
|
+
| --------------------------- | -------- | ------- | ------------------------------------------------ |
|
|
157
|
+
| lfg | Workflow | High | Full chain orchestration |
|
|
158
|
+
| ship | Workflow | High | Clear gates and verification |
|
|
159
|
+
| plan | Planning | High | Structured output |
|
|
160
|
+
| verify | Quality | High | Recently improved (incremental, parallel, cache) |
|
|
161
|
+
| compound | Learning | High | Extracts learnings |
|
|
162
|
+
| start/resume/handoff | Session | Medium | Functional but could cross-ref more |
|
|
163
|
+
| status | Info | Medium | |
|
|
164
|
+
| pr | Git | Medium | |
|
|
165
|
+
| review-codebase | Quality | Medium | |
|
|
166
|
+
| research | Research | Medium | |
|
|
167
|
+
| design/ui-review | Design | Low | Prompt-template style |
|
|
168
|
+
| init/init-user/init-context | Setup | High | Well-tested |
|
|
169
|
+
| create | Meta | Medium | |
|
|
170
|
+
|
|
171
|
+
## Overlap Analysis
|
|
172
|
+
|
|
173
|
+
| Pair | Overlap | Recommendation |
|
|
174
|
+
| -------------------------------------------------------------- | ------------------------------------ | ----------------------------------------------------- |
|
|
175
|
+
| context-management ↔ compaction | Both manage context size | Merge or clearly differentiate |
|
|
176
|
+
| agent-teams ↔ swarm-coordination ↔ dispatching-parallel-agents | All handle parallel agents | Create decision tree in agent-teams, reference others |
|
|
177
|
+
| session-management ↔ context-management | Both track context thresholds | Merge session into context-management |
|
|
178
|
+
| ui-ux-research ↔ design-system-audit ↔ visual-analysis | All design-focused prompt templates | Consolidate into one design-audit skill |
|
|
179
|
+
| beads ↔ beads-bridge | Bridge extends beads for multi-agent | Clear but should be documented in beads |
|
|
180
|
+
| structured-edit ↔ code-navigation | Both about code manipulation | Cross-reference each other |
|
|
181
|
+
|
|
182
|
+
## Top 10 Improvement Priorities
|
|
183
|
+
|
|
184
|
+
Ranked by impact (core skills first, high-frequency usage).
|
|
185
|
+
|
|
186
|
+
| # | Action | Target | Impact |
|
|
187
|
+
| --- | -------------------------------------------------------------------------- | ----------------- | --------------------------- |
|
|
188
|
+
| 1 | Add "Replaces X" to top 10 skills | All tier 2 skills | +adoption (tilth: +36pp) |
|
|
189
|
+
| 2 | Add anti-patterns to beads, defense-in-depth, root-cause-tracing | Core debugging | +failure prevention |
|
|
190
|
+
| 3 | Add verification steps to condition-based-waiting, defense-in-depth, beads | Core workflow | +correctness |
|
|
191
|
+
| 4 | Consolidate context-management + session-management | Context skills | -redundancy, -token cost |
|
|
192
|
+
| 5 | Consolidate ui-ux-research + design-system-audit + visual-analysis | Design skills | -3 weak skills → 1 adequate |
|
|
193
|
+
| 6 | Rewrite brainstorming with concrete examples | Planning | +actionability |
|
|
194
|
+
| 7 | Add cross-references to isolated skills (6 skills) | Various | +routing |
|
|
195
|
+
| 8 | Trim memory-system from 2.4k to ~1.5k tokens | Core | +token efficiency |
|
|
196
|
+
| 9 | Add "Replaces X" to tools (context7 SKIP gate, grepsearch WHEN gate) | Tools | +routing |
|
|
197
|
+
| 10 | Audit remaining 48 un-reviewed skills | All | Full coverage |
|
|
198
|
+
|
|
199
|
+
## Template-Level Metrics
|
|
200
|
+
|
|
201
|
+
| Metric | Target | Current | Status |
|
|
202
|
+
| ----------------------------- | ------ | ---------------- | ---------- |
|
|
203
|
+
| Core skills at Exemplary tier | 100% | 50% (5/10 core) | Needs work |
|
|
204
|
+
| No skills at Poor tier | 0 | 2 | Needs work |
|
|
205
|
+
| Average token cost per skill | <1500 | ~1.5k (reviewed) | Borderline |
|
|
206
|
+
| Skills with WHEN/SKIP gates | 100% | 100% (reviewed) | PASS |
|
|
207
|
+
| Skills with anti-patterns | >75% | 44% (11/25) | Needs work |
|
|
208
|
+
| Overlap/redundancy pairs | 0 | 6 pairs | Needs work |
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
_Next: Apply improvement priorities starting with #1 (add "Replaces X" to tier 2 skills)._
|
|
213
|
+
_Re-audit after changes to measure improvement._
|
|
Binary file
|
|
Binary file
|
|
Binary file
|