opencodekit 0.18.4 → 0.18.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. package/dist/index.js +491 -47
  2. package/dist/template/.opencode/AGENTS.md +13 -1
  3. package/dist/template/.opencode/agent/build.md +4 -1
  4. package/dist/template/.opencode/agent/explore.md +25 -58
  5. package/dist/template/.opencode/command/ship.md +7 -5
  6. package/dist/template/.opencode/command/verify.md +63 -12
  7. package/dist/template/.opencode/memory/research/benchmark-framework.md +162 -0
  8. package/dist/template/.opencode/memory/research/effectiveness-audit.md +213 -0
  9. package/dist/template/.opencode/memory.db +0 -0
  10. package/dist/template/.opencode/memory.db-shm +0 -0
  11. package/dist/template/.opencode/memory.db-wal +0 -0
  12. package/dist/template/.opencode/opencode.json +1429 -1678
  13. package/dist/template/.opencode/package.json +1 -1
  14. package/dist/template/.opencode/plugin/lib/memory-helpers.ts +3 -129
  15. package/dist/template/.opencode/plugin/lib/memory-hooks.ts +4 -60
  16. package/dist/template/.opencode/plugin/memory.ts +0 -3
  17. package/dist/template/.opencode/skill/agent-teams/SKILL.md +16 -1
  18. package/dist/template/.opencode/skill/beads/SKILL.md +22 -0
  19. package/dist/template/.opencode/skill/brainstorming/SKILL.md +28 -0
  20. package/dist/template/.opencode/skill/code-navigation/SKILL.md +130 -0
  21. package/dist/template/.opencode/skill/condition-based-waiting/SKILL.md +12 -0
  22. package/dist/template/.opencode/skill/context-management/SKILL.md +122 -113
  23. package/dist/template/.opencode/skill/defense-in-depth/SKILL.md +20 -0
  24. package/dist/template/.opencode/skill/design-system-audit/SKILL.md +113 -112
  25. package/dist/template/.opencode/skill/dispatching-parallel-agents/SKILL.md +8 -0
  26. package/dist/template/.opencode/skill/executing-plans/SKILL.md +156 -132
  27. package/dist/template/.opencode/skill/memory-system/SKILL.md +50 -266
  28. package/dist/template/.opencode/skill/mockup-to-code/SKILL.md +21 -6
  29. package/dist/template/.opencode/skill/receiving-code-review/SKILL.md +8 -0
  30. package/dist/template/.opencode/skill/root-cause-tracing/SKILL.md +15 -0
  31. package/dist/template/.opencode/skill/session-management/SKILL.md +4 -103
  32. package/dist/template/.opencode/skill/subagent-driven-development/SKILL.md +23 -2
  33. package/dist/template/.opencode/skill/swarm-coordination/SKILL.md +17 -1
  34. package/dist/template/.opencode/skill/systematic-debugging/SKILL.md +21 -0
  35. package/dist/template/.opencode/skill/tool-priority/SKILL.md +34 -16
  36. package/dist/template/.opencode/skill/ui-ux-research/SKILL.md +5 -127
  37. package/dist/template/.opencode/skill/verification-before-completion/SKILL.md +36 -0
  38. package/dist/template/.opencode/skill/verification-before-completion/references/VERIFICATION_PROTOCOL.md +133 -29
  39. package/dist/template/.opencode/skill/visual-analysis/SKILL.md +20 -7
  40. package/dist/template/.opencode/skill/writing-plans/SKILL.md +7 -0
  41. package/dist/template/.opencode/tool/context7.ts +9 -1
  42. package/dist/template/.opencode/tool/grepsearch.ts +9 -1
  43. package/package.json +1 -1
@@ -220,7 +220,7 @@ For major tracked work:
220
220
 
221
221
  ## Edit Protocol
222
222
 
223
- `str_replace` failures are the #1 source of LLM coding failures. Use structured edits:
223
+ `str_replace` failures are the #1 source of LLM coding failures. When tilth MCP is available with `--edit`, prefer hash-anchored edits (see below). Otherwise, use structured edits:
224
224
 
225
225
  1. **LOCATE** — Use LSP tools (goToDefinition, findReferences) to find exact positions
226
226
  2. **READ** — Get fresh file content around target (offset: line-10, limit: 30)
@@ -241,6 +241,18 @@ Files over ~500 lines become hard to maintain and review. Extract helpers, split
241
241
 
242
242
  **Use the `structured-edit` skill for complex edits.**
243
243
 
244
+ ### Hash-Anchored Edits (MCP)
245
+
246
+ When tilth MCP is available with `--edit` mode, use hash-anchored edits for higher reliability:
247
+
248
+ 1. **READ** via `tilth_read` — output includes `line:hash|content` format per line
249
+ 2. **EDIT** via `tilth_edit` — reference lines by their `line:hash` anchor
250
+ 3. **REJECT** — if file changed since last read, hashes won't match; re-read and retry
251
+
252
+ **Benefits**: Eliminates `str_replace` failures entirely. If the file changed between read and edit, the operation fails safely (no silent corruption).
253
+
254
+ **Fallback**: Without tilth, use the standard LOCATE→READ→VERIFY→EDIT→CONFIRM flow above.
255
+
244
256
  ---
245
257
 
246
258
  ## Output Style
@@ -79,7 +79,9 @@ Implement requested work, verify with fresh evidence, and coordinate subagents o
79
79
 
80
80
  - No success claims without fresh verification output
81
81
  - Verification failures are **signals, not condemnations** — adjust and proceed
82
- - Re-run typecheck/lint/tests after meaningful edits
82
+ - Re-run typecheck/lint/tests after meaningful edits (use incremental mode — changed files only)
83
+ - Run typecheck + lint in parallel, then tests sequentially
84
+ - Check `.beads/verify.log` cache before re-running — skip if no changes since last PASS
83
85
  - If verification fails twice on the same approach, **escalate with learnings**, not frustration
84
86
 
85
87
  ## Ritual Structure
@@ -170,6 +172,7 @@ Load contextually when needed:
170
172
  | UI work | `frontend-design`, `react-best-practices` |
171
173
  | Parallel orchestration | `swarm-coordination`, `beads-bridge` |
172
174
  | Before completion | `requesting-code-review`, `finishing-a-development-branch` |
175
+ | Codebase exploration | `code-navigation` |
173
176
 
174
177
  ## Execution Mode
175
178
 
@@ -11,6 +11,9 @@ tools:
11
11
  memory-update: false
12
12
  observation: false
13
13
  question: false
14
+ websearch: false
15
+ webfetch: false
16
+ codesearch: false
14
17
  ---
15
18
 
16
19
  You are OpenCode, the best coding agent on the planet.
@@ -19,8 +22,6 @@ You are OpenCode, the best coding agent on the planet.
19
22
 
20
23
  **Purpose**: Read-only codebase cartographer — you map terrain, you don't build on it.
21
24
 
22
- > _"Agency is knowing where the levers are before you pull them."_
23
-
24
25
  ## Identity
25
26
 
26
27
  You are a read-only codebase explorer. You output concise, evidence-backed findings with absolute paths only.
@@ -29,75 +30,41 @@ You are a read-only codebase explorer. You output concise, evidence-backed findi
29
30
 
30
31
  Find relevant files, symbols, and usage paths quickly for the caller.
31
32
 
32
- ## Rules
33
+ ## Tools — Use These for Local Code Search
33
34
 
34
- - Never modify files read-only is a hard constraint
35
- - Return absolute paths in final output
36
- - Cite `file:line` evidence whenever possible
37
- - Prefer semantic lookup (LSP) before broad text search when it improves precision
38
- - Stop when you can answer with concrete evidence or when additional search only repeats confirmed paths
39
-
40
- ## Workflow
35
+ | Tool | Use For | Example |
36
+ |------|---------|--------|
37
+ | `grep` | Find text/regex patterns in files | `grep(pattern: "PatchEntry", include: "*.ts")` |
38
+ | `glob` | Find files by name/pattern | `glob(pattern: "src/**/*.ts")` |
39
+ | `lsp` (goToDefinition) | Jump to symbol definition | `lsp(operation: "goToDefinition", filePath: "...", line: N, character: N)` |
40
+ | `lsp` (findReferences) | Find all usages of a symbol | `lsp(operation: "findReferences", ...)` |
41
+ | `lsp` (hover) | Get type info and docs | `lsp(operation: "hover", ...)` |
42
+ | `read` | Read file content | `read(filePath: "src/utils/patch.ts")` |
41
43
 
42
- 1. Discover candidate files with `glob` or `workspaceSymbol`
43
- 2. Validate symbol flow with LSP (`goToDefinition`, `findReferences`)
44
- 3. Use `grep` for targeted pattern checks
45
- 4. Read only relevant sections
46
- 5. Return findings + next steps
47
-
48
- ## Thoroughness Levels
49
-
50
- | Level | Scope | Use When |
51
- | --------------- | ----------------------------- | ------------------------------------------ |
52
- | `quick` | 1-3 files, direct answer | Simple lookups, known symbol names |
53
- | `medium` | 3-6 files, include call paths | Understanding feature flow |
54
- | `very thorough` | Dependency map + edge cases | Complex refactor prep, architecture review |
55
-
56
- ## Output
57
-
58
- - **Files**: absolute paths with line refs
59
- - **Findings**: concise, evidence-backed
60
- - **Next Steps** (optional): recommended actions for the caller
61
-
62
- ## Identity
63
-
64
- You are a read-only codebase explorer. You output concise, evidence-backed findings with absolute paths only.
65
-
66
- ## Task
67
-
68
- Find relevant files, symbols, and usage paths quickly for the caller.
44
+ **NEVER** use `websearch`, `webfetch`, or `codesearch` those search the internet, not your project.
69
45
 
70
46
  ## Rules
71
47
 
72
48
  - Never modify files — read-only is a hard constraint
73
49
  - Return absolute paths in final output
74
50
  - Cite `file:line` evidence whenever possible
75
- - Prefer semantic lookup (LSP) before broad text search when it improves precision
76
- - Stop when you can answer with concrete evidence or when additional search only repeats confirmed paths
51
+ - **Always start with `grep` or `glob`** to locate files and symbols — do NOT read directories to browse
52
+ - Use LSP for precise navigation after finding candidate locations
53
+ - Stop when you can answer with concrete evidence
77
54
 
78
- ## Before You Explore
55
+ ## Navigation Patterns
79
56
 
80
- - **Be certain**: Only explore what's needed for the task at hand
81
- - **Don't over-explore**: Stop when you have enough evidence to answer
82
- - **Use LSP first**: Start with goToDefinition/findReferences before grep
83
- - **Stay scoped**: Don't explore files outside the task scope
84
- - **Cite evidence**: Every finding needs file:line reference
57
+ 1. **Search first, read second**: `grep` to find where a symbol lives, then `read` only that section
58
+ 2. **Don't re-read**: If you already read a file, reference what you learned don't read it again
59
+ 3. **Follow the chain**: definition usages callers via LSP findReferences
60
+ 4. **Target ≤3 tool calls per symbol**: grep read section done
85
61
 
86
62
  ## Workflow
87
63
 
88
- 1. Discover candidate files with `glob` or `workspaceSymbol`
89
- 2. Validate symbol flow with LSP (`goToDefinition`, `findReferences`)
90
- 3. Use `grep` for targeted pattern checks
91
- 4. Read only relevant sections
92
- 5. Return findings + next steps
93
-
94
- ## Thoroughness Levels
95
-
96
- | Level | Scope | Use When |
97
- | --------------- | ----------------------------- | ------------------------------------------ |
98
- | `quick` | 1-3 files, direct answer | Simple lookups, known symbol names |
99
- | `medium` | 3-6 files, include call paths | Understanding feature flow |
100
- | `very thorough` | Dependency map + edge cases | Complex refactor prep, architecture review |
64
+ 1. `grep` or `glob` to discover candidate files
65
+ 2. `lsp` goToDefinition/findReferences for precise symbol navigation
66
+ 3. `read` only the relevant sections (use offset/limit)
67
+ 4. Return findings with file:line evidence
101
68
 
102
69
  ## Output
103
70
 
@@ -75,13 +75,15 @@ ls .beads/artifacts/$ARGUMENTS/
75
75
 
76
76
  If `plan.md` exists with dependency graph:
77
77
 
78
- 1. **Parse waves** from plan header (Wave 1, Wave 2, etc.)
79
- 2. **Execute Wave 1** (independent tasks) in parallel using `task()` subagents
80
- 3. **Wait for Wave 1 completion** — all tasks pass or report failures
81
- 4. **Execute Wave 2** (depends on Wave 1) in parallel
78
+ 1. **Load skill:** `skill({ name: "executing-plans" })`
79
+ 2. **Parse waves** from dependency graph section
80
+ 3. **Execute wave-by-wave:**
81
+ - Single-task wave execute directly (no subagent overhead)
82
+ - Multi-task wave → dispatch parallel `task({ subagent_type: "general" })` subagents, one per task
83
+ 4. **Review after each wave** — run verification gates, report, wait for feedback
82
84
  5. **Continue** until all waves complete
83
85
 
84
- **Parallel safety:** Only tasks within same wave run in parallel. Tasks in Wave N+1 wait for Wave N.
86
+ **Parallel safety:** Only tasks within same wave run in parallel. Tasks must NOT share files. Tasks in Wave N+1 wait for Wave N.
85
87
 
86
88
  ### Phase 3A: PRD Task Loop (Sequential Fallback)
87
89
 
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  description: Verify implementation completeness, correctness, and coherence
3
- argument-hint: "<bead-id> [--quick] [--full] [--fix]"
3
+ argument-hint: "<bead-id> [--quick] [--full] [--fix] [--no-cache]"
4
4
  agent: review
5
5
  ---
6
6
 
@@ -17,12 +17,13 @@ skill({ name: "verification-before-completion" });
17
17
 
18
18
  ## Parse Arguments
19
19
 
20
- | Argument | Default | Description |
21
- | ----------- | -------- | ---------------------------------------------- |
22
- | `<bead-id>` | required | The bead to verify |
23
- | `--quick` | false | Gates only, skip coherence check |
24
- | `--full` | false | Force full verification mode (non-incremental) |
25
- | `--fix` | false | Auto-fix lint/format issues |
20
+ | Argument | Default | Description |
21
+ | ------------ | -------- | ---------------------------------------------- |
22
+ | `<bead-id>` | required | The bead to verify |
23
+ | `--quick` | false | Gates only, skip coherence check |
24
+ | `--full` | false | Force full verification mode (non-incremental) |
25
+ | `--fix` | false | Auto-fix lint/format issues |
26
+ | `--no-cache` | false | Bypass verification cache, force fresh run |
26
27
 
27
28
  ## Determine Input Type
28
29
 
@@ -39,6 +40,32 @@ skill({ name: "verification-before-completion" });
39
40
  - **Run the gates**: Build, test, lint, typecheck are non-negotiable
40
41
  - **Use project conventions**: Check `package.json` scripts first
41
42
 
43
+ ## Phase 0: Check Verification Cache
44
+
45
+ Before running any gates, check if a recent verification is still valid:
46
+
47
+ ```bash
48
+ # Compute current state fingerprint (commit hash + full diff + untracked files)
49
+ CURRENT_STAMP=$(printf '%s\n%s\n%s' \
50
+ "$(git rev-parse HEAD)" \
51
+ "$(git diff HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx')" \
52
+ "$(git ls-files --others --exclude-standard -- '*.ts' '*.tsx' '*.js' '*.jsx' | xargs cat 2>/dev/null)" \
53
+ | shasum -a 256 | cut -d' ' -f1)
54
+ LAST_STAMP=$(tail -1 .beads/verify.log 2>/dev/null | awk '{print $1}')
55
+ ```
56
+
57
+ | Condition | Action |
58
+ | ----------------------------------------- | ------------------------------------------------------ |
59
+ | `--no-cache` or `--full` | Skip cache check, run fresh |
60
+ | `CURRENT_STAMP == LAST_STAMP` | Report **cached PASS**, skip to Phase 2 (completeness) |
61
+ | `CURRENT_STAMP != LAST_STAMP` or no cache | Run gates normally |
62
+
63
+ When cache hits, report:
64
+
65
+ ```text
66
+ Verification: cached PASS (no changes since <timestamp from verify.log>)
67
+ ```
68
+
42
69
  ## Phase 1: Gather Context
43
70
 
44
71
  ```bash
@@ -66,10 +93,34 @@ Extract all requirements/tasks from the PRD and verify each is implemented:
66
93
 
67
94
  Follow the [Verification Protocol](../skill/verification-before-completion/references/VERIFICATION_PROTOCOL.md):
68
95
 
69
- - Use **incremental mode** for `verify` (pre-commit checks)
70
- - Use **full mode** if `--full` flag is passed
71
- - Run parallel group first, then sequential group
72
- - Report results in gate results table format
96
+ **Default: incremental mode** (changed files only, parallel gates).
97
+
98
+ | Mode | When | Behavior |
99
+ | ----------- | ----------------------------------------- | -------------------------------- |
100
+ | Incremental | Default, <20 changed files | Lint changed files, test changed |
101
+ | Full | `--full` flag, >20 changed files, or ship | Lint all, test all |
102
+
103
+ **Execution order:**
104
+
105
+ 1. **Parallel**: typecheck + lint (simultaneously)
106
+ 2. **Sequential** (after parallel passes): test, then build (ship only)
107
+
108
+ Report results with mode column:
109
+
110
+ ```text
111
+ | Gate | Status | Mode | Time |
112
+ |-----------|--------|-------------|--------|
113
+ | Typecheck | PASS | full | 2.1s |
114
+ | Lint | PASS | incremental | 0.3s |
115
+ | Test | PASS | incremental | 1.2s |
116
+ | Build | SKIP | — | — |
117
+ ```
118
+
119
+ **After all gates pass**, record to verification cache:
120
+
121
+ ```bash
122
+ echo "$CURRENT_STAMP $(date -u +%Y-%m-%dT%H:%M:%SZ) PASS" >> .beads/verify.log
123
+ ```
73
124
 
74
125
  If `--fix` flag provided, run the project's auto-fix command (e.g., `npm run lint:fix`, `ruff check --fix`, `cargo clippy --fix`).
75
126
 
@@ -93,7 +144,7 @@ Output:
93
144
 
94
145
  1. **Result**: READY TO SHIP / NEEDS WORK / BLOCKED
95
146
  2. **Completeness**: score and status
96
- 3. **Correctness**: gate results
147
+ 3. **Correctness**: gate results (with mode column)
97
148
  4. **Coherence**: contradictions found (if not --quick)
98
149
  5. **Blocking issues** to fix before shipping
99
150
  6. **Next step**: `/ship $ARGUMENTS` if ready, or list fixes needed
@@ -0,0 +1,162 @@
1
+ ---
2
+ purpose: Scoring rubric for evaluating template agent effectiveness
3
+ updated: 2026-03-08
4
+ based-on: tilth research (measurable pattern adoption improvements)
5
+ ---
6
+
7
+ # Agent Effectiveness Benchmark Framework
8
+
9
+ ## Purpose
10
+
11
+ Evaluate whether skills, tools, and commands in the OpenCodeKit template actually help AI agents perform better. Based on tilth's methodology: they measured accuracy, cost/correct answer, and tool adoption rates to prove what works.
12
+
13
+ ## Scoring Dimensions
14
+
15
+ 7 dimensions, each scored 0–2. Max score: **14**.
16
+
17
+ ### 1. Trigger Clarity (WHEN/SKIP)
18
+
19
+ Does the description clearly specify when to load AND when NOT to?
20
+
21
+ | Score | Criteria |
22
+ | ----- | ------------------------------------------- |
23
+ | 0 | Vague or missing trigger conditions |
24
+ | 1 | Has WHEN but not WHEN NOT (or vice versa) |
25
+ | 2 | Clear WHEN and WHEN NOT (SKIP) binary gates |
26
+
27
+ **Why it matters:** tilth found explicit WHEN/SKIP gates are the single most effective pattern for correct tool routing. Without them, agents either over-load (waste tokens) or under-load (miss relevant skills).
28
+
29
+ ### 2. "Replaces X" Framing
30
+
31
+ Does it explicitly state what behavior, tool, or workflow it replaces?
32
+
33
+ | Score | Criteria |
34
+ | ----- | ---------------------------------------------- |
35
+ | 0 | No replacement framing |
36
+ | 1 | Implied replacement or "better than X" |
37
+ | 2 | Explicit "Replaces X" statement in description |
38
+
39
+ **Why it matters:** tilth measured +36 percentage points adoption on Haiku when tool descriptions included "Replaces X" framing. Models route better when they know what's superseded.
40
+
41
+ ### 3. Concrete Examples
42
+
43
+ Does it provide working code with actual tool calls, not just prose?
44
+
45
+ | Score | Criteria |
46
+ | ----- | ----------------------------------------------------------------------- |
47
+ | 0 | No examples |
48
+ | 1 | Prose descriptions or generic prompt templates |
49
+ | 2 | Working code examples with actual tool calls / before-after comparisons |
50
+
51
+ **Why it matters:** Models follow examples more reliably than instructions. Prompt templates ("Analyze this image: [attach]") score 1, not 2 — they lack tool integration.
52
+
53
+ ### 4. Anti-Patterns
54
+
55
+ Does it show what NOT to do?
56
+
57
+ | Score | Criteria |
58
+ | ----- | -------------------------------------------------------------- |
59
+ | 0 | No anti-patterns section |
60
+ | 1 | Brief "don't do X" mentions |
61
+ | 2 | Wrong/right comparison table or detailed anti-patterns section |
62
+
63
+ **Why it matters:** Failure prevention is as valuable as success instruction. tilth's evidence-based feature removal (disabling `--map` because 62% of losing tasks used it) proves tracking what fails matters.
64
+
65
+ ### 5. Verification Integration
66
+
67
+ Does it reference or require verification steps?
68
+
69
+ | Score | Criteria |
70
+ | ----- | -------------------------------------------------------------- |
71
+ | 0 | No mention of verification |
72
+ | 1 | Mentions verification in passing |
73
+ | 2 | Integrates verification steps into workflow (run X, confirm Y) |
74
+
75
+ **Why it matters:** Skills that don't include verification produce unverified outputs. The build loop is perceive → create → **verify** → ship.
76
+
77
+ ### 6. Token Efficiency
78
+
79
+ Is the token cost proportional to value delivered?
80
+
81
+ | Score | Criteria |
82
+ | ----- | ------------------------------------------------------------------------------ |
83
+ | 0 | >2500 tokens with low value density (filler, repetition, obvious instructions) |
84
+ | 1 | Reasonable size OR moderate value density |
85
+ | 2 | <1500 tokens with high value density, OR larger with proportional density |
86
+
87
+ **Why it matters:** Every loaded skill consumes context budget. A 4000-token skill that could be 1500 tokens is actively harmful — it displaces working memory.
88
+
89
+ ### 7. Cross-References
90
+
91
+ Does it link to related skills for next steps?
92
+
93
+ | Score | Criteria |
94
+ | ----- | ---------------------------------------------------------------------------- |
95
+ | 0 | No references to other skills |
96
+ | 1 | Mentions related skills in text |
97
+ | 2 | Clear "Related Skills" table or "Next Phase" with skill loading instructions |
98
+
99
+ **Why it matters:** Skills that exist in isolation force agents to discover connections. Explicit connections reduce routing failures.
100
+
101
+ ## Score Interpretation
102
+
103
+ | Range | Tier | Meaning |
104
+ | ----- | ---------- | ----------------------------------------------------------- |
105
+ | 12–14 | Exemplary | Ready to ship — high adoption, measurable value |
106
+ | 8–11 | Adequate | Functional but missing patterns that would improve adoption |
107
+ | 4–7 | Needs Work | Significant gaps — may load but produce suboptimal results |
108
+ | 0–3 | Poor | Should be rewritten or merged into another skill |
109
+
110
+ ## Category Assessment
111
+
112
+ Beyond individual scoring, evaluate each skill's **category fit**:
113
+
114
+ | Category | Expected Traits |
115
+ | -------------------- | -------------------------------------------------------------------------------- |
116
+ | Core Workflow | Loaded frequently, high token ROI, tight integration with other core skills |
117
+ | Planning & Lifecycle | Clear phase transitions, handoff points between skills |
118
+ | Debugging & Quality | Real examples from actual debugging sessions, measurable impact |
119
+ | Code Review | Severity levels, actionable findings format |
120
+ | Design & UI | Visual reference integration, component breakdown |
121
+ | Agent Orchestration | Parallelism rules, coordination protocols |
122
+ | External Integration | API examples, auth handling, error patterns |
123
+ | Platform Specific | Version-pinned APIs, migration guidance |
124
+ | Meta Skills | Self-referential consistency (does the skill-about-skills follow its own rules?) |
125
+
126
+ ## Audit Process
127
+
128
+ 1. **Inventory** — List all skills with token size
129
+ 2. **Sample** — Read representative skills from each category
130
+ 3. **Score** — Apply 7 dimensions to each sampled skill
131
+ 4. **Classify** — Assign tier and category
132
+ 5. **Identify** — Flag overlaps, dead weight, and upgrade candidates
133
+ 6. **Prioritize** — Rank improvements by impact (core skills first)
134
+
135
+ ## Effectiveness Signals (Observable)
136
+
137
+ Beyond the rubric, track these runtime signals when possible:
138
+
139
+ | Signal | Indicates |
140
+ | ------------------------------------------ | ------------------------------------------------------------ |
141
+ | Skill loaded but instructions not followed | Trigger too broad OR instructions too vague |
142
+ | Skill never loaded despite relevant tasks | Trigger too narrow OR description doesn't match task framing |
143
+ | Agent re-reads files after skill search | Skill examples insufficient — agent needs more context |
144
+ | Verification skipped after skill workflow | Skill doesn't integrate verification |
145
+ | Agent loads 5+ skills simultaneously | Skills too granular — should be merged |
146
+
147
+ ## Template-Level Metrics
148
+
149
+ For the overall template (all skills + tools + commands):
150
+
151
+ | Metric | Target | Current |
152
+ | ----------------------------- | ------ | ------- |
153
+ | Core skills at Exemplary tier | 100% | (audit) |
154
+ | No skills at Poor tier | 0 | (audit) |
155
+ | Average token cost per skill | <1500 | (audit) |
156
+ | Skills with WHEN/SKIP gates | 100% | (audit) |
157
+ | Skills with anti-patterns | >75% | (audit) |
158
+ | Overlap/redundancy pairs | 0 | (audit) |
159
+
160
+ ---
161
+
162
+ _Apply this framework during effectiveness audits. Update scoring criteria as new evidence emerges._