@kennethsolomon/shipkit 3.10.2 → 3.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. package/README.md +92 -4
  2. package/commands/sk/context-budget.md +5 -0
  3. package/commands/sk/eval.md +5 -0
  4. package/commands/sk/health.md +5 -0
  5. package/commands/sk/help.md +32 -8
  6. package/commands/sk/learn.md +5 -0
  7. package/commands/sk/resume-session.md +5 -0
  8. package/commands/sk/safety-guard.md +5 -0
  9. package/commands/sk/save-session.md +5 -0
  10. package/commands/sk/set-profile.md +8 -0
  11. package/package.json +1 -1
  12. package/skills/sk:brainstorming/SKILL.md +13 -0
  13. package/skills/sk:context-budget/SKILL.md +126 -0
  14. package/skills/sk:eval/SKILL.md +188 -0
  15. package/skills/sk:health/SKILL.md +146 -0
  16. package/skills/sk:learn/SKILL.md +138 -0
  17. package/skills/sk:resume-session/SKILL.md +95 -0
  18. package/skills/sk:safety-guard/SKILL.md +134 -0
  19. package/skills/sk:save-session/SKILL.md +84 -0
  20. package/skills/sk:setup-claude/SKILL.md +39 -2
  21. package/skills/sk:setup-claude/templates/.claude/settings.json.template +110 -26
  22. package/skills/sk:setup-claude/templates/CLAUDE.md.template +8 -1
  23. package/skills/sk:setup-claude/templates/hooks/config-protection.sh +71 -0
  24. package/skills/sk:setup-claude/templates/hooks/console-log-warning.sh +42 -0
  25. package/skills/sk:setup-claude/templates/hooks/cost-tracker.sh +26 -0
  26. package/skills/sk:setup-claude/templates/hooks/post-edit-format.sh +53 -0
  27. package/skills/sk:setup-claude/templates/hooks/safety-guard.sh +72 -0
  28. package/skills/sk:setup-claude/templates/hooks/suggest-compact.sh +35 -0
  29. package/skills/sk:setup-optimizer/SKILL.md +59 -8
  30. package/skills/sk:start/SKILL.md +25 -0
package/README.md CHANGED
@@ -48,6 +48,44 @@ That's it. `/sk:setup-claude` creates your project scaffolding: planning files,
48
48
 
49
49
  `/sk:start` is the recommended entry point — it classifies your task and routes you to the optimal flow automatically. You can also jump directly to `/sk:brainstorm`, `/sk:debug`, or any other flow entry point.
50
50
 
51
+ ### Updating ShipKit
52
+
53
+ ```bash
54
+ # Update the package
55
+ npm install -g @kennethsolomon/shipkit && shipkit
56
+
57
+ # Then in each project, update CLAUDE.md + deploy new hooks:
58
+ /sk:setup-optimizer
59
+ ```
60
+
61
+ `shipkit` re-installs all skills and commands globally. `/sk:setup-optimizer` updates each project's CLAUDE.md with new commands and deploys any missing hooks.
62
+
63
+ ---
64
+
65
+ ## Lifecycle Hooks
66
+
67
+ `/sk:setup-claude` installs lifecycle hooks that automate common tasks. Core hooks are always installed; enhanced hooks are opt-in.
68
+
69
+ **Core hooks (always installed):**
70
+ | Hook | Event | What it does |
71
+ |------|-------|-------------|
72
+ | `session-start` | SessionStart | Loads branch, recent commits, tech debt, code health |
73
+ | `session-stop` | Stop | Logs session accomplishments to `tasks/progress.md` |
74
+ | `pre-compact` | PreCompact | Saves git state before context compression |
75
+ | `validate-commit` | PreToolUse (git commit) | Validates conventional commit format, detects secrets |
76
+ | `validate-push` | PreToolUse (git push) | Warns before pushing to protected branches |
77
+ | `log-agent` | SubagentStart | Logs sub-agent invocations to `tasks/agent-audit.log` |
78
+
79
+ **Enhanced hooks (opt-in via `/sk:setup-claude` or `/sk:setup-optimizer`):**
80
+ | Hook | Event | What it does |
81
+ |------|-------|-------------|
82
+ | `config-protection` | PreToolUse (Edit/Write) | Blocks modifications to linter/formatter configs |
83
+ | `post-edit-format` | PostToolUse (Edit) | Auto-formats with Biome/Prettier/Pint/gofmt after edits |
84
+ | `console-log-warning` | Stop | Warns about `console.log`, `dd()`, `var_dump()` in modified files |
85
+ | `suggest-compact` | PreToolUse (Edit/Write) | Suggests `/compact` after 50+ tool calls |
86
+ | `cost-tracker` | Stop | Logs session metadata to `.claude/sessions/cost-log.jsonl` |
87
+ | `safety-guard` | PreToolUse (Bash/Edit/Write) | Enforces `/sk:safety-guard` freeze/careful mode |
88
+
51
89
  ---
52
90
 
53
91
  ## Pick Your Flow
@@ -166,15 +204,56 @@ Pre-existing issues are logged to `tasks/tech-debt.md` — not fixed inline.
166
204
 
167
205
  Use these anytime — they're not part of any workflow.
168
206
 
207
+ ### Intelligence
208
+
209
+ | Command | Usage | What it does |
210
+ |---------|-------|-------------|
211
+ | `/sk:learn` | `/sk:learn` | Extract reusable patterns from the session with confidence scoring (0.3-0.9) |
212
+ | `/sk:learn` | `/sk:learn --list` | Show all learned patterns |
213
+ | `/sk:context-budget` | `/sk:context-budget` | Audit token consumption across skills, agents, MCP tools, CLAUDE.md |
214
+ | `/sk:context-budget` | `/sk:context-budget --verbose` | Per-file token breakdown |
215
+ | `/sk:health` | `/sk:health` | Scorecard across 7 categories (0-70): tools, context, gates, memory, evals, security, cost |
216
+ | `/sk:eval` | `/sk:eval define auth` | Define eval criteria before coding |
217
+ | `/sk:eval` | `/sk:eval check auth` | Run evals during implementation |
218
+ | `/sk:eval` | `/sk:eval report` | Summary of all eval results with pass@k metrics |
219
+
220
+ ### Session Management
221
+
222
+ | Command | Usage | What it does |
223
+ |---------|-------|-------------|
224
+ | `/sk:save-session` | `/sk:save-session` | Save branch, task, progress, open questions to `.claude/sessions/` |
225
+ | `/sk:save-session` | `/sk:save-session --name "auth-flow"` | Save with a custom name |
226
+ | `/sk:resume-session` | `/sk:resume-session` | List saved sessions and pick one to restore |
227
+ | `/sk:resume-session` | `/sk:resume-session --latest` | Auto-pick most recent session |
228
+ | `/sk:context` | `/sk:context` | Load all project context (automatic via hooks on session start) |
229
+
230
+ ### Safety
231
+
232
+ | Command | Usage | What it does |
233
+ |---------|-------|-------------|
234
+ | `/sk:safety-guard` | `/sk:safety-guard careful` | Block destructive commands (rm -rf, force push, etc.) |
235
+ | `/sk:safety-guard` | `/sk:safety-guard freeze --dir src/` | Lock edits to `src/` only |
236
+ | `/sk:safety-guard` | `/sk:safety-guard guard --dir src/` | Both careful + freeze combined |
237
+ | `/sk:safety-guard` | `/sk:safety-guard off` | Disable all guards |
238
+ | `/sk:safety-guard` | `/sk:safety-guard status` | Show current mode + blocked action count |
239
+
240
+ ### Code Quality
241
+
169
242
  | Command | When to use |
170
243
  |---------|------------|
171
244
  | `/sk:scope-check` | Mid-implementation — detect scope creep (On Track / Minor / Significant / Out of Control) |
172
245
  | `/sk:retro` | After shipping — analyze velocity, blockers, patterns, generate action items |
246
+ | `/sk:seo-audit` | Web projects — SEO audit with source + dev server scanning |
247
+
248
+ ### Documentation & Setup
249
+
250
+ | Command | When to use |
251
+ |---------|------------|
173
252
  | `/sk:reverse-doc` | Inherited codebase — generate architecture/design docs from existing code |
253
+ | `/sk:setup-optimizer` | Maintenance — diagnose, update workflow, deploy hooks, enrich CLAUDE.md |
254
+ | `/sk:mvp` | New idea — generate a complete MVP app from a single prompt |
174
255
  | `/sk:status` | Quick view of workflow and task status |
175
256
  | `/sk:dashboard` | Visual Kanban board across all git worktrees |
176
- | `/sk:mvp` | Generate a complete MVP app from a single idea prompt |
177
- | `/sk:seo-audit` | SEO audit for web projects |
178
257
 
179
258
  ---
180
259
 
@@ -193,7 +272,7 @@ Use these anytime — they're not part of any workflow.
193
272
  ## All Commands
194
273
 
195
274
  <details>
196
- <summary><strong>38 commands</strong> — click to expand</summary>
275
+ <summary><strong>51 commands</strong> — click to expand</summary>
197
276
 
198
277
  | Command | Purpose |
199
278
  |---------|---------|
@@ -205,33 +284,42 @@ Use these anytime — they're not part of any workflow.
205
284
  | `/sk:change` | Handle mid-workflow requirement changes |
206
285
  | `/sk:config` | View/edit project config |
207
286
  | `/sk:context` | Load project context (automatic via hooks) |
287
+ | `/sk:context-budget` | Audit context window token consumption |
208
288
  | `/sk:dashboard` | Live Kanban board — sk:dashboard across worktrees |
209
289
  | `/sk:debug` | Structured bug investigation |
210
290
  | `/sk:e2e` | E2E Tests — behavioral verification |
291
+ | `/sk:eval` | Define, run, and report evals for agent reliability |
211
292
  | `/sk:execute-plan` | Execute plan checkboxes in batches |
212
293
  | `/sk:fast-track` | Small changes — skip planning, keep gates |
213
294
  | `/sk:features` | Sync feature specs with codebase |
214
295
  | `/sk:finish-feature` | Changelog + PR |
215
296
  | `/sk:frontend-design` | UI mockup + optional Pencil visual design |
216
297
  | `/sk:gates` | All quality gates in parallel batches |
298
+ | `/sk:health` | Harness self-audit scorecard |
217
299
  | `/sk:help` | Show all commands |
218
300
  | `/sk:hotfix` | Emergency fix workflow |
219
301
  | `/sk:laravel-init` | Configure existing Laravel project |
220
302
  | `/sk:laravel-new` | Scaffold fresh Laravel app |
303
+ | `/sk:learn` | Extract reusable patterns from sessions |
221
304
  | `/sk:lint` | Auto-detect and run all linters |
222
305
  | `/sk:mvp` | Generate MVP app from a prompt |
223
306
  | `/sk:perf` | Performance audit |
224
307
  | `/sk:plan` | Create/refresh planning files |
225
308
  | `/sk:release` | Version bump + tag (`--android` / `--ios` for store audit) |
309
+ | `/sk:resume-session` | Resume a previously saved session |
226
310
  | `/sk:retro` | Post-ship retrospective |
227
311
  | `/sk:reverse-doc` | Generate docs from existing code |
228
312
  | `/sk:review` | 7-dimension code review |
313
+ | `/sk:safety-guard` | Protect against destructive ops |
314
+ | `/sk:save-session` | Save session state for continuity |
229
315
  | `/sk:schema-migrate` | Database schema change analysis |
230
316
  | `/sk:scope-check` | Detect scope creep mid-implementation |
231
317
  | `/sk:security-check` | OWASP security audit |
232
- | `/sk:seo-audit` | sk:seo-audit for web projects |
318
+ | `/sk:seo-audit` | SEO audit for web projects |
233
319
  | `/sk:set-profile` | Switch model routing profile |
234
320
  | `/sk:setup-claude` | Bootstrap project scaffolding |
321
+ | `/sk:setup-optimizer` | Diagnose + update workflow + deploy hooks + enrich CLAUDE.md |
322
+ | `/sk:skill-creator` | Create or improve skills |
235
323
  | `/sk:smart-commit` | Conventional commit with approval |
236
324
  | `/sk:start` | Smart entry point — classifies task, routes to optimal flow |
237
325
  | `/sk:status` | Show workflow + task status |
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Audit context window token consumption and find optimization opportunities."
3
+ ---
4
+
5
+ Use the `sk:context-budget` skill to inventory all components consuming context tokens (agents, skills, rules, MCP tools, CLAUDE.md), classify usage frequency, detect bloat, and recommend top 3 optimizations with estimated token savings.
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Define, run, and report on evaluations for agent reliability and code quality."
3
+ ---
4
+
5
+ Use the `sk:eval` skill to define eval criteria before coding (`define`), verify during implementation (`check`), and summarize results after shipping (`report`). Supports code-based, model-based, and human graders with pass@k and pass^k metrics.
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Run harness self-audit and produce a health scorecard."
3
+ ---
4
+
5
+ Use the `sk:health` skill to score your ShipKit setup across 7 categories (Tool Coverage, Context Efficiency, Quality Gates, Memory Persistence, Eval Coverage, Security Guardrails, Cost Efficiency). Produces a 0-70 scorecard with concrete findings and top 3 actions.
@@ -65,35 +65,55 @@ Requirements change mid-workflow? Run `/sk:change` — it classifies the scope a
65
65
  |---------|-------------|
66
66
  | `/sk:accessibility` | WCAG 2.1 AA audit on frontend code |
67
67
  | `/sk:api-design` | Design REST/GraphQL contracts before implementation |
68
- | `/sk:brainstorm` | Explore requirements, no code |
68
+ | `/sk:autopilot` | Hands-free workflow — auto-skip, auto-advance, auto-commit |
69
+ | `/sk:brainstorm` | Explore requirements and design (includes search-first research) |
69
70
  | `/sk:branch` | Create branch from current task |
70
- | `/sk:change` | Handle mid-workflow requirement change — re-enter at the right step |
71
+ | `/sk:change` | Handle mid-workflow requirement change |
72
+ | `/sk:config` | View and edit project config |
73
+ | `/sk:context` | Load project context (automatic via hooks) |
74
+ | `/sk:context-budget` | Audit context window token consumption and find savings |
75
+ | `/sk:dashboard` | Read-only workflow Kanban board |
71
76
  | `/sk:debug` | Structured bug investigation |
77
+ | `/sk:e2e` | E2E behavioral verification |
78
+ | `/sk:eval` | Define, run, and report evals for agent reliability |
72
79
  | `/sk:execute-plan` | Implement plan in batches |
80
+ | `/sk:fast-track` | Small changes — skip planning, keep gates |
73
81
  | `/sk:features` | Sync docs/sk:features/ specs with codebase |
74
82
  | `/sk:finish-feature` | Changelog + PR creation |
75
- | `/sk:frontend-design` | UI mockup + design spec before implementation. Add `--pencil` to also generate a Pencil visual mockup (requires Pencil app + MCP) |
83
+ | `/sk:frontend-design` | UI mockup + optional Pencil visual mockup |
84
+ | `/sk:gates` | All quality gates in parallel batches |
85
+ | `/sk:health` | Harness self-audit scorecard (7 categories, 0-70) |
76
86
  | `/sk:hotfix` | Emergency fix workflow (skips design/TDD) |
77
87
  | `/sk:laravel-init` | Configure existing Laravel project |
78
88
  | `/sk:laravel-new` | Scaffold new Laravel project |
89
+ | `/sk:learn` | Extract reusable patterns from sessions |
79
90
  | `/sk:lint` | Auto-detect and run all linters |
91
+ | `/sk:mvp` | Generate MVP app from a prompt |
80
92
  | `/sk:perf` | Performance audit |
81
93
  | `/sk:plan` | Create/refresh task planning files |
82
- | `/sk:release` | Automate releases: bump version, update CHANGELOG, create tag, push to GitHub. Use --android and/or --ios flags for App Store / Play Store readiness audit |
83
- | `/sk:review` | Blast-radius-aware self-review of branch changes |
94
+ | `/sk:release` | Version bump + tag (`--android` / `--ios` for store audit) |
95
+ | `/sk:resume-session` | Resume a previously saved session |
96
+ | `/sk:retro` | Post-ship retrospective |
97
+ | `/sk:reverse-doc` | Generate docs from existing code |
98
+ | `/sk:review` | 7-dimension self-review of branch changes |
99
+ | `/sk:safety-guard` | Protect against destructive ops (careful/freeze/guard) |
100
+ | `/sk:save-session` | Save session state for cross-session continuity |
84
101
  | `/sk:schema-migrate` | Multi-ORM schema change analysis |
102
+ | `/sk:scope-check` | Detect scope creep mid-implementation |
85
103
  | `/sk:security-check` | OWASP security audit |
104
+ | `/sk:seo-audit` | SEO audit for web projects |
105
+ | `/sk:set-profile` | Switch model routing profile |
86
106
  | `/sk:setup-claude` | Bootstrap project scaffolding |
87
- | `/sk:setup-optimizer` | Enrich CLAUDE.md by scanning codebase |
107
+ | `/sk:setup-optimizer` | Diagnose + update workflow + enrich CLAUDE.md |
88
108
  | `/sk:skill-creator` | Create or improve skills |
89
109
  | `/sk:smart-commit` | Conventional commit with approval |
110
+ | `/sk:start` | Smart entry point — classifies task, routes to optimal flow |
90
111
  | `/sk:status` | Show workflow and task status |
112
+ | `/sk:team` | Parallel domain agents for full-stack tasks |
91
113
  | `/sk:test` | Auto-detect and verify all tests pass |
92
114
  | `/sk:update-task` | Mark task done, log completion |
93
115
  | `/sk:write-plan` | Write plan to `tasks/todo.md` |
94
116
  | `/sk:write-tests` | TDD: write failing tests first |
95
- | `/sk:config` | View and edit project config |
96
- | `/sk:set-profile` | Switch model routing profile |
97
117
 
98
118
  ---
99
119
 
@@ -113,9 +133,13 @@ ShipKit routes each skill to the right model automatically. Set once per project
113
133
  | brainstorm, write-plan, debug, execute-plan, review | opus | opus | sonnet | sonnet |
114
134
  | write-tests, frontend-design, api-design, security-check | opus | sonnet | sonnet | sonnet |
115
135
  | change | opus | sonnet | sonnet | sonnet |
136
+ | autopilot, team | opus | opus | sonnet | sonnet |
116
137
  | perf, schema-migrate, accessibility | opus | sonnet | sonnet | haiku |
138
+ | eval | sonnet | sonnet | sonnet | haiku |
117
139
  | lint, test | sonnet | sonnet | haiku | haiku |
118
140
  | smart-commit, branch, update-task | haiku | haiku | haiku | haiku |
141
+ | start, learn, context-budget, health | haiku | haiku | haiku | haiku |
142
+ | save-session, resume-session, safety-guard | haiku | haiku | haiku | haiku |
119
143
 
120
144
  `opus` = inherit (uses your current session model).
121
145
  Config lives in `.shipkit/config.json` — per project, gitignored by default.
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Extract reusable patterns from the current session into learned instincts."
3
+ ---
4
+
5
+ Use the `sk:learn` skill to analyze the current session for extractable patterns (error resolutions, debugging techniques, workarounds, project conventions). Patterns are saved with confidence scoring and can be promoted from project-scoped to global.
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Resume a previously saved session with full context restoration."
3
+ ---
4
+
5
+ Use the `sk:resume-session` skill to list available saved sessions from `.claude/sessions/`, select one, and restore its context (branch, task state, progress, open questions, next steps) into the current conversation.
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Protect against destructive operations with careful, freeze, and guard modes."
3
+ ---
4
+
5
+ Use the `sk:safety-guard` skill to activate protection modes: `careful` (block destructive commands), `freeze --dir <path>` (lock edits to a directory), `guard --dir <path>` (both), `off` (disable), or `status` (show current mode).
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: "Save current session state for cross-session continuity."
3
+ ---
4
+
5
+ Use the `sk:save-session` skill to persist the current session state (branch, task, progress, findings, open questions) to `.claude/sessions/` for resumption in a future conversation. Essential for EPIC-scope multi-session workflows.
@@ -30,6 +30,10 @@ Valid profiles: `full-sail` · `quality` · `balanced` · `budget`
30
30
  | smart-commit, branch, update-task | haiku | haiku | haiku | haiku |
31
31
  | autopilot, team | opus | opus | sonnet | sonnet |
32
32
  | start | haiku | haiku | haiku | haiku |
33
+ | learn, context-budget, health | haiku | haiku | haiku | haiku |
34
+ | save-session, resume-session | haiku | haiku | haiku | haiku |
35
+ | safety-guard | haiku | haiku | haiku | haiku |
36
+ | eval | sonnet | sonnet | sonnet | haiku |
33
37
 
34
38
  Note: `opus` = inherit (uses the current session model). Switch to Opus 4.5 in your session to get the full benefit.
35
39
 
@@ -70,6 +74,10 @@ Model assignments for this project:
70
74
  smart-commit, branch, update-task → haiku
71
75
  autopilot, team → <model>
72
76
  start → haiku
77
+ learn, context-budget, health → haiku
78
+ save-session, resume-session → haiku
79
+ safety-guard → haiku
80
+ eval → <model>
73
81
 
74
82
  Run /sk:config to see all settings or make further changes.
75
83
  ```
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@kennethsolomon/shipkit",
3
- "version": "3.10.2",
3
+ "version": "3.11.0",
4
4
  "description": "A structured workflow toolkit for Claude Code.",
5
5
  "keywords": [
6
6
  "claude",
@@ -74,6 +74,19 @@ digraph brainstorming {
74
74
  - Only one question per message - if a topic needs more exploration, break it into multiple questions
75
75
  - Focus on understanding: purpose, constraints, success criteria
76
76
 
77
+ **Search-First Research (before proposing approaches):**
78
+ Before proposing custom solutions, check if the problem is already solved:
79
+ 1. **Grep codebase** — does similar functionality already exist in this repo?
80
+ 2. **Check package registries** — is there a well-maintained package for this? (npm, PyPI, Packagist, crates.io)
81
+ 3. **Check existing skills** — does a ShipKit skill or MCP server already handle this?
82
+
83
+ Decision matrix:
84
+ - **Adopt** — existing solution covers 90%+ of requirements → use it directly
85
+ - **Extend** — existing solution covers 60-90% → extend or wrap it
86
+ - **Build custom** — nothing suitable exists → build from scratch (informed by what was found)
87
+
88
+ If a suitable package or existing solution is found, include it as one of the approaches.
89
+
77
90
  **Exploring approaches:**
78
91
  - Propose 2-3 different approaches with trade-offs
79
92
  - Present options conversationally with your recommendation and reasoning
@@ -0,0 +1,126 @@
1
+ ---
2
+ name: sk:context-budget
3
+ description: "Audit context window token consumption and find optimization opportunities."
4
+ ---
5
+
6
+ # /sk:context-budget — Token Consumption Audit
7
+
8
+ Audits all components that consume context window tokens — agents, skills, rules, MCP tools, CLAUDE.md — and identifies optimization opportunities.
9
+
10
+ ## Usage
11
+
12
+ ```
13
+ /sk:context-budget # standard audit
14
+ /sk:context-budget --verbose # per-file breakdown
15
+ ```
16
+
17
+ ## Model Routing
18
+
19
+ Read `.shipkit/config.json` from the project root if it exists.
20
+
21
+ | Profile | Model |
22
+ |---------|-------|
23
+ | `full-sail` | haiku |
24
+ | `quality` | haiku |
25
+ | `balanced` | haiku |
26
+ | `budget` | haiku |
27
+
28
+ > Counting and classification is lightweight — haiku is sufficient.
29
+
30
+ ## How It Works
31
+
32
+ ### Phase 1: Inventory
33
+
34
+ Scan and count token estimates for every loaded component:
35
+
36
+ | Component | Location | Token Estimation |
37
+ |-----------|----------|------------------|
38
+ | CLAUDE.md | `CLAUDE.md` | `words * 1.3` |
39
+ | Global CLAUDE.md | `~/.claude/CLAUDE.md` | `words * 1.3` |
40
+ | Skills | `skills/*/SKILL.md` | `words * 1.3` |
41
+ | Commands | `commands/**/*.md` | `words * 1.3` |
42
+ | Agents | `.claude/agents/*.md` | `words * 1.3` |
43
+ | Rules | `.claude/rules/*.md` | `words * 1.3` |
44
+ | MCP tool schemas | count tools * ~500 tokens each | `tool_count * 500` |
45
+ | Hooks | `.claude/hooks/*.sh` (minimal overhead) | `words * 1.3` |
46
+
47
+ **Token estimation formula:**
48
+ - Prose/markdown: `word_count * 1.3`
49
+ - Code blocks: `char_count / 4`
50
+ - MCP tool schemas: ~500 tokens per tool definition
51
+
52
+ ### Phase 2: Classify Usage Frequency
53
+
54
+ For each component, classify how often it's actually needed:
55
+
56
+ | Classification | Meaning | Action |
57
+ |---------------|---------|--------|
58
+ | **Always** | Loaded every session, always relevant | Keep as-is |
59
+ | **Sometimes** | Relevant to specific task types | Consider conditional loading |
60
+ | **Rarely** | Edge case, rarely triggered | Candidate for removal/extraction |
61
+
62
+ Classification heuristics:
63
+ - Skills used in the workflow (brainstorm, write-tests, gates, etc.) → Always
64
+ - Skills triggered by keywords (frontend-design, api-design) → Sometimes
65
+ - Niche skills (seo-audit, schema-migrate) → Rarely
66
+ - MCP tools: if >20 tools on one server → flag as over-subscribed
67
+
68
+ ### Phase 3: Detect Issues
69
+
70
+ Flag these common problems:
71
+
72
+ 1. **Bloated agents** — agent descriptions >200 lines
73
+ 2. **Bloated skills** — skill definitions >400 lines
74
+ 3. **Bloated rules** — rule files >100 lines
75
+ 4. **MCP over-subscription** — servers with >20 tools (each costs ~500 tokens)
76
+ 5. **CLI-wrapping MCPs** — MCP servers that just wrap CLI tools (overhead > benefit)
77
+ 6. **Duplicate content** — same instructions in CLAUDE.md AND skill files
78
+ 7. **CLAUDE.md bloat** — CLAUDE.md >200 lines (the target)
79
+ 8. **Unused components** — skills/agents never referenced in workflow
80
+
81
+ ### Phase 4: Report
82
+
83
+ Output a structured report:
84
+
85
+ ```
86
+ === Context Budget Audit ===
87
+
88
+ Component Breakdown:
89
+ CLAUDE.md ~1,200 tokens
90
+ Global CLAUDE.md ~800 tokens
91
+ Skills (42 files) ~18,000 tokens
92
+ Commands (35 files) ~8,000 tokens
93
+ Agents (8 files) ~3,200 tokens
94
+ Rules (5 files) ~1,500 tokens
95
+ MCP tools (3 servers) ~15,000 tokens (30 tools)
96
+ ─────────────────────────────────
97
+ Total overhead: ~47,700 tokens
98
+
99
+ Context window: 200,000 tokens
100
+ Overhead: 47,700 tokens (23.8%)
101
+ Available for work: 152,300 tokens
102
+
103
+ Issues Found:
104
+ [HIGH] MCP server "playwright" has 28 tools (~14,000 tokens)
105
+ [MEDIUM] Skill sk:frontend-design is 380 lines (~500 tokens)
106
+ [LOW] Agent perf-auditor has 220 lines (~290 tokens)
107
+
108
+ Top 3 Optimizations:
109
+ 1. Remove unused MCP tools from playwright (save ~7,000 tokens)
110
+ 2. Consolidate duplicate workflow instructions (save ~1,200 tokens)
111
+ 3. Trim agent descriptions to <150 lines (save ~400 tokens)
112
+
113
+ Potential savings: ~8,600 tokens (18% reduction)
114
+ ```
115
+
116
+ ### --verbose Mode
117
+
118
+ Adds per-file token breakdown:
119
+
120
+ ```
121
+ Skills Breakdown:
122
+ sk:autopilot/SKILL.md ~620 tokens
123
+ sk:brainstorming/SKILL.md ~480 tokens
124
+ sk:gates/SKILL.md ~440 tokens
125
+ ...
126
+ ```
@@ -0,0 +1,188 @@
1
+ ---
2
+ name: sk:eval
3
+ description: "Define, run, and report on evaluations for agent reliability and code quality."
4
+ ---
5
+
6
+ # /sk:eval — Eval-Driven Development
7
+
8
+ A formal evaluation framework for measuring agent reliability and code quality. Define evals before coding, check during implementation, and report after shipping.
9
+
10
+ ## Usage
11
+
12
+ ```
13
+ /sk:eval define <feature> # create eval definition
14
+ /sk:eval check <feature> # run evals against current state
15
+ /sk:eval report # summary of all eval results
16
+ /sk:eval list # show all defined evals
17
+ ```
18
+
19
+ ## Model Routing
20
+
21
+ Read `.shipkit/config.json` from the project root if it exists.
22
+
23
+ | Profile | Model |
24
+ |---------|-------|
25
+ | `full-sail` | sonnet |
26
+ | `quality` | sonnet |
27
+ | `balanced` | sonnet |
28
+ | `budget` | haiku |
29
+
30
+ > Eval analysis needs reasoning for model-based graders — sonnet for balanced+.
31
+
32
+ ## Eval Types
33
+
34
+ ### Capability Evals
35
+
36
+ Test whether Claude can accomplish something new:
37
+
38
+ - "Can it generate a valid migration from a schema description?"
39
+ - "Can it write a test that covers all edge cases?"
40
+ - "Can it refactor without changing behavior?"
41
+
42
+ ### Regression Evals
43
+
44
+ Ensure changes don't break existing behavior:
45
+
46
+ - "Does the login flow still work after auth refactor?"
47
+ - "Do all API endpoints still return correct status codes?"
48
+ - "Are all existing tests still passing?"
49
+
50
+ ## Grader Types
51
+
52
+ ### Code-Based (Deterministic)
53
+
54
+ Graded by running commands — pass/fail:
55
+
56
+ ```yaml
57
+ grader: code
58
+ checks:
59
+ - command: "npm test"
60
+ expect: exit_code_0
61
+ - command: "grep -r 'TODO' src/"
62
+ expect: no_output
63
+ - command: "npx tsc --noEmit"
64
+ expect: exit_code_0
65
+ ```
66
+
67
+ ### Model-Based (LLM-as-Judge)
68
+
69
+ Graded by an LLM against a rubric — scored 1-5:
70
+
71
+ ```yaml
72
+ grader: model
73
+ rubric: |
74
+ Score the implementation on:
75
+ 1. Correctness — does it solve the stated problem?
76
+ 2. Completeness — are all edge cases handled?
77
+ 3. Code quality — is it readable and maintainable?
78
+ 4. Security — are there any vulnerabilities?
79
+ 5. Performance — any obvious inefficiencies?
80
+ threshold: 4.0
81
+ ```
82
+
83
+ ### Human (Manual Review)
84
+
85
+ Flagged for human review — generates a checklist:
86
+
87
+ ```yaml
88
+ grader: human
89
+ checklist:
90
+ - "UI renders correctly on mobile"
91
+ - "Error messages are user-friendly"
92
+ - "Animation feels smooth (60fps)"
93
+ ```
94
+
95
+ ## Metrics
96
+
97
+ ### pass@k
98
+
99
+ At least 1 success in k attempts. Used for capability evals where some variance is expected.
100
+
101
+ ```
102
+ pass@3: Run the eval 3 times. Pass if at least 1 succeeds.
103
+ ```
104
+
105
+ ### pass^k
106
+
107
+ ALL k attempts must succeed. Used for regression evals where consistency is required.
108
+
109
+ ```
110
+ pass^3: Run the eval 3 times. Pass only if all 3 succeed.
111
+ ```
112
+
113
+ ## Storage
114
+
115
+ ### Eval Definition
116
+
117
+ Stored in `.claude/evals/[feature].md`:
118
+
119
+ ```markdown
120
+ ---
121
+ feature: user-authentication
122
+ type: capability
123
+ grader: code
124
+ created: 2026-03-25
125
+ pass_metric: pass@1
126
+ ---
127
+
128
+ ## Description
129
+ Verify the OAuth2 login flow works end-to-end.
130
+
131
+ ## Checks
132
+ - [ ] `npm test -- --grep "auth"` passes
133
+ - [ ] `curl -s localhost:3000/auth/google` returns 302
134
+ - [ ] `grep -r "hardcoded.*secret" src/` returns nothing
135
+
136
+ ## History
137
+ | Date | Result | Score | Notes |
138
+ |------|--------|-------|-------|
139
+ ```
140
+
141
+ ### Eval Results
142
+
143
+ Appended to `.claude/evals/[feature].log`:
144
+
145
+ ```
146
+ [2026-03-25T10:30:00Z] PASS — pass@1 (1/1 succeeded)
147
+ check_1: npm test (exit 0) ✓
148
+ check_2: curl auth redirect (302) ✓
149
+ check_3: no hardcoded secrets ✓
150
+ ```
151
+
152
+ ## Workflow Integration
153
+
154
+ ### Before Coding (define)
155
+
156
+ ```
157
+ /sk:eval define user-authentication
158
+ ```
159
+
160
+ Creates the eval definition with checks derived from the task requirements.
161
+
162
+ ### During Implementation (check)
163
+
164
+ ```
165
+ /sk:eval check user-authentication
166
+ ```
167
+
168
+ Runs all checks and reports pass/fail. Use during step 5 (Write Tests + Implement) to verify progress.
169
+
170
+ ### After Shipping (report)
171
+
172
+ ```
173
+ /sk:eval report
174
+ ```
175
+
176
+ Summary of all evals:
177
+
178
+ ```
179
+ === Eval Report ===
180
+
181
+ user-authentication PASS pass@1 (3 checks, 3 passed)
182
+ api-v2-endpoints PASS pass^3 (5 checks, 5 passed x3)
183
+ queue-reliability FAIL pass@3 (2 checks, 0/3 succeeded)
184
+
185
+ Overall: 2/3 passing (67%)
186
+
187
+ Action: queue-reliability needs investigation
188
+ ```