npm - @kennethsolomon/shipkit - Versions diffs - 3.10.2 → 3.11.0 - Mend

@kennethsolomon/shipkit 3.10.2 → 3.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

package/README.md +92 -4
package/commands/sk/context-budget.md +5 -0
package/commands/sk/eval.md +5 -0
package/commands/sk/health.md +5 -0
package/commands/sk/help.md +32 -8
package/commands/sk/learn.md +5 -0
package/commands/sk/resume-session.md +5 -0
package/commands/sk/safety-guard.md +5 -0
package/commands/sk/save-session.md +5 -0
package/commands/sk/set-profile.md +8 -0
package/package.json +1 -1
package/skills/sk:brainstorming/SKILL.md +13 -0
package/skills/sk:context-budget/SKILL.md +126 -0
package/skills/sk:eval/SKILL.md +188 -0
package/skills/sk:health/SKILL.md +146 -0
package/skills/sk:learn/SKILL.md +138 -0
package/skills/sk:resume-session/SKILL.md +95 -0
package/skills/sk:safety-guard/SKILL.md +134 -0
package/skills/sk:save-session/SKILL.md +84 -0
package/skills/sk:setup-claude/SKILL.md +39 -2
package/skills/sk:setup-claude/templates/.claude/settings.json.template +110 -26
package/skills/sk:setup-claude/templates/CLAUDE.md.template +8 -1
package/skills/sk:setup-claude/templates/hooks/config-protection.sh +71 -0
package/skills/sk:setup-claude/templates/hooks/console-log-warning.sh +42 -0
package/skills/sk:setup-claude/templates/hooks/cost-tracker.sh +26 -0
package/skills/sk:setup-claude/templates/hooks/post-edit-format.sh +53 -0
package/skills/sk:setup-claude/templates/hooks/safety-guard.sh +72 -0
package/skills/sk:setup-claude/templates/hooks/suggest-compact.sh +35 -0
package/skills/sk:setup-optimizer/SKILL.md +59 -8
package/skills/sk:start/SKILL.md +25 -0

package/README.md CHANGED Viewed

@@ -48,6 +48,44 @@ That's it. `/sk:setup-claude` creates your project scaffolding: planning files,
 `/sk:start` is the recommended entry point — it classifies your task and routes you to the optimal flow automatically. You can also jump directly to `/sk:brainstorm`, `/sk:debug`, or any other flow entry point.
+### Updating ShipKit
+```bash
+# Update the package
+npm install -g @kennethsolomon/shipkit && shipkit
+# Then in each project, update CLAUDE.md + deploy new hooks:
+/sk:setup-optimizer
+```
+`shipkit` re-installs all skills and commands globally. `/sk:setup-optimizer` updates each project's CLAUDE.md with new commands and deploys any missing hooks.
+---
+## Lifecycle Hooks
+`/sk:setup-claude` installs lifecycle hooks that automate common tasks. Core hooks are always installed; enhanced hooks are opt-in.
+**Core hooks (always installed):**
+| Hook | Event | What it does |
+|------|-------|-------------|
+| `session-start` | SessionStart | Loads branch, recent commits, tech debt, code health |
+| `session-stop` | Stop | Logs session accomplishments to `tasks/progress.md` |
+| `pre-compact` | PreCompact | Saves git state before context compression |
+| `validate-commit` | PreToolUse (git commit) | Validates conventional commit format, detects secrets |
+| `validate-push` | PreToolUse (git push) | Warns before pushing to protected branches |
+| `log-agent` | SubagentStart | Logs sub-agent invocations to `tasks/agent-audit.log` |
+**Enhanced hooks (opt-in via `/sk:setup-claude` or `/sk:setup-optimizer`):**
+| Hook | Event | What it does |
+|------|-------|-------------|
+| `config-protection` | PreToolUse (Edit/Write) | Blocks modifications to linter/formatter configs |
+| `post-edit-format` | PostToolUse (Edit) | Auto-formats with Biome/Prettier/Pint/gofmt after edits |
+| `console-log-warning` | Stop | Warns about `console.log`, `dd()`, `var_dump()` in modified files |
+| `suggest-compact` | PreToolUse (Edit/Write) | Suggests `/compact` after 50+ tool calls |
+| `cost-tracker` | Stop | Logs session metadata to `.claude/sessions/cost-log.jsonl` |
+| `safety-guard` | PreToolUse (Bash/Edit/Write) | Enforces `/sk:safety-guard` freeze/careful mode |
 ---
 ## Pick Your Flow
@@ -166,15 +204,56 @@ Pre-existing issues are logged to `tasks/tech-debt.md` — not fixed inline.
 Use these anytime — they're not part of any workflow.
+### Intelligence
+| Command | Usage | What it does |
+|---------|-------|-------------|
+| `/sk:learn` | `/sk:learn` | Extract reusable patterns from the session with confidence scoring (0.3-0.9) |
+| `/sk:learn` | `/sk:learn --list` | Show all learned patterns |
+| `/sk:context-budget` | `/sk:context-budget` | Audit token consumption across skills, agents, MCP tools, CLAUDE.md |
+| `/sk:context-budget` | `/sk:context-budget --verbose` | Per-file token breakdown |
+| `/sk:health` | `/sk:health` | Scorecard across 7 categories (0-70): tools, context, gates, memory, evals, security, cost |
+| `/sk:eval` | `/sk:eval define auth` | Define eval criteria before coding |
+| `/sk:eval` | `/sk:eval check auth` | Run evals during implementation |
+| `/sk:eval` | `/sk:eval report` | Summary of all eval results with pass@k metrics |
+### Session Management
+| Command | Usage | What it does |
+|---------|-------|-------------|
+| `/sk:save-session` | `/sk:save-session` | Save branch, task, progress, open questions to `.claude/sessions/` |
+| `/sk:save-session` | `/sk:save-session --name "auth-flow"` | Save with a custom name |
+| `/sk:resume-session` | `/sk:resume-session` | List saved sessions and pick one to restore |
+| `/sk:resume-session` | `/sk:resume-session --latest` | Auto-pick most recent session |
+| `/sk:context` | `/sk:context` | Load all project context (automatic via hooks on session start) |
+### Safety
+| Command | Usage | What it does |
+|---------|-------|-------------|
+| `/sk:safety-guard` | `/sk:safety-guard careful` | Block destructive commands (rm -rf, force push, etc.) |
+| `/sk:safety-guard` | `/sk:safety-guard freeze --dir src/` | Lock edits to `src/` only |
+| `/sk:safety-guard` | `/sk:safety-guard guard --dir src/` | Both careful + freeze combined |
+| `/sk:safety-guard` | `/sk:safety-guard off` | Disable all guards |
+| `/sk:safety-guard` | `/sk:safety-guard status` | Show current mode + blocked action count |
+### Code Quality
 | Command | When to use |
 |---------|------------|
 | `/sk:scope-check` | Mid-implementation — detect scope creep (On Track / Minor / Significant / Out of Control) |
 | `/sk:retro` | After shipping — analyze velocity, blockers, patterns, generate action items |
+| `/sk:seo-audit` | Web projects — SEO audit with source + dev server scanning |
+### Documentation & Setup
+| Command | When to use |
+|---------|------------|
 | `/sk:reverse-doc` | Inherited codebase — generate architecture/design docs from existing code |
+| `/sk:setup-optimizer` | Maintenance — diagnose, update workflow, deploy hooks, enrich CLAUDE.md |
+| `/sk:mvp` | New idea — generate a complete MVP app from a single prompt |
 | `/sk:status` | Quick view of workflow and task status |
 | `/sk:dashboard` | Visual Kanban board across all git worktrees |
-| `/sk:mvp` | Generate a complete MVP app from a single idea prompt |
-| `/sk:seo-audit` | SEO audit for web projects |
 ---
@@ -193,7 +272,7 @@ Use these anytime — they're not part of any workflow.
 ## All Commands
 <details>
-<summary><strong>38 commands</strong> — click to expand</summary>
+<summary><strong>51 commands</strong> — click to expand</summary>
 | Command | Purpose |
 |---------|---------|
@@ -205,33 +284,42 @@ Use these anytime — they're not part of any workflow.
 | `/sk:change` | Handle mid-workflow requirement changes |
 | `/sk:config` | View/edit project config |
 | `/sk:context` | Load project context (automatic via hooks) |
+| `/sk:context-budget` | Audit context window token consumption |
 | `/sk:dashboard` | Live Kanban board — sk:dashboard across worktrees |
 | `/sk:debug` | Structured bug investigation |
 | `/sk:e2e` | E2E Tests — behavioral verification |
+| `/sk:eval` | Define, run, and report evals for agent reliability |
 | `/sk:execute-plan` | Execute plan checkboxes in batches |
 | `/sk:fast-track` | Small changes — skip planning, keep gates |
 | `/sk:features` | Sync feature specs with codebase |
 | `/sk:finish-feature` | Changelog + PR |
 | `/sk:frontend-design` | UI mockup + optional Pencil visual design |
 | `/sk:gates` | All quality gates in parallel batches |
+| `/sk:health` | Harness self-audit scorecard |
 | `/sk:help` | Show all commands |
 | `/sk:hotfix` | Emergency fix workflow |
 | `/sk:laravel-init` | Configure existing Laravel project |
 | `/sk:laravel-new` | Scaffold fresh Laravel app |
+| `/sk:learn` | Extract reusable patterns from sessions |
 | `/sk:lint` | Auto-detect and run all linters |
 | `/sk:mvp` | Generate MVP app from a prompt |
 | `/sk:perf` | Performance audit |
 | `/sk:plan` | Create/refresh planning files |
 | `/sk:release` | Version bump + tag (`--android` / `--ios` for store audit) |
+| `/sk:resume-session` | Resume a previously saved session |
 | `/sk:retro` | Post-ship retrospective |
 | `/sk:reverse-doc` | Generate docs from existing code |
 | `/sk:review` | 7-dimension code review |
+| `/sk:safety-guard` | Protect against destructive ops |
+| `/sk:save-session` | Save session state for continuity |
 | `/sk:schema-migrate` | Database schema change analysis |
 | `/sk:scope-check` | Detect scope creep mid-implementation |
 | `/sk:security-check` | OWASP security audit |
-| `/sk:seo-audit` | sk:seo-audit for web projects |
+| `/sk:seo-audit` | SEO audit for web projects |
 | `/sk:set-profile` | Switch model routing profile |
 | `/sk:setup-claude` | Bootstrap project scaffolding |
+| `/sk:setup-optimizer` | Diagnose + update workflow + deploy hooks + enrich CLAUDE.md |
+| `/sk:skill-creator` | Create or improve skills |
 | `/sk:smart-commit` | Conventional commit with approval |
 | `/sk:start` | Smart entry point — classifies task, routes to optimal flow |
 | `/sk:status` | Show workflow + task status |

package/commands/sk/context-budget.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Audit context window token consumption and find optimization opportunities."
+---
+Use the `sk:context-budget` skill to inventory all components consuming context tokens (agents, skills, rules, MCP tools, CLAUDE.md), classify usage frequency, detect bloat, and recommend top 3 optimizations with estimated token savings.

package/commands/sk/eval.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Define, run, and report on evaluations for agent reliability and code quality."
+---
+Use the `sk:eval` skill to define eval criteria before coding (`define`), verify during implementation (`check`), and summarize results after shipping (`report`). Supports code-based, model-based, and human graders with pass@k and pass^k metrics.

package/commands/sk/health.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Run harness self-audit and produce a health scorecard."
+---
+Use the `sk:health` skill to score your ShipKit setup across 7 categories (Tool Coverage, Context Efficiency, Quality Gates, Memory Persistence, Eval Coverage, Security Guardrails, Cost Efficiency). Produces a 0-70 scorecard with concrete findings and top 3 actions.

package/commands/sk/help.md CHANGED Viewed

@@ -65,35 +65,55 @@ Requirements change mid-workflow? Run `/sk:change` — it classifies the scope a
 |---------|-------------|
 | `/sk:accessibility` | WCAG 2.1 AA audit on frontend code |
 | `/sk:api-design` | Design REST/GraphQL contracts before implementation |
-| `/sk:brainstorm` | Explore requirements, no code |
+| `/sk:autopilot` | Hands-free workflow — auto-skip, auto-advance, auto-commit |
+| `/sk:brainstorm` | Explore requirements and design (includes search-first research) |
 | `/sk:branch` | Create branch from current task |
-| `/sk:change` | Handle mid-workflow requirement change — re-enter at the right step |
+| `/sk:change` | Handle mid-workflow requirement change |
+| `/sk:config` | View and edit project config |
+| `/sk:context` | Load project context (automatic via hooks) |
+| `/sk:context-budget` | Audit context window token consumption and find savings |
+| `/sk:dashboard` | Read-only workflow Kanban board |
 | `/sk:debug` | Structured bug investigation |
+| `/sk:e2e` | E2E behavioral verification |
+| `/sk:eval` | Define, run, and report evals for agent reliability |
 | `/sk:execute-plan` | Implement plan in batches |
+| `/sk:fast-track` | Small changes — skip planning, keep gates |
 | `/sk:features` | Sync docs/sk:features/ specs with codebase |
 | `/sk:finish-feature` | Changelog + PR creation |
-| `/sk:frontend-design` | UI mockup + design spec before implementation. Add `--pencil` to also generate a Pencil visual mockup (requires Pencil app + MCP) |
+| `/sk:frontend-design` | UI mockup + optional Pencil visual mockup |
+| `/sk:gates` | All quality gates in parallel batches |
+| `/sk:health` | Harness self-audit scorecard (7 categories, 0-70) |
 | `/sk:hotfix` | Emergency fix workflow (skips design/TDD) |
 | `/sk:laravel-init` | Configure existing Laravel project |
 | `/sk:laravel-new` | Scaffold new Laravel project |
+| `/sk:learn` | Extract reusable patterns from sessions |
 | `/sk:lint` | Auto-detect and run all linters |
+| `/sk:mvp` | Generate MVP app from a prompt |
 | `/sk:perf` | Performance audit |
 | `/sk:plan` | Create/refresh task planning files |
-| `/sk:release` | Automate releases: bump version, update CHANGELOG, create tag, push to GitHub. Use --android and/or --ios flags for App Store / Play Store readiness audit |
-| `/sk:review` | Blast-radius-aware self-review of branch changes |
+| `/sk:release` | Version bump + tag (`--android` / `--ios` for store audit) |
+| `/sk:resume-session` | Resume a previously saved session |
+| `/sk:retro` | Post-ship retrospective |
+| `/sk:reverse-doc` | Generate docs from existing code |
+| `/sk:review` | 7-dimension self-review of branch changes |
+| `/sk:safety-guard` | Protect against destructive ops (careful/freeze/guard) |
+| `/sk:save-session` | Save session state for cross-session continuity |
 | `/sk:schema-migrate` | Multi-ORM schema change analysis |
+| `/sk:scope-check` | Detect scope creep mid-implementation |
 | `/sk:security-check` | OWASP security audit |
+| `/sk:seo-audit` | SEO audit for web projects |
+| `/sk:set-profile` | Switch model routing profile |
 | `/sk:setup-claude` | Bootstrap project scaffolding |
-| `/sk:setup-optimizer` | Enrich CLAUDE.md by scanning codebase |
+| `/sk:setup-optimizer` | Diagnose + update workflow + enrich CLAUDE.md |
 | `/sk:skill-creator` | Create or improve skills |
 | `/sk:smart-commit` | Conventional commit with approval |
+| `/sk:start` | Smart entry point — classifies task, routes to optimal flow |
 | `/sk:status` | Show workflow and task status |
+| `/sk:team` | Parallel domain agents for full-stack tasks |
 | `/sk:test` | Auto-detect and verify all tests pass |
 | `/sk:update-task` | Mark task done, log completion |
 | `/sk:write-plan` | Write plan to `tasks/todo.md` |
 | `/sk:write-tests` | TDD: write failing tests first |
-| `/sk:config` | View and edit project config |
-| `/sk:set-profile` | Switch model routing profile |
 ---
@@ -113,9 +133,13 @@ ShipKit routes each skill to the right model automatically. Set once per project
 | brainstorm, write-plan, debug, execute-plan, review | opus | opus | sonnet | sonnet |
 | write-tests, frontend-design, api-design, security-check | opus | sonnet | sonnet | sonnet |
 | change | opus | sonnet | sonnet | sonnet |
+| autopilot, team | opus | opus | sonnet | sonnet |
 | perf, schema-migrate, accessibility | opus | sonnet | sonnet | haiku |
+| eval | sonnet | sonnet | sonnet | haiku |
 | lint, test | sonnet | sonnet | haiku | haiku |
 | smart-commit, branch, update-task | haiku | haiku | haiku | haiku |
+| start, learn, context-budget, health | haiku | haiku | haiku | haiku |
+| save-session, resume-session, safety-guard | haiku | haiku | haiku | haiku |
 `opus` = inherit (uses your current session model).
 Config lives in `.shipkit/config.json` — per project, gitignored by default.

package/commands/sk/learn.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Extract reusable patterns from the current session into learned instincts."
+---
+Use the `sk:learn` skill to analyze the current session for extractable patterns (error resolutions, debugging techniques, workarounds, project conventions). Patterns are saved with confidence scoring and can be promoted from project-scoped to global.

package/commands/sk/resume-session.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Resume a previously saved session with full context restoration."
+---
+Use the `sk:resume-session` skill to list available saved sessions from `.claude/sessions/`, select one, and restore its context (branch, task state, progress, open questions, next steps) into the current conversation.

package/commands/sk/safety-guard.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Protect against destructive operations with careful, freeze, and guard modes."
+---
+Use the `sk:safety-guard` skill to activate protection modes: `careful` (block destructive commands), `freeze --dir <path>` (lock edits to a directory), `guard --dir <path>` (both), `off` (disable), or `status` (show current mode).

package/commands/sk/save-session.md ADDED Viewed

@@ -0,0 +1,5 @@
+---
+description: "Save current session state for cross-session continuity."
+---
+Use the `sk:save-session` skill to persist the current session state (branch, task, progress, findings, open questions) to `.claude/sessions/` for resumption in a future conversation. Essential for EPIC-scope multi-session workflows.

package/commands/sk/set-profile.md CHANGED Viewed

@@ -30,6 +30,10 @@ Valid profiles: `full-sail` · `quality` · `balanced` · `budget`
 | smart-commit, branch, update-task | haiku | haiku | haiku | haiku |
 | autopilot, team | opus | opus | sonnet | sonnet |
 | start | haiku | haiku | haiku | haiku |
+| learn, context-budget, health | haiku | haiku | haiku | haiku |
+| save-session, resume-session | haiku | haiku | haiku | haiku |
+| safety-guard | haiku | haiku | haiku | haiku |
+| eval | sonnet | sonnet | sonnet | haiku |
 Note: `opus` = inherit (uses the current session model). Switch to Opus 4.5 in your session to get the full benefit.
@@ -70,6 +74,10 @@ Model assignments for this project:
   smart-commit, branch, update-task → haiku
   autopilot, team → <model>
   start → haiku
+  learn, context-budget, health → haiku
+  save-session, resume-session → haiku
+  safety-guard → haiku
+  eval → <model>
 Run /sk:config to see all settings or make further changes.
 ```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@kennethsolomon/shipkit",
-  "version": "3.10.2",
+  "version": "3.11.0",
   "description": "A structured workflow toolkit for Claude Code.",
   "keywords": [
     "claude",

package/skills/sk:brainstorming/SKILL.md CHANGED Viewed

@@ -74,6 +74,19 @@ digraph brainstorming {
 - Only one question per message - if a topic needs more exploration, break it into multiple questions
 - Focus on understanding: purpose, constraints, success criteria
+**Search-First Research (before proposing approaches):**
+Before proposing custom solutions, check if the problem is already solved:
+1. **Grep codebase** — does similar functionality already exist in this repo?
+2. **Check package registries** — is there a well-maintained package for this? (npm, PyPI, Packagist, crates.io)
+3. **Check existing skills** — does a ShipKit skill or MCP server already handle this?
+Decision matrix:
+- **Adopt** — existing solution covers 90%+ of requirements → use it directly
+- **Extend** — existing solution covers 60-90% → extend or wrap it
+- **Build custom** — nothing suitable exists → build from scratch (informed by what was found)
+If a suitable package or existing solution is found, include it as one of the approaches.
 **Exploring approaches:**
 - Propose 2-3 different approaches with trade-offs
 - Present options conversationally with your recommendation and reasoning

package/skills/sk:context-budget/SKILL.md ADDED Viewed

@@ -0,0 +1,126 @@
+---
+name: sk:context-budget
+description: "Audit context window token consumption and find optimization opportunities."
+---
+# /sk:context-budget — Token Consumption Audit
+Audits all components that consume context window tokens — agents, skills, rules, MCP tools, CLAUDE.md — and identifies optimization opportunities.
+## Usage
+```
+/sk:context-budget              # standard audit
+/sk:context-budget --verbose    # per-file breakdown
+```
+## Model Routing
+Read `.shipkit/config.json` from the project root if it exists.
+| Profile | Model |
+|---------|-------|
+| `full-sail` | haiku |
+| `quality` | haiku |
+| `balanced` | haiku |
+| `budget` | haiku |
+> Counting and classification is lightweight — haiku is sufficient.
+## How It Works
+### Phase 1: Inventory
+Scan and count token estimates for every loaded component:
+| Component | Location | Token Estimation |
+|-----------|----------|------------------|
+| CLAUDE.md | `CLAUDE.md` | `words * 1.3` |
+| Global CLAUDE.md | `~/.claude/CLAUDE.md` | `words * 1.3` |
+| Skills | `skills/*/SKILL.md` | `words * 1.3` |
+| Commands | `commands/**/*.md` | `words * 1.3` |
+| Agents | `.claude/agents/*.md` | `words * 1.3` |
+| Rules | `.claude/rules/*.md` | `words * 1.3` |
+| MCP tool schemas | count tools * ~500 tokens each | `tool_count * 500` |
+| Hooks | `.claude/hooks/*.sh` (minimal overhead) | `words * 1.3` |
+**Token estimation formula:**
+- Prose/markdown: `word_count * 1.3`
+- Code blocks: `char_count / 4`
+- MCP tool schemas: ~500 tokens per tool definition
+### Phase 2: Classify Usage Frequency
+For each component, classify how often it's actually needed:
+| Classification | Meaning | Action |
+|---------------|---------|--------|
+| **Always** | Loaded every session, always relevant | Keep as-is |
+| **Sometimes** | Relevant to specific task types | Consider conditional loading |
+| **Rarely** | Edge case, rarely triggered | Candidate for removal/extraction |
+Classification heuristics:
+- Skills used in the workflow (brainstorm, write-tests, gates, etc.) → Always
+- Skills triggered by keywords (frontend-design, api-design) → Sometimes
+- Niche skills (seo-audit, schema-migrate) → Rarely
+- MCP tools: if >20 tools on one server → flag as over-subscribed
+### Phase 3: Detect Issues
+Flag these common problems:
+1. **Bloated agents** — agent descriptions >200 lines
+2. **Bloated skills** — skill definitions >400 lines
+3. **Bloated rules** — rule files >100 lines
+4. **MCP over-subscription** — servers with >20 tools (each costs ~500 tokens)
+5. **CLI-wrapping MCPs** — MCP servers that just wrap CLI tools (overhead > benefit)
+6. **Duplicate content** — same instructions in CLAUDE.md AND skill files
+7. **CLAUDE.md bloat** — CLAUDE.md >200 lines (the target)
+8. **Unused components** — skills/agents never referenced in workflow
+### Phase 4: Report
+Output a structured report:
+```
+=== Context Budget Audit ===
+Component Breakdown:
+  CLAUDE.md              ~1,200 tokens
+  Global CLAUDE.md         ~800 tokens
+  Skills (42 files)     ~18,000 tokens
+  Commands (35 files)    ~8,000 tokens
+  Agents (8 files)       ~3,200 tokens
+  Rules (5 files)        ~1,500 tokens
+  MCP tools (3 servers)  ~15,000 tokens (30 tools)
+  ─────────────────────────────────
+  Total overhead:        ~47,700 tokens
+Context window:          200,000 tokens
+Overhead:                 47,700 tokens (23.8%)
+Available for work:      152,300 tokens
+Issues Found:
+  [HIGH]   MCP server "playwright" has 28 tools (~14,000 tokens)
+  [MEDIUM] Skill sk:frontend-design is 380 lines (~500 tokens)
+  [LOW]    Agent perf-auditor has 220 lines (~290 tokens)
+Top 3 Optimizations:
+  1. Remove unused MCP tools from playwright (save ~7,000 tokens)
+  2. Consolidate duplicate workflow instructions (save ~1,200 tokens)
+  3. Trim agent descriptions to <150 lines (save ~400 tokens)
+  Potential savings: ~8,600 tokens (18% reduction)
+```
+### --verbose Mode
+Adds per-file token breakdown:
+```
+Skills Breakdown:
+  sk:autopilot/SKILL.md        ~620 tokens
+  sk:brainstorming/SKILL.md    ~480 tokens
+  sk:gates/SKILL.md            ~440 tokens
+  ...
+```

package/skills/sk:eval/SKILL.md ADDED Viewed

@@ -0,0 +1,188 @@
+---
+name: sk:eval
+description: "Define, run, and report on evaluations for agent reliability and code quality."
+---
+# /sk:eval — Eval-Driven Development
+A formal evaluation framework for measuring agent reliability and code quality. Define evals before coding, check during implementation, and report after shipping.
+## Usage
+```
+/sk:eval define <feature>    # create eval definition
+/sk:eval check <feature>     # run evals against current state
+/sk:eval report              # summary of all eval results
+/sk:eval list                # show all defined evals
+```
+## Model Routing
+Read `.shipkit/config.json` from the project root if it exists.
+| Profile | Model |
+|---------|-------|
+| `full-sail` | sonnet |
+| `quality` | sonnet |
+| `balanced` | sonnet |
+| `budget` | haiku |
+> Eval analysis needs reasoning for model-based graders — sonnet for balanced+.
+## Eval Types
+### Capability Evals
+Test whether Claude can accomplish something new:
+- "Can it generate a valid migration from a schema description?"
+- "Can it write a test that covers all edge cases?"
+- "Can it refactor without changing behavior?"
+### Regression Evals
+Ensure changes don't break existing behavior:
+- "Does the login flow still work after auth refactor?"
+- "Do all API endpoints still return correct status codes?"
+- "Are all existing tests still passing?"
+## Grader Types
+### Code-Based (Deterministic)
+Graded by running commands — pass/fail:
+```yaml
+grader: code
+checks:
+  - command: "npm test"
+    expect: exit_code_0
+  - command: "grep -r 'TODO' src/"
+    expect: no_output
+  - command: "npx tsc --noEmit"
+    expect: exit_code_0
+```
+### Model-Based (LLM-as-Judge)
+Graded by an LLM against a rubric — scored 1-5:
+```yaml
+grader: model
+rubric: |
+  Score the implementation on:
+  1. Correctness — does it solve the stated problem?
+  2. Completeness — are all edge cases handled?
+  3. Code quality — is it readable and maintainable?
+  4. Security — are there any vulnerabilities?
+  5. Performance — any obvious inefficiencies?
+threshold: 4.0
+```
+### Human (Manual Review)
+Flagged for human review — generates a checklist:
+```yaml
+grader: human
+checklist:
+  - "UI renders correctly on mobile"
+  - "Error messages are user-friendly"
+  - "Animation feels smooth (60fps)"
+```
+## Metrics
+### pass@k
+At least 1 success in k attempts. Used for capability evals where some variance is expected.
+```
+pass@3: Run the eval 3 times. Pass if at least 1 succeeds.
+```
+### pass^k
+ALL k attempts must succeed. Used for regression evals where consistency is required.
+```
+pass^3: Run the eval 3 times. Pass only if all 3 succeed.
+```
+## Storage
+### Eval Definition
+Stored in `.claude/evals/[feature].md`:
+```markdown
+---
+feature: user-authentication
+type: capability
+grader: code
+created: 2026-03-25
+pass_metric: pass@1
+---
+## Description
+Verify the OAuth2 login flow works end-to-end.
+## Checks
+- [ ] `npm test -- --grep "auth"` passes
+- [ ] `curl -s localhost:3000/auth/google` returns 302
+- [ ] `grep -r "hardcoded.*secret" src/` returns nothing
+## History
+| Date | Result | Score | Notes |
+|------|--------|-------|-------|
+```
+### Eval Results
+Appended to `.claude/evals/[feature].log`:
+```
+[2026-03-25T10:30:00Z] PASS — pass@1 (1/1 succeeded)
+  check_1: npm test (exit 0) ✓
+  check_2: curl auth redirect (302) ✓
+  check_3: no hardcoded secrets ✓
+```
+## Workflow Integration
+### Before Coding (define)
+```
+/sk:eval define user-authentication
+```
+Creates the eval definition with checks derived from the task requirements.
+### During Implementation (check)
+```
+/sk:eval check user-authentication
+```
+Runs all checks and reports pass/fail. Use during step 5 (Write Tests + Implement) to verify progress.
+### After Shipping (report)
+```
+/sk:eval report
+```
+Summary of all evals:
+```
+=== Eval Report ===
+  user-authentication    PASS  pass@1  (3 checks, 3 passed)
+  api-v2-endpoints       PASS  pass^3  (5 checks, 5 passed x3)
+  queue-reliability      FAIL  pass@3  (2 checks, 0/3 succeeded)
+  Overall: 2/3 passing (67%)
+  Action: queue-reliability needs investigation
+```