@kennethsolomon/shipkit 3.10.1 → 3.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. package/README.md +121 -49
  2. package/commands/sk/autopilot.md +2 -2
  3. package/commands/sk/context-budget.md +5 -0
  4. package/commands/sk/eval.md +5 -0
  5. package/commands/sk/health.md +5 -0
  6. package/commands/sk/help.md +32 -8
  7. package/commands/sk/learn.md +5 -0
  8. package/commands/sk/resume-session.md +5 -0
  9. package/commands/sk/safety-guard.md +5 -0
  10. package/commands/sk/save-session.md +5 -0
  11. package/commands/sk/security-check.md +2 -2
  12. package/commands/sk/set-profile.md +8 -0
  13. package/commands/sk/status.md +4 -9
  14. package/package.json +1 -1
  15. package/skills/sk:accessibility/SKILL.md +10 -1
  16. package/skills/sk:autopilot/SKILL.md +26 -45
  17. package/skills/sk:brainstorming/SKILL.md +13 -0
  18. package/skills/sk:context/SKILL.md +11 -15
  19. package/skills/sk:context-budget/SKILL.md +126 -0
  20. package/skills/sk:dashboard/SKILL.md +3 -4
  21. package/skills/sk:dashboard/server.js +0 -65
  22. package/skills/sk:e2e/SKILL.md +3 -3
  23. package/skills/sk:eval/SKILL.md +188 -0
  24. package/skills/sk:fast-track/SKILL.md +0 -9
  25. package/skills/sk:frontend-design/SKILL.md +232 -0
  26. package/skills/sk:gates/SKILL.md +2 -3
  27. package/skills/sk:health/SKILL.md +146 -0
  28. package/skills/sk:learn/SKILL.md +138 -0
  29. package/skills/sk:lint/SKILL.md +3 -3
  30. package/skills/sk:perf/SKILL.md +3 -3
  31. package/skills/sk:resume-session/SKILL.md +95 -0
  32. package/skills/sk:retro/SKILL.md +1 -2
  33. package/skills/sk:review/SKILL.md +2 -2
  34. package/skills/sk:safety-guard/SKILL.md +134 -0
  35. package/skills/sk:save-session/SKILL.md +84 -0
  36. package/skills/sk:setup-claude/SKILL.md +40 -4
  37. package/skills/sk:setup-claude/scripts/__pycache__/apply_setup_claude.cpython-314.pyc +0 -0
  38. package/skills/sk:setup-claude/scripts/apply_setup_claude.py +0 -1
  39. package/skills/sk:setup-claude/templates/.claude/settings.json.template +110 -26
  40. package/skills/sk:setup-claude/templates/.claude/statusline.sh +1 -15
  41. package/skills/sk:setup-claude/templates/CLAUDE.md.template +69 -138
  42. package/skills/sk:setup-claude/templates/commands/brainstorm.md.template +2 -13
  43. package/skills/sk:setup-claude/templates/hooks/config-protection.sh +71 -0
  44. package/skills/sk:setup-claude/templates/hooks/console-log-warning.sh +42 -0
  45. package/skills/sk:setup-claude/templates/hooks/cost-tracker.sh +26 -0
  46. package/skills/sk:setup-claude/templates/hooks/post-edit-format.sh +53 -0
  47. package/skills/sk:setup-claude/templates/hooks/pre-compact.sh +1 -12
  48. package/skills/sk:setup-claude/templates/hooks/safety-guard.sh +72 -0
  49. package/skills/sk:setup-claude/templates/hooks/session-start.sh +0 -11
  50. package/skills/sk:setup-claude/templates/hooks/session-stop.sh +0 -7
  51. package/skills/sk:setup-claude/templates/hooks/suggest-compact.sh +35 -0
  52. package/skills/sk:setup-claude/tests/__pycache__/test_apply_setup_claude.cpython-314.pyc +0 -0
  53. package/skills/sk:setup-claude/tests/test_apply_setup_claude.py +2 -33
  54. package/skills/sk:setup-optimizer/SKILL.md +68 -15
  55. package/skills/sk:start/SKILL.md +34 -11
  56. package/skills/sk:test/SKILL.md +3 -3
  57. package/skills/sk:setup-claude/templates/tasks/workflow-status.md.template +0 -28
@@ -1,13 +1,13 @@
1
1
  ---
2
2
  name: sk:autopilot
3
- description: Hands-free workflow — runs all 21 steps with auto-skip, auto-advance, auto-commit. Stops only for direction approval, 3-strike failures, and PR push.
3
+ description: Hands-free workflow — runs all 8 steps with auto-skip, auto-advance, auto-commit. Stops only for direction approval, 3-strike failures, and PR push.
4
4
  user_invocable: true
5
5
  allowed_tools: Read, Write, Bash, Glob, Grep, Agent, Skill
6
6
  ---
7
7
 
8
8
  # Autopilot Mode
9
9
 
10
- Hands-free workflow that executes all 21 steps of the ShipIt workflow with minimal interruptions. Same quality gates, same fix loops, same 100% coverage — just fewer stops.
10
+ Hands-free workflow that executes all 8 steps of the ShipIt workflow with minimal interruptions. Same quality gates, same fix loops, same 100% coverage — just fewer stops.
11
11
 
12
12
  ## When to Use
13
13
 
@@ -23,30 +23,19 @@ Hands-free workflow that executes all 21 steps of the ShipIt workflow with minim
23
23
 
24
24
  ## Quality Guarantee
25
25
 
26
- Autopilot runs the EXACT same 21 steps as manual mode:
26
+ Autopilot runs the EXACT same 8 steps as manual mode:
27
27
  - ALL quality gates enforced (lint, test, security, perf, review, e2e)
28
- - ALL fix-commit-rerun loops active
28
+ - ALL fix-rerun loops active
29
29
  - 100% test coverage required on new code
30
30
  - 0 security issues required
31
31
  - The ONLY difference: auto-advance between steps instead of stopping
32
32
 
33
33
  ## Steps
34
34
 
35
- ### 0. Reset Tracker
35
+ ### 1. Load Context + Brainstorm + Direction Approval (STOP — requires user input)
36
36
 
37
- Read `tasks/workflow-status.md`. If it has done/skipped steps from a different task, auto-reset all steps to `not yet`.
38
-
39
- ### 1. Load Context (auto — no prompt)
40
-
41
- - Read `tasks/todo.md`
42
- - Read `tasks/lessons.md` (apply all active lessons as constraints)
43
- - Read `tasks/findings.md` (if exists)
44
- - Read `tasks/tech-debt.md` (if exists)
45
-
46
- ### 2. Brainstorm + Direction Approval (STOP — requires user input)
47
-
48
- Run brainstorm internally:
49
- - Explore the codebase (3 parallel Explore agents)
37
+ - Read `tasks/todo.md`, `tasks/lessons.md`, `tasks/findings.md`, `tasks/tech-debt.md`
38
+ - Run brainstorm internally (3 parallel Explore agents)
50
39
  - Propose 2-3 approaches with trade-offs
51
40
 
52
41
  **Present ONE direction summary and ask:**
@@ -57,40 +46,33 @@ Run brainstorm internally:
57
46
 
58
47
  Wait for explicit `y` before continuing. This is the ONLY planning stop.
59
48
 
49
+ ### 2. Design (auto-skip if no frontend/API keywords)
50
+
51
+ Run `/sk:frontend-design` or `/sk:api-design` if applicable. Auto-skip if no frontend/API keywords detected. Log: `Auto-skipped: Design ([reason])`
52
+
60
53
  ### 3. Plan (auto-advance)
61
54
 
62
- Write the implementation plan to `tasks/todo.md`. Do NOT ask for plan approval — the direction approval in step 2 covers this.
55
+ Write the implementation plan to `tasks/todo.md`. Do NOT ask for plan approval — the direction approval in step 1 covers this.
63
56
 
64
57
  ### 4. Branch (auto-advance)
65
58
 
66
59
  Create feature branch auto-named from the task. Do NOT ask for confirmation.
67
60
 
68
- ### 5. Auto-Skip Detection
61
+ ### 5. Write Tests + Implement (auto-advance)
69
62
 
70
- Scan `tasks/todo.md` for frontend/backend/database keywords. For each optional step:
71
- - **Design (step 4)**: auto-skip if no frontend keywords
72
- - **Accessibility (step 5)**: auto-skip if no frontend keywords
73
- - **Migrate (step 8)**: auto-skip if no database keywords
74
- - **Performance (step 15)**: auto-skip if no frontend AND no database keywords
63
+ - Run `/sk:write-tests` (TDD red phase)
64
+ - Run `/sk:schema-migrate` if database keywords detected
65
+ - Run `/sk:execute-plan` (TDD green phase)
66
+ - Auto-advance when done
75
67
 
76
- Log each auto-skip: `Auto-skipped: [Step Name] ([reason])`
77
-
78
- ### 6. Write Tests (auto-advance)
79
-
80
- Write failing tests based on the plan (TDD red phase). Auto-advance when done.
81
-
82
- ### 7. Implement (auto-advance)
83
-
84
- Execute the plan — make failing tests pass. Use wave-based sub-agents for parallel work where possible.
85
-
86
- ### 8. Commit (auto-commit)
68
+ ### 6. Commit (auto-commit)
87
69
 
88
70
  Auto-commit with conventional commit format. Do NOT ask for commit message approval.
89
71
  Format: `type(scope): description`
90
72
 
91
- ### 9. Gates (auto-advance on clean pass)
73
+ ### 7. Gates (auto-advance on clean pass)
92
74
 
93
- Run all quality gates. Use `/sk:gates` if available, otherwise run sequentially:
75
+ Run all quality gates via `/sk:gates`:
94
76
  1. Lint + dep audit
95
77
  2. Test (100% coverage)
96
78
  3. Security (0 issues)
@@ -98,9 +80,9 @@ Run all quality gates. Use `/sk:gates` if available, otherwise run sequentially:
98
80
  5. Review + simplify
99
81
  6. E2E
100
82
 
101
- Each gate auto-fixes and re-runs internally. Auto-advance to next gate on clean pass.
83
+ Each gate auto-fixes and re-runs internally. Squash gate commits one commit per gate pass.
102
84
 
103
- ### 10. PR Push (STOP — requires user confirmation)
85
+ ### 8. PR Push (STOP — requires user confirmation)
104
86
 
105
87
  **This is the second mandatory stop.** Present:
106
88
  > "All gates passed. Ready to create PR.
@@ -110,11 +92,10 @@ Each gate auto-fixes and re-runs internally. Auto-advance to next gate on clean
110
92
 
111
93
  Wait for explicit confirmation — pushing is visible to others.
112
94
 
113
- ### 11. Finalize (auto-advance)
114
-
95
+ After confirmation:
115
96
  - Create PR
116
97
  - Sync features (`/sk:features`)
117
- - Ask about release (step 21 is never auto-skipped)
98
+ - Ask about release (never auto-skipped)
118
99
 
119
100
  ## 3-Strike Protocol
120
101
 
@@ -128,9 +109,9 @@ If any step fails 3 times:
128
109
 
129
110
  | Stop | When | Why |
130
111
  |------|------|-----|
131
- | Direction approval | After brainstorm (step 2) | User must approve the approach |
112
+ | Direction approval | After brainstorm (step 1) | User must approve the approach |
132
113
  | 3-strike failure | Any step fails 3x | Needs human judgment |
133
- | PR push | Before creating PR (step 10) | Visible to others — always confirm |
114
+ | PR push | Before creating PR (step 8) | Visible to others — always confirm |
134
115
 
135
116
  Everything else auto-advances.
136
117
 
@@ -74,6 +74,19 @@ digraph brainstorming {
74
74
  - Only one question per message - if a topic needs more exploration, break it into multiple questions
75
75
  - Focus on understanding: purpose, constraints, success criteria
76
76
 
77
+ **Search-First Research (before proposing approaches):**
78
+ Before proposing custom solutions, check if the problem is already solved:
79
+ 1. **Grep codebase** — does similar functionality already exist in this repo?
80
+ 2. **Check package registries** — is there a well-maintained package for this? (npm, PyPI, Packagist, crates.io)
81
+ 3. **Check existing skills** — does a ShipKit skill or MCP server already handle this?
82
+
83
+ Decision matrix:
84
+ - **Adopt** — existing solution covers 90%+ of requirements → use it directly
85
+ - **Extend** — existing solution covers 60-90% → extend or wrap it
86
+ - **Build custom** — nothing suitable exists → build from scratch (informed by what was found)
87
+
88
+ If a suitable package or existing solution is found, include it as one of the approaches.
89
+
77
90
  **Exploring approaches:**
78
91
  - Propose 2-3 different approaches with trade-offs
79
92
  - Present options conversationally with your recommendation and reasoning
@@ -26,21 +26,19 @@ Load all project context files into the conversation and output a formatted sess
26
26
  | # | File | What to Extract |
27
27
  |---|------|-----------------|
28
28
  | 1 | `tasks/todo.md` | Task name (from `# TODO —` heading), milestone progress, count of `- [x]` (done) vs `- [ ]` (pending) checkboxes |
29
- | 2 | `tasks/workflow-status.md` | Current step (row with `>> next <<`), step name, command to run |
30
- | 3 | `tasks/progress.md` | Last 5 entries only (most recent work). If file is large, read only the last 50 lines. |
31
- | 4 | `tasks/findings.md` | Current decisions, chosen approach, open questions |
32
- | 5 | `tasks/lessons.md` | All active lessons read in full, apply as constraints for this session |
33
- | 6 | `docs/decisions.md` | If exists: last 3 ADR entries. If missing: note "no decisions log yet" |
34
- | 7 | `docs/vision.md` | If exists: product name + value proposition. If missing: note "no vision.md found" |
35
- | 8 | `tasks/tech-debt.md` | If exists: count entries with no `Resolved:` line (unresolved), highest severity among unresolved |
29
+ | 2 | `tasks/progress.md` | Last 5 entries only (most recent work). If file is large, read only the last 50 lines. |
30
+ | 3 | `tasks/findings.md` | Current decisions, chosen approach, open questions |
31
+ | 4 | `tasks/lessons.md` | All active lessons — read in full, apply as constraints for this session |
32
+ | 5 | `docs/decisions.md` | If exists: last 3 ADR entries. If missing: note "no decisions log yet" |
33
+ | 6 | `docs/vision.md` | If exists: product name + value proposition. If missing: note "no vision.md found" |
34
+ | 7 | `tasks/tech-debt.md` | If exists: count entries with no `Resolved:` line (unresolved), highest severity among unresolved |
36
35
 
37
36
  ### Reading Strategy
38
37
 
39
- - Read files 1-5 first (these are the core context).
40
- - Files 6-7 are optional — check if they exist before reading.
38
+ - Read files 1-4 first (these are the core context).
39
+ - Files 5-6 are optional — check if they exist before reading.
41
40
  - For `tasks/progress.md`: only read the last 50 lines to avoid loading a huge file.
42
41
  - If `tasks/todo.md` is missing: the project has no active task.
43
- - If `tasks/workflow-status.md` is missing: the workflow hasn't started.
44
42
 
45
43
  ---
46
44
 
@@ -54,9 +52,8 @@ After reading all files, output this session brief:
54
52
  ╚══════════════════════════════════════════╝
55
53
  Branch: [current git branch]
56
54
  Task: [task name from todo.md, or "No active task"]
57
- Step: [step #] [step name] run `/sk:[command]`
55
+ Progress: [N done] / [M total] checkboxes in todo.md
58
56
  Last done: [last progress.md entry summary, 1 line]
59
- Pending: [N] checkboxes remaining in todo.md
60
57
  Lessons: [count] active — [most critical 1-liner from lessons.md]
61
58
  Open Qs: [open questions from findings.md, or "none"]
62
59
  Tech Debt: [N] unresolved — highest: [severity] ([file:line])
@@ -68,9 +65,8 @@ Product: [value prop from vision.md, or "no vision.md found"]
68
65
 
69
66
  - **Branch:** Run `git branch --show-current` to get the current branch name.
70
67
  - **Task:** Extract from the first `# TODO —` line in `tasks/todo.md`. If the file doesn't exist or all checkboxes are done, show "No active task — ready to start fresh".
71
- - **Step:** Find the row containing `>> next <<` in `tasks/workflow-status.md`. Extract step number, name, and command. If no `>> next <<` found, show "Workflow complete" or "Not started".
68
+ - **Progress:** Count `- [x]` (done) and `- [ ]` (pending) lines in `tasks/todo.md`. Stop counting at the first `## Verification`, `## Acceptance Criteria`, or `## Risks` heading (these are meta-sections, not tasks). Show `N done / M total`.
72
69
  - **Last done:** The most recent entry from `tasks/progress.md`. Summarize in one line.
73
- - **Pending:** Count `- [ ]` lines in `tasks/todo.md`. Stop counting at the first `## Verification`, `## Acceptance Criteria`, or `## Risks` heading (these are meta-sections, not tasks).
74
70
  - **Lessons:** Count `### [` headings in `tasks/lessons.md` (each lesson starts with `### [YYYY-MM-DD]`). Show the count + the **Prevention:** line from the most recent lesson.
75
71
  - **Open Qs:** Check for an "## Open Questions" section in `tasks/findings.md`. List them or say "none".
76
72
  - **Tech Debt:** Read `tasks/tech-debt.md` if it exists. Count entries that have no `Resolved:` line — each entry starts with `### [`. For unresolved entries, find the highest severity. Show `N unresolved — highest: [severity] ([file])`. If file missing or 0 unresolved, show `none`.
@@ -93,7 +89,7 @@ After outputting the session brief:
93
89
  | Scenario | Behavior |
94
90
  |----------|----------|
95
91
  | No `tasks/todo.md` | Show "No active task — ready to start fresh" |
96
- | No `tasks/workflow-status.md` | Show "Workflow not started" for Step field |
92
+ | All checkboxes done in todo.md | Show "Task complete — 0 pending" for Progress field |
97
93
  | No `tasks/progress.md` | Show "No progress logged yet" for Last done |
98
94
  | No `tasks/findings.md` | Show "none" for Open Qs |
99
95
  | No `tasks/lessons.md` | Show "0 active" for Lessons |
@@ -0,0 +1,126 @@
1
+ ---
2
+ name: sk:context-budget
3
+ description: "Audit context window token consumption and find optimization opportunities."
4
+ ---
5
+
6
+ # /sk:context-budget — Token Consumption Audit
7
+
8
+ Audits all components that consume context window tokens — agents, skills, rules, MCP tools, CLAUDE.md — and identifies optimization opportunities.
9
+
10
+ ## Usage
11
+
12
+ ```
13
+ /sk:context-budget # standard audit
14
+ /sk:context-budget --verbose # per-file breakdown
15
+ ```
16
+
17
+ ## Model Routing
18
+
19
+ Read `.shipkit/config.json` from the project root if it exists.
20
+
21
+ | Profile | Model |
22
+ |---------|-------|
23
+ | `full-sail` | haiku |
24
+ | `quality` | haiku |
25
+ | `balanced` | haiku |
26
+ | `budget` | haiku |
27
+
28
+ > Counting and classification is lightweight — haiku is sufficient.
29
+
30
+ ## How It Works
31
+
32
+ ### Phase 1: Inventory
33
+
34
+ Scan and count token estimates for every loaded component:
35
+
36
+ | Component | Location | Token Estimation |
37
+ |-----------|----------|------------------|
38
+ | CLAUDE.md | `CLAUDE.md` | `words * 1.3` |
39
+ | Global CLAUDE.md | `~/.claude/CLAUDE.md` | `words * 1.3` |
40
+ | Skills | `skills/*/SKILL.md` | `words * 1.3` |
41
+ | Commands | `commands/**/*.md` | `words * 1.3` |
42
+ | Agents | `.claude/agents/*.md` | `words * 1.3` |
43
+ | Rules | `.claude/rules/*.md` | `words * 1.3` |
44
+ | MCP tool schemas | count tools * ~500 tokens each | `tool_count * 500` |
45
+ | Hooks | `.claude/hooks/*.sh` (minimal overhead) | `words * 1.3` |
46
+
47
+ **Token estimation formula:**
48
+ - Prose/markdown: `word_count * 1.3`
49
+ - Code blocks: `char_count / 4`
50
+ - MCP tool schemas: ~500 tokens per tool definition
51
+
52
+ ### Phase 2: Classify Usage Frequency
53
+
54
+ For each component, classify how often it's actually needed:
55
+
56
+ | Classification | Meaning | Action |
57
+ |---------------|---------|--------|
58
+ | **Always** | Loaded every session, always relevant | Keep as-is |
59
+ | **Sometimes** | Relevant to specific task types | Consider conditional loading |
60
+ | **Rarely** | Edge case, rarely triggered | Candidate for removal/extraction |
61
+
62
+ Classification heuristics:
63
+ - Skills used in the workflow (brainstorm, write-tests, gates, etc.) → Always
64
+ - Skills triggered by keywords (frontend-design, api-design) → Sometimes
65
+ - Niche skills (seo-audit, schema-migrate) → Rarely
66
+ - MCP tools: if >20 tools on one server → flag as over-subscribed
67
+
68
+ ### Phase 3: Detect Issues
69
+
70
+ Flag these common problems:
71
+
72
+ 1. **Bloated agents** — agent descriptions >200 lines
73
+ 2. **Bloated skills** — skill definitions >400 lines
74
+ 3. **Bloated rules** — rule files >100 lines
75
+ 4. **MCP over-subscription** — servers with >20 tools (each costs ~500 tokens)
76
+ 5. **CLI-wrapping MCPs** — MCP servers that just wrap CLI tools (overhead > benefit)
77
+ 6. **Duplicate content** — same instructions in CLAUDE.md AND skill files
78
+ 7. **CLAUDE.md bloat** — CLAUDE.md >200 lines (the target)
79
+ 8. **Unused components** — skills/agents never referenced in workflow
80
+
81
+ ### Phase 4: Report
82
+
83
+ Output a structured report:
84
+
85
+ ```
86
+ === Context Budget Audit ===
87
+
88
+ Component Breakdown:
89
+ CLAUDE.md ~1,200 tokens
90
+ Global CLAUDE.md ~800 tokens
91
+ Skills (42 files) ~18,000 tokens
92
+ Commands (35 files) ~8,000 tokens
93
+ Agents (8 files) ~3,200 tokens
94
+ Rules (5 files) ~1,500 tokens
95
+ MCP tools (3 servers) ~15,000 tokens (30 tools)
96
+ ─────────────────────────────────
97
+ Total overhead: ~47,700 tokens
98
+
99
+ Context window: 200,000 tokens
100
+ Overhead: 47,700 tokens (23.8%)
101
+ Available for work: 152,300 tokens
102
+
103
+ Issues Found:
104
+ [HIGH] MCP server "playwright" has 28 tools (~14,000 tokens)
105
+ [MEDIUM] Skill sk:frontend-design is 380 lines (~500 tokens)
106
+ [LOW] Agent perf-auditor has 220 lines (~290 tokens)
107
+
108
+ Top 3 Optimizations:
109
+ 1. Remove unused MCP tools from playwright (save ~7,000 tokens)
110
+ 2. Consolidate duplicate workflow instructions (save ~1,200 tokens)
111
+ 3. Trim agent descriptions to <150 lines (save ~400 tokens)
112
+
113
+ Potential savings: ~8,600 tokens (18% reduction)
114
+ ```
115
+
116
+ ### --verbose Mode
117
+
118
+ Adds per-file token breakdown:
119
+
120
+ ```
121
+ Skills Breakdown:
122
+ sk:autopilot/SKILL.md ~620 tokens
123
+ sk:brainstorming/SKILL.md ~480 tokens
124
+ sk:gates/SKILL.md ~440 tokens
125
+ ...
126
+ ```
@@ -29,9 +29,8 @@ PORT=4000 node skills/sk:dashboard/server.js
29
29
  ## What It Shows
30
30
 
31
31
  - **Swimlanes per worktree** — one row per worktree discovered via `git worktree list`
32
- - **Phase timeline** — workflow steps laid out as columns (Read, Explore, Plan, Branch, Tests, Implement, Lint, Verify, Security, Review, E2E, Finalize)
33
- - **Status indicators** — done, skipped, partial, in-progress, not yet
34
- - **Progress bars** — percentage of steps completed per worktree
32
+ - **Phase timeline** — workflow steps laid out as columns (Explore, Design, Plan, Branch, Tests+Implement, Commit, Gates, Finalize)
33
+ - **Progress bars** — percentage of todo.md checkboxes completed per worktree
35
34
  - **Current task** — the active task name from `tasks/todo.md`
36
35
 
37
36
  ## Architecture
@@ -39,7 +38,7 @@ PORT=4000 node skills/sk:dashboard/server.js
39
38
  Zero-dependency Node.js server. Uses only built-in modules (`http`, `fs`, `path`, `child_process`).
40
39
 
41
40
  - `server.js` serves the dashboard HTML and exposes `/api/status`
42
- - `/api/status` reads `tasks/workflow-status.md` and `tasks/todo.md` from each worktree, parses step statuses, and returns JSON
41
+ - `/api/status` reads `tasks/todo.md` from each worktree, parses checkbox progress, and returns JSON
43
42
  - `dashboard.html` is a single-file UI (HTML + embedded CSS + JS) that polls `/api/status` every 3 seconds
44
43
  - Worktree discovery via `git worktree list`
45
44
 
@@ -7,9 +7,6 @@ const { execSync } = require("child_process");
7
7
  const PORT =
8
8
  parseInt(process.argv.find((_, i, a) => a[i - 1] === "--port") || process.env.PORT, 10) || 3333;
9
9
 
10
- const HARD_GATES = new Set([12, 14, 16, 20, 22]);
11
- const OPTIONALS = new Set([4, 5, 8, 18, 27]);
12
-
13
10
  function stripMd(s) {
14
11
  return (s || "").replace(/\*\*/g, "").replace(/`/g, "").trim();
15
12
  }
@@ -34,53 +31,6 @@ function discoverWorktrees() {
34
31
  }
35
32
  }
36
33
 
37
- function parseWorkflowStatus(worktreePath) {
38
- const filePath = path.join(worktreePath, "tasks", "workflow-status.md");
39
- try {
40
- const lines = fs.readFileSync(filePath, "utf8").split("\n");
41
-
42
- let headerFound = false;
43
- let separatorSkipped = false;
44
- const steps = [];
45
-
46
- for (const line of lines) {
47
- if (!headerFound) {
48
- if (line.includes("| # |")) headerFound = true;
49
- continue;
50
- }
51
- if (!separatorSkipped) {
52
- separatorSkipped = true;
53
- continue;
54
- }
55
- const cells = line.split("|").slice(1, -1).map((c) => c.trim());
56
- if (cells.length < 3) continue;
57
-
58
- const number = parseInt(cells[0], 10);
59
- if (isNaN(number)) continue;
60
-
61
- const rawStep = stripMd(cells[1]);
62
- const cmdMatch = rawStep.match(/\((.+?)\)/);
63
- const command = cmdMatch ? cmdMatch[1].trim() : "";
64
- const name = rawStep.replace(/\s*\(.+?\)\s*/, "").trim();
65
-
66
- steps.push({
67
- number,
68
- name,
69
- command,
70
- status: stripMd(cells[2]),
71
- notes: stripMd(cells[3]),
72
- isHardGate: HARD_GATES.has(number),
73
- isOptional: OPTIONALS.has(number),
74
- });
75
- }
76
- return steps;
77
- } catch (err) {
78
- if (err.code === "ENOENT") return [];
79
- process.stderr.write(`Error parsing workflow-status.md: ${err.message}\n`);
80
- return [];
81
- }
82
- }
83
-
84
34
  const STOP_HEADERS = new Set(["Verification", "Acceptance Criteria", "Risks", "Change Log", "Summary"]);
85
35
 
86
36
  function parseTodo(worktreePath) {
@@ -137,18 +87,8 @@ function parseTodo(worktreePath) {
137
87
  function buildStatus() {
138
88
  const worktrees = discoverWorktrees();
139
89
  return worktrees.map((wt) => {
140
- const steps = parseWorkflowStatus(wt.path);
141
90
  const todo = parseTodo(wt.path);
142
91
 
143
- let currentStep = 0;
144
- let totalDone = 0;
145
- let totalSkipped = 0;
146
- for (const s of steps) {
147
- if (s.status === ">> next <<") currentStep = s.number;
148
- if (s.status === "done") totalDone++;
149
- if (s.status === "skipped") totalSkipped++;
150
- }
151
-
152
92
  return {
153
93
  path: wt.path,
154
94
  branch: wt.branch,
@@ -156,11 +96,6 @@ function buildStatus() {
156
96
  todosDone: todo.todosDone,
157
97
  todosTotal: todo.todosTotal,
158
98
  todoItems: todo.todoItems,
159
- currentStep,
160
- totalDone,
161
- totalSkipped,
162
- totalSteps: steps.length,
163
- steps,
164
99
  };
165
100
  });
166
101
  }
@@ -184,17 +184,17 @@ If any fail → apply Fix & Retest Protocol.
184
184
 
185
185
  When this gate requires a fix, classify it before committing:
186
186
 
187
- **a. Style/config/wording change** (CSS tweak, copy change, selector fix) → auto-commit with `fix(e2e): resolve failing E2E scenarios` and re-run `/sk:e2e`. Do not ask the user.
187
+ **a. Style/config/wording change** (CSS tweak, copy change, selector fix) → include in the gate's squash commit and re-run `/sk:e2e`. Do not ask the user.
188
188
 
189
189
  **b. Logic change** (new branch, modified condition, new data path, query change, new function, API change) → trigger protocol:
190
190
  1. Update or add failing unit tests for the new behavior
191
191
  2. Re-run `/sk:test` — must pass at 100% coverage
192
- 3. Auto-commit tests + fix together with `fix(e2e): [description]`.
192
+ 3. Commit tests + fix together with `fix(e2e): [description]`.
193
193
  4. Re-run `/sk:e2e` from scratch
194
194
 
195
195
  **Exception:** Formatter auto-fixes are never logic changes — bypass protocol automatically.
196
196
 
197
- Gates own their commits — the fix-commit-rerun loop is fully internal. No manual commit step needed after this gate.
197
+ > Squash gate commits — collect all fixes for the pass, then one commit: `fix(e2e): resolve failing E2E scenarios`. Do not commit after each individual fix.
198
198
 
199
199
  **This gate cannot be skipped.** All scenarios must pass before proceeding to `/sk:update-task`.
200
200
 
@@ -0,0 +1,188 @@
1
+ ---
2
+ name: sk:eval
3
+ description: "Define, run, and report on evaluations for agent reliability and code quality."
4
+ ---
5
+
6
+ # /sk:eval — Eval-Driven Development
7
+
8
+ A formal evaluation framework for measuring agent reliability and code quality. Define evals before coding, check during implementation, and report after shipping.
9
+
10
+ ## Usage
11
+
12
+ ```
13
+ /sk:eval define <feature> # create eval definition
14
+ /sk:eval check <feature> # run evals against current state
15
+ /sk:eval report # summary of all eval results
16
+ /sk:eval list # show all defined evals
17
+ ```
18
+
19
+ ## Model Routing
20
+
21
+ Read `.shipkit/config.json` from the project root if it exists.
22
+
23
+ | Profile | Model |
24
+ |---------|-------|
25
+ | `full-sail` | sonnet |
26
+ | `quality` | sonnet |
27
+ | `balanced` | sonnet |
28
+ | `budget` | haiku |
29
+
30
+ > Eval analysis needs reasoning for model-based graders — sonnet for balanced+.
31
+
32
+ ## Eval Types
33
+
34
+ ### Capability Evals
35
+
36
+ Test whether Claude can accomplish something new:
37
+
38
+ - "Can it generate a valid migration from a schema description?"
39
+ - "Can it write a test that covers all edge cases?"
40
+ - "Can it refactor without changing behavior?"
41
+
42
+ ### Regression Evals
43
+
44
+ Ensure changes don't break existing behavior:
45
+
46
+ - "Does the login flow still work after auth refactor?"
47
+ - "Do all API endpoints still return correct status codes?"
48
+ - "Are all existing tests still passing?"
49
+
50
+ ## Grader Types
51
+
52
+ ### Code-Based (Deterministic)
53
+
54
+ Graded by running commands — pass/fail:
55
+
56
+ ```yaml
57
+ grader: code
58
+ checks:
59
+ - command: "npm test"
60
+ expect: exit_code_0
61
+ - command: "grep -r 'TODO' src/"
62
+ expect: no_output
63
+ - command: "npx tsc --noEmit"
64
+ expect: exit_code_0
65
+ ```
66
+
67
+ ### Model-Based (LLM-as-Judge)
68
+
69
+ Graded by an LLM against a rubric — scored 1-5:
70
+
71
+ ```yaml
72
+ grader: model
73
+ rubric: |
74
+ Score the implementation on:
75
+ 1. Correctness — does it solve the stated problem?
76
+ 2. Completeness — are all edge cases handled?
77
+ 3. Code quality — is it readable and maintainable?
78
+ 4. Security — are there any vulnerabilities?
79
+ 5. Performance — any obvious inefficiencies?
80
+ threshold: 4.0
81
+ ```
82
+
83
+ ### Human (Manual Review)
84
+
85
+ Flagged for human review — generates a checklist:
86
+
87
+ ```yaml
88
+ grader: human
89
+ checklist:
90
+ - "UI renders correctly on mobile"
91
+ - "Error messages are user-friendly"
92
+ - "Animation feels smooth (60fps)"
93
+ ```
94
+
95
+ ## Metrics
96
+
97
+ ### pass@k
98
+
99
+ At least 1 success in k attempts. Used for capability evals where some variance is expected.
100
+
101
+ ```
102
+ pass@3: Run the eval 3 times. Pass if at least 1 succeeds.
103
+ ```
104
+
105
+ ### pass^k
106
+
107
+ ALL k attempts must succeed. Used for regression evals where consistency is required.
108
+
109
+ ```
110
+ pass^3: Run the eval 3 times. Pass only if all 3 succeed.
111
+ ```
112
+
113
+ ## Storage
114
+
115
+ ### Eval Definition
116
+
117
+ Stored in `.claude/evals/[feature].md`:
118
+
119
+ ```markdown
120
+ ---
121
+ feature: user-authentication
122
+ type: capability
123
+ grader: code
124
+ created: 2026-03-25
125
+ pass_metric: pass@1
126
+ ---
127
+
128
+ ## Description
129
+ Verify the OAuth2 login flow works end-to-end.
130
+
131
+ ## Checks
132
+ - [ ] `npm test -- --grep "auth"` passes
133
+ - [ ] `curl -s localhost:3000/auth/google` returns 302
134
+ - [ ] `grep -r "hardcoded.*secret" src/` returns nothing
135
+
136
+ ## History
137
+ | Date | Result | Score | Notes |
138
+ |------|--------|-------|-------|
139
+ ```
140
+
141
+ ### Eval Results
142
+
143
+ Appended to `.claude/evals/[feature].log`:
144
+
145
+ ```
146
+ [2026-03-25T10:30:00Z] PASS — pass@1 (1/1 succeeded)
147
+ check_1: npm test (exit 0) ✓
148
+ check_2: curl auth redirect (302) ✓
149
+ check_3: no hardcoded secrets ✓
150
+ ```
151
+
152
+ ## Workflow Integration
153
+
154
+ ### Before Coding (define)
155
+
156
+ ```
157
+ /sk:eval define user-authentication
158
+ ```
159
+
160
+ Creates the eval definition with checks derived from the task requirements.
161
+
162
+ ### During Implementation (check)
163
+
164
+ ```
165
+ /sk:eval check user-authentication
166
+ ```
167
+
168
+ Runs all checks and reports pass/fail. Use during step 5 (Write Tests + Implement) to verify progress.
169
+
170
+ ### After Shipping (report)
171
+
172
+ ```
173
+ /sk:eval report
174
+ ```
175
+
176
+ Summary of all evals:
177
+
178
+ ```
179
+ === Eval Report ===
180
+
181
+ user-authentication PASS pass@1 (3 checks, 3 passed)
182
+ api-v2-endpoints PASS pass^3 (5 checks, 5 passed x3)
183
+ queue-reliability FAIL pass@3 (2 checks, 0/3 succeeded)
184
+
185
+ Overall: 2/3 passing (67%)
186
+
187
+ Action: queue-reliability needs investigation
188
+ ```