npm - @kennethsolomon/shipkit - Versions diffs - 3.10.1 → 3.11.0 - Mend

@kennethsolomon/shipkit 3.10.1 → 3.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

package/README.md +121 -49
package/commands/sk/autopilot.md +2 -2
package/commands/sk/context-budget.md +5 -0
package/commands/sk/eval.md +5 -0
package/commands/sk/health.md +5 -0
package/commands/sk/help.md +32 -8
package/commands/sk/learn.md +5 -0
package/commands/sk/resume-session.md +5 -0
package/commands/sk/safety-guard.md +5 -0
package/commands/sk/save-session.md +5 -0
package/commands/sk/security-check.md +2 -2
package/commands/sk/set-profile.md +8 -0
package/commands/sk/status.md +4 -9
package/package.json +1 -1
package/skills/sk:accessibility/SKILL.md +10 -1
package/skills/sk:autopilot/SKILL.md +26 -45
package/skills/sk:brainstorming/SKILL.md +13 -0
package/skills/sk:context/SKILL.md +11 -15
package/skills/sk:context-budget/SKILL.md +126 -0
package/skills/sk:dashboard/SKILL.md +3 -4
package/skills/sk:dashboard/server.js +0 -65
package/skills/sk:e2e/SKILL.md +3 -3
package/skills/sk:eval/SKILL.md +188 -0
package/skills/sk:fast-track/SKILL.md +0 -9
package/skills/sk:frontend-design/SKILL.md +232 -0
package/skills/sk:gates/SKILL.md +2 -3
package/skills/sk:health/SKILL.md +146 -0
package/skills/sk:learn/SKILL.md +138 -0
package/skills/sk:lint/SKILL.md +3 -3
package/skills/sk:perf/SKILL.md +3 -3
package/skills/sk:resume-session/SKILL.md +95 -0
package/skills/sk:retro/SKILL.md +1 -2
package/skills/sk:review/SKILL.md +2 -2
package/skills/sk:safety-guard/SKILL.md +134 -0
package/skills/sk:save-session/SKILL.md +84 -0
package/skills/sk:setup-claude/SKILL.md +40 -4
package/skills/sk:setup-claude/scripts/__pycache__/apply_setup_claude.cpython-314.pyc +0 -0
package/skills/sk:setup-claude/scripts/apply_setup_claude.py +0 -1
package/skills/sk:setup-claude/templates/.claude/settings.json.template +110 -26
package/skills/sk:setup-claude/templates/.claude/statusline.sh +1 -15
package/skills/sk:setup-claude/templates/CLAUDE.md.template +69 -138
package/skills/sk:setup-claude/templates/commands/brainstorm.md.template +2 -13
package/skills/sk:setup-claude/templates/hooks/config-protection.sh +71 -0
package/skills/sk:setup-claude/templates/hooks/console-log-warning.sh +42 -0
package/skills/sk:setup-claude/templates/hooks/cost-tracker.sh +26 -0
package/skills/sk:setup-claude/templates/hooks/post-edit-format.sh +53 -0
package/skills/sk:setup-claude/templates/hooks/pre-compact.sh +1 -12
package/skills/sk:setup-claude/templates/hooks/safety-guard.sh +72 -0
package/skills/sk:setup-claude/templates/hooks/session-start.sh +0 -11
package/skills/sk:setup-claude/templates/hooks/session-stop.sh +0 -7
package/skills/sk:setup-claude/templates/hooks/suggest-compact.sh +35 -0
package/skills/sk:setup-claude/tests/__pycache__/test_apply_setup_claude.cpython-314.pyc +0 -0
package/skills/sk:setup-claude/tests/test_apply_setup_claude.py +2 -33
package/skills/sk:setup-optimizer/SKILL.md +68 -15
package/skills/sk:start/SKILL.md +34 -11
package/skills/sk:test/SKILL.md +3 -3
package/skills/sk:setup-claude/templates/tasks/workflow-status.md.template +0 -28

package/skills/sk:autopilot/SKILL.md CHANGED Viewed

@@ -1,13 +1,13 @@
 ---
 name: sk:autopilot
-description: Hands-free workflow — runs all 21 steps with auto-skip, auto-advance, auto-commit. Stops only for direction approval, 3-strike failures, and PR push.
+description: Hands-free workflow — runs all 8 steps with auto-skip, auto-advance, auto-commit. Stops only for direction approval, 3-strike failures, and PR push.
 user_invocable: true
 allowed_tools: Read, Write, Bash, Glob, Grep, Agent, Skill
 ---
 # Autopilot Mode
-Hands-free workflow that executes all 21 steps of the ShipIt workflow with minimal interruptions. Same quality gates, same fix loops, same 100% coverage — just fewer stops.
+Hands-free workflow that executes all 8 steps of the ShipIt workflow with minimal interruptions. Same quality gates, same fix loops, same 100% coverage — just fewer stops.
 ## When to Use
@@ -23,30 +23,19 @@ Hands-free workflow that executes all 21 steps of the ShipIt workflow with minim
 ## Quality Guarantee
-Autopilot runs the EXACT same 21 steps as manual mode:
+Autopilot runs the EXACT same 8 steps as manual mode:
 - ALL quality gates enforced (lint, test, security, perf, review, e2e)
-- ALL fix-commit-rerun loops active
+- ALL fix-rerun loops active
 - 100% test coverage required on new code
 - 0 security issues required
 - The ONLY difference: auto-advance between steps instead of stopping
 ## Steps
-### 0. Reset Tracker
+### 1. Load Context + Brainstorm + Direction Approval (STOP — requires user input)
-Read `tasks/workflow-status.md`. If it has done/skipped steps from a different task, auto-reset all steps to `not yet`.
-### 1. Load Context (auto — no prompt)
-- Read `tasks/todo.md`
-- Read `tasks/lessons.md` (apply all active lessons as constraints)
-- Read `tasks/findings.md` (if exists)
-- Read `tasks/tech-debt.md` (if exists)
-### 2. Brainstorm + Direction Approval (STOP — requires user input)
-Run brainstorm internally:
-- Explore the codebase (3 parallel Explore agents)
+- Read `tasks/todo.md`, `tasks/lessons.md`, `tasks/findings.md`, `tasks/tech-debt.md`
+- Run brainstorm internally (3 parallel Explore agents)
 - Propose 2-3 approaches with trade-offs
 **Present ONE direction summary and ask:**
@@ -57,40 +46,33 @@ Run brainstorm internally:
 Wait for explicit `y` before continuing. This is the ONLY planning stop.
+### 2. Design (auto-skip if no frontend/API keywords)
+Run `/sk:frontend-design` or `/sk:api-design` if applicable. Auto-skip if no frontend/API keywords detected. Log: `Auto-skipped: Design ([reason])`
 ### 3. Plan (auto-advance)
-Write the implementation plan to `tasks/todo.md`. Do NOT ask for plan approval — the direction approval in step 2 covers this.
+Write the implementation plan to `tasks/todo.md`. Do NOT ask for plan approval — the direction approval in step 1 covers this.
 ### 4. Branch (auto-advance)
 Create feature branch auto-named from the task. Do NOT ask for confirmation.
-### 5. Auto-Skip Detection
+### 5. Write Tests + Implement (auto-advance)
-Scan `tasks/todo.md` for frontend/backend/database keywords. For each optional step:
-- **Design (step 4)**: auto-skip if no frontend keywords
-- **Accessibility (step 5)**: auto-skip if no frontend keywords
-- **Migrate (step 8)**: auto-skip if no database keywords
-- **Performance (step 15)**: auto-skip if no frontend AND no database keywords
+- Run `/sk:write-tests` (TDD red phase)
+- Run `/sk:schema-migrate` if database keywords detected
+- Run `/sk:execute-plan` (TDD green phase)
+- Auto-advance when done
-Log each auto-skip: `Auto-skipped: [Step Name] ([reason])`
-### 6. Write Tests (auto-advance)
-Write failing tests based on the plan (TDD red phase). Auto-advance when done.
-### 7. Implement (auto-advance)
-Execute the plan — make failing tests pass. Use wave-based sub-agents for parallel work where possible.
-### 8. Commit (auto-commit)
+### 6. Commit (auto-commit)
 Auto-commit with conventional commit format. Do NOT ask for commit message approval.
 Format: `type(scope): description`
-### 9. Gates (auto-advance on clean pass)
+### 7. Gates (auto-advance on clean pass)
-Run all quality gates. Use `/sk:gates` if available, otherwise run sequentially:
+Run all quality gates via `/sk:gates`:
 1. Lint + dep audit
 2. Test (100% coverage)
 3. Security (0 issues)
@@ -98,9 +80,9 @@ Run all quality gates. Use `/sk:gates` if available, otherwise run sequentially:
 5. Review + simplify
 6. E2E
-Each gate auto-fixes and re-runs internally. Auto-advance to next gate on clean pass.
+Each gate auto-fixes and re-runs internally. Squash gate commits — one commit per gate pass.
-### 10. PR Push (STOP — requires user confirmation)
+### 8. PR Push (STOP — requires user confirmation)
 **This is the second mandatory stop.** Present:
 > "All gates passed. Ready to create PR.
@@ -110,11 +92,10 @@ Each gate auto-fixes and re-runs internally. Auto-advance to next gate on clean
 Wait for explicit confirmation — pushing is visible to others.
-### 11. Finalize (auto-advance)
+After confirmation:
 - Create PR
 - Sync features (`/sk:features`)
-- Ask about release (step 21 is never auto-skipped)
+- Ask about release (never auto-skipped)
 ## 3-Strike Protocol
@@ -128,9 +109,9 @@ If any step fails 3 times:
 | Stop | When | Why |
 |------|------|-----|
-| Direction approval | After brainstorm (step 2) | User must approve the approach |
+| Direction approval | After brainstorm (step 1) | User must approve the approach |
 | 3-strike failure | Any step fails 3x | Needs human judgment |
-| PR push | Before creating PR (step 10) | Visible to others — always confirm |
+| PR push | Before creating PR (step 8) | Visible to others — always confirm |
 Everything else auto-advances.

package/skills/sk:brainstorming/SKILL.md CHANGED Viewed

@@ -74,6 +74,19 @@ digraph brainstorming {
 - Only one question per message - if a topic needs more exploration, break it into multiple questions
 - Focus on understanding: purpose, constraints, success criteria
+**Search-First Research (before proposing approaches):**
+Before proposing custom solutions, check if the problem is already solved:
+1. **Grep codebase** — does similar functionality already exist in this repo?
+2. **Check package registries** — is there a well-maintained package for this? (npm, PyPI, Packagist, crates.io)
+3. **Check existing skills** — does a ShipKit skill or MCP server already handle this?
+Decision matrix:
+- **Adopt** — existing solution covers 90%+ of requirements → use it directly
+- **Extend** — existing solution covers 60-90% → extend or wrap it
+- **Build custom** — nothing suitable exists → build from scratch (informed by what was found)
+If a suitable package or existing solution is found, include it as one of the approaches.
 **Exploring approaches:**
 - Propose 2-3 different approaches with trade-offs
 - Present options conversationally with your recommendation and reasoning

package/skills/sk:context/SKILL.md CHANGED Viewed

@@ -26,21 +26,19 @@ Load all project context files into the conversation and output a formatted sess
 | # | File | What to Extract |
 |---|------|-----------------|
 | 1 | `tasks/todo.md` | Task name (from `# TODO —` heading), milestone progress, count of `- [x]` (done) vs `- [ ]` (pending) checkboxes |
-| 2 | `tasks/workflow-status.md` | Current step (row with `>> next <<`), step name, command to run |
-| 3 | `tasks/progress.md` | Last 5 entries only (most recent work). If file is large, read only the last 50 lines. |
-| 4 | `tasks/findings.md` | Current decisions, chosen approach, open questions |
-| 5 | `tasks/lessons.md` | All active lessons — read in full, apply as constraints for this session |
-| 6 | `docs/decisions.md` | If exists: last 3 ADR entries. If missing: note "no decisions log yet" |
-| 7 | `docs/vision.md` | If exists: product name + value proposition. If missing: note "no vision.md found" |
-| 8 | `tasks/tech-debt.md` | If exists: count entries with no `Resolved:` line (unresolved), highest severity among unresolved |
+| 2 | `tasks/progress.md` | Last 5 entries only (most recent work). If file is large, read only the last 50 lines. |
+| 3 | `tasks/findings.md` | Current decisions, chosen approach, open questions |
+| 4 | `tasks/lessons.md` | All active lessons — read in full, apply as constraints for this session |
+| 5 | `docs/decisions.md` | If exists: last 3 ADR entries. If missing: note "no decisions log yet" |
+| 6 | `docs/vision.md` | If exists: product name + value proposition. If missing: note "no vision.md found" |
+| 7 | `tasks/tech-debt.md` | If exists: count entries with no `Resolved:` line (unresolved), highest severity among unresolved |
 ### Reading Strategy
-- Read files 1-5 first (these are the core context).
-- Files 6-7 are optional — check if they exist before reading.
+- Read files 1-4 first (these are the core context).
+- Files 5-6 are optional — check if they exist before reading.
 - For `tasks/progress.md`: only read the last 50 lines to avoid loading a huge file.
 - If `tasks/todo.md` is missing: the project has no active task.
-- If `tasks/workflow-status.md` is missing: the workflow hasn't started.
 ---
@@ -54,9 +52,8 @@ After reading all files, output this session brief:
 ╚══════════════════════════════════════════╝
 Branch:     [current git branch]
 Task:       [task name from todo.md, or "No active task"]
-Step:       [step #] [step name] → run `/sk:[command]`
+Progress:   [N done] / [M total] checkboxes in todo.md
 Last done:  [last progress.md entry summary, 1 line]
-Pending:    [N] checkboxes remaining in todo.md
 Lessons:    [count] active — [most critical 1-liner from lessons.md]
 Open Qs:    [open questions from findings.md, or "none"]
 Tech Debt:  [N] unresolved — highest: [severity] ([file:line])
@@ -68,9 +65,8 @@ Product:    [value prop from vision.md, or "no vision.md found"]
 - **Branch:** Run `git branch --show-current` to get the current branch name.
 - **Task:** Extract from the first `# TODO —` line in `tasks/todo.md`. If the file doesn't exist or all checkboxes are done, show "No active task — ready to start fresh".
-- **Step:** Find the row containing `>> next <<` in `tasks/workflow-status.md`. Extract step number, name, and command. If no `>> next <<` found, show "Workflow complete" or "Not started".
+- **Progress:** Count `- [x]` (done) and `- [ ]` (pending) lines in `tasks/todo.md`. Stop counting at the first `## Verification`, `## Acceptance Criteria`, or `## Risks` heading (these are meta-sections, not tasks). Show `N done / M total`.
 - **Last done:** The most recent entry from `tasks/progress.md`. Summarize in one line.
-- **Pending:** Count `- [ ]` lines in `tasks/todo.md`. Stop counting at the first `## Verification`, `## Acceptance Criteria`, or `## Risks` heading (these are meta-sections, not tasks).
 - **Lessons:** Count `### [` headings in `tasks/lessons.md` (each lesson starts with `### [YYYY-MM-DD]`). Show the count + the **Prevention:** line from the most recent lesson.
 - **Open Qs:** Check for an "## Open Questions" section in `tasks/findings.md`. List them or say "none".
 - **Tech Debt:** Read `tasks/tech-debt.md` if it exists. Count entries that have no `Resolved:` line — each entry starts with `### [`. For unresolved entries, find the highest severity. Show `N unresolved — highest: [severity] ([file])`. If file missing or 0 unresolved, show `none`.
@@ -93,7 +89,7 @@ After outputting the session brief:
 | Scenario | Behavior |
 |----------|----------|
 | No `tasks/todo.md` | Show "No active task — ready to start fresh" |
-| No `tasks/workflow-status.md` | Show "Workflow not started" for Step field |
+| All checkboxes done in todo.md | Show "Task complete — 0 pending" for Progress field |
 | No `tasks/progress.md` | Show "No progress logged yet" for Last done |
 | No `tasks/findings.md` | Show "none" for Open Qs |
 | No `tasks/lessons.md` | Show "0 active" for Lessons |

package/skills/sk:context-budget/SKILL.md ADDED Viewed

@@ -0,0 +1,126 @@
+---
+name: sk:context-budget
+description: "Audit context window token consumption and find optimization opportunities."
+---
+# /sk:context-budget — Token Consumption Audit
+Audits all components that consume context window tokens — agents, skills, rules, MCP tools, CLAUDE.md — and identifies optimization opportunities.
+## Usage
+```
+/sk:context-budget              # standard audit
+/sk:context-budget --verbose    # per-file breakdown
+```
+## Model Routing
+Read `.shipkit/config.json` from the project root if it exists.
+| Profile | Model |
+|---------|-------|
+| `full-sail` | haiku |
+| `quality` | haiku |
+| `balanced` | haiku |
+| `budget` | haiku |
+> Counting and classification is lightweight — haiku is sufficient.
+## How It Works
+### Phase 1: Inventory
+Scan and count token estimates for every loaded component:
+| Component | Location | Token Estimation |
+|-----------|----------|------------------|
+| CLAUDE.md | `CLAUDE.md` | `words * 1.3` |
+| Global CLAUDE.md | `~/.claude/CLAUDE.md` | `words * 1.3` |
+| Skills | `skills/*/SKILL.md` | `words * 1.3` |
+| Commands | `commands/**/*.md` | `words * 1.3` |
+| Agents | `.claude/agents/*.md` | `words * 1.3` |
+| Rules | `.claude/rules/*.md` | `words * 1.3` |
+| MCP tool schemas | count tools * ~500 tokens each | `tool_count * 500` |
+| Hooks | `.claude/hooks/*.sh` (minimal overhead) | `words * 1.3` |
+**Token estimation formula:**
+- Prose/markdown: `word_count * 1.3`
+- Code blocks: `char_count / 4`
+- MCP tool schemas: ~500 tokens per tool definition
+### Phase 2: Classify Usage Frequency
+For each component, classify how often it's actually needed:
+| Classification | Meaning | Action |
+|---------------|---------|--------|
+| **Always** | Loaded every session, always relevant | Keep as-is |
+| **Sometimes** | Relevant to specific task types | Consider conditional loading |
+| **Rarely** | Edge case, rarely triggered | Candidate for removal/extraction |
+Classification heuristics:
+- Skills used in the workflow (brainstorm, write-tests, gates, etc.) → Always
+- Skills triggered by keywords (frontend-design, api-design) → Sometimes
+- Niche skills (seo-audit, schema-migrate) → Rarely
+- MCP tools: if >20 tools on one server → flag as over-subscribed
+### Phase 3: Detect Issues
+Flag these common problems:
+1. **Bloated agents** — agent descriptions >200 lines
+2. **Bloated skills** — skill definitions >400 lines
+3. **Bloated rules** — rule files >100 lines
+4. **MCP over-subscription** — servers with >20 tools (each costs ~500 tokens)
+5. **CLI-wrapping MCPs** — MCP servers that just wrap CLI tools (overhead > benefit)
+6. **Duplicate content** — same instructions in CLAUDE.md AND skill files
+7. **CLAUDE.md bloat** — CLAUDE.md >200 lines (the target)
+8. **Unused components** — skills/agents never referenced in workflow
+### Phase 4: Report
+Output a structured report:
+```
+=== Context Budget Audit ===
+Component Breakdown:
+  CLAUDE.md              ~1,200 tokens
+  Global CLAUDE.md         ~800 tokens
+  Skills (42 files)     ~18,000 tokens
+  Commands (35 files)    ~8,000 tokens
+  Agents (8 files)       ~3,200 tokens
+  Rules (5 files)        ~1,500 tokens
+  MCP tools (3 servers)  ~15,000 tokens (30 tools)
+  ─────────────────────────────────
+  Total overhead:        ~47,700 tokens
+Context window:          200,000 tokens
+Overhead:                 47,700 tokens (23.8%)
+Available for work:      152,300 tokens
+Issues Found:
+  [HIGH]   MCP server "playwright" has 28 tools (~14,000 tokens)
+  [MEDIUM] Skill sk:frontend-design is 380 lines (~500 tokens)
+  [LOW]    Agent perf-auditor has 220 lines (~290 tokens)
+Top 3 Optimizations:
+  1. Remove unused MCP tools from playwright (save ~7,000 tokens)
+  2. Consolidate duplicate workflow instructions (save ~1,200 tokens)
+  3. Trim agent descriptions to <150 lines (save ~400 tokens)
+  Potential savings: ~8,600 tokens (18% reduction)
+```
+### --verbose Mode
+Adds per-file token breakdown:
+```
+Skills Breakdown:
+  sk:autopilot/SKILL.md        ~620 tokens
+  sk:brainstorming/SKILL.md    ~480 tokens
+  sk:gates/SKILL.md            ~440 tokens
+  ...
+```

package/skills/sk:dashboard/SKILL.md CHANGED Viewed

@@ -29,9 +29,8 @@ PORT=4000 node skills/sk:dashboard/server.js
 ## What It Shows
 - **Swimlanes per worktree** — one row per worktree discovered via `git worktree list`
-- **Phase timeline** — workflow steps laid out as columns (Read, Explore, Plan, Branch, Tests, Implement, Lint, Verify, Security, Review, E2E, Finalize)
-- **Status indicators** — done, skipped, partial, in-progress, not yet
-- **Progress bars** — percentage of steps completed per worktree
+- **Phase timeline** — workflow steps laid out as columns (Explore, Design, Plan, Branch, Tests+Implement, Commit, Gates, Finalize)
+- **Progress bars** — percentage of todo.md checkboxes completed per worktree
 - **Current task** — the active task name from `tasks/todo.md`
 ## Architecture
@@ -39,7 +38,7 @@ PORT=4000 node skills/sk:dashboard/server.js
 Zero-dependency Node.js server. Uses only built-in modules (`http`, `fs`, `path`, `child_process`).
 - `server.js` serves the dashboard HTML and exposes `/api/status`
-- `/api/status` reads `tasks/workflow-status.md` and `tasks/todo.md` from each worktree, parses step statuses, and returns JSON
+- `/api/status` reads `tasks/todo.md` from each worktree, parses checkbox progress, and returns JSON
 - `dashboard.html` is a single-file UI (HTML + embedded CSS + JS) that polls `/api/status` every 3 seconds
 - Worktree discovery via `git worktree list`

package/skills/sk:dashboard/server.js CHANGED Viewed

@@ -7,9 +7,6 @@ const { execSync } = require("child_process");
 const PORT =
   parseInt(process.argv.find((_, i, a) => a[i - 1] === "--port") || process.env.PORT, 10) || 3333;
-const HARD_GATES = new Set([12, 14, 16, 20, 22]);
-const OPTIONALS = new Set([4, 5, 8, 18, 27]);
 function stripMd(s) {
   return (s || "").replace(/\*\*/g, "").replace(/`/g, "").trim();
 }
@@ -34,53 +31,6 @@ function discoverWorktrees() {
   }
 }
-function parseWorkflowStatus(worktreePath) {
-  const filePath = path.join(worktreePath, "tasks", "workflow-status.md");
-  try {
-    const lines = fs.readFileSync(filePath, "utf8").split("\n");
-    let headerFound = false;
-    let separatorSkipped = false;
-    const steps = [];
-    for (const line of lines) {
-      if (!headerFound) {
-        if (line.includes("| # |")) headerFound = true;
-        continue;
-      }
-      if (!separatorSkipped) {
-        separatorSkipped = true;
-        continue;
-      }
-      const cells = line.split("|").slice(1, -1).map((c) => c.trim());
-      if (cells.length < 3) continue;
-      const number = parseInt(cells[0], 10);
-      if (isNaN(number)) continue;
-      const rawStep = stripMd(cells[1]);
-      const cmdMatch = rawStep.match(/\((.+?)\)/);
-      const command = cmdMatch ? cmdMatch[1].trim() : "";
-      const name = rawStep.replace(/\s*\(.+?\)\s*/, "").trim();
-      steps.push({
-        number,
-        name,
-        command,
-        status: stripMd(cells[2]),
-        notes: stripMd(cells[3]),
-        isHardGate: HARD_GATES.has(number),
-        isOptional: OPTIONALS.has(number),
-      });
-    }
-    return steps;
-  } catch (err) {
-    if (err.code === "ENOENT") return [];
-    process.stderr.write(`Error parsing workflow-status.md: ${err.message}\n`);
-    return [];
-  }
-}
 const STOP_HEADERS = new Set(["Verification", "Acceptance Criteria", "Risks", "Change Log", "Summary"]);
 function parseTodo(worktreePath) {
@@ -137,18 +87,8 @@ function parseTodo(worktreePath) {
 function buildStatus() {
   const worktrees = discoverWorktrees();
   return worktrees.map((wt) => {
-    const steps = parseWorkflowStatus(wt.path);
     const todo = parseTodo(wt.path);
-    let currentStep = 0;
-    let totalDone = 0;
-    let totalSkipped = 0;
-    for (const s of steps) {
-      if (s.status === ">> next <<") currentStep = s.number;
-      if (s.status === "done") totalDone++;
-      if (s.status === "skipped") totalSkipped++;
-    }
     return {
       path: wt.path,
       branch: wt.branch,
@@ -156,11 +96,6 @@ function buildStatus() {
       todosDone: todo.todosDone,
       todosTotal: todo.todosTotal,
       todoItems: todo.todoItems,
-      currentStep,
-      totalDone,
-      totalSkipped,
-      totalSteps: steps.length,
-      steps,
     };
   });
 }

package/skills/sk:e2e/SKILL.md CHANGED Viewed

@@ -184,17 +184,17 @@ If any fail → apply Fix & Retest Protocol.
 When this gate requires a fix, classify it before committing:
-**a. Style/config/wording change** (CSS tweak, copy change, selector fix) → auto-commit with `fix(e2e): resolve failing E2E scenarios` and re-run `/sk:e2e`. Do not ask the user.
+**a. Style/config/wording change** (CSS tweak, copy change, selector fix) → include in the gate's squash commit and re-run `/sk:e2e`. Do not ask the user.
 **b. Logic change** (new branch, modified condition, new data path, query change, new function, API change) → trigger protocol:
 1. Update or add failing unit tests for the new behavior
 2. Re-run `/sk:test` — must pass at 100% coverage
-3. Auto-commit tests + fix together with `fix(e2e): [description]`.
+3. Commit tests + fix together with `fix(e2e): [description]`.
 4. Re-run `/sk:e2e` from scratch
 **Exception:** Formatter auto-fixes are never logic changes — bypass protocol automatically.
-Gates own their commits — the fix-commit-rerun loop is fully internal. No manual commit step needed after this gate.
+> Squash gate commits — collect all fixes for the pass, then one commit: `fix(e2e): resolve failing E2E scenarios`. Do not commit after each individual fix.
 **This gate cannot be skipped.** All scenarios must pass before proceeding to `/sk:update-task`.

package/skills/sk:eval/SKILL.md ADDED Viewed

@@ -0,0 +1,188 @@
+---
+name: sk:eval
+description: "Define, run, and report on evaluations for agent reliability and code quality."
+---
+# /sk:eval — Eval-Driven Development
+A formal evaluation framework for measuring agent reliability and code quality. Define evals before coding, check during implementation, and report after shipping.
+## Usage
+```
+/sk:eval define <feature>    # create eval definition
+/sk:eval check <feature>     # run evals against current state
+/sk:eval report              # summary of all eval results
+/sk:eval list                # show all defined evals
+```
+## Model Routing
+Read `.shipkit/config.json` from the project root if it exists.
+| Profile | Model |
+|---------|-------|
+| `full-sail` | sonnet |
+| `quality` | sonnet |
+| `balanced` | sonnet |
+| `budget` | haiku |
+> Eval analysis needs reasoning for model-based graders — sonnet for balanced+.
+## Eval Types
+### Capability Evals
+Test whether Claude can accomplish something new:
+- "Can it generate a valid migration from a schema description?"
+- "Can it write a test that covers all edge cases?"
+- "Can it refactor without changing behavior?"
+### Regression Evals
+Ensure changes don't break existing behavior:
+- "Does the login flow still work after auth refactor?"
+- "Do all API endpoints still return correct status codes?"
+- "Are all existing tests still passing?"
+## Grader Types
+### Code-Based (Deterministic)
+Graded by running commands — pass/fail:
+```yaml
+grader: code
+checks:
+  - command: "npm test"
+    expect: exit_code_0
+  - command: "grep -r 'TODO' src/"
+    expect: no_output
+  - command: "npx tsc --noEmit"
+    expect: exit_code_0
+```
+### Model-Based (LLM-as-Judge)
+Graded by an LLM against a rubric — scored 1-5:
+```yaml
+grader: model
+rubric: |
+  Score the implementation on:
+  1. Correctness — does it solve the stated problem?
+  2. Completeness — are all edge cases handled?
+  3. Code quality — is it readable and maintainable?
+  4. Security — are there any vulnerabilities?
+  5. Performance — any obvious inefficiencies?
+threshold: 4.0
+```
+### Human (Manual Review)
+Flagged for human review — generates a checklist:
+```yaml
+grader: human
+checklist:
+  - "UI renders correctly on mobile"
+  - "Error messages are user-friendly"
+  - "Animation feels smooth (60fps)"
+```
+## Metrics
+### pass@k
+At least 1 success in k attempts. Used for capability evals where some variance is expected.
+```
+pass@3: Run the eval 3 times. Pass if at least 1 succeeds.
+```
+### pass^k
+ALL k attempts must succeed. Used for regression evals where consistency is required.
+```
+pass^3: Run the eval 3 times. Pass only if all 3 succeed.
+```
+## Storage
+### Eval Definition
+Stored in `.claude/evals/[feature].md`:
+```markdown
+---
+feature: user-authentication
+type: capability
+grader: code
+created: 2026-03-25
+pass_metric: pass@1
+---
+## Description
+Verify the OAuth2 login flow works end-to-end.
+## Checks
+- [ ] `npm test -- --grep "auth"` passes
+- [ ] `curl -s localhost:3000/auth/google` returns 302
+- [ ] `grep -r "hardcoded.*secret" src/` returns nothing
+## History
+| Date | Result | Score | Notes |
+|------|--------|-------|-------|
+```
+### Eval Results
+Appended to `.claude/evals/[feature].log`:
+```
+[2026-03-25T10:30:00Z] PASS — pass@1 (1/1 succeeded)
+  check_1: npm test (exit 0) ✓
+  check_2: curl auth redirect (302) ✓
+  check_3: no hardcoded secrets ✓
+```
+## Workflow Integration
+### Before Coding (define)
+```
+/sk:eval define user-authentication
+```
+Creates the eval definition with checks derived from the task requirements.
+### During Implementation (check)
+```
+/sk:eval check user-authentication
+```
+Runs all checks and reports pass/fail. Use during step 5 (Write Tests + Implement) to verify progress.
+### After Shipping (report)
+```
+/sk:eval report
+```
+Summary of all evals:
+```
+=== Eval Report ===
+  user-authentication    PASS  pass@1  (3 checks, 3 passed)
+  api-v2-endpoints       PASS  pass^3  (5 checks, 5 passed x3)
+  queue-reliability      FAIL  pass@3  (2 checks, 0/3 succeeded)
+  Overall: 2/3 passing (67%)
+  Action: queue-reliability needs investigation
+```