@kennethsolomon/shipkit 3.10.1 → 3.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +121 -49
- package/commands/sk/autopilot.md +2 -2
- package/commands/sk/context-budget.md +5 -0
- package/commands/sk/eval.md +5 -0
- package/commands/sk/health.md +5 -0
- package/commands/sk/help.md +32 -8
- package/commands/sk/learn.md +5 -0
- package/commands/sk/resume-session.md +5 -0
- package/commands/sk/safety-guard.md +5 -0
- package/commands/sk/save-session.md +5 -0
- package/commands/sk/security-check.md +2 -2
- package/commands/sk/set-profile.md +8 -0
- package/commands/sk/status.md +4 -9
- package/package.json +1 -1
- package/skills/sk:accessibility/SKILL.md +10 -1
- package/skills/sk:autopilot/SKILL.md +26 -45
- package/skills/sk:brainstorming/SKILL.md +13 -0
- package/skills/sk:context/SKILL.md +11 -15
- package/skills/sk:context-budget/SKILL.md +126 -0
- package/skills/sk:dashboard/SKILL.md +3 -4
- package/skills/sk:dashboard/server.js +0 -65
- package/skills/sk:e2e/SKILL.md +3 -3
- package/skills/sk:eval/SKILL.md +188 -0
- package/skills/sk:fast-track/SKILL.md +0 -9
- package/skills/sk:frontend-design/SKILL.md +232 -0
- package/skills/sk:gates/SKILL.md +2 -3
- package/skills/sk:health/SKILL.md +146 -0
- package/skills/sk:learn/SKILL.md +138 -0
- package/skills/sk:lint/SKILL.md +3 -3
- package/skills/sk:perf/SKILL.md +3 -3
- package/skills/sk:resume-session/SKILL.md +95 -0
- package/skills/sk:retro/SKILL.md +1 -2
- package/skills/sk:review/SKILL.md +2 -2
- package/skills/sk:safety-guard/SKILL.md +134 -0
- package/skills/sk:save-session/SKILL.md +84 -0
- package/skills/sk:setup-claude/SKILL.md +40 -4
- package/skills/sk:setup-claude/scripts/__pycache__/apply_setup_claude.cpython-314.pyc +0 -0
- package/skills/sk:setup-claude/scripts/apply_setup_claude.py +0 -1
- package/skills/sk:setup-claude/templates/.claude/settings.json.template +110 -26
- package/skills/sk:setup-claude/templates/.claude/statusline.sh +1 -15
- package/skills/sk:setup-claude/templates/CLAUDE.md.template +69 -138
- package/skills/sk:setup-claude/templates/commands/brainstorm.md.template +2 -13
- package/skills/sk:setup-claude/templates/hooks/config-protection.sh +71 -0
- package/skills/sk:setup-claude/templates/hooks/console-log-warning.sh +42 -0
- package/skills/sk:setup-claude/templates/hooks/cost-tracker.sh +26 -0
- package/skills/sk:setup-claude/templates/hooks/post-edit-format.sh +53 -0
- package/skills/sk:setup-claude/templates/hooks/pre-compact.sh +1 -12
- package/skills/sk:setup-claude/templates/hooks/safety-guard.sh +72 -0
- package/skills/sk:setup-claude/templates/hooks/session-start.sh +0 -11
- package/skills/sk:setup-claude/templates/hooks/session-stop.sh +0 -7
- package/skills/sk:setup-claude/templates/hooks/suggest-compact.sh +35 -0
- package/skills/sk:setup-claude/tests/__pycache__/test_apply_setup_claude.cpython-314.pyc +0 -0
- package/skills/sk:setup-claude/tests/test_apply_setup_claude.py +2 -33
- package/skills/sk:setup-optimizer/SKILL.md +68 -15
- package/skills/sk:start/SKILL.md +34 -11
- package/skills/sk:test/SKILL.md +3 -3
- package/skills/sk:setup-claude/templates/tasks/workflow-status.md.template +0 -28
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: sk:autopilot
|
|
3
|
-
description: Hands-free workflow — runs all
|
|
3
|
+
description: Hands-free workflow — runs all 8 steps with auto-skip, auto-advance, auto-commit. Stops only for direction approval, 3-strike failures, and PR push.
|
|
4
4
|
user_invocable: true
|
|
5
5
|
allowed_tools: Read, Write, Bash, Glob, Grep, Agent, Skill
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Autopilot Mode
|
|
9
9
|
|
|
10
|
-
Hands-free workflow that executes all
|
|
10
|
+
Hands-free workflow that executes all 8 steps of the ShipIt workflow with minimal interruptions. Same quality gates, same fix loops, same 100% coverage — just fewer stops.
|
|
11
11
|
|
|
12
12
|
## When to Use
|
|
13
13
|
|
|
@@ -23,30 +23,19 @@ Hands-free workflow that executes all 21 steps of the ShipIt workflow with minim
|
|
|
23
23
|
|
|
24
24
|
## Quality Guarantee
|
|
25
25
|
|
|
26
|
-
Autopilot runs the EXACT same
|
|
26
|
+
Autopilot runs the EXACT same 8 steps as manual mode:
|
|
27
27
|
- ALL quality gates enforced (lint, test, security, perf, review, e2e)
|
|
28
|
-
- ALL fix-
|
|
28
|
+
- ALL fix-rerun loops active
|
|
29
29
|
- 100% test coverage required on new code
|
|
30
30
|
- 0 security issues required
|
|
31
31
|
- The ONLY difference: auto-advance between steps instead of stopping
|
|
32
32
|
|
|
33
33
|
## Steps
|
|
34
34
|
|
|
35
|
-
###
|
|
35
|
+
### 1. Load Context + Brainstorm + Direction Approval (STOP — requires user input)
|
|
36
36
|
|
|
37
|
-
Read `tasks/
|
|
38
|
-
|
|
39
|
-
### 1. Load Context (auto — no prompt)
|
|
40
|
-
|
|
41
|
-
- Read `tasks/todo.md`
|
|
42
|
-
- Read `tasks/lessons.md` (apply all active lessons as constraints)
|
|
43
|
-
- Read `tasks/findings.md` (if exists)
|
|
44
|
-
- Read `tasks/tech-debt.md` (if exists)
|
|
45
|
-
|
|
46
|
-
### 2. Brainstorm + Direction Approval (STOP — requires user input)
|
|
47
|
-
|
|
48
|
-
Run brainstorm internally:
|
|
49
|
-
- Explore the codebase (3 parallel Explore agents)
|
|
37
|
+
- Read `tasks/todo.md`, `tasks/lessons.md`, `tasks/findings.md`, `tasks/tech-debt.md`
|
|
38
|
+
- Run brainstorm internally (3 parallel Explore agents)
|
|
50
39
|
- Propose 2-3 approaches with trade-offs
|
|
51
40
|
|
|
52
41
|
**Present ONE direction summary and ask:**
|
|
@@ -57,40 +46,33 @@ Run brainstorm internally:
|
|
|
57
46
|
|
|
58
47
|
Wait for explicit `y` before continuing. This is the ONLY planning stop.
|
|
59
48
|
|
|
49
|
+
### 2. Design (auto-skip if no frontend/API keywords)
|
|
50
|
+
|
|
51
|
+
Run `/sk:frontend-design` or `/sk:api-design` if applicable. Auto-skip if no frontend/API keywords detected. Log: `Auto-skipped: Design ([reason])`
|
|
52
|
+
|
|
60
53
|
### 3. Plan (auto-advance)
|
|
61
54
|
|
|
62
|
-
Write the implementation plan to `tasks/todo.md`. Do NOT ask for plan approval — the direction approval in step
|
|
55
|
+
Write the implementation plan to `tasks/todo.md`. Do NOT ask for plan approval — the direction approval in step 1 covers this.
|
|
63
56
|
|
|
64
57
|
### 4. Branch (auto-advance)
|
|
65
58
|
|
|
66
59
|
Create feature branch auto-named from the task. Do NOT ask for confirmation.
|
|
67
60
|
|
|
68
|
-
### 5.
|
|
61
|
+
### 5. Write Tests + Implement (auto-advance)
|
|
69
62
|
|
|
70
|
-
|
|
71
|
-
-
|
|
72
|
-
-
|
|
73
|
-
-
|
|
74
|
-
- **Performance (step 15)**: auto-skip if no frontend AND no database keywords
|
|
63
|
+
- Run `/sk:write-tests` (TDD red phase)
|
|
64
|
+
- Run `/sk:schema-migrate` if database keywords detected
|
|
65
|
+
- Run `/sk:execute-plan` (TDD green phase)
|
|
66
|
+
- Auto-advance when done
|
|
75
67
|
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
### 6. Write Tests (auto-advance)
|
|
79
|
-
|
|
80
|
-
Write failing tests based on the plan (TDD red phase). Auto-advance when done.
|
|
81
|
-
|
|
82
|
-
### 7. Implement (auto-advance)
|
|
83
|
-
|
|
84
|
-
Execute the plan — make failing tests pass. Use wave-based sub-agents for parallel work where possible.
|
|
85
|
-
|
|
86
|
-
### 8. Commit (auto-commit)
|
|
68
|
+
### 6. Commit (auto-commit)
|
|
87
69
|
|
|
88
70
|
Auto-commit with conventional commit format. Do NOT ask for commit message approval.
|
|
89
71
|
Format: `type(scope): description`
|
|
90
72
|
|
|
91
|
-
###
|
|
73
|
+
### 7. Gates (auto-advance on clean pass)
|
|
92
74
|
|
|
93
|
-
Run all quality gates
|
|
75
|
+
Run all quality gates via `/sk:gates`:
|
|
94
76
|
1. Lint + dep audit
|
|
95
77
|
2. Test (100% coverage)
|
|
96
78
|
3. Security (0 issues)
|
|
@@ -98,9 +80,9 @@ Run all quality gates. Use `/sk:gates` if available, otherwise run sequentially:
|
|
|
98
80
|
5. Review + simplify
|
|
99
81
|
6. E2E
|
|
100
82
|
|
|
101
|
-
Each gate auto-fixes and re-runs internally.
|
|
83
|
+
Each gate auto-fixes and re-runs internally. Squash gate commits — one commit per gate pass.
|
|
102
84
|
|
|
103
|
-
###
|
|
85
|
+
### 8. PR Push (STOP — requires user confirmation)
|
|
104
86
|
|
|
105
87
|
**This is the second mandatory stop.** Present:
|
|
106
88
|
> "All gates passed. Ready to create PR.
|
|
@@ -110,11 +92,10 @@ Each gate auto-fixes and re-runs internally. Auto-advance to next gate on clean
|
|
|
110
92
|
|
|
111
93
|
Wait for explicit confirmation — pushing is visible to others.
|
|
112
94
|
|
|
113
|
-
|
|
114
|
-
|
|
95
|
+
After confirmation:
|
|
115
96
|
- Create PR
|
|
116
97
|
- Sync features (`/sk:features`)
|
|
117
|
-
- Ask about release (
|
|
98
|
+
- Ask about release (never auto-skipped)
|
|
118
99
|
|
|
119
100
|
## 3-Strike Protocol
|
|
120
101
|
|
|
@@ -128,9 +109,9 @@ If any step fails 3 times:
|
|
|
128
109
|
|
|
129
110
|
| Stop | When | Why |
|
|
130
111
|
|------|------|-----|
|
|
131
|
-
| Direction approval | After brainstorm (step
|
|
112
|
+
| Direction approval | After brainstorm (step 1) | User must approve the approach |
|
|
132
113
|
| 3-strike failure | Any step fails 3x | Needs human judgment |
|
|
133
|
-
| PR push | Before creating PR (step
|
|
114
|
+
| PR push | Before creating PR (step 8) | Visible to others — always confirm |
|
|
134
115
|
|
|
135
116
|
Everything else auto-advances.
|
|
136
117
|
|
|
@@ -74,6 +74,19 @@ digraph brainstorming {
|
|
|
74
74
|
- Only one question per message - if a topic needs more exploration, break it into multiple questions
|
|
75
75
|
- Focus on understanding: purpose, constraints, success criteria
|
|
76
76
|
|
|
77
|
+
**Search-First Research (before proposing approaches):**
|
|
78
|
+
Before proposing custom solutions, check if the problem is already solved:
|
|
79
|
+
1. **Grep codebase** — does similar functionality already exist in this repo?
|
|
80
|
+
2. **Check package registries** — is there a well-maintained package for this? (npm, PyPI, Packagist, crates.io)
|
|
81
|
+
3. **Check existing skills** — does a ShipKit skill or MCP server already handle this?
|
|
82
|
+
|
|
83
|
+
Decision matrix:
|
|
84
|
+
- **Adopt** — existing solution covers 90%+ of requirements → use it directly
|
|
85
|
+
- **Extend** — existing solution covers 60-90% → extend or wrap it
|
|
86
|
+
- **Build custom** — nothing suitable exists → build from scratch (informed by what was found)
|
|
87
|
+
|
|
88
|
+
If a suitable package or existing solution is found, include it as one of the approaches.
|
|
89
|
+
|
|
77
90
|
**Exploring approaches:**
|
|
78
91
|
- Propose 2-3 different approaches with trade-offs
|
|
79
92
|
- Present options conversationally with your recommendation and reasoning
|
|
@@ -26,21 +26,19 @@ Load all project context files into the conversation and output a formatted sess
|
|
|
26
26
|
| # | File | What to Extract |
|
|
27
27
|
|---|------|-----------------|
|
|
28
28
|
| 1 | `tasks/todo.md` | Task name (from `# TODO —` heading), milestone progress, count of `- [x]` (done) vs `- [ ]` (pending) checkboxes |
|
|
29
|
-
| 2 | `tasks/
|
|
30
|
-
| 3 | `tasks/
|
|
31
|
-
| 4 | `tasks/
|
|
32
|
-
| 5 | `
|
|
33
|
-
| 6 | `docs/
|
|
34
|
-
| 7 | `
|
|
35
|
-
| 8 | `tasks/tech-debt.md` | If exists: count entries with no `Resolved:` line (unresolved), highest severity among unresolved |
|
|
29
|
+
| 2 | `tasks/progress.md` | Last 5 entries only (most recent work). If file is large, read only the last 50 lines. |
|
|
30
|
+
| 3 | `tasks/findings.md` | Current decisions, chosen approach, open questions |
|
|
31
|
+
| 4 | `tasks/lessons.md` | All active lessons — read in full, apply as constraints for this session |
|
|
32
|
+
| 5 | `docs/decisions.md` | If exists: last 3 ADR entries. If missing: note "no decisions log yet" |
|
|
33
|
+
| 6 | `docs/vision.md` | If exists: product name + value proposition. If missing: note "no vision.md found" |
|
|
34
|
+
| 7 | `tasks/tech-debt.md` | If exists: count entries with no `Resolved:` line (unresolved), highest severity among unresolved |
|
|
36
35
|
|
|
37
36
|
### Reading Strategy
|
|
38
37
|
|
|
39
|
-
- Read files 1-
|
|
40
|
-
- Files 6
|
|
38
|
+
- Read files 1-4 first (these are the core context).
|
|
39
|
+
- Files 5-6 are optional — check if they exist before reading.
|
|
41
40
|
- For `tasks/progress.md`: only read the last 50 lines to avoid loading a huge file.
|
|
42
41
|
- If `tasks/todo.md` is missing: the project has no active task.
|
|
43
|
-
- If `tasks/workflow-status.md` is missing: the workflow hasn't started.
|
|
44
42
|
|
|
45
43
|
---
|
|
46
44
|
|
|
@@ -54,9 +52,8 @@ After reading all files, output this session brief:
|
|
|
54
52
|
╚══════════════════════════════════════════╝
|
|
55
53
|
Branch: [current git branch]
|
|
56
54
|
Task: [task name from todo.md, or "No active task"]
|
|
57
|
-
|
|
55
|
+
Progress: [N done] / [M total] checkboxes in todo.md
|
|
58
56
|
Last done: [last progress.md entry summary, 1 line]
|
|
59
|
-
Pending: [N] checkboxes remaining in todo.md
|
|
60
57
|
Lessons: [count] active — [most critical 1-liner from lessons.md]
|
|
61
58
|
Open Qs: [open questions from findings.md, or "none"]
|
|
62
59
|
Tech Debt: [N] unresolved — highest: [severity] ([file:line])
|
|
@@ -68,9 +65,8 @@ Product: [value prop from vision.md, or "no vision.md found"]
|
|
|
68
65
|
|
|
69
66
|
- **Branch:** Run `git branch --show-current` to get the current branch name.
|
|
70
67
|
- **Task:** Extract from the first `# TODO —` line in `tasks/todo.md`. If the file doesn't exist or all checkboxes are done, show "No active task — ready to start fresh".
|
|
71
|
-
- **
|
|
68
|
+
- **Progress:** Count `- [x]` (done) and `- [ ]` (pending) lines in `tasks/todo.md`. Stop counting at the first `## Verification`, `## Acceptance Criteria`, or `## Risks` heading (these are meta-sections, not tasks). Show `N done / M total`.
|
|
72
69
|
- **Last done:** The most recent entry from `tasks/progress.md`. Summarize in one line.
|
|
73
|
-
- **Pending:** Count `- [ ]` lines in `tasks/todo.md`. Stop counting at the first `## Verification`, `## Acceptance Criteria`, or `## Risks` heading (these are meta-sections, not tasks).
|
|
74
70
|
- **Lessons:** Count `### [` headings in `tasks/lessons.md` (each lesson starts with `### [YYYY-MM-DD]`). Show the count + the **Prevention:** line from the most recent lesson.
|
|
75
71
|
- **Open Qs:** Check for an "## Open Questions" section in `tasks/findings.md`. List them or say "none".
|
|
76
72
|
- **Tech Debt:** Read `tasks/tech-debt.md` if it exists. Count entries that have no `Resolved:` line — each entry starts with `### [`. For unresolved entries, find the highest severity. Show `N unresolved — highest: [severity] ([file])`. If file missing or 0 unresolved, show `none`.
|
|
@@ -93,7 +89,7 @@ After outputting the session brief:
|
|
|
93
89
|
| Scenario | Behavior |
|
|
94
90
|
|----------|----------|
|
|
95
91
|
| No `tasks/todo.md` | Show "No active task — ready to start fresh" |
|
|
96
|
-
|
|
|
92
|
+
| All checkboxes done in todo.md | Show "Task complete — 0 pending" for Progress field |
|
|
97
93
|
| No `tasks/progress.md` | Show "No progress logged yet" for Last done |
|
|
98
94
|
| No `tasks/findings.md` | Show "none" for Open Qs |
|
|
99
95
|
| No `tasks/lessons.md` | Show "0 active" for Lessons |
|
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sk:context-budget
|
|
3
|
+
description: "Audit context window token consumption and find optimization opportunities."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# /sk:context-budget — Token Consumption Audit
|
|
7
|
+
|
|
8
|
+
Audits all components that consume context window tokens — agents, skills, rules, MCP tools, CLAUDE.md — and identifies optimization opportunities.
|
|
9
|
+
|
|
10
|
+
## Usage
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
/sk:context-budget # standard audit
|
|
14
|
+
/sk:context-budget --verbose # per-file breakdown
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
## Model Routing
|
|
18
|
+
|
|
19
|
+
Read `.shipkit/config.json` from the project root if it exists.
|
|
20
|
+
|
|
21
|
+
| Profile | Model |
|
|
22
|
+
|---------|-------|
|
|
23
|
+
| `full-sail` | haiku |
|
|
24
|
+
| `quality` | haiku |
|
|
25
|
+
| `balanced` | haiku |
|
|
26
|
+
| `budget` | haiku |
|
|
27
|
+
|
|
28
|
+
> Counting and classification is lightweight — haiku is sufficient.
|
|
29
|
+
|
|
30
|
+
## How It Works
|
|
31
|
+
|
|
32
|
+
### Phase 1: Inventory
|
|
33
|
+
|
|
34
|
+
Scan and count token estimates for every loaded component:
|
|
35
|
+
|
|
36
|
+
| Component | Location | Token Estimation |
|
|
37
|
+
|-----------|----------|------------------|
|
|
38
|
+
| CLAUDE.md | `CLAUDE.md` | `words * 1.3` |
|
|
39
|
+
| Global CLAUDE.md | `~/.claude/CLAUDE.md` | `words * 1.3` |
|
|
40
|
+
| Skills | `skills/*/SKILL.md` | `words * 1.3` |
|
|
41
|
+
| Commands | `commands/**/*.md` | `words * 1.3` |
|
|
42
|
+
| Agents | `.claude/agents/*.md` | `words * 1.3` |
|
|
43
|
+
| Rules | `.claude/rules/*.md` | `words * 1.3` |
|
|
44
|
+
| MCP tool schemas | count tools * ~500 tokens each | `tool_count * 500` |
|
|
45
|
+
| Hooks | `.claude/hooks/*.sh` (minimal overhead) | `words * 1.3` |
|
|
46
|
+
|
|
47
|
+
**Token estimation formula:**
|
|
48
|
+
- Prose/markdown: `word_count * 1.3`
|
|
49
|
+
- Code blocks: `char_count / 4`
|
|
50
|
+
- MCP tool schemas: ~500 tokens per tool definition
|
|
51
|
+
|
|
52
|
+
### Phase 2: Classify Usage Frequency
|
|
53
|
+
|
|
54
|
+
For each component, classify how often it's actually needed:
|
|
55
|
+
|
|
56
|
+
| Classification | Meaning | Action |
|
|
57
|
+
|---------------|---------|--------|
|
|
58
|
+
| **Always** | Loaded every session, always relevant | Keep as-is |
|
|
59
|
+
| **Sometimes** | Relevant to specific task types | Consider conditional loading |
|
|
60
|
+
| **Rarely** | Edge case, rarely triggered | Candidate for removal/extraction |
|
|
61
|
+
|
|
62
|
+
Classification heuristics:
|
|
63
|
+
- Skills used in the workflow (brainstorm, write-tests, gates, etc.) → Always
|
|
64
|
+
- Skills triggered by keywords (frontend-design, api-design) → Sometimes
|
|
65
|
+
- Niche skills (seo-audit, schema-migrate) → Rarely
|
|
66
|
+
- MCP tools: if >20 tools on one server → flag as over-subscribed
|
|
67
|
+
|
|
68
|
+
### Phase 3: Detect Issues
|
|
69
|
+
|
|
70
|
+
Flag these common problems:
|
|
71
|
+
|
|
72
|
+
1. **Bloated agents** — agent descriptions >200 lines
|
|
73
|
+
2. **Bloated skills** — skill definitions >400 lines
|
|
74
|
+
3. **Bloated rules** — rule files >100 lines
|
|
75
|
+
4. **MCP over-subscription** — servers with >20 tools (each costs ~500 tokens)
|
|
76
|
+
5. **CLI-wrapping MCPs** — MCP servers that just wrap CLI tools (overhead > benefit)
|
|
77
|
+
6. **Duplicate content** — same instructions in CLAUDE.md AND skill files
|
|
78
|
+
7. **CLAUDE.md bloat** — CLAUDE.md >200 lines (the target)
|
|
79
|
+
8. **Unused components** — skills/agents never referenced in workflow
|
|
80
|
+
|
|
81
|
+
### Phase 4: Report
|
|
82
|
+
|
|
83
|
+
Output a structured report:
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
=== Context Budget Audit ===
|
|
87
|
+
|
|
88
|
+
Component Breakdown:
|
|
89
|
+
CLAUDE.md ~1,200 tokens
|
|
90
|
+
Global CLAUDE.md ~800 tokens
|
|
91
|
+
Skills (42 files) ~18,000 tokens
|
|
92
|
+
Commands (35 files) ~8,000 tokens
|
|
93
|
+
Agents (8 files) ~3,200 tokens
|
|
94
|
+
Rules (5 files) ~1,500 tokens
|
|
95
|
+
MCP tools (3 servers) ~15,000 tokens (30 tools)
|
|
96
|
+
─────────────────────────────────
|
|
97
|
+
Total overhead: ~47,700 tokens
|
|
98
|
+
|
|
99
|
+
Context window: 200,000 tokens
|
|
100
|
+
Overhead: 47,700 tokens (23.8%)
|
|
101
|
+
Available for work: 152,300 tokens
|
|
102
|
+
|
|
103
|
+
Issues Found:
|
|
104
|
+
[HIGH] MCP server "playwright" has 28 tools (~14,000 tokens)
|
|
105
|
+
[MEDIUM] Skill sk:frontend-design is 380 lines (~500 tokens)
|
|
106
|
+
[LOW] Agent perf-auditor has 220 lines (~290 tokens)
|
|
107
|
+
|
|
108
|
+
Top 3 Optimizations:
|
|
109
|
+
1. Remove unused MCP tools from playwright (save ~7,000 tokens)
|
|
110
|
+
2. Consolidate duplicate workflow instructions (save ~1,200 tokens)
|
|
111
|
+
3. Trim agent descriptions to <150 lines (save ~400 tokens)
|
|
112
|
+
|
|
113
|
+
Potential savings: ~8,600 tokens (18% reduction)
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### --verbose Mode
|
|
117
|
+
|
|
118
|
+
Adds per-file token breakdown:
|
|
119
|
+
|
|
120
|
+
```
|
|
121
|
+
Skills Breakdown:
|
|
122
|
+
sk:autopilot/SKILL.md ~620 tokens
|
|
123
|
+
sk:brainstorming/SKILL.md ~480 tokens
|
|
124
|
+
sk:gates/SKILL.md ~440 tokens
|
|
125
|
+
...
|
|
126
|
+
```
|
|
@@ -29,9 +29,8 @@ PORT=4000 node skills/sk:dashboard/server.js
|
|
|
29
29
|
## What It Shows
|
|
30
30
|
|
|
31
31
|
- **Swimlanes per worktree** — one row per worktree discovered via `git worktree list`
|
|
32
|
-
- **Phase timeline** — workflow steps laid out as columns (
|
|
33
|
-
- **
|
|
34
|
-
- **Progress bars** — percentage of steps completed per worktree
|
|
32
|
+
- **Phase timeline** — workflow steps laid out as columns (Explore, Design, Plan, Branch, Tests+Implement, Commit, Gates, Finalize)
|
|
33
|
+
- **Progress bars** — percentage of todo.md checkboxes completed per worktree
|
|
35
34
|
- **Current task** — the active task name from `tasks/todo.md`
|
|
36
35
|
|
|
37
36
|
## Architecture
|
|
@@ -39,7 +38,7 @@ PORT=4000 node skills/sk:dashboard/server.js
|
|
|
39
38
|
Zero-dependency Node.js server. Uses only built-in modules (`http`, `fs`, `path`, `child_process`).
|
|
40
39
|
|
|
41
40
|
- `server.js` serves the dashboard HTML and exposes `/api/status`
|
|
42
|
-
- `/api/status` reads `tasks/
|
|
41
|
+
- `/api/status` reads `tasks/todo.md` from each worktree, parses checkbox progress, and returns JSON
|
|
43
42
|
- `dashboard.html` is a single-file UI (HTML + embedded CSS + JS) that polls `/api/status` every 3 seconds
|
|
44
43
|
- Worktree discovery via `git worktree list`
|
|
45
44
|
|
|
@@ -7,9 +7,6 @@ const { execSync } = require("child_process");
|
|
|
7
7
|
const PORT =
|
|
8
8
|
parseInt(process.argv.find((_, i, a) => a[i - 1] === "--port") || process.env.PORT, 10) || 3333;
|
|
9
9
|
|
|
10
|
-
const HARD_GATES = new Set([12, 14, 16, 20, 22]);
|
|
11
|
-
const OPTIONALS = new Set([4, 5, 8, 18, 27]);
|
|
12
|
-
|
|
13
10
|
function stripMd(s) {
|
|
14
11
|
return (s || "").replace(/\*\*/g, "").replace(/`/g, "").trim();
|
|
15
12
|
}
|
|
@@ -34,53 +31,6 @@ function discoverWorktrees() {
|
|
|
34
31
|
}
|
|
35
32
|
}
|
|
36
33
|
|
|
37
|
-
function parseWorkflowStatus(worktreePath) {
|
|
38
|
-
const filePath = path.join(worktreePath, "tasks", "workflow-status.md");
|
|
39
|
-
try {
|
|
40
|
-
const lines = fs.readFileSync(filePath, "utf8").split("\n");
|
|
41
|
-
|
|
42
|
-
let headerFound = false;
|
|
43
|
-
let separatorSkipped = false;
|
|
44
|
-
const steps = [];
|
|
45
|
-
|
|
46
|
-
for (const line of lines) {
|
|
47
|
-
if (!headerFound) {
|
|
48
|
-
if (line.includes("| # |")) headerFound = true;
|
|
49
|
-
continue;
|
|
50
|
-
}
|
|
51
|
-
if (!separatorSkipped) {
|
|
52
|
-
separatorSkipped = true;
|
|
53
|
-
continue;
|
|
54
|
-
}
|
|
55
|
-
const cells = line.split("|").slice(1, -1).map((c) => c.trim());
|
|
56
|
-
if (cells.length < 3) continue;
|
|
57
|
-
|
|
58
|
-
const number = parseInt(cells[0], 10);
|
|
59
|
-
if (isNaN(number)) continue;
|
|
60
|
-
|
|
61
|
-
const rawStep = stripMd(cells[1]);
|
|
62
|
-
const cmdMatch = rawStep.match(/\((.+?)\)/);
|
|
63
|
-
const command = cmdMatch ? cmdMatch[1].trim() : "";
|
|
64
|
-
const name = rawStep.replace(/\s*\(.+?\)\s*/, "").trim();
|
|
65
|
-
|
|
66
|
-
steps.push({
|
|
67
|
-
number,
|
|
68
|
-
name,
|
|
69
|
-
command,
|
|
70
|
-
status: stripMd(cells[2]),
|
|
71
|
-
notes: stripMd(cells[3]),
|
|
72
|
-
isHardGate: HARD_GATES.has(number),
|
|
73
|
-
isOptional: OPTIONALS.has(number),
|
|
74
|
-
});
|
|
75
|
-
}
|
|
76
|
-
return steps;
|
|
77
|
-
} catch (err) {
|
|
78
|
-
if (err.code === "ENOENT") return [];
|
|
79
|
-
process.stderr.write(`Error parsing workflow-status.md: ${err.message}\n`);
|
|
80
|
-
return [];
|
|
81
|
-
}
|
|
82
|
-
}
|
|
83
|
-
|
|
84
34
|
const STOP_HEADERS = new Set(["Verification", "Acceptance Criteria", "Risks", "Change Log", "Summary"]);
|
|
85
35
|
|
|
86
36
|
function parseTodo(worktreePath) {
|
|
@@ -137,18 +87,8 @@ function parseTodo(worktreePath) {
|
|
|
137
87
|
function buildStatus() {
|
|
138
88
|
const worktrees = discoverWorktrees();
|
|
139
89
|
return worktrees.map((wt) => {
|
|
140
|
-
const steps = parseWorkflowStatus(wt.path);
|
|
141
90
|
const todo = parseTodo(wt.path);
|
|
142
91
|
|
|
143
|
-
let currentStep = 0;
|
|
144
|
-
let totalDone = 0;
|
|
145
|
-
let totalSkipped = 0;
|
|
146
|
-
for (const s of steps) {
|
|
147
|
-
if (s.status === ">> next <<") currentStep = s.number;
|
|
148
|
-
if (s.status === "done") totalDone++;
|
|
149
|
-
if (s.status === "skipped") totalSkipped++;
|
|
150
|
-
}
|
|
151
|
-
|
|
152
92
|
return {
|
|
153
93
|
path: wt.path,
|
|
154
94
|
branch: wt.branch,
|
|
@@ -156,11 +96,6 @@ function buildStatus() {
|
|
|
156
96
|
todosDone: todo.todosDone,
|
|
157
97
|
todosTotal: todo.todosTotal,
|
|
158
98
|
todoItems: todo.todoItems,
|
|
159
|
-
currentStep,
|
|
160
|
-
totalDone,
|
|
161
|
-
totalSkipped,
|
|
162
|
-
totalSteps: steps.length,
|
|
163
|
-
steps,
|
|
164
99
|
};
|
|
165
100
|
});
|
|
166
101
|
}
|
package/skills/sk:e2e/SKILL.md
CHANGED
|
@@ -184,17 +184,17 @@ If any fail → apply Fix & Retest Protocol.
|
|
|
184
184
|
|
|
185
185
|
When this gate requires a fix, classify it before committing:
|
|
186
186
|
|
|
187
|
-
**a. Style/config/wording change** (CSS tweak, copy change, selector fix) →
|
|
187
|
+
**a. Style/config/wording change** (CSS tweak, copy change, selector fix) → include in the gate's squash commit and re-run `/sk:e2e`. Do not ask the user.
|
|
188
188
|
|
|
189
189
|
**b. Logic change** (new branch, modified condition, new data path, query change, new function, API change) → trigger protocol:
|
|
190
190
|
1. Update or add failing unit tests for the new behavior
|
|
191
191
|
2. Re-run `/sk:test` — must pass at 100% coverage
|
|
192
|
-
3.
|
|
192
|
+
3. Commit tests + fix together with `fix(e2e): [description]`.
|
|
193
193
|
4. Re-run `/sk:e2e` from scratch
|
|
194
194
|
|
|
195
195
|
**Exception:** Formatter auto-fixes are never logic changes — bypass protocol automatically.
|
|
196
196
|
|
|
197
|
-
|
|
197
|
+
> Squash gate commits — collect all fixes for the pass, then one commit: `fix(e2e): resolve failing E2E scenarios`. Do not commit after each individual fix.
|
|
198
198
|
|
|
199
199
|
**This gate cannot be skipped.** All scenarios must pass before proceeding to `/sk:update-task`.
|
|
200
200
|
|
|
@@ -0,0 +1,188 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sk:eval
|
|
3
|
+
description: "Define, run, and report on evaluations for agent reliability and code quality."
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# /sk:eval — Eval-Driven Development
|
|
7
|
+
|
|
8
|
+
A formal evaluation framework for measuring agent reliability and code quality. Define evals before coding, check during implementation, and report after shipping.
|
|
9
|
+
|
|
10
|
+
## Usage
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
/sk:eval define <feature> # create eval definition
|
|
14
|
+
/sk:eval check <feature> # run evals against current state
|
|
15
|
+
/sk:eval report # summary of all eval results
|
|
16
|
+
/sk:eval list # show all defined evals
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Model Routing
|
|
20
|
+
|
|
21
|
+
Read `.shipkit/config.json` from the project root if it exists.
|
|
22
|
+
|
|
23
|
+
| Profile | Model |
|
|
24
|
+
|---------|-------|
|
|
25
|
+
| `full-sail` | sonnet |
|
|
26
|
+
| `quality` | sonnet |
|
|
27
|
+
| `balanced` | sonnet |
|
|
28
|
+
| `budget` | haiku |
|
|
29
|
+
|
|
30
|
+
> Eval analysis needs reasoning for model-based graders — sonnet for balanced+.
|
|
31
|
+
|
|
32
|
+
## Eval Types
|
|
33
|
+
|
|
34
|
+
### Capability Evals
|
|
35
|
+
|
|
36
|
+
Test whether Claude can accomplish something new:
|
|
37
|
+
|
|
38
|
+
- "Can it generate a valid migration from a schema description?"
|
|
39
|
+
- "Can it write a test that covers all edge cases?"
|
|
40
|
+
- "Can it refactor without changing behavior?"
|
|
41
|
+
|
|
42
|
+
### Regression Evals
|
|
43
|
+
|
|
44
|
+
Ensure changes don't break existing behavior:
|
|
45
|
+
|
|
46
|
+
- "Does the login flow still work after auth refactor?"
|
|
47
|
+
- "Do all API endpoints still return correct status codes?"
|
|
48
|
+
- "Are all existing tests still passing?"
|
|
49
|
+
|
|
50
|
+
## Grader Types
|
|
51
|
+
|
|
52
|
+
### Code-Based (Deterministic)
|
|
53
|
+
|
|
54
|
+
Graded by running commands — pass/fail:
|
|
55
|
+
|
|
56
|
+
```yaml
|
|
57
|
+
grader: code
|
|
58
|
+
checks:
|
|
59
|
+
- command: "npm test"
|
|
60
|
+
expect: exit_code_0
|
|
61
|
+
- command: "grep -r 'TODO' src/"
|
|
62
|
+
expect: no_output
|
|
63
|
+
- command: "npx tsc --noEmit"
|
|
64
|
+
expect: exit_code_0
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### Model-Based (LLM-as-Judge)
|
|
68
|
+
|
|
69
|
+
Graded by an LLM against a rubric — scored 1-5:
|
|
70
|
+
|
|
71
|
+
```yaml
|
|
72
|
+
grader: model
|
|
73
|
+
rubric: |
|
|
74
|
+
Score the implementation on:
|
|
75
|
+
1. Correctness — does it solve the stated problem?
|
|
76
|
+
2. Completeness — are all edge cases handled?
|
|
77
|
+
3. Code quality — is it readable and maintainable?
|
|
78
|
+
4. Security — are there any vulnerabilities?
|
|
79
|
+
5. Performance — any obvious inefficiencies?
|
|
80
|
+
threshold: 4.0
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Human (Manual Review)
|
|
84
|
+
|
|
85
|
+
Flagged for human review — generates a checklist:
|
|
86
|
+
|
|
87
|
+
```yaml
|
|
88
|
+
grader: human
|
|
89
|
+
checklist:
|
|
90
|
+
- "UI renders correctly on mobile"
|
|
91
|
+
- "Error messages are user-friendly"
|
|
92
|
+
- "Animation feels smooth (60fps)"
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## Metrics
|
|
96
|
+
|
|
97
|
+
### pass@k
|
|
98
|
+
|
|
99
|
+
At least 1 success in k attempts. Used for capability evals where some variance is expected.
|
|
100
|
+
|
|
101
|
+
```
|
|
102
|
+
pass@3: Run the eval 3 times. Pass if at least 1 succeeds.
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### pass^k
|
|
106
|
+
|
|
107
|
+
ALL k attempts must succeed. Used for regression evals where consistency is required.
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
pass^3: Run the eval 3 times. Pass only if all 3 succeed.
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
## Storage
|
|
114
|
+
|
|
115
|
+
### Eval Definition
|
|
116
|
+
|
|
117
|
+
Stored in `.claude/evals/[feature].md`:
|
|
118
|
+
|
|
119
|
+
```markdown
|
|
120
|
+
---
|
|
121
|
+
feature: user-authentication
|
|
122
|
+
type: capability
|
|
123
|
+
grader: code
|
|
124
|
+
created: 2026-03-25
|
|
125
|
+
pass_metric: pass@1
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## Description
|
|
129
|
+
Verify the OAuth2 login flow works end-to-end.
|
|
130
|
+
|
|
131
|
+
## Checks
|
|
132
|
+
- [ ] `npm test -- --grep "auth"` passes
|
|
133
|
+
- [ ] `curl -s localhost:3000/auth/google` returns 302
|
|
134
|
+
- [ ] `grep -r "hardcoded.*secret" src/` returns nothing
|
|
135
|
+
|
|
136
|
+
## History
|
|
137
|
+
| Date | Result | Score | Notes |
|
|
138
|
+
|------|--------|-------|-------|
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### Eval Results
|
|
142
|
+
|
|
143
|
+
Appended to `.claude/evals/[feature].log`:
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
[2026-03-25T10:30:00Z] PASS — pass@1 (1/1 succeeded)
|
|
147
|
+
check_1: npm test (exit 0) ✓
|
|
148
|
+
check_2: curl auth redirect (302) ✓
|
|
149
|
+
check_3: no hardcoded secrets ✓
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## Workflow Integration
|
|
153
|
+
|
|
154
|
+
### Before Coding (define)
|
|
155
|
+
|
|
156
|
+
```
|
|
157
|
+
/sk:eval define user-authentication
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
Creates the eval definition with checks derived from the task requirements.
|
|
161
|
+
|
|
162
|
+
### During Implementation (check)
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
/sk:eval check user-authentication
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
Runs all checks and reports pass/fail. Use during step 5 (Write Tests + Implement) to verify progress.
|
|
169
|
+
|
|
170
|
+
### After Shipping (report)
|
|
171
|
+
|
|
172
|
+
```
|
|
173
|
+
/sk:eval report
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
Summary of all evals:
|
|
177
|
+
|
|
178
|
+
```
|
|
179
|
+
=== Eval Report ===
|
|
180
|
+
|
|
181
|
+
user-authentication PASS pass@1 (3 checks, 3 passed)
|
|
182
|
+
api-v2-endpoints PASS pass^3 (5 checks, 5 passed x3)
|
|
183
|
+
queue-reliability FAIL pass@3 (2 checks, 0/3 succeeded)
|
|
184
|
+
|
|
185
|
+
Overall: 2/3 passing (67%)
|
|
186
|
+
|
|
187
|
+
Action: queue-reliability needs investigation
|
|
188
|
+
```
|