codebyplan 1.13.53 → 1.13.55

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (84) hide show
  1. package/dist/cli.js +1364 -352
  2. package/package.json +1 -1
  3. package/templates/agents/cbp-database-agent.md +1 -1
  4. package/templates/agents/cbp-e2e-maestro.md +1 -1
  5. package/templates/agents/cbp-e2e-playwright.md +24 -16
  6. package/templates/agents/cbp-e2e-tauri.md +1 -1
  7. package/templates/agents/cbp-e2e-vscode.md +1 -1
  8. package/templates/agents/cbp-e2e-xcuitest.md +1 -1
  9. package/templates/agents/cbp-improve-claude.md +2 -2
  10. package/templates/agents/{cbp-round-executor.md → cbp-round-builder.md} +23 -23
  11. package/templates/agents/{cbp-task-planner.md → cbp-round-planner.md} +26 -25
  12. package/templates/agents/cbp-security-agent.md +1 -1
  13. package/templates/agents/cbp-stripe-agent.md +2 -2
  14. package/templates/agents/cbp-testing-qa-agent.md +11 -11
  15. package/templates/agents/cbp-verify-reviewer.md +236 -0
  16. package/templates/context/architecture-map.md +4 -4
  17. package/templates/context/mcp-docs.md +57 -11
  18. package/templates/context/testing/e2e.md +9 -9
  19. package/templates/github-workflows/ci.yml +58 -0
  20. package/templates/hooks/cbp-skill-context-guard.sh +1 -1
  21. package/templates/hooks/cbp-test-hooks.sh +9 -9
  22. package/templates/hooks/validate-structure-lengths.sh +1 -1
  23. package/templates/hooks/validate-structure-patterns.sh +1 -1
  24. package/templates/rules/README.md +1 -2
  25. package/templates/rules/agent-claim-verification.md +1 -1
  26. package/templates/rules/context-file-loading.md +10 -10
  27. package/templates/rules/development-workflow.md +73 -0
  28. package/templates/rules/e2e-mandatory.md +8 -8
  29. package/templates/rules/execution-proof.md +70 -0
  30. package/templates/rules/model-invocation-convention.md +2 -2
  31. package/templates/rules/parallel-waves.md +11 -11
  32. package/templates/rules/spawn-failure-is-gate-failure.md +76 -0
  33. package/templates/rules/task-routing-recommendation.md +1 -1
  34. package/templates/rules/todo-backend.md +3 -3
  35. package/templates/rules/two-tier-ci.md +63 -0
  36. package/templates/settings.project.base.json +8 -10
  37. package/templates/skills/cbp-build-cc-mode/SKILL.md +1 -1
  38. package/templates/skills/cbp-build-cc-settings/reference/cbp-permission-policy.md +7 -7
  39. package/templates/skills/cbp-build-cc-skill/SKILL.md +1 -1
  40. package/templates/skills/cbp-build-cc-skill/reference/cbp-quality.md +2 -2
  41. package/templates/skills/cbp-build-cc-skill/reference/fork-eligibility.md +11 -14
  42. package/templates/skills/cbp-checkpoint-check/SKILL.md +2 -2
  43. package/templates/skills/cbp-checkpoint-create/SKILL.md +16 -1
  44. package/templates/skills/cbp-checkpoint-update/SKILL.md +3 -3
  45. package/templates/skills/cbp-clear-continue/SKILL.md +2 -2
  46. package/templates/skills/cbp-clear-prep/SKILL.md +3 -3
  47. package/templates/skills/{cbp-task-complete → cbp-finalize}/SKILL.md +25 -29
  48. package/templates/skills/{cbp-task-complete → cbp-finalize}/reference/checkpoint-done-branching.md +1 -1
  49. package/templates/skills/{cbp-task-complete → cbp-finalize}/reference/next-step-heuristic.md +1 -1
  50. package/templates/skills/cbp-frontend-design/SKILL.md +1 -1
  51. package/templates/skills/cbp-frontend-ui/SKILL.md +7 -7
  52. package/templates/skills/cbp-git-commit/SKILL.md +3 -3
  53. package/templates/skills/cbp-merge-main/SKILL.md +4 -4
  54. package/templates/skills/{cbp-round-execute → cbp-round-build}/SKILL.md +93 -75
  55. package/templates/skills/cbp-round-complete/SKILL.md +15 -14
  56. package/templates/skills/cbp-round-plan/SKILL.md +344 -0
  57. package/templates/skills/cbp-session-end/SKILL.md +1 -1
  58. package/templates/skills/cbp-ship-main/SKILL.md +3 -2
  59. package/templates/skills/cbp-standalone-task-check/SKILL.md +10 -9
  60. package/templates/skills/cbp-standalone-task-complete/SKILL.md +12 -13
  61. package/templates/skills/cbp-standalone-task-create/SKILL.md +16 -9
  62. package/templates/skills/cbp-standalone-task-start/SKILL.md +9 -5
  63. package/templates/skills/cbp-standalone-task-testing/SKILL.md +5 -5
  64. package/templates/skills/cbp-task-create/SKILL.md +6 -7
  65. package/templates/skills/cbp-task-start/SKILL.md +8 -8
  66. package/templates/skills/cbp-todo/SKILL.md +6 -8
  67. package/templates/skills/cbp-verify/SKILL.md +146 -0
  68. package/templates/skills/cbp-verify/reference/deterministic-gates.md +114 -0
  69. package/templates/skills/{cbp-round-end → cbp-verify}/reference/findings-presentation.md +16 -12
  70. package/templates/skills/cbp-verify/reference/round-scope.md +62 -0
  71. package/templates/skills/cbp-verify/reference/task-scope.md +71 -0
  72. package/templates/agents/cbp-improve-round.md +0 -283
  73. package/templates/agents/cbp-task-check.md +0 -217
  74. package/templates/skills/cbp-round-check/SKILL.md +0 -134
  75. package/templates/skills/cbp-round-end/SKILL.md +0 -173
  76. package/templates/skills/cbp-round-end/reference/inline-fallback.md +0 -35
  77. package/templates/skills/cbp-round-execute/reference/inline-fallback.md +0 -55
  78. package/templates/skills/cbp-round-input/SKILL.md +0 -197
  79. package/templates/skills/cbp-round-start/SKILL.md +0 -261
  80. package/templates/skills/cbp-round-update/SKILL.md +0 -120
  81. package/templates/skills/cbp-ship/templates/workflow-eas-submit.yml +0 -53
  82. package/templates/skills/cbp-ship/templates/workflow-vsce-publish.yml +0 -31
  83. package/templates/skills/cbp-task-check/SKILL.md +0 -172
  84. package/templates/skills/cbp-task-testing/SKILL.md +0 -279
@@ -1,279 +0,0 @@
1
- ---
2
- scope: org-shared
3
- name: cbp-task-testing
4
- description: Run comprehensive task-level testing after /cbp-task-check passes
5
- argument-hint: [chk-task]
6
- triggers: [cbp-task-complete, cbp-round-input]
7
- effort: xhigh
8
- ---
9
-
10
- # Task Testing Command
11
-
12
- Comprehensive task-level testing — runs all automated tests and walks the user through manual testing one-by-one. Distinct from round-level testing (`testing-qa-agent`): this tests the **entire delivered feature holistically** after all rounds are complete. Runs inline — no sub-agent.
13
-
14
- ## When Used
15
-
16
- - After `/cbp-task-check` passes with READY verdict (auto-triggered)
17
- - Before `/cbp-task-complete`
18
- - **Never skippable**
19
-
20
- ## Scope vs Round-Level Validation
21
-
22
- Per-wave `testing-qa-agent` runs inside `/cbp-round-execute` Step 5. This skill adds the cross-cutting layer that is only visible across the full task diff: whole-repo lint, whole-repo typecheck, full test suite, `pnpm audit` (via `codebyplan check --scope task --json`), and full-diff security scan — each run once here, not per-round.
23
-
24
- ## Instructions
25
-
26
- ### Step 1: Parse `$ARGUMENTS`
27
-
28
- Parse the argument using the canonical chk-task-round notation (see `.claude/rules/notation-consistency.md`):
29
-
30
- | Shape | Regex | Resolves to |
31
- |-------|-------|-------------|
32
- | `{chk}-{task}` (e.g. `108-1`) | `^[0-9]+-[0-9]+$` | Checkpoint-bound: CHK-{chk} TASK-{task} |
33
- | _(empty)_ | — | Resolve from local state per Step 1.5/2 (MCP `get_current_task` break-glass) — the active in-progress task |
34
- | `{task}` (bare number) | — | **Error**: "Use /cbp-standalone-task-testing {N} instead — bare numbers no longer route to standalone tasks." |
35
-
36
- Anything else is malformed — surface this error and stop:
37
-
38
- ```
39
- task-testing: invalid argument `{value}`. Expected:
40
- 108-1 → CHK-108 TASK-1 (checkpoint-bound)
41
- (empty) → active in-progress task
42
-
43
- For standalone tasks, use `/cbp-standalone-task-testing {N}`.
44
- For a specific round, use `/cbp-round-update 108-1-2`.
45
- ```
46
-
47
- Error cases: `108-1-2` (that is round-update's shape), `abc`, `108-`, `-1`, `108--1`, anything with whitespace or non-numeric characters.
48
-
49
- #### Worked examples
50
-
51
- - `task-testing 108-1` → CHK-108 TASK-1
52
- - `task-testing` (no arg) → active in-progress task via `get_current_task`
53
- - `task-testing 45` → error: "Use /cbp-standalone-task-testing 45 instead — bare numbers no longer route to standalone tasks."
54
- - `task-testing 108-1-2` → error: "use `/cbp-round-update 108-1-2`"
55
- - `task-testing abc` → error: malformed
56
-
57
- ### Step 1.5: Get Current Task
58
-
59
- Given the parse from Step 1:
60
-
61
- | Parse | Resolution path |
62
- |-------|-----------------|
63
- | `{chk}-{task}` | Read `.codebyplan/state/checkpoints/*.json` → filter `number === {chk}`. Read `.codebyplan/state/checkpoints/<id>/tasks/*.json` → filter `number === {task}`. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_checkpoints`/`get_tasks` when state dir absent and sync fails. |
64
- | _(empty)_ | Read `.codebyplan/state/todos.json` → find the active in-progress task. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_current_task(repo_id)` when state dir absent and sync fails. |
65
-
66
- If no in-progress task, show error and stop.
67
-
68
- ### Step 2: Verify All Rounds Complete
69
-
70
- Read `.codebyplan/state/checkpoints/<checkpointId>/tasks/<taskId>/rounds/*.json` (local-first). If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_rounds(task_id)` when state dir absent and sync fails. Verify all rounds are `completed`. If any still `in_progress`:
71
-
72
- ```
73
- ## Cannot Run Task Testing
74
-
75
- TASK-[N] has an active round (Round [N]). Complete it first:
76
- - Run `/cbp-round-update` to finish the round
77
- ```
78
-
79
- Stop.
80
-
81
- ### Step 3: Verify `/cbp-task-check` Passed
82
-
83
- Check `task.context.check_verdict`: must exist and have `verdict = "READY"`. Otherwise:
84
-
85
- ```
86
- ## Cannot Run Task Testing
87
-
88
- `/cbp-task-check` has not passed yet. Run `/cbp-task-check` first.
89
- ```
90
-
91
- Stop.
92
-
93
- ### Step 4: Aggregate Files Changed
94
-
95
- Collect all `files_changed` from all rounds, deduplicate (latest action per path wins). Skip deleted files for file-reading in Step 5.
96
-
97
- ### Step 5: Read ALL Final Changed Files
98
-
99
- Read every non-deleted file in the aggregated list. Understand the complete delivered work across all rounds. Build a mental model of what was built and how it connects.
100
-
101
- ### Step 6: Run Comprehensive Automated Testing
102
-
103
- Capture stdout and stderr for each check.
104
-
105
- **Hard-fail tests** (block completion):
106
-
107
- Run the unified check matrix:
108
-
109
- ```bash
110
- codebyplan check --scope task --json
111
- ```
112
-
113
- Capture the JSON result. The runner is **whole-repo + baseline**: it runs `turbo run lint|typecheck|test` across every package and diffs each per-package result against the committed `.check-baseline.json`, so only NEW per-package failures fail a check. Five checks run for `--scope task`: `gate6` (sibling-identity parity — ALWAYS hard-fail, never baselined), `lint`, `typecheck`, `tests`, and `audit` (`audit.new_failures` lists new GHSA advisory ids not in the allowlist). A baselined check's `status` is `pass` when its `new_failures` array is empty even if the underlying command exited non-zero. If `any_failed === true` (or `hard_fail_checks` is non-empty), this is a hard fail — surface each failing result's `stdout`/`stderr`/`new_failures` and stop.
114
-
115
- For each result entry, record: `category` (from `result.check`), `status` (from `result.status`), `details`, `stdout` (from `result.stdout`), `stderr` (from `result.stderr`), and `new_failures` (from `result.new_failures` — the newly-failing packages / new GHSA ids; the field is omitted/`undefined` for `gate6`, not `null`).
116
-
117
- Additional hard-fail checks (not part of the runner):
118
-
119
- | Category | Command | Condition |
120
- | ----------------------- | ------------------------------- | -------------------------------- |
121
- | Per-package E2E | `pnpm --filter <pkg> e2e:test` | UI files in aggregated_files |
122
- | Full-diff security scan | inline grep or `security-agent` | Always |
123
-
124
- Per-file lint + format are enforced by `lint-format-on-edit.sh` hook per edit. This step catches cross-package issues invisible to per-wave checks.
125
-
126
- **Soft tests** (report, don't block):
127
-
128
- | Category | Method | Condition |
129
- | ---------- | ----------------------------------------- | ---------------------------- |
130
- | Visual | Screenshot compare via `e2e:visual-check` | UI work + dev server running |
131
- | API Health | `curl` health endpoint | API routes changed |
132
-
133
- #### Step 6.x: Autonomous Sim Screenshot Validation (mobile / on-device)
134
-
135
- For mobile rounds (Maestro / XCUITest / Tauri-mobile) where unit tests passed but the round touched component-mount code paths (custom hooks, prop signatures, conditional renders, navigation tabs), unit-test green is NOT sufficient evidence that the screen mounts at runtime. Use the autonomous sim screenshot loop to catch runtime crashes invisible to mocked unit tests.
136
-
137
- **Procedure** (when iOS Simulator is the target — adapt the screenshot command for Android/Tauri equivalents):
138
-
139
- 1. Confirm the target screen's default state via `Read` of its parent component or store.
140
- 2. If the screen is normally gated behind a tab/route the simulator isn't currently on, temporarily flip the gating state's default value at the screen's entry point (`useState(initialTab) → useState('targetTab')` or equivalent) so the screen mounts on the next reload.
141
- 3. Trigger a Fast Refresh by saving a touched file, then capture: `xcrun simctl io booted screenshot /tmp/codebyplan-task-testing-{task}-{state}.png`.
142
- 4. Read the screenshot via the multimodal Read tool. Confirm the screen rendered (vs blank/crash/red-box error overlay).
143
- 5. Revert the state-default flip from step 2. Confirm the file diff is empty (`git diff <path>`) before proceeding.
144
-
145
- **When to use**: round modifies hook usage, prop signatures, component-tree shape, or store subscriptions on a screen whose unit tests mock the data layer. Skip when the modified code path has no UI surface (pure utilities, server actions).
146
-
147
- **When NOT to use**: don't flip state defaults on the main app entry / auth gate / feature-flag boundaries — the revert risk is too high. Use storybook or a dedicated `__dev__` tab if the screen has cross-cutting state.
148
-
149
- Record the result in `task_testing_output.autonomous_sim_check`:
150
-
151
- ```yaml
152
- autonomous_sim_check:
153
- screen: "<screen-name>"
154
- status: "rendered" | "crashed" | "blank"
155
- screenshot_path: "/tmp/codebyplan-task-testing-..."
156
- state_flip_reverted: true
157
- ```
158
-
159
- This technique uniquely catches Rules-of-Hooks violations and prop-shape mismatches that mocked unit tests cannot detect — the runtime hook scheduler is the only oracle.
160
-
161
- ### Step 6.5: Cross-Round Code Review
162
-
163
- Round-level code review runs per-round via `improve-round` at `/cbp-round-end`. This step adds the cross-round holistic layer — things only visible once all rounds are aggregated.
164
-
165
- Inline review (no sub-agent) across the aggregated files read in Step 5. Check:
166
-
167
- | Concern | What to Look For |
168
- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
169
- | Leftover debug | `console.log`, `debugger`, commented-out blocks, `TODO`/`FIXME` added during this task |
170
- | Cross-round duplication | Same helper/logic written independently in 2+ rounds — candidate for extraction |
171
- | Convention drift | One round introduces a pattern (error handling, naming, file layout) that contradicts a pattern established in an earlier round of the same task |
172
- | Incomplete follow-through | A round added a type/field/table column that later rounds never consume |
173
- | Orphaned additions | Exports or utilities added in an early round with no callers after later rounds refactored past them |
174
-
175
- For each finding, record: `{category, file, description, severity: 'low'|'medium'|'high', suggested_fix}`.
176
-
177
- Findings with severity `medium` or `high` feed the Step 9 problem classification. `low` findings are recorded in `task_testing_output` for the record but do not block.
178
-
179
- If any finding points to a need that exceeds task scope (e.g. a utility worth extracting for the wider codebase, a convention the repo should adopt globally), route per `immediate-issue-capture.md` "How to Capture" — default to a NEW TASK in the current checkpoint, not a standalone task. Standalone routing applies only when the finding is genuinely off-axis from every active checkpoint AND the user has confirmed standalone routing.
180
-
181
- ### Step 7: Separate Claude-Testable vs User-Testable
182
-
183
- **Claude handles automatically** (Step 6): build, types, unit tests, E2E tests, visual, API health.
184
-
185
- **User must verify** (requires human judgment):
186
-
187
- - Visual appearance quality (does it look good?)
188
- - UX flow (is the interaction intuitive?)
189
- - Business logic correctness (does it do the right thing?)
190
- - Edge cases (unusual inputs, boundary conditions)
191
- - Cross-browser / real-device behavior
192
- - Content accuracy (text, labels, messages)
193
-
194
- Generate user test items based on: task requirements, changed files, round context.
195
-
196
- ### Step 8: User Testing Walkthrough
197
-
198
- Present all user-testable items as a **single checklist in one `AskUserQuestion` prompt**. Do not ask one question per item — the batched format is preferred.
199
-
200
- Format the question so every item is visible in the checklist, with a single overall answer (e.g., "all pass", "minor issues", "major issues"). Provide the description, how-to-test steps, and expected result per item inside the question body. If the user reports mixed results, collect the specifics in a follow-up.
201
-
202
- Record the aggregate response and any per-item notes.
203
-
204
- ### Step 9: Classify Problems
205
-
206
- Collect failures from automated tests (Step 6), cross-round code review (Step 6.5, medium+), and user tests (Step 8). Classify:
207
-
208
- - **Minor** (round-fixable): styling, small bugs, missing edge cases, localized duplication
209
- - **Major** (new-task-worthy): architectural issues, missing features, fundamental design problems, convention drift that spans multiple files
210
-
211
- ### Step 10: Save Results
212
-
213
- `codebyplan task update --id <taskId> --checkpoint-id <checkpointId> --context '<json>'` (CLI write-through: local state file + REST), merging `task_testing_output` into the existing context object. Break-glass fallback: MCP `update_task` when CLI is unavailable.
214
-
215
- ```ts
216
- // context payload to merge:
217
- {
218
- task_testing_output: {
219
- claude_tests: [...],
220
- cross_round_code_findings: [...], // from Step 6.5
221
- user_tests: [...],
222
- problems_found: [...],
223
- all_passed: boolean,
224
- summary: { total, passed, failed, pending }
225
- }
226
- }
227
- ```
228
-
229
- ### Step 11: Route Based on Results
230
-
231
- **ALL PASS:**
232
-
233
- ```
234
- All tests passed for TASK-[N]. Routing to task-complete...
235
- ```
236
-
237
- Invoke `cbp-task-complete` via the Skill tool. `cbp-task-complete` is `ask`-tier — the harness
238
- permission prompt IS the human gate; the user confirms (or declines) before task commit,
239
- merge-main, and completion.
240
-
241
- **Minor problems found:**
242
-
243
- Invoking `cbp-round-input` to address the minor issues found during testing...
244
-
245
- Invoke `cbp-round-input` via the Skill tool. `cbp-round-input` is `allow`-tier — it auto-fires
246
- silently.
247
-
248
- **Major problems found:**
249
-
250
- ---
251
-
252
- **Next:**
253
- Run `/cbp-task-create` to:
254
-
255
- - Create a new task for the identified issues
256
-
257
- ---
258
-
259
- Waiting for user to run `/cbp-task-create`.
260
-
261
- **User wants re-test:** Suggest re-running `/cbp-task-testing`.
262
-
263
- ## Key Rules
264
-
265
- - **Never skippable** — mandatory before `/cbp-task-complete`
266
- - **Must loop until everything passes** — problems must be addressed
267
- - **No file changes** — testing only, never edit
268
- - **Batch user tests** — present all user-testable items in a single `AskUserQuestion` checklist; never one-per-question
269
- - **Read actual files** — do not rely on metadata alone
270
- - **Run actual commands** — capture real stdout/stderr
271
- - **Checkpoint-bound only** — for standalone tasks use `/cbp-standalone-task-testing`
272
-
273
- ## Integration
274
-
275
- - **Reads**: `.codebyplan/state/checkpoints/*.json`, `checkpoints/<id>/tasks/*.json`, `checkpoints/<id>/tasks/<id>/rounds/*.json`, `todos.json` (local-first; `npx codebyplan sync` on miss; MCP `get_current_task`/`get_rounds` break-glass), plus all aggregated files
276
- - **Writes**: `codebyplan task update` (CLI write-through; MCP `update_task` break-glass)
277
- - **Triggers**: `cbp-task-complete` (auto via Skill tool, when ALL PASS — `ask`-tier, permission prompt IS the human gate); `cbp-round-input` (auto via Skill tool, on minor problems — `allow`-tier, fires silently)
278
- - **Triggered by**: `cbp-task-check` auto-triggers this skill via Skill tool on READY verdict; `cbp-task-testing` is `allow`-tier and fires silently (no permission prompt)
279
- - **ci.json awareness**: `codebyplan check --scope task --json` is turbo-native — it runs `turbo run lint|typecheck|test` directly and does NOT read `.codebyplan/ci.json`. ci.json command resolution (via `npx codebyplan ci resolve <category> [--platform <slug>]`) is used by non-check consumers (`cbp-testing-qa-agent`, `cbp-security-agent`, `cbp-standalone-task-testing`, `cbp-checkpoint-check`), with a central-default fallback ensuring exit 0 even when ci.json is absent.