codebyplan 1.13.52 → 1.13.54

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (92) hide show
  1. package/dist/cli.js +3226 -897
  2. package/package.json +1 -1
  3. package/templates/agents/cbp-database-agent.md +1 -1
  4. package/templates/agents/cbp-e2e-maestro.md +1 -1
  5. package/templates/agents/cbp-e2e-playwright.md +24 -16
  6. package/templates/agents/cbp-e2e-tauri.md +1 -1
  7. package/templates/agents/cbp-e2e-vscode.md +1 -1
  8. package/templates/agents/cbp-e2e-xcuitest.md +1 -1
  9. package/templates/agents/cbp-improve-claude.md +2 -2
  10. package/templates/agents/{cbp-round-executor.md → cbp-round-builder.md} +23 -23
  11. package/templates/agents/{cbp-task-planner.md → cbp-round-planner.md} +26 -25
  12. package/templates/agents/cbp-security-agent.md +10 -2
  13. package/templates/agents/cbp-stripe-agent.md +2 -2
  14. package/templates/agents/cbp-testing-qa-agent.md +34 -20
  15. package/templates/agents/cbp-verify-reviewer.md +236 -0
  16. package/templates/context/architecture-map.md +4 -4
  17. package/templates/context/mcp-docs.md +57 -11
  18. package/templates/context/testing/e2e.md +9 -9
  19. package/templates/github-workflows/ci.yml +104 -0
  20. package/templates/github-workflows/publish.yml +8 -27
  21. package/templates/github-workflows/release-desktop.yml +215 -0
  22. package/templates/hooks/cbp-skill-context-guard.sh +1 -1
  23. package/templates/hooks/cbp-test-hooks.sh +9 -9
  24. package/templates/hooks/validate-structure-lengths.sh +1 -1
  25. package/templates/hooks/validate-structure-patterns.sh +1 -1
  26. package/templates/rules/README.md +1 -2
  27. package/templates/rules/agent-claim-verification.md +1 -1
  28. package/templates/rules/context-file-loading.md +10 -10
  29. package/templates/rules/development-workflow.md +73 -0
  30. package/templates/rules/e2e-mandatory.md +8 -8
  31. package/templates/rules/execution-proof.md +70 -0
  32. package/templates/rules/model-invocation-convention.md +2 -2
  33. package/templates/rules/parallel-waves.md +11 -11
  34. package/templates/rules/spawn-failure-is-gate-failure.md +76 -0
  35. package/templates/rules/task-routing-recommendation.md +1 -1
  36. package/templates/rules/todo-backend.md +3 -3
  37. package/templates/rules/two-tier-ci.md +63 -0
  38. package/templates/settings.project.base.json +15 -11
  39. package/templates/skills/cbp-build-cc-mode/SKILL.md +1 -1
  40. package/templates/skills/cbp-build-cc-settings/reference/cbp-permission-policy.md +7 -7
  41. package/templates/skills/cbp-build-cc-skill/SKILL.md +1 -1
  42. package/templates/skills/cbp-build-cc-skill/reference/cbp-quality.md +2 -2
  43. package/templates/skills/cbp-build-cc-skill/reference/fork-eligibility.md +11 -14
  44. package/templates/skills/cbp-checkpoint-check/SKILL.md +11 -3
  45. package/templates/skills/cbp-checkpoint-create/SKILL.md +16 -1
  46. package/templates/skills/cbp-checkpoint-end/SKILL.md +5 -1
  47. package/templates/skills/cbp-checkpoint-update/SKILL.md +3 -3
  48. package/templates/skills/cbp-clear-continue/SKILL.md +2 -2
  49. package/templates/skills/cbp-clear-prep/SKILL.md +3 -3
  50. package/templates/skills/{cbp-task-complete → cbp-finalize}/SKILL.md +25 -29
  51. package/templates/skills/{cbp-task-complete → cbp-finalize}/reference/checkpoint-done-branching.md +1 -1
  52. package/templates/skills/{cbp-task-complete → cbp-finalize}/reference/next-step-heuristic.md +1 -1
  53. package/templates/skills/cbp-frontend-design/SKILL.md +1 -1
  54. package/templates/skills/cbp-frontend-ui/SKILL.md +7 -7
  55. package/templates/skills/cbp-git-commit/SKILL.md +3 -3
  56. package/templates/skills/cbp-merge-main/SKILL.md +4 -4
  57. package/templates/skills/{cbp-round-execute → cbp-round-build}/SKILL.md +93 -75
  58. package/templates/skills/cbp-round-complete/SKILL.md +15 -14
  59. package/templates/skills/cbp-round-plan/SKILL.md +344 -0
  60. package/templates/skills/cbp-session-end/SKILL.md +1 -1
  61. package/templates/skills/cbp-setup-cd/SKILL.md +291 -0
  62. package/templates/skills/cbp-setup-cd/reference/github-actions-cd.md +231 -0
  63. package/templates/skills/cbp-setup-ci/SKILL.md +175 -0
  64. package/templates/skills/cbp-setup-ci/reference/github-actions.md +100 -0
  65. package/templates/skills/cbp-ship/SKILL.md +21 -0
  66. package/templates/skills/cbp-ship-main/SKILL.md +3 -2
  67. package/templates/skills/cbp-standalone-task-check/SKILL.md +10 -9
  68. package/templates/skills/cbp-standalone-task-complete/SKILL.md +12 -13
  69. package/templates/skills/cbp-standalone-task-create/SKILL.md +16 -9
  70. package/templates/skills/cbp-standalone-task-start/SKILL.md +9 -5
  71. package/templates/skills/cbp-standalone-task-testing/SKILL.md +16 -7
  72. package/templates/skills/cbp-task-create/SKILL.md +6 -7
  73. package/templates/skills/cbp-task-start/SKILL.md +8 -8
  74. package/templates/skills/cbp-todo/SKILL.md +6 -8
  75. package/templates/skills/cbp-verify/SKILL.md +146 -0
  76. package/templates/skills/cbp-verify/reference/deterministic-gates.md +114 -0
  77. package/templates/skills/{cbp-round-end → cbp-verify}/reference/findings-presentation.md +16 -12
  78. package/templates/skills/cbp-verify/reference/round-scope.md +62 -0
  79. package/templates/skills/cbp-verify/reference/task-scope.md +71 -0
  80. package/templates/agents/cbp-improve-round.md +0 -283
  81. package/templates/agents/cbp-task-check.md +0 -217
  82. package/templates/skills/cbp-round-check/SKILL.md +0 -132
  83. package/templates/skills/cbp-round-end/SKILL.md +0 -173
  84. package/templates/skills/cbp-round-end/reference/inline-fallback.md +0 -35
  85. package/templates/skills/cbp-round-execute/reference/inline-fallback.md +0 -55
  86. package/templates/skills/cbp-round-input/SKILL.md +0 -197
  87. package/templates/skills/cbp-round-start/SKILL.md +0 -261
  88. package/templates/skills/cbp-round-update/SKILL.md +0 -120
  89. package/templates/skills/cbp-ship/templates/workflow-eas-submit.yml +0 -53
  90. package/templates/skills/cbp-ship/templates/workflow-vsce-publish.yml +0 -31
  91. package/templates/skills/cbp-task-check/SKILL.md +0 -172
  92. package/templates/skills/cbp-task-testing/SKILL.md +0 -277
@@ -1,31 +0,0 @@
1
- name: VS Code Marketplace publish
2
-
3
- on:
4
- push:
5
- tags:
6
- - 'vscode-v*.*.*'
7
- workflow_dispatch:
8
-
9
- jobs:
10
- publish:
11
- runs-on: ubuntu-latest
12
- steps:
13
- - uses: actions/checkout@v4
14
-
15
- - uses: pnpm/action-setup@v3
16
- with:
17
- version: 10
18
-
19
- - uses: actions/setup-node@v4
20
- with:
21
- node-version: 22
22
- cache: pnpm
23
-
24
- - run: pnpm install --frozen-lockfile
25
- - run: pnpm --filter REPLACE_WITH_EXT_NAME build
26
- - name: Publish to Marketplace
27
- env:
28
- VSCE_PAT: ${{ secrets.VSCE_PAT }}
29
- run: |
30
- cd apps/vscode
31
- npx @vscode/vsce publish --no-yarn
@@ -1,172 +0,0 @@
1
- ---
2
- name: cbp-task-check
3
- description: AI production review for the current task
4
- argument-hint: [chk-task]
5
- triggers: [cbp-task-testing, cbp-round-input]
6
- effort: high
7
- ---
8
-
9
- # Task Check Command
10
-
11
- AI-driven production readiness review. Spawns the `cbp-task-check` agent for thorough verification including user satisfaction discussion. This command is a thin orchestrator — the agent does the heavy lifting. It is the **cross-round double-check**: rounds already own per-round QA (debug scan, security grep, audit, per-app build/lint/types), so this layer focuses on holistic concerns visible only across the full task diff — requirements traceability, checkpoint alignment, shippability, holistic code review, and scope drift — never re-running per-round checks.
12
-
13
- ## Inline-Fallback for Spawn Failure
14
-
15
- If the `cbp-task-check` agent spawn fails for any reason (`API Error: Extra usage required`, monthly Agent usage cap, provider 5xx, rate limit, context overflow), the orchestrator MUST follow the canonical inline-fallback procedure documented in `skills/cbp-round-end/SKILL.md` "Inline-fallback for any spawn failure".
16
-
17
- Procedure summary (pointer back to canonical):
18
-
19
- 1. Detect the failure class from the error string; record `round.context.task_check_findings.spawn_failure = { class, error_message, decided_at }`.
20
- 2. Walk the agent's documented Phase 1-10 checklist inline using `Read` / `Grep` / `Bash` / MCP `get_*` tools — the agent's definition file is the inline script.
21
- 3. Populate the agent's output contract (`verdict`, `route_recommendation`, `requirements_status`, `qa_status`, `code_review_findings`, `user_satisfaction`, `scope_divergence_detected`, etc.) with `mode: 'inline_fallback'` so analytics distinguishes.
22
- 4. Apply the pre-emptive-skip rule: when the same failure class fired in the previous skill of this session, skip the spawn attempt entirely and go straight to inline.
23
- 5. Continue the skill — do NOT abort. Inline-fallback is intended to keep the pipeline moving under sustained outages.
24
-
25
- Inline-fallback is NOT a quality downgrade trapdoor — every Phase from the agent definition MUST be walked, in order, with the same Read/Grep depth the agent would have used. Skipping phases under the banner of fallback is a separate failure mode that `cbp-improve-claude` flags as `inline_fallback_shortcutting`.
26
-
27
- ## When Used
28
-
29
- - After all rounds complete and all files approved (auto-triggered by `/cbp-round-complete`)
30
- - Before `/cbp-task-testing`
31
- - `/cbp-task-check` is NEVER skippable
32
-
33
- ## Instructions
34
-
35
- ### Step 1: Parse `$ARGUMENTS`
36
-
37
- Parse the argument using the canonical chk-task-round notation (see `cbp-round-start` Step 0 "CHK / TASK / ROUND Identifier Notation Vocabulary"):
38
-
39
- | Shape | Regex | Resolves to |
40
- |-------|-------|-------------|
41
- | `{chk}-{task}` (e.g. `108-1`) | `^[0-9]+-[0-9]+$` | Checkpoint-bound: CHK-{chk} TASK-{task} |
42
- | _(empty)_ | — | Resolve from local state per Step 1.5/2 (MCP `get_current_task` break-glass) — the active in-progress task |
43
- | `{task}` (bare number) | — | **Error**: "Use /cbp-standalone-task-check {N} instead — bare numbers no longer route to standalone tasks." |
44
-
45
- Anything else is malformed — surface this error and stop:
46
-
47
- ```
48
- task-check: invalid argument `{value}`. Expected:
49
- 108-1 → CHK-108 TASK-1 (checkpoint-bound)
50
- (empty) → active in-progress task
51
-
52
- For standalone tasks, use `/cbp-standalone-task-check {N}`.
53
- For a specific round, use `/cbp-round-update 108-1-2`.
54
- ```
55
-
56
- Error cases: `108-1-2` (that is round-update's shape), `abc`, `108-`, `-1`, `108--1`, anything with whitespace or non-numeric characters.
57
-
58
- #### Worked examples
59
-
60
- - `task-check 108-1` → CHK-108 TASK-1
61
- - `task-check` (no arg) → active in-progress task via `get_current_task`
62
- - `task-check 45` → error: "Use /cbp-standalone-task-check 45 instead — bare numbers no longer route to standalone tasks."
63
- - `task-check 108-1-2` → error: "use `/cbp-round-update 108-1-2`"
64
- - `task-check abc` → error: malformed
65
-
66
- ### Step 1.5: Get Current Task
67
-
68
- Given the parse from Step 1:
69
-
70
- | Parse | Resolution path |
71
- |-------|-----------------|
72
- | `{chk}-{task}` | Read `.codebyplan/state/checkpoints/*.json` → filter `number === {chk}`. Read `.codebyplan/state/checkpoints/<id>/tasks/*.json` → filter `number === {task}`. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_checkpoints`/`get_tasks` when state dir absent and sync fails. |
73
- | _(empty)_ | Read `.codebyplan/state/todos.json` → find the active in-progress task. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_current_task(repo_id)` when state dir absent and sync fails. |
74
-
75
- If no in-progress task, show error and stop.
76
-
77
- ### Step 2: Quick Gate — Verify All Rounds Complete
78
-
79
- Read `.codebyplan/state/checkpoints/<checkpointId>/tasks/<taskId>/rounds/*.json` (local-first). If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_rounds` when state dir absent and sync fails. Verify all rounds are `completed`.
80
-
81
- If any rounds still in_progress:
82
-
83
- ```
84
- ## Cannot Run Task Check
85
-
86
- TASK-[N] has an active round (Round [N]). Complete it first:
87
- - Run `/cbp-round-update` to finish the round
88
- ```
89
-
90
- Stop here.
91
-
92
- ### Step 3: Load All Context
93
-
94
- 1. Checkpoint details — from `.codebyplan/state/checkpoints/<checkpointId>.json` (already read in Step 1.5)
95
- 2. Task details — from `.codebyplan/state/checkpoints/<checkpointId>/tasks/<taskId>.json` (already read in Step 1.5)
96
- 3. All rounds — from `.codebyplan/state/checkpoints/<checkpointId>/tasks/<taskId>/rounds/*.json` (already read in Step 2)
97
-
98
- ### Step 4: Spawn Task Check Agent
99
-
100
- Spawn `cbp-task-check` agent with full context:
101
-
102
- ```yaml
103
- input:
104
- task_number: [N]
105
- round_number: [total rounds]
106
- checkpoint: { id, title, goal, context }
107
- task: { id, title, requirements, context, files_changed, qa }
108
- rounds: [{ number, requirements, context, qa, files_changed }]
109
- ```
110
-
111
- Wait for agent to complete. Agent handles all 10 phases including user satisfaction discussion.
112
-
113
- ### Step 5: Save Agent Output
114
-
115
- Save agent output to task context: `codebyplan task update --id <taskId> --checkpoint-id <checkpointId> --context '{"check_verdict": ...}'` (CLI write-through: local state file + REST). Break-glass fallback: MCP `update_task` when CLI is unavailable.
116
-
117
- - `task.context.check_verdict` = agent output (verdict, requirements_check, etc.)
118
-
119
- ### Step 6: Route Based on Verdict
120
-
121
- **READY + satisfied:**
122
-
123
- Starting task testing...
124
-
125
- Invoke `cbp-task-testing` via the Skill tool with the same `{chk-task}` argument. `cbp-task-testing`
126
- is `allow`-tier — it auto-fires silently. If the `cbp-skill-context-guard.sh` hook detects the
127
- context window is above the 200K threshold it will block the skill and direct you to run
128
- `/cbp-clear-prep` first; otherwise testing starts immediately.
129
-
130
- **NOT READY — fixable issues:**
131
-
132
- ```
133
- Issues found that need addressing:
134
- - [issue 1]
135
- - [issue 2]
136
- ```
137
-
138
- Invoking `cbp-round-input` to address the issues found during review...
139
-
140
- Invoke `cbp-round-input` via the Skill tool. `cbp-round-input` is `allow`-tier — it auto-fires silently.
141
-
142
- **NOT READY — needs new task:**
143
-
144
- ```
145
- Scope issues identified that require a new task:
146
- - [scope issue]
147
- ```
148
-
149
- Suggest: `/cbp-task-create`. **STOP HERE** — wait for user (creating a new task is a user scope decision — not auto-triggered).
150
-
151
- **NOT READY — approvals missing:**
152
-
153
- ```
154
- Code review passed but [N] files need user approval.
155
- ```
156
-
157
- Suggest: Approve files, then re-run `/cbp-task-check`. **STOP HERE** — wait for user (approval is a user action — not auto-triggered).
158
-
159
- ## Key Rules
160
-
161
- - **`/cbp-task-check` is NEVER skippable** — mandatory before `/cbp-task-testing`
162
- - **This is AI review + user satisfaction** — not automated testing
163
- - **Read all changed files** — agent does the heavy lifting
164
- - **No file changes** — review only, never edit
165
- - **Checkpoint-bound only** — for standalone tasks use `/cbp-standalone-task-check`
166
-
167
- ## Integration
168
-
169
- - **Reads**: `.codebyplan/state/checkpoints/*.json`, `checkpoints/<id>/tasks/*.json`, `checkpoints/<id>/tasks/<id>/rounds/*.json`, `todos.json` (local-first; `npx codebyplan sync` on miss; MCP `get_current_task`/`get_rounds` break-glass), plus all changed files (via agent)
170
- - **Writes**: `codebyplan task update` (CLI write-through; MCP `update_task` break-glass)
171
- - **Triggers**: auto-triggers `cbp-task-testing` via Skill tool on READY + satisfied (`allow`-tier, fires silently; the 200K context guard handles oversized contexts via the cbp-clear-prep flow); auto-triggers `cbp-round-input` via Skill tool on NOT READY — fixable issues (`allow`-tier, fires silently)
172
- - **Triggered by**: `/cbp-round-complete` (auto, when all files approved)
@@ -1,277 +0,0 @@
1
- ---
2
- name: cbp-task-testing
3
- description: Run comprehensive task-level testing after /cbp-task-check passes
4
- argument-hint: [chk-task]
5
- triggers: [cbp-task-complete, cbp-round-input]
6
- effort: xhigh
7
- ---
8
-
9
- # Task Testing Command
10
-
11
- Comprehensive task-level testing — runs all automated tests and walks the user through manual testing one-by-one. Distinct from round-level testing (`testing-qa-agent`): this tests the **entire delivered feature holistically** after all rounds are complete. Runs inline — no sub-agent.
12
-
13
- ## When Used
14
-
15
- - After `/cbp-task-check` passes with READY verdict (auto-triggered)
16
- - Before `/cbp-task-complete`
17
- - **Never skippable**
18
-
19
- ## Scope vs Round-Level Validation
20
-
21
- Per-wave `testing-qa-agent` runs inside `/cbp-round-execute` Step 5. This skill adds the cross-cutting layer that is only visible across the full task diff: whole-repo lint, whole-repo typecheck, full test suite, `pnpm audit` (via `codebyplan check --scope task --json`), and full-diff security scan — each run once here, not per-round.
22
-
23
- ## Instructions
24
-
25
- ### Step 1: Parse `$ARGUMENTS`
26
-
27
- Parse the argument using the canonical chk-task-round notation (see `.claude/rules/notation-consistency.md`):
28
-
29
- | Shape | Regex | Resolves to |
30
- |-------|-------|-------------|
31
- | `{chk}-{task}` (e.g. `108-1`) | `^[0-9]+-[0-9]+$` | Checkpoint-bound: CHK-{chk} TASK-{task} |
32
- | _(empty)_ | — | Resolve from local state per Step 1.5/2 (MCP `get_current_task` break-glass) — the active in-progress task |
33
- | `{task}` (bare number) | — | **Error**: "Use /cbp-standalone-task-testing {N} instead — bare numbers no longer route to standalone tasks." |
34
-
35
- Anything else is malformed — surface this error and stop:
36
-
37
- ```
38
- task-testing: invalid argument `{value}`. Expected:
39
- 108-1 → CHK-108 TASK-1 (checkpoint-bound)
40
- (empty) → active in-progress task
41
-
42
- For standalone tasks, use `/cbp-standalone-task-testing {N}`.
43
- For a specific round, use `/cbp-round-update 108-1-2`.
44
- ```
45
-
46
- Error cases: `108-1-2` (that is round-update's shape), `abc`, `108-`, `-1`, `108--1`, anything with whitespace or non-numeric characters.
47
-
48
- #### Worked examples
49
-
50
- - `task-testing 108-1` → CHK-108 TASK-1
51
- - `task-testing` (no arg) → active in-progress task via `get_current_task`
52
- - `task-testing 45` → error: "Use /cbp-standalone-task-testing 45 instead — bare numbers no longer route to standalone tasks."
53
- - `task-testing 108-1-2` → error: "use `/cbp-round-update 108-1-2`"
54
- - `task-testing abc` → error: malformed
55
-
56
- ### Step 1.5: Get Current Task
57
-
58
- Given the parse from Step 1:
59
-
60
- | Parse | Resolution path |
61
- |-------|-----------------|
62
- | `{chk}-{task}` | Read `.codebyplan/state/checkpoints/*.json` → filter `number === {chk}`. Read `.codebyplan/state/checkpoints/<id>/tasks/*.json` → filter `number === {task}`. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_checkpoints`/`get_tasks` when state dir absent and sync fails. |
63
- | _(empty)_ | Read `.codebyplan/state/todos.json` → find the active in-progress task. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_current_task(repo_id)` when state dir absent and sync fails. |
64
-
65
- If no in-progress task, show error and stop.
66
-
67
- ### Step 2: Verify All Rounds Complete
68
-
69
- Read `.codebyplan/state/checkpoints/<checkpointId>/tasks/<taskId>/rounds/*.json` (local-first). If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_rounds(task_id)` when state dir absent and sync fails. Verify all rounds are `completed`. If any still `in_progress`:
70
-
71
- ```
72
- ## Cannot Run Task Testing
73
-
74
- TASK-[N] has an active round (Round [N]). Complete it first:
75
- - Run `/cbp-round-update` to finish the round
76
- ```
77
-
78
- Stop.
79
-
80
- ### Step 3: Verify `/cbp-task-check` Passed
81
-
82
- Check `task.context.check_verdict`: must exist and have `verdict = "READY"`. Otherwise:
83
-
84
- ```
85
- ## Cannot Run Task Testing
86
-
87
- `/cbp-task-check` has not passed yet. Run `/cbp-task-check` first.
88
- ```
89
-
90
- Stop.
91
-
92
- ### Step 4: Aggregate Files Changed
93
-
94
- Collect all `files_changed` from all rounds, deduplicate (latest action per path wins). Skip deleted files for file-reading in Step 5.
95
-
96
- ### Step 5: Read ALL Final Changed Files
97
-
98
- Read every non-deleted file in the aggregated list. Understand the complete delivered work across all rounds. Build a mental model of what was built and how it connects.
99
-
100
- ### Step 6: Run Comprehensive Automated Testing
101
-
102
- Capture stdout and stderr for each check.
103
-
104
- **Hard-fail tests** (block completion):
105
-
106
- Run the unified check matrix:
107
-
108
- ```bash
109
- codebyplan check --scope task --json
110
- ```
111
-
112
- Capture the JSON result. The runner is **whole-repo + baseline**: it runs `turbo run lint|typecheck|test` across every package and diffs each per-package result against the committed `.check-baseline.json`, so only NEW per-package failures fail a check. Five checks run for `--scope task`: `gate6` (sibling-identity parity — ALWAYS hard-fail, never baselined), `lint`, `typecheck`, `tests`, and `audit` (`audit.new_failures` lists new GHSA advisory ids not in the allowlist). A baselined check's `status` is `pass` when its `new_failures` array is empty even if the underlying command exited non-zero. If `any_failed === true` (or `hard_fail_checks` is non-empty), this is a hard fail — surface each failing result's `stdout`/`stderr`/`new_failures` and stop.
113
-
114
- For each result entry, record: `category` (from `result.check`), `status` (from `result.status`), `details`, `stdout` (from `result.stdout`), `stderr` (from `result.stderr`), and `new_failures` (from `result.new_failures` — the newly-failing packages / new GHSA ids; the field is omitted/`undefined` for `gate6`, not `null`).
115
-
116
- Additional hard-fail checks (not part of the runner):
117
-
118
- | Category | Command | Condition |
119
- | ----------------------- | ------------------------------- | -------------------------------- |
120
- | Per-package E2E | `pnpm --filter <pkg> e2e:test` | UI files in aggregated_files |
121
- | Full-diff security scan | inline grep or `security-agent` | Always |
122
-
123
- Per-file lint + format are enforced by `lint-format-on-edit.sh` hook per edit. This step catches cross-package issues invisible to per-wave checks.
124
-
125
- **Soft tests** (report, don't block):
126
-
127
- | Category | Method | Condition |
128
- | ---------- | ----------------------------------------- | ---------------------------- |
129
- | Visual | Screenshot compare via `e2e:visual-check` | UI work + dev server running |
130
- | API Health | `curl` health endpoint | API routes changed |
131
-
132
- #### Step 6.x: Autonomous Sim Screenshot Validation (mobile / on-device)
133
-
134
- For mobile rounds (Maestro / XCUITest / Tauri-mobile) where unit tests passed but the round touched component-mount code paths (custom hooks, prop signatures, conditional renders, navigation tabs), unit-test green is NOT sufficient evidence that the screen mounts at runtime. Use the autonomous sim screenshot loop to catch runtime crashes invisible to mocked unit tests.
135
-
136
- **Procedure** (when iOS Simulator is the target — adapt the screenshot command for Android/Tauri equivalents):
137
-
138
- 1. Confirm the target screen's default state via `Read` of its parent component or store.
139
- 2. If the screen is normally gated behind a tab/route the simulator isn't currently on, temporarily flip the gating state's default value at the screen's entry point (`useState(initialTab) → useState('targetTab')` or equivalent) so the screen mounts on the next reload.
140
- 3. Trigger a Fast Refresh by saving a touched file, then capture: `xcrun simctl io booted screenshot /tmp/codebyplan-task-testing-{task}-{state}.png`.
141
- 4. Read the screenshot via the multimodal Read tool. Confirm the screen rendered (vs blank/crash/red-box error overlay).
142
- 5. Revert the state-default flip from step 2. Confirm the file diff is empty (`git diff <path>`) before proceeding.
143
-
144
- **When to use**: round modifies hook usage, prop signatures, component-tree shape, or store subscriptions on a screen whose unit tests mock the data layer. Skip when the modified code path has no UI surface (pure utilities, server actions).
145
-
146
- **When NOT to use**: don't flip state defaults on the main app entry / auth gate / feature-flag boundaries — the revert risk is too high. Use storybook or a dedicated `__dev__` tab if the screen has cross-cutting state.
147
-
148
- Record the result in `task_testing_output.autonomous_sim_check`:
149
-
150
- ```yaml
151
- autonomous_sim_check:
152
- screen: "<screen-name>"
153
- status: "rendered" | "crashed" | "blank"
154
- screenshot_path: "/tmp/codebyplan-task-testing-..."
155
- state_flip_reverted: true
156
- ```
157
-
158
- This technique uniquely catches Rules-of-Hooks violations and prop-shape mismatches that mocked unit tests cannot detect — the runtime hook scheduler is the only oracle.
159
-
160
- ### Step 6.5: Cross-Round Code Review
161
-
162
- Round-level code review runs per-round via `improve-round` at `/cbp-round-end`. This step adds the cross-round holistic layer — things only visible once all rounds are aggregated.
163
-
164
- Inline review (no sub-agent) across the aggregated files read in Step 5. Check:
165
-
166
- | Concern | What to Look For |
167
- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
168
- | Leftover debug | `console.log`, `debugger`, commented-out blocks, `TODO`/`FIXME` added during this task |
169
- | Cross-round duplication | Same helper/logic written independently in 2+ rounds — candidate for extraction |
170
- | Convention drift | One round introduces a pattern (error handling, naming, file layout) that contradicts a pattern established in an earlier round of the same task |
171
- | Incomplete follow-through | A round added a type/field/table column that later rounds never consume |
172
- | Orphaned additions | Exports or utilities added in an early round with no callers after later rounds refactored past them |
173
-
174
- For each finding, record: `{category, file, description, severity: 'low'|'medium'|'high', suggested_fix}`.
175
-
176
- Findings with severity `medium` or `high` feed the Step 9 problem classification. `low` findings are recorded in `task_testing_output` for the record but do not block.
177
-
178
- If any finding points to a need that exceeds task scope (e.g. a utility worth extracting for the wider codebase, a convention the repo should adopt globally), route per `immediate-issue-capture.md` "How to Capture" — default to a NEW TASK in the current checkpoint, not a standalone task. Standalone routing applies only when the finding is genuinely off-axis from every active checkpoint AND the user has confirmed standalone routing.
179
-
180
- ### Step 7: Separate Claude-Testable vs User-Testable
181
-
182
- **Claude handles automatically** (Step 6): build, types, unit tests, E2E tests, visual, API health.
183
-
184
- **User must verify** (requires human judgment):
185
-
186
- - Visual appearance quality (does it look good?)
187
- - UX flow (is the interaction intuitive?)
188
- - Business logic correctness (does it do the right thing?)
189
- - Edge cases (unusual inputs, boundary conditions)
190
- - Cross-browser / real-device behavior
191
- - Content accuracy (text, labels, messages)
192
-
193
- Generate user test items based on: task requirements, changed files, round context.
194
-
195
- ### Step 8: User Testing Walkthrough
196
-
197
- Present all user-testable items as a **single checklist in one `AskUserQuestion` prompt**. Do not ask one question per item — the batched format is preferred.
198
-
199
- Format the question so every item is visible in the checklist, with a single overall answer (e.g., "all pass", "minor issues", "major issues"). Provide the description, how-to-test steps, and expected result per item inside the question body. If the user reports mixed results, collect the specifics in a follow-up.
200
-
201
- Record the aggregate response and any per-item notes.
202
-
203
- ### Step 9: Classify Problems
204
-
205
- Collect failures from automated tests (Step 6), cross-round code review (Step 6.5, medium+), and user tests (Step 8). Classify:
206
-
207
- - **Minor** (round-fixable): styling, small bugs, missing edge cases, localized duplication
208
- - **Major** (new-task-worthy): architectural issues, missing features, fundamental design problems, convention drift that spans multiple files
209
-
210
- ### Step 10: Save Results
211
-
212
- `codebyplan task update --id <taskId> --checkpoint-id <checkpointId> --context '<json>'` (CLI write-through: local state file + REST), merging `task_testing_output` into the existing context object. Break-glass fallback: MCP `update_task` when CLI is unavailable.
213
-
214
- ```ts
215
- // context payload to merge:
216
- {
217
- task_testing_output: {
218
- claude_tests: [...],
219
- cross_round_code_findings: [...], // from Step 6.5
220
- user_tests: [...],
221
- problems_found: [...],
222
- all_passed: boolean,
223
- summary: { total, passed, failed, pending }
224
- }
225
- }
226
- ```
227
-
228
- ### Step 11: Route Based on Results
229
-
230
- **ALL PASS:**
231
-
232
- ```
233
- All tests passed for TASK-[N]. Routing to task-complete...
234
- ```
235
-
236
- Invoke `cbp-task-complete` via the Skill tool. `cbp-task-complete` is `ask`-tier — the harness
237
- permission prompt IS the human gate; the user confirms (or declines) before task commit,
238
- merge-main, and completion.
239
-
240
- **Minor problems found:**
241
-
242
- Invoking `cbp-round-input` to address the minor issues found during testing...
243
-
244
- Invoke `cbp-round-input` via the Skill tool. `cbp-round-input` is `allow`-tier — it auto-fires
245
- silently.
246
-
247
- **Major problems found:**
248
-
249
- ---
250
-
251
- **Next:**
252
- Run `/cbp-task-create` to:
253
-
254
- - Create a new task for the identified issues
255
-
256
- ---
257
-
258
- Waiting for user to run `/cbp-task-create`.
259
-
260
- **User wants re-test:** Suggest re-running `/cbp-task-testing`.
261
-
262
- ## Key Rules
263
-
264
- - **Never skippable** — mandatory before `/cbp-task-complete`
265
- - **Must loop until everything passes** — problems must be addressed
266
- - **No file changes** — testing only, never edit
267
- - **Batch user tests** — present all user-testable items in a single `AskUserQuestion` checklist; never one-per-question
268
- - **Read actual files** — do not rely on metadata alone
269
- - **Run actual commands** — capture real stdout/stderr
270
- - **Checkpoint-bound only** — for standalone tasks use `/cbp-standalone-task-testing`
271
-
272
- ## Integration
273
-
274
- - **Reads**: `.codebyplan/state/checkpoints/*.json`, `checkpoints/<id>/tasks/*.json`, `checkpoints/<id>/tasks/<id>/rounds/*.json`, `todos.json` (local-first; `npx codebyplan sync` on miss; MCP `get_current_task`/`get_rounds` break-glass), plus all aggregated files
275
- - **Writes**: `codebyplan task update` (CLI write-through; MCP `update_task` break-glass)
276
- - **Triggers**: `cbp-task-complete` (auto via Skill tool, when ALL PASS — `ask`-tier, permission prompt IS the human gate); `cbp-round-input` (auto via Skill tool, on minor problems — `allow`-tier, fires silently)
277
- - **Triggered by**: `cbp-task-check` auto-triggers this skill via Skill tool on READY verdict; `cbp-task-testing` is `allow`-tier and fires silently (no permission prompt)