npm - codebyplan - Versions diffs - 1.13.53 → 1.13.55 - Mend

codebyplan 1.13.53 → 1.13.55

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (84) hide show

package/templates/skills/cbp-task-testing/SKILL.md DELETED Viewed

@@ -1,279 +0,0 @@
----
-scope: org-shared
-name: cbp-task-testing
-description: Run comprehensive task-level testing after /cbp-task-check passes
-argument-hint: [chk-task]
-triggers: [cbp-task-complete, cbp-round-input]
-effort: xhigh
----
-# Task Testing Command
-Comprehensive task-level testing — runs all automated tests and walks the user through manual testing one-by-one. Distinct from round-level testing (`testing-qa-agent`): this tests the **entire delivered feature holistically** after all rounds are complete. Runs inline — no sub-agent.
-## When Used
-- After `/cbp-task-check` passes with READY verdict (auto-triggered)
-- Before `/cbp-task-complete`
-- **Never skippable**
-## Scope vs Round-Level Validation
-Per-wave `testing-qa-agent` runs inside `/cbp-round-execute` Step 5. This skill adds the cross-cutting layer that is only visible across the full task diff: whole-repo lint, whole-repo typecheck, full test suite, `pnpm audit` (via `codebyplan check --scope task --json`), and full-diff security scan — each run once here, not per-round.
-## Instructions
-### Step 1: Parse `$ARGUMENTS`
-Parse the argument using the canonical chk-task-round notation (see `.claude/rules/notation-consistency.md`):
-| Shape | Regex | Resolves to |
-|-------|-------|-------------|
-| `{chk}-{task}` (e.g. `108-1`) | `^[0-9]+-[0-9]+$` | Checkpoint-bound: CHK-{chk} TASK-{task} |
-| _(empty)_ | — | Resolve from local state per Step 1.5/2 (MCP `get_current_task` break-glass) — the active in-progress task |
-| `{task}` (bare number) | — | **Error**: "Use /cbp-standalone-task-testing {N} instead — bare numbers no longer route to standalone tasks." |
-Anything else is malformed — surface this error and stop:
-```
-task-testing: invalid argument `{value}`. Expected:
-  108-1  → CHK-108 TASK-1 (checkpoint-bound)
-  (empty) → active in-progress task
-For standalone tasks, use `/cbp-standalone-task-testing {N}`.
-For a specific round, use `/cbp-round-update 108-1-2`.
-```
-Error cases: `108-1-2` (that is round-update's shape), `abc`, `108-`, `-1`, `108--1`, anything with whitespace or non-numeric characters.
-#### Worked examples
-- `task-testing 108-1` → CHK-108 TASK-1
-- `task-testing` (no arg) → active in-progress task via `get_current_task`
-- `task-testing 45` → error: "Use /cbp-standalone-task-testing 45 instead — bare numbers no longer route to standalone tasks."
-- `task-testing 108-1-2` → error: "use `/cbp-round-update 108-1-2`"
-- `task-testing abc` → error: malformed
-### Step 1.5: Get Current Task
-Given the parse from Step 1:
-| Parse | Resolution path |
-|-------|-----------------|
-| `{chk}-{task}` | Read `.codebyplan/state/checkpoints/*.json` → filter `number === {chk}`. Read `.codebyplan/state/checkpoints/<id>/tasks/*.json` → filter `number === {task}`. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_checkpoints`/`get_tasks` when state dir absent and sync fails. |
-| _(empty)_ | Read `.codebyplan/state/todos.json` → find the active in-progress task. If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_current_task(repo_id)` when state dir absent and sync fails. |
-If no in-progress task, show error and stop.
-### Step 2: Verify All Rounds Complete
-Read `.codebyplan/state/checkpoints/<checkpointId>/tasks/<taskId>/rounds/*.json` (local-first). If missing/stale, run `npx codebyplan sync` once and re-read. Break-glass fallback: MCP `get_rounds(task_id)` when state dir absent and sync fails. Verify all rounds are `completed`. If any still `in_progress`:
-```
-## Cannot Run Task Testing
-TASK-[N] has an active round (Round [N]). Complete it first:
-- Run `/cbp-round-update` to finish the round
-```
-Stop.
-### Step 3: Verify `/cbp-task-check` Passed
-Check `task.context.check_verdict`: must exist and have `verdict = "READY"`. Otherwise:
-```
-## Cannot Run Task Testing
-`/cbp-task-check` has not passed yet. Run `/cbp-task-check` first.
-```
-Stop.
-### Step 4: Aggregate Files Changed
-Collect all `files_changed` from all rounds, deduplicate (latest action per path wins). Skip deleted files for file-reading in Step 5.
-### Step 5: Read ALL Final Changed Files
-Read every non-deleted file in the aggregated list. Understand the complete delivered work across all rounds. Build a mental model of what was built and how it connects.
-### Step 6: Run Comprehensive Automated Testing
-Capture stdout and stderr for each check.
-**Hard-fail tests** (block completion):
-Run the unified check matrix:
-```bash
-codebyplan check --scope task --json
-```
-Capture the JSON result. The runner is **whole-repo + baseline**: it runs `turbo run lint|typecheck|test` across every package and diffs each per-package result against the committed `.check-baseline.json`, so only NEW per-package failures fail a check. Five checks run for `--scope task`: `gate6` (sibling-identity parity — ALWAYS hard-fail, never baselined), `lint`, `typecheck`, `tests`, and `audit` (`audit.new_failures` lists new GHSA advisory ids not in the allowlist). A baselined check's `status` is `pass` when its `new_failures` array is empty even if the underlying command exited non-zero. If `any_failed === true` (or `hard_fail_checks` is non-empty), this is a hard fail — surface each failing result's `stdout`/`stderr`/`new_failures` and stop.
-For each result entry, record: `category` (from `result.check`), `status` (from `result.status`), `details`, `stdout` (from `result.stdout`), `stderr` (from `result.stderr`), and `new_failures` (from `result.new_failures` — the newly-failing packages / new GHSA ids; the field is omitted/`undefined` for `gate6`, not `null`).
-Additional hard-fail checks (not part of the runner):
-| Category                | Command                         | Condition                        |
-| ----------------------- | ------------------------------- | -------------------------------- |
-| Per-package E2E         | `pnpm --filter <pkg> e2e:test`  | UI files in aggregated_files     |
-| Full-diff security scan | inline grep or `security-agent` | Always                           |
-Per-file lint + format are enforced by `lint-format-on-edit.sh` hook per edit. This step catches cross-package issues invisible to per-wave checks.
-**Soft tests** (report, don't block):
-| Category   | Method                                    | Condition                    |
-| ---------- | ----------------------------------------- | ---------------------------- |
-| Visual     | Screenshot compare via `e2e:visual-check` | UI work + dev server running |
-| API Health | `curl` health endpoint                    | API routes changed           |
-#### Step 6.x: Autonomous Sim Screenshot Validation (mobile / on-device)
-For mobile rounds (Maestro / XCUITest / Tauri-mobile) where unit tests passed but the round touched component-mount code paths (custom hooks, prop signatures, conditional renders, navigation tabs), unit-test green is NOT sufficient evidence that the screen mounts at runtime. Use the autonomous sim screenshot loop to catch runtime crashes invisible to mocked unit tests.
-**Procedure** (when iOS Simulator is the target — adapt the screenshot command for Android/Tauri equivalents):
-1. Confirm the target screen's default state via `Read` of its parent component or store.
-2. If the screen is normally gated behind a tab/route the simulator isn't currently on, temporarily flip the gating state's default value at the screen's entry point (`useState(initialTab) → useState('targetTab')` or equivalent) so the screen mounts on the next reload.
-3. Trigger a Fast Refresh by saving a touched file, then capture: `xcrun simctl io booted screenshot /tmp/codebyplan-task-testing-{task}-{state}.png`.
-4. Read the screenshot via the multimodal Read tool. Confirm the screen rendered (vs blank/crash/red-box error overlay).
-5. Revert the state-default flip from step 2. Confirm the file diff is empty (`git diff <path>`) before proceeding.
-**When to use**: round modifies hook usage, prop signatures, component-tree shape, or store subscriptions on a screen whose unit tests mock the data layer. Skip when the modified code path has no UI surface (pure utilities, server actions).
-**When NOT to use**: don't flip state defaults on the main app entry / auth gate / feature-flag boundaries — the revert risk is too high. Use storybook or a dedicated `__dev__` tab if the screen has cross-cutting state.
-Record the result in `task_testing_output.autonomous_sim_check`:
-```yaml
-autonomous_sim_check:
-  screen: "<screen-name>"
-  status: "rendered" | "crashed" | "blank"
-  screenshot_path: "/tmp/codebyplan-task-testing-..."
-  state_flip_reverted: true
-```
-This technique uniquely catches Rules-of-Hooks violations and prop-shape mismatches that mocked unit tests cannot detect — the runtime hook scheduler is the only oracle.
-### Step 6.5: Cross-Round Code Review
-Round-level code review runs per-round via `improve-round` at `/cbp-round-end`. This step adds the cross-round holistic layer — things only visible once all rounds are aggregated.
-Inline review (no sub-agent) across the aggregated files read in Step 5. Check:
-| Concern                   | What to Look For                                                                                                                                 |
-| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Leftover debug            | `console.log`, `debugger`, commented-out blocks, `TODO`/`FIXME` added during this task                                                           |
-| Cross-round duplication   | Same helper/logic written independently in 2+ rounds — candidate for extraction                                                                  |
-| Convention drift          | One round introduces a pattern (error handling, naming, file layout) that contradicts a pattern established in an earlier round of the same task |
-| Incomplete follow-through | A round added a type/field/table column that later rounds never consume                                                                          |
-| Orphaned additions        | Exports or utilities added in an early round with no callers after later rounds refactored past them                                             |
-For each finding, record: `{category, file, description, severity: 'low'|'medium'|'high', suggested_fix}`.
-Findings with severity `medium` or `high` feed the Step 9 problem classification. `low` findings are recorded in `task_testing_output` for the record but do not block.
-If any finding points to a need that exceeds task scope (e.g. a utility worth extracting for the wider codebase, a convention the repo should adopt globally), route per `immediate-issue-capture.md` "How to Capture" — default to a NEW TASK in the current checkpoint, not a standalone task. Standalone routing applies only when the finding is genuinely off-axis from every active checkpoint AND the user has confirmed standalone routing.
-### Step 7: Separate Claude-Testable vs User-Testable
-**Claude handles automatically** (Step 6): build, types, unit tests, E2E tests, visual, API health.
-**User must verify** (requires human judgment):
-- Visual appearance quality (does it look good?)
-- UX flow (is the interaction intuitive?)
-- Business logic correctness (does it do the right thing?)
-- Edge cases (unusual inputs, boundary conditions)
-- Cross-browser / real-device behavior
-- Content accuracy (text, labels, messages)
-Generate user test items based on: task requirements, changed files, round context.
-### Step 8: User Testing Walkthrough
-Present all user-testable items as a **single checklist in one `AskUserQuestion` prompt**. Do not ask one question per item — the batched format is preferred.
-Format the question so every item is visible in the checklist, with a single overall answer (e.g., "all pass", "minor issues", "major issues"). Provide the description, how-to-test steps, and expected result per item inside the question body. If the user reports mixed results, collect the specifics in a follow-up.
-Record the aggregate response and any per-item notes.
-### Step 9: Classify Problems
-Collect failures from automated tests (Step 6), cross-round code review (Step 6.5, medium+), and user tests (Step 8). Classify:
-- **Minor** (round-fixable): styling, small bugs, missing edge cases, localized duplication
-- **Major** (new-task-worthy): architectural issues, missing features, fundamental design problems, convention drift that spans multiple files
-### Step 10: Save Results
-`codebyplan task update --id <taskId> --checkpoint-id <checkpointId> --context '<json>'` (CLI write-through: local state file + REST), merging `task_testing_output` into the existing context object. Break-glass fallback: MCP `update_task` when CLI is unavailable.
-```ts
-// context payload to merge:
-{
-  task_testing_output: {
-    claude_tests: [...],
-    cross_round_code_findings: [...],   // from Step 6.5
-    user_tests: [...],
-    problems_found: [...],
-    all_passed: boolean,
-    summary: { total, passed, failed, pending }
-  }
-}
-```
-### Step 11: Route Based on Results
-**ALL PASS:**
-```
-All tests passed for TASK-[N]. Routing to task-complete...
-```
-Invoke `cbp-task-complete` via the Skill tool. `cbp-task-complete` is `ask`-tier — the harness
-permission prompt IS the human gate; the user confirms (or declines) before task commit,
-merge-main, and completion.
-**Minor problems found:**
-Invoking `cbp-round-input` to address the minor issues found during testing...
-Invoke `cbp-round-input` via the Skill tool. `cbp-round-input` is `allow`-tier — it auto-fires
-silently.
-**Major problems found:**
----
-**Next:**
-Run `/cbp-task-create` to:
-- Create a new task for the identified issues
----
-Waiting for user to run `/cbp-task-create`.
-**User wants re-test:** Suggest re-running `/cbp-task-testing`.
-## Key Rules
-- **Never skippable** — mandatory before `/cbp-task-complete`
-- **Must loop until everything passes** — problems must be addressed
-- **No file changes** — testing only, never edit
-- **Batch user tests** — present all user-testable items in a single `AskUserQuestion` checklist; never one-per-question
-- **Read actual files** — do not rely on metadata alone
-- **Run actual commands** — capture real stdout/stderr
-- **Checkpoint-bound only** — for standalone tasks use `/cbp-standalone-task-testing`
-## Integration
-- **Reads**: `.codebyplan/state/checkpoints/*.json`, `checkpoints/<id>/tasks/*.json`, `checkpoints/<id>/tasks/<id>/rounds/*.json`, `todos.json` (local-first; `npx codebyplan sync` on miss; MCP `get_current_task`/`get_rounds` break-glass), plus all aggregated files
-- **Writes**: `codebyplan task update` (CLI write-through; MCP `update_task` break-glass)
-- **Triggers**: `cbp-task-complete` (auto via Skill tool, when ALL PASS — `ask`-tier, permission prompt IS the human gate); `cbp-round-input` (auto via Skill tool, on minor problems — `allow`-tier, fires silently)
-- **Triggered by**: `cbp-task-check` auto-triggers this skill via Skill tool on READY verdict; `cbp-task-testing` is `allow`-tier and fires silently (no permission prompt)
-- **ci.json awareness**: `codebyplan check --scope task --json` is turbo-native — it runs `turbo run lint|typecheck|test` directly and does NOT read `.codebyplan/ci.json`. ci.json command resolution (via `npx codebyplan ci resolve <category> [--platform <slug>]`) is used by non-check consumers (`cbp-testing-qa-agent`, `cbp-security-agent`, `cbp-standalone-task-testing`, `cbp-checkpoint-check`), with a central-default fallback ensuring exit 0 even when ci.json is absent.