npm - @glrs-dev/harness-plugin-opencode - Versions diffs - 0.3.1 → 1.0.1 - Mend

@glrs-dev/harness-plugin-opencode 0.3.1 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

package/CHANGELOG.md +192 -0
package/dist/agents/prompts/pilot-builder.md +18 -3
package/dist/agents/prompts/pilot-planner.md +19 -9
package/dist/chunk-57EOY72Y.js +174 -0
package/dist/chunk-5TAMY7P6.js +67 -0
package/dist/chunk-BKTFWXLG.js +204 -0
package/dist/chunk-KB7M7JXU.js +145 -0
package/dist/chunk-RNRCXQ65.js +56 -0
package/dist/cli.js +955 -1453
package/dist/index.js +1 -1
package/dist/paths-LT3QQKCF.js +18 -0
package/dist/pilot/mcp/status-server.d.ts +1 -0
package/dist/pilot/mcp/status-server.js +228 -0
package/dist/pilot-config-7LJZ23YK.js +55 -0
package/dist/runs-QWPL3TKV.js +18 -0
package/dist/safety-gate-WM3EWOCY.js +10 -0
package/dist/setup-hook-FHTXMAQL.js +88 -0
package/dist/skills/adr/SKILL.md +328 -0
package/dist/skills/pilot-planning/SKILL.md +40 -13
package/dist/skills/pilot-planning/rules/decomposition.md +27 -0
package/dist/skills/pilot-planning/rules/self-review.md +1 -1
package/dist/skills/pilot-planning/rules/touches-scope.md +34 -0
package/dist/skills/pilot-planning/rules/verify-design.md +78 -14
package/dist/tasks-KJ3WN2KY.js +32 -0
package/package.json +4 -2
package/dist/skills/pilot-planning/rules/setup-authoring.md +0 -68

package/dist/skills/adr/SKILL.md ADDED Viewed

@@ -0,0 +1,328 @@
+---
+name: adr
+description: "Use when drafting, revising, or reading any engineering ADR in `docs/adr/`. Encodes grounding steps, the mandatory section template, the Unspecified-interactions-vs-Open-questions rubric, the security-default-deny rule, and self-check red flags. Use when the task is to write an ADR, draft an architecture decision, produce a design doc for a schema/contract/cross-package change, propose a new table/entity, or capture a consequential decision. Do NOT draft an ADR without this skill loaded."
+---
+# Engineering ADR Skill (docs/adr/)
+Purpose: every engineering ADR in this repo starts from the same
+opinionated foundation. Read prior ADRs in `docs/adr/` before drafting
+(see Step 1) — each one's lessons compound.
+This skill describes **what** to do and **how** to structure an ADR.
+It deliberately does NOT prescribe a review process — how an ADR gets
+scrutinized before merge is up to whoever is shipping it and whichever
+harness or team workflow applies. The skill's job is to make the draft
+good; the review process is a separate concern.
+## When you MUST load this skill
+- Drafting a new file in `docs/adr/`.
+- Revising an existing ADR (even a typo-sized change — you may trip
+  one of the red flags below).
+- Reading an existing ADR to understand a past decision, if you need
+  to write a supersession or cite its pattern.
+## When this skill does NOT apply
+- Product decisions (if a `docs/product/` directory exists, use that).
+- LLM-feature proposals (if a dedicated template exists, use that).
+- Implementation plans, task breakdowns, build sequencing — Linear
+  issues or plan files.
+- Bug fixes, refactors, single-PR work — Linear issue, no ADR.
+## The iron rules (five rules; every ADR should honor them)
+1. **Ground before you draft.** Run the grounding checklist below
+   BEFORE writing the Decision section. Invented table/column/module
+   names are the #1 cause of ADR rework.
+2. **Section order is frozen** (see Template). Don't reorder. Don't
+   omit. A missing section is a signal you skipped work, not that the
+   work wasn't needed.
+3. **Security-sensitive capabilities DEFAULT DENY.** Every new role
+   grant, every new partner scope, every new cross-org read path
+   starts in the `off` position with an explicit, logged
+   per-principal enablement path. "Probably fine" is not a stance.
+4. **Cross-system couplings go in `Consequences -> Unspecified
+   interactions`, not `Open questions`.** See the rubric below.
+5. **"Pre-implementation codebase investigation" items must be
+   genuinely unknown at write time.** If it's "verify my bullets are
+   right", it's already your job — do it before drafting.
+## Step 1: Grounding (mandatory, before drafting)
+This is not optional. Perform each step and capture the real
+names/paths in a scratch note you'll use while drafting:
+1. **Discover prior ADRs.** Read existing ADRs in `docs/adr/` to
+   understand established conventions and patterns. If an `adr-index`
+   MCP tool is available, use it to find ADRs by subject-area tags.
+   Otherwise, list and skim the directory. Pay particular attention to
+   conventions in each ADR's `establishes` frontmatter — those are in
+   force (unless a later ADR's `supersedes:` includes it).
+2. **Read every referenced file.** For the decision you're about to
+   make, identify the 3-10 existing files/tables/contracts your ADR
+   will touch or adjoin. Read them. Copy real symbol names into your
+   scratch note — do not paraphrase from memory.
+3. **Grep-verify every table, column, entity, and symbol name before
+   it lands in the draft.** Use AST-aware symbol lookup for code
+   symbols where available; fall back to `grep`. An invented name in
+   the Decision section is the #1 cause of ADR rework.
+4. **Identify the access/tenancy story.** Is the new entity scoped to
+   a user, an org, global, or cross-tenant? Confirm it follows
+   existing access patterns and doesn't accidentally bypass them.
+5. **Identify every touched contract.** Internal vs external, file
+   paths, permission keys. The ADR must cite the real file paths.
+6. **Identify circuit breakers and cross-system coupling.** List every
+   module/table/entity whose behavior will change because of this
+   decision.
+7. **Decide whether this ADR warrants a follow-up project.** If the
+   decision produces 3+ implementable issues, file a project when the
+   ADR merges. Small decisions that land in one PR don't need one.
+Only after these seven steps do you touch the template.
+## Step 2: Template (frozen section order)
+```markdown
+---
+touches: [<coarse subject-area tags>]
+establishes:
+  - <convention-slug-this-adr-introduces>
+  - <another-convention-if-any>
+supersedes: []  # or [<prior-adr-filename-without-.md>] if this replaces one
+---
+# ADR: <Short decision title>
+---
+---
+## 1. Context
+What system state exists today, cited with real file paths + symbol
+names. Who the actors/roles are. What's broken, missing, or ambiguous.
+Include a "Prior art in this repo" subsection listing existing
+patterns that inform or constrain the decision.
+## 2. Decision
+What we will do, subsectioned by concern:
+  2.1 Data model (if any — new tables/columns/enums with real names)
+  2.2 Resolution / runtime semantics (pure functions, state transitions)
+  2.3 External API contract (paths, verbs, schemas, file locations)
+  2.4 Internal API contract (same)
+  2.5 UI design (surfaces, routes, key flows, broken-state treatment)
+  2.6 External integration surface (third-party APIs, adapters, etc.)
+  2.7 Role-based access matrix (see iron rule #3)
+  2.8 Migration strategy (new table? rename? backfill? legacy handling?)
+Execution planning — merge units, task sequencing, PR boundaries —
+does NOT belong in an ADR. Those are implementation concerns tracked
+separately. If a project exists for the decision, the project is
+where sequencing lives, not here.
+## 3. Consequences
+### Positive
+### Negative / trade-offs
+### Neutral / noted
+### Unspecified interactions with existing mechanisms
+  (see rubric below; this subsection is mandatory if any exist)
+## 4. Alternatives considered
+Alt 1, Alt 2, ..., each with a one-paragraph rejection reason. Include
+the genuinely-considered options; don't straw-man. If only one
+alternative existed, this section is a red flag — you haven't
+explored the decision space.
+## 5. Decision linkages
+Consumers, dependencies, blockers, future extensions, what this ADR
+establishes (e.g. a new convention).
+## 6. Open questions
+ITERATE UNTIL EMPTY. An ADR should not merge with unresolved open
+questions. Each question is either: (a) answerable now — answer it
+inline and move to a "Resolved during drafting" appendix, or (b) a
+blocker that requires external input — in which case the ADR is not
+ready to merge. Do not use this section as a parking lot for
+laziness. If you can grep the codebase or reason through the
+tradeoffs to resolve a question, do it before declaring the draft
+complete.
+Format when all questions are resolved:
+  "None. All questions resolved during drafting:"
+  followed by a "### Resolved during drafting" subsection with
+  numbered answers preserving the original question for traceability.
+## 7. Pre-implementation codebase investigation
+ITERATE UNTIL EMPTY. Same rule as S6. Every item here must be
+resolved before the ADR merges — either by doing the investigation
+during drafting (preferred) or by explicitly blocking the ADR on the
+investigation. An ADR with unresolved S7 items is an ADR that will
+produce wrong implementation work.
+Format when all items are confirmed:
+  "None. All items confirmed during drafting:"
+  followed by a "### Resolved during drafting" subsection with
+  numbered findings.
+## 8. References
+Every file cited, every external doc, every ticket/issue, and the
+convention this ADR establishes or modifies.
+```
+Sections with no content in your decision: write "Not applicable" and
+one sentence explaining why. Do not delete the heading.
+### Frontmatter contract
+The YAML frontmatter is the **only** machine-readable metadata on an
+ADR. There is no prose header block — no `Date`, no `Authors`, no
+status. The date is in the filename, authorship is in `git log`,
+and whether an ADR is in force is determined by Git (on `main` = in
+force; named in a later ADR's `supersedes:` = superseded).
+Duplicating any of this in the body would create drift. The body
+opens straight with the `# ADR: <title>` heading and goes to S1
+Context.
+The frontmatter carries only facts about the ADR's content, never
+state or intent about implementation follow-through (whether a
+project gets created, whether the decision has been acted on, etc. —
+those are independently observable and don't belong here).
+Rules:
+- **`touches`** — inline list of coarse subject-area tags. Err toward
+  more tags — matching is cheap, missing a cross-reference is
+  expensive.
+- **`establishes`** — block list of convention slugs this ADR
+  introduces (kebab-case; descriptive, not clever). These are what
+  future ADR authors discover when their decision is constrained by
+  conventions you set.
+- **`supersedes`** — list of prior ADR filenames (without `.md`) that
+  this ADR replaces. Empty for most ADRs. Supersession lives in the
+  superseding ADR's frontmatter, not as a flag on the superseded ADR —
+  that one stays unchanged on `main` as a truthful historical record.
+## Rubric: Unspecified interactions vs. Open questions vs. Pre-implementation investigation
+This is the most common ADR failure. Use this table:
+| Item type | Goes in | Test |
+|---|---|---|
+| A coupling we know exists in the codebase today that this decision changes or newly touches, but we deliberately are not specifying here | `Consequences -> Unspecified interactions with existing mechanisms` | "Implementers need to know about X coupling to avoid breaking it." |
+| A design sub-decision we deferred because it isn't blocking and has multiple valid answers | `Open questions` | "A reasonable person could answer this two ways and either is defensible; we'll pick one during implementation." |
+| A fact we don't know yet about the codebase that must be verified before the first PR | `Pre-implementation codebase investigation` | "The answer is knowable by grepping / reading code, not by discussion." |
+If an item is really "I haven't done my homework" dressed up as an
+open question, it fails this rubric. Do the homework or move it to
+Pre-implementation investigation with a specific grep/read
+prescribed.
+## Security default-deny rule (iron rule #3, expanded)
+For every capability that can:
+- Write to another user's/org's data
+- Stamp long-lived credentials used on outbound traffic
+- Grant a partner/API-key/integration-user role any verb beyond
+  `read` on its own scope
+the ADR must:
+1. Default to `off` (not-granted). Do not write "probably fine, worth
+   confirming."
+2. Specify the enablement mechanism: who grants it, where it's logged,
+   and how it's revoked.
+3. State the blast radius if the grant is misused (a mistaken or
+   compromised principal).
+4. Name the expected flow without the grant (what does the actor do
+   instead?).
+## Red flags — author self-check
+These are common failure modes observed across ADRs. Use this list as
+a self-check before you consider a draft complete.
+- Any table, column, enum, or code symbol in your draft has not been
+  grep-confirmed against the actual codebase.
+- Your Decision section says "probably fine" about a security grant.
+  Make it default-deny.
+- You have zero alternatives in S4 beyond the chosen one.
+- Your S7 "pre-implementation investigation" reads like "verify my
+  bullets are right." Move these to grounding and do them now.
+- A coupling with existing mechanisms is not mentioned. If you
+  honestly looked and found none, state that.
+- Your ADR introduces a new enum/channel/role/surface whose naming
+  collides with an existing one.
+- Your S2 Decision subsections leak into execution planning — merge
+  units, PR boundaries, task sequencing. That belongs in issues, not
+  in the ADR.
+- Your UI section doesn't describe the broken-state case (what
+  happens when a referenced entity is archived/inactive/missing).
+- Your migration section doesn't describe the down() path.
+- Your S6 Open questions are really S3 Unspecified interactions (they
+  describe *existing* couplings, not *deferred* design decisions).
+- Your S6 or S7 has unresolved items. Both sections must be iterated
+  to empty before the ADR merges. If you can answer a question by
+  reading code or reasoning through tradeoffs, do it now — don't
+  defer to implementation what you can resolve during drafting.
+- Your ADR is missing YAML frontmatter. Without frontmatter, the ADR
+  is invisible to discovery and future authors will rediscover your
+  lessons from scratch.
+- A convention you introduce in S2 is not listed in `establishes:`
+  frontmatter. Future ADRs can't find that it exists.
+## Inline-vs-follow-on decision rubric
+When you discover during drafting that a sub-decision is bigger than
+you thought:
+- **Inline it** if: the sub-decision touches <=3 files, introduces no
+  new abstractions, and doesn't shift the boundary of any existing
+  subsystem.
+- **Follow-on ADR** if: crosses a package boundary you haven't
+  mapped, introduces a new abstraction (new model pattern, new
+  helper), or requires re-architecting an existing subsystem.
+- **Resolve it now** if: you can answer the question by reading code
+  or reasoning through tradeoffs. S6 must be empty at merge — don't
+  defer what you can decide during drafting.
+A follow-on ADR is cited in S5 Decision linkages as a "Blocker" or
+"Future extension."
+## File placement and naming
+- **Location:** `docs/adr/`.
+- **Filename:** `YYYY-MM-DD-<slug>.md`. ISO date (authored date),
+  kebab-case slug, 3-7 words.
+- **Branch name:** `docs/<slug>` or `<user>/<ticket>-<slug>` if
+  tracked by an issue.
+## Commit sequence
+1. Verify the frontmatter block parses (no tabs, list items use
+   `  - ` indent). Check that `touches` tags are meaningful and any
+   new conventions are listed in `establishes`.
+2. `git add docs/adr/<file>.md`
+3. Commit message: `docs(adr): <title>`.
+4. Push branch and open PR. Link the issue in the PR body if one
+   exists.
+5. If the decision warrants a follow-up project (per grounding step
+   7), create the project on merge and link it from the ADR's S5
+   Decision linkages in a follow-up commit.

package/dist/skills/pilot-planning/SKILL.md CHANGED Viewed

@@ -11,7 +11,7 @@ A good plan trades a planning-session's worth of patient thought for hours of un
 ## Workflow
-Apply these ten rules in order. Each rule has its own file in `rules/` for the full text:
+Apply these nine rules in order. Each rule has its own file in `rules/` for the full text:
 1. [`first-principles.md`](rules/first-principles.md) — Frame the task FROM the user's intent, not from a templated checklist. Ask "what does the user actually want done?" before "what files might change?"
@@ -25,29 +25,56 @@ Apply these ten rules in order. Each rule has its own file in `rules/` for the f
 6. [`milestones.md`](rules/milestones.md) — Optional grouping. Use when several tasks share a "is this batch done?" check (e.g. integration tests after a chunk of unit-test work).
-7. [`self-review.md`](rules/self-review.md) — Before declaring the plan ready, run through a 7-question checklist. Find the holes yourself; the validator only catches schema errors.
+7. [`self-review.md`](rules/self-review.md) — Before declaring the plan ready, run through a 7-question checklist. Find the holes yourself; the validator only catches schema errors. And before declaring "refuse", revisit the bundle-vs-split decision below.
 8. [`task-context.md`](rules/task-context.md) — Every non-trivial task carries a `context:` block. Thin plans fail because the builder works each task from scratch with no carry-over; rich context pre-loads what the builder needs to work confidently. Cover outcome, rationale, code pointers, acceptance.
-9. [`setup-authoring.md`](rules/setup-authoring.md) — Detect → propose → confirm the top-level `setup:` block. Covers package manager install, docker-compose services, and migration tooling detection.
-10. [`qa-expectations.md`](rules/qa-expectations.md) — Detect → propose → confirm per-surface verify patterns for UI, API, DB, integration, browser-based component, and CLI surfaces.
+9. [`qa-expectations.md`](rules/qa-expectations.md) — Detect → propose → confirm per-surface verify patterns for UI, API, DB, integration, browser-based component, and CLI surfaces.
 ## After applying the rules
 1. Save the YAML to the path returned by `bunx @glrs-dev/harness-plugin-opencode pilot plan-dir`.
-2. Run `bunx @glrs-dev/harness-plugin-opencode pilot validate <path>` and fix every error / warning.
-3. Hand off to the user with: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
+2. Remind the user the plan assumes their dev stack is already running (install, compose, migrate, seed). Plans no longer bootstrap their own environment.
+3. Run `bunx @glrs-dev/harness-plugin-opencode pilot validate <path>` and fix every error / warning.
+4. Hand off to the user with: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
 Do NOT summarize the plan in chat. The user can read the YAML.
+## When to bundle vs. split plans
+Multi-issue cross-cutting plans are a first-class pilot shape. When a user's scope spans 2–4 related issues, default to **one plan** covering all of them — as long as they share:
+- Same repo (or monorepo).
+- Same package manager / install command.
+- Same `docker-compose` (or equivalent local-infra) stack.
+- Same test runner and verify style.
+- Same migrations/seed pipeline.
+Bundling amortizes setup cost (install, compose up, migrate, seed — minutes each, paid once per pilot run) across all the work. Tasks from different issues typically form disconnected subtrees in the DAG — see [`dag-shape.md`](rules/dag-shape.md)'s "Disconnected" pattern. Task-level `cascadeFail` only blocks transitive dependents, so a failure in one subtree does NOT cascade into its siblings.
+**Split into separate pilot plans when:**
+- Issues live in different repositories.
+- Issues require fundamentally different setup environments.
+- Issues have fundamentally different acceptance shapes (e.g., automated typecheck vs. manual operator playbook).
+See [`decomposition.md`](rules/decomposition.md) "Plan sizing — count of tasks" for more.
 ## When to refuse
-If, after applying the methodology, you cannot produce a plan with at least:
+Refuse ONLY when the **work itself** is underspecified or ambiguous — no concrete acceptance criteria, no clear "done" condition. Examples that warrant refusal:
+- "Make the API better."
+- "Refactor auth."
+- "Clean up tech debt."
+These don't name specific behaviors the pilot-builder can verify. Ask the user to narrow the scope before planning.
+**Do NOT refuse for:**
-- 2 tasks
-- Each with non-trivial verify
-- Each with tight `touches`
-- A coherent DAG
+- Plan size (5–30 tasks is fine; even more is fine when the work is well-defined).
+- Multi-issue scope (2–4 related issues in one plan is first-class — see "When to bundle" above).
+- Disconnected-subtree DAG shape (tasks from different concerns don't need artificial edges).
+- Concerns about PR shape (that's a reviewer decision; the pilot run can produce one PR or several).
-…tell the user the work isn't ready for pilot. Suggest they break it down themselves first, or use the regular `/plan` agent (markdown plans, human-driven execution). It is far better to refuse than to ship a bad plan.
+When you do refuse: tell the user honestly and specifically what's missing. Suggest the regular `/plan` agent (markdown plans, human-driven execution) for ambiguous work that needs human iteration before it's pilotable. It is far better to refuse an unspecified request than to ship a plan full of `echo done` verifies — but narrow what "bad plan" means. Ambitious is not bad; ambiguous is bad.

package/dist/skills/pilot-planning/rules/decomposition.md CHANGED Viewed

@@ -34,3 +34,30 @@ A "right-sized" pilot task is one the pilot-builder can complete in a single ses
 ## When you can't decompose
 If the work genuinely doesn't decompose (e.g., a 200-line algorithm that has to land atomically), it might not be a fit for pilot. Tell the user; they may want to run it as a regular `/build` task instead.
+## Plan sizing — count of tasks
+Per-task size is covered above. Plan-level size (total task count) is a different dimension and has its own sweet spot: **roughly 5–30 tasks per `pilot.yaml`**. Outside this range:
+- **Fewer than 5 tasks:** usually means the work is a single change that doesn't benefit from the pilot harness. Consider `/plan` + `/build` instead.
+- **More than 30 tasks:** fine in principle, but at that size the plan probably spans enough distinct concerns that a human reviewer will want it split — not a pilot problem, a PR-shape problem.
+### Multi-issue cross-cutting plans are a first-class shape
+It is **normal and correct** for a single pilot plan to span 2–4 related issues (Linear tickets, GitHub issues) **when those issues share setup and verify infrastructure** — same repo, same package manager, same `docker-compose`, same test runner, same migrations. Reasons to bundle:
+- **Setup amortization.** `pnpm install`, `docker compose up`, `pnpm db:migrate`, seed scripts — each of these is minutes of wall time. Running them once per pilot session vs. once per Linear issue saves hours across a multi-issue push.
+- **Context reuse.** The builder learns the codebase through reading during early tasks; that context benefits every subsequent task in the run.
+- **Shared acceptance.** Cross-issue integration checks (a milestone-close verify that exercises all three issues' changes together) are natural in one plan, awkward across three runs.
+**Reference shape (not a red flag):** rule-engine cleanup + LISTEN/NOTIFY cache invalidation + read-only admin UI landed together in one plan of ~19 tasks across 4 milestones, covering 3 Linear issues. This is the shape pilot is built for.
+When bundling, the tasks from different issues typically form **disconnected subtrees** in the DAG (no real semantic dependency between them). That's fine — see [`dag-shape.md`](dag-shape.md)'s "Disconnected" pattern. Task-level `cascadeFail` only blocks transitive dependents, so a failure in one subtree doesn't cascade into the siblings.
+### When to split instead of bundle
+Split into separate pilot plans when:
+- The issues live in **different repositories**.
+- The issues require **fundamentally different setup environments** (e.g., one needs Postgres + Temporal, the other needs a headless browser grid — sharing setup is worse than paying the cost twice).
+- The issues have **fundamentally different acceptance criteria** (e.g., one is a TypeScript refactor verified via typecheck, the other is an infrastructure change verified via a manual operator playbook — no shared verify makes sense).

package/dist/skills/pilot-planning/rules/self-review.md CHANGED Viewed

@@ -16,7 +16,7 @@ The validator catches schema, DAG, and glob errors. It cannot catch "this verify
 5. **Are there missing edges?** Look at every pair of tasks that share files in their `touches:`. Do they need an order? If T2's verify exercises code T1 introduces, T2 depends on T1 — even if their `touches:` don't overlap.
-6. **Can the plan recover from a per-task failure?** If T3 fails, the cascade-fail blocks T4 onward. Is the resulting "failed=T3, blocked=[T4..T7]" state useful for the human operator? Or did you concentrate too much value into T3 such that its failure is catastrophic?
+6. **Does the DAG concentrate too much value in one task?** Task-level `cascadeFail` only blocks transitive DEPENDENTS of the failed task — sibling subtrees in a disconnected DAG keep running. So plan size is not itself a risk. The real risk is a task everything else depends on: a schema migration that all downstream work reads, a core-type definition all imports reference, a shared config every consumer parses. If THAT task fails, the whole run stalls. Is there such a task in your plan? If yes, can it be simplified — smaller diff, tighter verify, higher success probability? Don't over-concentrate; a plan where 80% of tasks depend on T1 and T1 is complex is fragile by design.
 7. **Could you read this plan in 6 months and understand it?** Plan names + task titles + prompts should be a self-explanatory summary of the work. If the plan needs a verbal preamble to make sense, rewrite the prompts.

package/dist/skills/pilot-planning/rules/touches-scope.md CHANGED Viewed

@@ -45,3 +45,37 @@ If the verify commands would FAIL without edits, an empty `touches` is a STOP
 - **Including the migrations dir for a non-migration task.** Tight scope.
 When in doubt, write the tightest possible scope first. If the task fails verify with "touches violation: src/X.ts", the worker shows you which file got touched — broaden then.
+## `tolerate:` — files allowed in the diff but outside the contract
+When a task's verify step runs a tool that writes files as a side-effect (codegen, build, snapshots), those files will appear in `git diff` even though the agent didn't author them. Add them to `tolerate:` so enforcement accepts them without counting them as part of the task's output.
+Two categories to watch for:
+**Built-in defaults (already tolerated — don't list these):**
+- `**/next-env.d.ts` — Next.js regenerates on every `next build`.
+- `**/.next/types/**`, `**/.next/dev/types/**` — Next.js app-router generated types.
+- `**/*.tsbuildinfo` — TypeScript project-reference build cache.
+- `**/__snapshots__/**`, `**/*.snap` — Jest / Vitest snapshot files rewritten by `-u`.
+**Project-specific (list in `tolerate:` per task):**
+- Prisma client output (e.g., `prisma/client/**` if `prisma generate` runs in verify).
+- GraphQL codegen output (`graphql/generated/**`, `*.graphql.d.ts`).
+- OpenAPI codegen output (`api-types/generated/**`).
+- Anywhere you have a build step that writes type declarations downstream of the agent's source edits.
+A good test: if the task's verify step runs `prisma generate`, `pnpm codegen`, `next build`, or similar, ask: "does that command write files anywhere?" If yes, those paths go in `tolerate:`.
+### Example
+```yaml
+- id: T-ADD-RULE-MODEL
+  touches:
+    - prisma/schema.prisma
+    - src/models/rule.ts
+  tolerate:
+    - prisma/client/**        # prisma generate output
+  verify:
+    - pnpm prisma generate
+    - pnpm --filter core test rule-model
+```

package/dist/skills/pilot-planning/rules/verify-design.md CHANGED Viewed

@@ -2,24 +2,63 @@
 **Each task's `verify:` commands must succeed iff the task is correctly done.**
-The verify list is the contract between the planner and the builder. It is the ONLY signal pilot uses to decide "did this task work?". A weak verify means you're shipping work the run thinks is fine but really isn't.
+The verify list is the contract between the planner and the builder. It is the ONLY signal pilot uses to decide "did this task work?". A weak verify means you're shipping work the run thinks is fine but really isn't. An over-broad verify means the task fails for reasons unrelated to the work — pre-existing test failures, missing infrastructure, flaky integration tests — and the agent wastes its retry budget on something it can't fix.
+## The cardinal rule: verify ONLY what the task changed
+A verify command must exercise **exactly the code the task produced** — no more, no less. If the task adds `src/entities/audit-log/schema.ts` and its test file, the verify is:
+```yaml
+verify:
+  - pnpm --filter @kn/core test -- --run src/entities/audit-log/__tests__/schema.test.ts
+```
+NOT:
+```yaml
+verify:
+  - pnpm --filter @kn/core test -- --run src/entities/audit-log
+```
+The second form runs EVERY test under that directory — including integration tests that need a running database, tests for pre-existing code the task didn't touch, and tests that may already be failing on the base branch. The agent cannot fix those failures. It will exhaust its retry budget and STOP.
+**The verify command's scope must be as tight as the `touches:` scope.** If you wouldn't put a file in `touches:`, don't let the verify command exercise it.
 ## What a good verify looks like
-- `bun test test/api.test.ts` (assertion)
-- `bun run typecheck` (semantic check, catches real failures)
-- `bun run lint` (style, but only when style is the work)
-- `node scripts/check-schema.ts` (your own probe — write it as part of the task)
-- `curl -fsS http://localhost:3000/health | jq .ok` (integration probe)
+- `pnpm test -- --run path/to/specific.test.ts` — runs ONE test file
+- `bun test test/api/specific.test.ts` — same, bun flavor
+- `bun run typecheck` — semantic check, catches real type failures (good as `verify_after_each`)
+- `node scripts/check-schema.ts` — your own probe script (write it as part of the task)
+- `grep -q 'export function newThing' src/file.ts && bun test test/file.test.ts` — existence + behavior
 ## What's not OK
 - `echo done` — proves nothing
 - `test -f src/foo.ts` — file existence is necessary but rarely sufficient
 - `bun run build` ALONE — build success without tests means "TypeScript was happy"; insufficient for behavior tasks
+- `pnpm test` (whole package) — pulls in every test in the package; pre-existing failures block the task
+- `pnpm --filter @pkg test -- --run src/module` (directory-level) — same problem; runs integration tests the task didn't write
 - `grep -q 'newFunction' src/file.ts` — proves text presence, not behavior
 - `git diff --name-only | grep src/api` — proves edits happened, not that they're correct
+## The pre-existing-failure trap
+Pilot runs a **baseline check** before the agent starts: every verify command is executed on the clean tree. If ANY command fails in baseline, the task aborts immediately with a clear message:
+> baseline verify failed: `pnpm --filter @kn/core test` → exit 1.
+> This command fails on the clean tree BEFORE the agent starts —
+> fix your environment or narrow the verify scope.
+This prevents the agent from wasting its 5-attempt retry budget on failures it didn't cause and can't fix. The baseline is the planner's contract: "these commands WILL pass if the environment is set up correctly."
+**If your verify command fails in baseline, the fix is one of:**
+1. Start the missing infrastructure (the setup hook should handle this).
+2. Narrow the verify to only the specific test file the task creates.
+3. Fix the pre-existing test failure on the base branch first.
+The agent gets 5 attempts (with escalating "try a different approach" nudges) for failures it introduces AFTER the baseline passes. Pre-existing failures never reach the agent.
 ## Two-tier verify
 Use BOTH a per-task verify and `defaults.verify_after_each`:
@@ -27,24 +66,27 @@ Use BOTH a per-task verify and `defaults.verify_after_each`:
 ```yaml
 defaults:
   verify_after_each:
-    - bun run typecheck     # always must pass
+    - bun run typecheck     # always must pass — catches cross-file breakage
 tasks:
   - id: T1
     verify:
-      - bun test test/api/specific.test.ts   # task-specific
+      - bun test test/api/create-rule.test.ts   # task-specific behavior proof
 ```
-`verify_after_each` catches global breakage (a syntax error in a file the task didn't even touch); per-task verify catches task-specific behavior.
+`verify_after_each` catches global breakage (a syntax error in a file the task didn't even touch); per-task verify catches task-specific behavior. Together they form a tight net without over-reaching.
 ## Touches and verify must agree
-If the task `touches: src/api/**` but the verify command runs `bun test test/web/`, you almost certainly have a wrong scope. The verify that would actually catch task failure must exercise files in the touched scope.
+If the task `touches: [src/api/rules.ts, test/api/rules.test.ts]` but the verify command runs `bun test test/web/`, you have a wrong scope. The verify must exercise files in the touched scope — and ONLY those files.
-## Verify must be deterministic
+Conversely: if the verify runs `test/api/rules.test.ts` but `touches:` doesn't include `test/api/rules.test.ts`, the agent can't create or edit that test file. Both must agree.
-- No `sleep` to wait for a service that may not start in CI.
-- No `docker run` unless the task is explicitly about containers.
+## Verify must be deterministic and self-contained
+- No `sleep` to wait for a service that may not start.
 - No external network calls that could flake — mock or skip.
+- No dependency on infrastructure the setup hook didn't start. If the verify needs postgres, the setup hook must start it. If the verify needs an API server, the setup hook must start it.
+- No dependency on other tasks' output being committed (use `depends_on` to sequence).
 If a verify command flakes, three retries will exhaust attempts and the task fails for environmental reasons. Pilot has no way to distinguish "real failure" from "flake".
@@ -52,6 +94,28 @@ If a verify command flakes, three retries will exhaust attempts and the task fai
 For non-trivial tasks, write a verify that would HAVE FAILED before the task ran. This makes the task's value observable. If the verify passed before AND passes after, the task didn't actually move the system.
+Good pattern: the test file the agent creates IS the "before" check — it didn't exist before, so `bun test path/to/new.test.ts` would have failed (file not found). After the task, it exists and passes.
+## Port and environment awareness
+If the setup hook starts services on non-default ports (to avoid collisions with the user's dev stack), verify commands must use those ports. Two patterns:
+**A. Source the env file the hook wrote:**
+```yaml
+verify:
+  - bash -c 'source .env.pilot && pnpm --filter @pkg test -- --run path/to/test.ts'
+```
+**B. Use `defaults.verify_after_each` for the env-sourcing wrapper:**
+```yaml
+defaults:
+  verify_after_each:
+    - bash -c 'source .env.pilot && bun run typecheck'
+```
+**C. Tests read from `process.env` at runtime** (best — no wrapper needed):
+If the test framework reads `DATABASE_URL` from the environment, and the setup hook exports it, the verify command just works. This is the cleanest pattern.
 ## Cross-reference: per-surface tooling menu
-For the per-surface tooling menu (Playwright for UI, curl for API, Postgres for DB), see rule 10 (`qa-expectations.md`). That rule applies these principles to specific tools; this rule defines the principles themselves.
+For the per-surface tooling menu (Playwright for UI, curl for API, Postgres for DB), see rule 9 (`qa-expectations.md`). That rule applies these principles to specific tools; this rule defines the principles themselves.

package/dist/tasks-KJ3WN2KY.js ADDED Viewed

@@ -0,0 +1,32 @@
+import {
+  countByStatus,
+  getTask,
+  listTasks,
+  markAborted,
+  markBlocked,
+  markFailed,
+  markPending,
+  markReady,
+  markRunning,
+  markSucceeded,
+  readyTasks,
+  resetTasksForResume,
+  setCostUsd,
+  upsertFromPlan
+} from "./chunk-57EOY72Y.js";
+export {
+  countByStatus,
+  getTask,
+  listTasks,
+  markAborted,
+  markBlocked,
+  markFailed,
+  markPending,
+  markReady,
+  markRunning,
+  markSucceeded,
+  readyTasks,
+  resetTasksForResume,
+  setCostUsd,
+  upsertFromPlan
+};