@glrs-dev/harness-plugin-opencode 0.3.1 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,328 @@
1
+ ---
2
+ name: adr
3
+ description: "Use when drafting, revising, or reading any engineering ADR in `docs/adr/`. Encodes grounding steps, the mandatory section template, the Unspecified-interactions-vs-Open-questions rubric, the security-default-deny rule, and self-check red flags. Use when the task is to write an ADR, draft an architecture decision, produce a design doc for a schema/contract/cross-package change, propose a new table/entity, or capture a consequential decision. Do NOT draft an ADR without this skill loaded."
4
+ ---
5
+
6
+ # Engineering ADR Skill (docs/adr/)
7
+
8
+ Purpose: every engineering ADR in this repo starts from the same
9
+ opinionated foundation. Read prior ADRs in `docs/adr/` before drafting
10
+ (see Step 1) — each one's lessons compound.
11
+
12
+ This skill describes **what** to do and **how** to structure an ADR.
13
+ It deliberately does NOT prescribe a review process — how an ADR gets
14
+ scrutinized before merge is up to whoever is shipping it and whichever
15
+ harness or team workflow applies. The skill's job is to make the draft
16
+ good; the review process is a separate concern.
17
+
18
+ ## When you MUST load this skill
19
+
20
+ - Drafting a new file in `docs/adr/`.
21
+ - Revising an existing ADR (even a typo-sized change — you may trip
22
+ one of the red flags below).
23
+ - Reading an existing ADR to understand a past decision, if you need
24
+ to write a supersession or cite its pattern.
25
+
26
+ ## When this skill does NOT apply
27
+
28
+ - Product decisions (if a `docs/product/` directory exists, use that).
29
+ - LLM-feature proposals (if a dedicated template exists, use that).
30
+ - Implementation plans, task breakdowns, build sequencing — Linear
31
+ issues or plan files.
32
+ - Bug fixes, refactors, single-PR work — Linear issue, no ADR.
33
+
34
+ ## The iron rules (five rules; every ADR should honor them)
35
+
36
+ 1. **Ground before you draft.** Run the grounding checklist below
37
+ BEFORE writing the Decision section. Invented table/column/module
38
+ names are the #1 cause of ADR rework.
39
+ 2. **Section order is frozen** (see Template). Don't reorder. Don't
40
+ omit. A missing section is a signal you skipped work, not that the
41
+ work wasn't needed.
42
+ 3. **Security-sensitive capabilities DEFAULT DENY.** Every new role
43
+ grant, every new partner scope, every new cross-org read path
44
+ starts in the `off` position with an explicit, logged
45
+ per-principal enablement path. "Probably fine" is not a stance.
46
+ 4. **Cross-system couplings go in `Consequences -> Unspecified
47
+ interactions`, not `Open questions`.** See the rubric below.
48
+ 5. **"Pre-implementation codebase investigation" items must be
49
+ genuinely unknown at write time.** If it's "verify my bullets are
50
+ right", it's already your job — do it before drafting.
51
+
52
+ ## Step 1: Grounding (mandatory, before drafting)
53
+
54
+ This is not optional. Perform each step and capture the real
55
+ names/paths in a scratch note you'll use while drafting:
56
+
57
+ 1. **Discover prior ADRs.** Read existing ADRs in `docs/adr/` to
58
+ understand established conventions and patterns. If an `adr-index`
59
+ MCP tool is available, use it to find ADRs by subject-area tags.
60
+ Otherwise, list and skim the directory. Pay particular attention to
61
+ conventions in each ADR's `establishes` frontmatter — those are in
62
+ force (unless a later ADR's `supersedes:` includes it).
63
+
64
+ 2. **Read every referenced file.** For the decision you're about to
65
+ make, identify the 3-10 existing files/tables/contracts your ADR
66
+ will touch or adjoin. Read them. Copy real symbol names into your
67
+ scratch note — do not paraphrase from memory.
68
+
69
+ 3. **Grep-verify every table, column, entity, and symbol name before
70
+ it lands in the draft.** Use AST-aware symbol lookup for code
71
+ symbols where available; fall back to `grep`. An invented name in
72
+ the Decision section is the #1 cause of ADR rework.
73
+
74
+ 4. **Identify the access/tenancy story.** Is the new entity scoped to
75
+ a user, an org, global, or cross-tenant? Confirm it follows
76
+ existing access patterns and doesn't accidentally bypass them.
77
+
78
+ 5. **Identify every touched contract.** Internal vs external, file
79
+ paths, permission keys. The ADR must cite the real file paths.
80
+
81
+ 6. **Identify circuit breakers and cross-system coupling.** List every
82
+ module/table/entity whose behavior will change because of this
83
+ decision.
84
+
85
+ 7. **Decide whether this ADR warrants a follow-up project.** If the
86
+ decision produces 3+ implementable issues, file a project when the
87
+ ADR merges. Small decisions that land in one PR don't need one.
88
+
89
+ Only after these seven steps do you touch the template.
90
+
91
+ ## Step 2: Template (frozen section order)
92
+
93
+ ```markdown
94
+ ---
95
+ touches: [<coarse subject-area tags>]
96
+ establishes:
97
+ - <convention-slug-this-adr-introduces>
98
+ - <another-convention-if-any>
99
+ supersedes: [] # or [<prior-adr-filename-without-.md>] if this replaces one
100
+ ---
101
+
102
+ # ADR: <Short decision title>
103
+
104
+ ---
105
+
106
+ ---
107
+
108
+ ## 1. Context
109
+
110
+ What system state exists today, cited with real file paths + symbol
111
+ names. Who the actors/roles are. What's broken, missing, or ambiguous.
112
+ Include a "Prior art in this repo" subsection listing existing
113
+ patterns that inform or constrain the decision.
114
+
115
+ ## 2. Decision
116
+
117
+ What we will do, subsectioned by concern:
118
+
119
+ 2.1 Data model (if any — new tables/columns/enums with real names)
120
+ 2.2 Resolution / runtime semantics (pure functions, state transitions)
121
+ 2.3 External API contract (paths, verbs, schemas, file locations)
122
+ 2.4 Internal API contract (same)
123
+ 2.5 UI design (surfaces, routes, key flows, broken-state treatment)
124
+ 2.6 External integration surface (third-party APIs, adapters, etc.)
125
+ 2.7 Role-based access matrix (see iron rule #3)
126
+ 2.8 Migration strategy (new table? rename? backfill? legacy handling?)
127
+
128
+ Execution planning — merge units, task sequencing, PR boundaries —
129
+ does NOT belong in an ADR. Those are implementation concerns tracked
130
+ separately. If a project exists for the decision, the project is
131
+ where sequencing lives, not here.
132
+
133
+ ## 3. Consequences
134
+
135
+ ### Positive
136
+ ### Negative / trade-offs
137
+ ### Neutral / noted
138
+
139
+ ### Unspecified interactions with existing mechanisms
140
+ (see rubric below; this subsection is mandatory if any exist)
141
+
142
+ ## 4. Alternatives considered
143
+
144
+ Alt 1, Alt 2, ..., each with a one-paragraph rejection reason. Include
145
+ the genuinely-considered options; don't straw-man. If only one
146
+ alternative existed, this section is a red flag — you haven't
147
+ explored the decision space.
148
+
149
+ ## 5. Decision linkages
150
+
151
+ Consumers, dependencies, blockers, future extensions, what this ADR
152
+ establishes (e.g. a new convention).
153
+
154
+ ## 6. Open questions
155
+
156
+ ITERATE UNTIL EMPTY. An ADR should not merge with unresolved open
157
+ questions. Each question is either: (a) answerable now — answer it
158
+ inline and move to a "Resolved during drafting" appendix, or (b) a
159
+ blocker that requires external input — in which case the ADR is not
160
+ ready to merge. Do not use this section as a parking lot for
161
+ laziness. If you can grep the codebase or reason through the
162
+ tradeoffs to resolve a question, do it before declaring the draft
163
+ complete.
164
+
165
+ Format when all questions are resolved:
166
+ "None. All questions resolved during drafting:"
167
+ followed by a "### Resolved during drafting" subsection with
168
+ numbered answers preserving the original question for traceability.
169
+
170
+ ## 7. Pre-implementation codebase investigation
171
+
172
+ ITERATE UNTIL EMPTY. Same rule as S6. Every item here must be
173
+ resolved before the ADR merges — either by doing the investigation
174
+ during drafting (preferred) or by explicitly blocking the ADR on the
175
+ investigation. An ADR with unresolved S7 items is an ADR that will
176
+ produce wrong implementation work.
177
+
178
+ Format when all items are confirmed:
179
+ "None. All items confirmed during drafting:"
180
+ followed by a "### Resolved during drafting" subsection with
181
+ numbered findings.
182
+
183
+ ## 8. References
184
+
185
+ Every file cited, every external doc, every ticket/issue, and the
186
+ convention this ADR establishes or modifies.
187
+ ```
188
+
189
+ Sections with no content in your decision: write "Not applicable" and
190
+ one sentence explaining why. Do not delete the heading.
191
+
192
+ ### Frontmatter contract
193
+
194
+ The YAML frontmatter is the **only** machine-readable metadata on an
195
+ ADR. There is no prose header block — no `Date`, no `Authors`, no
196
+ status. The date is in the filename, authorship is in `git log`,
197
+ and whether an ADR is in force is determined by Git (on `main` = in
198
+ force; named in a later ADR's `supersedes:` = superseded).
199
+ Duplicating any of this in the body would create drift. The body
200
+ opens straight with the `# ADR: <title>` heading and goes to S1
201
+ Context.
202
+
203
+ The frontmatter carries only facts about the ADR's content, never
204
+ state or intent about implementation follow-through (whether a
205
+ project gets created, whether the decision has been acted on, etc. —
206
+ those are independently observable and don't belong here).
207
+
208
+ Rules:
209
+
210
+ - **`touches`** — inline list of coarse subject-area tags. Err toward
211
+ more tags — matching is cheap, missing a cross-reference is
212
+ expensive.
213
+ - **`establishes`** — block list of convention slugs this ADR
214
+ introduces (kebab-case; descriptive, not clever). These are what
215
+ future ADR authors discover when their decision is constrained by
216
+ conventions you set.
217
+ - **`supersedes`** — list of prior ADR filenames (without `.md`) that
218
+ this ADR replaces. Empty for most ADRs. Supersession lives in the
219
+ superseding ADR's frontmatter, not as a flag on the superseded ADR —
220
+ that one stays unchanged on `main` as a truthful historical record.
221
+
222
+ ## Rubric: Unspecified interactions vs. Open questions vs. Pre-implementation investigation
223
+
224
+ This is the most common ADR failure. Use this table:
225
+
226
+ | Item type | Goes in | Test |
227
+ |---|---|---|
228
+ | A coupling we know exists in the codebase today that this decision changes or newly touches, but we deliberately are not specifying here | `Consequences -> Unspecified interactions with existing mechanisms` | "Implementers need to know about X coupling to avoid breaking it." |
229
+ | A design sub-decision we deferred because it isn't blocking and has multiple valid answers | `Open questions` | "A reasonable person could answer this two ways and either is defensible; we'll pick one during implementation." |
230
+ | A fact we don't know yet about the codebase that must be verified before the first PR | `Pre-implementation codebase investigation` | "The answer is knowable by grepping / reading code, not by discussion." |
231
+
232
+ If an item is really "I haven't done my homework" dressed up as an
233
+ open question, it fails this rubric. Do the homework or move it to
234
+ Pre-implementation investigation with a specific grep/read
235
+ prescribed.
236
+
237
+ ## Security default-deny rule (iron rule #3, expanded)
238
+
239
+ For every capability that can:
240
+
241
+ - Write to another user's/org's data
242
+ - Stamp long-lived credentials used on outbound traffic
243
+ - Grant a partner/API-key/integration-user role any verb beyond
244
+ `read` on its own scope
245
+
246
+ the ADR must:
247
+
248
+ 1. Default to `off` (not-granted). Do not write "probably fine, worth
249
+ confirming."
250
+ 2. Specify the enablement mechanism: who grants it, where it's logged,
251
+ and how it's revoked.
252
+ 3. State the blast radius if the grant is misused (a mistaken or
253
+ compromised principal).
254
+ 4. Name the expected flow without the grant (what does the actor do
255
+ instead?).
256
+
257
+ ## Red flags — author self-check
258
+
259
+ These are common failure modes observed across ADRs. Use this list as
260
+ a self-check before you consider a draft complete.
261
+
262
+ - Any table, column, enum, or code symbol in your draft has not been
263
+ grep-confirmed against the actual codebase.
264
+ - Your Decision section says "probably fine" about a security grant.
265
+ Make it default-deny.
266
+ - You have zero alternatives in S4 beyond the chosen one.
267
+ - Your S7 "pre-implementation investigation" reads like "verify my
268
+ bullets are right." Move these to grounding and do them now.
269
+ - A coupling with existing mechanisms is not mentioned. If you
270
+ honestly looked and found none, state that.
271
+ - Your ADR introduces a new enum/channel/role/surface whose naming
272
+ collides with an existing one.
273
+ - Your S2 Decision subsections leak into execution planning — merge
274
+ units, PR boundaries, task sequencing. That belongs in issues, not
275
+ in the ADR.
276
+ - Your UI section doesn't describe the broken-state case (what
277
+ happens when a referenced entity is archived/inactive/missing).
278
+ - Your migration section doesn't describe the down() path.
279
+ - Your S6 Open questions are really S3 Unspecified interactions (they
280
+ describe *existing* couplings, not *deferred* design decisions).
281
+ - Your S6 or S7 has unresolved items. Both sections must be iterated
282
+ to empty before the ADR merges. If you can answer a question by
283
+ reading code or reasoning through tradeoffs, do it now — don't
284
+ defer to implementation what you can resolve during drafting.
285
+ - Your ADR is missing YAML frontmatter. Without frontmatter, the ADR
286
+ is invisible to discovery and future authors will rediscover your
287
+ lessons from scratch.
288
+ - A convention you introduce in S2 is not listed in `establishes:`
289
+ frontmatter. Future ADRs can't find that it exists.
290
+
291
+ ## Inline-vs-follow-on decision rubric
292
+
293
+ When you discover during drafting that a sub-decision is bigger than
294
+ you thought:
295
+
296
+ - **Inline it** if: the sub-decision touches <=3 files, introduces no
297
+ new abstractions, and doesn't shift the boundary of any existing
298
+ subsystem.
299
+ - **Follow-on ADR** if: crosses a package boundary you haven't
300
+ mapped, introduces a new abstraction (new model pattern, new
301
+ helper), or requires re-architecting an existing subsystem.
302
+ - **Resolve it now** if: you can answer the question by reading code
303
+ or reasoning through tradeoffs. S6 must be empty at merge — don't
304
+ defer what you can decide during drafting.
305
+
306
+ A follow-on ADR is cited in S5 Decision linkages as a "Blocker" or
307
+ "Future extension."
308
+
309
+ ## File placement and naming
310
+
311
+ - **Location:** `docs/adr/`.
312
+ - **Filename:** `YYYY-MM-DD-<slug>.md`. ISO date (authored date),
313
+ kebab-case slug, 3-7 words.
314
+ - **Branch name:** `docs/<slug>` or `<user>/<ticket>-<slug>` if
315
+ tracked by an issue.
316
+
317
+ ## Commit sequence
318
+
319
+ 1. Verify the frontmatter block parses (no tabs, list items use
320
+ ` - ` indent). Check that `touches` tags are meaningful and any
321
+ new conventions are listed in `establishes`.
322
+ 2. `git add docs/adr/<file>.md`
323
+ 3. Commit message: `docs(adr): <title>`.
324
+ 4. Push branch and open PR. Link the issue in the PR body if one
325
+ exists.
326
+ 5. If the decision warrants a follow-up project (per grounding step
327
+ 7), create the project on merge and link it from the ADR's S5
328
+ Decision linkages in a follow-up commit.
@@ -11,7 +11,7 @@ A good plan trades a planning-session's worth of patient thought for hours of un
11
11
 
12
12
  ## Workflow
13
13
 
14
- Apply these ten rules in order. Each rule has its own file in `rules/` for the full text:
14
+ Apply these nine rules in order. Each rule has its own file in `rules/` for the full text:
15
15
 
16
16
  1. [`first-principles.md`](rules/first-principles.md) — Frame the task FROM the user's intent, not from a templated checklist. Ask "what does the user actually want done?" before "what files might change?"
17
17
 
@@ -25,29 +25,56 @@ Apply these ten rules in order. Each rule has its own file in `rules/` for the f
25
25
 
26
26
  6. [`milestones.md`](rules/milestones.md) — Optional grouping. Use when several tasks share a "is this batch done?" check (e.g. integration tests after a chunk of unit-test work).
27
27
 
28
- 7. [`self-review.md`](rules/self-review.md) — Before declaring the plan ready, run through a 7-question checklist. Find the holes yourself; the validator only catches schema errors.
28
+ 7. [`self-review.md`](rules/self-review.md) — Before declaring the plan ready, run through a 7-question checklist. Find the holes yourself; the validator only catches schema errors. And before declaring "refuse", revisit the bundle-vs-split decision below.
29
29
 
30
30
  8. [`task-context.md`](rules/task-context.md) — Every non-trivial task carries a `context:` block. Thin plans fail because the builder works each task from scratch with no carry-over; rich context pre-loads what the builder needs to work confidently. Cover outcome, rationale, code pointers, acceptance.
31
31
 
32
- 9. [`setup-authoring.md`](rules/setup-authoring.md) — Detect → propose → confirm the top-level `setup:` block. Covers package manager install, docker-compose services, and migration tooling detection.
33
-
34
- 10. [`qa-expectations.md`](rules/qa-expectations.md) — Detect → propose → confirm per-surface verify patterns for UI, API, DB, integration, browser-based component, and CLI surfaces.
32
+ 9. [`qa-expectations.md`](rules/qa-expectations.md) — Detect → propose → confirm per-surface verify patterns for UI, API, DB, integration, browser-based component, and CLI surfaces.
35
33
 
36
34
  ## After applying the rules
37
35
 
38
36
  1. Save the YAML to the path returned by `bunx @glrs-dev/harness-plugin-opencode pilot plan-dir`.
39
- 2. Run `bunx @glrs-dev/harness-plugin-opencode pilot validate <path>` and fix every error / warning.
40
- 3. Hand off to the user with: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
37
+ 2. Remind the user the plan assumes their dev stack is already running (install, compose, migrate, seed). Plans no longer bootstrap their own environment.
38
+ 3. Run `bunx @glrs-dev/harness-plugin-opencode pilot validate <path>` and fix every error / warning.
39
+ 4. Hand off to the user with: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
41
40
 
42
41
  Do NOT summarize the plan in chat. The user can read the YAML.
43
42
 
43
+ ## When to bundle vs. split plans
44
+
45
+ Multi-issue cross-cutting plans are a first-class pilot shape. When a user's scope spans 2–4 related issues, default to **one plan** covering all of them — as long as they share:
46
+
47
+ - Same repo (or monorepo).
48
+ - Same package manager / install command.
49
+ - Same `docker-compose` (or equivalent local-infra) stack.
50
+ - Same test runner and verify style.
51
+ - Same migrations/seed pipeline.
52
+
53
+ Bundling amortizes setup cost (install, compose up, migrate, seed — minutes each, paid once per pilot run) across all the work. Tasks from different issues typically form disconnected subtrees in the DAG — see [`dag-shape.md`](rules/dag-shape.md)'s "Disconnected" pattern. Task-level `cascadeFail` only blocks transitive dependents, so a failure in one subtree does NOT cascade into its siblings.
54
+
55
+ **Split into separate pilot plans when:**
56
+
57
+ - Issues live in different repositories.
58
+ - Issues require fundamentally different setup environments.
59
+ - Issues have fundamentally different acceptance shapes (e.g., automated typecheck vs. manual operator playbook).
60
+
61
+ See [`decomposition.md`](rules/decomposition.md) "Plan sizing — count of tasks" for more.
62
+
44
63
  ## When to refuse
45
64
 
46
- If, after applying the methodology, you cannot produce a plan with at least:
65
+ Refuse ONLY when the **work itself** is underspecified or ambiguous — no concrete acceptance criteria, no clear "done" condition. Examples that warrant refusal:
66
+
67
+ - "Make the API better."
68
+ - "Refactor auth."
69
+ - "Clean up tech debt."
70
+
71
+ These don't name specific behaviors the pilot-builder can verify. Ask the user to narrow the scope before planning.
72
+
73
+ **Do NOT refuse for:**
47
74
 
48
- - 2 tasks
49
- - Each with non-trivial verify
50
- - Each with tight `touches`
51
- - A coherent DAG
75
+ - Plan size (5–30 tasks is fine; even more is fine when the work is well-defined).
76
+ - Multi-issue scope (2–4 related issues in one plan is first-class — see "When to bundle" above).
77
+ - Disconnected-subtree DAG shape (tasks from different concerns don't need artificial edges).
78
+ - Concerns about PR shape (that's a reviewer decision; the pilot run can produce one PR or several).
52
79
 
53
- tell the user the work isn't ready for pilot. Suggest they break it down themselves first, or use the regular `/plan` agent (markdown plans, human-driven execution). It is far better to refuse than to ship a bad plan.
80
+ When you do refuse: tell the user honestly and specifically what's missing. Suggest the regular `/plan` agent (markdown plans, human-driven execution) for ambiguous work that needs human iteration before it's pilotable. It is far better to refuse an unspecified request than to ship a plan full of `echo done` verifies — but narrow what "bad plan" means. Ambitious is not bad; ambiguous is bad.
@@ -34,3 +34,30 @@ A "right-sized" pilot task is one the pilot-builder can complete in a single ses
34
34
  ## When you can't decompose
35
35
 
36
36
  If the work genuinely doesn't decompose (e.g., a 200-line algorithm that has to land atomically), it might not be a fit for pilot. Tell the user; they may want to run it as a regular `/build` task instead.
37
+
38
+ ## Plan sizing — count of tasks
39
+
40
+ Per-task size is covered above. Plan-level size (total task count) is a different dimension and has its own sweet spot: **roughly 5–30 tasks per `pilot.yaml`**. Outside this range:
41
+
42
+ - **Fewer than 5 tasks:** usually means the work is a single change that doesn't benefit from the pilot harness. Consider `/plan` + `/build` instead.
43
+ - **More than 30 tasks:** fine in principle, but at that size the plan probably spans enough distinct concerns that a human reviewer will want it split — not a pilot problem, a PR-shape problem.
44
+
45
+ ### Multi-issue cross-cutting plans are a first-class shape
46
+
47
+ It is **normal and correct** for a single pilot plan to span 2–4 related issues (Linear tickets, GitHub issues) **when those issues share setup and verify infrastructure** — same repo, same package manager, same `docker-compose`, same test runner, same migrations. Reasons to bundle:
48
+
49
+ - **Setup amortization.** `pnpm install`, `docker compose up`, `pnpm db:migrate`, seed scripts — each of these is minutes of wall time. Running them once per pilot session vs. once per Linear issue saves hours across a multi-issue push.
50
+ - **Context reuse.** The builder learns the codebase through reading during early tasks; that context benefits every subsequent task in the run.
51
+ - **Shared acceptance.** Cross-issue integration checks (a milestone-close verify that exercises all three issues' changes together) are natural in one plan, awkward across three runs.
52
+
53
+ **Reference shape (not a red flag):** rule-engine cleanup + LISTEN/NOTIFY cache invalidation + read-only admin UI landed together in one plan of ~19 tasks across 4 milestones, covering 3 Linear issues. This is the shape pilot is built for.
54
+
55
+ When bundling, the tasks from different issues typically form **disconnected subtrees** in the DAG (no real semantic dependency between them). That's fine — see [`dag-shape.md`](dag-shape.md)'s "Disconnected" pattern. Task-level `cascadeFail` only blocks transitive dependents, so a failure in one subtree doesn't cascade into the siblings.
56
+
57
+ ### When to split instead of bundle
58
+
59
+ Split into separate pilot plans when:
60
+
61
+ - The issues live in **different repositories**.
62
+ - The issues require **fundamentally different setup environments** (e.g., one needs Postgres + Temporal, the other needs a headless browser grid — sharing setup is worse than paying the cost twice).
63
+ - The issues have **fundamentally different acceptance criteria** (e.g., one is a TypeScript refactor verified via typecheck, the other is an infrastructure change verified via a manual operator playbook — no shared verify makes sense).
@@ -16,7 +16,7 @@ The validator catches schema, DAG, and glob errors. It cannot catch "this verify
16
16
 
17
17
  5. **Are there missing edges?** Look at every pair of tasks that share files in their `touches:`. Do they need an order? If T2's verify exercises code T1 introduces, T2 depends on T1 — even if their `touches:` don't overlap.
18
18
 
19
- 6. **Can the plan recover from a per-task failure?** If T3 fails, the cascade-fail blocks T4 onward. Is the resulting "failed=T3, blocked=[T4..T7]" state useful for the human operator? Or did you concentrate too much value into T3 such that its failure is catastrophic?
19
+ 6. **Does the DAG concentrate too much value in one task?** Task-level `cascadeFail` only blocks transitive DEPENDENTS of the failed task — sibling subtrees in a disconnected DAG keep running. So plan size is not itself a risk. The real risk is a task everything else depends on: a schema migration that all downstream work reads, a core-type definition all imports reference, a shared config every consumer parses. If THAT task fails, the whole run stalls. Is there such a task in your plan? If yes, can it be simplified smaller diff, tighter verify, higher success probability? Don't over-concentrate; a plan where 80% of tasks depend on T1 and T1 is complex is fragile by design.
20
20
 
21
21
  7. **Could you read this plan in 6 months and understand it?** Plan names + task titles + prompts should be a self-explanatory summary of the work. If the plan needs a verbal preamble to make sense, rewrite the prompts.
22
22
 
@@ -45,3 +45,37 @@ If the verify commands would FAIL without edits, an empty `touches` is a STOP
45
45
  - **Including the migrations dir for a non-migration task.** Tight scope.
46
46
 
47
47
  When in doubt, write the tightest possible scope first. If the task fails verify with "touches violation: src/X.ts", the worker shows you which file got touched — broaden then.
48
+
49
+ ## `tolerate:` — files allowed in the diff but outside the contract
50
+
51
+ When a task's verify step runs a tool that writes files as a side-effect (codegen, build, snapshots), those files will appear in `git diff` even though the agent didn't author them. Add them to `tolerate:` so enforcement accepts them without counting them as part of the task's output.
52
+
53
+ Two categories to watch for:
54
+
55
+ **Built-in defaults (already tolerated — don't list these):**
56
+ - `**/next-env.d.ts` — Next.js regenerates on every `next build`.
57
+ - `**/.next/types/**`, `**/.next/dev/types/**` — Next.js app-router generated types.
58
+ - `**/*.tsbuildinfo` — TypeScript project-reference build cache.
59
+ - `**/__snapshots__/**`, `**/*.snap` — Jest / Vitest snapshot files rewritten by `-u`.
60
+
61
+ **Project-specific (list in `tolerate:` per task):**
62
+ - Prisma client output (e.g., `prisma/client/**` if `prisma generate` runs in verify).
63
+ - GraphQL codegen output (`graphql/generated/**`, `*.graphql.d.ts`).
64
+ - OpenAPI codegen output (`api-types/generated/**`).
65
+ - Anywhere you have a build step that writes type declarations downstream of the agent's source edits.
66
+
67
+ A good test: if the task's verify step runs `prisma generate`, `pnpm codegen`, `next build`, or similar, ask: "does that command write files anywhere?" If yes, those paths go in `tolerate:`.
68
+
69
+ ### Example
70
+
71
+ ```yaml
72
+ - id: T-ADD-RULE-MODEL
73
+ touches:
74
+ - prisma/schema.prisma
75
+ - src/models/rule.ts
76
+ tolerate:
77
+ - prisma/client/** # prisma generate output
78
+ verify:
79
+ - pnpm prisma generate
80
+ - pnpm --filter core test rule-model
81
+ ```
@@ -2,24 +2,63 @@
2
2
 
3
3
  **Each task's `verify:` commands must succeed iff the task is correctly done.**
4
4
 
5
- The verify list is the contract between the planner and the builder. It is the ONLY signal pilot uses to decide "did this task work?". A weak verify means you're shipping work the run thinks is fine but really isn't.
5
+ The verify list is the contract between the planner and the builder. It is the ONLY signal pilot uses to decide "did this task work?". A weak verify means you're shipping work the run thinks is fine but really isn't. An over-broad verify means the task fails for reasons unrelated to the work — pre-existing test failures, missing infrastructure, flaky integration tests — and the agent wastes its retry budget on something it can't fix.
6
+
7
+ ## The cardinal rule: verify ONLY what the task changed
8
+
9
+ A verify command must exercise **exactly the code the task produced** — no more, no less. If the task adds `src/entities/audit-log/schema.ts` and its test file, the verify is:
10
+
11
+ ```yaml
12
+ verify:
13
+ - pnpm --filter @kn/core test -- --run src/entities/audit-log/__tests__/schema.test.ts
14
+ ```
15
+
16
+ NOT:
17
+
18
+ ```yaml
19
+ verify:
20
+ - pnpm --filter @kn/core test -- --run src/entities/audit-log
21
+ ```
22
+
23
+ The second form runs EVERY test under that directory — including integration tests that need a running database, tests for pre-existing code the task didn't touch, and tests that may already be failing on the base branch. The agent cannot fix those failures. It will exhaust its retry budget and STOP.
24
+
25
+ **The verify command's scope must be as tight as the `touches:` scope.** If you wouldn't put a file in `touches:`, don't let the verify command exercise it.
6
26
 
7
27
  ## What a good verify looks like
8
28
 
9
- - `bun test test/api.test.ts` (assertion)
10
- - `bun run typecheck` (semantic check, catches real failures)
11
- - `bun run lint` (style, but only when style is the work)
12
- - `node scripts/check-schema.ts` (your own probe write it as part of the task)
13
- - `curl -fsS http://localhost:3000/health | jq .ok` (integration probe)
29
+ - `pnpm test -- --run path/to/specific.test.ts` — runs ONE test file
30
+ - `bun test test/api/specific.test.ts` same, bun flavor
31
+ - `bun run typecheck` — semantic check, catches real type failures (good as `verify_after_each`)
32
+ - `node scripts/check-schema.ts` your own probe script (write it as part of the task)
33
+ - `grep -q 'export function newThing' src/file.ts && bun test test/file.test.ts` existence + behavior
14
34
 
15
35
  ## What's not OK
16
36
 
17
37
  - `echo done` — proves nothing
18
38
  - `test -f src/foo.ts` — file existence is necessary but rarely sufficient
19
39
  - `bun run build` ALONE — build success without tests means "TypeScript was happy"; insufficient for behavior tasks
40
+ - `pnpm test` (whole package) — pulls in every test in the package; pre-existing failures block the task
41
+ - `pnpm --filter @pkg test -- --run src/module` (directory-level) — same problem; runs integration tests the task didn't write
20
42
  - `grep -q 'newFunction' src/file.ts` — proves text presence, not behavior
21
43
  - `git diff --name-only | grep src/api` — proves edits happened, not that they're correct
22
44
 
45
+ ## The pre-existing-failure trap
46
+
47
+ Pilot runs a **baseline check** before the agent starts: every verify command is executed on the clean tree. If ANY command fails in baseline, the task aborts immediately with a clear message:
48
+
49
+ > baseline verify failed: `pnpm --filter @kn/core test` → exit 1.
50
+ > This command fails on the clean tree BEFORE the agent starts —
51
+ > fix your environment or narrow the verify scope.
52
+
53
+ This prevents the agent from wasting its 5-attempt retry budget on failures it didn't cause and can't fix. The baseline is the planner's contract: "these commands WILL pass if the environment is set up correctly."
54
+
55
+ **If your verify command fails in baseline, the fix is one of:**
56
+ 1. Start the missing infrastructure (the setup hook should handle this).
57
+ 2. Narrow the verify to only the specific test file the task creates.
58
+ 3. Fix the pre-existing test failure on the base branch first.
59
+
60
+ The agent gets 5 attempts (with escalating "try a different approach" nudges) for failures it introduces AFTER the baseline passes. Pre-existing failures never reach the agent.
61
+
23
62
  ## Two-tier verify
24
63
 
25
64
  Use BOTH a per-task verify and `defaults.verify_after_each`:
@@ -27,24 +66,27 @@ Use BOTH a per-task verify and `defaults.verify_after_each`:
27
66
  ```yaml
28
67
  defaults:
29
68
  verify_after_each:
30
- - bun run typecheck # always must pass
69
+ - bun run typecheck # always must pass — catches cross-file breakage
31
70
  tasks:
32
71
  - id: T1
33
72
  verify:
34
- - bun test test/api/specific.test.ts # task-specific
73
+ - bun test test/api/create-rule.test.ts # task-specific behavior proof
35
74
  ```
36
75
 
37
- `verify_after_each` catches global breakage (a syntax error in a file the task didn't even touch); per-task verify catches task-specific behavior.
76
+ `verify_after_each` catches global breakage (a syntax error in a file the task didn't even touch); per-task verify catches task-specific behavior. Together they form a tight net without over-reaching.
38
77
 
39
78
  ## Touches and verify must agree
40
79
 
41
- If the task `touches: src/api/**` but the verify command runs `bun test test/web/`, you almost certainly have a wrong scope. The verify that would actually catch task failure must exercise files in the touched scope.
80
+ If the task `touches: [src/api/rules.ts, test/api/rules.test.ts]` but the verify command runs `bun test test/web/`, you have a wrong scope. The verify must exercise files in the touched scope — and ONLY those files.
42
81
 
43
- ## Verify must be deterministic
82
+ Conversely: if the verify runs `test/api/rules.test.ts` but `touches:` doesn't include `test/api/rules.test.ts`, the agent can't create or edit that test file. Both must agree.
44
83
 
45
- - No `sleep` to wait for a service that may not start in CI.
46
- - No `docker run` unless the task is explicitly about containers.
84
+ ## Verify must be deterministic and self-contained
85
+
86
+ - No `sleep` to wait for a service that may not start.
47
87
  - No external network calls that could flake — mock or skip.
88
+ - No dependency on infrastructure the setup hook didn't start. If the verify needs postgres, the setup hook must start it. If the verify needs an API server, the setup hook must start it.
89
+ - No dependency on other tasks' output being committed (use `depends_on` to sequence).
48
90
 
49
91
  If a verify command flakes, three retries will exhaust attempts and the task fails for environmental reasons. Pilot has no way to distinguish "real failure" from "flake".
50
92
 
@@ -52,6 +94,28 @@ If a verify command flakes, three retries will exhaust attempts and the task fai
52
94
 
53
95
  For non-trivial tasks, write a verify that would HAVE FAILED before the task ran. This makes the task's value observable. If the verify passed before AND passes after, the task didn't actually move the system.
54
96
 
97
+ Good pattern: the test file the agent creates IS the "before" check — it didn't exist before, so `bun test path/to/new.test.ts` would have failed (file not found). After the task, it exists and passes.
98
+
99
+ ## Port and environment awareness
100
+
101
+ If the setup hook starts services on non-default ports (to avoid collisions with the user's dev stack), verify commands must use those ports. Two patterns:
102
+
103
+ **A. Source the env file the hook wrote:**
104
+ ```yaml
105
+ verify:
106
+ - bash -c 'source .env.pilot && pnpm --filter @pkg test -- --run path/to/test.ts'
107
+ ```
108
+
109
+ **B. Use `defaults.verify_after_each` for the env-sourcing wrapper:**
110
+ ```yaml
111
+ defaults:
112
+ verify_after_each:
113
+ - bash -c 'source .env.pilot && bun run typecheck'
114
+ ```
115
+
116
+ **C. Tests read from `process.env` at runtime** (best — no wrapper needed):
117
+ If the test framework reads `DATABASE_URL` from the environment, and the setup hook exports it, the verify command just works. This is the cleanest pattern.
118
+
55
119
  ## Cross-reference: per-surface tooling menu
56
120
 
57
- For the per-surface tooling menu (Playwright for UI, curl for API, Postgres for DB), see rule 10 (`qa-expectations.md`). That rule applies these principles to specific tools; this rule defines the principles themselves.
121
+ For the per-surface tooling menu (Playwright for UI, curl for API, Postgres for DB), see rule 9 (`qa-expectations.md`). That rule applies these principles to specific tools; this rule defines the principles themselves.
@@ -0,0 +1,32 @@
1
+ import {
2
+ countByStatus,
3
+ getTask,
4
+ listTasks,
5
+ markAborted,
6
+ markBlocked,
7
+ markFailed,
8
+ markPending,
9
+ markReady,
10
+ markRunning,
11
+ markSucceeded,
12
+ readyTasks,
13
+ resetTasksForResume,
14
+ setCostUsd,
15
+ upsertFromPlan
16
+ } from "./chunk-57EOY72Y.js";
17
+ export {
18
+ countByStatus,
19
+ getTask,
20
+ listTasks,
21
+ markAborted,
22
+ markBlocked,
23
+ markFailed,
24
+ markPending,
25
+ markReady,
26
+ markRunning,
27
+ markSucceeded,
28
+ readyTasks,
29
+ resetTasksForResume,
30
+ setCostUsd,
31
+ upsertFromPlan
32
+ };