@glrs-dev/cli 1.1.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +4 -0
- package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-assessor.md +77 -0
- package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-builder.md +24 -116
- package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-planner.md +38 -160
- package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-scoper.md +58 -0
- package/dist/vendor/harness-opencode/dist/{chunk-BWERBERN.js → chunk-6CZPRUMJ.js} +12 -62
- package/dist/vendor/harness-opencode/dist/chunk-DZG4D3OH.js +54 -0
- package/dist/vendor/harness-opencode/dist/chunk-OYRKOEXK.js +88 -0
- package/dist/vendor/harness-opencode/dist/cli.js +1619 -3930
- package/dist/vendor/harness-opencode/dist/index.js +831 -166
- package/dist/vendor/harness-opencode/dist/{install-5JKWK6Z4.js → install-6775ZBDG.js} +1 -1
- package/dist/vendor/harness-opencode/dist/paths-WZ23ZQOV.js +18 -0
- package/dist/vendor/harness-opencode/dist/skills/code-quality/SKILL.md +45 -0
- package/dist/vendor/harness-opencode/dist/skills/code-quality/rules/building.md +125 -0
- package/dist/vendor/harness-opencode/dist/skills/code-quality/rules/gap-analysis.md +92 -0
- package/dist/vendor/harness-opencode/dist/skills/code-quality/rules/planning.md +96 -0
- package/dist/vendor/harness-opencode/dist/skills/code-quality/rules/review.md +104 -0
- package/dist/vendor/harness-opencode/package.json +1 -1
- package/package.json +1 -1
- package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-builder.open.md +0 -129
- package/dist/vendor/harness-opencode/dist/chunk-57EOY72Y.js +0 -174
- package/dist/vendor/harness-opencode/dist/chunk-5TAMY7P6.js +0 -67
- package/dist/vendor/harness-opencode/dist/chunk-BKTFWXLG.js +0 -204
- package/dist/vendor/harness-opencode/dist/chunk-EK7K4NTV.js +0 -747
- package/dist/vendor/harness-opencode/dist/chunk-KB7M7JXU.js +0 -145
- package/dist/vendor/harness-opencode/dist/chunk-RNRCXQ65.js +0 -56
- package/dist/vendor/harness-opencode/dist/paths-LT3QQKCF.js +0 -18
- package/dist/vendor/harness-opencode/dist/pilot/mcp/status-server.d.ts +0 -1
- package/dist/vendor/harness-opencode/dist/pilot/mcp/status-server.js +0 -228
- package/dist/vendor/harness-opencode/dist/pilot-config-7LJZ23YK.js +0 -55
- package/dist/vendor/harness-opencode/dist/runs-QWPL3TKV.js +0 -18
- package/dist/vendor/harness-opencode/dist/safety-gate-WM3EWOCY.js +0 -10
- package/dist/vendor/harness-opencode/dist/setup-hook-FHTXMAQL.js +0 -88
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/SKILL.md +0 -80
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/dag-shape.md +0 -47
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/decomposition.md +0 -63
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/first-principles.md +0 -29
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/milestones.md +0 -57
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/qa-expectations.md +0 -120
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/self-review.md +0 -46
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/task-context.md +0 -47
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/touches-scope.md +0 -81
- package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/verify-design.md +0 -121
- package/dist/vendor/harness-opencode/dist/tasks-KJ3WN2KY.js +0 -32
|
@@ -1,80 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: pilot-planning
|
|
3
|
-
description: Methodology for producing a pilot.yaml plan that the pilot-builder agent can execute unattended. Use when the pilot-planner agent receives a feature request — covers task decomposition, verify-command design, scope tightness, DAG shape, and self-review. Auto-loaded by the pilot-planner agent.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# Pilot Planning Skill
|
|
7
|
-
|
|
8
|
-
You are producing a `pilot.yaml` plan: a list of tasks the pilot-builder agent can execute one at a time, fully unattended. The cost of a bad plan is high — the builder will fail tasks confusingly, the cascade-fail will block downstream work, and the human pilot operator has to clean up worktrees and re-plan.
|
|
9
|
-
|
|
10
|
-
A good plan trades a planning-session's worth of patient thought for hours of unsupervised builder time. Take the patient thought.
|
|
11
|
-
|
|
12
|
-
## Workflow
|
|
13
|
-
|
|
14
|
-
Apply these nine rules in order. Each rule has its own file in `rules/` for the full text:
|
|
15
|
-
|
|
16
|
-
1. [`first-principles.md`](rules/first-principles.md) — Frame the task FROM the user's intent, not from a templated checklist. Ask "what does the user actually want done?" before "what files might change?"
|
|
17
|
-
|
|
18
|
-
2. [`decomposition.md`](rules/decomposition.md) — Break the work into right-sized tasks (10-30 minutes of agent time, ≤3 attempts). Too big = unbounded work; too small = orchestration overhead drowns the value.
|
|
19
|
-
|
|
20
|
-
3. [`verify-design.md`](rules/verify-design.md) — Each task's `verify:` commands must succeed iff the task is correctly done. No `echo done`. No `test -f file.ts`. Real assertions only.
|
|
21
|
-
|
|
22
|
-
4. [`touches-scope.md`](rules/touches-scope.md) — `touches:` globs must be the tightest set that lets the task succeed. Default to "specific file paths"; `**` is a smell.
|
|
23
|
-
|
|
24
|
-
5. [`dag-shape.md`](rules/dag-shape.md) — Tasks depend on each other only when there's a real semantic dependency (B reads what A produces). False dependencies make the run sequential when it could parallel; missing dependencies cause subtle race-on-state bugs.
|
|
25
|
-
|
|
26
|
-
6. [`milestones.md`](rules/milestones.md) — Optional grouping. Use when several tasks share a "is this batch done?" check (e.g. integration tests after a chunk of unit-test work).
|
|
27
|
-
|
|
28
|
-
7. [`self-review.md`](rules/self-review.md) — Before declaring the plan ready, run through a 7-question checklist. Find the holes yourself; the validator only catches schema errors. And before declaring "refuse", revisit the bundle-vs-split decision below.
|
|
29
|
-
|
|
30
|
-
8. [`task-context.md`](rules/task-context.md) — Every non-trivial task carries a `context:` block. Thin plans fail because the builder works each task from scratch with no carry-over; rich context pre-loads what the builder needs to work confidently. Cover outcome, rationale, code pointers, acceptance.
|
|
31
|
-
|
|
32
|
-
9. [`qa-expectations.md`](rules/qa-expectations.md) — Detect → propose → confirm per-surface verify patterns for UI, API, DB, integration, browser-based component, and CLI surfaces.
|
|
33
|
-
|
|
34
|
-
## After applying the rules
|
|
35
|
-
|
|
36
|
-
1. Save the YAML to the path returned by `bunx @glrs-dev/harness-plugin-opencode pilot plan-dir`.
|
|
37
|
-
2. Remind the user the plan assumes their dev stack is already running (install, compose, migrate, seed). Plans no longer bootstrap their own environment.
|
|
38
|
-
3. Run `bunx @glrs-dev/harness-plugin-opencode pilot validate <path>` and fix every error / warning.
|
|
39
|
-
4. Hand off to the user with: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
|
|
40
|
-
|
|
41
|
-
Do NOT summarize the plan in chat. The user can read the YAML.
|
|
42
|
-
|
|
43
|
-
## When to bundle vs. split plans
|
|
44
|
-
|
|
45
|
-
Multi-issue cross-cutting plans are a first-class pilot shape. When a user's scope spans 2–4 related issues, default to **one plan** covering all of them — as long as they share:
|
|
46
|
-
|
|
47
|
-
- Same repo (or monorepo).
|
|
48
|
-
- Same package manager / install command.
|
|
49
|
-
- Same `docker-compose` (or equivalent local-infra) stack.
|
|
50
|
-
- Same test runner and verify style.
|
|
51
|
-
- Same migrations/seed pipeline.
|
|
52
|
-
|
|
53
|
-
Bundling amortizes setup cost (install, compose up, migrate, seed — minutes each, paid once per pilot run) across all the work. Tasks from different issues typically form disconnected subtrees in the DAG — see [`dag-shape.md`](rules/dag-shape.md)'s "Disconnected" pattern. Task-level `cascadeFail` only blocks transitive dependents, so a failure in one subtree does NOT cascade into its siblings.
|
|
54
|
-
|
|
55
|
-
**Split into separate pilot plans when:**
|
|
56
|
-
|
|
57
|
-
- Issues live in different repositories.
|
|
58
|
-
- Issues require fundamentally different setup environments.
|
|
59
|
-
- Issues have fundamentally different acceptance shapes (e.g., automated typecheck vs. manual operator playbook).
|
|
60
|
-
|
|
61
|
-
See [`decomposition.md`](rules/decomposition.md) "Plan sizing — count of tasks" for more.
|
|
62
|
-
|
|
63
|
-
## When to refuse
|
|
64
|
-
|
|
65
|
-
Refuse ONLY when the **work itself** is underspecified or ambiguous — no concrete acceptance criteria, no clear "done" condition. Examples that warrant refusal:
|
|
66
|
-
|
|
67
|
-
- "Make the API better."
|
|
68
|
-
- "Refactor auth."
|
|
69
|
-
- "Clean up tech debt."
|
|
70
|
-
|
|
71
|
-
These don't name specific behaviors the pilot-builder can verify. Ask the user to narrow the scope before planning.
|
|
72
|
-
|
|
73
|
-
**Do NOT refuse for:**
|
|
74
|
-
|
|
75
|
-
- Plan size (5–30 tasks is fine; even more is fine when the work is well-defined).
|
|
76
|
-
- Multi-issue scope (2–4 related issues in one plan is first-class — see "When to bundle" above).
|
|
77
|
-
- Disconnected-subtree DAG shape (tasks from different concerns don't need artificial edges).
|
|
78
|
-
- Concerns about PR shape (that's a reviewer decision; the pilot run can produce one PR or several).
|
|
79
|
-
|
|
80
|
-
When you do refuse: tell the user honestly and specifically what's missing. Suggest the regular `/plan` agent (markdown plans, human-driven execution) for ambiguous work that needs human iteration before it's pilotable. It is far better to refuse an unspecified request than to ship a plan full of `echo done` verifies — but narrow what "bad plan" means. Ambitious is not bad; ambiguous is bad.
|
|
@@ -1,47 +0,0 @@
|
|
|
1
|
-
# Rule 5 — DAG shape
|
|
2
|
-
|
|
3
|
-
**Tasks depend on each other only when there's a real semantic dependency.**
|
|
4
|
-
|
|
5
|
-
The `depends_on` edges in the plan determine run order. False edges serialize work that could parallelize (v0.3); missing edges let a downstream task run against a state where its prerequisite hasn't committed yet.
|
|
6
|
-
|
|
7
|
-
## What a real dependency looks like
|
|
8
|
-
|
|
9
|
-
- **Reads code that the dep produces.** T2 imports a function T1 introduced.
|
|
10
|
-
- **Reads schema that the dep modifies.** T2 calls an endpoint T1 added.
|
|
11
|
-
- **Tests behavior the dep implements.** T2's verify runs a test T1's code makes pass.
|
|
12
|
-
|
|
13
|
-
## What ISN'T a real dependency
|
|
14
|
-
|
|
15
|
-
- "T1 should run first because it's foundational." If T2 doesn't use T1's output, the order doesn't matter for correctness — and forcing it costs you parallelism.
|
|
16
|
-
- "Both touch `src/api/`." Touch overlap is a worktree-pool concern (v0.3), not a logical dependency. Capture it via `touches:` if at all.
|
|
17
|
-
- "I want T1 to be done before I review T2." That's a human-review concern, not a pilot DAG concern. The pilot run completes; you review afterward.
|
|
18
|
-
|
|
19
|
-
## Common shapes
|
|
20
|
-
|
|
21
|
-
**Linear** — T1 → T2 → T3:
|
|
22
|
-
|
|
23
|
-
Each task is the next layer. Use when each layer literally builds on the previous.
|
|
24
|
-
|
|
25
|
-
**Diamond** — T1 fans out to T2, T3; both reconverge into T4:
|
|
26
|
-
|
|
27
|
-
T1 = "introduce module skeleton"; T2, T3 = "fill in submodule X / Y" (parallelizable on disjoint scopes); T4 = "wire up everything and run integration tests".
|
|
28
|
-
|
|
29
|
-
**Disconnected** — Two independent components in the same plan:
|
|
30
|
-
|
|
31
|
-
`auth-1`, `auth-2` are one chain; `billing-1`, `billing-2` are another. Use when the plan covers multiple unrelated improvements.
|
|
32
|
-
|
|
33
|
-
**Hub-and-spoke** — Many tasks all depend on T1 but not on each other:
|
|
34
|
-
|
|
35
|
-
T1 = "add the typed client"; T2-Tn each = "use the typed client in module M". All Tn parallelize.
|
|
36
|
-
|
|
37
|
-
## Cycle detection
|
|
38
|
-
|
|
39
|
-
The validator catches cycles. If you accidentally write `T1 → T2 → T1`, validate will tell you. Most cycles arise from copy-paste in `depends_on` lists; check yours before saving.
|
|
40
|
-
|
|
41
|
-
## Self-loops
|
|
42
|
-
|
|
43
|
-
`T1: depends_on: [T1]` is a self-loop, also caught by validate. Always a typo.
|
|
44
|
-
|
|
45
|
-
## "I want everything serial"
|
|
46
|
-
|
|
47
|
-
Sometimes the right answer IS a fully linear DAG (e.g., a refactor where each step's diff would conflict with the next). Don't be afraid to chain everything if that's the truth — but don't pretend it's the truth when it isn't.
|
|
@@ -1,63 +0,0 @@
|
|
|
1
|
-
# Rule 2 — Decomposition
|
|
2
|
-
|
|
3
|
-
**Right-sized tasks: 10-30 minutes of agent time, ≤3 attempts to pass verify.**
|
|
4
|
-
|
|
5
|
-
A "right-sized" pilot task is one the pilot-builder can complete in a single session within the default `max_turns: 50` budget. Empirically, that's about 10-30 minutes of agent wall time and 1-3 attempts.
|
|
6
|
-
|
|
7
|
-
## Sizing heuristics
|
|
8
|
-
|
|
9
|
-
**Too big (split it):**
|
|
10
|
-
|
|
11
|
-
- The verify command exercises >3 distinct code paths.
|
|
12
|
-
- The task touches >5 files.
|
|
13
|
-
- The prompt has >10 numbered steps.
|
|
14
|
-
- The task says "and also" / "while you're at it" — a sign of conjoined work.
|
|
15
|
-
|
|
16
|
-
**Too small (merge it):**
|
|
17
|
-
|
|
18
|
-
- The task touches a single file with <30 lines added/changed.
|
|
19
|
-
- The verify command would also pass before the task ran.
|
|
20
|
-
- Splitting added a `depends_on` edge that just moves work around.
|
|
21
|
-
|
|
22
|
-
## Splitting patterns
|
|
23
|
-
|
|
24
|
-
- **Layer-by-layer**: schema → DB accessors → API → wiring. Each layer has its own tests; each is a task.
|
|
25
|
-
- **Read → Write**: T1 = "add a function that returns the data", T2 = "add an endpoint that calls it". T2 depends on T1.
|
|
26
|
-
- **Skeleton → Detail**: T1 = "introduce the module structure with stubs", T2-Tn = "fill in each stub with logic+tests". The stubs let downstream tasks parallelize.
|
|
27
|
-
|
|
28
|
-
## Anti-patterns
|
|
29
|
-
|
|
30
|
-
- **Refactor as one task.** "Refactor X" is a feature, not a task. Decompose into `extract Y`, `inline Z`, `rename W`, each with its own verify.
|
|
31
|
-
- **Setup-only tasks.** "Install lodash" is not a pilot task — the next task can install it as part of its own scope. Avoid tasks that don't deliver an observable check.
|
|
32
|
-
- **Cleanup-only tasks.** "Remove dead code". The verify is "tests still pass" — but tests passing was already the contract on the previous task. If there's nothing new to assert, this isn't a task.
|
|
33
|
-
|
|
34
|
-
## When you can't decompose
|
|
35
|
-
|
|
36
|
-
If the work genuinely doesn't decompose (e.g., a 200-line algorithm that has to land atomically), it might not be a fit for pilot. Tell the user; they may want to run it as a regular `/build` task instead.
|
|
37
|
-
|
|
38
|
-
## Plan sizing — count of tasks
|
|
39
|
-
|
|
40
|
-
Per-task size is covered above. Plan-level size (total task count) is a different dimension and has its own sweet spot: **roughly 5–30 tasks per `pilot.yaml`**. Outside this range:
|
|
41
|
-
|
|
42
|
-
- **Fewer than 5 tasks:** usually means the work is a single change that doesn't benefit from the pilot harness. Consider `/plan` + `/build` instead.
|
|
43
|
-
- **More than 30 tasks:** fine in principle, but at that size the plan probably spans enough distinct concerns that a human reviewer will want it split — not a pilot problem, a PR-shape problem.
|
|
44
|
-
|
|
45
|
-
### Multi-issue cross-cutting plans are a first-class shape
|
|
46
|
-
|
|
47
|
-
It is **normal and correct** for a single pilot plan to span 2–4 related issues (Linear tickets, GitHub issues) **when those issues share setup and verify infrastructure** — same repo, same package manager, same `docker-compose`, same test runner, same migrations. Reasons to bundle:
|
|
48
|
-
|
|
49
|
-
- **Setup amortization.** `pnpm install`, `docker compose up`, `pnpm db:migrate`, seed scripts — each of these is minutes of wall time. Running them once per pilot session vs. once per Linear issue saves hours across a multi-issue push.
|
|
50
|
-
- **Context reuse.** The builder learns the codebase through reading during early tasks; that context benefits every subsequent task in the run.
|
|
51
|
-
- **Shared acceptance.** Cross-issue integration checks (a milestone-close verify that exercises all three issues' changes together) are natural in one plan, awkward across three runs.
|
|
52
|
-
|
|
53
|
-
**Reference shape (not a red flag):** rule-engine cleanup + LISTEN/NOTIFY cache invalidation + read-only admin UI landed together in one plan of ~19 tasks across 4 milestones, covering 3 Linear issues. This is the shape pilot is built for.
|
|
54
|
-
|
|
55
|
-
When bundling, the tasks from different issues typically form **disconnected subtrees** in the DAG (no real semantic dependency between them). That's fine — see [`dag-shape.md`](dag-shape.md)'s "Disconnected" pattern. Task-level `cascadeFail` only blocks transitive dependents, so a failure in one subtree doesn't cascade into the siblings.
|
|
56
|
-
|
|
57
|
-
### When to split instead of bundle
|
|
58
|
-
|
|
59
|
-
Split into separate pilot plans when:
|
|
60
|
-
|
|
61
|
-
- The issues live in **different repositories**.
|
|
62
|
-
- The issues require **fundamentally different setup environments** (e.g., one needs Postgres + Temporal, the other needs a headless browser grid — sharing setup is worse than paying the cost twice).
|
|
63
|
-
- The issues have **fundamentally different acceptance criteria** (e.g., one is a TypeScript refactor verified via typecheck, the other is an infrastructure change verified via a manual operator playbook — no shared verify makes sense).
|
|
@@ -1,29 +0,0 @@
|
|
|
1
|
-
# Rule 1 — First-principles task framing
|
|
2
|
-
|
|
3
|
-
**Frame from intent, not from a template.**
|
|
4
|
-
|
|
5
|
-
Bad plans start with a checklist ("read AGENTS.md → write tests → write code → run tests"). Good plans start with the question: *what does the user actually want at the end of this?*
|
|
6
|
-
|
|
7
|
-
## What to ask yourself
|
|
8
|
-
|
|
9
|
-
1. **What is the working state at the end of the run?** A passing test suite that previously failed? A new endpoint serving real traffic? A refactor with zero behavior change? Different end-states demand different task shapes.
|
|
10
|
-
|
|
11
|
-
2. **What can fail?** A task that "adds an import" can't really fail. A task that "implements pagination across three layers" can fail in a hundred ways. The latter needs decomposition.
|
|
12
|
-
|
|
13
|
-
3. **What does the verify catch?** If you can't articulate the failure mode each verify command detects, the verify is decoration.
|
|
14
|
-
|
|
15
|
-
4. **What is the smallest change that ships?** Pilot is good at small surgical work. If the user wants a wholesale rewrite, pilot is the wrong tool — say so.
|
|
16
|
-
|
|
17
|
-
## Talk to the user — once
|
|
18
|
-
|
|
19
|
-
Before you spend an hour reading code, take 2 minutes to ask the user 1-3 clarifying questions:
|
|
20
|
-
|
|
21
|
-
- Scope (what's in / out of this plan?)
|
|
22
|
-
- Success criteria (how do we know we're done?)
|
|
23
|
-
- Constraints (deps to use, deps to avoid, tests to preserve)
|
|
24
|
-
|
|
25
|
-
Do this BEFORE applying rules 2-7. The cheapest mistake to fix is the one you avoid by understanding intent up front.
|
|
26
|
-
|
|
27
|
-
## Then read code
|
|
28
|
-
|
|
29
|
-
Don't ask the user things you can answer by reading code. Don't ask "what test framework do you use?" — `package.json` says. Don't ask "where does auth live?" — `grep` it. Use the user's time only for things genuinely unknown to the codebase.
|
|
@@ -1,57 +0,0 @@
|
|
|
1
|
-
# Rule 6 — Milestones (optional)
|
|
2
|
-
|
|
3
|
-
**Use milestones to attach extra verify when a logical batch finishes.**
|
|
4
|
-
|
|
5
|
-
Milestones are an optional grouping. They serve two purposes:
|
|
6
|
-
|
|
7
|
-
1. **Status output** — `pilot status` groups tasks by milestone. Easier to read for big plans.
|
|
8
|
-
2. **Milestone-level verify** — extra verify commands that run when the LAST task in the milestone completes.
|
|
9
|
-
|
|
10
|
-
If neither of those is useful, don't add milestones. Plain task lists are simpler.
|
|
11
|
-
|
|
12
|
-
## Schema
|
|
13
|
-
|
|
14
|
-
```yaml
|
|
15
|
-
milestones:
|
|
16
|
-
- name: M1
|
|
17
|
-
description: Foundation
|
|
18
|
-
verify:
|
|
19
|
-
- bun run integration-test:foundation
|
|
20
|
-
- name: M2
|
|
21
|
-
description: API layer
|
|
22
|
-
verify:
|
|
23
|
-
- bun run integration-test:api
|
|
24
|
-
|
|
25
|
-
tasks:
|
|
26
|
-
- id: T1
|
|
27
|
-
title: schema
|
|
28
|
-
milestone: M1
|
|
29
|
-
- id: T2
|
|
30
|
-
title: db
|
|
31
|
-
milestone: M1
|
|
32
|
-
- id: T3
|
|
33
|
-
title: endpoint
|
|
34
|
-
milestone: M2
|
|
35
|
-
```
|
|
36
|
-
|
|
37
|
-
Each task has an optional `milestone:` label. The label must match a `milestones[].name` (the validator catches typos).
|
|
38
|
-
|
|
39
|
-
## When milestone verify fires
|
|
40
|
-
|
|
41
|
-
Milestone-level verify runs **after the last task in that milestone completes successfully**. "Last" = last in topological order among tasks with that label. If any task in the milestone fails or gets blocked, the milestone verify does NOT run (the cascade-fail will block downstream work anyway).
|
|
42
|
-
|
|
43
|
-
## When to use them
|
|
44
|
-
|
|
45
|
-
- **Multi-layer features** where you want an integration test after each layer (schema, API, UI).
|
|
46
|
-
- **Long plans** (8+ tasks) where the user wants visible progress markers.
|
|
47
|
-
- **Mixed-domain plans** where milestones group related work for status readability.
|
|
48
|
-
|
|
49
|
-
## When NOT to use them
|
|
50
|
-
|
|
51
|
-
- Simple plans (≤5 tasks). Just list the tasks; status output is fine without grouping.
|
|
52
|
-
- Plans where every "milestone" has only one task. Use task verify instead.
|
|
53
|
-
- Plans where the milestone verify is "the same as the last task's verify". Redundant.
|
|
54
|
-
|
|
55
|
-
## Don't conflate milestone with dep
|
|
56
|
-
|
|
57
|
-
Milestones are a presentation/verify-grouping concept; they do NOT change scheduling. If T3 needs T2 done before it can start, that's a `depends_on: [T2]`, not a `milestone:` label. The DAG and milestones are independent axes.
|
|
@@ -1,120 +0,0 @@
|
|
|
1
|
-
# Rule 10 — QA-expectations establishment
|
|
2
|
-
|
|
3
|
-
**Detect → propose → confirm per-surface verify patterns.**
|
|
4
|
-
|
|
5
|
-
A plan's verify commands are its contract with the builder. Generic verifies ("run tests") waste builder time; specific verifies ("run the API tests that exercise the files this task touches") catch real failures. This rule establishes concrete, per-surface QA expectations with the user before emitting the plan.
|
|
6
|
-
|
|
7
|
-
## The six surfaces
|
|
8
|
-
|
|
9
|
-
For each surface below, detect signals in the codebase, propose a canonical verify pattern, and confirm with the user.
|
|
10
|
-
|
|
11
|
-
### UI — Browser-based user interface
|
|
12
|
-
|
|
13
|
-
**Detection signals:**
|
|
14
|
-
- `@playwright/test`, `cypress`, or `@vitest/browser` in `package.json` dependencies
|
|
15
|
-
- `playwright.config.{ts,js}` or `cypress.config.*` present
|
|
16
|
-
|
|
17
|
-
**Proposed verify pattern:**
|
|
18
|
-
Playwright MCP invocation for visual/interaction assertions:
|
|
19
|
-
```yaml
|
|
20
|
-
verify:
|
|
21
|
-
- playwright test --project=chromium --grep "@task-specific-tag"
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
### API — HTTP endpoints
|
|
25
|
-
|
|
26
|
-
**Detection signals:**
|
|
27
|
-
- `openapi.yaml` / `openapi.json` present
|
|
28
|
-
- `curl` or `httpie` usage in existing scripts
|
|
29
|
-
- Postman collection files
|
|
30
|
-
|
|
31
|
-
**Proposed verify pattern:**
|
|
32
|
-
Direct HTTP assertion against a local port:
|
|
33
|
-
```yaml
|
|
34
|
-
verify:
|
|
35
|
-
- curl -fsS http://localhost:3000/health | jq '.status == "ok"'
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
### DB — Database schema and queries
|
|
39
|
-
|
|
40
|
-
**Detection signals:**
|
|
41
|
-
- `docker-compose` postgres service defined
|
|
42
|
-
- `prisma`, `drizzle-kit`, `knex`, or `flyway` in dependencies
|
|
43
|
-
- `test/db` or similar helper directory
|
|
44
|
-
|
|
45
|
-
**Proposed verify pattern:**
|
|
46
|
-
Postgres readiness + migration + assertion:
|
|
47
|
-
```yaml
|
|
48
|
-
verify:
|
|
49
|
-
- pg_isready -h localhost -p 5432
|
|
50
|
-
- pnpm prisma migrate deploy
|
|
51
|
-
- pnpm tsx scripts/verify-db.ts
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
### Integration — Cross-module workflows
|
|
55
|
-
|
|
56
|
-
**Detection signals:**
|
|
57
|
-
- `test/integration/**` directory exists
|
|
58
|
-
- `e2e/**` directory exists
|
|
59
|
-
- `*.integration.test.ts` files
|
|
60
|
-
|
|
61
|
-
**Proposed verify pattern:**
|
|
62
|
-
Integration test runner scoped to relevant paths:
|
|
63
|
-
```yaml
|
|
64
|
-
verify:
|
|
65
|
-
- pnpm test test/integration
|
|
66
|
-
```
|
|
67
|
-
|
|
68
|
-
### Browser-based component — Storybook stories
|
|
69
|
-
|
|
70
|
-
**Detection signals:**
|
|
71
|
-
- `storybook` or `@storybook/*` in dependencies
|
|
72
|
-
- `*.stories.{ts,tsx}` files present
|
|
73
|
-
|
|
74
|
-
**Proposed verify pattern:**
|
|
75
|
-
Storybook test or Chromatic visual verification:
|
|
76
|
-
```yaml
|
|
77
|
-
verify:
|
|
78
|
-
- pnpm storybook test --stories "ComponentName"
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
### CLI — Command-line interface
|
|
82
|
-
|
|
83
|
-
**Detection signals:**
|
|
84
|
-
- `bin/*` directory with executables
|
|
85
|
-
- `package.json` `bin:` entry defined
|
|
86
|
-
|
|
87
|
-
**Proposed verify pattern:**
|
|
88
|
-
Smoke test via help flag or scripted invocation:
|
|
89
|
-
```yaml
|
|
90
|
-
verify:
|
|
91
|
-
- pnpm my-cli --help
|
|
92
|
-
- pnpm tsx scripts/smoke-test-cli.ts
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
## Question-bundling rule
|
|
96
|
-
|
|
97
|
-
**Two or more surfaces detected:** Bundle into a single structured `question` tool call with one checkbox group per surface.
|
|
98
|
-
|
|
99
|
-
**One surface detected:** Still ask (confirmation, not interrogation), but use a single-field call.
|
|
100
|
-
|
|
101
|
-
**Zero surfaces detected:** Skip the QA-expectation question entirely. Fall back to generic verifies:
|
|
102
|
-
```yaml
|
|
103
|
-
defaults:
|
|
104
|
-
verify_after_each:
|
|
105
|
-
- pnpm run typecheck
|
|
106
|
-
- pnpm test
|
|
107
|
-
```
|
|
108
|
-
|
|
109
|
-
## Emission
|
|
110
|
-
|
|
111
|
-
Confirmed patterns become:
|
|
112
|
-
|
|
113
|
-
1. **Per-task verify templates** — tasks targeting specific files use scoped verifies (e.g., `pnpm test test/api/users.test.ts` for a task touching `src/api/users.ts`)
|
|
114
|
-
2. **defaults.verify_after_each** — global breakage catchers (typecheck, full test suite)
|
|
115
|
-
|
|
116
|
-
The rule: per-task verify targets the specific files touched; defaults catches global breakage.
|
|
117
|
-
|
|
118
|
-
## Cross-reference to verify-design.md
|
|
119
|
-
|
|
120
|
-
This rule (10) is the per-surface tactical layer — it names the tools to detect and the patterns to propose. Rule 3 (verify-design.md) owns the principles: deterministic, assertive, would-have-failed-before. Every proposed command must satisfy both layers.
|
|
@@ -1,46 +0,0 @@
|
|
|
1
|
-
# Rule 7 — Self-review
|
|
2
|
-
|
|
3
|
-
**Before declaring the plan ready, run through this checklist.**
|
|
4
|
-
|
|
5
|
-
The validator catches schema, DAG, and glob errors. It cannot catch "this verify is too weak" or "this scope is too loose". You can.
|
|
6
|
-
|
|
7
|
-
## The 7 questions
|
|
8
|
-
|
|
9
|
-
1. **Is each task right-sized?** Reread each task's prompt. Could the pilot-builder do it in ~20 minutes with the standard `max_turns: 50`? If a task feels like 2 hours of work, split it. If it feels like 2 minutes, merge it.
|
|
10
|
-
|
|
11
|
-
2. **Does each verify command HAVE to fail before the task runs?** For each task, mentally checkout the pre-task state. Would the verify command fail there? If not, the verify isn't observing the task's effect — fix it.
|
|
12
|
-
|
|
13
|
-
3. **Is each `touches:` glob the tightest fit?** For each task, list the files the agent will need to edit. Are they all matched? Are there ANY paths matched that the agent SHOULDN'T touch? If yes to either, refine.
|
|
14
|
-
|
|
15
|
-
4. **Does the DAG match the actual dependencies?** For each `depends_on:` edge, ask: does the dependent task READ code the dep produces, or assume schema the dep modifies? If "no" to both, the edge is false. Drop it.
|
|
16
|
-
|
|
17
|
-
5. **Are there missing edges?** Look at every pair of tasks that share files in their `touches:`. Do they need an order? If T2's verify exercises code T1 introduces, T2 depends on T1 — even if their `touches:` don't overlap.
|
|
18
|
-
|
|
19
|
-
6. **Does the DAG concentrate too much value in one task?** Task-level `cascadeFail` only blocks transitive DEPENDENTS of the failed task — sibling subtrees in a disconnected DAG keep running. So plan size is not itself a risk. The real risk is a task everything else depends on: a schema migration that all downstream work reads, a core-type definition all imports reference, a shared config every consumer parses. If THAT task fails, the whole run stalls. Is there such a task in your plan? If yes, can it be simplified — smaller diff, tighter verify, higher success probability? Don't over-concentrate; a plan where 80% of tasks depend on T1 and T1 is complex is fragile by design.
|
|
20
|
-
|
|
21
|
-
7. **Could you read this plan in 6 months and understand it?** Plan names + task titles + prompts should be a self-explanatory summary of the work. If the plan needs a verbal preamble to make sense, rewrite the prompts.
|
|
22
|
-
|
|
23
|
-
## Run validate
|
|
24
|
-
|
|
25
|
-
```
|
|
26
|
-
bunx @glrs-dev/harness-plugin-opencode pilot validate <plan-path>
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
Fix every error AND warning. The "warnings" tier (e.g., glob conflicts between tasks) is also yours to action — either decide they're OK and document it, or restructure.
|
|
30
|
-
|
|
31
|
-
## When the plan is ready
|
|
32
|
-
|
|
33
|
-
When all seven questions are answered "yes" and `pilot validate` exits 0:
|
|
34
|
-
|
|
35
|
-
- Save the plan.
|
|
36
|
-
- Tell the user: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
|
|
37
|
-
- Stop. Don't summarize. Don't editorialize. The user can read the YAML.
|
|
38
|
-
|
|
39
|
-
## When the plan is NOT ready
|
|
40
|
-
|
|
41
|
-
If you can't answer "yes" to any of the seven questions and you don't see a way to fix it within the planning session:
|
|
42
|
-
|
|
43
|
-
- Tell the user honestly. "I can't produce a plan that I'd trust the unattended builder to execute, because <specific reason>."
|
|
44
|
-
- Suggest the regular `/plan` agent (markdown plans, human-driven `/build`) or a manual decomposition.
|
|
45
|
-
|
|
46
|
-
It is far better to refuse than to ship a bad plan.
|
|
@@ -1,47 +0,0 @@
|
|
|
1
|
-
# Task context
|
|
2
|
-
|
|
3
|
-
Every non-trivial task in a pilot plan carries a `context:` field — a markdown block that preloads the builder agent with the narrative it needs to work confidently without re-discovering the problem from scratch.
|
|
4
|
-
|
|
5
|
-
The builder gets a fresh opencode session per task. No carry-over from the planning conversation. No memory of which files the planner inspected. Just: title, touches, verify, context (if present), and the prompt directive. If `context:` is empty, the builder starts from the directive alone — fine for a one-line task ("add a CHANGELOG entry for version 1.2.3"), but painful for anything else.
|
|
6
|
-
|
|
7
|
-
## What belongs in context
|
|
8
|
-
|
|
9
|
-
- **The user-facing outcome.** In one sentence, what changes from a user's perspective when this task lands? Why should anyone care it got done?
|
|
10
|
-
- **The rationale / why this task exists.** What problem is this task solving? Why is it broken out as a separate task rather than rolled into a sibling? The planner had reasons; write them down.
|
|
11
|
-
- **Code pointers.** The specific files / functions / types the builder should read BEFORE editing. Name them with paths the builder can `read` directly. E.g., "Start by reading `src/pilot/cli/build.ts:resolvePlanPath` (lines 350-370) — the three-step fallback lives there." Saves 3-10 minutes of the builder re-grepping the repo.
|
|
12
|
-
- **Acceptance shorthand.** What "done" looks like from the human's view — a sentence or two that complements the machine-checkable `verify:` list. Verify says "tests pass"; context says "the user can now type `pilot build plan-name` without the full path."
|
|
13
|
-
- **Gotchas / constraints.** Anything the builder would trip over that `prompt:` shouldn't carry as a directive. "The schema is `.strict()` — don't add unknown keys." "Downstream tools parse stdout; keep streaming logs on stderr."
|
|
14
|
-
|
|
15
|
-
## What does NOT belong in context
|
|
16
|
-
|
|
17
|
-
- **The directive itself.** "Add a function that …" is `prompt:` territory. Keep context for grounding, prompt for the imperative.
|
|
18
|
-
- **Implementation plans.** Don't pre-decide how the builder should write the code. `touches:` constrains the scope; the builder picks the structure within it. If you find yourself writing "first add X, then update Y, then rename Z," either the task is too big (split it) or you're over-specifying (trust the builder).
|
|
19
|
-
- **Copy-pasted architecture diagrams.** If it's longer than ~40 lines, it probably belongs in a doc file the builder can read via `touches`, not inline in the plan.
|
|
20
|
-
- **Tutorials.** The builder already knows how to write TypeScript / run tests / use `edit`. Don't explain the fundamentals; link to the specific non-obvious convention in the repo (AGENTS.md, CLAUDE.md).
|
|
21
|
-
|
|
22
|
-
## Length guidance
|
|
23
|
-
|
|
24
|
-
- **Trivial task** (one-line prompt, ≤1 file, ≤10 LOC): `context:` optional; omit is fine.
|
|
25
|
-
- **Standard task** (3-5 files, non-trivial logic): one paragraph minimum, 3-5 sentences covering outcome, rationale, and the 2-3 most relevant code pointers.
|
|
26
|
-
- **Complex task** (many files, architectural change): several paragraphs, organized under headers (`### Outcome`, `### Rationale`, `### Code pointers`, `### Acceptance`). If you're writing more than ~60 lines of context, reconsider: is this really one task, or should it be split?
|
|
27
|
-
|
|
28
|
-
## Relationship to other fields
|
|
29
|
-
|
|
30
|
-
- **`prompt:`** is the directive. It says "do X." Keep it crisp — one to three short paragraphs max. If you're tempted to put narrative in `prompt:`, move it to `context:`.
|
|
31
|
-
- **`verify:`** is the machine contract. Binary, scripted, precise.
|
|
32
|
-
- **`touches:`** is the scope ceiling. Lists every file the builder is allowed to edit.
|
|
33
|
-
- **`context:`** is the human narrative. Read by the builder once at kickoff; helps the builder understand WHICH files inside `touches:` to read first and WHAT the end user will perceive.
|
|
34
|
-
|
|
35
|
-
The four work together: `context:` orients, `touches:` bounds, `prompt:` directs, `verify:` confirms.
|
|
36
|
-
|
|
37
|
-
## Emission
|
|
38
|
-
|
|
39
|
-
The kickoff prompt sent to the builder renders `context:` as a `## Context` section between the scope/verify block and the final `## Task` directive. Reading order: hard rules → allowed scope → verify commands → **context (grounding)** → task (act). The builder reads context right before the directive so the directive is the last, most salient framing when it starts making edits.
|
|
40
|
-
|
|
41
|
-
Empty `context:` → no `## Context` section emitted. No penalty for omission on trivial tasks.
|
|
42
|
-
|
|
43
|
-
## Anti-pattern: copying the user's original request
|
|
44
|
-
|
|
45
|
-
Don't just paste the Linear ticket description or the user's chat message into `context:`. That defeats the point of planning — you're supposed to have DIGESTED the request into task-shaped outcomes, not forwarded it verbatim. If the context reads like the ticket, the planning didn't do its job.
|
|
46
|
-
|
|
47
|
-
Good context is specific to *this task*, referencing *this task's* files, *this task's* verify commands, *this task's* narrow success criterion. Plan-wide or epic-wide context belongs at the plan level (the top-of-file `name:` and `branch_prefix:`), not duplicated into every task.
|
|
@@ -1,81 +0,0 @@
|
|
|
1
|
-
# Rule 4 — `touches:` scope tightness
|
|
2
|
-
|
|
3
|
-
**Globs must be the tightest set that lets the task succeed. `**` is a smell.**
|
|
4
|
-
|
|
5
|
-
The `touches:` list is the agent's leash. After verify passes, the worker computes `git diff --name-only` against the worktree's pre-task SHA; any path NOT matched by `touches:` is a violation and the task fails.
|
|
6
|
-
|
|
7
|
-
This catches:
|
|
8
|
-
|
|
9
|
-
- Agents that "helpfully" reformat unrelated files.
|
|
10
|
-
- Agents that modify a test in a far-away module to make verify pass.
|
|
11
|
-
- Agents that drift into copilot-style imports of unrelated utils.
|
|
12
|
-
|
|
13
|
-
Tight scopes also let v0.3's parallel scheduler safely run two tasks at once — if their touches don't intersect, they can't conflict.
|
|
14
|
-
|
|
15
|
-
## Heuristics
|
|
16
|
-
|
|
17
|
-
- **One module = one glob.** `src/api/**` and `test/api/**` for an API task. Not `src/**`.
|
|
18
|
-
- **Exact files when you know them.** `src/auth/login.ts` is better than `src/auth/**` if the task is just "edit login.ts".
|
|
19
|
-
- **Test files belong with their source files.** A task that adds source code almost always adds or edits a test. Both go in `touches:`.
|
|
20
|
-
- **Lock files: rarely.** `package.json` / `bun.lock` / `Cargo.lock` should appear ONLY when the task explicitly says "add a dependency". Don't include them speculatively.
|
|
21
|
-
- **Config files: rarely.** `tsconfig.json`, `.eslintrc`, `package.json` scripts — only if the task is about config.
|
|
22
|
-
|
|
23
|
-
## When `**` IS reasonable
|
|
24
|
-
|
|
25
|
-
- The task is a global rename / rewrite (across the whole repo).
|
|
26
|
-
- The task is "fix every TODO in the codebase" — touches everything by intent.
|
|
27
|
-
- The task explicitly says "this is a sweeping change".
|
|
28
|
-
|
|
29
|
-
In these cases, `**` is fine; the AGENT'S diligence becomes the constraint instead of the touches enforcement.
|
|
30
|
-
|
|
31
|
-
## What `touches: []` means
|
|
32
|
-
|
|
33
|
-
An empty `touches` list means the task **must NOT edit any files**. Use this for:
|
|
34
|
-
|
|
35
|
-
- Verify-only tasks (e.g., "confirm the existing tests still pass after a deps update was made by an upstream task").
|
|
36
|
-
- Probing tasks (e.g., "run benchmarks and report results" — though pilot doesn't yet have a "report results" mechanism, so this is rare).
|
|
37
|
-
|
|
38
|
-
If the verify commands would FAIL without edits, an empty `touches` is a STOP — the task is contradictory.
|
|
39
|
-
|
|
40
|
-
## Common mistakes
|
|
41
|
-
|
|
42
|
-
- **`touches: ["**/*.ts"]`** — too loose. Better: list the actual modules.
|
|
43
|
-
- **Forgetting tests.** Source-only `touches:` makes the task fail when the agent (correctly) edits the test file.
|
|
44
|
-
- **Forgetting docs.** If the task explicitly says "update README", README must be in `touches:`.
|
|
45
|
-
- **Including the migrations dir for a non-migration task.** Tight scope.
|
|
46
|
-
|
|
47
|
-
When in doubt, write the tightest possible scope first. If the task fails verify with "touches violation: src/X.ts", the worker shows you which file got touched — broaden then.
|
|
48
|
-
|
|
49
|
-
## `tolerate:` — files allowed in the diff but outside the contract
|
|
50
|
-
|
|
51
|
-
When a task's verify step runs a tool that writes files as a side-effect (codegen, build, snapshots), those files will appear in `git diff` even though the agent didn't author them. Add them to `tolerate:` so enforcement accepts them without counting them as part of the task's output.
|
|
52
|
-
|
|
53
|
-
Two categories to watch for:
|
|
54
|
-
|
|
55
|
-
**Built-in defaults (already tolerated — don't list these):**
|
|
56
|
-
- `**/next-env.d.ts` — Next.js regenerates on every `next build`.
|
|
57
|
-
- `**/.next/types/**`, `**/.next/dev/types/**` — Next.js app-router generated types.
|
|
58
|
-
- `**/*.tsbuildinfo` — TypeScript project-reference build cache.
|
|
59
|
-
- `**/__snapshots__/**`, `**/*.snap` — Jest / Vitest snapshot files rewritten by `-u`.
|
|
60
|
-
|
|
61
|
-
**Project-specific (list in `tolerate:` per task):**
|
|
62
|
-
- Prisma client output (e.g., `prisma/client/**` if `prisma generate` runs in verify).
|
|
63
|
-
- GraphQL codegen output (`graphql/generated/**`, `*.graphql.d.ts`).
|
|
64
|
-
- OpenAPI codegen output (`api-types/generated/**`).
|
|
65
|
-
- Anywhere you have a build step that writes type declarations downstream of the agent's source edits.
|
|
66
|
-
|
|
67
|
-
A good test: if the task's verify step runs `prisma generate`, `pnpm codegen`, `next build`, or similar, ask: "does that command write files anywhere?" If yes, those paths go in `tolerate:`.
|
|
68
|
-
|
|
69
|
-
### Example
|
|
70
|
-
|
|
71
|
-
```yaml
|
|
72
|
-
- id: T-ADD-RULE-MODEL
|
|
73
|
-
touches:
|
|
74
|
-
- prisma/schema.prisma
|
|
75
|
-
- src/models/rule.ts
|
|
76
|
-
tolerate:
|
|
77
|
-
- prisma/client/** # prisma generate output
|
|
78
|
-
verify:
|
|
79
|
-
- pnpm prisma generate
|
|
80
|
-
- pnpm --filter core test rule-model
|
|
81
|
-
```
|