npm - @glrs-dev/cli - Versions diffs - 0.0.1 → 0.1.1 - Mend

@glrs-dev/cli 0.0.1 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (173) hide show

package/dist/vendor/harness-opencode/dist/install-4EYR56OR.js ADDED Viewed

@@ -0,0 +1,9 @@
+import {
+  MODEL_PRESETS,
+  install
+} from "./chunk-VVMP6QWS.js";
+import "./chunk-VJUETC6A.js";
+export {
+  MODEL_PRESETS,
+  install
+};

package/dist/vendor/harness-opencode/dist/skills/agent-estimation/SKILL.md ADDED Viewed

@@ -0,0 +1,159 @@
+---
+name: agent-estimation
+description: Estimate AI-agent task effort in tool-call rounds first, convert to wallclock only at the end. Use when the user asks 'how long will this take', 'estimate this', 'scope this work', 'round budget', 'effort estimate', or asks for a timeline on agent-executed work. Produces a structured module-breakdown table with risk coefficients and a final wallclock range. Avoids the systematic overestimation that happens when agents anchor to human-developer timelines from training data.
+---
+# Agent Work Estimation
+## Why this skill exists
+AI coding agents systematically overestimate task duration because they anchor to human-developer timelines absorbed from training data. A task you can complete in 30 minutes of agent time gets estimated as "2-3 days" because that's what a StackOverflow answer would say.
+**The fix:** estimate in your own operational units — tool-call rounds — first. Convert to human wallclock only at the very end, as the last step.
+This skill is adapted for the harness-opencode environment from the OpenClaw `hjw21century/agent-estimation` skill. Original source: https://openclawlaunch.com/skills/agent-estimation.
+## Units
+| Unit | Definition | Scale |
+|------|------------|-------|
+| **Round** | One tool-call cycle: think → write/edit → execute → read output → decide if fix needed | ~2-4 min wallclock |
+| **Module** | A functional unit built from multiple rounds until it's usable on its own | 2-15 rounds |
+| **Project** | Sum of modules + integration rounds | Σ(modules) + integration |
+A **Round** is the atomic unit. It maps to one iteration of:
+1. Agent reasons about what to do.
+2. Agent writes or edits code.
+3. Agent runs the code or a test.
+4. Agent reads the output.
+5. Agent decides if it needs to fix something. If yes → next round.
+## Procedure
+Follow these five steps in order. Do NOT skip step 5 — premature wallclock conversion is the failure mode.
+### Step 1: Decompose into modules
+Break the task into functional modules. Each module should be independently buildable and testable. Ask: "What are the distinct pieces I would build one at a time?"
+### Step 2: Estimate base rounds per module
+Use these anchors:
+| Pattern | Typical rounds | Examples |
+|---------|----------------|----------|
+| **Boilerplate / known pattern** | 1-2 | CRUD endpoint, config file, standard API client, adding a file to match an existing recipe |
+| **Moderate complexity** | 3-5 | Custom UI layout, state management, data pipeline, non-trivial refactor |
+| **Exploratory / under-documented** | 5-10 | Unfamiliar framework, platform-specific APIs, complex integrations |
+| **High uncertainty** | 8-15 | Undocumented behavior, novel algorithms, multi-system debugging |
+Calibration rules:
+- If you can generate the code in one shot and it will likely run → **1 round**.
+- If you'll generate, run, see an error, fix → **2-3 rounds**.
+- If the library/framework has sparse docs and you'll be guessing → **5+ rounds**.
+- If it involves platform permissions, OS-level APIs, or environment-specific behavior the user must manually verify → add **2-3 rounds**.
+### Step 3: Assign risk coefficients
+Each module gets a coefficient that inflates its round count:
+| Risk | Coefficient | When to apply |
+|------|-------------|---------------|
+| **Low** | 1.0 | Mature ecosystem, clear docs, strong pattern match |
+| **Medium** | 1.3 | Minor unknowns, may need 1-2 extra debug rounds |
+| **High** | 1.5 | Sparse docs, platform quirks, integration unknowns |
+| **Very High** | 2.0 | Possible dead ends, may need to change approach entirely |
+### Step 4: Calculate totals
+```
+module_effective_rounds = base_rounds × risk_coefficient
+project_rounds          = Σ(module_effective_rounds) + integration_rounds
+integration_rounds      = 10-20% of base total (wiring modules together)
+```
+### Step 5: Convert to wallclock — LAST
+Only after steps 1-4 are complete:
+```
+wallclock = project_rounds × minutes_per_round
+```
+Default `minutes_per_round = 3` (agent generation + user review).
+Adjust:
+- Fast iteration, user barely reviews → **2 min/round**.
+- Complex domain, user carefully reviews each step → **4 min/round**.
+- User needs to manually test (mobile, hardware, permissions) → **5 min/round**.
+## Output format
+Always produce the estimation in this exact structure:
+```markdown
+### Task: <task name>
+#### Module breakdown
+| # | Module | Base rounds | Risk | Effective rounds | Notes |
+|---|--------|-------------|------|------------------|-------|
+| 1 | ...    | N           | 1.x  | M                | why   |
+| 2 | ...    | N           | 1.x  | M                | why   |
+#### Summary
+- **Base rounds:** X
+- **Integration:** +Y rounds
+- **Risk-adjusted total:** Z rounds
+- **Estimated wallclock:** A – B minutes (at N min/round)
+#### Biggest risks
+1. <specific risk and what could blow up the estimate>
+2. <…>
+```
+## Anti-patterns to avoid
+These are the exact failure modes this skill exists to prevent:
+1. **Human-time anchoring:** "A developer would take about 2 weeks…" → NO. Start from rounds.
+2. **Padding by vibes:** Adding time "just to be safe" without a specific risk rationale → NO. Use risk coefficients; each bump must have a reason.
+3. **Confusing complexity with volume:** 500 lines of boilerplate ≠ hard. One line of CGEvent API ≠ easy. Estimate by uncertainty, not line count.
+4. **Forgetting integration cost:** Modules work alone but break together. Always add 10-20% for integration.
+5. **Ignoring user-side bottlenecks:** If the user must grant permissions, restart an app, or test on a device, that's extra round time. Adjust `minutes_per_round` upward, don't add phantom rounds.
+6. **Premature wallclock conversion:** If you computed minutes before finishing step 4, start over. The whole point is to think in rounds first.
+## Calibration examples
+These anchor what "N rounds" feels like in this codebase. Use them as reference points when estimating similar work.
+| Project | Module count | Total rounds | Notes |
+|---------|--------------|--------------|-------|
+| Add a new bundled skill (SKILL.md + test bump + build verify) | 1 | 2-3 | Recipe-driven, mature test suite, no new wiring |
+| Add a new agent with prompt + registration + test | 2 | 4-6 | New prompt file + `createAgents()` entry + test case |
+| Add a new slash command | 2 | 3-5 | Prompt file + `createCommands()` entry |
+| Add a new custom tool with schema + handler + test | 3 | 8-12 | Schema design + handler logic + integration point |
+| Refactor a cross-cutting concern (e.g., permission maps across all agents) | 3-5 | 15-25 | Medium-high risk due to surface area |
+| Add a new sub-plugin (hook + registration + tests) | 3-4 | 12-18 | Plugin API surface, test fixtures |
+| Non-trivial pilot subsystem feature (new verb, new scheduler rule) | 4-6 | 20-40 | Higher risk; SQLite schema + CLI + worker wiring |
+When in doubt, pick the closest example and adjust the risk coefficient for what makes this specific task different.
+## When to use this skill
+- Scoping a coding task before starting implementation.
+- Comparing two implementation approaches by round cost.
+- Setting realistic expectations with the user on agent-executed work.
+- Identifying which modules carry the most schedule risk.
+- Deciding whether a task fits in one session or needs to be split.
+## When NOT to use this skill
+- Trivial one-line edits (typo fixes, rename). Just do it; estimating takes longer than the work.
+- Open-ended research tasks where the "module breakdown" is the research itself. Estimate after the first exploratory round, not before.
+- Questions that aren't about effort ("how does X work", "what's the right pattern"). Answer the actual question.

package/dist/vendor/harness-opencode/dist/skills/paths.ts ADDED Viewed

@@ -0,0 +1,18 @@
+import { fileURLToPath } from "node:url";
+import { dirname, join } from "node:path";
+/**
+ * Returns the absolute path to the bundled dist/skills/ directory.
+ *
+ * The plugin ships as ESM (tsup default). `import.meta.url` resolves to the
+ * plugin's own dist/index.js, which lives alongside dist/skills/ in the
+ * tsup-emitted output. Resolving relative to the module URL is both simple
+ * and robust against npm-cache path variance.
+ *
+ * No createRequire / require.resolve needed — verified by Spike 1 against
+ * OpenCode 1.14.19 on macOS.
+ */
+export function getSkillsRoot(): string {
+  const here = dirname(fileURLToPath(import.meta.url));
+  return join(here, "skills");
+}

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/SKILL.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: pilot-planning
+description: Methodology for producing a pilot.yaml plan that the pilot-builder agent can execute unattended. Use when the pilot-planner agent receives a feature request — covers task decomposition, verify-command design, scope tightness, DAG shape, and self-review. Auto-loaded by the pilot-planner agent.
+---
+# Pilot Planning Skill
+You are producing a `pilot.yaml` plan: a list of tasks the pilot-builder agent can execute one at a time, fully unattended. The cost of a bad plan is high — the builder will fail tasks confusingly, the cascade-fail will block downstream work, and the human pilot operator has to clean up worktrees and re-plan.
+A good plan trades a planning-session's worth of patient thought for hours of unsupervised builder time. Take the patient thought.
+## Workflow
+Apply these eight rules in order. Each rule has its own file in `rules/` for the full text:
+1. [`first-principles.md`](rules/first-principles.md) — Frame the task FROM the user's intent, not from a templated checklist. Ask "what does the user actually want done?" before "what files might change?"
+2. [`decomposition.md`](rules/decomposition.md) — Break the work into right-sized tasks (10-30 minutes of agent time, ≤3 attempts). Too big = unbounded work; too small = orchestration overhead drowns the value.
+3. [`verify-design.md`](rules/verify-design.md) — Each task's `verify:` commands must succeed iff the task is correctly done. No `echo done`. No `test -f file.ts`. Real assertions only.
+4. [`touches-scope.md`](rules/touches-scope.md) — `touches:` globs must be the tightest set that lets the task succeed. Default to "specific file paths"; `**` is a smell.
+5. [`dag-shape.md`](rules/dag-shape.md) — Tasks depend on each other only when there's a real semantic dependency (B reads what A produces). False dependencies make the run sequential when it could parallel; missing dependencies cause subtle race-on-state bugs.
+6. [`milestones.md`](rules/milestones.md) — Optional grouping. Use when several tasks share a "is this batch done?" check (e.g. integration tests after a chunk of unit-test work).
+7. [`self-review.md`](rules/self-review.md) — Before declaring the plan ready, run through a 7-question checklist. Find the holes yourself; the validator only catches schema errors.
+8. [`task-context.md`](rules/task-context.md) — Every non-trivial task carries a `context:` block. Thin plans fail because the builder works each task from scratch with no carry-over; rich context pre-loads what the builder needs to work confidently. Cover outcome, rationale, code pointers, acceptance.
+## After applying the rules
+1. Save the YAML to the path returned by `bunx @glrs-dev/harness-plugin-opencode pilot plan-dir`.
+2. Run `bunx @glrs-dev/harness-plugin-opencode pilot validate <path>` and fix every error / warning.
+3. Hand off to the user with: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
+Do NOT summarize the plan in chat. The user can read the YAML.
+## When to refuse
+If, after applying the methodology, you cannot produce a plan with at least:
+- 2 tasks
+- Each with non-trivial verify
+- Each with tight `touches`
+- A coherent DAG
+…tell the user the work isn't ready for pilot. Suggest they break it down themselves first, or use the regular `/plan` agent (markdown plans, human-driven execution). It is far better to refuse than to ship a bad plan.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/dag-shape.md ADDED Viewed

@@ -0,0 +1,47 @@
+# Rule 5 — DAG shape
+**Tasks depend on each other only when there's a real semantic dependency.**
+The `depends_on` edges in the plan determine run order. False edges serialize work that could parallelize (v0.3); missing edges let a downstream task run against a state where its prerequisite hasn't committed yet.
+## What a real dependency looks like
+- **Reads code that the dep produces.** T2 imports a function T1 introduced.
+- **Reads schema that the dep modifies.** T2 calls an endpoint T1 added.
+- **Tests behavior the dep implements.** T2's verify runs a test T1's code makes pass.
+## What ISN'T a real dependency
+- "T1 should run first because it's foundational." If T2 doesn't use T1's output, the order doesn't matter for correctness — and forcing it costs you parallelism.
+- "Both touch `src/api/`." Touch overlap is a worktree-pool concern (v0.3), not a logical dependency. Capture it via `touches:` if at all.
+- "I want T1 to be done before I review T2." That's a human-review concern, not a pilot DAG concern. The pilot run completes; you review afterward.
+## Common shapes
+**Linear** — T1 → T2 → T3:
+Each task is the next layer. Use when each layer literally builds on the previous.
+**Diamond** — T1 fans out to T2, T3; both reconverge into T4:
+T1 = "introduce module skeleton"; T2, T3 = "fill in submodule X / Y" (parallelizable on disjoint scopes); T4 = "wire up everything and run integration tests".
+**Disconnected** — Two independent components in the same plan:
+`auth-1`, `auth-2` are one chain; `billing-1`, `billing-2` are another. Use when the plan covers multiple unrelated improvements.
+**Hub-and-spoke** — Many tasks all depend on T1 but not on each other:
+T1 = "add the typed client"; T2-Tn each = "use the typed client in module M". All Tn parallelize.
+## Cycle detection
+The validator catches cycles. If you accidentally write `T1 → T2 → T1`, validate will tell you. Most cycles arise from copy-paste in `depends_on` lists; check yours before saving.
+## Self-loops
+`T1: depends_on: [T1]` is a self-loop, also caught by validate. Always a typo.
+## "I want everything serial"
+Sometimes the right answer IS a fully linear DAG (e.g., a refactor where each step's diff would conflict with the next). Don't be afraid to chain everything if that's the truth — but don't pretend it's the truth when it isn't.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/decomposition.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rule 2 — Decomposition
+**Right-sized tasks: 10-30 minutes of agent time, ≤3 attempts to pass verify.**
+A "right-sized" pilot task is one the pilot-builder can complete in a single session within the default `max_turns: 50` budget. Empirically, that's about 10-30 minutes of agent wall time and 1-3 attempts.
+## Sizing heuristics
+**Too big (split it):**
+- The verify command exercises >3 distinct code paths.
+- The task touches >5 files.
+- The prompt has >10 numbered steps.
+- The task says "and also" / "while you're at it" — a sign of conjoined work.
+**Too small (merge it):**
+- The task touches a single file with <30 lines added/changed.
+- The verify command would also pass before the task ran.
+- Splitting added a `depends_on` edge that just moves work around.
+## Splitting patterns
+- **Layer-by-layer**: schema → DB accessors → API → wiring. Each layer has its own tests; each is a task.
+- **Read → Write**: T1 = "add a function that returns the data", T2 = "add an endpoint that calls it". T2 depends on T1.
+- **Skeleton → Detail**: T1 = "introduce the module structure with stubs", T2-Tn = "fill in each stub with logic+tests". The stubs let downstream tasks parallelize.
+## Anti-patterns
+- **Refactor as one task.** "Refactor X" is a feature, not a task. Decompose into `extract Y`, `inline Z`, `rename W`, each with its own verify.
+- **Setup-only tasks.** "Install lodash" is not a pilot task — the next task can install it as part of its own scope. Avoid tasks that don't deliver an observable check.
+- **Cleanup-only tasks.** "Remove dead code". The verify is "tests still pass" — but tests passing was already the contract on the previous task. If there's nothing new to assert, this isn't a task.
+## When you can't decompose
+If the work genuinely doesn't decompose (e.g., a 200-line algorithm that has to land atomically), it might not be a fit for pilot. Tell the user; they may want to run it as a regular `/build` task instead.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/first-principles.md ADDED Viewed

@@ -0,0 +1,29 @@
+# Rule 1 — First-principles task framing
+**Frame from intent, not from a template.**
+Bad plans start with a checklist ("read AGENTS.md → write tests → write code → run tests"). Good plans start with the question: *what does the user actually want at the end of this?*
+## What to ask yourself
+1. **What is the working state at the end of the run?** A passing test suite that previously failed? A new endpoint serving real traffic? A refactor with zero behavior change? Different end-states demand different task shapes.
+2. **What can fail?** A task that "adds an import" can't really fail. A task that "implements pagination across three layers" can fail in a hundred ways. The latter needs decomposition.
+3. **What does the verify catch?** If you can't articulate the failure mode each verify command detects, the verify is decoration.
+4. **What is the smallest change that ships?** Pilot is good at small surgical work. If the user wants a wholesale rewrite, pilot is the wrong tool — say so.
+## Talk to the user — once
+Before you spend an hour reading code, take 2 minutes to ask the user 1-3 clarifying questions:
+- Scope (what's in / out of this plan?)
+- Success criteria (how do we know we're done?)
+- Constraints (deps to use, deps to avoid, tests to preserve)
+Do this BEFORE applying rules 2-7. The cheapest mistake to fix is the one you avoid by understanding intent up front.
+## Then read code
+Don't ask the user things you can answer by reading code. Don't ask "what test framework do you use?" — `package.json` says. Don't ask "where does auth live?" — `grep` it. Use the user's time only for things genuinely unknown to the codebase.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/milestones.md ADDED Viewed

@@ -0,0 +1,57 @@
+# Rule 6 — Milestones (optional)
+**Use milestones to attach extra verify when a logical batch finishes.**
+Milestones are an optional grouping. They serve two purposes:
+1. **Status output** — `pilot status` groups tasks by milestone. Easier to read for big plans.
+2. **Milestone-level verify** — extra verify commands that run when the LAST task in the milestone completes.
+If neither of those is useful, don't add milestones. Plain task lists are simpler.
+## Schema
+```yaml
+milestones:
+  - name: M1
+    description: Foundation
+    verify:
+      - bun run integration-test:foundation
+  - name: M2
+    description: API layer
+    verify:
+      - bun run integration-test:api
+tasks:
+  - id: T1
+    title: schema
+    milestone: M1
+  - id: T2
+    title: db
+    milestone: M1
+  - id: T3
+    title: endpoint
+    milestone: M2
+```
+Each task has an optional `milestone:` label. The label must match a `milestones[].name` (the validator catches typos).
+## When milestone verify fires
+Milestone-level verify runs **after the last task in that milestone completes successfully**. "Last" = last in topological order among tasks with that label. If any task in the milestone fails or gets blocked, the milestone verify does NOT run (the cascade-fail will block downstream work anyway).
+## When to use them
+- **Multi-layer features** where you want an integration test after each layer (schema, API, UI).
+- **Long plans** (8+ tasks) where the user wants visible progress markers.
+- **Mixed-domain plans** where milestones group related work for status readability.
+## When NOT to use them
+- Simple plans (≤5 tasks). Just list the tasks; status output is fine without grouping.
+- Plans where every "milestone" has only one task. Use task verify instead.
+- Plans where the milestone verify is "the same as the last task's verify". Redundant.
+## Don't conflate milestone with dep
+Milestones are a presentation/verify-grouping concept; they do NOT change scheduling. If T3 needs T2 done before it can start, that's a `depends_on: [T2]`, not a `milestone:` label. The DAG and milestones are independent axes.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/self-review.md ADDED Viewed

@@ -0,0 +1,46 @@
+# Rule 7 — Self-review
+**Before declaring the plan ready, run through this checklist.**
+The validator catches schema, DAG, and glob errors. It cannot catch "this verify is too weak" or "this scope is too loose". You can.
+## The 7 questions
+1. **Is each task right-sized?** Reread each task's prompt. Could the pilot-builder do it in ~20 minutes with the standard `max_turns: 50`? If a task feels like 2 hours of work, split it. If it feels like 2 minutes, merge it.
+2. **Does each verify command HAVE to fail before the task runs?** For each task, mentally checkout the pre-task state. Would the verify command fail there? If not, the verify isn't observing the task's effect — fix it.
+3. **Is each `touches:` glob the tightest fit?** For each task, list the files the agent will need to edit. Are they all matched? Are there ANY paths matched that the agent SHOULDN'T touch? If yes to either, refine.
+4. **Does the DAG match the actual dependencies?** For each `depends_on:` edge, ask: does the dependent task READ code the dep produces, or assume schema the dep modifies? If "no" to both, the edge is false. Drop it.
+5. **Are there missing edges?** Look at every pair of tasks that share files in their `touches:`. Do they need an order? If T2's verify exercises code T1 introduces, T2 depends on T1 — even if their `touches:` don't overlap.
+6. **Can the plan recover from a per-task failure?** If T3 fails, the cascade-fail blocks T4 onward. Is the resulting "failed=T3, blocked=[T4..T7]" state useful for the human operator? Or did you concentrate too much value into T3 such that its failure is catastrophic?
+7. **Could you read this plan in 6 months and understand it?** Plan names + task titles + prompts should be a self-explanatory summary of the work. If the plan needs a verbal preamble to make sense, rewrite the prompts.
+## Run validate
+```
+bunx @glrs-dev/harness-plugin-opencode pilot validate <plan-path>
+```
+Fix every error AND warning. The "warnings" tier (e.g., glob conflicts between tasks) is also yours to action — either decide they're OK and document it, or restructure.
+## When the plan is ready
+When all seven questions are answered "yes" and `pilot validate` exits 0:
+- Save the plan.
+- Tell the user: `Plan saved to <path>. Next: bunx @glrs-dev/harness-plugin-opencode pilot build`.
+- Stop. Don't summarize. Don't editorialize. The user can read the YAML.
+## When the plan is NOT ready
+If you can't answer "yes" to any of the seven questions and you don't see a way to fix it within the planning session:
+- Tell the user honestly. "I can't produce a plan that I'd trust the unattended builder to execute, because <specific reason>."
+- Suggest the regular `/plan` agent (markdown plans, human-driven `/build`) or a manual decomposition.
+It is far better to refuse than to ship a bad plan.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/task-context.md ADDED Viewed

@@ -0,0 +1,47 @@
+# Task context
+Every non-trivial task in a pilot plan carries a `context:` field — a markdown block that preloads the builder agent with the narrative it needs to work confidently without re-discovering the problem from scratch.
+The builder gets a fresh opencode session per task. No carry-over from the planning conversation. No memory of which files the planner inspected. Just: title, touches, verify, context (if present), and the prompt directive. If `context:` is empty, the builder starts from the directive alone — fine for a one-line task ("add a CHANGELOG entry for version 1.2.3"), but painful for anything else.
+## What belongs in context
+- **The user-facing outcome.** In one sentence, what changes from a user's perspective when this task lands? Why should anyone care it got done?
+- **The rationale / why this task exists.** What problem is this task solving? Why is it broken out as a separate task rather than rolled into a sibling? The planner had reasons; write them down.
+- **Code pointers.** The specific files / functions / types the builder should read BEFORE editing. Name them with paths the builder can `read` directly. E.g., "Start by reading `src/pilot/cli/build.ts:resolvePlanPath` (lines 350-370) — the three-step fallback lives there." Saves 3-10 minutes of the builder re-grepping the repo.
+- **Acceptance shorthand.** What "done" looks like from the human's view — a sentence or two that complements the machine-checkable `verify:` list. Verify says "tests pass"; context says "the user can now type `pilot build plan-name` without the full path."
+- **Gotchas / constraints.** Anything the builder would trip over that `prompt:` shouldn't carry as a directive. "The schema is `.strict()` — don't add unknown keys." "Downstream tools parse stdout; keep streaming logs on stderr."
+## What does NOT belong in context
+- **The directive itself.** "Add a function that …" is `prompt:` territory. Keep context for grounding, prompt for the imperative.
+- **Implementation plans.** Don't pre-decide how the builder should write the code. `touches:` constrains the scope; the builder picks the structure within it. If you find yourself writing "first add X, then update Y, then rename Z," either the task is too big (split it) or you're over-specifying (trust the builder).
+- **Copy-pasted architecture diagrams.** If it's longer than ~40 lines, it probably belongs in a doc file the builder can read via `touches`, not inline in the plan.
+- **Tutorials.** The builder already knows how to write TypeScript / run tests / use `edit`. Don't explain the fundamentals; link to the specific non-obvious convention in the repo (AGENTS.md, CLAUDE.md).
+## Length guidance
+- **Trivial task** (one-line prompt, ≤1 file, ≤10 LOC): `context:` optional; omit is fine.
+- **Standard task** (3-5 files, non-trivial logic): one paragraph minimum, 3-5 sentences covering outcome, rationale, and the 2-3 most relevant code pointers.
+- **Complex task** (many files, architectural change): several paragraphs, organized under headers (`### Outcome`, `### Rationale`, `### Code pointers`, `### Acceptance`). If you're writing more than ~60 lines of context, reconsider: is this really one task, or should it be split?
+## Relationship to other fields
+- **`prompt:`** is the directive. It says "do X." Keep it crisp — one to three short paragraphs max. If you're tempted to put narrative in `prompt:`, move it to `context:`.
+- **`verify:`** is the machine contract. Binary, scripted, precise.
+- **`touches:`** is the scope ceiling. Lists every file the builder is allowed to edit.
+- **`context:`** is the human narrative. Read by the builder once at kickoff; helps the builder understand WHICH files inside `touches:` to read first and WHAT the end user will perceive.
+The four work together: `context:` orients, `touches:` bounds, `prompt:` directs, `verify:` confirms.
+## Emission
+The kickoff prompt sent to the builder renders `context:` as a `## Context` section between the scope/verify block and the final `## Task` directive. Reading order: hard rules → allowed scope → verify commands → **context (grounding)** → task (act). The builder reads context right before the directive so the directive is the last, most salient framing when it starts making edits.
+Empty `context:` → no `## Context` section emitted. No penalty for omission on trivial tasks.
+## Anti-pattern: copying the user's original request
+Don't just paste the Linear ticket description or the user's chat message into `context:`. That defeats the point of planning — you're supposed to have DIGESTED the request into task-shaped outcomes, not forwarded it verbatim. If the context reads like the ticket, the planning didn't do its job.
+Good context is specific to *this task*, referencing *this task's* files, *this task's* verify commands, *this task's* narrow success criterion. Plan-wide or epic-wide context belongs at the plan level (the top-of-file `name:` and `branch_prefix:`), not duplicated into every task.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/touches-scope.md ADDED Viewed

@@ -0,0 +1,47 @@
+# Rule 4 — `touches:` scope tightness
+**Globs must be the tightest set that lets the task succeed. `**` is a smell.**
+The `touches:` list is the agent's leash. After verify passes, the worker computes `git diff --name-only` against the worktree's pre-task SHA; any path NOT matched by `touches:` is a violation and the task fails.
+This catches:
+- Agents that "helpfully" reformat unrelated files.
+- Agents that modify a test in a far-away module to make verify pass.
+- Agents that drift into copilot-style imports of unrelated utils.
+Tight scopes also let v0.3's parallel scheduler safely run two tasks at once — if their touches don't intersect, they can't conflict.
+## Heuristics
+- **One module = one glob.** `src/api/**` and `test/api/**` for an API task. Not `src/**`.
+- **Exact files when you know them.** `src/auth/login.ts` is better than `src/auth/**` if the task is just "edit login.ts".
+- **Test files belong with their source files.** A task that adds source code almost always adds or edits a test. Both go in `touches:`.
+- **Lock files: rarely.** `package.json` / `bun.lock` / `Cargo.lock` should appear ONLY when the task explicitly says "add a dependency". Don't include them speculatively.
+- **Config files: rarely.** `tsconfig.json`, `.eslintrc`, `package.json` scripts — only if the task is about config.
+## When `**` IS reasonable
+- The task is a global rename / rewrite (across the whole repo).
+- The task is "fix every TODO in the codebase" — touches everything by intent.
+- The task explicitly says "this is a sweeping change".
+In these cases, `**` is fine; the AGENT'S diligence becomes the constraint instead of the touches enforcement.
+## What `touches: []` means
+An empty `touches` list means the task **must NOT edit any files**. Use this for:
+- Verify-only tasks (e.g., "confirm the existing tests still pass after a deps update was made by an upstream task").
+- Probing tasks (e.g., "run benchmarks and report results" — though pilot doesn't yet have a "report results" mechanism, so this is rare).
+If the verify commands would FAIL without edits, an empty `touches` is a STOP — the task is contradictory.
+## Common mistakes
+- **`touches: ["**/*.ts"]`** — too loose. Better: list the actual modules.
+- **Forgetting tests.** Source-only `touches:` makes the task fail when the agent (correctly) edits the test file.
+- **Forgetting docs.** If the task explicitly says "update README", README must be in `touches:`.
+- **Including the migrations dir for a non-migration task.** Tight scope.
+When in doubt, write the tightest possible scope first. If the task fails verify with "touches violation: src/X.ts", the worker shows you which file got touched — broaden then.

package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/verify-design.md ADDED Viewed

@@ -0,0 +1,53 @@
+# Rule 3 — Verify-command design
+**Each task's `verify:` commands must succeed iff the task is correctly done.**
+The verify list is the contract between the planner and the builder. It is the ONLY signal pilot uses to decide "did this task work?". A weak verify means you're shipping work the run thinks is fine but really isn't.
+## What a good verify looks like
+- `bun test test/api.test.ts` (assertion)
+- `bun run typecheck` (semantic check, catches real failures)
+- `bun run lint` (style, but only when style is the work)
+- `node scripts/check-schema.ts` (your own probe — write it as part of the task)
+- `curl -fsS http://localhost:3000/health | jq .ok` (integration probe)
+## What's not OK
+- `echo done` — proves nothing
+- `test -f src/foo.ts` — file existence is necessary but rarely sufficient
+- `bun run build` ALONE — build success without tests means "TypeScript was happy"; insufficient for behavior tasks
+- `grep -q 'newFunction' src/file.ts` — proves text presence, not behavior
+- `git diff --name-only | grep src/api` — proves edits happened, not that they're correct
+## Two-tier verify
+Use BOTH a per-task verify and `defaults.verify_after_each`:
+```yaml
+defaults:
+  verify_after_each:
+    - bun run typecheck     # always must pass
+tasks:
+  - id: T1
+    verify:
+      - bun test test/api/specific.test.ts   # task-specific
+```
+`verify_after_each` catches global breakage (a syntax error in a file the task didn't even touch); per-task verify catches task-specific behavior.
+## Touches and verify must agree
+If the task `touches: src/api/**` but the verify command runs `bun test test/web/`, you almost certainly have a wrong scope. The verify that would actually catch task failure must exercise files in the touched scope.
+## Verify must be deterministic
+- No `sleep` to wait for a service that may not start in CI.
+- No `docker run` unless the task is explicitly about containers.
+- No external network calls that could flake — mock or skip.
+If a verify command flakes, three retries will exhaust attempts and the task fails for environmental reasons. Pilot has no way to distinguish "real failure" from "flake".
+## Always include a "before" check
+For non-trivial tasks, write a verify that would HAVE FAILED before the task ran. This makes the task's value observable. If the verify passed before AND passes after, the task didn't actually move the system.