spec-and-loop 3.0.2 → 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,564 +1,160 @@
1
- # Writing Ralph-Friendly OpenSpec Tasks
1
+ # Ralph-Friendly Task Authoring Rules
2
2
 
3
- An actionable guide to shaping tasks so a Ralph Wiggum–style loop (fresh-session, re-reads prompt and repo state each iteration, uses objective backpressure) can make steady progress without getting stuck on ambiguity, hidden policy, or missing verification.
3
+ You are writing `tasks.md` for an OpenSpec change that will be executed by `ralph-run` in a fresh-session loop. Every iteration re-reads this file plus proposal.md, design.md, and specs. The loop implements one task per iteration, runs verification, and marks progress only on success.
4
4
 
5
- This guide is organized so you can act first and read rationale second:
5
+ ## Task template
6
6
 
7
- 1. [Quick reference: the task template](#quick-reference-the-task-template)
8
- 2. [Quick reference: how to order tasks](#quick-reference-how-to-order-tasks)
9
- 3. [Quick reference: authoring recipe](#quick-reference-authoring-recipe)
10
- 4. [Task sizing and splitting](#task-sizing-and-splitting)
11
- 5. [Worked examples: good vs. bad tasks](#worked-examples-good-vs-bad-tasks)
12
- 6. [Quality gates and baselines](#quality-gates-and-baselines)
13
- 7. [Human handoffs and operator-only work](#human-handoffs-and-operator-only-work)
14
- 8. [The surrounding artifact package](#the-surrounding-artifact-package)
15
- 9. [`prd.json` specifics](#prdjson-specifics)
16
- 10. [Authoring checklist](#authoring-checklist)
17
- 11. [Background and rationale](#background-and-rationale)
18
- 12. [Source notes](#source-notes)
19
-
20
- ---
21
-
22
- ## Quick reference: the task template
23
-
24
- Every Ralph-friendly task checkbox in `tasks.md` should fit this shape:
7
+ Every `- [ ]` checkbox must follow this shape:
25
8
 
26
9
  ```markdown
27
10
  - [ ] **<short imperative outcome>**
28
- - Scope: <1 subsystem or tightly related file cluster; name the primary files>
29
- - Change: <what behavior, data, or contract becomes true after this task>
11
+ - Scope: <1 subsystem or file cluster; name the primary files>
12
+ - Change: <what becomes true after this task>
30
13
  - Done when:
31
- - <observable change 1, tied to code/data/doc>
32
- - <verifier command or selector with expected result>
33
- - <optional second verifier if it is in the same cluster>
14
+ - <observable change tied to code/data/doc>
15
+ - <verifier command with expected result>
34
16
  - Stop and hand off if:
35
- - <concrete blocker condition, e.g. "a required-clean gate regresses">
36
- - <concrete ambiguity condition, e.g. "spec disagrees with design">
17
+ - <concrete blocker or ambiguity condition>
37
18
  ```
38
19
 
39
- Rules this template enforces:
40
-
41
- - **One dominant outcome.** The bold title is a single behavior slice, not a list.
42
- - **One file cluster.** Scope names the area so the loop does not go hunting.
43
- - **Objective "done".** Every `Done when` bullet is either an observable artifact change or a runnable check with a named expected result. No soft verbs (`ensure`, `validate`, `support`, `keep`) without an attached observable.
44
- - **Explicit stop conditions.** The loop has written permission to halt, so it does not improvise.
45
-
46
- A one-line prose version for the title and change, useful when drafting:
47
-
48
- > Change X behavior in Y area so that Z becomes true. Verify by running C and confirming D. Stop and hand off if E.
49
-
50
- If you need "and" twice in that first sentence, you are probably hiding a split point.
51
-
52
- ---
53
-
54
- ## Quick reference: how to order tasks
55
-
56
- When in doubt, arrange tasks from shared contract outward to user-facing surfaces:
57
-
58
- 1. **Pre-flight baseline.** Run every quality gate a later task requires to pass. Record the output. This is the only way later iterations can distinguish regressions from pre-existing failures. See [quality gates](#quality-gates-and-baselines).
59
- 2. **Freeze shared contracts and prerequisites.** Types, interfaces, registration points, ownership boundaries.
60
- 3. **Freeze typed data, config, schemas.** Centralize the data a later surface task would otherwise have to rediscover.
61
- 4. **Implement one user-facing surface at a time.** One route, one component family, or one workflow per task.
62
- 5. **Wire shared emitters and cross-links.** Navigation, shared shells, cross-references, anything that spans surfaces.
63
- 6. **Run final integrated quality gates.** A dedicated task that runs the full suite and is allowed to be a hard stop.
64
-
65
- Why this ordering: early tasks reduce ambiguity for later ones. A route task that can trust a frozen typed-data contract is dramatically smaller and safer than one that must invent the contract while rendering the page.
66
-
67
- Do **not** make the agent infer the dependency graph. Order the checkboxes in `tasks.md` as written, and if two tasks are independent, say so explicitly.
68
-
69
- ---
70
-
71
- ## Quick reference: authoring recipe
72
-
73
- 1. **Draft behavior slices, not file edits.** "Freeze X contract," "centralize Y data," "implement Z surface." Never "edit `file-a`."
74
- 2. **Size against the smallest model you expect to run the loop.** If you expect a smaller model or heavy context reload, bias one size smaller than your strongest-model plan. See [sizing profiles](#two-sizing-profiles).
75
- 3. **For each candidate task, count `V`, `S`, `C`, `P`:**
76
- - `V` = independent verification clusters
77
- - `S` = independent subsystems or file clusters
78
- - `C` = clean stopping points that would leave the repo reviewable
79
- - `P` = unresolved policy or design questions
80
- 4. **If `P > 0`, stop.** Fix the design or spec first. Do not encode unresolved policy as a task.
81
- 5. **If `V`, `S`, or `C` is meaningfully `> 1`, split.** Default lightweight split target: `recommended subtasks = max(V, C)`, `+1` if the task mixes foundation with feature work, `+1` if it mixes a user-facing surface with shared cross-surface wiring.
82
- 6. **Stop splitting before tasks become file chores.** A child task that has no standalone verifier or clean stopping point has been split too far.
83
- 7. **Apply the [task template](#quick-reference-the-task-template) to each final checkbox.**
84
- 8. **Order the final list** using the [contract-to-surface pattern](#quick-reference-how-to-order-tasks), with a pre-flight baseline task at the top if any later task requires a gate to be clean.
85
-
86
- One-line rule of thumb:
87
-
88
- **Keep splitting until each checkbox has one dominant verifier and one clean stop point, then stop.**
89
-
90
- ---
91
-
92
- ## Task sizing and splitting
93
-
94
- Task size is **harness-relative**. The correct unit is the largest coherent slice that still fits the actual context budget of the loop you plan to run.
20
+ Enforced rules:
21
+ - Title is one outcome, not a list. If you need "and" twice, split.
22
+ - Scope names files so the loop does not hunt.
23
+ - `Done when` bullets are observable or runnable. No soft verbs (`ensure`, `support`, `validate`, `keep`) without attached evidence.
24
+ - `Stop and hand off if` gives the loop written permission to halt.
95
25
 
96
- - Too broad: agent makes hidden design decisions and thrashes.
97
- - Too granular: loop wastes iterations on bookkeeping and repeated context reload.
98
- - Too context-heavy: task is semantically valid, but the model spends the iteration reloading unrelated facts.
99
- - Right: one coherent behavior slice, one main risk, one main verification cluster, one honest stopping point.
26
+ ## Ordering
100
27
 
101
- ### Signs a task is too large split it
28
+ 1. Pre-flight baseline (if any later task needs a clean gate)
29
+ 2. Freeze shared contracts (types, interfaces, boundaries)
30
+ 3. Freeze typed data, config, schemas
31
+ 4. One user-facing surface per task
32
+ 5. Wire shared emitters and cross-links
33
+ 6. Final integrated quality gates (hard stop allowed)
102
34
 
103
- - Spans more than one independent file cluster or subsystem.
104
- - Has more than one independent verification cluster.
105
- - Mixes foundation work and feature work in one checkbox.
106
- - Contains more than one obvious mergeable stopping point.
107
- - Requires the agent to hold many unrelated policy rules in working memory.
108
- - Likely to trigger broad search, broad refactor, and broad validation in the same iteration.
35
+ Do not make the agent infer dependencies. Order checkboxes in execution order. Mark independent tasks explicitly.
109
36
 
110
- ### Signs a task is too small — merge it
37
+ ## Sizing and splitting
111
38
 
112
- - Touches the same small file cluster as its neighbor.
113
- - Shares the same main verification command as its neighbor.
114
- - The first half does not produce a meaningful checkpoint on its own.
115
- - Splitting only creates bookkeeping churn.
116
- - Exists only so the next checkbox can finish it.
117
- - Its only proof is "the next task worked."
39
+ For each candidate task, count:
40
+ - V = independent verification clusters
41
+ - S = independent subsystems or file clusters
42
+ - C = clean stopping points (repo stays reviewable)
43
+ - P = unresolved policy questions
118
44
 
119
- ### Two sizing profiles
120
-
121
- Pick one explicitly before you start writing tasks.
122
-
123
- **Medium profile** — use when the loop reloads the full artifact pack cleanly, the model is strong, and the repo area is familiar:
124
-
125
- - one dominant outcome, one dominant risk
126
- - one main code or data surface
127
- - one main verification cluster
128
- - roughly 2–5 primary files in one area
129
- - 1–2 focused verification commands or selectors
130
- - 3–7 `Done when` bullets
131
-
132
- **Lightweight profile** — use when you expect smaller models, heavy per-iteration context reload, broad/unfamiliar repo area, or want more loop checkpoints:
133
-
134
- - one dominant outcome, one dominant verification cluster
135
- - one subsystem or tightly related file cluster
136
- - one clean stopping point that would still be reviewable if the loop stopped there
137
- - roughly 1–3 primary files (sometimes 4 if same route or subsystem)
138
- - ideally 1 focused verification command or selector (occasionally 2)
139
- - 2–5 `Done when` bullets
140
-
141
- ### Split / merge test
142
-
143
- Before finalizing a checkbox, ask:
144
-
145
- 1. If the loop stopped halfway through this item, would the repo be in a clean, reviewable state?
146
- 2. Would I know exactly which verification command proves this half is done?
147
- 3. Does this checkbox have one dominant risk, or several unrelated ones?
45
+ Rules:
46
+ - P > 0 → stop. Fix design.md first.
47
+ - V, S, or C > 1 → split. Target subtasks = max(V, C).
48
+ - Stop splitting when a child has no standalone verifier.
148
49
 
149
- Interpretation:
50
+ **Medium profile** (strong model, familiar repo): 1 outcome, 2–5 files, 1–2 verifiers, 3–7 `Done when` bullets.
51
+ **Lightweight profile** (smaller model, unfamiliar repo): 1 outcome, 1–3 files, 1 verifier, 2–5 `Done when` bullets.
150
52
 
151
- - If answers 1 and 2 are "yes," there is a valid split point.
152
- - If answer 3 is "several," the task is probably too large.
153
- - If none of the halves would be meaningful on their own, the task is already about right.
53
+ Split test: if the loop stopped halfway, would the repo be clean and reviewable? If yes and there's a verifier for each half, split. If no half is meaningful alone, don't split.
154
54
 
155
- ### Can you triple the task count?
55
+ ## Quality gates
156
56
 
157
- Yes, sometimes when the extra checkboxes come from real checkpoints, not mechanical crumbs.
57
+ - A failing `Done when` check means the task is NOT done. No rationalization.
58
+ - "Pre-existing" requires a before-baseline. Without one, any failure could be a regression.
59
+ - First task in a chain that needs clean gates must be a pre-flight baseline that records gate output.
60
+ - Explicitly distinguish known-broken validators (document and continue) from required-clean validators (hard stop). If only one is named, the loop generalizes permissively.
158
61
 
159
- Good reasons the count grows: a previous checkbox actually contained 2–3 independent verification clusters; a checkbox mixed contract freezing with feature work; a smaller model means medium tasks are no longer comfortable in one session; you want explicit clean stop points for long-running loops.
62
+ Pre-flight template:
63
+ ```markdown
64
+ - [ ] **Pre-flight: record quality gate baselines**
65
+ - Scope: no code edits
66
+ - Change: Capture current state of all gates later tasks require.
67
+ - Done when:
68
+ - `.ralph/baselines/<change>-<gate>.txt` exists for each gate with full output
69
+ - `.ralph/baselines/<change>-readme.md` lists passing/failing gates and exact failing identifiers
70
+ - Stop and hand off if: any gate is nondeterministic across two runs.
71
+ ```
160
72
 
161
- Bad reasons: turning one coherent change into a sequence of file chores; splitting edits that share the same verifier and stopping point; separate tasks for imports, renames, or tiny mechanical follow-through; doc-only subtasks that do not freeze an independent contract.
73
+ ## Anti-patterns (do not do these)
162
74
 
163
- ---
75
+ - Soft verbs without observables (`ensure X`, `support Y`, `validate Z`)
76
+ - Unresolved policy as tasks ("decide whether X or Y")
77
+ - Mixing implementation + rollout + manual validation in one checkbox
78
+ - File chores (separate tasks for imports, renames, tiny follow-through)
79
+ - Tasks whose only proof is "the next task worked"
80
+ - `Done when` that only checks unit tests when real behavior is end-to-end
81
+ - Visual verification without splitting from code changes (context overflow risk)
82
+ - "Maybe this, maybe that" wording in tasks or specs once loop starts
164
83
 
165
- ## Worked examples: good vs. bad tasks
84
+ ## Examples
166
85
 
167
- ### Example 1: too large vs. split
86
+ **Bad** vague, no verifier:
87
+ ```markdown
88
+ - [ ] Ensure support for tenant-scoped promotion
89
+ ```
168
90
 
169
- **Too large** — mixes three independent contracts in one checkbox:
91
+ **Good** — outcome, verifier, stop condition:
92
+ ```markdown
93
+ - [ ] **Refuse promotion when staged tenant has missing required rows**
94
+ - Scope: `src/ingestion/promote.py`, `tests/unit/test_promote.py`
95
+ - Change: `promote` exits non-zero and leaves active version unchanged when required rows are missing.
96
+ - Done when:
97
+ - New test `test_promote_refuses_missing_rows` passes
98
+ - `pytest tests/unit/test_promote.py` exits 0
99
+ - Stop and hand off if: "required rows" not defined in design.md.
100
+ ```
170
101
 
102
+ **Bad** — too large, three contracts in one:
171
103
  ```markdown
172
104
  - [ ] Freeze the bootstrap contract in code, tests, and docs
173
105
  ```
174
106
 
175
- **Better** — three tasks, each with its own verifier and stop point:
176
-
107
+ **Good** — split into one task per contract:
177
108
  ```markdown
178
109
  - [ ] **Freeze Atmosphere CSS ownership**
179
110
  - Scope: `src/styles/atmosphere/*`, `tailwind.config.*`
180
- - Change: Atmosphere is the sole owner of the listed tokens; Harbor no longer redefines them.
111
+ - Change: Atmosphere is sole owner of listed tokens; Harbor no longer redefines them.
181
112
  - Done when:
182
113
  - `rg "atm-color-" src/styles/harbor` returns no matches
183
114
  - `npx tsc --noEmit` exits 0
184
- - Stop and hand off if: a required token is owned by both systems and the design does not say which wins.
115
+ - Stop and hand off if: a token is owned by both systems and design does not resolve.
185
116
 
186
117
  - [ ] **Freeze Harbor registration and TSX integration**
187
118
  - Scope: `src/components/harbor-bootstrap.tsx`, `src/types/harbor.d.ts`
188
- - Change: Harbor components are registered once at boot and typed for TSX usage.
119
+ - Change: Harbor components registered once at boot, typed for TSX.
189
120
  - Done when:
190
121
  - `rg "registerHarbor" src` returns exactly one call site
191
122
  - `npm test -- harbor-bootstrap` passes
192
- - Stop and hand off if: more than one registration site is required by a consumer.
193
-
194
- - [ ] **Freeze contributor docs for the chosen authoring model**
195
- - Scope: `docs/contributing/components.md`
196
- - Change: Docs describe the frozen Atmosphere+Harbor model and link to the two tasks above.
197
- - Done when:
198
- - Doc file exists and lists both ownership rules
199
- - `npm run lint:docs` exits 0
200
- - Stop and hand off if: docs would need to describe a policy not yet settled in `design.md`.
123
+ - Stop and hand off if: more than one registration site is required.
201
124
  ```
202
125
 
203
- ### Example 2: too small vs. merged
204
-
205
- **Too small** — two chores with no standalone checkpoint:
206
-
126
+ **Bad** too small, file chores:
207
127
  ```markdown
208
128
  - [ ] Add `import { formatDate } from './date'` to `ReleaseCard.tsx`
209
129
  - [ ] Use `formatDate` in the `ReleaseCard` publish timestamp
210
130
  ```
211
131
 
212
- **Better** — one task with a real outcome and verifier:
213
-
132
+ **Good** — merged into one coherent outcome:
214
133
  ```markdown
215
- - [ ] **Format ReleaseCard publish timestamp via the shared `formatDate` helper**
134
+ - [ ] **Format ReleaseCard timestamp via shared `formatDate` helper**
216
135
  - Scope: `src/components/ReleaseCard.tsx`
217
- - Change: ReleaseCard renders timestamps through the shared helper instead of inline formatting.
136
+ - Change: ReleaseCard renders timestamps through the shared helper.
218
137
  - Done when:
219
138
  - `rg "toLocaleDateString" src/components/ReleaseCard.tsx` returns no matches
220
139
  - `npm test -- ReleaseCard` passes
221
- - Stop and hand off if: `formatDate` does not cover a required locale used by fixtures.
222
- ```
223
-
224
- ### Example 3: soft verbs vs. observable "done"
225
-
226
- **Bad** — vague verbs, no verifier:
227
-
228
- ```markdown
229
- - [ ] Ensure support for tenant-scoped promotion
230
- ```
231
-
232
- **Better** — outcome, verifier, and stop condition are explicit:
233
-
234
- ```markdown
235
- - [ ] **Refuse promotion when a staged tenant version has missing required rows**
236
- - Scope: `src/ingestion/promote.py`, `tests/unit/test_promote.py`
237
- - Change: `promote` exits non-zero and leaves the active version unchanged when required staged rows are missing.
238
- - Done when:
239
- - New test `test_promote_refuses_missing_rows` passes
240
- - `pytest tests/unit/test_promote.py` exits 0
241
- - Active version in the fixture DB is unchanged after the failed promote
242
- - Stop and hand off if: the set of "required rows" is not defined in `design.md`.
243
- ```
244
-
245
- ---
246
-
247
- ## Quality gates and baselines
248
-
249
- The single most common quality failure in observed Ralph runs: a task has a `Done when: npm test exits 0` bullet, the test fails, and the loop marks the task complete anyway with a rationalization note explaining why the failure is "unrelated" or "pre-existing."
250
-
251
- The rule:
252
-
253
- **If a `Done when` check fails, the task is not done. The loop must stop and report the failure rather than reclassify the gate as inapplicable.**
254
-
255
- The only legitimate exception is a check that the task itself explicitly categorizes as a known pre-existing failure — and that categorization must be made by the author at task-writing time, not by the agent at execution time.
256
-
257
- ### "Pre-existing" requires a before-baseline
258
-
259
- "Pre-existing and unrelated" is only valid if there is documented evidence that the check was already failing *before this task's code changes ran*. Without a before-run, any test failure could equally be a regression introduced by this task.
260
-
261
- A common failure mode:
262
-
263
- 1. Task A adds code. Its `Done when` list only says `npx tsc --noEmit exits 0`. The loop runs `tsc`, notes it is blocked by pre-existing errors, documents that, and moves on. `npm test` is never run.
264
- 2. Task B's `Done when` list says `npm test exits 0`. The loop runs `npm test` for the first time. It fails.
265
- 3. The loop classifies the failure as "pre-existing" because it "looks unrelated," even though no one verified the test suite was clean before Task A.
266
-
267
- ### Always include a pre-flight baseline task
268
-
269
- When a chain of tasks collectively produces a feature and any task in the chain requires a quality gate to be clean, the first task in the chain should be a dedicated pre-flight that:
270
-
271
- - Runs every gate later tasks require.
272
- - Records the exact outputs (file paths, exit codes, failing test names).
273
- - Documents any gates already failing and why.
274
- - Names the baseline file so later tasks can reference it when distinguishing regressions.
275
-
276
- Template:
277
-
278
- ```markdown
279
- - [ ] **Pre-flight: record quality gate baselines for this change**
280
- - Scope: no code edits
281
- - Change: Capture the current state of all gates later tasks require.
282
- - Done when:
283
- - `.ralph/baselines/<change>-tsc.txt` exists with full `npx tsc --noEmit` output
284
- - `.ralph/baselines/<change>-test.txt` exists with full `npm test` output
285
- - `.ralph/baselines/<change>-readme.md` lists which gates passed, which failed, and for failing gates the exact failing identifiers
286
- - Stop and hand off if: any gate behavior is nondeterministic across two consecutive runs (flaky baseline is not a baseline).
287
- ```
288
-
289
- ### Name known-broken vs. required-clean validators
290
-
291
- A subtler failure: the loop correctly learns to document and continue when `npx tsc --noEmit` fails due to known pre-existing repo errors. Then it applies the same permissive reasoning to `npm test`, which was supposed to be clean.
292
-
293
- Loop instructions (prompt or wrapper) must explicitly list both categories:
294
-
295
- - **Known-broken validators**: named by command or pattern. Loop may document failures and continue **only for these**.
296
- - **Required-clean validators**: named explicitly. Failures are hard blockers with no exception, regardless of how the failure looks.
297
-
298
- If only one list is named, the loop will generalize permissive behavior across all gates.
299
-
300
- ### Screenshots and large binary assets
301
-
302
- Visual verification tasks that use Chrome DevTools MCP or Figma MCP screenshots can exceed provider context limits when the images are read back into working context. In observed runs this caused `failed to parse request` errors mid-iteration, forcing compaction and restart from a summary. Compaction discards intermediate state and can cause the loop to misclassify prior task completion.
303
-
304
- Rules for visual tasks:
305
-
306
- - Save screenshots to a repo-local file path; record the path in task notes.
307
- - Do not re-read screenshot files into loop context after capture.
308
- - For Chrome DevTools screenshots, use `filePath` to save and only record the path.
309
- - Scope Figma MCP calls narrowly: use `excludeScreenshot: true` for structural inspection.
310
- - If a task requires both a code change and a visual verification, split them: one task for the code change (verified by `tsc`/`npm test`), one for the visual check (verified by screenshot path capture and manual or scripted comparison).
311
-
312
- ---
313
-
314
- ## Human handoffs and operator-only work
315
-
316
- Stage rollout validation, production-only checks, approvals, privileged access — these are not autonomous loop tasks.
317
-
318
- Rules:
319
-
320
- - Keep them documented in the artifact pack.
321
- - Put them in a dedicated `Human Handoff` or `Operator Handoff` section of `tasks.md` or in `proposal.md`.
322
- - Keep them **outside the checkbox path the loop consumes**. If your loop reads `- [ ]` items, operator items must not use that marker, or must live under a heading the loop instructions tell the agent to skip.
323
- - Do not rely on "we discussed this in chat." The handoff must live in a durable file.
324
-
325
- If the loop cannot honestly complete an item without a human or a protected environment, it should not be a normal task.
326
-
327
- ---
328
-
329
- ## The surrounding artifact package
330
-
331
- A task list does not stand alone. Ralph reloads a package of artifacts each iteration. Each file answers a different question:
332
-
333
- | Question | File | What it contains |
334
- | -------- | ---- | ---------------- |
335
- | Why are we doing this? | `proposal.md` | Problem, value, scope, non-goals, rollout boundaries, operator impact |
336
- | What must be true when we are done? | `specs/**/spec.md` | Required behaviors, failure cases, scenarios, first-rollout vs. deferred |
337
- | How should the system behave internally? | `design.md` | Algorithms, config shapes, failure semantics, compatibility, retention math, handoff flow |
338
- | What is the next safe increment? | `tasks.md` | Ordered checkboxes using the task template above |
339
- | How should the loop operate? | Loop prompt / wrapper | Reload rules, one-task-per-iteration, validator categories, stop conditions |
340
-
341
- Authoring rules:
342
-
343
- - **Resolve or explicitly defer policy before writing tasks.** Phrases like "may be shared or tenant-specific," "one option is," or "could support later" are fine while exploring; they are blockers once the loop starts. Resolve algorithms, fallback behavior, retention math, config shape, failure taxonomy, and compatibility-window behavior in `design.md`.
344
- - **Specs must be deterministic.** If two good implementers could read the spec and make materially different choices, the spec is not loop-safe yet.
345
- - **If a dedicated coverage artifact exists** (such as a `figma-route-map.md`), route and shared-surface tasks should reuse it as the durable source of truth instead of rediscovering coverage each iteration.
346
- - **Run with full OpenSpec context when available.** Repo guidance favors `./scripts/ralph-run.sh tasks <change>` over raw `tasks.md` mode because `opsx-apply` provides the agent with a manifest of OpenSpec artifact paths (`## OpenSpec Artifacts`) so the agent can read proposal, design, and specs as needed each iteration. If you run raw `prd.json` or raw `tasks.md` mode, push more detail down into each item because the companion docs will not be listed in the manifest.
347
-
348
- ### Loop-prompt / wrapper instructions
349
-
350
- At minimum, the loop prompt must tell the agent to:
351
-
352
- - Read the OpenSpec artifacts listed in `## OpenSpec Artifacts` (proposal, design, specs) before implementing the current task.
353
- - Inspect prior iteration state before starting new work.
354
- - Implement exactly one task per iteration.
355
- - Run the exact validators relevant to that task.
356
- - Mark progress **only** after verification succeeds.
357
- - Stop and request help on ambiguity, contradictions, missing dependencies, or unresolved failures.
358
- - Not invent new requirements or silently redefine "done."
359
- - Preserve human handoff items as handoff items.
360
-
361
- Additional critical rules to state explicitly:
362
-
363
- - A `Done when` check failure is a hard blocker. Do not reclassify a failing gate as "pre-existing" or "unrelated" unless the task itself documents that pre-existing failure with explicit before-run evidence.
364
- - Distinguish known-broken validators (document and continue) from required-clean validators (hard stop). Name both categories. Do not generalize permissive behavior from one to the other.
365
- - Before writing any code in a task requiring quality gates to pass, run those gates and record the baseline. A failure seen after code changes but not in the baseline is a regression and a hard stop.
366
- - Do not read large binary assets back into working context after capturing them. Save to disk, record the path, move on.
367
- - Use portable shell constructs. In zsh, `status` is read-only; use `$?` directly rather than `status=$?`.
368
-
369
- ---
370
-
371
- ## `prd.json` specifics
372
-
373
- The local JSON template is intentionally minimal:
374
-
375
- ```json
376
- {
377
- "features": [
378
- {
379
- "category": "functional",
380
- "description": "Description of the feature requirement",
381
- "steps": ["Step 1 to verify", "Step 2 to verify"],
382
- "passes": false
383
- }
384
- ]
385
- }
140
+ - Stop and hand off if: `formatDate` does not cover a required locale.
386
141
  ```
387
142
 
388
- Because the schema is small, most of the real quality comes from how `description` and `steps` are written.
389
-
390
- Rules:
391
-
392
- 1. **Treat `description` and `steps` as immutable truth.** The loop updates only `passes: false -> true`. If the requirement is wrong, a human edits it deliberately.
393
- 2. **Each feature is a behavior slice, not a file chore.** "Tenant-scoped promotion refuses to activate when required staged rows are missing," not "edit `promote.py`."
394
- 3. **Write `steps` as verification steps.** Observable, ordered, testable. Answer: how to observe the behavior, what to run, what to compare, what must be true for `passes` to become `true`.
395
- 4. **Each feature fits in one session.** If it needs multiple unrelated edits, multiple policy decisions, and several different verification modes, split it.
396
- 5. **Encode quality gates in `steps` or the loop prompt.** The schema does not carry them separately.
397
- 6. **Prefer end-to-end verification over code-only.** Unit tests passing does not prove the feature works end-to-end. For UI or workflow changes, include steps that simulate a real user path or a realistic system check.
398
- 7. **Manual/operator items go elsewhere.** Do not put stage validation or approvals into the autonomous JSON list unless clearly marked and excluded from execution.
399
- 8. **Order by dependency.** Schema and primitives first, then behavior, then integration, then docs. Do not make the agent infer the graph.
400
- 9. **Do not encode unresolved design choices as feature items.** "Decide whether cleanup is shared or tenant-specific" is a design question, not an implementation feature.
401
- 10. **Keep JSON concise; move rationale to companion docs.** JSON carries execution truth; motivation, scope, architecture, and handoffs live in `proposal.md` and `design.md`.
402
-
403
- Good local `prd.json` item:
404
-
405
- ```json
406
- {
407
- "category": "ingestion",
408
- "description": "Standalone promote refuses to activate a staged tenant version when required staged rows are missing",
409
- "steps": [
410
- "Create or identify a tenant with a staged target version but missing required staged rows",
411
- "Run the standalone promote path for that tenant",
412
- "Verify the command exits non-zero or returns the documented failure outcome",
413
- "Verify tenant active version is unchanged",
414
- "Run the relevant test selector and confirm it passes"
415
- ],
416
- "passes": false
417
- }
418
- ```
419
-
420
- Why this works: outcome-based description, deterministic steps, explicit negative behavior, includes verification, no hidden product decision.
421
-
422
- ---
423
-
424
- ## Authoring checklist
425
-
426
- Before calling an OpenSpec change "Ralph-friendly," confirm all of these:
427
-
428
- ### Artifact package
429
-
430
- - [ ] `proposal.md` states scope, non-goals, and first-rollout boundaries.
431
- - [ ] `design.md` does not leave core policy choices unresolved.
432
- - [ ] Specs are specific enough that two implementers would not make materially different choices.
433
- - [ ] Human/operator work is documented outside the autonomous checkbox path.
434
- - [ ] The artifacts on disk contain all critical guidance that a fresh session needs.
435
-
436
- ### Task shape
437
-
438
- - [ ] Each task uses the [task template](#quick-reference-the-task-template) (outcome title, scope, change, `Done when`, stop condition).
439
- - [ ] Each task is atomic in the semantic sense, not merely tiny.
440
- - [ ] Each task has one dominant outcome and one dominant verification cluster.
441
- - [ ] Each task size matches the smallest model/harness expected to run it.
442
- - [ ] No checkbox contains two obvious mergeable checkpoints that could be verified independently.
443
- - [ ] If using a lightweight profile, the higher task count comes from real checkpoints rather than file-chore fragmentation.
444
- - [ ] No task has been split so far that it loses its own meaningful verifier or clean stopping point.
445
- - [ ] Each task has an explicit, runnable verification target.
446
-
447
- ### Ordering and dependencies
448
-
449
- - [ ] Tasks are ordered contract → data → surface → shared wiring → final gates.
450
- - [ ] No task depends on stage/prod/manual access unless it is explicitly a handoff item.
451
- - [ ] Any task chain that requires `npm test` or browser tests to pass in a later task includes a pre-flight quality gate task at the start that records baseline output.
452
-
453
- ### Quality gates
454
-
455
- - [ ] No task's `Done when` requires a gate to pass without the loop instructions making clear that gate failure is a hard stop.
456
- - [ ] The loop instructions name known-broken validators (document and continue) and required-clean validators (hard stop), and forbid generalizing between them.
457
- - [ ] Visual verification tasks save outputs to file paths and do not re-read large binary content back into loop context.
458
- - [ ] Tasks requiring both a code change and a visual screenshot comparison are split into two tasks if combining them would risk context overflow.
459
-
460
- ### `prd.json` (if used)
461
-
462
- - [ ] The loop is instructed to modify only `passes`.
463
- - [ ] `description` and `steps` are outcome-based and deterministic.
464
- - [ ] Manual/operator items are excluded from the autonomous list.
465
-
466
- ### Anti-patterns to avoid
467
-
468
- - Vague verbs (`ensure`, `support`, `validate`) with no observable output.
469
- - Asking the loop to decide policy mid-implementation.
470
- - Mixing implementation, rollout, and manual validation in one checkbox.
471
- - Splitting one migration or tightly related refactor into tiny loop iterations with no independent verification point.
472
- - Hiding critical instructions only in chat history.
473
- - Letting the agent rewrite feature definitions instead of only updating status.
474
- - Declaring done from unit tests alone when the real behavior is end-to-end.
475
- - Leaving "maybe this, maybe that" wording in design or proposal once implementation is about to start.
476
- - Marking a task complete when a `Done when` check failed, with a rationalization note explaining why the failure "does not count."
477
- - Classifying a failing quality gate as "pre-existing" without a documented before-baseline.
478
- - Running `tsc` during code-writing tasks but not `npm test`, then running `npm test` for the first time in a later task and treating its failure as pre-existing.
479
- - Generalizing the permissive "document and continue" pattern from a known-broken validator to validators that should be clean.
480
- - Reading large binary assets back into working context after capture.
481
- - Using reserved or read-only zsh variable names like `status`.
482
-
483
- ---
484
-
485
- ## Background and rationale
486
-
487
- This section preserves the reasoning behind the rules above. If the quick-reference sections answer your authoring question, you can skip it.
488
-
489
- ### Why Ralph loops depend on artifacts, not chat
490
-
491
- Ralph loops work because each iteration starts fresh, re-reads the prompt and repo state, and uses objective backpressure (tests, typechecks, render checks, browser checks). The loop is only as good as the artifacts it reloads every time.
492
-
493
- The most important principle:
494
-
495
- **Ralph does not want the smallest textual tasks. Ralph wants the largest coherent task that is still unambiguous, objectively verifiable, and comfortably completable in one agent session.**
496
-
497
- Task size is harness-relative: with a strong model and disciplined full-artifact reload, medium tasks are often best; with a smaller model, noisy tool output, or heavy context reload, the same task may be too large. The correct unit is the largest coherent slice that still fits the actual context budget of the loop you plan to run.
498
-
499
- ### Lessons from prior Ralph-loop reviews
500
-
501
- The `tenant-scoped-content-versioning` example and subsequent reviews produced the concrete rules in this guide. Key findings:
502
-
503
- 1. **Medium atomic tasks beat both vague tasks and micro-tasks.** The worst plans combined tiny mechanical subtasks with broad ambiguous ones. Better: merge obvious same-file mechanical work, split only tasks that still hide policy or control-flow decisions.
504
-
505
- 2. **Human handoffs must be documented but not executed by the loop.** Stage rollout validation, signoff, and production-only checks belong in a dedicated handoff section, outside the checkbox path the loop consumes.
506
-
507
- 3. **Unresolved policy questions become loop churn.** Phrases like "reuse the current staged version for the same release cycle," "validate critical failures," or "may be shared or tenant-specific" are acceptable in human planning but bad for a Ralph loop. A fresh-session agent will treat them as missing decisions and thrash.
508
-
509
- 4. **Every wide task needs explicit "done when" signals.** Verbs like `ensure`, `validate`, `keep`, or `support` are too soft on their own.
510
-
511
- 5. **Full OpenSpec context is better than raw task-file mode.** Repo guidance favors `./scripts/ralph-run.sh tasks <change>` over raw `tasks.md` mode because `opsx-apply` provides a manifest (`## OpenSpec Artifacts`) listing artifact paths so the agent can read proposal, design, and specs as needed. A task list can be shorter when the design/specs fully resolve tricky decisions, but only if the loop actually references those artifacts each iteration.
512
-
513
- 6. **"Done when" gates are hard stops, not soft guidelines.** The most common single-task quality failure is a loop marking a task complete after a `Done when` check failed, with a rationalization note. The gate is self-authorizing; the loop decides the gate does not apply, bypasses it, and moves on, recording a completion claim the stated verifier never confirmed.
514
-
515
- 7. **"Pre-existing" is a claim that requires a before-baseline, not a judgment call.** In the absence of a before-baseline, any test failure during a task could equally be a regression. Baselines must exist in writing before code changes land.
516
-
517
- 8. **Permissive reasoning for a known-broken validator bleeds to all validators.** A loop told "document tsc failures and continue" will generalize that to `npm test`, browser tests, or any other gate. Fix: name both known-broken and required-clean validators explicitly.
518
-
519
- 9. **Establish baselines before the first code-writing task in a chain.** Otherwise the loop has no clean before-state and cannot reliably distinguish its own regressions from pre-existing issues.
520
-
521
- 10. **Large binary assets in tool responses cause context overflow and loop restarts.** Screenshots read back into context can exceed provider limits, force compaction, and discard intermediate state — including prior task completion status.
522
-
523
- ### What "Ralph-friendly" means in practice
524
-
525
- A Ralph-friendly spec or task set has these properties:
526
-
527
- 1. One loop item equals one coherent slice of behavior.
528
- 2. "Done" is frozen up front and not negotiated mid-run.
529
- 3. Verification is explicit and runnable.
530
- 4. The agent is not asked to choose product or rollout policy.
531
- 5. Human-only checks are durable but excluded from autonomous execution.
532
- 6. Repeated failure patterns are corrected by editing the artifact, not by hoping the next session "remembers."
533
- 7. The loop can stop honestly with a blocker rather than improvise.
534
-
535
- ### Bottom line
536
-
537
- The best Ralph-friendly OpenSpec proposal is not the most detailed artifact in the abstract. It is the artifact set that leaves the loop with the fewest judgment calls.
538
-
539
- If a fresh-session agent can read the artifacts, pick one coherent increment, verify it objectively, stop honestly on blockers, and leave the repo in a clean state — the proposal is Ralph-friendly.
540
-
541
- ---
542
-
543
- ## Source notes
544
-
545
- Durable repo-local sources used:
546
-
547
- - `scripts/RALPHY-OPENSPEC-RUNNING.md`
548
- - `scripts/templates/features-template.json`
549
- - `scripts/templates/prd-template.md`
550
- - `scripts/ralph-run.sh`
551
- - `hidden/RALPH-WIGGUM-OPENSPEC.md`
552
- - `hidden/RALPH-WIGGUM-CURSOR.md`
143
+ ## Artifact requirements
553
144
 
554
- Prior internal Ralph-loop review conversations also informed this note.
145
+ Before writing tasks, confirm:
146
+ - `proposal.md` has scope, non-goals, rollout boundaries
147
+ - `design.md` resolves all policy (no "may be X or Y")
148
+ - Specs are deterministic (two implementers would make the same choices)
149
+ - Human/operator work is outside the `- [ ]` checkbox path
555
150
 
556
- External references consulted:
151
+ If any of these are unresolved, stop and fix the artifact before writing tasks.
557
152
 
558
- - Anthropic, "Effective harnesses for long-running agents" — `https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents`
559
- - Anthropic, "Harness design for long-running application development" — `https://www.anthropic.com/engineering/harness-design-long-running-apps/`
560
- - Claude docs, "Claude 4 best practices" → "Multi-context window workflows" — `https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices#multi-context-window-workflows`
561
- - Geoffrey Huntley, "Ralph Wiggum as a software engineer" — `https://ghuntley.com/ralph/`
562
- - Geoffrey Huntley, "everything is a ralph loop" — `https://ghuntley.com/loop/`
563
- - Ralph TUI docs: `create-prd` and `convert` — `https://ralph-tui.com/docs/cli/create-prd`, `https://ralph-tui.com/docs/cli/convert`
153
+ ## prd.json rules (if used)
564
154
 
155
+ - `description` and `steps` are immutable; loop updates only `passes`
156
+ - Each feature = one behavior slice, not a file chore
157
+ - `steps` are verification steps: observable, ordered, testable
158
+ - Each feature fits in one session; split if it needs multiple unrelated edits
159
+ - Order by dependency; do not make the agent infer the graph
160
+ - No unresolved design choices as feature items
@@ -20,7 +20,6 @@
20
20
  * errors ("no prompt source configured", "prompt file
21
21
  * not found", "prompt file is empty") do not fire.
22
22
  * {{tasks}} - Raw tasks file content
23
- * {{task_context}} - Fresh current-task and completed-task summary
24
23
  * {{task_promise}} - Configured task promise string
25
24
  * {{completion_promise}} - Configured completion promise string
26
25
  * {{commit_contract}} - Commit instructions derived from options.noCommit
@@ -37,7 +36,6 @@
37
36
 
38
37
  const fs = require('fs');
39
38
  const path = require('path');
40
- const tasks = require('./tasks');
41
39
 
42
40
  // One-time fallback notice flag for invalid RALPH_BASE_PROMPT_WARN_BYTES
43
41
  let _warnBytesInvalidNoticed = false;
@@ -159,15 +157,12 @@ function render(options, iteration) {
159
157
  tasksContent = fs.readFileSync(options.tasksFile, 'utf8');
160
158
  }
161
159
 
162
- const taskContext = options.tasksFile ? tasks.taskContext(options.tasksFile) : '';
163
-
164
160
  const vars = {
165
161
  iteration: String(iteration),
166
162
  max_iterations: String(options.maxIterations || 50),
167
163
  change_dir: options.changeDir || '',
168
164
  base_prompt: base,
169
165
  tasks: tasksContent,
170
- task_context: taskContext,
171
166
  task_promise: options.taskPromise || 'READY_FOR_NEXT_TASK',
172
167
  completion_promise: options.completionPromise || 'COMPLETE',
173
168
  commit_contract: options.noCommit
@@ -141,39 +141,6 @@ function countTasks(tasksFile) {
141
141
  };
142
142
  }
143
143
 
144
- /**
145
- * Build a compact task-context block for the current tasks file.
146
- * Mirrors the shell-side task context format so prompts can render a fresh
147
- * snapshot on every iteration without regenerating the whole PRD.
148
- *
149
- * @param {string} tasksFile
150
- * @returns {string}
151
- */
152
- function taskContext(tasksFile) {
153
- const all = parseTasks(tasksFile);
154
- if (all.length === 0) return '';
155
-
156
- const current =
157
- all.find((task) => task.status === 'in_progress') ||
158
- all.find((task) => task.status === 'incomplete') ||
159
- null;
160
- const completedCount = all.filter((task) => task.status === 'completed').length;
161
- const total = all.length;
162
-
163
- const sections = [];
164
-
165
- if (current) {
166
- sections.push('## Current Task');
167
- sections.push(`- ${current.fullDescription || current.description}`);
168
- sections.push('');
169
- }
170
-
171
- sections.push('## Progress');
172
- sections.push(`- ${completedCount} of ${total} tasks complete`);
173
-
174
- return sections.join('\n');
175
- }
176
-
177
144
  // ---------------------------------------------------------------------------
178
145
  // Internal helpers
179
146
  // ---------------------------------------------------------------------------
@@ -199,6 +166,5 @@ module.exports = {
199
166
  currentTask,
200
167
  hashFile,
201
168
  countTasks,
202
- taskContext,
203
169
  tasksLinkPath,
204
170
  };
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "spec-and-loop",
3
- "version": "3.0.2",
3
+ "version": "3.1.0",
4
4
  "description": "OpenSpec + Ralph Loop integration for iterative development with opencode",
5
5
  "main": "index.js",
6
6
  "bin": {
@@ -337,65 +337,34 @@ ensure_artifacts_present() {
337
337
  local change_dir="$1"
338
338
  local change_name="$2"
339
339
 
340
- local required_files=(
341
- "proposal.md"
342
- "tasks.md"
343
- "design.md"
344
- )
345
-
346
- local missing=()
347
- for file in "${required_files[@]}"; do
348
- if [[ ! -f "$change_dir/$file" ]]; then
349
- missing+=("$file")
350
- fi
351
- done
340
+ local status_json
341
+ status_json=$(openspec status --change "$change_name" --json 2>/dev/null)
342
+ if [[ $? -ne 0 ]]; then
343
+ log_error "Failed to query openspec status for change: $change_name"
344
+ exit 1
345
+ fi
352
346
 
353
- if [[ ${#missing[@]} -eq 0 ]]; then
347
+ local blocked
348
+ blocked=$(echo "$status_json" | jq -r '.artifacts[] | select(.status == "blocked") | .id' 2>/dev/null)
349
+ if [[ -z "$blocked" ]]; then
354
350
  return 0
355
351
  fi
356
352
 
357
- log_info "Missing artifacts: ${missing[*]}"
353
+ log_info "Blocked artifacts detected: $blocked"
358
354
  log_info "Invoking opencode to complete missing artifacts..."
359
355
 
360
356
  opencode run "/opsx-ff $change_name" || true
361
357
 
362
- for file in "${required_files[@]}"; do
363
- if [[ ! -f "$change_dir/$file" ]]; then
364
- log_error "Required artifact still not found after /opsx-ff: $file"
365
- exit 1
366
- fi
367
- done
358
+ status_json=$(openspec status --change "$change_name" --json 2>/dev/null)
359
+ blocked=$(echo "$status_json" | jq -r '.artifacts[] | select(.status == "blocked") | .id' 2>/dev/null)
360
+ if [[ -n "$blocked" ]]; then
361
+ log_error "Artifacts still blocked after /opsx-ff: $blocked"
362
+ exit 1
363
+ fi
368
364
 
369
365
  log_info "All missing artifacts generated"
370
366
  }
371
367
 
372
- validate_openspec_artifacts() {
373
- local change_dir="$1"
374
-
375
- log_verbose "Validating OpenSpec artifacts..."
376
-
377
- local required_files=(
378
- "proposal.md"
379
- "tasks.md"
380
- "design.md"
381
- )
382
-
383
- for file in "${required_files[@]}"; do
384
- if [[ ! -f "$change_dir/$file" ]]; then
385
- log_error "Required artifact not found: $file"
386
- exit 1
387
- fi
388
- log_verbose "Found artifact: $file"
389
- done
390
-
391
- if [[ ! -d "$change_dir/specs" ]]; then
392
- log_error "Required directory not found: specs/"
393
- exit 1
394
- fi
395
- log_verbose "Found directory: specs/"
396
-
397
- log_info "All OpenSpec artifacts validated"
398
- }
399
368
 
400
369
  setup_ralph_directory() {
401
370
  local change_dir="$1"
@@ -810,111 +779,7 @@ sync_tasks_to_ralph() {
810
779
  log_verbose "Symlink configured: $ralph_tasks_file -> $abs_tasks_file"
811
780
  }
812
781
 
813
- create_prompt_template() {
814
- local change_dir="$1"
815
- local template_file="$2"
816
-
817
- log_verbose "Creating custom prompt template..."
818
-
819
- local abs_change_dir
820
- abs_change_dir=$(get_realpath "$change_dir")
821
-
822
- cat > "$template_file" << 'EOF'
823
- # Ralph Wiggum Task Execution - Iteration {{iteration}} / {{max_iterations}}
824
-
825
- Change directory: {{change_dir}}
826
-
827
- ## OpenSpec Artifacts
828
-
829
- {{_openspec_manifest}}
830
-
831
- ## Fresh Task Context
832
-
833
- {{task_context}}
834
-
835
- ## Instructions
836
-
837
- Before implementing, read the OpenSpec artifacts listed above that are relevant to the current task.
838
-
839
- Follow this loop contract EXACTLY. Do not skip steps. Do not batch. Do not output a promise until every step is done.
840
-
841
- 1. Work on the task shown in `## Fresh Task Context` above. Before editing any marker, open `tasks.md` at `{{change_dir}}/tasks.md` and verify that same task is still `- [ ] ` or `- [/] ` on disk (it may have been closed by a prior iteration if you are resuming).
842
- 2. Edit `tasks.md` in place to change that line's marker to `- [/] ` (in-progress). You MUST use your file edit tool to modify the file on disk — a shell `cp`, `sed`, or print-to-stdout does not count. Verify by re-reading the file.
843
- 3. Implement the smallest change that fully satisfies the task's Done-when conditions. Run the task's verification command if one is specified.
844
- 4. On success, edit `tasks.md` again in place to change that line's marker from `- [/] ` to `- [x] `. Verify by re-reading the file and confirming the `[x]` is present on that exact line.
845
- 5. ONLY after step 4 writes `[x]` to disk, output `<promise>{{task_promise}}</promise>` on its own line.
846
- 6. If and only if EVERY task line in `tasks.md` is `- [x] `, output `<promise>{{completion_promise}}</promise>` instead.
847
-
848
- Hard rules:
849
- - If you do not actually modify `tasks.md` on disk in this iteration, DO NOT output any promise tag. Output a short failure note instead and stop.
850
- - Never output `<promise>{{task_promise}}</promise>` while the task you just worked on is still `- [ ]` on disk. That causes the same task to repeat forever.
851
- - Promise tags must be on their own line, literal, unquoted, and not described in prose.
852
- - If an approach fails twice, try a different one.
853
- - If the task is already satisfied by prior work (e.g. target file already exists with the right content), you STILL must flip the checkbox to `[x]` in `tasks.md` before emitting the promise.
854
-
855
- ## Commit Contract
856
-
857
- {{commit_contract}}
858
- EOF
859
-
860
- # Determine repo root for AGENTS.md probe
861
- local repo_root
862
- repo_root=$(git rev-parse --show-toplevel 2>/dev/null) || repo_root=""
863
-
864
- # Build the manifest body
865
- local manifest_body
866
- manifest_body="Read these as needed (source of truth for this change):"$'\n'$'\n'
867
- manifest_body+="- $abs_change_dir/proposal.md"$'\n'
868
- manifest_body+="- $abs_change_dir/design.md"$'\n'
869
-
870
- # Pre-expand specs/*/spec.md into concrete paths
871
- if [[ -d "$abs_change_dir/specs" ]]; then
872
- while IFS= read -r spec_path; do
873
- [[ -n "$spec_path" ]] && manifest_body+="- $spec_path"$'\n'
874
- done < <(find "$abs_change_dir/specs" -name spec.md -type f 2>/dev/null | sort)
875
- fi
876
-
877
- # Optionally append AGENTS.md reference
878
- local agents_line
879
- agents_line=$(probe_agents_md "$repo_root")
880
- if [[ -n "$agents_line" ]]; then
881
- manifest_body+=$'\n'"$agents_line"
882
- fi
883
782
 
884
- # Append Ralph best practices guide if project is ralphified
885
- if check_ralphified; then
886
- local bp_manifest_path="$abs_change_dir/../../OPENSPEC-RALPH-BP.md"
887
- if [[ ! -f "$bp_manifest_path" ]]; then
888
- bp_manifest_path="$repo_root/openspec/OPENSPEC-RALPH-BP.md"
889
- fi
890
- if [[ -f "$bp_manifest_path" ]]; then
891
- manifest_body+=$'\n'"- $bp_manifest_path (Ralph best practices guide)"
892
- fi
893
- fi
894
-
895
- # Substitute {{_openspec_manifest}} using awk with a manifest temp file
896
- # (awk -v cannot handle multi-line values; use getline from a file instead)
897
- local _manifest_file
898
- _manifest_file=$(mktemp 2>/dev/null || mktemp -t ralph-manifest)
899
- printf '%s' "$manifest_body" > "$_manifest_file"
900
- local _tmpfile
901
- _tmpfile=$(mktemp 2>/dev/null || mktemp -t ralph-template)
902
- awk -v mf="$_manifest_file" '
903
- {
904
- if ($0 == "{{_openspec_manifest}}") {
905
- while ((getline line < mf) > 0) { print line }
906
- close(mf)
907
- } else { print }
908
- }
909
- ' "$template_file" > "$_tmpfile" && mv "$_tmpfile" "$template_file"
910
- rm -f "$_manifest_file"
911
-
912
- # Substitute {{change_dir}}
913
- _tmpfile=$(mktemp 2>/dev/null || mktemp -t ralph-template)
914
- sed "s|{{change_dir}}|$abs_change_dir|g" "$template_file" > "$_tmpfile" && mv "$_tmpfile" "$template_file"
915
-
916
- log_verbose "Prompt template created: $template_file"
917
- }
918
783
 
919
784
  probe_agents_md() {
920
785
  local repo_root="$1"
@@ -1064,14 +929,11 @@ execute_ralph_loop() {
1064
929
  return 1
1065
930
  fi
1066
931
 
1067
- local template_file="$ralph_dir/prompt-template.md"
1068
-
1069
932
  # Clean up old output directories and setup new one
1070
933
  cleanup_old_output
1071
934
  local output_dir=$(setup_output_capture "$ralph_dir")
1072
935
 
1073
936
  sync_tasks_to_ralph "$change_dir" "$ralph_dir"
1074
- create_prompt_template "$change_dir" "$template_file"
1075
937
 
1076
938
  # Output files
1077
939
  local stdout_log="$output_dir/ralph-stdout.log"
@@ -1080,13 +942,25 @@ execute_ralph_loop() {
1080
942
  log_info "Invoking internal mini Ralph runtime..."
1081
943
  log_info "Capturing output to: $output_dir"
1082
944
 
945
+ local ralph_prompt_text="/opsx-apply $CHANGE_NAME
946
+
947
+ You are operating inside an automated loop. Follow these constraints EXACTLY:
948
+
949
+ 1. Implement exactly ONE pending task from the task list /opsx-apply shows you.
950
+ 2. After marking the task checkbox [x] on disk, output <promise>READY_FOR_NEXT_TASK</promise> on its own line.
951
+ 3. If and only if EVERY task checkbox is [x], output <promise>COMPLETE</promise> instead.
952
+ 4. Do not ask questions or wait for input. If blocked, output a short failure note describing the blocker and stop.
953
+ 5. If the task is already satisfied by prior work, still flip the checkbox to [x] before emitting the promise.
954
+
955
+ Do not create git commits yourself. The Ralph runner manages automatic task commits when auto-commit is enabled."
956
+
1083
957
  # Build the mini-ralph-cli arguments
1084
958
  local mini_ralph_args=(
1085
- "--prompt-template" "$template_file"
1086
959
  "--ralph-dir" "$ralph_dir"
1087
960
  "--tasks-file" "$change_dir/tasks.md"
1088
961
  "--tasks"
1089
962
  "--max-iterations" "$max_iterations"
963
+ "--prompt-text" "$ralph_prompt_text"
1090
964
  )
1091
965
 
1092
966
  if [[ "$no_commit" == true ]]; then
@@ -1474,7 +1348,6 @@ main() {
1474
1348
 
1475
1349
  ensure_artifacts_present "$change_dir" "$CHANGE_NAME"
1476
1350
 
1477
- validate_openspec_artifacts "$change_dir"
1478
1351
  validate_script_state "$change_dir"
1479
1352
  local ralph_dir=$(setup_ralph_directory "$change_dir")
1480
1353