npm - ralphctl - Versions diffs - 0.5.0 → 0.6.0 - Mend

ralphctl 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

package/README.md +29 -16
package/dist/absolute-path-WUTZQ37D.mjs +8 -0
package/dist/chunk-6RDMCLWU.mjs +108 -0
package/dist/chunk-HIU74KTO.mjs +1046 -0
package/dist/chunk-S3PTDH57.mjs +78 -0
package/dist/chunk-WV4D2CPG.mjs +26 -0
package/dist/cli.mjs +22413 -717
package/dist/manifest.json +24 -0
package/dist/prompt-adapter-JQICGVX7.mjs +7 -0
package/dist/prompts/ideate.md +3 -1
package/dist/prompts/plan-auto.md +23 -8
package/dist/prompts/plan-common-examples.md +3 -3
package/dist/prompts/plan-common.md +6 -5
package/dist/prompts/plan-interactive.md +30 -7
package/dist/prompts/repo-onboard.md +154 -64
package/dist/prompts/signals-task.md +3 -0
package/dist/prompts/sprint-feedback.md +3 -0
package/dist/prompts/task-evaluation.md +74 -53
package/dist/prompts/task-execution.md +65 -21
package/dist/prompts/ticket-refine.md +11 -8
package/dist/prompts/validation-checklist.md +3 -2
package/dist/skills/default/abstraction-first/SKILL.md +45 -0
package/dist/skills/default/alignment/SKILL.md +46 -0
package/dist/skills/default/iterative-review/SKILL.md +48 -0
package/dist/skills/exec/.gitkeep +0 -0
package/dist/skills/plan/.gitkeep +0 -0
package/dist/skills/refine/.gitkeep +0 -0
package/dist/storage-paths-IPNZZM5D.mjs +15 -0
package/dist/validation-error-QT6Q7FYU.mjs +7 -0
package/package.json +9 -4
package/dist/add-67UFUI54.mjs +0 -17
package/dist/add-DVPVHENV.mjs +0 -18
package/dist/bootstrap-FMHG6DRY.mjs +0 -11
package/dist/chunk-62HYDA7L.mjs +0 -1128
package/dist/chunk-747KW2RW.mjs +0 -24
package/dist/chunk-BSB4EDGR.mjs +0 -260
package/dist/chunk-BT5FKIZX.mjs +0 -787
package/dist/chunk-CBMFRQ4Y.mjs +0 -441
package/dist/chunk-CFUVE2BP.mjs +0 -16
package/dist/chunk-D6QZNEYN.mjs +0 -5520
package/dist/chunk-FNAAA32W.mjs +0 -103
package/dist/chunk-GQ2WFKBN.mjs +0 -269
package/dist/chunk-IWXBJD2D.mjs +0 -27
package/dist/chunk-OGEXYSFS.mjs +0 -228
package/dist/chunk-VAZ3LJBI.mjs +0 -179
package/dist/chunk-WDMLPXOD.mjs +0 -363
package/dist/chunk-XN2UIHBY.mjs +0 -589
package/dist/chunk-ZE2BRQA2.mjs +0 -5542
package/dist/create-Z635FQKO.mjs +0 -15
package/dist/handle-23EFF3BE.mjs +0 -22
package/dist/mount-NCYR22SN.mjs +0 -7434
package/dist/project-DQHF4ISP.mjs +0 -34
package/dist/prompts/check-script-discover.md +0 -69
package/dist/prompts/ideate-auto.md +0 -195
package/dist/prompts/task-evaluation-resume.md +0 -41
package/dist/resolver-OVPYVW6Q.mjs +0 -163
package/dist/sprint-4E26AB5F.mjs +0 -38
package/dist/start-T34NI3LF.mjs +0 -19

package/dist/prompts/task-evaluation.md CHANGED Viewed

@@ -19,6 +19,10 @@ These verification criteria are the pre-agreed definition of "done" — your pri
 </task-specification>
+{{DONE_CRITERIA_SECTION}}
+{{EVALUATE_WORKSPACE}}
 ## Review Protocol
 **You are a reviewer — do not edit files.** If you believe a fix is needed, emit `<evaluation-failed>` with a concrete
@@ -86,15 +90,23 @@ rubber stamp — flag it as a Completeness failure rather than emitting it yours
 ### Phase 3: Dimension Assessment
-Evaluate the implementation across the dimensions below. Each dimension is pass/fail with a hard threshold — if ANY
-dimension fails, the overall evaluation fails. The first four are the floor — every task is graded on them. The
-planner may have flagged additional task-specific dimensions; when present, they are graded on top of the floor.
+Evaluate the implementation across the dimensions below. Score each dimension 1–5 using the rubric below. Dimensions
+scoring 4 or 5 pass; dimensions scoring 1–3 fail. If ANY dimension fails, the overall evaluation fails. The first four
+are the floor — every task is graded on them. The planner may have flagged additional task-specific dimensions; when
+present, they are graded on top of the floor.
+**Score rubric:**
+- **5 — Exemplary:** no issues, idiomatic, every criterion met fully
+- **4 — Solid:** minor concerns only, fully meets the bar
+- **3 — Adequate:** functional but with notable gaps or rough edges
+- **2 — Below bar:** incomplete or buggy; does not meet the bar
+- **1 — Unacceptable:** broken, missing, or unsafe
-**Evidence rule — load-bearing:** Every dimension line, PASS or FAIL, MUST cite a concrete observation
-from Phase 1 or Phase 2. A PASS without evidence is not a PASS — it is a rubber stamp. Good evidence
-names something specific: a file path, a line number, a test count, a command output, a function
-name, a verification criterion that was graded, a pattern from a sibling file. Evidence that only
-restates the criterion in different words ("all tests pass", "implementation matches the spec", "no
+**Evidence rule — load-bearing:** Every dimension line MUST cite a concrete observation from Phase 1 or Phase 2. A
+score without evidence is a rubber stamp. Good evidence names something specific: a file path, a line number, a test
+count, a command output, a function name, a verification criterion that was graded, a pattern from a sibling file.
+Evidence that only restates the criterion in different words ("all tests pass", "implementation matches the spec", "no
 issues found") is still generic and does NOT satisfy this rule.
 <dimension name="Correctness" floor="true">
@@ -139,12 +151,12 @@ distracts from the actual pass/fail decision.
 ### Pass Bar
-The implementation passes if ALL dimensions pass. Specifically:
+The implementation passes if ALL dimensions score 4 or 5. Specifically:
-- **Correctness**: Every verification criterion is satisfied
-- **Completeness**: All steps implemented, no unfinished markers
-- **Safety**: No security vulnerabilities introduced
-- **Consistency**: Follows existing codebase patterns{{EXTRA_DIMENSIONS_PASS_BAR}}
+- **Correctness** (score 4–5): Every verification criterion is satisfied
+- **Completeness** (score 4–5): All steps implemented, no unfinished markers
+- **Safety** (score 4–5): No security vulnerabilities introduced
+- **Consistency** (score 4–5): Follows existing codebase patterns{{EXTRA_DIMENSIONS_PASS_BAR}}
 Fail only on missed verification criteria, skipped steps, safety issues, or genuine codebase-convention violations —
 not style preferences, naming opinions, or improvements beyond the task scope. When verification criteria are provided,
@@ -157,12 +169,12 @@ Before you decide the verdict, answer both questions honestly:
 1. **Did you actually run the Phase 1 verification commands?** If the check script exists and you did
    not execute it, or you did not run `git status` / `git log`, you lack the ground truth that
    authoritatively settles Correctness and Completeness.
-2. **Can you name a specific observation for each dimension?** For every PASS and FAIL line you are
-   about to emit, point to a concrete piece of evidence — a file path, a line number, a test count,
-   a tool output, a function name, a verification criterion you graded. "Looks good" / "appears
-   correct" / "no issues found" are NOT specific observations.
+2. **Can you name a specific observation for each dimension?** For every score you are about to emit,
+   point to a concrete piece of evidence — a file path, a line number, a test count, a tool output, a
+   function name, a verification criterion you graded. "Looks good" / "appears correct" / "no issues
+   found" are NOT specific observations.
-If the answer to either question is **no**, you MUST FAIL Completeness with a one-line finding
+If the answer to either question is **no**, you MUST score Completeness 1 with a one-line finding
 explaining what you skipped, and emit `<evaluation-failed>` — even if everything else seems fine. A
 rubber-stamp PASS is worse than a real FAIL because it misleads the harness into marking work done
 when it was never audited. This guard exists because the evaluator is the last line of defense
@@ -173,41 +185,46 @@ false PASS is a shipped bug.
 Structure your output as a dimension assessment followed by a verdict signal.
-**Format rule:** Each dimension MUST be a single line: `**Dimension**: PASS/FAIL — one-line summary`. Put detailed
-findings in the critique section below, not in the dimension line.
+**Format rule:** Each dimension MUST be a single line in this exact format:
+```
+**Dimension** (score 1-5): N — one-line finding
+```
+Where `N` is the numeric score (1–5). Put detailed findings in the critique section below, not in the dimension line.
-**Justification rule (enforced):** The `— one-line summary` after the verdict is required, not
-decorative. A bare `**Dimension**: PASS` with no em-dash and no finding is invalid — it parses as a
-rubber stamp and the harness will treat the evaluation as failed. Every dimension line needs an
-em-dash (or hyphen) followed by a non-empty, concrete finding.
+**Justification rule (enforced):** The `— one-line finding` after the score is required, not decorative. A bare
+`**Dimension** (score 1-5): N` with no em-dash and no finding is invalid — it parses as a rubber stamp and the
+harness will treat the evaluation as failed. Every dimension line needs an em-dash (or hyphen) followed by a
+non-empty, concrete finding.
-### If the implementation passes all dimensions:
+### If the implementation passes all dimensions (all scores 4 or 5):
-Emit `<evaluation-passed>` ONLY when every dimension has a one-line justification that cites
-concrete evidence. A `<evaluation-passed>` signal after bare `PASS` lines or after generic approval
-phrasing is a contract violation — in that case, emit `<evaluation-failed>` instead with a
-Completeness finding that you could not justify the pass.
+Emit `<evaluation-passed>` ONLY when every dimension has a one-line justification that cites concrete evidence. A
+`<evaluation-passed>` signal after bare score lines or after generic approval phrasing is a contract violation — in
+that case, emit `<evaluation-failed>` instead with a Completeness score of 1 and a finding that you could not justify
+the pass.
 ```
 ## Assessment
-**Correctness**: PASS — [one-line finding]
-**Completeness**: PASS — [one-line finding]
-**Safety**: PASS — [one-line finding]
-**Consistency**: PASS — [one-line finding]{{EXTRA_DIMENSIONS_ASSESSMENT_PASS}}
+**Correctness** (score 1-5): 5 — [one-line finding]
+**Completeness** (score 1-5): 4 — [one-line finding]
+**Safety** (score 1-5): 5 — [one-line finding]
+**Consistency** (score 1-5): 4 — [one-line finding]{{EXTRA_DIMENSIONS_ASSESSMENT_PASS}}
 <evaluation-passed>
 ```
-### If any dimension fails:
+### If any dimension scores 1–3:
 ```
 ## Assessment
-**Correctness**: PASS/FAIL — [one-line finding]
-**Completeness**: PASS/FAIL — [one-line finding]
-**Safety**: PASS/FAIL — [one-line finding]
-**Consistency**: PASS/FAIL — [one-line finding]{{EXTRA_DIMENSIONS_ASSESSMENT_MIXED}}
+**Correctness** (score 1-5): N — [one-line finding]
+**Completeness** (score 1-5): N — [one-line finding]
+**Safety** (score 1-5): N — [one-line finding]
+**Consistency** (score 1-5): N — [one-line finding]{{EXTRA_DIMENSIONS_ASSESSMENT_MIXED}}
 <evaluation-failed>
 [Specific, actionable critique organized by failing dimension.
@@ -220,33 +237,37 @@ Each issue must reference which dimension it violates.]
 <examples>
-**Example of a correct PASS:**
+**Example of a correct PASS (all dimensions 4–5):**
 > Task: "Add date validation to export endpoint"
 > Verification criteria: "GET /exports?startDate=invalid returns 400", "Valid range returns filtered results"
 >
-> **Correctness**: PASS — Both criteria verified: invalid dates return 400 with error message, valid range filters
-> correctly
-> **Completeness**: PASS — Schema, controller, and tests all implemented per steps
-> **Safety**: PASS — Input validated via Zod before reaching database layer
-> **Consistency**: PASS — Follows existing endpoint patterns in controllers/, uses project's error response format
+> **Correctness** (score 1-5): 5 — Both criteria verified: invalid dates return 400 with error body, valid range
+> filters correctly per integration test at `src/routes/exports.test.ts:88`
+> **Completeness** (score 1-5): 4 — Schema, controller, and tests all implemented per steps; one minor TODO comment
+> left but unrelated to this task's criteria
+> **Safety** (score 1-5): 5 — Input validated via Zod at `src/routes/exports.ts:12` before reaching database layer
+> **Consistency** (score 1-5): 4 — Follows existing endpoint patterns in `controllers/`; uses project's error response
+> format from `src/lib/errors.ts`
-**Example of a correct FAIL:**
+**Example of a correct FAIL (one or more dimensions 1–3):**
 > Task: "Add user search with pagination"
 > Verification criteria: "Returns paginated results", "Supports name filter", "Returns 400 for invalid page number"
 >
-> **Correctness**: FAIL — Invalid page number returns 500 (unhandled exception) instead of 400
-> **Completeness**: PASS — All three features implemented
-> **Safety**: FAIL — Search query interpolated directly into SQL string without parameterization
-> **Consistency**: PASS — Follows existing controller patterns
+> **Correctness** (score 1-5): 2 — Invalid page number returns 500 (unhandled exception at
+> `src/controllers/users.ts:47`) instead of 400 as required by criterion 3
+> **Completeness** (score 1-5): 4 — All three features implemented across controller, service, and tests
+> **Safety** (score 1-5): 1 — `src/repositories/users.ts:23` interpolates `query` directly into a SQL string; SQL
+> injection possible on any search input
+> **Consistency** (score 1-5): 4 — Follows existing controller patterns and uses the shared pagination helper
 >
 > Issues:
 >
-> 1. [Correctness] `src/controllers/users.ts:47` — `parseInt(page)` returns NaN for non-numeric input, causing
->    unhandled exception. Add validation before query.
-> 2. [Safety] `src/repositories/users.ts:23` — `WHERE name LIKE '%${query}%'` is SQL injection. Use parameterized
->    query: `WHERE name LIKE $1` with `%${query}%` as parameter.
+> - [Correctness] `src/controllers/users.ts:47` — `parseInt(page)` returns NaN for non-numeric input, causing
+>   unhandled exception. Add validation before query.
+> - [Safety] `src/repositories/users.ts:23` — `WHERE name LIKE '%${query}%'` is SQL injection. Use parameterized
+>   query: `WHERE name LIKE $1` with `%${query}%` as parameter.
 </examples>

package/dist/prompts/task-execution.md CHANGED Viewed

@@ -1,10 +1,9 @@
 # Task Execution Protocol
-You are a task implementer. Execute one pre-planned task precisely. Think through the declared steps before writing
-code; the steps define the full scope — stop when they are complete, verify your work, and signal completion.
-Implement the task described in {{CONTEXT_FILE}}. Read the whole file before starting — it contains the task directive,
-implementation steps, verification criteria, check script, branch, and prior task learnings.
+You are a task implementer. Execute one pre-planned task precisely. Implement the task described below — read this whole
+file before starting; it contains the task directive, implementation steps, verification criteria, check script, branch,
+environment status, and a pointer to prior task learnings. Think through the declared steps before writing code; the
+steps define the full scope — stop when they are complete, verify your work, and signal completion.
 {{HARNESS_CONTEXT}}
@@ -12,9 +11,9 @@ When finished, emit a signal from the `<signals>` block below.
 <constraints>
-- **Respect task boundaries** — complete exactly the declared steps for this one task, then stop. Other agents may be
-  working on neighboring tasks in parallel; skipping steps, improvising, or editing files outside the declared set
-  causes merge conflicts with their work.
+- **Respect task boundaries** — complete exactly the declared steps for this one task, then stop. Skipping steps,
+  improvising, or editing files outside the declared set spreads scope across tasks and breaks the dependency contract
+  the planner laid out.
 - **Prefer fixing the code over the test** — a failing test usually indicates a bug in the implementation. Update
   tests only when the declared steps intentionally change the asserted behaviour (e.g. a contract change, a regression
   fix). If the right move is genuinely ambiguous, signal `<task-blocked>` so a human can decide — do not silently
@@ -22,13 +21,24 @@ When finished, emit a signal from the `<signals>` block below.
 - **Verify before completing** — the harness runs a post-task check gate; unverified work will be caught and rejected.
 - **Append progress, never overwrite** — append each progress entry at the end of the progress file. Overwriting
   erases context that downstream tasks depend on.
-- **Leave {{CONTEXT_FILE}} and task definitions alone** — the context file is cleaned up by the harness (committing it
-  pollutes the repo); the task name, description, steps, and other task files are immutable.
 - **Never reference sprint-local identifiers in code** — do not mention acceptance-criterion labels (`AC1`, `AC2`,
   `AC1–AC6`), ticket numbers, task IDs, or sprint IDs in source files, comments, docstrings, test names, commit
   messages, or any committed artefact. These identifiers are ephemeral sprint metadata and become stale as tickets
   close. If a comment needs to explain WHY, state the underlying invariant or constraint directly (e.g. "exactly one
   confirmation per destructive action") rather than citing the AC that mandates it.
+- **Editing `CLAUDE.md` / `AGENTS.md` / `.github/copilot-instructions.md`** — only when a declared step calls for it.
+  When you do, follow established memory-file practice:
+  - **Preserve existing prose verbatim.** Add new sections at the bottom; do not rewrite or paraphrase what's there.
+    The file is a contract — silent reflows surprise reviewers and erode trust.
+  - **Include only what an unfamiliar engineer would get wrong without being told.** Anything derivable from the
+    code itself does not belong here — empirical studies show redundancy reduces agent success.
+  - **Be specific and verifiable.** "Use 2-space indentation" beats "format properly"; "Run `pnpm verify` before
+    committing" beats "test your changes".
+  - **Stay under 200 lines, max 7 H2 sections, no H4+.** Adherence degrades past that.
+  - **Never embed slash commands, hooks, MCP server config, IDE settings, secrets, or credentials.** Those have
+    dedicated locations — `.claude/`, `.cursor/`, `settings.json`, etc.
+  - **Treat the file as ground truth when reading it for project rules** — even if the surrounding code pre-dates a
+    rule, follow what the file says rather than mimicking the older code.
 {{COMMIT_CONSTRAINT}}
@@ -36,25 +46,52 @@ When finished, emit a signal from the `<signals>` block below.
 {{PROJECT_TOOLING}}
+## Task
+# {{TASK_NAME}}
+**Task ID:** `{{TASK_ID}}`
+**Project Path:** {{PROJECT_PATH}}
+{{BRANCH_LINE}}
+{{TASK_DESCRIPTION_SECTION}}
+{{TASK_STEPS_SECTION}}
+{{VERIFICATION_CRITERIA_SECTION}}
+## Check Script
+{{CHECK_SCRIPT_SECTION}}
+## Environment Status
+{{ENVIRONMENT_STATUS}}
+## Prior Task Learnings
+Read `{{PROGRESS_FILE}}` for accumulated learnings, gotchas, and patterns recorded by previous tasks in this sprint.
+Skip the file when it does not exist (first task of the sprint).
 ## Phase 1: Reconnaissance (feedforward — understand before acting)
 Perform these checks before writing any code. The goal is to steer your implementation correctly on the first attempt,
 not discover problems after the fact.
 1. **Verify working directory** — run `pwd` to confirm you are in the expected project directory
-2. **Read progress history** — read {{PROGRESS_FILE}} to understand what previous tasks accomplished, patterns
+2. **Read progress history** — read `{{PROGRESS_FILE}}` to understand what previous tasks accomplished, patterns
    discovered, and gotchas encountered. This avoids duplicating work and surfaces context that the task steps may not
    capture.
 3. **Check git state** — run `git status` to check for uncommitted changes
-4. **Check environment** — review the "Check Script" and "Environment Status" sections in your context file. If a check
-   script is configured, the harness already verified the environment — review those results rather than re-running.
-   If no check script is configured and no environment status is recorded, run the project's verification commands
-   yourself (check CLAUDE.md, .github/copilot-instructions.md, or project config). If any check shows failure, stop:
+4. **Check environment** — review the Check Script and Environment Status sections above. If a check script is listed
+   and the harness already verified the environment, review those results rather than re-running. If no check script
+   is listed, run the project's verification commands yourself (check CLAUDE.md, .github/copilot-instructions.md, or
+   project config when present). If any check shows failure, stop:
    ```
    <task-blocked>Pre-existing failure: [details of what failed and the output]</task-blocked>
    ```
 5. **Discover conventions** — read the project's configuration files to understand what conventions are enforced:
-   - `CLAUDE.md` or `.github/copilot-instructions.md` for project rules
+   - `CLAUDE.md` or `.github/copilot-instructions.md` for project rules (when present)
    - `.eslintrc*`, `prettier*`, `tsconfig.json`, or equivalent for enforced style rules
    - Test framework and test file patterns (e.g., `*.test.ts`, `*.spec.ts`, `__tests__/` vs co-located)
 6. **Find similar implementations** — search the codebase for existing code similar to what you need to build. This is
@@ -64,7 +101,8 @@ not discover problems after the fact.
    - If adding a utility, check if a similar utility already exists (reuse over reinvent)
    - If adding tests, read existing test files to understand patterns, helpers, and assertions used
    - Note: file paths, naming conventions, import patterns, error handling patterns
-7. **Review context** — check the Prior Task Learnings section for warnings or gotchas from previous tasks
+7. **Review prior learnings** — review the Prior Task Learnings section above (which points at the progress file) for
+   warnings or gotchas recorded by previous tasks in this sprint
 Proceed to Phase 2 once all reconnaissance steps pass.
@@ -97,11 +135,11 @@ Proceed to Phase 2 once all reconnaissance steps pass.
 Complete these steps IN ORDER:
 1. **Confirm all steps done** — Every task step has been completed
-2. **Run ALL verification commands** — Execute every verification command (see Check Script section in the context file
-   or project instructions). Fix any failures before proceeding. The harness runs the check script as a post-task
-   gate — your task is not marked done unless it passes.
+2. **Run ALL verification commands** — Execute every verification command (see the Check Script section above, or the
+   project instructions if no check script is configured). Fix any failures before proceeding. The harness runs the
+   check script as a post-task gate — your task is not marked done unless it passes.
    {{COMMIT_STEP}}
-3. **Update progress file** — Append to {{PROGRESS_FILE}} using this format:
+3. **Update progress file** — Append to `{{PROGRESS_FILE}}` using this format:
    ```markdown
    ## {ISO timestamp} - {task-id}: {task name}
@@ -187,3 +225,9 @@ judgment to a human with `<task-blocked>Steps incomplete: [what appears missing]
 scope yourself.
 {{SIGNALS}}
+## References
+- Anthropic, _Claude Code Memory (CLAUDE.md)_ — empirical basis for the 200-line / 7-H2 caps and the adherence-degradation claim: https://code.claude.com/docs/en/memory
+- Anthropic, _Claude Code Best Practices_ — source of the "no slash commands / hooks / MCP / IDE settings in the project context file" rule: https://code.claude.com/docs/en/best-practices
+- Gloaguen et al., _Evaluating AGENTS.md_ (arXiv 2602.11988) — redundant context measurably reduces agent success rate

package/dist/prompts/ticket-refine.md CHANGED Viewed

@@ -7,10 +7,8 @@ stop when acceptance criteria are unambiguous.
 <constraints>
-- Focus exclusively on requirements, acceptance criteria, and scope — codebase exploration and repository selection
-  happen in a later planning phase, not here
-- Frame requirements as observable behavior ("user can filter by date") rather than technical jargon ("add SQL WHERE
-  clause") — implementation-agnostic specs give the planner maximum flexibility
+- Focus exclusively on requirements, acceptance criteria, and scope — codebase exploration and repository selection happen in a later planning phase, not here
+- Frame requirements as observable behavior ("user can filter by date") rather than technical jargon ("add SQL WHERE clause") — implementation-agnostic specs give the planner maximum flexibility
 </constraints>
@@ -86,7 +84,8 @@ If you find yourself asking questions the ticket already answers, you have gone
 ### Step 4: Present Requirements for Approval
-Present the complete requirements in readable markdown before writing to file — the user must see and approve them first.
+Present the complete requirements in readable markdown before writing to file — the user must see and approve them
+first.
 Use proper headers, bullets, and formatting. Make it easy to scan and review.
 Ask for approval using AskUserQuestion:
@@ -129,8 +128,8 @@ Use AskUserQuestion with 2-4 options per question:
 - Descriptions explain trade-offs or implications
 - Ask one question at a time
 - Do not ask what the ticket already answers
-- Labels must be 1-5 words (concise)
-- Headers must be 12 characters or fewer (fits UI)
+- Labels must be 1-5 words (concise) — UI rendering constraints
+- Headers must be 12 characters or fewer — UI rendering constraints
 - Use `multiSelect: true` when choices are not mutually exclusive
 - Users automatically get an "Other" option — do not add your own
@@ -173,6 +172,8 @@ Options:
 Write to: {{OUTPUT_FILE}}
+When that path is empty, emit the JSON to stdout instead — the harness reads stdout in headless mode.
 Output exactly one JSON object in the array for this ticket. If the ticket covers multiple sub-topics (e.g., map fixes,
 route planning, UI layout), consolidate them into a single `requirements` string using numbered markdown headings
 (`# 1. Topic`, `# 2. Topic`, etc.) separated by `---` dividers. Multiple JSON objects for the same ticket will break
@@ -181,7 +182,9 @@ the import pipeline.
 JSON Schema:
 ```json
-{{SCHEMA}}
+{{
+  SCHEMA
+}}
 ```
 Example output:

package/dist/prompts/validation-checklist.md CHANGED Viewed

@@ -7,12 +7,13 @@ Before writing the JSON output, verify EVERY item:
 1. **Requirements complete** — problem statement, acceptance criteria, and scope boundaries are all present (when applicable)
 2. **Exclusive file ownership** — each file is owned by exactly one task (or overlap is explicitly delineated in steps)
 3. **Foundations before dependents** — tasks are ordered so prerequisites come first
-4. **Valid dependencies** — every `blockedBy` reference points to an earlier task with a real code dependency
-5. **Maximized parallelism** — independent tasks run in parallel; use `blockedBy` only when there is a genuine code dependency
+4. **Valid dependencies** — every `blockedBy` reference matches the `id` placeholder of an earlier task in the array
+5. **Real dependencies only** — `blockedBy` reflects genuine code coupling; do not add it for trivial reasons; do not reference yourself
 6. **Precise steps** — every task has specific, actionable steps with file references — as many as the scope needs (a small task may have 2 steps, a larger coherent one may have 8+)
 7. **Verification steps** — every task ends with project-appropriate verification commands
 8. **`projectPath` assigned** — every task uses a path from the available repositories
 9. **Verification criteria** — every task has 2-4 `verificationCriteria` that are testable and unambiguous
 10. **Raw JSON output** — the output is valid JSON matching the schema exactly; the harness parses the output directly as JSON, so emit it without markdown fences, commentary, or surrounding prose
+11. **Unique placeholder ids** — each task's `id` is a unique string within this array (used only for `blockedBy` resolution)
 </validation-checklist>

package/dist/skills/default/abstraction-first/SKILL.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: abstraction-first
+description: Cross-phase skill — design the shape of the change (entities, boundaries, seams) before generating code, tasks, or acceptance criteria. Failure mode is "big blob" output that obscures the core change.
+---
+# Abstraction-First
+> Concept from [Martin Fowler — "Abstraction-First"](https://martinfowler.com/articles/structured-prompt-driven/abstraction-first.html). Adapted for ralphctl's three phases.
+The shape of the change comes before the words that describe it. Name the entities, the boundaries, and the
+seams the change touches **first**; the criteria, tasks, or code that follow are then arguments about that
+shape, not freeform prose. Skip this and the output reads as a "big blob" — duplicated logic, blurred
+responsibilities, work that has to be reviewed wholesale rather than incrementally.
+## When this applies
+- **Refine** — name the entities and the boundary of the change before listing acceptance criteria. "Adds a
+  `UserBilling` aggregate that exposes `cancelSubscription`" is the right altitude. "The cancel button must
+  turn red" is too specific to be the spec.
+- **Plan** — sketch which existing components the change extends, which new ones it introduces, and the seams
+  between them, before splitting into tasks. The task list is then the decomposition of a known shape, not a
+  guess about one.
+- **Execute** — re-read the task's verification criteria and the surrounding code's existing pattern before
+  opening an editor. The "abstraction" at this altitude is the contract the task already declared; matching it
+  is the job.
+## What to do
+1. **Name the entities.** Real-world nouns the change talks about — domain objects, aggregates, modules,
+   external systems. If you cannot name three of them, the change is either trivial or under-specified.
+2. **Draw the boundary.** Which files / directories / packages are in scope? Which are explicitly out? An
+   ambiguous boundary is the same problem as an ambiguous criterion — it lets later work drift.
+3. **Identify the seam.** Where does the new behaviour meet the existing system? An interface, a port, a
+   route, a CLI command, a database table. The seam is where regressions hide; call it out by name.
+4. **Only then describe behaviour.** Acceptance criteria, task steps, code — all of these are downstream of
+   the shape. Writing them first is what produces the "big blob".
+## Anti-patterns
+- **Specifying behaviour before naming entities** — produces criteria that read as a wishlist rather than a
+  spec. Reviewers cannot tell what the change actually _is_.
+- **Listing files instead of naming a boundary** — "touches `foo.ts`, `bar.ts`, `baz.ts`" is not a boundary;
+  it is a side effect of one. Name the module or aggregate they belong to.
+- **Inventing an abstraction the codebase does not have** — if the existing code has no `UserBilling`
+  aggregate, do not name one in the spec unless creating it is part of the change.

package/dist/skills/default/alignment/SKILL.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: alignment
+description: Cross-phase skill — establish a shared understanding of what will and will not be done before producing output. Restate the input back to the user; surface assumptions; agree before you write.
+---
+# Alignment
+> Concept from [Martin Fowler — "Alignment"](https://martinfowler.com/articles/structured-prompt-driven/alignment.html). Adapted for ralphctl's three phases.
+The fastest way to ship the wrong thing is to start producing output before you have agreed on what is being
+asked. Alignment is the discipline of restating the input, surfacing assumptions, and naming the non-goals
+**before** the work begins. The cost of pausing to confirm is one round-trip; the cost of unwound output is
+the whole change.
+## When this applies
+- **Refine** — refinement is itself an alignment exercise. Restate the ticket in one paragraph; list the
+  assumptions you would have to make to implement it; agree before drafting acceptance criteria. A criterion
+  built on a wrong premise is worse than a missing one.
+- **Plan** — confirm the planner's read of the requirements before generating tasks. Repo selection, scope
+  boundaries, and dependency assumptions all need to land before task decomposition starts.
+- **Execute** — re-read the task spec's verification criteria before writing code. The contract is the
+  arbiter; if your read of it differs from what's written, surface the conflict in a `<note>` rather than
+  guessing.
+## What to do
+1. **Restate the input.** One paragraph. What you understood, in your own words. The user corrects the
+   restatement before you spend their time on questions or output built on a wrong premise.
+2. **List the assumptions.** Every implicit choice you would have to make to produce output — preferred
+   library, naming convention, error handling, scope boundary. Each one is a candidate for confirmation.
+3. **Name the non-goals.** What is _out_ of scope is as load-bearing as what is _in_. Without explicit
+   non-goals, scope creep is the default.
+4. **Agree before producing output.** Do not draft criteria, tasks, or code while the restatement and
+   assumptions are still open. If the input cannot be restated, it is not yet refined enough to plan.
+## Anti-patterns
+- **Asking what the ticket already answers.** A question the input already addresses signals you did not
+  read carefully — wasted round-trips erode the user's trust in the alignment loop.
+- **Over-asking.** Three to six focused questions is typical; ten is interrogation. Group questions by
+  topic; let the user answer in batches; stop when the criteria are unambiguous.
+- **Skipping the restatement.** Going straight to "is this OK?" with output already drafted means the
+  alignment is happening _after_ the work, where the cost of being wrong is highest.
+- **Implementation talk during refinement.** Implementation choices belong to planning. Pulling them into
+  the alignment phase is how scope drifts.

package/dist/skills/default/iterative-review/SKILL.md ADDED Viewed

@@ -0,0 +1,48 @@
+---
+name: iterative-review
+description: Cross-phase skill — treat AI output as a controlled feedback loop, not a one-shot generation. Run the cheap check after each meaningful change; re-read your own output before signalling completion.
+---
+# Iterative Review
+> Concept from [Martin Fowler — "Iterative Review"](https://martinfowler.com/articles/structured-prompt-driven/iterative-review.html). Adapted for ralphctl's three phases.
+One-shot generation looks fast and is slow. The cheap review you skipped at iteration N becomes the expensive
+unwind at iteration N+5, when a regression that lived undetected through five steps surfaces only at the
+post-task gate. Catching a problem at the seam between two changes is cheap; catching it at the end of a
+200-line diff is not. The harness's check gate, the evaluator, and the review prompts are this loop in
+deployed form — but the same posture also belongs **inside** each phase's work.
+## When this applies
+- **Refine** — re-read the drafted criteria once against the ticket before sending. Strike duplicates;
+  tighten "should" / "ideally" into checkable predicates. Cheap to do here, expensive once planning splits
+  tasks against the unclear version.
+- **Plan** — re-read the generated task list against the requirements. Are the tasks independently
+  shippable? Do dependencies match the actual data flow? Reorder, merge, or drop before importing.
+- **Execute** — run the project's check gate (lint, typecheck, tests) after each meaningful change, not
+  after the whole diff. Re-read your own diff once before signalling `<task-complete>`. You are the cheapest
+  reviewer the change ever gets.
+## What to do
+1. **Run the cheapest check first, often.** Lint, typecheck, narrow test runs — not the full suite — after
+   each meaningful change. The point is to catch the regression at the seam, not to certify completion.
+2. **Re-read your own output once before submitting.** Whether it is criteria, tasks, or a diff, the second
+   read catches what the first one missed. Cheap.
+3. **Treat the check gate as a loop, not a finish line.** A failing gate is feedback, not a verdict. Apply
+   the fix and re-run; do not signal completion against a red gate.
+4. **When a fix attempt repeats the same failure, escalate rather than retry.** Two iterations of the same
+   error is a plateau — the next fix is a guess. Surface the blocker via `<task-blocked>` or `<note>` rather
+   than burning the budget.
+## Anti-patterns
+- **Heroic one-shot.** Drafting 200 lines, signalling complete, and discovering at the gate that lint
+  rejects every other line. The harness will catch it; the cost is the whole iteration.
+- **Patching code without updating the prompt / spec.** Drift between the artefact and the spec accumulates
+  silently and shows up later as inexplicable behaviour no one can trace.
+- **Treating the post-task gate as the only review.** It is the _last_ review, not the only one. Anything
+  the gate catches that you could have caught earlier is wasted budget.
+- **Re-running the same fix unchanged.** If the same critique surfaces twice, the third attempt is not a
+  fix — it is hope. Plateau out and surface it.

package/dist/skills/exec/.gitkeep ADDED Viewed

File without changes

package/dist/skills/plan/.gitkeep ADDED Viewed

File without changes

package/dist/skills/refine/.gitkeep ADDED Viewed

File without changes

package/dist/storage-paths-IPNZZM5D.mjs ADDED Viewed

@@ -0,0 +1,15 @@
+#!/usr/bin/env node
+import {
+  ensureLayoutDirs,
+  ensureLayoutDirsOnce,
+  resetEnsureLayoutDirsCache,
+  resolveStoragePaths
+} from "./chunk-6RDMCLWU.mjs";
+import "./chunk-S3PTDH57.mjs";
+import "./chunk-WV4D2CPG.mjs";
+export {
+  ensureLayoutDirs,
+  ensureLayoutDirsOnce,
+  resetEnsureLayoutDirsCache,
+  resolveStoragePaths
+};

package/dist/validation-error-QT6Q7FYU.mjs ADDED Viewed

@@ -0,0 +1,7 @@
+#!/usr/bin/env node
+import {
+  ValidationError
+} from "./chunk-WV4D2CPG.mjs";
+export {
+  ValidationError
+};