npm - ralphctl - Versions diffs - 0.7.3 → 0.8.0 - Mend

ralphctl 0.7.3 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

package/README.md +83 -84
package/dist/cli.mjs +12265 -4933
package/dist/manifest.json +4 -3
package/dist/prompts/_partials/decisions.md +14 -0
package/dist/prompts/_partials/signals-feedback.md +18 -0
package/dist/prompts/_partials/validation-checklist.md +5 -4
package/dist/prompts/apply-feedback/template.md +24 -23
package/dist/prompts/create-pr/template.md +73 -0
package/dist/prompts/detect-scripts/template.md +1 -8
package/dist/prompts/detect-skills/template.md +1 -9
package/dist/prompts/evaluate/template.md +109 -121
package/dist/prompts/ideate/template.md +48 -22
package/dist/prompts/implement/template.md +57 -79
package/dist/prompts/plan/template.md +78 -45
package/dist/prompts/readiness/template.md +32 -28
package/dist/prompts/refine/template.md +35 -28
package/package.json +2 -2
package/dist/prompts/_partials/signals-evaluation.md +0 -14
package/dist/prompts/_partials/signals-task.md +0 -26

package/dist/manifest.json CHANGED Viewed

@@ -1,12 +1,13 @@
 {
   "version": 1,
-  "generatedAt": "2026-05-20T06:49:10.467Z",
+  "generatedAt": "2026-05-24T20:04:47.241Z",
   "assets": [
+    "prompts/_partials/decisions.md",
     "prompts/_partials/harness-context.md",
-    "prompts/_partials/signals-evaluation.md",
-    "prompts/_partials/signals-task.md",
+    "prompts/_partials/signals-feedback.md",
     "prompts/_partials/validation-checklist.md",
     "prompts/apply-feedback/template.md",
+    "prompts/create-pr/template.md",
     "prompts/detect-scripts/template.md",
     "prompts/detect-skills/template.md",
     "prompts/evaluate/template.md",

package/dist/prompts/_partials/decisions.md ADDED Viewed

@@ -0,0 +1,14 @@
+## Recording architectural decisions
+When you make a non-obvious architectural or implementation choice — one a future reviewer might disagree
+with or need to understand — emit `<decision>your concise rationale</decision>` so the harness can record
+it in the sprint's decisions log.
+- **Emit sparingly** — only for choices a future maintainer could not recover from the diff alone (e.g.
+  picking one valid pattern over another, choosing a tradeoff, deliberately deviating from a project
+  convention). Obvious changes do not need a decision entry.
+- **One sentence per decision** — lead with the choice, then the rationale: "Used X over Y because Z." Use
+  two sentences only when the rationale genuinely cannot be compressed without losing the key tradeoff.
+- The harness appends timestamp + task id + commit sha automatically — do not include those yourself.
+- Multiple `<decision>` tags per task are allowed when distinct choices were made; emit one tag per
+  decision rather than packing several into one body.

package/dist/prompts/_partials/signals-feedback.md ADDED Viewed

@@ -0,0 +1,18 @@
+<signals>
+Use these signals to communicate the outcome of this feedback round to the harness. The harness parses your output
+for these tags; nothing else in your message is treated as a control signal.
+- `<task-complete>` — Marks the round as successfully applied. Emit when every requested change is on disk and
+  the working tree reflects the user's direction. The harness commits your edits afterward and runs the project's
+  verify script itself — do not run verification yourself, and do not commit.
+- `<task-blocked>reason</task-blocked>` — Marks the round as un-appliable. Use when you genuinely cannot proceed:
+  the feedback is ambiguous in WHAT (not where), it contradicts an invariant in a prior round, or it asks for
+  information you do not have. Be concrete in the reason — the harness surfaces it verbatim to the operator and
+  ends the review loop.
+Emit exactly one of the two signals above. Any of the implement-flow signals (`<change>`, `<learning>`,
+`<note>`, `<decision>`, `<task-verified>`, `<commit-message>`, `<progress>`) are not consumed by the review
+flow — emitting them wastes tokens and produces no on-disk effect.
+</signals>

package/dist/prompts/_partials/validation-checklist.md CHANGED Viewed

@@ -6,16 +6,17 @@ Before writing the JSON output, verify EVERY item:
 1. **Requirements understood** — every approved ticket is reflected in at least one task; nothing in scope is dropped.
 2. **Exclusive file ownership** — each file is owned by exactly one task. When two tasks must edit the same file,
-   make the relationship explicit via `dependsOn` so they run in sequence, not in parallel.
-3. **Foundations before dependents** — order tasks so prerequisites come first; `dependsOn` reflects genuine code
+   make the relationship explicit via `blockedBy` so they run in sequence, not in parallel.
+3. **Foundations before dependents** — order tasks so prerequisites come first; `blockedBy` reflects genuine code
    coupling, not arbitrary preference.
-4. **Valid `dependsOn` references** — every id in `dependsOn` matches an earlier task's `id` placeholder; no
+4. **Valid `blockedBy` references** — every id in `blockedBy` matches an earlier task's `id` placeholder; no
    self-edges; no cycles.
 5. **Precise steps** — each task has 2–8 specific, actionable steps. Each step references concrete files or
    functions; "implement the feature" is not a step.
 6. **Verification criteria** — each task has 2–4 `verificationCriteria` that are testable and unambiguous.
    "Tests pass" alone is too vague — name the behaviour or invariant that proves the task is done.
-7. **Repository assignment** — every task's `repositoryId` is one of the sprint's affected repositories.
+7. **Repository assignment** — every task's `projectPath` matches one of the absolute paths listed under
+   "Selected repositories" above.
 8. **Raw JSON output** — output a single JSON array matching the Task schema. The harness parses your output
    directly; emit it without markdown fences, commentary, or surrounding prose.
 9. **Unique placeholder ids** — each task's `id` is a unique string within this array (used only for

package/dist/prompts/apply-feedback/template.md CHANGED Viewed

@@ -16,8 +16,10 @@ of them as they wrote it.
 code, don't add tests the user didn't ask for, don't tighten unrelated types. The user is
 shaping the work; you execute their direction.
-**Commit on completion.** When you've applied the round's feedback, the harness will commit
-your changes with the message `feedback(round-N): <body-snippet>`. Do not commit yourself.
+**Commit and verify are the harness's job.** When you've applied the round's feedback, the harness
+commits your changes with the message `feedback(round-N): <body-snippet>` and then runs the project's
+verify script itself. Do not commit, and do not run verify scripts — emit `<task-complete>` once your
+edits are on disk and let the harness drive the gate.
 **Make the edits — don't just describe them.** The harness does not apply changes for you;
 you must write the files. A written-out description of the edits, without actual file writes,
@@ -60,18 +62,23 @@ This is the round you are applying. Read it carefully and make ONLY the changes
 <progress>
 The sprint's `progress.md` — pinned learnings and decisions, plus per-task activity. Use it
-for context (don't re-discover what the prior tasks already learned), and emit `<learning>`
-or `<decision>` if your application surfaces new insight.
+for context so you don't re-discover what the prior tasks already established. This is a
+review-time prompt — the review flow does not mine `<learning>` / `<decision>` / `<note>`
+back into `progress.md`, so do not emit them; surface insights inside the change itself
+(via tests, docstrings, or the targeted edit).
 {{PROGRESS}}
 </progress>
-You are working in this project directory:
+## Repositories
-```
-{{PROJECT_PATH}}
-```
+The sprint targets the repositories below. Each line is `- \`<absolute-path>\` (<name>)`. Decide which
+repository (or repositories) the latest round touches based on the feedback content and the relevant
+source layout. The harness mounts every repository as a workspace root — open files via the absolute
+paths shown.
+{{REPOSITORIES}}
 ## Protocol
@@ -98,21 +105,15 @@ Then orient before editing:
    files when the round is symptom-described rather than file-described).
 3. **Do not commit.** The harness commits your changes with `feedback(round-N): <body-snippet>`.
-### Phase 3 — Verification
-1. **Run the check script** (when one is configured in the Project Tooling section). Record its
-   output verbatim for `<task-verified>`.
-2. **When no check script is configured**, emit
-   `<task-verified>no check script configured; change applied</task-verified>` so the harness can
-   record that the round produced changes without a verification gate.
-3. **Signal completion** with `<task-complete>` once the change is applied and verification (if
-   any) passed.
+### Phase 3 — Signal outcome
-If you cannot apply the feedback (ambiguous, contradicts an invariant, missing context that
-prior rounds did not supply), emit `<task-blocked>reason</task-blocked>` with a concrete
-explanation. Ambiguity in WHERE to apply the change is not a blocker — pick the narrowest
-plausible target. Ambiguity in WHAT to do is.
+When every requested change is on disk, emit `<task-complete>`. The harness then commits your edits
+and runs the project's verify script — you do not run either step yourself.
-When finished, emit a verdict signal from the `<signals>` block below.
+If you cannot apply the feedback (the request is ambiguous in WHAT to do, contradicts an invariant
+established by a prior round, or asks for information neither this round nor the feedback log
+supplies), emit `<task-blocked>reason</task-blocked>` with a concrete explanation. Ambiguity in
+WHERE to apply the change is not a blocker — pick the narrowest plausible target. Ambiguity in WHAT
+to do is.
-{{SIGNALS}}
+{{OUTPUT_CONTRACT_SECTION}}

package/dist/prompts/create-pr/template.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Pull Request Authoring Protocol
+You are authoring a pull-request title and body for a branch that is ready to merge. Audience: the
+project's maintainers reviewing the PR. Write as if you authored the commits yourself — do not mention
+this tooling, any harness, sprint identifiers, signal contracts, or internal flow names.
+{{HARNESS_CONTEXT}}
+## Branch under review
+- **Head branch:** `{{HEAD_BRANCH}}` (already pushed to `origin`)
+- **Base branch:** `{{BASE_BRANCH}}`
+## Tickets the branch addresses
+{{TICKET_SUMMARY}}
+## How to gather context
+Run these from your cwd to see exactly what is changing:
+- `git log {{BASE_BRANCH}}..HEAD` — the commit history on this branch
+- `git diff {{BASE_BRANCH}}...HEAD --stat` — the file-level change summary
+- `git diff {{BASE_BRANCH}}...HEAD` — the full diff, when you need to inspect specific changes
+Lean on `--stat` to group changes sensibly; only read the full diff for sections you cannot summarise
+from commit messages alone.
+## What to author
+### Title
+One line, imperative present-tense, ≤70 characters. Examples — "Add CSV export for transactions",
+"Fix race in session locking". Do not prefix with the branch name, ticket id, or `feat:` / `fix:` —
+the project's commit-message convention is independent and already applied at commit time.
+### Body
+The body has three sections, in this order:
+1. **Summary** — 1–3 sentences naming what the branch does and why. Focus on intent and observable
+   behaviour change; do not describe file paths or implementation mechanics in the summary.
+2. **`## Changes`** — bullet list of what changed, grouped sensibly (by feature, module, or layer —
+   not file-by-file). Each bullet is one short sentence.
+3. **`## Test plan`** — markdown checklist of how a reviewer would verify the branch. Concrete
+   actions, not abstractions. Include both manual checks and automated coverage when applicable.
+End the body with the verbatim issue references below, if any are present:
+```
+{{ISSUE_REFS}}
+```
+If `{{ISSUE_REFS}}` is empty, omit the trailing closes block entirely — do not invent issue
+numbers, and do not write "no related issues".
+## Constraints
+- Stay implementation-agnostic in the summary — name behaviour, not call sites.
+- Never reference this tooling, any harness, sprint ids, internal flow names, or the AI itself.
+  Reviewers should not be able to tell from the PR description that it was authored with assistance.
+- Use em-dash `—` (not a plain hyphen) for explanatory clauses, matching the project's house style.
+- Do not invent acceptance criteria, ticket numbers, or roadmap items that are not visible in
+  the diff or the ticket summary above.
+## Anti-patterns
+- A summary that lists files instead of behaviour.
+- A title that exceeds 70 characters or reads as past-tense ("Added X" → "Add X").
+- A "Test plan" that says "see CI" — name the concrete checks.
+- Inventing a "Closes #N" line when `{{ISSUE_REFS}}` is empty.
+{{OUTPUT_CONTRACT_SECTION}}

package/dist/prompts/detect-scripts/template.md CHANGED Viewed

@@ -108,11 +108,4 @@ tools — the project's scripts are the documented contract.
 ### Phase 3 — Output
-Emit the elements below, each on its own line, no preamble, no commentary, no markdown fences
-around the tags:
-1. `<setup-script>…single shell line…</setup-script>` — omit only when the project documents no
-   setup step.
-2. `<verify-script>…single shell line…</verify-script>` — omit only when the project documents no
-   verification commands.
-3. `<note>…</note>` — optional, one short observation naming the source file(s) you relied on.
+{{OUTPUT_CONTRACT_SECTION}}

package/dist/prompts/detect-skills/template.md CHANGED Viewed

@@ -125,12 +125,4 @@ specific to this repo.
 ### Phase 3 — Output
-Emit the elements below, each as a single block, no preamble, no commentary, no markdown fences
-around the tags themselves:
-1. `<setup-skill>…multi-paragraph markdown body…</setup-skill>` — omit only when an existing
-   project skill already covers sprint setup for this repo.
-2. `<verify-skill>…multi-paragraph markdown body…</verify-skill>` — omit only when an existing
-   project skill already covers post-task verification for this repo.
-3. `<note>…</note>` — optional, one short observation naming the source file(s) relied on, or
-   noting which existing skill made a tag redundant.
+{{OUTPUT_CONTRACT_SECTION}}

package/dist/prompts/evaluate/template.md CHANGED Viewed

@@ -9,16 +9,19 @@ to confirm what they claim.
 <constraints>
-**You are a reviewer — do not edit files.** If you believe a fix is needed, emit `<evaluation-failed>` with a
-concrete critique; the harness will resume the generator to apply the fix. Do not run `git stash`, do not edit
-tests, do not create commits. Your tools are read-only: `git status`, `git log`, `git diff`, file reads, and
-running existing check scripts. Any write operation is a protocol violation.
+**You are a reviewer — do not edit files.** If you believe a fix is needed, emit an `evaluation` signal with
+`status: "failed"` and a concrete critique; the harness will resume the generator to apply the fix. Do not run
+`git stash`, do not edit tests, do not create commits. Your tools are read-only: `git status`, `git log`,
+`git diff`, file reads, and running existing verify scripts and per-criterion auto commands. Any write
+operation other than `signals.json` is a protocol violation.
 </constraints>
 <task-specification>
-These verification criteria are the pre-agreed definition of "done" — your primary grading rubric.
+These verification criteria are the pre-agreed definition of "done" — your primary grading rubric. The task
+contract at `{{CONTRACT_PATH}}` is the authoritative source; read it before starting. The criteria block
+below mirrors that file for in-context reference.
 **Task:** {{TASK_NAME}}
@@ -34,9 +37,21 @@ You are working in this project directory:
 {{PROJECT_PATH}}
 ```
-## Check Script
+## Verify Script
-{{CHECK_SCRIPT_SECTION}}
+{{VERIFY_SCRIPT_SECTION}}
+## Prior progress
+Below is the sprint's `progress.md` body so you can judge this round's work against what
+already shipped — prior tasks' decisions, changes, learnings, and notes. Use it to spot
+inconsistencies with established direction and to avoid critiquing the generator for
+following a decision already recorded in earlier rounds.
+{{PRIOR_PROGRESS}}
+If the block above is empty, no prior progress has been recorded — this is the first
+task-attempt of the sprint.
 ## Project Tooling
@@ -52,26 +67,43 @@ reasoning produces sharper reviews than jumping straight to verdicts.
 Then run deterministic checks first — these are cheap, fast, and authoritative.
-1. **Run the check script** (when configured in the Check Script section above) — this is the same gate the
+1. **Run the verify script** (when configured in the Verify Script section above) — this is the same gate the
    harness uses post-task. If it fails, the implementation fails regardless of how clean the code looks.
    Record the output verbatim.
-2. **`git status`** — the tree MUST be clean. Uncommitted changes from the generator are a Completeness
-   failure; uncommitted changes from you are a protocol violation.
-3. **`git log --oneline -10`** — identify which commits belong to this task.
+2. **`git status --porcelain`** — inventory the files the generator touched. The working tree is expected
+   to be dirty at this point: the harness commits the generator's output _after_ this evaluator passes,
+   not before. A dirty tree is normal; do not treat it as a Completeness failure. Do not run `git stash`,
+   `git add`, or `git commit` — those are write operations and a protocol violation.
+3. **`git diff`** — review the generator's uncommitted changes. This is your primary view of what was
+   implemented. `git log` will not show this task's work because no commit exists yet.
+Computational results are ground truth. If the verify script fails, stop early and emit an `evaluation`
+signal with `status: "failed"` — the implementation does not pass.
-Computational results are ground truth. If the check script fails, stop early and emit
-`<evaluation-failed>` — the implementation does not pass.
+### Phase 2 — Per-criterion assessment
-### Phase 2 — Inferential investigation
+For every criterion in the contract:
+- **`auto` criteria** — run the specified command and record the verbatim output (a trimmed tail when
+  enormous) in the matching dimension's `executionEvidence` field. PASS only when the command exits 0
+  AND the assertion holds; FAIL otherwise. Cite the command's exit code in the finding.
+- **`manual` criteria** — cite the specific code location (`path:line`) or behaviour evidence the
+  assertion describes. PASS only when the cited evidence demonstrably satisfies the assertion; FAIL
+  otherwise. Generic approval language ("looks good", "appears correct") is INSUFFICIENT and is itself
+  a Completeness failure.
+Grade each criterion PASS or FAIL — no middle ground. **Any single criterion FAIL forces an overall
+FAIL on the `evaluation` signal.**
+### Phase 3 — Inferential investigation
 Now apply semantic judgment to what the computational checks cannot catch. Every finding you emit MUST trace to
 a concrete observation — a file path, a line, a function name, a specific value, a tool output, or a quoted
-snippet. Generic approval language ("looks good", "appears correct", "seems fine", "looks clean", "should be
-OK") is INSUFFICIENT and is itself a Completeness failure if you emit it.
+snippet.
-1. **Diff the task's commit range** — derive the base from the branch's divergence point
-   (`git merge-base HEAD main` or the closest equivalent) and run `git diff <base>..HEAD`. Tasks may produce
-   multiple commits; do not assume a single commit.
+1. **Review the generator's changes** — run `git diff` to see all uncommitted working-tree changes, and
+   `git status --porcelain` for a quick inventory of touched files. These are the authoritative view of
+   what the generator produced; there is no task commit to diff against yet.
 2. **Read the changed files carefully** — understand the full implementation, not just the diff. Note specific
    constructs worth citing later (new functions, changed signatures, edge-case branches).
 3. **Read surrounding code** — check that the implementation follows existing patterns and conventions. Cite a
@@ -86,21 +118,12 @@ OK") is INSUFFICIENT and is itself a Completeness failure if you emit it.
    - Skip this step only when the project has no runnable verification tooling or the task is purely structural
      (types, schemas, config).
-### Phase 3 — Dimension assessment
+### Phase 4 — Dimension assessment
 Evaluate the implementation across the dimensions below. The floor dimensions apply to every task; the planner
-may have attached additional task-specific dimensions (rendered below the floor block when present). Score each
-on the same 1–5 rubric. Dimensions scoring 4 or 5 pass; dimensions scoring 1, 2, or 3 fail. If ANY dimension
-fails, the overall evaluation fails.
-**Score rubric:**
-- **5 — Exemplary:** no issues; idiomatic; every criterion met fully.
-- **4 — Solid:** meets every criterion; minor stylistic improvements possible but not material.
-- **3 — Adequate but flawed:** meets the letter of the criteria but with material gaps (incomplete edge-case
-  handling, weak tests, awkward patterns). Score 3 fails.
-- **2 — Below bar:** missing required behaviour; tests do not cover the change; significant pattern violations.
-- **1 — Unacceptable:** does not implement the task or actively breaks unrelated code.
+may have attached additional task-specific dimensions (rendered below the floor block when present). Each
+dimension assessment produces `passed: true | false` and a `finding` citing the specific evidence. No middle
+ground — a dimension either passes or fails. **If ANY dimension fails, the overall evaluation fails.**
 **Floor dimensions:**
@@ -115,122 +138,87 @@ fails, the overall evaluation fails.
 {{EXTRA_DIMENSIONS_SECTION}}
-Write per-dimension findings as a markdown section with a one-sentence verdict and 1–3 specific observations
-each. The verdict signal at the end is the aggregate; the per-dimension findings are the audit trail.
+Write per-dimension findings as a markdown section with a one-sentence PASS / FAIL verdict and 1–3 specific
+observations each. The verdict signal at the end is the aggregate; the per-dimension findings are the audit
+trail.
 ### Anti-Rubber-Stamp Guard
 Before you decide the verdict, answer both questions honestly:
-1. **Did you actually run the Phase 1 verification commands?** If the check script exists and you did
-   not execute it, or you did not run `git status` / `git log`, you lack the ground truth that
-   authoritatively settles Correctness and Completeness.
-2. **Can you name a specific observation for each dimension?** For every score you are about to emit,
-   point to a concrete piece of evidence — a file path, a line number, a test count, a tool output, a
-   function name, a verification criterion you graded. "Looks good" / "appears correct" / "no issues
-   found" are NOT specific observations.
-If the answer to either question is **no**, you MUST score Completeness 1 with a one-line finding
-explaining what you skipped, and emit `<evaluation-failed>` — even if everything else seems fine. A
-rubber-stamp PASS is worse than a real FAIL because it misleads the harness into marking work done
-when it was never audited. This guard exists because the evaluator is the last line of defense
-against silent-pass regressions; the cost of a false FAIL is one extra fix iteration, the cost of a
-false PASS is a shipped bug.
+1. **Did you actually run the Phase 1 verification commands AND every `auto` criterion's command?** If the
+   verify script exists and you did not execute it, or you did not run `git status --porcelain` / `git diff`,
+   or you skipped any auto criterion's command, you lack the ground truth that authoritatively settles
+   Correctness and Completeness.
+2. **Can you name a specific observation for each dimension AND each criterion?** For every PASS you are
+   about to emit, point to a concrete piece of evidence — a file path, a line number, a test count, a tool
+   output, a function name, a verification criterion you graded. "Looks good" / "appears correct" / "no
+   issues found" are NOT specific observations.
+If the answer to either question is **no**, you MUST set Completeness `passed: false` with a one-line
+finding explaining what you skipped, and set the signal status to `failed` — even if everything else seems
+fine. A rubber-stamp PASS is worse than a real FAIL because it misleads the harness into marking work done
+when it was never audited. This guard exists because the evaluator is the last line of defense against
+silent-pass regressions; the cost of a false FAIL is one extra fix iteration, the cost of a false PASS is a
+shipped bug.
 ## Output format
-Markdown body, then exactly one verdict signal at the end:
-```markdown
-## Findings
-### Correctness — passed (5)
-{1–3 specific observations citing files / lines / functions.}
-### Completeness — failed (3)
-{1–3 specific observations. Be concrete about what's missing.}
-### Safety — passed (4)
-{...}
-### Consistency — passed (5)
-{...}
-<evaluation-failed>
-{Actionable critique. The generator will see this and resume to fix it. Be specific:
-which dimension failed, what the gap is, what change would close it.}
-</evaluation-failed>
-```
-When every dimension passes, end with `<evaluation-passed>` (no body).
+Capture your per-dimension findings in the `evaluation` signal's `dimensions` array (one entry per
+dimension with `dimension`, `passed`, `finding`, and — for dimensions paired with an `auto` criterion —
+`executionEvidence` carrying the verbatim command output). When any dimension or criterion fails, set
+`status: "failed"` and supply a `critique` — the actionable summary the generator sees on the next round.
+When every dimension AND every criterion passes, set `status: "passed"` (the `critique` may be omitted).
 ### Calibration examples
 <examples>
-**Example of a correct PASS (every dimension scored 4 or 5):**
+**Example of a correct PASS (every dimension and every criterion graded PASS):**
 > Task: "Add date validation to export endpoint"
-> Verification criteria: "GET /exports?startDate=invalid returns 400", "Valid range returns filtered results"
->
-> ### Correctness — passed (5)
->
-> Both criteria verified: invalid dates return 400 with error body; valid range filters correctly per
-> integration test at `src/routes/exports.test.ts:88`.
->
-> ### Completeness — passed (4)
->
-> Schema, controller, and tests all implemented per steps; one minor TODO comment left but unrelated to
-> this task's criteria.
+> Contract criteria:
 >
-> ### Safety — passed (5)
+> - **[C1]** (auto) `npm run test -- export.test.ts` — endpoint integration tests pass.
+> - **[C2]** (manual) — invalid `startDate` returns 400 with error body.
 >
-> Input validated via Zod at `src/routes/exports.ts:12` before reaching the database layer.
+> Dimensions:
 >
-> ### Consistency — passed (4)
+> - Correctness — PASS — both criteria verified: C1 command exited 0 with all 12 tests green; C2 — invalid
+>   dates return 400 with the project's standard error body at `src/routes/exports.ts:42`. `executionEvidence`
+>   for the Correctness row carries the C1 command output verbatim.
+> - Completeness — PASS — schema, controller, and tests all implemented per steps; one minor TODO comment
+>   left but unrelated to this task's criteria.
+> - Safety — PASS — input validated via Zod at `src/routes/exports.ts:12` before reaching the database.
+> - Consistency — PASS — follows existing endpoint patterns in `controllers/`; uses the project's error
+>   response format from `src/lib/errors.ts`.
 >
-> Follows existing endpoint patterns in `controllers/`; uses the project's error response format from
-> `src/lib/errors.ts`.
->
-> <evaluation-passed>
+> → `status: "passed"`, no critique.
-**Example of a correct FAIL (one or more dimensions scored 1–3):**
+**Example of a correct FAIL (one criterion / dimension failed):**
 > Task: "Add user search with pagination"
-> Verification criteria: "Returns paginated results", "Supports name filter", "Returns 400 for invalid page number"
->
-> ### Correctness — failed (2)
->
-> Invalid page number returns 500 (unhandled exception at `src/controllers/users.ts:47`) instead of 400
-> as required by criterion 3.
->
-> ### Completeness — passed (4)
+> Contract criteria:
 >
-> All three features implemented across controller, service, and tests.
+> - **[C1]** (auto) `npm run test -- users-search.test.ts` — search tests pass.
+> - **[C2]** (manual) — invalid page number returns 400.
 >
-> ### Safety — failed (1)
+> Dimensions:
 >
-> `src/repositories/users.ts:23` interpolates `query` directly into a SQL string; SQL injection is
-> possible on any search input.
->
-> ### Consistency — passed (4)
->
-> Follows existing controller patterns and uses the shared pagination helper.
->
-> <evaluation-failed>
-> [Correctness] `src/controllers/users.ts:47` — `parseInt(page)` returns NaN for non-numeric input,
-> causing an unhandled exception. Add validation before the query.
+> - Correctness — FAIL — C1 command exited 1: `users-search.test.ts: invalid page number test failed —
+expected 400, got 500`. C2 — `src/controllers/users.ts:47` returns 500 (unhandled exception) instead
+>   of 400. `executionEvidence` on the Correctness row carries the failing test output verbatim.
+> - Completeness — PASS — all three features implemented across controller, service, and tests.
+> - Safety — FAIL — `src/repositories/users.ts:23` interpolates `query` directly into a SQL string; SQL
+>   injection is possible on any search input.
+> - Consistency — PASS — follows existing controller patterns and uses the shared pagination helper.
 >
+> → `status: "failed"`, critique:
+> "[Correctness · C2] `src/controllers/users.ts:47` — `parseInt(page)` returns NaN for non-numeric input,
+> causing an unhandled exception. Add validation before the query so invalid page numbers return 400.
 > [Safety] `src/repositories/users.ts:23` — `WHERE name LIKE '%${query}%'` is SQL injection. Use a
-> parameterised query: `WHERE name LIKE $1` with `%${query}%` as the parameter.
-> </evaluation-failed>
+> parameterised query: `WHERE name LIKE $1` with `%${query}%` as the parameter."
 </examples>
-When finished, emit a verdict signal from the `<signals>` block below.
-{{SIGNALS}}
+{{OUTPUT_CONTRACT_SECTION}}