ralphctl 0.7.2 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +83 -84
- package/dist/cli.mjs +14145 -6686
- package/dist/manifest.json +4 -3
- package/dist/prompts/_partials/decisions.md +14 -0
- package/dist/prompts/_partials/signals-feedback.md +18 -0
- package/dist/prompts/_partials/validation-checklist.md +5 -4
- package/dist/prompts/apply-feedback/template.md +24 -23
- package/dist/prompts/create-pr/template.md +73 -0
- package/dist/prompts/detect-scripts/template.md +1 -8
- package/dist/prompts/detect-skills/template.md +1 -9
- package/dist/prompts/evaluate/template.md +109 -121
- package/dist/prompts/ideate/template.md +48 -22
- package/dist/prompts/implement/template.md +57 -79
- package/dist/prompts/plan/template.md +78 -45
- package/dist/prompts/readiness/template.md +32 -28
- package/dist/prompts/refine/template.md +35 -28
- package/package.json +2 -2
- package/dist/prompts/_partials/signals-evaluation.md +0 -14
- package/dist/prompts/_partials/signals-task.md +0 -26
package/dist/manifest.json
CHANGED
|
@@ -1,12 +1,13 @@
|
|
|
1
1
|
{
|
|
2
2
|
"version": 1,
|
|
3
|
-
"generatedAt": "2026-05-
|
|
3
|
+
"generatedAt": "2026-05-24T20:04:47.241Z",
|
|
4
4
|
"assets": [
|
|
5
|
+
"prompts/_partials/decisions.md",
|
|
5
6
|
"prompts/_partials/harness-context.md",
|
|
6
|
-
"prompts/_partials/signals-
|
|
7
|
-
"prompts/_partials/signals-task.md",
|
|
7
|
+
"prompts/_partials/signals-feedback.md",
|
|
8
8
|
"prompts/_partials/validation-checklist.md",
|
|
9
9
|
"prompts/apply-feedback/template.md",
|
|
10
|
+
"prompts/create-pr/template.md",
|
|
10
11
|
"prompts/detect-scripts/template.md",
|
|
11
12
|
"prompts/detect-skills/template.md",
|
|
12
13
|
"prompts/evaluate/template.md",
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
## Recording architectural decisions
|
|
2
|
+
|
|
3
|
+
When you make a non-obvious architectural or implementation choice — one a future reviewer might disagree
|
|
4
|
+
with or need to understand — emit `<decision>your concise rationale</decision>` so the harness can record
|
|
5
|
+
it in the sprint's decisions log.
|
|
6
|
+
|
|
7
|
+
- **Emit sparingly** — only for choices a future maintainer could not recover from the diff alone (e.g.
|
|
8
|
+
picking one valid pattern over another, choosing a tradeoff, deliberately deviating from a project
|
|
9
|
+
convention). Obvious changes do not need a decision entry.
|
|
10
|
+
- **One sentence per decision** — lead with the choice, then the rationale: "Used X over Y because Z." Use
|
|
11
|
+
two sentences only when the rationale genuinely cannot be compressed without losing the key tradeoff.
|
|
12
|
+
- The harness appends timestamp + task id + commit sha automatically — do not include those yourself.
|
|
13
|
+
- Multiple `<decision>` tags per task are allowed when distinct choices were made; emit one tag per
|
|
14
|
+
decision rather than packing several into one body.
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
<signals>
|
|
2
|
+
|
|
3
|
+
Use these signals to communicate the outcome of this feedback round to the harness. The harness parses your output
|
|
4
|
+
for these tags; nothing else in your message is treated as a control signal.
|
|
5
|
+
|
|
6
|
+
- `<task-complete>` — Marks the round as successfully applied. Emit when every requested change is on disk and
|
|
7
|
+
the working tree reflects the user's direction. The harness commits your edits afterward and runs the project's
|
|
8
|
+
verify script itself — do not run verification yourself, and do not commit.
|
|
9
|
+
- `<task-blocked>reason</task-blocked>` — Marks the round as un-appliable. Use when you genuinely cannot proceed:
|
|
10
|
+
the feedback is ambiguous in WHAT (not where), it contradicts an invariant in a prior round, or it asks for
|
|
11
|
+
information you do not have. Be concrete in the reason — the harness surfaces it verbatim to the operator and
|
|
12
|
+
ends the review loop.
|
|
13
|
+
|
|
14
|
+
Emit exactly one of the two signals above. Any of the implement-flow signals (`<change>`, `<learning>`,
|
|
15
|
+
`<note>`, `<decision>`, `<task-verified>`, `<commit-message>`, `<progress>`) are not consumed by the review
|
|
16
|
+
flow — emitting them wastes tokens and produces no on-disk effect.
|
|
17
|
+
|
|
18
|
+
</signals>
|
|
@@ -6,16 +6,17 @@ Before writing the JSON output, verify EVERY item:
|
|
|
6
6
|
|
|
7
7
|
1. **Requirements understood** — every approved ticket is reflected in at least one task; nothing in scope is dropped.
|
|
8
8
|
2. **Exclusive file ownership** — each file is owned by exactly one task. When two tasks must edit the same file,
|
|
9
|
-
make the relationship explicit via `
|
|
10
|
-
3. **Foundations before dependents** — order tasks so prerequisites come first; `
|
|
9
|
+
make the relationship explicit via `blockedBy` so they run in sequence, not in parallel.
|
|
10
|
+
3. **Foundations before dependents** — order tasks so prerequisites come first; `blockedBy` reflects genuine code
|
|
11
11
|
coupling, not arbitrary preference.
|
|
12
|
-
4. **Valid `
|
|
12
|
+
4. **Valid `blockedBy` references** — every id in `blockedBy` matches an earlier task's `id` placeholder; no
|
|
13
13
|
self-edges; no cycles.
|
|
14
14
|
5. **Precise steps** — each task has 2–8 specific, actionable steps. Each step references concrete files or
|
|
15
15
|
functions; "implement the feature" is not a step.
|
|
16
16
|
6. **Verification criteria** — each task has 2–4 `verificationCriteria` that are testable and unambiguous.
|
|
17
17
|
"Tests pass" alone is too vague — name the behaviour or invariant that proves the task is done.
|
|
18
|
-
7. **Repository assignment** — every task's `
|
|
18
|
+
7. **Repository assignment** — every task's `projectPath` matches one of the absolute paths listed under
|
|
19
|
+
"Selected repositories" above.
|
|
19
20
|
8. **Raw JSON output** — output a single JSON array matching the Task schema. The harness parses your output
|
|
20
21
|
directly; emit it without markdown fences, commentary, or surrounding prose.
|
|
21
22
|
9. **Unique placeholder ids** — each task's `id` is a unique string within this array (used only for
|
|
@@ -16,8 +16,10 @@ of them as they wrote it.
|
|
|
16
16
|
code, don't add tests the user didn't ask for, don't tighten unrelated types. The user is
|
|
17
17
|
shaping the work; you execute their direction.
|
|
18
18
|
|
|
19
|
-
**Commit
|
|
20
|
-
your changes with the message `feedback(round-N): <body-snippet
|
|
19
|
+
**Commit and verify are the harness's job.** When you've applied the round's feedback, the harness
|
|
20
|
+
commits your changes with the message `feedback(round-N): <body-snippet>` and then runs the project's
|
|
21
|
+
verify script itself. Do not commit, and do not run verify scripts — emit `<task-complete>` once your
|
|
22
|
+
edits are on disk and let the harness drive the gate.
|
|
21
23
|
|
|
22
24
|
**Make the edits — don't just describe them.** The harness does not apply changes for you;
|
|
23
25
|
you must write the files. A written-out description of the edits, without actual file writes,
|
|
@@ -60,18 +62,23 @@ This is the round you are applying. Read it carefully and make ONLY the changes
|
|
|
60
62
|
<progress>
|
|
61
63
|
|
|
62
64
|
The sprint's `progress.md` — pinned learnings and decisions, plus per-task activity. Use it
|
|
63
|
-
for context
|
|
64
|
-
|
|
65
|
+
for context so you don't re-discover what the prior tasks already established. This is a
|
|
66
|
+
review-time prompt — the review flow does not mine `<learning>` / `<decision>` / `<note>`
|
|
67
|
+
back into `progress.md`, so do not emit them; surface insights inside the change itself
|
|
68
|
+
(via tests, docstrings, or the targeted edit).
|
|
65
69
|
|
|
66
70
|
{{PROGRESS}}
|
|
67
71
|
|
|
68
72
|
</progress>
|
|
69
73
|
|
|
70
|
-
|
|
74
|
+
## Repositories
|
|
71
75
|
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
76
|
+
The sprint targets the repositories below. Each line is `- \`<absolute-path>\` (<name>)`. Decide which
|
|
77
|
+
repository (or repositories) the latest round touches based on the feedback content and the relevant
|
|
78
|
+
source layout. The harness mounts every repository as a workspace root — open files via the absolute
|
|
79
|
+
paths shown.
|
|
80
|
+
|
|
81
|
+
{{REPOSITORIES}}
|
|
75
82
|
|
|
76
83
|
## Protocol
|
|
77
84
|
|
|
@@ -98,21 +105,15 @@ Then orient before editing:
|
|
|
98
105
|
files when the round is symptom-described rather than file-described).
|
|
99
106
|
3. **Do not commit.** The harness commits your changes with `feedback(round-N): <body-snippet>`.
|
|
100
107
|
|
|
101
|
-
### Phase 3 —
|
|
102
|
-
|
|
103
|
-
1. **Run the check script** (when one is configured in the Project Tooling section). Record its
|
|
104
|
-
output verbatim for `<task-verified>`.
|
|
105
|
-
2. **When no check script is configured**, emit
|
|
106
|
-
`<task-verified>no check script configured; change applied</task-verified>` so the harness can
|
|
107
|
-
record that the round produced changes without a verification gate.
|
|
108
|
-
3. **Signal completion** with `<task-complete>` once the change is applied and verification (if
|
|
109
|
-
any) passed.
|
|
108
|
+
### Phase 3 — Signal outcome
|
|
110
109
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
explanation. Ambiguity in WHERE to apply the change is not a blocker — pick the narrowest
|
|
114
|
-
plausible target. Ambiguity in WHAT to do is.
|
|
110
|
+
When every requested change is on disk, emit `<task-complete>`. The harness then commits your edits
|
|
111
|
+
and runs the project's verify script — you do not run either step yourself.
|
|
115
112
|
|
|
116
|
-
|
|
113
|
+
If you cannot apply the feedback (the request is ambiguous in WHAT to do, contradicts an invariant
|
|
114
|
+
established by a prior round, or asks for information neither this round nor the feedback log
|
|
115
|
+
supplies), emit `<task-blocked>reason</task-blocked>` with a concrete explanation. Ambiguity in
|
|
116
|
+
WHERE to apply the change is not a blocker — pick the narrowest plausible target. Ambiguity in WHAT
|
|
117
|
+
to do is.
|
|
117
118
|
|
|
118
|
-
{{
|
|
119
|
+
{{OUTPUT_CONTRACT_SECTION}}
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# Pull Request Authoring Protocol
|
|
2
|
+
|
|
3
|
+
You are authoring a pull-request title and body for a branch that is ready to merge. Audience: the
|
|
4
|
+
project's maintainers reviewing the PR. Write as if you authored the commits yourself — do not mention
|
|
5
|
+
this tooling, any harness, sprint identifiers, signal contracts, or internal flow names.
|
|
6
|
+
|
|
7
|
+
{{HARNESS_CONTEXT}}
|
|
8
|
+
|
|
9
|
+
## Branch under review
|
|
10
|
+
|
|
11
|
+
- **Head branch:** `{{HEAD_BRANCH}}` (already pushed to `origin`)
|
|
12
|
+
- **Base branch:** `{{BASE_BRANCH}}`
|
|
13
|
+
|
|
14
|
+
## Tickets the branch addresses
|
|
15
|
+
|
|
16
|
+
{{TICKET_SUMMARY}}
|
|
17
|
+
|
|
18
|
+
## How to gather context
|
|
19
|
+
|
|
20
|
+
Run these from your cwd to see exactly what is changing:
|
|
21
|
+
|
|
22
|
+
- `git log {{BASE_BRANCH}}..HEAD` — the commit history on this branch
|
|
23
|
+
- `git diff {{BASE_BRANCH}}...HEAD --stat` — the file-level change summary
|
|
24
|
+
- `git diff {{BASE_BRANCH}}...HEAD` — the full diff, when you need to inspect specific changes
|
|
25
|
+
|
|
26
|
+
Lean on `--stat` to group changes sensibly; only read the full diff for sections you cannot summarise
|
|
27
|
+
from commit messages alone.
|
|
28
|
+
|
|
29
|
+
## What to author
|
|
30
|
+
|
|
31
|
+
### Title
|
|
32
|
+
|
|
33
|
+
One line, imperative present-tense, ≤70 characters. Examples — "Add CSV export for transactions",
|
|
34
|
+
"Fix race in session locking". Do not prefix with the branch name, ticket id, or `feat:` / `fix:` —
|
|
35
|
+
the project's commit-message convention is independent and already applied at commit time.
|
|
36
|
+
|
|
37
|
+
### Body
|
|
38
|
+
|
|
39
|
+
The body has three sections, in this order:
|
|
40
|
+
|
|
41
|
+
1. **Summary** — 1–3 sentences naming what the branch does and why. Focus on intent and observable
|
|
42
|
+
behaviour change; do not describe file paths or implementation mechanics in the summary.
|
|
43
|
+
2. **`## Changes`** — bullet list of what changed, grouped sensibly (by feature, module, or layer —
|
|
44
|
+
not file-by-file). Each bullet is one short sentence.
|
|
45
|
+
3. **`## Test plan`** — markdown checklist of how a reviewer would verify the branch. Concrete
|
|
46
|
+
actions, not abstractions. Include both manual checks and automated coverage when applicable.
|
|
47
|
+
|
|
48
|
+
End the body with the verbatim issue references below, if any are present:
|
|
49
|
+
|
|
50
|
+
```
|
|
51
|
+
{{ISSUE_REFS}}
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
If `{{ISSUE_REFS}}` is empty, omit the trailing closes block entirely — do not invent issue
|
|
55
|
+
numbers, and do not write "no related issues".
|
|
56
|
+
|
|
57
|
+
## Constraints
|
|
58
|
+
|
|
59
|
+
- Stay implementation-agnostic in the summary — name behaviour, not call sites.
|
|
60
|
+
- Never reference this tooling, any harness, sprint ids, internal flow names, or the AI itself.
|
|
61
|
+
Reviewers should not be able to tell from the PR description that it was authored with assistance.
|
|
62
|
+
- Use em-dash `—` (not a plain hyphen) for explanatory clauses, matching the project's house style.
|
|
63
|
+
- Do not invent acceptance criteria, ticket numbers, or roadmap items that are not visible in
|
|
64
|
+
the diff or the ticket summary above.
|
|
65
|
+
|
|
66
|
+
## Anti-patterns
|
|
67
|
+
|
|
68
|
+
- A summary that lists files instead of behaviour.
|
|
69
|
+
- A title that exceeds 70 characters or reads as past-tense ("Added X" → "Add X").
|
|
70
|
+
- A "Test plan" that says "see CI" — name the concrete checks.
|
|
71
|
+
- Inventing a "Closes #N" line when `{{ISSUE_REFS}}` is empty.
|
|
72
|
+
|
|
73
|
+
{{OUTPUT_CONTRACT_SECTION}}
|
|
@@ -108,11 +108,4 @@ tools — the project's scripts are the documented contract.
|
|
|
108
108
|
|
|
109
109
|
### Phase 3 — Output
|
|
110
110
|
|
|
111
|
-
|
|
112
|
-
around the tags:
|
|
113
|
-
|
|
114
|
-
1. `<setup-script>…single shell line…</setup-script>` — omit only when the project documents no
|
|
115
|
-
setup step.
|
|
116
|
-
2. `<verify-script>…single shell line…</verify-script>` — omit only when the project documents no
|
|
117
|
-
verification commands.
|
|
118
|
-
3. `<note>…</note>` — optional, one short observation naming the source file(s) you relied on.
|
|
111
|
+
{{OUTPUT_CONTRACT_SECTION}}
|
|
@@ -125,12 +125,4 @@ specific to this repo.
|
|
|
125
125
|
|
|
126
126
|
### Phase 3 — Output
|
|
127
127
|
|
|
128
|
-
|
|
129
|
-
around the tags themselves:
|
|
130
|
-
|
|
131
|
-
1. `<setup-skill>…multi-paragraph markdown body…</setup-skill>` — omit only when an existing
|
|
132
|
-
project skill already covers sprint setup for this repo.
|
|
133
|
-
2. `<verify-skill>…multi-paragraph markdown body…</verify-skill>` — omit only when an existing
|
|
134
|
-
project skill already covers post-task verification for this repo.
|
|
135
|
-
3. `<note>…</note>` — optional, one short observation naming the source file(s) relied on, or
|
|
136
|
-
noting which existing skill made a tag redundant.
|
|
128
|
+
{{OUTPUT_CONTRACT_SECTION}}
|
|
@@ -9,16 +9,19 @@ to confirm what they claim.
|
|
|
9
9
|
|
|
10
10
|
<constraints>
|
|
11
11
|
|
|
12
|
-
**You are a reviewer — do not edit files.** If you believe a fix is needed, emit
|
|
13
|
-
concrete critique; the harness will resume the generator to apply the fix. Do not run
|
|
14
|
-
tests, do not create commits. Your tools are read-only: `git status`, `git log`,
|
|
15
|
-
running existing
|
|
12
|
+
**You are a reviewer — do not edit files.** If you believe a fix is needed, emit an `evaluation` signal with
|
|
13
|
+
`status: "failed"` and a concrete critique; the harness will resume the generator to apply the fix. Do not run
|
|
14
|
+
`git stash`, do not edit tests, do not create commits. Your tools are read-only: `git status`, `git log`,
|
|
15
|
+
`git diff`, file reads, and running existing verify scripts and per-criterion auto commands. Any write
|
|
16
|
+
operation other than `signals.json` is a protocol violation.
|
|
16
17
|
|
|
17
18
|
</constraints>
|
|
18
19
|
|
|
19
20
|
<task-specification>
|
|
20
21
|
|
|
21
|
-
These verification criteria are the pre-agreed definition of "done" — your primary grading rubric.
|
|
22
|
+
These verification criteria are the pre-agreed definition of "done" — your primary grading rubric. The task
|
|
23
|
+
contract at `{{CONTRACT_PATH}}` is the authoritative source; read it before starting. The criteria block
|
|
24
|
+
below mirrors that file for in-context reference.
|
|
22
25
|
|
|
23
26
|
**Task:** {{TASK_NAME}}
|
|
24
27
|
|
|
@@ -34,9 +37,21 @@ You are working in this project directory:
|
|
|
34
37
|
{{PROJECT_PATH}}
|
|
35
38
|
```
|
|
36
39
|
|
|
37
|
-
##
|
|
40
|
+
## Verify Script
|
|
38
41
|
|
|
39
|
-
{{
|
|
42
|
+
{{VERIFY_SCRIPT_SECTION}}
|
|
43
|
+
|
|
44
|
+
## Prior progress
|
|
45
|
+
|
|
46
|
+
Below is the sprint's `progress.md` body so you can judge this round's work against what
|
|
47
|
+
already shipped — prior tasks' decisions, changes, learnings, and notes. Use it to spot
|
|
48
|
+
inconsistencies with established direction and to avoid critiquing the generator for
|
|
49
|
+
following a decision already recorded in earlier rounds.
|
|
50
|
+
|
|
51
|
+
{{PRIOR_PROGRESS}}
|
|
52
|
+
|
|
53
|
+
If the block above is empty, no prior progress has been recorded — this is the first
|
|
54
|
+
task-attempt of the sprint.
|
|
40
55
|
|
|
41
56
|
## Project Tooling
|
|
42
57
|
|
|
@@ -52,26 +67,43 @@ reasoning produces sharper reviews than jumping straight to verdicts.
|
|
|
52
67
|
|
|
53
68
|
Then run deterministic checks first — these are cheap, fast, and authoritative.
|
|
54
69
|
|
|
55
|
-
1. **Run the
|
|
70
|
+
1. **Run the verify script** (when configured in the Verify Script section above) — this is the same gate the
|
|
56
71
|
harness uses post-task. If it fails, the implementation fails regardless of how clean the code looks.
|
|
57
72
|
Record the output verbatim.
|
|
58
|
-
2. **`git status`** — the
|
|
59
|
-
|
|
60
|
-
|
|
73
|
+
2. **`git status --porcelain`** — inventory the files the generator touched. The working tree is expected
|
|
74
|
+
to be dirty at this point: the harness commits the generator's output _after_ this evaluator passes,
|
|
75
|
+
not before. A dirty tree is normal; do not treat it as a Completeness failure. Do not run `git stash`,
|
|
76
|
+
`git add`, or `git commit` — those are write operations and a protocol violation.
|
|
77
|
+
3. **`git diff`** — review the generator's uncommitted changes. This is your primary view of what was
|
|
78
|
+
implemented. `git log` will not show this task's work because no commit exists yet.
|
|
79
|
+
|
|
80
|
+
Computational results are ground truth. If the verify script fails, stop early and emit an `evaluation`
|
|
81
|
+
signal with `status: "failed"` — the implementation does not pass.
|
|
61
82
|
|
|
62
|
-
|
|
63
|
-
`<evaluation-failed>` — the implementation does not pass.
|
|
83
|
+
### Phase 2 — Per-criterion assessment
|
|
64
84
|
|
|
65
|
-
|
|
85
|
+
For every criterion in the contract:
|
|
86
|
+
|
|
87
|
+
- **`auto` criteria** — run the specified command and record the verbatim output (a trimmed tail when
|
|
88
|
+
enormous) in the matching dimension's `executionEvidence` field. PASS only when the command exits 0
|
|
89
|
+
AND the assertion holds; FAIL otherwise. Cite the command's exit code in the finding.
|
|
90
|
+
- **`manual` criteria** — cite the specific code location (`path:line`) or behaviour evidence the
|
|
91
|
+
assertion describes. PASS only when the cited evidence demonstrably satisfies the assertion; FAIL
|
|
92
|
+
otherwise. Generic approval language ("looks good", "appears correct") is INSUFFICIENT and is itself
|
|
93
|
+
a Completeness failure.
|
|
94
|
+
|
|
95
|
+
Grade each criterion PASS or FAIL — no middle ground. **Any single criterion FAIL forces an overall
|
|
96
|
+
FAIL on the `evaluation` signal.**
|
|
97
|
+
|
|
98
|
+
### Phase 3 — Inferential investigation
|
|
66
99
|
|
|
67
100
|
Now apply semantic judgment to what the computational checks cannot catch. Every finding you emit MUST trace to
|
|
68
101
|
a concrete observation — a file path, a line, a function name, a specific value, a tool output, or a quoted
|
|
69
|
-
snippet.
|
|
70
|
-
OK") is INSUFFICIENT and is itself a Completeness failure if you emit it.
|
|
102
|
+
snippet.
|
|
71
103
|
|
|
72
|
-
1. **
|
|
73
|
-
|
|
74
|
-
|
|
104
|
+
1. **Review the generator's changes** — run `git diff` to see all uncommitted working-tree changes, and
|
|
105
|
+
`git status --porcelain` for a quick inventory of touched files. These are the authoritative view of
|
|
106
|
+
what the generator produced; there is no task commit to diff against yet.
|
|
75
107
|
2. **Read the changed files carefully** — understand the full implementation, not just the diff. Note specific
|
|
76
108
|
constructs worth citing later (new functions, changed signatures, edge-case branches).
|
|
77
109
|
3. **Read surrounding code** — check that the implementation follows existing patterns and conventions. Cite a
|
|
@@ -86,21 +118,12 @@ OK") is INSUFFICIENT and is itself a Completeness failure if you emit it.
|
|
|
86
118
|
- Skip this step only when the project has no runnable verification tooling or the task is purely structural
|
|
87
119
|
(types, schemas, config).
|
|
88
120
|
|
|
89
|
-
### Phase
|
|
121
|
+
### Phase 4 — Dimension assessment
|
|
90
122
|
|
|
91
123
|
Evaluate the implementation across the dimensions below. The floor dimensions apply to every task; the planner
|
|
92
|
-
may have attached additional task-specific dimensions (rendered below the floor block when present).
|
|
93
|
-
|
|
94
|
-
fails, the overall evaluation fails
|
|
95
|
-
|
|
96
|
-
**Score rubric:**
|
|
97
|
-
|
|
98
|
-
- **5 — Exemplary:** no issues; idiomatic; every criterion met fully.
|
|
99
|
-
- **4 — Solid:** meets every criterion; minor stylistic improvements possible but not material.
|
|
100
|
-
- **3 — Adequate but flawed:** meets the letter of the criteria but with material gaps (incomplete edge-case
|
|
101
|
-
handling, weak tests, awkward patterns). Score 3 fails.
|
|
102
|
-
- **2 — Below bar:** missing required behaviour; tests do not cover the change; significant pattern violations.
|
|
103
|
-
- **1 — Unacceptable:** does not implement the task or actively breaks unrelated code.
|
|
124
|
+
may have attached additional task-specific dimensions (rendered below the floor block when present). Each
|
|
125
|
+
dimension assessment produces `passed: true | false` and a `finding` citing the specific evidence. No middle
|
|
126
|
+
ground — a dimension either passes or fails. **If ANY dimension fails, the overall evaluation fails.**
|
|
104
127
|
|
|
105
128
|
**Floor dimensions:**
|
|
106
129
|
|
|
@@ -115,122 +138,87 @@ fails, the overall evaluation fails.
|
|
|
115
138
|
|
|
116
139
|
{{EXTRA_DIMENSIONS_SECTION}}
|
|
117
140
|
|
|
118
|
-
Write per-dimension findings as a markdown section with a one-sentence verdict and 1–3 specific
|
|
119
|
-
each. The verdict signal at the end is the aggregate; the per-dimension findings are the audit
|
|
141
|
+
Write per-dimension findings as a markdown section with a one-sentence PASS / FAIL verdict and 1–3 specific
|
|
142
|
+
observations each. The verdict signal at the end is the aggregate; the per-dimension findings are the audit
|
|
143
|
+
trail.
|
|
120
144
|
|
|
121
145
|
### Anti-Rubber-Stamp Guard
|
|
122
146
|
|
|
123
147
|
Before you decide the verdict, answer both questions honestly:
|
|
124
148
|
|
|
125
|
-
1. **Did you actually run the Phase 1 verification commands
|
|
126
|
-
not execute it, or you did not run `git status` / `git
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
false
|
|
149
|
+
1. **Did you actually run the Phase 1 verification commands AND every `auto` criterion's command?** If the
|
|
150
|
+
verify script exists and you did not execute it, or you did not run `git status --porcelain` / `git diff`,
|
|
151
|
+
or you skipped any auto criterion's command, you lack the ground truth that authoritatively settles
|
|
152
|
+
Correctness and Completeness.
|
|
153
|
+
2. **Can you name a specific observation for each dimension AND each criterion?** For every PASS you are
|
|
154
|
+
about to emit, point to a concrete piece of evidence — a file path, a line number, a test count, a tool
|
|
155
|
+
output, a function name, a verification criterion you graded. "Looks good" / "appears correct" / "no
|
|
156
|
+
issues found" are NOT specific observations.
|
|
157
|
+
|
|
158
|
+
If the answer to either question is **no**, you MUST set Completeness `passed: false` with a one-line
|
|
159
|
+
finding explaining what you skipped, and set the signal status to `failed` — even if everything else seems
|
|
160
|
+
fine. A rubber-stamp PASS is worse than a real FAIL because it misleads the harness into marking work done
|
|
161
|
+
when it was never audited. This guard exists because the evaluator is the last line of defense against
|
|
162
|
+
silent-pass regressions; the cost of a false FAIL is one extra fix iteration, the cost of a false PASS is a
|
|
163
|
+
shipped bug.
|
|
139
164
|
|
|
140
165
|
## Output format
|
|
141
166
|
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
### Correctness — passed (5)
|
|
148
|
-
|
|
149
|
-
{1–3 specific observations citing files / lines / functions.}
|
|
150
|
-
|
|
151
|
-
### Completeness — failed (3)
|
|
152
|
-
|
|
153
|
-
{1–3 specific observations. Be concrete about what's missing.}
|
|
154
|
-
|
|
155
|
-
### Safety — passed (4)
|
|
156
|
-
|
|
157
|
-
{...}
|
|
158
|
-
|
|
159
|
-
### Consistency — passed (5)
|
|
160
|
-
|
|
161
|
-
{...}
|
|
162
|
-
|
|
163
|
-
<evaluation-failed>
|
|
164
|
-
{Actionable critique. The generator will see this and resume to fix it. Be specific:
|
|
165
|
-
which dimension failed, what the gap is, what change would close it.}
|
|
166
|
-
</evaluation-failed>
|
|
167
|
-
```
|
|
168
|
-
|
|
169
|
-
When every dimension passes, end with `<evaluation-passed>` (no body).
|
|
167
|
+
Capture your per-dimension findings in the `evaluation` signal's `dimensions` array (one entry per
|
|
168
|
+
dimension with `dimension`, `passed`, `finding`, and — for dimensions paired with an `auto` criterion —
|
|
169
|
+
`executionEvidence` carrying the verbatim command output). When any dimension or criterion fails, set
|
|
170
|
+
`status: "failed"` and supply a `critique` — the actionable summary the generator sees on the next round.
|
|
171
|
+
When every dimension AND every criterion passes, set `status: "passed"` (the `critique` may be omitted).
|
|
170
172
|
|
|
171
173
|
### Calibration examples
|
|
172
174
|
|
|
173
175
|
<examples>
|
|
174
176
|
|
|
175
|
-
**Example of a correct PASS (every dimension
|
|
177
|
+
**Example of a correct PASS (every dimension and every criterion graded PASS):**
|
|
176
178
|
|
|
177
179
|
> Task: "Add date validation to export endpoint"
|
|
178
|
-
>
|
|
179
|
-
>
|
|
180
|
-
> ### Correctness — passed (5)
|
|
181
|
-
>
|
|
182
|
-
> Both criteria verified: invalid dates return 400 with error body; valid range filters correctly per
|
|
183
|
-
> integration test at `src/routes/exports.test.ts:88`.
|
|
184
|
-
>
|
|
185
|
-
> ### Completeness — passed (4)
|
|
186
|
-
>
|
|
187
|
-
> Schema, controller, and tests all implemented per steps; one minor TODO comment left but unrelated to
|
|
188
|
-
> this task's criteria.
|
|
180
|
+
> Contract criteria:
|
|
189
181
|
>
|
|
190
|
-
>
|
|
182
|
+
> - **[C1]** (auto) `npm run test -- export.test.ts` — endpoint integration tests pass.
|
|
183
|
+
> - **[C2]** (manual) — invalid `startDate` returns 400 with error body.
|
|
191
184
|
>
|
|
192
|
-
>
|
|
185
|
+
> Dimensions:
|
|
193
186
|
>
|
|
194
|
-
>
|
|
187
|
+
> - Correctness — PASS — both criteria verified: C1 command exited 0 with all 12 tests green; C2 — invalid
|
|
188
|
+
> dates return 400 with the project's standard error body at `src/routes/exports.ts:42`. `executionEvidence`
|
|
189
|
+
> for the Correctness row carries the C1 command output verbatim.
|
|
190
|
+
> - Completeness — PASS — schema, controller, and tests all implemented per steps; one minor TODO comment
|
|
191
|
+
> left but unrelated to this task's criteria.
|
|
192
|
+
> - Safety — PASS — input validated via Zod at `src/routes/exports.ts:12` before reaching the database.
|
|
193
|
+
> - Consistency — PASS — follows existing endpoint patterns in `controllers/`; uses the project's error
|
|
194
|
+
> response format from `src/lib/errors.ts`.
|
|
195
195
|
>
|
|
196
|
-
>
|
|
197
|
-
> `src/lib/errors.ts`.
|
|
198
|
-
>
|
|
199
|
-
> <evaluation-passed>
|
|
196
|
+
> → `status: "passed"`, no critique.
|
|
200
197
|
|
|
201
|
-
**Example of a correct FAIL (one
|
|
198
|
+
**Example of a correct FAIL (one criterion / dimension failed):**
|
|
202
199
|
|
|
203
200
|
> Task: "Add user search with pagination"
|
|
204
|
-
>
|
|
205
|
-
>
|
|
206
|
-
> ### Correctness — failed (2)
|
|
207
|
-
>
|
|
208
|
-
> Invalid page number returns 500 (unhandled exception at `src/controllers/users.ts:47`) instead of 400
|
|
209
|
-
> as required by criterion 3.
|
|
210
|
-
>
|
|
211
|
-
> ### Completeness — passed (4)
|
|
201
|
+
> Contract criteria:
|
|
212
202
|
>
|
|
213
|
-
>
|
|
203
|
+
> - **[C1]** (auto) `npm run test -- users-search.test.ts` — search tests pass.
|
|
204
|
+
> - **[C2]** (manual) — invalid page number returns 400.
|
|
214
205
|
>
|
|
215
|
-
>
|
|
206
|
+
> Dimensions:
|
|
216
207
|
>
|
|
217
|
-
> `
|
|
218
|
-
|
|
219
|
-
>
|
|
220
|
-
>
|
|
221
|
-
>
|
|
222
|
-
>
|
|
223
|
-
>
|
|
224
|
-
> <evaluation-failed>
|
|
225
|
-
> [Correctness] `src/controllers/users.ts:47` — `parseInt(page)` returns NaN for non-numeric input,
|
|
226
|
-
> causing an unhandled exception. Add validation before the query.
|
|
208
|
+
> - Correctness — FAIL — C1 command exited 1: `users-search.test.ts: invalid page number test failed —
|
|
209
|
+
expected 400, got 500`. C2 — `src/controllers/users.ts:47` returns 500 (unhandled exception) instead
|
|
210
|
+
> of 400. `executionEvidence` on the Correctness row carries the failing test output verbatim.
|
|
211
|
+
> - Completeness — PASS — all three features implemented across controller, service, and tests.
|
|
212
|
+
> - Safety — FAIL — `src/repositories/users.ts:23` interpolates `query` directly into a SQL string; SQL
|
|
213
|
+
> injection is possible on any search input.
|
|
214
|
+
> - Consistency — PASS — follows existing controller patterns and uses the shared pagination helper.
|
|
227
215
|
>
|
|
216
|
+
> → `status: "failed"`, critique:
|
|
217
|
+
> "[Correctness · C2] `src/controllers/users.ts:47` — `parseInt(page)` returns NaN for non-numeric input,
|
|
218
|
+
> causing an unhandled exception. Add validation before the query so invalid page numbers return 400.
|
|
228
219
|
> [Safety] `src/repositories/users.ts:23` — `WHERE name LIKE '%${query}%'` is SQL injection. Use a
|
|
229
|
-
> parameterised query: `WHERE name LIKE $1` with `%${query}%` as the parameter.
|
|
230
|
-
> </evaluation-failed>
|
|
220
|
+
> parameterised query: `WHERE name LIKE $1` with `%${query}%` as the parameter."
|
|
231
221
|
|
|
232
222
|
</examples>
|
|
233
223
|
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
{{SIGNALS}}
|
|
224
|
+
{{OUTPUT_CONTRACT_SECTION}}
|