@jhlee0619/codexloop 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +34 -0
- package/.claude-plugin/plugin.json +8 -0
- package/.codex-plugin/plugin.json +38 -0
- package/LICENSE +21 -0
- package/README.md +425 -0
- package/assets/banner.png +0 -0
- package/bin/cloop +45 -0
- package/commands/iterate.md +25 -0
- package/commands/model.md +33 -0
- package/commands/result.md +17 -0
- package/commands/start.md +188 -0
- package/commands/status.md +10 -0
- package/commands/stop.md +12 -0
- package/package.json +60 -0
- package/prompts/evaluate.md +91 -0
- package/prompts/rank.md +97 -0
- package/prompts/suggest.md +69 -0
- package/schemas/evaluation.schema.json +65 -0
- package/schemas/loop-state.schema.json +103 -0
- package/schemas/proposal.schema.json +74 -0
- package/schemas/ranking.schema.json +77 -0
- package/scripts/lib/apply.mjs +254 -0
- package/scripts/lib/args.mjs +202 -0
- package/scripts/lib/codex-exec.mjs +318 -0
- package/scripts/lib/convergence.mjs +153 -0
- package/scripts/lib/iteration.mjs +484 -0
- package/scripts/lib/process.mjs +164 -0
- package/scripts/lib/prompts.mjs +53 -0
- package/scripts/lib/rank.mjs +149 -0
- package/scripts/lib/render.mjs +240 -0
- package/scripts/lib/state.mjs +378 -0
- package/scripts/lib/validate.mjs +71 -0
- package/scripts/lib/workspace.mjs +49 -0
- package/scripts/loop-companion.mjs +849 -0
- package/skills/cloop/SKILL.md +177 -0
|
@@ -0,0 +1,188 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Begin a new CodexLoop with an interview + approval phase before the loop actually runs
|
|
3
|
+
argument-hint: '[--wait|--background] [--yes] [--goal "<text>"] [--task-file <path>] [--max-iter N] [--max-time <dur>] [--max-calls N] [--dry-run] [--model <m>] [--effort <e>] [--nproposals N] [--test-cmd "<cmd>"] [--lint-cmd "<cmd>"] [--type-cmd "<cmd>"]'
|
|
4
|
+
allowed-tools: Read, Glob, Grep, Bash(node:*), Bash(git:*), AskUserQuestion
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Start a new CodexLoop. **Before launching the runtime, conduct a short
|
|
8
|
+
interview with the user to clarify the goal and confirm the plan — do not
|
|
9
|
+
just forward the raw arguments to `loop-companion.mjs`.** The runtime then
|
|
10
|
+
drives Codex through the six-step iteration cycle (evaluate, suggest, rank,
|
|
11
|
+
apply, validate, record) until the goal is met, the loop converges, or a
|
|
12
|
+
budget limit fires.
|
|
13
|
+
|
|
14
|
+
Raw slash-command arguments:
|
|
15
|
+
`$ARGUMENTS`
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Your role
|
|
20
|
+
|
|
21
|
+
You are the interviewer. Your job has four phases:
|
|
22
|
+
|
|
23
|
+
1. **Pre-flight** — reject quickly on obvious problems.
|
|
24
|
+
2. **Goal capture + clarification** — produce a clear, immutable one-sentence goal.
|
|
25
|
+
3. **Plan assembly** — acceptance criteria, test/lint/type commands, budget, model + reasoning effort, execution mode (`--wait` vs `--background`).
|
|
26
|
+
4. **Approval** — show the full plan and get explicit go/no-go before calling `loop-companion.mjs`.
|
|
27
|
+
|
|
28
|
+
You do NOT generate proposals, apply patches, or run tests yourself. The
|
|
29
|
+
runtime owns all Codex calls, patch application, validation, and state.
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Phase 1 — pre-flight
|
|
34
|
+
|
|
35
|
+
Before anything else, verify:
|
|
36
|
+
|
|
37
|
+
1. The current directory is a git repository (`git rev-parse --show-toplevel` succeeds). If not, tell the user to run `git init` and commit a baseline, then stop.
|
|
38
|
+
2. The working tree is clean (`git status --porcelain` is empty). If dirty, print the top of the dirty list and tell the user to commit or stash before starting a loop. Stop.
|
|
39
|
+
3. `codex --version` succeeds. If not, tell the user to run `/codex:setup` from the `openai-codex` plugin (or `npm install -g @openai/codex`). Stop.
|
|
40
|
+
|
|
41
|
+
If any pre-flight check fails, do NOT continue to the interview.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## Phase 2 — goal capture and clarification
|
|
46
|
+
|
|
47
|
+
Assemble the goal in this priority order:
|
|
48
|
+
|
|
49
|
+
1. If `$ARGUMENTS` contains `--goal "<text>"`, use that as the seed.
|
|
50
|
+
2. Otherwise, if `$ARGUMENTS` contains `--task-file <path>`, read that file and use its trimmed contents as the seed.
|
|
51
|
+
3. Otherwise, take every non-flag token in `$ARGUMENTS` (anything that is NOT a `--flag`, `--flag=value`, value-of-a-value-flag, or `-C <dir>`) and join them with single spaces. If the result is non-empty, use it as the seed. This is the **primary path** — it lets users write `/cloop:start fix the failing auth tests` or `/cloop:start "fix the failing auth tests"` and have the text become the goal seed with zero ceremony.
|
|
52
|
+
4. Otherwise, search the working tree for `TASK.md`, `GOAL.md`, `PRD.md`, then `AGENTS.md` (in that order). If one exists, read it and offer it to the user as the seed with a one-line `AskUserQuestion` such as `Use TASK.md as the goal?` [Yes / Let me type a different goal].
|
|
53
|
+
5. Otherwise, ask the user directly: `AskUserQuestion` with `question: "What should this CodexLoop achieve?"` and two options: `Type the goal now` (user uses Other) and `Point me at a file instead`.
|
|
54
|
+
|
|
55
|
+
Once you have seed text, **clarify it**:
|
|
56
|
+
|
|
57
|
+
- If the seed is shorter than ~20 characters or is vague ("fix it", "make it work", "refactor"), ask one clarifying `AskUserQuestion` with two scoped options AND a free-form Other. The two options should be your best guesses at specific goals based on the repo you see.
|
|
58
|
+
- Rewrite the user's wording into a single concrete sentence that names the files, symbols, or behaviors involved when possible. Show the rewritten text back via `AskUserQuestion`: `Is this the goal you want? → [Yes, this is accurate] / [Let me rephrase]`. If they rephrase, iterate once more.
|
|
59
|
+
|
|
60
|
+
**Never** modify the goal text after this phase. The runtime stores a hash of
|
|
61
|
+
the final goal and will refuse to run if the text drifts.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Phase 3 — plan assembly
|
|
66
|
+
|
|
67
|
+
Walk these slots in order. For each slot, skip the question if `$ARGUMENTS`
|
|
68
|
+
already contains the corresponding flag. Otherwise ask a **focused**
|
|
69
|
+
`AskUserQuestion` (one question per slot). Use concrete options drawn from
|
|
70
|
+
the repo (`package.json` scripts, `Makefile` targets, existing test dirs)
|
|
71
|
+
rather than generic placeholders.
|
|
72
|
+
|
|
73
|
+
1. **Acceptance criteria** — how will we know the goal is met? Examples: "all tests in `tests/auth/` pass", "`npm run lint` exits 0", "the `GET /health` handler no longer 500s". Collect 1–3 concrete, verifiable criteria. Store them as a single shell-quoted list that will become part of the goal text in `loop-companion.mjs`'s view. If the user cannot name any, default to "`<testCmd>` exits 0" once you have a test command.
|
|
74
|
+
2. **Test command** (`--test-cmd`) — inspect `package.json`, `Makefile`, `pyproject.toml`, or `go.mod` to propose plausible candidates. Offer 2–3 real options in `AskUserQuestion`. If nothing fits, ask the user to type one. Skippable only if the user explicitly confirms "no test command".
|
|
75
|
+
3. **Lint command** (`--lint-cmd`) — optional. Propose `npm run lint` / `eslint .` / `ruff check .` / `golangci-lint run` based on what you see. Offer a "(skip lint)" option.
|
|
76
|
+
4. **Type command** (`--type-cmd`) — optional. Propose `npm run typecheck` / `tsc --noEmit` / `mypy .` / `go vet ./...` based on the repo. Offer a "(skip type check)" option.
|
|
77
|
+
5. **Budget**:
|
|
78
|
+
- `--max-iter` (default 20) — ask only if the user did not supply it. Offer `5 (small bug)`, `10 (feature)`, `20 (default)`, or free-form via Other.
|
|
79
|
+
- `--max-time` (default `1h`) — ask only if missing. Offer `15m`, `30m`, `1h (default)`, `3h`.
|
|
80
|
+
- `--max-calls` (default 200) — do not ask unless the user raised the topic.
|
|
81
|
+
6. **Model + reasoning effort**:
|
|
82
|
+
- Defaults are `gpt-5.4` with `xhigh` reasoning. Do NOT ask about these by default — they are fine for almost every loop.
|
|
83
|
+
- Only ask if the user mentioned cost, speed, or a specific model in their free-form input. Offer `gpt-5.4 + xhigh (Recommended)`, `spark + high (faster, cheaper)`, and Other.
|
|
84
|
+
7. **Execution mode** (`--wait` vs `--background`):
|
|
85
|
+
- If `$ARGUMENTS` already contains `--wait` or `--background`, use it.
|
|
86
|
+
- Otherwise estimate loop length from `--max-iter` and whether `--dry-run` is set. Recommend `--wait` when max-iter ≤ 2 OR `--dry-run`. Recommend `--background` otherwise (a 10-iteration real loop can take 10+ minutes — blocking the session for that long is rarely what the user wants).
|
|
87
|
+
- Ask with `AskUserQuestion`, putting the recommendation first, suffixed `(Recommended)`:
|
|
88
|
+
- `Wait for results in this session` — maps to `--wait`; Claude stays attached and streams every iteration transcript inline.
|
|
89
|
+
- `Run in background` — maps to `--background`; the runtime spawns a detached Node worker via `spawnDetached`, writes `.loop/loop.pid`, and returns immediately. Progress is viewable via `/cloop:status`; the worker survives Claude Code restarts.
|
|
90
|
+
|
|
91
|
+
### What `--wait` actually means
|
|
92
|
+
|
|
93
|
+
`--wait` is **foreground mode**. The `cloop start --wait` call blocks in
|
|
94
|
+
this tool session until the loop finishes (goal-met / converged / budget
|
|
95
|
+
exhausted / error). You see the full iteration transcript inline. Use this
|
|
96
|
+
for short loops, dry runs, and when you want to watch the reasoning live.
|
|
97
|
+
|
|
98
|
+
`--background` is **OS-detached mode**. `cloop start --background` spawns
|
|
99
|
+
a detached Node worker that owns the loop, returns a pid + log path
|
|
100
|
+
immediately, and exits. The worker runs independently of the Claude Code
|
|
101
|
+
session, so closing Claude does NOT kill it. Use this for real multi-minute
|
|
102
|
+
loops; poll progress via `/cloop:status`.
|
|
103
|
+
|
|
104
|
+
Never pass both flags. If the user typed both in `$ARGUMENTS`, stop and ask.
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Phase 4 — approval
|
|
109
|
+
|
|
110
|
+
Build an approval summary and show it to the user via `AskUserQuestion` with
|
|
111
|
+
a `preview` on the recommended option. The preview must contain the full
|
|
112
|
+
plan as plain text, e.g.:
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
Goal: fix the failing adds() tests in tests/add.test.js
|
|
116
|
+
Acceptance criteria:
|
|
117
|
+
- node --test tests/ exits 0
|
|
118
|
+
- no test files are deleted or skipped
|
|
119
|
+
Test command: node --test tests/
|
|
120
|
+
Lint command: (skip)
|
|
121
|
+
Type command: (skip)
|
|
122
|
+
Budget: 10 iter / 30m / 200 calls
|
|
123
|
+
Model: gpt-5.4 (xhigh reasoning)
|
|
124
|
+
Mode: background
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
The question: `Start the loop with this plan?` with options:
|
|
128
|
+
|
|
129
|
+
1. `Start loop (Recommended)` — accept and run
|
|
130
|
+
2. `Edit the goal` — re-enter Phase 2
|
|
131
|
+
3. `Edit plan details` — re-enter Phase 3 from whichever slot the user wants changed
|
|
132
|
+
4. `Cancel` — stop without running
|
|
133
|
+
|
|
134
|
+
If the user chose `Start loop`, proceed to Phase 5. If they chose an edit
|
|
135
|
+
option, loop back to the relevant phase and rebuild the approval preview.
|
|
136
|
+
**Never** skip the approval step, even if `$ARGUMENTS` already contains
|
|
137
|
+
every flag — the one exception is when `$ARGUMENTS` includes `--yes`
|
|
138
|
+
(power-user skip), in which case you still print the plan summary but
|
|
139
|
+
invoke `loop-companion.mjs` without waiting for explicit confirmation.
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
143
|
+
## Phase 5 — invoke the runtime
|
|
144
|
+
|
|
145
|
+
Assemble a single Bash command with every confirmed flag. Quote strings
|
|
146
|
+
with spaces. Use the exact flags `loop-companion.mjs start` expects:
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
node "${CLAUDE_PLUGIN_ROOT}/scripts/loop-companion.mjs" start \
|
|
150
|
+
<--wait|--background> \
|
|
151
|
+
--goal "<final goal text>" \
|
|
152
|
+
--max-iter <N> \
|
|
153
|
+
--max-time <dur> \
|
|
154
|
+
--test-cmd "<cmd>" \
|
|
155
|
+
[--lint-cmd "<cmd>"] \
|
|
156
|
+
[--type-cmd "<cmd>"] \
|
|
157
|
+
[--model <m>] \
|
|
158
|
+
[--effort <e>] \
|
|
159
|
+
[--dry-run] \
|
|
160
|
+
[--nproposals <N>]
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
Model + effort flags are OPTIONAL because the runtime already defaults to
|
|
164
|
+
`gpt-5.4` / `xhigh`. Pass them only if the user explicitly chose something
|
|
165
|
+
different. Test / lint / type commands are optional but nearly always
|
|
166
|
+
desirable — at minimum, include `--test-cmd` if one was confirmed.
|
|
167
|
+
|
|
168
|
+
### Foreground run (`--wait`)
|
|
169
|
+
|
|
170
|
+
Run the command as a plain blocking `Bash` call and return its stdout
|
|
171
|
+
**verbatim**. Do not paraphrase per-iteration transcripts. Do not summarize.
|
|
172
|
+
|
|
173
|
+
### Background run (`--background`)
|
|
174
|
+
|
|
175
|
+
The runtime detaches the worker itself, so the Bash call is effectively
|
|
176
|
+
immediate. Run it as a plain blocking `Bash` call (do NOT use
|
|
177
|
+
`run_in_background: true` — the runtime already spawns a detached Node
|
|
178
|
+
worker that outlives the tool call). Return its stdout verbatim. Then tell
|
|
179
|
+
the user: `CodexLoop started in the background. Check /cloop:status for progress.`
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Argument rules
|
|
184
|
+
|
|
185
|
+
- Preserve every argument the user typed. Never strip `--wait`, `--background`, `--dry-run`, or any other flag.
|
|
186
|
+
- Never pass `--wait` and `--background` at the same time.
|
|
187
|
+
- If the user already supplied `--goal`, `--test-cmd`, etc., skip that interview slot but still include them in the final plan preview.
|
|
188
|
+
- If the companion script prints `Codex CLI is not ready`, stop and tell the user to run `/codex:setup`.
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Show the current CodexLoop state, iteration log, budget consumption, and convergence metrics
|
|
3
|
+
argument-hint: '[--json]'
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
allowed-tools: Bash(node:*)
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
!`node "${CLAUDE_PLUGIN_ROOT}/scripts/loop-companion.mjs" status $ARGUMENTS`
|
|
9
|
+
|
|
10
|
+
Present the output verbatim to the user. Preserve the status table, iteration history table, and any reported stopReason or error exactly as printed. Do not summarize.
|
package/commands/stop.md
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Stop a running background CodexLoop (SIGTERM; graceful shutdown at next iteration boundary)
|
|
3
|
+
argument-hint: '[--force]'
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
allowed-tools: Bash(node:*)
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
!`node "${CLAUDE_PLUGIN_ROOT}/scripts/loop-companion.mjs" stop $ARGUMENTS`
|
|
9
|
+
|
|
10
|
+
Present the output verbatim to the user.
|
|
11
|
+
|
|
12
|
+
If the command prints that the worker did not exit within 60 seconds, tell the user they can re-run `/cloop:stop --force` to send SIGKILL immediately. Do not run `--force` automatically.
|
package/package.json
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@jhlee0619/codexloop",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "CodexLoop \u2014 iterative improvement loop that drives OpenAI Codex as a multi-role critic (evaluate \u2192 suggest \u2192 rank \u2192 apply \u2192 validate \u2192 record).",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"bin": {
|
|
7
|
+
"cloop": "./bin/cloop"
|
|
8
|
+
},
|
|
9
|
+
"files": [
|
|
10
|
+
"bin/",
|
|
11
|
+
"scripts/",
|
|
12
|
+
"prompts/",
|
|
13
|
+
"schemas/",
|
|
14
|
+
"assets/",
|
|
15
|
+
".claude-plugin/plugin.json",
|
|
16
|
+
".claude-plugin/marketplace.json",
|
|
17
|
+
".codex-plugin/plugin.json",
|
|
18
|
+
"commands/",
|
|
19
|
+
"skills/",
|
|
20
|
+
"README.md",
|
|
21
|
+
"LICENSE"
|
|
22
|
+
],
|
|
23
|
+
"engines": {
|
|
24
|
+
"node": ">=20"
|
|
25
|
+
},
|
|
26
|
+
"scripts": {
|
|
27
|
+
"test": "node tests/unit/state.test.mjs && node tests/unit/rank.test.mjs && node tests/unit/convergence.test.mjs && node tests/integration/loop-smoke.test.mjs",
|
|
28
|
+
"test:unit": "node tests/unit/state.test.mjs && node tests/unit/rank.test.mjs && node tests/unit/convergence.test.mjs",
|
|
29
|
+
"test:integration": "node tests/integration/loop-smoke.test.mjs",
|
|
30
|
+
"test:smoke": "node tests/integration/loop-smoke.test.mjs"
|
|
31
|
+
},
|
|
32
|
+
"keywords": [
|
|
33
|
+
"codex",
|
|
34
|
+
"claude-code",
|
|
35
|
+
"plugin",
|
|
36
|
+
"loop",
|
|
37
|
+
"iterative",
|
|
38
|
+
"refinement",
|
|
39
|
+
"review",
|
|
40
|
+
"adversarial",
|
|
41
|
+
"convergence",
|
|
42
|
+
"cloop",
|
|
43
|
+
"cli"
|
|
44
|
+
],
|
|
45
|
+
"repository": {
|
|
46
|
+
"type": "git",
|
|
47
|
+
"url": "git+https://github.com/jhlee0619/CodexLoop.git"
|
|
48
|
+
},
|
|
49
|
+
"bugs": {
|
|
50
|
+
"url": "https://github.com/jhlee0619/CodexLoop/issues"
|
|
51
|
+
},
|
|
52
|
+
"homepage": "https://github.com/jhlee0619/CodexLoop#readme",
|
|
53
|
+
"license": "MIT",
|
|
54
|
+
"author": "jhlee",
|
|
55
|
+
"preferGlobal": true,
|
|
56
|
+
"os": [
|
|
57
|
+
"linux",
|
|
58
|
+
"darwin"
|
|
59
|
+
]
|
|
60
|
+
}
|
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
<role>
|
|
2
|
+
You are Codex operating as a strict code reviewer and adversarial critic for CodexLoop.
|
|
3
|
+
Your job is to evaluate the current state of a repository against a specific goal and report defects, risks, and missing pieces that block acceptance.
|
|
4
|
+
</role>
|
|
5
|
+
|
|
6
|
+
<task>
|
|
7
|
+
Evaluate whether the code in the current working directory satisfies the goal below. Return a single structured JSON evaluation matching the provided schema. You are the LAST reviewer before the loop decides what to do next.
|
|
8
|
+
|
|
9
|
+
Goal:
|
|
10
|
+
{{GOAL}}
|
|
11
|
+
|
|
12
|
+
Acceptance criteria:
|
|
13
|
+
{{ACCEPTANCE_CRITERIA}}
|
|
14
|
+
|
|
15
|
+
Iteration number: {{ITERATION_INDEX}} of at most {{MAX_ITERATIONS}}
|
|
16
|
+
|
|
17
|
+
Previous iteration summaries (most recent last, or "(none)" on the first iteration):
|
|
18
|
+
{{LAST_ITERATIONS}}
|
|
19
|
+
|
|
20
|
+
Append-only progress log (last 30 lines, or "(none)"):
|
|
21
|
+
{{PROGRESS_LOG_TAIL}}
|
|
22
|
+
|
|
23
|
+
Current open issues carried from previous iterations (or "(none)"):
|
|
24
|
+
{{OPEN_ISSUES}}
|
|
25
|
+
|
|
26
|
+
Git diff since the loop's seed commit (or "(none)" on the first iteration):
|
|
27
|
+
{{DIFF_SINCE_SEED}}
|
|
28
|
+
|
|
29
|
+
Latest test/lint/type command results (or "(none)"):
|
|
30
|
+
{{CURRENT_CHECK_STATE}}
|
|
31
|
+
</task>
|
|
32
|
+
|
|
33
|
+
<operating_stance>
|
|
34
|
+
Default to skepticism. Assume the code can fail in subtle, high-cost, or user-visible ways until the evidence says otherwise.
|
|
35
|
+
Do not give credit for good intent, partial fixes, or likely follow-up work.
|
|
36
|
+
If something only works on the happy path, treat that as a real weakness.
|
|
37
|
+
Be terse and specific — the runtime parses your JSON, it does not read prose outside the contract.
|
|
38
|
+
</operating_stance>
|
|
39
|
+
|
|
40
|
+
<attack_surface>
|
|
41
|
+
Prioritize failures that would block goal acceptance:
|
|
42
|
+
- acceptance criteria that are not yet met
|
|
43
|
+
- failing tests, type errors, lint errors introduced since the loop started
|
|
44
|
+
- edge cases, null/empty/timeout handling, partial-failure safety
|
|
45
|
+
- invariant violations, ordering assumptions, concurrency hazards
|
|
46
|
+
- reward-hacking patterns from prior iterations (deleted tests, weakened assertions, goal drift, error swallowing)
|
|
47
|
+
- regressions in files unrelated to the recent patch
|
|
48
|
+
</attack_surface>
|
|
49
|
+
|
|
50
|
+
<review_method>
|
|
51
|
+
Walk the git diff first. For each open issue from the previous iteration, decide whether it is resolved, still open, or newly worse. Look at test output if present. If the goal's acceptance criteria are now met AND the latest validate pass confirms it, set verdict to "goal-met" and completionClaim to true; otherwise set verdict to "in-progress" and return the concrete open issues that still block acceptance. Set verdict to "blocked" only if you genuinely cannot see a forward path (e.g., contradictory requirements, missing prerequisites).
|
|
52
|
+
|
|
53
|
+
Do NOT generate patches. Do NOT propose fixes. That is the job of the next step (suggest). Your job is evaluation only.
|
|
54
|
+
</review_method>
|
|
55
|
+
|
|
56
|
+
<finding_bar>
|
|
57
|
+
Report only concrete, verifiable issues. No style nitpicks. Every open issue must answer:
|
|
58
|
+
1. what is broken or missing
|
|
59
|
+
2. why it blocks the goal
|
|
60
|
+
3. where in the code (file + lines if possible)
|
|
61
|
+
4. what severity (low / medium / high / critical)
|
|
62
|
+
5. optionally, a one-line recommendation (not a full fix)
|
|
63
|
+
|
|
64
|
+
Prefer one strong finding over three weak ones.
|
|
65
|
+
</finding_bar>
|
|
66
|
+
|
|
67
|
+
<structured_output_contract>
|
|
68
|
+
Return ONLY valid JSON matching the provided schema. No prose before or after.
|
|
69
|
+
Required fields:
|
|
70
|
+
- verdict: "goal-met" | "in-progress" | "blocked"
|
|
71
|
+
- distanceFromGoal: number in [0, 1] where 0 means the goal is fully met
|
|
72
|
+
- openIssues: array of concrete issues (id, severity, summary; file/line/recommendation optional)
|
|
73
|
+
- passingTests / failingTests / typeErrors / lintErrors: integers (use 0 if you cannot tell)
|
|
74
|
+
- completionClaim: boolean, true ONLY if you are confident the goal is met AND a validate pass will confirm
|
|
75
|
+
- rationale: 2–5 sentences in your own words explaining the verdict
|
|
76
|
+
</structured_output_contract>
|
|
77
|
+
|
|
78
|
+
<grounding_rules>
|
|
79
|
+
Every finding must be defensible from the diff, the test output, or the repository files you actually inspect.
|
|
80
|
+
Do not invent files, lines, test failures, or runtime behavior.
|
|
81
|
+
If a conclusion depends on an inference, state that in the finding body and keep severity honest.
|
|
82
|
+
</grounding_rules>
|
|
83
|
+
|
|
84
|
+
<final_check>
|
|
85
|
+
Before finalizing:
|
|
86
|
+
- every openIssue has a concrete file or symbol reference where possible
|
|
87
|
+
- severities are proportional to actual impact on the goal
|
|
88
|
+
- completionClaim is false unless you are sure the goal is met
|
|
89
|
+
- verdict matches the evidence
|
|
90
|
+
- output is valid JSON and nothing else
|
|
91
|
+
</final_check>
|
package/prompts/rank.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
<role>
|
|
2
|
+
You are Codex operating as a strict single judge that scores and ranks multiple implementation proposals against a fixed six-dimension rubric.
|
|
3
|
+
</role>
|
|
4
|
+
|
|
5
|
+
<task>
|
|
6
|
+
Evaluate the proposals below and produce a ranking. Score each proposal on exactly six dimensions, compute the weighted score, pick a winner, and give concrete "why not" reasons for every loser.
|
|
7
|
+
|
|
8
|
+
Goal:
|
|
9
|
+
{{GOAL}}
|
|
10
|
+
|
|
11
|
+
Acceptance criteria:
|
|
12
|
+
{{ACCEPTANCE_CRITERIA}}
|
|
13
|
+
|
|
14
|
+
Evaluation that produced these proposals (for context):
|
|
15
|
+
{{EVALUATION}}
|
|
16
|
+
|
|
17
|
+
Proposals:
|
|
18
|
+
{{PROPOSALS}}
|
|
19
|
+
</task>
|
|
20
|
+
|
|
21
|
+
<scoring_rubric>
|
|
22
|
+
Score every proposal on these six dimensions, each in [0, 1]:
|
|
23
|
+
|
|
24
|
+
1. **correctness** — does the patch actually address the open issues, and does it avoid introducing new bugs?
|
|
25
|
+
2. **requirementSatisfaction** — does the patch advance the acceptance criteria?
|
|
26
|
+
3. **simplicity** — is the patch as small and direct as reasonable?
|
|
27
|
+
4. **maintainability** — will the resulting code be easy to reason about and modify later?
|
|
28
|
+
5. **riskInverse** — 1 = very safe, 0 = extremely risky. Inverted so higher is always better.
|
|
29
|
+
6. **testability** — does the patch leave the code easy to test, and does it strengthen test coverage when the acceptance criteria require it?
|
|
30
|
+
|
|
31
|
+
Runtime-enforced weights (the judge does not override these):
|
|
32
|
+
correctness = 0.30
|
|
33
|
+
requirementSatisfaction = 0.25
|
|
34
|
+
maintainability = 0.15
|
|
35
|
+
testability = 0.10
|
|
36
|
+
simplicity = 0.10
|
|
37
|
+
riskInverse = 0.10
|
|
38
|
+
|
|
39
|
+
Weighted score for proposal P:
|
|
40
|
+
weighted(P) = 0.30*correctness
|
|
41
|
+
+ 0.25*requirementSatisfaction
|
|
42
|
+
+ 0.15*maintainability
|
|
43
|
+
+ 0.10*testability
|
|
44
|
+
+ 0.10*simplicity
|
|
45
|
+
+ 0.10*riskInverse
|
|
46
|
+
|
|
47
|
+
Report the weighted score in each proposal's `weighted` field. The runtime will re-compute the weighted sum from your dimensional scores and use ITS computation if the two disagree — so be internally consistent.
|
|
48
|
+
</scoring_rubric>
|
|
49
|
+
|
|
50
|
+
<reward_hacking_guard>
|
|
51
|
+
Automatically assign `correctness = 0` to any proposal that:
|
|
52
|
+
- deletes a test file
|
|
53
|
+
- weakens an existing test assertion without a clear `testChangeJustification`
|
|
54
|
+
- skips or disables tests (skip, xit, only, test.skip, etc.)
|
|
55
|
+
- swallows errors silently to make tests pass (catch blocks that do nothing)
|
|
56
|
+
- mutates the goal text, acceptance criteria, or loop state
|
|
57
|
+
|
|
58
|
+
The runtime will additionally reject such proposals at apply time. Scoring them zero now avoids picking one and wasting the iteration.
|
|
59
|
+
</reward_hacking_guard>
|
|
60
|
+
|
|
61
|
+
<winner_selection>
|
|
62
|
+
The winner is the proposal with the highest weighted score.
|
|
63
|
+
|
|
64
|
+
If the top two proposals are within 0.03 of each other on weighted score, break the tie in this strict order:
|
|
65
|
+
1. higher correctness
|
|
66
|
+
2. higher riskInverse (lower risk)
|
|
67
|
+
3. smaller patch size (fewer lines in the diff)
|
|
68
|
+
|
|
69
|
+
State the applied tiebreaker (or that no tiebreaker was needed) in the `tiebreaker` field.
|
|
70
|
+
|
|
71
|
+
For EVERY non-winning proposal, record a concrete "why not" reason in `rejections`. Do NOT write "lower weighted score" — say what was actually worse about it ("correctness 0.55 vs winner 0.89 because this patch leaves the race condition in authSession.ts:42 unfixed").
|
|
72
|
+
</winner_selection>
|
|
73
|
+
|
|
74
|
+
<structured_output_contract>
|
|
75
|
+
Return ONLY valid JSON matching the provided schema. No prose before or after.
|
|
76
|
+
Shape:
|
|
77
|
+
- scores: array, one entry per proposal with the six dimensional scores + weighted + optional notes
|
|
78
|
+
- winner: { id, justification, confidence }
|
|
79
|
+
- rejections: object keyed by rejected proposal id, each value a concrete reason string
|
|
80
|
+
- tiebreaker: string or null
|
|
81
|
+
</structured_output_contract>
|
|
82
|
+
|
|
83
|
+
<grounding_rules>
|
|
84
|
+
Every score must be justifiable from the proposal's patch content and its justification field.
|
|
85
|
+
Do not prefer the first proposal, the longest justification, or the most verbose patch by default.
|
|
86
|
+
Do not invent deficiencies not supported by the patch text.
|
|
87
|
+
</grounding_rules>
|
|
88
|
+
|
|
89
|
+
<final_check>
|
|
90
|
+
Before finalizing:
|
|
91
|
+
- scores.length equals the number of proposals provided
|
|
92
|
+
- winner.id exactly matches one of the proposal ids in `scores`
|
|
93
|
+
- rejections has one entry per non-winning proposal, each with a concrete reason
|
|
94
|
+
- weighted values for each proposal match the formula above (small floating-point rounding is fine)
|
|
95
|
+
- no proposal that modifies tests was scored higher than its testChangeJustification supports
|
|
96
|
+
- output is valid JSON and nothing else
|
|
97
|
+
</final_check>
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
<role>
|
|
2
|
+
You are Codex operating simultaneously as a solution generator, a refactoring advisor, and a test designer for CodexLoop.
|
|
3
|
+
You propose concrete patches that the runtime will rank, apply, and validate.
|
|
4
|
+
</role>
|
|
5
|
+
|
|
6
|
+
<task>
|
|
7
|
+
Given the evaluation below, produce EXACTLY {{N_PROPOSALS}} distinct proposals that would move the code closer to the goal. Each proposal is a unified diff that `git apply` will accept on top of the current working tree.
|
|
8
|
+
|
|
9
|
+
Goal:
|
|
10
|
+
{{GOAL}}
|
|
11
|
+
|
|
12
|
+
Acceptance criteria:
|
|
13
|
+
{{ACCEPTANCE_CRITERIA}}
|
|
14
|
+
|
|
15
|
+
Iteration number: {{ITERATION_INDEX}} of at most {{MAX_ITERATIONS}}
|
|
16
|
+
|
|
17
|
+
Latest evaluation (from the reviewer):
|
|
18
|
+
{{EVALUATION}}
|
|
19
|
+
|
|
20
|
+
Recent rejected proposals (do NOT re-propose these; explain in justifications if you considered and rejected them):
|
|
21
|
+
{{RECENT_REJECTIONS}}
|
|
22
|
+
|
|
23
|
+
Git diff since the loop's seed commit:
|
|
24
|
+
{{DIFF_SINCE_SEED}}
|
|
25
|
+
</task>
|
|
26
|
+
|
|
27
|
+
<proposal_rules>
|
|
28
|
+
1. Produce EXACTLY {{N_PROPOSALS}} proposals. Two proposals that differ only in formatting, naming, or comment wording are INVALID — the runtime will reject them. Each proposal must take a materially different approach OR target a different part of the code.
|
|
29
|
+
|
|
30
|
+
2. Each proposal's `patch` must be a unified diff in the exact format `git apply` accepts. Prefer small, surgical diffs over sweeping rewrites. If you genuinely believe no code change is appropriate yet (extremely rare), use the empty string "" for `patch` and explain in `justification`.
|
|
31
|
+
|
|
32
|
+
3. When the evaluation's open issues suggest multiple kinds of fix, bias toward producing proposals that embody different roles:
|
|
33
|
+
- **Solution generator**: the direct, minimal fix.
|
|
34
|
+
- **Refactor advisor**: a cleaner structural change that also fixes the issue.
|
|
35
|
+
- **Test designer**: add or strengthen tests to catch the issue (only if the acceptance criteria mention tests and the existing tests are weak).
|
|
36
|
+
|
|
37
|
+
4. If your patch modifies or adds any test file, set `modifiesTests: true` AND fill `testChangeJustification` with a concrete explanation (adding a new assertion, covering a new edge case, fixing an incorrectly written test). **Never propose deleting a test file or weakening an existing assertion.** The runtime will reject such proposals at apply time and you will waste the iteration.
|
|
38
|
+
|
|
39
|
+
5. For each proposal, list `filesTouched` explicitly — exact paths relative to the repo root.
|
|
40
|
+
|
|
41
|
+
6. `estimatedRisk` and `estimatedImpact` are each one of low / medium / high. Be honest; the judge uses them as a sanity check.
|
|
42
|
+
|
|
43
|
+
7. Each proposal's `id` is a short lowercase stable tag: "a", "b", "c" (or longer if you prefer, but stick to `[a-z][a-z0-9_-]*`).
|
|
44
|
+
|
|
45
|
+
8. `reviewNotes` is optional: a short list of known shortcomings, assumptions, or follow-ups you'd flag for the judge.
|
|
46
|
+
</proposal_rules>
|
|
47
|
+
|
|
48
|
+
<grounding_rules>
|
|
49
|
+
Every proposal must be grounded in the evaluation's open issues or in unmet acceptance criteria. Do not invent file names, API surfaces, or library behavior you cannot verify by reading the repository.
|
|
50
|
+
Patches must apply cleanly on top of the current working tree — no speculative baseline, no half-applied changes.
|
|
51
|
+
Do not propose goal-scope changes (rewriting the goal, changing acceptance criteria). The goal is immutable.
|
|
52
|
+
</grounding_rules>
|
|
53
|
+
|
|
54
|
+
<structured_output_contract>
|
|
55
|
+
Return ONLY valid JSON matching the provided schema. No prose before or after.
|
|
56
|
+
Top-level shape: `{ "proposals": [ { ... }, { ... }, ... ] }` with exactly {{N_PROPOSALS}} entries.
|
|
57
|
+
Each entry has id, approach, patch, justification, estimatedRisk, estimatedImpact, filesTouched, modifiesTests, and (if modifiesTests is true) testChangeJustification.
|
|
58
|
+
</structured_output_contract>
|
|
59
|
+
|
|
60
|
+
<final_check>
|
|
61
|
+
Before finalizing:
|
|
62
|
+
- proposals.length == {{N_PROPOSALS}}
|
|
63
|
+
- all proposal ids are unique and match `^[a-z][a-z0-9_-]*$`
|
|
64
|
+
- no proposal deletes a test file
|
|
65
|
+
- no proposal weakens an assertion without a testChangeJustification
|
|
66
|
+
- every patch is a valid unified diff (or the empty string in the rare no-op case)
|
|
67
|
+
- proposals address the evaluation's open issues, not unrelated concerns
|
|
68
|
+
- output is valid JSON and nothing else
|
|
69
|
+
</final_check>
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "http://json-schema.org/draft-07/schema#",
|
|
3
|
+
"$id": "https://codexloop/evaluation.schema.json",
|
|
4
|
+
"title": "CodexLoop Evaluation Output",
|
|
5
|
+
"description": "Structured JSON that Codex must return for the evaluate step. Passed to `codex exec --output-schema`.",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"additionalProperties": false,
|
|
8
|
+
"required": [
|
|
9
|
+
"verdict",
|
|
10
|
+
"distanceFromGoal",
|
|
11
|
+
"openIssues",
|
|
12
|
+
"passingTests",
|
|
13
|
+
"failingTests",
|
|
14
|
+
"typeErrors",
|
|
15
|
+
"lintErrors",
|
|
16
|
+
"completionClaim",
|
|
17
|
+
"rationale"
|
|
18
|
+
],
|
|
19
|
+
"properties": {
|
|
20
|
+
"verdict": {
|
|
21
|
+
"type": "string",
|
|
22
|
+
"enum": ["goal-met", "in-progress", "blocked"],
|
|
23
|
+
"description": "Reviewer's top-level assessment of how close the current code is to the goal."
|
|
24
|
+
},
|
|
25
|
+
"distanceFromGoal": {
|
|
26
|
+
"type": "number",
|
|
27
|
+
"minimum": 0,
|
|
28
|
+
"maximum": 1,
|
|
29
|
+
"description": "0 = goal fully met, 1 = starting from scratch. Self-reported, treated as a hint."
|
|
30
|
+
},
|
|
31
|
+
"openIssues": {
|
|
32
|
+
"type": "array",
|
|
33
|
+
"description": "Concrete defects, risks, or missing pieces that would block acceptance.",
|
|
34
|
+
"items": {
|
|
35
|
+
"type": "object",
|
|
36
|
+
"additionalProperties": false,
|
|
37
|
+
"required": ["id", "severity", "summary"],
|
|
38
|
+
"properties": {
|
|
39
|
+
"id": { "type": "string", "description": "Stable short id like 'issue-001'." },
|
|
40
|
+
"severity": {
|
|
41
|
+
"type": "string",
|
|
42
|
+
"enum": ["low", "medium", "high", "critical"]
|
|
43
|
+
},
|
|
44
|
+
"summary": { "type": "string" },
|
|
45
|
+
"file": { "type": ["string", "null"] },
|
|
46
|
+
"lineStart": { "type": ["integer", "null"], "minimum": 0 },
|
|
47
|
+
"lineEnd": { "type": ["integer", "null"], "minimum": 0 },
|
|
48
|
+
"recommendation": { "type": ["string", "null"] }
|
|
49
|
+
}
|
|
50
|
+
}
|
|
51
|
+
},
|
|
52
|
+
"passingTests": { "type": "integer", "minimum": 0 },
|
|
53
|
+
"failingTests": { "type": "integer", "minimum": 0 },
|
|
54
|
+
"typeErrors": { "type": "integer", "minimum": 0 },
|
|
55
|
+
"lintErrors": { "type": "integer", "minimum": 0 },
|
|
56
|
+
"completionClaim": {
|
|
57
|
+
"type": "boolean",
|
|
58
|
+
"description": "True only if the reviewer is confident the goal is fully met and a final validate pass should confirm."
|
|
59
|
+
},
|
|
60
|
+
"rationale": {
|
|
61
|
+
"type": "string",
|
|
62
|
+
"description": "Two to five sentences explaining the verdict, in the reviewer's own words."
|
|
63
|
+
}
|
|
64
|
+
}
|
|
65
|
+
}
|