@ai-dev-methodologies/rlp-desk 0.2.4 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +32 -1
- package/docs/TODO-verification-next.md +59 -0
- package/docs/architecture.md +25 -0
- package/docs/getting-started.md +11 -4
- package/docs/internal/verification-policy-gap-analysis.md +523 -0
- package/docs/internal/verification-strategy-research.md +2097 -0
- package/docs/protocol-reference.md +21 -10
- package/package.json +1 -1
- package/src/commands/rlp-desk.md +97 -4
- package/src/governance.md +219 -1
- package/src/scripts/init_ralph_desk.zsh +221 -12
|
@@ -136,24 +136,28 @@ Written by the Worker at the end of every iteration. Provides a structured JSON
|
|
|
136
136
|
|
|
137
137
|
### Done Claim (`<slug>-done-claim.json`)
|
|
138
138
|
|
|
139
|
-
Written by the Worker when claiming
|
|
139
|
+
Written by the Worker when claiming work is complete. **Must include `execution_steps`** (governance §1f — always-on, not optional):
|
|
140
140
|
|
|
141
141
|
```json
|
|
142
142
|
{
|
|
143
|
-
"
|
|
144
|
-
"
|
|
145
|
-
"
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
"
|
|
149
|
-
"
|
|
150
|
-
|
|
143
|
+
"us_id": "US-001",
|
|
144
|
+
"claims": ["AC1: add(10,5)=15", "AC2: subtract(10,5)=5"],
|
|
145
|
+
"execution_steps": [
|
|
146
|
+
{"step": "write_test", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "wrote tests/test_add.py"},
|
|
147
|
+
{"step": "verify_red", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 1, "summary": "RED: test fails"},
|
|
148
|
+
{"step": "implement", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "created add()"},
|
|
149
|
+
{"step": "verify_green", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 0, "summary": "GREEN: 3 passed"},
|
|
150
|
+
{"step": "verify_e2e", "ac_id": "AC1", "command": "python -c '...'", "exit_code": 0, "summary": "E2E matches"},
|
|
151
|
+
{"step": "commit", "ac_id": "AC1", "command": "git commit", "exit_code": 0, "summary": "committed abc1234"}
|
|
152
|
+
]
|
|
151
153
|
}
|
|
152
154
|
```
|
|
153
155
|
|
|
156
|
+
`execution_steps` must be a JSON array of objects. Step types from §1f vocabulary: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `verify_e2e`, `commit`, `verify`.
|
|
157
|
+
|
|
154
158
|
### Verify Verdict (`<slug>-verify-verdict.json`)
|
|
155
159
|
|
|
156
|
-
Written by the Verifier after independent verification.
|
|
160
|
+
Written by the Verifier after independent verification. **Must include `reasoning`** (governance §1f — always-on, not optional).
|
|
157
161
|
|
|
158
162
|
**Tmux mode polling:** In tmux mode, after dispatching the Verifier, the shell Leader polls for the existence of `verify-verdict.json` (same pattern as `iter-signal.json`). Once it appears, the Leader reads the `verdict` and `recommended_state_transition` fields via `jq` to decide whether to write a COMPLETE sentinel, continue iterating, or write a BLOCKED sentinel.
|
|
159
163
|
|
|
@@ -172,6 +176,13 @@ Written by the Verifier after independent verification.
|
|
|
172
176
|
"evidence": "test -f calc.py → exit 0"
|
|
173
177
|
}
|
|
174
178
|
],
|
|
179
|
+
"reasoning": [
|
|
180
|
+
{"check": "IL-1 Evidence Gate", "decision": "pass", "basis": "ran pytest fresh, exit 0"},
|
|
181
|
+
{"check": "Layer Enforcement", "decision": "pass", "basis": "L1+L3 required and executed"},
|
|
182
|
+
{"check": "Test Sufficiency", "decision": "pass", "basis": "9 tests / 3 AC = 3.0 ratio"},
|
|
183
|
+
{"check": "Anti-Gaming", "decision": "pass", "basis": "no tautological tests, no mocking"},
|
|
184
|
+
{"check": "Worker Process Audit", "decision": "pass", "basis": "verify_red present with exit≠0"}
|
|
185
|
+
],
|
|
175
186
|
"missing_evidence": [],
|
|
176
187
|
"issues": [
|
|
177
188
|
{
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ai-dev-methodologies/rlp-desk",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.3.1",
|
|
4
4
|
"description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
|
|
5
5
|
"scripts": {
|
|
6
6
|
"postinstall": "node scripts/postinstall.js",
|
package/src/commands/rlp-desk.md
CHANGED
|
@@ -24,6 +24,17 @@ Ask about these items one by one (or in small groups):
|
|
|
24
24
|
1. **Slug** — short identifier (e.g., `auth-refactor`). Suggest one, ask if OK.
|
|
25
25
|
2. **Objective** — what the loop achieves
|
|
26
26
|
3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
|
|
27
|
+
- Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
|
|
28
|
+
- Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
|
|
29
|
+
```
|
|
30
|
+
Given [precondition in domain language]
|
|
31
|
+
When [action in domain language]
|
|
32
|
+
Then [expected outcome with quantitative criteria]
|
|
33
|
+
```
|
|
34
|
+
- Include at least 1 negative test per US ("must NOT happen").
|
|
35
|
+
- Include boundary cases per US (empty, max, zero, concurrent).
|
|
36
|
+
- **Task Type** per US: `code` | `visual` | `content` | `integration` | `infra`
|
|
37
|
+
- **Risk Level** per US (governance §1c): `LOW` | `MEDIUM` | `HIGH` | `CRITICAL`
|
|
27
38
|
4. **Iteration Unit** — what one worker does per iteration. Explicitly ask:
|
|
28
39
|
- "One US per iteration (bounded, incremental verification)?"
|
|
29
40
|
- "All stories at once (faster, single verification)?"
|
|
@@ -42,8 +53,17 @@ Ask about these items one by one (or in small groups):
|
|
|
42
53
|
11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
|
|
43
54
|
12. **Max Iterations** — suggest based on story count, ask if OK.
|
|
44
55
|
|
|
45
|
-
After all items are confirmed
|
|
46
|
-
|
|
56
|
+
After all items are confirmed:
|
|
57
|
+
|
|
58
|
+
1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
|
|
59
|
+
If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
|
|
60
|
+
If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
|
|
61
|
+
If all ACs score 10-12: **PASS** — clean.
|
|
62
|
+
Present the score table to the user before proceeding.
|
|
63
|
+
2. Present the full contract summary.
|
|
64
|
+
3. **Self-Verification** — Ask: "Enable self-verification? Worker records step-by-step evidence, Verifier cross-validates process. Recommended for MEDIUM+ risk." Default: yes for HIGH/CRITICAL, no for LOW/MEDIUM.
|
|
65
|
+
4. On approval, offer to run `init`.
|
|
66
|
+
|
|
47
67
|
Do NOT create files during brainstorm.
|
|
48
68
|
Do NOT auto-decide iteration unit — the user MUST explicitly choose.
|
|
49
69
|
|
|
@@ -98,6 +118,7 @@ Options (parse from `$ARGUMENTS`):
|
|
|
98
118
|
- `all`: consensus runs on every verify (current behavior)
|
|
99
119
|
- `final-only`: consensus only on final ALL verify
|
|
100
120
|
- `--debug` — enable debug logging (writes to logs/<slug>/debug.log)
|
|
121
|
+
- `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
|
|
101
122
|
|
|
102
123
|
### Mode Selection
|
|
103
124
|
|
|
@@ -181,6 +202,12 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
|
|
|
181
202
|
- Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md`
|
|
182
203
|
- Combine with iteration number + memory contract
|
|
183
204
|
- Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
|
|
205
|
+
- Note: Worker ALWAYS records execution_steps in done-claim.json per governance §1f. No flag needed.
|
|
206
|
+
|
|
207
|
+
**④½ Contract review** (agent mode only)
|
|
208
|
+
- Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
|
|
209
|
+
- If `--debug`: debug_log `[EXEC] iter=N phase=contract_review result=<ok|issues>`
|
|
210
|
+
- In tmux mode: skip (shell leader cannot reason). Log: `[EXEC] iter=N phase=contract_review skipped=tmux_mode`
|
|
184
211
|
|
|
185
212
|
**⑤ Execute Worker**
|
|
186
213
|
- If `--debug`: debug_log `[EXEC] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
|
|
@@ -234,6 +261,7 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
|
|
|
234
261
|
- Verifier checks all AC at once
|
|
235
262
|
|
|
236
263
|
**⑦a Dispatch Verifier**
|
|
264
|
+
- Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
|
|
237
265
|
- If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
|
|
238
266
|
|
|
239
267
|
If `--verifier-engine claude` (default):
|
|
@@ -269,11 +297,16 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
269
297
|
1. Read `issues` array, sort by severity (`critical` → `major` → `minor`)
|
|
270
298
|
2. Build structured fix contract with traceability rule
|
|
271
299
|
3. Include `fix_hint` values labeled `(suggestion, non-authoritative)` if present
|
|
272
|
-
4.
|
|
273
|
-
5.
|
|
300
|
+
4. Include impacted tests from test-spec (so Worker can run them before and after the fix)
|
|
301
|
+
5. Increment `consecutive_failures` in `status.json`
|
|
302
|
+
6. If `consecutive_failures >= 3` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
|
|
303
|
+
7. Go to ⑧ with fix contract as next Worker contract
|
|
274
304
|
- `request_info` → Leader reads Verifier's questions, decides outcome (or relays to Worker in next contract) → go to ⑧
|
|
275
305
|
- `blocked` → write BLOCKED sentinel, stop
|
|
276
306
|
- If `--debug`: debug_log `[EXEC] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id>`
|
|
307
|
+
- If `--debug`: debug_log `[EXEC] iter=N phase=layer_check L1=<status> L2=<status> L3=<status> L4=<status>`
|
|
308
|
+
- If `--debug`: debug_log `[EXEC] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
|
|
309
|
+
- If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
|
|
277
310
|
- If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
|
|
278
311
|
|
|
279
312
|
**⑧ Write result log and report to user, continue loop**
|
|
@@ -288,6 +321,63 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
|
|
|
288
321
|
At loop end (COMPLETE, BLOCKED, or TIMEOUT):
|
|
289
322
|
- If `--debug`: debug_log `[VALIDATE] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
|
|
290
323
|
|
|
324
|
+
**⑨ Campaign Self-Verification** (when `--with-self-verification` is enabled):
|
|
325
|
+
|
|
326
|
+
After the loop ends, the Leader performs post-campaign analysis:
|
|
327
|
+
|
|
328
|
+
1. **Collect data**: Read all archived `iter-NNN.result.md`, done-claim.json (with execution_steps), and verify-verdict.json (with reasoning) from `logs/<slug>/`
|
|
329
|
+
2. **Write cumulative data**: `logs/<slug>/self-verification-data.json` — normalized iteration records
|
|
330
|
+
3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
|
|
331
|
+
4. **Report to user**: Display the full report content
|
|
332
|
+
|
|
333
|
+
Report template (9 sections):
|
|
334
|
+
|
|
335
|
+
```
|
|
336
|
+
# Campaign Self-Verification Report: <slug>
|
|
337
|
+
Report Version: NNN | Generated: timestamp | Campaign: slug — objective
|
|
338
|
+
Schema Version: governance hash | Data Quality: N% iterations complete
|
|
339
|
+
|
|
340
|
+
## 1. Automated Validation Summary
|
|
341
|
+
Table: Iter | US | Worker Verdict | Verifier Verdict | Outcome
|
|
342
|
+
|
|
343
|
+
## 2. Failure Deep Dive (per failed iteration)
|
|
344
|
+
Per failure: Worker steps → Verifier reasoning → Root cause → Resolution
|
|
345
|
+
|
|
346
|
+
## 3. Worker Process Quality (§1f audit)
|
|
347
|
+
Table: Iter | US | Steps | verify_red? | RED exit≠0? | verify_green? | Test-First? | E2E? | AC linked?
|
|
348
|
+
Aggregate: TDD compliance %, RED confirmation %, E2E evidence %, step completeness %
|
|
349
|
+
Audit: each step must have type from §1f vocabulary + ac_id + command + exit_code
|
|
350
|
+
|
|
351
|
+
## 4. Verifier Judgment Quality (§1f audit)
|
|
352
|
+
Table: Iter | US | Checks | All Basis? | Independent? | IL-1? | Layer? | Sufficiency? | Anti-Gaming? | Worker Audit?
|
|
353
|
+
Aggregate: Reasoning completeness %, Independent verification %, §1f category coverage %
|
|
354
|
+
Audit: verify all 5 mandatory check categories (IL-1, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit) are present
|
|
355
|
+
|
|
356
|
+
## 5. AC Lifecycle
|
|
357
|
+
Table: US | AC | First Claimed (iter) | First Verified (iter) | Reopen Count | Final Status
|
|
358
|
+
|
|
359
|
+
## 6. Test-Spec Adherence
|
|
360
|
+
Spec completeness (layers/commands/mappings present)
|
|
361
|
+
Spec execution fidelity (exact checks run and cited)
|
|
362
|
+
|
|
363
|
+
## 7. Patterns: Strengths & Weaknesses
|
|
364
|
+
Strengths: what worked well
|
|
365
|
+
Weaknesses: systemic issues
|
|
366
|
+
|
|
367
|
+
## 8. Recommendations for Next Cycle
|
|
368
|
+
### Brainstorm (missing scenarios/constraints) — citing iter/AC
|
|
369
|
+
### PRD (ambiguous or oversized ACs) — citing iter/AC
|
|
370
|
+
### Test-Spec (missing layers, weak mappings) — citing iter/AC
|
|
371
|
+
|
|
372
|
+
## 9. Blind Spots
|
|
373
|
+
What this report CANNOT prove from available data
|
|
374
|
+
|
|
375
|
+
## Data Provenance Rule
|
|
376
|
+
Report content MUST be derivable from: done-claim.json (execution_steps), verify-verdict.json (reasoning),
|
|
377
|
+
PRD, and test-spec. Information from source code inspection that is not in these files must be excluded
|
|
378
|
+
or explicitly marked as "[source-inspection]" with justification.
|
|
379
|
+
```
|
|
380
|
+
|
|
291
381
|
### Circuit Breaker
|
|
292
382
|
- context-latest.md unchanged 3 iterations → BLOCKED
|
|
293
383
|
- Same acceptance criterion fails 2 consecutive iterations → upgrade model, retry once, then BLOCKED
|
|
@@ -329,6 +419,8 @@ Remove:
|
|
|
329
419
|
- `.claude/ralph-desk/logs/<slug>/session-config.json`
|
|
330
420
|
- `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
|
|
331
421
|
- `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
|
|
422
|
+
- `.claude/ralph-desk/memos/<slug>-escalation.md`
|
|
423
|
+
Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
|
|
332
424
|
|
|
333
425
|
If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
|
|
334
426
|
```bash
|
|
@@ -359,6 +451,7 @@ Run options:
|
|
|
359
451
|
--verify-consensus Cross-engine consensus verification
|
|
360
452
|
--consensus-scope SCOPE When consensus runs: all|final-only (default: all)
|
|
361
453
|
--debug Debug logging (logs/<slug>/debug.log)
|
|
454
|
+
--with-self-verification Campaign self-verification analysis (post-loop report)
|
|
362
455
|
```
|
|
363
456
|
|
|
364
457
|
## Architecture
|
package/src/governance.md
CHANGED
|
@@ -10,10 +10,208 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
|
|
|
10
10
|
- **Fresh context per iteration**: Worker/Verifier start fresh every time. No prior conversation.
|
|
11
11
|
- **Filesystem = memory**: State exists only on the filesystem (PRD, memory, context, memos).
|
|
12
12
|
- **Worker claim ≠ complete**: A Worker's DONE is merely a claim. The Verifier must independently verify before it's confirmed.
|
|
13
|
+
- **Worker scope is bounded**: Worker implements only the contracted US per iteration (Scope Lock). Out-of-scope changes are flagged by the Verifier.
|
|
13
14
|
- **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
|
|
14
15
|
- **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
|
|
15
16
|
- **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
|
|
16
17
|
|
|
18
|
+
## 1a. Iron Laws
|
|
19
|
+
|
|
20
|
+
Absolute rules that cannot be violated under any circumstance.
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
|
|
24
|
+
IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
|
|
25
|
+
IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
|
|
26
|
+
IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
**IL-1: Evidence Mandate**
|
|
30
|
+
Required: every verdict must reference at least one command execution with its exit code.
|
|
31
|
+
A verdict without command output evidence is automatically invalid.
|
|
32
|
+
Additional signal: phrases such as "should pass", "probably works", "seems to",
|
|
33
|
+
"looks correct", "appears to" (including but not limited to) without command evidence
|
|
34
|
+
confirm the violation but are not the primary check — command output presence is.
|
|
35
|
+
`request_info` is not an escape from IL-1 — if evidence was collectible, it must be collected.
|
|
36
|
+
|
|
37
|
+
**IL-2: Ambiguity Gate**
|
|
38
|
+
Each AC is scored 0-12 on 6 dimensions (0/1/2 points each):
|
|
39
|
+
- Single behavior: 0=multiple behaviors, 1=mostly single, 2=exactly one testable behavior
|
|
40
|
+
- Domain language: 0=technical terms (class names, API paths, DB tables), 1=mixed, 2=pure domain language
|
|
41
|
+
- Stakeholder clarity: 0=unclear who benefits, 1=implied, 2=explicitly stated
|
|
42
|
+
- Portability: 0=tech-stack specific, 1=mostly portable, 2=fully stack-independent
|
|
43
|
+
- Concrete example: 0=vague ("some input"), 1=partial specifics, 2=exact values with expected results
|
|
44
|
+
- Independence: 0=requires another AC to pass first, 1=loosely coupled, 2=fully self-contained
|
|
45
|
+
|
|
46
|
+
Score interpretation:
|
|
47
|
+
- 0-5: **REJECT** — init blocked. ACs too ambiguous.
|
|
48
|
+
- 6-9: **WARN** — init proceeds with logged warning. Show which dimensions scored low.
|
|
49
|
+
- 10-12: **PASS** — clean.
|
|
50
|
+
|
|
51
|
+
IL-2 is a pre-run gate: scoring MUST happen during brainstorm or at init time.
|
|
52
|
+
In tmux mode, IL-2 must be satisfied before `/rlp-desk run` is invoked.
|
|
53
|
+
|
|
54
|
+
Calibration example:
|
|
55
|
+
- Score 5 (REJECT): "Given a user, When they log in, Then the system works correctly"
|
|
56
|
+
→ single:1 (login is one behavior), domain:2 (domain terms only), stakeholder:1 (implied user), portability:1 (mostly portable), concrete:0 ("works correctly" is vague), independence:0 (implies registration AC) = 5
|
|
57
|
+
- Score 7 (WARN): "Given a registered user with email 'test@example.com' and valid password, When they submit the login form, Then they are redirected to the dashboard within 2 seconds"
|
|
58
|
+
→ single:2 (exactly one: login redirect), domain:1 ("submit" is slightly technical), stakeholder:1 (implied end user), portability:1 (web-specific "form"), concrete:2 (specific email, 2s threshold), independence:0 (requires registration) = 7
|
|
59
|
+
|
|
60
|
+
**IL-3: Layer Completeness**
|
|
61
|
+
Verification layers:
|
|
62
|
+
- L1: Unit Test — function-level, mocks allowed. Always required.
|
|
63
|
+
- L2: Integration — real external services (DB, API, Redis). Required when external dependencies exist.
|
|
64
|
+
- L3: E2E Simulation — known input → full pipeline → output comparison. Always required.
|
|
65
|
+
- L4: Deploy Verify — production environment checks. Required when deploying.
|
|
66
|
+
|
|
67
|
+
Layer requirements per US are determined by risk classification (§1c).
|
|
68
|
+
Non-applicable layers must be explicitly marked "N/A — {reason}" in the test-spec.
|
|
69
|
+
Any required layer section with TODO or blank = automatic Verifier FAIL.
|
|
70
|
+
See §1d for full layer definitions.
|
|
71
|
+
|
|
72
|
+
**IL-4: Test Sufficiency**
|
|
73
|
+
Each AC must have >= 3 tests covering >= 2 of 3 categories (happy path, negative/error, boundary).
|
|
74
|
+
Tests must be mapped to ACs in the test-spec's criteria-to-test mapping table.
|
|
75
|
+
Only tests listed in this mapping count toward IL-4.
|
|
76
|
+
Count < 3 per any AC = FAIL.
|
|
77
|
+
|
|
78
|
+
### Enforcement
|
|
79
|
+
|
|
80
|
+
| Iron Law | Checked by | When | Method |
|
|
81
|
+
|----------|-----------|------|--------|
|
|
82
|
+
| IL-1 | Verifier | verification time | mechanical (command output presence) |
|
|
83
|
+
| IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
|
|
84
|
+
| IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
|
|
85
|
+
| IL-4 | Verifier | verification time | scored (test count per AC) |
|
|
86
|
+
|
|
87
|
+
- Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
|
|
88
|
+
- When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
|
|
89
|
+
`request_info` remains valid only when the Verifier cannot determine whether an Iron Law
|
|
90
|
+
was violated (e.g., cannot access test files, command execution blocked).
|
|
91
|
+
- You (the Leader) cannot waive Iron Laws. Only the user can explicitly waive an Iron Law
|
|
92
|
+
for a specific US with documented justification in the PRD.
|
|
93
|
+
|
|
94
|
+
## 1b. Evidence Gate
|
|
95
|
+
|
|
96
|
+
This section operationalizes IL-1 (Evidence Mandate) into a concrete step-by-step protocol.
|
|
97
|
+
|
|
98
|
+
Before any verdict, the Verifier MUST follow this 5-step process:
|
|
99
|
+
|
|
100
|
+
1. **IDENTIFY**: What command proves this claim?
|
|
101
|
+
2. **RUN**: Execute the command (fresh, not cached or recalled)
|
|
102
|
+
3. **READ**: Full output + exit code + failure count
|
|
103
|
+
4. **VERIFY**: Does output confirm the claim?
|
|
104
|
+
- YES → state claim WITH evidence (command + output + exit code)
|
|
105
|
+
- NO → state actual status with evidence
|
|
106
|
+
5. **ONLY THEN**: Issue verdict
|
|
107
|
+
|
|
108
|
+
Skipping any step = invalid verification (IL-1 violation).
|
|
109
|
+
|
|
110
|
+
### Forbidden Patterns
|
|
111
|
+
- "should pass", "probably works" without command output
|
|
112
|
+
- Trusting Worker's success reports without independent re-execution
|
|
113
|
+
- Partial verification ("linter passed" ≠ "tests passed" ≠ "all AC met")
|
|
114
|
+
- "Code inspection" as substitute for automated command execution
|
|
115
|
+
- Citing cached/prior results instead of fresh execution
|
|
116
|
+
|
|
117
|
+
## 1c. Risk Classification
|
|
118
|
+
|
|
119
|
+
Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
|
|
120
|
+
|
|
121
|
+
| Level | Description | Required Layers | Extra Requirements |
|
|
122
|
+
|-------|-------------|-----------------|-------------------|
|
|
123
|
+
| LOW | Read-only, docs, config | L1 + L3 | — |
|
|
124
|
+
| MEDIUM | New feature, refactor | L1 + L2 (if external deps) + L3 | — |
|
|
125
|
+
| HIGH | Production deploy, data migration | L1 + L2 + L3 + L4 | — |
|
|
126
|
+
| CRITICAL | Financial, security, medical | L1 + L2 + L3 + L4 | consensus + mutation testing (when mutation testing tool is configured in test-spec) |
|
|
127
|
+
|
|
128
|
+
L2 is included in MEDIUM+ rows but is marked N/A when no external services exist (see §1d L2 "When N/A" clause).
|
|
129
|
+
|
|
130
|
+
### Who Decides
|
|
131
|
+
- During brainstorm: user assigns risk level per US (Leader suggests, user confirms).
|
|
132
|
+
- If brainstorm was skipped: Leader assigns based on PRD content at first run iteration.
|
|
133
|
+
- Risk level is recorded in PRD per US. Cannot be downgraded without user approval.
|
|
134
|
+
|
|
135
|
+
### Examples
|
|
136
|
+
- LOW: README update, adding comments, .env.example
|
|
137
|
+
- MEDIUM: REST API endpoint, React component, business rule
|
|
138
|
+
- HIGH: database migration, CI/CD change, deployment config
|
|
139
|
+
- CRITICAL: payment processing, auth, encryption, PII handling
|
|
140
|
+
|
|
141
|
+
## 1d. Verification Layers
|
|
142
|
+
|
|
143
|
+
Four layers of verification, each targeting a different failure mode.
|
|
144
|
+
|
|
145
|
+
### L1: Unit Test (always required)
|
|
146
|
+
- Scope: function/method level, isolated logic
|
|
147
|
+
- Mocks: allowed for external boundaries only
|
|
148
|
+
- Evidence: test runner output with pass/fail count + exit code
|
|
149
|
+
|
|
150
|
+
### L2: Integration (required when external dependencies exist)
|
|
151
|
+
- Scope: interaction with real external services (database, API, message queue, cache)
|
|
152
|
+
- Mocks: NOT allowed — use real or containerized services
|
|
153
|
+
- Evidence: integration test output with connection confirmation + data verification
|
|
154
|
+
- When N/A: "N/A — no external services (pure computation/transformation)"
|
|
155
|
+
|
|
156
|
+
### L3: E2E Simulation (always required)
|
|
157
|
+
- Scope: known input → full pipeline → quantitative output comparison
|
|
158
|
+
- Evidence: input data + actual output + expected output + comparison result
|
|
159
|
+
- For simple utilities: E2E = "run function with known input, verify output matches expected"
|
|
160
|
+
|
|
161
|
+
### L4: Deploy Verify (required when deploying)
|
|
162
|
+
- Scope: production/staging environment health after deployment
|
|
163
|
+
- Evidence: health check response + deployment status + monitoring state
|
|
164
|
+
- When N/A: "N/A — no deployment (library/tool, local-only change)"
|
|
165
|
+
|
|
166
|
+
### Rules
|
|
167
|
+
- L1 and L3: always required regardless of risk level.
|
|
168
|
+
- L2: required for MEDIUM+ risk when external services are involved.
|
|
169
|
+
- L4: required for HIGH+ risk when deployment occurs.
|
|
170
|
+
- Layer requirements per US are determined by risk classification (§1c).
|
|
171
|
+
- Non-required layers must be marked "N/A — {reason}" per IL-3. Blank or TODO = FAIL.
|
|
172
|
+
|
|
173
|
+
## 1e. Verification Checkpoints
|
|
174
|
+
|
|
175
|
+
Verification occurs at two boundaries, not as a single final event.
|
|
176
|
+
|
|
177
|
+
### Checkpoint 1: Story/Unit (per-US)
|
|
178
|
+
- Trigger: Worker signals verify with us_id = specific US
|
|
179
|
+
- Scope: that US's acceptance criteria (L1 pass is verified as part of layer enforcement in Verifier step 5)
|
|
180
|
+
- On fail: fix loop (§7½)
|
|
181
|
+
|
|
182
|
+
### Checkpoint 2: Release Readiness (us_id=ALL)
|
|
183
|
+
- Trigger: all individual US pass Checkpoint 1 → Worker signals verify with us_id = "ALL"
|
|
184
|
+
- Scope: all AC + L2 integration (if applicable) + L3 E2E Simulation + L4 deploy (if applicable) + mutation score (if CRITICAL, when mutation testing tool is configured in test-spec)
|
|
185
|
+
- On fail: fix loop; escalation to user if 3 consecutive failures
|
|
186
|
+
|
|
187
|
+
### Relationship to Existing Flow
|
|
188
|
+
- Checkpoint 1 = existing per-US verify (§7a). No change.
|
|
189
|
+
- Checkpoint 2 = existing "us_id=ALL final full verify" (§7a). Adds explicit layer scope.
|
|
190
|
+
- No new iteration steps are introduced.
|
|
191
|
+
|
|
192
|
+
## 1f. Execution & Judgment Traceability
|
|
193
|
+
|
|
194
|
+
Every iteration, Worker and Verifier MUST record their process and reasoning — not just results.
|
|
195
|
+
This is the default behavior, not an optional flag. Without it, IL-1 (Evidence Mandate) is incomplete.
|
|
196
|
+
|
|
197
|
+
### Worker: execution_steps in done-claim.json
|
|
198
|
+
Worker records what was done, in what order, with command evidence in `done-claim.json`:
|
|
199
|
+
- Each step includes: what action, which AC, command executed, exit code, summary
|
|
200
|
+
- Step types: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `commit`, `verify`
|
|
201
|
+
- This proves the Worker followed test-first approach and did not skip steps
|
|
202
|
+
|
|
203
|
+
### Verifier: reasoning in verify-verdict.json
|
|
204
|
+
Verifier records WHY each judgment was made in `verify-verdict.json`:
|
|
205
|
+
- Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
|
|
206
|
+
- Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit
|
|
207
|
+
- This proves the Verifier actually performed each check rather than rubber-stamping
|
|
208
|
+
|
|
209
|
+
### Why This Is Default (Not Optional)
|
|
210
|
+
- IL-1 says "no claims without evidence" — this applies to Worker AND Verifier
|
|
211
|
+
- Without execution_steps, Worker's done-claim is an unsubstantiated assertion
|
|
212
|
+
- Without reasoning, Verifier's verdict is an unsubstantiated judgment
|
|
213
|
+
- Both are archived in `logs/<slug>/` per existing audit trail pattern
|
|
214
|
+
|
|
17
215
|
## 2. Roles
|
|
18
216
|
|
|
19
217
|
### Leader (current session)
|
|
@@ -182,6 +380,7 @@ Characteristics:
|
|
|
182
380
|
│ ├── <slug>-done-claim.json # Worker's completion claim (runtime)
|
|
183
381
|
│ ├── <slug>-iter-signal.json # Worker's iteration signal (runtime)
|
|
184
382
|
│ ├── <slug>-verify-verdict.json # Verifier's verdict (runtime)
|
|
383
|
+
│ ├── <slug>-escalation.md # Architecture escalation report (tmux mode, §7¾)
|
|
185
384
|
│ ├── <slug>-complete.md # SENTINEL (Leader only)
|
|
186
385
|
│ └── <slug>-blocked.md # SENTINEL (Leader only)
|
|
187
386
|
├── plans/
|
|
@@ -191,6 +390,8 @@ Characteristics:
|
|
|
191
390
|
├── iter-NNN.worker-prompt.md # Audit trail prompt copy
|
|
192
391
|
├── iter-NNN.verifier-prompt.md # Audit trail prompt copy
|
|
193
392
|
├── iter-NNN.result.md # Iteration result (leader-measured + git-measured)
|
|
393
|
+
├── self-verification-data.json # Cumulative campaign data (--with-self-verification)
|
|
394
|
+
├── self-verification-report-NNN.md # Versioned campaign analysis report (--with-self-verification)
|
|
194
395
|
└── status.json # Leader's loop state
|
|
195
396
|
```
|
|
196
397
|
|
|
@@ -311,12 +512,29 @@ Traceability: only changes that resolve a listed issue are allowed.
|
|
|
311
512
|
Every change must be justified by the issue it addresses.
|
|
312
513
|
```
|
|
313
514
|
|
|
515
|
+
## 7¾. Architecture Escalation
|
|
516
|
+
|
|
517
|
+
Note: Circuit Breaker (§8) fires first at 2 consecutive failures (model upgrade + retry). If the retry also fails (3rd consecutive failure), Architecture Escalation applies. The CB retry counts toward the consecutive_failures counter.
|
|
518
|
+
|
|
519
|
+
If 3+ consecutive fix attempts fail for the same US:
|
|
520
|
+
|
|
521
|
+
1. **STOP fixing symptoms** — the problem is likely architectural, not a bug.
|
|
522
|
+
2. **Leader reports to user**: "3 consecutive fix attempts failed for US-{id}. This suggests an architectural issue, not a simple bug."
|
|
523
|
+
3. **Include in report**:
|
|
524
|
+
- What was attempted in each fix
|
|
525
|
+
- What specifically kept failing
|
|
526
|
+
- Hypothesis: why fixes are not sticking
|
|
527
|
+
4. **Do NOT attempt fix #4** without user guidance.
|
|
528
|
+
5. **Options**: refactor architecture, simplify the US, split the US, or mark BLOCKED.
|
|
529
|
+
|
|
530
|
+
In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOCKED sentinel with reason "architecture-escalation."
|
|
531
|
+
|
|
314
532
|
## 8. Circuit Breaker
|
|
315
533
|
|
|
316
534
|
| Condition | Verdict |
|
|
317
535
|
|-----------|---------|
|
|
318
536
|
| context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
|
|
319
|
-
| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → BLOCKED |
|
|
537
|
+
| Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → Architecture Escalation (§7¾) → BLOCKED |
|
|
320
538
|
| 3 consecutive **fail** verdicts on 3 unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED |
|
|
321
539
|
| max_iter reached | TIMEOUT (report to user) |
|
|
322
540
|
|