@ai-dev-methodologies/rlp-desk 0.2.4 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -136,24 +136,28 @@ Written by the Worker at the end of every iteration. Provides a structured JSON
136
136
 
137
137
  ### Done Claim (`<slug>-done-claim.json`)
138
138
 
139
- Written by the Worker when claiming all work is complete:
139
+ Written by the Worker when claiming work is complete. **Must include `execution_steps`** (governance §1f — always-on, not optional):
140
140
 
141
141
  ```json
142
142
  {
143
- "iteration": 3,
144
- "claimed_at_utc": "2025-01-15T10:30:00Z",
145
- "summary": "All user stories implemented and tests passing",
146
- "stories_completed": ["US-001", "US-002"],
147
- "evidence": {
148
- "test_output": "8 passed in 0.05s",
149
- "files_created": ["calc.py", "test_calc.py"]
150
- }
143
+ "us_id": "US-001",
144
+ "claims": ["AC1: add(10,5)=15", "AC2: subtract(10,5)=5"],
145
+ "execution_steps": [
146
+ {"step": "write_test", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "wrote tests/test_add.py"},
147
+ {"step": "verify_red", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 1, "summary": "RED: test fails"},
148
+ {"step": "implement", "ac_id": "AC1", "command": null, "exit_code": null, "summary": "created add()"},
149
+ {"step": "verify_green", "ac_id": "AC1", "command": "pytest tests/", "exit_code": 0, "summary": "GREEN: 3 passed"},
150
+ {"step": "verify_e2e", "ac_id": "AC1", "command": "python -c '...'", "exit_code": 0, "summary": "E2E matches"},
151
+ {"step": "commit", "ac_id": "AC1", "command": "git commit", "exit_code": 0, "summary": "committed abc1234"}
152
+ ]
151
153
  }
152
154
  ```
153
155
 
156
+ `execution_steps` must be a JSON array of objects. Step types from §1f vocabulary: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `verify_e2e`, `commit`, `verify`.
157
+
154
158
  ### Verify Verdict (`<slug>-verify-verdict.json`)
155
159
 
156
- Written by the Verifier after independent verification.
160
+ Written by the Verifier after independent verification. **Must include `reasoning`** (governance §1f — always-on, not optional).
157
161
 
158
162
  **Tmux mode polling:** In tmux mode, after dispatching the Verifier, the shell Leader polls for the existence of `verify-verdict.json` (same pattern as `iter-signal.json`). Once it appears, the Leader reads the `verdict` and `recommended_state_transition` fields via `jq` to decide whether to write a COMPLETE sentinel, continue iterating, or write a BLOCKED sentinel.
159
163
 
@@ -172,6 +176,13 @@ Written by the Verifier after independent verification.
172
176
  "evidence": "test -f calc.py → exit 0"
173
177
  }
174
178
  ],
179
+ "reasoning": [
180
+ {"check": "IL-1 Evidence Gate", "decision": "pass", "basis": "ran pytest fresh, exit 0"},
181
+ {"check": "Layer Enforcement", "decision": "pass", "basis": "L1+L3 required and executed"},
182
+ {"check": "Test Sufficiency", "decision": "pass", "basis": "9 tests / 3 AC = 3.0 ratio"},
183
+ {"check": "Anti-Gaming", "decision": "pass", "basis": "no tautological tests, no mocking"},
184
+ {"check": "Worker Process Audit", "decision": "pass", "basis": "verify_red present with exit≠0"}
185
+ ],
175
186
  "missing_evidence": [],
176
187
  "issues": [
177
188
  {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.2.4",
3
+ "version": "0.3.1",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
@@ -24,6 +24,17 @@ Ask about these items one by one (or in small groups):
24
24
  1. **Slug** — short identifier (e.g., `auth-refactor`). Suggest one, ask if OK.
25
25
  2. **Objective** — what the loop achieves
26
26
  3. **User Stories** — discrete units with testable acceptance criteria. Propose a breakdown, ask the user to confirm/modify.
27
+ - Apply INVEST criteria: each US must be Independent, Negotiable, Valuable, Estimable, Small, Testable.
28
+ - Each AC MUST use Given/When/Then format with **domain language only** (no class names, API paths, DB tables):
29
+ ```
30
+ Given [precondition in domain language]
31
+ When [action in domain language]
32
+ Then [expected outcome with quantitative criteria]
33
+ ```
34
+ - Include at least 1 negative test per US ("must NOT happen").
35
+ - Include boundary cases per US (empty, max, zero, concurrent).
36
+ - **Task Type** per US: `code` | `visual` | `content` | `integration` | `infra`
37
+ - **Risk Level** per US (governance §1c): `LOW` | `MEDIUM` | `HIGH` | `CRITICAL`
27
38
  4. **Iteration Unit** — what one worker does per iteration. Explicitly ask:
28
39
  - "One US per iteration (bounded, incremental verification)?"
29
40
  - "All stories at once (faster, single verification)?"
@@ -42,8 +53,17 @@ Ask about these items one by one (or in small groups):
42
53
  11. **Consensus Scope** — If consensus enabled, ask: "Consensus on every verify (all, default) or only on final verify (final-only)?" Default: all.
43
54
  12. **Max Iterations** — suggest based on story count, ask if OK.
44
55
 
45
- After all items are confirmed, present the full contract summary.
46
- On approval, offer to run `init`.
56
+ After all items are confirmed:
57
+
58
+ 1. **Ambiguity Gate (IL-2)** — score each AC per governance §1a IL-2 (6 dimensions, 0-12 points).
59
+ If ANY AC scores below 6: **REJECT** — refine that AC before proceeding.
60
+ If all ACs score 6-9: **WARN** — proceed with logged warning, show low-scoring dimensions.
61
+ If all ACs score 10-12: **PASS** — clean.
62
+ Present the score table to the user before proceeding.
63
+ 2. Present the full contract summary.
64
+ 3. **Self-Verification** — Ask: "Enable self-verification? Worker records step-by-step evidence, Verifier cross-validates process. Recommended for MEDIUM+ risk." Default: yes for HIGH/CRITICAL, no for LOW/MEDIUM.
65
+ 4. On approval, offer to run `init`.
66
+
47
67
  Do NOT create files during brainstorm.
48
68
  Do NOT auto-decide iteration unit — the user MUST explicitly choose.
49
69
 
@@ -98,6 +118,7 @@ Options (parse from `$ARGUMENTS`):
98
118
  - `all`: consensus runs on every verify (current behavior)
99
119
  - `final-only`: consensus only on final ALL verify
100
120
  - `--debug` — enable debug logging (writes to logs/<slug>/debug.log)
121
+ - `--with-self-verification` — enable campaign-level self-verification analysis. After COMPLETE, Leader analyzes all iteration records (done-claims + verdicts) and generates a campaign self-verification summary with patterns and recommendations for next planning cycle. (Note: execution_steps and reasoning are ALWAYS recorded per governance §1f — this flag adds post-campaign analysis.)
101
122
 
102
123
  ### Mode Selection
103
124
 
@@ -181,6 +202,12 @@ rm -f .claude/ralph-desk/memos/<slug>-verify-verdict.json
181
202
  - Read `.claude/ralph-desk/prompts/<slug>.worker.prompt.md`
182
203
  - Combine with iteration number + memory contract
183
204
  - Write to `.claude/ralph-desk/logs/<slug>/iter-NNN.worker-prompt.md` (audit trail)
205
+ - Note: Worker ALWAYS records execution_steps in done-claim.json per governance §1f. No flag needed.
206
+
207
+ **④½ Contract review** (agent mode only)
208
+ - Before dispatching Worker, spawn a lightweight review: "Is this iteration contract sufficient to achieve the US's AC? Any missing steps?"
209
+ - If `--debug`: debug_log `[EXEC] iter=N phase=contract_review result=<ok|issues>`
210
+ - In tmux mode: skip (shell leader cannot reason). Log: `[EXEC] iter=N phase=contract_review skipped=tmux_mode`
184
211
 
185
212
  **⑤ Execute Worker**
186
213
  - If `--debug`: debug_log `[EXEC] iter=N phase=worker engine=<engine> model=<model> dispatched=true`
@@ -234,6 +261,7 @@ Bash("codex exec --model <worker_codex_model> --reasoning-effort <worker_codex_r
234
261
  - Verifier checks all AC at once
235
262
 
236
263
  **⑦a Dispatch Verifier**
264
+ - Note: Verifier ALWAYS records reasoning in verify-verdict.json per governance §1f. No flag needed.
237
265
  - If `--debug`: debug_log `[EXEC] iter=N phase=verifier engine=<engine> model=<model> scope=<us_id> dispatched=true`
238
266
 
239
267
  If `--verifier-engine claude` (default):
@@ -269,11 +297,16 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
269
297
  1. Read `issues` array, sort by severity (`critical` → `major` → `minor`)
270
298
  2. Build structured fix contract with traceability rule
271
299
  3. Include `fix_hint` values labeled `(suggestion, non-authoritative)` if present
272
- 4. Increment `consecutive_failures` in `status.json`
273
- 5. Go to with fix contract as next Worker contract
300
+ 4. Include impacted tests from test-spec (so Worker can run them before and after the fix)
301
+ 5. Increment `consecutive_failures` in `status.json`
302
+ 6. If `consecutive_failures >= 3` for same US → **Architecture Escalation** (governance §7¾): stop fixing, report to user
303
+ 7. Go to ⑧ with fix contract as next Worker contract
274
304
  - `request_info` → Leader reads Verifier's questions, decides outcome (or relays to Worker in next contract) → go to ⑧
275
305
  - `blocked` → write BLOCKED sentinel, stop
276
306
  - If `--debug`: debug_log `[EXEC] iter=N phase=verdict engine=<engine> verdict=<pass|fail|request_info> us_id=<us_id>`
307
+ - If `--debug`: debug_log `[EXEC] iter=N phase=layer_check L1=<status> L2=<status> L3=<status> L4=<status>`
308
+ - If `--debug`: debug_log `[EXEC] iter=N phase=sufficiency test_count=<N> ac_count=<N> ratio=<N> verdict=<pass|fail>`
309
+ - If `--debug`: debug_log `[EXEC] iter=N phase=checkpoint level=<1|2> evidence=<summary>`
277
310
  - If `--debug` and consensus: debug_log `[EXEC] iter=N phase=consensus claude=<verdict> codex=<verdict> round=<N>`
278
311
 
279
312
  **⑧ Write result log and report to user, continue loop**
@@ -288,6 +321,63 @@ After the primary verifier runs, run a second verifier with the OTHER engine:
288
321
  At loop end (COMPLETE, BLOCKED, or TIMEOUT):
289
322
  - If `--debug`: debug_log `[VALIDATE] result=<COMPLETE|BLOCKED|TIMEOUT> iterations=<N> verified_us=<list>`
290
323
 
324
+ **⑨ Campaign Self-Verification** (when `--with-self-verification` is enabled):
325
+
326
+ After the loop ends, the Leader performs post-campaign analysis:
327
+
328
+ 1. **Collect data**: Read all archived `iter-NNN.result.md`, done-claim.json (with execution_steps), and verify-verdict.json (with reasoning) from `logs/<slug>/`
329
+ 2. **Write cumulative data**: `logs/<slug>/self-verification-data.json` — normalized iteration records
330
+ 3. **Generate versioned report**: `logs/<slug>/self-verification-report-NNN.md` (NNN = auto-increment from existing reports)
331
+ 4. **Report to user**: Display the full report content
332
+
333
+ Report template (9 sections):
334
+
335
+ ```
336
+ # Campaign Self-Verification Report: <slug>
337
+ Report Version: NNN | Generated: timestamp | Campaign: slug — objective
338
+ Schema Version: governance hash | Data Quality: N% iterations complete
339
+
340
+ ## 1. Automated Validation Summary
341
+ Table: Iter | US | Worker Verdict | Verifier Verdict | Outcome
342
+
343
+ ## 2. Failure Deep Dive (per failed iteration)
344
+ Per failure: Worker steps → Verifier reasoning → Root cause → Resolution
345
+
346
+ ## 3. Worker Process Quality (§1f audit)
347
+ Table: Iter | US | Steps | verify_red? | RED exit≠0? | verify_green? | Test-First? | E2E? | AC linked?
348
+ Aggregate: TDD compliance %, RED confirmation %, E2E evidence %, step completeness %
349
+ Audit: each step must have type from §1f vocabulary + ac_id + command + exit_code
350
+
351
+ ## 4. Verifier Judgment Quality (§1f audit)
352
+ Table: Iter | US | Checks | All Basis? | Independent? | IL-1? | Layer? | Sufficiency? | Anti-Gaming? | Worker Audit?
353
+ Aggregate: Reasoning completeness %, Independent verification %, §1f category coverage %
354
+ Audit: verify all 5 mandatory check categories (IL-1, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit) are present
355
+
356
+ ## 5. AC Lifecycle
357
+ Table: US | AC | First Claimed (iter) | First Verified (iter) | Reopen Count | Final Status
358
+
359
+ ## 6. Test-Spec Adherence
360
+ Spec completeness (layers/commands/mappings present)
361
+ Spec execution fidelity (exact checks run and cited)
362
+
363
+ ## 7. Patterns: Strengths & Weaknesses
364
+ Strengths: what worked well
365
+ Weaknesses: systemic issues
366
+
367
+ ## 8. Recommendations for Next Cycle
368
+ ### Brainstorm (missing scenarios/constraints) — citing iter/AC
369
+ ### PRD (ambiguous or oversized ACs) — citing iter/AC
370
+ ### Test-Spec (missing layers, weak mappings) — citing iter/AC
371
+
372
+ ## 9. Blind Spots
373
+ What this report CANNOT prove from available data
374
+
375
+ ## Data Provenance Rule
376
+ Report content MUST be derivable from: done-claim.json (execution_steps), verify-verdict.json (reasoning),
377
+ PRD, and test-spec. Information from source code inspection that is not in these files must be excluded
378
+ or explicitly marked as "[source-inspection]" with justification.
379
+ ```
380
+
291
381
  ### Circuit Breaker
292
382
  - context-latest.md unchanged 3 iterations → BLOCKED
293
383
  - Same acceptance criterion fails 2 consecutive iterations → upgrade model, retry once, then BLOCKED
@@ -329,6 +419,8 @@ Remove:
329
419
  - `.claude/ralph-desk/logs/<slug>/session-config.json`
330
420
  - `.claude/ralph-desk/logs/<slug>/worker-heartbeat.json`
331
421
  - `.claude/ralph-desk/logs/<slug>/verifier-heartbeat.json`
422
+ - `.claude/ralph-desk/memos/<slug>-escalation.md`
423
+ Note: `logs/<slug>/self-verification-data.json` and `self-verification-report-NNN.md` are intentionally preserved across clean for historical comparison.
332
424
 
333
425
  If `--kill-session` is passed, also kill any tmux session matching `rlp-desk-<slug>-*`:
334
426
  ```bash
@@ -359,6 +451,7 @@ Run options:
359
451
  --verify-consensus Cross-engine consensus verification
360
452
  --consensus-scope SCOPE When consensus runs: all|final-only (default: all)
361
453
  --debug Debug logging (logs/<slug>/debug.log)
454
+ --with-self-verification Campaign self-verification analysis (post-loop report)
362
455
  ```
363
456
 
364
457
  ## Architecture
package/src/governance.md CHANGED
@@ -10,10 +10,208 @@ The Leader orchestrates, while Worker/Verifier run in isolated fresh contexts ev
10
10
  - **Fresh context per iteration**: Worker/Verifier start fresh every time. No prior conversation.
11
11
  - **Filesystem = memory**: State exists only on the filesystem (PRD, memory, context, memos).
12
12
  - **Worker claim ≠ complete**: A Worker's DONE is merely a claim. The Verifier must independently verify before it's confirmed.
13
+ - **Worker scope is bounded**: Worker implements only the contracted US per iteration (Scope Lock). Out-of-scope changes are flagged by the Verifier.
13
14
  - **Verifier is independent**: The Verifier judges based on evidence alone, without knowledge of the Worker's reasoning process.
14
15
  - **Sentinels are Leader-owned**: Only the Leader writes COMPLETE/BLOCKED sentinels.
15
16
  - **Supported engines**: claude (default; models: haiku, sonnet, opus) and codex (opt-in via `--worker-engine codex` / `--verifier-engine codex`).
16
17
 
18
+ ## 1a. Iron Laws
19
+
20
+ Absolute rules that cannot be violated under any circumstance.
21
+
22
+ ```
23
+ IL-1: NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
24
+ IL-2: NO INIT WITHOUT AC QUALITY SCORE >= 6
25
+ IL-3: NO PASS WITH TODO IN ANY REQUIRED VERIFICATION LAYER
26
+ IL-4: NO PASS WITHOUT TEST COUNT >= AC COUNT x 3
27
+ ```
28
+
29
+ **IL-1: Evidence Mandate**
30
+ Required: every verdict must reference at least one command execution with its exit code.
31
+ A verdict without command output evidence is automatically invalid.
32
+ Additional signal: phrases such as "should pass", "probably works", "seems to",
33
+ "looks correct", "appears to" (including but not limited to) without command evidence
34
+ confirm the violation but are not the primary check — command output presence is.
35
+ `request_info` is not an escape from IL-1 — if evidence was collectible, it must be collected.
36
+
37
+ **IL-2: Ambiguity Gate**
38
+ Each AC is scored 0-12 on 6 dimensions (0/1/2 points each):
39
+ - Single behavior: 0=multiple behaviors, 1=mostly single, 2=exactly one testable behavior
40
+ - Domain language: 0=technical terms (class names, API paths, DB tables), 1=mixed, 2=pure domain language
41
+ - Stakeholder clarity: 0=unclear who benefits, 1=implied, 2=explicitly stated
42
+ - Portability: 0=tech-stack specific, 1=mostly portable, 2=fully stack-independent
43
+ - Concrete example: 0=vague ("some input"), 1=partial specifics, 2=exact values with expected results
44
+ - Independence: 0=requires another AC to pass first, 1=loosely coupled, 2=fully self-contained
45
+
46
+ Score interpretation:
47
+ - 0-5: **REJECT** — init blocked. ACs too ambiguous.
48
+ - 6-9: **WARN** — init proceeds with logged warning. Show which dimensions scored low.
49
+ - 10-12: **PASS** — clean.
50
+
51
+ IL-2 is a pre-run gate: scoring MUST happen during brainstorm or at init time.
52
+ In tmux mode, IL-2 must be satisfied before `/rlp-desk run` is invoked.
53
+
54
+ Calibration example:
55
+ - Score 5 (REJECT): "Given a user, When they log in, Then the system works correctly"
56
+ → single:1 (login is one behavior), domain:2 (domain terms only), stakeholder:1 (implied user), portability:1 (mostly portable), concrete:0 ("works correctly" is vague), independence:0 (implies registration AC) = 5
57
+ - Score 7 (WARN): "Given a registered user with email 'test@example.com' and valid password, When they submit the login form, Then they are redirected to the dashboard within 2 seconds"
58
+ → single:2 (exactly one: login redirect), domain:1 ("submit" is slightly technical), stakeholder:1 (implied end user), portability:1 (web-specific "form"), concrete:2 (specific email, 2s threshold), independence:0 (requires registration) = 7
59
+
60
+ **IL-3: Layer Completeness**
61
+ Verification layers:
62
+ - L1: Unit Test — function-level, mocks allowed. Always required.
63
+ - L2: Integration — real external services (DB, API, Redis). Required when external dependencies exist.
64
+ - L3: E2E Simulation — known input → full pipeline → output comparison. Always required.
65
+ - L4: Deploy Verify — production environment checks. Required when deploying.
66
+
67
+ Layer requirements per US are determined by risk classification (§1c).
68
+ Non-applicable layers must be explicitly marked "N/A — {reason}" in the test-spec.
69
+ Any required layer section with TODO or blank = automatic Verifier FAIL.
70
+ See §1d for full layer definitions.
71
+
72
+ **IL-4: Test Sufficiency**
73
+ Each AC must have >= 3 tests covering >= 2 of 3 categories (happy path, negative/error, boundary).
74
+ Tests must be mapped to ACs in the test-spec's criteria-to-test mapping table.
75
+ Only tests listed in this mapping count toward IL-4.
76
+ Count < 3 per any AC = FAIL.
77
+
78
+ ### Enforcement
79
+
80
+ | Iron Law | Checked by | When | Method |
81
+ |----------|-----------|------|--------|
82
+ | IL-1 | Verifier | verification time | mechanical (command output presence) |
83
+ | IL-2 | Leader | brainstorm/init | scored (6-dimension rubric) |
84
+ | IL-3 | Verifier | verification time | mechanical (TODO/blank scan) |
85
+ | IL-4 | Verifier | verification time | scored (test count per AC) |
86
+
87
+ - Violation of any Iron Law overrides all other verdict considerations — verdict MUST be FAIL.
88
+ - When an Iron Law is violated, the verdict MUST be `fail` regardless of uncertainty.
89
+ `request_info` remains valid only when the Verifier cannot determine whether an Iron Law
90
+ was violated (e.g., cannot access test files, command execution blocked).
91
+ - You (the Leader) cannot waive Iron Laws. Only the user can explicitly waive an Iron Law
92
+ for a specific US with documented justification in the PRD.
93
+
94
+ ## 1b. Evidence Gate
95
+
96
+ This section operationalizes IL-1 (Evidence Mandate) into a concrete step-by-step protocol.
97
+
98
+ Before any verdict, the Verifier MUST follow this 5-step process:
99
+
100
+ 1. **IDENTIFY**: What command proves this claim?
101
+ 2. **RUN**: Execute the command (fresh, not cached or recalled)
102
+ 3. **READ**: Full output + exit code + failure count
103
+ 4. **VERIFY**: Does output confirm the claim?
104
+ - YES → state claim WITH evidence (command + output + exit code)
105
+ - NO → state actual status with evidence
106
+ 5. **ONLY THEN**: Issue verdict
107
+
108
+ Skipping any step = invalid verification (IL-1 violation).
109
+
110
+ ### Forbidden Patterns
111
+ - "should pass", "probably works" without command output
112
+ - Trusting Worker's success reports without independent re-execution
113
+ - Partial verification ("linter passed" ≠ "tests passed" ≠ "all AC met")
114
+ - "Code inspection" as substitute for automated command execution
115
+ - Citing cached/prior results instead of fresh execution
116
+
117
+ ## 1c. Risk Classification
118
+
119
+ Each US is classified by risk level during brainstorm. Higher risk = more verification layers.
120
+
121
+ | Level | Description | Required Layers | Extra Requirements |
122
+ |-------|-------------|-----------------|-------------------|
123
+ | LOW | Read-only, docs, config | L1 + L3 | — |
124
+ | MEDIUM | New feature, refactor | L1 + L2 (if external deps) + L3 | — |
125
+ | HIGH | Production deploy, data migration | L1 + L2 + L3 + L4 | — |
126
+ | CRITICAL | Financial, security, medical | L1 + L2 + L3 + L4 | consensus + mutation testing (when mutation testing tool is configured in test-spec) |
127
+
128
+ L2 is included in MEDIUM+ rows but is marked N/A when no external services exist (see §1d L2 "When N/A" clause).
129
+
130
+ ### Who Decides
131
+ - During brainstorm: user assigns risk level per US (Leader suggests, user confirms).
132
+ - If brainstorm was skipped: Leader assigns based on PRD content at first run iteration.
133
+ - Risk level is recorded in PRD per US. Cannot be downgraded without user approval.
134
+
135
+ ### Examples
136
+ - LOW: README update, adding comments, .env.example
137
+ - MEDIUM: REST API endpoint, React component, business rule
138
+ - HIGH: database migration, CI/CD change, deployment config
139
+ - CRITICAL: payment processing, auth, encryption, PII handling
140
+
141
+ ## 1d. Verification Layers
142
+
143
+ Four layers of verification, each targeting a different failure mode.
144
+
145
+ ### L1: Unit Test (always required)
146
+ - Scope: function/method level, isolated logic
147
+ - Mocks: allowed for external boundaries only
148
+ - Evidence: test runner output with pass/fail count + exit code
149
+
150
+ ### L2: Integration (required when external dependencies exist)
151
+ - Scope: interaction with real external services (database, API, message queue, cache)
152
+ - Mocks: NOT allowed — use real or containerized services
153
+ - Evidence: integration test output with connection confirmation + data verification
154
+ - When N/A: "N/A — no external services (pure computation/transformation)"
155
+
156
+ ### L3: E2E Simulation (always required)
157
+ - Scope: known input → full pipeline → quantitative output comparison
158
+ - Evidence: input data + actual output + expected output + comparison result
159
+ - For simple utilities: E2E = "run function with known input, verify output matches expected"
160
+
161
+ ### L4: Deploy Verify (required when deploying)
162
+ - Scope: production/staging environment health after deployment
163
+ - Evidence: health check response + deployment status + monitoring state
164
+ - When N/A: "N/A — no deployment (library/tool, local-only change)"
165
+
166
+ ### Rules
167
+ - L1 and L3: always required regardless of risk level.
168
+ - L2: required for MEDIUM+ risk when external services are involved.
169
+ - L4: required for HIGH+ risk when deployment occurs.
170
+ - Layer requirements per US are determined by risk classification (§1c).
171
+ - Non-required layers must be marked "N/A — {reason}" per IL-3. Blank or TODO = FAIL.
172
+
173
+ ## 1e. Verification Checkpoints
174
+
175
+ Verification occurs at two boundaries, not as a single final event.
176
+
177
+ ### Checkpoint 1: Story/Unit (per-US)
178
+ - Trigger: Worker signals verify with us_id = specific US
179
+ - Scope: that US's acceptance criteria (L1 pass is verified as part of layer enforcement in Verifier step 5)
180
+ - On fail: fix loop (§7½)
181
+
182
+ ### Checkpoint 2: Release Readiness (us_id=ALL)
183
+ - Trigger: all individual US pass Checkpoint 1 → Worker signals verify with us_id = "ALL"
184
+ - Scope: all AC + L2 integration (if applicable) + L3 E2E Simulation + L4 deploy (if applicable) + mutation score (if CRITICAL, when mutation testing tool is configured in test-spec)
185
+ - On fail: fix loop; escalation to user if 3 consecutive failures
186
+
187
+ ### Relationship to Existing Flow
188
+ - Checkpoint 1 = existing per-US verify (§7a). No change.
189
+ - Checkpoint 2 = existing "us_id=ALL final full verify" (§7a). Adds explicit layer scope.
190
+ - No new iteration steps are introduced.
191
+
192
+ ## 1f. Execution & Judgment Traceability
193
+
194
+ Every iteration, Worker and Verifier MUST record their process and reasoning — not just results.
195
+ This is the default behavior, not an optional flag. Without it, IL-1 (Evidence Mandate) is incomplete.
196
+
197
+ ### Worker: execution_steps in done-claim.json
198
+ Worker records what was done, in what order, with command evidence in `done-claim.json`:
199
+ - Each step includes: what action, which AC, command executed, exit code, summary
200
+ - Step types: `write_test`, `verify_red`, `implement`, `verify_green`, `refactor`, `commit`, `verify`
201
+ - This proves the Worker followed test-first approach and did not skip steps
202
+
203
+ ### Verifier: reasoning in verify-verdict.json
204
+ Verifier records WHY each judgment was made in `verify-verdict.json`:
205
+ - Each check includes: what was checked, decision (pass/fail), and the specific evidence basis
206
+ - Checks include: IL-1 Evidence Gate, Layer Enforcement, Test Sufficiency, Anti-Gaming, Worker Process Audit
207
+ - This proves the Verifier actually performed each check rather than rubber-stamping
208
+
209
+ ### Why This Is Default (Not Optional)
210
+ - IL-1 says "no claims without evidence" — this applies to Worker AND Verifier
211
+ - Without execution_steps, Worker's done-claim is an unsubstantiated assertion
212
+ - Without reasoning, Verifier's verdict is an unsubstantiated judgment
213
+ - Both are archived in `logs/<slug>/` per existing audit trail pattern
214
+
17
215
  ## 2. Roles
18
216
 
19
217
  ### Leader (current session)
@@ -182,6 +380,7 @@ Characteristics:
182
380
  │ ├── <slug>-done-claim.json # Worker's completion claim (runtime)
183
381
  │ ├── <slug>-iter-signal.json # Worker's iteration signal (runtime)
184
382
  │ ├── <slug>-verify-verdict.json # Verifier's verdict (runtime)
383
+ │ ├── <slug>-escalation.md # Architecture escalation report (tmux mode, §7¾)
185
384
  │ ├── <slug>-complete.md # SENTINEL (Leader only)
186
385
  │ └── <slug>-blocked.md # SENTINEL (Leader only)
187
386
  ├── plans/
@@ -191,6 +390,8 @@ Characteristics:
191
390
  ├── iter-NNN.worker-prompt.md # Audit trail prompt copy
192
391
  ├── iter-NNN.verifier-prompt.md # Audit trail prompt copy
193
392
  ├── iter-NNN.result.md # Iteration result (leader-measured + git-measured)
393
+ ├── self-verification-data.json # Cumulative campaign data (--with-self-verification)
394
+ ├── self-verification-report-NNN.md # Versioned campaign analysis report (--with-self-verification)
194
395
  └── status.json # Leader's loop state
195
396
  ```
196
397
 
@@ -311,12 +512,29 @@ Traceability: only changes that resolve a listed issue are allowed.
311
512
  Every change must be justified by the issue it addresses.
312
513
  ```
313
514
 
515
+ ## 7¾. Architecture Escalation
516
+
517
+ Note: Circuit Breaker (§8) fires first at 2 consecutive failures (model upgrade + retry). If the retry also fails (3rd consecutive failure), Architecture Escalation applies. The CB retry counts toward the consecutive_failures counter.
518
+
519
+ If 3+ consecutive fix attempts fail for the same US:
520
+
521
+ 1. **STOP fixing symptoms** — the problem is likely architectural, not a bug.
522
+ 2. **Leader reports to user**: "3 consecutive fix attempts failed for US-{id}. This suggests an architectural issue, not a simple bug."
523
+ 3. **Include in report**:
524
+ - What was attempted in each fix
525
+ - What specifically kept failing
526
+ - Hypothesis: why fixes are not sticking
527
+ 4. **Do NOT attempt fix #4** without user guidance.
528
+ 5. **Options**: refactor architecture, simplify the US, split the US, or mark BLOCKED.
529
+
530
+ In tmux mode: Leader writes `<slug>-escalation.md` with the report and sets BLOCKED sentinel with reason "architecture-escalation."
531
+
314
532
  ## 8. Circuit Breaker
315
533
 
316
534
  | Condition | Verdict |
317
535
  |-----------|---------|
318
536
  | context-latest.md unchanged for 3 consecutive iterations | BLOCKED |
319
- | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → BLOCKED |
537
+ | Same acceptance criterion fails 2 consecutive iterations | Upgrade model, retry once; if still failing → Architecture Escalation (§7¾) → BLOCKED |
320
538
  | 3 consecutive **fail** verdicts on 3 unique criterion IDs | Upgrade to opus, retry once; if still failing → BLOCKED |
321
539
  | max_iter reached | TIMEOUT (report to user) |
322
540