@ai-dev-methodologies/rlp-desk 0.5.4 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/plans/mutable-booping-corbato.md +163 -0
- package/docs/superpowers/plans/2026-04-06-worker-verifier-prompt-restructure.md +179 -0
- package/package.json +1 -1
- package/src/commands/rlp-desk.md +23 -11
- package/src/model-upgrade-table.md +9 -9
- package/src/scripts/init_ralph_desk.zsh +75 -19
- package/src/scripts/lib_ralph_desk.zsh +33 -11
- package/src/scripts/run_ralph_desk.zsh +161 -51
|
@@ -0,0 +1,163 @@
|
|
|
1
|
+
# Plan: rlp-desk Batch Mode + Operational Context 개선
|
|
2
|
+
|
|
3
|
+
## Context
|
|
4
|
+
|
|
5
|
+
실제 캠페인(`prod-local-parity`, spark:high)에서 두 가지 구조적 문제가 발견됨:
|
|
6
|
+
|
|
7
|
+
1. **Batch 모드 무한 FAIL**: US 5개 이상이면 Worker가 일부만 완료 → Verifier가 전체 검증 → FAIL → 진전 무시 → CB BLOCKED. `VERIFIED_US` 추적이 per-us 모드에만 있고 batch에는 없음.
|
|
8
|
+
|
|
9
|
+
2. **서버 프로젝트 지원 부재**: Worker가 코드 수정 후 서버 restart를 안 하고, 서버 포트를 모르고, health check도 없음. spark 모델 탓이 아니라 **rlp-desk가 operational context를 brainstorm/prompt에 반영하지 않는 설계 결함**.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## P0: Batch 모드 Partial Progress Tracking
|
|
14
|
+
|
|
15
|
+
### 수정 대상
|
|
16
|
+
- `src/scripts/run_ralph_desk.zsh`
|
|
17
|
+
- `src/commands/rlp-desk.md` (agent mode ⑦c)
|
|
18
|
+
|
|
19
|
+
### 변경 내용
|
|
20
|
+
|
|
21
|
+
#### 1. Batch 모드에도 VERIFIED_US 추적 (run_ralph_desk.zsh)
|
|
22
|
+
- PASS verdict 처리(L2423): `per-us` 조건 제거 → batch에서도 `signal_us_id`가 개별 US면 `VERIFIED_US`에 추가
|
|
23
|
+
- FAIL verdict 처리(L2445): verdict JSON에서 `per_us_results` 파싱 → `met=true`인 US를 `VERIFIED_US`에 추가
|
|
24
|
+
- status.json 갱신: batch 모드에서도 `verified_us` 배열 기록
|
|
25
|
+
|
|
26
|
+
#### 2. Verifier Prompt에 VERIFIED_US 전달 (run_ralph_desk.zsh L1225-1232)
|
|
27
|
+
- `if [[ "$VERIFY_MODE" = "per-us"` 조건 → `if [[ -n "$VERIFIED_US"` 로 변경
|
|
28
|
+
- batch 모드 verifier에게도 "이미 verified된 US skip" 지시
|
|
29
|
+
|
|
30
|
+
#### 3. Fix Contract Scope Narrowing (run_ralph_desk.zsh L2461-2473)
|
|
31
|
+
- FAIL 시: verdict에서 pass한 US 추출 → fix contract에 "US-001~004 verified. Continue from US-005."
|
|
32
|
+
- Worker prompt 조합 시 `VERIFIED_US` 참조하여 축소된 scope 전달
|
|
33
|
+
|
|
34
|
+
#### 4. consecutive_failures 부분 리셋 (run_ralph_desk.zsh L2447)
|
|
35
|
+
- 새로 pass된 US가 있으면 (`VERIFIED_US` 길어짐) → `CONSECUTIVE_FAILURES=0` 리셋
|
|
36
|
+
- 진전 없이 같은 상태면 → 기존대로 증가
|
|
37
|
+
|
|
38
|
+
#### 5. Verifier Verdict에 per_us_results 필수화
|
|
39
|
+
- Verifier prompt template(init_ralph_desk.zsh L384-474)에 output format 추가:
|
|
40
|
+
```json
|
|
41
|
+
{
|
|
42
|
+
"verdict": "fail",
|
|
43
|
+
"per_us_results": { "US-001": "pass", "US-005": "fail" },
|
|
44
|
+
"issues": [...]
|
|
45
|
+
}
|
|
46
|
+
```
|
|
47
|
+
- batch/per-us 공통으로 per_us_results 포함하도록 지시
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## P1: Brainstorm Operational Context + Worker System Prompt
|
|
52
|
+
|
|
53
|
+
### 수정 대상
|
|
54
|
+
- `src/commands/rlp-desk.md` (brainstorm section)
|
|
55
|
+
- `src/scripts/init_ralph_desk.zsh` (Worker/Verifier prompt template)
|
|
56
|
+
|
|
57
|
+
### 변경 내용
|
|
58
|
+
|
|
59
|
+
#### 1. Brainstorm: Operational Context 수집 (rlp-desk.md L24-93)
|
|
60
|
+
현재 11개 항목 수집 중, **12번째 항목 추가**:
|
|
61
|
+
|
|
62
|
+
```
|
|
63
|
+
12. **Operational Context** (if applicable):
|
|
64
|
+
- Does this project require a running server/service? (y/n)
|
|
65
|
+
- Server start command (e.g., `npm run dev`, `python manage.py runserver`)
|
|
66
|
+
- Server port (e.g., 7001)
|
|
67
|
+
- Health check URL (e.g., `http://localhost:7001/health`)
|
|
68
|
+
- Other runtime dependencies (e.g., database, Redis)
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
brainstorm이 프로젝트 디렉토리에서 `package.json`의 `scripts.dev`/`scripts.start`, `Makefile`, `docker-compose.yml` 등을 자동 감지하여 추천.
|
|
72
|
+
|
|
73
|
+
#### 2. Brainstorm: US 생성 시 Operational Step 포함 가이드
|
|
74
|
+
US/AC 작성 가이드(rlp-desk.md L26-38)에 추가:
|
|
75
|
+
|
|
76
|
+
```
|
|
77
|
+
- If the project has operational context (server, DB, etc.):
|
|
78
|
+
- Each US that modifies server code MUST include AC:
|
|
79
|
+
"Given server is running, When code is modified, Then server is restarted and responds on health check URL"
|
|
80
|
+
- Do NOT assume Worker will restart server on its own — spell it out in AC
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
#### 3. Init: Worker Prompt에 Operational Rules 주입 (init_ralph_desk.zsh L285-380)
|
|
84
|
+
brainstorm에서 수집한 operational context를 Worker prompt template에 주입:
|
|
85
|
+
|
|
86
|
+
```markdown
|
|
87
|
+
## Operational Context
|
|
88
|
+
- **Server Command**: `npm run dev`
|
|
89
|
+
- **Server Port**: 7001
|
|
90
|
+
- **Health Check**: `http://localhost:7001/health`
|
|
91
|
+
|
|
92
|
+
### Operational Rules (always apply)
|
|
93
|
+
- After modifying server/application code, restart the server: `[server_cmd]`
|
|
94
|
+
- Before signaling done, verify server responds: `curl -s [health_url] || fail`
|
|
95
|
+
- Do NOT modify dependency files (package.json, requirements.txt, etc.) unless the AC explicitly requires it
|
|
96
|
+
- Do NOT run package install commands (npm install, pip install, etc.) unless the AC explicitly requires it
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
operational context가 없는 프로젝트(코드만 수정)면 이 섹션 생략.
|
|
100
|
+
|
|
101
|
+
#### 4. Init: Verifier Prompt에도 Operational Check 추가
|
|
102
|
+
Verifier prompt template(init_ralph_desk.zsh L384-474)에:
|
|
103
|
+
|
|
104
|
+
```markdown
|
|
105
|
+
## Operational Verification (if server context provided)
|
|
106
|
+
- Verify server is running on expected port before checking ACs
|
|
107
|
+
- If server is down, verdict=FAIL with issue: "server not running"
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
#### 5. --server-cmd / --server-port CLI 옵션 (run_ralph_desk.zsh)
|
|
111
|
+
brainstorm에서 수집한 값을 init이 prompt에 넣지만, run 시 override도 가능:
|
|
112
|
+
- `--server-cmd "npm run dev"` → Worker prompt의 서버 명령어 override
|
|
113
|
+
- `--server-port 7001` → Worker prompt의 포트 override
|
|
114
|
+
- 런타임에 iteration 시작 시 health check (optional, `--server-health-check` flag)
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## Verification Plan
|
|
119
|
+
|
|
120
|
+
### P0 Tests
|
|
121
|
+
```bash
|
|
122
|
+
# Batch partial progress 단위 테스트
|
|
123
|
+
zsh tests/test_batch_partial_progress.sh
|
|
124
|
+
# 시나리오: batch FAIL verdict에 per_us_results 포함 → VERIFIED_US 추적 확인
|
|
125
|
+
# 시나리오: 새 US pass 시 consecutive_failures 리셋 확인
|
|
126
|
+
# 시나리오: verifier prompt에 VERIFIED_US 포함 확인 (batch 모드)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### P1 Tests
|
|
130
|
+
```bash
|
|
131
|
+
# Operational context 단위 테스트
|
|
132
|
+
zsh tests/test_operational_context.sh
|
|
133
|
+
# 시나리오: --server-cmd 옵션 파싱 확인
|
|
134
|
+
# 시나리오: Worker prompt에 operational rules 주입 확인
|
|
135
|
+
# 시나리오: operational context 없는 프로젝트에서는 섹션 생략 확인
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Self-Verification (CLAUDE.md 필수)
|
|
139
|
+
변경된 src 파일에 대해 3개 시나리오 (LOW/MEDIUM/CRITICAL) 자체 검증 실행.
|
|
140
|
+
|
|
141
|
+
### E2E
|
|
142
|
+
실제 캠페인으로 테스트:
|
|
143
|
+
1. batch 모드 + 10 US → partial progress 추적 확인
|
|
144
|
+
2. server 프로젝트 + spark:high → 서버 restart 수행 확인
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## File Map
|
|
149
|
+
|
|
150
|
+
| 파일 | P0 | P1 |
|
|
151
|
+
|------|----|----|
|
|
152
|
+
| `src/scripts/run_ralph_desk.zsh` | VERIFIED_US batch 추적, fix contract narrowing, CF 리셋 | --server-cmd/port 옵션 |
|
|
153
|
+
| `src/scripts/lib_ralph_desk.zsh` | - | - |
|
|
154
|
+
| `src/scripts/init_ralph_desk.zsh` | - | Worker/Verifier prompt에 operational context 주입 |
|
|
155
|
+
| `src/commands/rlp-desk.md` | agent mode ⑦c batch 로직 | brainstorm 12번 항목, US 가이드 |
|
|
156
|
+
| `src/governance.md` | - | - |
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## Scope / Non-Goals
|
|
161
|
+
- 모델별 가드레일 (spark 전용 금지 목록) → **하지 않음**. brainstorm/prompt 구조로 해결
|
|
162
|
+
- batch 모드 완전 제거 → **하지 않음**. 수정하여 사용 가능하게 함
|
|
163
|
+
- auto-detect project type → brainstorm에서 사용자 확인 + 파일 기반 추천만. 완전 자동화 아님
|
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
# Worker/Verifier Prompt Restructure Implementation Plan
|
|
2
|
+
|
|
3
|
+
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
4
|
+
|
|
5
|
+
**Goal:** Restructure Worker/Verifier prompt templates in init_ralph_desk.zsh to make TDD more prominent, reduce prompt size, and add debugging/anti-rationalization guidance.
|
|
6
|
+
|
|
7
|
+
**Architecture:** Elevate TDD from procedural guidance to hard constraint box (same level as SCOPE LOCK). Compress Forbidden Shortcuts by removing items already enforced by Verifier. Add "When Stuck" debugging protocol and Verifier anti-rationalization patterns.
|
|
8
|
+
|
|
9
|
+
**Tech Stack:** zsh heredoc templates in init_ralph_desk.zsh
|
|
10
|
+
|
|
11
|
+
**Origin:** ralplan consensus (Architect APPROVE + Critic ACCEPT, 2026-04-06)
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
### Task 1: Insert TDD MANDATE and remove old Test-First Approach
|
|
16
|
+
|
|
17
|
+
**Files:**
|
|
18
|
+
- Modify: `src/scripts/init_ralph_desk.zsh:323-337`
|
|
19
|
+
|
|
20
|
+
- [ ] **Step 1: Insert TDD MANDATE after file-reading section (L323), before SCOPE LOCK (L325)**
|
|
21
|
+
|
|
22
|
+
In `src/scripts/init_ralph_desk.zsh`, find:
|
|
23
|
+
```
|
|
24
|
+
4. Latest Context: $DESK/context/$SLUG-latest.md → current state
|
|
25
|
+
|
|
26
|
+
## SCOPE LOCK (hard constraint — violation causes verification failure)
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Replace with:
|
|
30
|
+
```
|
|
31
|
+
4. Latest Context: $DESK/context/$SLUG-latest.md → current state
|
|
32
|
+
|
|
33
|
+
## TDD MANDATE (hard constraint — violation = automatic FAIL)
|
|
34
|
+
> Write failing tests FIRST → confirm RED (exit_code=1) → implement minimum code → confirm GREEN.
|
|
35
|
+
> Every NEW AC requires: write_test → verify_red → implement → verify_green in execution_steps.
|
|
36
|
+
> No exceptions. Verifier rejects missing RED evidence. For already-passing ACs, use verify_existing.
|
|
37
|
+
|
|
38
|
+
## SCOPE LOCK (hard constraint — violation causes verification failure)
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
- [ ] **Step 2: Remove old Test-First Approach section (L333-337)**
|
|
42
|
+
|
|
43
|
+
Find and delete these 6 lines (header + blank line + 4 items):
|
|
44
|
+
```
|
|
45
|
+
## Test-First Approach (read test-spec BEFORE coding)
|
|
46
|
+
1. Read test-spec "Impacted Tests" — if TODO (first iteration), skip to step 2 and fill this section during your work. Otherwise, run these FIRST to confirm they pass before your changes.
|
|
47
|
+
2. Read test-spec "Required New Tests" — write these. They SHOULD FAIL initially.
|
|
48
|
+
3. Implement minimum code to make all tests pass.
|
|
49
|
+
4. Run ALL tests (impacted + new) to confirm nothing is broken.
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
- [ ] **Step 3: Verify changes**
|
|
53
|
+
|
|
54
|
+
Run: `grep -n "TDD MANDATE\|Test-First Approach\|SCOPE LOCK" src/scripts/init_ralph_desk.zsh`
|
|
55
|
+
Expected: TDD MANDATE appears BEFORE SCOPE LOCK. "Test-First Approach" does NOT appear.
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
### Task 2: Compress Forbidden Shortcuts from 14 to 6 items
|
|
60
|
+
|
|
61
|
+
**Files:**
|
|
62
|
+
- Modify: `src/scripts/init_ralph_desk.zsh:339-353`
|
|
63
|
+
|
|
64
|
+
- [ ] **Step 1: Replace Forbidden Shortcuts section**
|
|
65
|
+
|
|
66
|
+
Find the entire `## Forbidden Shortcuts` section (14 items from L339-353) and replace with these 6 items:
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
## Forbidden Shortcuts (Verifier will check these)
|
|
70
|
+
- Do not mock external services when L2 integration test is required by test-spec.
|
|
71
|
+
- Do not delete or weaken existing assertions to make tests pass.
|
|
72
|
+
- Do not skip boundary cases listed in the PRD.
|
|
73
|
+
- Do not write code before tests — if you did, delete it and start with tests.
|
|
74
|
+
- **NEVER modify rlp-desk infrastructure files** (~/.claude/ralph-desk/*, ~/.claude/commands/rlp-desk.md). If you discover a bug in rlp-desk itself, report it in done-claim.json with {"status": "blocked", "reason": "rlp-desk bug: <description>"} and signal blocked. Do NOT attempt to fix rlp-desk — it is the orchestration tool, not your project code.
|
|
75
|
+
- **NEVER modify Claude Code settings** (~/.claude/settings.json, .claude/settings.local.json, or any settings files). Do NOT add permissions, change models, or alter configuration. If a permission prompt blocks you, report it as blocked — do NOT try to edit settings to bypass it.
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Removed items and their coverage:
|
|
79
|
+
- L342 "test-specific logic" → Verifier step 10 (L474) checks this
|
|
80
|
+
- L344 "code inspection" → Verifier step 10½ phrase scan (L484)
|
|
81
|
+
- L345 "too simple to test" → Verifier step 10½ phrase scan (L484)
|
|
82
|
+
- L346 "I'll test after" → will add to Verifier step 10½ in Task 4
|
|
83
|
+
- L347 "already manually tested" → Verifier step 10½ phrase scan (L484)
|
|
84
|
+
- L348 "partial check is enough" → Verifier step 10½ phrase scan (L484)
|
|
85
|
+
- L349 "I'm confident" → Verifier step 10½ phrase scan (L484)
|
|
86
|
+
- L350 "existing code has no tests" → TDD MANDATE covers ("no exceptions")
|
|
87
|
+
|
|
88
|
+
- [ ] **Step 2: Verify line count**
|
|
89
|
+
|
|
90
|
+
Run: `sed -n '/^## Forbidden Shortcuts/,/^## /p' src/scripts/init_ralph_desk.zsh | head -10`
|
|
91
|
+
Expected: 7 lines (1 header + 6 items)
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
### Task 3: Add "When Stuck" debugging guide
|
|
96
|
+
|
|
97
|
+
**Files:**
|
|
98
|
+
- Modify: `src/scripts/init_ralph_desk.zsh` (after Forbidden Shortcuts, before Iteration rules)
|
|
99
|
+
|
|
100
|
+
- [ ] **Step 1: Insert debugging guide**
|
|
101
|
+
|
|
102
|
+
Find:
|
|
103
|
+
```
|
|
104
|
+
- **NEVER modify Claude Code settings** (~/.claude/settings.json, .claude/settings.local.json, or any settings files). Do NOT add permissions, change models, or alter configuration. If a permission prompt blocks you, report it as blocked — do NOT try to edit settings to bypass it.
|
|
105
|
+
|
|
106
|
+
## Iteration rules
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Replace with:
|
|
110
|
+
```
|
|
111
|
+
- **NEVER modify Claude Code settings** (~/.claude/settings.json, .claude/settings.local.json, or any settings files). Do NOT add permissions, change models, or alter configuration. If a permission prompt blocks you, report it as blocked — do NOT try to edit settings to bypass it.
|
|
112
|
+
|
|
113
|
+
## When Stuck (do NOT guess-and-fix)
|
|
114
|
+
> 1. STOP and READ the error. Trace the call stack. Identify the root cause before touching code.
|
|
115
|
+
> 2. Write a minimal test that reproduces the failure, then fix the root cause only.
|
|
116
|
+
> 3. If 3+ fixes fail on the same issue, signal "blocked" with your diagnosis.
|
|
117
|
+
|
|
118
|
+
## Iteration rules
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
- [ ] **Step 2: Verify insertion**
|
|
122
|
+
|
|
123
|
+
Run: `grep -n "When Stuck" src/scripts/init_ralph_desk.zsh`
|
|
124
|
+
Expected: exactly 1 match, between Forbidden Shortcuts and Iteration rules
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
### Task 4: Extend Verifier Anti-Rationalization + gap-close step 10½
|
|
129
|
+
|
|
130
|
+
**Files:**
|
|
131
|
+
- Modify: `src/scripts/init_ralph_desk.zsh:480,484`
|
|
132
|
+
|
|
133
|
+
- [ ] **Step 1: Add rationalization red flags to step 10¼**
|
|
134
|
+
|
|
135
|
+
Find:
|
|
136
|
+
```
|
|
137
|
+
- Never issue a silent PASS — every pass verdict must cite specific evidence for each AC checked
|
|
138
|
+
10½. **Worker Process Audit**:
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Replace with:
|
|
142
|
+
```
|
|
143
|
+
- Never issue a silent PASS — every pass verdict must cite specific evidence for each AC checked
|
|
144
|
+
- Rationalization red flags: "tests pass so it works" (passing ≠ correct), "Worker is confident" (confidence ≠ evidence), "changes are minimal" (scope ≠ correctness)
|
|
145
|
+
10½. **Worker Process Audit**:
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
- [ ] **Step 2: Gap-close — add "I'll test after" to step 10½ phrase scan**
|
|
149
|
+
|
|
150
|
+
Find:
|
|
151
|
+
```
|
|
152
|
+
- Forbidden shortcuts: check done-claim claims and summary for forbidden phrases ("code inspection", "I'm confident", "too simple", "already manually tested", "partial check")
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
Replace with:
|
|
156
|
+
```
|
|
157
|
+
- Forbidden shortcuts: check done-claim claims and summary for forbidden phrases ("code inspection", "I'm confident", "too simple", "I'll test after", "already manually tested", "partial check")
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
- [ ] **Step 3: Verify both changes**
|
|
161
|
+
|
|
162
|
+
Run: `grep -n "Rationalization red flags\|I'll test after" src/scripts/init_ralph_desk.zsh`
|
|
163
|
+
Expected: 2 matches — one in step 10¼, one in step 10½
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## Token Budget Verification
|
|
168
|
+
|
|
169
|
+
After all 4 tasks, verify net token reduction:
|
|
170
|
+
|
|
171
|
+
```bash
|
|
172
|
+
# Before: count Worker prompt lines (approximate)
|
|
173
|
+
# After: should be ~10 lines fewer
|
|
174
|
+
wc -l src/scripts/init_ralph_desk.zsh
|
|
175
|
+
# Compare with git: lines removed vs added
|
|
176
|
+
git diff --stat src/scripts/init_ralph_desk.zsh
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
Expected: net negative line count (fewer lines = less context pressure on Worker).
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ai-dev-methodologies/rlp-desk",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.7.0",
|
|
4
4
|
"description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
|
|
5
5
|
"scripts": {
|
|
6
6
|
"postinstall": "node scripts/postinstall.js",
|
package/src/commands/rlp-desk.md
CHANGED
|
@@ -69,14 +69,14 @@ Ask about these items one by one (or in small groups):
|
|
|
69
69
|
|
|
70
70
|
| Complexity | Worker | per-US Verifier | Final Verifier | Consensus |
|
|
71
71
|
|------------|--------|-----------------|----------------|-----------|
|
|
72
|
-
| LOW |
|
|
73
|
-
| MEDIUM |
|
|
72
|
+
| LOW | gpt-5.4:high | sonnet | opus | final-only |
|
|
73
|
+
| MEDIUM | gpt-5.4:high | opus | opus | final-only |
|
|
74
74
|
| HIGH | gpt-5.4:high | opus | opus | all |
|
|
75
75
|
| CRITICAL | gpt-5.4:high | opus | opus + human | all |
|
|
76
76
|
|
|
77
77
|
**Worker model selection** (cross-engine):
|
|
78
|
-
- **
|
|
79
|
-
- **
|
|
78
|
+
- **gpt-5.4:high** — default recommendation (full context window, reliable for all US sizes)
|
|
79
|
+
- **spark:high** — only when US is small enough for spark's 100k context (single-file, AC count <= 4, simple logic). Do NOT use as primary recommendation — spark context window is too small for most tasks
|
|
80
80
|
|
|
81
81
|
Present complexity score with evidence to the user, e.g.: "I rate this MEDIUM because: US count=4 (MEDIUM), file scope=2 (MEDIUM), logic=conditionals (MEDIUM), deps=none (LOW), impact=modify (MEDIUM). Highest=MEDIUM."
|
|
82
82
|
|
|
@@ -85,12 +85,24 @@ Ask about these items one by one (or in small groups):
|
|
|
85
85
|
**If codex is NOT installed** — say: "Codex is not installed. Defaulting to claude-only Worker. Note: without a second engine, your Verifier shares the same perspective as the Worker — there is a risk of blind spots where both Worker and Verifier miss the same issue. To unlock cross-engine coverage: `npm install -g @openai/codex`"
|
|
86
86
|
|
|
87
87
|
8. **Batch Capacity Check** — when verify-mode is batch and PRD is large:
|
|
88
|
-
- batch + spark + AC >
|
|
88
|
+
- batch + spark + AC > 4 → warn "spark 100k context limit — switch to gpt-5.4 or split smaller"
|
|
89
89
|
- batch + gpt-5.4 + AC > 15 → warn "too many ACs for single batch — consider wave split (3-4 US per wave)"
|
|
90
90
|
- per-us → no warning (US-level processing, no limit concern)
|
|
91
91
|
9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
|
|
92
92
|
10. **Consensus** — Ask: "Use cross-engine consensus? off (single engine), final-only (cross-engine on final verify only), or all (cross-engine on every verify). Requires codex CLI." Default: off. Recommended: final-only when codex is installed.
|
|
93
93
|
11. **Max Iterations** — suggest based on story count, ask if OK.
|
|
94
|
+
12. **Operational Context** — Auto-detect: scan project root for `package.json` (scripts.dev/start), `Makefile`, `docker-compose.yml`, `manage.py`. If detected, ask:
|
|
95
|
+
- "Does this project require a running server/service during development?" (y/n)
|
|
96
|
+
- If yes: "Server start command?" (pre-fill from detected scripts, e.g., `npm run dev`)
|
|
97
|
+
- "Server port?" (e.g., 7001)
|
|
98
|
+
- "Health check URL?" (e.g., `http://localhost:7001/health`) — optional
|
|
99
|
+
- Pass to init: `--server-cmd "CMD" --server-port PORT --server-health URL`
|
|
100
|
+
- If no server needed: skip. Init generates prompts without operational context.
|
|
101
|
+
|
|
102
|
+
**US generation guidance when server context is present:**
|
|
103
|
+
- Each US that modifies server/application code SHOULD include an AC or note:
|
|
104
|
+
"Given server is running, When code is modified, Then server is restarted and health check passes"
|
|
105
|
+
- Do NOT assume the Worker model will restart servers on its own — spell it out in the AC or rely on the operational rules injected by init.
|
|
94
106
|
|
|
95
107
|
After all items are confirmed:
|
|
96
108
|
|
|
@@ -132,12 +144,12 @@ Tell the user:
|
|
|
132
144
|
```
|
|
133
145
|
Available run commands (copy the one you want):
|
|
134
146
|
|
|
135
|
-
# ★ Recommended: cross-engine + final-consensus (
|
|
136
|
-
/rlp-desk run <actual-slug> --mode tmux --worker-model spark:high --consensus final-only --debug
|
|
137
|
-
|
|
138
|
-
# Large PRD (AC > 15, exceeds spark 100k limit):
|
|
147
|
+
# ★ Recommended: cross-engine + final-consensus (full context + blind-spot coverage):
|
|
139
148
|
/rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus final-only --debug
|
|
140
149
|
|
|
150
|
+
# Small tasks only (single-file, AC <= 4, simple logic — spark 100k context limit):
|
|
151
|
+
/rlp-desk run <actual-slug> --mode tmux --worker-model spark:high --consensus final-only --debug
|
|
152
|
+
|
|
141
153
|
# Critical (full consensus on every verify):
|
|
142
154
|
/rlp-desk run <actual-slug> --mode tmux --worker-model gpt-5.4:high --consensus all --debug
|
|
143
155
|
|
|
@@ -146,7 +158,7 @@ Tell the user:
|
|
|
146
158
|
|
|
147
159
|
# Full options reference:
|
|
148
160
|
# --mode agent|tmux (default: agent)
|
|
149
|
-
# --worker-model MODEL haiku|sonnet|opus or
|
|
161
|
+
# --worker-model MODEL haiku|sonnet|opus or gpt-5.4:high|spark:high (default: haiku)
|
|
150
162
|
# --lock-worker-model disable auto model upgrade
|
|
151
163
|
# --verifier-model MODEL per-US verifier (default: sonnet)
|
|
152
164
|
# --final-verifier-model MODEL final ALL verifier (default: opus)
|
|
@@ -695,7 +707,7 @@ Example:
|
|
|
695
707
|
|
|
696
708
|
Run options:
|
|
697
709
|
--mode agent|tmux Execution mode (default: agent)
|
|
698
|
-
--worker-model MODEL Worker model: haiku|sonnet|opus or
|
|
710
|
+
--worker-model MODEL Worker model: haiku|sonnet|opus or gpt-5.4:high|spark:high (default: haiku)
|
|
699
711
|
--lock-worker-model Disable auto model upgrade on failure
|
|
700
712
|
--verifier-model MODEL per-US verifier (default: sonnet)
|
|
701
713
|
--final-verifier-model MODEL Final ALL verifier (default: opus)
|
|
@@ -9,23 +9,23 @@ CB default: 6. Override: `--cb-threshold N`. Worker only — Verifier fixed at c
|
|
|
9
9
|
- CB < table columns → BLOCKED at that column
|
|
10
10
|
- CB > 6 → repeat ceiling model beyond column 6
|
|
11
11
|
|
|
12
|
-
## GPT Pro (spark — separate token limit)
|
|
12
|
+
## GPT Pro (gpt-5.3-codex-spark — separate token limit)
|
|
13
13
|
|
|
14
14
|
| Complexity | 1-2 | 3-4 | 5-6 | 7+ |
|
|
15
15
|
|------------|-----|-----|-----|-----|
|
|
16
|
-
| LOW | spark:low | spark:medium | spark:high | BLOCKED |
|
|
17
|
-
| MEDIUM | spark:medium | spark:high | spark:xhigh | BLOCKED |
|
|
18
|
-
| HIGH | spark:high | spark:xhigh | spark:xhigh | BLOCKED |
|
|
19
|
-
| CRITICAL | spark:xhigh | spark:xhigh | spark:xhigh | BLOCKED |
|
|
16
|
+
| LOW | gpt-5.3-codex-spark:low | gpt-5.3-codex-spark:medium | gpt-5.3-codex-spark:high | BLOCKED |
|
|
17
|
+
| MEDIUM | gpt-5.3-codex-spark:medium | gpt-5.3-codex-spark:high | gpt-5.3-codex-spark:xhigh | BLOCKED |
|
|
18
|
+
| HIGH | gpt-5.3-codex-spark:high | gpt-5.3-codex-spark:xhigh | gpt-5.3-codex-spark:xhigh | BLOCKED |
|
|
19
|
+
| CRITICAL | gpt-5.3-codex-spark:xhigh | gpt-5.3-codex-spark:xhigh | gpt-5.3-codex-spark:xhigh | BLOCKED |
|
|
20
20
|
|
|
21
21
|
## Non-Pro (gpt-5.4)
|
|
22
22
|
|
|
23
23
|
| Complexity | 1-2 | 3-4 | 5-6 | 7+ |
|
|
24
24
|
|------------|-----|-----|-----|-----|
|
|
25
|
-
| LOW | 5.4:low | 5.4:medium | 5.4:high | BLOCKED |
|
|
26
|
-
| MEDIUM | 5.4:medium | 5.4:high | 5.4:xhigh | BLOCKED |
|
|
27
|
-
| HIGH | 5.4:high | 5.4:xhigh | 5.4:xhigh | BLOCKED |
|
|
28
|
-
| CRITICAL | 5.4:xhigh | 5.4:xhigh | 5.4:xhigh | BLOCKED |
|
|
25
|
+
| LOW | gpt-5.4:low | gpt-5.4:medium | gpt-5.4:high | BLOCKED |
|
|
26
|
+
| MEDIUM | gpt-5.4:medium | gpt-5.4:high | gpt-5.4:xhigh | BLOCKED |
|
|
27
|
+
| HIGH | gpt-5.4:high | gpt-5.4:xhigh | gpt-5.4:xhigh | BLOCKED |
|
|
28
|
+
| CRITICAL | gpt-5.4:xhigh | gpt-5.4:xhigh | gpt-5.4:xhigh | BLOCKED |
|
|
29
29
|
|
|
30
30
|
## Claude-only
|
|
31
31
|
|
|
@@ -11,11 +11,14 @@ set -euo pipefail
|
|
|
11
11
|
# ~/.claude/ralph-desk/init_ralph_desk.zsh <slug> [objective] [--mode fresh|improve]
|
|
12
12
|
# =============================================================================
|
|
13
13
|
|
|
14
|
-
SLUG="${1:?Usage: $0 <slug> [objective] [--mode fresh|improve]}"
|
|
14
|
+
SLUG="${1:?Usage: $0 <slug> [objective] [--mode fresh|improve] [--server-cmd CMD] [--server-port PORT] [--server-health URL]}"
|
|
15
15
|
MODE=""
|
|
16
16
|
OBJECTIVE="TBD - fill in the objective"
|
|
17
|
+
SERVER_CMD=""
|
|
18
|
+
SERVER_PORT=""
|
|
19
|
+
SERVER_HEALTH=""
|
|
17
20
|
|
|
18
|
-
# Parse remaining arguments
|
|
21
|
+
# Parse remaining arguments
|
|
19
22
|
shift
|
|
20
23
|
while [[ $# -gt 0 ]]; do
|
|
21
24
|
case "$1" in
|
|
@@ -27,6 +30,30 @@ while [[ $# -gt 0 ]]; do
|
|
|
27
30
|
MODE="${1#--mode=}"
|
|
28
31
|
shift
|
|
29
32
|
;;
|
|
33
|
+
--server-cmd)
|
|
34
|
+
SERVER_CMD="${2:?--server-cmd requires a command}"
|
|
35
|
+
shift 2
|
|
36
|
+
;;
|
|
37
|
+
--server-cmd=*)
|
|
38
|
+
SERVER_CMD="${1#--server-cmd=}"
|
|
39
|
+
shift
|
|
40
|
+
;;
|
|
41
|
+
--server-port)
|
|
42
|
+
SERVER_PORT="${2:?--server-port requires a port number}"
|
|
43
|
+
shift 2
|
|
44
|
+
;;
|
|
45
|
+
--server-port=*)
|
|
46
|
+
SERVER_PORT="${1#--server-port=}"
|
|
47
|
+
shift
|
|
48
|
+
;;
|
|
49
|
+
--server-health)
|
|
50
|
+
SERVER_HEALTH="${2:?--server-health requires a URL}"
|
|
51
|
+
shift 2
|
|
52
|
+
;;
|
|
53
|
+
--server-health=*)
|
|
54
|
+
SERVER_HEALTH="${1#--server-health=}"
|
|
55
|
+
shift
|
|
56
|
+
;;
|
|
30
57
|
*)
|
|
31
58
|
OBJECTIVE="$1"
|
|
32
59
|
shift
|
|
@@ -176,10 +203,10 @@ print_run_presets() {
|
|
|
176
203
|
echo "Available run commands (copy the one you want):"
|
|
177
204
|
echo ""
|
|
178
205
|
if [[ $codex_available -eq 1 ]]; then
|
|
179
|
-
echo "# Recommended: cross-engine + final-consensus (
|
|
206
|
+
echo "# Recommended: cross-engine + final-consensus (full context + blind-spot coverage):"
|
|
180
207
|
echo "/rlp-desk run $slug --worker-model gpt-5.4:high --final-consensus --debug"
|
|
181
208
|
echo ""
|
|
182
|
-
echo "#
|
|
209
|
+
echo "# Small tasks only (single-file, AC <= 4, simple logic — spark 100k context limit):"
|
|
183
210
|
echo "/rlp-desk run $slug --worker-model gpt-5.3-codex-spark:high --debug"
|
|
184
211
|
echo ""
|
|
185
212
|
echo "# Claude-only:"
|
|
@@ -295,6 +322,11 @@ Read these files in order:
|
|
|
295
322
|
3. Test Spec: $DESK/plans/test-spec-$SLUG.md → verification methods
|
|
296
323
|
4. Latest Context: $DESK/context/$SLUG-latest.md → current state
|
|
297
324
|
|
|
325
|
+
## TDD MANDATE (hard constraint — violation = automatic FAIL)
|
|
326
|
+
> Write failing tests FIRST → confirm RED (exit_code=1) → implement minimum code → confirm GREEN.
|
|
327
|
+
> Every NEW AC requires: write_test → verify_red → implement → verify_green in execution_steps.
|
|
328
|
+
> No exceptions. Verifier rejects missing RED evidence. For already-passing ACs, use verify_existing.
|
|
329
|
+
|
|
298
330
|
## SCOPE LOCK (hard constraint — violation causes verification failure)
|
|
299
331
|
- You MUST only implement the work described in the "Next Iteration Contract" from campaign memory.
|
|
300
332
|
- If the contract says "implement US-001 only", do ONLY that. Do NOT touch other stories.
|
|
@@ -303,28 +335,19 @@ Read these files in order:
|
|
|
303
335
|
- No file creation or modification outside the project root.
|
|
304
336
|
- Do not modify this prompt file or any PRD/test-spec files.
|
|
305
337
|
|
|
306
|
-
## Test-First Approach (read test-spec BEFORE coding)
|
|
307
|
-
1. Read test-spec "Impacted Tests" — if TODO (first iteration), skip to step 2 and fill this section during your work. Otherwise, run these FIRST to confirm they pass before your changes.
|
|
308
|
-
2. Read test-spec "Required New Tests" — write these. They SHOULD FAIL initially.
|
|
309
|
-
3. Implement minimum code to make all tests pass.
|
|
310
|
-
4. Run ALL tests (impacted + new) to confirm nothing is broken.
|
|
311
|
-
|
|
312
338
|
## Forbidden Shortcuts (Verifier will check these)
|
|
313
339
|
- Do not mock external services when L2 integration test is required by test-spec.
|
|
314
340
|
- Do not delete or weaken existing assertions to make tests pass.
|
|
315
|
-
- Do not add test-specific logic (code that detects it is running in a test).
|
|
316
341
|
- Do not skip boundary cases listed in the PRD.
|
|
317
|
-
- Do not claim "code inspection" as verification — run the actual command.
|
|
318
|
-
- Do not say "too simple to test" — simple code breaks. Test takes 30 seconds.
|
|
319
|
-
- Do not say "I'll test after" — tests passing immediately prove nothing.
|
|
320
|
-
- Do not say "already manually tested" — ad-hoc is not systematic, no record.
|
|
321
|
-
- Do not say "partial check is enough" — partial proves nothing about the whole.
|
|
322
|
-
- Do not say "I'm confident" — confidence is not evidence.
|
|
323
|
-
- Do not say "existing code has no tests" — you are improving it, add tests.
|
|
324
342
|
- Do not write code before tests — if you did, delete it and start with tests.
|
|
325
343
|
- **NEVER modify rlp-desk infrastructure files** (~/.claude/ralph-desk/*, ~/.claude/commands/rlp-desk.md). If you discover a bug in rlp-desk itself, report it in done-claim.json with {"status": "blocked", "reason": "rlp-desk bug: <description>"} and signal blocked. Do NOT attempt to fix rlp-desk — it is the orchestration tool, not your project code.
|
|
326
344
|
- **NEVER modify Claude Code settings** (~/.claude/settings.json, .claude/settings.local.json, or any settings files). Do NOT add permissions, change models, or alter configuration. If a permission prompt blocks you, report it as blocked — do NOT try to edit settings to bypass it.
|
|
327
345
|
|
|
346
|
+
## When Stuck (do NOT guess-and-fix)
|
|
347
|
+
> 1. STOP and READ the error. Trace the call stack. Identify the root cause before touching code.
|
|
348
|
+
> 2. Write a minimal test that reproduces the failure, then fix the root cause only.
|
|
349
|
+
> 3. If 3+ fixes fail on the same issue, signal "blocked" with your diagnosis.
|
|
350
|
+
|
|
328
351
|
## Iteration rules
|
|
329
352
|
- Use fresh context only; do NOT depend on prior chat history.
|
|
330
353
|
- Execute exactly the work specified in the Next Iteration Contract.
|
|
@@ -378,6 +401,24 @@ execution_steps MUST be a JSON array of objects (not a dict with string keys). E
|
|
|
378
401
|
## Objective
|
|
379
402
|
$OBJECTIVE
|
|
380
403
|
EOF
|
|
404
|
+
|
|
405
|
+
# Inject operational context if server options provided
|
|
406
|
+
if [[ -n "$SERVER_CMD" || -n "$SERVER_PORT" ]]; then
|
|
407
|
+
cat >> "$F" <<OPCTX
|
|
408
|
+
|
|
409
|
+
## Operational Context
|
|
410
|
+
$([ -n "$SERVER_CMD" ] && echo "- **Server Start Command**: \`$SERVER_CMD\`")
|
|
411
|
+
$([ -n "$SERVER_PORT" ] && echo "- **Server Port**: $SERVER_PORT")
|
|
412
|
+
$([ -n "$SERVER_HEALTH" ] && echo "- **Health Check URL**: $SERVER_HEALTH")
|
|
413
|
+
|
|
414
|
+
### Operational Rules (always apply when server context is present)
|
|
415
|
+
- After modifying server/application code, restart the server$([ -n "$SERVER_CMD" ] && echo ": \`$SERVER_CMD\`")
|
|
416
|
+
- Before signaling done, verify the server responds$([ -n "$SERVER_HEALTH" ] && echo ": \`curl -sf $SERVER_HEALTH\`" || [ -n "$SERVER_PORT" ] && echo ": \`curl -sf http://localhost:$SERVER_PORT/\`")
|
|
417
|
+
- Do NOT modify dependency files (package.json, requirements.txt, etc.) unless the AC explicitly requires it
|
|
418
|
+
- Do NOT run package install commands (npm install, pip install, etc.) unless the AC explicitly requires it
|
|
419
|
+
OPCTX
|
|
420
|
+
fi
|
|
421
|
+
|
|
381
422
|
echo " + $F"
|
|
382
423
|
else echo " · $F"; fi
|
|
383
424
|
|
|
@@ -433,10 +474,11 @@ Check the iter-signal.json "us_id" field:
|
|
|
433
474
|
- If your verdict history shows a 100% pass rate, re-examine your last verdict with increased scrutiny — a 100% pass rate is a red flag for insufficient rigor
|
|
434
475
|
- When issuing PASS with explicit warning: note any concerning patterns (e.g., low test diversity, marginal coverage) even if technically passing
|
|
435
476
|
- Never issue a silent PASS — every pass verdict must cite specific evidence for each AC checked
|
|
477
|
+
- Rationalization red flags: "tests pass so it works" (passing ≠ correct), "Worker is confident" (confidence ≠ evidence), "changes are minimal" (scope ≠ correctness)
|
|
436
478
|
10½. **Worker Process Audit**:
|
|
437
479
|
- Test-first compliance: done-claim execution_steps must show write_test step before implement step for each AC
|
|
438
480
|
- RED phase evidence: at least one verify_red step with exit_code=1 per AC (proves tests were written before passing)
|
|
439
|
-
- Forbidden shortcuts: check done-claim claims and summary for forbidden phrases ("code inspection", "I'm confident", "too simple", "already manually tested", "partial check")
|
|
481
|
+
- Forbidden shortcuts: check done-claim claims and summary for forbidden phrases ("code inspection", "I'm confident", "too simple", "I'll test after", "already manually tested", "partial check")
|
|
440
482
|
- Step completeness: each AC should have write_test → verify_red → implement → verify_green sequence in execution_steps
|
|
441
483
|
11. **Reproducibility check**: verify lock file committed, clean install succeeds, security scan passes, env vars documented (per test-spec Reproducibility Gate). Skip if test-spec says "N/A."
|
|
442
484
|
12. Write verdict JSON to: $DESK/memos/$SLUG-verify-verdict.json
|
|
@@ -447,6 +489,7 @@ Verdict JSON:
|
|
|
447
489
|
"us_id": "US-NNN or ALL (matches the scope you verified)",
|
|
448
490
|
"verified_at_utc": "ISO timestamp",
|
|
449
491
|
"summary": "...",
|
|
492
|
+
"per_us_results": {"US-001": "pass|fail|not_started", "US-002": "pass|fail|not_started"},
|
|
450
493
|
"criteria_results": [{"criterion":"...","met":true/false,"evidence":"..."}],
|
|
451
494
|
"missing_evidence": [],
|
|
452
495
|
"issues": [{"id":"...","severity":"critical|major|minor","description":"...","fix_hint":"(suggestion, non-authoritative)"}],
|
|
@@ -471,7 +514,20 @@ Rules:
|
|
|
471
514
|
- Deterministic checks (type hints, linting, security) delegate to test-spec tools; focus on AC verification + semantic review + smoke test.
|
|
472
515
|
- Do NOT modify code or write sentinel files.
|
|
473
516
|
- If Worker claims "inspection" or "review" for an AC that requires an automated command, verdict = FAIL.
|
|
517
|
+
- **ALWAYS include per_us_results** in verdict JSON — map each US to "pass", "fail", or "not_started". This is required for partial progress tracking in both batch and per-us modes.
|
|
474
518
|
EOF
|
|
519
|
+
|
|
520
|
+
# Inject operational verification if server options provided
|
|
521
|
+
if [[ -n "$SERVER_CMD" || -n "$SERVER_PORT" ]]; then
|
|
522
|
+
cat >> "$F" <<OPVER
|
|
523
|
+
|
|
524
|
+
## Operational Verification (server context present)
|
|
525
|
+
- Before verifying ACs, check that the server is running$([ -n "$SERVER_PORT" ] && echo " on port $SERVER_PORT")$([ -n "$SERVER_HEALTH" ] && echo ": \`curl -sf $SERVER_HEALTH\`")
|
|
526
|
+
- If the server is not running, verdict = FAIL with issue: "server not running on expected port"
|
|
527
|
+
- If Worker modified server code but did not restart the server, verdict = FAIL with issue: "server not restarted after code change"
|
|
528
|
+
OPVER
|
|
529
|
+
fi
|
|
530
|
+
|
|
475
531
|
echo " + $F"
|
|
476
532
|
else echo " · $F"; fi
|
|
477
533
|
|
|
@@ -29,9 +29,36 @@ log_error() {
|
|
|
29
29
|
echo "[$(date '+%Y-%m-%d %H:%M:%S')] ERROR: $*" >&2
|
|
30
30
|
}
|
|
31
31
|
|
|
32
|
+
# build_claude_cmd() — centralized claude CLI command builder
|
|
33
|
+
# Single source of truth for all claude invocation flags (--no-mcp, DISABLE_OMC, etc.)
|
|
34
|
+
# Inspired by codex-plugin-cc companion pattern: CLI abstraction in one place.
|
|
35
|
+
# Args: $1=mode (tui|print) $2=model $3=prompt_file (print mode only) $4=output_log (print mode only)
|
|
36
|
+
# Output: complete command string on stdout
|
|
37
|
+
# Globals read: CLAUDE_BIN
|
|
38
|
+
build_claude_cmd() {
|
|
39
|
+
local mode="$1"
|
|
40
|
+
local model="$2"
|
|
41
|
+
local prompt_file="${3:-}"
|
|
42
|
+
local output_log="${4:-}"
|
|
43
|
+
|
|
44
|
+
local base="DISABLE_OMC=1 $CLAUDE_BIN --model $model --no-mcp --dangerously-skip-permissions"
|
|
45
|
+
case "$mode" in
|
|
46
|
+
tui)
|
|
47
|
+
echo "$base"
|
|
48
|
+
;;
|
|
49
|
+
print)
|
|
50
|
+
echo "$base -p \"\$(cat $prompt_file)\" --output-format text 2>&1 | tee $output_log"
|
|
51
|
+
;;
|
|
52
|
+
*)
|
|
53
|
+
echo "ERROR: build_claude_cmd unknown mode '$mode'" >&2
|
|
54
|
+
return 1
|
|
55
|
+
;;
|
|
56
|
+
esac
|
|
57
|
+
}
|
|
58
|
+
|
|
32
59
|
# parse_model_flag() — parse unified --worker-model / --verifier-model value
|
|
33
60
|
# Colon format (model:reasoning) → codex engine; plain name → claude engine.
|
|
34
|
-
# Spark alias:
|
|
61
|
+
# Spark alias: bare "spark" is expanded to full model ID "gpt-5.3-codex-spark".
|
|
35
62
|
# Usage: parse_model_flag <value> <role>
|
|
36
63
|
# Output (stdout): "engine model [reasoning]" e.g. "codex gpt-5.4 medium" | "claude sonnet"
|
|
37
64
|
# Returns: 0 on success, 1 on invalid format (error written to stderr)
|
|
@@ -47,8 +74,8 @@ parse_model_flag() {
|
|
|
47
74
|
if (( colon_count == 1 )); then
|
|
48
75
|
local model="${value%%:*}"
|
|
49
76
|
local reasoning="${value##*:}"
|
|
50
|
-
if [[ "$model" ==
|
|
51
|
-
model="spark"
|
|
77
|
+
if [[ "$model" == "spark" ]]; then
|
|
78
|
+
model="gpt-5.3-codex-spark"
|
|
52
79
|
fi
|
|
53
80
|
echo "codex $model $reasoning"
|
|
54
81
|
else
|
|
@@ -76,7 +103,7 @@ get_model_string() {
|
|
|
76
103
|
# get_next_model() — return next model in Worker upgrade path, or empty at ceiling
|
|
77
104
|
# Usage: get_next_model <model_str>
|
|
78
105
|
# claude: "haiku"|"sonnet"|"opus"
|
|
79
|
-
# codex: "gpt-5.4:medium"|"gpt-5.4:high"|"gpt-5.4:xhigh"|"spark:medium"|...
|
|
106
|
+
# codex: "gpt-5.4:medium"|"gpt-5.4:high"|"gpt-5.4:xhigh"|"gpt-5.3-codex-spark:medium"|...
|
|
80
107
|
# Output: next model string, or empty string if at ceiling
|
|
81
108
|
get_next_model() {
|
|
82
109
|
local current="$1"
|
|
@@ -85,16 +112,11 @@ get_next_model() {
|
|
|
85
112
|
haiku) echo "sonnet" ;;
|
|
86
113
|
sonnet) echo "opus" ;;
|
|
87
114
|
opus) echo "" ;;
|
|
88
|
-
# Codex GPT Pro upgrade path
|
|
89
|
-
spark:low) echo "spark:medium" ;;
|
|
90
|
-
spark:medium) echo "spark:high" ;;
|
|
91
|
-
spark:high) echo "spark:xhigh" ;;
|
|
92
|
-
spark:xhigh) echo "" ;; # spark ceiling
|
|
93
|
-
# Codex GPT Pro upgrade path (full model names)
|
|
115
|
+
# Codex GPT Pro (spark) upgrade path
|
|
94
116
|
gpt-5.3-codex-spark:low) echo "gpt-5.3-codex-spark:medium" ;;
|
|
95
117
|
gpt-5.3-codex-spark:medium) echo "gpt-5.3-codex-spark:high" ;;
|
|
96
118
|
gpt-5.3-codex-spark:high) echo "gpt-5.3-codex-spark:xhigh" ;;
|
|
97
|
-
gpt-5.3-codex-spark:xhigh) echo "" ;; # spark ceiling
|
|
119
|
+
gpt-5.3-codex-spark:xhigh) echo "" ;; # spark ceiling
|
|
98
120
|
# Codex Non-Pro upgrade path
|
|
99
121
|
gpt-5.4:low) echo "gpt-5.4:medium" ;;
|
|
100
122
|
gpt-5.4:medium) echo "gpt-5.4:high" ;;
|
|
@@ -66,7 +66,7 @@ _auto_detect_engine() {
|
|
|
66
66
|
if [[ "$model_val" == *:* ]]; then
|
|
67
67
|
local model_part="${model_val%%:*}"
|
|
68
68
|
local reasoning_part="${model_val##*:}"
|
|
69
|
-
[[ "$model_part" ==
|
|
69
|
+
[[ "$model_part" == "spark" ]] && model_part="gpt-5.3-codex-spark"
|
|
70
70
|
eval "$engine_var=codex"
|
|
71
71
|
eval "$model_var=$model_part"
|
|
72
72
|
[[ -n "$codex_model_var" ]] && eval "$codex_model_var=$model_part"
|
|
@@ -211,19 +211,55 @@ check_dead_pane() {
|
|
|
211
211
|
return 1 # alive
|
|
212
212
|
}
|
|
213
213
|
|
|
214
|
-
# launch_worker_codex() — launch codex Worker
|
|
215
|
-
#
|
|
216
|
-
#
|
|
214
|
+
# launch_worker_codex() — launch codex Worker TUI, send instruction, verify submission
|
|
215
|
+
# Matches launch_worker_claude() pattern for consistent tmux-visible execution.
|
|
216
|
+
# Args: $1=pane_id $2=prompt_file $3=iteration $4=worker_launch_cmd
|
|
217
|
+
# Returns: 0 on success, 1 on fatal failure
|
|
217
218
|
launch_worker_codex() {
|
|
218
219
|
local pane_id="$1"
|
|
219
|
-
local
|
|
220
|
+
local prompt_file="$2"
|
|
220
221
|
local iter="$3"
|
|
222
|
+
local worker_launch="$4"
|
|
223
|
+
|
|
224
|
+
log " Launching Worker codex TUI in pane $pane_id..."
|
|
225
|
+
paste_to_pane "$pane_id" "$worker_launch"
|
|
226
|
+
tmux send-keys -t "$pane_id" C-m
|
|
227
|
+
|
|
228
|
+
# Wait for codex TUI to be ready
|
|
229
|
+
if ! wait_for_pane_ready "$pane_id" 30; then
|
|
230
|
+
log_error "Worker codex failed to start"
|
|
231
|
+
return 1
|
|
232
|
+
fi
|
|
221
233
|
|
|
222
|
-
|
|
223
|
-
|
|
234
|
+
# Send instruction to codex TUI
|
|
235
|
+
sleep 3
|
|
236
|
+
local worker_instruction="Read and execute the instructions in $prompt_file"
|
|
237
|
+
paste_to_pane "$pane_id" "$worker_instruction"
|
|
224
238
|
tmux send-keys -t "$pane_id" C-m
|
|
225
|
-
log_debug "Worker codex
|
|
226
|
-
|
|
239
|
+
log_debug "Worker codex instruction sent (${#worker_instruction} chars)"
|
|
240
|
+
|
|
241
|
+
# Submit loop — verify codex started working
|
|
242
|
+
local submit_attempts=0
|
|
243
|
+
while (( submit_attempts < 15 )); do
|
|
244
|
+
sleep 2
|
|
245
|
+
local pane_check
|
|
246
|
+
pane_check=$(tmux capture-pane -t "$pane_id" -p 2>/dev/null)
|
|
247
|
+
if echo "$pane_check" | grep -qi "working\|thinking\|Exploring\|Running\|reading\|searching\|editing\|writing" 2>/dev/null; then
|
|
248
|
+
log_debug "Worker codex started working after $((submit_attempts + 1)) checks"
|
|
249
|
+
break
|
|
250
|
+
fi
|
|
251
|
+
if (( submit_attempts == 8 )); then
|
|
252
|
+
log_debug "Adaptive instruction retry: clearing line and re-typing"
|
|
253
|
+
tmux send-keys -t "$pane_id" C-u 2>/dev/null
|
|
254
|
+
sleep 0.1
|
|
255
|
+
paste_to_pane "$pane_id" "$worker_instruction"
|
|
256
|
+
tmux send-keys -t "$pane_id" C-m
|
|
257
|
+
fi
|
|
258
|
+
tmux send-keys -t "$pane_id" C-m 2>/dev/null
|
|
259
|
+
sleep 0.3
|
|
260
|
+
tmux send-keys -t "$pane_id" C-m 2>/dev/null
|
|
261
|
+
(( submit_attempts++ ))
|
|
262
|
+
done
|
|
227
263
|
return 0
|
|
228
264
|
}
|
|
229
265
|
|
|
@@ -308,19 +344,53 @@ launch_worker_claude() {
|
|
|
308
344
|
return 0
|
|
309
345
|
}
|
|
310
346
|
|
|
311
|
-
# launch_verifier_codex() — launch codex Verifier
|
|
347
|
+
# launch_verifier_codex() — launch codex Verifier TUI, send instruction, verify submission
|
|
348
|
+
# Matches launch_verifier_claude() pattern for consistent tmux-visible execution.
|
|
312
349
|
# Args: $1=pane_id $2=prompt_file $3=iteration $4=launch_cmd
|
|
313
|
-
# Returns: 0
|
|
350
|
+
# Returns: 0 on success
|
|
314
351
|
launch_verifier_codex() {
|
|
315
352
|
local pane_id="$1"
|
|
316
353
|
local prompt_file="$2"
|
|
317
354
|
local iter="$3"
|
|
318
355
|
local verifier_launch="$4"
|
|
319
356
|
|
|
320
|
-
log " Launching Verifier codex in pane $pane_id..."
|
|
357
|
+
log " Launching Verifier codex TUI in pane $pane_id..."
|
|
321
358
|
paste_to_pane "$pane_id" "$verifier_launch"
|
|
322
359
|
tmux send-keys -t "$pane_id" C-m
|
|
360
|
+
|
|
361
|
+
if ! wait_for_pane_ready "$pane_id" 30; then
|
|
362
|
+
log_error "Verifier codex failed to start"
|
|
363
|
+
return 1
|
|
364
|
+
fi
|
|
365
|
+
|
|
323
366
|
sleep 3
|
|
367
|
+
local verifier_instruction="Read and execute the instructions in $prompt_file"
|
|
368
|
+
paste_to_pane "$pane_id" "$verifier_instruction"
|
|
369
|
+
tmux send-keys -t "$pane_id" C-m
|
|
370
|
+
log_debug "Verifier codex instruction sent"
|
|
371
|
+
|
|
372
|
+
# Submit loop — verify codex started working
|
|
373
|
+
local submit_attempts=0
|
|
374
|
+
while (( submit_attempts < 15 )); do
|
|
375
|
+
sleep 2
|
|
376
|
+
local vs_check
|
|
377
|
+
vs_check=$(tmux capture-pane -t "$pane_id" -p 2>/dev/null)
|
|
378
|
+
if echo "$vs_check" | grep -qi "working\|thinking\|Exploring\|Running\|reading\|searching\|editing\|writing" 2>/dev/null; then
|
|
379
|
+
log_debug "Verifier codex started working after $((submit_attempts + 1)) checks"
|
|
380
|
+
break
|
|
381
|
+
fi
|
|
382
|
+
if (( submit_attempts == 8 )); then
|
|
383
|
+
log_debug "Adaptive instruction retry: clearing line and re-typing"
|
|
384
|
+
tmux send-keys -t "$pane_id" C-u 2>/dev/null
|
|
385
|
+
sleep 0.1
|
|
386
|
+
paste_to_pane "$pane_id" "$verifier_instruction"
|
|
387
|
+
tmux send-keys -t "$pane_id" C-m
|
|
388
|
+
fi
|
|
389
|
+
tmux send-keys -t "$pane_id" C-m 2>/dev/null
|
|
390
|
+
sleep 0.3
|
|
391
|
+
tmux send-keys -t "$pane_id" C-m 2>/dev/null
|
|
392
|
+
(( submit_attempts++ ))
|
|
393
|
+
done
|
|
324
394
|
return 0
|
|
325
395
|
}
|
|
326
396
|
|
|
@@ -386,7 +456,7 @@ handle_worker_exit_codex() {
|
|
|
386
456
|
local dc_us_id
|
|
387
457
|
dc_us_id=$(jq -r '.us_id // "unknown"' "$DONE_CLAIM_FILE" 2>/dev/null)
|
|
388
458
|
log " Codex worker completed with done-claim (us_id=$dc_us_id). Auto-generating signal."
|
|
389
|
-
echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$dc_us_id"'","summary":"auto-generated after codex
|
|
459
|
+
echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$dc_us_id"'","summary":"auto-generated after codex exit","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}' > "$signal_file"
|
|
390
460
|
else
|
|
391
461
|
log " WARNING: Codex worker exited without done-claim. Generating verify signal for current US."
|
|
392
462
|
local current_us
|
|
@@ -394,7 +464,7 @@ handle_worker_exit_codex() {
|
|
|
394
464
|
local mem_us
|
|
395
465
|
mem_us=$(sed -n 's/.*Next.*US-\([0-9]*\).*/US-\1/p' "$DESK/memos/${SLUG}-memory.md" 2>/dev/null | head -1)
|
|
396
466
|
[[ -n "$mem_us" ]] && current_us="$mem_us"
|
|
397
|
-
echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$current_us"'","summary":"auto-generated after codex
|
|
467
|
+
echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$current_us"'","summary":"auto-generated after codex exit (no done-claim)","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}' > "$signal_file"
|
|
398
468
|
fi
|
|
399
469
|
return 0
|
|
400
470
|
}
|
|
@@ -952,7 +1022,7 @@ restart_worker() {
|
|
|
952
1022
|
if [[ "$WORKER_ENGINE" = "codex" ]]; then
|
|
953
1023
|
safe_send_keys "$pane_id" "${CODEX_BIN:-codex} -m $WORKER_CODEX_MODEL -c model_reasoning_effort=\"$WORKER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
|
|
954
1024
|
else
|
|
955
|
-
safe_send_keys "$pane_id" "$
|
|
1025
|
+
safe_send_keys "$pane_id" "$(build_claude_cmd tui "$WORKER_MODEL")"
|
|
956
1026
|
fi
|
|
957
1027
|
WORKER_RESTARTS[$iter]=$((restart_count + 1))
|
|
958
1028
|
return 0
|
|
@@ -1068,30 +1138,35 @@ write_worker_trigger() {
|
|
|
1068
1138
|
elif [[ "$VERIFY_MODE" = "batch" ]]; then
|
|
1069
1139
|
echo ""
|
|
1070
1140
|
echo "---"
|
|
1071
|
-
|
|
1072
|
-
|
|
1073
|
-
|
|
1074
|
-
|
|
1075
|
-
|
|
1141
|
+
if [[ -n "$VERIFIED_US" ]]; then
|
|
1142
|
+
echo "## BATCH MODE — CONTINUE FROM PARTIAL PROGRESS"
|
|
1143
|
+
echo "The following US have already been verified: **$VERIFIED_US**"
|
|
1144
|
+
echo "- Do NOT re-implement these — they are done."
|
|
1145
|
+
echo "- Focus ONLY on the remaining unverified user stories."
|
|
1146
|
+
echo '- Signal verify with us_id="ALL" when the remaining stories are complete.'
|
|
1147
|
+
else
|
|
1148
|
+
echo "## BATCH MODE OVERRIDE"
|
|
1149
|
+
echo "Ignore any per-US signal instructions above. In batch mode:"
|
|
1150
|
+
echo "- Implement ALL user stories in this iteration"
|
|
1151
|
+
echo '- Signal verify with us_id="ALL" only when ALL stories are complete'
|
|
1152
|
+
echo "- Do NOT signal verify after individual stories"
|
|
1153
|
+
fi
|
|
1076
1154
|
fi
|
|
1077
1155
|
} | atomic_write "$prompt_file"
|
|
1078
1156
|
|
|
1079
1157
|
# Write trigger script (DO NOT use exec -- breaks heartbeat cleanup)
|
|
1080
1158
|
# Engine-specific launch command (expanded at write time)
|
|
1081
1159
|
if [[ "$WORKER_ENGINE" = "codex" ]]; then
|
|
1082
|
-
local engine_cmd="${CODEX_BIN:-codex}
|
|
1160
|
+
local engine_cmd="${CODEX_BIN:-codex} \\
|
|
1083
1161
|
-m $WORKER_CODEX_MODEL \\
|
|
1084
1162
|
-c model_reasoning_effort=\"$WORKER_CODEX_REASONING\" \\
|
|
1085
1163
|
--dangerously-bypass-approvals-and-sandbox \\
|
|
1086
1164
|
\"\$(cat $prompt_file)\""
|
|
1087
|
-
local engine_comment="# Run codex
|
|
1165
|
+
local engine_comment="# Run codex with fresh context (fallback trigger — TUI primary launch via launch_worker_codex)"
|
|
1088
1166
|
else
|
|
1089
|
-
local engine_cmd
|
|
1090
|
-
|
|
1091
|
-
|
|
1092
|
-
--output-format text \\
|
|
1093
|
-
2>&1 | tee $output_log"
|
|
1094
|
-
local engine_comment="# Run claude with fresh context (governance.md s7 step 5)"
|
|
1167
|
+
local engine_cmd
|
|
1168
|
+
engine_cmd=$(build_claude_cmd print "$WORKER_MODEL" "$prompt_file" "$output_log")
|
|
1169
|
+
local engine_comment="# Run claude with fresh context, no MCP/skills (governance.md s7 step 5)"
|
|
1095
1170
|
fi
|
|
1096
1171
|
|
|
1097
1172
|
{
|
|
@@ -1152,13 +1227,15 @@ write_verifier_trigger() {
|
|
|
1152
1227
|
echo "- **Iteration**: $iter"
|
|
1153
1228
|
echo "- **Done Claim**: $DONE_CLAIM_FILE"
|
|
1154
1229
|
echo "- **Verify Mode**: $VERIFY_MODE"
|
|
1155
|
-
if [[
|
|
1230
|
+
if [[ -n "$us_id" ]]; then
|
|
1156
1231
|
if [[ "$us_id" = "ALL" ]]; then
|
|
1157
|
-
echo "- **Scope**:
|
|
1158
|
-
echo "- **Previously verified US**: $VERIFIED_US"
|
|
1232
|
+
echo "- **Scope**: FULL VERIFY — check ALL acceptance criteria from the PRD"
|
|
1159
1233
|
else
|
|
1160
1234
|
echo "- **Scope**: Verify ONLY the acceptance criteria for **${us_id}**"
|
|
1235
|
+
fi
|
|
1236
|
+
if [[ -n "$VERIFIED_US" ]]; then
|
|
1161
1237
|
echo "- **Previously verified US**: $VERIFIED_US"
|
|
1238
|
+
echo "- **Note**: Skip re-verifying the above US. Focus on unverified stories."
|
|
1162
1239
|
fi
|
|
1163
1240
|
fi
|
|
1164
1241
|
} | atomic_write "$prompt_file"
|
|
@@ -1173,12 +1250,9 @@ write_verifier_trigger() {
|
|
|
1173
1250
|
2>&1 | tee $output_log"
|
|
1174
1251
|
local engine_comment="# Run codex with fresh context (governance.md s7 step 7)"
|
|
1175
1252
|
else
|
|
1176
|
-
local engine_cmd
|
|
1177
|
-
|
|
1178
|
-
|
|
1179
|
-
--output-format text \\
|
|
1180
|
-
2>&1 | tee $output_log"
|
|
1181
|
-
local engine_comment="# Run claude with fresh context (governance.md s7 step 7)"
|
|
1253
|
+
local engine_cmd
|
|
1254
|
+
engine_cmd=$(build_claude_cmd print "$verifier_model" "$prompt_file" "$output_log")
|
|
1255
|
+
local engine_comment="# Run claude with fresh context, no MCP/skills (governance.md s7 step 7)"
|
|
1182
1256
|
fi
|
|
1183
1257
|
|
|
1184
1258
|
{
|
|
@@ -1577,11 +1651,11 @@ run_single_verifier() {
|
|
|
1577
1651
|
# Launch verifier — dispatch to engine-specific function
|
|
1578
1652
|
local verifier_launch
|
|
1579
1653
|
if [[ "$engine" = "codex" ]]; then
|
|
1580
|
-
verifier_launch="${CODEX_BIN:-codex}
|
|
1654
|
+
verifier_launch="${CODEX_BIN:-codex} -m $VERIFIER_CODEX_MODEL -c model_reasoning_effort=\"$VERIFIER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
|
|
1581
1655
|
launch_verifier_codex "$VERIFIER_PANE" "$prompt_file" "$iter" "$verifier_launch"
|
|
1582
|
-
log_debug "Verifier$suffix codex
|
|
1656
|
+
log_debug "Verifier$suffix codex TUI dispatched"
|
|
1583
1657
|
else
|
|
1584
|
-
verifier_launch="$
|
|
1658
|
+
verifier_launch="$(build_claude_cmd tui "$model")"
|
|
1585
1659
|
if ! launch_verifier_claude "$VERIFIER_PANE" "$prompt_file" "$iter" "$verifier_launch"; then
|
|
1586
1660
|
log_error "Verifier$suffix failed to start"
|
|
1587
1661
|
return 1
|
|
@@ -1592,7 +1666,7 @@ run_single_verifier() {
|
|
|
1592
1666
|
# Poll for verdict
|
|
1593
1667
|
if [[ "$engine" = "codex" ]]; then
|
|
1594
1668
|
# Codex exec: simple file poll (non-interactive, no heartbeat/nudge needed)
|
|
1595
|
-
log " Polling for verify-verdict.json ($suffix, codex
|
|
1669
|
+
log " Polling for verify-verdict.json ($suffix, codex TUI)..."
|
|
1596
1670
|
local codex_poll_start
|
|
1597
1671
|
codex_poll_start=$(date +%s)
|
|
1598
1672
|
while true; do
|
|
@@ -1666,7 +1740,7 @@ run_sequential_final_verify() {
|
|
|
1666
1740
|
verifier_launch="${CODEX_BIN:-codex} -m $VERIFIER_CODEX_MODEL -c model_reasoning_effort=\"$VERIFIER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
|
|
1667
1741
|
launch_verifier_codex "$VERIFIER_PANE" "$verifier_prompt" "$iter" "$verifier_launch"
|
|
1668
1742
|
else
|
|
1669
|
-
verifier_launch="$
|
|
1743
|
+
verifier_launch="$(build_claude_cmd tui "$VERIFIER_MODEL")"
|
|
1670
1744
|
launch_verifier_claude "$VERIFIER_PANE" "$verifier_prompt" "$iter" "$verifier_launch" || {
|
|
1671
1745
|
log_error "Failed to launch verifier for $us"
|
|
1672
1746
|
FAILED_US="$us"
|
|
@@ -1936,7 +2010,7 @@ main() {
|
|
|
1936
2010
|
--arg verifier_model "$VERIFIER_MODEL" \
|
|
1937
2011
|
--argjson debug "$DEBUG" \
|
|
1938
2012
|
--argjson with_sv "$WITH_SELF_VERIFICATION" \
|
|
1939
|
-
--argjson consensus "$VERIFY_CONSENSUS" \
|
|
2013
|
+
--argjson consensus "${VERIFY_CONSENSUS:-0}" \
|
|
1940
2014
|
'{slug: $slug, project_root: $project_root, project_name: $project_name, campaign_status: $campaign_status, start_time: $start_time, end_time: $end_time, worker_model: $worker_model, verifier_model: $verifier_model, debug: $debug, with_self_verification: $with_sv, consensus: $consensus}' \
|
|
1941
2015
|
> "$METADATA_FILE"
|
|
1942
2016
|
|
|
@@ -1980,7 +2054,7 @@ main() {
|
|
|
1980
2054
|
log_debug "[OPTION] expected_flow=worker(all)->verify(ALL)->COMPLETE"
|
|
1981
2055
|
fi
|
|
1982
2056
|
|
|
1983
|
-
if [[ "$VERIFY_CONSENSUS" = "1" ]]; then
|
|
2057
|
+
if [[ "${VERIFY_CONSENSUS:-0}" = "1" ]]; then
|
|
1984
2058
|
log_debug "[OPTION] consensus_flow=each_verify_runs_claude+codex_both_must_pass"
|
|
1985
2059
|
fi
|
|
1986
2060
|
fi
|
|
@@ -2104,11 +2178,14 @@ main() {
|
|
|
2104
2178
|
|
|
2105
2179
|
local worker_launch
|
|
2106
2180
|
if [[ "$WORKER_ENGINE" = "codex" ]]; then
|
|
2107
|
-
|
|
2108
|
-
|
|
2109
|
-
|
|
2181
|
+
worker_launch="${CODEX_BIN:-codex} -m $WORKER_CODEX_MODEL -c model_reasoning_effort=\"$WORKER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
|
|
2182
|
+
if ! launch_worker_codex "$WORKER_PANE" "$worker_prompt" "$ITERATION" "$worker_launch"; then
|
|
2183
|
+
write_blocked_sentinel "Worker codex failed to start in pane"
|
|
2184
|
+
update_status "blocked" "worker_start_failed"
|
|
2185
|
+
return 1
|
|
2186
|
+
fi
|
|
2110
2187
|
else
|
|
2111
|
-
worker_launch="$
|
|
2188
|
+
worker_launch="$(build_claude_cmd tui "$WORKER_MODEL")"
|
|
2112
2189
|
if ! launch_worker_claude "$WORKER_PANE" "$worker_prompt" "$ITERATION" "$worker_launch"; then
|
|
2113
2190
|
write_blocked_sentinel "Worker claude failed to start in pane"
|
|
2114
2191
|
update_status "blocked" "worker_start_failed"
|
|
@@ -2285,7 +2362,7 @@ main() {
|
|
|
2285
2362
|
if [[ "$VERIFIER_ENGINE" = "codex" ]]; then
|
|
2286
2363
|
verifier_launch="${CODEX_BIN:-codex} -m $VERIFIER_CODEX_MODEL -c model_reasoning_effort=\"$VERIFIER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
|
|
2287
2364
|
else
|
|
2288
|
-
verifier_launch="$
|
|
2365
|
+
verifier_launch="$(build_claude_cmd tui "$VERIFIER_MODEL")"
|
|
2289
2366
|
fi
|
|
2290
2367
|
log_debug "[FLOW] iter=$ITERATION phase=verifier engine=$VERIFIER_ENGINE model=$VERIFIER_MODEL scope=${signal_us_id:-all} dispatched=true"
|
|
2291
2368
|
|
|
@@ -2346,8 +2423,8 @@ main() {
|
|
|
2346
2423
|
_MODEL_UPGRADED=0
|
|
2347
2424
|
fi
|
|
2348
2425
|
|
|
2349
|
-
# ---
|
|
2350
|
-
if [[
|
|
2426
|
+
# --- Verified US tracking (both per-us and batch modes) ---
|
|
2427
|
+
if [[ -n "$signal_us_id" && "$signal_us_id" != "ALL" ]]; then
|
|
2351
2428
|
# Add this US to verified list
|
|
2352
2429
|
if [[ -n "$VERIFIED_US" ]]; then
|
|
2353
2430
|
VERIFIED_US="${VERIFIED_US},${signal_us_id}"
|
|
@@ -2371,6 +2448,32 @@ main() {
|
|
|
2371
2448
|
;;
|
|
2372
2449
|
fail)
|
|
2373
2450
|
# --- governance.md s7½: Fix Loop (adapted for tmux lean mode) ---
|
|
2451
|
+
|
|
2452
|
+
# Parse per_us_results from verdict to track partial progress (batch + per-us)
|
|
2453
|
+
local _prev_verified="$VERIFIED_US"
|
|
2454
|
+
if jq -e '.per_us_results' "$VERDICT_FILE" &>/dev/null; then
|
|
2455
|
+
local _newly_passed
|
|
2456
|
+
_newly_passed=$(jq -r '.per_us_results | to_entries[] | select(.value == "pass") | .key' "$VERDICT_FILE" 2>/dev/null)
|
|
2457
|
+
for _pus in $(echo "$_newly_passed"); do
|
|
2458
|
+
if ! echo ",$VERIFIED_US," | grep -q ",$_pus,"; then
|
|
2459
|
+
if [[ -n "$VERIFIED_US" ]]; then
|
|
2460
|
+
VERIFIED_US="${VERIFIED_US},${_pus}"
|
|
2461
|
+
else
|
|
2462
|
+
VERIFIED_US="$_pus"
|
|
2463
|
+
fi
|
|
2464
|
+
log " Partial progress: $_pus passed (overall FAIL). Verified so far: $VERIFIED_US"
|
|
2465
|
+
fi
|
|
2466
|
+
done
|
|
2467
|
+
log_debug "[FLOW] iter=$ITERATION partial_progress prev=$_prev_verified now=$VERIFIED_US"
|
|
2468
|
+
fi
|
|
2469
|
+
|
|
2470
|
+
# Partial progress resets consecutive failures (progress was made)
|
|
2471
|
+
if [[ "$VERIFIED_US" != "$_prev_verified" ]]; then
|
|
2472
|
+
CONSECUTIVE_FAILURES=0
|
|
2473
|
+
log " Progress detected — consecutive_failures reset to 0"
|
|
2474
|
+
log_debug "[GOV] iter=$ITERATION consecutive_failures_reset=partial_progress"
|
|
2475
|
+
fi
|
|
2476
|
+
|
|
2374
2477
|
(( CONSECUTIVE_FAILURES++ ))
|
|
2375
2478
|
record_us_failure "${signal_us_id:-unknown}"
|
|
2376
2479
|
check_model_upgrade "${signal_us_id:-unknown}"
|
|
@@ -2389,6 +2492,13 @@ main() {
|
|
|
2389
2492
|
{
|
|
2390
2493
|
echo "# Fix Contract (from Verifier iteration $ITERATION)"
|
|
2391
2494
|
echo ""
|
|
2495
|
+
if [[ -n "$VERIFIED_US" ]]; then
|
|
2496
|
+
echo "## Verified US (do NOT re-implement these)"
|
|
2497
|
+
echo "$VERIFIED_US" | tr ',' '\n' | sed 's/^/- /'
|
|
2498
|
+
echo ""
|
|
2499
|
+
echo "**Focus ONLY on unverified user stories. The above are already verified.**"
|
|
2500
|
+
echo ""
|
|
2501
|
+
fi
|
|
2392
2502
|
echo "## Summary"
|
|
2393
2503
|
echo "$verdict_summary_fail"
|
|
2394
2504
|
echo ""
|