@ai-dev-methodologies/rlp-desk 0.5.3 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,163 @@
1
+ # Plan: rlp-desk Batch Mode + Operational Context 개선
2
+
3
+ ## Context
4
+
5
+ 실제 캠페인(`prod-local-parity`, spark:high)에서 두 가지 구조적 문제가 발견됨:
6
+
7
+ 1. **Batch 모드 무한 FAIL**: US 5개 이상이면 Worker가 일부만 완료 → Verifier가 전체 검증 → FAIL → 진전 무시 → CB BLOCKED. `VERIFIED_US` 추적이 per-us 모드에만 있고 batch에는 없음.
8
+
9
+ 2. **서버 프로젝트 지원 부재**: Worker가 코드 수정 후 서버 restart를 안 하고, 서버 포트를 모르고, health check도 없음. spark 모델 탓이 아니라 **rlp-desk가 operational context를 brainstorm/prompt에 반영하지 않는 설계 결함**.
10
+
11
+ ---
12
+
13
+ ## P0: Batch 모드 Partial Progress Tracking
14
+
15
+ ### 수정 대상
16
+ - `src/scripts/run_ralph_desk.zsh`
17
+ - `src/commands/rlp-desk.md` (agent mode ⑦c)
18
+
19
+ ### 변경 내용
20
+
21
+ #### 1. Batch 모드에도 VERIFIED_US 추적 (run_ralph_desk.zsh)
22
+ - PASS verdict 처리(L2423): `per-us` 조건 제거 → batch에서도 `signal_us_id`가 개별 US면 `VERIFIED_US`에 추가
23
+ - FAIL verdict 처리(L2445): verdict JSON에서 `per_us_results` 파싱 → `met=true`인 US를 `VERIFIED_US`에 추가
24
+ - status.json 갱신: batch 모드에서도 `verified_us` 배열 기록
25
+
26
+ #### 2. Verifier Prompt에 VERIFIED_US 전달 (run_ralph_desk.zsh L1225-1232)
27
+ - `if [[ "$VERIFY_MODE" = "per-us"` 조건 → `if [[ -n "$VERIFIED_US"` 로 변경
28
+ - batch 모드 verifier에게도 "이미 verified된 US skip" 지시
29
+
30
+ #### 3. Fix Contract Scope Narrowing (run_ralph_desk.zsh L2461-2473)
31
+ - FAIL 시: verdict에서 pass한 US 추출 → fix contract에 "US-001~004 verified. Continue from US-005."
32
+ - Worker prompt 조합 시 `VERIFIED_US` 참조하여 축소된 scope 전달
33
+
34
+ #### 4. consecutive_failures 부분 리셋 (run_ralph_desk.zsh L2447)
35
+ - 새로 pass된 US가 있으면 (`VERIFIED_US` 길어짐) → `CONSECUTIVE_FAILURES=0` 리셋
36
+ - 진전 없이 같은 상태면 → 기존대로 증가
37
+
38
+ #### 5. Verifier Verdict에 per_us_results 필수화
39
+ - Verifier prompt template(init_ralph_desk.zsh L384-474)에 output format 추가:
40
+ ```json
41
+ {
42
+ "verdict": "fail",
43
+ "per_us_results": { "US-001": "pass", "US-005": "fail" },
44
+ "issues": [...]
45
+ }
46
+ ```
47
+ - batch/per-us 공통으로 per_us_results 포함하도록 지시
48
+
49
+ ---
50
+
51
+ ## P1: Brainstorm Operational Context + Worker System Prompt
52
+
53
+ ### 수정 대상
54
+ - `src/commands/rlp-desk.md` (brainstorm section)
55
+ - `src/scripts/init_ralph_desk.zsh` (Worker/Verifier prompt template)
56
+
57
+ ### 변경 내용
58
+
59
+ #### 1. Brainstorm: Operational Context 수집 (rlp-desk.md L24-93)
60
+ 현재 11개 항목 수집 중, **12번째 항목 추가**:
61
+
62
+ ```
63
+ 12. **Operational Context** (if applicable):
64
+ - Does this project require a running server/service? (y/n)
65
+ - Server start command (e.g., `npm run dev`, `python manage.py runserver`)
66
+ - Server port (e.g., 7001)
67
+ - Health check URL (e.g., `http://localhost:7001/health`)
68
+ - Other runtime dependencies (e.g., database, Redis)
69
+ ```
70
+
71
+ brainstorm이 프로젝트 디렉토리에서 `package.json`의 `scripts.dev`/`scripts.start`, `Makefile`, `docker-compose.yml` 등을 자동 감지하여 추천.
72
+
73
+ #### 2. Brainstorm: US 생성 시 Operational Step 포함 가이드
74
+ US/AC 작성 가이드(rlp-desk.md L26-38)에 추가:
75
+
76
+ ```
77
+ - If the project has operational context (server, DB, etc.):
78
+ - Each US that modifies server code MUST include AC:
79
+ "Given server is running, When code is modified, Then server is restarted and responds on health check URL"
80
+ - Do NOT assume Worker will restart server on its own — spell it out in AC
81
+ ```
82
+
83
+ #### 3. Init: Worker Prompt에 Operational Rules 주입 (init_ralph_desk.zsh L285-380)
84
+ brainstorm에서 수집한 operational context를 Worker prompt template에 주입:
85
+
86
+ ```markdown
87
+ ## Operational Context
88
+ - **Server Command**: `npm run dev`
89
+ - **Server Port**: 7001
90
+ - **Health Check**: `http://localhost:7001/health`
91
+
92
+ ### Operational Rules (always apply)
93
+ - After modifying server/application code, restart the server: `[server_cmd]`
94
+ - Before signaling done, verify server responds: `curl -s [health_url] || fail`
95
+ - Do NOT modify dependency files (package.json, requirements.txt, etc.) unless the AC explicitly requires it
96
+ - Do NOT run package install commands (npm install, pip install, etc.) unless the AC explicitly requires it
97
+ ```
98
+
99
+ operational context가 없는 프로젝트(코드만 수정)면 이 섹션 생략.
100
+
101
+ #### 4. Init: Verifier Prompt에도 Operational Check 추가
102
+ Verifier prompt template(init_ralph_desk.zsh L384-474)에:
103
+
104
+ ```markdown
105
+ ## Operational Verification (if server context provided)
106
+ - Verify server is running on expected port before checking ACs
107
+ - If server is down, verdict=FAIL with issue: "server not running"
108
+ ```
109
+
110
+ #### 5. --server-cmd / --server-port CLI 옵션 (run_ralph_desk.zsh)
111
+ brainstorm에서 수집한 값을 init이 prompt에 넣지만, run 시 override도 가능:
112
+ - `--server-cmd "npm run dev"` → Worker prompt의 서버 명령어 override
113
+ - `--server-port 7001` → Worker prompt의 포트 override
114
+ - 런타임에 iteration 시작 시 health check (optional, `--server-health-check` flag)
115
+
116
+ ---
117
+
118
+ ## Verification Plan
119
+
120
+ ### P0 Tests
121
+ ```bash
122
+ # Batch partial progress 단위 테스트
123
+ zsh tests/test_batch_partial_progress.sh
124
+ # 시나리오: batch FAIL verdict에 per_us_results 포함 → VERIFIED_US 추적 확인
125
+ # 시나리오: 새 US pass 시 consecutive_failures 리셋 확인
126
+ # 시나리오: verifier prompt에 VERIFIED_US 포함 확인 (batch 모드)
127
+ ```
128
+
129
+ ### P1 Tests
130
+ ```bash
131
+ # Operational context 단위 테스트
132
+ zsh tests/test_operational_context.sh
133
+ # 시나리오: --server-cmd 옵션 파싱 확인
134
+ # 시나리오: Worker prompt에 operational rules 주입 확인
135
+ # 시나리오: operational context 없는 프로젝트에서는 섹션 생략 확인
136
+ ```
137
+
138
+ ### Self-Verification (CLAUDE.md 필수)
139
+ 변경된 src 파일에 대해 3개 시나리오 (LOW/MEDIUM/CRITICAL) 자체 검증 실행.
140
+
141
+ ### E2E
142
+ 실제 캠페인으로 테스트:
143
+ 1. batch 모드 + 10 US → partial progress 추적 확인
144
+ 2. server 프로젝트 + spark:high → 서버 restart 수행 확인
145
+
146
+ ---
147
+
148
+ ## File Map
149
+
150
+ | 파일 | P0 | P1 |
151
+ |------|----|----|
152
+ | `src/scripts/run_ralph_desk.zsh` | VERIFIED_US batch 추적, fix contract narrowing, CF 리셋 | --server-cmd/port 옵션 |
153
+ | `src/scripts/lib_ralph_desk.zsh` | - | - |
154
+ | `src/scripts/init_ralph_desk.zsh` | - | Worker/Verifier prompt에 operational context 주입 |
155
+ | `src/commands/rlp-desk.md` | agent mode ⑦c batch 로직 | brainstorm 12번 항목, US 가이드 |
156
+ | `src/governance.md` | - | - |
157
+
158
+ ---
159
+
160
+ ## Scope / Non-Goals
161
+ - 모델별 가드레일 (spark 전용 금지 목록) → **하지 않음**. brainstorm/prompt 구조로 해결
162
+ - batch 모드 완전 제거 → **하지 않음**. 수정하여 사용 가능하게 함
163
+ - auto-detect project type → brainstorm에서 사용자 확인 + 파일 기반 추천만. 완전 자동화 아님
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ai-dev-methodologies/rlp-desk",
3
- "version": "0.5.3",
3
+ "version": "0.6.0",
4
4
  "description": "Fresh-context iterative loops for Claude Code — autonomous task completion with independent verification",
5
5
  "scripts": {
6
6
  "postinstall": "node scripts/postinstall.js",
@@ -91,6 +91,18 @@ Ask about these items one by one (or in small groups):
91
91
  9. **Verify Mode** — per-us (default) or batch. Ask: "Verify after each user story (per-us, recommended) or only after all stories are done (batch)?" Default recommendation: per-us for 2+ stories.
92
92
  10. **Consensus** — Ask: "Use cross-engine consensus? off (single engine), final-only (cross-engine on final verify only), or all (cross-engine on every verify). Requires codex CLI." Default: off. Recommended: final-only when codex is installed.
93
93
  11. **Max Iterations** — suggest based on story count, ask if OK.
94
+ 12. **Operational Context** — Auto-detect: scan project root for `package.json` (scripts.dev/start), `Makefile`, `docker-compose.yml`, `manage.py`. If detected, ask:
95
+ - "Does this project require a running server/service during development?" (y/n)
96
+ - If yes: "Server start command?" (pre-fill from detected scripts, e.g., `npm run dev`)
97
+ - "Server port?" (e.g., 7001)
98
+ - "Health check URL?" (e.g., `http://localhost:7001/health`) — optional
99
+ - Pass to init: `--server-cmd "CMD" --server-port PORT --server-health URL`
100
+ - If no server needed: skip. Init generates prompts without operational context.
101
+
102
+ **US generation guidance when server context is present:**
103
+ - Each US that modifies server/application code SHOULD include an AC or note:
104
+ "Given server is running, When code is modified, Then server is restarted and health check passes"
105
+ - Do NOT assume the Worker model will restart servers on its own — spell it out in the AC or rely on the operational rules injected by init.
94
106
 
95
107
  After all items are confirmed:
96
108
 
@@ -9,23 +9,23 @@ CB default: 6. Override: `--cb-threshold N`. Worker only — Verifier fixed at c
9
9
  - CB < table columns → BLOCKED at that column
10
10
  - CB > 6 → repeat ceiling model beyond column 6
11
11
 
12
- ## GPT Pro (spark — separate token limit)
12
+ ## GPT Pro (gpt-5.3-codex-spark — separate token limit)
13
13
 
14
14
  | Complexity | 1-2 | 3-4 | 5-6 | 7+ |
15
15
  |------------|-----|-----|-----|-----|
16
- | LOW | spark:low | spark:medium | spark:high | BLOCKED |
17
- | MEDIUM | spark:medium | spark:high | spark:xhigh | BLOCKED |
18
- | HIGH | spark:high | spark:xhigh | spark:xhigh | BLOCKED |
19
- | CRITICAL | spark:xhigh | spark:xhigh | spark:xhigh | BLOCKED |
16
+ | LOW | gpt-5.3-codex-spark:low | gpt-5.3-codex-spark:medium | gpt-5.3-codex-spark:high | BLOCKED |
17
+ | MEDIUM | gpt-5.3-codex-spark:medium | gpt-5.3-codex-spark:high | gpt-5.3-codex-spark:xhigh | BLOCKED |
18
+ | HIGH | gpt-5.3-codex-spark:high | gpt-5.3-codex-spark:xhigh | gpt-5.3-codex-spark:xhigh | BLOCKED |
19
+ | CRITICAL | gpt-5.3-codex-spark:xhigh | gpt-5.3-codex-spark:xhigh | gpt-5.3-codex-spark:xhigh | BLOCKED |
20
20
 
21
21
  ## Non-Pro (gpt-5.4)
22
22
 
23
23
  | Complexity | 1-2 | 3-4 | 5-6 | 7+ |
24
24
  |------------|-----|-----|-----|-----|
25
- | LOW | 5.4:low | 5.4:medium | 5.4:high | BLOCKED |
26
- | MEDIUM | 5.4:medium | 5.4:high | 5.4:xhigh | BLOCKED |
27
- | HIGH | 5.4:high | 5.4:xhigh | 5.4:xhigh | BLOCKED |
28
- | CRITICAL | 5.4:xhigh | 5.4:xhigh | 5.4:xhigh | BLOCKED |
25
+ | LOW | gpt-5.4:low | gpt-5.4:medium | gpt-5.4:high | BLOCKED |
26
+ | MEDIUM | gpt-5.4:medium | gpt-5.4:high | gpt-5.4:xhigh | BLOCKED |
27
+ | HIGH | gpt-5.4:high | gpt-5.4:xhigh | gpt-5.4:xhigh | BLOCKED |
28
+ | CRITICAL | gpt-5.4:xhigh | gpt-5.4:xhigh | gpt-5.4:xhigh | BLOCKED |
29
29
 
30
30
  ## Claude-only
31
31
 
@@ -11,11 +11,14 @@ set -euo pipefail
11
11
  # ~/.claude/ralph-desk/init_ralph_desk.zsh <slug> [objective] [--mode fresh|improve]
12
12
  # =============================================================================
13
13
 
14
- SLUG="${1:?Usage: $0 <slug> [objective] [--mode fresh|improve]}"
14
+ SLUG="${1:?Usage: $0 <slug> [objective] [--mode fresh|improve] [--server-cmd CMD] [--server-port PORT] [--server-health URL]}"
15
15
  MODE=""
16
16
  OBJECTIVE="TBD - fill in the objective"
17
+ SERVER_CMD=""
18
+ SERVER_PORT=""
19
+ SERVER_HEALTH=""
17
20
 
18
- # Parse remaining arguments: --mode fresh|improve + optional positional objective
21
+ # Parse remaining arguments
19
22
  shift
20
23
  while [[ $# -gt 0 ]]; do
21
24
  case "$1" in
@@ -27,6 +30,30 @@ while [[ $# -gt 0 ]]; do
27
30
  MODE="${1#--mode=}"
28
31
  shift
29
32
  ;;
33
+ --server-cmd)
34
+ SERVER_CMD="${2:?--server-cmd requires a command}"
35
+ shift 2
36
+ ;;
37
+ --server-cmd=*)
38
+ SERVER_CMD="${1#--server-cmd=}"
39
+ shift
40
+ ;;
41
+ --server-port)
42
+ SERVER_PORT="${2:?--server-port requires a port number}"
43
+ shift 2
44
+ ;;
45
+ --server-port=*)
46
+ SERVER_PORT="${1#--server-port=}"
47
+ shift
48
+ ;;
49
+ --server-health)
50
+ SERVER_HEALTH="${2:?--server-health requires a URL}"
51
+ shift 2
52
+ ;;
53
+ --server-health=*)
54
+ SERVER_HEALTH="${1#--server-health=}"
55
+ shift
56
+ ;;
30
57
  *)
31
58
  OBJECTIVE="$1"
32
59
  shift
@@ -378,6 +405,24 @@ execution_steps MUST be a JSON array of objects (not a dict with string keys). E
378
405
  ## Objective
379
406
  $OBJECTIVE
380
407
  EOF
408
+
409
+ # Inject operational context if server options provided
410
+ if [[ -n "$SERVER_CMD" || -n "$SERVER_PORT" ]]; then
411
+ cat >> "$F" <<OPCTX
412
+
413
+ ## Operational Context
414
+ $([ -n "$SERVER_CMD" ] && echo "- **Server Start Command**: \`$SERVER_CMD\`")
415
+ $([ -n "$SERVER_PORT" ] && echo "- **Server Port**: $SERVER_PORT")
416
+ $([ -n "$SERVER_HEALTH" ] && echo "- **Health Check URL**: $SERVER_HEALTH")
417
+
418
+ ### Operational Rules (always apply when server context is present)
419
+ - After modifying server/application code, restart the server$([ -n "$SERVER_CMD" ] && echo ": \`$SERVER_CMD\`")
420
+ - Before signaling done, verify the server responds$([ -n "$SERVER_HEALTH" ] && echo ": \`curl -sf $SERVER_HEALTH\`" || [ -n "$SERVER_PORT" ] && echo ": \`curl -sf http://localhost:$SERVER_PORT/\`")
421
+ - Do NOT modify dependency files (package.json, requirements.txt, etc.) unless the AC explicitly requires it
422
+ - Do NOT run package install commands (npm install, pip install, etc.) unless the AC explicitly requires it
423
+ OPCTX
424
+ fi
425
+
381
426
  echo " + $F"
382
427
  else echo " · $F"; fi
383
428
 
@@ -447,6 +492,7 @@ Verdict JSON:
447
492
  "us_id": "US-NNN or ALL (matches the scope you verified)",
448
493
  "verified_at_utc": "ISO timestamp",
449
494
  "summary": "...",
495
+ "per_us_results": {"US-001": "pass|fail|not_started", "US-002": "pass|fail|not_started"},
450
496
  "criteria_results": [{"criterion":"...","met":true/false,"evidence":"..."}],
451
497
  "missing_evidence": [],
452
498
  "issues": [{"id":"...","severity":"critical|major|minor","description":"...","fix_hint":"(suggestion, non-authoritative)"}],
@@ -471,7 +517,20 @@ Rules:
471
517
  - Deterministic checks (type hints, linting, security) delegate to test-spec tools; focus on AC verification + semantic review + smoke test.
472
518
  - Do NOT modify code or write sentinel files.
473
519
  - If Worker claims "inspection" or "review" for an AC that requires an automated command, verdict = FAIL.
520
+ - **ALWAYS include per_us_results** in verdict JSON — map each US to "pass", "fail", or "not_started". This is required for partial progress tracking in both batch and per-us modes.
474
521
  EOF
522
+
523
+ # Inject operational verification if server options provided
524
+ if [[ -n "$SERVER_CMD" || -n "$SERVER_PORT" ]]; then
525
+ cat >> "$F" <<OPVER
526
+
527
+ ## Operational Verification (server context present)
528
+ - Before verifying ACs, check that the server is running$([ -n "$SERVER_PORT" ] && echo " on port $SERVER_PORT")$([ -n "$SERVER_HEALTH" ] && echo ": \`curl -sf $SERVER_HEALTH\`")
529
+ - If the server is not running, verdict = FAIL with issue: "server not running on expected port"
530
+ - If Worker modified server code but did not restart the server, verdict = FAIL with issue: "server not restarted after code change"
531
+ OPVER
532
+ fi
533
+
475
534
  echo " + $F"
476
535
  else echo " · $F"; fi
477
536
 
@@ -31,7 +31,7 @@ log_error() {
31
31
 
32
32
  # parse_model_flag() — parse unified --worker-model / --verifier-model value
33
33
  # Colon format (model:reasoning) → codex engine; plain name → claude engine.
34
- # Spark alias: any model name containing "spark" is normalized to "spark".
34
+ # Spark alias: bare "spark" is expanded to full model ID "gpt-5.3-codex-spark".
35
35
  # Usage: parse_model_flag <value> <role>
36
36
  # Output (stdout): "engine model [reasoning]" e.g. "codex gpt-5.4 medium" | "claude sonnet"
37
37
  # Returns: 0 on success, 1 on invalid format (error written to stderr)
@@ -47,8 +47,8 @@ parse_model_flag() {
47
47
  if (( colon_count == 1 )); then
48
48
  local model="${value%%:*}"
49
49
  local reasoning="${value##*:}"
50
- if [[ "$model" == *"spark"* ]]; then
51
- model="spark"
50
+ if [[ "$model" == "spark" ]]; then
51
+ model="gpt-5.3-codex-spark"
52
52
  fi
53
53
  echo "codex $model $reasoning"
54
54
  else
@@ -76,7 +76,7 @@ get_model_string() {
76
76
  # get_next_model() — return next model in Worker upgrade path, or empty at ceiling
77
77
  # Usage: get_next_model <model_str>
78
78
  # claude: "haiku"|"sonnet"|"opus"
79
- # codex: "gpt-5.4:medium"|"gpt-5.4:high"|"gpt-5.4:xhigh"|"spark:medium"|...
79
+ # codex: "gpt-5.4:medium"|"gpt-5.4:high"|"gpt-5.4:xhigh"|"gpt-5.3-codex-spark:medium"|...
80
80
  # Output: next model string, or empty string if at ceiling
81
81
  get_next_model() {
82
82
  local current="$1"
@@ -85,16 +85,11 @@ get_next_model() {
85
85
  haiku) echo "sonnet" ;;
86
86
  sonnet) echo "opus" ;;
87
87
  opus) echo "" ;;
88
- # Codex GPT Pro upgrade path (short aliases)
89
- spark:low) echo "spark:medium" ;;
90
- spark:medium) echo "spark:high" ;;
91
- spark:high) echo "spark:xhigh" ;;
92
- spark:xhigh) echo "" ;; # spark ceiling
93
- # Codex GPT Pro upgrade path (full model names)
88
+ # Codex GPT Pro (spark) upgrade path
94
89
  gpt-5.3-codex-spark:low) echo "gpt-5.3-codex-spark:medium" ;;
95
90
  gpt-5.3-codex-spark:medium) echo "gpt-5.3-codex-spark:high" ;;
96
91
  gpt-5.3-codex-spark:high) echo "gpt-5.3-codex-spark:xhigh" ;;
97
- gpt-5.3-codex-spark:xhigh) echo "" ;; # spark ceiling (full name)
92
+ gpt-5.3-codex-spark:xhigh) echo "" ;; # spark ceiling
98
93
  # Codex Non-Pro upgrade path
99
94
  gpt-5.4:low) echo "gpt-5.4:medium" ;;
100
95
  gpt-5.4:medium) echo "gpt-5.4:high" ;;
@@ -58,10 +58,30 @@ IDLE_NUDGE_THRESHOLD="${IDLE_NUDGE_THRESHOLD:-30}"
58
58
  MAX_NUDGES="${MAX_NUDGES:-3}"
59
59
  WITH_SELF_VERIFICATION="${WITH_SELF_VERIFICATION:-0}"
60
60
 
61
- # --- Engine Selection ---
62
- WORKER_ENGINE="${WORKER_ENGINE:-claude}" # claude|codex
63
- VERIFIER_ENGINE="${VERIFIER_ENGINE:-claude}" # claude|codex
64
- FINAL_VERIFIER_ENGINE="${FINAL_VERIFIER_ENGINE:-claude}" # claude|codex (derived from FINAL_VERIFIER_MODEL)
61
+ # --- Engine Selection (auto-detect from model format: name=claude, name:reasoning=codex) ---
62
+ # If model contains ":", it's codex format — auto-set engine and split model/reasoning
63
+ _auto_detect_engine() {
64
+ local model_var="$1" engine_var="$2" codex_model_var="$3" codex_reasoning_var="$4"
65
+ local model_val="${(P)model_var}"
66
+ if [[ "$model_val" == *:* ]]; then
67
+ local model_part="${model_val%%:*}"
68
+ local reasoning_part="${model_val##*:}"
69
+ [[ "$model_part" == "spark" ]] && model_part="gpt-5.3-codex-spark"
70
+ eval "$engine_var=codex"
71
+ eval "$model_var=$model_part"
72
+ [[ -n "$codex_model_var" ]] && eval "$codex_model_var=$model_part"
73
+ [[ -n "$codex_reasoning_var" ]] && eval "$codex_reasoning_var=$reasoning_part"
74
+ fi
75
+ }
76
+
77
+ WORKER_ENGINE="${WORKER_ENGINE:-claude}"
78
+ VERIFIER_ENGINE="${VERIFIER_ENGINE:-claude}"
79
+ FINAL_VERIFIER_ENGINE="${FINAL_VERIFIER_ENGINE:-claude}"
80
+
81
+ # Auto-detect engine from model format for env var path (CLI path uses parse_model_flag)
82
+ _auto_detect_engine WORKER_MODEL WORKER_ENGINE WORKER_CODEX_MODEL WORKER_CODEX_REASONING
83
+ _auto_detect_engine VERIFIER_MODEL VERIFIER_ENGINE VERIFIER_CODEX_MODEL VERIFIER_CODEX_REASONING
84
+ _auto_detect_engine FINAL_VERIFIER_MODEL FINAL_VERIFIER_ENGINE "" ""
65
85
  WORKER_CODEX_MODEL="${WORKER_CODEX_MODEL:-gpt-5.4}"
66
86
  WORKER_CODEX_REASONING="${WORKER_CODEX_REASONING:-high}" # low|medium|high
67
87
  VERIFIER_CODEX_MODEL="${VERIFIER_CODEX_MODEL:-gpt-5.4}"
@@ -191,19 +211,55 @@ check_dead_pane() {
191
211
  return 1 # alive
192
212
  }
193
213
 
194
- # launch_worker_codex() — launch codex Worker via trigger script (non-interactive exec)
195
- # Args: $1=pane_id $2=trigger_file $3=iteration
196
- # Returns: 0 always (codex failures detected by poll_for_signal)
214
+ # launch_worker_codex() — launch codex Worker TUI, send instruction, verify submission
215
+ # Matches launch_worker_claude() pattern for consistent tmux-visible execution.
216
+ # Args: $1=pane_id $2=prompt_file $3=iteration $4=worker_launch_cmd
217
+ # Returns: 0 on success, 1 on fatal failure
197
218
  launch_worker_codex() {
198
219
  local pane_id="$1"
199
- local trigger_file="$2"
220
+ local prompt_file="$2"
200
221
  local iter="$3"
222
+ local worker_launch="$4"
223
+
224
+ log " Launching Worker codex TUI in pane $pane_id..."
225
+ paste_to_pane "$pane_id" "$worker_launch"
226
+ tmux send-keys -t "$pane_id" C-m
201
227
 
202
- log " Launching Worker codex via trigger script in pane $pane_id..."
203
- paste_to_pane "$pane_id" "bash $trigger_file"
228
+ # Wait for codex TUI to be ready
229
+ if ! wait_for_pane_ready "$pane_id" 30; then
230
+ log_error "Worker codex failed to start"
231
+ return 1
232
+ fi
233
+
234
+ # Send instruction to codex TUI
235
+ sleep 3
236
+ local worker_instruction="Read and execute the instructions in $prompt_file"
237
+ paste_to_pane "$pane_id" "$worker_instruction"
204
238
  tmux send-keys -t "$pane_id" C-m
205
- log_debug "Worker codex trigger sent: $trigger_file"
206
- sleep 3 # brief wait for codex to start
239
+ log_debug "Worker codex instruction sent (${#worker_instruction} chars)"
240
+
241
+ # Submit loop — verify codex started working
242
+ local submit_attempts=0
243
+ while (( submit_attempts < 15 )); do
244
+ sleep 2
245
+ local pane_check
246
+ pane_check=$(tmux capture-pane -t "$pane_id" -p 2>/dev/null)
247
+ if echo "$pane_check" | grep -qi "working\|thinking\|Exploring\|Running\|reading\|searching\|editing\|writing" 2>/dev/null; then
248
+ log_debug "Worker codex started working after $((submit_attempts + 1)) checks"
249
+ break
250
+ fi
251
+ if (( submit_attempts == 8 )); then
252
+ log_debug "Adaptive instruction retry: clearing line and re-typing"
253
+ tmux send-keys -t "$pane_id" C-u 2>/dev/null
254
+ sleep 0.1
255
+ paste_to_pane "$pane_id" "$worker_instruction"
256
+ tmux send-keys -t "$pane_id" C-m
257
+ fi
258
+ tmux send-keys -t "$pane_id" C-m 2>/dev/null
259
+ sleep 0.3
260
+ tmux send-keys -t "$pane_id" C-m 2>/dev/null
261
+ (( submit_attempts++ ))
262
+ done
207
263
  return 0
208
264
  }
209
265
 
@@ -288,19 +344,53 @@ launch_worker_claude() {
288
344
  return 0
289
345
  }
290
346
 
291
- # launch_verifier_codex() — launch codex Verifier in pane (non-interactive)
347
+ # launch_verifier_codex() — launch codex Verifier TUI, send instruction, verify submission
348
+ # Matches launch_verifier_claude() pattern for consistent tmux-visible execution.
292
349
  # Args: $1=pane_id $2=prompt_file $3=iteration $4=launch_cmd
293
- # Returns: 0 always
350
+ # Returns: 0 on success
294
351
  launch_verifier_codex() {
295
352
  local pane_id="$1"
296
353
  local prompt_file="$2"
297
354
  local iter="$3"
298
355
  local verifier_launch="$4"
299
356
 
300
- log " Launching Verifier codex in pane $pane_id..."
357
+ log " Launching Verifier codex TUI in pane $pane_id..."
301
358
  paste_to_pane "$pane_id" "$verifier_launch"
302
359
  tmux send-keys -t "$pane_id" C-m
360
+
361
+ if ! wait_for_pane_ready "$pane_id" 30; then
362
+ log_error "Verifier codex failed to start"
363
+ return 1
364
+ fi
365
+
303
366
  sleep 3
367
+ local verifier_instruction="Read and execute the instructions in $prompt_file"
368
+ paste_to_pane "$pane_id" "$verifier_instruction"
369
+ tmux send-keys -t "$pane_id" C-m
370
+ log_debug "Verifier codex instruction sent"
371
+
372
+ # Submit loop — verify codex started working
373
+ local submit_attempts=0
374
+ while (( submit_attempts < 15 )); do
375
+ sleep 2
376
+ local vs_check
377
+ vs_check=$(tmux capture-pane -t "$pane_id" -p 2>/dev/null)
378
+ if echo "$vs_check" | grep -qi "working\|thinking\|Exploring\|Running\|reading\|searching\|editing\|writing" 2>/dev/null; then
379
+ log_debug "Verifier codex started working after $((submit_attempts + 1)) checks"
380
+ break
381
+ fi
382
+ if (( submit_attempts == 8 )); then
383
+ log_debug "Adaptive instruction retry: clearing line and re-typing"
384
+ tmux send-keys -t "$pane_id" C-u 2>/dev/null
385
+ sleep 0.1
386
+ paste_to_pane "$pane_id" "$verifier_instruction"
387
+ tmux send-keys -t "$pane_id" C-m
388
+ fi
389
+ tmux send-keys -t "$pane_id" C-m 2>/dev/null
390
+ sleep 0.3
391
+ tmux send-keys -t "$pane_id" C-m 2>/dev/null
392
+ (( submit_attempts++ ))
393
+ done
304
394
  return 0
305
395
  }
306
396
 
@@ -366,7 +456,7 @@ handle_worker_exit_codex() {
366
456
  local dc_us_id
367
457
  dc_us_id=$(jq -r '.us_id // "unknown"' "$DONE_CLAIM_FILE" 2>/dev/null)
368
458
  log " Codex worker completed with done-claim (us_id=$dc_us_id). Auto-generating signal."
369
- echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$dc_us_id"'","summary":"auto-generated after codex exec exit","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}' > "$signal_file"
459
+ echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$dc_us_id"'","summary":"auto-generated after codex exit","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}' > "$signal_file"
370
460
  else
371
461
  log " WARNING: Codex worker exited without done-claim. Generating verify signal for current US."
372
462
  local current_us
@@ -374,7 +464,7 @@ handle_worker_exit_codex() {
374
464
  local mem_us
375
465
  mem_us=$(sed -n 's/.*Next.*US-\([0-9]*\).*/US-\1/p' "$DESK/memos/${SLUG}-memory.md" 2>/dev/null | head -1)
376
466
  [[ -n "$mem_us" ]] && current_us="$mem_us"
377
- echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$current_us"'","summary":"auto-generated after codex exec exit (no done-claim)","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}' > "$signal_file"
467
+ echo '{"iteration":'"$iter"',"status":"verify","us_id":"'"$current_us"'","summary":"auto-generated after codex exit (no done-claim)","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}' > "$signal_file"
378
468
  fi
379
469
  return 0
380
470
  }
@@ -1048,23 +1138,31 @@ write_worker_trigger() {
1048
1138
  elif [[ "$VERIFY_MODE" = "batch" ]]; then
1049
1139
  echo ""
1050
1140
  echo "---"
1051
- echo "## BATCH MODE OVERRIDE"
1052
- echo "Ignore any per-US signal instructions above. In batch mode:"
1053
- echo "- Implement ALL user stories in this iteration"
1054
- echo '- Signal verify with us_id="ALL" only when ALL stories are complete'
1055
- echo "- Do NOT signal verify after individual stories"
1141
+ if [[ -n "$VERIFIED_US" ]]; then
1142
+ echo "## BATCH MODE CONTINUE FROM PARTIAL PROGRESS"
1143
+ echo "The following US have already been verified: **$VERIFIED_US**"
1144
+ echo "- Do NOT re-implement these they are done."
1145
+ echo "- Focus ONLY on the remaining unverified user stories."
1146
+ echo '- Signal verify with us_id="ALL" when the remaining stories are complete.'
1147
+ else
1148
+ echo "## BATCH MODE OVERRIDE"
1149
+ echo "Ignore any per-US signal instructions above. In batch mode:"
1150
+ echo "- Implement ALL user stories in this iteration"
1151
+ echo '- Signal verify with us_id="ALL" only when ALL stories are complete'
1152
+ echo "- Do NOT signal verify after individual stories"
1153
+ fi
1056
1154
  fi
1057
1155
  } | atomic_write "$prompt_file"
1058
1156
 
1059
1157
  # Write trigger script (DO NOT use exec -- breaks heartbeat cleanup)
1060
1158
  # Engine-specific launch command (expanded at write time)
1061
1159
  if [[ "$WORKER_ENGINE" = "codex" ]]; then
1062
- local engine_cmd="${CODEX_BIN:-codex} exec \\
1160
+ local engine_cmd="${CODEX_BIN:-codex} \\
1063
1161
  -m $WORKER_CODEX_MODEL \\
1064
1162
  -c model_reasoning_effort=\"$WORKER_CODEX_REASONING\" \\
1065
1163
  --dangerously-bypass-approvals-and-sandbox \\
1066
1164
  \"\$(cat $prompt_file)\""
1067
- local engine_comment="# Run codex exec with fresh context (no pipecodex requires terminal)"
1165
+ local engine_comment="# Run codex with fresh context (fallback triggerTUI primary launch via launch_worker_codex)"
1068
1166
  else
1069
1167
  local engine_cmd="$CLAUDE_BIN -p \"\$(cat $prompt_file)\" \\
1070
1168
  --model $WORKER_MODEL \\
@@ -1132,13 +1230,15 @@ write_verifier_trigger() {
1132
1230
  echo "- **Iteration**: $iter"
1133
1231
  echo "- **Done Claim**: $DONE_CLAIM_FILE"
1134
1232
  echo "- **Verify Mode**: $VERIFY_MODE"
1135
- if [[ "$VERIFY_MODE" = "per-us" && -n "$us_id" ]]; then
1233
+ if [[ -n "$us_id" ]]; then
1136
1234
  if [[ "$us_id" = "ALL" ]]; then
1137
- echo "- **Scope**: FINAL FULL VERIFY — check ALL acceptance criteria from the PRD"
1138
- echo "- **Previously verified US**: $VERIFIED_US"
1235
+ echo "- **Scope**: FULL VERIFY — check ALL acceptance criteria from the PRD"
1139
1236
  else
1140
1237
  echo "- **Scope**: Verify ONLY the acceptance criteria for **${us_id}**"
1238
+ fi
1239
+ if [[ -n "$VERIFIED_US" ]]; then
1141
1240
  echo "- **Previously verified US**: $VERIFIED_US"
1241
+ echo "- **Note**: Skip re-verifying the above US. Focus on unverified stories."
1142
1242
  fi
1143
1243
  fi
1144
1244
  } | atomic_write "$prompt_file"
@@ -1557,9 +1657,9 @@ run_single_verifier() {
1557
1657
  # Launch verifier — dispatch to engine-specific function
1558
1658
  local verifier_launch
1559
1659
  if [[ "$engine" = "codex" ]]; then
1560
- verifier_launch="${CODEX_BIN:-codex} exec \"\$(cat $prompt_file)\" -m $VERIFIER_CODEX_MODEL -c model_reasoning_effort=\"$VERIFIER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
1660
+ verifier_launch="${CODEX_BIN:-codex} -m $VERIFIER_CODEX_MODEL -c model_reasoning_effort=\"$VERIFIER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
1561
1661
  launch_verifier_codex "$VERIFIER_PANE" "$prompt_file" "$iter" "$verifier_launch"
1562
- log_debug "Verifier$suffix codex exec dispatched"
1662
+ log_debug "Verifier$suffix codex TUI dispatched"
1563
1663
  else
1564
1664
  verifier_launch="$CLAUDE_BIN --model $model --dangerously-skip-permissions"
1565
1665
  if ! launch_verifier_claude "$VERIFIER_PANE" "$prompt_file" "$iter" "$verifier_launch"; then
@@ -1572,7 +1672,7 @@ run_single_verifier() {
1572
1672
  # Poll for verdict
1573
1673
  if [[ "$engine" = "codex" ]]; then
1574
1674
  # Codex exec: simple file poll (non-interactive, no heartbeat/nudge needed)
1575
- log " Polling for verify-verdict.json ($suffix, codex exec)..."
1675
+ log " Polling for verify-verdict.json ($suffix, codex TUI)..."
1576
1676
  local codex_poll_start
1577
1677
  codex_poll_start=$(date +%s)
1578
1678
  while true; do
@@ -1916,7 +2016,7 @@ main() {
1916
2016
  --arg verifier_model "$VERIFIER_MODEL" \
1917
2017
  --argjson debug "$DEBUG" \
1918
2018
  --argjson with_sv "$WITH_SELF_VERIFICATION" \
1919
- --argjson consensus "$VERIFY_CONSENSUS" \
2019
+ --argjson consensus "${VERIFY_CONSENSUS:-0}" \
1920
2020
  '{slug: $slug, project_root: $project_root, project_name: $project_name, campaign_status: $campaign_status, start_time: $start_time, end_time: $end_time, worker_model: $worker_model, verifier_model: $verifier_model, debug: $debug, with_self_verification: $with_sv, consensus: $consensus}' \
1921
2021
  > "$METADATA_FILE"
1922
2022
 
@@ -1960,7 +2060,7 @@ main() {
1960
2060
  log_debug "[OPTION] expected_flow=worker(all)->verify(ALL)->COMPLETE"
1961
2061
  fi
1962
2062
 
1963
- if [[ "$VERIFY_CONSENSUS" = "1" ]]; then
2063
+ if [[ "${VERIFY_CONSENSUS:-0}" = "1" ]]; then
1964
2064
  log_debug "[OPTION] consensus_flow=each_verify_runs_claude+codex_both_must_pass"
1965
2065
  fi
1966
2066
  fi
@@ -2084,9 +2184,12 @@ main() {
2084
2184
 
2085
2185
  local worker_launch
2086
2186
  if [[ "$WORKER_ENGINE" = "codex" ]]; then
2087
- local worker_trigger="$LOGS_DIR/iter-$(printf '%03d' $ITERATION).worker-trigger.sh"
2088
- worker_launch="bash $worker_trigger"
2089
- launch_worker_codex "$WORKER_PANE" "$worker_trigger" "$ITERATION"
2187
+ worker_launch="${CODEX_BIN:-codex} -m $WORKER_CODEX_MODEL -c model_reasoning_effort=\"$WORKER_CODEX_REASONING\" --dangerously-bypass-approvals-and-sandbox"
2188
+ if ! launch_worker_codex "$WORKER_PANE" "$worker_prompt" "$ITERATION" "$worker_launch"; then
2189
+ write_blocked_sentinel "Worker codex failed to start in pane"
2190
+ update_status "blocked" "worker_start_failed"
2191
+ return 1
2192
+ fi
2090
2193
  else
2091
2194
  worker_launch="$CLAUDE_BIN --model $WORKER_MODEL --dangerously-skip-permissions"
2092
2195
  if ! launch_worker_claude "$WORKER_PANE" "$worker_prompt" "$ITERATION" "$worker_launch"; then
@@ -2326,8 +2429,8 @@ main() {
2326
2429
  _MODEL_UPGRADED=0
2327
2430
  fi
2328
2431
 
2329
- # --- Per-US tracking ---
2330
- if [[ "$VERIFY_MODE" = "per-us" && -n "$signal_us_id" && "$signal_us_id" != "ALL" ]]; then
2432
+ # --- Verified US tracking (both per-us and batch modes) ---
2433
+ if [[ -n "$signal_us_id" && "$signal_us_id" != "ALL" ]]; then
2331
2434
  # Add this US to verified list
2332
2435
  if [[ -n "$VERIFIED_US" ]]; then
2333
2436
  VERIFIED_US="${VERIFIED_US},${signal_us_id}"
@@ -2351,6 +2454,32 @@ main() {
2351
2454
  ;;
2352
2455
  fail)
2353
2456
  # --- governance.md s7½: Fix Loop (adapted for tmux lean mode) ---
2457
+
2458
+ # Parse per_us_results from verdict to track partial progress (batch + per-us)
2459
+ local _prev_verified="$VERIFIED_US"
2460
+ if jq -e '.per_us_results' "$VERDICT_FILE" &>/dev/null; then
2461
+ local _newly_passed
2462
+ _newly_passed=$(jq -r '.per_us_results | to_entries[] | select(.value == "pass") | .key' "$VERDICT_FILE" 2>/dev/null)
2463
+ for _pus in $(echo "$_newly_passed"); do
2464
+ if ! echo ",$VERIFIED_US," | grep -q ",$_pus,"; then
2465
+ if [[ -n "$VERIFIED_US" ]]; then
2466
+ VERIFIED_US="${VERIFIED_US},${_pus}"
2467
+ else
2468
+ VERIFIED_US="$_pus"
2469
+ fi
2470
+ log " Partial progress: $_pus passed (overall FAIL). Verified so far: $VERIFIED_US"
2471
+ fi
2472
+ done
2473
+ log_debug "[FLOW] iter=$ITERATION partial_progress prev=$_prev_verified now=$VERIFIED_US"
2474
+ fi
2475
+
2476
+ # Partial progress resets consecutive failures (progress was made)
2477
+ if [[ "$VERIFIED_US" != "$_prev_verified" ]]; then
2478
+ CONSECUTIVE_FAILURES=0
2479
+ log " Progress detected — consecutive_failures reset to 0"
2480
+ log_debug "[GOV] iter=$ITERATION consecutive_failures_reset=partial_progress"
2481
+ fi
2482
+
2354
2483
  (( CONSECUTIVE_FAILURES++ ))
2355
2484
  record_us_failure "${signal_us_id:-unknown}"
2356
2485
  check_model_upgrade "${signal_us_id:-unknown}"
@@ -2369,6 +2498,13 @@ main() {
2369
2498
  {
2370
2499
  echo "# Fix Contract (from Verifier iteration $ITERATION)"
2371
2500
  echo ""
2501
+ if [[ -n "$VERIFIED_US" ]]; then
2502
+ echo "## Verified US (do NOT re-implement these)"
2503
+ echo "$VERIFIED_US" | tr ',' '\n' | sed 's/^/- /'
2504
+ echo ""
2505
+ echo "**Focus ONLY on unverified user stories. The above are already verified.**"
2506
+ echo ""
2507
+ fi
2372
2508
  echo "## Summary"
2373
2509
  echo "$verdict_summary_fail"
2374
2510
  echo ""