maestro-flow 0.3.39 → 0.3.41

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -279,9 +279,15 @@ Write `.workflow/.maestro/ralph-{YYYYMMDD-HHmmss}/status.json`:
279
279
  }
280
280
  ```
281
281
 
282
- ### 1.7: Initialize plan + confirm
282
+ ### 1.7: Initialize tracking + confirm
283
283
 
284
284
  ```
285
+ // Goal = outer constraint — ensures entire lifecycle chain completes
286
+ functions.create_goal({
287
+ objective: `Ralph lifecycle: ${lifecycle_position} → milestone-complete | ${steps.length} steps (${decision_count} decisions) | quality=${quality_mode}`
288
+ })
289
+
290
+ // Plan = inner tracking — sub-step progress
285
291
  functions.update_plan({
286
292
  explanation: "Ralph lifecycle: {position} → milestone-complete",
287
293
  plan: steps.map(step => ({ step: stepLabel(step), status: "pending" }))
@@ -355,8 +361,10 @@ STATUS: proceed | fix | escalate
355
361
  REASON: 一句话解释
356
362
  GAP_SUMMARY: 问题描述(fix/escalate 时填写)
357
363
  CONFIDENCE: high | medium | low
364
+ CONFIDENCE_SCORE: 0-100(从结果文件中读取置信度分数,无则估算)
365
+ WEAKEST_DIMENSION: 最弱维度名称
358
366
  ---END---
359
- CONSTRAINTS: 只评估 | STATUS 三选一 | retry ${meta.retry_count}/${meta.max_retries} 达上限必须 escalate" --role analyze --mode analysis`,
367
+ CONSTRAINTS: 只评估 | STATUS 三选一 | 置信度 < 60% 倾向 fix | retry ${meta.retry_count}/${meta.max_retries} 达上限必须 escalate" --role analyze --mode analysis`,
360
368
  yield_time_ms: 30000,
361
369
  max_output_tokens: 6000
362
370
  })
@@ -366,17 +374,25 @@ CONSTRAINTS: 只评估 | STATUS 三选一 | retry ${meta.retry_count}/${meta.max
366
374
 
367
375
  **Parse verdict** (on callback):
368
376
  ```
369
- Extract STATUS / REASON / GAP_SUMMARY / CONFIDENCE from output.
377
+ Extract STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION from output.
370
378
  If parse fails → fallback: STATUS = "fix", GAP_SUMMARY = generic
379
+
380
+ Confidence-based verdict adjustment (after parse, before apply):
381
+ If CONFIDENCE_SCORE < 60 AND STATUS == "proceed":
382
+ → Override to "fix", REASON += " (置信度不足: {score}%,{weakest} 需加强)"
383
+ If CONFIDENCE_SCORE > 95 AND STATUS == "fix" AND retry_count > 0:
384
+ → Suggest "proceed" override, REASON += " (置信度充分: {score}%,建议通过)"
371
385
  ```
372
386
 
387
+ **Confidence-aware evaluation**: Before delegating, check if artifact contains confidence section (added by downstream commands). If found, include `已有置信度评估: 整体 {overall}%, 最弱维度: {weakest} ({score}%)` in delegate prompt as additional signal.
388
+
373
389
  **Apply verdict:**
374
390
 
375
391
  | Mode | Behavior |
376
392
  |------|----------|
377
- | `-y` (auto_mode) | Follow verdict directly, no user prompt |
378
- | Interactive + confidence == "high" | Display recommendation, prompt user with options below |
379
- | Interactive + confidence != "high" | Display recommendation **with warning**, prompt user with options below |
393
+ | `-y` (auto_mode) | Follow adjusted verdict directly, no user prompt |
394
+ | Interactive + confidence_score >= 80 | Display recommendation with confidence, prompt user |
395
+ | Interactive + confidence_score < 80 | Display recommendation **with confidence warning**, prompt user |
380
396
 
381
397
  Interactive prompt (via `request_user_input`):
382
398
  ```json
@@ -577,8 +593,13 @@ functions.update_plan({
577
593
  explanation: "Ralph lifecycle complete",
578
594
  plan: steps.map(step => ({ step: stepLabel(step), status: "completed" }))
579
595
  })
596
+
597
+ // Release goal constraint — only on true completion
598
+ functions.update_goal({ status: "complete" })
580
599
  ```
581
600
 
601
+ **Note**: Pause/escalate paths (`post-debug-escalate` STOP, session pause) do NOT call `update_goal` — goal stays running for resume.
602
+
582
603
  Display:
583
604
  ```
584
605
  ============================================================
@@ -665,9 +686,11 @@ Rules:
665
686
  - [ ] Conditional steps evaluated at decision time (coverage threshold)
666
687
  - [ ] buildSkillCall() completes arg enrichment + auto flag, CSV contains full commands
667
688
  - [ ] Quality-gate decisions delegate-evaluated via `maestro delegate --role analyze`
668
- - [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE
669
- - [ ] `-y` mode: auto-follow delegate verdict, no STOP (except post-debug-escalate)
670
- - [ ] Interactive mode: display recommendation + request_user_input with override
689
+ - [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION
690
+ - [ ] Confidence-based verdict adjustment applied (< 60% bias fix, > 95% bias proceed)
691
+ - [ ] Artifact confidence sections read when available as additional signal
692
+ - [ ] `-y` mode: auto-follow adjusted verdict, no STOP (except post-debug-escalate)
693
+ - [ ] Interactive mode: display recommendation with confidence score + request_user_input with override
671
694
  - [ ] Delegate failure fallback: treat as "fix" verdict
672
695
  - [ ] passed_gates[] tracks passed quality gates, skips re-runs in retry loops
673
696
  - [ ] passed_gates cleared when code changes (fix-loop inserts execute step)
@@ -385,6 +385,9 @@ OUTER LOOP (max_iter iterations):
385
385
  Analyze: pass rate delta, failure clusters, strategy effectiveness
386
386
  Append to reflection-log.md
387
387
 
388
+ **Test confidence scoring** (at each REFLECT step):
389
+ Dimensions (5): scenario_coverage, test_quality, diagnostic_accuracy, strategy_effectiveness, infrastructure_fitness. Factors (weights): completeness(.30), pass_rate_trend(.25), classification_accuracy(.20), coverage_breadth(.15), consistency(.10). Enhanced convergence: BOTH pass_rate ≥ threshold AND confidence ≥ 60%. Add confidence to `report.json`.
390
+
388
391
  ADJUST (Adaptive Strategy):
389
392
  IF startStrategy provided AND iteration == 1: use startStrategy as initial
390
393
  OTHERWISE auto-select:
@@ -540,6 +543,9 @@ CSV Session: .tests/auto-test/.csv-session/
540
543
  - [ ] discoveries.ndjson append-only throughout
541
544
  - [ ] Cross-layer context propagation via prev_context column
542
545
  - [ ] Iteration engine ran (inner: test_defect fix, outer: strategy adjust)
546
+ - [ ] Test confidence scored per iteration with 5-dimension factor model
547
+ - [ ] Convergence check includes confidence >= 60% alongside pass_rate
548
+ - [ ] Confidence section added to report.json
543
549
  - [ ] state.json, report.json, reflection-log.md written
544
550
  - [ ] If spec: traceability.md produced
545
551
  - [ ] If failures: issues auto-created in issues.jsonl
@@ -257,6 +257,10 @@ spawn_agents_on_csv({
257
257
 
258
258
  2. **Generate context.md**: Debug report with summary (mode, hypothesis/confirmed/fixed/verified counts), per-hypothesis results (hypothesis, evidence for/against, findings, status), per-fix results (fix applied, verified, findings), aggregated root causes, and next steps.
259
259
 
260
+ 2b. **Debug confidence scoring**:
261
+
262
+ Dimensions (4): hypothesis_quality, evidence_completeness, root_cause_isolation, fix_confidence. Factors (weights): evidence_depth(.30), evidence_strength(.25), coverage_breadth(.20), reproduction(.15), consistency(.10). Map to legacy: <40% = low, 40-70% = medium, >70% = high. Append confidence assessment to context.md.
263
+
260
264
  3. **UAT update** (if --from-uat): Update `uat.md` gaps with `root_cause`, `fix_direction`, `affected_files` for confirmed hypotheses.
261
265
 
262
266
  4. **Issue update**: If `issues.jsonl` exists, update matching issues with status `diagnosed`, add `context.suggested_fix` and `context.notes`.
@@ -296,7 +300,7 @@ spawn_agents_on_csv({
296
300
 
297
301
  | Type | Dedup Key | Data Schema | Description |
298
302
  |------|-----------|-------------|-------------|
299
- | `root_cause` | `data.location` | `{location, cause, severity, confidence}` | Confirmed root cause |
303
+ | `root_cause` | `data.location` | `{location, cause, severity, confidence_score, confidence_factors}` | Confirmed root cause |
300
304
  | `hypothesis_evidence` | `data.hypothesis+data.location` | `{hypothesis, location, type, conclusion}` | Evidence for/against hypothesis |
301
305
  | `affected_component` | `data.component` | `{component, files[], impact}` | Component affected by bug |
302
306
  | `reproduction_path` | `data.trigger` | `{trigger, steps[], frequency}` | Bug reproduction path |
@@ -333,6 +337,8 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"root_cause","data":{"location":"src/
333
337
  - [ ] Refuted/inconclusive hypotheses correctly skip wave 2 fix tasks
334
338
  - [ ] Wave 2 fixes attempted only for confirmed hypotheses
335
339
  - [ ] context.md produced with diagnosis summary
340
+ - [ ] Multi-factor confidence scored per hypothesis replacing simple high/medium/low
341
+ - [ ] Confidence assessment appended to context.md
336
342
  - [ ] UAT gaps updated (if --from-uat)
337
343
  - [ ] Issues updated with diagnosis results
338
344
  - [ ] discoveries.ndjson append-only throughout
@@ -434,6 +434,12 @@ Options:
434
434
  If re-verify passes: update uat.md gaps as resolved, report success.
435
435
  If gaps remain after 2 iterations: report remaining, suggest manual intervention.
436
436
 
437
+ ### Step 12.5: UAT Confidence Scoring
438
+
439
+ Dimensions (4): scenario_coverage, diagnostic_depth, observation_quality, closure_completeness. Factors (weights): requirements_mapped(.30), observation_specificity(.25), user_validation(.20), diagnostic_depth(.15), consistency(.10). Append confidence summary to `uat.md`.
440
+
441
+ **Readiness gate** (before final report): Block if scenario_coverage < 40% or any blocker-severity gap without diagnosis.
442
+
437
443
  ### Step 13: Report
438
444
 
439
445
  ```
@@ -490,6 +496,9 @@ Files:
490
496
  - [ ] test-results.json and coverage-report.json written
491
497
  - [ ] index.json uat fields updated
492
498
  - [ ] Artifact registered in state.json
499
+ - [ ] UAT confidence scored with 4-dimension factor model
500
+ - [ ] Readiness gate checked before final report
501
+ - [ ] Confidence summary appended to uat.md
493
502
  - [ ] If issues: diagnosis.csv built, spawn_agents_on_csv executed per gap cluster
494
503
  - [ ] Gaps updated with root_cause, fix_direction, affected_files
495
504
  - [ ] Gap-fix loop triggered if --auto-fix (max 2 iterations)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "maestro-flow",
3
- "version": "0.3.39",
3
+ "version": "0.3.41",
4
4
  "description": "Workflow orchestration CLI with MCP endpoint support and extensible architecture",
5
5
  "type": "module",
6
6
  "imports": {
@@ -252,6 +252,10 @@ Re-read original User Intent from discussion.md header. Check each intent item a
252
252
 
253
253
  Append initial Intent Coverage Check to discussion.md.
254
254
 
255
+ **Step 4.6: Baseline Confidence Scoring**
256
+
257
+ Dimensions = the 6 analysis dimensions. Factors (weights): findings_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15), consistency(.10). Score each factor per dimension from Round 1 results. Append baseline confidence table to discussion.md. Thresholds: <60% 继续深入 | 60-80% 可选 | 80-95% 接近收敛 | >95% 建议收敛.
258
+
255
259
  ### Step 5: Interactive Discussion Loop
256
260
 
257
261
  Max 5 rounds. Each round follows this sequence:
@@ -301,7 +305,19 @@ Re-read original User Intent. Check each item:
301
305
 
302
306
  If ❌ or ⚠️ items exist → proactively surface to user at start of next round.
303
307
 
304
- **Auto mode (-y)**: auto-deepen for up to 3 rounds, then synthesize.
308
+ **5.8: Re-score Confidence** (every round):
309
+ Re-evaluate factors per dimension. Show delta: `Confidence: {prev}% → {current}% ({±N%}), {weakest_dim} 仍需深入`
310
+
311
+ **5.9: Quality Mechanisms**:
312
+ - **Pressure Pass** (mandatory ≥1 before Step 6): highest-confidence finding → pressure ladder (evidence demand → assumption probe → boundary/tradeoff → root cause check). Record under `#### 压力测试`.
313
+ - **Devil's Advocate**: dimension > 0.7 → challenge "如果 [finding] 不成立?" (once per dimension)
314
+ - **Scope Minimizer**: findings > 5 + scope expanding → "最小可行结论集?"
315
+ - **Stall Detection**: delta < 5% for 2 consecutive rounds → "分析可能停滞,建议切换方向或收敛"
316
+
317
+ **5.10: Pre-Synthesis Readiness Gate** (on "分析完成"):
318
+ Block if: ❌ items without deferral | any dimension < 40% | no pressure pass | unresolved contradictions. If blocked → AskUserQuestion: 补充后继续 or 忽略风险并继续 (record `residual_risks[]`).
319
+
320
+ **Auto mode (-y)**: auto-deepen ≤3 rounds, readiness gate auto-overrides with residual risk recording.
305
321
 
306
322
  ### Step 6: Six-Dimension Scoring
307
323
 
@@ -318,9 +334,11 @@ Using all exploration findings, discussion insights, and user feedback, score ac
318
334
 
319
335
  Each dimension scored with specific evidence (code refs, data points from exploration).
320
336
 
337
+ Each 1-5 score justified by confidence factors from Step 4.6/5.8. Include per-dimension confidence % alongside the score.
338
+
321
339
  Build probability-impact risk matrix from identified risks.
322
340
 
323
- Formulate Go/No-Go/Conditional recommendation with confidence level.
341
+ Formulate Go/No-Go/Conditional recommendation with overall confidence %. Write confidence summary (per-dimension scores + overall + pressure pass result + residual risks) to analysis.md.
324
342
 
325
343
  ### Step 7: Synthesis & Conclusion
326
344
 
@@ -673,6 +691,10 @@ Replaceable blocks (overwritten each round):
673
691
  - At least 2 alternatives compared with tradeoffs
674
692
  - Go/No-Go/Conditional recommendation with confidence level
675
693
  - Code references included where relevant (file paths, line numbers)
694
+ - Confidence tracking initialized and re-scored each round
695
+ - Readiness gate checked before synthesis (Step 5.10)
696
+ - Pressure pass completed ≥ 1 time before Step 6
697
+ - Confidence summary with factor decomposition in analysis.md
676
698
 
677
699
  **Both modes (full + quick):**
678
700
  - context.md written with all decisions classified as Locked/Free/Deferred
@@ -498,6 +498,18 @@ END OUTER
498
498
 
499
499
  ---
500
500
 
501
+ ### Step 7.5: Test Confidence Scoring
502
+
503
+ Scored after each REFLECT step. Dimensions (5): scenario_coverage, test_quality, diagnostic_accuracy, strategy_effectiveness, infrastructure_fitness. Factors (weights): completeness(.30), pass_rate_trend(.25), classification_accuracy(.20), coverage_breadth(.15), consistency(.10). Append confidence table to reflection-log.md.
504
+
505
+ **Enhanced Convergence**: pass_rate ≥ 95% AND confidence ≥ 60% → converged. pass_rate ≥ 95% BUT confidence < 60% → continue (tests may be weak). max_iter reached or all failures = code_defect → Step 8.
506
+
507
+ **Quality mechanisms**: Pressure Pass (before Step 8) — select 2-3 passing tests from highest-pass-rate layer, verify they exercise real behavior (not mock-only, non-trivial assertions). Devil's Advocate — pass_rate > 80% → challenge assertion specificity, error path coverage, mock over-reliance. Stall Detection — delta < 5% for 2 iterations + pass_rate flat → force Reflective strategy.
508
+
509
+ **Readiness Gate** (before Step 8, skip if max_iter=1): scenario_coverage < 40% | no pressure pass | diagnostic_accuracy < 40% | unclassified failures. If blocked → force one additional iteration. Add confidence section to report.json.
510
+
511
+ ---
512
+
501
513
  ### Step 8: Complete & Write Artifacts
502
514
 
503
515
  1. Update session state:
@@ -331,7 +331,7 @@ Each agent receives:
331
331
 
332
332
  ### Step 5: Synthesis Integration (Auto Mode)
333
333
 
334
- Six sub-phases producing feature specs from cross-role analysis:
334
+ Seven sub-phases producing feature specs from cross-role analysis:
335
335
 
336
336
  **Sub-phase 1: Discovery**
337
337
  - Detect session, validate analysis files exist
@@ -349,6 +349,16 @@ Six sub-phases producing feature specs from cross-role analysis:
349
349
  - Output: `enhancement_recommendations` (EP-001, EP-002, ...) + `feature_conflict_map` (per-feature consensus/conflicts/cross_refs)
350
350
  - Conflict resolution quality: actionable, justified ("because...tradeoff:..."), scoped, confidence-tagged ([RESOLVED]|[SUGGESTED]|[UNRESOLVED])
351
351
 
352
+ **Sub-phase 3B: Brainstorm Confidence Scoring**
353
+
354
+ Compute after cross-role analysis, before user interaction. Dimensions (5): role_coverage, cross_role_consistency, feature_completeness, spec_quality, design_feasibility. Factors (weights): analysis_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15), consistency(.10).
355
+
356
+ Scoring points: per-role completion → `role_coverage`; after cross-role → `cross_role_consistency`; after user interaction → `user_validation`.
357
+
358
+ **Quality mechanisms**: Pressure Pass (before spec gen) — verify claims backed by ≥2 role analyses. Devil's Advocate — challenge when dimension > 0.7. Conflict Detection (replaces stall) — >3 `[UNRESOLVED]` conflicts → challenge.
359
+
360
+ **Readiness Gate** (before Sub-phase 4): Block if role_coverage < 100% | cross_role_consistency < 40% | feature_completeness below intent count | no pressure pass. If blocked → AskUserQuestion: fill gaps or proceed with residual risks. Append confidence summary to synthesis-changelog.md.
361
+
352
362
  **Sub-phase 4: User Interaction**
353
363
  - Enhancement selection: AskUserQuestion (multiSelect=true, batched by 4)
354
364
  - Clarification questions: 9-category taxonomy scan, AskUserQuestion (single-select, multi-round)
@@ -137,7 +137,7 @@ For each cluster, spawn concurrently as general-purpose agent (`run_in_backgroun
137
137
 
138
138
  - **Input**: cluster name, phase, all gaps (test_id, truth, reason, severity). Mode: `symptoms_prefilled`.
139
139
  - **Process**: read source files, form 2-3 hypotheses per gap ranked by likelihood, search code for evidence, log each as NDJSON line, confirm/refute.
140
- - **Output per gap**: `root_cause`, `fix_direction`, `affected_files` (file:line), `confidence` (high/medium/low), `evidence` summary.
140
+ - **Output per gap**: `root_cause`, `fix_direction`, `affected_files` (file:line), `confidence` (multi-factor; also include legacy `confidence_level`: high/medium/low for backward compat), `evidence` summary.
141
141
  - **Files**: `{debug_dir}/evidence-{cluster_slug}.ndjson`, `{debug_dir}/understanding-{cluster_slug}.md`
142
142
 
143
143
  All agents run concurrently. Collect all results.
@@ -198,7 +198,7 @@ For each agent result, extract:
198
198
  - root_cause per gap
199
199
  - fix_direction per gap
200
200
  - affected_files per gap
201
- - confidence level
201
+ - confidence (multi-factor, see 7.1)
202
202
  - evidence summary
203
203
 
204
204
  Build unified diagnosis:
@@ -215,14 +215,23 @@ Build unified diagnosis:
215
215
  "root_cause": "...",
216
216
  "fix_direction": "...",
217
217
  "affected_files": ["src/components/Comments.tsx:42"],
218
- "confidence": "high"
218
+ "confidence": { "overall": 0.78, "dimensions": {} }
219
219
  }
220
220
  ]
221
221
  }
222
- ]
222
+ ],
223
+ "confidence": {}
223
224
  }
224
225
  ```
225
226
 
227
+ ### Step 7.0: Debug Confidence Scoring
228
+
229
+ Dimensions (4): hypothesis_quality, evidence_completeness, root_cause_isolation, fix_confidence. Factors (weights): evidence_depth(.30), evidence_strength(.25), coverage_breadth(.20), reproduction(.15), consistency(.10). Map to legacy levels: <40% = low, 40-70% = medium, >70% = high.
230
+
231
+ Quality mechanisms: Pressure Pass (before Step 9) — cross-check confirmed vs refuted hypotheses. Devil's Advocate — root_cause_isolation > 0.7 → "根因在更深层?". Stall Detection — no new evidence + delta < 5% for 2 continuations → "调查可能停滞".
232
+
233
+ Readiness Gate (blocks Step 9): evidence_completeness ≥ 40% | pressure pass done | no contradicting evidence | fix_direction has specific files. If blocked → AskUserQuestion: 补充调查 or 忽略风险并确认. Append confidence table to understanding.md.
234
+
226
235
  ### Step 7.1: Update Issues with Diagnosis
227
236
 
228
237
  For each diagnosed gap with an `issue_id`, update the corresponding issue in `.workflow/issues/issues.jsonl`:
package/workflows/plan.md CHANGED
@@ -251,12 +251,22 @@ Bidirectional linking: update matching issues in `.workflow/issues/issues.jsonl`
251
251
  - Critical issues → re-spawn planner with issues, revise, re-check
252
252
  - Warnings only → log and proceed
253
253
 
254
- 3. **Update index.json**
255
- - Set `index.json.plan` = `{ task_ids, task_count, complexity, waves, executor_assignments: {} }`
254
+ 3. **Plan Confidence Scoring**
255
+
256
+ Dimensions (5): requirements_coverage, task_quality, dependency_correctness, estimation_accuracy, collision_safety. Factors (weights): completeness(.30), specificity(.25), structural_validity(.20), user_validation(.15), consistency(.10). Re-score after each revision round.
257
+
258
+ Quality mechanisms: Pressure Pass (mandatory before P4.5) — verify highest-complexity task's read_first/convergence.criteria/action. Devil's Advocate — requirements_coverage > 0.7 → "隐含需求?". Scope Minimizer — task_count exceeds guard → "最小可行任务集?". Stall Detection — delta < 5% → suggest broader revision.
259
+
260
+ 4. **Plan Readiness Gate** (blocks P4.5)
261
+
262
+ Block if: requirements_coverage < 40% | task missing read_first/convergence.criteria | no pressure pass | circular deps. If blocked → AskUserQuestion: 修订计划 or 忽略风险并继续 (record residual_risks). Add confidence section to plan.json.
263
+
264
+ 5. **Update index.json**
265
+ - Set `index.json.plan` = `{ task_ids, task_count, complexity, waves, executor_assignments: {}, confidence: overall_score }`
256
266
  - Set `status: "planning"`, `updated_at: now()`
257
267
 
258
268
  ### Output
259
- - Updated plan.json (if revised)
269
+ - Updated plan.json (if revised) with confidence section
260
270
  - Updated .task/ files (if revised)
261
271
  - Updated index.json with plan fields
262
272
 
@@ -286,7 +296,7 @@ Bidirectional linking: update matching issues in `.workflow/issues/issues.jsonl`
286
296
 
287
297
  ### Steps
288
298
 
289
- 1. **Display plan summary** — summary, approach, task count, wave structure, complexity, key dependencies
299
+ 1. **Display plan summary** — summary, approach, task count, wave structure, complexity, key dependencies, **plan confidence** (overall %, weakest dimension, pressure pass result)
290
300
 
291
301
  2. **Present options via AskUserQuestion** (skip if `config.gates.confirm_plan == false`, auto-proceed)
292
302
  - Execute now → build executionContext, hand off to /workflow:execute
package/workflows/test.md CHANGED
@@ -416,6 +416,16 @@ If re-verify still has gaps: report remaining gaps, suggest manual intervention.
416
416
 
417
417
  ---
418
418
 
419
+ ### Step 12.5: UAT Confidence Scoring
420
+
421
+ Dimensions (4): scenario_coverage, diagnostic_depth, observation_quality, closure_completeness. Factors (weights): requirements_mapped(.30), observation_specificity(.25), user_validation(.20), diagnostic_depth(.15), consistency(.10). Score at: init (Step 5), per user response (Step 8), after gap-fix loop (Step 12).
422
+
423
+ Quality mechanisms: Pressure Pass — >80% pass → ask user to try edge case. Devil's Advocate — >70% first-try pass → challenge scenario difficulty. Stall Detection — 2 gap-fix iterations without improvement → stop.
424
+
425
+ Readiness Gate (blocks Step 13): scenario_coverage < 40% | blocker gap without diagnosis | no pressure pass (if >80%) | unresolved gaps without acknowledgment. Append confidence summary to uat.md.
426
+
427
+ ---
428
+
419
429
  ### Step 13: Report
420
430
 
421
431
  ```