npm - maestro-flow - Versions diffs - 0.3.39 → 0.3.40 - Mend

maestro-flow 0.3.39 → 0.3.40

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/.claude/commands/maestro-analyze.md +4 -0
package/.claude/commands/maestro-brainstorm.md +4 -0
package/.claude/commands/maestro-plan.md +5 -1
package/.claude/commands/maestro-ralph.md +39 -9
package/.claude/commands/quality-auto-test.md +4 -0
package/.claude/commands/quality-debug.md +5 -1
package/.claude/commands/quality-test.md +4 -0
package/.codex/skills/maestro-analyze/SKILL.md +7 -0
package/.codex/skills/maestro-brainstorm/SKILL.md +12 -3
package/.codex/skills/maestro-plan/SKILL.md +10 -0
package/.codex/skills/maestro-ralph/SKILL.md +20 -8
package/.codex/skills/quality-auto-test/SKILL.md +6 -0
package/.codex/skills/quality-debug/SKILL.md +7 -1
package/.codex/skills/quality-test/SKILL.md +9 -0
package/package.json +1 -1
package/workflows/analyze.md +24 -2
package/workflows/auto-test.md +12 -0
package/workflows/brainstorm.md +11 -1
package/workflows/debug.md +13 -4
package/workflows/plan.md +14 -4
package/workflows/test.md +10 -0

package/.claude/commands/maestro-analyze.md CHANGED Viewed

@@ -108,6 +108,10 @@ Full mode:
 - [ ] analysis.md written with all 6 dimensions scored with evidence
 - [ ] conclusions.json created with recommendations and decision trail
 - [ ] Intent Coverage tracked and verified (no unresolved ❌ items)
+- [ ] Confidence tracking initialized (Step 4.6) and re-scored each round (Step 5.8)
+- [ ] Readiness gate checked before synthesis (Step 5.10)
+- [ ] Pressure pass completed ≥ 1 time before Step 6
+- [ ] Confidence summary with factor decomposition written to analysis.md
 Gaps mode:
 - [ ] Issues loaded from issues.jsonl (all open/registered, or single ISS-ID)

package/.claude/commands/maestro-brainstorm.md CHANGED Viewed

@@ -91,6 +91,10 @@ Single role mode:
 - [ ] Final Output Gate passed (Step 5.5) or `--yes` bypassed
 - [ ] All user decisions captured with Decision Recording Protocol
 - [ ] Session metadata updated with completion status
+- [ ] Confidence scored per role completion and after cross-role analysis
+- [ ] Readiness gate checked before spec generation
+- [ ] Pressure pass completed on at least 1 feature spec
+- [ ] Confidence summary appended to synthesis-changelog.md
 **Single role mode**:
 - [ ] analysis.md written to `{output_dir}/{role}/`

package/.claude/commands/maestro-plan.md CHANGED Viewed

@@ -143,8 +143,12 @@ Follow workflow plan.md § "Revise Mode" and § "Check Mode" respectively. These
 - [ ] Every task has `read_first[]` with at least the file being modified + source of truth files
 - [ ] Every task has `convergence.criteria[]` with grep-verifiable conditions (no subjective language)
 - [ ] Every task `action` and `implementation` contain concrete values (no "align X with Y")
+- [ ] Plan confidence scored in P4 with 5-dimension factor model
+- [ ] Plan readiness gate checked before P4.5 collision detection
+- [ ] Pressure pass completed on highest-complexity task
+- [ ] plan.json includes confidence section (overall, dimensions, pressure_pass)
 - [ ] Collision detection executed against same-milestone plans (non-blocking)
 - [ ] Plan-checker passed (or minor issues acknowledged)
-- [ ] User confirmation captured (execute/modify/cancel)
+- [ ] User confirmation captured (execute/modify/cancel) with confidence displayed
 - [ ] Artifact registered in state.json with correct scope/milestone/phase/depends_on
 </success_criteria>

package/.claude/commands/maestro-ralph.md CHANGED Viewed

@@ -326,10 +326,28 @@ For quality-gate decisions (post-verify, post-business-test, post-review, post-t
 | post-review | `{artifact_dir}/review.json` |
 | post-test | `{artifact_dir}/uat.md`, `{artifact_dir}/.tests/test-results.json` |
+**Confidence-aware evaluation**:
+Before delegating, check if artifact contains a confidence section (added by downstream commands):
+- `verification.json` → `confidence.overall` (from maestro-verify)
+- `report.json` → `confidence.overall` (from quality-auto-test)
+- `review.json` → may contain dimension confidence (from quality-review)
+- `uat.md` → confidence summary section (from quality-test)
+If confidence data found, include in delegate prompt as additional signal:
+```
+已有置信度评估: 整体 {overall}%, 最弱维度: {weakest} ({score}%)
+```
+**Confidence-based verdict bias**: When artifact confidence is available:
+- confidence < 60% → bias toward "fix" even if surface status looks clean (hidden quality gaps)
+- confidence 60-95% → use delegate verdict as-is
+- confidence > 95% → bias toward "proceed" (strong evidence of quality)
 ```
 Bash({
   command: `maestro delegate "PURPOSE: 评估 ${meta.decision} 质量门结果，判断是否通过
-TASK: 读取结果文件 | 分析通过/失败状态 | 评估问题严重性 | 给出下一步建议
+TASK: 读取结果文件 | 分析通过/失败状态 | 评估问题严重性 | 检查置信度评分 | 给出下一步建议
 MODE: analysis
 CONTEXT: @${result_files}
 EXPECTED: 严格按以下格式输出:
@@ -338,8 +356,10 @@ STATUS: proceed | fix | escalate
 REASON: 一句话解释
 GAP_SUMMARY: 具体问题描述（仅 fix/escalate 时填写，用于传递给 quality-debug）
 CONFIDENCE: high | medium | low
+CONFIDENCE_SCORE: 0-100（从结果文件中读取置信度分数，无则估算）
+WEAKEST_DIMENSION: 最弱维度名称
 ---END---
-CONSTRAINTS: 只评估不修改 | STATUS 三选一 | 如果 retry ${meta.retry_count}/${meta.max_retries} 已达上限且仍有问题则必须 escalate" --role analyze --mode analysis`,
+CONSTRAINTS: 只评估不修改 | STATUS 三选一 | 置信度 < 60% 倾向 fix | 如果 retry ${meta.retry_count}/${meta.max_retries} 已达上限且仍有问题则必须 escalate" --role analyze --mode analysis`,
   run_in_background: true
 })
 STOP — wait for callback.
@@ -352,12 +372,20 @@ STOP — wait for callback.
 Parse structured response:
 ```
 Extract between ---VERDICT--- and ---END---:
-  verdict.status   = "proceed" | "fix" | "escalate"
-  verdict.reason   = string
-  verdict.gap_summary = string (context for quality-debug)
-  verdict.confidence = "high" | "medium" | "low"
+  verdict.status           = "proceed" | "fix" | "escalate"
+  verdict.reason           = string
+  verdict.gap_summary      = string (context for quality-debug)
+  verdict.confidence       = "high" | "medium" | "low"
+  verdict.confidence_score = 0-100 (numeric, from artifact or estimated)
+  verdict.weakest_dimension = string (weakest confidence dimension)
 If parse fails → fallback: treat as "fix" with generic gap_summary
+Confidence-based verdict adjustment (after parse, before apply):
+  If verdict.confidence_score < 60 AND verdict.status == "proceed":
+    → Override to "fix", reason += " (置信度不足: {score}%，{weakest_dimension} 需加强)"
+  If verdict.confidence_score > 95 AND verdict.status == "fix" AND retry_count > 0:
+    → Suggest "proceed" override, reason += " (置信度充分: {score}%，建议通过)"
 ```
 **Apply verdict:**
@@ -503,9 +531,11 @@ End.
 - [ ] Full quality pipeline generated: verify → business-test → review → test-gen → test
 - [ ] Decision nodes inserted after: post-verify, post-business-test, post-review, post-test, post-milestone
 - [ ] Quality-gate decisions delegated via `maestro delegate --role analyze --mode analysis`
-- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE
-- [ ] `-y` mode: auto-follow delegate verdict without user confirmation
-- [ ] Interactive mode: display recommendation + AskUserQuestion with override options
+- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION
+- [ ] Confidence-based verdict adjustment applied (< 60% bias fix, > 95% bias proceed)
+- [ ] Artifact confidence sections read when available (verification.json, report.json, uat.md)
+- [ ] `-y` mode: auto-follow adjusted verdict without user confirmation
+- [ ] Interactive mode: display recommendation with confidence score + AskUserQuestion with override options
 - [ ] Delegate failure fallback: treat as "fix" verdict
 - [ ] gap_summary from delegate passed to quality-debug as context
 - [ ] Fix-loop templates applied per decision type with retry_count increment

package/.claude/commands/quality-auto-test.md CHANGED Viewed

@@ -116,6 +116,10 @@ Append to state.json.artifacts[]:
 - [ ] Tests executed progressively (L0→L3) with fail-fast on critical
 - [ ] Iteration engine ran (inner: test_defect fix, outer: strategy adjust)
 - [ ] state.json, report.json, reflection-log.md written
+- [ ] Test confidence scored per iteration (Step 7.5) with 5-dimension factor model
+- [ ] Convergence check includes confidence >= 60% alongside pass_rate threshold
+- [ ] Pressure pass completed on highest-pass-rate layer before completion
+- [ ] report.json includes confidence section
 - [ ] index.json updated with auto_test section
 - [ ] If spec source: traceability matrix built, traceability.md written
 - [ ] If failures: issues auto-created in issues.jsonl

package/.claude/commands/quality-debug.md CHANGED Viewed

@@ -115,7 +115,11 @@ If user confirms, invoke `Skill({ skill: "spec-add", args: "<category> <content>
 - [ ] evidence.ndjson written with structured NDJSON entries
 - [ ] understanding.md tracks evolving understanding per cluster
 - [ ] Root causes collected with fix_direction and affected_files
+- [ ] Multi-factor confidence scored per gap (Step 7.0) replacing simple high/medium/low
+- [ ] Readiness gate checked before ROOT CAUSE declaration
+- [ ] Pressure pass completed on confirmed hypothesis
+- [ ] Confidence table appended to understanding.md
 - [ ] If --from-uat: uat.md gaps updated with diagnosis artifacts
-- [ ] Results unified into diagnosis summary
+- [ ] Results unified into diagnosis summary with confidence section
 - [ ] Next step routed (plan --gaps + execute if fix needed, verify if fix applied, resume if inconclusive)
 </success_criteria>

package/.claude/commands/quality-test.md CHANGED Viewed

@@ -95,6 +95,10 @@ Append to state.json.artifacts[]:
 - [ ] Severity inferred from natural language (never asked)
 - [ ] Batched writes: on issue, every 5 passes, or completion
 - [ ] test-results.json and coverage-report.json written
+- [ ] UAT confidence scored with 4-dimension factor model
+- [ ] Readiness gate checked before final report
+- [ ] Pressure pass completed if > 80% pass rate
+- [ ] Confidence summary appended to uat.md
 - [ ] index.json uat fields updated
 - [ ] If issues: parallel debug agents spawned per gap cluster
 - [ ] Gaps updated with root_cause, fix_direction, affected_files

package/.codex/skills/maestro-analyze/SKILL.md CHANGED Viewed

@@ -112,6 +112,7 @@ id,title,description,dimension,analysis_type,deps,context_from,wave,status,findi
 | `findings` | Output | Key findings summary (max 500 chars) |
 | `score` | Output | Dimension score (0-100 for scoring tasks, empty for explore/decide) |
 | `recommendations` | Output | Dimension-specific recommendations |
+| `confidence_score` | Output | Per-dimension confidence score (0-100) from factor-based assessment |
 | `error` | Output | Error message if failed |
 ### Per-Wave CSV (Temporary)
@@ -356,6 +357,10 @@ Write wave CSV with `prev_context`, execute `spawn_agents_on_csv` for synthesis
 {prioritized recommendations with rationale}
 ```
+3b. **Confidence scoring** (full mode only):
+   Factors (weights): findings_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15, 0 in CSV mode), consistency(.10). Overall = average of dimension scores. Thresholds: <60% deeper | 60-80% optional | 80-95% converging | >95% converge. Append confidence summary to `analysis.md` and `conclusions.json`.
 4. Build `context.md` (both modes):
 ```markdown
@@ -479,6 +484,8 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"exploration_finding","data":{"file":
 - [ ] analysis.md + conclusions.json produced (full mode only)
 - [ ] Deferred items auto-created as issues
 - [ ] Artifact registered in state.json
+- [ ] Confidence scored per dimension with factor-based model (full mode only)
+- [ ] Confidence summary appended to analysis.md and conclusions.json
 - [ ] Final outputs copied to scratchDir
 - [ ] discoveries.ndjson append-only throughout
 </success_criteria>

package/.codex/skills/maestro-brainstorm/SKILL.md CHANGED Viewed

@@ -364,9 +364,15 @@ spawn_agents_on_csv({
 - Skill: maestro-roadmap --mode full -- Generate full spec package from brainstorm
 ```
-4. Copy artifacts to output `.brainstorming/` directory (phase mode or scratch mode target)
-5. Update phase `index.json` with brainstorm status (if phase mode)
-6. **Next-Step Routing** (skip if AUTO_YES — default to first applicable):
+4. **Brainstorm confidence scoring**:
+   Dimensions (5): role_coverage, cross_role_consistency, feature_completeness, spec_quality, design_feasibility. Factors (weights): analysis_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15, 0 if --yes), consistency(.10). Append confidence summary to `synthesis-changelog.md`.
+   **Conflict-based quality gate**: >3 `[UNRESOLVED]` conflicts → warn before artifact registration.
+5. Copy artifacts to output `.brainstorming/` directory (phase mode or scratch mode target)
+6. Update phase `index.json` with brainstorm status (if phase mode)
+7. **Next-Step Routing** (skip if AUTO_YES — default to first applicable):
    - Detect UI features: scan feature-index.json for UI/frontend-related features (keywords: ui, interface, page, component, dashboard, form, layout)
    - `request_user_input` (include UI Design option only when UI features detected):
      ```json
@@ -439,4 +445,7 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"terminology","data":{"term":"CRDT","
 - [ ] context.md produced with full brainstorm report
 - [ ] Artifacts copied to target .brainstorming/ directory
 - [ ] discoveries.ndjson append-only throughout
+- [ ] Confidence scored per role and after cross-role synthesis
+- [ ] Conflict-based quality gate evaluated (> 3 unresolved = warning)
+- [ ] Confidence summary appended to synthesis-changelog.md
 </success_criteria>

package/.codex/skills/maestro-plan/SKILL.md CHANGED Viewed

@@ -359,6 +359,12 @@ spawn_agents_on_csv({
 1. **Plan checking** (inline, not a separate wave):
    Read `plan.json` + all `.task/TASK-*.json`. Validate: requirements coverage, file feasibility, dependency correctness (no cycles, valid wave order), grep-verifiable convergence criteria, read_first completeness, action concreteness, no parallel file conflicts, **task count within complexity threshold** (reject over-split plans), **no per-file splitting** (each task must be feature-level).
+1b. **Plan confidence scoring**:
+   Dimensions (5): requirements_coverage, task_quality, dependency_correctness, estimation_accuracy, collision_safety. Factors (weights): completeness(.30), specificity(.25), structural_validity(.20), user_validation(.15), consistency(.10). Add `confidence` section to `plan.json`.
+   **Readiness gate**: Block if requirements_coverage < 40% or any task missing read_first/convergence.criteria.
 2. **Revision loop** (max 3 rounds): If critical issues found, regenerate affected tasks.
 2b. **Spec Enrichment**: Persist cross-task reusable design decisions:
@@ -397,6 +403,7 @@ spawn_agents_on_csv({
    Tasks: {task_count} tasks in {wave_count} waves
    Check: {checker_status} (iteration {check_count}/{max_checks})
    Collision: {collision_status}
+   Confidence: {overall}% (weakest: {dim})
    Next steps:
      $maestro-execute "{phase}"     -- Execute the plan
@@ -464,6 +471,9 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"existing_pattern","data":{"name":"Re
 - [ ] plan.json produced in phase directory
 - [ ] .task/TASK-*.json files produced for all tasks
 - [ ] Plan passes quality checks (coverage, deps, criteria)
+- [ ] Plan confidence scored with 5-dimension factor model
+- [ ] Readiness gate checked before confirmation
+- [ ] plan.json includes confidence section
 - [ ] Collision detection executed against same-milestone plans
 - [ ] PLN artifact registered in state.json
 - [ ] context.md produced with exploration findings + plan overview

package/.codex/skills/maestro-ralph/SKILL.md CHANGED Viewed

@@ -355,8 +355,10 @@ STATUS: proceed | fix | escalate
 REASON: 一句话解释
 GAP_SUMMARY: 问题描述（fix/escalate 时填写）
 CONFIDENCE: high | medium | low
+CONFIDENCE_SCORE: 0-100（从结果文件中读取置信度分数，无则估算）
+WEAKEST_DIMENSION: 最弱维度名称
 ---END---
-CONSTRAINTS: 只评估 | STATUS 三选一 | retry ${meta.retry_count}/${meta.max_retries} 达上限必须 escalate" --role analyze --mode analysis`,
+CONSTRAINTS: 只评估 | STATUS 三选一 | 置信度 < 60% 倾向 fix | retry ${meta.retry_count}/${meta.max_retries} 达上限必须 escalate" --role analyze --mode analysis`,
   yield_time_ms: 30000,
   max_output_tokens: 6000
 })
@@ -366,17 +368,25 @@ CONSTRAINTS: 只评估 | STATUS 三选一 | retry ${meta.retry_count}/${meta.max
 **Parse verdict** (on callback):
 ```
-Extract STATUS / REASON / GAP_SUMMARY / CONFIDENCE from output.
+Extract STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION from output.
 If parse fails → fallback: STATUS = "fix", GAP_SUMMARY = generic
+Confidence-based verdict adjustment (after parse, before apply):
+  If CONFIDENCE_SCORE < 60 AND STATUS == "proceed":
+    → Override to "fix", REASON += " (置信度不足: {score}%，{weakest} 需加强)"
+  If CONFIDENCE_SCORE > 95 AND STATUS == "fix" AND retry_count > 0:
+    → Suggest "proceed" override, REASON += " (置信度充分: {score}%，建议通过)"
 ```
+**Confidence-aware evaluation**: Before delegating, check if artifact contains confidence section (added by downstream commands). If found, include `已有置信度评估: 整体 {overall}%, 最弱维度: {weakest} ({score}%)` in delegate prompt as additional signal.
 **Apply verdict:**
 | Mode | Behavior |
 |------|----------|
-| `-y` (auto_mode) | Follow verdict directly, no user prompt |
-| Interactive + confidence == "high" | Display recommendation, prompt user with options below |
-| Interactive + confidence != "high" | Display recommendation **with warning**, prompt user with options below |
+| `-y` (auto_mode) | Follow adjusted verdict directly, no user prompt |
+| Interactive + confidence_score >= 80 | Display recommendation with confidence, prompt user |
+| Interactive + confidence_score < 80 | Display recommendation **with confidence warning**, prompt user |
 Interactive prompt (via `request_user_input`):
 ```json
@@ -665,9 +675,11 @@ Rules:
 - [ ] Conditional steps evaluated at decision time (coverage threshold)
 - [ ] buildSkillCall() completes arg enrichment + auto flag, CSV contains full commands
 - [ ] Quality-gate decisions delegate-evaluated via `maestro delegate --role analyze`
-- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE
-- [ ] `-y` mode: auto-follow delegate verdict, no STOP (except post-debug-escalate)
-- [ ] Interactive mode: display recommendation + request_user_input with override
+- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION
+- [ ] Confidence-based verdict adjustment applied (< 60% bias fix, > 95% bias proceed)
+- [ ] Artifact confidence sections read when available as additional signal
+- [ ] `-y` mode: auto-follow adjusted verdict, no STOP (except post-debug-escalate)
+- [ ] Interactive mode: display recommendation with confidence score + request_user_input with override
 - [ ] Delegate failure fallback: treat as "fix" verdict
 - [ ] passed_gates[] tracks passed quality gates, skips re-runs in retry loops
 - [ ] passed_gates cleared when code changes (fix-loop inserts execute step)

package/.codex/skills/quality-auto-test/SKILL.md CHANGED Viewed

@@ -385,6 +385,9 @@ OUTER LOOP (max_iter iterations):
     Analyze: pass rate delta, failure clusters, strategy effectiveness
     Append to reflection-log.md
+    **Test confidence scoring** (at each REFLECT step):
+       Dimensions (5): scenario_coverage, test_quality, diagnostic_accuracy, strategy_effectiveness, infrastructure_fitness. Factors (weights): completeness(.30), pass_rate_trend(.25), classification_accuracy(.20), coverage_breadth(.15), consistency(.10). Enhanced convergence: BOTH pass_rate ≥ threshold AND confidence ≥ 60%. Add confidence to `report.json`.
   ADJUST (Adaptive Strategy):
     IF startStrategy provided AND iteration == 1: use startStrategy as initial
     OTHERWISE auto-select:
@@ -540,6 +543,9 @@ CSV Session: .tests/auto-test/.csv-session/
 - [ ] discoveries.ndjson append-only throughout
 - [ ] Cross-layer context propagation via prev_context column
 - [ ] Iteration engine ran (inner: test_defect fix, outer: strategy adjust)
+- [ ] Test confidence scored per iteration with 5-dimension factor model
+- [ ] Convergence check includes confidence >= 60% alongside pass_rate
+- [ ] Confidence section added to report.json
 - [ ] state.json, report.json, reflection-log.md written
 - [ ] If spec: traceability.md produced
 - [ ] If failures: issues auto-created in issues.jsonl

package/.codex/skills/quality-debug/SKILL.md CHANGED Viewed

@@ -257,6 +257,10 @@ spawn_agents_on_csv({
 2. **Generate context.md**: Debug report with summary (mode, hypothesis/confirmed/fixed/verified counts), per-hypothesis results (hypothesis, evidence for/against, findings, status), per-fix results (fix applied, verified, findings), aggregated root causes, and next steps.
+2b. **Debug confidence scoring**:
+   Dimensions (4): hypothesis_quality, evidence_completeness, root_cause_isolation, fix_confidence. Factors (weights): evidence_depth(.30), evidence_strength(.25), coverage_breadth(.20), reproduction(.15), consistency(.10). Map to legacy: <40% = low, 40-70% = medium, >70% = high. Append confidence assessment to context.md.
 3. **UAT update** (if --from-uat): Update `uat.md` gaps with `root_cause`, `fix_direction`, `affected_files` for confirmed hypotheses.
 4. **Issue update**: If `issues.jsonl` exists, update matching issues with status `diagnosed`, add `context.suggested_fix` and `context.notes`.
@@ -296,7 +300,7 @@ spawn_agents_on_csv({
 | Type | Dedup Key | Data Schema | Description |
 |------|-----------|-------------|-------------|
-| `root_cause` | `data.location` | `{location, cause, severity, confidence}` | Confirmed root cause |
+| `root_cause` | `data.location` | `{location, cause, severity, confidence_score, confidence_factors}` | Confirmed root cause |
 | `hypothesis_evidence` | `data.hypothesis+data.location` | `{hypothesis, location, type, conclusion}` | Evidence for/against hypothesis |
 | `affected_component` | `data.component` | `{component, files[], impact}` | Component affected by bug |
 | `reproduction_path` | `data.trigger` | `{trigger, steps[], frequency}` | Bug reproduction path |
@@ -333,6 +337,8 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"root_cause","data":{"location":"src/
 - [ ] Refuted/inconclusive hypotheses correctly skip wave 2 fix tasks
 - [ ] Wave 2 fixes attempted only for confirmed hypotheses
 - [ ] context.md produced with diagnosis summary
+- [ ] Multi-factor confidence scored per hypothesis replacing simple high/medium/low
+- [ ] Confidence assessment appended to context.md
 - [ ] UAT gaps updated (if --from-uat)
 - [ ] Issues updated with diagnosis results
 - [ ] discoveries.ndjson append-only throughout

package/.codex/skills/quality-test/SKILL.md CHANGED Viewed

@@ -434,6 +434,12 @@ Options:
 If re-verify passes: update uat.md gaps as resolved, report success.
 If gaps remain after 2 iterations: report remaining, suggest manual intervention.
+### Step 12.5: UAT Confidence Scoring
+Dimensions (4): scenario_coverage, diagnostic_depth, observation_quality, closure_completeness. Factors (weights): requirements_mapped(.30), observation_specificity(.25), user_validation(.20), diagnostic_depth(.15), consistency(.10). Append confidence summary to `uat.md`.
+**Readiness gate** (before final report): Block if scenario_coverage < 40% or any blocker-severity gap without diagnosis.
 ### Step 13: Report
 ```
@@ -490,6 +496,9 @@ Files:
 - [ ] test-results.json and coverage-report.json written
 - [ ] index.json uat fields updated
 - [ ] Artifact registered in state.json
+- [ ] UAT confidence scored with 4-dimension factor model
+- [ ] Readiness gate checked before final report
+- [ ] Confidence summary appended to uat.md
 - [ ] If issues: diagnosis.csv built, spawn_agents_on_csv executed per gap cluster
 - [ ] Gaps updated with root_cause, fix_direction, affected_files
 - [ ] Gap-fix loop triggered if --auto-fix (max 2 iterations)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "maestro-flow",
-  "version": "0.3.39",
+  "version": "0.3.40",
   "description": "Workflow orchestration CLI with MCP endpoint support and extensible architecture",
   "type": "module",
   "imports": {

package/workflows/analyze.md CHANGED Viewed

@@ -252,6 +252,10 @@ Re-read original User Intent from discussion.md header. Check each intent item a
 Append initial Intent Coverage Check to discussion.md.
+**Step 4.6: Baseline Confidence Scoring**
+Dimensions = the 6 analysis dimensions. Factors (weights): findings_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15), consistency(.10). Score each factor per dimension from Round 1 results. Append baseline confidence table to discussion.md. Thresholds: <60% 继续深入 | 60-80% 可选 | 80-95% 接近收敛 | >95% 建议收敛.
 ### Step 5: Interactive Discussion Loop
 Max 5 rounds. Each round follows this sequence:
@@ -301,7 +305,19 @@ Re-read original User Intent. Check each item:
 If ❌ or ⚠️ items exist → proactively surface to user at start of next round.
-**Auto mode (-y)**: auto-deepen for up to 3 rounds, then synthesize.
+**5.8: Re-score Confidence** (every round):
+Re-evaluate factors per dimension. Show delta: `Confidence: {prev}% → {current}% ({±N%}), {weakest_dim} 仍需深入`
+**5.9: Quality Mechanisms**:
+- **Pressure Pass** (mandatory ≥1 before Step 6): highest-confidence finding → pressure ladder (evidence demand → assumption probe → boundary/tradeoff → root cause check). Record under `#### 压力测试`.
+- **Devil's Advocate**: dimension > 0.7 → challenge "如果 [finding] 不成立？" (once per dimension)
+- **Scope Minimizer**: findings > 5 + scope expanding → "最小可行结论集？"
+- **Stall Detection**: delta < 5% for 2 consecutive rounds → "分析可能停滞，建议切换方向或收敛"
+**5.10: Pre-Synthesis Readiness Gate** (on "分析完成"):
+Block if: ❌ items without deferral | any dimension < 40% | no pressure pass | unresolved contradictions. If blocked → AskUserQuestion: 补充后继续 or 忽略风险并继续 (record `residual_risks[]`).
+**Auto mode (-y)**: auto-deepen ≤3 rounds, readiness gate auto-overrides with residual risk recording.
 ### Step 6: Six-Dimension Scoring
@@ -318,9 +334,11 @@ Using all exploration findings, discussion insights, and user feedback, score ac
 Each dimension scored with specific evidence (code refs, data points from exploration).
+Each 1-5 score justified by confidence factors from Step 4.6/5.8. Include per-dimension confidence % alongside the score.
 Build probability-impact risk matrix from identified risks.
-Formulate Go/No-Go/Conditional recommendation with confidence level.
+Formulate Go/No-Go/Conditional recommendation with overall confidence %. Write confidence summary (per-dimension scores + overall + pressure pass result + residual risks) to analysis.md.
 ### Step 7: Synthesis & Conclusion
@@ -673,6 +691,10 @@ Replaceable blocks (overwritten each round):
 - At least 2 alternatives compared with tradeoffs
 - Go/No-Go/Conditional recommendation with confidence level
 - Code references included where relevant (file paths, line numbers)
+- Confidence tracking initialized and re-scored each round
+- Readiness gate checked before synthesis (Step 5.10)
+- Pressure pass completed ≥ 1 time before Step 6
+- Confidence summary with factor decomposition in analysis.md
 **Both modes (full + quick):**
 - context.md written with all decisions classified as Locked/Free/Deferred

package/workflows/auto-test.md CHANGED Viewed

@@ -498,6 +498,18 @@ END OUTER
 ---
+### Step 7.5: Test Confidence Scoring
+Scored after each REFLECT step. Dimensions (5): scenario_coverage, test_quality, diagnostic_accuracy, strategy_effectiveness, infrastructure_fitness. Factors (weights): completeness(.30), pass_rate_trend(.25), classification_accuracy(.20), coverage_breadth(.15), consistency(.10). Append confidence table to reflection-log.md.
+**Enhanced Convergence**: pass_rate ≥ 95% AND confidence ≥ 60% → converged. pass_rate ≥ 95% BUT confidence < 60% → continue (tests may be weak). max_iter reached or all failures = code_defect → Step 8.
+**Quality mechanisms**: Pressure Pass (before Step 8) — select 2-3 passing tests from highest-pass-rate layer, verify they exercise real behavior (not mock-only, non-trivial assertions). Devil's Advocate — pass_rate > 80% → challenge assertion specificity, error path coverage, mock over-reliance. Stall Detection — delta < 5% for 2 iterations + pass_rate flat → force Reflective strategy.
+**Readiness Gate** (before Step 8, skip if max_iter=1): scenario_coverage < 40% | no pressure pass | diagnostic_accuracy < 40% | unclassified failures. If blocked → force one additional iteration. Add confidence section to report.json.
+---
 ### Step 8: Complete & Write Artifacts
 1. Update session state:

package/workflows/brainstorm.md CHANGED Viewed

@@ -331,7 +331,7 @@ Each agent receives:
 ### Step 5: Synthesis Integration (Auto Mode)
-Six sub-phases producing feature specs from cross-role analysis:
+Seven sub-phases producing feature specs from cross-role analysis:
 **Sub-phase 1: Discovery**
 - Detect session, validate analysis files exist
@@ -349,6 +349,16 @@ Six sub-phases producing feature specs from cross-role analysis:
 - Output: `enhancement_recommendations` (EP-001, EP-002, ...) + `feature_conflict_map` (per-feature consensus/conflicts/cross_refs)
 - Conflict resolution quality: actionable, justified ("because...tradeoff:..."), scoped, confidence-tagged ([RESOLVED]|[SUGGESTED]|[UNRESOLVED])
+**Sub-phase 3B: Brainstorm Confidence Scoring**
+Compute after cross-role analysis, before user interaction. Dimensions (5): role_coverage, cross_role_consistency, feature_completeness, spec_quality, design_feasibility. Factors (weights): analysis_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15), consistency(.10).
+Scoring points: per-role completion → `role_coverage`; after cross-role → `cross_role_consistency`; after user interaction → `user_validation`.
+**Quality mechanisms**: Pressure Pass (before spec gen) — verify claims backed by ≥2 role analyses. Devil's Advocate — challenge when dimension > 0.7. Conflict Detection (replaces stall) — >3 `[UNRESOLVED]` conflicts → challenge.
+**Readiness Gate** (before Sub-phase 4): Block if role_coverage < 100% | cross_role_consistency < 40% | feature_completeness below intent count | no pressure pass. If blocked → AskUserQuestion: fill gaps or proceed with residual risks. Append confidence summary to synthesis-changelog.md.
 **Sub-phase 4: User Interaction**
 - Enhancement selection: AskUserQuestion (multiSelect=true, batched by 4)
 - Clarification questions: 9-category taxonomy scan, AskUserQuestion (single-select, multi-round)

package/workflows/debug.md CHANGED Viewed

@@ -137,7 +137,7 @@ For each cluster, spawn concurrently as general-purpose agent (`run_in_backgroun
 - **Input**: cluster name, phase, all gaps (test_id, truth, reason, severity). Mode: `symptoms_prefilled`.
 - **Process**: read source files, form 2-3 hypotheses per gap ranked by likelihood, search code for evidence, log each as NDJSON line, confirm/refute.
-- **Output per gap**: `root_cause`, `fix_direction`, `affected_files` (file:line), `confidence` (high/medium/low), `evidence` summary.
+- **Output per gap**: `root_cause`, `fix_direction`, `affected_files` (file:line), `confidence` (multi-factor; also include legacy `confidence_level`: high/medium/low for backward compat), `evidence` summary.
 - **Files**: `{debug_dir}/evidence-{cluster_slug}.ndjson`, `{debug_dir}/understanding-{cluster_slug}.md`
 All agents run concurrently. Collect all results.
@@ -198,7 +198,7 @@ For each agent result, extract:
 - root_cause per gap
 - fix_direction per gap
 - affected_files per gap
-- confidence level
+- confidence (multi-factor, see 7.1)
 - evidence summary
 Build unified diagnosis:
@@ -215,14 +215,23 @@ Build unified diagnosis:
           "root_cause": "...",
           "fix_direction": "...",
           "affected_files": ["src/components/Comments.tsx:42"],
-          "confidence": "high"
+          "confidence": { "overall": 0.78, "dimensions": {} }
         }
       ]
     }
-  ]
+  ],
+  "confidence": {}
 }
 ```
+### Step 7.0: Debug Confidence Scoring
+Dimensions (4): hypothesis_quality, evidence_completeness, root_cause_isolation, fix_confidence. Factors (weights): evidence_depth(.30), evidence_strength(.25), coverage_breadth(.20), reproduction(.15), consistency(.10). Map to legacy levels: <40% = low, 40-70% = medium, >70% = high.
+Quality mechanisms: Pressure Pass (before Step 9) — cross-check confirmed vs refuted hypotheses. Devil's Advocate — root_cause_isolation > 0.7 → "根因在更深层？". Stall Detection — no new evidence + delta < 5% for 2 continuations → "调查可能停滞".
+Readiness Gate (blocks Step 9): evidence_completeness ≥ 40% | pressure pass done | no contradicting evidence | fix_direction has specific files. If blocked → AskUserQuestion: 补充调查 or 忽略风险并确认. Append confidence table to understanding.md.
 ### Step 7.1: Update Issues with Diagnosis
 For each diagnosed gap with an `issue_id`, update the corresponding issue in `.workflow/issues/issues.jsonl`:

package/workflows/plan.md CHANGED Viewed

@@ -251,12 +251,22 @@ Bidirectional linking: update matching issues in `.workflow/issues/issues.jsonl`
    - Critical issues → re-spawn planner with issues, revise, re-check
    - Warnings only → log and proceed
-3. **Update index.json**
-   - Set `index.json.plan` = `{ task_ids, task_count, complexity, waves, executor_assignments: {} }`
+3. **Plan Confidence Scoring**
+   Dimensions (5): requirements_coverage, task_quality, dependency_correctness, estimation_accuracy, collision_safety. Factors (weights): completeness(.30), specificity(.25), structural_validity(.20), user_validation(.15), consistency(.10). Re-score after each revision round.
+   Quality mechanisms: Pressure Pass (mandatory before P4.5) — verify highest-complexity task's read_first/convergence.criteria/action. Devil's Advocate — requirements_coverage > 0.7 → "隐含需求？". Scope Minimizer — task_count exceeds guard → "最小可行任务集？". Stall Detection — delta < 5% → suggest broader revision.
+4. **Plan Readiness Gate** (blocks P4.5)
+   Block if: requirements_coverage < 40% | task missing read_first/convergence.criteria | no pressure pass | circular deps. If blocked → AskUserQuestion: 修订计划 or 忽略风险并继续 (record residual_risks). Add confidence section to plan.json.
+5. **Update index.json**
+   - Set `index.json.plan` = `{ task_ids, task_count, complexity, waves, executor_assignments: {}, confidence: overall_score }`
    - Set `status: "planning"`, `updated_at: now()`
 ### Output
-- Updated plan.json (if revised)
+- Updated plan.json (if revised) with confidence section
 - Updated .task/ files (if revised)
 - Updated index.json with plan fields
@@ -286,7 +296,7 @@ Bidirectional linking: update matching issues in `.workflow/issues/issues.jsonl`
 ### Steps
-1. **Display plan summary** — summary, approach, task count, wave structure, complexity, key dependencies
+1. **Display plan summary** — summary, approach, task count, wave structure, complexity, key dependencies, **plan confidence** (overall %, weakest dimension, pressure pass result)
 2. **Present options via AskUserQuestion** (skip if `config.gates.confirm_plan == false`, auto-proceed)
    - Execute now → build executionContext, hand off to /workflow:execute

package/workflows/test.md CHANGED Viewed

@@ -416,6 +416,16 @@ If re-verify still has gaps: report remaining gaps, suggest manual intervention.
 ---
+### Step 12.5: UAT Confidence Scoring
+Dimensions (4): scenario_coverage, diagnostic_depth, observation_quality, closure_completeness. Factors (weights): requirements_mapped(.30), observation_specificity(.25), user_validation(.20), diagnostic_depth(.15), consistency(.10). Score at: init (Step 5), per user response (Step 8), after gap-fix loop (Step 12).
+Quality mechanisms: Pressure Pass — >80% pass → ask user to try edge case. Devil's Advocate — >70% first-try pass → challenge scenario difficulty. Stall Detection — 2 gap-fix iterations without improvement → stop.
+Readiness Gate (blocks Step 13): scenario_coverage < 40% | blocker gap without diagnosis | no pressure pass (if >80%) | unresolved gaps without acknowledgment. Append confidence summary to uat.md.
+---
 ### Step 13: Report
 ```