maestro-flow 0.3.39 → 0.3.40
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/maestro-analyze.md +4 -0
- package/.claude/commands/maestro-brainstorm.md +4 -0
- package/.claude/commands/maestro-plan.md +5 -1
- package/.claude/commands/maestro-ralph.md +39 -9
- package/.claude/commands/quality-auto-test.md +4 -0
- package/.claude/commands/quality-debug.md +5 -1
- package/.claude/commands/quality-test.md +4 -0
- package/.codex/skills/maestro-analyze/SKILL.md +7 -0
- package/.codex/skills/maestro-brainstorm/SKILL.md +12 -3
- package/.codex/skills/maestro-plan/SKILL.md +10 -0
- package/.codex/skills/maestro-ralph/SKILL.md +20 -8
- package/.codex/skills/quality-auto-test/SKILL.md +6 -0
- package/.codex/skills/quality-debug/SKILL.md +7 -1
- package/.codex/skills/quality-test/SKILL.md +9 -0
- package/package.json +1 -1
- package/workflows/analyze.md +24 -2
- package/workflows/auto-test.md +12 -0
- package/workflows/brainstorm.md +11 -1
- package/workflows/debug.md +13 -4
- package/workflows/plan.md +14 -4
- package/workflows/test.md +10 -0
|
@@ -108,6 +108,10 @@ Full mode:
|
|
|
108
108
|
- [ ] analysis.md written with all 6 dimensions scored with evidence
|
|
109
109
|
- [ ] conclusions.json created with recommendations and decision trail
|
|
110
110
|
- [ ] Intent Coverage tracked and verified (no unresolved ❌ items)
|
|
111
|
+
- [ ] Confidence tracking initialized (Step 4.6) and re-scored each round (Step 5.8)
|
|
112
|
+
- [ ] Readiness gate checked before synthesis (Step 5.10)
|
|
113
|
+
- [ ] Pressure pass completed ≥ 1 time before Step 6
|
|
114
|
+
- [ ] Confidence summary with factor decomposition written to analysis.md
|
|
111
115
|
|
|
112
116
|
Gaps mode:
|
|
113
117
|
- [ ] Issues loaded from issues.jsonl (all open/registered, or single ISS-ID)
|
|
@@ -91,6 +91,10 @@ Single role mode:
|
|
|
91
91
|
- [ ] Final Output Gate passed (Step 5.5) or `--yes` bypassed
|
|
92
92
|
- [ ] All user decisions captured with Decision Recording Protocol
|
|
93
93
|
- [ ] Session metadata updated with completion status
|
|
94
|
+
- [ ] Confidence scored per role completion and after cross-role analysis
|
|
95
|
+
- [ ] Readiness gate checked before spec generation
|
|
96
|
+
- [ ] Pressure pass completed on at least 1 feature spec
|
|
97
|
+
- [ ] Confidence summary appended to synthesis-changelog.md
|
|
94
98
|
|
|
95
99
|
**Single role mode**:
|
|
96
100
|
- [ ] analysis.md written to `{output_dir}/{role}/`
|
|
@@ -143,8 +143,12 @@ Follow workflow plan.md § "Revise Mode" and § "Check Mode" respectively. These
|
|
|
143
143
|
- [ ] Every task has `read_first[]` with at least the file being modified + source of truth files
|
|
144
144
|
- [ ] Every task has `convergence.criteria[]` with grep-verifiable conditions (no subjective language)
|
|
145
145
|
- [ ] Every task `action` and `implementation` contain concrete values (no "align X with Y")
|
|
146
|
+
- [ ] Plan confidence scored in P4 with 5-dimension factor model
|
|
147
|
+
- [ ] Plan readiness gate checked before P4.5 collision detection
|
|
148
|
+
- [ ] Pressure pass completed on highest-complexity task
|
|
149
|
+
- [ ] plan.json includes confidence section (overall, dimensions, pressure_pass)
|
|
146
150
|
- [ ] Collision detection executed against same-milestone plans (non-blocking)
|
|
147
151
|
- [ ] Plan-checker passed (or minor issues acknowledged)
|
|
148
|
-
- [ ] User confirmation captured (execute/modify/cancel)
|
|
152
|
+
- [ ] User confirmation captured (execute/modify/cancel) with confidence displayed
|
|
149
153
|
- [ ] Artifact registered in state.json with correct scope/milestone/phase/depends_on
|
|
150
154
|
</success_criteria>
|
|
@@ -326,10 +326,28 @@ For quality-gate decisions (post-verify, post-business-test, post-review, post-t
|
|
|
326
326
|
| post-review | `{artifact_dir}/review.json` |
|
|
327
327
|
| post-test | `{artifact_dir}/uat.md`, `{artifact_dir}/.tests/test-results.json` |
|
|
328
328
|
|
|
329
|
+
**Confidence-aware evaluation**:
|
|
330
|
+
|
|
331
|
+
Before delegating, check if artifact contains a confidence section (added by downstream commands):
|
|
332
|
+
- `verification.json` → `confidence.overall` (from maestro-verify)
|
|
333
|
+
- `report.json` → `confidence.overall` (from quality-auto-test)
|
|
334
|
+
- `review.json` → may contain dimension confidence (from quality-review)
|
|
335
|
+
- `uat.md` → confidence summary section (from quality-test)
|
|
336
|
+
|
|
337
|
+
If confidence data found, include in delegate prompt as additional signal:
|
|
338
|
+
```
|
|
339
|
+
已有置信度评估: 整体 {overall}%, 最弱维度: {weakest} ({score}%)
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
**Confidence-based verdict bias**: When artifact confidence is available:
|
|
343
|
+
- confidence < 60% → bias toward "fix" even if surface status looks clean (hidden quality gaps)
|
|
344
|
+
- confidence 60-95% → use delegate verdict as-is
|
|
345
|
+
- confidence > 95% → bias toward "proceed" (strong evidence of quality)
|
|
346
|
+
|
|
329
347
|
```
|
|
330
348
|
Bash({
|
|
331
349
|
command: `maestro delegate "PURPOSE: 评估 ${meta.decision} 质量门结果,判断是否通过
|
|
332
|
-
TASK: 读取结果文件 | 分析通过/失败状态 | 评估问题严重性 | 给出下一步建议
|
|
350
|
+
TASK: 读取结果文件 | 分析通过/失败状态 | 评估问题严重性 | 检查置信度评分 | 给出下一步建议
|
|
333
351
|
MODE: analysis
|
|
334
352
|
CONTEXT: @${result_files}
|
|
335
353
|
EXPECTED: 严格按以下格式输出:
|
|
@@ -338,8 +356,10 @@ STATUS: proceed | fix | escalate
|
|
|
338
356
|
REASON: 一句话解释
|
|
339
357
|
GAP_SUMMARY: 具体问题描述(仅 fix/escalate 时填写,用于传递给 quality-debug)
|
|
340
358
|
CONFIDENCE: high | medium | low
|
|
359
|
+
CONFIDENCE_SCORE: 0-100(从结果文件中读取置信度分数,无则估算)
|
|
360
|
+
WEAKEST_DIMENSION: 最弱维度名称
|
|
341
361
|
---END---
|
|
342
|
-
CONSTRAINTS: 只评估不修改 | STATUS 三选一 | 如果 retry ${meta.retry_count}/${meta.max_retries} 已达上限且仍有问题则必须 escalate" --role analyze --mode analysis`,
|
|
362
|
+
CONSTRAINTS: 只评估不修改 | STATUS 三选一 | 置信度 < 60% 倾向 fix | 如果 retry ${meta.retry_count}/${meta.max_retries} 已达上限且仍有问题则必须 escalate" --role analyze --mode analysis`,
|
|
343
363
|
run_in_background: true
|
|
344
364
|
})
|
|
345
365
|
STOP — wait for callback.
|
|
@@ -352,12 +372,20 @@ STOP — wait for callback.
|
|
|
352
372
|
Parse structured response:
|
|
353
373
|
```
|
|
354
374
|
Extract between ---VERDICT--- and ---END---:
|
|
355
|
-
verdict.status
|
|
356
|
-
verdict.reason
|
|
357
|
-
verdict.gap_summary
|
|
358
|
-
verdict.confidence
|
|
375
|
+
verdict.status = "proceed" | "fix" | "escalate"
|
|
376
|
+
verdict.reason = string
|
|
377
|
+
verdict.gap_summary = string (context for quality-debug)
|
|
378
|
+
verdict.confidence = "high" | "medium" | "low"
|
|
379
|
+
verdict.confidence_score = 0-100 (numeric, from artifact or estimated)
|
|
380
|
+
verdict.weakest_dimension = string (weakest confidence dimension)
|
|
359
381
|
|
|
360
382
|
If parse fails → fallback: treat as "fix" with generic gap_summary
|
|
383
|
+
|
|
384
|
+
Confidence-based verdict adjustment (after parse, before apply):
|
|
385
|
+
If verdict.confidence_score < 60 AND verdict.status == "proceed":
|
|
386
|
+
→ Override to "fix", reason += " (置信度不足: {score}%,{weakest_dimension} 需加强)"
|
|
387
|
+
If verdict.confidence_score > 95 AND verdict.status == "fix" AND retry_count > 0:
|
|
388
|
+
→ Suggest "proceed" override, reason += " (置信度充分: {score}%,建议通过)"
|
|
361
389
|
```
|
|
362
390
|
|
|
363
391
|
**Apply verdict:**
|
|
@@ -503,9 +531,11 @@ End.
|
|
|
503
531
|
- [ ] Full quality pipeline generated: verify → business-test → review → test-gen → test
|
|
504
532
|
- [ ] Decision nodes inserted after: post-verify, post-business-test, post-review, post-test, post-milestone
|
|
505
533
|
- [ ] Quality-gate decisions delegated via `maestro delegate --role analyze --mode analysis`
|
|
506
|
-
- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE
|
|
507
|
-
- [ ]
|
|
508
|
-
- [ ]
|
|
534
|
+
- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION
|
|
535
|
+
- [ ] Confidence-based verdict adjustment applied (< 60% bias fix, > 95% bias proceed)
|
|
536
|
+
- [ ] Artifact confidence sections read when available (verification.json, report.json, uat.md)
|
|
537
|
+
- [ ] `-y` mode: auto-follow adjusted verdict without user confirmation
|
|
538
|
+
- [ ] Interactive mode: display recommendation with confidence score + AskUserQuestion with override options
|
|
509
539
|
- [ ] Delegate failure fallback: treat as "fix" verdict
|
|
510
540
|
- [ ] gap_summary from delegate passed to quality-debug as context
|
|
511
541
|
- [ ] Fix-loop templates applied per decision type with retry_count increment
|
|
@@ -116,6 +116,10 @@ Append to state.json.artifacts[]:
|
|
|
116
116
|
- [ ] Tests executed progressively (L0→L3) with fail-fast on critical
|
|
117
117
|
- [ ] Iteration engine ran (inner: test_defect fix, outer: strategy adjust)
|
|
118
118
|
- [ ] state.json, report.json, reflection-log.md written
|
|
119
|
+
- [ ] Test confidence scored per iteration (Step 7.5) with 5-dimension factor model
|
|
120
|
+
- [ ] Convergence check includes confidence >= 60% alongside pass_rate threshold
|
|
121
|
+
- [ ] Pressure pass completed on highest-pass-rate layer before completion
|
|
122
|
+
- [ ] report.json includes confidence section
|
|
119
123
|
- [ ] index.json updated with auto_test section
|
|
120
124
|
- [ ] If spec source: traceability matrix built, traceability.md written
|
|
121
125
|
- [ ] If failures: issues auto-created in issues.jsonl
|
|
@@ -115,7 +115,11 @@ If user confirms, invoke `Skill({ skill: "spec-add", args: "<category> <content>
|
|
|
115
115
|
- [ ] evidence.ndjson written with structured NDJSON entries
|
|
116
116
|
- [ ] understanding.md tracks evolving understanding per cluster
|
|
117
117
|
- [ ] Root causes collected with fix_direction and affected_files
|
|
118
|
+
- [ ] Multi-factor confidence scored per gap (Step 7.0) replacing simple high/medium/low
|
|
119
|
+
- [ ] Readiness gate checked before ROOT CAUSE declaration
|
|
120
|
+
- [ ] Pressure pass completed on confirmed hypothesis
|
|
121
|
+
- [ ] Confidence table appended to understanding.md
|
|
118
122
|
- [ ] If --from-uat: uat.md gaps updated with diagnosis artifacts
|
|
119
|
-
- [ ] Results unified into diagnosis summary
|
|
123
|
+
- [ ] Results unified into diagnosis summary with confidence section
|
|
120
124
|
- [ ] Next step routed (plan --gaps + execute if fix needed, verify if fix applied, resume if inconclusive)
|
|
121
125
|
</success_criteria>
|
|
@@ -95,6 +95,10 @@ Append to state.json.artifacts[]:
|
|
|
95
95
|
- [ ] Severity inferred from natural language (never asked)
|
|
96
96
|
- [ ] Batched writes: on issue, every 5 passes, or completion
|
|
97
97
|
- [ ] test-results.json and coverage-report.json written
|
|
98
|
+
- [ ] UAT confidence scored with 4-dimension factor model
|
|
99
|
+
- [ ] Readiness gate checked before final report
|
|
100
|
+
- [ ] Pressure pass completed if > 80% pass rate
|
|
101
|
+
- [ ] Confidence summary appended to uat.md
|
|
98
102
|
- [ ] index.json uat fields updated
|
|
99
103
|
- [ ] If issues: parallel debug agents spawned per gap cluster
|
|
100
104
|
- [ ] Gaps updated with root_cause, fix_direction, affected_files
|
|
@@ -112,6 +112,7 @@ id,title,description,dimension,analysis_type,deps,context_from,wave,status,findi
|
|
|
112
112
|
| `findings` | Output | Key findings summary (max 500 chars) |
|
|
113
113
|
| `score` | Output | Dimension score (0-100 for scoring tasks, empty for explore/decide) |
|
|
114
114
|
| `recommendations` | Output | Dimension-specific recommendations |
|
|
115
|
+
| `confidence_score` | Output | Per-dimension confidence score (0-100) from factor-based assessment |
|
|
115
116
|
| `error` | Output | Error message if failed |
|
|
116
117
|
|
|
117
118
|
### Per-Wave CSV (Temporary)
|
|
@@ -356,6 +357,10 @@ Write wave CSV with `prev_context`, execute `spawn_agents_on_csv` for synthesis
|
|
|
356
357
|
{prioritized recommendations with rationale}
|
|
357
358
|
```
|
|
358
359
|
|
|
360
|
+
3b. **Confidence scoring** (full mode only):
|
|
361
|
+
|
|
362
|
+
Factors (weights): findings_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15, 0 in CSV mode), consistency(.10). Overall = average of dimension scores. Thresholds: <60% deeper | 60-80% optional | 80-95% converging | >95% converge. Append confidence summary to `analysis.md` and `conclusions.json`.
|
|
363
|
+
|
|
359
364
|
4. Build `context.md` (both modes):
|
|
360
365
|
|
|
361
366
|
```markdown
|
|
@@ -479,6 +484,8 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"exploration_finding","data":{"file":
|
|
|
479
484
|
- [ ] analysis.md + conclusions.json produced (full mode only)
|
|
480
485
|
- [ ] Deferred items auto-created as issues
|
|
481
486
|
- [ ] Artifact registered in state.json
|
|
487
|
+
- [ ] Confidence scored per dimension with factor-based model (full mode only)
|
|
488
|
+
- [ ] Confidence summary appended to analysis.md and conclusions.json
|
|
482
489
|
- [ ] Final outputs copied to scratchDir
|
|
483
490
|
- [ ] discoveries.ndjson append-only throughout
|
|
484
491
|
</success_criteria>
|
|
@@ -364,9 +364,15 @@ spawn_agents_on_csv({
|
|
|
364
364
|
- Skill: maestro-roadmap --mode full -- Generate full spec package from brainstorm
|
|
365
365
|
```
|
|
366
366
|
|
|
367
|
-
4.
|
|
368
|
-
|
|
369
|
-
|
|
367
|
+
4. **Brainstorm confidence scoring**:
|
|
368
|
+
|
|
369
|
+
Dimensions (5): role_coverage, cross_role_consistency, feature_completeness, spec_quality, design_feasibility. Factors (weights): analysis_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15, 0 if --yes), consistency(.10). Append confidence summary to `synthesis-changelog.md`.
|
|
370
|
+
|
|
371
|
+
**Conflict-based quality gate**: >3 `[UNRESOLVED]` conflicts → warn before artifact registration.
|
|
372
|
+
|
|
373
|
+
5. Copy artifacts to output `.brainstorming/` directory (phase mode or scratch mode target)
|
|
374
|
+
6. Update phase `index.json` with brainstorm status (if phase mode)
|
|
375
|
+
7. **Next-Step Routing** (skip if AUTO_YES — default to first applicable):
|
|
370
376
|
- Detect UI features: scan feature-index.json for UI/frontend-related features (keywords: ui, interface, page, component, dashboard, form, layout)
|
|
371
377
|
- `request_user_input` (include UI Design option only when UI features detected):
|
|
372
378
|
```json
|
|
@@ -439,4 +445,7 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"terminology","data":{"term":"CRDT","
|
|
|
439
445
|
- [ ] context.md produced with full brainstorm report
|
|
440
446
|
- [ ] Artifacts copied to target .brainstorming/ directory
|
|
441
447
|
- [ ] discoveries.ndjson append-only throughout
|
|
448
|
+
- [ ] Confidence scored per role and after cross-role synthesis
|
|
449
|
+
- [ ] Conflict-based quality gate evaluated (> 3 unresolved = warning)
|
|
450
|
+
- [ ] Confidence summary appended to synthesis-changelog.md
|
|
442
451
|
</success_criteria>
|
|
@@ -359,6 +359,12 @@ spawn_agents_on_csv({
|
|
|
359
359
|
1. **Plan checking** (inline, not a separate wave):
|
|
360
360
|
Read `plan.json` + all `.task/TASK-*.json`. Validate: requirements coverage, file feasibility, dependency correctness (no cycles, valid wave order), grep-verifiable convergence criteria, read_first completeness, action concreteness, no parallel file conflicts, **task count within complexity threshold** (reject over-split plans), **no per-file splitting** (each task must be feature-level).
|
|
361
361
|
|
|
362
|
+
1b. **Plan confidence scoring**:
|
|
363
|
+
|
|
364
|
+
Dimensions (5): requirements_coverage, task_quality, dependency_correctness, estimation_accuracy, collision_safety. Factors (weights): completeness(.30), specificity(.25), structural_validity(.20), user_validation(.15), consistency(.10). Add `confidence` section to `plan.json`.
|
|
365
|
+
|
|
366
|
+
**Readiness gate**: Block if requirements_coverage < 40% or any task missing read_first/convergence.criteria.
|
|
367
|
+
|
|
362
368
|
2. **Revision loop** (max 3 rounds): If critical issues found, regenerate affected tasks.
|
|
363
369
|
|
|
364
370
|
2b. **Spec Enrichment**: Persist cross-task reusable design decisions:
|
|
@@ -397,6 +403,7 @@ spawn_agents_on_csv({
|
|
|
397
403
|
Tasks: {task_count} tasks in {wave_count} waves
|
|
398
404
|
Check: {checker_status} (iteration {check_count}/{max_checks})
|
|
399
405
|
Collision: {collision_status}
|
|
406
|
+
Confidence: {overall}% (weakest: {dim})
|
|
400
407
|
|
|
401
408
|
Next steps:
|
|
402
409
|
$maestro-execute "{phase}" -- Execute the plan
|
|
@@ -464,6 +471,9 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"existing_pattern","data":{"name":"Re
|
|
|
464
471
|
- [ ] plan.json produced in phase directory
|
|
465
472
|
- [ ] .task/TASK-*.json files produced for all tasks
|
|
466
473
|
- [ ] Plan passes quality checks (coverage, deps, criteria)
|
|
474
|
+
- [ ] Plan confidence scored with 5-dimension factor model
|
|
475
|
+
- [ ] Readiness gate checked before confirmation
|
|
476
|
+
- [ ] plan.json includes confidence section
|
|
467
477
|
- [ ] Collision detection executed against same-milestone plans
|
|
468
478
|
- [ ] PLN artifact registered in state.json
|
|
469
479
|
- [ ] context.md produced with exploration findings + plan overview
|
|
@@ -355,8 +355,10 @@ STATUS: proceed | fix | escalate
|
|
|
355
355
|
REASON: 一句话解释
|
|
356
356
|
GAP_SUMMARY: 问题描述(fix/escalate 时填写)
|
|
357
357
|
CONFIDENCE: high | medium | low
|
|
358
|
+
CONFIDENCE_SCORE: 0-100(从结果文件中读取置信度分数,无则估算)
|
|
359
|
+
WEAKEST_DIMENSION: 最弱维度名称
|
|
358
360
|
---END---
|
|
359
|
-
CONSTRAINTS: 只评估 | STATUS 三选一 | retry ${meta.retry_count}/${meta.max_retries} 达上限必须 escalate" --role analyze --mode analysis`,
|
|
361
|
+
CONSTRAINTS: 只评估 | STATUS 三选一 | 置信度 < 60% 倾向 fix | retry ${meta.retry_count}/${meta.max_retries} 达上限必须 escalate" --role analyze --mode analysis`,
|
|
360
362
|
yield_time_ms: 30000,
|
|
361
363
|
max_output_tokens: 6000
|
|
362
364
|
})
|
|
@@ -366,17 +368,25 @@ CONSTRAINTS: 只评估 | STATUS 三选一 | retry ${meta.retry_count}/${meta.max
|
|
|
366
368
|
|
|
367
369
|
**Parse verdict** (on callback):
|
|
368
370
|
```
|
|
369
|
-
Extract STATUS / REASON / GAP_SUMMARY / CONFIDENCE from output.
|
|
371
|
+
Extract STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION from output.
|
|
370
372
|
If parse fails → fallback: STATUS = "fix", GAP_SUMMARY = generic
|
|
373
|
+
|
|
374
|
+
Confidence-based verdict adjustment (after parse, before apply):
|
|
375
|
+
If CONFIDENCE_SCORE < 60 AND STATUS == "proceed":
|
|
376
|
+
→ Override to "fix", REASON += " (置信度不足: {score}%,{weakest} 需加强)"
|
|
377
|
+
If CONFIDENCE_SCORE > 95 AND STATUS == "fix" AND retry_count > 0:
|
|
378
|
+
→ Suggest "proceed" override, REASON += " (置信度充分: {score}%,建议通过)"
|
|
371
379
|
```
|
|
372
380
|
|
|
381
|
+
**Confidence-aware evaluation**: Before delegating, check if artifact contains confidence section (added by downstream commands). If found, include `已有置信度评估: 整体 {overall}%, 最弱维度: {weakest} ({score}%)` in delegate prompt as additional signal.
|
|
382
|
+
|
|
373
383
|
**Apply verdict:**
|
|
374
384
|
|
|
375
385
|
| Mode | Behavior |
|
|
376
386
|
|------|----------|
|
|
377
|
-
| `-y` (auto_mode) | Follow verdict directly, no user prompt |
|
|
378
|
-
| Interactive +
|
|
379
|
-
| Interactive +
|
|
387
|
+
| `-y` (auto_mode) | Follow adjusted verdict directly, no user prompt |
|
|
388
|
+
| Interactive + confidence_score >= 80 | Display recommendation with confidence, prompt user |
|
|
389
|
+
| Interactive + confidence_score < 80 | Display recommendation **with confidence warning**, prompt user |
|
|
380
390
|
|
|
381
391
|
Interactive prompt (via `request_user_input`):
|
|
382
392
|
```json
|
|
@@ -665,9 +675,11 @@ Rules:
|
|
|
665
675
|
- [ ] Conditional steps evaluated at decision time (coverage threshold)
|
|
666
676
|
- [ ] buildSkillCall() completes arg enrichment + auto flag, CSV contains full commands
|
|
667
677
|
- [ ] Quality-gate decisions delegate-evaluated via `maestro delegate --role analyze`
|
|
668
|
-
- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE
|
|
669
|
-
- [ ]
|
|
670
|
-
- [ ]
|
|
678
|
+
- [ ] Delegate verdict parsed: STATUS / REASON / GAP_SUMMARY / CONFIDENCE / CONFIDENCE_SCORE / WEAKEST_DIMENSION
|
|
679
|
+
- [ ] Confidence-based verdict adjustment applied (< 60% bias fix, > 95% bias proceed)
|
|
680
|
+
- [ ] Artifact confidence sections read when available as additional signal
|
|
681
|
+
- [ ] `-y` mode: auto-follow adjusted verdict, no STOP (except post-debug-escalate)
|
|
682
|
+
- [ ] Interactive mode: display recommendation with confidence score + request_user_input with override
|
|
671
683
|
- [ ] Delegate failure fallback: treat as "fix" verdict
|
|
672
684
|
- [ ] passed_gates[] tracks passed quality gates, skips re-runs in retry loops
|
|
673
685
|
- [ ] passed_gates cleared when code changes (fix-loop inserts execute step)
|
|
@@ -385,6 +385,9 @@ OUTER LOOP (max_iter iterations):
|
|
|
385
385
|
Analyze: pass rate delta, failure clusters, strategy effectiveness
|
|
386
386
|
Append to reflection-log.md
|
|
387
387
|
|
|
388
|
+
**Test confidence scoring** (at each REFLECT step):
|
|
389
|
+
Dimensions (5): scenario_coverage, test_quality, diagnostic_accuracy, strategy_effectiveness, infrastructure_fitness. Factors (weights): completeness(.30), pass_rate_trend(.25), classification_accuracy(.20), coverage_breadth(.15), consistency(.10). Enhanced convergence: BOTH pass_rate ≥ threshold AND confidence ≥ 60%. Add confidence to `report.json`.
|
|
390
|
+
|
|
388
391
|
ADJUST (Adaptive Strategy):
|
|
389
392
|
IF startStrategy provided AND iteration == 1: use startStrategy as initial
|
|
390
393
|
OTHERWISE auto-select:
|
|
@@ -540,6 +543,9 @@ CSV Session: .tests/auto-test/.csv-session/
|
|
|
540
543
|
- [ ] discoveries.ndjson append-only throughout
|
|
541
544
|
- [ ] Cross-layer context propagation via prev_context column
|
|
542
545
|
- [ ] Iteration engine ran (inner: test_defect fix, outer: strategy adjust)
|
|
546
|
+
- [ ] Test confidence scored per iteration with 5-dimension factor model
|
|
547
|
+
- [ ] Convergence check includes confidence >= 60% alongside pass_rate
|
|
548
|
+
- [ ] Confidence section added to report.json
|
|
543
549
|
- [ ] state.json, report.json, reflection-log.md written
|
|
544
550
|
- [ ] If spec: traceability.md produced
|
|
545
551
|
- [ ] If failures: issues auto-created in issues.jsonl
|
|
@@ -257,6 +257,10 @@ spawn_agents_on_csv({
|
|
|
257
257
|
|
|
258
258
|
2. **Generate context.md**: Debug report with summary (mode, hypothesis/confirmed/fixed/verified counts), per-hypothesis results (hypothesis, evidence for/against, findings, status), per-fix results (fix applied, verified, findings), aggregated root causes, and next steps.
|
|
259
259
|
|
|
260
|
+
2b. **Debug confidence scoring**:
|
|
261
|
+
|
|
262
|
+
Dimensions (4): hypothesis_quality, evidence_completeness, root_cause_isolation, fix_confidence. Factors (weights): evidence_depth(.30), evidence_strength(.25), coverage_breadth(.20), reproduction(.15), consistency(.10). Map to legacy: <40% = low, 40-70% = medium, >70% = high. Append confidence assessment to context.md.
|
|
263
|
+
|
|
260
264
|
3. **UAT update** (if --from-uat): Update `uat.md` gaps with `root_cause`, `fix_direction`, `affected_files` for confirmed hypotheses.
|
|
261
265
|
|
|
262
266
|
4. **Issue update**: If `issues.jsonl` exists, update matching issues with status `diagnosed`, add `context.suggested_fix` and `context.notes`.
|
|
@@ -296,7 +300,7 @@ spawn_agents_on_csv({
|
|
|
296
300
|
|
|
297
301
|
| Type | Dedup Key | Data Schema | Description |
|
|
298
302
|
|------|-----------|-------------|-------------|
|
|
299
|
-
| `root_cause` | `data.location` | `{location, cause, severity,
|
|
303
|
+
| `root_cause` | `data.location` | `{location, cause, severity, confidence_score, confidence_factors}` | Confirmed root cause |
|
|
300
304
|
| `hypothesis_evidence` | `data.hypothesis+data.location` | `{hypothesis, location, type, conclusion}` | Evidence for/against hypothesis |
|
|
301
305
|
| `affected_component` | `data.component` | `{component, files[], impact}` | Component affected by bug |
|
|
302
306
|
| `reproduction_path` | `data.trigger` | `{trigger, steps[], frequency}` | Bug reproduction path |
|
|
@@ -333,6 +337,8 @@ echo '{"ts":"<ISO>","worker":"{id}","type":"root_cause","data":{"location":"src/
|
|
|
333
337
|
- [ ] Refuted/inconclusive hypotheses correctly skip wave 2 fix tasks
|
|
334
338
|
- [ ] Wave 2 fixes attempted only for confirmed hypotheses
|
|
335
339
|
- [ ] context.md produced with diagnosis summary
|
|
340
|
+
- [ ] Multi-factor confidence scored per hypothesis replacing simple high/medium/low
|
|
341
|
+
- [ ] Confidence assessment appended to context.md
|
|
336
342
|
- [ ] UAT gaps updated (if --from-uat)
|
|
337
343
|
- [ ] Issues updated with diagnosis results
|
|
338
344
|
- [ ] discoveries.ndjson append-only throughout
|
|
@@ -434,6 +434,12 @@ Options:
|
|
|
434
434
|
If re-verify passes: update uat.md gaps as resolved, report success.
|
|
435
435
|
If gaps remain after 2 iterations: report remaining, suggest manual intervention.
|
|
436
436
|
|
|
437
|
+
### Step 12.5: UAT Confidence Scoring
|
|
438
|
+
|
|
439
|
+
Dimensions (4): scenario_coverage, diagnostic_depth, observation_quality, closure_completeness. Factors (weights): requirements_mapped(.30), observation_specificity(.25), user_validation(.20), diagnostic_depth(.15), consistency(.10). Append confidence summary to `uat.md`.
|
|
440
|
+
|
|
441
|
+
**Readiness gate** (before final report): Block if scenario_coverage < 40% or any blocker-severity gap without diagnosis.
|
|
442
|
+
|
|
437
443
|
### Step 13: Report
|
|
438
444
|
|
|
439
445
|
```
|
|
@@ -490,6 +496,9 @@ Files:
|
|
|
490
496
|
- [ ] test-results.json and coverage-report.json written
|
|
491
497
|
- [ ] index.json uat fields updated
|
|
492
498
|
- [ ] Artifact registered in state.json
|
|
499
|
+
- [ ] UAT confidence scored with 4-dimension factor model
|
|
500
|
+
- [ ] Readiness gate checked before final report
|
|
501
|
+
- [ ] Confidence summary appended to uat.md
|
|
493
502
|
- [ ] If issues: diagnosis.csv built, spawn_agents_on_csv executed per gap cluster
|
|
494
503
|
- [ ] Gaps updated with root_cause, fix_direction, affected_files
|
|
495
504
|
- [ ] Gap-fix loop triggered if --auto-fix (max 2 iterations)
|
package/package.json
CHANGED
package/workflows/analyze.md
CHANGED
|
@@ -252,6 +252,10 @@ Re-read original User Intent from discussion.md header. Check each intent item a
|
|
|
252
252
|
|
|
253
253
|
Append initial Intent Coverage Check to discussion.md.
|
|
254
254
|
|
|
255
|
+
**Step 4.6: Baseline Confidence Scoring**
|
|
256
|
+
|
|
257
|
+
Dimensions = the 6 analysis dimensions. Factors (weights): findings_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15), consistency(.10). Score each factor per dimension from Round 1 results. Append baseline confidence table to discussion.md. Thresholds: <60% 继续深入 | 60-80% 可选 | 80-95% 接近收敛 | >95% 建议收敛.
|
|
258
|
+
|
|
255
259
|
### Step 5: Interactive Discussion Loop
|
|
256
260
|
|
|
257
261
|
Max 5 rounds. Each round follows this sequence:
|
|
@@ -301,7 +305,19 @@ Re-read original User Intent. Check each item:
|
|
|
301
305
|
|
|
302
306
|
If ❌ or ⚠️ items exist → proactively surface to user at start of next round.
|
|
303
307
|
|
|
304
|
-
**
|
|
308
|
+
**5.8: Re-score Confidence** (every round):
|
|
309
|
+
Re-evaluate factors per dimension. Show delta: `Confidence: {prev}% → {current}% ({±N%}), {weakest_dim} 仍需深入`
|
|
310
|
+
|
|
311
|
+
**5.9: Quality Mechanisms**:
|
|
312
|
+
- **Pressure Pass** (mandatory ≥1 before Step 6): highest-confidence finding → pressure ladder (evidence demand → assumption probe → boundary/tradeoff → root cause check). Record under `#### 压力测试`.
|
|
313
|
+
- **Devil's Advocate**: dimension > 0.7 → challenge "如果 [finding] 不成立?" (once per dimension)
|
|
314
|
+
- **Scope Minimizer**: findings > 5 + scope expanding → "最小可行结论集?"
|
|
315
|
+
- **Stall Detection**: delta < 5% for 2 consecutive rounds → "分析可能停滞,建议切换方向或收敛"
|
|
316
|
+
|
|
317
|
+
**5.10: Pre-Synthesis Readiness Gate** (on "分析完成"):
|
|
318
|
+
Block if: ❌ items without deferral | any dimension < 40% | no pressure pass | unresolved contradictions. If blocked → AskUserQuestion: 补充后继续 or 忽略风险并继续 (record `residual_risks[]`).
|
|
319
|
+
|
|
320
|
+
**Auto mode (-y)**: auto-deepen ≤3 rounds, readiness gate auto-overrides with residual risk recording.
|
|
305
321
|
|
|
306
322
|
### Step 6: Six-Dimension Scoring
|
|
307
323
|
|
|
@@ -318,9 +334,11 @@ Using all exploration findings, discussion insights, and user feedback, score ac
|
|
|
318
334
|
|
|
319
335
|
Each dimension scored with specific evidence (code refs, data points from exploration).
|
|
320
336
|
|
|
337
|
+
Each 1-5 score justified by confidence factors from Step 4.6/5.8. Include per-dimension confidence % alongside the score.
|
|
338
|
+
|
|
321
339
|
Build probability-impact risk matrix from identified risks.
|
|
322
340
|
|
|
323
|
-
Formulate Go/No-Go/Conditional recommendation with confidence
|
|
341
|
+
Formulate Go/No-Go/Conditional recommendation with overall confidence %. Write confidence summary (per-dimension scores + overall + pressure pass result + residual risks) to analysis.md.
|
|
324
342
|
|
|
325
343
|
### Step 7: Synthesis & Conclusion
|
|
326
344
|
|
|
@@ -673,6 +691,10 @@ Replaceable blocks (overwritten each round):
|
|
|
673
691
|
- At least 2 alternatives compared with tradeoffs
|
|
674
692
|
- Go/No-Go/Conditional recommendation with confidence level
|
|
675
693
|
- Code references included where relevant (file paths, line numbers)
|
|
694
|
+
- Confidence tracking initialized and re-scored each round
|
|
695
|
+
- Readiness gate checked before synthesis (Step 5.10)
|
|
696
|
+
- Pressure pass completed ≥ 1 time before Step 6
|
|
697
|
+
- Confidence summary with factor decomposition in analysis.md
|
|
676
698
|
|
|
677
699
|
**Both modes (full + quick):**
|
|
678
700
|
- context.md written with all decisions classified as Locked/Free/Deferred
|
package/workflows/auto-test.md
CHANGED
|
@@ -498,6 +498,18 @@ END OUTER
|
|
|
498
498
|
|
|
499
499
|
---
|
|
500
500
|
|
|
501
|
+
### Step 7.5: Test Confidence Scoring
|
|
502
|
+
|
|
503
|
+
Scored after each REFLECT step. Dimensions (5): scenario_coverage, test_quality, diagnostic_accuracy, strategy_effectiveness, infrastructure_fitness. Factors (weights): completeness(.30), pass_rate_trend(.25), classification_accuracy(.20), coverage_breadth(.15), consistency(.10). Append confidence table to reflection-log.md.
|
|
504
|
+
|
|
505
|
+
**Enhanced Convergence**: pass_rate ≥ 95% AND confidence ≥ 60% → converged. pass_rate ≥ 95% BUT confidence < 60% → continue (tests may be weak). max_iter reached or all failures = code_defect → Step 8.
|
|
506
|
+
|
|
507
|
+
**Quality mechanisms**: Pressure Pass (before Step 8) — select 2-3 passing tests from highest-pass-rate layer, verify they exercise real behavior (not mock-only, non-trivial assertions). Devil's Advocate — pass_rate > 80% → challenge assertion specificity, error path coverage, mock over-reliance. Stall Detection — delta < 5% for 2 iterations + pass_rate flat → force Reflective strategy.
|
|
508
|
+
|
|
509
|
+
**Readiness Gate** (before Step 8, skip if max_iter=1): scenario_coverage < 40% | no pressure pass | diagnostic_accuracy < 40% | unclassified failures. If blocked → force one additional iteration. Add confidence section to report.json.
|
|
510
|
+
|
|
511
|
+
---
|
|
512
|
+
|
|
501
513
|
### Step 8: Complete & Write Artifacts
|
|
502
514
|
|
|
503
515
|
1. Update session state:
|
package/workflows/brainstorm.md
CHANGED
|
@@ -331,7 +331,7 @@ Each agent receives:
|
|
|
331
331
|
|
|
332
332
|
### Step 5: Synthesis Integration (Auto Mode)
|
|
333
333
|
|
|
334
|
-
|
|
334
|
+
Seven sub-phases producing feature specs from cross-role analysis:
|
|
335
335
|
|
|
336
336
|
**Sub-phase 1: Discovery**
|
|
337
337
|
- Detect session, validate analysis files exist
|
|
@@ -349,6 +349,16 @@ Six sub-phases producing feature specs from cross-role analysis:
|
|
|
349
349
|
- Output: `enhancement_recommendations` (EP-001, EP-002, ...) + `feature_conflict_map` (per-feature consensus/conflicts/cross_refs)
|
|
350
350
|
- Conflict resolution quality: actionable, justified ("because...tradeoff:..."), scoped, confidence-tagged ([RESOLVED]|[SUGGESTED]|[UNRESOLVED])
|
|
351
351
|
|
|
352
|
+
**Sub-phase 3B: Brainstorm Confidence Scoring**
|
|
353
|
+
|
|
354
|
+
Compute after cross-role analysis, before user interaction. Dimensions (5): role_coverage, cross_role_consistency, feature_completeness, spec_quality, design_feasibility. Factors (weights): analysis_depth(.30), evidence_strength(.25), coverage_breadth(.20), user_validation(.15), consistency(.10).
|
|
355
|
+
|
|
356
|
+
Scoring points: per-role completion → `role_coverage`; after cross-role → `cross_role_consistency`; after user interaction → `user_validation`.
|
|
357
|
+
|
|
358
|
+
**Quality mechanisms**: Pressure Pass (before spec gen) — verify claims backed by ≥2 role analyses. Devil's Advocate — challenge when dimension > 0.7. Conflict Detection (replaces stall) — >3 `[UNRESOLVED]` conflicts → challenge.
|
|
359
|
+
|
|
360
|
+
**Readiness Gate** (before Sub-phase 4): Block if role_coverage < 100% | cross_role_consistency < 40% | feature_completeness below intent count | no pressure pass. If blocked → AskUserQuestion: fill gaps or proceed with residual risks. Append confidence summary to synthesis-changelog.md.
|
|
361
|
+
|
|
352
362
|
**Sub-phase 4: User Interaction**
|
|
353
363
|
- Enhancement selection: AskUserQuestion (multiSelect=true, batched by 4)
|
|
354
364
|
- Clarification questions: 9-category taxonomy scan, AskUserQuestion (single-select, multi-round)
|
package/workflows/debug.md
CHANGED
|
@@ -137,7 +137,7 @@ For each cluster, spawn concurrently as general-purpose agent (`run_in_backgroun
|
|
|
137
137
|
|
|
138
138
|
- **Input**: cluster name, phase, all gaps (test_id, truth, reason, severity). Mode: `symptoms_prefilled`.
|
|
139
139
|
- **Process**: read source files, form 2-3 hypotheses per gap ranked by likelihood, search code for evidence, log each as NDJSON line, confirm/refute.
|
|
140
|
-
- **Output per gap**: `root_cause`, `fix_direction`, `affected_files` (file:line), `confidence` (high/medium/low), `evidence` summary.
|
|
140
|
+
- **Output per gap**: `root_cause`, `fix_direction`, `affected_files` (file:line), `confidence` (multi-factor; also include legacy `confidence_level`: high/medium/low for backward compat), `evidence` summary.
|
|
141
141
|
- **Files**: `{debug_dir}/evidence-{cluster_slug}.ndjson`, `{debug_dir}/understanding-{cluster_slug}.md`
|
|
142
142
|
|
|
143
143
|
All agents run concurrently. Collect all results.
|
|
@@ -198,7 +198,7 @@ For each agent result, extract:
|
|
|
198
198
|
- root_cause per gap
|
|
199
199
|
- fix_direction per gap
|
|
200
200
|
- affected_files per gap
|
|
201
|
-
- confidence
|
|
201
|
+
- confidence (multi-factor, see 7.1)
|
|
202
202
|
- evidence summary
|
|
203
203
|
|
|
204
204
|
Build unified diagnosis:
|
|
@@ -215,14 +215,23 @@ Build unified diagnosis:
|
|
|
215
215
|
"root_cause": "...",
|
|
216
216
|
"fix_direction": "...",
|
|
217
217
|
"affected_files": ["src/components/Comments.tsx:42"],
|
|
218
|
-
"confidence": "
|
|
218
|
+
"confidence": { "overall": 0.78, "dimensions": {} }
|
|
219
219
|
}
|
|
220
220
|
]
|
|
221
221
|
}
|
|
222
|
-
]
|
|
222
|
+
],
|
|
223
|
+
"confidence": {}
|
|
223
224
|
}
|
|
224
225
|
```
|
|
225
226
|
|
|
227
|
+
### Step 7.0: Debug Confidence Scoring
|
|
228
|
+
|
|
229
|
+
Dimensions (4): hypothesis_quality, evidence_completeness, root_cause_isolation, fix_confidence. Factors (weights): evidence_depth(.30), evidence_strength(.25), coverage_breadth(.20), reproduction(.15), consistency(.10). Map to legacy levels: <40% = low, 40-70% = medium, >70% = high.
|
|
230
|
+
|
|
231
|
+
Quality mechanisms: Pressure Pass (before Step 9) — cross-check confirmed vs refuted hypotheses. Devil's Advocate — root_cause_isolation > 0.7 → "根因在更深层?". Stall Detection — no new evidence + delta < 5% for 2 continuations → "调查可能停滞".
|
|
232
|
+
|
|
233
|
+
Readiness Gate (blocks Step 9): evidence_completeness ≥ 40% | pressure pass done | no contradicting evidence | fix_direction has specific files. If blocked → AskUserQuestion: 补充调查 or 忽略风险并确认. Append confidence table to understanding.md.
|
|
234
|
+
|
|
226
235
|
### Step 7.1: Update Issues with Diagnosis
|
|
227
236
|
|
|
228
237
|
For each diagnosed gap with an `issue_id`, update the corresponding issue in `.workflow/issues/issues.jsonl`:
|
package/workflows/plan.md
CHANGED
|
@@ -251,12 +251,22 @@ Bidirectional linking: update matching issues in `.workflow/issues/issues.jsonl`
|
|
|
251
251
|
- Critical issues → re-spawn planner with issues, revise, re-check
|
|
252
252
|
- Warnings only → log and proceed
|
|
253
253
|
|
|
254
|
-
3. **
|
|
255
|
-
|
|
254
|
+
3. **Plan Confidence Scoring**
|
|
255
|
+
|
|
256
|
+
Dimensions (5): requirements_coverage, task_quality, dependency_correctness, estimation_accuracy, collision_safety. Factors (weights): completeness(.30), specificity(.25), structural_validity(.20), user_validation(.15), consistency(.10). Re-score after each revision round.
|
|
257
|
+
|
|
258
|
+
Quality mechanisms: Pressure Pass (mandatory before P4.5) — verify highest-complexity task's read_first/convergence.criteria/action. Devil's Advocate — requirements_coverage > 0.7 → "隐含需求?". Scope Minimizer — task_count exceeds guard → "最小可行任务集?". Stall Detection — delta < 5% → suggest broader revision.
|
|
259
|
+
|
|
260
|
+
4. **Plan Readiness Gate** (blocks P4.5)
|
|
261
|
+
|
|
262
|
+
Block if: requirements_coverage < 40% | task missing read_first/convergence.criteria | no pressure pass | circular deps. If blocked → AskUserQuestion: 修订计划 or 忽略风险并继续 (record residual_risks). Add confidence section to plan.json.
|
|
263
|
+
|
|
264
|
+
5. **Update index.json**
|
|
265
|
+
- Set `index.json.plan` = `{ task_ids, task_count, complexity, waves, executor_assignments: {}, confidence: overall_score }`
|
|
256
266
|
- Set `status: "planning"`, `updated_at: now()`
|
|
257
267
|
|
|
258
268
|
### Output
|
|
259
|
-
- Updated plan.json (if revised)
|
|
269
|
+
- Updated plan.json (if revised) with confidence section
|
|
260
270
|
- Updated .task/ files (if revised)
|
|
261
271
|
- Updated index.json with plan fields
|
|
262
272
|
|
|
@@ -286,7 +296,7 @@ Bidirectional linking: update matching issues in `.workflow/issues/issues.jsonl`
|
|
|
286
296
|
|
|
287
297
|
### Steps
|
|
288
298
|
|
|
289
|
-
1. **Display plan summary** — summary, approach, task count, wave structure, complexity, key dependencies
|
|
299
|
+
1. **Display plan summary** — summary, approach, task count, wave structure, complexity, key dependencies, **plan confidence** (overall %, weakest dimension, pressure pass result)
|
|
290
300
|
|
|
291
301
|
2. **Present options via AskUserQuestion** (skip if `config.gates.confirm_plan == false`, auto-proceed)
|
|
292
302
|
- Execute now → build executionContext, hand off to /workflow:execute
|
package/workflows/test.md
CHANGED
|
@@ -416,6 +416,16 @@ If re-verify still has gaps: report remaining gaps, suggest manual intervention.
|
|
|
416
416
|
|
|
417
417
|
---
|
|
418
418
|
|
|
419
|
+
### Step 12.5: UAT Confidence Scoring
|
|
420
|
+
|
|
421
|
+
Dimensions (4): scenario_coverage, diagnostic_depth, observation_quality, closure_completeness. Factors (weights): requirements_mapped(.30), observation_specificity(.25), user_validation(.20), diagnostic_depth(.15), consistency(.10). Score at: init (Step 5), per user response (Step 8), after gap-fix loop (Step 12).
|
|
422
|
+
|
|
423
|
+
Quality mechanisms: Pressure Pass — >80% pass → ask user to try edge case. Devil's Advocate — >70% first-try pass → challenge scenario difficulty. Stall Detection — 2 gap-fix iterations without improvement → stop.
|
|
424
|
+
|
|
425
|
+
Readiness Gate (blocks Step 13): scenario_coverage < 40% | blocker gap without diagnosis | no pressure pass (if >80%) | unresolved gaps without acknowledgment. Append confidence summary to uat.md.
|
|
426
|
+
|
|
427
|
+
---
|
|
428
|
+
|
|
419
429
|
### Step 13: Report
|
|
420
430
|
|
|
421
431
|
```
|