@thierrynakoa/fire-flow 10.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +64 -0
- package/ARCHITECTURE-DIAGRAM.md +440 -0
- package/COMMAND-REFERENCE.md +172 -0
- package/DOMINION-FLOW-OVERVIEW.md +421 -0
- package/LICENSE +21 -0
- package/QUICK-START.md +351 -0
- package/README.md +398 -0
- package/TROUBLESHOOTING.md +264 -0
- package/agents/fire-codebase-mapper.md +484 -0
- package/agents/fire-debugger.md +535 -0
- package/agents/fire-executor.md +949 -0
- package/agents/fire-fact-checker.md +276 -0
- package/agents/fire-learncoding-explainer.md +237 -0
- package/agents/fire-learncoding-walker.md +147 -0
- package/agents/fire-planner.md +675 -0
- package/agents/fire-project-researcher.md +155 -0
- package/agents/fire-research-synthesizer.md +166 -0
- package/agents/fire-researcher.md +723 -0
- package/agents/fire-reviewer.md +499 -0
- package/agents/fire-roadmapper.md +203 -0
- package/agents/fire-verifier.md +880 -0
- package/bin/cli.js +208 -0
- package/commands/fire-0-orient.md +476 -0
- package/commands/fire-1-new.md +281 -0
- package/commands/fire-1a-discuss.md +455 -0
- package/commands/fire-2-plan.md +527 -0
- package/commands/fire-3-execute.md +1303 -0
- package/commands/fire-4-verify.md +845 -0
- package/commands/fire-5-handoff.md +515 -0
- package/commands/fire-6-resume.md +501 -0
- package/commands/fire-7-review.md +409 -0
- package/commands/fire-add-new-skill.md +598 -0
- package/commands/fire-analytics.md +499 -0
- package/commands/fire-assumptions.md +78 -0
- package/commands/fire-autonomous.md +528 -0
- package/commands/fire-brainstorm.md +413 -0
- package/commands/fire-complete-milestone.md +270 -0
- package/commands/fire-dashboard.md +375 -0
- package/commands/fire-debug.md +663 -0
- package/commands/fire-discover.md +616 -0
- package/commands/fire-double-check.md +460 -0
- package/commands/fire-execute-plan.md +182 -0
- package/commands/fire-learncoding.md +242 -0
- package/commands/fire-loop-resume.md +272 -0
- package/commands/fire-loop-stop.md +198 -0
- package/commands/fire-loop.md +1168 -0
- package/commands/fire-map-codebase.md +313 -0
- package/commands/fire-new-milestone.md +356 -0
- package/commands/fire-reflect.md +235 -0
- package/commands/fire-research.md +246 -0
- package/commands/fire-search.md +330 -0
- package/commands/fire-security-audit-repo.md +293 -0
- package/commands/fire-security-scan.md +484 -0
- package/commands/fire-session-summary.md +252 -0
- package/commands/fire-skills-diff.md +506 -0
- package/commands/fire-skills-history.md +388 -0
- package/commands/fire-skills-rollback.md +408 -0
- package/commands/fire-skills-sync.md +470 -0
- package/commands/fire-test.md +520 -0
- package/commands/fire-todos.md +335 -0
- package/commands/fire-transition.md +186 -0
- package/commands/fire-update.md +312 -0
- package/commands/fire-verify-uat.md +146 -0
- package/commands/fire-vuln-scan.md +493 -0
- package/hooks/hooks.json +16 -0
- package/hooks/run-hook.cmd +69 -0
- package/hooks/run-hook.sh +8 -0
- package/hooks/run-session-end.cmd +49 -0
- package/hooks/run-session-end.sh +7 -0
- package/hooks/session-end.sh +90 -0
- package/hooks/session-start.sh +111 -0
- package/package.json +52 -0
- package/plugin.json +7 -0
- package/references/auto-skill-extraction.md +136 -0
- package/references/behavioral-directives.md +365 -0
- package/references/blocker-tracking.md +155 -0
- package/references/checkpoints.md +165 -0
- package/references/circuit-breaker.md +410 -0
- package/references/context-engineering.md +587 -0
- package/references/decision-time-guidance.md +289 -0
- package/references/error-classification.md +326 -0
- package/references/execution-mode-intelligence.md +242 -0
- package/references/git-integration.md +217 -0
- package/references/honesty-protocols.md +304 -0
- package/references/integration-architecture.md +470 -0
- package/references/issue-to-pr-pipeline.md +150 -0
- package/references/metrics-and-trends.md +234 -0
- package/references/playwright-e2e-testing.md +326 -0
- package/references/questioning.md +125 -0
- package/references/research-improvements.md +110 -0
- package/references/skills-usage-guide.md +429 -0
- package/references/tdd.md +131 -0
- package/references/testing-enforcement.md +192 -0
- package/references/ui-brand.md +383 -0
- package/references/validation-checklist.md +456 -0
- package/references/verification-patterns.md +187 -0
- package/references/warrior-principles.md +173 -0
- package/skills-library/SKILLS-INDEX.md +588 -0
- package/skills-library/_general/frontend/html-visual-reports.md +292 -0
- package/skills-library/_general/methodology/debug-swarm-researcher-escape-hatch.md +240 -0
- package/skills-library/_general/methodology/learncoding-agentic-pattern.md +114 -0
- package/skills-library/_general/methodology/shell-autonomous-loop-fixplan.md +238 -0
- package/skills-library/basics/api-rest-basics.md +162 -0
- package/skills-library/basics/env-variables.md +96 -0
- package/skills-library/basics/error-handling-basics.md +125 -0
- package/skills-library/basics/git-commit-conventions.md +106 -0
- package/skills-library/basics/readme-template.md +108 -0
- package/skills-library/common-tasks/async-await-patterns.md +157 -0
- package/skills-library/common-tasks/auth-jwt-basics.md +164 -0
- package/skills-library/common-tasks/database-schema-design.md +166 -0
- package/skills-library/common-tasks/file-upload-basics.md +166 -0
- package/skills-library/common-tasks/form-validation.md +159 -0
- package/skills-library/debugging/FAILURE_TAXONOMY_CLASSIFICATION.md +117 -0
- package/skills-library/debugging/THREE_AGENT_HYPOTHESIS_DEBUGGING.md +86 -0
- package/skills-library/methodology/BREATH_BASED_PARALLEL_EXECUTION.md +678 -0
- package/skills-library/methodology/CONFIDENCE_GATED_EXECUTION.md +243 -0
- package/skills-library/methodology/EVIDENCE_BASED_VALIDATION.md +308 -0
- package/skills-library/methodology/MULTI_PERSPECTIVE_CODE_REVIEW.md +330 -0
- package/skills-library/methodology/PATH_VERIFICATION_GATE.md +211 -0
- package/skills-library/methodology/REFLEXION_MEMORY_PATTERN.md +183 -0
- package/skills-library/methodology/RESEARCH_BACKED_WORKFLOW_UPGRADE.md +263 -0
- package/skills-library/methodology/SABBATH_REST_PATTERN.md +267 -0
- package/skills-library/methodology/STONE_AND_SCAFFOLD.md +220 -0
- package/skills-library/performance/cache-augmented-generation.md +172 -0
- package/skills-library/quality-safety/debugging-steps.md +147 -0
- package/skills-library/quality-safety/deployment-checklist.md +155 -0
- package/skills-library/quality-safety/security-checklist.md +204 -0
- package/skills-library/quality-safety/testing-basics.md +180 -0
- package/skills-library/security/agent-security-scanner.md +445 -0
- package/skills-library/specialists/api-architecture/api-designer.md +49 -0
- package/skills-library/specialists/api-architecture/graphql-architect.md +49 -0
- package/skills-library/specialists/api-architecture/mcp-developer.md +51 -0
- package/skills-library/specialists/api-architecture/microservices-architect.md +50 -0
- package/skills-library/specialists/api-architecture/websocket-engineer.md +48 -0
- package/skills-library/specialists/backend/django-expert.md +52 -0
- package/skills-library/specialists/backend/fastapi-expert.md +52 -0
- package/skills-library/specialists/backend/laravel-specialist.md +52 -0
- package/skills-library/specialists/backend/nestjs-expert.md +51 -0
- package/skills-library/specialists/backend/rails-expert.md +53 -0
- package/skills-library/specialists/backend/spring-boot-engineer.md +56 -0
- package/skills-library/specialists/data-ml/fine-tuning-expert.md +48 -0
- package/skills-library/specialists/data-ml/ml-pipeline.md +47 -0
- package/skills-library/specialists/data-ml/pandas-pro.md +47 -0
- package/skills-library/specialists/data-ml/rag-architect.md +51 -0
- package/skills-library/specialists/data-ml/spark-engineer.md +47 -0
- package/skills-library/specialists/frontend/angular-architect.md +52 -0
- package/skills-library/specialists/frontend/flutter-expert.md +51 -0
- package/skills-library/specialists/frontend/nextjs-developer.md +54 -0
- package/skills-library/specialists/frontend/react-native-expert.md +50 -0
- package/skills-library/specialists/frontend/vue-expert.md +51 -0
- package/skills-library/specialists/infrastructure/chaos-engineer.md +74 -0
- package/skills-library/specialists/infrastructure/cloud-architect.md +70 -0
- package/skills-library/specialists/infrastructure/database-optimizer.md +64 -0
- package/skills-library/specialists/infrastructure/devops-engineer.md +70 -0
- package/skills-library/specialists/infrastructure/kubernetes-specialist.md +52 -0
- package/skills-library/specialists/infrastructure/monitoring-expert.md +70 -0
- package/skills-library/specialists/infrastructure/sre-engineer.md +70 -0
- package/skills-library/specialists/infrastructure/terraform-engineer.md +51 -0
- package/skills-library/specialists/languages/cpp-pro.md +74 -0
- package/skills-library/specialists/languages/csharp-developer.md +69 -0
- package/skills-library/specialists/languages/dotnet-core-expert.md +54 -0
- package/skills-library/specialists/languages/golang-pro.md +51 -0
- package/skills-library/specialists/languages/java-architect.md +49 -0
- package/skills-library/specialists/languages/javascript-pro.md +68 -0
- package/skills-library/specialists/languages/kotlin-specialist.md +68 -0
- package/skills-library/specialists/languages/php-pro.md +49 -0
- package/skills-library/specialists/languages/python-pro.md +52 -0
- package/skills-library/specialists/languages/react-expert.md +51 -0
- package/skills-library/specialists/languages/rust-engineer.md +50 -0
- package/skills-library/specialists/languages/sql-pro.md +56 -0
- package/skills-library/specialists/languages/swift-expert.md +69 -0
- package/skills-library/specialists/languages/typescript-pro.md +51 -0
- package/skills-library/specialists/platform/atlassian-mcp.md +52 -0
- package/skills-library/specialists/platform/embedded-systems.md +53 -0
- package/skills-library/specialists/platform/game-developer.md +53 -0
- package/skills-library/specialists/platform/salesforce-developer.md +53 -0
- package/skills-library/specialists/platform/shopify-expert.md +49 -0
- package/skills-library/specialists/platform/wordpress-pro.md +49 -0
- package/skills-library/specialists/quality/code-documenter.md +51 -0
- package/skills-library/specialists/quality/code-reviewer.md +67 -0
- package/skills-library/specialists/quality/debugging-wizard.md +51 -0
- package/skills-library/specialists/quality/fullstack-guardian.md +51 -0
- package/skills-library/specialists/quality/legacy-modernizer.md +50 -0
- package/skills-library/specialists/quality/playwright-expert.md +65 -0
- package/skills-library/specialists/quality/spec-miner.md +56 -0
- package/skills-library/specialists/quality/test-master.md +65 -0
- package/skills-library/specialists/security/secure-code-guardian.md +55 -0
- package/skills-library/specialists/security/security-reviewer.md +53 -0
- package/skills-library/specialists/workflow/architecture-designer.md +53 -0
- package/skills-library/specialists/workflow/cli-developer.md +70 -0
- package/skills-library/specialists/workflow/feature-forge.md +65 -0
- package/skills-library/specialists/workflow/prompt-engineer.md +54 -0
- package/skills-library/specialists/workflow/the-fool.md +62 -0
- package/templates/ASSUMPTIONS.md +125 -0
- package/templates/BLOCKERS.md +73 -0
- package/templates/DECISION_LOG.md +116 -0
- package/templates/UAT.md +96 -0
- package/templates/blueprint.md +94 -0
- package/templates/brainstorm.md +185 -0
- package/templates/conscience.md +92 -0
- package/templates/fire-handoff.md +159 -0
- package/templates/metrics.md +67 -0
- package/templates/phase-prompt.md +142 -0
- package/templates/record.md +131 -0
- package/templates/review-report.md +117 -0
- package/templates/skills-index.md +157 -0
- package/templates/verification.md +149 -0
- package/templates/vision.md +79 -0
- package/validation-config.yml +793 -0
- package/version.json +7 -0
- package/workflows/execute-phase.md +732 -0
- package/workflows/handoff-session.md +678 -0
- package/workflows/new-project.md +578 -0
- package/workflows/plan-phase.md +592 -0
- package/workflows/verify-phase.md +874 -0
|
@@ -0,0 +1,243 @@
|
|
|
1
|
+
# Confidence-Gated Execution — Adaptive Autonomy Through Uncertainty Estimation
|
|
2
|
+
|
|
3
|
+
## The Problem
|
|
4
|
+
|
|
5
|
+
AI agents treat all tasks with the same level of autonomy. Whether the agent is implementing a well-understood CRUD endpoint or modifying a security-critical authentication flow, it proceeds with the same approach. This leads to:
|
|
6
|
+
|
|
7
|
+
- Overconfident execution on unfamiliar tasks (mistakes that could have been prevented by asking)
|
|
8
|
+
- Underconfident pausing on familiar tasks (unnecessary user interruptions)
|
|
9
|
+
- No learning signal for when to be cautious vs. autonomous
|
|
10
|
+
|
|
11
|
+
### Why It Was Hard
|
|
12
|
+
|
|
13
|
+
- Confidence is subjective — agents tend to default to "high confidence" (optimism bias)
|
|
14
|
+
- Need concrete, measurable signals rather than vibes
|
|
15
|
+
- Gate behavior must be lightweight (can't add 5 minutes of analysis before every action)
|
|
16
|
+
- Must integrate with existing execution flows without disrupting them
|
|
17
|
+
- Progressive autonomy requires tracking outcomes over time
|
|
18
|
+
|
|
19
|
+
### Impact
|
|
20
|
+
|
|
21
|
+
- Without confidence gates: 73.5% error recovery rate
|
|
22
|
+
- With self-evaluation (which confidence gates enable): 95% error recovery rate
|
|
23
|
+
- Users report 20%→40% auto-approve rate increase over 750 sessions when agents demonstrate appropriate caution
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## The Solution
|
|
28
|
+
|
|
29
|
+
### Root Cause
|
|
30
|
+
|
|
31
|
+
Agents lack a structured way to estimate "how sure am I?" before acting. The SAUP framework (ACL 2025) shows that propagating uncertainty through reasoning steps significantly improves decision quality.
|
|
32
|
+
|
|
33
|
+
### Signal-Based Confidence Scoring
|
|
34
|
+
|
|
35
|
+
Instead of asking "how confident are you?" (subjective), compute confidence from concrete signals:
|
|
36
|
+
|
|
37
|
+
```
|
|
38
|
+
confidence = 50 (baseline)
|
|
39
|
+
|
|
40
|
+
Positive signals (increase confidence):
|
|
41
|
+
+ Matching skill found in library: +20
|
|
42
|
+
+ Similar reflection exists: +15
|
|
43
|
+
+ Tests available to verify: +25
|
|
44
|
+
+ Familiar technology/framework: +15
|
|
45
|
+
+ Clear, unambiguous requirements: +15
|
|
46
|
+
|
|
47
|
+
Negative signals (decrease confidence):
|
|
48
|
+
- Unfamiliar framework/library: -20
|
|
49
|
+
- No tests available to verify: -15
|
|
50
|
+
- Ambiguous/incomplete requirements: -20
|
|
51
|
+
- Security-sensitive change: -10
|
|
52
|
+
- Destructive operation: -15
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Three Confidence Levels
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
HIGH (>80%): Proceed Autonomously
|
|
59
|
+
→ Execute task directly
|
|
60
|
+
→ Run Self-Judge after completion
|
|
61
|
+
→ Log confidence in summary
|
|
62
|
+
|
|
63
|
+
MEDIUM (50-80%): Proceed with Extra Validation
|
|
64
|
+
→ Search reflections for similar scenarios
|
|
65
|
+
→ Search skills library for guidance
|
|
66
|
+
→ Run Self-Judge BEFORE and AFTER
|
|
67
|
+
→ Log uncertainty reason for future learning
|
|
68
|
+
|
|
69
|
+
LOW (<50%): Pause and Escalate
|
|
70
|
+
→ Search Context7 for current library docs
|
|
71
|
+
→ Check if this is outside trained domain
|
|
72
|
+
→ Ask user for guidance before proceeding
|
|
73
|
+
→ Create checkpoint before attempting
|
|
74
|
+
→ Log what made confidence low
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
### Integration Points
|
|
78
|
+
|
|
79
|
+
**In fire-3-execute (executor context injection):**
|
|
80
|
+
```xml
|
|
81
|
+
<confidence_gates>
|
|
82
|
+
Before each plan task, estimate confidence using signal scoring.
|
|
83
|
+
Log confidence level and signals in RECORD.md.
|
|
84
|
+
Gate behavior: HIGH=proceed, MEDIUM=extra-validation, LOW=escalate.
|
|
85
|
+
</confidence_gates>
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**In fire-loop (recitation block):**
|
|
89
|
+
```markdown
|
|
90
|
+
## Confidence Check (v5.0)
|
|
91
|
+
- Score: {0-100} — {HIGH/MEDIUM/LOW}
|
|
92
|
+
- Signals: {what raised or lowered confidence}
|
|
93
|
+
- Action: {proceed / extra-validation / escalate}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**In RECORD.md (execution log):**
|
|
97
|
+
```yaml
|
|
98
|
+
confidence_log:
|
|
99
|
+
- task: 1
|
|
100
|
+
score: 85
|
|
101
|
+
level: HIGH
|
|
102
|
+
signals: [skill_match, tests_available, familiar_tech]
|
|
103
|
+
- task: 3
|
|
104
|
+
score: 45
|
|
105
|
+
level: LOW
|
|
106
|
+
signals: [unfamiliar_framework, no_tests, ambiguous_requirements]
|
|
107
|
+
action: "Asked user for clarification on WebSocket auth approach"
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Code Example: Confidence Estimation
|
|
111
|
+
|
|
112
|
+
```python
|
|
113
|
+
# Conceptual implementation (adapt to your agent framework)
|
|
114
|
+
|
|
115
|
+
def estimate_confidence(task, skills_library, reflections, test_suite):
|
|
116
|
+
score = 50 # baseline
|
|
117
|
+
signals = []
|
|
118
|
+
|
|
119
|
+
# Check skill library
|
|
120
|
+
matching_skills = skills_library.search(task.description)
|
|
121
|
+
if matching_skills:
|
|
122
|
+
score += 20
|
|
123
|
+
signals.append("skill_match")
|
|
124
|
+
|
|
125
|
+
# Check reflections
|
|
126
|
+
matching_reflections = reflections.search(task.description)
|
|
127
|
+
if matching_reflections:
|
|
128
|
+
score += 15
|
|
129
|
+
signals.append("reflection_match")
|
|
130
|
+
|
|
131
|
+
# Check test availability
|
|
132
|
+
if test_suite.has_tests_for(task.affected_files):
|
|
133
|
+
score += 25
|
|
134
|
+
signals.append("tests_available")
|
|
135
|
+
else:
|
|
136
|
+
score -= 15
|
|
137
|
+
signals.append("no_tests")
|
|
138
|
+
|
|
139
|
+
# Check technology familiarity
|
|
140
|
+
if task.technology in KNOWN_TECHNOLOGIES:
|
|
141
|
+
score += 15
|
|
142
|
+
signals.append("familiar_tech")
|
|
143
|
+
else:
|
|
144
|
+
score -= 20
|
|
145
|
+
signals.append("unfamiliar_framework")
|
|
146
|
+
|
|
147
|
+
# Check requirement clarity
|
|
148
|
+
if task.has_clear_acceptance_criteria:
|
|
149
|
+
score += 15
|
|
150
|
+
signals.append("clear_requirements")
|
|
151
|
+
else:
|
|
152
|
+
score -= 20
|
|
153
|
+
signals.append("ambiguous_requirements")
|
|
154
|
+
|
|
155
|
+
# Security/destructive checks
|
|
156
|
+
if task.is_security_sensitive:
|
|
157
|
+
score -= 10
|
|
158
|
+
signals.append("security_sensitive")
|
|
159
|
+
if task.is_destructive:
|
|
160
|
+
score -= 15
|
|
161
|
+
signals.append("destructive_operation")
|
|
162
|
+
|
|
163
|
+
# Clamp to 0-100
|
|
164
|
+
score = max(0, min(100, score))
|
|
165
|
+
|
|
166
|
+
level = "HIGH" if score > 80 else "MEDIUM" if score >= 50 else "LOW"
|
|
167
|
+
return score, level, signals
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## Testing the Fix
|
|
173
|
+
|
|
174
|
+
### Scenario Tests
|
|
175
|
+
|
|
176
|
+
| Task | Expected Score | Expected Level | Expected Action |
|
|
177
|
+
|------|---------------|----------------|-----------------|
|
|
178
|
+
| Add CRUD endpoint (known stack, tests exist, skill found) | 90+ | HIGH | Proceed |
|
|
179
|
+
| Implement WebSocket auth (unfamiliar, no tests, no skill) | 30-40 | LOW | Escalate |
|
|
180
|
+
| Fix CSS layout (familiar, no tests, clear requirement) | 65-75 | MEDIUM | Extra validation |
|
|
181
|
+
| Delete user data migration (familiar, tests, destructive) | 55-65 | MEDIUM | Extra validation |
|
|
182
|
+
| Integrate new payment provider (unfamiliar, security) | 25-35 | LOW | Escalate |
|
|
183
|
+
|
|
184
|
+
### Verification
|
|
185
|
+
|
|
186
|
+
1. Execute tasks of varying familiarity
|
|
187
|
+
2. Verify HIGH-confidence tasks proceed without interruption
|
|
188
|
+
3. Verify MEDIUM-confidence tasks trigger reflection/skill search
|
|
189
|
+
4. Verify LOW-confidence tasks pause for user input
|
|
190
|
+
5. Check confidence log in RECORD.md matches observed behavior
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## Prevention
|
|
195
|
+
|
|
196
|
+
- Calibrate signals based on actual outcomes (if LOW-confidence tasks consistently succeed, adjust weights)
|
|
197
|
+
- Don't let confidence become a checkbox (the score should reflect genuine uncertainty)
|
|
198
|
+
- Review confidence logs periodically to identify systematic biases
|
|
199
|
+
- Track confidence vs. outcome correlation to improve scoring
|
|
200
|
+
|
|
201
|
+
---
|
|
202
|
+
|
|
203
|
+
## Related Patterns
|
|
204
|
+
|
|
205
|
+
- [AGENT_SELF_IMPROVEMENT_LOOP](./AGENT_SELF_IMPROVEMENT_LOOP.md) - Confidence is upgrade #6 of the loop
|
|
206
|
+
- [REFLEXION_MEMORY_PATTERN](./REFLEXION_MEMORY_PATTERN.md) - Reflections feed confidence scoring (+15)
|
|
207
|
+
- [SELF_QUESTIONING_TASK_GENERATION](./SELF_QUESTIONING_TASK_GENERATION.md) - Self-Judge runs at MEDIUM+ confidence
|
|
208
|
+
- [CONFIDENCE_ANNOTATION_PATTERN](./CONFIDENCE_ANNOTATION_PATTERN.md) - Related annotation approach
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Common Mistakes to Avoid
|
|
213
|
+
|
|
214
|
+
- Setting baseline too high (70+) — makes everything look "confident"
|
|
215
|
+
- Ignoring negative signals — agents naturally want to proceed
|
|
216
|
+
- Treating confidence gates as hard blockers — they're advisory, agent can override with justification
|
|
217
|
+
- Not logging confidence scores — you need the data to calibrate over time
|
|
218
|
+
- Applying same weights to all task types — security tasks should weight "security_sensitive" more heavily
|
|
219
|
+
- Making confidence estimation take >30 seconds — speed is critical for adoption
|
|
220
|
+
|
|
221
|
+
---
|
|
222
|
+
|
|
223
|
+
## Resources
|
|
224
|
+
|
|
225
|
+
- SAUP (ACL 2025): Uncertainty propagation through reasoning steps
|
|
226
|
+
- Anthropic measurement: Progressive autonomy 20%→40% over 750 sessions
|
|
227
|
+
- Snorkel AI: 95% vs 73.5% error recovery with self-evaluation
|
|
228
|
+
- Dominion Flow implementation: `fire-3-execute.md` confidence_gates, `fire-loop.md` confidence check
|
|
229
|
+
|
|
230
|
+
---
|
|
231
|
+
|
|
232
|
+
## Time to Implement
|
|
233
|
+
|
|
234
|
+
**2-3 hours** — Add confidence scoring to executor context, integrate into loop recitation, add RECORD.md logging
|
|
235
|
+
|
|
236
|
+
## Difficulty Level
|
|
237
|
+
|
|
238
|
+
Stars: 3/5 — The signal scoring is simple. The challenge is **calibration**: getting the weights right so the agent isn't overconfident or over-cautious. Requires tracking outcomes over multiple sessions.
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
|
|
242
|
+
**Author Notes:**
|
|
243
|
+
The most counterintuitive finding: agents that ask for help more often perform better overall. LOW-confidence escalation isn't a failure — it's the agent saying "I know what I don't know." The 95% error recovery rate comes not from avoiding errors, but from knowing when to pause and seek guidance. Confidence gates formalize the difference between "I know this" and "I'm guessing" — and that distinction is worth a 21.5 percentage point improvement in recovery rate.
|
|
@@ -0,0 +1,308 @@
|
|
|
1
|
+
# Evidence-Based Validation - Verification Before Completion
|
|
2
|
+
|
|
3
|
+
## The Problem
|
|
4
|
+
|
|
5
|
+
AI agents (and humans) often claim work is "done" without actually verifying it works. This leads to:
|
|
6
|
+
- "Tests pass" claims without running tests
|
|
7
|
+
- "Bug fixed" assertions without reproduction verification
|
|
8
|
+
- Premature completion claims that waste time on rework
|
|
9
|
+
|
|
10
|
+
### Why It Was Hard
|
|
11
|
+
|
|
12
|
+
- Pressure to deliver quickly encourages shortcuts
|
|
13
|
+
- Verification feels redundant after writing code
|
|
14
|
+
- Confidence in one's work creates blind spots
|
|
15
|
+
- No systematic enforcement of "prove it works"
|
|
16
|
+
|
|
17
|
+
### Impact
|
|
18
|
+
|
|
19
|
+
- False completion claims waste reviewer time
|
|
20
|
+
- Bugs reach production that should have been caught
|
|
21
|
+
- Trust erodes when "done" doesn't mean "done"
|
|
22
|
+
- Rework costs exceed original implementation time
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## The Solution
|
|
27
|
+
|
|
28
|
+
**Evidence Before Assertion** - Never claim something works without captured proof.
|
|
29
|
+
|
|
30
|
+
### Core Principle
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
34
|
+
│ │
|
|
35
|
+
│ WRONG: "The tests pass" → then run tests │
|
|
36
|
+
│ RIGHT: Run tests → "The tests pass (output: 45/45)" │
|
|
37
|
+
│ │
|
|
38
|
+
│ WRONG: "I fixed the bug" → hope it works │
|
|
39
|
+
│ RIGHT: Reproduce → fix → verify fix → "Bug fixed (proof)" │
|
|
40
|
+
│ │
|
|
41
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### The Verification Protocol
|
|
45
|
+
|
|
46
|
+
#### Step 1: Identify Claims
|
|
47
|
+
|
|
48
|
+
Before declaring work complete, list all implicit claims:
|
|
49
|
+
|
|
50
|
+
```markdown
|
|
51
|
+
## Claims to Verify
|
|
52
|
+
|
|
53
|
+
| # | Implicit Claim | Evidence Required |
|
|
54
|
+
|---|----------------|-------------------|
|
|
55
|
+
| 1 | Code compiles | Build output with exit code 0 |
|
|
56
|
+
| 2 | Tests pass | Test runner output showing pass count |
|
|
57
|
+
| 3 | Feature works | Demo or API test with expected response |
|
|
58
|
+
| 4 | No regressions | Full test suite output |
|
|
59
|
+
| 5 | Types correct | TypeScript compiler output |
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
#### Step 2: Execute Verification Commands
|
|
63
|
+
|
|
64
|
+
Run ALL commands and capture output:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
# Build check
|
|
68
|
+
echo "=== BUILD CHECK ==="
|
|
69
|
+
npm run build 2>&1
|
|
70
|
+
echo "Exit code: $?"
|
|
71
|
+
|
|
72
|
+
# Test check
|
|
73
|
+
echo "=== TEST CHECK ==="
|
|
74
|
+
npm run test 2>&1
|
|
75
|
+
echo "Exit code: $?"
|
|
76
|
+
|
|
77
|
+
# Type check
|
|
78
|
+
echo "=== TYPE CHECK ==="
|
|
79
|
+
npm run typecheck 2>&1
|
|
80
|
+
echo "Exit code: $?"
|
|
81
|
+
|
|
82
|
+
# Lint check
|
|
83
|
+
echo "=== LINT CHECK ==="
|
|
84
|
+
npm run lint 2>&1
|
|
85
|
+
echo "Exit code: $?"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
#### Step 3: Document Results with Evidence
|
|
89
|
+
|
|
90
|
+
```markdown
|
|
91
|
+
## Verification Results
|
|
92
|
+
|
|
93
|
+
### Build Check
|
|
94
|
+
**Command:** `npm run build`
|
|
95
|
+
**Exit Code:** 0
|
|
96
|
+
**Output:**
|
|
97
|
+
```
|
|
98
|
+
> project@1.0.0 build
|
|
99
|
+
> tsc && vite build
|
|
100
|
+
|
|
101
|
+
vite v5.0.0 building for production...
|
|
102
|
+
✓ 234 modules transformed.
|
|
103
|
+
dist/index.html 0.45 kB │ gzip: 0.29 kB
|
|
104
|
+
dist/assets/index.js 145.67 kB │ gzip: 47.23 kB
|
|
105
|
+
✓ built in 2.34s
|
|
106
|
+
```
|
|
107
|
+
**Verdict:** PASS
|
|
108
|
+
|
|
109
|
+
### Test Check
|
|
110
|
+
**Command:** `npm run test`
|
|
111
|
+
**Exit Code:** 0
|
|
112
|
+
**Output:**
|
|
113
|
+
```
|
|
114
|
+
PASS src/auth/login.test.ts
|
|
115
|
+
PASS src/api/users.test.ts
|
|
116
|
+
PASS src/utils/helpers.test.ts
|
|
117
|
+
|
|
118
|
+
Test Suites: 3 passed, 3 total
|
|
119
|
+
Tests: 45 passed, 45 total
|
|
120
|
+
Time: 3.42s
|
|
121
|
+
```
|
|
122
|
+
**Verdict:** PASS (45/45 tests)
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
#### Step 4: Honesty Protocol
|
|
126
|
+
|
|
127
|
+
Before claiming complete, answer honestly:
|
|
128
|
+
|
|
129
|
+
```markdown
|
|
130
|
+
## Honesty Check
|
|
131
|
+
|
|
132
|
+
### Question 1: Did I actually run these commands?
|
|
133
|
+
- [x] Yes, all commands executed in this session
|
|
134
|
+
- [ ] No, I assumed based on previous runs
|
|
135
|
+
|
|
136
|
+
### Question 2: Am I interpreting output honestly?
|
|
137
|
+
- [x] Output clearly shows success
|
|
138
|
+
- [ ] Output is ambiguous, I'm assuming
|
|
139
|
+
|
|
140
|
+
### Question 3: What am I NOT checking?
|
|
141
|
+
- [ ] Nothing unchecked
|
|
142
|
+
- [x] E2E tests not run (documented limitation)
|
|
143
|
+
|
|
144
|
+
### Question 4: Would a skeptic be convinced?
|
|
145
|
+
- [x] Yes, evidence is clear
|
|
146
|
+
- [ ] No, more verification needed
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
#### Step 5: Final Verdict
|
|
150
|
+
|
|
151
|
+
```markdown
|
|
152
|
+
## Double-Check Summary
|
|
153
|
+
|
|
154
|
+
| Check | Command | Result | Evidence |
|
|
155
|
+
|-------|---------|--------|----------|
|
|
156
|
+
| Build | `npm run build` | PASS | Exit 0, no errors |
|
|
157
|
+
| Tests | `npm run test` | PASS | 45/45 passing |
|
|
158
|
+
| Types | `npm run typecheck` | PASS | 0 errors |
|
|
159
|
+
| Lint | `npm run lint` | PASS | 0 warnings |
|
|
160
|
+
|
|
161
|
+
**STATUS:** VERIFIED
|
|
162
|
+
**Confidence:** HIGH
|
|
163
|
+
|
|
164
|
+
You may now claim this work is complete.
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
---
|
|
168
|
+
|
|
169
|
+
## Implementation
|
|
170
|
+
|
|
171
|
+
### Verification Command Template
|
|
172
|
+
|
|
173
|
+
```javascript
|
|
174
|
+
async function doubleCheck(checks = ['build', 'test', 'typecheck', 'lint']) {
|
|
175
|
+
const results = {};
|
|
176
|
+
|
|
177
|
+
for (const check of checks) {
|
|
178
|
+
const command = COMMANDS[check];
|
|
179
|
+
console.log(`=== ${check.toUpperCase()} CHECK ===`);
|
|
180
|
+
|
|
181
|
+
const { stdout, stderr, exitCode } = await exec(command);
|
|
182
|
+
|
|
183
|
+
results[check] = {
|
|
184
|
+
command,
|
|
185
|
+
exitCode,
|
|
186
|
+
output: stdout + stderr,
|
|
187
|
+
verdict: exitCode === 0 ? 'PASS' : 'FAIL'
|
|
188
|
+
};
|
|
189
|
+
|
|
190
|
+
console.log(`Exit code: ${exitCode}`);
|
|
191
|
+
console.log(stdout);
|
|
192
|
+
}
|
|
193
|
+
|
|
194
|
+
return results;
|
|
195
|
+
}
|
|
196
|
+
|
|
197
|
+
const COMMANDS = {
|
|
198
|
+
build: 'npm run build',
|
|
199
|
+
test: 'npm run test',
|
|
200
|
+
typecheck: 'npm run typecheck',
|
|
201
|
+
lint: 'npm run lint',
|
|
202
|
+
coverage: 'npm run test:coverage',
|
|
203
|
+
security: 'npm audit',
|
|
204
|
+
e2e: 'npm run test:e2e'
|
|
205
|
+
};
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
### Anti-Pattern Detection
|
|
209
|
+
|
|
210
|
+
```javascript
|
|
211
|
+
// BAD: Claiming without evidence
|
|
212
|
+
function reviewCode() {
|
|
213
|
+
// ... look at code ...
|
|
214
|
+
return "Tests pass"; // NO EVIDENCE!
|
|
215
|
+
}
|
|
216
|
+
|
|
217
|
+
// GOOD: Evidence-based claim
|
|
218
|
+
async function reviewCode() {
|
|
219
|
+
const output = await exec('npm run test');
|
|
220
|
+
return `Tests pass (${output.passCount}/${output.totalCount})`;
|
|
221
|
+
}
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
---
|
|
225
|
+
|
|
226
|
+
## Testing the Pattern
|
|
227
|
+
|
|
228
|
+
### Before (No Verification)
|
|
229
|
+
```
|
|
230
|
+
Claim: "All tests pass"
|
|
231
|
+
Reality: 3 tests failing
|
|
232
|
+
Result: Bug reaches production
|
|
233
|
+
Cost: 4 hours debugging + hotfix
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
### After (Evidence-Based)
|
|
237
|
+
```
|
|
238
|
+
Claim: "All tests pass"
|
|
239
|
+
Evidence: Test output shows 42/45 passing
|
|
240
|
+
Reality: 3 tests failing (caught immediately)
|
|
241
|
+
Result: Fixed before merge
|
|
242
|
+
Cost: 15 minutes
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
## Prevention
|
|
248
|
+
|
|
249
|
+
### When to Use Evidence-Based Validation
|
|
250
|
+
|
|
251
|
+
- **Always:** Before claiming any work is complete
|
|
252
|
+
- **Always:** Before creating a PR
|
|
253
|
+
- **Always:** Before merging to main
|
|
254
|
+
- **Always:** After fixing bugs (verify the fix)
|
|
255
|
+
|
|
256
|
+
### Verification Depth Levels
|
|
257
|
+
|
|
258
|
+
| Depth | Checks | Use Case |
|
|
259
|
+
|-------|--------|----------|
|
|
260
|
+
| Quick | build, lint | Minor changes |
|
|
261
|
+
| Standard | build, test, types, lint | Normal PRs |
|
|
262
|
+
| Deep | All + coverage + security + E2E | Production releases |
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## Related Patterns
|
|
267
|
+
|
|
268
|
+
- [Multi-Perspective Code Review](./MULTI_PERSPECTIVE_CODE_REVIEW.md)
|
|
269
|
+
- [Honesty Protocols](./HONESTY_PROTOCOLS.md)
|
|
270
|
+
- [60-Point Validation Checklist](./VALIDATION_CHECKLIST.md)
|
|
271
|
+
|
|
272
|
+
---
|
|
273
|
+
|
|
274
|
+
## Common Mistakes to Avoid
|
|
275
|
+
|
|
276
|
+
- **Skipping verification to save time** - Rework costs more
|
|
277
|
+
- **Assuming previous run is still valid** - Always re-run
|
|
278
|
+
- **Ignoring warnings** - Warnings become errors
|
|
279
|
+
- **Partial verification** - Run ALL relevant checks
|
|
280
|
+
- **Trusting memory** - Capture actual output
|
|
281
|
+
|
|
282
|
+
---
|
|
283
|
+
|
|
284
|
+
## Resources
|
|
285
|
+
|
|
286
|
+
- [superpowers:verification-before-completion](https://github.com/anthropics/claude-code-plugins)
|
|
287
|
+
- [Test-Driven Development patterns](./TDD_PATTERNS.md)
|
|
288
|
+
- [Continuous Integration best practices](../deployment-security/CI_CD_PATTERNS.md)
|
|
289
|
+
|
|
290
|
+
---
|
|
291
|
+
|
|
292
|
+
## Time to Implement
|
|
293
|
+
|
|
294
|
+
**Per verification:** 2-5 minutes
|
|
295
|
+
**ROI:** Prevents 1-4 hour debugging sessions
|
|
296
|
+
|
|
297
|
+
## Difficulty Level
|
|
298
|
+
|
|
299
|
+
⭐ (1/5) - Simple to implement, requires discipline
|
|
300
|
+
|
|
301
|
+
---
|
|
302
|
+
|
|
303
|
+
**Author Notes:**
|
|
304
|
+
This pattern seems obvious but is violated constantly. The key insight is that **verification must be mandatory, not optional**. By requiring captured output as evidence, you eliminate the possibility of false claims.
|
|
305
|
+
|
|
306
|
+
The honesty protocol questions force reflection before completion. Question 3 ("What am I NOT checking?") is particularly powerful - it surfaces blind spots before they become problems.
|
|
307
|
+
|
|
308
|
+
**Implementation in Dominion Flow:** Available via `/fire-double-check` command.
|