@thierrynakoa/fire-flow 10.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (215) hide show
  1. package/.claude-plugin/plugin.json +64 -0
  2. package/ARCHITECTURE-DIAGRAM.md +440 -0
  3. package/COMMAND-REFERENCE.md +172 -0
  4. package/DOMINION-FLOW-OVERVIEW.md +421 -0
  5. package/LICENSE +21 -0
  6. package/QUICK-START.md +351 -0
  7. package/README.md +398 -0
  8. package/TROUBLESHOOTING.md +264 -0
  9. package/agents/fire-codebase-mapper.md +484 -0
  10. package/agents/fire-debugger.md +535 -0
  11. package/agents/fire-executor.md +949 -0
  12. package/agents/fire-fact-checker.md +276 -0
  13. package/agents/fire-learncoding-explainer.md +237 -0
  14. package/agents/fire-learncoding-walker.md +147 -0
  15. package/agents/fire-planner.md +675 -0
  16. package/agents/fire-project-researcher.md +155 -0
  17. package/agents/fire-research-synthesizer.md +166 -0
  18. package/agents/fire-researcher.md +723 -0
  19. package/agents/fire-reviewer.md +499 -0
  20. package/agents/fire-roadmapper.md +203 -0
  21. package/agents/fire-verifier.md +880 -0
  22. package/bin/cli.js +208 -0
  23. package/commands/fire-0-orient.md +476 -0
  24. package/commands/fire-1-new.md +281 -0
  25. package/commands/fire-1a-discuss.md +455 -0
  26. package/commands/fire-2-plan.md +527 -0
  27. package/commands/fire-3-execute.md +1303 -0
  28. package/commands/fire-4-verify.md +845 -0
  29. package/commands/fire-5-handoff.md +515 -0
  30. package/commands/fire-6-resume.md +501 -0
  31. package/commands/fire-7-review.md +409 -0
  32. package/commands/fire-add-new-skill.md +598 -0
  33. package/commands/fire-analytics.md +499 -0
  34. package/commands/fire-assumptions.md +78 -0
  35. package/commands/fire-autonomous.md +528 -0
  36. package/commands/fire-brainstorm.md +413 -0
  37. package/commands/fire-complete-milestone.md +270 -0
  38. package/commands/fire-dashboard.md +375 -0
  39. package/commands/fire-debug.md +663 -0
  40. package/commands/fire-discover.md +616 -0
  41. package/commands/fire-double-check.md +460 -0
  42. package/commands/fire-execute-plan.md +182 -0
  43. package/commands/fire-learncoding.md +242 -0
  44. package/commands/fire-loop-resume.md +272 -0
  45. package/commands/fire-loop-stop.md +198 -0
  46. package/commands/fire-loop.md +1168 -0
  47. package/commands/fire-map-codebase.md +313 -0
  48. package/commands/fire-new-milestone.md +356 -0
  49. package/commands/fire-reflect.md +235 -0
  50. package/commands/fire-research.md +246 -0
  51. package/commands/fire-search.md +330 -0
  52. package/commands/fire-security-audit-repo.md +293 -0
  53. package/commands/fire-security-scan.md +484 -0
  54. package/commands/fire-session-summary.md +252 -0
  55. package/commands/fire-skills-diff.md +506 -0
  56. package/commands/fire-skills-history.md +388 -0
  57. package/commands/fire-skills-rollback.md +408 -0
  58. package/commands/fire-skills-sync.md +470 -0
  59. package/commands/fire-test.md +520 -0
  60. package/commands/fire-todos.md +335 -0
  61. package/commands/fire-transition.md +186 -0
  62. package/commands/fire-update.md +312 -0
  63. package/commands/fire-verify-uat.md +146 -0
  64. package/commands/fire-vuln-scan.md +493 -0
  65. package/hooks/hooks.json +16 -0
  66. package/hooks/run-hook.cmd +69 -0
  67. package/hooks/run-hook.sh +8 -0
  68. package/hooks/run-session-end.cmd +49 -0
  69. package/hooks/run-session-end.sh +7 -0
  70. package/hooks/session-end.sh +90 -0
  71. package/hooks/session-start.sh +111 -0
  72. package/package.json +52 -0
  73. package/plugin.json +7 -0
  74. package/references/auto-skill-extraction.md +136 -0
  75. package/references/behavioral-directives.md +365 -0
  76. package/references/blocker-tracking.md +155 -0
  77. package/references/checkpoints.md +165 -0
  78. package/references/circuit-breaker.md +410 -0
  79. package/references/context-engineering.md +587 -0
  80. package/references/decision-time-guidance.md +289 -0
  81. package/references/error-classification.md +326 -0
  82. package/references/execution-mode-intelligence.md +242 -0
  83. package/references/git-integration.md +217 -0
  84. package/references/honesty-protocols.md +304 -0
  85. package/references/integration-architecture.md +470 -0
  86. package/references/issue-to-pr-pipeline.md +150 -0
  87. package/references/metrics-and-trends.md +234 -0
  88. package/references/playwright-e2e-testing.md +326 -0
  89. package/references/questioning.md +125 -0
  90. package/references/research-improvements.md +110 -0
  91. package/references/skills-usage-guide.md +429 -0
  92. package/references/tdd.md +131 -0
  93. package/references/testing-enforcement.md +192 -0
  94. package/references/ui-brand.md +383 -0
  95. package/references/validation-checklist.md +456 -0
  96. package/references/verification-patterns.md +187 -0
  97. package/references/warrior-principles.md +173 -0
  98. package/skills-library/SKILLS-INDEX.md +588 -0
  99. package/skills-library/_general/frontend/html-visual-reports.md +292 -0
  100. package/skills-library/_general/methodology/debug-swarm-researcher-escape-hatch.md +240 -0
  101. package/skills-library/_general/methodology/learncoding-agentic-pattern.md +114 -0
  102. package/skills-library/_general/methodology/shell-autonomous-loop-fixplan.md +238 -0
  103. package/skills-library/basics/api-rest-basics.md +162 -0
  104. package/skills-library/basics/env-variables.md +96 -0
  105. package/skills-library/basics/error-handling-basics.md +125 -0
  106. package/skills-library/basics/git-commit-conventions.md +106 -0
  107. package/skills-library/basics/readme-template.md +108 -0
  108. package/skills-library/common-tasks/async-await-patterns.md +157 -0
  109. package/skills-library/common-tasks/auth-jwt-basics.md +164 -0
  110. package/skills-library/common-tasks/database-schema-design.md +166 -0
  111. package/skills-library/common-tasks/file-upload-basics.md +166 -0
  112. package/skills-library/common-tasks/form-validation.md +159 -0
  113. package/skills-library/debugging/FAILURE_TAXONOMY_CLASSIFICATION.md +117 -0
  114. package/skills-library/debugging/THREE_AGENT_HYPOTHESIS_DEBUGGING.md +86 -0
  115. package/skills-library/methodology/BREATH_BASED_PARALLEL_EXECUTION.md +678 -0
  116. package/skills-library/methodology/CONFIDENCE_GATED_EXECUTION.md +243 -0
  117. package/skills-library/methodology/EVIDENCE_BASED_VALIDATION.md +308 -0
  118. package/skills-library/methodology/MULTI_PERSPECTIVE_CODE_REVIEW.md +330 -0
  119. package/skills-library/methodology/PATH_VERIFICATION_GATE.md +211 -0
  120. package/skills-library/methodology/REFLEXION_MEMORY_PATTERN.md +183 -0
  121. package/skills-library/methodology/RESEARCH_BACKED_WORKFLOW_UPGRADE.md +263 -0
  122. package/skills-library/methodology/SABBATH_REST_PATTERN.md +267 -0
  123. package/skills-library/methodology/STONE_AND_SCAFFOLD.md +220 -0
  124. package/skills-library/performance/cache-augmented-generation.md +172 -0
  125. package/skills-library/quality-safety/debugging-steps.md +147 -0
  126. package/skills-library/quality-safety/deployment-checklist.md +155 -0
  127. package/skills-library/quality-safety/security-checklist.md +204 -0
  128. package/skills-library/quality-safety/testing-basics.md +180 -0
  129. package/skills-library/security/agent-security-scanner.md +445 -0
  130. package/skills-library/specialists/api-architecture/api-designer.md +49 -0
  131. package/skills-library/specialists/api-architecture/graphql-architect.md +49 -0
  132. package/skills-library/specialists/api-architecture/mcp-developer.md +51 -0
  133. package/skills-library/specialists/api-architecture/microservices-architect.md +50 -0
  134. package/skills-library/specialists/api-architecture/websocket-engineer.md +48 -0
  135. package/skills-library/specialists/backend/django-expert.md +52 -0
  136. package/skills-library/specialists/backend/fastapi-expert.md +52 -0
  137. package/skills-library/specialists/backend/laravel-specialist.md +52 -0
  138. package/skills-library/specialists/backend/nestjs-expert.md +51 -0
  139. package/skills-library/specialists/backend/rails-expert.md +53 -0
  140. package/skills-library/specialists/backend/spring-boot-engineer.md +56 -0
  141. package/skills-library/specialists/data-ml/fine-tuning-expert.md +48 -0
  142. package/skills-library/specialists/data-ml/ml-pipeline.md +47 -0
  143. package/skills-library/specialists/data-ml/pandas-pro.md +47 -0
  144. package/skills-library/specialists/data-ml/rag-architect.md +51 -0
  145. package/skills-library/specialists/data-ml/spark-engineer.md +47 -0
  146. package/skills-library/specialists/frontend/angular-architect.md +52 -0
  147. package/skills-library/specialists/frontend/flutter-expert.md +51 -0
  148. package/skills-library/specialists/frontend/nextjs-developer.md +54 -0
  149. package/skills-library/specialists/frontend/react-native-expert.md +50 -0
  150. package/skills-library/specialists/frontend/vue-expert.md +51 -0
  151. package/skills-library/specialists/infrastructure/chaos-engineer.md +74 -0
  152. package/skills-library/specialists/infrastructure/cloud-architect.md +70 -0
  153. package/skills-library/specialists/infrastructure/database-optimizer.md +64 -0
  154. package/skills-library/specialists/infrastructure/devops-engineer.md +70 -0
  155. package/skills-library/specialists/infrastructure/kubernetes-specialist.md +52 -0
  156. package/skills-library/specialists/infrastructure/monitoring-expert.md +70 -0
  157. package/skills-library/specialists/infrastructure/sre-engineer.md +70 -0
  158. package/skills-library/specialists/infrastructure/terraform-engineer.md +51 -0
  159. package/skills-library/specialists/languages/cpp-pro.md +74 -0
  160. package/skills-library/specialists/languages/csharp-developer.md +69 -0
  161. package/skills-library/specialists/languages/dotnet-core-expert.md +54 -0
  162. package/skills-library/specialists/languages/golang-pro.md +51 -0
  163. package/skills-library/specialists/languages/java-architect.md +49 -0
  164. package/skills-library/specialists/languages/javascript-pro.md +68 -0
  165. package/skills-library/specialists/languages/kotlin-specialist.md +68 -0
  166. package/skills-library/specialists/languages/php-pro.md +49 -0
  167. package/skills-library/specialists/languages/python-pro.md +52 -0
  168. package/skills-library/specialists/languages/react-expert.md +51 -0
  169. package/skills-library/specialists/languages/rust-engineer.md +50 -0
  170. package/skills-library/specialists/languages/sql-pro.md +56 -0
  171. package/skills-library/specialists/languages/swift-expert.md +69 -0
  172. package/skills-library/specialists/languages/typescript-pro.md +51 -0
  173. package/skills-library/specialists/platform/atlassian-mcp.md +52 -0
  174. package/skills-library/specialists/platform/embedded-systems.md +53 -0
  175. package/skills-library/specialists/platform/game-developer.md +53 -0
  176. package/skills-library/specialists/platform/salesforce-developer.md +53 -0
  177. package/skills-library/specialists/platform/shopify-expert.md +49 -0
  178. package/skills-library/specialists/platform/wordpress-pro.md +49 -0
  179. package/skills-library/specialists/quality/code-documenter.md +51 -0
  180. package/skills-library/specialists/quality/code-reviewer.md +67 -0
  181. package/skills-library/specialists/quality/debugging-wizard.md +51 -0
  182. package/skills-library/specialists/quality/fullstack-guardian.md +51 -0
  183. package/skills-library/specialists/quality/legacy-modernizer.md +50 -0
  184. package/skills-library/specialists/quality/playwright-expert.md +65 -0
  185. package/skills-library/specialists/quality/spec-miner.md +56 -0
  186. package/skills-library/specialists/quality/test-master.md +65 -0
  187. package/skills-library/specialists/security/secure-code-guardian.md +55 -0
  188. package/skills-library/specialists/security/security-reviewer.md +53 -0
  189. package/skills-library/specialists/workflow/architecture-designer.md +53 -0
  190. package/skills-library/specialists/workflow/cli-developer.md +70 -0
  191. package/skills-library/specialists/workflow/feature-forge.md +65 -0
  192. package/skills-library/specialists/workflow/prompt-engineer.md +54 -0
  193. package/skills-library/specialists/workflow/the-fool.md +62 -0
  194. package/templates/ASSUMPTIONS.md +125 -0
  195. package/templates/BLOCKERS.md +73 -0
  196. package/templates/DECISION_LOG.md +116 -0
  197. package/templates/UAT.md +96 -0
  198. package/templates/blueprint.md +94 -0
  199. package/templates/brainstorm.md +185 -0
  200. package/templates/conscience.md +92 -0
  201. package/templates/fire-handoff.md +159 -0
  202. package/templates/metrics.md +67 -0
  203. package/templates/phase-prompt.md +142 -0
  204. package/templates/record.md +131 -0
  205. package/templates/review-report.md +117 -0
  206. package/templates/skills-index.md +157 -0
  207. package/templates/verification.md +149 -0
  208. package/templates/vision.md +79 -0
  209. package/validation-config.yml +793 -0
  210. package/version.json +7 -0
  211. package/workflows/execute-phase.md +732 -0
  212. package/workflows/handoff-session.md +678 -0
  213. package/workflows/new-project.md +578 -0
  214. package/workflows/plan-phase.md +592 -0
  215. package/workflows/verify-phase.md +874 -0
@@ -0,0 +1,243 @@
1
+ # Confidence-Gated Execution — Adaptive Autonomy Through Uncertainty Estimation
2
+
3
+ ## The Problem
4
+
5
+ AI agents treat all tasks with the same level of autonomy. Whether the agent is implementing a well-understood CRUD endpoint or modifying a security-critical authentication flow, it proceeds with the same approach. This leads to:
6
+
7
+ - Overconfident execution on unfamiliar tasks (mistakes that could have been prevented by asking)
8
+ - Underconfident pausing on familiar tasks (unnecessary user interruptions)
9
+ - No learning signal for when to be cautious vs. autonomous
10
+
11
+ ### Why It Was Hard
12
+
13
+ - Confidence is subjective — agents tend to default to "high confidence" (optimism bias)
14
+ - Need concrete, measurable signals rather than vibes
15
+ - Gate behavior must be lightweight (can't add 5 minutes of analysis before every action)
16
+ - Must integrate with existing execution flows without disrupting them
17
+ - Progressive autonomy requires tracking outcomes over time
18
+
19
+ ### Impact
20
+
21
+ - Without confidence gates: 73.5% error recovery rate
22
+ - With self-evaluation (which confidence gates enable): 95% error recovery rate
23
+ - Users report 20%→40% auto-approve rate increase over 750 sessions when agents demonstrate appropriate caution
24
+
25
+ ---
26
+
27
+ ## The Solution
28
+
29
+ ### Root Cause
30
+
31
+ Agents lack a structured way to estimate "how sure am I?" before acting. The SAUP framework (ACL 2025) shows that propagating uncertainty through reasoning steps significantly improves decision quality.
32
+
33
+ ### Signal-Based Confidence Scoring
34
+
35
+ Instead of asking "how confident are you?" (subjective), compute confidence from concrete signals:
36
+
37
+ ```
38
+ confidence = 50 (baseline)
39
+
40
+ Positive signals (increase confidence):
41
+ + Matching skill found in library: +20
42
+ + Similar reflection exists: +15
43
+ + Tests available to verify: +25
44
+ + Familiar technology/framework: +15
45
+ + Clear, unambiguous requirements: +15
46
+
47
+ Negative signals (decrease confidence):
48
+ - Unfamiliar framework/library: -20
49
+ - No tests available to verify: -15
50
+ - Ambiguous/incomplete requirements: -20
51
+ - Security-sensitive change: -10
52
+ - Destructive operation: -15
53
+ ```
54
+
55
+ ### Three Confidence Levels
56
+
57
+ ```
58
+ HIGH (>80%): Proceed Autonomously
59
+ → Execute task directly
60
+ → Run Self-Judge after completion
61
+ → Log confidence in summary
62
+
63
+ MEDIUM (50-80%): Proceed with Extra Validation
64
+ → Search reflections for similar scenarios
65
+ → Search skills library for guidance
66
+ → Run Self-Judge BEFORE and AFTER
67
+ → Log uncertainty reason for future learning
68
+
69
+ LOW (<50%): Pause and Escalate
70
+ → Search Context7 for current library docs
71
+ → Check if this is outside trained domain
72
+ → Ask user for guidance before proceeding
73
+ → Create checkpoint before attempting
74
+ → Log what made confidence low
75
+ ```
76
+
77
+ ### Integration Points
78
+
79
+ **In fire-3-execute (executor context injection):**
80
+ ```xml
81
+ <confidence_gates>
82
+ Before each plan task, estimate confidence using signal scoring.
83
+ Log confidence level and signals in RECORD.md.
84
+ Gate behavior: HIGH=proceed, MEDIUM=extra-validation, LOW=escalate.
85
+ </confidence_gates>
86
+ ```
87
+
88
+ **In fire-loop (recitation block):**
89
+ ```markdown
90
+ ## Confidence Check (v5.0)
91
+ - Score: {0-100} — {HIGH/MEDIUM/LOW}
92
+ - Signals: {what raised or lowered confidence}
93
+ - Action: {proceed / extra-validation / escalate}
94
+ ```
95
+
96
+ **In RECORD.md (execution log):**
97
+ ```yaml
98
+ confidence_log:
99
+ - task: 1
100
+ score: 85
101
+ level: HIGH
102
+ signals: [skill_match, tests_available, familiar_tech]
103
+ - task: 3
104
+ score: 45
105
+ level: LOW
106
+ signals: [unfamiliar_framework, no_tests, ambiguous_requirements]
107
+ action: "Asked user for clarification on WebSocket auth approach"
108
+ ```
109
+
110
+ ### Code Example: Confidence Estimation
111
+
112
+ ```python
113
+ # Conceptual implementation (adapt to your agent framework)
114
+
115
+ def estimate_confidence(task, skills_library, reflections, test_suite):
116
+ score = 50 # baseline
117
+ signals = []
118
+
119
+ # Check skill library
120
+ matching_skills = skills_library.search(task.description)
121
+ if matching_skills:
122
+ score += 20
123
+ signals.append("skill_match")
124
+
125
+ # Check reflections
126
+ matching_reflections = reflections.search(task.description)
127
+ if matching_reflections:
128
+ score += 15
129
+ signals.append("reflection_match")
130
+
131
+ # Check test availability
132
+ if test_suite.has_tests_for(task.affected_files):
133
+ score += 25
134
+ signals.append("tests_available")
135
+ else:
136
+ score -= 15
137
+ signals.append("no_tests")
138
+
139
+ # Check technology familiarity
140
+ if task.technology in KNOWN_TECHNOLOGIES:
141
+ score += 15
142
+ signals.append("familiar_tech")
143
+ else:
144
+ score -= 20
145
+ signals.append("unfamiliar_framework")
146
+
147
+ # Check requirement clarity
148
+ if task.has_clear_acceptance_criteria:
149
+ score += 15
150
+ signals.append("clear_requirements")
151
+ else:
152
+ score -= 20
153
+ signals.append("ambiguous_requirements")
154
+
155
+ # Security/destructive checks
156
+ if task.is_security_sensitive:
157
+ score -= 10
158
+ signals.append("security_sensitive")
159
+ if task.is_destructive:
160
+ score -= 15
161
+ signals.append("destructive_operation")
162
+
163
+ # Clamp to 0-100
164
+ score = max(0, min(100, score))
165
+
166
+ level = "HIGH" if score > 80 else "MEDIUM" if score >= 50 else "LOW"
167
+ return score, level, signals
168
+ ```
169
+
170
+ ---
171
+
172
+ ## Testing the Fix
173
+
174
+ ### Scenario Tests
175
+
176
+ | Task | Expected Score | Expected Level | Expected Action |
177
+ |------|---------------|----------------|-----------------|
178
+ | Add CRUD endpoint (known stack, tests exist, skill found) | 90+ | HIGH | Proceed |
179
+ | Implement WebSocket auth (unfamiliar, no tests, no skill) | 30-40 | LOW | Escalate |
180
+ | Fix CSS layout (familiar, no tests, clear requirement) | 65-75 | MEDIUM | Extra validation |
181
+ | Delete user data migration (familiar, tests, destructive) | 55-65 | MEDIUM | Extra validation |
182
+ | Integrate new payment provider (unfamiliar, security) | 25-35 | LOW | Escalate |
183
+
184
+ ### Verification
185
+
186
+ 1. Execute tasks of varying familiarity
187
+ 2. Verify HIGH-confidence tasks proceed without interruption
188
+ 3. Verify MEDIUM-confidence tasks trigger reflection/skill search
189
+ 4. Verify LOW-confidence tasks pause for user input
190
+ 5. Check confidence log in RECORD.md matches observed behavior
191
+
192
+ ---
193
+
194
+ ## Prevention
195
+
196
+ - Calibrate signals based on actual outcomes (if LOW-confidence tasks consistently succeed, adjust weights)
197
+ - Don't let confidence become a checkbox (the score should reflect genuine uncertainty)
198
+ - Review confidence logs periodically to identify systematic biases
199
+ - Track confidence vs. outcome correlation to improve scoring
200
+
201
+ ---
202
+
203
+ ## Related Patterns
204
+
205
+ - [AGENT_SELF_IMPROVEMENT_LOOP](./AGENT_SELF_IMPROVEMENT_LOOP.md) - Confidence is upgrade #6 of the loop
206
+ - [REFLEXION_MEMORY_PATTERN](./REFLEXION_MEMORY_PATTERN.md) - Reflections feed confidence scoring (+15)
207
+ - [SELF_QUESTIONING_TASK_GENERATION](./SELF_QUESTIONING_TASK_GENERATION.md) - Self-Judge runs at MEDIUM+ confidence
208
+ - [CONFIDENCE_ANNOTATION_PATTERN](./CONFIDENCE_ANNOTATION_PATTERN.md) - Related annotation approach
209
+
210
+ ---
211
+
212
+ ## Common Mistakes to Avoid
213
+
214
+ - Setting baseline too high (70+) — makes everything look "confident"
215
+ - Ignoring negative signals — agents naturally want to proceed
216
+ - Treating confidence gates as hard blockers — they're advisory, agent can override with justification
217
+ - Not logging confidence scores — you need the data to calibrate over time
218
+ - Applying same weights to all task types — security tasks should weight "security_sensitive" more heavily
219
+ - Making confidence estimation take >30 seconds — speed is critical for adoption
220
+
221
+ ---
222
+
223
+ ## Resources
224
+
225
+ - SAUP (ACL 2025): Uncertainty propagation through reasoning steps
226
+ - Anthropic measurement: Progressive autonomy 20%→40% over 750 sessions
227
+ - Snorkel AI: 95% vs 73.5% error recovery with self-evaluation
228
+ - Dominion Flow implementation: `fire-3-execute.md` confidence_gates, `fire-loop.md` confidence check
229
+
230
+ ---
231
+
232
+ ## Time to Implement
233
+
234
+ **2-3 hours** — Add confidence scoring to executor context, integrate into loop recitation, add RECORD.md logging
235
+
236
+ ## Difficulty Level
237
+
238
+ Stars: 3/5 — The signal scoring is simple. The challenge is **calibration**: getting the weights right so the agent isn't overconfident or over-cautious. Requires tracking outcomes over multiple sessions.
239
+
240
+ ---
241
+
242
+ **Author Notes:**
243
+ The most counterintuitive finding: agents that ask for help more often perform better overall. LOW-confidence escalation isn't a failure — it's the agent saying "I know what I don't know." The 95% error recovery rate comes not from avoiding errors, but from knowing when to pause and seek guidance. Confidence gates formalize the difference between "I know this" and "I'm guessing" — and that distinction is worth a 21.5 percentage point improvement in recovery rate.
@@ -0,0 +1,308 @@
1
+ # Evidence-Based Validation - Verification Before Completion
2
+
3
+ ## The Problem
4
+
5
+ AI agents (and humans) often claim work is "done" without actually verifying it works. This leads to:
6
+ - "Tests pass" claims without running tests
7
+ - "Bug fixed" assertions without reproduction verification
8
+ - Premature completion claims that waste time on rework
9
+
10
+ ### Why It Was Hard
11
+
12
+ - Pressure to deliver quickly encourages shortcuts
13
+ - Verification feels redundant after writing code
14
+ - Confidence in one's work creates blind spots
15
+ - No systematic enforcement of "prove it works"
16
+
17
+ ### Impact
18
+
19
+ - False completion claims waste reviewer time
20
+ - Bugs reach production that should have been caught
21
+ - Trust erodes when "done" doesn't mean "done"
22
+ - Rework costs exceed original implementation time
23
+
24
+ ---
25
+
26
+ ## The Solution
27
+
28
+ **Evidence Before Assertion** - Never claim something works without captured proof.
29
+
30
+ ### Core Principle
31
+
32
+ ```
33
+ ┌─────────────────────────────────────────────────────────────────┐
34
+ │ │
35
+ │ WRONG: "The tests pass" → then run tests │
36
+ │ RIGHT: Run tests → "The tests pass (output: 45/45)" │
37
+ │ │
38
+ │ WRONG: "I fixed the bug" → hope it works │
39
+ │ RIGHT: Reproduce → fix → verify fix → "Bug fixed (proof)" │
40
+ │ │
41
+ └─────────────────────────────────────────────────────────────────┘
42
+ ```
43
+
44
+ ### The Verification Protocol
45
+
46
+ #### Step 1: Identify Claims
47
+
48
+ Before declaring work complete, list all implicit claims:
49
+
50
+ ```markdown
51
+ ## Claims to Verify
52
+
53
+ | # | Implicit Claim | Evidence Required |
54
+ |---|----------------|-------------------|
55
+ | 1 | Code compiles | Build output with exit code 0 |
56
+ | 2 | Tests pass | Test runner output showing pass count |
57
+ | 3 | Feature works | Demo or API test with expected response |
58
+ | 4 | No regressions | Full test suite output |
59
+ | 5 | Types correct | TypeScript compiler output |
60
+ ```
61
+
62
+ #### Step 2: Execute Verification Commands
63
+
64
+ Run ALL commands and capture output:
65
+
66
+ ```bash
67
+ # Build check
68
+ echo "=== BUILD CHECK ==="
69
+ npm run build 2>&1
70
+ echo "Exit code: $?"
71
+
72
+ # Test check
73
+ echo "=== TEST CHECK ==="
74
+ npm run test 2>&1
75
+ echo "Exit code: $?"
76
+
77
+ # Type check
78
+ echo "=== TYPE CHECK ==="
79
+ npm run typecheck 2>&1
80
+ echo "Exit code: $?"
81
+
82
+ # Lint check
83
+ echo "=== LINT CHECK ==="
84
+ npm run lint 2>&1
85
+ echo "Exit code: $?"
86
+ ```
87
+
88
+ #### Step 3: Document Results with Evidence
89
+
90
+ ```markdown
91
+ ## Verification Results
92
+
93
+ ### Build Check
94
+ **Command:** `npm run build`
95
+ **Exit Code:** 0
96
+ **Output:**
97
+ ```
98
+ > project@1.0.0 build
99
+ > tsc && vite build
100
+
101
+ vite v5.0.0 building for production...
102
+ ✓ 234 modules transformed.
103
+ dist/index.html 0.45 kB │ gzip: 0.29 kB
104
+ dist/assets/index.js 145.67 kB │ gzip: 47.23 kB
105
+ ✓ built in 2.34s
106
+ ```
107
+ **Verdict:** PASS
108
+
109
+ ### Test Check
110
+ **Command:** `npm run test`
111
+ **Exit Code:** 0
112
+ **Output:**
113
+ ```
114
+ PASS src/auth/login.test.ts
115
+ PASS src/api/users.test.ts
116
+ PASS src/utils/helpers.test.ts
117
+
118
+ Test Suites: 3 passed, 3 total
119
+ Tests: 45 passed, 45 total
120
+ Time: 3.42s
121
+ ```
122
+ **Verdict:** PASS (45/45 tests)
123
+ ```
124
+
125
+ #### Step 4: Honesty Protocol
126
+
127
+ Before claiming complete, answer honestly:
128
+
129
+ ```markdown
130
+ ## Honesty Check
131
+
132
+ ### Question 1: Did I actually run these commands?
133
+ - [x] Yes, all commands executed in this session
134
+ - [ ] No, I assumed based on previous runs
135
+
136
+ ### Question 2: Am I interpreting output honestly?
137
+ - [x] Output clearly shows success
138
+ - [ ] Output is ambiguous, I'm assuming
139
+
140
+ ### Question 3: What am I NOT checking?
141
+ - [ ] Nothing unchecked
142
+ - [x] E2E tests not run (documented limitation)
143
+
144
+ ### Question 4: Would a skeptic be convinced?
145
+ - [x] Yes, evidence is clear
146
+ - [ ] No, more verification needed
147
+ ```
148
+
149
+ #### Step 5: Final Verdict
150
+
151
+ ```markdown
152
+ ## Double-Check Summary
153
+
154
+ | Check | Command | Result | Evidence |
155
+ |-------|---------|--------|----------|
156
+ | Build | `npm run build` | PASS | Exit 0, no errors |
157
+ | Tests | `npm run test` | PASS | 45/45 passing |
158
+ | Types | `npm run typecheck` | PASS | 0 errors |
159
+ | Lint | `npm run lint` | PASS | 0 warnings |
160
+
161
+ **STATUS:** VERIFIED
162
+ **Confidence:** HIGH
163
+
164
+ You may now claim this work is complete.
165
+ ```
166
+
167
+ ---
168
+
169
+ ## Implementation
170
+
171
+ ### Verification Command Template
172
+
173
+ ```javascript
174
+ async function doubleCheck(checks = ['build', 'test', 'typecheck', 'lint']) {
175
+ const results = {};
176
+
177
+ for (const check of checks) {
178
+ const command = COMMANDS[check];
179
+ console.log(`=== ${check.toUpperCase()} CHECK ===`);
180
+
181
+ const { stdout, stderr, exitCode } = await exec(command);
182
+
183
+ results[check] = {
184
+ command,
185
+ exitCode,
186
+ output: stdout + stderr,
187
+ verdict: exitCode === 0 ? 'PASS' : 'FAIL'
188
+ };
189
+
190
+ console.log(`Exit code: ${exitCode}`);
191
+ console.log(stdout);
192
+ }
193
+
194
+ return results;
195
+ }
196
+
197
+ const COMMANDS = {
198
+ build: 'npm run build',
199
+ test: 'npm run test',
200
+ typecheck: 'npm run typecheck',
201
+ lint: 'npm run lint',
202
+ coverage: 'npm run test:coverage',
203
+ security: 'npm audit',
204
+ e2e: 'npm run test:e2e'
205
+ };
206
+ ```
207
+
208
+ ### Anti-Pattern Detection
209
+
210
+ ```javascript
211
+ // BAD: Claiming without evidence
212
+ function reviewCode() {
213
+ // ... look at code ...
214
+ return "Tests pass"; // NO EVIDENCE!
215
+ }
216
+
217
+ // GOOD: Evidence-based claim
218
+ async function reviewCode() {
219
+ const output = await exec('npm run test');
220
+ return `Tests pass (${output.passCount}/${output.totalCount})`;
221
+ }
222
+ ```
223
+
224
+ ---
225
+
226
+ ## Testing the Pattern
227
+
228
+ ### Before (No Verification)
229
+ ```
230
+ Claim: "All tests pass"
231
+ Reality: 3 tests failing
232
+ Result: Bug reaches production
233
+ Cost: 4 hours debugging + hotfix
234
+ ```
235
+
236
+ ### After (Evidence-Based)
237
+ ```
238
+ Claim: "All tests pass"
239
+ Evidence: Test output shows 42/45 passing
240
+ Reality: 3 tests failing (caught immediately)
241
+ Result: Fixed before merge
242
+ Cost: 15 minutes
243
+ ```
244
+
245
+ ---
246
+
247
+ ## Prevention
248
+
249
+ ### When to Use Evidence-Based Validation
250
+
251
+ - **Always:** Before claiming any work is complete
252
+ - **Always:** Before creating a PR
253
+ - **Always:** Before merging to main
254
+ - **Always:** After fixing bugs (verify the fix)
255
+
256
+ ### Verification Depth Levels
257
+
258
+ | Depth | Checks | Use Case |
259
+ |-------|--------|----------|
260
+ | Quick | build, lint | Minor changes |
261
+ | Standard | build, test, types, lint | Normal PRs |
262
+ | Deep | All + coverage + security + E2E | Production releases |
263
+
264
+ ---
265
+
266
+ ## Related Patterns
267
+
268
+ - [Multi-Perspective Code Review](./MULTI_PERSPECTIVE_CODE_REVIEW.md)
269
+ - [Honesty Protocols](./HONESTY_PROTOCOLS.md)
270
+ - [60-Point Validation Checklist](./VALIDATION_CHECKLIST.md)
271
+
272
+ ---
273
+
274
+ ## Common Mistakes to Avoid
275
+
276
+ - **Skipping verification to save time** - Rework costs more
277
+ - **Assuming previous run is still valid** - Always re-run
278
+ - **Ignoring warnings** - Warnings become errors
279
+ - **Partial verification** - Run ALL relevant checks
280
+ - **Trusting memory** - Capture actual output
281
+
282
+ ---
283
+
284
+ ## Resources
285
+
286
+ - [superpowers:verification-before-completion](https://github.com/anthropics/claude-code-plugins)
287
+ - [Test-Driven Development patterns](./TDD_PATTERNS.md)
288
+ - [Continuous Integration best practices](../deployment-security/CI_CD_PATTERNS.md)
289
+
290
+ ---
291
+
292
+ ## Time to Implement
293
+
294
+ **Per verification:** 2-5 minutes
295
+ **ROI:** Prevents 1-4 hour debugging sessions
296
+
297
+ ## Difficulty Level
298
+
299
+ ⭐ (1/5) - Simple to implement, requires discipline
300
+
301
+ ---
302
+
303
+ **Author Notes:**
304
+ This pattern seems obvious but is violated constantly. The key insight is that **verification must be mandatory, not optional**. By requiring captured output as evidence, you eliminate the possibility of false claims.
305
+
306
+ The honesty protocol questions force reflection before completion. Question 3 ("What am I NOT checking?") is particularly powerful - it surfaces blind spots before they become problems.
307
+
308
+ **Implementation in Dominion Flow:** Available via `/fire-double-check` command.