npm - @exaudeus/workrail - Versions diffs - 0.7.1 → 0.7.2-beta.0 - Mend

@exaudeus/workrail 0.7.1 → 0.7.2-beta.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/package.json +1 -1
package/workflows/CHANGELOG-bug-investigation.md +20 -0
package/workflows/systematic-bug-investigation-with-loops.json +12 -6

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@exaudeus/workrail",
-  "version": "0.7.1",
+  "version": "0.7.2-beta.0",
   "description": "MCP server for structured workflow orchestration and step-by-step task guidance",
   "license": "MIT",
   "bin": {

package/workflows/CHANGELOG-bug-investigation.md CHANGED Viewed

@@ -1,5 +1,25 @@
 # Changelog - Systematic Bug Investigation Workflow
+## [1.1.0-beta.18] - 2025-01-06
+### CRITICAL FIX
+- **Addresses persistent early-stopping bug**: Agents were still stopping after Phase 1/2 saying "I found the bug"
+- **Root Cause Identified**: Agents fundamentally misunderstand THE GOAL
+  - WRONG: "The goal is finding the bug" → Stop after analysis with high confidence
+  - RIGHT: "The goal is PROVING the bug with evidence" → Must complete Phases 3-5
+- **New Meta-Guidance Section**: Added explicit "CRITICAL MISUNDERSTANDING TO AVOID" section
+  - "FINDING ≠ DONE. PROVING = DONE."
+  - "\"I found the bug\" = YOU HAVE A GUESS. \"I proved the bug\" = YOU HAVE EVIDENCE."
+  - "NEVER create summary documents until Phase 6"
+- **Step-Level Warnings**: Added "FINDING ≠ PROVING" warnings at all critical stopping points:
+  - **Phase 1f** (after analysis): Full explanation of why analysis ≠ proof
+  - **Phase 2a** (hypothesis development): "You have THEORIES, not EVIDENCE"
+  - **Phase 2h** (midpoint): "You may have 'found' the bug, but haven't 'proved' it"
+- **Step Count Corrections**: Fixed inconsistencies (27 → 23 steps throughout)
+### Why This Fix Is Different
+Previous fixes (beta.1-beta.17) added warnings about "high confidence ≠ done" but didn't address the fundamental goal misunderstanding. Agents thought their job was to "identify" the bug, not "prove" it. This fix makes the distinction crystal clear upfront.
 ## [1.1.0-beta.17] - 2025-01-06
 ### Major Restructuring

package/workflows/systematic-bug-investigation-with-loops.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "id": "systematic-bug-investigation-with-loops",
   "name": "Systematic Bug Investigation Workflow",
-  "version": "1.1.0-beta.17",
+  "version": "1.1.0-beta.18",
   "description": "A comprehensive workflow for systematic bug and failing test investigation that prevents LLMs from jumping to conclusions. Enforces thorough evidence gathering, hypothesis formation, debugging instrumentation, and validation to achieve near 100% certainty about root causes. This workflow does NOT fix bugs - it produces detailed diagnostic writeups that enable effective fixing by providing complete understanding of what is happening, why it's happening, and supporting evidence.",
   "clarificationPrompts": [
     "What type of system is this? (web app, mobile app, backend service, desktop app, etc.)",
@@ -23,10 +23,16 @@
   "metaGuidance": [
     "**\ud83d\udea8 MANDATORY WORKFLOW EXECUTION - READ THIS FIRST:**",
     "YOU ARE EXECUTING A STRUCTURED WORKFLOW, NOT FREESTYLE DEBUGGING.",
-    "You CANNOT \"figure out the bug\" and stop. You MUST execute all 27 workflow steps by repeatedly calling workflow_next and following instructions until the MCP returns isComplete=true.",
+    "You CANNOT \"figure out the bug\" and stop. You MUST execute all 23 workflow steps by repeatedly calling workflow_next and following instructions until the MCP returns isComplete=true.",
     "WORKFLOW MECHANICS: Each call to workflow_next returns the next required step. You MUST execute that step, then call workflow_next again. Repeat until isComplete=true.",
     "DO NOT STOP CALLING WORKFLOW_NEXT: Even if you think you know the bug, even if you have high confidence, even if it seems obvious - you MUST continue calling workflow_next.",
-    "STEP COUNTER: Every prompt shows \"Step X of 26\" - you are NOT done until you reach Step 23/23 and isComplete=true.",
+    "STEP COUNTER: Every prompt shows \"Step X of 23\" - you are NOT done until you reach Step 23/23 and isComplete=true.",
+    "**\ud83d\udea8 CRITICAL MISUNDERSTANDING TO AVOID:**",
+    "THE GOAL IS NOT \"FINDING\" THE BUG. THE GOAL IS \"PROVING\" THE BUG WITH EVIDENCE.",
+    "\"I found the bug\" = YOU HAVE A GUESS. \"I proved the bug\" = YOU HAVE EVIDENCE FROM PHASES 3-5.",
+    "FINDING \u2260 DONE. PROVING = DONE. Only after completing instrumentation, evidence collection, and validation do you have proof.",
+    "NEVER say \"I've identified the root cause\" and stop. That is a THEORY, not PROOF. Continue to evidence collection.",
+    "DO NOT create \"summary documents\" or \"diagnostic writeups\" until Phase 6. That is SKIPPING THE WORKFLOW.",
     "**\ud83c\udfaf PHASE 0 = PURE SETUP (NO ANALYSIS):**",
     "Phase 0 is MECHANICAL SETUP ONLY: triage, user preferences, tool checking, context creation. No code analysis, no assumption checking. That comes in Phase 1.",
     "Phase 1a NOW includes assumption verification - AFTER you've seen the code and built structural understanding. You can't meaningfully question assumptions before understanding the codebase.",
@@ -321,7 +327,7 @@
     {
       "id": "phase-1f-breadth-verification",
       "title": "Phase 1f: Final Breadth & Scope Verification",
-      "prompt": "**FINAL BREADTH & SCOPE VERIFICATION - Catch Tunnel Vision NOW**\n\n\u26a0\ufe0f **CRITICAL CHECKPOINT BEFORE HYPOTHESES**: This step prevents the #1 cause of wrong conclusions: looking in the wrong place or missing the wider context.\n\n**Goal**: Verify you analyzed the RIGHT code with sufficient breadth AND depth before committing to hypotheses.\n\n\ud83d\udea8 **DO NOT STOP HERE**: Even if you think you found the bug during analysis, you have ZERO PROOF. Analysis = educated guesses. Proof comes from Phases 3-5 (instrumentation + evidence). You are only ~25% done. MUST continue to Phase 2.\n\n---\n\n**STEP 1: Scope Sanity Check**\n\nAsk yourself these questions:\n1. **Module Root Correctness**: Is the `moduleRoot` from Phase 1a actually correct?\n   - Does it include ALL files in the error stack trace?\n   - Did I clamp too narrowly to a subdirectory when the bug spans multiple modules?\n   - Should I expand scope to parent directory or adjacent modules?\n\n2. **Missing Adjacent Systems**: Did I consider:\n   - Adjacent microservices/modules that interact with this one?\n   - Shared libraries or utilities used here?\n   - Configuration systems (env vars, config files, feature flags)?\n   - Caching layers or state management systems?\n   - Database schema or data migration issues?\n\n3. **Entry Point Coverage**: From Phase 1a Flow Anchors, did I verify:\n   - ALL entry points that could trigger this bug?\n   - Less obvious entry points (background jobs, scheduled tasks, webhooks)?\n   - Initialization code that runs before the failing code?\n\n---\n\n**STEP 2: Wide-Angle Review**\n\nReview your Phase 1 analysis outputs and answer:\n\n1. **Pattern Confidence** (from Phase 1, sub-phase 2):\n   - Do I have a solid Pattern Catalog with \u22652 occurrences per pattern?\n   - Did I identify clear pattern deviations in failing code?\n   - Are there OTHER files that deviate from patterns I haven't looked at?\n\n2. **Call Graph Completeness** (from Phase 1, sub-phase 1 & 2):\n   - Did my bounded call graph capture all HOT paths?\n   - Are there callers OUTSIDE my 2-hop boundary I should check?\n   - Did I trace BACKWARDS from the error far enough (to true entry points)?\n\n3. **Component Rankings** (from Phase 1, sub-phase 3):\n   - Are my top 5 components actually the most suspicious?\n   - Did I miss components because they're not in the stack trace?\n   - Should I re-rank based on new understanding?\n\n4. **Data Flow Completeness** (from Phase 1, sub-phase 4):\n   - Did I trace data flow from TRUE origin (user input, external system)?\n   - Are there data transformations BEFORE my analyzed scope?\n   - Did I check data validation at ALL boundaries?\n\n5. **Test Coverage Gaps** (from Phase 1, sub-phase 5):\n   - Did I find tests that SHOULD exist but don't?\n   - Are there missing test categories (integration, edge cases, error conditions)?\n   - Do test gaps reveal I'm looking in wrong place?\n\n---\n\n\n\n**STEP 2.5: Assumption Verification**\n\n**NOW that you've completed 5 phases of code analysis, verify all assumptions:**\n\n1. **Bug Report Assumptions**:\n   - Is the described behavior actually a bug based on what you now know about the code?\n   - Are the reproduction steps accurate given the code paths you've mapped?\n   - Is the error message consistent with the actual code flow you've traced?\n   - Are there missing steps or context in the bug report that your analysis revealed?\n\n2. **API/Library Assumptions**:\n   - Check documentation for any APIs/libraries mentioned in stack trace\n   - Verify actual behavior vs assumed behavior based on your code analysis\n   - Note any version-specific behavior that might matter\n   - Did your call graph analysis reveal unexpected library usage patterns?\n\n3. **Environment Assumptions**:\n   - Based on code analysis, is this environment-specific?\n   - Are there configuration dependencies you discovered in the code?\n   - Could timing/concurrency be a factor (based on code structure you analyzed)?\n   - Did pattern analysis reveal environment-dependent code paths?\n\n4. **Recent Changes Impact**:\n   - Review last 5-10 commits affecting the analyzed code\n   - Do they relate to the bug or point to alternative causes?\n   - Did your analysis reveal recent changes that break established patterns?\n\n**Document**: Create or update AssumptionVerification.md with verified/challenged assumptions.\n\n**Set**: `assumptionsVerified = true` in context\n\n---\n**STEP 3: Alternative Scope Analysis**\n\n**Generate 2-3 alternative investigation scopes and evaluate:**\n\nFor each alternative scope, assess:\n- **Scope Description**: What module/area would this focus on?\n- **Why It Might Be Better**: What evidence suggests this scope?\n- **Evidence For**: What supports investigating this area?\n- **Evidence Against**: Why might this be wrong direction?\n- **Confidence**: Rate 1-10 that this is the right scope\n\n**Example Alternative Scopes**:\n- Expand to parent module (if current feels too narrow)\n- Shift to adjacent service (if this might be symptom not cause)\n- Focus on infrastructure layer (if might be env/config issue)\n- Focus on data layer (if might be data corruption/migration issue)\n\n---\n\n**STEP 4: Breadth Decision**\n\nBased on Steps 1-3, make ONE of these decisions:\n\n**OPTION A: SCOPE IS CORRECT - Continue to Hypothesis Development**\n- Current module root and analyzed components are right\n- Breadth and depth are sufficient\n- Ready to form hypotheses with confidence\n- Set `scopeVerified = true` and proceed\n\n**OPTION B: EXPAND SCOPE - Additional Analysis Required**\n- Identified critical gaps in breadth or depth\n- Need to analyze additional modules/components\n- Set specific components/areas to add to analysis\n- Set `needsScopeExpansion = true`\n- Document what to add: `additionalAnalysisNeeded = [list]`\n\n**OPTION C: SHIFT SCOPE - Wrong Area**\n- Current focus is likely wrong place\n- Alternative scope has stronger evidence\n- Need to restart Phase 1 with new module root\n- Set `needsScopeShift = true`\n- Set `newModuleRoot = [path]`\n\n---\n\n**OUTPUT: Create ScopeVerification.md**\n\nMust include:\n1. **Scope Sanity Check Results** (answers to Step 1 questions)\n2. **Wide-Angle Review Findings** (answers to Step 2 questions)\n3. **Alternative Scopes Evaluated** (2-3 alternatives with scores)\n4. **Breadth Decision** (A, B, or C with justification)\n5. **Confidence in Current Scope** (1-10)\n6. **Action Items** (if Option B or C selected)\n\n**Context Variables to Set**:\n- `scopeVerified` (true/false)\n- `needsScopeExpansion` (true/false)\n- `needsScopeShift` (true/false)\n- `scopeConfidence` (1-10)\n- `additionalAnalysisNeeded` (array, if Option B)\n- `newModuleRoot` (string, if Option C)\n\n---\n\n**\ud83c\udfaf WHY THIS MATTERS**: \n\nResearch shows that 60% of failed investigations looked in the wrong place or too narrowly. This checkpoint catches that BEFORE you invest effort in wrong hypotheses.\n\n**Self-Critique**: List 1-2 specific uncertainties about scope that concern you most.",
+      "prompt": "**FINAL BREADTH & SCOPE VERIFICATION - Catch Tunnel Vision NOW**\n\n\u26a0\ufe0f **CRITICAL CHECKPOINT BEFORE HYPOTHESES**: This step prevents the #1 cause of wrong conclusions: looking in the wrong place or missing the wider context.\n\n**Goal**: Verify you analyzed the RIGHT code with sufficient breadth AND depth before committing to hypotheses.\n\n\ud83d\udea8 **DO NOT STOP HERE - CRITICAL MISUNDERSTANDING:**\n\n**\"I FOUND THE BUG\" \u2260 DONE. \"I PROVED THE BUG\" = DONE.**\n\nEven if you think you found the bug during analysis, you have ZERO PROOF:\n- \"Finding\" the bug = You have a THEORY/GUESS based on code analysis\n- \"Proving\" the bug = You have EVIDENCE from instrumentation + logs + validation (Phases 3-5)\n\nAnalysis = educated guesses. Proof comes from Phases 3-5 (instrumentation + evidence). You are only ~25% done.\n\n**DO NOT create summary documents or \"comprehensive findings\" now. That is Phase 6, not Phase 1f.**\n\nMUST continue to Phase 2 (Hypothesis Formation), then Phases 3-5 (Evidence Collection).\n\n---\n\n**STEP 1: Scope Sanity Check**\n\nAsk yourself these questions:\n1. **Module Root Correctness**: Is the `moduleRoot` from Phase 1a actually correct?\n   - Does it include ALL files in the error stack trace?\n   - Did I clamp too narrowly to a subdirectory when the bug spans multiple modules?\n   - Should I expand scope to parent directory or adjacent modules?\n\n2. **Missing Adjacent Systems**: Did I consider:\n   - Adjacent microservices/modules that interact with this one?\n   - Shared libraries or utilities used here?\n   - Configuration systems (env vars, config files, feature flags)?\n   - Caching layers or state management systems?\n   - Database schema or data migration issues?\n\n3. **Entry Point Coverage**: From Phase 1a Flow Anchors, did I verify:\n   - ALL entry points that could trigger this bug?\n   - Less obvious entry points (background jobs, scheduled tasks, webhooks)?\n   - Initialization code that runs before the failing code?\n\n---\n\n**STEP 2: Wide-Angle Review**\n\nReview your Phase 1 analysis outputs and answer:\n\n1. **Pattern Confidence** (from Phase 1, sub-phase 2):\n   - Do I have a solid Pattern Catalog with \u22652 occurrences per pattern?\n   - Did I identify clear pattern deviations in failing code?\n   - Are there OTHER files that deviate from patterns I haven't looked at?\n\n2. **Call Graph Completeness** (from Phase 1, sub-phase 1 & 2):\n   - Did my bounded call graph capture all HOT paths?\n   - Are there callers OUTSIDE my 2-hop boundary I should check?\n   - Did I trace BACKWARDS from the error far enough (to true entry points)?\n\n3. **Component Rankings** (from Phase 1, sub-phase 3):\n   - Are my top 5 components actually the most suspicious?\n   - Did I miss components because they're not in the stack trace?\n   - Should I re-rank based on new understanding?\n\n4. **Data Flow Completeness** (from Phase 1, sub-phase 4):\n   - Did I trace data flow from TRUE origin (user input, external system)?\n   - Are there data transformations BEFORE my analyzed scope?\n   - Did I check data validation at ALL boundaries?\n\n5. **Test Coverage Gaps** (from Phase 1, sub-phase 5):\n   - Did I find tests that SHOULD exist but don't?\n   - Are there missing test categories (integration, edge cases, error conditions)?\n   - Do test gaps reveal I'm looking in wrong place?\n\n---\n\n\n\n**STEP 2.5: Assumption Verification**\n\n**NOW that you've completed 5 phases of code analysis, verify all assumptions:**\n\n1. **Bug Report Assumptions**:\n   - Is the described behavior actually a bug based on what you now know about the code?\n   - Are the reproduction steps accurate given the code paths you've mapped?\n   - Is the error message consistent with the actual code flow you've traced?\n   - Are there missing steps or context in the bug report that your analysis revealed?\n\n2. **API/Library Assumptions**:\n   - Check documentation for any APIs/libraries mentioned in stack trace\n   - Verify actual behavior vs assumed behavior based on your code analysis\n   - Note any version-specific behavior that might matter\n   - Did your call graph analysis reveal unexpected library usage patterns?\n\n3. **Environment Assumptions**:\n   - Based on code analysis, is this environment-specific?\n   - Are there configuration dependencies you discovered in the code?\n   - Could timing/concurrency be a factor (based on code structure you analyzed)?\n   - Did pattern analysis reveal environment-dependent code paths?\n\n4. **Recent Changes Impact**:\n   - Review last 5-10 commits affecting the analyzed code\n   - Do they relate to the bug or point to alternative causes?\n   - Did your analysis reveal recent changes that break established patterns?\n\n**Document**: Create or update AssumptionVerification.md with verified/challenged assumptions.\n\n**Set**: `assumptionsVerified = true` in context\n\n---\n**STEP 3: Alternative Scope Analysis**\n\n**Generate 2-3 alternative investigation scopes and evaluate:**\n\nFor each alternative scope, assess:\n- **Scope Description**: What module/area would this focus on?\n- **Why It Might Be Better**: What evidence suggests this scope?\n- **Evidence For**: What supports investigating this area?\n- **Evidence Against**: Why might this be wrong direction?\n- **Confidence**: Rate 1-10 that this is the right scope\n\n**Example Alternative Scopes**:\n- Expand to parent module (if current feels too narrow)\n- Shift to adjacent service (if this might be symptom not cause)\n- Focus on infrastructure layer (if might be env/config issue)\n- Focus on data layer (if might be data corruption/migration issue)\n\n---\n\n**STEP 4: Breadth Decision**\n\nBased on Steps 1-3, make ONE of these decisions:\n\n**OPTION A: SCOPE IS CORRECT - Continue to Hypothesis Development**\n- Current module root and analyzed components are right\n- Breadth and depth are sufficient\n- Ready to form hypotheses with confidence\n- Set `scopeVerified = true` and proceed\n\n**OPTION B: EXPAND SCOPE - Additional Analysis Required**\n- Identified critical gaps in breadth or depth\n- Need to analyze additional modules/components\n- Set specific components/areas to add to analysis\n- Set `needsScopeExpansion = true`\n- Document what to add: `additionalAnalysisNeeded = [list]`\n\n**OPTION C: SHIFT SCOPE - Wrong Area**\n- Current focus is likely wrong place\n- Alternative scope has stronger evidence\n- Need to restart Phase 1 with new module root\n- Set `needsScopeShift = true`\n- Set `newModuleRoot = [path]`\n\n---\n\n**OUTPUT: Create ScopeVerification.md**\n\nMust include:\n1. **Scope Sanity Check Results** (answers to Step 1 questions)\n2. **Wide-Angle Review Findings** (answers to Step 2 questions)\n3. **Alternative Scopes Evaluated** (2-3 alternatives with scores)\n4. **Breadth Decision** (A, B, or C with justification)\n5. **Confidence in Current Scope** (1-10)\n6. **Action Items** (if Option B or C selected)\n\n**Context Variables to Set**:\n- `scopeVerified` (true/false)\n- `needsScopeExpansion` (true/false)\n- `needsScopeShift` (true/false)\n- `scopeConfidence` (1-10)\n- `additionalAnalysisNeeded` (array, if Option B)\n- `newModuleRoot` (string, if Option C)\n\n---\n\n**\ud83c\udfaf WHY THIS MATTERS**: \n\nResearch shows that 60% of failed investigations looked in the wrong place or too narrowly. This checkpoint catches that BEFORE you invest effort in wrong hypotheses.\n\n**Self-Critique**: List 1-2 specific uncertainties about scope that concern you most.",
       "agentRole": "You are a senior investigator performing final scope verification. Your expertise is catching tunnel vision, identifying missing context, and ensuring investigations focus on the right area. You excel at meta-analysis and sanity checking investigative scope.",
       "guidance": [
         "This step comes AFTER Phase 1 (5-phase analysis loop) and BEFORE Phase 2a (hypothesis development)",
@@ -341,7 +347,7 @@
     {
       "id": "phase-2a-hypothesis-development",
       "title": "Phase 2a: Hypothesis Development & Prioritization",
-      "prompt": "**HYPOTHESIS GENERATION** - Based on codebase analysis, formulate testable hypotheses about the bug's root cause.\n\n\ud83d\udea8 **YOU ARE NOT DONE**: You have completed Phase 1 (Analysis). You do NOT have proof yet. You have THEORIES, not EVIDENCE. Phases 2-6 are MANDATORY to prove your hypotheses and produce the diagnostic writeup. You are ~30% done.\n\n**CRITICAL REMINDERS:**\n- Even if you're \"100% confident\" in a hypothesis, it's unproven without instrumentation + evidence (Phases 3-5)\n- Confidence in a theory \u2260 proof of that theory\n- Professional practice requires validation even with high confidence\n- The workflow requires you to continue through all phases\n- DO NOT provide final conclusions or \"stop here\" - you MUST continue\n\n---\n\n**STEP 1: Evidence-Based Hypothesis Development**\nCreate maximum 5 prioritized hypotheses. Each includes:\n- **Root Cause Theory**: Specific technical explanation\n- **Supporting Evidence**: Code patterns/logic flows supporting this theory\n- **Failure Mechanism**: Exact sequence leading to observed bug\n- **Testability Score**: Quantified assessment (1-10) of validation ease\n- **Evidence Strength Score**: Quantified assessment (1-10) based on code findings\n\n**STEP 2: Hypothesis Prioritization Matrix**\nRank hypotheses using weighted scoring:\n- **Evidence Strength** (40%): Code analysis support for theory\n- **Testability** (35%): Validation ease with debugging instruments\n- **Impact Scope** (25%): How well this explains all symptoms\n\n**STEP 3: Pattern Integration**\nIncorporate findings from findSimilarBugs():\n- **Historical Patterns**: Similar bugs fixed previously\n- **Known Issues**: Related problems in the codebase\n- **Test Failures**: Similar test failure patterns\n- Adjust hypothesis confidence based on pattern matches\n\n**CRITICAL RULE**: All hypotheses must be based on concrete evidence from code analysis.\n\n**OUTPUTS**: Maximum 5 hypotheses with quantified scoring, ranked by priority.\n\n**\u26a0\ufe0f INVESTIGATION NOT COMPLETE**: Developing hypotheses with high evidence scores is excellent progress, but represents only ~35% of the investigation. Even if you have a hypothesis with 9-10/10 evidence strength:\n\n- You are NOT done with the investigation\n- You MUST continue to Phase 2b-2h to refine and validate hypotheses\n- You MUST continue to Phase 3 to implement instrumentation\n- You MUST continue to Phase 4-5 to collect and analyze evidence\n- You MUST continue to Phase 6 to produce the comprehensive diagnostic writeup\n\n**DO NOT set isWorkflowComplete=true at this stage.** The workflow requires completing all phases.",
+      "prompt": "**HYPOTHESIS GENERATION** - Based on codebase analysis, formulate testable hypotheses about the bug's root cause.\n\n\ud83d\udea8 **YOU ARE NOT DONE - \"FINDING\" \u2260 \"PROVING\"**\n\n**You have THEORIES, not EVIDENCE. You have FOUND possible causes, not PROVED them.**: You have completed Phase 1 (Analysis). You do NOT have proof yet. You have THEORIES, not EVIDENCE. Phases 2-6 are MANDATORY to prove your hypotheses and produce the diagnostic writeup. You are ~30% done.\n\n**CRITICAL REMINDERS:**\n- Even if you're \"100% confident\" in a hypothesis, it's unproven without instrumentation + evidence (Phases 3-5)\n- Confidence in a theory \u2260 proof of that theory\n- Professional practice requires validation even with high confidence\n- The workflow requires you to continue through all phases\n- DO NOT provide final conclusions or \"stop here\" - you MUST continue\n\n---\n\n**STEP 1: Evidence-Based Hypothesis Development**\nCreate maximum 5 prioritized hypotheses. Each includes:\n- **Root Cause Theory**: Specific technical explanation\n- **Supporting Evidence**: Code patterns/logic flows supporting this theory\n- **Failure Mechanism**: Exact sequence leading to observed bug\n- **Testability Score**: Quantified assessment (1-10) of validation ease\n- **Evidence Strength Score**: Quantified assessment (1-10) based on code findings\n\n**STEP 2: Hypothesis Prioritization Matrix**\nRank hypotheses using weighted scoring:\n- **Evidence Strength** (40%): Code analysis support for theory\n- **Testability** (35%): Validation ease with debugging instruments\n- **Impact Scope** (25%): How well this explains all symptoms\n\n**STEP 3: Pattern Integration**\nIncorporate findings from findSimilarBugs():\n- **Historical Patterns**: Similar bugs fixed previously\n- **Known Issues**: Related problems in the codebase\n- **Test Failures**: Similar test failure patterns\n- Adjust hypothesis confidence based on pattern matches\n\n**CRITICAL RULE**: All hypotheses must be based on concrete evidence from code analysis.\n\n**OUTPUTS**: Maximum 5 hypotheses with quantified scoring, ranked by priority.\n\n**\u26a0\ufe0f INVESTIGATION NOT COMPLETE**: Developing hypotheses with high evidence scores is excellent progress, but represents only ~35% of the investigation. Even if you have a hypothesis with 9-10/10 evidence strength:\n\n- You are NOT done with the investigation\n- You MUST continue to Phase 2b-2h to refine and validate hypotheses\n- You MUST continue to Phase 3 to implement instrumentation\n- You MUST continue to Phase 4-5 to collect and analyze evidence\n- You MUST continue to Phase 6 to produce the comprehensive diagnostic writeup\n\n**DO NOT set isWorkflowComplete=true at this stage.** The workflow requires completing all phases.",
       "agentRole": "You are a senior software detective and root cause analysis expert with deep expertise in systematic hypothesis formation. Your strength lies in connecting code evidence to potential failure mechanisms and creating testable theories. You excel at logical reasoning and evidence-based deduction. You must maintain rigorous quantitative standards and reject any hypothesis not grounded in concrete code evidence.",
       "guidance": [
         "EVIDENCE-BASED ONLY: Every hypothesis must be grounded in concrete code analysis findings with quantified evidence scores",
@@ -510,7 +516,7 @@
     {
       "id": "phase-2h-cognitive-reset",
       "title": "Phase 2h: Cognitive Reset & Plan Review",
-      "prompt": "**COGNITIVE RESET** - Take a mental step back before implementing instrumentation.\n\n\ud83d\udea8 **YOU ARE HALFWAY DONE (~50%)**: You have hypotheses and a validation plan. This is NOT proof. You MUST continue to Phases 3-6 to:\n- Phase 3: Add instrumentation to validate hypotheses\n- Phase 4: Collect concrete evidence\n- Phase 5: Analyze evidence and confirm/refute hypotheses\n- Phase 6: Write comprehensive diagnostic report\n\nEven if you have \"100% confidence\" in a hypothesis, professional practice requires empirical validation. DO NOT STOP HERE.\n\n---\n\n**GOAL**: Review the investigation with fresh eyes and validate the plan before execution.\n\n**STEP 1: Progress Summary**\n- What have we learned so far? (3-5 key insights)\n- What are our top hypotheses? (brief recap)\n- What's our instrumentation strategy? (high-level summary)\n\n**STEP 2: Critical Questions**\n- Are we missing any obvious alternative explanations?\n- Are our hypotheses too similar or too narrow?\n- Is our instrumentation plan efficient and comprehensive?\n- Are we making any unwarranted assumptions?\n- Is there a simpler approach we haven't considered?\n\n**STEP 3: Bias Check**\n- First impression bias: Are we anchored to initial theories?\n- Confirmation bias: Are we seeking evidence that confirms our beliefs?\n- Complexity bias: Are we overcomplicating a simple issue?\n- Recency bias: Are we over-weighting recent findings?\n\n**STEP 4: Sanity Checks**\n- Does the timeline make sense? (When did bug appear vs when hypothesized causes were introduced)\n- Do the symptoms match our theories? (All symptoms explained, no contradictions)\n- Are we investigating the right level? (Too high-level or too low-level)\n- Have we consulted existing documentation/logs adequately?\n\n**STEP 5: Plan Validation**\n- Review the instrumentation plan from Phase 2g\n- Will it actually answer our questions?\n- Are there any gaps or redundancies?\n- Is it safe to execute? (no production impacts, no data corruption risks)\n\n**STEP 6: Proceed or Pivot Decision**\n- **PROCEED**: Plan is sound, move to implementation\n- **REFINE**: Minor adjustments needed (update plan)\n- **PIVOT**: Major issues found (return to earlier phase)\n\n**OUTPUT**:\n- Cognitive reset complete with decision (PROCEED/REFINE/PIVOT)\n- Any plan adjustments documented\n- Set `resetComplete` = true",
+      "prompt": "**COGNITIVE RESET** - Take a mental step back before implementing instrumentation.\n\n\ud83d\udea8 **YOU ARE HALFWAY DONE (~50%) - FINDING \u2260 PROVING**: You may have \"found\" the bug (high confidence theory), but you haven't \"proved\" it yet. You have hypotheses and a validation plan. This is NOT proof. You MUST continue to Phases 3-6 to:\n- Phase 3: Add instrumentation to validate hypotheses\n- Phase 4: Collect concrete evidence\n- Phase 5: Analyze evidence and confirm/refute hypotheses\n- Phase 6: Write comprehensive diagnostic report\n\nEven if you have \"100% confidence\" in a hypothesis, professional practice requires empirical validation. DO NOT STOP HERE.\n\n---\n\n**GOAL**: Review the investigation with fresh eyes and validate the plan before execution.\n\n**STEP 1: Progress Summary**\n- What have we learned so far? (3-5 key insights)\n- What are our top hypotheses? (brief recap)\n- What's our instrumentation strategy? (high-level summary)\n\n**STEP 2: Critical Questions**\n- Are we missing any obvious alternative explanations?\n- Are our hypotheses too similar or too narrow?\n- Is our instrumentation plan efficient and comprehensive?\n- Are we making any unwarranted assumptions?\n- Is there a simpler approach we haven't considered?\n\n**STEP 3: Bias Check**\n- First impression bias: Are we anchored to initial theories?\n- Confirmation bias: Are we seeking evidence that confirms our beliefs?\n- Complexity bias: Are we overcomplicating a simple issue?\n- Recency bias: Are we over-weighting recent findings?\n\n**STEP 4: Sanity Checks**\n- Does the timeline make sense? (When did bug appear vs when hypothesized causes were introduced)\n- Do the symptoms match our theories? (All symptoms explained, no contradictions)\n- Are we investigating the right level? (Too high-level or too low-level)\n- Have we consulted existing documentation/logs adequately?\n\n**STEP 5: Plan Validation**\n- Review the instrumentation plan from Phase 2g\n- Will it actually answer our questions?\n- Are there any gaps or redundancies?\n- Is it safe to execute? (no production impacts, no data corruption risks)\n\n**STEP 6: Proceed or Pivot Decision**\n- **PROCEED**: Plan is sound, move to implementation\n- **REFINE**: Minor adjustments needed (update plan)\n- **PIVOT**: Major issues found (return to earlier phase)\n\n**OUTPUT**:\n- Cognitive reset complete with decision (PROCEED/REFINE/PIVOT)\n- Any plan adjustments documented\n- Set `resetComplete` = true",
       "agentRole": "You are a senior debugger reviewing the investigation plan with fresh, critical eyes before committing to implementation.",
       "guidance": [
         "Be honest about potential biases and blind spots",