@exaudeus/workrail 0.7.2-beta.3 β†’ 0.7.2-beta.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@exaudeus/workrail",
3
- "version": "0.7.2-beta.3",
3
+ "version": "0.7.2-beta.4",
4
4
  "description": "MCP server for structured workflow orchestration and step-by-step task guidance",
5
5
  "license": "MIT",
6
6
  "bin": {
@@ -1,5 +1,19 @@
1
1
  # Changelog - Systematic Bug Investigation Workflow
2
2
 
3
+ ## [1.1.0-beta.22] - 2025-01-06
4
+
5
+ ### CRITICAL FIX - Invalid Loop Step Schema
6
+ - **ROOT CAUSE**: In beta.19, we added `guidance` to the loop step, but loop steps DON'T support guidance in the schema
7
+ - Schema allows: `id`, `type`, `title`, `loop`, `body`, `functionDefinitions`, `requireConfirmation`, `runCondition`
8
+ - Does NOT allow: `guidance`, `prompt`, `agentRole`
9
+ - **Fix**: Moved loop enforcement guidance to first body step (`analysis-neighborhood-contracts`)
10
+ - "USER SAYS: This loop MUST complete ALL 5 iterations..."
11
+ - Now properly enforced on each iteration
12
+ - **Validation**: βœ… Workflow now passes full schema validation
13
+
14
+ ### Why This Matters
15
+ Without proper validation, the MCP server couldn't load the workflow at all. Beta.19-21 were broken due to schema violations.
16
+
3
17
  ## [1.1.0-beta.21] - 2025-01-06
4
18
 
5
19
  ### HOTFIX - metaGuidance Schema Violations
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "id": "systematic-bug-investigation-with-loops",
3
3
  "name": "Systematic Bug Investigation Workflow",
4
- "version": "1.1.0-beta.21",
4
+ "version": "1.1.0-beta.22",
5
5
  "description": "A comprehensive workflow for systematic bug and failing test investigation that prevents LLMs from jumping to conclusions. Enforces thorough evidence gathering, hypothesis formation, debugging instrumentation, and validation to achieve near 100% certainty about root causes. This workflow does NOT fix bugs - it produces detailed diagnostic writeups that enable effective fixing by providing complete understanding of what is happening, why it's happening, and supporting evidence.",
6
6
  "clarificationPrompts": [
7
7
  "What type of system is this? (web app, mobile app, backend service, desktop app, etc.)",
@@ -21,44 +21,44 @@
21
21
  "Bug is reproducible with specific steps or a minimal test case"
22
22
  ],
23
23
  "metaGuidance": [
24
- "**🚨 MANDATORY WORKFLOW EXECUTION - READ THIS FIRST:**",
24
+ "**\ud83d\udea8 MANDATORY WORKFLOW EXECUTION - READ THIS FIRST:**",
25
25
  "YOU ARE EXECUTING A STRUCTURED WORKFLOW, NOT FREESTYLE DEBUGGING.",
26
26
  "You CANNOT \"figure out the bug\" and stop. You MUST execute all 23 workflow steps by repeatedly calling workflow_next and following instructions until the MCP returns isComplete=true.",
27
27
  "WORKFLOW MECHANICS: Each call to workflow_next returns the next required step. You MUST execute that step, then call workflow_next again. Repeat until isComplete=true.",
28
28
  "DO NOT STOP CALLING WORKFLOW_NEXT: Even if you think you know the bug, even if you have high confidence, even if it seems obvious - you MUST continue calling workflow_next.",
29
29
  "STEP COUNTER: Every prompt shows \"Step X of 23\" - you are NOT done until you reach Step 23/23 and isComplete=true.",
30
- "**🚨 CRITICAL MISUNDERSTANDING TO AVOID:**",
30
+ "**\ud83d\udea8 CRITICAL MISUNDERSTANDING TO AVOID:**",
31
31
  "THE GOAL IS NOT \"FINDING\" THE BUG. THE GOAL IS \"PROVING\" THE BUG WITH EVIDENCE.",
32
32
  "\"I found the bug\" = YOU HAVE A GUESS. \"I proved the bug\" = YOU HAVE EVIDENCE FROM PHASES 3-5.",
33
- "FINDING β‰  DONE. PROVING = DONE. Only after completing instrumentation, evidence collection, and validation do you have proof.",
33
+ "FINDING \u2260 DONE. PROVING = DONE. Only after completing instrumentation, evidence collection, and validation do you have proof.",
34
34
  "NEVER say \"I've identified the root cause\" and stop. That is a THEORY, not PROOF. Continue to evidence collection.",
35
- "**🚨 USER SAYS: NO RATIONALIZATION. NO EXCEPTIONS. NO \"BUT IN MY CASE...\":**",
35
+ "**\ud83d\udea8 USER SAYS: NO RATIONALIZATION. NO EXCEPTIONS. NO \"BUT IN MY CASE...\":**",
36
36
  "DO NOT say \"However, given that I have...\" or \"Let me do a targeted Phase X...\" or \"Based on my high confidence...\"",
37
37
  "USER SAYS: YOUR SITUATION IS NOT SPECIAL. YOU ARE NOT THE EXCEPTION. Complete ALL 23 steps. Complete ALL 5 analysis iterations.",
38
38
  "USER SAYS: \"I found the bug early\" = ALL THE MORE REASON to validate it properly through ALL phases. Quick conclusions are WRONG 90% of the time.",
39
39
  "DO NOT create \"summary documents\" or \"diagnostic writeups\" until Phase 6. That is SKIPPING THE WORKFLOW.",
40
- "**🎯 PHASE 0 = PURE SETUP (NO ANALYSIS):**",
40
+ "**\ud83c\udfaf PHASE 0 = PURE SETUP (NO ANALYSIS):**",
41
41
  "Phase 0 is MECHANICAL SETUP ONLY: triage, user preferences, tool checking, context creation. No code analysis, no assumption checking. That comes in Phase 1.",
42
42
  "Phase 1a NOW includes assumption verification - AFTER you've seen the code and built structural understanding. You can't meaningfully question assumptions before understanding the codebase.",
43
- "**🚨 CRITICAL: ANALYSIS β‰  DIAGNOSIS β‰  PROOF:**",
43
+ "**\ud83d\udea8 CRITICAL: ANALYSIS \u2260 DIAGNOSIS \u2260 PROOF:**",
44
44
  "AFTER PHASE 1 (Analysis): You have analyzed code and identified suspicious patterns. This is NOT proof. You have ZERO evidence yet. You are ~20% done.",
45
45
  "AFTER PHASE 2 (Hypotheses): You have theories about the bug. This is NOT proof. You still have ZERO evidence. You are ~40% done.",
46
46
  "EVIDENCE COMES FROM PHASES 3-5: Instrumentation, evidence collection, and validation. Only THEN do you have proof.",
47
47
  "STOP = WRONG: Stopping after analysis (even with \"100% confidence\") means you have ZERO PROOF and are providing GUESSES, not diagnosis.",
48
- "**🎯 WHY THIS STRUCTURE EXISTS (Evidence-Based):**",
48
+ "**\ud83c\udfaf WHY THIS STRUCTURE EXISTS (Evidence-Based):**",
49
49
  "Professional research spanning 20+ years shows agents who skip systematic investigation steps are wrong ~90% of the time, even with 9-10/10 self-reported confidence.",
50
50
  "Quick conclusions miss: edge cases, alternative explanations, environment factors, interaction effects, and data corruption paths.",
51
51
  "This workflow FORCES thoroughness through: code analysis, hypothesis formation, instrumentation, evidence gathering, adversarial review, and comprehensive documentation.",
52
52
  "**CRITICAL WORKFLOW DISCIPLINE:**",
53
- "HIGH CONFIDENCE β‰  INVESTIGATION COMPLETE: Achieving 8-10/10 confidence in a hypothesis is excellent progress but does NOT mean the workflow is done.",
53
+ "HIGH CONFIDENCE \u2260 INVESTIGATION COMPLETE: Achieving 8-10/10 confidence in a hypothesis is excellent progress but does NOT mean the workflow is done.",
54
54
  "COMPLETE ALL PHASES: You MUST complete ALL phases (0 through 6) regardless of confidence level. Each phase builds critical evidence and documentation.",
55
55
  "WORKFLOW COMPLETION FLAG: Only set isWorkflowComplete=true when you complete Phase 6 (Comprehensive Diagnostic Writeup) AND produce the full deliverable.",
56
56
  "DO NOT SKIP PHASES: Even with high confidence, you must complete hypothesis generation (Phase 2), instrumentation (Phase 3), evidence collection (Phase 4), analysis (Phase 5), and writeup (Phase 6).",
57
57
  "PHASE PROGRESSION: An investigation that stops at triage (Phase 0) or hypothesis formation (Phase 2) or evidence collection (Phase 4) is INCOMPLETE - the diagnostic writeup is the required deliverable.",
58
58
  "**HIGH AUTO MODE DISCIPLINE:**",
59
- "In HIGH automation mode, agents must execute phases WITHOUT asking permission between phases. This means: proceed automatically from Phase 1β†’2β†’3β†’4β†’5β†’6.",
60
- "HIGH AUTO β‰  PERMISSION TO SKIP PHASES. HIGH AUTO = NO INTERRUPTIONS, NOT NO PHASES.",
61
- "**CRITICAL: HIGH AUTOMATION β‰  AUTONOMY TO SKIP:**",
59
+ "In HIGH automation mode, agents must execute phases WITHOUT asking permission between phases. This means: proceed automatically from Phase 1\u21922\u21923\u21924\u21925\u21926.",
60
+ "HIGH AUTO \u2260 PERMISSION TO SKIP PHASES. HIGH AUTO = NO INTERRUPTIONS, NOT NO PHASES.",
61
+ "**CRITICAL: HIGH AUTOMATION \u2260 AUTONOMY TO SKIP:**",
62
62
  "USER SAYS: 'High automation mode' means you DON'T ASK PERMISSION. It does NOT mean you have autonomy to decide which phases to skip.",
63
63
  "High auto = Faster execution of ALL phases. NOT = Smarter agent gets to skip phases it thinks are unnecessary.",
64
64
  "are: (1) Phase 0e early termination decision, (2) Phase 4a controlled experiments. All other phases execute automatically based on the systematic workflow structure.",
@@ -72,10 +72,10 @@
72
72
  "fun checkHypothesisInTests(hypothesis) = 'Search existing tests for evidence. Direct: tests of suspected components. Indirect: tests that would fail if true. Document in TestEvidence/{hypothesis}.md'",
73
73
  "fun aggregateDebugLogs(pattern, timeWindow=100) = 'Deduplicate logs matching {pattern}. Output: {pattern} x{count} in {timeWindow}ms, variations: {unique_values}'",
74
74
  "fun createInvestigationBranch() = 'git checkout -b investigate/{bug-id}-{timestamp}. If git unavailable, create Investigation/{timestamp}/ directory for artifacts.'",
75
- "fun trackInvestigation(phase, status) = 'Update INVESTIGATION_CONTEXT.md progress: βœ… {completed}, πŸ”„ {phase}, ⏳ Remaining: {list}, πŸ“Š Confidence: {score}/10'",
75
+ "fun trackInvestigation(phase, status) = 'Update INVESTIGATION_CONTEXT.md progress: \u2705 {completed}, \ud83d\udd04 {phase}, \u23f3 Remaining: {list}, \ud83d\udcca Confidence: {score}/10'",
76
76
  "fun updateInvestigationContext(section, content) = 'Update INVESTIGATION_CONTEXT.md {section} with {content}. Include timestamp. If section doesn\\'t exist, create it. Preserve all other sections.'",
77
77
  "fun findSimilarBugs() = 'Search for: 1) Similar error patterns in codebase, 2) Previous fixes in git history, 3) Related test cases. Document in SimilarPatterns.md'",
78
- "fun visualProgress() = 'Show: βœ… Phase 0 | βœ… Phase 1 | πŸ”„ Phase 2 | ⏳ Phase 3-5 | ⏳ Phase 6 | πŸ“Š 35% Complete. Include time spent per phase.'",
78
+ "fun visualProgress() = 'Show: \u2705 Phase 0 | \u2705 Phase 1 | \ud83d\udd04 Phase 2 | \u23f3 Phase 3-5 | \u23f3 Phase 6 | \ud83d\udcca 35% Complete. Include time spent per phase.'",
79
79
  "fun applyDebugPreferences() = 'Apply user debugging preferences from userDebugPreferences context variable. Adapt logging verbosity, tool selection, output format.'",
80
80
  "fun addResumptionJson(phase) = 'Update INVESTIGATION_CONTEXT.md resumption section with: workflowId, completedSteps up to {phase}, all context variables. Include workflow_get and workflow_next instructions.'",
81
81
  "**USAGE:** When you see function calls like instrumentCode() or analyzeTests(), execute the full instructions defined above.",
@@ -101,7 +101,7 @@
101
101
  "CONTEXT DOCUMENTATION: Maintain INVESTIGATION_CONTEXT.md throughout. Update after major milestones, failures, or user interventions to enable seamless resumption and handoffs. Include explicit resumption instructions using workflow_get and workflow_next.",
102
102
  "GIT FALLBACK STRATEGY: If git unavailable, gracefully skip commits/branches, log changes manually in CONTEXT.md with timestamps, warn user, document modifications for manual control.",
103
103
  "GIT ERROR HANDLING: Use run_terminal_cmd for git operations; if fails, output exact command for user manual execution. Never halt investigation due to git unavailability.",
104
- "TOOL AVAILABILITY AWARENESS: Check debugging tool availability before investigation design. Have fallbacks for when primary tools unavailable (grep→file_search, etc).",
104
+ "TOOL AVAILABILITY AWARENESS: Check debugging tool availability before investigation design. Have fallbacks for when primary tools unavailable (grep\u2192file_search, etc).",
105
105
  "SECURITY PROTOCOLS: Sanitize sensitive data in logs/reproduction steps. Be mindful of exposing credentials, PII, or system internals during evidence collection phases.",
106
106
  "DYNAMIC RE-TRIAGE: Allow complexity upgrades during investigation if evidence reveals deeper issues. Safe downgrades only with explicit user confirmation after evidence review.",
107
107
  "DEVIL'S ADVOCATE REVIEW: Actively challenge primary hypothesis with available evidence. Seek alternative explanations and rate alternative likelihood before final confidence assessment.",
@@ -115,7 +115,7 @@
115
115
  {
116
116
  "id": "phase-0-complete-setup",
117
117
  "title": "Phase 0: Complete Investigation Setup",
118
- "prompt": "**SYSTEMATIC INVESTIGATION SETUP** - Complete all mechanical setup before analysis begins.\n\n**This phase is PURELY MECHANICAL - no code analysis or hypothesis formation yet.**\n\n---\n\n**PART 1: Bug Report Triage**\n\nPlease provide complete bug context:\n- **Bug Description**: Observed vs expected behavior?\n- **Error Messages/Stack Traces**: Complete error output\n- **Reproduction Steps**: Consistent reproduction method?\n- **Environment Details**: OS, language/framework versions\n- **Recent Changes**: Commits, deployments, config changes?\n\n**Classify Project Type:**\n- Languages/Frameworks (primary tech stack)\n- Build System (Maven, Gradle, npm, etc.)\n- Testing Framework (JUnit, Jest, pytest, etc.)\n- Logging System (available mechanisms)\n- Architecture (monolithic, microservices, distributed, serverless)\n\n**Assess Bug Complexity:**\n- Simple: Single function, clear error path, minimal dependencies\n- Standard: Multiple components, moderate investigation required\n- Complex: Cross-system, race conditions, complex state management\n\n**Determine Automation Level:**\nAsk user: \"What automation level for this investigation?\"\n- High: Auto-approve decisions >8.0 confidence, minimal confirmations\n- Medium: Standard confirmations for key decisions\n- Low: Extra confirmations, manual approval for all changes\n\n---\n\n**PART 2: User Debugging Preferences**\n\n**Check for preferences in:**\n- User settings/memory\n- Project documentation (team standards)\n- Previous instructions/guidance\n\n**Categorize preferences:**\n- Debugging Tools: debugger vs logs vs traces\n- Log Verbosity: detailed vs concise\n- Output Format: structured vs human-readable\n- Testing Approach: unit vs integration test focus\n- Commit Style: conventional vs descriptive\n- Documentation: inline comments vs separate docs\n- Error Handling: fail fast vs defensive\n\n**If no explicit preferences, ask user:**\n- \"Verbose logging or concise summaries?\"\n- \"Interactive debuggers or log analysis?\"\n- \"Any specific tools or approaches your team prefers?\"\n\n---\n\n**PART 3: Tool Availability Check**\n\n**Verify core tools:**\n\n1. **Analysis Tools**: Test availability of grep_search, read_file, codebase_search\n2. **Git Operations**: Check `git --version`, set gitAvailable flag\n3. **Build/Test Tools** (based on projectType): npm/yarn, Maven/Gradle, pytest, etc.\n4. **Debugging Tools**: Language-specific debuggers, profilers, log aggregation\n\n**Fallback strategies if tools unavailable:**\n- grep_search fails β†’ use file_search\n- codebase_search fails β†’ use grep_search with context\n- Git unavailable β†’ track changes in INVESTIGATION_CONTEXT.md\n- Build tools missing β†’ focus on static analysis\n\n---\n\n**PART 4: Initialize Investigation Context Document**\n\nUse createInvestigationBranch() for version control, then create INVESTIGATION_CONTEXT.md with bug summary, progress tracking, environment setup, and resumption instructions.\n\n**REQUIRED OUTPUTS:**\n\nSet ALL context variables:\n- `projectType`, `bugComplexity`, `debuggingMechanism`, `isDistributed`\n- `automationLevel` (High/Medium/Low)\n- `userDebugPreferences` (categorized preferences object)\n- `availableTools` (array of available tool names)\n- `gitAvailable` (boolean)\n- `toolLimitations` (string describing any restrictions)\n- `contextInitialized` = true\n\nCreate comprehensive INVESTIGATION_CONTEXT.md with all function definitions from metaGuidance.",
118
+ "prompt": "**SYSTEMATIC INVESTIGATION SETUP** - Complete all mechanical setup before analysis begins.\n\n**This phase is PURELY MECHANICAL - no code analysis or hypothesis formation yet.**\n\n---\n\n**PART 1: Bug Report Triage**\n\nPlease provide complete bug context:\n- **Bug Description**: Observed vs expected behavior?\n- **Error Messages/Stack Traces**: Complete error output\n- **Reproduction Steps**: Consistent reproduction method?\n- **Environment Details**: OS, language/framework versions\n- **Recent Changes**: Commits, deployments, config changes?\n\n**Classify Project Type:**\n- Languages/Frameworks (primary tech stack)\n- Build System (Maven, Gradle, npm, etc.)\n- Testing Framework (JUnit, Jest, pytest, etc.)\n- Logging System (available mechanisms)\n- Architecture (monolithic, microservices, distributed, serverless)\n\n**Assess Bug Complexity:**\n- Simple: Single function, clear error path, minimal dependencies\n- Standard: Multiple components, moderate investigation required\n- Complex: Cross-system, race conditions, complex state management\n\n**Determine Automation Level:**\nAsk user: \"What automation level for this investigation?\"\n- High: Auto-approve decisions >8.0 confidence, minimal confirmations\n- Medium: Standard confirmations for key decisions\n- Low: Extra confirmations, manual approval for all changes\n\n---\n\n**PART 2: User Debugging Preferences**\n\n**Check for preferences in:**\n- User settings/memory\n- Project documentation (team standards)\n- Previous instructions/guidance\n\n**Categorize preferences:**\n- Debugging Tools: debugger vs logs vs traces\n- Log Verbosity: detailed vs concise\n- Output Format: structured vs human-readable\n- Testing Approach: unit vs integration test focus\n- Commit Style: conventional vs descriptive\n- Documentation: inline comments vs separate docs\n- Error Handling: fail fast vs defensive\n\n**If no explicit preferences, ask user:**\n- \"Verbose logging or concise summaries?\"\n- \"Interactive debuggers or log analysis?\"\n- \"Any specific tools or approaches your team prefers?\"\n\n---\n\n**PART 3: Tool Availability Check**\n\n**Verify core tools:**\n\n1. **Analysis Tools**: Test availability of grep_search, read_file, codebase_search\n2. **Git Operations**: Check `git --version`, set gitAvailable flag\n3. **Build/Test Tools** (based on projectType): npm/yarn, Maven/Gradle, pytest, etc.\n4. **Debugging Tools**: Language-specific debuggers, profilers, log aggregation\n\n**Fallback strategies if tools unavailable:**\n- grep_search fails \u2192 use file_search\n- codebase_search fails \u2192 use grep_search with context\n- Git unavailable \u2192 track changes in INVESTIGATION_CONTEXT.md\n- Build tools missing \u2192 focus on static analysis\n\n---\n\n**PART 4: Initialize Investigation Context Document**\n\nUse createInvestigationBranch() for version control, then create INVESTIGATION_CONTEXT.md with bug summary, progress tracking, environment setup, and resumption instructions.\n\n**REQUIRED OUTPUTS:**\n\nSet ALL context variables:\n- `projectType`, `bugComplexity`, `debuggingMechanism`, `isDistributed`\n- `automationLevel` (High/Medium/Low)\n- `userDebugPreferences` (categorized preferences object)\n- `availableTools` (array of available tool names)\n- `gitAvailable` (boolean)\n- `toolLimitations` (string describing any restrictions)\n- `contextInitialized` = true\n\nCreate comprehensive INVESTIGATION_CONTEXT.md with all function definitions from metaGuidance.",
119
119
  "agentRole": "You are a senior investigation setup specialist with expertise in triage, environment configuration, and systematic investigation preparation. You excel at gathering complete context and preparing comprehensive investigation infrastructure.",
120
120
  "guidance": [
121
121
  "This phase is MECHANICAL ONLY - no code analysis or hypothesis formation",
@@ -131,7 +131,7 @@
131
131
  {
132
132
  "id": "phase-0a-workflow-commitment",
133
133
  "title": "Phase 0a: Workflow Execution Commitment & Early Termination Checkpoint",
134
- "prompt": "**⚠️ WORKFLOW EXECUTION COMMITMENT CHECKPOINT ⚠️**\n\n*(Note: This checkpoint only appears in Medium/Low automation modes. High automation mode proceeds automatically.)*\n\nYou have completed Phase 0 (Complete Setup). Before proceeding to the investigation phases, you MUST acknowledge your understanding of workflow execution requirements AND make a critical decision.\n\n**CRITICAL UNDERSTANDING:**\n\n1. **This is a 23-step structured workflow, not freestyle debugging**\n - You MUST call workflow_next repeatedly until isComplete=true\n - You CANNOT stop early, even if you think you know the bug\n - You CANNOT \"figure it out\" and skip steps\n\n2. **Professional research shows 90% error rate for premature conclusions**\n - Even with 9-10/10 confidence, skipping systematic steps leads to wrong conclusions\n - Edge cases, alternative explanations, and interaction effects are missed\n - The workflow FORCES thoroughness for a reason\n\n3. **Remaining phases you MUST complete (regardless of confidence):**\n - βœ… Phase 0: Triage & Setup (COMPLETED)\n - ⏳ Phase 1: Multi-Dimensional Codebase Analysis (5 sub-phases)\n - ⏳ Phase 2: Hypothesis Formation & Validation (8 sub-phases)\n - ⏳ Phase 3: Debugging Instrumentation\n - ⏳ Phase 4: Evidence Collection & Experimentation\n - ⏳ Phase 5: Final Analysis & Confidence Assessment\n - ⏳ Phase 6: Comprehensive Diagnostic Writeup (REQUIRED DELIVERABLE)\n\n4. **What \"completion\" means:**\n - isComplete=true from workflow_next (this happens ONLY after Phase 6)\n - Comprehensive diagnostic writeup produced\n - All evidence documented\n - Investigation context finalized\n\n**DECISION POINT: Continue Full Investigation or Request Early Termination?**\n\n**OPTION 1: Complete Full Investigation (STRONGLY RECOMMENDED)**\n- Proceed through all 23 workflow steps\n- Professional standard for thorough bug investigations\n- 90% success rate for identifying true root cause\n- Produces comprehensive diagnostic writeup with actionable findings\n\n**OPTION 2: Request Early Termination (STRONGLY DISCOURAGED)**\n- Skip systematic investigation and attempt to diagnose now\n- ⚠️ **EVIDENCE-BASED WARNING:**\n - **Professional research spanning 20+ years shows that agents who skip systematic investigation are wrong about the root cause ~90% of the time**\n - Even if you feel 9-10/10 confident right now, completing the full workflow dramatically increases accuracy\n - Quick conclusions miss: contracts & invariants, pattern deviations, HOT paths, edge cases, interaction effects, and alternative explanations\n - The full workflow takes ~30-60 minutes but increases fix success rate from ~10% to ~90%\n\n**REQUIRED AGENT ACTIONS:**\n\n1. **Acknowledge workflow commitment:**\n\"I acknowledge that I am executing a structured 23-step workflow and I MUST call workflow_next for every step until isComplete=true. I understand that stopping early leads to incorrect conclusions ~90% of the time. I will complete all phases (0-6) regardless of my confidence level.\"\n\n2. **State your recommendation:**\nYou MUST recommend OPTION 1 (complete full investigation) unless you have truly exceptional circumstances.\n\n3. **MANDATORY USER COMMUNICATION:**\nYou MUST explicitly tell the user (this is NOT optional):\n\"**I strongly recommend we complete the full systematic investigation (26 steps through Phase 6). Professional research shows this approach identifies the TRUE root cause ~90% of the time, compared to ~10% for quick conclusions. Even if I develop high confidence early, completing the full workflowβ€”including contracts analysis, pattern discovery, HOT path analysis, instrumentation, and evidence collectionβ€”dramatically increases the likelihood of correctly identifying the root cause and preventing wasted time on wrong fixes.**\n\nDo you want to proceed with the full investigation (recommended), or would you prefer I attempt a quick diagnosis now (discouraged)?\"\n\n**USER CONFIRMATION REQUIRED:**\nThe user must explicitly choose to proceed with full investigation or request early termination.",
134
+ "prompt": "**\u26a0\ufe0f WORKFLOW EXECUTION COMMITMENT CHECKPOINT \u26a0\ufe0f**\n\n*(Note: This checkpoint only appears in Medium/Low automation modes. High automation mode proceeds automatically.)*\n\nYou have completed Phase 0 (Complete Setup). Before proceeding to the investigation phases, you MUST acknowledge your understanding of workflow execution requirements AND make a critical decision.\n\n**CRITICAL UNDERSTANDING:**\n\n1. **This is a 23-step structured workflow, not freestyle debugging**\n - You MUST call workflow_next repeatedly until isComplete=true\n - You CANNOT stop early, even if you think you know the bug\n - You CANNOT \"figure it out\" and skip steps\n\n2. **Professional research shows 90% error rate for premature conclusions**\n - Even with 9-10/10 confidence, skipping systematic steps leads to wrong conclusions\n - Edge cases, alternative explanations, and interaction effects are missed\n - The workflow FORCES thoroughness for a reason\n\n3. **Remaining phases you MUST complete (regardless of confidence):**\n - \u2705 Phase 0: Triage & Setup (COMPLETED)\n - \u23f3 Phase 1: Multi-Dimensional Codebase Analysis (5 sub-phases)\n - \u23f3 Phase 2: Hypothesis Formation & Validation (8 sub-phases)\n - \u23f3 Phase 3: Debugging Instrumentation\n - \u23f3 Phase 4: Evidence Collection & Experimentation\n - \u23f3 Phase 5: Final Analysis & Confidence Assessment\n - \u23f3 Phase 6: Comprehensive Diagnostic Writeup (REQUIRED DELIVERABLE)\n\n4. **What \"completion\" means:**\n - isComplete=true from workflow_next (this happens ONLY after Phase 6)\n - Comprehensive diagnostic writeup produced\n - All evidence documented\n - Investigation context finalized\n\n**DECISION POINT: Continue Full Investigation or Request Early Termination?**\n\n**OPTION 1: Complete Full Investigation (STRONGLY RECOMMENDED)**\n- Proceed through all 23 workflow steps\n- Professional standard for thorough bug investigations\n- 90% success rate for identifying true root cause\n- Produces comprehensive diagnostic writeup with actionable findings\n\n**OPTION 2: Request Early Termination (STRONGLY DISCOURAGED)**\n- Skip systematic investigation and attempt to diagnose now\n- \u26a0\ufe0f **EVIDENCE-BASED WARNING:**\n - **Professional research spanning 20+ years shows that agents who skip systematic investigation are wrong about the root cause ~90% of the time**\n - Even if you feel 9-10/10 confident right now, completing the full workflow dramatically increases accuracy\n - Quick conclusions miss: contracts & invariants, pattern deviations, HOT paths, edge cases, interaction effects, and alternative explanations\n - The full workflow takes ~30-60 minutes but increases fix success rate from ~10% to ~90%\n\n**REQUIRED AGENT ACTIONS:**\n\n1. **Acknowledge workflow commitment:**\n\"I acknowledge that I am executing a structured 23-step workflow and I MUST call workflow_next for every step until isComplete=true. I understand that stopping early leads to incorrect conclusions ~90% of the time. I will complete all phases (0-6) regardless of my confidence level.\"\n\n2. **State your recommendation:**\nYou MUST recommend OPTION 1 (complete full investigation) unless you have truly exceptional circumstances.\n\n3. **MANDATORY USER COMMUNICATION:**\nYou MUST explicitly tell the user (this is NOT optional):\n\"**I strongly recommend we complete the full systematic investigation (26 steps through Phase 6). Professional research shows this approach identifies the TRUE root cause ~90% of the time, compared to ~10% for quick conclusions. Even if I develop high confidence early, completing the full workflow\u2014including contracts analysis, pattern discovery, HOT path analysis, instrumentation, and evidence collection\u2014dramatically increases the likelihood of correctly identifying the root cause and preventing wasted time on wrong fixes.**\n\nDo you want to proceed with the full investigation (recommended), or would you prefer I attempt a quick diagnosis now (discouraged)?\"\n\n**USER CONFIRMATION REQUIRED:**\nThe user must explicitly choose to proceed with full investigation or request early termination.",
135
135
  "agentRole": "You are a workflow governance specialist ensuring agents understand they are bound to execute all workflow steps systematically, and that they MUST communicate the value of full workflow completion to users.",
136
136
  "guidance": [
137
137
  "This checkpoint prevents premature termination at the earliest possible point",
@@ -162,9 +162,13 @@
162
162
  {
163
163
  "id": "analysis-neighborhood-contracts",
164
164
  "title": "Analysis 1/5: Neighborhood, Call Graph & Contracts",
165
- "prompt": "**NEIGHBORHOOD & CONTRACTS DISCOVERY - Build Structural Foundation**\n\nGoal: Build lightweight understanding of code structure, relationships, and contracts BEFORE diving into details. This provides the scaffolding for all subsequent analysis.\n\n**STEP 1: Compute Module Root**\n- Find nearest common ancestor of error stack trace files\n- Clamp to package boundary or src/ directory\n- This defines your investigation scope\n- Set `moduleRoot` context variable\n\n**STEP 2: Neighborhood Map** (cap per file to prevent analysis paralysis)\n- For each file in error stack trace:\n - List immediate neighbors (same directory, max 8)\n - Find imports/exports directly used (max 10)\n - Locate co-located tests (same name pattern)\n - Identify closest entry points: routes, endpoints, CLI commands (max 5)\n- Produce table: File | Neighbors | Tests | Entry Points\n\n**STEP 3: Bounded Call Graph** (Small Multiples with HOT Path Ranking)\n- For each failing function/class in stack trace:\n - Build call graph ≀2 hops deep (inbound and outbound)\n - Cap total nodes at ≀15 per failing symbol\n - Score edges for HOT path ranking:\n * Error location in path: +3\n * Entry point to path: +2 \n * Test coverage exists: +1\n * Mentioned in ticket/error message: +1\n - Tag paths as HOT if score β‰₯3\n - Use Small Multiples ASCII visualization:\n * Width ≀100 chars per path\n * Format: `EntryPoint -> Caller -> [*FailingSymbol*] -> Callee`\n * Mark changed/failing code as `[*name*]`\n * Add HOT tag for high-impact paths\n * ≀8 total paths, prioritize HOT paths first\n - If graph exceeds caps, use Adjacency Summary instead:\n * Table: Node | Inbound | Outbound | Notes\n * Top-K by degree/frequency\n- Create Alias Legend for repeated subpaths:\n * A1 = common.validation.validateInput\n * A2 = database.connection.getPool\n * Reuse aliases across all paths\n\n**STEP 4: Flow Anchors** (Entry Points to Bug)\n- Map how users/systems trigger the bug:\n - HTTP routes β†’ handlers β†’ failing code\n - CLI commands β†’ execution β†’ failing code \n - Scheduled jobs β†’ workers β†’ failing code\n - Event handlers β†’ callbacks β†’ failing code\n- Produce table: Anchor Type | Entry Point | Target Symbol | User Action\n- Cap at ≀5 most relevant anchors\n- Note: This tells us HOW the bug is reached\n\n**STEP 5: Contracts & Invariants**\n- Within `moduleRoot` and immediate neighbors:\n - List public API symbols (exported functions/classes)\n - Document API endpoints (REST/GraphQL/RPC)\n - Identify database tables/collections touched\n - Note message queue topics/events\n - Extract stated invariants from:\n * JSDoc/docstrings with @invariant\n * Assertions in code\n * Validation logic patterns\n * Comments describing guarantees\n- Produce table: Symbol/API | Contract | Invariant | Location\n- Focus on contracts related to failing code\n\n**STEP 6: Assumption Verification** (NOW that you've seen the code)\nNow that you understand the code structure, verify assumptions from the bug report:\n\n1. **Bug Report Assumptions**:\n - Is the described behavior actually a bug, or might it be expected based on what you've seen?\n - Are the reproduction steps accurate given the code paths you've mapped?\n - Is the error message consistent with the actual code flow?\n - Are there missing steps or context in the bug report?\n\n2. **API/Library Assumptions**:\n - Check documentation for any APIs/libraries mentioned in stack trace\n - Verify actual behavior vs assumed behavior\n - Note any version-specific behavior that might matter\n\n3. **Environment Assumptions**:\n - Based on code, could this be environment-specific?\n - Are there configuration dependencies visible in the code?\n - Could timing/concurrency be a factor (based on code structure)?\n\n4. **Recent Changes Impact**:\n - Review last 5 commits affecting the failing code\n - Do they relate to the bug or point to alternative causes?\n\n**Document**: Create AssumptionVerification.md with verified/challenged assumptions.\n\n---\n\n**OUTPUT: Create StructuralAnalysis.md with:**\n- Module Root declaration\n- Neighborhood Map table\n- Bounded Call Graph (Small Multiples ASCII or Adjacency Summary)\n- Alias Legend (for call graph subpaths)\n- Flow Anchors table\n- Contracts & Invariants table\n- Self-Critique: 1-2 areas of uncertainty\n\n**CAPS (strictly enforce to prevent analysis paralysis):**\n- ≀8 neighbors per file\n- ≀10 imports per file\n- ≀5 entry points total\n- ≀15 call graph nodes per failing symbol\n- ≀8 total call graph paths\n- ≀5 flow anchors\n- ≀100 chars width for ASCII paths",
165
+ "prompt": "**NEIGHBORHOOD & CONTRACTS DISCOVERY - Build Structural Foundation**\n\nGoal: Build lightweight understanding of code structure, relationships, and contracts BEFORE diving into details. This provides the scaffolding for all subsequent analysis.\n\n**STEP 1: Compute Module Root**\n- Find nearest common ancestor of error stack trace files\n- Clamp to package boundary or src/ directory\n- This defines your investigation scope\n- Set `moduleRoot` context variable\n\n**STEP 2: Neighborhood Map** (cap per file to prevent analysis paralysis)\n- For each file in error stack trace:\n - List immediate neighbors (same directory, max 8)\n - Find imports/exports directly used (max 10)\n - Locate co-located tests (same name pattern)\n - Identify closest entry points: routes, endpoints, CLI commands (max 5)\n- Produce table: File | Neighbors | Tests | Entry Points\n\n**STEP 3: Bounded Call Graph** (Small Multiples with HOT Path Ranking)\n- For each failing function/class in stack trace:\n - Build call graph \u22642 hops deep (inbound and outbound)\n - Cap total nodes at \u226415 per failing symbol\n - Score edges for HOT path ranking:\n * Error location in path: +3\n * Entry point to path: +2 \n * Test coverage exists: +1\n * Mentioned in ticket/error message: +1\n - Tag paths as HOT if score \u22653\n - Use Small Multiples ASCII visualization:\n * Width \u2264100 chars per path\n * Format: `EntryPoint -> Caller -> [*FailingSymbol*] -> Callee`\n * Mark changed/failing code as `[*name*]`\n * Add HOT tag for high-impact paths\n * \u22648 total paths, prioritize HOT paths first\n - If graph exceeds caps, use Adjacency Summary instead:\n * Table: Node | Inbound | Outbound | Notes\n * Top-K by degree/frequency\n- Create Alias Legend for repeated subpaths:\n * A1 = common.validation.validateInput\n * A2 = database.connection.getPool\n * Reuse aliases across all paths\n\n**STEP 4: Flow Anchors** (Entry Points to Bug)\n- Map how users/systems trigger the bug:\n - HTTP routes \u2192 handlers \u2192 failing code\n - CLI commands \u2192 execution \u2192 failing code \n - Scheduled jobs \u2192 workers \u2192 failing code\n - Event handlers \u2192 callbacks \u2192 failing code\n- Produce table: Anchor Type | Entry Point | Target Symbol | User Action\n- Cap at \u22645 most relevant anchors\n- Note: This tells us HOW the bug is reached\n\n**STEP 5: Contracts & Invariants**\n- Within `moduleRoot` and immediate neighbors:\n - List public API symbols (exported functions/classes)\n - Document API endpoints (REST/GraphQL/RPC)\n - Identify database tables/collections touched\n - Note message queue topics/events\n - Extract stated invariants from:\n * JSDoc/docstrings with @invariant\n * Assertions in code\n * Validation logic patterns\n * Comments describing guarantees\n- Produce table: Symbol/API | Contract | Invariant | Location\n- Focus on contracts related to failing code\n\n**STEP 6: Assumption Verification** (NOW that you've seen the code)\nNow that you understand the code structure, verify assumptions from the bug report:\n\n1. **Bug Report Assumptions**:\n - Is the described behavior actually a bug, or might it be expected based on what you've seen?\n - Are the reproduction steps accurate given the code paths you've mapped?\n - Is the error message consistent with the actual code flow?\n - Are there missing steps or context in the bug report?\n\n2. **API/Library Assumptions**:\n - Check documentation for any APIs/libraries mentioned in stack trace\n - Verify actual behavior vs assumed behavior\n - Note any version-specific behavior that might matter\n\n3. **Environment Assumptions**:\n - Based on code, could this be environment-specific?\n - Are there configuration dependencies visible in the code?\n - Could timing/concurrency be a factor (based on code structure)?\n\n4. **Recent Changes Impact**:\n - Review last 5 commits affecting the failing code\n - Do they relate to the bug or point to alternative causes?\n\n**Document**: Create AssumptionVerification.md with verified/challenged assumptions.\n\n---\n\n**OUTPUT: Create StructuralAnalysis.md with:**\n- Module Root declaration\n- Neighborhood Map table\n- Bounded Call Graph (Small Multiples ASCII or Adjacency Summary)\n- Alias Legend (for call graph subpaths)\n- Flow Anchors table\n- Contracts & Invariants table\n- Self-Critique: 1-2 areas of uncertainty\n\n**CAPS (strictly enforce to prevent analysis paralysis):**\n- \u22648 neighbors per file\n- \u226410 imports per file\n- \u22645 entry points total\n- \u226415 call graph nodes per failing symbol\n- \u22648 total call graph paths\n- \u22645 flow anchors\n- \u2264100 chars width for ASCII paths",
166
166
  "agentRole": "You are a codebase navigator building structural understanding. Your focus is mapping relationships, entry points, and contracts WITHOUT diving into implementation details yet.",
167
167
  "guidance": [
168
+ "\ud83d\udea8 USER SAYS: This loop MUST complete ALL 5 iterations. Do NOT exit early even if you think you found the bug.",
169
+ "DO NOT rationalize: 'I have high confidence so I can do a targeted Phase 2.' NO. Complete all 5 iterations FIRST.",
170
+ "Agents who skip analysis iterations are wrong ~95% of the time. The later iterations catch edge cases and alternative explanations.",
171
+ "Iteration 2/5 is NOT enough. Iteration 3/5 is NOT enough. Complete 5/5.",
168
172
  "This is analysis phase 1 of 5 total phases",
169
173
  "Phase 1a = Structure + Assumption Verification - Build the map, THEN question the bug report",
170
174
  "Initialize majorIssuesFound = false",
@@ -172,7 +176,7 @@
172
176
  "STEP 6: NOW verify assumptions - you have context to challenge the bug report",
173
177
  "CRITICAL: You can't meaningfully question assumptions before seeing code",
174
178
  "STRICTLY ENFORCE CAPS - this prevents 2-hour rabbit holes",
175
- "Small Multiples: Render mini ASCII path diagrams (≀6 nodes per path)",
179
+ "Small Multiples: Render mini ASCII path diagrams (\u22646 nodes per path)",
176
180
  "HOT Path Ranking: Score and prioritize high-impact paths",
177
181
  "Alias Legend: Collapse repeated subpaths with deterministic aliases (A1, A2...)",
178
182
  "Adjacency Summary: If caps exceeded, use tabular summary instead of full graph",
@@ -191,7 +195,7 @@
191
195
  {
192
196
  "id": "analysis-breadth-scan",
193
197
  "title": "Analysis 2/5: Breadth Scan & Pattern Discovery",
194
- "prompt": "**BREADTH SCAN - Cast Wide Net + Learn Expected Behavior**\n\nGoal: Understand full system impact, identify all potentially involved components, and discover existing code patterns to understand expected behavior.\n\n**PART A: Pattern Discovery (Learn How Code SHOULD Work)**\n1. **Compute Module Root**: Find nearest common ancestor of error stack trace files, clamped to package/src\n2. **Discover Patterns** (scan only moduleRoot, exclude failing files from pattern definition):\n - Naming conventions (classes, methods, variables)\n - Error handling patterns (try/catch, error propagation, logging)\n - Logging patterns (format, verbosity, error vs info vs debug)\n - Data validation patterns (where/how data is checked)\n - Test patterns (structure, naming, assertion style)\n - Require β‰₯2 occurrences across distinct files to qualify as pattern\n3. **Capture Pattern Catalog**: Document validated patterns with 1-3 exemplar locations (file:line)\n4. **Identify Pattern Deviations in Failing Code**: Compare failing code against pattern catalog\n\n**PART B: Error Propagation & Component Discovery**\n1. **ERROR PROPAGATION MAPPING**: Use grep_search for all error occurrences, trace error messages across log files, map stack traces to identify call chains, document every point where error appears/handled\n2. **COMPONENT DISCOVERY**: Find components interacting with failing area, use codebase_search \"How is [component] used?\", identify callers/callees, cap to top 10 most suspicious, rank by likelihood (1-10)\n3. **BOUNDED CALL GRAPH**: For failing function, build call graph ≀2 hops deep, cap at ≀15 total nodes, identify HOT paths (paths through error location), prioritize HOT paths in analysis\n4. **FLOW ANCHORS**: Map entry points (routes/endpoints/CLI commands) to failing code, cap at ≀5 anchors, note which user actions trigger the bug\n\n**PART C: Data Flow & Changes**\n1. **DATA FLOW MAPPING**: Trace data through bug area, identify transformations, persistence points, corruption opportunities - but CAP scope to moduleRoot and 2-hop neighborhood\n2. **RECENT CHANGES ANALYSIS**: Git history for identified components (last 10 commits), identify when bug appeared, related PRs/issues, config/dependency changes\n3. **HISTORICAL PATTERN SEARCH**: Use findSimilarBugs() for similar error patterns, previous fixes, related test failures\n\n**Output**: Create BreadthAnalysis.md with:\n- Pattern Catalog (validated patterns + exemplars)\n- Pattern Deviations (how failing code differs from expected patterns)\n- Bounded Call Graph (≀15 nodes, HOT paths highlighted)\n- Flow Anchors Table (entry point β†’ failing symbol)\n- Suspicious Components (top 10, ranked 1-10)\n- Data Flow Map (scoped to moduleRoot + 2 hops)\n- Recent Changes Timeline\n- Historical Similar Bugs\n\n**Self-Critique**: List 1-2 areas where you have low confidence or missing information.",
198
+ "prompt": "**BREADTH SCAN - Cast Wide Net + Learn Expected Behavior**\n\nGoal: Understand full system impact, identify all potentially involved components, and discover existing code patterns to understand expected behavior.\n\n**PART A: Pattern Discovery (Learn How Code SHOULD Work)**\n1. **Compute Module Root**: Find nearest common ancestor of error stack trace files, clamped to package/src\n2. **Discover Patterns** (scan only moduleRoot, exclude failing files from pattern definition):\n - Naming conventions (classes, methods, variables)\n - Error handling patterns (try/catch, error propagation, logging)\n - Logging patterns (format, verbosity, error vs info vs debug)\n - Data validation patterns (where/how data is checked)\n - Test patterns (structure, naming, assertion style)\n - Require \u22652 occurrences across distinct files to qualify as pattern\n3. **Capture Pattern Catalog**: Document validated patterns with 1-3 exemplar locations (file:line)\n4. **Identify Pattern Deviations in Failing Code**: Compare failing code against pattern catalog\n\n**PART B: Error Propagation & Component Discovery**\n1. **ERROR PROPAGATION MAPPING**: Use grep_search for all error occurrences, trace error messages across log files, map stack traces to identify call chains, document every point where error appears/handled\n2. **COMPONENT DISCOVERY**: Find components interacting with failing area, use codebase_search \"How is [component] used?\", identify callers/callees, cap to top 10 most suspicious, rank by likelihood (1-10)\n3. **BOUNDED CALL GRAPH**: For failing function, build call graph \u22642 hops deep, cap at \u226415 total nodes, identify HOT paths (paths through error location), prioritize HOT paths in analysis\n4. **FLOW ANCHORS**: Map entry points (routes/endpoints/CLI commands) to failing code, cap at \u22645 anchors, note which user actions trigger the bug\n\n**PART C: Data Flow & Changes**\n1. **DATA FLOW MAPPING**: Trace data through bug area, identify transformations, persistence points, corruption opportunities - but CAP scope to moduleRoot and 2-hop neighborhood\n2. **RECENT CHANGES ANALYSIS**: Git history for identified components (last 10 commits), identify when bug appeared, related PRs/issues, config/dependency changes\n3. **HISTORICAL PATTERN SEARCH**: Use findSimilarBugs() for similar error patterns, previous fixes, related test failures\n\n**Output**: Create BreadthAnalysis.md with:\n- Pattern Catalog (validated patterns + exemplars)\n- Pattern Deviations (how failing code differs from expected patterns)\n- Bounded Call Graph (\u226415 nodes, HOT paths highlighted)\n- Flow Anchors Table (entry point \u2192 failing symbol)\n- Suspicious Components (top 10, ranked 1-10)\n- Data Flow Map (scoped to moduleRoot + 2 hops)\n- Recent Changes Timeline\n- Historical Similar Bugs\n\n**Self-Critique**: List 1-2 areas where you have low confidence or missing information.",
195
199
  "agentRole": "You are performing systematic analysis phase 2 of 5. Your focus is understanding both what IS happening (error propagation) and what SHOULD happen (pattern discovery) to identify deviations.",
196
200
  "guidance": [
197
201
  "This is analysis phase 2 of 5 total phases",
@@ -199,10 +203,10 @@
199
203
  "Create BreadthAnalysis.md with structured findings",
200
204
  "CRITICAL: Discover patterns FIRST from working code, THEN compare failing code to patterns",
201
205
  "Pattern deviations often reveal the bug (e.g., missing validation, different error handling)",
202
- "Apply CAPS to prevent analysis paralysis: ≀10 components, ≀15 call graph nodes, ≀5 flow anchors, ≀2 hops",
203
- "HOT PATH RANKING: Score paths by (error in path=3, entry point=2, test coverage=1); tag HOT if scoreβ‰₯3",
206
+ "Apply CAPS to prevent analysis paralysis: \u226410 components, \u226415 call graph nodes, \u22645 flow anchors, \u22642 hops",
207
+ "HOT PATH RANKING: Score paths by (error in path=3, entry point=2, test coverage=1); tag HOT if score\u22653",
204
208
  "BOUNDED CALL GRAPH: Use codebase_search to find callers/callees, stop at 2 hops, cap nodes, dedupe",
205
- "PATTERN DISCOVERY: Require β‰₯2 occurrences to qualify as pattern; singletons are 'candidate conventions' only",
209
+ "PATTERN DISCOVERY: Require \u22652 occurrences to qualify as pattern; singletons are 'candidate conventions' only",
206
210
  "SELF-CRITIQUE: Explicitly note 1-2 areas of uncertainty or missing information",
207
211
  "Update INVESTIGATION_CONTEXT.md after completion",
208
212
  "Use the function definitions for standardized operations"
@@ -216,7 +220,7 @@
216
220
  {
217
221
  "id": "analysis-deep-dive",
218
222
  "title": "Analysis 3/5: Component Deep Dive with Hot-Path Focus",
219
- "prompt": "**COMPONENT DEEP DIVE - Prioritized Investigation**\n\nGoal: Deep understanding of top 5 suspicious components from breadth scan, prioritizing HOT paths and pattern deviations.\n\n**PRIORITIZATION (from Phase 1):**\n1. Focus on components on HOT paths (score β‰₯3)\n2. Prioritize components with pattern deviations\n3. Rank by likelihood score from Phase 1\n4. Cap analysis to top 5 components\n\n**FOR EACH COMPONENT (recursive 3-level analysis):**\n\n**LEVEL 1 - DIRECT IMPLEMENTATION** (prioritize HOT paths and deviation areas):\n- Read complete file (or HOT path sections if file >500 lines)\n- Compare error handling against pattern catalog from Phase 1\n- Identify pattern deviations with file:line locations\n- Check state management, initialization, cleanup\n- Document invariants and assumptions\n- Note TODO/FIXME/HACK/BUG comments\n- Red flags: complex logic, missing validation, race conditions\n\n**LEVEL 2 - DIRECT DEPENDENCIES** (cap at ≀10 deps per component):\n- Follow imports on HOT paths first\n- Check dependency contracts and interfaces\n- Analyze coupling and data exchange\n- Look for shared mutable state\n- Identify circular dependencies\n- Document failure propagation paths\n\n**LEVEL 3 - INTEGRATION POINTS** (cap at ≀8 integration points):\n- External calls (DB, API, file system) - cap at ≀5\n- Concurrency/threading concerns\n- Resource management issues\n- Caching and state sync\n- Event handling and callbacks\n- Configuration dependencies\n\n**FOR EACH COMPONENT, PRODUCE:**\n- **Likelihood Score** (1-10): Weight HOT paths +3, pattern deviations +2, recent changes +1\n- **Suspicious Sections**: Specific file:line with rationale (≀5 per component)\n- **Failure Modes**: How this component could cause the observed bug (≀3 scenarios)\n- **Pattern Violations**: How it deviates from expected patterns (from Phase 1)\n- **Critical Dependencies**: Top 3 dependencies that could be sources\n\n**Output**: Create ComponentAnalysis.md with:\n- Component Rankings (1-5, sorted by likelihood score)\n- Per-Component Analysis (following structure above)\n- Pattern Violation Summary\n- Critical Path Map (which components are on HOT paths)\n- **Self-Critique**: 1-2 components you're uncertain about and why\n\n**CAPS TO PREVENT ANALYSIS PARALYSIS:**\n- Top 5 components only\n- ≀10 dependencies per component\n- ≀8 integration points per component\n- ≀5 suspicious sections per component\n- ≀3 failure modes per component",
223
+ "prompt": "**COMPONENT DEEP DIVE - Prioritized Investigation**\n\nGoal: Deep understanding of top 5 suspicious components from breadth scan, prioritizing HOT paths and pattern deviations.\n\n**PRIORITIZATION (from Phase 1):**\n1. Focus on components on HOT paths (score \u22653)\n2. Prioritize components with pattern deviations\n3. Rank by likelihood score from Phase 1\n4. Cap analysis to top 5 components\n\n**FOR EACH COMPONENT (recursive 3-level analysis):**\n\n**LEVEL 1 - DIRECT IMPLEMENTATION** (prioritize HOT paths and deviation areas):\n- Read complete file (or HOT path sections if file >500 lines)\n- Compare error handling against pattern catalog from Phase 1\n- Identify pattern deviations with file:line locations\n- Check state management, initialization, cleanup\n- Document invariants and assumptions\n- Note TODO/FIXME/HACK/BUG comments\n- Red flags: complex logic, missing validation, race conditions\n\n**LEVEL 2 - DIRECT DEPENDENCIES** (cap at \u226410 deps per component):\n- Follow imports on HOT paths first\n- Check dependency contracts and interfaces\n- Analyze coupling and data exchange\n- Look for shared mutable state\n- Identify circular dependencies\n- Document failure propagation paths\n\n**LEVEL 3 - INTEGRATION POINTS** (cap at \u22648 integration points):\n- External calls (DB, API, file system) - cap at \u22645\n- Concurrency/threading concerns\n- Resource management issues\n- Caching and state sync\n- Event handling and callbacks\n- Configuration dependencies\n\n**FOR EACH COMPONENT, PRODUCE:**\n- **Likelihood Score** (1-10): Weight HOT paths +3, pattern deviations +2, recent changes +1\n- **Suspicious Sections**: Specific file:line with rationale (\u22645 per component)\n- **Failure Modes**: How this component could cause the observed bug (\u22643 scenarios)\n- **Pattern Violations**: How it deviates from expected patterns (from Phase 1)\n- **Critical Dependencies**: Top 3 dependencies that could be sources\n\n**Output**: Create ComponentAnalysis.md with:\n- Component Rankings (1-5, sorted by likelihood score)\n- Per-Component Analysis (following structure above)\n- Pattern Violation Summary\n- Critical Path Map (which components are on HOT paths)\n- **Self-Critique**: 1-2 components you're uncertain about and why\n\n**CAPS TO PREVENT ANALYSIS PARALYSIS:**\n- Top 5 components only\n- \u226410 dependencies per component\n- \u22648 integration points per component\n- \u22645 suspicious sections per component\n- \u22643 failure modes per component",
220
224
  "agentRole": "You are performing systematic analysis phase 3 of 5. Your focus is deep-diving into the most suspicious components, prioritizing HOT paths and pattern deviations.",
221
225
  "guidance": [
222
226
  "This is analysis phase 3 of 5 total phases",
@@ -288,13 +292,7 @@
288
292
  "requireConfirmation": false
289
293
  }
290
294
  ],
291
- "requireConfirmation": false,
292
- "guidance": [
293
- "🚨 USER SAYS: This loop MUST complete ALL 5 iterations. Do NOT exit early even if you think you found the bug.",
294
- "DO NOT rationalize: 'I have high confidence so I can do a targeted Phase 2.' NO. Complete all 5 iterations FIRST.",
295
- "Agents who skip analysis iterations are wrong ~95% of the time. The later iterations catch edge cases and alternative explanations.",
296
- "Iteration 2/5 is NOT enough. Iteration 3/5 is NOT enough. Complete 5/5."
297
- ]
295
+ "requireConfirmation": false
298
296
  },
299
297
  {
300
298
  "id": "phase-1a-binary-search",
@@ -341,7 +339,7 @@
341
339
  {
342
340
  "id": "phase-1f-breadth-verification",
343
341
  "title": "Phase 1f: Final Breadth & Scope Verification",
344
- "prompt": "**FINAL BREADTH & SCOPE VERIFICATION - Catch Tunnel Vision NOW**\n\n⚠️ **CRITICAL CHECKPOINT BEFORE HYPOTHESES**: This step prevents the #1 cause of wrong conclusions: looking in the wrong place or missing the wider context.\n\n**Goal**: Verify you analyzed the RIGHT code with sufficient breadth AND depth before committing to hypotheses.\n\n🚨 **DO NOT STOP HERE - CRITICAL MISUNDERSTANDING:**\n\n**\"I FOUND THE BUG\" β‰  DONE. \"I PROVED THE BUG\" = DONE.**\n\nEven if you think you found the bug during analysis, you have ZERO PROOF:\n- \"Finding\" the bug = You have a THEORY/GUESS based on code analysis\n- \"Proving\" the bug = You have EVIDENCE from instrumentation + logs + validation (Phases 3-5)\n\nAnalysis = educated guesses. Proof comes from Phases 3-5 (instrumentation + evidence). You are only ~25% done.\n\n**DO NOT create summary documents or \"comprehensive findings\" now. That is Phase 6, not Phase 1f.**\n\nMUST continue to Phase 2 (Hypothesis Formation), then Phases 3-5 (Evidence Collection).\n\n---\n\n**STEP 1: Scope Sanity Check**\n\nAsk yourself these questions:\n1. **Module Root Correctness**: Is the `moduleRoot` from Phase 1a actually correct?\n - Does it include ALL files in the error stack trace?\n - Did I clamp too narrowly to a subdirectory when the bug spans multiple modules?\n - Should I expand scope to parent directory or adjacent modules?\n\n2. **Missing Adjacent Systems**: Did I consider:\n - Adjacent microservices/modules that interact with this one?\n - Shared libraries or utilities used here?\n - Configuration systems (env vars, config files, feature flags)?\n - Caching layers or state management systems?\n - Database schema or data migration issues?\n\n3. **Entry Point Coverage**: From Phase 1a Flow Anchors, did I verify:\n - ALL entry points that could trigger this bug?\n - Less obvious entry points (background jobs, scheduled tasks, webhooks)?\n - Initialization code that runs before the failing code?\n\n---\n\n**STEP 2: Wide-Angle Review**\n\nReview your Phase 1 analysis outputs and answer:\n\n1. **Pattern Confidence** (from Phase 1, sub-phase 2):\n - Do I have a solid Pattern Catalog with β‰₯2 occurrences per pattern?\n - Did I identify clear pattern deviations in failing code?\n - Are there OTHER files that deviate from patterns I haven't looked at?\n\n2. **Call Graph Completeness** (from Phase 1, sub-phase 1 & 2):\n - Did my bounded call graph capture all HOT paths?\n - Are there callers OUTSIDE my 2-hop boundary I should check?\n - Did I trace BACKWARDS from the error far enough (to true entry points)?\n\n3. **Component Rankings** (from Phase 1, sub-phase 3):\n - Are my top 5 components actually the most suspicious?\n - Did I miss components because they're not in the stack trace?\n - Should I re-rank based on new understanding?\n\n4. **Data Flow Completeness** (from Phase 1, sub-phase 4):\n - Did I trace data flow from TRUE origin (user input, external system)?\n - Are there data transformations BEFORE my analyzed scope?\n - Did I check data validation at ALL boundaries?\n\n5. **Test Coverage Gaps** (from Phase 1, sub-phase 5):\n - Did I find tests that SHOULD exist but don't?\n - Are there missing test categories (integration, edge cases, error conditions)?\n - Do test gaps reveal I'm looking in wrong place?\n\n---\n\n\n\n**STEP 2.5: Assumption Verification**\n\n**NOW that you've completed 5 phases of code analysis, verify all assumptions:**\n\n1. **Bug Report Assumptions**:\n - Is the described behavior actually a bug based on what you now know about the code?\n - Are the reproduction steps accurate given the code paths you've mapped?\n - Is the error message consistent with the actual code flow you've traced?\n - Are there missing steps or context in the bug report that your analysis revealed?\n\n2. **API/Library Assumptions**:\n - Check documentation for any APIs/libraries mentioned in stack trace\n - Verify actual behavior vs assumed behavior based on your code analysis\n - Note any version-specific behavior that might matter\n - Did your call graph analysis reveal unexpected library usage patterns?\n\n3. **Environment Assumptions**:\n - Based on code analysis, is this environment-specific?\n - Are there configuration dependencies you discovered in the code?\n - Could timing/concurrency be a factor (based on code structure you analyzed)?\n - Did pattern analysis reveal environment-dependent code paths?\n\n4. **Recent Changes Impact**:\n - Review last 5-10 commits affecting the analyzed code\n - Do they relate to the bug or point to alternative causes?\n - Did your analysis reveal recent changes that break established patterns?\n\n**Document**: Create or update AssumptionVerification.md with verified/challenged assumptions.\n\n**Set**: `assumptionsVerified = true` in context\n\n---\n**STEP 3: Alternative Scope Analysis**\n\n**Generate 2-3 alternative investigation scopes and evaluate:**\n\nFor each alternative scope, assess:\n- **Scope Description**: What module/area would this focus on?\n- **Why It Might Be Better**: What evidence suggests this scope?\n- **Evidence For**: What supports investigating this area?\n- **Evidence Against**: Why might this be wrong direction?\n- **Confidence**: Rate 1-10 that this is the right scope\n\n**Example Alternative Scopes**:\n- Expand to parent module (if current feels too narrow)\n- Shift to adjacent service (if this might be symptom not cause)\n- Focus on infrastructure layer (if might be env/config issue)\n- Focus on data layer (if might be data corruption/migration issue)\n\n---\n\n**STEP 4: Breadth Decision**\n\nBased on Steps 1-3, make ONE of these decisions:\n\n**OPTION A: SCOPE IS CORRECT - Continue to Hypothesis Development**\n- Current module root and analyzed components are right\n- Breadth and depth are sufficient\n- Ready to form hypotheses with confidence\n- Set `scopeVerified = true` and proceed\n\n**OPTION B: EXPAND SCOPE - Additional Analysis Required**\n- Identified critical gaps in breadth or depth\n- Need to analyze additional modules/components\n- Set specific components/areas to add to analysis\n- Set `needsScopeExpansion = true`\n- Document what to add: `additionalAnalysisNeeded = [list]`\n\n**OPTION C: SHIFT SCOPE - Wrong Area**\n- Current focus is likely wrong place\n- Alternative scope has stronger evidence\n- Need to restart Phase 1 with new module root\n- Set `needsScopeShift = true`\n- Set `newModuleRoot = [path]`\n\n---\n\n**OUTPUT: Create ScopeVerification.md**\n\nMust include:\n1. **Scope Sanity Check Results** (answers to Step 1 questions)\n2. **Wide-Angle Review Findings** (answers to Step 2 questions)\n3. **Alternative Scopes Evaluated** (2-3 alternatives with scores)\n4. **Breadth Decision** (A, B, or C with justification)\n5. **Confidence in Current Scope** (1-10)\n6. **Action Items** (if Option B or C selected)\n\n**Context Variables to Set**:\n- `scopeVerified` (true/false)\n- `needsScopeExpansion` (true/false)\n- `needsScopeShift` (true/false)\n- `scopeConfidence` (1-10)\n- `additionalAnalysisNeeded` (array, if Option B)\n- `newModuleRoot` (string, if Option C)\n\n---\n\n**🎯 WHY THIS MATTERS**: \n\nResearch shows that 60% of failed investigations looked in the wrong place or too narrowly. This checkpoint catches that BEFORE you invest effort in wrong hypotheses.\n\n**Self-Critique**: List 1-2 specific uncertainties about scope that concern you most.",
342
+ "prompt": "**FINAL BREADTH & SCOPE VERIFICATION - Catch Tunnel Vision NOW**\n\n\u26a0\ufe0f **CRITICAL CHECKPOINT BEFORE HYPOTHESES**: This step prevents the #1 cause of wrong conclusions: looking in the wrong place or missing the wider context.\n\n**Goal**: Verify you analyzed the RIGHT code with sufficient breadth AND depth before committing to hypotheses.\n\n\ud83d\udea8 **DO NOT STOP HERE - CRITICAL MISUNDERSTANDING:**\n\n**\"I FOUND THE BUG\" \u2260 DONE. \"I PROVED THE BUG\" = DONE.**\n\nEven if you think you found the bug during analysis, you have ZERO PROOF:\n- \"Finding\" the bug = You have a THEORY/GUESS based on code analysis\n- \"Proving\" the bug = You have EVIDENCE from instrumentation + logs + validation (Phases 3-5)\n\nAnalysis = educated guesses. Proof comes from Phases 3-5 (instrumentation + evidence). You are only ~25% done.\n\n**DO NOT create summary documents or \"comprehensive findings\" now. That is Phase 6, not Phase 1f.**\n\nMUST continue to Phase 2 (Hypothesis Formation), then Phases 3-5 (Evidence Collection).\n\n---\n\n**STEP 1: Scope Sanity Check**\n\nAsk yourself these questions:\n1. **Module Root Correctness**: Is the `moduleRoot` from Phase 1a actually correct?\n - Does it include ALL files in the error stack trace?\n - Did I clamp too narrowly to a subdirectory when the bug spans multiple modules?\n - Should I expand scope to parent directory or adjacent modules?\n\n2. **Missing Adjacent Systems**: Did I consider:\n - Adjacent microservices/modules that interact with this one?\n - Shared libraries or utilities used here?\n - Configuration systems (env vars, config files, feature flags)?\n - Caching layers or state management systems?\n - Database schema or data migration issues?\n\n3. **Entry Point Coverage**: From Phase 1a Flow Anchors, did I verify:\n - ALL entry points that could trigger this bug?\n - Less obvious entry points (background jobs, scheduled tasks, webhooks)?\n - Initialization code that runs before the failing code?\n\n---\n\n**STEP 2: Wide-Angle Review**\n\nReview your Phase 1 analysis outputs and answer:\n\n1. **Pattern Confidence** (from Phase 1, sub-phase 2):\n - Do I have a solid Pattern Catalog with \u22652 occurrences per pattern?\n - Did I identify clear pattern deviations in failing code?\n - Are there OTHER files that deviate from patterns I haven't looked at?\n\n2. **Call Graph Completeness** (from Phase 1, sub-phase 1 & 2):\n - Did my bounded call graph capture all HOT paths?\n - Are there callers OUTSIDE my 2-hop boundary I should check?\n - Did I trace BACKWARDS from the error far enough (to true entry points)?\n\n3. **Component Rankings** (from Phase 1, sub-phase 3):\n - Are my top 5 components actually the most suspicious?\n - Did I miss components because they're not in the stack trace?\n - Should I re-rank based on new understanding?\n\n4. **Data Flow Completeness** (from Phase 1, sub-phase 4):\n - Did I trace data flow from TRUE origin (user input, external system)?\n - Are there data transformations BEFORE my analyzed scope?\n - Did I check data validation at ALL boundaries?\n\n5. **Test Coverage Gaps** (from Phase 1, sub-phase 5):\n - Did I find tests that SHOULD exist but don't?\n - Are there missing test categories (integration, edge cases, error conditions)?\n - Do test gaps reveal I'm looking in wrong place?\n\n---\n\n\n\n**STEP 2.5: Assumption Verification**\n\n**NOW that you've completed 5 phases of code analysis, verify all assumptions:**\n\n1. **Bug Report Assumptions**:\n - Is the described behavior actually a bug based on what you now know about the code?\n - Are the reproduction steps accurate given the code paths you've mapped?\n - Is the error message consistent with the actual code flow you've traced?\n - Are there missing steps or context in the bug report that your analysis revealed?\n\n2. **API/Library Assumptions**:\n - Check documentation for any APIs/libraries mentioned in stack trace\n - Verify actual behavior vs assumed behavior based on your code analysis\n - Note any version-specific behavior that might matter\n - Did your call graph analysis reveal unexpected library usage patterns?\n\n3. **Environment Assumptions**:\n - Based on code analysis, is this environment-specific?\n - Are there configuration dependencies you discovered in the code?\n - Could timing/concurrency be a factor (based on code structure you analyzed)?\n - Did pattern analysis reveal environment-dependent code paths?\n\n4. **Recent Changes Impact**:\n - Review last 5-10 commits affecting the analyzed code\n - Do they relate to the bug or point to alternative causes?\n - Did your analysis reveal recent changes that break established patterns?\n\n**Document**: Create or update AssumptionVerification.md with verified/challenged assumptions.\n\n**Set**: `assumptionsVerified = true` in context\n\n---\n**STEP 3: Alternative Scope Analysis**\n\n**Generate 2-3 alternative investigation scopes and evaluate:**\n\nFor each alternative scope, assess:\n- **Scope Description**: What module/area would this focus on?\n- **Why It Might Be Better**: What evidence suggests this scope?\n- **Evidence For**: What supports investigating this area?\n- **Evidence Against**: Why might this be wrong direction?\n- **Confidence**: Rate 1-10 that this is the right scope\n\n**Example Alternative Scopes**:\n- Expand to parent module (if current feels too narrow)\n- Shift to adjacent service (if this might be symptom not cause)\n- Focus on infrastructure layer (if might be env/config issue)\n- Focus on data layer (if might be data corruption/migration issue)\n\n---\n\n**STEP 4: Breadth Decision**\n\nBased on Steps 1-3, make ONE of these decisions:\n\n**OPTION A: SCOPE IS CORRECT - Continue to Hypothesis Development**\n- Current module root and analyzed components are right\n- Breadth and depth are sufficient\n- Ready to form hypotheses with confidence\n- Set `scopeVerified = true` and proceed\n\n**OPTION B: EXPAND SCOPE - Additional Analysis Required**\n- Identified critical gaps in breadth or depth\n- Need to analyze additional modules/components\n- Set specific components/areas to add to analysis\n- Set `needsScopeExpansion = true`\n- Document what to add: `additionalAnalysisNeeded = [list]`\n\n**OPTION C: SHIFT SCOPE - Wrong Area**\n- Current focus is likely wrong place\n- Alternative scope has stronger evidence\n- Need to restart Phase 1 with new module root\n- Set `needsScopeShift = true`\n- Set `newModuleRoot = [path]`\n\n---\n\n**OUTPUT: Create ScopeVerification.md**\n\nMust include:\n1. **Scope Sanity Check Results** (answers to Step 1 questions)\n2. **Wide-Angle Review Findings** (answers to Step 2 questions)\n3. **Alternative Scopes Evaluated** (2-3 alternatives with scores)\n4. **Breadth Decision** (A, B, or C with justification)\n5. **Confidence in Current Scope** (1-10)\n6. **Action Items** (if Option B or C selected)\n\n**Context Variables to Set**:\n- `scopeVerified` (true/false)\n- `needsScopeExpansion` (true/false)\n- `needsScopeShift` (true/false)\n- `scopeConfidence` (1-10)\n- `additionalAnalysisNeeded` (array, if Option B)\n- `newModuleRoot` (string, if Option C)\n\n---\n\n**\ud83c\udfaf WHY THIS MATTERS**: \n\nResearch shows that 60% of failed investigations looked in the wrong place or too narrowly. This checkpoint catches that BEFORE you invest effort in wrong hypotheses.\n\n**Self-Critique**: List 1-2 specific uncertainties about scope that concern you most.",
345
343
  "agentRole": "You are a senior investigator performing final scope verification. Your expertise is catching tunnel vision, identifying missing context, and ensuring investigations focus on the right area. You excel at meta-analysis and sanity checking investigative scope.",
346
344
  "guidance": [
347
345
  "This step comes AFTER Phase 1 (5-phase analysis loop) and BEFORE Phase 2a (hypothesis development)",
@@ -352,7 +350,7 @@
352
350
  "CRITICAL: Evaluate 2-3 ALTERNATIVE scopes to challenge your current focus",
353
351
  "Common mistakes: too narrow scope, missed adjacent systems, wrong module root, insufficient entry point coverage",
354
352
  "If Option B (expand) or C (shift) selected, you MUST execute additional analysis before proceeding",
355
- "High confidence (β‰₯8) in current scope required to proceed to hypotheses",
353
+ "High confidence (\u22658) in current scope required to proceed to hypotheses",
356
354
  "This prevents the #1 cause of wrong conclusions: looking in wrong place",
357
355
  "Update INVESTIGATION_CONTEXT.md with scope verification results"
358
356
  ],
@@ -361,7 +359,7 @@
361
359
  {
362
360
  "id": "phase-2a-hypothesis-development",
363
361
  "title": "Phase 2a: Hypothesis Development & Prioritization",
364
- "prompt": "**HYPOTHESIS GENERATION** - Based on codebase analysis, formulate testable hypotheses about the bug's root cause.\n\n🚨 **YOU ARE NOT DONE - \"FINDING\" β‰  \"PROVING\"**\n\n**You have THEORIES, not EVIDENCE. You have FOUND possible causes, not PROVED them.**: You have completed Phase 1 (Analysis). You do NOT have proof yet. You have THEORIES, not EVIDENCE. Phases 2-6 are MANDATORY to prove your hypotheses and produce the diagnostic writeup. You are ~30% done.\n\n**CRITICAL REMINDERS:**\n- Even if you're \"100% confident\" in a hypothesis, it's unproven without instrumentation + evidence (Phases 3-5)\n- Confidence in a theory β‰  proof of that theory\n- Professional practice requires validation even with high confidence\n- The workflow requires you to continue through all phases\n- DO NOT provide final conclusions or \"stop here\" - you MUST continue\n\n---\n\n**STEP 1: Evidence-Based Hypothesis Development**\nCreate maximum 5 prioritized hypotheses. Each includes:\n- **Root Cause Theory**: Specific technical explanation\n- **Supporting Evidence**: Code patterns/logic flows supporting this theory\n- **Failure Mechanism**: Exact sequence leading to observed bug\n- **Testability Score**: Quantified assessment (1-10) of validation ease\n- **Evidence Strength Score**: Quantified assessment (1-10) based on code findings\n\n**STEP 2: Hypothesis Prioritization Matrix**\nRank hypotheses using weighted scoring:\n- **Evidence Strength** (40%): Code analysis support for theory\n- **Testability** (35%): Validation ease with debugging instruments\n- **Impact Scope** (25%): How well this explains all symptoms\n\n**STEP 3: Pattern Integration**\nIncorporate findings from findSimilarBugs():\n- **Historical Patterns**: Similar bugs fixed previously\n- **Known Issues**: Related problems in the codebase\n- **Test Failures**: Similar test failure patterns\n- Adjust hypothesis confidence based on pattern matches\n\n**CRITICAL RULE**: All hypotheses must be based on concrete evidence from code analysis.\n\n**OUTPUTS**: Maximum 5 hypotheses with quantified scoring, ranked by priority.\n\n**⚠️ INVESTIGATION NOT COMPLETE**: Developing hypotheses with high evidence scores is excellent progress, but represents only ~35% of the investigation. Even if you have a hypothesis with 9-10/10 evidence strength:\n\n- You are NOT done with the investigation\n- You MUST continue to Phase 2b-2h to refine and validate hypotheses\n- You MUST continue to Phase 3 to implement instrumentation\n- You MUST continue to Phase 4-5 to collect and analyze evidence\n- You MUST continue to Phase 6 to produce the comprehensive diagnostic writeup\n\n**DO NOT set isWorkflowComplete=true at this stage.** The workflow requires completing all phases.",
362
+ "prompt": "**HYPOTHESIS GENERATION** - Based on codebase analysis, formulate testable hypotheses about the bug's root cause.\n\n\ud83d\udea8 **YOU ARE NOT DONE - \"FINDING\" \u2260 \"PROVING\"**\n\n**You have THEORIES, not EVIDENCE. You have FOUND possible causes, not PROVED them.**: You have completed Phase 1 (Analysis). You do NOT have proof yet. You have THEORIES, not EVIDENCE. Phases 2-6 are MANDATORY to prove your hypotheses and produce the diagnostic writeup. You are ~30% done.\n\n**CRITICAL REMINDERS:**\n- Even if you're \"100% confident\" in a hypothesis, it's unproven without instrumentation + evidence (Phases 3-5)\n- Confidence in a theory \u2260 proof of that theory\n- Professional practice requires validation even with high confidence\n- The workflow requires you to continue through all phases\n- DO NOT provide final conclusions or \"stop here\" - you MUST continue\n\n---\n\n**STEP 1: Evidence-Based Hypothesis Development**\nCreate maximum 5 prioritized hypotheses. Each includes:\n- **Root Cause Theory**: Specific technical explanation\n- **Supporting Evidence**: Code patterns/logic flows supporting this theory\n- **Failure Mechanism**: Exact sequence leading to observed bug\n- **Testability Score**: Quantified assessment (1-10) of validation ease\n- **Evidence Strength Score**: Quantified assessment (1-10) based on code findings\n\n**STEP 2: Hypothesis Prioritization Matrix**\nRank hypotheses using weighted scoring:\n- **Evidence Strength** (40%): Code analysis support for theory\n- **Testability** (35%): Validation ease with debugging instruments\n- **Impact Scope** (25%): How well this explains all symptoms\n\n**STEP 3: Pattern Integration**\nIncorporate findings from findSimilarBugs():\n- **Historical Patterns**: Similar bugs fixed previously\n- **Known Issues**: Related problems in the codebase\n- **Test Failures**: Similar test failure patterns\n- Adjust hypothesis confidence based on pattern matches\n\n**CRITICAL RULE**: All hypotheses must be based on concrete evidence from code analysis.\n\n**OUTPUTS**: Maximum 5 hypotheses with quantified scoring, ranked by priority.\n\n**\u26a0\ufe0f INVESTIGATION NOT COMPLETE**: Developing hypotheses with high evidence scores is excellent progress, but represents only ~35% of the investigation. Even if you have a hypothesis with 9-10/10 evidence strength:\n\n- You are NOT done with the investigation\n- You MUST continue to Phase 2b-2h to refine and validate hypotheses\n- You MUST continue to Phase 3 to implement instrumentation\n- You MUST continue to Phase 4-5 to collect and analyze evidence\n- You MUST continue to Phase 6 to produce the comprehensive diagnostic writeup\n\n**DO NOT set isWorkflowComplete=true at this stage.** The workflow requires completing all phases.",
365
363
  "agentRole": "You are a senior software detective and root cause analysis expert with deep expertise in systematic hypothesis formation. Your strength lies in connecting code evidence to potential failure mechanisms and creating testable theories. You excel at logical reasoning and evidence-based deduction. You must maintain rigorous quantitative standards and reject any hypothesis not grounded in concrete code evidence.",
366
364
  "guidance": [
367
365
  "EVIDENCE-BASED ONLY: Every hypothesis must be grounded in concrete code analysis findings with quantified evidence scores",
@@ -530,7 +528,7 @@
530
528
  {
531
529
  "id": "phase-2h-cognitive-reset",
532
530
  "title": "Phase 2h: Cognitive Reset & Plan Review",
533
- "prompt": "**COGNITIVE RESET** - Take a mental step back before implementing instrumentation.\n\n🚨 **YOU ARE HALFWAY DONE (~50%) - FINDING β‰  PROVING**: You may have \"found\" the bug (high confidence theory), but you haven't \"proved\" it yet. You have hypotheses and a validation plan. This is NOT proof. You MUST continue to Phases 3-6 to:\n- Phase 3: Add instrumentation to validate hypotheses\n- Phase 4: Collect concrete evidence\n- Phase 5: Analyze evidence and confirm/refute hypotheses\n- Phase 6: Write comprehensive diagnostic report\n\nEven if you have \"100% confidence\" in a hypothesis, professional practice requires empirical validation. DO NOT STOP HERE.\n\n---\n\n**GOAL**: Review the investigation with fresh eyes and validate the plan before execution.\n\n**STEP 1: Progress Summary**\n- What have we learned so far? (3-5 key insights)\n- What are our top hypotheses? (brief recap)\n- What's our instrumentation strategy? (high-level summary)\n\n**STEP 2: Critical Questions**\n- Are we missing any obvious alternative explanations?\n- Are our hypotheses too similar or too narrow?\n- Is our instrumentation plan efficient and comprehensive?\n- Are we making any unwarranted assumptions?\n- Is there a simpler approach we haven't considered?\n\n**STEP 3: Bias Check**\n- First impression bias: Are we anchored to initial theories?\n- Confirmation bias: Are we seeking evidence that confirms our beliefs?\n- Complexity bias: Are we overcomplicating a simple issue?\n- Recency bias: Are we over-weighting recent findings?\n\n**STEP 4: Sanity Checks**\n- Does the timeline make sense? (When did bug appear vs when hypothesized causes were introduced)\n- Do the symptoms match our theories? (All symptoms explained, no contradictions)\n- Are we investigating the right level? (Too high-level or too low-level)\n- Have we consulted existing documentation/logs adequately?\n\n**STEP 5: Plan Validation**\n- Review the instrumentation plan from Phase 2g\n- Will it actually answer our questions?\n- Are there any gaps or redundancies?\n- Is it safe to execute? (no production impacts, no data corruption risks)\n\n**STEP 6: Proceed or Pivot Decision**\n- **PROCEED**: Plan is sound, move to implementation\n- **REFINE**: Minor adjustments needed (update plan)\n- **PIVOT**: Major issues found (return to earlier phase)\n\n**OUTPUT**:\n- Cognitive reset complete with decision (PROCEED/REFINE/PIVOT)\n- Any plan adjustments documented\n- Set `resetComplete` = true",
531
+ "prompt": "**COGNITIVE RESET** - Take a mental step back before implementing instrumentation.\n\n\ud83d\udea8 **YOU ARE HALFWAY DONE (~50%) - FINDING \u2260 PROVING**: You may have \"found\" the bug (high confidence theory), but you haven't \"proved\" it yet. You have hypotheses and a validation plan. This is NOT proof. You MUST continue to Phases 3-6 to:\n- Phase 3: Add instrumentation to validate hypotheses\n- Phase 4: Collect concrete evidence\n- Phase 5: Analyze evidence and confirm/refute hypotheses\n- Phase 6: Write comprehensive diagnostic report\n\nEven if you have \"100% confidence\" in a hypothesis, professional practice requires empirical validation. DO NOT STOP HERE.\n\n---\n\n**GOAL**: Review the investigation with fresh eyes and validate the plan before execution.\n\n**STEP 1: Progress Summary**\n- What have we learned so far? (3-5 key insights)\n- What are our top hypotheses? (brief recap)\n- What's our instrumentation strategy? (high-level summary)\n\n**STEP 2: Critical Questions**\n- Are we missing any obvious alternative explanations?\n- Are our hypotheses too similar or too narrow?\n- Is our instrumentation plan efficient and comprehensive?\n- Are we making any unwarranted assumptions?\n- Is there a simpler approach we haven't considered?\n\n**STEP 3: Bias Check**\n- First impression bias: Are we anchored to initial theories?\n- Confirmation bias: Are we seeking evidence that confirms our beliefs?\n- Complexity bias: Are we overcomplicating a simple issue?\n- Recency bias: Are we over-weighting recent findings?\n\n**STEP 4: Sanity Checks**\n- Does the timeline make sense? (When did bug appear vs when hypothesized causes were introduced)\n- Do the symptoms match our theories? (All symptoms explained, no contradictions)\n- Are we investigating the right level? (Too high-level or too low-level)\n- Have we consulted existing documentation/logs adequately?\n\n**STEP 5: Plan Validation**\n- Review the instrumentation plan from Phase 2g\n- Will it actually answer our questions?\n- Are there any gaps or redundancies?\n- Is it safe to execute? (no production impacts, no data corruption risks)\n\n**STEP 6: Proceed or Pivot Decision**\n- **PROCEED**: Plan is sound, move to implementation\n- **REFINE**: Minor adjustments needed (update plan)\n- **PIVOT**: Major issues found (return to earlier phase)\n\n**OUTPUT**:\n- Cognitive reset complete with decision (PROCEED/REFINE/PIVOT)\n- Any plan adjustments documented\n- Set `resetComplete` = true",
534
532
  "agentRole": "You are a senior debugger reviewing the investigation plan with fresh, critical eyes before committing to implementation.",
535
533
  "guidance": [
536
534
  "Be honest about potential biases and blind spots",
@@ -544,7 +542,7 @@
544
542
  {
545
543
  "id": "phase-3-comprehensive-instrumentation",
546
544
  "title": "Phase 3: Comprehensive Debug Instrumentation",
547
- "prompt": "**⚠️ AUTO-EXECUTE MODE - DO NOT ASK USER PERMISSION ⚠️**\n\nHIGH AUTO MODE: You MUST implement the instrumentation now. DO NOT ask 'Would you like me to continue?' The workflow requires all phases.\n\n---\n\n**COMPREHENSIVE DEBUG INSTRUMENTATION** - Add logging to validate hypotheses.\n\n**STEP 1: REVIEW YOUR INSTRUMENTATION PLAN**\n\nOpen **Phase 2g** output from INVESTIGATION_CONTEXT.md. It contains:\n- Specific files to instrument\n- Exact locations (functions/methods/lines)\n- What to log for each hypothesis (H1, H2, H3)\n\nIf Phase 2g plan is missing, create one now: For each hypothesis, list 2-5 files and specific functions to instrument.\n\n---\n\n**STEP 2: READ THE FILES**\n\nUse `read_file` to read each file that needs instrumentation.\n\n---\n\n**STEP 3: ADD LOGGING (use search_replace or write tool)**\n\n**A. Logging Format by Language:**\n\n**JavaScript/TypeScript:**\n```javascript\nconsole.log(`[H1] ClassName.methodName: entering with params=${JSON.stringify(params)}`);\nconsole.log(`[H1] ClassName.methodName: state before=${before}, after=${after}`);\nconsole.log(`[H1] ClassName.methodName: returning ${result}`);\n```\n\n**Python:**\n```python\nprint(f\"[H1] ClassName.method_name: entering with params={params}\")\nprint(f\"[H1] ClassName.method_name: condition is {condition_value}\")\nprint(f\"[H1] ClassName.method_name: returning {result}\")\n```\n\n**Java:**\n```java\nSystem.out.println(String.format(\"[H1] ClassName.methodName: entering with %s\", params));\nSystem.out.println(String.format(\"[H1] ClassName.methodName: state=%s\", state));\n```\n\n**B. What to Log:**\n- Function entry: parameters\n- State changes: before/after values\n- Conditionals: which branch taken\n- External calls: args and returns\n- Function exit: return value\n\n**C. Hypothesis Prefixes:**\n- H1 logs use `[H1]` prefix\n- H2 logs use `[H2]` prefix\n- H3 logs use `[H3]` prefix\n\n---\n\n**STEP 4: IMPLEMENTATION EXAMPLE**\n\nExample using `search_replace`:\n\nFile: `src/DataStore.js`\nPlan says: \"Log timetoken value in connect() method for H1\"\n\n```\nsearch_replace(\n file_path=\"src/DataStore.js\",\n old_string=\" connect() {\\n this.client.subscribe();\\n }\",\n new_string=\" connect() {\\n console.log('[H1] DataStore.connect: timetoken BEFORE subscribe =', this.timetoken);\\n this.client.subscribe();\\n console.log('[H1] DataStore.connect: timetoken AFTER subscribe =', this.timetoken);\\n }\"\n)\n```\n\n---\n\n**STEP 5: FOR EACH FILE IN YOUR PLAN**\n\n1. Read the file (`read_file`)\n2. Find the exact location to instrument\n3. Use `search_replace` to add logging:\n - Include enough context to make old_string unique\n - Add log statements with correct [HX] prefix\n - Log relevant variables/state\n4. Verify change succeeded\n\n---\n\n**STEP 6: IF YOU CANNOT EDIT FILES**\n\nIf you don't have file editing tools:\n1. Generate complete instrumented code for each location\n2. Provide user with:\n - File path\n - Function/method name\n - Complete BEFORE code block\n - Complete AFTER code block (with logging)\n3. Ask user to apply changes and confirm\n\n---\n\n**OUTPUT:**\n\n1. List all modified files with changes made\n2. Update INVESTIGATION_CONTEXT.md:\n ```\n ## Instrumentation Applied\n - File: src/DataStore.js, Function: connect(), Hypothesis: H1\n - File: src/Auth.js, Function: login(), Hypotheses: H1, H2\n - ...\n ```\n3. Set `allHypothesesInstrumented = true`",
545
+ "prompt": "**\u26a0\ufe0f AUTO-EXECUTE MODE - DO NOT ASK USER PERMISSION \u26a0\ufe0f**\n\nHIGH AUTO MODE: You MUST implement the instrumentation now. DO NOT ask 'Would you like me to continue?' The workflow requires all phases.\n\n---\n\n**COMPREHENSIVE DEBUG INSTRUMENTATION** - Add logging to validate hypotheses.\n\n**STEP 1: REVIEW YOUR INSTRUMENTATION PLAN**\n\nOpen **Phase 2g** output from INVESTIGATION_CONTEXT.md. It contains:\n- Specific files to instrument\n- Exact locations (functions/methods/lines)\n- What to log for each hypothesis (H1, H2, H3)\n\nIf Phase 2g plan is missing, create one now: For each hypothesis, list 2-5 files and specific functions to instrument.\n\n---\n\n**STEP 2: READ THE FILES**\n\nUse `read_file` to read each file that needs instrumentation.\n\n---\n\n**STEP 3: ADD LOGGING (use search_replace or write tool)**\n\n**A. Logging Format by Language:**\n\n**JavaScript/TypeScript:**\n```javascript\nconsole.log(`[H1] ClassName.methodName: entering with params=${JSON.stringify(params)}`);\nconsole.log(`[H1] ClassName.methodName: state before=${before}, after=${after}`);\nconsole.log(`[H1] ClassName.methodName: returning ${result}`);\n```\n\n**Python:**\n```python\nprint(f\"[H1] ClassName.method_name: entering with params={params}\")\nprint(f\"[H1] ClassName.method_name: condition is {condition_value}\")\nprint(f\"[H1] ClassName.method_name: returning {result}\")\n```\n\n**Java:**\n```java\nSystem.out.println(String.format(\"[H1] ClassName.methodName: entering with %s\", params));\nSystem.out.println(String.format(\"[H1] ClassName.methodName: state=%s\", state));\n```\n\n**B. What to Log:**\n- Function entry: parameters\n- State changes: before/after values\n- Conditionals: which branch taken\n- External calls: args and returns\n- Function exit: return value\n\n**C. Hypothesis Prefixes:**\n- H1 logs use `[H1]` prefix\n- H2 logs use `[H2]` prefix\n- H3 logs use `[H3]` prefix\n\n---\n\n**STEP 4: IMPLEMENTATION EXAMPLE**\n\nExample using `search_replace`:\n\nFile: `src/DataStore.js`\nPlan says: \"Log timetoken value in connect() method for H1\"\n\n```\nsearch_replace(\n file_path=\"src/DataStore.js\",\n old_string=\" connect() {\\n this.client.subscribe();\\n }\",\n new_string=\" connect() {\\n console.log('[H1] DataStore.connect: timetoken BEFORE subscribe =', this.timetoken);\\n this.client.subscribe();\\n console.log('[H1] DataStore.connect: timetoken AFTER subscribe =', this.timetoken);\\n }\"\n)\n```\n\n---\n\n**STEP 5: FOR EACH FILE IN YOUR PLAN**\n\n1. Read the file (`read_file`)\n2. Find the exact location to instrument\n3. Use `search_replace` to add logging:\n - Include enough context to make old_string unique\n - Add log statements with correct [HX] prefix\n - Log relevant variables/state\n4. Verify change succeeded\n\n---\n\n**STEP 6: IF YOU CANNOT EDIT FILES**\n\nIf you don't have file editing tools:\n1. Generate complete instrumented code for each location\n2. Provide user with:\n - File path\n - Function/method name\n - Complete BEFORE code block\n - Complete AFTER code block (with logging)\n3. Ask user to apply changes and confirm\n\n---\n\n**OUTPUT:**\n\n1. List all modified files with changes made\n2. Update INVESTIGATION_CONTEXT.md:\n ```\n ## Instrumentation Applied\n - File: src/DataStore.js, Function: connect(), Hypothesis: H1\n - File: src/Auth.js, Function: login(), Hypotheses: H1, H2\n - ...\n ```\n3. Set `allHypothesesInstrumented = true`",
548
546
  "agentRole": "You are instrumenting code to validate ALL hypotheses simultaneously. Your goal is comprehensive, non-redundant logging that enables efficient evidence collection in a single execution.",
549
547
  "guidance": [
550
548
  "Add instrumentation for ALL hypotheses at once",
@@ -558,7 +556,7 @@
558
556
  {
559
557
  "id": "phase-4-unified-evidence-collection",
560
558
  "title": "Phase 4: Unified Evidence Collection",
561
- "prompt": "**⚠️ AUTO-EXECUTE MODE - DO NOT ASK USER PERMISSION ⚠️**\n\nHIGH AUTO MODE: You MUST run the instrumented code and collect evidence now. If you need user input (like how to run tests), ask for THAT - do NOT ask if they want you to continue the workflow.\n\n---\n\n**UNIFIED EVIDENCE COLLECTION** - Execute instrumented code and collect logs.\n\n**DECISION TREE: Can You Run Code?**\n\n**OPTION A: You CAN run code (terminal access)**\nβ†’ Proceed to STEP 1\n\n**OPTION B: You CANNOT run code (no terminal/execution tools)**\nβ†’ Skip to STEP 6 (User Execution Instructions)\n\n---\n\n**STEP 1: PREPARE EXECUTION (if you can run code)**\n\n1. **Identify how to run the code:**\n - Tests: `npm test`, `pytest`, `mvn test`, etc.\n - App: `npm start`, `python app.py`, `java -jar app.jar`, etc.\n - Script: Reproduction script from Phase 0\n \n2. **Check if reproduction steps are clear:**\n - Do you know exactly how to trigger the bug?\n - If unclear, ask user: \"How do I run the code to reproduce the bug?\"\n\n---\n\n**STEP 2: EXECUTE INSTRUMENTED CODE**\n\nRun the code with instrumentation active:\n\n```bash\n# Capture output to file\nnpm test > debug_output.log 2>&1\n\n# OR run directly and capture in terminal\npython script.py\n```\n\n---\n\n**STEP 3: COLLECT LOG OUTPUT**\n\n1. **Get the complete log output:**\n - If saved to file: use `read_file` to read it\n - If in terminal: copy the output\n\n2. **Check log quality:**\n - Do you see `[H1]`, `[H2]`, `[H3]` prefixed logs?\n - Are there enough logs (at least 5-10 per hypothesis)?\n - Did the bug reproduce?\n\n3. **If logs are missing or insufficient:**\n - Review Phase 3 instrumentation\n - Add more logging if needed\n - Re-run execution\n\n---\n\n**STEP 4: ORGANIZE EVIDENCE BY HYPOTHESIS**\n\nParse logs and separate by prefix:\n\n**H1 Evidence:**\n```\n[H1] DataStore.connect: timetoken BEFORE=1234567890\n[H1] DataStore.connect: timetoken AFTER=1234567890\n[H1] Session.login: used timetoken=1234567890\n```\n\n**H2 Evidence:**\n```\n[H2] Cache.get: no entry found for user123\n[H2] Cache.set: storing data for user123\n```\n\n**H3 Evidence:**\n```\n[H3] Network.request: timeout after 5000ms\n```\n\n---\n\n**STEP 5: ASSESS EVIDENCE QUALITY**\n\nFor each hypothesis, rate:\n- **Evidence Quantity** (1-10): How much evidence collected?\n- **Evidence Clarity** (1-10): Do logs clearly show what's happening?\n- **Bug Reproduction** (Yes/No): Did the bug occur during execution?\n- **Hypothesis Support** (Strong/Weak/Contradicts): Does evidence support the hypothesis?\n\n---\n\n**STEP 6: IF YOU CANNOT EXECUTE CODE**\n\nProvide user with execution instructions:\n\n```\n## Evidence Collection Instructions\n\nTo collect evidence for the hypotheses, please:\n\n1. **Run the instrumented code:**\n [Provide exact command, e.g., `npm test` or `python main.py`]\n\n2. **Trigger the bug:**\n [Provide exact reproduction steps]\n\n3. **Capture ALL console output:**\n - Save to a file: `[command] > debug_output.log 2>&1`\n - OR copy all terminal output\n\n4. **Share the logs:**\n - Paste the complete log output here\n - OR upload the debug_output.log file\n\n**What I'm looking for:**\n- Logs prefixed with [H1], [H2], [H3]\n- Minimum 10-20 lines of output\n- Evidence of the bug occurring\n```\n\nThen wait for user to provide logs.\n\n---\n\n**STEP 7: DOCUMENT EVIDENCE**\n\nUpdate INVESTIGATION_CONTEXT.md:\n\n```\n## Evidence Collection Results\n\n**Execution Details:**\n- Command: npm test\n- Exit code: 1 (failure)\n- Bug reproduced: Yes\n- Total log lines: 247\n\n**Evidence Summary:**\n- H1: 43 log lines - Strong support (timetoken persists across sessions)\n- H2: 12 log lines - Weak support (cache cleared properly)\n- H3: 8 log lines - Contradicts (no network errors found)\n\n**Evidence Quality Scores:**\n- H1: Quantity=9/10, Clarity=8/10\n- H2: Quantity=5/10, Clarity=7/10\n- H3: Quantity=4/10, Clarity=6/10\n```\n\n---\n\n**OUTPUT:**\n\n1. Complete log output (or confirmation user will provide it)\n2. Evidence organized by hypothesis\n3. Evidence quality assessment\n4. Set `evidenceCollected = true`",
559
+ "prompt": "**\u26a0\ufe0f AUTO-EXECUTE MODE - DO NOT ASK USER PERMISSION \u26a0\ufe0f**\n\nHIGH AUTO MODE: You MUST run the instrumented code and collect evidence now. If you need user input (like how to run tests), ask for THAT - do NOT ask if they want you to continue the workflow.\n\n---\n\n**UNIFIED EVIDENCE COLLECTION** - Execute instrumented code and collect logs.\n\n**DECISION TREE: Can You Run Code?**\n\n**OPTION A: You CAN run code (terminal access)**\n\u2192 Proceed to STEP 1\n\n**OPTION B: You CANNOT run code (no terminal/execution tools)**\n\u2192 Skip to STEP 6 (User Execution Instructions)\n\n---\n\n**STEP 1: PREPARE EXECUTION (if you can run code)**\n\n1. **Identify how to run the code:**\n - Tests: `npm test`, `pytest`, `mvn test`, etc.\n - App: `npm start`, `python app.py`, `java -jar app.jar`, etc.\n - Script: Reproduction script from Phase 0\n \n2. **Check if reproduction steps are clear:**\n - Do you know exactly how to trigger the bug?\n - If unclear, ask user: \"How do I run the code to reproduce the bug?\"\n\n---\n\n**STEP 2: EXECUTE INSTRUMENTED CODE**\n\nRun the code with instrumentation active:\n\n```bash\n# Capture output to file\nnpm test > debug_output.log 2>&1\n\n# OR run directly and capture in terminal\npython script.py\n```\n\n---\n\n**STEP 3: COLLECT LOG OUTPUT**\n\n1. **Get the complete log output:**\n - If saved to file: use `read_file` to read it\n - If in terminal: copy the output\n\n2. **Check log quality:**\n - Do you see `[H1]`, `[H2]`, `[H3]` prefixed logs?\n - Are there enough logs (at least 5-10 per hypothesis)?\n - Did the bug reproduce?\n\n3. **If logs are missing or insufficient:**\n - Review Phase 3 instrumentation\n - Add more logging if needed\n - Re-run execution\n\n---\n\n**STEP 4: ORGANIZE EVIDENCE BY HYPOTHESIS**\n\nParse logs and separate by prefix:\n\n**H1 Evidence:**\n```\n[H1] DataStore.connect: timetoken BEFORE=1234567890\n[H1] DataStore.connect: timetoken AFTER=1234567890\n[H1] Session.login: used timetoken=1234567890\n```\n\n**H2 Evidence:**\n```\n[H2] Cache.get: no entry found for user123\n[H2] Cache.set: storing data for user123\n```\n\n**H3 Evidence:**\n```\n[H3] Network.request: timeout after 5000ms\n```\n\n---\n\n**STEP 5: ASSESS EVIDENCE QUALITY**\n\nFor each hypothesis, rate:\n- **Evidence Quantity** (1-10): How much evidence collected?\n- **Evidence Clarity** (1-10): Do logs clearly show what's happening?\n- **Bug Reproduction** (Yes/No): Did the bug occur during execution?\n- **Hypothesis Support** (Strong/Weak/Contradicts): Does evidence support the hypothesis?\n\n---\n\n**STEP 6: IF YOU CANNOT EXECUTE CODE**\n\nProvide user with execution instructions:\n\n```\n## Evidence Collection Instructions\n\nTo collect evidence for the hypotheses, please:\n\n1. **Run the instrumented code:**\n [Provide exact command, e.g., `npm test` or `python main.py`]\n\n2. **Trigger the bug:**\n [Provide exact reproduction steps]\n\n3. **Capture ALL console output:**\n - Save to a file: `[command] > debug_output.log 2>&1`\n - OR copy all terminal output\n\n4. **Share the logs:**\n - Paste the complete log output here\n - OR upload the debug_output.log file\n\n**What I'm looking for:**\n- Logs prefixed with [H1], [H2], [H3]\n- Minimum 10-20 lines of output\n- Evidence of the bug occurring\n```\n\nThen wait for user to provide logs.\n\n---\n\n**STEP 7: DOCUMENT EVIDENCE**\n\nUpdate INVESTIGATION_CONTEXT.md:\n\n```\n## Evidence Collection Results\n\n**Execution Details:**\n- Command: npm test\n- Exit code: 1 (failure)\n- Bug reproduced: Yes\n- Total log lines: 247\n\n**Evidence Summary:**\n- H1: 43 log lines - Strong support (timetoken persists across sessions)\n- H2: 12 log lines - Weak support (cache cleared properly)\n- H3: 8 log lines - Contradicts (no network errors found)\n\n**Evidence Quality Scores:**\n- H1: Quantity=9/10, Clarity=8/10\n- H2: Quantity=5/10, Clarity=7/10\n- H3: Quantity=4/10, Clarity=6/10\n```\n\n---\n\n**OUTPUT:**\n\n1. Complete log output (or confirmation user will provide it)\n2. Evidence organized by hypothesis\n3. Evidence quality assessment\n4. Set `evidenceCollected = true`",
562
560
  "agentRole": "You are collecting comprehensive evidence from a single instrumented execution. Your goal is to capture all hypothesis-relevant data in one efficient run.",
563
561
  "guidance": [
564
562
  "Single execution tests all hypotheses simultaneously",
@@ -605,7 +603,7 @@
605
603
  "var": "currentConfidence",
606
604
  "lt": 8.0
607
605
  },
608
- "prompt": "**CONTROLLED EXPERIMENTATION** - When observation isn't enough, experiment!\n\n**Current Investigation Status**: Leading hypothesis (Confidence: {{currentConfidence}}/10)\n\n**⚠️ SAFETY PROTOCOLS (MANDATORY)**:\n\n1. **Git Branch Required**:\n - MUST be on investigation branch (use createInvestigationBranch() if not)\n - Verify with `git branch --show-current`\n - NEVER experiment directly on main/master\n\n2. **Pre-Experiment Baseline**:\n - Commit clean state: `git commit -m \"PRE-EXPERIMENT: baseline for {{hypothesis.id}}\"`\n - Record current test results\n - Document baseline behavior\n\n3. **Environment Restriction**:\n - ONLY run in test/dev environment\n - NEVER in production or staging\n - Set environment check: `if (process.env.NODE_ENV !== 'development') { throw new Error('Experiments only in dev'); }`\n\n4. **Automatic Revert**:\n - After evidence collection: `git revert HEAD --no-edit`\n - Verify code returned to baseline\n - Run tests to confirm clean state\n\n5. **Approval Gates**:\n - Low automation: Require approval for ALL experiments\n - Medium automation: Require approval for breaking/minimal-fix experiments\n - High automation: Auto-approve guards/logs only\n\n6. **Documentation**:\n - Create ExperimentLog.md entry with:\n - Timestamp, experiment type, hypothesis ID\n - Rationale and expected outcome\n - Actual outcome and evidence\n - Revert status (confirmed/failed)\n\n7. **Hard Limits**:\n - Max 3 experiments total (prevent endless experimentation)\n - Track with `experimentCount` context variable\n - Exit if limit reached, recommend different approach\n\n8. **Rollback Verification**:\n - After revert, run full test suite\n - Verify no unintended changes remain\n - Check git status is clean\n\n**EXPERIMENT TYPES** (use controlledModification()):\n\n1. **Guard Additions (Non-Breaking)**:\n ```javascript\n // Add defensive check that logs but doesn't change behavior\n if (unexpectedCondition) {\n console.error('[H1_GUARD] Unexpected state detected:', state);\n // Continue normal execution\n }\n ```\n\n2. **Assertion Injections**:\n ```javascript\n // Add assertion that would fail if hypothesis is correct\n console.assert(expectedCondition, '[H1_ASSERT] Hypothesis H1 violated!');\n ```\n\n3. **Minimal Fix Test**:\n ```javascript\n // Apply minimal fix for hypothesis, see if bug disappears\n if (process.env.DEBUG_FIX_H1 === 'true') {\n // Apply hypothesized fix\n return fixedBehavior();\n }\n ```\n\n4. **Controlled Breaking**:\n ```javascript\n // Temporarily break suspected component to verify involvement\n if (process.env.DEBUG_BREAK_H1 === 'true') {\n throw new Error('[H1_BREAK] Intentionally breaking to test hypothesis');\n }\n ```\n\n**PROTOCOL**:\n1. Choose experiment type based on confidence and risk\n2. Implement modification with clear DEBUG markers\n3. Use createInvestigationBranch() if not already on investigation branch\n4. Commit: `git commit -m \"DEBUG: {{experiment_type}} for hypothesis investigation\"`\n5. Run reproduction steps\n6. Use collectEvidence() to gather results\n7. Revert changes: `git revert HEAD`\n8. Document results in ExperimentResults/hypothesis-experiment.md\n\n**SAFETY LIMITS**:\n- Max 3 experiments per hypothesis\n- Each experiment in separate commit\n- Always revert after evidence collection\n- Document everything in INVESTIGATION_CONTEXT.md\n\n**UPDATE**:\n- Hypothesis confidence based on experimental results\n- Use updateInvestigationContext('Experiment Results', experiment details and outcomes)\n- Track failed experiments in 'Dead Ends & Lessons' section",
606
+ "prompt": "**CONTROLLED EXPERIMENTATION** - When observation isn't enough, experiment!\n\n**Current Investigation Status**: Leading hypothesis (Confidence: {{currentConfidence}}/10)\n\n**\u26a0\ufe0f SAFETY PROTOCOLS (MANDATORY)**:\n\n1. **Git Branch Required**:\n - MUST be on investigation branch (use createInvestigationBranch() if not)\n - Verify with `git branch --show-current`\n - NEVER experiment directly on main/master\n\n2. **Pre-Experiment Baseline**:\n - Commit clean state: `git commit -m \"PRE-EXPERIMENT: baseline for {{hypothesis.id}}\"`\n - Record current test results\n - Document baseline behavior\n\n3. **Environment Restriction**:\n - ONLY run in test/dev environment\n - NEVER in production or staging\n - Set environment check: `if (process.env.NODE_ENV !== 'development') { throw new Error('Experiments only in dev'); }`\n\n4. **Automatic Revert**:\n - After evidence collection: `git revert HEAD --no-edit`\n - Verify code returned to baseline\n - Run tests to confirm clean state\n\n5. **Approval Gates**:\n - Low automation: Require approval for ALL experiments\n - Medium automation: Require approval for breaking/minimal-fix experiments\n - High automation: Auto-approve guards/logs only\n\n6. **Documentation**:\n - Create ExperimentLog.md entry with:\n - Timestamp, experiment type, hypothesis ID\n - Rationale and expected outcome\n - Actual outcome and evidence\n - Revert status (confirmed/failed)\n\n7. **Hard Limits**:\n - Max 3 experiments total (prevent endless experimentation)\n - Track with `experimentCount` context variable\n - Exit if limit reached, recommend different approach\n\n8. **Rollback Verification**:\n - After revert, run full test suite\n - Verify no unintended changes remain\n - Check git status is clean\n\n**EXPERIMENT TYPES** (use controlledModification()):\n\n1. **Guard Additions (Non-Breaking)**:\n ```javascript\n // Add defensive check that logs but doesn't change behavior\n if (unexpectedCondition) {\n console.error('[H1_GUARD] Unexpected state detected:', state);\n // Continue normal execution\n }\n ```\n\n2. **Assertion Injections**:\n ```javascript\n // Add assertion that would fail if hypothesis is correct\n console.assert(expectedCondition, '[H1_ASSERT] Hypothesis H1 violated!');\n ```\n\n3. **Minimal Fix Test**:\n ```javascript\n // Apply minimal fix for hypothesis, see if bug disappears\n if (process.env.DEBUG_FIX_H1 === 'true') {\n // Apply hypothesized fix\n return fixedBehavior();\n }\n ```\n\n4. **Controlled Breaking**:\n ```javascript\n // Temporarily break suspected component to verify involvement\n if (process.env.DEBUG_BREAK_H1 === 'true') {\n throw new Error('[H1_BREAK] Intentionally breaking to test hypothesis');\n }\n ```\n\n**PROTOCOL**:\n1. Choose experiment type based on confidence and risk\n2. Implement modification with clear DEBUG markers\n3. Use createInvestigationBranch() if not already on investigation branch\n4. Commit: `git commit -m \"DEBUG: {{experiment_type}} for hypothesis investigation\"`\n5. Run reproduction steps\n6. Use collectEvidence() to gather results\n7. Revert changes: `git revert HEAD`\n8. Document results in ExperimentResults/hypothesis-experiment.md\n\n**SAFETY LIMITS**:\n- Max 3 experiments per hypothesis\n- Each experiment in separate commit\n- Always revert after evidence collection\n- Document everything in INVESTIGATION_CONTEXT.md\n\n**UPDATE**:\n- Hypothesis confidence based on experimental results\n- Use updateInvestigationContext('Experiment Results', experiment details and outcomes)\n- Track failed experiments in 'Dead Ends & Lessons' section",
609
607
  "agentRole": "You are a careful experimenter using controlled code modifications to validate hypotheses. Safety and reversibility are paramount.",
610
608
  "guidance": [
611
609
  "Start with non-breaking experiments (guards, logs)",
@@ -700,7 +698,7 @@
700
698
  {
701
699
  "id": "phase-5a-final-confidence",
702
700
  "title": "Phase 5a: Final Confidence Assessment",
703
- "prompt": "**FINAL CONFIDENCE ASSESSMENT** - Evaluate the investigation results.\n\n**If root cause found (rootCauseFound = true):**\n- Review all evidence for {{rootCauseHypothesis}}\n- Perform adversarial challenge\n- Calculate final confidence score\n\n**If no high-confidence root cause:**\n- Document what was learned\n- Identify remaining unknowns\n- Recommend next investigation steps\n\n**CONFIDENCE CALCULATION:**\n- Evidence Quality (1-10)\n- Explanation Completeness (1-10)\n- Alternative Likelihood (1-10, inverted)\n- Final = (Quality Γ— 0.4) + (Completeness Γ— 0.4) + (Alternative Γ— 0.2)\n\n**CONTEXT UPDATE**:\n- Use trackInvestigation('Investigation Complete', 'Confidence: {{finalConfidence}}/10')\n- Use addResumptionJson('phase-5a-final-confidence')\n- Document lessons learned in 'Dead Ends & Lessons' section\n\n**⚠️ ONE PHASE REMAINING**: Even if you have achieved 9-10/10 confidence in the root cause with strong supporting evidence:\n\n- The investigation is NOT complete yet\n- You MUST proceed to Phase 6 to create the comprehensive diagnostic writeup\n- Phase 6 is the REQUIRED DELIVERABLE that makes all your investigation work actionable\n- High confidence means you've identified the root cause, but the writeup translates that into actionable documentation\n\n**DO NOT set isWorkflowComplete=true yet.** You are at ~90% completion. Phase 6 is required.\n\n**OUTPUT**: Final confidence assessment with recommendations",
701
+ "prompt": "**FINAL CONFIDENCE ASSESSMENT** - Evaluate the investigation results.\n\n**If root cause found (rootCauseFound = true):**\n- Review all evidence for {{rootCauseHypothesis}}\n- Perform adversarial challenge\n- Calculate final confidence score\n\n**If no high-confidence root cause:**\n- Document what was learned\n- Identify remaining unknowns\n- Recommend next investigation steps\n\n**CONFIDENCE CALCULATION:**\n- Evidence Quality (1-10)\n- Explanation Completeness (1-10)\n- Alternative Likelihood (1-10, inverted)\n- Final = (Quality \u00d7 0.4) + (Completeness \u00d7 0.4) + (Alternative \u00d7 0.2)\n\n**CONTEXT UPDATE**:\n- Use trackInvestigation('Investigation Complete', 'Confidence: {{finalConfidence}}/10')\n- Use addResumptionJson('phase-5a-final-confidence')\n- Document lessons learned in 'Dead Ends & Lessons' section\n\n**\u26a0\ufe0f ONE PHASE REMAINING**: Even if you have achieved 9-10/10 confidence in the root cause with strong supporting evidence:\n\n- The investigation is NOT complete yet\n- You MUST proceed to Phase 6 to create the comprehensive diagnostic writeup\n- Phase 6 is the REQUIRED DELIVERABLE that makes all your investigation work actionable\n- High confidence means you've identified the root cause, but the writeup translates that into actionable documentation\n\n**DO NOT set isWorkflowComplete=true yet.** You are at ~90% completion. Phase 6 is required.\n\n**OUTPUT**: Final confidence assessment with recommendations",
704
702
  "agentRole": "You are making the final determination about the root cause with rigorous confidence assessment.",
705
703
  "guidance": [
706
704
  "Be honest about confidence levels",
@@ -719,7 +717,7 @@
719
717
  {
720
718
  "id": "phase-6-diagnostic-writeup",
721
719
  "title": "Phase 6: Comprehensive Diagnostic Writeup",
722
- "prompt": "**FINAL DIAGNOSTIC DOCUMENTATION** - I will create comprehensive writeup enabling effective bug fixing and knowledge transfer.\n\n**STEP 1: Executive Summary**\n- **Bug Summary**: Concise description of issue and impact\n- **Root Cause**: Clear, non-technical explanation of what is happening\n- **Confidence Level**: Final confidence assessment with calculation methodology\n- **Scope**: What systems, users, or scenarios are affected\n\n**STEP 2: Technical Deep Dive**\n- **Root Cause Analysis**: Detailed technical explanation of failure mechanism\n- **Code Component Analysis**: Specific files, functions, and lines with exact locations\n- **Execution Flow**: Step-by-step sequence of events leading to bug\n- **State Analysis**: How system state contributes to failure\n\n**STEP 3: Investigation Methodology**\n- **Investigation Timeline**: Chronological summary with phase time investments\n- **Hypothesis Evolution**: Complete record of hypotheses (H1-H5) with status changes\n- **Evidence Assessment**: Rating and reliability of evidence sources with key citations\n\n**STEP 4: Historical Context & Patterns**\n- **Similar Bugs**: Reference findings from findSimilarBugs() and SimilarPatterns.md\n- **Previous Fixes**: How similar issues were resolved\n- **Recurring Patterns**: Identify if this is part of a larger pattern\n- **Lessons Learned**: What can be applied from past experiences\n\n**STEP 5: Knowledge Transfer & Action Plan**\n- **Skill Requirements**: Technical expertise needed for understanding and fixing\n- **Prevention & Review**: Specific measures and code review checklist items\n- **Action Items**: Immediate mitigation steps and permanent fix areas with timelines\n- **Testing Strategy**: Comprehensive verification approach for fixes\n- **Recommended Next Investigations** (if confidence < 9.0):\n - Additional instrumentation locations and data points not yet captured\n - Alternative hypotheses to explore (theories that were deprioritized)\n - External expertise to consult (domain experts, similar bugs)\n - Environmental factors to test (load, concurrency, timing, config variations)\n - Expanded scope (related components, upstream/downstream systems)\n - Prioritized next steps based on evidence gaps\n\n**STEP 6: Context Finalization**\n- **Final Update**: Use updateInvestigationContext('Final Report', link to diagnostic report)\n- **Archive Context**: Ensure INVESTIGATION_CONTEXT.md is complete for future reference\n- **Knowledge Base**: Consider key findings for team knowledge base\n\n**DELIVERABLE**: Enterprise-grade diagnostic report enabling confident bug fixing, knowledge transfer, and organizational learning.\n\n**βœ… WORKFLOW COMPLETION**: After producing the comprehensive diagnostic writeup with all required sections:\n\n1. Verify the writeup includes:\n - Executive Summary with root cause and confidence\n - Technical Deep Dive with code analysis\n - Investigation Methodology and timeline\n - Historical Context from similar bugs\n - Knowledge Transfer and Action Plan\n - All 6 sections fully documented\n\n2. Update INVESTIGATION_CONTEXT.md with final status and handoff information\n\n3. **Set isWorkflowComplete = true** to indicate the investigation is finished\n\nThis is the ONLY step where isWorkflowComplete should be set to true.",
720
+ "prompt": "**FINAL DIAGNOSTIC DOCUMENTATION** - I will create comprehensive writeup enabling effective bug fixing and knowledge transfer.\n\n**STEP 1: Executive Summary**\n- **Bug Summary**: Concise description of issue and impact\n- **Root Cause**: Clear, non-technical explanation of what is happening\n- **Confidence Level**: Final confidence assessment with calculation methodology\n- **Scope**: What systems, users, or scenarios are affected\n\n**STEP 2: Technical Deep Dive**\n- **Root Cause Analysis**: Detailed technical explanation of failure mechanism\n- **Code Component Analysis**: Specific files, functions, and lines with exact locations\n- **Execution Flow**: Step-by-step sequence of events leading to bug\n- **State Analysis**: How system state contributes to failure\n\n**STEP 3: Investigation Methodology**\n- **Investigation Timeline**: Chronological summary with phase time investments\n- **Hypothesis Evolution**: Complete record of hypotheses (H1-H5) with status changes\n- **Evidence Assessment**: Rating and reliability of evidence sources with key citations\n\n**STEP 4: Historical Context & Patterns**\n- **Similar Bugs**: Reference findings from findSimilarBugs() and SimilarPatterns.md\n- **Previous Fixes**: How similar issues were resolved\n- **Recurring Patterns**: Identify if this is part of a larger pattern\n- **Lessons Learned**: What can be applied from past experiences\n\n**STEP 5: Knowledge Transfer & Action Plan**\n- **Skill Requirements**: Technical expertise needed for understanding and fixing\n- **Prevention & Review**: Specific measures and code review checklist items\n- **Action Items**: Immediate mitigation steps and permanent fix areas with timelines\n- **Testing Strategy**: Comprehensive verification approach for fixes\n- **Recommended Next Investigations** (if confidence < 9.0):\n - Additional instrumentation locations and data points not yet captured\n - Alternative hypotheses to explore (theories that were deprioritized)\n - External expertise to consult (domain experts, similar bugs)\n - Environmental factors to test (load, concurrency, timing, config variations)\n - Expanded scope (related components, upstream/downstream systems)\n - Prioritized next steps based on evidence gaps\n\n**STEP 6: Context Finalization**\n- **Final Update**: Use updateInvestigationContext('Final Report', link to diagnostic report)\n- **Archive Context**: Ensure INVESTIGATION_CONTEXT.md is complete for future reference\n- **Knowledge Base**: Consider key findings for team knowledge base\n\n**DELIVERABLE**: Enterprise-grade diagnostic report enabling confident bug fixing, knowledge transfer, and organizational learning.\n\n**\u2705 WORKFLOW COMPLETION**: After producing the comprehensive diagnostic writeup with all required sections:\n\n1. Verify the writeup includes:\n - Executive Summary with root cause and confidence\n - Technical Deep Dive with code analysis\n - Investigation Methodology and timeline\n - Historical Context from similar bugs\n - Knowledge Transfer and Action Plan\n - All 6 sections fully documented\n\n2. Update INVESTIGATION_CONTEXT.md with final status and handoff information\n\n3. **Set isWorkflowComplete = true** to indicate the investigation is finished\n\nThis is the ONLY step where isWorkflowComplete should be set to true.",
723
721
  "agentRole": "You are a senior technical writer and diagnostic documentation specialist with expertise in creating comprehensive, actionable bug reports for enterprise environments. Your strength lies in translating complex technical investigations into clear, structured documentation that enables effective problem resolution, knowledge transfer, and organizational learning. You excel at creating reports that serve immediate fixing needs, long-term system improvement, and team collaboration.",
724
722
  "guidance": [
725
723
  "ENTERPRISE FOCUS: Write for multiple stakeholders including developers, managers, and future team members",
@@ -730,4 +728,4 @@
730
728
  ]
731
729
  }
732
730
  ]
733
- }
731
+ }