testchimp-runner-core 0.0.35 → 0.0.36

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/package.json +6 -1
  2. package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
  3. package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
  4. package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
  5. package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
  6. package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
  7. package/plandocs/INTEGRATION_COMPLETE.md +0 -322
  8. package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
  9. package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
  10. package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
  11. package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
  12. package/plandocs/PHASE_1_COMPLETE.md +0 -165
  13. package/plandocs/PHASE_1_SUMMARY.md +0 -184
  14. package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
  15. package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
  16. package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
  17. package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
  18. package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
  19. package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
  20. package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
  21. package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
  22. package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
  23. package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
  24. package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
  25. package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
  26. package/plandocs/exploratory-mode-support.plan.md +0 -928
  27. package/plandocs/journey-id-tracking-addendum.md +0 -227
  28. package/releasenotes/RELEASE_0.0.26.md +0 -165
  29. package/releasenotes/RELEASE_0.0.27.md +0 -236
  30. package/releasenotes/RELEASE_0.0.28.md +0 -286
  31. package/src/auth-config.ts +0 -84
  32. package/src/credit-usage-service.ts +0 -188
  33. package/src/env-loader.ts +0 -103
  34. package/src/execution-service.ts +0 -996
  35. package/src/file-handler.ts +0 -104
  36. package/src/index.ts +0 -432
  37. package/src/llm-facade.ts +0 -821
  38. package/src/llm-provider.ts +0 -53
  39. package/src/model-constants.ts +0 -35
  40. package/src/orchestrator/decision-parser.ts +0 -139
  41. package/src/orchestrator/index.ts +0 -58
  42. package/src/orchestrator/orchestrator-agent.ts +0 -1282
  43. package/src/orchestrator/orchestrator-prompts.ts +0 -786
  44. package/src/orchestrator/page-som-handler.ts +0 -1565
  45. package/src/orchestrator/som-types.ts +0 -188
  46. package/src/orchestrator/tool-registry.ts +0 -184
  47. package/src/orchestrator/tools/check-page-ready.ts +0 -75
  48. package/src/orchestrator/tools/extract-data.ts +0 -92
  49. package/src/orchestrator/tools/index.ts +0 -15
  50. package/src/orchestrator/tools/inspect-page.ts +0 -42
  51. package/src/orchestrator/tools/recall-history.ts +0 -72
  52. package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
  53. package/src/orchestrator/tools/take-screenshot.ts +0 -128
  54. package/src/orchestrator/tools/verify-action-result.ts +0 -159
  55. package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
  56. package/src/orchestrator/types.ts +0 -291
  57. package/src/playwright-mcp-service.ts +0 -224
  58. package/src/progress-reporter.ts +0 -144
  59. package/src/prompts.ts +0 -842
  60. package/src/providers/backend-proxy-llm-provider.ts +0 -91
  61. package/src/providers/local-llm-provider.ts +0 -38
  62. package/src/scenario-service.ts +0 -252
  63. package/src/scenario-worker-class.ts +0 -1110
  64. package/src/script-utils.ts +0 -203
  65. package/src/types.ts +0 -239
  66. package/src/utils/browser-utils.ts +0 -348
  67. package/src/utils/coordinate-converter.ts +0 -162
  68. package/src/utils/page-info-retry.ts +0 -65
  69. package/src/utils/page-info-utils.ts +0 -285
  70. package/testchimp-runner-core-0.0.35.tgz +0 -0
  71. package/tsconfig.json +0 -19
@@ -1,139 +0,0 @@
1
- # Selector Preference Improvements
2
-
3
- ## Summary
4
- Updated the orchestrator agent to prefer user-friendly, semantic Playwright selectors over auto-generated IDs, following Playwright's official best practices.
5
-
6
- ## Problem
7
- The agent was generating commands like:
8
- ```typescript
9
- await page.fill('#«r3»-form-item', 'alice@example.com')
10
- await page.fill('#«r4»-form-item', 'TestPass123')
11
- ```
12
-
13
- These auto-generated IDs (especially with unicode characters like `«r3»`) are:
14
- - Not user-friendly or readable
15
- - Break when component instances change
16
- - Not maintainable
17
- - Not following Playwright best practices
18
-
19
- ## Solution
20
- Implemented a comprehensive selector preference strategy across three key files:
21
-
22
- ### 1. Orchestrator Agent Prompt (`orchestrator-agent.ts`)
23
-
24
- Added new section **"5b. SELECTOR PREFERENCE"** with explicit guidance:
25
-
26
- **Preferred selectors (in order):**
27
- 1. `page.getByRole('role', {name: 'text'})` - Accessible, semantic, resilient
28
- 2. `page.getByLabel('label text')` - Great for form inputs
29
- 3. `page.getByPlaceholder('placeholder')` - Good for inputs without labels
30
- 4. `page.getByText('visible text')` - Clear and readable
31
- 5. `page.getByTestId('test-id')` - Stable if available
32
-
33
- **Avoid (last resort only):**
34
- - CSS selectors with auto-generated IDs: `#r3-form-item`, `#«r3»-form-item`
35
- - CSS selectors with unicode characters
36
- - Complex CSS paths
37
-
38
- **Examples provided:**
39
- ```typescript
40
- // ❌ BAD
41
- await page.fill('#«r3»-form-item', 'alice@example.com')
42
-
43
- // ✅ GOOD
44
- await page.getByLabel('Email').fill('alice@example.com')
45
- await page.getByRole('textbox', {name: 'Email'}).fill('alice@example.com')
46
- await page.getByPlaceholder('Enter your email').fill('alice@example.com')
47
- ```
48
-
49
- ### 2. Page Info Utils (`page-info-utils.ts`)
50
-
51
- **Reordered selector generation priority:**
52
-
53
- Before:
54
- 1. ID selector (e.g., `#«r3»-form-item`)
55
- 2. data-testid
56
- 3. getByRole
57
-
58
- After:
59
- 1. getByLabel (for inputs with associated labels)
60
- 2. getByRole with name (semantic and accessible)
61
- 3. getByPlaceholder (for inputs with placeholders)
62
- 4. getByText (visible text fallback)
63
- 5. getByTestId (stable, explicit)
64
- 6. ID selector - ONLY if stable (filters out auto-generated IDs with unicode or patterns like `rc_`, `:r[0-9]+:`, `__`)
65
-
66
- **Enhanced display:**
67
- - Shows best selector first, with up to 2 alternatives
68
- - Example: `getByLabel('Email') (or: getByRole('textbox', {name: 'Email'}), getByPlaceholder('Enter your email'))`
69
-
70
- ### 3. Screenshot Tool (`take-screenshot.ts`)
71
-
72
- **Updated vision analysis prompts:**
73
- - System prompt now emphasizes: "ALWAYS prioritize semantic selectors (getByRole, getByLabel, getByText) over CSS selectors with auto-generated IDs"
74
- - Analysis task explicitly instructs: "Recommend SEMANTIC SELECTORS FIRST" and "AVOID auto-generated IDs with unicode"
75
-
76
- ### 4. Updated EXPERIENCES Section
77
-
78
- Enhanced examples to capture semantic selector patterns:
79
- ```
80
- ✅ GOOD - App-specific patterns:
81
- - "Login form fields accessible via getByLabel: 'Email' and 'Password'"
82
- - "Submit buttons consistently use role=button with text matching action"
83
- - "Input fields have clear placeholders - prefer getByPlaceholder over IDs"
84
-
85
- ❌ BAD:
86
- - Noting auto-generated IDs like #«r3»-form-item (these are unreliable)
87
- ```
88
-
89
- ## Benefits
90
-
91
- 1. **More Maintainable**: Semantic selectors are resilient to UI changes
92
- 2. **Self-Documenting**: Code reads like natural language
93
- 3. **Accessibility**: Ensures UI elements are properly accessible
94
- 4. **Best Practices**: Follows Playwright's official recommendations
95
- 5. **User-Friendly**: Generated tests are more readable and easier to understand
96
-
97
- ## Expected Output
98
-
99
- After these changes, the agent will generate:
100
-
101
- ```typescript
102
- // Login form example
103
- await page.getByLabel('Email').fill('alice@example.com')
104
- await page.getByLabel('Password').fill('TestPass123')
105
- await page.getByRole('button', {name: 'Sign In'}).click()
106
-
107
- // Navigation example
108
- await page.getByRole('link', {name: 'Dashboard'}).click()
109
-
110
- // Form with placeholders
111
- await page.getByPlaceholder('Search...').fill('test query')
112
- ```
113
-
114
- Instead of:
115
-
116
- ```typescript
117
- // Old style with auto-generated IDs
118
- await page.fill('#«r3»-form-item', 'alice@example.com')
119
- await page.fill('#«r4»-form-item', 'TestPass123')
120
- await page.click('#«r5»-button')
121
- ```
122
-
123
- ## Testing
124
-
125
- The changes are backward compatible - the agent will still use ID selectors as a last resort when semantic selectors are not available. The build completed successfully with no linter errors.
126
-
127
- ## Files Modified
128
-
129
- 1. `/src/orchestrator/orchestrator-agent.ts` - System prompt enhancements
130
- 2. `/src/utils/page-info-utils.ts` - Selector priority reordering
131
- 3. `/src/orchestrator/tools/take-screenshot.ts` - Vision analysis updates
132
-
133
- ## Next Steps
134
-
135
- Test the changes with real scenarios to verify that:
136
- 1. Generated commands use semantic selectors when available
137
- 2. Tests remain stable and maintainable
138
- 3. Fallback to ID selectors works when semantic options aren't available
139
-
@@ -1,151 +0,0 @@
1
- # Runner-Core v0.0.33 - Session Summary
2
-
3
- ## Date: October 15, 2025
4
-
5
- ## Major Accomplishments:
6
-
7
- ### 1. ✅ **Coordinate Fallback System** (Phase 1 Complete)
8
- - Percentage-based coordinates (0-100%, 3 decimal precision)
9
- - Activates after 3 selector failures
10
- - 2 coordinate attempts before giving up
11
- - Resolution-independent positioning
12
-
13
- ### 2. ✅ **Note to Future Self** (Tactical Memory)
14
- - Free-form notes persist across iterations AND steps
15
- - Enables strategic planning across agent decisions
16
- - Helps maintain context: "Tried X, will try Y next"
17
-
18
- ### 3. ✅ **Visual Verification Tool** (NEW)
19
- - `verify_action_result` - Before/after screenshot comparison
20
- - Agent-callable (decides when to use)
21
- - JPEG 60% quality (85-90% smaller than PNG)
22
- - Multi-image LLM interface support
23
-
24
- ### 4. ✅ **Critical Bug Fixes**
25
- - **Coordinate mode never activated**: Changed forced stuck from >= 3 to >= 5 failures
26
- - **Missing required fields**: Made parser flexible (accepts reasoning OR statusReasoning)
27
- - **Navigation timeouts**: Added 30s timeout guidance for page.goto()
28
- - **Strict mode violations**: Added scoping guidance (locator('#parent').getByText())
29
-
30
- ### 5. ✅ **Prompt Optimizations**
31
- - **59% reduction**: 17,573 chars → 7,287 chars in system prompt
32
- - **Cache-optimized**: Static content first, dynamic last
33
- - **Cost savings**: ~40% overall with model tiering
34
- - **Focused on cognition**: Removed bloat, kept decision-making guidance
35
-
36
- ### 6. ✅ **Model Optimization**
37
- - **gpt-5-mini**: Complex tasks (4 operations)
38
- - Command generation
39
- - Goal completion checks
40
- - Repair suggestions
41
- - Agent orchestration
42
- - **gpt-4o-mini**: Simple tasks (7 operations)
43
- - Scenario breakdown
44
- - Screenshot need assessment
45
- - Repair confidence
46
- - Test name generation
47
- - Hashtag generation
48
- - Script parsing
49
- - Final script merging
50
- - **Est. 25-30% cost reduction**
51
-
52
- ### 7. ✅ **Code Cleanup**
53
- - Removed V1 SmartTestRunnerCore (V2 is stable)
54
- - Removed backup files (.bak, .tmp)
55
- - Consolidated types into V2
56
- - Removed PeopleHR-specific examples from prompts
57
-
58
- ### 8. ✅ **Enhanced Logging**
59
- - Prompt length metrics (chars + estimated tokens)
60
- - Full LLM response on parsing errors
61
- - Field presence diagnostics
62
- - Retry logging for 500 errors
63
-
64
- ### 9. ✅ **Retry Logic**
65
- - Automatic retry for OpenAI 500 errors
66
- - Exponential backoff (1s, 2s, 4s)
67
- - Up to 3 attempts before failing
68
-
69
- ### 10. ✅ **Headed Mode for Local Testing**
70
- - All browser instances use headed: false → headed: false for local dev
71
- - Visual debugging enabled
72
-
73
- ## Files Modified:
74
-
75
- ### Runner-Core:
76
- 1. `src/orchestrator/orchestrator-agent.ts` - Main agent logic
77
- 2. `src/orchestrator/types.ts` - NoteToFutureSelf, CoordinateAction
78
- 3. `src/utils/coordinate-converter.ts` - NEW - Coordinate to Playwright conversion
79
- 4. `src/orchestrator/tools/verify-action-result.ts` - NEW - Visual verification tool
80
- 5. `src/llm-provider.ts` - Added LabeledImage, multi-image support
81
- 6. `src/llm-facade.ts` - Model optimization
82
- 7. `src/model-constants.ts` - Added DEFAULT_SIMPLER_MODEL
83
- 8. `src/scenario-worker-class.ts` - Tool registration
84
- 9. `src/orchestrator/index.ts` - Exports
85
- 10. `src/orchestrator/tools/index.ts` - Tool exports
86
-
87
- ### Scriptservice:
88
- 1. `providers/scriptservice-llm-provider.ts` - Multi-image handling, retry logic
89
- 2. `smart-test-runner-core-v2.ts` - Type definitions, V1 removal
90
- 3. `smart-test-execution-handler.ts` - V1 removal
91
- 4. `workers/test-based-explorer.ts` - V1 removal
92
- 5. `script-generation-handlers.ts` - Headed mode
93
- 6. `script-generation/script-generation-service.ts` - Headed mode
94
- 7. `smart-test-execution-handler.ts` - Headed mode
95
-
96
- ### Documentation:
97
- 1. `WHATS_NEW_v0.0.33.md`
98
- 2. `PHASE_1_COMPLETE.md`
99
- 3. `PHASE_1_SUMMARY.md`
100
- 4. `IMPLEMENTATION_STATUS.md`
101
- 5. `VISUAL_AGENT_EVOLUTION_PLAN.md`
102
- 6. `PROMPT_SANITY_CHECK.md`
103
- 7. `PROMPT_OPTIMIZATION_ANALYSIS.md`
104
- 8. `COORDINATE_MODE_DIAGNOSIS.md`
105
- 9. `BEFORE_AFTER_VERIFICATION.md`
106
- 10. `TROUBLESHOOTING_SESSION.md`
107
-
108
- ## Live Test Status:
109
-
110
- **Job**: `71b88c60-52f5-4343-aef8-c44ebb07f3e9`
111
- **Status**: Running (check browser + logs)
112
- **Watch For**:
113
- - Step 5 (Employee Information) - Previously problematic
114
- - Coordinate mode activation
115
- - verify_action_result tool usage
116
- - Overall completion
117
-
118
- ## Key Metrics:
119
-
120
- **Cost Optimization:**
121
- - Prompt size: 59% reduction
122
- - Model tiering: 7/11 tasks on cheaper model
123
- - JPEG compression: 85-90% smaller screenshots
124
- - **Total savings: ~40% cost reduction**
125
-
126
- ## Next Steps After Test Completes:
127
-
128
- 1. Check if Step 5 completes successfully
129
- 2. Verify coordinate mode activated if needed
130
- 3. Check if verify_action_result tool was used
131
- 4. Analyze any remaining failures
132
- 5. Iterate on prompts/logic based on results
133
-
134
- ## Known Issues to Monitor:
135
-
136
- 1. **Step 5 False Positive**: Clicking menu item vs navigating to page
137
- 2. **Coordinate Loop**: Agent not knowing when coordinate clicks succeed
138
- 3. **Vision verification usage**: Will agent call it proactively?
139
-
140
- ## Success Criteria:
141
-
142
- ✅ All 7 steps complete
143
- ✅ Coordinate fallback used when selectors fail
144
- ✅ Visual verification validates goal achievement
145
- ✅ No infinite loops or stuck states
146
- ✅ Generated script is accurate
147
-
148
- ---
149
-
150
- **Check your browser window and /tmp/scriptservice-test.log for live execution!**
151
-
@@ -1,72 +0,0 @@
1
- # Troubleshooting Session: All Modules Icon Click Failure
2
-
3
- ## Objective:
4
- Understand why the orchestrator agent gets stuck on "Click on the all Modules menu item (top menu icon)" while manual Playwright MCP navigation succeeded.
5
-
6
- ## What I Need to See:
7
-
8
- ### 1. Full Agent Logs for the Failing Step
9
- Please provide the complete logs showing:
10
- - What iteration attempts were made (iteration 1, 2, 3...)
11
- - What selectors the agent tried each time
12
- - What errors it encountered
13
- - What the DOM snapshot showed
14
- - Whether it took screenshots
15
- - What notes it left to future self
16
-
17
- ### 2. The DOM Context It Saw
18
- - Interactive elements list
19
- - ARIA tree snapshot
20
- - Whether the hamburger icon was visible in the list
21
-
22
- ## What Worked (My Manual MCP Session):
23
-
24
- From earlier successful navigation:
25
- ```
26
- ✅ Step 1: Clicked hamburger menu
27
- Selector: #sidebar-toggle > span > svg
28
-
29
- ✅ Step 2: Clicked "Core HR"
30
- Selector: getByText('Core HR')
31
-
32
- ✅ Step 3: Clicked "Employee Information"
33
- Selector: getByText('Employee Information')
34
- ```
35
-
36
- ## Hypothesis of Why Agent Fails:
37
-
38
- ### Possible Issue 1: Wrong Selector Strategy
39
- - Agent might be trying: `getByText('All Modules')` (strict mode violation)
40
- - Or: `#MenuToggle` (wrong ID)
41
- - Or: `#sidebar-toggle-menu` (doesn't exist)
42
- - Instead of: `#sidebar-toggle > span > svg` (actual selector)
43
-
44
- ### Possible Issue 2: Missing Icon Detection
45
- - Hamburger icons are often SVG elements without accessible text
46
- - Agent might not recognize this pattern
47
- - Prompt doesn't explicitly guide on icon/SVG selector strategy
48
-
49
- ### Possible Issue 3: DOM List Incomplete
50
- - Interactive elements might not include the SVG icon
51
- - If icon isn't in the list, agent won't know it exists
52
- - Need to check if `getEnhancedPageInfo` captures SVG icons
53
-
54
- ### Possible Issue 4: Ambiguous Text
55
- - "All Modules" might appear in multiple places (menu button + modal title)
56
- - Agent tries `getByText('All Modules')` → strict mode violation
57
- - Should scope to parent: `locator('#sidebar-toggle').getByText('All Modules')`
58
-
59
- ## Next Steps:
60
-
61
- 1. **Get full logs** from your failing run
62
- 2. **Compare** what agent saw vs what I saw
63
- 3. **Identify** the gap (prompt, DOM extraction, or selector logic)
64
- 4. **Plan fixes**:
65
- - Prompt improvements (icon/SVG guidance)
66
- - DOM extraction improvements (ensure icons are captured)
67
- - Selector strategy improvements (parent scoping for icons)
68
- - Example-based learning (hamburger menu pattern)
69
-
70
- ## Waiting For:
71
- Please paste the full logs from the failing step showing all iteration attempts and what the agent tried.
72
-
@@ -1,336 +0,0 @@
1
- # Vision-Based Diagnostic Analysis Improvements
2
-
3
- ## Overview
4
- Enhanced the test automation system to use screenshot-based vision analysis as a **diagnostic tool** to understand WHY failures occur and recommend better strategies based on visual reality vs DOM assumptions.
5
-
6
- **Vision diagnostics are now utilized in BOTH:**
7
- 1. **Script Generation** (`ScenarioWorker`) - When generating commands from scenarios
8
- 2. **Script Repair/Execution** (`ExecutionService`) - When repairing failing scripts with AI
9
-
10
- ## Problem Statement (From Logs)
11
- 1. **Hallucinated Verification Targets**: Goal completion was creating fake sub-goals like "verify message was sent" and looking for non-existent elements like "Message sent" text
12
- 2. **No Screenshot Analysis**: Vision mode was never triggering because LLM assessment always said "NO"
13
-
14
- ## Solutions Implemented
15
-
16
- ### 1. Fixed Hallucinated Verification Targets ✅
17
-
18
- #### A. Goal Completion Check (`prompts.ts` lines 55-129)
19
- **Changes:**
20
- - System prompt: "Be EXTREMELY CONSERVATIVE - mark goals complete when PRIMARY action succeeds. DO NOT invent verification steps"
21
- - Added golden rule: "If goal is SIMPLE ACTION and action SUCCEEDED, mark COMPLETE immediately"
22
- - Explicit examples showing action goals complete after action succeeds (no verification needed)
23
-
24
- **Example:**
25
- ```
26
- Before: "Click send button" → click succeeds → "INCOMPLETE - need to verify message sent"
27
- After: "Click send button" → click succeeds → "COMPLETE ✅"
28
- ```
29
-
30
- #### B. Command Generation Anti-Hallucination (`prompts.ts` lines 215-225)
31
- **New Section: "NEVER Hallucinate Verification Elements"**
32
- - ONLY verify elements that ACTUALLY EXIST in current DOM state
33
- - Don't invent success messages, confirmations, or sent indicators
34
- - Use alternative verification: state changes, network, page load
35
- - Stop trying to find elements after previous attempts failed
36
-
37
- #### C. Smart Hallucination Detection (`llm-facade.ts` lines 573-599)
38
- **Automatic Detection:**
39
- - Detects when LLM repeatedly tries to find non-existent elements (2+ attempts)
40
- - Shows "⚠️ HALLUCINATION ALERT" with guidance to stop and use alternatives
41
- - Analyzes command patterns (getByText, toBeVisible) + errors (not found, timeout)
42
-
43
- ### 2. Enabled Screenshot-Based Diagnostic Analysis 📸
44
-
45
- #### A. Conservative Screenshot Trigger (`scenario-worker-class.ts` line 193)
46
- ```typescript
47
- // ONLY on attempt 2 (3rd attempt, after exactly 2 failures)
48
- // This is the ONLY chance to use vision - must be absolutely necessary
49
- if (attempt === 2 && lastError && !usedVisionMode)
50
- ```
51
-
52
- **Attempt Flow:**
53
- - Attempt 0: First try (DOM-based)
54
- - Attempt 1: Second try (DOM-based)
55
- - **Attempt 2: Third try - Vision assessment (if truly needed)**
56
- - Attempt 3: Fourth try (final, DOM-based)
57
-
58
- #### B. Conservative Screenshot Assessment (`prompts.ts` lines 131-191)
59
- **System Prompt:** "Be LIBERAL in recommending screenshots - visual context provides diagnostic insights DOM cannot"
60
-
61
- **Diagnostic Value Framework:**
62
- 1. Identify WHY attempts failed (DOM assumptions vs visual reality)
63
- 2. Detect hallucinated elements (see if expected elements exist visually)
64
- 3. Recommend better strategies (verification based on what's visible)
65
- 4. Find visual blockers (overlays, modals, layout issues)
66
- 5. Correct wrong assumptions (actual state vs expected state)
67
-
68
- **Hallucination Detection in Assessment:**
69
- - If "not found/timeout" errors + verification attempts → HIGH chance of hallucination
70
- - Screenshot reveals if elements actually exist
71
-
72
- #### C. Enhanced Vision Mode as Diagnostic Tool (`prompts.ts` lines 359-462)
73
-
74
- **System Prompt:** "Analyze screenshot to understand WHY previous attempts failed and recommend BEST next step based on visual reality vs DOM assumptions"
75
-
76
- **Three Primary Tasks:**
77
- 1. **DEBUG WHY PREVIOUS ATTEMPTS FAILED** - Compare what you assumed vs what you SEE
78
- 2. **IDENTIFY THE ROOT CAUSE** - Why did DOM-based approaches fail?
79
- 3. **RECOMMEND SMARTER ALTERNATIVES** - What should we do instead?
80
-
81
- **Critical Questions Framework:**
82
-
83
- 🔍 **Visual vs DOM Reality Check:**
84
- - What do you SEE vs what DOM suggested?
85
- - Are elements you tried to find VISIBLE on screen?
86
- - Are there visual indicators you missed?
87
- - What's ACTUALLY present vs ASSUMED?
88
-
89
- 🚫 **Why Did Previous Attempts Fail?**
90
- - Looking for elements that DON'T EXIST? (hallucination)
91
- - Wrong selectors for elements that ARE visible?
92
- - Elements blocked/covered by overlays?
93
- - Elements in different state than expected?
94
- - Page in different state than assumed?
95
-
96
- 💡 **What Should You Do Differently?**
97
- - If verification elements don't exist: Use state-based verification
98
- - If elements blocked: Remove blocker first
99
- - If wrong selector: Use visual clues for better selectors
100
- - If goal achieved: Confirm with simple wait/check
101
-
102
- **Verification Strategy Based on Visual Reality:**
103
-
104
- ```
105
- IF you see success indicators in screenshot:
106
- ✅ Use them: await expect(page.getByText('visible-text-here')).toBeVisible()
107
-
108
- IF you DON'T see success indicators but action appears complete:
109
- ✅ Use state-based checks:
110
- - Button disabled check
111
- - Form cleared/reset check
112
- - URL changed check
113
- - Network response verification
114
- - Load state verification
115
-
116
- IF previous attempts looked for non-existent elements:
117
- ❌ STOP looking for them
118
- ✅ Switch to alternative verification
119
- ```
120
-
121
- **Comparison Analysis Template:**
122
- ```
123
- "Based on screenshot analysis:
124
- - DOM suggested: [what you thought was there]
125
- - Visual reality: [what you actually see]
126
- - Why previous failed: [root cause analysis]
127
- - Better approach: [what to do instead]"
128
- ```
129
-
130
- #### D. Enhanced Diagnostic Logging (`llm-facade.ts` lines 313-328)
131
-
132
- **Vision Mode Response Now Includes:**
133
- - `visualInsights`: What screenshot revealed that DOM couldn't tell you
134
- - `failureRootCause`: Why previous DOM-based attempts failed
135
- - `recommendedAlternative`: Better verification/interaction strategy
136
- - `reasoning`: Full diagnostic analysis
137
-
138
- **Console Logging:**
139
- ```
140
- 📸 Visual insights: [what was discovered]
141
- 🔍 Root cause analysis: [why it failed]
142
- 💡 Recommended alternative: [what to do instead]
143
- 🧠 Vision-based reasoning: [full analysis]
144
- ```
145
-
146
- ## Code Modularization
147
-
148
- Vision diagnostics are properly modularized and reused across both flows:
149
-
150
- 1. **`LLMFacade` Methods** (shared by both flows):
151
- - `assessScreenshotNeed()` - Conservative assessment if screenshot would help
152
- - `getVisionDiagnostics()` - Supervisor analyzes screenshot (gpt-4o)
153
- - `generateCommandFromSupervisorInstructions()` - Worker generates command (gpt-4.1-mini)
154
-
155
- 2. **Two-Step Supervisor Pattern** (used consistently):
156
- ```typescript
157
- // Step 1: Supervisor analyzes screenshot (expensive vision model)
158
- const supervisorDiagnostics = await llmFacade.getVisionDiagnostics(...)
159
-
160
- // Step 2: Use insights for action
161
- // - In generation: Generate command from instructions
162
- // - In repair: Enhance failure context with visual insights
163
- ```
164
-
165
- 3. **No Code Duplication**:
166
- - Vision prompts defined once in `prompts.ts`
167
- - LLM calls handled by `LLMFacade`
168
- - Both flows use same vision assessment logic
169
-
170
- ## Behavioral Changes
171
-
172
- ### Script Generation - Before:
173
- ```
174
- 1. Click send button → succeeds
175
- 2. Goal check → "INCOMPLETE - verify message sent" (hallucinated)
176
- 3. Try: await expect(page.getByText('Message sent')).toBeVisible()
177
- 4. Fail: Element not found
178
- 5. Try: await page.getByRole('status').waitFor()
179
- 6. Fail: Timeout
180
- 7. Repeat until max failures
181
- 8. Screenshot assessment → "NO, not needed"
182
- 9. Vision mode → never triggers
183
- ```
184
-
185
- ### Script Generation - After:
186
- ```
187
- 1. Click send button → succeeds
188
- 2. Goal check → "COMPLETE ✅" (action succeeded, no verification needed)
189
- OR if verification truly needed but wrong approach:
190
- 3. After 2 failures → Screenshot assessment
191
- 4. Assessment → "YES - Diagnostic value: Visual reveals if success indicators exist"
192
- 5. Vision mode activates with screenshot
193
- 6. Diagnostic analysis:
194
- - "DOM suggested: Success message would appear"
195
- - "Visual reality: No success message visible, button is now disabled"
196
- - "Root cause: Hallucinated verification element that doesn't exist"
197
- - "Better approach: Check button disabled state instead"
198
- 7. Generates: await expect(page.locator('button[name="Send"]')).toBeDisabled()
199
- 8. Success ✅
200
- ```
201
-
202
- ### Script Repair - Before:
203
- ```
204
- 1. Script step fails: await page.click('button[name="Submit"]')
205
- 2. AI Repair attempt 1: Try different selector → Fails
206
- 3. AI Repair attempt 2: Try with wait → Fails
207
- 4. AI Repair attempt 3: Try another approach → Fails
208
- 5. Give up - repair failed
209
- ```
210
-
211
- ### Script Repair - After:
212
- ```
213
- 1. Script step fails: await page.click('button[name="Submit"]')
214
- 2. AI Repair attempt 1: Try different selector → Fails
215
- 3. AI Repair attempt 2: Try with wait → Fails
216
- 4. After 2 failures → Screenshot assessment
217
- 5. Assessment → "YES - Visual analysis can reveal why repairs are failing"
218
- 6. Vision supervisor analyzes screenshot:
219
- - "Visual analysis: Button is disabled and grayed out"
220
- - "Root cause: Trying to click a disabled button"
221
- - "Recommended approach: Wait for button to become enabled first"
222
- 7. AI Repair attempt 3 with vision insights: Insert wait step before click
223
- 8. Success ✅ (vision-aided repair)
224
- ```
225
-
226
- ## Cost Considerations
227
-
228
- **Vision Mode (GPT-4o) Triggers in Both Flows:**
229
-
230
- 1. **Script Generation** (`ScenarioWorker`):
231
- - After 2 failed command generation attempts (3rd attempt)
232
- - Only when LLM assessment says "YES"
233
- - Prevents endless DOM-based retries on wrong assumptions
234
-
235
- 2. **Script Repair** (`ExecutionService`):
236
- - After 2 failed repair attempts (3rd/final repair attempt)
237
- - Only when LLM assessment says "YES"
238
- - Prevents wasted repair cycles on wrong strategy
239
-
240
- **What You Get (Both Flows):**
241
- - Root cause analysis of failures
242
- - Visual vs DOM reality comparison
243
- - Recommended alternative strategies
244
- - Smart fallback only when truly needed
245
- - Prevents wasted attempts on wrong approaches
246
-
247
- ## Files Modified
248
-
249
- 1. **`/runner-core/src/prompts.ts`**
250
- - Enhanced goal completion check (lines 55-129)
251
- - Enhanced screenshot assessment (lines 131-184)
252
- - Enhanced vision mode prompt with diagnostics (lines 359-462)
253
- - Added anti-hallucination sections
254
-
255
- 2. **`/runner-core/src/llm-facade.ts`**
256
- - Added hallucination detection (lines 573-599)
257
- - Enhanced vision logging (lines 313-328)
258
- - Vision methods: `assessScreenshotNeed`, `getVisionDiagnostics`, `generateCommandFromSupervisorInstructions`
259
-
260
- 3. **`/runner-core/src/scenario-worker-class.ts`** (Script Generation)
261
- - Vision fallback on attempt 2 (line 197)
262
- - Two-step supervisor pattern for command generation (lines 197-241)
263
-
264
- 4. **`/runner-core/src/execution-service.ts`** (Script Repair/Execution)
265
- - Vision fallback on final repair attempt after 2 failures (lines 500-594)
266
- - Vision-enhanced context for repair suggestions
267
- - Same supervisor pattern for consistency
268
-
269
- ## Testing Recommendations
270
-
271
- ### Script Generation Tests
272
-
273
- 1. **Test with messaging flow:**
274
- - Verify "send message" steps complete without hallucinated verification
275
- - Check if vision mode provides diagnostic insights when command generation fails
276
-
277
- 2. **Test with form submission:**
278
- - Ensure submit actions complete without looking for non-existent confirmations
279
- - Verify alternative verification strategies are used
280
-
281
- ### Script Repair Tests
282
-
283
- 1. **Test with selector failures:**
284
- - Run script with outdated selectors
285
- - Verify vision diagnostics reveal actual element state
286
- - Check that repairs use vision insights
287
-
288
- 2. **Test with state-dependent failures:**
289
- - Run script where timing/state causes failures
290
- - Verify vision reveals actual page state (e.g., disabled buttons, loading states)
291
- - Check that repairs address root cause
292
-
293
- ### Monitor Both Flows
294
-
295
- 1. **Vision mode activation:**
296
- - Should trigger after 2 failures when truly needed (not for every failure)
297
- - Check diagnostic logs for root cause analysis
298
- - Verify recommended alternatives are sensible
299
-
300
- 2. **Success patterns:**
301
- - Look for "(vision-aided)" markers in success logs
302
- - Track improvement in success rate after vision diagnostics added
303
-
304
- ## Next Steps
305
-
306
- 1. **Monitor logs for:**
307
- - Reduced hallucination instances (generation flow)
308
- - Vision mode diagnostic quality (both flows)
309
- - Alternative verification strategies effectiveness (generation flow)
310
- - Repair success rate improvement (repair flow)
311
-
312
- 2. **Metrics to track:**
313
- - Vision mode activation rate in generation vs repair
314
- - Hallucination detection rate
315
- - Test success rate improvement (generation)
316
- - Repair success rate improvement (execution)
317
- - Cost vs benefit of vision mode in each flow
318
-
319
- 3. **Potential future enhancements:**
320
- - Learn from vision diagnostics to improve DOM-only approaches
321
- - Build a library of common visual patterns and solutions
322
- - Optimize screenshot timing based on diagnostic value
323
- - Consider vision diagnostics for other failure scenarios
324
- - Share vision insights between generation and repair flows
325
-
326
- ## Summary
327
-
328
- **✅ Vision-based diagnostics are now properly integrated into BOTH:**
329
- - **Script Generation** - Helps generate better commands when DOM info insufficient
330
- - **Script Repair** - Helps diagnose why repairs fail and suggest better fixes
331
-
332
- **✅ Properly modularized with no code duplication:**
333
- - Shared `LLMFacade` methods
334
- - Consistent two-step supervisor pattern
335
- - Single source of truth for prompts and logic
336
-