testchimp-runner-core 0.0.35 → 0.0.36
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +6 -1
- package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
- package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
- package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
- package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
- package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
- package/plandocs/INTEGRATION_COMPLETE.md +0 -322
- package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
- package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
- package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
- package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
- package/plandocs/PHASE_1_COMPLETE.md +0 -165
- package/plandocs/PHASE_1_SUMMARY.md +0 -184
- package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
- package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
- package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
- package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
- package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
- package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
- package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
- package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
- package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
- package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
- package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
- package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
- package/plandocs/exploratory-mode-support.plan.md +0 -928
- package/plandocs/journey-id-tracking-addendum.md +0 -227
- package/releasenotes/RELEASE_0.0.26.md +0 -165
- package/releasenotes/RELEASE_0.0.27.md +0 -236
- package/releasenotes/RELEASE_0.0.28.md +0 -286
- package/src/auth-config.ts +0 -84
- package/src/credit-usage-service.ts +0 -188
- package/src/env-loader.ts +0 -103
- package/src/execution-service.ts +0 -996
- package/src/file-handler.ts +0 -104
- package/src/index.ts +0 -432
- package/src/llm-facade.ts +0 -821
- package/src/llm-provider.ts +0 -53
- package/src/model-constants.ts +0 -35
- package/src/orchestrator/decision-parser.ts +0 -139
- package/src/orchestrator/index.ts +0 -58
- package/src/orchestrator/orchestrator-agent.ts +0 -1282
- package/src/orchestrator/orchestrator-prompts.ts +0 -786
- package/src/orchestrator/page-som-handler.ts +0 -1565
- package/src/orchestrator/som-types.ts +0 -188
- package/src/orchestrator/tool-registry.ts +0 -184
- package/src/orchestrator/tools/check-page-ready.ts +0 -75
- package/src/orchestrator/tools/extract-data.ts +0 -92
- package/src/orchestrator/tools/index.ts +0 -15
- package/src/orchestrator/tools/inspect-page.ts +0 -42
- package/src/orchestrator/tools/recall-history.ts +0 -72
- package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
- package/src/orchestrator/tools/take-screenshot.ts +0 -128
- package/src/orchestrator/tools/verify-action-result.ts +0 -159
- package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
- package/src/orchestrator/types.ts +0 -291
- package/src/playwright-mcp-service.ts +0 -224
- package/src/progress-reporter.ts +0 -144
- package/src/prompts.ts +0 -842
- package/src/providers/backend-proxy-llm-provider.ts +0 -91
- package/src/providers/local-llm-provider.ts +0 -38
- package/src/scenario-service.ts +0 -252
- package/src/scenario-worker-class.ts +0 -1110
- package/src/script-utils.ts +0 -203
- package/src/types.ts +0 -239
- package/src/utils/browser-utils.ts +0 -348
- package/src/utils/coordinate-converter.ts +0 -162
- package/src/utils/page-info-retry.ts +0 -65
- package/src/utils/page-info-utils.ts +0 -285
- package/testchimp-runner-core-0.0.35.tgz +0 -0
- package/tsconfig.json +0 -19
|
@@ -1,139 +0,0 @@
|
|
|
1
|
-
# Selector Preference Improvements
|
|
2
|
-
|
|
3
|
-
## Summary
|
|
4
|
-
Updated the orchestrator agent to prefer user-friendly, semantic Playwright selectors over auto-generated IDs, following Playwright's official best practices.
|
|
5
|
-
|
|
6
|
-
## Problem
|
|
7
|
-
The agent was generating commands like:
|
|
8
|
-
```typescript
|
|
9
|
-
await page.fill('#«r3»-form-item', 'alice@example.com')
|
|
10
|
-
await page.fill('#«r4»-form-item', 'TestPass123')
|
|
11
|
-
```
|
|
12
|
-
|
|
13
|
-
These auto-generated IDs (especially with unicode characters like `«r3»`) are:
|
|
14
|
-
- Not user-friendly or readable
|
|
15
|
-
- Break when component instances change
|
|
16
|
-
- Not maintainable
|
|
17
|
-
- Not following Playwright best practices
|
|
18
|
-
|
|
19
|
-
## Solution
|
|
20
|
-
Implemented a comprehensive selector preference strategy across three key files:
|
|
21
|
-
|
|
22
|
-
### 1. Orchestrator Agent Prompt (`orchestrator-agent.ts`)
|
|
23
|
-
|
|
24
|
-
Added new section **"5b. SELECTOR PREFERENCE"** with explicit guidance:
|
|
25
|
-
|
|
26
|
-
**Preferred selectors (in order):**
|
|
27
|
-
1. `page.getByRole('role', {name: 'text'})` - Accessible, semantic, resilient
|
|
28
|
-
2. `page.getByLabel('label text')` - Great for form inputs
|
|
29
|
-
3. `page.getByPlaceholder('placeholder')` - Good for inputs without labels
|
|
30
|
-
4. `page.getByText('visible text')` - Clear and readable
|
|
31
|
-
5. `page.getByTestId('test-id')` - Stable if available
|
|
32
|
-
|
|
33
|
-
**Avoid (last resort only):**
|
|
34
|
-
- CSS selectors with auto-generated IDs: `#r3-form-item`, `#«r3»-form-item`
|
|
35
|
-
- CSS selectors with unicode characters
|
|
36
|
-
- Complex CSS paths
|
|
37
|
-
|
|
38
|
-
**Examples provided:**
|
|
39
|
-
```typescript
|
|
40
|
-
// ❌ BAD
|
|
41
|
-
await page.fill('#«r3»-form-item', 'alice@example.com')
|
|
42
|
-
|
|
43
|
-
// ✅ GOOD
|
|
44
|
-
await page.getByLabel('Email').fill('alice@example.com')
|
|
45
|
-
await page.getByRole('textbox', {name: 'Email'}).fill('alice@example.com')
|
|
46
|
-
await page.getByPlaceholder('Enter your email').fill('alice@example.com')
|
|
47
|
-
```
|
|
48
|
-
|
|
49
|
-
### 2. Page Info Utils (`page-info-utils.ts`)
|
|
50
|
-
|
|
51
|
-
**Reordered selector generation priority:**
|
|
52
|
-
|
|
53
|
-
Before:
|
|
54
|
-
1. ID selector (e.g., `#«r3»-form-item`)
|
|
55
|
-
2. data-testid
|
|
56
|
-
3. getByRole
|
|
57
|
-
|
|
58
|
-
After:
|
|
59
|
-
1. getByLabel (for inputs with associated labels)
|
|
60
|
-
2. getByRole with name (semantic and accessible)
|
|
61
|
-
3. getByPlaceholder (for inputs with placeholders)
|
|
62
|
-
4. getByText (visible text fallback)
|
|
63
|
-
5. getByTestId (stable, explicit)
|
|
64
|
-
6. ID selector - ONLY if stable (filters out auto-generated IDs with unicode or patterns like `rc_`, `:r[0-9]+:`, `__`)
|
|
65
|
-
|
|
66
|
-
**Enhanced display:**
|
|
67
|
-
- Shows best selector first, with up to 2 alternatives
|
|
68
|
-
- Example: `getByLabel('Email') (or: getByRole('textbox', {name: 'Email'}), getByPlaceholder('Enter your email'))`
|
|
69
|
-
|
|
70
|
-
### 3. Screenshot Tool (`take-screenshot.ts`)
|
|
71
|
-
|
|
72
|
-
**Updated vision analysis prompts:**
|
|
73
|
-
- System prompt now emphasizes: "ALWAYS prioritize semantic selectors (getByRole, getByLabel, getByText) over CSS selectors with auto-generated IDs"
|
|
74
|
-
- Analysis task explicitly instructs: "Recommend SEMANTIC SELECTORS FIRST" and "AVOID auto-generated IDs with unicode"
|
|
75
|
-
|
|
76
|
-
### 4. Updated EXPERIENCES Section
|
|
77
|
-
|
|
78
|
-
Enhanced examples to capture semantic selector patterns:
|
|
79
|
-
```
|
|
80
|
-
✅ GOOD - App-specific patterns:
|
|
81
|
-
- "Login form fields accessible via getByLabel: 'Email' and 'Password'"
|
|
82
|
-
- "Submit buttons consistently use role=button with text matching action"
|
|
83
|
-
- "Input fields have clear placeholders - prefer getByPlaceholder over IDs"
|
|
84
|
-
|
|
85
|
-
❌ BAD:
|
|
86
|
-
- Noting auto-generated IDs like #«r3»-form-item (these are unreliable)
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
## Benefits
|
|
90
|
-
|
|
91
|
-
1. **More Maintainable**: Semantic selectors are resilient to UI changes
|
|
92
|
-
2. **Self-Documenting**: Code reads like natural language
|
|
93
|
-
3. **Accessibility**: Ensures UI elements are properly accessible
|
|
94
|
-
4. **Best Practices**: Follows Playwright's official recommendations
|
|
95
|
-
5. **User-Friendly**: Generated tests are more readable and easier to understand
|
|
96
|
-
|
|
97
|
-
## Expected Output
|
|
98
|
-
|
|
99
|
-
After these changes, the agent will generate:
|
|
100
|
-
|
|
101
|
-
```typescript
|
|
102
|
-
// Login form example
|
|
103
|
-
await page.getByLabel('Email').fill('alice@example.com')
|
|
104
|
-
await page.getByLabel('Password').fill('TestPass123')
|
|
105
|
-
await page.getByRole('button', {name: 'Sign In'}).click()
|
|
106
|
-
|
|
107
|
-
// Navigation example
|
|
108
|
-
await page.getByRole('link', {name: 'Dashboard'}).click()
|
|
109
|
-
|
|
110
|
-
// Form with placeholders
|
|
111
|
-
await page.getByPlaceholder('Search...').fill('test query')
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
Instead of:
|
|
115
|
-
|
|
116
|
-
```typescript
|
|
117
|
-
// Old style with auto-generated IDs
|
|
118
|
-
await page.fill('#«r3»-form-item', 'alice@example.com')
|
|
119
|
-
await page.fill('#«r4»-form-item', 'TestPass123')
|
|
120
|
-
await page.click('#«r5»-button')
|
|
121
|
-
```
|
|
122
|
-
|
|
123
|
-
## Testing
|
|
124
|
-
|
|
125
|
-
The changes are backward compatible - the agent will still use ID selectors as a last resort when semantic selectors are not available. The build completed successfully with no linter errors.
|
|
126
|
-
|
|
127
|
-
## Files Modified
|
|
128
|
-
|
|
129
|
-
1. `/src/orchestrator/orchestrator-agent.ts` - System prompt enhancements
|
|
130
|
-
2. `/src/utils/page-info-utils.ts` - Selector priority reordering
|
|
131
|
-
3. `/src/orchestrator/tools/take-screenshot.ts` - Vision analysis updates
|
|
132
|
-
|
|
133
|
-
## Next Steps
|
|
134
|
-
|
|
135
|
-
Test the changes with real scenarios to verify that:
|
|
136
|
-
1. Generated commands use semantic selectors when available
|
|
137
|
-
2. Tests remain stable and maintainable
|
|
138
|
-
3. Fallback to ID selectors works when semantic options aren't available
|
|
139
|
-
|
|
@@ -1,151 +0,0 @@
|
|
|
1
|
-
# Runner-Core v0.0.33 - Session Summary
|
|
2
|
-
|
|
3
|
-
## Date: October 15, 2025
|
|
4
|
-
|
|
5
|
-
## Major Accomplishments:
|
|
6
|
-
|
|
7
|
-
### 1. ✅ **Coordinate Fallback System** (Phase 1 Complete)
|
|
8
|
-
- Percentage-based coordinates (0-100%, 3 decimal precision)
|
|
9
|
-
- Activates after 3 selector failures
|
|
10
|
-
- 2 coordinate attempts before giving up
|
|
11
|
-
- Resolution-independent positioning
|
|
12
|
-
|
|
13
|
-
### 2. ✅ **Note to Future Self** (Tactical Memory)
|
|
14
|
-
- Free-form notes persist across iterations AND steps
|
|
15
|
-
- Enables strategic planning across agent decisions
|
|
16
|
-
- Helps maintain context: "Tried X, will try Y next"
|
|
17
|
-
|
|
18
|
-
### 3. ✅ **Visual Verification Tool** (NEW)
|
|
19
|
-
- `verify_action_result` - Before/after screenshot comparison
|
|
20
|
-
- Agent-callable (decides when to use)
|
|
21
|
-
- JPEG 60% quality (85-90% smaller than PNG)
|
|
22
|
-
- Multi-image LLM interface support
|
|
23
|
-
|
|
24
|
-
### 4. ✅ **Critical Bug Fixes**
|
|
25
|
-
- **Coordinate mode never activated**: Changed forced stuck from >= 3 to >= 5 failures
|
|
26
|
-
- **Missing required fields**: Made parser flexible (accepts reasoning OR statusReasoning)
|
|
27
|
-
- **Navigation timeouts**: Added 30s timeout guidance for page.goto()
|
|
28
|
-
- **Strict mode violations**: Added scoping guidance (locator('#parent').getByText())
|
|
29
|
-
|
|
30
|
-
### 5. ✅ **Prompt Optimizations**
|
|
31
|
-
- **59% reduction**: 17,573 chars → 7,287 chars in system prompt
|
|
32
|
-
- **Cache-optimized**: Static content first, dynamic last
|
|
33
|
-
- **Cost savings**: ~40% overall with model tiering
|
|
34
|
-
- **Focused on cognition**: Removed bloat, kept decision-making guidance
|
|
35
|
-
|
|
36
|
-
### 6. ✅ **Model Optimization**
|
|
37
|
-
- **gpt-5-mini**: Complex tasks (4 operations)
|
|
38
|
-
- Command generation
|
|
39
|
-
- Goal completion checks
|
|
40
|
-
- Repair suggestions
|
|
41
|
-
- Agent orchestration
|
|
42
|
-
- **gpt-4o-mini**: Simple tasks (7 operations)
|
|
43
|
-
- Scenario breakdown
|
|
44
|
-
- Screenshot need assessment
|
|
45
|
-
- Repair confidence
|
|
46
|
-
- Test name generation
|
|
47
|
-
- Hashtag generation
|
|
48
|
-
- Script parsing
|
|
49
|
-
- Final script merging
|
|
50
|
-
- **Est. 25-30% cost reduction**
|
|
51
|
-
|
|
52
|
-
### 7. ✅ **Code Cleanup**
|
|
53
|
-
- Removed V1 SmartTestRunnerCore (V2 is stable)
|
|
54
|
-
- Removed backup files (.bak, .tmp)
|
|
55
|
-
- Consolidated types into V2
|
|
56
|
-
- Removed PeopleHR-specific examples from prompts
|
|
57
|
-
|
|
58
|
-
### 8. ✅ **Enhanced Logging**
|
|
59
|
-
- Prompt length metrics (chars + estimated tokens)
|
|
60
|
-
- Full LLM response on parsing errors
|
|
61
|
-
- Field presence diagnostics
|
|
62
|
-
- Retry logging for 500 errors
|
|
63
|
-
|
|
64
|
-
### 9. ✅ **Retry Logic**
|
|
65
|
-
- Automatic retry for OpenAI 500 errors
|
|
66
|
-
- Exponential backoff (1s, 2s, 4s)
|
|
67
|
-
- Up to 3 attempts before failing
|
|
68
|
-
|
|
69
|
-
### 10. ✅ **Headed Mode for Local Testing**
|
|
70
|
-
- All browser instances use headed: false → headed: false for local dev
|
|
71
|
-
- Visual debugging enabled
|
|
72
|
-
|
|
73
|
-
## Files Modified:
|
|
74
|
-
|
|
75
|
-
### Runner-Core:
|
|
76
|
-
1. `src/orchestrator/orchestrator-agent.ts` - Main agent logic
|
|
77
|
-
2. `src/orchestrator/types.ts` - NoteToFutureSelf, CoordinateAction
|
|
78
|
-
3. `src/utils/coordinate-converter.ts` - NEW - Coordinate to Playwright conversion
|
|
79
|
-
4. `src/orchestrator/tools/verify-action-result.ts` - NEW - Visual verification tool
|
|
80
|
-
5. `src/llm-provider.ts` - Added LabeledImage, multi-image support
|
|
81
|
-
6. `src/llm-facade.ts` - Model optimization
|
|
82
|
-
7. `src/model-constants.ts` - Added DEFAULT_SIMPLER_MODEL
|
|
83
|
-
8. `src/scenario-worker-class.ts` - Tool registration
|
|
84
|
-
9. `src/orchestrator/index.ts` - Exports
|
|
85
|
-
10. `src/orchestrator/tools/index.ts` - Tool exports
|
|
86
|
-
|
|
87
|
-
### Scriptservice:
|
|
88
|
-
1. `providers/scriptservice-llm-provider.ts` - Multi-image handling, retry logic
|
|
89
|
-
2. `smart-test-runner-core-v2.ts` - Type definitions, V1 removal
|
|
90
|
-
3. `smart-test-execution-handler.ts` - V1 removal
|
|
91
|
-
4. `workers/test-based-explorer.ts` - V1 removal
|
|
92
|
-
5. `script-generation-handlers.ts` - Headed mode
|
|
93
|
-
6. `script-generation/script-generation-service.ts` - Headed mode
|
|
94
|
-
7. `smart-test-execution-handler.ts` - Headed mode
|
|
95
|
-
|
|
96
|
-
### Documentation:
|
|
97
|
-
1. `WHATS_NEW_v0.0.33.md`
|
|
98
|
-
2. `PHASE_1_COMPLETE.md`
|
|
99
|
-
3. `PHASE_1_SUMMARY.md`
|
|
100
|
-
4. `IMPLEMENTATION_STATUS.md`
|
|
101
|
-
5. `VISUAL_AGENT_EVOLUTION_PLAN.md`
|
|
102
|
-
6. `PROMPT_SANITY_CHECK.md`
|
|
103
|
-
7. `PROMPT_OPTIMIZATION_ANALYSIS.md`
|
|
104
|
-
8. `COORDINATE_MODE_DIAGNOSIS.md`
|
|
105
|
-
9. `BEFORE_AFTER_VERIFICATION.md`
|
|
106
|
-
10. `TROUBLESHOOTING_SESSION.md`
|
|
107
|
-
|
|
108
|
-
## Live Test Status:
|
|
109
|
-
|
|
110
|
-
**Job**: `71b88c60-52f5-4343-aef8-c44ebb07f3e9`
|
|
111
|
-
**Status**: Running (check browser + logs)
|
|
112
|
-
**Watch For**:
|
|
113
|
-
- Step 5 (Employee Information) - Previously problematic
|
|
114
|
-
- Coordinate mode activation
|
|
115
|
-
- verify_action_result tool usage
|
|
116
|
-
- Overall completion
|
|
117
|
-
|
|
118
|
-
## Key Metrics:
|
|
119
|
-
|
|
120
|
-
**Cost Optimization:**
|
|
121
|
-
- Prompt size: 59% reduction
|
|
122
|
-
- Model tiering: 7/11 tasks on cheaper model
|
|
123
|
-
- JPEG compression: 85-90% smaller screenshots
|
|
124
|
-
- **Total savings: ~40% cost reduction**
|
|
125
|
-
|
|
126
|
-
## Next Steps After Test Completes:
|
|
127
|
-
|
|
128
|
-
1. Check if Step 5 completes successfully
|
|
129
|
-
2. Verify coordinate mode activated if needed
|
|
130
|
-
3. Check if verify_action_result tool was used
|
|
131
|
-
4. Analyze any remaining failures
|
|
132
|
-
5. Iterate on prompts/logic based on results
|
|
133
|
-
|
|
134
|
-
## Known Issues to Monitor:
|
|
135
|
-
|
|
136
|
-
1. **Step 5 False Positive**: Clicking menu item vs navigating to page
|
|
137
|
-
2. **Coordinate Loop**: Agent not knowing when coordinate clicks succeed
|
|
138
|
-
3. **Vision verification usage**: Will agent call it proactively?
|
|
139
|
-
|
|
140
|
-
## Success Criteria:
|
|
141
|
-
|
|
142
|
-
✅ All 7 steps complete
|
|
143
|
-
✅ Coordinate fallback used when selectors fail
|
|
144
|
-
✅ Visual verification validates goal achievement
|
|
145
|
-
✅ No infinite loops or stuck states
|
|
146
|
-
✅ Generated script is accurate
|
|
147
|
-
|
|
148
|
-
---
|
|
149
|
-
|
|
150
|
-
**Check your browser window and /tmp/scriptservice-test.log for live execution!**
|
|
151
|
-
|
|
@@ -1,72 +0,0 @@
|
|
|
1
|
-
# Troubleshooting Session: All Modules Icon Click Failure
|
|
2
|
-
|
|
3
|
-
## Objective:
|
|
4
|
-
Understand why the orchestrator agent gets stuck on "Click on the all Modules menu item (top menu icon)" while manual Playwright MCP navigation succeeded.
|
|
5
|
-
|
|
6
|
-
## What I Need to See:
|
|
7
|
-
|
|
8
|
-
### 1. Full Agent Logs for the Failing Step
|
|
9
|
-
Please provide the complete logs showing:
|
|
10
|
-
- What iteration attempts were made (iteration 1, 2, 3...)
|
|
11
|
-
- What selectors the agent tried each time
|
|
12
|
-
- What errors it encountered
|
|
13
|
-
- What the DOM snapshot showed
|
|
14
|
-
- Whether it took screenshots
|
|
15
|
-
- What notes it left to future self
|
|
16
|
-
|
|
17
|
-
### 2. The DOM Context It Saw
|
|
18
|
-
- Interactive elements list
|
|
19
|
-
- ARIA tree snapshot
|
|
20
|
-
- Whether the hamburger icon was visible in the list
|
|
21
|
-
|
|
22
|
-
## What Worked (My Manual MCP Session):
|
|
23
|
-
|
|
24
|
-
From earlier successful navigation:
|
|
25
|
-
```
|
|
26
|
-
✅ Step 1: Clicked hamburger menu
|
|
27
|
-
Selector: #sidebar-toggle > span > svg
|
|
28
|
-
|
|
29
|
-
✅ Step 2: Clicked "Core HR"
|
|
30
|
-
Selector: getByText('Core HR')
|
|
31
|
-
|
|
32
|
-
✅ Step 3: Clicked "Employee Information"
|
|
33
|
-
Selector: getByText('Employee Information')
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
## Hypothesis of Why Agent Fails:
|
|
37
|
-
|
|
38
|
-
### Possible Issue 1: Wrong Selector Strategy
|
|
39
|
-
- Agent might be trying: `getByText('All Modules')` (strict mode violation)
|
|
40
|
-
- Or: `#MenuToggle` (wrong ID)
|
|
41
|
-
- Or: `#sidebar-toggle-menu` (doesn't exist)
|
|
42
|
-
- Instead of: `#sidebar-toggle > span > svg` (actual selector)
|
|
43
|
-
|
|
44
|
-
### Possible Issue 2: Missing Icon Detection
|
|
45
|
-
- Hamburger icons are often SVG elements without accessible text
|
|
46
|
-
- Agent might not recognize this pattern
|
|
47
|
-
- Prompt doesn't explicitly guide on icon/SVG selector strategy
|
|
48
|
-
|
|
49
|
-
### Possible Issue 3: DOM List Incomplete
|
|
50
|
-
- Interactive elements might not include the SVG icon
|
|
51
|
-
- If icon isn't in the list, agent won't know it exists
|
|
52
|
-
- Need to check if `getEnhancedPageInfo` captures SVG icons
|
|
53
|
-
|
|
54
|
-
### Possible Issue 4: Ambiguous Text
|
|
55
|
-
- "All Modules" might appear in multiple places (menu button + modal title)
|
|
56
|
-
- Agent tries `getByText('All Modules')` → strict mode violation
|
|
57
|
-
- Should scope to parent: `locator('#sidebar-toggle').getByText('All Modules')`
|
|
58
|
-
|
|
59
|
-
## Next Steps:
|
|
60
|
-
|
|
61
|
-
1. **Get full logs** from your failing run
|
|
62
|
-
2. **Compare** what agent saw vs what I saw
|
|
63
|
-
3. **Identify** the gap (prompt, DOM extraction, or selector logic)
|
|
64
|
-
4. **Plan fixes**:
|
|
65
|
-
- Prompt improvements (icon/SVG guidance)
|
|
66
|
-
- DOM extraction improvements (ensure icons are captured)
|
|
67
|
-
- Selector strategy improvements (parent scoping for icons)
|
|
68
|
-
- Example-based learning (hamburger menu pattern)
|
|
69
|
-
|
|
70
|
-
## Waiting For:
|
|
71
|
-
Please paste the full logs from the failing step showing all iteration attempts and what the agent tried.
|
|
72
|
-
|
|
@@ -1,336 +0,0 @@
|
|
|
1
|
-
# Vision-Based Diagnostic Analysis Improvements
|
|
2
|
-
|
|
3
|
-
## Overview
|
|
4
|
-
Enhanced the test automation system to use screenshot-based vision analysis as a **diagnostic tool** to understand WHY failures occur and recommend better strategies based on visual reality vs DOM assumptions.
|
|
5
|
-
|
|
6
|
-
**Vision diagnostics are now utilized in BOTH:**
|
|
7
|
-
1. **Script Generation** (`ScenarioWorker`) - When generating commands from scenarios
|
|
8
|
-
2. **Script Repair/Execution** (`ExecutionService`) - When repairing failing scripts with AI
|
|
9
|
-
|
|
10
|
-
## Problem Statement (From Logs)
|
|
11
|
-
1. **Hallucinated Verification Targets**: Goal completion was creating fake sub-goals like "verify message was sent" and looking for non-existent elements like "Message sent" text
|
|
12
|
-
2. **No Screenshot Analysis**: Vision mode was never triggering because LLM assessment always said "NO"
|
|
13
|
-
|
|
14
|
-
## Solutions Implemented
|
|
15
|
-
|
|
16
|
-
### 1. Fixed Hallucinated Verification Targets ✅
|
|
17
|
-
|
|
18
|
-
#### A. Goal Completion Check (`prompts.ts` lines 55-129)
|
|
19
|
-
**Changes:**
|
|
20
|
-
- System prompt: "Be EXTREMELY CONSERVATIVE - mark goals complete when PRIMARY action succeeds. DO NOT invent verification steps"
|
|
21
|
-
- Added golden rule: "If goal is SIMPLE ACTION and action SUCCEEDED, mark COMPLETE immediately"
|
|
22
|
-
- Explicit examples showing action goals complete after action succeeds (no verification needed)
|
|
23
|
-
|
|
24
|
-
**Example:**
|
|
25
|
-
```
|
|
26
|
-
Before: "Click send button" → click succeeds → "INCOMPLETE - need to verify message sent"
|
|
27
|
-
After: "Click send button" → click succeeds → "COMPLETE ✅"
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
#### B. Command Generation Anti-Hallucination (`prompts.ts` lines 215-225)
|
|
31
|
-
**New Section: "NEVER Hallucinate Verification Elements"**
|
|
32
|
-
- ONLY verify elements that ACTUALLY EXIST in current DOM state
|
|
33
|
-
- Don't invent success messages, confirmations, or sent indicators
|
|
34
|
-
- Use alternative verification: state changes, network, page load
|
|
35
|
-
- Stop trying to find elements after previous attempts failed
|
|
36
|
-
|
|
37
|
-
#### C. Smart Hallucination Detection (`llm-facade.ts` lines 573-599)
|
|
38
|
-
**Automatic Detection:**
|
|
39
|
-
- Detects when LLM repeatedly tries to find non-existent elements (2+ attempts)
|
|
40
|
-
- Shows "⚠️ HALLUCINATION ALERT" with guidance to stop and use alternatives
|
|
41
|
-
- Analyzes command patterns (getByText, toBeVisible) + errors (not found, timeout)
|
|
42
|
-
|
|
43
|
-
### 2. Enabled Screenshot-Based Diagnostic Analysis 📸
|
|
44
|
-
|
|
45
|
-
#### A. Conservative Screenshot Trigger (`scenario-worker-class.ts` line 193)
|
|
46
|
-
```typescript
|
|
47
|
-
// ONLY on attempt 2 (3rd attempt, after exactly 2 failures)
|
|
48
|
-
// This is the ONLY chance to use vision - must be absolutely necessary
|
|
49
|
-
if (attempt === 2 && lastError && !usedVisionMode)
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
**Attempt Flow:**
|
|
53
|
-
- Attempt 0: First try (DOM-based)
|
|
54
|
-
- Attempt 1: Second try (DOM-based)
|
|
55
|
-
- **Attempt 2: Third try - Vision assessment (if truly needed)**
|
|
56
|
-
- Attempt 3: Fourth try (final, DOM-based)
|
|
57
|
-
|
|
58
|
-
#### B. Conservative Screenshot Assessment (`prompts.ts` lines 131-191)
|
|
59
|
-
**System Prompt:** "Be LIBERAL in recommending screenshots - visual context provides diagnostic insights DOM cannot"
|
|
60
|
-
|
|
61
|
-
**Diagnostic Value Framework:**
|
|
62
|
-
1. Identify WHY attempts failed (DOM assumptions vs visual reality)
|
|
63
|
-
2. Detect hallucinated elements (see if expected elements exist visually)
|
|
64
|
-
3. Recommend better strategies (verification based on what's visible)
|
|
65
|
-
4. Find visual blockers (overlays, modals, layout issues)
|
|
66
|
-
5. Correct wrong assumptions (actual state vs expected state)
|
|
67
|
-
|
|
68
|
-
**Hallucination Detection in Assessment:**
|
|
69
|
-
- If "not found/timeout" errors + verification attempts → HIGH chance of hallucination
|
|
70
|
-
- Screenshot reveals if elements actually exist
|
|
71
|
-
|
|
72
|
-
#### C. Enhanced Vision Mode as Diagnostic Tool (`prompts.ts` lines 359-462)
|
|
73
|
-
|
|
74
|
-
**System Prompt:** "Analyze screenshot to understand WHY previous attempts failed and recommend BEST next step based on visual reality vs DOM assumptions"
|
|
75
|
-
|
|
76
|
-
**Three Primary Tasks:**
|
|
77
|
-
1. **DEBUG WHY PREVIOUS ATTEMPTS FAILED** - Compare what you assumed vs what you SEE
|
|
78
|
-
2. **IDENTIFY THE ROOT CAUSE** - Why did DOM-based approaches fail?
|
|
79
|
-
3. **RECOMMEND SMARTER ALTERNATIVES** - What should we do instead?
|
|
80
|
-
|
|
81
|
-
**Critical Questions Framework:**
|
|
82
|
-
|
|
83
|
-
🔍 **Visual vs DOM Reality Check:**
|
|
84
|
-
- What do you SEE vs what DOM suggested?
|
|
85
|
-
- Are elements you tried to find VISIBLE on screen?
|
|
86
|
-
- Are there visual indicators you missed?
|
|
87
|
-
- What's ACTUALLY present vs ASSUMED?
|
|
88
|
-
|
|
89
|
-
🚫 **Why Did Previous Attempts Fail?**
|
|
90
|
-
- Looking for elements that DON'T EXIST? (hallucination)
|
|
91
|
-
- Wrong selectors for elements that ARE visible?
|
|
92
|
-
- Elements blocked/covered by overlays?
|
|
93
|
-
- Elements in different state than expected?
|
|
94
|
-
- Page in different state than assumed?
|
|
95
|
-
|
|
96
|
-
💡 **What Should You Do Differently?**
|
|
97
|
-
- If verification elements don't exist: Use state-based verification
|
|
98
|
-
- If elements blocked: Remove blocker first
|
|
99
|
-
- If wrong selector: Use visual clues for better selectors
|
|
100
|
-
- If goal achieved: Confirm with simple wait/check
|
|
101
|
-
|
|
102
|
-
**Verification Strategy Based on Visual Reality:**
|
|
103
|
-
|
|
104
|
-
```
|
|
105
|
-
IF you see success indicators in screenshot:
|
|
106
|
-
✅ Use them: await expect(page.getByText('visible-text-here')).toBeVisible()
|
|
107
|
-
|
|
108
|
-
IF you DON'T see success indicators but action appears complete:
|
|
109
|
-
✅ Use state-based checks:
|
|
110
|
-
- Button disabled check
|
|
111
|
-
- Form cleared/reset check
|
|
112
|
-
- URL changed check
|
|
113
|
-
- Network response verification
|
|
114
|
-
- Load state verification
|
|
115
|
-
|
|
116
|
-
IF previous attempts looked for non-existent elements:
|
|
117
|
-
❌ STOP looking for them
|
|
118
|
-
✅ Switch to alternative verification
|
|
119
|
-
```
|
|
120
|
-
|
|
121
|
-
**Comparison Analysis Template:**
|
|
122
|
-
```
|
|
123
|
-
"Based on screenshot analysis:
|
|
124
|
-
- DOM suggested: [what you thought was there]
|
|
125
|
-
- Visual reality: [what you actually see]
|
|
126
|
-
- Why previous failed: [root cause analysis]
|
|
127
|
-
- Better approach: [what to do instead]"
|
|
128
|
-
```
|
|
129
|
-
|
|
130
|
-
#### D. Enhanced Diagnostic Logging (`llm-facade.ts` lines 313-328)
|
|
131
|
-
|
|
132
|
-
**Vision Mode Response Now Includes:**
|
|
133
|
-
- `visualInsights`: What screenshot revealed that DOM couldn't tell you
|
|
134
|
-
- `failureRootCause`: Why previous DOM-based attempts failed
|
|
135
|
-
- `recommendedAlternative`: Better verification/interaction strategy
|
|
136
|
-
- `reasoning`: Full diagnostic analysis
|
|
137
|
-
|
|
138
|
-
**Console Logging:**
|
|
139
|
-
```
|
|
140
|
-
📸 Visual insights: [what was discovered]
|
|
141
|
-
🔍 Root cause analysis: [why it failed]
|
|
142
|
-
💡 Recommended alternative: [what to do instead]
|
|
143
|
-
🧠 Vision-based reasoning: [full analysis]
|
|
144
|
-
```
|
|
145
|
-
|
|
146
|
-
## Code Modularization
|
|
147
|
-
|
|
148
|
-
Vision diagnostics are properly modularized and reused across both flows:
|
|
149
|
-
|
|
150
|
-
1. **`LLMFacade` Methods** (shared by both flows):
|
|
151
|
-
- `assessScreenshotNeed()` - Conservative assessment if screenshot would help
|
|
152
|
-
- `getVisionDiagnostics()` - Supervisor analyzes screenshot (gpt-4o)
|
|
153
|
-
- `generateCommandFromSupervisorInstructions()` - Worker generates command (gpt-4.1-mini)
|
|
154
|
-
|
|
155
|
-
2. **Two-Step Supervisor Pattern** (used consistently):
|
|
156
|
-
```typescript
|
|
157
|
-
// Step 1: Supervisor analyzes screenshot (expensive vision model)
|
|
158
|
-
const supervisorDiagnostics = await llmFacade.getVisionDiagnostics(...)
|
|
159
|
-
|
|
160
|
-
// Step 2: Use insights for action
|
|
161
|
-
// - In generation: Generate command from instructions
|
|
162
|
-
// - In repair: Enhance failure context with visual insights
|
|
163
|
-
```
|
|
164
|
-
|
|
165
|
-
3. **No Code Duplication**:
|
|
166
|
-
- Vision prompts defined once in `prompts.ts`
|
|
167
|
-
- LLM calls handled by `LLMFacade`
|
|
168
|
-
- Both flows use same vision assessment logic
|
|
169
|
-
|
|
170
|
-
## Behavioral Changes
|
|
171
|
-
|
|
172
|
-
### Script Generation - Before:
|
|
173
|
-
```
|
|
174
|
-
1. Click send button → succeeds
|
|
175
|
-
2. Goal check → "INCOMPLETE - verify message sent" (hallucinated)
|
|
176
|
-
3. Try: await expect(page.getByText('Message sent')).toBeVisible()
|
|
177
|
-
4. Fail: Element not found
|
|
178
|
-
5. Try: await page.getByRole('status').waitFor()
|
|
179
|
-
6. Fail: Timeout
|
|
180
|
-
7. Repeat until max failures
|
|
181
|
-
8. Screenshot assessment → "NO, not needed"
|
|
182
|
-
9. Vision mode → never triggers
|
|
183
|
-
```
|
|
184
|
-
|
|
185
|
-
### Script Generation - After:
|
|
186
|
-
```
|
|
187
|
-
1. Click send button → succeeds
|
|
188
|
-
2. Goal check → "COMPLETE ✅" (action succeeded, no verification needed)
|
|
189
|
-
OR if verification truly needed but wrong approach:
|
|
190
|
-
3. After 2 failures → Screenshot assessment
|
|
191
|
-
4. Assessment → "YES - Diagnostic value: Visual reveals if success indicators exist"
|
|
192
|
-
5. Vision mode activates with screenshot
|
|
193
|
-
6. Diagnostic analysis:
|
|
194
|
-
- "DOM suggested: Success message would appear"
|
|
195
|
-
- "Visual reality: No success message visible, button is now disabled"
|
|
196
|
-
- "Root cause: Hallucinated verification element that doesn't exist"
|
|
197
|
-
- "Better approach: Check button disabled state instead"
|
|
198
|
-
7. Generates: await expect(page.locator('button[name="Send"]')).toBeDisabled()
|
|
199
|
-
8. Success ✅
|
|
200
|
-
```
|
|
201
|
-
|
|
202
|
-
### Script Repair - Before:
|
|
203
|
-
```
|
|
204
|
-
1. Script step fails: await page.click('button[name="Submit"]')
|
|
205
|
-
2. AI Repair attempt 1: Try different selector → Fails
|
|
206
|
-
3. AI Repair attempt 2: Try with wait → Fails
|
|
207
|
-
4. AI Repair attempt 3: Try another approach → Fails
|
|
208
|
-
5. Give up - repair failed
|
|
209
|
-
```
|
|
210
|
-
|
|
211
|
-
### Script Repair - After:
|
|
212
|
-
```
|
|
213
|
-
1. Script step fails: await page.click('button[name="Submit"]')
|
|
214
|
-
2. AI Repair attempt 1: Try different selector → Fails
|
|
215
|
-
3. AI Repair attempt 2: Try with wait → Fails
|
|
216
|
-
4. After 2 failures → Screenshot assessment
|
|
217
|
-
5. Assessment → "YES - Visual analysis can reveal why repairs are failing"
|
|
218
|
-
6. Vision supervisor analyzes screenshot:
|
|
219
|
-
- "Visual analysis: Button is disabled and grayed out"
|
|
220
|
-
- "Root cause: Trying to click a disabled button"
|
|
221
|
-
- "Recommended approach: Wait for button to become enabled first"
|
|
222
|
-
7. AI Repair attempt 3 with vision insights: Insert wait step before click
|
|
223
|
-
8. Success ✅ (vision-aided repair)
|
|
224
|
-
```
|
|
225
|
-
|
|
226
|
-
## Cost Considerations
|
|
227
|
-
|
|
228
|
-
**Vision Mode (GPT-4o) Triggers in Both Flows:**
|
|
229
|
-
|
|
230
|
-
1. **Script Generation** (`ScenarioWorker`):
|
|
231
|
-
- After 2 failed command generation attempts (3rd attempt)
|
|
232
|
-
- Only when LLM assessment says "YES"
|
|
233
|
-
- Prevents endless DOM-based retries on wrong assumptions
|
|
234
|
-
|
|
235
|
-
2. **Script Repair** (`ExecutionService`):
|
|
236
|
-
- After 2 failed repair attempts (3rd/final repair attempt)
|
|
237
|
-
- Only when LLM assessment says "YES"
|
|
238
|
-
- Prevents wasted repair cycles on wrong strategy
|
|
239
|
-
|
|
240
|
-
**What You Get (Both Flows):**
|
|
241
|
-
- Root cause analysis of failures
|
|
242
|
-
- Visual vs DOM reality comparison
|
|
243
|
-
- Recommended alternative strategies
|
|
244
|
-
- Smart fallback only when truly needed
|
|
245
|
-
- Prevents wasted attempts on wrong approaches
|
|
246
|
-
|
|
247
|
-
## Files Modified
|
|
248
|
-
|
|
249
|
-
1. **`/runner-core/src/prompts.ts`**
|
|
250
|
-
- Enhanced goal completion check (lines 55-129)
|
|
251
|
-
- Enhanced screenshot assessment (lines 131-184)
|
|
252
|
-
- Enhanced vision mode prompt with diagnostics (lines 359-462)
|
|
253
|
-
- Added anti-hallucination sections
|
|
254
|
-
|
|
255
|
-
2. **`/runner-core/src/llm-facade.ts`**
|
|
256
|
-
- Added hallucination detection (lines 573-599)
|
|
257
|
-
- Enhanced vision logging (lines 313-328)
|
|
258
|
-
- Vision methods: `assessScreenshotNeed`, `getVisionDiagnostics`, `generateCommandFromSupervisorInstructions`
|
|
259
|
-
|
|
260
|
-
3. **`/runner-core/src/scenario-worker-class.ts`** (Script Generation)
|
|
261
|
-
- Vision fallback on attempt 2 (line 197)
|
|
262
|
-
- Two-step supervisor pattern for command generation (lines 197-241)
|
|
263
|
-
|
|
264
|
-
4. **`/runner-core/src/execution-service.ts`** (Script Repair/Execution)
|
|
265
|
-
- Vision fallback on final repair attempt after 2 failures (lines 500-594)
|
|
266
|
-
- Vision-enhanced context for repair suggestions
|
|
267
|
-
- Same supervisor pattern for consistency
|
|
268
|
-
|
|
269
|
-
## Testing Recommendations
|
|
270
|
-
|
|
271
|
-
### Script Generation Tests
|
|
272
|
-
|
|
273
|
-
1. **Test with messaging flow:**
|
|
274
|
-
- Verify "send message" steps complete without hallucinated verification
|
|
275
|
-
- Check if vision mode provides diagnostic insights when command generation fails
|
|
276
|
-
|
|
277
|
-
2. **Test with form submission:**
|
|
278
|
-
- Ensure submit actions complete without looking for non-existent confirmations
|
|
279
|
-
- Verify alternative verification strategies are used
|
|
280
|
-
|
|
281
|
-
### Script Repair Tests
|
|
282
|
-
|
|
283
|
-
1. **Test with selector failures:**
|
|
284
|
-
- Run script with outdated selectors
|
|
285
|
-
- Verify vision diagnostics reveal actual element state
|
|
286
|
-
- Check that repairs use vision insights
|
|
287
|
-
|
|
288
|
-
2. **Test with state-dependent failures:**
|
|
289
|
-
- Run script where timing/state causes failures
|
|
290
|
-
- Verify vision reveals actual page state (e.g., disabled buttons, loading states)
|
|
291
|
-
- Check that repairs address root cause
|
|
292
|
-
|
|
293
|
-
### Monitor Both Flows
|
|
294
|
-
|
|
295
|
-
1. **Vision mode activation:**
|
|
296
|
-
- Should trigger after 2 failures when truly needed (not for every failure)
|
|
297
|
-
- Check diagnostic logs for root cause analysis
|
|
298
|
-
- Verify recommended alternatives are sensible
|
|
299
|
-
|
|
300
|
-
2. **Success patterns:**
|
|
301
|
-
- Look for "(vision-aided)" markers in success logs
|
|
302
|
-
- Track improvement in success rate after vision diagnostics added
|
|
303
|
-
|
|
304
|
-
## Next Steps
|
|
305
|
-
|
|
306
|
-
1. **Monitor logs for:**
|
|
307
|
-
- Reduced hallucination instances (generation flow)
|
|
308
|
-
- Vision mode diagnostic quality (both flows)
|
|
309
|
-
- Alternative verification strategies effectiveness (generation flow)
|
|
310
|
-
- Repair success rate improvement (repair flow)
|
|
311
|
-
|
|
312
|
-
2. **Metrics to track:**
|
|
313
|
-
- Vision mode activation rate in generation vs repair
|
|
314
|
-
- Hallucination detection rate
|
|
315
|
-
- Test success rate improvement (generation)
|
|
316
|
-
- Repair success rate improvement (execution)
|
|
317
|
-
- Cost vs benefit of vision mode in each flow
|
|
318
|
-
|
|
319
|
-
3. **Potential future enhancements:**
|
|
320
|
-
- Learn from vision diagnostics to improve DOM-only approaches
|
|
321
|
-
- Build a library of common visual patterns and solutions
|
|
322
|
-
- Optimize screenshot timing based on diagnostic value
|
|
323
|
-
- Consider vision diagnostics for other failure scenarios
|
|
324
|
-
- Share vision insights between generation and repair flows
|
|
325
|
-
|
|
326
|
-
## Summary
|
|
327
|
-
|
|
328
|
-
**✅ Vision-based diagnostics are now properly integrated into BOTH:**
|
|
329
|
-
- **Script Generation** - Helps generate better commands when DOM info insufficient
|
|
330
|
-
- **Script Repair** - Helps diagnose why repairs fail and suggest better fixes
|
|
331
|
-
|
|
332
|
-
**✅ Properly modularized with no code duplication:**
|
|
333
|
-
- Shared `LLMFacade` methods
|
|
334
|
-
- Consistent two-step supervisor pattern
|
|
335
|
-
- Single source of truth for prompts and logic
|
|
336
|
-
|