testchimp-runner-core 0.0.32 → 0.0.34

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/dist/llm-facade.d.ts.map +1 -1
  2. package/dist/llm-facade.js +7 -7
  3. package/dist/llm-facade.js.map +1 -1
  4. package/dist/llm-provider.d.ts +9 -0
  5. package/dist/llm-provider.d.ts.map +1 -1
  6. package/dist/model-constants.d.ts +16 -5
  7. package/dist/model-constants.d.ts.map +1 -1
  8. package/dist/model-constants.js +17 -6
  9. package/dist/model-constants.js.map +1 -1
  10. package/dist/orchestrator/index.d.ts +1 -1
  11. package/dist/orchestrator/index.d.ts.map +1 -1
  12. package/dist/orchestrator/index.js +3 -2
  13. package/dist/orchestrator/index.js.map +1 -1
  14. package/dist/orchestrator/orchestrator-agent.d.ts +0 -8
  15. package/dist/orchestrator/orchestrator-agent.d.ts.map +1 -1
  16. package/dist/orchestrator/orchestrator-agent.js +206 -405
  17. package/dist/orchestrator/orchestrator-agent.js.map +1 -1
  18. package/dist/orchestrator/orchestrator-prompts.d.ts +20 -0
  19. package/dist/orchestrator/orchestrator-prompts.d.ts.map +1 -0
  20. package/dist/orchestrator/orchestrator-prompts.js +455 -0
  21. package/dist/orchestrator/orchestrator-prompts.js.map +1 -0
  22. package/dist/orchestrator/tools/index.d.ts +2 -1
  23. package/dist/orchestrator/tools/index.d.ts.map +1 -1
  24. package/dist/orchestrator/tools/index.js +4 -2
  25. package/dist/orchestrator/tools/index.js.map +1 -1
  26. package/dist/orchestrator/tools/verify-action-result.d.ts +17 -0
  27. package/dist/orchestrator/tools/verify-action-result.d.ts.map +1 -0
  28. package/dist/orchestrator/tools/verify-action-result.js +140 -0
  29. package/dist/orchestrator/tools/verify-action-result.js.map +1 -0
  30. package/dist/orchestrator/types.d.ts +26 -0
  31. package/dist/orchestrator/types.d.ts.map +1 -1
  32. package/dist/orchestrator/types.js.map +1 -1
  33. package/dist/prompts.d.ts.map +1 -1
  34. package/dist/prompts.js +87 -37
  35. package/dist/prompts.js.map +1 -1
  36. package/dist/scenario-worker-class.d.ts.map +1 -1
  37. package/dist/scenario-worker-class.js +4 -1
  38. package/dist/scenario-worker-class.js.map +1 -1
  39. package/dist/utils/coordinate-converter.d.ts +32 -0
  40. package/dist/utils/coordinate-converter.d.ts.map +1 -0
  41. package/dist/utils/coordinate-converter.js +130 -0
  42. package/dist/utils/coordinate-converter.js.map +1 -0
  43. package/package.json +1 -1
  44. package/plandocs/BEFORE_AFTER_VERIFICATION.md +148 -0
  45. package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +144 -0
  46. package/plandocs/IMPLEMENTATION_STATUS.md +108 -0
  47. package/plandocs/PHASE_1_COMPLETE.md +165 -0
  48. package/plandocs/PHASE_1_SUMMARY.md +184 -0
  49. package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +120 -0
  50. package/plandocs/PROMPT_SANITY_CHECK.md +120 -0
  51. package/plandocs/SESSION_SUMMARY_v0.0.33.md +151 -0
  52. package/plandocs/TROUBLESHOOTING_SESSION.md +72 -0
  53. package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +396 -0
  54. package/plandocs/WHATS_NEW_v0.0.33.md +183 -0
  55. package/src/llm-facade.ts +8 -8
  56. package/src/llm-provider.ts +11 -1
  57. package/src/model-constants.ts +17 -5
  58. package/src/orchestrator/index.ts +3 -2
  59. package/src/orchestrator/orchestrator-agent.ts +249 -424
  60. package/src/orchestrator/orchestrator-agent.ts.backup +1386 -0
  61. package/src/orchestrator/orchestrator-prompts.ts +474 -0
  62. package/src/orchestrator/tools/index.ts +2 -1
  63. package/src/orchestrator/tools/verify-action-result.ts +159 -0
  64. package/src/orchestrator/types.ts +48 -0
  65. package/src/prompts.ts +87 -37
  66. package/src/scenario-worker-class.ts +7 -2
  67. package/src/utils/coordinate-converter.ts +162 -0
  68. package/testchimp-runner-core-0.0.33.tgz +0 -0
  69. /package/{CREDIT_CALLBACK_ARCHITECTURE.md → plandocs/CREDIT_CALLBACK_ARCHITECTURE.md} +0 -0
  70. /package/{INTEGRATION_COMPLETE.md → plandocs/INTEGRATION_COMPLETE.md} +0 -0
  71. /package/{VISION_DIAGNOSTICS_IMPROVEMENTS.md → plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md} +0 -0
@@ -0,0 +1,165 @@
1
+ # Phase 1 Implementation - COMPLETE ✅
2
+
3
+ ## Version: runner-core v0.0.33
4
+
5
+ ## What's Been Implemented
6
+
7
+ ### 1. Free-Form "Note to Future Self"
8
+ **Purpose:** Tactical memory - agent leaves notes that persist across iterations AND steps.
9
+
10
+ **Type:**
11
+ ```typescript
12
+ interface NoteToFutureSelf {
13
+ fromIteration: number;
14
+ content: string; // FREE-FORM - agent writes whatever it wants
15
+ }
16
+ ```
17
+
18
+ **How it works:**
19
+ - Agent includes `"noteToFutureSelf": "..."` in response
20
+ - System stores it in `memory.latestNote` (persists across steps!)
21
+ - Passed to next iteration AND next step
22
+ - Displayed prominently at top of prompt
23
+ - Agent reads it FIRST before making decision
24
+
25
+ **Scope:** Entire scenario journey (not just current step)
26
+
27
+ **Example notes:**
28
+
29
+ *Iteration-specific:*
30
+ - "Tried #sidebar-toggle, failed with 'not clickable'. Will try child SVG element next."
31
+
32
+ *Step-spanning:*
33
+ - "This app has slow-loading modals. Always wait 2s after page load before clicking."
34
+ - "Cookie consent appears on every page. Check for and dismiss it first."
35
+ - "Sidebar only visible on desktop viewport (>1024px width)."
36
+
37
+ ### 2. Percentage-Based Coordinate Fallback
38
+ **Purpose:** Last-resort mechanism when selector generation repeatedly fails.
39
+
40
+ **Type:**
41
+ ```typescript
42
+ interface CoordinateAction {
43
+ type: 'coordinate';
44
+ action: 'click' | 'doubleClick' | 'rightClick' | 'hover' | 'drag' | 'fill' | 'scroll';
45
+ xPercent: number; // 0-100, 3 decimal precision
46
+ yPercent: number;
47
+ toXPercent?: number; // For drag
48
+ toYPercent?: number;
49
+ value?: string; // For fill
50
+ scrollAmount?: number; // For scroll
51
+ }
52
+ ```
53
+
54
+ **How it works:**
55
+ - LLM outputs percentages: `{xPercent: 15.755, yPercent: 8.500}`
56
+ - CoordinateConverter converts to pixels: `15.755% → 252px`
57
+ - Generates Playwright command: `await page.mouse.click(252, 68);`
58
+
59
+ **Supported actions:**
60
+ - click, doubleClick, rightClick, hover
61
+ - fill (clicks then types value)
62
+ - drag (from x%,y% to toX%,toY%)
63
+ - scroll (at position, by amount)
64
+
65
+ ### 3. Two-Tier Auto-Escalation
66
+ **Trigger:** Code-controlled (not LLM-decided)
67
+
68
+ ```
69
+ Tier 1 (iterations 1-3): Playwright Selector Mode
70
+ ├─ Normal buildSystemPrompt()
71
+ ├─ Agent generates: await page.getByRole(...).click()
72
+ ├─ Leaves noteToFutureSelf for continuity
73
+ └─ 3 attempts, then escalate
74
+
75
+ Tier 2 (iterations 4-5): Coordinate Mode
76
+ ├─ Auto-activates when consecutiveFailures >= 3
77
+ ├─ Uses buildCoordinateSystemPrompt()
78
+ ├─ Agent outputs: {xPercent: 15.755, yPercent: 8.500}
79
+ ├─ CoordinateConverter → mouse.click(x, y)
80
+ └─ 2 attempts max, then give up
81
+
82
+ Total: Maximum 5 iterations per step
83
+ ```
84
+
85
+ ### 4. Precision & Accuracy
86
+ - **3 decimal precision** for coordinates (~1px accuracy on most screens)
87
+ - **Resolution-independent** - works on any viewport size
88
+ - **Percentage reference:**
89
+ - Top-left: (0, 0)
90
+ - Top-right: (100, 0)
91
+ - Center: (50, 50)
92
+ - Bottom-right: (100, 100)
93
+
94
+ ## Files Modified
95
+
96
+ 1. **orchestrator/types.ts**
97
+ - Added `NoteToFutureSelf` interface
98
+ - Added `CoordinateAction` interface
99
+ - Updated `AgentDecision` with new fields
100
+ - Updated `AgentContext` with noteFromPreviousIteration
101
+
102
+ 2. **orchestrator/orchestrator-agent.ts**
103
+ - Added note tracking in executeStep()
104
+ - Added coordinate action execution
105
+ - Added buildCoordinateSystemPrompt()
106
+ - Updated buildUserPrompt() to display notes
107
+ - Added mode switching in callAgent()
108
+ - Updated response format documentation
109
+
110
+ 3. **utils/coordinate-converter.ts** (NEW)
111
+ - percentToPixels() - Convert % to pixels
112
+ - getViewportSize() - Get current viewport dimensions
113
+ - generateCommands() - Create Playwright commands from percentages
114
+ - executeAction() - Direct execution helper
115
+
116
+ 4. **scenario-worker-class.ts** (Earlier fix)
117
+ - Smart timeout handling for waitForLoadState
118
+
119
+ 5. **execution-service.ts** (Earlier fix)
120
+ - Smart timeout handling for navigation commands
121
+
122
+ ## How to Use
123
+
124
+ **No code changes needed!** The features activate automatically:
125
+
126
+ 1. **Note to self:** Agent can optionally include `noteToFutureSelf` in any iteration
127
+ 2. **Coordinates:** Auto-activate at iteration 4 if selectors keep failing
128
+
129
+ ## Testing Phase 1
130
+
131
+ To validate the implementation:
132
+
133
+ 1. **Run PeopleHR scenario** (previously failed on hamburger menu)
134
+ - Should now succeed with note guidance
135
+ - May use coordinates if SVG selector still fails
136
+
137
+ 2. **Check logs for:**
138
+ - `📝 Note to self: ...` (agent leaving tactical notes)
139
+ - `🎯 COORDINATE MODE ACTIVATED` (tier 2 triggered)
140
+ - `🎯 Coordinate Action: click at (X%, Y%)` (using fallback)
141
+
142
+ 3. **Expected improvements:**
143
+ - 20-30% fewer iterations per step (thanks to notes)
144
+ - < 5% scenarios need coordinate fallback
145
+ - Coordinates work when everything else fails
146
+
147
+ ## Phase 2 Preview (Not Yet Implemented)
148
+
149
+ When Phase 2 is added, it will become a **three-tier** system:
150
+ - Tier 1 (iterations 1-2): Playwright selectors
151
+ - Tier 2 (iterations 3-4): Numbered elements (CLICK[3])
152
+ - Tier 3 (iterations 5+): Percentage coordinates
153
+
154
+ Phase 2 adds visual markers [1], [2], [3] on elements with structured commands.
155
+
156
+ ---
157
+
158
+ ## Status: ✅ READY FOR TESTING
159
+
160
+ Runner-core v0.0.33 is built and ready. Test it with:
161
+ - VS Code extension "Run Test" on peoplehr-corrected.smart.spec.ts
162
+ - Or generate new script from peoplehr.txt scenario
163
+
164
+ **Next:** Validate Phase 1 works before starting Phase 2.
165
+
@@ -0,0 +1,184 @@
1
+ # Phase 1 Complete - Summary & Testing Guide
2
+
3
+ ## Version: runner-core v0.0.33 ✅
4
+
5
+ ---
6
+
7
+ ## Implementation Complete
8
+
9
+ ### What's New:
10
+
11
+ 1. **📝 Note to Future Self**
12
+ - Free-form tactical memory between iterations
13
+ - Agent writes: "Tried X, failed. Will try Y next."
14
+ - Prevents repeated mistakes
15
+
16
+ 2. **🎯 Percentage-Based Coordinates**
17
+ - Last-resort fallback (3-decimal precision)
18
+ - Resolution-independent (works any viewport size)
19
+ - Supports: click, fill, drag, hover, scroll
20
+
21
+ 3. **⚡ Optimized Iteration Limits**
22
+ - Max 5 iterations per step (down from 8)
23
+ - 2 coordinate attempts max (coordinates work or they don't)
24
+ - Faster feedback on stuck scenarios
25
+
26
+ ---
27
+
28
+ ## Current Behavior (Phase 1)
29
+
30
+ ```
31
+ ┌─────────────────────────────────────────────────────┐
32
+ │ Iteration 1: Playwright selector │
33
+ │ Try: await page.getByRole('button'...).click() │
34
+ │ Note: "If this fails, try #id selector" │
35
+ │ │
36
+ │ Iteration 2: Playwright selector │
37
+ │ Read note from iteration 1 │
38
+ │ Try: await page.locator('#sidebar-toggle').click()│
39
+ │ Note: "If this fails, try SVG child" │
40
+ │ │
41
+ │ Iteration 3: Playwright selector │
42
+ │ Read note from iteration 2 │
43
+ │ Try: await page.locator('#sidebar-toggle svg') │
44
+ │ → Fails again │
45
+ │ │
46
+ │ 🎯 COORDINATE MODE ACTIVATED 🎯 │
47
+ │ │
48
+ │ Iteration 4: Coordinate action │
49
+ │ Agent outputs: {xPercent: 5.500, yPercent: 8.250}│
50
+ │ Execute: page.mouse.click(88, 66) │
51
+ │ → Success! │
52
+ │ │
53
+ │ OR if fails... │
54
+ │ │
55
+ │ Iteration 5: Coordinate action (2nd attempt) │
56
+ │ Try slightly adjusted coordinates │
57
+ │ → If fails: GIVE UP (stuck) │
58
+ │ │
59
+ │ Total: Max 5 iterations │
60
+ └─────────────────────────────────────────────────────┘
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Testing Phase 1
66
+
67
+ ### Test 1: PeopleHR Scenario (Previously Failed)
68
+
69
+ **Expected outcome:**
70
+ - Iteration 1-2: Try text/ID selectors → fail
71
+ - Iteration 3: Note says "try SVG child" → succeeds!
72
+ - OR Iteration 4: Coordinates → succeeds!
73
+
74
+ **Run:**
75
+ ```bash
76
+ # Via VS extension: "Generate Script" on peoplehr.txt
77
+ # Or "Run Test" on peoplehr-corrected.smart.spec.ts
78
+ ```
79
+
80
+ **Look for in logs:**
81
+ ```
82
+ 📝 Note to self: ...
83
+ 🎯 COORDINATE MODE ACTIVATED
84
+ 🎯 Coordinate Action (attempt 1/2): click at (5.500%, 8.250%)
85
+ ```
86
+
87
+ ### Test 2: Simple Scenario (Should Still Be Fast)
88
+
89
+ Create test: `simple-login.txt`
90
+ ```
91
+ - go to https://example.com/login
92
+ - fill username with "alice"
93
+ - fill password with "password123"
94
+ - click login button
95
+ ```
96
+
97
+ **Expected:**
98
+ - Each step: 1 iteration (Tier 1 success)
99
+ - No coordinates needed
100
+ - Fast execution
101
+
102
+ ### Test 3: Coordinate Fallback
103
+
104
+ **Deliberately difficult scenario:**
105
+ ```
106
+ - go to https://some-app-with-shadow-dom.com
107
+ - click on custom web component icon
108
+ ```
109
+
110
+ **Expected:**
111
+ - Iterations 1-3: Selectors fail
112
+ - Iteration 4: Coordinates succeed
113
+ - Generated script contains: `await page.mouse.click(x, y);`
114
+
115
+ ---
116
+
117
+ ## Expected Improvements
118
+
119
+ ### Metrics to Track:
120
+
121
+ 1. **Iteration Efficiency**
122
+ - Before: ~4 average iterations per step
123
+ - After: ~2.5 average iterations per step (30-40% reduction)
124
+
125
+ 2. **Success Rate**
126
+ - Before: Stuck on complex UIs (hamburgers, icons, shadow DOM)
127
+ - After: Coordinates provide escape hatch
128
+
129
+ 3. **Coordinate Usage**
130
+ - Target: < 10% of scenarios use coordinates
131
+ - Most scenarios still succeed with selectors
132
+
133
+ ---
134
+
135
+ ## Files Changed
136
+
137
+ **New:**
138
+ - `src/utils/coordinate-converter.ts` - Percentage conversion utility
139
+ - `VISUAL_AGENT_EVOLUTION_PLAN.md` - Complete plan
140
+ - `PHASE_1_COMPLETE.md` - Feature documentation
141
+ - `IMPLEMENTATION_STATUS.md` - Current status
142
+ - `PHASE_1_SUMMARY.md` - This file
143
+
144
+ **Modified:**
145
+ - `src/orchestrator/types.ts` - Added NoteToFutureSelf, CoordinateAction
146
+ - `src/orchestrator/orchestrator-agent.ts` - Note tracking, coordinate handling, mode switching
147
+ - `src/scenario-worker-class.ts` - Timeout handling (earlier fix)
148
+ - `src/execution-service.ts` - Timeout handling (earlier fix)
149
+
150
+ ---
151
+
152
+ ## Iteration Budget (Max 5 per Step)
153
+
154
+ **Phase 1 (Current):**
155
+ ```
156
+ Iterations 1-3: Playwright selectors (3 attempts)
157
+ Iterations 4-5: Coordinates (2 attempts)
158
+ ```
159
+
160
+ **Phase 2 (Future - Optimized):**
161
+ ```
162
+ Iteration 1: Playwright selector (1 attempt) - fast path
163
+ Iterations 2-3: Index commands (2 attempts) - reliable fallback
164
+ Iterations 4-5: Coordinates (2 attempts) - last resort
165
+ ```
166
+
167
+ **Benefit of Phase 2:**
168
+ - Most scenarios finish in iteration 1 (fast!)
169
+ - Complex scenarios use iterations 2-3 (index system)
170
+ - Only extreme cases reach iterations 4-5 (coordinates)
171
+
172
+ ---
173
+
174
+ ## Ready to Test!
175
+
176
+ **Current version** (runner-core v0.0.33) is built and ready.
177
+
178
+ **Test with:**
179
+ 1. VS Code extension "Generate Script" on `peoplehr.txt`
180
+ 2. Or "Run Test" on any existing smart test
181
+ 3. Check logs for note-to-self and coordinate usage
182
+
183
+ **After validating Phase 1 works well, proceed to Phase 2 for numbered element system.**
184
+
@@ -0,0 +1,120 @@
1
+ # System Prompt Optimization Analysis
2
+
3
+ ## Current Stats:
4
+ - **System Prompt**: 17,573 chars (346 lines)
5
+ - **With Tool Descriptions**: 19,613 chars (~4,903 tokens)
6
+ - **Cost per call**: ~$0.0007 (gpt-5-mini input tokens)
7
+
8
+ ## Optimization Opportunities:
9
+
10
+ ### 1. **Duplicate Examples** (Save ~30%)
11
+ **Current**: Multiple example sections with ❌/✅ pairs
12
+ - Lines 633-644: Examples section with goto, fill, click examples
13
+ - Lines 621-626: Ambiguous text handling examples
14
+ - Lines 603-607: DOM snapshot examples
15
+ - Lines 615-619: Selector preference list
16
+
17
+ **Optimization**: Consolidate into ONE examples section
18
+ **Savings**: ~2,000 chars
19
+
20
+ ### 2. **Verbose Selector Section** (Save ~20%)
21
+ **Current**: Lines 602-644 (42 lines, ~1,800 chars)
22
+ - Lists all selector types with emoji
23
+ - Detailed examples for each
24
+ - Repetitive "Good/Bad" patterns
25
+
26
+ **Optimization**: Create compact reference table
27
+ ```
28
+ SELECTORS (preference order):
29
+ 1. getByRole/Label/Placeholder (semantic, stable)
30
+ 2. getByText (scope to parent if ambiguous!)
31
+ 3. CSS IDs (avoid auto-generated)
32
+
33
+ Common mistakes: Missing goto timeout, unscoped getByText, auto-generated IDs
34
+ ```
35
+ **Savings**: ~1,200 chars
36
+
37
+ ### 3. **Emoji Overuse** (Save ~5%)
38
+ **Current**: Heavy use of ⚠️, ❌, ✅, 🏆, etc.
39
+
40
+ **Optimization**: Use sparingly (only for critical warnings)
41
+ **Savings**: ~500 chars
42
+
43
+ ### 4. **Redundant "WHY" Explanations** (Save ~10%)
44
+ **Current**: Multiple "WHY:" sections explaining rationale
45
+ - Line 642-644: WHY semantic selectors
46
+ - Similar explanations scattered throughout
47
+
48
+ **Optimization**: Remove or consolidate
49
+ **Savings**: ~800 chars
50
+
51
+ ### 5. **Tool Instructions Redundancy** (Save ~10%)
52
+ **Current**: Tools described twice:
53
+ - In tool registry (dynamic)
54
+ - In prompt rules (static)
55
+
56
+ **Optimization**: Rely more on tool registry descriptions
57
+ **Savings**: ~600 chars
58
+
59
+ ### 6. **Status Rules Repetition** (Save ~5%)
60
+ **Current**: Lines 468-486 - Status rules explained multiple times
61
+
62
+ **Optimization**: Single concise statement
63
+ **Savings**: ~400 chars
64
+
65
+ ## Proposed Condensed Structure:
66
+
67
+ ```markdown
68
+ # System Prompt (Optimized)
69
+
70
+ ## Agent Role & Tools
71
+ [Tool descriptions from registry]
72
+
73
+ ## Response Format (JSON)
74
+ {required fields} - minimal format, no extensive comments
75
+
76
+ ## Core Rules (Prioritized)
77
+ 1. Status decisions (complete/continue/stuck)
78
+ 2. Selector strategy (semantic > text > CSS)
79
+ 3. Common errors (goto timeout, strict mode, auto-IDs)
80
+ 4. When to use tools vs commands
81
+ 5. Note to future self usage
82
+
83
+ ## Examples (Consolidated)
84
+ - Navigation: goto with 30s timeout
85
+ - Selectors: Scoped getByText, semantic selectors
86
+ - Coordinates: When and how
87
+
88
+ ## Advanced Features
89
+ - Blocker detection
90
+ - Step re-evaluation
91
+ - Coordinate fallback
92
+ ```
93
+
94
+ ## Total Potential Savings:
95
+
96
+ - **Before**: 17,573 chars (~4,393 tokens)
97
+ - **After**: ~12,000 chars (~3,000 tokens)
98
+ - **Reduction**: ~32% reduction in system prompt
99
+ - **Cost savings**: ~$0.0002 per call (~30% per call)
100
+ - **Overall impact**: With 7 tasks using gpt-4o-mini, only 4 tasks will benefit
101
+ - **Est. total savings**: ~5-8% additional cost reduction
102
+
103
+ ## Recommendation:
104
+
105
+ **Optimize if:**
106
+ - You're seeing consistent 500 errors (less likely now with retry)
107
+ - Want to maximize caching efficiency
108
+ - Running high-volume scenarios (1000+ per day)
109
+
110
+ **Skip if:**
111
+ - Current cost is acceptable
112
+ - Prompt clarity is more important than 5-8% savings
113
+ - Risk of quality degradation concerns you
114
+
115
+ ## Action Items (if optimizing):
116
+
117
+ 1. ✅ Keep: Critical decision logic, JSON format, coordinate mode
118
+ 2. ⚠️ Condense: Selector examples, error responses, WHY sections
119
+ 3. ❌ Remove: Duplicate examples, excessive emojis, redundant explanations
120
+
@@ -0,0 +1,120 @@
1
+ # Prompt Sanity Check - Runner-Core v0.0.33
2
+
3
+ ## ✅ STRENGTHS
4
+
5
+ ### System Prompt (`buildSystemPrompt`)
6
+ - ✅ Required fields clearly marked at top (status, reasoning, statusReasoning)
7
+ - ✅ Comprehensive JSON format with examples
8
+ - ✅ Clear status decision rules
9
+ - ✅ Good blocker detection guidance
10
+ - ✅ Semantic selector preference clearly explained with examples
11
+ - ✅ Tool vs command distinction is clear
12
+ - ✅ Coordinate fallback documented
13
+
14
+ ### User Prompt (`buildUserPrompt`)
15
+ - ✅ Static content first (cache-optimized)
16
+ - ✅ Dynamic content last (current state, page info)
17
+ - ✅ Notes from previous iteration shown prominently
18
+ - ✅ Clear warnings for consecutive failures
19
+ - ✅ Coordinate mode trigger clear
20
+
21
+ ## ⚠️ ISSUES FOUND
22
+
23
+ ### 1. **Duplication/Redundancy**
24
+ - ❌ "Use semantic selectors" mentioned in:
25
+ - System prompt (line ~605: "SELECTOR PREFERENCE")
26
+ - User prompt (line ~860: "SELECTOR STRATEGY")
27
+ - **FIX**: Remove from user prompt, keep in system prompt only
28
+
29
+ ### 2. **Length Concerns**
30
+ - ⚠️ System prompt is ~325 lines (very long)
31
+ - ⚠️ May cause LLM to miss critical details in the middle
32
+ - **SUGGESTION**: Consider breaking into sections or condensing
33
+
34
+ ### 3. **Conflicting Guidance**
35
+ - ⚠️ Line ~469: "stuck: Tried 3+ iterations"
36
+ - But coordinate mode triggers at 3 failures (line ~904)
37
+ - **FIX**: Clarify: stuck = 5 attempts total (3 regular + 2 coordinate)
38
+
39
+ ### 4. **Unclear Iteration Count**
40
+ - ❌ Line ~714: "When iteration count reaches 4+"
41
+ - ❌ Line ~748: "iteration 4+"
42
+ - ✅ But code triggers at 3 failures
43
+ - **FIX**: Update prompt to say "iteration 4+" (0,1,2 = 3 failures, next is #3 which is 4th iteration)
44
+
45
+ ### 5. **Missing Information**
46
+ - ❌ Max iterations per step not mentioned (code has 5)
47
+ - **FIX**: Add to system prompt: "MAX 5 iterations per step"
48
+
49
+ ### 6. **Verbosity**
50
+ - ⚠️ Examples section (lines ~617-628) is great but long
51
+ - ⚠️ Multiple emoji warnings (⚠️⚠️⚠️) can be reduced to single ⚠️
52
+ - **SUGGESTION**: Keep examples, reduce emoji spam
53
+
54
+ ## 🔧 RECOMMENDED FIXES
55
+
56
+ ### Priority 1 (Critical):
57
+ 1. Remove duplicate selector strategy from user prompt
58
+ 2. Clarify max iterations (5 total)
59
+ 3. Fix coordinate mode iteration number (4th iteration = after 3 failures)
60
+
61
+ ### Priority 2 (Nice to have):
62
+ 4. Condense system prompt if possible (target: 250 lines)
63
+ 5. Reduce emoji overuse
64
+ 6. Add section headers in system prompt for clarity
65
+
66
+ ## 📊 PROMPT STRUCTURE ANALYSIS
67
+
68
+ ### System Prompt Sections:
69
+ 1. Introduction (1 line)
70
+ 2. Tool descriptions (dynamic, from registry)
71
+ 3. JSON format (40 lines) ✅
72
+ 4. Status rules (15 lines) ✅
73
+ 5. Step re-evaluation (20 lines) ✅
74
+ 6. Blocker detection (25 lines) ✅
75
+ 7. Experiences (25 lines) ✅
76
+ 8. Critical rules (200 lines) ⚠️ TOO LONG
77
+ 9. Coordinate actions (45 lines) ✅
78
+
79
+ **TOTAL**: ~370 lines (with tool descriptions)
80
+
81
+ ### User Prompt Sections:
82
+ 1. Static instructions (20 lines) - **Cache-friendly** ✅
83
+ 2. Dynamic context marker (1 line) ✅
84
+ 3. Notes from previous iteration (5 lines) ✅
85
+ 4. Warnings for failures (15 lines) ✅
86
+ 5. Coordinate mode trigger (8 lines) ✅
87
+ 6. Current step goal (10 lines) ✅
88
+ 7. Page state (50-100 lines, variable) ✅
89
+ 8. Recent steps (20-50 lines, variable) ✅
90
+ 9. Experiences (10 lines) ✅
91
+
92
+ **TOTAL**: ~140-200 lines per call
93
+
94
+ ## 🎯 RECOMMENDATION SUMMARY
95
+
96
+ **Keep as-is:**
97
+ - JSON structure
98
+ - Semantic selector examples
99
+ - Blocker detection
100
+ - Note to future self
101
+ - Coordinate fallback
102
+ - Cache optimization
103
+
104
+ **Fix:**
105
+ - Remove selector duplication in user prompt
106
+ - Clarify iteration counts
107
+ - Add max iteration limit
108
+ - Reduce emoji spam
109
+
110
+ **Consider:**
111
+ - Condensing "Critical Rules" section (currently 200 lines)
112
+ - Moving some examples to external docs
113
+ - Breaking long sections with clear headers
114
+
115
+ ## Overall Assessment: **8/10**
116
+ - Prompts are comprehensive and well-structured
117
+ - Main issues are length and minor redundancies
118
+ - Cache optimization is excellent
119
+ - A few clarity fixes needed for iteration counts
120
+