testchimp-runner-core 0.0.35 → 0.0.37

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/dist/orchestrator/orchestrator-agent.d.ts.map +1 -1
  2. package/dist/orchestrator/orchestrator-agent.js +7 -4
  3. package/dist/orchestrator/orchestrator-agent.js.map +1 -1
  4. package/dist/orchestrator/orchestrator-prompts.d.ts.map +1 -1
  5. package/dist/orchestrator/orchestrator-prompts.js +73 -15
  6. package/dist/orchestrator/orchestrator-prompts.js.map +1 -1
  7. package/dist/orchestrator/page-som-handler.d.ts +1 -2
  8. package/dist/orchestrator/page-som-handler.d.ts.map +1 -1
  9. package/dist/orchestrator/page-som-handler.js +51 -25
  10. package/dist/orchestrator/page-som-handler.js.map +1 -1
  11. package/package.json +6 -1
  12. package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
  13. package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
  14. package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
  15. package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
  16. package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
  17. package/plandocs/INTEGRATION_COMPLETE.md +0 -322
  18. package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
  19. package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
  20. package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
  21. package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
  22. package/plandocs/PHASE_1_COMPLETE.md +0 -165
  23. package/plandocs/PHASE_1_SUMMARY.md +0 -184
  24. package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
  25. package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
  26. package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
  27. package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
  28. package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
  29. package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
  30. package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
  31. package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
  32. package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
  33. package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
  34. package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
  35. package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
  36. package/plandocs/exploratory-mode-support.plan.md +0 -928
  37. package/plandocs/journey-id-tracking-addendum.md +0 -227
  38. package/releasenotes/RELEASE_0.0.26.md +0 -165
  39. package/releasenotes/RELEASE_0.0.27.md +0 -236
  40. package/releasenotes/RELEASE_0.0.28.md +0 -286
  41. package/src/auth-config.ts +0 -84
  42. package/src/credit-usage-service.ts +0 -188
  43. package/src/env-loader.ts +0 -103
  44. package/src/execution-service.ts +0 -996
  45. package/src/file-handler.ts +0 -104
  46. package/src/index.ts +0 -432
  47. package/src/llm-facade.ts +0 -821
  48. package/src/llm-provider.ts +0 -53
  49. package/src/model-constants.ts +0 -35
  50. package/src/orchestrator/decision-parser.ts +0 -139
  51. package/src/orchestrator/index.ts +0 -58
  52. package/src/orchestrator/orchestrator-agent.ts +0 -1282
  53. package/src/orchestrator/orchestrator-prompts.ts +0 -786
  54. package/src/orchestrator/page-som-handler.ts +0 -1565
  55. package/src/orchestrator/som-types.ts +0 -188
  56. package/src/orchestrator/tool-registry.ts +0 -184
  57. package/src/orchestrator/tools/check-page-ready.ts +0 -75
  58. package/src/orchestrator/tools/extract-data.ts +0 -92
  59. package/src/orchestrator/tools/index.ts +0 -15
  60. package/src/orchestrator/tools/inspect-page.ts +0 -42
  61. package/src/orchestrator/tools/recall-history.ts +0 -72
  62. package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
  63. package/src/orchestrator/tools/take-screenshot.ts +0 -128
  64. package/src/orchestrator/tools/verify-action-result.ts +0 -159
  65. package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
  66. package/src/orchestrator/types.ts +0 -291
  67. package/src/playwright-mcp-service.ts +0 -224
  68. package/src/progress-reporter.ts +0 -144
  69. package/src/prompts.ts +0 -842
  70. package/src/providers/backend-proxy-llm-provider.ts +0 -91
  71. package/src/providers/local-llm-provider.ts +0 -38
  72. package/src/scenario-service.ts +0 -252
  73. package/src/scenario-worker-class.ts +0 -1110
  74. package/src/script-utils.ts +0 -203
  75. package/src/types.ts +0 -239
  76. package/src/utils/browser-utils.ts +0 -348
  77. package/src/utils/coordinate-converter.ts +0 -162
  78. package/src/utils/page-info-retry.ts +0 -65
  79. package/src/utils/page-info-utils.ts +0 -285
  80. package/testchimp-runner-core-0.0.35.tgz +0 -0
  81. package/tsconfig.json +0 -19
@@ -1,184 +0,0 @@
1
- # Phase 1 Complete - Summary & Testing Guide
2
-
3
- ## Version: runner-core v0.0.33 ✅
4
-
5
- ---
6
-
7
- ## Implementation Complete
8
-
9
- ### What's New:
10
-
11
- 1. **📝 Note to Future Self**
12
- - Free-form tactical memory between iterations
13
- - Agent writes: "Tried X, failed. Will try Y next."
14
- - Prevents repeated mistakes
15
-
16
- 2. **🎯 Percentage-Based Coordinates**
17
- - Last-resort fallback (3-decimal precision)
18
- - Resolution-independent (works any viewport size)
19
- - Supports: click, fill, drag, hover, scroll
20
-
21
- 3. **⚡ Optimized Iteration Limits**
22
- - Max 5 iterations per step (down from 8)
23
- - 2 coordinate attempts max (coordinates work or they don't)
24
- - Faster feedback on stuck scenarios
25
-
26
- ---
27
-
28
- ## Current Behavior (Phase 1)
29
-
30
- ```
31
- ┌─────────────────────────────────────────────────────┐
32
- │ Iteration 1: Playwright selector │
33
- │ Try: await page.getByRole('button'...).click() │
34
- │ Note: "If this fails, try #id selector" │
35
- │ │
36
- │ Iteration 2: Playwright selector │
37
- │ Read note from iteration 1 │
38
- │ Try: await page.locator('#sidebar-toggle').click()│
39
- │ Note: "If this fails, try SVG child" │
40
- │ │
41
- │ Iteration 3: Playwright selector │
42
- │ Read note from iteration 2 │
43
- │ Try: await page.locator('#sidebar-toggle svg') │
44
- │ → Fails again │
45
- │ │
46
- │ 🎯 COORDINATE MODE ACTIVATED 🎯 │
47
- │ │
48
- │ Iteration 4: Coordinate action │
49
- │ Agent outputs: {xPercent: 5.500, yPercent: 8.250}│
50
- │ Execute: page.mouse.click(88, 66) │
51
- │ → Success! │
52
- │ │
53
- │ OR if fails... │
54
- │ │
55
- │ Iteration 5: Coordinate action (2nd attempt) │
56
- │ Try slightly adjusted coordinates │
57
- │ → If fails: GIVE UP (stuck) │
58
- │ │
59
- │ Total: Max 5 iterations │
60
- └─────────────────────────────────────────────────────┘
61
- ```
62
-
63
- ---
64
-
65
- ## Testing Phase 1
66
-
67
- ### Test 1: PeopleHR Scenario (Previously Failed)
68
-
69
- **Expected outcome:**
70
- - Iteration 1-2: Try text/ID selectors → fail
71
- - Iteration 3: Note says "try SVG child" → succeeds!
72
- - OR Iteration 4: Coordinates → succeeds!
73
-
74
- **Run:**
75
- ```bash
76
- # Via VS extension: "Generate Script" on peoplehr.txt
77
- # Or "Run Test" on peoplehr-corrected.smart.spec.ts
78
- ```
79
-
80
- **Look for in logs:**
81
- ```
82
- 📝 Note to self: ...
83
- 🎯 COORDINATE MODE ACTIVATED
84
- 🎯 Coordinate Action (attempt 1/2): click at (5.500%, 8.250%)
85
- ```
86
-
87
- ### Test 2: Simple Scenario (Should Still Be Fast)
88
-
89
- Create test: `simple-login.txt`
90
- ```
91
- - go to https://example.com/login
92
- - fill username with "alice"
93
- - fill password with "password123"
94
- - click login button
95
- ```
96
-
97
- **Expected:**
98
- - Each step: 1 iteration (Tier 1 success)
99
- - No coordinates needed
100
- - Fast execution
101
-
102
- ### Test 3: Coordinate Fallback
103
-
104
- **Deliberately difficult scenario:**
105
- ```
106
- - go to https://some-app-with-shadow-dom.com
107
- - click on custom web component icon
108
- ```
109
-
110
- **Expected:**
111
- - Iterations 1-3: Selectors fail
112
- - Iteration 4: Coordinates succeed
113
- - Generated script contains: `await page.mouse.click(x, y);`
114
-
115
- ---
116
-
117
- ## Expected Improvements
118
-
119
- ### Metrics to Track:
120
-
121
- 1. **Iteration Efficiency**
122
- - Before: ~4 average iterations per step
123
- - After: ~2.5 average iterations per step (30-40% reduction)
124
-
125
- 2. **Success Rate**
126
- - Before: Stuck on complex UIs (hamburgers, icons, shadow DOM)
127
- - After: Coordinates provide escape hatch
128
-
129
- 3. **Coordinate Usage**
130
- - Target: < 10% of scenarios use coordinates
131
- - Most scenarios still succeed with selectors
132
-
133
- ---
134
-
135
- ## Files Changed
136
-
137
- **New:**
138
- - `src/utils/coordinate-converter.ts` - Percentage conversion utility
139
- - `VISUAL_AGENT_EVOLUTION_PLAN.md` - Complete plan
140
- - `PHASE_1_COMPLETE.md` - Feature documentation
141
- - `IMPLEMENTATION_STATUS.md` - Current status
142
- - `PHASE_1_SUMMARY.md` - This file
143
-
144
- **Modified:**
145
- - `src/orchestrator/types.ts` - Added NoteToFutureSelf, CoordinateAction
146
- - `src/orchestrator/orchestrator-agent.ts` - Note tracking, coordinate handling, mode switching
147
- - `src/scenario-worker-class.ts` - Timeout handling (earlier fix)
148
- - `src/execution-service.ts` - Timeout handling (earlier fix)
149
-
150
- ---
151
-
152
- ## Iteration Budget (Max 5 per Step)
153
-
154
- **Phase 1 (Current):**
155
- ```
156
- Iterations 1-3: Playwright selectors (3 attempts)
157
- Iterations 4-5: Coordinates (2 attempts)
158
- ```
159
-
160
- **Phase 2 (Future - Optimized):**
161
- ```
162
- Iteration 1: Playwright selector (1 attempt) - fast path
163
- Iterations 2-3: Index commands (2 attempts) - reliable fallback
164
- Iterations 4-5: Coordinates (2 attempts) - last resort
165
- ```
166
-
167
- **Benefit of Phase 2:**
168
- - Most scenarios finish in iteration 1 (fast!)
169
- - Complex scenarios use iterations 2-3 (index system)
170
- - Only extreme cases reach iterations 4-5 (coordinates)
171
-
172
- ---
173
-
174
- ## Ready to Test!
175
-
176
- **Current version** (runner-core v0.0.33) is built and ready.
177
-
178
- **Test with:**
179
- 1. VS Code extension "Generate Script" on `peoplehr.txt`
180
- 2. Or "Run Test" on any existing smart test
181
- 3. Check logs for note-to-self and coordinate usage
182
-
183
- **After validating Phase 1 works well, proceed to Phase 2 for numbered element system.**
184
-
@@ -1,372 +0,0 @@
1
- # Planning Session Summary: Orchestrator Agent Architecture
2
-
3
- ## Date: October 11, 2025
4
-
5
- ---
6
-
7
- ## Final Decisions Made
8
-
9
- ### 1. ✅ Self-Reflection in MVP
10
- **Decision**: Include free-form self-reflection with agent-driven loop detection
11
- - Agent outputs `guidanceForNext` (free-form text) for train of thought continuity
12
- - Agent signals `detectingLoop: true` when it notices repetition
13
- - Agent decides when to break own loop, system enforces hard limits as backup
14
-
15
- **Rationale**: Valuable for maintaining context across iterations, agent self-corrects
16
-
17
- ### 2. ✅ No Screenshot Budget
18
- **Decision**: Screenshots available freely, no artificial limits
19
- - Corrected token cost: 1-2K tokens (NOT 100K!)
20
- - For 1920x1080 viewport: ~1,452 tokens (gpt-4.1-mini)
21
- - Comparable to extra DOM context
22
-
23
- **Rationale**: Very affordable, enables liberal vision use throughout journey
24
-
25
- ### 3. ✅ DOM Limits (Increased for Complex Pages)
26
- **Decision**: Increased limits in getEnhancedPageInfo to handle complex UIs
27
- - ARIA tree depth: 4 levels
28
- - Interactive elements: top 50 (was 12)
29
- - IDs: top 50 (was 10)
30
- - Data attributes: top 50 (was 10)
31
- - Form fields: top 20 (was 8)
32
- - Page structure: top 10 (was 6)
33
- - General elements: top 50 (was 15)
34
- - Text: 30 chars max
35
- - Result: ~800-1,500 tokens
36
-
37
- **Rationale**: Complex pages need more context, still compact with truncation
38
-
39
- ### 4. ✅ Token Usage Tracking
40
- **Decision**: Track and report all LLM token usage via callback
41
- - Interface: `onTokensUsed({inputTokens, outputTokens, includesImage})`
42
- - Heuristic: 4 characters = 1 token
43
- - Image tokens: ~1,500 estimate for viewport screenshots
44
- - Reported via ProgressReporter for analytics
45
-
46
- **Rationale**: Cost tracking, optimization, analytics
47
-
48
- ### 5. ✅ Recovery Tools in MVP
49
- **Decision**: Include 3 recovery tools for self-unsticking
50
- - `navigate_back()` - Go back in history
51
- - `refresh_page()` - Reload page
52
- - `navigate_to_url({url})` - Navigate to specific URL (with domain validation)
53
-
54
- **Rationale**: Agent needs ability to recover from bad states (wrong navigation, stuck page, side effects)
55
-
56
- ### 6. ✅ Inquisitive Exploration in Phase 2
57
- **Decision**: Defer exploratory actions to Phase 2, MVP uses workarounds
58
- - Phase 2 tool: `explore_element({action, selector, purpose})`
59
- - Actions: hover, click_info, click_menu, focus
60
- - Safety: State validation, non-consequential only
61
- - **Screenshot handling**: Immediate analysis via sub-agent call
62
- - System takes screenshot after exploration
63
- - Calls agent to analyze screenshot
64
- - Agent extracts learnings (text)
65
- - Only TEXT stored in history, NOT screenshot
66
- - Keeps memory lightweight
67
- - MVP workaround: Use screenshot + DOM analysis + retry
68
-
69
- **Rationale**: Safety concerns, complexity, needs battle-testing first
70
-
71
- ### 7. ✅ Always-Provided Context Structure
72
- **Decision**: Provide comprehensive context automatically each iteration
73
- - Overall goal + current goal
74
- - Current page info (DOM)
75
- - Recent 6-7 steps
76
- - Experiences (learnings)
77
- - Extracted data
78
- - Self-reflection from previous iteration
79
- - Journey progress tracking
80
-
81
- **Rationale**: Agent needs full situational awareness without repeated tool calls
82
-
83
- ### 8. ✅ System vs Agent Guardrails
84
- **Decision**: Clear separation of responsibilities
85
- - **System enforces**: Iteration limits, tool call limits, command limits
86
- - **Agent signals**: Stuck, infeasible, detecting loop
87
- - System has final say, agent provides soft guidance
88
-
89
- **Rationale**: Safety (hard limits) + intelligence (agent self-awareness)
90
-
91
- ---
92
-
93
- ## Architecture Summary
94
-
95
- ### Core Components
96
-
97
- **1. Always-Provided Context** (auto-fetched each iteration)
98
- ```typescript
99
- {
100
- overallGoal, currentStepGoal, stepNumber, totalSteps,
101
- currentPageInfo, currentURL,
102
- recentSteps (6-7), experiences, extractedData,
103
- previousIterationGuidance
104
- }
105
- ```
106
-
107
- **2. Tools** (8 in MVP)
108
- - **Information**: take_screenshot, recall_history, inspect_page, check_page_ready
109
- - **Data**: extract_data
110
- - **Recovery**: navigate_back, refresh_page, navigate_to_url
111
-
112
- **3. Agent Decision Output**
113
- ```typescript
114
- {
115
- toolCalls, toolReasoning, needsToolResults,
116
- commands, commandReasoning,
117
- selfReflection: {guidanceForNext, detectingLoop, loopReasoning},
118
- experiences, memoryUpdate,
119
- status: 'complete' | 'stuck' | 'infeasible' | 'continue',
120
- statusReasoning, reasoning
121
- }
122
- ```
123
-
124
- **4. Sequential Batch Execution**
125
- - Agent plans batch of commands (max 3-5)
126
- - System executes one-by-one
127
- - Stop at first failure
128
- - Record each individually in history
129
-
130
- **5. Comprehensive Logging**
131
- - Every iteration: goal, reasoning, self-reflection, tools, commands, experiences, status
132
- - All thoughts visible for debugging
133
- - Exported via ProgressReporter
134
-
135
- **6. Token Usage Tracking**
136
- - Input + output tokens calculated (4 chars = 1 token)
137
- - Image tokens estimated (~1,500 for viewport)
138
- - Reported via `onTokensUsed()` callback
139
-
140
- ---
141
-
142
- ## MVP vs Phase 2
143
-
144
- ### MVP Includes:
145
- - ✅ 8 core tools (info + data + recovery)
146
- - ✅ Journey memory with experiences
147
- - ✅ Self-reflection + loop detection
148
- - ✅ Batch command planning
149
- - ✅ Self-recovery (navigate back/refresh)
150
- - ✅ Token tracking
151
- - ✅ Comprehensive logging
152
- - ✅ Configurable guardrails
153
-
154
- ### Phase 2 Adds:
155
- - Inquisitive exploration (explore_element)
156
- - Advanced optimizations (caching, adaptive limits)
157
- - Memory summarization for long journeys
158
-
159
- ---
160
-
161
- ## Key Metrics
162
-
163
- ### Token Usage Per Iteration (Estimated)
164
- ```
165
- System prompt: 500 tokens
166
- Always-provided context: 1,200-2,000 tokens
167
- - Goals & progress: 100
168
- - DOM (increased limits): 800-1,500
169
- - Recent 6-7 steps: 300-500
170
- - Experiences: 100-200
171
- Self-reflection: 100 tokens
172
- Tool results (optional): 300-500 tokens
173
- Screenshot (optional): 1,500 tokens
174
-
175
- Total without screenshot: 2,400-3,600 tokens
176
- Total with screenshot: 3,900-5,100 tokens
177
- ```
178
-
179
- ### Expected Performance
180
- - LLM calls/step: 2-4 (vs 4-6 current)
181
- - Iterations/step: 3-5 (vs 8-12 current)
182
- - Tool calls/step: 1-3
183
- - Commands/iteration: 2-3 (batched)
184
- - Agent learns: 1-2 experiences per step
185
-
186
- ---
187
-
188
- ## Inquisitive Exploration Design (Phase 2)
189
-
190
- ### Problem
191
- Menu items are icon-only, no text/ARIA labels → Agent unsure which to click
192
-
193
- ### Solution
194
- Agent investigates non-consequentially, analyzes immediately:
195
-
196
- ```
197
- Iteration N:
198
- Agent Decision: "Need to hover over icons to see tooltips"
199
- Tool: explore_element({action: "hover", selector: "nav button:nth-child(2)"})
200
-
201
- System:
202
- → Hover, wait 500ms
203
- → Take screenshot
204
- → Call agent (sub-call): "What do you see in this screenshot?"
205
-
206
- Agent Analysis (sub-call):
207
- → Sees tooltip
208
- → Returns: "Tooltip shows 'Dashboard' - this is the Dashboard button"
209
-
210
- System:
211
- → Stores TEXT in history: "Explored button, tooltip confirms Dashboard"
212
- → Does NOT store screenshot
213
- → Returns to main agent: {success: true, learning: "Tooltip shows Dashboard"}
214
-
215
- Agent Decision (continues):
216
- → "Great, confirmed it's Dashboard"
217
- → Commands: ["page.click('nav button:nth-child(2)')"]
218
-
219
- System: Execute commands
220
- ```
221
-
222
- **Key difference**: Screenshot analyzed WITHIN same iteration, only text stored
223
-
224
- ### Allowed Actions
225
- - ✅ hover (show tooltips)
226
- - ✅ click_info (info icons)
227
- - ✅ click_menu (expand menus)
228
- - ✅ focus (see input hints)
229
-
230
- ### NOT Allowed
231
- - ❌ Submit forms
232
- - ❌ Delete/remove
233
- - ❌ Logout
234
- - ❌ File uploads
235
-
236
- ### Safety
237
- - State validation (URL, modal count)
238
- - Revert if unexpected navigation
239
- - Budget: 10 explorations per step
240
- - Timeout: 2s per exploration
241
-
242
- ### Why Phase 2
243
- - Safety risk (needs robust validation)
244
- - Complexity (screenshot handling, state comparison)
245
- - MVP workaround: screenshot + DOM + retry
246
-
247
- ---
248
-
249
- ## Implementation Status
250
-
251
- ### Completed (During Planning):
252
- - ✅ Token usage tracking added to interfaces
253
- - ✅ BackendProxyLLMProvider calculates token usage
254
- - ✅ LLMFacade prepared for token callback
255
- - ✅ Progress reporter extended with `onTokensUsed()`
256
-
257
- ### Ready to Implement:
258
- 1. OrchestratorAgent class
259
- 2. ToolRegistry with 8 tools
260
- 3. Journey memory implementation
261
- 4. Always-provided context builder
262
- 5. Self-reflection structures
263
- 6. Recovery tools (navigate_back, refresh, navigate_to)
264
- 7. Comprehensive logging
265
- 8. Token tracking integration
266
-
267
- ### Phase 2:
268
- 1. Exploratory actions (explore_element)
269
- 2. State validation logic
270
- 3. Advanced optimizations
271
-
272
- ---
273
-
274
- ## Documentation Created
275
-
276
- 1. **MULTI_AGENT_ARCHITECTURE_REVIEW.md**
277
- - 8 pitfalls analyzed with mitigations
278
- - Phased implementation strategy
279
- - Risk analysis and trade-offs
280
-
281
- 2. **ORCHESTRATOR_IMPLEMENTATION_PLAN.md**
282
- - Detailed implementation specs
283
- - Code examples
284
- - Integration points
285
-
286
- 3. **ORCHESTRATOR_MVP_SUMMARY.md**
287
- - Executive summary
288
- - Complete feature list
289
- - Inquisitive exploration section
290
- - MVP vs Phase 2 breakdown
291
-
292
- 4. **PLANNING_SESSION_SUMMARY.md** (this document)
293
- - All decisions made
294
- - Rationales
295
- - Implementation status
296
-
297
- ---
298
-
299
- ## Next Steps
300
-
301
- ### When Ready to Implement:
302
- 1. Review all 3 architecture documents
303
- 2. Start with MVP (8 tools, no exploration)
304
- 3. Implement in order:
305
- - Tool registry + tool implementations
306
- - Journey memory structures
307
- - OrchestratorAgent loop
308
- - Integration with ScenarioWorker
309
- - Token tracking integration
310
- - Comprehensive logging
311
- 4. Test with real scenarios
312
- 5. Measure metrics vs current approach
313
- 6. Iterate based on findings
314
- 7. Add Phase 2 features when validated
315
-
316
- ---
317
-
318
- ## Success Criteria (MVP)
319
-
320
- ### Must Have:
321
- - [ ] Fewer iterations than current (target: 50% reduction)
322
- - [ ] Backward compatible (VS Extension & GitHub Runner work)
323
- - [ ] No infinite loops (guardrails work)
324
- - [ ] Memory doesn't bloat
325
- - [ ] Tool extensibility works
326
- - [ ] Token usage tracked accurately
327
-
328
- ### Nice to Have:
329
- - [ ] Fewer LLM calls than current
330
- - [ ] Better success rate
331
- - [ ] Faster execution
332
-
333
- ### Acceptable Trade-offs:
334
- - ⚠️ Slightly higher token usage per iteration (richer context)
335
- - ⚠️ Some tool call overhead
336
- - ⚠️ No exploratory actions in MVP
337
-
338
- ---
339
-
340
- ## Estimated Timeline
341
-
342
- **MVP Implementation**: 2-3 weeks
343
- - Week 1: Foundation (types, tool registry, tool implementations)
344
- - Week 2: Orchestrator loop, integration
345
- - Week 3: Testing, refinement, metrics
346
-
347
- **Phase 2 (Exploration)**: 1-2 weeks after MVP validated
348
-
349
- **Total**: 3-5 weeks for complete solution
350
-
351
- ---
352
-
353
- ## Final Architecture Confidence
354
-
355
- **✅ Ready to implement** with:
356
- - All major decisions finalized
357
- - Trade-offs understood and accepted
358
- - Risks identified with mitigations
359
- - Phased approach reduces implementation risk
360
- - Backward compatibility ensured
361
- - Comprehensive documentation complete
362
-
363
- **Key strengths**:
364
- - Human-like (memory, learning, reflection, recovery)
365
- - Extensible (tool registry, dynamic prompts)
366
- - Safe (system guardrails, agent self-awareness)
367
- - Transparent (comprehensive logging)
368
- - Cost-aware (token tracking)
369
- - Practical (recovery tools, self-unstuck)
370
-
371
- **No blockers to proceed.**
372
-
@@ -1,120 +0,0 @@
1
- # System Prompt Optimization Analysis
2
-
3
- ## Current Stats:
4
- - **System Prompt**: 17,573 chars (346 lines)
5
- - **With Tool Descriptions**: 19,613 chars (~4,903 tokens)
6
- - **Cost per call**: ~$0.0007 (gpt-5-mini input tokens)
7
-
8
- ## Optimization Opportunities:
9
-
10
- ### 1. **Duplicate Examples** (Save ~30%)
11
- **Current**: Multiple example sections with ❌/✅ pairs
12
- - Lines 633-644: Examples section with goto, fill, click examples
13
- - Lines 621-626: Ambiguous text handling examples
14
- - Lines 603-607: DOM snapshot examples
15
- - Lines 615-619: Selector preference list
16
-
17
- **Optimization**: Consolidate into ONE examples section
18
- **Savings**: ~2,000 chars
19
-
20
- ### 2. **Verbose Selector Section** (Save ~20%)
21
- **Current**: Lines 602-644 (42 lines, ~1,800 chars)
22
- - Lists all selector types with emoji
23
- - Detailed examples for each
24
- - Repetitive "Good/Bad" patterns
25
-
26
- **Optimization**: Create compact reference table
27
- ```
28
- SELECTORS (preference order):
29
- 1. getByRole/Label/Placeholder (semantic, stable)
30
- 2. getByText (scope to parent if ambiguous!)
31
- 3. CSS IDs (avoid auto-generated)
32
-
33
- Common mistakes: Missing goto timeout, unscoped getByText, auto-generated IDs
34
- ```
35
- **Savings**: ~1,200 chars
36
-
37
- ### 3. **Emoji Overuse** (Save ~5%)
38
- **Current**: Heavy use of ⚠️, ❌, ✅, 🏆, etc.
39
-
40
- **Optimization**: Use sparingly (only for critical warnings)
41
- **Savings**: ~500 chars
42
-
43
- ### 4. **Redundant "WHY" Explanations** (Save ~10%)
44
- **Current**: Multiple "WHY:" sections explaining rationale
45
- - Line 642-644: WHY semantic selectors
46
- - Similar explanations scattered throughout
47
-
48
- **Optimization**: Remove or consolidate
49
- **Savings**: ~800 chars
50
-
51
- ### 5. **Tool Instructions Redundancy** (Save ~10%)
52
- **Current**: Tools described twice:
53
- - In tool registry (dynamic)
54
- - In prompt rules (static)
55
-
56
- **Optimization**: Rely more on tool registry descriptions
57
- **Savings**: ~600 chars
58
-
59
- ### 6. **Status Rules Repetition** (Save ~5%)
60
- **Current**: Lines 468-486 - Status rules explained multiple times
61
-
62
- **Optimization**: Single concise statement
63
- **Savings**: ~400 chars
64
-
65
- ## Proposed Condensed Structure:
66
-
67
- ```markdown
68
- # System Prompt (Optimized)
69
-
70
- ## Agent Role & Tools
71
- [Tool descriptions from registry]
72
-
73
- ## Response Format (JSON)
74
- {required fields} - minimal format, no extensive comments
75
-
76
- ## Core Rules (Prioritized)
77
- 1. Status decisions (complete/continue/stuck)
78
- 2. Selector strategy (semantic > text > CSS)
79
- 3. Common errors (goto timeout, strict mode, auto-IDs)
80
- 4. When to use tools vs commands
81
- 5. Note to future self usage
82
-
83
- ## Examples (Consolidated)
84
- - Navigation: goto with 30s timeout
85
- - Selectors: Scoped getByText, semantic selectors
86
- - Coordinates: When and how
87
-
88
- ## Advanced Features
89
- - Blocker detection
90
- - Step re-evaluation
91
- - Coordinate fallback
92
- ```
93
-
94
- ## Total Potential Savings:
95
-
96
- - **Before**: 17,573 chars (~4,393 tokens)
97
- - **After**: ~12,000 chars (~3,000 tokens)
98
- - **Reduction**: ~32% reduction in system prompt
99
- - **Cost savings**: ~$0.0002 per call (~30% per call)
100
- - **Overall impact**: With 7 tasks using gpt-4o-mini, only 4 tasks will benefit
101
- - **Est. total savings**: ~5-8% additional cost reduction
102
-
103
- ## Recommendation:
104
-
105
- **Optimize if:**
106
- - You're seeing consistent 500 errors (less likely now with retry)
107
- - Want to maximize caching efficiency
108
- - Running high-volume scenarios (1000+ per day)
109
-
110
- **Skip if:**
111
- - Current cost is acceptable
112
- - Prompt clarity is more important than 5-8% savings
113
- - Risk of quality degradation concerns you
114
-
115
- ## Action Items (if optimizing):
116
-
117
- 1. ✅ Keep: Critical decision logic, JSON format, coordinate mode
118
- 2. ⚠️ Condense: Selector examples, error responses, WHY sections
119
- 3. ❌ Remove: Duplicate examples, excessive emojis, redundant explanations
120
-