testchimp-runner-core 0.0.21 → 0.0.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (146) hide show
  1. package/VISION_DIAGNOSTICS_IMPROVEMENTS.md +336 -0
  2. package/dist/credit-usage-service.d.ts +9 -0
  3. package/dist/credit-usage-service.d.ts.map +1 -1
  4. package/dist/credit-usage-service.js +20 -5
  5. package/dist/credit-usage-service.js.map +1 -1
  6. package/dist/execution-service.d.ts +7 -2
  7. package/dist/execution-service.d.ts.map +1 -1
  8. package/dist/execution-service.js +91 -36
  9. package/dist/execution-service.js.map +1 -1
  10. package/dist/index.d.ts +30 -2
  11. package/dist/index.d.ts.map +1 -1
  12. package/dist/index.js +91 -26
  13. package/dist/index.js.map +1 -1
  14. package/dist/llm-facade.d.ts +64 -8
  15. package/dist/llm-facade.d.ts.map +1 -1
  16. package/dist/llm-facade.js +361 -109
  17. package/dist/llm-facade.js.map +1 -1
  18. package/dist/llm-provider.d.ts +39 -0
  19. package/dist/llm-provider.d.ts.map +1 -0
  20. package/dist/llm-provider.js +7 -0
  21. package/dist/llm-provider.js.map +1 -0
  22. package/dist/model-constants.d.ts +21 -0
  23. package/dist/model-constants.d.ts.map +1 -0
  24. package/dist/model-constants.js +24 -0
  25. package/dist/model-constants.js.map +1 -0
  26. package/dist/orchestrator/index.d.ts +8 -0
  27. package/dist/orchestrator/index.d.ts.map +1 -0
  28. package/dist/orchestrator/index.js +23 -0
  29. package/dist/orchestrator/index.js.map +1 -0
  30. package/dist/orchestrator/orchestrator-agent.d.ts +66 -0
  31. package/dist/orchestrator/orchestrator-agent.d.ts.map +1 -0
  32. package/dist/orchestrator/orchestrator-agent.js +855 -0
  33. package/dist/orchestrator/orchestrator-agent.js.map +1 -0
  34. package/dist/orchestrator/tool-registry.d.ts +74 -0
  35. package/dist/orchestrator/tool-registry.d.ts.map +1 -0
  36. package/dist/orchestrator/tool-registry.js +131 -0
  37. package/dist/orchestrator/tool-registry.js.map +1 -0
  38. package/dist/orchestrator/tools/check-page-ready.d.ts +13 -0
  39. package/dist/orchestrator/tools/check-page-ready.d.ts.map +1 -0
  40. package/dist/orchestrator/tools/check-page-ready.js +72 -0
  41. package/dist/orchestrator/tools/check-page-ready.js.map +1 -0
  42. package/dist/orchestrator/tools/extract-data.d.ts +13 -0
  43. package/dist/orchestrator/tools/extract-data.d.ts.map +1 -0
  44. package/dist/orchestrator/tools/extract-data.js +84 -0
  45. package/dist/orchestrator/tools/extract-data.js.map +1 -0
  46. package/dist/orchestrator/tools/index.d.ts +10 -0
  47. package/dist/orchestrator/tools/index.d.ts.map +1 -0
  48. package/dist/orchestrator/tools/index.js +18 -0
  49. package/dist/orchestrator/tools/index.js.map +1 -0
  50. package/dist/orchestrator/tools/inspect-page.d.ts +13 -0
  51. package/dist/orchestrator/tools/inspect-page.d.ts.map +1 -0
  52. package/dist/orchestrator/tools/inspect-page.js +39 -0
  53. package/dist/orchestrator/tools/inspect-page.js.map +1 -0
  54. package/dist/orchestrator/tools/recall-history.d.ts +13 -0
  55. package/dist/orchestrator/tools/recall-history.d.ts.map +1 -0
  56. package/dist/orchestrator/tools/recall-history.js +64 -0
  57. package/dist/orchestrator/tools/recall-history.js.map +1 -0
  58. package/dist/orchestrator/tools/take-screenshot.d.ts +15 -0
  59. package/dist/orchestrator/tools/take-screenshot.d.ts.map +1 -0
  60. package/dist/orchestrator/tools/take-screenshot.js +112 -0
  61. package/dist/orchestrator/tools/take-screenshot.js.map +1 -0
  62. package/dist/orchestrator/types.d.ts +133 -0
  63. package/dist/orchestrator/types.d.ts.map +1 -0
  64. package/dist/orchestrator/types.js +28 -0
  65. package/dist/orchestrator/types.js.map +1 -0
  66. package/dist/playwright-mcp-service.d.ts +9 -0
  67. package/dist/playwright-mcp-service.d.ts.map +1 -1
  68. package/dist/playwright-mcp-service.js +20 -5
  69. package/dist/playwright-mcp-service.js.map +1 -1
  70. package/dist/progress-reporter.d.ts +97 -0
  71. package/dist/progress-reporter.d.ts.map +1 -0
  72. package/dist/progress-reporter.js +18 -0
  73. package/dist/progress-reporter.js.map +1 -0
  74. package/dist/prompts.d.ts +24 -0
  75. package/dist/prompts.d.ts.map +1 -1
  76. package/dist/prompts.js +593 -68
  77. package/dist/prompts.js.map +1 -1
  78. package/dist/providers/backend-proxy-llm-provider.d.ts +25 -0
  79. package/dist/providers/backend-proxy-llm-provider.d.ts.map +1 -0
  80. package/dist/providers/backend-proxy-llm-provider.js +76 -0
  81. package/dist/providers/backend-proxy-llm-provider.js.map +1 -0
  82. package/dist/providers/local-llm-provider.d.ts +21 -0
  83. package/dist/providers/local-llm-provider.d.ts.map +1 -0
  84. package/dist/providers/local-llm-provider.js +35 -0
  85. package/dist/providers/local-llm-provider.js.map +1 -0
  86. package/dist/scenario-service.d.ts +27 -1
  87. package/dist/scenario-service.d.ts.map +1 -1
  88. package/dist/scenario-service.js +48 -12
  89. package/dist/scenario-service.js.map +1 -1
  90. package/dist/scenario-worker-class.d.ts +39 -2
  91. package/dist/scenario-worker-class.d.ts.map +1 -1
  92. package/dist/scenario-worker-class.js +614 -86
  93. package/dist/scenario-worker-class.js.map +1 -1
  94. package/dist/script-utils.d.ts +2 -0
  95. package/dist/script-utils.d.ts.map +1 -1
  96. package/dist/script-utils.js +44 -4
  97. package/dist/script-utils.js.map +1 -1
  98. package/dist/types.d.ts +11 -0
  99. package/dist/types.d.ts.map +1 -1
  100. package/dist/types.js.map +1 -1
  101. package/dist/utils/browser-utils.d.ts +20 -1
  102. package/dist/utils/browser-utils.d.ts.map +1 -1
  103. package/dist/utils/browser-utils.js +102 -51
  104. package/dist/utils/browser-utils.js.map +1 -1
  105. package/dist/utils/page-info-utils.d.ts +23 -4
  106. package/dist/utils/page-info-utils.d.ts.map +1 -1
  107. package/dist/utils/page-info-utils.js +174 -43
  108. package/dist/utils/page-info-utils.js.map +1 -1
  109. package/package.json +1 -2
  110. package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +642 -0
  111. package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +844 -0
  112. package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +539 -0
  113. package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +241 -0
  114. package/plandocs/PHASE1_FINAL_STATUS.md +210 -0
  115. package/plandocs/PLANNING_SESSION_SUMMARY.md +372 -0
  116. package/plandocs/SCRIPT_CLEANUP_FEATURE.md +201 -0
  117. package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +364 -0
  118. package/plandocs/SELECTOR_IMPROVEMENTS.md +139 -0
  119. package/src/credit-usage-service.ts +23 -5
  120. package/src/execution-service.ts +152 -42
  121. package/src/index.ts +169 -26
  122. package/src/llm-facade.ts +500 -126
  123. package/src/llm-provider.ts +43 -0
  124. package/src/model-constants.ts +23 -0
  125. package/src/orchestrator/index.ts +33 -0
  126. package/src/orchestrator/orchestrator-agent.ts +1037 -0
  127. package/src/orchestrator/tool-registry.ts +182 -0
  128. package/src/orchestrator/tools/check-page-ready.ts +75 -0
  129. package/src/orchestrator/tools/extract-data.ts +92 -0
  130. package/src/orchestrator/tools/index.ts +11 -0
  131. package/src/orchestrator/tools/inspect-page.ts +42 -0
  132. package/src/orchestrator/tools/recall-history.ts +72 -0
  133. package/src/orchestrator/tools/take-screenshot.ts +128 -0
  134. package/src/orchestrator/types.ts +200 -0
  135. package/src/playwright-mcp-service.ts +23 -5
  136. package/src/progress-reporter.ts +109 -0
  137. package/src/prompts.ts +606 -69
  138. package/src/providers/backend-proxy-llm-provider.ts +91 -0
  139. package/src/providers/local-llm-provider.ts +38 -0
  140. package/src/scenario-service.ts +83 -13
  141. package/src/scenario-worker-class.ts +740 -72
  142. package/src/script-utils.ts +50 -5
  143. package/src/types.ts +13 -1
  144. package/src/utils/browser-utils.ts +123 -51
  145. package/src/utils/page-info-utils.ts +210 -53
  146. package/testchimp-runner-core-0.0.22.tgz +0 -0
@@ -0,0 +1,642 @@
1
+ # Human-Like Script Generation Improvements
2
+
3
+ ## Current Gaps vs Human Behavior
4
+
5
+ ### Gap 1: **Reactive vs Proactive Planning**
6
+
7
+ **Current (Reactive):**
8
+ ```
9
+ Step: "Login with credentials: admin, pass123"
10
+ 1. Generate command → fill(username)
11
+ 2. Execute → success
12
+ 3. Check goal → incomplete
13
+ 4. Generate command → fill(password)
14
+ 5. Execute → success
15
+ 6. Check goal → incomplete
16
+ 7. Generate command → click(login)
17
+ 8. Execute → success
18
+ 9. Check goal → complete
19
+ ```
20
+ **3 LLM calls + 3 goal checks = slow, expensive, non-deterministic**
21
+
22
+ **Human (Proactive):**
23
+ ```
24
+ Step: "Login with credentials: admin, pass123"
25
+ Looks at page → sees login form with username, password, and button
26
+ Plans: fill(username) → fill(password) → click(login)
27
+ Executes all 3 → done
28
+ ```
29
+ **1 planning call + execution = fast, deterministic**
30
+
31
+ ### Gap 2: **Excessive Retries**
32
+
33
+ **Current:**
34
+ - 4 attempts per sub-action × 5 sub-actions = up to 20 attempts per step
35
+ - Humans don't retry 20 times - they try 2-3 approaches max
36
+
37
+ **Human:**
38
+ - Try primary approach (e.g., click by text)
39
+ - If fails, try alternative (e.g., click by role)
40
+ - If still fails, reassess situation (vision/give up)
41
+
42
+ ### Gap 3: **No Pattern Recognition**
43
+
44
+ **Current:**
45
+ - Every login form treated as novel
46
+ - Every navigation treated independently
47
+ - No learning from previous successful patterns
48
+
49
+ **Human:**
50
+ - Recognizes "this is a login form" → knows the pattern
51
+ - Recognizes "this is a modal" → knows to look for close button
52
+ - Learns from experience
53
+
54
+ ### Gap 4: **Sub-Action Approach is Non-Deterministic**
55
+
56
+ **Current:**
57
+ ```
58
+ Step: "Click on All Modules"
59
+ Sub-action 1: Try approach A → fails
60
+ Sub-action 2: Try approach B → fails
61
+ Sub-action 3: Try approach C → fails
62
+ Sub-action count: 3, failed attempts: 8
63
+ ```
64
+
65
+ **Human:**
66
+ ```
67
+ Step: "Click on All Modules"
68
+ Looks at page → sees "All Modules" menu
69
+ Clicks it → if fails, tries once more with better selector
70
+ Max 2-3 attempts total
71
+ ```
72
+
73
+ ### Gap 5: **No Confidence Scoring**
74
+
75
+ **Current:**
76
+ - LLM generates command with no confidence indication
77
+ - System doesn't know if LLM is guessing or certain
78
+
79
+ **Human:**
80
+ - "I see exactly where to click" → high confidence, execute
81
+ - "I'm not sure which element" → low confidence, look more carefully first
82
+
83
+ ## Proposed Improvements
84
+
85
+ ### 1. **Upfront Step Planning Mode** 🎯
86
+
87
+ Add a planning phase before execution:
88
+
89
+ ```typescript
90
+ // NEW: Step planning API
91
+ async planStepExecution(
92
+ stepDescription: string,
93
+ pageInfo: PageInfo
94
+ ): Promise<StepExecutionPlan> {
95
+ // LLM analyzes page and creates complete plan for step
96
+ return {
97
+ commands: [
98
+ { code: "await page.fill('username', 'admin')", confidence: 0.95 },
99
+ { code: "await page.fill('password', 'pass')", confidence: 0.95 },
100
+ { code: "await page.click('button[type=submit]')", confidence: 0.85 }
101
+ ],
102
+ reasoning: "Login form detected with clear username/password fields and submit button",
103
+ overallConfidence: 0.90,
104
+ needsVisionUpfront: false // LLM indicates if vision would help BEFORE attempting
105
+ }
106
+ }
107
+ ```
108
+
109
+ **Benefits:**
110
+ - Single LLM call for entire step
111
+ - All commands generated together (more coherent)
112
+ - Confidence scoring guides execution
113
+ - Can request vision upfront if uncertain
114
+
115
+ ### 2. **Pattern Recognition System** 📚
116
+
117
+ Build common pattern library:
118
+
119
+ ```typescript
120
+ interface ActionPattern {
121
+ name: string;
122
+ trigger: (pageInfo: PageInfo) => boolean; // DOM-based detection
123
+ template: (params: any) => string[]; // Command sequence
124
+ }
125
+
126
+ const COMMON_PATTERNS = {
127
+ LOGIN_FORM: {
128
+ trigger: (page) => hasFields(['username', 'password']) && hasButton('login'),
129
+ template: ({ username, password }) => [
130
+ `await page.fill('[name="username"]', '${username}')`,
131
+ `await page.fill('[name="password"]', '${password}')`,
132
+ `await page.click('button[type="submit"]')`
133
+ ]
134
+ },
135
+ MODAL_WITH_CLOSE: {
136
+ trigger: (page) => page.title.includes('modal') || hasElement('[role="dialog"]'),
137
+ template: () => [
138
+ `await page.click('[aria-label="Close"]')`
139
+ ]
140
+ },
141
+ SEARCH_BAR: {
142
+ trigger: (page) => hasSearchInput(),
143
+ template: ({ query }) => [
144
+ `await page.fill('[type="search"]', '${query}')`,
145
+ `await page.keyboard.press('Enter')`
146
+ ]
147
+ }
148
+ }
149
+ ```
150
+
151
+ **Flow:**
152
+ 1. Check if current page matches known pattern
153
+ 2. If match + high confidence → use pattern template
154
+ 3. If no match or low confidence → use LLM
155
+
156
+ **Benefits:**
157
+ - Deterministic for common cases
158
+ - Faster execution (no LLM call for patterns)
159
+ - More reliable (tested patterns)
160
+ - LLM handles edge cases
161
+
162
+ ### 3. **Smarter Retry Strategy** 🔄
163
+
164
+ Replace sub-action approach with strategic retries:
165
+
166
+ ```typescript
167
+ interface RetryStrategy {
168
+ maxAttempts: 2; // Humans try 2-3 times max
169
+ approaches: [
170
+ 'primary', // First attempt: Use most obvious selector
171
+ 'alternative', // Second attempt: Try different selector strategy
172
+ 'vision' // Third attempt: Use vision if previous failed
173
+ ]
174
+ }
175
+
176
+ // Instead of:
177
+ // Sub-action 1 (4 attempts) → Sub-action 2 (4 attempts) → Sub-action 3 (4 attempts)
178
+ // = 12 attempts with different commands
179
+
180
+ // Do:
181
+ // Attempt 1: Primary approach
182
+ // Attempt 2: Alternative selector strategy
183
+ // Attempt 3: Vision-guided (if needed)
184
+ // = 3 attempts max, clear progression
185
+ ```
186
+
187
+ **Benefits:**
188
+ - Predictable attempt count
189
+ - Clear escalation path
190
+ - More human-like (try twice, then ask for help/vision)
191
+
192
+ ### 4. **Context-Aware Command Generation** 🧠
193
+
194
+ Use previous step context more effectively:
195
+
196
+ ```typescript
197
+ async generateCommandWithContext(
198
+ stepDescription: string,
199
+ pageInfo: PageInfo,
200
+ previousSuccessfulCommands: string[], // What worked before
201
+ recentPatterns: PatternMatch[] // Patterns detected recently
202
+ ): Promise<Command> {
203
+
204
+ const prompt = `
205
+ Previous successful patterns in this session:
206
+ - Username field used: page.fill('[name="username"]', ...)
207
+ - Buttons clicked using: page.getByRole('button', ...)
208
+
209
+ Use similar approaches for consistency.
210
+
211
+ Current step: ${stepDescription}
212
+ `;
213
+ }
214
+ ```
215
+
216
+ **Benefits:**
217
+ - Commands consistent within session
218
+ - Learns what selectors work on this app
219
+ - Faster convergence
220
+
221
+ ### 5. **Confidence-Based Execution** 📊
222
+
223
+ Add confidence scoring to guide execution:
224
+
225
+ ```typescript
226
+ interface CommandWithConfidence {
227
+ code: string;
228
+ confidence: number; // 0-1
229
+ reasoning: string;
230
+ }
231
+
232
+ // Execution logic:
233
+ if (command.confidence < 0.6) {
234
+ // Low confidence - use vision upfront instead of trying and failing
235
+ logger.log("Low confidence, requesting vision analysis before attempting");
236
+ const visionDiagnostics = await getVisionDiagnostics(...);
237
+ command = regenerateWithVisionInsights(visionDiagnostics);
238
+ } else if (command.confidence < 0.8) {
239
+ // Medium confidence - try but have vision ready
240
+ logger.log("Medium confidence, will use vision if this fails");
241
+ } else {
242
+ // High confidence - proceed normally
243
+ logger.log("High confidence, executing");
244
+ }
245
+ ```
246
+
247
+ **Benefits:**
248
+ - Vision used proactively when LLM is uncertain
249
+ - Fewer wasted attempts on guesses
250
+ - More human-like (ask for help when unsure)
251
+
252
+ ### 6. **Simplified Goal Model** 🎯
253
+
254
+ Remove sub-action pattern, use clearer model:
255
+
256
+ ```typescript
257
+ // Current (complex):
258
+ // Step → Sub-actions (1-5) → Attempts (0-3) → Goal checks after each sub-action
259
+ // = nested loops, hard to reason about
260
+
261
+ // Proposed (simple):
262
+ // Step → Plan (upfront) → Execute commands → Check completion once
263
+ // = linear flow, easy to understand
264
+
265
+ async executeStep(step: Step, page: Page): Promise<StepResult> {
266
+ // 1. Plan the step (single LLM call)
267
+ const plan = await planStepExecution(step.description, getPageInfo(page));
268
+
269
+ // 2. Execute each command in plan
270
+ for (const command of plan.commands) {
271
+ try {
272
+ await executeCommand(command, page);
273
+ } catch (error) {
274
+ // If any command fails, replan with error context
275
+ if (plan.overallConfidence > 0.8) {
276
+ // Was confident, try alternative selector
277
+ const alt = await regenerateCommand(command, error, page);
278
+ await executeCommand(alt, page);
279
+ } else {
280
+ // Was uncertain, use vision
281
+ const vision = await getVisionGuidedCommand(command, error, page);
282
+ await executeCommand(vision, page);
283
+ }
284
+ }
285
+ }
286
+
287
+ // 3. Single goal check at end
288
+ return checkGoalCompletion(step.description, plan.commands, page);
289
+ }
290
+ ```
291
+
292
+ **Benefits:**
293
+ - Predictable execution flow
294
+ - Fewer LLM calls (plan once, not per sub-action)
295
+ - Easier to debug and reason about
296
+ - More human-like (plan → execute → verify)
297
+
298
+ ### 7. **Smart DOM Analysis** 🔍
299
+
300
+ Better upfront DOM understanding:
301
+
302
+ ```typescript
303
+ interface PageUnderstanding {
304
+ pageType: 'login' | 'dashboard' | 'form' | 'modal' | 'list' | 'unknown';
305
+ primaryInteractiveElements: Element[];
306
+ forms: FormInfo[];
307
+ modals: ModalInfo[];
308
+ navigation: NavigationInfo[];
309
+ confidence: number;
310
+ }
311
+
312
+ async analyzePageContext(page: Page): Promise<PageUnderstanding> {
313
+ const pageInfo = await getEnhancedPageInfo(page);
314
+
315
+ // Detect page type and structure
316
+ const analysis = await llm.analyzePage(pageInfo);
317
+
318
+ return {
319
+ pageType: 'login', // Detected
320
+ forms: [{ fields: ['username', 'password'], submitButton: 'Login to Continue' }],
321
+ confidence: 0.95
322
+ };
323
+ }
324
+ ```
325
+
326
+ **Usage:**
327
+ ```typescript
328
+ const pageContext = await analyzePageContext(page);
329
+ if (pageContext.pageType === 'login' && step.description.includes('login')) {
330
+ // High confidence - use login pattern
331
+ const form = pageContext.forms[0];
332
+ return generateLoginCommands(form, credentials);
333
+ }
334
+ ```
335
+
336
+ **Benefits:**
337
+ - Understands page structure upfront
338
+ - Can detect common patterns automatically
339
+ - Informs better command generation
340
+ - More deterministic (same page type → same approach)
341
+
342
+ ### 8. **Reduce Noise in Execution** 🔇
343
+
344
+ **Current issues:**
345
+ - Too many goal completion checks (after every sub-action)
346
+ - Too many retry variations
347
+ - Too much back-and-forth with LLM
348
+
349
+ **Improvement:**
350
+ ```typescript
351
+ // Instead of:
352
+ // fill(username) → goal check → fill(password) → goal check → click(login) → goal check
353
+ // = 6 operations (3 actions + 3 checks)
354
+
355
+ // Do:
356
+ // Plan: [fill(username), fill(password), click(login)]
357
+ // Execute all → single goal check
358
+ // = 4 operations (3 actions + 1 check)
359
+ ```
360
+
361
+ **Benefits:**
362
+ - 50% fewer LLM calls for multi-action steps
363
+ - Faster execution
364
+ - More predictable
365
+
366
+ ### 9. **Visual Scanning Before Action** 👁️
367
+
368
+ Add optional visual scan for complex steps:
369
+
370
+ ```typescript
371
+ async executeComplexStep(step: Step, page: Page): Promise<Result> {
372
+ // Check if step seems complex/ambiguous
373
+ const complexity = assessStepComplexity(step.description);
374
+
375
+ if (complexity === 'high' && visualBudgetAvailable) {
376
+ // Proactive vision for complex steps
377
+ logger.log("Complex step detected - using vision upfront for better planning");
378
+ const screenshot = await page.screenshot();
379
+ const plan = await generatePlanWithVision(step, pageInfo, screenshot);
380
+ return executePlan(plan);
381
+ } else {
382
+ // Standard DOM-based approach
383
+ return executeStepStandard(step, page);
384
+ }
385
+ }
386
+ ```
387
+
388
+ **When to use:**
389
+ - Ambiguous descriptions ("click the settings icon")
390
+ - Multiple similar elements expected
391
+ - Visual-heavy interactions
392
+ - Previous similar steps failed
393
+
394
+ **Benefits:**
395
+ - Fewer wasted attempts
396
+ - Higher first-attempt success rate
397
+ - More human-like (look before you leap)
398
+
399
+ ### 10. **Session Memory** 💭
400
+
401
+ Remember what works across steps:
402
+
403
+ ```typescript
404
+ interface SessionMemory {
405
+ workingSelectors: Map<string, string>; // "login button" → "button[name='login']"
406
+ pagePatterns: string[]; // "This app uses data-testid"
407
+ commonElements: Element[]; // Persistent nav/header elements
408
+ successfulStrategies: string[]; // What approaches worked
409
+ }
410
+
411
+ // Example usage:
412
+ if (sessionMemory.workingSelectors.has('login button')) {
413
+ // We found login button before, use same selector
414
+ const selector = sessionMemory.workingSelectors.get('login button');
415
+ return `await page.click('${selector}')`;
416
+ }
417
+
418
+ if (sessionMemory.pagePatterns.includes('uses-data-testid')) {
419
+ // This app uses data-testid, prefer that
420
+ return `await page.click('[data-testid="login"]')`;
421
+ }
422
+ ```
423
+
424
+ **Benefits:**
425
+ - Consistency across steps
426
+ - Learn from early steps
427
+ - Deterministic within session
428
+ - More human-like (remember what worked)
429
+
430
+ ## Recommended Implementation Priority
431
+
432
+ ### Phase 1: **Low-Hanging Fruit** (Quick Wins)
433
+
434
+ 1. ✅ **Reduce retry budget**
435
+ - MAX_SUBACTIONS_PER_STEP: 5 → 3 (humans try 2-3 approaches max)
436
+ - MAX_RETRIES_PER_STEP: 3 → 2 (3 attempts per sub-action → 2)
437
+ - MAX_FAILED_ATTEMPTS_PER_STEP: 12 → 8
438
+
439
+ 2. ✅ **Confidence scoring**
440
+ - Add confidence field to LLM responses
441
+ - Trigger vision earlier for low confidence (< 0.6)
442
+ - Log confidence for debugging
443
+
444
+ 3. ✅ **Session memory (simple version)**
445
+ - Track successful selectors in current session
446
+ - Reuse patterns that worked
447
+ - Pass to LLM as context
448
+
449
+ ### Phase 2: **Structural Improvements** (Medium Effort)
450
+
451
+ 4. **Upfront step planning**
452
+ - New API: `planStepExecution()`
453
+ - Generate all commands for step at once
454
+ - Execute plan without goal checks between commands
455
+ - Single goal check at end
456
+
457
+ 5. **Pattern detection**
458
+ - Detect common patterns (login, search, modal, dropdown)
459
+ - Use templates for high-confidence matches
460
+ - Fallback to LLM for novel cases
461
+
462
+ 6. **Smarter retry strategy**
463
+ - Replace sub-action approach with: primary → alternative → vision
464
+ - Clear escalation path
465
+ - Fail faster
466
+
467
+ ### Phase 3: **Advanced Features** (Future)
468
+
469
+ 7. **Proactive vision for complex steps**
470
+ - Complexity scoring
471
+ - Visual scan before attempting
472
+ - Better first-attempt success
473
+
474
+ 8. **Cross-session learning**
475
+ - Persist patterns across sessions
476
+ - Build app-specific knowledge base
477
+ - Improve over time
478
+
479
+ 9. **Multi-step lookahead**
480
+ - Plan multiple steps at once when related
481
+ - Optimize for workflow efficiency
482
+ - Batch similar operations
483
+
484
+ ## Immediate Actionable Changes
485
+
486
+ ### Change 1: Reduce Retry Budget (More Human-Like)
487
+
488
+ ```typescript
489
+ // In scenario-worker-class.ts
490
+ const MAX_RETRIES_PER_STEP = 2; // Was 3 → Now 2 attempts per approach (total 3 tries)
491
+ const MAX_SUBACTIONS_PER_STEP = 3; // Was 5 → Now 3 different approaches max
492
+ const MAX_FAILED_ATTEMPTS_PER_STEP = 8; // Was 12 → Now 8 total failures max
493
+ ```
494
+
495
+ **Rationale:** Humans try 2-3 things, not 12. Fail faster = faster feedback.
496
+
497
+ ### Change 2: Add Confidence Scoring
498
+
499
+ **Modify LLM response:**
500
+ ```typescript
501
+ interface LLMPlaywrightCommandResponse {
502
+ command: string;
503
+ reasoning: string;
504
+ confidence: number; // NEW: 0-1 scale
505
+ uncertaintyReason?: string; // NEW: Why low confidence
506
+ }
507
+ ```
508
+
509
+ **Prompt addition:**
510
+ ```
511
+ Additionally, provide a confidence score (0-1):
512
+ - 1.0: Element clearly identified in DOM with unique selector
513
+ - 0.8: Element found but selector might not be unique
514
+ - 0.6: Multiple possible elements, best guess
515
+ - 0.4: Element not clearly in DOM, inferring from context
516
+ - 0.2: Guessing, likely to fail
517
+
518
+ If confidence < 0.7, explain uncertainty.
519
+ ```
520
+
521
+ **Usage:**
522
+ ```typescript
523
+ const result = await generatePlaywrightCommand(...);
524
+ if (result.confidence < 0.6 && !usedVisionMode) {
525
+ logger.log(`⚠️ Low confidence (${result.confidence}) - using vision proactively`);
526
+ // Skip attempts, go straight to vision
527
+ const visionResult = await getVisionGuidedCommand(...);
528
+ return visionResult;
529
+ }
530
+ ```
531
+
532
+ ### Change 3: Session Memory (Simple Version)
533
+
534
+ ```typescript
535
+ class SessionContext {
536
+ successfulSelectors: Map<string, string> = new Map();
537
+ appPatterns: Set<string> = new Set(); // 'uses-data-testid', 'uses-aria-labels', etc.
538
+
539
+ recordSuccess(elementDescription: string, selector: string) {
540
+ this.successfulSelectors.set(elementDescription, selector);
541
+ }
542
+
543
+ getContext(): string {
544
+ if (this.successfulSelectors.size === 0) return '';
545
+
546
+ return `
547
+ Selectors that worked in this session:
548
+ ${Array.from(this.successfulSelectors.entries())
549
+ .map(([desc, sel]) => `- "${desc}": ${sel}`)
550
+ .join('\n')}
551
+
552
+ Try similar patterns for consistency.
553
+ `;
554
+ }
555
+ }
556
+ ```
557
+
558
+ **Add to command generation prompt:**
559
+ ```typescript
560
+ ${sessionContext.getContext()}
561
+ ```
562
+
563
+ ### Change 4: Skip Sub-Action Loop for High Confidence
564
+
565
+ ```typescript
566
+ // After planning, if all commands have high confidence (>0.8):
567
+ if (plan.overallConfidence > 0.8) {
568
+ // Execute entire plan without intermediate checks
569
+ for (const cmd of plan.commands) {
570
+ await execute(cmd);
571
+ }
572
+ // Single goal check at end
573
+ return checkGoalCompletion(step, plan.commands, page);
574
+ } else {
575
+ // Low confidence - use current sub-action approach for safety
576
+ return executeWithSubActions(step, page);
577
+ }
578
+ ```
579
+
580
+ ## Expected Improvements
581
+
582
+ ### Metrics:
583
+
584
+ | Metric | Current | After Phase 1 | After Phase 2 |
585
+ |--------|---------|---------------|---------------|
586
+ | Avg LLM calls per step | 4-6 | 3-4 | 1-2 |
587
+ | Max attempts per step | 20 | 8 | 6 |
588
+ | Success rate (1st attempt) | ~40% | ~50% | ~70% |
589
+ | Time per step | 30-60s | 20-40s | 10-20s |
590
+ | Determinism | Low | Medium | High |
591
+
592
+ ### Behavioral Changes:
593
+
594
+ **Login scenario:**
595
+
596
+ **Current:**
597
+ ```
598
+ 1. fill(user) → check goal → incomplete
599
+ 2. fill(pass) → check goal → incomplete
600
+ 3. click(login) → check goal → complete
601
+ = 3 actions + 3 goal checks = 6 operations
602
+ ```
603
+
604
+ **After Phase 1 (confidence + memory):**
605
+ ```
606
+ 1. fill(user) → check goal → incomplete (but with confidence: 0.9)
607
+ 2. fill(pass) → check goal → incomplete (confidence: 0.9)
608
+ 3. click(login) → check goal → complete (confidence: 0.85)
609
+ = 3 actions + 3 checks, but with confidence signals
610
+ ```
611
+
612
+ **After Phase 2 (planning):**
613
+ ```
614
+ 1. Plan: [fill(user), fill(pass), click(login)] (confidence: 0.9)
615
+ 2. Execute all 3 commands
616
+ 3. Single goal check → complete
617
+ = 1 plan + 3 actions + 1 check = 5 operations
618
+ ```
619
+
620
+ ## Summary
621
+
622
+ **Core Philosophy Shift:**
623
+
624
+ **From:** Reactive trial-and-error with many retries
625
+ **To:** Proactive planning with fewer, smarter attempts
626
+
627
+ **Most Important:**
628
+ 1. Plan before acting (not generate-execute-check loop)
629
+ 2. Use confidence to guide strategy
630
+ 3. Remember what works
631
+ 4. Fail faster (humans don't retry 12 times)
632
+ 5. Recognize common patterns
633
+
634
+ **Most Human-Like = Most Deterministic**
635
+ - Humans plan, then execute
636
+ - Humans learn from experience
637
+ - Humans don't endlessly retry
638
+ - Humans recognize familiar patterns
639
+ - Humans ask for help when uncertain
640
+
641
+ These improvements make the system more deterministic, faster, cheaper, and more aligned with how humans actually interact with web applications.
642
+