testchimp-runner-core 0.0.35 → 0.0.37
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/orchestrator/orchestrator-agent.d.ts.map +1 -1
- package/dist/orchestrator/orchestrator-agent.js +7 -4
- package/dist/orchestrator/orchestrator-agent.js.map +1 -1
- package/dist/orchestrator/orchestrator-prompts.d.ts.map +1 -1
- package/dist/orchestrator/orchestrator-prompts.js +73 -15
- package/dist/orchestrator/orchestrator-prompts.js.map +1 -1
- package/dist/orchestrator/page-som-handler.d.ts +1 -2
- package/dist/orchestrator/page-som-handler.d.ts.map +1 -1
- package/dist/orchestrator/page-som-handler.js +51 -25
- package/dist/orchestrator/page-som-handler.js.map +1 -1
- package/package.json +6 -1
- package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
- package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
- package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
- package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
- package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
- package/plandocs/INTEGRATION_COMPLETE.md +0 -322
- package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
- package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
- package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
- package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
- package/plandocs/PHASE_1_COMPLETE.md +0 -165
- package/plandocs/PHASE_1_SUMMARY.md +0 -184
- package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
- package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
- package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
- package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
- package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
- package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
- package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
- package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
- package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
- package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
- package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
- package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
- package/plandocs/exploratory-mode-support.plan.md +0 -928
- package/plandocs/journey-id-tracking-addendum.md +0 -227
- package/releasenotes/RELEASE_0.0.26.md +0 -165
- package/releasenotes/RELEASE_0.0.27.md +0 -236
- package/releasenotes/RELEASE_0.0.28.md +0 -286
- package/src/auth-config.ts +0 -84
- package/src/credit-usage-service.ts +0 -188
- package/src/env-loader.ts +0 -103
- package/src/execution-service.ts +0 -996
- package/src/file-handler.ts +0 -104
- package/src/index.ts +0 -432
- package/src/llm-facade.ts +0 -821
- package/src/llm-provider.ts +0 -53
- package/src/model-constants.ts +0 -35
- package/src/orchestrator/decision-parser.ts +0 -139
- package/src/orchestrator/index.ts +0 -58
- package/src/orchestrator/orchestrator-agent.ts +0 -1282
- package/src/orchestrator/orchestrator-prompts.ts +0 -786
- package/src/orchestrator/page-som-handler.ts +0 -1565
- package/src/orchestrator/som-types.ts +0 -188
- package/src/orchestrator/tool-registry.ts +0 -184
- package/src/orchestrator/tools/check-page-ready.ts +0 -75
- package/src/orchestrator/tools/extract-data.ts +0 -92
- package/src/orchestrator/tools/index.ts +0 -15
- package/src/orchestrator/tools/inspect-page.ts +0 -42
- package/src/orchestrator/tools/recall-history.ts +0 -72
- package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
- package/src/orchestrator/tools/take-screenshot.ts +0 -128
- package/src/orchestrator/tools/verify-action-result.ts +0 -159
- package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
- package/src/orchestrator/types.ts +0 -291
- package/src/playwright-mcp-service.ts +0 -224
- package/src/progress-reporter.ts +0 -144
- package/src/prompts.ts +0 -842
- package/src/providers/backend-proxy-llm-provider.ts +0 -91
- package/src/providers/local-llm-provider.ts +0 -38
- package/src/scenario-service.ts +0 -252
- package/src/scenario-worker-class.ts +0 -1110
- package/src/script-utils.ts +0 -203
- package/src/types.ts +0 -239
- package/src/utils/browser-utils.ts +0 -348
- package/src/utils/coordinate-converter.ts +0 -162
- package/src/utils/page-info-retry.ts +0 -65
- package/src/utils/page-info-utils.ts +0 -285
- package/testchimp-runner-core-0.0.35.tgz +0 -0
- package/tsconfig.json +0 -19
|
@@ -1,642 +0,0 @@
|
|
|
1
|
-
# Human-Like Script Generation Improvements
|
|
2
|
-
|
|
3
|
-
## Current Gaps vs Human Behavior
|
|
4
|
-
|
|
5
|
-
### Gap 1: **Reactive vs Proactive Planning**
|
|
6
|
-
|
|
7
|
-
**Current (Reactive):**
|
|
8
|
-
```
|
|
9
|
-
Step: "Login with credentials: admin, pass123"
|
|
10
|
-
1. Generate command → fill(username)
|
|
11
|
-
2. Execute → success
|
|
12
|
-
3. Check goal → incomplete
|
|
13
|
-
4. Generate command → fill(password)
|
|
14
|
-
5. Execute → success
|
|
15
|
-
6. Check goal → incomplete
|
|
16
|
-
7. Generate command → click(login)
|
|
17
|
-
8. Execute → success
|
|
18
|
-
9. Check goal → complete
|
|
19
|
-
```
|
|
20
|
-
**3 LLM calls + 3 goal checks = slow, expensive, non-deterministic**
|
|
21
|
-
|
|
22
|
-
**Human (Proactive):**
|
|
23
|
-
```
|
|
24
|
-
Step: "Login with credentials: admin, pass123"
|
|
25
|
-
Looks at page → sees login form with username, password, and button
|
|
26
|
-
Plans: fill(username) → fill(password) → click(login)
|
|
27
|
-
Executes all 3 → done
|
|
28
|
-
```
|
|
29
|
-
**1 planning call + execution = fast, deterministic**
|
|
30
|
-
|
|
31
|
-
### Gap 2: **Excessive Retries**
|
|
32
|
-
|
|
33
|
-
**Current:**
|
|
34
|
-
- 4 attempts per sub-action × 5 sub-actions = up to 20 attempts per step
|
|
35
|
-
- Humans don't retry 20 times - they try 2-3 approaches max
|
|
36
|
-
|
|
37
|
-
**Human:**
|
|
38
|
-
- Try primary approach (e.g., click by text)
|
|
39
|
-
- If fails, try alternative (e.g., click by role)
|
|
40
|
-
- If still fails, reassess situation (vision/give up)
|
|
41
|
-
|
|
42
|
-
### Gap 3: **No Pattern Recognition**
|
|
43
|
-
|
|
44
|
-
**Current:**
|
|
45
|
-
- Every login form treated as novel
|
|
46
|
-
- Every navigation treated independently
|
|
47
|
-
- No learning from previous successful patterns
|
|
48
|
-
|
|
49
|
-
**Human:**
|
|
50
|
-
- Recognizes "this is a login form" → knows the pattern
|
|
51
|
-
- Recognizes "this is a modal" → knows to look for close button
|
|
52
|
-
- Learns from experience
|
|
53
|
-
|
|
54
|
-
### Gap 4: **Sub-Action Approach is Non-Deterministic**
|
|
55
|
-
|
|
56
|
-
**Current:**
|
|
57
|
-
```
|
|
58
|
-
Step: "Click on All Modules"
|
|
59
|
-
Sub-action 1: Try approach A → fails
|
|
60
|
-
Sub-action 2: Try approach B → fails
|
|
61
|
-
Sub-action 3: Try approach C → fails
|
|
62
|
-
Sub-action count: 3, failed attempts: 8
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
**Human:**
|
|
66
|
-
```
|
|
67
|
-
Step: "Click on All Modules"
|
|
68
|
-
Looks at page → sees "All Modules" menu
|
|
69
|
-
Clicks it → if fails, tries once more with better selector
|
|
70
|
-
Max 2-3 attempts total
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
### Gap 5: **No Confidence Scoring**
|
|
74
|
-
|
|
75
|
-
**Current:**
|
|
76
|
-
- LLM generates command with no confidence indication
|
|
77
|
-
- System doesn't know if LLM is guessing or certain
|
|
78
|
-
|
|
79
|
-
**Human:**
|
|
80
|
-
- "I see exactly where to click" → high confidence, execute
|
|
81
|
-
- "I'm not sure which element" → low confidence, look more carefully first
|
|
82
|
-
|
|
83
|
-
## Proposed Improvements
|
|
84
|
-
|
|
85
|
-
### 1. **Upfront Step Planning Mode** 🎯
|
|
86
|
-
|
|
87
|
-
Add a planning phase before execution:
|
|
88
|
-
|
|
89
|
-
```typescript
|
|
90
|
-
// NEW: Step planning API
|
|
91
|
-
async planStepExecution(
|
|
92
|
-
stepDescription: string,
|
|
93
|
-
pageInfo: PageInfo
|
|
94
|
-
): Promise<StepExecutionPlan> {
|
|
95
|
-
// LLM analyzes page and creates complete plan for step
|
|
96
|
-
return {
|
|
97
|
-
commands: [
|
|
98
|
-
{ code: "await page.fill('username', 'admin')", confidence: 0.95 },
|
|
99
|
-
{ code: "await page.fill('password', 'pass')", confidence: 0.95 },
|
|
100
|
-
{ code: "await page.click('button[type=submit]')", confidence: 0.85 }
|
|
101
|
-
],
|
|
102
|
-
reasoning: "Login form detected with clear username/password fields and submit button",
|
|
103
|
-
overallConfidence: 0.90,
|
|
104
|
-
needsVisionUpfront: false // LLM indicates if vision would help BEFORE attempting
|
|
105
|
-
}
|
|
106
|
-
}
|
|
107
|
-
```
|
|
108
|
-
|
|
109
|
-
**Benefits:**
|
|
110
|
-
- Single LLM call for entire step
|
|
111
|
-
- All commands generated together (more coherent)
|
|
112
|
-
- Confidence scoring guides execution
|
|
113
|
-
- Can request vision upfront if uncertain
|
|
114
|
-
|
|
115
|
-
### 2. **Pattern Recognition System** 📚
|
|
116
|
-
|
|
117
|
-
Build common pattern library:
|
|
118
|
-
|
|
119
|
-
```typescript
|
|
120
|
-
interface ActionPattern {
|
|
121
|
-
name: string;
|
|
122
|
-
trigger: (pageInfo: PageInfo) => boolean; // DOM-based detection
|
|
123
|
-
template: (params: any) => string[]; // Command sequence
|
|
124
|
-
}
|
|
125
|
-
|
|
126
|
-
const COMMON_PATTERNS = {
|
|
127
|
-
LOGIN_FORM: {
|
|
128
|
-
trigger: (page) => hasFields(['username', 'password']) && hasButton('login'),
|
|
129
|
-
template: ({ username, password }) => [
|
|
130
|
-
`await page.fill('[name="username"]', '${username}')`,
|
|
131
|
-
`await page.fill('[name="password"]', '${password}')`,
|
|
132
|
-
`await page.click('button[type="submit"]')`
|
|
133
|
-
]
|
|
134
|
-
},
|
|
135
|
-
MODAL_WITH_CLOSE: {
|
|
136
|
-
trigger: (page) => page.title.includes('modal') || hasElement('[role="dialog"]'),
|
|
137
|
-
template: () => [
|
|
138
|
-
`await page.click('[aria-label="Close"]')`
|
|
139
|
-
]
|
|
140
|
-
},
|
|
141
|
-
SEARCH_BAR: {
|
|
142
|
-
trigger: (page) => hasSearchInput(),
|
|
143
|
-
template: ({ query }) => [
|
|
144
|
-
`await page.fill('[type="search"]', '${query}')`,
|
|
145
|
-
`await page.keyboard.press('Enter')`
|
|
146
|
-
]
|
|
147
|
-
}
|
|
148
|
-
}
|
|
149
|
-
```
|
|
150
|
-
|
|
151
|
-
**Flow:**
|
|
152
|
-
1. Check if current page matches known pattern
|
|
153
|
-
2. If match + high confidence → use pattern template
|
|
154
|
-
3. If no match or low confidence → use LLM
|
|
155
|
-
|
|
156
|
-
**Benefits:**
|
|
157
|
-
- Deterministic for common cases
|
|
158
|
-
- Faster execution (no LLM call for patterns)
|
|
159
|
-
- More reliable (tested patterns)
|
|
160
|
-
- LLM handles edge cases
|
|
161
|
-
|
|
162
|
-
### 3. **Smarter Retry Strategy** 🔄
|
|
163
|
-
|
|
164
|
-
Replace sub-action approach with strategic retries:
|
|
165
|
-
|
|
166
|
-
```typescript
|
|
167
|
-
interface RetryStrategy {
|
|
168
|
-
maxAttempts: 2; // Humans try 2-3 times max
|
|
169
|
-
approaches: [
|
|
170
|
-
'primary', // First attempt: Use most obvious selector
|
|
171
|
-
'alternative', // Second attempt: Try different selector strategy
|
|
172
|
-
'vision' // Third attempt: Use vision if previous failed
|
|
173
|
-
]
|
|
174
|
-
}
|
|
175
|
-
|
|
176
|
-
// Instead of:
|
|
177
|
-
// Sub-action 1 (4 attempts) → Sub-action 2 (4 attempts) → Sub-action 3 (4 attempts)
|
|
178
|
-
// = 12 attempts with different commands
|
|
179
|
-
|
|
180
|
-
// Do:
|
|
181
|
-
// Attempt 1: Primary approach
|
|
182
|
-
// Attempt 2: Alternative selector strategy
|
|
183
|
-
// Attempt 3: Vision-guided (if needed)
|
|
184
|
-
// = 3 attempts max, clear progression
|
|
185
|
-
```
|
|
186
|
-
|
|
187
|
-
**Benefits:**
|
|
188
|
-
- Predictable attempt count
|
|
189
|
-
- Clear escalation path
|
|
190
|
-
- More human-like (try twice, then ask for help/vision)
|
|
191
|
-
|
|
192
|
-
### 4. **Context-Aware Command Generation** 🧠
|
|
193
|
-
|
|
194
|
-
Use previous step context more effectively:
|
|
195
|
-
|
|
196
|
-
```typescript
|
|
197
|
-
async generateCommandWithContext(
|
|
198
|
-
stepDescription: string,
|
|
199
|
-
pageInfo: PageInfo,
|
|
200
|
-
previousSuccessfulCommands: string[], // What worked before
|
|
201
|
-
recentPatterns: PatternMatch[] // Patterns detected recently
|
|
202
|
-
): Promise<Command> {
|
|
203
|
-
|
|
204
|
-
const prompt = `
|
|
205
|
-
Previous successful patterns in this session:
|
|
206
|
-
- Username field used: page.fill('[name="username"]', ...)
|
|
207
|
-
- Buttons clicked using: page.getByRole('button', ...)
|
|
208
|
-
|
|
209
|
-
Use similar approaches for consistency.
|
|
210
|
-
|
|
211
|
-
Current step: ${stepDescription}
|
|
212
|
-
`;
|
|
213
|
-
}
|
|
214
|
-
```
|
|
215
|
-
|
|
216
|
-
**Benefits:**
|
|
217
|
-
- Commands consistent within session
|
|
218
|
-
- Learns what selectors work on this app
|
|
219
|
-
- Faster convergence
|
|
220
|
-
|
|
221
|
-
### 5. **Confidence-Based Execution** 📊
|
|
222
|
-
|
|
223
|
-
Add confidence scoring to guide execution:
|
|
224
|
-
|
|
225
|
-
```typescript
|
|
226
|
-
interface CommandWithConfidence {
|
|
227
|
-
code: string;
|
|
228
|
-
confidence: number; // 0-1
|
|
229
|
-
reasoning: string;
|
|
230
|
-
}
|
|
231
|
-
|
|
232
|
-
// Execution logic:
|
|
233
|
-
if (command.confidence < 0.6) {
|
|
234
|
-
// Low confidence - use vision upfront instead of trying and failing
|
|
235
|
-
logger.log("Low confidence, requesting vision analysis before attempting");
|
|
236
|
-
const visionDiagnostics = await getVisionDiagnostics(...);
|
|
237
|
-
command = regenerateWithVisionInsights(visionDiagnostics);
|
|
238
|
-
} else if (command.confidence < 0.8) {
|
|
239
|
-
// Medium confidence - try but have vision ready
|
|
240
|
-
logger.log("Medium confidence, will use vision if this fails");
|
|
241
|
-
} else {
|
|
242
|
-
// High confidence - proceed normally
|
|
243
|
-
logger.log("High confidence, executing");
|
|
244
|
-
}
|
|
245
|
-
```
|
|
246
|
-
|
|
247
|
-
**Benefits:**
|
|
248
|
-
- Vision used proactively when LLM is uncertain
|
|
249
|
-
- Fewer wasted attempts on guesses
|
|
250
|
-
- More human-like (ask for help when unsure)
|
|
251
|
-
|
|
252
|
-
### 6. **Simplified Goal Model** 🎯
|
|
253
|
-
|
|
254
|
-
Remove sub-action pattern, use clearer model:
|
|
255
|
-
|
|
256
|
-
```typescript
|
|
257
|
-
// Current (complex):
|
|
258
|
-
// Step → Sub-actions (1-5) → Attempts (0-3) → Goal checks after each sub-action
|
|
259
|
-
// = nested loops, hard to reason about
|
|
260
|
-
|
|
261
|
-
// Proposed (simple):
|
|
262
|
-
// Step → Plan (upfront) → Execute commands → Check completion once
|
|
263
|
-
// = linear flow, easy to understand
|
|
264
|
-
|
|
265
|
-
async executeStep(step: Step, page: Page): Promise<StepResult> {
|
|
266
|
-
// 1. Plan the step (single LLM call)
|
|
267
|
-
const plan = await planStepExecution(step.description, getPageInfo(page));
|
|
268
|
-
|
|
269
|
-
// 2. Execute each command in plan
|
|
270
|
-
for (const command of plan.commands) {
|
|
271
|
-
try {
|
|
272
|
-
await executeCommand(command, page);
|
|
273
|
-
} catch (error) {
|
|
274
|
-
// If any command fails, replan with error context
|
|
275
|
-
if (plan.overallConfidence > 0.8) {
|
|
276
|
-
// Was confident, try alternative selector
|
|
277
|
-
const alt = await regenerateCommand(command, error, page);
|
|
278
|
-
await executeCommand(alt, page);
|
|
279
|
-
} else {
|
|
280
|
-
// Was uncertain, use vision
|
|
281
|
-
const vision = await getVisionGuidedCommand(command, error, page);
|
|
282
|
-
await executeCommand(vision, page);
|
|
283
|
-
}
|
|
284
|
-
}
|
|
285
|
-
}
|
|
286
|
-
|
|
287
|
-
// 3. Single goal check at end
|
|
288
|
-
return checkGoalCompletion(step.description, plan.commands, page);
|
|
289
|
-
}
|
|
290
|
-
```
|
|
291
|
-
|
|
292
|
-
**Benefits:**
|
|
293
|
-
- Predictable execution flow
|
|
294
|
-
- Fewer LLM calls (plan once, not per sub-action)
|
|
295
|
-
- Easier to debug and reason about
|
|
296
|
-
- More human-like (plan → execute → verify)
|
|
297
|
-
|
|
298
|
-
### 7. **Smart DOM Analysis** 🔍
|
|
299
|
-
|
|
300
|
-
Better upfront DOM understanding:
|
|
301
|
-
|
|
302
|
-
```typescript
|
|
303
|
-
interface PageUnderstanding {
|
|
304
|
-
pageType: 'login' | 'dashboard' | 'form' | 'modal' | 'list' | 'unknown';
|
|
305
|
-
primaryInteractiveElements: Element[];
|
|
306
|
-
forms: FormInfo[];
|
|
307
|
-
modals: ModalInfo[];
|
|
308
|
-
navigation: NavigationInfo[];
|
|
309
|
-
confidence: number;
|
|
310
|
-
}
|
|
311
|
-
|
|
312
|
-
async analyzePageContext(page: Page): Promise<PageUnderstanding> {
|
|
313
|
-
const pageInfo = await getEnhancedPageInfo(page);
|
|
314
|
-
|
|
315
|
-
// Detect page type and structure
|
|
316
|
-
const analysis = await llm.analyzePage(pageInfo);
|
|
317
|
-
|
|
318
|
-
return {
|
|
319
|
-
pageType: 'login', // Detected
|
|
320
|
-
forms: [{ fields: ['username', 'password'], submitButton: 'Login to Continue' }],
|
|
321
|
-
confidence: 0.95
|
|
322
|
-
};
|
|
323
|
-
}
|
|
324
|
-
```
|
|
325
|
-
|
|
326
|
-
**Usage:**
|
|
327
|
-
```typescript
|
|
328
|
-
const pageContext = await analyzePageContext(page);
|
|
329
|
-
if (pageContext.pageType === 'login' && step.description.includes('login')) {
|
|
330
|
-
// High confidence - use login pattern
|
|
331
|
-
const form = pageContext.forms[0];
|
|
332
|
-
return generateLoginCommands(form, credentials);
|
|
333
|
-
}
|
|
334
|
-
```
|
|
335
|
-
|
|
336
|
-
**Benefits:**
|
|
337
|
-
- Understands page structure upfront
|
|
338
|
-
- Can detect common patterns automatically
|
|
339
|
-
- Informs better command generation
|
|
340
|
-
- More deterministic (same page type → same approach)
|
|
341
|
-
|
|
342
|
-
### 8. **Reduce Noise in Execution** 🔇
|
|
343
|
-
|
|
344
|
-
**Current issues:**
|
|
345
|
-
- Too many goal completion checks (after every sub-action)
|
|
346
|
-
- Too many retry variations
|
|
347
|
-
- Too much back-and-forth with LLM
|
|
348
|
-
|
|
349
|
-
**Improvement:**
|
|
350
|
-
```typescript
|
|
351
|
-
// Instead of:
|
|
352
|
-
// fill(username) → goal check → fill(password) → goal check → click(login) → goal check
|
|
353
|
-
// = 6 operations (3 actions + 3 checks)
|
|
354
|
-
|
|
355
|
-
// Do:
|
|
356
|
-
// Plan: [fill(username), fill(password), click(login)]
|
|
357
|
-
// Execute all → single goal check
|
|
358
|
-
// = 4 operations (3 actions + 1 check)
|
|
359
|
-
```
|
|
360
|
-
|
|
361
|
-
**Benefits:**
|
|
362
|
-
- 50% fewer LLM calls for multi-action steps
|
|
363
|
-
- Faster execution
|
|
364
|
-
- More predictable
|
|
365
|
-
|
|
366
|
-
### 9. **Visual Scanning Before Action** 👁️
|
|
367
|
-
|
|
368
|
-
Add optional visual scan for complex steps:
|
|
369
|
-
|
|
370
|
-
```typescript
|
|
371
|
-
async executeComplexStep(step: Step, page: Page): Promise<Result> {
|
|
372
|
-
// Check if step seems complex/ambiguous
|
|
373
|
-
const complexity = assessStepComplexity(step.description);
|
|
374
|
-
|
|
375
|
-
if (complexity === 'high' && visualBudgetAvailable) {
|
|
376
|
-
// Proactive vision for complex steps
|
|
377
|
-
logger.log("Complex step detected - using vision upfront for better planning");
|
|
378
|
-
const screenshot = await page.screenshot();
|
|
379
|
-
const plan = await generatePlanWithVision(step, pageInfo, screenshot);
|
|
380
|
-
return executePlan(plan);
|
|
381
|
-
} else {
|
|
382
|
-
// Standard DOM-based approach
|
|
383
|
-
return executeStepStandard(step, page);
|
|
384
|
-
}
|
|
385
|
-
}
|
|
386
|
-
```
|
|
387
|
-
|
|
388
|
-
**When to use:**
|
|
389
|
-
- Ambiguous descriptions ("click the settings icon")
|
|
390
|
-
- Multiple similar elements expected
|
|
391
|
-
- Visual-heavy interactions
|
|
392
|
-
- Previous similar steps failed
|
|
393
|
-
|
|
394
|
-
**Benefits:**
|
|
395
|
-
- Fewer wasted attempts
|
|
396
|
-
- Higher first-attempt success rate
|
|
397
|
-
- More human-like (look before you leap)
|
|
398
|
-
|
|
399
|
-
### 10. **Session Memory** 💭
|
|
400
|
-
|
|
401
|
-
Remember what works across steps:
|
|
402
|
-
|
|
403
|
-
```typescript
|
|
404
|
-
interface SessionMemory {
|
|
405
|
-
workingSelectors: Map<string, string>; // "login button" → "button[name='login']"
|
|
406
|
-
pagePatterns: string[]; // "This app uses data-testid"
|
|
407
|
-
commonElements: Element[]; // Persistent nav/header elements
|
|
408
|
-
successfulStrategies: string[]; // What approaches worked
|
|
409
|
-
}
|
|
410
|
-
|
|
411
|
-
// Example usage:
|
|
412
|
-
if (sessionMemory.workingSelectors.has('login button')) {
|
|
413
|
-
// We found login button before, use same selector
|
|
414
|
-
const selector = sessionMemory.workingSelectors.get('login button');
|
|
415
|
-
return `await page.click('${selector}')`;
|
|
416
|
-
}
|
|
417
|
-
|
|
418
|
-
if (sessionMemory.pagePatterns.includes('uses-data-testid')) {
|
|
419
|
-
// This app uses data-testid, prefer that
|
|
420
|
-
return `await page.click('[data-testid="login"]')`;
|
|
421
|
-
}
|
|
422
|
-
```
|
|
423
|
-
|
|
424
|
-
**Benefits:**
|
|
425
|
-
- Consistency across steps
|
|
426
|
-
- Learn from early steps
|
|
427
|
-
- Deterministic within session
|
|
428
|
-
- More human-like (remember what worked)
|
|
429
|
-
|
|
430
|
-
## Recommended Implementation Priority
|
|
431
|
-
|
|
432
|
-
### Phase 1: **Low-Hanging Fruit** (Quick Wins)
|
|
433
|
-
|
|
434
|
-
1. ✅ **Reduce retry budget**
|
|
435
|
-
- MAX_SUBACTIONS_PER_STEP: 5 → 3 (humans try 2-3 approaches max)
|
|
436
|
-
- MAX_RETRIES_PER_STEP: 3 → 2 (3 attempts per sub-action → 2)
|
|
437
|
-
- MAX_FAILED_ATTEMPTS_PER_STEP: 12 → 8
|
|
438
|
-
|
|
439
|
-
2. ✅ **Confidence scoring**
|
|
440
|
-
- Add confidence field to LLM responses
|
|
441
|
-
- Trigger vision earlier for low confidence (< 0.6)
|
|
442
|
-
- Log confidence for debugging
|
|
443
|
-
|
|
444
|
-
3. ✅ **Session memory (simple version)**
|
|
445
|
-
- Track successful selectors in current session
|
|
446
|
-
- Reuse patterns that worked
|
|
447
|
-
- Pass to LLM as context
|
|
448
|
-
|
|
449
|
-
### Phase 2: **Structural Improvements** (Medium Effort)
|
|
450
|
-
|
|
451
|
-
4. **Upfront step planning**
|
|
452
|
-
- New API: `planStepExecution()`
|
|
453
|
-
- Generate all commands for step at once
|
|
454
|
-
- Execute plan without goal checks between commands
|
|
455
|
-
- Single goal check at end
|
|
456
|
-
|
|
457
|
-
5. **Pattern detection**
|
|
458
|
-
- Detect common patterns (login, search, modal, dropdown)
|
|
459
|
-
- Use templates for high-confidence matches
|
|
460
|
-
- Fallback to LLM for novel cases
|
|
461
|
-
|
|
462
|
-
6. **Smarter retry strategy**
|
|
463
|
-
- Replace sub-action approach with: primary → alternative → vision
|
|
464
|
-
- Clear escalation path
|
|
465
|
-
- Fail faster
|
|
466
|
-
|
|
467
|
-
### Phase 3: **Advanced Features** (Future)
|
|
468
|
-
|
|
469
|
-
7. **Proactive vision for complex steps**
|
|
470
|
-
- Complexity scoring
|
|
471
|
-
- Visual scan before attempting
|
|
472
|
-
- Better first-attempt success
|
|
473
|
-
|
|
474
|
-
8. **Cross-session learning**
|
|
475
|
-
- Persist patterns across sessions
|
|
476
|
-
- Build app-specific knowledge base
|
|
477
|
-
- Improve over time
|
|
478
|
-
|
|
479
|
-
9. **Multi-step lookahead**
|
|
480
|
-
- Plan multiple steps at once when related
|
|
481
|
-
- Optimize for workflow efficiency
|
|
482
|
-
- Batch similar operations
|
|
483
|
-
|
|
484
|
-
## Immediate Actionable Changes
|
|
485
|
-
|
|
486
|
-
### Change 1: Reduce Retry Budget (More Human-Like)
|
|
487
|
-
|
|
488
|
-
```typescript
|
|
489
|
-
// In scenario-worker-class.ts
|
|
490
|
-
const MAX_RETRIES_PER_STEP = 2; // Was 3 → Now 2 attempts per approach (total 3 tries)
|
|
491
|
-
const MAX_SUBACTIONS_PER_STEP = 3; // Was 5 → Now 3 different approaches max
|
|
492
|
-
const MAX_FAILED_ATTEMPTS_PER_STEP = 8; // Was 12 → Now 8 total failures max
|
|
493
|
-
```
|
|
494
|
-
|
|
495
|
-
**Rationale:** Humans try 2-3 things, not 12. Fail faster = faster feedback.
|
|
496
|
-
|
|
497
|
-
### Change 2: Add Confidence Scoring
|
|
498
|
-
|
|
499
|
-
**Modify LLM response:**
|
|
500
|
-
```typescript
|
|
501
|
-
interface LLMPlaywrightCommandResponse {
|
|
502
|
-
command: string;
|
|
503
|
-
reasoning: string;
|
|
504
|
-
confidence: number; // NEW: 0-1 scale
|
|
505
|
-
uncertaintyReason?: string; // NEW: Why low confidence
|
|
506
|
-
}
|
|
507
|
-
```
|
|
508
|
-
|
|
509
|
-
**Prompt addition:**
|
|
510
|
-
```
|
|
511
|
-
Additionally, provide a confidence score (0-1):
|
|
512
|
-
- 1.0: Element clearly identified in DOM with unique selector
|
|
513
|
-
- 0.8: Element found but selector might not be unique
|
|
514
|
-
- 0.6: Multiple possible elements, best guess
|
|
515
|
-
- 0.4: Element not clearly in DOM, inferring from context
|
|
516
|
-
- 0.2: Guessing, likely to fail
|
|
517
|
-
|
|
518
|
-
If confidence < 0.7, explain uncertainty.
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
**Usage:**
|
|
522
|
-
```typescript
|
|
523
|
-
const result = await generatePlaywrightCommand(...);
|
|
524
|
-
if (result.confidence < 0.6 && !usedVisionMode) {
|
|
525
|
-
logger.log(`⚠️ Low confidence (${result.confidence}) - using vision proactively`);
|
|
526
|
-
// Skip attempts, go straight to vision
|
|
527
|
-
const visionResult = await getVisionGuidedCommand(...);
|
|
528
|
-
return visionResult;
|
|
529
|
-
}
|
|
530
|
-
```
|
|
531
|
-
|
|
532
|
-
### Change 3: Session Memory (Simple Version)
|
|
533
|
-
|
|
534
|
-
```typescript
|
|
535
|
-
class SessionContext {
|
|
536
|
-
successfulSelectors: Map<string, string> = new Map();
|
|
537
|
-
appPatterns: Set<string> = new Set(); // 'uses-data-testid', 'uses-aria-labels', etc.
|
|
538
|
-
|
|
539
|
-
recordSuccess(elementDescription: string, selector: string) {
|
|
540
|
-
this.successfulSelectors.set(elementDescription, selector);
|
|
541
|
-
}
|
|
542
|
-
|
|
543
|
-
getContext(): string {
|
|
544
|
-
if (this.successfulSelectors.size === 0) return '';
|
|
545
|
-
|
|
546
|
-
return `
|
|
547
|
-
Selectors that worked in this session:
|
|
548
|
-
${Array.from(this.successfulSelectors.entries())
|
|
549
|
-
.map(([desc, sel]) => `- "${desc}": ${sel}`)
|
|
550
|
-
.join('\n')}
|
|
551
|
-
|
|
552
|
-
Try similar patterns for consistency.
|
|
553
|
-
`;
|
|
554
|
-
}
|
|
555
|
-
}
|
|
556
|
-
```
|
|
557
|
-
|
|
558
|
-
**Add to command generation prompt:**
|
|
559
|
-
```typescript
|
|
560
|
-
${sessionContext.getContext()}
|
|
561
|
-
```
|
|
562
|
-
|
|
563
|
-
### Change 4: Skip Sub-Action Loop for High Confidence
|
|
564
|
-
|
|
565
|
-
```typescript
|
|
566
|
-
// After planning, if all commands have high confidence (>0.8):
|
|
567
|
-
if (plan.overallConfidence > 0.8) {
|
|
568
|
-
// Execute entire plan without intermediate checks
|
|
569
|
-
for (const cmd of plan.commands) {
|
|
570
|
-
await execute(cmd);
|
|
571
|
-
}
|
|
572
|
-
// Single goal check at end
|
|
573
|
-
return checkGoalCompletion(step, plan.commands, page);
|
|
574
|
-
} else {
|
|
575
|
-
// Low confidence - use current sub-action approach for safety
|
|
576
|
-
return executeWithSubActions(step, page);
|
|
577
|
-
}
|
|
578
|
-
```
|
|
579
|
-
|
|
580
|
-
## Expected Improvements
|
|
581
|
-
|
|
582
|
-
### Metrics:
|
|
583
|
-
|
|
584
|
-
| Metric | Current | After Phase 1 | After Phase 2 |
|
|
585
|
-
|--------|---------|---------------|---------------|
|
|
586
|
-
| Avg LLM calls per step | 4-6 | 3-4 | 1-2 |
|
|
587
|
-
| Max attempts per step | 20 | 8 | 6 |
|
|
588
|
-
| Success rate (1st attempt) | ~40% | ~50% | ~70% |
|
|
589
|
-
| Time per step | 30-60s | 20-40s | 10-20s |
|
|
590
|
-
| Determinism | Low | Medium | High |
|
|
591
|
-
|
|
592
|
-
### Behavioral Changes:
|
|
593
|
-
|
|
594
|
-
**Login scenario:**
|
|
595
|
-
|
|
596
|
-
**Current:**
|
|
597
|
-
```
|
|
598
|
-
1. fill(user) → check goal → incomplete
|
|
599
|
-
2. fill(pass) → check goal → incomplete
|
|
600
|
-
3. click(login) → check goal → complete
|
|
601
|
-
= 3 actions + 3 goal checks = 6 operations
|
|
602
|
-
```
|
|
603
|
-
|
|
604
|
-
**After Phase 1 (confidence + memory):**
|
|
605
|
-
```
|
|
606
|
-
1. fill(user) → check goal → incomplete (but with confidence: 0.9)
|
|
607
|
-
2. fill(pass) → check goal → incomplete (confidence: 0.9)
|
|
608
|
-
3. click(login) → check goal → complete (confidence: 0.85)
|
|
609
|
-
= 3 actions + 3 checks, but with confidence signals
|
|
610
|
-
```
|
|
611
|
-
|
|
612
|
-
**After Phase 2 (planning):**
|
|
613
|
-
```
|
|
614
|
-
1. Plan: [fill(user), fill(pass), click(login)] (confidence: 0.9)
|
|
615
|
-
2. Execute all 3 commands
|
|
616
|
-
3. Single goal check → complete
|
|
617
|
-
= 1 plan + 3 actions + 1 check = 5 operations
|
|
618
|
-
```
|
|
619
|
-
|
|
620
|
-
## Summary
|
|
621
|
-
|
|
622
|
-
**Core Philosophy Shift:**
|
|
623
|
-
|
|
624
|
-
**From:** Reactive trial-and-error with many retries
|
|
625
|
-
**To:** Proactive planning with fewer, smarter attempts
|
|
626
|
-
|
|
627
|
-
**Most Important:**
|
|
628
|
-
1. Plan before acting (not generate-execute-check loop)
|
|
629
|
-
2. Use confidence to guide strategy
|
|
630
|
-
3. Remember what works
|
|
631
|
-
4. Fail faster (humans don't retry 12 times)
|
|
632
|
-
5. Recognize common patterns
|
|
633
|
-
|
|
634
|
-
**Most Human-Like = Most Deterministic**
|
|
635
|
-
- Humans plan, then execute
|
|
636
|
-
- Humans learn from experience
|
|
637
|
-
- Humans don't endlessly retry
|
|
638
|
-
- Humans recognize familiar patterns
|
|
639
|
-
- Humans ask for help when uncertain
|
|
640
|
-
|
|
641
|
-
These improvements make the system more deterministic, faster, cheaper, and more aligned with how humans actually interact with web applications.
|
|
642
|
-
|