testchimp-runner-core 0.0.35 → 0.0.36

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/package.json +6 -1
  2. package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
  3. package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
  4. package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
  5. package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
  6. package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
  7. package/plandocs/INTEGRATION_COMPLETE.md +0 -322
  8. package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
  9. package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
  10. package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
  11. package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
  12. package/plandocs/PHASE_1_COMPLETE.md +0 -165
  13. package/plandocs/PHASE_1_SUMMARY.md +0 -184
  14. package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
  15. package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
  16. package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
  17. package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
  18. package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
  19. package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
  20. package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
  21. package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
  22. package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
  23. package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
  24. package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
  25. package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
  26. package/plandocs/exploratory-mode-support.plan.md +0 -928
  27. package/plandocs/journey-id-tracking-addendum.md +0 -227
  28. package/releasenotes/RELEASE_0.0.26.md +0 -165
  29. package/releasenotes/RELEASE_0.0.27.md +0 -236
  30. package/releasenotes/RELEASE_0.0.28.md +0 -286
  31. package/src/auth-config.ts +0 -84
  32. package/src/credit-usage-service.ts +0 -188
  33. package/src/env-loader.ts +0 -103
  34. package/src/execution-service.ts +0 -996
  35. package/src/file-handler.ts +0 -104
  36. package/src/index.ts +0 -432
  37. package/src/llm-facade.ts +0 -821
  38. package/src/llm-provider.ts +0 -53
  39. package/src/model-constants.ts +0 -35
  40. package/src/orchestrator/decision-parser.ts +0 -139
  41. package/src/orchestrator/index.ts +0 -58
  42. package/src/orchestrator/orchestrator-agent.ts +0 -1282
  43. package/src/orchestrator/orchestrator-prompts.ts +0 -786
  44. package/src/orchestrator/page-som-handler.ts +0 -1565
  45. package/src/orchestrator/som-types.ts +0 -188
  46. package/src/orchestrator/tool-registry.ts +0 -184
  47. package/src/orchestrator/tools/check-page-ready.ts +0 -75
  48. package/src/orchestrator/tools/extract-data.ts +0 -92
  49. package/src/orchestrator/tools/index.ts +0 -15
  50. package/src/orchestrator/tools/inspect-page.ts +0 -42
  51. package/src/orchestrator/tools/recall-history.ts +0 -72
  52. package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
  53. package/src/orchestrator/tools/take-screenshot.ts +0 -128
  54. package/src/orchestrator/tools/verify-action-result.ts +0 -159
  55. package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
  56. package/src/orchestrator/types.ts +0 -291
  57. package/src/playwright-mcp-service.ts +0 -224
  58. package/src/progress-reporter.ts +0 -144
  59. package/src/prompts.ts +0 -842
  60. package/src/providers/backend-proxy-llm-provider.ts +0 -91
  61. package/src/providers/local-llm-provider.ts +0 -38
  62. package/src/scenario-service.ts +0 -252
  63. package/src/scenario-worker-class.ts +0 -1110
  64. package/src/script-utils.ts +0 -203
  65. package/src/types.ts +0 -239
  66. package/src/utils/browser-utils.ts +0 -348
  67. package/src/utils/coordinate-converter.ts +0 -162
  68. package/src/utils/page-info-retry.ts +0 -65
  69. package/src/utils/page-info-utils.ts +0 -285
  70. package/testchimp-runner-core-0.0.35.tgz +0 -0
  71. package/tsconfig.json +0 -19
@@ -1,844 +0,0 @@
1
- # Multi-Agent Architecture: Critical Review & Phased Approach
2
-
3
- ## Executive Summary
4
-
5
- **Architecture**: Single orchestrator agent with extensible tools, journey memory, and self-reflection
6
- **Goal**: Replace reactive command-by-command with proactive tool-based decision-making
7
- **Compatibility**: Internal refactor only, external API unchanged
8
-
9
- ---
10
-
11
- ## Strengths of Design
12
-
13
- ### 1. Human-Like Operation ✅
14
- - Always-provided context mirrors human awareness
15
- - Tools model human information gathering
16
- - Self-reflection models metacognition
17
- - Experience accumulation models learning
18
-
19
- ### 2. Extensibility ✅
20
- - Dynamic tool registry → add tools without prompt changes
21
- - Tool descriptions auto-included in prompts
22
- - Easy to add new capabilities
23
-
24
- ### 3. Flexibility ✅
25
- - Works for both generation and repair modes
26
- - Configurable guardrails per job
27
- - Agent + system termination (soft + hard limits)
28
-
29
- ### 4. Efficiency ✅
30
- - Batch command planning (fewer iterations)
31
- - Always-provided context (no repeated tool calls for same info)
32
- - Recent memory prevents bloat
33
-
34
- ---
35
-
36
- ## Potential Pitfalls & Mitigations
37
-
38
- ### Pitfall 1: Tool Call Overhead
39
-
40
- **Problem**: If agent calls tools every iteration, we get:
41
- ```
42
- Iteration 1: Agent call → Tool calls (DOM, screenshot) → Agent call with results
43
- = 2 LLM calls + tool execution time
44
- ```
45
-
46
- **Current Mitigation**:
47
- - ✅ DOM always provided (no tool call)
48
- - ✅ Screenshot available freely (not expensive)
49
- - ⚠️ Risk: Agent might over-use other tools
50
-
51
- **Improvement**:
52
- ```typescript
53
- // Track tool usage patterns
54
- if (agent.calledSameToolLast3Iterations('take_screenshot')) {
55
- logger.warn('Agent overusing screenshot, limiting availability');
56
- // Only provide screenshot tool every other iteration
57
- }
58
- ```
59
-
60
- **For MVP**: Accept some tool overhead, optimize in Phase 2
61
-
62
- ---
63
-
64
- ### Pitfall 2: Self-Reflection Spiral
65
-
66
- **Problem**: Agent reflection might reinforce wrong assumptions:
67
- ```
68
- Iteration 1: "Focus on finding 'Submit' button"
69
- Iteration 2: "Still focusing on 'Submit' button, try different selector"
70
- Iteration 3: "Submit button must exist, trying again..."
71
- (Agent stuck on non-existent element)
72
- ```
73
-
74
- **Current Mitigation**:
75
- - ✅ System can override reflection after N failures
76
- - ✅ Fresh DOM each iteration (reality check)
77
- - ⚠️ Risk: Agent ignores fresh context, follows stale reflection
78
-
79
- **Improvement**:
80
- ```typescript
81
- // Detect reflection loops
82
- if (lastReflection.focus === currentReflection.focus && failureCount > 2) {
83
- context.systemNote = "OVERRIDE: Your focus isn't working. Try completely different approach.";
84
- context.previousIterationGuidance = null; // Clear bad reflection
85
- }
86
- ```
87
-
88
- **For MVP**: Start without self-reflection, add in Phase 2 with loop detection
89
-
90
- ---
91
-
92
- ### Pitfall 3: Exploratory Actions Cause Side Effects
93
-
94
- **Problem**: "Safe" actions might not be safe:
95
- ```
96
- Click info icon → Opens modal → Blocks page → Can't recover
97
- Hover over menu → Menu stays open → Interferes with next action
98
- Click dropdown → Triggers onChange event → Unexpected state change
99
- ```
100
-
101
- **Current Mitigation**:
102
- - ✅ Limited action types (hover, click_menu, click_info, focus)
103
- - ✅ Screenshot after exploration to see state
104
- - ⚠️ Risk: No way to undo exploratory actions
105
-
106
- **Improvement**:
107
- ```typescript
108
- // Snapshot before exploration
109
- const beforeState = await page.evaluate(() => ({
110
- url: window.location.href,
111
- activeElement: document.activeElement?.tagName
112
- }));
113
-
114
- // Execute exploration
115
- await explore();
116
-
117
- // Check if state changed unexpectedly
118
- const afterState = await page.evaluate(...);
119
- if (beforeState.url !== afterState.url) {
120
- logger.error('Exploration caused navigation - aborting');
121
- await page.goBack();
122
- }
123
- ```
124
-
125
- **For MVP**: Start without exploratory actions, add in Phase 2 with state validation
126
-
127
- ---
128
-
129
- ### Pitfall 4: Memory Bloat
130
-
131
- **Problem**: Long scenarios create huge history:
132
- ```
133
- 50 steps × 5 iterations/step = 250 memory entries
134
- Each entry: action (50 chars) + code (100 chars) + observation (100 chars)
135
- = 62,500 characters in memory
136
- ```
137
-
138
- **Current Mitigation**:
139
- - ✅ Always-provided: Recent 6-7 steps only
140
- - ✅ Tool for deeper history
141
- - ⚠️ Risk: Full history grows unbounded
142
-
143
- **Improvement**:
144
- ```typescript
145
- interface AgentConfig {
146
- maxHistorySize?: number; // Default: 100
147
- summarizeHistoryAfter?: number; // Default: 50
148
- }
149
-
150
- // After 50 steps, summarize older history
151
- if (history.length > config.summarizeHistoryAfter) {
152
- const oldSteps = history.slice(0, -config.maxHistorySize);
153
- const summary = await llm.summarize(oldSteps);
154
- history = [{ ...summary, isSummary: true }, ...history.slice(-config.maxHistorySize)];
155
- }
156
- ```
157
-
158
- **For MVP**: Cap history at 100 steps, add summarization in Phase 2
159
-
160
- ---
161
-
162
- ### Pitfall 5: Prompt Complexity
163
-
164
- **Problem**: Always-provided context + tool descriptions + self-reflection → Large prompts
165
-
166
- **Estimated prompt size:**
167
- ```
168
- - System prompt: 500 tokens (with tool descriptions)
169
- - Always-provided context: 1200-2000 tokens
170
- - Goals & progress: 100 tokens
171
- - DOM (optimized with increased limits): 800-1500 tokens
172
- * getEnhancedPageInfo limits:
173
- - ARIA depth: 4 levels max
174
- - Interactive elements: top 50
175
- - IDs: top 50
176
- - Data attrs: top 50
177
- - Form fields: top 20
178
- - Text: 30 chars max
179
- - Recent 6-7 steps: 300-500 tokens (50-80 tokens each)
180
- - Experiences: 100-200 tokens (~10 learnings)
181
- - Self-reflection: 100 tokens
182
- - Tool results (if any): 300-500 tokens
183
- Total: 2400-3600 tokens per iteration
184
- ```
185
-
186
- **Current Mitigation**:
187
- - ✅ DOM pre-truncated by getEnhancedPageInfo (already compact!)
188
- - ✅ Recent steps only (6-7, not all)
189
- - ✅ Experiences capped at 20
190
- - ⚠️ Risk: Still substantial with tool results
191
-
192
- **Validation Needed**:
193
- ```typescript
194
- // Log actual prompt sizes during development
195
- logger.debug(`Prompt tokens: system=${systemTokens}, user=${userTokens}, total=${total}`);
196
-
197
- // If exceeds threshold, warn
198
- if (total > 3000) {
199
- logger.warn(`Large prompt: ${total} tokens`);
200
- }
201
- ```
202
-
203
- **For MVP**: Increased to ~2,400-3,600 tokens per iteration to support complex pages. Acceptable trade-off for better agent awareness. Monitor and optimize Phase 2 if becomes issue.
204
-
205
- ---
206
-
207
- ### Pitfall 6: Agent Ignores Available Information
208
-
209
- **Problem**: Agent has DOM in context but still calls inspect_page tool
210
- ```
211
- Agent: "toolCalls": [{"name": "inspect_page"}]
212
- System: (But DOM was just provided...)
213
- ```
214
-
215
- **Current Mitigation**:
216
- - ✅ Prompt says "Current DOM snapshot" in always-provided
217
- - ⚠️ Risk: Agent doesn't realize info is already available
218
-
219
- **Improvement**:
220
- ```typescript
221
- // System validates tool calls
222
- if (decision.toolCalls?.includes('inspect_page') && domProvidedThisIteration) {
223
- logger.warn('SYSTEM: inspect_page unnecessary, DOM already provided');
224
- decision.toolCalls = decision.toolCalls.filter(t => t.name !== 'inspect_page');
225
- }
226
- ```
227
-
228
- **For MVP**: Log warnings but allow redundant calls, add validation Phase 2
229
-
230
- ---
231
-
232
- ### Pitfall 7: Batch Execution Waste
233
-
234
- **Problem**: Agent plans 5 commands, first one fails, remaining 4 wasted:
235
- ```
236
- Batch: [cmd1, cmd2, cmd3, cmd4, cmd5]
237
- Execute cmd1 → FAIL
238
- Skip cmd2-5
239
- (Agent had to think of all 5, but only 1 executed)
240
- ```
241
-
242
- **Current Mitigation**:
243
- - ✅ Sequential execution prevents cascade failures
244
- - ⚠️ Risk: Wasted planning effort
245
-
246
- **Improvement**:
247
- ```typescript
248
- // Adaptive batch size
249
- if (successRate < 0.5) {
250
- config.maxCommandsPerIteration = 2; // Plan fewer when failing
251
- } else {
252
- config.maxCommandsPerIteration = 5; // Plan more when succeeding
253
- }
254
- ```
255
-
256
- **For MVP**: Accept some waste, optimize in Phase 2
257
-
258
- ---
259
-
260
- ### Pitfall 8: Screenshot Token Cost (CORRECTED)
261
-
262
- **Actual Cost** (based on OpenAI vision pricing):
263
-
264
- **For gpt-4.1-mini / gpt-5-mini:**
265
- - Viewport screenshot (1920x1080): ~1024-1452 tokens
266
- - Full page (1920x3000): ~1536 tokens (capped)
267
-
268
- **For gpt-4o / gpt-4.1:**
269
- - Viewport (1920x1080): ~1105 tokens
270
- - Full page: ~1360 tokens
271
-
272
- **Calculation for 1920x1080:**
273
- ```
274
- 1. Scale to fit 2048x2048: No scaling (already fits)
275
- 2. Scale shortest to 768px: 1080→768, 1920→1365
276
- 3. Tiles: ceil(1365/32) × ceil(768/32) = 43 × 24 = 1032 patches
277
- 4. For gpt-4.1-mini: 1032 × 1.62 ≈ 1672 tokens (capped at 1536)
278
- ```
279
-
280
- **Conclusion**: Screenshots are **1K-2K tokens**, NOT 100K!
281
-
282
- **Impact**:
283
- - ✅ Very affordable (comparable to providing extra DOM context)
284
- - ✅ No budget needed - agent can use freely
285
- - ✅ Vision mode viable for most steps if helpful
286
- - ✅ Tool call overhead acceptable
287
-
288
- **For MVP**: Use screenshots liberally, no artificial limits
289
-
290
- ---
291
-
292
- ## Phased Implementation Strategy
293
-
294
- ### MVP (Phase 1): Core Agent Without Advanced Features
295
-
296
- **Include:**
297
- - ✅ OrchestratorAgent class
298
- - ✅ Dynamic ToolRegistry
299
- - ✅ 5 essential tools: inspect_page, take_screenshot, recall_history, extract_data, check_page_ready
300
- - ✅ Journey memory (history + experiences + extracted data)
301
- - ✅ Always-provided context (overall goal, current goal, DOM, recent 6-7 steps)
302
- - ✅ Self-reflection (free-form guidance to next iteration)
303
- - ✅ Loop detection (agent detects own spirals via detectingLoop flag)
304
- - ✅ Batch command planning (max 3-5)
305
- - ✅ Sequential execution (stop on first failure)
306
- - ✅ Experience accumulation (learnings)
307
- - ✅ System guardrails (iteration limits, no screenshot budget)
308
- - ✅ Agent termination (complete/stuck/infeasible)
309
- - ✅ Comprehensive logging (all reasoning visible)
310
-
311
- **Exclude (Phase 2):**
312
- - ❌ Exploratory actions (safety concerns)
313
- - ❌ Advanced tool validation
314
- - ❌ Tool result caching
315
- - ❌ Memory summarization
316
-
317
- **Benefits**:
318
- - Complete feature set (tools + memory + reflection + learning)
319
- - Agent maintains train of thought via self-reflection
320
- - Agent self-corrects via loop detection
321
- - Still simpler than 2-agent architecture
322
- - Comprehensive logging for debugging
323
-
324
- **Expected metrics**:
325
- - LLM calls/step: 2-4 (vs 4-6 current)
326
- - Iterations/step: 3-5 (vs 8-12 current)
327
- - Tool calls/step: 1-3
328
- - Commands/iteration: 2-3 (batched)
329
- - Agent learns: 1-2 experiences per step
330
-
331
- ---
332
-
333
- ### Phase 2: Add Learning & Reflection
334
-
335
- **Add:**
336
- - ✅ Self-reflection (previousIterationGuidance)
337
- - ✅ Experience accumulation
338
- - ✅ extract_data tool
339
- - ✅ Experience deduplication
340
- - ✅ Reflection loop detection
341
-
342
- **Benefits**:
343
- - Agent learns patterns
344
- - Continuity across iterations
345
- - Better context for future steps
346
-
347
- **Risks**:
348
- - Reflection spirals (mitigated with loop detection)
349
- - Experience bloat (mitigated with deduplication)
350
-
351
- ---
352
-
353
- ### Phase 3: Advanced Exploration & Optimization
354
-
355
- **Add:**
356
- - ✅ Exploratory actions (with state validation)
357
- - ✅ Tool result caching
358
- - ✅ Adaptive batch sizing
359
- - ✅ Tool usage pattern detection
360
- - ✅ Memory summarization
361
-
362
- **Benefits**:
363
- - Handle ambiguous UIs
364
- - More efficient tool use
365
- - Better long-scenario handling
366
-
367
- ---
368
-
369
- ## Critical Architecture Decisions to Validate
370
-
371
- ### Decision 1: Always-Provide DOM vs Tool
372
-
373
- **Current**: DOM always provided (auto-fetched each iteration)
374
-
375
- **Alternative**: DOM via tool call
376
-
377
- **Analysis**:
378
- - Pro (always-provide): No tool call overhead, always fresh
379
- - Con (always-provide): Wasted if agent doesn't need it
380
- - **Verdict**: ✅ Always-provide is correct (DOM needed 95% of time)
381
-
382
- ### Decision 2: Recent Steps Count (6-7 vs 3)
383
-
384
- **Current**: 6-7 steps in always-provided context
385
-
386
- **Analysis**:
387
- - More context → Better decisions
388
- - More context → Larger prompts
389
- - **Test**: Measure prompt size and decision quality
390
-
391
- **For MVP**: Start with 5, tune based on real usage
392
-
393
- ### Decision 3: Self-Reflection Format
394
-
395
- **Current**: Structured (focus, avoid, hypothesis)
396
-
397
- **Alternative**: Free-form text
398
-
399
- **Analysis**:
400
- - Structured → Easier to process
401
- - Structured → Might be restrictive
402
- - **Verdict**: ✅ Structured is better (easier to override when looping)
403
-
404
- **For MVP**: Skip self-reflection entirely, add Phase 2
405
-
406
- ### Decision 4: Batch Size
407
-
408
- **Current**: Max 5 commands per iteration
409
-
410
- **Analysis**:
411
- - Too small → Many iterations
412
- - Too large → Wasted planning if early failure
413
- - **Test**: Measure success rate of 2nd-5th commands in batch
414
-
415
- **For MVP**: Max 3 commands, increase to 5 in Phase 2 if success rate high
416
-
417
- ---
418
-
419
- ## Agent Transparency: Comprehensive Logging
420
-
421
- ### Principle: Make Agent's Thinking Visible
422
-
423
- **Every agent decision must be logged** so developers can understand:
424
- - What the agent is thinking
425
- - Why it made each decision
426
- - What it learned
427
- - When and why it's stuck
428
-
429
- ### Logged Information
430
-
431
- ```typescript
432
- // Per iteration, log:
433
- 1. Iteration number and goal
434
- 2. Agent reasoning (why this approach)
435
- 3. Self-reflection (focus, avoid, hypothesis)
436
- 4. Tools requested + why
437
- 5. Commands planned + why
438
- 6. Experiences learned
439
- 7. Status decision + why
440
- 8. Command execution results
441
- ```
442
-
443
- ### Example Log Output
444
-
445
- ```
446
- [Orchestrator] === Iteration 1/8 ===
447
- [Orchestrator] 🎯 Goal: Login with alice@example.com, TestPass123
448
- [Orchestrator] 💭 Reasoning: Need to locate login form elements
449
- [Orchestrator] 🔧 Tools: [inspect_page]
450
- [Orchestrator] 📋 Why: Need DOM to find email/password fields
451
- [Orchestrator] ⏳ Executing tools...
452
- [Orchestrator] ✓ Tools complete
453
- [Orchestrator] 📝 Commands (3): fill email, fill password, click submit
454
- [Orchestrator] 💡 Why batch: Can fill entire form before submitting
455
- [Orchestrator] 🧠 Next iteration focus: Check for redirect after submit
456
- [Orchestrator] 📚 Learning: Forms use #id selectors consistently
457
- [Orchestrator] ▶ Executing sequentially...
458
- [Orchestrator] ✓ [1/3] await page.fill('#email', 'alice@example.com')
459
- [Orchestrator] ✓ [2/3] await page.fill('#password', 'TestPass123')
460
- [Orchestrator] ✓ [3/3] await page.click('button[type="submit"]')
461
- [Orchestrator] 🎯 Status: continue
462
- [Orchestrator] 💭 Why: Commands executed, need to verify navigation
463
- ```
464
-
465
- ### Progress Reporter Extension
466
-
467
- Add agent thoughts to progress reporting:
468
-
469
- ```typescript
470
- interface StepProgress {
471
- // ... existing fields
472
-
473
- // NEW: Agent transparency
474
- agentIteration?: number;
475
- agentReasoning?: string;
476
- agentSelfReflection?: SelfReflection;
477
- agentExperiences?: string[];
478
- agentToolsUsed?: string[];
479
- agentStatus?: string;
480
- }
481
-
482
- // Report after each iteration:
483
- await progressReporter?.onStepProgress?.({
484
- jobId,
485
- stepNumber,
486
- description: stepGoal,
487
- status: StepExecutionStatus.IN_PROGRESS,
488
- code: decision.commands?.join('\n'),
489
- agentIteration: iteration,
490
- agentReasoning: decision.reasoning,
491
- agentSelfReflection: decision.selfReflection,
492
- agentExperiences: decision.experiences,
493
- agentToolsUsed: decision.toolCalls?.map(t => t.name),
494
- agentStatus: decision.status
495
- });
496
- ```
497
-
498
- **Benefits:**
499
- - VS Extension can display agent thoughts in output panel
500
- - Script Service can store in DB for frontend visualization
501
- - Debugging becomes much easier
502
- - Users understand what agent is doing
503
-
504
- ---
505
-
506
- ## Recommended MVP Scope
507
-
508
- ### Include (Core Functionality)
509
-
510
- 1. **OrchestratorAgent**
511
- - Single agent loop
512
- - Always-provided context (overall goal, current goal, current DOM, recent 5 steps)
513
- - Tool calls (max 3 per iteration)
514
- - Batch commands (max 3 per iteration)
515
- - Sequential execution with early stop
516
- - System guardrails
517
-
518
- 2. **Essential Tools**
519
- - `inspect_page` (might be redundant since always-provided, but keep for extensibility demo)
520
- - `take_screenshot` (isFullPage param)
521
- - `recall_history` (maxSteps param)
522
-
523
- 3. **Simple Memory**
524
- - Unified history array
525
- - No experiences (add Phase 2)
526
- - No self-reflection (add Phase 2)
527
- - Basic extracted data
528
-
529
- 4. **Guardrails**
530
- - Max 8 iterations/step
531
- - Max 10 screenshots/scenario
532
- - Max 2 consecutive failures
533
- - Agent status: complete | stuck | continue
534
-
535
- ### Exclude (Phase 2+)
536
-
537
- 1. **Self-Reflection** - Add after validating basic agent works
538
- 2. **Exploratory Actions** - Add after validating tools work safely
539
- 3. **Experiences** - Add after validating memory works
540
- 4. **extract_data Tool** - Can be added later
541
- 5. **Advanced Validation** - Tool result caching, loop detection, etc.
542
-
543
- ---
544
-
545
- ## Implementation Risks & Mitigation
546
-
547
- ### Risk 1: Tool Call Latency
548
- **Impact**: High
549
- **Mitigation**: Always-provide DOM, limit tool calls to 3
550
- **Acceptance**: Some overhead acceptable for better decisions
551
-
552
- ### Risk 2: Prompt Token Cost
553
- **Impact**: Medium
554
- **Mitigation**: Truncate DOM, limit recent steps to 5
555
- **Acceptance**: Monitor token usage, optimize Phase 2
556
-
557
- ### Risk 3: Agent Loops
558
- **Impact**: High
559
- **Mitigation**: System iteration limits (8), consecutive failure stops (2)
560
- **Acceptance**: Hard limits prevent runaway
561
-
562
- ### Risk 4: Breaking Backward Compatibility
563
- **Impact**: Critical
564
- **Mitigation**: Changes internal to ScenarioWorker only, external API unchanged
565
- **Validation**: Test VS Extension and GitHub Runner after implementation
566
-
567
- ### Risk 5: Complexity
568
- **Impact**: Medium
569
- **Mitigation**: MVP excludes advanced features, phased approach
570
- **Acceptance**: Simplify if MVP too complex
571
-
572
- ---
573
-
574
- ## Key Simplifications for MVP
575
-
576
- ### 1. Include Self-Reflection (Agent Detects Spirals)
577
- **Why**: Provides valuable train of thought continuity
578
- **How**: Agent outputs free-form guidance to itself, PLUS detects if spiraling
579
- **Safety**: Agent can recognize "I'm stuck on same approach" and reset
580
-
581
- ```typescript
582
- interface SelfReflection {
583
- guidanceForNext: string; // Free-form: "Try data-testid, the icon approach failed"
584
- detectingLoop: boolean; // Agent signals if it thinks it's looping
585
- loopReasoning?: string; // "Tried same selector 3 times, need different approach"
586
- }
587
-
588
- // In prompt:
589
- "If you notice you're trying the same approach repeatedly, set detectingLoop=true and try something completely different."
590
- ```
591
-
592
- **Agent decides when to break own loop**, system enforces hard limits as backup.
593
-
594
- ### 2. Include Self-Reflection with Loop Detection
595
- **Agent outputs free-form guidance** for next iteration
596
- **Agent detects own loops** via `detectingLoop` flag
597
- **Agent resets approach** when it notices repetition
598
-
599
- ### 3. No Exploratory Actions (Phase 2)
600
- **Why**: Safety risk, adds complexity
601
- **Trade-off**: Can't investigate ambiguous elements programmatically
602
- **Acceptable**: Can use screenshot + DOM analysis instead
603
-
604
- ### 4. Include Experience Accumulation
605
- **Why**: Learning across steps provides value
606
- **How**: Agent outputs experiences learned each iteration
607
- **Simple**: Just array of strings, deduplicate similar ones
608
-
609
- ### 5. Include extract_data Tool
610
- **Why**: Steps often reference earlier data
611
- **How**: Simple tool to save selector value as named data
612
- **Benefit**: Agent can explicitly save data for later steps
613
-
614
- ### 6. Simple Tool Validation (Phase 2+)
615
- **Why**: Complex validation adds overhead
616
- **Trade-off**: Agent might make redundant tool calls
617
- **Acceptable**: Log warnings, don't block
618
-
619
- ---
620
-
621
- ## MVP Architecture (Simplified)
622
-
623
- ```
624
- ┌────────────────────────────────────────┐
625
- │ ORCHESTRATOR AGENT (MVP) │
626
- │ • Always gets: Overall goal │
627
- │ • Current goal │
628
- │ • Current DOM │
629
- │ • Recent 5 steps │
630
- │ • Tools: screenshot, recall_history │
631
- │ • Plans: Up to 3 commands │
632
- │ • Executes: Sequential, stop on fail │
633
- │ • Decides: complete | stuck | continue│
634
- └────────────────────────────────────────┘
635
- ```
636
-
637
- **Excluded from MVP:**
638
- - Self-reflection
639
- - Exploratory actions
640
- - Experience learning
641
- - extract_data tool
642
- - Advanced validations
643
-
644
- ---
645
-
646
- ## Phase 2 Additions (After MVP Validated)
647
-
648
- **Add if MVP works well:**
649
-
650
- 1. **Self-Reflection** (if agent needs continuity)
651
- - Monitor: Does agent make progress without it?
652
- - Add: Only if we see repeated mistakes
653
-
654
- 2. **Experience Accumulation** (if learning helps)
655
- - Monitor: Do patterns repeat across steps?
656
- - Add: If we see benefit
657
-
658
- 3. **Exploratory Actions** (if needed for ambiguous UIs)
659
- - Monitor: How often are elements ambiguous?
660
- - Add: With state validation
661
-
662
- 4. **extract_data Tool** (if data needed across steps)
663
- - Monitor: Do steps reference earlier data?
664
- - Add: If manual observation notes aren't sufficient
665
-
666
- ---
667
-
668
- ## Phase 3 Optimizations (Future)
669
-
670
- 1. Tool result caching
671
- 2. Adaptive batch sizing
672
- 3. Memory summarization for long scenarios
673
- 4. Project-level memory
674
- 5. Cross-scenario learning
675
- 6. Token budget tracking
676
- 7. Tool cost-benefit analysis
677
-
678
- ---
679
-
680
- ## Success Criteria (MVP)
681
-
682
- ### Must Have
683
- - [ ] Fewer iterations than current (target: 50% reduction)
684
- - [ ] Backward compatible (VS Extension and GitHub Runner work)
685
- - [ ] No infinite loops (guardrails work)
686
- - [ ] Memory doesn't bloat
687
- - [ ] Tool extensibility works (can add new tool)
688
-
689
- ### Nice to Have
690
- - [ ] Fewer LLM calls than current
691
- - [ ] Better success rate
692
- - [ ] Faster execution
693
-
694
- ### Acceptable Trade-offs
695
- - ⚠️ Slightly higher token usage (more context in prompts)
696
- - ⚠️ Some tool call overhead
697
- - ⚠️ No learning/reflection in MVP (add Phase 2)
698
-
699
- ---
700
-
701
- ## Implementation Order (MVP)
702
-
703
- ### Week 1: Foundation
704
- 1. Create types (AgentConfig, JourneyMemory, AlwaysProvidedContext)
705
- 2. Create ToolRegistry with dynamic prompt generation
706
- 3. Implement 3 tools (inspect_page, take_screenshot, recall_history)
707
- 4. Test tools independently
708
-
709
- ### Week 2: Orchestrator
710
- 1. Implement OrchestratorAgent.executeStep()
711
- 2. Always-provided context building
712
- 3. Tool call execution
713
- 4. Batch command execution (sequential)
714
- 5. Simple memory update (no experiences)
715
- 6. System guardrails
716
-
717
- ### Week 3: Integration
718
- 1. Add orchestrator prompts
719
- 2. Refactor ScenarioWorker to use orchestrator
720
- 3. Test generation mode
721
- 4. Test repair mode (if time)
722
-
723
- ### Week 4: Testing & Refinement
724
- 1. Test with real scenarios
725
- 2. Tune iteration limits
726
- 3. Measure token usage
727
- 4. Compare metrics vs current
728
- 5. Fix bugs
729
- 6. Verify backward compatibility
730
-
731
- ---
732
-
733
- ## Decision: Start with MVP or Full Architecture?
734
-
735
- ### Recommendation: **Start with MVP**
736
-
737
- **Why:**
738
- 1. **Validate core concept** - Does tool-use agent work better?
739
- 2. **Reduce risk** - Simpler implementation, fewer edge cases
740
- 3. **Faster iteration** - Can test and tune quicker
741
- 4. **Learn from usage** - Real data tells us what features matter
742
-
743
- **MVP Excludes:**
744
- - Self-reflection (complex, risky)
745
- - Exploratory actions (safety concerns)
746
- - Experience accumulation (nice-to-have)
747
- - extract_data tool (not essential)
748
-
749
- **MVP Focuses On:**
750
- - ✅ Tool-based information gathering
751
- - ✅ Batch command planning
752
- - ✅ Memory management
753
- - ✅ Guardrails
754
- - ✅ Backward compatibility
755
-
756
- **After MVP proves value, add Phase 2 features based on:**
757
- - What problems remain?
758
- - What would self-reflection solve?
759
- - How often are elements ambiguous (exploration)?
760
- - Do patterns repeat (experiences)?
761
-
762
- ---
763
-
764
- ## Open Questions to Resolve During Implementation
765
-
766
- 1. **Optimal recent steps count**: 3, 5, or 7?
767
- - Test with different values, measure decision quality
768
-
769
- 2. **Tool call timing**: Before every iteration or only when agent requests?
770
- - MVP: Only when agent requests
771
- - Could change if DOM staleness becomes issue
772
-
773
- 3. **Batch size sweet spot**: 3 or 5 commands?
774
- - Start with 3, increase if success rate of later commands is high
775
-
776
- 4. **Memory update frequency**: After each command or after iteration?
777
- - MVP: After each command (more granular)
778
- - Could batch if performance issue
779
-
780
- 5. **Tool result format**: Raw or summarized?
781
- - MVP: Raw (simpler)
782
- - Add summarization if token usage too high
783
-
784
- ---
785
-
786
- ## Success Metrics to Track
787
-
788
- ### Performance
789
- - Average LLM calls per step (target: < 3)
790
- - Average iterations per step (target: < 5)
791
- - Average tool calls per step (target: < 2)
792
- - Time per step (target: < 30s)
793
-
794
- ### Quality
795
- - Success rate per step (target: > 80%)
796
- - Commands per iteration (target: 2-3 avg)
797
- - Tool call relevance (target: > 90% useful)
798
-
799
- ### Resource
800
- - Token usage per step (monitor, optimize if > 5K)
801
- - Screenshot usage (should be < 10% of steps)
802
- - Memory size (should cap at 100 entries)
803
-
804
- ---
805
-
806
- ## Recommendation: Proceed with MVP
807
-
808
- **Start with:**
809
- 1. Basic orchestrator agent
810
- 2. 3 core tools (screenshot, recall_history, inspect_page optional)
811
- 3. Simple memory (history only)
812
- 4. Always-provided context (goal + DOM + recent 5 steps)
813
- 5. Batch commands (max 3)
814
- 6. System guardrails
815
-
816
- **Validate:**
817
- - Does it work better than current?
818
- - Is backward compatibility maintained?
819
- - Are guardrails sufficient?
820
-
821
- **Then add Phase 2:**
822
- - Self-reflection (if needed)
823
- - Experiences (if patterns emerge)
824
- - Exploratory actions (if ambiguity common)
825
-
826
- ---
827
-
828
- ## Final Thoughts
829
-
830
- **This architecture is sound, but ambitious.**
831
-
832
- **Recommendation**: Implement MVP first to validate:
833
- 1. Tool-use paradigm works
834
- 2. Memory management works
835
- 3. Batch execution helps
836
- 4. Guardrails sufficient
837
-
838
- **Then iterate** based on real usage data, not assumptions.
839
-
840
- **MVP timeline**: 2-3 weeks
841
- **Full architecture**: 4-6 weeks
842
-
843
- **Start with MVP to reduce risk and validate concept.**
844
-