testchimp-runner-core 0.0.35 → 0.0.36
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +6 -1
- package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
- package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
- package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
- package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
- package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
- package/plandocs/INTEGRATION_COMPLETE.md +0 -322
- package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
- package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
- package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
- package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
- package/plandocs/PHASE_1_COMPLETE.md +0 -165
- package/plandocs/PHASE_1_SUMMARY.md +0 -184
- package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
- package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
- package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
- package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
- package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
- package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
- package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
- package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
- package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
- package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
- package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
- package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
- package/plandocs/exploratory-mode-support.plan.md +0 -928
- package/plandocs/journey-id-tracking-addendum.md +0 -227
- package/releasenotes/RELEASE_0.0.26.md +0 -165
- package/releasenotes/RELEASE_0.0.27.md +0 -236
- package/releasenotes/RELEASE_0.0.28.md +0 -286
- package/src/auth-config.ts +0 -84
- package/src/credit-usage-service.ts +0 -188
- package/src/env-loader.ts +0 -103
- package/src/execution-service.ts +0 -996
- package/src/file-handler.ts +0 -104
- package/src/index.ts +0 -432
- package/src/llm-facade.ts +0 -821
- package/src/llm-provider.ts +0 -53
- package/src/model-constants.ts +0 -35
- package/src/orchestrator/decision-parser.ts +0 -139
- package/src/orchestrator/index.ts +0 -58
- package/src/orchestrator/orchestrator-agent.ts +0 -1282
- package/src/orchestrator/orchestrator-prompts.ts +0 -786
- package/src/orchestrator/page-som-handler.ts +0 -1565
- package/src/orchestrator/som-types.ts +0 -188
- package/src/orchestrator/tool-registry.ts +0 -184
- package/src/orchestrator/tools/check-page-ready.ts +0 -75
- package/src/orchestrator/tools/extract-data.ts +0 -92
- package/src/orchestrator/tools/index.ts +0 -15
- package/src/orchestrator/tools/inspect-page.ts +0 -42
- package/src/orchestrator/tools/recall-history.ts +0 -72
- package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
- package/src/orchestrator/tools/take-screenshot.ts +0 -128
- package/src/orchestrator/tools/verify-action-result.ts +0 -159
- package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
- package/src/orchestrator/types.ts +0 -291
- package/src/playwright-mcp-service.ts +0 -224
- package/src/progress-reporter.ts +0 -144
- package/src/prompts.ts +0 -842
- package/src/providers/backend-proxy-llm-provider.ts +0 -91
- package/src/providers/local-llm-provider.ts +0 -38
- package/src/scenario-service.ts +0 -252
- package/src/scenario-worker-class.ts +0 -1110
- package/src/script-utils.ts +0 -203
- package/src/types.ts +0 -239
- package/src/utils/browser-utils.ts +0 -348
- package/src/utils/coordinate-converter.ts +0 -162
- package/src/utils/page-info-retry.ts +0 -65
- package/src/utils/page-info-utils.ts +0 -285
- package/testchimp-runner-core-0.0.35.tgz +0 -0
- package/tsconfig.json +0 -19
|
@@ -1,184 +0,0 @@
|
|
|
1
|
-
# Phase 1 Complete - Summary & Testing Guide
|
|
2
|
-
|
|
3
|
-
## Version: runner-core v0.0.33 ✅
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Implementation Complete
|
|
8
|
-
|
|
9
|
-
### What's New:
|
|
10
|
-
|
|
11
|
-
1. **📝 Note to Future Self**
|
|
12
|
-
- Free-form tactical memory between iterations
|
|
13
|
-
- Agent writes: "Tried X, failed. Will try Y next."
|
|
14
|
-
- Prevents repeated mistakes
|
|
15
|
-
|
|
16
|
-
2. **🎯 Percentage-Based Coordinates**
|
|
17
|
-
- Last-resort fallback (3-decimal precision)
|
|
18
|
-
- Resolution-independent (works any viewport size)
|
|
19
|
-
- Supports: click, fill, drag, hover, scroll
|
|
20
|
-
|
|
21
|
-
3. **⚡ Optimized Iteration Limits**
|
|
22
|
-
- Max 5 iterations per step (down from 8)
|
|
23
|
-
- 2 coordinate attempts max (coordinates work or they don't)
|
|
24
|
-
- Faster feedback on stuck scenarios
|
|
25
|
-
|
|
26
|
-
---
|
|
27
|
-
|
|
28
|
-
## Current Behavior (Phase 1)
|
|
29
|
-
|
|
30
|
-
```
|
|
31
|
-
┌─────────────────────────────────────────────────────┐
|
|
32
|
-
│ Iteration 1: Playwright selector │
|
|
33
|
-
│ Try: await page.getByRole('button'...).click() │
|
|
34
|
-
│ Note: "If this fails, try #id selector" │
|
|
35
|
-
│ │
|
|
36
|
-
│ Iteration 2: Playwright selector │
|
|
37
|
-
│ Read note from iteration 1 │
|
|
38
|
-
│ Try: await page.locator('#sidebar-toggle').click()│
|
|
39
|
-
│ Note: "If this fails, try SVG child" │
|
|
40
|
-
│ │
|
|
41
|
-
│ Iteration 3: Playwright selector │
|
|
42
|
-
│ Read note from iteration 2 │
|
|
43
|
-
│ Try: await page.locator('#sidebar-toggle svg') │
|
|
44
|
-
│ → Fails again │
|
|
45
|
-
│ │
|
|
46
|
-
│ 🎯 COORDINATE MODE ACTIVATED 🎯 │
|
|
47
|
-
│ │
|
|
48
|
-
│ Iteration 4: Coordinate action │
|
|
49
|
-
│ Agent outputs: {xPercent: 5.500, yPercent: 8.250}│
|
|
50
|
-
│ Execute: page.mouse.click(88, 66) │
|
|
51
|
-
│ → Success! │
|
|
52
|
-
│ │
|
|
53
|
-
│ OR if fails... │
|
|
54
|
-
│ │
|
|
55
|
-
│ Iteration 5: Coordinate action (2nd attempt) │
|
|
56
|
-
│ Try slightly adjusted coordinates │
|
|
57
|
-
│ → If fails: GIVE UP (stuck) │
|
|
58
|
-
│ │
|
|
59
|
-
│ Total: Max 5 iterations │
|
|
60
|
-
└─────────────────────────────────────────────────────┘
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
---
|
|
64
|
-
|
|
65
|
-
## Testing Phase 1
|
|
66
|
-
|
|
67
|
-
### Test 1: PeopleHR Scenario (Previously Failed)
|
|
68
|
-
|
|
69
|
-
**Expected outcome:**
|
|
70
|
-
- Iteration 1-2: Try text/ID selectors → fail
|
|
71
|
-
- Iteration 3: Note says "try SVG child" → succeeds!
|
|
72
|
-
- OR Iteration 4: Coordinates → succeeds!
|
|
73
|
-
|
|
74
|
-
**Run:**
|
|
75
|
-
```bash
|
|
76
|
-
# Via VS extension: "Generate Script" on peoplehr.txt
|
|
77
|
-
# Or "Run Test" on peoplehr-corrected.smart.spec.ts
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
**Look for in logs:**
|
|
81
|
-
```
|
|
82
|
-
📝 Note to self: ...
|
|
83
|
-
🎯 COORDINATE MODE ACTIVATED
|
|
84
|
-
🎯 Coordinate Action (attempt 1/2): click at (5.500%, 8.250%)
|
|
85
|
-
```
|
|
86
|
-
|
|
87
|
-
### Test 2: Simple Scenario (Should Still Be Fast)
|
|
88
|
-
|
|
89
|
-
Create test: `simple-login.txt`
|
|
90
|
-
```
|
|
91
|
-
- go to https://example.com/login
|
|
92
|
-
- fill username with "alice"
|
|
93
|
-
- fill password with "password123"
|
|
94
|
-
- click login button
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
**Expected:**
|
|
98
|
-
- Each step: 1 iteration (Tier 1 success)
|
|
99
|
-
- No coordinates needed
|
|
100
|
-
- Fast execution
|
|
101
|
-
|
|
102
|
-
### Test 3: Coordinate Fallback
|
|
103
|
-
|
|
104
|
-
**Deliberately difficult scenario:**
|
|
105
|
-
```
|
|
106
|
-
- go to https://some-app-with-shadow-dom.com
|
|
107
|
-
- click on custom web component icon
|
|
108
|
-
```
|
|
109
|
-
|
|
110
|
-
**Expected:**
|
|
111
|
-
- Iterations 1-3: Selectors fail
|
|
112
|
-
- Iteration 4: Coordinates succeed
|
|
113
|
-
- Generated script contains: `await page.mouse.click(x, y);`
|
|
114
|
-
|
|
115
|
-
---
|
|
116
|
-
|
|
117
|
-
## Expected Improvements
|
|
118
|
-
|
|
119
|
-
### Metrics to Track:
|
|
120
|
-
|
|
121
|
-
1. **Iteration Efficiency**
|
|
122
|
-
- Before: ~4 average iterations per step
|
|
123
|
-
- After: ~2.5 average iterations per step (30-40% reduction)
|
|
124
|
-
|
|
125
|
-
2. **Success Rate**
|
|
126
|
-
- Before: Stuck on complex UIs (hamburgers, icons, shadow DOM)
|
|
127
|
-
- After: Coordinates provide escape hatch
|
|
128
|
-
|
|
129
|
-
3. **Coordinate Usage**
|
|
130
|
-
- Target: < 10% of scenarios use coordinates
|
|
131
|
-
- Most scenarios still succeed with selectors
|
|
132
|
-
|
|
133
|
-
---
|
|
134
|
-
|
|
135
|
-
## Files Changed
|
|
136
|
-
|
|
137
|
-
**New:**
|
|
138
|
-
- `src/utils/coordinate-converter.ts` - Percentage conversion utility
|
|
139
|
-
- `VISUAL_AGENT_EVOLUTION_PLAN.md` - Complete plan
|
|
140
|
-
- `PHASE_1_COMPLETE.md` - Feature documentation
|
|
141
|
-
- `IMPLEMENTATION_STATUS.md` - Current status
|
|
142
|
-
- `PHASE_1_SUMMARY.md` - This file
|
|
143
|
-
|
|
144
|
-
**Modified:**
|
|
145
|
-
- `src/orchestrator/types.ts` - Added NoteToFutureSelf, CoordinateAction
|
|
146
|
-
- `src/orchestrator/orchestrator-agent.ts` - Note tracking, coordinate handling, mode switching
|
|
147
|
-
- `src/scenario-worker-class.ts` - Timeout handling (earlier fix)
|
|
148
|
-
- `src/execution-service.ts` - Timeout handling (earlier fix)
|
|
149
|
-
|
|
150
|
-
---
|
|
151
|
-
|
|
152
|
-
## Iteration Budget (Max 5 per Step)
|
|
153
|
-
|
|
154
|
-
**Phase 1 (Current):**
|
|
155
|
-
```
|
|
156
|
-
Iterations 1-3: Playwright selectors (3 attempts)
|
|
157
|
-
Iterations 4-5: Coordinates (2 attempts)
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
**Phase 2 (Future - Optimized):**
|
|
161
|
-
```
|
|
162
|
-
Iteration 1: Playwright selector (1 attempt) - fast path
|
|
163
|
-
Iterations 2-3: Index commands (2 attempts) - reliable fallback
|
|
164
|
-
Iterations 4-5: Coordinates (2 attempts) - last resort
|
|
165
|
-
```
|
|
166
|
-
|
|
167
|
-
**Benefit of Phase 2:**
|
|
168
|
-
- Most scenarios finish in iteration 1 (fast!)
|
|
169
|
-
- Complex scenarios use iterations 2-3 (index system)
|
|
170
|
-
- Only extreme cases reach iterations 4-5 (coordinates)
|
|
171
|
-
|
|
172
|
-
---
|
|
173
|
-
|
|
174
|
-
## Ready to Test!
|
|
175
|
-
|
|
176
|
-
**Current version** (runner-core v0.0.33) is built and ready.
|
|
177
|
-
|
|
178
|
-
**Test with:**
|
|
179
|
-
1. VS Code extension "Generate Script" on `peoplehr.txt`
|
|
180
|
-
2. Or "Run Test" on any existing smart test
|
|
181
|
-
3. Check logs for note-to-self and coordinate usage
|
|
182
|
-
|
|
183
|
-
**After validating Phase 1 works well, proceed to Phase 2 for numbered element system.**
|
|
184
|
-
|
|
@@ -1,372 +0,0 @@
|
|
|
1
|
-
# Planning Session Summary: Orchestrator Agent Architecture
|
|
2
|
-
|
|
3
|
-
## Date: October 11, 2025
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Final Decisions Made
|
|
8
|
-
|
|
9
|
-
### 1. ✅ Self-Reflection in MVP
|
|
10
|
-
**Decision**: Include free-form self-reflection with agent-driven loop detection
|
|
11
|
-
- Agent outputs `guidanceForNext` (free-form text) for train of thought continuity
|
|
12
|
-
- Agent signals `detectingLoop: true` when it notices repetition
|
|
13
|
-
- Agent decides when to break own loop, system enforces hard limits as backup
|
|
14
|
-
|
|
15
|
-
**Rationale**: Valuable for maintaining context across iterations, agent self-corrects
|
|
16
|
-
|
|
17
|
-
### 2. ✅ No Screenshot Budget
|
|
18
|
-
**Decision**: Screenshots available freely, no artificial limits
|
|
19
|
-
- Corrected token cost: 1-2K tokens (NOT 100K!)
|
|
20
|
-
- For 1920x1080 viewport: ~1,452 tokens (gpt-4.1-mini)
|
|
21
|
-
- Comparable to extra DOM context
|
|
22
|
-
|
|
23
|
-
**Rationale**: Very affordable, enables liberal vision use throughout journey
|
|
24
|
-
|
|
25
|
-
### 3. ✅ DOM Limits (Increased for Complex Pages)
|
|
26
|
-
**Decision**: Increased limits in getEnhancedPageInfo to handle complex UIs
|
|
27
|
-
- ARIA tree depth: 4 levels
|
|
28
|
-
- Interactive elements: top 50 (was 12)
|
|
29
|
-
- IDs: top 50 (was 10)
|
|
30
|
-
- Data attributes: top 50 (was 10)
|
|
31
|
-
- Form fields: top 20 (was 8)
|
|
32
|
-
- Page structure: top 10 (was 6)
|
|
33
|
-
- General elements: top 50 (was 15)
|
|
34
|
-
- Text: 30 chars max
|
|
35
|
-
- Result: ~800-1,500 tokens
|
|
36
|
-
|
|
37
|
-
**Rationale**: Complex pages need more context, still compact with truncation
|
|
38
|
-
|
|
39
|
-
### 4. ✅ Token Usage Tracking
|
|
40
|
-
**Decision**: Track and report all LLM token usage via callback
|
|
41
|
-
- Interface: `onTokensUsed({inputTokens, outputTokens, includesImage})`
|
|
42
|
-
- Heuristic: 4 characters = 1 token
|
|
43
|
-
- Image tokens: ~1,500 estimate for viewport screenshots
|
|
44
|
-
- Reported via ProgressReporter for analytics
|
|
45
|
-
|
|
46
|
-
**Rationale**: Cost tracking, optimization, analytics
|
|
47
|
-
|
|
48
|
-
### 5. ✅ Recovery Tools in MVP
|
|
49
|
-
**Decision**: Include 3 recovery tools for self-unsticking
|
|
50
|
-
- `navigate_back()` - Go back in history
|
|
51
|
-
- `refresh_page()` - Reload page
|
|
52
|
-
- `navigate_to_url({url})` - Navigate to specific URL (with domain validation)
|
|
53
|
-
|
|
54
|
-
**Rationale**: Agent needs ability to recover from bad states (wrong navigation, stuck page, side effects)
|
|
55
|
-
|
|
56
|
-
### 6. ✅ Inquisitive Exploration in Phase 2
|
|
57
|
-
**Decision**: Defer exploratory actions to Phase 2, MVP uses workarounds
|
|
58
|
-
- Phase 2 tool: `explore_element({action, selector, purpose})`
|
|
59
|
-
- Actions: hover, click_info, click_menu, focus
|
|
60
|
-
- Safety: State validation, non-consequential only
|
|
61
|
-
- **Screenshot handling**: Immediate analysis via sub-agent call
|
|
62
|
-
- System takes screenshot after exploration
|
|
63
|
-
- Calls agent to analyze screenshot
|
|
64
|
-
- Agent extracts learnings (text)
|
|
65
|
-
- Only TEXT stored in history, NOT screenshot
|
|
66
|
-
- Keeps memory lightweight
|
|
67
|
-
- MVP workaround: Use screenshot + DOM analysis + retry
|
|
68
|
-
|
|
69
|
-
**Rationale**: Safety concerns, complexity, needs battle-testing first
|
|
70
|
-
|
|
71
|
-
### 7. ✅ Always-Provided Context Structure
|
|
72
|
-
**Decision**: Provide comprehensive context automatically each iteration
|
|
73
|
-
- Overall goal + current goal
|
|
74
|
-
- Current page info (DOM)
|
|
75
|
-
- Recent 6-7 steps
|
|
76
|
-
- Experiences (learnings)
|
|
77
|
-
- Extracted data
|
|
78
|
-
- Self-reflection from previous iteration
|
|
79
|
-
- Journey progress tracking
|
|
80
|
-
|
|
81
|
-
**Rationale**: Agent needs full situational awareness without repeated tool calls
|
|
82
|
-
|
|
83
|
-
### 8. ✅ System vs Agent Guardrails
|
|
84
|
-
**Decision**: Clear separation of responsibilities
|
|
85
|
-
- **System enforces**: Iteration limits, tool call limits, command limits
|
|
86
|
-
- **Agent signals**: Stuck, infeasible, detecting loop
|
|
87
|
-
- System has final say, agent provides soft guidance
|
|
88
|
-
|
|
89
|
-
**Rationale**: Safety (hard limits) + intelligence (agent self-awareness)
|
|
90
|
-
|
|
91
|
-
---
|
|
92
|
-
|
|
93
|
-
## Architecture Summary
|
|
94
|
-
|
|
95
|
-
### Core Components
|
|
96
|
-
|
|
97
|
-
**1. Always-Provided Context** (auto-fetched each iteration)
|
|
98
|
-
```typescript
|
|
99
|
-
{
|
|
100
|
-
overallGoal, currentStepGoal, stepNumber, totalSteps,
|
|
101
|
-
currentPageInfo, currentURL,
|
|
102
|
-
recentSteps (6-7), experiences, extractedData,
|
|
103
|
-
previousIterationGuidance
|
|
104
|
-
}
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
**2. Tools** (8 in MVP)
|
|
108
|
-
- **Information**: take_screenshot, recall_history, inspect_page, check_page_ready
|
|
109
|
-
- **Data**: extract_data
|
|
110
|
-
- **Recovery**: navigate_back, refresh_page, navigate_to_url
|
|
111
|
-
|
|
112
|
-
**3. Agent Decision Output**
|
|
113
|
-
```typescript
|
|
114
|
-
{
|
|
115
|
-
toolCalls, toolReasoning, needsToolResults,
|
|
116
|
-
commands, commandReasoning,
|
|
117
|
-
selfReflection: {guidanceForNext, detectingLoop, loopReasoning},
|
|
118
|
-
experiences, memoryUpdate,
|
|
119
|
-
status: 'complete' | 'stuck' | 'infeasible' | 'continue',
|
|
120
|
-
statusReasoning, reasoning
|
|
121
|
-
}
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
**4. Sequential Batch Execution**
|
|
125
|
-
- Agent plans batch of commands (max 3-5)
|
|
126
|
-
- System executes one-by-one
|
|
127
|
-
- Stop at first failure
|
|
128
|
-
- Record each individually in history
|
|
129
|
-
|
|
130
|
-
**5. Comprehensive Logging**
|
|
131
|
-
- Every iteration: goal, reasoning, self-reflection, tools, commands, experiences, status
|
|
132
|
-
- All thoughts visible for debugging
|
|
133
|
-
- Exported via ProgressReporter
|
|
134
|
-
|
|
135
|
-
**6. Token Usage Tracking**
|
|
136
|
-
- Input + output tokens calculated (4 chars = 1 token)
|
|
137
|
-
- Image tokens estimated (~1,500 for viewport)
|
|
138
|
-
- Reported via `onTokensUsed()` callback
|
|
139
|
-
|
|
140
|
-
---
|
|
141
|
-
|
|
142
|
-
## MVP vs Phase 2
|
|
143
|
-
|
|
144
|
-
### MVP Includes:
|
|
145
|
-
- ✅ 8 core tools (info + data + recovery)
|
|
146
|
-
- ✅ Journey memory with experiences
|
|
147
|
-
- ✅ Self-reflection + loop detection
|
|
148
|
-
- ✅ Batch command planning
|
|
149
|
-
- ✅ Self-recovery (navigate back/refresh)
|
|
150
|
-
- ✅ Token tracking
|
|
151
|
-
- ✅ Comprehensive logging
|
|
152
|
-
- ✅ Configurable guardrails
|
|
153
|
-
|
|
154
|
-
### Phase 2 Adds:
|
|
155
|
-
- Inquisitive exploration (explore_element)
|
|
156
|
-
- Advanced optimizations (caching, adaptive limits)
|
|
157
|
-
- Memory summarization for long journeys
|
|
158
|
-
|
|
159
|
-
---
|
|
160
|
-
|
|
161
|
-
## Key Metrics
|
|
162
|
-
|
|
163
|
-
### Token Usage Per Iteration (Estimated)
|
|
164
|
-
```
|
|
165
|
-
System prompt: 500 tokens
|
|
166
|
-
Always-provided context: 1,200-2,000 tokens
|
|
167
|
-
- Goals & progress: 100
|
|
168
|
-
- DOM (increased limits): 800-1,500
|
|
169
|
-
- Recent 6-7 steps: 300-500
|
|
170
|
-
- Experiences: 100-200
|
|
171
|
-
Self-reflection: 100 tokens
|
|
172
|
-
Tool results (optional): 300-500 tokens
|
|
173
|
-
Screenshot (optional): 1,500 tokens
|
|
174
|
-
|
|
175
|
-
Total without screenshot: 2,400-3,600 tokens
|
|
176
|
-
Total with screenshot: 3,900-5,100 tokens
|
|
177
|
-
```
|
|
178
|
-
|
|
179
|
-
### Expected Performance
|
|
180
|
-
- LLM calls/step: 2-4 (vs 4-6 current)
|
|
181
|
-
- Iterations/step: 3-5 (vs 8-12 current)
|
|
182
|
-
- Tool calls/step: 1-3
|
|
183
|
-
- Commands/iteration: 2-3 (batched)
|
|
184
|
-
- Agent learns: 1-2 experiences per step
|
|
185
|
-
|
|
186
|
-
---
|
|
187
|
-
|
|
188
|
-
## Inquisitive Exploration Design (Phase 2)
|
|
189
|
-
|
|
190
|
-
### Problem
|
|
191
|
-
Menu items are icon-only, no text/ARIA labels → Agent unsure which to click
|
|
192
|
-
|
|
193
|
-
### Solution
|
|
194
|
-
Agent investigates non-consequentially, analyzes immediately:
|
|
195
|
-
|
|
196
|
-
```
|
|
197
|
-
Iteration N:
|
|
198
|
-
Agent Decision: "Need to hover over icons to see tooltips"
|
|
199
|
-
Tool: explore_element({action: "hover", selector: "nav button:nth-child(2)"})
|
|
200
|
-
|
|
201
|
-
System:
|
|
202
|
-
→ Hover, wait 500ms
|
|
203
|
-
→ Take screenshot
|
|
204
|
-
→ Call agent (sub-call): "What do you see in this screenshot?"
|
|
205
|
-
|
|
206
|
-
Agent Analysis (sub-call):
|
|
207
|
-
→ Sees tooltip
|
|
208
|
-
→ Returns: "Tooltip shows 'Dashboard' - this is the Dashboard button"
|
|
209
|
-
|
|
210
|
-
System:
|
|
211
|
-
→ Stores TEXT in history: "Explored button, tooltip confirms Dashboard"
|
|
212
|
-
→ Does NOT store screenshot
|
|
213
|
-
→ Returns to main agent: {success: true, learning: "Tooltip shows Dashboard"}
|
|
214
|
-
|
|
215
|
-
Agent Decision (continues):
|
|
216
|
-
→ "Great, confirmed it's Dashboard"
|
|
217
|
-
→ Commands: ["page.click('nav button:nth-child(2)')"]
|
|
218
|
-
|
|
219
|
-
System: Execute commands
|
|
220
|
-
```
|
|
221
|
-
|
|
222
|
-
**Key difference**: Screenshot analyzed WITHIN same iteration, only text stored
|
|
223
|
-
|
|
224
|
-
### Allowed Actions
|
|
225
|
-
- ✅ hover (show tooltips)
|
|
226
|
-
- ✅ click_info (info icons)
|
|
227
|
-
- ✅ click_menu (expand menus)
|
|
228
|
-
- ✅ focus (see input hints)
|
|
229
|
-
|
|
230
|
-
### NOT Allowed
|
|
231
|
-
- ❌ Submit forms
|
|
232
|
-
- ❌ Delete/remove
|
|
233
|
-
- ❌ Logout
|
|
234
|
-
- ❌ File uploads
|
|
235
|
-
|
|
236
|
-
### Safety
|
|
237
|
-
- State validation (URL, modal count)
|
|
238
|
-
- Revert if unexpected navigation
|
|
239
|
-
- Budget: 10 explorations per step
|
|
240
|
-
- Timeout: 2s per exploration
|
|
241
|
-
|
|
242
|
-
### Why Phase 2
|
|
243
|
-
- Safety risk (needs robust validation)
|
|
244
|
-
- Complexity (screenshot handling, state comparison)
|
|
245
|
-
- MVP workaround: screenshot + DOM + retry
|
|
246
|
-
|
|
247
|
-
---
|
|
248
|
-
|
|
249
|
-
## Implementation Status
|
|
250
|
-
|
|
251
|
-
### Completed (During Planning):
|
|
252
|
-
- ✅ Token usage tracking added to interfaces
|
|
253
|
-
- ✅ BackendProxyLLMProvider calculates token usage
|
|
254
|
-
- ✅ LLMFacade prepared for token callback
|
|
255
|
-
- ✅ Progress reporter extended with `onTokensUsed()`
|
|
256
|
-
|
|
257
|
-
### Ready to Implement:
|
|
258
|
-
1. OrchestratorAgent class
|
|
259
|
-
2. ToolRegistry with 8 tools
|
|
260
|
-
3. Journey memory implementation
|
|
261
|
-
4. Always-provided context builder
|
|
262
|
-
5. Self-reflection structures
|
|
263
|
-
6. Recovery tools (navigate_back, refresh, navigate_to)
|
|
264
|
-
7. Comprehensive logging
|
|
265
|
-
8. Token tracking integration
|
|
266
|
-
|
|
267
|
-
### Phase 2:
|
|
268
|
-
1. Exploratory actions (explore_element)
|
|
269
|
-
2. State validation logic
|
|
270
|
-
3. Advanced optimizations
|
|
271
|
-
|
|
272
|
-
---
|
|
273
|
-
|
|
274
|
-
## Documentation Created
|
|
275
|
-
|
|
276
|
-
1. **MULTI_AGENT_ARCHITECTURE_REVIEW.md**
|
|
277
|
-
- 8 pitfalls analyzed with mitigations
|
|
278
|
-
- Phased implementation strategy
|
|
279
|
-
- Risk analysis and trade-offs
|
|
280
|
-
|
|
281
|
-
2. **ORCHESTRATOR_IMPLEMENTATION_PLAN.md**
|
|
282
|
-
- Detailed implementation specs
|
|
283
|
-
- Code examples
|
|
284
|
-
- Integration points
|
|
285
|
-
|
|
286
|
-
3. **ORCHESTRATOR_MVP_SUMMARY.md**
|
|
287
|
-
- Executive summary
|
|
288
|
-
- Complete feature list
|
|
289
|
-
- Inquisitive exploration section
|
|
290
|
-
- MVP vs Phase 2 breakdown
|
|
291
|
-
|
|
292
|
-
4. **PLANNING_SESSION_SUMMARY.md** (this document)
|
|
293
|
-
- All decisions made
|
|
294
|
-
- Rationales
|
|
295
|
-
- Implementation status
|
|
296
|
-
|
|
297
|
-
---
|
|
298
|
-
|
|
299
|
-
## Next Steps
|
|
300
|
-
|
|
301
|
-
### When Ready to Implement:
|
|
302
|
-
1. Review all 3 architecture documents
|
|
303
|
-
2. Start with MVP (8 tools, no exploration)
|
|
304
|
-
3. Implement in order:
|
|
305
|
-
- Tool registry + tool implementations
|
|
306
|
-
- Journey memory structures
|
|
307
|
-
- OrchestratorAgent loop
|
|
308
|
-
- Integration with ScenarioWorker
|
|
309
|
-
- Token tracking integration
|
|
310
|
-
- Comprehensive logging
|
|
311
|
-
4. Test with real scenarios
|
|
312
|
-
5. Measure metrics vs current approach
|
|
313
|
-
6. Iterate based on findings
|
|
314
|
-
7. Add Phase 2 features when validated
|
|
315
|
-
|
|
316
|
-
---
|
|
317
|
-
|
|
318
|
-
## Success Criteria (MVP)
|
|
319
|
-
|
|
320
|
-
### Must Have:
|
|
321
|
-
- [ ] Fewer iterations than current (target: 50% reduction)
|
|
322
|
-
- [ ] Backward compatible (VS Extension & GitHub Runner work)
|
|
323
|
-
- [ ] No infinite loops (guardrails work)
|
|
324
|
-
- [ ] Memory doesn't bloat
|
|
325
|
-
- [ ] Tool extensibility works
|
|
326
|
-
- [ ] Token usage tracked accurately
|
|
327
|
-
|
|
328
|
-
### Nice to Have:
|
|
329
|
-
- [ ] Fewer LLM calls than current
|
|
330
|
-
- [ ] Better success rate
|
|
331
|
-
- [ ] Faster execution
|
|
332
|
-
|
|
333
|
-
### Acceptable Trade-offs:
|
|
334
|
-
- ⚠️ Slightly higher token usage per iteration (richer context)
|
|
335
|
-
- ⚠️ Some tool call overhead
|
|
336
|
-
- ⚠️ No exploratory actions in MVP
|
|
337
|
-
|
|
338
|
-
---
|
|
339
|
-
|
|
340
|
-
## Estimated Timeline
|
|
341
|
-
|
|
342
|
-
**MVP Implementation**: 2-3 weeks
|
|
343
|
-
- Week 1: Foundation (types, tool registry, tool implementations)
|
|
344
|
-
- Week 2: Orchestrator loop, integration
|
|
345
|
-
- Week 3: Testing, refinement, metrics
|
|
346
|
-
|
|
347
|
-
**Phase 2 (Exploration)**: 1-2 weeks after MVP validated
|
|
348
|
-
|
|
349
|
-
**Total**: 3-5 weeks for complete solution
|
|
350
|
-
|
|
351
|
-
---
|
|
352
|
-
|
|
353
|
-
## Final Architecture Confidence
|
|
354
|
-
|
|
355
|
-
**✅ Ready to implement** with:
|
|
356
|
-
- All major decisions finalized
|
|
357
|
-
- Trade-offs understood and accepted
|
|
358
|
-
- Risks identified with mitigations
|
|
359
|
-
- Phased approach reduces implementation risk
|
|
360
|
-
- Backward compatibility ensured
|
|
361
|
-
- Comprehensive documentation complete
|
|
362
|
-
|
|
363
|
-
**Key strengths**:
|
|
364
|
-
- Human-like (memory, learning, reflection, recovery)
|
|
365
|
-
- Extensible (tool registry, dynamic prompts)
|
|
366
|
-
- Safe (system guardrails, agent self-awareness)
|
|
367
|
-
- Transparent (comprehensive logging)
|
|
368
|
-
- Cost-aware (token tracking)
|
|
369
|
-
- Practical (recovery tools, self-unstuck)
|
|
370
|
-
|
|
371
|
-
**No blockers to proceed.**
|
|
372
|
-
|
|
@@ -1,120 +0,0 @@
|
|
|
1
|
-
# System Prompt Optimization Analysis
|
|
2
|
-
|
|
3
|
-
## Current Stats:
|
|
4
|
-
- **System Prompt**: 17,573 chars (346 lines)
|
|
5
|
-
- **With Tool Descriptions**: 19,613 chars (~4,903 tokens)
|
|
6
|
-
- **Cost per call**: ~$0.0007 (gpt-5-mini input tokens)
|
|
7
|
-
|
|
8
|
-
## Optimization Opportunities:
|
|
9
|
-
|
|
10
|
-
### 1. **Duplicate Examples** (Save ~30%)
|
|
11
|
-
**Current**: Multiple example sections with ❌/✅ pairs
|
|
12
|
-
- Lines 633-644: Examples section with goto, fill, click examples
|
|
13
|
-
- Lines 621-626: Ambiguous text handling examples
|
|
14
|
-
- Lines 603-607: DOM snapshot examples
|
|
15
|
-
- Lines 615-619: Selector preference list
|
|
16
|
-
|
|
17
|
-
**Optimization**: Consolidate into ONE examples section
|
|
18
|
-
**Savings**: ~2,000 chars
|
|
19
|
-
|
|
20
|
-
### 2. **Verbose Selector Section** (Save ~20%)
|
|
21
|
-
**Current**: Lines 602-644 (42 lines, ~1,800 chars)
|
|
22
|
-
- Lists all selector types with emoji
|
|
23
|
-
- Detailed examples for each
|
|
24
|
-
- Repetitive "Good/Bad" patterns
|
|
25
|
-
|
|
26
|
-
**Optimization**: Create compact reference table
|
|
27
|
-
```
|
|
28
|
-
SELECTORS (preference order):
|
|
29
|
-
1. getByRole/Label/Placeholder (semantic, stable)
|
|
30
|
-
2. getByText (scope to parent if ambiguous!)
|
|
31
|
-
3. CSS IDs (avoid auto-generated)
|
|
32
|
-
|
|
33
|
-
Common mistakes: Missing goto timeout, unscoped getByText, auto-generated IDs
|
|
34
|
-
```
|
|
35
|
-
**Savings**: ~1,200 chars
|
|
36
|
-
|
|
37
|
-
### 3. **Emoji Overuse** (Save ~5%)
|
|
38
|
-
**Current**: Heavy use of ⚠️, ❌, ✅, 🏆, etc.
|
|
39
|
-
|
|
40
|
-
**Optimization**: Use sparingly (only for critical warnings)
|
|
41
|
-
**Savings**: ~500 chars
|
|
42
|
-
|
|
43
|
-
### 4. **Redundant "WHY" Explanations** (Save ~10%)
|
|
44
|
-
**Current**: Multiple "WHY:" sections explaining rationale
|
|
45
|
-
- Line 642-644: WHY semantic selectors
|
|
46
|
-
- Similar explanations scattered throughout
|
|
47
|
-
|
|
48
|
-
**Optimization**: Remove or consolidate
|
|
49
|
-
**Savings**: ~800 chars
|
|
50
|
-
|
|
51
|
-
### 5. **Tool Instructions Redundancy** (Save ~10%)
|
|
52
|
-
**Current**: Tools described twice:
|
|
53
|
-
- In tool registry (dynamic)
|
|
54
|
-
- In prompt rules (static)
|
|
55
|
-
|
|
56
|
-
**Optimization**: Rely more on tool registry descriptions
|
|
57
|
-
**Savings**: ~600 chars
|
|
58
|
-
|
|
59
|
-
### 6. **Status Rules Repetition** (Save ~5%)
|
|
60
|
-
**Current**: Lines 468-486 - Status rules explained multiple times
|
|
61
|
-
|
|
62
|
-
**Optimization**: Single concise statement
|
|
63
|
-
**Savings**: ~400 chars
|
|
64
|
-
|
|
65
|
-
## Proposed Condensed Structure:
|
|
66
|
-
|
|
67
|
-
```markdown
|
|
68
|
-
# System Prompt (Optimized)
|
|
69
|
-
|
|
70
|
-
## Agent Role & Tools
|
|
71
|
-
[Tool descriptions from registry]
|
|
72
|
-
|
|
73
|
-
## Response Format (JSON)
|
|
74
|
-
{required fields} - minimal format, no extensive comments
|
|
75
|
-
|
|
76
|
-
## Core Rules (Prioritized)
|
|
77
|
-
1. Status decisions (complete/continue/stuck)
|
|
78
|
-
2. Selector strategy (semantic > text > CSS)
|
|
79
|
-
3. Common errors (goto timeout, strict mode, auto-IDs)
|
|
80
|
-
4. When to use tools vs commands
|
|
81
|
-
5. Note to future self usage
|
|
82
|
-
|
|
83
|
-
## Examples (Consolidated)
|
|
84
|
-
- Navigation: goto with 30s timeout
|
|
85
|
-
- Selectors: Scoped getByText, semantic selectors
|
|
86
|
-
- Coordinates: When and how
|
|
87
|
-
|
|
88
|
-
## Advanced Features
|
|
89
|
-
- Blocker detection
|
|
90
|
-
- Step re-evaluation
|
|
91
|
-
- Coordinate fallback
|
|
92
|
-
```
|
|
93
|
-
|
|
94
|
-
## Total Potential Savings:
|
|
95
|
-
|
|
96
|
-
- **Before**: 17,573 chars (~4,393 tokens)
|
|
97
|
-
- **After**: ~12,000 chars (~3,000 tokens)
|
|
98
|
-
- **Reduction**: ~32% reduction in system prompt
|
|
99
|
-
- **Cost savings**: ~$0.0002 per call (~30% per call)
|
|
100
|
-
- **Overall impact**: With 7 tasks using gpt-4o-mini, only 4 tasks will benefit
|
|
101
|
-
- **Est. total savings**: ~5-8% additional cost reduction
|
|
102
|
-
|
|
103
|
-
## Recommendation:
|
|
104
|
-
|
|
105
|
-
**Optimize if:**
|
|
106
|
-
- You're seeing consistent 500 errors (less likely now with retry)
|
|
107
|
-
- Want to maximize caching efficiency
|
|
108
|
-
- Running high-volume scenarios (1000+ per day)
|
|
109
|
-
|
|
110
|
-
**Skip if:**
|
|
111
|
-
- Current cost is acceptable
|
|
112
|
-
- Prompt clarity is more important than 5-8% savings
|
|
113
|
-
- Risk of quality degradation concerns you
|
|
114
|
-
|
|
115
|
-
## Action Items (if optimizing):
|
|
116
|
-
|
|
117
|
-
1. ✅ Keep: Critical decision logic, JSON format, coordinate mode
|
|
118
|
-
2. ⚠️ Condense: Selector examples, error responses, WHY sections
|
|
119
|
-
3. ❌ Remove: Duplicate examples, excessive emojis, redundant explanations
|
|
120
|
-
|