testchimp-runner-core 0.0.21 → 0.0.23
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/VISION_DIAGNOSTICS_IMPROVEMENTS.md +336 -0
- package/dist/credit-usage-service.d.ts +9 -0
- package/dist/credit-usage-service.d.ts.map +1 -1
- package/dist/credit-usage-service.js +20 -5
- package/dist/credit-usage-service.js.map +1 -1
- package/dist/execution-service.d.ts +7 -2
- package/dist/execution-service.d.ts.map +1 -1
- package/dist/execution-service.js +91 -36
- package/dist/execution-service.js.map +1 -1
- package/dist/index.d.ts +30 -2
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +91 -26
- package/dist/index.js.map +1 -1
- package/dist/llm-facade.d.ts +64 -8
- package/dist/llm-facade.d.ts.map +1 -1
- package/dist/llm-facade.js +361 -109
- package/dist/llm-facade.js.map +1 -1
- package/dist/llm-provider.d.ts +39 -0
- package/dist/llm-provider.d.ts.map +1 -0
- package/dist/llm-provider.js +7 -0
- package/dist/llm-provider.js.map +1 -0
- package/dist/model-constants.d.ts +21 -0
- package/dist/model-constants.d.ts.map +1 -0
- package/dist/model-constants.js +24 -0
- package/dist/model-constants.js.map +1 -0
- package/dist/orchestrator/index.d.ts +8 -0
- package/dist/orchestrator/index.d.ts.map +1 -0
- package/dist/orchestrator/index.js +23 -0
- package/dist/orchestrator/index.js.map +1 -0
- package/dist/orchestrator/orchestrator-agent.d.ts +66 -0
- package/dist/orchestrator/orchestrator-agent.d.ts.map +1 -0
- package/dist/orchestrator/orchestrator-agent.js +855 -0
- package/dist/orchestrator/orchestrator-agent.js.map +1 -0
- package/dist/orchestrator/tool-registry.d.ts +74 -0
- package/dist/orchestrator/tool-registry.d.ts.map +1 -0
- package/dist/orchestrator/tool-registry.js +131 -0
- package/dist/orchestrator/tool-registry.js.map +1 -0
- package/dist/orchestrator/tools/check-page-ready.d.ts +13 -0
- package/dist/orchestrator/tools/check-page-ready.d.ts.map +1 -0
- package/dist/orchestrator/tools/check-page-ready.js +72 -0
- package/dist/orchestrator/tools/check-page-ready.js.map +1 -0
- package/dist/orchestrator/tools/extract-data.d.ts +13 -0
- package/dist/orchestrator/tools/extract-data.d.ts.map +1 -0
- package/dist/orchestrator/tools/extract-data.js +84 -0
- package/dist/orchestrator/tools/extract-data.js.map +1 -0
- package/dist/orchestrator/tools/index.d.ts +10 -0
- package/dist/orchestrator/tools/index.d.ts.map +1 -0
- package/dist/orchestrator/tools/index.js +18 -0
- package/dist/orchestrator/tools/index.js.map +1 -0
- package/dist/orchestrator/tools/inspect-page.d.ts +13 -0
- package/dist/orchestrator/tools/inspect-page.d.ts.map +1 -0
- package/dist/orchestrator/tools/inspect-page.js +39 -0
- package/dist/orchestrator/tools/inspect-page.js.map +1 -0
- package/dist/orchestrator/tools/recall-history.d.ts +13 -0
- package/dist/orchestrator/tools/recall-history.d.ts.map +1 -0
- package/dist/orchestrator/tools/recall-history.js +64 -0
- package/dist/orchestrator/tools/recall-history.js.map +1 -0
- package/dist/orchestrator/tools/take-screenshot.d.ts +15 -0
- package/dist/orchestrator/tools/take-screenshot.d.ts.map +1 -0
- package/dist/orchestrator/tools/take-screenshot.js +112 -0
- package/dist/orchestrator/tools/take-screenshot.js.map +1 -0
- package/dist/orchestrator/types.d.ts +133 -0
- package/dist/orchestrator/types.d.ts.map +1 -0
- package/dist/orchestrator/types.js +28 -0
- package/dist/orchestrator/types.js.map +1 -0
- package/dist/playwright-mcp-service.d.ts +9 -0
- package/dist/playwright-mcp-service.d.ts.map +1 -1
- package/dist/playwright-mcp-service.js +20 -5
- package/dist/playwright-mcp-service.js.map +1 -1
- package/dist/progress-reporter.d.ts +97 -0
- package/dist/progress-reporter.d.ts.map +1 -0
- package/dist/progress-reporter.js +18 -0
- package/dist/progress-reporter.js.map +1 -0
- package/dist/prompts.d.ts +24 -0
- package/dist/prompts.d.ts.map +1 -1
- package/dist/prompts.js +593 -68
- package/dist/prompts.js.map +1 -1
- package/dist/providers/backend-proxy-llm-provider.d.ts +25 -0
- package/dist/providers/backend-proxy-llm-provider.d.ts.map +1 -0
- package/dist/providers/backend-proxy-llm-provider.js +76 -0
- package/dist/providers/backend-proxy-llm-provider.js.map +1 -0
- package/dist/providers/local-llm-provider.d.ts +21 -0
- package/dist/providers/local-llm-provider.d.ts.map +1 -0
- package/dist/providers/local-llm-provider.js +35 -0
- package/dist/providers/local-llm-provider.js.map +1 -0
- package/dist/scenario-service.d.ts +27 -1
- package/dist/scenario-service.d.ts.map +1 -1
- package/dist/scenario-service.js +48 -12
- package/dist/scenario-service.js.map +1 -1
- package/dist/scenario-worker-class.d.ts +39 -2
- package/dist/scenario-worker-class.d.ts.map +1 -1
- package/dist/scenario-worker-class.js +614 -86
- package/dist/scenario-worker-class.js.map +1 -1
- package/dist/script-utils.d.ts +2 -0
- package/dist/script-utils.d.ts.map +1 -1
- package/dist/script-utils.js +44 -4
- package/dist/script-utils.js.map +1 -1
- package/dist/types.d.ts +11 -0
- package/dist/types.d.ts.map +1 -1
- package/dist/types.js.map +1 -1
- package/dist/utils/browser-utils.d.ts +20 -1
- package/dist/utils/browser-utils.d.ts.map +1 -1
- package/dist/utils/browser-utils.js +102 -51
- package/dist/utils/browser-utils.js.map +1 -1
- package/dist/utils/page-info-utils.d.ts +23 -4
- package/dist/utils/page-info-utils.d.ts.map +1 -1
- package/dist/utils/page-info-utils.js +174 -43
- package/dist/utils/page-info-utils.js.map +1 -1
- package/package.json +1 -2
- package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +642 -0
- package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +844 -0
- package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +539 -0
- package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +241 -0
- package/plandocs/PHASE1_FINAL_STATUS.md +210 -0
- package/plandocs/PLANNING_SESSION_SUMMARY.md +372 -0
- package/plandocs/SCRIPT_CLEANUP_FEATURE.md +201 -0
- package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +364 -0
- package/plandocs/SELECTOR_IMPROVEMENTS.md +139 -0
- package/src/credit-usage-service.ts +23 -5
- package/src/execution-service.ts +152 -42
- package/src/index.ts +169 -26
- package/src/llm-facade.ts +500 -126
- package/src/llm-provider.ts +43 -0
- package/src/model-constants.ts +23 -0
- package/src/orchestrator/index.ts +33 -0
- package/src/orchestrator/orchestrator-agent.ts +1037 -0
- package/src/orchestrator/tool-registry.ts +182 -0
- package/src/orchestrator/tools/check-page-ready.ts +75 -0
- package/src/orchestrator/tools/extract-data.ts +92 -0
- package/src/orchestrator/tools/index.ts +11 -0
- package/src/orchestrator/tools/inspect-page.ts +42 -0
- package/src/orchestrator/tools/recall-history.ts +72 -0
- package/src/orchestrator/tools/take-screenshot.ts +128 -0
- package/src/orchestrator/types.ts +200 -0
- package/src/playwright-mcp-service.ts +23 -5
- package/src/progress-reporter.ts +109 -0
- package/src/prompts.ts +606 -69
- package/src/providers/backend-proxy-llm-provider.ts +91 -0
- package/src/providers/local-llm-provider.ts +38 -0
- package/src/scenario-service.ts +83 -13
- package/src/scenario-worker-class.ts +740 -72
- package/src/script-utils.ts +50 -5
- package/src/types.ts +13 -1
- package/src/utils/browser-utils.ts +123 -51
- package/src/utils/page-info-utils.ts +210 -53
- package/testchimp-runner-core-0.0.22.tgz +0 -0
|
@@ -0,0 +1,372 @@
|
|
|
1
|
+
# Planning Session Summary: Orchestrator Agent Architecture
|
|
2
|
+
|
|
3
|
+
## Date: October 11, 2025
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Final Decisions Made
|
|
8
|
+
|
|
9
|
+
### 1. ✅ Self-Reflection in MVP
|
|
10
|
+
**Decision**: Include free-form self-reflection with agent-driven loop detection
|
|
11
|
+
- Agent outputs `guidanceForNext` (free-form text) for train of thought continuity
|
|
12
|
+
- Agent signals `detectingLoop: true` when it notices repetition
|
|
13
|
+
- Agent decides when to break own loop, system enforces hard limits as backup
|
|
14
|
+
|
|
15
|
+
**Rationale**: Valuable for maintaining context across iterations, agent self-corrects
|
|
16
|
+
|
|
17
|
+
### 2. ✅ No Screenshot Budget
|
|
18
|
+
**Decision**: Screenshots available freely, no artificial limits
|
|
19
|
+
- Corrected token cost: 1-2K tokens (NOT 100K!)
|
|
20
|
+
- For 1920x1080 viewport: ~1,452 tokens (gpt-4.1-mini)
|
|
21
|
+
- Comparable to extra DOM context
|
|
22
|
+
|
|
23
|
+
**Rationale**: Very affordable, enables liberal vision use throughout journey
|
|
24
|
+
|
|
25
|
+
### 3. ✅ DOM Limits (Increased for Complex Pages)
|
|
26
|
+
**Decision**: Increased limits in getEnhancedPageInfo to handle complex UIs
|
|
27
|
+
- ARIA tree depth: 4 levels
|
|
28
|
+
- Interactive elements: top 50 (was 12)
|
|
29
|
+
- IDs: top 50 (was 10)
|
|
30
|
+
- Data attributes: top 50 (was 10)
|
|
31
|
+
- Form fields: top 20 (was 8)
|
|
32
|
+
- Page structure: top 10 (was 6)
|
|
33
|
+
- General elements: top 50 (was 15)
|
|
34
|
+
- Text: 30 chars max
|
|
35
|
+
- Result: ~800-1,500 tokens
|
|
36
|
+
|
|
37
|
+
**Rationale**: Complex pages need more context, still compact with truncation
|
|
38
|
+
|
|
39
|
+
### 4. ✅ Token Usage Tracking
|
|
40
|
+
**Decision**: Track and report all LLM token usage via callback
|
|
41
|
+
- Interface: `onTokensUsed({inputTokens, outputTokens, includesImage})`
|
|
42
|
+
- Heuristic: 4 characters = 1 token
|
|
43
|
+
- Image tokens: ~1,500 estimate for viewport screenshots
|
|
44
|
+
- Reported via ProgressReporter for analytics
|
|
45
|
+
|
|
46
|
+
**Rationale**: Cost tracking, optimization, analytics
|
|
47
|
+
|
|
48
|
+
### 5. ✅ Recovery Tools in MVP
|
|
49
|
+
**Decision**: Include 3 recovery tools for self-unsticking
|
|
50
|
+
- `navigate_back()` - Go back in history
|
|
51
|
+
- `refresh_page()` - Reload page
|
|
52
|
+
- `navigate_to_url({url})` - Navigate to specific URL (with domain validation)
|
|
53
|
+
|
|
54
|
+
**Rationale**: Agent needs ability to recover from bad states (wrong navigation, stuck page, side effects)
|
|
55
|
+
|
|
56
|
+
### 6. ✅ Inquisitive Exploration in Phase 2
|
|
57
|
+
**Decision**: Defer exploratory actions to Phase 2, MVP uses workarounds
|
|
58
|
+
- Phase 2 tool: `explore_element({action, selector, purpose})`
|
|
59
|
+
- Actions: hover, click_info, click_menu, focus
|
|
60
|
+
- Safety: State validation, non-consequential only
|
|
61
|
+
- **Screenshot handling**: Immediate analysis via sub-agent call
|
|
62
|
+
- System takes screenshot after exploration
|
|
63
|
+
- Calls agent to analyze screenshot
|
|
64
|
+
- Agent extracts learnings (text)
|
|
65
|
+
- Only TEXT stored in history, NOT screenshot
|
|
66
|
+
- Keeps memory lightweight
|
|
67
|
+
- MVP workaround: Use screenshot + DOM analysis + retry
|
|
68
|
+
|
|
69
|
+
**Rationale**: Safety concerns, complexity, needs battle-testing first
|
|
70
|
+
|
|
71
|
+
### 7. ✅ Always-Provided Context Structure
|
|
72
|
+
**Decision**: Provide comprehensive context automatically each iteration
|
|
73
|
+
- Overall goal + current goal
|
|
74
|
+
- Current page info (DOM)
|
|
75
|
+
- Recent 6-7 steps
|
|
76
|
+
- Experiences (learnings)
|
|
77
|
+
- Extracted data
|
|
78
|
+
- Self-reflection from previous iteration
|
|
79
|
+
- Journey progress tracking
|
|
80
|
+
|
|
81
|
+
**Rationale**: Agent needs full situational awareness without repeated tool calls
|
|
82
|
+
|
|
83
|
+
### 8. ✅ System vs Agent Guardrails
|
|
84
|
+
**Decision**: Clear separation of responsibilities
|
|
85
|
+
- **System enforces**: Iteration limits, tool call limits, command limits
|
|
86
|
+
- **Agent signals**: Stuck, infeasible, detecting loop
|
|
87
|
+
- System has final say, agent provides soft guidance
|
|
88
|
+
|
|
89
|
+
**Rationale**: Safety (hard limits) + intelligence (agent self-awareness)
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Architecture Summary
|
|
94
|
+
|
|
95
|
+
### Core Components
|
|
96
|
+
|
|
97
|
+
**1. Always-Provided Context** (auto-fetched each iteration)
|
|
98
|
+
```typescript
|
|
99
|
+
{
|
|
100
|
+
overallGoal, currentStepGoal, stepNumber, totalSteps,
|
|
101
|
+
currentPageInfo, currentURL,
|
|
102
|
+
recentSteps (6-7), experiences, extractedData,
|
|
103
|
+
previousIterationGuidance
|
|
104
|
+
}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
**2. Tools** (8 in MVP)
|
|
108
|
+
- **Information**: take_screenshot, recall_history, inspect_page, check_page_ready
|
|
109
|
+
- **Data**: extract_data
|
|
110
|
+
- **Recovery**: navigate_back, refresh_page, navigate_to_url
|
|
111
|
+
|
|
112
|
+
**3. Agent Decision Output**
|
|
113
|
+
```typescript
|
|
114
|
+
{
|
|
115
|
+
toolCalls, toolReasoning, needsToolResults,
|
|
116
|
+
commands, commandReasoning,
|
|
117
|
+
selfReflection: {guidanceForNext, detectingLoop, loopReasoning},
|
|
118
|
+
experiences, memoryUpdate,
|
|
119
|
+
status: 'complete' | 'stuck' | 'infeasible' | 'continue',
|
|
120
|
+
statusReasoning, reasoning
|
|
121
|
+
}
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
**4. Sequential Batch Execution**
|
|
125
|
+
- Agent plans batch of commands (max 3-5)
|
|
126
|
+
- System executes one-by-one
|
|
127
|
+
- Stop at first failure
|
|
128
|
+
- Record each individually in history
|
|
129
|
+
|
|
130
|
+
**5. Comprehensive Logging**
|
|
131
|
+
- Every iteration: goal, reasoning, self-reflection, tools, commands, experiences, status
|
|
132
|
+
- All thoughts visible for debugging
|
|
133
|
+
- Exported via ProgressReporter
|
|
134
|
+
|
|
135
|
+
**6. Token Usage Tracking**
|
|
136
|
+
- Input + output tokens calculated (4 chars = 1 token)
|
|
137
|
+
- Image tokens estimated (~1,500 for viewport)
|
|
138
|
+
- Reported via `onTokensUsed()` callback
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## MVP vs Phase 2
|
|
143
|
+
|
|
144
|
+
### MVP Includes:
|
|
145
|
+
- ✅ 8 core tools (info + data + recovery)
|
|
146
|
+
- ✅ Journey memory with experiences
|
|
147
|
+
- ✅ Self-reflection + loop detection
|
|
148
|
+
- ✅ Batch command planning
|
|
149
|
+
- ✅ Self-recovery (navigate back/refresh)
|
|
150
|
+
- ✅ Token tracking
|
|
151
|
+
- ✅ Comprehensive logging
|
|
152
|
+
- ✅ Configurable guardrails
|
|
153
|
+
|
|
154
|
+
### Phase 2 Adds:
|
|
155
|
+
- Inquisitive exploration (explore_element)
|
|
156
|
+
- Advanced optimizations (caching, adaptive limits)
|
|
157
|
+
- Memory summarization for long journeys
|
|
158
|
+
|
|
159
|
+
---
|
|
160
|
+
|
|
161
|
+
## Key Metrics
|
|
162
|
+
|
|
163
|
+
### Token Usage Per Iteration (Estimated)
|
|
164
|
+
```
|
|
165
|
+
System prompt: 500 tokens
|
|
166
|
+
Always-provided context: 1,200-2,000 tokens
|
|
167
|
+
- Goals & progress: 100
|
|
168
|
+
- DOM (increased limits): 800-1,500
|
|
169
|
+
- Recent 6-7 steps: 300-500
|
|
170
|
+
- Experiences: 100-200
|
|
171
|
+
Self-reflection: 100 tokens
|
|
172
|
+
Tool results (optional): 300-500 tokens
|
|
173
|
+
Screenshot (optional): 1,500 tokens
|
|
174
|
+
|
|
175
|
+
Total without screenshot: 2,400-3,600 tokens
|
|
176
|
+
Total with screenshot: 3,900-5,100 tokens
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### Expected Performance
|
|
180
|
+
- LLM calls/step: 2-4 (vs 4-6 current)
|
|
181
|
+
- Iterations/step: 3-5 (vs 8-12 current)
|
|
182
|
+
- Tool calls/step: 1-3
|
|
183
|
+
- Commands/iteration: 2-3 (batched)
|
|
184
|
+
- Agent learns: 1-2 experiences per step
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Inquisitive Exploration Design (Phase 2)
|
|
189
|
+
|
|
190
|
+
### Problem
|
|
191
|
+
Menu items are icon-only, no text/ARIA labels → Agent unsure which to click
|
|
192
|
+
|
|
193
|
+
### Solution
|
|
194
|
+
Agent investigates non-consequentially, analyzes immediately:
|
|
195
|
+
|
|
196
|
+
```
|
|
197
|
+
Iteration N:
|
|
198
|
+
Agent Decision: "Need to hover over icons to see tooltips"
|
|
199
|
+
Tool: explore_element({action: "hover", selector: "nav button:nth-child(2)"})
|
|
200
|
+
|
|
201
|
+
System:
|
|
202
|
+
→ Hover, wait 500ms
|
|
203
|
+
→ Take screenshot
|
|
204
|
+
→ Call agent (sub-call): "What do you see in this screenshot?"
|
|
205
|
+
|
|
206
|
+
Agent Analysis (sub-call):
|
|
207
|
+
→ Sees tooltip
|
|
208
|
+
→ Returns: "Tooltip shows 'Dashboard' - this is the Dashboard button"
|
|
209
|
+
|
|
210
|
+
System:
|
|
211
|
+
→ Stores TEXT in history: "Explored button, tooltip confirms Dashboard"
|
|
212
|
+
→ Does NOT store screenshot
|
|
213
|
+
→ Returns to main agent: {success: true, learning: "Tooltip shows Dashboard"}
|
|
214
|
+
|
|
215
|
+
Agent Decision (continues):
|
|
216
|
+
→ "Great, confirmed it's Dashboard"
|
|
217
|
+
→ Commands: ["page.click('nav button:nth-child(2)')"]
|
|
218
|
+
|
|
219
|
+
System: Execute commands
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
**Key difference**: Screenshot analyzed WITHIN same iteration, only text stored
|
|
223
|
+
|
|
224
|
+
### Allowed Actions
|
|
225
|
+
- ✅ hover (show tooltips)
|
|
226
|
+
- ✅ click_info (info icons)
|
|
227
|
+
- ✅ click_menu (expand menus)
|
|
228
|
+
- ✅ focus (see input hints)
|
|
229
|
+
|
|
230
|
+
### NOT Allowed
|
|
231
|
+
- ❌ Submit forms
|
|
232
|
+
- ❌ Delete/remove
|
|
233
|
+
- ❌ Logout
|
|
234
|
+
- ❌ File uploads
|
|
235
|
+
|
|
236
|
+
### Safety
|
|
237
|
+
- State validation (URL, modal count)
|
|
238
|
+
- Revert if unexpected navigation
|
|
239
|
+
- Budget: 10 explorations per step
|
|
240
|
+
- Timeout: 2s per exploration
|
|
241
|
+
|
|
242
|
+
### Why Phase 2
|
|
243
|
+
- Safety risk (needs robust validation)
|
|
244
|
+
- Complexity (screenshot handling, state comparison)
|
|
245
|
+
- MVP workaround: screenshot + DOM + retry
|
|
246
|
+
|
|
247
|
+
---
|
|
248
|
+
|
|
249
|
+
## Implementation Status
|
|
250
|
+
|
|
251
|
+
### Completed (During Planning):
|
|
252
|
+
- ✅ Token usage tracking added to interfaces
|
|
253
|
+
- ✅ BackendProxyLLMProvider calculates token usage
|
|
254
|
+
- ✅ LLMFacade prepared for token callback
|
|
255
|
+
- ✅ Progress reporter extended with `onTokensUsed()`
|
|
256
|
+
|
|
257
|
+
### Ready to Implement:
|
|
258
|
+
1. OrchestratorAgent class
|
|
259
|
+
2. ToolRegistry with 8 tools
|
|
260
|
+
3. Journey memory implementation
|
|
261
|
+
4. Always-provided context builder
|
|
262
|
+
5. Self-reflection structures
|
|
263
|
+
6. Recovery tools (navigate_back, refresh, navigate_to)
|
|
264
|
+
7. Comprehensive logging
|
|
265
|
+
8. Token tracking integration
|
|
266
|
+
|
|
267
|
+
### Phase 2:
|
|
268
|
+
1. Exploratory actions (explore_element)
|
|
269
|
+
2. State validation logic
|
|
270
|
+
3. Advanced optimizations
|
|
271
|
+
|
|
272
|
+
---
|
|
273
|
+
|
|
274
|
+
## Documentation Created
|
|
275
|
+
|
|
276
|
+
1. **MULTI_AGENT_ARCHITECTURE_REVIEW.md**
|
|
277
|
+
- 8 pitfalls analyzed with mitigations
|
|
278
|
+
- Phased implementation strategy
|
|
279
|
+
- Risk analysis and trade-offs
|
|
280
|
+
|
|
281
|
+
2. **ORCHESTRATOR_IMPLEMENTATION_PLAN.md**
|
|
282
|
+
- Detailed implementation specs
|
|
283
|
+
- Code examples
|
|
284
|
+
- Integration points
|
|
285
|
+
|
|
286
|
+
3. **ORCHESTRATOR_MVP_SUMMARY.md**
|
|
287
|
+
- Executive summary
|
|
288
|
+
- Complete feature list
|
|
289
|
+
- Inquisitive exploration section
|
|
290
|
+
- MVP vs Phase 2 breakdown
|
|
291
|
+
|
|
292
|
+
4. **PLANNING_SESSION_SUMMARY.md** (this document)
|
|
293
|
+
- All decisions made
|
|
294
|
+
- Rationales
|
|
295
|
+
- Implementation status
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## Next Steps
|
|
300
|
+
|
|
301
|
+
### When Ready to Implement:
|
|
302
|
+
1. Review all 3 architecture documents
|
|
303
|
+
2. Start with MVP (8 tools, no exploration)
|
|
304
|
+
3. Implement in order:
|
|
305
|
+
- Tool registry + tool implementations
|
|
306
|
+
- Journey memory structures
|
|
307
|
+
- OrchestratorAgent loop
|
|
308
|
+
- Integration with ScenarioWorker
|
|
309
|
+
- Token tracking integration
|
|
310
|
+
- Comprehensive logging
|
|
311
|
+
4. Test with real scenarios
|
|
312
|
+
5. Measure metrics vs current approach
|
|
313
|
+
6. Iterate based on findings
|
|
314
|
+
7. Add Phase 2 features when validated
|
|
315
|
+
|
|
316
|
+
---
|
|
317
|
+
|
|
318
|
+
## Success Criteria (MVP)
|
|
319
|
+
|
|
320
|
+
### Must Have:
|
|
321
|
+
- [ ] Fewer iterations than current (target: 50% reduction)
|
|
322
|
+
- [ ] Backward compatible (VS Extension & GitHub Runner work)
|
|
323
|
+
- [ ] No infinite loops (guardrails work)
|
|
324
|
+
- [ ] Memory doesn't bloat
|
|
325
|
+
- [ ] Tool extensibility works
|
|
326
|
+
- [ ] Token usage tracked accurately
|
|
327
|
+
|
|
328
|
+
### Nice to Have:
|
|
329
|
+
- [ ] Fewer LLM calls than current
|
|
330
|
+
- [ ] Better success rate
|
|
331
|
+
- [ ] Faster execution
|
|
332
|
+
|
|
333
|
+
### Acceptable Trade-offs:
|
|
334
|
+
- ⚠️ Slightly higher token usage per iteration (richer context)
|
|
335
|
+
- ⚠️ Some tool call overhead
|
|
336
|
+
- ⚠️ No exploratory actions in MVP
|
|
337
|
+
|
|
338
|
+
---
|
|
339
|
+
|
|
340
|
+
## Estimated Timeline
|
|
341
|
+
|
|
342
|
+
**MVP Implementation**: 2-3 weeks
|
|
343
|
+
- Week 1: Foundation (types, tool registry, tool implementations)
|
|
344
|
+
- Week 2: Orchestrator loop, integration
|
|
345
|
+
- Week 3: Testing, refinement, metrics
|
|
346
|
+
|
|
347
|
+
**Phase 2 (Exploration)**: 1-2 weeks after MVP validated
|
|
348
|
+
|
|
349
|
+
**Total**: 3-5 weeks for complete solution
|
|
350
|
+
|
|
351
|
+
---
|
|
352
|
+
|
|
353
|
+
## Final Architecture Confidence
|
|
354
|
+
|
|
355
|
+
**✅ Ready to implement** with:
|
|
356
|
+
- All major decisions finalized
|
|
357
|
+
- Trade-offs understood and accepted
|
|
358
|
+
- Risks identified with mitigations
|
|
359
|
+
- Phased approach reduces implementation risk
|
|
360
|
+
- Backward compatibility ensured
|
|
361
|
+
- Comprehensive documentation complete
|
|
362
|
+
|
|
363
|
+
**Key strengths**:
|
|
364
|
+
- Human-like (memory, learning, reflection, recovery)
|
|
365
|
+
- Extensible (tool registry, dynamic prompts)
|
|
366
|
+
- Safe (system guardrails, agent self-awareness)
|
|
367
|
+
- Transparent (comprehensive logging)
|
|
368
|
+
- Cost-aware (token tracking)
|
|
369
|
+
- Practical (recovery tools, self-unstuck)
|
|
370
|
+
|
|
371
|
+
**No blockers to proceed.**
|
|
372
|
+
|
|
@@ -0,0 +1,201 @@
|
|
|
1
|
+
# Script Cleanup Feature
|
|
2
|
+
|
|
3
|
+
## Summary
|
|
4
|
+
Added a final cleanup step in the script generation pipeline that uses an LLM to make minor adjustments to the generated test script, removing redundancies and improving code quality without changing the core logic.
|
|
5
|
+
|
|
6
|
+
## Purpose
|
|
7
|
+
After the orchestrator generates test steps, there may be minor redundancies or formatting issues:
|
|
8
|
+
- Duplicate expect() assertions
|
|
9
|
+
- Redundant waits or checks
|
|
10
|
+
- Inconsistent formatting
|
|
11
|
+
- Orphaned step comments without code
|
|
12
|
+
|
|
13
|
+
The cleanup step acts as a final sanity check to polish the generated script while preserving its core functionality.
|
|
14
|
+
|
|
15
|
+
## Implementation
|
|
16
|
+
|
|
17
|
+
### 1. New Prompt (`prompts.ts`)
|
|
18
|
+
|
|
19
|
+
**SCRIPT_CLEANUP** prompt with clear guidelines:
|
|
20
|
+
- **DO:** Remove duplicates, fix formatting, consolidate identical assertions
|
|
21
|
+
- **DO NOT:** Change test logic, remove legitimate assertions, restructure code, change selectors, add new functionality
|
|
22
|
+
|
|
23
|
+
**Examples in prompt:**
|
|
24
|
+
```typescript
|
|
25
|
+
// ❌ REMOVE redundancy:
|
|
26
|
+
await expect(page.getByText('Hello')).toBeVisible();
|
|
27
|
+
await expect(page.getByText('Hello')).toBeVisible(); // duplicate
|
|
28
|
+
|
|
29
|
+
// ✅ KEEP legitimate checks:
|
|
30
|
+
await expect(page.getByPlaceholder('Message...')).toBeEmpty();
|
|
31
|
+
await page.getByPlaceholder('Message...').fill('Hello');
|
|
32
|
+
await expect(page.getByPlaceholder('Message...')).toHaveValue('Hello'); // different checks
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
### 2. New Method in LLMFacade (`llm-facade.ts`)
|
|
36
|
+
|
|
37
|
+
```typescript
|
|
38
|
+
async cleanupScript(script: string, model?: string): Promise<{
|
|
39
|
+
script: string;
|
|
40
|
+
changes: string[];
|
|
41
|
+
skipped?: string;
|
|
42
|
+
}>
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
**Behavior:**
|
|
46
|
+
- Calls LLM with SCRIPT_CLEANUP prompt
|
|
47
|
+
- Parses JSON response with cleaned script and list of changes
|
|
48
|
+
- Returns original script on error (safe fallback)
|
|
49
|
+
- Logs all changes made for transparency
|
|
50
|
+
|
|
51
|
+
**Error Handling:**
|
|
52
|
+
- Invalid JSON → return original script
|
|
53
|
+
- Missing fields → return original script
|
|
54
|
+
- LLM error → return original script
|
|
55
|
+
- Never fails the generation process
|
|
56
|
+
|
|
57
|
+
### 3. Integration into Scenario Worker (`scenario-worker-class.ts`)
|
|
58
|
+
|
|
59
|
+
Added cleanup step immediately after `generateTestScript()`:
|
|
60
|
+
|
|
61
|
+
```typescript
|
|
62
|
+
// Generate clean script with TestChimp comment and code
|
|
63
|
+
generatedScript = generateTestScript(testName, steps, undefined, hashtags);
|
|
64
|
+
|
|
65
|
+
// Perform final cleanup pass to remove redundancies and make minor adjustments
|
|
66
|
+
this.log(`[ScenarioWorker] Performing final script cleanup...`);
|
|
67
|
+
try {
|
|
68
|
+
const cleanupResult = await this.llmFacade.cleanupScript(generatedScript, job.model);
|
|
69
|
+
|
|
70
|
+
if (cleanupResult.changes && cleanupResult.changes.length > 0) {
|
|
71
|
+
this.log(`[ScenarioWorker] Cleanup made ${cleanupResult.changes.length} improvement(s):`);
|
|
72
|
+
cleanupResult.changes.forEach((change, i) => {
|
|
73
|
+
this.log(`[ScenarioWorker] ${i + 1}. ${change}`);
|
|
74
|
+
});
|
|
75
|
+
generatedScript = cleanupResult.script;
|
|
76
|
+
} else if (cleanupResult.skipped) {
|
|
77
|
+
this.log(`[ScenarioWorker] Cleanup skipped: ${cleanupResult.skipped}`);
|
|
78
|
+
} else {
|
|
79
|
+
this.log(`[ScenarioWorker] Cleanup completed - no changes needed`);
|
|
80
|
+
}
|
|
81
|
+
} catch (error: any) {
|
|
82
|
+
this.log(`[ScenarioWorker] Cleanup failed, using original script: ${error.message}`);
|
|
83
|
+
// Continue with original script on error
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## What Gets Cleaned Up
|
|
88
|
+
|
|
89
|
+
### ✅ Redundancies Removed
|
|
90
|
+
1. **Duplicate assertions:**
|
|
91
|
+
```typescript
|
|
92
|
+
// Before cleanup
|
|
93
|
+
await expect(page.getByText('Hello')).toBeVisible();
|
|
94
|
+
await expect(page.getByText('Hello')).toBeVisible();
|
|
95
|
+
|
|
96
|
+
// After cleanup
|
|
97
|
+
await expect(page.getByText('Hello')).toBeVisible();
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
2. **Redundant URL checks:**
|
|
101
|
+
```typescript
|
|
102
|
+
// Before cleanup
|
|
103
|
+
await expect(page).toHaveURL(/\/messages/);
|
|
104
|
+
await expect(page).toHaveURL(/\/messages/);
|
|
105
|
+
|
|
106
|
+
// After cleanup
|
|
107
|
+
await expect(page).toHaveURL(/\/messages/);
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
3. **Duplicate comments without code** (already handled by script generation, but this is a safety net)
|
|
111
|
+
|
|
112
|
+
### ✅ Minor Formatting Fixes
|
|
113
|
+
- Inconsistent spacing
|
|
114
|
+
- Alignment issues
|
|
115
|
+
- Obvious formatting problems
|
|
116
|
+
|
|
117
|
+
### ❌ Preserved (Not Changed)
|
|
118
|
+
- Test logic and flow
|
|
119
|
+
- Legitimate assertions (same locator, different expectations)
|
|
120
|
+
- Important waits
|
|
121
|
+
- Selectors
|
|
122
|
+
- Test structure
|
|
123
|
+
- Any functionality
|
|
124
|
+
|
|
125
|
+
## Safety Features
|
|
126
|
+
|
|
127
|
+
### 1. Conservative Approach
|
|
128
|
+
- Only makes changes when confident they're safe
|
|
129
|
+
- Prompt explicitly warns against major changes
|
|
130
|
+
- Focuses on "obvious" redundancies only
|
|
131
|
+
|
|
132
|
+
### 2. Transparency
|
|
133
|
+
- Logs all changes made with descriptions
|
|
134
|
+
- Makes it easy to see what was modified
|
|
135
|
+
- Helps debug if cleanup causes issues
|
|
136
|
+
|
|
137
|
+
### 3. Graceful Degradation
|
|
138
|
+
- Any error → return original script
|
|
139
|
+
- Invalid response → return original script
|
|
140
|
+
- Never breaks the generation pipeline
|
|
141
|
+
- Cleanup is an enhancement, not a requirement
|
|
142
|
+
|
|
143
|
+
### 4. Idempotent
|
|
144
|
+
- Running cleanup twice should produce the same result
|
|
145
|
+
- No cumulative changes or drift
|
|
146
|
+
|
|
147
|
+
## Example Output
|
|
148
|
+
|
|
149
|
+
**Console logs:**
|
|
150
|
+
```
|
|
151
|
+
[ScenarioWorker] Performing final script cleanup...
|
|
152
|
+
[LLMFacade] Script cleanup completed. Changes: 2
|
|
153
|
+
[LLMFacade] 1. Removed duplicate expect assertion for message visibility
|
|
154
|
+
[LLMFacade] 2. Consolidated redundant URL checks into single assertion
|
|
155
|
+
[ScenarioWorker] Cleanup made 2 improvement(s):
|
|
156
|
+
[ScenarioWorker] 1. Removed duplicate expect assertion for message visibility
|
|
157
|
+
[ScenarioWorker] 2. Consolidated redundant URL checks into single assertion
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## Benefits
|
|
161
|
+
|
|
162
|
+
1. **Cleaner Scripts:** Removes redundancies that can make tests harder to read
|
|
163
|
+
2. **Reduced Token Usage:** Shorter scripts mean less tokens consumed by users
|
|
164
|
+
3. **Better Maintainability:** Clean code is easier to understand and modify
|
|
165
|
+
4. **Safety Net:** Catches issues that might slip through orchestrator logic
|
|
166
|
+
5. **Zero Risk:** Fallback to original script on any error
|
|
167
|
+
|
|
168
|
+
## Performance Impact
|
|
169
|
+
|
|
170
|
+
- **Time:** Adds one LLM call at the end (~1-3 seconds)
|
|
171
|
+
- **Cost:** One additional LLM call per script generation
|
|
172
|
+
- **Benefit:** Catches redundancies that would otherwise be in production tests
|
|
173
|
+
|
|
174
|
+
The small overhead is worthwhile for the quality improvement.
|
|
175
|
+
|
|
176
|
+
## Future Enhancements
|
|
177
|
+
|
|
178
|
+
Possible improvements:
|
|
179
|
+
1. **Configurable:** Allow users to disable cleanup if they prefer
|
|
180
|
+
2. **More Rules:** Add more specific cleanup patterns
|
|
181
|
+
3. **Static Analysis:** Use AST parsing instead of LLM for some checks (faster, cheaper)
|
|
182
|
+
4. **Metrics:** Track how often cleanup makes changes vs. no-ops
|
|
183
|
+
|
|
184
|
+
## Files Modified
|
|
185
|
+
|
|
186
|
+
1. `/src/prompts.ts` - Added SCRIPT_CLEANUP prompt
|
|
187
|
+
2. `/src/llm-facade.ts` - Added cleanupScript() method
|
|
188
|
+
3. `/src/scenario-worker-class.ts` - Integrated cleanup into generation pipeline
|
|
189
|
+
|
|
190
|
+
## Testing
|
|
191
|
+
|
|
192
|
+
The feature is safe to deploy because:
|
|
193
|
+
- Falls back to original script on any error
|
|
194
|
+
- Doesn't break existing functionality
|
|
195
|
+
- Only makes conservative changes
|
|
196
|
+
- Logs all modifications for review
|
|
197
|
+
|
|
198
|
+
## Conclusion
|
|
199
|
+
|
|
200
|
+
The script cleanup feature adds a lightweight final polish step to the generation pipeline, removing redundancies and improving code quality without risk to the core test logic. It's a safety net that catches issues the orchestrator might miss while maintaining backward compatibility and graceful error handling.
|
|
201
|
+
|