testchimp-runner-core 0.0.21 → 0.0.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (146) hide show
  1. package/VISION_DIAGNOSTICS_IMPROVEMENTS.md +336 -0
  2. package/dist/credit-usage-service.d.ts +9 -0
  3. package/dist/credit-usage-service.d.ts.map +1 -1
  4. package/dist/credit-usage-service.js +20 -5
  5. package/dist/credit-usage-service.js.map +1 -1
  6. package/dist/execution-service.d.ts +7 -2
  7. package/dist/execution-service.d.ts.map +1 -1
  8. package/dist/execution-service.js +91 -36
  9. package/dist/execution-service.js.map +1 -1
  10. package/dist/index.d.ts +30 -2
  11. package/dist/index.d.ts.map +1 -1
  12. package/dist/index.js +91 -26
  13. package/dist/index.js.map +1 -1
  14. package/dist/llm-facade.d.ts +64 -8
  15. package/dist/llm-facade.d.ts.map +1 -1
  16. package/dist/llm-facade.js +361 -109
  17. package/dist/llm-facade.js.map +1 -1
  18. package/dist/llm-provider.d.ts +39 -0
  19. package/dist/llm-provider.d.ts.map +1 -0
  20. package/dist/llm-provider.js +7 -0
  21. package/dist/llm-provider.js.map +1 -0
  22. package/dist/model-constants.d.ts +21 -0
  23. package/dist/model-constants.d.ts.map +1 -0
  24. package/dist/model-constants.js +24 -0
  25. package/dist/model-constants.js.map +1 -0
  26. package/dist/orchestrator/index.d.ts +8 -0
  27. package/dist/orchestrator/index.d.ts.map +1 -0
  28. package/dist/orchestrator/index.js +23 -0
  29. package/dist/orchestrator/index.js.map +1 -0
  30. package/dist/orchestrator/orchestrator-agent.d.ts +66 -0
  31. package/dist/orchestrator/orchestrator-agent.d.ts.map +1 -0
  32. package/dist/orchestrator/orchestrator-agent.js +855 -0
  33. package/dist/orchestrator/orchestrator-agent.js.map +1 -0
  34. package/dist/orchestrator/tool-registry.d.ts +74 -0
  35. package/dist/orchestrator/tool-registry.d.ts.map +1 -0
  36. package/dist/orchestrator/tool-registry.js +131 -0
  37. package/dist/orchestrator/tool-registry.js.map +1 -0
  38. package/dist/orchestrator/tools/check-page-ready.d.ts +13 -0
  39. package/dist/orchestrator/tools/check-page-ready.d.ts.map +1 -0
  40. package/dist/orchestrator/tools/check-page-ready.js +72 -0
  41. package/dist/orchestrator/tools/check-page-ready.js.map +1 -0
  42. package/dist/orchestrator/tools/extract-data.d.ts +13 -0
  43. package/dist/orchestrator/tools/extract-data.d.ts.map +1 -0
  44. package/dist/orchestrator/tools/extract-data.js +84 -0
  45. package/dist/orchestrator/tools/extract-data.js.map +1 -0
  46. package/dist/orchestrator/tools/index.d.ts +10 -0
  47. package/dist/orchestrator/tools/index.d.ts.map +1 -0
  48. package/dist/orchestrator/tools/index.js +18 -0
  49. package/dist/orchestrator/tools/index.js.map +1 -0
  50. package/dist/orchestrator/tools/inspect-page.d.ts +13 -0
  51. package/dist/orchestrator/tools/inspect-page.d.ts.map +1 -0
  52. package/dist/orchestrator/tools/inspect-page.js +39 -0
  53. package/dist/orchestrator/tools/inspect-page.js.map +1 -0
  54. package/dist/orchestrator/tools/recall-history.d.ts +13 -0
  55. package/dist/orchestrator/tools/recall-history.d.ts.map +1 -0
  56. package/dist/orchestrator/tools/recall-history.js +64 -0
  57. package/dist/orchestrator/tools/recall-history.js.map +1 -0
  58. package/dist/orchestrator/tools/take-screenshot.d.ts +15 -0
  59. package/dist/orchestrator/tools/take-screenshot.d.ts.map +1 -0
  60. package/dist/orchestrator/tools/take-screenshot.js +112 -0
  61. package/dist/orchestrator/tools/take-screenshot.js.map +1 -0
  62. package/dist/orchestrator/types.d.ts +133 -0
  63. package/dist/orchestrator/types.d.ts.map +1 -0
  64. package/dist/orchestrator/types.js +28 -0
  65. package/dist/orchestrator/types.js.map +1 -0
  66. package/dist/playwright-mcp-service.d.ts +9 -0
  67. package/dist/playwright-mcp-service.d.ts.map +1 -1
  68. package/dist/playwright-mcp-service.js +20 -5
  69. package/dist/playwright-mcp-service.js.map +1 -1
  70. package/dist/progress-reporter.d.ts +97 -0
  71. package/dist/progress-reporter.d.ts.map +1 -0
  72. package/dist/progress-reporter.js +18 -0
  73. package/dist/progress-reporter.js.map +1 -0
  74. package/dist/prompts.d.ts +24 -0
  75. package/dist/prompts.d.ts.map +1 -1
  76. package/dist/prompts.js +593 -68
  77. package/dist/prompts.js.map +1 -1
  78. package/dist/providers/backend-proxy-llm-provider.d.ts +25 -0
  79. package/dist/providers/backend-proxy-llm-provider.d.ts.map +1 -0
  80. package/dist/providers/backend-proxy-llm-provider.js +76 -0
  81. package/dist/providers/backend-proxy-llm-provider.js.map +1 -0
  82. package/dist/providers/local-llm-provider.d.ts +21 -0
  83. package/dist/providers/local-llm-provider.d.ts.map +1 -0
  84. package/dist/providers/local-llm-provider.js +35 -0
  85. package/dist/providers/local-llm-provider.js.map +1 -0
  86. package/dist/scenario-service.d.ts +27 -1
  87. package/dist/scenario-service.d.ts.map +1 -1
  88. package/dist/scenario-service.js +48 -12
  89. package/dist/scenario-service.js.map +1 -1
  90. package/dist/scenario-worker-class.d.ts +39 -2
  91. package/dist/scenario-worker-class.d.ts.map +1 -1
  92. package/dist/scenario-worker-class.js +614 -86
  93. package/dist/scenario-worker-class.js.map +1 -1
  94. package/dist/script-utils.d.ts +2 -0
  95. package/dist/script-utils.d.ts.map +1 -1
  96. package/dist/script-utils.js +44 -4
  97. package/dist/script-utils.js.map +1 -1
  98. package/dist/types.d.ts +11 -0
  99. package/dist/types.d.ts.map +1 -1
  100. package/dist/types.js.map +1 -1
  101. package/dist/utils/browser-utils.d.ts +20 -1
  102. package/dist/utils/browser-utils.d.ts.map +1 -1
  103. package/dist/utils/browser-utils.js +102 -51
  104. package/dist/utils/browser-utils.js.map +1 -1
  105. package/dist/utils/page-info-utils.d.ts +23 -4
  106. package/dist/utils/page-info-utils.d.ts.map +1 -1
  107. package/dist/utils/page-info-utils.js +174 -43
  108. package/dist/utils/page-info-utils.js.map +1 -1
  109. package/package.json +1 -2
  110. package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +642 -0
  111. package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +844 -0
  112. package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +539 -0
  113. package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +241 -0
  114. package/plandocs/PHASE1_FINAL_STATUS.md +210 -0
  115. package/plandocs/PLANNING_SESSION_SUMMARY.md +372 -0
  116. package/plandocs/SCRIPT_CLEANUP_FEATURE.md +201 -0
  117. package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +364 -0
  118. package/plandocs/SELECTOR_IMPROVEMENTS.md +139 -0
  119. package/src/credit-usage-service.ts +23 -5
  120. package/src/execution-service.ts +152 -42
  121. package/src/index.ts +169 -26
  122. package/src/llm-facade.ts +500 -126
  123. package/src/llm-provider.ts +43 -0
  124. package/src/model-constants.ts +23 -0
  125. package/src/orchestrator/index.ts +33 -0
  126. package/src/orchestrator/orchestrator-agent.ts +1037 -0
  127. package/src/orchestrator/tool-registry.ts +182 -0
  128. package/src/orchestrator/tools/check-page-ready.ts +75 -0
  129. package/src/orchestrator/tools/extract-data.ts +92 -0
  130. package/src/orchestrator/tools/index.ts +11 -0
  131. package/src/orchestrator/tools/inspect-page.ts +42 -0
  132. package/src/orchestrator/tools/recall-history.ts +72 -0
  133. package/src/orchestrator/tools/take-screenshot.ts +128 -0
  134. package/src/orchestrator/types.ts +200 -0
  135. package/src/playwright-mcp-service.ts +23 -5
  136. package/src/progress-reporter.ts +109 -0
  137. package/src/prompts.ts +606 -69
  138. package/src/providers/backend-proxy-llm-provider.ts +91 -0
  139. package/src/providers/local-llm-provider.ts +38 -0
  140. package/src/scenario-service.ts +83 -13
  141. package/src/scenario-worker-class.ts +740 -72
  142. package/src/script-utils.ts +50 -5
  143. package/src/types.ts +13 -1
  144. package/src/utils/browser-utils.ts +123 -51
  145. package/src/utils/page-info-utils.ts +210 -53
  146. package/testchimp-runner-core-0.0.22.tgz +0 -0
@@ -0,0 +1,372 @@
1
+ # Planning Session Summary: Orchestrator Agent Architecture
2
+
3
+ ## Date: October 11, 2025
4
+
5
+ ---
6
+
7
+ ## Final Decisions Made
8
+
9
+ ### 1. ✅ Self-Reflection in MVP
10
+ **Decision**: Include free-form self-reflection with agent-driven loop detection
11
+ - Agent outputs `guidanceForNext` (free-form text) for train of thought continuity
12
+ - Agent signals `detectingLoop: true` when it notices repetition
13
+ - Agent decides when to break own loop, system enforces hard limits as backup
14
+
15
+ **Rationale**: Valuable for maintaining context across iterations, agent self-corrects
16
+
17
+ ### 2. ✅ No Screenshot Budget
18
+ **Decision**: Screenshots available freely, no artificial limits
19
+ - Corrected token cost: 1-2K tokens (NOT 100K!)
20
+ - For 1920x1080 viewport: ~1,452 tokens (gpt-4.1-mini)
21
+ - Comparable to extra DOM context
22
+
23
+ **Rationale**: Very affordable, enables liberal vision use throughout journey
24
+
25
+ ### 3. ✅ DOM Limits (Increased for Complex Pages)
26
+ **Decision**: Increased limits in getEnhancedPageInfo to handle complex UIs
27
+ - ARIA tree depth: 4 levels
28
+ - Interactive elements: top 50 (was 12)
29
+ - IDs: top 50 (was 10)
30
+ - Data attributes: top 50 (was 10)
31
+ - Form fields: top 20 (was 8)
32
+ - Page structure: top 10 (was 6)
33
+ - General elements: top 50 (was 15)
34
+ - Text: 30 chars max
35
+ - Result: ~800-1,500 tokens
36
+
37
+ **Rationale**: Complex pages need more context, still compact with truncation
38
+
39
+ ### 4. ✅ Token Usage Tracking
40
+ **Decision**: Track and report all LLM token usage via callback
41
+ - Interface: `onTokensUsed({inputTokens, outputTokens, includesImage})`
42
+ - Heuristic: 4 characters = 1 token
43
+ - Image tokens: ~1,500 estimate for viewport screenshots
44
+ - Reported via ProgressReporter for analytics
45
+
46
+ **Rationale**: Cost tracking, optimization, analytics
47
+
48
+ ### 5. ✅ Recovery Tools in MVP
49
+ **Decision**: Include 3 recovery tools for self-unsticking
50
+ - `navigate_back()` - Go back in history
51
+ - `refresh_page()` - Reload page
52
+ - `navigate_to_url({url})` - Navigate to specific URL (with domain validation)
53
+
54
+ **Rationale**: Agent needs ability to recover from bad states (wrong navigation, stuck page, side effects)
55
+
56
+ ### 6. ✅ Inquisitive Exploration in Phase 2
57
+ **Decision**: Defer exploratory actions to Phase 2, MVP uses workarounds
58
+ - Phase 2 tool: `explore_element({action, selector, purpose})`
59
+ - Actions: hover, click_info, click_menu, focus
60
+ - Safety: State validation, non-consequential only
61
+ - **Screenshot handling**: Immediate analysis via sub-agent call
62
+ - System takes screenshot after exploration
63
+ - Calls agent to analyze screenshot
64
+ - Agent extracts learnings (text)
65
+ - Only TEXT stored in history, NOT screenshot
66
+ - Keeps memory lightweight
67
+ - MVP workaround: Use screenshot + DOM analysis + retry
68
+
69
+ **Rationale**: Safety concerns, complexity, needs battle-testing first
70
+
71
+ ### 7. ✅ Always-Provided Context Structure
72
+ **Decision**: Provide comprehensive context automatically each iteration
73
+ - Overall goal + current goal
74
+ - Current page info (DOM)
75
+ - Recent 6-7 steps
76
+ - Experiences (learnings)
77
+ - Extracted data
78
+ - Self-reflection from previous iteration
79
+ - Journey progress tracking
80
+
81
+ **Rationale**: Agent needs full situational awareness without repeated tool calls
82
+
83
+ ### 8. ✅ System vs Agent Guardrails
84
+ **Decision**: Clear separation of responsibilities
85
+ - **System enforces**: Iteration limits, tool call limits, command limits
86
+ - **Agent signals**: Stuck, infeasible, detecting loop
87
+ - System has final say, agent provides soft guidance
88
+
89
+ **Rationale**: Safety (hard limits) + intelligence (agent self-awareness)
90
+
91
+ ---
92
+
93
+ ## Architecture Summary
94
+
95
+ ### Core Components
96
+
97
+ **1. Always-Provided Context** (auto-fetched each iteration)
98
+ ```typescript
99
+ {
100
+ overallGoal, currentStepGoal, stepNumber, totalSteps,
101
+ currentPageInfo, currentURL,
102
+ recentSteps (6-7), experiences, extractedData,
103
+ previousIterationGuidance
104
+ }
105
+ ```
106
+
107
+ **2. Tools** (8 in MVP)
108
+ - **Information**: take_screenshot, recall_history, inspect_page, check_page_ready
109
+ - **Data**: extract_data
110
+ - **Recovery**: navigate_back, refresh_page, navigate_to_url
111
+
112
+ **3. Agent Decision Output**
113
+ ```typescript
114
+ {
115
+ toolCalls, toolReasoning, needsToolResults,
116
+ commands, commandReasoning,
117
+ selfReflection: {guidanceForNext, detectingLoop, loopReasoning},
118
+ experiences, memoryUpdate,
119
+ status: 'complete' | 'stuck' | 'infeasible' | 'continue',
120
+ statusReasoning, reasoning
121
+ }
122
+ ```
123
+
124
+ **4. Sequential Batch Execution**
125
+ - Agent plans batch of commands (max 3-5)
126
+ - System executes one-by-one
127
+ - Stop at first failure
128
+ - Record each individually in history
129
+
130
+ **5. Comprehensive Logging**
131
+ - Every iteration: goal, reasoning, self-reflection, tools, commands, experiences, status
132
+ - All thoughts visible for debugging
133
+ - Exported via ProgressReporter
134
+
135
+ **6. Token Usage Tracking**
136
+ - Input + output tokens calculated (4 chars = 1 token)
137
+ - Image tokens estimated (~1,500 for viewport)
138
+ - Reported via `onTokensUsed()` callback
139
+
140
+ ---
141
+
142
+ ## MVP vs Phase 2
143
+
144
+ ### MVP Includes:
145
+ - ✅ 8 core tools (info + data + recovery)
146
+ - ✅ Journey memory with experiences
147
+ - ✅ Self-reflection + loop detection
148
+ - ✅ Batch command planning
149
+ - ✅ Self-recovery (navigate back/refresh)
150
+ - ✅ Token tracking
151
+ - ✅ Comprehensive logging
152
+ - ✅ Configurable guardrails
153
+
154
+ ### Phase 2 Adds:
155
+ - Inquisitive exploration (explore_element)
156
+ - Advanced optimizations (caching, adaptive limits)
157
+ - Memory summarization for long journeys
158
+
159
+ ---
160
+
161
+ ## Key Metrics
162
+
163
+ ### Token Usage Per Iteration (Estimated)
164
+ ```
165
+ System prompt: 500 tokens
166
+ Always-provided context: 1,200-2,000 tokens
167
+ - Goals & progress: 100
168
+ - DOM (increased limits): 800-1,500
169
+ - Recent 6-7 steps: 300-500
170
+ - Experiences: 100-200
171
+ Self-reflection: 100 tokens
172
+ Tool results (optional): 300-500 tokens
173
+ Screenshot (optional): 1,500 tokens
174
+
175
+ Total without screenshot: 2,400-3,600 tokens
176
+ Total with screenshot: 3,900-5,100 tokens
177
+ ```
178
+
179
+ ### Expected Performance
180
+ - LLM calls/step: 2-4 (vs 4-6 current)
181
+ - Iterations/step: 3-5 (vs 8-12 current)
182
+ - Tool calls/step: 1-3
183
+ - Commands/iteration: 2-3 (batched)
184
+ - Agent learns: 1-2 experiences per step
185
+
186
+ ---
187
+
188
+ ## Inquisitive Exploration Design (Phase 2)
189
+
190
+ ### Problem
191
+ Menu items are icon-only, no text/ARIA labels → Agent unsure which to click
192
+
193
+ ### Solution
194
+ Agent investigates non-consequentially, analyzes immediately:
195
+
196
+ ```
197
+ Iteration N:
198
+ Agent Decision: "Need to hover over icons to see tooltips"
199
+ Tool: explore_element({action: "hover", selector: "nav button:nth-child(2)"})
200
+
201
+ System:
202
+ → Hover, wait 500ms
203
+ → Take screenshot
204
+ → Call agent (sub-call): "What do you see in this screenshot?"
205
+
206
+ Agent Analysis (sub-call):
207
+ → Sees tooltip
208
+ → Returns: "Tooltip shows 'Dashboard' - this is the Dashboard button"
209
+
210
+ System:
211
+ → Stores TEXT in history: "Explored button, tooltip confirms Dashboard"
212
+ → Does NOT store screenshot
213
+ → Returns to main agent: {success: true, learning: "Tooltip shows Dashboard"}
214
+
215
+ Agent Decision (continues):
216
+ → "Great, confirmed it's Dashboard"
217
+ → Commands: ["page.click('nav button:nth-child(2)')"]
218
+
219
+ System: Execute commands
220
+ ```
221
+
222
+ **Key difference**: Screenshot analyzed WITHIN same iteration, only text stored
223
+
224
+ ### Allowed Actions
225
+ - ✅ hover (show tooltips)
226
+ - ✅ click_info (info icons)
227
+ - ✅ click_menu (expand menus)
228
+ - ✅ focus (see input hints)
229
+
230
+ ### NOT Allowed
231
+ - ❌ Submit forms
232
+ - ❌ Delete/remove
233
+ - ❌ Logout
234
+ - ❌ File uploads
235
+
236
+ ### Safety
237
+ - State validation (URL, modal count)
238
+ - Revert if unexpected navigation
239
+ - Budget: 10 explorations per step
240
+ - Timeout: 2s per exploration
241
+
242
+ ### Why Phase 2
243
+ - Safety risk (needs robust validation)
244
+ - Complexity (screenshot handling, state comparison)
245
+ - MVP workaround: screenshot + DOM + retry
246
+
247
+ ---
248
+
249
+ ## Implementation Status
250
+
251
+ ### Completed (During Planning):
252
+ - ✅ Token usage tracking added to interfaces
253
+ - ✅ BackendProxyLLMProvider calculates token usage
254
+ - ✅ LLMFacade prepared for token callback
255
+ - ✅ Progress reporter extended with `onTokensUsed()`
256
+
257
+ ### Ready to Implement:
258
+ 1. OrchestratorAgent class
259
+ 2. ToolRegistry with 8 tools
260
+ 3. Journey memory implementation
261
+ 4. Always-provided context builder
262
+ 5. Self-reflection structures
263
+ 6. Recovery tools (navigate_back, refresh, navigate_to)
264
+ 7. Comprehensive logging
265
+ 8. Token tracking integration
266
+
267
+ ### Phase 2:
268
+ 1. Exploratory actions (explore_element)
269
+ 2. State validation logic
270
+ 3. Advanced optimizations
271
+
272
+ ---
273
+
274
+ ## Documentation Created
275
+
276
+ 1. **MULTI_AGENT_ARCHITECTURE_REVIEW.md**
277
+ - 8 pitfalls analyzed with mitigations
278
+ - Phased implementation strategy
279
+ - Risk analysis and trade-offs
280
+
281
+ 2. **ORCHESTRATOR_IMPLEMENTATION_PLAN.md**
282
+ - Detailed implementation specs
283
+ - Code examples
284
+ - Integration points
285
+
286
+ 3. **ORCHESTRATOR_MVP_SUMMARY.md**
287
+ - Executive summary
288
+ - Complete feature list
289
+ - Inquisitive exploration section
290
+ - MVP vs Phase 2 breakdown
291
+
292
+ 4. **PLANNING_SESSION_SUMMARY.md** (this document)
293
+ - All decisions made
294
+ - Rationales
295
+ - Implementation status
296
+
297
+ ---
298
+
299
+ ## Next Steps
300
+
301
+ ### When Ready to Implement:
302
+ 1. Review all 3 architecture documents
303
+ 2. Start with MVP (8 tools, no exploration)
304
+ 3. Implement in order:
305
+ - Tool registry + tool implementations
306
+ - Journey memory structures
307
+ - OrchestratorAgent loop
308
+ - Integration with ScenarioWorker
309
+ - Token tracking integration
310
+ - Comprehensive logging
311
+ 4. Test with real scenarios
312
+ 5. Measure metrics vs current approach
313
+ 6. Iterate based on findings
314
+ 7. Add Phase 2 features when validated
315
+
316
+ ---
317
+
318
+ ## Success Criteria (MVP)
319
+
320
+ ### Must Have:
321
+ - [ ] Fewer iterations than current (target: 50% reduction)
322
+ - [ ] Backward compatible (VS Extension & GitHub Runner work)
323
+ - [ ] No infinite loops (guardrails work)
324
+ - [ ] Memory doesn't bloat
325
+ - [ ] Tool extensibility works
326
+ - [ ] Token usage tracked accurately
327
+
328
+ ### Nice to Have:
329
+ - [ ] Fewer LLM calls than current
330
+ - [ ] Better success rate
331
+ - [ ] Faster execution
332
+
333
+ ### Acceptable Trade-offs:
334
+ - ⚠️ Slightly higher token usage per iteration (richer context)
335
+ - ⚠️ Some tool call overhead
336
+ - ⚠️ No exploratory actions in MVP
337
+
338
+ ---
339
+
340
+ ## Estimated Timeline
341
+
342
+ **MVP Implementation**: 2-3 weeks
343
+ - Week 1: Foundation (types, tool registry, tool implementations)
344
+ - Week 2: Orchestrator loop, integration
345
+ - Week 3: Testing, refinement, metrics
346
+
347
+ **Phase 2 (Exploration)**: 1-2 weeks after MVP validated
348
+
349
+ **Total**: 3-5 weeks for complete solution
350
+
351
+ ---
352
+
353
+ ## Final Architecture Confidence
354
+
355
+ **✅ Ready to implement** with:
356
+ - All major decisions finalized
357
+ - Trade-offs understood and accepted
358
+ - Risks identified with mitigations
359
+ - Phased approach reduces implementation risk
360
+ - Backward compatibility ensured
361
+ - Comprehensive documentation complete
362
+
363
+ **Key strengths**:
364
+ - Human-like (memory, learning, reflection, recovery)
365
+ - Extensible (tool registry, dynamic prompts)
366
+ - Safe (system guardrails, agent self-awareness)
367
+ - Transparent (comprehensive logging)
368
+ - Cost-aware (token tracking)
369
+ - Practical (recovery tools, self-unstuck)
370
+
371
+ **No blockers to proceed.**
372
+
@@ -0,0 +1,201 @@
1
+ # Script Cleanup Feature
2
+
3
+ ## Summary
4
+ Added a final cleanup step in the script generation pipeline that uses an LLM to make minor adjustments to the generated test script, removing redundancies and improving code quality without changing the core logic.
5
+
6
+ ## Purpose
7
+ After the orchestrator generates test steps, there may be minor redundancies or formatting issues:
8
+ - Duplicate expect() assertions
9
+ - Redundant waits or checks
10
+ - Inconsistent formatting
11
+ - Orphaned step comments without code
12
+
13
+ The cleanup step acts as a final sanity check to polish the generated script while preserving its core functionality.
14
+
15
+ ## Implementation
16
+
17
+ ### 1. New Prompt (`prompts.ts`)
18
+
19
+ **SCRIPT_CLEANUP** prompt with clear guidelines:
20
+ - **DO:** Remove duplicates, fix formatting, consolidate identical assertions
21
+ - **DO NOT:** Change test logic, remove legitimate assertions, restructure code, change selectors, add new functionality
22
+
23
+ **Examples in prompt:**
24
+ ```typescript
25
+ // ❌ REMOVE redundancy:
26
+ await expect(page.getByText('Hello')).toBeVisible();
27
+ await expect(page.getByText('Hello')).toBeVisible(); // duplicate
28
+
29
+ // ✅ KEEP legitimate checks:
30
+ await expect(page.getByPlaceholder('Message...')).toBeEmpty();
31
+ await page.getByPlaceholder('Message...').fill('Hello');
32
+ await expect(page.getByPlaceholder('Message...')).toHaveValue('Hello'); // different checks
33
+ ```
34
+
35
+ ### 2. New Method in LLMFacade (`llm-facade.ts`)
36
+
37
+ ```typescript
38
+ async cleanupScript(script: string, model?: string): Promise<{
39
+ script: string;
40
+ changes: string[];
41
+ skipped?: string;
42
+ }>
43
+ ```
44
+
45
+ **Behavior:**
46
+ - Calls LLM with SCRIPT_CLEANUP prompt
47
+ - Parses JSON response with cleaned script and list of changes
48
+ - Returns original script on error (safe fallback)
49
+ - Logs all changes made for transparency
50
+
51
+ **Error Handling:**
52
+ - Invalid JSON → return original script
53
+ - Missing fields → return original script
54
+ - LLM error → return original script
55
+ - Never fails the generation process
56
+
57
+ ### 3. Integration into Scenario Worker (`scenario-worker-class.ts`)
58
+
59
+ Added cleanup step immediately after `generateTestScript()`:
60
+
61
+ ```typescript
62
+ // Generate clean script with TestChimp comment and code
63
+ generatedScript = generateTestScript(testName, steps, undefined, hashtags);
64
+
65
+ // Perform final cleanup pass to remove redundancies and make minor adjustments
66
+ this.log(`[ScenarioWorker] Performing final script cleanup...`);
67
+ try {
68
+ const cleanupResult = await this.llmFacade.cleanupScript(generatedScript, job.model);
69
+
70
+ if (cleanupResult.changes && cleanupResult.changes.length > 0) {
71
+ this.log(`[ScenarioWorker] Cleanup made ${cleanupResult.changes.length} improvement(s):`);
72
+ cleanupResult.changes.forEach((change, i) => {
73
+ this.log(`[ScenarioWorker] ${i + 1}. ${change}`);
74
+ });
75
+ generatedScript = cleanupResult.script;
76
+ } else if (cleanupResult.skipped) {
77
+ this.log(`[ScenarioWorker] Cleanup skipped: ${cleanupResult.skipped}`);
78
+ } else {
79
+ this.log(`[ScenarioWorker] Cleanup completed - no changes needed`);
80
+ }
81
+ } catch (error: any) {
82
+ this.log(`[ScenarioWorker] Cleanup failed, using original script: ${error.message}`);
83
+ // Continue with original script on error
84
+ }
85
+ ```
86
+
87
+ ## What Gets Cleaned Up
88
+
89
+ ### ✅ Redundancies Removed
90
+ 1. **Duplicate assertions:**
91
+ ```typescript
92
+ // Before cleanup
93
+ await expect(page.getByText('Hello')).toBeVisible();
94
+ await expect(page.getByText('Hello')).toBeVisible();
95
+
96
+ // After cleanup
97
+ await expect(page.getByText('Hello')).toBeVisible();
98
+ ```
99
+
100
+ 2. **Redundant URL checks:**
101
+ ```typescript
102
+ // Before cleanup
103
+ await expect(page).toHaveURL(/\/messages/);
104
+ await expect(page).toHaveURL(/\/messages/);
105
+
106
+ // After cleanup
107
+ await expect(page).toHaveURL(/\/messages/);
108
+ ```
109
+
110
+ 3. **Duplicate comments without code** (already handled by script generation, but this is a safety net)
111
+
112
+ ### ✅ Minor Formatting Fixes
113
+ - Inconsistent spacing
114
+ - Alignment issues
115
+ - Obvious formatting problems
116
+
117
+ ### ❌ Preserved (Not Changed)
118
+ - Test logic and flow
119
+ - Legitimate assertions (same locator, different expectations)
120
+ - Important waits
121
+ - Selectors
122
+ - Test structure
123
+ - Any functionality
124
+
125
+ ## Safety Features
126
+
127
+ ### 1. Conservative Approach
128
+ - Only makes changes when confident they're safe
129
+ - Prompt explicitly warns against major changes
130
+ - Focuses on "obvious" redundancies only
131
+
132
+ ### 2. Transparency
133
+ - Logs all changes made with descriptions
134
+ - Makes it easy to see what was modified
135
+ - Helps debug if cleanup causes issues
136
+
137
+ ### 3. Graceful Degradation
138
+ - Any error → return original script
139
+ - Invalid response → return original script
140
+ - Never breaks the generation pipeline
141
+ - Cleanup is an enhancement, not a requirement
142
+
143
+ ### 4. Idempotent
144
+ - Running cleanup twice should produce the same result
145
+ - No cumulative changes or drift
146
+
147
+ ## Example Output
148
+
149
+ **Console logs:**
150
+ ```
151
+ [ScenarioWorker] Performing final script cleanup...
152
+ [LLMFacade] Script cleanup completed. Changes: 2
153
+ [LLMFacade] 1. Removed duplicate expect assertion for message visibility
154
+ [LLMFacade] 2. Consolidated redundant URL checks into single assertion
155
+ [ScenarioWorker] Cleanup made 2 improvement(s):
156
+ [ScenarioWorker] 1. Removed duplicate expect assertion for message visibility
157
+ [ScenarioWorker] 2. Consolidated redundant URL checks into single assertion
158
+ ```
159
+
160
+ ## Benefits
161
+
162
+ 1. **Cleaner Scripts:** Removes redundancies that can make tests harder to read
163
+ 2. **Reduced Token Usage:** Shorter scripts mean less tokens consumed by users
164
+ 3. **Better Maintainability:** Clean code is easier to understand and modify
165
+ 4. **Safety Net:** Catches issues that might slip through orchestrator logic
166
+ 5. **Zero Risk:** Fallback to original script on any error
167
+
168
+ ## Performance Impact
169
+
170
+ - **Time:** Adds one LLM call at the end (~1-3 seconds)
171
+ - **Cost:** One additional LLM call per script generation
172
+ - **Benefit:** Catches redundancies that would otherwise be in production tests
173
+
174
+ The small overhead is worthwhile for the quality improvement.
175
+
176
+ ## Future Enhancements
177
+
178
+ Possible improvements:
179
+ 1. **Configurable:** Allow users to disable cleanup if they prefer
180
+ 2. **More Rules:** Add more specific cleanup patterns
181
+ 3. **Static Analysis:** Use AST parsing instead of LLM for some checks (faster, cheaper)
182
+ 4. **Metrics:** Track how often cleanup makes changes vs. no-ops
183
+
184
+ ## Files Modified
185
+
186
+ 1. `/src/prompts.ts` - Added SCRIPT_CLEANUP prompt
187
+ 2. `/src/llm-facade.ts` - Added cleanupScript() method
188
+ 3. `/src/scenario-worker-class.ts` - Integrated cleanup into generation pipeline
189
+
190
+ ## Testing
191
+
192
+ The feature is safe to deploy because:
193
+ - Falls back to original script on any error
194
+ - Doesn't break existing functionality
195
+ - Only makes conservative changes
196
+ - Logs all modifications for review
197
+
198
+ ## Conclusion
199
+
200
+ The script cleanup feature adds a lightweight final polish step to the generation pipeline, removing redundancies and improving code quality without risk to the core test logic. It's a safety net that catches issues the orchestrator might miss while maintaining backward compatibility and graceful error handling.
201
+