npm - testchimp-runner-core - Versions diffs - 0.0.35 → 0.0.36 - Mend

testchimp-runner-core 0.0.35 → 0.0.36

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (71) hide show

package/package.json +6 -1
package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
package/plandocs/INTEGRATION_COMPLETE.md +0 -322
package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
package/plandocs/PHASE_1_COMPLETE.md +0 -165
package/plandocs/PHASE_1_SUMMARY.md +0 -184
package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
package/plandocs/exploratory-mode-support.plan.md +0 -928
package/plandocs/journey-id-tracking-addendum.md +0 -227
package/releasenotes/RELEASE_0.0.26.md +0 -165
package/releasenotes/RELEASE_0.0.27.md +0 -236
package/releasenotes/RELEASE_0.0.28.md +0 -286
package/src/auth-config.ts +0 -84
package/src/credit-usage-service.ts +0 -188
package/src/env-loader.ts +0 -103
package/src/execution-service.ts +0 -996
package/src/file-handler.ts +0 -104
package/src/index.ts +0 -432
package/src/llm-facade.ts +0 -821
package/src/llm-provider.ts +0 -53
package/src/model-constants.ts +0 -35
package/src/orchestrator/decision-parser.ts +0 -139
package/src/orchestrator/index.ts +0 -58
package/src/orchestrator/orchestrator-agent.ts +0 -1282
package/src/orchestrator/orchestrator-prompts.ts +0 -786
package/src/orchestrator/page-som-handler.ts +0 -1565
package/src/orchestrator/som-types.ts +0 -188
package/src/orchestrator/tool-registry.ts +0 -184
package/src/orchestrator/tools/check-page-ready.ts +0 -75
package/src/orchestrator/tools/extract-data.ts +0 -92
package/src/orchestrator/tools/index.ts +0 -15
package/src/orchestrator/tools/inspect-page.ts +0 -42
package/src/orchestrator/tools/recall-history.ts +0 -72
package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
package/src/orchestrator/tools/take-screenshot.ts +0 -128
package/src/orchestrator/tools/verify-action-result.ts +0 -159
package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
package/src/orchestrator/types.ts +0 -291
package/src/playwright-mcp-service.ts +0 -224
package/src/progress-reporter.ts +0 -144
package/src/prompts.ts +0 -842
package/src/providers/backend-proxy-llm-provider.ts +0 -91
package/src/providers/local-llm-provider.ts +0 -38
package/src/scenario-service.ts +0 -252
package/src/scenario-worker-class.ts +0 -1110
package/src/script-utils.ts +0 -203
package/src/types.ts +0 -239
package/src/utils/browser-utils.ts +0 -348
package/src/utils/coordinate-converter.ts +0 -162
package/src/utils/page-info-retry.ts +0 -65
package/src/utils/page-info-utils.ts +0 -285
package/testchimp-runner-core-0.0.35.tgz +0 -0
package/tsconfig.json +0 -19

package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md DELETED Viewed

@@ -1,844 +0,0 @@
-# Multi-Agent Architecture: Critical Review & Phased Approach
-## Executive Summary
-**Architecture**: Single orchestrator agent with extensible tools, journey memory, and self-reflection
-**Goal**: Replace reactive command-by-command with proactive tool-based decision-making
-**Compatibility**: Internal refactor only, external API unchanged
----
-## Strengths of Design
-### 1. Human-Like Operation ✅
-- Always-provided context mirrors human awareness
-- Tools model human information gathering
-- Self-reflection models metacognition
-- Experience accumulation models learning
-### 2. Extensibility ✅
-- Dynamic tool registry → add tools without prompt changes
-- Tool descriptions auto-included in prompts
-- Easy to add new capabilities
-### 3. Flexibility ✅
-- Works for both generation and repair modes
-- Configurable guardrails per job
-- Agent + system termination (soft + hard limits)
-### 4. Efficiency ✅
-- Batch command planning (fewer iterations)
-- Always-provided context (no repeated tool calls for same info)
-- Recent memory prevents bloat
----
-## Potential Pitfalls & Mitigations
-### Pitfall 1: Tool Call Overhead
-**Problem**: If agent calls tools every iteration, we get:
-```
-Iteration 1: Agent call → Tool calls (DOM, screenshot) → Agent call with results
-         = 2 LLM calls + tool execution time
-```
-**Current Mitigation**:
-- ✅ DOM always provided (no tool call)
-- ✅ Screenshot available freely (not expensive)
-- ⚠️ Risk: Agent might over-use other tools
-**Improvement**:
-```typescript
-// Track tool usage patterns
-if (agent.calledSameToolLast3Iterations('take_screenshot')) {
-  logger.warn('Agent overusing screenshot, limiting availability');
-  // Only provide screenshot tool every other iteration
-}
-```
-**For MVP**: Accept some tool overhead, optimize in Phase 2
----
-### Pitfall 2: Self-Reflection Spiral
-**Problem**: Agent reflection might reinforce wrong assumptions:
-```
-Iteration 1: "Focus on finding 'Submit' button"
-Iteration 2: "Still focusing on 'Submit' button, try different selector"
-Iteration 3: "Submit button must exist, trying again..."
-(Agent stuck on non-existent element)
-```
-**Current Mitigation**:
-- ✅ System can override reflection after N failures
-- ✅ Fresh DOM each iteration (reality check)
-- ⚠️ Risk: Agent ignores fresh context, follows stale reflection
-**Improvement**:
-```typescript
-// Detect reflection loops
-if (lastReflection.focus === currentReflection.focus && failureCount > 2) {
-  context.systemNote = "OVERRIDE: Your focus isn't working. Try completely different approach.";
-  context.previousIterationGuidance = null;  // Clear bad reflection
-}
-```
-**For MVP**: Start without self-reflection, add in Phase 2 with loop detection
----
-### Pitfall 3: Exploratory Actions Cause Side Effects
-**Problem**: "Safe" actions might not be safe:
-```
-Click info icon → Opens modal → Blocks page → Can't recover
-Hover over menu → Menu stays open → Interferes with next action
-Click dropdown → Triggers onChange event → Unexpected state change
-```
-**Current Mitigation**:
-- ✅ Limited action types (hover, click_menu, click_info, focus)
-- ✅ Screenshot after exploration to see state
-- ⚠️ Risk: No way to undo exploratory actions
-**Improvement**:
-```typescript
-// Snapshot before exploration
-const beforeState = await page.evaluate(() => ({
-  url: window.location.href,
-  activeElement: document.activeElement?.tagName
-}));
-// Execute exploration
-await explore();
-// Check if state changed unexpectedly
-const afterState = await page.evaluate(...);
-if (beforeState.url !== afterState.url) {
-  logger.error('Exploration caused navigation - aborting');
-  await page.goBack();
-}
-```
-**For MVP**: Start without exploratory actions, add in Phase 2 with state validation
----
-### Pitfall 4: Memory Bloat
-**Problem**: Long scenarios create huge history:
-```
-50 steps × 5 iterations/step = 250 memory entries
-Each entry: action (50 chars) + code (100 chars) + observation (100 chars)
-= 62,500 characters in memory
-```
-**Current Mitigation**:
-- ✅ Always-provided: Recent 6-7 steps only
-- ✅ Tool for deeper history
-- ⚠️ Risk: Full history grows unbounded
-**Improvement**:
-```typescript
-interface AgentConfig {
-  maxHistorySize?: number;  // Default: 100
-  summarizeHistoryAfter?: number;  // Default: 50
-}
-// After 50 steps, summarize older history
-if (history.length > config.summarizeHistoryAfter) {
-  const oldSteps = history.slice(0, -config.maxHistorySize);
-  const summary = await llm.summarize(oldSteps);
-  history = [{ ...summary, isSummary: true }, ...history.slice(-config.maxHistorySize)];
-}
-```
-**For MVP**: Cap history at 100 steps, add summarization in Phase 2
----
-### Pitfall 5: Prompt Complexity
-**Problem**: Always-provided context + tool descriptions + self-reflection → Large prompts
-**Estimated prompt size:**
-```
-- System prompt: 500 tokens (with tool descriptions)
-- Always-provided context: 1200-2000 tokens
-  - Goals & progress: 100 tokens
-  - DOM (optimized with increased limits): 800-1500 tokens
-    * getEnhancedPageInfo limits:
-      - ARIA depth: 4 levels max
-      - Interactive elements: top 50
-      - IDs: top 50
-      - Data attrs: top 50
-      - Form fields: top 20
-      - Text: 30 chars max
-  - Recent 6-7 steps: 300-500 tokens (50-80 tokens each)
-  - Experiences: 100-200 tokens (~10 learnings)
-- Self-reflection: 100 tokens
-- Tool results (if any): 300-500 tokens
-Total: 2400-3600 tokens per iteration
-```
-**Current Mitigation**:
-- ✅ DOM pre-truncated by getEnhancedPageInfo (already compact!)
-- ✅ Recent steps only (6-7, not all)
-- ✅ Experiences capped at 20
-- ⚠️ Risk: Still substantial with tool results
-**Validation Needed**:
-```typescript
-// Log actual prompt sizes during development
-logger.debug(`Prompt tokens: system=${systemTokens}, user=${userTokens}, total=${total}`);
-// If exceeds threshold, warn
-if (total > 3000) {
-  logger.warn(`Large prompt: ${total} tokens`);
-}
-```
-**For MVP**: Increased to ~2,400-3,600 tokens per iteration to support complex pages. Acceptable trade-off for better agent awareness. Monitor and optimize Phase 2 if becomes issue.
----
-### Pitfall 6: Agent Ignores Available Information
-**Problem**: Agent has DOM in context but still calls inspect_page tool
-```
-Agent: "toolCalls": [{"name": "inspect_page"}]
-System: (But DOM was just provided...)
-```
-**Current Mitigation**:
-- ✅ Prompt says "Current DOM snapshot" in always-provided
-- ⚠️ Risk: Agent doesn't realize info is already available
-**Improvement**:
-```typescript
-// System validates tool calls
-if (decision.toolCalls?.includes('inspect_page') && domProvidedThisIteration) {
-  logger.warn('SYSTEM: inspect_page unnecessary, DOM already provided');
-  decision.toolCalls = decision.toolCalls.filter(t => t.name !== 'inspect_page');
-}
-```
-**For MVP**: Log warnings but allow redundant calls, add validation Phase 2
----
-### Pitfall 7: Batch Execution Waste
-**Problem**: Agent plans 5 commands, first one fails, remaining 4 wasted:
-```
-Batch: [cmd1, cmd2, cmd3, cmd4, cmd5]
-Execute cmd1 → FAIL
-Skip cmd2-5
-(Agent had to think of all 5, but only 1 executed)
-```
-**Current Mitigation**:
-- ✅ Sequential execution prevents cascade failures
-- ⚠️ Risk: Wasted planning effort
-**Improvement**:
-```typescript
-// Adaptive batch size
-if (successRate < 0.5) {
-  config.maxCommandsPerIteration = 2;  // Plan fewer when failing
-} else {
-  config.maxCommandsPerIteration = 5;  // Plan more when succeeding
-}
-```
-**For MVP**: Accept some waste, optimize in Phase 2
----
-### Pitfall 8: Screenshot Token Cost (CORRECTED)
-**Actual Cost** (based on OpenAI vision pricing):
-**For gpt-4.1-mini / gpt-5-mini:**
-- Viewport screenshot (1920x1080): ~1024-1452 tokens
-- Full page (1920x3000): ~1536 tokens (capped)
-**For gpt-4o / gpt-4.1:**
-- Viewport (1920x1080): ~1105 tokens
-- Full page: ~1360 tokens
-**Calculation for 1920x1080:**
-```
-1. Scale to fit 2048x2048: No scaling (already fits)
-2. Scale shortest to 768px: 1080→768, 1920→1365
-3. Tiles: ceil(1365/32) × ceil(768/32) = 43 × 24 = 1032 patches
-4. For gpt-4.1-mini: 1032 × 1.62 ≈ 1672 tokens (capped at 1536)
-```
-**Conclusion**: Screenshots are **1K-2K tokens**, NOT 100K!
-**Impact**:
-- ✅ Very affordable (comparable to providing extra DOM context)
-- ✅ No budget needed - agent can use freely
-- ✅ Vision mode viable for most steps if helpful
-- ✅ Tool call overhead acceptable
-**For MVP**: Use screenshots liberally, no artificial limits
----
-## Phased Implementation Strategy
-### MVP (Phase 1): Core Agent Without Advanced Features
-**Include:**
-- ✅ OrchestratorAgent class
-- ✅ Dynamic ToolRegistry
-- ✅ 5 essential tools: inspect_page, take_screenshot, recall_history, extract_data, check_page_ready
-- ✅ Journey memory (history + experiences + extracted data)
-- ✅ Always-provided context (overall goal, current goal, DOM, recent 6-7 steps)
-- ✅ Self-reflection (free-form guidance to next iteration)
-- ✅ Loop detection (agent detects own spirals via detectingLoop flag)
-- ✅ Batch command planning (max 3-5)
-- ✅ Sequential execution (stop on first failure)
-- ✅ Experience accumulation (learnings)
-- ✅ System guardrails (iteration limits, no screenshot budget)
-- ✅ Agent termination (complete/stuck/infeasible)
-- ✅ Comprehensive logging (all reasoning visible)
-**Exclude (Phase 2):**
-- ❌ Exploratory actions (safety concerns)
-- ❌ Advanced tool validation
-- ❌ Tool result caching
-- ❌ Memory summarization
-**Benefits**:
-- Complete feature set (tools + memory + reflection + learning)
-- Agent maintains train of thought via self-reflection
-- Agent self-corrects via loop detection
-- Still simpler than 2-agent architecture
-- Comprehensive logging for debugging
-**Expected metrics**:
-- LLM calls/step: 2-4 (vs 4-6 current)
-- Iterations/step: 3-5 (vs 8-12 current)
-- Tool calls/step: 1-3
-- Commands/iteration: 2-3 (batched)
-- Agent learns: 1-2 experiences per step
----
-### Phase 2: Add Learning & Reflection
-**Add:**
-- ✅ Self-reflection (previousIterationGuidance)
-- ✅ Experience accumulation
-- ✅ extract_data tool
-- ✅ Experience deduplication
-- ✅ Reflection loop detection
-**Benefits**:
-- Agent learns patterns
-- Continuity across iterations
-- Better context for future steps
-**Risks**:
-- Reflection spirals (mitigated with loop detection)
-- Experience bloat (mitigated with deduplication)
----
-### Phase 3: Advanced Exploration & Optimization
-**Add:**
-- ✅ Exploratory actions (with state validation)
-- ✅ Tool result caching
-- ✅ Adaptive batch sizing
-- ✅ Tool usage pattern detection
-- ✅ Memory summarization
-**Benefits**:
-- Handle ambiguous UIs
-- More efficient tool use
-- Better long-scenario handling
----
-## Critical Architecture Decisions to Validate
-### Decision 1: Always-Provide DOM vs Tool
-**Current**: DOM always provided (auto-fetched each iteration)
-**Alternative**: DOM via tool call
-**Analysis**:
-- Pro (always-provide): No tool call overhead, always fresh
-- Con (always-provide): Wasted if agent doesn't need it
-- **Verdict**: ✅ Always-provide is correct (DOM needed 95% of time)
-### Decision 2: Recent Steps Count (6-7 vs 3)
-**Current**: 6-7 steps in always-provided context
-**Analysis**:
-- More context → Better decisions
-- More context → Larger prompts
-- **Test**: Measure prompt size and decision quality
-**For MVP**: Start with 5, tune based on real usage
-### Decision 3: Self-Reflection Format
-**Current**: Structured (focus, avoid, hypothesis)
-**Alternative**: Free-form text
-**Analysis**:
-- Structured → Easier to process
-- Structured → Might be restrictive
-- **Verdict**: ✅ Structured is better (easier to override when looping)
-**For MVP**: Skip self-reflection entirely, add Phase 2
-### Decision 4: Batch Size
-**Current**: Max 5 commands per iteration
-**Analysis**:
-- Too small → Many iterations
-- Too large → Wasted planning if early failure
-- **Test**: Measure success rate of 2nd-5th commands in batch
-**For MVP**: Max 3 commands, increase to 5 in Phase 2 if success rate high
----
-## Agent Transparency: Comprehensive Logging
-### Principle: Make Agent's Thinking Visible
-**Every agent decision must be logged** so developers can understand:
-- What the agent is thinking
-- Why it made each decision
-- What it learned
-- When and why it's stuck
-### Logged Information
-```typescript
-// Per iteration, log:
-1. Iteration number and goal
-2. Agent reasoning (why this approach)
-3. Self-reflection (focus, avoid, hypothesis)
-4. Tools requested + why
-5. Commands planned + why
-6. Experiences learned
-7. Status decision + why
-8. Command execution results
-```
-### Example Log Output
-```
-[Orchestrator] === Iteration 1/8 ===
-[Orchestrator] 🎯 Goal: Login with alice@example.com, TestPass123
-[Orchestrator] 💭 Reasoning: Need to locate login form elements
-[Orchestrator] 🔧 Tools: [inspect_page]
-[Orchestrator] 📋 Why: Need DOM to find email/password fields
-[Orchestrator] ⏳ Executing tools...
-[Orchestrator] ✓ Tools complete
-[Orchestrator] 📝 Commands (3): fill email, fill password, click submit
-[Orchestrator] 💡 Why batch: Can fill entire form before submitting
-[Orchestrator] 🧠 Next iteration focus: Check for redirect after submit
-[Orchestrator] 📚 Learning: Forms use #id selectors consistently
-[Orchestrator] ▶ Executing sequentially...
-[Orchestrator]   ✓ [1/3] await page.fill('#email', 'alice@example.com')
-[Orchestrator]   ✓ [2/3] await page.fill('#password', 'TestPass123')
-[Orchestrator]   ✓ [3/3] await page.click('button[type="submit"]')
-[Orchestrator] 🎯 Status: continue
-[Orchestrator] 💭 Why: Commands executed, need to verify navigation
-```
-### Progress Reporter Extension
-Add agent thoughts to progress reporting:
-```typescript
-interface StepProgress {
-  // ... existing fields
-  // NEW: Agent transparency
-  agentIteration?: number;
-  agentReasoning?: string;
-  agentSelfReflection?: SelfReflection;
-  agentExperiences?: string[];
-  agentToolsUsed?: string[];
-  agentStatus?: string;
-}
-// Report after each iteration:
-await progressReporter?.onStepProgress?.({
-  jobId,
-  stepNumber,
-  description: stepGoal,
-  status: StepExecutionStatus.IN_PROGRESS,
-  code: decision.commands?.join('\n'),
-  agentIteration: iteration,
-  agentReasoning: decision.reasoning,
-  agentSelfReflection: decision.selfReflection,
-  agentExperiences: decision.experiences,
-  agentToolsUsed: decision.toolCalls?.map(t => t.name),
-  agentStatus: decision.status
-});
-```
-**Benefits:**
-- VS Extension can display agent thoughts in output panel
-- Script Service can store in DB for frontend visualization
-- Debugging becomes much easier
-- Users understand what agent is doing
----
-## Recommended MVP Scope
-### Include (Core Functionality)
-1. **OrchestratorAgent**
-   - Single agent loop
-   - Always-provided context (overall goal, current goal, current DOM, recent 5 steps)
-   - Tool calls (max 3 per iteration)
-   - Batch commands (max 3 per iteration)
-   - Sequential execution with early stop
-   - System guardrails
-2. **Essential Tools**
-   - `inspect_page` (might be redundant since always-provided, but keep for extensibility demo)
-   - `take_screenshot` (isFullPage param)
-   - `recall_history` (maxSteps param)
-3. **Simple Memory**
-   - Unified history array
-   - No experiences (add Phase 2)
-   - No self-reflection (add Phase 2)
-   - Basic extracted data
-4. **Guardrails**
-   - Max 8 iterations/step
-   - Max 10 screenshots/scenario
-   - Max 2 consecutive failures
-   - Agent status: complete | stuck | continue
-### Exclude (Phase 2+)
-1. **Self-Reflection** - Add after validating basic agent works
-2. **Exploratory Actions** - Add after validating tools work safely
-3. **Experiences** - Add after validating memory works
-4. **extract_data Tool** - Can be added later
-5. **Advanced Validation** - Tool result caching, loop detection, etc.
----
-## Implementation Risks & Mitigation
-### Risk 1: Tool Call Latency
-**Impact**: High
-**Mitigation**: Always-provide DOM, limit tool calls to 3
-**Acceptance**: Some overhead acceptable for better decisions
-### Risk 2: Prompt Token Cost
-**Impact**: Medium
-**Mitigation**: Truncate DOM, limit recent steps to 5
-**Acceptance**: Monitor token usage, optimize Phase 2
-### Risk 3: Agent Loops
-**Impact**: High
-**Mitigation**: System iteration limits (8), consecutive failure stops (2)
-**Acceptance**: Hard limits prevent runaway
-### Risk 4: Breaking Backward Compatibility
-**Impact**: Critical
-**Mitigation**: Changes internal to ScenarioWorker only, external API unchanged
-**Validation**: Test VS Extension and GitHub Runner after implementation
-### Risk 5: Complexity
-**Impact**: Medium
-**Mitigation**: MVP excludes advanced features, phased approach
-**Acceptance**: Simplify if MVP too complex
----
-## Key Simplifications for MVP
-### 1. Include Self-Reflection (Agent Detects Spirals)
-**Why**: Provides valuable train of thought continuity
-**How**: Agent outputs free-form guidance to itself, PLUS detects if spiraling
-**Safety**: Agent can recognize "I'm stuck on same approach" and reset
-```typescript
-interface SelfReflection {
-  guidanceForNext: string;  // Free-form: "Try data-testid, the icon approach failed"
-  detectingLoop: boolean;  // Agent signals if it thinks it's looping
-  loopReasoning?: string;  // "Tried same selector 3 times, need different approach"
-}
-// In prompt:
-"If you notice you're trying the same approach repeatedly, set detectingLoop=true and try something completely different."
-```
-**Agent decides when to break own loop**, system enforces hard limits as backup.
-### 2. Include Self-Reflection with Loop Detection
-**Agent outputs free-form guidance** for next iteration
-**Agent detects own loops** via `detectingLoop` flag
-**Agent resets approach** when it notices repetition
-### 3. No Exploratory Actions (Phase 2)
-**Why**: Safety risk, adds complexity
-**Trade-off**: Can't investigate ambiguous elements programmatically
-**Acceptable**: Can use screenshot + DOM analysis instead
-### 4. Include Experience Accumulation
-**Why**: Learning across steps provides value
-**How**: Agent outputs experiences learned each iteration
-**Simple**: Just array of strings, deduplicate similar ones
-### 5. Include extract_data Tool
-**Why**: Steps often reference earlier data
-**How**: Simple tool to save selector value as named data
-**Benefit**: Agent can explicitly save data for later steps
-### 6. Simple Tool Validation (Phase 2+)
-**Why**: Complex validation adds overhead
-**Trade-off**: Agent might make redundant tool calls
-**Acceptable**: Log warnings, don't block
----
-## MVP Architecture (Simplified)
-```
-┌────────────────────────────────────────┐
-│       ORCHESTRATOR AGENT (MVP)         │
-│  • Always gets: Overall goal           │
-│  •              Current goal           │
-│  •              Current DOM            │
-│  •              Recent 5 steps         │
-│  • Tools: screenshot, recall_history   │
-│  • Plans: Up to 3 commands             │
-│  • Executes: Sequential, stop on fail  │
-│  • Decides: complete | stuck | continue│
-└────────────────────────────────────────┘
-```
-**Excluded from MVP:**
-- Self-reflection
-- Exploratory actions
-- Experience learning
-- extract_data tool
-- Advanced validations
----
-## Phase 2 Additions (After MVP Validated)
-**Add if MVP works well:**
-1. **Self-Reflection** (if agent needs continuity)
-   - Monitor: Does agent make progress without it?
-   - Add: Only if we see repeated mistakes
-2. **Experience Accumulation** (if learning helps)
-   - Monitor: Do patterns repeat across steps?
-   - Add: If we see benefit
-3. **Exploratory Actions** (if needed for ambiguous UIs)
-   - Monitor: How often are elements ambiguous?
-   - Add: With state validation
-4. **extract_data Tool** (if data needed across steps)
-   - Monitor: Do steps reference earlier data?
-   - Add: If manual observation notes aren't sufficient
----
-## Phase 3 Optimizations (Future)
-1. Tool result caching
-2. Adaptive batch sizing
-3. Memory summarization for long scenarios
-4. Project-level memory
-5. Cross-scenario learning
-6. Token budget tracking
-7. Tool cost-benefit analysis
----
-## Success Criteria (MVP)
-### Must Have
-- [ ] Fewer iterations than current (target: 50% reduction)
-- [ ] Backward compatible (VS Extension and GitHub Runner work)
-- [ ] No infinite loops (guardrails work)
-- [ ] Memory doesn't bloat
-- [ ] Tool extensibility works (can add new tool)
-### Nice to Have
-- [ ] Fewer LLM calls than current
-- [ ] Better success rate
-- [ ] Faster execution
-### Acceptable Trade-offs
-- ⚠️ Slightly higher token usage (more context in prompts)
-- ⚠️ Some tool call overhead
-- ⚠️ No learning/reflection in MVP (add Phase 2)
----
-## Implementation Order (MVP)
-### Week 1: Foundation
-1. Create types (AgentConfig, JourneyMemory, AlwaysProvidedContext)
-2. Create ToolRegistry with dynamic prompt generation
-3. Implement 3 tools (inspect_page, take_screenshot, recall_history)
-4. Test tools independently
-### Week 2: Orchestrator
-1. Implement OrchestratorAgent.executeStep()
-2. Always-provided context building
-3. Tool call execution
-4. Batch command execution (sequential)
-5. Simple memory update (no experiences)
-6. System guardrails
-### Week 3: Integration
-1. Add orchestrator prompts
-2. Refactor ScenarioWorker to use orchestrator
-3. Test generation mode
-4. Test repair mode (if time)
-### Week 4: Testing & Refinement
-1. Test with real scenarios
-2. Tune iteration limits
-3. Measure token usage
-4. Compare metrics vs current
-5. Fix bugs
-6. Verify backward compatibility
----
-## Decision: Start with MVP or Full Architecture?
-### Recommendation: **Start with MVP**
-**Why:**
-1. **Validate core concept** - Does tool-use agent work better?
-2. **Reduce risk** - Simpler implementation, fewer edge cases
-3. **Faster iteration** - Can test and tune quicker
-4. **Learn from usage** - Real data tells us what features matter
-**MVP Excludes:**
-- Self-reflection (complex, risky)
-- Exploratory actions (safety concerns)
-- Experience accumulation (nice-to-have)
-- extract_data tool (not essential)
-**MVP Focuses On:**
-- ✅ Tool-based information gathering
-- ✅ Batch command planning
-- ✅ Memory management
-- ✅ Guardrails
-- ✅ Backward compatibility
-**After MVP proves value, add Phase 2 features based on:**
-- What problems remain?
-- What would self-reflection solve?
-- How often are elements ambiguous (exploration)?
-- Do patterns repeat (experiences)?
----
-## Open Questions to Resolve During Implementation
-1. **Optimal recent steps count**: 3, 5, or 7?
-   - Test with different values, measure decision quality
-2. **Tool call timing**: Before every iteration or only when agent requests?
-   - MVP: Only when agent requests
-   - Could change if DOM staleness becomes issue
-3. **Batch size sweet spot**: 3 or 5 commands?
-   - Start with 3, increase if success rate of later commands is high
-4. **Memory update frequency**: After each command or after iteration?
-   - MVP: After each command (more granular)
-   - Could batch if performance issue
-5. **Tool result format**: Raw or summarized?
-   - MVP: Raw (simpler)
-   - Add summarization if token usage too high
----
-## Success Metrics to Track
-### Performance
-- Average LLM calls per step (target: < 3)
-- Average iterations per step (target: < 5)
-- Average tool calls per step (target: < 2)
-- Time per step (target: < 30s)
-### Quality
-- Success rate per step (target: > 80%)
-- Commands per iteration (target: 2-3 avg)
-- Tool call relevance (target: > 90% useful)
-### Resource
-- Token usage per step (monitor, optimize if > 5K)
-- Screenshot usage (should be < 10% of steps)
-- Memory size (should cap at 100 entries)
----
-## Recommendation: Proceed with MVP
-**Start with:**
-1. Basic orchestrator agent
-2. 3 core tools (screenshot, recall_history, inspect_page optional)
-3. Simple memory (history only)
-4. Always-provided context (goal + DOM + recent 5 steps)
-5. Batch commands (max 3)
-6. System guardrails
-**Validate:**
-- Does it work better than current?
-- Is backward compatibility maintained?
-- Are guardrails sufficient?
-**Then add Phase 2:**
-- Self-reflection (if needed)
-- Experiences (if patterns emerge)
-- Exploratory actions (if ambiguity common)
----
-## Final Thoughts
-**This architecture is sound, but ambitious.**
-**Recommendation**: Implement MVP first to validate:
-1. Tool-use paradigm works
-2. Memory management works
-3. Batch execution helps
-4. Guardrails sufficient
-**Then iterate** based on real usage data, not assumptions.
-**MVP timeline**: 2-3 weeks
-**Full architecture**: 4-6 weeks
-**Start with MVP to reduce risk and validate concept.**