npm - testchimp-runner-core - Versions diffs - 0.0.35 → 0.0.36 - Mend

testchimp-runner-core 0.0.35 → 0.0.36

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (71) hide show

package/package.json +6 -1
package/plandocs/BEFORE_AFTER_VERIFICATION.md +0 -148
package/plandocs/COORDINATE_MODE_DIAGNOSIS.md +0 -144
package/plandocs/CREDIT_CALLBACK_ARCHITECTURE.md +0 -253
package/plandocs/HUMAN_LIKE_IMPROVEMENTS.md +0 -642
package/plandocs/IMPLEMENTATION_STATUS.md +0 -108
package/plandocs/INTEGRATION_COMPLETE.md +0 -322
package/plandocs/MULTI_AGENT_ARCHITECTURE_REVIEW.md +0 -844
package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md +0 -539
package/plandocs/PHASE1_ABSTRACTION_COMPLETE.md +0 -241
package/plandocs/PHASE1_FINAL_STATUS.md +0 -210
package/plandocs/PHASE_1_COMPLETE.md +0 -165
package/plandocs/PHASE_1_SUMMARY.md +0 -184
package/plandocs/PLANNING_SESSION_SUMMARY.md +0 -372
package/plandocs/PROMPT_OPTIMIZATION_ANALYSIS.md +0 -120
package/plandocs/PROMPT_SANITY_CHECK.md +0 -120
package/plandocs/SCRIPT_CLEANUP_FEATURE.md +0 -201
package/plandocs/SCRIPT_GENERATION_ARCHITECTURE.md +0 -364
package/plandocs/SELECTOR_IMPROVEMENTS.md +0 -139
package/plandocs/SESSION_SUMMARY_v0.0.33.md +0 -151
package/plandocs/TROUBLESHOOTING_SESSION.md +0 -72
package/plandocs/VISION_DIAGNOSTICS_IMPROVEMENTS.md +0 -336
package/plandocs/VISUAL_AGENT_EVOLUTION_PLAN.md +0 -396
package/plandocs/WHATS_NEW_v0.0.33.md +0 -183
package/plandocs/exploratory-mode-support-v2.plan.md +0 -953
package/plandocs/exploratory-mode-support.plan.md +0 -928
package/plandocs/journey-id-tracking-addendum.md +0 -227
package/releasenotes/RELEASE_0.0.26.md +0 -165
package/releasenotes/RELEASE_0.0.27.md +0 -236
package/releasenotes/RELEASE_0.0.28.md +0 -286
package/src/auth-config.ts +0 -84
package/src/credit-usage-service.ts +0 -188
package/src/env-loader.ts +0 -103
package/src/execution-service.ts +0 -996
package/src/file-handler.ts +0 -104
package/src/index.ts +0 -432
package/src/llm-facade.ts +0 -821
package/src/llm-provider.ts +0 -53
package/src/model-constants.ts +0 -35
package/src/orchestrator/decision-parser.ts +0 -139
package/src/orchestrator/index.ts +0 -58
package/src/orchestrator/orchestrator-agent.ts +0 -1282
package/src/orchestrator/orchestrator-prompts.ts +0 -786
package/src/orchestrator/page-som-handler.ts +0 -1565
package/src/orchestrator/som-types.ts +0 -188
package/src/orchestrator/tool-registry.ts +0 -184
package/src/orchestrator/tools/check-page-ready.ts +0 -75
package/src/orchestrator/tools/extract-data.ts +0 -92
package/src/orchestrator/tools/index.ts +0 -15
package/src/orchestrator/tools/inspect-page.ts +0 -42
package/src/orchestrator/tools/recall-history.ts +0 -72
package/src/orchestrator/tools/refresh-som-markers.ts +0 -69
package/src/orchestrator/tools/take-screenshot.ts +0 -128
package/src/orchestrator/tools/verify-action-result.ts +0 -159
package/src/orchestrator/tools/view-previous-screenshot.ts +0 -103
package/src/orchestrator/types.ts +0 -291
package/src/playwright-mcp-service.ts +0 -224
package/src/progress-reporter.ts +0 -144
package/src/prompts.ts +0 -842
package/src/providers/backend-proxy-llm-provider.ts +0 -91
package/src/providers/local-llm-provider.ts +0 -38
package/src/scenario-service.ts +0 -252
package/src/scenario-worker-class.ts +0 -1110
package/src/script-utils.ts +0 -203
package/src/types.ts +0 -239
package/src/utils/browser-utils.ts +0 -348
package/src/utils/coordinate-converter.ts +0 -162
package/src/utils/page-info-retry.ts +0 -65
package/src/utils/page-info-utils.ts +0 -285
package/testchimp-runner-core-0.0.35.tgz +0 -0
package/tsconfig.json +0 -19

package/plandocs/ORCHESTRATOR_MVP_SUMMARY.md DELETED Viewed

@@ -1,539 +0,0 @@
-# Orchestrator Agent MVP: Final Specification
-## What We're Building
-**Single orchestrator agent** that replaces the current reactive sub-action loop with a proactive, tool-using, memory-maintaining system.
----
-## Core Features (All in MVP)
-### 1. Always-Provided Context (Auto-Fetched Each Iteration)
-```typescript
-{
-  // WHERE AM I? (Journey tracking)
-  overallGoal: "Full scenario text",
-  currentStepGoal: "Login with alice@example.com, TestPass123",
-  stepNumber: 3,
-  totalSteps: 6,
-  completedSteps: ["Go to URL", "Click login"],
-  remainingSteps: ["Messages", "Send message", "Verify"],
-  // WHAT DO I SEE? (Current state - fresh)
-  currentPageInfo: getEnhancedPageInfo(page),  // ARIA + IDs + data attrs (compact!)
-  currentURL: page.url(),
-  // WHAT DID I JUST DO? (Recent memory ~6-7 steps - TEXT ONLY, NO SCREENSHOTS)
-  recentSteps: [
-    {
-      stepNumber: 2,
-      action: "Navigated to login page",
-      code: "await page.goto('https://...')",
-      result: "success",
-      observation: "Login form visible with email and password fields"
-    },
-    {
-      stepNumber: 2,
-      action: "Explored menu icon (hover)",
-      code: "explore_element({action: 'hover', selector: 'nav button:nth-child(2)'})",
-      result: "success",
-      observation: "Tooltip appeared showing 'Dashboard', confirmed this is the Dashboard button"
-    }
-  ],
-  // WHAT HAVE I LEARNED? (Experiences)
-  experiences: [
-    "Site uses #id selectors for forms",
-    "Login redirects to /dashboard"
-  ],
-  // WHAT DO I KNOW? (Extracted data)
-  extractedData: { userEmail: "alice@example.com" },
-  // WHAT DID I TELL MYSELF? (Self-reflection from previous iteration)
-  previousIterationGuidance: {
-    guidanceForNext: "Check for redirect after submit",
-    detectingLoop: false
-  }
-}
-```
-**Optimized for complex pages:**
-- DOM: Pre-truncated with increased limits
-  - Interactive elements: top 50
-  - IDs: top 50
-  - Data attributes: top 50
-  - Form fields: top 20
-  - Page structure: top 10
-  - General elements: top 50
-  - Text: 30 chars max
-  - Result: ~800-1,500 tokens
-- Steps: Recent 6-7 only
-- Experiences: Cap at 20
-### 2. Optional Tools (Agent Requests)
-**8 core tools:**
-**Information Gathering:**
-1. **take_screenshot({isFullPage})** - Visual context (use freely)
-2. **recall_history({maxSteps, query})** - Deeper history
-3. **inspect_page()** - Get DOM (might be redundant since always-provided, but keeps extensibility)
-4. **check_page_ready()** - Verify page loaded
-**Data Management:**
-5. **extract_data({selector, dataName})** - Save data for later
-**Recovery Tools (Self-Unstuck):**
-6. **navigate_back()** - Go back in browser history (if exploratory action had side effects)
-7. **refresh_page()** - Reload current page (if page in bad state)
-8. **navigate_to_url({url})** - Navigate to specific URL (validate it's within allowed domain)
-**Inquisitive Exploration (Phase 2):**
-9. **explore_element({action, selector, purpose})** - Investigate ambiguous elements
-   - Actions: "hover", "click_info", "click_menu", "focus"
-   - Non-consequential only (no form submit, delete, etc.)
-   - Returns: {success, screenshotTaken, observation}
-**Extensible**: Add new tools → auto-appear in prompt
-**Recovery scenarios:**
-- Agent realizes it navigated away from domain → `navigate_back()` or `navigate_to_url({baseUrl})`
-- Page got stuck/unresponsive → `refresh_page()`
-- Exploratory action opened modal/overlay → `navigate_back()` or refresh
-- Lost context after redirect → `navigate_to_url()` to known good state
-### 3. Agent Decision (What Agent Outputs)
-```typescript
-{
-  // 1. Tool requests (multiple allowed)
-  toolCalls: [
-    {name: "take_screenshot", params: {isFullPage: false}},
-    {name: "recall_history", params: {maxSteps: 5}}
-  ],
-  toolReasoning: "Need visual + history to understand pattern",
-  needsToolResults: true,  // Wait for tools before commands
-  // 2. Command batch (executed sequentially)
-  commands: [
-    "await page.fill('#email', 'alice@example.com')",
-    "await page.fill('#password', 'TestPass123')",
-    "await page.click('button[type=\"submit\"]')"
-  ],
-  commandReasoning: "Batch entire login flow",
-  // 3. Self-reflection (FREE-FORM guidance to next iteration)
-  selfReflection: {
-    guidanceForNext: "After submit, check URL changed to /dashboard",
-    detectingLoop: false,  // Agent signals if repeating same approach
-    loopReasoning: null
-  },
-  // 4. Learnings (stored in experiences)
-  experiences: [
-    "Forms use #id selectors consistently",
-    "Login redirects immediately after submit"
-  ],
-  // 5. Memory update (what to store)
-  memoryUpdate: {
-    action: "Filled login form and submitted",
-    observation: "Page redirected to dashboard",
-    extractedData: {userEmail: "alice@example.com"}
-  },
-  // 6. Termination decision
-  status: "complete",  // or "stuck" | "infeasible" | "continue"
-  statusReasoning: "Login completed, dashboard visible",
-  reasoning: "Overall iteration reasoning"
-}
-```
-### 4. Execution: Sequential with Early Stop
-**Agent plans batch, system executes sequentially:**
-```
-Agent: commands = [cmd1, cmd2, cmd3]
-Execute:
-  cmd1 → SUCCESS ✓ (record in history)
-  cmd2 → SUCCESS ✓ (record in history)
-  cmd3 → FAIL ✗ (record in history, STOP)
-Result: 2/3 succeeded, accurately tracked
-```
-### 5. Comprehensive Logging
-**Every iteration logs:**
-```
-[Orchestrator] === Iteration 2/8 ===
-[Orchestrator] 🎯 Current Goal: Login with alice@example.com, TestPass123
-[Orchestrator] 📍 Progress: Step 2/6
-[Orchestrator] 💭 Reasoning: Form fields located, batching fill operations
-[Orchestrator] 🧠 Previous Guidance: Check for redirect after submit
-[Orchestrator] 🔧 Tools: [take_screenshot (viewport)]
-[Orchestrator] 📋 Tool Reasoning: Visual check for login button state
-[Orchestrator] ✓ Tools executed
-[Orchestrator] 📝 Commands (3): fill email, fill password, click submit
-[Orchestrator] 💡 Batch Reasoning: Can execute entire login in one go
-[Orchestrator] ▶ Executing sequentially:
-[Orchestrator]   ✓ [1/3] await page.fill('#email', 'alice@example.com')
-[Orchestrator]   ✓ [2/3] await page.fill('#password', 'TestPass123')
-[Orchestrator]   ✓ [3/3] await page.click('button[type="submit"]')
-[Orchestrator] 📚 Experiences: Site uses #id for forms, Login redirects to /dashboard
-[Orchestrator] 🧠 Next Guidance: Verify dashboard loaded, check for user menu
-[Orchestrator] 🔄 Loop Detection: false
-[Orchestrator] 🎯 Status: continue
-[Orchestrator] 💭 Status Reasoning: Commands executed, need to verify navigation
-```
----
-## Inquisitive Exploration (Phase 2)
-### Problem: Ambiguous UI Elements
-**Scenario**: Agent needs to click "Dashboard" but menu items are icon-only (no text, no clear ARIA labels)
-**Solution**: Agent investigates non-consequentially before committing to actions
-### How It Works
-```typescript
-// ITERATION N - Agent realizes it needs more info:
-{
-  "toolCalls": [
-    {
-      "name": "explore_element",
-      "params": {
-        "action": "hover",
-        "selector": "nav button:nth-child(2)",
-        "purpose": "Check tooltip to see if this is Dashboard"
-      }
-    }
-  ],
-  "toolReasoning": "Menu items are icons without labels, need to hover to see tooltips",
-  "needsToolResults": true  // Agent waits for tool results before continuing
-}
-// System executes tool:
-1. Hover over element
-2. Wait 500ms for tooltip
-3. Take screenshot
-4. Call agent AGAIN with screenshot to analyze it
-5. Agent responds with analysis: "Tooltip shows 'Dashboard' text"
-6. System stores learning in history (TEXT, not screenshot):
-   {
-     action: "Explored menu icon (hover)",
-     code: "explore_element(...)",
-     result: "success",
-     observation: "Tooltip appeared showing 'Dashboard', confirmed this is the Dashboard button"
-   }
-7. Tool returns to original agent call: {success: true, learning: "Tooltip says Dashboard"}
-// SAME ITERATION N - Agent receives tool result (TEXT learning, no screenshot):
-{
-  "toolResults": {
-    "explore_element": {
-      "success": true,
-      "learning": "Tooltip appeared showing 'Dashboard', confirmed this is the Dashboard button"
-    }
-  }
-}
-// Agent now has confidence to proceed with commands:
-{
-  "commands": ["await page.click('nav button:nth-child(2)')"],
-  "commandReasoning": "Exploration confirmed this is Dashboard button via tooltip"
-}
-```
-**Key: Screenshot analyzed immediately, only TEXT learnings stored/passed forward**
-### Allowed Exploration Actions
-**Non-consequential only:**
-1. **hover** - Show tooltips, menus, dropdowns
-   - Safe: Doesn't change state
-   - Use: Reveal hidden info
-2. **click_info** - Click info icons, help buttons
-   - Safe: Usually opens modal/tooltip
-   - Use: Get more context
-   - Risk: Modal might block page (can navigate_back)
-3. **click_menu** - Click menu headers to reveal items
-   - Safe: Just expands menu
-   - Use: See menu options
-   - Risk: Menu might navigate (rare)
-4. **focus** - Focus on input to see placeholder/validation
-   - Safe: Just focuses element
-   - Use: See input hints
-**NOT allowed:**
-- ❌ Submit forms
-- ❌ Delete/remove actions
-- ❌ Purchase/confirm buttons
-- ❌ Logout
-- ❌ File uploads
-- ❌ Navigation links (unless explicitly exploratory)
-### Exploration Workflow
-```
-Iteration N:
-  Agent Decision 1: "Need to explore menu icons"
-  Tool: explore_element(hover, selector)
-  System executes:
-    → Hover element
-    → Take screenshot
-    → Call agent with screenshot: "Analyze this, what do you see?"
-  Agent Decision 2 (sub-call): "I see tooltip says 'Dashboard'"
-  System stores:
-    → History entry (TEXT): "Explored button, tooltip shows Dashboard"
-    → No screenshot stored
-  System returns to Agent Decision 1:
-    → Tool result: {success: true, learning: "Tooltip shows Dashboard"}
-  Agent Decision 1 continues: "Great! Now I can click it"
-  Commands: [click that element]
-  System executes commands sequentially
-```
-**Key: Screenshot analyzed IMMEDIATELY within same iteration, only TEXT learning stored**
-**Benefits:**
-- No screenshots stored in memory (saves tokens)
-- Immediate feedback (no waiting for next iteration)
-- Structured learning extraction
-- Future iterations only see concise text observations
-### Guardrails
-```typescript
-interface AgentConfig {
-  // Per iteration
-  maxExploratoryActionsPerIteration: 3,  // Can explore up to 3 elements per iteration
-  // Per step
-  maxExploratoryActionsPerStep: 10,  // Total exploration budget per step
-  // Safety
-  explorationTimeout: 2000,  // Max wait for tooltip/menu
-  allowedExplorationActions: ['hover', 'click_info', 'click_menu', 'focus']
-}
-// System enforces:
-if (explorationAction.action === 'click' && selector.includes('submit')) {
-  logger.error('SYSTEM: Exploratory click on submit button blocked');
-  return {success: false, reason: 'unsafe_action'};
-}
-if (explorationCount > config.maxExploratoryActionsPerStep) {
-  logger.warn('SYSTEM: Exploration budget exhausted');
-  // Remove explore_element from available tools
-}
-```
-### State Validation
-**Before exploration:**
-```typescript
-const beforeState = {
-  url: page.url(),
-  modalCount: await page.locator('[role="dialog"]').count()
-};
-```
-**After exploration:**
-```typescript
-const afterState = {
-  url: page.url(),
-  modalCount: await page.locator('[role="dialog"]').count()
-};
-// Check for unexpected navigation
-if (beforeState.url !== afterState.url) {
-  logger.warn('Exploration caused navigation, reverting');
-  await page.goBack();
-}
-// Modal opened (might be intended)
-if (afterState.modalCount > beforeState.modalCount) {
-  observation = "Modal opened (may need to close)";
-}
-```
-### When Agent Uses Exploration
-**Agent decides based on:**
-- ❓ DOM shows icons without text
-- ❓ Multiple similar elements, unclear which is correct
-- ❓ Need to see menu contents before deciding
-- ❓ Input field needs to show validation rules
-**Example agent reasoning:**
-```
-"DOM shows 5 icon buttons without labels. Need to hover over each
-to see tooltips and identify which is Dashboard. Will explore
-buttons 1-3 this iteration."
-```
-### Why Phase 2 (Not MVP)
-**Reasons to defer:**
-1. **Safety risk** - Need robust validation to prevent state changes
-2. **Complexity** - Requires screenshot handling, state comparison
-3. **Edge cases** - Modals, overlays, navigation need careful handling
-4. **Testing needed** - Validate on multiple sites before including
-**MVP workaround:**
-- Agent uses `take_screenshot()` + DOM analysis
-- Makes best guess from available info
-- If wrong, retry with different approach
-- Less elegant but safer
-**Add in Phase 2 when:**
-- MVP validated and working
-- Identified common patterns where exploration helps
-- State validation logic battle-tested
----
-## Guardrails
-### System-Enforced (Hard Limits)
-```typescript
-interface AgentConfig {
-  // Per-step
-  maxIterationsPerStep: 8,
-  maxToolCallsPerIteration: 5,
-  maxCommandsPerIteration: 5,
-  // Scenario-wide
-  maxConsecutiveStepFailures: 2,
-  maxTotalIterations: 50,  // Across all steps
-  // Memory
-  maxExperiences: 20,
-  maxHistorySize: 100,
-  recentStepsCount: 7  // How many in always-provided
-}
-// System checks BEFORE and AFTER agent call
-if (iteration > config.maxIterationsPerStep) {
-  logger.warn('SYSTEM: Iteration limit reached');
-  return {success: false, reason: 'system_limit'};
-}
-```
-### Agent Self-Awareness (Soft Signals)
-```typescript
-// Agent can signal issues:
-{
-  "status": "stuck",
-  "statusReasoning": "Tried 4 different selectors, none work - element likely doesn't exist"
-}
-{
-  "selfReflection": {
-    "detectingLoop": true,
-    "loopReasoning": "I've tried text-based selectors 3 times, need completely different approach"
-  }
-}
-// System respects agent signals but also enforces hard limits
-```
-**No screenshot budget** - Agent can use screenshots freely
----
-## MVP Scope (Complete Feature Set)
-### Included in MVP:
-- ✅ OrchestratorAgent with tool-use
-- ✅ Dynamic ToolRegistry
-- ✅ **8 core tools**:
-  - Information: take_screenshot, recall_history, inspect_page, check_page_ready
-  - Data: extract_data
-  - Recovery: navigate_back, refresh_page, navigate_to_url
-- ✅ Journey memory (history + experiences + extracted data)
-- ✅ Always-provided context (goal + DOM + recent 7 steps + self-reflection)
-- ✅ **Free-form self-reflection** (train of thought continuity)
-- ✅ **Agent loop detection** (detectingLoop flag)
-- ✅ **Self-recovery** (navigate_back, refresh_page when stuck)
-- ✅ Batch command planning (max 3-5)
-- ✅ Sequential execution (stop on failure)
-- ✅ Experience accumulation
-- ✅ Configurable guardrails (per job)
-- ✅ **Token usage tracking** (input/output/image)
-- ✅ **Comprehensive logging** (all thoughts visible)
-- ✅ Works for generation AND repair modes
-### Excluded (Phase 2):
-- ❌ Exploratory actions (explore_element tool - safety concerns)
-- ❌ Advanced optimizations (caching, adaptive limits)
-- ❌ Memory summarization
----
-## Key Decisions
-1. ✅ **Self-reflection in MVP** - Valuable for continuity, agent detects own loops
-2. ✅ **No screenshot budget** - Use freely when helpful
-3. ✅ **DOM always-provided** - Already compact via getEnhancedPageInfo
-4. ✅ **Recent 6-7 steps** - Enough context without bloat
-5. ✅ **Agent + system guardrails** - Agent signals, system enforces
-6. ✅ **Sequential batch execution** - Plan together, execute one-by-one
-7. ✅ **Repair mode support** - Script → Agent on failure → Script
----
-## Why This Works
-**Agent maintains train of thought:**
-- Iteration 1: "Try #id selectors" → succeeds
-- Iteration 2 self-reflection: "IDs worked, continue using them"
-- Iteration 3: Uses IDs again → succeeds faster
-**Agent detects spirals:**
-- Iteration 1-2-3: Tries text selectors, all fail
-- Iteration 4 self-reflection: detectingLoop=true, "Text doesn't work, switching to IDs"
-- Breaks own loop before system limit
-**Human-like:**
-- Remember what just happened (recent steps)
-- Learn patterns (experiences)
-- Maintain train of thought (self-reflection)
-- Know when stuck (loop detection)
-- Use tools when needed (screenshot, history)
----
-## Ready to Implement
-**Full MVP specification complete** with:
-- All architecture decisions finalized
-- Self-reflection included with loop detection
-- No screenshot budgeting
-- DOM optimization validated
-- Comprehensive logging defined
-- Repair mode integration planned
-- Guardrails configured
-**Estimated implementation**: 2-3 weeks for complete MVP