npm - @debugg-ai/debugg-ai-mcp - Versions diffs - 2.4.1 → 2.5.0 - Mend

@debugg-ai/debugg-ai-mcp 2.4.1 → 2.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.md +21 -1
package/dist/handlers/index.js +1 -0
package/dist/handlers/probePageHandler.js +275 -0
package/dist/handlers/searchEnvironmentsHandler.js +12 -2
package/dist/handlers/testPageChangesHandler.js +149 -70
package/dist/handlers/triggerCrawlHandler.js +65 -21
package/dist/services/ngrok/tunnelManager.js +46 -7
package/dist/services/ngrok/tunnelRegistry.js +39 -5
package/dist/services/ngrok/types.js +0 -1
package/dist/tools/index.js +3 -0
package/dist/tools/probePage.js +89 -0
package/dist/types/index.js +17 -0
package/dist/utils/errors.js +0 -1
package/dist/utils/harSummarizer.js +105 -0
package/dist/utils/projectAnalyzer.js +2 -2
package/dist/utils/telemetry.js +1 -0
package/dist/utils/transientErrors.js +82 -0
package/dist/utils/urlParser.js +1 -1
package/dist/utils/validation.js +1 -1
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -34,7 +34,7 @@ docker run -i --rm --init -e DEBUGGAI_API_KEY=your_api_key quinnosha/debugg-ai-m
 ## Tools
-The server exposes **11** tools grouped into Browser (2), Search (3), Projects (3), and Environments (3). The headline tool is `check_app_in_browser`; the rest manage projects, environments + their credentials, and execution history through a uniform `search_*` + CRUD pattern.
+The server exposes **12** tools grouped into Browser (3), Search (3), Projects (3), and Environments (3). The headline tools are `check_app_in_browser` (full AI agent) and `probe_page` (lightweight no-LLM page probe); the rest manage projects, environments + their credentials, and execution history through a uniform `search_*` + CRUD pattern.
 ### Browser
@@ -75,6 +75,26 @@ URLs are short-lived presigned S3 — refetch the parent execution via `search_e
 Fires a server-side browser-agent crawl to populate the project's knowledge graph. Localhost URLs tunnel automatically. Returns `{executionId, status, targetUrl, durationMs, outcome?, crawlSummary?, knowledgeGraph?, browserSession?}` with `knowledgeGraph.imported === true` on successful ingestion. The `browserSession` block (HAR + console-log URLs, same shape as above) is also present on completed crawls.
+#### `probe_page`
+**Lightweight no-LLM batch page probe.** Pass 1-20 URLs; each navigates, waits for load, and returns rendered state — screenshot + page metadata + structured console errors + network summary. No agent loop, no LLM cost, no scenario assertions. Use it for "did I just break /settings?", multi-route smoke after a refactor, CI per-PR sweeps, and quick is-it-up checks where `check_app_in_browser`'s 60-150s agent loop is overkill.
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `targets` | array **required** | 1-20 entries: `[{url, waitForSelector?, waitForLoadState?, timeoutMs?}]` |
+| `targets[].url` | string **required** | Public URL or localhost (auto-tunneled) |
+| `targets[].waitForLoadState` | enum | `'load'` (default) / `'domcontentloaded'` / `'networkidle'` |
+| `targets[].waitForSelector` | string | Optional CSS selector to wait for after navigation |
+| `targets[].timeoutMs` | number | Per-URL timeout, 1000-30000 (default 10000) |
+| `includeHtml` | boolean | Return raw HTML in each result (default false) |
+| `captureScreenshots` | boolean | Return one PNG per target (default true) |
+The whole batch shares a single backend execution + browser session + tunnel — 5 URLs in one call is dramatically faster than 5 parallel single-URL calls. Per-URL `error` field preserves batch resilience: a single failed target doesn't fail the others.
+**`networkSummary` aggregation key is `origin + pathname`** — refetch loops (`?n=0..4` repeatedly hitting the same endpoint) collapse into a single entry with the count, so `/api/poll` showing up with `count: 47` is the actionable "infinite refetch loop" signal users originally asked for.
+Performance budget: <10s for 1 URL, <25s for 20. Localhost dead-port returns `LocalServerUnreachable` in <2s without burning a workflow execution.
 ### Search (dual-mode: uuid detail OR filtered list)
 Each `search_*` tool has two modes. Pass `{uuid}` for a single-record detail response. Pass filter params for a paginated summary list. 404 from the backend surfaces as `isError: true` with `{error: 'NotFound', message, uuid}`.

package/dist/handlers/index.js CHANGED Viewed

@@ -1,5 +1,6 @@
 export * from './testPageChangesHandler.js';
 export * from './triggerCrawlHandler.js';
+export * from './probePageHandler.js';
 export * from './searchProjectsHandler.js';
 export * from './searchEnvironmentsHandler.js';
 export * from './searchExecutionsHandler.js';

package/dist/handlers/probePageHandler.js ADDED Viewed

@@ -0,0 +1,275 @@
+/**
+ * probePageHandler — lightweight no-LLM batch page probe.
+ *
+ * Mirrors triggerCrawlHandler's 4-step pattern (find template → execute →
+ * poll → format response) but: (a) takes a list of targets and produces a
+ * list of results, (b) does no agent steps (zero LLM in critical path),
+ * (c) MCP-side aggregates per-target HAR slices into NetworkSummary[].
+ *
+ * The backend "Page Probe" workflow template runs:
+ *   browser.setup → loop[targets](page.navigate → page.capture) → done
+ *
+ * Each page.capture node emits per-iteration outputData with consoleSlice
+ * + harSlice windowed to that URL's load span — that's what makes per-URL
+ * networkSummary attribution accurate.
+ */
+import { config } from '../config/index.js';
+import { Logger } from '../utils/logger.js';
+import { handleExternalServiceError } from '../utils/errors.js';
+import { imageContentBlock } from '../utils/imageUtils.js';
+import { DebuggAIServerClient } from '../services/index.js';
+import { TunnelProvisionError } from '../services/tunnels.js';
+import { tunnelManager } from '../services/ngrok/tunnelManager.js';
+import { probeLocalPort, probeTunnelHealth } from '../utils/localReachability.js';
+import { extractLocalhostPort } from '../utils/urlParser.js';
+import { buildContext, findExistingTunnel, ensureTunnel, sanitizeResponseUrls, touchTunnelById, } from '../utils/tunnelContext.js';
+import { getCachedTemplateUuid, invalidateTemplateCache } from '../utils/handlerCaches.js';
+import { summarizeHar, summarizeConsole } from '../utils/harSummarizer.js';
+const logger = new Logger({ module: 'probePageHandler' });
+const TEMPLATE_KEYWORD = 'page probe';
+export async function probePageHandler(input, context, rawProgressCallback) {
+    const startTime = Date.now();
+    logger.toolStart('probe_page', input);
+    // Bead 0bq: progress circuit-breaker — see testPageChangesHandler for rationale.
+    let progressDisabled = false;
+    const progressCallback = rawProgressCallback
+        ? async (update) => {
+            if (progressDisabled)
+                return;
+            try {
+                await rawProgressCallback(update);
+            }
+            catch (err) {
+                progressDisabled = true;
+                logger.warn('Progress emission failed; disabling further emissions for this request', {
+                    error: err instanceof Error ? err.message : String(err),
+                });
+            }
+        }
+        : undefined;
+    const client = new DebuggAIServerClient(config.api.key);
+    await client.init();
+    const abortController = new AbortController();
+    const onStdinClose = () => {
+        abortController.abort();
+        progressDisabled = true;
+    };
+    process.stdin.once('close', onStdinClose);
+    // Per-target tunnel contexts. Index aligns with input.targets[].
+    const targetContexts = [];
+    // Tunnel keys we provisioned this call (for cleanup if creation fails after key acquired).
+    const acquiredKeyIds = [];
+    // Progress budget: 1 pre-flight + 1 template + 1 execute + N per-target captures + 1 done
+    const TOTAL_STEPS = 3 + input.targets.length + 1;
+    let progressStep = 0;
+    try {
+        if (progressCallback) {
+            await progressCallback({ progress: ++progressStep, total: TOTAL_STEPS, message: `Pre-flight + tunnel setup (${input.targets.length} target${input.targets.length === 1 ? '' : 's'})...` });
+        }
+        // ── Per-target pre-flight + tunnel resolution ──────────────────────────
+        for (const target of input.targets) {
+            const ctx = buildContext(target.url);
+            if (ctx.isLocalhost) {
+                // Pre-flight TCP probe: fail fast if dev server isn't listening.
+                const port = extractLocalhostPort(ctx.originalUrl);
+                if (typeof port === 'number') {
+                    const probe = await probeLocalPort(port);
+                    if (!probe.reachable) {
+                        const payload = {
+                            error: 'LocalServerUnreachable',
+                            message: `No server listening on 127.0.0.1:${port}. Start your dev server on that port before running probe_page. Probe result: ${probe.code} (${probe.detail ?? 'no detail'}).`,
+                            detail: {
+                                port,
+                                probeCode: probe.code,
+                                probeDetail: probe.detail,
+                                elapsedMs: probe.elapsedMs,
+                            },
+                        };
+                        logger.warn(`Pre-flight port probe failed for ${ctx.originalUrl}: ${probe.code} in ${probe.elapsedMs}ms`);
+                        return { content: [{ type: 'text', text: JSON.stringify(payload, null, 2) }], isError: true };
+                    }
+                }
+                // Reuse existing tunnel for this port if any; otherwise provision.
+                const reused = findExistingTunnel(ctx);
+                if (reused) {
+                    targetContexts.push(reused);
+                }
+                else {
+                    let tunnel;
+                    try {
+                        tunnel = await client.tunnels.provisionWithRetry();
+                    }
+                    catch (provisionError) {
+                        const msg = provisionError instanceof Error ? provisionError.message : String(provisionError);
+                        const diag = provisionError instanceof TunnelProvisionError ? ` ${provisionError.diagnosticSuffix()}` : '';
+                        throw new Error(`Failed to provision tunnel for ${ctx.originalUrl}. ` +
+                            `(Detail: ${msg})${diag}`);
+                    }
+                    acquiredKeyIds.push(tunnel.keyId);
+                    let tunneled;
+                    try {
+                        tunneled = await ensureTunnel(ctx, tunnel.tunnelKey, tunnel.tunnelId, tunnel.keyId, () => client.revokeNgrokKey(tunnel.keyId));
+                    }
+                    catch (tunnelError) {
+                        const msg = tunnelError instanceof Error ? tunnelError.message : String(tunnelError);
+                        throw new Error(`Tunnel creation failed for ${ctx.originalUrl}. (Detail: ${msg})`);
+                    }
+                    // Tunnel health probe: catch the IPv4/IPv6 bind / dead-server case
+                    // before committing to a full backend execution.
+                    if (tunneled.targetUrl) {
+                        const health = await probeTunnelHealth(tunneled.targetUrl);
+                        if (!health.healthy) {
+                            const payload = {
+                                error: 'TunnelTrafficBlocked',
+                                message: `Tunnel established but traffic isn't reaching the dev server. ${health.detail ?? ''}`,
+                                detail: {
+                                    code: health.code,
+                                    status: health.status,
+                                    ngrokErrorCode: health.ngrokErrorCode,
+                                    elapsedMs: health.elapsedMs,
+                                },
+                            };
+                            if (tunneled.tunnelId) {
+                                tunnelManager.stopTunnel(tunneled.tunnelId).catch((err) => logger.warn(`Failed to stop broken tunnel ${tunneled.tunnelId}: ${err}`));
+                            }
+                            return { content: [{ type: 'text', text: JSON.stringify(payload, null, 2) }], isError: true };
+                        }
+                    }
+                    targetContexts.push(tunneled);
+                }
+            }
+            else {
+                // Public URL — no tunnel needed.
+                targetContexts.push(ctx);
+            }
+        }
+        // ── Locate workflow template ───────────────────────────────────────────
+        if (progressCallback) {
+            await progressCallback({ progress: ++progressStep, total: TOTAL_STEPS, message: 'Locating page-probe workflow template...' });
+        }
+        const templateUuid = await getCachedTemplateUuid(TEMPLATE_KEYWORD, async (name) => {
+            return client.workflows.findTemplateByName(name);
+        });
+        if (!templateUuid) {
+            throw new Error(`Page Probe Workflow Template not found. ` +
+                `Ensure the backend has a template matching "${TEMPLATE_KEYWORD}" seeded and accessible.`);
+        }
+        // ── Build contextData (camelCase; axiosTransport snake_cases on the wire) ──
+        const contextData = {
+            targets: input.targets.map((t, i) => ({
+                url: targetContexts[i].targetUrl ?? t.url,
+                waitForSelector: t.waitForSelector,
+                waitForLoadState: t.waitForLoadState,
+                timeoutMs: t.timeoutMs,
+            })),
+            includeHtml: input.includeHtml,
+            captureScreenshots: input.captureScreenshots,
+        };
+        // ── Execute ────────────────────────────────────────────────────────────
+        if (progressCallback) {
+            await progressCallback({ progress: ++progressStep, total: TOTAL_STEPS, message: 'Queuing workflow execution...' });
+        }
+        const executeResponse = await client.workflows.executeWorkflow(templateUuid, contextData);
+        const executionUuid = executeResponse.executionUuid;
+        logger.info(`Probe execution queued: ${executionUuid}`);
+        // ── Poll ───────────────────────────────────────────────────────────────
+        let lastCompleted = -1;
+        const finalExecution = await client.workflows.pollExecution(executionUuid, async (exec) => {
+            // Keep all active tunnels alive during polling.
+            for (const tc of targetContexts) {
+                if (tc.tunnelId)
+                    touchTunnelById(tc.tunnelId);
+            }
+            if (!progressCallback)
+                return;
+            const completedNodes = (exec.nodeExecutions ?? []).filter(n => n.nodeType === 'page.capture' && n.status === 'success').length;
+            if (completedNodes !== lastCompleted) {
+                lastCompleted = completedNodes;
+                await progressCallback({
+                    progress: Math.min(progressStep + completedNodes, TOTAL_STEPS - 1),
+                    total: TOTAL_STEPS,
+                    message: `Probed ${completedNodes}/${input.targets.length} target${input.targets.length === 1 ? '' : 's'}...`,
+                });
+            }
+        }, abortController.signal);
+        // ── Format response ────────────────────────────────────────────────────
+        const duration = Date.now() - startTime;
+        const captureNodes = (finalExecution.nodeExecutions ?? [])
+            .filter(n => n.nodeType === 'page.capture')
+            .sort((a, b) => a.executionOrder - b.executionOrder);
+        const results = [];
+        const screenshotBlocks = [];
+        for (let i = 0; i < input.targets.length; i++) {
+            const target = input.targets[i];
+            const node = captureNodes[i];
+            const data = node?.outputData ?? {};
+            const result = {
+                url: target.url, // ORIGINAL caller URL — not the tunneled rewrite
+                finalUrl: typeof data.finalUrl === 'string' ? data.finalUrl : (typeof data.url === 'string' ? data.url : target.url),
+                statusCode: typeof data.statusCode === 'number' ? data.statusCode : 0,
+                title: typeof data.title === 'string' ? data.title : null,
+                loadTimeMs: typeof data.loadTimeMs === 'number' ? data.loadTimeMs : 0,
+                consoleErrors: summarizeConsole(Array.isArray(data.consoleSlice) ? data.consoleSlice : []),
+                networkSummary: summarizeHar(Array.isArray(data.harSlice) ? data.harSlice : []),
+            };
+            if (input.includeHtml && typeof data.html === 'string') {
+                result.html = data.html;
+            }
+            if (typeof data.error === 'string' && data.error) {
+                result.error = data.error;
+            }
+            results.push(result);
+            if (input.captureScreenshots && typeof data.screenshotB64 === 'string' && data.screenshotB64) {
+                screenshotBlocks.push(imageContentBlock(data.screenshotB64, 'image/png'));
+            }
+        }
+        const responsePayload = {
+            executionId: executionUuid,
+            durationMs: typeof finalExecution.durationMs === 'number' ? finalExecution.durationMs : duration,
+            results,
+        };
+        if (finalExecution.browserSession) {
+            responsePayload.browserSession = finalExecution.browserSession;
+        }
+        // Sanitize ngrok URLs from the entire payload — agent-authored strings in
+        // node outputData (titles, HTML, console messages from the page itself)
+        // can occasionally contain the tunnel URL; rewrite to the original
+        // localhost origin per tunnel context. For multi-localhost batches we
+        // run sanitize once per localhost target since each may have its own
+        // tunnel↔origin mapping.
+        let sanitizedPayload = responsePayload;
+        for (const tc of targetContexts) {
+            if (tc.isLocalhost) {
+                sanitizedPayload = sanitizeResponseUrls(sanitizedPayload, tc);
+            }
+        }
+        logger.toolComplete('probe_page', duration);
+        return {
+            content: [
+                { type: 'text', text: JSON.stringify(sanitizedPayload, null, 2) },
+                ...screenshotBlocks,
+            ],
+        };
+    }
+    catch (error) {
+        const duration = Date.now() - startTime;
+        logger.toolError('probe_page', error, duration);
+        if (error instanceof Error && (error.message.includes('not found') || error.message.includes('401'))) {
+            invalidateTemplateCache();
+        }
+        throw handleExternalServiceError(error, 'DebuggAI', 'probe_page execution');
+    }
+    finally {
+        process.stdin.removeListener('close', onStdinClose);
+        // Tunnels intentionally NOT torn down — reuse pattern (bead vwd) +
+        // 55-min idle auto-shutoff. Revoke only orphaned keys (we acquired the
+        // key but tunnel creation failed before ensureTunnel completed).
+        for (let i = 0; i < acquiredKeyIds.length; i++) {
+            const keyId = acquiredKeyIds[i];
+            const tc = targetContexts[i];
+            if (tc && !tc.tunnelId && keyId) {
+                client.revokeNgrokKey(keyId).catch(err => logger.warn(`Failed to revoke unused ngrok key ${keyId}: ${err}`));
+            }
+        }
+    }
+}

package/dist/handlers/searchEnvironmentsHandler.js CHANGED Viewed

@@ -61,6 +61,11 @@ export async function searchEnvironmentsHandler(input, _context) {
         const client = new DebuggAIServerClient(config.api.key);
         await client.init();
         // ── Resolve projectUuid ──
+        // Bead gb4n: when projectUuid is provided directly (caller skips git
+        // auto-resolution), `name` and `repoName` are unknown. OMIT those fields
+        // rather than emitting nulls — null fields surprised callers and
+        // muddied the contract. If a caller needs them, they fetch via
+        // search_projects.
         let projectUuid = input.projectUuid;
         let project = null;
         if (!projectUuid) {
@@ -73,10 +78,15 @@ export async function searchEnvironmentsHandler(input, _context) {
                 return noProjectResolved(pagination, `No DebuggAI project found for repo "${repoName}". Pass projectUuid explicitly.`);
             }
             projectUuid = resolved.uuid;
-            project = { uuid: resolved.uuid, name: resolved.name, repoName: resolved.repo?.name ?? repoName };
+            project = { uuid: resolved.uuid };
+            if (resolved.name)
+                project.name = resolved.name;
+            const rn = resolved.repo?.name ?? repoName;
+            if (rn)
+                project.repoName = rn;
         }
         else {
-            project = { uuid: projectUuid, name: null, repoName: null };
+            project = { uuid: projectUuid };
         }
         // ── uuid mode ──
         if (input.uuid) {

package/dist/handlers/testPageChangesHandler.js CHANGED Viewed

@@ -15,8 +15,23 @@ import { tunnelManager } from '../services/ngrok/tunnelManager.js';
 import { probeLocalPort, probeTunnelHealth } from '../utils/localReachability.js';
 import { extractLocalhostPort } from '../utils/urlParser.js';
 import { getCachedTemplateUuid, getCachedProjectUuid, invalidateTemplateCache, invalidateProjectCache, } from '../utils/handlerCaches.js';
+import { isTransientWorkflowError, transientReasonTag } from '../utils/transientErrors.js';
+import { Telemetry, TelemetryEvents } from '../utils/telemetry.js';
 const logger = new Logger({ module: 'testPageChangesHandler' });
 const TEMPLATE_NAME = 'app evaluation';
+// Bead kbxy: bounded retry on known transient backend signatures (Pydantic
+// JSON parse errors, 502s, ECONNRESETs). Default 1 retry; env-overridable
+// up to 3 to balance reliability vs quota cost. Conservative: only retries
+// on documented transient patterns (utils/transientErrors.ts).
+function getMaxTransientRetries() {
+    const raw = process.env.DEBUGGAI_TRANSIENT_RETRIES;
+    if (raw === undefined || raw === '')
+        return 1;
+    const n = parseInt(raw, 10);
+    if (!Number.isFinite(n) || n < 0)
+        return 1;
+    return Math.min(n, 3);
+}
 // Concurrency control — max 2 simultaneous browser checks.
 // Additional requests queue and run when a slot opens.
 const MAX_CONCURRENT = 2;
@@ -229,88 +244,126 @@ async function testPageChangesHandlerInner(input, context, rawProgressCallback)
         if (progressCallback) {
             await progressCallback({ progress: 3, total: TOTAL_STEPS, message: 'Queuing workflow execution...' });
         }
-        const executeResponse = await client.workflows.executeWorkflow(templateUuid, contextData, Object.keys(env).length > 0 ? env : undefined);
-        const executionUuid = executeResponse.executionUuid;
-        logger.info(`Execution queued: ${executionUuid}`);
-        // --- Poll ---
-        // Progress phases:
+        // --- Execute + Poll (with bounded retry on transient errors, bead kbxy) ---
+        // Progress phases (per attempt):
         //   1-3:   MCP setup (tunnel, template, queue) — already sent above
         //   4-6:   Backend setup (trigger, browser.setup, subworkflow starting)
         //   7-27:  Agent steps (mapped from state.stepsTaken)
         //   28:    Complete
         const BACKEND_SETUP_END = 6;
-        let lastStepsTaken = 0;
-        let observedMaxSteps = MAX_EXEC_STEPS;
         const TERMINAL_STATUSES = new Set(['completed', 'failed', 'cancelled']);
-        const finalExecution = await client.workflows.pollExecution(executionUuid, async (exec) => {
-            // Keep the tunnel alive while the workflow is actively running
-            if (ctx.tunnelId)
-                touchTunnelById(ctx.tunnelId);
-            const nodes = exec.nodeExecutions ?? [];
-            const stepsTaken = Math.max(nodes.filter(n => n.nodeType === 'brain.step').length, exec.state?.stepsTaken ?? 0);
-            if (stepsTaken !== lastStepsTaken) {
-                lastStepsTaken = stepsTaken;
-                logger.info(`Execution status: ${exec.status}, nodes: ${nodes.length}, steps: ${stepsTaken}`);
-            }
-            if (!progressCallback)
-                return;
-            // Bead 0bq: emit the final "Complete:" progress INSIDE this callback
-            // when terminal status is detected. pollExecution will return on the
-            // next line (line 183 in services/workflows.ts), so there's no
-            // post-pollExecution progress emission that could race the response.
-            if (TERMINAL_STATUSES.has(exec.status)) {
-                const terminalOutcome = exec.state?.outcome ?? exec.status;
-                await progressCallback({
-                    progress: TOTAL_STEPS,
-                    total: TOTAL_STEPS,
-                    message: `Complete: ${terminalOutcome}`,
+        const MAX_RETRIES = getMaxTransientRetries();
+        let executeResponse;
+        let executionUuid = '';
+        let finalExecution;
+        let attempt = 0;
+        while (true) {
+            attempt++;
+            if (attempt > 1) {
+                // Retry path — emit telemetry + progress notification + brief backoff.
+                Telemetry.capture(TelemetryEvents.WORKFLOW_TRANSIENT_RETRY, {
+                    tool: 'check_app_in_browser',
+                    attempt,
+                    reason: transientReasonTag(finalExecution),
+                    previousExecutionId: executionUuid,
+                    previousErrorMessage: finalExecution?.errorMessage?.slice(0, 200),
+                    previousStateError: finalExecution?.state?.error?.slice(0, 200),
                 });
-                return;
-            }
-            // --- Compute progress number ---
-            let execProgress;
-            let message;
-            if (stepsTaken > 0) {
-                // Agent is actively stepping — map into slots 7..27
-                if (stepsTaken > observedMaxSteps)
-                    observedMaxSteps = stepsTaken + 5;
-                const stepSlots = TOTAL_STEPS - BACKEND_SETUP_END - 1; // 21 slots
-                execProgress = BACKEND_SETUP_END + Math.max(1, Math.round((stepsTaken / observedMaxSteps) * stepSlots));
-                execProgress = Math.min(execProgress, TOTAL_STEPS - 1);
-                // Use state.currentAction for the message (backend sends intent + actionType)
-                const ca = exec.state?.currentAction;
-                if (ca?.intent) {
-                    const action = ca.actionType ?? ca.action_type ?? 'working';
-                    message = `Step ${stepsTaken}: [${action}] ${ca.intent}`;
-                }
-                else {
-                    message = `Agent evaluating... (step ${stepsTaken})`;
+                if (progressCallback) {
+                    await progressCallback({
+                        progress: SETUP_STEPS,
+                        total: TOTAL_STEPS,
+                        message: `Transient backend error — retrying (attempt ${attempt}/${MAX_RETRIES + 1})...`,
+                    });
                 }
+                await new Promise(r => setTimeout(r, 1000 * (attempt - 1)));
             }
-            else {
-                // No agent steps yet — show backend setup progress from node transitions
-                const hasSubworkflow = nodes.some(n => n.nodeType === 'subworkflow.run');
-                const hasBrowserSetup = nodes.some(n => n.nodeType === 'browser.setup');
-                const browserReady = nodes.some(n => n.nodeType === 'browser.setup' && n.status === 'success');
-                if (browserReady || hasSubworkflow) {
-                    execProgress = BACKEND_SETUP_END;
-                    message = 'Browser ready, agent starting...';
+            executeResponse = await client.workflows.executeWorkflow(templateUuid, contextData, Object.keys(env).length > 0 ? env : undefined);
+            executionUuid = executeResponse.executionUuid;
+            logger.info(`Execution queued: ${executionUuid}${attempt > 1 ? ` (retry ${attempt - 1}/${MAX_RETRIES})` : ''}`);
+            // Closure state — reset PER ATTEMPT so progress numbers don't double-count
+            // across retries.
+            let lastStepsTaken = 0;
+            let observedMaxSteps = MAX_EXEC_STEPS;
+            finalExecution = await client.workflows.pollExecution(executionUuid, async (exec) => {
+                // Keep the tunnel alive while the workflow is actively running
+                if (ctx.tunnelId)
+                    touchTunnelById(ctx.tunnelId);
+                const nodes = exec.nodeExecutions ?? [];
+                const stepsTaken = Math.max(nodes.filter(n => n.nodeType === 'brain.step').length, exec.state?.stepsTaken ?? 0);
+                if (stepsTaken !== lastStepsTaken) {
+                    lastStepsTaken = stepsTaken;
+                    logger.info(`Execution status: ${exec.status}, nodes: ${nodes.length}, steps: ${stepsTaken}`);
                 }
-                else if (hasBrowserSetup) {
-                    execProgress = SETUP_STEPS + 2;
-                    message = 'Launching browser...';
+                if (!progressCallback)
+                    return;
+                // Bead 0bq: emit the final "Complete:" progress INSIDE this callback
+                // when terminal status is detected. pollExecution will return on the
+                // next line (line 183 in services/workflows.ts), so there's no
+                // post-pollExecution progress emission that could race the response.
+                if (TERMINAL_STATUSES.has(exec.status)) {
+                    const terminalOutcome = exec.state?.outcome ?? exec.status;
+                    await progressCallback({
+                        progress: TOTAL_STEPS,
+                        total: TOTAL_STEPS,
+                        message: `Complete: ${terminalOutcome}`,
+                    });
+                    return;
                 }
-                else if (nodes.length > 0) {
-                    execProgress = SETUP_STEPS + 1;
-                    message = 'Workflow triggered, preparing...';
+                // --- Compute progress number ---
+                let execProgress;
+                let message;
+                if (stepsTaken > 0) {
+                    // Agent is actively stepping — map into slots 7..27
+                    if (stepsTaken > observedMaxSteps)
+                        observedMaxSteps = stepsTaken + 5;
+                    const stepSlots = TOTAL_STEPS - BACKEND_SETUP_END - 1; // 21 slots
+                    execProgress = BACKEND_SETUP_END + Math.max(1, Math.round((stepsTaken / observedMaxSteps) * stepSlots));
+                    execProgress = Math.min(execProgress, TOTAL_STEPS - 1);
+                    // Use state.currentAction for the message (backend sends intent + actionType)
+                    const ca = exec.state?.currentAction;
+                    if (ca?.intent) {
+                        const action = ca.actionType ?? ca.action_type ?? 'working';
+                        message = `Step ${stepsTaken}: [${action}] ${ca.intent}`;
+                    }
+                    else {
+                        message = `Agent evaluating... (step ${stepsTaken})`;
+                    }
                 }
                 else {
-                    execProgress = SETUP_STEPS + 1;
-                    message = 'Waiting for execution to start...';
+                    // No agent steps yet — show backend setup progress from node transitions
+                    const hasSubworkflow = nodes.some(n => n.nodeType === 'subworkflow.run');
+                    const hasBrowserSetup = nodes.some(n => n.nodeType === 'browser.setup');
+                    const browserReady = nodes.some(n => n.nodeType === 'browser.setup' && n.status === 'success');
+                    if (browserReady || hasSubworkflow) {
+                        execProgress = BACKEND_SETUP_END;
+                        message = 'Browser ready, agent starting...';
+                    }
+                    else if (hasBrowserSetup) {
+                        execProgress = SETUP_STEPS + 2;
+                        message = 'Launching browser...';
+                    }
+                    else if (nodes.length > 0) {
+                        execProgress = SETUP_STEPS + 1;
+                        message = 'Workflow triggered, preparing...';
+                    }
+                    else {
+                        execProgress = SETUP_STEPS + 1;
+                        message = 'Waiting for execution to start...';
+                    }
                 }
-            }
-            await progressCallback({ progress: execProgress, total: TOTAL_STEPS, message });
-        }, abortController.signal);
+                await progressCallback({ progress: execProgress, total: TOTAL_STEPS, message });
+            }, abortController.signal);
+            // Decide retry vs exit: only retry on documented transient signatures
+            // AND while we still have budget. Otherwise break and surface whatever
+            // result the agent reached.
+            if (attempt > MAX_RETRIES)
+                break;
+            if (!isTransientWorkflowError(finalExecution))
+                break;
+            logger.warn(`Transient backend error detected (${transientReasonTag(finalExecution) ?? 'unknown'}) — ` +
+                `retrying (attempt ${attempt + 1}/${MAX_RETRIES + 1})`);
+        }
         const duration = Date.now() - startTime;
         // --- Format result ---
         const outcome = finalExecution.state?.outcome ?? finalExecution.status;
@@ -368,15 +421,41 @@ async function testPageChangesHandlerInner(input, context, rawProgressCallback)
                 reason: sw.error || undefined,
             };
         }
+        const stepsTaken = finalExecution.state?.stepsTaken ?? subworkflowNode?.outputData?.stepsTaken ?? actionTrace.length;
+        const success = finalExecution.state?.success ?? subworkflowNode?.outputData?.success ?? false;
         const responsePayload = {
             outcome,
-            success: finalExecution.state?.success ?? subworkflowNode?.outputData?.success ?? false,
+            success,
             status: finalExecution.status,
-            stepsTaken: finalExecution.state?.stepsTaken ?? subworkflowNode?.outputData?.stepsTaken ?? actionTrace.length,
+            stepsTaken,
+            stepsBudget: MAX_EXEC_STEPS, // bead qmdd
+            stepsRemaining: Math.max(0, MAX_EXEC_STEPS - (stepsTaken ?? 0)), // bead qmdd
             targetUrl: originalUrl,
             executionId: executionUuid,
             durationMs: finalExecution.durationMs ?? duration,
         };
+        // Bead jqmj: failureCategory disambiguates the three meanings of 'fail':
+        //   'agent-error'        — workflow/infra failure (Pydantic parse error,
+        //                          backend exception, transport issue). Caller's
+        //                          right move: retry-with-backoff.
+        //   'assertion-mismatch' — agent ran the scenario but page state didn't
+        //                          match expectations. Caller's right move: fix
+        //                          code or update the test description.
+        //   ('page-error' is reserved for v2 — needs a structured signal from
+        //   backend to distinguish from assertion-mismatch reliably; today's
+        //   inferrable info is too fragile.)
+        // Field is OMITTED on success (no failure to categorize).
+        if (!success) {
+            // state.error is the AGENT's narrative — it can describe assertion
+            // failures ("expected heading to contain Welcome") OR infrastructure
+            // failures ("Pydantic JSON parse error"). Without a structured signal,
+            // we only count it as 'agent-error' when paired with workflow-level
+            // failure (status='failed') or transient signature.
+            // status='failed' or errorMessage set → workflow-level / transport error.
+            const hasInfraFailure = finalExecution.status === 'failed'
+                || !!finalExecution.errorMessage;
+            responsePayload.failureCategory = hasInfraFailure ? 'agent-error' : 'assertion-mismatch';
+        }
         if (actionTrace.length > 0)
             responsePayload.actionTrace = actionTrace;
         if (evaluation)