npm - agent-scenario-loop - Versions diffs - 0.1.2 → 0.1.3 - Mend

agent-scenario-loop 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

package/README.md +9 -9
package/app/profile-session.ts +98 -4
package/dist/core/agent-summary.d.ts +3 -2
package/dist/core/agent-summary.js +44 -2
package/dist/core/artifact-contract.d.ts +22 -4
package/dist/core/artifact-contract.js +512 -11
package/dist/core/comparison.d.ts +57 -3
package/dist/core/comparison.js +113 -1
package/dist/core/planner.d.ts +32 -1
package/dist/core/planner.js +144 -0
package/dist/core/run-index.d.ts +4 -0
package/dist/core/run-index.js +55 -1
package/dist/core/schema-validator.d.ts +1 -0
package/dist/core/schema-validator.js +1 -0
package/dist/runner/compare-latest.d.ts +8 -4
package/dist/runner/compare-latest.js +24 -5
package/dist/runner/example-android-live.d.ts +10 -1
package/dist/runner/example-android-live.js +55 -0
package/dist/runner/example-ios-live.d.ts +10 -1
package/dist/runner/example-ios-live.js +55 -0
package/dist/runner/ios-simctl.d.ts +5 -0
package/dist/runner/ios-simctl.js +6 -0
package/dist/runner/live-comparison.d.ts +2 -2
package/dist/runner/live-comparison.js +2 -1
package/dist/runner/live-proof-summary.d.ts +5 -4
package/dist/runner/live-proof-summary.js +12 -2
package/dist/runner/live-proof.d.ts +3 -2
package/dist/runner/live-proof.js +9 -2
package/dist/runner/profile-android.d.ts +5 -0
package/dist/runner/profile-android.js +148 -24
package/dist/runner/profile-ios.d.ts +11 -1
package/dist/runner/profile-ios.js +128 -9
package/dist/runner/profile-mobile.d.ts +8 -0
package/dist/runner/profile-mobile.js +267 -28
package/docs/adapters.md +4 -0
package/docs/architecture.md +90 -0
package/docs/authoring.md +5 -1
package/docs/concepts.md +3 -24
package/docs/consumer-rehearsal.md +4 -0
package/docs/contracts.md +30 -100
package/docs/external-adapter-protocol.md +219 -0
package/docs/live-proofs.md +83 -2
package/docs/principles.md +9 -15
package/examples/mobile-app/README.md +12 -0
package/examples/mobile-app/runner-manifests/primary-runner.json +1 -0
package/examples/runners/README.md +1 -0
package/examples/runners/adb-android.json +1 -0
package/examples/runners/agent-device-android.json +1 -0
package/examples/runners/agent-device-ios.json +1 -0
package/examples/runners/argent-android.json +1 -0
package/examples/runners/argent-ios.json +1 -0
package/examples/runners/xcodebuildmcp-ios.json +1 -0
package/package.json +2 -1
package/schemas/causal-run.schema.json +85 -2
package/schemas/comparison.schema.json +130 -2
package/schemas/external-adapter-message.schema.json +693 -0
package/schemas/health.schema.json +72 -0
package/schemas/live-proof-set.schema.json +1 -1
package/schemas/live-proof.schema.json +14 -6
package/schemas/manifest.schema.json +442 -1
package/schemas/runner-capabilities.schema.json +20 -0
package/schemas/scenario.schema.json +16 -0
package/templates/primary-runner.json +1 -0

package/docs/contracts.md CHANGED Viewed

@@ -4,6 +4,8 @@ This package ships the scenario, runner, and artifact contracts that make Agent
 The package is intentionally contract-first: adopt the scenario and artifact shape once, then add or swap runner loops without rewriting your scenarios.
+See [Architecture](architecture.md) for the TypeScript-first implementation and language-neutral contract boundary.
 ## What ships today
 - [app/profile-session.ts](../app/profile-session.ts): thin React Native integration for session control, truth events, and signal attachments
@@ -68,11 +70,14 @@ Portable scenario manifests describe the durable app behavior before choosing a
 - `budgets`: product thresholds evaluated only after scenario health passes
 - `steps`: runner-facing launch, command, wait, gesture, and capture actions
 - `selector`: optional app target on a step, such as a test id, accessibility id, label, text, resource id, or xpath
+- `uiContext`: optional UI ownership requirement on a step; UI driver actions default to `app`
 - `artifacts`: required and optional evidence outputs
 The scenario contract is intentionally runner-neutral. Runners can map steps to adb, XcodeBuildMCP, agent-device, accessibility tools, profilers, or custom scripts while preserving the same journey, milestones, budgets, and expected events.
-Runner capabilities describe ownership, such as launch, session control, command execution, log capture, artifact writing, or profiler support. Driver actions describe the concrete operations an adapter can perform inside a run. A runner may be able to own a scenario lifecycle without supporting every driver action; the planner fails only when a required step declares a `driverAction` that the selected runner or an active provider does not declare in `driverActions`.
+Runner capabilities describe ownership, such as launch, session control, command execution, log capture, artifact writing, or profiler support. Driver actions describe the concrete operations an adapter can perform inside a run. UI contexts describe which surface the runner or provider can own: `app`, `systemDialog`, `notificationShade`, `externalBrowser`, `webView`, `shareSheet`, `picker`, or `otherApp`. UI and capture driver actions default to `app` when a step omits `uiContext`; a scenario must opt into system or external contexts explicitly. A runner may be able to own a scenario lifecycle without supporting every driver action or UI context; the planner fails when a required step declares a `driverAction` or `uiContext` that the selected runner or an active provider does not declare.
+Planner compatibility artifacts and planner-derived `health.json` include a `downgradePolicy` block with `mode: "no-silent-downgrade"`. Required capability, driver-action, UI-context, or artifact gaps are recorded as `unsupported`; optional gaps are recorded as warnings. `allowedSubstitutions` and `substitutions` are explicit arrays, so future semantic downgrades must be visible in artifacts instead of being inferred from a passed plan.
 `buildScenarioExecutionPlan()` turns the same scenario steps into a deterministic adapter-facing work list. Each normalized step records the scenario step id, original kind, required flag, optional driver action, and the runner port method that owns it: `launch`, `executeStep`, `waitForTruthEvent`, or `captureEvidence`.
@@ -80,11 +85,11 @@ Android adb capture routes normalized steps with `driverAction: "tap"`, `"scroll
 When Android adb `tap` or `scroll` steps provide a portable selector instead of coordinates, the runner captures `uiautomator dump` output, resolves supported selector kinds against node bounds, and derives adb input coordinates before executing the action. Built-in Android selector resolution supports `testId`, `resourceId`, `accessibilityId`, `accessibilityLabel`, and `text`; `xpath` stays available for external runners with native selector engines.
-I/O from iOS simctl capture routes through the simctl driver adapter. `readLogs` preserves bounded simulator logs under `raw/ios-simctl-log.txt`. A scenario step with `driverAction: "screenshot"` or `artifact: "screenshot"` requests a screenshot capture, defaulting to `captures/ios-screenshot.png`; when `--screenshot-type`, `--screenshot-display`, or `--screenshot-mask` are supplied to `asl-ios-simctl`, the command passes those supported `simctl io screenshot` options and records them in capture metadata. The profile manifest records the resulting capture path in `artifacts.captures.screenshots`.
+I/O from iOS simctl capture routes through the simctl driver adapter. `readLogs` preserves bounded simulator logs under `raw/ios-simctl-log.txt`. A scenario step with `driverAction: "screenshot"` or `artifact: "screenshot"` requests a screenshot capture, defaulting to `captures/ios-screenshot.png`. The profile manifest records the resulting capture path in `artifacts.captures.screenshots`, and capture metadata records any supported simulator screenshot options the runner used.
 Planner compatibility also validates the adapter metadata that built-in runners require. Android adb `tap` steps need either `adapterOptions.androidAdb.x/y` or a portable selector; Android adb `scroll` steps need either `startX/startY/endX/endY` or a portable selector; iOS simctl command metadata needs non-empty command strings and positive integer waits/repeat counts. Argent `tap` steps need `adapterOptions.argent.x/y`, Argent `scroll` steps need `adapterOptions.argent.startX/startY/endX/endY`, and Argent `assertVisible` steps need a portable selector. These failures become `invalid_adapter_options` health checks before runtime execution starts.
-Adapter-target fixtures such as `agent-device-android`, `agent-device-ios`, `argent-ios`, `argent-android`, `argent-react-profiler-provider`, and `axe-accessibility-provider` describe where external tools can plug into the same contract. They are schema-checked and planner-tested capability manifests. The bundled `agent-device` capture runner implements the portable interaction subset for iOS and Android; broader agent-device surfaces such as React DevTools, traces, network, and performance still need explicit adapters or provider attachments before they become part of the stable artifact contract. The bundled Argent runner implements launch, coordinate-backed gestures, screenshot requests, and description-backed visibility proof for portable selector match modes while keeping React profiler output in a separate Android evidence-provider lane. Argent command-surface checks prove the configured tools exist; runtime health still owns whether the selected device backend produced screenshot evidence. Required screenshot failures fail health, and optional screenshot failures are preserved as warnings. Active evidence providers can satisfy required evidence artifacts and provider-owned driver actions such as `collectPerfSignals`; providers outside the selected platform do not contribute to the match. When those tools write files independently, profile CLIs can attach the files with `--signal <js|memory|network>:<path>` or `--capture <screenshot|video|uiTree>:<path>` so provider evidence lands in the stable manifest and artifact layout. The `script-accessibility-provider`, `script-profiler-provider`, `script-memory-provider`, and `script-network-provider` examples show provider-command wrappers for project-local tools without making those tools package dependencies.
+Adapter-target fixtures such as `agent-device-android`, `agent-device-ios`, `argent-ios`, `argent-android`, `argent-react-profiler-provider`, and `axe-accessibility-provider` describe where external tools can plug into the same contract. They are schema-checked and planner-tested capability manifests. The bundled `agent-device` capture runner implements the portable interaction subset for iOS and Android; broader agent-device surfaces such as React DevTools, traces, network, and performance still need explicit adapters or provider attachments before they become part of the stable artifact contract. The bundled Argent runner implements launch, coordinate-backed gestures, screenshot requests, and description-backed visibility proof for portable selector match modes while keeping React profiler output in a separate Android evidence-provider lane. Argent command-surface checks prove the configured tools exist; runtime health still owns whether the selected device backend produced screenshot evidence. Required screenshot failures fail health, and optional screenshot failures are preserved as warnings. Active evidence providers can satisfy required evidence artifacts and provider-owned driver actions such as `collectPerfSignals`; providers outside the selected platform do not contribute to the match. When those tools write files independently, attached provider evidence lands in the stable manifest and artifact layout. The `script-accessibility-provider`, `script-profiler-provider`, `script-memory-provider`, and `script-network-provider` examples show provider-command wrappers for project-local tools without making those tools package dependencies.
 ## Public artifact layout
@@ -111,11 +116,23 @@ Profile runner artifacts:
 `manifest.json`, `metrics.json`, `budget-verdict.json`, and `causal-run.json` are schema-checked before the runner writes them. This keeps profile artifacts stable across fixture logs, adb-captured logs, and future runner adapters.
+`causal-run.json` preserves app-emitted timeline events through the public causal phase/status vocabulary. If an app emits richer phase or status values, ASL writes schema-valid top-level values and preserves the originals as timeline metadata. Timeline metadata also preserves scalar correlation fields such as `iteration`, `sequence`, `queueId`, `commandId`, `operationId`, `attemptId`, and `clockDomain` when the app emits them. Profile-session command acknowledgements are included as ASL-owned timeline entries with command status, result, source, sequence, queue, wait, and command ID metadata, so agents can inspect runtime ordering without treating command transport as product truth. Repeated runs include `iterationSummary` so agents can distinguish complete, partial, failed, and timeout iteration evidence without scraping raw logs. Scenarios without budget thresholds still produce schema-valid causal artifacts with an empty `budgets` object.
+`manifest.attempt` records the run attempt identity and terminal semantics independently of prose summaries. It includes an `attemptId`, `attemptNumber`, `maxAttempts`, optional retry lineage, terminal state, failure classification, cleanup outcome, and whether preserved partial artifacts are valid for diagnosis. Retry attempts must identify the prior attempt and retry reason. A failed attempt can therefore keep usable raw evidence without implying that product verdict, timing, or comparison claims are trustworthy.
+`manifest.provenance.cohort` records product-neutral compatibility inputs for comparing runs. Profile runners populate known fields such as `appId`, `platform`, `runnerName`, `runnerVersion`, `commandTransport`, and active provider IDs; richer callers can add app/build version, build mode, OS version, device class, feature flags, and seed identity. ASL derives `manifest.provenance.cohortHash` from the normalized cohort. Latest-trusted comparison requires the same cohort hash when the current run records one, so old artifacts remain comparable only when the current artifact has not opted into cohort-aware selection.
+`manifest.attempt.terminalState` uses a terminal vocabulary of `passed`, `failed`, `timeout`, `cancelled`, `aborted`, `inconclusive`, `unsupported`, `skipped`, and `unhealthy`. Attempt construction rejects misleading terminal combinations: passed attempts must end as `passed`, failed attempts must use a failure terminal state, timeout/cancelled/aborted attempts must preserve valid partial artifact paths, and cleanup statuses such as `passed`, `failed`, or `partial` must include a cleanup message. `manifest.environment` records product-neutral lifecycle and environment preconditions and postconditions. Each field is an assertion object with a `value` and `evidence` state. Generated profile artifacts default to `value: "unknown"` and `evidence: "not-asserted"` unless the runner can prove more. The dedicated `lifecyclePhase` assertion supports `cold-launch`, `warm-launch`, `hot-launch`, `resume`, `foreground`, `background`, `force-stop`, `process-death`, `scene-recreation`, `activity-recreation`, `os-reclaim`, `reboot`, and `relaunch`. This preserves what the runner did not prove instead of letting agents infer installed state, app data state, auth state, route, foreground state, permissions, locale, timezone, theme, font scale, orientation, network, animations, cleanup, data, or artifact completeness from surrounding logs.
+Profile `agent-summary.md` files include an `attempt` section when the run has a manifest attempt block, including terminal state, cleanup state, partial-artifact validity, and retry lineage. Latest-trusted baseline selection treats attempt-aware runs as baseline-trusted only when health and verdict passed, the attempt is a clean first attempt, cleanup did not fail or remain partial, and partial artifacts are not marked valid diagnostic fragments. Older artifacts without `manifest.attempt` remain legacy-trusted when health and verdict passed, but new attempt-aware runs cannot hide retry laundering behind a green final verdict.
+Profile runners assert only environment facts they own. Every completed profile manifest records ASL-controlled artifact completeness and cleanup postconditions. Live adb/simctl capture paths also assert runner-controlled foreground state, explicit lifecycle preconditions, and foreground postconditions. Use `--lifecycle-phase <phase>` when a runner can prove a non-cold precondition such as `warm-launch` or `resume`; log-ingest and preexisting artifact ingestion keep those fields `unknown/not-asserted`.
 Aggregate live proof commands write `live-proof.json` and `agent-summary.md` under `_live-proof/<run-id>`. The live-proof artifact points to preflight evidence, every scenario run, optional interaction proofs from tools such as agent-device or Argent, optional skipped interaction proof declarations, and optional latest-trusted comparison outputs, giving agents one stable entrypoint after a proof run. Preflight, profile, and interaction pointers include health and verdict status from the linked run artifacts, so agents can see what passed before opening deeper evidence. Interaction proof pointers also include sidecar screenshot capture inventory when the sidecar produced screenshots, plus `warnings` when optional sidecar checks failed without invalidating the required proof. If profile health or verdict fails, requested sidecars are not executed; they are recorded in `skippedInteractionProofs` with a reason and next action so agent feedback stays explicit without mixing runner evidence into an untrusted timing run. The aggregate artifact records `status`, `comparisonStatus`, `comparisonCounts`, optional per-comparison `metricSummary` counts/highlights, and a `nextAction` hint so agents can distinguish failed proof gates, regressions, mixed metric movement, missing baselines, inconclusive comparisons, partial sidecar evidence, and clean summaries without scraping prose.
 Platform-set proof commands write `live-proof-set.json` and `agent-summary.md` under the caller-provided proof-set output directory. The proof-set artifact records required platforms, present platforms, missing platforms, each linked `live-proof.json`, failed proof reasons, regression-gate reasons, and a next action. This gives agents one stable Android-plus-iOS gate after the per-platform live proofs have written their own aggregate evidence.
-Provider or custom-script evidence attached with `--signal` or `--capture` is copied into stable run folders and inventoried in `manifest.artifacts.evidenceAttachments`. Each inventory entry records the evidence channel, kind, run-relative path, source filename, byte size, and sha256 hash; it does not preserve local absolute source paths.
+Provider or custom-script evidence attachments are copied into stable run folders and inventoried in `manifest.artifacts.evidenceAttachments`. Each inventory entry records the evidence channel, kind, run-relative path, source filename, byte size, sha256 hash, completeness status, corruption status, redaction status, and transformations; it does not preserve local absolute source paths.
 Evidence folders:
@@ -164,104 +181,17 @@ Not yet shipped as supported public features:
 - Computer Use flows
 - product-specific scenarios
-## Preflight planning
-Use `check-plan` to validate a scenario, runner manifest, and optional evidence-provider manifests before execution:
-```bash
-pnpm check-plan -- --scenario examples/scenarios/mobile/app-startup.json --runner examples/runners/xcodebuildmcp-ios.json --platform ios --out artifacts/plan/app-startup
-```
-This validates the input manifests, writes schema-checked `health.json` and `verdict.json`, writes `agent-summary.md`, and includes the raw planner match in `planner-compatibility.json`.
-## Android adb readiness
-Use `android:preflight` to verify adb and connected-device readiness before adding live Android scenario execution:
-```bash
-pnpm android:preflight -- --package com.example.app --out artifacts/android-adb-preflight
-```
-The command writes:
-- `health.json`
-- `verdict.json`
-- `agent-summary.md`
-- `raw/adb-version.txt`
-- `raw/adb-devices.txt`
-- `raw/android-metadata.json`
-If adb, a connected online device, or an optional package check fails, health fails and the verdict remains `inconclusive`.
-Add `--capture-logcat --logcat-lines <count>` to write `raw/adb-logcat.txt` in the same artifact folder. Add `--react-native-debug-host <host:port>` with `--package <name>` for React Native development builds that need adb reverse plus the app `debug_http_host` preference before launch; the runner writes `raw/adb-react-native-reverse.txt` and `raw/adb-react-native-debug-host.txt`. Add `--clear-logcat --launch --wait-ms <ms>` with `--package <name>` to clear logs, launch the package, wait for a bounded capture window, and then collect logcat evidence. If requested capture-window setup or logcat capture fails, scenario health fails because timing and event evidence would be incomplete.
-Use that captured logcat evidence directly with Android profiling:
-```bash
-pnpm profile:android -- --config core/config-template.json --scenario examples/mobile-app/scenarios/android/app-startup.json --adb-artifacts artifacts/android-adb-preflight --run-id android-run-1
-```
-Or let Android profiling own the adb capture window before it writes profile artifacts:
-```bash
-pnpm profile:android -- --config core/config-template.json --scenario examples/mobile-app/scenarios/android/app-startup.json --adb-capture --react-native-debug-host localhost:8097 --clear-logcat --launch --run-id android-run-1
-```
-## iOS simulator capture
-Use `profile:ios --simctl-capture` when the example app or a consuming app is already installed on a booted simulator:
-```bash
-pnpm profile:ios -- --config core/config-template.json --scenario examples/mobile-app/scenarios/ios/app-startup.json --simctl-capture --profile-session --profile-session-storage --launch --run-id ios-run-1
-```
-The command writes a separate simctl capture folder under the selected output root, seeds the app-owned profile session into native AsyncStorage before launch, then collects stored app profile events after the capture window. Command scenarios seed the scenario command queue through the same storage contract before launch. When `raw/ios-profile-events.log` exists, the iOS profile runner ingests that stored truth-event log; otherwise it falls back to `raw/ios-simctl-log.txt`.
-For profile-session capture on Android or iOS, omitting `--wait-ms` lets ASL derive the final evidence window from scenario execution waits and cycle count. Explicit `--wait-ms` remains authoritative when a consuming app has a known startup or logging delay that the scenario cannot express.
-Scenario command targets live in `adapterOptions.iosSimctl.commands`, while the app handles them through `registerProfileCommandTargetHandler`. The iOS proof does not depend on unified logs carrying JavaScript console output; it depends on app-owned stored profile events.
-## Historical comparison
-Use `compare` to build `comparison.json` from two completed run folders:
-```bash
-pnpm compare -- --baseline artifacts/runs/app-startup/baseline --current artifacts/runs/app-startup/current --out artifacts/runs/app-startup/current --fail-on-regression
-```
-The comparison gate is intentionally strict. If either run failed scenario health, or if the scenario ids do not match, the comparison is `inconclusive`. Numeric budget checks are compared only after that health gate passes. `comparison.json` includes `comparisonBasis` with the baseline/current run ids and run directories, giving agents artifact-local provenance instead of forcing them to infer it from folder names.
-Use `compare:latest` when an artifact root contains run history and the agent should compare the current run against the newest trusted prior run for the same scenario:
-```bash
-pnpm compare:latest -- --root artifacts/runs --scenario app-startup --current artifacts/runs/app-startup/current --out artifacts/runs/app-startup/current --fail-on-regression
-```
-The latest-trusted command excludes the exact current run directory from baseline selection. Baseline trust requires passed health and passed verdict. Current runs must pass scenario health before the command will compare timing or budget evidence. If the current manifest declares `comparisonLane`, baseline selection is scoped to trusted prior runs with the same lane; if the current manifest has no lane, selection stays within unlabeled trusted prior runs. Profile manifests also include `scenarioHash`, a stable fingerprint of the normalized scenario contract. When the current run has that hash, latest-trusted selection only compares against trusted prior runs with the same hash; legacy runs without the hash remain comparable only to legacy current runs. This keeps proof modes such as plain live proof and live proof plus agent-device sidecar from comparing against each other, and it keeps migrated scenario definitions from poisoning before/after verdicts. Latest-trusted artifacts set `comparisonBasis.strategy` to `latest_trusted_prior` and record selection counts for inspected, trusted, trusted-prior, lane-comparable, and scenario-contract-comparable candidates.
-## Fixture loop
-Use `demo:loop` to run the current contract without a simulator:
-```bash
-pnpm demo:loop -- --out artifacts/demo-loop
-```
-The fixture loop writes:
+## Command guidance
-- `preflight/app-startup/health.json`
-- `preflight/app-startup/verdict.json`
-- `preflight/app-startup/agent-summary.md`
-- `profile-runs/app-startup/demo-baseline/*`
-- `profile-runs/app-startup/demo-current/*`
-- `profile-runs/app-startup/demo-current/comparison.json`
+Contracts defines the schemas, artifact fields, runner surfaces, and trust policy. Runnable walkthroughs live in [Live Proofs](live-proofs.md):
-This is not a replacement for live device proof. It is a stable contract check that keeps the evidence loop reproducible through trusted prior-run selection while iOS or Android runtime setup is unavailable.
+- [plan checks](live-proofs.md#plan-check)
+- [Android adb preflight and profile capture](live-proofs.md#platform-preflight-and-profile-capture)
+- [iOS simctl profile capture](live-proofs.md#platform-preflight-and-profile-capture)
+- [fixture loop](live-proofs.md#fixture-loop)
+- [explicit and latest-trusted comparison](live-proofs.md#comparison)
+- [generic Android and iOS live proof](live-proofs.md#generic-mobile-proof)
 ## Read next
-- [README](../README.md) for the shortest path through the project
-- [Concepts](concepts.md) for the broader product framing
-- [Adapter Onboarding](adapters.md) for adding runners and evidence providers
-- [Consumer App Rehearsal](consumer-rehearsal.md) for adopting the package in an existing app
-- [Runner docs](../runner/README.md) for current runner behavior and limits
+- [Scenario Authoring](authoring.md) for writing portable scenarios against these contracts

package/docs/external-adapter-protocol.md ADDED Viewed

@@ -0,0 +1,219 @@
+# External Adapter Protocol
+ASL core is TypeScript, but the adapter contract is language-neutral. An external adapter is an out-of-process executable that exchanges newline-delimited JSON messages over stdin and stdout. The executable can be written in any language and must not depend on ASL TypeScript internals.
+This document defines the minimal protocol surface for conformance fixtures and future adapter hosts. JSON Schema and this normative protocol document are the source of truth for portable behavior; built-in TypeScript runners remain implementations of the same contract, not the contract itself. The protocol message schema is published in `schemas/external-adapter-message.schema.json`.
+## Transport
+- The host starts the adapter as a child process without a shell.
+- Each message is one UTF-8 JSON object followed by `\n`.
+- stdout is reserved for protocol messages. Diagnostics must go to stderr.
+- Requests and responses are correlated by `operationId`.
+- `seq` is a monotonically increasing integer within each sender's stream.
+- Hosts and adapters maintain independent `seq` streams. A receiver must treat missing, repeated, or non-monotonic `seq` values as protocol health failures.
+- Timestamps use RFC 3339 strings. Timing-sensitive waits must declare their `clockDomain`.
+- Adapters must classify work received after its request `deadline` as a structured deadline failure instead of silently attempting stale work.
+- Paths in artifact references are run-relative unless `uri` is explicitly used.
+- Artifact and raw file references should include `sha256` and `sizeBytes` when the adapter can compute them.
+- Evidence bytes must not be embedded in protocol messages as raw data or base64.
+## Envelope
+Every message uses the same envelope:
+```json
+{
+  "protocolVersion": "1.0",
+  "seq": 1,
+  "operationId": "op-001",
+  "kind": "request",
+  "type": "hello",
+  "runId": "run-001",
+  "attemptId": "attempt-001",
+  "deadline": "2026-06-19T12:00:05.000Z",
+  "body": {}
+}
+```
+Fields:
+| Field | Required | Meaning |
+| --- | --- | --- |
+| `protocolVersion` | yes | Protocol major/minor string. This document defines `1.0`. |
+| `seq` | yes | Sender-local message sequence. |
+| `operationId` | yes | Correlates one request with one response. Cancellation targets this value. |
+| `kind` | yes | `request`, `response`, or `event`. |
+| `type` | yes | Operation or event name. |
+| `runId` | request after `hello` | Stable ASL run identifier. |
+| `attemptId` | request after `hello` | Stable retry/attempt identifier for the run. |
+| `deadline` | request operations | Absolute deadline for bounded work. |
+| `body` | yes | Operation-specific payload. |
+Responses must echo `protocolVersion`, `operationId`, `runId`, and `attemptId` when those fields were present on the request.
+## Hello And Capability Discovery
+The first host message must be `hello`. The adapter responds with its identity, supported protocol range, platforms, capabilities, driver actions, artifact outputs, and clock domains.
+Request body:
+```json
+{
+  "host": {
+    "name": "agent-scenario-loop",
+    "version": "0.1.x"
+  },
+  "platform": "android"
+}
+```
+Response body:
+```json
+{
+  "adapter": {
+    "name": "asl-python-conformance-fixture",
+    "version": "0.1.0"
+  },
+  "acceptedProtocolVersion": "1.0",
+  "platforms": ["android", "ios"],
+  "capabilities": ["prepare", "launch", "command", "truthEvent", "evidence", "cancel", "stop", "finalize"],
+  "driverActions": ["tap", "assertVisible"],
+  "artifactOutputs": ["logs", "screenshot", "truth-events"],
+  "clockDomains": ["host-monotonic", "device-log"]
+}
+```
+If the adapter cannot support the requested protocol or platform, it must return a structured failure response and then exit cleanly or wait for `finalize`.
+## Operations
+All operation responses use:
+```json
+{
+  "ok": true,
+  "result": {}
+}
+```
+or:
+```json
+{
+  "ok": false,
+  "failure": {
+    "category": "unsupported",
+    "code": "unsupported_action",
+    "message": "driverAction `pinch` is not supported",
+    "retryable": false,
+    "details": {
+      "driverAction": "pinch"
+    }
+  }
+}
+```
+`failure.category` is optional for older adapters but recommended for conformance. Use these product-neutral categories:
+| Category | Use |
+| --- | --- |
+| `adapter` | Adapter implementation failure that is not more specific. |
+| `cancelled` | Operation was cancelled before completion. |
+| `cleanup` | Stop/finalize/cleanup invariant failed. |
+| `deadline` | Request deadline expired before or during adapter work. |
+| `environment` | Host, device, simulator, permission, or tool environment prevented execution. |
+| `protocol` | Malformed message, invalid sequence, unsupported protocol, or decode failure. |
+| `runner` | Runner orchestration failed outside app product behavior. |
+| `unsupported` | Operation, platform, driver action, or evidence kind is unsupported. |
+### prepare
+Validates target configuration, creates or verifies run directories, and reports setup metadata. The request body should include `platform`, target identifiers, environment assumptions, and an optional `artifactsRoot`.
+The response should include normalized target metadata and any adapter-owned artifact directories.
+### launch
+Launches or verifies the app, device, browser, or other target. The request body should include `platform`, `target`, and optional launch arguments.
+The response should include launch status, a target reference when available, and artifact references for raw command output.
+### executeAction
+Executes one portable driver action. The request body must include `driverAction` and action-specific input. The adapter must reject unknown or unsupported actions with `ok: false`.
+### waitCondition
+Waits for a truth event, UI condition, log marker, or other bounded condition. The request body must include `condition`, `deadline`, and `clockDomain`.
+The response should include matched truth-event data when available. Timing values are not trustworthy verdict inputs unless scenario health passed.
+### captureEvidence
+Captures logs, screenshots, UI trees, videos, profiler output, or provider signals. The response must return artifact references:
+```json
+{
+  "artifacts": [
+    {
+      "kind": "screenshot",
+      "path": "captures/final-screen.png",
+      "contentType": "image/png",
+      "description": "Final screen after launch",
+      "sha256": "0000000000000000000000000000000000000000000000000000000000000000",
+      "sizeBytes": 0
+    }
+  ]
+}
+```
+### cancel
+Requests cancellation of an in-flight `operationId`. The body must include `targetOperationId` and a human-readable `reason`. Adapters should make cancellation best effort and respond to the original operation with `code: "cancelled"` if it was interrupted.
+### stop
+Stops the active app/session/target while preserving evidence produced so far. This is distinct from `finalize`; the adapter may still accept evidence capture or finalization work after stop.
+If there is no active launched target, `stop` must return a structured cleanup failure instead of pretending cleanup ran. Include `details.cleanupStatus` when the adapter can distinguish `not-required`, `partial`, `failed`, or `passed`.
+### finalize
+Flushes pending protocol output, closes adapter-owned resources, and reports final artifact inventory. After a successful `finalize` response the adapter should exit with code `0`.
+`finalize` is terminal for one adapter attempt. Repeated finalization must return a structured cleanup or protocol failure and must not rewrite the prior artifact inventory.
+## Events
+Adapters may emit `event` messages between request responses for truth events, progress, and evidence discovery:
+```json
+{
+  "protocolVersion": "1.0",
+  "seq": 4,
+  "kind": "event",
+  "type": "truthEvent",
+  "runId": "run-001",
+  "attemptId": "attempt-001",
+  "body": {
+    "name": "app.ready",
+    "clockDomain": "device-log",
+    "observedAt": "2026-06-19T12:00:02.000Z",
+    "payload": {
+      "screen": "Home"
+    }
+  }
+}
+```
+Events must not replace the response for an operation. The host should still receive one terminal response for every request except when the process exits unexpectedly.
+## Conformance Fixture
+The fixture under `runner/__tests__/fixtures/external-adapter/` is intentionally small and non-JavaScript. It proves that a conforming adapter can be an external process with no ASL TypeScript imports. Golden transcripts in the same directory define expected request/response behavior for the success path, unsupported action failure, expired deadline failure, cleanup/finalization failure, sequence monotonicity, and artifact references without embedded evidence bytes.
+## Read next
+- [Contracts](contracts.md) for the scenario, runner, artifact, health, verdict, comparison, and provenance shapes

package/docs/live-proofs.md CHANGED Viewed

@@ -14,6 +14,27 @@ pnpm demo:loop -- --out artifacts/demo-loop
 The command runs preflight, profiles baseline/current event logs, writes run artifacts, compares the current run against the latest trusted prior run, and refreshes the current run's `agent-summary.md`.
+It writes:
+- `preflight/app-startup/health.json`
+- `preflight/app-startup/verdict.json`
+- `preflight/app-startup/agent-summary.md`
+- `profile-runs/app-startup/demo-baseline/*`
+- `profile-runs/app-startup/demo-current/*`
+- `profile-runs/app-startup/demo-current/comparison.json`
+This is not a replacement for live device proof. It is a stable contract check that keeps the evidence loop reproducible through trusted prior-run selection while iOS or Android runtime setup is unavailable.
+## Plan Check
+Use `check-plan` to validate a scenario, runner manifest, and optional evidence-provider manifests before execution:
+```bash
+pnpm check-plan -- --scenario examples/scenarios/mobile/app-startup.json --runner examples/runners/xcodebuildmcp-ios.json --platform ios --out artifacts/plan/app-startup
+```
+This validates the input manifests, writes schema-checked `health.json` and `verdict.json`, writes `agent-summary.md`, and includes the raw planner match in `planner-compatibility.json`.
 ## Host/Device Access
 Keep deterministic validation and live device proof as separate execution lanes.
@@ -69,6 +90,46 @@ ASL_ARGENT_BIN=pnpm \
 The doctor composes the existing adb, simctl, agent-device, and Argent checks into one ASL artifact set. A failed doctor is environment evidence, not product evidence: fix the host access or command shape before starting scenario execution.
+## Platform Preflight and Profile Capture
+Use `android:preflight` to verify adb and connected-device readiness before adding live Android scenario execution:
+```bash
+pnpm android:preflight -- --package com.example.app --out artifacts/android-adb-preflight
+```
+The command writes `health.json`, `verdict.json`, `agent-summary.md`, `raw/adb-version.txt`, `raw/adb-devices.txt`, and `raw/android-metadata.json`. If adb, a connected online device, or an optional package check fails, health fails and the verdict remains `inconclusive`.
+Add `--capture-logcat --logcat-lines <count>` to write `raw/adb-logcat.txt` in the same artifact folder. Add `--react-native-debug-host <host:port>` with `--package <name>` for React Native development builds that need adb reverse plus the app `debug_http_host` preference before launch; the runner writes `raw/adb-react-native-reverse.txt` and `raw/adb-react-native-debug-host.txt`. Add `--clear-logcat --launch --wait-ms <ms>` with `--package <name>` to clear logs, launch the package, wait for a bounded capture window, and then collect logcat evidence. If requested capture-window setup or logcat capture fails, scenario health fails because timing and event evidence would be incomplete.
+Use captured logcat evidence directly with Android profiling:
+```bash
+pnpm profile:android -- --config core/config-template.json --scenario examples/mobile-app/scenarios/android/app-startup.json --adb-artifacts artifacts/android-adb-preflight --run-id android-run-1
+```
+Or let Android profiling own the adb capture window before it writes profile artifacts:
+```bash
+pnpm profile:android -- --config core/config-template.json --scenario examples/mobile-app/scenarios/android/app-startup.json --adb-capture --react-native-debug-host localhost:8097 --clear-logcat --launch --run-id android-run-1
+```
+Use `profile:ios --simctl-capture` when the example app or a consuming app is already installed on a booted simulator:
+```bash
+pnpm profile:ios -- --config core/config-template.json --scenario examples/mobile-app/scenarios/ios/app-startup.json --simctl-capture --profile-session --profile-session-storage --launch --run-id ios-run-1
+```
+The command writes a separate simctl capture folder under the selected output root, seeds the app-owned profile session into native AsyncStorage before launch, then collects stored app profile events after the capture window. Command scenarios seed the scenario command queue through the same storage contract before launch. Command envelopes preserve `commandId`, `sequence`, `queueId`, and, for normalized execution-plan commands followed by a milestone wait, `waitForMilestone` plus `waitTimeoutMs`. Deep-link command transport uses the same envelope in query parameters. When `raw/ios-profile-events.log` exists, the iOS profile runner ingests that stored truth-event log; otherwise it falls back to `raw/ios-simctl-log.txt`.
+When a scenario requests a screenshot, pass supported simulator screenshot options through the iOS capture command with `--screenshot-type`, `--screenshot-display`, or `--screenshot-mask`; ASL records the chosen options in capture metadata and the resulting path in `manifest.artifacts.captures.screenshots`.
+For profile-session capture on Android or iOS, omitting `--wait-ms` lets ASL derive the final evidence window from scenario execution waits and cycle count. Explicit `--wait-ms` remains authoritative when a consuming app has a known startup or logging delay that the scenario cannot express.
+Scenario command targets live in `adapterOptions.iosSimctl.commands`, while the app handles them through `registerProfileCommandTargetHandler`. The iOS proof does not depend on unified logs carrying JavaScript console output; it depends on app-owned stored profile events.
+Attach independently produced provider evidence with `--signal <js|memory|network>:<path>` or `--capture <screenshot|video|uiTree>:<path>` so profile commands copy those files into stable run folders and inventory them in `manifest.artifacts.evidenceAttachments`.
 ## Generic Mobile Proof
 Use the generic live runners in a consuming app after `asl-init` has created `asl.config.json`, `scenarios/mobile/<id>.json`, and the `asl:*` package-script snippets:
@@ -134,6 +195,14 @@ pnpm example:android:live -- --run-suffix before-change
 pnpm example:android:live -- --run-suffix after-change
 ```
+After dependency, native-build, or scenario-contract changes, use `--seed-baseline` to capture a trusted same-cohort baseline immediately before the measured run. The seeded profiles use `*-baseline` run ids, must pass health and verdict, and stay in the same comparison lane:
+```bash
+pnpm example:android:live -- --run-suffix release-check --seed-baseline
+```
+When latest-trusted comparison sees slower single-run timing but both baseline and current remain inside their budgets, ASL reports `low_confidence` instead of `regressed`. Treat that as a repeat-or-sample signal, not proof of product regression.
 Read [Example Mobile App: Android Capture](../examples/mobile-app/README.md#android-capture) for Metro routing, adb permissions, individual scenario commands, selector behavior, and optional video capture.
 Expo dev-client Android shells may need an explicit Metro deep link after the native app launches. Put that local URL in ignored env state, for example `ASL_EXAMPLE_ANDROID_DEV_CLIENT_URL=asl-example://expo-development-client/?url=http%3A%2F%2F10.0.2.2%3A8097`, so Android profile capture opens the correct app session before profile-session deep links. When bundle load time is variable, also set `ASL_EXAMPLE_ANDROID_DEV_CLIENT_READY_PATTERN='Running "main"'` so the runner waits for bounded logcat readiness evidence before sending scenario links.
@@ -163,6 +232,14 @@ The root example live scripts pass `--compare-latest --fail-on-regression` by de
 pnpm example:ios:live -- --run-suffix after-change
 ```
+Use `--seed-baseline` for fresh release checks where no compatible trusted iOS baseline exists yet:
+```bash
+pnpm example:ios:live -- --run-suffix release-check --seed-baseline
+```
+The same `low_confidence` comparison policy applies to iOS seeded baselines, where simulator and dev-client startup timing can vary between adjacent runs while still satisfying product budgets.
 Expo dev-client iOS shells may need an explicit Metro deep link after the native app launches. Put that local URL in ignored env state, for example `ASL_EXAMPLE_IOS_DEV_CLIENT_URL=asl-example://expo-development-client/?url=http%3A%2F%2Flocalhost%3A8097`, so iOS profile capture opens the correct app session before collecting evidence.
 The default iOS live proof transport seeds profile-session control into simulator app storage. Use `--ios-profile-session-transport deeplink` when the app should receive profile-session start and command control through app URLs instead.
@@ -247,6 +324,10 @@ pnpm compare:latest \
 Scenario health must pass before timing or budget evidence can support an improvement or regression claim.
+The comparison gate is intentionally strict. If either run failed scenario health, or if the scenario ids do not match, the comparison is `inconclusive`. Numeric budget checks are compared only after that health gate passes. `comparison.json` includes `comparisonBasis` with the baseline/current run ids and run directories, giving agents artifact-local provenance instead of forcing them to infer it from folder names. It also includes `measurementPolicy`, which records the baseline selection mode, poisoning protections, valid sample counts, timing tolerance, and confidence level used for the comparison.
+The latest-trusted command excludes the exact current run directory from baseline selection. Baseline trust requires passed health and passed verdict. For attempt-aware artifacts, baseline trust also requires a clean first passed attempt, no retry lineage, no failed or partial cleanup, and no valid partial-artifact diagnostic fragments. Current runs must pass scenario health before the command will compare timing or budget evidence. If the current manifest declares `comparisonLane`, baseline selection is scoped to trusted prior runs with the same lane; if the current manifest has no lane, selection stays within unlabeled trusted prior runs. Profile manifests also include `scenarioHash`, a stable fingerprint of the normalized scenario contract. When the current run has that hash, latest-trusted selection only compares against trusted prior runs with the same hash; legacy runs without the hash remain comparable only to legacy current runs. This keeps proof modes such as plain live proof and live proof plus agent-device sidecar from comparing against each other, and it keeps migrated scenario definitions from poisoning before/after verdicts. Latest-trusted artifacts set `comparisonBasis.strategy` to `latest_trusted_prior`, record selection counts for inspected, trusted, trusted-prior, lane-comparable, and scenario-contract-comparable candidates, and mirror the active lane, scenario hash, and cohort hash inside `measurementPolicy.baselineSelection.poisoningProtection` when those filters are active.
 ## Release Gate
 Before publishing, run:
@@ -263,8 +344,8 @@ Package smoke and consumer rehearsal keep child commands bounded so package-mana
 ASL_PACKAGE_GATE_TIMEOUT_MS=300000 pnpm release:check
 ```
-Read next:
+## Side References
-- [Contracts](contracts.md) for artifact layout and supported runner surface
 - [Consumer App Rehearsal](consumer-rehearsal.md) for adoption inside an existing app
 - [examples/mobile-app](../examples/mobile-app/README.md) for detailed dogfood app commands
+- [Public API](api.md) for package imports and programmable runner composition

package/docs/principles.md CHANGED Viewed

@@ -1,28 +1,24 @@
 # Principles
-`agent-scenario-loop` is a scenario orchestration and evidence collection layer for agent-driven software development.
+`agent-scenario-loop` has one durable claim: scenarios, contracts, and evidence must outlive the current runner.
-Read this after [Concepts](concepts.md) if you want the project doctrine in a compact form.
-The durable value is not any one runner. The durable value is a stable scenario and evidence contract that survives runner changes.
-It is not another agent runner. It is the layer that coordinates runners, preserves evidence, and keeps scenarios useful as tooling changes.
-Scenarios are long-lived project assets. They describe important application behaviors, not the temporary mechanics of the current runner.
+Read this after [Concepts](concepts.md). Concepts explains the model; this page is the compressed doctrine.
 ## Four planes
+ASL separates mobile proof into four planes. Mixing them is the usual source of flaky claims.
 1. Control plane
-Use semantic app commands, deep links, and deterministic hooks before falling back to raw UI replay.
+Use semantic app commands, deep links, and deterministic hooks to start and steer the scenario. Raw UI replay is a realism check, not the preferred control architecture.
 2. Truth plane
-Use explicit profile events, stored signals, route state, and committed artifacts as the source of truth.
+Use app-owned truth events, stored signals, route state, and committed artifacts as the source of what happened.
 3. Evidence plane
-Preserve logs, screenshots, videos, profiler exports, memory captures, network captures, UI trees, metrics, and verdicts in one stable artifact layout.
+Preserve logs, screenshots, videos, profiler exports, memory captures, network captures, UI trees, metrics, verdicts, comparisons, and summaries in one stable artifact layout.
 4. Realism plane
-Use taps, swipes, and full UI interaction for realism checks and last-mile validation, not as the primary control architecture.
+Use taps, swipes, alerts, full UI interaction, and external device tools to prove the app still behaves under real interaction pressure.
 ## Invariants
@@ -41,6 +37,4 @@ Use taps, swipes, and full UI interaction for realism checks and last-mile valid
 ## Read next
-- [Contracts](contracts.md) for the current artifact and package surface
-- [Runner docs](../runner/README.md) for the host execution boundary
-- [README](../README.md) for the project entrypoint
+- [Architecture](architecture.md) for the TypeScript-first, language-neutral contract boundary

package/examples/mobile-app/README.md CHANGED Viewed

@@ -233,6 +233,12 @@ pnpm example:android:live -- --run-suffix before-change
 pnpm example:android:live -- --run-suffix after-change
 ```
+When a release, native build, or scenario edit changes the compatible cohort, seed a fresh trusted baseline in the same command:
+```bash
+pnpm example:android:live -- --run-suffix release-check --seed-baseline
+```
 The individual live commands remain useful while debugging one scenario:
 ```bash
@@ -289,6 +295,12 @@ The root example live scripts pass `--compare-latest --fail-on-regression` by de
 pnpm example:ios:live -- --run-suffix after-change
 ```
+When there is no compatible trusted iOS baseline for the current release cohort, seed one before the measured run:
+```bash
+pnpm example:ios:live -- --run-suffix release-check --seed-baseline
+```
 If global `xcode-select` points at a beta Xcode whose simulator services are not ready, set `ASL_EXAMPLE_XCODE_DEVELOPER_DIR` before the Node runner starts:
 ```bash

package/examples/mobile-app/runner-manifests/primary-runner.json CHANGED Viewed

@@ -6,6 +6,7 @@
   "capabilities": ["launch", "sessionControl", "command", "logCapture", "artifactWrite"],
   "driverActions": ["tap", "scroll", "assertVisible", "inspectTree", "screenshot", "record", "readLogs"],
   "artifactOutputs": ["logs", "signals", "screenshot", "video", "uiTree"],
+  "uiContexts": ["app"],
   "lifecycle": [
     "prepare",
     "launch",

package/examples/runners/README.md CHANGED Viewed

@@ -32,6 +32,7 @@ They do not mean the package bundles every named tool. A fixture describes what
 - Keep `capabilities` about lifecycle or evidence ownership.
 - Keep `driverActions` about concrete operations the adapter can perform.
+- Keep `uiContexts` about the surface the adapter can own; do not use `app` proof for system dialogs, share sheets, external browsers, WebViews, pickers, notifications, or another app unless the manifest explicitly declares that context.
 - Do not add a capability or driver action until a runner or provider can produce the corresponding evidence.
 - Keep `providerCommands` on evidence-provider manifests; primary runners should own lifecycle orchestration, not provider command wrappers.
 - When a tool writes files independently, attach them through `--signal`, `--capture`, or a `providerCommands` manifest so the run keeps stable artifact paths.