RubyGems - ace-test-runner-e2e - Versions diffs - 0.29.8 → 0.40.1 - Mend

ace-test-runner-e2e 0.29.8 → 0.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6206e4d6f65fe1ab5c27d1b5479e37af079b3554bd6289786aa71ce62e4ecf50
-  data.tar.gz: bedb5fa2830bc1f2818e2246acbbab15ec6a7227898bf4dba2b23b6521eb8d5b
+  metadata.gz: ae94a3ebd8b4ed697d8b3b5a705030236659ab9b8e34fa4e5baa421a22d1b781
+  data.tar.gz: 8d65447a2174a8fe614d0fce6a8ab2503d39b06e76584aa7af4ba9cedaf3f8bc
 SHA512:
-  metadata.gz: 9d06bc8d9447debe2b48128b7c45ea0a357d01677b9b8b48fa508c5f8a078e8b4ed0396a4d4d38c06be25573701d4bd75b6f481438d116e3499b4b1890a9edd5
-  data.tar.gz: b148663600b83ffde9821761a1ef4a7b43d11a2efc10d97433bb090d4ae2bfc8f7dd997756190cdafc894c497f1eada99faf3b78dcf7df38ed7eec9d57cedec3
+  metadata.gz: 3889a846fd3631330728fe5e144259328f08ebd7ee35d4d8f5358fb149f85ad1dc4071132d1db283bb402f1cbed16655f1e790a644e40508e1d104b4fb24e0f5
+  data.tar.gz: c68c945a12f8ab86c23ad3714584afbc9e507a7cc8d91b089c99048e43df579adae9f407e769e9669e17910c5853c0919b81846fb278b5d72722220b29187101

data/.ace-defaults/e2e-runner/config.yml CHANGED Viewed

@@ -2,12 +2,16 @@
 # This file provides defaults for the ace-test-runner-e2e gem
 paths:
+  # Preferred location for deterministic preflight tests in packages.
+  preflight: "test/feat"
   # Where test scenarios are stored in packages
   scenarios: "test/e2e"
   # Directory for test execution artifacts (gitignored)
   cache_dir: ".ace-local/test-e2e"
 patterns:
+  # Glob pattern for deterministic preflight tests.
+  preflight: "test/feat/**/*_test.rb"
   # Glob pattern for finding test scenarios (TS-format directories)
   discovery: "test/e2e/TS-*/scenario.yml"
@@ -38,13 +42,21 @@ reporting:
 # Execution defaults
 execution:
-  # Default LLM provider:model for test execution
-  provider: "role:e2e-executor"
+  # Legacy provider fallback when runner/verifier are not explicitly split
+  provider: "role:e2e-runner"
+  # LLM provider:model for runner execution
+  runner_provider: "role:e2e-runner"
+  # LLM provider:model for verifier execution
+  verifier_provider: "role:e2e-verifier"
   # Timeout per test in seconds
   timeout: 600
   # Number of tests to run in parallel (1 = sequential)
   parallel: 3
+sandbox:
+  profile: "ace-default"
+  ruby_version: "3.4.9"
 # Provider configuration
 providers:
   # CLI providers use deterministic pipeline execution (runner + verifier)

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,239 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.40.1] - 2026-04-24
+### Fixed
+- Removed suite-specific wording from the single-command `ace-test-e2e` help/output path so the `RunTest` CLI stays scoped to the single-command surface while preserving prune-artifact guidance.
+## [0.40.0] - 2026-04-24
+### Changed
+- Added `--[no-]prune-artifacts` to `ace-test-e2e` and `ace-test-e2e-suite` so operators can clear stale `.ace-local/test-e2e` run artifacts before execution while preserving suite reports and the shared `runtime-cache/`.
+## [0.39.1] - 2026-04-24
+### Fixed
+- Resolved suite-shared runtime reuse in child `ace-test-e2e` subprocesses by honoring inherited `ACE_E2E_SHARED_RUNTIME_ROOT` from the process environment instead of rebuilding sandbox-local runtimes after prewarming.
+## [0.39.0] - 2026-04-24
+### Changed
+- Added `--[no-]retry-failures-once` for full-suite reruns, including flaky-recovery reporting when a failed first pass succeeds on the retry pass.
+- Reused a suite-shared E2E runtime cache under `.ace-local/test-e2e/runtime-cache/` so parallel sandbox workers stop rebuilding the same Bundler environment and native extensions for every scenario.
+## [0.38.17] - 2026-04-24
+### Fixed
+- Detected fixture-commit setup flows across the full setup sequence instead of only single-step `git add && git commit` commands, restoring support-path git excludes for split-step fixture repositories.
+## [0.38.16] - 2026-04-23
+### Fixed
+- Enforced runner-owned verifier artifact contracts in scenario loading, expanded grouped `.stdout` / `.stderr` / `.exit` shorthand, and rejected verifier-only or wildcard artifact declarations that previously let retained E2E drift slip through.
+### Technical
+- Updated E2E guides, templates, and create/review/plan/rewrite/fix workflows to distinguish `public-surface` versus `retained-contract` TCs and require explicit downstream retained-E2E sweeps after public contract changes.
+## [0.38.15] - 2026-04-23
+### Fixed
+- Passed declared artifact contracts directly into runner prompts, added one bounded runner repair pass when required captures are still missing, and persisted repair metadata so missing-artifact E2E failures can recover before verifier judgment.
+## [0.38.14] - 2026-04-23
+### Fixed
+- Limited deterministic sandbox git excludes to setup-commit scenarios so copied package trees remain visible to ignore-aware tools while fixture-repo support paths stay unstaged.
+## [0.38.13] - 2026-04-23
+### Fixed
+- Enabled role-based verifier fallback in pipeline execution so successful runner phases still produce verifier results when the first verifier provider is unavailable.
+- Seeded deterministic sandbox git excludes for copied package trees and fixture-commit support paths so setup-time `git add -A` no longer stages runner support files or copied package content into fixture repositories.
+## [0.38.12] - 2026-04-23
+### Changed
+- Updated default ACE sandbox bootstrap to use `ace-config sync ace-llm-providers-cli` before `ace-handbook sync`, matching the renamed config sync command and minimal quick-start config requirement.
+## [0.38.11] - 2026-04-20
+### Fixed
+- Spaced batch run IDs by 100ms in `TestOrchestrator` so generated 50ms-format IDs remain unique under fast consecutive suite execution.
+## [0.38.10] - 2026-04-19
+### Fixed
+- Added strict runner ordering guidance, verifier artifact mtimes, and direct goal-number-to-TC mapping so E2E reports classify out-of-order postcondition captures as runner errors instead of shifting failed TC IDs.
+## [0.38.9] - 2026-04-19
+### Changed
+- Strengthened the E2E failure-analysis and fix workflows to require explicit docs/help drift reporting for every failed TC, so stale usage docs or CLI help surfaced by E2E failures become concrete fix targets instead of hidden runner workarounds.
+## [0.38.8] - 2026-04-16
+### Fixed
+- Synced protocol-source package trees into prepared sandboxes before deterministic setup, preserved the sanitized setup environment for runner and verifier execution, and tightened the shared runner contract to require direct `ace-*` commands with immediate `.stdout` / `.stderr` / `.exit` persistence.
+## [0.38.7] - 2026-04-16
+### Fixed
+- Reused already prepared CLI-provider sandboxes during pipeline execution so the runner no longer rewrites tracked sandbox state after deterministic setup, which prevents staged-path failures caused by post-setup provider-directory symlinks.
+## [0.38.6] - 2026-04-16
+### Fixed
+- Scoped declared sandbox-layout artifacts to the active test case, recorded present-versus-missing required artifacts in harness snapshots and report metadata, and passed that contract into verifier prompts.
+- Added canonical goal-verdict reporting so generated scenario reports keep the authoritative failed-TC mapping even when narrative evidence includes contradictory wording.
+## [0.38.5] - 2026-04-16
+### Fixed
+- Synced package protocol-source manifests into copied E2E sandboxes so bundled workflow and skill resolution continues to work after sandbox setup.
+- Hardened the shared runner prompt contract to preserve sandbox runtime `PATH`/environment and forbid wrapper patterns that break direct `ace-*` execution.
+## [0.38.4] - 2026-04-16
+### Fixed
+- Built a dedicated sandbox runtime for E2E runs with sandbox-local Gemfile, Bundler state, gem home, bin shims, verifier sandbox context, preserved report-directory reuse, and wrapper-compatible launch behavior so sandboxed commands stop leaking back into the source worktree.
+## [0.38.3] - 2026-04-16
+### Fixed
+- Stripped inherited Bundler and Ruby env leakage from sandboxed E2E subprocesses, created sandbox-local Bundler state, preserved failure-stub report directories in suite aggregation, and aligned shared setup templates/docs with the `ACE_E2E_SOURCE_ROOT` source-root contract.
+## [0.38.2] - 2026-04-16
+### Fixed
+- Prepared setup steps with sandbox runtime environment, hardened runtime directory permissions for tmux access, and kept sandbox support paths aligned with the active `bubblewrap` execution model.
+## [0.38.1] - 2026-04-15
+### Fixed
+- Tightened the Linux `bubblewrap` sandbox mounts to preserve required device access such as `/dev/null` while keeping the host filesystem isolated.
+- Moved sandbox support directories outside the copied repo workspace so E2E setup steps like `git add -A` no longer stage sandbox home, tmp, or runtime files.
+## [0.38.0] - 2026-04-15
+### Changed
+- Rewrote `TS-RUNNER-001` to use public fixture-driven discovery (`copy-fixtures`) and expanded suite control-flow coverage beyond help-only output.
+- Added `TS-RUNNER-002` to cover real non-dry run report generation, verifier-output evidence, and explicit `ace-test-e2e-sh` public shell-helper usage.
+- Updated `docs/usage.md` with safe shell-helper workflows tied to deterministic `.ace-local/test-e2e/` report paths.
+### Fixed
+- Routed setup/runner/verifier subprocesses through the new sandbox backend, kept user-facing verifier metadata in written reports, and taught the minimal verifier parser to accept standalone `Results: X/Y passed` summaries.
+## [0.37.2] - 2026-04-14
+### Changed
+- Added a canonical public-surface gate across the E2E handbook so goal-based scenarios must prove both that the tool works and that a user can complete the job from docs, `--help`, and the public CLI without hidden recipes or workarounds.
+- Updated the create/review/plan/rewrite/run/fix workflow guidance, shared guides, and templates to treat workaround-driven scenarios as invalid or at-risk and to record friction through runner observations instead of teaching fallback procedures.
+## [0.37.1] - 2026-04-13
+### Changed
+- Updated the canonical E2E create/review/rewrite/run guidance, templates, and references so goal-based scenarios are written around final sandbox state plus runner observations instead of helper artifacts under `results/`.
+## [0.37.0] - 2026-04-13
+### Changed
+- Made runner `Observations` the canonical non-filesystem evidence channel for goal-based E2E scenarios, passed them directly into verifier prompts, and persisted them through the harness-managed report surface.
+- Updated the shared E2E template, authoring guides, and rewrite/run workflows to require goal achievement from sandbox end state first, using runner observations as the only secondary evidence source instead of helper artifacts under `results/`.
+## [0.36.1] - 2026-04-13
+### Fixed
+- Preferred canonical per-scenario `report.md` metadata when building aggregate package and suite reports so failed TC mappings no longer drift from the underlying scenario reports.
+- Added explicit dirty-worktree diagnostics to suite reporting so tracked repo mutations are surfaced as runner diagnostics instead of being inferred after the fact.
+### Changed
+- Updated the canonical E2E failure-analysis and fix workflows plus usage guidance to treat aggregate reports as indexes and per-scenario reports as the source of truth for TC-level triage.
+## [0.36.0] - 2026-04-13
+### Fixed
+- Renamed aggregated E2E outputs to scope-specific package and suite report filenames instead of the ambiguous shared `final-report` label.
+- Stripped ambient `TMUX` and `TMUX_PANE` state from setup and pipeline subprocess environments so E2E runs do not accidentally attach to the operator's live tmux session.
+### Technical
+- Updated suite orchestrator/report writer coverage and E2E workflow guidance around the explicit package-vs-suite report contract.
+## [0.35.0] - 2026-04-13
+### Changed
+- **ace-test-runner-e2e v0.35.0**: Added optional scenario artifact declarations via `(optional)`, separated required and optional artifact tracking, and included optional outputs in manifests and snapshots without failing scenarios when they are absent.
+## [0.34.1] - 2026-04-13
+### Changed
+- Completed the batch i05 migration follow-through for this package and aligned it with the restarted `fast` / `feat` / `e2e` verification model.
+### Technical
+- Included in the coordinated assignment-driven patch release for batch i05 package updates.
+## [0.34.0] - 2026-04-12
+### Changed
+- Migrated package deterministic tests to the restarted `fast`/`feat` layout by moving `test/atoms`, `test/commands`, `test/handbook`, `test/models`, `test/molecules`, and `test/organisms` under `test/fast/`, and moving legacy `test/integration` coverage into `test/feat/`.
+- Updated package docs and CLI wording to teach `fast`/`feat` deterministic coverage plus scenario-only `test/e2e` execution via `ace-test-e2e`.
+- Refreshed `TS-RUNNER-001` scenario metadata and decision-record unit coverage references to point at migrated `test/fast` paths.
+## [0.33.1] - 2026-04-12
+### Fixed
+- Made suite final reports deterministic for canonical sections by deriving summary rows, failed-test details, reports tables, and the overall line from runtime results instead of model-authored prose.
+- Added regression coverage so hallucinated scenario titles, failed TC IDs, and duplicate overall lines are ignored or replaced before report files are written.
+## [0.33.0] - 2026-04-11
+### Changed
+- Made `wfi://e2e/fix` a self-bootstrapping workflow that reuses existing failure analysis when present and generates it via `wfi://e2e/analyze-failures` when missing or incomplete.
+- Updated the canonical `as-e2e-fix` skill contract to state that missing analysis is generated automatically before fixes are applied.
+### Technical
+- Refactored `ConfigLoader` molecule tests to use config mock mode, removing dependency on monorepo `.ace` overrides and making the test contract stable across environments.
+## [0.32.2] - 2026-04-11
+### Fixed
+- Generated per-scenario CLI batch `run_id`s from explicit 50ms timestamp buckets so parallel package runs no longer occasionally reuse the same report-path ID and trip the unique-run-id orchestration contract.
+## [0.32.1] - 2026-04-11
+### Technical
+- Synced the canonical `as-e2e-review` skill description with the package-targeted assign verification contract so shipped metadata no longer implies broader scenario-sweep execution.
+## [0.31.0] - 2026-04-10
+### Changed
+- Restored the two-phase E2E harness to run deterministic `test/integration` coverage before agent scenarios from `test/e2e`, with integration failures short-circuiting scenario execution.
+- Added deterministic integration execution, richer per-test-case manifests and artifact snapshotting, and refreshed CLI/docs/workflows/tests around the restarted layout and role-based runner/verifier contract.
+### Fixed
+- Accepted minimal verifier evidence responses in the runner pipeline so successful scenario runs no longer fail when a verifier omits the full structured envelope.
+## [0.30.2] - 2026-04-10
+### Fixed
+- Surface `git diff` stderr when affected-package detection fails so invalid refs and shallow-clone failures no longer look like empty affected sets.
+## [0.30.1] - 2026-04-10
+### Fixed
+- Raised the `ace-support-test-helpers` runtime dependency floor to `~> 0.14` so released installs accept the shared sandbox package-copy helper line used by the restarted runner.
+- Restored the `TS-RUNNER-001` smoke scenario fixture source path so the CLI smoke scenario resolves its canonical demo fixture again.
+## [0.30.0] - 2026-04-10
+### Changed
+- Reworked `ace-test-runner-e2e` back into a two-phase contract, with deterministic integration from `test/integration` before agent scenarios from `test/e2e`.
+- Switched sandbox orchestration to the shared package-copy helper and refreshed CLI/docs/workflows for the restarted E2E structure.
+### Fixed
+- Hardened affected-file detection by capturing git diff stderr so provider-side affected checks fail with clearer diagnostics.
 ## [0.29.8] - 2026-04-01
 ### Fixed

data/README.md CHANGED Viewed

@@ -18,11 +18,11 @@
 ![ace-test-runner-e2e demo](docs/demo/ace-test-runner-e2e-getting-started.gif)
-`ace-test-runner-e2e` runs realistic workflow scenarios through coding agents so teams can validate behavior beyond unit and integration coverage while keeping execution reproducible and isolated from the working tree.
+`ace-test-runner-e2e` runs realistic workflow scenarios through coding agents so teams can validate behavior beyond deterministic package tests while keeping execution reproducible and isolated from the working tree.
 ## How It Works
-1. Discover E2E scenario definitions from package-local `test/e2e/` suites with metadata, tags, and command flows.
+1. Discover deterministic preflight tests from package-local `test/feat/` and agent scenarios from `test/e2e/`, preserving metadata, tags, and command flows.
 2. Execute scenarios inside reproducible sandboxes that isolate agent runs from the working tree.
 3. Produce structured reports that are easy to inspect, compare across runs, and feed back into triage workflows.

data/exe/ace-test-e2e-sh CHANGED Viewed

@@ -1,6 +1,8 @@
 #!/usr/bin/env ruby
 # frozen_string_literal: true
+require_relative "../lib/ace/test/end_to_end_runner"
 # ace-test-e2e-sh - Execute commands within E2E test sandbox
 #
 # Usage:
@@ -57,11 +59,14 @@ unless Dir.exist?(test_dir)
   exit 1
 end
-Dir.chdir(test_dir)
-ENV["PROJECT_ROOT_PATH"] = test_dir
+backend = Ace::Test::EndToEndRunner::Molecules::BwrapSandboxBackend.new(
+  sandbox_root: test_dir,
+  source_root: ENV["ACE_E2E_SOURCE_ROOT"]
+)
+env = backend.prepared_env("PROJECT_ROOT_PATH" => test_dir, "ACE_E2E_SOURCE_ROOT" => ENV["ACE_E2E_SOURCE_ROOT"])
 if ARGV.empty?
-  exec "bash"
+  backend.exec(["bash"], chdir: test_dir, env: env)
 else
-  exec(*ARGV)
+  backend.exec(ARGV, chdir: test_dir, env: env)
 end

data/handbook/guides/e2e-testing.g.md CHANGED Viewed

@@ -3,8 +3,8 @@ doc-type: guide
 title: E2E Testing Guide
 purpose: Conventions and best practices for agent-executed end-to-end tests
 ace-docs:
-  last-updated: 2026-03-12
-  last-checked: 2026-03-21
+  last-updated: 2026-04-19
+  last-checked: 2026-04-19
 ---
 # E2E Testing Guide
@@ -12,6 +12,11 @@ ace-docs:
 ## Overview
 E2E tests are executed by an AI agent and reserved for behaviors that require real CLI execution, real tools, and real filesystem side effects.
+They must also answer a user-journey question: can a user do the job from the tool's public surface, and how much friction does that journey have?
+In practice, ACE uses two valid TC styles:
+- **Public-surface TCs** — prove a user job from docs/usage/`--help` and the CLI itself.
+- **Retained-contract TCs** — pin a previously fragile integrated behavior with deterministic, explicitly declared evidence.
 ## Canonical Conventions
@@ -24,7 +29,7 @@ E2E tests are executed by an AI agent and reserved for behaviors that require re
   - `TC-*.verify.md`
   - `runner.yml.md`
   - `verifier.yml.md`
-- TC artifacts use `results/tc/{NN}/`
+- TC outcome artifacts use `results/tc/{NN}/`
 - Summary reports use `tcs-passed`, `tcs-failed`, `tcs-total`, and `failed[].tc`
 - Scenarios declare `tags` for discovery-time filtering via `--tags`/`--exclude-tags`
@@ -32,15 +37,24 @@ E2E tests are executed by an AI agent and reserved for behaviors that require re
 - Runner is **execution-only**:
   - perform user-like CLI actions in sandbox
-  - produce evidence files under `results/tc/{NN}/`
+  - produce only declared outcome evidence under `results/tc/{NN}/`
+  - return final runner observations through the harness contract
   - do not issue PASS/FAIL verdicts
   - do not perform verifier-style assertion/classification
+  - do not invent workarounds or hidden command recipes to compensate for docs/help/CLI gaps
 - Verifier is **verification-only**:
   - evaluate TC outcome from sandbox evidence
+  - use runner observations as the only non-filesystem secondary evidence source
   - apply an **impact-first** evidence order:
     1. sandbox/project state impact
-    2. explicit TC artifacts
-    3. debug captures (`stdout`, `stderr`, `*.exit`, metadata) only as fallback
+    2. runner observations
+    3. explicit TC artifacts that are true product outcomes
+    4. debug captures (`stdout`, `stderr`, `*.exit`, metadata) only as fallback
+- Artifact contract ownership:
+  - runner instructions and `scenario.yml` setup/layout declare verifier-visible artifact paths
+  - verifier consumes that contract; it does not create new required artifact paths
+  - grouped shorthand such as ``results/tc/02/help.stdout`, `.stderr`, `.exit`` counts as an exact declaration of all three files
+  - wildcard artifact paths such as `results/tc/02/output.*` are not valid declarations
 - Setup ownership:
   - sandbox preparation belongs to `scenario.yml` `setup:` + `fixtures/`
   - TC runner files must not define independent environment setup procedures
@@ -52,7 +66,40 @@ Before adding a TC, confirm the behavior needs:
 - real external tools/processes
 - real filesystem I/O and environment state
-If not, keep coverage in unit/integration tests.
+If not, keep coverage in `fast`/`feat` tests.
+## Public-Surface Gate
+Before keeping or adding a goal-style TC, confirm the user job is achievable from:
+- package README / usage docs
+- `--help`
+- declared fixtures/setup
+- the tool under test itself
+Reject or rewrite the TC if it depends on:
+- hidden recipes embedded in runner instructions
+- workaround branches for unsupported or undocumented behavior
+- direct supporting-tool probes as the primary oracle
+- internal details that are not necessary to prove the user job
+When an E2E failure shows that a valid user job is not discoverable from docs, usage guides, or `--help`, treat that as
+docs/help drift. Failure analysis must record the stale or missing public surface and the exact docs/help target to
+update instead of teaching the runner a workaround.
+## TC Style Selection
+Use **public-surface** style when the goal is a real user journey and the primary oracle should stay on user-visible behavior.
+Use **retained-contract** style when the integrated behavior matters but final sandbox state alone is not enough. In that case, small declared supporting captures are valid, for example:
+- `.stdout`, `.stderr`, `.exit`
+- `command.txt`
+- `path-check.txt`
+- `artifact-check.txt`
+Even retained-contract TCs must not rely on:
+- verifier-only artifact declarations
+- wildcard artifact paths
+- reflections, PASS/FAIL summaries, or verifier-facing manifests under `results/`
 ## Cost and Scope
@@ -79,6 +126,7 @@ The verifier is always-on for standalone goal-mode TCs in the CLI pipeline. For
 ## Scenario Layout
 ```text
+{package}/test/feat/**/*_test.rb
 {package}/test/e2e/TS-{AREA}-{NNN}-{slug}/
   scenario.yml
   runner.yml.md
@@ -101,10 +149,19 @@ This prevents duplicate assertions across test layers.
 ## Authoring Rules
 - Keep runner goals outcome-oriented and deterministic.
+- Keep runner goals aligned with the public user path; if the runner needs a workaround, surface that as friction rather than teaching the workaround.
 - Keep verifier expectations impact-first, then artifacts, then debug fallback.
 - Preserve strict TC pairing (`runner` + `verify`).
-- Keep outputs inside `results/tc/{NN}/`.
+- Keep `results/tc/{NN}/` for declared verifier-dependent evidence only.
+- Declare every verifier-dependent file path in runner instructions or scenario setup. Do not rely on verifier-only path references.
+- Allow small supporting captures only when they are explicitly declared and materially improve confidence.
+- Do not use wildcard artifact paths.
+- Do not instruct runners to create reflections, PASS/FAIL summaries, verifier-facing manifests, or ad hoc temp inputs in `results/`.
+- Do not judge success from runner-authored summaries when final sandbox state can prove the goal directly.
+- Use runner observations only to explain ambiguity or missing side effects, not to replace missing end-state evidence.
+- Treat any workaround noted in runner observations as a product/docs/help or scenario-design smell that must be fixed, not preserved.
 - Avoid hidden dependencies between TCs unless explicitly intended.
+- For `--watch` or other live-output commands, use a bounded-session pattern with explicit termination behavior and captured exit codes.
 ## Execution Artifacts
@@ -121,4 +178,13 @@ Before approving new/updated E2E tests:
 - [ ] `runner.yml.md` and `verifier.yml.md` exist
 - [ ] Every TC has both `.runner.md` and `.verify.md`
 - [ ] Artifacts are scoped to `results/tc/{NN}/`
-- [ ] Value-gate metadata is present (`e2e-justification`, `unit-coverage-reviewed`, `cost-tier`)
+- [ ] Every verifier-dependent artifact path is declared by runner/setup
+- [ ] No verifier depends on wildcard or verifier-only artifact paths
+- [ ] Verifier primary oracle is final sandbox state or real product output, not helper artifacts
+- [ ] Runner observations are the only non-filesystem secondary evidence source
+- [ ] TC style is explicit in the review (`public-surface` or `retained-contract`)
+- [ ] Scenario can be completed from docs/usage/`--help` without hidden recipes or workaround instructions
+- [ ] Any internal-detail assertion is part of the public contract or justified as retained-contract evidence
+- [ ] Any friction/workaround found during review is treated as a gap, not as a runner script opportunity
+- [ ] Failure analysis records docs/help drift from failed public user paths, or explicitly records `None`
+- [ ] Value-gate metadata is present (`e2e-justification`, `unit-coverage-reviewed`, `cost-tier`)

data/handbook/guides/scenario-yml-reference.g.md CHANGED Viewed

@@ -46,14 +46,14 @@ Example: `ace-lint/test/e2e/TS-LINT-001-lint-pipeline/scenario.yml`
 |-------|------|---------|-------------|
 | `priority` | string | `medium` | Test priority: `high`, `medium`, `low` |
 | `tool-under-test` | string | — | Primary command/tool validated |
-| `sandbox-layout` | object | `{}` | Declared artifact paths and expected outputs |
+| `sandbox-layout` | object | `{}` | Directory-level outcome hints used to precreate `results/tc/*` paths and guide verification |
 | `duration` | string | — | Estimated duration (e.g., `~15min`) |
 | `timeout` | integer | — | Optional per-scenario execution timeout in seconds |
 | `automation-candidate` | boolean | `false` | Whether test is automatable |
 | `tags` | array | `[]` | Scenario tags for filtering with `--tags`/`--exclude-tags` (OR semantics) |
 | `cost-tier` | string | `smoke` | Run profile: `smoke`, `happy-path`, `deep` |
 | `e2e-justification` | string | — | Why E2E is needed |
-| `unit-coverage-reviewed` | array | `[]` | Unit/integration files reviewed |
+| `unit-coverage-reviewed` | array | `[]` | Deterministic test files reviewed (`test/fast` and/or `test/feat`) |
 | `requires` | object | — | Test prerequisites |
 | `setup` | array | `[]` | Setup directives before execution |
 | `last-verified` | string | — | Last successful verification date |
@@ -73,6 +73,11 @@ Pairing rule:
 Artifact layout conventions:
 - canonical: `results/tc/{NN}/`
 - avoid non-TC-scoped result folders
+- keep only declared verifier-dependent evidence under `results/tc/{NN}/`; runner observations live in harness reports, not sandbox helper files
+- file-level verifier checks must be declared by the runner; `sandbox-layout` does not replace exact file declarations
+- grouped shorthand such as ``results/tc/01/help.stdout`, `.stderr`, `.exit`` is valid for exact sibling captures
+- wildcard artifact paths are not supported
+- absence of a declared path is debug context, not a standalone failure reason
 Canonical summary report fields:
 - `tcs-passed`
@@ -83,6 +88,8 @@ Canonical summary report fields:
 Role contract:
 - `runner.yml.md` + `TC-*.runner.md` are execution-only.
 - `verifier.yml.md` + `TC-*.verify.md` are verification-only with impact-first checks.
+- Public-surface TCs should be solvable from the public surface (docs/usage/`--help` + tool under test) without hidden recipes or workaround instructions.
+- Retained-contract TCs may keep small declared supporting captures when they materially improve confidence.
 ## `requires` Object
@@ -92,6 +99,11 @@ requires:
   ruby: ">= 3.0"
 ```
+`requires.tools` rules:
+- declare execution prerequisites and supporting environment dependencies
+- do not use `requires.tools` as permission to make fallback probes the primary oracle
+- for ACE CLI scenarios, support tools are setup/dependency context unless the scenario is explicitly about that support tool itself
 ## `setup` Directives
 Available directives:
@@ -112,7 +124,7 @@ setup:
   - git-init
   - tmux-session:
       name-source: run-id
-  - run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
+  - run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
   - copy-fixtures
   - run: git add -A && git commit -m "initial" --quiet
   - agent-env:
@@ -122,6 +134,7 @@ setup:
 Setup rules:
 - Setup is fail-fast. Do not hide setup failures with `|| true`.
 - Setup belongs in `scenario.yml` and fixtures, not in TC runner instructions.
+- Use setup to create prerequisite state, not verifier-facing helper files under `results/`.
 - If setup fails (for example, missing `mise trust` support), stop scenario execution and report infrastructure failure.
 ## Complete Example
@@ -137,17 +150,17 @@ cost-tier: smoke
 tags: [smoke, "use-case:lint"]
 e2e-justification: "Validates real subprocess behavior and report file generation"
 unit-coverage-reviewed:
-  - test/molecules/lint_runner_test.rb
-  - test/organisms/lint_orchestrator_test.rb
+  - test/fast/molecules/lint_runner_test.rb
+  - test/fast/organisms/lint_orchestrator_test.rb
 tool-under-test: ace-lint
 sandbox-layout:
-  results/tc/01/: "help artifacts"
+  results/tc/01/: "Goal 1 outcome artifacts"
 requires:
   tools: [ace-lint, standardrb, jq]
   ruby: ">= 3.0"
 setup:
   - git-init
-  - run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
+  - run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
   - copy-fixtures
   - agent-env:
       PROJECT_ROOT_PATH: "."
@@ -179,4 +192,4 @@ test/e2e/TS-LINT-001-lint-pipeline/
 ├── TC-001-help-survey.runner.md
 ├── TC-001-help-survey.verify.md
 └── fixtures/
-```
+```

data/handbook/guides/tc-authoring.g.md CHANGED Viewed

@@ -29,9 +29,14 @@ Inline `.tc.md` and frontmatter `mode` values are no longer supported.
 - Scenario-level config files:
   - `runner.yml.md`
   - `verifier.yml.md`
-- TC artifacts write to `results/tc/{NN}/`
+- TC outcome artifacts write to `results/tc/{NN}/`
 - Summary counters use `tcs-passed`, `tcs-failed`, and `tcs-total`
+## TC Styles
+- **Public-surface**: prove a documented user job from docs/usage/`--help` and the CLI.
+- **Retained-contract**: pin an integrated behavior with deterministic, explicitly declared supporting evidence when end-state checks alone are insufficient.
 ## File Naming
 - `TC-{NNN}` — test case number (e.g., TC-001)
@@ -77,12 +82,14 @@ Run `ace-lint` and produce report artifacts for a valid file.
 ## Workspace
 - Root: sandbox directory
-- Output: `results/tc/01/`
+- Outcome artifacts: `results/tc/01/`
 ## Constraints
 - Use only sandbox paths
-- Keep evidence under `results/tc/01/`
+- Keep only declared verifier-dependent evidence under `results/tc/01/`
+- Declare exact paths for any verifier-dependent captures, for example ``results/tc/01/help.stdout`, `.stderr`, `.exit``
+- Do not place helper inputs, manifests, PASS/FAIL summaries, or reflections under `results/tc/01/`
 - Execute actions only; do not assign PASS/FAIL or final verdicts
 ```
@@ -102,6 +109,7 @@ Example:
 - **Impact Checks**: target sandbox/project state changed as expected
 - **Artifact Checks**: `results/tc/01/report.json` exists and is valid
+- **Runner Observations**: use harness-provided end-of-run observations only as supporting context
 - **Debug Fallback**: inspect `stdout`/`stderr`/`*.exit` only when primary checks are inconclusive
 ## Verdict
@@ -120,12 +128,22 @@ Pass only when all expectations are satisfied by on-disk evidence.
 - Keep each TC focused on one coherent behavior path.
 - Ensure goal numbers and TC numbers remain aligned (`TC-001` -> Goal 1).
+- Choose the TC style up front: `public-surface` or `retained-contract`.
 - Keep runner files execution-only and verifier files verdict-only.
 - Make verifier expectations deterministic with impact-first ordering.
-- Keep all artifacts under `results/tc/{NN}/` to avoid cross-goal contamination.
+- Keep `results/tc/{NN}/` for declared verifier-dependent evidence only.
+- Declare every verifier-dependent path in the runner or setup. Do not rely on verifier-only references.
+- Grouped capture shorthand is valid only for exact sibling files, for example ``foo.stdout`, `.stderr`, `.exit``.
+- Do not use wildcard artifact paths.
+- Use harness-provided runner observations as the only non-filesystem secondary evidence source.
+- Prefer final sandbox state and real product output over raw debug captures.
+- Do not ask the runner to write setup inputs, audit manifests, verifier-facing summaries, or final reflections for the verifier.
+- Do not teach the runner hidden recipes or workaround sequences; if the path is not discoverable from docs/usage/`--help`, the TC is wrong or the public surface needs improvement.
+- Use runner observations to record friction and workaround pressure, not to normalize it.
+- For watch/live-output flows, use a bounded-session pattern with explicit shutdown and captured exit code.
 - Record why each scenario remains E2E via `e2e-justification` and `unit-coverage-reviewed` in `scenario.yml`.
 ## Related
 - [scenario.yml Reference](scenario-yml-reference.g.md)
-- [E2E Testing Guide](e2e-testing.g.md)
+- [E2E Testing Guide](e2e-testing.g.md)

data/handbook/skills/as-e2e-fix/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: as-e2e-fix
-description: Diagnose, fix, and rerun failing E2E tests systematically
+description: Diagnose, fix, and rerun failing E2E tests systematically, generating failure analysis when needed
 # context: no-fork
 # agent: general-purpose
 user-invocable: true
@@ -32,4 +32,4 @@ skill:
     workflow: wfi://e2e/fix
 ---
-Load and run `ace-bundle wfi://e2e/fix` in the current project, then follow the loaded workflow as the source of truth and execute it end-to-end instead of only summarizing it.
+Load and run `ace-bundle wfi://e2e/fix` in the current project, then follow the loaded workflow as the source of truth and execute it end-to-end instead of only summarizing it. If E2E failure analysis is missing or incomplete, generate it via `wfi://e2e/analyze-failures` as part of the fix workflow before applying changes.