RubyGems - ace-test-runner-e2e - Versions diffs - 0.29.6 → 0.38.11 - Mend

ace-test-runner-e2e 0.29.6 → 0.38.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

data/handbook/workflow-instructions/e2e/plan-changes.wf.md CHANGED Viewed

@@ -71,6 +71,7 @@ For REMOVE due to overlap, replacement evidence is mandatory:
 **KEEP** — The TC has genuine E2E value and needs no changes. Criteria (all must be true):
 - TC passes the E2E Value Gate (tests real CLI binary + external tools + filesystem I/O)
+- TC passes the Public-Surface Gate (user can do the job from docs/usage/`--help` without hidden recipes or workarounds)
 - Related source code has no changes since `last-verified`
 - TC structure is valid and assertions are current
@@ -79,6 +80,7 @@ For REMOVE due to overlap, replacement evidence is mandatory:
 - TC scope is too broad (should be narrowed to only E2E-exclusive aspects)
 - TC scope is too narrow (missing assertions for related behavior in same CLI invocation)
 - TC has structure issues flagged in the review
+- TC is hidden-recipe-driven or workaround-driven but the underlying user job should still be supported by the public surface after scenario/docs/help correction
 **CONSOLIDATE** — The TC should merge with another TC. Criteria (any one is sufficient):
 - Multiple TCs share the same CLI invocation and could be a single TC with multiple assertions
@@ -91,6 +93,7 @@ For each classification, document:
 - For REMOVE (overlap): replacement evidence (`existing unit tests` or `planned unit backfill`)
 - For MODIFY: what specifically needs to change
 - For CONSOLIDATE: the target TC and which assertions merge
+- Whether the current TC is public-surface-valid, hidden-recipe-driven, workaround-driven, or checking an unsupported internal detail
 ### 4. Identify New TCs Needed
@@ -112,6 +115,12 @@ For each candidate, answer: "Does this require the full CLI binary + real extern
 - If NO: skip — unit tests cover this (or add explicit unit test action if coverage is missing)
 - If YES: include in the plan
+**Filter through Public-Surface Gate:**
+For each candidate, answer: "Can a user do this job through the public tool surface without hidden recipes or workarounds?"
+- If NO because the job should be supported: add a product/docs/help improvement action and do not encode the workaround into the TC
+- If NO because the detail is not user-visible: skip or narrow the TC
+- If YES: keep planning the TC
 ### 5. Propose Scenario Structure
 Group all planned TCs (KEEP + MODIFY + CONSOLIDATE targets + ADD) into scenarios:
@@ -176,6 +185,13 @@ Format the complete change plan:
 |----|---------------|
 | {tc-id} | Update assertions — {feature} behavior changed in {commit} |
 | {tc-id} | Narrow scope — remove assertions covered by unit tests |
+| {tc-id} | Remove hidden recipe / workaround dependence — rewrite around public docs/help/CLI path |
+### Public-Surface Gaps ({n} actions)
+| Action | Target | Why |
+|--------|--------|-----|
+| Update docs/help/CLI | {package/path} | {job is valid but current public surface is too weak for the E2E path} |
 ### CONSOLIDATE ({n} TCs → {n} TCs)
@@ -252,4 +268,4 @@ implementation code and at least one E2E test before planning changes.
 If the user rejects the plan:
 1. Ask which classifications they disagree with
 2. Adjust the plan based on feedback
-3. Re-present the updated plan
+3. Re-present the updated plan

data/handbook/workflow-instructions/e2e/review.wf.md CHANGED Viewed

@@ -13,7 +13,9 @@ This workflow performs deep exploration of a package to produce a **coverage mat
 During review, treat the runner/verifier split as a first-class quality check:
 - Runner must be execution-only (no verdict language).
-- Verifier must be impact-first (sandbox impact before artifacts/debug).
+- Verifier must be impact-first (sandbox impact before runner observations and debug).
+- `results/tc/{NN}/` must not be used for helper inputs or verifier-feeding helper reports.
+- Goal-style TCs must also pass the public-surface check: the runner should be able to do the job from docs/usage/`--help` and the tool under test, without hidden recipes or workarounds.
 **Pipeline position:** Stage 1 of 3 (Explore)
@@ -86,9 +88,8 @@ Map what unit tests cover at each layer:
 **List all test files by layer:**
 ```bash
-find {PACKAGE}/test/atoms -name "*_test.rb" 2>/dev/null | sort
-find {PACKAGE}/test/molecules -name "*_test.rb" 2>/dev/null | sort
-find {PACKAGE}/test/organisms -name "*_test.rb" 2>/dev/null | sort
+find {PACKAGE}/test/fast -name "*_test.rb" 2>/dev/null | sort
+find {PACKAGE}/test/feat -name "*_test.rb" 2>/dev/null | sort
 ```
 **For each test file:**
@@ -100,7 +101,7 @@ Build a unit test map:
 | Test File | Layer | Feature Covered | Test Count | Assertion Count |
 |-----------|-------|-----------------|------------|-----------------|
-| {path} | atom | {feature} | {n} | {n} |
+| {path} | fast/feat | {feature} | {n} | {n} |
 ### 4. Inventory Existing E2E Coverage
@@ -116,20 +117,32 @@ find {PACKAGE}/test/e2e -name "scenario.yml" -path "*/TS-*" 2>/dev/null | sort
   - `tags`, `cost-tier`, `e2e-justification`, `unit-coverage-reviewed`
   - `last-verified`, `verified-by`
 - Extract the objective (what the TC verifies)
+- Record the TC's primary oracle:
+  - final sandbox state / real product output
+  - runner observations as supporting context
+  - debug fallback only when necessary
+- Record whether the job is achievable from the public surface:
+  - `valid`
+  - `hidden-recipe-driven`
+  - `workaround-driven`
+  - `unsupported-detail`
+- Record qualitative friction:
+  - `low`, `medium`, `high`
 - Identify which CLI commands the TC runs
-- Count verification steps (PASS/FAIL checks)
+- Record command fingerprint (`command + key flags`) for each command assertion
 - Map to the feature it tests
 - Mark TC evidence status:
-  - `complete` when `e2e-justification` is present and `unit-coverage-reviewed` has at least one path
+  - `complete` when `e2e-justification` is present, the verifier is end-state-first, and `unit-coverage-reviewed` has at least one path
   - `missing` otherwise
+  - `at-risk` when evidence is existence-only, helper-artifact-driven, duplicate command invocations are detected, or the TC is hidden-recipe/workaround-driven
 If `--scope` was provided, filter to only the specified scenario.
 Build an E2E test map:
-| TC ID | Title | CLI Command | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence |
-|-------|-------|-------------|----------------|---------------|------|-----------|-------------------|------------------------|----------|
-| {id} | {title} | {command} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing} |
+| TC ID | Title | Command Invocations | Feature Tested | Primary Oracle | Public Surface Fit | Friction | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence | False-Positive Risk |
+|-------|-------|-------------|----------------|----------------|--------------------|----------|------|-----------|-------------------|------------------------|----------|---------------------|
+| {id} | {title} | {command list} | {feature} | {state / output / observations+fallback} | {valid/hidden-recipe/workaround/unsupported-detail} | {low/medium/high} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing/at-risk} | {low/medium/high} |
 ### 5. Build Coverage Matrix
@@ -137,19 +150,19 @@ Combine the three inventories into a single coverage matrix:
 **Matrix structure:**
 - **Rows:** Features/behaviors from step 2
-- **Columns:** Unit Tests (atoms/molecules/organisms) | E2E Tests
+- **Columns:** Unit Tests (`fast`/`feat`) | E2E Tests
 - **Cells:** Test file references + counts, or "none"
 ```markdown
 ### Coverage Matrix
-| Feature | Unit Tests | E2E Tests | Status |
-|---------|-----------|-----------|--------|
-| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Covered |
-| {feature} | {test files} ({n} assertions) | none | Unit-only |
-| {feature} | none | {TC IDs} ({n} verifications) | E2E-only |
-| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Overlap |
-| {feature} | none | none | Gap |
+| Feature | Unit Tests | E2E Tests | Evidence Strength | False-Positive Risk | Status |
+|---------|-----------|-----------|------------------|----------------------|--------|
+| {feature} | {test files} ({n} assertions) | {TC IDs} | state+content + observations | low | Covered |
+| {feature} | {test files} ({n} assertions) | none | none | n/a | Unit-only |
+| {feature} | none | {TC IDs} | state+content | low | E2E-only |
+| {feature} | {test files} ({n} assertions) | {TC IDs} | debug-heavy, helper-artifact-driven, or workaround-driven | medium/high | Overlap |
+| {feature} | none | none | none | high | Gap |
 ```
 **Classify each row:**
@@ -158,6 +171,7 @@ Combine the three inventories into a single coverage matrix:
 - **E2E-only** — E2E test exists but no unit test. Valid if the behavior is inherently E2E (subprocess execution, filesystem discovery).
 - **Overlap** — Both unit and E2E test the same assertions. E2E TC is a candidate for removal.
 - **Gap** — Neither unit nor E2E test covers this feature. Needs investigation.
+- If a row has `false-positive risk` `high`, downgrade Covered/Overlap to **manual-review** until evidence is corrected.
 ### 6. Generate Review Report
@@ -168,7 +182,7 @@ Produce the full review report with actionable findings:
 **Reviewed:** {timestamp}
 **Scope:** {package-wide or scenario-id}
-**Workflow version:** 2.1
+**Workflow version:** 2.2
 ### Summary
@@ -179,7 +193,8 @@ Produce the full review report with actionable findings:
 | Unit assertions | {n} |
 | E2E scenarios | {n} |
 | E2E test cases | {n} |
-| TCs with decision evidence | {n}/{total} |
+| TCs with end-state-first evidence | {n}/{total} |
+| High-risk helper-artifact TCs | {n}/{total} |
 ### Coverage Matrix
@@ -187,23 +202,24 @@ Produce the full review report with actionable findings:
 ### Overlap Analysis
-TCs that may fail the E2E Value Gate (unit tests cover the same behavior):
+TCs that may fail the E2E Value Gate (unit tests cover the same behavior or high false-positive risk):
 | TC ID | Feature | Overlapping Unit Tests | Recommendation |
 |-------|---------|----------------------|----------------|
 | {id} | {feature} | {test files} | Remove — unit tests cover this fully |
-| {id} | {feature} | {test files} | Keep — TC tests CLI pipeline, units test logic |
+| {id} | {feature} | {test files} | Keep — TC tests real CLI journey and final integrated outcome |
+| {id} | {feature} | {test files} | Strengthen — currently helper-artifact-driven, workaround-driven, or debug-heavy |
 **Candidates for removal:** {n} TCs have full overlap with unit tests
 ### E2E Decision Record Coverage
-| TC ID | Evidence Status | Missing Fields |
-|-------|------------------|----------------|
-| {id} | complete | none |
-| {id} | missing | e2e-justification, unit-coverage-reviewed |
+| TC ID | Evidence Status | Public Surface Fit | Friction | Missing Fields / Contract Drift |
+|-------|------------------|--------------------|----------|-------------------------------|
+| {id} | complete | valid | low | none |
+| {id} | missing | hidden-recipe-driven | high | e2e-justification, unit-coverage-reviewed, end-state oracle |
-**Action:** Any TC with missing evidence should be updated in `scenario.yml` during the next rewrite cycle.
+**Action:** Any TC with missing evidence, helper-artifact drift, hidden recipes, workaround dependence, or unsupported internal-detail checks should be updated during the next rewrite cycle.
 ### Gap Analysis
@@ -283,4 +299,4 @@ Package '{package}' not found.
 Available packages:
 {list of ace-* directories}
-```
+```

data/handbook/workflow-instructions/e2e/rewrite.wf.md CHANGED Viewed

@@ -30,7 +30,8 @@ ace-bundle wfi://e2e/review  →  ace-bundle wfi://e2e/plan-changes  →  ace-bu
 - Keep scenario IDs in `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
 - Keep standalone pairs as `TC-*.runner.md` + `TC-*.verify.md`
-- Keep TC artifact outputs under `results/tc/{NN}/`
+- Keep TC outcome artifacts under `results/tc/{NN}/`
+- Keep runner observations in harness reports, not sandbox helper files
 - Keep summary report fields as `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
 - CLI split reminder:
   - `ace-test-e2e` runs single-package tests
@@ -41,6 +42,8 @@ ace-bundle wfi://e2e/review  →  ace-bundle wfi://e2e/plan-changes  →  ace-bu
 - Normalize runner files to execution-only language.
 - Normalize verifier files to verdict-only, impact-first validation.
 - Keep setup concerns in `scenario.yml` and fixtures, not in TC runner setup sections.
+- Remove helper artifact requirements from `results/tc/{NN}/`; use runner observations instead.
+- Rewrite goal-style TCs around the public user path. Do not preserve hidden recipes, workaround branches, or supporting-tool probes as the way the runner reaches the goal.
 ## Workflow Steps
@@ -120,11 +123,18 @@ Follow the E2E test writing rules:
 - **Run the tool first** to verify actual behavior before writing assertions
 - Apply the E2E Value Gate — every TC must require real CLI binary + external tools + filesystem I/O
-- Use `&& echo "PASS" || echo "FAIL"` patterns for every verification step
 - Follow TC ordering: error paths first, happy path, structure verification, lifecycle, end state
 - Consolidate assertions sharing the same CLI invocation into a single TC
 - Target 2-5 TCs per scenario
 - Test through the CLI interface, not library imports
+- Write runner goals as “do the job” outcomes, not “write a report for the verifier” chores
+- Keep `results/tc/{NN}/` for real outcomes only; avoid helper YAML, path files, command files, and reflections
+- Use runner observations as the only non-filesystem secondary evidence source
+- Make final sandbox state or real product output the primary oracle whenever possible
+- Add behavioral/content assertions only when CLI output itself is part of the user-visible outcome
+- Remove duplicate command-only TCs; fold related assertions into one TC where possible
+- Do not encode exact workaround procedures, hidden command recipes, or internal debugging tricks the user would not infer from docs/usage/`--help`
+- If the job is valid but the public surface is too weak, plan a product/docs/help fix instead of hardcoding the workaround into the TC
 **Load the TC template for reference:**
 ```bash
@@ -141,6 +151,9 @@ For each TC classified as MODIFY:
    - **Narrow scope** — remove assertions that unit tests cover, keep only E2E-exclusive checks
    - **Broaden scope** — add assertions for related behavior tested by the same CLI invocation
    - **Fix structure** — add missing sections, fix formatting issues
+   - **Replace helper-artifact oracles** — if the existing TC relies on runner-written helper files, rewrite it around final sandbox state plus runner observations
+   - **Add evidence gates** — if the existing TC relies on existence-only or missing end-state checks, strengthen the primary oracle before falling back to debug captures
+   - **Remove hidden recipes/workarounds** — if the existing TC teaches the runner how to bypass the public surface, rewrite it around the supported user path or narrow/remove the TC
 3. Update the `last-verified` field if the TC was re-run during modification
 4. Write the updated TC runner/verifier files
@@ -228,6 +241,7 @@ Present the execution summary:
 - [ ] TC count matches plan: {yes/no}
 - [ ] No stale references: {yes/no}
 - [ ] All scenarios have 2-5 TCs: {yes/no}
+- [ ] Modified/created TCs avoid helper files in `results/tc/{NN}/`: {yes/no}
 ### Next Steps
@@ -278,4 +292,4 @@ If execution fails partway through:
 1. Report which actions completed and which failed
 2. Do not attempt to roll back completed actions
 3. Show the state of `{PACKAGE}/test/e2e/` after partial execution
-4. Suggest re-running with the remaining actions
+4. Suggest re-running with the remaining actions

data/handbook/workflow-instructions/e2e/run.wf.md CHANGED Viewed

@@ -1,4 +1,12 @@
 ---
+name: e2e-run
+description: Execute an E2E test scenario with full agent guidance
+allowed-tools:
+- Bash(ace-bundle:*)
+- Read
+- Write
+- Glob
+- Grep
 doc-type: workflow
 title: Run E2E Test Workflow
 purpose: Execute an E2E test scenario with full agent guidance
@@ -13,7 +21,7 @@ This workflow guides an agent through executing an E2E test scenario. It support
 ## Arguments
-- `PACKAGE` (optional) - Package containing the test (e.g., `ace-lint`). If omitted, looks for `test/e2e/` in project root.
+- `PACKAGE` (optional) - Package containing the test (e.g., `ace-lint`). If omitted, discovery uses `test/feat/` and `test/e2e/` in the project root.
 - `TEST_ID` (optional) - Test identifier (e.g., `TS-LINT-001`). If omitted, runs all tests.
 - `--run-id RUN_ID` (optional) - Pre-generated timestamp ID for deterministic report paths.
 - `--report-dir PATH` (optional) - Explicit report directory path (skips computed `${TEST_DIR}-reports`).
@@ -33,18 +41,25 @@ This workflow guides an agent through executing an E2E test scenario. It support
 - `ace-test-e2e` runs single-package scenarios; `ace-test-e2e-suite` runs suite-level execution
 - Scenario IDs: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
 - Standalone TC pairs: `TC-*.runner.md` + `TC-*.verify.md`
-- TC artifacts: `results/tc/{NN}/`
+- TC outcome artifacts: `results/tc/{NN}/`
 - Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
 - Tag filtering happens at discovery time (before sandbox setup)
 ## Execution Contract
-- Runner instructions are execution-only: perform actions and write evidence.
+- Runner instructions are execution-only: perform actions and return final observations.
+- The runner should follow the public user path from docs/usage/`--help` and the tool under test itself. Do not encode or normalize hidden recipes and workarounds.
 - Verifier instructions are verification-only: assign verdicts using impact-first checks:
   1. sandbox/project state impact
-  2. explicit artifacts
-  3. debug captures as fallback
+  2. runner observations
+  3. explicit outcome artifacts
+  4. debug captures as fallback
 - Do not place ad-hoc setup logic in TC runner files; sandbox setup belongs to `scenario.yml` and fixtures.
+- Do not place helper inputs, reflections, or temp manifests under `results/tc/{NN}/`.
+- Do not ask the runner to write verifier-facing summaries or audit files when final sandbox state can prove the goal directly.
+- If the runner observations show a workaround was needed, treat that as a docs/help/product or scenario-design gap, not a successful steady-state contract.
 ## Execution Environment Guardrail
@@ -55,12 +70,12 @@ This workflow guides an agent through executing an E2E test scenario. It support
 For CLI providers (`ace-test-e2e`), the deterministic 6-phase pipeline handles execution automatically:
-1. **Setup** — `SetupExecutor` creates sandbox (git init, mise.toml, .ace symlinks, `results/tc/{NN}/` dirs)
-2. **Runner prompt** — `SkillPromptBuilder` assembles context from `runner.yml.md` + `TC-*.runner.md`
-3. **Runner LLM** — Agent executes TC steps in sandbox, produces artifacts
-4. **Verifier prompt** — `SkillPromptBuilder` assembles context from `verifier.yml.md` + `TC-*.verify.md`
-5. **Verifier LLM** — Independent agent evaluates artifacts against expectations
-6. **Report** — `PipelineReportGenerator` produces deterministic summary
+1. **Setup** -- `SetupExecutor` creates sandbox (git init, mise.toml, .ace symlinks, `results/tc/{NN}/` dirs)
+2. **Runner prompt** -- `SkillPromptBuilder` assembles context from `runner.yml.md` + `TC-*.runner.md`
+3. **Runner LLM** -- Agent executes TC steps in sandbox and returns final observations
+4. **Verifier prompt** -- `SkillPromptBuilder` assembles context from `verifier.yml.md` + `TC-*.verify.md` and includes runner observations
+5. **Verifier LLM** -- Independent agent evaluates artifacts against expectations
+6. **Report** -- `PipelineReportGenerator` produces deterministic summary and persists runner observations in harness-managed reports
 When this workflow is invoked directly (not via CLI pipeline), the agent performs steps 1-6 manually using the workflow steps below.
@@ -83,7 +98,8 @@ When invoked as a subagent (via a batch orchestrator such as an assignment fan-o
 - **Failed**: {count}
 - **Total**: {count}
 - **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
-- **Issues**: Brief description or "None"
+- **Observations**: Brief factual summary or "None"
+- **Issues**: Brief description or "None" (legacy alias if `Observations` is unavailable)
 ```
 Do NOT return full report contents, detailed TC output, or setup logs.
@@ -95,6 +111,7 @@ Do NOT return full report contents, detailed TC output, or setup logs.
 When invoked with `--tc-mode`, the sandbox is pre-populated by `SetupExecutor` and only a single TC is executed. Steps 1-5 of standard mode are skipped.
 **TC-Level Arguments:**
 - `PACKAGE` (required), `TEST_ID` (required), `TC_ID` (required)
 - `--tc-mode` (required), `--sandbox SANDBOX_PATH` (required)
 - `--run-id RUN_ID` (optional), `--env KEY=VALUE,...` (optional)
@@ -108,7 +125,8 @@ When invoked with `--tc-mode`, the sandbox is pre-populated by `SetupExecutor` a
 6. Return TC-level contract
 **TC-Level Rules:**
-- Do NOT create or modify sandbox — `SetupExecutor` already prepared it
+- Do NOT create or modify sandbox -- `SetupExecutor` already prepared it
 - Always export `--env` variables before executing test steps
 - Report actual results even if they differ from expected
@@ -138,6 +156,7 @@ If no tests found after filtering, report error and exit.
 ### 2. Read Test Scenario
 For each scenario file, read and parse:
 - `test-id`, `title`, `priority`, `duration`, `requires`, `tags`
 **Multiple tests:** Execute steps 2-7 for each scenario sequentially, then generate a combined summary.
@@ -172,22 +191,24 @@ Report missing prerequisites before proceeding.
 **Pre-generated Run ID:** If `--run-id` was provided, set `TIMESTAMP_ID=$RUN_ID` instead of generating a new one.
 **Directory naming convention:**
-- `{timestamp}` — 6-char base36 timestamp
-- `{short-pkg}` — package without `ace-` prefix (e.g., `lint`)
-- `{short-id}` — lowercase prefix + number (e.g., `ts001`)
+- `{timestamp}` -- 6-char base36 timestamp
+- `{short-pkg}` -- package without `ace-` prefix (e.g., `lint`)
+- `{short-id}` -- lowercase prefix + number (e.g., `ts001`)
 ```
 .ace-local/test-e2e/
 ├── 8osvnh-lint-ts001/          # Sandbox
 ├── 8osvnh-lint-ts001-reports/  # Reports (summary.r.md, experience.r.md, metadata.yml)
-└── 8osynv-final-report.md     # Suite report (sibling)
+└── 8osynv-suite-report.md     # Suite report (sibling)
 ```
 **Expected variables after setup:**
-- `PROJECT_ROOT` — Original project directory
-- `TEST_DIR` — Sandbox directory (cwd after setup)
-- `REPORTS_DIR` — Reports directory
-- `TIMESTAMP_ID` — Unique run identifier
+- `PROJECT_ROOT` -- Original project directory
+- `TEST_DIR` -- Sandbox directory (cwd after setup)
+- `REPORTS_DIR` -- Reports directory
+- `TIMESTAMP_ID` -- Unique run identifier
 ### 4.1 Sandbox Isolation Checkpoint (MANDATORY)
@@ -198,7 +219,7 @@ echo "=== SANDBOX ISOLATION CHECK ==="
 CURRENT_DIR="$(pwd)"
 [[ "$CURRENT_DIR" == *".ace-local/test-e2e/"* ]] && echo "PASS: In sandbox" || echo "FAIL: NOT in sandbox"
 git rev-parse --git-dir >/dev/null 2>&1 && { [ -z "$(git remote -v 2>/dev/null)" ] && echo "PASS: No remotes" || echo "FAIL: Remotes found"; } || echo "PASS: No git"
-[ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-taskflow" ] && echo "FAIL: Project markers found" || echo "PASS: No markers"
+[ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ] && echo "FAIL: Project markers found" || echo "PASS: No markers"
 echo "=== END CHECK ==="
 ```
@@ -208,7 +229,7 @@ echo "=== END CHECK ==="
 ### 5. Create Test Data
 > **Use `ace-test-e2e-sh "$TEST_DIR"` for ALL commands after setup.**
-> Each bash block runs in a fresh shell — the wrapper ensures sandbox isolation.
+> Each bash block runs in a fresh shell -- the wrapper ensures sandbox isolation.
 Execute test data creation commands from the scenario, writing files inside `$TEST_DIR/`.
@@ -219,9 +240,9 @@ Execute test data creation commands from the scenario, writing files inside `$TE
 If `FILTERED_CASES` is set, execute only matching TCs. Otherwise execute all.
 For each TC (TC-NNN):
-1. **Check filter** — skip if not in `FILTERED_CASES`
+1. **Check filter** -- skip if not in `FILTERED_CASES`
 2. **Read** the runner file (`TC-NNN-*.runner.md`)
-3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
+3. **Execute** runner steps and create only final outcome artifacts under `results/tc/{NN}/`
 4. **Verify** against paired `.verify.md` expectations
 5. **Record** status (Pass/Fail) with evidence
@@ -232,6 +253,7 @@ Track friction points during execution for the experience report.
 Write three report files to the reports directory.
 **Report path setup:**
 ```bash
 REPORT_DIR="${PROVIDED_REPORT_DIR:-${TEST_DIR}-reports}"
 mkdir -p "$REPORT_DIR"
@@ -302,6 +324,7 @@ Sandbox directories in `.ace-local/test-e2e/` are gitignored.
 Summarize execution in the response. Reports are persisted to disk.
 **Single test:**
 ```markdown
 ## E2E Test Execution Report
 **Test ID:** {test-id} | **Package:** {package} | **Status:** {PASS/FAIL}
@@ -316,6 +339,7 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
 ### 10. Update Test Scenario
 If all tests pass, update `scenario.yml`:
 ```yaml
 last-verified: {today's date}
 verified-by: claude-{model}
@@ -352,4 +376,4 @@ ace-test-e2e ace-lint --exclude-tags deep
 # All tests in project root
 ace-test-e2e
-```
+```

data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md CHANGED Viewed

@@ -132,7 +132,7 @@ else
 fi
 # Check 3: Project root markers should NOT exist
-if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-taskflow" ]; then
+if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ]; then
   echo "FAIL: Main project markers found - NOT an isolated repo!"
   echo "  ACTION: STOP - You are in the main repository."
 else
@@ -321,7 +321,7 @@ Add setup directives to `scenario.yml`:
 # scenario.yml
 setup:
   - git-init
-  - run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
+  - run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
   - copy-fixtures
   - agent-env:
       PROJECT_ROOT_PATH: "."
@@ -405,7 +405,7 @@ else
 fi
 # Check 3: Project markers
-if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-taskflow" ]; then
+if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ]; then
   echo "FAIL: Main project markers found!"
   exit 1
 else
@@ -458,4 +458,4 @@ ace-test-e2e-sh "$REPO_DIR" git status
 ## See Also
 - [E2E Testing Guide](guide://e2e-testing)
-- [Test Suite Health](guide://test-suite-health)
+- [Test Suite Health](guide://test-suite-health)

data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb CHANGED Viewed

@@ -35,7 +35,7 @@ module Ace
           #
           # Resolves role: references to their concrete provider before checking.
           #
-          # @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-executor")
+          # @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-runner")
           # @return [Boolean]
           def cli_provider?(provider_string)
             resolved = resolve_provider_name(provider_string)
@@ -44,9 +44,9 @@ module Ace
           def build_execution_prompt(command:, tc_mode:)
             return_contract = if tc_mode
-              "- **Test ID**: ...\n- **TC ID**: ...\n- **Status**: pass | fail\n- **Report Paths**: ...\n- **Issues**: ..."
+              "- **Test ID**: ...\n- **TC ID**: ...\n- **Status**: pass | fail\n- **Report Paths**: ...\n- **Observations**: ...\n- **Issues**: ... (optional legacy alias)"
             else
-              "- **Test ID**: ...\n- **Status**: pass | fail | partial\n- **Passed**: ...\n- **Failed**: ...\n- **Total**: ...\n- **Report Paths**: ...\n- **Issues**: ..."
+              "- **Test ID**: ...\n- **Status**: pass | fail | partial\n- **Passed**: ...\n- **Failed**: ...\n- **Total**: ...\n- **Report Paths**: ...\n- **Observations**: ...\n- **Issues**: ... (optional legacy alias)"
             end
             <<~PROMPT.strip
@@ -55,8 +55,9 @@ module Ace
               Execution requirements:
               - Do not run `/ace-...` inside a shell command.
-              - If slash commands are unavailable, stop and report that limitation in `Issues`.
+              - If slash commands are unavailable, stop and report that limitation in `Observations`.
               - Write reports under `.ace-local/test-e2e/*-reports/`.
+              - `Observations` is required and must be a concise factual summary of actions, outcomes, and blockers without verdict language.
               - Return only this structured summary:
               #{return_contract}
             PROMPT
@@ -122,6 +123,7 @@ module Ace
               Verification requirements:
               - Inspect sandbox artifacts and scenario files directly.
+              - Judge from sandbox state first, then runner observations, then raw debug captures only when needed.
               - Evaluate each test case using `TC-*.verify.md` criteria when present.
               - Classify each failed test case with one category:
                 `test-spec-error`, `tool-bug`, `runner-error`, or `infrastructure-error`.
@@ -145,7 +147,7 @@ module Ace
           # Resolve the bare provider name from a provider string.
           # For role: references, resolves via ProviderModelParser to find the
-          # concrete provider (e.g. "role:e2e-executor" → "claude").
+          # concrete provider (e.g. "role:e2e-runner" → "claude").
           def resolve_provider_name(provider_string)
             name = self.class.provider_name(provider_string)
             return name unless name == "role"