RubyGems - ace-test-runner-e2e - Versions diffs - 0.29.0 - Mend

ace-test-runner-e2e 0.29.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

checksums.yaml +7 -0
data/.ace-defaults/e2e-runner/config.yml +70 -0
data/.ace-defaults/nav/protocols/guide-sources/ace-test-runner-e2e.yml +11 -0
data/.ace-defaults/nav/protocols/skill-sources/ace-test-runner-e2e.yml +19 -0
data/.ace-defaults/nav/protocols/tmpl-sources/ace-test-runner-e2e.yml +12 -0
data/.ace-defaults/nav/protocols/wfi-sources/ace-test-runner-e2e.yml +11 -0
data/CHANGELOG.md +1166 -0
data/LICENSE +21 -0
data/README.md +42 -0
data/Rakefile +15 -0
data/exe/ace-test-e2e +15 -0
data/exe/ace-test-e2e-sh +67 -0
data/exe/ace-test-e2e-suite +13 -0
data/handbook/guides/e2e-testing.g.md +124 -0
data/handbook/guides/scenario-yml-reference.g.md +182 -0
data/handbook/guides/tc-authoring.g.md +131 -0
data/handbook/skills/as-e2e-create/SKILL.md +30 -0
data/handbook/skills/as-e2e-fix/SKILL.md +35 -0
data/handbook/skills/as-e2e-manage/SKILL.md +31 -0
data/handbook/skills/as-e2e-plan-changes/SKILL.md +30 -0
data/handbook/skills/as-e2e-review/SKILL.md +35 -0
data/handbook/skills/as-e2e-rewrite/SKILL.md +31 -0
data/handbook/skills/as-e2e-run/SKILL.md +48 -0
data/handbook/skills/as-e2e-setup-sandbox/SKILL.md +34 -0
data/handbook/templates/ace-taskflow-fixture.template.md +322 -0
data/handbook/templates/agent-experience-report.template.md +89 -0
data/handbook/templates/metadata.template.yml +49 -0
data/handbook/templates/scenario.yml.template.yml +60 -0
data/handbook/templates/tc-file.template.md +45 -0
data/handbook/templates/test-report.template.md +94 -0
data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +126 -0
data/handbook/workflow-instructions/e2e/create.wf.md +395 -0
data/handbook/workflow-instructions/e2e/execute.wf.md +253 -0
data/handbook/workflow-instructions/e2e/fix.wf.md +166 -0
data/handbook/workflow-instructions/e2e/manage.wf.md +179 -0
data/handbook/workflow-instructions/e2e/plan-changes.wf.md +255 -0
data/handbook/workflow-instructions/e2e/review.wf.md +286 -0
data/handbook/workflow-instructions/e2e/rewrite.wf.md +281 -0
data/handbook/workflow-instructions/e2e/run.wf.md +355 -0
data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +461 -0
data/lib/ace/test/end_to_end_runner/atoms/display_helpers.rb +234 -0
data/lib/ace/test/end_to_end_runner/atoms/prompt_builder.rb +199 -0
data/lib/ace/test/end_to_end_runner/atoms/result_parser.rb +166 -0
data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +166 -0
data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +244 -0
data/lib/ace/test/end_to_end_runner/atoms/suite_report_prompt_builder.rb +103 -0
data/lib/ace/test/end_to_end_runner/atoms/tc_fidelity_validator.rb +39 -0
data/lib/ace/test/end_to_end_runner/atoms/test_case_parser.rb +108 -0
data/lib/ace/test/end_to_end_runner/cli/commands/run_suite.rb +130 -0
data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +156 -0
data/lib/ace/test/end_to_end_runner/models/test_case.rb +47 -0
data/lib/ace/test/end_to_end_runner/models/test_result.rb +115 -0
data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +90 -0
data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +92 -0
data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +75 -0
data/lib/ace/test/end_to_end_runner/molecules/failure_finder.rb +203 -0
data/lib/ace/test/end_to_end_runner/molecules/fixture_copier.rb +35 -0
data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +121 -0
data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +182 -0
data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +321 -0
data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +131 -0
data/lib/ace/test/end_to_end_runner/molecules/progress_display_manager.rb +172 -0
data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +259 -0
data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +254 -0
data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +181 -0
data/lib/ace/test/end_to_end_runner/molecules/simple_display_manager.rb +72 -0
data/lib/ace/test/end_to_end_runner/molecules/suite_progress_display_manager.rb +223 -0
data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +277 -0
data/lib/ace/test/end_to_end_runner/molecules/suite_simple_display_manager.rb +116 -0
data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +136 -0
data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +332 -0
data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +830 -0
data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +442 -0
data/lib/ace/test/end_to_end_runner/version.rb +9 -0
data/lib/ace/test/end_to_end_runner.rb +71 -0
metadata +220 -0

data/handbook/workflow-instructions/e2e/create.wf.md ADDED Viewed

@@ -0,0 +1,395 @@
+---
+doc-type: workflow
+title: Create E2E Test Workflow
+purpose: Create a new E2E test scenario from template
+ace-docs:
+  last-updated: 2026-03-12
+  last-checked: 2026-03-21
+---
+# Create E2E Test Workflow
+This workflow guides an agent through creating a new E2E test scenario.
+## Arguments
+- `PACKAGE` (required) - The package for the test (e.g., `ace-lint`)
+- `AREA` (required) - The test area code (e.g., `LINT`, `REVIEW`, `GIT`)
+- `--format ts` (optional, default) - Test format. Creates a directory with `scenario.yml`, `runner.yml.md`, `verifier.yml.md`, and TC runner/verifier pairs (TS-format). This is the only supported format.
+- `--context <description>` (optional) - Description of what the test should verify
+## Canonical Conventions
+- Scenario ID format: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
+- Standalone files: `TC-*.runner.md` and `TC-*.verify.md`
+- TC artifact layout: `results/tc/{NN}/`
+- Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
+- CLI split reminder:
+  - `ace-test-e2e` for single-package execution
+  - `ace-test-e2e-suite` for suite-level execution
+## Authoring Contract
+- Runner files (`runner.yml.md`, `TC-*.runner.md`) are execution-only.
+- Verifier files (`verifier.yml.md`, `TC-*.verify.md`) are verdict-only with impact-first evidence order:
+  1. sandbox/project state impact
+  2. explicit artifacts
+  3. debug captures as fallback
+- Setup belongs to `scenario.yml` `setup:` and fixtures; do not duplicate setup in runner TC instructions.
+## Workflow Steps
+### 1. Validate Inputs
+**Check package exists:**
+```bash
+test -d "{PACKAGE}" && echo "Package exists" || echo "Package not found"
+```
+If package doesn't exist, list available packages:
+```bash
+ls -d */ | grep -E "^ace-" | sed 's/\/$//'
+```
+**Normalize area code:**
+- Convert to uppercase (e.g., `lint` -> `LINT`)
+- Verify it's a valid area name (2-10 alphanumeric characters)
+### 2. Generate Test ID
+Find the next available test ID:
+```bash
+# Search TS-format directories
+find {PACKAGE}/test/e2e -maxdepth 1 -type d -name "TS-{AREA}-*" 2>/dev/null | \
+  sed 's/.*TS-{AREA}-\([0-9]*\).*/\1/'
+```
+Sort and take the highest number:
+- If no existing tests: use `001`
+- Otherwise: increment the highest number by 1
+- Format as three digits (e.g., `001`, `002`, `015`)
+Result: `TS-{AREA}-{NNN}` (e.g., `TS-LINT-003`)
+### 3. Create Directory
+Ensure the E2E test directory exists:
+```bash
+mkdir -p {PACKAGE}/test/e2e
+```
+### 4. Generate Test Slug
+Create a kebab-case slug:
+**If --context provided:**
+- Extract key words from the context description
+- Convert to lowercase
+- Replace spaces with hyphens
+- Limit to 5-6 words
+**If no context:**
+- Use a placeholder: `new-test-scenario`
+Example: "Test config file validation" -> `config-file-validation`
+The slug is the directory name suffix: `TS-LINT-003-config-file-validation/`
+### 5. Load Template
+Load the test template:
+```bash
+ace-bundle tmpl://test-e2e
+```
+Or read directly:
+```
+ace-test-runner-e2e/handbook/templates/test-e2e.template.md
+```
+### 6. Populate Template
+Replace template placeholders with actual values:
+| Placeholder | Value |
+|-------------|-------|
+| `{AREA}` | Area code (uppercase) |
+| `{NNN}` | Sequential number (3 digits) |
+| `{short-pkg}` | Package name without `ace-` prefix (e.g., `git-commit`) |
+| `{short-id}` | Lowercase test number (e.g., `ts001`) |
+| `{Descriptive Title}` | Generated from context or area |
+| `{area-name}` | Area code (lowercase) |
+Initial values for optional fields:
+- `priority: medium`
+- `duration: ~10min`
+- `automation-candidate: false`
+- `cost-tier: smoke`
+- `tags: [{cost-tier}, "use-case:{area}"]`
+- `e2e-justification:` (brief statement of why this cannot be unit-only)
+- `unit-coverage-reviewed:` (list of unit test files checked during Value Gate)
+- `last-verified:` (leave empty)
+- `verified-by:` (leave empty)
+### 7. E2E Value Gate Check
+Before generating test cases, verify the proposed test has genuine E2E value.
+**Check unit test coverage:**
+```bash
+# Search for existing unit tests covering this area
+find {PACKAGE}/test/atoms {PACKAGE}/test/molecules {PACKAGE}/test/organisms \
+  -name "*_test.rb" 2>/dev/null | head -20
+```
+Read the relevant test files and count assertions covering the behavior described in `--context`.
+**Apply the gate per TC:**
+For each proposed TC, answer: **"Does this require the full CLI binary + real external tools + real filesystem I/O?"**
+- If **YES**: proceed to TC generation
+- If **NO**: note that unit tests cover this behavior and skip the TC
+- If **PARTIAL**: create the TC but scope it to only the E2E-exclusive aspects
+**Example decisions:**
+- "Test that invalid YAML config produces error" — check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
+- "Test that StandardRB subprocess executes and returns results" — unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
+If all proposed TCs fail the gate, report to the user:
+```
+All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
+No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
+```
+### 7a. E2E Decision Record (Required)
+Before writing files, produce a decision record table for every candidate TC:
+| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Unit tests reviewed |
+|-------|---------------------------|-----------------|---------------------|
+| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {path1,path2} |
+Rules:
+- No TC may be created without a row in this table.
+- If decision is `SKIP`, include the unit-test evidence that replaces it.
+- At least one `unit tests reviewed` path is required for each row.
+- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
+### 8. Context-Based Generation (if --context)
+If a context description was provided, enhance the test with:
+**Research the package:**
+1. **Run unit tests first** (`ace-test` in the package) — they are the ground truth for implemented behavior
+2. Examine the relevant code in `{PACKAGE}/lib/`
+3. Check existing unit tests for expected behavior patterns
+4. Understand the feature being tested
+5. **Run the tool** to observe actual behavior, output format, file paths, and exit codes
+6. **Verify config/input formats** by reading the actual parsing code — never assume formats from design specs or task descriptions
+**Generate test content:**
+1. Write a clear objective based on the context
+2. Identify prerequisites for the test
+3. Create appropriate test data setup
+4. Generate test cases following the rules below
+5. Define success criteria
+#### Test Case Generation Rules
+**MUST (required for all E2E tests):**
+- **Verify the feature is implemented** before writing the test — read the actual implementation code, not just task specs or design documents
+- **Verify config/input formats** by reading the parsing code — never assume formats from BDD specs, task descriptions, or documentation
+- Include an error/negative TC only when it validates E2E-exclusive behavior (real CLI parser/runtime/tooling/filesystem) or when unit coverage has a documented gap
+- Verify actual file paths by running the tool first — never hardcode paths from documentation or assumptions
+- Use explicit `&& echo "PASS" || echo "FAIL"` patterns for every verification step
+- Check specific exit codes for error commands (not just "non-zero")
+**SHOULD (strongly recommended):**
+- Test the real user journey — structure TCs as a sequential workflow, not isolated commands
+- Verify exit codes for all commands, not just error cases
+- Include negative assertions (files/directories that should NOT exist)
+- Capture and check CLI output content, not just exit codes
+- Verify that status values match actual implementation (e.g., `done` vs `completed`)
+**COST-AWARE (reduce LLM invocations):**
+- Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC — not three.
+- Target 2-5 TCs per scenario. More than 5 suggests the scenario is too broad; split into focused scenarios. Fewer than 2 suggests merging with a related scenario.
+- Never create a TC for a single assertion when that assertion could be appended to an existing TC that runs the same command.
+#### Recommended TC Ordering
+1. **Error paths first** — wrong args, missing files, no prior state (run from clean state)
+2. **Happy path start** — create/init with correct args, verify output
+3. **Structure verification** — check actual on-disk file structure with negative assertions
+4. **Lifecycle operations** — status, advance, fail, retry in workflow order
+5. **End state** — verify completion message, all steps terminal
+This ordering ensures error TCs run before any state is created (clean environment), and happy-path TCs build on each other sequentially.
+See: **e2e-testing.g.md § "Avoiding False Positive Tests"** for the full list of anti-patterns and the reviewer checklist.
+#### CLI-Based Testing Requirement
+**E2E tests MUST test through the CLI interface, not library imports.**
+**Valid approach:**
+```bash
+OUTPUT=$(ace-review --preset code --subject "diff:HEAD~1" --auto-execute 2>&1)
+EXIT_CODE=$?
+[ "$EXIT_CODE" -eq 0 ] && echo "PASS" || echo "FAIL"
+```
+**Invalid approach (this is integration/unit testing, not E2E):**
+```bash
+bundle exec ruby -e '
+  require_relative "lib/ace/review"
+  result = Ace::Review::SomeClass.method(args)
+'
+```
+**For execution tests (LLM, API calls):**
+- Use `--auto-execute` to make real API calls
+- Using only `--dry-run` cannot verify actual execution behavior
+- Keep costs minimal: cheap models, tiny prompts, small diffs
+#### Common Anti-Patterns to Avoid
+**Writing tests from design specs before implementation:**
+- Task descriptions and BDD specs often describe *intended* behavior with *proposed* config formats
+- The actual implementation may use different formats, different commands, or different workflows
+- Example: A spec might describe `jobs:` with explicit `number:` and `parent:` fields, but implementation uses `steps:` with auto-generated numbers and dynamic hierarchy via `add --after --child`
+- **Fix:** Always read the actual implementation code (especially config parsing) before writing test data
+**Assuming static vs dynamic behavior:**
+- Tests may assume features work at config-time (static) when they actually work at runtime (dynamic)
+- Example: Assuming hierarchy is defined in config when it's actually built dynamically via commands
+- **Fix:** Trace the actual code path for the feature being tested
+**Splitting one command into many redundant TCs:**
+- Multiple TCs each validate one assertion after the same CLI invocation, creating overlap with unit tests and increasing run cost
+- Example: TC-A checks exit code, TC-B checks report file, TC-C checks summary text for the same command run
+- **Fix:** Consolidate those assertions into one TC and move formatter/parser details to unit tests
+**Example for "Test config file validation":**
+```markdown
+## Test Cases
+### TC-001: Error — Missing Config File
+**Objective:** Verify that a nonexistent config file produces exit code 3 and a clear error
+### TC-002: Error — Malformed YAML Config
+**Objective:** Verify malformed YAML is handled gracefully with actionable error message
+### TC-003: Valid Config File
+**Objective:** Verify valid configuration files are accepted
+### TC-004: Verify On-Disk Structure
+**Objective:** Check actual file paths created, with negative assertions for wrong paths
+```
+### 9. Write Test Files
+Create the scenario directory with separate files:
+```bash
+mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}
+```
+Write `scenario.yml` (metadata and setup):
+```
+{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/scenario.yml
+```
+Write scenario pair configs:
+```
+{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/runner.yml.md
+{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/verifier.yml.md
+```
+Write individual TC runner/verifier files for each test case:
+```
+{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.runner.md
+{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.verify.md
+```
+Optionally create a fixtures directory if test data is needed:
+```bash
+mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/fixtures
+```
+Example: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/scenario.yml`
+### 10. Report Result
+Output a summary:
+```markdown
+## E2E Test Created
+**Test ID:** TS-{AREA}-{NNN}
+**Format:** TS (directory-based)
+**Package:** {package}
+**Directory:** {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/
+**Files:**
+- scenario.yml
+- runner.yml.md
+- verifier.yml.md
+- TC-001-{tc-slug}.runner.md
+- TC-001-{tc-slug}.verify.md
+### Next Steps
+1. Review and customize `scenario.yml` and TC files
+2. Add fixtures to the `fixtures/` directory if needed
+3. Review the E2E Decision Record and ensure `unit-coverage-reviewed` is populated
+4. Run the test with `ace-test-e2e {package} TS-{AREA}-{NNN}`
+5. Update `last-verified` after successful execution
+```
+## Example Invocations
+**Create a test:**
+```bash
+ace-bundle wfi://e2e/create
+```
+Creates: `ace-lint/test/e2e/TS-LINT-003-new-test-scenario/` with `scenario.yml` and TC files.
+**Create a contextual test:**
+```bash
+ace-bundle wfi://e2e/create
+```
+Creates: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/` with `scenario.yml` and TC files for config validation.
+**Create test for new area:**
+```bash
+ace-bundle wfi://e2e/create
+```
+Creates: `ace-review/test/e2e/TS-COMMENT-001-pr-comment-threading/` with `scenario.yml` and TC files.
+## Error Handling
+### Package Not Found
+```
+Error: Package '{package}' not found.
+Available packages:
+- ace-lint
+- ace-review
+- ace-test-runner-e2e
+```
+### Invalid Area Code
+```
+Error: Invalid area code '{area}'.
+Area codes must be:
+- 2-10 characters
+- Alphanumeric only
+- Will be converted to uppercase
+```

data/handbook/workflow-instructions/e2e/execute.wf.md ADDED Viewed

@@ -0,0 +1,253 @@
+---
+doc-type: workflow
+title: Execute E2E Test Workflow
+purpose: Execute test cases in a pre-populated sandbox with reporting
+ace-docs:
+  last-updated: 2026-03-04
+  last-checked: 2026-03-21
+---
+# Execute E2E Test Workflow
+This workflow guides an agent through executing test cases in a **pre-populated sandbox**. The sandbox was created by `SetupExecutor` — this workflow handles only execution and reporting.
+## SetupExecutor Contract
+Before this workflow is invoked, `SetupExecutor` has already:
+- Created an isolated sandbox directory under `.ace-local/test-e2e/`
+- Initialized git (`git init`, user config, `.gitignore`)
+- Installed `mise.toml` for tool version management
+- Created `.ace` symlinks for configuration access
+- Created `results/tc/{NN}/` directories for each TC
+- Copied fixtures from the scenario's `fixtures/` directory
+- Placed `TC-*.runner.md` and `TC-*.verify.md` files in the sandbox
+Tag filtering happens at discovery time (before `SetupExecutor` runs). By the time this workflow executes, only matching scenarios are included.
+## Arguments
+- `PACKAGE` (required) - Package containing the test (e.g., `ace-lint`)
+- `TEST_ID` (required) - Test identifier (e.g., `TS-LINT-001`)
+- `--sandbox SANDBOX_PATH` (required) - Path to pre-populated sandbox directory
+- `--run-id RUN_ID` (optional) - Pre-generated timestamp ID for deterministic report paths
+- `--env KEY=VALUE[,...]` (optional) - Comma-separated environment variables to set before execution
+- `--verify` (optional) - Enable independent verifier mode (second agent pass with sandbox inspection)
+- `TEST_CASES` (optional) - Comma-separated TC IDs to execute (e.g., `TC-001,tc-003,002`)
+  **TC ID normalization:** `TC-001` (unchanged), `tc-001` → `TC-001`, `001` → `TC-001`, `1` → `TC-001`, `TC-1` → `TC-001`
+## Canonical Conventions
+- `ace-test-e2e` runs single-package scenarios; `ace-test-e2e-suite` runs suite-level execution
+- Scenario IDs: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
+- Standalone TC pairs: `TC-*.runner.md` + `TC-*.verify.md`
+- TC artifacts: `results/tc/{NN}/`
+- Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
+## Execution Contract
+- Runner is execution-only: execute declared TC actions and capture evidence.
+- Verifier is verification-only: determine PASS/FAIL using impact-first ordering:
+  1. sandbox/project state impact
+  2. explicit artifacts
+  3. debug captures (`stdout`/`stderr`/exit) as fallback
+- Do not interpret setup ownership in runner TC files; setup is owned by `scenario.yml` + fixtures.
+## Dual-Agent Verifier
+When `--verify` is passed (or always-on for CLI pipeline runs), execution follows a dual-agent pattern:
+1. **Runner agent** executes TC steps and produces artifacts in `results/tc/{NN}/`
+2. **Verifier agent** independently inspects the sandbox and artifacts against `TC-*.verify.md` expectations
+3. **Report generator** (`PipelineReportGenerator`) produces deterministic summary from verifier output
+The verifier has no access to the runner's conversation — it evaluates purely from on-disk evidence. This prevents self-confirmation bias.
+## Subagent Mode
+When invoked as a subagent (via Task tool from orchestrator):
+**Return contract:**
+```markdown
+- **Test ID**: {test-id}
+- **Status**: pass | fail | partial
+- **Passed**: {count}
+- **Failed**: {count}
+- **Total**: {count}
+- **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
+- **Issues**: Brief description or "None"
+```
+Do NOT return full report contents — they are on disk.
+## TC-Level Execution Mode
+When invoked with `--tc-mode`, only a single TC is executed.
+**TC-Level Arguments:**
+- `PACKAGE` (required), `TEST_ID` (required), `TC_ID` (required)
+- `--tc-mode` (required), `--sandbox SANDBOX_PATH` (required)
+- `--run-id RUN_ID` (optional)
+**TC-Level Steps:**
+1. Verify `SANDBOX_PATH` exists
+2. `cd SANDBOX_PATH`
+3. Execute TC steps from the runner file
+4. Write per-TC reports to `{RUN_ID}-{pkg}-{scenario}-{tc}-reports/`
+5. Return TC-level contract
+**TC-Level Rules:**
+- Do NOT create or modify sandbox — `SetupExecutor` already prepared it
+- Execute only the steps described in the TC content
+- Report actual results even if they differ from expected
+---
+## Sandbox Rules
+- Do NOT create or modify sandbox setup — it is already prepared
+- Do NOT run environment setup, prerequisite checks, or test data creation
+- Focus exclusively on TC execution and reporting
+## Workflow Steps
+### 1. Set Up Execution Environment
+1. Parse `--env` and export each `KEY=VALUE`
+2. `cd SANDBOX_PATH`
+3. Set `TIMESTAMP_ID` from `--run-id` or generate with `ace-b36ts encode`
+**Expected variables:**
+- `SANDBOX_PATH` — Pre-populated sandbox (cwd)
+- `TIMESTAMP_ID` — Unique run identifier
+- Any variables from `--env`
+### 2. Discover and Filter Test Cases
+Find TC definitions in the sandbox:
+```bash
+find "${SANDBOX_PATH}" -name "TC-*.runner.md" -o -name "TC-*.verify.md" 2>/dev/null | sort
+```
+List all found TCs before proceeding:
+```
+Found N test case files:
+- TC-001: {filename}
+- TC-002: {filename}
+```
+> **TC FIDELITY RULE:** Execute ONLY discovered `TC-*.runner.md` + `TC-*.verify.md` pairs. Do NOT invent TCs. Every runner must have a matching verifier and vice versa. Missing pairs are errors — report them and skip the unmatched TC.
+If `TEST_CASES` argument provided, normalize IDs to `TC-NNN` format and filter. Only execute matching TCs.
+### 3. Execute Test Cases
+> **Use `ace-test-e2e-sh "$SANDBOX_PATH"` for ALL commands.**
+For each TC (TC-NNN):
+1. **Check filter** — skip if `FILTERED_CASES` is set and TC not in list
+2. **Read** the runner file objective
+3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
+4. **Capture** exit codes, output, error messages
+5. **Evaluate** against verifier expectations
+6. **Record** Pass/Fail with per-TC evidence
+**Self-check:** Before writing reports, verify your result table has exactly N rows matching discovered TCs (or filtered subset).
+Track friction points for the experience report.
+### 4. Write Reports
+Write three report files to `${SANDBOX_PATH}-reports/`.
+```bash
+REPORT_DIR="${SANDBOX_PATH}-reports"
+mkdir -p "$REPORT_DIR"
+```
+Replace all `{placeholder}` values with actual data.
+#### 4.1 summary.r.md
+```yaml
+---
+test-id: {test-id}
+package: {package}
+agent: {agent-name}
+executed: {timestamp}
+status: pass|fail|partial|incomplete
+tcs-passed: {count}
+tcs-failed: {count}
+tcs-total: {count}
+score: "{passed}/{total}"
+verdict: pass|fail|partial|incomplete
+filtered: true|false
+failed:
+  - tc: TC-NNN
+    category: tool-bug|runner-error|test-spec-error|infrastructure-error
+    evidence: "brief evidence"
+---
+```
+Followed by test information table, results summary, and TC evaluation details.
+#### 4.2 experience.r.md
+Agent experience report with friction points, root cause analysis, improvement suggestions, and positive observations.
+#### 4.3 metadata.yml
+```yaml
+run-id: "{TIMESTAMP_ID}"
+test-id: "{test-id}"
+package: "{package}"
+status: "{status}"
+score: {0.0-1.0}
+verdict: pass|partial|fail
+tcs-passed: {count}
+tcs-failed: {count}
+tcs-total: {count}
+failed:
+  - tc: TC-NNN
+    category: tool-bug|runner-error|test-spec-error|infrastructure-error
+    evidence: "brief evidence"
+test_cases:
+  filtered: true|false
+  executed: [TC-001, TC-003]
+git:
+  branch: "{branch}"
+  commit: "{short-sha}"
+```
+#### 4.4 Report file paths
+```
+Reports written:
+- ${REPORT_DIR}/summary.r.md
+- ${REPORT_DIR}/experience.r.md
+- ${REPORT_DIR}/metadata.yml
+```
+### 5. Return Summary
+```markdown
+## E2E Test Execution Report
+**Test ID:** {test-id} | **Package:** {package} | **Status:** {PASS/FAIL}
+| Test Case | Description | Status |
+|-----------|-------------|--------|
+| TC-001    | ...         | Pass   |
+Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
+```
+## Error Handling
+| Failure | Action |
+|---------|--------|
+| TC fails | Record details, continue remaining TCs, include in report |
+| Sandbox missing/corrupted | Report error, do NOT recreate, return error summary |
+| TC filter mismatch | STOP, do not write reports, offer re-run |
+| Missing TC pair file | Report error for that TC, skip it, continue others |