RubyGems - ace-test-runner-e2e - Versions diffs - 0.29.6 → 0.38.11 - Mend

ace-test-runner-e2e 0.29.6 → 0.38.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

data/handbook/templates/ace-taskflow-fixture.template.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 doc-type: template
 title: ACE Taskflow Test Fixture Template
-purpose: Documentation for ace-test-runner-e2e/handbook/templates/ace-taskflow-fixture.template.md
+purpose: Documentation for ace-test-runner-e2e/handbook/templates/ace-task-fixture.template.md
 ace-docs:
   last-updated: 2026-02-25
   last-checked: 2026-03-21
@@ -9,7 +9,7 @@ ace-docs:
 # ACE Taskflow Test Fixture Template
-This template provides scaffolding for E2E tests that need valid ace-taskflow structures.
+This template provides scaffolding for E2E tests that need valid ace-task structures.
 ## Basic Task Fixture
@@ -17,10 +17,10 @@ Create a minimal valid taskflow structure:
 ```bash
 # Create release directory structure
-mkdir -p "$REPO_DIR/.ace-taskflow/v.test/tasks/001-feature"
+mkdir -p "$REPO_DIR/.ace-task/v.test/tasks/001-feature"
 # Create a valid task file
-cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
+cat > "$REPO_DIR/.ace-task/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
 ---
 id: v.test+task.001
 status: pending
@@ -54,7 +54,7 @@ EOF
 For tests involving ace-git-worktree:
 ```bash
-cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
+cat > "$REPO_DIR/.ace-task/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
 ---
 id: v.test+task.001
 status: in-progress
@@ -93,10 +93,10 @@ For tests involving task hierarchies:
 ```bash
 # Create parent task directory
-mkdir -p "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature"
+mkdir -p "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature"
 # Create orchestrator task
-cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature/100-orchestrator.s.md" << 'EOF'
+cat > "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature/100-orchestrator.s.md" << 'EOF'
 ---
 id: v.test+task.100
 status: pending
@@ -125,7 +125,7 @@ Orchestrator task that coordinates subtasks.
 EOF
 # Create first subtask
-cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature/100.01-first-subtask.s.md" << 'EOF'
+cat > "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature/100.01-first-subtask.s.md" << 'EOF'
 ---
 id: v.test+task.100.01
 status: pending
@@ -151,7 +151,7 @@ First part of the parent feature.
 EOF
 # Create second subtask
-cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature/100.02-second-subtask.s.md" << 'EOF'
+cat > "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature/100.02-second-subtask.s.md" << 'EOF'
 ---
 id: v.test+task.100.02
 status: pending
@@ -184,7 +184,7 @@ For tests that need a complete release setup:
 ```bash
 # Create release.yml
-cat > "$REPO_DIR/.ace-taskflow/v.test/release.yml" << 'EOF'
+cat > "$REPO_DIR/.ace-task/v.test/release.yml" << 'EOF'
 id: v.test
 title: Test Release
 status: active
@@ -206,10 +206,10 @@ git config user.email "test@example.com"
 git config user.name "Test User"
 # Create taskflow structure
-mkdir -p .ace-taskflow/v.test/tasks/001-feature
+mkdir -p .ace-task/v.test/tasks/001-feature
 # Create release configuration
-cat > .ace-taskflow/v.test/release.yml << 'EOF'
+cat > .ace-task/v.test/release.yml << 'EOF'
 id: v.test
 title: Test Release
 status: active
@@ -217,7 +217,7 @@ started: 2026-01-01
 EOF
 # Create task
-cat > .ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md << 'EOF'
+cat > .ace-task/v.test/tasks/001-feature/001-test-task.s.md << 'EOF'
 ---
 id: v.test+task.001
 status: pending
@@ -242,13 +242,13 @@ Test task for E2E testing.
 EOF
 # Commit the structure
-git add .ace-taskflow/
+git add .ace-task/
 git commit -m "Add taskflow structure" --quiet
 # Set PROJECT_ROOT_PATH for isolated testing
 export PROJECT_ROOT_PATH="$REPO_DIR"
-# Now ace-taskflow commands will use this isolated structure
+# Now ace-task commands will use this isolated structure
 # ace-task show 001  # Should find the test task
 ```
@@ -284,13 +284,13 @@ export PROJECT_ROOT_PATH="$REPO_DIR"
 ### Testing Task Selection
 ```bash
-# Verify ace-taskflow can find the task
+# Verify ace-task can find the task
 ace-task show 001
 # Should output task details
 # Verify task file path
 ace-task show 001 --path
-# Should output: .ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md
+# Should output: .ace-task/v.test/tasks/001-feature/001-test-task.s.md
 ```
 ### Testing Status Updates

data/handbook/templates/agent-experience-report.template.md CHANGED Viewed

@@ -69,10 +69,11 @@ ace-docs:
 ## Workarounds Used
-{Document any workarounds the agent had to employ to complete the test. These indicate areas needing improvement.}
+{Document any workarounds the agent had to employ to complete the test. These are failure signals against the public-surface contract and should drive product/docs/help or scenario changes.}
 - **Issue:** {What required a workaround}
   **Workaround:** {What was done instead}
+  **Why this is a gap:** {Which part of the public surface or scenario contract failed}
 ## Positive Observations
@@ -86,4 +87,4 @@ ace-docs:
 {Suggestions for improving this specific test scenario based on execution experience.}
 - {Recommendation 1}
-- {Recommendation 2}
+- {Recommendation 2}

data/handbook/templates/scenario.yml.template.yml CHANGED Viewed

@@ -23,6 +23,12 @@ tags: [{cost-tier}, "use-case:{area}"]
 # Optional: Why this scenario must be E2E (not unit-only)
 e2e-justification: "{Requires real CLI/tools/filesystem behavior}"
+# Optional: Evidence quality target for review coverage (`command-output`, `state+content`, `existence-only`)
+e2e-evidence-strength: command-output
+# Optional: False-positive risk estimate (`low`, `medium`, `high`)
+e2e-false-positive-risk: low
 # Optional: Unit test files reviewed during Value Gate analysis
 unit-coverage-reviewed:
   - test/{layer}/{file}_test.rb
@@ -30,11 +36,16 @@ unit-coverage-reviewed:
 # Optional: Primary command under test
 tool-under-test: {ace-tool}
+# Goal-style scenarios should be doable from docs/usage/--help and the public CLI
+# without hidden runner recipes or workaround instructions.
 # Optional: Declared sandbox artifact layout
 sandbox-layout:
-  results/tc/01/: "Goal 1 artifacts"
+  results/tc/01/: "Goal 1 outcome artifacts"
 # Optional: Prerequisites
+# `requires.tools` declares dependencies; it does not justify fallback probing as
+# the main oracle for an ACE CLI scenario.
 requires:
   tools: [{tool1}, {tool2}]
   ruby: ">= 3.0"
@@ -49,7 +60,7 @@ setup:
   # Optional detached tmux session for test isolation (uses unique run ID)
   # - tmux-session:
   #     name-source: run-id
-  - run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
+  - run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
   # Uncomment as needed:
   # - copy-fixtures         # Copy fixtures/ directory to sandbox
   # - agent-env:            # Environment variables passed to runner/verifier agent subprocess

data/handbook/templates/tc-file.template.md CHANGED Viewed

@@ -16,14 +16,16 @@ ace-docs:
 ## Workspace
 - Working directory: {sandbox-root}
-- Output directory: `results/tc/{NN}/`
+- Outcome artifacts only: `results/tc/{NN}/`
 ## Constraints
 - Use only declared scenario tools (`ace-*` and explicit exceptions)
-- Keep artifacts under `results/tc/{NN}/`
+- Keep only product outcomes or essential command captures under `results/tc/{NN}/`
+- Do not write helper inputs, reflections, manifests, or temp files under `results/tc/{NN}/`
 - Do not write outside sandbox
 - Execute actions only; do not assign PASS/FAIL in runner file
+- Follow the public user path from docs/usage/`--help`; do not embed hidden recipes or workaround branches in the TC
 <!--
 Companion verifier file (`TC-{NNN}-{slug}.verify.md`) example:
@@ -35,11 +37,19 @@ Companion verifier file (`TC-{NNN}-{slug}.verify.md`) example:
 - Impact Checks:
   - {Sandbox/project impact expectation}
 - Artifact Checks:
-  - {Artifact expectation}
+  - {Outcome artifact expectation when a real end-user-visible file or output should exist}
+- Runner Observations:
+  - {How final runner observations help disambiguate the result when state alone is not enough, and record friction/workaround pressure if present}
 - Debug Fallback:
   - {Optional stdout/stderr/exit evidence when needed}
+## Overall User Outcome
+- **Works for end user**: {yes|partial|no}
+- **Friction**: {User-visible friction or `None`}
+- **Feedback**: {Product/docs/help feedback or `None`}
 ## Verdict
-- Pass when impact and artifact checks are satisfied from sandbox evidence.
+- Pass when the public path works from sandbox evidence. Missing helper artifacts alone should not fail the goal.
 -->

data/handbook/workflow-instructions/e2e/analyze-failures.wf.md CHANGED Viewed

@@ -1,10 +1,17 @@
 ---
+name: e2e-analyze-failures
+description: Analyze failing E2E scenarios, classify root causes, and surface docs/help drift before fixes.
+allowed-tools:
+- Bash(ace-bundle:*)
+- Read
+- Grep
+- Glob
 doc-type: workflow
 title: Analyze E2E Failures Workflow
 purpose: analyze-e2e-failures workflow instruction
 ace-docs:
-  last-updated: 2026-03-04
-  last-checked: 2026-03-21
+  last-updated: 2026-04-19
+  last-checked: 2026-04-19
 ---
 # Analyze E2E Failures Workflow
@@ -17,6 +24,7 @@ This workflow determines whether each failure is caused by:
 - application/tool code
 - E2E test definition/spec
 - E2E runner/infrastructure
+- stale, missing, or misleading docs/help that made the public user path unclear
 ## Hard Rule
@@ -49,6 +57,11 @@ Use exactly one category per failed TC:
 3. `runner-infrastructure-issue`
 - Sandbox/setup/provider/parsing/orchestration issue
+Public-surface interpretation rules:
+- If the TC fails because it encoded a hidden recipe or workaround, classify it as `test-issue`.
+- If the intended user job is valid but the public CLI/docs/`--help` do not support it cleanly, classify it as `code-issue` with a fix target in product docs/help or CLI help rather than preserving the workaround.
+- If the failure is about internal detail that a user cannot or need not observe from the public surface, prefer narrowing/removing the TC over deepening the runner.
 ## Required Evidence Sources
 Use these files as primary evidence:
@@ -57,6 +70,8 @@ Use these files as primary evidence:
 - `metadata.yml`
 - Relevant artifacts in `results/tc/{NN}/`
+Aggregate suite/package reports are indexing aids only. For failed TC IDs, categories, and evidence, the per-scenario `report.md` in the referenced report directory is the canonical source of truth.
 ## Analysis Procedure
 1. Locate latest failing report directories
@@ -68,20 +83,40 @@ ls -lt .ace-local/test-e2e/*-reports/ 2>/dev/null | head -20
 - failed TC IDs
 - reported category/evidence from metadata
 - corroborating artifact evidence
+- if analyzing from a suite/package report, read the referenced per-scenario `report.md` before accepting any failed-TC mapping
+If the aggregate report and per-scenario report disagree:
+- trust the per-scenario `report.md`
+- classify the mismatch itself as a runner/reporting issue in your analysis notes
+- do not plan fixes from the aggregate failed-TC mapping alone
 3. Reclassify each failed TC if needed
 - Use `code-issue`, `test-issue`, or `runner-infrastructure-issue`
 - Add confidence: `high|medium|low`
 - Add one disconfirming check per TC
 - If confidence is `medium` or `low`, run at least one additional diagnostic read/search before final decision
-4. Recommend rerun scope (cost-aware)
+- Before claiming sandbox escape or fixture contamination, compare repo `git status --short` before and after the relevant E2E run when that evidence is available. Do not infer escape solely from an after-the-fact dirty tree.
+- Check whether the scenario required a hidden recipe or workaround to reach the goal. If yes, record that explicitly in the evidence and classification.
+4. Audit docs/help drift for each failed TC
+- Identify the user job the TC is trying to prove.
+- Check the public surface that a normal user or agent would consult:
+  - package `README.md`
+  - package `docs/usage.md`
+  - package `docs/getting-started.md`
+  - package `docs/handbook.md`
+  - direct command `--help` output for the command involved
+- Record whether the failure exposes stale, missing, or misleading docs/help.
+- If drift exists, list concrete docs/help update targets and make them part of the fix target.
+- If no drift exists, record `None` explicitly. Do not omit the docs/help assessment.
+5. Recommend rerun scope (cost-aware)
 - `scenario` (default)
 - `package`
 - `suite`
 with explicit rationale
-5. Choose autonomous fix decision per failed TC
+6. Choose autonomous fix decision per failed TC
 - Select a single primary fix action
 - Provide concrete file targets in priority order
 - Define explicit no-touch boundaries
@@ -99,6 +134,17 @@ Produce this section before exiting:
 | TS-FOO-001 / TC-003 | test-issue | summary + artifact mismatch details | scenario files | test-scenario-runner | TC-003-foo.runner.md | TC-003-foo.verify.md | lib/** | high | re-run scenario after spec adjustment | scenario |
 ```
+Then produce a docs/help drift section. This section is required even when no drift is found:
+```markdown
+## Docs / Help Drift From E2E Failures
+| Scenario / TC | User Job | Public Surface Checked | Drift Found | Evidence | Update Targets | Action |
+|---|---|---|---|---|---|---|
+| TS-FOO-001 / TC-003 | user-facing job being tested | README, docs/usage.md, --help | yes | docs show stale flag missing from help | docs/usage.md, CLI --help | update docs/help before preserving scenario path |
+| TS-BAR-001 / TC-001 | user-facing job being tested | README, docs/usage.md, --help | no | public path is documented and help matches | None | no docs/help update |
+```
 Then include:
 ```markdown
@@ -121,6 +167,7 @@ Then include:
 - Fix target is explicit per failed TC
 - Fix target files are explicit per failed TC (primary + fallback)
 - No-touch boundaries are explicit per failed TC
+- Docs/help drift is assessed for every failed TC with concrete update targets or `None`
 - A single autonomous chosen fix decision is present per failed TC
 - Rerun scope recommendation is cost-aware
-- No code/scenario/runner edits were made in this workflow
+- No code/scenario/runner edits were made in this workflow