ace-test-runner-e2e 0.29.8 → 0.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. checksums.yaml +4 -4
  2. data/.ace-defaults/e2e-runner/config.yml +14 -2
  3. data/CHANGELOG.md +233 -0
  4. data/README.md +2 -2
  5. data/exe/ace-test-e2e-sh +9 -4
  6. data/handbook/guides/e2e-testing.g.md +75 -9
  7. data/handbook/guides/scenario-yml-reference.g.md +21 -8
  8. data/handbook/guides/tc-authoring.g.md +23 -5
  9. data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
  10. data/handbook/skills/as-e2e-review/SKILL.md +2 -2
  11. data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
  12. data/handbook/templates/agent-experience-report.template.md +3 -2
  13. data/handbook/templates/scenario.yml.template.yml +7 -2
  14. data/handbook/templates/tc-file.template.md +16 -4
  15. data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
  16. data/handbook/workflow-instructions/e2e/create.wf.md +128 -25
  17. data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
  18. data/handbook/workflow-instructions/e2e/fix.wf.md +84 -15
  19. data/handbook/workflow-instructions/e2e/plan-changes.wf.md +33 -1
  20. data/handbook/workflow-instructions/e2e/review.wf.md +40 -25
  21. data/handbook/workflow-instructions/e2e/rewrite.wf.md +22 -8
  22. data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
  23. data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
  24. data/lib/ace/test/end_to_end_runner/atoms/artifact_contract_validator.rb +138 -0
  25. data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
  26. data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
  27. data/lib/ace/test/end_to_end_runner/cli/commands/run_suite.rb +195 -5
  28. data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +58 -9
  29. data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
  30. data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
  31. data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
  32. data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
  33. data/lib/ace/test/end_to_end_runner/molecules/artifact_pruner.rb +61 -0
  34. data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
  35. data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
  36. data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
  37. data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +235 -18
  38. data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +164 -13
  39. data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
  40. data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +121 -18
  41. data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +15 -12
  42. data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +374 -0
  43. data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +83 -5
  44. data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +121 -16
  45. data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +422 -97
  46. data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
  47. data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
  48. data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +98 -18
  49. data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +159 -19
  50. data/lib/ace/test/end_to_end_runner/version.rb +1 -1
  51. data/lib/ace/test/end_to_end_runner.rb +4 -0
  52. metadata +21 -2
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: as-e2e-review
3
- description: Deep exploration producing a coverage matrix of functionality, unit tests, and E2E tests
3
+ description: Review E2E coverage for modified packages and run targeted package scenarios
4
4
  # bundle: wfi://e2e/review
5
5
  # agent: general-purpose
6
6
  user-invocable: true
@@ -24,7 +24,7 @@ assign:
24
24
  source: wfi://e2e/review
25
25
  steps:
26
26
  - name: verify-e2e
27
- description: Review E2E coverage for modified packages and run targeted scenarios
27
+ description: Review E2E coverage for modified packages and run targeted package scenarios
28
28
  tags: [testing, e2e, verification]
29
29
  skill:
30
30
  kind: workflow
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  doc-type: template
3
3
  title: ACE Taskflow Test Fixture Template
4
- purpose: Documentation for ace-test-runner-e2e/handbook/templates/ace-taskflow-fixture.template.md
4
+ purpose: Documentation for ace-test-runner-e2e/handbook/templates/ace-task-fixture.template.md
5
5
  ace-docs:
6
6
  last-updated: 2026-02-25
7
7
  last-checked: 2026-03-21
@@ -9,7 +9,7 @@ ace-docs:
9
9
 
10
10
  # ACE Taskflow Test Fixture Template
11
11
 
12
- This template provides scaffolding for E2E tests that need valid ace-taskflow structures.
12
+ This template provides scaffolding for E2E tests that need valid ace-task structures.
13
13
 
14
14
  ## Basic Task Fixture
15
15
 
@@ -17,10 +17,10 @@ Create a minimal valid taskflow structure:
17
17
 
18
18
  ```bash
19
19
  # Create release directory structure
20
- mkdir -p "$REPO_DIR/.ace-taskflow/v.test/tasks/001-feature"
20
+ mkdir -p "$REPO_DIR/.ace-task/v.test/tasks/001-feature"
21
21
 
22
22
  # Create a valid task file
23
- cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
23
+ cat > "$REPO_DIR/.ace-task/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
24
24
  ---
25
25
  id: v.test+task.001
26
26
  status: pending
@@ -54,7 +54,7 @@ EOF
54
54
  For tests involving ace-git-worktree:
55
55
 
56
56
  ```bash
57
- cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
57
+ cat > "$REPO_DIR/.ace-task/v.test/tasks/001-feature/001-test-task.s.md" << 'EOF'
58
58
  ---
59
59
  id: v.test+task.001
60
60
  status: in-progress
@@ -93,10 +93,10 @@ For tests involving task hierarchies:
93
93
 
94
94
  ```bash
95
95
  # Create parent task directory
96
- mkdir -p "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature"
96
+ mkdir -p "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature"
97
97
 
98
98
  # Create orchestrator task
99
- cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature/100-orchestrator.s.md" << 'EOF'
99
+ cat > "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature/100-orchestrator.s.md" << 'EOF'
100
100
  ---
101
101
  id: v.test+task.100
102
102
  status: pending
@@ -125,7 +125,7 @@ Orchestrator task that coordinates subtasks.
125
125
  EOF
126
126
 
127
127
  # Create first subtask
128
- cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature/100.01-first-subtask.s.md" << 'EOF'
128
+ cat > "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature/100.01-first-subtask.s.md" << 'EOF'
129
129
  ---
130
130
  id: v.test+task.100.01
131
131
  status: pending
@@ -151,7 +151,7 @@ First part of the parent feature.
151
151
  EOF
152
152
 
153
153
  # Create second subtask
154
- cat > "$REPO_DIR/.ace-taskflow/v.test/tasks/100-parent-feature/100.02-second-subtask.s.md" << 'EOF'
154
+ cat > "$REPO_DIR/.ace-task/v.test/tasks/100-parent-feature/100.02-second-subtask.s.md" << 'EOF'
155
155
  ---
156
156
  id: v.test+task.100.02
157
157
  status: pending
@@ -184,7 +184,7 @@ For tests that need a complete release setup:
184
184
 
185
185
  ```bash
186
186
  # Create release.yml
187
- cat > "$REPO_DIR/.ace-taskflow/v.test/release.yml" << 'EOF'
187
+ cat > "$REPO_DIR/.ace-task/v.test/release.yml" << 'EOF'
188
188
  id: v.test
189
189
  title: Test Release
190
190
  status: active
@@ -206,10 +206,10 @@ git config user.email "test@example.com"
206
206
  git config user.name "Test User"
207
207
 
208
208
  # Create taskflow structure
209
- mkdir -p .ace-taskflow/v.test/tasks/001-feature
209
+ mkdir -p .ace-task/v.test/tasks/001-feature
210
210
 
211
211
  # Create release configuration
212
- cat > .ace-taskflow/v.test/release.yml << 'EOF'
212
+ cat > .ace-task/v.test/release.yml << 'EOF'
213
213
  id: v.test
214
214
  title: Test Release
215
215
  status: active
@@ -217,7 +217,7 @@ started: 2026-01-01
217
217
  EOF
218
218
 
219
219
  # Create task
220
- cat > .ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md << 'EOF'
220
+ cat > .ace-task/v.test/tasks/001-feature/001-test-task.s.md << 'EOF'
221
221
  ---
222
222
  id: v.test+task.001
223
223
  status: pending
@@ -242,13 +242,13 @@ Test task for E2E testing.
242
242
  EOF
243
243
 
244
244
  # Commit the structure
245
- git add .ace-taskflow/
245
+ git add .ace-task/
246
246
  git commit -m "Add taskflow structure" --quiet
247
247
 
248
248
  # Set PROJECT_ROOT_PATH for isolated testing
249
249
  export PROJECT_ROOT_PATH="$REPO_DIR"
250
250
 
251
- # Now ace-taskflow commands will use this isolated structure
251
+ # Now ace-task commands will use this isolated structure
252
252
  # ace-task show 001 # Should find the test task
253
253
  ```
254
254
 
@@ -284,13 +284,13 @@ export PROJECT_ROOT_PATH="$REPO_DIR"
284
284
  ### Testing Task Selection
285
285
 
286
286
  ```bash
287
- # Verify ace-taskflow can find the task
287
+ # Verify ace-task can find the task
288
288
  ace-task show 001
289
289
  # Should output task details
290
290
 
291
291
  # Verify task file path
292
292
  ace-task show 001 --path
293
- # Should output: .ace-taskflow/v.test/tasks/001-feature/001-test-task.s.md
293
+ # Should output: .ace-task/v.test/tasks/001-feature/001-test-task.s.md
294
294
  ```
295
295
 
296
296
  ### Testing Status Updates
@@ -69,10 +69,11 @@ ace-docs:
69
69
 
70
70
  ## Workarounds Used
71
71
 
72
- {Document any workarounds the agent had to employ to complete the test. These indicate areas needing improvement.}
72
+ {Document any workarounds the agent had to employ to complete the test. These are failure signals against the public-surface contract and should drive product/docs/help or scenario changes.}
73
73
 
74
74
  - **Issue:** {What required a workaround}
75
75
  **Workaround:** {What was done instead}
76
+ **Why this is a gap:** {Which part of the public surface or scenario contract failed}
76
77
 
77
78
  ## Positive Observations
78
79
 
@@ -86,4 +87,4 @@ ace-docs:
86
87
  {Suggestions for improving this specific test scenario based on execution experience.}
87
88
 
88
89
  - {Recommendation 1}
89
- - {Recommendation 2}
90
+ - {Recommendation 2}
@@ -36,11 +36,16 @@ unit-coverage-reviewed:
36
36
  # Optional: Primary command under test
37
37
  tool-under-test: {ace-tool}
38
38
 
39
+ # Goal-style scenarios should be doable from docs/usage/--help and the public CLI
40
+ # without hidden runner recipes or workaround instructions.
41
+
39
42
  # Optional: Declared sandbox artifact layout
40
43
  sandbox-layout:
41
- results/tc/01/: "Goal 1 artifacts"
44
+ results/tc/01/: "Goal 1 outcome artifacts"
42
45
 
43
46
  # Optional: Prerequisites
47
+ # `requires.tools` declares dependencies; it does not justify fallback probing as
48
+ # the main oracle for an ACE CLI scenario.
44
49
  requires:
45
50
  tools: [{tool1}, {tool2}]
46
51
  ruby: ">= 3.0"
@@ -55,7 +60,7 @@ setup:
55
60
  # Optional detached tmux session for test isolation (uses unique run ID)
56
61
  # - tmux-session:
57
62
  # name-source: run-id
58
- - run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
63
+ - run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
59
64
  # Uncomment as needed:
60
65
  # - copy-fixtures # Copy fixtures/ directory to sandbox
61
66
  # - agent-env: # Environment variables passed to runner/verifier agent subprocess
@@ -16,14 +16,18 @@ ace-docs:
16
16
  ## Workspace
17
17
 
18
18
  - Working directory: {sandbox-root}
19
- - Output directory: `results/tc/{NN}/`
19
+ - Outcome artifacts only: `results/tc/{NN}/`
20
20
 
21
21
  ## Constraints
22
22
 
23
23
  - Use only declared scenario tools (`ace-*` and explicit exceptions)
24
- - Keep artifacts under `results/tc/{NN}/`
24
+ - Keep only product outcomes or essential command captures under `results/tc/{NN}/`
25
+ - Declare every verifier-dependent path explicitly in the runner or scenario setup
26
+ - Grouped capture shorthand such as ``results/tc/{NN}/cmd.stdout`, `.stderr`, `.exit`` is allowed for exact sibling files
27
+ - Do not write helper inputs, reflections, PASS/FAIL summaries, manifests, or temp files under `results/tc/{NN}/`
25
28
  - Do not write outside sandbox
26
29
  - Execute actions only; do not assign PASS/FAIL in runner file
30
+ - Follow the public user path from docs/usage/`--help`; do not embed hidden recipes or workaround branches in the TC
27
31
 
28
32
  <!--
29
33
  Companion verifier file (`TC-{NNN}-{slug}.verify.md`) example:
@@ -35,11 +39,19 @@ Companion verifier file (`TC-{NNN}-{slug}.verify.md`) example:
35
39
  - Impact Checks:
36
40
  - {Sandbox/project impact expectation}
37
41
  - Artifact Checks:
38
- - {Artifact expectation}
42
+ - {Outcome artifact expectation when a real end-user-visible file or output should exist}
43
+ - Runner Observations:
44
+ - {How final runner observations help disambiguate the result when state alone is not enough, and record friction/workaround pressure if present}
39
45
  - Debug Fallback:
40
46
  - {Optional stdout/stderr/exit evidence when needed}
41
47
 
48
+ ## Overall User Outcome
49
+
50
+ - **Works for end user**: {yes|partial|no}
51
+ - **Friction**: {User-visible friction or `None`}
52
+ - **Feedback**: {Product/docs/help feedback or `None`}
53
+
42
54
  ## Verdict
43
55
 
44
- - Pass when impact and artifact checks are satisfied from sandbox evidence.
56
+ - Pass when the public path or retained contract is satisfied from sandbox evidence. Undeclared helper artifacts alone should not fail the goal.
45
57
  -->
@@ -1,10 +1,17 @@
1
1
  ---
2
+ name: e2e-analyze-failures
3
+ description: Analyze failing E2E scenarios, classify root causes, and surface docs/help drift before fixes.
4
+ allowed-tools:
5
+ - Bash(ace-bundle:*)
6
+ - Read
7
+ - Grep
8
+ - Glob
2
9
  doc-type: workflow
3
10
  title: Analyze E2E Failures Workflow
4
11
  purpose: analyze-e2e-failures workflow instruction
5
12
  ace-docs:
6
- last-updated: 2026-03-04
7
- last-checked: 2026-03-21
13
+ last-updated: 2026-04-19
14
+ last-checked: 2026-04-19
8
15
  ---
9
16
 
10
17
  # Analyze E2E Failures Workflow
@@ -17,6 +24,7 @@ This workflow determines whether each failure is caused by:
17
24
  - application/tool code
18
25
  - E2E test definition/spec
19
26
  - E2E runner/infrastructure
27
+ - stale, missing, or misleading docs/help that made the public user path unclear
20
28
 
21
29
  ## Hard Rule
22
30
 
@@ -49,6 +57,11 @@ Use exactly one category per failed TC:
49
57
  3. `runner-infrastructure-issue`
50
58
  - Sandbox/setup/provider/parsing/orchestration issue
51
59
 
60
+ Public-surface interpretation rules:
61
+ - If the TC fails because it encoded a hidden recipe or workaround, classify it as `test-issue`.
62
+ - If the intended user job is valid but the public CLI/docs/`--help` do not support it cleanly, classify it as `code-issue` with a fix target in product docs/help or CLI help rather than preserving the workaround.
63
+ - If the failure is about internal detail that a user cannot or need not observe from the public surface, prefer narrowing/removing the TC over deepening the runner.
64
+
52
65
  ## Required Evidence Sources
53
66
 
54
67
  Use these files as primary evidence:
@@ -57,6 +70,8 @@ Use these files as primary evidence:
57
70
  - `metadata.yml`
58
71
  - Relevant artifacts in `results/tc/{NN}/`
59
72
 
73
+ Aggregate suite/package reports are indexing aids only. For failed TC IDs, categories, and evidence, the per-scenario `report.md` in the referenced report directory is the canonical source of truth.
74
+
60
75
  ## Analysis Procedure
61
76
 
62
77
  1. Locate latest failing report directories
@@ -68,20 +83,40 @@ ls -lt .ace-local/test-e2e/*-reports/ 2>/dev/null | head -20
68
83
  - failed TC IDs
69
84
  - reported category/evidence from metadata
70
85
  - corroborating artifact evidence
86
+ - if analyzing from a suite/package report, read the referenced per-scenario `report.md` before accepting any failed-TC mapping
87
+
88
+ If the aggregate report and per-scenario report disagree:
89
+ - trust the per-scenario `report.md`
90
+ - classify the mismatch itself as a runner/reporting issue in your analysis notes
91
+ - do not plan fixes from the aggregate failed-TC mapping alone
71
92
 
72
93
  3. Reclassify each failed TC if needed
73
94
  - Use `code-issue`, `test-issue`, or `runner-infrastructure-issue`
74
95
  - Add confidence: `high|medium|low`
75
96
  - Add one disconfirming check per TC
76
97
  - If confidence is `medium` or `low`, run at least one additional diagnostic read/search before final decision
77
-
78
- 4. Recommend rerun scope (cost-aware)
98
+ - Before claiming sandbox escape or fixture contamination, compare repo `git status --short` before and after the relevant E2E run when that evidence is available. Do not infer escape solely from an after-the-fact dirty tree.
99
+ - Check whether the scenario required a hidden recipe or workaround to reach the goal. If yes, record that explicitly in the evidence and classification.
100
+
101
+ 4. Audit docs/help drift for each failed TC
102
+ - Identify the user job the TC is trying to prove.
103
+ - Check the public surface that a normal user or agent would consult:
104
+ - package `README.md`
105
+ - package `docs/usage.md`
106
+ - package `docs/getting-started.md`
107
+ - package `docs/handbook.md`
108
+ - direct command `--help` output for the command involved
109
+ - Record whether the failure exposes stale, missing, or misleading docs/help.
110
+ - If drift exists, list concrete docs/help update targets and make them part of the fix target.
111
+ - If no drift exists, record `None` explicitly. Do not omit the docs/help assessment.
112
+
113
+ 5. Recommend rerun scope (cost-aware)
79
114
  - `scenario` (default)
80
115
  - `package`
81
116
  - `suite`
82
117
  with explicit rationale
83
118
 
84
- 5. Choose autonomous fix decision per failed TC
119
+ 6. Choose autonomous fix decision per failed TC
85
120
  - Select a single primary fix action
86
121
  - Provide concrete file targets in priority order
87
122
  - Define explicit no-touch boundaries
@@ -99,6 +134,17 @@ Produce this section before exiting:
99
134
  | TS-FOO-001 / TC-003 | test-issue | summary + artifact mismatch details | scenario files | test-scenario-runner | TC-003-foo.runner.md | TC-003-foo.verify.md | lib/** | high | re-run scenario after spec adjustment | scenario |
100
135
  ```
101
136
 
137
+ Then produce a docs/help drift section. This section is required even when no drift is found:
138
+
139
+ ```markdown
140
+ ## Docs / Help Drift From E2E Failures
141
+
142
+ | Scenario / TC | User Job | Public Surface Checked | Drift Found | Evidence | Update Targets | Action |
143
+ |---|---|---|---|---|---|---|
144
+ | TS-FOO-001 / TC-003 | user-facing job being tested | README, docs/usage.md, --help | yes | docs show stale flag missing from help | docs/usage.md, CLI --help | update docs/help before preserving scenario path |
145
+ | TS-BAR-001 / TC-001 | user-facing job being tested | README, docs/usage.md, --help | no | public path is documented and help matches | None | no docs/help update |
146
+ ```
147
+
102
148
  Then include:
103
149
 
104
150
  ```markdown
@@ -121,6 +167,7 @@ Then include:
121
167
  - Fix target is explicit per failed TC
122
168
  - Fix target files are explicit per failed TC (primary + fallback)
123
169
  - No-touch boundaries are explicit per failed TC
170
+ - Docs/help drift is assessed for every failed TC with concrete update targets or `None`
124
171
  - A single autonomous chosen fix decision is present per failed TC
125
172
  - Rerun scope recommendation is cost-aware
126
- - No code/scenario/runner edits were made in this workflow
173
+ - No code/scenario/runner edits were made in this workflow