ace-test-runner-e2e 0.29.6 → 0.38.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. checksums.yaml +4 -4
  2. data/.ace-defaults/e2e-runner/config.yml +14 -2
  3. data/CHANGELOG.md +187 -0
  4. data/README.md +2 -2
  5. data/exe/ace-test-e2e-sh +9 -4
  6. data/handbook/guides/e2e-testing.g.md +43 -9
  7. data/handbook/guides/scenario-yml-reference.g.md +16 -8
  8. data/handbook/guides/tc-authoring.g.md +12 -5
  9. data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
  10. data/handbook/skills/as-e2e-review/SKILL.md +2 -2
  11. data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
  12. data/handbook/templates/agent-experience-report.template.md +3 -2
  13. data/handbook/templates/scenario.yml.template.yml +13 -2
  14. data/handbook/templates/tc-file.template.md +14 -4
  15. data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
  16. data/handbook/workflow-instructions/e2e/create.wf.md +139 -23
  17. data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
  18. data/handbook/workflow-instructions/e2e/fix.wf.md +65 -15
  19. data/handbook/workflow-instructions/e2e/plan-changes.wf.md +17 -1
  20. data/handbook/workflow-instructions/e2e/review.wf.md +44 -28
  21. data/handbook/workflow-instructions/e2e/rewrite.wf.md +17 -3
  22. data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
  23. data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
  24. data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
  25. data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
  26. data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +21 -8
  27. data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
  28. data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
  29. data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
  30. data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
  31. data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
  32. data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
  33. data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
  34. data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +165 -25
  35. data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +121 -8
  36. data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
  37. data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +119 -18
  38. data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +13 -12
  39. data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +282 -0
  40. data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +85 -5
  41. data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +98 -16
  42. data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +241 -97
  43. data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
  44. data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
  45. data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +73 -15
  46. data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +120 -19
  47. data/lib/ace/test/end_to_end_runner/version.rb +1 -1
  48. data/lib/ace/test/end_to_end_runner.rb +2 -0
  49. metadata +19 -2
@@ -71,6 +71,7 @@ For REMOVE due to overlap, replacement evidence is mandatory:
71
71
 
72
72
  **KEEP** — The TC has genuine E2E value and needs no changes. Criteria (all must be true):
73
73
  - TC passes the E2E Value Gate (tests real CLI binary + external tools + filesystem I/O)
74
+ - TC passes the Public-Surface Gate (user can do the job from docs/usage/`--help` without hidden recipes or workarounds)
74
75
  - Related source code has no changes since `last-verified`
75
76
  - TC structure is valid and assertions are current
76
77
 
@@ -79,6 +80,7 @@ For REMOVE due to overlap, replacement evidence is mandatory:
79
80
  - TC scope is too broad (should be narrowed to only E2E-exclusive aspects)
80
81
  - TC scope is too narrow (missing assertions for related behavior in same CLI invocation)
81
82
  - TC has structure issues flagged in the review
83
+ - TC is hidden-recipe-driven or workaround-driven but the underlying user job should still be supported by the public surface after scenario/docs/help correction
82
84
 
83
85
  **CONSOLIDATE** — The TC should merge with another TC. Criteria (any one is sufficient):
84
86
  - Multiple TCs share the same CLI invocation and could be a single TC with multiple assertions
@@ -91,6 +93,7 @@ For each classification, document:
91
93
  - For REMOVE (overlap): replacement evidence (`existing unit tests` or `planned unit backfill`)
92
94
  - For MODIFY: what specifically needs to change
93
95
  - For CONSOLIDATE: the target TC and which assertions merge
96
+ - Whether the current TC is public-surface-valid, hidden-recipe-driven, workaround-driven, or checking an unsupported internal detail
94
97
 
95
98
  ### 4. Identify New TCs Needed
96
99
 
@@ -112,6 +115,12 @@ For each candidate, answer: "Does this require the full CLI binary + real extern
112
115
  - If NO: skip — unit tests cover this (or add explicit unit test action if coverage is missing)
113
116
  - If YES: include in the plan
114
117
 
118
+ **Filter through Public-Surface Gate:**
119
+ For each candidate, answer: "Can a user do this job through the public tool surface without hidden recipes or workarounds?"
120
+ - If NO because the job should be supported: add a product/docs/help improvement action and do not encode the workaround into the TC
121
+ - If NO because the detail is not user-visible: skip or narrow the TC
122
+ - If YES: keep planning the TC
123
+
115
124
  ### 5. Propose Scenario Structure
116
125
 
117
126
  Group all planned TCs (KEEP + MODIFY + CONSOLIDATE targets + ADD) into scenarios:
@@ -176,6 +185,13 @@ Format the complete change plan:
176
185
  |----|---------------|
177
186
  | {tc-id} | Update assertions — {feature} behavior changed in {commit} |
178
187
  | {tc-id} | Narrow scope — remove assertions covered by unit tests |
188
+ | {tc-id} | Remove hidden recipe / workaround dependence — rewrite around public docs/help/CLI path |
189
+
190
+ ### Public-Surface Gaps ({n} actions)
191
+
192
+ | Action | Target | Why |
193
+ |--------|--------|-----|
194
+ | Update docs/help/CLI | {package/path} | {job is valid but current public surface is too weak for the E2E path} |
179
195
 
180
196
  ### CONSOLIDATE ({n} TCs → {n} TCs)
181
197
 
@@ -252,4 +268,4 @@ implementation code and at least one E2E test before planning changes.
252
268
  If the user rejects the plan:
253
269
  1. Ask which classifications they disagree with
254
270
  2. Adjust the plan based on feedback
255
- 3. Re-present the updated plan
271
+ 3. Re-present the updated plan
@@ -13,7 +13,9 @@ This workflow performs deep exploration of a package to produce a **coverage mat
13
13
 
14
14
  During review, treat the runner/verifier split as a first-class quality check:
15
15
  - Runner must be execution-only (no verdict language).
16
- - Verifier must be impact-first (sandbox impact before artifacts/debug).
16
+ - Verifier must be impact-first (sandbox impact before runner observations and debug).
17
+ - `results/tc/{NN}/` must not be used for helper inputs or verifier-feeding helper reports.
18
+ - Goal-style TCs must also pass the public-surface check: the runner should be able to do the job from docs/usage/`--help` and the tool under test, without hidden recipes or workarounds.
17
19
 
18
20
  **Pipeline position:** Stage 1 of 3 (Explore)
19
21
 
@@ -86,9 +88,8 @@ Map what unit tests cover at each layer:
86
88
 
87
89
  **List all test files by layer:**
88
90
  ```bash
89
- find {PACKAGE}/test/atoms -name "*_test.rb" 2>/dev/null | sort
90
- find {PACKAGE}/test/molecules -name "*_test.rb" 2>/dev/null | sort
91
- find {PACKAGE}/test/organisms -name "*_test.rb" 2>/dev/null | sort
91
+ find {PACKAGE}/test/fast -name "*_test.rb" 2>/dev/null | sort
92
+ find {PACKAGE}/test/feat -name "*_test.rb" 2>/dev/null | sort
92
93
  ```
93
94
 
94
95
  **For each test file:**
@@ -100,7 +101,7 @@ Build a unit test map:
100
101
 
101
102
  | Test File | Layer | Feature Covered | Test Count | Assertion Count |
102
103
  |-----------|-------|-----------------|------------|-----------------|
103
- | {path} | atom | {feature} | {n} | {n} |
104
+ | {path} | fast/feat | {feature} | {n} | {n} |
104
105
 
105
106
  ### 4. Inventory Existing E2E Coverage
106
107
 
@@ -116,20 +117,32 @@ find {PACKAGE}/test/e2e -name "scenario.yml" -path "*/TS-*" 2>/dev/null | sort
116
117
  - `tags`, `cost-tier`, `e2e-justification`, `unit-coverage-reviewed`
117
118
  - `last-verified`, `verified-by`
118
119
  - Extract the objective (what the TC verifies)
120
+ - Record the TC's primary oracle:
121
+ - final sandbox state / real product output
122
+ - runner observations as supporting context
123
+ - debug fallback only when necessary
124
+ - Record whether the job is achievable from the public surface:
125
+ - `valid`
126
+ - `hidden-recipe-driven`
127
+ - `workaround-driven`
128
+ - `unsupported-detail`
129
+ - Record qualitative friction:
130
+ - `low`, `medium`, `high`
119
131
  - Identify which CLI commands the TC runs
120
- - Count verification steps (PASS/FAIL checks)
132
+ - Record command fingerprint (`command + key flags`) for each command assertion
121
133
  - Map to the feature it tests
122
134
  - Mark TC evidence status:
123
- - `complete` when `e2e-justification` is present and `unit-coverage-reviewed` has at least one path
135
+ - `complete` when `e2e-justification` is present, the verifier is end-state-first, and `unit-coverage-reviewed` has at least one path
124
136
  - `missing` otherwise
137
+ - `at-risk` when evidence is existence-only, helper-artifact-driven, duplicate command invocations are detected, or the TC is hidden-recipe/workaround-driven
125
138
 
126
139
  If `--scope` was provided, filter to only the specified scenario.
127
140
 
128
141
  Build an E2E test map:
129
142
 
130
- | TC ID | Title | CLI Command | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence |
131
- |-------|-------|-------------|----------------|---------------|------|-----------|-------------------|------------------------|----------|
132
- | {id} | {title} | {command} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing} |
143
+ | TC ID | Title | Command Invocations | Feature Tested | Primary Oracle | Public Surface Fit | Friction | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence | False-Positive Risk |
144
+ |-------|-------|-------------|----------------|----------------|--------------------|----------|------|-----------|-------------------|------------------------|----------|---------------------|
145
+ | {id} | {title} | {command list} | {feature} | {state / output / observations+fallback} | {valid/hidden-recipe/workaround/unsupported-detail} | {low/medium/high} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing/at-risk} | {low/medium/high} |
133
146
 
134
147
  ### 5. Build Coverage Matrix
135
148
 
@@ -137,19 +150,19 @@ Combine the three inventories into a single coverage matrix:
137
150
 
138
151
  **Matrix structure:**
139
152
  - **Rows:** Features/behaviors from step 2
140
- - **Columns:** Unit Tests (atoms/molecules/organisms) | E2E Tests
153
+ - **Columns:** Unit Tests (`fast`/`feat`) | E2E Tests
141
154
  - **Cells:** Test file references + counts, or "none"
142
155
 
143
156
  ```markdown
144
157
  ### Coverage Matrix
145
158
 
146
- | Feature | Unit Tests | E2E Tests | Status |
147
- |---------|-----------|-----------|--------|
148
- | {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Covered |
149
- | {feature} | {test files} ({n} assertions) | none | Unit-only |
150
- | {feature} | none | {TC IDs} ({n} verifications) | E2E-only |
151
- | {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Overlap |
152
- | {feature} | none | none | Gap |
159
+ | Feature | Unit Tests | E2E Tests | Evidence Strength | False-Positive Risk | Status |
160
+ |---------|-----------|-----------|------------------|----------------------|--------|
161
+ | {feature} | {test files} ({n} assertions) | {TC IDs} | state+content + observations | low | Covered |
162
+ | {feature} | {test files} ({n} assertions) | none | none | n/a | Unit-only |
163
+ | {feature} | none | {TC IDs} | state+content | low | E2E-only |
164
+ | {feature} | {test files} ({n} assertions) | {TC IDs} | debug-heavy, helper-artifact-driven, or workaround-driven | medium/high | Overlap |
165
+ | {feature} | none | none | none | high | Gap |
153
166
  ```
154
167
 
155
168
  **Classify each row:**
@@ -158,6 +171,7 @@ Combine the three inventories into a single coverage matrix:
158
171
  - **E2E-only** — E2E test exists but no unit test. Valid if the behavior is inherently E2E (subprocess execution, filesystem discovery).
159
172
  - **Overlap** — Both unit and E2E test the same assertions. E2E TC is a candidate for removal.
160
173
  - **Gap** — Neither unit nor E2E test covers this feature. Needs investigation.
174
+ - If a row has `false-positive risk` `high`, downgrade Covered/Overlap to **manual-review** until evidence is corrected.
161
175
 
162
176
  ### 6. Generate Review Report
163
177
 
@@ -168,7 +182,7 @@ Produce the full review report with actionable findings:
168
182
 
169
183
  **Reviewed:** {timestamp}
170
184
  **Scope:** {package-wide or scenario-id}
171
- **Workflow version:** 2.1
185
+ **Workflow version:** 2.2
172
186
 
173
187
  ### Summary
174
188
 
@@ -179,7 +193,8 @@ Produce the full review report with actionable findings:
179
193
  | Unit assertions | {n} |
180
194
  | E2E scenarios | {n} |
181
195
  | E2E test cases | {n} |
182
- | TCs with decision evidence | {n}/{total} |
196
+ | TCs with end-state-first evidence | {n}/{total} |
197
+ | High-risk helper-artifact TCs | {n}/{total} |
183
198
 
184
199
  ### Coverage Matrix
185
200
 
@@ -187,23 +202,24 @@ Produce the full review report with actionable findings:
187
202
 
188
203
  ### Overlap Analysis
189
204
 
190
- TCs that may fail the E2E Value Gate (unit tests cover the same behavior):
205
+ TCs that may fail the E2E Value Gate (unit tests cover the same behavior or high false-positive risk):
191
206
 
192
207
  | TC ID | Feature | Overlapping Unit Tests | Recommendation |
193
208
  |-------|---------|----------------------|----------------|
194
209
  | {id} | {feature} | {test files} | Remove — unit tests cover this fully |
195
- | {id} | {feature} | {test files} | Keep — TC tests CLI pipeline, units test logic |
210
+ | {id} | {feature} | {test files} | Keep — TC tests real CLI journey and final integrated outcome |
211
+ | {id} | {feature} | {test files} | Strengthen — currently helper-artifact-driven, workaround-driven, or debug-heavy |
196
212
 
197
213
  **Candidates for removal:** {n} TCs have full overlap with unit tests
198
214
 
199
215
  ### E2E Decision Record Coverage
200
216
 
201
- | TC ID | Evidence Status | Missing Fields |
202
- |-------|------------------|----------------|
203
- | {id} | complete | none |
204
- | {id} | missing | e2e-justification, unit-coverage-reviewed |
217
+ | TC ID | Evidence Status | Public Surface Fit | Friction | Missing Fields / Contract Drift |
218
+ |-------|------------------|--------------------|----------|-------------------------------|
219
+ | {id} | complete | valid | low | none |
220
+ | {id} | missing | hidden-recipe-driven | high | e2e-justification, unit-coverage-reviewed, end-state oracle |
205
221
 
206
- **Action:** Any TC with missing evidence should be updated in `scenario.yml` during the next rewrite cycle.
222
+ **Action:** Any TC with missing evidence, helper-artifact drift, hidden recipes, workaround dependence, or unsupported internal-detail checks should be updated during the next rewrite cycle.
207
223
 
208
224
  ### Gap Analysis
209
225
 
@@ -283,4 +299,4 @@ Package '{package}' not found.
283
299
 
284
300
  Available packages:
285
301
  {list of ace-* directories}
286
- ```
302
+ ```
@@ -30,7 +30,8 @@ ace-bundle wfi://e2e/review → ace-bundle wfi://e2e/plan-changes → ace-bu
30
30
 
31
31
  - Keep scenario IDs in `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
32
32
  - Keep standalone pairs as `TC-*.runner.md` + `TC-*.verify.md`
33
- - Keep TC artifact outputs under `results/tc/{NN}/`
33
+ - Keep TC outcome artifacts under `results/tc/{NN}/`
34
+ - Keep runner observations in harness reports, not sandbox helper files
34
35
  - Keep summary report fields as `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
35
36
  - CLI split reminder:
36
37
  - `ace-test-e2e` runs single-package tests
@@ -41,6 +42,8 @@ ace-bundle wfi://e2e/review → ace-bundle wfi://e2e/plan-changes → ace-bu
41
42
  - Normalize runner files to execution-only language.
42
43
  - Normalize verifier files to verdict-only, impact-first validation.
43
44
  - Keep setup concerns in `scenario.yml` and fixtures, not in TC runner setup sections.
45
+ - Remove helper artifact requirements from `results/tc/{NN}/`; use runner observations instead.
46
+ - Rewrite goal-style TCs around the public user path. Do not preserve hidden recipes, workaround branches, or supporting-tool probes as the way the runner reaches the goal.
44
47
 
45
48
  ## Workflow Steps
46
49
 
@@ -120,11 +123,18 @@ Follow the E2E test writing rules:
120
123
 
121
124
  - **Run the tool first** to verify actual behavior before writing assertions
122
125
  - Apply the E2E Value Gate — every TC must require real CLI binary + external tools + filesystem I/O
123
- - Use `&& echo "PASS" || echo "FAIL"` patterns for every verification step
124
126
  - Follow TC ordering: error paths first, happy path, structure verification, lifecycle, end state
125
127
  - Consolidate assertions sharing the same CLI invocation into a single TC
126
128
  - Target 2-5 TCs per scenario
127
129
  - Test through the CLI interface, not library imports
130
+ - Write runner goals as “do the job” outcomes, not “write a report for the verifier” chores
131
+ - Keep `results/tc/{NN}/` for real outcomes only; avoid helper YAML, path files, command files, and reflections
132
+ - Use runner observations as the only non-filesystem secondary evidence source
133
+ - Make final sandbox state or real product output the primary oracle whenever possible
134
+ - Add behavioral/content assertions only when CLI output itself is part of the user-visible outcome
135
+ - Remove duplicate command-only TCs; fold related assertions into one TC where possible
136
+ - Do not encode exact workaround procedures, hidden command recipes, or internal debugging tricks the user would not infer from docs/usage/`--help`
137
+ - If the job is valid but the public surface is too weak, plan a product/docs/help fix instead of hardcoding the workaround into the TC
128
138
 
129
139
  **Load the TC template for reference:**
130
140
  ```bash
@@ -141,6 +151,9 @@ For each TC classified as MODIFY:
141
151
  - **Narrow scope** — remove assertions that unit tests cover, keep only E2E-exclusive checks
142
152
  - **Broaden scope** — add assertions for related behavior tested by the same CLI invocation
143
153
  - **Fix structure** — add missing sections, fix formatting issues
154
+ - **Replace helper-artifact oracles** — if the existing TC relies on runner-written helper files, rewrite it around final sandbox state plus runner observations
155
+ - **Add evidence gates** — if the existing TC relies on existence-only or missing end-state checks, strengthen the primary oracle before falling back to debug captures
156
+ - **Remove hidden recipes/workarounds** — if the existing TC teaches the runner how to bypass the public surface, rewrite it around the supported user path or narrow/remove the TC
144
157
  3. Update the `last-verified` field if the TC was re-run during modification
145
158
  4. Write the updated TC runner/verifier files
146
159
 
@@ -228,6 +241,7 @@ Present the execution summary:
228
241
  - [ ] TC count matches plan: {yes/no}
229
242
  - [ ] No stale references: {yes/no}
230
243
  - [ ] All scenarios have 2-5 TCs: {yes/no}
244
+ - [ ] Modified/created TCs avoid helper files in `results/tc/{NN}/`: {yes/no}
231
245
 
232
246
  ### Next Steps
233
247
 
@@ -278,4 +292,4 @@ If execution fails partway through:
278
292
  1. Report which actions completed and which failed
279
293
  2. Do not attempt to roll back completed actions
280
294
  3. Show the state of `{PACKAGE}/test/e2e/` after partial execution
281
- 4. Suggest re-running with the remaining actions
295
+ 4. Suggest re-running with the remaining actions
@@ -1,4 +1,12 @@
1
1
  ---
2
+ name: e2e-run
3
+ description: Execute an E2E test scenario with full agent guidance
4
+ allowed-tools:
5
+ - Bash(ace-bundle:*)
6
+ - Read
7
+ - Write
8
+ - Glob
9
+ - Grep
2
10
  doc-type: workflow
3
11
  title: Run E2E Test Workflow
4
12
  purpose: Execute an E2E test scenario with full agent guidance
@@ -13,7 +21,7 @@ This workflow guides an agent through executing an E2E test scenario. It support
13
21
 
14
22
  ## Arguments
15
23
 
16
- - `PACKAGE` (optional) - Package containing the test (e.g., `ace-lint`). If omitted, looks for `test/e2e/` in project root.
24
+ - `PACKAGE` (optional) - Package containing the test (e.g., `ace-lint`). If omitted, discovery uses `test/feat/` and `test/e2e/` in the project root.
17
25
  - `TEST_ID` (optional) - Test identifier (e.g., `TS-LINT-001`). If omitted, runs all tests.
18
26
  - `--run-id RUN_ID` (optional) - Pre-generated timestamp ID for deterministic report paths.
19
27
  - `--report-dir PATH` (optional) - Explicit report directory path (skips computed `${TEST_DIR}-reports`).
@@ -33,18 +41,25 @@ This workflow guides an agent through executing an E2E test scenario. It support
33
41
  - `ace-test-e2e` runs single-package scenarios; `ace-test-e2e-suite` runs suite-level execution
34
42
  - Scenario IDs: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
35
43
  - Standalone TC pairs: `TC-*.runner.md` + `TC-*.verify.md`
36
- - TC artifacts: `results/tc/{NN}/`
44
+ - TC outcome artifacts: `results/tc/{NN}/`
37
45
  - Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
38
46
  - Tag filtering happens at discovery time (before sandbox setup)
39
47
 
40
48
  ## Execution Contract
41
49
 
42
- - Runner instructions are execution-only: perform actions and write evidence.
50
+ - Runner instructions are execution-only: perform actions and return final observations.
51
+ - The runner should follow the public user path from docs/usage/`--help` and the tool under test itself. Do not encode or normalize hidden recipes and workarounds.
43
52
  - Verifier instructions are verification-only: assign verdicts using impact-first checks:
53
+
44
54
  1. sandbox/project state impact
45
- 2. explicit artifacts
46
- 3. debug captures as fallback
55
+ 2. runner observations
56
+ 3. explicit outcome artifacts
57
+ 4. debug captures as fallback
58
+
47
59
  - Do not place ad-hoc setup logic in TC runner files; sandbox setup belongs to `scenario.yml` and fixtures.
60
+ - Do not place helper inputs, reflections, or temp manifests under `results/tc/{NN}/`.
61
+ - Do not ask the runner to write verifier-facing summaries or audit files when final sandbox state can prove the goal directly.
62
+ - If the runner observations show a workaround was needed, treat that as a docs/help/product or scenario-design gap, not a successful steady-state contract.
48
63
 
49
64
  ## Execution Environment Guardrail
50
65
 
@@ -55,12 +70,12 @@ This workflow guides an agent through executing an E2E test scenario. It support
55
70
 
56
71
  For CLI providers (`ace-test-e2e`), the deterministic 6-phase pipeline handles execution automatically:
57
72
 
58
- 1. **Setup** `SetupExecutor` creates sandbox (git init, mise.toml, .ace symlinks, `results/tc/{NN}/` dirs)
59
- 2. **Runner prompt** `SkillPromptBuilder` assembles context from `runner.yml.md` + `TC-*.runner.md`
60
- 3. **Runner LLM** Agent executes TC steps in sandbox, produces artifacts
61
- 4. **Verifier prompt** `SkillPromptBuilder` assembles context from `verifier.yml.md` + `TC-*.verify.md`
62
- 5. **Verifier LLM** Independent agent evaluates artifacts against expectations
63
- 6. **Report** `PipelineReportGenerator` produces deterministic summary
73
+ 1. **Setup** -- `SetupExecutor` creates sandbox (git init, mise.toml, .ace symlinks, `results/tc/{NN}/` dirs)
74
+ 2. **Runner prompt** -- `SkillPromptBuilder` assembles context from `runner.yml.md` + `TC-*.runner.md`
75
+ 3. **Runner LLM** -- Agent executes TC steps in sandbox and returns final observations
76
+ 4. **Verifier prompt** -- `SkillPromptBuilder` assembles context from `verifier.yml.md` + `TC-*.verify.md` and includes runner observations
77
+ 5. **Verifier LLM** -- Independent agent evaluates artifacts against expectations
78
+ 6. **Report** -- `PipelineReportGenerator` produces deterministic summary and persists runner observations in harness-managed reports
64
79
 
65
80
  When this workflow is invoked directly (not via CLI pipeline), the agent performs steps 1-6 manually using the workflow steps below.
66
81
 
@@ -83,7 +98,8 @@ When invoked as a subagent (via a batch orchestrator such as an assignment fan-o
83
98
  - **Failed**: {count}
84
99
  - **Total**: {count}
85
100
  - **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
86
- - **Issues**: Brief description or "None"
101
+ - **Observations**: Brief factual summary or "None"
102
+ - **Issues**: Brief description or "None" (legacy alias if `Observations` is unavailable)
87
103
  ```
88
104
 
89
105
  Do NOT return full report contents, detailed TC output, or setup logs.
@@ -95,6 +111,7 @@ Do NOT return full report contents, detailed TC output, or setup logs.
95
111
  When invoked with `--tc-mode`, the sandbox is pre-populated by `SetupExecutor` and only a single TC is executed. Steps 1-5 of standard mode are skipped.
96
112
 
97
113
  **TC-Level Arguments:**
114
+
98
115
  - `PACKAGE` (required), `TEST_ID` (required), `TC_ID` (required)
99
116
  - `--tc-mode` (required), `--sandbox SANDBOX_PATH` (required)
100
117
  - `--run-id RUN_ID` (optional), `--env KEY=VALUE,...` (optional)
@@ -108,7 +125,8 @@ When invoked with `--tc-mode`, the sandbox is pre-populated by `SetupExecutor` a
108
125
  6. Return TC-level contract
109
126
 
110
127
  **TC-Level Rules:**
111
- - Do NOT create or modify sandbox — `SetupExecutor` already prepared it
128
+
129
+ - Do NOT create or modify sandbox -- `SetupExecutor` already prepared it
112
130
  - Always export `--env` variables before executing test steps
113
131
  - Report actual results even if they differ from expected
114
132
 
@@ -138,6 +156,7 @@ If no tests found after filtering, report error and exit.
138
156
  ### 2. Read Test Scenario
139
157
 
140
158
  For each scenario file, read and parse:
159
+
141
160
  - `test-id`, `title`, `priority`, `duration`, `requires`, `tags`
142
161
 
143
162
  **Multiple tests:** Execute steps 2-7 for each scenario sequentially, then generate a combined summary.
@@ -172,22 +191,24 @@ Report missing prerequisites before proceeding.
172
191
  **Pre-generated Run ID:** If `--run-id` was provided, set `TIMESTAMP_ID=$RUN_ID` instead of generating a new one.
173
192
 
174
193
  **Directory naming convention:**
175
- - `{timestamp}` — 6-char base36 timestamp
176
- - `{short-pkg}` package without `ace-` prefix (e.g., `lint`)
177
- - `{short-id}` lowercase prefix + number (e.g., `ts001`)
194
+
195
+ - `{timestamp}` -- 6-char base36 timestamp
196
+ - `{short-pkg}` -- package without `ace-` prefix (e.g., `lint`)
197
+ - `{short-id}` -- lowercase prefix + number (e.g., `ts001`)
178
198
 
179
199
  ```
180
200
  .ace-local/test-e2e/
181
201
  ├── 8osvnh-lint-ts001/ # Sandbox
182
202
  ├── 8osvnh-lint-ts001-reports/ # Reports (summary.r.md, experience.r.md, metadata.yml)
183
- └── 8osynv-final-report.md # Suite report (sibling)
203
+ └── 8osynv-suite-report.md # Suite report (sibling)
184
204
  ```
185
205
 
186
206
  **Expected variables after setup:**
187
- - `PROJECT_ROOT` — Original project directory
188
- - `TEST_DIR` Sandbox directory (cwd after setup)
189
- - `REPORTS_DIR` Reports directory
190
- - `TIMESTAMP_ID` Unique run identifier
207
+
208
+ - `PROJECT_ROOT` -- Original project directory
209
+ - `TEST_DIR` -- Sandbox directory (cwd after setup)
210
+ - `REPORTS_DIR` -- Reports directory
211
+ - `TIMESTAMP_ID` -- Unique run identifier
191
212
 
192
213
  ### 4.1 Sandbox Isolation Checkpoint (MANDATORY)
193
214
 
@@ -198,7 +219,7 @@ echo "=== SANDBOX ISOLATION CHECK ==="
198
219
  CURRENT_DIR="$(pwd)"
199
220
  [[ "$CURRENT_DIR" == *".ace-local/test-e2e/"* ]] && echo "PASS: In sandbox" || echo "FAIL: NOT in sandbox"
200
221
  git rev-parse --git-dir >/dev/null 2>&1 && { [ -z "$(git remote -v 2>/dev/null)" ] && echo "PASS: No remotes" || echo "FAIL: Remotes found"; } || echo "PASS: No git"
201
- [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-taskflow" ] && echo "FAIL: Project markers found" || echo "PASS: No markers"
222
+ [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ] && echo "FAIL: Project markers found" || echo "PASS: No markers"
202
223
  echo "=== END CHECK ==="
203
224
  ```
204
225
 
@@ -208,7 +229,7 @@ echo "=== END CHECK ==="
208
229
  ### 5. Create Test Data
209
230
 
210
231
  > **Use `ace-test-e2e-sh "$TEST_DIR"` for ALL commands after setup.**
211
- > Each bash block runs in a fresh shell the wrapper ensures sandbox isolation.
232
+ > Each bash block runs in a fresh shell -- the wrapper ensures sandbox isolation.
212
233
 
213
234
  Execute test data creation commands from the scenario, writing files inside `$TEST_DIR/`.
214
235
 
@@ -219,9 +240,9 @@ Execute test data creation commands from the scenario, writing files inside `$TE
219
240
  If `FILTERED_CASES` is set, execute only matching TCs. Otherwise execute all.
220
241
 
221
242
  For each TC (TC-NNN):
222
- 1. **Check filter** skip if not in `FILTERED_CASES`
243
+ 1. **Check filter** -- skip if not in `FILTERED_CASES`
223
244
  2. **Read** the runner file (`TC-NNN-*.runner.md`)
224
- 3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
245
+ 3. **Execute** runner steps and create only final outcome artifacts under `results/tc/{NN}/`
225
246
  4. **Verify** against paired `.verify.md` expectations
226
247
  5. **Record** status (Pass/Fail) with evidence
227
248
 
@@ -232,6 +253,7 @@ Track friction points during execution for the experience report.
232
253
  Write three report files to the reports directory.
233
254
 
234
255
  **Report path setup:**
256
+
235
257
  ```bash
236
258
  REPORT_DIR="${PROVIDED_REPORT_DIR:-${TEST_DIR}-reports}"
237
259
  mkdir -p "$REPORT_DIR"
@@ -302,6 +324,7 @@ Sandbox directories in `.ace-local/test-e2e/` are gitignored.
302
324
  Summarize execution in the response. Reports are persisted to disk.
303
325
 
304
326
  **Single test:**
327
+
305
328
  ```markdown
306
329
  ## E2E Test Execution Report
307
330
  **Test ID:** {test-id} | **Package:** {package} | **Status:** {PASS/FAIL}
@@ -316,6 +339,7 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
316
339
  ### 10. Update Test Scenario
317
340
 
318
341
  If all tests pass, update `scenario.yml`:
342
+
319
343
  ```yaml
320
344
  last-verified: {today's date}
321
345
  verified-by: claude-{model}
@@ -352,4 +376,4 @@ ace-test-e2e ace-lint --exclude-tags deep
352
376
 
353
377
  # All tests in project root
354
378
  ace-test-e2e
355
- ```
379
+ ```
@@ -132,7 +132,7 @@ else
132
132
  fi
133
133
 
134
134
  # Check 3: Project root markers should NOT exist
135
- if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-taskflow" ]; then
135
+ if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ]; then
136
136
  echo "FAIL: Main project markers found - NOT an isolated repo!"
137
137
  echo " ACTION: STOP - You are in the main repository."
138
138
  else
@@ -321,7 +321,7 @@ Add setup directives to `scenario.yml`:
321
321
  # scenario.yml
322
322
  setup:
323
323
  - git-init
324
- - run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
324
+ - run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
325
325
  - copy-fixtures
326
326
  - agent-env:
327
327
  PROJECT_ROOT_PATH: "."
@@ -405,7 +405,7 @@ else
405
405
  fi
406
406
 
407
407
  # Check 3: Project markers
408
- if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-taskflow" ]; then
408
+ if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ]; then
409
409
  echo "FAIL: Main project markers found!"
410
410
  exit 1
411
411
  else
@@ -458,4 +458,4 @@ ace-test-e2e-sh "$REPO_DIR" git status
458
458
  ## See Also
459
459
 
460
460
  - [E2E Testing Guide](guide://e2e-testing)
461
- - [Test Suite Health](guide://test-suite-health)
461
+ - [Test Suite Health](guide://test-suite-health)
@@ -35,7 +35,7 @@ module Ace
35
35
  #
36
36
  # Resolves role: references to their concrete provider before checking.
37
37
  #
38
- # @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-executor")
38
+ # @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-runner")
39
39
  # @return [Boolean]
40
40
  def cli_provider?(provider_string)
41
41
  resolved = resolve_provider_name(provider_string)
@@ -44,9 +44,9 @@ module Ace
44
44
 
45
45
  def build_execution_prompt(command:, tc_mode:)
46
46
  return_contract = if tc_mode
47
- "- **Test ID**: ...\n- **TC ID**: ...\n- **Status**: pass | fail\n- **Report Paths**: ...\n- **Issues**: ..."
47
+ "- **Test ID**: ...\n- **TC ID**: ...\n- **Status**: pass | fail\n- **Report Paths**: ...\n- **Observations**: ...\n- **Issues**: ... (optional legacy alias)"
48
48
  else
49
- "- **Test ID**: ...\n- **Status**: pass | fail | partial\n- **Passed**: ...\n- **Failed**: ...\n- **Total**: ...\n- **Report Paths**: ...\n- **Issues**: ..."
49
+ "- **Test ID**: ...\n- **Status**: pass | fail | partial\n- **Passed**: ...\n- **Failed**: ...\n- **Total**: ...\n- **Report Paths**: ...\n- **Observations**: ...\n- **Issues**: ... (optional legacy alias)"
50
50
  end
51
51
 
52
52
  <<~PROMPT.strip
@@ -55,8 +55,9 @@ module Ace
55
55
 
56
56
  Execution requirements:
57
57
  - Do not run `/ace-...` inside a shell command.
58
- - If slash commands are unavailable, stop and report that limitation in `Issues`.
58
+ - If slash commands are unavailable, stop and report that limitation in `Observations`.
59
59
  - Write reports under `.ace-local/test-e2e/*-reports/`.
60
+ - `Observations` is required and must be a concise factual summary of actions, outcomes, and blockers without verdict language.
60
61
  - Return only this structured summary:
61
62
  #{return_contract}
62
63
  PROMPT
@@ -122,6 +123,7 @@ module Ace
122
123
 
123
124
  Verification requirements:
124
125
  - Inspect sandbox artifacts and scenario files directly.
126
+ - Judge from sandbox state first, then runner observations, then raw debug captures only when needed.
125
127
  - Evaluate each test case using `TC-*.verify.md` criteria when present.
126
128
  - Classify each failed test case with one category:
127
129
  `test-spec-error`, `tool-bug`, `runner-error`, or `infrastructure-error`.
@@ -145,7 +147,7 @@ module Ace
145
147
 
146
148
  # Resolve the bare provider name from a provider string.
147
149
  # For role: references, resolves via ProviderModelParser to find the
148
- # concrete provider (e.g. "role:e2e-executor" → "claude").
150
+ # concrete provider (e.g. "role:e2e-runner" → "claude").
149
151
  def resolve_provider_name(provider_string)
150
152
  name = self.class.provider_name(provider_string)
151
153
  return name unless name == "role"