ace-test-runner-e2e 0.29.8 → 0.38.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. checksums.yaml +4 -4
  2. data/.ace-defaults/e2e-runner/config.yml +14 -2
  3. data/CHANGELOG.md +178 -0
  4. data/README.md +2 -2
  5. data/exe/ace-test-e2e-sh +9 -4
  6. data/handbook/guides/e2e-testing.g.md +43 -9
  7. data/handbook/guides/scenario-yml-reference.g.md +16 -8
  8. data/handbook/guides/tc-authoring.g.md +12 -5
  9. data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
  10. data/handbook/skills/as-e2e-review/SKILL.md +2 -2
  11. data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
  12. data/handbook/templates/agent-experience-report.template.md +3 -2
  13. data/handbook/templates/scenario.yml.template.yml +7 -2
  14. data/handbook/templates/tc-file.template.md +14 -4
  15. data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
  16. data/handbook/workflow-instructions/e2e/create.wf.md +118 -25
  17. data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
  18. data/handbook/workflow-instructions/e2e/fix.wf.md +65 -15
  19. data/handbook/workflow-instructions/e2e/plan-changes.wf.md +17 -1
  20. data/handbook/workflow-instructions/e2e/review.wf.md +36 -25
  21. data/handbook/workflow-instructions/e2e/rewrite.wf.md +15 -8
  22. data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
  23. data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
  24. data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
  25. data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
  26. data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +21 -8
  27. data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
  28. data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
  29. data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
  30. data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
  31. data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
  32. data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
  33. data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
  34. data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +157 -16
  35. data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +121 -8
  36. data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
  37. data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +119 -18
  38. data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +13 -12
  39. data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +282 -0
  40. data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +85 -5
  41. data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +98 -16
  42. data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +241 -97
  43. data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
  44. data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
  45. data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +73 -15
  46. data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +120 -19
  47. data/lib/ace/test/end_to_end_runner/version.rb +1 -1
  48. data/lib/ace/test/end_to_end_runner.rb +2 -0
  49. metadata +19 -2
@@ -1,4 +1,12 @@
1
1
  ---
2
+ name: e2e-create
3
+ description: Create a new E2E test scenario from template
4
+ allowed-tools:
5
+ - Bash(ace-bundle:*)
6
+ - Read
7
+ - Write
8
+ - Glob
9
+ - Grep
2
10
  doc-type: workflow
3
11
  title: Create E2E Test Workflow
4
12
  purpose: Create a new E2E test scenario from template
@@ -23,35 +31,48 @@ This workflow guides an agent through creating a new E2E test scenario.
23
31
  - Scenario ID format: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
24
32
  - Standalone files: `TC-*.runner.md` and `TC-*.verify.md`
25
33
  - TC artifact layout: `results/tc/{NN}/`
34
+ - Runner observations are harness-managed report data, not sandbox helper files
26
35
  - Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
27
36
  - CLI split reminder:
37
+
28
38
  - `ace-test-e2e` for single-package execution
29
39
  - `ace-test-e2e-suite` for suite-level execution
30
40
 
31
41
  ## Authoring Contract
32
42
 
33
43
  - Runner files (`runner.yml.md`, `TC-*.runner.md`) are execution-only.
44
+ - Goal-style TCs must prove two things:
45
+ - the tool works
46
+ - a user can do the job from the public surface (`README`, usage docs, `--help`, and the CLI itself) without hidden recipes or workarounds
34
47
  - Verifier files (`verifier.yml.md`, `TC-*.verify.md`) are verdict-only with impact-first evidence order:
48
+
35
49
  1. sandbox/project state impact
36
- 2. explicit artifacts
37
- 3. debug captures as fallback
50
+ 2. runner observations
51
+ 3. explicit product outcomes
52
+ 4. debug captures as fallback
53
+
38
54
  - Setup belongs to `scenario.yml` `setup:` and fixtures; do not duplicate setup in runner TC instructions.
55
+ - Keep `results/tc/{NN}/` for real outcome artifacts only; do not ask the runner to write helper YAML, path files, command files, reflections, or verifier-facing manifests there.
56
+ - Do not encode hidden command recipes, fallback detours, or workaround sequences in runner TC files. If the job cannot be done from the public surface, treat that as a product/docs/help gap or remove/narrow the TC.
39
57
 
40
58
  ## Workflow Steps
41
59
 
42
60
  ### 1. Validate Inputs
43
61
 
44
62
  **Check package exists:**
63
+
45
64
  ```bash
46
65
  test -d "{PACKAGE}" && echo "Package exists" || echo "Package not found"
47
66
  ```
48
67
 
49
68
  If package doesn't exist, list available packages:
69
+
50
70
  ```bash
51
71
  ls -d */ | grep -E "^ace-" | sed 's/\/$//'
52
72
  ```
53
73
 
54
74
  **Normalize area code:**
75
+
55
76
  - Convert to uppercase (e.g., `lint` -> `LINT`)
56
77
  - Verify it's a valid area name (2-10 alphanumeric characters)
57
78
 
@@ -66,6 +87,7 @@ find {PACKAGE}/test/e2e -maxdepth 1 -type d -name "TS-{AREA}-*" 2>/dev/null | \
66
87
  ```
67
88
 
68
89
  Sort and take the highest number:
90
+
69
91
  - If no existing tests: use `001`
70
92
  - Otherwise: increment the highest number by 1
71
93
  - Format as three digits (e.g., `001`, `002`, `015`)
@@ -85,12 +107,14 @@ mkdir -p {PACKAGE}/test/e2e
85
107
  Create a kebab-case slug:
86
108
 
87
109
  **If --context provided:**
110
+
88
111
  - Extract key words from the context description
89
112
  - Convert to lowercase
90
113
  - Replace spaces with hyphens
91
114
  - Limit to 5-6 words
92
115
 
93
116
  **If no context:**
117
+
94
118
  - Use a placeholder: `new-test-scenario`
95
119
 
96
120
  Example: "Test config file validation" -> `config-file-validation`
@@ -100,11 +124,13 @@ The slug is the directory name suffix: `TS-LINT-003-config-file-validation/`
100
124
  ### 5. Load Template
101
125
 
102
126
  Load the test template:
127
+
103
128
  ```bash
104
129
  ace-bundle tmpl://test-e2e
105
130
  ```
106
131
 
107
132
  Or read directly:
133
+
108
134
  ```
109
135
  ace-test-runner-e2e/handbook/templates/test-e2e.template.md
110
136
  ```
@@ -123,6 +149,7 @@ Replace template placeholders with actual values:
123
149
  | `{area-name}` | Area code (lowercase) |
124
150
 
125
151
  Initial values for optional fields:
152
+
126
153
  - `priority: medium`
127
154
  - `duration: ~10min`
128
155
  - `automation-candidate: false`
@@ -138,9 +165,10 @@ Initial values for optional fields:
138
165
  Before generating test cases, verify the proposed test has genuine E2E value.
139
166
 
140
167
  **Check unit test coverage:**
168
+
141
169
  ```bash
142
170
  # Search for existing unit tests covering this area
143
- find {PACKAGE}/test/atoms {PACKAGE}/test/molecules {PACKAGE}/test/organisms \
171
+ find {PACKAGE}/test/fast {PACKAGE}/test/feat \
144
172
  -name "*_test.rb" 2>/dev/null | head -20
145
173
  ```
146
174
 
@@ -154,47 +182,91 @@ For each proposed TC, answer: **"Does this require the full CLI binary + real ex
154
182
  - If **PARTIAL**: create the TC but scope it to only the E2E-exclusive aspects
155
183
 
156
184
  **Example decisions:**
157
- - "Test that invalid YAML config produces error" — check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
158
- - "Test that StandardRB subprocess executes and returns results" unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
185
+
186
+ - "Test that invalid YAML config produces error" -- check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
187
+ - "Test that StandardRB subprocess executes and returns results" -- unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
159
188
 
160
189
  If all proposed TCs fail the gate, report to the user:
190
+
161
191
  ```
162
192
  All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
163
193
  No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
164
194
  ```
165
195
 
166
- ### 7a. Evidence-Gate Review Before Writing Files
196
+ ### 7a. Public-Surface Gate
197
+
198
+ Before generating or keeping a goal-style TC, answer:
199
+ **"Can a normal user complete this job from the package's public surface, without hidden recipes or workarounds?"**
200
+
201
+ Public surface means:
202
+ - package README / usage docs
203
+ - `--help`
204
+ - declared fixtures and `scenario.yml` setup
205
+ - the tool under test itself
206
+
207
+ Reject or narrow the TC if it depends on:
208
+ - step-by-step runner procedures a user would not infer from docs/help
209
+ - workaround branches to compensate for CLI/docs/help gaps
210
+ - direct supporting-tool probes as the primary oracle for an ACE CLI scenario
211
+ - internal-state checks that the public surface does not expose and that do not matter to the user job
212
+
213
+ ### 7b. Evidence-Gate Review Before Writing Files
167
214
 
168
215
  Before finalizing the test plan, block weak coverage patterns:
216
+
169
217
  - **Existence-only TC**:
218
+
170
219
  - only checks directory/file existence
171
220
  - no command output/content assertion
172
221
  - missing `*.exit` capture for the executed command
222
+
173
223
  - **Duplicate-invocation TC**:
224
+
174
225
  - same command invocation, same purpose, split across multiple TCs
175
226
 
227
+ - **Helper-artifact-driven TC**:
228
+
229
+ - runner is instructed to create YAML/TXT/MD helper files in `results/tc/{NN}/`
230
+ - verifier depends on those helper files instead of final sandbox state or real product output
231
+
232
+ - **Hidden-recipe-driven TC**:
233
+
234
+ - the runner must follow a command sequence not discoverable from docs/usage/`--help`
235
+ - the TC succeeds only because the scenario teaches an internal or non-obvious workaround
236
+
237
+ - **Workaround-driven TC**:
238
+
239
+ - the runner is told how to bypass a docs/help/CLI gap instead of surfacing it
240
+ - the verifier would pass a scenario that a normal user could not complete cleanly
241
+
176
242
  | TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
177
243
  |-------|---------------------------|------------------|-----------------|--------------------|
178
244
  | {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
179
245
 
180
246
  Rules:
247
+
181
248
  - `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
249
+ - `helper-artifact-driven` is never valid for KEEP/ADD when final sandbox state could prove the goal directly.
250
+ - `hidden-recipe-driven` and `workaround-driven` are never valid for KEEP/ADD.
182
251
  - `SKIP` rows must include replacement unit-test evidence.
183
- - Non-skipped rows must include command-level artifacts (`stdout`, `stderr`, `exit`, and/or explicit proof files).
252
+ - Non-skipped rows must identify the primary oracle for the TC: final sandbox state, real product output, or debug fallback.
253
+ - Non-skipped rows must state why the job is achievable from the public surface without hidden recipes.
184
254
  - At least one `unit tests reviewed` path is required for every row.
185
255
  - The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
186
256
 
187
- ### 7b. E2E Decision Record (Required)
257
+ ### 7c. E2E Decision Record (Required)
188
258
 
189
259
  Before writing files, produce a decision record table for every candidate TC:
190
260
 
191
- | TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Unit tests reviewed |
192
- |-------|---------------------------|-----------------|---------------------|
193
- | {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {path1,path2} |
261
+ | TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Public-surface path | Unit tests reviewed |
262
+ |-------|---------------------------|-----------------|---------------------|---------------------|
263
+ | {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {docs/help/CLI path or "not valid"} | {path1,path2} |
194
264
 
195
265
  Rules:
266
+
196
267
  - No TC may be created without a row in this table.
197
268
  - If decision is `SKIP`, include the unit-test evidence that replaces it.
269
+ - If the public-surface path is missing or workaround-driven, the TC must be `SKIP` or explicitly planned as a product/docs/help improvement before creation.
198
270
  - At least one `unit tests reviewed` path is required for each row.
199
271
  - The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
200
272
 
@@ -203,12 +275,13 @@ Rules:
203
275
  If a context description was provided, enhance the test with:
204
276
 
205
277
  **Research the package:**
206
- 1. **Run unit tests first** (`ace-test` in the package) they are the ground truth for implemented behavior
278
+ 1. **Run unit tests first** (`ace-test` in the package) -- they are the ground truth for implemented behavior
207
279
  2. Examine the relevant code in `{PACKAGE}/lib/`
208
280
  3. Check existing unit tests for expected behavior patterns
209
281
  4. Understand the feature being tested
210
282
  5. **Run the tool** to observe actual behavior, output format, file paths, and exit codes
211
- 6. **Verify config/input formats** by reading the actual parsing code never assume formats from design specs or task descriptions
283
+ 6. **Verify config/input formats** by reading the actual parsing code -- never assume formats from design specs or task descriptions
284
+ 7. **Compare with the public surface** -- verify the intended user path is actually supported by docs/help, and do not compensate for gaps with hidden runner instructions
212
285
 
213
286
  **Generate test content:**
214
287
  1. Write a clear objective based on the context
@@ -220,16 +293,20 @@ If a context description was provided, enhance the test with:
220
293
  #### Test Case Generation Rules
221
294
 
222
295
  **MUST (required for all E2E tests):**
223
- - **Verify the feature is implemented** before writing the test — read the actual implementation code, not just task specs or design documents
224
- - **Verify config/input formats** by reading the parsing code never assume formats from BDD specs, task descriptions, or documentation
296
+
297
+ - **Verify the feature is implemented** before writing the test -- read the actual implementation code, not just task specs or design documents
298
+ - **Verify config/input formats** by reading the parsing code -- never assume formats from BDD specs, task descriptions, or documentation
225
299
  - Include an error/negative TC only when it validates E2E-exclusive behavior (real CLI parser/runtime/tooling/filesystem) or when unit coverage has a documented gap
226
- - Verify actual file paths by running the tool first never hardcode paths from documentation or assumptions
227
- - Use explicit `&& echo "PASS" || echo "FAIL"` patterns for every verification step
300
+ - Verify actual file paths by running the tool first -- never hardcode paths from documentation or assumptions
301
+ - Write runner goals as user outcomes, not “create a report” chores for the verifier
228
302
  - Check specific exit codes for error commands (not just "non-zero")
229
- - Add at least one output-content assertion for each command being verified
303
+ - Make final sandbox state or real product output the primary oracle whenever possible
304
+ - Do not require runner-authored helper files under `results/tc/{NN}/`
305
+ - Add at least one behavioral/content assertion when CLI output itself is part of the outcome being tested
230
306
 
231
307
  **SHOULD (strongly recommended):**
232
- - Test the real user journey — structure TCs as a sequential workflow, not isolated commands
308
+
309
+ - Test the real user journey -- structure TCs as a sequential workflow, not isolated commands
233
310
  - Verify exit codes for all commands, not just error cases
234
311
  - Include negative assertions (files/directories that should NOT exist)
235
312
  - Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
@@ -237,17 +314,18 @@ If a context description was provided, enhance the test with:
237
314
  - Verify that status values match actual implementation (e.g., `done` vs `completed`)
238
315
 
239
316
  **COST-AWARE (reduce LLM invocations):**
240
- - Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC — not three.
317
+
318
+ - Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC -- not three.
241
319
  - Target 2-5 TCs per scenario. More than 5 suggests the scenario is too broad; split into focused scenarios. Fewer than 2 suggests merging with a related scenario.
242
320
  - Never create a TC for a single assertion when that assertion could be appended to an existing TC that runs the same command.
243
321
 
244
322
  #### Recommended TC Ordering
245
323
 
246
- 1. **Error paths first** wrong args, missing files, no prior state (run from clean state)
247
- 2. **Happy path start** create/init with correct args, verify output
248
- 3. **Structure verification** check actual on-disk file structure with negative assertions
249
- 4. **Lifecycle operations** status, advance, fail, retry in workflow order
250
- 5. **End state** verify completion message, all steps terminal
324
+ 1. **Error paths first** -- wrong args, missing files, no prior state (run from clean state)
325
+ 2. **Happy path start** -- create/init with correct args, verify output
326
+ 3. **Structure verification** -- check actual on-disk file structure with negative assertions
327
+ 4. **Lifecycle operations** -- status, advance, fail, retry in workflow order
328
+ 5. **End state** -- verify completion message, all steps terminal
251
329
 
252
330
  This ordering ensures error TCs run before any state is created (clean environment), and happy-path TCs build on each other sequentially.
253
331
 
@@ -258,6 +336,7 @@ See: **e2e-testing.g.md § "Avoiding False Positive Tests"** for the full list o
258
336
  **E2E tests MUST test through the CLI interface, not library imports.**
259
337
 
260
338
  **Valid approach:**
339
+
261
340
  ```bash
262
341
  OUTPUT=$(ace-review --preset code --subject "diff:HEAD~1" --auto-execute 2>&1)
263
342
  EXIT_CODE=$?
@@ -265,6 +344,7 @@ EXIT_CODE=$?
265
344
  ```
266
345
 
267
346
  **Invalid approach (this is integration/unit testing, not E2E):**
347
+
268
348
  ```bash
269
349
  bundle exec ruby -e '
270
350
  require_relative "lib/ace/review"
@@ -273,6 +353,7 @@ bundle exec ruby -e '
273
353
  ```
274
354
 
275
355
  **For execution tests (LLM, API calls):**
356
+
276
357
  - Use `--auto-execute` to make real API calls
277
358
  - Using only `--dry-run` cannot verify actual execution behavior
278
359
  - Keep costs minimal: cheap models, tiny prompts, small diffs
@@ -280,22 +361,26 @@ bundle exec ruby -e '
280
361
  #### Common Anti-Patterns to Avoid
281
362
 
282
363
  **Writing tests from design specs before implementation:**
364
+
283
365
  - Task descriptions and BDD specs often describe *intended* behavior with *proposed* config formats
284
366
  - The actual implementation may use different formats, different commands, or different workflows
285
367
  - Example: A spec might describe `jobs:` with explicit `number:` and `parent:` fields, but implementation uses `steps:` with auto-generated numbers and dynamic hierarchy via `add --after --child`
286
368
  - **Fix:** Always read the actual implementation code (especially config parsing) before writing test data
287
369
 
288
370
  **Assuming static vs dynamic behavior:**
371
+
289
372
  - Tests may assume features work at config-time (static) when they actually work at runtime (dynamic)
290
373
  - Example: Assuming hierarchy is defined in config when it's actually built dynamically via commands
291
374
  - **Fix:** Trace the actual code path for the feature being tested
292
375
 
293
376
  **Splitting one command into many redundant TCs:**
377
+
294
378
  - Multiple TCs each validate one assertion after the same CLI invocation, creating overlap with unit tests and increasing run cost
295
379
  - Example: TC-A checks exit code, TC-B checks report file, TC-C checks summary text for the same command run
296
380
  - **Fix:** Consolidate those assertions into one TC and move formatter/parser details to unit tests
297
381
 
298
382
  **Example for "Test config file validation":**
383
+
299
384
  ```markdown
300
385
  ## Test Cases
301
386
 
@@ -315,28 +400,33 @@ bundle exec ruby -e '
315
400
  ### 9. Write Test Files
316
401
 
317
402
  Create the scenario directory with separate files:
403
+
318
404
  ```bash
319
405
  mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}
320
406
  ```
321
407
 
322
408
  Write `scenario.yml` (metadata and setup):
409
+
323
410
  ```
324
411
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/scenario.yml
325
412
  ```
326
413
 
327
414
  Write scenario pair configs:
415
+
328
416
  ```
329
417
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/runner.yml.md
330
418
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/verifier.yml.md
331
419
  ```
332
420
 
333
421
  Write individual TC runner/verifier files for each test case:
422
+
334
423
  ```
335
424
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.runner.md
336
425
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.verify.md
337
426
  ```
338
427
 
339
428
  Optionally create a fixtures directory if test data is needed:
429
+
340
430
  ```bash
341
431
  mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/fixtures
342
432
  ```
@@ -373,6 +463,7 @@ Output a summary:
373
463
  ## Example Invocations
374
464
 
375
465
  **Create a test:**
466
+
376
467
  ```bash
377
468
  ace-bundle wfi://e2e/create
378
469
  ```
@@ -380,6 +471,7 @@ ace-bundle wfi://e2e/create
380
471
  Creates: `ace-lint/test/e2e/TS-LINT-003-new-test-scenario/` with `scenario.yml` and TC files.
381
472
 
382
473
  **Create a contextual test:**
474
+
383
475
  ```bash
384
476
  ace-bundle wfi://e2e/create
385
477
  ```
@@ -387,6 +479,7 @@ ace-bundle wfi://e2e/create
387
479
  Creates: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/` with `scenario.yml` and TC files for config validation.
388
480
 
389
481
  **Create test for new area:**
482
+
390
483
  ```bash
391
484
  ace-bundle wfi://e2e/create
392
485
  ```
@@ -46,12 +46,15 @@ Tag filtering happens at discovery time (before `SetupExecutor` runs). By the ti
46
46
 
47
47
  ## Execution Contract
48
48
 
49
- - Runner is execution-only: execute declared TC actions and capture evidence.
49
+ - Runner is execution-only: execute declared TC actions, leave only real outcome evidence under `results/tc/{NN}/`, and return final observations through the harness.
50
+ - Runner follows the public user path. Do not turn missing docs/help/CLI affordances into embedded workaround instructions.
50
51
  - Verifier is verification-only: determine PASS/FAIL using impact-first ordering:
51
52
  1. sandbox/project state impact
52
- 2. explicit artifacts
53
- 3. debug captures (`stdout`/`stderr`/exit) as fallback
53
+ 2. runner observations
54
+ 3. explicit artifacts that are true product outcomes
55
+ 4. debug captures (`stdout`/`stderr`/exit) as fallback
54
56
  - Do not interpret setup ownership in runner TC files; setup is owned by `scenario.yml` + fixtures.
57
+ - Treat workaround pressure recorded in runner observations as a gap to fix, not as permission to strengthen the runner script.
55
58
 
56
59
  ## Dual-Agent Verifier
57
60
 
@@ -61,7 +64,7 @@ When `--verify` is passed (or always-on for CLI pipeline runs), execution follow
61
64
  2. **Verifier agent** independently inspects the sandbox and artifacts against `TC-*.verify.md` expectations
62
65
  3. **Report generator** (`PipelineReportGenerator`) produces deterministic summary from verifier output
63
66
 
64
- The verifier has no access to the runner's conversation — it evaluates purely from on-disk evidence. This prevents self-confirmation bias.
67
+ The verifier has no access to the runner's conversation — it evaluates from sandbox evidence plus the structured runner observations persisted by the harness. This prevents self-confirmation bias while still surfacing execution context.
65
68
 
66
69
  ## Subagent Mode
67
70
 
@@ -75,6 +78,7 @@ When invoked as a subagent (via Task tool from orchestrator):
75
78
  - **Failed**: {count}
76
79
  - **Total**: {count}
77
80
  - **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
81
+ - **Observations**: Brief factual summary or "None"
78
82
  - **Issues**: Brief description or "None"
79
83
  ```
80
84
 
@@ -149,8 +153,8 @@ For each TC (TC-NNN):
149
153
 
150
154
  1. **Check filter** — skip if `FILTERED_CASES` is set and TC not in list
151
155
  2. **Read** the runner file objective
152
- 3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
153
- 4. **Capture** exit codes, output, error messages
156
+ 3. **Execute** runner steps, save only real outcome artifacts to `results/tc/{NN}/`
157
+ 4. **Return** factual runner observations through the harness
154
158
  5. **Evaluate** against verifier expectations
155
159
  6. **Record** Pass/Fail with per-TC evidence
156
160
 
@@ -250,4 +254,4 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
250
254
  | TC fails | Record details, continue remaining TCs, include in report |
251
255
  | Sandbox missing/corrupted | Report error, do NOT recreate, return error summary |
252
256
  | TC filter mismatch | STOP, do not write reports, offer re-run |
253
- | Missing TC pair file | Report error for that TC, skip it, continue others |
257
+ | Missing TC pair file | Report error for that TC, skip it, continue others |
@@ -1,23 +1,33 @@
1
1
  ---
2
+ name: e2e-fix
3
+ description: Diagnose, fix, and rerun failing E2E scenarios with a self-bootstrapping analysis loop.
4
+ allowed-tools:
5
+ - Bash(ace-bundle:*)
6
+ - Bash(ace-test:*)
7
+ - Read
8
+ - Write
9
+ - Edit
10
+ - Skill
2
11
  doc-type: workflow
3
12
  title: Fix E2E Tests Workflow
4
13
  purpose: fix-e2e-tests workflow instruction
5
14
  ace-docs:
6
- last-updated: 2026-03-13
7
- last-checked: 2026-03-21
15
+ last-updated: 2026-04-19
16
+ last-checked: 2026-04-19
8
17
  ---
9
18
 
10
19
  # Fix E2E Tests Workflow
11
20
 
12
21
  ## Goal
13
22
 
14
- Apply targeted fixes for failing E2E scenarios based on an existing E2E failure analysis report.
23
+ Diagnose, fix, and rerun failing E2E scenarios with a single workflow entrypoint.
15
24
 
16
- This workflow is execution-only. Root cause classification is handled by `wfi://e2e/analyze-failures`.
25
+ This workflow owns analysis readiness before any fix is applied. Reuse an existing analysis report when it is complete; otherwise generate or complete it via `wfi://e2e/analyze-failures`, then continue directly into the fix loop.
17
26
 
18
- ## Hard Gate (Required Before Any Fix)
27
+ ## Analysis Readiness Gate
28
+
29
+ Before any fix, ensure an analysis report exists with:
19
30
 
20
- Do not apply any fix until an analysis report exists with:
21
31
  - scenario / TC identifier
22
32
  - category (`code-issue`, `test-issue`, `runner-infrastructure-issue`)
23
33
  - evidence from reports/artifacts
@@ -26,22 +36,29 @@ Do not apply any fix until an analysis report exists with:
26
36
  - primary candidate files
27
37
  - do-not-touch boundaries
28
38
  - rerun scope recommendation
39
+ - `Docs / Help Drift From E2E Failures` section with `Public Surface Checked`, `Drift Found`, and `Update Targets`
40
+
41
+ If analysis is missing or incomplete, generate or refresh it first:
29
42
 
30
- If analysis is missing or incomplete, stop and run:
31
43
  ```bash
32
44
  ace-bundle wfi://e2e/analyze-failures
33
45
  ```
34
46
 
47
+ Then continue this workflow using the resulting `E2E Failure Analysis Report`, `Fix Decisions`, and `Execution Plan Input` as the source of truth. Do not stop merely because analysis had to be generated.
48
+
35
49
  ## Required Input
36
50
 
37
- Use the output section from `e2e/analyze-failures`:
51
+ Use the output section from `e2e/analyze-failures` when present, whether it was provided up front or generated by this workflow:
52
+
38
53
  - `## E2E Failure Analysis Report`
54
+ - `## Docs / Help Drift From E2E Failures`
39
55
  - `## Fix Decisions`
40
56
  - `### Execution Plan Input`
41
57
 
42
58
  ## Autonomy Rule
43
59
 
44
60
  - Do not ask the user to choose fix target, category, or rerun scope.
61
+ - If analysis is missing, run `wfi://e2e/analyze-failures` yourself before fixing.
45
62
  - If analysis is incomplete, auto-complete missing decision fields via local evidence (reports, artifacts, scenario files, implementation), then proceed.
46
63
  - Only stop for hard blockers (missing files/tools/permissions).
47
64
 
@@ -61,27 +78,40 @@ Apply fixes in this order:
61
78
 
62
79
  ## Fix Procedure
63
80
 
64
- 1. Pick the first prioritized item from analysis
81
+ 1. Establish or refresh analysis
82
+
83
+ - Check for a current analysis report that satisfies the Analysis Readiness Gate.
84
+ - If none exists, or if required fields are missing, including the docs/help drift section, run `ace-bundle wfi://e2e/analyze-failures`.
85
+ - Reuse the most recent valid analysis output as the source of truth for fix selection.
86
+ - Treat full-suite/package reruns and targeted scenario reruns as different scopes. Do not label a broader suite failure set as a regression in a previously fixed targeted scenario unless the same scenario fails again on a clean rerun.
87
+
88
+ 2. Pick the first prioritized item from analysis
89
+
65
90
  - Use the selected "First item to fix"
66
91
  - Confirm category, fix target, and rerun scope
67
92
  - Apply the "Chosen fix decision" and primary candidate files directly
68
93
 
69
- 2. Apply category-specific fix
94
+ 3. Apply category-specific fix
70
95
 
71
96
  ### Category: runner-infrastructure-issue
97
+
72
98
  - Fix runner/sandbox/provider/reporting/orchestration behavior
73
99
  - Verify with runner tests when applicable: `ace-test ace-test-runner-e2e`
74
100
 
75
101
  ### Category: code-issue
102
+
76
103
  - Fix package/tool behavior in implementation code
77
104
  - Add/update unit tests if needed
105
+ - When the user job is valid but not achievable from docs/help/public CLI, apply the documented docs/help update target instead of codifying the workaround in the scenario
78
106
 
79
107
  ### Category: test-issue
108
+
80
109
  - Fix scenario definition, runner/verifier criteria, fixtures, or setup steps
81
110
  - Preserve role split: runner is execution-only, verifier is impact-first verdict
82
111
  - Keep implementation unchanged unless analysis is revised
112
+ - Remove hidden recipes, workaround branches, and unsupported internal-detail checks from goal-style TCs
83
113
 
84
- 3. Rerun the selected failing scope after each fix
114
+ 4. Rerun the selected failing scope after each fix
85
115
 
86
116
  After every implemented fix, rerun the analysis-selected failing scope before moving to the next item or recommending release.
87
117
 
@@ -96,25 +126,31 @@ ace-test-e2e {package}
96
126
  ```
97
127
 
98
128
  Rules:
129
+
99
130
  - Scenario rerun is the default after each fix iteration.
100
131
  - Use package rerun only when analysis explicitly selected package scope.
101
132
  - For multiple failing scenarios, rerun each scenario explicitly.
133
+
102
134
  ```text
103
135
  ace-test-e2e ace-assign TS-ASSIGN-001
104
136
  ace-test-e2e ace-assign TS-ASSIGN-002
105
137
  ace-test-e2e ace-bundle TS-BUNDLE-001
106
138
  ```
139
+
107
140
  - Record the rerun command and result in the execution summary for every fix item.
108
141
 
109
- 4. Re-check classification when evidence conflicts
142
+ 5. Re-check classification when evidence conflicts
143
+
110
144
  - If outcome contradicts analysis, return to `e2e/analyze-failures`
111
145
  - Update analysis report and re-select a new autonomous chosen fix decision before continuing
146
+ - If a suite/package report conflicts with a scenario report, the scenario report wins and the aggregate mismatch must be fixed or explicitly tracked before relying on suite-level TC mappings.
147
+
148
+ 6. Iterate until all targeted failures are resolved
112
149
 
113
- 5. Iterate until all targeted failures are resolved
114
150
  - Keep one active scenario/TC at a time
115
151
  - Preserve cost-conscious rerun discipline
116
152
 
117
- 6. Run a final explicit failing-scenario checkpoint before concluding the fix session
153
+ 7. Run a final explicit failing-scenario checkpoint before concluding the fix session
118
154
 
119
155
  After the currently targeted failures are addressed, require one final:
120
156
 
@@ -136,12 +172,25 @@ Use one explicit command per previously failing scenario to confirm no targeted
136
172
  ```markdown
137
173
  ## E2E Fix Execution Summary
138
174
 
175
+ Analysis Source: reused existing analysis | generated via `wfi://e2e/analyze-failures` | refreshed incomplete analysis
176
+
139
177
  | Scenario / TC | Category | Change Applied | Verification Command | Result |
140
178
  |---|---|---|---|---|
141
179
  | ... | ... | ... | ... | pass/fail |
142
180
  ```
143
181
 
182
+ If the analysis reported docs/help drift, include:
183
+
184
+ ```markdown
185
+ ## Docs / Help Updates
186
+
187
+ | Scenario / TC | Public Surface Updated | Why |
188
+ |---|---|---|
189
+ | ... | docs/usage.md, CLI --help | E2E failure showed the valid user job was not discoverable |
190
+ ```
191
+
144
192
  Include one final row for the batch checkpoint:
193
+
145
194
  - Verification Command: one explicit rerun command per remaining failed scenario (`ace-test-e2e {package} {test-id}`)
146
195
  - Result: `pass` or remaining failing scenarios
147
196
  - If failures remain, continue the fix loop instead of treating the session as complete
@@ -160,7 +209,8 @@ If unresolved:
160
209
 
161
210
  - Fixes are traceable to analyzed failures
162
211
  - Verification scope matches analysis recommendation, including mandatory reruns after each fix
212
+ - Any docs/help drift from analysis is fixed or explicitly carried as an unresolved blocker
163
213
  - Cost-conscious rerun strategy was followed
164
214
  - Final explicit per-scenario rerun checkpoint for all targeted failures was completed before concluding the fix session
165
215
  - No user clarification was required for fix targeting/scope in normal flow
166
- - Targeted failures pass, or blockers are explicitly documented
216
+ - Targeted failures pass, or blockers are explicitly documented