ace-test-runner-e2e 0.29.8 → 0.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. checksums.yaml +4 -4
  2. data/.ace-defaults/e2e-runner/config.yml +14 -2
  3. data/CHANGELOG.md +233 -0
  4. data/README.md +2 -2
  5. data/exe/ace-test-e2e-sh +9 -4
  6. data/handbook/guides/e2e-testing.g.md +75 -9
  7. data/handbook/guides/scenario-yml-reference.g.md +21 -8
  8. data/handbook/guides/tc-authoring.g.md +23 -5
  9. data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
  10. data/handbook/skills/as-e2e-review/SKILL.md +2 -2
  11. data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
  12. data/handbook/templates/agent-experience-report.template.md +3 -2
  13. data/handbook/templates/scenario.yml.template.yml +7 -2
  14. data/handbook/templates/tc-file.template.md +16 -4
  15. data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
  16. data/handbook/workflow-instructions/e2e/create.wf.md +128 -25
  17. data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
  18. data/handbook/workflow-instructions/e2e/fix.wf.md +84 -15
  19. data/handbook/workflow-instructions/e2e/plan-changes.wf.md +33 -1
  20. data/handbook/workflow-instructions/e2e/review.wf.md +40 -25
  21. data/handbook/workflow-instructions/e2e/rewrite.wf.md +22 -8
  22. data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
  23. data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
  24. data/lib/ace/test/end_to_end_runner/atoms/artifact_contract_validator.rb +138 -0
  25. data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
  26. data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
  27. data/lib/ace/test/end_to_end_runner/cli/commands/run_suite.rb +195 -5
  28. data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +58 -9
  29. data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
  30. data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
  31. data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
  32. data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
  33. data/lib/ace/test/end_to_end_runner/molecules/artifact_pruner.rb +61 -0
  34. data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
  35. data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
  36. data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
  37. data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +235 -18
  38. data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +164 -13
  39. data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
  40. data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +121 -18
  41. data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +15 -12
  42. data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +374 -0
  43. data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +83 -5
  44. data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +121 -16
  45. data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +422 -97
  46. data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
  47. data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
  48. data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +98 -18
  49. data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +159 -19
  50. data/lib/ace/test/end_to_end_runner/version.rb +1 -1
  51. data/lib/ace/test/end_to_end_runner.rb +4 -0
  52. metadata +21 -2
@@ -1,4 +1,12 @@
1
1
  ---
2
+ name: e2e-create
3
+ description: Create a new E2E test scenario from template
4
+ allowed-tools:
5
+ - Bash(ace-bundle:*)
6
+ - Read
7
+ - Write
8
+ - Glob
9
+ - Grep
2
10
  doc-type: workflow
3
11
  title: Create E2E Test Workflow
4
12
  purpose: Create a new E2E test scenario from template
@@ -23,35 +31,54 @@ This workflow guides an agent through creating a new E2E test scenario.
23
31
  - Scenario ID format: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
24
32
  - Standalone files: `TC-*.runner.md` and `TC-*.verify.md`
25
33
  - TC artifact layout: `results/tc/{NN}/`
34
+ - Runner observations are harness-managed report data, not sandbox helper files
26
35
  - Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
27
36
  - CLI split reminder:
37
+
28
38
  - `ace-test-e2e` for single-package execution
29
39
  - `ace-test-e2e-suite` for suite-level execution
30
40
 
31
41
  ## Authoring Contract
32
42
 
33
43
  - Runner files (`runner.yml.md`, `TC-*.runner.md`) are execution-only.
44
+ - Every TC must be authored as one of:
45
+ - **public-surface** — a user job from docs/usage/`--help` and the CLI
46
+ - **retained-contract** — a deterministic integrated regression check with declared supporting evidence
47
+ - Goal-style/public-surface TCs must prove two things:
48
+ - the tool works
49
+ - a user can do the job from the public surface (`README`, usage docs, `--help`, and the CLI itself) without hidden recipes or workarounds
34
50
  - Verifier files (`verifier.yml.md`, `TC-*.verify.md`) are verdict-only with impact-first evidence order:
51
+
35
52
  1. sandbox/project state impact
36
- 2. explicit artifacts
37
- 3. debug captures as fallback
53
+ 2. runner observations
54
+ 3. explicit product outcomes
55
+ 4. debug captures as fallback
56
+
38
57
  - Setup belongs to `scenario.yml` `setup:` and fixtures; do not duplicate setup in runner TC instructions.
58
+ - Keep `results/tc/{NN}/` for declared verifier-dependent evidence only.
59
+ - Declare every verifier-dependent path in the runner or setup. Grouped shorthand such as ``foo.stdout`, `.stderr`, `.exit`` is allowed for exact sibling captures.
60
+ - Do not use wildcard artifact paths.
61
+ - Do not ask the runner to write reflections, verifier-facing manifests, or undeclared helper files there.
62
+ - Do not encode hidden command recipes, fallback detours, or workaround sequences in runner TC files. If the job cannot be done from the public surface, treat that as a product/docs/help gap or remove/narrow the TC.
39
63
 
40
64
  ## Workflow Steps
41
65
 
42
66
  ### 1. Validate Inputs
43
67
 
44
68
  **Check package exists:**
69
+
45
70
  ```bash
46
71
  test -d "{PACKAGE}" && echo "Package exists" || echo "Package not found"
47
72
  ```
48
73
 
49
74
  If package doesn't exist, list available packages:
75
+
50
76
  ```bash
51
77
  ls -d */ | grep -E "^ace-" | sed 's/\/$//'
52
78
  ```
53
79
 
54
80
  **Normalize area code:**
81
+
55
82
  - Convert to uppercase (e.g., `lint` -> `LINT`)
56
83
  - Verify it's a valid area name (2-10 alphanumeric characters)
57
84
 
@@ -66,6 +93,7 @@ find {PACKAGE}/test/e2e -maxdepth 1 -type d -name "TS-{AREA}-*" 2>/dev/null | \
66
93
  ```
67
94
 
68
95
  Sort and take the highest number:
96
+
69
97
  - If no existing tests: use `001`
70
98
  - Otherwise: increment the highest number by 1
71
99
  - Format as three digits (e.g., `001`, `002`, `015`)
@@ -85,12 +113,14 @@ mkdir -p {PACKAGE}/test/e2e
85
113
  Create a kebab-case slug:
86
114
 
87
115
  **If --context provided:**
116
+
88
117
  - Extract key words from the context description
89
118
  - Convert to lowercase
90
119
  - Replace spaces with hyphens
91
120
  - Limit to 5-6 words
92
121
 
93
122
  **If no context:**
123
+
94
124
  - Use a placeholder: `new-test-scenario`
95
125
 
96
126
  Example: "Test config file validation" -> `config-file-validation`
@@ -100,11 +130,13 @@ The slug is the directory name suffix: `TS-LINT-003-config-file-validation/`
100
130
  ### 5. Load Template
101
131
 
102
132
  Load the test template:
133
+
103
134
  ```bash
104
135
  ace-bundle tmpl://test-e2e
105
136
  ```
106
137
 
107
138
  Or read directly:
139
+
108
140
  ```
109
141
  ace-test-runner-e2e/handbook/templates/test-e2e.template.md
110
142
  ```
@@ -123,6 +155,7 @@ Replace template placeholders with actual values:
123
155
  | `{area-name}` | Area code (lowercase) |
124
156
 
125
157
  Initial values for optional fields:
158
+
126
159
  - `priority: medium`
127
160
  - `duration: ~10min`
128
161
  - `automation-candidate: false`
@@ -138,9 +171,10 @@ Initial values for optional fields:
138
171
  Before generating test cases, verify the proposed test has genuine E2E value.
139
172
 
140
173
  **Check unit test coverage:**
174
+
141
175
  ```bash
142
176
  # Search for existing unit tests covering this area
143
- find {PACKAGE}/test/atoms {PACKAGE}/test/molecules {PACKAGE}/test/organisms \
177
+ find {PACKAGE}/test/fast {PACKAGE}/test/feat \
144
178
  -name "*_test.rb" 2>/dev/null | head -20
145
179
  ```
146
180
 
@@ -154,47 +188,95 @@ For each proposed TC, answer: **"Does this require the full CLI binary + real ex
154
188
  - If **PARTIAL**: create the TC but scope it to only the E2E-exclusive aspects
155
189
 
156
190
  **Example decisions:**
157
- - "Test that invalid YAML config produces error" — check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
158
- - "Test that StandardRB subprocess executes and returns results" unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
191
+
192
+ - "Test that invalid YAML config produces error" -- check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
193
+ - "Test that StandardRB subprocess executes and returns results" -- unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
159
194
 
160
195
  If all proposed TCs fail the gate, report to the user:
196
+
161
197
  ```
162
198
  All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
163
199
  No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
164
200
  ```
165
201
 
166
- ### 7a. Evidence-Gate Review Before Writing Files
202
+ ### 7a. Public-Surface Gate
203
+
204
+ Before generating or keeping a goal-style TC, answer:
205
+ **"Can a normal user complete this job from the package's public surface, without hidden recipes or workarounds?"**
206
+
207
+ Public surface means:
208
+ - package README / usage docs
209
+ - `--help`
210
+ - declared fixtures and `scenario.yml` setup
211
+ - the tool under test itself
212
+
213
+ Reject or narrow the TC if it depends on:
214
+ - step-by-step runner procedures a user would not infer from docs/help
215
+ - workaround branches to compensate for CLI/docs/help gaps
216
+ - direct supporting-tool probes as the primary oracle for an ACE CLI scenario
217
+ - internal-state checks that the public surface does not expose and that do not matter to the user job
218
+
219
+ ### 7b. Evidence-Gate Review Before Writing Files
167
220
 
168
221
  Before finalizing the test plan, block weak coverage patterns:
222
+
169
223
  - **Existence-only TC**:
224
+
170
225
  - only checks directory/file existence
171
226
  - no command output/content assertion
172
227
  - missing `*.exit` capture for the executed command
228
+
173
229
  - **Duplicate-invocation TC**:
230
+
174
231
  - same command invocation, same purpose, split across multiple TCs
175
232
 
233
+ - **Helper-artifact-driven TC**:
234
+
235
+ - runner is instructed to create YAML/TXT/MD helper files in `results/tc/{NN}/`
236
+ - verifier depends on those helper files instead of final sandbox state or real product output
237
+
238
+ - **Hidden-recipe-driven TC**:
239
+
240
+ - the runner must follow a command sequence not discoverable from docs/usage/`--help`
241
+ - the TC succeeds only because the scenario teaches an internal or non-obvious workaround
242
+
243
+ - **Workaround-driven TC**:
244
+
245
+ - the runner is told how to bypass a docs/help/CLI gap instead of surfacing it
246
+ - the verifier would pass a scenario that a normal user could not complete cleanly
247
+
176
248
  | TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
177
249
  |-------|---------------------------|------------------|-----------------|--------------------|
178
250
  | {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
179
251
 
180
252
  Rules:
253
+
181
254
  - `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
255
+ - `helper-artifact-driven` is never valid for KEEP/ADD when final sandbox state could prove the goal directly.
256
+ - `hidden-recipe-driven` and `workaround-driven` are never valid for KEEP/ADD.
257
+ - Every verifier-dependent artifact must be declared by runner/setup; verifier-only references are invalid.
258
+ - Wildcard artifact paths are never valid for KEEP/ADD.
182
259
  - `SKIP` rows must include replacement unit-test evidence.
183
- - Non-skipped rows must include command-level artifacts (`stdout`, `stderr`, `exit`, and/or explicit proof files).
260
+ - Non-skipped rows must identify the primary oracle for the TC: final sandbox state, real product output, or debug fallback.
261
+ - Non-skipped rows must state why the job is achievable from the public surface without hidden recipes.
262
+ - Non-skipped rows must identify TC style: `public-surface` or `retained-contract`.
184
263
  - At least one `unit tests reviewed` path is required for every row.
185
264
  - The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
186
265
 
187
- ### 7b. E2E Decision Record (Required)
266
+ ### 7c. E2E Decision Record (Required)
188
267
 
189
268
  Before writing files, produce a decision record table for every candidate TC:
190
269
 
191
- | TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Unit tests reviewed |
192
- |-------|---------------------------|-----------------|---------------------|
193
- | {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {path1,path2} |
270
+ | TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Public-surface path | Unit tests reviewed |
271
+ |-------|---------------------------|-----------------|---------------------|---------------------|
272
+ | {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {docs/help/CLI path or "not valid"} | {path1,path2} |
194
273
 
195
274
  Rules:
275
+
196
276
  - No TC may be created without a row in this table.
197
277
  - If decision is `SKIP`, include the unit-test evidence that replaces it.
278
+ - If the public-surface path is missing or workaround-driven, the TC must be `SKIP` or explicitly planned as a product/docs/help improvement before creation.
279
+ - If the TC uses live refresh or watch behavior, include a bounded-session capture plan with explicit shutdown behavior and exit-code expectations.
198
280
  - At least one `unit tests reviewed` path is required for each row.
199
281
  - The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
200
282
 
@@ -203,12 +285,13 @@ Rules:
203
285
  If a context description was provided, enhance the test with:
204
286
 
205
287
  **Research the package:**
206
- 1. **Run unit tests first** (`ace-test` in the package) they are the ground truth for implemented behavior
288
+ 1. **Run unit tests first** (`ace-test` in the package) -- they are the ground truth for implemented behavior
207
289
  2. Examine the relevant code in `{PACKAGE}/lib/`
208
290
  3. Check existing unit tests for expected behavior patterns
209
291
  4. Understand the feature being tested
210
292
  5. **Run the tool** to observe actual behavior, output format, file paths, and exit codes
211
- 6. **Verify config/input formats** by reading the actual parsing code never assume formats from design specs or task descriptions
293
+ 6. **Verify config/input formats** by reading the actual parsing code -- never assume formats from design specs or task descriptions
294
+ 7. **Compare with the public surface** -- verify the intended user path is actually supported by docs/help, and do not compensate for gaps with hidden runner instructions
212
295
 
213
296
  **Generate test content:**
214
297
  1. Write a clear objective based on the context
@@ -220,16 +303,20 @@ If a context description was provided, enhance the test with:
220
303
  #### Test Case Generation Rules
221
304
 
222
305
  **MUST (required for all E2E tests):**
223
- - **Verify the feature is implemented** before writing the test — read the actual implementation code, not just task specs or design documents
224
- - **Verify config/input formats** by reading the parsing code never assume formats from BDD specs, task descriptions, or documentation
306
+
307
+ - **Verify the feature is implemented** before writing the test -- read the actual implementation code, not just task specs or design documents
308
+ - **Verify config/input formats** by reading the parsing code -- never assume formats from BDD specs, task descriptions, or documentation
225
309
  - Include an error/negative TC only when it validates E2E-exclusive behavior (real CLI parser/runtime/tooling/filesystem) or when unit coverage has a documented gap
226
- - Verify actual file paths by running the tool first never hardcode paths from documentation or assumptions
227
- - Use explicit `&& echo "PASS" || echo "FAIL"` patterns for every verification step
310
+ - Verify actual file paths by running the tool first -- never hardcode paths from documentation or assumptions
311
+ - Write runner goals as user outcomes, not “create a report” chores for the verifier
228
312
  - Check specific exit codes for error commands (not just "non-zero")
229
- - Add at least one output-content assertion for each command being verified
313
+ - Make final sandbox state or real product output the primary oracle whenever possible
314
+ - Do not require undeclared or verifier-facing helper files under `results/tc/{NN}/`
315
+ - Add at least one behavioral/content assertion when CLI output itself is part of the outcome being tested
230
316
 
231
317
  **SHOULD (strongly recommended):**
232
- - Test the real user journey — structure TCs as a sequential workflow, not isolated commands
318
+
319
+ - Test the real user journey -- structure TCs as a sequential workflow, not isolated commands
233
320
  - Verify exit codes for all commands, not just error cases
234
321
  - Include negative assertions (files/directories that should NOT exist)
235
322
  - Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
@@ -237,17 +324,18 @@ If a context description was provided, enhance the test with:
237
324
  - Verify that status values match actual implementation (e.g., `done` vs `completed`)
238
325
 
239
326
  **COST-AWARE (reduce LLM invocations):**
240
- - Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC — not three.
327
+
328
+ - Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC -- not three.
241
329
  - Target 2-5 TCs per scenario. More than 5 suggests the scenario is too broad; split into focused scenarios. Fewer than 2 suggests merging with a related scenario.
242
330
  - Never create a TC for a single assertion when that assertion could be appended to an existing TC that runs the same command.
243
331
 
244
332
  #### Recommended TC Ordering
245
333
 
246
- 1. **Error paths first** wrong args, missing files, no prior state (run from clean state)
247
- 2. **Happy path start** create/init with correct args, verify output
248
- 3. **Structure verification** check actual on-disk file structure with negative assertions
249
- 4. **Lifecycle operations** status, advance, fail, retry in workflow order
250
- 5. **End state** verify completion message, all steps terminal
334
+ 1. **Error paths first** -- wrong args, missing files, no prior state (run from clean state)
335
+ 2. **Happy path start** -- create/init with correct args, verify output
336
+ 3. **Structure verification** -- check actual on-disk file structure with negative assertions
337
+ 4. **Lifecycle operations** -- status, advance, fail, retry in workflow order
338
+ 5. **End state** -- verify completion message, all steps terminal
251
339
 
252
340
  This ordering ensures error TCs run before any state is created (clean environment), and happy-path TCs build on each other sequentially.
253
341
 
@@ -258,6 +346,7 @@ See: **e2e-testing.g.md § "Avoiding False Positive Tests"** for the full list o
258
346
  **E2E tests MUST test through the CLI interface, not library imports.**
259
347
 
260
348
  **Valid approach:**
349
+
261
350
  ```bash
262
351
  OUTPUT=$(ace-review --preset code --subject "diff:HEAD~1" --auto-execute 2>&1)
263
352
  EXIT_CODE=$?
@@ -265,6 +354,7 @@ EXIT_CODE=$?
265
354
  ```
266
355
 
267
356
  **Invalid approach (this is integration/unit testing, not E2E):**
357
+
268
358
  ```bash
269
359
  bundle exec ruby -e '
270
360
  require_relative "lib/ace/review"
@@ -273,6 +363,7 @@ bundle exec ruby -e '
273
363
  ```
274
364
 
275
365
  **For execution tests (LLM, API calls):**
366
+
276
367
  - Use `--auto-execute` to make real API calls
277
368
  - Using only `--dry-run` cannot verify actual execution behavior
278
369
  - Keep costs minimal: cheap models, tiny prompts, small diffs
@@ -280,22 +371,26 @@ bundle exec ruby -e '
280
371
  #### Common Anti-Patterns to Avoid
281
372
 
282
373
  **Writing tests from design specs before implementation:**
374
+
283
375
  - Task descriptions and BDD specs often describe *intended* behavior with *proposed* config formats
284
376
  - The actual implementation may use different formats, different commands, or different workflows
285
377
  - Example: A spec might describe `jobs:` with explicit `number:` and `parent:` fields, but implementation uses `steps:` with auto-generated numbers and dynamic hierarchy via `add --after --child`
286
378
  - **Fix:** Always read the actual implementation code (especially config parsing) before writing test data
287
379
 
288
380
  **Assuming static vs dynamic behavior:**
381
+
289
382
  - Tests may assume features work at config-time (static) when they actually work at runtime (dynamic)
290
383
  - Example: Assuming hierarchy is defined in config when it's actually built dynamically via commands
291
384
  - **Fix:** Trace the actual code path for the feature being tested
292
385
 
293
386
  **Splitting one command into many redundant TCs:**
387
+
294
388
  - Multiple TCs each validate one assertion after the same CLI invocation, creating overlap with unit tests and increasing run cost
295
389
  - Example: TC-A checks exit code, TC-B checks report file, TC-C checks summary text for the same command run
296
390
  - **Fix:** Consolidate those assertions into one TC and move formatter/parser details to unit tests
297
391
 
298
392
  **Example for "Test config file validation":**
393
+
299
394
  ```markdown
300
395
  ## Test Cases
301
396
 
@@ -315,28 +410,33 @@ bundle exec ruby -e '
315
410
  ### 9. Write Test Files
316
411
 
317
412
  Create the scenario directory with separate files:
413
+
318
414
  ```bash
319
415
  mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}
320
416
  ```
321
417
 
322
418
  Write `scenario.yml` (metadata and setup):
419
+
323
420
  ```
324
421
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/scenario.yml
325
422
  ```
326
423
 
327
424
  Write scenario pair configs:
425
+
328
426
  ```
329
427
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/runner.yml.md
330
428
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/verifier.yml.md
331
429
  ```
332
430
 
333
431
  Write individual TC runner/verifier files for each test case:
432
+
334
433
  ```
335
434
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.runner.md
336
435
  {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.verify.md
337
436
  ```
338
437
 
339
438
  Optionally create a fixtures directory if test data is needed:
439
+
340
440
  ```bash
341
441
  mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/fixtures
342
442
  ```
@@ -373,6 +473,7 @@ Output a summary:
373
473
  ## Example Invocations
374
474
 
375
475
  **Create a test:**
476
+
376
477
  ```bash
377
478
  ace-bundle wfi://e2e/create
378
479
  ```
@@ -380,6 +481,7 @@ ace-bundle wfi://e2e/create
380
481
  Creates: `ace-lint/test/e2e/TS-LINT-003-new-test-scenario/` with `scenario.yml` and TC files.
381
482
 
382
483
  **Create a contextual test:**
484
+
383
485
  ```bash
384
486
  ace-bundle wfi://e2e/create
385
487
  ```
@@ -387,6 +489,7 @@ ace-bundle wfi://e2e/create
387
489
  Creates: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/` with `scenario.yml` and TC files for config validation.
388
490
 
389
491
  **Create test for new area:**
492
+
390
493
  ```bash
391
494
  ace-bundle wfi://e2e/create
392
495
  ```
@@ -46,12 +46,15 @@ Tag filtering happens at discovery time (before `SetupExecutor` runs). By the ti
46
46
 
47
47
  ## Execution Contract
48
48
 
49
- - Runner is execution-only: execute declared TC actions and capture evidence.
49
+ - Runner is execution-only: execute declared TC actions, leave only real outcome evidence under `results/tc/{NN}/`, and return final observations through the harness.
50
+ - Runner follows the public user path. Do not turn missing docs/help/CLI affordances into embedded workaround instructions.
50
51
  - Verifier is verification-only: determine PASS/FAIL using impact-first ordering:
51
52
  1. sandbox/project state impact
52
- 2. explicit artifacts
53
- 3. debug captures (`stdout`/`stderr`/exit) as fallback
53
+ 2. runner observations
54
+ 3. explicit artifacts that are true product outcomes
55
+ 4. debug captures (`stdout`/`stderr`/exit) as fallback
54
56
  - Do not interpret setup ownership in runner TC files; setup is owned by `scenario.yml` + fixtures.
57
+ - Treat workaround pressure recorded in runner observations as a gap to fix, not as permission to strengthen the runner script.
55
58
 
56
59
  ## Dual-Agent Verifier
57
60
 
@@ -61,7 +64,7 @@ When `--verify` is passed (or always-on for CLI pipeline runs), execution follow
61
64
  2. **Verifier agent** independently inspects the sandbox and artifacts against `TC-*.verify.md` expectations
62
65
  3. **Report generator** (`PipelineReportGenerator`) produces deterministic summary from verifier output
63
66
 
64
- The verifier has no access to the runner's conversation — it evaluates purely from on-disk evidence. This prevents self-confirmation bias.
67
+ The verifier has no access to the runner's conversation — it evaluates from sandbox evidence plus the structured runner observations persisted by the harness. This prevents self-confirmation bias while still surfacing execution context.
65
68
 
66
69
  ## Subagent Mode
67
70
 
@@ -75,6 +78,7 @@ When invoked as a subagent (via Task tool from orchestrator):
75
78
  - **Failed**: {count}
76
79
  - **Total**: {count}
77
80
  - **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
81
+ - **Observations**: Brief factual summary or "None"
78
82
  - **Issues**: Brief description or "None"
79
83
  ```
80
84
 
@@ -149,8 +153,8 @@ For each TC (TC-NNN):
149
153
 
150
154
  1. **Check filter** — skip if `FILTERED_CASES` is set and TC not in list
151
155
  2. **Read** the runner file objective
152
- 3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
153
- 4. **Capture** exit codes, output, error messages
156
+ 3. **Execute** runner steps, save only real outcome artifacts to `results/tc/{NN}/`
157
+ 4. **Return** factual runner observations through the harness
154
158
  5. **Evaluate** against verifier expectations
155
159
  6. **Record** Pass/Fail with per-TC evidence
156
160
 
@@ -250,4 +254,4 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
250
254
  | TC fails | Record details, continue remaining TCs, include in report |
251
255
  | Sandbox missing/corrupted | Report error, do NOT recreate, return error summary |
252
256
  | TC filter mismatch | STOP, do not write reports, offer re-run |
253
- | Missing TC pair file | Report error for that TC, skip it, continue others |
257
+ | Missing TC pair file | Report error for that TC, skip it, continue others |
@@ -1,23 +1,33 @@
1
1
  ---
2
+ name: e2e-fix
3
+ description: Diagnose, fix, and rerun failing E2E scenarios with a self-bootstrapping analysis loop.
4
+ allowed-tools:
5
+ - Bash(ace-bundle:*)
6
+ - Bash(ace-test:*)
7
+ - Read
8
+ - Write
9
+ - Edit
10
+ - Skill
2
11
  doc-type: workflow
3
12
  title: Fix E2E Tests Workflow
4
13
  purpose: fix-e2e-tests workflow instruction
5
14
  ace-docs:
6
- last-updated: 2026-03-13
7
- last-checked: 2026-03-21
15
+ last-updated: 2026-04-19
16
+ last-checked: 2026-04-19
8
17
  ---
9
18
 
10
19
  # Fix E2E Tests Workflow
11
20
 
12
21
  ## Goal
13
22
 
14
- Apply targeted fixes for failing E2E scenarios based on an existing E2E failure analysis report.
23
+ Diagnose, fix, and rerun failing E2E scenarios with a single workflow entrypoint.
15
24
 
16
- This workflow is execution-only. Root cause classification is handled by `wfi://e2e/analyze-failures`.
25
+ This workflow owns analysis readiness before any fix is applied. Reuse an existing analysis report when it is complete; otherwise generate or complete it via `wfi://e2e/analyze-failures`, then continue directly into the fix loop.
17
26
 
18
- ## Hard Gate (Required Before Any Fix)
27
+ ## Analysis Readiness Gate
28
+
29
+ Before any fix, ensure an analysis report exists with:
19
30
 
20
- Do not apply any fix until an analysis report exists with:
21
31
  - scenario / TC identifier
22
32
  - category (`code-issue`, `test-issue`, `runner-infrastructure-issue`)
23
33
  - evidence from reports/artifacts
@@ -26,22 +36,29 @@ Do not apply any fix until an analysis report exists with:
26
36
  - primary candidate files
27
37
  - do-not-touch boundaries
28
38
  - rerun scope recommendation
39
+ - `Docs / Help Drift From E2E Failures` section with `Public Surface Checked`, `Drift Found`, and `Update Targets`
40
+
41
+ If analysis is missing or incomplete, generate or refresh it first:
29
42
 
30
- If analysis is missing or incomplete, stop and run:
31
43
  ```bash
32
44
  ace-bundle wfi://e2e/analyze-failures
33
45
  ```
34
46
 
47
+ Then continue this workflow using the resulting `E2E Failure Analysis Report`, `Fix Decisions`, and `Execution Plan Input` as the source of truth. Do not stop merely because analysis had to be generated.
48
+
35
49
  ## Required Input
36
50
 
37
- Use the output section from `e2e/analyze-failures`:
51
+ Use the output section from `e2e/analyze-failures` when present, whether it was provided up front or generated by this workflow:
52
+
38
53
  - `## E2E Failure Analysis Report`
54
+ - `## Docs / Help Drift From E2E Failures`
39
55
  - `## Fix Decisions`
40
56
  - `### Execution Plan Input`
41
57
 
42
58
  ## Autonomy Rule
43
59
 
44
60
  - Do not ask the user to choose fix target, category, or rerun scope.
61
+ - If analysis is missing, run `wfi://e2e/analyze-failures` yourself before fixing.
45
62
  - If analysis is incomplete, auto-complete missing decision fields via local evidence (reports, artifacts, scenario files, implementation), then proceed.
46
63
  - Only stop for hard blockers (missing files/tools/permissions).
47
64
 
@@ -61,27 +78,41 @@ Apply fixes in this order:
61
78
 
62
79
  ## Fix Procedure
63
80
 
64
- 1. Pick the first prioritized item from analysis
81
+ 1. Establish or refresh analysis
82
+
83
+ - Check for a current analysis report that satisfies the Analysis Readiness Gate.
84
+ - If none exists, or if required fields are missing, including the docs/help drift section, run `ace-bundle wfi://e2e/analyze-failures`.
85
+ - Reuse the most recent valid analysis output as the source of truth for fix selection.
86
+ - Treat full-suite/package reruns and targeted scenario reruns as different scopes. Do not label a broader suite failure set as a regression in a previously fixed targeted scenario unless the same scenario fails again on a clean rerun.
87
+
88
+ 2. Pick the first prioritized item from analysis
89
+
65
90
  - Use the selected "First item to fix"
66
91
  - Confirm category, fix target, and rerun scope
67
92
  - Apply the "Chosen fix decision" and primary candidate files directly
68
93
 
69
- 2. Apply category-specific fix
94
+ 3. Apply category-specific fix
70
95
 
71
96
  ### Category: runner-infrastructure-issue
97
+
72
98
  - Fix runner/sandbox/provider/reporting/orchestration behavior
73
99
  - Verify with runner tests when applicable: `ace-test ace-test-runner-e2e`
74
100
 
75
101
  ### Category: code-issue
102
+
76
103
  - Fix package/tool behavior in implementation code
77
104
  - Add/update unit tests if needed
105
+ - When the user job is valid but not achievable from docs/help/public CLI, apply the documented docs/help update target instead of codifying the workaround in the scenario
78
106
 
79
107
  ### Category: test-issue
108
+
80
109
  - Fix scenario definition, runner/verifier criteria, fixtures, or setup steps
81
110
  - Preserve role split: runner is execution-only, verifier is impact-first verdict
82
111
  - Keep implementation unchanged unless analysis is revised
112
+ - Remove hidden recipes, workaround branches, and unsupported internal-detail checks from goal-style TCs
113
+ - Repair undeclared or wildcard artifact contracts before weakening product assertions
83
114
 
84
- 3. Rerun the selected failing scope after each fix
115
+ 4. Rerun the selected failing scope after each fix
85
116
 
86
117
  After every implemented fix, rerun the analysis-selected failing scope before moving to the next item or recommending release.
87
118
 
@@ -96,25 +127,37 @@ ace-test-e2e {package}
96
127
  ```
97
128
 
98
129
  Rules:
130
+
99
131
  - Scenario rerun is the default after each fix iteration.
100
132
  - Use package rerun only when analysis explicitly selected package scope.
101
133
  - For multiple failing scenarios, rerun each scenario explicitly.
134
+
102
135
  ```text
103
136
  ace-test-e2e ace-assign TS-ASSIGN-001
104
137
  ace-test-e2e ace-assign TS-ASSIGN-002
105
138
  ace-test-e2e ace-bundle TS-BUNDLE-001
106
139
  ```
140
+
107
141
  - Record the rerun command and result in the execution summary for every fix item.
108
142
 
109
- 4. Re-check classification when evidence conflicts
143
+ 5. Re-check classification when evidence conflicts
144
+
110
145
  - If outcome contradicts analysis, return to `e2e/analyze-failures`
111
146
  - Update analysis report and re-select a new autonomous chosen fix decision before continuing
147
+ - If a suite/package report conflicts with a scenario report, the scenario report wins and the aggregate mismatch must be fixed or explicitly tracked before relying on suite-level TC mappings.
148
+
149
+ 6. Iterate until all targeted failures are resolved
112
150
 
113
- 5. Iterate until all targeted failures are resolved
114
151
  - Keep one active scenario/TC at a time
115
152
  - Preserve cost-conscious rerun discipline
116
153
 
117
- 6. Run a final explicit failing-scenario checkpoint before concluding the fix session
154
+ 6a. If the fix changes a public contract, run a downstream retained-E2E sweep
155
+
156
+ - Trigger this sweep when the fix changes status words, JSON keys, command shapes, lifecycle semantics, or ownership/state semantics
157
+ - Grep impacted scenarios and downstream consumers before concluding the fix
158
+ - Update retained runner/verifier contracts in the same change set whenever feasible
159
+
160
+ 7. Run a final explicit failing-scenario checkpoint before concluding the fix session
118
161
 
119
162
  After the currently targeted failures are addressed, require one final:
120
163
 
@@ -136,12 +179,37 @@ Use one explicit command per previously failing scenario to confirm no targeted
136
179
  ```markdown
137
180
  ## E2E Fix Execution Summary
138
181
 
182
+ Analysis Source: reused existing analysis | generated via `wfi://e2e/analyze-failures` | refreshed incomplete analysis
183
+
139
184
  | Scenario / TC | Category | Change Applied | Verification Command | Result |
140
185
  |---|---|---|---|---|
141
186
  | ... | ... | ... | ... | pass/fail |
142
187
  ```
143
188
 
189
+ Also include:
190
+
191
+ ```markdown
192
+ ## Fix Classification Totals
193
+
194
+ | Bucket | Count |
195
+ |---|---|
196
+ | Product bug | {n} |
197
+ | Harness bug | {n} |
198
+ | Retained test/spec drift | {n} |
199
+ ```
200
+
201
+ If the analysis reported docs/help drift, include:
202
+
203
+ ```markdown
204
+ ## Docs / Help Updates
205
+
206
+ | Scenario / TC | Public Surface Updated | Why |
207
+ |---|---|---|
208
+ | ... | docs/usage.md, CLI --help | E2E failure showed the valid user job was not discoverable |
209
+ ```
210
+
144
211
  Include one final row for the batch checkpoint:
212
+
145
213
  - Verification Command: one explicit rerun command per remaining failed scenario (`ace-test-e2e {package} {test-id}`)
146
214
  - Result: `pass` or remaining failing scenarios
147
215
  - If failures remain, continue the fix loop instead of treating the session as complete
@@ -160,7 +228,8 @@ If unresolved:
160
228
 
161
229
  - Fixes are traceable to analyzed failures
162
230
  - Verification scope matches analysis recommendation, including mandatory reruns after each fix
231
+ - Any docs/help drift from analysis is fixed or explicitly carried as an unresolved blocker
163
232
  - Cost-conscious rerun strategy was followed
164
233
  - Final explicit per-scenario rerun checkpoint for all targeted failures was completed before concluding the fix session
165
234
  - No user clarification was required for fix targeting/scope in normal flow
166
- - Targeted failures pass, or blockers are explicitly documented
235
+ - Targeted failures pass, or blockers are explicitly documented