ace-test-runner-e2e 0.29.6 → 0.38.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.ace-defaults/e2e-runner/config.yml +14 -2
- data/CHANGELOG.md +187 -0
- data/README.md +2 -2
- data/exe/ace-test-e2e-sh +9 -4
- data/handbook/guides/e2e-testing.g.md +43 -9
- data/handbook/guides/scenario-yml-reference.g.md +16 -8
- data/handbook/guides/tc-authoring.g.md +12 -5
- data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
- data/handbook/skills/as-e2e-review/SKILL.md +2 -2
- data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
- data/handbook/templates/agent-experience-report.template.md +3 -2
- data/handbook/templates/scenario.yml.template.yml +13 -2
- data/handbook/templates/tc-file.template.md +14 -4
- data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
- data/handbook/workflow-instructions/e2e/create.wf.md +139 -23
- data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
- data/handbook/workflow-instructions/e2e/fix.wf.md +65 -15
- data/handbook/workflow-instructions/e2e/plan-changes.wf.md +17 -1
- data/handbook/workflow-instructions/e2e/review.wf.md +44 -28
- data/handbook/workflow-instructions/e2e/rewrite.wf.md +17 -3
- data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
- data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
- data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
- data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
- data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +21 -8
- data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
- data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
- data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
- data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
- data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
- data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
- data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +165 -25
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +121 -8
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +119 -18
- data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +13 -12
- data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +282 -0
- data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +85 -5
- data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +98 -16
- data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +241 -97
- data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
- data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
- data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +73 -15
- data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +120 -19
- data/lib/ace/test/end_to_end_runner/version.rb +1 -1
- data/lib/ace/test/end_to_end_runner.rb +2 -0
- metadata +19 -2
|
@@ -1,4 +1,12 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-create
|
|
3
|
+
description: Create a new E2E test scenario from template
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Read
|
|
7
|
+
- Write
|
|
8
|
+
- Glob
|
|
9
|
+
- Grep
|
|
2
10
|
doc-type: workflow
|
|
3
11
|
title: Create E2E Test Workflow
|
|
4
12
|
purpose: Create a new E2E test scenario from template
|
|
@@ -23,35 +31,48 @@ This workflow guides an agent through creating a new E2E test scenario.
|
|
|
23
31
|
- Scenario ID format: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
|
|
24
32
|
- Standalone files: `TC-*.runner.md` and `TC-*.verify.md`
|
|
25
33
|
- TC artifact layout: `results/tc/{NN}/`
|
|
34
|
+
- Runner observations are harness-managed report data, not sandbox helper files
|
|
26
35
|
- Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
|
|
27
36
|
- CLI split reminder:
|
|
37
|
+
|
|
28
38
|
- `ace-test-e2e` for single-package execution
|
|
29
39
|
- `ace-test-e2e-suite` for suite-level execution
|
|
30
40
|
|
|
31
41
|
## Authoring Contract
|
|
32
42
|
|
|
33
43
|
- Runner files (`runner.yml.md`, `TC-*.runner.md`) are execution-only.
|
|
44
|
+
- Goal-style TCs must prove two things:
|
|
45
|
+
- the tool works
|
|
46
|
+
- a user can do the job from the public surface (`README`, usage docs, `--help`, and the CLI itself) without hidden recipes or workarounds
|
|
34
47
|
- Verifier files (`verifier.yml.md`, `TC-*.verify.md`) are verdict-only with impact-first evidence order:
|
|
48
|
+
|
|
35
49
|
1. sandbox/project state impact
|
|
36
|
-
2.
|
|
37
|
-
3.
|
|
50
|
+
2. runner observations
|
|
51
|
+
3. explicit product outcomes
|
|
52
|
+
4. debug captures as fallback
|
|
53
|
+
|
|
38
54
|
- Setup belongs to `scenario.yml` `setup:` and fixtures; do not duplicate setup in runner TC instructions.
|
|
55
|
+
- Keep `results/tc/{NN}/` for real outcome artifacts only; do not ask the runner to write helper YAML, path files, command files, reflections, or verifier-facing manifests there.
|
|
56
|
+
- Do not encode hidden command recipes, fallback detours, or workaround sequences in runner TC files. If the job cannot be done from the public surface, treat that as a product/docs/help gap or remove/narrow the TC.
|
|
39
57
|
|
|
40
58
|
## Workflow Steps
|
|
41
59
|
|
|
42
60
|
### 1. Validate Inputs
|
|
43
61
|
|
|
44
62
|
**Check package exists:**
|
|
63
|
+
|
|
45
64
|
```bash
|
|
46
65
|
test -d "{PACKAGE}" && echo "Package exists" || echo "Package not found"
|
|
47
66
|
```
|
|
48
67
|
|
|
49
68
|
If package doesn't exist, list available packages:
|
|
69
|
+
|
|
50
70
|
```bash
|
|
51
71
|
ls -d */ | grep -E "^ace-" | sed 's/\/$//'
|
|
52
72
|
```
|
|
53
73
|
|
|
54
74
|
**Normalize area code:**
|
|
75
|
+
|
|
55
76
|
- Convert to uppercase (e.g., `lint` -> `LINT`)
|
|
56
77
|
- Verify it's a valid area name (2-10 alphanumeric characters)
|
|
57
78
|
|
|
@@ -66,6 +87,7 @@ find {PACKAGE}/test/e2e -maxdepth 1 -type d -name "TS-{AREA}-*" 2>/dev/null | \
|
|
|
66
87
|
```
|
|
67
88
|
|
|
68
89
|
Sort and take the highest number:
|
|
90
|
+
|
|
69
91
|
- If no existing tests: use `001`
|
|
70
92
|
- Otherwise: increment the highest number by 1
|
|
71
93
|
- Format as three digits (e.g., `001`, `002`, `015`)
|
|
@@ -85,12 +107,14 @@ mkdir -p {PACKAGE}/test/e2e
|
|
|
85
107
|
Create a kebab-case slug:
|
|
86
108
|
|
|
87
109
|
**If --context provided:**
|
|
110
|
+
|
|
88
111
|
- Extract key words from the context description
|
|
89
112
|
- Convert to lowercase
|
|
90
113
|
- Replace spaces with hyphens
|
|
91
114
|
- Limit to 5-6 words
|
|
92
115
|
|
|
93
116
|
**If no context:**
|
|
117
|
+
|
|
94
118
|
- Use a placeholder: `new-test-scenario`
|
|
95
119
|
|
|
96
120
|
Example: "Test config file validation" -> `config-file-validation`
|
|
@@ -100,11 +124,13 @@ The slug is the directory name suffix: `TS-LINT-003-config-file-validation/`
|
|
|
100
124
|
### 5. Load Template
|
|
101
125
|
|
|
102
126
|
Load the test template:
|
|
127
|
+
|
|
103
128
|
```bash
|
|
104
129
|
ace-bundle tmpl://test-e2e
|
|
105
130
|
```
|
|
106
131
|
|
|
107
132
|
Or read directly:
|
|
133
|
+
|
|
108
134
|
```
|
|
109
135
|
ace-test-runner-e2e/handbook/templates/test-e2e.template.md
|
|
110
136
|
```
|
|
@@ -123,6 +149,7 @@ Replace template placeholders with actual values:
|
|
|
123
149
|
| `{area-name}` | Area code (lowercase) |
|
|
124
150
|
|
|
125
151
|
Initial values for optional fields:
|
|
152
|
+
|
|
126
153
|
- `priority: medium`
|
|
127
154
|
- `duration: ~10min`
|
|
128
155
|
- `automation-candidate: false`
|
|
@@ -138,9 +165,10 @@ Initial values for optional fields:
|
|
|
138
165
|
Before generating test cases, verify the proposed test has genuine E2E value.
|
|
139
166
|
|
|
140
167
|
**Check unit test coverage:**
|
|
168
|
+
|
|
141
169
|
```bash
|
|
142
170
|
# Search for existing unit tests covering this area
|
|
143
|
-
find {PACKAGE}/test/
|
|
171
|
+
find {PACKAGE}/test/fast {PACKAGE}/test/feat \
|
|
144
172
|
-name "*_test.rb" 2>/dev/null | head -20
|
|
145
173
|
```
|
|
146
174
|
|
|
@@ -154,26 +182,91 @@ For each proposed TC, answer: **"Does this require the full CLI binary + real ex
|
|
|
154
182
|
- If **PARTIAL**: create the TC but scope it to only the E2E-exclusive aspects
|
|
155
183
|
|
|
156
184
|
**Example decisions:**
|
|
157
|
-
|
|
158
|
-
- "Test that
|
|
185
|
+
|
|
186
|
+
- "Test that invalid YAML config produces error" -- check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
|
|
187
|
+
- "Test that StandardRB subprocess executes and returns results" -- unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
|
|
159
188
|
|
|
160
189
|
If all proposed TCs fail the gate, report to the user:
|
|
190
|
+
|
|
161
191
|
```
|
|
162
192
|
All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
|
|
163
193
|
No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
|
|
164
194
|
```
|
|
165
195
|
|
|
166
|
-
### 7a.
|
|
196
|
+
### 7a. Public-Surface Gate
|
|
197
|
+
|
|
198
|
+
Before generating or keeping a goal-style TC, answer:
|
|
199
|
+
**"Can a normal user complete this job from the package's public surface, without hidden recipes or workarounds?"**
|
|
200
|
+
|
|
201
|
+
Public surface means:
|
|
202
|
+
- package README / usage docs
|
|
203
|
+
- `--help`
|
|
204
|
+
- declared fixtures and `scenario.yml` setup
|
|
205
|
+
- the tool under test itself
|
|
206
|
+
|
|
207
|
+
Reject or narrow the TC if it depends on:
|
|
208
|
+
- step-by-step runner procedures a user would not infer from docs/help
|
|
209
|
+
- workaround branches to compensate for CLI/docs/help gaps
|
|
210
|
+
- direct supporting-tool probes as the primary oracle for an ACE CLI scenario
|
|
211
|
+
- internal-state checks that the public surface does not expose and that do not matter to the user job
|
|
212
|
+
|
|
213
|
+
### 7b. Evidence-Gate Review Before Writing Files
|
|
214
|
+
|
|
215
|
+
Before finalizing the test plan, block weak coverage patterns:
|
|
216
|
+
|
|
217
|
+
- **Existence-only TC**:
|
|
218
|
+
|
|
219
|
+
- only checks directory/file existence
|
|
220
|
+
- no command output/content assertion
|
|
221
|
+
- missing `*.exit` capture for the executed command
|
|
222
|
+
|
|
223
|
+
- **Duplicate-invocation TC**:
|
|
224
|
+
|
|
225
|
+
- same command invocation, same purpose, split across multiple TCs
|
|
226
|
+
|
|
227
|
+
- **Helper-artifact-driven TC**:
|
|
228
|
+
|
|
229
|
+
- runner is instructed to create YAML/TXT/MD helper files in `results/tc/{NN}/`
|
|
230
|
+
- verifier depends on those helper files instead of final sandbox state or real product output
|
|
231
|
+
|
|
232
|
+
- **Hidden-recipe-driven TC**:
|
|
233
|
+
|
|
234
|
+
- the runner must follow a command sequence not discoverable from docs/usage/`--help`
|
|
235
|
+
- the TC succeeds only because the scenario teaches an internal or non-obvious workaround
|
|
236
|
+
|
|
237
|
+
- **Workaround-driven TC**:
|
|
238
|
+
|
|
239
|
+
- the runner is told how to bypass a docs/help/CLI gap instead of surfacing it
|
|
240
|
+
- the verifier would pass a scenario that a normal user could not complete cleanly
|
|
241
|
+
|
|
242
|
+
| TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
|
|
243
|
+
|-------|---------------------------|------------------|-----------------|--------------------|
|
|
244
|
+
| {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
245
|
+
|
|
246
|
+
Rules:
|
|
247
|
+
|
|
248
|
+
- `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
|
|
249
|
+
- `helper-artifact-driven` is never valid for KEEP/ADD when final sandbox state could prove the goal directly.
|
|
250
|
+
- `hidden-recipe-driven` and `workaround-driven` are never valid for KEEP/ADD.
|
|
251
|
+
- `SKIP` rows must include replacement unit-test evidence.
|
|
252
|
+
- Non-skipped rows must identify the primary oracle for the TC: final sandbox state, real product output, or debug fallback.
|
|
253
|
+
- Non-skipped rows must state why the job is achievable from the public surface without hidden recipes.
|
|
254
|
+
- At least one `unit tests reviewed` path is required for every row.
|
|
255
|
+
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
256
|
+
|
|
257
|
+
### 7c. E2E Decision Record (Required)
|
|
167
258
|
|
|
168
259
|
Before writing files, produce a decision record table for every candidate TC:
|
|
169
260
|
|
|
170
|
-
| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Unit tests reviewed |
|
|
171
|
-
|
|
172
|
-
| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
261
|
+
| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Public-surface path | Unit tests reviewed |
|
|
262
|
+
|-------|---------------------------|-----------------|---------------------|---------------------|
|
|
263
|
+
| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {docs/help/CLI path or "not valid"} | {path1,path2} |
|
|
173
264
|
|
|
174
265
|
Rules:
|
|
266
|
+
|
|
175
267
|
- No TC may be created without a row in this table.
|
|
176
268
|
- If decision is `SKIP`, include the unit-test evidence that replaces it.
|
|
269
|
+
- If the public-surface path is missing or workaround-driven, the TC must be `SKIP` or explicitly planned as a product/docs/help improvement before creation.
|
|
177
270
|
- At least one `unit tests reviewed` path is required for each row.
|
|
178
271
|
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
179
272
|
|
|
@@ -182,12 +275,13 @@ Rules:
|
|
|
182
275
|
If a context description was provided, enhance the test with:
|
|
183
276
|
|
|
184
277
|
**Research the package:**
|
|
185
|
-
1. **Run unit tests first** (`ace-test` in the package)
|
|
278
|
+
1. **Run unit tests first** (`ace-test` in the package) -- they are the ground truth for implemented behavior
|
|
186
279
|
2. Examine the relevant code in `{PACKAGE}/lib/`
|
|
187
280
|
3. Check existing unit tests for expected behavior patterns
|
|
188
281
|
4. Understand the feature being tested
|
|
189
282
|
5. **Run the tool** to observe actual behavior, output format, file paths, and exit codes
|
|
190
|
-
6. **Verify config/input formats** by reading the actual parsing code
|
|
283
|
+
6. **Verify config/input formats** by reading the actual parsing code -- never assume formats from design specs or task descriptions
|
|
284
|
+
7. **Compare with the public surface** -- verify the intended user path is actually supported by docs/help, and do not compensate for gaps with hidden runner instructions
|
|
191
285
|
|
|
192
286
|
**Generate test content:**
|
|
193
287
|
1. Write a clear objective based on the context
|
|
@@ -199,32 +293,39 @@ If a context description was provided, enhance the test with:
|
|
|
199
293
|
#### Test Case Generation Rules
|
|
200
294
|
|
|
201
295
|
**MUST (required for all E2E tests):**
|
|
202
|
-
|
|
203
|
-
- **Verify
|
|
296
|
+
|
|
297
|
+
- **Verify the feature is implemented** before writing the test -- read the actual implementation code, not just task specs or design documents
|
|
298
|
+
- **Verify config/input formats** by reading the parsing code -- never assume formats from BDD specs, task descriptions, or documentation
|
|
204
299
|
- Include an error/negative TC only when it validates E2E-exclusive behavior (real CLI parser/runtime/tooling/filesystem) or when unit coverage has a documented gap
|
|
205
|
-
- Verify actual file paths by running the tool first
|
|
206
|
-
-
|
|
300
|
+
- Verify actual file paths by running the tool first -- never hardcode paths from documentation or assumptions
|
|
301
|
+
- Write runner goals as user outcomes, not “create a report” chores for the verifier
|
|
207
302
|
- Check specific exit codes for error commands (not just "non-zero")
|
|
303
|
+
- Make final sandbox state or real product output the primary oracle whenever possible
|
|
304
|
+
- Do not require runner-authored helper files under `results/tc/{NN}/`
|
|
305
|
+
- Add at least one behavioral/content assertion when CLI output itself is part of the outcome being tested
|
|
208
306
|
|
|
209
307
|
**SHOULD (strongly recommended):**
|
|
210
|
-
|
|
308
|
+
|
|
309
|
+
- Test the real user journey -- structure TCs as a sequential workflow, not isolated commands
|
|
211
310
|
- Verify exit codes for all commands, not just error cases
|
|
212
311
|
- Include negative assertions (files/directories that should NOT exist)
|
|
312
|
+
- Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
|
|
213
313
|
- Capture and check CLI output content, not just exit codes
|
|
214
314
|
- Verify that status values match actual implementation (e.g., `done` vs `completed`)
|
|
215
315
|
|
|
216
316
|
**COST-AWARE (reduce LLM invocations):**
|
|
217
|
-
|
|
317
|
+
|
|
318
|
+
- Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC -- not three.
|
|
218
319
|
- Target 2-5 TCs per scenario. More than 5 suggests the scenario is too broad; split into focused scenarios. Fewer than 2 suggests merging with a related scenario.
|
|
219
320
|
- Never create a TC for a single assertion when that assertion could be appended to an existing TC that runs the same command.
|
|
220
321
|
|
|
221
322
|
#### Recommended TC Ordering
|
|
222
323
|
|
|
223
|
-
1. **Error paths first**
|
|
224
|
-
2. **Happy path start**
|
|
225
|
-
3. **Structure verification**
|
|
226
|
-
4. **Lifecycle operations**
|
|
227
|
-
5. **End state**
|
|
324
|
+
1. **Error paths first** -- wrong args, missing files, no prior state (run from clean state)
|
|
325
|
+
2. **Happy path start** -- create/init with correct args, verify output
|
|
326
|
+
3. **Structure verification** -- check actual on-disk file structure with negative assertions
|
|
327
|
+
4. **Lifecycle operations** -- status, advance, fail, retry in workflow order
|
|
328
|
+
5. **End state** -- verify completion message, all steps terminal
|
|
228
329
|
|
|
229
330
|
This ordering ensures error TCs run before any state is created (clean environment), and happy-path TCs build on each other sequentially.
|
|
230
331
|
|
|
@@ -235,6 +336,7 @@ See: **e2e-testing.g.md § "Avoiding False Positive Tests"** for the full list o
|
|
|
235
336
|
**E2E tests MUST test through the CLI interface, not library imports.**
|
|
236
337
|
|
|
237
338
|
**Valid approach:**
|
|
339
|
+
|
|
238
340
|
```bash
|
|
239
341
|
OUTPUT=$(ace-review --preset code --subject "diff:HEAD~1" --auto-execute 2>&1)
|
|
240
342
|
EXIT_CODE=$?
|
|
@@ -242,6 +344,7 @@ EXIT_CODE=$?
|
|
|
242
344
|
```
|
|
243
345
|
|
|
244
346
|
**Invalid approach (this is integration/unit testing, not E2E):**
|
|
347
|
+
|
|
245
348
|
```bash
|
|
246
349
|
bundle exec ruby -e '
|
|
247
350
|
require_relative "lib/ace/review"
|
|
@@ -250,6 +353,7 @@ bundle exec ruby -e '
|
|
|
250
353
|
```
|
|
251
354
|
|
|
252
355
|
**For execution tests (LLM, API calls):**
|
|
356
|
+
|
|
253
357
|
- Use `--auto-execute` to make real API calls
|
|
254
358
|
- Using only `--dry-run` cannot verify actual execution behavior
|
|
255
359
|
- Keep costs minimal: cheap models, tiny prompts, small diffs
|
|
@@ -257,22 +361,26 @@ bundle exec ruby -e '
|
|
|
257
361
|
#### Common Anti-Patterns to Avoid
|
|
258
362
|
|
|
259
363
|
**Writing tests from design specs before implementation:**
|
|
364
|
+
|
|
260
365
|
- Task descriptions and BDD specs often describe *intended* behavior with *proposed* config formats
|
|
261
366
|
- The actual implementation may use different formats, different commands, or different workflows
|
|
262
367
|
- Example: A spec might describe `jobs:` with explicit `number:` and `parent:` fields, but implementation uses `steps:` with auto-generated numbers and dynamic hierarchy via `add --after --child`
|
|
263
368
|
- **Fix:** Always read the actual implementation code (especially config parsing) before writing test data
|
|
264
369
|
|
|
265
370
|
**Assuming static vs dynamic behavior:**
|
|
371
|
+
|
|
266
372
|
- Tests may assume features work at config-time (static) when they actually work at runtime (dynamic)
|
|
267
373
|
- Example: Assuming hierarchy is defined in config when it's actually built dynamically via commands
|
|
268
374
|
- **Fix:** Trace the actual code path for the feature being tested
|
|
269
375
|
|
|
270
376
|
**Splitting one command into many redundant TCs:**
|
|
377
|
+
|
|
271
378
|
- Multiple TCs each validate one assertion after the same CLI invocation, creating overlap with unit tests and increasing run cost
|
|
272
379
|
- Example: TC-A checks exit code, TC-B checks report file, TC-C checks summary text for the same command run
|
|
273
380
|
- **Fix:** Consolidate those assertions into one TC and move formatter/parser details to unit tests
|
|
274
381
|
|
|
275
382
|
**Example for "Test config file validation":**
|
|
383
|
+
|
|
276
384
|
```markdown
|
|
277
385
|
## Test Cases
|
|
278
386
|
|
|
@@ -292,28 +400,33 @@ bundle exec ruby -e '
|
|
|
292
400
|
### 9. Write Test Files
|
|
293
401
|
|
|
294
402
|
Create the scenario directory with separate files:
|
|
403
|
+
|
|
295
404
|
```bash
|
|
296
405
|
mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}
|
|
297
406
|
```
|
|
298
407
|
|
|
299
408
|
Write `scenario.yml` (metadata and setup):
|
|
409
|
+
|
|
300
410
|
```
|
|
301
411
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/scenario.yml
|
|
302
412
|
```
|
|
303
413
|
|
|
304
414
|
Write scenario pair configs:
|
|
415
|
+
|
|
305
416
|
```
|
|
306
417
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/runner.yml.md
|
|
307
418
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/verifier.yml.md
|
|
308
419
|
```
|
|
309
420
|
|
|
310
421
|
Write individual TC runner/verifier files for each test case:
|
|
422
|
+
|
|
311
423
|
```
|
|
312
424
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.runner.md
|
|
313
425
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.verify.md
|
|
314
426
|
```
|
|
315
427
|
|
|
316
428
|
Optionally create a fixtures directory if test data is needed:
|
|
429
|
+
|
|
317
430
|
```bash
|
|
318
431
|
mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/fixtures
|
|
319
432
|
```
|
|
@@ -350,6 +463,7 @@ Output a summary:
|
|
|
350
463
|
## Example Invocations
|
|
351
464
|
|
|
352
465
|
**Create a test:**
|
|
466
|
+
|
|
353
467
|
```bash
|
|
354
468
|
ace-bundle wfi://e2e/create
|
|
355
469
|
```
|
|
@@ -357,6 +471,7 @@ ace-bundle wfi://e2e/create
|
|
|
357
471
|
Creates: `ace-lint/test/e2e/TS-LINT-003-new-test-scenario/` with `scenario.yml` and TC files.
|
|
358
472
|
|
|
359
473
|
**Create a contextual test:**
|
|
474
|
+
|
|
360
475
|
```bash
|
|
361
476
|
ace-bundle wfi://e2e/create
|
|
362
477
|
```
|
|
@@ -364,6 +479,7 @@ ace-bundle wfi://e2e/create
|
|
|
364
479
|
Creates: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/` with `scenario.yml` and TC files for config validation.
|
|
365
480
|
|
|
366
481
|
**Create test for new area:**
|
|
482
|
+
|
|
367
483
|
```bash
|
|
368
484
|
ace-bundle wfi://e2e/create
|
|
369
485
|
```
|
|
@@ -392,4 +508,4 @@ Area codes must be:
|
|
|
392
508
|
- 2-10 characters
|
|
393
509
|
- Alphanumeric only
|
|
394
510
|
- Will be converted to uppercase
|
|
395
|
-
```
|
|
511
|
+
```
|
|
@@ -46,12 +46,15 @@ Tag filtering happens at discovery time (before `SetupExecutor` runs). By the ti
|
|
|
46
46
|
|
|
47
47
|
## Execution Contract
|
|
48
48
|
|
|
49
|
-
- Runner is execution-only: execute declared TC actions and
|
|
49
|
+
- Runner is execution-only: execute declared TC actions, leave only real outcome evidence under `results/tc/{NN}/`, and return final observations through the harness.
|
|
50
|
+
- Runner follows the public user path. Do not turn missing docs/help/CLI affordances into embedded workaround instructions.
|
|
50
51
|
- Verifier is verification-only: determine PASS/FAIL using impact-first ordering:
|
|
51
52
|
1. sandbox/project state impact
|
|
52
|
-
2.
|
|
53
|
-
3.
|
|
53
|
+
2. runner observations
|
|
54
|
+
3. explicit artifacts that are true product outcomes
|
|
55
|
+
4. debug captures (`stdout`/`stderr`/exit) as fallback
|
|
54
56
|
- Do not interpret setup ownership in runner TC files; setup is owned by `scenario.yml` + fixtures.
|
|
57
|
+
- Treat workaround pressure recorded in runner observations as a gap to fix, not as permission to strengthen the runner script.
|
|
55
58
|
|
|
56
59
|
## Dual-Agent Verifier
|
|
57
60
|
|
|
@@ -61,7 +64,7 @@ When `--verify` is passed (or always-on for CLI pipeline runs), execution follow
|
|
|
61
64
|
2. **Verifier agent** independently inspects the sandbox and artifacts against `TC-*.verify.md` expectations
|
|
62
65
|
3. **Report generator** (`PipelineReportGenerator`) produces deterministic summary from verifier output
|
|
63
66
|
|
|
64
|
-
The verifier has no access to the runner's conversation — it evaluates
|
|
67
|
+
The verifier has no access to the runner's conversation — it evaluates from sandbox evidence plus the structured runner observations persisted by the harness. This prevents self-confirmation bias while still surfacing execution context.
|
|
65
68
|
|
|
66
69
|
## Subagent Mode
|
|
67
70
|
|
|
@@ -75,6 +78,7 @@ When invoked as a subagent (via Task tool from orchestrator):
|
|
|
75
78
|
- **Failed**: {count}
|
|
76
79
|
- **Total**: {count}
|
|
77
80
|
- **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
|
|
81
|
+
- **Observations**: Brief factual summary or "None"
|
|
78
82
|
- **Issues**: Brief description or "None"
|
|
79
83
|
```
|
|
80
84
|
|
|
@@ -149,8 +153,8 @@ For each TC (TC-NNN):
|
|
|
149
153
|
|
|
150
154
|
1. **Check filter** — skip if `FILTERED_CASES` is set and TC not in list
|
|
151
155
|
2. **Read** the runner file objective
|
|
152
|
-
3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
|
|
153
|
-
4. **
|
|
156
|
+
3. **Execute** runner steps, save only real outcome artifacts to `results/tc/{NN}/`
|
|
157
|
+
4. **Return** factual runner observations through the harness
|
|
154
158
|
5. **Evaluate** against verifier expectations
|
|
155
159
|
6. **Record** Pass/Fail with per-TC evidence
|
|
156
160
|
|
|
@@ -250,4 +254,4 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
|
|
|
250
254
|
| TC fails | Record details, continue remaining TCs, include in report |
|
|
251
255
|
| Sandbox missing/corrupted | Report error, do NOT recreate, return error summary |
|
|
252
256
|
| TC filter mismatch | STOP, do not write reports, offer re-run |
|
|
253
|
-
| Missing TC pair file | Report error for that TC, skip it, continue others |
|
|
257
|
+
| Missing TC pair file | Report error for that TC, skip it, continue others |
|
|
@@ -1,23 +1,33 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-fix
|
|
3
|
+
description: Diagnose, fix, and rerun failing E2E scenarios with a self-bootstrapping analysis loop.
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Bash(ace-test:*)
|
|
7
|
+
- Read
|
|
8
|
+
- Write
|
|
9
|
+
- Edit
|
|
10
|
+
- Skill
|
|
2
11
|
doc-type: workflow
|
|
3
12
|
title: Fix E2E Tests Workflow
|
|
4
13
|
purpose: fix-e2e-tests workflow instruction
|
|
5
14
|
ace-docs:
|
|
6
|
-
last-updated: 2026-
|
|
7
|
-
last-checked: 2026-
|
|
15
|
+
last-updated: 2026-04-19
|
|
16
|
+
last-checked: 2026-04-19
|
|
8
17
|
---
|
|
9
18
|
|
|
10
19
|
# Fix E2E Tests Workflow
|
|
11
20
|
|
|
12
21
|
## Goal
|
|
13
22
|
|
|
14
|
-
|
|
23
|
+
Diagnose, fix, and rerun failing E2E scenarios with a single workflow entrypoint.
|
|
15
24
|
|
|
16
|
-
This workflow is
|
|
25
|
+
This workflow owns analysis readiness before any fix is applied. Reuse an existing analysis report when it is complete; otherwise generate or complete it via `wfi://e2e/analyze-failures`, then continue directly into the fix loop.
|
|
17
26
|
|
|
18
|
-
##
|
|
27
|
+
## Analysis Readiness Gate
|
|
28
|
+
|
|
29
|
+
Before any fix, ensure an analysis report exists with:
|
|
19
30
|
|
|
20
|
-
Do not apply any fix until an analysis report exists with:
|
|
21
31
|
- scenario / TC identifier
|
|
22
32
|
- category (`code-issue`, `test-issue`, `runner-infrastructure-issue`)
|
|
23
33
|
- evidence from reports/artifacts
|
|
@@ -26,22 +36,29 @@ Do not apply any fix until an analysis report exists with:
|
|
|
26
36
|
- primary candidate files
|
|
27
37
|
- do-not-touch boundaries
|
|
28
38
|
- rerun scope recommendation
|
|
39
|
+
- `Docs / Help Drift From E2E Failures` section with `Public Surface Checked`, `Drift Found`, and `Update Targets`
|
|
40
|
+
|
|
41
|
+
If analysis is missing or incomplete, generate or refresh it first:
|
|
29
42
|
|
|
30
|
-
If analysis is missing or incomplete, stop and run:
|
|
31
43
|
```bash
|
|
32
44
|
ace-bundle wfi://e2e/analyze-failures
|
|
33
45
|
```
|
|
34
46
|
|
|
47
|
+
Then continue this workflow using the resulting `E2E Failure Analysis Report`, `Fix Decisions`, and `Execution Plan Input` as the source of truth. Do not stop merely because analysis had to be generated.
|
|
48
|
+
|
|
35
49
|
## Required Input
|
|
36
50
|
|
|
37
|
-
Use the output section from `e2e/analyze-failures
|
|
51
|
+
Use the output section from `e2e/analyze-failures` when present, whether it was provided up front or generated by this workflow:
|
|
52
|
+
|
|
38
53
|
- `## E2E Failure Analysis Report`
|
|
54
|
+
- `## Docs / Help Drift From E2E Failures`
|
|
39
55
|
- `## Fix Decisions`
|
|
40
56
|
- `### Execution Plan Input`
|
|
41
57
|
|
|
42
58
|
## Autonomy Rule
|
|
43
59
|
|
|
44
60
|
- Do not ask the user to choose fix target, category, or rerun scope.
|
|
61
|
+
- If analysis is missing, run `wfi://e2e/analyze-failures` yourself before fixing.
|
|
45
62
|
- If analysis is incomplete, auto-complete missing decision fields via local evidence (reports, artifacts, scenario files, implementation), then proceed.
|
|
46
63
|
- Only stop for hard blockers (missing files/tools/permissions).
|
|
47
64
|
|
|
@@ -61,27 +78,40 @@ Apply fixes in this order:
|
|
|
61
78
|
|
|
62
79
|
## Fix Procedure
|
|
63
80
|
|
|
64
|
-
1.
|
|
81
|
+
1. Establish or refresh analysis
|
|
82
|
+
|
|
83
|
+
- Check for a current analysis report that satisfies the Analysis Readiness Gate.
|
|
84
|
+
- If none exists, or if required fields are missing, including the docs/help drift section, run `ace-bundle wfi://e2e/analyze-failures`.
|
|
85
|
+
- Reuse the most recent valid analysis output as the source of truth for fix selection.
|
|
86
|
+
- Treat full-suite/package reruns and targeted scenario reruns as different scopes. Do not label a broader suite failure set as a regression in a previously fixed targeted scenario unless the same scenario fails again on a clean rerun.
|
|
87
|
+
|
|
88
|
+
2. Pick the first prioritized item from analysis
|
|
89
|
+
|
|
65
90
|
- Use the selected "First item to fix"
|
|
66
91
|
- Confirm category, fix target, and rerun scope
|
|
67
92
|
- Apply the "Chosen fix decision" and primary candidate files directly
|
|
68
93
|
|
|
69
|
-
|
|
94
|
+
3. Apply category-specific fix
|
|
70
95
|
|
|
71
96
|
### Category: runner-infrastructure-issue
|
|
97
|
+
|
|
72
98
|
- Fix runner/sandbox/provider/reporting/orchestration behavior
|
|
73
99
|
- Verify with runner tests when applicable: `ace-test ace-test-runner-e2e`
|
|
74
100
|
|
|
75
101
|
### Category: code-issue
|
|
102
|
+
|
|
76
103
|
- Fix package/tool behavior in implementation code
|
|
77
104
|
- Add/update unit tests if needed
|
|
105
|
+
- When the user job is valid but not achievable from docs/help/public CLI, apply the documented docs/help update target instead of codifying the workaround in the scenario
|
|
78
106
|
|
|
79
107
|
### Category: test-issue
|
|
108
|
+
|
|
80
109
|
- Fix scenario definition, runner/verifier criteria, fixtures, or setup steps
|
|
81
110
|
- Preserve role split: runner is execution-only, verifier is impact-first verdict
|
|
82
111
|
- Keep implementation unchanged unless analysis is revised
|
|
112
|
+
- Remove hidden recipes, workaround branches, and unsupported internal-detail checks from goal-style TCs
|
|
83
113
|
|
|
84
|
-
|
|
114
|
+
4. Rerun the selected failing scope after each fix
|
|
85
115
|
|
|
86
116
|
After every implemented fix, rerun the analysis-selected failing scope before moving to the next item or recommending release.
|
|
87
117
|
|
|
@@ -96,25 +126,31 @@ ace-test-e2e {package}
|
|
|
96
126
|
```
|
|
97
127
|
|
|
98
128
|
Rules:
|
|
129
|
+
|
|
99
130
|
- Scenario rerun is the default after each fix iteration.
|
|
100
131
|
- Use package rerun only when analysis explicitly selected package scope.
|
|
101
132
|
- For multiple failing scenarios, rerun each scenario explicitly.
|
|
133
|
+
|
|
102
134
|
```text
|
|
103
135
|
ace-test-e2e ace-assign TS-ASSIGN-001
|
|
104
136
|
ace-test-e2e ace-assign TS-ASSIGN-002
|
|
105
137
|
ace-test-e2e ace-bundle TS-BUNDLE-001
|
|
106
138
|
```
|
|
139
|
+
|
|
107
140
|
- Record the rerun command and result in the execution summary for every fix item.
|
|
108
141
|
|
|
109
|
-
|
|
142
|
+
5. Re-check classification when evidence conflicts
|
|
143
|
+
|
|
110
144
|
- If outcome contradicts analysis, return to `e2e/analyze-failures`
|
|
111
145
|
- Update analysis report and re-select a new autonomous chosen fix decision before continuing
|
|
146
|
+
- If a suite/package report conflicts with a scenario report, the scenario report wins and the aggregate mismatch must be fixed or explicitly tracked before relying on suite-level TC mappings.
|
|
147
|
+
|
|
148
|
+
6. Iterate until all targeted failures are resolved
|
|
112
149
|
|
|
113
|
-
5. Iterate until all targeted failures are resolved
|
|
114
150
|
- Keep one active scenario/TC at a time
|
|
115
151
|
- Preserve cost-conscious rerun discipline
|
|
116
152
|
|
|
117
|
-
|
|
153
|
+
7. Run a final explicit failing-scenario checkpoint before concluding the fix session
|
|
118
154
|
|
|
119
155
|
After the currently targeted failures are addressed, require one final:
|
|
120
156
|
|
|
@@ -136,12 +172,25 @@ Use one explicit command per previously failing scenario to confirm no targeted
|
|
|
136
172
|
```markdown
|
|
137
173
|
## E2E Fix Execution Summary
|
|
138
174
|
|
|
175
|
+
Analysis Source: reused existing analysis | generated via `wfi://e2e/analyze-failures` | refreshed incomplete analysis
|
|
176
|
+
|
|
139
177
|
| Scenario / TC | Category | Change Applied | Verification Command | Result |
|
|
140
178
|
|---|---|---|---|---|
|
|
141
179
|
| ... | ... | ... | ... | pass/fail |
|
|
142
180
|
```
|
|
143
181
|
|
|
182
|
+
If the analysis reported docs/help drift, include:
|
|
183
|
+
|
|
184
|
+
```markdown
|
|
185
|
+
## Docs / Help Updates
|
|
186
|
+
|
|
187
|
+
| Scenario / TC | Public Surface Updated | Why |
|
|
188
|
+
|---|---|---|
|
|
189
|
+
| ... | docs/usage.md, CLI --help | E2E failure showed the valid user job was not discoverable |
|
|
190
|
+
```
|
|
191
|
+
|
|
144
192
|
Include one final row for the batch checkpoint:
|
|
193
|
+
|
|
145
194
|
- Verification Command: one explicit rerun command per remaining failed scenario (`ace-test-e2e {package} {test-id}`)
|
|
146
195
|
- Result: `pass` or remaining failing scenarios
|
|
147
196
|
- If failures remain, continue the fix loop instead of treating the session as complete
|
|
@@ -160,7 +209,8 @@ If unresolved:
|
|
|
160
209
|
|
|
161
210
|
- Fixes are traceable to analyzed failures
|
|
162
211
|
- Verification scope matches analysis recommendation, including mandatory reruns after each fix
|
|
212
|
+
- Any docs/help drift from analysis is fixed or explicitly carried as an unresolved blocker
|
|
163
213
|
- Cost-conscious rerun strategy was followed
|
|
164
214
|
- Final explicit per-scenario rerun checkpoint for all targeted failures was completed before concluding the fix session
|
|
165
215
|
- No user clarification was required for fix targeting/scope in normal flow
|
|
166
|
-
- Targeted failures pass, or blockers are explicitly documented
|
|
216
|
+
- Targeted failures pass, or blockers are explicitly documented
|