ace-test-runner-e2e 0.29.8 → 0.38.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.ace-defaults/e2e-runner/config.yml +14 -2
- data/CHANGELOG.md +178 -0
- data/README.md +2 -2
- data/exe/ace-test-e2e-sh +9 -4
- data/handbook/guides/e2e-testing.g.md +43 -9
- data/handbook/guides/scenario-yml-reference.g.md +16 -8
- data/handbook/guides/tc-authoring.g.md +12 -5
- data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
- data/handbook/skills/as-e2e-review/SKILL.md +2 -2
- data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
- data/handbook/templates/agent-experience-report.template.md +3 -2
- data/handbook/templates/scenario.yml.template.yml +7 -2
- data/handbook/templates/tc-file.template.md +14 -4
- data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
- data/handbook/workflow-instructions/e2e/create.wf.md +118 -25
- data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
- data/handbook/workflow-instructions/e2e/fix.wf.md +65 -15
- data/handbook/workflow-instructions/e2e/plan-changes.wf.md +17 -1
- data/handbook/workflow-instructions/e2e/review.wf.md +36 -25
- data/handbook/workflow-instructions/e2e/rewrite.wf.md +15 -8
- data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
- data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
- data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
- data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
- data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +21 -8
- data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
- data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
- data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
- data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
- data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
- data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
- data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +157 -16
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +121 -8
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +119 -18
- data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +13 -12
- data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +282 -0
- data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +85 -5
- data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +98 -16
- data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +241 -97
- data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
- data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
- data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +73 -15
- data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +120 -19
- data/lib/ace/test/end_to_end_runner/version.rb +1 -1
- data/lib/ace/test/end_to_end_runner.rb +2 -0
- metadata +19 -2
|
@@ -1,4 +1,12 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-create
|
|
3
|
+
description: Create a new E2E test scenario from template
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Read
|
|
7
|
+
- Write
|
|
8
|
+
- Glob
|
|
9
|
+
- Grep
|
|
2
10
|
doc-type: workflow
|
|
3
11
|
title: Create E2E Test Workflow
|
|
4
12
|
purpose: Create a new E2E test scenario from template
|
|
@@ -23,35 +31,48 @@ This workflow guides an agent through creating a new E2E test scenario.
|
|
|
23
31
|
- Scenario ID format: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
|
|
24
32
|
- Standalone files: `TC-*.runner.md` and `TC-*.verify.md`
|
|
25
33
|
- TC artifact layout: `results/tc/{NN}/`
|
|
34
|
+
- Runner observations are harness-managed report data, not sandbox helper files
|
|
26
35
|
- Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
|
|
27
36
|
- CLI split reminder:
|
|
37
|
+
|
|
28
38
|
- `ace-test-e2e` for single-package execution
|
|
29
39
|
- `ace-test-e2e-suite` for suite-level execution
|
|
30
40
|
|
|
31
41
|
## Authoring Contract
|
|
32
42
|
|
|
33
43
|
- Runner files (`runner.yml.md`, `TC-*.runner.md`) are execution-only.
|
|
44
|
+
- Goal-style TCs must prove two things:
|
|
45
|
+
- the tool works
|
|
46
|
+
- a user can do the job from the public surface (`README`, usage docs, `--help`, and the CLI itself) without hidden recipes or workarounds
|
|
34
47
|
- Verifier files (`verifier.yml.md`, `TC-*.verify.md`) are verdict-only with impact-first evidence order:
|
|
48
|
+
|
|
35
49
|
1. sandbox/project state impact
|
|
36
|
-
2.
|
|
37
|
-
3.
|
|
50
|
+
2. runner observations
|
|
51
|
+
3. explicit product outcomes
|
|
52
|
+
4. debug captures as fallback
|
|
53
|
+
|
|
38
54
|
- Setup belongs to `scenario.yml` `setup:` and fixtures; do not duplicate setup in runner TC instructions.
|
|
55
|
+
- Keep `results/tc/{NN}/` for real outcome artifacts only; do not ask the runner to write helper YAML, path files, command files, reflections, or verifier-facing manifests there.
|
|
56
|
+
- Do not encode hidden command recipes, fallback detours, or workaround sequences in runner TC files. If the job cannot be done from the public surface, treat that as a product/docs/help gap or remove/narrow the TC.
|
|
39
57
|
|
|
40
58
|
## Workflow Steps
|
|
41
59
|
|
|
42
60
|
### 1. Validate Inputs
|
|
43
61
|
|
|
44
62
|
**Check package exists:**
|
|
63
|
+
|
|
45
64
|
```bash
|
|
46
65
|
test -d "{PACKAGE}" && echo "Package exists" || echo "Package not found"
|
|
47
66
|
```
|
|
48
67
|
|
|
49
68
|
If package doesn't exist, list available packages:
|
|
69
|
+
|
|
50
70
|
```bash
|
|
51
71
|
ls -d */ | grep -E "^ace-" | sed 's/\/$//'
|
|
52
72
|
```
|
|
53
73
|
|
|
54
74
|
**Normalize area code:**
|
|
75
|
+
|
|
55
76
|
- Convert to uppercase (e.g., `lint` -> `LINT`)
|
|
56
77
|
- Verify it's a valid area name (2-10 alphanumeric characters)
|
|
57
78
|
|
|
@@ -66,6 +87,7 @@ find {PACKAGE}/test/e2e -maxdepth 1 -type d -name "TS-{AREA}-*" 2>/dev/null | \
|
|
|
66
87
|
```
|
|
67
88
|
|
|
68
89
|
Sort and take the highest number:
|
|
90
|
+
|
|
69
91
|
- If no existing tests: use `001`
|
|
70
92
|
- Otherwise: increment the highest number by 1
|
|
71
93
|
- Format as three digits (e.g., `001`, `002`, `015`)
|
|
@@ -85,12 +107,14 @@ mkdir -p {PACKAGE}/test/e2e
|
|
|
85
107
|
Create a kebab-case slug:
|
|
86
108
|
|
|
87
109
|
**If --context provided:**
|
|
110
|
+
|
|
88
111
|
- Extract key words from the context description
|
|
89
112
|
- Convert to lowercase
|
|
90
113
|
- Replace spaces with hyphens
|
|
91
114
|
- Limit to 5-6 words
|
|
92
115
|
|
|
93
116
|
**If no context:**
|
|
117
|
+
|
|
94
118
|
- Use a placeholder: `new-test-scenario`
|
|
95
119
|
|
|
96
120
|
Example: "Test config file validation" -> `config-file-validation`
|
|
@@ -100,11 +124,13 @@ The slug is the directory name suffix: `TS-LINT-003-config-file-validation/`
|
|
|
100
124
|
### 5. Load Template
|
|
101
125
|
|
|
102
126
|
Load the test template:
|
|
127
|
+
|
|
103
128
|
```bash
|
|
104
129
|
ace-bundle tmpl://test-e2e
|
|
105
130
|
```
|
|
106
131
|
|
|
107
132
|
Or read directly:
|
|
133
|
+
|
|
108
134
|
```
|
|
109
135
|
ace-test-runner-e2e/handbook/templates/test-e2e.template.md
|
|
110
136
|
```
|
|
@@ -123,6 +149,7 @@ Replace template placeholders with actual values:
|
|
|
123
149
|
| `{area-name}` | Area code (lowercase) |
|
|
124
150
|
|
|
125
151
|
Initial values for optional fields:
|
|
152
|
+
|
|
126
153
|
- `priority: medium`
|
|
127
154
|
- `duration: ~10min`
|
|
128
155
|
- `automation-candidate: false`
|
|
@@ -138,9 +165,10 @@ Initial values for optional fields:
|
|
|
138
165
|
Before generating test cases, verify the proposed test has genuine E2E value.
|
|
139
166
|
|
|
140
167
|
**Check unit test coverage:**
|
|
168
|
+
|
|
141
169
|
```bash
|
|
142
170
|
# Search for existing unit tests covering this area
|
|
143
|
-
find {PACKAGE}/test/
|
|
171
|
+
find {PACKAGE}/test/fast {PACKAGE}/test/feat \
|
|
144
172
|
-name "*_test.rb" 2>/dev/null | head -20
|
|
145
173
|
```
|
|
146
174
|
|
|
@@ -154,47 +182,91 @@ For each proposed TC, answer: **"Does this require the full CLI binary + real ex
|
|
|
154
182
|
- If **PARTIAL**: create the TC but scope it to only the E2E-exclusive aspects
|
|
155
183
|
|
|
156
184
|
**Example decisions:**
|
|
157
|
-
|
|
158
|
-
- "Test that
|
|
185
|
+
|
|
186
|
+
- "Test that invalid YAML config produces error" -- check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
|
|
187
|
+
- "Test that StandardRB subprocess executes and returns results" -- unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
|
|
159
188
|
|
|
160
189
|
If all proposed TCs fail the gate, report to the user:
|
|
190
|
+
|
|
161
191
|
```
|
|
162
192
|
All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
|
|
163
193
|
No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
|
|
164
194
|
```
|
|
165
195
|
|
|
166
|
-
### 7a.
|
|
196
|
+
### 7a. Public-Surface Gate
|
|
197
|
+
|
|
198
|
+
Before generating or keeping a goal-style TC, answer:
|
|
199
|
+
**"Can a normal user complete this job from the package's public surface, without hidden recipes or workarounds?"**
|
|
200
|
+
|
|
201
|
+
Public surface means:
|
|
202
|
+
- package README / usage docs
|
|
203
|
+
- `--help`
|
|
204
|
+
- declared fixtures and `scenario.yml` setup
|
|
205
|
+
- the tool under test itself
|
|
206
|
+
|
|
207
|
+
Reject or narrow the TC if it depends on:
|
|
208
|
+
- step-by-step runner procedures a user would not infer from docs/help
|
|
209
|
+
- workaround branches to compensate for CLI/docs/help gaps
|
|
210
|
+
- direct supporting-tool probes as the primary oracle for an ACE CLI scenario
|
|
211
|
+
- internal-state checks that the public surface does not expose and that do not matter to the user job
|
|
212
|
+
|
|
213
|
+
### 7b. Evidence-Gate Review Before Writing Files
|
|
167
214
|
|
|
168
215
|
Before finalizing the test plan, block weak coverage patterns:
|
|
216
|
+
|
|
169
217
|
- **Existence-only TC**:
|
|
218
|
+
|
|
170
219
|
- only checks directory/file existence
|
|
171
220
|
- no command output/content assertion
|
|
172
221
|
- missing `*.exit` capture for the executed command
|
|
222
|
+
|
|
173
223
|
- **Duplicate-invocation TC**:
|
|
224
|
+
|
|
174
225
|
- same command invocation, same purpose, split across multiple TCs
|
|
175
226
|
|
|
227
|
+
- **Helper-artifact-driven TC**:
|
|
228
|
+
|
|
229
|
+
- runner is instructed to create YAML/TXT/MD helper files in `results/tc/{NN}/`
|
|
230
|
+
- verifier depends on those helper files instead of final sandbox state or real product output
|
|
231
|
+
|
|
232
|
+
- **Hidden-recipe-driven TC**:
|
|
233
|
+
|
|
234
|
+
- the runner must follow a command sequence not discoverable from docs/usage/`--help`
|
|
235
|
+
- the TC succeeds only because the scenario teaches an internal or non-obvious workaround
|
|
236
|
+
|
|
237
|
+
- **Workaround-driven TC**:
|
|
238
|
+
|
|
239
|
+
- the runner is told how to bypass a docs/help/CLI gap instead of surfacing it
|
|
240
|
+
- the verifier would pass a scenario that a normal user could not complete cleanly
|
|
241
|
+
|
|
176
242
|
| TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
|
|
177
243
|
|-------|---------------------------|------------------|-----------------|--------------------|
|
|
178
244
|
| {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
179
245
|
|
|
180
246
|
Rules:
|
|
247
|
+
|
|
181
248
|
- `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
|
|
249
|
+
- `helper-artifact-driven` is never valid for KEEP/ADD when final sandbox state could prove the goal directly.
|
|
250
|
+
- `hidden-recipe-driven` and `workaround-driven` are never valid for KEEP/ADD.
|
|
182
251
|
- `SKIP` rows must include replacement unit-test evidence.
|
|
183
|
-
- Non-skipped rows must
|
|
252
|
+
- Non-skipped rows must identify the primary oracle for the TC: final sandbox state, real product output, or debug fallback.
|
|
253
|
+
- Non-skipped rows must state why the job is achievable from the public surface without hidden recipes.
|
|
184
254
|
- At least one `unit tests reviewed` path is required for every row.
|
|
185
255
|
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
186
256
|
|
|
187
|
-
###
|
|
257
|
+
### 7c. E2E Decision Record (Required)
|
|
188
258
|
|
|
189
259
|
Before writing files, produce a decision record table for every candidate TC:
|
|
190
260
|
|
|
191
|
-
| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Unit tests reviewed |
|
|
192
|
-
|
|
193
|
-
| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
261
|
+
| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Public-surface path | Unit tests reviewed |
|
|
262
|
+
|-------|---------------------------|-----------------|---------------------|---------------------|
|
|
263
|
+
| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {docs/help/CLI path or "not valid"} | {path1,path2} |
|
|
194
264
|
|
|
195
265
|
Rules:
|
|
266
|
+
|
|
196
267
|
- No TC may be created without a row in this table.
|
|
197
268
|
- If decision is `SKIP`, include the unit-test evidence that replaces it.
|
|
269
|
+
- If the public-surface path is missing or workaround-driven, the TC must be `SKIP` or explicitly planned as a product/docs/help improvement before creation.
|
|
198
270
|
- At least one `unit tests reviewed` path is required for each row.
|
|
199
271
|
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
200
272
|
|
|
@@ -203,12 +275,13 @@ Rules:
|
|
|
203
275
|
If a context description was provided, enhance the test with:
|
|
204
276
|
|
|
205
277
|
**Research the package:**
|
|
206
|
-
1. **Run unit tests first** (`ace-test` in the package)
|
|
278
|
+
1. **Run unit tests first** (`ace-test` in the package) -- they are the ground truth for implemented behavior
|
|
207
279
|
2. Examine the relevant code in `{PACKAGE}/lib/`
|
|
208
280
|
3. Check existing unit tests for expected behavior patterns
|
|
209
281
|
4. Understand the feature being tested
|
|
210
282
|
5. **Run the tool** to observe actual behavior, output format, file paths, and exit codes
|
|
211
|
-
6. **Verify config/input formats** by reading the actual parsing code
|
|
283
|
+
6. **Verify config/input formats** by reading the actual parsing code -- never assume formats from design specs or task descriptions
|
|
284
|
+
7. **Compare with the public surface** -- verify the intended user path is actually supported by docs/help, and do not compensate for gaps with hidden runner instructions
|
|
212
285
|
|
|
213
286
|
**Generate test content:**
|
|
214
287
|
1. Write a clear objective based on the context
|
|
@@ -220,16 +293,20 @@ If a context description was provided, enhance the test with:
|
|
|
220
293
|
#### Test Case Generation Rules
|
|
221
294
|
|
|
222
295
|
**MUST (required for all E2E tests):**
|
|
223
|
-
|
|
224
|
-
- **Verify
|
|
296
|
+
|
|
297
|
+
- **Verify the feature is implemented** before writing the test -- read the actual implementation code, not just task specs or design documents
|
|
298
|
+
- **Verify config/input formats** by reading the parsing code -- never assume formats from BDD specs, task descriptions, or documentation
|
|
225
299
|
- Include an error/negative TC only when it validates E2E-exclusive behavior (real CLI parser/runtime/tooling/filesystem) or when unit coverage has a documented gap
|
|
226
|
-
- Verify actual file paths by running the tool first
|
|
227
|
-
-
|
|
300
|
+
- Verify actual file paths by running the tool first -- never hardcode paths from documentation or assumptions
|
|
301
|
+
- Write runner goals as user outcomes, not “create a report” chores for the verifier
|
|
228
302
|
- Check specific exit codes for error commands (not just "non-zero")
|
|
229
|
-
-
|
|
303
|
+
- Make final sandbox state or real product output the primary oracle whenever possible
|
|
304
|
+
- Do not require runner-authored helper files under `results/tc/{NN}/`
|
|
305
|
+
- Add at least one behavioral/content assertion when CLI output itself is part of the outcome being tested
|
|
230
306
|
|
|
231
307
|
**SHOULD (strongly recommended):**
|
|
232
|
-
|
|
308
|
+
|
|
309
|
+
- Test the real user journey -- structure TCs as a sequential workflow, not isolated commands
|
|
233
310
|
- Verify exit codes for all commands, not just error cases
|
|
234
311
|
- Include negative assertions (files/directories that should NOT exist)
|
|
235
312
|
- Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
|
|
@@ -237,17 +314,18 @@ If a context description was provided, enhance the test with:
|
|
|
237
314
|
- Verify that status values match actual implementation (e.g., `done` vs `completed`)
|
|
238
315
|
|
|
239
316
|
**COST-AWARE (reduce LLM invocations):**
|
|
240
|
-
|
|
317
|
+
|
|
318
|
+
- Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC -- not three.
|
|
241
319
|
- Target 2-5 TCs per scenario. More than 5 suggests the scenario is too broad; split into focused scenarios. Fewer than 2 suggests merging with a related scenario.
|
|
242
320
|
- Never create a TC for a single assertion when that assertion could be appended to an existing TC that runs the same command.
|
|
243
321
|
|
|
244
322
|
#### Recommended TC Ordering
|
|
245
323
|
|
|
246
|
-
1. **Error paths first**
|
|
247
|
-
2. **Happy path start**
|
|
248
|
-
3. **Structure verification**
|
|
249
|
-
4. **Lifecycle operations**
|
|
250
|
-
5. **End state**
|
|
324
|
+
1. **Error paths first** -- wrong args, missing files, no prior state (run from clean state)
|
|
325
|
+
2. **Happy path start** -- create/init with correct args, verify output
|
|
326
|
+
3. **Structure verification** -- check actual on-disk file structure with negative assertions
|
|
327
|
+
4. **Lifecycle operations** -- status, advance, fail, retry in workflow order
|
|
328
|
+
5. **End state** -- verify completion message, all steps terminal
|
|
251
329
|
|
|
252
330
|
This ordering ensures error TCs run before any state is created (clean environment), and happy-path TCs build on each other sequentially.
|
|
253
331
|
|
|
@@ -258,6 +336,7 @@ See: **e2e-testing.g.md § "Avoiding False Positive Tests"** for the full list o
|
|
|
258
336
|
**E2E tests MUST test through the CLI interface, not library imports.**
|
|
259
337
|
|
|
260
338
|
**Valid approach:**
|
|
339
|
+
|
|
261
340
|
```bash
|
|
262
341
|
OUTPUT=$(ace-review --preset code --subject "diff:HEAD~1" --auto-execute 2>&1)
|
|
263
342
|
EXIT_CODE=$?
|
|
@@ -265,6 +344,7 @@ EXIT_CODE=$?
|
|
|
265
344
|
```
|
|
266
345
|
|
|
267
346
|
**Invalid approach (this is integration/unit testing, not E2E):**
|
|
347
|
+
|
|
268
348
|
```bash
|
|
269
349
|
bundle exec ruby -e '
|
|
270
350
|
require_relative "lib/ace/review"
|
|
@@ -273,6 +353,7 @@ bundle exec ruby -e '
|
|
|
273
353
|
```
|
|
274
354
|
|
|
275
355
|
**For execution tests (LLM, API calls):**
|
|
356
|
+
|
|
276
357
|
- Use `--auto-execute` to make real API calls
|
|
277
358
|
- Using only `--dry-run` cannot verify actual execution behavior
|
|
278
359
|
- Keep costs minimal: cheap models, tiny prompts, small diffs
|
|
@@ -280,22 +361,26 @@ bundle exec ruby -e '
|
|
|
280
361
|
#### Common Anti-Patterns to Avoid
|
|
281
362
|
|
|
282
363
|
**Writing tests from design specs before implementation:**
|
|
364
|
+
|
|
283
365
|
- Task descriptions and BDD specs often describe *intended* behavior with *proposed* config formats
|
|
284
366
|
- The actual implementation may use different formats, different commands, or different workflows
|
|
285
367
|
- Example: A spec might describe `jobs:` with explicit `number:` and `parent:` fields, but implementation uses `steps:` with auto-generated numbers and dynamic hierarchy via `add --after --child`
|
|
286
368
|
- **Fix:** Always read the actual implementation code (especially config parsing) before writing test data
|
|
287
369
|
|
|
288
370
|
**Assuming static vs dynamic behavior:**
|
|
371
|
+
|
|
289
372
|
- Tests may assume features work at config-time (static) when they actually work at runtime (dynamic)
|
|
290
373
|
- Example: Assuming hierarchy is defined in config when it's actually built dynamically via commands
|
|
291
374
|
- **Fix:** Trace the actual code path for the feature being tested
|
|
292
375
|
|
|
293
376
|
**Splitting one command into many redundant TCs:**
|
|
377
|
+
|
|
294
378
|
- Multiple TCs each validate one assertion after the same CLI invocation, creating overlap with unit tests and increasing run cost
|
|
295
379
|
- Example: TC-A checks exit code, TC-B checks report file, TC-C checks summary text for the same command run
|
|
296
380
|
- **Fix:** Consolidate those assertions into one TC and move formatter/parser details to unit tests
|
|
297
381
|
|
|
298
382
|
**Example for "Test config file validation":**
|
|
383
|
+
|
|
299
384
|
```markdown
|
|
300
385
|
## Test Cases
|
|
301
386
|
|
|
@@ -315,28 +400,33 @@ bundle exec ruby -e '
|
|
|
315
400
|
### 9. Write Test Files
|
|
316
401
|
|
|
317
402
|
Create the scenario directory with separate files:
|
|
403
|
+
|
|
318
404
|
```bash
|
|
319
405
|
mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}
|
|
320
406
|
```
|
|
321
407
|
|
|
322
408
|
Write `scenario.yml` (metadata and setup):
|
|
409
|
+
|
|
323
410
|
```
|
|
324
411
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/scenario.yml
|
|
325
412
|
```
|
|
326
413
|
|
|
327
414
|
Write scenario pair configs:
|
|
415
|
+
|
|
328
416
|
```
|
|
329
417
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/runner.yml.md
|
|
330
418
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/verifier.yml.md
|
|
331
419
|
```
|
|
332
420
|
|
|
333
421
|
Write individual TC runner/verifier files for each test case:
|
|
422
|
+
|
|
334
423
|
```
|
|
335
424
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.runner.md
|
|
336
425
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.verify.md
|
|
337
426
|
```
|
|
338
427
|
|
|
339
428
|
Optionally create a fixtures directory if test data is needed:
|
|
429
|
+
|
|
340
430
|
```bash
|
|
341
431
|
mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/fixtures
|
|
342
432
|
```
|
|
@@ -373,6 +463,7 @@ Output a summary:
|
|
|
373
463
|
## Example Invocations
|
|
374
464
|
|
|
375
465
|
**Create a test:**
|
|
466
|
+
|
|
376
467
|
```bash
|
|
377
468
|
ace-bundle wfi://e2e/create
|
|
378
469
|
```
|
|
@@ -380,6 +471,7 @@ ace-bundle wfi://e2e/create
|
|
|
380
471
|
Creates: `ace-lint/test/e2e/TS-LINT-003-new-test-scenario/` with `scenario.yml` and TC files.
|
|
381
472
|
|
|
382
473
|
**Create a contextual test:**
|
|
474
|
+
|
|
383
475
|
```bash
|
|
384
476
|
ace-bundle wfi://e2e/create
|
|
385
477
|
```
|
|
@@ -387,6 +479,7 @@ ace-bundle wfi://e2e/create
|
|
|
387
479
|
Creates: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/` with `scenario.yml` and TC files for config validation.
|
|
388
480
|
|
|
389
481
|
**Create test for new area:**
|
|
482
|
+
|
|
390
483
|
```bash
|
|
391
484
|
ace-bundle wfi://e2e/create
|
|
392
485
|
```
|
|
@@ -46,12 +46,15 @@ Tag filtering happens at discovery time (before `SetupExecutor` runs). By the ti
|
|
|
46
46
|
|
|
47
47
|
## Execution Contract
|
|
48
48
|
|
|
49
|
-
- Runner is execution-only: execute declared TC actions and
|
|
49
|
+
- Runner is execution-only: execute declared TC actions, leave only real outcome evidence under `results/tc/{NN}/`, and return final observations through the harness.
|
|
50
|
+
- Runner follows the public user path. Do not turn missing docs/help/CLI affordances into embedded workaround instructions.
|
|
50
51
|
- Verifier is verification-only: determine PASS/FAIL using impact-first ordering:
|
|
51
52
|
1. sandbox/project state impact
|
|
52
|
-
2.
|
|
53
|
-
3.
|
|
53
|
+
2. runner observations
|
|
54
|
+
3. explicit artifacts that are true product outcomes
|
|
55
|
+
4. debug captures (`stdout`/`stderr`/exit) as fallback
|
|
54
56
|
- Do not interpret setup ownership in runner TC files; setup is owned by `scenario.yml` + fixtures.
|
|
57
|
+
- Treat workaround pressure recorded in runner observations as a gap to fix, not as permission to strengthen the runner script.
|
|
55
58
|
|
|
56
59
|
## Dual-Agent Verifier
|
|
57
60
|
|
|
@@ -61,7 +64,7 @@ When `--verify` is passed (or always-on for CLI pipeline runs), execution follow
|
|
|
61
64
|
2. **Verifier agent** independently inspects the sandbox and artifacts against `TC-*.verify.md` expectations
|
|
62
65
|
3. **Report generator** (`PipelineReportGenerator`) produces deterministic summary from verifier output
|
|
63
66
|
|
|
64
|
-
The verifier has no access to the runner's conversation — it evaluates
|
|
67
|
+
The verifier has no access to the runner's conversation — it evaluates from sandbox evidence plus the structured runner observations persisted by the harness. This prevents self-confirmation bias while still surfacing execution context.
|
|
65
68
|
|
|
66
69
|
## Subagent Mode
|
|
67
70
|
|
|
@@ -75,6 +78,7 @@ When invoked as a subagent (via Task tool from orchestrator):
|
|
|
75
78
|
- **Failed**: {count}
|
|
76
79
|
- **Total**: {count}
|
|
77
80
|
- **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
|
|
81
|
+
- **Observations**: Brief factual summary or "None"
|
|
78
82
|
- **Issues**: Brief description or "None"
|
|
79
83
|
```
|
|
80
84
|
|
|
@@ -149,8 +153,8 @@ For each TC (TC-NNN):
|
|
|
149
153
|
|
|
150
154
|
1. **Check filter** — skip if `FILTERED_CASES` is set and TC not in list
|
|
151
155
|
2. **Read** the runner file objective
|
|
152
|
-
3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
|
|
153
|
-
4. **
|
|
156
|
+
3. **Execute** runner steps, save only real outcome artifacts to `results/tc/{NN}/`
|
|
157
|
+
4. **Return** factual runner observations through the harness
|
|
154
158
|
5. **Evaluate** against verifier expectations
|
|
155
159
|
6. **Record** Pass/Fail with per-TC evidence
|
|
156
160
|
|
|
@@ -250,4 +254,4 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
|
|
|
250
254
|
| TC fails | Record details, continue remaining TCs, include in report |
|
|
251
255
|
| Sandbox missing/corrupted | Report error, do NOT recreate, return error summary |
|
|
252
256
|
| TC filter mismatch | STOP, do not write reports, offer re-run |
|
|
253
|
-
| Missing TC pair file | Report error for that TC, skip it, continue others |
|
|
257
|
+
| Missing TC pair file | Report error for that TC, skip it, continue others |
|
|
@@ -1,23 +1,33 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-fix
|
|
3
|
+
description: Diagnose, fix, and rerun failing E2E scenarios with a self-bootstrapping analysis loop.
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Bash(ace-test:*)
|
|
7
|
+
- Read
|
|
8
|
+
- Write
|
|
9
|
+
- Edit
|
|
10
|
+
- Skill
|
|
2
11
|
doc-type: workflow
|
|
3
12
|
title: Fix E2E Tests Workflow
|
|
4
13
|
purpose: fix-e2e-tests workflow instruction
|
|
5
14
|
ace-docs:
|
|
6
|
-
last-updated: 2026-
|
|
7
|
-
last-checked: 2026-
|
|
15
|
+
last-updated: 2026-04-19
|
|
16
|
+
last-checked: 2026-04-19
|
|
8
17
|
---
|
|
9
18
|
|
|
10
19
|
# Fix E2E Tests Workflow
|
|
11
20
|
|
|
12
21
|
## Goal
|
|
13
22
|
|
|
14
|
-
|
|
23
|
+
Diagnose, fix, and rerun failing E2E scenarios with a single workflow entrypoint.
|
|
15
24
|
|
|
16
|
-
This workflow is
|
|
25
|
+
This workflow owns analysis readiness before any fix is applied. Reuse an existing analysis report when it is complete; otherwise generate or complete it via `wfi://e2e/analyze-failures`, then continue directly into the fix loop.
|
|
17
26
|
|
|
18
|
-
##
|
|
27
|
+
## Analysis Readiness Gate
|
|
28
|
+
|
|
29
|
+
Before any fix, ensure an analysis report exists with:
|
|
19
30
|
|
|
20
|
-
Do not apply any fix until an analysis report exists with:
|
|
21
31
|
- scenario / TC identifier
|
|
22
32
|
- category (`code-issue`, `test-issue`, `runner-infrastructure-issue`)
|
|
23
33
|
- evidence from reports/artifacts
|
|
@@ -26,22 +36,29 @@ Do not apply any fix until an analysis report exists with:
|
|
|
26
36
|
- primary candidate files
|
|
27
37
|
- do-not-touch boundaries
|
|
28
38
|
- rerun scope recommendation
|
|
39
|
+
- `Docs / Help Drift From E2E Failures` section with `Public Surface Checked`, `Drift Found`, and `Update Targets`
|
|
40
|
+
|
|
41
|
+
If analysis is missing or incomplete, generate or refresh it first:
|
|
29
42
|
|
|
30
|
-
If analysis is missing or incomplete, stop and run:
|
|
31
43
|
```bash
|
|
32
44
|
ace-bundle wfi://e2e/analyze-failures
|
|
33
45
|
```
|
|
34
46
|
|
|
47
|
+
Then continue this workflow using the resulting `E2E Failure Analysis Report`, `Fix Decisions`, and `Execution Plan Input` as the source of truth. Do not stop merely because analysis had to be generated.
|
|
48
|
+
|
|
35
49
|
## Required Input
|
|
36
50
|
|
|
37
|
-
Use the output section from `e2e/analyze-failures
|
|
51
|
+
Use the output section from `e2e/analyze-failures` when present, whether it was provided up front or generated by this workflow:
|
|
52
|
+
|
|
38
53
|
- `## E2E Failure Analysis Report`
|
|
54
|
+
- `## Docs / Help Drift From E2E Failures`
|
|
39
55
|
- `## Fix Decisions`
|
|
40
56
|
- `### Execution Plan Input`
|
|
41
57
|
|
|
42
58
|
## Autonomy Rule
|
|
43
59
|
|
|
44
60
|
- Do not ask the user to choose fix target, category, or rerun scope.
|
|
61
|
+
- If analysis is missing, run `wfi://e2e/analyze-failures` yourself before fixing.
|
|
45
62
|
- If analysis is incomplete, auto-complete missing decision fields via local evidence (reports, artifacts, scenario files, implementation), then proceed.
|
|
46
63
|
- Only stop for hard blockers (missing files/tools/permissions).
|
|
47
64
|
|
|
@@ -61,27 +78,40 @@ Apply fixes in this order:
|
|
|
61
78
|
|
|
62
79
|
## Fix Procedure
|
|
63
80
|
|
|
64
|
-
1.
|
|
81
|
+
1. Establish or refresh analysis
|
|
82
|
+
|
|
83
|
+
- Check for a current analysis report that satisfies the Analysis Readiness Gate.
|
|
84
|
+
- If none exists, or if required fields are missing, including the docs/help drift section, run `ace-bundle wfi://e2e/analyze-failures`.
|
|
85
|
+
- Reuse the most recent valid analysis output as the source of truth for fix selection.
|
|
86
|
+
- Treat full-suite/package reruns and targeted scenario reruns as different scopes. Do not label a broader suite failure set as a regression in a previously fixed targeted scenario unless the same scenario fails again on a clean rerun.
|
|
87
|
+
|
|
88
|
+
2. Pick the first prioritized item from analysis
|
|
89
|
+
|
|
65
90
|
- Use the selected "First item to fix"
|
|
66
91
|
- Confirm category, fix target, and rerun scope
|
|
67
92
|
- Apply the "Chosen fix decision" and primary candidate files directly
|
|
68
93
|
|
|
69
|
-
|
|
94
|
+
3. Apply category-specific fix
|
|
70
95
|
|
|
71
96
|
### Category: runner-infrastructure-issue
|
|
97
|
+
|
|
72
98
|
- Fix runner/sandbox/provider/reporting/orchestration behavior
|
|
73
99
|
- Verify with runner tests when applicable: `ace-test ace-test-runner-e2e`
|
|
74
100
|
|
|
75
101
|
### Category: code-issue
|
|
102
|
+
|
|
76
103
|
- Fix package/tool behavior in implementation code
|
|
77
104
|
- Add/update unit tests if needed
|
|
105
|
+
- When the user job is valid but not achievable from docs/help/public CLI, apply the documented docs/help update target instead of codifying the workaround in the scenario
|
|
78
106
|
|
|
79
107
|
### Category: test-issue
|
|
108
|
+
|
|
80
109
|
- Fix scenario definition, runner/verifier criteria, fixtures, or setup steps
|
|
81
110
|
- Preserve role split: runner is execution-only, verifier is impact-first verdict
|
|
82
111
|
- Keep implementation unchanged unless analysis is revised
|
|
112
|
+
- Remove hidden recipes, workaround branches, and unsupported internal-detail checks from goal-style TCs
|
|
83
113
|
|
|
84
|
-
|
|
114
|
+
4. Rerun the selected failing scope after each fix
|
|
85
115
|
|
|
86
116
|
After every implemented fix, rerun the analysis-selected failing scope before moving to the next item or recommending release.
|
|
87
117
|
|
|
@@ -96,25 +126,31 @@ ace-test-e2e {package}
|
|
|
96
126
|
```
|
|
97
127
|
|
|
98
128
|
Rules:
|
|
129
|
+
|
|
99
130
|
- Scenario rerun is the default after each fix iteration.
|
|
100
131
|
- Use package rerun only when analysis explicitly selected package scope.
|
|
101
132
|
- For multiple failing scenarios, rerun each scenario explicitly.
|
|
133
|
+
|
|
102
134
|
```text
|
|
103
135
|
ace-test-e2e ace-assign TS-ASSIGN-001
|
|
104
136
|
ace-test-e2e ace-assign TS-ASSIGN-002
|
|
105
137
|
ace-test-e2e ace-bundle TS-BUNDLE-001
|
|
106
138
|
```
|
|
139
|
+
|
|
107
140
|
- Record the rerun command and result in the execution summary for every fix item.
|
|
108
141
|
|
|
109
|
-
|
|
142
|
+
5. Re-check classification when evidence conflicts
|
|
143
|
+
|
|
110
144
|
- If outcome contradicts analysis, return to `e2e/analyze-failures`
|
|
111
145
|
- Update analysis report and re-select a new autonomous chosen fix decision before continuing
|
|
146
|
+
- If a suite/package report conflicts with a scenario report, the scenario report wins and the aggregate mismatch must be fixed or explicitly tracked before relying on suite-level TC mappings.
|
|
147
|
+
|
|
148
|
+
6. Iterate until all targeted failures are resolved
|
|
112
149
|
|
|
113
|
-
5. Iterate until all targeted failures are resolved
|
|
114
150
|
- Keep one active scenario/TC at a time
|
|
115
151
|
- Preserve cost-conscious rerun discipline
|
|
116
152
|
|
|
117
|
-
|
|
153
|
+
7. Run a final explicit failing-scenario checkpoint before concluding the fix session
|
|
118
154
|
|
|
119
155
|
After the currently targeted failures are addressed, require one final:
|
|
120
156
|
|
|
@@ -136,12 +172,25 @@ Use one explicit command per previously failing scenario to confirm no targeted
|
|
|
136
172
|
```markdown
|
|
137
173
|
## E2E Fix Execution Summary
|
|
138
174
|
|
|
175
|
+
Analysis Source: reused existing analysis | generated via `wfi://e2e/analyze-failures` | refreshed incomplete analysis
|
|
176
|
+
|
|
139
177
|
| Scenario / TC | Category | Change Applied | Verification Command | Result |
|
|
140
178
|
|---|---|---|---|---|
|
|
141
179
|
| ... | ... | ... | ... | pass/fail |
|
|
142
180
|
```
|
|
143
181
|
|
|
182
|
+
If the analysis reported docs/help drift, include:
|
|
183
|
+
|
|
184
|
+
```markdown
|
|
185
|
+
## Docs / Help Updates
|
|
186
|
+
|
|
187
|
+
| Scenario / TC | Public Surface Updated | Why |
|
|
188
|
+
|---|---|---|
|
|
189
|
+
| ... | docs/usage.md, CLI --help | E2E failure showed the valid user job was not discoverable |
|
|
190
|
+
```
|
|
191
|
+
|
|
144
192
|
Include one final row for the batch checkpoint:
|
|
193
|
+
|
|
145
194
|
- Verification Command: one explicit rerun command per remaining failed scenario (`ace-test-e2e {package} {test-id}`)
|
|
146
195
|
- Result: `pass` or remaining failing scenarios
|
|
147
196
|
- If failures remain, continue the fix loop instead of treating the session as complete
|
|
@@ -160,7 +209,8 @@ If unresolved:
|
|
|
160
209
|
|
|
161
210
|
- Fixes are traceable to analyzed failures
|
|
162
211
|
- Verification scope matches analysis recommendation, including mandatory reruns after each fix
|
|
212
|
+
- Any docs/help drift from analysis is fixed or explicitly carried as an unresolved blocker
|
|
163
213
|
- Cost-conscious rerun strategy was followed
|
|
164
214
|
- Final explicit per-scenario rerun checkpoint for all targeted failures was completed before concluding the fix session
|
|
165
215
|
- No user clarification was required for fix targeting/scope in normal flow
|
|
166
|
-
- Targeted failures pass, or blockers are explicitly documented
|
|
216
|
+
- Targeted failures pass, or blockers are explicitly documented
|