ace-test-runner-e2e 0.29.8 → 0.40.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.ace-defaults/e2e-runner/config.yml +14 -2
- data/CHANGELOG.md +233 -0
- data/README.md +2 -2
- data/exe/ace-test-e2e-sh +9 -4
- data/handbook/guides/e2e-testing.g.md +75 -9
- data/handbook/guides/scenario-yml-reference.g.md +21 -8
- data/handbook/guides/tc-authoring.g.md +23 -5
- data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
- data/handbook/skills/as-e2e-review/SKILL.md +2 -2
- data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
- data/handbook/templates/agent-experience-report.template.md +3 -2
- data/handbook/templates/scenario.yml.template.yml +7 -2
- data/handbook/templates/tc-file.template.md +16 -4
- data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
- data/handbook/workflow-instructions/e2e/create.wf.md +128 -25
- data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
- data/handbook/workflow-instructions/e2e/fix.wf.md +84 -15
- data/handbook/workflow-instructions/e2e/plan-changes.wf.md +33 -1
- data/handbook/workflow-instructions/e2e/review.wf.md +40 -25
- data/handbook/workflow-instructions/e2e/rewrite.wf.md +22 -8
- data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
- data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
- data/lib/ace/test/end_to_end_runner/atoms/artifact_contract_validator.rb +138 -0
- data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
- data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
- data/lib/ace/test/end_to_end_runner/cli/commands/run_suite.rb +195 -5
- data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +58 -9
- data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
- data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
- data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
- data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
- data/lib/ace/test/end_to_end_runner/molecules/artifact_pruner.rb +61 -0
- data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
- data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
- data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +235 -18
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +164 -13
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +121 -18
- data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +15 -12
- data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +374 -0
- data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +83 -5
- data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +121 -16
- data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +422 -97
- data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
- data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
- data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +98 -18
- data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +159 -19
- data/lib/ace/test/end_to_end_runner/version.rb +1 -1
- data/lib/ace/test/end_to_end_runner.rb +4 -0
- metadata +21 -2
|
@@ -1,4 +1,12 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-create
|
|
3
|
+
description: Create a new E2E test scenario from template
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Read
|
|
7
|
+
- Write
|
|
8
|
+
- Glob
|
|
9
|
+
- Grep
|
|
2
10
|
doc-type: workflow
|
|
3
11
|
title: Create E2E Test Workflow
|
|
4
12
|
purpose: Create a new E2E test scenario from template
|
|
@@ -23,35 +31,54 @@ This workflow guides an agent through creating a new E2E test scenario.
|
|
|
23
31
|
- Scenario ID format: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
|
|
24
32
|
- Standalone files: `TC-*.runner.md` and `TC-*.verify.md`
|
|
25
33
|
- TC artifact layout: `results/tc/{NN}/`
|
|
34
|
+
- Runner observations are harness-managed report data, not sandbox helper files
|
|
26
35
|
- Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
|
|
27
36
|
- CLI split reminder:
|
|
37
|
+
|
|
28
38
|
- `ace-test-e2e` for single-package execution
|
|
29
39
|
- `ace-test-e2e-suite` for suite-level execution
|
|
30
40
|
|
|
31
41
|
## Authoring Contract
|
|
32
42
|
|
|
33
43
|
- Runner files (`runner.yml.md`, `TC-*.runner.md`) are execution-only.
|
|
44
|
+
- Every TC must be authored as one of:
|
|
45
|
+
- **public-surface** — a user job from docs/usage/`--help` and the CLI
|
|
46
|
+
- **retained-contract** — a deterministic integrated regression check with declared supporting evidence
|
|
47
|
+
- Goal-style/public-surface TCs must prove two things:
|
|
48
|
+
- the tool works
|
|
49
|
+
- a user can do the job from the public surface (`README`, usage docs, `--help`, and the CLI itself) without hidden recipes or workarounds
|
|
34
50
|
- Verifier files (`verifier.yml.md`, `TC-*.verify.md`) are verdict-only with impact-first evidence order:
|
|
51
|
+
|
|
35
52
|
1. sandbox/project state impact
|
|
36
|
-
2.
|
|
37
|
-
3.
|
|
53
|
+
2. runner observations
|
|
54
|
+
3. explicit product outcomes
|
|
55
|
+
4. debug captures as fallback
|
|
56
|
+
|
|
38
57
|
- Setup belongs to `scenario.yml` `setup:` and fixtures; do not duplicate setup in runner TC instructions.
|
|
58
|
+
- Keep `results/tc/{NN}/` for declared verifier-dependent evidence only.
|
|
59
|
+
- Declare every verifier-dependent path in the runner or setup. Grouped shorthand such as ``foo.stdout`, `.stderr`, `.exit`` is allowed for exact sibling captures.
|
|
60
|
+
- Do not use wildcard artifact paths.
|
|
61
|
+
- Do not ask the runner to write reflections, verifier-facing manifests, or undeclared helper files there.
|
|
62
|
+
- Do not encode hidden command recipes, fallback detours, or workaround sequences in runner TC files. If the job cannot be done from the public surface, treat that as a product/docs/help gap or remove/narrow the TC.
|
|
39
63
|
|
|
40
64
|
## Workflow Steps
|
|
41
65
|
|
|
42
66
|
### 1. Validate Inputs
|
|
43
67
|
|
|
44
68
|
**Check package exists:**
|
|
69
|
+
|
|
45
70
|
```bash
|
|
46
71
|
test -d "{PACKAGE}" && echo "Package exists" || echo "Package not found"
|
|
47
72
|
```
|
|
48
73
|
|
|
49
74
|
If package doesn't exist, list available packages:
|
|
75
|
+
|
|
50
76
|
```bash
|
|
51
77
|
ls -d */ | grep -E "^ace-" | sed 's/\/$//'
|
|
52
78
|
```
|
|
53
79
|
|
|
54
80
|
**Normalize area code:**
|
|
81
|
+
|
|
55
82
|
- Convert to uppercase (e.g., `lint` -> `LINT`)
|
|
56
83
|
- Verify it's a valid area name (2-10 alphanumeric characters)
|
|
57
84
|
|
|
@@ -66,6 +93,7 @@ find {PACKAGE}/test/e2e -maxdepth 1 -type d -name "TS-{AREA}-*" 2>/dev/null | \
|
|
|
66
93
|
```
|
|
67
94
|
|
|
68
95
|
Sort and take the highest number:
|
|
96
|
+
|
|
69
97
|
- If no existing tests: use `001`
|
|
70
98
|
- Otherwise: increment the highest number by 1
|
|
71
99
|
- Format as three digits (e.g., `001`, `002`, `015`)
|
|
@@ -85,12 +113,14 @@ mkdir -p {PACKAGE}/test/e2e
|
|
|
85
113
|
Create a kebab-case slug:
|
|
86
114
|
|
|
87
115
|
**If --context provided:**
|
|
116
|
+
|
|
88
117
|
- Extract key words from the context description
|
|
89
118
|
- Convert to lowercase
|
|
90
119
|
- Replace spaces with hyphens
|
|
91
120
|
- Limit to 5-6 words
|
|
92
121
|
|
|
93
122
|
**If no context:**
|
|
123
|
+
|
|
94
124
|
- Use a placeholder: `new-test-scenario`
|
|
95
125
|
|
|
96
126
|
Example: "Test config file validation" -> `config-file-validation`
|
|
@@ -100,11 +130,13 @@ The slug is the directory name suffix: `TS-LINT-003-config-file-validation/`
|
|
|
100
130
|
### 5. Load Template
|
|
101
131
|
|
|
102
132
|
Load the test template:
|
|
133
|
+
|
|
103
134
|
```bash
|
|
104
135
|
ace-bundle tmpl://test-e2e
|
|
105
136
|
```
|
|
106
137
|
|
|
107
138
|
Or read directly:
|
|
139
|
+
|
|
108
140
|
```
|
|
109
141
|
ace-test-runner-e2e/handbook/templates/test-e2e.template.md
|
|
110
142
|
```
|
|
@@ -123,6 +155,7 @@ Replace template placeholders with actual values:
|
|
|
123
155
|
| `{area-name}` | Area code (lowercase) |
|
|
124
156
|
|
|
125
157
|
Initial values for optional fields:
|
|
158
|
+
|
|
126
159
|
- `priority: medium`
|
|
127
160
|
- `duration: ~10min`
|
|
128
161
|
- `automation-candidate: false`
|
|
@@ -138,9 +171,10 @@ Initial values for optional fields:
|
|
|
138
171
|
Before generating test cases, verify the proposed test has genuine E2E value.
|
|
139
172
|
|
|
140
173
|
**Check unit test coverage:**
|
|
174
|
+
|
|
141
175
|
```bash
|
|
142
176
|
# Search for existing unit tests covering this area
|
|
143
|
-
find {PACKAGE}/test/
|
|
177
|
+
find {PACKAGE}/test/fast {PACKAGE}/test/feat \
|
|
144
178
|
-name "*_test.rb" 2>/dev/null | head -20
|
|
145
179
|
```
|
|
146
180
|
|
|
@@ -154,47 +188,95 @@ For each proposed TC, answer: **"Does this require the full CLI binary + real ex
|
|
|
154
188
|
- If **PARTIAL**: create the TC but scope it to only the E2E-exclusive aspects
|
|
155
189
|
|
|
156
190
|
**Example decisions:**
|
|
157
|
-
|
|
158
|
-
- "Test that
|
|
191
|
+
|
|
192
|
+
- "Test that invalid YAML config produces error" -- check if `atoms/config_parser_test.rb` already asserts this. If so, **skip** (unit test covers it). If unit test checks parsing but not the full CLI exit code path, **create** a TC scoped to just the exit code.
|
|
193
|
+
- "Test that StandardRB subprocess executes and returns results" -- unit tests stub the subprocess. **Create** this as E2E because it requires the real tool.
|
|
159
194
|
|
|
160
195
|
If all proposed TCs fail the gate, report to the user:
|
|
196
|
+
|
|
161
197
|
```
|
|
162
198
|
All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
|
|
163
199
|
No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
|
|
164
200
|
```
|
|
165
201
|
|
|
166
|
-
### 7a.
|
|
202
|
+
### 7a. Public-Surface Gate
|
|
203
|
+
|
|
204
|
+
Before generating or keeping a goal-style TC, answer:
|
|
205
|
+
**"Can a normal user complete this job from the package's public surface, without hidden recipes or workarounds?"**
|
|
206
|
+
|
|
207
|
+
Public surface means:
|
|
208
|
+
- package README / usage docs
|
|
209
|
+
- `--help`
|
|
210
|
+
- declared fixtures and `scenario.yml` setup
|
|
211
|
+
- the tool under test itself
|
|
212
|
+
|
|
213
|
+
Reject or narrow the TC if it depends on:
|
|
214
|
+
- step-by-step runner procedures a user would not infer from docs/help
|
|
215
|
+
- workaround branches to compensate for CLI/docs/help gaps
|
|
216
|
+
- direct supporting-tool probes as the primary oracle for an ACE CLI scenario
|
|
217
|
+
- internal-state checks that the public surface does not expose and that do not matter to the user job
|
|
218
|
+
|
|
219
|
+
### 7b. Evidence-Gate Review Before Writing Files
|
|
167
220
|
|
|
168
221
|
Before finalizing the test plan, block weak coverage patterns:
|
|
222
|
+
|
|
169
223
|
- **Existence-only TC**:
|
|
224
|
+
|
|
170
225
|
- only checks directory/file existence
|
|
171
226
|
- no command output/content assertion
|
|
172
227
|
- missing `*.exit` capture for the executed command
|
|
228
|
+
|
|
173
229
|
- **Duplicate-invocation TC**:
|
|
230
|
+
|
|
174
231
|
- same command invocation, same purpose, split across multiple TCs
|
|
175
232
|
|
|
233
|
+
- **Helper-artifact-driven TC**:
|
|
234
|
+
|
|
235
|
+
- runner is instructed to create YAML/TXT/MD helper files in `results/tc/{NN}/`
|
|
236
|
+
- verifier depends on those helper files instead of final sandbox state or real product output
|
|
237
|
+
|
|
238
|
+
- **Hidden-recipe-driven TC**:
|
|
239
|
+
|
|
240
|
+
- the runner must follow a command sequence not discoverable from docs/usage/`--help`
|
|
241
|
+
- the TC succeeds only because the scenario teaches an internal or non-obvious workaround
|
|
242
|
+
|
|
243
|
+
- **Workaround-driven TC**:
|
|
244
|
+
|
|
245
|
+
- the runner is told how to bypass a docs/help/CLI gap instead of surfacing it
|
|
246
|
+
- the verifier would pass a scenario that a normal user could not complete cleanly
|
|
247
|
+
|
|
176
248
|
| TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
|
|
177
249
|
|-------|---------------------------|------------------|-----------------|--------------------|
|
|
178
250
|
| {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
179
251
|
|
|
180
252
|
Rules:
|
|
253
|
+
|
|
181
254
|
- `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
|
|
255
|
+
- `helper-artifact-driven` is never valid for KEEP/ADD when final sandbox state could prove the goal directly.
|
|
256
|
+
- `hidden-recipe-driven` and `workaround-driven` are never valid for KEEP/ADD.
|
|
257
|
+
- Every verifier-dependent artifact must be declared by runner/setup; verifier-only references are invalid.
|
|
258
|
+
- Wildcard artifact paths are never valid for KEEP/ADD.
|
|
182
259
|
- `SKIP` rows must include replacement unit-test evidence.
|
|
183
|
-
- Non-skipped rows must
|
|
260
|
+
- Non-skipped rows must identify the primary oracle for the TC: final sandbox state, real product output, or debug fallback.
|
|
261
|
+
- Non-skipped rows must state why the job is achievable from the public surface without hidden recipes.
|
|
262
|
+
- Non-skipped rows must identify TC style: `public-surface` or `retained-contract`.
|
|
184
263
|
- At least one `unit tests reviewed` path is required for every row.
|
|
185
264
|
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
186
265
|
|
|
187
|
-
###
|
|
266
|
+
### 7c. E2E Decision Record (Required)
|
|
188
267
|
|
|
189
268
|
Before writing files, produce a decision record table for every candidate TC:
|
|
190
269
|
|
|
191
|
-
| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Unit tests reviewed |
|
|
192
|
-
|
|
193
|
-
| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
270
|
+
| TC ID | Decision (KEEP/ADD/SKIP) | E2E-only reason | Public-surface path | Unit tests reviewed |
|
|
271
|
+
|-------|---------------------------|-----------------|---------------------|---------------------|
|
|
272
|
+
| {tc-id} | {decision} | {why this needs real CLI/tools/fs} | {docs/help/CLI path or "not valid"} | {path1,path2} |
|
|
194
273
|
|
|
195
274
|
Rules:
|
|
275
|
+
|
|
196
276
|
- No TC may be created without a row in this table.
|
|
197
277
|
- If decision is `SKIP`, include the unit-test evidence that replaces it.
|
|
278
|
+
- If the public-surface path is missing or workaround-driven, the TC must be `SKIP` or explicitly planned as a product/docs/help improvement before creation.
|
|
279
|
+
- If the TC uses live refresh or watch behavior, include a bounded-session capture plan with explicit shutdown behavior and exit-code expectations.
|
|
198
280
|
- At least one `unit tests reviewed` path is required for each row.
|
|
199
281
|
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
200
282
|
|
|
@@ -203,12 +285,13 @@ Rules:
|
|
|
203
285
|
If a context description was provided, enhance the test with:
|
|
204
286
|
|
|
205
287
|
**Research the package:**
|
|
206
|
-
1. **Run unit tests first** (`ace-test` in the package)
|
|
288
|
+
1. **Run unit tests first** (`ace-test` in the package) -- they are the ground truth for implemented behavior
|
|
207
289
|
2. Examine the relevant code in `{PACKAGE}/lib/`
|
|
208
290
|
3. Check existing unit tests for expected behavior patterns
|
|
209
291
|
4. Understand the feature being tested
|
|
210
292
|
5. **Run the tool** to observe actual behavior, output format, file paths, and exit codes
|
|
211
|
-
6. **Verify config/input formats** by reading the actual parsing code
|
|
293
|
+
6. **Verify config/input formats** by reading the actual parsing code -- never assume formats from design specs or task descriptions
|
|
294
|
+
7. **Compare with the public surface** -- verify the intended user path is actually supported by docs/help, and do not compensate for gaps with hidden runner instructions
|
|
212
295
|
|
|
213
296
|
**Generate test content:**
|
|
214
297
|
1. Write a clear objective based on the context
|
|
@@ -220,16 +303,20 @@ If a context description was provided, enhance the test with:
|
|
|
220
303
|
#### Test Case Generation Rules
|
|
221
304
|
|
|
222
305
|
**MUST (required for all E2E tests):**
|
|
223
|
-
|
|
224
|
-
- **Verify
|
|
306
|
+
|
|
307
|
+
- **Verify the feature is implemented** before writing the test -- read the actual implementation code, not just task specs or design documents
|
|
308
|
+
- **Verify config/input formats** by reading the parsing code -- never assume formats from BDD specs, task descriptions, or documentation
|
|
225
309
|
- Include an error/negative TC only when it validates E2E-exclusive behavior (real CLI parser/runtime/tooling/filesystem) or when unit coverage has a documented gap
|
|
226
|
-
- Verify actual file paths by running the tool first
|
|
227
|
-
-
|
|
310
|
+
- Verify actual file paths by running the tool first -- never hardcode paths from documentation or assumptions
|
|
311
|
+
- Write runner goals as user outcomes, not “create a report” chores for the verifier
|
|
228
312
|
- Check specific exit codes for error commands (not just "non-zero")
|
|
229
|
-
-
|
|
313
|
+
- Make final sandbox state or real product output the primary oracle whenever possible
|
|
314
|
+
- Do not require undeclared or verifier-facing helper files under `results/tc/{NN}/`
|
|
315
|
+
- Add at least one behavioral/content assertion when CLI output itself is part of the outcome being tested
|
|
230
316
|
|
|
231
317
|
**SHOULD (strongly recommended):**
|
|
232
|
-
|
|
318
|
+
|
|
319
|
+
- Test the real user journey -- structure TCs as a sequential workflow, not isolated commands
|
|
233
320
|
- Verify exit codes for all commands, not just error cases
|
|
234
321
|
- Include negative assertions (files/directories that should NOT exist)
|
|
235
322
|
- Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
|
|
@@ -237,17 +324,18 @@ If a context description was provided, enhance the test with:
|
|
|
237
324
|
- Verify that status values match actual implementation (e.g., `done` vs `completed`)
|
|
238
325
|
|
|
239
326
|
**COST-AWARE (reduce LLM invocations):**
|
|
240
|
-
|
|
327
|
+
|
|
328
|
+
- Consolidate assertions that share the same CLI invocation into a single TC. For example, after running `ace-lint file.rb`, check exit code, report.json structure, and ok.md existence in ONE TC -- not three.
|
|
241
329
|
- Target 2-5 TCs per scenario. More than 5 suggests the scenario is too broad; split into focused scenarios. Fewer than 2 suggests merging with a related scenario.
|
|
242
330
|
- Never create a TC for a single assertion when that assertion could be appended to an existing TC that runs the same command.
|
|
243
331
|
|
|
244
332
|
#### Recommended TC Ordering
|
|
245
333
|
|
|
246
|
-
1. **Error paths first**
|
|
247
|
-
2. **Happy path start**
|
|
248
|
-
3. **Structure verification**
|
|
249
|
-
4. **Lifecycle operations**
|
|
250
|
-
5. **End state**
|
|
334
|
+
1. **Error paths first** -- wrong args, missing files, no prior state (run from clean state)
|
|
335
|
+
2. **Happy path start** -- create/init with correct args, verify output
|
|
336
|
+
3. **Structure verification** -- check actual on-disk file structure with negative assertions
|
|
337
|
+
4. **Lifecycle operations** -- status, advance, fail, retry in workflow order
|
|
338
|
+
5. **End state** -- verify completion message, all steps terminal
|
|
251
339
|
|
|
252
340
|
This ordering ensures error TCs run before any state is created (clean environment), and happy-path TCs build on each other sequentially.
|
|
253
341
|
|
|
@@ -258,6 +346,7 @@ See: **e2e-testing.g.md § "Avoiding False Positive Tests"** for the full list o
|
|
|
258
346
|
**E2E tests MUST test through the CLI interface, not library imports.**
|
|
259
347
|
|
|
260
348
|
**Valid approach:**
|
|
349
|
+
|
|
261
350
|
```bash
|
|
262
351
|
OUTPUT=$(ace-review --preset code --subject "diff:HEAD~1" --auto-execute 2>&1)
|
|
263
352
|
EXIT_CODE=$?
|
|
@@ -265,6 +354,7 @@ EXIT_CODE=$?
|
|
|
265
354
|
```
|
|
266
355
|
|
|
267
356
|
**Invalid approach (this is integration/unit testing, not E2E):**
|
|
357
|
+
|
|
268
358
|
```bash
|
|
269
359
|
bundle exec ruby -e '
|
|
270
360
|
require_relative "lib/ace/review"
|
|
@@ -273,6 +363,7 @@ bundle exec ruby -e '
|
|
|
273
363
|
```
|
|
274
364
|
|
|
275
365
|
**For execution tests (LLM, API calls):**
|
|
366
|
+
|
|
276
367
|
- Use `--auto-execute` to make real API calls
|
|
277
368
|
- Using only `--dry-run` cannot verify actual execution behavior
|
|
278
369
|
- Keep costs minimal: cheap models, tiny prompts, small diffs
|
|
@@ -280,22 +371,26 @@ bundle exec ruby -e '
|
|
|
280
371
|
#### Common Anti-Patterns to Avoid
|
|
281
372
|
|
|
282
373
|
**Writing tests from design specs before implementation:**
|
|
374
|
+
|
|
283
375
|
- Task descriptions and BDD specs often describe *intended* behavior with *proposed* config formats
|
|
284
376
|
- The actual implementation may use different formats, different commands, or different workflows
|
|
285
377
|
- Example: A spec might describe `jobs:` with explicit `number:` and `parent:` fields, but implementation uses `steps:` with auto-generated numbers and dynamic hierarchy via `add --after --child`
|
|
286
378
|
- **Fix:** Always read the actual implementation code (especially config parsing) before writing test data
|
|
287
379
|
|
|
288
380
|
**Assuming static vs dynamic behavior:**
|
|
381
|
+
|
|
289
382
|
- Tests may assume features work at config-time (static) when they actually work at runtime (dynamic)
|
|
290
383
|
- Example: Assuming hierarchy is defined in config when it's actually built dynamically via commands
|
|
291
384
|
- **Fix:** Trace the actual code path for the feature being tested
|
|
292
385
|
|
|
293
386
|
**Splitting one command into many redundant TCs:**
|
|
387
|
+
|
|
294
388
|
- Multiple TCs each validate one assertion after the same CLI invocation, creating overlap with unit tests and increasing run cost
|
|
295
389
|
- Example: TC-A checks exit code, TC-B checks report file, TC-C checks summary text for the same command run
|
|
296
390
|
- **Fix:** Consolidate those assertions into one TC and move formatter/parser details to unit tests
|
|
297
391
|
|
|
298
392
|
**Example for "Test config file validation":**
|
|
393
|
+
|
|
299
394
|
```markdown
|
|
300
395
|
## Test Cases
|
|
301
396
|
|
|
@@ -315,28 +410,33 @@ bundle exec ruby -e '
|
|
|
315
410
|
### 9. Write Test Files
|
|
316
411
|
|
|
317
412
|
Create the scenario directory with separate files:
|
|
413
|
+
|
|
318
414
|
```bash
|
|
319
415
|
mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}
|
|
320
416
|
```
|
|
321
417
|
|
|
322
418
|
Write `scenario.yml` (metadata and setup):
|
|
419
|
+
|
|
323
420
|
```
|
|
324
421
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/scenario.yml
|
|
325
422
|
```
|
|
326
423
|
|
|
327
424
|
Write scenario pair configs:
|
|
425
|
+
|
|
328
426
|
```
|
|
329
427
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/runner.yml.md
|
|
330
428
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/verifier.yml.md
|
|
331
429
|
```
|
|
332
430
|
|
|
333
431
|
Write individual TC runner/verifier files for each test case:
|
|
432
|
+
|
|
334
433
|
```
|
|
335
434
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.runner.md
|
|
336
435
|
{PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/TC-001-{tc-slug}.verify.md
|
|
337
436
|
```
|
|
338
437
|
|
|
339
438
|
Optionally create a fixtures directory if test data is needed:
|
|
439
|
+
|
|
340
440
|
```bash
|
|
341
441
|
mkdir -p {PACKAGE}/test/e2e/TS-{AREA}-{NNN}-{slug}/fixtures
|
|
342
442
|
```
|
|
@@ -373,6 +473,7 @@ Output a summary:
|
|
|
373
473
|
## Example Invocations
|
|
374
474
|
|
|
375
475
|
**Create a test:**
|
|
476
|
+
|
|
376
477
|
```bash
|
|
377
478
|
ace-bundle wfi://e2e/create
|
|
378
479
|
```
|
|
@@ -380,6 +481,7 @@ ace-bundle wfi://e2e/create
|
|
|
380
481
|
Creates: `ace-lint/test/e2e/TS-LINT-003-new-test-scenario/` with `scenario.yml` and TC files.
|
|
381
482
|
|
|
382
483
|
**Create a contextual test:**
|
|
484
|
+
|
|
383
485
|
```bash
|
|
384
486
|
ace-bundle wfi://e2e/create
|
|
385
487
|
```
|
|
@@ -387,6 +489,7 @@ ace-bundle wfi://e2e/create
|
|
|
387
489
|
Creates: `ace-lint/test/e2e/TS-LINT-003-config-file-validation/` with `scenario.yml` and TC files for config validation.
|
|
388
490
|
|
|
389
491
|
**Create test for new area:**
|
|
492
|
+
|
|
390
493
|
```bash
|
|
391
494
|
ace-bundle wfi://e2e/create
|
|
392
495
|
```
|
|
@@ -46,12 +46,15 @@ Tag filtering happens at discovery time (before `SetupExecutor` runs). By the ti
|
|
|
46
46
|
|
|
47
47
|
## Execution Contract
|
|
48
48
|
|
|
49
|
-
- Runner is execution-only: execute declared TC actions and
|
|
49
|
+
- Runner is execution-only: execute declared TC actions, leave only real outcome evidence under `results/tc/{NN}/`, and return final observations through the harness.
|
|
50
|
+
- Runner follows the public user path. Do not turn missing docs/help/CLI affordances into embedded workaround instructions.
|
|
50
51
|
- Verifier is verification-only: determine PASS/FAIL using impact-first ordering:
|
|
51
52
|
1. sandbox/project state impact
|
|
52
|
-
2.
|
|
53
|
-
3.
|
|
53
|
+
2. runner observations
|
|
54
|
+
3. explicit artifacts that are true product outcomes
|
|
55
|
+
4. debug captures (`stdout`/`stderr`/exit) as fallback
|
|
54
56
|
- Do not interpret setup ownership in runner TC files; setup is owned by `scenario.yml` + fixtures.
|
|
57
|
+
- Treat workaround pressure recorded in runner observations as a gap to fix, not as permission to strengthen the runner script.
|
|
55
58
|
|
|
56
59
|
## Dual-Agent Verifier
|
|
57
60
|
|
|
@@ -61,7 +64,7 @@ When `--verify` is passed (or always-on for CLI pipeline runs), execution follow
|
|
|
61
64
|
2. **Verifier agent** independently inspects the sandbox and artifacts against `TC-*.verify.md` expectations
|
|
62
65
|
3. **Report generator** (`PipelineReportGenerator`) produces deterministic summary from verifier output
|
|
63
66
|
|
|
64
|
-
The verifier has no access to the runner's conversation — it evaluates
|
|
67
|
+
The verifier has no access to the runner's conversation — it evaluates from sandbox evidence plus the structured runner observations persisted by the harness. This prevents self-confirmation bias while still surfacing execution context.
|
|
65
68
|
|
|
66
69
|
## Subagent Mode
|
|
67
70
|
|
|
@@ -75,6 +78,7 @@ When invoked as a subagent (via Task tool from orchestrator):
|
|
|
75
78
|
- **Failed**: {count}
|
|
76
79
|
- **Total**: {count}
|
|
77
80
|
- **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
|
|
81
|
+
- **Observations**: Brief factual summary or "None"
|
|
78
82
|
- **Issues**: Brief description or "None"
|
|
79
83
|
```
|
|
80
84
|
|
|
@@ -149,8 +153,8 @@ For each TC (TC-NNN):
|
|
|
149
153
|
|
|
150
154
|
1. **Check filter** — skip if `FILTERED_CASES` is set and TC not in list
|
|
151
155
|
2. **Read** the runner file objective
|
|
152
|
-
3. **Execute** runner steps, save artifacts to `results/tc/{NN}/`
|
|
153
|
-
4. **
|
|
156
|
+
3. **Execute** runner steps, save only real outcome artifacts to `results/tc/{NN}/`
|
|
157
|
+
4. **Return** factual runner observations through the harness
|
|
154
158
|
5. **Evaluate** against verifier expectations
|
|
155
159
|
6. **Record** Pass/Fail with per-TC evidence
|
|
156
160
|
|
|
@@ -250,4 +254,4 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
|
|
|
250
254
|
| TC fails | Record details, continue remaining TCs, include in report |
|
|
251
255
|
| Sandbox missing/corrupted | Report error, do NOT recreate, return error summary |
|
|
252
256
|
| TC filter mismatch | STOP, do not write reports, offer re-run |
|
|
253
|
-
| Missing TC pair file | Report error for that TC, skip it, continue others |
|
|
257
|
+
| Missing TC pair file | Report error for that TC, skip it, continue others |
|
|
@@ -1,23 +1,33 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-fix
|
|
3
|
+
description: Diagnose, fix, and rerun failing E2E scenarios with a self-bootstrapping analysis loop.
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Bash(ace-test:*)
|
|
7
|
+
- Read
|
|
8
|
+
- Write
|
|
9
|
+
- Edit
|
|
10
|
+
- Skill
|
|
2
11
|
doc-type: workflow
|
|
3
12
|
title: Fix E2E Tests Workflow
|
|
4
13
|
purpose: fix-e2e-tests workflow instruction
|
|
5
14
|
ace-docs:
|
|
6
|
-
last-updated: 2026-
|
|
7
|
-
last-checked: 2026-
|
|
15
|
+
last-updated: 2026-04-19
|
|
16
|
+
last-checked: 2026-04-19
|
|
8
17
|
---
|
|
9
18
|
|
|
10
19
|
# Fix E2E Tests Workflow
|
|
11
20
|
|
|
12
21
|
## Goal
|
|
13
22
|
|
|
14
|
-
|
|
23
|
+
Diagnose, fix, and rerun failing E2E scenarios with a single workflow entrypoint.
|
|
15
24
|
|
|
16
|
-
This workflow is
|
|
25
|
+
This workflow owns analysis readiness before any fix is applied. Reuse an existing analysis report when it is complete; otherwise generate or complete it via `wfi://e2e/analyze-failures`, then continue directly into the fix loop.
|
|
17
26
|
|
|
18
|
-
##
|
|
27
|
+
## Analysis Readiness Gate
|
|
28
|
+
|
|
29
|
+
Before any fix, ensure an analysis report exists with:
|
|
19
30
|
|
|
20
|
-
Do not apply any fix until an analysis report exists with:
|
|
21
31
|
- scenario / TC identifier
|
|
22
32
|
- category (`code-issue`, `test-issue`, `runner-infrastructure-issue`)
|
|
23
33
|
- evidence from reports/artifacts
|
|
@@ -26,22 +36,29 @@ Do not apply any fix until an analysis report exists with:
|
|
|
26
36
|
- primary candidate files
|
|
27
37
|
- do-not-touch boundaries
|
|
28
38
|
- rerun scope recommendation
|
|
39
|
+
- `Docs / Help Drift From E2E Failures` section with `Public Surface Checked`, `Drift Found`, and `Update Targets`
|
|
40
|
+
|
|
41
|
+
If analysis is missing or incomplete, generate or refresh it first:
|
|
29
42
|
|
|
30
|
-
If analysis is missing or incomplete, stop and run:
|
|
31
43
|
```bash
|
|
32
44
|
ace-bundle wfi://e2e/analyze-failures
|
|
33
45
|
```
|
|
34
46
|
|
|
47
|
+
Then continue this workflow using the resulting `E2E Failure Analysis Report`, `Fix Decisions`, and `Execution Plan Input` as the source of truth. Do not stop merely because analysis had to be generated.
|
|
48
|
+
|
|
35
49
|
## Required Input
|
|
36
50
|
|
|
37
|
-
Use the output section from `e2e/analyze-failures
|
|
51
|
+
Use the output section from `e2e/analyze-failures` when present, whether it was provided up front or generated by this workflow:
|
|
52
|
+
|
|
38
53
|
- `## E2E Failure Analysis Report`
|
|
54
|
+
- `## Docs / Help Drift From E2E Failures`
|
|
39
55
|
- `## Fix Decisions`
|
|
40
56
|
- `### Execution Plan Input`
|
|
41
57
|
|
|
42
58
|
## Autonomy Rule
|
|
43
59
|
|
|
44
60
|
- Do not ask the user to choose fix target, category, or rerun scope.
|
|
61
|
+
- If analysis is missing, run `wfi://e2e/analyze-failures` yourself before fixing.
|
|
45
62
|
- If analysis is incomplete, auto-complete missing decision fields via local evidence (reports, artifacts, scenario files, implementation), then proceed.
|
|
46
63
|
- Only stop for hard blockers (missing files/tools/permissions).
|
|
47
64
|
|
|
@@ -61,27 +78,41 @@ Apply fixes in this order:
|
|
|
61
78
|
|
|
62
79
|
## Fix Procedure
|
|
63
80
|
|
|
64
|
-
1.
|
|
81
|
+
1. Establish or refresh analysis
|
|
82
|
+
|
|
83
|
+
- Check for a current analysis report that satisfies the Analysis Readiness Gate.
|
|
84
|
+
- If none exists, or if required fields are missing, including the docs/help drift section, run `ace-bundle wfi://e2e/analyze-failures`.
|
|
85
|
+
- Reuse the most recent valid analysis output as the source of truth for fix selection.
|
|
86
|
+
- Treat full-suite/package reruns and targeted scenario reruns as different scopes. Do not label a broader suite failure set as a regression in a previously fixed targeted scenario unless the same scenario fails again on a clean rerun.
|
|
87
|
+
|
|
88
|
+
2. Pick the first prioritized item from analysis
|
|
89
|
+
|
|
65
90
|
- Use the selected "First item to fix"
|
|
66
91
|
- Confirm category, fix target, and rerun scope
|
|
67
92
|
- Apply the "Chosen fix decision" and primary candidate files directly
|
|
68
93
|
|
|
69
|
-
|
|
94
|
+
3. Apply category-specific fix
|
|
70
95
|
|
|
71
96
|
### Category: runner-infrastructure-issue
|
|
97
|
+
|
|
72
98
|
- Fix runner/sandbox/provider/reporting/orchestration behavior
|
|
73
99
|
- Verify with runner tests when applicable: `ace-test ace-test-runner-e2e`
|
|
74
100
|
|
|
75
101
|
### Category: code-issue
|
|
102
|
+
|
|
76
103
|
- Fix package/tool behavior in implementation code
|
|
77
104
|
- Add/update unit tests if needed
|
|
105
|
+
- When the user job is valid but not achievable from docs/help/public CLI, apply the documented docs/help update target instead of codifying the workaround in the scenario
|
|
78
106
|
|
|
79
107
|
### Category: test-issue
|
|
108
|
+
|
|
80
109
|
- Fix scenario definition, runner/verifier criteria, fixtures, or setup steps
|
|
81
110
|
- Preserve role split: runner is execution-only, verifier is impact-first verdict
|
|
82
111
|
- Keep implementation unchanged unless analysis is revised
|
|
112
|
+
- Remove hidden recipes, workaround branches, and unsupported internal-detail checks from goal-style TCs
|
|
113
|
+
- Repair undeclared or wildcard artifact contracts before weakening product assertions
|
|
83
114
|
|
|
84
|
-
|
|
115
|
+
4. Rerun the selected failing scope after each fix
|
|
85
116
|
|
|
86
117
|
After every implemented fix, rerun the analysis-selected failing scope before moving to the next item or recommending release.
|
|
87
118
|
|
|
@@ -96,25 +127,37 @@ ace-test-e2e {package}
|
|
|
96
127
|
```
|
|
97
128
|
|
|
98
129
|
Rules:
|
|
130
|
+
|
|
99
131
|
- Scenario rerun is the default after each fix iteration.
|
|
100
132
|
- Use package rerun only when analysis explicitly selected package scope.
|
|
101
133
|
- For multiple failing scenarios, rerun each scenario explicitly.
|
|
134
|
+
|
|
102
135
|
```text
|
|
103
136
|
ace-test-e2e ace-assign TS-ASSIGN-001
|
|
104
137
|
ace-test-e2e ace-assign TS-ASSIGN-002
|
|
105
138
|
ace-test-e2e ace-bundle TS-BUNDLE-001
|
|
106
139
|
```
|
|
140
|
+
|
|
107
141
|
- Record the rerun command and result in the execution summary for every fix item.
|
|
108
142
|
|
|
109
|
-
|
|
143
|
+
5. Re-check classification when evidence conflicts
|
|
144
|
+
|
|
110
145
|
- If outcome contradicts analysis, return to `e2e/analyze-failures`
|
|
111
146
|
- Update analysis report and re-select a new autonomous chosen fix decision before continuing
|
|
147
|
+
- If a suite/package report conflicts with a scenario report, the scenario report wins and the aggregate mismatch must be fixed or explicitly tracked before relying on suite-level TC mappings.
|
|
148
|
+
|
|
149
|
+
6. Iterate until all targeted failures are resolved
|
|
112
150
|
|
|
113
|
-
5. Iterate until all targeted failures are resolved
|
|
114
151
|
- Keep one active scenario/TC at a time
|
|
115
152
|
- Preserve cost-conscious rerun discipline
|
|
116
153
|
|
|
117
|
-
|
|
154
|
+
6a. If the fix changes a public contract, run a downstream retained-E2E sweep
|
|
155
|
+
|
|
156
|
+
- Trigger this sweep when the fix changes status words, JSON keys, command shapes, lifecycle semantics, or ownership/state semantics
|
|
157
|
+
- Grep impacted scenarios and downstream consumers before concluding the fix
|
|
158
|
+
- Update retained runner/verifier contracts in the same change set whenever feasible
|
|
159
|
+
|
|
160
|
+
7. Run a final explicit failing-scenario checkpoint before concluding the fix session
|
|
118
161
|
|
|
119
162
|
After the currently targeted failures are addressed, require one final:
|
|
120
163
|
|
|
@@ -136,12 +179,37 @@ Use one explicit command per previously failing scenario to confirm no targeted
|
|
|
136
179
|
```markdown
|
|
137
180
|
## E2E Fix Execution Summary
|
|
138
181
|
|
|
182
|
+
Analysis Source: reused existing analysis | generated via `wfi://e2e/analyze-failures` | refreshed incomplete analysis
|
|
183
|
+
|
|
139
184
|
| Scenario / TC | Category | Change Applied | Verification Command | Result |
|
|
140
185
|
|---|---|---|---|---|
|
|
141
186
|
| ... | ... | ... | ... | pass/fail |
|
|
142
187
|
```
|
|
143
188
|
|
|
189
|
+
Also include:
|
|
190
|
+
|
|
191
|
+
```markdown
|
|
192
|
+
## Fix Classification Totals
|
|
193
|
+
|
|
194
|
+
| Bucket | Count |
|
|
195
|
+
|---|---|
|
|
196
|
+
| Product bug | {n} |
|
|
197
|
+
| Harness bug | {n} |
|
|
198
|
+
| Retained test/spec drift | {n} |
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
If the analysis reported docs/help drift, include:
|
|
202
|
+
|
|
203
|
+
```markdown
|
|
204
|
+
## Docs / Help Updates
|
|
205
|
+
|
|
206
|
+
| Scenario / TC | Public Surface Updated | Why |
|
|
207
|
+
|---|---|---|
|
|
208
|
+
| ... | docs/usage.md, CLI --help | E2E failure showed the valid user job was not discoverable |
|
|
209
|
+
```
|
|
210
|
+
|
|
144
211
|
Include one final row for the batch checkpoint:
|
|
212
|
+
|
|
145
213
|
- Verification Command: one explicit rerun command per remaining failed scenario (`ace-test-e2e {package} {test-id}`)
|
|
146
214
|
- Result: `pass` or remaining failing scenarios
|
|
147
215
|
- If failures remain, continue the fix loop instead of treating the session as complete
|
|
@@ -160,7 +228,8 @@ If unresolved:
|
|
|
160
228
|
|
|
161
229
|
- Fixes are traceable to analyzed failures
|
|
162
230
|
- Verification scope matches analysis recommendation, including mandatory reruns after each fix
|
|
231
|
+
- Any docs/help drift from analysis is fixed or explicitly carried as an unresolved blocker
|
|
163
232
|
- Cost-conscious rerun strategy was followed
|
|
164
233
|
- Final explicit per-scenario rerun checkpoint for all targeted failures was completed before concluding the fix session
|
|
165
234
|
- No user clarification was required for fix targeting/scope in normal flow
|
|
166
|
-
- Targeted failures pass, or blockers are explicitly documented
|
|
235
|
+
- Targeted failures pass, or blockers are explicitly documented
|