ace-test-runner-e2e 0.29.8 → 0.38.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.ace-defaults/e2e-runner/config.yml +14 -2
- data/CHANGELOG.md +178 -0
- data/README.md +2 -2
- data/exe/ace-test-e2e-sh +9 -4
- data/handbook/guides/e2e-testing.g.md +43 -9
- data/handbook/guides/scenario-yml-reference.g.md +16 -8
- data/handbook/guides/tc-authoring.g.md +12 -5
- data/handbook/skills/as-e2e-fix/SKILL.md +2 -2
- data/handbook/skills/as-e2e-review/SKILL.md +2 -2
- data/handbook/templates/ace-taskflow-fixture.template.md +17 -17
- data/handbook/templates/agent-experience-report.template.md +3 -2
- data/handbook/templates/scenario.yml.template.yml +7 -2
- data/handbook/templates/tc-file.template.md +14 -4
- data/handbook/workflow-instructions/e2e/analyze-failures.wf.md +53 -6
- data/handbook/workflow-instructions/e2e/create.wf.md +118 -25
- data/handbook/workflow-instructions/e2e/execute.wf.md +11 -7
- data/handbook/workflow-instructions/e2e/fix.wf.md +65 -15
- data/handbook/workflow-instructions/e2e/plan-changes.wf.md +17 -1
- data/handbook/workflow-instructions/e2e/review.wf.md +36 -25
- data/handbook/workflow-instructions/e2e/rewrite.wf.md +15 -8
- data/handbook/workflow-instructions/e2e/run.wf.md +50 -26
- data/handbook/workflow-instructions/e2e/setup-sandbox.wf.md +4 -4
- data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +7 -5
- data/lib/ace/test/end_to_end_runner/atoms/skill_result_parser.rb +73 -7
- data/lib/ace/test/end_to_end_runner/cli/commands/run_test.rb +21 -8
- data/lib/ace/test/end_to_end_runner/models/test_case.rb +8 -2
- data/lib/ace/test/end_to_end_runner/models/test_result.rb +9 -3
- data/lib/ace/test/end_to_end_runner/models/test_scenario.rb +4 -2
- data/lib/ace/test/end_to_end_runner/molecules/affected_detector.rb +7 -2
- data/lib/ace/test/end_to_end_runner/molecules/bwrap_sandbox_backend.rb +271 -0
- data/lib/ace/test/end_to_end_runner/molecules/config_loader.rb +28 -1
- data/lib/ace/test/end_to_end_runner/molecules/integration_runner.rb +122 -0
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +157 -16
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_prompt_bundler.rb +121 -8
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_report_generator.rb +91 -19
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_sandbox_builder.rb +119 -18
- data/lib/ace/test/end_to_end_runner/molecules/report_writer.rb +13 -12
- data/lib/ace/test/end_to_end_runner/molecules/sandbox_runtime_builder.rb +282 -0
- data/lib/ace/test/end_to_end_runner/molecules/scenario_loader.rb +85 -5
- data/lib/ace/test/end_to_end_runner/molecules/setup_executor.rb +98 -16
- data/lib/ace/test/end_to_end_runner/molecules/suite_report_writer.rb +241 -97
- data/lib/ace/test/end_to_end_runner/molecules/test_discoverer.rb +38 -13
- data/lib/ace/test/end_to_end_runner/molecules/test_executor.rb +27 -5
- data/lib/ace/test/end_to_end_runner/organisms/suite_orchestrator.rb +73 -15
- data/lib/ace/test/end_to_end_runner/organisms/test_orchestrator.rb +120 -19
- data/lib/ace/test/end_to_end_runner/version.rb +1 -1
- data/lib/ace/test/end_to_end_runner.rb +2 -0
- metadata +19 -2
|
@@ -71,6 +71,7 @@ For REMOVE due to overlap, replacement evidence is mandatory:
|
|
|
71
71
|
|
|
72
72
|
**KEEP** — The TC has genuine E2E value and needs no changes. Criteria (all must be true):
|
|
73
73
|
- TC passes the E2E Value Gate (tests real CLI binary + external tools + filesystem I/O)
|
|
74
|
+
- TC passes the Public-Surface Gate (user can do the job from docs/usage/`--help` without hidden recipes or workarounds)
|
|
74
75
|
- Related source code has no changes since `last-verified`
|
|
75
76
|
- TC structure is valid and assertions are current
|
|
76
77
|
|
|
@@ -79,6 +80,7 @@ For REMOVE due to overlap, replacement evidence is mandatory:
|
|
|
79
80
|
- TC scope is too broad (should be narrowed to only E2E-exclusive aspects)
|
|
80
81
|
- TC scope is too narrow (missing assertions for related behavior in same CLI invocation)
|
|
81
82
|
- TC has structure issues flagged in the review
|
|
83
|
+
- TC is hidden-recipe-driven or workaround-driven but the underlying user job should still be supported by the public surface after scenario/docs/help correction
|
|
82
84
|
|
|
83
85
|
**CONSOLIDATE** — The TC should merge with another TC. Criteria (any one is sufficient):
|
|
84
86
|
- Multiple TCs share the same CLI invocation and could be a single TC with multiple assertions
|
|
@@ -91,6 +93,7 @@ For each classification, document:
|
|
|
91
93
|
- For REMOVE (overlap): replacement evidence (`existing unit tests` or `planned unit backfill`)
|
|
92
94
|
- For MODIFY: what specifically needs to change
|
|
93
95
|
- For CONSOLIDATE: the target TC and which assertions merge
|
|
96
|
+
- Whether the current TC is public-surface-valid, hidden-recipe-driven, workaround-driven, or checking an unsupported internal detail
|
|
94
97
|
|
|
95
98
|
### 4. Identify New TCs Needed
|
|
96
99
|
|
|
@@ -112,6 +115,12 @@ For each candidate, answer: "Does this require the full CLI binary + real extern
|
|
|
112
115
|
- If NO: skip — unit tests cover this (or add explicit unit test action if coverage is missing)
|
|
113
116
|
- If YES: include in the plan
|
|
114
117
|
|
|
118
|
+
**Filter through Public-Surface Gate:**
|
|
119
|
+
For each candidate, answer: "Can a user do this job through the public tool surface without hidden recipes or workarounds?"
|
|
120
|
+
- If NO because the job should be supported: add a product/docs/help improvement action and do not encode the workaround into the TC
|
|
121
|
+
- If NO because the detail is not user-visible: skip or narrow the TC
|
|
122
|
+
- If YES: keep planning the TC
|
|
123
|
+
|
|
115
124
|
### 5. Propose Scenario Structure
|
|
116
125
|
|
|
117
126
|
Group all planned TCs (KEEP + MODIFY + CONSOLIDATE targets + ADD) into scenarios:
|
|
@@ -176,6 +185,13 @@ Format the complete change plan:
|
|
|
176
185
|
|----|---------------|
|
|
177
186
|
| {tc-id} | Update assertions — {feature} behavior changed in {commit} |
|
|
178
187
|
| {tc-id} | Narrow scope — remove assertions covered by unit tests |
|
|
188
|
+
| {tc-id} | Remove hidden recipe / workaround dependence — rewrite around public docs/help/CLI path |
|
|
189
|
+
|
|
190
|
+
### Public-Surface Gaps ({n} actions)
|
|
191
|
+
|
|
192
|
+
| Action | Target | Why |
|
|
193
|
+
|--------|--------|-----|
|
|
194
|
+
| Update docs/help/CLI | {package/path} | {job is valid but current public surface is too weak for the E2E path} |
|
|
179
195
|
|
|
180
196
|
### CONSOLIDATE ({n} TCs → {n} TCs)
|
|
181
197
|
|
|
@@ -252,4 +268,4 @@ implementation code and at least one E2E test before planning changes.
|
|
|
252
268
|
If the user rejects the plan:
|
|
253
269
|
1. Ask which classifications they disagree with
|
|
254
270
|
2. Adjust the plan based on feedback
|
|
255
|
-
3. Re-present the updated plan
|
|
271
|
+
3. Re-present the updated plan
|
|
@@ -13,7 +13,9 @@ This workflow performs deep exploration of a package to produce a **coverage mat
|
|
|
13
13
|
|
|
14
14
|
During review, treat the runner/verifier split as a first-class quality check:
|
|
15
15
|
- Runner must be execution-only (no verdict language).
|
|
16
|
-
- Verifier must be impact-first (sandbox impact before
|
|
16
|
+
- Verifier must be impact-first (sandbox impact before runner observations and debug).
|
|
17
|
+
- `results/tc/{NN}/` must not be used for helper inputs or verifier-feeding helper reports.
|
|
18
|
+
- Goal-style TCs must also pass the public-surface check: the runner should be able to do the job from docs/usage/`--help` and the tool under test, without hidden recipes or workarounds.
|
|
17
19
|
|
|
18
20
|
**Pipeline position:** Stage 1 of 3 (Explore)
|
|
19
21
|
|
|
@@ -86,9 +88,8 @@ Map what unit tests cover at each layer:
|
|
|
86
88
|
|
|
87
89
|
**List all test files by layer:**
|
|
88
90
|
```bash
|
|
89
|
-
find {PACKAGE}/test/
|
|
90
|
-
find {PACKAGE}/test/
|
|
91
|
-
find {PACKAGE}/test/organisms -name "*_test.rb" 2>/dev/null | sort
|
|
91
|
+
find {PACKAGE}/test/fast -name "*_test.rb" 2>/dev/null | sort
|
|
92
|
+
find {PACKAGE}/test/feat -name "*_test.rb" 2>/dev/null | sort
|
|
92
93
|
```
|
|
93
94
|
|
|
94
95
|
**For each test file:**
|
|
@@ -100,7 +101,7 @@ Build a unit test map:
|
|
|
100
101
|
|
|
101
102
|
| Test File | Layer | Feature Covered | Test Count | Assertion Count |
|
|
102
103
|
|-----------|-------|-----------------|------------|-----------------|
|
|
103
|
-
| {path} |
|
|
104
|
+
| {path} | fast/feat | {feature} | {n} | {n} |
|
|
104
105
|
|
|
105
106
|
### 4. Inventory Existing E2E Coverage
|
|
106
107
|
|
|
@@ -116,22 +117,32 @@ find {PACKAGE}/test/e2e -name "scenario.yml" -path "*/TS-*" 2>/dev/null | sort
|
|
|
116
117
|
- `tags`, `cost-tier`, `e2e-justification`, `unit-coverage-reviewed`
|
|
117
118
|
- `last-verified`, `verified-by`
|
|
118
119
|
- Extract the objective (what the TC verifies)
|
|
120
|
+
- Record the TC's primary oracle:
|
|
121
|
+
- final sandbox state / real product output
|
|
122
|
+
- runner observations as supporting context
|
|
123
|
+
- debug fallback only when necessary
|
|
124
|
+
- Record whether the job is achievable from the public surface:
|
|
125
|
+
- `valid`
|
|
126
|
+
- `hidden-recipe-driven`
|
|
127
|
+
- `workaround-driven`
|
|
128
|
+
- `unsupported-detail`
|
|
129
|
+
- Record qualitative friction:
|
|
130
|
+
- `low`, `medium`, `high`
|
|
119
131
|
- Identify which CLI commands the TC runs
|
|
120
132
|
- Record command fingerprint (`command + key flags`) for each command assertion
|
|
121
|
-
- Count verification steps (PASS/FAIL checks)
|
|
122
133
|
- Map to the feature it tests
|
|
123
134
|
- Mark TC evidence status:
|
|
124
|
-
- `complete` when `e2e-justification` is present,
|
|
135
|
+
- `complete` when `e2e-justification` is present, the verifier is end-state-first, and `unit-coverage-reviewed` has at least one path
|
|
125
136
|
- `missing` otherwise
|
|
126
|
-
- `at-risk` when evidence is existence-only
|
|
137
|
+
- `at-risk` when evidence is existence-only, helper-artifact-driven, duplicate command invocations are detected, or the TC is hidden-recipe/workaround-driven
|
|
127
138
|
|
|
128
139
|
If `--scope` was provided, filter to only the specified scenario.
|
|
129
140
|
|
|
130
141
|
Build an E2E test map:
|
|
131
142
|
|
|
132
|
-
| TC ID | Title | Command Invocations | Feature Tested |
|
|
133
|
-
|
|
134
|
-
| {id} | {title} | {command list} | {feature} | {
|
|
143
|
+
| TC ID | Title | Command Invocations | Feature Tested | Primary Oracle | Public Surface Fit | Friction | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence | False-Positive Risk |
|
|
144
|
+
|-------|-------|-------------|----------------|----------------|--------------------|----------|------|-----------|-------------------|------------------------|----------|---------------------|
|
|
145
|
+
| {id} | {title} | {command list} | {feature} | {state / output / observations+fallback} | {valid/hidden-recipe/workaround/unsupported-detail} | {low/medium/high} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing/at-risk} | {low/medium/high} |
|
|
135
146
|
|
|
136
147
|
### 5. Build Coverage Matrix
|
|
137
148
|
|
|
@@ -139,7 +150,7 @@ Combine the three inventories into a single coverage matrix:
|
|
|
139
150
|
|
|
140
151
|
**Matrix structure:**
|
|
141
152
|
- **Rows:** Features/behaviors from step 2
|
|
142
|
-
- **Columns:** Unit Tests (
|
|
153
|
+
- **Columns:** Unit Tests (`fast`/`feat`) | E2E Tests
|
|
143
154
|
- **Cells:** Test file references + counts, or "none"
|
|
144
155
|
|
|
145
156
|
```markdown
|
|
@@ -147,10 +158,10 @@ Combine the three inventories into a single coverage matrix:
|
|
|
147
158
|
|
|
148
159
|
| Feature | Unit Tests | E2E Tests | Evidence Strength | False-Positive Risk | Status |
|
|
149
160
|
|---------|-----------|-----------|------------------|----------------------|--------|
|
|
150
|
-
| {feature} | {test files} ({n} assertions) | {TC IDs}
|
|
161
|
+
| {feature} | {test files} ({n} assertions) | {TC IDs} | state+content + observations | low | Covered |
|
|
151
162
|
| {feature} | {test files} ({n} assertions) | none | none | n/a | Unit-only |
|
|
152
|
-
| {feature} | none | {TC IDs}
|
|
153
|
-
| {feature} | {test files} ({n} assertions) | {TC IDs}
|
|
163
|
+
| {feature} | none | {TC IDs} | state+content | low | E2E-only |
|
|
164
|
+
| {feature} | {test files} ({n} assertions) | {TC IDs} | debug-heavy, helper-artifact-driven, or workaround-driven | medium/high | Overlap |
|
|
154
165
|
| {feature} | none | none | none | high | Gap |
|
|
155
166
|
```
|
|
156
167
|
|
|
@@ -171,7 +182,7 @@ Produce the full review report with actionable findings:
|
|
|
171
182
|
|
|
172
183
|
**Reviewed:** {timestamp}
|
|
173
184
|
**Scope:** {package-wide or scenario-id}
|
|
174
|
-
**Workflow version:** 2.
|
|
185
|
+
**Workflow version:** 2.2
|
|
175
186
|
|
|
176
187
|
### Summary
|
|
177
188
|
|
|
@@ -182,8 +193,8 @@ Produce the full review report with actionable findings:
|
|
|
182
193
|
| Unit assertions | {n} |
|
|
183
194
|
| E2E scenarios | {n} |
|
|
184
195
|
| E2E test cases | {n} |
|
|
185
|
-
| TCs with
|
|
186
|
-
| High-risk
|
|
196
|
+
| TCs with end-state-first evidence | {n}/{total} |
|
|
197
|
+
| High-risk helper-artifact TCs | {n}/{total} |
|
|
187
198
|
|
|
188
199
|
### Coverage Matrix
|
|
189
200
|
|
|
@@ -196,19 +207,19 @@ TCs that may fail the E2E Value Gate (unit tests cover the same behavior or high
|
|
|
196
207
|
| TC ID | Feature | Overlapping Unit Tests | Recommendation |
|
|
197
208
|
|-------|---------|----------------------|----------------|
|
|
198
209
|
| {id} | {feature} | {test files} | Remove — unit tests cover this fully |
|
|
199
|
-
| {id} | {feature} | {test files} | Keep — TC tests CLI
|
|
200
|
-
| {id} | {feature} | {test files} | Strengthen — currently
|
|
210
|
+
| {id} | {feature} | {test files} | Keep — TC tests real CLI journey and final integrated outcome |
|
|
211
|
+
| {id} | {feature} | {test files} | Strengthen — currently helper-artifact-driven, workaround-driven, or debug-heavy |
|
|
201
212
|
|
|
202
213
|
**Candidates for removal:** {n} TCs have full overlap with unit tests
|
|
203
214
|
|
|
204
215
|
### E2E Decision Record Coverage
|
|
205
216
|
|
|
206
|
-
| TC ID | Evidence Status | Missing Fields |
|
|
207
|
-
|
|
208
|
-
| {id} | complete | none |
|
|
209
|
-
| {id} | missing | e2e-justification, unit-coverage-reviewed |
|
|
217
|
+
| TC ID | Evidence Status | Public Surface Fit | Friction | Missing Fields / Contract Drift |
|
|
218
|
+
|-------|------------------|--------------------|----------|-------------------------------|
|
|
219
|
+
| {id} | complete | valid | low | none |
|
|
220
|
+
| {id} | missing | hidden-recipe-driven | high | e2e-justification, unit-coverage-reviewed, end-state oracle |
|
|
210
221
|
|
|
211
|
-
**Action:** Any TC with missing evidence should be updated
|
|
222
|
+
**Action:** Any TC with missing evidence, helper-artifact drift, hidden recipes, workaround dependence, or unsupported internal-detail checks should be updated during the next rewrite cycle.
|
|
212
223
|
|
|
213
224
|
### Gap Analysis
|
|
214
225
|
|
|
@@ -30,7 +30,8 @@ ace-bundle wfi://e2e/review → ace-bundle wfi://e2e/plan-changes → ace-bu
|
|
|
30
30
|
|
|
31
31
|
- Keep scenario IDs in `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
|
|
32
32
|
- Keep standalone pairs as `TC-*.runner.md` + `TC-*.verify.md`
|
|
33
|
-
- Keep TC
|
|
33
|
+
- Keep TC outcome artifacts under `results/tc/{NN}/`
|
|
34
|
+
- Keep runner observations in harness reports, not sandbox helper files
|
|
34
35
|
- Keep summary report fields as `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
|
|
35
36
|
- CLI split reminder:
|
|
36
37
|
- `ace-test-e2e` runs single-package tests
|
|
@@ -41,6 +42,8 @@ ace-bundle wfi://e2e/review → ace-bundle wfi://e2e/plan-changes → ace-bu
|
|
|
41
42
|
- Normalize runner files to execution-only language.
|
|
42
43
|
- Normalize verifier files to verdict-only, impact-first validation.
|
|
43
44
|
- Keep setup concerns in `scenario.yml` and fixtures, not in TC runner setup sections.
|
|
45
|
+
- Remove helper artifact requirements from `results/tc/{NN}/`; use runner observations instead.
|
|
46
|
+
- Rewrite goal-style TCs around the public user path. Do not preserve hidden recipes, workaround branches, or supporting-tool probes as the way the runner reaches the goal.
|
|
44
47
|
|
|
45
48
|
## Workflow Steps
|
|
46
49
|
|
|
@@ -120,16 +123,18 @@ Follow the E2E test writing rules:
|
|
|
120
123
|
|
|
121
124
|
- **Run the tool first** to verify actual behavior before writing assertions
|
|
122
125
|
- Apply the E2E Value Gate — every TC must require real CLI binary + external tools + filesystem I/O
|
|
123
|
-
- Use `&& echo "PASS" || echo "FAIL"` patterns for every verification step
|
|
124
126
|
- Follow TC ordering: error paths first, happy path, structure verification, lifecycle, end state
|
|
125
127
|
- Consolidate assertions sharing the same CLI invocation into a single TC
|
|
126
128
|
- Target 2-5 TCs per scenario
|
|
127
129
|
- Test through the CLI interface, not library imports
|
|
128
|
-
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
-
|
|
130
|
+
- Write runner goals as “do the job” outcomes, not “write a report for the verifier” chores
|
|
131
|
+
- Keep `results/tc/{NN}/` for real outcomes only; avoid helper YAML, path files, command files, and reflections
|
|
132
|
+
- Use runner observations as the only non-filesystem secondary evidence source
|
|
133
|
+
- Make final sandbox state or real product output the primary oracle whenever possible
|
|
134
|
+
- Add behavioral/content assertions only when CLI output itself is part of the user-visible outcome
|
|
132
135
|
- Remove duplicate command-only TCs; fold related assertions into one TC where possible
|
|
136
|
+
- Do not encode exact workaround procedures, hidden command recipes, or internal debugging tricks the user would not infer from docs/usage/`--help`
|
|
137
|
+
- If the job is valid but the public surface is too weak, plan a product/docs/help fix instead of hardcoding the workaround into the TC
|
|
133
138
|
|
|
134
139
|
**Load the TC template for reference:**
|
|
135
140
|
```bash
|
|
@@ -146,7 +151,9 @@ For each TC classified as MODIFY:
|
|
|
146
151
|
- **Narrow scope** — remove assertions that unit tests cover, keep only E2E-exclusive checks
|
|
147
152
|
- **Broaden scope** — add assertions for related behavior tested by the same CLI invocation
|
|
148
153
|
- **Fix structure** — add missing sections, fix formatting issues
|
|
149
|
-
- **
|
|
154
|
+
- **Replace helper-artifact oracles** — if the existing TC relies on runner-written helper files, rewrite it around final sandbox state plus runner observations
|
|
155
|
+
- **Add evidence gates** — if the existing TC relies on existence-only or missing end-state checks, strengthen the primary oracle before falling back to debug captures
|
|
156
|
+
- **Remove hidden recipes/workarounds** — if the existing TC teaches the runner how to bypass the public surface, rewrite it around the supported user path or narrow/remove the TC
|
|
150
157
|
3. Update the `last-verified` field if the TC was re-run during modification
|
|
151
158
|
4. Write the updated TC runner/verifier files
|
|
152
159
|
|
|
@@ -234,7 +241,7 @@ Present the execution summary:
|
|
|
234
241
|
- [ ] TC count matches plan: {yes/no}
|
|
235
242
|
- [ ] No stale references: {yes/no}
|
|
236
243
|
- [ ] All scenarios have 2-5 TCs: {yes/no}
|
|
237
|
-
- [ ]
|
|
244
|
+
- [ ] Modified/created TCs avoid helper files in `results/tc/{NN}/`: {yes/no}
|
|
238
245
|
|
|
239
246
|
### Next Steps
|
|
240
247
|
|
|
@@ -1,4 +1,12 @@
|
|
|
1
1
|
---
|
|
2
|
+
name: e2e-run
|
|
3
|
+
description: Execute an E2E test scenario with full agent guidance
|
|
4
|
+
allowed-tools:
|
|
5
|
+
- Bash(ace-bundle:*)
|
|
6
|
+
- Read
|
|
7
|
+
- Write
|
|
8
|
+
- Glob
|
|
9
|
+
- Grep
|
|
2
10
|
doc-type: workflow
|
|
3
11
|
title: Run E2E Test Workflow
|
|
4
12
|
purpose: Execute an E2E test scenario with full agent guidance
|
|
@@ -13,7 +21,7 @@ This workflow guides an agent through executing an E2E test scenario. It support
|
|
|
13
21
|
|
|
14
22
|
## Arguments
|
|
15
23
|
|
|
16
|
-
- `PACKAGE` (optional) - Package containing the test (e.g., `ace-lint`). If omitted,
|
|
24
|
+
- `PACKAGE` (optional) - Package containing the test (e.g., `ace-lint`). If omitted, discovery uses `test/feat/` and `test/e2e/` in the project root.
|
|
17
25
|
- `TEST_ID` (optional) - Test identifier (e.g., `TS-LINT-001`). If omitted, runs all tests.
|
|
18
26
|
- `--run-id RUN_ID` (optional) - Pre-generated timestamp ID for deterministic report paths.
|
|
19
27
|
- `--report-dir PATH` (optional) - Explicit report directory path (skips computed `${TEST_DIR}-reports`).
|
|
@@ -33,18 +41,25 @@ This workflow guides an agent through executing an E2E test scenario. It support
|
|
|
33
41
|
- `ace-test-e2e` runs single-package scenarios; `ace-test-e2e-suite` runs suite-level execution
|
|
34
42
|
- Scenario IDs: `TS-<PACKAGE_SHORT>-<NNN>[-slug]`
|
|
35
43
|
- Standalone TC pairs: `TC-*.runner.md` + `TC-*.verify.md`
|
|
36
|
-
- TC artifacts: `results/tc/{NN}/`
|
|
44
|
+
- TC outcome artifacts: `results/tc/{NN}/`
|
|
37
45
|
- Summary counters: `tcs-passed`, `tcs-failed`, `tcs-total`, `failed[].tc`
|
|
38
46
|
- Tag filtering happens at discovery time (before sandbox setup)
|
|
39
47
|
|
|
40
48
|
## Execution Contract
|
|
41
49
|
|
|
42
|
-
- Runner instructions are execution-only: perform actions and
|
|
50
|
+
- Runner instructions are execution-only: perform actions and return final observations.
|
|
51
|
+
- The runner should follow the public user path from docs/usage/`--help` and the tool under test itself. Do not encode or normalize hidden recipes and workarounds.
|
|
43
52
|
- Verifier instructions are verification-only: assign verdicts using impact-first checks:
|
|
53
|
+
|
|
44
54
|
1. sandbox/project state impact
|
|
45
|
-
2.
|
|
46
|
-
3.
|
|
55
|
+
2. runner observations
|
|
56
|
+
3. explicit outcome artifacts
|
|
57
|
+
4. debug captures as fallback
|
|
58
|
+
|
|
47
59
|
- Do not place ad-hoc setup logic in TC runner files; sandbox setup belongs to `scenario.yml` and fixtures.
|
|
60
|
+
- Do not place helper inputs, reflections, or temp manifests under `results/tc/{NN}/`.
|
|
61
|
+
- Do not ask the runner to write verifier-facing summaries or audit files when final sandbox state can prove the goal directly.
|
|
62
|
+
- If the runner observations show a workaround was needed, treat that as a docs/help/product or scenario-design gap, not a successful steady-state contract.
|
|
48
63
|
|
|
49
64
|
## Execution Environment Guardrail
|
|
50
65
|
|
|
@@ -55,12 +70,12 @@ This workflow guides an agent through executing an E2E test scenario. It support
|
|
|
55
70
|
|
|
56
71
|
For CLI providers (`ace-test-e2e`), the deterministic 6-phase pipeline handles execution automatically:
|
|
57
72
|
|
|
58
|
-
1. **Setup**
|
|
59
|
-
2. **Runner prompt**
|
|
60
|
-
3. **Runner LLM**
|
|
61
|
-
4. **Verifier prompt**
|
|
62
|
-
5. **Verifier LLM**
|
|
63
|
-
6. **Report**
|
|
73
|
+
1. **Setup** -- `SetupExecutor` creates sandbox (git init, mise.toml, .ace symlinks, `results/tc/{NN}/` dirs)
|
|
74
|
+
2. **Runner prompt** -- `SkillPromptBuilder` assembles context from `runner.yml.md` + `TC-*.runner.md`
|
|
75
|
+
3. **Runner LLM** -- Agent executes TC steps in sandbox and returns final observations
|
|
76
|
+
4. **Verifier prompt** -- `SkillPromptBuilder` assembles context from `verifier.yml.md` + `TC-*.verify.md` and includes runner observations
|
|
77
|
+
5. **Verifier LLM** -- Independent agent evaluates artifacts against expectations
|
|
78
|
+
6. **Report** -- `PipelineReportGenerator` produces deterministic summary and persists runner observations in harness-managed reports
|
|
64
79
|
|
|
65
80
|
When this workflow is invoked directly (not via CLI pipeline), the agent performs steps 1-6 manually using the workflow steps below.
|
|
66
81
|
|
|
@@ -83,7 +98,8 @@ When invoked as a subagent (via a batch orchestrator such as an assignment fan-o
|
|
|
83
98
|
- **Failed**: {count}
|
|
84
99
|
- **Total**: {count}
|
|
85
100
|
- **Report Paths**: {timestamp}-{short-pkg}-{short-id}.*
|
|
86
|
-
- **
|
|
101
|
+
- **Observations**: Brief factual summary or "None"
|
|
102
|
+
- **Issues**: Brief description or "None" (legacy alias if `Observations` is unavailable)
|
|
87
103
|
```
|
|
88
104
|
|
|
89
105
|
Do NOT return full report contents, detailed TC output, or setup logs.
|
|
@@ -95,6 +111,7 @@ Do NOT return full report contents, detailed TC output, or setup logs.
|
|
|
95
111
|
When invoked with `--tc-mode`, the sandbox is pre-populated by `SetupExecutor` and only a single TC is executed. Steps 1-5 of standard mode are skipped.
|
|
96
112
|
|
|
97
113
|
**TC-Level Arguments:**
|
|
114
|
+
|
|
98
115
|
- `PACKAGE` (required), `TEST_ID` (required), `TC_ID` (required)
|
|
99
116
|
- `--tc-mode` (required), `--sandbox SANDBOX_PATH` (required)
|
|
100
117
|
- `--run-id RUN_ID` (optional), `--env KEY=VALUE,...` (optional)
|
|
@@ -108,7 +125,8 @@ When invoked with `--tc-mode`, the sandbox is pre-populated by `SetupExecutor` a
|
|
|
108
125
|
6. Return TC-level contract
|
|
109
126
|
|
|
110
127
|
**TC-Level Rules:**
|
|
111
|
-
|
|
128
|
+
|
|
129
|
+
- Do NOT create or modify sandbox -- `SetupExecutor` already prepared it
|
|
112
130
|
- Always export `--env` variables before executing test steps
|
|
113
131
|
- Report actual results even if they differ from expected
|
|
114
132
|
|
|
@@ -138,6 +156,7 @@ If no tests found after filtering, report error and exit.
|
|
|
138
156
|
### 2. Read Test Scenario
|
|
139
157
|
|
|
140
158
|
For each scenario file, read and parse:
|
|
159
|
+
|
|
141
160
|
- `test-id`, `title`, `priority`, `duration`, `requires`, `tags`
|
|
142
161
|
|
|
143
162
|
**Multiple tests:** Execute steps 2-7 for each scenario sequentially, then generate a combined summary.
|
|
@@ -172,22 +191,24 @@ Report missing prerequisites before proceeding.
|
|
|
172
191
|
**Pre-generated Run ID:** If `--run-id` was provided, set `TIMESTAMP_ID=$RUN_ID` instead of generating a new one.
|
|
173
192
|
|
|
174
193
|
**Directory naming convention:**
|
|
175
|
-
|
|
176
|
-
- `{
|
|
177
|
-
- `{short-
|
|
194
|
+
|
|
195
|
+
- `{timestamp}` -- 6-char base36 timestamp
|
|
196
|
+
- `{short-pkg}` -- package without `ace-` prefix (e.g., `lint`)
|
|
197
|
+
- `{short-id}` -- lowercase prefix + number (e.g., `ts001`)
|
|
178
198
|
|
|
179
199
|
```
|
|
180
200
|
.ace-local/test-e2e/
|
|
181
201
|
├── 8osvnh-lint-ts001/ # Sandbox
|
|
182
202
|
├── 8osvnh-lint-ts001-reports/ # Reports (summary.r.md, experience.r.md, metadata.yml)
|
|
183
|
-
└── 8osynv-
|
|
203
|
+
└── 8osynv-suite-report.md # Suite report (sibling)
|
|
184
204
|
```
|
|
185
205
|
|
|
186
206
|
**Expected variables after setup:**
|
|
187
|
-
|
|
188
|
-
- `
|
|
189
|
-
- `
|
|
190
|
-
- `
|
|
207
|
+
|
|
208
|
+
- `PROJECT_ROOT` -- Original project directory
|
|
209
|
+
- `TEST_DIR` -- Sandbox directory (cwd after setup)
|
|
210
|
+
- `REPORTS_DIR` -- Reports directory
|
|
211
|
+
- `TIMESTAMP_ID` -- Unique run identifier
|
|
191
212
|
|
|
192
213
|
### 4.1 Sandbox Isolation Checkpoint (MANDATORY)
|
|
193
214
|
|
|
@@ -198,7 +219,7 @@ echo "=== SANDBOX ISOLATION CHECK ==="
|
|
|
198
219
|
CURRENT_DIR="$(pwd)"
|
|
199
220
|
[[ "$CURRENT_DIR" == *".ace-local/test-e2e/"* ]] && echo "PASS: In sandbox" || echo "FAIL: NOT in sandbox"
|
|
200
221
|
git rev-parse --git-dir >/dev/null 2>&1 && { [ -z "$(git remote -v 2>/dev/null)" ] && echo "PASS: No remotes" || echo "FAIL: Remotes found"; } || echo "PASS: No git"
|
|
201
|
-
[ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-
|
|
222
|
+
[ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ] && echo "FAIL: Project markers found" || echo "PASS: No markers"
|
|
202
223
|
echo "=== END CHECK ==="
|
|
203
224
|
```
|
|
204
225
|
|
|
@@ -208,7 +229,7 @@ echo "=== END CHECK ==="
|
|
|
208
229
|
### 5. Create Test Data
|
|
209
230
|
|
|
210
231
|
> **Use `ace-test-e2e-sh "$TEST_DIR"` for ALL commands after setup.**
|
|
211
|
-
> Each bash block runs in a fresh shell
|
|
232
|
+
> Each bash block runs in a fresh shell -- the wrapper ensures sandbox isolation.
|
|
212
233
|
|
|
213
234
|
Execute test data creation commands from the scenario, writing files inside `$TEST_DIR/`.
|
|
214
235
|
|
|
@@ -219,9 +240,9 @@ Execute test data creation commands from the scenario, writing files inside `$TE
|
|
|
219
240
|
If `FILTERED_CASES` is set, execute only matching TCs. Otherwise execute all.
|
|
220
241
|
|
|
221
242
|
For each TC (TC-NNN):
|
|
222
|
-
1. **Check filter**
|
|
243
|
+
1. **Check filter** -- skip if not in `FILTERED_CASES`
|
|
223
244
|
2. **Read** the runner file (`TC-NNN-*.runner.md`)
|
|
224
|
-
3. **Execute** runner steps
|
|
245
|
+
3. **Execute** runner steps and create only final outcome artifacts under `results/tc/{NN}/`
|
|
225
246
|
4. **Verify** against paired `.verify.md` expectations
|
|
226
247
|
5. **Record** status (Pass/Fail) with evidence
|
|
227
248
|
|
|
@@ -232,6 +253,7 @@ Track friction points during execution for the experience report.
|
|
|
232
253
|
Write three report files to the reports directory.
|
|
233
254
|
|
|
234
255
|
**Report path setup:**
|
|
256
|
+
|
|
235
257
|
```bash
|
|
236
258
|
REPORT_DIR="${PROVIDED_REPORT_DIR:-${TEST_DIR}-reports}"
|
|
237
259
|
mkdir -p "$REPORT_DIR"
|
|
@@ -302,6 +324,7 @@ Sandbox directories in `.ace-local/test-e2e/` are gitignored.
|
|
|
302
324
|
Summarize execution in the response. Reports are persisted to disk.
|
|
303
325
|
|
|
304
326
|
**Single test:**
|
|
327
|
+
|
|
305
328
|
```markdown
|
|
306
329
|
## E2E Test Execution Report
|
|
307
330
|
**Test ID:** {test-id} | **Package:** {package} | **Status:** {PASS/FAIL}
|
|
@@ -316,6 +339,7 @@ Reports: `.ace-local/test-e2e/{timestamp}-{short-pkg}-{short-id}-reports/`
|
|
|
316
339
|
### 10. Update Test Scenario
|
|
317
340
|
|
|
318
341
|
If all tests pass, update `scenario.yml`:
|
|
342
|
+
|
|
319
343
|
```yaml
|
|
320
344
|
last-verified: {today's date}
|
|
321
345
|
verified-by: claude-{model}
|
|
@@ -352,4 +376,4 @@ ace-test-e2e ace-lint --exclude-tags deep
|
|
|
352
376
|
|
|
353
377
|
# All tests in project root
|
|
354
378
|
ace-test-e2e
|
|
355
|
-
```
|
|
379
|
+
```
|
|
@@ -132,7 +132,7 @@ else
|
|
|
132
132
|
fi
|
|
133
133
|
|
|
134
134
|
# Check 3: Project root markers should NOT exist
|
|
135
|
-
if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-
|
|
135
|
+
if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ]; then
|
|
136
136
|
echo "FAIL: Main project markers found - NOT an isolated repo!"
|
|
137
137
|
echo " ACTION: STOP - You are in the main repository."
|
|
138
138
|
else
|
|
@@ -321,7 +321,7 @@ Add setup directives to `scenario.yml`:
|
|
|
321
321
|
# scenario.yml
|
|
322
322
|
setup:
|
|
323
323
|
- git-init
|
|
324
|
-
- run: "cp $PROJECT_ROOT_PATH/mise.toml mise.toml && mise trust mise.toml"
|
|
324
|
+
- run: "cp ${ACE_E2E_SOURCE_ROOT:-$PROJECT_ROOT_PATH}/mise.toml mise.toml && mise trust mise.toml"
|
|
325
325
|
- copy-fixtures
|
|
326
326
|
- agent-env:
|
|
327
327
|
PROJECT_ROOT_PATH: "."
|
|
@@ -405,7 +405,7 @@ else
|
|
|
405
405
|
fi
|
|
406
406
|
|
|
407
407
|
# Check 3: Project markers
|
|
408
|
-
if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-
|
|
408
|
+
if [ -f "CLAUDE.md" ] || [ -f "Gemfile" ] || [ -d ".ace-task" ]; then
|
|
409
409
|
echo "FAIL: Main project markers found!"
|
|
410
410
|
exit 1
|
|
411
411
|
else
|
|
@@ -458,4 +458,4 @@ ace-test-e2e-sh "$REPO_DIR" git status
|
|
|
458
458
|
## See Also
|
|
459
459
|
|
|
460
460
|
- [E2E Testing Guide](guide://e2e-testing)
|
|
461
|
-
- [Test Suite Health](guide://test-suite-health)
|
|
461
|
+
- [Test Suite Health](guide://test-suite-health)
|
|
@@ -35,7 +35,7 @@ module Ace
|
|
|
35
35
|
#
|
|
36
36
|
# Resolves role: references to their concrete provider before checking.
|
|
37
37
|
#
|
|
38
|
-
# @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-
|
|
38
|
+
# @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-runner")
|
|
39
39
|
# @return [Boolean]
|
|
40
40
|
def cli_provider?(provider_string)
|
|
41
41
|
resolved = resolve_provider_name(provider_string)
|
|
@@ -44,9 +44,9 @@ module Ace
|
|
|
44
44
|
|
|
45
45
|
def build_execution_prompt(command:, tc_mode:)
|
|
46
46
|
return_contract = if tc_mode
|
|
47
|
-
"- **Test ID**: ...\n- **TC ID**: ...\n- **Status**: pass | fail\n- **Report Paths**: ...\n- **Issues**: ..."
|
|
47
|
+
"- **Test ID**: ...\n- **TC ID**: ...\n- **Status**: pass | fail\n- **Report Paths**: ...\n- **Observations**: ...\n- **Issues**: ... (optional legacy alias)"
|
|
48
48
|
else
|
|
49
|
-
"- **Test ID**: ...\n- **Status**: pass | fail | partial\n- **Passed**: ...\n- **Failed**: ...\n- **Total**: ...\n- **Report Paths**: ...\n- **Issues**: ..."
|
|
49
|
+
"- **Test ID**: ...\n- **Status**: pass | fail | partial\n- **Passed**: ...\n- **Failed**: ...\n- **Total**: ...\n- **Report Paths**: ...\n- **Observations**: ...\n- **Issues**: ... (optional legacy alias)"
|
|
50
50
|
end
|
|
51
51
|
|
|
52
52
|
<<~PROMPT.strip
|
|
@@ -55,8 +55,9 @@ module Ace
|
|
|
55
55
|
|
|
56
56
|
Execution requirements:
|
|
57
57
|
- Do not run `/ace-...` inside a shell command.
|
|
58
|
-
- If slash commands are unavailable, stop and report that limitation in `
|
|
58
|
+
- If slash commands are unavailable, stop and report that limitation in `Observations`.
|
|
59
59
|
- Write reports under `.ace-local/test-e2e/*-reports/`.
|
|
60
|
+
- `Observations` is required and must be a concise factual summary of actions, outcomes, and blockers without verdict language.
|
|
60
61
|
- Return only this structured summary:
|
|
61
62
|
#{return_contract}
|
|
62
63
|
PROMPT
|
|
@@ -122,6 +123,7 @@ module Ace
|
|
|
122
123
|
|
|
123
124
|
Verification requirements:
|
|
124
125
|
- Inspect sandbox artifacts and scenario files directly.
|
|
126
|
+
- Judge from sandbox state first, then runner observations, then raw debug captures only when needed.
|
|
125
127
|
- Evaluate each test case using `TC-*.verify.md` criteria when present.
|
|
126
128
|
- Classify each failed test case with one category:
|
|
127
129
|
`test-spec-error`, `tool-bug`, `runner-error`, or `infrastructure-error`.
|
|
@@ -145,7 +147,7 @@ module Ace
|
|
|
145
147
|
|
|
146
148
|
# Resolve the bare provider name from a provider string.
|
|
147
149
|
# For role: references, resolves via ProviderModelParser to find the
|
|
148
|
-
# concrete provider (e.g. "role:e2e-
|
|
150
|
+
# concrete provider (e.g. "role:e2e-runner" → "claude").
|
|
149
151
|
def resolve_provider_name(provider_string)
|
|
150
152
|
name = self.class.provider_name(provider_string)
|
|
151
153
|
return name unless name == "role"
|