ace-test-runner-e2e 0.29.6 → 0.29.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +9 -0
- data/handbook/templates/scenario.yml.template.yml +6 -0
- data/handbook/workflow-instructions/e2e/create.wf.md +25 -2
- data/handbook/workflow-instructions/e2e/review.wf.md +17 -12
- data/handbook/workflow-instructions/e2e/rewrite.wf.md +8 -1
- data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +11 -12
- data/lib/ace/test/end_to_end_runner/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6206e4d6f65fe1ab5c27d1b5479e37af079b3554bd6289786aa71ce62e4ecf50
|
|
4
|
+
data.tar.gz: bedb5fa2830bc1f2818e2246acbbab15ec6a7227898bf4dba2b23b6521eb8d5b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9d06bc8d9447debe2b48128b7c45ea0a357d01677b9b8b48fa508c5f8a078e8b4ed0396a4d4d38c06be25573701d4bd75b6f481438d116e3499b4b1890a9edd5
|
|
7
|
+
data.tar.gz: b148663600b83ffde9821761a1ef4a7b43d11a2efc10d97433bb090d4ae2bfc8f7dd997756190cdafc894c497f1eada99faf3b78dcf7df38ed7eec9d57cedec3
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.29.8] - 2026-04-01
|
|
11
|
+
|
|
12
|
+
### Fixed
|
|
13
|
+
- Replaced process-global `Dir.chdir` in pipeline LLM execution with explicit `working_dir` threading to avoid parallel scenario crashes (`RuntimeError: conflicting chdir during another chdir block`).
|
|
14
|
+
|
|
15
|
+
### Changed
|
|
16
|
+
- **ace-monorepo-e2e**: Added stronger command/output evidence gates to `TS-MONO-001-rubygems-install` and `TS-MONO-002-quickstart-local` so local sandbox installs and quick-start workflow checks validate real CLI behavior, output, and exit status rather than directory/file presence alone.
|
|
17
|
+
- **ace-monorepo-e2e**: Updated `ace-test-runner-e2e` workflow instructions and scenario template defaults to reduce false-positive E2E tests through command-level evidence, false-positive risk tagging, and duplicate-command consolidation rules.
|
|
18
|
+
|
|
10
19
|
## [0.29.6] - 2026-04-01
|
|
11
20
|
|
|
12
21
|
### Fixed
|
|
@@ -23,6 +23,12 @@ tags: [{cost-tier}, "use-case:{area}"]
|
|
|
23
23
|
# Optional: Why this scenario must be E2E (not unit-only)
|
|
24
24
|
e2e-justification: "{Requires real CLI/tools/filesystem behavior}"
|
|
25
25
|
|
|
26
|
+
# Optional: Evidence quality target for review coverage (`command-output`, `state+content`, `existence-only`)
|
|
27
|
+
e2e-evidence-strength: command-output
|
|
28
|
+
|
|
29
|
+
# Optional: False-positive risk estimate (`low`, `medium`, `high`)
|
|
30
|
+
e2e-false-positive-risk: low
|
|
31
|
+
|
|
26
32
|
# Optional: Unit test files reviewed during Value Gate analysis
|
|
27
33
|
unit-coverage-reviewed:
|
|
28
34
|
- test/{layer}/{file}_test.rb
|
|
@@ -163,7 +163,28 @@ All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
|
|
|
163
163
|
No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
|
|
164
164
|
```
|
|
165
165
|
|
|
166
|
-
### 7a.
|
|
166
|
+
### 7a. Evidence-Gate Review Before Writing Files
|
|
167
|
+
|
|
168
|
+
Before finalizing the test plan, block weak coverage patterns:
|
|
169
|
+
- **Existence-only TC**:
|
|
170
|
+
- only checks directory/file existence
|
|
171
|
+
- no command output/content assertion
|
|
172
|
+
- missing `*.exit` capture for the executed command
|
|
173
|
+
- **Duplicate-invocation TC**:
|
|
174
|
+
- same command invocation, same purpose, split across multiple TCs
|
|
175
|
+
|
|
176
|
+
| TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
|
|
177
|
+
|-------|---------------------------|------------------|-----------------|--------------------|
|
|
178
|
+
| {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
|
|
179
|
+
|
|
180
|
+
Rules:
|
|
181
|
+
- `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
|
|
182
|
+
- `SKIP` rows must include replacement unit-test evidence.
|
|
183
|
+
- Non-skipped rows must include command-level artifacts (`stdout`, `stderr`, `exit`, and/or explicit proof files).
|
|
184
|
+
- At least one `unit tests reviewed` path is required for every row.
|
|
185
|
+
- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
|
|
186
|
+
|
|
187
|
+
### 7b. E2E Decision Record (Required)
|
|
167
188
|
|
|
168
189
|
Before writing files, produce a decision record table for every candidate TC:
|
|
169
190
|
|
|
@@ -205,11 +226,13 @@ If a context description was provided, enhance the test with:
|
|
|
205
226
|
- Verify actual file paths by running the tool first — never hardcode paths from documentation or assumptions
|
|
206
227
|
- Use explicit `&& echo "PASS" || echo "FAIL"` patterns for every verification step
|
|
207
228
|
- Check specific exit codes for error commands (not just "non-zero")
|
|
229
|
+
- Add at least one output-content assertion for each command being verified
|
|
208
230
|
|
|
209
231
|
**SHOULD (strongly recommended):**
|
|
210
232
|
- Test the real user journey — structure TCs as a sequential workflow, not isolated commands
|
|
211
233
|
- Verify exit codes for all commands, not just error cases
|
|
212
234
|
- Include negative assertions (files/directories that should NOT exist)
|
|
235
|
+
- Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
|
|
213
236
|
- Capture and check CLI output content, not just exit codes
|
|
214
237
|
- Verify that status values match actual implementation (e.g., `done` vs `completed`)
|
|
215
238
|
|
|
@@ -392,4 +415,4 @@ Area codes must be:
|
|
|
392
415
|
- 2-10 characters
|
|
393
416
|
- Alphanumeric only
|
|
394
417
|
- Will be converted to uppercase
|
|
395
|
-
```
|
|
418
|
+
```
|
|
@@ -117,19 +117,21 @@ find {PACKAGE}/test/e2e -name "scenario.yml" -path "*/TS-*" 2>/dev/null | sort
|
|
|
117
117
|
- `last-verified`, `verified-by`
|
|
118
118
|
- Extract the objective (what the TC verifies)
|
|
119
119
|
- Identify which CLI commands the TC runs
|
|
120
|
+
- Record command fingerprint (`command + key flags`) for each command assertion
|
|
120
121
|
- Count verification steps (PASS/FAIL checks)
|
|
121
122
|
- Map to the feature it tests
|
|
122
123
|
- Mark TC evidence status:
|
|
123
|
-
- `complete` when `e2e-justification` is present and `unit-coverage-reviewed` has at least one path
|
|
124
|
+
- `complete` when `e2e-justification` is present, command artifacts are present, and `unit-coverage-reviewed` has at least one path
|
|
124
125
|
- `missing` otherwise
|
|
126
|
+
- `at-risk` when evidence is existence-only or duplicate command invocations are detected
|
|
125
127
|
|
|
126
128
|
If `--scope` was provided, filter to only the specified scenario.
|
|
127
129
|
|
|
128
130
|
Build an E2E test map:
|
|
129
131
|
|
|
130
|
-
| TC ID | Title |
|
|
132
|
+
| TC ID | Title | Command Invocations | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence | False-Positive Risk |
|
|
131
133
|
|-------|-------|-------------|----------------|---------------|------|-----------|-------------------|------------------------|----------|
|
|
132
|
-
| {id} | {title} | {command} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing} |
|
|
134
|
+
| {id} | {title} | {command list} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing/at-risk} | {low/medium/high} |
|
|
133
135
|
|
|
134
136
|
### 5. Build Coverage Matrix
|
|
135
137
|
|
|
@@ -143,13 +145,13 @@ Combine the three inventories into a single coverage matrix:
|
|
|
143
145
|
```markdown
|
|
144
146
|
### Coverage Matrix
|
|
145
147
|
|
|
146
|
-
| Feature | Unit Tests | E2E Tests | Status |
|
|
147
|
-
|
|
148
|
-
| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Covered |
|
|
149
|
-
| {feature} | {test files} ({n} assertions) | none | Unit-only |
|
|
150
|
-
| {feature} | none | {TC IDs} ({n} verifications) | E2E-only |
|
|
151
|
-
| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Overlap |
|
|
152
|
-
| {feature} | none | none | Gap |
|
|
148
|
+
| Feature | Unit Tests | E2E Tests | Evidence Strength | False-Positive Risk | Status |
|
|
149
|
+
|---------|-----------|-----------|------------------|----------------------|--------|
|
|
150
|
+
| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | command-output/state+content | low | Covered |
|
|
151
|
+
| {feature} | {test files} ({n} assertions) | none | none | n/a | Unit-only |
|
|
152
|
+
| {feature} | none | {TC IDs} ({n} verifications) | command-output | low | E2E-only |
|
|
153
|
+
| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | command-output or existence-only | medium/high | Overlap |
|
|
154
|
+
| {feature} | none | none | none | high | Gap |
|
|
153
155
|
```
|
|
154
156
|
|
|
155
157
|
**Classify each row:**
|
|
@@ -158,6 +160,7 @@ Combine the three inventories into a single coverage matrix:
|
|
|
158
160
|
- **E2E-only** — E2E test exists but no unit test. Valid if the behavior is inherently E2E (subprocess execution, filesystem discovery).
|
|
159
161
|
- **Overlap** — Both unit and E2E test the same assertions. E2E TC is a candidate for removal.
|
|
160
162
|
- **Gap** — Neither unit nor E2E test covers this feature. Needs investigation.
|
|
163
|
+
- If a row has `false-positive risk` `high`, downgrade Covered/Overlap to **manual-review** until evidence is corrected.
|
|
161
164
|
|
|
162
165
|
### 6. Generate Review Report
|
|
163
166
|
|
|
@@ -180,6 +183,7 @@ Produce the full review report with actionable findings:
|
|
|
180
183
|
| E2E scenarios | {n} |
|
|
181
184
|
| E2E test cases | {n} |
|
|
182
185
|
| TCs with decision evidence | {n}/{total} |
|
|
186
|
+
| High-risk false-positive TCs | {n}/{total} |
|
|
183
187
|
|
|
184
188
|
### Coverage Matrix
|
|
185
189
|
|
|
@@ -187,12 +191,13 @@ Produce the full review report with actionable findings:
|
|
|
187
191
|
|
|
188
192
|
### Overlap Analysis
|
|
189
193
|
|
|
190
|
-
TCs that may fail the E2E Value Gate (unit tests cover the same behavior):
|
|
194
|
+
TCs that may fail the E2E Value Gate (unit tests cover the same behavior or high false-positive risk):
|
|
191
195
|
|
|
192
196
|
| TC ID | Feature | Overlapping Unit Tests | Recommendation |
|
|
193
197
|
|-------|---------|----------------------|----------------|
|
|
194
198
|
| {id} | {feature} | {test files} | Remove — unit tests cover this fully |
|
|
195
199
|
| {id} | {feature} | {test files} | Keep — TC tests CLI pipeline, units test logic |
|
|
200
|
+
| {id} | {feature} | {test files} | Strengthen — currently existence-only or duplicate command assertions |
|
|
196
201
|
|
|
197
202
|
**Candidates for removal:** {n} TCs have full overlap with unit tests
|
|
198
203
|
|
|
@@ -283,4 +288,4 @@ Package '{package}' not found.
|
|
|
283
288
|
|
|
284
289
|
Available packages:
|
|
285
290
|
{list of ace-* directories}
|
|
286
|
-
```
|
|
291
|
+
```
|
|
@@ -125,6 +125,11 @@ Follow the E2E test writing rules:
|
|
|
125
125
|
- Consolidate assertions sharing the same CLI invocation into a single TC
|
|
126
126
|
- Target 2-5 TCs per scenario
|
|
127
127
|
- Test through the CLI interface, not library imports
|
|
128
|
+
- Add command-level evidence in every runner:
|
|
129
|
+
- command output (`*.stdout`/`*.stderr`)
|
|
130
|
+
- command exit status (`*.exit`)
|
|
131
|
+
- Add at least one behavioral/content assertion per command assertion set
|
|
132
|
+
- Remove duplicate command-only TCs; fold related assertions into one TC where possible
|
|
128
133
|
|
|
129
134
|
**Load the TC template for reference:**
|
|
130
135
|
```bash
|
|
@@ -141,6 +146,7 @@ For each TC classified as MODIFY:
|
|
|
141
146
|
- **Narrow scope** — remove assertions that unit tests cover, keep only E2E-exclusive checks
|
|
142
147
|
- **Broaden scope** — add assertions for related behavior tested by the same CLI invocation
|
|
143
148
|
- **Fix structure** — add missing sections, fix formatting issues
|
|
149
|
+
- **Add evidence gates** — if the existing TC relies on existence-only or missing exit/status checks, add explicit command output assertions and `.exit` captures
|
|
144
150
|
3. Update the `last-verified` field if the TC was re-run during modification
|
|
145
151
|
4. Write the updated TC runner/verifier files
|
|
146
152
|
|
|
@@ -228,6 +234,7 @@ Present the execution summary:
|
|
|
228
234
|
- [ ] TC count matches plan: {yes/no}
|
|
229
235
|
- [ ] No stale references: {yes/no}
|
|
230
236
|
- [ ] All scenarios have 2-5 TCs: {yes/no}
|
|
237
|
+
- [ ] All modified/created TCs include command output + exit artifacts: {yes/no}
|
|
231
238
|
|
|
232
239
|
### Next Steps
|
|
233
240
|
|
|
@@ -278,4 +285,4 @@ If execution fails partway through:
|
|
|
278
285
|
1. Report which actions completed and which failed
|
|
279
286
|
2. Do not attempt to roll back completed actions
|
|
280
287
|
3. Show the state of `{PACKAGE}/test/e2e/` after partial execution
|
|
281
|
-
4. Suggest re-running with the remaining actions
|
|
288
|
+
4. Suggest re-running with the remaining actions
|
|
@@ -102,18 +102,17 @@ module Ace
|
|
|
102
102
|
system = File.read(system_path)
|
|
103
103
|
sandbox_dir = env_vars["PROJECT_ROOT_PATH"] || env_vars[:PROJECT_ROOT_PATH]
|
|
104
104
|
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
end
|
|
105
|
+
Ace::LLM::QueryInterface.query(
|
|
106
|
+
@provider,
|
|
107
|
+
prompt,
|
|
108
|
+
system: system,
|
|
109
|
+
cli_args: cli_args,
|
|
110
|
+
timeout: @timeout,
|
|
111
|
+
fallback: false,
|
|
112
|
+
output: output_path,
|
|
113
|
+
subprocess_env: env_vars,
|
|
114
|
+
working_dir: sandbox_dir
|
|
115
|
+
)
|
|
117
116
|
end
|
|
118
117
|
end
|
|
119
118
|
end
|
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ace-test-runner-e2e
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.29.
|
|
4
|
+
version: 0.29.8
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Michal Czyz
|
|
8
8
|
bindir: exe
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-04-
|
|
10
|
+
date: 2026-04-05 00:00:00.000000000 Z
|
|
11
11
|
dependencies:
|
|
12
12
|
- !ruby/object:Gem::Dependency
|
|
13
13
|
name: ace-support-cli
|