ace-test-runner-e2e 0.29.2 → 0.29.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dde42f8b80c7e0a73e15b49c75c855de309a131e85e563e11229649b58ecbe80
4
- data.tar.gz: 705e034b6dff3495dc2c442ddc85ebc186165d9c09160e78e2dc324c42afe526
3
+ metadata.gz: 6206e4d6f65fe1ab5c27d1b5479e37af079b3554bd6289786aa71ce62e4ecf50
4
+ data.tar.gz: bedb5fa2830bc1f2818e2246acbbab15ec6a7227898bf4dba2b23b6521eb8d5b
5
5
  SHA512:
6
- metadata.gz: b935805e2fc496cf7b79526d29995ada1cf06cb03f9dbbda409a5f35b2f58492419a8f4b70b50b4224a09dec7452b5bb71cc8406b55c8c49f5566a60659506db
7
- data.tar.gz: 89abb75c2dabb819e7068e0654bb5da1dc2a2319bbca78d0b8e1114ccb6776c3f3f79748596479da188820ddc60bb9481611fca64575d1f1638fd949a678e6e8
6
+ metadata.gz: 9d06bc8d9447debe2b48128b7c45ea0a357d01677b9b8b48fa508c5f8a078e8b4ed0396a4d4d38c06be25573701d4bd75b6f481438d116e3499b4b1890a9edd5
7
+ data.tar.gz: b148663600b83ffde9821761a1ef4a7b43d11a2efc10d97433bb090d4ae2bfc8f7dd997756190cdafc894c497f1eada99faf3b78dcf7df38ed7eec9d57cedec3
@@ -32,14 +32,14 @@ cleanup:
32
32
  # Reporting defaults (suite final report LLM synthesis)
33
33
  reporting:
34
34
  # LLM model alias for suite report generation
35
- model: "glite"
35
+ model: "role:e2e-reporter"
36
36
  # Timeout in seconds for report generation
37
37
  timeout: 60
38
38
 
39
39
  # Execution defaults
40
40
  execution:
41
41
  # Default LLM provider:model for test execution
42
- provider: "claude:sonnet@yolo"
42
+ provider: "role:e2e-executor"
43
43
  # Timeout per test in seconds
44
44
  timeout: 600
45
45
  # Number of tests to run in parallel (1 = sequential)
data/CHANGELOG.md CHANGED
@@ -7,6 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.29.8] - 2026-04-01
11
+
12
+ ### Fixed
13
+ - Replaced process-global `Dir.chdir` in pipeline LLM execution with explicit `working_dir` threading to avoid parallel scenario crashes (`RuntimeError: conflicting chdir during another chdir block`).
14
+
15
+ ### Changed
16
+ - **ace-monorepo-e2e**: Added stronger command/output evidence gates to `TS-MONO-001-rubygems-install` and `TS-MONO-002-quickstart-local` so local sandbox installs and quick-start workflow checks validate real CLI behavior, output, and exit status rather than directory/file presence alone.
17
+ - **ace-monorepo-e2e**: Updated `ace-test-runner-e2e` workflow instructions and scenario template defaults to reduce false-positive E2E tests through command-level evidence, false-positive risk tagging, and duplicate-command consolidation rules.
18
+
19
+ ## [0.29.6] - 2026-04-01
20
+
21
+ ### Fixed
22
+ - Resolved `role:` provider references in CLI provider detection so sandbox isolation and pipeline execution apply when using role-based model selectors like `role:e2e-executor`.
23
+
24
+ ## [0.29.5] - 2026-04-01
25
+
26
+ ### Fixed
27
+ - Changed pipeline executor to `Dir.chdir` into sandbox before launching the LLM agent, preventing artifact leaks to the repo root.
28
+
29
+ ## [0.29.4] - 2026-03-31
30
+
31
+ ### Changed
32
+ - Role-based E2E runner model defaults.
33
+
34
+ ## [0.29.3] - 2026-03-29
35
+
36
+ ### Changed
37
+ - Role-based e2e execution and reporting defaults.
38
+
39
+
10
40
  ## [0.29.2] - 2026-03-29
11
41
 
12
42
  ### Technical
@@ -23,6 +23,12 @@ tags: [{cost-tier}, "use-case:{area}"]
23
23
  # Optional: Why this scenario must be E2E (not unit-only)
24
24
  e2e-justification: "{Requires real CLI/tools/filesystem behavior}"
25
25
 
26
+ # Optional: Evidence quality target for review coverage (`command-output`, `state+content`, `existence-only`)
27
+ e2e-evidence-strength: command-output
28
+
29
+ # Optional: False-positive risk estimate (`low`, `medium`, `high`)
30
+ e2e-false-positive-risk: low
31
+
26
32
  # Optional: Unit test files reviewed during Value Gate analysis
27
33
  unit-coverage-reviewed:
28
34
  - test/{layer}/{file}_test.rb
@@ -163,7 +163,28 @@ All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
163
163
  No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
164
164
  ```
165
165
 
166
- ### 7a. E2E Decision Record (Required)
166
+ ### 7a. Evidence-Gate Review Before Writing Files
167
+
168
+ Before finalizing the test plan, block weak coverage patterns:
169
+ - **Existence-only TC**:
170
+ - only checks directory/file existence
171
+ - no command output/content assertion
172
+ - missing `*.exit` capture for the executed command
173
+ - **Duplicate-invocation TC**:
174
+ - same command invocation, same purpose, split across multiple TCs
175
+
176
+ | TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
177
+ |-------|---------------------------|------------------|-----------------|--------------------|
178
+ | {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
179
+
180
+ Rules:
181
+ - `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
182
+ - `SKIP` rows must include replacement unit-test evidence.
183
+ - Non-skipped rows must include command-level artifacts (`stdout`, `stderr`, `exit`, and/or explicit proof files).
184
+ - At least one `unit tests reviewed` path is required for every row.
185
+ - The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
186
+
187
+ ### 7b. E2E Decision Record (Required)
167
188
 
168
189
  Before writing files, produce a decision record table for every candidate TC:
169
190
 
@@ -205,11 +226,13 @@ If a context description was provided, enhance the test with:
205
226
  - Verify actual file paths by running the tool first — never hardcode paths from documentation or assumptions
206
227
  - Use explicit `&& echo "PASS" || echo "FAIL"` patterns for every verification step
207
228
  - Check specific exit codes for error commands (not just "non-zero")
229
+ - Add at least one output-content assertion for each command being verified
208
230
 
209
231
  **SHOULD (strongly recommended):**
210
232
  - Test the real user journey — structure TCs as a sequential workflow, not isolated commands
211
233
  - Verify exit codes for all commands, not just error cases
212
234
  - Include negative assertions (files/directories that should NOT exist)
235
+ - Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
213
236
  - Capture and check CLI output content, not just exit codes
214
237
  - Verify that status values match actual implementation (e.g., `done` vs `completed`)
215
238
 
@@ -392,4 +415,4 @@ Area codes must be:
392
415
  - 2-10 characters
393
416
  - Alphanumeric only
394
417
  - Will be converted to uppercase
395
- ```
418
+ ```
@@ -117,19 +117,21 @@ find {PACKAGE}/test/e2e -name "scenario.yml" -path "*/TS-*" 2>/dev/null | sort
117
117
  - `last-verified`, `verified-by`
118
118
  - Extract the objective (what the TC verifies)
119
119
  - Identify which CLI commands the TC runs
120
+ - Record command fingerprint (`command + key flags`) for each command assertion
120
121
  - Count verification steps (PASS/FAIL checks)
121
122
  - Map to the feature it tests
122
123
  - Mark TC evidence status:
123
- - `complete` when `e2e-justification` is present and `unit-coverage-reviewed` has at least one path
124
+ - `complete` when `e2e-justification` is present, command artifacts are present, and `unit-coverage-reviewed` has at least one path
124
125
  - `missing` otherwise
126
+ - `at-risk` when evidence is existence-only or duplicate command invocations are detected
125
127
 
126
128
  If `--scope` was provided, filter to only the specified scenario.
127
129
 
128
130
  Build an E2E test map:
129
131
 
130
- | TC ID | Title | CLI Command | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence |
132
+ | TC ID | Title | Command Invocations | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence | False-Positive Risk |
131
133
  |-------|-------|-------------|----------------|---------------|------|-----------|-------------------|------------------------|----------|
132
- | {id} | {title} | {command} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing} |
134
+ | {id} | {title} | {command list} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing/at-risk} | {low/medium/high} |
133
135
 
134
136
  ### 5. Build Coverage Matrix
135
137
 
@@ -143,13 +145,13 @@ Combine the three inventories into a single coverage matrix:
143
145
  ```markdown
144
146
  ### Coverage Matrix
145
147
 
146
- | Feature | Unit Tests | E2E Tests | Status |
147
- |---------|-----------|-----------|--------|
148
- | {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Covered |
149
- | {feature} | {test files} ({n} assertions) | none | Unit-only |
150
- | {feature} | none | {TC IDs} ({n} verifications) | E2E-only |
151
- | {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Overlap |
152
- | {feature} | none | none | Gap |
148
+ | Feature | Unit Tests | E2E Tests | Evidence Strength | False-Positive Risk | Status |
149
+ |---------|-----------|-----------|------------------|----------------------|--------|
150
+ | {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | command-output/state+content | low | Covered |
151
+ | {feature} | {test files} ({n} assertions) | none | none | n/a | Unit-only |
152
+ | {feature} | none | {TC IDs} ({n} verifications) | command-output | low | E2E-only |
153
+ | {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | command-output or existence-only | medium/high | Overlap |
154
+ | {feature} | none | none | none | high | Gap |
153
155
  ```
154
156
 
155
157
  **Classify each row:**
@@ -158,6 +160,7 @@ Combine the three inventories into a single coverage matrix:
158
160
  - **E2E-only** — E2E test exists but no unit test. Valid if the behavior is inherently E2E (subprocess execution, filesystem discovery).
159
161
  - **Overlap** — Both unit and E2E test the same assertions. E2E TC is a candidate for removal.
160
162
  - **Gap** — Neither unit nor E2E test covers this feature. Needs investigation.
163
+ - If a row has `false-positive risk` `high`, downgrade Covered/Overlap to **manual-review** until evidence is corrected.
161
164
 
162
165
  ### 6. Generate Review Report
163
166
 
@@ -180,6 +183,7 @@ Produce the full review report with actionable findings:
180
183
  | E2E scenarios | {n} |
181
184
  | E2E test cases | {n} |
182
185
  | TCs with decision evidence | {n}/{total} |
186
+ | High-risk false-positive TCs | {n}/{total} |
183
187
 
184
188
  ### Coverage Matrix
185
189
 
@@ -187,12 +191,13 @@ Produce the full review report with actionable findings:
187
191
 
188
192
  ### Overlap Analysis
189
193
 
190
- TCs that may fail the E2E Value Gate (unit tests cover the same behavior):
194
+ TCs that may fail the E2E Value Gate (unit tests cover the same behavior or high false-positive risk):
191
195
 
192
196
  | TC ID | Feature | Overlapping Unit Tests | Recommendation |
193
197
  |-------|---------|----------------------|----------------|
194
198
  | {id} | {feature} | {test files} | Remove — unit tests cover this fully |
195
199
  | {id} | {feature} | {test files} | Keep — TC tests CLI pipeline, units test logic |
200
+ | {id} | {feature} | {test files} | Strengthen — currently existence-only or duplicate command assertions |
196
201
 
197
202
  **Candidates for removal:** {n} TCs have full overlap with unit tests
198
203
 
@@ -283,4 +288,4 @@ Package '{package}' not found.
283
288
 
284
289
  Available packages:
285
290
  {list of ace-* directories}
286
- ```
291
+ ```
@@ -125,6 +125,11 @@ Follow the E2E test writing rules:
125
125
  - Consolidate assertions sharing the same CLI invocation into a single TC
126
126
  - Target 2-5 TCs per scenario
127
127
  - Test through the CLI interface, not library imports
128
+ - Add command-level evidence in every runner:
129
+ - command output (`*.stdout`/`*.stderr`)
130
+ - command exit status (`*.exit`)
131
+ - Add at least one behavioral/content assertion per command assertion set
132
+ - Remove duplicate command-only TCs; fold related assertions into one TC where possible
128
133
 
129
134
  **Load the TC template for reference:**
130
135
  ```bash
@@ -141,6 +146,7 @@ For each TC classified as MODIFY:
141
146
  - **Narrow scope** — remove assertions that unit tests cover, keep only E2E-exclusive checks
142
147
  - **Broaden scope** — add assertions for related behavior tested by the same CLI invocation
143
148
  - **Fix structure** — add missing sections, fix formatting issues
149
+ - **Add evidence gates** — if the existing TC relies on existence-only or missing exit/status checks, add explicit command output assertions and `.exit` captures
144
150
  3. Update the `last-verified` field if the TC was re-run during modification
145
151
  4. Write the updated TC runner/verifier files
146
152
 
@@ -228,6 +234,7 @@ Present the execution summary:
228
234
  - [ ] TC count matches plan: {yes/no}
229
235
  - [ ] No stale references: {yes/no}
230
236
  - [ ] All scenarios have 2-5 TCs: {yes/no}
237
+ - [ ] All modified/created TCs include command output + exit artifacts: {yes/no}
231
238
 
232
239
  ### Next Steps
233
240
 
@@ -278,4 +285,4 @@ If execution fails partway through:
278
285
  1. Report which actions completed and which failed
279
286
  2. Do not attempt to roll back completed actions
280
287
  3. Show the state of `{PACKAGE}/test/e2e/` after partial execution
281
- 4. Suggest re-running with the remaining actions
288
+ 4. Suggest re-running with the remaining actions
@@ -33,11 +33,13 @@ module Ace
33
33
 
34
34
  # Instance method: check if a provider string refers to a CLI provider
35
35
  #
36
- # @param provider_string [String] Provider:model string
36
+ # Resolves role: references to their concrete provider before checking.
37
+ #
38
+ # @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-executor")
37
39
  # @return [Boolean]
38
40
  def cli_provider?(provider_string)
39
- name = self.class.provider_name(provider_string)
40
- @cli_providers.include?(name)
41
+ resolved = resolve_provider_name(provider_string)
42
+ @cli_providers.include?(resolved)
41
43
  end
42
44
 
43
45
  def build_execution_prompt(command:, tc_mode:)
@@ -139,6 +141,23 @@ module Ace
139
141
  PROMPT
140
142
  end
141
143
 
144
+ private
145
+
146
+ # Resolve the bare provider name from a provider string.
147
+ # For role: references, resolves via ProviderModelParser to find the
148
+ # concrete provider (e.g. "role:e2e-executor" → "claude").
149
+ def resolve_provider_name(provider_string)
150
+ name = self.class.provider_name(provider_string)
151
+ return name unless name == "role"
152
+
153
+ parse_result = Ace::LLM::Molecules::ProviderModelParser.new.parse(provider_string)
154
+ parse_result.valid? ? parse_result.provider : name
155
+ rescue
156
+ name
157
+ end
158
+
159
+ public
160
+
142
161
  # Lazily-loaded default instance backed by ConfigLoader
143
162
  # @return [CliProviderAdapter]
144
163
  def self.default_instance
@@ -100,7 +100,7 @@ module Ace
100
100
  def run_llm(prompt_path:, system_path:, output_path:, cli_args:, env_vars:)
101
101
  prompt = File.read(prompt_path)
102
102
  system = File.read(system_path)
103
- working_dir = env_vars["PROJECT_ROOT_PATH"] || env_vars[:PROJECT_ROOT_PATH]
103
+ sandbox_dir = env_vars["PROJECT_ROOT_PATH"] || env_vars[:PROJECT_ROOT_PATH]
104
104
 
105
105
  Ace::LLM::QueryInterface.query(
106
106
  @provider,
@@ -110,8 +110,8 @@ module Ace
110
110
  timeout: @timeout,
111
111
  fallback: false,
112
112
  output: output_path,
113
- working_dir: working_dir,
114
- subprocess_env: env_vars
113
+ subprocess_env: env_vars,
114
+ working_dir: sandbox_dir
115
115
  )
116
116
  end
117
117
  end
@@ -3,7 +3,7 @@
3
3
  module Ace
4
4
  module Test
5
5
  module EndToEndRunner
6
- VERSION = '0.29.2'
6
+ VERSION = '0.29.8'
7
7
  end
8
8
  end
9
9
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ace-test-runner-e2e
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.29.2
4
+ version: 0.29.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michal Czyz
8
8
  bindir: exe
9
9
  cert_chain: []
10
- date: 2026-03-29 00:00:00.000000000 Z
10
+ date: 2026-04-05 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: ace-support-cli