RubyGems - ace-test-runner-e2e - Versions diffs - 0.29.2 → 0.29.8 - Mend

ace-test-runner-e2e 0.29.2 → 0.29.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/.ace-defaults/e2e-runner/config.yml +2 -2
data/CHANGELOG.md +30 -0
data/handbook/templates/scenario.yml.template.yml +6 -0
data/handbook/workflow-instructions/e2e/create.wf.md +25 -2
data/handbook/workflow-instructions/e2e/review.wf.md +17 -12
data/handbook/workflow-instructions/e2e/rewrite.wf.md +8 -1
data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb +22 -3
data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb +3 -3
data/lib/ace/test/end_to_end_runner/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: dde42f8b80c7e0a73e15b49c75c855de309a131e85e563e11229649b58ecbe80
-  data.tar.gz: 705e034b6dff3495dc2c442ddc85ebc186165d9c09160e78e2dc324c42afe526
+  metadata.gz: 6206e4d6f65fe1ab5c27d1b5479e37af079b3554bd6289786aa71ce62e4ecf50
+  data.tar.gz: bedb5fa2830bc1f2818e2246acbbab15ec6a7227898bf4dba2b23b6521eb8d5b
 SHA512:
-  metadata.gz: b935805e2fc496cf7b79526d29995ada1cf06cb03f9dbbda409a5f35b2f58492419a8f4b70b50b4224a09dec7452b5bb71cc8406b55c8c49f5566a60659506db
-  data.tar.gz: 89abb75c2dabb819e7068e0654bb5da1dc2a2319bbca78d0b8e1114ccb6776c3f3f79748596479da188820ddc60bb9481611fca64575d1f1638fd949a678e6e8
+  metadata.gz: 9d06bc8d9447debe2b48128b7c45ea0a357d01677b9b8b48fa508c5f8a078e8b4ed0396a4d4d38c06be25573701d4bd75b6f481438d116e3499b4b1890a9edd5
+  data.tar.gz: b148663600b83ffde9821761a1ef4a7b43d11a2efc10d97433bb090d4ae2bfc8f7dd997756190cdafc894c497f1eada99faf3b78dcf7df38ed7eec9d57cedec3

data/.ace-defaults/e2e-runner/config.yml CHANGED Viewed

@@ -32,14 +32,14 @@ cleanup:
 # Reporting defaults (suite final report LLM synthesis)
 reporting:
   # LLM model alias for suite report generation
-  model: "glite"
+  model: "role:e2e-reporter"
   # Timeout in seconds for report generation
   timeout: 60
 # Execution defaults
 execution:
   # Default LLM provider:model for test execution
-  provider: "claude:sonnet@yolo"
+  provider: "role:e2e-executor"
   # Timeout per test in seconds
   timeout: 600
   # Number of tests to run in parallel (1 = sequential)

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.29.8] - 2026-04-01
+### Fixed
+- Replaced process-global `Dir.chdir` in pipeline LLM execution with explicit `working_dir` threading to avoid parallel scenario crashes (`RuntimeError: conflicting chdir during another chdir block`).
+### Changed
+- **ace-monorepo-e2e**: Added stronger command/output evidence gates to `TS-MONO-001-rubygems-install` and `TS-MONO-002-quickstart-local` so local sandbox installs and quick-start workflow checks validate real CLI behavior, output, and exit status rather than directory/file presence alone.
+- **ace-monorepo-e2e**: Updated `ace-test-runner-e2e` workflow instructions and scenario template defaults to reduce false-positive E2E tests through command-level evidence, false-positive risk tagging, and duplicate-command consolidation rules.
+## [0.29.6] - 2026-04-01
+### Fixed
+- Resolved `role:` provider references in CLI provider detection so sandbox isolation and pipeline execution apply when using role-based model selectors like `role:e2e-executor`.
+## [0.29.5] - 2026-04-01
+### Fixed
+- Changed pipeline executor to `Dir.chdir` into sandbox before launching the LLM agent, preventing artifact leaks to the repo root.
+## [0.29.4] - 2026-03-31
+### Changed
+- Role-based E2E runner model defaults.
+## [0.29.3] - 2026-03-29
+### Changed
+- Role-based e2e execution and reporting defaults.
 ## [0.29.2] - 2026-03-29
 ### Technical

data/handbook/templates/scenario.yml.template.yml CHANGED Viewed

@@ -23,6 +23,12 @@ tags: [{cost-tier}, "use-case:{area}"]
 # Optional: Why this scenario must be E2E (not unit-only)
 e2e-justification: "{Requires real CLI/tools/filesystem behavior}"
+# Optional: Evidence quality target for review coverage (`command-output`, `state+content`, `existence-only`)
+e2e-evidence-strength: command-output
+# Optional: False-positive risk estimate (`low`, `medium`, `high`)
+e2e-false-positive-risk: low
 # Optional: Unit test files reviewed during Value Gate analysis
 unit-coverage-reviewed:
   - test/{layer}/{file}_test.rb

data/handbook/workflow-instructions/e2e/create.wf.md CHANGED Viewed

@@ -163,7 +163,28 @@ All proposed behaviors are already covered by unit tests in {PACKAGE}/test/.
 No E2E test needed. Consider adding unit tests instead if coverage gaps exist.
 ```
-### 7a. E2E Decision Record (Required)
+### 7a. Evidence-Gate Review Before Writing Files
+Before finalizing the test plan, block weak coverage patterns:
+- **Existence-only TC**:
+  - only checks directory/file existence
+  - no command output/content assertion
+  - missing `*.exit` capture for the executed command
+- **Duplicate-invocation TC**:
+  - same command invocation, same purpose, split across multiple TCs
+| TC ID | Decision (KEEP/ADD/SKIP) | Evidence Strength | E2E-only reason | Unit tests reviewed |
+|-------|---------------------------|------------------|-----------------|--------------------|
+| {tc-id} | {decision} | `command-output` | {why this needs real CLI/tools/fs} | {path1,path2} |
+Rules:
+- `existence-only` is never valid for KEEP/ADD. Use it only for SKIP rows with explicit unit-test replacement.
+- `SKIP` rows must include replacement unit-test evidence.
+- Non-skipped rows must include command-level artifacts (`stdout`, `stderr`, `exit`, and/or explicit proof files).
+- At least one `unit tests reviewed` path is required for every row.
+- The scenario-level `unit-coverage-reviewed` field must include the union of all referenced unit test files.
+### 7b. E2E Decision Record (Required)
 Before writing files, produce a decision record table for every candidate TC:
@@ -205,11 +226,13 @@ If a context description was provided, enhance the test with:
 - Verify actual file paths by running the tool first — never hardcode paths from documentation or assumptions
 - Use explicit `&& echo "PASS" || echo "FAIL"` patterns for every verification step
 - Check specific exit codes for error commands (not just "non-zero")
+- Add at least one output-content assertion for each command being verified
 **SHOULD (strongly recommended):**
 - Test the real user journey — structure TCs as a sequential workflow, not isolated commands
 - Verify exit codes for all commands, not just error cases
 - Include negative assertions (files/directories that should NOT exist)
+- Capture and retain command output for all assertions (`stdout`, `stderr`, and `*.exit`)
 - Capture and check CLI output content, not just exit codes
 - Verify that status values match actual implementation (e.g., `done` vs `completed`)
@@ -392,4 +415,4 @@ Area codes must be:
 - 2-10 characters
 - Alphanumeric only
 - Will be converted to uppercase
-```
+```

data/handbook/workflow-instructions/e2e/review.wf.md CHANGED Viewed

@@ -117,19 +117,21 @@ find {PACKAGE}/test/e2e -name "scenario.yml" -path "*/TS-*" 2>/dev/null | sort
   - `last-verified`, `verified-by`
 - Extract the objective (what the TC verifies)
 - Identify which CLI commands the TC runs
+- Record command fingerprint (`command + key flags`) for each command assertion
 - Count verification steps (PASS/FAIL checks)
 - Map to the feature it tests
 - Mark TC evidence status:
-  - `complete` when `e2e-justification` is present and `unit-coverage-reviewed` has at least one path
+  - `complete` when `e2e-justification` is present, command artifacts are present, and `unit-coverage-reviewed` has at least one path
   - `missing` otherwise
+  - `at-risk` when evidence is existence-only or duplicate command invocations are detected
 If `--scope` was provided, filter to only the specified scenario.
 Build an E2E test map:
-| TC ID | Title | CLI Command | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence |
+| TC ID | Title | Command Invocations | Feature Tested | Verifications | Tags | Cost Tier | E2E Justification | Unit Coverage Reviewed | Evidence | False-Positive Risk |
 |-------|-------|-------------|----------------|---------------|------|-----------|-------------------|------------------------|----------|
-| {id} | {title} | {command} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing} |
+| {id} | {title} | {command list} | {feature} | {n} | {tags} | {tier} | {reason or "(missing)"} | {files or "(missing)"} | {complete/missing/at-risk} | {low/medium/high} |
 ### 5. Build Coverage Matrix
@@ -143,13 +145,13 @@ Combine the three inventories into a single coverage matrix:
 ```markdown
 ### Coverage Matrix
-| Feature | Unit Tests | E2E Tests | Status |
-|---------|-----------|-----------|--------|
-| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Covered |
-| {feature} | {test files} ({n} assertions) | none | Unit-only |
-| {feature} | none | {TC IDs} ({n} verifications) | E2E-only |
-| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | Overlap |
-| {feature} | none | none | Gap |
+| Feature | Unit Tests | E2E Tests | Evidence Strength | False-Positive Risk | Status |
+|---------|-----------|-----------|------------------|----------------------|--------|
+| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | command-output/state+content | low | Covered |
+| {feature} | {test files} ({n} assertions) | none | none | n/a | Unit-only |
+| {feature} | none | {TC IDs} ({n} verifications) | command-output | low | E2E-only |
+| {feature} | {test files} ({n} assertions) | {TC IDs} ({n} verifications) | command-output or existence-only | medium/high | Overlap |
+| {feature} | none | none | none | high | Gap |
 ```
 **Classify each row:**
@@ -158,6 +160,7 @@ Combine the three inventories into a single coverage matrix:
 - **E2E-only** — E2E test exists but no unit test. Valid if the behavior is inherently E2E (subprocess execution, filesystem discovery).
 - **Overlap** — Both unit and E2E test the same assertions. E2E TC is a candidate for removal.
 - **Gap** — Neither unit nor E2E test covers this feature. Needs investigation.
+- If a row has `false-positive risk` `high`, downgrade Covered/Overlap to **manual-review** until evidence is corrected.
 ### 6. Generate Review Report
@@ -180,6 +183,7 @@ Produce the full review report with actionable findings:
 | E2E scenarios | {n} |
 | E2E test cases | {n} |
 | TCs with decision evidence | {n}/{total} |
+| High-risk false-positive TCs | {n}/{total} |
 ### Coverage Matrix
@@ -187,12 +191,13 @@ Produce the full review report with actionable findings:
 ### Overlap Analysis
-TCs that may fail the E2E Value Gate (unit tests cover the same behavior):
+TCs that may fail the E2E Value Gate (unit tests cover the same behavior or high false-positive risk):
 | TC ID | Feature | Overlapping Unit Tests | Recommendation |
 |-------|---------|----------------------|----------------|
 | {id} | {feature} | {test files} | Remove — unit tests cover this fully |
 | {id} | {feature} | {test files} | Keep — TC tests CLI pipeline, units test logic |
+| {id} | {feature} | {test files} | Strengthen — currently existence-only or duplicate command assertions |
 **Candidates for removal:** {n} TCs have full overlap with unit tests
@@ -283,4 +288,4 @@ Package '{package}' not found.
 Available packages:
 {list of ace-* directories}
-```
+```

data/handbook/workflow-instructions/e2e/rewrite.wf.md CHANGED Viewed

@@ -125,6 +125,11 @@ Follow the E2E test writing rules:
 - Consolidate assertions sharing the same CLI invocation into a single TC
 - Target 2-5 TCs per scenario
 - Test through the CLI interface, not library imports
+- Add command-level evidence in every runner:
+  - command output (`*.stdout`/`*.stderr`)
+  - command exit status (`*.exit`)
+- Add at least one behavioral/content assertion per command assertion set
+- Remove duplicate command-only TCs; fold related assertions into one TC where possible
 **Load the TC template for reference:**
 ```bash
@@ -141,6 +146,7 @@ For each TC classified as MODIFY:
    - **Narrow scope** — remove assertions that unit tests cover, keep only E2E-exclusive checks
    - **Broaden scope** — add assertions for related behavior tested by the same CLI invocation
    - **Fix structure** — add missing sections, fix formatting issues
+   - **Add evidence gates** — if the existing TC relies on existence-only or missing exit/status checks, add explicit command output assertions and `.exit` captures
 3. Update the `last-verified` field if the TC was re-run during modification
 4. Write the updated TC runner/verifier files
@@ -228,6 +234,7 @@ Present the execution summary:
 - [ ] TC count matches plan: {yes/no}
 - [ ] No stale references: {yes/no}
 - [ ] All scenarios have 2-5 TCs: {yes/no}
+- [ ] All modified/created TCs include command output + exit artifacts: {yes/no}
 ### Next Steps
@@ -278,4 +285,4 @@ If execution fails partway through:
 1. Report which actions completed and which failed
 2. Do not attempt to roll back completed actions
 3. Show the state of `{PACKAGE}/test/e2e/` after partial execution
-4. Suggest re-running with the remaining actions
+4. Suggest re-running with the remaining actions

data/lib/ace/test/end_to_end_runner/atoms/skill_prompt_builder.rb CHANGED Viewed

@@ -33,11 +33,13 @@ module Ace
           # Instance method: check if a provider string refers to a CLI provider
           #
-          # @param provider_string [String] Provider:model string
+          # Resolves role: references to their concrete provider before checking.
+          #
+          # @param provider_string [String] Provider:model string (e.g., "claude:sonnet", "role:e2e-executor")
           # @return [Boolean]
           def cli_provider?(provider_string)
-            name = self.class.provider_name(provider_string)
-            @cli_providers.include?(name)
+            resolved = resolve_provider_name(provider_string)
+            @cli_providers.include?(resolved)
           end
           def build_execution_prompt(command:, tc_mode:)
@@ -139,6 +141,23 @@ module Ace
             PROMPT
           end
+          private
+          # Resolve the bare provider name from a provider string.
+          # For role: references, resolves via ProviderModelParser to find the
+          # concrete provider (e.g. "role:e2e-executor" → "claude").
+          def resolve_provider_name(provider_string)
+            name = self.class.provider_name(provider_string)
+            return name unless name == "role"
+            parse_result = Ace::LLM::Molecules::ProviderModelParser.new.parse(provider_string)
+            parse_result.valid? ? parse_result.provider : name
+          rescue
+            name
+          end
+          public
           # Lazily-loaded default instance backed by ConfigLoader
           # @return [CliProviderAdapter]
           def self.default_instance

data/lib/ace/test/end_to_end_runner/molecules/pipeline_executor.rb CHANGED Viewed

@@ -100,7 +100,7 @@ module Ace
           def run_llm(prompt_path:, system_path:, output_path:, cli_args:, env_vars:)
             prompt = File.read(prompt_path)
             system = File.read(system_path)
-            working_dir = env_vars["PROJECT_ROOT_PATH"] || env_vars[:PROJECT_ROOT_PATH]
+            sandbox_dir = env_vars["PROJECT_ROOT_PATH"] || env_vars[:PROJECT_ROOT_PATH]
             Ace::LLM::QueryInterface.query(
               @provider,
@@ -110,8 +110,8 @@ module Ace
               timeout: @timeout,
               fallback: false,
               output: output_path,
-              working_dir: working_dir,
-              subprocess_env: env_vars
+              subprocess_env: env_vars,
+              working_dir: sandbox_dir
             )
           end
         end

data/lib/ace/test/end_to_end_runner/version.rb CHANGED Viewed

@@ -3,7 +3,7 @@
 module Ace
   module Test
     module EndToEndRunner
-      VERSION = '0.29.2'
+      VERSION = '0.29.8'
     end
   end
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: ace-test-runner-e2e
 version: !ruby/object:Gem::Version
-  version: 0.29.2
+  version: 0.29.8
 platform: ruby
 authors:
 - Michal Czyz
 bindir: exe
 cert_chain: []
-date: 2026-03-29 00:00:00.000000000 Z
+date: 2026-04-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: ace-support-cli