PyPI - traceval - Versions diffs - 0.2.1__tar.gz → 0.2.2__tar.gz - Mend

traceval 0.2.1tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

{traceval-0.2.1 → traceval-0.2.2}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.2.2] - 2026-07-02
+### Fixed
+- A run in which zero cases execute (e.g. unresolvable target) now writes a self-describing run report with an `errors` section instead of writing nothing; `--json` never reports `null`. Added a single clear top-level error line on target-resolution failure. Added `make test` and CONTRIBUTING as the canonical dev commands. Reported via external review of a failed-target invocation.
+- Run report schema additions (existing fields unchanged): `summary.errored` counts cases that never executed due to setup/collection errors; top-level `errors` lists `{stage, detail}` entries for `target_resolution`, `collection`, and `setup` failures, with identical details deduplicated into one entry carrying a `count`; `results` is `[]` rather than absent when nothing executed. The terminal summary now shows `Errored: n`.
 ## [0.2.1] - 2026-07-02
 ### Added

traceval-0.2.2/CONTRIBUTING.md ADDED Viewed

@@ -0,0 +1,22 @@
+# Contributing to traceval
+## Setup
+```bash
+git clone https://github.com/theramkm/traceval.git
+cd traceval
+uv sync          # installs the package and dev dependencies (same as CI)
+```
+## Development loop
+```bash
+make test        # pytest -q
+make lint        # ruff check, ruff format --check, mypy
+make demo        # end-to-end smoke: healthy agent passes, buggy agent fails
+make all         # lint + test
+```
+CI runs the same commands on Python 3.11, 3.12, and 3.13, plus a
+wheel-based demo smoke job, and enforces 85% coverage. Keep all of it
+green; add a test for every behavior change.

traceval-0.2.2/Makefile ADDED Viewed

@@ -0,0 +1,14 @@
+.PHONY: test lint demo all
+test:
+	uv run pytest -q
+lint:
+	uv run ruff check src tests examples
+	uv run ruff format --check src tests examples
+	uv run mypy src/traceval
+demo:
+	uv run traceval demo -o /tmp/traceval-demo --force
+all: lint test

{traceval-0.2.1 → traceval-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: traceval
-Version: 0.2.1
+Version: 0.2.2
 Summary: Trace-to-Eval Compiler
 License: MIT
 License-File: LICENSE
@@ -140,7 +140,7 @@ traceval Run Summary
 │ c_d3f3b631__case_007 │ c_d3f3b631 │  PASS   │          0.0 │
 │ c_e834c13c__case_008 │ c_e834c13c │  PASS   │          0.0 │
 └──────────────────────┴────────────┴─────────┴──────────────┘
-Total: 8 | Passed: 8 | Failed: 0
+Total: 8 | Passed: 8 | Failed: 0 | Errored: 0
 ```
 The target is an HTTP URL or a `module:function` callable. Checks cover `exact`, `contains_any`, `not_contains`, `regex`, `json_schema`, `tool_sequence`, `no_tool_loop`, and `judge`. Run reports land in `<evals_dir>/runs/` (override with `--runs-dir`); pass `--compare <previous report>` to print regressions and improvements between runs. The exit code is nonzero when any case fails.
@@ -191,7 +191,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
-      - uses: theramkm/traceval@v0.2.1
+      - uses: theramkm/traceval@v0.2.2
         with:
           evals-dir: evals/
           target: myapp.agent:invoke_agent   # or an HTTP URL
@@ -200,6 +200,11 @@ jobs:
 Inputs: `evals-dir` and `target` (required); `judge`, `compare`, `only`, `runs-dir`, `traceval-version`, `python-version` (optional). For a real LLM judge, set `judge: openai` and pass `OPENAI_API_KEY` (or `GEMINI_API_KEY`) via `env:` from your repository secrets.
+## Development
+See [CONTRIBUTING.md](CONTRIBUTING.md) for setup.
+Run the test suite with `make test` and the full gate set with `make lint`.
 ## Honest Limitations
 * **Side-Effect Free**: traceval assertions evaluate input/output matches. It does not attempt to replay side effects (e.g., updating database records) on mock tools.

{traceval-0.2.1 → traceval-0.2.2}/README.md RENAMED Viewed

@@ -123,7 +123,7 @@ traceval Run Summary
 │ c_d3f3b631__case_007 │ c_d3f3b631 │  PASS   │          0.0 │
 │ c_e834c13c__case_008 │ c_e834c13c │  PASS   │          0.0 │
 └──────────────────────┴────────────┴─────────┴──────────────┘
-Total: 8 | Passed: 8 | Failed: 0
+Total: 8 | Passed: 8 | Failed: 0 | Errored: 0
 ```
 The target is an HTTP URL or a `module:function` callable. Checks cover `exact`, `contains_any`, `not_contains`, `regex`, `json_schema`, `tool_sequence`, `no_tool_loop`, and `judge`. Run reports land in `<evals_dir>/runs/` (override with `--runs-dir`); pass `--compare <previous report>` to print regressions and improvements between runs. The exit code is nonzero when any case fails.
@@ -174,7 +174,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
-      - uses: theramkm/traceval@v0.2.1
+      - uses: theramkm/traceval@v0.2.2
         with:
           evals-dir: evals/
           target: myapp.agent:invoke_agent   # or an HTTP URL
@@ -183,6 +183,11 @@ jobs:
 Inputs: `evals-dir` and `target` (required); `judge`, `compare`, `only`, `runs-dir`, `traceval-version`, `python-version` (optional). For a real LLM judge, set `judge: openai` and pass `OPENAI_API_KEY` (or `GEMINI_API_KEY`) via `env:` from your repository secrets.
+## Development
+See [CONTRIBUTING.md](CONTRIBUTING.md) for setup.
+Run the test suite with `make test` and the full gate set with `make lint`.
 ## Honest Limitations
 * **Side-Effect Free**: traceval assertions evaluate input/output matches. It does not attempt to replay side effects (e.g., updating database records) on mock tools.

traceval-0.2.2/docs/img/report.png ADDED Viewed

Binary file

{traceval-0.2.1 → traceval-0.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "traceval"
-version = "0.2.1"
+version = "0.2.2"
 description = "Trace-to-Eval Compiler"
 readme = "README.md"
 requires-python = ">=3.11"

traceval-0.2.2/src/traceval/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.2.2"

{traceval-0.2.1 → traceval-0.2.2}/src/traceval/compile/templates/conftest.py.jinja RENAMED Viewed

@@ -46,38 +46,111 @@ def pytest_generate_tests(metafunc):
         metafunc.parametrize("eval_case", cases, ids=[c["id"] for c in cases])
-@pytest.fixture(scope="session")
-def target_runner(request):
-    target_opt = request.config.getoption("--target")
+# Target resolution runs ONCE per session so a broken target produces one
+# clear error line up front (and one errors entry in the run report),
+# never a wall of identical per-case tracebacks with no explanation.
+_target_probe = {}
+def _probe_target(config):
+    if _target_probe.get("done"):
+        return _target_probe
+    _target_probe["done"] = True
+    target_opt = config.getoption("--target")
     if not target_opt:
         config_path = Path(__file__).parent / "traceval.yaml"
         if config_path.exists():
             try:
                 with open(config_path, encoding="utf-8") as f:
-                    config = yaml.safe_load(f)
-                    target_opt = config.get("target", {}).get("default_url")
+                    cfg = yaml.safe_load(f)
+                    target_opt = cfg.get("target", {}).get("default_url")
             except Exception:
                 pass
     if not target_opt:
-        pytest.fail("No target specified. Use --target option or set in traceval.yaml.")
-    return resolve_target(target_opt)
+        detail = "No target specified. Use --target option or set in traceval.yaml."
+        _target_probe["error"] = detail
+        _record_error("target_resolution", detail)
+        return _target_probe
+    try:
+        _target_probe["target"] = resolve_target(target_opt)
+    except Exception as e:
+        detail = f"target '{target_opt}' could not be imported ({e}). Check the module path or URL."
+        _target_probe["error"] = detail
+        _record_error("target_resolution", detail)
+        return _target_probe
+    if target_opt.startswith(("http://", "https://")):
+        # First-contact reachability check; connection-level failures only.
+        # The resolved target is kept either way so cases still run.
+        import httpx
+        try:
+            httpx.get(target_opt, timeout=2.0)
+        except (httpx.ConnectError, httpx.InvalidURL, httpx.UnsupportedProtocol) as e:
+            detail = f"target '{target_opt}' is unreachable ({e}). Check the module path or URL."
+            _target_probe["error"] = detail
+            _record_error("target_resolution", detail)
+    return _target_probe
+def pytest_report_header(config):
+    probe = _probe_target(config)
+    if probe.get("error"):
+        return [f"ERROR: {probe['error']}"]
+    return []
+@pytest.fixture(scope="session")
+def target_runner(request):
+    probe = _probe_target(request.config)
+    if "target" not in probe:
+        pytest.fail(probe.get("error") or "Target resolution failed.")
+    return probe["target"]
 @pytest.fixture(scope="session")
 def judge_runner(request):
     judge_opt = request.config.getoption("--judge")
     return resolve_judge(judge_opt)
-# Accumulator for final report
+# Accumulators for final report
 _results_accumulator = []
+_errors_accumulator = []
+_errored_cases = [0]
+def _record_error(stage, detail):
+    # Deduplicate identical details into one entry with a count field
+    for entry in _errors_accumulator:
+        if entry["stage"] == stage and entry["detail"] == detail:
+            entry["count"] = entry.get("count", 1) + 1
+            return
+    _errors_accumulator.append({"stage": stage, "detail": detail})
+def pytest_collectreport(report):
+    if report.failed:
+        detail = str(report.longrepr).strip().splitlines()[-1]
+        _record_error("collection", detail)
+@pytest.hookimpl(hookwrapper=True)
+def pytest_runtest_makereport(item, call):
+    outcome = yield
+    report = outcome.get_result()
+    if report.when == "setup" and report.failed:
+        # Case never executed (fixture/setup error)
+        _errored_cases[0] += 1
+        detail = getattr(getattr(report, "longrepr", None), "reprcrash", None)
+        detail = detail.message if detail else str(report.longrepr).strip().splitlines()[-1]
+        _record_error("setup", detail)
 @pytest.hookimpl(tryfirst=True)
 def pytest_sessionstart(session):
     _results_accumulator.clear()
+    _errors_accumulator.clear()
+    _errored_cases[0] = 0
 def pytest_sessionfinish(session, exitstatus):
-    if not _results_accumulator:
-        return
+    # ALWAYS write a run report, including when zero cases executed:
+    # catastrophic failure must produce a self-describing artifact,
+    # never silence.
     runs_opt = session.config.getoption("--runs-dir")
     if runs_opt:
         runs_dir = Path(runs_opt).resolve()
@@ -95,43 +168,48 @@ def pytest_sessionfinish(session, exitstatus):
     passed_count = sum(1 for r in _results_accumulator if r["passed"])
     failed_count = len(_results_accumulator) - passed_count
+    errored_count = _errored_cases[0]
+    total_count = len(_results_accumulator) + errored_count
     report = {
         "timestamp": datetime.now(timezone.utc).isoformat(),
         "summary": {
-            "total": len(_results_accumulator),
+            "total": total_count,
             "passed": passed_count,
             "failed": failed_count,
+            "errored": errored_count,
         },
+        "errors": _errors_accumulator,
         "results": _results_accumulator
     }
     report_file.write_text(json.dumps(report, indent=2), encoding="utf-8")
     # Rich Table Terminal output
     from rich.console import Console
     from rich.table import Table
     console = Console()
     console.print("\n[bold purple]traceval Run Summary[/bold purple]")
-    table = Table(show_header=True, header_style="bold blue")
-    table.add_column("Case ID", style="cyan")
-    table.add_column("Cluster", style="magenta")
-    table.add_column("Outcome", justify="center")
-    table.add_column("Latency (ms)", justify="right")
-    for r in _results_accumulator:
-        outcome_str = "[bold green]PASS[/bold green]" if r["passed"] else "[bold red]FAIL[/bold red]"
-        table.add_row(
-            r["case_id"],
-            r["cluster"],
-            outcome_str,
-            f"{r['latency_ms']:.1f}"
-        )
-    console.print(table)
-    console.print(f"Total: {len(_results_accumulator)} | Passed: {passed_count} | Failed: {failed_count}")
+    if _results_accumulator:
+        table = Table(show_header=True, header_style="bold blue")
+        table.add_column("Case ID", style="cyan")
+        table.add_column("Cluster", style="magenta")
+        table.add_column("Outcome", justify="center")
+        table.add_column("Latency (ms)", justify="right")
+        for r in _results_accumulator:
+            outcome_str = "[bold green]PASS[/bold green]" if r["passed"] else "[bold red]FAIL[/bold red]"
+            table.add_row(
+                r["case_id"],
+                r["cluster"],
+                outcome_str,
+                f"{r['latency_ms']:.1f}"
+            )
+        console.print(table)
+    console.print(f"Total: {total_count} | Passed: {passed_count} | Failed: {failed_count} | Errored: {errored_count}")
     console.print(f"Run report written to: {report_file}")
     # Optional --compare checking

traceval-0.2.2/tests/test_broken_target.py ADDED Viewed

@@ -0,0 +1,90 @@
+"""Regression tests for the failed-target incident: a run in which zero
+cases execute must write a self-describing run report, print one clear
+error line, and never report null. Silence is the bug.
+Reported via external review of a failed-target invocation.
+"""
+import json
+import sys
+from pathlib import Path
+from typer.testing import CliRunner
+from traceval.cli import app
+from traceval.compile import generate_evals
+from traceval.ingest import ingest_file
+from traceval.store import TraceStore
+FIXTURES_DIR = Path(__file__).parent / "fixtures"
+BROKEN_TARGET = "no.such.module:fn"
+runner = CliRunner()
+def _generate_suite(tmp_path):
+    db_path = tmp_path / "traces.db"
+    store = TraceStore(db_path)
+    ingest_file(FIXTURES_DIR / "generic_traces.jsonl", store, format_name="generic")
+    store.close()
+    evals_dir = tmp_path / "evals"
+    generate_evals(db_path, evals_dir, include_failures=True)
+    return evals_dir
+def _run(evals_dir, *extra_args):
+    # `run` calls pytest.main in-process; a previously executed generated
+    # suite leaves its conftest cached in sys.modules and poisons this one.
+    for mod in ("conftest", "test_generated"):
+        sys.modules.pop(mod, None)
+    return runner.invoke(
+        app,
+        [
+            "run",
+            str(evals_dir),
+            "--target",
+            BROKEN_TARGET,
+            "--judge",
+            "fake",
+            *extra_args,
+        ],
+    )
+def test_broken_target_writes_report(tmp_path):
+    evals_dir = _generate_suite(tmp_path)
+    result = _run(evals_dir)
+    assert result.exit_code != 0
+    reports = list((evals_dir / "runs").glob("run_*.json"))
+    assert len(reports) == 1, "exactly one report must be written"
+    with open(reports[0], encoding="utf-8") as f:
+        report = json.load(f)
+    assert report["summary"]["errored"] == report["summary"]["total"] > 0
+    assert report["summary"]["passed"] == 0
+    assert report["summary"]["failed"] == 0
+    assert report["results"] == []
+    assert report["errors"][0]["stage"] == "target_resolution"
+    assert BROKEN_TARGET in report["errors"][0]["detail"]
+    # Identical per-case setup errors deduplicate into one counted entry
+    setup_errors = [e for e in report["errors"] if e["stage"] == "setup"]
+    assert len(setup_errors) == 1
+    assert setup_errors[0]["count"] == report["summary"]["errored"]
+def test_broken_target_json_report_not_null(tmp_path):
+    evals_dir = _generate_suite(tmp_path)
+    result = _run(evals_dir, "--json")
+    data = json.loads(result.stdout)
+    assert isinstance(data["report"], str)
+    assert Path(data["report"]).exists()
+    assert data["exit_code"] == 1
+    assert result.exit_code == 1
+def test_broken_target_prints_one_clear_error(tmp_path):
+    evals_dir = _generate_suite(tmp_path)
+    result = _run(evals_dir)
+    error_line = f"ERROR: target '{BROKEN_TARGET}' could not be imported"
+    assert result.output.count(error_line) == 1, result.output

{traceval-0.2.1 → traceval-0.2.2}/uv.lock RENAMED Viewed

@@ -1037,7 +1037,7 @@ wheels = [
 [[package]]
 name = "traceval"
-version = "0.2.1"
+version = "0.2.2"
 source = { editable = "." }
 dependencies = [
     { name = "httpx" },