PyPI - evalcraft - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

evalcraft 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (295) hide show

{evalcraft-0.2.0 → evalcraft-0.3.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,16 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] — 2026-06-01
+### Added
+- **`evalcraft check-stale`** — activates the provenance every cassette already records (model set, prompt hash, timestamp) to flag deterministic tests that have silently gone stale: a recorded model absent from the current `--models` set is **CRITICAL** (non-zero exit — a CI gate), a drifted `--prompts` hash is a **WARNING**, and age is **INFO**. Adds a `StalenessChecker` Python API (`evalcraft.staleness`) and refactors a shared `compute_prompt_hash` so recorded and recomputed prompt hashes match byte-for-byte. No new dependencies; runs fully offline.
+## [0.2.1] — 2026-05-30
+### Fixed
+- **Removed references to the unregistered `evalcraft.dev` domain.** The cloud client and the `evalcraft cloud` CLI no longer default to a non-existent `api.evalcraft.dev` endpoint. There is **no public hosted service** — configure a self-hosted dashboard URL explicitly via `base_url=`, the `EVALCRAFT_BASE_URL` env var, or `~/.evalcraft/config.json`. A cloud call with no URL configured now raises a clear, self-host-pointing error instead of failing against a dead host. Also scrubbed the dead domain from the `evalcraft init` config template and the landing-page contact links.
 ## [0.2.0] — 2026-05-30
 Ships everything developed since the initial `0.1.0` PyPI upload — a much larger

{evalcraft-0.2.0 → evalcraft-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: evalcraft
-Version: 0.2.0
+Version: 0.3.0
 Summary: VCR for AI agents — record agent runs as cassettes and replay them deterministically in CI for $0.
 Project-URL: Homepage, https://github.com/beyhangl/evalcraft
 Project-URL: Repository, https://github.com/beyhangl/evalcraft
@@ -69,7 +69,8 @@ Description-Content-Type: text/markdown
 <p align="center">
   <img src="site/logo.png" alt="Evalcraft" width="400" />
 </p>
-<p align="center"><strong>VCR for AI agents.</strong> Record an agent run once, replay it deterministically in CI for <strong>$0</strong> — fast regression tests for your agent's plumbing (tool calls, control flow, cost &amp; latency budgets), plus live-eval to catch real model drift.</p>
+<p align="center"><strong>Deterministic tests for AI agents — generated from one real run.</strong></p>
+<p align="center">Capture an agent run and evalcraft writes a <strong>pytest</strong> that locks its tool calls, output shape, and cost — then replays it in CI for <strong>$0</strong>. Like VCR for HTTP, but it writes the agent tests for you.</p>
 [![CI](https://github.com/beyhangl/evalcraft/actions/workflows/ci.yml/badge.svg)](https://github.com/beyhangl/evalcraft/actions/workflows/ci.yml)
 [![PyPI](https://img.shields.io/pypi/v/evalcraft)](https://pypi.org/project/evalcraft/)
@@ -536,6 +537,7 @@ evalcraft [command] [options]
 | `evalcraft sanitize <cassette>` | Redact PII and secrets |
 | `evalcraft doctor` | Diagnose setup issues (deps, API keys, cassettes) |
 | `evalcraft live-eval <current> --baseline <b>` | Gate a live-eval run vs a baseline (catch drift) |
+| `evalcraft check-stale <cassettes> --models <set>` | Fail CI when a cassette's recorded model was retired or swapped |
 ---

{evalcraft-0.2.0 → evalcraft-0.3.0}/README.md RENAMED Viewed

@@ -1,7 +1,8 @@
 <p align="center">
   <img src="site/logo.png" alt="Evalcraft" width="400" />
 </p>
-<p align="center"><strong>VCR for AI agents.</strong> Record an agent run once, replay it deterministically in CI for <strong>$0</strong> — fast regression tests for your agent's plumbing (tool calls, control flow, cost &amp; latency budgets), plus live-eval to catch real model drift.</p>
+<p align="center"><strong>Deterministic tests for AI agents — generated from one real run.</strong></p>
+<p align="center">Capture an agent run and evalcraft writes a <strong>pytest</strong> that locks its tool calls, output shape, and cost — then replays it in CI for <strong>$0</strong>. Like VCR for HTTP, but it writes the agent tests for you.</p>
 [![CI](https://github.com/beyhangl/evalcraft/actions/workflows/ci.yml/badge.svg)](https://github.com/beyhangl/evalcraft/actions/workflows/ci.yml)
 [![PyPI](https://img.shields.io/pypi/v/evalcraft)](https://pypi.org/project/evalcraft/)
@@ -468,6 +469,7 @@ evalcraft [command] [options]
 | `evalcraft sanitize <cassette>` | Redact PII and secrets |
 | `evalcraft doctor` | Diagnose setup issues (deps, API keys, cassettes) |
 | `evalcraft live-eval <current> --baseline <b>` | Gate a live-eval run vs a baseline (catch drift) |
+| `evalcraft check-stale <cassettes> --models <set>` | Fail CI when a cassette's recorded model was retired or swapped |
 ---

{evalcraft-0.2.0 → evalcraft-0.3.0}/docs/index.md RENAMED Viewed

@@ -1,6 +1,8 @@
 # Evalcraft
-**VCR for AI agents.** Record an agent run once, replay it deterministically in CI for **$0** — fast regression tests for your agent's plumbing (tool calls, control flow, cost & latency budgets), plus live-eval to catch real model drift.
+**Deterministic tests for AI agents — generated from one real run.**
+Capture an agent run and evalcraft writes a **pytest** that locks its tool calls, output shape, and cost — then replays it in CI for **$0**. Like VCR for HTTP, but it writes the agent tests for you.
 [![CI](https://github.com/beyhangl/evalcraft/actions/workflows/ci.yml/badge.svg)](https://github.com/beyhangl/evalcraft/actions/workflows/ci.yml)
 [![PyPI](https://img.shields.io/pypi/v/evalcraft)](https://pypi.org/project/evalcraft/)

{evalcraft-0.2.0 → evalcraft-0.3.0}/docs/user-guide/changelog.md RENAMED Viewed

@@ -7,6 +7,13 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) and
 ---
+## [0.3.0] — 2026-06-01
+### Added
+- `evalcraft check-stale` — detect cassettes recorded against a retired/swapped model (CRITICAL, non-zero exit for CI) or a drifted prompt (WARNING), by activating the provenance each cassette records. See [Check Stale](check-stale.md).
+---
 ## [0.1.0] — 2026-03-05
 Initial public release of Evalcraft — the pytest for AI agents.

evalcraft-0.3.0/docs/user-guide/check-stale.md ADDED Viewed

@@ -0,0 +1,95 @@
+# Check Stale — catch cassettes recorded against a retired model
+A replayed cassette is a *deterministic* test: it passes as long as the recording
+is unchanged. But that's exactly the trap — a green replay says nothing about
+whether the recording still mirrors reality. In 2026, models get **hard
+retirement dates** (and providers silently update weights). When the model a
+cassette was recorded against is gone, your test keeps "passing" against a world
+that no longer exists.
+`evalcraft check-stale` fixes the blind spot by **activating the provenance**
+every cassette already records (model set, prompt hash, timestamp) and turning it
+into a CI gate.
+```bash
+evalcraft check-stale tests/cassettes/*.json --models "gpt-5.1,claude-sonnet-4-5"
+```
+```
+  staleness check  3 cassette(s)
+  refund_flow
+  CRITICAL  [model_retired] Recorded model 'gpt-4o' is not in the current model set —
+            it may have been retired or swapped. This deterministic test no longer
+            mirrors production.
+  fresh  weather_agent
+  fresh  search_agent
+  CRITICAL staleness found — re-record the affected cassettes
+# exit code 1
+```
+## What it checks
+| Finding | Severity | Meaning | Exits CI? |
+|---|---|---|---|
+| `model_retired` | **CRITICAL** | A recorded model is absent from the current `--models` set (retired or swapped) — the cassette may now exercise an API that errors live. | **Yes (exit 1)** |
+| `prompt_drift` | WARNING | The current prompt hash (`--prompts`) differs from the recorded one — still replays, but no longer mirrors the live prompt. | No |
+| `age` | INFO | The recording is older than `--max-age-days`. | No |
+| `no_provenance` | INFO | A legacy / hand-built cassette with no provenance — re-record to enable checks. | No |
+Only a **retired model** blocks the build — it's the one signal that means "your
+deterministic test is lying." Prompt drift and age are visible but non-blocking.
+## Flags
+| Flag | Description |
+|---|---|
+| `--models "a,b,c"` | The model set you ship today. Any recorded model not in this exact set → CRITICAL. Omit to skip the model check. |
+| `--prompts <file>` | A file of your current prompts; its hash is compared to the recorded `prompt_hash`. Omit to skip. |
+| `--max-age-days N` | Recorded-at age over `N` days → INFO. Defaults to `30` if no other check is given. |
+| `--json` | Emit `{"cassettes": [report, ...]}` (severity strings `CRITICAL`/`WARNING`/`INFO`). Still exits 1 on any CRITICAL. |
+Matching is **exact and case-sensitive** — a swap from `gpt-5.1` to `gpt-5.1-mini`
+*should* fire. No fuzzy family matching.
+### `--prompts` file shape
+The hash basis is identical to what was recorded at capture time, so a file that
+reproduces the prompts matches byte-for-byte. Accepted shapes:
+```jsonc
+// 1. JSON object with the run's input + per-LLM-call inputs
+{ "input_text": "refund order 123", "llm_inputs": ["system + user prompt...", "..."] }
+// 2. JSON list → treated as llm_inputs (input_text = "")
+["system + user prompt..."]
+// 3. anything else → treated as input_text
+```
+## Wire it into CI
+Add it as a fast, deterministic gate next to your other checks — no API key, no
+network:
+```yaml
+- name: Fail if any cassette was recorded against a retired model
+  run: evalcraft check-stale tests/cassettes/*.json --models "${{ vars.CURRENT_MODELS }}"
+```
+When a model is retired, the gate goes red — **re-record the affected cassettes**
+(which refreshes their provenance), review the new behavior, and commit.
+## Python API
+```python
+from evalcraft import StalenessChecker
+from evalcraft.core.models import Cassette
+report = StalenessChecker(max_age_days=30).check(
+    Cassette.load("tests/cassettes/refund_flow.json"),
+    current_models=["gpt-5.1", "claude-sonnet-4-5"],
+)
+assert not report.has_critical, report.to_dict()
+```

{evalcraft-0.2.0 → evalcraft-0.3.0}/docs/user-guide/cli.md RENAMED Viewed

@@ -465,6 +465,25 @@ def test_with_generated_fixtures():
 ---
+## `evalcraft check-stale`
+Flag cassettes recorded against a model that's been retired or swapped (or a drifted prompt), using the provenance each cassette records. Exits non-zero on a retired model so CI can block stale deterministic tests.
+```bash
+evalcraft check-stale CASSETTES... [OPTIONS]
+```
+| Option | Description |
+|--------|-------------|
+| `--models "a,b,c"` | Current model set; a recorded model not in it is CRITICAL |
+| `--prompts PATH` | Current prompts file; hash drift vs the recording is a WARNING |
+| `--max-age-days N` | Recorded-at age over N days is INFO (defaults to 30 if no other check) |
+| `--json` | Emit JSON; still exits 1 on any CRITICAL |
+See [Check Stale](check-stale.md) for the full guide.
+---
 ## Exit codes
 | Code | Meaning |

{evalcraft-0.2.0 → evalcraft-0.3.0}/evalcraft/__init__.py RENAMED Viewed

@@ -4,7 +4,7 @@ Record agent runs as cassettes and replay them deterministically in CI for $0;
 mock LLMs/tools, score runs, and catch real model drift with live-eval.
 """
-__version__ = "0.2.0"
+__version__ = "0.3.0"
 from evalcraft.capture.recorder import CaptureContext, capture
 from evalcraft.cloud.client import EvalcraftCloud
@@ -52,6 +52,7 @@ from evalcraft.mock.llm import MockLLM
 from evalcraft.mock.tool import MockTool
 from evalcraft.regression.detector import RegressionDetector, RegressionReport
 from evalcraft.replay.engine import ReplayEngine, replay
+from evalcraft.staleness import StalenessChecker, StalenessFinding, StalenessReport
 __all__ = [
     "capture",
@@ -95,5 +96,8 @@ __all__ = [
     "GoldenSet",
     "RegressionDetector",
     "RegressionReport",
+    "StalenessChecker",
+    "StalenessFinding",
+    "StalenessReport",
     "EvalcraftCloud",
 ]

{evalcraft-0.2.0 → evalcraft-0.3.0}/evalcraft/cli/main.py RENAMED Viewed

@@ -58,7 +58,7 @@ _SPAN_COLORS: dict[SpanKind, str] = {
 # ─── CLI root ─────────────────────────────────────────────────────────────────
 @click.group()
-@click.version_option(version="0.2.0", prog_name="evalcraft")
+@click.version_option(version="0.3.0", prog_name="evalcraft")
 def cli() -> None:
     """evalcraft — capture, replay, and evaluate AI agent runs."""
@@ -885,6 +885,108 @@ def regression_cmd(cassette: str, golden: str, as_json: bool) -> None:
         sys.exit(1)
+# ─── check-stale ──────────────────────────────────────────────────────────────
+@cli.command("check-stale")
+@click.argument("cassettes", nargs=-1, required=True,
+                type=click.Path(exists=True, dir_okay=False))
+@click.option("--models", "models_csv", default=None,
+              help="Comma-separated current model set (e.g. 'gpt-5.1,claude-sonnet-4-5'). "
+                   "A recorded model absent here is CRITICAL (retired/swapped).")
+@click.option("--prompts", "prompts_path", default=None,
+              type=click.Path(exists=True, dir_okay=False),
+              help="JSON/text file of current prompts; its hash is compared to the "
+                   "recorded prompt_hash (WARNING on drift).")
+@click.option("--max-age-days", default=None, type=int,
+              help="Recorded-at age over N days is INFO. Defaults to 30 if no other check given.")
+@click.option("--json", "as_json", is_flag=True, help="Output as JSON")
+def check_stale_cmd(
+    cassettes: tuple[str, ...],
+    models_csv: str | None,
+    prompts_path: str | None,
+    max_age_days: int | None,
+    as_json: bool,
+) -> None:
+    """Flag CASSETTES recorded against a retired model or a drifted prompt.
+    Activates each cassette's recorded provenance (model set, prompt hash,
+    timestamp). Exits non-zero if ANY cassette references a model no longer in
+    --models, so CI can block deterministic tests that have silently gone stale.
+    Example:
+        evalcraft check-stale tests/cassettes/*.json --models "gpt-5.1,claude-sonnet-4-5"
+    """
+    from evalcraft.staleness import StalenessChecker, hash_prompts_file
+    current_models = (
+        [m.strip() for m in models_csv.split(",") if m.strip()]
+        if models_csv is not None
+        else None
+    )
+    current_prompt_hash = hash_prompts_file(prompts_path) if prompts_path else None
+    # With no explicit check requested, fall back to an age check (default 30d).
+    effective_age = max_age_days
+    if max_age_days is None and current_models is None and current_prompt_hash is None:
+        effective_age = 30
+    checker = StalenessChecker(max_age_days=effective_age)
+    reports = []
+    for path in cassettes:
+        cassette = _load_cassette(path)
+        report = checker.check(
+            cassette,
+            current_models=current_models,
+            current_prompt_hash=current_prompt_hash,
+        )
+        if not report.cassette_name:
+            report.cassette_name = Path(path).stem
+        reports.append(report)
+    any_critical = any(r.has_critical for r in reports)
+    if as_json:
+        click.echo(json.dumps(
+            {"cassettes": [r.to_dict() for r in reports]}, indent=2, default=str
+        ))
+        if any_critical:
+            sys.exit(1)
+        return
+    _SEV_COLORS = {"CRITICAL": "red", "WARNING": "yellow", "INFO": "blue"}
+    click.echo(
+        click.style("  staleness check", fg="cyan", bold=True)
+        + f"  {len(reports)} cassette(s)"
+    )
+    click.echo()
+    total_findings = 0
+    for report in reports:
+        if not report.has_findings:
+            click.echo(click.style("  fresh", fg="green", bold=True) + f"  {report.cassette_name}")
+            continue
+        click.echo(click.style(f"  {report.cassette_name}", bold=True))
+        for f in report.findings:
+            total_findings += 1
+            color = _SEV_COLORS.get(f.severity.value, "white")
+            icon = click.style(f"  {f.severity.value:<8}", fg=color, bold=True)
+            click.echo(f"{icon}  [{f.category}] {f.message}")
+        click.echo()
+    if total_findings == 0:
+        click.echo(click.style("  all cassettes fresh", fg="green", bold=True))
+    elif any_critical:
+        click.echo(click.style(
+            "  CRITICAL staleness found — re-record the affected cassettes", fg="red", bold=True
+        ))
+    if any_critical:
+        sys.exit(1)
 # ─── sanitize ─────────────────────────────────────────────────────────────────
 @cli.command()
@@ -1111,8 +1213,8 @@ def cloud() -> None:
 @cloud.command("login")
 @click.option("--api-key", prompt="API key", hide_input=True,
               help="Your Evalcraft API key (ec_...)")
-@click.option("--url", default="https://api.evalcraft.dev/v1",
-              help="Override API base URL")
+@click.option("--url", default="",
+              help="Your self-hosted dashboard URL (optional; there is no public hosted service)")
 def cloud_login(api_key: str, url: str) -> None:
     """Save your API key to ~/.evalcraft/config.json.

{evalcraft-0.2.0 → evalcraft-0.3.0}/evalcraft/cli/templates/evalcraft.toml RENAMED Viewed

@@ -37,4 +37,4 @@ auto_upload = false
 [cloud]
 # Override the Evalcraft cloud API endpoint.
-# base_url = "https://api.evalcraft.dev/v1"
+# base_url = "http://localhost:8000/v1"  # your self-hosted dashboard (no public service)

{evalcraft-0.2.0 → evalcraft-0.3.0}/evalcraft/cloud/client.py RENAMED Viewed

@@ -34,13 +34,11 @@ from typing import Any
 logger = logging.getLogger(__name__)
-# NOTE: The hosted Evalcraft dashboard/API is not yet publicly available.
-# This default points at the *planned* hosted endpoint; until it ships, set
-# ``base_url`` (or the ``base_url`` field in ~/.evalcraft/config.json) to your
-# own self-hosted dashboard — see the ``dashboard/`` directory. All cloud
-# features are optional: the core capture / replay / eval workflow runs fully
-# offline and never contacts this endpoint.
-_DEFAULT_BASE_URL = "https://api.evalcraft.dev/v1"
+# There is no public hosted Evalcraft API. Cloud features are optional and target
+# a *self-hosted* dashboard (see the ``dashboard/`` directory); configure the
+# endpoint explicitly via the ``base_url`` argument, the ``EVALCRAFT_BASE_URL``
+# environment variable, or ``~/.evalcraft/config.json``. The core capture /
+# replay / eval workflow runs fully offline and never contacts any endpoint.
 _CONFIG_DIR = Path.home() / ".evalcraft"
 _CONFIG_FILE = _CONFIG_DIR / "config.json"
 _QUEUE_DIR = _CONFIG_DIR / "queue"
@@ -104,7 +102,9 @@ class EvalcraftCloud:
         api_key: Bearer token (``ec_...``).  If None, reads from
             ``~/.evalcraft/config.json`` or the ``EVALCRAFT_API_KEY``
             environment variable.
-        base_url: Override the default API endpoint.
+        base_url: URL of your self-hosted Evalcraft dashboard.  Required for any
+            cloud call — there is no public hosted service.  Falls back to the
+            ``EVALCRAFT_BASE_URL`` env var, then ``~/.evalcraft/config.json``.
         timeout: Request timeout in seconds (default 30).
         max_retries: Maximum number of retry attempts for transient errors
             (default 3).  Uses exponential backoff with jitter.
@@ -115,13 +115,13 @@ class EvalcraftCloud:
     def __init__(
         self,
         api_key: str | None = None,
-        base_url: str = _DEFAULT_BASE_URL,
+        base_url: str | None = None,
         timeout: int = 30,
         max_retries: int = 3,
         queue_dir: Path | None = None,
     ):
         self.api_key = api_key or self._load_api_key()
-        self.base_url = base_url.rstrip("/")
+        self.base_url = (base_url or self._load_base_url()).rstrip("/")
         self.timeout = timeout
         self.max_retries = max_retries
         self.queue_dir = queue_dir or _QUEUE_DIR
@@ -237,8 +237,8 @@ class EvalcraftCloud:
     # ──────────────────────────────────────────
     @staticmethod
-    def save_config(api_key: str, base_url: str = _DEFAULT_BASE_URL) -> None:
-        """Persist API key and base URL to ``~/.evalcraft/config.json``."""
+    def save_config(api_key: str, base_url: str = "") -> None:
+        """Persist the API key (and optional dashboard URL) to ``~/.evalcraft/config.json``."""
         _CONFIG_DIR.mkdir(parents=True, exist_ok=True)
         config: dict = {}
         if _CONFIG_FILE.exists():
@@ -247,7 +247,8 @@ class EvalcraftCloud:
             except Exception:
                 pass
         config["api_key"] = api_key
-        config["base_url"] = base_url
+        if base_url:
+            config["base_url"] = base_url
         _CONFIG_FILE.write_text(json.dumps(config, indent=2))
         _CONFIG_FILE.chmod(0o600)
@@ -288,6 +289,15 @@ class EvalcraftCloud:
         config = self.load_config()
         return str(config.get("api_key", ""))
+    def _load_base_url(self) -> str:
+        """Resolve the dashboard base URL from env or config (empty if unset)."""
+        import os
+        env_url = os.environ.get("EVALCRAFT_BASE_URL", "")
+        if env_url:
+            return env_url
+        config = self.load_config()
+        return str(config.get("base_url", ""))
     def _request(
         self,
         method: str,
@@ -307,11 +317,18 @@ class EvalcraftCloud:
         Raises:
             CloudUploadError: After max_retries exhausted or on 4xx errors.
         """
+        if not self.base_url:
+            raise CloudUploadError(
+                "No Evalcraft dashboard URL is configured. There is no public "
+                "hosted service — point the client at your own self-hosted "
+                "dashboard (see the dashboard/ directory) via base_url=..., the "
+                "EVALCRAFT_BASE_URL env var, or ~/.evalcraft/config.json."
+            )
         url = f"{self.base_url}{path}"
         body: bytes | None = None
         headers: dict[str, str] = {
             "Accept": "application/json",
-            "User-Agent": "evalcraft-sdk/0.2.0",
+            "User-Agent": "evalcraft-sdk/0.3.0",
         }
         if self.api_key:
             headers["Authorization"] = f"Bearer {self.api_key}"

{evalcraft-0.2.0 → evalcraft-0.3.0}/evalcraft/core/models.py RENAMED Viewed

@@ -165,6 +165,21 @@ class Provenance:
         )
+def compute_prompt_hash(input_text: str, llm_inputs: list[Any]) -> str:
+    """Hash the prompt surface of a run — the user input plus each LLM span's input.
+    Used both when recording provenance (:meth:`Cassette.capture_provenance`) and
+    when checking staleness, so a recorded hash and a recomputed one match
+    byte-for-byte. List order is significant; only dict keys are sorted.
+    """
+    basis = json.dumps(
+        {"input_text": input_text, "llm_inputs": list(llm_inputs)},
+        sort_keys=True,
+        default=str,
+    )
+    return hashlib.sha256(basis.encode()).hexdigest()[:16]
 @dataclass
 class Cassette:
     """A recorded agent run — the fundamental unit of Evalcraft.
@@ -230,15 +245,9 @@ class Cassette:
         llm_spans = self.get_llm_calls()
         models = sorted({s.model for s in llm_spans if s.model})
-        basis = json.dumps(
-            {
-                "input_text": self.input_text,
-                "llm_inputs": [s.input for s in llm_spans],
-            },
-            sort_keys=True,
-            default=str,
+        prompt_hash = compute_prompt_hash(
+            self.input_text, [s.input for s in llm_spans]
         )
-        prompt_hash = hashlib.sha256(basis.encode()).hexdigest()[:16]
         self.provenance = Provenance(
             recorded_at=time.time(),
@@ -288,7 +297,7 @@ class Cassette:
         self.compute_metrics()
         self.compute_fingerprint()
         return {
-            "evalcraft_version": "0.2.0",
+            "evalcraft_version": "0.3.0",
             "cassette": {
                 "id": self.id,
                 "name": self.name,

evalcraft-0.3.0/evalcraft/staleness/__init__.py ADDED Viewed

@@ -0,0 +1,25 @@
+"""Staleness detection — flag cassettes recorded against retired models / drifted prompts.
+A cassette's recorded provenance (model set, prompt hash, timestamp) is only
+useful if something acts on it. This module does: it turns that provenance into
+actionable CI signal so a deterministic test can't silently keep passing against
+a model that no longer exists.
+    from evalcraft.staleness import StalenessChecker
+"""
+from evalcraft.core.models import compute_prompt_hash
+from evalcraft.staleness.checker import (
+    StalenessChecker,
+    StalenessFinding,
+    StalenessReport,
+    hash_prompts_file,
+)
+__all__ = [
+    "StalenessChecker",
+    "StalenessFinding",
+    "StalenessReport",
+    "compute_prompt_hash",
+    "hash_prompts_file",
+]

evalcraft 0.2.0__tar.gz → 0.3.0__tar.gz

evalcraft 0.2.0tar.gz → 0.3.0tar.gz