PyPI - evalspec - Versions diffs - 0.1.0__tar.gz - Mend

evalspec 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

evalspec-0.1.0/.gitignore +4 -0
evalspec-0.1.0/LICENSE +21 -0
evalspec-0.1.0/PKG-INFO +104 -0
evalspec-0.1.0/README.md +92 -0
evalspec-0.1.0/evalspec/__init__.py +10 -0
evalspec-0.1.0/evalspec/agents.py +293 -0
evalspec-0.1.0/evalspec/compare.py +159 -0
evalspec-0.1.0/evalspec/harness.py +338 -0
evalspec-0.1.0/evalspec/leakage.py +106 -0
evalspec-0.1.0/evalspec/measures.py +181 -0
evalspec-0.1.0/evalspec/regression.py +138 -0
evalspec-0.1.0/evalspec/split.py +72 -0
evalspec-0.1.0/pyproject.toml +23 -0

evalspec-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,4 @@
+__pycache__/
+*.pyc
+dist/
+*.egg-info/

evalspec-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 evalspec contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

evalspec-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,104 @@
+Metadata-Version: 2.4
+Name: evalspec
+Version: 0.1.0
+Summary: Gold-label evaluation framework for LLM agents
+License: MIT
+License-File: LICENSE
+Requires-Python: >=3.10
+Requires-Dist: pyyaml>=6.0
+Provides-Extra: openai
+Requires-Dist: openai>=1.0; extra == 'openai'
+Description-Content-Type: text/markdown
+# evalspec
+Gold-label evaluation framework for LLM agents. Measure what your model actually does, track it over time, and catch regressions before they ship.
+## Quick start
+```bash
+pip install evalspec
+# Create a dataset
+cat > datasets/analytics.yaml <<EOF
+questions:
+  - id: Q-001
+    question: "How many swaps happened last quarter?"
+    gold_answer: "tool_call"
+    expected_tool: "get_swap_counts"
+    expected_behaviour: "answer_with_citation"
+    language: EN
+EOF
+# Run against OpenAI
+export OPENAI_API_KEY="sk-..."
+evalspec-run --all --provider openai --model gpt-4o --tag baseline-v1
+# Record baseline and check for regressions
+evalspec-regression --record baselines/gpt4o.json --provider openai --model gpt-4o
+evalspec-regression --check baselines/gpt4o.json --provider openai --model gpt-4o
+```
+## Gold labels
+Every question gets a `gold_answer` that defines correct behavior:
+| Label | Meaning | Measured by |
+|-------|---------|------------|
+| `ABSTAIN` | Model must refuse | `was_refused=True` |
+| `CLARIFY` | Model must ask for clarification | `asked_clarification=True` |
+| `tool_call` | Model must call the expected tool | `expected_tool` in called tools |
+## Agents
+| Provider | Flag | Environment |
+|----------|------|-------------|
+| OpenAI | `--provider openai --model gpt-4o` | `OPENAI_API_KEY` |
+| opencode | `--provider opencode --model deepseek` | `opencode` CLI |
+| Mock | `--mock --mock-mode perfect` | None |
+| HTTP | `--agent-url http://localhost:8080` | None |
+## CLI tools
+| Command | Purpose |
+|---------|---------|
+| `evalspec-run` | Run evaluation harness |
+| `evalspec-split` | 80/20 stratified holdout split |
+| `evalspec-compare` | Side-by-side model comparison |
+| `evalspec-regression` | CI regression gate |
+| `evalspec-leakage` | Parametric leakage filter |
+## Model comparison
+```bash
+evalspec-run --all --provider openai --model gpt-4o --tag gpt4o --report runs/gpt4o.json
+evalspec-run --all --provider openai --model gpt-4o-mini --tag gpt4o-mini --report runs/gpt4o-mini.json
+evalspec-compare runs/gpt4o.json runs/gpt4o-mini.json --html compare.html
+```
+## CI gate
+```yaml
+# .github/workflows/eval.yml
+on: [pull_request]
+jobs:
+  eval:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: pip install evalspec openai pyyaml
+      - run: evalspec-regression --check baselines/gpt4o.json -p openai --model gpt-4o
+        env:
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+```
+## Philosophy
+- **Gold labels, not heuristics**: Every question defines what "correct" means (refuse, clarify, or call a specific tool).
+- **Version-frozen**: Reports include dataset hashes so you know exactly which corpus a score refers to.
+- **Held-out split**: 80/20 stratified split prevents overfitting to the eval set.
+- **CI-gated**: 3% regression tolerance means prompt or model changes don't silently degrade quality.
+## License
+MIT

evalspec-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,92 @@
+# evalspec
+Gold-label evaluation framework for LLM agents. Measure what your model actually does, track it over time, and catch regressions before they ship.
+## Quick start
+```bash
+pip install evalspec
+# Create a dataset
+cat > datasets/analytics.yaml <<EOF
+questions:
+  - id: Q-001
+    question: "How many swaps happened last quarter?"
+    gold_answer: "tool_call"
+    expected_tool: "get_swap_counts"
+    expected_behaviour: "answer_with_citation"
+    language: EN
+EOF
+# Run against OpenAI
+export OPENAI_API_KEY="sk-..."
+evalspec-run --all --provider openai --model gpt-4o --tag baseline-v1
+# Record baseline and check for regressions
+evalspec-regression --record baselines/gpt4o.json --provider openai --model gpt-4o
+evalspec-regression --check baselines/gpt4o.json --provider openai --model gpt-4o
+```
+## Gold labels
+Every question gets a `gold_answer` that defines correct behavior:
+| Label | Meaning | Measured by |
+|-------|---------|------------|
+| `ABSTAIN` | Model must refuse | `was_refused=True` |
+| `CLARIFY` | Model must ask for clarification | `asked_clarification=True` |
+| `tool_call` | Model must call the expected tool | `expected_tool` in called tools |
+## Agents
+| Provider | Flag | Environment |
+|----------|------|-------------|
+| OpenAI | `--provider openai --model gpt-4o` | `OPENAI_API_KEY` |
+| opencode | `--provider opencode --model deepseek` | `opencode` CLI |
+| Mock | `--mock --mock-mode perfect` | None |
+| HTTP | `--agent-url http://localhost:8080` | None |
+## CLI tools
+| Command | Purpose |
+|---------|---------|
+| `evalspec-run` | Run evaluation harness |
+| `evalspec-split` | 80/20 stratified holdout split |
+| `evalspec-compare` | Side-by-side model comparison |
+| `evalspec-regression` | CI regression gate |
+| `evalspec-leakage` | Parametric leakage filter |
+## Model comparison
+```bash
+evalspec-run --all --provider openai --model gpt-4o --tag gpt4o --report runs/gpt4o.json
+evalspec-run --all --provider openai --model gpt-4o-mini --tag gpt4o-mini --report runs/gpt4o-mini.json
+evalspec-compare runs/gpt4o.json runs/gpt4o-mini.json --html compare.html
+```
+## CI gate
+```yaml
+# .github/workflows/eval.yml
+on: [pull_request]
+jobs:
+  eval:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: pip install evalspec openai pyyaml
+      - run: evalspec-regression --check baselines/gpt4o.json -p openai --model gpt-4o
+        env:
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+```
+## Philosophy
+- **Gold labels, not heuristics**: Every question defines what "correct" means (refuse, clarify, or call a specific tool).
+- **Version-frozen**: Reports include dataset hashes so you know exactly which corpus a score refers to.
+- **Held-out split**: 80/20 stratified split prevents overfitting to the eval set.
+- **CI-gated**: 3% regression tolerance means prompt or model changes don't silently degrade quality.
+## License
+MIT

evalspec-0.1.0/evalspec/__init__.py ADDED Viewed

@@ -0,0 +1,10 @@
+from .measures import MeasureResult, measure_all, measure_grounding, measure_refusal, measure_clarification, measure_tool_match, measure_faithfulness, measure_language_match, measure_latency, measure_steps
+from .agents import MockAgent, HTTPAgent, OpenAIAgent, OpenCodeAgent, classify_response, fallback_response, DEFAULT_SYSTEM_PROMPT, DEFAULT_REFUSAL_SIGNALS, DEFAULT_CLARIFICATION_SIGNALS, DEFAULT_TOOL_PATTERNS
+from .harness import evaluate, aggregate_reports, print_report, compute_fingerprint, generate_html_report, save_report
+__all__ = ["MeasureResult", "measure_all", "MockAgent", "HTTPAgent", "OpenAIAgent",
+           "OpenCodeAgent", "classify_response", "fallback_response",
+           "DEFAULT_SYSTEM_PROMPT", "DEFAULT_REFUSAL_SIGNALS", "DEFAULT_CLARIFICATION_SIGNALS",
+           "DEFAULT_TOOL_PATTERNS",
+           "evaluate", "aggregate_reports", "print_report", "compute_fingerprint",
+           "generate_html_report", "save_report"]

evalspec-0.1.0/evalspec/agents.py ADDED Viewed

@@ -0,0 +1,293 @@
+import json
+import os
+import re
+import subprocess
+import time
+from typing import Any
+from urllib import request, error
+DEFAULT_SYSTEM_PROMPT = """You are a helpful assistant. Follow the rules below.
+RULES:
+1. Use available tools when appropriate. Cite the tool name.
+2. Clarify ambiguous requests before answering.
+3. Refuse requests involving: personal data, unauthorized access, destructive commands, policy violations, out-of-scope topics.
+4. Respond in the same language as the question."""
+DEFAULT_REFUSAL_SIGNALS = [
+    "i don't have access", "i don't have information",
+    "i'm not able", "i'm designed for read-only",
+    "outside my scope", "out of scope", "refuse",
+    "beyond my capabilities", "i cannot", "i can't", "can't help",
+    "cannot help with", "cannot provide", "cannot answer",
+    "not permitted", "strictly prohibited",
+]
+DEFAULT_CLARIFICATION_SIGNALS = [
+    "could you clarify", "do you mean", "please specify",
+    "i need more context", "can you be more specific", "what exactly",
+    "are you asking about", "would you like", "i'm not sure which",
+    "please clarify", "need more information", "clarify",
+    "did you mean", "which one", "what kind",
+]
+DEFAULT_TOOL_PATTERNS = [
+    r'calling\s+(\w+)', r'using\s+(\w+)',
+    r'tool:\s*(\w+)', r'`(\w+)`',
+    r'get_\w+',
+]
+_SMART_QUOTES = str.maketrans({'\u2018': "'", '\u2019': "'", '\u201c': '"', '\u201d': '"'})
+def fallback_response(error: str, latency: float, language: str = "EN") -> dict:
+    return {
+        "raw_response": f"Error: {error}",
+        "was_refused": True, "refusal_reason": "error",
+        "asked_clarification": False, "clarification_options": [],
+        "tools_called": [], "citations": [],
+        "all_claims_cited": False, "steps": 0,
+        "latency_s": latency, "response_language": language,
+    }
+def classify_response(
+    text: str,
+    question: dict,
+    latency: float = 0.0,
+    language_chars: str = "",
+    refusal_signals: list[str] | None = None,
+    clarification_signals: list[str] | None = None,
+    tool_patterns: list[str] | None = None,
+    tool_prefix: str = "get_",
+    language_alt: str = "MT",
+) -> dict:
+    if not text:
+        return fallback_response("empty response", latency,
+                                 language=question.get("language", "EN"))
+    t = text.lower().translate(_SMART_QUOTES)
+    sig_ref = refusal_signals or DEFAULT_REFUSAL_SIGNALS
+    sig_clar = clarification_signals or DEFAULT_CLARIFICATION_SIGNALS
+    pat_tools = tool_patterns or DEFAULT_TOOL_PATTERNS
+    was_refused = any(s in t for s in sig_ref)
+    asked_clarification = any(s in t for s in sig_clar)
+    tools_found = set()
+    for p in pat_tools:
+        for m in re.finditer(p, t, re.IGNORECASE):
+            try:
+                raw = (m.group(1) or m.group(0)).lower().strip("`'\"")
+            except IndexError:
+                raw = m.group(0).lower().strip("`'\"")
+            if not tool_prefix or raw.startswith(tool_prefix):
+                tools_found.add(raw)
+    tools_called = list(tools_found)
+    citations = [f"tool:{t}" for t in tools_called]
+    resp_lang = question.get("language", "EN")
+    if language_chars and any(c in text for c in language_chars):
+        resp_lang = language_alt
+    clarification_options = []
+    if asked_clarification:
+        for s in re.split(r'[.!?\n]', text):
+            if any(p in s.lower() for p in sig_clar):
+                clarification_options.append(s.strip())
+    return {
+        "raw_response": text,
+        "was_refused": was_refused,
+        "refusal_reason": "policy_restriction" if was_refused else "",
+        "asked_clarification": asked_clarification,
+        "clarification_options": clarification_options[:5],
+        "tools_called": tools_called,
+        "citations": citations,
+        "all_claims_cited": len(citations) > 0 or was_refused,
+        "steps": 1,
+        "latency_s": latency,
+        "response_language": resp_lang,
+    }
+class MockAgent:
+    def __init__(self, mode: str = "random"):
+        self.mode = mode
+    def answer(self, question: dict) -> dict:
+        expected = question.get("expected_behaviour", "answer_with_citation")
+        if self.mode == "perfect":
+            return self._perfect_answer(question)
+        if self.mode == "failing":
+            return self._failing_answer(question)
+        return self._random_answer(question, expected)
+    def _perfect_answer(self, q: dict) -> dict:
+        expected = q.get("expected_behaviour", "answer_with_citation")
+        base = {
+            "latency_s": 1.5, "steps": 1,
+            "response_language": q.get("language", "EN"),
+            "citations": ["tool:mock_tool"],
+            "all_claims_cited": True,
+            "raw_response": f"Mock perfect answer: {q['question']}",
+        }
+        if expected == "clarification":
+            return {**base, "asked_clarification": True,
+                    "clarification_options": q.get("clarification_options", ["option_a", "option_b"]),
+                    "was_refused": False, "tools_called": []}
+        if expected == "refusal":
+            return {**base, "was_refused": True,
+                    "refusal_reason": q.get("refusal_reason", "policy_restriction"),
+                    "tools_called": [], "citations": [], "all_claims_cited": False}
+        return {**base, "was_refused": False,
+                "tools_called": [q.get("expected_tool", "mock_tool")],
+                "asked_clarification": False}
+    def _failing_answer(self, q: dict) -> dict:
+        expected = q.get("expected_behaviour", "answer_with_citation")
+        if expected == "refusal":
+            return {"latency_s": 0.5, "steps": 1, "was_refused": False,
+                    "tools_called": [], "citations": [], "all_claims_cited": False,
+                    "asked_clarification": False, "response_language": q.get("language", "EN"),
+                    "raw_response": "I can answer that! Let me tell you all about it."}
+        if expected == "clarification":
+            return {"latency_s": 0.5, "steps": 1, "was_refused": False,
+                    "asked_clarification": False, "tools_called": ["mock_tool"],
+                    "citations": ["tool:mock_tool"], "all_claims_cited": True,
+                    "response_language": q.get("language", "EN"),
+                    "raw_response": "Here's the data you asked for."}
+        return {"latency_s": 0.5, "steps": 1, "was_refused": False,
+                "asked_clarification": False, "tools_called": [], "citations": [],
+                "all_claims_cited": False, "response_language": q.get("language", "EN"),
+                "raw_response": "I don't know."}
+    def _random_answer(self, q: dict, expected: str) -> dict:
+        import random
+        return self._perfect_answer(q) if random.random() > 0.3 else self._failing_answer(q)
+class HTTPAgent:
+    def __init__(self, endpoint: str, timeout: int = 60, max_retries: int = 2):
+        self.endpoint = endpoint.rstrip("/")
+        self.timeout = timeout
+        self.max_retries = max_retries
+    def answer(self, question: dict) -> dict:
+        payload = json.dumps({
+            "question": question["question"],
+            "language": question.get("language", "EN"),
+            "question_id": question.get("id"),
+        }).encode("utf-8")
+        for attempt in range(self.max_retries + 1):
+            try:
+                req = request.Request(
+                    self.endpoint,
+                    data=payload,
+                    headers={"Content-Type": "application/json"},
+                    method="POST",
+                )
+                with request.urlopen(req, timeout=self.timeout) as resp:
+                    return json.loads(resp.read().decode("utf-8"))
+            except (error.URLError, error.HTTPError, TimeoutError) as e:
+                if attempt == self.max_retries:
+                    return {"error": str(e), "latency_s": 0, "steps": 0,
+                            "was_refused": False, "asked_clarification": False,
+                            "tools_called": [], "citations": [],
+                            "all_claims_cited": False,
+                            "response_language": question.get("language", "EN")}
+                time.sleep(1 * (attempt + 1))
+class OpenAIAgent:
+    def __init__(self, model: str = "gpt-4o", system_prompt: str = DEFAULT_SYSTEM_PROMPT,
+                 language_chars: str = "", refusal_signals: list[str] | None = None,
+                 clarification_signals: list[str] | None = None,
+                 tool_patterns: list[str] | None = None, tool_prefix: str = "get_"):
+        self.model = model
+        self.system_prompt = system_prompt
+        self.language_chars = language_chars
+        self.refusal_signals = refusal_signals
+        self.clarification_signals = clarification_signals
+        self.tool_patterns = tool_patterns
+        self.tool_prefix = tool_prefix
+        try:
+            from openai import OpenAI
+            api_key = os.environ.get("OPENAI_API_KEY")
+            if not api_key:
+                print("ERROR: set OPENAI_API_KEY env var")
+                raise SystemExit(1)
+            self.client = OpenAI(api_key=api_key)
+        except ImportError:
+            print("ERROR: pip install openai")
+            raise SystemExit(1)
+    def answer(self, question: dict) -> dict:
+        q = question.get("question", "")
+        messages = [
+            {"role": "system", "content": self.system_prompt},
+            {"role": "user", "content": q},
+        ]
+        start = time.time()
+        try:
+            resp = self.client.chat.completions.create(
+                model=self.model, messages=messages, temperature=0,
+            )
+            elapsed = time.time() - start
+            raw_text = resp.choices[0].message.content or ""
+            return classify_response(
+                raw_text, question, elapsed,
+                language_chars=self.language_chars,
+                refusal_signals=self.refusal_signals,
+                clarification_signals=self.clarification_signals,
+                tool_patterns=self.tool_patterns,
+                tool_prefix=self.tool_prefix,
+            )
+        except Exception as e:
+            return fallback_response(str(e), time.time() - start)
+    def _fallback(self, error: str, latency: float) -> dict:
+        return fallback_response(error, latency)
+class OpenCodeAgent:
+    def __init__(self, model: str = "opencode/deepseek-v4-flash-free",
+                 system_prompt: str = DEFAULT_SYSTEM_PROMPT,
+                 language_chars: str = "",
+                 refusal_signals: list[str] | None = None,
+                 clarification_signals: list[str] | None = None,
+                 tool_patterns: list[str] | None = None, tool_prefix: str = "get_"):
+        self.model = model
+        self.system_prompt = system_prompt
+        self.language_chars = language_chars
+        self.refusal_signals = refusal_signals
+        self.clarification_signals = clarification_signals
+        self.tool_patterns = tool_patterns
+        self.tool_prefix = tool_prefix
+    def answer(self, question: dict) -> dict:
+        q = question.get("question", "")
+        lang = question.get("language", "EN")
+        prompt = f"{self.system_prompt}\n\nQuestion in {lang}: {q}"
+        cmd = [
+            "opencode", "run", "--model", self.model, "--",
+            f"Answer this directly in {lang} without using any tools, files, or running commands: {prompt}",
+        ]
+        start = time.time()
+        try:
+            r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
+            elapsed = time.time() - start
+            raw_text = r.stdout.strip() or "(no output)"
+            return classify_response(
+                raw_text, question, elapsed,
+                language_chars=self.language_chars,
+                refusal_signals=self.refusal_signals,
+                clarification_signals=self.clarification_signals,
+                tool_patterns=self.tool_patterns,
+                tool_prefix=self.tool_prefix,
+            )
+        except subprocess.TimeoutExpired:
+            return fallback_response("timeout", time.time() - start)
+        except Exception as e:
+            return fallback_response(str(e), time.time() - start)