PyPI - codeprobe - Versions diffs - 0.3.5__tar.gz → 0.3.7__tar.gz - Mend

codeprobe 0.3.5tar.gz → 0.3.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (185) hide show

{codeprobe-0.3.5 → codeprobe-0.3.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: codeprobe
-Version: 0.3.5
+Version: 0.3.7
 Summary: Benchmark AI coding agents against your own codebase. Mine real tasks from repo history, run agents, interpret results.
 Author: codeprobe contributors
 License-Expression: Apache-2.0
@@ -42,7 +42,7 @@ Mine real tasks from your repo history, run agents against them, and find out wh
 ## Why codeprobe?
-Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data, and as general public benchmarks likely don't capture what is most important to your unique  workflows. codeprobe mines tasks from **your private repo history**, producing benchmarks that are impossible to contaminate. You can also point the tool at any public repo to mine tasks from.
+Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data, and as general public benchmarks likely don't capture what is most important to your unique workflows. codeprobe mines tasks from **your private repo history**, producing benchmarks that are impossible to contaminate. You can also point the tool at any public repo to mine tasks from.
 ## Prerequisites
@@ -122,6 +122,28 @@ codeprobe probe . -n 10 -l python -s 42 -o ./probes
 Generates four probe types: find-function, count-callers, return-type, module-dependency.
+## Curation Workflows
+End-to-end flows from a raw repo to ranked agent results. Each workflow covers the full `assess → mine → validate → run → interpret` pipeline.
+| Workflow       | When to use                               | Guide                                                        |
+| -------------- | ----------------------------------------- | ------------------------------------------------------------ |
+| **Standard**   | Repo has merged PRs/MRs                   | [docs/workflows/standard.md](docs/workflows/standard.md)     |
+| **Cold-start** | New repo, squashed history, vendored code | [docs/workflows/cold-start.md](docs/workflows/cold-start.md) |
+| **Cross-repo** | Tasks spanning multiple repositories      | [docs/workflows/cross-repo.md](docs/workflows/cross-repo.md) |
+**Quick start (standard path):**
+```bash
+codeprobe assess /path/to/repo
+codeprobe mine /path/to/repo --goal quality --count 10 --no-interactive
+codeprobe validate /path/to/repo/.codeprobe/tasks/<task-id>
+codeprobe run /path/to/repo --agent claude --max-cost-usd 5.00
+codeprobe interpret /path/to/repo
+```
+For the full MCP comparison setup (preambles, baseline vs with-MCP configs), see the next section.
 ## MCP Comparison Experiments
 Compare agent performance with and without MCP tools (Sourcegraph, GitHub, etc.).

{codeprobe-0.3.5 → codeprobe-0.3.7}/README.md RENAMED Viewed

@@ -6,7 +6,7 @@ Mine real tasks from your repo history, run agents against them, and find out wh
 ## Why codeprobe?
-Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data, and as general public benchmarks likely don't capture what is most important to your unique  workflows. codeprobe mines tasks from **your private repo history**, producing benchmarks that are impossible to contaminate. You can also point the tool at any public repo to mine tasks from.
+Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data, and as general public benchmarks likely don't capture what is most important to your unique workflows. codeprobe mines tasks from **your private repo history**, producing benchmarks that are impossible to contaminate. You can also point the tool at any public repo to mine tasks from.
 ## Prerequisites
@@ -86,6 +86,28 @@ codeprobe probe . -n 10 -l python -s 42 -o ./probes
 Generates four probe types: find-function, count-callers, return-type, module-dependency.
+## Curation Workflows
+End-to-end flows from a raw repo to ranked agent results. Each workflow covers the full `assess → mine → validate → run → interpret` pipeline.
+| Workflow       | When to use                               | Guide                                                        |
+| -------------- | ----------------------------------------- | ------------------------------------------------------------ |
+| **Standard**   | Repo has merged PRs/MRs                   | [docs/workflows/standard.md](docs/workflows/standard.md)     |
+| **Cold-start** | New repo, squashed history, vendored code | [docs/workflows/cold-start.md](docs/workflows/cold-start.md) |
+| **Cross-repo** | Tasks spanning multiple repositories      | [docs/workflows/cross-repo.md](docs/workflows/cross-repo.md) |
+**Quick start (standard path):**
+```bash
+codeprobe assess /path/to/repo
+codeprobe mine /path/to/repo --goal quality --count 10 --no-interactive
+codeprobe validate /path/to/repo/.codeprobe/tasks/<task-id>
+codeprobe run /path/to/repo --agent claude --max-cost-usd 5.00
+codeprobe interpret /path/to/repo
+```
+For the full MCP comparison setup (preambles, baseline vs with-MCP configs), see the next section.
 ## MCP Comparison Experiments
 Compare agent performance with and without MCP tools (Sourcegraph, GitHub, etc.).

{codeprobe-0.3.5 → codeprobe-0.3.7}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "codeprobe"
-version = "0.3.5"
+version = "0.3.7"
 description = "Benchmark AI coding agents against your own codebase. Mine real tasks from repo history, run agents, interpret results."
 readme = "README.md"
 license = "Apache-2.0"
@@ -75,6 +75,9 @@ where = ["src"]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
+markers = [
+    "integration: requires external services (skipped by default in CI)",
+]
 [tool.mypy]
 python_version = "3.11"

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """codeprobe — Benchmark AI coding agents against your own codebase."""
-__version__ = "0.3.1"
+__version__ = "0.3.7"

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/adapters/_base.py RENAMED Viewed

@@ -63,6 +63,19 @@ def _adapter_safe_env(extra: dict[str, str] | None = None) -> dict[str, str]:
     return env
+def _decode_timeout_output(raw: str | bytes | None) -> str:
+    """Decode stdout/stderr from a TimeoutExpired exception.
+    The exception may carry ``str``, ``bytes``, or ``None`` depending on
+    how ``subprocess.run`` was called and how the process was killed.
+    """
+    if raw is None:
+        return ""
+    if isinstance(raw, bytes):
+        return raw.decode("utf-8", errors="replace")
+    return raw
 class BaseAdapter:
     """Base class for CLI-based agent adapters.
@@ -162,12 +175,46 @@ class BaseAdapter:
             )
         except subprocess.TimeoutExpired as exc:
             duration = time.monotonic() - start
+            timeout_error = f"Agent timed out after {config.timeout_seconds}s"
+            raw_stdout = _decode_timeout_output(exc.stdout)
+            raw_stderr = _decode_timeout_output(exc.stderr) or None
+            if raw_stdout:
+                try:
+                    partial_result = subprocess.CompletedProcess(
+                        args=cmd,
+                        returncode=-1,
+                        stdout=raw_stdout,
+                        stderr=raw_stderr or "",
+                    )
+                    parsed = self.parse_output(partial_result, duration)
+                    merged_error = timeout_error
+                    if parsed.error:
+                        merged_error = f"{timeout_error}; {parsed.error}"
+                    return AgentOutput(
+                        stdout=parsed.stdout,
+                        stderr=parsed.stderr,
+                        exit_code=-1,
+                        duration_seconds=duration,
+                        input_tokens=parsed.input_tokens,
+                        output_tokens=parsed.output_tokens,
+                        cache_read_tokens=parsed.cache_read_tokens,
+                        cost_usd=parsed.cost_usd,
+                        cost_model=parsed.cost_model,
+                        cost_source=parsed.cost_source,
+                        error=merged_error,
+                        tool_call_count=parsed.tool_call_count,
+                    )
+                except Exception as parse_exc:
+                    timeout_error = f"{timeout_error}; parse_output failed: {parse_exc}"
             return AgentOutput(
-                stdout=exc.stdout if isinstance(exc.stdout, str) else "",
-                stderr=exc.stderr if isinstance(exc.stderr, str) else None,
+                stdout=raw_stdout,
+                stderr=raw_stderr,
                 exit_code=-1,
                 duration_seconds=duration,
-                error=f"Agent timed out after {config.timeout_seconds}s",
+                error=timeout_error,
             )
         except FileNotFoundError as exc:
             raise AdapterSetupError(f"Binary not found at runtime: {exc}") from exc

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/adapters/claude.py RENAMED Viewed

@@ -132,4 +132,5 @@ class ClaudeAdapter(BaseAdapter):
             cost_model=usage.cost_model,
             cost_source=usage.cost_source,
             error=usage.error,
+            tool_call_count=usage.tool_call_count,
         )

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/adapters/protocol.py RENAMED Viewed

@@ -44,6 +44,7 @@ class AgentOutput:
     cost_model: str = "unknown"
     error: str | None = None
     cost_source: str = "unavailable"
+    tool_call_count: int | None = None
     def __post_init__(self) -> None:
         if self.cost_model not in ALLOWED_COST_MODELS:
@@ -81,6 +82,13 @@ class AgentAdapter(Protocol):
         [project.entry-points."codeprobe.agents"]
         myagent = "my_package:MyAgentAdapter"
+    For cross-repo tasks, the executor may lay out additional
+    repositories under ``<workspace>/repos/<name>``, each pinned to its
+    own pre-merge commit.  Adapters don't need special handling — the
+    paths are available for the model to navigate, and the primary
+    workspace remains at its existing location for backwards
+    compatibility with single-repo tasks.
     """
     @property

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/adapters/telemetry.py RENAMED Viewed

@@ -65,6 +65,7 @@ class UsageData:
     cost_model: str = "unknown"
     cost_source: str = "unavailable"
     error: str | None = None
+    tool_call_count: int | None = None
     def __post_init__(self) -> None:
         if self.cost_model not in ALLOWED_COST_MODELS:
@@ -86,6 +87,28 @@ class TelemetryCollector(Protocol):
     def collect(self, raw_output: str, **context: Any) -> UsageData: ...
+def _count_tool_use_blocks(envelope: dict[str, Any]) -> int | None:
+    """Count ``tool_use`` content blocks in a Claude CLI JSON envelope.
+    Iterates the ``messages`` array (when present) and counts content
+    blocks with ``type == "tool_use"`` in assistant messages.
+    Returns ``None`` when the envelope has no ``messages`` key.
+    """
+    messages = envelope.get("messages")
+    if messages is None:
+        return None
+    count = 0
+    for msg in messages:
+        content = msg.get("content")
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            if isinstance(block, dict) and block.get("type") == "tool_use":
+                count += 1
+    return count
 class JsonStdoutCollector:
     """Extract telemetry from Claude CLI JSON envelope on stdout.
@@ -125,6 +148,8 @@ class JsonStdoutCollector:
             cost_model = "unknown"
             cost_source = "unavailable"
+        tool_call_count = _count_tool_use_blocks(envelope)
         return UsageData(
             input_tokens=input_tokens,
             output_tokens=output_tokens,
@@ -132,6 +157,7 @@ class JsonStdoutCollector:
             cost_usd=cost_usd_raw,
             cost_model=cost_model,
             cost_source=cost_source,
+            tool_call_count=tool_call_count,
         )

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/api.py RENAMED Viewed

@@ -185,8 +185,17 @@ def run_experiment(
         save_config_results(experiment_dir, exp_config.label, results)
-        passed = sum(1 for r in results if r.automated_score >= 1.0)
-        logger.info("[%s] %d/%d passed", exp_config.label, passed, len(results))
+        scoring = sum(1 for r in results if r.automated_score > 0.0)
+        mean = (
+            sum(r.automated_score for r in results) / len(results) if results else 0.0
+        )
+        logger.info(
+            "[%s] %d/%d scored (mean=%.2f)",
+            exp_config.label,
+            scoring,
+            len(results),
+            mean,
+        )
         all_config_results.append(
             ConfigResults(config=exp_config.label, completed=results)

{codeprobe-0.3.5 → codeprobe-0.3.7}/src/codeprobe/assess/heuristics.py RENAMED Viewed

@@ -69,6 +69,16 @@ _TEST_GLOBS: list[str] = [
     "*.spec.js",
 ]
+# Recursive variants for repos with nested test layouts (e.g. numpy/_core/tests/).
+_RECURSIVE_TEST_DIR_GLOBS: list[str] = [
+    "**/tests/**",
+    "**/test/**",
+    "**/spec/**",
+    "**/__tests__/**",
+]
+_RECURSIVE_TEST_FILE_GLOBS: list[str] = [f"**/{p}" for p in _TEST_GLOBS]
 # ---------------------------------------------------------------------------
 # Fixed rubric — model scores against these, doesn't invent them
 # ---------------------------------------------------------------------------
@@ -217,16 +227,20 @@ def _detect_primary_languages(file_list: str) -> list[str]:
 def _has_tests(repo_path: Path) -> bool:
-    """Check whether the repo appears to contain tests."""
+    """Check whether the repo appears to contain tests.
+    Checks top-level test directories first, then falls back to recursive
+    git ls-files glob patterns to catch repos with nested test layouts
+    (e.g. numpy/_core/tests/, numpy/tests/).
+    """
+    # Fast path: top-level test directories
     for d in _TEST_DIRS:
         if (repo_path / d).is_dir():
             return True
-    # Check for test files via git ls-files
-    for pattern in _TEST_GLOBS:
-        out = _run_git(["ls-files", "--", pattern], cwd=repo_path)
-        if out:
-            return True
-    return False
+    # Single git ls-files call with all patterns (top-level + recursive)
+    all_patterns = _TEST_GLOBS + _RECURSIVE_TEST_DIR_GLOBS + _RECURSIVE_TEST_FILE_GLOBS
+    out = _run_git(["ls-files", "--", *all_patterns], cwd=repo_path)
+    return bool(out)
 def _has_ci(repo_path: Path) -> bool:

codeprobe 0.3.5__tar.gz → 0.3.7__tar.gz

codeprobe 0.3.5tar.gz → 0.3.7tar.gz