PyPI - inspect-eval-utils - Versions diffs - 1.2.0__tar.gz → 1.3.0__tar.gz - Mend

inspect-eval-utils 1.2.0tar.gz → 1.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: inspect-eval-utils
-Version: 1.2.0
+Version: 1.3.0
 Summary: Shared utilities for METR Inspect AI eval repos: task scaffolder + common runtime helpers.
 Project-URL: Repository, https://github.com/METR/inspect-eval-utils
 Project-URL: Issues, https://github.com/METR/inspect-eval-utils/issues
@@ -297,6 +297,31 @@ tools call <tool-name> --json-args '{"arg": "value"}'
 The CLI keeps a short cache for list/help/completion metadata, but tool calls
 refresh the current `ToolSource` before execution.
+#### Running the tool CLI from a task setup solver
+Use `start_tool_cli` to expose `Setting`/task tools as a `tools` command for the
+agent in one line. It installs the CLI, starts the RPC service in the background,
+and returns once it's ready (raising if startup fails):
+```python
+from inspect_eval_utils.tool_cli import start_tool_cli
+from inspect_ai.util import sandbox
+@solver
+def setup() -> Solver:
+    async def solve(state: TaskState, generate: Generate) -> TaskState:
+        await start_tool_cli(MY_TOOLS, sandbox("default"), user="agent")
+        return state
+    return solve
+```
+The command is resolved two ways: *interactive* shells (e.g. `human_cli`) pick it
+up via a `.bashrc` alias + tab-completion; *non-interactive* shells (the model
+agent's `bash()` tool, `sandbox.exec`) find it on `PATH` at
+`/usr/local/bin/<command_name>`. Pass `on_path=False` to skip the PATH wrapper, or
+`bin_dir=...` to relocate it. `run_tool_cli_service` and `setting_tool_cli_running`
+install the PATH wrapper too (default-on).
 #### Common mistakes
 - **Listing infrastructure sandboxes as Workspaces.** Only list sandboxes the
@@ -407,6 +432,27 @@ It does NOT modify `[tool.uv.workspace].members` — that's typically a glob lik
 common surprise — the scaffolder modifies a file outside `tasks/my_eval/`, so
 review the diff before committing.
+### Generated eval-set config
+The scaffolder also writes a minimal Hawk eval-set skeleton to
+`eval_sets/<name>.eval-set.yaml` (creating `eval_sets/` if needed). This is the
+config you run a batch grid with:
+```bash
+hawk eval-set eval_sets/my_eval.eval-set.yaml
+```
+The task `package` URL is derived from the target repo's git `origin` remote and
+current branch, e.g.
+`git+ssh://git@github.com/METR/<repo>@<branch>#subdirectory=tasks/my_eval`. When
+the metadata can't be determined, a TODO marker is left in its place:
+- no `origin` remote → the whole `package` value is a `TODO:` string,
+- detached HEAD (no branch) → the ref becomes `TODO-set-ref`.
+The skeleton is intentionally minimal (one model, one solver). An existing
+`eval_sets/<name>.eval-set.yaml` is only overwritten with `--force`.
 ### How substitution works
 The scaffolder rewrites two things in the same pass:

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/README.md RENAMED Viewed

@@ -272,6 +272,31 @@ tools call <tool-name> --json-args '{"arg": "value"}'
 The CLI keeps a short cache for list/help/completion metadata, but tool calls
 refresh the current `ToolSource` before execution.
+#### Running the tool CLI from a task setup solver
+Use `start_tool_cli` to expose `Setting`/task tools as a `tools` command for the
+agent in one line. It installs the CLI, starts the RPC service in the background,
+and returns once it's ready (raising if startup fails):
+```python
+from inspect_eval_utils.tool_cli import start_tool_cli
+from inspect_ai.util import sandbox
+@solver
+def setup() -> Solver:
+    async def solve(state: TaskState, generate: Generate) -> TaskState:
+        await start_tool_cli(MY_TOOLS, sandbox("default"), user="agent")
+        return state
+    return solve
+```
+The command is resolved two ways: *interactive* shells (e.g. `human_cli`) pick it
+up via a `.bashrc` alias + tab-completion; *non-interactive* shells (the model
+agent's `bash()` tool, `sandbox.exec`) find it on `PATH` at
+`/usr/local/bin/<command_name>`. Pass `on_path=False` to skip the PATH wrapper, or
+`bin_dir=...` to relocate it. `run_tool_cli_service` and `setting_tool_cli_running`
+install the PATH wrapper too (default-on).
 #### Common mistakes
 - **Listing infrastructure sandboxes as Workspaces.** Only list sandboxes the
@@ -382,6 +407,27 @@ It does NOT modify `[tool.uv.workspace].members` — that's typically a glob lik
 common surprise — the scaffolder modifies a file outside `tasks/my_eval/`, so
 review the diff before committing.
+### Generated eval-set config
+The scaffolder also writes a minimal Hawk eval-set skeleton to
+`eval_sets/<name>.eval-set.yaml` (creating `eval_sets/` if needed). This is the
+config you run a batch grid with:
+```bash
+hawk eval-set eval_sets/my_eval.eval-set.yaml
+```
+The task `package` URL is derived from the target repo's git `origin` remote and
+current branch, e.g.
+`git+ssh://git@github.com/METR/<repo>@<branch>#subdirectory=tasks/my_eval`. When
+the metadata can't be determined, a TODO marker is left in its place:
+- no `origin` remote → the whole `package` value is a `TODO:` string,
+- detached HEAD (no branch) → the ref becomes `TODO-set-ref`.
+The skeleton is intentionally minimal (one model, one solver). An existing
+`eval_sets/<name>.eval-set.yaml` is only overwritten with `--force`.
 ### How substitution works
 The scaffolder rewrites two things in the same pass:

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/pyproject.toml RENAMED Viewed

@@ -62,6 +62,7 @@ dev = [
     "pytest-timeout>=2.3",
     "ruff>=0.11",
     "basedpyright>=1.37",
+    "pyyaml>=6.0",
 ]
 [tool.ruff]

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/src/inspect_eval_utils/_cli.py RENAMED Viewed

@@ -49,7 +49,7 @@ def main(argv: list[str] | None = None) -> None:
     parser.add_argument(
         "--force",
         action="store_true",
-        help="Overwrite an existing tasks/<name>/",
+        help="Overwrite an existing tasks/<name>/ and eval_sets/<name>.eval-set.yaml",
     )
     args = parser.parse_args(argv)
@@ -85,6 +85,8 @@ def main(argv: list[str] | None = None) -> None:
     print(f"  cd {target_dir}")
     print("  uv sync --group tasks")
     print(f"  uv run inspect eval {snake} --model mockllm/replay --limit 1")
+    print(f"Also generated eval_sets/{snake}.eval-set.yaml (Hawk batch config).")
+    print(f"  Batch run: hawk eval-set eval_sets/{snake}.eval-set.yaml")
 if __name__ == "__main__":

inspect_eval_utils-1.3.0/src/inspect_eval_utils/report/plot.py ADDED Viewed

@@ -0,0 +1,234 @@
+"""Render the score-vs-cost matplotlib plot as PNG bytes."""
+from __future__ import annotations
+import io
+import logging
+import math
+import threading
+from collections.abc import Sequence
+from importlib.resources import files
+from inspect_eval_utils.report.cost import cumulative_cost
+from inspect_eval_utils.report.events import ReportEvent
+# Matplotlib logs "generated new fontManager" at INFO the first time its font
+# cache is built. Quiet it so eval scoring transcripts stay clean.
+logging.getLogger("matplotlib.font_manager").setLevel(logging.WARNING)
+# Color palette derived from the METR May 2026 brand guide.
+_LEAD_GREEN_500 = "#589885"
+_GREEN_700 = "#2A6912"
+_GRAY_300 = "#D9DCE2"
+_GRAY_700 = "#3D424D"
+_GRAY_800 = "#282C33"
+_GRAY_900 = "#1B1D22"
+_BUNDLED_FONT_FAMILY = ["Instrument Sans", "DejaVu Sans"]
+# Guards the one-time mutation of matplotlib's global font registry so
+# concurrent build_plot callers don't race the check-then-addfont below.
+_FONT_LOCK = threading.Lock()
+# Set once registration has succeeded; lets the common case skip the lock.
+_font_registered = False
+def _register_bundled_font() -> None:
+    """Register the vendored Instrument Sans TTF with matplotlib (best-effort).
+    Quietly returns if already registered or if the asset is missing. Uses
+    double-checked locking so that, after the one-time registration, concurrent
+    callers take the lock-free fast path instead of serializing on every render.
+    """
+    global _font_registered
+    if _font_registered:
+        return
+    from matplotlib import font_manager
+    with _FONT_LOCK:
+        if _font_registered:
+            return
+        installed = {f.name for f in font_manager.fontManager.ttflist}
+        if "Instrument Sans" not in installed:
+            try:
+                font_path = files("inspect_eval_utils.report") / "assets" / "InstrumentSans.ttf"
+                font_manager.fontManager.addfont(str(font_path))
+            except Exception:  # noqa: BLE001
+                # Asset missing or unreadable; leave the flag unset so a later
+                # call retries. Callers proceed with matplotlib's DejaVu Sans
+                # fallback in the meantime.
+                return
+        _font_registered = True
+def build_plot(
+    events: Sequence[ReportEvent],
+    *,
+    model: str,
+    title: str,
+    y_label: str,
+    line_label: str = "Best score",
+    current_score_label: str | None = None,
+    x_label_money: str = "Cumulative model cost ($)",
+    x_label_tokens: str = "Cumulative tokens (cost unavailable)",
+    marker_event_kind: str | None,
+) -> bytes:
+    """Render the score-vs-cost plot as PNG bytes.
+    The line plots best-so-far `score_update` values, starting at `(0, 0)`,
+    against cumulative model cost for `model`. If Inspect AI has no pricing for
+    the model, the x-axis falls back to cumulative token count instead.
+    `title`, `y_label`, `line_label`, `x_label_money`, and `x_label_tokens`
+    provide the plot, legend, and axis copy.
+    `marker_event_kind` selects which non-score events delimit episodic spans
+    (e.g. `"attempt_start"`); pass `None` to disable. When set, the plot area
+    is shaded into alternating background bands — one per span — so band
+    *width* visually encodes the compute spent in each span.
+    When `current_score_label` is provided, a second (non-monotonic) line is
+    drawn through the raw per-event score values and labelled accordingly in
+    the legend.
+    The bundled Instrument Sans font is registered best-effort and used with
+    DejaVu Sans as a fallback. Returns PNG bytes.
+    """
+    from matplotlib.backends.backend_agg import FigureCanvasAgg
+    from matplotlib.figure import Figure
+    from matplotlib.font_manager import FontProperties
+    _register_bundled_font()
+    font_family = _BUNDLED_FONT_FAMILY
+    has_usage = False
+    cost_available = True
+    xs_line: list[float] = [0.0]
+    ys_line: list[float] = [0.0]
+    xs_current: list[float] = [0.0]
+    ys_current: list[float] = [0.0]
+    marker_xs: list[float] = []
+    best_so_far = 0.0
+    for ev in events:
+        if ev.usage is None:
+            continue
+        has_usage = True
+        x, available = cumulative_cost(ev.usage, model)
+        cost_available = cost_available and available
+        if ev.event_type == "score_update":
+            best_so_far = max(best_so_far, ev.score)
+            xs_line.append(x)
+            ys_line.append(best_so_far)
+            xs_current.append(x)
+            ys_current.append(ev.score)
+        elif marker_event_kind is not None and ev.event_type == marker_event_kind:
+            marker_xs.append(x)
+            # Break the current-score line at episodic boundaries so it
+            # renders as separate segments per attempt instead of a vertical
+            # drop back to the new attempt's starting floor.
+            xs_current.append(x)
+            ys_current.append(float("nan"))
+    label_font = FontProperties(family=font_family, size=14)
+    title_font = FontProperties(family=font_family, size=15, weight="medium")
+    legend_font = FontProperties(family=font_family, size=11)
+    # Object-oriented (non-pyplot) API: a standalone Figure with an explicit
+    # Agg canvas keeps this function thread-safe. pyplot's global figure
+    # registry and `rc_context`'s process-wide rcParams mutation both race
+    # under concurrent calls, so we render off a local Figure and apply every
+    # style per-artist instead of via global rcParams.
+    fig = Figure(figsize=(10, 6))
+    FigureCanvasAgg(fig)  # attaches an Agg canvas (sets fig.canvas)
+    ax = fig.subplots()
+    if current_score_label is not None:
+        ax.plot(
+            xs_current,
+            ys_current,
+            "--",
+            color=_LEAD_GREEN_500,
+            linewidth=1.5,
+            label=current_score_label,
+            zorder=1,
+        )
+    ax.plot(
+        xs_line,
+        ys_line,
+        "-",
+        color=_GREEN_700,
+        linewidth=2,
+        label=line_label,
+        zorder=2,
+    )
+    if marker_xs:
+        # Render each marker_event_kind span as a background band. Band
+        # *width* encodes the compute spent in that span, so clustering
+        # naturally shows as a squeeze of narrow bands.
+        sorted_starts = sorted(marker_xs)
+        finite_xs = xs_line + [v for v in xs_current if not math.isnan(v)] + marker_xs
+        band_end = max(finite_xs) if finite_xs else 0.0
+        boundaries = sorted_starts + [band_end]
+        for k in range(len(sorted_starts)):
+            if k % 2 == 1:
+                ax.axvspan(
+                    boundaries[k],
+                    boundaries[k + 1],
+                    color=_GRAY_300,
+                    alpha=0.25,
+                    zorder=0,
+                )
+    x_label = x_label_money if (has_usage and cost_available) else x_label_tokens
+    ax.set_xlabel(x_label, color=_GRAY_800, fontproperties=label_font)
+    ax.set_ylabel(y_label, color=_GRAY_800, rotation=90, fontproperties=label_font)
+    ax.set_ylim(0, 1.05)
+    ax.set_xlim(left=0)
+    ax.spines["top"].set_visible(False)
+    ax.spines["right"].set_visible(False)
+    ax.spines["bottom"].set_color(_GRAY_700)
+    ax.spines["left"].set_color(_GRAY_700)
+    ax.spines["bottom"].set_linewidth(0.8)
+    ax.spines["left"].set_linewidth(0.8)
+    ax.tick_params(
+        colors=_GRAY_700,
+        labelsize=12,
+        width=0.5,
+        length=0,
+        labelfontfamily=font_family,
+    )
+    ax.grid(
+        True,
+        color=_GRAY_300,
+        linewidth=0.8,
+        linestyle=(0, (4, 2)),
+        zorder=0,
+    )
+    ax.set_axisbelow(True)
+    ax.set_title(title, color=_GRAY_900, fontproperties=title_font, pad=12)
+    legend = ax.legend(
+        loc="upper left",
+        frameon=True,
+        fancybox=False,
+        edgecolor=_GRAY_300,
+        framealpha=1.0,
+        borderpad=0.6,
+        prop=legend_font,
+    )
+    legend.get_frame().set_linewidth(0.5)
+    legend.get_frame().set_facecolor("white")
+    buf = io.BytesIO()
+    fig.savefig(
+        buf,
+        format="png",
+        dpi=300,
+        bbox_inches="tight",
+        facecolor="white",
+    )
+    return buf.getvalue()

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/src/inspect_eval_utils/scaffolder.py RENAMED Viewed

@@ -308,6 +308,132 @@ def render_readme(*, snake: str, description: str) -> str:
     return README_TEMPLATE.format(snake=snake, description=description)
+EVAL_SET_TEMPLATE = """\
+name: {name}
+tasks:
+  - package: "{package_url}"
+    name: {namespace}
+    items:
+      - name: {name}
+        args: []
+epochs: 4
+token_limit: 40000000
+models:
+  - package: anthropic
+    name: anthropic
+    items:
+      - name: claude-opus-4-5-20251101
+        args:
+          config:
+            max_tokens: 32000
+            reasoning_tokens: 16000
+            max_connections: 60
+solvers:
+  - package: "git+https://github.com/METR/inspect-agents@metr_agents/v0.3.5#subdirectory=packages/agents"
+    name: metr_agents
+    items:
+      - name: react
+        args:
+          tools:
+            required:
+              - inspect_ai/bash
+              - metr_agents/set_timeout
+            optional:
+              - inspect_ai/python
+          truncation: disabled
+          compaction: CompactionSummary
+          compaction_threshold: 0.75
+"""
+def render_eval_set(*, name: str, namespace: str, package_url: str) -> str:
+    """Render a minimal Hawk eval-set skeleton for a scaffolded task."""
+    return EVAL_SET_TEMPLATE.format(name=name, namespace=namespace, package_url=package_url)
+def _read_origin_url(git_dir: Path) -> str | None:
+    """Return the `[remote "origin"] url` value from a .git/config, or None.
+    Hand-parsed rather than via configparser: git indents entries with tabs,
+    which configparser misreads as multi-line value continuations.
+    """
+    config_path = git_dir / "config"
+    if not config_path.is_file():
+        return None
+    try:
+        lines = config_path.read_text().splitlines()
+    except (OSError, UnicodeDecodeError):
+        return None
+    in_origin = False
+    for line in lines:
+        stripped = line.strip()
+        if stripped.startswith("[") and stripped.endswith("]"):
+            in_origin = stripped.replace(" ", "") == '[remote"origin"]'
+            continue
+        if in_origin and "=" in stripped:
+            key, _, value = stripped.partition("=")
+            if key.strip() == "url":
+                return value.strip()
+    return None
+def _read_current_branch(git_dir: Path) -> str | None:
+    """Return the current branch name from .git/HEAD, or None if detached/missing."""
+    head_path = git_dir / "HEAD"
+    if not head_path.is_file():
+        return None
+    try:
+        content = head_path.read_text().strip()
+    except (OSError, UnicodeDecodeError):
+        return None
+    prefix = "ref: refs/heads/"
+    if content.startswith(prefix):
+        return content[len(prefix) :]
+    return None
+def _parse_remote_url(url: str) -> tuple[str, str] | None:
+    """Parse a git remote URL into (host, 'org/repo'). None if unrecognized."""
+    url = url.strip()
+    if url.endswith(".git"):
+        url = url[:-4]
+    for pattern in (
+        r"^git@([^:]+):(.+)$",
+        r"^ssh://git@([^/]+)/(.+)$",
+        r"^https://([^/]+)/(.+)$",
+    ):
+        m = re.match(pattern, url)
+        if m:
+            return m.group(1), m.group(2)
+    return None
+def derive_package_url(target_dir: Path, task_name: str) -> str:
+    """Build the eval-set task package URL from the target repo's git metadata.
+    Returns a `git+ssh://...#subdirectory=tasks/<task_name>` URL. Any piece that
+    cannot be determined is filled with a TODO marker so the result is never
+    silently wrong:
+      - no readable origin remote -> the whole value is a TODO string
+      - detached HEAD (no branch)  -> the ref slot becomes `TODO-set-ref`
+    """
+    git_dir = target_dir / ".git"
+    url = _read_origin_url(git_dir)
+    parsed = _parse_remote_url(url) if url else None
+    if parsed is None:
+        return (
+            "TODO: set git+ssh package URL, e.g. "
+            f"git+ssh://git@github.com/<org>/<repo>@<branch>"
+            f"#subdirectory=tasks/{task_name}"
+        )
+    host, path = parsed
+    branch = _read_current_branch(git_dir) or "TODO-set-ref"
+    return f"git+ssh://git@{host}/{path}@{branch}#subdirectory=tasks/{task_name}"
 def edit_root_pyproject(src: str, *, target_pkg_name: str, new_task_dir_name: str) -> str:
     """Add the new task to dependency-groups.tasks and tool.uv.sources, and
     ensure [tool.uv.workspace].members covers tasks/<new_task_dir_name>.
@@ -461,6 +587,12 @@ def scaffold_into(
         new_task_dir_name=target.new_task_name,
     )
+    # Validate the eval-set destination up front too, so a conflict aborts
+    # before any file writes (mirrors the dest_root / root-pyproject checks).
+    eval_set_path = target_dir / "eval_sets" / f"{target.new_task_name}.eval-set.yaml"
+    if eval_set_path.exists() and not force:
+        sys.exit(f"{eval_set_path} already exists (use --force to overwrite)")
     if dest_root.exists():
         if not force:
             sys.exit(f"{dest_root} already exists (use --force to overwrite)")
@@ -518,5 +650,15 @@ def scaffold_into(
     # Write the (already-validated) edited root pyproject.toml.
     root_pyproject.write_text(new_root_pyproject)
+    # Generated eval-set skeleton at the repo root (not inside tasks/<name>/).
+    eval_set_path.parent.mkdir(parents=True, exist_ok=True)
+    eval_set_path.write_text(
+        render_eval_set(
+            name=target.new_task_name,
+            namespace=target.namespace,
+            package_url=derive_package_url(target_dir, target.new_task_name),
+        )
+    )
     # Audit.
     audit_generated_tree(dest_root, source=source)

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/src/inspect_eval_utils/tool_cli/__init__.py RENAMED Viewed

@@ -7,13 +7,19 @@ in the sandbox shell.
 """
 from inspect_eval_utils.tool_cli._mechanism import (
+    generate_tool_cli_script,
     install_tool_cli,
     run_tool_cli_service,
+    start_tool_cli,
+    tool_cli_service_methods,
 )
 from inspect_eval_utils.tool_cli._setting import setting_tool_cli_running
 __all__ = [
+    "generate_tool_cli_script",
     "install_tool_cli",
     "run_tool_cli_service",
     "setting_tool_cli_running",
+    "start_tool_cli",
+    "tool_cli_service_methods",
 ]

{inspect_eval_utils-1.2.0 → inspect_eval_utils-1.3.0}/src/inspect_eval_utils/tool_cli/_mechanism.py RENAMED Viewed

@@ -5,6 +5,7 @@ with an RPC bridge back to the host for actual tool execution.
 """
 import json
+import logging
 import re
 import shlex
 import time
@@ -16,10 +17,19 @@ import anyio
 from inspect_ai.model import ChatMessage, ChatMessageAssistant, ChatMessageTool, execute_tools
 from inspect_ai.tool import Tool, ToolCall, ToolDef, ToolSource
 from inspect_ai.tool._tool_def import tool_defs
-from inspect_ai.util import SandboxEnvironment, sandbox_service
+from inspect_ai.util import (
+    SandboxEnvironment,
+    background,
+    sandbox_service,
+)
+from inspect_ai.util import (
+    sandbox as _get_sandbox,
+)
 from inspect_ai.util._sandbox.service import SandboxServiceMethod
 from pydantic import JsonValue
+logger = logging.getLogger(__name__)
 class _ToolCliResolver:
     def __init__(
@@ -62,6 +72,8 @@ async def install_tool_cli(
     service_name: str = "tool_cli",
     install_dir: str = "/opt/tool_cli",
     user: str | None = None,
+    on_path: bool = True,
+    bin_dir: str = "/usr/local/bin",
 ) -> dict[str, SandboxServiceMethod]:
     """Generate a CLI script, install it into a sandbox, and return service methods.
@@ -75,6 +87,9 @@ async def install_tool_cli(
         service_name: Name for the sandbox service (used for RPC).
         install_dir: Directory in the sandbox to install the CLI script.
         user: Sandbox user to install as.
+        on_path: Install a wrapper for the command in ``bin_dir`` so it resolves on
+            PATH for non-interactive shells (e.g. the agent's bash() tool).
+        bin_dir: Directory on PATH to install the wrapper into.
     Returns:
         A dict of service methods to pass to ``sandbox_service()``.
@@ -89,6 +104,8 @@ async def install_tool_cli(
         command_name=command_name,
         install_dir=install_dir,
         user=user,
+        on_path=on_path,
+        bin_dir=bin_dir,
     )
     return methods
@@ -103,6 +120,8 @@ async def run_tool_cli_service(
     service_name: str = "tool_cli",
     install_dir: str = "/opt/tool_cli",
     user: str | None = None,
+    on_path: bool = True,
+    bin_dir: str = "/usr/local/bin",
     polling_interval: float | None = None,
     started: anyio.Event | None = None,
 ) -> None:
@@ -118,6 +137,9 @@ async def run_tool_cli_service(
         service_name: Name for the sandbox service (used for RPC).
         install_dir: Directory in the sandbox to install the CLI script.
         user: Sandbox user to install as.
+        on_path: Install a wrapper for the command in ``bin_dir`` so it resolves on
+            PATH for non-interactive shells (e.g. the agent's bash() tool).
+        bin_dir: Directory on PATH to install the wrapper into.
         polling_interval: Polling interval for RPC request checking.
         started: Event set once the sandbox service is ready.
     """
@@ -128,6 +150,8 @@ async def run_tool_cli_service(
         service_name=service_name,
         install_dir=install_dir,
         user=user,
+        on_path=on_path,
+        bin_dir=bin_dir,
     )
     await sandbox_service(
         service_name,
@@ -140,6 +164,98 @@ async def run_tool_cli_service(
     )
+async def start_tool_cli(
+    tools: Sequence[Tool | ToolDef | ToolSource],
+    sandbox: SandboxEnvironment | None = None,
+    *,
+    command_name: str = "tools",
+    service_name: str = "tool_cli",
+    install_dir: str = "/opt/tool_cli",
+    user: str | None = None,
+    on_path: bool = True,
+    bin_dir: str = "/usr/local/bin",
+    polling_interval: float | None = None,
+) -> None:
+    """Install the tool CLI and run its sandbox service in the background.
+    Fire-and-forget helper for task **setup solvers**: it installs the CLI in the
+    foreground (so install errors propagate to you), starts the RPC service in the
+    background, and returns once the service is ready. The service then runs until
+    the sample ends. By default the command is exposed on PATH (see ``on_path``) so
+    the model agent's non-interactive ``bash()`` tool can run it.
+    Unlike a bare ``background(run_tool_cli_service(...))`` + ``started.wait()``,
+    this surfaces startup failures as an exception instead of hanging.
+    Args:
+        tools: Tools to expose as CLI commands.
+        sandbox: Sandbox to install into. Defaults to ``sandbox("default")``.
+        command_name: Name of the CLI command (and the PATH wrapper).
+        service_name: Sandbox-service name used for RPC.
+        install_dir: Directory in the sandbox to install the CLI script.
+        user: Sandbox user the service runs as (e.g. the agent's user).
+        on_path: Expose ``command_name`` on PATH (default True).
+        bin_dir: Directory on PATH for the wrapper.
+        polling_interval: RPC request polling interval.
+    Example:
+        ```python
+        @solver
+        def setup() -> Solver:
+            async def solve(state: TaskState, generate: Generate) -> TaskState:
+                await start_tool_cli(MY_TOOLS, sandbox("default"), user="agent")
+                return state
+            return solve
+        ```
+    """
+    sbx = sandbox if sandbox is not None else _get_sandbox("default")
+    # Foreground: install errors propagate to the caller (no deadlock).
+    methods = await install_tool_cli(
+        tools,
+        sbx,
+        command_name=command_name,
+        service_name=service_name,
+        install_dir=install_dir,
+        user=user,
+        on_path=on_path,
+        bin_dir=bin_dir,
+    )
+    started = anyio.Event()
+    startup_error: dict[str, BaseException] = {}
+    async def _serve() -> None:
+        try:
+            await sandbox_service(
+                service_name,
+                methods,
+                lambda: False,  # run for the lifetime of the sample
+                sbx,
+                user=user,
+                polling_interval=polling_interval,
+                started=started,
+            )
+        except anyio.get_cancelled_exc_class():
+            raise
+        except BaseException as exc:  # noqa: BLE001 - re-raised on the caller's task
+            if not started.is_set():
+                # Startup failure: record it and unblock the waiter so the caller
+                # raises a clean error instead of hanging on started.wait().
+                startup_error["error"] = exc
+                started.set()
+            else:
+                # Failure after startup: let background() log/propagate it.
+                raise
+    background(_serve)
+    await started.wait()
+    if "error" in startup_error:
+        raise RuntimeError(f"tool_cli service {service_name!r} failed to start") from startup_error[
+            "error"
+        ]
 def generate_tool_cli_script(service_name: str = "tool_cli") -> str:
     """Generate a Python CLI script that calls tools via sandbox service RPC.
@@ -211,7 +327,7 @@ def _add_dynamic_arg(parser, name, param, required):
             parser.add_argument(flag, dest=dest, nargs="?", const=True, default=None, type=_parse_bool, help=description)
         return
     if type_str in ("array", "object"):
-        parser.add_argument(_flag_name(name), dest=dest, type=str, required=required, default=None if not required else None, help=description)
+        parser.add_argument(_flag_name(name), dest=dest, type=str, required=required, default=None, help=description)
         return
     type_map = {{"string": str, "integer": int, "number": float}}
     py_type = type_map.get(type_str or "string", str)
@@ -277,15 +393,20 @@ def _required_bool_names(tool):
 def _call_rpc(method, *args, **kwargs):
+    # The RPC client is keyword-only after `method`; pass args by parameter name.
     try:
         if method == "list_tools":
             return call_{service_name}('list_tools')
         if method == "describe_tool":
-            return call_{service_name}('describe_tool', *args, **kwargs)
+            return call_{service_name}('describe_tool', tool_name=args[0])
         if method == "describe_tool_for_call":
-            return call_{service_name}('describe_tool_for_call', *args, **kwargs)
+            return call_{service_name}('describe_tool_for_call', tool_name=args[0])
         if method == "call_tool":
-            return call_{service_name}('call_tool', *args, **kwargs)
+            if len(args) > 2:
+                return call_{service_name}(
+                    'call_tool', tool_name=args[0], arguments=args[1], snapshot_token=args[2]
+                )
+            return call_{service_name}('call_tool', tool_name=args[0], arguments=args[1])
         return call_{service_name}(method, *args, **kwargs)
     except Exception as exc:
         print(str(exc), file=sys.stderr)
@@ -472,6 +593,30 @@ def _check_duplicate_tool_names(tool_defs_list: Sequence[ToolDef]) -> None:
         raise ValueError(f"Duplicate tool names: {names}")
+class _SnapshotStore:
+    """Bounded token->snapshot store; evicts oldest entries past ``max_size``.
+    Guards against unbounded growth when a CLI ``call`` is abandoned between
+    ``describe_tool_for_call`` (which stores a snapshot) and ``call_tool``
+    (which pops it).
+    """
+    def __init__(self, max_size: int = 128) -> None:
+        self._max = max_size
+        self._data: dict[str, list[ToolDef]] = {}
+    def put(self, token: str, value: list[ToolDef]) -> None:
+        self._data[token] = value
+        while len(self._data) > self._max:
+            del self._data[next(iter(self._data))]  # dicts preserve insertion order
+    def pop(self, token: str) -> list[ToolDef] | None:
+        return self._data.pop(token, None)
+    def __len__(self) -> int:
+        return len(self._data)
 def tool_cli_service_methods(
     tools: Sequence[Tool | ToolDef | ToolSource],
     *,
@@ -487,7 +632,7 @@ def tool_cli_service_methods(
         A dict mapping method names to async handler functions.
     """
     resolver = _ToolCliResolver(tools, cache_ttl=cache_ttl)
-    call_snapshots: dict[str, list[ToolDef]] = {}
+    call_snapshots = _SnapshotStore()
     async def list_tools() -> JsonValue:
         resolved = await resolver.resolve(use_cache=True)
@@ -509,7 +654,7 @@ def tool_cli_service_methods(
         if td is None:
             raise ValueError(f"Unknown tool: {tool_name}")
         snapshot_token = uuid4().hex
-        call_snapshots[snapshot_token] = resolved
+        call_snapshots.put(snapshot_token, resolved)
         description = _tool_description(td)
         description["_call_snapshot"] = snapshot_token
         return description
@@ -522,7 +667,7 @@ def tool_cli_service_methods(
         if snapshot_token is None:
             resolved = await resolver.resolve(use_cache=False)
         else:
-            resolved = call_snapshots.pop(snapshot_token, None)
+            resolved = call_snapshots.pop(snapshot_token)
             if resolved is None:
                 resolved = await resolver.resolve(use_cache=False)
         tools_by_name = _tools_by_name(resolved)
@@ -633,10 +778,18 @@ async def _install_script(
     command_name: str,
     install_dir: str,
     user: str | None,
+    on_path: bool = True,
+    bin_dir: str = "/usr/local/bin",
 ) -> None:
     """Install the CLI script into the sandbox."""
     _validate_command_name(command_name)
+    # Validate python3 before any writes so a missing interpreter fails cleanly
+    # (the CLI script and PATH wrapper both invoke python3).
+    python_check = await sandbox.exec(["sh", "-c", "command -v python3"], user=user)
+    if not python_check.success:
+        raise RuntimeError("tool_cli requires python3 in the sandbox but none was found on PATH.")
     # create install dir
     await _checked_exec(sandbox, ["mkdir", "-p", install_dir], user="root")
     if user and user != "root":
@@ -648,55 +801,78 @@ async def _install_script(
     await _checked_exec(sandbox, ["tee", "--", script_path], input=script, user=user)
     await _checked_exec(sandbox, ["chmod", "+x", script_path], user=user)
-    # determine user's home directory for .bashrc
-    if user:
-        result = await sandbox.exec(["getent", "passwd", user], user=user)
-        if result.success and result.stdout.strip():
-            fields = result.stdout.strip().split(":")
-            home_dir = fields[5] if len(fields) > 5 and fields[5] else f"/home/{user}"
+    # Expose the command on PATH so non-interactive shells (the model agent's
+    # bash() tool) can find it; the .bashrc alias only helps interactive shells.
+    # Written as root because /usr/local/bin is not writable by the agent user.
+    if on_path:
+        wrapper_path = f"{bin_dir}/{command_name}"
+        wrapper = f'#!/bin/sh\nexec python3 {shlex.quote(script_path)} "$@"\n'
+        await _checked_exec(sandbox, ["mkdir", "-p", bin_dir], user="root")
+        await _checked_exec(sandbox, ["tee", "--", wrapper_path], input=wrapper, user="root")
+        await _checked_exec(sandbox, ["chmod", "+x", wrapper_path], user="root")
+    # Interactive shell alias + tab completion (best-effort: only benefits the
+    # interactive human_cli shell; the PATH wrapper is what model agents use).
+    try:
+        # determine user's home directory for .bashrc
+        if user:
+            result = await sandbox.exec(["getent", "passwd", user], user=user)
+            if result.success and result.stdout.strip():
+                fields = result.stdout.strip().split(":")
+                home_dir = fields[5] if len(fields) > 5 and fields[5] else f"/home/{user}"
+            else:
+                home_dir = f"/home/{user}"
         else:
-            home_dir = f"/home/{user}"
-    else:
-        result = await sandbox.exec(["bash", "-c", "echo $HOME"], user=user)
-        home_dir = result.stdout.strip() if result.success and result.stdout.strip() else "/root"
-    # build bash alias and tab completion
-    shell_setup_path = f"{home_dir}/.tool_cli_bashrc"
-    shell_setup_source = (
-        f"[ -f {shlex.quote(shell_setup_path)} ] && . {shlex.quote(shell_setup_path)}"
-    )
-    bashrc_addition = dedent(f"""
-        # Tool CLI alias and completion
-        alias {command_name}={shlex.quote(f"python3 {script_path}")}
-        _{command_name}_completion() {{
-            local cur candidate
-            cur="${{COMP_WORDS[COMP_CWORD]}}"
-            COMPREPLY=()
-            while IFS= read -r candidate; do
-                [[ $candidate == "$cur"* ]] && COMPREPLY+=("$candidate")
-            done < <(python3 {shlex.quote(script_path)} __complete "$COMP_CWORD" "${{COMP_WORDS[@]}}" 2>/dev/null)
-        }}
-        complete -F _{command_name}_completion {command_name}
-    """)
-    await _checked_exec(
-        sandbox,
-        ["tee", "--", shell_setup_path],
-        input=bashrc_addition,
-        user=user,
-    )
+            result = await sandbox.exec(["bash", "-c", "echo $HOME"], user=user)
+            home_dir = (
+                result.stdout.strip() if result.success and result.stdout.strip() else "/root"
+            )
+        # build bash alias and tab completion
+        shell_setup_path = f"{home_dir}/.tool_cli_bashrc"
+        shell_setup_source = (
+            f"[ -f {shlex.quote(shell_setup_path)} ] && . {shlex.quote(shell_setup_path)}"
+        )
+        bashrc_addition = dedent(f"""
+            # Tool CLI alias and completion
+            alias {command_name}={shlex.quote(f"python3 {script_path}")}
+            _{command_name}_completion() {{
+                local cur candidate
+                cur="${{COMP_WORDS[COMP_CWORD]}}"
+                COMPREPLY=()
+                while IFS= read -r candidate; do
+                    [[ $candidate == "$cur"* ]] && COMPREPLY+=("$candidate")
+                done < <(python3 {shlex.quote(script_path)} __complete "$COMP_CWORD" "${{COMP_WORDS[@]}}" 2>/dev/null)
+            }}
+            complete -F _{command_name}_completion {command_name}
+        """)
-    bashrc_path = f"{home_dir}/.bashrc"
-    result = await sandbox.exec(["grep", "-qxF", shell_setup_source, bashrc_path], user=user)
-    if not result.success:
         await _checked_exec(
             sandbox,
-            ["tee", "-a", bashrc_path],
-            input=f"\n{shell_setup_source}\n",
+            ["tee", "--", shell_setup_path],
+            input=bashrc_addition,
             user=user,
         )
+        bashrc_path = f"{home_dir}/.bashrc"
+        result = await sandbox.exec(["grep", "-qxF", shell_setup_source, bashrc_path], user=user)
+        if not result.success:
+            await _checked_exec(
+                sandbox,
+                ["tee", "-a", bashrc_path],
+                input=f"\n{shell_setup_source}\n",
+                user=user,
+            )
+    except Exception as exc:  # noqa: BLE001 - alias is best-effort
+        logger.warning(
+            "tool_cli: could not install the interactive shell alias (%s); "
+            "the %r command is still available on PATH.",
+            exc,
+            command_name,
+            exc_info=True,
+        )
 async def _checked_exec(
     sandbox: SandboxEnvironment,

inspect_eval_utils-1.2.0/src/inspect_eval_utils/report/plot.py DELETED Viewed

@@ -1,219 +0,0 @@
-# Matplotlib's API is partially untyped; these suppressions apply only to
-# build_plot below.
-# pyright: reportUnknownMemberType=false
-# pyright: reportUnknownVariableType=false
-"""Render the score-vs-cost matplotlib plot as PNG bytes."""
-from __future__ import annotations
-import io
-import logging
-import math
-from collections.abc import Sequence
-from importlib.resources import files
-from inspect_eval_utils.report.cost import cumulative_cost
-from inspect_eval_utils.report.events import ReportEvent
-# Matplotlib logs "generated new fontManager" at INFO the first time its font
-# cache is built. Quiet it so eval scoring transcripts stay clean.
-logging.getLogger("matplotlib.font_manager").setLevel(logging.WARNING)
-# Color palette derived from the METR May 2026 brand guide.
-_LEAD_GREEN_500 = "#589885"
-_GREEN_700 = "#2A6912"
-_GRAY_300 = "#D9DCE2"
-_GRAY_700 = "#3D424D"
-_GRAY_800 = "#282C33"
-_GRAY_900 = "#1B1D22"
-_BUNDLED_FONT_FAMILY = ["Instrument Sans", "DejaVu Sans"]
-def _register_bundled_font() -> None:
-    """Register the vendored Instrument Sans TTF with matplotlib (best-effort).
-    Quietly returns if already registered or if the asset is missing.
-    """
-    from matplotlib import font_manager
-    installed = {f.name for f in font_manager.fontManager.ttflist}
-    if "Instrument Sans" in installed:
-        return
-    try:
-        font_path = files("inspect_eval_utils.report") / "assets" / "InstrumentSans.ttf"
-        font_manager.fontManager.addfont(str(font_path))
-    except Exception:  # noqa: BLE001
-        # Asset missing or unreadable; caller can still proceed with the
-        # DejaVu Sans fallback that matplotlib supplies.
-        return
-def build_plot(
-    events: Sequence[ReportEvent],
-    *,
-    model: str,
-    title: str,
-    y_label: str,
-    line_label: str = "Best score",
-    current_score_label: str | None = None,
-    x_label_money: str = "Cumulative model cost ($)",
-    x_label_tokens: str = "Cumulative tokens (cost unavailable)",
-    marker_event_kind: str | None,
-) -> bytes:
-    """Render the score-vs-cost plot as PNG bytes.
-    The line plots best-so-far `score_update` values, starting at `(0, 0)`,
-    against cumulative model cost for `model`. If Inspect AI has no pricing for
-    the model, the x-axis falls back to cumulative token count instead.
-    `title`, `y_label`, `line_label`, `x_label_money`, and `x_label_tokens`
-    provide the plot, legend, and axis copy.
-    `marker_event_kind` selects which non-score events delimit episodic spans
-    (e.g. `"attempt_start"`); pass `None` to disable. When set, the plot area
-    is shaded into alternating background bands — one per span — so band
-    *width* visually encodes the compute spent in each span.
-    When `current_score_label` is provided, a second (non-monotonic) line is
-    drawn through the raw per-event score values and labelled accordingly in
-    the legend.
-    The bundled Instrument Sans font is registered best-effort and used with
-    DejaVu Sans as a fallback. Returns PNG bytes.
-    """
-    import matplotlib
-    matplotlib.use("Agg")
-    import matplotlib.pyplot as plt
-    _register_bundled_font()
-    font_family = _BUNDLED_FONT_FAMILY
-    has_usage = False
-    cost_available = True
-    xs_line: list[float] = [0.0]
-    ys_line: list[float] = [0.0]
-    xs_current: list[float] = [0.0]
-    ys_current: list[float] = [0.0]
-    marker_xs: list[float] = []
-    best_so_far = 0.0
-    for ev in events:
-        if ev.usage is None:
-            continue
-        has_usage = True
-        x, available = cumulative_cost(ev.usage, model)
-        cost_available = cost_available and available
-        if ev.event_type == "score_update":
-            best_so_far = max(best_so_far, ev.score)
-            xs_line.append(x)
-            ys_line.append(best_so_far)
-            xs_current.append(x)
-            ys_current.append(ev.score)
-        elif marker_event_kind is not None and ev.event_type == marker_event_kind:
-            marker_xs.append(x)
-            # Break the current-score line at episodic boundaries so it
-            # renders as separate segments per attempt instead of a vertical
-            # drop back to the new attempt's starting floor.
-            xs_current.append(x)
-            ys_current.append(float("nan"))
-    rc_overrides = {
-        "font.family": font_family,
-        "font.size": 13,
-        "axes.labelsize": 14,
-        "axes.titlesize": 15,
-        "xtick.labelsize": 12,
-        "ytick.labelsize": 12,
-        "legend.fontsize": 11,
-        "axes.linewidth": 0.8,
-        "xtick.major.width": 0.5,
-        "ytick.major.width": 0.5,
-        "xtick.major.size": 0,
-        "ytick.major.size": 0,
-    }
-    with plt.rc_context(rc_overrides):
-        fig, ax = plt.subplots(figsize=(10, 6))
-        if current_score_label is not None:
-            ax.plot(
-                xs_current,
-                ys_current,
-                "--",
-                color=_LEAD_GREEN_500,
-                linewidth=1.5,
-                label=current_score_label,
-                zorder=1,
-            )
-        ax.plot(
-            xs_line,
-            ys_line,
-            "-",
-            color=_GREEN_700,
-            linewidth=2,
-            label=line_label,
-            zorder=2,
-        )
-        if marker_xs:
-            # Render each marker_event_kind span as a background band. Band
-            # *width* encodes the compute spent in that span, so clustering
-            # naturally shows as a squeeze of narrow bands.
-            sorted_starts = sorted(marker_xs)
-            finite_xs = xs_line + [v for v in xs_current if not math.isnan(v)] + marker_xs
-            band_end = max(finite_xs) if finite_xs else 0.0
-            boundaries = sorted_starts + [band_end]
-            for k in range(len(sorted_starts)):
-                if k % 2 == 1:
-                    ax.axvspan(
-                        boundaries[k],
-                        boundaries[k + 1],
-                        color=_GRAY_300,
-                        alpha=0.25,
-                        zorder=0,
-                    )
-        x_label = x_label_money if (has_usage and cost_available) else x_label_tokens
-        ax.set_xlabel(x_label, color=_GRAY_800)
-        ax.set_ylabel(y_label, color=_GRAY_800, rotation=90)
-        ax.set_ylim(0, 1.05)
-        ax.set_xlim(left=0)
-        ax.spines["top"].set_visible(False)
-        ax.spines["right"].set_visible(False)
-        ax.spines["bottom"].set_color(_GRAY_700)
-        ax.spines["left"].set_color(_GRAY_700)
-        ax.spines["bottom"].set_linewidth(0.8)
-        ax.spines["left"].set_linewidth(0.8)
-        ax.tick_params(colors=_GRAY_700)
-        ax.grid(
-            True,
-            color=_GRAY_300,
-            linewidth=0.8,
-            linestyle=(0, (4, 2)),
-            zorder=0,
-        )
-        ax.set_axisbelow(True)
-        ax.set_title(title, color=_GRAY_900, fontweight="medium", pad=12)
-        legend = ax.legend(
-            loc="upper left",
-            frameon=True,
-            fancybox=False,
-            edgecolor=_GRAY_300,
-            framealpha=1.0,
-            borderpad=0.6,
-        )
-        legend.get_frame().set_linewidth(0.5)
-        legend.get_frame().set_facecolor("white")
-        buf = io.BytesIO()
-        fig.savefig(
-            buf,
-            format="png",
-            dpi=300,
-            bbox_inches="tight",
-            facecolor="white",
-        )
-        plt.close(fig)
-    return buf.getvalue()