PyPI - benchflow - Versions diffs - 0.2.2__tar.gz → 0.3.0__tar.gz - Mend

benchflow 0.2.2tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (181) hide show

{benchflow-0.2.2 → benchflow-0.3.0}/.gitignore RENAMED Viewed

@@ -181,3 +181,5 @@ dogfood/
 tmp/
 .claude/settings.local.json
 tests/.smoke-jobs/
+context/
+tutorials/

{benchflow-0.2.2 → benchflow-0.3.0}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,26 @@
 ## [Unreleased]
+## 0.2.3 — 2026-04-15
+### Added
+- `benchmarks/tb2_multiturn-claude-haiku45.yaml` — shipped config for the README's TB2 multi-turn Claude result.
+- Daytona resource clamping via `BENCHFLOW_DAYTONA_MAX_CPUS` / `MAX_MEMORY_MB`.
+### Changed
+- Renamed `skillsbench-claude-glm5.yaml` → `skillsbench-claude-glm51.yaml` to match the model ID.
+- `codex --login` correction in `docs/getting-started.md`.
+- Restricted sdist build to `src/`, `tests/`, and metadata.
+### Fixed
+- Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
+- Preserve trusted verifier path entries and workspace answer files.
+- Redirect oracle output to container log.
+- Align YAML path resolution to config file location.
 ## 0.2.2 — 2026-04-13
 ### Added

benchflow-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,212 @@
+Metadata-Version: 2.4
+Name: benchflow
+Version: 0.3.0
+Summary: Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.
+Project-URL: Homepage, https://github.com/benchflow-ai/benchflow
+Project-URL: Repository, https://github.com/benchflow-ai/benchflow
+Project-URL: Issues, https://github.com/benchflow-ai/benchflow/issues
+Project-URL: Discord, https://discord.gg/mZ9Rc8q8W3
+Project-URL: Changelog, https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md
+Author-email: Xiangyi Li <xiangyi@benchflow.ai>, Kyoung Whan Choe <choe.kyoung@gmail.com>
+Maintainer-email: Xiangyi Li <xiangyi@benchflow.ai>, Kyoung Whan Choe <choe.kyoung@gmail.com>
+License: Apache-2.0
+License-File: LICENSE
+Keywords: acp,agent-evaluation,benchmark,llm-agents,multi-turn,skillsbench,terminal-bench
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Requires-Python: >=3.12
+Requires-Dist: anyio>=4.0
+Requires-Dist: harbor==0.3.0
+Requires-Dist: httpx>=0.27.0
+Requires-Dist: pydantic>=2.0
+Requires-Dist: pyyaml>=6.0
+Requires-Dist: rich>=13.0
+Requires-Dist: typer>=0.9
+Provides-Extra: dev
+Requires-Dist: pre-commit>=3.7; extra == 'dev'
+Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
+Requires-Dist: pytest>=9.0.3; extra == 'dev'
+Requires-Dist: ruff>=0.7.0; extra == 'dev'
+Requires-Dist: ty>=0.0.1a1; extra == 'dev'
+Description-Content-Type: text/markdown
+<div align="center">
+  <h1>BenchFlow</h1>
+  <p>Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent</p>
+  <a href="https://pypi.org/project/benchflow/" target="_blank">
+    <img src="https://img.shields.io/badge/PyPI-0.3.0a3-blue?style=for-the-badge&logo=pypi" alt="PyPI">
+  </a>
+  <a href="https://discord.gg/mZ9Rc8q8W3" target="_blank">
+    <img src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+## What
+BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It supports single-agent, multi-agent, and multi-turn evaluation patterns through a Scene-based lifecycle.
+- **Any ACP agent** — Gemini CLI, Claude, Codex, OpenClaw, Pi, or your own
+- **Multi-scene trials** — skill generation → solve, coder → reviewer → revision
+- **Cloud sandboxes** — Daytona backend for parallel execution at scale
+- **YAML-driven** — same task folder, different trial configs for ablation
+## Install
+```bash
+pip install benchflow==0.3.0a3
+```
+Requires Python 3.12+. For cloud sandboxes, set `DAYTONA_API_KEY`.
+## Quick Start
+### CLI
+```bash
+# Run a single task with Gemini
+bench eval create -t tasks/my-task -a gemini -m gemini-3.1-flash-lite-preview -e daytona
+# Run from YAML config (batch, concurrent)
+bench eval create -f benchmarks/tb2-gemini-baseline.yaml
+# List agents
+bench agent list
+# Check task validity
+bench tasks check tasks/my-task
+```
+### Python
+```python
+import benchflow as bf
+from benchflow.trial import TrialConfig, Scene, Role, Turn
+# Simplest: one agent, one task
+result = await bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview")
+print(result.rewards)  # {"reward": 1.0}
+# Scene-based: skill-gen → solve (BYOS pattern)
+config = TrialConfig(
+    task_path=Path("tasks/my-task"),
+    scenes=[
+        Scene(name="skill-gen",
+              roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
+              turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]),
+        Scene(name="solve",
+              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
+              turns=[Turn("solver")]),  # None prompt = use instruction.md
+    ],
+    environment="daytona",
+)
+result = await bf.run(config)
+# Multi-agent: coder + reviewer
+config = TrialConfig(
+    task_path=Path("tasks/my-task"),
+    scenes=[
+        Scene(name="review-loop",
+              roles=[
+                  Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
+                  Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
+              ],
+              turns=[
+                  Turn("coder", "Solve the task. Write to /app/.outbox/reviewer.json when done."),
+                  Turn("reviewer", "Review the coder's work. Write feedback to /app/.outbox/coder.json."),
+                  Turn("coder", "Read the reviewer's feedback and revise your solution."),
+              ]),
+    ],
+    environment="daytona",
+)
+result = await bf.run(config)
+```
+### YAML Trial Config
+```yaml
+# trial-baseline.yaml
+task_dir: .ref/terminal-bench-2
+agent: gemini
+model: gemini-3.1-flash-lite-preview
+environment: daytona
+concurrency: 89
+# trial-byos.yaml (same tasks, different config)
+task_dir: .ref/terminal-bench-2
+scenes:
+  - name: skill-gen
+    roles: [{name: gen, agent: gemini, model: gemini-3.1-flash-lite-preview}]
+    turns: [{role: gen, prompt: "Generate a skill for this task..."}]
+  - name: solve
+    roles: [{name: solver, agent: gemini, model: gemini-3.1-flash-lite-preview}]
+```
+## CLI Reference
+```
+bench agent list              List registered agents
+bench agent show <name>       Agent details + conformance status
+bench eval create             Create + run evaluation (returns job-id)
+bench eval list               List completed evaluations
+bench skills eval             Evaluate skill via evals.json
+bench tasks init <name>       Scaffold new task
+bench tasks check <dir>       Validate task (--rubric for custom)
+bench train create            Reward-based training sweep
+bench environment create      Spin up sandbox from task dir
+bench environment list        List active sandboxes
+```
+## Architecture
+```
+Trial = sequence of Scenes in a shared sandbox
+Scene = Roles + Turns (one interaction region)
+Role  = agent + model
+Turn  = one prompt for one role
+bf.run(config)
+  → Trial.create(config)
+    → trial.setup()      # resolve config, create env object
+    → trial.start()      # spin up sandbox, upload task files
+    → for scene in config.scenes:
+        → trial._run_scene(scene)  # connect/execute/disconnect per role
+    → trial.verify()     # run verifier, score
+    → trial.cleanup()    # stop sandbox
+```
+## Registered Agents
+| Agent | Command | Auth |
+|-------|---------|------|
+| `gemini` | `gemini --acp --yolo` | GOOGLE_API_KEY |
+| `claude-agent-acp` | `claude-agent-acp` | ANTHROPIC_API_KEY |
+| `codex-acp` | `codex-acp` | OPENAI_API_KEY |
+| `openclaw` | `openclaw-acp-shim` | inferred from model |
+| `pi-acp` | `pi-acp` | ANTHROPIC_API_KEY |
+## Adding a Custom Agent
+Any ACP-native agent works. Create `agent.toml`:
+```toml
+name = "my-agent"
+launch_cmd = "my-agent --acp"
+install_cmd = "npm install -g my-agent"
+requires_env = ["MY_API_KEY"]
+```
+## Development
+```bash
+uv venv -p 3.12 .venv && uv pip install -e ".[dev]"
+.venv/bin/python -m pytest tests/       # 580+ unit tests
+.venv/bin/ty check src/                 # type check
+```

benchflow-0.3.0/README.md ADDED Viewed

@@ -0,0 +1,177 @@
+<div align="center">
+  <h1>BenchFlow</h1>
+  <p>Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent</p>
+  <a href="https://pypi.org/project/benchflow/" target="_blank">
+    <img src="https://img.shields.io/badge/PyPI-0.3.0a3-blue?style=for-the-badge&logo=pypi" alt="PyPI">
+  </a>
+  <a href="https://discord.gg/mZ9Rc8q8W3" target="_blank">
+    <img src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+## What
+BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It supports single-agent, multi-agent, and multi-turn evaluation patterns through a Scene-based lifecycle.
+- **Any ACP agent** — Gemini CLI, Claude, Codex, OpenClaw, Pi, or your own
+- **Multi-scene trials** — skill generation → solve, coder → reviewer → revision
+- **Cloud sandboxes** — Daytona backend for parallel execution at scale
+- **YAML-driven** — same task folder, different trial configs for ablation
+## Install
+```bash
+pip install benchflow==0.3.0a3
+```
+Requires Python 3.12+. For cloud sandboxes, set `DAYTONA_API_KEY`.
+## Quick Start
+### CLI
+```bash
+# Run a single task with Gemini
+bench eval create -t tasks/my-task -a gemini -m gemini-3.1-flash-lite-preview -e daytona
+# Run from YAML config (batch, concurrent)
+bench eval create -f benchmarks/tb2-gemini-baseline.yaml
+# List agents
+bench agent list
+# Check task validity
+bench tasks check tasks/my-task
+```
+### Python
+```python
+import benchflow as bf
+from benchflow.trial import TrialConfig, Scene, Role, Turn
+# Simplest: one agent, one task
+result = await bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview")
+print(result.rewards)  # {"reward": 1.0}
+# Scene-based: skill-gen → solve (BYOS pattern)
+config = TrialConfig(
+    task_path=Path("tasks/my-task"),
+    scenes=[
+        Scene(name="skill-gen",
+              roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
+              turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]),
+        Scene(name="solve",
+              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
+              turns=[Turn("solver")]),  # None prompt = use instruction.md
+    ],
+    environment="daytona",
+)
+result = await bf.run(config)
+# Multi-agent: coder + reviewer
+config = TrialConfig(
+    task_path=Path("tasks/my-task"),
+    scenes=[
+        Scene(name="review-loop",
+              roles=[
+                  Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
+                  Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
+              ],
+              turns=[
+                  Turn("coder", "Solve the task. Write to /app/.outbox/reviewer.json when done."),
+                  Turn("reviewer", "Review the coder's work. Write feedback to /app/.outbox/coder.json."),
+                  Turn("coder", "Read the reviewer's feedback and revise your solution."),
+              ]),
+    ],
+    environment="daytona",
+)
+result = await bf.run(config)
+```
+### YAML Trial Config
+```yaml
+# trial-baseline.yaml
+task_dir: .ref/terminal-bench-2
+agent: gemini
+model: gemini-3.1-flash-lite-preview
+environment: daytona
+concurrency: 89
+# trial-byos.yaml (same tasks, different config)
+task_dir: .ref/terminal-bench-2
+scenes:
+  - name: skill-gen
+    roles: [{name: gen, agent: gemini, model: gemini-3.1-flash-lite-preview}]
+    turns: [{role: gen, prompt: "Generate a skill for this task..."}]
+  - name: solve
+    roles: [{name: solver, agent: gemini, model: gemini-3.1-flash-lite-preview}]
+```
+## CLI Reference
+```
+bench agent list              List registered agents
+bench agent show <name>       Agent details + conformance status
+bench eval create             Create + run evaluation (returns job-id)
+bench eval list               List completed evaluations
+bench skills eval             Evaluate skill via evals.json
+bench tasks init <name>       Scaffold new task
+bench tasks check <dir>       Validate task (--rubric for custom)
+bench train create            Reward-based training sweep
+bench environment create      Spin up sandbox from task dir
+bench environment list        List active sandboxes
+```
+## Architecture
+```
+Trial = sequence of Scenes in a shared sandbox
+Scene = Roles + Turns (one interaction region)
+Role  = agent + model
+Turn  = one prompt for one role
+bf.run(config)
+  → Trial.create(config)
+    → trial.setup()      # resolve config, create env object
+    → trial.start()      # spin up sandbox, upload task files
+    → for scene in config.scenes:
+        → trial._run_scene(scene)  # connect/execute/disconnect per role
+    → trial.verify()     # run verifier, score
+    → trial.cleanup()    # stop sandbox
+```
+## Registered Agents
+| Agent | Command | Auth |
+|-------|---------|------|
+| `gemini` | `gemini --acp --yolo` | GOOGLE_API_KEY |
+| `claude-agent-acp` | `claude-agent-acp` | ANTHROPIC_API_KEY |
+| `codex-acp` | `codex-acp` | OPENAI_API_KEY |
+| `openclaw` | `openclaw-acp-shim` | inferred from model |
+| `pi-acp` | `pi-acp` | ANTHROPIC_API_KEY |
+## Adding a Custom Agent
+Any ACP-native agent works. Create `agent.toml`:
+```toml
+name = "my-agent"
+launch_cmd = "my-agent --acp"
+install_cmd = "npm install -g my-agent"
+requires_env = ["MY_API_KEY"]
+```
+## Development
+```bash
+uv venv -p 3.12 .venv && uv pip install -e ".[dev]"
+.venv/bin/python -m pytest tests/       # 580+ unit tests
+.venv/bin/ty check src/                 # type check
+```

{benchflow-0.2.2 → benchflow-0.3.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "benchflow"
-version = "0.2.2"
+version = "0.3.0"
 description = "Multi-turn agent benchmarking with ACP — run any agent, any model, any provider."
 readme = "README.md"
 requires-python = ">=3.12"
@@ -37,7 +37,7 @@ classifiers = [
 [project.optional-dependencies]
 dev = [
     "pre-commit>=3.7",
-    "pytest>=8.0",
+    "pytest>=9.0.3",
     "pytest-asyncio>=0.24.0",
     "ruff>=0.7.0",
     "ty>=0.0.1a1",
@@ -45,6 +45,7 @@ dev = [
 [project.scripts]
 benchflow = "benchflow.cli.main:app"
+bench = "benchflow.cli.main:app"
 [project.urls]
 Homepage = "https://github.com/benchflow-ai/benchflow"
@@ -58,20 +59,20 @@ requires = ["hatchling"]
 build-backend = "hatchling.build"
 [tool.hatch.build.targets.sdist]
-exclude = [
-    ".venv*",
-    ".ref",
-    "jobs",
-    "dist",
-    ".claude",
-    ".dev-docs",
-    ".pytest_cache",
-    "__pycache__",
+# Allowlist: only ship what the installed package needs.
+only-include = [
+    "src",
+    "tests",
+    "README.md",
+    "CHANGELOG.md",
+    "LICENSE",
+    "pyproject.toml",
 ]
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
 addopts = "-m 'not live'"
+testpaths = ["tests"]
 markers = [
     "live: requires real Anthropic API and Docker daemon (run with -m live)",
 ]

{benchflow-0.2.2 → benchflow-0.3.0}/src/benchflow/__init__.py RENAMED Viewed

@@ -45,7 +45,20 @@ from benchflow.environments import (
 from benchflow.job import Job, JobConfig, JobResult, RetryConfig
 from benchflow.metrics import BenchmarkMetrics, collect_metrics
 from benchflow.models import AgentInstallError, AgentTimeoutError, RunResult
+from benchflow.runtime import (
+    Agent,
+    Environment,
+    Runtime,
+    RuntimeConfig,
+    RuntimeResult,
+    run,  # bf.run(agent, env) — the primary 0.3 API
+)
+from benchflow._scene import MailboxTransport, Message, MessageTransport, Role, Scene
+from benchflow._snapshot import list_snapshots, restore, snapshot
 from benchflow.sdk import SDK
+from benchflow.trial import Trial, TrialConfig
+from benchflow.trial import Role as TrialRole, Scene as TrialScene, Turn
+from benchflow.trial_yaml import trial_config_from_yaml
 from benchflow.skills import SkillInfo, discover_skills, install_skill, parse_skill
 from benchflow.trajectories.otel import OTelCollector
 from benchflow.trajectories.proxy import TrajectoryProxy
@@ -63,7 +76,6 @@ __all__ = [
     "ExecResult",
     "Task",
     "TaskConfig",
-    "Trial",
     "Verifier",
     "VerifierResult",
     # ACP
@@ -88,7 +100,30 @@ __all__ = [
     "AgentInstallError",
     "AgentTimeoutError",
     "RunResult",
-    # SDK
+    # Runtime (0.3 primary API)
+    "Agent",
+    "Environment",
+    "Runtime",
+    "RuntimeConfig",
+    "RuntimeResult",
+    "run",
+    # Multi-agent scene
+    "Scene",
+    "Role",
+    "Message",
+    "MessageTransport",
+    "MailboxTransport",
+    # Env snapshots
+    "snapshot",
+    "restore",
+    "list_snapshots",
+    # Trial (decomposed lifecycle)
+    "Trial",
+    "TrialConfig",
+    "TrialRole",
+    "TrialScene",
+    "Turn",
+    # SDK (backwards compat)
     "SDK",
     # Environments / dep staging
     "SERVICES",

{benchflow-0.2.2 → benchflow-0.3.0}/src/benchflow/_acp_run.py RENAMED Viewed

@@ -25,11 +25,16 @@ from benchflow._sandbox import build_priv_drop_cmd
 from benchflow._trajectory import _capture_session_trajectory
 from benchflow.acp.client import ACPClient
 from benchflow.acp.container_transport import ContainerTransport
+from benchflow.agents.providers import strip_provider_prefix
 from benchflow.process import DaytonaProcess, DockerProcess
 logger = logging.getLogger(__name__)
+_ACP_CONNECT_MAX_RETRIES = 3
+_ACP_CONNECT_BASE_DELAY = 2.0
 async def connect_acp(
     env,
     agent: str,
@@ -41,7 +46,10 @@ async def connect_acp(
     environment: str,
     agent_cwd: str,
 ) -> tuple[ACPClient, object, str]:
-    """Create ACP transport, connect, init session, set model. Return (client, session, agent_name)."""
+    """Create ACP transport, connect, init session, set model. Return (client, session, agent_name).
+    Retries with exponential backoff on ConnectionError (Daytona SSH storms).
+    """
     # Resolve agent binary path for non-docker environments
     if environment != "docker":
         which_result = await env.exec(
@@ -58,32 +66,61 @@ async def connect_acp(
         agent_launch = build_priv_drop_cmd(agent_launch, sandbox_user)
         logger.info(f"Agent sandboxed as: {sandbox_user}")
-    if environment == "docker":
-        live_proc = DockerProcess.from_harbor_env(env)
-    else:
-        live_proc = await DaytonaProcess.from_harbor_env(env)
-    agent_log = trial_dir / "agent" / f"{agent.replace('-', '_')}.txt"
-    transport = ContainerTransport(
-        container_process=live_proc,
-        command=agent_launch,
-        env=agent_env,
-        cwd=agent_cwd,
-        agent_log_path=agent_log,
-    )
-    acp_client = ACPClient(transport)
-    await acp_client.connect()
-    init_result = await asyncio.wait_for(acp_client.initialize(), timeout=60)
-    agent_name = init_result.agent_info.name if init_result.agent_info else agent
-    logger.info(f"ACP agent: {agent_name}")
-    session = await asyncio.wait_for(acp_client.session_new(cwd=agent_cwd), timeout=60)
-    logger.info(f"Session: {session.session_id}")
+    last_err: Exception | None = None
+    acp_client: ACPClient | None = None
+    for attempt in range(_ACP_CONNECT_MAX_RETRIES + 1):
+        if attempt > 0:
+            delay = _ACP_CONNECT_BASE_DELAY * (2 ** (attempt - 1))
+            logger.info(f"ACP connect retry {attempt}/{_ACP_CONNECT_MAX_RETRIES} after {delay:.0f}s")
+            await asyncio.sleep(delay)
-    if model:
-        from benchflow.agents.providers import strip_provider_prefix
+        try:
+            if environment == "docker":
+                live_proc = DockerProcess.from_harbor_env(env)
+            else:
+                live_proc = await DaytonaProcess.from_harbor_env(env)
+            agent_log = trial_dir / "agent" / f"{agent.replace('-', '_')}.txt"
+            transport = ContainerTransport(
+                container_process=live_proc,
+                command=agent_launch,
+                env=agent_env,
+                cwd=agent_cwd,
+                agent_log_path=agent_log,
+            )
+            acp_client = ACPClient(transport)
+            await acp_client.connect()
+            init_result = await asyncio.wait_for(acp_client.initialize(), timeout=60)
+            agent_name = init_result.agent_info.name if init_result.agent_info else agent
+            logger.info(f"ACP agent: {agent_name}")
+            session = await asyncio.wait_for(acp_client.session_new(cwd=agent_cwd), timeout=60)
+            logger.info(f"Session: {session.session_id}")
+            break
+        except ConnectionError as e:
+            # Close the failed client before retrying
+            if acp_client:
+                try:
+                    await acp_client.close()
+                except Exception:
+                    pass
+                acp_client = None
+            last_err = e
+            if attempt == _ACP_CONNECT_MAX_RETRIES:
+                raise
+            logger.warning(f"ACP connect failed (attempt {attempt + 1}): {e}")
+            continue
+        except Exception:
+            # Non-retryable error — close client to prevent leak
+            if acp_client:
+                try:
+                    await acp_client.close()
+                except Exception:
+                    pass
+            raise
+    if model:
         acp_model_id = strip_provider_prefix(model)
         try:
             await asyncio.wait_for(acp_client.set_model(acp_model_id), timeout=60)
@@ -102,7 +139,7 @@ async def execute_prompts(
 ) -> tuple[list[dict], int]:
     """Send prompts via ACP and capture trajectory. Return (trajectory, n_tool_calls)."""
     for i, prompt in enumerate(prompts):
-        logger.info(f"Prompt {i + 1}/{len(prompts)}: {prompt[:80]}...")
+        logger.info(f"Prompt {i + 1}/{len(prompts)}: {(prompt or '<instruction.md>')[:80]}...")
         prompt_result = await asyncio.wait_for(
             acp_client.prompt(prompt),
             timeout=timeout,

benchflow 0.2.2__tar.gz → 0.3.0__tar.gz

benchflow 0.2.2tar.gz → 0.3.0tar.gz