PyPI - benchflow - Versions diffs - 0.3.2__tar.gz → 0.3.3__tar.gz - Mend

benchflow 0.3.2tar.gz → 0.3.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (172) hide show

{benchflow-0.3.2 → benchflow-0.3.3}/.gitignore RENAMED Viewed

@@ -173,7 +173,8 @@ cython_debug/
 .DS_Store
 # benchflow
-.ref/
+.cache/
 trials/
 jobs/
 .jobs/

{benchflow-0.3.2 → benchflow-0.3.3}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,45 @@
 ## [Unreleased]
+## 0.3.3 — 2026-05-15
+### Added
+- **Harvey LAB benchmark** — converter, agent shim, and parity validation for 1,251 legal AI tasks (#239).
+- **Harvey LAB Claude Sonnet judge** — switched verifier from Gemini to `claude-sonnet-4-6`, matching the original benchmark default (#264).
+- **ProgramBench integration** — new benchmark adapter; TB2 removed; `.ref/` migrated to `benchmarks/` (#237).
+- **CLI progress output** — `bench eval create` / `bench run` now show progress messages by default (#264).
+- **Skill nudge** — optional prompt injection for skill-enhanced agent runs (#207).
+- **Self-generated skill mode** for Codex agent (#233).
+- **Integration test suite** for ENG-6 + `OPENAI_BASE_URL` inheritance fix (#255).
+- **Modal backend support** — Dockerfile compatibility for Modal environments.
+- **CITATION.cff** (#246).
+- **`AGENTS.md`** — canonical contributor guide; `CLAUDE.md` deprecated (#258).
+### Changed
+- **Two-field source pattern** for dataset sourcing (#252).
+- **Docs overhaul** — synced from www.benchflow.ai; Mintlify config added then orphaned config removed (#259, #257, #226).
+- **`uv sync`** for package management (#232).
+### Fixed
+- Prevent `TypeError` in `metrics.collect_metrics` when reward is `None` (#243).
+- Copy eval `requirements.txt` into Docker build context (#245).
+- Resolve agent aliases in `bench agent show` and display aliases in `bench agent list` (#251).
+- Guard ACP transports against JSON scalar logs (#236).
+- Agent timeout reward fallback for Codex (#234).
+- Isolate JS agent runtime installs (#231).
+- Route Codex ACP through responses API (#224).
+- Deploy skills and forward `solution.env` for oracle runs (#223).
+- Honor no-internet tasks for agent runs; disable web tools without prompt mutation (#215).
+- Propagate `OPENAI_API_KEY` for vllm provider (#3).
+- Preserve arrival order of thought/message within flush windows (#214).
+- Record user messages and per-turn agent text in ACP trajectory (#745).
+- Chown skill-link parent dirs so sandbox user can write into them.
+- Dynamic `--rootdir` in `PYTEST_ADDOPTS` based on task workspace.
+- Unique env-file path in `DaytonaPtyProcess` to avoid race conditions (#200).
 ## 0.2.3 — 2026-04-15
 ### Added
@@ -66,7 +105,7 @@
 - **Vertex AI support** — ADC auth for `google-vertex/`, `anthropic-vertex/`, `vertex-zai/` prefixed models.
 - **Provider registry** — add a new LLM endpoint via a dict entry in `providers.py`, no code changes.
 - **`benchmarks/` directory** with reusable YAML configs and runner scripts for TB2 and SkillsBench.
-- **Auto task download** via `ensure_tasks()` — `terminal-bench-2` and `skillsbench` clone into `.ref/` on first run.
+- **Auto task download** — YAML configs reference datasets as `org/repo/path` (e.g. `harbor-framework/terminal-bench-2`). Repos are cloned on first use and cached under `.cache/datasets/`.
 - **`benchflow tasks init`** — scaffold new tasks.
 - **`benchflow tasks check`** — validate task structure.
 - **`benchflow cleanup`** — delete old sandboxes with `--max-age` filtering (default 24h).

benchflow-0.3.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,143 @@
+Metadata-Version: 2.4
+Name: benchflow
+Version: 0.3.3
+Summary: Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.
+Project-URL: Homepage, https://github.com/benchflow-ai/benchflow
+Project-URL: Repository, https://github.com/benchflow-ai/benchflow
+Project-URL: Issues, https://github.com/benchflow-ai/benchflow/issues
+Project-URL: Discord, https://discord.gg/mZ9Rc8q8W3
+Project-URL: Changelog, https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md
+Author-email: Xiangyi Li <xiangyi@benchflow.ai>, Kyoung Whan Choe <choe.kyoung@gmail.com>
+Maintainer-email: Xiangyi Li <xiangyi@benchflow.ai>, Kyoung Whan Choe <choe.kyoung@gmail.com>
+License: Apache-2.0
+License-File: LICENSE
+Keywords: acp,agent-evaluation,benchmark,llm-agents,multi-turn,skillsbench,terminal-bench
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Requires-Python: >=3.12
+Requires-Dist: anyio>=4.0
+Requires-Dist: harbor==0.3.0
+Requires-Dist: httpx>=0.27.0
+Requires-Dist: pydantic>=2.0
+Requires-Dist: pyyaml>=6.0
+Requires-Dist: rich>=13.0
+Requires-Dist: typer>=0.9
+Provides-Extra: bedrock
+Requires-Dist: boto3>=1.40; extra == 'bedrock'
+Provides-Extra: dev
+Requires-Dist: pre-commit>=3.7; extra == 'dev'
+Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
+Requires-Dist: pytest>=9.0.3; extra == 'dev'
+Requires-Dist: ruff>=0.7.0; extra == 'dev'
+Requires-Dist: ty>=0.0.1a1; extra == 'dev'
+Description-Content-Type: text/markdown
+<div align="center">
+  <h1>BenchFlow</h1>
+  <p>Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent</p>
+  <a href="https://pypi.org/project/benchflow/" target="_blank">
+    <img src="https://img.shields.io/pypi/v/benchflow?style=for-the-badge&logo=pypi" alt="PyPI">
+  </a>
+  <a href="https://discord.gg/mZ9Rc8q8W3" target="_blank">
+    <img src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+## What
+BenchFlow runs AI agents against benchmark tasks in sandboxed environments. Single-agent, multi-agent, and multi-round patterns share one Scene-based lifecycle.
+- **Any ACP agent** — Gemini CLI, Claude Code, Codex, OpenCode, OpenHands, OpenClaw, Pi, or your own
+- **Single + multi + progressive** — single-agent / multi-agent (coder + reviewer, simulated user) / multi-round with a Python `BaseUser` callback
+- **Sandbox backends** — Docker locally, Daytona for parallel cloud runs, Modal for serverless/GPU-backed task environments
+- **Hardened verifier** — defaults block BenchJack/Meerkat-style reward-hacking; tasks opt out per-feature
+## Install
+```bash
+uv tool install benchflow
+```
+Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). Set `DAYTONA_API_KEY` for Daytona runs or configure Modal auth for Modal runs; export the relevant agent API key (`GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) or run `claude login` / `codex --login` for subscription auth.
+## Documentation
+Start with [Getting started](./docs/getting-started.md), then [Concepts](./docs/concepts.md) for the mental model. Then by goal:
+| If you want to… | Read |
+|------------------|------|
+| Run an eval on an existing task | [Getting started](./docs/getting-started.md) |
+| Understand Trial / Scene / Role / Verifier | [Concepts](./docs/concepts.md) |
+| Author a new task | [Task authoring](./docs/task-authoring.md) |
+| Multi-agent: coder + reviewer, simulated user, BYOS, stateful envs | [Use cases](./docs/use-cases.md) |
+| Multi-round single-agent (progressive disclosure, oracle access) | [Progressive disclosure](./docs/progressive-disclosure.md) |
+| Skill evaluation (when the artifact is a skill, not a workspace) | [Skill eval](./docs/skill-eval.md) |
+| Understand the security model | [Sandbox hardening](./docs/sandbox-hardening.md) |
+| CLI flags + commands | [CLI reference](./docs/reference/cli.md) |
+| Python API surface | [Python API reference](./docs/reference/python-api.md) |
+Notebooks and runnable example scripts live under [`docs/examples/`](./docs/examples/) so examples stay versioned with the docs that explain them.
+## Benchmark task sources
+Benchmark datasets live in external Git repos and are referenced with two fields:
+```yaml
+# benchmarks/skillsbench-claude-glm51.yaml
+source:
+  repo: benchflow-ai/skillsbench   # GitHub org/repo
+  path: tasks                       # optional subpath within repo
+  ref: main                         # optional branch/tag
+agent: claude-agent-acp
+model: claude-sonnet-4-6
+```
+Run any benchmark via the CLI:
+```bash
+# From a YAML config
+bench eval create -f benchmarks/skillsbench-claude-glm51.yaml
+# Inline — mirrors the YAML source fields
+bench eval create \
+    --source-repo benchflow-ai/skillsbench --source-path tasks \
+    -a gemini -m gemini-3.1-flash-lite-preview -e daytona -c 64
+```
+Repos are cloned and cached locally under `.cache/datasets/` on first use.
+SkillsBench itself sources BenchFlow from GitHub `main` in its
+[`pyproject.toml`](https://github.com/benchflow-ai/skillsbench/blob/main/pyproject.toml).
+After a BenchFlow change lands, run `uv lock --upgrade-package benchflow` in
+SkillsBench when you need its lockfile to point at the newest BenchFlow commit.
+## Featured
+- **Progressive disclosure on SWE-bench Pro** — the `BaseUser` abstraction drives a multi-round trial: terse round-0 prompt → failing-test hints → full spec. 5/5 oracle on Daytona, runnable demo at [`docs/examples/swebench_pro_progressive_disclosure.ipynb`](./docs/examples/swebench_pro_progressive_disclosure.ipynb). Also benchflow's [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) parity answer for the no-second-LLM case. See [Progressive disclosure](./docs/progressive-disclosure.md).
+## Research artifacts
+Two runnable labs validate the security story:
+- [`labs/benchjack-sandbox-hardening/`](./labs/benchjack-sandbox-hardening/) — end-to-end demo that 0.2.1+ blocks three [BenchJack](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) exploits that flip 0.2.0's reward from 0.0 to 1.0.
+- [`labs/reward-hack-matrix/`](./labs/reward-hack-matrix/) — full reward-hack sweep across real benchmarks comparing 0.2.0 vs 0.2.2.
+## Audience
+- **Eval researchers / paper writers** → [Getting started](./docs/getting-started.md) → [Concepts](./docs/concepts.md) → [Use cases](./docs/use-cases.md)
+- **Task authors** → [Task authoring](./docs/task-authoring.md) → [Sandbox hardening](./docs/sandbox-hardening.md)
+- **Agent builders integrating with benchflow** → [Concepts](./docs/concepts.md) → [Python API reference](./docs/reference/python-api.md) → [`benchflow.agents.registry`](./src/benchflow/agents/registry.py)
+- **Existing Harbor users migrating** → [Use cases — migration section](./docs/use-cases.md#migration-from-harbor) → [Progressive disclosure](./docs/progressive-disclosure.md#comparison-with-multi-agent-simulated-user)
+## Contributing
+PRs welcome. Open against `main`. CI runs ruff + tests on every PR; please run `ruff check .` and `pytest tests/` locally first.
+For a release: bump `pyproject.toml` to the next stable version, tag `v<version>` on main, push the tag — CI publishes to PyPI. Then bump main to the next `.dev0`.
+## License
+Apache-2.0.

benchflow-0.3.3/README.md ADDED Viewed

@@ -0,0 +1,106 @@
+<div align="center">
+  <h1>BenchFlow</h1>
+  <p>Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent</p>
+  <a href="https://pypi.org/project/benchflow/" target="_blank">
+    <img src="https://img.shields.io/pypi/v/benchflow?style=for-the-badge&logo=pypi" alt="PyPI">
+  </a>
+  <a href="https://discord.gg/mZ9Rc8q8W3" target="_blank">
+    <img src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+## What
+BenchFlow runs AI agents against benchmark tasks in sandboxed environments. Single-agent, multi-agent, and multi-round patterns share one Scene-based lifecycle.
+- **Any ACP agent** — Gemini CLI, Claude Code, Codex, OpenCode, OpenHands, OpenClaw, Pi, or your own
+- **Single + multi + progressive** — single-agent / multi-agent (coder + reviewer, simulated user) / multi-round with a Python `BaseUser` callback
+- **Sandbox backends** — Docker locally, Daytona for parallel cloud runs, Modal for serverless/GPU-backed task environments
+- **Hardened verifier** — defaults block BenchJack/Meerkat-style reward-hacking; tasks opt out per-feature
+## Install
+```bash
+uv tool install benchflow
+```
+Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). Set `DAYTONA_API_KEY` for Daytona runs or configure Modal auth for Modal runs; export the relevant agent API key (`GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) or run `claude login` / `codex --login` for subscription auth.
+## Documentation
+Start with [Getting started](./docs/getting-started.md), then [Concepts](./docs/concepts.md) for the mental model. Then by goal:
+| If you want to… | Read |
+|------------------|------|
+| Run an eval on an existing task | [Getting started](./docs/getting-started.md) |
+| Understand Trial / Scene / Role / Verifier | [Concepts](./docs/concepts.md) |
+| Author a new task | [Task authoring](./docs/task-authoring.md) |
+| Multi-agent: coder + reviewer, simulated user, BYOS, stateful envs | [Use cases](./docs/use-cases.md) |
+| Multi-round single-agent (progressive disclosure, oracle access) | [Progressive disclosure](./docs/progressive-disclosure.md) |
+| Skill evaluation (when the artifact is a skill, not a workspace) | [Skill eval](./docs/skill-eval.md) |
+| Understand the security model | [Sandbox hardening](./docs/sandbox-hardening.md) |
+| CLI flags + commands | [CLI reference](./docs/reference/cli.md) |
+| Python API surface | [Python API reference](./docs/reference/python-api.md) |
+Notebooks and runnable example scripts live under [`docs/examples/`](./docs/examples/) so examples stay versioned with the docs that explain them.
+## Benchmark task sources
+Benchmark datasets live in external Git repos and are referenced with two fields:
+```yaml
+# benchmarks/skillsbench-claude-glm51.yaml
+source:
+  repo: benchflow-ai/skillsbench   # GitHub org/repo
+  path: tasks                       # optional subpath within repo
+  ref: main                         # optional branch/tag
+agent: claude-agent-acp
+model: claude-sonnet-4-6
+```
+Run any benchmark via the CLI:
+```bash
+# From a YAML config
+bench eval create -f benchmarks/skillsbench-claude-glm51.yaml
+# Inline — mirrors the YAML source fields
+bench eval create \
+    --source-repo benchflow-ai/skillsbench --source-path tasks \
+    -a gemini -m gemini-3.1-flash-lite-preview -e daytona -c 64
+```
+Repos are cloned and cached locally under `.cache/datasets/` on first use.
+SkillsBench itself sources BenchFlow from GitHub `main` in its
+[`pyproject.toml`](https://github.com/benchflow-ai/skillsbench/blob/main/pyproject.toml).
+After a BenchFlow change lands, run `uv lock --upgrade-package benchflow` in
+SkillsBench when you need its lockfile to point at the newest BenchFlow commit.
+## Featured
+- **Progressive disclosure on SWE-bench Pro** — the `BaseUser` abstraction drives a multi-round trial: terse round-0 prompt → failing-test hints → full spec. 5/5 oracle on Daytona, runnable demo at [`docs/examples/swebench_pro_progressive_disclosure.ipynb`](./docs/examples/swebench_pro_progressive_disclosure.ipynb). Also benchflow's [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) parity answer for the no-second-LLM case. See [Progressive disclosure](./docs/progressive-disclosure.md).
+## Research artifacts
+Two runnable labs validate the security story:
+- [`labs/benchjack-sandbox-hardening/`](./labs/benchjack-sandbox-hardening/) — end-to-end demo that 0.2.1+ blocks three [BenchJack](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) exploits that flip 0.2.0's reward from 0.0 to 1.0.
+- [`labs/reward-hack-matrix/`](./labs/reward-hack-matrix/) — full reward-hack sweep across real benchmarks comparing 0.2.0 vs 0.2.2.
+## Audience
+- **Eval researchers / paper writers** → [Getting started](./docs/getting-started.md) → [Concepts](./docs/concepts.md) → [Use cases](./docs/use-cases.md)
+- **Task authors** → [Task authoring](./docs/task-authoring.md) → [Sandbox hardening](./docs/sandbox-hardening.md)
+- **Agent builders integrating with benchflow** → [Concepts](./docs/concepts.md) → [Python API reference](./docs/reference/python-api.md) → [`benchflow.agents.registry`](./src/benchflow/agents/registry.py)
+- **Existing Harbor users migrating** → [Use cases — migration section](./docs/use-cases.md#migration-from-harbor) → [Progressive disclosure](./docs/progressive-disclosure.md#comparison-with-multi-agent-simulated-user)
+## Contributing
+PRs welcome. Open against `main`. CI runs ruff + tests on every PR; please run `ruff check .` and `pytest tests/` locally first.
+For a release: bump `pyproject.toml` to the next stable version, tag `v<version>` on main, push the tag — CI publishes to PyPI. Then bump main to the next `.dev0`.
+## License
+Apache-2.0.

{benchflow-0.3.2 → benchflow-0.3.3}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "benchflow"
-version = "0.3.2"
+version = "0.3.3"
 description = "Multi-turn agent benchmarking with ACP — run any agent, any model, any provider."
 readme = "README.md"
 requires-python = ">=3.12"
@@ -42,6 +42,9 @@ dev = [
     "ruff>=0.7.0",
     "ty>=0.0.1a1",
 ]
+bedrock = [
+    "boto3>=1.40",
+]
 [project.scripts]
 benchflow = "benchflow.cli.main:app"
@@ -71,14 +74,16 @@ only-include = [
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
-addopts = "-m 'not live'"
+addopts = "-m 'not live and not integration'"
 testpaths = ["tests"]
 markers = [
     "live: requires real Anthropic API and Docker daemon (run with -m live)",
+    "integration: full integration tests — requires GEMINI_API_KEY + DAYTONA_API_KEY (run with -m integration)",
 ]
 [tool.ruff]
 target-version = "py312"
+extend-exclude = [".claude/skills/skill-creator"]
 [tool.ruff.lint]
 select = [
@@ -96,6 +101,16 @@ ignore = [
     "RUF022", # __all__ unsorted — grouped by section for agent-friendliness
 ]
+[tool.ruff.lint.per-file-ignores]
+# Standalone scripts — sys.path manipulation before imports is intentional
+"experiments/*.py" = ["E402"]
+"tests/conformance/*.py" = ["E402"]
+# Notebooks: cell-local imports + short loop vars are notebook conventions
+"docs/examples/*.ipynb" = ["E402", "E741", "SIM115"]
+# Forward references resolved via __future__ annotations — ruff flags them
+# but they work at runtime; explicit TYPE_CHECKING imports would force eager loads.
+"src/benchflow/runtime.py" = ["F821"]
 [tool.ty.environment]
 python-version = "3.12"

{benchflow-0.3.2 → benchflow-0.3.3}/src/benchflow/__init__.py RENAMED Viewed

@@ -19,13 +19,14 @@ from harbor import (
     ExecResult,
     Task,
     TaskConfig,
-    Trial,
     Verifier,
     VerifierResult,
 )
 # benchflow's additions
 from benchflow._env_setup import stage_dockerfile_deps
+from benchflow._scene import MailboxTransport, Message, MessageTransport, Role, Scene
+from benchflow._snapshot import list_snapshots, restore, snapshot
 from benchflow.acp.client import ACPClient
 from benchflow.acp.session import ACPSession
 from benchflow.agents.registry import (
@@ -53,16 +54,16 @@ from benchflow.runtime import (
     RuntimeResult,
     run,  # bf.run(agent, env) — the primary 0.3 API
 )
-from benchflow._scene import MailboxTransport, Message, MessageTransport, Role, Scene
-from benchflow._snapshot import list_snapshots, restore, snapshot
 from benchflow.sdk import SDK
-from benchflow.trial import Trial, TrialConfig
-from benchflow.trial import Role as TrialRole, Scene as TrialScene, Turn
-from benchflow.trial_yaml import trial_config_from_yaml
 from benchflow.skills import SkillInfo, discover_skills, install_skill, parse_skill
 from benchflow.trajectories.otel import OTelCollector
 from benchflow.trajectories.proxy import TrajectoryProxy
 from benchflow.trajectories.types import Trajectory
+from benchflow.trial import Role as TrialRole
+from benchflow.trial import Scene as TrialScene
+from benchflow.trial import Trial, TrialConfig, Turn
+from benchflow.trial_yaml import trial_config_from_yaml
+from benchflow.user import BaseUser, FunctionUser, PassthroughUser, RoundResult
 # Public API surface. Anything not in this list is implementation detail and
 # may change without notice. Names are grouped by source module to match the
@@ -123,6 +124,12 @@ __all__ = [
     "TrialRole",
     "TrialScene",
     "Turn",
+    "trial_config_from_yaml",
+    # User abstraction (progressive disclosure)
+    "BaseUser",
+    "FunctionUser",
+    "PassthroughUser",
+    "RoundResult",
     # SDK (backwards compat)
     "SDK",
     # Environments / dep staging

benchflow 0.3.2__tar.gz → 0.3.3__tar.gz

benchflow 0.3.2tar.gz → 0.3.3tar.gz