PyPI - benchflow - Versions diffs - 0.2.0__tar.gz → 0.2.2__tar.gz - Mend

benchflow 0.2.0tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (153) hide show

benchflow-0.2.2/.git ADDED Viewed

	@@ -0,0 +1 @@
1	+ gitdir: /workspace/.git/modules/repos/benchflow/worktrees/benchflow-main

benchflow-0.2.2/.github/workflows/test.yml ADDED Viewed

@@ -0,0 +1,38 @@
+name: test
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v3
+        with:
+          enable-cache: true
+      - name: Set up Python
+        run: uv python install 3.12
+      - name: Install dependencies
+        run: |
+          uv venv -p 3.12 .venv
+          uv pip install -e ".[dev]"
+      - name: Lint
+        run: .venv/bin/ruff check src tests
+      - name: Format check
+        run: .venv/bin/ruff format --check src tests
+      - name: Type check
+        run: .venv/bin/ty check
+      - name: Test
+        run: .venv/bin/python -m pytest tests/

{benchflow-0.2.0 → benchflow-0.2.2}/.gitignore RENAMED Viewed

@@ -130,6 +130,7 @@ celerybeat.pid
 # Environments
 .env
 .venv
+.venvs/
 env/
 venv/
 ENV/
@@ -175,6 +176,8 @@ cython_debug/
 .ref/
 trials/
 jobs/
+.jobs/
 dogfood/
 tmp/
 .claude/settings.local.json
+tests/.smoke-jobs/

benchflow-0.2.2/.pre-commit-config.yaml ADDED Viewed

@@ -0,0 +1,22 @@
+# Run on staged files at commit time. Mirrors what CI runs so format/lint
+# failures are caught locally before push. Install once per clone:
+#
+#     uv pip install -e ".[dev]" && pre-commit install
+#
+# Bypass with --no-verify only if you know what you're doing; CI will still
+# gate the same checks. Pinned to ruff 0.15.7 to match pyproject.toml's
+# ruff>=0.7.0 floor and avoid format-rule drift between hook and CI.
+#
+# Scoped to src/ and tests/ to match what CI actually checks
+# (`.github/workflows/*` runs `ruff format --check src tests`). benchmarks/
+# is intentionally excluded — out of CI scope, would silently expand the
+# format gate beyond what's enforced upstream.
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.15.7
+    hooks:
+      - id: ruff-format
+        files: ^(src|tests)/
+      - id: ruff-check
+        files: ^(src|tests)/
+        args: [--fix]

benchflow-0.2.2/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,86 @@
+# Changelog
+## [Unreleased]
+## 0.2.2 — 2026-04-13
+### Added
+- **Sandbox hardening tiers 1–3** — layered defense (env scrubbing, path lockdown, workspace
+  freeze, wider snapshot, oracle privilege drop) blocking F1–F6 red-team findings.
+- **`labs/reward-hack-matrix`** — per-trial timeout support and 0.2.2 sweep handoff scripts.
+### Fixed
+- Multiple sandbox bypass vectors identified in red-team testing.
+## 0.2.1 — 2026-04-12
+### Added
+- **Sandbox hardening on by default** — `sandbox_user` now defaults to `"agent"` (was `None`/root). Blocks conftest-hook and answer-lookup exploit patterns.
+- **Path lockdown** — new `sandbox_locked_paths` parameter makes `/solution` and `/tests` read-only before the verifier runs, blocking `.pth`-injection and similar pre-verify tampering.
+- **Verifier failure isolation** — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
+- **`labs/benchjack-sandbox-hardening`** — cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 `.pth`-injection) and their defenses.
+### Fixed
+- **Oracle runs as `sandbox_user`** — oracle agent now respects path lockdown instead of running as root and bypassing it.
+- **Multi-endpoint provider routing** — providers with multiple endpoints now route by the agent's native API protocol.
+- **Stale API key shadowing subscription auth** — emits a warning when `ANTHROPIC_API_KEY` env var is present alongside `claude login` credentials.
+- **pytest `ini`-injection bypass** — closed a verifier hardening edge case.
+### Changed
+- Version is now single-sourced via `importlib.metadata`; no more duplicate version string in `__init__.py`.
+- **User-facing docs** — new `docs/` directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved to `docs/`.
+## 0.2.0 — 2026-04-09
+**First public release.** A near-complete rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line. 0.1.x users should treat this as a fresh install; see `.dev-docs/sdk-reference.md` for the new SDK.
+### Added
+- **Multi-agent, multi-provider, multi-auth matrix** — one YAML config, any supported agent × model × provider × auth combination.
+- **Subscription auth support** — use `claude login`, `codex --login`, `gemini` OAuth credentials directly. No API keys required for host-based agent workflows.
+- **Vertex AI support** — ADC auth for `google-vertex/`, `anthropic-vertex/`, `vertex-zai/` prefixed models.
+- **Provider registry** — add a new LLM endpoint via a dict entry in `providers.py`, no code changes.
+- **`benchmarks/` directory** with reusable YAML configs and runner scripts for TB2 and SkillsBench.
+- **Auto task download** via `ensure_tasks()` — `terminal-bench-2` and `skillsbench` clone into `.ref/` on first run.
+- **`benchflow tasks init`** — scaffold new tasks.
+- **`benchflow tasks check`** — validate task structure.
+- **`benchflow cleanup`** — delete old sandboxes with `--max-age` filtering (default 24h).
+- **Oracle agent support** — run `solution/solve.sh` directly for task validation.
+- **Hello-world-task example** for sanity-testing the agent pipeline.
+- **Model generation params** via env vars (`BENCHFLOW_TEMPERATURE`, `BENCHFLOW_TOP_P`, `BENCHFLOW_MAX_TOKENS`).
+- **OpenClaw ACP shim** with trajectory parsing and skills support.
+- **ACP trajectory capture** — full multi-turn agent trajectories via ACP protocol.
+### Changed
+- **Skill loading** — agent-targeted with proper precedence; auto-distributed from `task.toml` `skills_dir`.
+- **`openclaw-gemini` merged** into `openclaw` — provider mode selected at runtime via `BENCHFLOW_PROVIDER_NAME`.
+### Fixed
+- **API keys leaking in `ps aux`** — env vars now written inside the container instead of passed via Docker exec `-e`.
+- **Subscription auth skipped without `-m`** — `benchflow run` without `--model` now checks correctly.
+- **ADC credentials break with `sandbox_user`** (#111) — credentials written to sandbox user's home instead of `/root/`.
+- **Daytona sandboxes not cleaned up** (#102) — auto-delete after max age.
+- **`benchflow cleanup` ignoring `--max-age`** — was deleting everything regardless of age.
+- **readline buffer overflow crashes trial** (#98).
+- **OpenClaw ACP shim loses tool command text** (#96).
+- **OpenClaw ACP shim hardcodes `anthropic/` prefix** (#95) — now routes correctly for Gemini/GLM models.
+- **Oracle agent `PermissionError`** writing `agent/oracle.txt` (#91).
+- **Oracle path skips `pre_agent_hooks`** (#92) — services now start before oracle runs.
+- **Trial data parity with Harbor** (#90) — richer `result.json`, agent logs, per-phase timing.
+- **`SDK.run()` `PermissionError`** — `jobs_dir` subdirectories created as root (#88).
+- **Partial trajectory lost on timeout** — saved before timeout raises.
+- **Redundant `--version` binary check** removed — was wasting 30s per trial.
+- **Trajectory fallback** — scrapes agent-native files when ACP `session/update` is empty (#94).
+- **`litellm` upgraded to 1.83.0** for CVE-2026-35030; transitive dep security alerts resolved (13 Dependabot alerts closed).
+### Deprecated
+- `BaseAgent` re-export — planned removal in 0.3.0
+- `Trial` re-export — planned removal in 0.3.0

benchflow-0.2.2/CLAUDE.md ADDED Viewed

@@ -0,0 +1,31 @@
+# benchflow
+Multi-turn agent benchmarking with ACP.
+Architecture, CLI, task format: see `docs/architecture.md`, `docs/cli-reference.md`, `docs/task-authoring.md`. Internal refactor notes and SDK reference: `.dev-docs/`.
+## Setup
+Requires Python 3.12+. Uses `uv`.
+```bash
+uv venv -p 3.12 .venv && uv pip install -e ".[dev]"
+.venv/bin/pre-commit install
+```
+## Test
+```bash
+.venv/bin/python -m pytest tests/          # unit (fast, no Docker)
+.venv/bin/python -m pytest -m live tests/  # e2e (Docker + API key)
+.venv/bin/ty check src/                    # type check — also the fastest "find references" after any signature change
+```
+CI gates `ruff format`, `ruff check`, `pytest`, and `ty check src/`. Run all four before pushing. Live tests use Haiku 4.5 (`claude-haiku-4-5-20251001`).
+## Conventions
+- **Minimal fix.** Do only what was asked. "Leave as is" is a valid outcome. Generalize on the third repetition, not the first.
+- **Registry over hardcode.** Adding an agent or provider is a dict entry in `agents/registry.py` or `providers.py` — not a new code path. The `oracle` special case in `sdk.py` exists because it bypasses the agent loop; don't add more without the same justification.
+- **Don't rewrite passing tests.** Updating a test because the code it covers changed shape is fine. Rewriting one to match new behavior without understanding why it was written is not. No tautological tests (dataclass reads, stdlib behavior, "does it construct").
+- **Human review before main.** Commit freely on a feature branch, open a PR. Never push to `main` directly, never force-push it.

{benchflow-0.2.0 → benchflow-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: benchflow
-Version: 0.2.0
+Version: 0.2.2
 Summary: Multi-turn agent benchmarking with ACP — run any agent, any model, any provider.
 Project-URL: Homepage, https://github.com/benchflow-ai/benchflow
 Project-URL: Repository, https://github.com/benchflow-ai/benchflow
@@ -26,9 +26,11 @@ Requires-Dist: pyyaml>=6.0
 Requires-Dist: rich>=13.0
 Requires-Dist: typer>=0.9
 Provides-Extra: dev
+Requires-Dist: pre-commit>=3.7; extra == 'dev'
 Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
 Requires-Dist: pytest>=8.0; extra == 'dev'
 Requires-Dist: ruff>=0.7.0; extra == 'dev'
+Requires-Dist: ty>=0.0.1a1; extra == 'dev'
 Description-Content-Type: text/markdown
 <div align="center">
@@ -75,85 +77,25 @@ benchflow view jobs/my-job/my-trial/
 ## SDK
 ```python
-import asyncio
 from benchflow import SDK, Job, JobConfig, collect_metrics
-async def main():
-    sdk = SDK()
-    # Single task — API keys auto-inherited from os.environ
-    result = await sdk.run(
-        task_path="path/to/task",
-        agent="claude-agent-acp",
-        model="claude-haiku-4-5-20251001",
-        environment="daytona",  # or "docker"
-    )
-    print(result.rewards)       # {"reward": 1.0}
-    print(result.n_tool_calls)  # 17
-    # Multi-turn — None = use task's instruction.md
-    result = await sdk.run(
-        task_path="path/to/task",
-        agent="claude-agent-acp",
-        prompts=[
-            None,
-            "Review your solution. Check for errors, test it, and fix any issues.",
-        ],
-        environment="daytona",
-    )
-    # Job — run a full benchmark with concurrency and retries
-    job = Job(
-        tasks_dir="path/to/tasks",
-        jobs_dir="jobs/tb2",
-        config=JobConfig(
-            agent="claude-agent-acp",
-            model="claude-haiku-4-5-20251001",
-            environment="daytona",
-            concurrency=64,
-        ),
-    )
-    result = await job.run()
-    print(f"{result.passed}/{result.total} ({result.score:.1%})")
-    # Metrics — aggregate results from a jobs directory
-    metrics = collect_metrics("jobs/tb2", benchmark="TB2")
-    print(metrics.summary())
-asyncio.run(main())
+result = await SDK().run(task_path="path/to/task", agent="claude-agent-acp")
+print(result.rewards)  # {"reward": 1.0}
 ```
+Single task, multi-turn, full benchmark jobs, and programmatic metrics — see [docs/getting-started.md](docs/getting-started.md).
 ## CLI
 ```bash
-# Run a single task
-benchflow run -t task/ -a claude-agent-acp -m claude-haiku-4-5-20251001 -e daytona
-# Run a benchmark job
-benchflow job -t tasks/ -a claude-agent-acp -c 64 -e daytona --retries 1
-# List agents
-benchflow agents
-# View metrics
-benchflow metrics jobs/tb2/ --json
-benchflow metrics jobs/tb2/
-# Evaluate a skill against tasks
-benchflow eval -t tasks/ --skills-dir skills/ -a claude-agent-acp -e daytona
-# List/install skills
-benchflow skills
-benchflow skills --install owner/repo@skill-name
-# View trajectory
-benchflow view jobs/tb2/my-trial/
-# Create/validate tasks
-benchflow tasks init my-task     # scaffold a new task directory
-benchflow tasks check tasks/my-task/  # validate task structure
+benchflow run -t path/to/task -a claude-agent-acp   # single task
+benchflow job -t tasks/ -a claude-agent-acp -c 1    # benchmark job
+benchflow metrics jobs/                              # aggregate results
+benchflow view jobs/my-job/my-trial/                # trajectory viewer
 ```
+Full flag reference for all 8 subcommands: [docs/cli-reference.md](docs/cli-reference.md).
 ## Agents
 Any [ACP-compatible agent](https://agentclientprotocol.com/get-started/agents) works. Registered agents are auto-installed in sandboxes.
@@ -163,7 +105,7 @@ benchflow agents              # list registered agents
 benchflow run -t task/ -a pi-acp -e daytona
 ```
-See [docs/tested-agents.md](docs/tested-agents.md) for the full list of tested agent × model/provider combinations.
+See [docs/architecture.md](docs/architecture.md#registry-pattern) for the full tested agent × model/provider matrix and how to add your own.
 ## Environments
@@ -174,24 +116,9 @@ See [docs/tested-agents.md](docs/tested-agents.md) for the full list of tested a
 ## How it Works
-```
-benchflow (host)                          Sandbox (Docker/Daytona)
-     |                                         |
-     |  1. Start environment (Harbor)          |
-     |  2. Install ACP agent (npm)             |
-     |  3. stdio pipe (exec/SSH) --------> claude-agent-acp
-     |                                         |
-     |  ACP: initialize                        |
-     |  ACP: session/new(cwd) --------------> agent sees workspace, skills
-     |  ACP: session/set_model(haiku) ------> model configured
-     |  ACP: session/prompt("solve this") --> agent uses Bash, Read, Write
-     |  ACP: session/update <---------------- tool calls, messages, thoughts
-     |  ACP: session/prompt("test it") -----> same session, full context
-     |  ACP: session/update <---------------- more tool calls
-     |                                         |
-     |  4. Run verifier (Harbor) -----------> tests/test.sh → reward.txt
-     |  5. Stop environment                    |
-```
+BenchFlow starts a sandboxed environment, connects to the agent via ACP over a live stdio pipe, sends one or more prompts (the agent retains full context between turns), then runs the verifier and captures the full trajectory.
+See [docs/architecture.md](docs/architecture.md) for SDK run phases, ACP protocol details, and the registry pattern.
 ## Task Format
@@ -208,6 +135,8 @@ my-task/
 └── solution/              # optional reference solution
 ```
+Full `task.toml` schema, verifier contract, and a worked example: [docs/task-authoring.md](docs/task-authoring.md).
 ## Results
 Every run produces structured output:
@@ -227,7 +156,25 @@ jobs/{job_name}/{trial_name}/
     └── reward.txt           # reward value
 ```
-## Benchmark Results
+## Benchmarks
+Tasks are auto-downloaded on first run (cloned into `.ref/`).
+**SkillsBench** (86 tasks — tool use, file editing, API calls):
+```bash
+python benchmarks/run_skillsbench.py benchmarks/skillsbench-claude-glm5.yaml   # Claude
+python benchmarks/run_skillsbench.py benchmarks/skillsbench-codex-gpt54.yaml   # Codex
+```
+**Terminal-Bench 2** (89 tasks — shell, git, compilers, daemons):
+```bash
+python benchmarks/run_tb2.py benchmarks/tb2_single-codex-gpt54.yaml      # single-turn
+python benchmarks/run_tb2.py benchmarks/tb2_multiturn-codex-gpt54.yaml   # multi-turn
+```
+Shipped configs use `environment: daytona` and `concurrency: 8`. For local Docker: `--env docker --concurrency 1`.
 | Benchmark | Agent | Model | Score |
 |-----------|-------|-------|-------|
@@ -247,15 +194,7 @@ Validation tasks in `.claude/skills/benchflow/tasks/` confirm agents can use the
 ## Architecture
-BenchFlow provides:
-- **ACP client** — multi-turn agent communication via live stdio pipe
-- **Job orchestration** — concurrency, retries, resume, metrics
-- **Multi-agent registry** — auto-install agents in sandboxes
-- **Trajectory capture** — from ACP protocol
-- **Skills** — teach agents to use BenchFlow itself
-- **Viewer** — HTML trajectory visualization
-- **CLI** — `run`, `job`, `agents`, `metrics`, `view`, `eval`, `skills`, `tasks`, `cleanup`
+ACP client, job orchestration, multi-agent registry, trajectory capture, skills, viewer, and CLI — see [docs/architecture.md](docs/architecture.md).
 ## Citation

{benchflow-0.2.0 → benchflow-0.2.2}/README.md RENAMED Viewed

@@ -42,85 +42,25 @@ benchflow view jobs/my-job/my-trial/
 ## SDK
 ```python
-import asyncio
 from benchflow import SDK, Job, JobConfig, collect_metrics
-async def main():
-    sdk = SDK()
-    # Single task — API keys auto-inherited from os.environ
-    result = await sdk.run(
-        task_path="path/to/task",
-        agent="claude-agent-acp",
-        model="claude-haiku-4-5-20251001",
-        environment="daytona",  # or "docker"
-    )
-    print(result.rewards)       # {"reward": 1.0}
-    print(result.n_tool_calls)  # 17
-    # Multi-turn — None = use task's instruction.md
-    result = await sdk.run(
-        task_path="path/to/task",
-        agent="claude-agent-acp",
-        prompts=[
-            None,
-            "Review your solution. Check for errors, test it, and fix any issues.",
-        ],
-        environment="daytona",
-    )
-    # Job — run a full benchmark with concurrency and retries
-    job = Job(
-        tasks_dir="path/to/tasks",
-        jobs_dir="jobs/tb2",
-        config=JobConfig(
-            agent="claude-agent-acp",
-            model="claude-haiku-4-5-20251001",
-            environment="daytona",
-            concurrency=64,
-        ),
-    )
-    result = await job.run()
-    print(f"{result.passed}/{result.total} ({result.score:.1%})")
-    # Metrics — aggregate results from a jobs directory
-    metrics = collect_metrics("jobs/tb2", benchmark="TB2")
-    print(metrics.summary())
-asyncio.run(main())
+result = await SDK().run(task_path="path/to/task", agent="claude-agent-acp")
+print(result.rewards)  # {"reward": 1.0}
 ```
+Single task, multi-turn, full benchmark jobs, and programmatic metrics — see [docs/getting-started.md](docs/getting-started.md).
 ## CLI
 ```bash
-# Run a single task
-benchflow run -t task/ -a claude-agent-acp -m claude-haiku-4-5-20251001 -e daytona
-# Run a benchmark job
-benchflow job -t tasks/ -a claude-agent-acp -c 64 -e daytona --retries 1
-# List agents
-benchflow agents
-# View metrics
-benchflow metrics jobs/tb2/ --json
-benchflow metrics jobs/tb2/
-# Evaluate a skill against tasks
-benchflow eval -t tasks/ --skills-dir skills/ -a claude-agent-acp -e daytona
-# List/install skills
-benchflow skills
-benchflow skills --install owner/repo@skill-name
-# View trajectory
-benchflow view jobs/tb2/my-trial/
-# Create/validate tasks
-benchflow tasks init my-task     # scaffold a new task directory
-benchflow tasks check tasks/my-task/  # validate task structure
+benchflow run -t path/to/task -a claude-agent-acp   # single task
+benchflow job -t tasks/ -a claude-agent-acp -c 1    # benchmark job
+benchflow metrics jobs/                              # aggregate results
+benchflow view jobs/my-job/my-trial/                # trajectory viewer
 ```
+Full flag reference for all 8 subcommands: [docs/cli-reference.md](docs/cli-reference.md).
 ## Agents
 Any [ACP-compatible agent](https://agentclientprotocol.com/get-started/agents) works. Registered agents are auto-installed in sandboxes.
@@ -130,7 +70,7 @@ benchflow agents              # list registered agents
 benchflow run -t task/ -a pi-acp -e daytona
 ```
-See [docs/tested-agents.md](docs/tested-agents.md) for the full list of tested agent × model/provider combinations.
+See [docs/architecture.md](docs/architecture.md#registry-pattern) for the full tested agent × model/provider matrix and how to add your own.
 ## Environments
@@ -141,24 +81,9 @@ See [docs/tested-agents.md](docs/tested-agents.md) for the full list of tested a
 ## How it Works
-```
-benchflow (host)                          Sandbox (Docker/Daytona)
-     |                                         |
-     |  1. Start environment (Harbor)          |
-     |  2. Install ACP agent (npm)             |
-     |  3. stdio pipe (exec/SSH) --------> claude-agent-acp
-     |                                         |
-     |  ACP: initialize                        |
-     |  ACP: session/new(cwd) --------------> agent sees workspace, skills
-     |  ACP: session/set_model(haiku) ------> model configured
-     |  ACP: session/prompt("solve this") --> agent uses Bash, Read, Write
-     |  ACP: session/update <---------------- tool calls, messages, thoughts
-     |  ACP: session/prompt("test it") -----> same session, full context
-     |  ACP: session/update <---------------- more tool calls
-     |                                         |
-     |  4. Run verifier (Harbor) -----------> tests/test.sh → reward.txt
-     |  5. Stop environment                    |
-```
+BenchFlow starts a sandboxed environment, connects to the agent via ACP over a live stdio pipe, sends one or more prompts (the agent retains full context between turns), then runs the verifier and captures the full trajectory.
+See [docs/architecture.md](docs/architecture.md) for SDK run phases, ACP protocol details, and the registry pattern.
 ## Task Format
@@ -175,6 +100,8 @@ my-task/
 └── solution/              # optional reference solution
 ```
+Full `task.toml` schema, verifier contract, and a worked example: [docs/task-authoring.md](docs/task-authoring.md).
 ## Results
 Every run produces structured output:
@@ -194,7 +121,25 @@ jobs/{job_name}/{trial_name}/
     └── reward.txt           # reward value
 ```
-## Benchmark Results
+## Benchmarks
+Tasks are auto-downloaded on first run (cloned into `.ref/`).
+**SkillsBench** (86 tasks — tool use, file editing, API calls):
+```bash
+python benchmarks/run_skillsbench.py benchmarks/skillsbench-claude-glm5.yaml   # Claude
+python benchmarks/run_skillsbench.py benchmarks/skillsbench-codex-gpt54.yaml   # Codex
+```
+**Terminal-Bench 2** (89 tasks — shell, git, compilers, daemons):
+```bash
+python benchmarks/run_tb2.py benchmarks/tb2_single-codex-gpt54.yaml      # single-turn
+python benchmarks/run_tb2.py benchmarks/tb2_multiturn-codex-gpt54.yaml   # multi-turn
+```
+Shipped configs use `environment: daytona` and `concurrency: 8`. For local Docker: `--env docker --concurrency 1`.
 | Benchmark | Agent | Model | Score |
 |-----------|-------|-------|-------|
@@ -214,15 +159,7 @@ Validation tasks in `.claude/skills/benchflow/tasks/` confirm agents can use the
 ## Architecture
-BenchFlow provides:
-- **ACP client** — multi-turn agent communication via live stdio pipe
-- **Job orchestration** — concurrency, retries, resume, metrics
-- **Multi-agent registry** — auto-install agents in sandboxes
-- **Trajectory capture** — from ACP protocol
-- **Skills** — teach agents to use BenchFlow itself
-- **Viewer** — HTML trajectory visualization
-- **CLI** — `run`, `job`, `agents`, `metrics`, `view`, `eval`, `skills`, `tasks`, `cleanup`
+ACP client, job orchestration, multi-agent registry, trajectory capture, skills, viewer, and CLI — see [docs/architecture.md](docs/architecture.md).
 ## Citation

benchflow-0.2.2/benchmarks/skillsbench-claude-glm5.yaml ADDED Viewed

@@ -0,0 +1,10 @@
+tasks_dir: ../.ref/skillsbench/tasks
+jobs_dir: ../jobs/skillsbench-claude-glm51
+agent: claude-agent-acp
+model: zai/glm-5.1
+environment: daytona
+concurrency: 8
+max_retries: 2
+exclude:
+  - scheduling-email-assistant
+  - mhc-layer-impl

benchflow 0.2.0__tar.gz → 0.2.2__tar.gz

benchflow 0.2.0tar.gz → 0.2.2tar.gz