PyPI - benchmaker - Versions diffs - 0.1.2__tar.gz → 0.1.3__tar.gz - Mend

benchmaker 0.1.2tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (89) hide show

{benchmaker-0.1.2 → benchmaker-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: benchmaker
-Version: 0.1.2
+Version: 0.1.3
 Summary: Async HTTP benchmarking utility with pluggable workloads and load models.
 Author: Xiaozhe Yao
 License: MIT
@@ -18,6 +18,8 @@ Requires-Dist: rich>=13; extra == "rich"
 Provides-Extra: hf
 Requires-Dist: datasets>=2.18; extra == "hf"
 Requires-Dist: transformers>=4.40; extra == "hf"
+Provides-Extra: tokenizer
+Requires-Dist: transformers>=4.40; extra == "tokenizer"
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
@@ -73,8 +75,8 @@ asyncio.run(main())
 ```
 Or via the CLI. Workload-specific benchmarks are exposed as **recipes** —
-`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`, `sglang`,
-`trajectory-replay`):
+`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`,
+`swebench-replay`, `sglang`, `trajectory-replay`):
 ```bash
 benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s
@@ -185,16 +187,15 @@ Full docs live in [`docs/`](docs/):
 - [Correctness / accuracy eval](docs/eval.md) — grade responses against references
 - [CLI & YAML reference](docs/cli-and-yaml.md)
 - [ShareGPT benchmark](docs/sharegpt-benchmark.md) — self-contained end-to-end walkthrough
-- `benchmaker sglang` — native SGLang `/generate` benchmark (see [`docs/sglang.md`](docs/sglang.md)).
-- `benchmaker trajectory-replay` — multi-turn prefix-cache parity replay of
-  trajectory datasets like SWE-smith (see [`docs/trajectory-replay.md`](docs/trajectory-replay.md)).
+- [SGLang benchmark](docs/sglang.md) — native SGLang `/generate` benchmark
+- [Trajectory replay](docs/trajectory-replay.md) — multi-turn prefix-cache parity replay
 ## Deterministic replay (`swebench-replay`)
 Re-run a recorded SWE-bench job with the LLM **mocked from its own logs** — the
 real pi + sandbox + verifier pipeline still runs, only the model is served back
 from recorded outputs, so re-runs are deterministic and free of model
-cost/variance. Vary `--concurrency` (or `--sweep`) to study the rest of the
+cost/variance. Vary `--concurrency` (or `--concurrency-sweep`) to study the rest of the
 pipeline without the model's stochasticity as a confound. Still needs
 `FLASH_SANDBOX_URL` (the sandbox + verifier are real).
@@ -207,7 +208,7 @@ python -m benchmaker.swebench.trajectory jobs/2026-06-08__05-24-01_b352cb \
 # 2) replay (host mode, localhost) across a concurrency sweep
 FLASH_SANDBOX_URL=http://localhost:8080 \
   benchmaker swebench-replay --trajectories replay-trajectories.jsonl \
-    --mode pi-host --sweep 1,5,25
+    --mode pi-host --concurrency-sweep 1,5,25
 # container mode: bind 0.0.0.0 and tell the sandbox how to reach the server
 FLASH_SANDBOX_URL=http://localhost:8080 \
@@ -221,6 +222,18 @@ run lacked an instance id) plus the count of assistant messages already in the
 request — so it is correct at any concurrency. A `MISSES` column in the summary
 flags any divergence (a request beyond the recorded turns).
+The standalone replay server can also **mock realistic streaming** for
+latency-sensitive benchmarks. Pass a real tokenizer and a per-token delay; the
+first token is emitted immediately (prefill free, TTFT≈0) and each subsequent
+token is spaced by `--inter-token-time` ms. Output stays byte-exact and the
+reported `usage` is the recorded value.
+```bash
+pip install 'benchmaker[tokenizer]'   # adds transformers for the tokenizer
+python -m benchmaker.swebench.replay_server replay-trajectories.jsonl \
+    --tokenizer zai-org/GLM-4.7-Flash --inter-token-time 50
+```
 ## Examples
 Under [`examples/`](examples/):
@@ -228,9 +241,12 @@ Under [`examples/`](examples/):
 - `simple_get.py`         — minimal library usage
 - `custom_hooks.py`       — request signing + response parsing
 - `llm_chat.py`           — OpenAI-compatible LLM endpoint with streaming
+- `llm_from_env.py`       — LLM benchmark using `from_env()`
 - `vllm_with_monitor.py`  — LLM benchmark with concurrent vLLM `/metrics` scrape
+- `agent_trove.py`        — user-defined agent benchmark
 - `sandbox_exec.py`       — Flash Sandbox `/exec` latency benchmark
 - `sandbox_lifecycle.py`  — full create → exec → delete cold-start benchmark
+- `bench_sandbox.py` / `bench_sandbox.sh` — sandbox benchmarks
 - `llm_eval.py`           — LLM benchmark + accuracy grading (exact/regex/judge)
 - `gsm8k_eval.py`         — GSM8K from HuggingFace + integer-match scorer
 - `config.yaml`           — generic HTTP YAML config
@@ -254,9 +270,22 @@ benchmaker/          # library code
   config.py  env.py  #   YAML config loading + .env interpolation
   core/              #   engine: types, load models, runner, metrics, monitors, trace
   io/                #   run output: per-run bundle + cross-run collection
-  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
-  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay) + registry
-  swebench/          #   SWE-bench coding agent + grading + harbor adapters
+  workloads/
+    http.py          #   HTTP workload-type
+    llm.py           #   OpenAI-compatible chat workload-type
+    sandbox.py       #   Flash Sandbox workload-type
+    sglang.py        #   SGLang native /generate workload-type
+    agent.py         #   user-defined Agent workload-type
+    trajectory.py    #   multi-turn trajectory replay workload
+    eval.py          #   correctness/accuracy evaluation
+    hf.py            #   HuggingFace dataset source
+    datasets.py      #   generic workload/dataset base classes
+    base.py          #   WorkloadType base class
+  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay, sglang, trajectory-replay) + registry
+  swebench/
+    trajectory.py    #   convert pi logs to replay trajectories
+    replay_server.py #   mock-LLM replay server for swebench-replay
+    agent.py         #   SWE-bench coding agent + grading + harbor adapters
 examples/            # runnable examples (incl. swebench/ coding-agent config)
 tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
 tests/               # pytest smoke tests

{benchmaker-0.1.2 → benchmaker-0.1.3}/README.md RENAMED Viewed

@@ -45,8 +45,8 @@ asyncio.run(main())
 ```
 Or via the CLI. Workload-specific benchmarks are exposed as **recipes** —
-`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`, `sglang`,
-`trajectory-replay`):
+`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`,
+`swebench-replay`, `sglang`, `trajectory-replay`):
 ```bash
 benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s
@@ -157,16 +157,15 @@ Full docs live in [`docs/`](docs/):
 - [Correctness / accuracy eval](docs/eval.md) — grade responses against references
 - [CLI & YAML reference](docs/cli-and-yaml.md)
 - [ShareGPT benchmark](docs/sharegpt-benchmark.md) — self-contained end-to-end walkthrough
-- `benchmaker sglang` — native SGLang `/generate` benchmark (see [`docs/sglang.md`](docs/sglang.md)).
-- `benchmaker trajectory-replay` — multi-turn prefix-cache parity replay of
-  trajectory datasets like SWE-smith (see [`docs/trajectory-replay.md`](docs/trajectory-replay.md)).
+- [SGLang benchmark](docs/sglang.md) — native SGLang `/generate` benchmark
+- [Trajectory replay](docs/trajectory-replay.md) — multi-turn prefix-cache parity replay
 ## Deterministic replay (`swebench-replay`)
 Re-run a recorded SWE-bench job with the LLM **mocked from its own logs** — the
 real pi + sandbox + verifier pipeline still runs, only the model is served back
 from recorded outputs, so re-runs are deterministic and free of model
-cost/variance. Vary `--concurrency` (or `--sweep`) to study the rest of the
+cost/variance. Vary `--concurrency` (or `--concurrency-sweep`) to study the rest of the
 pipeline without the model's stochasticity as a confound. Still needs
 `FLASH_SANDBOX_URL` (the sandbox + verifier are real).
@@ -179,7 +178,7 @@ python -m benchmaker.swebench.trajectory jobs/2026-06-08__05-24-01_b352cb \
 # 2) replay (host mode, localhost) across a concurrency sweep
 FLASH_SANDBOX_URL=http://localhost:8080 \
   benchmaker swebench-replay --trajectories replay-trajectories.jsonl \
-    --mode pi-host --sweep 1,5,25
+    --mode pi-host --concurrency-sweep 1,5,25
 # container mode: bind 0.0.0.0 and tell the sandbox how to reach the server
 FLASH_SANDBOX_URL=http://localhost:8080 \
@@ -193,6 +192,18 @@ run lacked an instance id) plus the count of assistant messages already in the
 request — so it is correct at any concurrency. A `MISSES` column in the summary
 flags any divergence (a request beyond the recorded turns).
+The standalone replay server can also **mock realistic streaming** for
+latency-sensitive benchmarks. Pass a real tokenizer and a per-token delay; the
+first token is emitted immediately (prefill free, TTFT≈0) and each subsequent
+token is spaced by `--inter-token-time` ms. Output stays byte-exact and the
+reported `usage` is the recorded value.
+```bash
+pip install 'benchmaker[tokenizer]'   # adds transformers for the tokenizer
+python -m benchmaker.swebench.replay_server replay-trajectories.jsonl \
+    --tokenizer zai-org/GLM-4.7-Flash --inter-token-time 50
+```
 ## Examples
 Under [`examples/`](examples/):
@@ -200,9 +211,12 @@ Under [`examples/`](examples/):
 - `simple_get.py`         — minimal library usage
 - `custom_hooks.py`       — request signing + response parsing
 - `llm_chat.py`           — OpenAI-compatible LLM endpoint with streaming
+- `llm_from_env.py`       — LLM benchmark using `from_env()`
 - `vllm_with_monitor.py`  — LLM benchmark with concurrent vLLM `/metrics` scrape
+- `agent_trove.py`        — user-defined agent benchmark
 - `sandbox_exec.py`       — Flash Sandbox `/exec` latency benchmark
 - `sandbox_lifecycle.py`  — full create → exec → delete cold-start benchmark
+- `bench_sandbox.py` / `bench_sandbox.sh` — sandbox benchmarks
 - `llm_eval.py`           — LLM benchmark + accuracy grading (exact/regex/judge)
 - `gsm8k_eval.py`         — GSM8K from HuggingFace + integer-match scorer
 - `config.yaml`           — generic HTTP YAML config
@@ -226,9 +240,22 @@ benchmaker/          # library code
   config.py  env.py  #   YAML config loading + .env interpolation
   core/              #   engine: types, load models, runner, metrics, monitors, trace
   io/                #   run output: per-run bundle + cross-run collection
-  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
-  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay) + registry
-  swebench/          #   SWE-bench coding agent + grading + harbor adapters
+  workloads/
+    http.py          #   HTTP workload-type
+    llm.py           #   OpenAI-compatible chat workload-type
+    sandbox.py       #   Flash Sandbox workload-type
+    sglang.py        #   SGLang native /generate workload-type
+    agent.py         #   user-defined Agent workload-type
+    trajectory.py    #   multi-turn trajectory replay workload
+    eval.py          #   correctness/accuracy evaluation
+    hf.py            #   HuggingFace dataset source
+    datasets.py      #   generic workload/dataset base classes
+    base.py          #   WorkloadType base class
+  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay, sglang, trajectory-replay) + registry
+  swebench/
+    trajectory.py    #   convert pi logs to replay trajectories
+    replay_server.py #   mock-LLM replay server for swebench-replay
+    agent.py         #   SWE-bench coding agent + grading + harbor adapters
 examples/            # runnable examples (incl. swebench/ coding-agent config)
 tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
 tests/               # pytest smoke tests

{benchmaker-0.1.2 → benchmaker-0.1.3}/benchmaker/__init__.py RENAMED Viewed

@@ -153,4 +153,4 @@ __all__ = [
     "write_bundle",
 ]
-__version__ = "0.1.1"
+__version__ = "0.1.3"

{benchmaker-0.1.2 → benchmaker-0.1.3}/benchmaker/config.py RENAMED Viewed

@@ -407,6 +407,13 @@ def build_config(cfg: dict, dotenv_path: Optional[str] = ".env",
         workload_type, extra_post = apply_correctness(workload_type, correctness_spec)
         post_hooks = list(post_hooks) + list(extra_post)
+    # A workload that schedules on per-request completion (e.g. interleaved
+    # trajectory replay) declares the post-hook it needs; install it so a YAML
+    # config can't silently stall waiting for a signal it never wired up.
+    wl_hook = workload.completion_hook()
+    if wl_hook is not None and wl_hook not in post_hooks:
+        post_hooks = list(post_hooks) + [wl_hook]
     recorder = _build_recorder(cfg.get("record"))
     return BenchConfig(

{benchmaker-0.1.2 → benchmaker-0.1.3}/benchmaker/recipes/swebench_replay.py RENAMED Viewed

@@ -5,7 +5,7 @@ Builds a replay store from recorded pi logs (or loads a prebuilt
 `replay-trajectories.jsonl`), starts the stateless replay server in-process, and
 runs the *real* harbor SWE-bench pipeline (pi + sandbox + verifier) with the
 model endpoint pointed at the replay server — at one ``--concurrency`` or a
-``--sweep`` of them. The LLM is the only thing mocked; everything else runs for
+``--concurrency-sweep`` of them. The LLM is the only thing mocked; everything else runs for
 real, so re-runs are deterministic and free of model cost/variance.
 Still requires ``FLASH_SANDBOX_URL`` (the sandbox + verifier are real). For
@@ -54,28 +54,30 @@ def _parse_concurrencies(sweep: Optional[str], concurrency: int) -> list[int]:
     return [int(x.strip()) for x in sweep.split(",") if x.strip()]
-def _resolve_task_filter(task, store) -> tuple[list[str], int]:
+def _resolve_task_filter(task, exclude_task, store) -> tuple[list[str], int]:
     """Which dataset tasks to run, and how many trajectories can't be targeted.
     Default to exactly the recorded tasks (each trajectory's instance_id) so
     harbor replays only what we have trajectories for — otherwise it would run
     the whole ``--dataset`` and every task without a recording becomes a replay
     miss. An explicit ``--task`` wins (the user is narrowing on purpose).
+    ``--exclude-task`` drops the named id(s) from the resolved set.
     Returns ``(task_ids, n_missing_instance_id)``."""
-    explicit = list(task)
+    excluded = set(exclude_task)
+    explicit = [t for t in task if t not in excluded]
     if explicit:
         return explicit, 0
-    ids = sorted({t.instance_id for t in store.values() if t.instance_id})
+    ids = sorted({t.instance_id for t in store.values()
+                  if t.instance_id and t.instance_id not in excluded})
     missing = sum(1 for t in store.values() if not t.instance_id)
     return ids, missing
 class SWEBenchReplayRecipe(Recipe):
     name = "swebench-replay"
     help = (
         "Replay recorded SWE-bench trajectories deterministically: mock the LLM "
         "with recorded outputs, run the real pi+sandbox+verifier pipeline at one "
-        "--concurrency or a --sweep. Requires FLASH_SANDBOX_URL."
+        "--concurrency or a --concurrency-sweep. Requires FLASH_SANDBOX_URL."
     )
     wants_load_options = False
@@ -88,12 +90,20 @@ class SWEBenchReplayRecipe(Recipe):
                          help="Prebuilt replay-trajectories.jsonl (instead of --job)."),
             click.option("--concurrency", type=int, default=4, show_default=True,
                          help="Concurrent trials (harbor n_concurrent_trials)."),
-            click.option("--sweep", default=None,
+            click.option("--concurrency-sweep", "concurrency_sweep", default=None,
                          help="Comma list of concurrencies to run in sequence, "
                               "e.g. '1,5,25' (overrides --concurrency)."),
             click.option("--mode", type=click.Choice(["pi-host", "pi-container"]),
                          default="pi-host", show_default=True,
                          help="pi run mode (the harbor agent key)."),
+            click.option("--route-tools", "route_tools",
+                         type=click.Choice(["all", "bash"]),
+                         default="all", show_default=True,
+                         help="pi-host: which tools to route into the sandbox. "
+                              "'all' routes bash+read+write+edit (matches how "
+                              "trajectories are recorded); 'bash' routes only bash "
+                              "(file edits hit the host fs and are lost on replay). "
+                              "Ignored for pi-container (pi runs in the sandbox)."),
             click.option("--host", default="127.0.0.1", show_default=True,
                          help="Replay server bind host (use 0.0.0.0 for container mode)."),
             click.option("--port", type=int, default=9100, show_default=True,
@@ -107,11 +117,21 @@ class SWEBenchReplayRecipe(Recipe):
                               "trajectory's model."),
             click.option("--dataset", default="swebench-verified", show_default=True,
                          help="Harbor dataset slug."),
+            click.option("--exec-timeout-sec", "exec_timeout_sec", type=float,
+                         default=None,
+                         help="pi-host: real per-command timeout (seconds) passed "
+                              "to environment.exec for every routed tool call "
+                              "(default 600). Lower it to surface real sandbox "
+                              "slowness/hangs under load. Ignored for pi-container "
+                              "(pi runs as one process with no per-command timeout)."),
             click.option("--n-tasks", "n_tasks", type=int, default=None,
                          help="Cap the number of recorded tasks to replay "
                               "(applied on top of the recorded-task filter)."),
             click.option("--task", multiple=True,
                          help="Restrict to specific task name(s)/glob(s). Repeatable."),
+            click.option("--exclude-task", "exclude_task", multiple=True,
+                         help="Drop specific task id(s) from the replay set. "
+                              "Repeatable."),
             click.option("--n-attempts", "n_attempts", type=int, default=1,
                          show_default=True, help="Attempts per task."),
             click.option("--timeout-multiplier", "timeout_multiplier", type=float,
@@ -129,15 +149,22 @@ class SWEBenchReplayRecipe(Recipe):
             click.option("--timeline/--no-timeline", "timeline", default=True,
                          show_default=True,
                          help="Capture timeline/utilization/tokens into the job dir."),
+            click.option("--validate-observations/--no-validate-observations",
+                         "validate_observations", default=False, show_default=True,
+                         help="Fail-fast on environment divergence: compare each "
+                              "step's tool-result status against the recording and "
+                              "stop the agent at the first mismatch. Requires a "
+                              "trajectory store recorded with tool_results."),
             click.option("--utilization-interval-sec", "utilization_interval_sec",
                          type=float, default=5.0, show_default=True),
         ]
-    def run(self, shared: SharedOpts, *, job, trajectories, concurrency, sweep, mode,
-            host, port, reachable_host, model, dataset, n_tasks, task, n_attempts,
+    def run(self, shared: SharedOpts, *, job, trajectories, concurrency,
+            concurrency_sweep, mode, route_tools, host, port, reachable_host, model,
+            dataset, exec_timeout_sec, n_tasks, task, exclude_task, n_attempts,
             timeout_multiplier, backend_type, request_timeout_sec,
             agent_ready_timeout_sec, jobs_dir, timeline,
-            utilization_interval_sec) -> Optional[int]:
+            utilization_interval_sec, validate_observations) -> Optional[int]:
         from benchmaker.swebench import harbor_eval as he
         from benchmaker.swebench import trajectory as T
@@ -180,7 +207,7 @@ class SWEBenchReplayRecipe(Recipe):
             raise click.UsageError("--model required (no model recorded in trajectories).")
         # Run exactly the recorded tasks, not the whole dataset (see helper).
-        task_filter, n_missing = _resolve_task_filter(task, store)
+        task_filter, n_missing = _resolve_task_filter(task, exclude_task, store)
         if n_missing:
             click.echo(f"warning: {n_missing} trajectories have no instance_id "
                        f"and cannot be targeted; they will be skipped.")
@@ -190,18 +217,34 @@ class SWEBenchReplayRecipe(Recipe):
                 "cannot select which tasks to replay.")
         replay_url = _replay_url(host, port, reachable_host)
-        concurrencies = _parse_concurrencies(sweep, concurrency)
+        concurrencies = _parse_concurrencies(concurrency_sweep, concurrency)
         click.echo(f"replay: {len(store)} trajectories, {len(task_filter)} tasks, "
                    f"model={run_model}, agent={mode}, url={replay_url}, "
                    f"concurrencies={concurrencies}")
+        # pi-host edits the sandbox over a bridge; the file tools (read/write/edit)
+        # only land in the sandbox when routed (route_tools=all), which is how the
+        # trajectories were recorded. With the agent default (bash-only) those
+        # recorded edits replay against the host fs and silently no-op. pi-container
+        # runs pi inside the sandbox, so the kwarg does not apply.
+        agent_kwargs = [f"route_tools={route_tools}"] if mode == "pi-host" else []
+        # Real per-command sandbox timeout. Only pi-host routes each tool call
+        # through environment.exec(timeout_sec=...); pi-container runs as one
+        # process with no per-command budget, so the flag is a no-op there.
+        if exec_timeout_sec is not None:
+            if mode == "pi-host":
+                agent_kwargs.append(f"exec_timeout_s={exec_timeout_sec}")
+            else:
+                click.echo("warning: --exec-timeout-sec is ignored for "
+                           "pi-container (no per-command timeout).")
         # Static harbor config shared by every sweep iteration; only `concurrency`
         # and `job_name` vary per run (set inside `_run_one`).
         base_ns = argparse.Namespace(
             dataset=dataset, agent=mode, model=run_model,
             api_key="replay",
-            agent_kwarg=[], agent_config_file=None,
-            n_tasks=n_tasks, task=task_filter,
+            agent_kwarg=agent_kwargs, agent_config_file=None,
+            n_tasks=n_tasks, task=task_filter, exclude_task=None,
             n_attempts=n_attempts, timeout_multiplier=timeout_multiplier,
             force_build=False, backend_type=backend_type,
             request_timeout_sec=request_timeout_sec,
@@ -214,20 +257,21 @@ class SWEBenchReplayRecipe(Recipe):
             for c in concurrencies:
                 results.append(asyncio.run(self._run_one(
                     store, base_ns, c, run_model, host, port, reachable_host,
-                    timeline, utilization_interval_sec)))
+                    timeline, utilization_interval_sec, validate_observations)))
         finally:
             if tmpdir is not None:
                 tmpdir.cleanup()
         # Comparison table.
-        click.echo("\nCONCURRENCY  ACCURACY  PASS/TOTAL  MISSES  JOB_DIR")
-        for c, accuracy, n_pass, n_total, misses, job_dir in results:
+        click.echo("\nCONCURRENCY  ACCURACY  PASS/TOTAL  MISSES  DIVERG  JOB_DIR")
+        for c, accuracy, n_pass, n_total, misses, diverg, job_dir in results:
             click.echo(f"{c:>11}  {accuracy:>7.1%}  {n_pass:>4}/{n_total:<5}  "
-                       f"{misses:>6}  {job_dir}")
+                       f"{misses:>6}  {diverg:>6}  {job_dir}")
         return None
     async def _run_one(self, store, base_ns, concurrency, run_model, host, port,
-                       reachable_host, timeline, utilization_interval_sec) -> tuple:
+                       reachable_host, timeline, utilization_interval_sec,
+                       validate_observations) -> tuple:
         """Serve `store` on host:port and run one harbor job at `concurrency`.
         Binds a fresh listener per call (pass --port 0 for an ephemeral port,
@@ -240,9 +284,9 @@ class SWEBenchReplayRecipe(Recipe):
         from benchmaker.swebench import harbor_eval as he
         from benchmaker.swebench.observability import run_job_with_observability
-        from benchmaker.swebench.replay_server import as_app, get_misses
+        from benchmaker.swebench.replay_server import as_app, get_divergences, get_misses
-        app = as_app(store, model_fallback=run_model)
+        app = as_app(store, model_fallback=run_model, validate=validate_observations)
         runner = web.AppRunner(app)
         await runner.setup()
         site = web.TCPSite(runner, host, port)
@@ -261,7 +305,7 @@ class SWEBenchReplayRecipe(Recipe):
             rows, accuracy = he._summarise(job_result)
             n_pass = sum(1 for r in rows if r["passed"])
             return (concurrency, accuracy, n_pass, len(rows), get_misses(app),
-                    str(job.job_dir))
+                    get_divergences(app), str(job.job_dir))
         finally:
             await runner.cleanup()

{benchmaker-0.1.2 → benchmaker-0.1.3}/benchmaker/recipes/trajectory_replay.py RENAMED Viewed

@@ -3,7 +3,21 @@
 Expands each trajectory into one chat request per assistant turn (growing shared
 prefix) against an OpenAI-compatible endpoint, recording the prefix-cache parity
 pair: meta.expected_prefix_tokens (tokenizer upper bound) vs extra.cached_tokens
-(server actual). Use `--rate closed:N` for clean prefix-cache locality.
+(server actual).
+Two scheduling regimes:
+* **Contiguous (default)** — all of trajectory A's turns, then all of B's. Turn
+  k+1 is served within a few requests of turn k, so its history is reused while
+  still hot in the local cache. Use ``--rate closed:N`` for clean prefix-cache
+  locality (best case: locality preserved).
+* **Interleaved** (``--concurrent-sessions N``) — keep up to N sessions active
+  and round-robin their turns, gating each session's turn k+1 on turn k's
+  completion (+ an optional ``--inter-turn-gap`` think time). Concurrent session
+  histories overflow the device KV pool, so a session's history is evicted
+  before its next turn — the multi-turn *reuse-after-eviction* regime that
+  stresses hierarchical / shared KV tiers. The in-flight ceiling defaults to
+  ``closed:N`` to match the active session count.
 """
 from __future__ import annotations
@@ -67,12 +81,24 @@ class TrajectoryReplayRecipe(Recipe):
                          help="Cap assistant turns replayed per trajectory."),
             click.option("--max-trajectories", "max_trajectories", type=int,
                          default=None, help="Cap number of trajectories replayed."),
+            click.option("--concurrent-sessions", "concurrent_sessions", type=int,
+                         default=None,
+                         help="Interleave turns across up to N concurrent "
+                              "sessions (round-robin, each session's turn k+1 "
+                              "gated on turn k completing) instead of replaying "
+                              "each trajectory contiguously. Enables the "
+                              "reuse-after-eviction regime; defaults the rate to "
+                              "closed:N."),
+            click.option("--inter-turn-gap", "inter_turn_gap", default=None,
+                         help="Per-session think time between consecutive turns "
+                              "(interleaved mode). E.g. 'const:2s', 'exp:1.5', "
+                              "'uniform:1s..3s'. Default: no gap."),
         ]
     def build(self, shared: SharedOpts, *, url, model, api_key, header, dataset,
               prompts_jsonl, split, preset, tokenizer, messages_field, id_field,
-              model_field, max_tokens, max_turns_per_trajectory, max_trajectories
-              ) -> BuildResult:
+              model_field, max_tokens, max_turns_per_trajectory, max_trajectories,
+              concurrent_sessions=None, inter_turn_gap=None) -> BuildResult:
         from benchmaker.workloads.llm import OpenAIChatWorkloadType
         from benchmaker.workloads.trajectory import TrajectoryReplayWorkload
@@ -99,7 +125,8 @@ class TrajectoryReplayRecipe(Recipe):
             messages_field=messages_field, id_field=id_field,
             model_field=model_field, max_tokens=max_tokens,
             max_turns_per_trajectory=max_turns_per_trajectory,
-            max_trajectories=max_trajectories, tokenizer=tokenizer)
+            max_trajectories=max_trajectories, tokenizer=tokenizer,
+            concurrent_sessions=concurrent_sessions, inter_turn_gap=inter_turn_gap)
         source_config = {
             "workload_type": {"type": "openai-chat", "url": wt._url,
@@ -111,14 +138,26 @@ class TrajectoryReplayRecipe(Recipe):
                          "model_field": model_field, "tokenizer": tokenizer,
                          "max_tokens": max_tokens,
                          "max_trajectories": max_trajectories,
-                         "max_turns_per_trajectory": max_turns_per_trajectory},
+                         "max_turns_per_trajectory": max_turns_per_trajectory,
+                         "concurrent_sessions": concurrent_sessions,
+                         "inter_turn_gap": inter_turn_gap},
         }
+        # Interleaved mode needs a per-turn completion signal to gate each
+        # session's next turn; wire the workload's post-hook and default the
+        # in-flight ceiling to the active session count.
+        hook = workload.completion_hook()
+        post_hooks: list = [hook] if hook is not None else []
+        default_rate = ("closed:8" if concurrent_sessions is None
+                        else f"closed:{concurrent_sessions}")
         # Finite dataset: replay once. The workload raises StopAsyncIteration when
         # exhausted, which halts the run; default to closed-loop with a long
         # nominal duration so exhaustion (not the clock) ends it.
         return BuildResult(
             workload_type=wt, workload=workload, source_config=source_config,
-            default_rate="closed:8", default_duration="24h")
+            post_hooks=post_hooks,
+            default_rate=default_rate, default_duration="24h")
 register(TrajectoryReplayRecipe())

{benchmaker-0.1.2 → benchmaker-0.1.3}/benchmaker/swebench/harbor_eval.py RENAMED Viewed

@@ -209,7 +209,8 @@ def _build_job_config(args: argparse.Namespace) -> JobConfig:
     )
     dataset = DatasetConfig(name=args.dataset, n_tasks=args.n_tasks,
-                            task_names=args.task or None)
+                            task_names=args.task or None,
+                            exclude_task_names=args.exclude_task or None)
     # Parent directory for the run bundle (harbor writes to <jobs_dir>/<job_name>).
     # Omit when unset so harbor keeps its own default of "jobs".
@@ -301,6 +302,10 @@ def _parse_args() -> argparse.Namespace:
                    help="Cap the number of dataset tasks.")
     p.add_argument("--task", action="append", default=[],
                    help="Restrict to specific task name(s)/glob(s) (repeatable).")
+    p.add_argument("--exclude-task", action="append", default=[],
+                   help="Skip specific task name(s)/glob(s) (repeatable). Applied "
+                        "after --task and before the --n-tasks cap, so the cap "
+                        "selects the first N tasks that remain after exclusion.")
     p.add_argument("--concurrency", type=int, default=4)
     p.add_argument("--n-attempts", type=int, default=1)
     p.add_argument("--timeout-multiplier", type=float, default=4.0,

benchmaker 0.1.2__tar.gz → 0.1.3__tar.gz

benchmaker 0.1.2tar.gz → 0.1.3tar.gz