PyPI - benchmaker - Versions diffs - 0.1.2__tar.gz → 0.1.4__tar.gz - Mend

benchmaker 0.1.2tar.gz → 0.1.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (94) hide show

{benchmaker-0.1.2 → benchmaker-0.1.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: benchmaker
-Version: 0.1.2
+Version: 0.1.4
 Summary: Async HTTP benchmarking utility with pluggable workloads and load models.
 Author: Xiaozhe Yao
 License: MIT
@@ -18,6 +18,8 @@ Requires-Dist: rich>=13; extra == "rich"
 Provides-Extra: hf
 Requires-Dist: datasets>=2.18; extra == "hf"
 Requires-Dist: transformers>=4.40; extra == "hf"
+Provides-Extra: tokenizer
+Requires-Dist: transformers>=4.40; extra == "tokenizer"
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
@@ -73,8 +75,8 @@ asyncio.run(main())
 ```
 Or via the CLI. Workload-specific benchmarks are exposed as **recipes** —
-`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`, `sglang`,
-`trajectory-replay`):
+`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`,
+`swebench-replay`, `sglang`, `trajectory-replay`):
 ```bash
 benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s
@@ -185,16 +187,16 @@ Full docs live in [`docs/`](docs/):
 - [Correctness / accuracy eval](docs/eval.md) — grade responses against references
 - [CLI & YAML reference](docs/cli-and-yaml.md)
 - [ShareGPT benchmark](docs/sharegpt-benchmark.md) — self-contained end-to-end walkthrough
-- `benchmaker sglang` — native SGLang `/generate` benchmark (see [`docs/sglang.md`](docs/sglang.md)).
-- `benchmaker trajectory-replay` — multi-turn prefix-cache parity replay of
-  trajectory datasets like SWE-smith (see [`docs/trajectory-replay.md`](docs/trajectory-replay.md)).
+- [DeepRAG and mixed lanes](docs/deeprag-mix.md) — prefill-heavy RAG and phase-swinging dataset lanes
+- [SGLang benchmark](docs/sglang.md) — native SGLang `/generate` benchmark
+- [Trajectory replay](docs/trajectory-replay.md) — multi-turn prefix-cache parity replay
 ## Deterministic replay (`swebench-replay`)
 Re-run a recorded SWE-bench job with the LLM **mocked from its own logs** — the
 real pi + sandbox + verifier pipeline still runs, only the model is served back
 from recorded outputs, so re-runs are deterministic and free of model
-cost/variance. Vary `--concurrency` (or `--sweep`) to study the rest of the
+cost/variance. Vary `--concurrency` (or `--concurrency-sweep`) to study the rest of the
 pipeline without the model's stochasticity as a confound. Still needs
 `FLASH_SANDBOX_URL` (the sandbox + verifier are real).
@@ -207,7 +209,7 @@ python -m benchmaker.swebench.trajectory jobs/2026-06-08__05-24-01_b352cb \
 # 2) replay (host mode, localhost) across a concurrency sweep
 FLASH_SANDBOX_URL=http://localhost:8080 \
   benchmaker swebench-replay --trajectories replay-trajectories.jsonl \
-    --mode pi-host --sweep 1,5,25
+    --mode pi-host --concurrency-sweep 1,5,25
 # container mode: bind 0.0.0.0 and tell the sandbox how to reach the server
 FLASH_SANDBOX_URL=http://localhost:8080 \
@@ -221,6 +223,18 @@ run lacked an instance id) plus the count of assistant messages already in the
 request — so it is correct at any concurrency. A `MISSES` column in the summary
 flags any divergence (a request beyond the recorded turns).
+The standalone replay server can also **mock realistic streaming** for
+latency-sensitive benchmarks. Pass a real tokenizer and a per-token delay; the
+first token is emitted immediately (prefill free, TTFT≈0) and each subsequent
+token is spaced by `--inter-token-time` ms. Output stays byte-exact and the
+reported `usage` is the recorded value.
+```bash
+pip install 'benchmaker[tokenizer]'   # adds transformers for the tokenizer
+python -m benchmaker.swebench.replay_server replay-trajectories.jsonl \
+    --tokenizer zai-org/GLM-4.7-Flash --inter-token-time 50
+```
 ## Examples
 Under [`examples/`](examples/):
@@ -228,9 +242,12 @@ Under [`examples/`](examples/):
 - `simple_get.py`         — minimal library usage
 - `custom_hooks.py`       — request signing + response parsing
 - `llm_chat.py`           — OpenAI-compatible LLM endpoint with streaming
+- `llm_from_env.py`       — LLM benchmark using `from_env()`
 - `vllm_with_monitor.py`  — LLM benchmark with concurrent vLLM `/metrics` scrape
+- `agent_trove.py`        — user-defined agent benchmark
 - `sandbox_exec.py`       — Flash Sandbox `/exec` latency benchmark
 - `sandbox_lifecycle.py`  — full create → exec → delete cold-start benchmark
+- `bench_sandbox.py` / `bench_sandbox.sh` — sandbox benchmarks
 - `llm_eval.py`           — LLM benchmark + accuracy grading (exact/regex/judge)
 - `gsm8k_eval.py`         — GSM8K from HuggingFace + integer-match scorer
 - `config.yaml`           — generic HTTP YAML config
@@ -254,9 +271,22 @@ benchmaker/          # library code
   config.py  env.py  #   YAML config loading + .env interpolation
   core/              #   engine: types, load models, runner, metrics, monitors, trace
   io/                #   run output: per-run bundle + cross-run collection
-  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
-  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay) + registry
-  swebench/          #   SWE-bench coding agent + grading + harbor adapters
+  workloads/
+    http.py          #   HTTP workload-type
+    llm.py           #   OpenAI-compatible chat workload-type
+    sandbox.py       #   Flash Sandbox workload-type
+    sglang.py        #   SGLang native /generate workload-type
+    agent.py         #   user-defined Agent workload-type
+    trajectory.py    #   multi-turn trajectory replay workload
+    eval.py          #   correctness/accuracy evaluation
+    hf.py            #   HuggingFace dataset source
+    datasets.py      #   generic workload/dataset base classes
+    base.py          #   WorkloadType base class
+  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay, sglang, trajectory-replay) + registry
+  swebench/
+    trajectory.py    #   convert pi logs to replay trajectories
+    replay_server.py #   mock-LLM replay server for swebench-replay
+    agent.py         #   SWE-bench coding agent + grading + harbor adapters
 examples/            # runnable examples (incl. swebench/ coding-agent config)
 tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
 tests/               # pytest smoke tests

{benchmaker-0.1.2 → benchmaker-0.1.4}/README.md RENAMED Viewed

@@ -45,8 +45,8 @@ asyncio.run(main())
 ```
 Or via the CLI. Workload-specific benchmarks are exposed as **recipes** —
-`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`, `sglang`,
-`trajectory-replay`):
+`benchmaker <recipe> --args` (`http`, `llm`, `sandbox`, `swebench`,
+`swebench-replay`, `sglang`, `trajectory-replay`):
 ```bash
 benchmaker http --url https://httpbin.org/get --rate poisson:50 --duration 10s
@@ -157,16 +157,16 @@ Full docs live in [`docs/`](docs/):
 - [Correctness / accuracy eval](docs/eval.md) — grade responses against references
 - [CLI & YAML reference](docs/cli-and-yaml.md)
 - [ShareGPT benchmark](docs/sharegpt-benchmark.md) — self-contained end-to-end walkthrough
-- `benchmaker sglang` — native SGLang `/generate` benchmark (see [`docs/sglang.md`](docs/sglang.md)).
-- `benchmaker trajectory-replay` — multi-turn prefix-cache parity replay of
-  trajectory datasets like SWE-smith (see [`docs/trajectory-replay.md`](docs/trajectory-replay.md)).
+- [DeepRAG and mixed lanes](docs/deeprag-mix.md) — prefill-heavy RAG and phase-swinging dataset lanes
+- [SGLang benchmark](docs/sglang.md) — native SGLang `/generate` benchmark
+- [Trajectory replay](docs/trajectory-replay.md) — multi-turn prefix-cache parity replay
 ## Deterministic replay (`swebench-replay`)
 Re-run a recorded SWE-bench job with the LLM **mocked from its own logs** — the
 real pi + sandbox + verifier pipeline still runs, only the model is served back
 from recorded outputs, so re-runs are deterministic and free of model
-cost/variance. Vary `--concurrency` (or `--sweep`) to study the rest of the
+cost/variance. Vary `--concurrency` (or `--concurrency-sweep`) to study the rest of the
 pipeline without the model's stochasticity as a confound. Still needs
 `FLASH_SANDBOX_URL` (the sandbox + verifier are real).
@@ -179,7 +179,7 @@ python -m benchmaker.swebench.trajectory jobs/2026-06-08__05-24-01_b352cb \
 # 2) replay (host mode, localhost) across a concurrency sweep
 FLASH_SANDBOX_URL=http://localhost:8080 \
   benchmaker swebench-replay --trajectories replay-trajectories.jsonl \
-    --mode pi-host --sweep 1,5,25
+    --mode pi-host --concurrency-sweep 1,5,25
 # container mode: bind 0.0.0.0 and tell the sandbox how to reach the server
 FLASH_SANDBOX_URL=http://localhost:8080 \
@@ -193,6 +193,18 @@ run lacked an instance id) plus the count of assistant messages already in the
 request — so it is correct at any concurrency. A `MISSES` column in the summary
 flags any divergence (a request beyond the recorded turns).
+The standalone replay server can also **mock realistic streaming** for
+latency-sensitive benchmarks. Pass a real tokenizer and a per-token delay; the
+first token is emitted immediately (prefill free, TTFT≈0) and each subsequent
+token is spaced by `--inter-token-time` ms. Output stays byte-exact and the
+reported `usage` is the recorded value.
+```bash
+pip install 'benchmaker[tokenizer]'   # adds transformers for the tokenizer
+python -m benchmaker.swebench.replay_server replay-trajectories.jsonl \
+    --tokenizer zai-org/GLM-4.7-Flash --inter-token-time 50
+```
 ## Examples
 Under [`examples/`](examples/):
@@ -200,9 +212,12 @@ Under [`examples/`](examples/):
 - `simple_get.py`         — minimal library usage
 - `custom_hooks.py`       — request signing + response parsing
 - `llm_chat.py`           — OpenAI-compatible LLM endpoint with streaming
+- `llm_from_env.py`       — LLM benchmark using `from_env()`
 - `vllm_with_monitor.py`  — LLM benchmark with concurrent vLLM `/metrics` scrape
+- `agent_trove.py`        — user-defined agent benchmark
 - `sandbox_exec.py`       — Flash Sandbox `/exec` latency benchmark
 - `sandbox_lifecycle.py`  — full create → exec → delete cold-start benchmark
+- `bench_sandbox.py` / `bench_sandbox.sh` — sandbox benchmarks
 - `llm_eval.py`           — LLM benchmark + accuracy grading (exact/regex/judge)
 - `gsm8k_eval.py`         — GSM8K from HuggingFace + integer-match scorer
 - `config.yaml`           — generic HTTP YAML config
@@ -226,9 +241,22 @@ benchmaker/          # library code
   config.py  env.py  #   YAML config loading + .env interpolation
   core/              #   engine: types, load models, runner, metrics, monitors, trace
   io/                #   run output: per-run bundle + cross-run collection
-  workloads/         #   workload-types (http, llm, sandbox, agent, hf, eval)
-  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay) + registry
-  swebench/          #   SWE-bench coding agent + grading + harbor adapters
+  workloads/
+    http.py          #   HTTP workload-type
+    llm.py           #   OpenAI-compatible chat workload-type
+    sandbox.py       #   Flash Sandbox workload-type
+    sglang.py        #   SGLang native /generate workload-type
+    agent.py         #   user-defined Agent workload-type
+    trajectory.py    #   multi-turn trajectory replay workload
+    eval.py          #   correctness/accuracy evaluation
+    hf.py            #   HuggingFace dataset source
+    datasets.py      #   generic workload/dataset base classes
+    base.py          #   WorkloadType base class
+  recipes/           #   CLI recipes (http, llm, sandbox, swebench, swebench-replay, sglang, trajectory-replay) + registry
+  swebench/
+    trajectory.py    #   convert pi logs to replay trajectories
+    replay_server.py #   mock-LLM replay server for swebench-replay
+    agent.py         #   SWE-bench coding agent + grading + harbor adapters
 examples/            # runnable examples (incl. swebench/ coding-agent config)
 tools/               # out-of-tree tooling: sharegpt/, swe_images/, agent_warmup/
 tests/               # pytest smoke tests

{benchmaker-0.1.2 → benchmaker-0.1.4}/benchmaker/__init__.py RENAMED Viewed

@@ -19,6 +19,7 @@ from benchmaker.workloads.http import HttpWorkloadType
 from benchmaker.workloads.llm import OpenAIChatWorkloadType
 from benchmaker.workloads.sandbox import SandboxWorkloadType
 from benchmaker.workloads.hf import HFDatasetWorkload
+from benchmaker.workloads.rag import DeepRAGWorkload
 from benchmaker.workloads.sglang import SGLangGenerateWorkloadType
 from benchmaker.workloads.trajectory import TrajectoryReplayWorkload
 from benchmaker.workloads.agent import (
@@ -59,7 +60,7 @@ from benchmaker.core.monitors import (
     PrometheusMonitor,
     parse_prometheus,
 )
-from benchmaker.core.runner import BenchRunner, BenchConfig, BenchResult
+from benchmaker.core.runner import BenchLane, BenchRunner, BenchConfig, BenchResult
 from benchmaker.core.trace import (
     ReplayWorkloadType,
     TracePacedLoad,
@@ -89,6 +90,7 @@ __all__ = [
     "OpenAIChatWorkloadType",
     "SandboxWorkloadType",
     "HFDatasetWorkload",
+    "DeepRAGWorkload",
     "SGLangGenerateWorkloadType",
     "TrajectoryReplayWorkload",
     # agent workload (pluggable user-defined agents)
@@ -136,6 +138,7 @@ __all__ = [
     # runner
     "BenchRunner",
     "BenchConfig",
+    "BenchLane",
     "BenchResult",
     # trace: record & replay
     "TraceRecorder",
@@ -153,4 +156,4 @@ __all__ = [
     "write_bundle",
 ]
-__version__ = "0.1.1"
+__version__ = "0.1.4"

{benchmaker-0.1.2 → benchmaker-0.1.4}/benchmaker/config.py RENAMED Viewed

@@ -22,7 +22,7 @@ from typing import Any, Callable, Optional
 from benchmaker.env import interpolate, load_dotenv
 from benchmaker.core.load import parse_duration, parse_rate_spec
 from benchmaker.core.monitors import FunctionMonitor, Monitor, PrometheusMonitor
-from benchmaker.core.runner import BenchConfig
+from benchmaker.core.runner import BenchConfig, BenchLane
 from benchmaker.workloads.base import WorkloadType
 from benchmaker.workloads.datasets import (
     CallableWorkload,
@@ -31,6 +31,7 @@ from benchmaker.workloads.datasets import (
     Workload,
 )
 from benchmaker.workloads.hf import HFDatasetWorkload
+from benchmaker.workloads.rag import DeepRAGWorkload
 from benchmaker.workloads.http import HttpWorkloadType
 from benchmaker.workloads.llm import OpenAIChatWorkloadType
 from benchmaker.workloads.sandbox import SandboxWorkloadType
@@ -154,6 +155,8 @@ def build_workload(spec: Any) -> Workload:
         return CallableWorkload(fn=fn, **kwargs)
     if t in ("hf", "huggingface"):
         return HFDatasetWorkload(**kwargs)
+    if t in ("deeprag", "deep-rag", "rag"):
+        return DeepRAGWorkload(**kwargs)
     if t == "trajectory":
         from benchmaker.workloads.trajectory import TrajectoryReplayWorkload
         return TrajectoryReplayWorkload(**kwargs)
@@ -365,8 +368,12 @@ def build_config(cfg: dict, dotenv_path: Optional[str] = ".env",
         cfg = interpolate(cfg)
     replay_spec = cfg.get("replay")
+    mix_spec = cfg.get("mix")
+    if replay_spec is not None and mix_spec is not None:
+        raise ValueError("'replay' and 'mix' are mutually exclusive")
     if replay_spec is not None:
         workload_type, workload, load_model = _build_replay(replay_spec)
+        lanes: list[BenchLane] = []
     else:
         wt_spec = cfg.get("workload_type")
         if not wt_spec:
@@ -382,16 +389,27 @@ def build_config(cfg: dict, dotenv_path: Optional[str] = ".env",
                 raise ValueError("config must define 'workload_type' or 'replay'")
         workload_type = build_workload_type(wt_spec)
-        workload = build_workload(cfg.get("workload"))
-        load_spec = cfg.get("load")
-        if load_spec is None:
-            raise ValueError("config must define 'load'")
         duration = cfg.get("duration") or cfg.get("duration_s")
         if duration is not None and isinstance(duration, str):
             duration = parse_duration(duration)
-        load_model = parse_rate_spec(load_spec, duration_s=duration,
-                                     max_requests=cfg.get("max_requests"))
+        if mix_spec is not None:
+            if cfg.get("load") is not None:
+                raise ValueError("a mixed config cannot also define top-level 'load'")
+            workload = StaticWorkload()
+            load_model = None
+            lanes = _build_lanes(
+                mix_spec,
+                duration_s=duration,
+                max_requests=cfg.get("max_requests"),
+            )
+        else:
+            workload = build_workload(cfg.get("workload"))
+            load_spec = cfg.get("load")
+            if load_spec is None:
+                raise ValueError("config must define 'load' or 'mix.lanes'")
+            load_model = parse_rate_spec(load_spec, duration_s=duration,
+                                         max_requests=cfg.get("max_requests"))
+            lanes = []
     pre_hooks = [resolve_callable(h) for h in (cfg.get("pre_hooks") or [])]
     post_hooks = [resolve_callable(h) for h in (cfg.get("post_hooks") or [])]
@@ -407,12 +425,22 @@ def build_config(cfg: dict, dotenv_path: Optional[str] = ".env",
         workload_type, extra_post = apply_correctness(workload_type, correctness_spec)
         post_hooks = list(post_hooks) + list(extra_post)
+    # A workload that schedules on per-request completion (e.g. interleaved
+    # trajectory replay) declares the post-hook it needs; install it so a YAML
+    # config can't silently stall waiting for a signal it never wired up.
+    workloads = [lane.workload for lane in lanes] if lanes else [workload]
+    for lane_workload in workloads:
+        wl_hook = lane_workload.completion_hook()
+        if wl_hook is not None and wl_hook not in post_hooks:
+            post_hooks = list(post_hooks) + [wl_hook]
     recorder = _build_recorder(cfg.get("record"))
     return BenchConfig(
         workload_type=workload_type,
         workload=workload,
         load=load_model,
+        lanes=lanes,
         pre_hooks=pre_hooks,
         post_hooks=post_hooks,
         monitors=monitors,
@@ -421,9 +449,48 @@ def build_config(cfg: dict, dotenv_path: Optional[str] = ".env",
         timeout_s=float(cfg.get("timeout_s", 60.0)),
         max_in_flight=int(cfg.get("max_in_flight", 10000)),
         progress_every_s=float(cfg.get("progress_every_s", 1.0)),
+        stop_on_exhausted=bool(cfg.get("stop_on_exhausted", True)),
     )
+def _build_lanes(spec: Any, *, duration_s: Optional[float],
+                 max_requests: Optional[int]) -> list[BenchLane]:
+    """Build independent workload/load pairs from a ``mix:`` YAML block."""
+    if not isinstance(spec, dict):
+        raise TypeError("'mix' must be a mapping with a 'lanes' list")
+    lane_specs = spec.get("lanes")
+    if not isinstance(lane_specs, list) or not lane_specs:
+        raise ValueError("'mix.lanes' must be a non-empty list")
+    lanes: list[BenchLane] = []
+    for index, lane_spec in enumerate(lane_specs):
+        if not isinstance(lane_spec, dict):
+            raise TypeError(f"mix.lanes[{index}] must be a mapping")
+        name = lane_spec.get("name")
+        if not isinstance(name, str) or not name.strip():
+            raise ValueError(f"mix.lanes[{index}].name must be a non-empty string")
+        if "workload" not in lane_spec:
+            raise ValueError(f"mix.lanes[{index}] must define a workload")
+        rate = lane_spec.get("rate", lane_spec.get("load"))
+        if rate is None:
+            raise ValueError(f"mix.lanes[{index}] must define rate (or load)")
+        lane_duration = lane_spec.get("duration", duration_s)
+        if isinstance(lane_duration, str):
+            lane_duration = parse_duration(lane_duration)
+        lane_max_requests = lane_spec.get("max_requests", max_requests)
+        lanes.append(BenchLane(
+            name=name,
+            workload=build_workload(lane_spec["workload"]),
+            load=parse_rate_spec(
+                rate,
+                duration_s=lane_duration,
+                max_requests=lane_max_requests,
+            ),
+        ))
+    return lanes
 def _build_recorder(spec: Any) -> Optional[TraceRecorder]:
     if spec is None:
         return None
@@ -451,4 +518,3 @@ def _build_replay(spec: Any) -> tuple[WorkloadType, Workload, Any]:
         TracePacedLoad(trace, speed=speed),
     )

{benchmaker-0.1.2 → benchmaker-0.1.4}/benchmaker/core/metrics.py RENAMED Viewed

@@ -52,64 +52,22 @@ class MetricsAggregator:
     def summary(self) -> dict:
         end = self.end_time or time.monotonic()
         wall_s = max(end - self.start_time, 1e-9)
-        ok = [s for s in self.samples if s.ok]
-        fail = [s for s in self.samples if not s.ok]
-        # Split fail into transport failures vs. delivered-but-graded-wrong.
-        wrong = [s for s in fail if s.request_ok]
-        request_failed = [s for s in fail if not s.request_ok]
-        latencies = [s.latency_s for s in ok]
-        status_counts = Counter(s.status for s in self.samples)
-        error_counts = Counter(s.error for s in fail if s.error)
-        out: dict = {
-            "wall_time_s": wall_s,
-            "total_requests": len(self.samples),
-            "success": len(ok),
-            "failed": len(fail),
-            "request_failed": len(request_failed),
-            "wrong_output": len(wrong),
-            "error_rate": (len(fail) / len(self.samples)) if self.samples else 0.0,
-            "request_failure_rate": (
-                (len(request_failed) / len(self.samples)) if self.samples else 0.0
-            ),
-            "throughput_rps": len(self.samples) / wall_s,
-            "goodput_rps": len(ok) / wall_s,
-            "bytes_sent": sum(s.bytes_sent for s in self.samples),
-            "bytes_recv": sum(s.bytes_recv for s in self.samples),
-            "status_codes": dict(status_counts),
-            "errors": dict(error_counts),
-        }
-        if latencies:
-            out["latency_s"] = {
-                "mean": statistics.mean(latencies),
-                "min": min(latencies),
-                "max": max(latencies),
-                "p50": _pct(latencies, 50),
-                "p90": _pct(latencies, 90),
-                "p95": _pct(latencies, 95),
-                "p99": _pct(latencies, 99),
-                "p999": _pct(latencies, 99.9),
-            }
+        out = _summary_for_samples(self.samples, wall_s)
-        # Aggregate workload-specific `extra` metrics generically: mean + percentiles.
-        extras: dict[str, list[float]] = defaultdict(list)
-        for s in ok:
-            for k, v in s.extra.items():
-                if isinstance(v, (int, float)):
-                    extras[k].append(float(v))
-        if extras:
-            ext_summary = {}
-            for k, vals in extras.items():
-                ext_summary[k] = {
-                    "mean": statistics.mean(vals),
-                    "p50": _pct(vals, 50),
-                    "p90": _pct(vals, 90),
-                    "p99": _pct(vals, 99),
-                    "min": min(vals),
-                    "max": max(vals),
-                }
-            out["workload_metrics"] = ext_summary
+        # A mixed benchmark needs each lane's SLO signal independently.  Use
+        # the same wall-clock interval as the aggregate so lane throughput is
+        # directly comparable to the total, while latency and workload metrics
+        # remain scoped to that lane's samples.
+        lanes: dict[str, list[Sample]] = defaultdict(list)
+        for sample in self.samples:
+            lane = sample.meta.get("lane")
+            if isinstance(lane, str) and lane:
+                lanes[lane].append(sample)
+        if lanes:
+            out["lanes"] = {
+                name: _summary_for_samples(samples, wall_s)
+                for name, samples in sorted(lanes.items())
+            }
         # Monitor time-series: summarize each metric per monitor.
         if self.monitor_samples:
@@ -181,6 +139,22 @@ class MetricsAggregator:
                 lines.append(f"    {k}")
                 for kk in ("mean", "p50", "p90", "p99", "max"):
                     lines.append(f"      {kk:<6}: {v[kk]:.4f}")
+        if s.get("lanes"):
+            lines.append("")
+            lines.append("  lanes")
+            for name, lane in s["lanes"].items():
+                lines.append(
+                    f"    {name}: {lane['total_requests']} requests, "
+                    f"{lane['throughput_rps']:.2f} req/s, "
+                    f"{lane['success']} success"
+                )
+                for metric in ("ttft_s", "itl_ms_mean", "tokens_per_s"):
+                    values = lane.get("workload_metrics", {}).get(metric)
+                    if values:
+                        lines.append(
+                            f"      {metric}: p50={values['p50']:.4f}, "
+                            f"p99={values['p99']:.4f}"
+                        )
         if s.get("monitors"):
             for mon_name, mon in s["monitors"].items():
                 lines.append("")
@@ -223,6 +197,70 @@ class MetricsAggregator:
                     }) + "\n")
+def _summary_for_samples(samples: list[Sample], wall_s: float) -> dict:
+    """Summarize a sample subset over a shared benchmark wall-clock interval."""
+    ok = [s for s in samples if s.ok]
+    fail = [s for s in samples if not s.ok]
+    # Split fail into transport failures vs. delivered-but-graded-wrong.
+    wrong = [s for s in fail if s.request_ok]
+    request_failed = [s for s in fail if not s.request_ok]
+    latencies = [s.latency_s for s in ok]
+    status_counts = Counter(s.status for s in samples)
+    error_counts = Counter(s.error for s in fail if s.error)
+    out: dict = {
+        "wall_time_s": wall_s,
+        "total_requests": len(samples),
+        "success": len(ok),
+        "failed": len(fail),
+        "request_failed": len(request_failed),
+        "wrong_output": len(wrong),
+        "error_rate": (len(fail) / len(samples)) if samples else 0.0,
+        "request_failure_rate": (
+            (len(request_failed) / len(samples)) if samples else 0.0
+        ),
+        "throughput_rps": len(samples) / wall_s,
+        "goodput_rps": len(ok) / wall_s,
+        "bytes_sent": sum(s.bytes_sent for s in samples),
+        "bytes_recv": sum(s.bytes_recv for s in samples),
+        "status_codes": dict(status_counts),
+        "errors": dict(error_counts),
+    }
+    if latencies:
+        out["latency_s"] = {
+            "mean": statistics.mean(latencies),
+            "min": min(latencies),
+            "max": max(latencies),
+            "p50": _pct(latencies, 50),
+            "p90": _pct(latencies, 90),
+            "p95": _pct(latencies, 95),
+            "p99": _pct(latencies, 99),
+            "p999": _pct(latencies, 99.9),
+        }
+    # Aggregate workload-specific `extra` metrics generically: mean + percentiles.
+    extras: dict[str, list[float]] = defaultdict(list)
+    for s in ok:
+        for k, v in s.extra.items():
+            if isinstance(v, (int, float)):
+                extras[k].append(float(v))
+    if extras:
+        ext_summary = {}
+        for k, vals in extras.items():
+            ext_summary[k] = {
+                "mean": statistics.mean(vals),
+                "p50": _pct(vals, 50),
+                "p90": _pct(vals, 90),
+                "p99": _pct(vals, 99),
+                "min": min(vals),
+                "max": max(vals),
+            }
+        out["workload_metrics"] = ext_summary
+    return out
 def _safe_meta(meta: dict) -> dict:
     out = {}
     for k, v in meta.items():

benchmaker 0.1.2__tar.gz → 0.1.4__tar.gz

benchmaker 0.1.2tar.gz → 0.1.4tar.gz