PyPI - llama-benchy - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

llama-benchy 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: llama-benchy
-Version: 0.1.0
+Version: 0.1.2
 Summary: llama-bench style benchmarking tool for all OpenAI-compatible LLM endpoints
 Author: eugr
 License: MIT License
@@ -74,12 +74,26 @@ As of January 2nd, 2026, I wasn't able to find any existing benchmarking tool th
 - Supports executing a command after each run (e.g., to clear cache).
 - Configurable latency measurement mode.
+# Current Limitations
+- Evaluates against `/v1/chat/completions` endpoint only.
+- Doesn't measure throughput in concurrency mode (coming later).
+- Outputs results as a Markdown table only for now.
 ## Installation
-Install `uv` first: https://docs.astral.sh/uv/getting-started/installation/
+Using `uv` is recommended. You can install `uv` here: https://docs.astral.sh/uv/getting-started/installation/
 ### Option 1: Run without installation using `uvx`
+Run the release version from PyPI:
+```bash
+uvx llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>
+```
+Run the latest version from the main branch:
 ```bash
 uvx --from git+https://github.com/eugr/llama-benchy llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>
 ```
@@ -124,6 +138,14 @@ uv run llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>
 ### Option 3: Install into system path
+Release version from PyPI:
+```bash
+uv pip install -U llama-benchy
+```
+Current version from the main branch:
 ```bash
 uv pip install git+https://github.com/eugr/llama-benchy --system
 ```
@@ -231,7 +253,7 @@ When `--enable-prefix-caching` is used (with `--depth` > 0), the script performs
 2.  **Inference**: Sends the same context (system message) followed by the actual prompt (user message). The server should reuse the cached context.
     -   Reported as standard `pp{tokens} @ d{depth}` and `tg{tokens} @ d{depth}`.
-By comparing the `pp` speed of the Inference step with a non-cached run (or the Context Load step), you can observe the speedup provided by prefix caching.
+In this case, `pp` and `tg` speeds will show an actual prompt processing / token generation speeds for a follow up prompt with a context pre-filled.
 ### Example

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/README.md RENAMED Viewed

@@ -29,12 +29,26 @@ As of January 2nd, 2026, I wasn't able to find any existing benchmarking tool th
 - Supports executing a command after each run (e.g., to clear cache).
 - Configurable latency measurement mode.
+# Current Limitations
+- Evaluates against `/v1/chat/completions` endpoint only.
+- Doesn't measure throughput in concurrency mode (coming later).
+- Outputs results as a Markdown table only for now.
 ## Installation
-Install `uv` first: https://docs.astral.sh/uv/getting-started/installation/
+Using `uv` is recommended. You can install `uv` here: https://docs.astral.sh/uv/getting-started/installation/
 ### Option 1: Run without installation using `uvx`
+Run the release version from PyPI:
+```bash
+uvx llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>
+```
+Run the latest version from the main branch:
 ```bash
 uvx --from git+https://github.com/eugr/llama-benchy llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>
 ```
@@ -79,6 +93,14 @@ uv run llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>
 ### Option 3: Install into system path
+Release version from PyPI:
+```bash
+uv pip install -U llama-benchy
+```
+Current version from the main branch:
 ```bash
 uv pip install git+https://github.com/eugr/llama-benchy --system
 ```
@@ -186,7 +208,7 @@ When `--enable-prefix-caching` is used (with `--depth` > 0), the script performs
 2.  **Inference**: Sends the same context (system message) followed by the actual prompt (user message). The server should reuse the cached context.
     -   Reported as standard `pp{tokens} @ d{depth}` and `tg{tokens} @ d{depth}`.
-By comparing the `pp` speed of the Inference step with a non-cached run (or the Context Load step), you can observe the speedup provided by prefix caching.
+In this case, `pp` and `tg` speeds will show an actual prompt processing / token generation speeds for a follow up prompt with a context pre-filled.
 ### Example

llama_benchy-0.1.2/src/llama_benchy/__init__.py ADDED Viewed

@@ -0,0 +1,8 @@
+"""
+llama-benchy - llama-bench style benchmarking tool for all backends
+This package provides a benchmarking tool for OpenAI-compatible LLM endpoints,
+generating statistics similar to `llama-bench`.
+"""
+from ._version import __version__

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/src/llama_benchy/__main__.py RENAMED Viewed

@@ -19,12 +19,13 @@ from transformers import AutoTokenizer
 import requests
 # Build number is now imported from __init__.py
-from . import __build__, __version__
+from . import __version__
 def parse_arguments():
     parser = argparse.ArgumentParser(description="LLM Benchmark Script")
+    parser.add_argument("--version", action="version", version=f"%(prog)s {__version__}")
     parser.add_argument("--base-url", type=str, required=True, help="OpenAI compatible endpoint URL")
     parser.add_argument("--api-key", type=str, default="EMPTY", help="API Key for the endpoint")
     parser.add_argument("--model", type=str, required=True, help="Model name to use for benchmarking")
@@ -240,6 +241,9 @@ async def run_benchmark(session, base_url, api_key, model_name, context_text, pr
         "est_ppt": None,
         "e2e_ttft": None
     }
+    # DEBUG: Buffer to store first few lines of raw response
+    debug_lines = []
     try:
         payload = {
@@ -275,12 +279,22 @@ async def run_benchmark(session, base_url, api_key, model_name, context_text, pr
                 while "\n" in buffer:
                     line, buffer = buffer.split("\n", 1)
                     line = line.strip()
-                    if not line or line == 'data: [DONE]':
+                    if not line:
+                        continue
+                    # Capture first 5 lines for debugging if needed
+                    if len(debug_lines) < 5:
+                        debug_lines.append(line)
+                    if line == 'data: [DONE]' or line == 'data:[DONE]':
                         continue
-                    if line.startswith('data: '):
+                    if line.startswith('data:'):
                         try:
-                            chunk = json.loads(line[6:])
+                            # Strip 'data:' and potential whitespace
+                            json_str = line[5:].strip()
+                            chunk = json.loads(json_str)
                             if 'usage' in chunk:
                                 prompt_usage_tokens = chunk['usage'].get('prompt_tokens', 0)
@@ -291,8 +305,9 @@ async def run_benchmark(session, base_url, api_key, model_name, context_text, pr
                                 delta = chunk['choices'][0].get('delta', {})
                                 content = delta.get('content')
                                 reasoning_content = delta.get('reasoning_content')
+                                reasoning = delta.get('reasoning')
-                                if content or reasoning_content:
+                                if content or reasoning_content or reasoning:
                                     if token_count == 0:
                                         first_token_time = chunk_time
                                         e2e_ttft = first_token_time - start_time
@@ -306,6 +321,10 @@ async def run_benchmark(session, base_url, api_key, model_name, context_text, pr
         end_time = time.perf_counter()
+        # DEBUG: Print warning if no tokens were collected
+        if token_count == 0:
+            print(f"\n[Warning] Run generated 0 tokens. Raw response sample: {debug_lines}")
         if token_count > 0:
             # Calculate decode time (time for subsequent tokens)
             # If only 1 token, decode_time is effectively 0, so we can't calculate inter-token speed
@@ -369,11 +388,10 @@ async def main_async():
         print("Error: --enable-prefix-caching and --no-cache are incompatible.")
         return
-    build_number = __build__
     version_number = __version__
     current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
-    print(f"llama-benchy ({version_number}.{build_number})")
+    print(f"llama-benchy ({version_number})")
     print(f"Date: {current_time}")
     print(f"Benchmarking model: {args.model} at {args.base_url}")
@@ -522,7 +540,7 @@ async def main_async():
             print("No results collected. Check if the model is generating tokens.")
         else:
             print(tabulate(results, headers=["model", "test", "t/s", "ttfr (ms)", "est_ppt (ms)", "e2e_ttft (ms)"], tablefmt="pipe", colalign=("left", "right", "right", "right", "right", "right")))
-            print(f"\nllama-benchy ({version_number}.{build_number})")
+            print(f"\nllama-benchy ({version_number})")
             print(f"date: {current_time} | latency mode: {args.latency_mode}")

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/src/llama_benchy/_version.py RENAMED Viewed

@@ -28,7 +28,7 @@ version_tuple: VERSION_TUPLE
 commit_id: COMMIT_ID
 __commit_id__: COMMIT_ID
-__version__ = version = '0.1.0'
-__version_tuple__ = version_tuple = (0, 1, 0)
+__version__ = version = '0.1.2'
+__version_tuple__ = version_tuple = (0, 1, 2)
 __commit_id__ = commit_id = None

llama_benchy-0.1.0/src/llama_benchy/__init__.py DELETED Viewed

@@ -1,24 +0,0 @@
-"""
-llama-benchy - llama-bench style benchmarking tool for all backends
-This package provides a benchmarking tool for OpenAI-compatible LLM endpoints,
-generating statistics similar to `llama-bench`.
-"""
-from ._version import __version__
-# Extract build number from the version string
-# Version format is like: '0.1.dev34+g33f03d886.d20260105'
-# We want to extract the git hash part: '33f03d886'
-__build__ = "unknown"
-if "+" in __version__:
-    try:
-        # Extract the part after the '+' and before the '.'
-        build_part = __version__.split("+")[1].split(".")[0]
-        # Remove the 'g' prefix if it exists
-        if build_part.startswith("g"):
-            __build__ = build_part[1:]
-        else:
-            __build__ = build_part
-    except (IndexError, AttributeError):
-        pass

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/.gitignore RENAMED Viewed

File without changes

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/LICENSE RENAMED Viewed

File without changes

{llama_benchy-0.1.0 → llama_benchy-0.1.2}/pyproject.toml RENAMED Viewed

File without changes

llama-benchy 0.1.0__tar.gz → 0.1.2__tar.gz

llama-benchy 0.1.0tar.gz → 0.1.2tar.gz