PyPI - guidellm - Versions diffs - 0.1.0__tar.gz → 0.2.0rc20250418__tar.gz - Mend

guidellm 0.1.0tar.gz → 0.2.0rc20250418tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of guidellm might be problematic. Click here for more details.

Files changed (85) hide show

{guidellm-0.1.0/src/guidellm.egg-info → guidellm-0.2.0rc20250418}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: guidellm
-Version: 0.1.0
+Version: 0.2.0rc20250418
 Summary: Guidance platform for deploying and managing large language models.
 Author: Neuralmagic, Inc.
 License:                                  Apache License
@@ -206,15 +206,17 @@ License:                                  Apache License
            limitations under the License.
 Project-URL: homepage, https://github.com/neuralmagic/guidellm
-Requires-Python: <4.0,>=3.8.0
+Requires-Python: <4.0,>=3.9.0
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: click
 Requires-Dist: datasets
 Requires-Dist: ftfy>=6.0.0
+Requires-Dist: httpx[http2]<1.0.0
 Requires-Dist: loguru
 Requires-Dist: numpy
-Requires-Dist: openai
+Requires-Dist: pillow
+Requires-Dist: protobuf
 Requires-Dist: pydantic>=2.0.0
 Requires-Dist: pydantic-settings>=2.0.0
 Requires-Dist: pyyaml>=6.0.0
@@ -226,12 +228,14 @@ Requires-Dist: pre-commit~=3.5.0; extra == "dev"
 Requires-Dist: scipy~=1.10; extra == "dev"
 Requires-Dist: sphinx~=7.1.2; extra == "dev"
 Requires-Dist: tox~=4.16.0; extra == "dev"
+Requires-Dist: lorem~=0.1.1; extra == "dev"
 Requires-Dist: pytest~=8.2.2; extra == "dev"
 Requires-Dist: pytest-asyncio~=0.23.8; extra == "dev"
 Requires-Dist: pytest-cov~=5.0.0; extra == "dev"
 Requires-Dist: pytest-mock~=3.14.0; extra == "dev"
 Requires-Dist: pytest-rerunfailures~=14.0; extra == "dev"
 Requires-Dist: requests-mock~=1.12.1; extra == "dev"
+Requires-Dist: respx~=0.22.0; extra == "dev"
 Requires-Dist: mypy~=1.10.1; extra == "dev"
 Requires-Dist: ruff~=0.5.2; extra == "dev"
 Requires-Dist: mdformat~=0.7.17; extra == "dev"
@@ -242,11 +246,12 @@ Requires-Dist: types-click~=7.1.8; extra == "dev"
 Requires-Dist: types-PyYAML~=6.0.1; extra == "dev"
 Requires-Dist: types-requests~=2.32.0; extra == "dev"
 Requires-Dist: types-toml; extra == "dev"
+Dynamic: license-file
 <p align="center">
   <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/guidellm-logo-light.png">
-    <img alt="GuideLLM Logo" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/guidellm-logo-dark.png" width=55%>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-logo-light.png">
+    <img alt="GuideLLM Logo" src="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-logo-dark.png" width=55%>
   </picture>
 </p>
@@ -254,25 +259,25 @@ Requires-Dist: types-toml; extra == "dev"
 Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inference
 </h3>
-[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPI Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPI%20Release)](https://pypi.python.org/pypi/guidellm) [![Pypi Release](https://img.shields.io/pypi/v/guidellm-nightly.svg?label=PyPI%20Nightly)](https://pypi.python.org/pypi/guidellm-nightly) [![Python Versions](https://img.shields.io/pypi/pyversions/guidellm.svg?label=Python)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
+[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPI Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPI%20Release)](https://pypi.python.org/pypi/guidellm) [![Python Versions](https://img.shields.io/badge/Python-3.8--3.12-orange)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
 ## Overview
 <p>
   <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/guidellm-user-flows-dark.png">
-    <img alt="GuideLLM User Flows" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/guidellm-user-flows-light.png">
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-user-flows-dark.png">
+    <img alt="GuideLLM User Flows" src="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-user-flows-light.png">
   </picture>
 </p>
-**GuideLLM** is a powerful tool for evaluating and optimizing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM helps users gauge the performance, resource needs, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.
+**GuideLLM** is a platform for evaluating and optimizing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM enables users to assess the performance, resource requirements, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.
 ### Key Features
 - **Performance Evaluation:** Analyze LLM inference under different load scenarios to ensure your system meets your service level objectives (SLOs).
 - **Resource Optimization:** Determine the most suitable hardware configurations for running your models effectively.
 - **Cost Estimation:** Understand the financial impact of different deployment strategies and make informed decisions to minimize costs.
-- **Scalability Testing:** Simulate scaling to handle large numbers of concurrent users without degradation in performance.
+- **Scalability Testing:** Simulate scaling to handle large numbers of concurrent users without performance degradation.
 ## Getting Started
@@ -281,21 +286,27 @@ Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inf
 Before installing, ensure you have the following prerequisites:
 - OS: Linux or MacOS
-- Python: 3.8 – 3.12
+- Python: 3.9 – 3.13
-GuideLLM is available on PyPI and is installed using `pip`:
+The latest GuideLLM release can be installed using pip:
 ```bash
 pip install guidellm
 ```
+Or from source code using pip:
+```bash
+pip install git+https://github.com/neuralmagic/guidellm.git
+```
 For detailed installation instructions and requirements, see the [Installation Guide](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md).
 ### Quick Start
 #### 1. Start an OpenAI Compatible Server (vLLM)
-GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose. To start a vLLM server with a Llama 3.1 8B quantized model, run the following command:
+GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose. After installing vLLM on your desired server (`pip install vllm`), start a vLLM server with a Llama 3.1 8B quantized model by running the following command:
 ```bash
 vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
@@ -303,107 +314,114 @@ vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
 For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
-#### 2. Run a GuideLLM Evaluation
+For information on starting other supported inference servers or platforms, see the [Supported Backends documentation](https://github.com/neuralmagic/guidellm/tree/main/docs/backends.md).
+#### 2. Run a GuideLLM Benchmark
-To run a GuideLLM evaluation, use the `guidellm` command with the appropriate model name and options on the server hosting the model or one with network access to the deployment server. For example, to evaluate the full performance range of the previously deployed Llama 3.1 8B model, run the following command:
+To run a GuideLLM benchmark, use the `guidellm benchmark` command with the target set to an OpenAI-compatible server. For this example, the target is set to 'http://localhost:8000', assuming that vLLM is active and running on the same server. Otherwise, update it to the appropriate location. By default, GuideLLM automatically determines the model available on the server and uses it. To target a different model, pass the desired name with the `--model` argument. Additionally, the `--rate-type` is set to `sweep`, which automatically runs a range of benchmarks to determine the minimum and maximum rates that the server and model can support. Each benchmark run under the sweep will run for 30 seconds, as set by the `--max-seconds` argument. Finally, `--data` is set to a synthetic dataset with 256 prompt tokens and 128 output tokens per request. For more arguments, supported scenarios, and configurations, jump to the [Configurations Section](#configurations) or run `guidellm benchmark --help`.
+Now, to start benchmarking, run the following command:
 ```bash
-guidellm \
-  --target "http://localhost:8000/v1" \
-  --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" \
-  --data-type emulated \
-  --data "prompt_tokens=512,generated_tokens=128"
+guidellm benchmark \
+  --target "http://localhost:8000" \
+  --rate-type sweep \
+  --max-seconds 30 \
+  --data "prompt_tokens=256,output_tokens=128"
 ```
-The above command will begin the evaluation and output progress updates similar to the following (if running on a different server, be sure to update the target!): <img src= "https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/sample-benchmarks.gif"/>
+The above command will begin the evaluation and provide progress updates similar to the following: <img src= "https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/sample-benchmarks.gif"/>
-Notes:
+#### 3. Analyze the Results
-- The `--target` flag specifies the server hosting the model. In this case, it is a local vLLM server.
-- The `--model` flag specifies the model to evaluate. The model name should match the name of the model deployed on the server
-- By default, GuideLLM will run a `sweep` of performance evaluations across different request rates, each lasting 120 seconds and the results are printed out to the terminal.
+After the evaluation is completed, GuideLLM will summarize the results into three sections:
-#### 3. Analyze the Results
+1. Benchmarks Metadata: A summary of the benchmark run and the arguments used to create it, including the server, data, profile, and more.
+2. Benchmarks Info: A high-level view of each benchmark and the requests that were run, including the type, duration, request statuses, and number of tokens.
+3. Benchmarks Stats: A summary of the statistics for each benchmark run, including the request rate, concurrency, latency, and token-level metrics such as TTFT, ITL, and more.
-After the evaluation is completed, GuideLLM will summarize the results, including various performance metrics.
+The sections will look similar to the following: <img alt="Sample GuideLLM benchmark output" src="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/sample-output.png" />
-The output results will start with a summary of the evaluation, followed by the requests data for each benchmark run. For example, the start of the output will look like the following:
+For more details about the metrics and definitions, please refer to the [Metrics documentation](https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/metrics.md).
-<img alt="Sample GuideLLM benchmark start output" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-output-start.png" />
+#### 4. Explore the Results File
-The end of the output will include important performance summary metrics such as request latency, time to first token (TTFT), inter-token latency (ITL), and more:
+By default, the full results, including complete statistics and request data, are saved to a file `benchmarks.json` in the current working directory. This file can be used for further analysis or reporting, and additionally can be reloaded into Python for further analysis using the `guidellm.benchmark.GenerativeBenchmarksReport` class. You can specify a different file name and extension with the `--output` argument.
-<img alt="Sample GuideLLM benchmark end output" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-output-end.png" />
+For more details about the supported output file types, please take a look at the [Outputs documentation](raw.githubusercontent.com/neuralmagic/guidellm/main/docs/outputs.md).
-#### 4. Use the Results
+#### 5. Use the Results
 The results from GuideLLM are used to optimize your LLM deployment for performance, resource efficiency, and cost. By analyzing the performance metrics, you can identify bottlenecks, determine the optimal request rate, and select the most cost-effective hardware configuration for your deployment.
-For example, if we deploy a latency-sensitive chat application, we likely want to optimize for low time to first token (TTFT) and inter-token latency (ITL). A reasonable threshold will depend on the application requirements. Still, we may want to ensure time to first token (TTFT) is under 200ms and inter-token latency (ITL) is under 50ms (20 updates per second). From the example results above, we can see that the model can meet these requirements on average at a request rate of 2.37 requests per second for each server. If you'd like to target a higher percentage of requests meeting these requirements, you can use the **Performance Stats by Benchmark** section to determine the rate at which 90% or 95% of requests meet these requirements.
+For example, when deploying a chat application, we likely want to ensure that our time to first token (TTFT) and inter-token latency (ITL) are under certain thresholds to meet our service level objectives (SLOs) or service level agreements (SLAs). For example, setting TTFT to 200ms and ITL 25ms for the sample data provided in the example above, we can see that even though the server is capable of handling up to 13 requests per second, we would only be able to meet our SLOs for 99% of users at a request rate of 3.5 requests per second. If we relax our constraints on ITL to 50 ms, then we can meet the TTFT SLA for 99% of users at a request rate of approximately 10 requests per second.
-If we deploy a throughput-sensitive summarization application, we likely want to optimize for the maximum requests the server can handle per second. In this case, the throughput benchmark shows that the server maxes out at 4.06 requests per second. If we need to handle more requests, consider adding more servers or upgrading the hardware configuration.
+For further details on determining the optimal request rate and SLOs, refer to the [SLOs documentation](https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/service_level_objectives.md).
 ### Configurations
-GuideLLM provides various CLI and environment options to customize evaluations, including setting the duration of each benchmark run, the number of concurrent requests, and the request rate.
+GuideLLM offers a range of configurations through both the benchmark CLI command and environment variables, which provide default values and more granular controls. The most common configurations are listed below. A complete list is easily accessible, though, by running `guidellm benchmark --help` or `guidellm config` respectively.
-Some typical configurations for the CLI include:
+#### Benchmark CLI
-- `--rate-type`: The rate to use for benchmarking. Options include `sweep`, `synchronous`, `throughput`, `constant`, and `poisson`.
-  - `--rate-type sweep`: (default) Sweep runs through the full range of the server's performance, starting with a `synchronous` rate, then `throughput`, and finally, 10 `constant` rates between the min and max request rate found.
-  - `--rate-type synchronous`: Synchronous runs requests synchronously, one after the other.
-  - `--rate-type throughput`: Throughput runs requests in a throughput manner, sending requests as fast as possible.
-  - `--rate-type constant`: Constant runs requests at a constant rate. Specify the request rate per second with the `--rate` argument. For example, `--rate 10` or multiple rates with `--rate 10 --rate 20 --rate 30`.
-  - `--rate-type poisson`: Poisson draws from a Poisson distribution with the mean at the specified rate, adding some real-world variance to the runs. Specify the request rate per second with the `--rate` argument. For example, `--rate 10` or multiple rates with `--rate 10 --rate 20 --rate 30`.
-- `--data-type`: The data to use for the benchmark. Options include `emulated`, `transformers`, and `file`.
-  - `--data-type emulated`: Emulated supports an EmulationConfig in string or file format for the `--data` argument to generate fake data. Specify the number of prompt tokens at a minimum and optionally the number of output tokens and other parameters for variance in the length. For example, `--data "prompt_tokens=128"`, `--data "prompt_tokens=128,generated_tokens=128" `, or `--data "prompt_tokens=128,prompt_tokens_variance=10" `.
-  - `--data-type file`: File supports a file path or URL to a file for the `--data` argument. The file should contain data encoded as a CSV, JSONL, TXT, or JSON/YAML file with a single prompt per line for CSV, JSONL, and TXT or a list of prompts for JSON/YAML. For example, `--data "data.txt"` where data.txt contents are `"prompt1\nprompt2\nprompt3"`.
-  - `--data-type transformers`: Transformers supports a dataset name or file path for the `--data` argument. For example, `--data "neuralmagic/LLM_compression_calibration"`.
-- `--max-seconds`: The maximum number of seconds to run each benchmark. The default is 120 seconds.
-- `--max-requests`: The maximum number of requests to run in each benchmark.
+The `guidellm benchmark` command is used to run benchmarks against a generative AI backend/server. The command accepts a variety of arguments to customize the benchmark run. The most common arguments include:
-For a complete list of supported CLI arguments, run the following command:
+- `--target`: Specifies the target path for the backend to run benchmarks against. For example, `http://localhost:8000`. This is required to define the server endpoint.
-```bash
-guidellm --help
-```
+- `--model`: Allows selecting a specific model from the server. If not provided, it defaults to the first model available on the server. Useful when multiple models are hosted on the same server.
-For a full list of configuration options, run the following command:
+- `--processor`: Used only for synthetic data creation or when the token source configuration is set to local for calculating token metrics locally. It must match the model's processor or tokenizer to ensure compatibility and correctness. This supports either a HuggingFace model ID or a local path to a processor or tokenizer.
-```bash
-guidellm-config
-```
+- `--data`: Specifies the dataset to use. This can be a HuggingFace dataset ID, a local path to a dataset, or standard text files such as CSV, JSONL, and more. Additionally, synthetic data configurations can be provided using JSON or key-value strings. Synthetic data options include:
+  - `prompt_tokens`: Average number of tokens for prompts.
+  - `output_tokens`: Average number of tokens for outputs.
+  - `TYPE_stdev`, `TYPE_min`, `TYPE_max`: Standard deviation, minimum, and maximum values for the specified type (e.g., `prompt_tokens`, `output_tokens`). If not provided, will use the provided tokens value only.
+  - `samples`: Number of samples to generate, defaults to 1000.
+  - `source`: Source text data for generation, defaults to a local copy of Pride and Prejudice.
+- `--data-args`: A JSON string used to specify the columns to source data from (e.g., `prompt_column`, `output_tokens_count_column`) and additional arguments to pass into the HuggingFace datasets constructor.
+- `--data-sampler`: Enables applying `random` shuffling or sampling to the dataset. If not set, no sampling is used.
+- `--rate-type`: Defines the type of benchmark to run (default sweep). Supported types include:
+  - `synchronous`: Runs a single stream of requests one at a time. `--rate` must not be set for this mode.
+  - `throughput`: Runs all requests in parallel to measure the maximum throughput for the server (bounded by GUIDELLM\_\_MAX_CONCURRENCY config argument). `--rate` must not be set for this mode.
+  - `concurrent`: Runs a fixed number of streams of requests in parallel. `--rate` must be set to the desired concurrency level/number of streams.
+  - `constant`: Sends requests asynchronously at a constant rate set by `--rate`.
+  - `poisson`: Sends requests at a rate following a Poisson distribution with the mean set by `--rate`.
+  - `sweep`: Automatically determines the minimum and maximum rates the server can support by running synchronous and throughput benchmarks, and then runs a series of benchmarks equally spaced between the two rates. The number of benchmarks is set by `--rate` (default is 10).
+- `--max-seconds`: Sets the maximum duration (in seconds) for each benchmark run. If not specified, the benchmark will run until the dataset is exhausted or the `--max-requests` limit is reached.
-See the [GuideLLM Documentation](#Documentation) for further information.
+- `--max-requests`: Sets the maximum number of requests for each benchmark run. If not provided, the benchmark will run until `--max-seconds` is reached or the dataset is exhausted.
+- `--warmup-percent`: Specifies the percentage of the benchmark to treat as a warmup phase. Requests during this phase are excluded from the final results.
+- `--cooldown-percent`: Specifies the percentage of the benchmark to treat as a cooldown phase. Requests during this phase are excluded from the final results.
+- `--output-path`: Defines the path to save the benchmark results. Supports JSON, YAML, or CSV formats. If a directory is provided, the results will be saved as `benchmarks.json` in that directory. If not set, the results will be saved in the current working directory.
 ## Resources
 ### Documentation
-Our comprehensive documentation provides detailed guides and resources to help you get the most out of GuideLLM. Whether just getting started or looking to dive deeper into advanced topics, you can find what you need in our [full documentation](https://github.com/neuralmagic/guidellm/tree/main/docs).
+Our comprehensive documentation offers detailed guides and resources to help you maximize the benefits of GuideLLM. Whether just getting started or looking to dive deeper into advanced topics, you can find what you need in our [documentation](https://github.com/neuralmagic/guidellm/tree/main/docs).
 ### Core Docs
 - [**Installation Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md) - This guide provides step-by-step instructions for installing GuideLLM, including prerequisites and setup tips.
+- [**Backends Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/backends.md) - A comprehensive overview of supported backends and how to set them up for use with GuideLLM.
+- [**Metrics Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/metrics.md) - Detailed explanations of the metrics used in GuideLLM, including definitions and how to interpret them.
+- [**Outputs Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/outputs.md) - Information on the different output formats supported by GuideLLM and how to use them.
 - [**Architecture Overview**](https://github.com/neuralmagic/guidellm/tree/main/docs/architecture.md) - A detailed look at GuideLLM's design, components, and how they interact.
-- [**CLI Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/guides/cli.md) - Comprehensive usage information for running GuideLLM via the command line, including available commands and options.
-- [**Configuration Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/guides/configuration.md) - Instructions on configuring GuideLLM to suit various deployment needs and performance goals.
 ### Supporting External Documentation
 - [**vLLM Documentation**](https://vllm.readthedocs.io/en/latest/) - Official vLLM documentation provides insights into installation, usage, and supported models.
-### Releases
-Visit our [GitHub Releases page](https://github.com/neuralmagic/guidellm/releases) and review the release notes to stay updated with the latest releases.
-### License
-GuideLLM is licensed under the [Apache License 2.0](https://github.com/neuralmagic/guidellm/blob/main/LICENSE).
-## Community
-### Contribute
+### Contribution Docs
 We appreciate contributions to the code, examples, integrations, documentation, bug reports, and feature requests! Your feedback and involvement are crucial in helping GuideLLM grow and improve. Below are some ways you can get involved:
@@ -411,14 +429,13 @@ We appreciate contributions to the code, examples, integrations, documentation,
 - [**CONTRIBUTING.md**](https://github.com/neuralmagic/guidellm/blob/main/CONTRIBUTING.md) - Guidelines for contributing to the project, including code standards, pull request processes, and more.
 - [**CODE_OF_CONDUCT.md**](https://github.com/neuralmagic/guidellm/blob/main/CODE_OF_CONDUCT.md) - Our expectations for community behavior to ensure a welcoming and inclusive environment.
-### Join
+### Releases
+Visit our [GitHub Releases page](https://github.com/neuralmagic/guidellm/releases) and review the release notes to stay updated with the latest releases.
-We invite you to join our growing community of developers, researchers, and enthusiasts passionate about LLMs and optimization. Whether you're looking for help, want to share your own experiences, or stay up to date with the latest developments, there are plenty of ways to get involved:
+### License
-- [**Neural Magic Community Slack**](https://neuralmagic.com/community/) - Join our Slack channel to connect with other GuideLLM users and developers. Ask questions, share your work, and get real-time support.
-- [**GitHub Issues**](https://github.com/neuralmagic/guidellm/issues) - Report bugs, request features, or browse existing issues. Your feedback helps us improve GuideLLM.
-- [**Subscribe to Updates**](https://neuralmagic.com/subscribe/) - Sign up for the latest news, announcements, and updates about GuideLLM, webinars, events, and more.
-- [**Contact Us**](http://neuralmagic.com/contact/) - Use our contact form for general questions about Neural Magic or GuideLLM.
+GuideLLM is licensed under the [Apache License 2.0](https://github.com/neuralmagic/guidellm/blob/main/LICENSE).
 ### Cite

guidellm-0.2.0rc20250418/README.md ADDED Viewed

@@ -0,0 +1,201 @@
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-logo-light.png">
+    <img alt="GuideLLM Logo" src="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-logo-dark.png" width=55%>
+  </picture>
+</p>
+<h3 align="center">
+Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inference
+</h3>
+[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPI Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPI%20Release)](https://pypi.python.org/pypi/guidellm) [![Python Versions](https://img.shields.io/badge/Python-3.8--3.12-orange)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
+## Overview
+<p>
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-user-flows-dark.png">
+    <img alt="GuideLLM User Flows" src="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/guidellm-user-flows-light.png">
+  </picture>
+</p>
+**GuideLLM** is a platform for evaluating and optimizing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM enables users to assess the performance, resource requirements, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.
+### Key Features
+- **Performance Evaluation:** Analyze LLM inference under different load scenarios to ensure your system meets your service level objectives (SLOs).
+- **Resource Optimization:** Determine the most suitable hardware configurations for running your models effectively.
+- **Cost Estimation:** Understand the financial impact of different deployment strategies and make informed decisions to minimize costs.
+- **Scalability Testing:** Simulate scaling to handle large numbers of concurrent users without performance degradation.
+## Getting Started
+### Installation
+Before installing, ensure you have the following prerequisites:
+- OS: Linux or MacOS
+- Python: 3.9 – 3.13
+The latest GuideLLM release can be installed using pip:
+```bash
+pip install guidellm
+```
+Or from source code using pip:
+```bash
+pip install git+https://github.com/neuralmagic/guidellm.git
+```
+For detailed installation instructions and requirements, see the [Installation Guide](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md).
+### Quick Start
+#### 1. Start an OpenAI Compatible Server (vLLM)
+GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose. After installing vLLM on your desired server (`pip install vllm`), start a vLLM server with a Llama 3.1 8B quantized model by running the following command:
+```bash
+vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
+```
+For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
+For information on starting other supported inference servers or platforms, see the [Supported Backends documentation](https://github.com/neuralmagic/guidellm/tree/main/docs/backends.md).
+#### 2. Run a GuideLLM Benchmark
+To run a GuideLLM benchmark, use the `guidellm benchmark` command with the target set to an OpenAI-compatible server. For this example, the target is set to 'http://localhost:8000', assuming that vLLM is active and running on the same server. Otherwise, update it to the appropriate location. By default, GuideLLM automatically determines the model available on the server and uses it. To target a different model, pass the desired name with the `--model` argument. Additionally, the `--rate-type` is set to `sweep`, which automatically runs a range of benchmarks to determine the minimum and maximum rates that the server and model can support. Each benchmark run under the sweep will run for 30 seconds, as set by the `--max-seconds` argument. Finally, `--data` is set to a synthetic dataset with 256 prompt tokens and 128 output tokens per request. For more arguments, supported scenarios, and configurations, jump to the [Configurations Section](#configurations) or run `guidellm benchmark --help`.
+Now, to start benchmarking, run the following command:
+```bash
+guidellm benchmark \
+  --target "http://localhost:8000" \
+  --rate-type sweep \
+  --max-seconds 30 \
+  --data "prompt_tokens=256,output_tokens=128"
+```
+The above command will begin the evaluation and provide progress updates similar to the following: <img src= "https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/sample-benchmarks.gif"/>
+#### 3. Analyze the Results
+After the evaluation is completed, GuideLLM will summarize the results into three sections:
+1. Benchmarks Metadata: A summary of the benchmark run and the arguments used to create it, including the server, data, profile, and more.
+2. Benchmarks Info: A high-level view of each benchmark and the requests that were run, including the type, duration, request statuses, and number of tokens.
+3. Benchmarks Stats: A summary of the statistics for each benchmark run, including the request rate, concurrency, latency, and token-level metrics such as TTFT, ITL, and more.
+The sections will look similar to the following: <img alt="Sample GuideLLM benchmark output" src="https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/sample-output.png" />
+For more details about the metrics and definitions, please refer to the [Metrics documentation](https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/metrics.md).
+#### 4. Explore the Results File
+By default, the full results, including complete statistics and request data, are saved to a file `benchmarks.json` in the current working directory. This file can be used for further analysis or reporting, and additionally can be reloaded into Python for further analysis using the `guidellm.benchmark.GenerativeBenchmarksReport` class. You can specify a different file name and extension with the `--output` argument.
+For more details about the supported output file types, please take a look at the [Outputs documentation](raw.githubusercontent.com/neuralmagic/guidellm/main/docs/outputs.md).
+#### 5. Use the Results
+The results from GuideLLM are used to optimize your LLM deployment for performance, resource efficiency, and cost. By analyzing the performance metrics, you can identify bottlenecks, determine the optimal request rate, and select the most cost-effective hardware configuration for your deployment.
+For example, when deploying a chat application, we likely want to ensure that our time to first token (TTFT) and inter-token latency (ITL) are under certain thresholds to meet our service level objectives (SLOs) or service level agreements (SLAs). For example, setting TTFT to 200ms and ITL 25ms for the sample data provided in the example above, we can see that even though the server is capable of handling up to 13 requests per second, we would only be able to meet our SLOs for 99% of users at a request rate of 3.5 requests per second. If we relax our constraints on ITL to 50 ms, then we can meet the TTFT SLA for 99% of users at a request rate of approximately 10 requests per second.
+For further details on determining the optimal request rate and SLOs, refer to the [SLOs documentation](https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/service_level_objectives.md).
+### Configurations
+GuideLLM offers a range of configurations through both the benchmark CLI command and environment variables, which provide default values and more granular controls. The most common configurations are listed below. A complete list is easily accessible, though, by running `guidellm benchmark --help` or `guidellm config` respectively.
+#### Benchmark CLI
+The `guidellm benchmark` command is used to run benchmarks against a generative AI backend/server. The command accepts a variety of arguments to customize the benchmark run. The most common arguments include:
+- `--target`: Specifies the target path for the backend to run benchmarks against. For example, `http://localhost:8000`. This is required to define the server endpoint.
+- `--model`: Allows selecting a specific model from the server. If not provided, it defaults to the first model available on the server. Useful when multiple models are hosted on the same server.
+- `--processor`: Used only for synthetic data creation or when the token source configuration is set to local for calculating token metrics locally. It must match the model's processor or tokenizer to ensure compatibility and correctness. This supports either a HuggingFace model ID or a local path to a processor or tokenizer.
+- `--data`: Specifies the dataset to use. This can be a HuggingFace dataset ID, a local path to a dataset, or standard text files such as CSV, JSONL, and more. Additionally, synthetic data configurations can be provided using JSON or key-value strings. Synthetic data options include:
+  - `prompt_tokens`: Average number of tokens for prompts.
+  - `output_tokens`: Average number of tokens for outputs.
+  - `TYPE_stdev`, `TYPE_min`, `TYPE_max`: Standard deviation, minimum, and maximum values for the specified type (e.g., `prompt_tokens`, `output_tokens`). If not provided, will use the provided tokens value only.
+  - `samples`: Number of samples to generate, defaults to 1000.
+  - `source`: Source text data for generation, defaults to a local copy of Pride and Prejudice.
+- `--data-args`: A JSON string used to specify the columns to source data from (e.g., `prompt_column`, `output_tokens_count_column`) and additional arguments to pass into the HuggingFace datasets constructor.
+- `--data-sampler`: Enables applying `random` shuffling or sampling to the dataset. If not set, no sampling is used.
+- `--rate-type`: Defines the type of benchmark to run (default sweep). Supported types include:
+  - `synchronous`: Runs a single stream of requests one at a time. `--rate` must not be set for this mode.
+  - `throughput`: Runs all requests in parallel to measure the maximum throughput for the server (bounded by GUIDELLM\_\_MAX_CONCURRENCY config argument). `--rate` must not be set for this mode.
+  - `concurrent`: Runs a fixed number of streams of requests in parallel. `--rate` must be set to the desired concurrency level/number of streams.
+  - `constant`: Sends requests asynchronously at a constant rate set by `--rate`.
+  - `poisson`: Sends requests at a rate following a Poisson distribution with the mean set by `--rate`.
+  - `sweep`: Automatically determines the minimum and maximum rates the server can support by running synchronous and throughput benchmarks, and then runs a series of benchmarks equally spaced between the two rates. The number of benchmarks is set by `--rate` (default is 10).
+- `--max-seconds`: Sets the maximum duration (in seconds) for each benchmark run. If not specified, the benchmark will run until the dataset is exhausted or the `--max-requests` limit is reached.
+- `--max-requests`: Sets the maximum number of requests for each benchmark run. If not provided, the benchmark will run until `--max-seconds` is reached or the dataset is exhausted.
+- `--warmup-percent`: Specifies the percentage of the benchmark to treat as a warmup phase. Requests during this phase are excluded from the final results.
+- `--cooldown-percent`: Specifies the percentage of the benchmark to treat as a cooldown phase. Requests during this phase are excluded from the final results.
+- `--output-path`: Defines the path to save the benchmark results. Supports JSON, YAML, or CSV formats. If a directory is provided, the results will be saved as `benchmarks.json` in that directory. If not set, the results will be saved in the current working directory.
+## Resources
+### Documentation
+Our comprehensive documentation offers detailed guides and resources to help you maximize the benefits of GuideLLM. Whether just getting started or looking to dive deeper into advanced topics, you can find what you need in our [documentation](https://github.com/neuralmagic/guidellm/tree/main/docs).
+### Core Docs
+- [**Installation Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md) - This guide provides step-by-step instructions for installing GuideLLM, including prerequisites and setup tips.
+- [**Backends Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/backends.md) - A comprehensive overview of supported backends and how to set them up for use with GuideLLM.
+- [**Metrics Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/metrics.md) - Detailed explanations of the metrics used in GuideLLM, including definitions and how to interpret them.
+- [**Outputs Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/outputs.md) - Information on the different output formats supported by GuideLLM and how to use them.
+- [**Architecture Overview**](https://github.com/neuralmagic/guidellm/tree/main/docs/architecture.md) - A detailed look at GuideLLM's design, components, and how they interact.
+### Supporting External Documentation
+- [**vLLM Documentation**](https://vllm.readthedocs.io/en/latest/) - Official vLLM documentation provides insights into installation, usage, and supported models.
+### Contribution Docs
+We appreciate contributions to the code, examples, integrations, documentation, bug reports, and feature requests! Your feedback and involvement are crucial in helping GuideLLM grow and improve. Below are some ways you can get involved:
+- [**DEVELOPING.md**](https://github.com/neuralmagic/guidellm/blob/main/DEVELOPING.md) - Development guide for setting up your environment and making contributions.
+- [**CONTRIBUTING.md**](https://github.com/neuralmagic/guidellm/blob/main/CONTRIBUTING.md) - Guidelines for contributing to the project, including code standards, pull request processes, and more.
+- [**CODE_OF_CONDUCT.md**](https://github.com/neuralmagic/guidellm/blob/main/CODE_OF_CONDUCT.md) - Our expectations for community behavior to ensure a welcoming and inclusive environment.
+### Releases
+Visit our [GitHub Releases page](https://github.com/neuralmagic/guidellm/releases) and review the release notes to stay updated with the latest releases.
+### License
+GuideLLM is licensed under the [Apache License 2.0](https://github.com/neuralmagic/guidellm/blob/main/LICENSE).
+### Cite
+If you find GuideLLM helpful in your research or projects, please consider citing it:
+```bibtex
+@misc{guidellm2024,
+  title={GuideLLM: Scalable Inference and Optimization for Large Language Models},
+  author={Neural Magic, Inc.},
+  year={2024},
+  howpublished={\url{https://github.com/neuralmagic/guidellm}},
+}
+```

guidellm-0.2.0rc20250418/pyproject.toml ADDED Viewed

@@ -0,0 +1,74 @@
+[build-system]
+requires = [ "setuptools >= 61.0", "wheel", "build",]
+build-backend = "setuptools.build_meta"
+[project]
+name = "guidellm"
+version = "0.2.0.rc20250418"
+description = "Guidance platform for deploying and managing large language models."
+requires-python = ">=3.9.0,<4.0"
+dependencies = [ "click", "datasets", "ftfy>=6.0.0", "httpx[http2]<1.0.0", "loguru", "numpy", "pillow", "protobuf", "pydantic>=2.0.0", "pydantic-settings>=2.0.0", "pyyaml>=6.0.0", "requests", "rich", "transformers",]
+[[project.authors]]
+name = "Neuralmagic, Inc."
+[tool.isort]
+profile = "black"
+[tool.mypy]
+files = [ "src/guidellm", "tests",]
+python_version = "3.9"
+warn_redundant_casts = true
+warn_unused_ignores = false
+show_error_codes = true
+namespace_packages = true
+exclude = [ "venv", ".tox",]
+follow_imports = "silent"
+[[tool.mypy.overrides]]
+module = [ "datasets.*",]
+ignore_missing_imports = true
+[tool.ruff]
+line-length = 88
+indent-width = 4
+exclude = [ "build", "dist", "env", ".venv",]
+[project.readme]
+file = "README.md"
+content-type = "text/markdown"
+[project.license]
+file = "LICENSE"
+[project.urls]
+homepage = "https://github.com/neuralmagic/guidellm"
+[project.optional-dependencies]
+dev = [ "pre-commit~=3.5.0", "scipy~=1.10", "sphinx~=7.1.2", "tox~=4.16.0", "lorem~=0.1.1", "pytest~=8.2.2", "pytest-asyncio~=0.23.8", "pytest-cov~=5.0.0", "pytest-mock~=3.14.0", "pytest-rerunfailures~=14.0", "requests-mock~=1.12.1", "respx~=0.22.0", "mypy~=1.10.1", "ruff~=0.5.2", "mdformat~=0.7.17", "mdformat-footnote~=0.1.1", "mdformat-frontmatter~=2.0.8", "mdformat-gfm~=0.3.6", "types-click~=7.1.8", "types-PyYAML~=6.0.1", "types-requests~=2.32.0", "types-toml",]
+[tool.setuptools.package-data]
+"guidellm.data" = [ "*.gz",]
+[tool.ruff.format]
+quote-style = "double"
+indent-style = "space"
+[tool.ruff.lint]
+ignore = [ "PLR0913", "TCH001", "COM812", "ISC001", "TCH002", "PLW1514", "RET505", "RET506", "PD011",]
+select = [ "E", "W", "A", "C", "COM", "ERA", "I", "ICN", "N", "NPY", "PD", "PT", "PTH", "Q", "TCH", "TID", "RUF022", "C4", "C90", "ISC", "PIE", "R", "SIM", "ARG", "ASYNC", "B", "BLE", "E", "F", "INP", "PGH", "PL", "RSE", "S", "SLF", "T10", "T20", "UP", "W", "YTT", "FIX",]
+[tool.pytest.ini_options]
+addopts = "-s -vvv --cache-clear"
+markers = [ "smoke: quick tests to check basic functionality", "sanity: detailed tests to ensure major functions work correctly", "regression: tests to ensure that new changes do not break existing functionality",]
+[project.entry-points.console_scripts]
+guidellm = "guidellm.__main__:cli"
+[tool.setuptools.packages.find]
+where = [ "src",]
+include = [ "*",]
+[tool.ruff.lint.extend-per-file-ignores]
+"tests/**/*.py" = [ "S101", "ARG", "PLR2004", "TCH002", "SLF001", "S105", "S311", "PT011", "N806", "PGH003", "S106", "PLR0915",]
+[tool.ruff.lint.isort]
+known-first-party = [ "guidellm", "tests",]

guidellm 0.1.0__tar.gz → 0.2.0rc20250418__tar.gz

Potentially problematic release.

guidellm 0.1.0tar.gz → 0.2.0rc20250418tar.gz