@flotorch/loadtest 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +250 -6
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,15 +1,259 @@
1
- # loadtest
1
+ # FLOTorch Load Tester
2
2
 
3
- To install dependencies:
3
+ LLM inference load testing and benchmarking tool. Measure TTFT, TPS, latency percentiles, and throughput of any OpenAI-compatible or SageMaker endpoint under sustained concurrent load.
4
+
5
+ ## Features
6
+
7
+ - Single-command benchmarking with auto-generated synthetic prompts
8
+ - Accurate concurrency control with ramp-up/ramp-down phases
9
+ - Streaming and non-streaming support
10
+ - Comprehensive metrics: TTFT, TTFNT, TPS, RPM, ITL, E2E latency, percentiles (p25–p99)
11
+ - JSON and CSV report exports with per-request logs
12
+ - Cache-hit simulation for testing endpoint caching behavior
13
+ - OpenAI-compatible API and AWS SageMaker backends
14
+
15
+ ## Installation
16
+
17
+ Requires **Node.js 18+**.
18
+
19
+ ```bash
20
+ # npm
21
+ npm install -g @flotorch/loadtest
22
+
23
+ # pnpm
24
+ pnpm add -g @flotorch/loadtest
25
+
26
+ # yarn
27
+ yarn global add @flotorch/loadtest
28
+ ```
29
+
30
+ After installation, the `flotorch-loadtest` command is available globally.
31
+
32
+ ## Quick Start
33
+
34
+ ### 1. Generate a config file
35
+
36
+ ```bash
37
+ flotorch-loadtest init
38
+ ```
39
+
40
+ This launches an interactive wizard that asks for:
41
+
42
+ - **Provider adapter** — `openai` or `sagemaker` (default: `openai`)
43
+ - **Model name** — the model identifier your endpoint expects
44
+ - **Base URL** — API endpoint (default: `https://api.openai.com/v1/chat/completions`)
45
+ - **Concurrency** — number of parallel requests (default: `10`)
46
+ - **Input tokens mean** — average input token count per request (default: `512`)
47
+ - **Output tokens mean** — average output token count per request (default: `256`)
48
+ - **Max requests** — total number of requests to send (default: `100`)
49
+ - **Streaming** — whether to stream responses (default: `y`)
50
+
51
+ Writes `config.json` to the current directory. You can specify a custom path:
52
+
53
+ ```bash
54
+ flotorch-loadtest init my-test.json
55
+ ```
56
+
57
+ ### 2. Set your API key
4
58
 
5
59
  ```bash
6
- bun install
60
+ export OPENAI_API_KEY="sk-..."
7
61
  ```
8
62
 
9
- To run:
63
+ For SageMaker, configure AWS credentials via standard AWS environment variables or credential files.
64
+
65
+ ### 3. Run the load test
10
66
 
11
67
  ```bash
12
- bun run index.ts
68
+ flotorch-loadtest run -c config.json
13
69
  ```
14
70
 
15
- This project was created using `bun init` in bun v1.3.9. [Bun](https://bun.com) is a fast all-in-one JavaScript runtime.
71
+ This runs the full pipeline: **generate prompts run benchmark generate report**.
72
+
73
+ Results are saved to `./results/<run-id>/` containing:
74
+
75
+ | File | Description |
76
+ | ----------------------- | ------------------------------------------------------------------ |
77
+ | `summary.json` | Aggregated metrics (latency, throughput, error rates, percentiles) |
78
+ | `run_log.jsonl` | Per-request metrics streamed during the run |
79
+ | `prompts.jsonl` | All generated prompts |
80
+ | `individual_responses/` | Full response data for each request |
81
+ | `config.resolved.json` | Final merged configuration used |
82
+
83
+ ## Commands
84
+
85
+ | Command | Description |
86
+ | ------------- | ------------------------------------------------------ |
87
+ | `run` | Full pipeline: generate → bench → report **(default)** |
88
+ | `generate` | Generate and save prompts only |
89
+ | `bench` | Run benchmark using pre-generated prompts |
90
+ | `report` | Generate report from existing benchmark results |
91
+ | `init [path]` | Interactively create a config file |
92
+
93
+ ```bash
94
+ flotorch-loadtest run -c config.json # full pipeline
95
+ flotorch-loadtest generate -c config.json # prompts only
96
+ flotorch-loadtest bench -c config.json # benchmark only
97
+ flotorch-loadtest report -c config.json # report only
98
+ ```
99
+
100
+ ## CLI Options
101
+
102
+ Any config value can be overridden from the command line:
103
+
104
+ | Flag | Short | Description |
105
+ | --------------------- | ----- | ------------------------------------------ |
106
+ | `--config <path>` | `-c` | Path to config JSON (required) |
107
+ | `--run-id <id>` | | Custom run ID (default: ISO timestamp) |
108
+ | `--model <name>` | `-m` | Override `provider.model` |
109
+ | `--concurrency <n>` | `-n` | Override `benchmark.concurrency` |
110
+ | `--max-requests <n>` | | Override `benchmark.maxRequests` |
111
+ | `--max-duration <n>` | | Override `benchmark.maxDuration` (seconds) |
112
+ | `--output-dir <path>` | `-o` | Override `benchmark.outputDir` |
113
+ | `--base-url <url>` | | Override `provider.baseURL` |
114
+ | `--streaming` | | Enable streaming |
115
+ | `--no-streaming` | | Disable streaming |
116
+
117
+ Example — override concurrency and model on the fly:
118
+
119
+ ```bash
120
+ flotorch-loadtest run -c config.json -n 50 -m gpt-4o
121
+ ```
122
+
123
+ ## Configuration Reference
124
+
125
+ The config file is JSON with four sections:
126
+
127
+ ```jsonc
128
+ {
129
+ "provider": {
130
+ "adapter": "openai", // "openai" | "sagemaker"
131
+ "model": "gpt-4o", // model identifier (required)
132
+ "baseURL": "https://api.openai.com/v1/chat/completions", // API endpoint
133
+ "systemPrompt": "You are a helpful assistant.", // optional system message
134
+ "config": {}, // backend-specific options
135
+ },
136
+ "benchmark": {
137
+ "concurrency": 10, // parallel requests (required)
138
+ "inputTokens": { "mean": 512, "stddev": 51 }, // input token distribution
139
+ "outputTokens": { "mean": 256, "stddev": 26 }, // output token distribution
140
+ "maxRequests": 100, // total requests (required if no maxDuration)
141
+ "maxDuration": 60, // duration in seconds (required if no maxRequests)
142
+ "timeout": 600, // per-request timeout in seconds (default: 600)
143
+ "streaming": true, // stream responses (default: true)
144
+ "cachePercentage": 0, // % of requests reusing previous prompts (0–100)
145
+ "outputDir": "./results", // results directory (default: ./results)
146
+ "inputFile": "prompts.jsonl", // pre-generated prompts (for bench command)
147
+ "rampUp": {
148
+ // optional: gradually increase concurrency
149
+ "duration": 30, // over N seconds, or
150
+ "requests": 50, // over N requests
151
+ },
152
+ "rampDown": {
153
+ // optional: gradually decrease concurrency
154
+ "duration": 15,
155
+ },
156
+ },
157
+ "generator": {
158
+ "enabled": false, // use synthetic prompt generator
159
+ "prompt": "Custom instruction...", // optional custom prompt template
160
+ "corpus": "./my-corpus.txt", // optional custom corpus file
161
+ },
162
+ "reporter": {
163
+ "adapters": ["json", "csv"], // export formats (default: ["json"])
164
+ },
165
+ }
166
+ ```
167
+
168
+ > At least one of `maxRequests` or `maxDuration` is required. If `stddev` is omitted, it defaults to 10% of the mean.
169
+
170
+ ## Metrics Collected
171
+
172
+ ### Per-request
173
+
174
+ - **TTFT** — Time to first token (ms)
175
+ - **TTFNT** — Time to first non-thinking token (for reasoning models)
176
+ - **E2E Latency** — End-to-end request latency (ms)
177
+ - **Inter-token latencies** — Time between successive tokens (streaming)
178
+ - **Output throughput** — Tokens per second
179
+ - **Input/output token counts**
180
+ - **Phase** — ramp-up, steady, or ramp-down
181
+ - **Cache hit** — Whether the request was a cache hit
182
+ - **Error details** — Error message and code if failed
183
+
184
+ ### Summary
185
+
186
+ - Success/failure counts and error rate
187
+ - RPM (requests per minute) and overall TPS (tokens per second)
188
+ - Percentiles (p25, p50, p75, p90, p95, p99) for all latency and throughput metrics
189
+ - Error code frequency breakdown
190
+ - Phase-level breakdown (requests and error rates per phase)
191
+ - Cache hit rate
192
+
193
+ ## Examples
194
+
195
+ ### Load test an OpenAI-compatible endpoint
196
+
197
+ ```json
198
+ {
199
+ "provider": {
200
+ "adapter": "openai",
201
+ "model": "gpt-4o",
202
+ "baseURL": "https://api.openai.com/v1/chat/completions"
203
+ },
204
+ "benchmark": {
205
+ "concurrency": 20,
206
+ "inputTokens": { "mean": 256 },
207
+ "outputTokens": { "mean": 128 },
208
+ "maxRequests": 500,
209
+ "streaming": true
210
+ }
211
+ }
212
+ ```
213
+
214
+ ### Load test a self-hosted model (vLLM, Ollama, etc.)
215
+
216
+ ```json
217
+ {
218
+ "provider": {
219
+ "adapter": "openai",
220
+ "model": "meta-llama/Llama-3-8B",
221
+ "baseURL": "http://localhost:8000/v1"
222
+ },
223
+ "benchmark": {
224
+ "concurrency": 50,
225
+ "inputTokens": { "mean": 512 },
226
+ "outputTokens": { "mean": 256 },
227
+ "maxDuration": 120,
228
+ "streaming": true,
229
+ "rampUp": { "duration": 30 },
230
+ "rampDown": { "duration": 15 }
231
+ }
232
+ }
233
+ ```
234
+
235
+ ### Time-bounded test with CSV output
236
+
237
+ ```json
238
+ {
239
+ "provider": {
240
+ "adapter": "openai",
241
+ "model": "gpt-4o-mini",
242
+ "baseURL": "https://api.openai.com/v1/chat/completions"
243
+ },
244
+ "benchmark": {
245
+ "concurrency": 10,
246
+ "inputTokens": { "mean": 128 },
247
+ "outputTokens": { "mean": 64 },
248
+ "maxDuration": 300,
249
+ "streaming": true
250
+ },
251
+ "reporter": {
252
+ "adapters": ["json", "csv"]
253
+ }
254
+ }
255
+ ```
256
+
257
+ ## License
258
+
259
+ MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@flotorch/loadtest",
3
- "version": "0.1.0",
3
+ "version": "0.1.2",
4
4
  "description": "LLM inference load testing and benchmarking tool",
5
5
  "license": "MIT",
6
6
  "repository": {