EvoScientist 0.0.1.dev4__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- EvoScientist/EvoScientist.py +25 -61
- EvoScientist/__init__.py +0 -19
- EvoScientist/backends.py +0 -26
- EvoScientist/cli.py +1365 -480
- EvoScientist/middleware.py +7 -56
- EvoScientist/skills/clip/SKILL.md +253 -0
- EvoScientist/skills/clip/references/applications.md +207 -0
- EvoScientist/skills/langgraph-docs/SKILL.md +36 -0
- EvoScientist/skills/tensorboard/SKILL.md +629 -0
- EvoScientist/skills/tensorboard/references/integrations.md +638 -0
- EvoScientist/skills/tensorboard/references/profiling.md +545 -0
- EvoScientist/skills/tensorboard/references/visualization.md +620 -0
- EvoScientist/skills/vllm/SKILL.md +364 -0
- EvoScientist/skills/vllm/references/optimization.md +226 -0
- EvoScientist/skills/vllm/references/quantization.md +284 -0
- EvoScientist/skills/vllm/references/server-deployment.md +255 -0
- EvoScientist/skills/vllm/references/troubleshooting.md +447 -0
- EvoScientist/stream/__init__.py +0 -25
- EvoScientist/stream/utils.py +16 -23
- EvoScientist/tools.py +2 -75
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/METADATA +8 -153
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/RECORD +26 -24
- evoscientist-0.1.0rc2.dist-info/entry_points.txt +2 -0
- EvoScientist/config.py +0 -274
- EvoScientist/llm/__init__.py +0 -21
- EvoScientist/llm/models.py +0 -99
- EvoScientist/memory.py +0 -715
- EvoScientist/onboard.py +0 -725
- EvoScientist/paths.py +0 -44
- EvoScientist/skills_manager.py +0 -391
- EvoScientist/stream/display.py +0 -604
- EvoScientist/stream/events.py +0 -415
- EvoScientist/stream/state.py +0 -343
- evoscientist-0.0.1.dev4.dist-info/entry_points.txt +0 -5
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/WHEEL +0 -0
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/licenses/LICENSE +0 -0
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,364 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: vllm
|
|
3
|
+
description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Orchestra Research
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism]
|
|
8
|
+
dependencies: [vllm, torch, transformers]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# vLLM - High-Performance LLM Serving
|
|
12
|
+
|
|
13
|
+
## Quick start
|
|
14
|
+
|
|
15
|
+
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
|
16
|
+
|
|
17
|
+
**Installation**:
|
|
18
|
+
```bash
|
|
19
|
+
pip install vllm
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
**Basic offline inference**:
|
|
23
|
+
```python
|
|
24
|
+
from vllm import LLM, SamplingParams
|
|
25
|
+
|
|
26
|
+
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
|
|
27
|
+
sampling = SamplingParams(temperature=0.7, max_tokens=256)
|
|
28
|
+
|
|
29
|
+
outputs = llm.generate(["Explain quantum computing"], sampling)
|
|
30
|
+
print(outputs[0].outputs[0].text)
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
**OpenAI-compatible server**:
|
|
34
|
+
```bash
|
|
35
|
+
vllm serve meta-llama/Llama-3-8B-Instruct
|
|
36
|
+
|
|
37
|
+
# Query with OpenAI SDK
|
|
38
|
+
python -c "
|
|
39
|
+
from openai import OpenAI
|
|
40
|
+
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
|
|
41
|
+
print(client.chat.completions.create(
|
|
42
|
+
model='meta-llama/Llama-3-8B-Instruct',
|
|
43
|
+
messages=[{'role': 'user', 'content': 'Hello!'}]
|
|
44
|
+
).choices[0].message.content)
|
|
45
|
+
"
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## Common workflows
|
|
49
|
+
|
|
50
|
+
### Workflow 1: Production API deployment
|
|
51
|
+
|
|
52
|
+
Copy this checklist and track progress:
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
Deployment Progress:
|
|
56
|
+
- [ ] Step 1: Configure server settings
|
|
57
|
+
- [ ] Step 2: Test with limited traffic
|
|
58
|
+
- [ ] Step 3: Enable monitoring
|
|
59
|
+
- [ ] Step 4: Deploy to production
|
|
60
|
+
- [ ] Step 5: Verify performance metrics
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
**Step 1: Configure server settings**
|
|
64
|
+
|
|
65
|
+
Choose configuration based on your model size:
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
# For 7B-13B models on single GPU
|
|
69
|
+
vllm serve meta-llama/Llama-3-8B-Instruct \
|
|
70
|
+
--gpu-memory-utilization 0.9 \
|
|
71
|
+
--max-model-len 8192 \
|
|
72
|
+
--port 8000
|
|
73
|
+
|
|
74
|
+
# For 30B-70B models with tensor parallelism
|
|
75
|
+
vllm serve meta-llama/Llama-2-70b-hf \
|
|
76
|
+
--tensor-parallel-size 4 \
|
|
77
|
+
--gpu-memory-utilization 0.9 \
|
|
78
|
+
--quantization awq \
|
|
79
|
+
--port 8000
|
|
80
|
+
|
|
81
|
+
# For production with caching and metrics
|
|
82
|
+
vllm serve meta-llama/Llama-3-8B-Instruct \
|
|
83
|
+
--gpu-memory-utilization 0.9 \
|
|
84
|
+
--enable-prefix-caching \
|
|
85
|
+
--enable-metrics \
|
|
86
|
+
--metrics-port 9090 \
|
|
87
|
+
--port 8000 \
|
|
88
|
+
--host 0.0.0.0
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
**Step 2: Test with limited traffic**
|
|
92
|
+
|
|
93
|
+
Run load test before production:
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
# Install load testing tool
|
|
97
|
+
pip install locust
|
|
98
|
+
|
|
99
|
+
# Create test_load.py with sample requests
|
|
100
|
+
# Run: locust -f test_load.py --host http://localhost:8000
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
|
|
104
|
+
|
|
105
|
+
**Step 3: Enable monitoring**
|
|
106
|
+
|
|
107
|
+
vLLM exposes Prometheus metrics on port 9090:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
curl http://localhost:9090/metrics | grep vllm
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Key metrics to monitor:
|
|
114
|
+
- `vllm:time_to_first_token_seconds` - Latency
|
|
115
|
+
- `vllm:num_requests_running` - Active requests
|
|
116
|
+
- `vllm:gpu_cache_usage_perc` - KV cache utilization
|
|
117
|
+
|
|
118
|
+
**Step 4: Deploy to production**
|
|
119
|
+
|
|
120
|
+
Use Docker for consistent deployment:
|
|
121
|
+
|
|
122
|
+
```bash
|
|
123
|
+
# Run vLLM in Docker
|
|
124
|
+
docker run --gpus all -p 8000:8000 \
|
|
125
|
+
vllm/vllm-openai:latest \
|
|
126
|
+
--model meta-llama/Llama-3-8B-Instruct \
|
|
127
|
+
--gpu-memory-utilization 0.9 \
|
|
128
|
+
--enable-prefix-caching
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
**Step 5: Verify performance metrics**
|
|
132
|
+
|
|
133
|
+
Check that deployment meets targets:
|
|
134
|
+
- TTFT < 500ms (for short prompts)
|
|
135
|
+
- Throughput > target req/sec
|
|
136
|
+
- GPU utilization > 80%
|
|
137
|
+
- No OOM errors in logs
|
|
138
|
+
|
|
139
|
+
### Workflow 2: Offline batch inference
|
|
140
|
+
|
|
141
|
+
For processing large datasets without server overhead.
|
|
142
|
+
|
|
143
|
+
Copy this checklist:
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
Batch Processing:
|
|
147
|
+
- [ ] Step 1: Prepare input data
|
|
148
|
+
- [ ] Step 2: Configure LLM engine
|
|
149
|
+
- [ ] Step 3: Run batch inference
|
|
150
|
+
- [ ] Step 4: Process results
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
**Step 1: Prepare input data**
|
|
154
|
+
|
|
155
|
+
```python
|
|
156
|
+
# Load prompts from file
|
|
157
|
+
prompts = []
|
|
158
|
+
with open("prompts.txt") as f:
|
|
159
|
+
prompts = [line.strip() for line in f]
|
|
160
|
+
|
|
161
|
+
print(f"Loaded {len(prompts)} prompts")
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
**Step 2: Configure LLM engine**
|
|
165
|
+
|
|
166
|
+
```python
|
|
167
|
+
from vllm import LLM, SamplingParams
|
|
168
|
+
|
|
169
|
+
llm = LLM(
|
|
170
|
+
model="meta-llama/Llama-3-8B-Instruct",
|
|
171
|
+
tensor_parallel_size=2, # Use 2 GPUs
|
|
172
|
+
gpu_memory_utilization=0.9,
|
|
173
|
+
max_model_len=4096
|
|
174
|
+
)
|
|
175
|
+
|
|
176
|
+
sampling = SamplingParams(
|
|
177
|
+
temperature=0.7,
|
|
178
|
+
top_p=0.95,
|
|
179
|
+
max_tokens=512,
|
|
180
|
+
stop=["</s>", "\n\n"]
|
|
181
|
+
)
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
**Step 3: Run batch inference**
|
|
185
|
+
|
|
186
|
+
vLLM automatically batches requests for efficiency:
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
# Process all prompts in one call
|
|
190
|
+
outputs = llm.generate(prompts, sampling)
|
|
191
|
+
|
|
192
|
+
# vLLM handles batching internally
|
|
193
|
+
# No need to manually chunk prompts
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Step 4: Process results**
|
|
197
|
+
|
|
198
|
+
```python
|
|
199
|
+
# Extract generated text
|
|
200
|
+
results = []
|
|
201
|
+
for output in outputs:
|
|
202
|
+
prompt = output.prompt
|
|
203
|
+
generated = output.outputs[0].text
|
|
204
|
+
results.append({
|
|
205
|
+
"prompt": prompt,
|
|
206
|
+
"generated": generated,
|
|
207
|
+
"tokens": len(output.outputs[0].token_ids)
|
|
208
|
+
})
|
|
209
|
+
|
|
210
|
+
# Save to file
|
|
211
|
+
import json
|
|
212
|
+
with open("results.jsonl", "w") as f:
|
|
213
|
+
for result in results:
|
|
214
|
+
f.write(json.dumps(result) + "\n")
|
|
215
|
+
|
|
216
|
+
print(f"Processed {len(results)} prompts")
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
### Workflow 3: Quantized model serving
|
|
220
|
+
|
|
221
|
+
Fit large models in limited GPU memory.
|
|
222
|
+
|
|
223
|
+
```
|
|
224
|
+
Quantization Setup:
|
|
225
|
+
- [ ] Step 1: Choose quantization method
|
|
226
|
+
- [ ] Step 2: Find or create quantized model
|
|
227
|
+
- [ ] Step 3: Launch with quantization flag
|
|
228
|
+
- [ ] Step 4: Verify accuracy
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
**Step 1: Choose quantization method**
|
|
232
|
+
|
|
233
|
+
- **AWQ**: Best for 70B models, minimal accuracy loss
|
|
234
|
+
- **GPTQ**: Wide model support, good compression
|
|
235
|
+
- **FP8**: Fastest on H100 GPUs
|
|
236
|
+
|
|
237
|
+
**Step 2: Find or create quantized model**
|
|
238
|
+
|
|
239
|
+
Use pre-quantized models from HuggingFace:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
# Search for AWQ models
|
|
243
|
+
# Example: TheBloke/Llama-2-70B-AWQ
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
**Step 3: Launch with quantization flag**
|
|
247
|
+
|
|
248
|
+
```bash
|
|
249
|
+
# Using pre-quantized model
|
|
250
|
+
vllm serve TheBloke/Llama-2-70B-AWQ \
|
|
251
|
+
--quantization awq \
|
|
252
|
+
--tensor-parallel-size 1 \
|
|
253
|
+
--gpu-memory-utilization 0.95
|
|
254
|
+
|
|
255
|
+
# Results: 70B model in ~40GB VRAM
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Step 4: Verify accuracy**
|
|
259
|
+
|
|
260
|
+
Test outputs match expected quality:
|
|
261
|
+
|
|
262
|
+
```python
|
|
263
|
+
# Compare quantized vs non-quantized responses
|
|
264
|
+
# Verify task-specific performance unchanged
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
## When to use vs alternatives
|
|
268
|
+
|
|
269
|
+
**Use vLLM when:**
|
|
270
|
+
- Deploying production LLM APIs (100+ req/sec)
|
|
271
|
+
- Serving OpenAI-compatible endpoints
|
|
272
|
+
- Limited GPU memory but need large models
|
|
273
|
+
- Multi-user applications (chatbots, assistants)
|
|
274
|
+
- Need low latency with high throughput
|
|
275
|
+
|
|
276
|
+
**Use alternatives instead:**
|
|
277
|
+
- **llama.cpp**: CPU/edge inference, single-user
|
|
278
|
+
- **HuggingFace transformers**: Research, prototyping, one-off generation
|
|
279
|
+
- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
|
|
280
|
+
- **Text-Generation-Inference**: Already in HuggingFace ecosystem
|
|
281
|
+
|
|
282
|
+
## Common issues
|
|
283
|
+
|
|
284
|
+
**Issue: Out of memory during model loading**
|
|
285
|
+
|
|
286
|
+
Reduce memory usage:
|
|
287
|
+
```bash
|
|
288
|
+
vllm serve MODEL \
|
|
289
|
+
--gpu-memory-utilization 0.7 \
|
|
290
|
+
--max-model-len 4096
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Or use quantization:
|
|
294
|
+
```bash
|
|
295
|
+
vllm serve MODEL --quantization awq
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
**Issue: Slow first token (TTFT > 1 second)**
|
|
299
|
+
|
|
300
|
+
Enable prefix caching for repeated prompts:
|
|
301
|
+
```bash
|
|
302
|
+
vllm serve MODEL --enable-prefix-caching
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
For long prompts, enable chunked prefill:
|
|
306
|
+
```bash
|
|
307
|
+
vllm serve MODEL --enable-chunked-prefill
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
**Issue: Model not found error**
|
|
311
|
+
|
|
312
|
+
Use `--trust-remote-code` for custom models:
|
|
313
|
+
```bash
|
|
314
|
+
vllm serve MODEL --trust-remote-code
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
**Issue: Low throughput (<50 req/sec)**
|
|
318
|
+
|
|
319
|
+
Increase concurrent sequences:
|
|
320
|
+
```bash
|
|
321
|
+
vllm serve MODEL --max-num-seqs 512
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
Check GPU utilization with `nvidia-smi` - should be >80%.
|
|
325
|
+
|
|
326
|
+
**Issue: Inference slower than expected**
|
|
327
|
+
|
|
328
|
+
Verify tensor parallelism uses power of 2 GPUs:
|
|
329
|
+
```bash
|
|
330
|
+
vllm serve MODEL --tensor-parallel-size 4 # Not 3
|
|
331
|
+
```
|
|
332
|
+
|
|
333
|
+
Enable speculative decoding for faster generation:
|
|
334
|
+
```bash
|
|
335
|
+
vllm serve MODEL --speculative-model DRAFT_MODEL
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
## Advanced topics
|
|
339
|
+
|
|
340
|
+
**Server deployment patterns**: See [references/server-deployment.md](references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
|
|
341
|
+
|
|
342
|
+
**Performance optimization**: See [references/optimization.md](references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
|
|
343
|
+
|
|
344
|
+
**Quantization guide**: See [references/quantization.md](references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
|
|
345
|
+
|
|
346
|
+
**Troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
|
|
347
|
+
|
|
348
|
+
## Hardware requirements
|
|
349
|
+
|
|
350
|
+
- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
|
|
351
|
+
- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
|
|
352
|
+
- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
|
|
353
|
+
|
|
354
|
+
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
|
|
355
|
+
|
|
356
|
+
## Resources
|
|
357
|
+
|
|
358
|
+
- Official docs: https://docs.vllm.ai
|
|
359
|
+
- GitHub: https://github.com/vllm-project/vllm
|
|
360
|
+
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
|
|
361
|
+
- Community: https://discuss.vllm.ai
|
|
362
|
+
|
|
363
|
+
|
|
364
|
+
|
|
@@ -0,0 +1,226 @@
|
|
|
1
|
+
# Performance Optimization
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- PagedAttention explained
|
|
5
|
+
- Continuous batching mechanics
|
|
6
|
+
- Prefix caching strategies
|
|
7
|
+
- Speculative decoding setup
|
|
8
|
+
- Benchmark results and comparisons
|
|
9
|
+
- Performance tuning guide
|
|
10
|
+
|
|
11
|
+
## PagedAttention explained
|
|
12
|
+
|
|
13
|
+
**Traditional attention problem**:
|
|
14
|
+
- KV cache stored in contiguous memory
|
|
15
|
+
- Wastes ~50% GPU memory due to fragmentation
|
|
16
|
+
- Cannot dynamically reallocate for varying sequence lengths
|
|
17
|
+
|
|
18
|
+
**PagedAttention solution**:
|
|
19
|
+
- Divides KV cache into fixed-size blocks (like OS virtual memory)
|
|
20
|
+
- Dynamic allocation from free block queue
|
|
21
|
+
- Shares blocks across sequences (for prefix caching)
|
|
22
|
+
|
|
23
|
+
**Memory savings example**:
|
|
24
|
+
```
|
|
25
|
+
Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
|
|
26
|
+
PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
**Configuration**:
|
|
30
|
+
```bash
|
|
31
|
+
# Block size (default: 16 tokens)
|
|
32
|
+
vllm serve MODEL --block-size 16
|
|
33
|
+
|
|
34
|
+
# Number of GPU blocks (auto-calculated)
|
|
35
|
+
# Controlled by --gpu-memory-utilization
|
|
36
|
+
vllm serve MODEL --gpu-memory-utilization 0.9
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Continuous batching mechanics
|
|
40
|
+
|
|
41
|
+
**Traditional batching**:
|
|
42
|
+
- Wait for all sequences in batch to finish
|
|
43
|
+
- GPU idle while waiting for longest sequence
|
|
44
|
+
- Low GPU utilization (~40-60%)
|
|
45
|
+
|
|
46
|
+
**Continuous batching**:
|
|
47
|
+
- Add new requests as slots become available
|
|
48
|
+
- Mix prefill (new requests) and decode (ongoing) in same batch
|
|
49
|
+
- High GPU utilization (>90%)
|
|
50
|
+
|
|
51
|
+
**Throughput improvement**:
|
|
52
|
+
```
|
|
53
|
+
Traditional batching: 50 req/sec @ 50% GPU util
|
|
54
|
+
Continuous batching: 200 req/sec @ 90% GPU util
|
|
55
|
+
= 4x throughput improvement
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
**Tuning parameters**:
|
|
59
|
+
```bash
|
|
60
|
+
# Max concurrent sequences (higher = more batching)
|
|
61
|
+
vllm serve MODEL --max-num-seqs 256
|
|
62
|
+
|
|
63
|
+
# Prefill/decode schedule (auto-balanced by default)
|
|
64
|
+
# No manual tuning needed
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Prefix caching strategies
|
|
68
|
+
|
|
69
|
+
Reuse computed KV cache for common prompt prefixes.
|
|
70
|
+
|
|
71
|
+
**Use cases**:
|
|
72
|
+
- System prompts repeated across requests
|
|
73
|
+
- Few-shot examples in every prompt
|
|
74
|
+
- RAG contexts with overlapping chunks
|
|
75
|
+
|
|
76
|
+
**Example savings**:
|
|
77
|
+
```
|
|
78
|
+
Prompt: [System: 500 tokens] + [User: 100 tokens]
|
|
79
|
+
|
|
80
|
+
Without caching: Compute 600 tokens every request
|
|
81
|
+
With caching: Compute 500 tokens once, then 100 tokens/request
|
|
82
|
+
= 83% faster TTFT
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
**Enable prefix caching**:
|
|
86
|
+
```bash
|
|
87
|
+
vllm serve MODEL --enable-prefix-caching
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
**Automatic prefix detection**:
|
|
91
|
+
- vLLM detects common prefixes automatically
|
|
92
|
+
- No code changes required
|
|
93
|
+
- Works with OpenAI-compatible API
|
|
94
|
+
|
|
95
|
+
**Cache hit rate monitoring**:
|
|
96
|
+
```bash
|
|
97
|
+
curl http://localhost:9090/metrics | grep cache_hit
|
|
98
|
+
# vllm_cache_hit_rate: 0.75 (75% hit rate)
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
## Speculative decoding setup
|
|
102
|
+
|
|
103
|
+
Use smaller "draft" model to propose tokens, larger model to verify.
|
|
104
|
+
|
|
105
|
+
**Speed improvement**:
|
|
106
|
+
```
|
|
107
|
+
Standard: Generate 1 token per forward pass
|
|
108
|
+
Speculative: Generate 3-5 tokens per forward pass
|
|
109
|
+
= 2-3x faster generation
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
**How it works**:
|
|
113
|
+
1. Draft model proposes K tokens (fast)
|
|
114
|
+
2. Target model verifies all K tokens in parallel (one pass)
|
|
115
|
+
3. Accept verified tokens, restart from first rejection
|
|
116
|
+
|
|
117
|
+
**Setup with separate draft model**:
|
|
118
|
+
```bash
|
|
119
|
+
vllm serve meta-llama/Llama-3-70B-Instruct \
|
|
120
|
+
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
|
121
|
+
--num-speculative-tokens 5
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
**Setup with n-gram draft** (no separate model):
|
|
125
|
+
```bash
|
|
126
|
+
vllm serve MODEL \
|
|
127
|
+
--speculative-method ngram \
|
|
128
|
+
--num-speculative-tokens 3
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
**When to use**:
|
|
132
|
+
- Output length > 100 tokens
|
|
133
|
+
- Draft model 5-10x smaller than target
|
|
134
|
+
- Acceptable 2-3% accuracy trade-off
|
|
135
|
+
|
|
136
|
+
## Benchmark results
|
|
137
|
+
|
|
138
|
+
**vLLM vs HuggingFace Transformers** (Llama 3 8B, A100):
|
|
139
|
+
```
|
|
140
|
+
Metric | HF Transformers | vLLM | Improvement
|
|
141
|
+
------------------------|-----------------|--------|------------
|
|
142
|
+
Throughput (req/sec) | 12 | 280 | 23x
|
|
143
|
+
TTFT (ms) | 850 | 120 | 7x
|
|
144
|
+
Tokens/sec | 45 | 2,100 | 47x
|
|
145
|
+
GPU Memory (GB) | 28 | 16 | 1.75x less
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
**vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100):
|
|
149
|
+
```
|
|
150
|
+
Metric | TensorRT-LLM | vLLM | Notes
|
|
151
|
+
------------------------|--------------|--------|------------------
|
|
152
|
+
Throughput (req/sec) | 320 | 285 | TRT 12% faster
|
|
153
|
+
Setup complexity | High | Low | vLLM much easier
|
|
154
|
+
NVIDIA-only | Yes | No | vLLM multi-platform
|
|
155
|
+
Quantization support | FP8, INT8 | AWQ/GPTQ/FP8 | vLLM more options
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
## Performance tuning guide
|
|
159
|
+
|
|
160
|
+
**Step 1: Measure baseline**
|
|
161
|
+
|
|
162
|
+
```bash
|
|
163
|
+
# Install benchmarking tool
|
|
164
|
+
pip install locust
|
|
165
|
+
|
|
166
|
+
# Run baseline benchmark
|
|
167
|
+
vllm bench throughput \
|
|
168
|
+
--model MODEL \
|
|
169
|
+
--input-tokens 128 \
|
|
170
|
+
--output-tokens 256 \
|
|
171
|
+
--num-prompts 1000
|
|
172
|
+
|
|
173
|
+
# Record: throughput, TTFT, tokens/sec
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
**Step 2: Tune memory utilization**
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
# Try different values: 0.7, 0.85, 0.9, 0.95
|
|
180
|
+
vllm serve MODEL --gpu-memory-utilization 0.9
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
Higher = more batch capacity = higher throughput, but risk OOM.
|
|
184
|
+
|
|
185
|
+
**Step 3: Tune concurrency**
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
# Try values: 128, 256, 512, 1024
|
|
189
|
+
vllm serve MODEL --max-num-seqs 256
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
Higher = more batching opportunity, but may increase latency.
|
|
193
|
+
|
|
194
|
+
**Step 4: Enable optimizations**
|
|
195
|
+
|
|
196
|
+
```bash
|
|
197
|
+
vllm serve MODEL \
|
|
198
|
+
--enable-prefix-caching \ # For repeated prompts
|
|
199
|
+
--enable-chunked-prefill \ # For long prompts
|
|
200
|
+
--gpu-memory-utilization 0.9 \
|
|
201
|
+
--max-num-seqs 512
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
**Step 5: Re-benchmark and compare**
|
|
205
|
+
|
|
206
|
+
Target improvements:
|
|
207
|
+
- Throughput: +30-100%
|
|
208
|
+
- TTFT: -20-50%
|
|
209
|
+
- GPU utilization: >85%
|
|
210
|
+
|
|
211
|
+
**Common performance issues**:
|
|
212
|
+
|
|
213
|
+
**Low throughput (<50 req/sec)**:
|
|
214
|
+
- Increase `--max-num-seqs`
|
|
215
|
+
- Enable `--enable-prefix-caching`
|
|
216
|
+
- Check GPU utilization (should be >80%)
|
|
217
|
+
|
|
218
|
+
**High TTFT (>1 second)**:
|
|
219
|
+
- Enable `--enable-chunked-prefill`
|
|
220
|
+
- Reduce `--max-model-len` if possible
|
|
221
|
+
- Check if model is too large for GPU
|
|
222
|
+
|
|
223
|
+
**OOM errors**:
|
|
224
|
+
- Reduce `--gpu-memory-utilization` to 0.7
|
|
225
|
+
- Reduce `--max-model-len`
|
|
226
|
+
- Use quantization (`--quantization awq`)
|