EvoScientist 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (108) hide show
  1. EvoScientist/EvoScientist.py +1 -1
  2. EvoScientist/cli.py +450 -178
  3. EvoScientist/middleware.py +5 -1
  4. EvoScientist/skills/accelerate/SKILL.md +332 -0
  5. EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
  6. EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
  7. EvoScientist/skills/accelerate/references/performance.md +525 -0
  8. EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
  9. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
  10. EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
  11. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
  12. EvoScientist/skills/clip/SKILL.md +253 -0
  13. EvoScientist/skills/clip/references/applications.md +207 -0
  14. EvoScientist/skills/find-skills/SKILL.md +133 -0
  15. EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
  16. EvoScientist/skills/flash-attention/SKILL.md +367 -0
  17. EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
  18. EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
  19. EvoScientist/skills/langgraph-docs/SKILL.md +36 -0
  20. EvoScientist/skills/llama-cpp/SKILL.md +258 -0
  21. EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
  22. EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
  23. EvoScientist/skills/llama-cpp/references/server.md +125 -0
  24. EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
  25. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  26. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  27. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  28. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  29. EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
  30. EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
  31. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
  32. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
  33. EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
  34. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
  35. EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
  36. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
  37. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
  38. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
  39. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
  40. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
  41. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
  42. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
  43. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
  44. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
  45. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
  46. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
  47. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
  48. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
  49. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
  50. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
  51. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
  52. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
  53. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  54. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
  55. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
  59. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
  60. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
  61. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
  62. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  63. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
  64. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
  67. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
  68. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
  69. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
  70. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  71. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
  72. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  76. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
  77. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
  78. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
  79. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
  80. EvoScientist/skills/peft/SKILL.md +431 -0
  81. EvoScientist/skills/peft/references/advanced-usage.md +514 -0
  82. EvoScientist/skills/peft/references/troubleshooting.md +480 -0
  83. EvoScientist/skills/ray-data/SKILL.md +326 -0
  84. EvoScientist/skills/ray-data/references/integration.md +82 -0
  85. EvoScientist/skills/ray-data/references/transformations.md +83 -0
  86. EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
  87. EvoScientist/skills/skill-creator/SKILL.md +356 -0
  88. EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
  89. EvoScientist/skills/skill-creator/references/workflows.md +28 -0
  90. EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
  91. EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
  92. EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
  93. EvoScientist/skills/tensorboard/SKILL.md +629 -0
  94. EvoScientist/skills/tensorboard/references/integrations.md +638 -0
  95. EvoScientist/skills/tensorboard/references/profiling.md +545 -0
  96. EvoScientist/skills/tensorboard/references/visualization.md +620 -0
  97. EvoScientist/skills/vllm/SKILL.md +364 -0
  98. EvoScientist/skills/vllm/references/optimization.md +226 -0
  99. EvoScientist/skills/vllm/references/quantization.md +284 -0
  100. EvoScientist/skills/vllm/references/server-deployment.md +255 -0
  101. EvoScientist/skills/vllm/references/troubleshooting.md +447 -0
  102. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/METADATA +26 -3
  103. evoscientist-0.1.0rc2.dist-info/RECORD +119 -0
  104. evoscientist-0.1.0rc1.dist-info/RECORD +0 -21
  105. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/WHEEL +0 -0
  106. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/entry_points.txt +0 -0
  107. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/licenses/LICENSE +0 -0
  108. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,364 @@
1
+ ---
2
+ name: vllm
3
+ description: Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
4
+ version: 1.0.0
5
+ author: Orchestra Research
6
+ license: MIT
7
+ tags: [vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism]
8
+ dependencies: [vllm, torch, transformers]
9
+ ---
10
+
11
+ # vLLM - High-Performance LLM Serving
12
+
13
+ ## Quick start
14
+
15
+ vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
16
+
17
+ **Installation**:
18
+ ```bash
19
+ pip install vllm
20
+ ```
21
+
22
+ **Basic offline inference**:
23
+ ```python
24
+ from vllm import LLM, SamplingParams
25
+
26
+ llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
27
+ sampling = SamplingParams(temperature=0.7, max_tokens=256)
28
+
29
+ outputs = llm.generate(["Explain quantum computing"], sampling)
30
+ print(outputs[0].outputs[0].text)
31
+ ```
32
+
33
+ **OpenAI-compatible server**:
34
+ ```bash
35
+ vllm serve meta-llama/Llama-3-8B-Instruct
36
+
37
+ # Query with OpenAI SDK
38
+ python -c "
39
+ from openai import OpenAI
40
+ client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
41
+ print(client.chat.completions.create(
42
+ model='meta-llama/Llama-3-8B-Instruct',
43
+ messages=[{'role': 'user', 'content': 'Hello!'}]
44
+ ).choices[0].message.content)
45
+ "
46
+ ```
47
+
48
+ ## Common workflows
49
+
50
+ ### Workflow 1: Production API deployment
51
+
52
+ Copy this checklist and track progress:
53
+
54
+ ```
55
+ Deployment Progress:
56
+ - [ ] Step 1: Configure server settings
57
+ - [ ] Step 2: Test with limited traffic
58
+ - [ ] Step 3: Enable monitoring
59
+ - [ ] Step 4: Deploy to production
60
+ - [ ] Step 5: Verify performance metrics
61
+ ```
62
+
63
+ **Step 1: Configure server settings**
64
+
65
+ Choose configuration based on your model size:
66
+
67
+ ```bash
68
+ # For 7B-13B models on single GPU
69
+ vllm serve meta-llama/Llama-3-8B-Instruct \
70
+ --gpu-memory-utilization 0.9 \
71
+ --max-model-len 8192 \
72
+ --port 8000
73
+
74
+ # For 30B-70B models with tensor parallelism
75
+ vllm serve meta-llama/Llama-2-70b-hf \
76
+ --tensor-parallel-size 4 \
77
+ --gpu-memory-utilization 0.9 \
78
+ --quantization awq \
79
+ --port 8000
80
+
81
+ # For production with caching and metrics
82
+ vllm serve meta-llama/Llama-3-8B-Instruct \
83
+ --gpu-memory-utilization 0.9 \
84
+ --enable-prefix-caching \
85
+ --enable-metrics \
86
+ --metrics-port 9090 \
87
+ --port 8000 \
88
+ --host 0.0.0.0
89
+ ```
90
+
91
+ **Step 2: Test with limited traffic**
92
+
93
+ Run load test before production:
94
+
95
+ ```bash
96
+ # Install load testing tool
97
+ pip install locust
98
+
99
+ # Create test_load.py with sample requests
100
+ # Run: locust -f test_load.py --host http://localhost:8000
101
+ ```
102
+
103
+ Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
104
+
105
+ **Step 3: Enable monitoring**
106
+
107
+ vLLM exposes Prometheus metrics on port 9090:
108
+
109
+ ```bash
110
+ curl http://localhost:9090/metrics | grep vllm
111
+ ```
112
+
113
+ Key metrics to monitor:
114
+ - `vllm:time_to_first_token_seconds` - Latency
115
+ - `vllm:num_requests_running` - Active requests
116
+ - `vllm:gpu_cache_usage_perc` - KV cache utilization
117
+
118
+ **Step 4: Deploy to production**
119
+
120
+ Use Docker for consistent deployment:
121
+
122
+ ```bash
123
+ # Run vLLM in Docker
124
+ docker run --gpus all -p 8000:8000 \
125
+ vllm/vllm-openai:latest \
126
+ --model meta-llama/Llama-3-8B-Instruct \
127
+ --gpu-memory-utilization 0.9 \
128
+ --enable-prefix-caching
129
+ ```
130
+
131
+ **Step 5: Verify performance metrics**
132
+
133
+ Check that deployment meets targets:
134
+ - TTFT < 500ms (for short prompts)
135
+ - Throughput > target req/sec
136
+ - GPU utilization > 80%
137
+ - No OOM errors in logs
138
+
139
+ ### Workflow 2: Offline batch inference
140
+
141
+ For processing large datasets without server overhead.
142
+
143
+ Copy this checklist:
144
+
145
+ ```
146
+ Batch Processing:
147
+ - [ ] Step 1: Prepare input data
148
+ - [ ] Step 2: Configure LLM engine
149
+ - [ ] Step 3: Run batch inference
150
+ - [ ] Step 4: Process results
151
+ ```
152
+
153
+ **Step 1: Prepare input data**
154
+
155
+ ```python
156
+ # Load prompts from file
157
+ prompts = []
158
+ with open("prompts.txt") as f:
159
+ prompts = [line.strip() for line in f]
160
+
161
+ print(f"Loaded {len(prompts)} prompts")
162
+ ```
163
+
164
+ **Step 2: Configure LLM engine**
165
+
166
+ ```python
167
+ from vllm import LLM, SamplingParams
168
+
169
+ llm = LLM(
170
+ model="meta-llama/Llama-3-8B-Instruct",
171
+ tensor_parallel_size=2, # Use 2 GPUs
172
+ gpu_memory_utilization=0.9,
173
+ max_model_len=4096
174
+ )
175
+
176
+ sampling = SamplingParams(
177
+ temperature=0.7,
178
+ top_p=0.95,
179
+ max_tokens=512,
180
+ stop=["</s>", "\n\n"]
181
+ )
182
+ ```
183
+
184
+ **Step 3: Run batch inference**
185
+
186
+ vLLM automatically batches requests for efficiency:
187
+
188
+ ```python
189
+ # Process all prompts in one call
190
+ outputs = llm.generate(prompts, sampling)
191
+
192
+ # vLLM handles batching internally
193
+ # No need to manually chunk prompts
194
+ ```
195
+
196
+ **Step 4: Process results**
197
+
198
+ ```python
199
+ # Extract generated text
200
+ results = []
201
+ for output in outputs:
202
+ prompt = output.prompt
203
+ generated = output.outputs[0].text
204
+ results.append({
205
+ "prompt": prompt,
206
+ "generated": generated,
207
+ "tokens": len(output.outputs[0].token_ids)
208
+ })
209
+
210
+ # Save to file
211
+ import json
212
+ with open("results.jsonl", "w") as f:
213
+ for result in results:
214
+ f.write(json.dumps(result) + "\n")
215
+
216
+ print(f"Processed {len(results)} prompts")
217
+ ```
218
+
219
+ ### Workflow 3: Quantized model serving
220
+
221
+ Fit large models in limited GPU memory.
222
+
223
+ ```
224
+ Quantization Setup:
225
+ - [ ] Step 1: Choose quantization method
226
+ - [ ] Step 2: Find or create quantized model
227
+ - [ ] Step 3: Launch with quantization flag
228
+ - [ ] Step 4: Verify accuracy
229
+ ```
230
+
231
+ **Step 1: Choose quantization method**
232
+
233
+ - **AWQ**: Best for 70B models, minimal accuracy loss
234
+ - **GPTQ**: Wide model support, good compression
235
+ - **FP8**: Fastest on H100 GPUs
236
+
237
+ **Step 2: Find or create quantized model**
238
+
239
+ Use pre-quantized models from HuggingFace:
240
+
241
+ ```bash
242
+ # Search for AWQ models
243
+ # Example: TheBloke/Llama-2-70B-AWQ
244
+ ```
245
+
246
+ **Step 3: Launch with quantization flag**
247
+
248
+ ```bash
249
+ # Using pre-quantized model
250
+ vllm serve TheBloke/Llama-2-70B-AWQ \
251
+ --quantization awq \
252
+ --tensor-parallel-size 1 \
253
+ --gpu-memory-utilization 0.95
254
+
255
+ # Results: 70B model in ~40GB VRAM
256
+ ```
257
+
258
+ **Step 4: Verify accuracy**
259
+
260
+ Test outputs match expected quality:
261
+
262
+ ```python
263
+ # Compare quantized vs non-quantized responses
264
+ # Verify task-specific performance unchanged
265
+ ```
266
+
267
+ ## When to use vs alternatives
268
+
269
+ **Use vLLM when:**
270
+ - Deploying production LLM APIs (100+ req/sec)
271
+ - Serving OpenAI-compatible endpoints
272
+ - Limited GPU memory but need large models
273
+ - Multi-user applications (chatbots, assistants)
274
+ - Need low latency with high throughput
275
+
276
+ **Use alternatives instead:**
277
+ - **llama.cpp**: CPU/edge inference, single-user
278
+ - **HuggingFace transformers**: Research, prototyping, one-off generation
279
+ - **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
280
+ - **Text-Generation-Inference**: Already in HuggingFace ecosystem
281
+
282
+ ## Common issues
283
+
284
+ **Issue: Out of memory during model loading**
285
+
286
+ Reduce memory usage:
287
+ ```bash
288
+ vllm serve MODEL \
289
+ --gpu-memory-utilization 0.7 \
290
+ --max-model-len 4096
291
+ ```
292
+
293
+ Or use quantization:
294
+ ```bash
295
+ vllm serve MODEL --quantization awq
296
+ ```
297
+
298
+ **Issue: Slow first token (TTFT > 1 second)**
299
+
300
+ Enable prefix caching for repeated prompts:
301
+ ```bash
302
+ vllm serve MODEL --enable-prefix-caching
303
+ ```
304
+
305
+ For long prompts, enable chunked prefill:
306
+ ```bash
307
+ vllm serve MODEL --enable-chunked-prefill
308
+ ```
309
+
310
+ **Issue: Model not found error**
311
+
312
+ Use `--trust-remote-code` for custom models:
313
+ ```bash
314
+ vllm serve MODEL --trust-remote-code
315
+ ```
316
+
317
+ **Issue: Low throughput (<50 req/sec)**
318
+
319
+ Increase concurrent sequences:
320
+ ```bash
321
+ vllm serve MODEL --max-num-seqs 512
322
+ ```
323
+
324
+ Check GPU utilization with `nvidia-smi` - should be >80%.
325
+
326
+ **Issue: Inference slower than expected**
327
+
328
+ Verify tensor parallelism uses power of 2 GPUs:
329
+ ```bash
330
+ vllm serve MODEL --tensor-parallel-size 4 # Not 3
331
+ ```
332
+
333
+ Enable speculative decoding for faster generation:
334
+ ```bash
335
+ vllm serve MODEL --speculative-model DRAFT_MODEL
336
+ ```
337
+
338
+ ## Advanced topics
339
+
340
+ **Server deployment patterns**: See [references/server-deployment.md](references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
341
+
342
+ **Performance optimization**: See [references/optimization.md](references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
343
+
344
+ **Quantization guide**: See [references/quantization.md](references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
345
+
346
+ **Troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
347
+
348
+ ## Hardware requirements
349
+
350
+ - **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
351
+ - **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
352
+ - **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
353
+
354
+ Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
355
+
356
+ ## Resources
357
+
358
+ - Official docs: https://docs.vllm.ai
359
+ - GitHub: https://github.com/vllm-project/vllm
360
+ - Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
361
+ - Community: https://discuss.vllm.ai
362
+
363
+
364
+
@@ -0,0 +1,226 @@
1
+ # Performance Optimization
2
+
3
+ ## Contents
4
+ - PagedAttention explained
5
+ - Continuous batching mechanics
6
+ - Prefix caching strategies
7
+ - Speculative decoding setup
8
+ - Benchmark results and comparisons
9
+ - Performance tuning guide
10
+
11
+ ## PagedAttention explained
12
+
13
+ **Traditional attention problem**:
14
+ - KV cache stored in contiguous memory
15
+ - Wastes ~50% GPU memory due to fragmentation
16
+ - Cannot dynamically reallocate for varying sequence lengths
17
+
18
+ **PagedAttention solution**:
19
+ - Divides KV cache into fixed-size blocks (like OS virtual memory)
20
+ - Dynamic allocation from free block queue
21
+ - Shares blocks across sequences (for prefix caching)
22
+
23
+ **Memory savings example**:
24
+ ```
25
+ Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
26
+ PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
27
+ ```
28
+
29
+ **Configuration**:
30
+ ```bash
31
+ # Block size (default: 16 tokens)
32
+ vllm serve MODEL --block-size 16
33
+
34
+ # Number of GPU blocks (auto-calculated)
35
+ # Controlled by --gpu-memory-utilization
36
+ vllm serve MODEL --gpu-memory-utilization 0.9
37
+ ```
38
+
39
+ ## Continuous batching mechanics
40
+
41
+ **Traditional batching**:
42
+ - Wait for all sequences in batch to finish
43
+ - GPU idle while waiting for longest sequence
44
+ - Low GPU utilization (~40-60%)
45
+
46
+ **Continuous batching**:
47
+ - Add new requests as slots become available
48
+ - Mix prefill (new requests) and decode (ongoing) in same batch
49
+ - High GPU utilization (>90%)
50
+
51
+ **Throughput improvement**:
52
+ ```
53
+ Traditional batching: 50 req/sec @ 50% GPU util
54
+ Continuous batching: 200 req/sec @ 90% GPU util
55
+ = 4x throughput improvement
56
+ ```
57
+
58
+ **Tuning parameters**:
59
+ ```bash
60
+ # Max concurrent sequences (higher = more batching)
61
+ vllm serve MODEL --max-num-seqs 256
62
+
63
+ # Prefill/decode schedule (auto-balanced by default)
64
+ # No manual tuning needed
65
+ ```
66
+
67
+ ## Prefix caching strategies
68
+
69
+ Reuse computed KV cache for common prompt prefixes.
70
+
71
+ **Use cases**:
72
+ - System prompts repeated across requests
73
+ - Few-shot examples in every prompt
74
+ - RAG contexts with overlapping chunks
75
+
76
+ **Example savings**:
77
+ ```
78
+ Prompt: [System: 500 tokens] + [User: 100 tokens]
79
+
80
+ Without caching: Compute 600 tokens every request
81
+ With caching: Compute 500 tokens once, then 100 tokens/request
82
+ = 83% faster TTFT
83
+ ```
84
+
85
+ **Enable prefix caching**:
86
+ ```bash
87
+ vllm serve MODEL --enable-prefix-caching
88
+ ```
89
+
90
+ **Automatic prefix detection**:
91
+ - vLLM detects common prefixes automatically
92
+ - No code changes required
93
+ - Works with OpenAI-compatible API
94
+
95
+ **Cache hit rate monitoring**:
96
+ ```bash
97
+ curl http://localhost:9090/metrics | grep cache_hit
98
+ # vllm_cache_hit_rate: 0.75 (75% hit rate)
99
+ ```
100
+
101
+ ## Speculative decoding setup
102
+
103
+ Use smaller "draft" model to propose tokens, larger model to verify.
104
+
105
+ **Speed improvement**:
106
+ ```
107
+ Standard: Generate 1 token per forward pass
108
+ Speculative: Generate 3-5 tokens per forward pass
109
+ = 2-3x faster generation
110
+ ```
111
+
112
+ **How it works**:
113
+ 1. Draft model proposes K tokens (fast)
114
+ 2. Target model verifies all K tokens in parallel (one pass)
115
+ 3. Accept verified tokens, restart from first rejection
116
+
117
+ **Setup with separate draft model**:
118
+ ```bash
119
+ vllm serve meta-llama/Llama-3-70B-Instruct \
120
+ --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
121
+ --num-speculative-tokens 5
122
+ ```
123
+
124
+ **Setup with n-gram draft** (no separate model):
125
+ ```bash
126
+ vllm serve MODEL \
127
+ --speculative-method ngram \
128
+ --num-speculative-tokens 3
129
+ ```
130
+
131
+ **When to use**:
132
+ - Output length > 100 tokens
133
+ - Draft model 5-10x smaller than target
134
+ - Acceptable 2-3% accuracy trade-off
135
+
136
+ ## Benchmark results
137
+
138
+ **vLLM vs HuggingFace Transformers** (Llama 3 8B, A100):
139
+ ```
140
+ Metric | HF Transformers | vLLM | Improvement
141
+ ------------------------|-----------------|--------|------------
142
+ Throughput (req/sec) | 12 | 280 | 23x
143
+ TTFT (ms) | 850 | 120 | 7x
144
+ Tokens/sec | 45 | 2,100 | 47x
145
+ GPU Memory (GB) | 28 | 16 | 1.75x less
146
+ ```
147
+
148
+ **vLLM vs TensorRT-LLM** (Llama 2 70B, 4x A100):
149
+ ```
150
+ Metric | TensorRT-LLM | vLLM | Notes
151
+ ------------------------|--------------|--------|------------------
152
+ Throughput (req/sec) | 320 | 285 | TRT 12% faster
153
+ Setup complexity | High | Low | vLLM much easier
154
+ NVIDIA-only | Yes | No | vLLM multi-platform
155
+ Quantization support | FP8, INT8 | AWQ/GPTQ/FP8 | vLLM more options
156
+ ```
157
+
158
+ ## Performance tuning guide
159
+
160
+ **Step 1: Measure baseline**
161
+
162
+ ```bash
163
+ # Install benchmarking tool
164
+ pip install locust
165
+
166
+ # Run baseline benchmark
167
+ vllm bench throughput \
168
+ --model MODEL \
169
+ --input-tokens 128 \
170
+ --output-tokens 256 \
171
+ --num-prompts 1000
172
+
173
+ # Record: throughput, TTFT, tokens/sec
174
+ ```
175
+
176
+ **Step 2: Tune memory utilization**
177
+
178
+ ```bash
179
+ # Try different values: 0.7, 0.85, 0.9, 0.95
180
+ vllm serve MODEL --gpu-memory-utilization 0.9
181
+ ```
182
+
183
+ Higher = more batch capacity = higher throughput, but risk OOM.
184
+
185
+ **Step 3: Tune concurrency**
186
+
187
+ ```bash
188
+ # Try values: 128, 256, 512, 1024
189
+ vllm serve MODEL --max-num-seqs 256
190
+ ```
191
+
192
+ Higher = more batching opportunity, but may increase latency.
193
+
194
+ **Step 4: Enable optimizations**
195
+
196
+ ```bash
197
+ vllm serve MODEL \
198
+ --enable-prefix-caching \ # For repeated prompts
199
+ --enable-chunked-prefill \ # For long prompts
200
+ --gpu-memory-utilization 0.9 \
201
+ --max-num-seqs 512
202
+ ```
203
+
204
+ **Step 5: Re-benchmark and compare**
205
+
206
+ Target improvements:
207
+ - Throughput: +30-100%
208
+ - TTFT: -20-50%
209
+ - GPU utilization: >85%
210
+
211
+ **Common performance issues**:
212
+
213
+ **Low throughput (<50 req/sec)**:
214
+ - Increase `--max-num-seqs`
215
+ - Enable `--enable-prefix-caching`
216
+ - Check GPU utilization (should be >80%)
217
+
218
+ **High TTFT (>1 second)**:
219
+ - Enable `--enable-chunked-prefill`
220
+ - Reduce `--max-model-len` if possible
221
+ - Check if model is too large for GPU
222
+
223
+ **OOM errors**:
224
+ - Reduce `--gpu-memory-utilization` to 0.7
225
+ - Reduce `--max-model-len`
226
+ - Use quantization (`--quantization awq`)