EvoScientist 0.0.1.dev4__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- EvoScientist/EvoScientist.py +25 -61
- EvoScientist/__init__.py +0 -19
- EvoScientist/backends.py +0 -26
- EvoScientist/cli.py +1365 -480
- EvoScientist/middleware.py +7 -56
- EvoScientist/skills/clip/SKILL.md +253 -0
- EvoScientist/skills/clip/references/applications.md +207 -0
- EvoScientist/skills/langgraph-docs/SKILL.md +36 -0
- EvoScientist/skills/tensorboard/SKILL.md +629 -0
- EvoScientist/skills/tensorboard/references/integrations.md +638 -0
- EvoScientist/skills/tensorboard/references/profiling.md +545 -0
- EvoScientist/skills/tensorboard/references/visualization.md +620 -0
- EvoScientist/skills/vllm/SKILL.md +364 -0
- EvoScientist/skills/vllm/references/optimization.md +226 -0
- EvoScientist/skills/vllm/references/quantization.md +284 -0
- EvoScientist/skills/vllm/references/server-deployment.md +255 -0
- EvoScientist/skills/vllm/references/troubleshooting.md +447 -0
- EvoScientist/stream/__init__.py +0 -25
- EvoScientist/stream/utils.py +16 -23
- EvoScientist/tools.py +2 -75
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/METADATA +8 -153
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/RECORD +26 -24
- evoscientist-0.1.0rc2.dist-info/entry_points.txt +2 -0
- EvoScientist/config.py +0 -274
- EvoScientist/llm/__init__.py +0 -21
- EvoScientist/llm/models.py +0 -99
- EvoScientist/memory.py +0 -715
- EvoScientist/onboard.py +0 -725
- EvoScientist/paths.py +0 -44
- EvoScientist/skills_manager.py +0 -391
- EvoScientist/stream/display.py +0 -604
- EvoScientist/stream/events.py +0 -415
- EvoScientist/stream/state.py +0 -343
- evoscientist-0.0.1.dev4.dist-info/entry_points.txt +0 -5
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/WHEEL +0 -0
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/licenses/LICENSE +0 -0
- {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,284 @@
|
|
|
1
|
+
# Quantization Guide
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- Quantization methods comparison
|
|
5
|
+
- AWQ setup and usage
|
|
6
|
+
- GPTQ setup and usage
|
|
7
|
+
- FP8 quantization (H100)
|
|
8
|
+
- Model preparation
|
|
9
|
+
- Accuracy vs compression trade-offs
|
|
10
|
+
|
|
11
|
+
## Quantization methods comparison
|
|
12
|
+
|
|
13
|
+
| Method | Compression | Accuracy Loss | Speed | Best For |
|
|
14
|
+
|--------|-------------|---------------|-------|----------|
|
|
15
|
+
| **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
|
|
16
|
+
| **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
|
|
17
|
+
| **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
|
|
18
|
+
| **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |
|
|
19
|
+
|
|
20
|
+
**Recommendation**:
|
|
21
|
+
- **Production**: Use AWQ for 70B models
|
|
22
|
+
- **H100 GPUs**: Use FP8 for best speed
|
|
23
|
+
- **Maximum compatibility**: Use GPTQ
|
|
24
|
+
- **Extreme compression**: Use SqueezeLLM
|
|
25
|
+
|
|
26
|
+
## AWQ setup and usage
|
|
27
|
+
|
|
28
|
+
**AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.
|
|
29
|
+
|
|
30
|
+
**Step 1: Find pre-quantized model**
|
|
31
|
+
|
|
32
|
+
Search HuggingFace for AWQ models:
|
|
33
|
+
```bash
|
|
34
|
+
# Example: TheBloke/Llama-2-70B-AWQ
|
|
35
|
+
# Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**Step 2: Launch with AWQ**
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
vllm serve TheBloke/Llama-2-70B-AWQ \
|
|
42
|
+
--quantization awq \
|
|
43
|
+
--tensor-parallel-size 1 \
|
|
44
|
+
--gpu-memory-utilization 0.95
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**Memory savings**:
|
|
48
|
+
```
|
|
49
|
+
Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
|
|
50
|
+
Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
|
|
51
|
+
= 4x memory reduction
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
**Step 3: Verify performance**
|
|
55
|
+
|
|
56
|
+
Test that outputs are acceptable:
|
|
57
|
+
```python
|
|
58
|
+
from openai import OpenAI
|
|
59
|
+
|
|
60
|
+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
|
61
|
+
|
|
62
|
+
# Test complex reasoning
|
|
63
|
+
response = client.chat.completions.create(
|
|
64
|
+
model="TheBloke/Llama-2-70B-AWQ",
|
|
65
|
+
messages=[{"role": "user", "content": "Explain quantum entanglement"}]
|
|
66
|
+
)
|
|
67
|
+
|
|
68
|
+
print(response.choices[0].message.content)
|
|
69
|
+
# Verify quality matches your requirements
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
**Quantize your own model** (requires GPU with 80GB+ VRAM):
|
|
73
|
+
|
|
74
|
+
```python
|
|
75
|
+
from awq import AutoAWQForCausalLM
|
|
76
|
+
from transformers import AutoTokenizer
|
|
77
|
+
|
|
78
|
+
model_path = "meta-llama/Llama-2-70b-hf"
|
|
79
|
+
quant_path = "llama-2-70b-awq"
|
|
80
|
+
|
|
81
|
+
# Load model
|
|
82
|
+
model = AutoAWQForCausalLM.from_pretrained(model_path)
|
|
83
|
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
|
84
|
+
|
|
85
|
+
# Quantize
|
|
86
|
+
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
|
|
87
|
+
model.quantize(tokenizer, quant_config=quant_config)
|
|
88
|
+
|
|
89
|
+
# Save
|
|
90
|
+
model.save_quantized(quant_path)
|
|
91
|
+
tokenizer.save_pretrained(quant_path)
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
## GPTQ setup and usage
|
|
95
|
+
|
|
96
|
+
**GPTQ** has widest model support and good compression.
|
|
97
|
+
|
|
98
|
+
**Step 1: Find GPTQ model**
|
|
99
|
+
|
|
100
|
+
```bash
|
|
101
|
+
# Example: TheBloke/Llama-2-13B-GPTQ
|
|
102
|
+
# Example: TheBloke/CodeLlama-34B-GPTQ
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
**Step 2: Launch with GPTQ**
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
vllm serve TheBloke/Llama-2-13B-GPTQ \
|
|
109
|
+
--quantization gptq \
|
|
110
|
+
--dtype float16
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**GPTQ configuration options**:
|
|
114
|
+
```bash
|
|
115
|
+
# Specify GPTQ parameters if needed
|
|
116
|
+
vllm serve MODEL \
|
|
117
|
+
--quantization gptq \
|
|
118
|
+
--gptq-act-order \ # Activation ordering
|
|
119
|
+
--dtype float16
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
**Quantize your own model**:
|
|
123
|
+
|
|
124
|
+
```python
|
|
125
|
+
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
|
126
|
+
from transformers import AutoTokenizer
|
|
127
|
+
|
|
128
|
+
model_name = "meta-llama/Llama-2-13b-hf"
|
|
129
|
+
quantized_name = "llama-2-13b-gptq"
|
|
130
|
+
|
|
131
|
+
# Load model
|
|
132
|
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
133
|
+
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
|
|
134
|
+
|
|
135
|
+
# Prepare calibration data
|
|
136
|
+
calib_data = [...] # List of sample texts
|
|
137
|
+
|
|
138
|
+
# Quantize
|
|
139
|
+
quantize_config = BaseQuantizeConfig(
|
|
140
|
+
bits=4,
|
|
141
|
+
group_size=128,
|
|
142
|
+
desc_act=True
|
|
143
|
+
)
|
|
144
|
+
model.quantize(calib_data)
|
|
145
|
+
|
|
146
|
+
# Save
|
|
147
|
+
model.save_quantized(quantized_name)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## FP8 quantization (H100)
|
|
151
|
+
|
|
152
|
+
**FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.
|
|
153
|
+
|
|
154
|
+
**Requirements**:
|
|
155
|
+
- H100 or H800 GPU
|
|
156
|
+
- CUDA 12.3+ (12.8 recommended)
|
|
157
|
+
- Hopper architecture support
|
|
158
|
+
|
|
159
|
+
**Step 1: Enable FP8**
|
|
160
|
+
|
|
161
|
+
```bash
|
|
162
|
+
vllm serve meta-llama/Llama-3-70B-Instruct \
|
|
163
|
+
--quantization fp8 \
|
|
164
|
+
--tensor-parallel-size 2
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
**Performance gains on H100**:
|
|
168
|
+
```
|
|
169
|
+
fp16: 180 tokens/sec
|
|
170
|
+
FP8: 320 tokens/sec
|
|
171
|
+
= 1.8x speedup
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
**Step 2: Verify accuracy**
|
|
175
|
+
|
|
176
|
+
FP8 typically has <0.5% accuracy degradation:
|
|
177
|
+
```python
|
|
178
|
+
# Run evaluation suite
|
|
179
|
+
# Compare FP8 vs FP16 on your tasks
|
|
180
|
+
# Verify acceptable accuracy
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Dynamic FP8 quantization** (no pre-quantized model needed):
|
|
184
|
+
|
|
185
|
+
```bash
|
|
186
|
+
# vLLM automatically quantizes at runtime
|
|
187
|
+
vllm serve MODEL --quantization fp8
|
|
188
|
+
# No model preparation required
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## Model preparation
|
|
192
|
+
|
|
193
|
+
**Pre-quantized models (easiest)**:
|
|
194
|
+
|
|
195
|
+
1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
|
|
196
|
+
2. Download or use directly: `TheBloke/[Model]-AWQ`
|
|
197
|
+
3. Launch with appropriate `--quantization` flag
|
|
198
|
+
|
|
199
|
+
**Quantize your own model**:
|
|
200
|
+
|
|
201
|
+
**AWQ**:
|
|
202
|
+
```bash
|
|
203
|
+
# Install AutoAWQ
|
|
204
|
+
pip install autoawq
|
|
205
|
+
|
|
206
|
+
# Run quantization script
|
|
207
|
+
python quantize_awq.py --model MODEL --output OUTPUT
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
**GPTQ**:
|
|
211
|
+
```bash
|
|
212
|
+
# Install AutoGPTQ
|
|
213
|
+
pip install auto-gptq
|
|
214
|
+
|
|
215
|
+
# Run quantization script
|
|
216
|
+
python quantize_gptq.py --model MODEL --output OUTPUT
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
**Calibration data**:
|
|
220
|
+
- Use 128-512 diverse examples from target domain
|
|
221
|
+
- Representative of production inputs
|
|
222
|
+
- Higher quality calibration = better accuracy
|
|
223
|
+
|
|
224
|
+
## Accuracy vs compression trade-offs
|
|
225
|
+
|
|
226
|
+
**Empirical results** (Llama 2 70B on MMLU benchmark):
|
|
227
|
+
|
|
228
|
+
| Quantization | Accuracy | Memory | Speed | Production-Ready |
|
|
229
|
+
|--------------|----------|--------|-------|------------------|
|
|
230
|
+
| FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
|
|
231
|
+
| FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
|
|
232
|
+
| AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
|
|
233
|
+
| GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
|
|
234
|
+
| SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |
|
|
235
|
+
|
|
236
|
+
**When to use each**:
|
|
237
|
+
|
|
238
|
+
**No quantization (FP16)**:
|
|
239
|
+
- Have sufficient GPU memory
|
|
240
|
+
- Need absolute best accuracy
|
|
241
|
+
- Model <13B parameters
|
|
242
|
+
|
|
243
|
+
**FP8**:
|
|
244
|
+
- Using H100/H800 GPUs
|
|
245
|
+
- Need best speed with minimal accuracy loss
|
|
246
|
+
- Production deployment
|
|
247
|
+
|
|
248
|
+
**AWQ 4-bit**:
|
|
249
|
+
- Need to fit 70B model in 40GB GPU
|
|
250
|
+
- Production deployment
|
|
251
|
+
- <1% accuracy loss acceptable
|
|
252
|
+
|
|
253
|
+
**GPTQ 4-bit**:
|
|
254
|
+
- Wide model support needed
|
|
255
|
+
- Not on H100 (use FP8 instead)
|
|
256
|
+
- 1-2% accuracy loss acceptable
|
|
257
|
+
|
|
258
|
+
**Testing strategy**:
|
|
259
|
+
|
|
260
|
+
1. **Baseline**: Measure FP16 accuracy on your evaluation set
|
|
261
|
+
2. **Quantize**: Create quantized version
|
|
262
|
+
3. **Evaluate**: Compare quantized vs baseline on same tasks
|
|
263
|
+
4. **Decide**: Accept if degradation < threshold (typically 1-2%)
|
|
264
|
+
|
|
265
|
+
**Example evaluation**:
|
|
266
|
+
```python
|
|
267
|
+
from evaluate import load_evaluation_suite
|
|
268
|
+
|
|
269
|
+
# Run on FP16 baseline
|
|
270
|
+
baseline_score = evaluate(model_fp16, eval_suite)
|
|
271
|
+
|
|
272
|
+
# Run on quantized
|
|
273
|
+
quant_score = evaluate(model_awq, eval_suite)
|
|
274
|
+
|
|
275
|
+
# Compare
|
|
276
|
+
degradation = (baseline_score - quant_score) / baseline_score * 100
|
|
277
|
+
print(f"Accuracy degradation: {degradation:.2f}%")
|
|
278
|
+
|
|
279
|
+
# Decision
|
|
280
|
+
if degradation < 1.0:
|
|
281
|
+
print("✅ Quantization acceptable for production")
|
|
282
|
+
else:
|
|
283
|
+
print("⚠️ Review accuracy loss")
|
|
284
|
+
```
|
|
@@ -0,0 +1,255 @@
|
|
|
1
|
+
# Server Deployment Patterns
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- Docker deployment
|
|
5
|
+
- Kubernetes deployment
|
|
6
|
+
- Load balancing with Nginx
|
|
7
|
+
- Multi-node distributed serving
|
|
8
|
+
- Production configuration examples
|
|
9
|
+
- Health checks and monitoring
|
|
10
|
+
|
|
11
|
+
## Docker deployment
|
|
12
|
+
|
|
13
|
+
**Basic Dockerfile**:
|
|
14
|
+
```dockerfile
|
|
15
|
+
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
|
|
16
|
+
|
|
17
|
+
RUN apt-get update && apt-get install -y python3-pip
|
|
18
|
+
RUN pip install vllm
|
|
19
|
+
|
|
20
|
+
EXPOSE 8000
|
|
21
|
+
|
|
22
|
+
CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct", \
|
|
23
|
+
"--host", "0.0.0.0", "--port", "8000", \
|
|
24
|
+
"--gpu-memory-utilization", "0.9"]
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
**Build and run**:
|
|
28
|
+
```bash
|
|
29
|
+
docker build -t vllm-server .
|
|
30
|
+
docker run --gpus all -p 8000:8000 vllm-server
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
**Docker Compose** (with metrics):
|
|
34
|
+
```yaml
|
|
35
|
+
version: '3.8'
|
|
36
|
+
services:
|
|
37
|
+
vllm:
|
|
38
|
+
image: vllm/vllm-openai:latest
|
|
39
|
+
command: >
|
|
40
|
+
--model meta-llama/Llama-3-8B-Instruct
|
|
41
|
+
--gpu-memory-utilization 0.9
|
|
42
|
+
--enable-metrics
|
|
43
|
+
--metrics-port 9090
|
|
44
|
+
ports:
|
|
45
|
+
- "8000:8000"
|
|
46
|
+
- "9090:9090"
|
|
47
|
+
deploy:
|
|
48
|
+
resources:
|
|
49
|
+
reservations:
|
|
50
|
+
devices:
|
|
51
|
+
- driver: nvidia
|
|
52
|
+
count: all
|
|
53
|
+
capabilities: [gpu]
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Kubernetes deployment
|
|
57
|
+
|
|
58
|
+
**Deployment manifest**:
|
|
59
|
+
```yaml
|
|
60
|
+
apiVersion: apps/v1
|
|
61
|
+
kind: Deployment
|
|
62
|
+
metadata:
|
|
63
|
+
name: vllm-server
|
|
64
|
+
spec:
|
|
65
|
+
replicas: 2
|
|
66
|
+
selector:
|
|
67
|
+
matchLabels:
|
|
68
|
+
app: vllm
|
|
69
|
+
template:
|
|
70
|
+
metadata:
|
|
71
|
+
labels:
|
|
72
|
+
app: vllm
|
|
73
|
+
spec:
|
|
74
|
+
containers:
|
|
75
|
+
- name: vllm
|
|
76
|
+
image: vllm/vllm-openai:latest
|
|
77
|
+
args:
|
|
78
|
+
- "--model=meta-llama/Llama-3-8B-Instruct"
|
|
79
|
+
- "--gpu-memory-utilization=0.9"
|
|
80
|
+
- "--enable-prefix-caching"
|
|
81
|
+
resources:
|
|
82
|
+
limits:
|
|
83
|
+
nvidia.com/gpu: 1
|
|
84
|
+
ports:
|
|
85
|
+
- containerPort: 8000
|
|
86
|
+
name: http
|
|
87
|
+
- containerPort: 9090
|
|
88
|
+
name: metrics
|
|
89
|
+
readinessProbe:
|
|
90
|
+
httpGet:
|
|
91
|
+
path: /health
|
|
92
|
+
port: 8000
|
|
93
|
+
initialDelaySeconds: 30
|
|
94
|
+
periodSeconds: 10
|
|
95
|
+
livenessProbe:
|
|
96
|
+
httpGet:
|
|
97
|
+
path: /health
|
|
98
|
+
port: 8000
|
|
99
|
+
initialDelaySeconds: 60
|
|
100
|
+
periodSeconds: 30
|
|
101
|
+
---
|
|
102
|
+
apiVersion: v1
|
|
103
|
+
kind: Service
|
|
104
|
+
metadata:
|
|
105
|
+
name: vllm-service
|
|
106
|
+
spec:
|
|
107
|
+
selector:
|
|
108
|
+
app: vllm
|
|
109
|
+
ports:
|
|
110
|
+
- port: 8000
|
|
111
|
+
targetPort: 8000
|
|
112
|
+
name: http
|
|
113
|
+
- port: 9090
|
|
114
|
+
targetPort: 9090
|
|
115
|
+
name: metrics
|
|
116
|
+
type: LoadBalancer
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
## Load balancing with Nginx
|
|
120
|
+
|
|
121
|
+
**Nginx configuration**:
|
|
122
|
+
```nginx
|
|
123
|
+
upstream vllm_backend {
|
|
124
|
+
least_conn; # Route to least-loaded server
|
|
125
|
+
server localhost:8001;
|
|
126
|
+
server localhost:8002;
|
|
127
|
+
server localhost:8003;
|
|
128
|
+
}
|
|
129
|
+
|
|
130
|
+
server {
|
|
131
|
+
listen 80;
|
|
132
|
+
|
|
133
|
+
location / {
|
|
134
|
+
proxy_pass http://vllm_backend;
|
|
135
|
+
proxy_set_header Host $host;
|
|
136
|
+
proxy_set_header X-Real-IP $remote_addr;
|
|
137
|
+
|
|
138
|
+
# Timeouts for long-running inference
|
|
139
|
+
proxy_read_timeout 300s;
|
|
140
|
+
proxy_connect_timeout 75s;
|
|
141
|
+
}
|
|
142
|
+
|
|
143
|
+
# Metrics endpoint
|
|
144
|
+
location /metrics {
|
|
145
|
+
proxy_pass http://localhost:9090/metrics;
|
|
146
|
+
}
|
|
147
|
+
}
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
**Start multiple vLLM instances**:
|
|
151
|
+
```bash
|
|
152
|
+
# Terminal 1
|
|
153
|
+
vllm serve MODEL --port 8001 --tensor-parallel-size 1
|
|
154
|
+
|
|
155
|
+
# Terminal 2
|
|
156
|
+
vllm serve MODEL --port 8002 --tensor-parallel-size 1
|
|
157
|
+
|
|
158
|
+
# Terminal 3
|
|
159
|
+
vllm serve MODEL --port 8003 --tensor-parallel-size 1
|
|
160
|
+
|
|
161
|
+
# Start Nginx
|
|
162
|
+
nginx -c /path/to/nginx.conf
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Multi-node distributed serving
|
|
166
|
+
|
|
167
|
+
For models too large for single node:
|
|
168
|
+
|
|
169
|
+
**Node 1** (master):
|
|
170
|
+
```bash
|
|
171
|
+
export MASTER_ADDR=192.168.1.10
|
|
172
|
+
export MASTER_PORT=29500
|
|
173
|
+
export RANK=0
|
|
174
|
+
export WORLD_SIZE=2
|
|
175
|
+
|
|
176
|
+
vllm serve meta-llama/Llama-2-70b-hf \
|
|
177
|
+
--tensor-parallel-size 8 \
|
|
178
|
+
--pipeline-parallel-size 2
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
**Node 2** (worker):
|
|
182
|
+
```bash
|
|
183
|
+
export MASTER_ADDR=192.168.1.10
|
|
184
|
+
export MASTER_PORT=29500
|
|
185
|
+
export RANK=1
|
|
186
|
+
export WORLD_SIZE=2
|
|
187
|
+
|
|
188
|
+
vllm serve meta-llama/Llama-2-70b-hf \
|
|
189
|
+
--tensor-parallel-size 8 \
|
|
190
|
+
--pipeline-parallel-size 2
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
## Production configuration examples
|
|
194
|
+
|
|
195
|
+
**High throughput** (batch-heavy workload):
|
|
196
|
+
```bash
|
|
197
|
+
vllm serve MODEL \
|
|
198
|
+
--max-num-seqs 512 \
|
|
199
|
+
--gpu-memory-utilization 0.95 \
|
|
200
|
+
--enable-prefix-caching \
|
|
201
|
+
--trust-remote-code
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
**Low latency** (interactive workload):
|
|
205
|
+
```bash
|
|
206
|
+
vllm serve MODEL \
|
|
207
|
+
--max-num-seqs 64 \
|
|
208
|
+
--gpu-memory-utilization 0.85 \
|
|
209
|
+
--enable-chunked-prefill
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Memory-constrained** (40GB GPU for 70B model):
|
|
213
|
+
```bash
|
|
214
|
+
vllm serve TheBloke/Llama-2-70B-AWQ \
|
|
215
|
+
--quantization awq \
|
|
216
|
+
--tensor-parallel-size 1 \
|
|
217
|
+
--gpu-memory-utilization 0.95 \
|
|
218
|
+
--max-model-len 4096
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
## Health checks and monitoring
|
|
222
|
+
|
|
223
|
+
**Health check endpoint**:
|
|
224
|
+
```bash
|
|
225
|
+
curl http://localhost:8000/health
|
|
226
|
+
# Returns: {"status": "ok"}
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
**Readiness check** (wait for model loaded):
|
|
230
|
+
```bash
|
|
231
|
+
#!/bin/bash
|
|
232
|
+
until curl -f http://localhost:8000/health; do
|
|
233
|
+
echo "Waiting for vLLM to be ready..."
|
|
234
|
+
sleep 5
|
|
235
|
+
done
|
|
236
|
+
echo "vLLM is ready!"
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**Prometheus scraping**:
|
|
240
|
+
```yaml
|
|
241
|
+
# prometheus.yml
|
|
242
|
+
scrape_configs:
|
|
243
|
+
- job_name: 'vllm'
|
|
244
|
+
static_configs:
|
|
245
|
+
- targets: ['localhost:9090']
|
|
246
|
+
metrics_path: '/metrics'
|
|
247
|
+
scrape_interval: 15s
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
**Grafana dashboard** (key metrics):
|
|
251
|
+
- Requests per second: `rate(vllm_request_success_total[5m])`
|
|
252
|
+
- TTFT p50: `histogram_quantile(0.5, vllm_time_to_first_token_seconds_bucket)`
|
|
253
|
+
- TTFT p99: `histogram_quantile(0.99, vllm_time_to_first_token_seconds_bucket)`
|
|
254
|
+
- GPU cache usage: `vllm_gpu_cache_usage_perc`
|
|
255
|
+
- Active requests: `vllm_num_requests_running`
|