EvoScientist 0.0.1.dev4__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. EvoScientist/EvoScientist.py +25 -61
  2. EvoScientist/__init__.py +0 -19
  3. EvoScientist/backends.py +0 -26
  4. EvoScientist/cli.py +1365 -480
  5. EvoScientist/middleware.py +7 -56
  6. EvoScientist/skills/clip/SKILL.md +253 -0
  7. EvoScientist/skills/clip/references/applications.md +207 -0
  8. EvoScientist/skills/langgraph-docs/SKILL.md +36 -0
  9. EvoScientist/skills/tensorboard/SKILL.md +629 -0
  10. EvoScientist/skills/tensorboard/references/integrations.md +638 -0
  11. EvoScientist/skills/tensorboard/references/profiling.md +545 -0
  12. EvoScientist/skills/tensorboard/references/visualization.md +620 -0
  13. EvoScientist/skills/vllm/SKILL.md +364 -0
  14. EvoScientist/skills/vllm/references/optimization.md +226 -0
  15. EvoScientist/skills/vllm/references/quantization.md +284 -0
  16. EvoScientist/skills/vllm/references/server-deployment.md +255 -0
  17. EvoScientist/skills/vllm/references/troubleshooting.md +447 -0
  18. EvoScientist/stream/__init__.py +0 -25
  19. EvoScientist/stream/utils.py +16 -23
  20. EvoScientist/tools.py +2 -75
  21. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/METADATA +8 -153
  22. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/RECORD +26 -24
  23. evoscientist-0.1.0rc2.dist-info/entry_points.txt +2 -0
  24. EvoScientist/config.py +0 -274
  25. EvoScientist/llm/__init__.py +0 -21
  26. EvoScientist/llm/models.py +0 -99
  27. EvoScientist/memory.py +0 -715
  28. EvoScientist/onboard.py +0 -725
  29. EvoScientist/paths.py +0 -44
  30. EvoScientist/skills_manager.py +0 -391
  31. EvoScientist/stream/display.py +0 -604
  32. EvoScientist/stream/events.py +0 -415
  33. EvoScientist/stream/state.py +0 -343
  34. evoscientist-0.0.1.dev4.dist-info/entry_points.txt +0 -5
  35. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/WHEEL +0 -0
  36. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/licenses/LICENSE +0 -0
  37. {evoscientist-0.0.1.dev4.dist-info → evoscientist-0.1.0rc2.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,284 @@
1
+ # Quantization Guide
2
+
3
+ ## Contents
4
+ - Quantization methods comparison
5
+ - AWQ setup and usage
6
+ - GPTQ setup and usage
7
+ - FP8 quantization (H100)
8
+ - Model preparation
9
+ - Accuracy vs compression trade-offs
10
+
11
+ ## Quantization methods comparison
12
+
13
+ | Method | Compression | Accuracy Loss | Speed | Best For |
14
+ |--------|-------------|---------------|-------|----------|
15
+ | **AWQ** | 4-bit (75%) | <1% | Fast | 70B models, production |
16
+ | **GPTQ** | 4-bit (75%) | 1-2% | Fast | Wide model support |
17
+ | **FP8** | 8-bit (50%) | <0.5% | Fastest | H100 GPUs only |
18
+ | **SqueezeLLM** | 3-4 bit (75-80%) | 2-3% | Medium | Extreme compression |
19
+
20
+ **Recommendation**:
21
+ - **Production**: Use AWQ for 70B models
22
+ - **H100 GPUs**: Use FP8 for best speed
23
+ - **Maximum compatibility**: Use GPTQ
24
+ - **Extreme compression**: Use SqueezeLLM
25
+
26
+ ## AWQ setup and usage
27
+
28
+ **AWQ** (Activation-aware Weight Quantization) achieves best accuracy at 4-bit.
29
+
30
+ **Step 1: Find pre-quantized model**
31
+
32
+ Search HuggingFace for AWQ models:
33
+ ```bash
34
+ # Example: TheBloke/Llama-2-70B-AWQ
35
+ # Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
36
+ ```
37
+
38
+ **Step 2: Launch with AWQ**
39
+
40
+ ```bash
41
+ vllm serve TheBloke/Llama-2-70B-AWQ \
42
+ --quantization awq \
43
+ --tensor-parallel-size 1 \
44
+ --gpu-memory-utilization 0.95
45
+ ```
46
+
47
+ **Memory savings**:
48
+ ```
49
+ Llama 2 70B fp16: 140GB VRAM (4x A100 needed)
50
+ Llama 2 70B AWQ: 35GB VRAM (1x A100 40GB)
51
+ = 4x memory reduction
52
+ ```
53
+
54
+ **Step 3: Verify performance**
55
+
56
+ Test that outputs are acceptable:
57
+ ```python
58
+ from openai import OpenAI
59
+
60
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
61
+
62
+ # Test complex reasoning
63
+ response = client.chat.completions.create(
64
+ model="TheBloke/Llama-2-70B-AWQ",
65
+ messages=[{"role": "user", "content": "Explain quantum entanglement"}]
66
+ )
67
+
68
+ print(response.choices[0].message.content)
69
+ # Verify quality matches your requirements
70
+ ```
71
+
72
+ **Quantize your own model** (requires GPU with 80GB+ VRAM):
73
+
74
+ ```python
75
+ from awq import AutoAWQForCausalLM
76
+ from transformers import AutoTokenizer
77
+
78
+ model_path = "meta-llama/Llama-2-70b-hf"
79
+ quant_path = "llama-2-70b-awq"
80
+
81
+ # Load model
82
+ model = AutoAWQForCausalLM.from_pretrained(model_path)
83
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
84
+
85
+ # Quantize
86
+ quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
87
+ model.quantize(tokenizer, quant_config=quant_config)
88
+
89
+ # Save
90
+ model.save_quantized(quant_path)
91
+ tokenizer.save_pretrained(quant_path)
92
+ ```
93
+
94
+ ## GPTQ setup and usage
95
+
96
+ **GPTQ** has widest model support and good compression.
97
+
98
+ **Step 1: Find GPTQ model**
99
+
100
+ ```bash
101
+ # Example: TheBloke/Llama-2-13B-GPTQ
102
+ # Example: TheBloke/CodeLlama-34B-GPTQ
103
+ ```
104
+
105
+ **Step 2: Launch with GPTQ**
106
+
107
+ ```bash
108
+ vllm serve TheBloke/Llama-2-13B-GPTQ \
109
+ --quantization gptq \
110
+ --dtype float16
111
+ ```
112
+
113
+ **GPTQ configuration options**:
114
+ ```bash
115
+ # Specify GPTQ parameters if needed
116
+ vllm serve MODEL \
117
+ --quantization gptq \
118
+ --gptq-act-order \ # Activation ordering
119
+ --dtype float16
120
+ ```
121
+
122
+ **Quantize your own model**:
123
+
124
+ ```python
125
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
126
+ from transformers import AutoTokenizer
127
+
128
+ model_name = "meta-llama/Llama-2-13b-hf"
129
+ quantized_name = "llama-2-13b-gptq"
130
+
131
+ # Load model
132
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
133
+ model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
134
+
135
+ # Prepare calibration data
136
+ calib_data = [...] # List of sample texts
137
+
138
+ # Quantize
139
+ quantize_config = BaseQuantizeConfig(
140
+ bits=4,
141
+ group_size=128,
142
+ desc_act=True
143
+ )
144
+ model.quantize(calib_data)
145
+
146
+ # Save
147
+ model.save_quantized(quantized_name)
148
+ ```
149
+
150
+ ## FP8 quantization (H100)
151
+
152
+ **FP8** (8-bit floating point) offers best speed on H100 GPUs with minimal accuracy loss.
153
+
154
+ **Requirements**:
155
+ - H100 or H800 GPU
156
+ - CUDA 12.3+ (12.8 recommended)
157
+ - Hopper architecture support
158
+
159
+ **Step 1: Enable FP8**
160
+
161
+ ```bash
162
+ vllm serve meta-llama/Llama-3-70B-Instruct \
163
+ --quantization fp8 \
164
+ --tensor-parallel-size 2
165
+ ```
166
+
167
+ **Performance gains on H100**:
168
+ ```
169
+ fp16: 180 tokens/sec
170
+ FP8: 320 tokens/sec
171
+ = 1.8x speedup
172
+ ```
173
+
174
+ **Step 2: Verify accuracy**
175
+
176
+ FP8 typically has <0.5% accuracy degradation:
177
+ ```python
178
+ # Run evaluation suite
179
+ # Compare FP8 vs FP16 on your tasks
180
+ # Verify acceptable accuracy
181
+ ```
182
+
183
+ **Dynamic FP8 quantization** (no pre-quantized model needed):
184
+
185
+ ```bash
186
+ # vLLM automatically quantizes at runtime
187
+ vllm serve MODEL --quantization fp8
188
+ # No model preparation required
189
+ ```
190
+
191
+ ## Model preparation
192
+
193
+ **Pre-quantized models (easiest)**:
194
+
195
+ 1. Search HuggingFace: `[model name] AWQ` or `[model name] GPTQ`
196
+ 2. Download or use directly: `TheBloke/[Model]-AWQ`
197
+ 3. Launch with appropriate `--quantization` flag
198
+
199
+ **Quantize your own model**:
200
+
201
+ **AWQ**:
202
+ ```bash
203
+ # Install AutoAWQ
204
+ pip install autoawq
205
+
206
+ # Run quantization script
207
+ python quantize_awq.py --model MODEL --output OUTPUT
208
+ ```
209
+
210
+ **GPTQ**:
211
+ ```bash
212
+ # Install AutoGPTQ
213
+ pip install auto-gptq
214
+
215
+ # Run quantization script
216
+ python quantize_gptq.py --model MODEL --output OUTPUT
217
+ ```
218
+
219
+ **Calibration data**:
220
+ - Use 128-512 diverse examples from target domain
221
+ - Representative of production inputs
222
+ - Higher quality calibration = better accuracy
223
+
224
+ ## Accuracy vs compression trade-offs
225
+
226
+ **Empirical results** (Llama 2 70B on MMLU benchmark):
227
+
228
+ | Quantization | Accuracy | Memory | Speed | Production-Ready |
229
+ |--------------|----------|--------|-------|------------------|
230
+ | FP16 (baseline) | 100% | 140GB | 1.0x | ✅ (if memory available) |
231
+ | FP8 | 99.5% | 70GB | 1.8x | ✅ (H100 only) |
232
+ | AWQ 4-bit | 99.0% | 35GB | 1.5x | ✅ (best for 70B) |
233
+ | GPTQ 4-bit | 98.5% | 35GB | 1.5x | ✅ (good compatibility) |
234
+ | SqueezeLLM 3-bit | 96.0% | 26GB | 1.3x | ⚠️ (check accuracy) |
235
+
236
+ **When to use each**:
237
+
238
+ **No quantization (FP16)**:
239
+ - Have sufficient GPU memory
240
+ - Need absolute best accuracy
241
+ - Model <13B parameters
242
+
243
+ **FP8**:
244
+ - Using H100/H800 GPUs
245
+ - Need best speed with minimal accuracy loss
246
+ - Production deployment
247
+
248
+ **AWQ 4-bit**:
249
+ - Need to fit 70B model in 40GB GPU
250
+ - Production deployment
251
+ - <1% accuracy loss acceptable
252
+
253
+ **GPTQ 4-bit**:
254
+ - Wide model support needed
255
+ - Not on H100 (use FP8 instead)
256
+ - 1-2% accuracy loss acceptable
257
+
258
+ **Testing strategy**:
259
+
260
+ 1. **Baseline**: Measure FP16 accuracy on your evaluation set
261
+ 2. **Quantize**: Create quantized version
262
+ 3. **Evaluate**: Compare quantized vs baseline on same tasks
263
+ 4. **Decide**: Accept if degradation < threshold (typically 1-2%)
264
+
265
+ **Example evaluation**:
266
+ ```python
267
+ from evaluate import load_evaluation_suite
268
+
269
+ # Run on FP16 baseline
270
+ baseline_score = evaluate(model_fp16, eval_suite)
271
+
272
+ # Run on quantized
273
+ quant_score = evaluate(model_awq, eval_suite)
274
+
275
+ # Compare
276
+ degradation = (baseline_score - quant_score) / baseline_score * 100
277
+ print(f"Accuracy degradation: {degradation:.2f}%")
278
+
279
+ # Decision
280
+ if degradation < 1.0:
281
+ print("✅ Quantization acceptable for production")
282
+ else:
283
+ print("⚠️ Review accuracy loss")
284
+ ```
@@ -0,0 +1,255 @@
1
+ # Server Deployment Patterns
2
+
3
+ ## Contents
4
+ - Docker deployment
5
+ - Kubernetes deployment
6
+ - Load balancing with Nginx
7
+ - Multi-node distributed serving
8
+ - Production configuration examples
9
+ - Health checks and monitoring
10
+
11
+ ## Docker deployment
12
+
13
+ **Basic Dockerfile**:
14
+ ```dockerfile
15
+ FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
16
+
17
+ RUN apt-get update && apt-get install -y python3-pip
18
+ RUN pip install vllm
19
+
20
+ EXPOSE 8000
21
+
22
+ CMD ["vllm", "serve", "meta-llama/Llama-3-8B-Instruct", \
23
+ "--host", "0.0.0.0", "--port", "8000", \
24
+ "--gpu-memory-utilization", "0.9"]
25
+ ```
26
+
27
+ **Build and run**:
28
+ ```bash
29
+ docker build -t vllm-server .
30
+ docker run --gpus all -p 8000:8000 vllm-server
31
+ ```
32
+
33
+ **Docker Compose** (with metrics):
34
+ ```yaml
35
+ version: '3.8'
36
+ services:
37
+ vllm:
38
+ image: vllm/vllm-openai:latest
39
+ command: >
40
+ --model meta-llama/Llama-3-8B-Instruct
41
+ --gpu-memory-utilization 0.9
42
+ --enable-metrics
43
+ --metrics-port 9090
44
+ ports:
45
+ - "8000:8000"
46
+ - "9090:9090"
47
+ deploy:
48
+ resources:
49
+ reservations:
50
+ devices:
51
+ - driver: nvidia
52
+ count: all
53
+ capabilities: [gpu]
54
+ ```
55
+
56
+ ## Kubernetes deployment
57
+
58
+ **Deployment manifest**:
59
+ ```yaml
60
+ apiVersion: apps/v1
61
+ kind: Deployment
62
+ metadata:
63
+ name: vllm-server
64
+ spec:
65
+ replicas: 2
66
+ selector:
67
+ matchLabels:
68
+ app: vllm
69
+ template:
70
+ metadata:
71
+ labels:
72
+ app: vllm
73
+ spec:
74
+ containers:
75
+ - name: vllm
76
+ image: vllm/vllm-openai:latest
77
+ args:
78
+ - "--model=meta-llama/Llama-3-8B-Instruct"
79
+ - "--gpu-memory-utilization=0.9"
80
+ - "--enable-prefix-caching"
81
+ resources:
82
+ limits:
83
+ nvidia.com/gpu: 1
84
+ ports:
85
+ - containerPort: 8000
86
+ name: http
87
+ - containerPort: 9090
88
+ name: metrics
89
+ readinessProbe:
90
+ httpGet:
91
+ path: /health
92
+ port: 8000
93
+ initialDelaySeconds: 30
94
+ periodSeconds: 10
95
+ livenessProbe:
96
+ httpGet:
97
+ path: /health
98
+ port: 8000
99
+ initialDelaySeconds: 60
100
+ periodSeconds: 30
101
+ ---
102
+ apiVersion: v1
103
+ kind: Service
104
+ metadata:
105
+ name: vllm-service
106
+ spec:
107
+ selector:
108
+ app: vllm
109
+ ports:
110
+ - port: 8000
111
+ targetPort: 8000
112
+ name: http
113
+ - port: 9090
114
+ targetPort: 9090
115
+ name: metrics
116
+ type: LoadBalancer
117
+ ```
118
+
119
+ ## Load balancing with Nginx
120
+
121
+ **Nginx configuration**:
122
+ ```nginx
123
+ upstream vllm_backend {
124
+ least_conn; # Route to least-loaded server
125
+ server localhost:8001;
126
+ server localhost:8002;
127
+ server localhost:8003;
128
+ }
129
+
130
+ server {
131
+ listen 80;
132
+
133
+ location / {
134
+ proxy_pass http://vllm_backend;
135
+ proxy_set_header Host $host;
136
+ proxy_set_header X-Real-IP $remote_addr;
137
+
138
+ # Timeouts for long-running inference
139
+ proxy_read_timeout 300s;
140
+ proxy_connect_timeout 75s;
141
+ }
142
+
143
+ # Metrics endpoint
144
+ location /metrics {
145
+ proxy_pass http://localhost:9090/metrics;
146
+ }
147
+ }
148
+ ```
149
+
150
+ **Start multiple vLLM instances**:
151
+ ```bash
152
+ # Terminal 1
153
+ vllm serve MODEL --port 8001 --tensor-parallel-size 1
154
+
155
+ # Terminal 2
156
+ vllm serve MODEL --port 8002 --tensor-parallel-size 1
157
+
158
+ # Terminal 3
159
+ vllm serve MODEL --port 8003 --tensor-parallel-size 1
160
+
161
+ # Start Nginx
162
+ nginx -c /path/to/nginx.conf
163
+ ```
164
+
165
+ ## Multi-node distributed serving
166
+
167
+ For models too large for single node:
168
+
169
+ **Node 1** (master):
170
+ ```bash
171
+ export MASTER_ADDR=192.168.1.10
172
+ export MASTER_PORT=29500
173
+ export RANK=0
174
+ export WORLD_SIZE=2
175
+
176
+ vllm serve meta-llama/Llama-2-70b-hf \
177
+ --tensor-parallel-size 8 \
178
+ --pipeline-parallel-size 2
179
+ ```
180
+
181
+ **Node 2** (worker):
182
+ ```bash
183
+ export MASTER_ADDR=192.168.1.10
184
+ export MASTER_PORT=29500
185
+ export RANK=1
186
+ export WORLD_SIZE=2
187
+
188
+ vllm serve meta-llama/Llama-2-70b-hf \
189
+ --tensor-parallel-size 8 \
190
+ --pipeline-parallel-size 2
191
+ ```
192
+
193
+ ## Production configuration examples
194
+
195
+ **High throughput** (batch-heavy workload):
196
+ ```bash
197
+ vllm serve MODEL \
198
+ --max-num-seqs 512 \
199
+ --gpu-memory-utilization 0.95 \
200
+ --enable-prefix-caching \
201
+ --trust-remote-code
202
+ ```
203
+
204
+ **Low latency** (interactive workload):
205
+ ```bash
206
+ vllm serve MODEL \
207
+ --max-num-seqs 64 \
208
+ --gpu-memory-utilization 0.85 \
209
+ --enable-chunked-prefill
210
+ ```
211
+
212
+ **Memory-constrained** (40GB GPU for 70B model):
213
+ ```bash
214
+ vllm serve TheBloke/Llama-2-70B-AWQ \
215
+ --quantization awq \
216
+ --tensor-parallel-size 1 \
217
+ --gpu-memory-utilization 0.95 \
218
+ --max-model-len 4096
219
+ ```
220
+
221
+ ## Health checks and monitoring
222
+
223
+ **Health check endpoint**:
224
+ ```bash
225
+ curl http://localhost:8000/health
226
+ # Returns: {"status": "ok"}
227
+ ```
228
+
229
+ **Readiness check** (wait for model loaded):
230
+ ```bash
231
+ #!/bin/bash
232
+ until curl -f http://localhost:8000/health; do
233
+ echo "Waiting for vLLM to be ready..."
234
+ sleep 5
235
+ done
236
+ echo "vLLM is ready!"
237
+ ```
238
+
239
+ **Prometheus scraping**:
240
+ ```yaml
241
+ # prometheus.yml
242
+ scrape_configs:
243
+ - job_name: 'vllm'
244
+ static_configs:
245
+ - targets: ['localhost:9090']
246
+ metrics_path: '/metrics'
247
+ scrape_interval: 15s
248
+ ```
249
+
250
+ **Grafana dashboard** (key metrics):
251
+ - Requests per second: `rate(vllm_request_success_total[5m])`
252
+ - TTFT p50: `histogram_quantile(0.5, vllm_time_to_first_token_seconds_bucket)`
253
+ - TTFT p99: `histogram_quantile(0.99, vllm_time_to_first_token_seconds_bucket)`
254
+ - GPU cache usage: `vllm_gpu_cache_usage_perc`
255
+ - Active requests: `vllm_num_requests_running`