EvoScientist 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- EvoScientist/EvoScientist.py +1 -1
- EvoScientist/cli.py +450 -178
- EvoScientist/middleware.py +5 -1
- EvoScientist/skills/accelerate/SKILL.md +332 -0
- EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
- EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
- EvoScientist/skills/accelerate/references/performance.md +525 -0
- EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
- EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
- EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
- EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
- EvoScientist/skills/clip/SKILL.md +253 -0
- EvoScientist/skills/clip/references/applications.md +207 -0
- EvoScientist/skills/find-skills/SKILL.md +133 -0
- EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
- EvoScientist/skills/flash-attention/SKILL.md +367 -0
- EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
- EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
- EvoScientist/skills/langgraph-docs/SKILL.md +36 -0
- EvoScientist/skills/llama-cpp/SKILL.md +258 -0
- EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
- EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
- EvoScientist/skills/llama-cpp/references/server.md +125 -0
- EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
- EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
- EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
- EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
- EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
- EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
- EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
- EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
- EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
- EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
- EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
- EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
- EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
- EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
- EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
- EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
- EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
- EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
- EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
- EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
- EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
- EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
- EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
- EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
- EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
- EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
- EvoScientist/skills/peft/SKILL.md +431 -0
- EvoScientist/skills/peft/references/advanced-usage.md +514 -0
- EvoScientist/skills/peft/references/troubleshooting.md +480 -0
- EvoScientist/skills/ray-data/SKILL.md +326 -0
- EvoScientist/skills/ray-data/references/integration.md +82 -0
- EvoScientist/skills/ray-data/references/transformations.md +83 -0
- EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
- EvoScientist/skills/skill-creator/SKILL.md +356 -0
- EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
- EvoScientist/skills/skill-creator/references/workflows.md +28 -0
- EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
- EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
- EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
- EvoScientist/skills/tensorboard/SKILL.md +629 -0
- EvoScientist/skills/tensorboard/references/integrations.md +638 -0
- EvoScientist/skills/tensorboard/references/profiling.md +545 -0
- EvoScientist/skills/tensorboard/references/visualization.md +620 -0
- EvoScientist/skills/vllm/SKILL.md +364 -0
- EvoScientist/skills/vllm/references/optimization.md +226 -0
- EvoScientist/skills/vllm/references/quantization.md +284 -0
- EvoScientist/skills/vllm/references/server-deployment.md +255 -0
- EvoScientist/skills/vllm/references/troubleshooting.md +447 -0
- {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/METADATA +26 -3
- evoscientist-0.1.0rc2.dist-info/RECORD +119 -0
- evoscientist-0.1.0rc1.dist-info/RECORD +0 -21
- {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/WHEEL +0 -0
- {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/entry_points.txt +0 -0
- {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/licenses/LICENSE +0 -0
- {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,293 @@
|
|
|
1
|
+
# HuggingFace Transformers Integration
|
|
2
|
+
|
|
3
|
+
## Contents
|
|
4
|
+
- Enabling Flash Attention in Transformers
|
|
5
|
+
- Supported model architectures
|
|
6
|
+
- Configuration examples
|
|
7
|
+
- Performance comparisons
|
|
8
|
+
- Troubleshooting model-specific issues
|
|
9
|
+
|
|
10
|
+
## Enabling Flash Attention in Transformers
|
|
11
|
+
|
|
12
|
+
HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.
|
|
13
|
+
|
|
14
|
+
**Simple enable for any supported model**:
|
|
15
|
+
```python
|
|
16
|
+
from transformers import AutoModel
|
|
17
|
+
|
|
18
|
+
model = AutoModel.from_pretrained(
|
|
19
|
+
"meta-llama/Llama-2-7b-hf",
|
|
20
|
+
attn_implementation="flash_attention_2",
|
|
21
|
+
torch_dtype=torch.float16,
|
|
22
|
+
device_map="auto"
|
|
23
|
+
)
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
**Install requirements**:
|
|
27
|
+
```bash
|
|
28
|
+
pip install transformers>=4.36
|
|
29
|
+
pip install flash-attn --no-build-isolation
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Supported model architectures
|
|
33
|
+
|
|
34
|
+
As of Transformers 4.40:
|
|
35
|
+
|
|
36
|
+
**Fully supported**:
|
|
37
|
+
- Llama / Llama 2 / Llama 3
|
|
38
|
+
- Mistral / Mixtral
|
|
39
|
+
- Falcon
|
|
40
|
+
- GPT-NeoX
|
|
41
|
+
- Phi / Phi-2 / Phi-3
|
|
42
|
+
- Qwen / Qwen2
|
|
43
|
+
- Gemma
|
|
44
|
+
- Starcoder2
|
|
45
|
+
- GPT-J
|
|
46
|
+
- OPT
|
|
47
|
+
- BLOOM
|
|
48
|
+
|
|
49
|
+
**Partially supported** (encoder-decoder):
|
|
50
|
+
- BART
|
|
51
|
+
- T5 / Flan-T5
|
|
52
|
+
- Whisper
|
|
53
|
+
|
|
54
|
+
**Check support**:
|
|
55
|
+
```python
|
|
56
|
+
from transformers import AutoConfig
|
|
57
|
+
|
|
58
|
+
config = AutoConfig.from_pretrained("model-name")
|
|
59
|
+
print(config._attn_implementation_internal)
|
|
60
|
+
# 'flash_attention_2' if supported
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Configuration examples
|
|
64
|
+
|
|
65
|
+
### Llama 2 with Flash Attention
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
69
|
+
import torch
|
|
70
|
+
|
|
71
|
+
model_id = "meta-llama/Llama-2-7b-hf"
|
|
72
|
+
|
|
73
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
74
|
+
model_id,
|
|
75
|
+
attn_implementation="flash_attention_2",
|
|
76
|
+
torch_dtype=torch.float16,
|
|
77
|
+
device_map="auto"
|
|
78
|
+
)
|
|
79
|
+
|
|
80
|
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
81
|
+
|
|
82
|
+
# Generate
|
|
83
|
+
inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
|
|
84
|
+
outputs = model.generate(**inputs, max_length=100)
|
|
85
|
+
print(tokenizer.decode(outputs[0]))
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Mistral with Flash Attention for long context
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
from transformers import AutoModelForCausalLM
|
|
92
|
+
import torch
|
|
93
|
+
|
|
94
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
95
|
+
"mistralai/Mistral-7B-v0.1",
|
|
96
|
+
attn_implementation="flash_attention_2",
|
|
97
|
+
torch_dtype=torch.bfloat16, # Better for long context
|
|
98
|
+
device_map="auto",
|
|
99
|
+
max_position_embeddings=32768 # Extended context
|
|
100
|
+
)
|
|
101
|
+
|
|
102
|
+
# Process long document (32K tokens)
|
|
103
|
+
long_text = "..." * 10000
|
|
104
|
+
inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
|
|
105
|
+
outputs = model.generate(**inputs, max_new_tokens=512)
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Fine-tuning with Flash Attention
|
|
109
|
+
|
|
110
|
+
```python
|
|
111
|
+
from transformers import Trainer, TrainingArguments
|
|
112
|
+
from transformers import AutoModelForCausalLM
|
|
113
|
+
|
|
114
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
115
|
+
"meta-llama/Llama-2-7b-hf",
|
|
116
|
+
attn_implementation="flash_attention_2",
|
|
117
|
+
torch_dtype=torch.float16
|
|
118
|
+
)
|
|
119
|
+
|
|
120
|
+
training_args = TrainingArguments(
|
|
121
|
+
output_dir="./results",
|
|
122
|
+
per_device_train_batch_size=4,
|
|
123
|
+
gradient_accumulation_steps=4,
|
|
124
|
+
num_train_epochs=3,
|
|
125
|
+
fp16=True, # Must match model dtype
|
|
126
|
+
optim="adamw_torch_fused" # Fast optimizer
|
|
127
|
+
)
|
|
128
|
+
|
|
129
|
+
trainer = Trainer(
|
|
130
|
+
model=model,
|
|
131
|
+
args=training_args,
|
|
132
|
+
train_dataset=train_dataset
|
|
133
|
+
)
|
|
134
|
+
|
|
135
|
+
trainer.train()
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Multi-GPU training
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
from transformers import AutoModelForCausalLM
|
|
142
|
+
import torch
|
|
143
|
+
|
|
144
|
+
# Model parallelism with Flash Attention
|
|
145
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
146
|
+
"meta-llama/Llama-2-13b-hf",
|
|
147
|
+
attn_implementation="flash_attention_2",
|
|
148
|
+
torch_dtype=torch.float16,
|
|
149
|
+
device_map="auto", # Automatic multi-GPU placement
|
|
150
|
+
max_memory={0: "20GB", 1: "20GB"} # Limit per GPU
|
|
151
|
+
)
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
## Performance comparisons
|
|
155
|
+
|
|
156
|
+
### Memory usage (Llama 2 7B, batch=1)
|
|
157
|
+
|
|
158
|
+
| Sequence Length | Standard Attention | Flash Attention 2 | Reduction |
|
|
159
|
+
|-----------------|-------------------|-------------------|-----------|
|
|
160
|
+
| 512 | 1.2 GB | 0.9 GB | 25% |
|
|
161
|
+
| 2048 | 3.8 GB | 1.4 GB | 63% |
|
|
162
|
+
| 8192 | 14.2 GB | 3.2 GB | 77% |
|
|
163
|
+
| 32768 | OOM (>24GB) | 10.8 GB | Fits! |
|
|
164
|
+
|
|
165
|
+
### Speed (tokens/sec, A100 80GB)
|
|
166
|
+
|
|
167
|
+
| Model | Standard | Flash Attn 2 | Speedup |
|
|
168
|
+
|-------|----------|--------------|---------|
|
|
169
|
+
| Llama 2 7B (seq=2048) | 42 | 118 | 2.8x |
|
|
170
|
+
| Llama 2 13B (seq=4096) | 18 | 52 | 2.9x |
|
|
171
|
+
| Llama 2 70B (seq=2048) | 4 | 11 | 2.75x |
|
|
172
|
+
|
|
173
|
+
### Training throughput (samples/sec)
|
|
174
|
+
|
|
175
|
+
| Model | Batch Size | Standard | Flash Attn 2 | Speedup |
|
|
176
|
+
|-------|------------|----------|--------------|---------|
|
|
177
|
+
| Llama 2 7B | 4 | 1.2 | 3.1 | 2.6x |
|
|
178
|
+
| Llama 2 7B | 8 | 2.1 | 5.8 | 2.8x |
|
|
179
|
+
| Llama 2 13B | 2 | 0.6 | 1.7 | 2.8x |
|
|
180
|
+
|
|
181
|
+
## Troubleshooting model-specific issues
|
|
182
|
+
|
|
183
|
+
### Issue: Model doesn't support Flash Attention
|
|
184
|
+
|
|
185
|
+
Check support list above. If not supported, use PyTorch SDPA as fallback:
|
|
186
|
+
|
|
187
|
+
```python
|
|
188
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
189
|
+
"model-name",
|
|
190
|
+
attn_implementation="sdpa", # PyTorch native (still faster)
|
|
191
|
+
torch_dtype=torch.float16
|
|
192
|
+
)
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
### Issue: CUDA out of memory during loading
|
|
196
|
+
|
|
197
|
+
Reduce memory footprint:
|
|
198
|
+
|
|
199
|
+
```python
|
|
200
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
201
|
+
"model-name",
|
|
202
|
+
attn_implementation="flash_attention_2",
|
|
203
|
+
torch_dtype=torch.float16,
|
|
204
|
+
device_map="auto",
|
|
205
|
+
max_memory={0: "18GB"}, # Reserve memory for KV cache
|
|
206
|
+
low_cpu_mem_usage=True
|
|
207
|
+
)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Issue: Slower inference than expected
|
|
211
|
+
|
|
212
|
+
Ensure dtype matches:
|
|
213
|
+
|
|
214
|
+
```python
|
|
215
|
+
# Model and inputs must both be float16/bfloat16
|
|
216
|
+
model = model.to(torch.float16)
|
|
217
|
+
inputs = tokenizer(..., return_tensors="pt").to("cuda")
|
|
218
|
+
inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
|
|
219
|
+
for k, v in inputs.items()}
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### Issue: Different outputs vs standard attention
|
|
223
|
+
|
|
224
|
+
Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:
|
|
225
|
+
|
|
226
|
+
```python
|
|
227
|
+
# Compare outputs
|
|
228
|
+
model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
|
|
229
|
+
model_flash = AutoModelForCausalLM.from_pretrained(
|
|
230
|
+
"model-name",
|
|
231
|
+
attn_implementation="flash_attention_2",
|
|
232
|
+
torch_dtype=torch.float16
|
|
233
|
+
)
|
|
234
|
+
|
|
235
|
+
inputs = tokenizer("Test", return_tensors="pt").to("cuda")
|
|
236
|
+
|
|
237
|
+
with torch.no_grad():
|
|
238
|
+
out_standard = model_standard(**inputs).logits
|
|
239
|
+
out_flash = model_flash(**inputs).logits
|
|
240
|
+
|
|
241
|
+
diff = (out_standard - out_flash).abs().max()
|
|
242
|
+
print(f"Max diff: {diff:.6f}") # Should be ~1e-3 to 1e-4
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
### Issue: ImportError during model loading
|
|
246
|
+
|
|
247
|
+
Install flash-attn:
|
|
248
|
+
```bash
|
|
249
|
+
pip install flash-attn --no-build-isolation
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
Or disable Flash Attention:
|
|
253
|
+
```python
|
|
254
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
255
|
+
"model-name",
|
|
256
|
+
attn_implementation="eager", # Standard PyTorch
|
|
257
|
+
torch_dtype=torch.float16
|
|
258
|
+
)
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
## Best practices
|
|
262
|
+
|
|
263
|
+
1. **Always use float16/bfloat16** with Flash Attention (not float32)
|
|
264
|
+
2. **Set device_map="auto"** for automatic memory management
|
|
265
|
+
3. **Use bfloat16 for long context** (better numerical stability)
|
|
266
|
+
4. **Enable gradient checkpointing** for training large models
|
|
267
|
+
5. **Monitor memory** with `torch.cuda.max_memory_allocated()`
|
|
268
|
+
|
|
269
|
+
**Example with all best practices**:
|
|
270
|
+
```python
|
|
271
|
+
from transformers import AutoModelForCausalLM, TrainingArguments
|
|
272
|
+
|
|
273
|
+
model = AutoModelForCausalLM.from_pretrained(
|
|
274
|
+
"meta-llama/Llama-2-7b-hf",
|
|
275
|
+
attn_implementation="flash_attention_2",
|
|
276
|
+
torch_dtype=torch.bfloat16, # Better for training
|
|
277
|
+
device_map="auto",
|
|
278
|
+
low_cpu_mem_usage=True
|
|
279
|
+
)
|
|
280
|
+
|
|
281
|
+
# Enable gradient checkpointing for memory
|
|
282
|
+
model.gradient_checkpointing_enable()
|
|
283
|
+
|
|
284
|
+
# Training with optimizations
|
|
285
|
+
training_args = TrainingArguments(
|
|
286
|
+
output_dir="./results",
|
|
287
|
+
per_device_train_batch_size=8,
|
|
288
|
+
gradient_accumulation_steps=2,
|
|
289
|
+
bf16=True, # Match model dtype
|
|
290
|
+
optim="adamw_torch_fused",
|
|
291
|
+
gradient_checkpointing=True
|
|
292
|
+
)
|
|
293
|
+
```
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: langgraph-docs
|
|
3
|
+
description: Use this skill for requests related to LangGraph in order to fetch relevant documentation to provide accurate, up-to-date guidance.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# langgraph-docs
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
This skill explains how to access LangGraph Python documentation to help answer questions and guide implementation.
|
|
11
|
+
|
|
12
|
+
## Instructions
|
|
13
|
+
|
|
14
|
+
### 1. Fetch the Documentation Index
|
|
15
|
+
|
|
16
|
+
Use the fetch_url tool to read the following URL:
|
|
17
|
+
https://docs.langchain.com/llms.txt
|
|
18
|
+
|
|
19
|
+
This provides a structured list of all available documentation with descriptions.
|
|
20
|
+
|
|
21
|
+
### 2. Select Relevant Documentation
|
|
22
|
+
|
|
23
|
+
Based on the question, identify 2-4 most relevant documentation URLs from the index. Prioritize:
|
|
24
|
+
|
|
25
|
+
- Specific how-to guides for implementation questions
|
|
26
|
+
- Core concept pages for understanding questions
|
|
27
|
+
- Tutorials for end-to-end examples
|
|
28
|
+
- Reference docs for API details
|
|
29
|
+
|
|
30
|
+
### 3. Fetch Selected Documentation
|
|
31
|
+
|
|
32
|
+
Use the fetch_url tool to read the selected documentation URLs.
|
|
33
|
+
|
|
34
|
+
### 4. Provide Accurate Guidance
|
|
35
|
+
|
|
36
|
+
After reading the documentation, complete the user's request.
|
|
@@ -0,0 +1,258 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: llama-cpp
|
|
3
|
+
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Orchestra Research
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
|
|
8
|
+
dependencies: [llama-cpp-python]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# llama.cpp
|
|
12
|
+
|
|
13
|
+
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
|
|
14
|
+
|
|
15
|
+
## When to use llama.cpp
|
|
16
|
+
|
|
17
|
+
**Use llama.cpp when:**
|
|
18
|
+
- Running on CPU-only machines
|
|
19
|
+
- Deploying on Apple Silicon (M1/M2/M3/M4)
|
|
20
|
+
- Using AMD or Intel GPUs (no CUDA)
|
|
21
|
+
- Edge deployment (Raspberry Pi, embedded systems)
|
|
22
|
+
- Need simple deployment without Docker/Python
|
|
23
|
+
|
|
24
|
+
**Use TensorRT-LLM instead when:**
|
|
25
|
+
- Have NVIDIA GPUs (A100/H100)
|
|
26
|
+
- Need maximum throughput (100K+ tok/s)
|
|
27
|
+
- Running in datacenter with CUDA
|
|
28
|
+
|
|
29
|
+
**Use vLLM instead when:**
|
|
30
|
+
- Have NVIDIA GPUs
|
|
31
|
+
- Need Python-first API
|
|
32
|
+
- Want PagedAttention
|
|
33
|
+
|
|
34
|
+
## Quick start
|
|
35
|
+
|
|
36
|
+
### Installation
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
# macOS/Linux
|
|
40
|
+
brew install llama.cpp
|
|
41
|
+
|
|
42
|
+
# Or build from source
|
|
43
|
+
git clone https://github.com/ggerganov/llama.cpp
|
|
44
|
+
cd llama.cpp
|
|
45
|
+
make
|
|
46
|
+
|
|
47
|
+
# With Metal (Apple Silicon)
|
|
48
|
+
make LLAMA_METAL=1
|
|
49
|
+
|
|
50
|
+
# With CUDA (NVIDIA)
|
|
51
|
+
make LLAMA_CUDA=1
|
|
52
|
+
|
|
53
|
+
# With ROCm (AMD)
|
|
54
|
+
make LLAMA_HIP=1
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Download model
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# Download from HuggingFace (GGUF format)
|
|
61
|
+
huggingface-cli download \
|
|
62
|
+
TheBloke/Llama-2-7B-Chat-GGUF \
|
|
63
|
+
llama-2-7b-chat.Q4_K_M.gguf \
|
|
64
|
+
--local-dir models/
|
|
65
|
+
|
|
66
|
+
# Or convert from HuggingFace
|
|
67
|
+
python convert_hf_to_gguf.py models/llama-2-7b-chat/
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Run inference
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
# Simple chat
|
|
74
|
+
./llama-cli \
|
|
75
|
+
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
|
76
|
+
-p "Explain quantum computing" \
|
|
77
|
+
-n 256 # Max tokens
|
|
78
|
+
|
|
79
|
+
# Interactive chat
|
|
80
|
+
./llama-cli \
|
|
81
|
+
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
|
82
|
+
--interactive
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### Server mode
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
# Start OpenAI-compatible server
|
|
89
|
+
./llama-server \
|
|
90
|
+
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
|
91
|
+
--host 0.0.0.0 \
|
|
92
|
+
--port 8080 \
|
|
93
|
+
-ngl 32 # Offload 32 layers to GPU
|
|
94
|
+
|
|
95
|
+
# Client request
|
|
96
|
+
curl http://localhost:8080/v1/chat/completions \
|
|
97
|
+
-H "Content-Type: application/json" \
|
|
98
|
+
-d '{
|
|
99
|
+
"model": "llama-2-7b-chat",
|
|
100
|
+
"messages": [{"role": "user", "content": "Hello!"}],
|
|
101
|
+
"temperature": 0.7,
|
|
102
|
+
"max_tokens": 100
|
|
103
|
+
}'
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
## Quantization formats
|
|
107
|
+
|
|
108
|
+
### GGUF format overview
|
|
109
|
+
|
|
110
|
+
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|
|
111
|
+
|--------|------|-----------|-------|---------|----------|
|
|
112
|
+
| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
|
|
113
|
+
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
|
|
114
|
+
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
|
|
115
|
+
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
|
|
116
|
+
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
|
|
117
|
+
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
|
|
118
|
+
|
|
119
|
+
### Choosing quantization
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
# General use (balanced)
|
|
123
|
+
Q4_K_M # 4-bit, medium quality
|
|
124
|
+
|
|
125
|
+
# Maximum speed (more degradation)
|
|
126
|
+
Q2_K or Q3_K_M
|
|
127
|
+
|
|
128
|
+
# Maximum quality (slower)
|
|
129
|
+
Q6_K or Q8_0
|
|
130
|
+
|
|
131
|
+
# Very large models (70B, 405B)
|
|
132
|
+
Q3_K_M or Q4_K_S # Lower bits to fit in memory
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## Hardware acceleration
|
|
136
|
+
|
|
137
|
+
### Apple Silicon (Metal)
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
# Build with Metal
|
|
141
|
+
make LLAMA_METAL=1
|
|
142
|
+
|
|
143
|
+
# Run with GPU acceleration (automatic)
|
|
144
|
+
./llama-cli -m model.gguf -ngl 999 # Offload all layers
|
|
145
|
+
|
|
146
|
+
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### NVIDIA GPUs (CUDA)
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
# Build with CUDA
|
|
153
|
+
make LLAMA_CUDA=1
|
|
154
|
+
|
|
155
|
+
# Offload layers to GPU
|
|
156
|
+
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
|
|
157
|
+
|
|
158
|
+
# Hybrid CPU+GPU for large models
|
|
159
|
+
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### AMD GPUs (ROCm)
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
# Build with ROCm
|
|
166
|
+
make LLAMA_HIP=1
|
|
167
|
+
|
|
168
|
+
# Run with AMD GPU
|
|
169
|
+
./llama-cli -m model.gguf -ngl 999
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## Common patterns
|
|
173
|
+
|
|
174
|
+
### Batch processing
|
|
175
|
+
|
|
176
|
+
```bash
|
|
177
|
+
# Process multiple prompts from file
|
|
178
|
+
cat prompts.txt | ./llama-cli \
|
|
179
|
+
-m model.gguf \
|
|
180
|
+
--batch-size 512 \
|
|
181
|
+
-n 100
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### Constrained generation
|
|
185
|
+
|
|
186
|
+
```bash
|
|
187
|
+
# JSON output with grammar
|
|
188
|
+
./llama-cli \
|
|
189
|
+
-m model.gguf \
|
|
190
|
+
-p "Generate a person: " \
|
|
191
|
+
--grammar-file grammars/json.gbnf
|
|
192
|
+
|
|
193
|
+
# Outputs valid JSON only
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### Context size
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
# Increase context (default 512)
|
|
200
|
+
./llama-cli \
|
|
201
|
+
-m model.gguf \
|
|
202
|
+
-c 4096 # 4K context window
|
|
203
|
+
|
|
204
|
+
# Very long context (if model supports)
|
|
205
|
+
./llama-cli -m model.gguf -c 32768 # 32K context
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
## Performance benchmarks
|
|
209
|
+
|
|
210
|
+
### CPU performance (Llama 2-7B Q4_K_M)
|
|
211
|
+
|
|
212
|
+
| CPU | Threads | Speed | Cost |
|
|
213
|
+
|-----|---------|-------|------|
|
|
214
|
+
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
|
|
215
|
+
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
|
|
216
|
+
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
|
|
217
|
+
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
|
|
218
|
+
|
|
219
|
+
### GPU acceleration (Llama 2-7B Q4_K_M)
|
|
220
|
+
|
|
221
|
+
| GPU | Speed | vs CPU | Cost |
|
|
222
|
+
|-----|-------|--------|------|
|
|
223
|
+
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
|
|
224
|
+
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
|
|
225
|
+
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
|
|
226
|
+
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
|
|
227
|
+
|
|
228
|
+
## Supported models
|
|
229
|
+
|
|
230
|
+
**LLaMA family**:
|
|
231
|
+
- Llama 2 (7B, 13B, 70B)
|
|
232
|
+
- Llama 3 (8B, 70B, 405B)
|
|
233
|
+
- Code Llama
|
|
234
|
+
|
|
235
|
+
**Mistral family**:
|
|
236
|
+
- Mistral 7B
|
|
237
|
+
- Mixtral 8x7B, 8x22B
|
|
238
|
+
|
|
239
|
+
**Other**:
|
|
240
|
+
- Falcon, BLOOM, GPT-J
|
|
241
|
+
- Phi-3, Gemma, Qwen
|
|
242
|
+
- LLaVA (vision), Whisper (audio)
|
|
243
|
+
|
|
244
|
+
**Find models**: https://huggingface.co/models?library=gguf
|
|
245
|
+
|
|
246
|
+
## References
|
|
247
|
+
|
|
248
|
+
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
|
|
249
|
+
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
|
|
250
|
+
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
|
|
251
|
+
|
|
252
|
+
## Resources
|
|
253
|
+
|
|
254
|
+
- **GitHub**: https://github.com/ggerganov/llama.cpp
|
|
255
|
+
- **Models**: https://huggingface.co/models?library=gguf
|
|
256
|
+
- **Discord**: https://discord.gg/llama-cpp
|
|
257
|
+
|
|
258
|
+
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
# Performance Optimization Guide
|
|
2
|
+
|
|
3
|
+
Maximize llama.cpp inference speed and efficiency.
|
|
4
|
+
|
|
5
|
+
## CPU Optimization
|
|
6
|
+
|
|
7
|
+
### Thread tuning
|
|
8
|
+
```bash
|
|
9
|
+
# Set threads (default: physical cores)
|
|
10
|
+
./llama-cli -m model.gguf -t 8
|
|
11
|
+
|
|
12
|
+
# For AMD Ryzen 9 7950X (16 cores, 32 threads)
|
|
13
|
+
-t 16 # Best: physical cores
|
|
14
|
+
|
|
15
|
+
# Avoid hyperthreading (slower for matrix ops)
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
### BLAS acceleration
|
|
19
|
+
```bash
|
|
20
|
+
# OpenBLAS (faster matrix ops)
|
|
21
|
+
make LLAMA_OPENBLAS=1
|
|
22
|
+
|
|
23
|
+
# BLAS gives 2-3× speedup
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## GPU Offloading
|
|
27
|
+
|
|
28
|
+
### Layer offloading
|
|
29
|
+
```bash
|
|
30
|
+
# Offload 35 layers to GPU (hybrid mode)
|
|
31
|
+
./llama-cli -m model.gguf -ngl 35
|
|
32
|
+
|
|
33
|
+
# Offload all layers
|
|
34
|
+
./llama-cli -m model.gguf -ngl 999
|
|
35
|
+
|
|
36
|
+
# Find optimal value:
|
|
37
|
+
# Start with -ngl 999
|
|
38
|
+
# If OOM, reduce by 5 until fits
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Memory usage
|
|
42
|
+
```bash
|
|
43
|
+
# Check VRAM usage
|
|
44
|
+
nvidia-smi dmon
|
|
45
|
+
|
|
46
|
+
# Reduce context if needed
|
|
47
|
+
./llama-cli -m model.gguf -c 2048 # 2K context instead of 4K
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
## Batch Processing
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
# Increase batch size for throughput
|
|
54
|
+
./llama-cli -m model.gguf -b 512 # Default: 512
|
|
55
|
+
|
|
56
|
+
# Physical batch (GPU)
|
|
57
|
+
--ubatch 128 # Process 128 tokens at once
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
## Context Management
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
# Default context (512 tokens)
|
|
64
|
+
-c 512
|
|
65
|
+
|
|
66
|
+
# Longer context (slower, more memory)
|
|
67
|
+
-c 4096
|
|
68
|
+
|
|
69
|
+
# Very long context (if model supports)
|
|
70
|
+
-c 32768
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Benchmarks
|
|
74
|
+
|
|
75
|
+
### CPU Performance (Llama 2-7B Q4_K_M)
|
|
76
|
+
|
|
77
|
+
| Setup | Speed | Notes |
|
|
78
|
+
|-------|-------|-------|
|
|
79
|
+
| Apple M3 Max | 50 tok/s | Metal acceleration |
|
|
80
|
+
| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
|
|
81
|
+
| Intel i9-13900K | 30 tok/s | AVX2 |
|
|
82
|
+
|
|
83
|
+
### GPU Offloading (RTX 4090)
|
|
84
|
+
|
|
85
|
+
| Layers GPU | Speed | VRAM |
|
|
86
|
+
|------------|-------|------|
|
|
87
|
+
| 0 (CPU only) | 30 tok/s | 0 GB |
|
|
88
|
+
| 20 (hybrid) | 80 tok/s | 8 GB |
|
|
89
|
+
| 35 (all) | 120 tok/s | 12 GB |
|