EvoScientist 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (108) hide show
  1. EvoScientist/EvoScientist.py +1 -1
  2. EvoScientist/cli.py +450 -178
  3. EvoScientist/middleware.py +5 -1
  4. EvoScientist/skills/accelerate/SKILL.md +332 -0
  5. EvoScientist/skills/accelerate/references/custom-plugins.md +453 -0
  6. EvoScientist/skills/accelerate/references/megatron-integration.md +489 -0
  7. EvoScientist/skills/accelerate/references/performance.md +525 -0
  8. EvoScientist/skills/bitsandbytes/SKILL.md +411 -0
  9. EvoScientist/skills/bitsandbytes/references/memory-optimization.md +521 -0
  10. EvoScientist/skills/bitsandbytes/references/qlora-training.md +521 -0
  11. EvoScientist/skills/bitsandbytes/references/quantization-formats.md +447 -0
  12. EvoScientist/skills/clip/SKILL.md +253 -0
  13. EvoScientist/skills/clip/references/applications.md +207 -0
  14. EvoScientist/skills/find-skills/SKILL.md +133 -0
  15. EvoScientist/skills/find-skills/scripts/install_skill.py +211 -0
  16. EvoScientist/skills/flash-attention/SKILL.md +367 -0
  17. EvoScientist/skills/flash-attention/references/benchmarks.md +215 -0
  18. EvoScientist/skills/flash-attention/references/transformers-integration.md +293 -0
  19. EvoScientist/skills/langgraph-docs/SKILL.md +36 -0
  20. EvoScientist/skills/llama-cpp/SKILL.md +258 -0
  21. EvoScientist/skills/llama-cpp/references/optimization.md +89 -0
  22. EvoScientist/skills/llama-cpp/references/quantization.md +213 -0
  23. EvoScientist/skills/llama-cpp/references/server.md +125 -0
  24. EvoScientist/skills/lm-evaluation-harness/SKILL.md +490 -0
  25. EvoScientist/skills/lm-evaluation-harness/references/api-evaluation.md +490 -0
  26. EvoScientist/skills/lm-evaluation-harness/references/benchmark-guide.md +488 -0
  27. EvoScientist/skills/lm-evaluation-harness/references/custom-tasks.md +602 -0
  28. EvoScientist/skills/lm-evaluation-harness/references/distributed-eval.md +519 -0
  29. EvoScientist/skills/ml-paper-writing/SKILL.md +937 -0
  30. EvoScientist/skills/ml-paper-writing/references/checklists.md +361 -0
  31. EvoScientist/skills/ml-paper-writing/references/citation-workflow.md +562 -0
  32. EvoScientist/skills/ml-paper-writing/references/reviewer-guidelines.md +367 -0
  33. EvoScientist/skills/ml-paper-writing/references/sources.md +159 -0
  34. EvoScientist/skills/ml-paper-writing/references/writing-guide.md +476 -0
  35. EvoScientist/skills/ml-paper-writing/templates/README.md +251 -0
  36. EvoScientist/skills/ml-paper-writing/templates/aaai2026/README.md +534 -0
  37. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex +144 -0
  38. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex +952 -0
  39. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bib +111 -0
  40. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.bst +1493 -0
  41. EvoScientist/skills/ml-paper-writing/templates/aaai2026/aaai2026.sty +315 -0
  42. EvoScientist/skills/ml-paper-writing/templates/acl/README.md +50 -0
  43. EvoScientist/skills/ml-paper-writing/templates/acl/acl.sty +312 -0
  44. EvoScientist/skills/ml-paper-writing/templates/acl/acl_latex.tex +377 -0
  45. EvoScientist/skills/ml-paper-writing/templates/acl/acl_lualatex.tex +101 -0
  46. EvoScientist/skills/ml-paper-writing/templates/acl/acl_natbib.bst +1940 -0
  47. EvoScientist/skills/ml-paper-writing/templates/acl/anthology.bib.txt +26 -0
  48. EvoScientist/skills/ml-paper-writing/templates/acl/custom.bib +70 -0
  49. EvoScientist/skills/ml-paper-writing/templates/acl/formatting.md +326 -0
  50. EvoScientist/skills/ml-paper-writing/templates/colm2025/README.md +3 -0
  51. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bib +11 -0
  52. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.bst +1440 -0
  53. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.pdf +0 -0
  54. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.sty +218 -0
  55. EvoScientist/skills/ml-paper-writing/templates/colm2025/colm2025_conference.tex +305 -0
  56. EvoScientist/skills/ml-paper-writing/templates/colm2025/fancyhdr.sty +485 -0
  57. EvoScientist/skills/ml-paper-writing/templates/colm2025/math_commands.tex +508 -0
  58. EvoScientist/skills/ml-paper-writing/templates/colm2025/natbib.sty +1246 -0
  59. EvoScientist/skills/ml-paper-writing/templates/iclr2026/fancyhdr.sty +485 -0
  60. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib +24 -0
  61. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst +1440 -0
  62. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf +0 -0
  63. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty +246 -0
  64. EvoScientist/skills/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex +414 -0
  65. EvoScientist/skills/ml-paper-writing/templates/iclr2026/math_commands.tex +508 -0
  66. EvoScientist/skills/ml-paper-writing/templates/iclr2026/natbib.sty +1246 -0
  67. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithm.sty +79 -0
  68. EvoScientist/skills/ml-paper-writing/templates/icml2026/algorithmic.sty +201 -0
  69. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.bib +75 -0
  70. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.pdf +0 -0
  71. EvoScientist/skills/ml-paper-writing/templates/icml2026/example_paper.tex +662 -0
  72. EvoScientist/skills/ml-paper-writing/templates/icml2026/fancyhdr.sty +864 -0
  73. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.bst +1443 -0
  74. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml2026.sty +767 -0
  75. EvoScientist/skills/ml-paper-writing/templates/icml2026/icml_numpapers.pdf +0 -0
  76. EvoScientist/skills/ml-paper-writing/templates/neurips2025/Makefile +36 -0
  77. EvoScientist/skills/ml-paper-writing/templates/neurips2025/extra_pkgs.tex +53 -0
  78. EvoScientist/skills/ml-paper-writing/templates/neurips2025/main.tex +38 -0
  79. EvoScientist/skills/ml-paper-writing/templates/neurips2025/neurips.sty +382 -0
  80. EvoScientist/skills/peft/SKILL.md +431 -0
  81. EvoScientist/skills/peft/references/advanced-usage.md +514 -0
  82. EvoScientist/skills/peft/references/troubleshooting.md +480 -0
  83. EvoScientist/skills/ray-data/SKILL.md +326 -0
  84. EvoScientist/skills/ray-data/references/integration.md +82 -0
  85. EvoScientist/skills/ray-data/references/transformations.md +83 -0
  86. EvoScientist/skills/skill-creator/LICENSE.txt +202 -0
  87. EvoScientist/skills/skill-creator/SKILL.md +356 -0
  88. EvoScientist/skills/skill-creator/references/output-patterns.md +82 -0
  89. EvoScientist/skills/skill-creator/references/workflows.md +28 -0
  90. EvoScientist/skills/skill-creator/scripts/init_skill.py +303 -0
  91. EvoScientist/skills/skill-creator/scripts/package_skill.py +110 -0
  92. EvoScientist/skills/skill-creator/scripts/quick_validate.py +95 -0
  93. EvoScientist/skills/tensorboard/SKILL.md +629 -0
  94. EvoScientist/skills/tensorboard/references/integrations.md +638 -0
  95. EvoScientist/skills/tensorboard/references/profiling.md +545 -0
  96. EvoScientist/skills/tensorboard/references/visualization.md +620 -0
  97. EvoScientist/skills/vllm/SKILL.md +364 -0
  98. EvoScientist/skills/vllm/references/optimization.md +226 -0
  99. EvoScientist/skills/vllm/references/quantization.md +284 -0
  100. EvoScientist/skills/vllm/references/server-deployment.md +255 -0
  101. EvoScientist/skills/vllm/references/troubleshooting.md +447 -0
  102. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/METADATA +26 -3
  103. evoscientist-0.1.0rc2.dist-info/RECORD +119 -0
  104. evoscientist-0.1.0rc1.dist-info/RECORD +0 -21
  105. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/WHEEL +0 -0
  106. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/entry_points.txt +0 -0
  107. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/licenses/LICENSE +0 -0
  108. {evoscientist-0.1.0rc1.dist-info → evoscientist-0.1.0rc2.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,293 @@
1
+ # HuggingFace Transformers Integration
2
+
3
+ ## Contents
4
+ - Enabling Flash Attention in Transformers
5
+ - Supported model architectures
6
+ - Configuration examples
7
+ - Performance comparisons
8
+ - Troubleshooting model-specific issues
9
+
10
+ ## Enabling Flash Attention in Transformers
11
+
12
+ HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.
13
+
14
+ **Simple enable for any supported model**:
15
+ ```python
16
+ from transformers import AutoModel
17
+
18
+ model = AutoModel.from_pretrained(
19
+ "meta-llama/Llama-2-7b-hf",
20
+ attn_implementation="flash_attention_2",
21
+ torch_dtype=torch.float16,
22
+ device_map="auto"
23
+ )
24
+ ```
25
+
26
+ **Install requirements**:
27
+ ```bash
28
+ pip install transformers>=4.36
29
+ pip install flash-attn --no-build-isolation
30
+ ```
31
+
32
+ ## Supported model architectures
33
+
34
+ As of Transformers 4.40:
35
+
36
+ **Fully supported**:
37
+ - Llama / Llama 2 / Llama 3
38
+ - Mistral / Mixtral
39
+ - Falcon
40
+ - GPT-NeoX
41
+ - Phi / Phi-2 / Phi-3
42
+ - Qwen / Qwen2
43
+ - Gemma
44
+ - Starcoder2
45
+ - GPT-J
46
+ - OPT
47
+ - BLOOM
48
+
49
+ **Partially supported** (encoder-decoder):
50
+ - BART
51
+ - T5 / Flan-T5
52
+ - Whisper
53
+
54
+ **Check support**:
55
+ ```python
56
+ from transformers import AutoConfig
57
+
58
+ config = AutoConfig.from_pretrained("model-name")
59
+ print(config._attn_implementation_internal)
60
+ # 'flash_attention_2' if supported
61
+ ```
62
+
63
+ ## Configuration examples
64
+
65
+ ### Llama 2 with Flash Attention
66
+
67
+ ```python
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer
69
+ import torch
70
+
71
+ model_id = "meta-llama/Llama-2-7b-hf"
72
+
73
+ model = AutoModelForCausalLM.from_pretrained(
74
+ model_id,
75
+ attn_implementation="flash_attention_2",
76
+ torch_dtype=torch.float16,
77
+ device_map="auto"
78
+ )
79
+
80
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
81
+
82
+ # Generate
83
+ inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
84
+ outputs = model.generate(**inputs, max_length=100)
85
+ print(tokenizer.decode(outputs[0]))
86
+ ```
87
+
88
+ ### Mistral with Flash Attention for long context
89
+
90
+ ```python
91
+ from transformers import AutoModelForCausalLM
92
+ import torch
93
+
94
+ model = AutoModelForCausalLM.from_pretrained(
95
+ "mistralai/Mistral-7B-v0.1",
96
+ attn_implementation="flash_attention_2",
97
+ torch_dtype=torch.bfloat16, # Better for long context
98
+ device_map="auto",
99
+ max_position_embeddings=32768 # Extended context
100
+ )
101
+
102
+ # Process long document (32K tokens)
103
+ long_text = "..." * 10000
104
+ inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
105
+ outputs = model.generate(**inputs, max_new_tokens=512)
106
+ ```
107
+
108
+ ### Fine-tuning with Flash Attention
109
+
110
+ ```python
111
+ from transformers import Trainer, TrainingArguments
112
+ from transformers import AutoModelForCausalLM
113
+
114
+ model = AutoModelForCausalLM.from_pretrained(
115
+ "meta-llama/Llama-2-7b-hf",
116
+ attn_implementation="flash_attention_2",
117
+ torch_dtype=torch.float16
118
+ )
119
+
120
+ training_args = TrainingArguments(
121
+ output_dir="./results",
122
+ per_device_train_batch_size=4,
123
+ gradient_accumulation_steps=4,
124
+ num_train_epochs=3,
125
+ fp16=True, # Must match model dtype
126
+ optim="adamw_torch_fused" # Fast optimizer
127
+ )
128
+
129
+ trainer = Trainer(
130
+ model=model,
131
+ args=training_args,
132
+ train_dataset=train_dataset
133
+ )
134
+
135
+ trainer.train()
136
+ ```
137
+
138
+ ### Multi-GPU training
139
+
140
+ ```python
141
+ from transformers import AutoModelForCausalLM
142
+ import torch
143
+
144
+ # Model parallelism with Flash Attention
145
+ model = AutoModelForCausalLM.from_pretrained(
146
+ "meta-llama/Llama-2-13b-hf",
147
+ attn_implementation="flash_attention_2",
148
+ torch_dtype=torch.float16,
149
+ device_map="auto", # Automatic multi-GPU placement
150
+ max_memory={0: "20GB", 1: "20GB"} # Limit per GPU
151
+ )
152
+ ```
153
+
154
+ ## Performance comparisons
155
+
156
+ ### Memory usage (Llama 2 7B, batch=1)
157
+
158
+ | Sequence Length | Standard Attention | Flash Attention 2 | Reduction |
159
+ |-----------------|-------------------|-------------------|-----------|
160
+ | 512 | 1.2 GB | 0.9 GB | 25% |
161
+ | 2048 | 3.8 GB | 1.4 GB | 63% |
162
+ | 8192 | 14.2 GB | 3.2 GB | 77% |
163
+ | 32768 | OOM (>24GB) | 10.8 GB | Fits! |
164
+
165
+ ### Speed (tokens/sec, A100 80GB)
166
+
167
+ | Model | Standard | Flash Attn 2 | Speedup |
168
+ |-------|----------|--------------|---------|
169
+ | Llama 2 7B (seq=2048) | 42 | 118 | 2.8x |
170
+ | Llama 2 13B (seq=4096) | 18 | 52 | 2.9x |
171
+ | Llama 2 70B (seq=2048) | 4 | 11 | 2.75x |
172
+
173
+ ### Training throughput (samples/sec)
174
+
175
+ | Model | Batch Size | Standard | Flash Attn 2 | Speedup |
176
+ |-------|------------|----------|--------------|---------|
177
+ | Llama 2 7B | 4 | 1.2 | 3.1 | 2.6x |
178
+ | Llama 2 7B | 8 | 2.1 | 5.8 | 2.8x |
179
+ | Llama 2 13B | 2 | 0.6 | 1.7 | 2.8x |
180
+
181
+ ## Troubleshooting model-specific issues
182
+
183
+ ### Issue: Model doesn't support Flash Attention
184
+
185
+ Check support list above. If not supported, use PyTorch SDPA as fallback:
186
+
187
+ ```python
188
+ model = AutoModelForCausalLM.from_pretrained(
189
+ "model-name",
190
+ attn_implementation="sdpa", # PyTorch native (still faster)
191
+ torch_dtype=torch.float16
192
+ )
193
+ ```
194
+
195
+ ### Issue: CUDA out of memory during loading
196
+
197
+ Reduce memory footprint:
198
+
199
+ ```python
200
+ model = AutoModelForCausalLM.from_pretrained(
201
+ "model-name",
202
+ attn_implementation="flash_attention_2",
203
+ torch_dtype=torch.float16,
204
+ device_map="auto",
205
+ max_memory={0: "18GB"}, # Reserve memory for KV cache
206
+ low_cpu_mem_usage=True
207
+ )
208
+ ```
209
+
210
+ ### Issue: Slower inference than expected
211
+
212
+ Ensure dtype matches:
213
+
214
+ ```python
215
+ # Model and inputs must both be float16/bfloat16
216
+ model = model.to(torch.float16)
217
+ inputs = tokenizer(..., return_tensors="pt").to("cuda")
218
+ inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
219
+ for k, v in inputs.items()}
220
+ ```
221
+
222
+ ### Issue: Different outputs vs standard attention
223
+
224
+ Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:
225
+
226
+ ```python
227
+ # Compare outputs
228
+ model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
229
+ model_flash = AutoModelForCausalLM.from_pretrained(
230
+ "model-name",
231
+ attn_implementation="flash_attention_2",
232
+ torch_dtype=torch.float16
233
+ )
234
+
235
+ inputs = tokenizer("Test", return_tensors="pt").to("cuda")
236
+
237
+ with torch.no_grad():
238
+ out_standard = model_standard(**inputs).logits
239
+ out_flash = model_flash(**inputs).logits
240
+
241
+ diff = (out_standard - out_flash).abs().max()
242
+ print(f"Max diff: {diff:.6f}") # Should be ~1e-3 to 1e-4
243
+ ```
244
+
245
+ ### Issue: ImportError during model loading
246
+
247
+ Install flash-attn:
248
+ ```bash
249
+ pip install flash-attn --no-build-isolation
250
+ ```
251
+
252
+ Or disable Flash Attention:
253
+ ```python
254
+ model = AutoModelForCausalLM.from_pretrained(
255
+ "model-name",
256
+ attn_implementation="eager", # Standard PyTorch
257
+ torch_dtype=torch.float16
258
+ )
259
+ ```
260
+
261
+ ## Best practices
262
+
263
+ 1. **Always use float16/bfloat16** with Flash Attention (not float32)
264
+ 2. **Set device_map="auto"** for automatic memory management
265
+ 3. **Use bfloat16 for long context** (better numerical stability)
266
+ 4. **Enable gradient checkpointing** for training large models
267
+ 5. **Monitor memory** with `torch.cuda.max_memory_allocated()`
268
+
269
+ **Example with all best practices**:
270
+ ```python
271
+ from transformers import AutoModelForCausalLM, TrainingArguments
272
+
273
+ model = AutoModelForCausalLM.from_pretrained(
274
+ "meta-llama/Llama-2-7b-hf",
275
+ attn_implementation="flash_attention_2",
276
+ torch_dtype=torch.bfloat16, # Better for training
277
+ device_map="auto",
278
+ low_cpu_mem_usage=True
279
+ )
280
+
281
+ # Enable gradient checkpointing for memory
282
+ model.gradient_checkpointing_enable()
283
+
284
+ # Training with optimizations
285
+ training_args = TrainingArguments(
286
+ output_dir="./results",
287
+ per_device_train_batch_size=8,
288
+ gradient_accumulation_steps=2,
289
+ bf16=True, # Match model dtype
290
+ optim="adamw_torch_fused",
291
+ gradient_checkpointing=True
292
+ )
293
+ ```
@@ -0,0 +1,36 @@
1
+ ---
2
+ name: langgraph-docs
3
+ description: Use this skill for requests related to LangGraph in order to fetch relevant documentation to provide accurate, up-to-date guidance.
4
+ ---
5
+
6
+ # langgraph-docs
7
+
8
+ ## Overview
9
+
10
+ This skill explains how to access LangGraph Python documentation to help answer questions and guide implementation.
11
+
12
+ ## Instructions
13
+
14
+ ### 1. Fetch the Documentation Index
15
+
16
+ Use the fetch_url tool to read the following URL:
17
+ https://docs.langchain.com/llms.txt
18
+
19
+ This provides a structured list of all available documentation with descriptions.
20
+
21
+ ### 2. Select Relevant Documentation
22
+
23
+ Based on the question, identify 2-4 most relevant documentation URLs from the index. Prioritize:
24
+
25
+ - Specific how-to guides for implementation questions
26
+ - Core concept pages for understanding questions
27
+ - Tutorials for end-to-end examples
28
+ - Reference docs for API details
29
+
30
+ ### 3. Fetch Selected Documentation
31
+
32
+ Use the fetch_url tool to read the selected documentation URLs.
33
+
34
+ ### 4. Provide Accurate Guidance
35
+
36
+ After reading the documentation, complete the user's request.
@@ -0,0 +1,258 @@
1
+ ---
2
+ name: llama-cpp
3
+ description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
4
+ version: 1.0.0
5
+ author: Orchestra Research
6
+ license: MIT
7
+ tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
8
+ dependencies: [llama-cpp-python]
9
+ ---
10
+
11
+ # llama.cpp
12
+
13
+ Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
14
+
15
+ ## When to use llama.cpp
16
+
17
+ **Use llama.cpp when:**
18
+ - Running on CPU-only machines
19
+ - Deploying on Apple Silicon (M1/M2/M3/M4)
20
+ - Using AMD or Intel GPUs (no CUDA)
21
+ - Edge deployment (Raspberry Pi, embedded systems)
22
+ - Need simple deployment without Docker/Python
23
+
24
+ **Use TensorRT-LLM instead when:**
25
+ - Have NVIDIA GPUs (A100/H100)
26
+ - Need maximum throughput (100K+ tok/s)
27
+ - Running in datacenter with CUDA
28
+
29
+ **Use vLLM instead when:**
30
+ - Have NVIDIA GPUs
31
+ - Need Python-first API
32
+ - Want PagedAttention
33
+
34
+ ## Quick start
35
+
36
+ ### Installation
37
+
38
+ ```bash
39
+ # macOS/Linux
40
+ brew install llama.cpp
41
+
42
+ # Or build from source
43
+ git clone https://github.com/ggerganov/llama.cpp
44
+ cd llama.cpp
45
+ make
46
+
47
+ # With Metal (Apple Silicon)
48
+ make LLAMA_METAL=1
49
+
50
+ # With CUDA (NVIDIA)
51
+ make LLAMA_CUDA=1
52
+
53
+ # With ROCm (AMD)
54
+ make LLAMA_HIP=1
55
+ ```
56
+
57
+ ### Download model
58
+
59
+ ```bash
60
+ # Download from HuggingFace (GGUF format)
61
+ huggingface-cli download \
62
+ TheBloke/Llama-2-7B-Chat-GGUF \
63
+ llama-2-7b-chat.Q4_K_M.gguf \
64
+ --local-dir models/
65
+
66
+ # Or convert from HuggingFace
67
+ python convert_hf_to_gguf.py models/llama-2-7b-chat/
68
+ ```
69
+
70
+ ### Run inference
71
+
72
+ ```bash
73
+ # Simple chat
74
+ ./llama-cli \
75
+ -m models/llama-2-7b-chat.Q4_K_M.gguf \
76
+ -p "Explain quantum computing" \
77
+ -n 256 # Max tokens
78
+
79
+ # Interactive chat
80
+ ./llama-cli \
81
+ -m models/llama-2-7b-chat.Q4_K_M.gguf \
82
+ --interactive
83
+ ```
84
+
85
+ ### Server mode
86
+
87
+ ```bash
88
+ # Start OpenAI-compatible server
89
+ ./llama-server \
90
+ -m models/llama-2-7b-chat.Q4_K_M.gguf \
91
+ --host 0.0.0.0 \
92
+ --port 8080 \
93
+ -ngl 32 # Offload 32 layers to GPU
94
+
95
+ # Client request
96
+ curl http://localhost:8080/v1/chat/completions \
97
+ -H "Content-Type: application/json" \
98
+ -d '{
99
+ "model": "llama-2-7b-chat",
100
+ "messages": [{"role": "user", "content": "Hello!"}],
101
+ "temperature": 0.7,
102
+ "max_tokens": 100
103
+ }'
104
+ ```
105
+
106
+ ## Quantization formats
107
+
108
+ ### GGUF format overview
109
+
110
+ | Format | Bits | Size (7B) | Speed | Quality | Use Case |
111
+ |--------|------|-----------|-------|---------|----------|
112
+ | **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
113
+ | Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
114
+ | Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
115
+ | Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
116
+ | Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
117
+ | Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
118
+
119
+ ### Choosing quantization
120
+
121
+ ```bash
122
+ # General use (balanced)
123
+ Q4_K_M # 4-bit, medium quality
124
+
125
+ # Maximum speed (more degradation)
126
+ Q2_K or Q3_K_M
127
+
128
+ # Maximum quality (slower)
129
+ Q6_K or Q8_0
130
+
131
+ # Very large models (70B, 405B)
132
+ Q3_K_M or Q4_K_S # Lower bits to fit in memory
133
+ ```
134
+
135
+ ## Hardware acceleration
136
+
137
+ ### Apple Silicon (Metal)
138
+
139
+ ```bash
140
+ # Build with Metal
141
+ make LLAMA_METAL=1
142
+
143
+ # Run with GPU acceleration (automatic)
144
+ ./llama-cli -m model.gguf -ngl 999 # Offload all layers
145
+
146
+ # Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
147
+ ```
148
+
149
+ ### NVIDIA GPUs (CUDA)
150
+
151
+ ```bash
152
+ # Build with CUDA
153
+ make LLAMA_CUDA=1
154
+
155
+ # Offload layers to GPU
156
+ ./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
157
+
158
+ # Hybrid CPU+GPU for large models
159
+ ./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
160
+ ```
161
+
162
+ ### AMD GPUs (ROCm)
163
+
164
+ ```bash
165
+ # Build with ROCm
166
+ make LLAMA_HIP=1
167
+
168
+ # Run with AMD GPU
169
+ ./llama-cli -m model.gguf -ngl 999
170
+ ```
171
+
172
+ ## Common patterns
173
+
174
+ ### Batch processing
175
+
176
+ ```bash
177
+ # Process multiple prompts from file
178
+ cat prompts.txt | ./llama-cli \
179
+ -m model.gguf \
180
+ --batch-size 512 \
181
+ -n 100
182
+ ```
183
+
184
+ ### Constrained generation
185
+
186
+ ```bash
187
+ # JSON output with grammar
188
+ ./llama-cli \
189
+ -m model.gguf \
190
+ -p "Generate a person: " \
191
+ --grammar-file grammars/json.gbnf
192
+
193
+ # Outputs valid JSON only
194
+ ```
195
+
196
+ ### Context size
197
+
198
+ ```bash
199
+ # Increase context (default 512)
200
+ ./llama-cli \
201
+ -m model.gguf \
202
+ -c 4096 # 4K context window
203
+
204
+ # Very long context (if model supports)
205
+ ./llama-cli -m model.gguf -c 32768 # 32K context
206
+ ```
207
+
208
+ ## Performance benchmarks
209
+
210
+ ### CPU performance (Llama 2-7B Q4_K_M)
211
+
212
+ | CPU | Threads | Speed | Cost |
213
+ |-----|---------|-------|------|
214
+ | Apple M3 Max | 16 | 50 tok/s | $0 (local) |
215
+ | AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
216
+ | Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
217
+ | AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
218
+
219
+ ### GPU acceleration (Llama 2-7B Q4_K_M)
220
+
221
+ | GPU | Speed | vs CPU | Cost |
222
+ |-----|-------|--------|------|
223
+ | NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
224
+ | NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
225
+ | AMD MI250 | 70 tok/s | 2× | $2.00/hour |
226
+ | Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
227
+
228
+ ## Supported models
229
+
230
+ **LLaMA family**:
231
+ - Llama 2 (7B, 13B, 70B)
232
+ - Llama 3 (8B, 70B, 405B)
233
+ - Code Llama
234
+
235
+ **Mistral family**:
236
+ - Mistral 7B
237
+ - Mixtral 8x7B, 8x22B
238
+
239
+ **Other**:
240
+ - Falcon, BLOOM, GPT-J
241
+ - Phi-3, Gemma, Qwen
242
+ - LLaVA (vision), Whisper (audio)
243
+
244
+ **Find models**: https://huggingface.co/models?library=gguf
245
+
246
+ ## References
247
+
248
+ - **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
249
+ - **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
250
+ - **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
251
+
252
+ ## Resources
253
+
254
+ - **GitHub**: https://github.com/ggerganov/llama.cpp
255
+ - **Models**: https://huggingface.co/models?library=gguf
256
+ - **Discord**: https://discord.gg/llama-cpp
257
+
258
+
@@ -0,0 +1,89 @@
1
+ # Performance Optimization Guide
2
+
3
+ Maximize llama.cpp inference speed and efficiency.
4
+
5
+ ## CPU Optimization
6
+
7
+ ### Thread tuning
8
+ ```bash
9
+ # Set threads (default: physical cores)
10
+ ./llama-cli -m model.gguf -t 8
11
+
12
+ # For AMD Ryzen 9 7950X (16 cores, 32 threads)
13
+ -t 16 # Best: physical cores
14
+
15
+ # Avoid hyperthreading (slower for matrix ops)
16
+ ```
17
+
18
+ ### BLAS acceleration
19
+ ```bash
20
+ # OpenBLAS (faster matrix ops)
21
+ make LLAMA_OPENBLAS=1
22
+
23
+ # BLAS gives 2-3× speedup
24
+ ```
25
+
26
+ ## GPU Offloading
27
+
28
+ ### Layer offloading
29
+ ```bash
30
+ # Offload 35 layers to GPU (hybrid mode)
31
+ ./llama-cli -m model.gguf -ngl 35
32
+
33
+ # Offload all layers
34
+ ./llama-cli -m model.gguf -ngl 999
35
+
36
+ # Find optimal value:
37
+ # Start with -ngl 999
38
+ # If OOM, reduce by 5 until fits
39
+ ```
40
+
41
+ ### Memory usage
42
+ ```bash
43
+ # Check VRAM usage
44
+ nvidia-smi dmon
45
+
46
+ # Reduce context if needed
47
+ ./llama-cli -m model.gguf -c 2048 # 2K context instead of 4K
48
+ ```
49
+
50
+ ## Batch Processing
51
+
52
+ ```bash
53
+ # Increase batch size for throughput
54
+ ./llama-cli -m model.gguf -b 512 # Default: 512
55
+
56
+ # Physical batch (GPU)
57
+ --ubatch 128 # Process 128 tokens at once
58
+ ```
59
+
60
+ ## Context Management
61
+
62
+ ```bash
63
+ # Default context (512 tokens)
64
+ -c 512
65
+
66
+ # Longer context (slower, more memory)
67
+ -c 4096
68
+
69
+ # Very long context (if model supports)
70
+ -c 32768
71
+ ```
72
+
73
+ ## Benchmarks
74
+
75
+ ### CPU Performance (Llama 2-7B Q4_K_M)
76
+
77
+ | Setup | Speed | Notes |
78
+ |-------|-------|-------|
79
+ | Apple M3 Max | 50 tok/s | Metal acceleration |
80
+ | AMD 7950X (16c) | 35 tok/s | OpenBLAS |
81
+ | Intel i9-13900K | 30 tok/s | AVX2 |
82
+
83
+ ### GPU Offloading (RTX 4090)
84
+
85
+ | Layers GPU | Speed | VRAM |
86
+ |------------|-------|------|
87
+ | 0 (CPU only) | 30 tok/s | 0 GB |
88
+ | 20 (hybrid) | 80 tok/s | 8 GB |
89
+ | 35 (all) | 120 tok/s | 12 GB |