@tecet/ollm 0.1.4-b → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/README.md +3 -410
- package/package.json +2 -2
- package/docs/Context/CheckpointFlowDiagram.md +0 -673
- package/docs/Context/ContextArchitecture.md +0 -898
- package/docs/Context/ContextCompression.md +0 -1102
- package/docs/Context/ContextManagment.md +0 -750
- package/docs/Context/Index.md +0 -209
- package/docs/Context/README.md +0 -390
- package/docs/DevelopmentRoadmap/Index.md +0 -238
- package/docs/DevelopmentRoadmap/OLLM-CLI_Releases.md +0 -419
- package/docs/DevelopmentRoadmap/PlanedFeatures.md +0 -448
- package/docs/DevelopmentRoadmap/README.md +0 -174
- package/docs/DevelopmentRoadmap/Roadmap.md +0 -572
- package/docs/DevelopmentRoadmap/RoadmapVisual.md +0 -372
- package/docs/Hooks/Architecture.md +0 -885
- package/docs/Hooks/Index.md +0 -244
- package/docs/Hooks/KeyboardShortcuts.md +0 -248
- package/docs/Hooks/Protocol.md +0 -817
- package/docs/Hooks/README.md +0 -403
- package/docs/Hooks/UserGuide.md +0 -1483
- package/docs/Hooks/VisualGuide.md +0 -598
- package/docs/Index.md +0 -506
- package/docs/Installation.md +0 -586
- package/docs/Introduction.md +0 -367
- package/docs/LLM Models/Index.md +0 -239
- package/docs/LLM Models/LLM_GettingStarted.md +0 -748
- package/docs/LLM Models/LLM_Index.md +0 -701
- package/docs/LLM Models/LLM_MemorySystem.md +0 -337
- package/docs/LLM Models/LLM_ModelCompatibility.md +0 -499
- package/docs/LLM Models/LLM_ModelsArchitecture.md +0 -933
- package/docs/LLM Models/LLM_ModelsCommands.md +0 -839
- package/docs/LLM Models/LLM_ModelsConfiguration.md +0 -1094
- package/docs/LLM Models/LLM_ModelsList.md +0 -1071
- package/docs/LLM Models/LLM_ModelsList.md.backup +0 -400
- package/docs/LLM Models/README.md +0 -355
- package/docs/MCP/MCP_Architecture.md +0 -1086
- package/docs/MCP/MCP_Commands.md +0 -1111
- package/docs/MCP/MCP_GettingStarted.md +0 -590
- package/docs/MCP/MCP_Index.md +0 -524
- package/docs/MCP/MCP_Integration.md +0 -866
- package/docs/MCP/MCP_Marketplace.md +0 -160
- package/docs/MCP/README.md +0 -415
- package/docs/Prompts System/Architecture.md +0 -760
- package/docs/Prompts System/Index.md +0 -223
- package/docs/Prompts System/PromptsRouting.md +0 -1047
- package/docs/Prompts System/PromptsTemplates.md +0 -1102
- package/docs/Prompts System/README.md +0 -389
- package/docs/Prompts System/SystemPrompts.md +0 -856
- package/docs/Quickstart.md +0 -535
- package/docs/Tools/Architecture.md +0 -884
- package/docs/Tools/GettingStarted.md +0 -624
- package/docs/Tools/Index.md +0 -216
- package/docs/Tools/ManifestReference.md +0 -141
- package/docs/Tools/README.md +0 -440
- package/docs/Tools/UserGuide.md +0 -773
- package/docs/Troubleshooting.md +0 -1265
- package/docs/UI&Settings/Architecture.md +0 -729
- package/docs/UI&Settings/ColorASCII.md +0 -34
- package/docs/UI&Settings/Commands.md +0 -755
- package/docs/UI&Settings/Configuration.md +0 -872
- package/docs/UI&Settings/Index.md +0 -293
- package/docs/UI&Settings/Keybinds.md +0 -372
- package/docs/UI&Settings/README.md +0 -278
- package/docs/UI&Settings/Terminal.md +0 -637
- package/docs/UI&Settings/Themes.md +0 -604
- package/docs/UI&Settings/UIGuide.md +0 -550
|
@@ -1,400 +0,0 @@
|
|
|
1
|
-
# Ollama Models Reference
|
|
2
|
-
|
|
3
|
-
**Complete reference for Ollama-compatible models**
|
|
4
|
-
|
|
5
|
-
This document provides comprehensive information about Ollama models, focusing on context management, tool calling capabilities, and VRAM requirements for OLLM CLI.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## 📋 Table of Contents
|
|
10
|
-
|
|
11
|
-
1. [Context Window Fundamentals](#context-window-fundamentals)
|
|
12
|
-
2. [VRAM Requirements](#vram-requirements)
|
|
13
|
-
3. [Model Selection Matrix](#model-selection-matrix)
|
|
14
|
-
4. [Tool Calling Support](#tool-calling-support)
|
|
15
|
-
5. [Quantization Guide](#quantization-guide)
|
|
16
|
-
6. [Configuration](#configuration)
|
|
17
|
-
7. [Performance Benchmarks](#performance-benchmarks)
|
|
18
|
-
|
|
19
|
-
**See Also:**
|
|
20
|
-
|
|
21
|
-
- [Model Commands](Models_commands.md)
|
|
22
|
-
- [Model Configuration](Models_configuration.md)
|
|
23
|
-
- [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md)
|
|
24
|
-
|
|
25
|
-
---
|
|
26
|
-
|
|
27
|
-
## Context Window Fundamentals
|
|
28
|
-
|
|
29
|
-
### Ollama Default Behavior
|
|
30
|
-
|
|
31
|
-
- **Default context window**: 2,048 tokens (conservative default)
|
|
32
|
-
- **Configurable via**: `num_ctx` parameter in Modelfile or API request
|
|
33
|
-
- **Environment variable**: `OLLAMA_CONTEXT_LENGTH`
|
|
34
|
-
|
|
35
|
-
### Context Window vs num_ctx
|
|
36
|
-
|
|
37
|
-
- **Context window**: Maximum tokens a model architecture supports
|
|
38
|
-
- **num_ctx**: Ollama-specific parameter setting active context length for inference
|
|
39
|
-
- Models can support larger contexts than Ollama's default - must be explicitly configured
|
|
40
|
-
|
|
41
|
-
**Example:**
|
|
42
|
-
|
|
43
|
-
```bash
|
|
44
|
-
# Llama 3.1 supports 128K context, but Ollama defaults to 2K
|
|
45
|
-
# Must configure explicitly:
|
|
46
|
-
export OLLAMA_CONTEXT_LENGTH=32768
|
|
47
|
-
```
|
|
48
|
-
|
|
49
|
-
---
|
|
50
|
-
|
|
51
|
-
## VRAM Requirements
|
|
52
|
-
|
|
53
|
-
### Memory Calculation Formula
|
|
54
|
-
|
|
55
|
-
```
|
|
56
|
-
Total VRAM = Model Weights + KV Cache + System Overhead (~0.5-1GB)
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
### Base VRAM by Model Size (Q4_K_M Quantization)
|
|
60
|
-
|
|
61
|
-
| Model Size | Est. File Size | Recommended VRAM | Example Models |
|
|
62
|
-
| ---------- | -------------- | ----------------- | --------------------------------------- |
|
|
63
|
-
| 3B-4B | 2.0-2.5 GB | 3-4 GB | Llama 3.2 3B, Qwen3 4B |
|
|
64
|
-
| 7B-9B | 4.5-6.0 GB | 6-8 GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B |
|
|
65
|
-
| 12B-14B | 7.5-9.0 GB | 10-12 GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B |
|
|
66
|
-
| 22B-35B | 14-21 GB | 16-24 GB | Gemma 3 27B, Qwen3 32B, DeepSeek R1 32B |
|
|
67
|
-
| 70B-72B | 40-43 GB | 48 GB (or 2×24GB) | Llama 3.3 70B, Qwen2.5 72B |
|
|
68
|
-
|
|
69
|
-
### KV Cache Memory Impact
|
|
70
|
-
|
|
71
|
-
KV cache grows linearly with context length:
|
|
72
|
-
|
|
73
|
-
- **8B model at 32K context**: ~4.5 GB for KV cache alone
|
|
74
|
-
- **Q8_0 KV quantization**: Halves KV cache memory (minimal quality impact)
|
|
75
|
-
- **Q4_0 KV quantization**: Reduces to 1/3 (noticeable quality reduction)
|
|
76
|
-
|
|
77
|
-
### Context Length VRAM Examples (8B Model, Q4_K_M)
|
|
78
|
-
|
|
79
|
-
| Context Length | KV Cache (FP16) | KV Cache (Q8_0) | Total VRAM |
|
|
80
|
-
| -------------- | --------------- | --------------- | ---------- |
|
|
81
|
-
| 4K tokens | ~0.6 GB | ~0.3 GB | ~6-7 GB |
|
|
82
|
-
| 8K tokens | ~1.1 GB | ~0.55 GB | ~7-8 GB |
|
|
83
|
-
| 16K tokens | ~2.2 GB | ~1.1 GB | ~8-9 GB |
|
|
84
|
-
| 32K tokens | ~4.5 GB | ~2.25 GB | ~10-11 GB |
|
|
85
|
-
|
|
86
|
-
**Key Insight:** Always aim to fit the entire model in VRAM. Partial offload causes 5-20x slowdown.
|
|
87
|
-
|
|
88
|
-
---
|
|
89
|
-
|
|
90
|
-
## Model Selection Matrix
|
|
91
|
-
|
|
92
|
-
### By VRAM Available
|
|
93
|
-
|
|
94
|
-
| VRAM | Recommended Models | Context Sweet Spot |
|
|
95
|
-
| ----- | ------------------------------------- | ------------------ |
|
|
96
|
-
| 4GB | Llama 3.2 3B, Qwen3 4B, Phi-3 Mini | 4K-8K |
|
|
97
|
-
| 8GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B | 8K-16K |
|
|
98
|
-
| 12GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B | 16K-32K |
|
|
99
|
-
| 16GB | Qwen3 14B (high ctx), DeepSeek R1 14B | 32K-64K |
|
|
100
|
-
| 24GB | Gemma 3 27B, Qwen3 32B, Mixtral 8x7B | 32K-64K |
|
|
101
|
-
| 48GB+ | Llama 3.3 70B, Qwen2.5 72B | 64K-128K |
|
|
102
|
-
|
|
103
|
-
### By Use Case
|
|
104
|
-
|
|
105
|
-
| Use Case | Recommended Model | Why |
|
|
106
|
-
| -------------- | -------------------- | ----------------------------- |
|
|
107
|
-
| General coding | Qwen2.5-Coder 7B/32B | Best open-source coding model |
|
|
108
|
-
| Tool calling | Llama 3.1 8B | Native support, well-tested |
|
|
109
|
-
| Long context | Qwen3 8B/14B | Up to 256K context |
|
|
110
|
-
| Reasoning | DeepSeek-R1 14B/32B | Matches frontier models |
|
|
111
|
-
| Low VRAM | Llama 3.2 3B | Excellent for 4GB |
|
|
112
|
-
| Multilingual | Qwen3 series | 29+ languages |
|
|
113
|
-
|
|
114
|
-
---
|
|
115
|
-
|
|
116
|
-
## Tool Calling Support
|
|
117
|
-
|
|
118
|
-
### Tier 1: Best Tool Calling Support
|
|
119
|
-
|
|
120
|
-
#### Llama 3.1/3.3 (8B, 70B)
|
|
121
|
-
|
|
122
|
-
- **Context**: 128K tokens (native)
|
|
123
|
-
- **Tool calling**: Native function calling support
|
|
124
|
-
- **VRAM**: 8GB (8B), 48GB+ (70B)
|
|
125
|
-
- **Strengths**: Best overall, largest ecosystem, excellent instruction following
|
|
126
|
-
- **Downloads**: 108M+ (most popular)
|
|
127
|
-
|
|
128
|
-
#### Qwen3 (4B-235B)
|
|
129
|
-
|
|
130
|
-
- **Context**: Up to 256K tokens
|
|
131
|
-
- **Tool calling**: Native support with hybrid reasoning modes
|
|
132
|
-
- **VRAM**: 4GB (4B) to 48GB+ (32B)
|
|
133
|
-
- **Strengths**: Multilingual, thinking/non-thinking modes, Apache 2.0 license
|
|
134
|
-
- **Special**: Can set "thinking budget" for reasoning depth
|
|
135
|
-
|
|
136
|
-
#### Mistral 7B / Mixtral 8x7B
|
|
137
|
-
|
|
138
|
-
- **Context**: 32K tokens (Mistral), 64K tokens (Mixtral)
|
|
139
|
-
- **Tool calling**: Native function calling
|
|
140
|
-
- **VRAM**: 7GB (7B), 26GB (8x7B MoE - only 13B active)
|
|
141
|
-
- **Strengths**: Efficient, excellent European languages
|
|
142
|
-
|
|
143
|
-
### Tier 2: Good Tool Calling Support
|
|
144
|
-
|
|
145
|
-
#### DeepSeek-R1 (1.5B-671B)
|
|
146
|
-
|
|
147
|
-
- **Context**: 128K tokens
|
|
148
|
-
- **Tool calling**: Supported via reasoning
|
|
149
|
-
- **VRAM**: 4GB (1.5B) to enterprise (671B)
|
|
150
|
-
- **Strengths**: Advanced reasoning, matches o1 on benchmarks
|
|
151
|
-
- **Architecture**: Mixture of Experts (MoE)
|
|
152
|
-
|
|
153
|
-
#### Gemma 3 (4B-27B)
|
|
154
|
-
|
|
155
|
-
- **Context**: 128K tokens (4B+)
|
|
156
|
-
- **Tool calling**: Supported
|
|
157
|
-
- **VRAM**: 4GB (4B) to 24GB (27B)
|
|
158
|
-
- **Strengths**: Multimodal (vision), efficient architecture
|
|
159
|
-
|
|
160
|
-
#### Phi-4 (14B)
|
|
161
|
-
|
|
162
|
-
- **Context**: 16K tokens
|
|
163
|
-
- **Tool calling**: Supported
|
|
164
|
-
- **VRAM**: 12GB
|
|
165
|
-
- **Strengths**: Exceptional reasoning for size, synthetic data training
|
|
166
|
-
|
|
167
|
-
### Tier 3: ReAct Fallback Required
|
|
168
|
-
|
|
169
|
-
#### CodeLlama (7B-70B)
|
|
170
|
-
|
|
171
|
-
- **Context**: 16K tokens (2K for 70B)
|
|
172
|
-
- **Tool calling**: Via ReAct loop
|
|
173
|
-
- **VRAM**: 4GB (7B) to 40GB (70B)
|
|
174
|
-
- **Best for**: Code-specific tasks
|
|
175
|
-
|
|
176
|
-
#### Llama 2 (7B-70B)
|
|
177
|
-
|
|
178
|
-
- **Context**: 4K tokens
|
|
179
|
-
- **Tool calling**: Via ReAct loop
|
|
180
|
-
- **VRAM**: 4GB (7B) to 40GB (70B)
|
|
181
|
-
- **Note**: Legacy, prefer Llama 3.x
|
|
182
|
-
|
|
183
|
-
---
|
|
184
|
-
|
|
185
|
-
## Quantization Guide
|
|
186
|
-
|
|
187
|
-
### Recommended Quantization Levels
|
|
188
|
-
|
|
189
|
-
| Level | VRAM Savings | Quality Impact | Recommendation |
|
|
190
|
-
| ---------- | ------------ | -------------- | --------------- |
|
|
191
|
-
| Q8_0 | ~50% | Minimal | Best quality |
|
|
192
|
-
| Q6_K | ~60% | Very slight | High quality |
|
|
193
|
-
| Q5_K_M | ~65% | Slight | Good balance |
|
|
194
|
-
| **Q4_K_M** | ~75% | Minor | **Sweet spot** |
|
|
195
|
-
| Q4_0 | ~75% | Noticeable | Budget option |
|
|
196
|
-
| Q3_K_M | ~80% | Significant | Not recommended |
|
|
197
|
-
| Q2_K | ~85% | Severe | Avoid |
|
|
198
|
-
|
|
199
|
-
### KV Cache Quantization
|
|
200
|
-
|
|
201
|
-
Enable KV cache quantization to reduce memory usage:
|
|
202
|
-
|
|
203
|
-
```bash
|
|
204
|
-
# Enable Q8_0 KV cache (recommended)
|
|
205
|
-
export OLLAMA_KV_CACHE_TYPE=q8_0
|
|
206
|
-
|
|
207
|
-
# Enable Q4_0 KV cache (aggressive)
|
|
208
|
-
export OLLAMA_KV_CACHE_TYPE=q4_0
|
|
209
|
-
```
|
|
210
|
-
|
|
211
|
-
**Note**: KV cache quantization requires Flash Attention enabled.
|
|
212
|
-
|
|
213
|
-
### Quantization Selection Guide
|
|
214
|
-
|
|
215
|
-
**For 8GB VRAM:**
|
|
216
|
-
|
|
217
|
-
- Use Q4_K_M for 8B models
|
|
218
|
-
- Enables 16K-32K context windows
|
|
219
|
-
- Minimal quality impact
|
|
220
|
-
|
|
221
|
-
**For 12GB VRAM:**
|
|
222
|
-
|
|
223
|
-
- Use Q5_K_M or Q6_K for 8B models
|
|
224
|
-
- Use Q4_K_M for 14B models
|
|
225
|
-
- Better quality, still good context
|
|
226
|
-
|
|
227
|
-
**For 24GB+ VRAM:**
|
|
228
|
-
|
|
229
|
-
- Use Q8_0 for best quality
|
|
230
|
-
- Can run larger models (32B+)
|
|
231
|
-
- Maximum context windows
|
|
232
|
-
|
|
233
|
-
---
|
|
234
|
-
|
|
235
|
-
## Configuration
|
|
236
|
-
|
|
237
|
-
### Setting Context Length
|
|
238
|
-
|
|
239
|
-
#### Via Modelfile
|
|
240
|
-
|
|
241
|
-
```dockerfile
|
|
242
|
-
FROM llama3.1:8b
|
|
243
|
-
PARAMETER num_ctx 32768
|
|
244
|
-
```
|
|
245
|
-
|
|
246
|
-
#### Via API Request
|
|
247
|
-
|
|
248
|
-
```json
|
|
249
|
-
{
|
|
250
|
-
"model": "llama3.1:8b",
|
|
251
|
-
"options": {
|
|
252
|
-
"num_ctx": 32768
|
|
253
|
-
}
|
|
254
|
-
}
|
|
255
|
-
```
|
|
256
|
-
|
|
257
|
-
#### Via Environment
|
|
258
|
-
|
|
259
|
-
```bash
|
|
260
|
-
export OLLAMA_CONTEXT_LENGTH=32768
|
|
261
|
-
```
|
|
262
|
-
|
|
263
|
-
#### Via OLLM Configuration
|
|
264
|
-
|
|
265
|
-
```yaml
|
|
266
|
-
# ~/.ollm/config.yaml
|
|
267
|
-
options:
|
|
268
|
-
numCtx: 32768
|
|
269
|
-
```
|
|
270
|
-
|
|
271
|
-
### Performance Optimization
|
|
272
|
-
|
|
273
|
-
```bash
|
|
274
|
-
# Enable Flash Attention
|
|
275
|
-
export OLLAMA_FLASH_ATTENTION=1
|
|
276
|
-
|
|
277
|
-
# Set KV cache quantization
|
|
278
|
-
export OLLAMA_KV_CACHE_TYPE=q8_0
|
|
279
|
-
|
|
280
|
-
# Keep model loaded
|
|
281
|
-
export OLLAMA_KEEP_ALIVE=24h
|
|
282
|
-
|
|
283
|
-
# Set number of GPU layers (for partial offload)
|
|
284
|
-
export OLLAMA_NUM_GPU=35 # Adjust based on VRAM
|
|
285
|
-
```
|
|
286
|
-
|
|
287
|
-
---
|
|
288
|
-
|
|
289
|
-
## Performance Benchmarks
|
|
290
|
-
|
|
291
|
-
### Inference Speed (tokens/second)
|
|
292
|
-
|
|
293
|
-
| Model | Full VRAM | Partial Offload | CPU Only |
|
|
294
|
-
| ------------ | --------- | --------------- | -------- |
|
|
295
|
-
| Llama 3.1 8B | 40-70 t/s | 8-15 t/s | 3-6 t/s |
|
|
296
|
-
| Qwen3 8B | 35-60 t/s | 7-12 t/s | 2-5 t/s |
|
|
297
|
-
| Mistral 7B | 45-80 t/s | 10-18 t/s | 4-7 t/s |
|
|
298
|
-
|
|
299
|
-
**Key Insight:** Partial VRAM offload causes 5-20x slowdown. Always aim to fit model entirely in VRAM.
|
|
300
|
-
|
|
301
|
-
### Context Processing Speed
|
|
302
|
-
|
|
303
|
-
| Context Length | Processing Time (8B model) |
|
|
304
|
-
| -------------- | -------------------------- |
|
|
305
|
-
| 4K tokens | ~0.5-1 second |
|
|
306
|
-
| 8K tokens | ~1-2 seconds |
|
|
307
|
-
| 16K tokens | ~2-4 seconds |
|
|
308
|
-
| 32K tokens | ~4-8 seconds |
|
|
309
|
-
|
|
310
|
-
**Note:** Times vary based on hardware and quantization level.
|
|
311
|
-
|
|
312
|
-
---
|
|
313
|
-
|
|
314
|
-
## Troubleshooting
|
|
315
|
-
|
|
316
|
-
### Out of Memory Errors
|
|
317
|
-
|
|
318
|
-
**Symptoms:**
|
|
319
|
-
|
|
320
|
-
```
|
|
321
|
-
Error: Failed to load model: insufficient VRAM
|
|
322
|
-
```
|
|
323
|
-
|
|
324
|
-
**Solutions:**
|
|
325
|
-
|
|
326
|
-
1. Use smaller model (8B instead of 70B)
|
|
327
|
-
2. Use more aggressive quantization (Q4_K_M instead of Q8_0)
|
|
328
|
-
3. Reduce context window (`num_ctx`)
|
|
329
|
-
4. Enable KV cache quantization
|
|
330
|
-
5. Close other GPU applications
|
|
331
|
-
|
|
332
|
-
### Slow Inference
|
|
333
|
-
|
|
334
|
-
**Symptoms:**
|
|
335
|
-
|
|
336
|
-
- Very slow token generation (< 5 tokens/second)
|
|
337
|
-
- High CPU usage
|
|
338
|
-
|
|
339
|
-
**Solutions:**
|
|
340
|
-
|
|
341
|
-
1. Check if model fits entirely in VRAM
|
|
342
|
-
2. Increase `OLLAMA_NUM_GPU` if using partial offload
|
|
343
|
-
3. Enable Flash Attention
|
|
344
|
-
4. Reduce context window
|
|
345
|
-
5. Use faster quantization (Q4_K_M)
|
|
346
|
-
|
|
347
|
-
### Context Window Issues
|
|
348
|
-
|
|
349
|
-
**Symptoms:**
|
|
350
|
-
|
|
351
|
-
```
|
|
352
|
-
Error: Context length exceeds model maximum
|
|
353
|
-
```
|
|
354
|
-
|
|
355
|
-
**Solutions:**
|
|
356
|
-
|
|
357
|
-
1. Check model's native context window
|
|
358
|
-
2. Set `num_ctx` appropriately
|
|
359
|
-
3. Use model with larger context window
|
|
360
|
-
4. Enable context compression in OLLM
|
|
361
|
-
|
|
362
|
-
---
|
|
363
|
-
|
|
364
|
-
## Best Practices
|
|
365
|
-
|
|
366
|
-
### Model Selection
|
|
367
|
-
|
|
368
|
-
1. **Match VRAM to model size**: Use the selection matrix above
|
|
369
|
-
2. **Consider context needs**: Larger contexts need more VRAM
|
|
370
|
-
3. **Test before committing**: Try models before configuring projects
|
|
371
|
-
4. **Use routing**: Let OLLM select appropriate models automatically
|
|
372
|
-
|
|
373
|
-
### VRAM Management
|
|
374
|
-
|
|
375
|
-
1. **Monitor usage**: Use `/context` command to check VRAM
|
|
376
|
-
2. **Keep models loaded**: Use `/model keep` for frequently-used models
|
|
377
|
-
3. **Unload when switching**: Use `/model unload` to free VRAM
|
|
378
|
-
4. **Enable auto-sizing**: Use `/context auto` for automatic context sizing
|
|
379
|
-
|
|
380
|
-
### Performance Optimization
|
|
381
|
-
|
|
382
|
-
1. **Enable Flash Attention**: Significant speed improvement
|
|
383
|
-
2. **Use KV cache quantization**: Reduces memory, minimal quality impact
|
|
384
|
-
3. **Keep models in VRAM**: Avoid partial offload
|
|
385
|
-
4. **Optimize context size**: Use only what you need
|
|
386
|
-
|
|
387
|
-
---
|
|
388
|
-
|
|
389
|
-
## See Also
|
|
390
|
-
|
|
391
|
-
- [Model Commands](Models_commands.md) - Model management commands
|
|
392
|
-
- [Model Configuration](Models_configuration.md) - Configuration options
|
|
393
|
-
- [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md) - Automatic model selection
|
|
394
|
-
- [Context Management](3%20projects/OLLM%20CLI/LLM%20Context%20Manager/README.md) - Context and VRAM monitoring
|
|
395
|
-
|
|
396
|
-
---
|
|
397
|
-
|
|
398
|
-
**Last Updated:** 2026-01-16
|
|
399
|
-
**Version:** 0.1.0
|
|
400
|
-
**Source:** Ollama Documentation, llama.cpp, community benchmarks
|