@tecet/ollm 0.1.3 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/README.md +529 -529
  2. package/dist/cli.js +291 -244
  3. package/dist/cli.js.map +3 -3
  4. package/dist/config/LLM_profiles.json +4 -4
  5. package/dist/config/settingsService.d.ts +5 -0
  6. package/dist/config/settingsService.d.ts.map +1 -1
  7. package/dist/config/settingsService.js +20 -0
  8. package/dist/config/settingsService.js.map +1 -1
  9. package/dist/config/toolsConfig.d.ts +6 -3
  10. package/dist/config/toolsConfig.d.ts.map +1 -1
  11. package/dist/config/toolsConfig.js +52 -3
  12. package/dist/config/toolsConfig.js.map +1 -1
  13. package/dist/features/chat/hooks/useChatNetwork.d.ts.map +1 -1
  14. package/dist/features/chat/hooks/useChatNetwork.js +15 -4
  15. package/dist/features/chat/hooks/useChatNetwork.js.map +1 -1
  16. package/dist/features/context/ContextManagerContext.d.ts.map +1 -1
  17. package/dist/features/context/ContextManagerContext.js +14 -73
  18. package/dist/features/context/ContextManagerContext.js.map +1 -1
  19. package/dist/features/context/hooks/useToolSupport.d.ts.map +1 -1
  20. package/dist/features/context/hooks/useToolSupport.js +13 -5
  21. package/dist/features/context/hooks/useToolSupport.js.map +1 -1
  22. package/dist/features/context/utils/systemPromptBuilder.d.ts.map +1 -1
  23. package/dist/features/context/utils/systemPromptBuilder.js +6 -34
  24. package/dist/features/context/utils/systemPromptBuilder.js.map +1 -1
  25. package/dist/nonInteractive.d.ts.map +1 -1
  26. package/dist/nonInteractive.js +1 -2
  27. package/dist/nonInteractive.js.map +1 -1
  28. package/dist/templates/assistant/tier3.txt +2 -0
  29. package/dist/templates/system/CoreMandates.txt +3 -0
  30. package/dist/templates/system/ToolDescriptions.txt +11 -3
  31. package/dist/templates/system/skills/SkillsAssistant.txt +1 -2
  32. package/dist/ui/App.js +15 -15
  33. package/dist/ui/components/launch/VersionBanner.js +1 -1
  34. package/dist/ui/components/layout/KeybindsLegend.d.ts.map +1 -1
  35. package/dist/ui/components/layout/KeybindsLegend.js +1 -1
  36. package/dist/ui/components/layout/KeybindsLegend.js.map +1 -1
  37. package/dist/ui/components/tabs/BugReportTab.js +1 -1
  38. package/dist/ui/components/tools/CategorySection.d.ts.map +1 -1
  39. package/dist/ui/components/tools/CategorySection.js +1 -0
  40. package/dist/ui/components/tools/CategorySection.js.map +1 -1
  41. package/dist/ui/contexts/__tests__/mcpTestUtils.d.ts +14 -14
  42. package/docs/DevelopmentRoadmap/OLLM-CLI_Releases.md +419 -419
  43. package/docs/DevelopmentRoadmap/RoadmapVisual.md +372 -372
  44. package/docs/LLM Models/LLM_ModelsList.md +1071 -1071
  45. package/docs/MCP/MCP_Architecture.md +1086 -1086
  46. package/package.json +82 -82
@@ -1,1071 +1,1071 @@
1
- # Ollama Models Reference
2
-
3
- **Complete reference for Ollama-compatible models**
4
-
5
- This document provides comprehensive information about Ollama models, focusing on context management, tool calling capabilities, and VRAM requirements for OLLM CLI.
6
-
7
- ---
8
-
9
- ## 📋 Table of Contents
10
-
11
- 1. [Context Window Fundamentals](#context-window-fundamentals)
12
- 2. [VRAM Requirements](#vram-requirements)
13
- 3. [Complete Model Database](#complete-model-database)
14
- 4. [Model Selection Matrix](#model-selection-matrix)
15
- 5. [Tool Calling Support](#tool-calling-support)
16
- 6. [Quantization Guide](#quantization-guide)
17
- 7. [Configuration](#configuration)
18
- 8. [Performance Benchmarks](#performance-benchmarks)
19
- 9. [VRAM Calculation Deep Dive](#vram-calculation-deep-dive)
20
-
21
- **See Also:**
22
-
23
- - [Model Commands](Models_commands.md)
24
- - [Model Configuration](Models_configuration.md)
25
- - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md)
26
-
27
- ---
28
-
29
- ## Context Window Fundamentals
30
-
31
- ### Ollama Default Behavior
32
-
33
- - **Default context window**: 2,048 tokens (conservative default)
34
- - **Configurable via**: `num_ctx` parameter in Modelfile or API request
35
- - **Environment variable**: `OLLAMA_CONTEXT_LENGTH`
36
-
37
- ### Context Window vs num_ctx
38
-
39
- - **Context window**: Maximum tokens a model architecture supports
40
- - **num_ctx**: Ollama-specific parameter setting active context length for inference
41
- - Models can support larger contexts than Ollama's default - must be explicitly configured
42
-
43
- **Example:**
44
-
45
- ```bash
46
- # Llama 3.1 supports 128K context, but Ollama defaults to 2K
47
- # Must configure explicitly:
48
- export OLLAMA_CONTEXT_LENGTH=32768
49
- ```
50
-
51
- ---
52
-
53
- ## VRAM Requirements
54
-
55
- ### Memory Calculation Formula
56
-
57
- ```
58
- Total VRAM = Model Weights + KV Cache + System Overhead (~0.5-1GB)
59
- ```
60
-
61
- ### Base VRAM by Model Size (Q4_K_M Quantization)
62
-
63
- | Model Size | Est. File Size | Recommended VRAM | Example Models |
64
- | ---------- | -------------- | ----------------- | --------------------------------------- |
65
- | 3B-4B | 2.0-2.5 GB | 3-4 GB | Llama 3.2 3B, Qwen3 4B |
66
- | 7B-9B | 4.5-6.0 GB | 6-8 GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B |
67
- | 12B-14B | 7.5-9.0 GB | 10-12 GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B |
68
- | 22B-35B | 14-21 GB | 16-24 GB | Gemma 3 27B, Qwen3 32B, DeepSeek R1 32B |
69
- | 70B-72B | 40-43 GB | 48 GB (or 2×24GB) | Llama 3.3 70B, Qwen2.5 72B |
70
-
71
- ### KV Cache Memory Impact
72
-
73
- KV cache grows linearly with context length:
74
-
75
- - **8B model at 32K context**: ~4.5 GB for KV cache alone
76
- - **Q8_0 KV quantization**: Halves KV cache memory (minimal quality impact)
77
- - **Q4_0 KV quantization**: Reduces to 1/3 (noticeable quality reduction)
78
-
79
- ### Context Length VRAM Examples (8B Model, Q4_K_M)
80
-
81
- | Context Length | KV Cache (FP16) | KV Cache (Q8_0) | Total VRAM |
82
- | -------------- | --------------- | --------------- | ---------- |
83
- | 4K tokens | ~0.6 GB | ~0.3 GB | ~6-7 GB |
84
- | 8K tokens | ~1.1 GB | ~0.55 GB | ~7-8 GB |
85
- | 16K tokens | ~2.2 GB | ~1.1 GB | ~8-9 GB |
86
- | 32K tokens | ~4.5 GB | ~2.25 GB | ~10-11 GB |
87
-
88
- **Key Insight:** Always aim to fit the entire model in VRAM. Partial offload causes 5-20x slowdown.
89
-
90
- ---
91
-
92
- ## Complete Model Database
93
-
94
- This section contains all 35+ models currently supported in OLLM CLI with accurate VRAM requirements for each context tier.
95
-
96
- ### Coding Models
97
-
98
- #### CodeGeeX4 9B
99
-
100
- - **Company:** Zhipu AI
101
- - **Quantization:** Q4_0
102
- - **Size:** 5.5 GB
103
- - **Max Context:** 128K
104
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
105
- - **Description:** Multilingual code generation model with high inference speed
106
- - **URL:** [ollama.com/library/codegeex4](https://ollama.com/library/codegeex4)
107
-
108
- **VRAM Requirements:**
109
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
110
- |---------|----|----|-----|-----|-----|------|
111
- | VRAM | 7.5 GB | 8.1 GB | 9.2 GB | 11.5 GB | 15.8 GB | 24.5 GB |
112
-
113
- #### CodeGemma 2B
114
-
115
- - **Company:** Google DeepMind
116
- - **Quantization:** Q4_K_M
117
- - **Size:** 1.6 GB
118
- - **Max Context:** 8K
119
- - **Tools:** ❌ No | **Reasoning:** ❌ No
120
- - **Description:** Lightweight code completion model for on-device use
121
- - **URL:** [ollama.com/library/codegemma](https://ollama.com/library/codegemma)
122
-
123
- **VRAM Requirements:**
124
- | Context | 4k | 8k |
125
- |---------|----|----|
126
- | VRAM | 3.0 GB | 3.4 GB |
127
-
128
- #### CodeGemma 7B
129
-
130
- - **Company:** Google DeepMind
131
- - **Quantization:** Q4_K_M
132
- - **Size:** 5.0 GB
133
- - **Max Context:** 8K
134
- - **Tools:** ❌ No | **Reasoning:** ❌ No
135
- - **Description:** Specialized code generation and completion model
136
- - **URL:** [ollama.com/library/codegemma](https://ollama.com/library/codegemma)
137
-
138
- **VRAM Requirements:**
139
- | Context | 4k | 8k |
140
- |---------|----|----|
141
- | VRAM | 7.0 GB | 8.5 GB |
142
-
143
- #### Codestral 22B
144
-
145
- - **Company:** Mistral AI
146
- - **Quantization:** Q4_0
147
- - **Size:** 12 GB
148
- - **Max Context:** 32K
149
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
150
- - **Description:** Mistral's model optimized for intermediate code generation tasks
151
- - **URL:** [ollama.com/library/codestral](https://ollama.com/library/codestral)
152
-
153
- **VRAM Requirements:**
154
- | Context | 4k | 8k | 16k | 32k |
155
- |---------|----|----|-----|-----|
156
- | VRAM | 14.5 GB | 15.6 GB | 17.8 GB | 22.0 GB |
157
-
158
- #### DeepSeek Coder V2 16B
159
-
160
- - **Company:** DeepSeek
161
- - **Quantization:** Q4_K_M
162
- - **Size:** 8.9 GB
163
- - **Max Context:** 128K
164
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
165
- - **Description:** Mixture-of-Experts (MoE) model for advanced coding tasks with highly efficient MLA Cache
166
- - **URL:** [ollama.com/library/deepseek-coder-v2](https://ollama.com/library/deepseek-coder-v2)
167
-
168
- **VRAM Requirements:**
169
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
170
- |---------|----|----|-----|-----|-----|------|
171
- | VRAM | 11.2 GB | 11.5 GB | 12.0 GB | 13.1 GB | 15.2 GB | 19.5 GB |
172
-
173
- #### Granite Code 3B
174
-
175
- - **Company:** IBM
176
- - **Quantization:** Q4_K_M
177
- - **Size:** 2.0 GB
178
- - **Max Context:** 4K
179
- - **Tools:** ❌ No | **Reasoning:** ❌ No
180
- - **Description:** IBM enterprise coding model, very fast
181
- - **URL:** [ollama.com/library/granite-code](https://ollama.com/library/granite-code)
182
-
183
- **VRAM Requirements:**
184
- | Context | 4k |
185
- |---------|----|
186
- | VRAM | 3.5 GB |
187
-
188
- #### Granite Code 8B
189
-
190
- - **Company:** IBM
191
- - **Quantization:** Q4_K_M
192
- - **Size:** 4.6 GB
193
- - **Max Context:** 4K
194
- - **Tools:** ❌ No | **Reasoning:** ❌ No
195
- - **Description:** IBM enterprise coding model, robust for Python/Java/JS
196
- - **URL:** [ollama.com/library/granite-code](https://ollama.com/library/granite-code)
197
-
198
- **VRAM Requirements:**
199
- | Context | 4k |
200
- |---------|----|
201
- | VRAM | 6.5 GB |
202
-
203
- #### Magicoder Latest
204
-
205
- - **Company:** ise-uiuc (OSS-Instruct Team)
206
- - **Quantization:** Q4_K_M
207
- - **Size:** 3.8 GB
208
- - **Max Context:** 16K
209
- - **Tools:** ❌ No | **Reasoning:** ❌ No
210
- - **Description:** Code model trained on synthetic data (OSS-Instruct)
211
- - **URL:** [ollama.com/library/magicoder](https://ollama.com/library/magicoder)
212
-
213
- **VRAM Requirements:**
214
- | Context | 4k | 8k | 16k |
215
- |---------|----|----|-----|
216
- | VRAM | 5.5 GB | 6.0 GB | 7.1 GB |
217
-
218
- #### OpenCoder 8B
219
-
220
- - **Company:** OpenCoder Team / INF
221
- - **Quantization:** Q4_K_M
222
- - **Size:** 4.7 GB
223
- - **Max Context:** 8K
224
- - **Tools:** ❌ No | **Reasoning:** ❌ No
225
- - **Description:** Fully open-source coding model with transparent dataset
226
- - **URL:** [ollama.com/library/opencoder](https://ollama.com/library/opencoder)
227
-
228
- **VRAM Requirements:**
229
- | Context | 4k | 8k |
230
- |---------|----|----|
231
- | VRAM | 6.5 GB | 7.2 GB |
232
-
233
- #### Qwen 2.5 Coder 3B
234
-
235
- - **Company:** Alibaba Cloud
236
- - **Quantization:** Q4_K_M
237
- - **Size:** 1.9 GB
238
- - **Max Context:** 128K
239
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
240
- - **Description:** Small but potent coding assistant
241
- - **URL:** [ollama.com/library/qwen2.5-coder](https://ollama.com/library/qwen2.5-coder)
242
-
243
- **VRAM Requirements:**
244
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
245
- |---------|----|----|-----|-----|-----|------|
246
- | VRAM | 3.5 GB | 4.0 GB | 4.8 GB | 6.5 GB | 9.8 GB | 16.5 GB |
247
-
248
- #### Qwen 2.5 Coder 7B
249
-
250
- - **Company:** Alibaba Cloud
251
- - **Quantization:** Q4_K_M
252
- - **Size:** 4.7 GB
253
- - **Max Context:** 128K
254
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
255
- - **Description:** SOTA coding model in the 7B class
256
- - **URL:** [ollama.com/library/qwen2.5-coder](https://ollama.com/library/qwen2.5-coder)
257
-
258
- **VRAM Requirements:**
259
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
260
- |---------|----|----|-----|-----|-----|------|
261
- | VRAM | 6.5 GB | 7.2 GB | 8.4 GB | 10.8 GB | 15.5 GB | 24.5 GB |
262
-
263
- #### Qwen 3 Coder 30B
264
-
265
- - **Company:** Alibaba Cloud
266
- - **Quantization:** Q4_0
267
- - **Size:** 18 GB
268
- - **Max Context:** 128K
269
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
270
- - **Description:** Heavyweight coding model, expert level generation
271
- - **URL:** [ollama.com/library/qwen3-coder](https://ollama.com/library/qwen3-coder)
272
-
273
- **VRAM Requirements:**
274
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
275
- |---------|----|----|-----|-----|-----|------|
276
- | VRAM | 20.5 GB | 22.0 GB | 24.5 GB | 29.5 GB | 39.5 GB | 59.5 GB |
277
-
278
- ---
279
-
280
- ### Reasoning Models
281
-
282
- #### DeepSeek R1 1.5B
283
-
284
- - **Company:** DeepSeek
285
- - **Quantization:** Q4_K_M
286
- - **Size:** 1.1 GB
287
- - **Max Context:** 128K
288
- - **Tools:** ❌ No | **Reasoning:** ✅ Yes
289
- - **Description:** Distilled reasoning model, highly efficient for logic puzzles
290
- - **URL:** [ollama.com/library/deepseek-r1](https://ollama.com/library/deepseek-r1)
291
-
292
- **VRAM Requirements:**
293
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
294
- |---------|----|----|-----|-----|-----|------|
295
- | VRAM | 2.5 GB | 2.8 GB | 3.2 GB | 4.1 GB | 6.0 GB | 9.8 GB |
296
-
297
- #### DeepSeek R1 7B
298
-
299
- - **Company:** DeepSeek
300
- - **Quantization:** Q4_K_M
301
- - **Size:** 4.7 GB
302
- - **Max Context:** 128K
303
- - **Tools:** ❌ No | **Reasoning:** ✅ Yes
304
- - **Description:** Distilled reasoning model based on Qwen 2.5, excels in math/logic
305
- - **URL:** [ollama.com/library/deepseek-r1](https://ollama.com/library/deepseek-r1)
306
-
307
- **VRAM Requirements:**
308
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
309
- |---------|----|----|-----|-----|-----|------|
310
- | VRAM | 6.5 GB | 7.2 GB | 8.4 GB | 10.8 GB | 15.5 GB | 24.5 GB |
311
-
312
- #### DeepSeek R1 8B
313
-
314
- - **Company:** DeepSeek
315
- - **Quantization:** Q4_K_M
316
- - **Size:** 5.2 GB
317
- - **Max Context:** 128K
318
- - **Tools:** ❌ No | **Reasoning:** ✅ Yes
319
- - **Description:** Distilled reasoning model based on Llama 3, balanced for logic
320
- - **URL:** [ollama.com/library/deepseek-r1](https://ollama.com/library/deepseek-r1)
321
-
322
- **VRAM Requirements:**
323
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
324
- |---------|----|----|-----|-----|-----|------|
325
- | VRAM | 7.0 GB | 7.8 GB | 9.0 GB | 11.5 GB | 16.2 GB | 25.0 GB |
326
-
327
- #### Phi-4 Mini Reasoning
328
-
329
- - **Company:** Microsoft
330
- - **Quantization:** Q4_K_M
331
- - **Size:** 3.2 GB
332
- - **Max Context:** 128K
333
- - **Tools:** ❌ No | **Reasoning:** ✅ Yes
334
- - **Description:** Microsoft model specialized in math and logical reasoning steps
335
- - **URL:** [ollama.com/library/phi4](https://ollama.com/library/phi4)
336
-
337
- **VRAM Requirements:**
338
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
339
- |---------|----|----|-----|-----|-----|------|
340
- | VRAM | 5.0 GB | 5.6 GB | 6.8 GB | 9.2 GB | 13.8 GB | 23.0 GB |
341
-
342
- ---
343
-
344
- ### General Purpose Models
345
-
346
- #### Command R7B 7B
347
-
348
- - **Company:** Cohere
349
- - **Quantization:** Q4_K_M
350
- - **Size:** 5.1 GB
351
- - **Max Context:** 128K
352
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
353
- - **Description:** Optimized for RAG and tool use, excellent instruction following
354
- - **URL:** [ollama.com/library/command-r7b](https://ollama.com/library/command-r7b)
355
-
356
- **VRAM Requirements:**
357
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
358
- |---------|----|----|-----|-----|-----|------|
359
- | VRAM | 7.0 GB | 7.6 GB | 8.8 GB | 11.0 GB | 15.5 GB | 24.5 GB |
360
-
361
- #### Dolphin3 8B
362
-
363
- - **Company:** Cognitive Computations
364
- - **Quantization:** Q4_K_M
365
- - **Size:** 4.9 GB
366
- - **Max Context:** 32K
367
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
368
- - **Description:** Uncensored/compliant model optimized for general conversation/coding
369
- - **URL:** [ollama.com/library/dolphin3](https://ollama.com/library/dolphin3)
370
-
371
- **VRAM Requirements:**
372
- | Context | 4k | 8k | 16k | 32k |
373
- |---------|----|----|-----|-----|
374
- | VRAM | 7.0 GB | 7.7 GB | 8.9 GB | 11.2 GB |
375
-
376
- #### Gemma 3 1B
377
-
378
- - **Company:** Google DeepMind
379
- - **Quantization:** Q4_K_M
380
- - **Size:** 815 MB
381
- - **Max Context:** 128K
382
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
383
- - **Description:** Ultra-compact text model from Google, high efficiency
384
- - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
385
-
386
- **VRAM Requirements:**
387
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
388
- |---------|----|----|-----|-----|-----|------|
389
- | VRAM | 2.0 GB | 2.4 GB | 3.2 GB | 4.8 GB | 8.0 GB | 14.2 GB |
390
-
391
- #### Gemma 3 4B
392
-
393
- - **Company:** Google DeepMind
394
- - **Quantization:** Q4_K_M
395
- - **Size:** 3.3 GB
396
- - **Max Context:** 128K
397
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
398
- - **Description:** Balanced Google model with strong general purpose capabilities
399
- - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
400
-
401
- **VRAM Requirements:**
402
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
403
- |---------|----|----|-----|-----|-----|------|
404
- | VRAM | 5.0 GB | 5.8 GB | 7.5 GB | 10.5 GB | 16.5 GB | 28.5 GB |
405
-
406
- #### Gemma 3 Nano E2B
407
-
408
- - **Company:** Google DeepMind
409
- - **Quantization:** Q4_K_M
410
- - **Size:** 5.6 GB
411
- - **Max Context:** 128K
412
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
413
- - **Description:** Nano/Experimental variant of Gemma 3 (high cache usage)
414
- - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
415
-
416
- **VRAM Requirements:**
417
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
418
- |---------|----|----|-----|-----|-----|------|
419
- | VRAM | 7.5 GB | 8.5 GB | 10.5 GB | 14.5 GB | 22.5 GB | 38.5 GB |
420
-
421
- #### Gemma 3 Nano Latest
422
-
423
- - **Company:** Google DeepMind
424
- - **Quantization:** Q4_K_M
425
- - **Size:** 7.5 GB
426
- - **Max Context:** 128K
427
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
428
- - **Description:** Latest standard variant of the Gemma 3 Nano/Next series
429
- - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
430
-
431
- **VRAM Requirements:**
432
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
433
- |---------|----|----|-----|-----|-----|------|
434
- | VRAM | 9.5 GB | 10.8 GB | 13.5 GB | 18.5 GB | 28.5 GB | 48.0 GB |
435
-
436
- #### GPT-OSS 20B
437
-
438
- - **Company:** OpenAI (Hypothetical/Mirror)
439
- - **Quantization:** Q4_0
440
- - **Size:** 13 GB
441
- - **Max Context:** 128K
442
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
443
- - **Description:** Open-weight model by OpenAI (hypothetical/leaked) for general tasks
444
- - **URL:** [ollama.com/library/gpt-oss](https://ollama.com/library/gpt-oss)
445
-
446
- **VRAM Requirements:**
447
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
448
- |---------|----|----|-----|-----|-----|------|
449
- | VRAM | 16.5 GB | 17.5 GB | 19.5 GB | 23.5 GB | 31.5 GB | 47.5 GB |
450
-
451
- #### Granite 3.3 8B
452
-
453
- - **Company:** IBM
454
- - **Quantization:** Q4_K_M
455
- - **Size:** 4.9 GB
456
- - **Max Context:** 128K
457
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
458
- - **Description:** Updated IBM model with improved reasoning and instruction following
459
- - **URL:** [ollama.com/library/granite3.3](https://ollama.com/library/granite3.3)
460
-
461
- **VRAM Requirements:**
462
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
463
- |---------|----|----|-----|-----|-----|------|
464
- | VRAM | 7.0 GB | 7.7 GB | 8.9 GB | 11.2 GB | 15.8 GB | 24.8 GB |
465
-
466
- #### Granite 4 Latest
467
-
468
- - **Company:** IBM
469
- - **Quantization:** Q4_K_M
470
- - **Size:** 2.1 GB
471
- - **Max Context:** 128K
472
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
473
- - **Description:** Next-gen IBM model, hybrid Mamba architecture for speed
474
- - **URL:** [ollama.com/library/granite4](https://ollama.com/library/granite4)
475
-
476
- **VRAM Requirements:**
477
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
478
- |---------|----|----|-----|-----|-----|------|
479
- | VRAM | 4.0 GB | 4.5 GB | 5.5 GB | 7.5 GB | 11.5 GB | 19.5 GB |
480
-
481
- #### Llama 3 Latest (8B)
482
-
483
- - **Company:** Meta
484
- - **Quantization:** Q4_K_M
485
- - **Size:** 4.7 GB
486
- - **Max Context:** 8K (Native Llama 3.0)
487
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
488
- - **Description:** Meta's 8B standard model (robust generalist)
489
- - **URL:** [ollama.com/library/llama3](https://ollama.com/library/llama3)
490
-
491
- **VRAM Requirements:**
492
- | Context | 4k | 8k |
493
- |---------|----|----|
494
- | VRAM | 6.5 GB | 7.2 GB |
495
-
496
- #### Llama 3.2 3B
497
-
498
- - **Company:** Meta
499
- - **Quantization:** Q4_K_M
500
- - **Size:** 2.0 GB
501
- - **Max Context:** 128K
502
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
503
- - **Description:** Lightweight Llama optimized for edge devices
504
- - **URL:** [ollama.com/library/llama3.2](https://ollama.com/library/llama3.2)
505
-
506
- **VRAM Requirements:**
507
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
508
- |---------|----|----|-----|-----|-----|------|
509
- | VRAM | 3.5 GB | 3.9 GB | 4.6 GB | 6.0 GB | 8.8 GB | 14.5 GB |
510
-
511
- #### Ministral 3 3B
512
-
513
- - **Company:** Mistral AI
514
- - **Quantization:** Q4_K_M
515
- - **Size:** 3.0 GB
516
- - **Max Context:** 128K
517
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
518
- - **Description:** Mistral's edge model with high intelligence-to-size ratio
519
- - **URL:** [ollama.com/library/ministral-3](https://ollama.com/library/ministral-3)
520
-
521
- **VRAM Requirements:**
522
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
523
- |---------|----|----|-----|-----|-----|------|
524
- | VRAM | 5.0 GB | 5.6 GB | 6.8 GB | 9.0 GB | 13.5 GB | 22.5 GB |
525
-
526
- #### Ministral 3 8B
527
-
528
- - **Company:** Mistral AI
529
- - **Quantization:** Q4_K_M
530
- - **Size:** 6.0 GB
531
- - **Max Context:** 128K
532
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
533
- - **Description:** Larger edge model from Mistral, strong instruction following
534
- - **URL:** [ollama.com/library/ministral-3](https://ollama.com/library/ministral-3)
535
-
536
- **VRAM Requirements:**
537
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
538
- |---------|----|----|-----|-----|-----|------|
539
- | VRAM | 8.0 GB | 8.8 GB | 10.5 GB | 13.6 GB | 20.0 GB | 32.5 GB |
540
-
541
- #### Phi-3 Mini 3.8B
542
-
543
- - **Company:** Microsoft
544
- - **Quantization:** Q4_K_M
545
- - **Size:** 2.2 GB
546
- - **Max Context:** 128K
547
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
548
- - **Description:** Microsoft's highly capable small model (Mini)
549
- - **URL:** [ollama.com/library/phi3](https://ollama.com/library/phi3)
550
-
551
- **VRAM Requirements:**
552
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
553
- |---------|----|----|-----|-----|-----|------|
554
- | VRAM | 4.0 GB | 4.5 GB | 5.5 GB | 7.5 GB | 11.5 GB | 19.5 GB |
555
-
556
- #### Qwen 2.5 7B
557
-
558
- - **Company:** Alibaba Cloud
559
- - **Quantization:** Q4_K_M
560
- - **Size:** 4.7 GB
561
- - **Max Context:** 128K
562
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
563
- - **Description:** Alibaba's strong generalist, beats Llama 3.1 8B in benchmarks
564
- - **URL:** [ollama.com/library/qwen2.5](https://ollama.com/library/qwen2.5)
565
-
566
- **VRAM Requirements:**
567
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
568
- |---------|----|----|-----|-----|-----|------|
569
- | VRAM | 6.5 GB | 7.2 GB | 8.4 GB | 10.8 GB | 15.5 GB | 24.5 GB |
570
-
571
- #### Qwen 3 4B
572
-
573
- - **Company:** Alibaba Cloud
574
- - **Quantization:** Q4_K_M
575
- - **Size:** 2.5 GB
576
- - **Max Context:** 256K
577
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
578
- - **Description:** Next-gen Qwen generalist, high efficiency
579
- - **URL:** [ollama.com/library/qwen3](https://ollama.com/library/qwen3)
580
-
581
- **VRAM Requirements:**
582
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
583
- |---------|----|----|-----|-----|-----|------|
584
- | VRAM | 4.5 GB | 5.0 GB | 6.0 GB | 8.0 GB | 12.0 GB | 20.0 GB |
585
-
586
- #### Qwen 3 8B
587
-
588
- - **Company:** Alibaba Cloud
589
- - **Quantization:** Q4_K_M
590
- - **Size:** 5.2 GB
591
- - **Max Context:** 256K
592
- - **Tools:** ✅ Yes | **Reasoning:** ✅ Yes
593
- - **Description:** Next-gen Qwen generalist 8B, high reasoning capability
594
- - **URL:** [ollama.com/library/qwen3](https://ollama.com/library/qwen3)
595
-
596
- **VRAM Requirements:**
597
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
598
- |---------|----|----|-----|-----|-----|------|
599
- | VRAM | 7.5 GB | 8.2 GB | 9.5 GB | 12.0 GB | 17.0 GB | 27.0 GB |
600
-
601
- ---
602
-
603
- ### Vision-Language Models
604
-
605
- #### Qwen 3 VL 4B
606
-
607
- - **Company:** Alibaba Cloud
608
- - **Quantization:** Q4_K_M
609
- - **Size:** 3.3 GB
610
- - **Max Context:** 128K
611
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
612
- - **Description:** Vision-Language model, can analyze images/video inputs
613
- - **URL:** [ollama.com/library/qwen3-vl](https://ollama.com/library/qwen3-vl)
614
-
615
- **VRAM Requirements:**
616
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
617
- |---------|----|----|-----|-----|-----|------|
618
- | VRAM | 5.0 GB | 5.5 GB | 6.5 GB | 8.5 GB | 12.5 GB | 20.5 GB |
619
-
620
- #### Qwen 3 VL 8B
621
-
622
- - **Company:** Alibaba Cloud
623
- - **Quantization:** Q4_K_M
624
- - **Size:** 6.1 GB
625
- - **Max Context:** 128K
626
- - **Tools:** ✅ Yes | **Reasoning:** ❌ No
627
- - **Description:** Larger Vision-Language model for detailed visual reasoning
628
- - **URL:** [ollama.com/library/qwen3-vl](https://ollama.com/library/qwen3-vl)
629
-
630
- **VRAM Requirements:**
631
- | Context | 4k | 8k | 16k | 32k | 64k | 128k |
632
- |---------|----|----|-----|-----|-----|------|
633
- | VRAM | 8.0 GB | 8.7 GB | 10.0 GB | 12.5 GB | 17.5 GB | 27.5 GB |
634
-
635
- ---
636
-
637
- ## Model Selection Matrix
638
-
639
- ### By VRAM Available
640
-
641
- | VRAM | Recommended Models | Context Sweet Spot |
642
- | ----- | ------------------------------------- | ------------------ |
643
- | 4GB | Llama 3.2 3B, Qwen3 4B, Phi-3 Mini | 4K-8K |
644
- | 8GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B | 8K-16K |
645
- | 12GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B | 16K-32K |
646
- | 16GB | Qwen3 14B (high ctx), DeepSeek R1 14B | 32K-64K |
647
- | 24GB | Gemma 3 27B, Qwen3 32B, Mixtral 8x7B | 32K-64K |
648
- | 48GB+ | Llama 3.3 70B, Qwen2.5 72B | 64K-128K |
649
-
650
- ### By Use Case
651
-
652
- | Use Case | Recommended Model | Why |
653
- | -------------- | -------------------- | ----------------------------- |
654
- | General coding | Qwen2.5-Coder 7B/32B | Best open-source coding model |
655
- | Tool calling | Llama 3.1 8B | Native support, well-tested |
656
- | Long context | Qwen3 8B/14B | Up to 256K context |
657
- | Reasoning | DeepSeek-R1 14B/32B | Matches frontier models |
658
- | Low VRAM | Llama 3.2 3B | Excellent for 4GB |
659
- | Multilingual | Qwen3 series | 29+ languages |
660
-
661
- ---
662
-
663
- ## Tool Calling Support
664
-
665
- ### Tier 1: Best Tool Calling Support
666
-
667
- #### Llama 3.1/3.3 (8B, 70B)
668
-
669
- - **Context**: 128K tokens (native)
670
- - **Tool calling**: Native function calling support
671
- - **VRAM**: 8GB (8B), 48GB+ (70B)
672
- - **Strengths**: Best overall, largest ecosystem, excellent instruction following
673
- - **Downloads**: 108M+ (most popular)
674
-
675
- #### Qwen3 (4B-235B)
676
-
677
- - **Context**: Up to 256K tokens
678
- - **Tool calling**: Native support with hybrid reasoning modes
679
- - **VRAM**: 4GB (4B) to 48GB+ (32B)
680
- - **Strengths**: Multilingual, thinking/non-thinking modes, Apache 2.0 license
681
- - **Special**: Can set "thinking budget" for reasoning depth
682
-
683
- #### Mistral 7B / Mixtral 8x7B
684
-
685
- - **Context**: 32K tokens (Mistral), 64K tokens (Mixtral)
686
- - **Tool calling**: Native function calling
687
- - **VRAM**: 7GB (7B), 26GB (8x7B MoE - only 13B active)
688
- - **Strengths**: Efficient, excellent European languages
689
-
690
- ### Tier 2: Good Tool Calling Support
691
-
692
- #### DeepSeek-R1 (1.5B-671B)
693
-
694
- - **Context**: 128K tokens
695
- - **Tool calling**: Supported via reasoning
696
- - **VRAM**: 4GB (1.5B) to enterprise (671B)
697
- - **Strengths**: Advanced reasoning, matches o1 on benchmarks
698
- - **Architecture**: Mixture of Experts (MoE)
699
-
700
- #### Gemma 3 (4B-27B)
701
-
702
- - **Context**: 128K tokens (4B+)
703
- - **Tool calling**: Supported
704
- - **VRAM**: 4GB (4B) to 24GB (27B)
705
- - **Strengths**: Multimodal (vision), efficient architecture
706
-
707
- #### Phi-4 (14B)
708
-
709
- - **Context**: 16K tokens
710
- - **Tool calling**: Supported
711
- - **VRAM**: 12GB
712
- - **Strengths**: Exceptional reasoning for size, synthetic data training
713
-
714
- ### Tier 3: ReAct Fallback Required
715
-
716
- #### CodeLlama (7B-70B)
717
-
718
- - **Context**: 16K tokens (2K for 70B)
719
- - **Tool calling**: Via ReAct loop
720
- - **VRAM**: 4GB (7B) to 40GB (70B)
721
- - **Best for**: Code-specific tasks
722
-
723
- #### Llama 2 (7B-70B)
724
-
725
- - **Context**: 4K tokens
726
- - **Tool calling**: Via ReAct loop
727
- - **VRAM**: 4GB (7B) to 40GB (70B)
728
- - **Note**: Legacy, prefer Llama 3.x
729
-
730
- ---
731
-
732
- ## Quantization Guide
733
-
734
- ### Recommended Quantization Levels
735
-
736
- | Level | VRAM Savings | Quality Impact | Recommendation |
737
- | ---------- | ------------ | -------------- | --------------- |
738
- | Q8_0 | ~50% | Minimal | Best quality |
739
- | Q6_K | ~60% | Very slight | High quality |
740
- | Q5_K_M | ~65% | Slight | Good balance |
741
- | **Q4_K_M** | ~75% | Minor | **Sweet spot** |
742
- | Q4_0 | ~75% | Noticeable | Budget option |
743
- | Q3_K_M | ~80% | Significant | Not recommended |
744
- | Q2_K | ~85% | Severe | Avoid |
745
-
746
- ### KV Cache Quantization
747
-
748
- Enable KV cache quantization to reduce memory usage:
749
-
750
- ```bash
751
- # Enable Q8_0 KV cache (recommended)
752
- export OLLAMA_KV_CACHE_TYPE=q8_0
753
-
754
- # Enable Q4_0 KV cache (aggressive)
755
- export OLLAMA_KV_CACHE_TYPE=q4_0
756
- ```
757
-
758
- **Note**: KV cache quantization requires Flash Attention enabled.
759
-
760
- ### Quantization Selection Guide
761
-
762
- **For 8GB VRAM:**
763
-
764
- - Use Q4_K_M for 8B models
765
- - Enables 16K-32K context windows
766
- - Minimal quality impact
767
-
768
- **For 12GB VRAM:**
769
-
770
- - Use Q5_K_M or Q6_K for 8B models
771
- - Use Q4_K_M for 14B models
772
- - Better quality, still good context
773
-
774
- **For 24GB+ VRAM:**
775
-
776
- - Use Q8_0 for best quality
777
- - Can run larger models (32B+)
778
- - Maximum context windows
779
-
780
- ---
781
-
782
- ## Configuration
783
-
784
- ### Setting Context Length
785
-
786
- #### Via Modelfile
787
-
788
- ```dockerfile
789
- FROM llama3.1:8b
790
- PARAMETER num_ctx 32768
791
- ```
792
-
793
- #### Via API Request
794
-
795
- ```json
796
- {
797
- "model": "llama3.1:8b",
798
- "options": {
799
- "num_ctx": 32768
800
- }
801
- }
802
- ```
803
-
804
- #### Via Environment
805
-
806
- ```bash
807
- export OLLAMA_CONTEXT_LENGTH=32768
808
- ```
809
-
810
- #### Via OLLM Configuration
811
-
812
- ```yaml
813
- # ~/.ollm/config.yaml
814
- options:
815
- numCtx: 32768
816
- ```
817
-
818
- ### Performance Optimization
819
-
820
- ```bash
821
- # Enable Flash Attention
822
- export OLLAMA_FLASH_ATTENTION=1
823
-
824
- # Set KV cache quantization
825
- export OLLAMA_KV_CACHE_TYPE=q8_0
826
-
827
- # Keep model loaded
828
- export OLLAMA_KEEP_ALIVE=24h
829
-
830
- # Set number of GPU layers (for partial offload)
831
- export OLLAMA_NUM_GPU=35 # Adjust based on VRAM
832
- ```
833
-
834
- ---
835
-
836
- ## Performance Benchmarks
837
-
838
- ### Inference Speed (tokens/second)
839
-
840
- | Model | Full VRAM | Partial Offload | CPU Only |
841
- | ------------ | --------- | --------------- | -------- |
842
- | Llama 3.1 8B | 40-70 t/s | 8-15 t/s | 3-6 t/s |
843
- | Qwen3 8B | 35-60 t/s | 7-12 t/s | 2-5 t/s |
844
- | Mistral 7B | 45-80 t/s | 10-18 t/s | 4-7 t/s |
845
-
846
- **Key Insight:** Partial VRAM offload causes 5-20x slowdown. Always aim to fit model entirely in VRAM.
847
-
848
- ### Context Processing Speed
849
-
850
- | Context Length | Processing Time (8B model) |
851
- | -------------- | -------------------------- |
852
- | 4K tokens | ~0.5-1 second |
853
- | 8K tokens | ~1-2 seconds |
854
- | 16K tokens | ~2-4 seconds |
855
- | 32K tokens | ~4-8 seconds |
856
-
857
- **Note:** Times vary based on hardware and quantization level.
858
-
859
- ---
860
-
861
- ## Troubleshooting
862
-
863
- ### Out of Memory Errors
864
-
865
- **Symptoms:**
866
-
867
- ```
868
- Error: Failed to load model: insufficient VRAM
869
- ```
870
-
871
- **Solutions:**
872
-
873
- 1. Use smaller model (8B instead of 70B)
874
- 2. Use more aggressive quantization (Q4_K_M instead of Q8_0)
875
- 3. Reduce context window (`num_ctx`)
876
- 4. Enable KV cache quantization
877
- 5. Close other GPU applications
878
-
879
- ### Slow Inference
880
-
881
- **Symptoms:**
882
-
883
- - Very slow token generation (< 5 tokens/second)
884
- - High CPU usage
885
-
886
- **Solutions:**
887
-
888
- 1. Check if model fits entirely in VRAM
889
- 2. Increase `OLLAMA_NUM_GPU` if using partial offload
890
- 3. Enable Flash Attention
891
- 4. Reduce context window
892
- 5. Use faster quantization (Q4_K_M)
893
-
894
- ### Context Window Issues
895
-
896
- **Symptoms:**
897
-
898
- ```
899
- Error: Context length exceeds model maximum
900
- ```
901
-
902
- **Solutions:**
903
-
904
- 1. Check model's native context window
905
- 2. Set `num_ctx` appropriately
906
- 3. Use model with larger context window
907
- 4. Enable context compression in OLLM
908
-
909
- ---
910
-
911
- ## Best Practices
912
-
913
- ### Model Selection
914
-
915
- 1. **Match VRAM to model size**: Use the selection matrix above
916
- 2. **Consider context needs**: Larger contexts need more VRAM
917
- 3. **Test before committing**: Try models before configuring projects
918
- 4. **Use routing**: Let OLLM select appropriate models automatically
919
-
920
- ### VRAM Management
921
-
922
- 1. **Monitor usage**: Use `/context` command to check VRAM
923
- 2. **Keep models loaded**: Use `/model keep` for frequently-used models
924
- 3. **Unload when switching**: Use `/model unload` to free VRAM
925
- 4. **Enable auto-sizing**: Use `/context auto` for automatic context sizing
926
-
927
- ### Performance Optimization
928
-
929
- 1. **Enable Flash Attention**: Significant speed improvement
930
- 2. **Use KV cache quantization**: Reduces memory, minimal quality impact
931
- 3. **Keep models in VRAM**: Avoid partial offload
932
- 4. **Optimize context size**: Use only what you need
933
-
934
- ---
935
-
936
- ## VRAM Calculation Deep Dive
937
-
938
- ### Understanding VRAM Usage Components
939
-
940
- VRAM usage comes from two distinct places:
941
-
942
- #### 1. The "Parking Fee" (Model Weights)
943
-
944
- This is the fixed cost just to load the model.
945
-
946
- - **For an 8B model (Q4 quantization):** You need roughly 5.5 GB just to store the "brain" of the model
947
- - This never changes, regardless of how much text you type
948
-
949
- #### 2. The "Running Cost" (KV Cache)
950
-
951
- This is the memory needed to "remember" the conversation. It grows with every single word (token) you add.
952
-
953
- - **For short context (4k tokens):** You only need ~0.5 GB extra
954
- - Total: 5.5 + 0.5 = 6.0 GB (Fits in 8GB card)
955
- - **For full context (128k tokens):** You need ~10 GB to 16 GB of extra VRAM just for the memory
956
- - Total: 5.5 + 16 = ~21.5 GB (Requires a 24GB card like an RTX 3090/4090)
957
-
958
- ### Context Cost by Model Class
959
-
960
- #### The "Small" Class (7B - 9B)
961
-
962
- **Models:** Llama 3 8B, Mistral 7B v0.3, Qwen 2.5 7B
963
-
964
- - **Context Cost:** ~0.13 GB per 1,000 tokens
965
- - **Notes:** These are extremely efficient. You can easily fit 32k context on a 12GB card (RTX 3060/4070) if you use 4-bit weights
966
- - **Warning:** Gemma 2 9B is an outlier. It uses a much larger cache (approx 0.34 GB per 1k tokens) due to its architecture (high head dimension). It eats VRAM much faster than Llama 3
967
-
968
- #### The "Medium" Class (Mixtral 8x7B)
969
-
970
- **Models:** Mixtral 8x7B, Mixtral 8x22B
971
-
972
- - **Context Cost:** ~0.13 GB per 1,000 tokens
973
- - **Notes:** Mixtral is unique. While the weights are huge (26GB+), the context cache is tiny (identical to the small 7B model). This means if you can fit the weights, increasing context to 32k is very "cheap" in terms of extra VRAM
974
-
975
- #### The "Large" Class (70B+)
976
-
977
- **Models:** Llama 3 70B, Qwen2 72B
978
-
979
- - **Context Cost:** ~0.33 GB per 1,000 tokens
980
- - **32k Context:** Requires ~10.5 GB just for the context
981
- - **128k Context:** Requires ~42 GB just for the context
982
- - **Notes:** To run Llama 3 70B at full 128k context, you generally need roughly 80GB+ VRAM (e.g., 2x A6000 or Mac Studio Ultra), even with 4-bit weights
983
-
984
- ### Tips for Saving VRAM
985
-
986
- #### KV Cache Quantization (FP8)
987
-
988
- If you use llama.cpp or ExLlamaV2, you can enable "FP8 Cache" (sometimes called Q8 cache). This cuts the context VRAM usage in half with negligible quality loss.
989
-
990
- **Example:** Llama 3 70B at 32k context drops from ~50.5 GB total to ~45 GB total
991
-
992
- #### System Overhead
993
-
994
- Always leave 1-2 GB of VRAM free for your display and OS overhead. If the chart says 23.8 GB and you have a 24 GB card, it will likely crash (OOM).
995
-
996
- ### Context Size Reference Table
997
-
998
- | Context Size | What it represents | Total VRAM Needed (8B Model) | GPU Class |
999
- | ------------ | -------------------- | ---------------------------- | -------------------------- |
1000
- | 4k | A long article | ~6.0 GB | 8GB Cards (RTX 3060/4060) |
1001
- | 16k | A small book chapter | ~7.5 GB | 8GB Cards (Tight fit) |
1002
- | 32k | A short book | ~9.5 GB | 12GB Cards (RTX 3060/4070) |
1003
- | 128k | A full novel | ~22.0 GB | 24GB Cards (RTX 3090/4090) |
1004
-
1005
- ### Practical Examples
1006
-
1007
- #### Example 1: Running Qwen 2.5 7B on 8GB Card
1008
-
1009
- - **Model weights:** 4.7 GB
1010
- - **4k context:** 6.5 GB total ✅ Fits comfortably
1011
- - **8k context:** 7.2 GB total ✅ Fits with headroom
1012
- - **16k context:** 8.4 GB total ❌ Will OOM
1013
- - **Recommendation:** Use 8k context maximum
1014
-
1015
- #### Example 2: Running DeepSeek Coder V2 16B on 12GB Card
1016
-
1017
- - **Model weights:** 8.9 GB (MoE with efficient MLA cache)
1018
- - **4k context:** 11.2 GB total ✅ Fits
1019
- - **8k context:** 11.5 GB total ✅ Fits
1020
- - **16k context:** 12.0 GB total ⚠️ Very tight, may OOM
1021
- - **32k context:** 13.1 GB total ❌ Will OOM
1022
- - **Recommendation:** Use 8k context for safety
1023
-
1024
- #### Example 3: Running Qwen 3 Coder 30B on 24GB Card
1025
-
1026
- - **Model weights:** 18 GB
1027
- - **4k context:** 20.5 GB total ✅ Fits
1028
- - **8k context:** 22.0 GB total ✅ Fits
1029
- - **16k context:** 24.5 GB total ❌ Will OOM
1030
- - **Recommendation:** Use 8k context maximum, or upgrade to 32GB+ VRAM
1031
-
1032
- ### GPU Recommendations by Use Case
1033
-
1034
- #### Budget Setup (8GB VRAM)
1035
-
1036
- - **Best models:** Llama 3.2 3B, Qwen 3 4B, Gemma 3 4B, DeepSeek R1 1.5B
1037
- - **Context sweet spot:** 8k-16k
1038
- - **Use case:** Code completion, quick queries, edge deployment
1039
-
1040
- #### Mid-Range Setup (12GB VRAM)
1041
-
1042
- - **Best models:** Qwen 2.5 7B, Qwen 2.5 Coder 7B, Llama 3 8B, Mistral 7B
1043
- - **Context sweet spot:** 16k-32k
1044
- - **Use case:** General development, moderate context needs
1045
-
1046
- #### High-End Setup (24GB VRAM)
1047
-
1048
- - **Best models:** Codestral 22B, Qwen 3 Coder 30B, DeepSeek Coder V2 16B
1049
- - **Context sweet spot:** 32k-64k
1050
- - **Use case:** Professional development, large codebases
1051
-
1052
- #### Workstation Setup (48GB+ VRAM)
1053
-
1054
- - **Best models:** Llama 3.3 70B, Qwen 2.5 72B, Large MoE models
1055
- - **Context sweet spot:** 64k-128k
1056
- - **Use case:** Enterprise, research, maximum capability
1057
-
1058
- ---
1059
-
1060
- ## See Also
1061
-
1062
- - [Model Commands](Models_commands.md) - Model management commands
1063
- - [Model Configuration](Models_configuration.md) - Configuration options
1064
- - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md) - Automatic model selection
1065
- - [Context Management](3%20projects/OLLM%20CLI/LLM%20Context%20Manager/README.md) - Context and VRAM monitoring
1066
-
1067
- ---
1068
-
1069
- **Last Updated:** 2026-01-28
1070
- **Version:** 0.2.0
1071
- **Source:** Ollama Documentation, llama.cpp, community benchmarks, OLLM CLI Model Database
1
+ # Ollama Models Reference
2
+
3
+ **Complete reference for Ollama-compatible models**
4
+
5
+ This document provides comprehensive information about Ollama models, focusing on context management, tool calling capabilities, and VRAM requirements for OLLM CLI.
6
+
7
+ ---
8
+
9
+ ## 📋 Table of Contents
10
+
11
+ 1. [Context Window Fundamentals](#context-window-fundamentals)
12
+ 2. [VRAM Requirements](#vram-requirements)
13
+ 3. [Complete Model Database](#complete-model-database)
14
+ 4. [Model Selection Matrix](#model-selection-matrix)
15
+ 5. [Tool Calling Support](#tool-calling-support)
16
+ 6. [Quantization Guide](#quantization-guide)
17
+ 7. [Configuration](#configuration)
18
+ 8. [Performance Benchmarks](#performance-benchmarks)
19
+ 9. [VRAM Calculation Deep Dive](#vram-calculation-deep-dive)
20
+
21
+ **See Also:**
22
+
23
+ - [Model Commands](Models_commands.md)
24
+ - [Model Configuration](Models_configuration.md)
25
+ - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md)
26
+
27
+ ---
28
+
29
+ ## Context Window Fundamentals
30
+
31
+ ### Ollama Default Behavior
32
+
33
+ - **Default context window**: 2,048 tokens (conservative default)
34
+ - **Configurable via**: `num_ctx` parameter in Modelfile or API request
35
+ - **Environment variable**: `OLLAMA_CONTEXT_LENGTH`
36
+
37
+ ### Context Window vs num_ctx
38
+
39
+ - **Context window**: Maximum tokens a model architecture supports
40
+ - **num_ctx**: Ollama-specific parameter setting active context length for inference
41
+ - Models can support larger contexts than Ollama's default - must be explicitly configured
42
+
43
+ **Example:**
44
+
45
+ ```bash
46
+ # Llama 3.1 supports 128K context, but Ollama defaults to 2K
47
+ # Must configure explicitly:
48
+ export OLLAMA_CONTEXT_LENGTH=32768
49
+ ```
50
+
51
+ ---
52
+
53
+ ## VRAM Requirements
54
+
55
+ ### Memory Calculation Formula
56
+
57
+ ```
58
+ Total VRAM = Model Weights + KV Cache + System Overhead (~0.5-1GB)
59
+ ```
60
+
61
+ ### Base VRAM by Model Size (Q4_K_M Quantization)
62
+
63
+ | Model Size | Est. File Size | Recommended VRAM | Example Models |
64
+ | ---------- | -------------- | ----------------- | --------------------------------------- |
65
+ | 3B-4B | 2.0-2.5 GB | 3-4 GB | Llama 3.2 3B, Qwen3 4B |
66
+ | 7B-9B | 4.5-6.0 GB | 6-8 GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B |
67
+ | 12B-14B | 7.5-9.0 GB | 10-12 GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B |
68
+ | 22B-35B | 14-21 GB | 16-24 GB | Gemma 3 27B, Qwen3 32B, DeepSeek R1 32B |
69
+ | 70B-72B | 40-43 GB | 48 GB (or 2×24GB) | Llama 3.3 70B, Qwen2.5 72B |
70
+
71
+ ### KV Cache Memory Impact
72
+
73
+ KV cache grows linearly with context length:
74
+
75
+ - **8B model at 32K context**: ~4.5 GB for KV cache alone
76
+ - **Q8_0 KV quantization**: Halves KV cache memory (minimal quality impact)
77
+ - **Q4_0 KV quantization**: Reduces to 1/3 (noticeable quality reduction)
78
+
79
+ ### Context Length VRAM Examples (8B Model, Q4_K_M)
80
+
81
+ | Context Length | KV Cache (FP16) | KV Cache (Q8_0) | Total VRAM |
82
+ | -------------- | --------------- | --------------- | ---------- |
83
+ | 4K tokens | ~0.6 GB | ~0.3 GB | ~6-7 GB |
84
+ | 8K tokens | ~1.1 GB | ~0.55 GB | ~7-8 GB |
85
+ | 16K tokens | ~2.2 GB | ~1.1 GB | ~8-9 GB |
86
+ | 32K tokens | ~4.5 GB | ~2.25 GB | ~10-11 GB |
87
+
88
+ **Key Insight:** Always aim to fit the entire model in VRAM. Partial offload causes 5-20x slowdown.
89
+
90
+ ---
91
+
92
+ ## Complete Model Database
93
+
94
+ This section contains all 35+ models currently supported in OLLM CLI with accurate VRAM requirements for each context tier.
95
+
96
+ ### Coding Models
97
+
98
+ #### CodeGeeX4 9B
99
+
100
+ - **Company:** Zhipu AI
101
+ - **Quantization:** Q4_0
102
+ - **Size:** 5.5 GB
103
+ - **Max Context:** 128K
104
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
105
+ - **Description:** Multilingual code generation model with high inference speed
106
+ - **URL:** [ollama.com/library/codegeex4](https://ollama.com/library/codegeex4)
107
+
108
+ **VRAM Requirements:**
109
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
110
+ |---------|----|----|-----|-----|-----|------|
111
+ | VRAM | 7.5 GB | 8.1 GB | 9.2 GB | 11.5 GB | 15.8 GB | 24.5 GB |
112
+
113
+ #### CodeGemma 2B
114
+
115
+ - **Company:** Google DeepMind
116
+ - **Quantization:** Q4_K_M
117
+ - **Size:** 1.6 GB
118
+ - **Max Context:** 8K
119
+ - **Tools:** ❌ No | **Reasoning:** ❌ No
120
+ - **Description:** Lightweight code completion model for on-device use
121
+ - **URL:** [ollama.com/library/codegemma](https://ollama.com/library/codegemma)
122
+
123
+ **VRAM Requirements:**
124
+ | Context | 4k | 8k |
125
+ |---------|----|----|
126
+ | VRAM | 3.0 GB | 3.4 GB |
127
+
128
+ #### CodeGemma 7B
129
+
130
+ - **Company:** Google DeepMind
131
+ - **Quantization:** Q4_K_M
132
+ - **Size:** 5.0 GB
133
+ - **Max Context:** 8K
134
+ - **Tools:** ❌ No | **Reasoning:** ❌ No
135
+ - **Description:** Specialized code generation and completion model
136
+ - **URL:** [ollama.com/library/codegemma](https://ollama.com/library/codegemma)
137
+
138
+ **VRAM Requirements:**
139
+ | Context | 4k | 8k |
140
+ |---------|----|----|
141
+ | VRAM | 7.0 GB | 8.5 GB |
142
+
143
+ #### Codestral 22B
144
+
145
+ - **Company:** Mistral AI
146
+ - **Quantization:** Q4_0
147
+ - **Size:** 12 GB
148
+ - **Max Context:** 32K
149
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
150
+ - **Description:** Mistral's model optimized for intermediate code generation tasks
151
+ - **URL:** [ollama.com/library/codestral](https://ollama.com/library/codestral)
152
+
153
+ **VRAM Requirements:**
154
+ | Context | 4k | 8k | 16k | 32k |
155
+ |---------|----|----|-----|-----|
156
+ | VRAM | 14.5 GB | 15.6 GB | 17.8 GB | 22.0 GB |
157
+
158
+ #### DeepSeek Coder V2 16B
159
+
160
+ - **Company:** DeepSeek
161
+ - **Quantization:** Q4_K_M
162
+ - **Size:** 8.9 GB
163
+ - **Max Context:** 128K
164
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
165
+ - **Description:** Mixture-of-Experts (MoE) model for advanced coding tasks with highly efficient MLA Cache
166
+ - **URL:** [ollama.com/library/deepseek-coder-v2](https://ollama.com/library/deepseek-coder-v2)
167
+
168
+ **VRAM Requirements:**
169
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
170
+ |---------|----|----|-----|-----|-----|------|
171
+ | VRAM | 11.2 GB | 11.5 GB | 12.0 GB | 13.1 GB | 15.2 GB | 19.5 GB |
172
+
173
+ #### Granite Code 3B
174
+
175
+ - **Company:** IBM
176
+ - **Quantization:** Q4_K_M
177
+ - **Size:** 2.0 GB
178
+ - **Max Context:** 4K
179
+ - **Tools:** ❌ No | **Reasoning:** ❌ No
180
+ - **Description:** IBM enterprise coding model, very fast
181
+ - **URL:** [ollama.com/library/granite-code](https://ollama.com/library/granite-code)
182
+
183
+ **VRAM Requirements:**
184
+ | Context | 4k |
185
+ |---------|----|
186
+ | VRAM | 3.5 GB |
187
+
188
+ #### Granite Code 8B
189
+
190
+ - **Company:** IBM
191
+ - **Quantization:** Q4_K_M
192
+ - **Size:** 4.6 GB
193
+ - **Max Context:** 4K
194
+ - **Tools:** ❌ No | **Reasoning:** ❌ No
195
+ - **Description:** IBM enterprise coding model, robust for Python/Java/JS
196
+ - **URL:** [ollama.com/library/granite-code](https://ollama.com/library/granite-code)
197
+
198
+ **VRAM Requirements:**
199
+ | Context | 4k |
200
+ |---------|----|
201
+ | VRAM | 6.5 GB |
202
+
203
+ #### Magicoder Latest
204
+
205
+ - **Company:** ise-uiuc (OSS-Instruct Team)
206
+ - **Quantization:** Q4_K_M
207
+ - **Size:** 3.8 GB
208
+ - **Max Context:** 16K
209
+ - **Tools:** ❌ No | **Reasoning:** ❌ No
210
+ - **Description:** Code model trained on synthetic data (OSS-Instruct)
211
+ - **URL:** [ollama.com/library/magicoder](https://ollama.com/library/magicoder)
212
+
213
+ **VRAM Requirements:**
214
+ | Context | 4k | 8k | 16k |
215
+ |---------|----|----|-----|
216
+ | VRAM | 5.5 GB | 6.0 GB | 7.1 GB |
217
+
218
+ #### OpenCoder 8B
219
+
220
+ - **Company:** OpenCoder Team / INF
221
+ - **Quantization:** Q4_K_M
222
+ - **Size:** 4.7 GB
223
+ - **Max Context:** 8K
224
+ - **Tools:** ❌ No | **Reasoning:** ❌ No
225
+ - **Description:** Fully open-source coding model with transparent dataset
226
+ - **URL:** [ollama.com/library/opencoder](https://ollama.com/library/opencoder)
227
+
228
+ **VRAM Requirements:**
229
+ | Context | 4k | 8k |
230
+ |---------|----|----|
231
+ | VRAM | 6.5 GB | 7.2 GB |
232
+
233
+ #### Qwen 2.5 Coder 3B
234
+
235
+ - **Company:** Alibaba Cloud
236
+ - **Quantization:** Q4_K_M
237
+ - **Size:** 1.9 GB
238
+ - **Max Context:** 128K
239
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
240
+ - **Description:** Small but potent coding assistant
241
+ - **URL:** [ollama.com/library/qwen2.5-coder](https://ollama.com/library/qwen2.5-coder)
242
+
243
+ **VRAM Requirements:**
244
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
245
+ |---------|----|----|-----|-----|-----|------|
246
+ | VRAM | 3.5 GB | 4.0 GB | 4.8 GB | 6.5 GB | 9.8 GB | 16.5 GB |
247
+
248
+ #### Qwen 2.5 Coder 7B
249
+
250
+ - **Company:** Alibaba Cloud
251
+ - **Quantization:** Q4_K_M
252
+ - **Size:** 4.7 GB
253
+ - **Max Context:** 128K
254
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
255
+ - **Description:** SOTA coding model in the 7B class
256
+ - **URL:** [ollama.com/library/qwen2.5-coder](https://ollama.com/library/qwen2.5-coder)
257
+
258
+ **VRAM Requirements:**
259
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
260
+ |---------|----|----|-----|-----|-----|------|
261
+ | VRAM | 6.5 GB | 7.2 GB | 8.4 GB | 10.8 GB | 15.5 GB | 24.5 GB |
262
+
263
+ #### Qwen 3 Coder 30B
264
+
265
+ - **Company:** Alibaba Cloud
266
+ - **Quantization:** Q4_0
267
+ - **Size:** 18 GB
268
+ - **Max Context:** 128K
269
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
270
+ - **Description:** Heavyweight coding model, expert level generation
271
+ - **URL:** [ollama.com/library/qwen3-coder](https://ollama.com/library/qwen3-coder)
272
+
273
+ **VRAM Requirements:**
274
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
275
+ |---------|----|----|-----|-----|-----|------|
276
+ | VRAM | 20.5 GB | 22.0 GB | 24.5 GB | 29.5 GB | 39.5 GB | 59.5 GB |
277
+
278
+ ---
279
+
280
+ ### Reasoning Models
281
+
282
+ #### DeepSeek R1 1.5B
283
+
284
+ - **Company:** DeepSeek
285
+ - **Quantization:** Q4_K_M
286
+ - **Size:** 1.1 GB
287
+ - **Max Context:** 128K
288
+ - **Tools:** ❌ No | **Reasoning:** ✅ Yes
289
+ - **Description:** Distilled reasoning model, highly efficient for logic puzzles
290
+ - **URL:** [ollama.com/library/deepseek-r1](https://ollama.com/library/deepseek-r1)
291
+
292
+ **VRAM Requirements:**
293
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
294
+ |---------|----|----|-----|-----|-----|------|
295
+ | VRAM | 2.5 GB | 2.8 GB | 3.2 GB | 4.1 GB | 6.0 GB | 9.8 GB |
296
+
297
+ #### DeepSeek R1 7B
298
+
299
+ - **Company:** DeepSeek
300
+ - **Quantization:** Q4_K_M
301
+ - **Size:** 4.7 GB
302
+ - **Max Context:** 128K
303
+ - **Tools:** ❌ No | **Reasoning:** ✅ Yes
304
+ - **Description:** Distilled reasoning model based on Qwen 2.5, excels in math/logic
305
+ - **URL:** [ollama.com/library/deepseek-r1](https://ollama.com/library/deepseek-r1)
306
+
307
+ **VRAM Requirements:**
308
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
309
+ |---------|----|----|-----|-----|-----|------|
310
+ | VRAM | 6.5 GB | 7.2 GB | 8.4 GB | 10.8 GB | 15.5 GB | 24.5 GB |
311
+
312
+ #### DeepSeek R1 8B
313
+
314
+ - **Company:** DeepSeek
315
+ - **Quantization:** Q4_K_M
316
+ - **Size:** 5.2 GB
317
+ - **Max Context:** 128K
318
+ - **Tools:** ❌ No | **Reasoning:** ✅ Yes
319
+ - **Description:** Distilled reasoning model based on Llama 3, balanced for logic
320
+ - **URL:** [ollama.com/library/deepseek-r1](https://ollama.com/library/deepseek-r1)
321
+
322
+ **VRAM Requirements:**
323
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
324
+ |---------|----|----|-----|-----|-----|------|
325
+ | VRAM | 7.0 GB | 7.8 GB | 9.0 GB | 11.5 GB | 16.2 GB | 25.0 GB |
326
+
327
+ #### Phi-4 Mini Reasoning
328
+
329
+ - **Company:** Microsoft
330
+ - **Quantization:** Q4_K_M
331
+ - **Size:** 3.2 GB
332
+ - **Max Context:** 128K
333
+ - **Tools:** ❌ No | **Reasoning:** ✅ Yes
334
+ - **Description:** Microsoft model specialized in math and logical reasoning steps
335
+ - **URL:** [ollama.com/library/phi4](https://ollama.com/library/phi4)
336
+
337
+ **VRAM Requirements:**
338
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
339
+ |---------|----|----|-----|-----|-----|------|
340
+ | VRAM | 5.0 GB | 5.6 GB | 6.8 GB | 9.2 GB | 13.8 GB | 23.0 GB |
341
+
342
+ ---
343
+
344
+ ### General Purpose Models
345
+
346
+ #### Command R7B 7B
347
+
348
+ - **Company:** Cohere
349
+ - **Quantization:** Q4_K_M
350
+ - **Size:** 5.1 GB
351
+ - **Max Context:** 128K
352
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
353
+ - **Description:** Optimized for RAG and tool use, excellent instruction following
354
+ - **URL:** [ollama.com/library/command-r7b](https://ollama.com/library/command-r7b)
355
+
356
+ **VRAM Requirements:**
357
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
358
+ |---------|----|----|-----|-----|-----|------|
359
+ | VRAM | 7.0 GB | 7.6 GB | 8.8 GB | 11.0 GB | 15.5 GB | 24.5 GB |
360
+
361
+ #### Dolphin3 8B
362
+
363
+ - **Company:** Cognitive Computations
364
+ - **Quantization:** Q4_K_M
365
+ - **Size:** 4.9 GB
366
+ - **Max Context:** 32K
367
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
368
+ - **Description:** Uncensored/compliant model optimized for general conversation/coding
369
+ - **URL:** [ollama.com/library/dolphin3](https://ollama.com/library/dolphin3)
370
+
371
+ **VRAM Requirements:**
372
+ | Context | 4k | 8k | 16k | 32k |
373
+ |---------|----|----|-----|-----|
374
+ | VRAM | 7.0 GB | 7.7 GB | 8.9 GB | 11.2 GB |
375
+
376
+ #### Gemma 3 1B
377
+
378
+ - **Company:** Google DeepMind
379
+ - **Quantization:** Q4_K_M
380
+ - **Size:** 815 MB
381
+ - **Max Context:** 128K
382
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
383
+ - **Description:** Ultra-compact text model from Google, high efficiency
384
+ - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
385
+
386
+ **VRAM Requirements:**
387
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
388
+ |---------|----|----|-----|-----|-----|------|
389
+ | VRAM | 2.0 GB | 2.4 GB | 3.2 GB | 4.8 GB | 8.0 GB | 14.2 GB |
390
+
391
+ #### Gemma 3 4B
392
+
393
+ - **Company:** Google DeepMind
394
+ - **Quantization:** Q4_K_M
395
+ - **Size:** 3.3 GB
396
+ - **Max Context:** 128K
397
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
398
+ - **Description:** Balanced Google model with strong general purpose capabilities
399
+ - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
400
+
401
+ **VRAM Requirements:**
402
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
403
+ |---------|----|----|-----|-----|-----|------|
404
+ | VRAM | 5.0 GB | 5.8 GB | 7.5 GB | 10.5 GB | 16.5 GB | 28.5 GB |
405
+
406
+ #### Gemma 3 Nano E2B
407
+
408
+ - **Company:** Google DeepMind
409
+ - **Quantization:** Q4_K_M
410
+ - **Size:** 5.6 GB
411
+ - **Max Context:** 128K
412
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
413
+ - **Description:** Nano/Experimental variant of Gemma 3 (high cache usage)
414
+ - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
415
+
416
+ **VRAM Requirements:**
417
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
418
+ |---------|----|----|-----|-----|-----|------|
419
+ | VRAM | 7.5 GB | 8.5 GB | 10.5 GB | 14.5 GB | 22.5 GB | 38.5 GB |
420
+
421
+ #### Gemma 3 Nano Latest
422
+
423
+ - **Company:** Google DeepMind
424
+ - **Quantization:** Q4_K_M
425
+ - **Size:** 7.5 GB
426
+ - **Max Context:** 128K
427
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
428
+ - **Description:** Latest standard variant of the Gemma 3 Nano/Next series
429
+ - **URL:** [ollama.com/library/gemma3](https://ollama.com/library/gemma3)
430
+
431
+ **VRAM Requirements:**
432
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
433
+ |---------|----|----|-----|-----|-----|------|
434
+ | VRAM | 9.5 GB | 10.8 GB | 13.5 GB | 18.5 GB | 28.5 GB | 48.0 GB |
435
+
436
+ #### GPT-OSS 20B
437
+
438
+ - **Company:** OpenAI (Hypothetical/Mirror)
439
+ - **Quantization:** Q4_0
440
+ - **Size:** 13 GB
441
+ - **Max Context:** 128K
442
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
443
+ - **Description:** Open-weight model by OpenAI (hypothetical/leaked) for general tasks
444
+ - **URL:** [ollama.com/library/gpt-oss](https://ollama.com/library/gpt-oss)
445
+
446
+ **VRAM Requirements:**
447
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
448
+ |---------|----|----|-----|-----|-----|------|
449
+ | VRAM | 16.5 GB | 17.5 GB | 19.5 GB | 23.5 GB | 31.5 GB | 47.5 GB |
450
+
451
+ #### Granite 3.3 8B
452
+
453
+ - **Company:** IBM
454
+ - **Quantization:** Q4_K_M
455
+ - **Size:** 4.9 GB
456
+ - **Max Context:** 128K
457
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
458
+ - **Description:** Updated IBM model with improved reasoning and instruction following
459
+ - **URL:** [ollama.com/library/granite3.3](https://ollama.com/library/granite3.3)
460
+
461
+ **VRAM Requirements:**
462
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
463
+ |---------|----|----|-----|-----|-----|------|
464
+ | VRAM | 7.0 GB | 7.7 GB | 8.9 GB | 11.2 GB | 15.8 GB | 24.8 GB |
465
+
466
+ #### Granite 4 Latest
467
+
468
+ - **Company:** IBM
469
+ - **Quantization:** Q4_K_M
470
+ - **Size:** 2.1 GB
471
+ - **Max Context:** 128K
472
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
473
+ - **Description:** Next-gen IBM model, hybrid Mamba architecture for speed
474
+ - **URL:** [ollama.com/library/granite4](https://ollama.com/library/granite4)
475
+
476
+ **VRAM Requirements:**
477
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
478
+ |---------|----|----|-----|-----|-----|------|
479
+ | VRAM | 4.0 GB | 4.5 GB | 5.5 GB | 7.5 GB | 11.5 GB | 19.5 GB |
480
+
481
+ #### Llama 3 Latest (8B)
482
+
483
+ - **Company:** Meta
484
+ - **Quantization:** Q4_K_M
485
+ - **Size:** 4.7 GB
486
+ - **Max Context:** 8K (Native Llama 3.0)
487
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
488
+ - **Description:** Meta's 8B standard model (robust generalist)
489
+ - **URL:** [ollama.com/library/llama3](https://ollama.com/library/llama3)
490
+
491
+ **VRAM Requirements:**
492
+ | Context | 4k | 8k |
493
+ |---------|----|----|
494
+ | VRAM | 6.5 GB | 7.2 GB |
495
+
496
+ #### Llama 3.2 3B
497
+
498
+ - **Company:** Meta
499
+ - **Quantization:** Q4_K_M
500
+ - **Size:** 2.0 GB
501
+ - **Max Context:** 128K
502
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
503
+ - **Description:** Lightweight Llama optimized for edge devices
504
+ - **URL:** [ollama.com/library/llama3.2](https://ollama.com/library/llama3.2)
505
+
506
+ **VRAM Requirements:**
507
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
508
+ |---------|----|----|-----|-----|-----|------|
509
+ | VRAM | 3.5 GB | 3.9 GB | 4.6 GB | 6.0 GB | 8.8 GB | 14.5 GB |
510
+
511
+ #### Ministral 3 3B
512
+
513
+ - **Company:** Mistral AI
514
+ - **Quantization:** Q4_K_M
515
+ - **Size:** 3.0 GB
516
+ - **Max Context:** 128K
517
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
518
+ - **Description:** Mistral's edge model with high intelligence-to-size ratio
519
+ - **URL:** [ollama.com/library/ministral-3](https://ollama.com/library/ministral-3)
520
+
521
+ **VRAM Requirements:**
522
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
523
+ |---------|----|----|-----|-----|-----|------|
524
+ | VRAM | 5.0 GB | 5.6 GB | 6.8 GB | 9.0 GB | 13.5 GB | 22.5 GB |
525
+
526
+ #### Ministral 3 8B
527
+
528
+ - **Company:** Mistral AI
529
+ - **Quantization:** Q4_K_M
530
+ - **Size:** 6.0 GB
531
+ - **Max Context:** 128K
532
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
533
+ - **Description:** Larger edge model from Mistral, strong instruction following
534
+ - **URL:** [ollama.com/library/ministral-3](https://ollama.com/library/ministral-3)
535
+
536
+ **VRAM Requirements:**
537
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
538
+ |---------|----|----|-----|-----|-----|------|
539
+ | VRAM | 8.0 GB | 8.8 GB | 10.5 GB | 13.6 GB | 20.0 GB | 32.5 GB |
540
+
541
+ #### Phi-3 Mini 3.8B
542
+
543
+ - **Company:** Microsoft
544
+ - **Quantization:** Q4_K_M
545
+ - **Size:** 2.2 GB
546
+ - **Max Context:** 128K
547
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
548
+ - **Description:** Microsoft's highly capable small model (Mini)
549
+ - **URL:** [ollama.com/library/phi3](https://ollama.com/library/phi3)
550
+
551
+ **VRAM Requirements:**
552
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
553
+ |---------|----|----|-----|-----|-----|------|
554
+ | VRAM | 4.0 GB | 4.5 GB | 5.5 GB | 7.5 GB | 11.5 GB | 19.5 GB |
555
+
556
+ #### Qwen 2.5 7B
557
+
558
+ - **Company:** Alibaba Cloud
559
+ - **Quantization:** Q4_K_M
560
+ - **Size:** 4.7 GB
561
+ - **Max Context:** 128K
562
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
563
+ - **Description:** Alibaba's strong generalist, beats Llama 3.1 8B in benchmarks
564
+ - **URL:** [ollama.com/library/qwen2.5](https://ollama.com/library/qwen2.5)
565
+
566
+ **VRAM Requirements:**
567
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
568
+ |---------|----|----|-----|-----|-----|------|
569
+ | VRAM | 6.5 GB | 7.2 GB | 8.4 GB | 10.8 GB | 15.5 GB | 24.5 GB |
570
+
571
+ #### Qwen 3 4B
572
+
573
+ - **Company:** Alibaba Cloud
574
+ - **Quantization:** Q4_K_M
575
+ - **Size:** 2.5 GB
576
+ - **Max Context:** 256K
577
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
578
+ - **Description:** Next-gen Qwen generalist, high efficiency
579
+ - **URL:** [ollama.com/library/qwen3](https://ollama.com/library/qwen3)
580
+
581
+ **VRAM Requirements:**
582
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
583
+ |---------|----|----|-----|-----|-----|------|
584
+ | VRAM | 4.5 GB | 5.0 GB | 6.0 GB | 8.0 GB | 12.0 GB | 20.0 GB |
585
+
586
+ #### Qwen 3 8B
587
+
588
+ - **Company:** Alibaba Cloud
589
+ - **Quantization:** Q4_K_M
590
+ - **Size:** 5.2 GB
591
+ - **Max Context:** 256K
592
+ - **Tools:** ✅ Yes | **Reasoning:** ✅ Yes
593
+ - **Description:** Next-gen Qwen generalist 8B, high reasoning capability
594
+ - **URL:** [ollama.com/library/qwen3](https://ollama.com/library/qwen3)
595
+
596
+ **VRAM Requirements:**
597
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
598
+ |---------|----|----|-----|-----|-----|------|
599
+ | VRAM | 7.5 GB | 8.2 GB | 9.5 GB | 12.0 GB | 17.0 GB | 27.0 GB |
600
+
601
+ ---
602
+
603
+ ### Vision-Language Models
604
+
605
+ #### Qwen 3 VL 4B
606
+
607
+ - **Company:** Alibaba Cloud
608
+ - **Quantization:** Q4_K_M
609
+ - **Size:** 3.3 GB
610
+ - **Max Context:** 128K
611
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
612
+ - **Description:** Vision-Language model, can analyze images/video inputs
613
+ - **URL:** [ollama.com/library/qwen3-vl](https://ollama.com/library/qwen3-vl)
614
+
615
+ **VRAM Requirements:**
616
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
617
+ |---------|----|----|-----|-----|-----|------|
618
+ | VRAM | 5.0 GB | 5.5 GB | 6.5 GB | 8.5 GB | 12.5 GB | 20.5 GB |
619
+
620
+ #### Qwen 3 VL 8B
621
+
622
+ - **Company:** Alibaba Cloud
623
+ - **Quantization:** Q4_K_M
624
+ - **Size:** 6.1 GB
625
+ - **Max Context:** 128K
626
+ - **Tools:** ✅ Yes | **Reasoning:** ❌ No
627
+ - **Description:** Larger Vision-Language model for detailed visual reasoning
628
+ - **URL:** [ollama.com/library/qwen3-vl](https://ollama.com/library/qwen3-vl)
629
+
630
+ **VRAM Requirements:**
631
+ | Context | 4k | 8k | 16k | 32k | 64k | 128k |
632
+ |---------|----|----|-----|-----|-----|------|
633
+ | VRAM | 8.0 GB | 8.7 GB | 10.0 GB | 12.5 GB | 17.5 GB | 27.5 GB |
634
+
635
+ ---
636
+
637
+ ## Model Selection Matrix
638
+
639
+ ### By VRAM Available
640
+
641
+ | VRAM | Recommended Models | Context Sweet Spot |
642
+ | ----- | ------------------------------------- | ------------------ |
643
+ | 4GB | Llama 3.2 3B, Qwen3 4B, Phi-3 Mini | 4K-8K |
644
+ | 8GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B | 8K-16K |
645
+ | 12GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B | 16K-32K |
646
+ | 16GB | Qwen3 14B (high ctx), DeepSeek R1 14B | 32K-64K |
647
+ | 24GB | Gemma 3 27B, Qwen3 32B, Mixtral 8x7B | 32K-64K |
648
+ | 48GB+ | Llama 3.3 70B, Qwen2.5 72B | 64K-128K |
649
+
650
+ ### By Use Case
651
+
652
+ | Use Case | Recommended Model | Why |
653
+ | -------------- | -------------------- | ----------------------------- |
654
+ | General coding | Qwen2.5-Coder 7B/32B | Best open-source coding model |
655
+ | Tool calling | Llama 3.1 8B | Native support, well-tested |
656
+ | Long context | Qwen3 8B/14B | Up to 256K context |
657
+ | Reasoning | DeepSeek-R1 14B/32B | Matches frontier models |
658
+ | Low VRAM | Llama 3.2 3B | Excellent for 4GB |
659
+ | Multilingual | Qwen3 series | 29+ languages |
660
+
661
+ ---
662
+
663
+ ## Tool Calling Support
664
+
665
+ ### Tier 1: Best Tool Calling Support
666
+
667
+ #### Llama 3.1/3.3 (8B, 70B)
668
+
669
+ - **Context**: 128K tokens (native)
670
+ - **Tool calling**: Native function calling support
671
+ - **VRAM**: 8GB (8B), 48GB+ (70B)
672
+ - **Strengths**: Best overall, largest ecosystem, excellent instruction following
673
+ - **Downloads**: 108M+ (most popular)
674
+
675
+ #### Qwen3 (4B-235B)
676
+
677
+ - **Context**: Up to 256K tokens
678
+ - **Tool calling**: Native support with hybrid reasoning modes
679
+ - **VRAM**: 4GB (4B) to 48GB+ (32B)
680
+ - **Strengths**: Multilingual, thinking/non-thinking modes, Apache 2.0 license
681
+ - **Special**: Can set "thinking budget" for reasoning depth
682
+
683
+ #### Mistral 7B / Mixtral 8x7B
684
+
685
+ - **Context**: 32K tokens (Mistral), 64K tokens (Mixtral)
686
+ - **Tool calling**: Native function calling
687
+ - **VRAM**: 7GB (7B), 26GB (8x7B MoE - only 13B active)
688
+ - **Strengths**: Efficient, excellent European languages
689
+
690
+ ### Tier 2: Good Tool Calling Support
691
+
692
+ #### DeepSeek-R1 (1.5B-671B)
693
+
694
+ - **Context**: 128K tokens
695
+ - **Tool calling**: Supported via reasoning
696
+ - **VRAM**: 4GB (1.5B) to enterprise (671B)
697
+ - **Strengths**: Advanced reasoning, matches o1 on benchmarks
698
+ - **Architecture**: Mixture of Experts (MoE)
699
+
700
+ #### Gemma 3 (4B-27B)
701
+
702
+ - **Context**: 128K tokens (4B+)
703
+ - **Tool calling**: Supported
704
+ - **VRAM**: 4GB (4B) to 24GB (27B)
705
+ - **Strengths**: Multimodal (vision), efficient architecture
706
+
707
+ #### Phi-4 (14B)
708
+
709
+ - **Context**: 16K tokens
710
+ - **Tool calling**: Supported
711
+ - **VRAM**: 12GB
712
+ - **Strengths**: Exceptional reasoning for size, synthetic data training
713
+
714
+ ### Tier 3: ReAct Fallback Required
715
+
716
+ #### CodeLlama (7B-70B)
717
+
718
+ - **Context**: 16K tokens (2K for 70B)
719
+ - **Tool calling**: Via ReAct loop
720
+ - **VRAM**: 4GB (7B) to 40GB (70B)
721
+ - **Best for**: Code-specific tasks
722
+
723
+ #### Llama 2 (7B-70B)
724
+
725
+ - **Context**: 4K tokens
726
+ - **Tool calling**: Via ReAct loop
727
+ - **VRAM**: 4GB (7B) to 40GB (70B)
728
+ - **Note**: Legacy, prefer Llama 3.x
729
+
730
+ ---
731
+
732
+ ## Quantization Guide
733
+
734
+ ### Recommended Quantization Levels
735
+
736
+ | Level | VRAM Savings | Quality Impact | Recommendation |
737
+ | ---------- | ------------ | -------------- | --------------- |
738
+ | Q8_0 | ~50% | Minimal | Best quality |
739
+ | Q6_K | ~60% | Very slight | High quality |
740
+ | Q5_K_M | ~65% | Slight | Good balance |
741
+ | **Q4_K_M** | ~75% | Minor | **Sweet spot** |
742
+ | Q4_0 | ~75% | Noticeable | Budget option |
743
+ | Q3_K_M | ~80% | Significant | Not recommended |
744
+ | Q2_K | ~85% | Severe | Avoid |
745
+
746
+ ### KV Cache Quantization
747
+
748
+ Enable KV cache quantization to reduce memory usage:
749
+
750
+ ```bash
751
+ # Enable Q8_0 KV cache (recommended)
752
+ export OLLAMA_KV_CACHE_TYPE=q8_0
753
+
754
+ # Enable Q4_0 KV cache (aggressive)
755
+ export OLLAMA_KV_CACHE_TYPE=q4_0
756
+ ```
757
+
758
+ **Note**: KV cache quantization requires Flash Attention enabled.
759
+
760
+ ### Quantization Selection Guide
761
+
762
+ **For 8GB VRAM:**
763
+
764
+ - Use Q4_K_M for 8B models
765
+ - Enables 16K-32K context windows
766
+ - Minimal quality impact
767
+
768
+ **For 12GB VRAM:**
769
+
770
+ - Use Q5_K_M or Q6_K for 8B models
771
+ - Use Q4_K_M for 14B models
772
+ - Better quality, still good context
773
+
774
+ **For 24GB+ VRAM:**
775
+
776
+ - Use Q8_0 for best quality
777
+ - Can run larger models (32B+)
778
+ - Maximum context windows
779
+
780
+ ---
781
+
782
+ ## Configuration
783
+
784
+ ### Setting Context Length
785
+
786
+ #### Via Modelfile
787
+
788
+ ```dockerfile
789
+ FROM llama3.1:8b
790
+ PARAMETER num_ctx 32768
791
+ ```
792
+
793
+ #### Via API Request
794
+
795
+ ```json
796
+ {
797
+ "model": "llama3.1:8b",
798
+ "options": {
799
+ "num_ctx": 32768
800
+ }
801
+ }
802
+ ```
803
+
804
+ #### Via Environment
805
+
806
+ ```bash
807
+ export OLLAMA_CONTEXT_LENGTH=32768
808
+ ```
809
+
810
+ #### Via OLLM Configuration
811
+
812
+ ```yaml
813
+ # ~/.ollm/config.yaml
814
+ options:
815
+ numCtx: 32768
816
+ ```
817
+
818
+ ### Performance Optimization
819
+
820
+ ```bash
821
+ # Enable Flash Attention
822
+ export OLLAMA_FLASH_ATTENTION=1
823
+
824
+ # Set KV cache quantization
825
+ export OLLAMA_KV_CACHE_TYPE=q8_0
826
+
827
+ # Keep model loaded
828
+ export OLLAMA_KEEP_ALIVE=24h
829
+
830
+ # Set number of GPU layers (for partial offload)
831
+ export OLLAMA_NUM_GPU=35 # Adjust based on VRAM
832
+ ```
833
+
834
+ ---
835
+
836
+ ## Performance Benchmarks
837
+
838
+ ### Inference Speed (tokens/second)
839
+
840
+ | Model | Full VRAM | Partial Offload | CPU Only |
841
+ | ------------ | --------- | --------------- | -------- |
842
+ | Llama 3.1 8B | 40-70 t/s | 8-15 t/s | 3-6 t/s |
843
+ | Qwen3 8B | 35-60 t/s | 7-12 t/s | 2-5 t/s |
844
+ | Mistral 7B | 45-80 t/s | 10-18 t/s | 4-7 t/s |
845
+
846
+ **Key Insight:** Partial VRAM offload causes 5-20x slowdown. Always aim to fit model entirely in VRAM.
847
+
848
+ ### Context Processing Speed
849
+
850
+ | Context Length | Processing Time (8B model) |
851
+ | -------------- | -------------------------- |
852
+ | 4K tokens | ~0.5-1 second |
853
+ | 8K tokens | ~1-2 seconds |
854
+ | 16K tokens | ~2-4 seconds |
855
+ | 32K tokens | ~4-8 seconds |
856
+
857
+ **Note:** Times vary based on hardware and quantization level.
858
+
859
+ ---
860
+
861
+ ## Troubleshooting
862
+
863
+ ### Out of Memory Errors
864
+
865
+ **Symptoms:**
866
+
867
+ ```
868
+ Error: Failed to load model: insufficient VRAM
869
+ ```
870
+
871
+ **Solutions:**
872
+
873
+ 1. Use smaller model (8B instead of 70B)
874
+ 2. Use more aggressive quantization (Q4_K_M instead of Q8_0)
875
+ 3. Reduce context window (`num_ctx`)
876
+ 4. Enable KV cache quantization
877
+ 5. Close other GPU applications
878
+
879
+ ### Slow Inference
880
+
881
+ **Symptoms:**
882
+
883
+ - Very slow token generation (< 5 tokens/second)
884
+ - High CPU usage
885
+
886
+ **Solutions:**
887
+
888
+ 1. Check if model fits entirely in VRAM
889
+ 2. Increase `OLLAMA_NUM_GPU` if using partial offload
890
+ 3. Enable Flash Attention
891
+ 4. Reduce context window
892
+ 5. Use faster quantization (Q4_K_M)
893
+
894
+ ### Context Window Issues
895
+
896
+ **Symptoms:**
897
+
898
+ ```
899
+ Error: Context length exceeds model maximum
900
+ ```
901
+
902
+ **Solutions:**
903
+
904
+ 1. Check model's native context window
905
+ 2. Set `num_ctx` appropriately
906
+ 3. Use model with larger context window
907
+ 4. Enable context compression in OLLM
908
+
909
+ ---
910
+
911
+ ## Best Practices
912
+
913
+ ### Model Selection
914
+
915
+ 1. **Match VRAM to model size**: Use the selection matrix above
916
+ 2. **Consider context needs**: Larger contexts need more VRAM
917
+ 3. **Test before committing**: Try models before configuring projects
918
+ 4. **Use routing**: Let OLLM select appropriate models automatically
919
+
920
+ ### VRAM Management
921
+
922
+ 1. **Monitor usage**: Use `/context` command to check VRAM
923
+ 2. **Keep models loaded**: Use `/model keep` for frequently-used models
924
+ 3. **Unload when switching**: Use `/model unload` to free VRAM
925
+ 4. **Enable auto-sizing**: Use `/context auto` for automatic context sizing
926
+
927
+ ### Performance Optimization
928
+
929
+ 1. **Enable Flash Attention**: Significant speed improvement
930
+ 2. **Use KV cache quantization**: Reduces memory, minimal quality impact
931
+ 3. **Keep models in VRAM**: Avoid partial offload
932
+ 4. **Optimize context size**: Use only what you need
933
+
934
+ ---
935
+
936
+ ## VRAM Calculation Deep Dive
937
+
938
+ ### Understanding VRAM Usage Components
939
+
940
+ VRAM usage comes from two distinct places:
941
+
942
+ #### 1. The "Parking Fee" (Model Weights)
943
+
944
+ This is the fixed cost just to load the model.
945
+
946
+ - **For an 8B model (Q4 quantization):** You need roughly 5.5 GB just to store the "brain" of the model
947
+ - This never changes, regardless of how much text you type
948
+
949
+ #### 2. The "Running Cost" (KV Cache)
950
+
951
+ This is the memory needed to "remember" the conversation. It grows with every single word (token) you add.
952
+
953
+ - **For short context (4k tokens):** You only need ~0.5 GB extra
954
+ - Total: 5.5 + 0.5 = 6.0 GB (Fits in 8GB card)
955
+ - **For full context (128k tokens):** You need ~10 GB to 16 GB of extra VRAM just for the memory
956
+ - Total: 5.5 + 16 = ~21.5 GB (Requires a 24GB card like an RTX 3090/4090)
957
+
958
+ ### Context Cost by Model Class
959
+
960
+ #### The "Small" Class (7B - 9B)
961
+
962
+ **Models:** Llama 3 8B, Mistral 7B v0.3, Qwen 2.5 7B
963
+
964
+ - **Context Cost:** ~0.13 GB per 1,000 tokens
965
+ - **Notes:** These are extremely efficient. You can easily fit 32k context on a 12GB card (RTX 3060/4070) if you use 4-bit weights
966
+ - **Warning:** Gemma 2 9B is an outlier. It uses a much larger cache (approx 0.34 GB per 1k tokens) due to its architecture (high head dimension). It eats VRAM much faster than Llama 3
967
+
968
+ #### The "Medium" Class (Mixtral 8x7B)
969
+
970
+ **Models:** Mixtral 8x7B, Mixtral 8x22B
971
+
972
+ - **Context Cost:** ~0.13 GB per 1,000 tokens
973
+ - **Notes:** Mixtral is unique. While the weights are huge (26GB+), the context cache is tiny (identical to the small 7B model). This means if you can fit the weights, increasing context to 32k is very "cheap" in terms of extra VRAM
974
+
975
+ #### The "Large" Class (70B+)
976
+
977
+ **Models:** Llama 3 70B, Qwen2 72B
978
+
979
+ - **Context Cost:** ~0.33 GB per 1,000 tokens
980
+ - **32k Context:** Requires ~10.5 GB just for the context
981
+ - **128k Context:** Requires ~42 GB just for the context
982
+ - **Notes:** To run Llama 3 70B at full 128k context, you generally need roughly 80GB+ VRAM (e.g., 2x A6000 or Mac Studio Ultra), even with 4-bit weights
983
+
984
+ ### Tips for Saving VRAM
985
+
986
+ #### KV Cache Quantization (FP8)
987
+
988
+ If you use llama.cpp or ExLlamaV2, you can enable "FP8 Cache" (sometimes called Q8 cache). This cuts the context VRAM usage in half with negligible quality loss.
989
+
990
+ **Example:** Llama 3 70B at 32k context drops from ~50.5 GB total to ~45 GB total
991
+
992
+ #### System Overhead
993
+
994
+ Always leave 1-2 GB of VRAM free for your display and OS overhead. If the chart says 23.8 GB and you have a 24 GB card, it will likely crash (OOM).
995
+
996
+ ### Context Size Reference Table
997
+
998
+ | Context Size | What it represents | Total VRAM Needed (8B Model) | GPU Class |
999
+ | ------------ | -------------------- | ---------------------------- | -------------------------- |
1000
+ | 4k | A long article | ~6.0 GB | 8GB Cards (RTX 3060/4060) |
1001
+ | 16k | A small book chapter | ~7.5 GB | 8GB Cards (Tight fit) |
1002
+ | 32k | A short book | ~9.5 GB | 12GB Cards (RTX 3060/4070) |
1003
+ | 128k | A full novel | ~22.0 GB | 24GB Cards (RTX 3090/4090) |
1004
+
1005
+ ### Practical Examples
1006
+
1007
+ #### Example 1: Running Qwen 2.5 7B on 8GB Card
1008
+
1009
+ - **Model weights:** 4.7 GB
1010
+ - **4k context:** 6.5 GB total ✅ Fits comfortably
1011
+ - **8k context:** 7.2 GB total ✅ Fits with headroom
1012
+ - **16k context:** 8.4 GB total ❌ Will OOM
1013
+ - **Recommendation:** Use 8k context maximum
1014
+
1015
+ #### Example 2: Running DeepSeek Coder V2 16B on 12GB Card
1016
+
1017
+ - **Model weights:** 8.9 GB (MoE with efficient MLA cache)
1018
+ - **4k context:** 11.2 GB total ✅ Fits
1019
+ - **8k context:** 11.5 GB total ✅ Fits
1020
+ - **16k context:** 12.0 GB total ⚠️ Very tight, may OOM
1021
+ - **32k context:** 13.1 GB total ❌ Will OOM
1022
+ - **Recommendation:** Use 8k context for safety
1023
+
1024
+ #### Example 3: Running Qwen 3 Coder 30B on 24GB Card
1025
+
1026
+ - **Model weights:** 18 GB
1027
+ - **4k context:** 20.5 GB total ✅ Fits
1028
+ - **8k context:** 22.0 GB total ✅ Fits
1029
+ - **16k context:** 24.5 GB total ❌ Will OOM
1030
+ - **Recommendation:** Use 8k context maximum, or upgrade to 32GB+ VRAM
1031
+
1032
+ ### GPU Recommendations by Use Case
1033
+
1034
+ #### Budget Setup (8GB VRAM)
1035
+
1036
+ - **Best models:** Llama 3.2 3B, Qwen 3 4B, Gemma 3 4B, DeepSeek R1 1.5B
1037
+ - **Context sweet spot:** 8k-16k
1038
+ - **Use case:** Code completion, quick queries, edge deployment
1039
+
1040
+ #### Mid-Range Setup (12GB VRAM)
1041
+
1042
+ - **Best models:** Qwen 2.5 7B, Qwen 2.5 Coder 7B, Llama 3 8B, Mistral 7B
1043
+ - **Context sweet spot:** 16k-32k
1044
+ - **Use case:** General development, moderate context needs
1045
+
1046
+ #### High-End Setup (24GB VRAM)
1047
+
1048
+ - **Best models:** Codestral 22B, Qwen 3 Coder 30B, DeepSeek Coder V2 16B
1049
+ - **Context sweet spot:** 32k-64k
1050
+ - **Use case:** Professional development, large codebases
1051
+
1052
+ #### Workstation Setup (48GB+ VRAM)
1053
+
1054
+ - **Best models:** Llama 3.3 70B, Qwen 2.5 72B, Large MoE models
1055
+ - **Context sweet spot:** 64k-128k
1056
+ - **Use case:** Enterprise, research, maximum capability
1057
+
1058
+ ---
1059
+
1060
+ ## See Also
1061
+
1062
+ - [Model Commands](Models_commands.md) - Model management commands
1063
+ - [Model Configuration](Models_configuration.md) - Configuration options
1064
+ - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md) - Automatic model selection
1065
+ - [Context Management](3%20projects/OLLM%20CLI/LLM%20Context%20Manager/README.md) - Context and VRAM monitoring
1066
+
1067
+ ---
1068
+
1069
+ **Last Updated:** 2026-01-28
1070
+ **Version:** 0.2.0
1071
+ **Source:** Ollama Documentation, llama.cpp, community benchmarks, OLLM CLI Model Database