@tecet/ollm 0.1.4-b → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/docs/README.md +3 -410
  2. package/package.json +2 -2
  3. package/docs/Context/CheckpointFlowDiagram.md +0 -673
  4. package/docs/Context/ContextArchitecture.md +0 -898
  5. package/docs/Context/ContextCompression.md +0 -1102
  6. package/docs/Context/ContextManagment.md +0 -750
  7. package/docs/Context/Index.md +0 -209
  8. package/docs/Context/README.md +0 -390
  9. package/docs/DevelopmentRoadmap/Index.md +0 -238
  10. package/docs/DevelopmentRoadmap/OLLM-CLI_Releases.md +0 -419
  11. package/docs/DevelopmentRoadmap/PlanedFeatures.md +0 -448
  12. package/docs/DevelopmentRoadmap/README.md +0 -174
  13. package/docs/DevelopmentRoadmap/Roadmap.md +0 -572
  14. package/docs/DevelopmentRoadmap/RoadmapVisual.md +0 -372
  15. package/docs/Hooks/Architecture.md +0 -885
  16. package/docs/Hooks/Index.md +0 -244
  17. package/docs/Hooks/KeyboardShortcuts.md +0 -248
  18. package/docs/Hooks/Protocol.md +0 -817
  19. package/docs/Hooks/README.md +0 -403
  20. package/docs/Hooks/UserGuide.md +0 -1483
  21. package/docs/Hooks/VisualGuide.md +0 -598
  22. package/docs/Index.md +0 -506
  23. package/docs/Installation.md +0 -586
  24. package/docs/Introduction.md +0 -367
  25. package/docs/LLM Models/Index.md +0 -239
  26. package/docs/LLM Models/LLM_GettingStarted.md +0 -748
  27. package/docs/LLM Models/LLM_Index.md +0 -701
  28. package/docs/LLM Models/LLM_MemorySystem.md +0 -337
  29. package/docs/LLM Models/LLM_ModelCompatibility.md +0 -499
  30. package/docs/LLM Models/LLM_ModelsArchitecture.md +0 -933
  31. package/docs/LLM Models/LLM_ModelsCommands.md +0 -839
  32. package/docs/LLM Models/LLM_ModelsConfiguration.md +0 -1094
  33. package/docs/LLM Models/LLM_ModelsList.md +0 -1071
  34. package/docs/LLM Models/LLM_ModelsList.md.backup +0 -400
  35. package/docs/LLM Models/README.md +0 -355
  36. package/docs/MCP/MCP_Architecture.md +0 -1086
  37. package/docs/MCP/MCP_Commands.md +0 -1111
  38. package/docs/MCP/MCP_GettingStarted.md +0 -590
  39. package/docs/MCP/MCP_Index.md +0 -524
  40. package/docs/MCP/MCP_Integration.md +0 -866
  41. package/docs/MCP/MCP_Marketplace.md +0 -160
  42. package/docs/MCP/README.md +0 -415
  43. package/docs/Prompts System/Architecture.md +0 -760
  44. package/docs/Prompts System/Index.md +0 -223
  45. package/docs/Prompts System/PromptsRouting.md +0 -1047
  46. package/docs/Prompts System/PromptsTemplates.md +0 -1102
  47. package/docs/Prompts System/README.md +0 -389
  48. package/docs/Prompts System/SystemPrompts.md +0 -856
  49. package/docs/Quickstart.md +0 -535
  50. package/docs/Tools/Architecture.md +0 -884
  51. package/docs/Tools/GettingStarted.md +0 -624
  52. package/docs/Tools/Index.md +0 -216
  53. package/docs/Tools/ManifestReference.md +0 -141
  54. package/docs/Tools/README.md +0 -440
  55. package/docs/Tools/UserGuide.md +0 -773
  56. package/docs/Troubleshooting.md +0 -1265
  57. package/docs/UI&Settings/Architecture.md +0 -729
  58. package/docs/UI&Settings/ColorASCII.md +0 -34
  59. package/docs/UI&Settings/Commands.md +0 -755
  60. package/docs/UI&Settings/Configuration.md +0 -872
  61. package/docs/UI&Settings/Index.md +0 -293
  62. package/docs/UI&Settings/Keybinds.md +0 -372
  63. package/docs/UI&Settings/README.md +0 -278
  64. package/docs/UI&Settings/Terminal.md +0 -637
  65. package/docs/UI&Settings/Themes.md +0 -604
  66. package/docs/UI&Settings/UIGuide.md +0 -550
@@ -1,400 +0,0 @@
1
- # Ollama Models Reference
2
-
3
- **Complete reference for Ollama-compatible models**
4
-
5
- This document provides comprehensive information about Ollama models, focusing on context management, tool calling capabilities, and VRAM requirements for OLLM CLI.
6
-
7
- ---
8
-
9
- ## 📋 Table of Contents
10
-
11
- 1. [Context Window Fundamentals](#context-window-fundamentals)
12
- 2. [VRAM Requirements](#vram-requirements)
13
- 3. [Model Selection Matrix](#model-selection-matrix)
14
- 4. [Tool Calling Support](#tool-calling-support)
15
- 5. [Quantization Guide](#quantization-guide)
16
- 6. [Configuration](#configuration)
17
- 7. [Performance Benchmarks](#performance-benchmarks)
18
-
19
- **See Also:**
20
-
21
- - [Model Commands](Models_commands.md)
22
- - [Model Configuration](Models_configuration.md)
23
- - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md)
24
-
25
- ---
26
-
27
- ## Context Window Fundamentals
28
-
29
- ### Ollama Default Behavior
30
-
31
- - **Default context window**: 2,048 tokens (conservative default)
32
- - **Configurable via**: `num_ctx` parameter in Modelfile or API request
33
- - **Environment variable**: `OLLAMA_CONTEXT_LENGTH`
34
-
35
- ### Context Window vs num_ctx
36
-
37
- - **Context window**: Maximum tokens a model architecture supports
38
- - **num_ctx**: Ollama-specific parameter setting active context length for inference
39
- - Models can support larger contexts than Ollama's default - must be explicitly configured
40
-
41
- **Example:**
42
-
43
- ```bash
44
- # Llama 3.1 supports 128K context, but Ollama defaults to 2K
45
- # Must configure explicitly:
46
- export OLLAMA_CONTEXT_LENGTH=32768
47
- ```
48
-
49
- ---
50
-
51
- ## VRAM Requirements
52
-
53
- ### Memory Calculation Formula
54
-
55
- ```
56
- Total VRAM = Model Weights + KV Cache + System Overhead (~0.5-1GB)
57
- ```
58
-
59
- ### Base VRAM by Model Size (Q4_K_M Quantization)
60
-
61
- | Model Size | Est. File Size | Recommended VRAM | Example Models |
62
- | ---------- | -------------- | ----------------- | --------------------------------------- |
63
- | 3B-4B | 2.0-2.5 GB | 3-4 GB | Llama 3.2 3B, Qwen3 4B |
64
- | 7B-9B | 4.5-6.0 GB | 6-8 GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B |
65
- | 12B-14B | 7.5-9.0 GB | 10-12 GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B |
66
- | 22B-35B | 14-21 GB | 16-24 GB | Gemma 3 27B, Qwen3 32B, DeepSeek R1 32B |
67
- | 70B-72B | 40-43 GB | 48 GB (or 2×24GB) | Llama 3.3 70B, Qwen2.5 72B |
68
-
69
- ### KV Cache Memory Impact
70
-
71
- KV cache grows linearly with context length:
72
-
73
- - **8B model at 32K context**: ~4.5 GB for KV cache alone
74
- - **Q8_0 KV quantization**: Halves KV cache memory (minimal quality impact)
75
- - **Q4_0 KV quantization**: Reduces to 1/3 (noticeable quality reduction)
76
-
77
- ### Context Length VRAM Examples (8B Model, Q4_K_M)
78
-
79
- | Context Length | KV Cache (FP16) | KV Cache (Q8_0) | Total VRAM |
80
- | -------------- | --------------- | --------------- | ---------- |
81
- | 4K tokens | ~0.6 GB | ~0.3 GB | ~6-7 GB |
82
- | 8K tokens | ~1.1 GB | ~0.55 GB | ~7-8 GB |
83
- | 16K tokens | ~2.2 GB | ~1.1 GB | ~8-9 GB |
84
- | 32K tokens | ~4.5 GB | ~2.25 GB | ~10-11 GB |
85
-
86
- **Key Insight:** Always aim to fit the entire model in VRAM. Partial offload causes 5-20x slowdown.
87
-
88
- ---
89
-
90
- ## Model Selection Matrix
91
-
92
- ### By VRAM Available
93
-
94
- | VRAM | Recommended Models | Context Sweet Spot |
95
- | ----- | ------------------------------------- | ------------------ |
96
- | 4GB | Llama 3.2 3B, Qwen3 4B, Phi-3 Mini | 4K-8K |
97
- | 8GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B | 8K-16K |
98
- | 12GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B | 16K-32K |
99
- | 16GB | Qwen3 14B (high ctx), DeepSeek R1 14B | 32K-64K |
100
- | 24GB | Gemma 3 27B, Qwen3 32B, Mixtral 8x7B | 32K-64K |
101
- | 48GB+ | Llama 3.3 70B, Qwen2.5 72B | 64K-128K |
102
-
103
- ### By Use Case
104
-
105
- | Use Case | Recommended Model | Why |
106
- | -------------- | -------------------- | ----------------------------- |
107
- | General coding | Qwen2.5-Coder 7B/32B | Best open-source coding model |
108
- | Tool calling | Llama 3.1 8B | Native support, well-tested |
109
- | Long context | Qwen3 8B/14B | Up to 256K context |
110
- | Reasoning | DeepSeek-R1 14B/32B | Matches frontier models |
111
- | Low VRAM | Llama 3.2 3B | Excellent for 4GB |
112
- | Multilingual | Qwen3 series | 29+ languages |
113
-
114
- ---
115
-
116
- ## Tool Calling Support
117
-
118
- ### Tier 1: Best Tool Calling Support
119
-
120
- #### Llama 3.1/3.3 (8B, 70B)
121
-
122
- - **Context**: 128K tokens (native)
123
- - **Tool calling**: Native function calling support
124
- - **VRAM**: 8GB (8B), 48GB+ (70B)
125
- - **Strengths**: Best overall, largest ecosystem, excellent instruction following
126
- - **Downloads**: 108M+ (most popular)
127
-
128
- #### Qwen3 (4B-235B)
129
-
130
- - **Context**: Up to 256K tokens
131
- - **Tool calling**: Native support with hybrid reasoning modes
132
- - **VRAM**: 4GB (4B) to 48GB+ (32B)
133
- - **Strengths**: Multilingual, thinking/non-thinking modes, Apache 2.0 license
134
- - **Special**: Can set "thinking budget" for reasoning depth
135
-
136
- #### Mistral 7B / Mixtral 8x7B
137
-
138
- - **Context**: 32K tokens (Mistral), 64K tokens (Mixtral)
139
- - **Tool calling**: Native function calling
140
- - **VRAM**: 7GB (7B), 26GB (8x7B MoE - only 13B active)
141
- - **Strengths**: Efficient, excellent European languages
142
-
143
- ### Tier 2: Good Tool Calling Support
144
-
145
- #### DeepSeek-R1 (1.5B-671B)
146
-
147
- - **Context**: 128K tokens
148
- - **Tool calling**: Supported via reasoning
149
- - **VRAM**: 4GB (1.5B) to enterprise (671B)
150
- - **Strengths**: Advanced reasoning, matches o1 on benchmarks
151
- - **Architecture**: Mixture of Experts (MoE)
152
-
153
- #### Gemma 3 (4B-27B)
154
-
155
- - **Context**: 128K tokens (4B+)
156
- - **Tool calling**: Supported
157
- - **VRAM**: 4GB (4B) to 24GB (27B)
158
- - **Strengths**: Multimodal (vision), efficient architecture
159
-
160
- #### Phi-4 (14B)
161
-
162
- - **Context**: 16K tokens
163
- - **Tool calling**: Supported
164
- - **VRAM**: 12GB
165
- - **Strengths**: Exceptional reasoning for size, synthetic data training
166
-
167
- ### Tier 3: ReAct Fallback Required
168
-
169
- #### CodeLlama (7B-70B)
170
-
171
- - **Context**: 16K tokens (2K for 70B)
172
- - **Tool calling**: Via ReAct loop
173
- - **VRAM**: 4GB (7B) to 40GB (70B)
174
- - **Best for**: Code-specific tasks
175
-
176
- #### Llama 2 (7B-70B)
177
-
178
- - **Context**: 4K tokens
179
- - **Tool calling**: Via ReAct loop
180
- - **VRAM**: 4GB (7B) to 40GB (70B)
181
- - **Note**: Legacy, prefer Llama 3.x
182
-
183
- ---
184
-
185
- ## Quantization Guide
186
-
187
- ### Recommended Quantization Levels
188
-
189
- | Level | VRAM Savings | Quality Impact | Recommendation |
190
- | ---------- | ------------ | -------------- | --------------- |
191
- | Q8_0 | ~50% | Minimal | Best quality |
192
- | Q6_K | ~60% | Very slight | High quality |
193
- | Q5_K_M | ~65% | Slight | Good balance |
194
- | **Q4_K_M** | ~75% | Minor | **Sweet spot** |
195
- | Q4_0 | ~75% | Noticeable | Budget option |
196
- | Q3_K_M | ~80% | Significant | Not recommended |
197
- | Q2_K | ~85% | Severe | Avoid |
198
-
199
- ### KV Cache Quantization
200
-
201
- Enable KV cache quantization to reduce memory usage:
202
-
203
- ```bash
204
- # Enable Q8_0 KV cache (recommended)
205
- export OLLAMA_KV_CACHE_TYPE=q8_0
206
-
207
- # Enable Q4_0 KV cache (aggressive)
208
- export OLLAMA_KV_CACHE_TYPE=q4_0
209
- ```
210
-
211
- **Note**: KV cache quantization requires Flash Attention enabled.
212
-
213
- ### Quantization Selection Guide
214
-
215
- **For 8GB VRAM:**
216
-
217
- - Use Q4_K_M for 8B models
218
- - Enables 16K-32K context windows
219
- - Minimal quality impact
220
-
221
- **For 12GB VRAM:**
222
-
223
- - Use Q5_K_M or Q6_K for 8B models
224
- - Use Q4_K_M for 14B models
225
- - Better quality, still good context
226
-
227
- **For 24GB+ VRAM:**
228
-
229
- - Use Q8_0 for best quality
230
- - Can run larger models (32B+)
231
- - Maximum context windows
232
-
233
- ---
234
-
235
- ## Configuration
236
-
237
- ### Setting Context Length
238
-
239
- #### Via Modelfile
240
-
241
- ```dockerfile
242
- FROM llama3.1:8b
243
- PARAMETER num_ctx 32768
244
- ```
245
-
246
- #### Via API Request
247
-
248
- ```json
249
- {
250
- "model": "llama3.1:8b",
251
- "options": {
252
- "num_ctx": 32768
253
- }
254
- }
255
- ```
256
-
257
- #### Via Environment
258
-
259
- ```bash
260
- export OLLAMA_CONTEXT_LENGTH=32768
261
- ```
262
-
263
- #### Via OLLM Configuration
264
-
265
- ```yaml
266
- # ~/.ollm/config.yaml
267
- options:
268
- numCtx: 32768
269
- ```
270
-
271
- ### Performance Optimization
272
-
273
- ```bash
274
- # Enable Flash Attention
275
- export OLLAMA_FLASH_ATTENTION=1
276
-
277
- # Set KV cache quantization
278
- export OLLAMA_KV_CACHE_TYPE=q8_0
279
-
280
- # Keep model loaded
281
- export OLLAMA_KEEP_ALIVE=24h
282
-
283
- # Set number of GPU layers (for partial offload)
284
- export OLLAMA_NUM_GPU=35 # Adjust based on VRAM
285
- ```
286
-
287
- ---
288
-
289
- ## Performance Benchmarks
290
-
291
- ### Inference Speed (tokens/second)
292
-
293
- | Model | Full VRAM | Partial Offload | CPU Only |
294
- | ------------ | --------- | --------------- | -------- |
295
- | Llama 3.1 8B | 40-70 t/s | 8-15 t/s | 3-6 t/s |
296
- | Qwen3 8B | 35-60 t/s | 7-12 t/s | 2-5 t/s |
297
- | Mistral 7B | 45-80 t/s | 10-18 t/s | 4-7 t/s |
298
-
299
- **Key Insight:** Partial VRAM offload causes 5-20x slowdown. Always aim to fit model entirely in VRAM.
300
-
301
- ### Context Processing Speed
302
-
303
- | Context Length | Processing Time (8B model) |
304
- | -------------- | -------------------------- |
305
- | 4K tokens | ~0.5-1 second |
306
- | 8K tokens | ~1-2 seconds |
307
- | 16K tokens | ~2-4 seconds |
308
- | 32K tokens | ~4-8 seconds |
309
-
310
- **Note:** Times vary based on hardware and quantization level.
311
-
312
- ---
313
-
314
- ## Troubleshooting
315
-
316
- ### Out of Memory Errors
317
-
318
- **Symptoms:**
319
-
320
- ```
321
- Error: Failed to load model: insufficient VRAM
322
- ```
323
-
324
- **Solutions:**
325
-
326
- 1. Use smaller model (8B instead of 70B)
327
- 2. Use more aggressive quantization (Q4_K_M instead of Q8_0)
328
- 3. Reduce context window (`num_ctx`)
329
- 4. Enable KV cache quantization
330
- 5. Close other GPU applications
331
-
332
- ### Slow Inference
333
-
334
- **Symptoms:**
335
-
336
- - Very slow token generation (< 5 tokens/second)
337
- - High CPU usage
338
-
339
- **Solutions:**
340
-
341
- 1. Check if model fits entirely in VRAM
342
- 2. Increase `OLLAMA_NUM_GPU` if using partial offload
343
- 3. Enable Flash Attention
344
- 4. Reduce context window
345
- 5. Use faster quantization (Q4_K_M)
346
-
347
- ### Context Window Issues
348
-
349
- **Symptoms:**
350
-
351
- ```
352
- Error: Context length exceeds model maximum
353
- ```
354
-
355
- **Solutions:**
356
-
357
- 1. Check model's native context window
358
- 2. Set `num_ctx` appropriately
359
- 3. Use model with larger context window
360
- 4. Enable context compression in OLLM
361
-
362
- ---
363
-
364
- ## Best Practices
365
-
366
- ### Model Selection
367
-
368
- 1. **Match VRAM to model size**: Use the selection matrix above
369
- 2. **Consider context needs**: Larger contexts need more VRAM
370
- 3. **Test before committing**: Try models before configuring projects
371
- 4. **Use routing**: Let OLLM select appropriate models automatically
372
-
373
- ### VRAM Management
374
-
375
- 1. **Monitor usage**: Use `/context` command to check VRAM
376
- 2. **Keep models loaded**: Use `/model keep` for frequently-used models
377
- 3. **Unload when switching**: Use `/model unload` to free VRAM
378
- 4. **Enable auto-sizing**: Use `/context auto` for automatic context sizing
379
-
380
- ### Performance Optimization
381
-
382
- 1. **Enable Flash Attention**: Significant speed improvement
383
- 2. **Use KV cache quantization**: Reduces memory, minimal quality impact
384
- 3. **Keep models in VRAM**: Avoid partial offload
385
- 4. **Optimize context size**: Use only what you need
386
-
387
- ---
388
-
389
- ## See Also
390
-
391
- - [Model Commands](Models_commands.md) - Model management commands
392
- - [Model Configuration](Models_configuration.md) - Configuration options
393
- - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md) - Automatic model selection
394
- - [Context Management](3%20projects/OLLM%20CLI/LLM%20Context%20Manager/README.md) - Context and VRAM monitoring
395
-
396
- ---
397
-
398
- **Last Updated:** 2026-01-16
399
- **Version:** 0.1.0
400
- **Source:** Ollama Documentation, llama.cpp, community benchmarks