@tecet/ollm 0.1.4 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (86) hide show
  1. package/dist/cli.js +20 -14
  2. package/dist/cli.js.map +3 -3
  3. package/dist/services/documentService.d.ts.map +1 -1
  4. package/dist/services/documentService.js +12 -2
  5. package/dist/services/documentService.js.map +1 -1
  6. package/dist/ui/components/docs/DocsPanel.d.ts.map +1 -1
  7. package/dist/ui/components/docs/DocsPanel.js +1 -1
  8. package/dist/ui/components/docs/DocsPanel.js.map +1 -1
  9. package/dist/ui/components/launch/VersionBanner.js +1 -1
  10. package/dist/ui/components/launch/VersionBanner.js.map +1 -1
  11. package/dist/ui/components/layout/KeybindsLegend.d.ts.map +1 -1
  12. package/dist/ui/components/layout/KeybindsLegend.js +1 -1
  13. package/dist/ui/components/layout/KeybindsLegend.js.map +1 -1
  14. package/dist/ui/components/tabs/BugReportTab.js +1 -1
  15. package/dist/ui/components/tabs/BugReportTab.js.map +1 -1
  16. package/dist/ui/services/docsService.d.ts +12 -27
  17. package/dist/ui/services/docsService.d.ts.map +1 -1
  18. package/dist/ui/services/docsService.js +40 -67
  19. package/dist/ui/services/docsService.js.map +1 -1
  20. package/docs/README.md +3 -410
  21. package/package.json +10 -7
  22. package/scripts/copy-docs-to-user.cjs +34 -0
  23. package/docs/Context/CheckpointFlowDiagram.md +0 -673
  24. package/docs/Context/ContextArchitecture.md +0 -898
  25. package/docs/Context/ContextCompression.md +0 -1102
  26. package/docs/Context/ContextManagment.md +0 -750
  27. package/docs/Context/Index.md +0 -209
  28. package/docs/Context/README.md +0 -390
  29. package/docs/DevelopmentRoadmap/Index.md +0 -238
  30. package/docs/DevelopmentRoadmap/OLLM-CLI_Releases.md +0 -419
  31. package/docs/DevelopmentRoadmap/PlanedFeatures.md +0 -448
  32. package/docs/DevelopmentRoadmap/README.md +0 -174
  33. package/docs/DevelopmentRoadmap/Roadmap.md +0 -572
  34. package/docs/DevelopmentRoadmap/RoadmapVisual.md +0 -372
  35. package/docs/Hooks/Architecture.md +0 -885
  36. package/docs/Hooks/Index.md +0 -244
  37. package/docs/Hooks/KeyboardShortcuts.md +0 -248
  38. package/docs/Hooks/Protocol.md +0 -817
  39. package/docs/Hooks/README.md +0 -403
  40. package/docs/Hooks/UserGuide.md +0 -1483
  41. package/docs/Hooks/VisualGuide.md +0 -598
  42. package/docs/Index.md +0 -506
  43. package/docs/Installation.md +0 -586
  44. package/docs/Introduction.md +0 -367
  45. package/docs/LLM Models/Index.md +0 -239
  46. package/docs/LLM Models/LLM_GettingStarted.md +0 -748
  47. package/docs/LLM Models/LLM_Index.md +0 -701
  48. package/docs/LLM Models/LLM_MemorySystem.md +0 -337
  49. package/docs/LLM Models/LLM_ModelCompatibility.md +0 -499
  50. package/docs/LLM Models/LLM_ModelsArchitecture.md +0 -933
  51. package/docs/LLM Models/LLM_ModelsCommands.md +0 -839
  52. package/docs/LLM Models/LLM_ModelsConfiguration.md +0 -1094
  53. package/docs/LLM Models/LLM_ModelsList.md +0 -1071
  54. package/docs/LLM Models/LLM_ModelsList.md.backup +0 -400
  55. package/docs/LLM Models/README.md +0 -355
  56. package/docs/MCP/MCP_Architecture.md +0 -1086
  57. package/docs/MCP/MCP_Commands.md +0 -1111
  58. package/docs/MCP/MCP_GettingStarted.md +0 -590
  59. package/docs/MCP/MCP_Index.md +0 -524
  60. package/docs/MCP/MCP_Integration.md +0 -866
  61. package/docs/MCP/MCP_Marketplace.md +0 -160
  62. package/docs/MCP/README.md +0 -415
  63. package/docs/Prompts System/Architecture.md +0 -760
  64. package/docs/Prompts System/Index.md +0 -223
  65. package/docs/Prompts System/PromptsRouting.md +0 -1047
  66. package/docs/Prompts System/PromptsTemplates.md +0 -1102
  67. package/docs/Prompts System/README.md +0 -389
  68. package/docs/Prompts System/SystemPrompts.md +0 -856
  69. package/docs/Quickstart.md +0 -535
  70. package/docs/Tools/Architecture.md +0 -884
  71. package/docs/Tools/GettingStarted.md +0 -624
  72. package/docs/Tools/Index.md +0 -216
  73. package/docs/Tools/ManifestReference.md +0 -141
  74. package/docs/Tools/README.md +0 -440
  75. package/docs/Tools/UserGuide.md +0 -773
  76. package/docs/Troubleshooting.md +0 -1265
  77. package/docs/UI&Settings/Architecture.md +0 -729
  78. package/docs/UI&Settings/ColorASCII.md +0 -34
  79. package/docs/UI&Settings/Commands.md +0 -755
  80. package/docs/UI&Settings/Configuration.md +0 -872
  81. package/docs/UI&Settings/Index.md +0 -293
  82. package/docs/UI&Settings/Keybinds.md +0 -372
  83. package/docs/UI&Settings/README.md +0 -278
  84. package/docs/UI&Settings/Terminal.md +0 -637
  85. package/docs/UI&Settings/Themes.md +0 -604
  86. package/docs/UI&Settings/UIGuide.md +0 -550
@@ -1,400 +0,0 @@
1
- # Ollama Models Reference
2
-
3
- **Complete reference for Ollama-compatible models**
4
-
5
- This document provides comprehensive information about Ollama models, focusing on context management, tool calling capabilities, and VRAM requirements for OLLM CLI.
6
-
7
- ---
8
-
9
- ## 📋 Table of Contents
10
-
11
- 1. [Context Window Fundamentals](#context-window-fundamentals)
12
- 2. [VRAM Requirements](#vram-requirements)
13
- 3. [Model Selection Matrix](#model-selection-matrix)
14
- 4. [Tool Calling Support](#tool-calling-support)
15
- 5. [Quantization Guide](#quantization-guide)
16
- 6. [Configuration](#configuration)
17
- 7. [Performance Benchmarks](#performance-benchmarks)
18
-
19
- **See Also:**
20
-
21
- - [Model Commands](Models_commands.md)
22
- - [Model Configuration](Models_configuration.md)
23
- - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md)
24
-
25
- ---
26
-
27
- ## Context Window Fundamentals
28
-
29
- ### Ollama Default Behavior
30
-
31
- - **Default context window**: 2,048 tokens (conservative default)
32
- - **Configurable via**: `num_ctx` parameter in Modelfile or API request
33
- - **Environment variable**: `OLLAMA_CONTEXT_LENGTH`
34
-
35
- ### Context Window vs num_ctx
36
-
37
- - **Context window**: Maximum tokens a model architecture supports
38
- - **num_ctx**: Ollama-specific parameter setting active context length for inference
39
- - Models can support larger contexts than Ollama's default - must be explicitly configured
40
-
41
- **Example:**
42
-
43
- ```bash
44
- # Llama 3.1 supports 128K context, but Ollama defaults to 2K
45
- # Must configure explicitly:
46
- export OLLAMA_CONTEXT_LENGTH=32768
47
- ```
48
-
49
- ---
50
-
51
- ## VRAM Requirements
52
-
53
- ### Memory Calculation Formula
54
-
55
- ```
56
- Total VRAM = Model Weights + KV Cache + System Overhead (~0.5-1GB)
57
- ```
58
-
59
- ### Base VRAM by Model Size (Q4_K_M Quantization)
60
-
61
- | Model Size | Est. File Size | Recommended VRAM | Example Models |
62
- | ---------- | -------------- | ----------------- | --------------------------------------- |
63
- | 3B-4B | 2.0-2.5 GB | 3-4 GB | Llama 3.2 3B, Qwen3 4B |
64
- | 7B-9B | 4.5-6.0 GB | 6-8 GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B |
65
- | 12B-14B | 7.5-9.0 GB | 10-12 GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B |
66
- | 22B-35B | 14-21 GB | 16-24 GB | Gemma 3 27B, Qwen3 32B, DeepSeek R1 32B |
67
- | 70B-72B | 40-43 GB | 48 GB (or 2×24GB) | Llama 3.3 70B, Qwen2.5 72B |
68
-
69
- ### KV Cache Memory Impact
70
-
71
- KV cache grows linearly with context length:
72
-
73
- - **8B model at 32K context**: ~4.5 GB for KV cache alone
74
- - **Q8_0 KV quantization**: Halves KV cache memory (minimal quality impact)
75
- - **Q4_0 KV quantization**: Reduces to 1/3 (noticeable quality reduction)
76
-
77
- ### Context Length VRAM Examples (8B Model, Q4_K_M)
78
-
79
- | Context Length | KV Cache (FP16) | KV Cache (Q8_0) | Total VRAM |
80
- | -------------- | --------------- | --------------- | ---------- |
81
- | 4K tokens | ~0.6 GB | ~0.3 GB | ~6-7 GB |
82
- | 8K tokens | ~1.1 GB | ~0.55 GB | ~7-8 GB |
83
- | 16K tokens | ~2.2 GB | ~1.1 GB | ~8-9 GB |
84
- | 32K tokens | ~4.5 GB | ~2.25 GB | ~10-11 GB |
85
-
86
- **Key Insight:** Always aim to fit the entire model in VRAM. Partial offload causes 5-20x slowdown.
87
-
88
- ---
89
-
90
- ## Model Selection Matrix
91
-
92
- ### By VRAM Available
93
-
94
- | VRAM | Recommended Models | Context Sweet Spot |
95
- | ----- | ------------------------------------- | ------------------ |
96
- | 4GB | Llama 3.2 3B, Qwen3 4B, Phi-3 Mini | 4K-8K |
97
- | 8GB | Llama 3.1 8B, Qwen3 8B, Mistral 7B | 8K-16K |
98
- | 12GB | Gemma 3 12B, Qwen3 14B, Phi-4 14B | 16K-32K |
99
- | 16GB | Qwen3 14B (high ctx), DeepSeek R1 14B | 32K-64K |
100
- | 24GB | Gemma 3 27B, Qwen3 32B, Mixtral 8x7B | 32K-64K |
101
- | 48GB+ | Llama 3.3 70B, Qwen2.5 72B | 64K-128K |
102
-
103
- ### By Use Case
104
-
105
- | Use Case | Recommended Model | Why |
106
- | -------------- | -------------------- | ----------------------------- |
107
- | General coding | Qwen2.5-Coder 7B/32B | Best open-source coding model |
108
- | Tool calling | Llama 3.1 8B | Native support, well-tested |
109
- | Long context | Qwen3 8B/14B | Up to 256K context |
110
- | Reasoning | DeepSeek-R1 14B/32B | Matches frontier models |
111
- | Low VRAM | Llama 3.2 3B | Excellent for 4GB |
112
- | Multilingual | Qwen3 series | 29+ languages |
113
-
114
- ---
115
-
116
- ## Tool Calling Support
117
-
118
- ### Tier 1: Best Tool Calling Support
119
-
120
- #### Llama 3.1/3.3 (8B, 70B)
121
-
122
- - **Context**: 128K tokens (native)
123
- - **Tool calling**: Native function calling support
124
- - **VRAM**: 8GB (8B), 48GB+ (70B)
125
- - **Strengths**: Best overall, largest ecosystem, excellent instruction following
126
- - **Downloads**: 108M+ (most popular)
127
-
128
- #### Qwen3 (4B-235B)
129
-
130
- - **Context**: Up to 256K tokens
131
- - **Tool calling**: Native support with hybrid reasoning modes
132
- - **VRAM**: 4GB (4B) to 48GB+ (32B)
133
- - **Strengths**: Multilingual, thinking/non-thinking modes, Apache 2.0 license
134
- - **Special**: Can set "thinking budget" for reasoning depth
135
-
136
- #### Mistral 7B / Mixtral 8x7B
137
-
138
- - **Context**: 32K tokens (Mistral), 64K tokens (Mixtral)
139
- - **Tool calling**: Native function calling
140
- - **VRAM**: 7GB (7B), 26GB (8x7B MoE - only 13B active)
141
- - **Strengths**: Efficient, excellent European languages
142
-
143
- ### Tier 2: Good Tool Calling Support
144
-
145
- #### DeepSeek-R1 (1.5B-671B)
146
-
147
- - **Context**: 128K tokens
148
- - **Tool calling**: Supported via reasoning
149
- - **VRAM**: 4GB (1.5B) to enterprise (671B)
150
- - **Strengths**: Advanced reasoning, matches o1 on benchmarks
151
- - **Architecture**: Mixture of Experts (MoE)
152
-
153
- #### Gemma 3 (4B-27B)
154
-
155
- - **Context**: 128K tokens (4B+)
156
- - **Tool calling**: Supported
157
- - **VRAM**: 4GB (4B) to 24GB (27B)
158
- - **Strengths**: Multimodal (vision), efficient architecture
159
-
160
- #### Phi-4 (14B)
161
-
162
- - **Context**: 16K tokens
163
- - **Tool calling**: Supported
164
- - **VRAM**: 12GB
165
- - **Strengths**: Exceptional reasoning for size, synthetic data training
166
-
167
- ### Tier 3: ReAct Fallback Required
168
-
169
- #### CodeLlama (7B-70B)
170
-
171
- - **Context**: 16K tokens (2K for 70B)
172
- - **Tool calling**: Via ReAct loop
173
- - **VRAM**: 4GB (7B) to 40GB (70B)
174
- - **Best for**: Code-specific tasks
175
-
176
- #### Llama 2 (7B-70B)
177
-
178
- - **Context**: 4K tokens
179
- - **Tool calling**: Via ReAct loop
180
- - **VRAM**: 4GB (7B) to 40GB (70B)
181
- - **Note**: Legacy, prefer Llama 3.x
182
-
183
- ---
184
-
185
- ## Quantization Guide
186
-
187
- ### Recommended Quantization Levels
188
-
189
- | Level | VRAM Savings | Quality Impact | Recommendation |
190
- | ---------- | ------------ | -------------- | --------------- |
191
- | Q8_0 | ~50% | Minimal | Best quality |
192
- | Q6_K | ~60% | Very slight | High quality |
193
- | Q5_K_M | ~65% | Slight | Good balance |
194
- | **Q4_K_M** | ~75% | Minor | **Sweet spot** |
195
- | Q4_0 | ~75% | Noticeable | Budget option |
196
- | Q3_K_M | ~80% | Significant | Not recommended |
197
- | Q2_K | ~85% | Severe | Avoid |
198
-
199
- ### KV Cache Quantization
200
-
201
- Enable KV cache quantization to reduce memory usage:
202
-
203
- ```bash
204
- # Enable Q8_0 KV cache (recommended)
205
- export OLLAMA_KV_CACHE_TYPE=q8_0
206
-
207
- # Enable Q4_0 KV cache (aggressive)
208
- export OLLAMA_KV_CACHE_TYPE=q4_0
209
- ```
210
-
211
- **Note**: KV cache quantization requires Flash Attention enabled.
212
-
213
- ### Quantization Selection Guide
214
-
215
- **For 8GB VRAM:**
216
-
217
- - Use Q4_K_M for 8B models
218
- - Enables 16K-32K context windows
219
- - Minimal quality impact
220
-
221
- **For 12GB VRAM:**
222
-
223
- - Use Q5_K_M or Q6_K for 8B models
224
- - Use Q4_K_M for 14B models
225
- - Better quality, still good context
226
-
227
- **For 24GB+ VRAM:**
228
-
229
- - Use Q8_0 for best quality
230
- - Can run larger models (32B+)
231
- - Maximum context windows
232
-
233
- ---
234
-
235
- ## Configuration
236
-
237
- ### Setting Context Length
238
-
239
- #### Via Modelfile
240
-
241
- ```dockerfile
242
- FROM llama3.1:8b
243
- PARAMETER num_ctx 32768
244
- ```
245
-
246
- #### Via API Request
247
-
248
- ```json
249
- {
250
- "model": "llama3.1:8b",
251
- "options": {
252
- "num_ctx": 32768
253
- }
254
- }
255
- ```
256
-
257
- #### Via Environment
258
-
259
- ```bash
260
- export OLLAMA_CONTEXT_LENGTH=32768
261
- ```
262
-
263
- #### Via OLLM Configuration
264
-
265
- ```yaml
266
- # ~/.ollm/config.yaml
267
- options:
268
- numCtx: 32768
269
- ```
270
-
271
- ### Performance Optimization
272
-
273
- ```bash
274
- # Enable Flash Attention
275
- export OLLAMA_FLASH_ATTENTION=1
276
-
277
- # Set KV cache quantization
278
- export OLLAMA_KV_CACHE_TYPE=q8_0
279
-
280
- # Keep model loaded
281
- export OLLAMA_KEEP_ALIVE=24h
282
-
283
- # Set number of GPU layers (for partial offload)
284
- export OLLAMA_NUM_GPU=35 # Adjust based on VRAM
285
- ```
286
-
287
- ---
288
-
289
- ## Performance Benchmarks
290
-
291
- ### Inference Speed (tokens/second)
292
-
293
- | Model | Full VRAM | Partial Offload | CPU Only |
294
- | ------------ | --------- | --------------- | -------- |
295
- | Llama 3.1 8B | 40-70 t/s | 8-15 t/s | 3-6 t/s |
296
- | Qwen3 8B | 35-60 t/s | 7-12 t/s | 2-5 t/s |
297
- | Mistral 7B | 45-80 t/s | 10-18 t/s | 4-7 t/s |
298
-
299
- **Key Insight:** Partial VRAM offload causes 5-20x slowdown. Always aim to fit model entirely in VRAM.
300
-
301
- ### Context Processing Speed
302
-
303
- | Context Length | Processing Time (8B model) |
304
- | -------------- | -------------------------- |
305
- | 4K tokens | ~0.5-1 second |
306
- | 8K tokens | ~1-2 seconds |
307
- | 16K tokens | ~2-4 seconds |
308
- | 32K tokens | ~4-8 seconds |
309
-
310
- **Note:** Times vary based on hardware and quantization level.
311
-
312
- ---
313
-
314
- ## Troubleshooting
315
-
316
- ### Out of Memory Errors
317
-
318
- **Symptoms:**
319
-
320
- ```
321
- Error: Failed to load model: insufficient VRAM
322
- ```
323
-
324
- **Solutions:**
325
-
326
- 1. Use smaller model (8B instead of 70B)
327
- 2. Use more aggressive quantization (Q4_K_M instead of Q8_0)
328
- 3. Reduce context window (`num_ctx`)
329
- 4. Enable KV cache quantization
330
- 5. Close other GPU applications
331
-
332
- ### Slow Inference
333
-
334
- **Symptoms:**
335
-
336
- - Very slow token generation (< 5 tokens/second)
337
- - High CPU usage
338
-
339
- **Solutions:**
340
-
341
- 1. Check if model fits entirely in VRAM
342
- 2. Increase `OLLAMA_NUM_GPU` if using partial offload
343
- 3. Enable Flash Attention
344
- 4. Reduce context window
345
- 5. Use faster quantization (Q4_K_M)
346
-
347
- ### Context Window Issues
348
-
349
- **Symptoms:**
350
-
351
- ```
352
- Error: Context length exceeds model maximum
353
- ```
354
-
355
- **Solutions:**
356
-
357
- 1. Check model's native context window
358
- 2. Set `num_ctx` appropriately
359
- 3. Use model with larger context window
360
- 4. Enable context compression in OLLM
361
-
362
- ---
363
-
364
- ## Best Practices
365
-
366
- ### Model Selection
367
-
368
- 1. **Match VRAM to model size**: Use the selection matrix above
369
- 2. **Consider context needs**: Larger contexts need more VRAM
370
- 3. **Test before committing**: Try models before configuring projects
371
- 4. **Use routing**: Let OLLM select appropriate models automatically
372
-
373
- ### VRAM Management
374
-
375
- 1. **Monitor usage**: Use `/context` command to check VRAM
376
- 2. **Keep models loaded**: Use `/model keep` for frequently-used models
377
- 3. **Unload when switching**: Use `/model unload` to free VRAM
378
- 4. **Enable auto-sizing**: Use `/context auto` for automatic context sizing
379
-
380
- ### Performance Optimization
381
-
382
- 1. **Enable Flash Attention**: Significant speed improvement
383
- 2. **Use KV cache quantization**: Reduces memory, minimal quality impact
384
- 3. **Keep models in VRAM**: Avoid partial offload
385
- 4. **Optimize context size**: Use only what you need
386
-
387
- ---
388
-
389
- ## See Also
390
-
391
- - [Model Commands](Models_commands.md) - Model management commands
392
- - [Model Configuration](Models_configuration.md) - Configuration options
393
- - [Routing Guide](3%20projects/OLLM%20CLI/LLM%20Models/routing/user-guide.md) - Automatic model selection
394
- - [Context Management](3%20projects/OLLM%20CLI/LLM%20Context%20Manager/README.md) - Context and VRAM monitoring
395
-
396
- ---
397
-
398
- **Last Updated:** 2026-01-16
399
- **Version:** 0.1.0
400
- **Source:** Ollama Documentation, llama.cpp, community benchmarks