gallama 0.0.1.post3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. gallama-0.0.1.post3/PKG-INFO +525 -0
  2. gallama-0.0.1.post3/README.md +490 -0
  3. gallama-0.0.1.post3/pyproject.toml +46 -0
  4. gallama-0.0.1.post3/requirements.txt +25 -0
  5. gallama-0.0.1.post3/setup.cfg +4 -0
  6. gallama-0.0.1.post3/src/gallama/__init__.py +7 -0
  7. gallama-0.0.1.post3/src/gallama/api_response/__init__.py +0 -0
  8. gallama-0.0.1.post3/src/gallama/api_response/chat_response.py +667 -0
  9. gallama-0.0.1.post3/src/gallama/api_response/stream_parser.py +101 -0
  10. gallama-0.0.1.post3/src/gallama/app.py +472 -0
  11. gallama-0.0.1.post3/src/gallama/backend/__init__.py +1 -0
  12. gallama-0.0.1.post3/src/gallama/backend/chatgenerator.py +807 -0
  13. gallama-0.0.1.post3/src/gallama/backend/embedding.py +89 -0
  14. gallama-0.0.1.post3/src/gallama/backend/model.py +273 -0
  15. gallama-0.0.1.post3/src/gallama/backend/prompt_engine.py +371 -0
  16. gallama-0.0.1.post3/src/gallama/backend/thinking_template.py +132 -0
  17. gallama-0.0.1.post3/src/gallama/backend/tools.py +238 -0
  18. gallama-0.0.1.post3/src/gallama/cli.py +172 -0
  19. gallama-0.0.1.post3/src/gallama/config/__init__.py +1 -0
  20. gallama-0.0.1.post3/src/gallama/config/config_manager.py +182 -0
  21. gallama-0.0.1.post3/src/gallama/data/__init__.py +2 -0
  22. gallama-0.0.1.post3/src/gallama/data/__pycache__/__init__.cpython-311.pyc +0 -0
  23. gallama-0.0.1.post3/src/gallama/data/__pycache__/__init__.cpython-312.pyc +0 -0
  24. gallama-0.0.1.post3/src/gallama/data/default_model_list.yaml +388 -0
  25. gallama-0.0.1.post3/src/gallama/data/model_config.yaml +42 -0
  26. gallama-0.0.1.post3/src/gallama/data/model_token.yaml +143 -0
  27. gallama-0.0.1.post3/src/gallama/data/prompt/artifact_prompt.py +96 -0
  28. gallama-0.0.1.post3/src/gallama/data/thinking_template/tool_necessity_evaluation.regex +11 -0
  29. gallama-0.0.1.post3/src/gallama/data/thinking_template/tool_necessity_evaluation.xml +12 -0
  30. gallama-0.0.1.post3/src/gallama/data_classes/__init__.py +13 -0
  31. gallama-0.0.1.post3/src/gallama/data_classes/data_class.py +582 -0
  32. gallama-0.0.1.post3/src/gallama/data_classes/server_dataclass.py +35 -0
  33. gallama-0.0.1.post3/src/gallama/logger/__init__.py +1 -0
  34. gallama-0.0.1.post3/src/gallama/logger/logger.py +258 -0
  35. gallama-0.0.1.post3/src/gallama/server.py +636 -0
  36. gallama-0.0.1.post3/src/gallama/server_engine/__init__.py +3 -0
  37. gallama-0.0.1.post3/src/gallama/server_engine/mixture_of_agents.py +127 -0
  38. gallama-0.0.1.post3/src/gallama/server_engine/model_management.py +215 -0
  39. gallama-0.0.1.post3/src/gallama/server_engine/request_handler.py +90 -0
  40. gallama-0.0.1.post3/src/gallama/utils/__init__.py +0 -0
  41. gallama-0.0.1.post3/src/gallama/utils/utils.py +123 -0
  42. gallama-0.0.1.post3/src/gallama.egg-info/PKG-INFO +525 -0
  43. gallama-0.0.1.post3/src/gallama.egg-info/SOURCES.txt +45 -0
  44. gallama-0.0.1.post3/src/gallama.egg-info/dependency_links.txt +1 -0
  45. gallama-0.0.1.post3/src/gallama.egg-info/entry_points.txt +2 -0
  46. gallama-0.0.1.post3/src/gallama.egg-info/requires.txt +25 -0
  47. gallama-0.0.1.post3/src/gallama.egg-info/top_level.txt +1 -0
@@ -0,0 +1,525 @@
1
+ Metadata-Version: 2.1
2
+ Name: gallama
3
+ Version: 0.0.1.post3
4
+ Summary: An oppinionated Llama Server engine with focus on agentic task
5
+ Author-email: David <trantrungduc91@example.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/remichu-ai/gallama
8
+ Requires-Python: >=3.11
9
+ Description-Content-Type: text/markdown
10
+ Requires-Dist: packaging
11
+ Requires-Dist: ninja
12
+ Requires-Dist: torch>=2.2.0
13
+ Requires-Dist: accelerate
14
+ Requires-Dist: pydantic
15
+ Requires-Dist: pytest
16
+ Requires-Dist: httpx
17
+ Requires-Dist: uvicorn
18
+ Requires-Dist: sentence-transformers
19
+ Requires-Dist: lm-format-enforcer
20
+ Requires-Dist: huggingface-hub
21
+ Requires-Dist: chardet
22
+ Requires-Dist: sse_starlette
23
+ Requires-Dist: fastapi
24
+ Requires-Dist: lxml
25
+ Requires-Dist: infinity-emb
26
+ Requires-Dist: colorama
27
+ Requires-Dist: zmq
28
+ Requires-Dist: transformers
29
+ Requires-Dist: pyyaml
30
+ Requires-Dist: rich
31
+ Requires-Dist: pyzmq
32
+ Requires-Dist: pygments
33
+ Requires-Dist: httpx
34
+ Requires-Dist: psutil
35
+
36
+ # gallama - Guided Agentic Llama
37
+
38
+ __gallama__ is an opinionated Python library that provides a LLM inference API service backend optimized for local agentic tasks.
39
+
40
+ It tries to close the gap between pure inference engine (such as ExLlamaV2 and Llama.cpp) and additional needs for agentic work (e.g., function calling, formatting constraints).
41
+
42
+ It provides some experimental features such as Artifact (like Claude's) and Thinking template (to guide chain of though).
43
+
44
+ As there is no standard approach for these experimental feature current, the library does implement it via some boilerplate code (you will find some fixed prompt inside the code) and thus is an opinionated API engine.
45
+
46
+ This library is marked with rolling update so please be patient hi-cup and bugs :) (feel free to open an issue if you come across any).
47
+
48
+ Currently, the backend is mainly using ExllamaV2. Llama.cpp support is under experiment.
49
+
50
+ Do checkout [TabbyAPI](https://github.com/theroyallab/tabbyAPI) if you want a reliable and pure ExllamaV2 API backend.
51
+
52
+ # Features
53
+
54
+ ## Integrated Model downloader
55
+
56
+ Ability to download exl2 model from Hugging Face via CLI for popular models.
57
+
58
+ Example as following to download llama-3.1 8B at 4.0bpw (4 bit per weight quantization)
59
+
60
+ ```shell
61
+ gallama download llama-3.1-8B:4.0
62
+ ```
63
+
64
+ After download, you can run the model with:
65
+ ```shell
66
+ gallama run llama-3.1-8B
67
+ ```
68
+
69
+ Here is the list of currently supported models for downloader:
70
+
71
+ You can view it in CLI using following command:
72
+ ```shell
73
+ gallama list available
74
+ ```
75
+
76
+ **LLM Models**
77
+
78
+ | Model | Backend | Available Quantizations (bpw) |
79
+ |-------|---------|-------------------------------|
80
+ | llama-3.1-8B | exllama | `3.0`, `3.5`, `4.0`, `4.5`, `5.0`, `6.0`, `8.0` |
81
+ | llama-3.1-70B | exllama | `2.5`, `3.0`, `3.5`, `4.0`, `4.5`, `5.0`, `6.0` |
82
+ | mistral | exllama | `2.8`, `3.0`, `4.0`, `4.5`, `5.0`, `6.0` |
83
+ | mistral-large | exllama | `2.3`, `2.5`, `2.75`, `3.0`, `3.5`, `3.75`, `4.0`, `4.25`, `4.5`, `4.75`, `5.0`, `6.0` |
84
+ | mistral-nemo | exllama | `2.5`, `3.0`, `3.5`, `4.0`, `4.5`, `5.0`, `6.0`, `8.0` |
85
+ | codestral | exllama | `3.0`, `3.5`, `4.25`, `5.0`, `6.5`, `8.0` |
86
+ | gemma-2-9B | exllama | `2.5`, `3.0`, `3.5`, `4.0`, `4.5`, `5.0`, `5.5`, `6.0`, `8.0` |
87
+ | gemma-2-27B | exllama | `3.0`, `3.5`, `4.0`, `4.5`, `5.0`, `6.0`, `8.0` |
88
+ | qwen-2-1.5B | exllama | `3.0`, `4.0`, `5.0`, `6.0`, `8.0` |
89
+ | qwen-2-7B | exllama | `3.0`, `4.0`, `5.0`, `6.0`, `8.0` |
90
+ | qwen-2-72B | exllama | `3.0`, `3.5`, `4.0`, `4.25`, `4.65`, `5.0`, `6.0`, `8.0` |
91
+ | yi-1.5-34B | exllama | `3.0`, `3.5`, `3.75`, `4.0`, `4.25`, `5.0`, `6.0`, `6.5`, `8.0` |
92
+
93
+ **Embedding Models:**
94
+
95
+ | Model | Backend | Available Quantizations (bpw) |
96
+ |-------|---------|-------------------------------|
97
+ | multilingual-e5-large-instruct | embedding | `16.0` |
98
+ | gte-large-en-v1.5 | embedding | `16.0` |
99
+
100
+ The syntax to specify the model is model name follow by `:` then follow by quantization float number e.g. `qwen-2-72B:4.0`
101
+
102
+ For model not listed here, you can refer to `examples/Model_Downloader.ipynb` for code to download from huggingface.
103
+ And then you will need to manually update the config in `~/gallama/model_config.yaml`. Instruction will be updated
104
+
105
+ Sometime, the download speed might get slowed down. Just cancel the download by Ctrl+C and run the same download command again. The download will resume with higher speed.
106
+
107
+ ## OpenAI Compatible Server
108
+ Fully compatible with the OpenAI client.
109
+
110
+ Install openai client and overwrite its base setting as follow:
111
+
112
+ ```shell
113
+ pip install openai
114
+ ```
115
+
116
+ ```python
117
+ import os
118
+ from openai import OpenAI
119
+
120
+ os.environ['OPENAI_API_KEY'] = 'test'
121
+ client = OpenAI(base_url='http://127.0.0.1:8000/v1')
122
+ ```
123
+
124
+ ```python
125
+ messages = [{"role": "user", "content": "Which is faster in terms of reaction speed: a cat or a dog?"}]
126
+
127
+ completion = client.chat.completions.create(
128
+ model="mistral",
129
+ messages=messages,
130
+ tool_choice="auto"
131
+ )
132
+
133
+ print(completion)
134
+ ```
135
+
136
+ ## Function Calling
137
+ Supports function calling for all models, mimicking OpenAI's behavior for tool_choice="auto" where if tool usage is not applicable, model will generate normal response.
138
+
139
+ ```python
140
+ tools = [
141
+ {
142
+ "type": "function",
143
+ "function": {
144
+ "name": "get_current_weather",
145
+ "description": "Get the current weather in a given location",
146
+ "parameters": {
147
+ "type": "object",
148
+ "properties": {
149
+ "location": {
150
+ "type": "string",
151
+ "description": "The city and state, e.g. San Francisco, CA",
152
+ },
153
+ "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
154
+ },
155
+ "required": ["location"],
156
+ },
157
+ }
158
+ }
159
+ ]
160
+
161
+ messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
162
+
163
+ completion = client.chat.completions.create(
164
+ model="mistral",
165
+ messages=messages,
166
+ tools=tools,
167
+ tool_choice="auto"
168
+ )
169
+
170
+ print(completion.choices[0].message.tool_calls[0].function)
171
+ ```
172
+
173
+ ## Artifact System (Experimental)
174
+ The engine support artifact mode (similar to Claude's Artifact).
175
+ While Claude probably train their model with custom dataset, we neither have this with open source nor I think there will be a standard for it (much like every new model use a different special tokens :D)
176
+
177
+ Current the library implement it via a combination of prompting, XML tag and format enforcement.
178
+
179
+ To be able to use Artifact mode effectively, you will need a good enough model, from our testing you will need minimally codestral 22B.
180
+ 70B+ models such as Qwen2 or Llama 3.1 will give the best result.
181
+
182
+ As Artifact feature is most useful for interaction instead of via API, we do have a lightweight WebUI here which work with our Artifact System.
183
+
184
+ If using python, you can interact with the UI as following.
185
+
186
+ __Non-Streaming__
187
+ ```python
188
+ completion = client.chat.completions.create(
189
+ model="codestral",
190
+ messages=[{"role": "user",
191
+ "content": "Explain in one liner what quicksort is and write me quicksort in python and Java"}],
192
+ extra_body={
193
+ "artifact": "Fast",
194
+ }
195
+ )
196
+
197
+ for choice in completion.choices:
198
+ print("-------------------------------\n")
199
+ print(choice.message.artifact_meta)
200
+ print(choice.message.content)
201
+ ```
202
+
203
+ __Streaming__
204
+ ```python
205
+ completion = client.chat.completions.create(
206
+ model="codestral",
207
+ messages=[{"role": "user",
208
+ "content": "Explain in one liner what quicksort is and write me quicksort in python and Java"}],
209
+ stream=True,
210
+ extra_body= {
211
+ "artifact": "Fast",
212
+ }
213
+ )
214
+
215
+ current_content_type = None
216
+ for chunk in completion:
217
+ if current_content_type != chunk.choices[0].delta.artifact_meta["tag_type"]:
218
+ print("\n------------------------------")
219
+ print(chunk.choices[0].delta.artifact_meta)
220
+ print("\n------------------------------")
221
+ current_content_type = chunk.choices[0].delta.artifact_meta["tag_type"]
222
+ print(chunk.choices[0].delta.content, end='')
223
+ ```
224
+
225
+
226
+ ## Thinking Method
227
+ A novel approach to guide LLM's thinking with XML templates (e.g., chain of thought) without modifying the prompt.
228
+ The benefit of this versus traditional CoT prompting is you can customize the XML template to be used depending on model and situation without the need to fix change the prompt,
229
+ As following example, you can influence the model to provide the answer with CoT while not showing the CoT in the answer.
230
+
231
+ ```python
232
+ thinking_template = """
233
+ <chain_of_thought>
234
+ <problem>{problem_statement}</problem>
235
+ <initial_state>{initial_state}</initial_state>
236
+ <steps>
237
+ <step>{action1}</step>
238
+ <step>{action2}</step>
239
+ <!-- Add more steps as needed -->
240
+ </steps>
241
+ <answer>Provide the answer</answer>
242
+ <final_answer>Only the final answer, no need to provide the step by step problem solving</final_answer>
243
+ </chain_of_thought>
244
+ """
245
+
246
+ messages = [
247
+ {"role": "user", "content": "I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?"}
248
+ ]
249
+
250
+ completion = client.chat.completions.create(
251
+ model="mistral",
252
+ messages=messages,
253
+ temperature=0.1,
254
+ max_tokens=200,
255
+ extra_body={
256
+ "thinking_template": thinking_template,
257
+ },
258
+ )
259
+
260
+ print(completion.choices[0].message.content)
261
+ # 10 apples
262
+ ```
263
+
264
+ ## Multiple Concurrent Models
265
+ Run multiple models (different or same) with automatic load balancing and request routing.
266
+ Model VRAM usage can be auto_loaded or with specific GPUs spliting.
267
+ Each model will be run as a dedicated FastAPI to ensure no threading issue and guarantee speed.
268
+ However, do note that this will be more demanding on the system as there will be multiple FastAPI running
269
+
270
+ ```shell
271
+ gallama run -id "model_id=qwen2-72B gpus=20,15,15,0" -id "model_id=Llama3.1-8B gpus=0,0,0,20"
272
+ ```
273
+
274
+ ## OpenAI Embedding Endpoint
275
+ Utilize Infinity Embedding library for both embedding via OpenAI client.
276
+
277
+ ```python
278
+ response = client.embeddings.create(
279
+ input="Your text string for embedding goes here",
280
+ model="Alibaba-NLP/gte-large-en-v1.5"
281
+ )
282
+
283
+ print(response.data[0].embedding)
284
+ ```
285
+
286
+ ## Legacy OpenAI Completion Endpoint
287
+ Support for the Legacy Completion Endpoint.
288
+
289
+ ```python
290
+ client.completions.create(
291
+ model="mistral",
292
+ prompt="Tell me a story about a Llama in 200 words",
293
+ max_tokens=1000,
294
+ temperature=0
295
+ )
296
+ ```
297
+
298
+ ## Format Enforcement
299
+ Ensure output conforms to specified patterns with a following options that can be specified in the `extra_body` when using OpenAI client.
300
+
301
+ ```python
302
+ completion = client.chat.completions.create(
303
+ model="mistral",
304
+ messages=[{"role": "user", "content": "Is smoking bad for health? Answer with Yes or No"}],
305
+ temperature=0.1,
306
+ max_tokens=200,
307
+ extra_body={
308
+ # "leading_prompt": leading_prompt, # prefix the generation with some string
309
+ # "regex_pattern": regex_pattern, # define the regex for the whole generation
310
+ # "regex_prefix_pattern": '(Yes|No)\.', # define the regex to match the starting words
311
+ # "stop_words": stop_words, # define the word to stop generation
312
+ },
313
+ )
314
+ ```
315
+
316
+ ## Streaming
317
+ Streaming is fully supported even in artifact mode.
318
+
319
+ ```python
320
+ messages = [{"role": "user", "content": "Tell me a 200-word story about a Llama"}]
321
+
322
+ completion = client.chat.completions.create(
323
+ model="mistral",
324
+ messages=messages,
325
+ stream=True,
326
+ temperature=0.1,
327
+ )
328
+
329
+ for chunk in completion:
330
+ print(chunk.choices[0].delta.content, end='')
331
+ ```
332
+
333
+ ## Remote Model Management
334
+ Load and unload models via API calls.
335
+
336
+ start gallama server if it is not current running:
337
+
338
+ ```shell
339
+ gallama run
340
+ ```
341
+
342
+ ```python
343
+ import requests
344
+
345
+ api_url = "http://127.0.0.1:8000/add_model"
346
+
347
+ payload = [
348
+ {
349
+ "model_id": "qwen-2-72B",
350
+ "gpus": [22,22,4,0],
351
+ "cache_size": 32768,
352
+ },
353
+ {
354
+ "model_id": "gemma-2-9b",
355
+ "gpus": [0,0,0,12],
356
+ "cache_size": 32768,
357
+ },
358
+ {
359
+ "model_id": "multilingual-e5-large-instruct",
360
+ "gpus": [0,0,0,5],
361
+ },
362
+ ]
363
+
364
+ response = requests.post(api_url, json=payload)
365
+ ```
366
+
367
+
368
+
369
+ ## Installation
370
+
371
+ gallama requires certain components to be installed and functioning.
372
+
373
+ Ensure that you have either ExllamaV2 (recommended) or Llama CPP Python (in development) installed. You can have both installed and use both with gallama as well,
374
+ If you already have either ExLlamaV2 (and optionally Flash Attention) running, you can install gallama by:
375
+
376
+ ```shell
377
+ pip install gallama
378
+ ```
379
+
380
+ Or, install from source:
381
+
382
+ ```shell
383
+ git clone https://github.com/remichu-ai/gallama.git
384
+ cd gallama
385
+ pip install .
386
+ ```
387
+
388
+ If you're starting from scratch and don't have these dependencies yet, follow these steps:
389
+
390
+ 1. Create a virtual environment (recommended):
391
+ Recommend to use python 3.12 if you can or minimally 3.11 for future tensor parallel compatibility.
392
+ ```shell
393
+ conda create --name genv python=3.12
394
+ conda activate genv
395
+ ```
396
+
397
+ 2. Install and verify ExLlamaV2 (Recommended):
398
+ - Follow instructions at [ExLlamaV2 GitHub](https://github.com/turboderp/exLlamav2)
399
+ - Test with examples from [ExLlamaV2 Examples](https://github.com/turboderp/exLlamav2/tree/master/examples)
400
+
401
+ (Optional) Install llama cpp-python:
402
+ - Follow instructions at [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
403
+ - Test with examples from [llama-cpp-python Examples](https://github.com/abetlen/llama-cpp-python/blob/main/examples/high_level_api/high_level_api_streaming.py)
404
+
405
+ 3. (Optional) Install Flash Attention for improved performance:
406
+ - Follow instructions at [Flash Attention GitHub](https://github.com/Dao-AILab/flash-attention)
407
+
408
+ 4. (Optional) Install Llama.cpp:
409
+ - Follow instructions at [Llama-cpp-python](https://github.com/abetlen/Llama-cpp-python)
410
+ - Note: ExLlamaV2 is currently recommended. Llama.cpp support is under development.
411
+
412
+ 5. Install gallama:
413
+ ```shell
414
+ pip install gallama
415
+ ```
416
+ Or, install from source:
417
+ ```shell
418
+ git clone https://github.com/remichu-ai/gallama.git
419
+ cd gallama
420
+ pip install .
421
+ ```
422
+
423
+ ## Usage
424
+
425
+ Follow these steps to use the model. We aim to make this process more convenient in future releases.
426
+
427
+ ### Setup
428
+
429
+ 1. Download the ExLlama model. A Jupyter notebook to assist with the download is included in `examples/Model_download.ipynb`.
430
+
431
+ 2. Initialize gallama:
432
+ ```shell
433
+ gallama run
434
+ ```
435
+ This creates a `model_config.yaml` file in `~/gallama`.
436
+
437
+ 3. Update `~/gallama/model_config.yaml` with your model configurations.
438
+
439
+ 4. Launch the model:
440
+ Simple method
441
+ ```shell
442
+ gallama run mistral
443
+ ```
444
+ Advanced method
445
+ ```shell
446
+ gallama run -id "model_id=mistral"
447
+ ```
448
+
449
+ ### Advanced Usage
450
+
451
+ Using `gallama run -id` followed by a string which is a dictionary of key-value pair will unlock additional option as following:
452
+
453
+ Customize the model launch using various parameters. Available parameters for the `-id` option include:
454
+
455
+ - `model_id`: ID of the model from the yml file (required)
456
+ - `model_name`: Name of the model (optional, defaults to the last part of `model_id`)
457
+ - `gpus`: VRAM usage for each GPU, comma-separated list of floats (optional)
458
+ - `cache_size`: Context length for cache text in integers (optional)
459
+ - `cache_quant`: Quantization to use for cache, options are "FP16", "Q4", "Q6", "Q8" (optional, defaults to Q4)
460
+ - `max_seq_len`: Maximum sequence length (optional)
461
+ - `backend`: Model engine backend, options are "exLlama", "Llama_cpp", "embedding" (optional, defaults to "exLlama")
462
+ - `tp`: enable tensor parallel with exllama v2 (experimental). See further below
463
+
464
+ #### Speculative Decoding Parameters
465
+ - `draft_model_id`: ID of the draft model (optional)
466
+ - `draft_model_name`: Name of the draft model (optional)
467
+ - `draft_gpus`: VRAM usage for each GPU for the draft model, comma-separated list of floats (optional)
468
+ - `draft_cache_size`: Context length for cache text in integers for the draft model (optional)
469
+ - `draft_cache_quant`: Quantization to use for cache for the draft model, options are "FP16", "Q4", "Q6", "Q8" (optional)
470
+
471
+ ### Examples
472
+
473
+ 1. Launch two models simultaneously:
474
+ ```shell
475
+ gallama run -id "model_id=mistral" -id "model_id=llama3"
476
+ ```
477
+
478
+ 2. Launch a model with specific VRAM limits per GPU:
479
+ ```shell
480
+ gallama run -id "model_id=qwen2-72B gpus=22,22,10,0"
481
+ ```
482
+ This limits memory usage to 22GB for GPU0 and GPU1, 10GB for GPU2, and 0GB for GPU3.
483
+
484
+ 3. Launch a model with custom cache size and quantization:
485
+ By default cache_size is initialized to max sequence length of the model.
486
+ However, if there is VRAM to spare, increase cache_size will have model to perform better for concurrent and batched request.
487
+ By default, cache_quant=Q4 will be used. However, do adjust it if required e.g. Qwen2 1.5B doesn't work well with Q4 cache, please use Q6 or Q8.
488
+ ```shell
489
+ gallama run -id "model_id=mistral cache_size=102400 cache_quant=Q8"
490
+ ```
491
+
492
+ 4. Launch a model with reduced cache size and quantization:
493
+ For model with high context, lower the sequence length can significantly reduce VRAM usage.
494
+ e.g. Mistral Large 2 can handle 128K content, however, it will require significant VRAM for the cache
495
+ ```shell
496
+ gallama run -id "model_id=mistral_large max_seq_len=32768"
497
+ ```
498
+
499
+ 5. Launch a model for embedding:
500
+ ```shell
501
+ gallama run -id "model_id=Alibaba-NLP/gte-large-en-v1.5 backend=embedding"
502
+ ```
503
+
504
+ 6. Launch a model with speculative decoding:
505
+ Only model with same vocabulary should be used for speculative decoding.
506
+ For reference, by enabling speculative decoding, qwen2-72B generation speed improve from 20tok/s to 25-35tok/s on my 4090s.
507
+ Highly recommend speculative decoding if you have VRAM to spare.
508
+ ```shell
509
+ gallama run -id "model_id=qwen2-72B draft_model_id=qwen2-1.5B"
510
+ ```
511
+ Ensure your GPU settings can accommodate the model requirements. Trial and adjust parameters as needed for your specific use case.
512
+ Note: The backend is assumed to be the same for both the main model and the draft model in speculative decoding.
513
+
514
+ 7. (Experimental) Tensor Parallel (TP)
515
+ Exllama V2 Tensor Parallel support is in the work. If you are adventurous, you can enable it by:
516
+ - Update your python>=3.11
517
+ - Install [exllamav2 dev_tp branch](https://github.com/turboderp/exllamav2/tree/dev_tp)
518
+ - Only support Qwen2-72B, Llama3.1-70B and Mistral Large at the moment
519
+ - Do run a draft model to help further with speed
520
+ - Currently there is no quantization for cache with TP yet, so please set maq_seq_len to a lower number if you do not have enough VRAM
521
+ You can launch gallama with tensor parallel as following:
522
+ ```shell
523
+ gallama run -id "model_id=qwen-2-72B draft_model_id=qwen-2-1.5B max_seq_len=32768 tp=True"
524
+ ```
525
+