PyPI - llama-cpp-python - Versions diffs - 0.1.77__tar.gz → 0.1.79__tar.gz - Mend

llama-cpp-python 0.1.77tar.gz → 0.1.79tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (300) hide show

{llama_cpp_python-0.1.77 → llama_cpp_python-0.1.79}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.1.79]
+### Added
+- (llama.cpp) GGUF support
+## [0.1.78]
+### Added
+- Grammar based sampling via LlamaGrammar which can be passed to completions
+- Make n_gpu_layers == -1 offload all layers
 ## [0.1.77]
 - (llama.cpp) Update llama.cpp add support for LLaMa 2 70B

{llama_cpp_python-0.1.77 → llama_cpp_python-0.1.79}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: llama_cpp_python
-Version: 0.1.77
+Version: 0.1.79
 Summary: A Python wrapper for llama.cpp
 Author: Andrei Betlen
 Author-email: abetlen@gmail.com
@@ -35,6 +35,9 @@ This package provides:
 Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).
+> [!WARNING]
+> Starting with version 0.1.79 the model format has changed from `ggmlv3` to `gguf`. Old model files can be converted using the `convert-llama-ggmlv3-to-gguf.py` script in [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
 ## Installation from PyPI (recommended)
@@ -158,7 +161,7 @@ llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
 Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
 ```python
-llm = Llama(model_path="./models/7B/ggml-model.bin", n_gqa=8)
+llm = Llama(model_path="./models/70B/ggml-model.bin", n_gqa=8)
 ```
 ## Web Server
@@ -187,7 +190,7 @@ docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggm
 ## Low-level API
 The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
-The entire lowe-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
+The entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
 Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

{llama_cpp_python-0.1.77 → llama_cpp_python-0.1.79}/README.md RENAMED Viewed

@@ -17,6 +17,9 @@ This package provides:
 Documentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).
+> [!WARNING]
+> Starting with version 0.1.79 the model format has changed from `ggmlv3` to `gguf`. Old model files can be converted using the `convert-llama-ggmlv3-to-gguf.py` script in [`llama.cpp`](https://github.com/ggerganov/llama.cpp)
 ## Installation from PyPI (recommended)
@@ -140,7 +143,7 @@ llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
 Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
 ```python
-llm = Llama(model_path="./models/7B/ggml-model.bin", n_gqa=8)
+llm = Llama(model_path="./models/70B/ggml-model.bin", n_gqa=8)
 ```
 ## Web Server
@@ -169,7 +172,7 @@ docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggm
 ## Low-level API
 The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
-The entire lowe-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
+The entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
 Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

{llama_cpp_python-0.1.77 → llama_cpp_python-0.1.79}/docker/openblas_simple/Dockerfile RENAMED Viewed

@@ -9,7 +9,7 @@ COPY . .
 RUN apt update && apt install -y libopenblas-dev ninja-build build-essential
 RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings
-RUN LLAMA_OPENBLAS=1 pip install llama_cpp_python --verbose
+RUN CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama_cpp_python --verbose
 # Run the server
 CMD python3 -m llama_cpp.server

{llama_cpp_python-0.1.77 → llama_cpp_python-0.1.79}/llama_cpp/llama.py RENAMED Viewed

@@ -23,10 +23,13 @@ import ctypes
 from . import llama_cpp
 from .llama_types import *
+from .llama_grammar import LlamaGrammar
 import numpy as np
 import numpy.typing as npt
+from .utils import suppress_stdout_stderr
 class BaseLlamaCache(ABC):
     """Base cache class for a llama.cpp model."""
@@ -224,7 +227,8 @@ class Llama:
         rope_freq_base: float = 10000.0,
         rope_freq_scale: float = 1.0,
         n_gqa: Optional[int] = None,  # (TEMPORARY) must be 8 for llama2 70b
-        rms_norm_eps: Optional[float] = None, # (TEMPORARY)
+        rms_norm_eps: Optional[float] = None,  # (TEMPORARY)
+        mul_mat_q: Optional[bool] = None,
         verbose: bool = True,
     ):
         """Load a llama.cpp model from `model_path`.
@@ -234,6 +238,7 @@ class Llama:
             n_ctx: Maximum context size.
             n_parts: Number of parts to split the model into. If -1, the number of parts is automatically determined.
             seed: Random seed. -1 for random.
+            n_gpu_layers: Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
             f16_kv: Use half-precision for key/value cache.
             logits_all: Return logits for all tokens, not just the last token.
             vocab_only: Only load the vocabulary no weights.
@@ -262,7 +267,7 @@ class Llama:
         self.params = llama_cpp.llama_context_default_params()
         self.params.n_ctx = n_ctx
-        self.params.n_gpu_layers = n_gpu_layers
+        self.params.n_gpu_layers = 0x7FFFFFFF if n_gpu_layers == -1 else n_gpu_layers  # 0x7FFFFFFF is INT32 max, will be auto set to all layers
         self.params.seed = seed
         self.params.f16_kv = f16_kv
         self.params.logits_all = logits_all
@@ -273,22 +278,21 @@ class Llama:
         self.params.low_vram = low_vram
         self.tensor_split = tensor_split
-        self._c_tensor_split = None
+        self._p_tensor_split = None
         if self.tensor_split is not None:
-            #Type conversion and expand the list to the length of LLAMA_MAX_DEVICES
-            FloatArray = ctypes.c_float * llama_cpp.LLAMA_MAX_DEVICES.value
-            self._c_tensor_split = FloatArray(*tensor_split) # keep a reference to the array so it is not gc'd
-            self.params.tensor_split = self._c_tensor_split
+            FloatArray = (ctypes.c_float * len(self.tensor_split))(*self.tensor_split)
+            self._p_tensor_split = ctypes.POINTER(ctypes.c_float)(
+                FloatArray
+            )  # keep a reference to the array so it is not gc'd
+            self.params.tensor_split = self._p_tensor_split
         self.params.rope_freq_base = rope_freq_base
         self.params.rope_freq_scale = rope_freq_scale
-        if n_gqa is not None:
-            self.params.n_gqa = n_gqa
-        if rms_norm_eps is not None:
-            self.params.rms_norm_eps = rms_norm_eps
+        if mul_mat_q is not None:
+            self.params.mul_mat_q = mul_mat_q
         self.last_n_tokens_size = last_n_tokens_size
         self.n_batch = min(n_ctx, n_batch)
@@ -307,12 +311,25 @@ class Llama:
         if not os.path.exists(model_path):
             raise ValueError(f"Model path does not exist: {model_path}")
-        self.model = llama_cpp.llama_load_model_from_file(
-            self.model_path.encode("utf-8"), self.params
-        )
+        if verbose:
+            self.model = llama_cpp.llama_load_model_from_file(
+                self.model_path.encode("utf-8"), self.params
+            )
+        else:
+            with suppress_stdout_stderr():
+                self.model = llama_cpp.llama_load_model_from_file(
+                    self.model_path.encode("utf-8"), self.params
+                )
         assert self.model is not None
-        self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
+        if verbose:
+            self.ctx = llama_cpp.llama_new_context_with_model(self.model, self.params)
+        else:
+            with suppress_stdout_stderr():
+                print("here")
+                self.ctx = llama_cpp.llama_new_context_with_model(
+                    self.model, self.params
+                )
         assert self.ctx is not None
@@ -349,8 +366,8 @@ class Llama:
             sorted=sorted,
         )
         self._candidates = candidates
-        self._token_nl = Llama.token_nl()
-        self._token_eos = Llama.token_eos()
+        self._token_nl = self.token_nl()
+        self._token_eos = self.token_eos()
         self._candidates_data_id = np.arange(self._n_vocab, dtype=np.intc)  # type: ignore
         self._candidates_data_p = np.zeros(self._n_vocab, dtype=np.single)
@@ -391,11 +408,11 @@ class Llama:
         Returns:
             A list of tokens.
         """
-        assert self.ctx is not None
+        assert self.model is not None
         n_ctx = self._n_ctx
         tokens = (llama_cpp.llama_token * n_ctx)()
-        n_tokens = llama_cpp.llama_tokenize(
-            self.ctx,
+        n_tokens = llama_cpp.llama_tokenize_with_model(
+            self.model,
             text,
             tokens,
             llama_cpp.c_int(n_ctx),
@@ -404,8 +421,8 @@ class Llama:
         if n_tokens < 0:
             n_tokens = abs(n_tokens)
             tokens = (llama_cpp.llama_token * n_tokens)()
-            n_tokens = llama_cpp.llama_tokenize(
-                self.ctx,
+            n_tokens = llama_cpp.llama_tokenize_with_model(
+                self.model,
                 text,
                 tokens,
                 llama_cpp.c_int(n_tokens),
@@ -426,13 +443,19 @@ class Llama:
         Returns:
             The detokenized string.
         """
-        assert self.ctx is not None
+        assert self.model is not None
         output = b""
+        size = 8
+        buffer = (ctypes.c_char * size)()
         for token in tokens:
-            output += llama_cpp.llama_token_to_str(
-                self.ctx, llama_cpp.llama_token(token)
+            n = llama_cpp.llama_token_to_str_with_model(
+                self.model, llama_cpp.llama_token(token), buffer, size
             )
-        return output
+            assert n <= size
+            output += bytes(buffer[:n])
+        # NOTE: Llama1 models automatically added a space at the start of the prompt
+        # this line removes a leading space if the first token is a beginning of sentence token
+        return output[1:] if len(tokens) > 0 and tokens[0] == self.token_bos() else output
     def set_cache(self, cache: Optional[BaseLlamaCache]):
         """Set the cache.
@@ -497,6 +520,7 @@ class Llama:
         mirostat_eta: llama_cpp.c_float,
         penalize_nl: bool = True,
         logits_processor: Optional[LogitsProcessorList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ):
         assert self.ctx is not None
         assert self.n_tokens > 0
@@ -543,8 +567,16 @@ class Llama:
         )
         if not penalize_nl:
             candidates.data[self._token_nl].logit = llama_cpp.c_float(nl_logit)
+        if grammar is not None:
+            llama_cpp.llama_sample_grammar(
+                ctx=self.ctx,
+                candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
+                grammar=grammar.grammar,
+            )
         if temp.value == 0.0:
-            return llama_cpp.llama_sample_token_greedy(
+            id = llama_cpp.llama_sample_token_greedy(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
             )
@@ -556,7 +588,7 @@ class Llama:
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
                 temp=temp,
             )
-            return llama_cpp.llama_sample_token_mirostat(
+            id = llama_cpp.llama_sample_token_mirostat(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
                 tau=mirostat_tau,
@@ -571,7 +603,7 @@ class Llama:
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
                 temp=temp,
             )
-            return llama_cpp.llama_sample_token_mirostat_v2(
+            id = llama_cpp.llama_sample_token_mirostat_v2(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
                 tau=mirostat_tau,
@@ -608,10 +640,17 @@ class Llama:
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
                 temp=temp,
             )
-            return llama_cpp.llama_sample_token(
+            id = llama_cpp.llama_sample_token(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.byref(candidates),  # type: ignore
             )
+        if grammar is not None:
+            llama_cpp.llama_grammar_accept_token(
+                ctx=self.ctx,
+                grammar=grammar.grammar,
+                token=llama_cpp.ctypes.c_int(id),
+            )
+        return id
     def sample(
         self,
@@ -627,6 +666,7 @@ class Llama:
         mirostat_tau: float = 5.0,
         penalize_nl: bool = True,
         logits_processor: Optional[LogitsProcessorList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ):
         """Sample a token from the model.
@@ -660,6 +700,7 @@ class Llama:
             mirostat_eta=llama_cpp.c_float(mirostat_eta),
             penalize_nl=penalize_nl,
             logits_processor=logits_processor,
+            grammar=grammar,
         )
     def generate(
@@ -678,6 +719,7 @@ class Llama:
         mirostat_eta: float = 0.1,
         logits_processor: Optional[LogitsProcessorList] = None,
         stopping_criteria: Optional[StoppingCriteriaList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ) -> Generator[int, Optional[Sequence[int]], None]:
         """Create a generator of tokens from a prompt.
@@ -699,7 +741,6 @@ class Llama:
             The generated tokens.
         """
         assert self.ctx is not None
         if reset and len(self._input_ids) > 0:
             longest_prefix = 0
             for a, b in zip(self._input_ids, tokens[:-1]):
@@ -717,6 +758,9 @@ class Llama:
         if reset:
             self.reset()
+        if grammar is not None:
+            grammar.reset()
         while True:
             self.eval(tokens)
             token = self.sample(
@@ -731,6 +775,7 @@ class Llama:
                 mirostat_tau=mirostat_tau,
                 mirostat_eta=mirostat_eta,
                 logits_processor=logits_processor,
+                grammar=grammar,
             )
             if stopping_criteria is not None and stopping_criteria(
                 self._input_ids.tolist(), self._scores[-1, :].tolist()
@@ -833,6 +878,7 @@ class Llama:
         model: Optional[str] = None,
         stopping_criteria: Optional[StoppingCriteriaList] = None,
         logits_processor: Optional[LogitsProcessorList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ) -> Union[Iterator[Completion], Iterator[CompletionChunk]]:
         assert self.ctx is not None
@@ -840,7 +886,7 @@ class Llama:
         created: int = int(time.time())
         completion_tokens: List[int] = []
         # Add blank space to start of prompt to match OG llama tokenizer
-        prompt_tokens: List[int] = self.tokenize(b" " + prompt.encode("utf-8"))
+        prompt_tokens: List[int] = self.tokenize(prompt.encode("utf-8")) if prompt != "" else [self.token_bos()]
         text: bytes = b""
         returned_tokens: int = 0
         stop = (
@@ -910,6 +956,7 @@ class Llama:
             repeat_penalty=repeat_penalty,
             stopping_criteria=stopping_criteria,
             logits_processor=logits_processor,
+            grammar=grammar,
         ):
             if token == self._token_eos:
                 text = self.detokenize(completion_tokens)
@@ -960,9 +1007,7 @@ class Llama:
                 for token in remaining_tokens:
                     token_end_position += len(self.detokenize([token]))
                     # Check if stop sequence is in the token
-                    if token_end_position >= (
-                        remaining_length - first_stop_position
-                    ):
+                    if token_end_position >= (remaining_length - first_stop_position):
                         break
                     logprobs_or_none: Optional[CompletionLogprobs] = None
                     if logprobs is not None:
@@ -1256,6 +1301,7 @@ class Llama:
         model: Optional[str] = None,
         stopping_criteria: Optional[StoppingCriteriaList] = None,
         logits_processor: Optional[LogitsProcessorList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ) -> Union[Completion, Iterator[CompletionChunk]]:
         """Generate text from a prompt.
@@ -1300,6 +1346,7 @@ class Llama:
             model=model,
             stopping_criteria=stopping_criteria,
             logits_processor=logits_processor,
+            grammar=grammar
         )
         if stream:
             chunks: Iterator[CompletionChunk] = completion_or_chunks
@@ -1329,6 +1376,7 @@ class Llama:
         model: Optional[str] = None,
         stopping_criteria: Optional[StoppingCriteriaList] = None,
         logits_processor: Optional[LogitsProcessorList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ) -> Union[Completion, Iterator[CompletionChunk]]:
         """Generate text from a prompt.
@@ -1373,6 +1421,7 @@ class Llama:
             model=model,
             stopping_criteria=stopping_criteria,
             logits_processor=logits_processor,
+            grammar=grammar,
         )
     def _convert_text_completion_to_chat(
@@ -1453,6 +1502,7 @@ class Llama:
         mirostat_eta: float = 0.1,
         model: Optional[str] = None,
         logits_processor: Optional[LogitsProcessorList] = None,
+        grammar: Optional[LlamaGrammar] = None,
     ) -> Union[ChatCompletion, Iterator[ChatCompletionChunk]]:
         """Generate a chat completion from a list of messages.
@@ -1495,6 +1545,7 @@ class Llama:
             mirostat_eta=mirostat_eta,
             model=model,
             logits_processor=logits_processor,
+            grammar=grammar,
         )
         if stream:
             chunks: Iterator[CompletionChunk] = completion_or_chunks  # type: ignore
@@ -1504,10 +1555,10 @@ class Llama:
             return self._convert_text_completion_to_chat(completion)
     def __del__(self):
-        if self.model is not None:
+        if hasattr(self, "model") and self.model is not None:
             llama_cpp.llama_free_model(self.model)
             self.model = None
-        if self.ctx is not None:
+        if hasattr(self, "ctx") and self.ctx is not None:
             llama_cpp.llama_free(self.ctx)
             self.ctx = None
@@ -1531,13 +1582,7 @@ class Llama:
             lora_base=self.lora_base,
             lora_path=self.lora_path,
             tensor_split=self.tensor_split,
-            ### TEMPORARY ###
-            n_gqa=self.params.n_gqa,
-            rms_norm_eps=self.params.rms_norm_eps,
-            ### TEMPORARY ###
-            ### DEPRECATED ###
-            n_parts=self.n_parts,
-            ### DEPRECATED ###
+            mul_mat_q=self.params.mul_mat_q,
         )
     def __setstate__(self, state):
@@ -1559,14 +1604,8 @@ class Llama:
             lora_base=state["lora_base"],
             lora_path=state["lora_path"],
             tensor_split=state["tensor_split"],
+            mul_mat_q=state["mul_mat_q"],
             verbose=state["verbose"],
-            ### TEMPORARY ###
-            n_gqa=state["n_gqa"],
-            rms_norm_eps=state["rms_norm_eps"],
-            ### TEMPORARY ###
-            ### DEPRECATED ###
-            n_parts=state["n_parts"],
-            ### DEPRECATED ###
         )
     def save_state(self) -> LlamaState:
@@ -1631,20 +1670,20 @@ class Llama:
         assert self.ctx is not None
         return LlamaTokenizer(self)
-    @staticmethod
-    def token_eos() -> int:
+    def token_eos(self) -> int:
         """Return the end-of-sequence token."""
-        return llama_cpp.llama_token_eos()
+        assert self.ctx is not None
+        return llama_cpp.llama_token_eos(self.ctx)
-    @staticmethod
-    def token_bos() -> int:
+    def token_bos(self) -> int:
         """Return the beginning-of-sequence token."""
-        return llama_cpp.llama_token_bos()
+        assert self.ctx is not None
+        return llama_cpp.llama_token_bos(self.ctx)
-    @staticmethod
-    def token_nl() -> int:
+    def token_nl(self) -> int:
         """Return the newline token."""
-        return llama_cpp.llama_token_nl()
+        assert self.ctx is not None
+        return llama_cpp.llama_token_nl(self.ctx)
     @staticmethod
     def logits_to_logprobs(logits: List[float]) -> List[float]:

llama-cpp-python 0.1.77__tar.gz → 0.1.79__tar.gz

llama-cpp-python 0.1.77tar.gz → 0.1.79tar.gz