PyPI - ovos-gguf-plugin - Versions diffs - 1.2.0a3__tar.gz - Mend

ovos-gguf-plugin 1.2.0a3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

ovos_gguf_plugin-1.2.0a3/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 OpenVoiceOS
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

ovos_gguf_plugin-1.2.0a3/PKG-INFO ADDED Viewed

@@ -0,0 +1,226 @@
+Metadata-Version: 2.4
+Name: ovos-gguf-plugin
+Version: 1.2.0a3
+Summary: local LLM plugin for OpenVoiceOS persona framework
+Author-email: jarbasai <jarbasai@mailfence.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/OpenVoiceOS/ovos-gguf-plugin
+Project-URL: Repository, https://github.com/OpenVoiceOS/ovos-gguf-plugin
+Keywords: OVOS,openvoiceos,plugin,utterance,fallback,query
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: ovos-plugin-manager<3.0.0,>=2.2.3a1
+Requires-Dist: ovos-spec-tools>=0.8.0a2
+Requires-Dist: huggingface-hub
+Requires-Dist: llama-cpp-python
+Requires-Dist: numpy
+Requires-Dist: sentence_stream
+Provides-Extra: test
+Requires-Dist: pytest; extra == "test"
+Requires-Dist: numpy; extra == "test"
+Requires-Dist: ovoscope; extra == "test"
+Requires-Dist: ovos-persona; extra == "test"
+Requires-Dist: ovos-solver-failure-plugin; extra == "test"
+Dynamic: license-file
+# ovos-gguf-plugin
+Unified GGUF wrapper for OpenVoiceOS — chat, summarization, dialog rewriting, translation, language detection, and text embeddings, all backed by quantized GGUF models via `llama-cpp-python`.
+## Install
+```bash
+pip install ovos-gguf-plugin
+```
+For GPU inference, rebuild `llama-cpp-python` with CUDA support first:
+```bash
+CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --no-cache-dir
+```
+## Plugin entry points
+| Entry-point group | Plugin name | Class | Role |
+|---|---|---|---|
+| `opm.agents.chat` | `ovos-chat-gguf-plugin` | `GGUFChatEngine` | conversational chat / question answering |
+| `opm.agents.summarizer` | `ovos-summarizer-gguf-plugin` | `GGUFSummarizer` | text summarization |
+| `opm.transformer.dialog` | `ovos-dialog-transformer-gguf-plugin` | `GGUFDialogTransformer` | dialog rewriting |
+| `opm.lang.translate` | `ovos-translate-gguf-plugin` | `GGUFTextTranslator` | machine translation |
+| `opm.lang.detect` | `ovos-lang-detect-gguf-plugin` | `GGUFTextLangDetector` | language detection |
+| `opm.embeddings.text` | `ovos-gguf-embeddings-plugin` | `GGUFEmbeddings` | text embeddings |
+## Quickstart
+### Chat
+```python
+from ovos_gguf_plugin.chat import GGUFChatEngine
+from ovos_plugin_manager.templates.agents import AgentMessage, MessageRole
+engine = GGUFChatEngine({
+    "model": "afrideva/Smol-Llama-101M-Chat-v1-GGUF",
+    "remote_filename": "*q2_k.gguf",
+    "max_tokens": 128,
+})
+msgs = [AgentMessage(role=MessageRole.USER, content="Tell me a joke.")]
+# stream sentence-by-sentence (suitable for TTS)
+for sentence in engine.stream_sentences(msgs):
+    print(sentence)
+# or get the full response at once
+reply = engine.continue_chat(msgs)
+print(reply.content)
+```
+### Summarizer
+```python
+from ovos_gguf_plugin.summarizer import GGUFSummarizer
+s = GGUFSummarizer({
+    "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+    "remote_filename": "*q8_0.gguf",
+})
+print(s.summarize("Long document text goes here ... " * 20))
+```
+### Dialog transformer
+```python
+from ovos_gguf_plugin.dialog_transformers import GGUFDialogTransformer
+dt = GGUFDialogTransformer({
+    "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+    "remote_filename": "*q8_0.gguf",
+})
+print(dt.transform("gonna grab some food real quick"))
+```
+### Translation
+```python
+from ovos_gguf_plugin.translate import GGUFTextTranslator
+tx = GGUFTextTranslator({
+    "model": "TheBloke/TowerInstruct-7B-v0.1-GGUF",
+    "remote_filename": "*Q4_K_M.gguf",
+})
+print(tx.translate("the easiest way to contribute is to help with translations",
+                   target="es-es"))
+```
+### Language detection
+```python
+from ovos_gguf_plugin.translate import GGUFTextLangDetector
+dt = GGUFTextLangDetector({
+    "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+    "remote_filename": "*q8_0.gguf",
+})
+print(dt.detect("you can help without any programming knowledge"))  # → en
+```
+### Text embeddings
+```python
+from ovos_gguf_plugin.embeddings import GGUFEmbeddings
+emb = GGUFEmbeddings({"model": "all-MiniLM-L6-v2"})
+vector = emb.get_embeddings("hello world")
+print(len(vector), "dims")
+```
+`model` accepts a friendly name from `GGUFEmbeddings.DEFAULT_MODELS` (e.g. `labse`, `all-MiniLM-L6-v2`, `nomic-embed-text-v1.5`, `bge-large-en-v1.5`), a bare Hugging Face repo id (with `remote_filename`), or a local `.gguf` path. Default is `labse`.
+As an OVOS text-embeddings plugin it is selected by name (`ovos-gguf-embeddings-plugin`), so it is a drop-in for anything that previously used the standalone embeddings plugin.
+## Configuration
+All wrappers share the same config keys:
+| Key | Default | Description |
+|---|---|---|
+| `model` | required | Local `.gguf` path, HuggingFace repo id, or friendly name (embeddings) |
+| `remote_filename` | `*Q4_K_M.gguf` | Glob for selecting the file from a HF repo |
+| `n_gpu_layers` | `0` | GPU layers to offload (`-1` = all) |
+| `chat_format` | `None` | llama.cpp chat format (auto-detected for most models) |
+| `verbose` | `True` | llama.cpp verbosity |
+| `max_tokens` | `512` | Maximum tokens to generate |
+| `system_prompt` | locale default | Override the system prompt |
+See [`docs/configuration.md`](docs/configuration.md) for the full reference including per-wrapper options and GPU build instructions.
+## Localized prompts
+System prompts and templates ship as `.prompt` resource files under `ovos_gguf_plugin/locale/<lang>/` and are loaded via [ovos-spec-tools](https://github.com/OpenVoiceOS/ovos-spec-tools) (OVOS-INTENT-2 §4.4). Drop translated `.prompt` files under a new `locale/<lang>/` to add a language; English `en-us` ships by default and is the fallback. A `system_prompt` in config overrides the locale file.
+See [`docs/localization.md`](docs/localization.md) for the full guide.
+## OVOS Persona Framework
+```json
+{
+  "name": "MyAssistant",
+  "solvers": ["ovos-solver-gguf-plugin"],
+  "ovos-solver-gguf-plugin": {
+    "model": "TheBloke/notus-7B-v1-GGUF",
+    "remote_filename": "*Q4_K_M.gguf",
+    "persona": "You are a helpful assistant.",
+    "verbose": false
+  }
+}
+```
+```bash
+ovos-persona-server --persona my_persona.json
+```
+## Documentation
+- [`docs/configuration.md`](docs/configuration.md) — full config reference, GPU build, per-wrapper notes
+- [`docs/localization.md`](docs/localization.md) — `.prompt` system, adding a language
+- [`docs/models.md`](docs/models.md) — recommended models per wrapper (including tiny CI-friendly ones)
+## Examples
+Runnable scripts under [`examples/`](examples/):
+- [`chat_example.py`](examples/chat_example.py)
+- [`embeddings_example.py`](examples/embeddings_example.py)
+- [`translate_example.py`](examples/translate_example.py)
+- [`lang_detect_example.py`](examples/lang_detect_example.py)
+- [`summarize_example.py`](examples/summarize_example.py)
+## Testing
+```bash
+pip install "ovos-gguf-plugin[test]"
+python -m pytest test/ -v
+```
+The test suite contains:
+- `test/test_embeddings.py` — hermetic unit tests (mocked llama.cpp, no downloads)
+- `test/test_prompts.py` — hermetic unit tests for localized prompt loading
+- `test/test_e2e.py` — real-model end-to-end tests (downloads tiny GGUFs once, ~70 MB total):
+  - chat: `afrideva/Smol-Llama-101M-Chat-v1-GGUF` q2_k (~45 MB)
+  - embeddings: `leliuga/all-MiniLM-L6-v2-GGUF` Q4_K_M (~23 MB)
+## Credits
+Originally developed by [TigreGótico](https://tigregotico.pt) for [OpenVoiceOS](https://openvoiceos.org),
+sponsored by VisioLab. Modernized under the [NGI0 Commons Fund](https://nlnet.nl/commonsfund) / [NLnet](https://nlnet.nl).
+<img src="https://github.com/user-attachments/assets/809588a2-32a2-406c-98c0-f88bf7753cb4" width="220" alt="VisioLab"/>
+> This work was sponsored by VisioLab, part of [Royal Dutch Visio](https://visio.org/), is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.
+[![NGI0 Commons Fund](./ngi.png)](https://nlnet.nl/project/OpenVoiceOS)
+This project was funded through the [NGI0 Commons Fund](https://nlnet.nl/commonsfund),
+a fund established by [NLnet](https://nlnet.nl) with financial support from the
+European Commission's [Next Generation Internet](https://ngi.eu) programme, under
+the aegis of [DG Communications Networks, Content and Technology](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/communications-networks-content-and-technology_en)
+under grant agreement No [101135429](https://cordis.europa.eu/project/id/101135429).

ovos_gguf_plugin-1.2.0a3/README.md ADDED Viewed

@@ -0,0 +1,200 @@
+# ovos-gguf-plugin
+Unified GGUF wrapper for OpenVoiceOS — chat, summarization, dialog rewriting, translation, language detection, and text embeddings, all backed by quantized GGUF models via `llama-cpp-python`.
+## Install
+```bash
+pip install ovos-gguf-plugin
+```
+For GPU inference, rebuild `llama-cpp-python` with CUDA support first:
+```bash
+CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --no-cache-dir
+```
+## Plugin entry points
+| Entry-point group | Plugin name | Class | Role |
+|---|---|---|---|
+| `opm.agents.chat` | `ovos-chat-gguf-plugin` | `GGUFChatEngine` | conversational chat / question answering |
+| `opm.agents.summarizer` | `ovos-summarizer-gguf-plugin` | `GGUFSummarizer` | text summarization |
+| `opm.transformer.dialog` | `ovos-dialog-transformer-gguf-plugin` | `GGUFDialogTransformer` | dialog rewriting |
+| `opm.lang.translate` | `ovos-translate-gguf-plugin` | `GGUFTextTranslator` | machine translation |
+| `opm.lang.detect` | `ovos-lang-detect-gguf-plugin` | `GGUFTextLangDetector` | language detection |
+| `opm.embeddings.text` | `ovos-gguf-embeddings-plugin` | `GGUFEmbeddings` | text embeddings |
+## Quickstart
+### Chat
+```python
+from ovos_gguf_plugin.chat import GGUFChatEngine
+from ovos_plugin_manager.templates.agents import AgentMessage, MessageRole
+engine = GGUFChatEngine({
+    "model": "afrideva/Smol-Llama-101M-Chat-v1-GGUF",
+    "remote_filename": "*q2_k.gguf",
+    "max_tokens": 128,
+})
+msgs = [AgentMessage(role=MessageRole.USER, content="Tell me a joke.")]
+# stream sentence-by-sentence (suitable for TTS)
+for sentence in engine.stream_sentences(msgs):
+    print(sentence)
+# or get the full response at once
+reply = engine.continue_chat(msgs)
+print(reply.content)
+```
+### Summarizer
+```python
+from ovos_gguf_plugin.summarizer import GGUFSummarizer
+s = GGUFSummarizer({
+    "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+    "remote_filename": "*q8_0.gguf",
+})
+print(s.summarize("Long document text goes here ... " * 20))
+```
+### Dialog transformer
+```python
+from ovos_gguf_plugin.dialog_transformers import GGUFDialogTransformer
+dt = GGUFDialogTransformer({
+    "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+    "remote_filename": "*q8_0.gguf",
+})
+print(dt.transform("gonna grab some food real quick"))
+```
+### Translation
+```python
+from ovos_gguf_plugin.translate import GGUFTextTranslator
+tx = GGUFTextTranslator({
+    "model": "TheBloke/TowerInstruct-7B-v0.1-GGUF",
+    "remote_filename": "*Q4_K_M.gguf",
+})
+print(tx.translate("the easiest way to contribute is to help with translations",
+                   target="es-es"))
+```
+### Language detection
+```python
+from ovos_gguf_plugin.translate import GGUFTextLangDetector
+dt = GGUFTextLangDetector({
+    "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+    "remote_filename": "*q8_0.gguf",
+})
+print(dt.detect("you can help without any programming knowledge"))  # → en
+```
+### Text embeddings
+```python
+from ovos_gguf_plugin.embeddings import GGUFEmbeddings
+emb = GGUFEmbeddings({"model": "all-MiniLM-L6-v2"})
+vector = emb.get_embeddings("hello world")
+print(len(vector), "dims")
+```
+`model` accepts a friendly name from `GGUFEmbeddings.DEFAULT_MODELS` (e.g. `labse`, `all-MiniLM-L6-v2`, `nomic-embed-text-v1.5`, `bge-large-en-v1.5`), a bare Hugging Face repo id (with `remote_filename`), or a local `.gguf` path. Default is `labse`.
+As an OVOS text-embeddings plugin it is selected by name (`ovos-gguf-embeddings-plugin`), so it is a drop-in for anything that previously used the standalone embeddings plugin.
+## Configuration
+All wrappers share the same config keys:
+| Key | Default | Description |
+|---|---|---|
+| `model` | required | Local `.gguf` path, HuggingFace repo id, or friendly name (embeddings) |
+| `remote_filename` | `*Q4_K_M.gguf` | Glob for selecting the file from a HF repo |
+| `n_gpu_layers` | `0` | GPU layers to offload (`-1` = all) |
+| `chat_format` | `None` | llama.cpp chat format (auto-detected for most models) |
+| `verbose` | `True` | llama.cpp verbosity |
+| `max_tokens` | `512` | Maximum tokens to generate |
+| `system_prompt` | locale default | Override the system prompt |
+See [`docs/configuration.md`](docs/configuration.md) for the full reference including per-wrapper options and GPU build instructions.
+## Localized prompts
+System prompts and templates ship as `.prompt` resource files under `ovos_gguf_plugin/locale/<lang>/` and are loaded via [ovos-spec-tools](https://github.com/OpenVoiceOS/ovos-spec-tools) (OVOS-INTENT-2 §4.4). Drop translated `.prompt` files under a new `locale/<lang>/` to add a language; English `en-us` ships by default and is the fallback. A `system_prompt` in config overrides the locale file.
+See [`docs/localization.md`](docs/localization.md) for the full guide.
+## OVOS Persona Framework
+```json
+{
+  "name": "MyAssistant",
+  "solvers": ["ovos-solver-gguf-plugin"],
+  "ovos-solver-gguf-plugin": {
+    "model": "TheBloke/notus-7B-v1-GGUF",
+    "remote_filename": "*Q4_K_M.gguf",
+    "persona": "You are a helpful assistant.",
+    "verbose": false
+  }
+}
+```
+```bash
+ovos-persona-server --persona my_persona.json
+```
+## Documentation
+- [`docs/configuration.md`](docs/configuration.md) — full config reference, GPU build, per-wrapper notes
+- [`docs/localization.md`](docs/localization.md) — `.prompt` system, adding a language
+- [`docs/models.md`](docs/models.md) — recommended models per wrapper (including tiny CI-friendly ones)
+## Examples
+Runnable scripts under [`examples/`](examples/):
+- [`chat_example.py`](examples/chat_example.py)
+- [`embeddings_example.py`](examples/embeddings_example.py)
+- [`translate_example.py`](examples/translate_example.py)
+- [`lang_detect_example.py`](examples/lang_detect_example.py)
+- [`summarize_example.py`](examples/summarize_example.py)
+## Testing
+```bash
+pip install "ovos-gguf-plugin[test]"
+python -m pytest test/ -v
+```
+The test suite contains:
+- `test/test_embeddings.py` — hermetic unit tests (mocked llama.cpp, no downloads)
+- `test/test_prompts.py` — hermetic unit tests for localized prompt loading
+- `test/test_e2e.py` — real-model end-to-end tests (downloads tiny GGUFs once, ~70 MB total):
+  - chat: `afrideva/Smol-Llama-101M-Chat-v1-GGUF` q2_k (~45 MB)
+  - embeddings: `leliuga/all-MiniLM-L6-v2-GGUF` Q4_K_M (~23 MB)
+## Credits
+Originally developed by [TigreGótico](https://tigregotico.pt) for [OpenVoiceOS](https://openvoiceos.org),
+sponsored by VisioLab. Modernized under the [NGI0 Commons Fund](https://nlnet.nl/commonsfund) / [NLnet](https://nlnet.nl).
+<img src="https://github.com/user-attachments/assets/809588a2-32a2-406c-98c0-f88bf7753cb4" width="220" alt="VisioLab"/>
+> This work was sponsored by VisioLab, part of [Royal Dutch Visio](https://visio.org/), is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.
+[![NGI0 Commons Fund](./ngi.png)](https://nlnet.nl/project/OpenVoiceOS)
+This project was funded through the [NGI0 Commons Fund](https://nlnet.nl/commonsfund),
+a fund established by [NLnet](https://nlnet.nl) with financial support from the
+European Commission's [Next Generation Internet](https://ngi.eu) programme, under
+the aegis of [DG Communications Networks, Content and Technology](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/communications-networks-content-and-technology_en)
+under grant agreement No [101135429](https://cordis.europa.eu/project/id/101135429).

ovos_gguf_plugin-1.2.0a3/ovos_gguf_plugin/__init__.py ADDED Viewed

@@ -0,0 +1,5 @@
+from ovos_gguf_plugin.chat import GGUFChatEngine
+from ovos_gguf_plugin.summarizer import GGUFSummarizer
+from ovos_gguf_plugin.dialog_transformers import GGUFDialogTransformer
+from ovos_gguf_plugin.translate import GGUFTextLangDetector, GGUFTextTranslator
+from ovos_gguf_plugin.embeddings import GGUFEmbeddings

ovos_gguf_plugin-1.2.0a3/ovos_gguf_plugin/chat.py ADDED Viewed

@@ -0,0 +1,193 @@
+import os
+from typing import Dict, Optional, List, Any, Iterable
+from sentence_stream import SentenceBoundaryDetector
+from llama_cpp import Llama
+from ovos_plugin_manager.templates.agents import ChatEngine, AgentMessage, MessageRole
+from ovos_utils.log import LOG
+class GGUFChatEngine(ChatEngine):
+    def __init__(self, config: Optional[Dict[str, Any]] = None,
+                 gguf_engine: Optional[Llama] = None):
+        config = config or {}
+        super().__init__(config)
+        if gguf_engine:
+            self.model = gguf_engine
+        else:
+            if "model" not in self.config:
+                raise ValueError("no 'model' set in config")
+            model = self.config["model"]
+            if os.path.isfile(model):  # local path
+                LOG.info(f"Loading GGUF model: {model}")
+                self.model = Llama(
+                    model_path=model,
+                    n_gpu_layers=self.config.get("n_gpu_layers", 0),
+                    chat_format=self.config.get("chat_format"),
+                    verbose=self.config.get("verbose", True))
+            else:
+                fname = self.config.get("remote_filename", "*Q4_K_M.gguf")
+                LOG.info(f"Loading GGUF model from hub: {model} from file: {fname}")
+                self.model = Llama.from_pretrained(
+                    repo_id=model,
+                    filename=fname,
+                    n_gpu_layers=self.config.get("n_gpu_layers", 0),
+                    chat_format=self.config.get("chat_format"),
+                    verbose=self.config.get("verbose", True)
+                )
+            LOG.info("GGUF model loaded!")
+        self.system_prompt = self.config.get("system_prompt")
+        self.allow_system = self.config.get("allow_system_prompts") or False
+    def validate_messages(self, messages: List[AgentMessage]) -> List[AgentMessage]:
+        """
+        Prepares the message list by enforcing system prompt rules.
+        This method:
+        1. Strips existing system messages if `allow_system` is False.
+        2. Injects the configured `system_prompt` if it exists.
+        3. Merges the configured system prompt with an existing one if
+           `allow_system` is True.
+        Args:
+            messages (List[AgentMessage]): The raw input history of messages.
+        Returns:
+            List[AgentMessage]: The processed list of messages ready for the API.
+        """
+        if not self.allow_system:
+            messages = [m for m in messages if m.role != MessageRole.SYSTEM]
+        if not messages:
+            if self.system_prompt:
+                return [AgentMessage(role=MessageRole.SYSTEM, content=self.system_prompt)]
+            return []
+        if self.system_prompt:
+            sysm = AgentMessage(role=MessageRole.SYSTEM, content=self.system_prompt)
+            if messages and messages[0].role == MessageRole.SYSTEM:
+                if self.allow_system:  # merge system prompts
+                    sysm = AgentMessage(role=MessageRole.SYSTEM,
+                                        content=self.system_prompt + "\n" + messages[0].content)
+                # replace system prompt
+                messages[0] = sysm
+            else:
+                messages.insert(0, sysm)
+        return messages
+    ###########################################################
+    # abstract methods to be implemented by individual plugins
+    ###########################################################
+    def continue_chat(self, messages: List[AgentMessage],
+                      session_id: str = "default",
+                      lang: Optional[str] = None,
+                      units: Optional[str] = None) -> AgentMessage:
+        """
+        Generate a response message based on the provided chat history.
+        Args:
+            messages (List[AgentMessage]): Full list of messages in the conversation.
+            session_id (str): Identifier for the session.
+            lang (str, optional): BCP-47 language code.
+            units (str, optional): Preferred unit system (e.g., "metric", "imperial").
+        Returns:
+            AgentMessage: The generated response message from the assistant.
+        """
+        ans = self.model.create_chat_completion(
+            messages=[
+                {"role": m.role, "content": m.content}
+                for m in self.validate_messages(messages)
+            ],
+            max_tokens=self.config.get("max_tokens"),
+            stream=False
+        )["choices"][0]["message"]
+        return AgentMessage(role=MessageRole.ASSISTANT, content=ans["content"])
+    def stream_tokens(self, messages: List[AgentMessage],
+                    session_id: str = "default",
+                    lang: Optional[str] = None,
+                    units: Optional[str] = None) -> Iterable[str]:
+        """
+        Stream back response tokens as they are generated.
+        Returns partial sentences and is not suitable for direct TTS.
+        Once merged the output corresponds to the content of a AgentMessage with MessageRole.ASSISTANT
+        Note:
+            Default implementation yields the full response from continue_chat.
+            Subclasses should override this for real-time token streaming.
+        Args:
+            messages (List[AgentMessage]): Full list of messages.
+            session_id (str): Identifier for the session.
+            lang (str, optional): Language code.
+            units (str, optional): Unit system.
+        Returns:
+            Iterable[str]: A stream of tokens/partial text.
+        """
+        # With stream=True, the output is of type `Iterator[CompletionChunk]`.
+        ans = self.model.create_chat_completion(
+            messages=[
+                {"role": m.role, "content": m.content}
+                for m in self.validate_messages(messages)
+            ],
+            max_tokens=self.config.get("max_tokens"),
+            stream=True
+        )
+        for item in ans:
+            chunk = item['choices'][0]["delta"].get("content")
+            if chunk:
+                yield chunk
+    def stream_sentences(self, messages: List[AgentMessage],
+                    session_id: str = "default",
+                    lang: Optional[str] = None,
+                    units: Optional[str] = None) -> Iterable[str]:
+        """
+        Stream back response sentences as they are generated.
+        Returns full sentences only, suitable for direct TTS.
+        Once merged the output corresponds to the content of a AgentMessage with MessageRole.ASSISTANT
+        Note:
+            Default implementation yields the full response from continue_chat.
+            Subclasses should override this for real-time token streaming.
+        Args:
+            messages (List[AgentMessage]): Full list of messages.
+            session_id (str): Identifier for the session.
+            lang (str, optional): Language code.
+            units (str, optional): Unit system.
+        Returns:
+            Iterable[str]: A stream of tokens/partial text.
+        """
+        boundary_detector = SentenceBoundaryDetector()
+        for tok in self.stream_tokens(messages):
+            yield from boundary_detector.add_chunk(tok)
+        final_text = boundary_detector.finish()
+        if final_text and not self.config.get("drop_incomplete_sentences", True):
+            yield final_text
+if __name__ == "__main__":
+    LOG.set_level("DEBUG")
+    cfg = {
+        "model": "Qwen/Qwen2-0.5B-Instruct-GGUF",
+        "remote_filename": "*q8_0.gguf"
+    }
+    query = """The possibility of alien life in the solar system has been a topic of interest for scientists and astronomers for many years. The search for extraterrestrial life has been a major focus of space exploration, with numerous missions and discoveries made in recent years. While there is still no concrete evidence of life beyond Earth, the search for alien life continues to be a fascinating and exciting endeavor.
+One of the most promising areas for the search for alien life is the moons of Jupiter and Saturn. These moons, such as Europa and Enceladus, are believed to have subsurface oceans that could potentially harbor life. The presence of water, a key ingredient for life as we know it, has been detected on these moons, and there are also indications of other necessary elements such as carbon, nitrogen, and oxygen.
+Another area of interest for the search for alien life is the asteroid belt between Mars and Jupiter. This region is home to millions of asteroids, some of which may have the right conditions for life to exist. For example, some asteroids have been found to have water and organic compounds, which are essential for life.
+In addition to the moons and asteroids of the solar system, there are also other potential locations for the search for alien life. For example, there are exoplanets, or planets outside of our solar system, that have been discovered in recent years. Some of these exoplanets are believed to be in the habitable zone, which means they are located in the right distance from their star to potentially have liquid water on their surface.
+Despite the potential for alien life in the solar system, there are still many uncertainties and unknowns. The search for extraterrestrial life is a complex and multifaceted endeavor that requires a combination of scientific research, technological advancements, and exploration. While there is still no concrete evidence of life beyond Earth, the search for alien life continues to be a fascinating and exciting endeavor that holds the potential for groundbreaking discoveries in the future."""
+    s = GGUFChatEngine(cfg)
+    messages = [AgentMessage(role=MessageRole.USER, content=query)]
+    for sent in s.stream_sentences(messages):
+        print(sent)