PyPI - lemonade-sdk - Versions diffs - 8.0.3__tar.gz → 8.0.4__tar.gz - Mend

lemonade-sdk 8.0.3tar.gz → 8.0.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of lemonade-sdk might be problematic. Click here for more details.

Files changed (77) hide show

{lemonade_sdk-8.0.3/src/lemonade_sdk.egg-info → lemonade_sdk-8.0.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: lemonade-sdk
-Version: 8.0.3
+Version: 8.0.4
 Summary: Lemonade SDK: Your LLM Aide for Validation and Deployment
 Author-email: lemonade@amd.com
 Requires-Python: >=3.10, <3.12
@@ -82,7 +82,7 @@ Dynamic: summary
 [![Lemonade tests](https://github.com/lemonade-sdk/lemonade/actions/workflows/test_lemonade.yml/badge.svg)](https://github.com/lemonade-sdk/lemonade/tree/main/test "Check out our tests")
 [![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](docs/README.md#installation "Check out our instructions")
-[![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](docs/README.md#installation "Check out our instructions")
+[![Made with Python](https://img.shields.io/badge/Python-3.10-blue?logo=python&logoColor=white)](docs/README.md#installation "Check out our instructions")
 ## 🍋 Lemonade SDK: Quickly serve, benchmark and deploy LLMs
@@ -97,8 +97,8 @@ The [Lemonade SDK](./docs/README.md) makes it easy to run Large Language Models
 The [Lemonade SDK](./docs/README.md) is comprised of the following:
 - 🌐 **[Lemonade Server](https://lemonade-server.ai/docs)**: A local LLM server for running ONNX and GGUF models using the OpenAI API standard. Install and enable your applications with NPU and GPU acceleration in minutes.
-- 🐍 **Lemonade API**: High-level Python API to directly integrate Lemonade LLMs into Python applications.
-- 🖥️ **Lemonade CLI**: The `lemonade` CLI lets you mix-and-match LLMs (ONNX, GGUF, SafeTensors) with measurement tools to characterize your models on your hardware. The available tools are:
+- 🐍 **[Lemonade API](./docs/lemonade_api.md)**: High-level Python API to directly integrate Lemonade LLMs into Python applications.
+- 🖥️ **[Lemonade CLI](./docs/dev_cli/README.md)**: The `lemonade` CLI lets you mix-and-match LLMs (ONNX, GGUF, SafeTensors) with measurement tools to characterize your models on your hardware. The available tools are:
   - Prompting with templates.
   - Measuring accuracy with a variety of tests.
   - Benchmarking to get the time-to-first-token and tokens per second.
@@ -153,14 +153,7 @@ Maximum LLM performance requires the right hardware accelerator with the right i
   </tbody>
 </table>
-#### Inference Engines Overview
-| Engine | Description |
-| :--- | :--- |
-| **OnnxRuntime GenAI (OGA)** | Microsoft engine that runs `.onnx` models and enables hardware vendors to provide their own execution providers (EPs) to support specialized hardware, such as neural processing units (NPUs). |
-| **llamacpp** | Community-driven engine with strong GPU acceleration, support for thousands of `.gguf` models, and advanced features such as vision-language models (VLMs) and mixture-of-experts (MoEs). |
-| **Hugging Face (HF)** | Hugging Face's `transformers` library can run the original `.safetensors` trained weights for models on Meta's PyTorch engine, which provides a source of truth for accuracy measurement. |
+To learn more about the supported hardware and software, visit the documentation [here](./docs/README.md#software-and-hardware-overview).
 ## Integrate Lemonade Server with Your Application

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/README.md RENAMED Viewed

@@ -1,6 +1,6 @@
 [![Lemonade tests](https://github.com/lemonade-sdk/lemonade/actions/workflows/test_lemonade.yml/badge.svg)](https://github.com/lemonade-sdk/lemonade/tree/main/test "Check out our tests")
 [![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](docs/README.md#installation "Check out our instructions")
-[![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](docs/README.md#installation "Check out our instructions")
+[![Made with Python](https://img.shields.io/badge/Python-3.10-blue?logo=python&logoColor=white)](docs/README.md#installation "Check out our instructions")
 ## 🍋 Lemonade SDK: Quickly serve, benchmark and deploy LLMs
@@ -15,8 +15,8 @@ The [Lemonade SDK](./docs/README.md) makes it easy to run Large Language Models
 The [Lemonade SDK](./docs/README.md) is comprised of the following:
 - 🌐 **[Lemonade Server](https://lemonade-server.ai/docs)**: A local LLM server for running ONNX and GGUF models using the OpenAI API standard. Install and enable your applications with NPU and GPU acceleration in minutes.
-- 🐍 **Lemonade API**: High-level Python API to directly integrate Lemonade LLMs into Python applications.
-- 🖥️ **Lemonade CLI**: The `lemonade` CLI lets you mix-and-match LLMs (ONNX, GGUF, SafeTensors) with measurement tools to characterize your models on your hardware. The available tools are:
+- 🐍 **[Lemonade API](./docs/lemonade_api.md)**: High-level Python API to directly integrate Lemonade LLMs into Python applications.
+- 🖥️ **[Lemonade CLI](./docs/dev_cli/README.md)**: The `lemonade` CLI lets you mix-and-match LLMs (ONNX, GGUF, SafeTensors) with measurement tools to characterize your models on your hardware. The available tools are:
   - Prompting with templates.
   - Measuring accuracy with a variety of tests.
   - Benchmarking to get the time-to-first-token and tokens per second.
@@ -71,14 +71,7 @@ Maximum LLM performance requires the right hardware accelerator with the right i
   </tbody>
 </table>
-#### Inference Engines Overview
-| Engine | Description |
-| :--- | :--- |
-| **OnnxRuntime GenAI (OGA)** | Microsoft engine that runs `.onnx` models and enables hardware vendors to provide their own execution providers (EPs) to support specialized hardware, such as neural processing units (NPUs). |
-| **llamacpp** | Community-driven engine with strong GPU acceleration, support for thousands of `.gguf` models, and advanced features such as vision-language models (VLMs) and mixture-of-experts (MoEs). |
-| **Hugging Face (HF)** | Hugging Face's `transformers` library can run the original `.safetensors` trained weights for models on Meta's PyTorch engine, which provides a source of truth for accuracy measurement. |
+To learn more about the supported hardware and software, visit the documentation [here](./docs/README.md#software-and-hardware-overview).
 ## Integrate Lemonade Server with Your Application

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/humaneval.py RENAMED Viewed

@@ -24,7 +24,7 @@ class AccuracyHumaneval(Tool):
     - pass@10: Percentage of problems solved within 10 generation attempts
     - pass@100: Percentage of problems solved within 100 generation attempts
-    See docs/lemonade/humaneval_accuracy.md for more details
+    See docs/dev_cli/humaneval_accuracy.md for more details
     """
     unique_name = "accuracy-humaneval"

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/mmlu.py RENAMED Viewed

@@ -27,7 +27,7 @@ def min_handle_none(*args: int):
 class AccuracyMMLU(Tool):
     """
-    See docs/lemonade/mmlu_accuracy.md for more details
+    See docs/dev_cli/mmlu_accuracy.md for more details
     """
     unique_name = "accuracy-mmlu"

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/oga/load.py RENAMED Viewed

@@ -58,7 +58,7 @@ class OgaLoad(FirstTool):
     Input: path to a checkpoint.
         Supported choices for cpu and igpu from HF model repository:
             LLM models on Huggingface supported by model_builder.  See documentation
-            (https://github.com/lemonade-sdk/lemonade/blob/main/docs/ort_genai_igpu.md)
+            (https://github.com/lemonade-sdk/lemonade/blob/main/docs/dev_cli/ort_genai_igpu.md)
             for supported models.
         Supported choices for npu from HF model repository:
             Models on Hugging Face that follow the "amd/**-onnx-ryzen-strix" pattern

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/perplexity.py RENAMED Viewed

@@ -17,7 +17,7 @@ class AccuracyPerplexity(Tool):
     Output state produced: None
-    See docs/lemonade/perplexity.md for more details.
+    See docs/dev_cli/perplexity.md for more details.
     """
     unique_name = "accuracy-perplexity"
@@ -63,7 +63,7 @@ class AccuracyPerplexity(Tool):
             # try-except will allow a few more LLMs to work
             max_length = 2048
         # Set stride to half of the maximum input length for overlapping window processing
-        # Refer to docs/perplexity.md for more information on sliding window
+        # Refer to docs/dev_cli/perplexity.md for more information on sliding window
         stride = max_length // 2
         # Determine the total sequence length of the tokenized input
         seq_len = encodings.input_ids.size(1)

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/quark/quark_load.py RENAMED Viewed

@@ -18,7 +18,7 @@ class QuarkLoad(Tool):
     Output:
         - state of the loaded model
-    See docs/quark.md for more details.
+    See docs/dev_cli/quark.md for more details.
     """
     unique_name = "quark-load"

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/quark/quark_quantize.py RENAMED Viewed

@@ -25,7 +25,7 @@ class QuarkQuantize(Tool):
     Output:
         - Modifies `state` with quantized and optionally exported model.
-    See docs/quark.md for more details.
+    See docs/dev_cli/quark.md for more details.
     """
     unique_name = "quark-quantize"
@@ -94,7 +94,7 @@ class QuarkQuantize(Tool):
             help="Number of samples for calibration.",
         )
-        # See docs/quark.md for more details.
+        # See docs/dev_cli/quark.md for more details.
         parser.add_argument(
             "--quant-scheme",
             type=str,

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/server/llamacpp.py RENAMED Viewed

@@ -16,11 +16,29 @@ from fastapi.responses import StreamingResponse
 from openai import OpenAI
-from lemonade_server.pydantic_models import ChatCompletionRequest, PullConfig
+from lemonade_server.pydantic_models import (
+    ChatCompletionRequest,
+    PullConfig,
+    EmbeddingsRequest,
+    RerankingRequest,
+)
 from lemonade_server.model_manager import ModelManager
 from lemonade.tools.server.utils.port import find_free_port
-LLAMA_VERSION = "b5699"
+LLAMA_VERSION = "b5787"
+def llamacpp_address(port: int) -> str:
+    """
+    Generate the base URL for the llamacpp server.
+    Args:
+        port: The port number the llamacpp server is running on
+    Returns:
+        The base URL for the llamacpp server
+    """
+    return f"http://127.0.0.1:{port}/v1"
 def get_llama_server_paths():
@@ -244,10 +262,24 @@ def _wait_for_load(llama_server_process: subprocess.Popen, port: int):
 def _launch_llama_subprocess(
-    snapshot_files: dict, use_gpu: bool, telemetry: LlamaTelemetry
+    snapshot_files: dict,
+    use_gpu: bool,
+    telemetry: LlamaTelemetry,
+    supports_embeddings: bool = False,
+    supports_reranking: bool = False,
 ) -> subprocess.Popen:
     """
-    Launch llama server subprocess with GPU or CPU configuration
+    Launch llama server subprocess with appropriate configuration.
+    Args:
+        snapshot_files: Dictionary of model files to load
+        use_gpu: Whether to use GPU acceleration
+        telemetry: Telemetry object for tracking performance metrics
+        supports_embeddings: Whether the model supports embeddings
+        supports_reranking: Whether the model supports reranking
+    Returns:
+        Subprocess handle for the llama server
     """
     # Get the current executable path (handles both Windows and Ubuntu structures)
@@ -271,6 +303,14 @@ def _launch_llama_subprocess(
     # reasoning_content field
     base_command.extend(["--reasoning-format", "none"])
+    # Add embeddings support if the model supports it
+    if supports_embeddings:
+        base_command.append("--embeddings")
+    # Add reranking support if the model supports it
+    if supports_reranking:
+        base_command.append("--reranking")
     # Configure GPU layers: 99 for GPU, 0 for CPU-only
     ngl_value = "99" if use_gpu else "0"
     command = base_command + ["-ngl", ngl_value]
@@ -310,7 +350,6 @@ def _launch_llama_subprocess(
 def server_load(model_config: PullConfig, telemetry: LlamaTelemetry):
     # Validate platform support before proceeding
     validate_platform_support()
@@ -367,15 +406,26 @@ def server_load(model_config: PullConfig, telemetry: LlamaTelemetry):
         logging.info("Cleaned up zip file")
     # Download the gguf to the hugging face cache
-    snapshot_files = ModelManager().download_gguf(model_config)
+    model_manager = ModelManager()
+    snapshot_files = model_manager.download_gguf(model_config)
     logging.debug(f"GGUF file paths: {snapshot_files}")
+    # Check if model supports embeddings
+    supported_models = model_manager.supported_models
+    model_info = supported_models.get(model_config.model_name, {})
+    supports_embeddings = "embeddings" in model_info.get("labels", [])
+    supports_reranking = "reranking" in model_info.get("labels", [])
     # Start the llama-serve.exe process
     logging.debug(f"Using llama_server for GGUF model: {llama_server_exe_path}")
     # Attempt loading on GPU first
     llama_server_process = _launch_llama_subprocess(
-        snapshot_files, use_gpu=True, telemetry=telemetry
+        snapshot_files,
+        use_gpu=True,
+        telemetry=telemetry,
+        supports_embeddings=supports_embeddings,
+        supports_reranking=supports_reranking,
     )
     # Check the /health endpoint until GPU server is ready
@@ -395,7 +445,11 @@ def server_load(model_config: PullConfig, telemetry: LlamaTelemetry):
             raise Exception("llamacpp GPU loading failed")
         llama_server_process = _launch_llama_subprocess(
-            snapshot_files, use_gpu=False, telemetry=telemetry
+            snapshot_files,
+            use_gpu=False,
+            telemetry=telemetry,
+            supports_embeddings=supports_embeddings,
+            supports_reranking=supports_reranking,
         )
         # Check the /health endpoint until CPU server is ready
@@ -416,7 +470,7 @@ def server_load(model_config: PullConfig, telemetry: LlamaTelemetry):
 def chat_completion(
     chat_completion_request: ChatCompletionRequest, telemetry: LlamaTelemetry
 ):
-    base_url = f"http://127.0.0.1:{telemetry.port}/v1"
+    base_url = llamacpp_address(telemetry.port)
     client = OpenAI(
         base_url=base_url,
         api_key="lemonade",
@@ -467,3 +521,70 @@ def chat_completion(
                 status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
                 detail=f"Chat completion error: {str(e)}",
             )
+def embeddings(embeddings_request: EmbeddingsRequest, telemetry: LlamaTelemetry):
+    """
+    Generate embeddings using the llamacpp server.
+    Args:
+        embeddings_request: The embeddings request containing input text/tokens
+        telemetry: Telemetry object containing the server port
+    Returns:
+        Embeddings response from the llamacpp server
+    """
+    base_url = llamacpp_address(telemetry.port)
+    client = OpenAI(
+        base_url=base_url,
+        api_key="lemonade",
+    )
+    # Convert Pydantic model to dict and remove unset/null values
+    request_dict = embeddings_request.model_dump(exclude_unset=True, exclude_none=True)
+    try:
+        # Call the embeddings endpoint
+        response = client.embeddings.create(**request_dict)
+        return response
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Embeddings error: {str(e)}",
+        )
+def reranking(reranking_request: RerankingRequest, telemetry: LlamaTelemetry):
+    """
+    Rerank documents based on their relevance to a query using the llamacpp server.
+    Args:
+        reranking_request: The reranking request containing query and documents
+        telemetry: Telemetry object containing the server port
+    Returns:
+        Reranking response from the llamacpp server containing ranked documents and scores
+    """
+    base_url = llamacpp_address(telemetry.port)
+    try:
+        # Convert Pydantic model to dict and exclude unset/null values
+        request_dict = reranking_request.model_dump(
+            exclude_unset=True, exclude_none=True
+        )
+        # Call the reranking endpoint directly since it's not supported by the OpenAI API
+        response = requests.post(
+            f"{base_url}/rerank",
+            json=request_dict,
+        )
+        response.raise_for_status()
+        return response.json()
+    except Exception as e:
+        logging.error("Error during reranking: %s", str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Reranking error: {str(e)}",
+        ) from e

{lemonade_sdk-8.0.3 → lemonade_sdk-8.0.4}/src/lemonade/tools/server/serve.py RENAMED Viewed

@@ -54,6 +54,8 @@ from lemonade_server.pydantic_models import (
     LoadConfig,
     CompletionRequest,
     ChatCompletionRequest,
+    EmbeddingsRequest,
+    RerankingRequest,
     ResponsesRequest,
     PullConfig,
     DeleteConfig,
@@ -231,8 +233,13 @@ class Server(ManagementTool):
             # OpenAI-compatible routes
             self.app.post(f"{prefix}/chat/completions")(self.chat_completions)
+            self.app.post(f"{prefix}/embeddings")(self.embeddings)
             self.app.get(f"{prefix}/models")(self.models)
+            # JinaAI routes (jina.ai/reranker/)
+            self.app.post(f"{prefix}/reranking")(self.reranking)
+            self.app.post(f"{prefix}/rerank")(self.reranking)
     @staticmethod
     def parser(add_help: bool = True) -> argparse.ArgumentParser:
         parser = __class__.helpful_parser(
@@ -796,6 +803,72 @@ class Server(ManagementTool):
                 created=int(time.time()),
             )
+    async def embeddings(self, embeddings_request: EmbeddingsRequest):
+        """
+        Generate embeddings for the provided input.
+        """
+        # Initialize load config from embeddings request
+        lc = LoadConfig(model_name=embeddings_request.model)
+        # Load the model if it's different from the currently loaded one
+        await self.load_llm(lc)
+        if self.llm_loaded.recipe == "llamacpp":
+            try:
+                return llamacpp.embeddings(embeddings_request, self.llama_telemetry)
+            except Exception as e:  # pylint: disable=broad-exception-caught
+                # Check if model has embeddings label
+                model_info = ModelManager().supported_models.get(
+                    self.llm_loaded.model_name, {}
+                )
+                if "embeddings" not in model_info.get("labels", []):
+                    raise HTTPException(
+                        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+                        detail="You tried to generate embeddings for a model that is "
+                        "not labeled as an embeddings model. Please use another model "
+                        "or re-register the current model with the 'embeddings' label.",
+                    ) from e
+                else:
+                    raise e
+        else:
+            raise HTTPException(
+                status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+                detail=f"Embeddings not supported for recipe: {self.llm_loaded.recipe}",
+            )
+    async def reranking(self, reranking_request: RerankingRequest):
+        """
+        Rerank documents based on their relevance to a query using the llamacpp server.
+        """
+        # Initialize load config from reranking request
+        lc = LoadConfig(model_name=reranking_request.model)
+        # Load the model if it's different from the currently loaded one
+        await self.load_llm(lc)
+        if self.llm_loaded.recipe == "llamacpp":
+            try:
+                return llamacpp.reranking(reranking_request, self.llama_telemetry)
+            except Exception as e:  # pylint: disable=broad-exception-caught
+                # Check if model has reranking label
+                model_info = ModelManager().supported_models.get(
+                    self.llm_loaded.model_name, {}
+                )
+                if "reranking" not in model_info.get("labels", []):
+                    raise HTTPException(
+                        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+                        detail="You tried to use reranking for a model that is "
+                        "not labeled as a reranking model. Please use another model "
+                        "or re-register the current model with the 'reranking' label.",
+                    ) from e
+                else:
+                    raise e
+        else:
+            raise HTTPException(
+                status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+                detail=f"Reranking not supported for recipe: {self.llm_loaded.recipe}",
+            )
     def apply_chat_template(
         self, messages: list[dict], tools: list[dict] | None = None
     ):

lemonade-sdk 8.0.3__tar.gz → 8.0.4__tar.gz

Potentially problematic release.

lemonade-sdk 8.0.3tar.gz → 8.0.4tar.gz