PyPI - lm-deluge - Versions diffs - 0.0.13__tar.gz → 0.0.15__tar.gz - Mend

lm-deluge 0.0.13tar.gz → 0.0.15tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of lm-deluge might be problematic. Click here for more details.

Files changed (82) hide show

{lm_deluge-0.0.13/src/lm_deluge.egg-info → lm_deluge-0.0.15}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: lm_deluge
-Version: 0.0.13
+Version: 0.0.15
 Summary: Python utility for using LLM API models.
 Author-email: Benjamin Anderson <ben@trytaylor.ai>
 Requires-Python: >=3.10
@@ -30,6 +30,7 @@ Dynamic: license-file
 `lm-deluge` is a lightweight helper library for maxing out your rate limits with LLM providers. It provides the following:
 - **Unified client** – Send prompts to all relevant models with a single client.
+- **Files and Images** - Include images easily for multimodal models, and PDF files for models that support them (OpenAI and Anthropic).
 - **Massive concurrency with throttling** – Set `max_tokens_per_minute` and `max_requests_per_minute` and let it fly. The client will process as many requests as possible while respecting rate limits and retrying failures.
 - **Spray across models/providers** – Configure a client with multiple models from any provider(s), and sampling weights. The client samples a model for each request.
 - **Tool Use** – Unified API for defining tools for all providers, and creating tools automatically from python functions.
@@ -41,6 +42,8 @@ Dynamic: license-file
 **STREAMING IS NOT IN SCOPE.** There are plenty of packages that let you stream chat completions across providers. The sole purpose of this package is to do very fast batch inference using APIs. Sorry!
+**Update 06/02/2025:** I lied, it supports (very basic) streaming now via client.stream(...). It will print tokens as they arrive, then return an APIResponse at the end. More sophisticated streaming may or may not be implemented later, don't count on it.
 ## Installation
 ```bash

{lm_deluge-0.0.13 → lm_deluge-0.0.15}/README.md RENAMED Viewed

@@ -3,6 +3,7 @@
 `lm-deluge` is a lightweight helper library for maxing out your rate limits with LLM providers. It provides the following:
 - **Unified client** – Send prompts to all relevant models with a single client.
+- **Files and Images** - Include images easily for multimodal models, and PDF files for models that support them (OpenAI and Anthropic).
 - **Massive concurrency with throttling** – Set `max_tokens_per_minute` and `max_requests_per_minute` and let it fly. The client will process as many requests as possible while respecting rate limits and retrying failures.
 - **Spray across models/providers** – Configure a client with multiple models from any provider(s), and sampling weights. The client samples a model for each request.
 - **Tool Use** – Unified API for defining tools for all providers, and creating tools automatically from python functions.
@@ -14,6 +15,8 @@
 **STREAMING IS NOT IN SCOPE.** There are plenty of packages that let you stream chat completions across providers. The sole purpose of this package is to do very fast batch inference using APIs. Sorry!
+**Update 06/02/2025:** I lied, it supports (very basic) streaming now via client.stream(...). It will print tokens as they arrive, then return an APIResponse at the end. More sophisticated streaming may or may not be implemented later, don't count on it.
 ## Installation
 ```bash

{lm_deluge-0.0.13 → lm_deluge-0.0.15}/pyproject.toml RENAMED Viewed

@@ -3,7 +3,7 @@ requires = ["setuptools", "wheel"]
 [project]
 name = "lm_deluge"
-version = "0.0.13"
+version = "0.0.15"
 authors = [{ name = "Benjamin Anderson", email = "ben@trytaylor.ai" }]
 description = "Python utility for using LLM API models."
 readme = "README.md"

{lm_deluge-0.0.13 → lm_deluge-0.0.15}/src/lm_deluge/__init__.py RENAMED Viewed

@@ -1,6 +1,7 @@
 from .client import LLMClient, SamplingParams, APIResponse
 from .prompt import Conversation, Message
 from .tool import Tool
+from .file import File
 import dotenv
 dotenv.load_dotenv()
@@ -12,4 +13,5 @@ __all__ = [
     "Conversation",
     "Message",
     "Tool",
+    "File",
 ]

{lm_deluge-0.0.13 → lm_deluge-0.0.15}/src/lm_deluge/api_requests/base.py RENAMED Viewed

@@ -1,165 +1,19 @@
 import asyncio
-import json
 import random
 import traceback
 from abc import ABC, abstractmethod
-from dataclasses import dataclass
 from typing import Callable
 import aiohttp
 from aiohttp import ClientResponse
-from lm_deluge.prompt import CachePattern, Conversation, Message
-from lm_deluge.usage import Usage
+from lm_deluge.prompt import CachePattern, Conversation
 from ..config import SamplingParams
 from ..errors import raise_if_modal_exception
 from ..models import APIModel
 from ..tracker import StatusTracker
-@dataclass
-class APIResponse:
-    # request information
-    id: int  # should be unique to the request within a given prompt-processing call
-    model_internal: str  # our internal model tag
-    prompt: Conversation
-    sampling_params: SamplingParams
-    # http response information
-    status_code: int | None
-    is_error: bool | None
-    error_message: str | None
-    # completion information - unified usage tracking
-    usage: Usage | None = None
-    # response content - structured format
-    content: Message | None = None
-    # optional or calculated automatically
-    thinking: str | None = None  # if model shows thinking tokens
-    model_external: str | None = None  # the model tag used by the API
-    region: str | None = None
-    logprobs: list | None = None
-    finish_reason: str | None = None  # make required later
-    cost: float | None = None  # calculated automatically
-    cache_hit: bool = False  # manually set if true
-    # set to true if is_error and should be retried with a different model
-    retry_with_different_model: bool | None = False
-    # set to true if should NOT retry with the same model (unrecoverable error)
-    give_up_if_no_other_models: bool | None = False
-    # OpenAI Responses API specific - used for computer use continuation
-    response_id: str | None = None
-    # Raw API response for debugging
-    raw_response: dict | None = None
-    @property
-    def completion(self) -> str | None:
-        """Backward compatibility: extract text from content Message."""
-        if self.content is not None:
-            return self.content.completion
-        return None
-    @property
-    def input_tokens(self) -> int | None:
-        """Get input tokens from usage object."""
-        return self.usage.input_tokens if self.usage else None
-    @property
-    def output_tokens(self) -> int | None:
-        """Get output tokens from usage object."""
-        return self.usage.output_tokens if self.usage else None
-    @property
-    def cache_read_tokens(self) -> int | None:
-        """Get cache read tokens from usage object."""
-        return self.usage.cache_read_tokens if self.usage else None
-    @property
-    def cache_write_tokens(self) -> int | None:
-        """Get cache write tokens from usage object."""
-        return self.usage.cache_write_tokens if self.usage else None
-    def __post_init__(self):
-        # calculate cost & get external model name
-        self.id = int(self.id)
-        api_model = APIModel.from_registry(self.model_internal)
-        self.model_external = api_model.name
-        self.cost = None
-        if (
-            self.usage is not None
-            and api_model.input_cost is not None
-            and api_model.output_cost is not None
-        ):
-            self.cost = (
-                self.usage.input_tokens * api_model.input_cost / 1e6
-                + self.usage.output_tokens * api_model.output_cost / 1e6
-            )
-        elif self.content is not None and self.completion is not None:
-            print(
-                f"Warning: Completion provided without token counts for model {self.model_internal}."
-            )
-    def to_dict(self):
-        return {
-            "id": self.id,
-            "model_internal": self.model_internal,
-            "model_external": self.model_external,
-            "region": self.region,
-            "prompt": self.prompt.to_log(),  # destroys image if present
-            "sampling_params": self.sampling_params.__dict__,
-            "status_code": self.status_code,
-            "is_error": self.is_error,
-            "error_message": self.error_message,
-            "completion": self.completion,  # computed property
-            "content": self.content.to_log() if self.content else None,
-            "usage": self.usage.to_dict() if self.usage else None,
-            "finish_reason": self.finish_reason,
-            "cost": self.cost,
-        }
-    @classmethod
-    def from_dict(cls, data: dict):
-        # Handle backward compatibility for content/completion
-        content = None
-        if "content" in data and data["content"] is not None:
-            # Reconstruct message from log format
-            content = Message.from_log(data["content"])
-        elif "completion" in data and data["completion"] is not None:
-            # Backward compatibility: create a Message with just text
-            content = Message.ai(data["completion"])
-        usage = None
-        if "usage" in data and data["usage"] is not None:
-            usage = Usage.from_dict(data["usage"])
-        return cls(
-            id=data.get("id", random.randint(0, 1_000_000_000)),
-            model_internal=data["model_internal"],
-            prompt=Conversation.from_log(data["prompt"]),
-            sampling_params=SamplingParams(**data["sampling_params"]),
-            status_code=data["status_code"],
-            is_error=data["is_error"],
-            error_message=data["error_message"],
-            usage=usage,
-            content=content,
-            thinking=data.get("thinking"),
-            model_external=data.get("model_external"),
-            region=data.get("region"),
-            logprobs=data.get("logprobs"),
-            finish_reason=data.get("finish_reason"),
-            cost=data.get("cost"),
-            cache_hit=data.get("cache_hit", False),
-        )
-    def write_to_file(self, filename):
-        """
-        Writes the APIResponse as a line to a file.
-        If file exists, appends to it.
-        """
-        with open(filename, "a") as f:
-            f.write(json.dumps(self.to_dict()) + "\n")
+from .response import APIResponse
 class APIRequestBase(ABC):

{lm_deluge-0.0.13 → lm_deluge-0.0.15}/src/lm_deluge/api_requests/common.py RENAMED Viewed

@@ -2,6 +2,7 @@ from .openai import OpenAIRequest, OpenAIResponsesRequest
 from .anthropic import AnthropicRequest
 from .mistral import MistralRequest
 from .bedrock import BedrockRequest
+from .gemini import GeminiRequest
 CLASSES = {
     "openai": OpenAIRequest,
@@ -9,4 +10,5 @@ CLASSES = {
     "anthropic": AnthropicRequest,
     "mistral": MistralRequest,
     "bedrock": BedrockRequest,
+    "gemini": GeminiRequest,
 }

lm_deluge-0.0.15/src/lm_deluge/api_requests/gemini.py ADDED Viewed

@@ -0,0 +1,222 @@
+import json
+import os
+import warnings
+from typing import Callable
+from aiohttp import ClientResponse
+from lm_deluge.tool import Tool
+from ..config import SamplingParams
+from ..models import APIModel
+from ..prompt import CachePattern, Conversation, Message, Text, Thinking, ToolCall
+from ..tracker import StatusTracker
+from ..usage import Usage
+from .base import APIRequestBase, APIResponse
+def _build_gemini_request(
+    model: APIModel,
+    prompt: Conversation,
+    tools: list[Tool] | None,
+    sampling_params: SamplingParams,
+) -> dict:
+    system_message, messages = prompt.to_gemini()
+    request_json = {
+        "contents": messages,
+        "generationConfig": {
+            "temperature": sampling_params.temperature,
+            "topP": sampling_params.top_p,
+            "maxOutputTokens": sampling_params.max_new_tokens,
+        },
+    }
+    # Add system instruction if present
+    if system_message:
+        request_json["systemInstruction"] = {"parts": [{"text": system_message}]}
+    # Handle reasoning models (thinking)
+    if model.reasoning_model:
+        request_json["generationConfig"]["thinkingConfig"] = {"includeThoughts": True}
+        if sampling_params.reasoning_effort and "flash" in model.id:
+            budget = {"low": 1024, "medium": 4096, "high": 16384}.get(
+                sampling_params.reasoning_effort
+            )
+            request_json["generationConfig"]["thinkingConfig"]["thinkingBudget"] = (
+                budget
+            )
+    else:
+        if sampling_params.reasoning_effort:
+            warnings.warn(
+                f"Ignoring reasoning_effort param for non-reasoning model: {model.name}"
+            )
+    # Add tools if provided
+    if tools:
+        tool_declarations = [tool.dump_for("google") for tool in tools]
+        request_json["tools"] = [{"functionDeclarations": tool_declarations}]
+    # Handle JSON mode
+    if sampling_params.json_mode and model.supports_json:
+        request_json["generationConfig"]["responseMimeType"] = "application/json"
+    return request_json
+class GeminiRequest(APIRequestBase):
+    def __init__(
+        self,
+        task_id: int,
+        model_name: str,  # must correspond to registry
+        prompt: Conversation,
+        attempts_left: int,
+        status_tracker: StatusTracker,
+        results_arr: list,
+        request_timeout: int = 30,
+        sampling_params: SamplingParams = SamplingParams(),
+        callback: Callable | None = None,
+        all_model_names: list[str] | None = None,
+        all_sampling_params: list[SamplingParams] | None = None,
+        tools: list | None = None,
+        cache: CachePattern | None = None,
+    ):
+        super().__init__(
+            task_id=task_id,
+            model_name=model_name,
+            prompt=prompt,
+            attempts_left=attempts_left,
+            status_tracker=status_tracker,
+            results_arr=results_arr,
+            request_timeout=request_timeout,
+            sampling_params=sampling_params,
+            callback=callback,
+            all_model_names=all_model_names,
+            all_sampling_params=all_sampling_params,
+            tools=tools,
+            cache=cache,
+        )
+        # Warn if cache is specified for Gemini model
+        if cache is not None:
+            warnings.warn(
+                f"Cache parameter '{cache}' is not supported for Gemini models, ignoring for {model_name}"
+            )
+        self.model = APIModel.from_registry(model_name)
+        # Gemini API endpoint format: https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent
+        self.url = f"{self.model.api_base}/models/{self.model.name}:generateContent"
+        self.request_header = {
+            "Content-Type": "application/json",
+        }
+        # Add API key as query parameter for Gemini
+        api_key = os.getenv(self.model.api_key_env_var)
+        if not api_key:
+            raise ValueError(
+                f"API key environment variable {self.model.api_key_env_var} not set"
+            )
+        self.url += f"?key={api_key}"
+        self.request_json = _build_gemini_request(
+            self.model, prompt, tools, sampling_params
+        )
+    async def handle_response(self, http_response: ClientResponse) -> APIResponse:
+        is_error = False
+        error_message = None
+        thinking = None
+        content = None
+        usage = None
+        status_code = http_response.status
+        mimetype = http_response.headers.get("Content-Type", None)
+        data = None
+        if status_code >= 200 and status_code < 300:
+            try:
+                data = await http_response.json()
+            except Exception as e:
+                is_error = True
+                error_message = (
+                    f"Error calling .json() on response w/ status {status_code}: {e}"
+                )
+            if not is_error:
+                assert data
+                try:
+                    # Parse Gemini response format
+                    parts = []
+                    if "candidates" in data and data["candidates"]:
+                        candidate = data["candidates"][0]
+                        if "content" in candidate and "parts" in candidate["content"]:
+                            for part in candidate["content"]["parts"]:
+                                if "text" in part:
+                                    parts.append(Text(part["text"]))
+                                elif "thought" in part:
+                                    parts.append(Thinking(part["thought"]))
+                                elif "functionCall" in part:
+                                    func_call = part["functionCall"]
+                                    # Generate a unique ID since Gemini doesn't provide one
+                                    import uuid
+                                    tool_id = f"call_{uuid.uuid4().hex[:8]}"
+                                    parts.append(
+                                        ToolCall(
+                                            id=tool_id,
+                                            name=func_call["name"],
+                                            arguments=func_call.get("args", {}),
+                                        )
+                                    )
+                    content = Message("assistant", parts)
+                    # Extract usage information if present
+                    if "usageMetadata" in data:
+                        usage_data = data["usageMetadata"]
+                        usage = Usage.from_gemini_usage(usage_data)
+                except Exception as e:
+                    is_error = True
+                    error_message = f"Error parsing Gemini response: {str(e)}"
+        elif mimetype and "json" in mimetype.lower():
+            is_error = True
+            try:
+                data = await http_response.json()
+                error_message = json.dumps(data)
+            except Exception:
+                error_message = (
+                    f"HTTP {status_code} with JSON content type but failed to parse"
+                )
+        else:
+            is_error = True
+            text = await http_response.text()
+            error_message = text
+        # Handle special kinds of errors
+        if is_error and error_message is not None:
+            if "rate limit" in error_message.lower() or status_code == 429:
+                error_message += " (Rate limit error, triggering cooldown.)"
+                self.status_tracker.rate_limit_exceeded()
+            if (
+                "context length" in error_message.lower()
+                or "token limit" in error_message.lower()
+            ):
+                error_message += " (Context length exceeded, set retries to 0.)"
+                self.attempts_left = 0
+        return APIResponse(
+            id=self.task_id,
+            status_code=status_code,
+            is_error=is_error,
+            error_message=error_message,
+            prompt=self.prompt,
+            content=content,
+            thinking=thinking,
+            model_internal=self.model_name,
+            sampling_params=self.sampling_params,
+            usage=usage,
+            raw_response=data,
+        )

{lm_deluge-0.0.13 → lm_deluge-0.0.15}/src/lm_deluge/api_requests/openai.py RENAMED Viewed

@@ -1,17 +1,19 @@
-import warnings
-from aiohttp import ClientResponse
 import json
 import os
+import warnings
 from typing import Callable
+import aiohttp
+from aiohttp import ClientResponse
 from lm_deluge.tool import Tool
-from .base import APIRequestBase, APIResponse
-from ..prompt import Conversation, Message, Text, ToolCall, Thinking, CachePattern
-from ..usage import Usage
-from ..tracker import StatusTracker
 from ..config import SamplingParams
 from ..models import APIModel
+from ..prompt import CachePattern, Conversation, Message, Text, Thinking, ToolCall
+from ..tracker import StatusTracker
+from ..usage import Usage
+from .base import APIRequestBase, APIResponse
 def _build_oa_chat_request(
@@ -111,6 +113,7 @@ class OpenAIRequest(APIRequestBase):
         status_code = http_response.status
         mimetype = http_response.headers.get("Content-Type", None)
         data = None
+        finish_reason = None
         if status_code >= 200 and status_code < 300:
             try:
                 data = await http_response.json()
@@ -125,6 +128,7 @@ class OpenAIRequest(APIRequestBase):
                     # Parse response into Message with parts
                     parts = []
                     message = data["choices"][0]["message"]
+                    finish_reason = data["choices"][0]["finish_reason"]
                     # Add text content if present
                     if message.get("content"):
@@ -190,6 +194,7 @@ class OpenAIRequest(APIRequestBase):
             sampling_params=self.sampling_params,
             usage=usage,
             raw_response=data,
+            finish_reason=finish_reason,
         )
@@ -266,6 +271,13 @@ class OpenAIResponsesRequest(APIRequestBase):
             self.request_json["max_output_tokens"] = sampling_params.max_new_tokens
         if self.model.reasoning_model:
+            if sampling_params.reasoning_effort in [None, "none"]:
+                # gemini models can switch reasoning off
+                if "gemini" in self.model.id:
+                    self.sampling_params.reasoning_effort = "none"  # expects string
+                # openai models can only go down to "low"
+                else:
+                    self.sampling_params.reasoning_effort = "low"
             self.request_json["temperature"] = 1.0
             self.request_json["top_p"] = 1.0
             self.request_json["reasoning"] = {
@@ -413,3 +425,57 @@ class OpenAIResponsesRequest(APIRequestBase):
             usage=usage,
             raw_response=data,
         )
+async def stream_chat(
+    model_name: str,  # must correspond to registry
+    prompt: Conversation,
+    sampling_params: SamplingParams = SamplingParams(),
+    tools: list | None = None,
+    cache: CachePattern | None = None,
+):
+    if cache is not None:
+        warnings.warn(
+            f"Cache parameter '{cache}' is only supported for Anthropic models, ignoring for {model_name}"
+        )
+    model = APIModel.from_registry(model_name)
+    if model.api_spec != "openai":
+        raise ValueError("streaming only supported on openai models for now")
+    url = f"{model.api_base}/chat/completions"
+    request_header = {"Authorization": f"Bearer {os.getenv(model.api_key_env_var)}"}
+    request_json = _build_oa_chat_request(model, prompt, tools, sampling_params)
+    request_json["stream"] = True
+    async with aiohttp.ClientSession() as s:
+        async with s.post(url, headers=request_header, json=request_json) as r:
+            r.raise_for_status()  # bail on 4xx/5xx
+            content = ""
+            buf = ""
+            async for chunk in r.content.iter_any():  # raw bytes
+                buf += chunk.decode()
+                while "\n\n" in buf:  # full SSE frame
+                    event, buf = buf.split("\n\n", 1)
+                    if not event.startswith("data:"):
+                        continue  # ignore comments
+                    data = event[5:].strip()  # after "data:"
+                    if data == "[DONE]":
+                        yield APIResponse(
+                            id=0,
+                            status_code=None,
+                            is_error=False,
+                            error_message=None,
+                            prompt=prompt,
+                            content=Message(
+                                role="assistant", parts=[Text(text=content)]
+                            ),
+                            model_internal=model.id,
+                            sampling_params=sampling_params,
+                            usage=None,
+                            raw_response=None,
+                        )
+                    msg = json.loads(data)  # SSE payload
+                    delta = msg["choices"][0]["delta"].get("content")
+                    if delta:
+                        content += delta
+                        yield delta

lm-deluge 0.0.13__tar.gz → 0.0.15__tar.gz

Potentially problematic release.

lm-deluge 0.0.13tar.gz → 0.0.15tar.gz