PyPI - lm-deluge - Versions diffs - 0.0.9__tar.gz → 0.0.13__tar.gz - Mend

lm-deluge 0.0.9tar.gz → 0.0.13tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of lm-deluge might be problematic. Click here for more details.

Files changed (79) hide show

{lm_deluge-0.0.9/src/lm_deluge.egg-info → lm_deluge-0.0.13}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: lm_deluge
-Version: 0.0.9
+Version: 0.0.13
 Summary: Python utility for using LLM API models.
 Author-email: Benjamin Anderson <ben@trytaylor.ai>
 Requires-Python: >=3.10
@@ -21,8 +21,8 @@ Requires-Dist: bs4
 Requires-Dist: lxml
 Requires-Dist: pdf2image
 Requires-Dist: pillow
-Requires-Dist: fasttext-wheel
-Requires-Dist: fasttext-langdetect
+Requires-Dist: fastmcp>=2.4
+Requires-Dist: rich
 Dynamic: license-file
 # lm-deluge
@@ -32,6 +32,9 @@ Dynamic: license-file
 - **Unified client** – Send prompts to all relevant models with a single client.
 - **Massive concurrency with throttling** – Set `max_tokens_per_minute` and `max_requests_per_minute` and let it fly. The client will process as many requests as possible while respecting rate limits and retrying failures.
 - **Spray across models/providers** – Configure a client with multiple models from any provider(s), and sampling weights. The client samples a model for each request.
+- **Tool Use** – Unified API for defining tools for all providers, and creating tools automatically from python functions.
+- **MCP Support** – Instantiate a `Tool` from a local or remote MCP server so that any LLM can use it, whether or not that provider natively supports MCP.
+- **Computer Use** – We support Claude Computer Use via the computer_use argument to process_prompts_sync/async. It works with Anthropic's API; Bedrock's API is broken right now and rejects the tool definitions, but in principle this will work there too when Bedrock gets their sh*t together.
 - **Caching** – Save completions in a local or distributed cache to avoid repeated LLM calls to process the same input.
 - **Convenient message constructor** – No more looking up how to build an Anthropic messages list with images. Our `Conversation` and `Message` classes work great with our client or with the `openai` and `anthropic` packages.
 - **Sync and async APIs** – Use the client from sync or async code.
@@ -44,7 +47,7 @@ Dynamic: license-file
 pip install lm-deluge
 ```
-The package relies on environment variables for API keys. Typical variables include `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `COHERE_API_KEY`, `META_API_KEY`, and `GOOGLE_API_KEY`. `LLMClient` will automatically load the `.env` file when imported; we recommend using that to set the environment variables.
+The package relies on environment variables for API keys. Typical variables include `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `COHERE_API_KEY`, `META_API_KEY`, and `GOOGLE_API_KEY`. `LLMClient` will automatically load the `.env` file when imported; we recommend using that to set the environment variables. For Bedrock, you'll need to set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
 ## Quickstart
@@ -60,13 +63,13 @@ print(resp[0].completion)
 ## Spraying Across Models
-To distribute your requests across models, just provide a list of more than one model to the constructor. The rate limits for the client apply to the client as a whole, not per-model, so you may want to increase them:
+To distribute your requests across models, just provide a list of more than one model to the constructor. See all available models in `models.py`. The rate limits for the client apply to the client as a whole, not per-model, so you may want to increase them:
 ```python
 from lm_deluge import LLMClient
 client = LLMClient.basic(
-    ["gpt-4o-mini", "claude-haiku-anthropic"],
+    ["gpt-4o-mini", "claude-3-haiku"],
     max_requests_per_minute=10_000
 )
 resps = client.process_prompts_sync(
@@ -81,7 +84,7 @@ API calls can be customized in a few ways.
 1. **Sampling Parameters.** This determines things like structured outputs, maximum completion tokens, nucleus sampling, etc. Provide a custom `SamplingParams` to the `LLMClient` to set temperature, top_p, json_mode, max_new_tokens, and/or reasoning_effort. You can pass 1 `SamplingParams` to use for all models, or a list of `SamplingParams` that's the same length as the list of models. You can also pass many of these arguments directly to `LLMClient.basic` so you don't have to construct an entire `SamplingParams` object.
 2. **Arguments to LLMClient.** This is where you set request timeout, rate limits, model name(s), model weight(s) for distributing requests across models, retries, and caching.
-3. **Arguments to process_prompts.** Per-call, you can set verbosity, whether to display progress, and whether to return just completions (rather than the full APIResponse object).
+3. **Arguments to process_prompts.** Per-call, you can set verbosity, whether to display progress, and whether to return just completions (rather than the full APIResponse object). This is also where you provide tools.
 Putting it all together:
@@ -120,11 +123,97 @@ resps = client.process_prompts_sync([prompt])
 This just works. Images can be local images on disk, URLs, bytes, base64 data URLs... go wild. You can use `Conversation.to_openai` or `Conversation.to_anthropic` to format your messages for the OpenAI or Anthropic clients directly.
-## Caching
+See a full multi-turn chat example in `examples/multiturn.md`.
-`lm_deluge.cache` includes LevelDB, SQLite and custom dictionary based caches.  Pass an instance via `LLMClient(..., cache=my_cache)` and previously seen prompts will not be re‑sent across different `process_prompts_[...]` calls.
+## Tool Use
-**IMPORTANT:** Caching does not currently work for prompts in the SAME batch. That is, if you call `process_prompts_sync` with the same prompt 100 times, there will be 0 cache hits. If you call `process_prompts_sync` a *second* time with those same 100 prompts, all 100 will be cache hits. The cache is intended to be persistent and help you save costs across many invocations, but it can't help with a single batch-inference session (yet!).
+Define tools from Python functions and use them with any model:
+```python
+from lm_deluge import LLMClient, Tool
+def get_weather(city: str) -> str:
+    return f"The weather in {city} is sunny and 72°F"
+tool = Tool.from_function(get_weather)
+client = LLMClient.basic("claude-3-haiku")
+resps = client.process_prompts_sync(
+    ["What's the weather in Paris?"],
+    tools=[tool]
+)
+# you can iterate over the tool calls in the response automatically
+for tool_call in resps[0].tool_calls:
+    print(tool_call.name, tool_call.arguments)
+```
+You can also automatically instantiate tools from MCP servers. Under the hood, the the constructor connects to the server, asks it what tools it has, and then creates a `Tool` from each of them, *with a built-in `call` and `acall` interface*.
+```python
+from lm_deluge import LLMClient, Tool
+# Connect to a local MCP server and get all of its tools
+filesystem_tools = Tool.from_mcp(
+    "filesystem",
+    command="npx",
+    args=["-y", "@modelcontextprotocol/server-filesystem", "/path/to/directory"]
+)
+# or load ALL the tools from a Claude Desktop like config
+config = {
+    "mcpServers": {
+        "exa": {
+            "url": f"https://mcp.exa.ai/mcp?exaApiKey={os.getenv('EXA_API_KEY')}"
+        },
+        "zapier": {
+            "url": f"https://mcp.zapier.com/api/mcp/s/{os.getenv('ZAPIER_MCP_SECRET')}/mcp"
+        }
+    }
+}
+all_tools = Tool.from_mcp_config(config)
+# let the model use the tools
+client = LLMClient.basic("gpt-4o-mini")
+resps = client.process_prompts_sync(
+    ["List the files in the current directory"],
+    tools=tools
+)
+# call the tools
+for tool_call in resps[0].tool_calls:
+    # this is dumb sorry will make it better
+    tool_to_call = [x for x in tools if x.name == tool_call.name][0]
+    tool_to_call.call(**tool_call.arguments) # in async code, use .acall()
+```
+### Prompt Caching (Anthropic)
+For Anthropic models, you can use prompt caching to reduce costs and latency for repeated context. This uses Anthropic's server-side prompt caching. Other providers like OpenAI and Google do this automatically, but Anthropic requires you to manually set cache-control on messages. You can do this in lm-deluge with a simple "cache" argument to `process_prompts_sync` or `process_prompts_async`:
+```python
+from lm_deluge import LLMClient, Conversation, Message
+# Create a conversation with system message
+conv = (
+    Conversation.system("You are an expert Python developer with deep knowledge of async programming.")
+    .add(Message.user("How do I use asyncio.gather?"))
+)
+# Use prompt caching to cache system message and tools
+client = LLMClient.basic("claude-3-5-sonnet")
+resps = client.process_prompts_sync(
+    [conv],
+    cache="system_and_tools"  # Cache system message and any tools
+)
+```
+Available cache patterns: `"system_and_tools"`, `"tools_only"`, `"last_user_message"`, `"last_2_user_messages"`, `"last_3_user_messages"`.
+## Local Caching
+Besides caching from model providers (which provides cache reads at a discount, but not for free) `lm_deluge.cache` includes LevelDB, SQLite and custom dictionary based caches to cache prompts locally. Pass an instance via `LLMClient(..., cache=my_cache)` and previously seen prompts will not be re‑sent across different `process_prompts_[...]` calls.
+**IMPORTANT:** Caching does not currently work for prompts in the SAME batch. That is, if you call `process_prompts_sync` with the same prompt 100 times, there will be 0 cache hits. If you call `process_prompts_sync` a *second* time with those same 100 prompts, all 100 will be cache hits. The local cache is intended to be persistent and help you save costs across many invocations, but it can't help with a single batch-inference session (yet!).
 ## Asynchronous Client
 Use this in asynchronous code, or in a Jupyter notebook. If you try to use the sync client in a Jupyter notebook, you'll have to use `nest-asyncio`, because internally the sync client uses async code. Don't do it! Just use the async client!
@@ -144,11 +233,11 @@ asyncio.run(main())
 ## Available Models
-We support all models in `src/lm_deluge/models.py`. An older version of this client supported Bedrock and Vertex. We plan to re-implement Bedrock support (our previous support was spotty and we need to figure out cross-region inference in order to support the newest Claude models). Vertex support is not currently planned, since Google allows you to connect your Vertex account to AI Studio, and Vertex authentication is a huge pain (requires service account credentials, etc.)
+We support all models in `src/lm_deluge/models.py`. Vertex support is not planned in the short term, since Google allows you to connect your Vertex account to AI Studio, and Vertex authentication is a huge pain (requires service account credentials, etc.)
 ## Feature Support
-We support structured outputs via `json_mode` parameter provided to `SamplingParams`. Structured outputs with a schema are planned. Reasoning models are supported via the `reasoning_effort` parameter, which is translated to a thinking budget for Claude/Gemini. Image models are supported. We don't support tool use yet, but support is planned (keep an eye out for a unified tool definition spec that works for all models!). We support logprobs for OpenAI models that return them via the `logprobs` argument to the `LLMClient`.
+We support structured outputs via `json_mode` parameter provided to `SamplingParams`. Structured outputs with a schema are planned. Reasoning models are supported via the `reasoning_effort` parameter, which is translated to a thinking budget for Claude/Gemini. Image models are supported. We support tool use as documented above. We support logprobs for OpenAI models that return them.
 ## Built‑in tools

{lm_deluge-0.0.9 → lm_deluge-0.0.13}/README.md RENAMED Viewed

@@ -5,6 +5,9 @@
 - **Unified client** – Send prompts to all relevant models with a single client.
 - **Massive concurrency with throttling** – Set `max_tokens_per_minute` and `max_requests_per_minute` and let it fly. The client will process as many requests as possible while respecting rate limits and retrying failures.
 - **Spray across models/providers** – Configure a client with multiple models from any provider(s), and sampling weights. The client samples a model for each request.
+- **Tool Use** – Unified API for defining tools for all providers, and creating tools automatically from python functions.
+- **MCP Support** – Instantiate a `Tool` from a local or remote MCP server so that any LLM can use it, whether or not that provider natively supports MCP.
+- **Computer Use** – We support Claude Computer Use via the computer_use argument to process_prompts_sync/async. It works with Anthropic's API; Bedrock's API is broken right now and rejects the tool definitions, but in principle this will work there too when Bedrock gets their sh*t together.
 - **Caching** – Save completions in a local or distributed cache to avoid repeated LLM calls to process the same input.
 - **Convenient message constructor** – No more looking up how to build an Anthropic messages list with images. Our `Conversation` and `Message` classes work great with our client or with the `openai` and `anthropic` packages.
 - **Sync and async APIs** – Use the client from sync or async code.
@@ -17,7 +20,7 @@
 pip install lm-deluge
 ```
-The package relies on environment variables for API keys. Typical variables include `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `COHERE_API_KEY`, `META_API_KEY`, and `GOOGLE_API_KEY`. `LLMClient` will automatically load the `.env` file when imported; we recommend using that to set the environment variables.
+The package relies on environment variables for API keys. Typical variables include `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `COHERE_API_KEY`, `META_API_KEY`, and `GOOGLE_API_KEY`. `LLMClient` will automatically load the `.env` file when imported; we recommend using that to set the environment variables. For Bedrock, you'll need to set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
 ## Quickstart
@@ -33,13 +36,13 @@ print(resp[0].completion)
 ## Spraying Across Models
-To distribute your requests across models, just provide a list of more than one model to the constructor. The rate limits for the client apply to the client as a whole, not per-model, so you may want to increase them:
+To distribute your requests across models, just provide a list of more than one model to the constructor. See all available models in `models.py`. The rate limits for the client apply to the client as a whole, not per-model, so you may want to increase them:
 ```python
 from lm_deluge import LLMClient
 client = LLMClient.basic(
-    ["gpt-4o-mini", "claude-haiku-anthropic"],
+    ["gpt-4o-mini", "claude-3-haiku"],
     max_requests_per_minute=10_000
 )
 resps = client.process_prompts_sync(
@@ -54,7 +57,7 @@ API calls can be customized in a few ways.
 1. **Sampling Parameters.** This determines things like structured outputs, maximum completion tokens, nucleus sampling, etc. Provide a custom `SamplingParams` to the `LLMClient` to set temperature, top_p, json_mode, max_new_tokens, and/or reasoning_effort. You can pass 1 `SamplingParams` to use for all models, or a list of `SamplingParams` that's the same length as the list of models. You can also pass many of these arguments directly to `LLMClient.basic` so you don't have to construct an entire `SamplingParams` object.
 2. **Arguments to LLMClient.** This is where you set request timeout, rate limits, model name(s), model weight(s) for distributing requests across models, retries, and caching.
-3. **Arguments to process_prompts.** Per-call, you can set verbosity, whether to display progress, and whether to return just completions (rather than the full APIResponse object).
+3. **Arguments to process_prompts.** Per-call, you can set verbosity, whether to display progress, and whether to return just completions (rather than the full APIResponse object). This is also where you provide tools.
 Putting it all together:
@@ -93,11 +96,97 @@ resps = client.process_prompts_sync([prompt])
 This just works. Images can be local images on disk, URLs, bytes, base64 data URLs... go wild. You can use `Conversation.to_openai` or `Conversation.to_anthropic` to format your messages for the OpenAI or Anthropic clients directly.
-## Caching
+See a full multi-turn chat example in `examples/multiturn.md`.
-`lm_deluge.cache` includes LevelDB, SQLite and custom dictionary based caches.  Pass an instance via `LLMClient(..., cache=my_cache)` and previously seen prompts will not be re‑sent across different `process_prompts_[...]` calls.
+## Tool Use
-**IMPORTANT:** Caching does not currently work for prompts in the SAME batch. That is, if you call `process_prompts_sync` with the same prompt 100 times, there will be 0 cache hits. If you call `process_prompts_sync` a *second* time with those same 100 prompts, all 100 will be cache hits. The cache is intended to be persistent and help you save costs across many invocations, but it can't help with a single batch-inference session (yet!).
+Define tools from Python functions and use them with any model:
+```python
+from lm_deluge import LLMClient, Tool
+def get_weather(city: str) -> str:
+    return f"The weather in {city} is sunny and 72°F"
+tool = Tool.from_function(get_weather)
+client = LLMClient.basic("claude-3-haiku")
+resps = client.process_prompts_sync(
+    ["What's the weather in Paris?"],
+    tools=[tool]
+)
+# you can iterate over the tool calls in the response automatically
+for tool_call in resps[0].tool_calls:
+    print(tool_call.name, tool_call.arguments)
+```
+You can also automatically instantiate tools from MCP servers. Under the hood, the the constructor connects to the server, asks it what tools it has, and then creates a `Tool` from each of them, *with a built-in `call` and `acall` interface*.
+```python
+from lm_deluge import LLMClient, Tool
+# Connect to a local MCP server and get all of its tools
+filesystem_tools = Tool.from_mcp(
+    "filesystem",
+    command="npx",
+    args=["-y", "@modelcontextprotocol/server-filesystem", "/path/to/directory"]
+)
+# or load ALL the tools from a Claude Desktop like config
+config = {
+    "mcpServers": {
+        "exa": {
+            "url": f"https://mcp.exa.ai/mcp?exaApiKey={os.getenv('EXA_API_KEY')}"
+        },
+        "zapier": {
+            "url": f"https://mcp.zapier.com/api/mcp/s/{os.getenv('ZAPIER_MCP_SECRET')}/mcp"
+        }
+    }
+}
+all_tools = Tool.from_mcp_config(config)
+# let the model use the tools
+client = LLMClient.basic("gpt-4o-mini")
+resps = client.process_prompts_sync(
+    ["List the files in the current directory"],
+    tools=tools
+)
+# call the tools
+for tool_call in resps[0].tool_calls:
+    # this is dumb sorry will make it better
+    tool_to_call = [x for x in tools if x.name == tool_call.name][0]
+    tool_to_call.call(**tool_call.arguments) # in async code, use .acall()
+```
+### Prompt Caching (Anthropic)
+For Anthropic models, you can use prompt caching to reduce costs and latency for repeated context. This uses Anthropic's server-side prompt caching. Other providers like OpenAI and Google do this automatically, but Anthropic requires you to manually set cache-control on messages. You can do this in lm-deluge with a simple "cache" argument to `process_prompts_sync` or `process_prompts_async`:
+```python
+from lm_deluge import LLMClient, Conversation, Message
+# Create a conversation with system message
+conv = (
+    Conversation.system("You are an expert Python developer with deep knowledge of async programming.")
+    .add(Message.user("How do I use asyncio.gather?"))
+)
+# Use prompt caching to cache system message and tools
+client = LLMClient.basic("claude-3-5-sonnet")
+resps = client.process_prompts_sync(
+    [conv],
+    cache="system_and_tools"  # Cache system message and any tools
+)
+```
+Available cache patterns: `"system_and_tools"`, `"tools_only"`, `"last_user_message"`, `"last_2_user_messages"`, `"last_3_user_messages"`.
+## Local Caching
+Besides caching from model providers (which provides cache reads at a discount, but not for free) `lm_deluge.cache` includes LevelDB, SQLite and custom dictionary based caches to cache prompts locally. Pass an instance via `LLMClient(..., cache=my_cache)` and previously seen prompts will not be re‑sent across different `process_prompts_[...]` calls.
+**IMPORTANT:** Caching does not currently work for prompts in the SAME batch. That is, if you call `process_prompts_sync` with the same prompt 100 times, there will be 0 cache hits. If you call `process_prompts_sync` a *second* time with those same 100 prompts, all 100 will be cache hits. The local cache is intended to be persistent and help you save costs across many invocations, but it can't help with a single batch-inference session (yet!).
 ## Asynchronous Client
 Use this in asynchronous code, or in a Jupyter notebook. If you try to use the sync client in a Jupyter notebook, you'll have to use `nest-asyncio`, because internally the sync client uses async code. Don't do it! Just use the async client!
@@ -117,11 +206,11 @@ asyncio.run(main())
 ## Available Models
-We support all models in `src/lm_deluge/models.py`. An older version of this client supported Bedrock and Vertex. We plan to re-implement Bedrock support (our previous support was spotty and we need to figure out cross-region inference in order to support the newest Claude models). Vertex support is not currently planned, since Google allows you to connect your Vertex account to AI Studio, and Vertex authentication is a huge pain (requires service account credentials, etc.)
+We support all models in `src/lm_deluge/models.py`. Vertex support is not planned in the short term, since Google allows you to connect your Vertex account to AI Studio, and Vertex authentication is a huge pain (requires service account credentials, etc.)
 ## Feature Support
-We support structured outputs via `json_mode` parameter provided to `SamplingParams`. Structured outputs with a schema are planned. Reasoning models are supported via the `reasoning_effort` parameter, which is translated to a thinking budget for Claude/Gemini. Image models are supported. We don't support tool use yet, but support is planned (keep an eye out for a unified tool definition spec that works for all models!). We support logprobs for OpenAI models that return them via the `logprobs` argument to the `LLMClient`.
+We support structured outputs via `json_mode` parameter provided to `SamplingParams`. Structured outputs with a schema are planned. Reasoning models are supported via the `reasoning_effort` parameter, which is translated to a thinking budget for Claude/Gemini. Image models are supported. We support tool use as documented above. We support logprobs for OpenAI models that return them.
 ## Built‑in tools

{lm_deluge-0.0.9 → lm_deluge-0.0.13}/pyproject.toml RENAMED Viewed

@@ -3,7 +3,7 @@ requires = ["setuptools", "wheel"]
 [project]
 name = "lm_deluge"
-version = "0.0.9"
+version = "0.0.13"
 authors = [{ name = "Benjamin Anderson", email = "ben@trytaylor.ai" }]
 description = "Python utility for using LLM API models."
 readme = "README.md"
@@ -27,6 +27,6 @@ dependencies = [
     "lxml",
     "pdf2image",
     "pillow",
-    "fasttext-wheel",
-    "fasttext-langdetect",
+    "fastmcp>=2.4",
+    "rich"
 ]

lm_deluge-0.0.13/src/lm_deluge/__init__.py ADDED Viewed

@@ -0,0 +1,15 @@
+from .client import LLMClient, SamplingParams, APIResponse
+from .prompt import Conversation, Message
+from .tool import Tool
+import dotenv
+dotenv.load_dotenv()
+__all__ = [
+    "LLMClient",
+    "SamplingParams",
+    "APIResponse",
+    "Conversation",
+    "Message",
+    "Tool",
+]

lm_deluge-0.0.13/src/lm_deluge/agent.py ADDED Viewed

File without changes

{lm_deluge-0.0.9 → lm_deluge-0.0.13}/src/lm_deluge/api_requests/anthropic.py RENAMED Viewed

@@ -1,17 +1,94 @@
-import asyncio
 from aiohttp import ClientResponse
 import json
 import os
-import warnings
-from tqdm import tqdm
 from typing import Callable
-from lm_deluge.prompt import Conversation, Message, Text, ToolCall, Thinking
+from lm_deluge.prompt import (
+    Conversation,
+    Message,
+    Text,
+    ToolCall,
+    Thinking,
+    CachePattern,
+)
+from lm_deluge.tool import Tool
+from lm_deluge.usage import Usage
 from .base import APIRequestBase, APIResponse
 from ..tracker import StatusTracker
-from ..sampling_params import SamplingParams
+from ..config import SamplingParams
 from ..models import APIModel
+from ..computer_use.anthropic_tools import get_anthropic_cu_tools
+def _build_anthropic_request(
+    model: APIModel,
+    prompt: Conversation,
+    tools: list[Tool] | None,
+    sampling_params: SamplingParams,
+    cache_pattern: CachePattern | None = None,
+    computer_use: bool = False,
+    display_width: int = 1024,
+    display_height: int = 768,
+):
+    system_message, messages = prompt.to_anthropic(cache_pattern=cache_pattern)
+    request_header = {
+        "x-api-key": os.getenv(model.api_key_env_var),
+        "anthropic-version": "2023-06-01",
+        "content-type": "application/json",
+    }
+    # Add beta header for Computer Use
+    if computer_use:
+        request_header["anthropic-beta"] = "computer-use-2025-01-24"
+    request_json = {
+        "model": model.name,
+        "messages": messages,
+        "temperature": sampling_params.temperature,
+        "top_p": sampling_params.top_p,
+        "max_tokens": sampling_params.max_new_tokens,
+    }
+    # handle thinking
+    if model.reasoning_model and sampling_params.reasoning_effort:
+        # translate reasoning effort of low, medium, high to budget tokens
+        budget = {"low": 1024, "medium": 4096, "high": 16384}.get(
+            sampling_params.reasoning_effort
+        )
+        request_json["thinking"] = {
+            "type": "enabled",
+            "budget_tokens": budget,
+        }
+        request_json.pop("top_p")
+        request_json["temperature"] = 1.0
+        request_json["max_tokens"] += budget
+    else:
+        request_json["thinking"] = {"type": "disabled"}
+        if sampling_params.reasoning_effort:
+            print("ignoring reasoning_effort for non-reasoning model")
+    if system_message is not None:
+        request_json["system"] = system_message
+    if tools or computer_use:
+        tool_definitions = []
+        if tools:
+            tool_definitions.extend([tool.dump_for("anthropic") for tool in tools])
+        # Add Computer Use tools
+        if computer_use:
+            cu_tools = get_anthropic_cu_tools(
+                model=model.id,
+                display_width=display_width,  # todo: set from ComputerUseParams
+                display_height=display_height,
+            )
+            tool_definitions.extend(cu_tools)
+        # Add cache control to last tool if tools_only caching is specified
+        if cache_pattern == "tools_only" and tool_definitions:
+            tool_definitions[-1]["cache_control"] = {"type": "ephemeral"}
+        request_json["tools"] = tool_definitions
+    return request_json, request_header
 class AnthropicRequest(APIRequestBase):
@@ -24,17 +101,19 @@ class AnthropicRequest(APIRequestBase):
         prompt: Conversation,
         attempts_left: int,
         status_tracker: StatusTracker,
-        retry_queue: asyncio.Queue,
         results_arr: list,
         request_timeout: int = 30,
         sampling_params: SamplingParams = SamplingParams(),
-        pbar: tqdm | None = None,
         callback: Callable | None = None,
-        debug: bool = False,
         # for retries
         all_model_names: list[str] | None = None,
         all_sampling_params: list[SamplingParams] | None = None,
         tools: list | None = None,
+        cache: CachePattern | None = None,
+        # Computer Use support
+        computer_use: bool = False,
+        display_width: int = 1024,
+        display_height: int = 768,
     ):
         super().__init__(
             task_id=task_id,
@@ -42,70 +121,42 @@ class AnthropicRequest(APIRequestBase):
             prompt=prompt,
             attempts_left=attempts_left,
             status_tracker=status_tracker,
-            retry_queue=retry_queue,
             results_arr=results_arr,
             request_timeout=request_timeout,
             sampling_params=sampling_params,
-            pbar=pbar,
             callback=callback,
-            debug=debug,
             all_model_names=all_model_names,
             all_sampling_params=all_sampling_params,
             tools=tools,
+            cache=cache,
         )
+        self.computer_use = computer_use
+        self.display_width = display_width
+        self.display_height = display_height
         self.model = APIModel.from_registry(model_name)
         self.url = f"{self.model.api_base}/messages"
-        self.system_message, messages = prompt.to_anthropic()
-        self.request_header = {
-            "x-api-key": os.getenv(self.model.api_key_env_var),
-            "anthropic-version": "2023-06-01",
-            "content-type": "application/json",
-        }
+        # Lock images as bytes if caching is enabled
+        if cache is not None:
+            prompt.lock_images_as_bytes()
-        self.request_json = {
-            "model": self.model.name,
-            "messages": messages,
-            "temperature": self.sampling_params.temperature,
-            "top_p": self.sampling_params.top_p,
-            "max_tokens": self.sampling_params.max_new_tokens,
-        }
-        # handle thinking
-        if self.model.reasoning_model:
-            if sampling_params.reasoning_effort:
-                # translate reasoning effort of low, medium, high to budget tokens
-                budget = {"low": 1024, "medium": 4096, "high": 16384}.get(
-                    sampling_params.reasoning_effort
-                )
-                self.request_json["thinking"] = {
-                    "type": "enabled",
-                    "budget_tokens": budget,
-                }
-                self.request_json.pop("top_p")
-                self.request_json["temperature"] = 1.0
-                self.request_json["max_tokens"] += (
-                    budget  # assume max tokens is max completion tokens
-                )
-            else:
-                # no thinking
-                self.request_json["thinking"] = {"type": "disabled"}
-        else:
-            if sampling_params.reasoning_effort:
-                warnings.warn(
-                    f"Ignoring reasoning_effort param for non-reasoning model: {model_name}"
-                )
-        if self.system_message is not None:
-            self.request_json["system"] = self.system_message
-        if tools:
-            self.request_json["tools"] = [tool.dump_for("anthropic") for tool in tools]
+        self.request_json, self.request_header = _build_anthropic_request(
+            self.model,
+            prompt,
+            tools,
+            sampling_params,
+            cache,
+            computer_use,
+            display_width,
+            display_height,
+        )
     async def handle_response(self, http_response: ClientResponse) -> APIResponse:
         is_error = False
         error_message = None
         thinking = None
         content = None
-        input_tokens = None
-        output_tokens = None
+        usage = None
         status_code = http_response.status
         mimetype = http_response.headers.get("Content-Type", None)
         rate_limits = {}
@@ -118,8 +169,6 @@ class AnthropicRequest(APIRequestBase):
             "anthropic-ratelimit-tokens-reset",
         ]:
             rate_limits[header] = http_response.headers.get(header, None)
-        if self.debug:
-            print(f"Rate limits: {rate_limits}")
         if status_code >= 200 and status_code < 300:
             try:
                 data = await http_response.json()
@@ -143,8 +192,7 @@ class AnthropicRequest(APIRequestBase):
                         )
                 content = Message("assistant", parts)
-                input_tokens = data["usage"]["input_tokens"]
-                output_tokens = data["usage"]["output_tokens"]
+                usage = Usage.from_anthropic_usage(data["usage"])
             except Exception as e:
                 is_error = True
                 error_message = (
@@ -182,6 +230,5 @@ class AnthropicRequest(APIRequestBase):
             thinking=thinking,
             model_internal=self.model_name,
             sampling_params=self.sampling_params,
-            input_tokens=input_tokens,
-            output_tokens=output_tokens,
+            usage=usage,
         )

lm-deluge 0.0.9__tar.gz → 0.0.13__tar.gz

Potentially problematic release.

lm-deluge 0.0.9tar.gz → 0.0.13tar.gz