PyPI - patchpal - Versions diffs - 0.3.2__tar.gz → 0.4.0__tar.gz - Mend

patchpal 0.3.2tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{patchpal-0.3.2/patchpal.egg-info → patchpal-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: patchpal
-Version: 0.3.2
+Version: 0.4.0
 Summary: A lean Claude Code clone in pure  Python
 Author: PatchPal Contributors
 License-Expression: Apache-2.0
@@ -905,6 +905,11 @@ You: /status
 # - Token usage breakdown
 # - Visual progress bar
 # - Auto-compaction status
+# - Session statistics:
+#   - Total LLM calls made
+#   - Cumulative input tokens (all requests combined)
+#   - Cumulative output tokens (all responses combined)
+#   - Total tokens (helps estimate API costs)
 # Manually trigger compaction
 You: /compact
@@ -916,6 +921,23 @@ You: /compact
 # Note: Requires at least 5 messages; most effective when context >50% full
 ```
+**Understanding Session Statistics:**
+The `/status` command shows cumulative token usage:
+- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
+  - Each LLM call resends the entire conversation history
+  - **Note on Anthropic models**: PatchPal uses prompt caching
+    - System prompt and last 2 messages are cached
+    - Cached tokens cost much less than regular input tokens
+    - The displayed token counts show raw totals, not cache-adjusted costs
+- **Cumulative output tokens**: Total tokens generated by the LLM
+  - Usually much smaller than input (just the generated responses)
+  - Typically costs more per token than input
+**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
 **Configuration:**
 See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
 - Context is automatically managed at 75% capacity through pruning and compaction.
 - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
 - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
+**Reducing API Costs via Token Optimization**
+When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
+**1. Use Pruning to Manage Long Sessions**
+- **Automatic pruning** removes old tool outputs while preserving conversation context
+- Configure pruning thresholds to be more aggressive:
+  ```bash
+  export PATCHPAL_PRUNE_PROTECT=20000    # Reduce from 40k to 20k tokens
+  export PATCHPAL_PRUNE_MINIMUM=10000    # Reduce minimum saved from 20k to 10k
+  ```
+- Pruning happens transparently before compaction and is much faster (no LLM call needed)
+**2. Monitor Session Token Usage**
+- Use `/status` to see cumulative token usage in real-time
+- **Session Statistics** section shows:
+  - Total LLM calls made
+  - Cumulative input tokens (raw totals, before caching discounts)
+  - Cumulative output tokens
+  - Total tokens for the session
+- Check periodically during long sessions to monitor usage
+- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
+- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
+**3. Manual Compaction for Cost Control**
+- Use `/status` regularly to monitor context window usage
+- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
+- Manual compaction gives you control over when the summarization LLM call happens
+**4. Adjust Auto-Compaction Threshold**
+- Lower threshold = more frequent compaction = smaller context = lower per-request costs
+- Higher threshold = fewer compaction calls = larger context = higher per-request costs
+  ```bash
+  # More aggressive compaction (compact at 60% instead of 75%)
+  export PATCHPAL_COMPACT_THRESHOLD=0.60
+  ```
+- Find the sweet spot for your workload (balance between compaction frequency and context size)
+**5. Use Local Models for Zero API Costs**
+- **Best option:** Run vLLM locally to eliminate API costs entirely
+  ```bash
+  export HOSTED_VLLM_API_BASE=http://localhost:8000
+  export HOSTED_VLLM_API_KEY=token-abc123
+  patchpal --model hosted_vllm/openai/gpt-oss-20b
+  ```
+- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
+- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
+**6. Start Fresh When Appropriate**
+- Use `/clear` command to reset conversation history without restarting PatchPal
+- Exit and restart PatchPal between unrelated tasks to clear context completely
+- Each fresh start begins with minimal tokens (just the system prompt)
+- Better than carrying large conversation history across different tasks
+**7. Use Smaller Models for Simple Tasks**
+- Use less expensive models for routine tasks:
+  ```bash
+  patchpal --model anthropic/claude-3-7-sonnet-latest  # Cheaper than claude-sonnet-4-5
+  patchpal --model openai/gpt-4o-mini                  # Cheaper than gpt-4o
+  ```
+- Reserve premium models for complex reasoning tasks
+**Cost Monitoring Tips:**
+- Check `/status` before large operations to see current token usage
+- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
+- Most cloud providers offer usage dashboards showing cache hits and actual charges
+- Set up billing alerts with your provider to avoid surprises
+- Consider local models (vLLM recommended) for high-volume usage or zero API costs

{patchpal-0.3.2 → patchpal-0.4.0}/README.md RENAMED Viewed

@@ -868,6 +868,11 @@ You: /status
 # - Token usage breakdown
 # - Visual progress bar
 # - Auto-compaction status
+# - Session statistics:
+#   - Total LLM calls made
+#   - Cumulative input tokens (all requests combined)
+#   - Cumulative output tokens (all responses combined)
+#   - Total tokens (helps estimate API costs)
 # Manually trigger compaction
 You: /compact
@@ -879,6 +884,23 @@ You: /compact
 # Note: Requires at least 5 messages; most effective when context >50% full
 ```
+**Understanding Session Statistics:**
+The `/status` command shows cumulative token usage:
+- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
+  - Each LLM call resends the entire conversation history
+  - **Note on Anthropic models**: PatchPal uses prompt caching
+    - System prompt and last 2 messages are cached
+    - Cached tokens cost much less than regular input tokens
+    - The displayed token counts show raw totals, not cache-adjusted costs
+- **Cumulative output tokens**: Total tokens generated by the LLM
+  - Usually much smaller than input (just the generated responses)
+  - Typically costs more per token than input
+**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
 **Configuration:**
 See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -967,3 +989,72 @@ The system ensures you can work for extended periods without hitting context lim
 - Context is automatically managed at 75% capacity through pruning and compaction.
 - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
 - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
+**Reducing API Costs via Token Optimization**
+When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
+**1. Use Pruning to Manage Long Sessions**
+- **Automatic pruning** removes old tool outputs while preserving conversation context
+- Configure pruning thresholds to be more aggressive:
+  ```bash
+  export PATCHPAL_PRUNE_PROTECT=20000    # Reduce from 40k to 20k tokens
+  export PATCHPAL_PRUNE_MINIMUM=10000    # Reduce minimum saved from 20k to 10k
+  ```
+- Pruning happens transparently before compaction and is much faster (no LLM call needed)
+**2. Monitor Session Token Usage**
+- Use `/status` to see cumulative token usage in real-time
+- **Session Statistics** section shows:
+  - Total LLM calls made
+  - Cumulative input tokens (raw totals, before caching discounts)
+  - Cumulative output tokens
+  - Total tokens for the session
+- Check periodically during long sessions to monitor usage
+- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
+- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
+**3. Manual Compaction for Cost Control**
+- Use `/status` regularly to monitor context window usage
+- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
+- Manual compaction gives you control over when the summarization LLM call happens
+**4. Adjust Auto-Compaction Threshold**
+- Lower threshold = more frequent compaction = smaller context = lower per-request costs
+- Higher threshold = fewer compaction calls = larger context = higher per-request costs
+  ```bash
+  # More aggressive compaction (compact at 60% instead of 75%)
+  export PATCHPAL_COMPACT_THRESHOLD=0.60
+  ```
+- Find the sweet spot for your workload (balance between compaction frequency and context size)
+**5. Use Local Models for Zero API Costs**
+- **Best option:** Run vLLM locally to eliminate API costs entirely
+  ```bash
+  export HOSTED_VLLM_API_BASE=http://localhost:8000
+  export HOSTED_VLLM_API_KEY=token-abc123
+  patchpal --model hosted_vllm/openai/gpt-oss-20b
+  ```
+- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
+- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
+**6. Start Fresh When Appropriate**
+- Use `/clear` command to reset conversation history without restarting PatchPal
+- Exit and restart PatchPal between unrelated tasks to clear context completely
+- Each fresh start begins with minimal tokens (just the system prompt)
+- Better than carrying large conversation history across different tasks
+**7. Use Smaller Models for Simple Tasks**
+- Use less expensive models for routine tasks:
+  ```bash
+  patchpal --model anthropic/claude-3-7-sonnet-latest  # Cheaper than claude-sonnet-4-5
+  patchpal --model openai/gpt-4o-mini                  # Cheaper than gpt-4o
+  ```
+- Reserve premium models for complex reasoning tasks
+**Cost Monitoring Tips:**
+- Check `/status` before large operations to see current token usage
+- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
+- Most cloud providers offer usage dashboards showing cache hits and actual charges
+- Set up billing alerts with your provider to avoid surprises
+- Consider local models (vLLM recommended) for high-volume usage or zero API costs

{patchpal-0.3.2 → patchpal-0.4.0}/patchpal/__init__.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """PatchPal - An open-source Claude Code clone implemented purely in Python."""
-__version__ = "0.3.2"
+__version__ = "0.4.0"
 from patchpal.agent import create_agent
 from patchpal.tools import (

{patchpal-0.3.2 → patchpal-0.4.0}/patchpal/agent.py RENAMED Viewed

@@ -725,9 +725,7 @@ def _apply_prompt_caching(messages: List[Dict[str, Any]], model_id: str) -> List
     Caches:
     - System messages (first 1-2 messages with role="system")
-    - Last 2 conversation messages (recent context)
-    This provides 90% cost reduction on cached content after the first request.
+    - Last 2 non-system messages (recent context, any role except system)
     Args:
         messages: List of message dictionaries
@@ -818,6 +816,11 @@ class PatchPalAgent:
         # Track last compaction to prevent compaction loops
         self._last_compaction_message_count = 0
+        # Track cumulative token usage across all LLM calls
+        self.total_llm_calls = 0
+        self.cumulative_input_tokens = 0
+        self.cumulative_output_tokens = 0
         # LiteLLM settings for models that need parameter dropping
         self.litellm_kwargs = {}
         if self.model_id.startswith("bedrock/"):
@@ -896,12 +899,22 @@ class PatchPalAgent:
                 messages = [{"role": "system", "content": SYSTEM_PROMPT}] + msgs
                 # Apply prompt caching for supported models
                 messages = _apply_prompt_caching(messages, self.model_id)
-                return litellm.completion(
+                response = litellm.completion(
                     model=self.model_id,
                     messages=messages,
                     **self.litellm_kwargs,
                 )
+                # Track token usage from compaction call
+                self.total_llm_calls += 1
+                if hasattr(response, "usage") and response.usage:
+                    if hasattr(response.usage, "prompt_tokens"):
+                        self.cumulative_input_tokens += response.usage.prompt_tokens
+                    if hasattr(response.usage, "completion_tokens"):
+                        self.cumulative_output_tokens += response.usage.completion_tokens
+                return response
             summary_msg, summary_text = self.context_manager.create_compaction(
                 self.messages,
                 compaction_completion,
@@ -995,6 +1008,15 @@ class PatchPalAgent:
                     tool_choice="auto",
                     **self.litellm_kwargs,
                 )
+                # Track token usage from this LLM call
+                self.total_llm_calls += 1
+                if hasattr(response, "usage") and response.usage:
+                    if hasattr(response.usage, "prompt_tokens"):
+                        self.cumulative_input_tokens += response.usage.prompt_tokens
+                    if hasattr(response.usage, "completion_tokens"):
+                        self.cumulative_output_tokens += response.usage.completion_tokens
             except Exception as e:
                 return f"Error calling model: {e}"

{patchpal-0.3.2 → patchpal-0.4.0}/patchpal/cli.py RENAMED Viewed

@@ -362,6 +362,24 @@ Supported models: Any LiteLLM-supported model
                         "\n  Auto-compaction: \033[33mDisabled\033[0m (set PATCHPAL_DISABLE_AUTOCOMPACT=false to enable)"
                     )
+                # Show cumulative token usage
+                print("\n\033[1;36mSession Statistics\033[0m")
+                print(f"  LLM calls: {agent.total_llm_calls}")
+                # Check if usage info is available (if we have LLM calls but no token counts)
+                has_usage_info = (
+                    agent.cumulative_input_tokens > 0 or agent.cumulative_output_tokens > 0
+                )
+                if agent.total_llm_calls > 0 and not has_usage_info:
+                    print(
+                        "  \033[2mToken usage unavailable (model doesn't report usage info)\033[0m"
+                    )
+                else:
+                    print(f"  Cumulative input tokens: {agent.cumulative_input_tokens:,}")
+                    print(f"  Cumulative output tokens: {agent.cumulative_output_tokens:,}")
+                    total_tokens = agent.cumulative_input_tokens + agent.cumulative_output_tokens
+                    print(f"  Total tokens: {total_tokens:,}")
                 print("=" * 70 + "\n")
                 continue

{patchpal-0.3.2 → patchpal-0.4.0/patchpal.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: patchpal
-Version: 0.3.2
+Version: 0.4.0
 Summary: A lean Claude Code clone in pure  Python
 Author: PatchPal Contributors
 License-Expression: Apache-2.0
@@ -905,6 +905,11 @@ You: /status
 # - Token usage breakdown
 # - Visual progress bar
 # - Auto-compaction status
+# - Session statistics:
+#   - Total LLM calls made
+#   - Cumulative input tokens (all requests combined)
+#   - Cumulative output tokens (all responses combined)
+#   - Total tokens (helps estimate API costs)
 # Manually trigger compaction
 You: /compact
@@ -916,6 +921,23 @@ You: /compact
 # Note: Requires at least 5 messages; most effective when context >50% full
 ```
+**Understanding Session Statistics:**
+The `/status` command shows cumulative token usage:
+- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
+  - Each LLM call resends the entire conversation history
+  - **Note on Anthropic models**: PatchPal uses prompt caching
+    - System prompt and last 2 messages are cached
+    - Cached tokens cost much less than regular input tokens
+    - The displayed token counts show raw totals, not cache-adjusted costs
+- **Cumulative output tokens**: Total tokens generated by the LLM
+  - Usually much smaller than input (just the generated responses)
+  - Typically costs more per token than input
+**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
 **Configuration:**
 See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
 - Context is automatically managed at 75% capacity through pruning and compaction.
 - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
 - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
+**Reducing API Costs via Token Optimization**
+When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
+**1. Use Pruning to Manage Long Sessions**
+- **Automatic pruning** removes old tool outputs while preserving conversation context
+- Configure pruning thresholds to be more aggressive:
+  ```bash
+  export PATCHPAL_PRUNE_PROTECT=20000    # Reduce from 40k to 20k tokens
+  export PATCHPAL_PRUNE_MINIMUM=10000    # Reduce minimum saved from 20k to 10k
+  ```
+- Pruning happens transparently before compaction and is much faster (no LLM call needed)
+**2. Monitor Session Token Usage**
+- Use `/status` to see cumulative token usage in real-time
+- **Session Statistics** section shows:
+  - Total LLM calls made
+  - Cumulative input tokens (raw totals, before caching discounts)
+  - Cumulative output tokens
+  - Total tokens for the session
+- Check periodically during long sessions to monitor usage
+- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
+- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
+**3. Manual Compaction for Cost Control**
+- Use `/status` regularly to monitor context window usage
+- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
+- Manual compaction gives you control over when the summarization LLM call happens
+**4. Adjust Auto-Compaction Threshold**
+- Lower threshold = more frequent compaction = smaller context = lower per-request costs
+- Higher threshold = fewer compaction calls = larger context = higher per-request costs
+  ```bash
+  # More aggressive compaction (compact at 60% instead of 75%)
+  export PATCHPAL_COMPACT_THRESHOLD=0.60
+  ```
+- Find the sweet spot for your workload (balance between compaction frequency and context size)
+**5. Use Local Models for Zero API Costs**
+- **Best option:** Run vLLM locally to eliminate API costs entirely
+  ```bash
+  export HOSTED_VLLM_API_BASE=http://localhost:8000
+  export HOSTED_VLLM_API_KEY=token-abc123
+  patchpal --model hosted_vllm/openai/gpt-oss-20b
+  ```
+- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
+- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
+**6. Start Fresh When Appropriate**
+- Use `/clear` command to reset conversation history without restarting PatchPal
+- Exit and restart PatchPal between unrelated tasks to clear context completely
+- Each fresh start begins with minimal tokens (just the system prompt)
+- Better than carrying large conversation history across different tasks
+**7. Use Smaller Models for Simple Tasks**
+- Use less expensive models for routine tasks:
+  ```bash
+  patchpal --model anthropic/claude-3-7-sonnet-latest  # Cheaper than claude-sonnet-4-5
+  patchpal --model openai/gpt-4o-mini                  # Cheaper than gpt-4o
+  ```
+- Reserve premium models for complex reasoning tasks
+**Cost Monitoring Tips:**
+- Check `/status` before large operations to see current token usage
+- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
+- Most cloud providers offer usage dashboards showing cache hits and actual charges
+- Set up billing alerts with your provider to avoid surprises
+- Consider local models (vLLM recommended) for high-volume usage or zero API costs