PyPI - patchpal - Versions diffs - 0.3.1__tar.gz → 0.4.1__tar.gz - Mend

patchpal 0.3.1tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{patchpal-0.3.1/patchpal.egg-info → patchpal-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: patchpal
-Version: 0.3.1
+Version: 0.4.1
 Summary: A lean Claude Code clone in pure  Python
 Author: PatchPal Contributors
 License-Expression: Apache-2.0
@@ -905,6 +905,11 @@ You: /status
 # - Token usage breakdown
 # - Visual progress bar
 # - Auto-compaction status
+# - Session statistics:
+#   - Total LLM calls made
+#   - Cumulative input tokens (all requests combined)
+#   - Cumulative output tokens (all responses combined)
+#   - Total tokens (helps estimate API costs)
 # Manually trigger compaction
 You: /compact
@@ -916,6 +921,23 @@ You: /compact
 # Note: Requires at least 5 messages; most effective when context >50% full
 ```
+**Understanding Session Statistics:**
+The `/status` command shows cumulative token usage:
+- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
+  - Each LLM call resends the entire conversation history
+  - **Note on Anthropic models**: PatchPal uses prompt caching
+    - System prompt and last 2 messages are cached
+    - Cached tokens cost much less than regular input tokens
+    - The displayed token counts show raw totals, not cache-adjusted costs
+- **Cumulative output tokens**: Total tokens generated by the LLM
+  - Usually much smaller than input (just the generated responses)
+  - Typically costs more per token than input
+**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
 **Configuration:**
 See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
 - Context is automatically managed at 75% capacity through pruning and compaction.
 - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
 - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
+**Reducing API Costs via Token Optimization**
+When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
+**1. Use Pruning to Manage Long Sessions**
+- **Automatic pruning** removes old tool outputs while preserving conversation context
+- Configure pruning thresholds to be more aggressive:
+  ```bash
+  export PATCHPAL_PRUNE_PROTECT=20000    # Reduce from 40k to 20k tokens
+  export PATCHPAL_PRUNE_MINIMUM=10000    # Reduce minimum saved from 20k to 10k
+  ```
+- Pruning happens transparently before compaction and is much faster (no LLM call needed)
+**2. Monitor Session Token Usage**
+- Use `/status` to see cumulative token usage in real-time
+- **Session Statistics** section shows:
+  - Total LLM calls made
+  - Cumulative input tokens (raw totals, before caching discounts)
+  - Cumulative output tokens
+  - Total tokens for the session
+- Check periodically during long sessions to monitor usage
+- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
+- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
+**3. Manual Compaction for Cost Control**
+- Use `/status` regularly to monitor context window usage
+- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
+- Manual compaction gives you control over when the summarization LLM call happens
+**4. Adjust Auto-Compaction Threshold**
+- Lower threshold = more frequent compaction = smaller context = lower per-request costs
+- Higher threshold = fewer compaction calls = larger context = higher per-request costs
+  ```bash
+  # More aggressive compaction (compact at 60% instead of 75%)
+  export PATCHPAL_COMPACT_THRESHOLD=0.60
+  ```
+- Find the sweet spot for your workload (balance between compaction frequency and context size)
+**5. Use Local Models for Zero API Costs**
+- **Best option:** Run vLLM locally to eliminate API costs entirely
+  ```bash
+  export HOSTED_VLLM_API_BASE=http://localhost:8000
+  export HOSTED_VLLM_API_KEY=token-abc123
+  patchpal --model hosted_vllm/openai/gpt-oss-20b
+  ```
+- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
+- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
+**6. Start Fresh When Appropriate**
+- Use `/clear` command to reset conversation history without restarting PatchPal
+- Exit and restart PatchPal between unrelated tasks to clear context completely
+- Each fresh start begins with minimal tokens (just the system prompt)
+- Better than carrying large conversation history across different tasks
+**7. Use Smaller Models for Simple Tasks**
+- Use less expensive models for routine tasks:
+  ```bash
+  patchpal --model anthropic/claude-3-7-sonnet-latest  # Cheaper than claude-sonnet-4-5
+  patchpal --model openai/gpt-4o-mini                  # Cheaper than gpt-4o
+  ```
+- Reserve premium models for complex reasoning tasks
+**Cost Monitoring Tips:**
+- Check `/status` before large operations to see current token usage
+- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
+- Most cloud providers offer usage dashboards showing cache hits and actual charges
+- Set up billing alerts with your provider to avoid surprises
+- Consider local models (vLLM recommended) for high-volume usage or zero API costs

{patchpal-0.3.1 → patchpal-0.4.1}/README.md RENAMED Viewed

@@ -868,6 +868,11 @@ You: /status
 # - Token usage breakdown
 # - Visual progress bar
 # - Auto-compaction status
+# - Session statistics:
+#   - Total LLM calls made
+#   - Cumulative input tokens (all requests combined)
+#   - Cumulative output tokens (all responses combined)
+#   - Total tokens (helps estimate API costs)
 # Manually trigger compaction
 You: /compact
@@ -879,6 +884,23 @@ You: /compact
 # Note: Requires at least 5 messages; most effective when context >50% full
 ```
+**Understanding Session Statistics:**
+The `/status` command shows cumulative token usage:
+- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
+  - Each LLM call resends the entire conversation history
+  - **Note on Anthropic models**: PatchPal uses prompt caching
+    - System prompt and last 2 messages are cached
+    - Cached tokens cost much less than regular input tokens
+    - The displayed token counts show raw totals, not cache-adjusted costs
+- **Cumulative output tokens**: Total tokens generated by the LLM
+  - Usually much smaller than input (just the generated responses)
+  - Typically costs more per token than input
+**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
 **Configuration:**
 See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -967,3 +989,72 @@ The system ensures you can work for extended periods without hitting context lim
 - Context is automatically managed at 75% capacity through pruning and compaction.
 - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
 - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
+**Reducing API Costs via Token Optimization**
+When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
+**1. Use Pruning to Manage Long Sessions**
+- **Automatic pruning** removes old tool outputs while preserving conversation context
+- Configure pruning thresholds to be more aggressive:
+  ```bash
+  export PATCHPAL_PRUNE_PROTECT=20000    # Reduce from 40k to 20k tokens
+  export PATCHPAL_PRUNE_MINIMUM=10000    # Reduce minimum saved from 20k to 10k
+  ```
+- Pruning happens transparently before compaction and is much faster (no LLM call needed)
+**2. Monitor Session Token Usage**
+- Use `/status` to see cumulative token usage in real-time
+- **Session Statistics** section shows:
+  - Total LLM calls made
+  - Cumulative input tokens (raw totals, before caching discounts)
+  - Cumulative output tokens
+  - Total tokens for the session
+- Check periodically during long sessions to monitor usage
+- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
+- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
+**3. Manual Compaction for Cost Control**
+- Use `/status` regularly to monitor context window usage
+- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
+- Manual compaction gives you control over when the summarization LLM call happens
+**4. Adjust Auto-Compaction Threshold**
+- Lower threshold = more frequent compaction = smaller context = lower per-request costs
+- Higher threshold = fewer compaction calls = larger context = higher per-request costs
+  ```bash
+  # More aggressive compaction (compact at 60% instead of 75%)
+  export PATCHPAL_COMPACT_THRESHOLD=0.60
+  ```
+- Find the sweet spot for your workload (balance between compaction frequency and context size)
+**5. Use Local Models for Zero API Costs**
+- **Best option:** Run vLLM locally to eliminate API costs entirely
+  ```bash
+  export HOSTED_VLLM_API_BASE=http://localhost:8000
+  export HOSTED_VLLM_API_KEY=token-abc123
+  patchpal --model hosted_vllm/openai/gpt-oss-20b
+  ```
+- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
+- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
+**6. Start Fresh When Appropriate**
+- Use `/clear` command to reset conversation history without restarting PatchPal
+- Exit and restart PatchPal between unrelated tasks to clear context completely
+- Each fresh start begins with minimal tokens (just the system prompt)
+- Better than carrying large conversation history across different tasks
+**7. Use Smaller Models for Simple Tasks**
+- Use less expensive models for routine tasks:
+  ```bash
+  patchpal --model anthropic/claude-3-7-sonnet-latest  # Cheaper than claude-sonnet-4-5
+  patchpal --model openai/gpt-4o-mini                  # Cheaper than gpt-4o
+  ```
+- Reserve premium models for complex reasoning tasks
+**Cost Monitoring Tips:**
+- Check `/status` before large operations to see current token usage
+- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
+- Most cloud providers offer usage dashboards showing cache hits and actual charges
+- Set up billing alerts with your provider to avoid surprises
+- Consider local models (vLLM recommended) for high-volume usage or zero API costs

{patchpal-0.3.1 → patchpal-0.4.1}/patchpal/__init__.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """PatchPal - An open-source Claude Code clone implemented purely in Python."""
-__version__ = "0.3.1"
+__version__ = "0.4.1"
 from patchpal.agent import create_agent
 from patchpal.tools import (

{patchpal-0.3.1 → patchpal-0.4.1}/patchpal/agent.py RENAMED Viewed

@@ -541,7 +541,7 @@ TOOLS = [
         "type": "function",
         "function": {
             "name": "run_shell",
-            "description": "Run a safe shell command in the repository. Privilege escalation (sudo, su) blocked by default unless PATCHPAL_ALLOW_SUDO=true.",
+            "description": "Run a safe shell command in the repository. Commands execute from repository root automatically (no need for 'cd'). Privilege escalation (sudo, su) blocked by default unless PATCHPAL_ALLOW_SUDO=true.",
             "parameters": {
                 "type": "object",
                 "properties": {
@@ -725,9 +725,7 @@ def _apply_prompt_caching(messages: List[Dict[str, Any]], model_id: str) -> List
     Caches:
     - System messages (first 1-2 messages with role="system")
-    - Last 2 conversation messages (recent context)
-    This provides 90% cost reduction on cached content after the first request.
+    - Last 2 non-system messages (recent context, any role except system)
     Args:
         messages: List of message dictionaries
@@ -744,8 +742,8 @@ def _apply_prompt_caching(messages: List[Dict[str, Any]], model_id: str) -> List
         # Bedrock uses cachePoint
         cache_marker = {"cachePoint": {"type": "ephemeral"}}
     else:
-        # Direct Anthropic API uses cacheControl
-        cache_marker = {"cacheControl": {"type": "ephemeral"}}
+        # Direct Anthropic API uses cache_control
+        cache_marker = {"cache_control": {"type": "ephemeral"}}
     # Find system messages (usually at the start)
     system_messages = [i for i, msg in enumerate(messages) if msg.get("role") == "system"]
@@ -818,6 +816,11 @@ class PatchPalAgent:
         # Track last compaction to prevent compaction loops
         self._last_compaction_message_count = 0
+        # Track cumulative token usage across all LLM calls
+        self.total_llm_calls = 0
+        self.cumulative_input_tokens = 0
+        self.cumulative_output_tokens = 0
         # LiteLLM settings for models that need parameter dropping
         self.litellm_kwargs = {}
         if self.model_id.startswith("bedrock/"):
@@ -896,12 +899,22 @@ class PatchPalAgent:
                 messages = [{"role": "system", "content": SYSTEM_PROMPT}] + msgs
                 # Apply prompt caching for supported models
                 messages = _apply_prompt_caching(messages, self.model_id)
-                return litellm.completion(
+                response = litellm.completion(
                     model=self.model_id,
                     messages=messages,
                     **self.litellm_kwargs,
                 )
+                # Track token usage from compaction call
+                self.total_llm_calls += 1
+                if hasattr(response, "usage") and response.usage:
+                    if hasattr(response.usage, "prompt_tokens"):
+                        self.cumulative_input_tokens += response.usage.prompt_tokens
+                    if hasattr(response.usage, "completion_tokens"):
+                        self.cumulative_output_tokens += response.usage.completion_tokens
+                return response
             summary_msg, summary_text = self.context_manager.create_compaction(
                 self.messages,
                 compaction_completion,
@@ -995,6 +1008,15 @@ class PatchPalAgent:
                     tool_choice="auto",
                     **self.litellm_kwargs,
                 )
+                # Track token usage from this LLM call
+                self.total_llm_calls += 1
+                if hasattr(response, "usage") and response.usage:
+                    if hasattr(response.usage, "prompt_tokens"):
+                        self.cumulative_input_tokens += response.usage.prompt_tokens
+                    if hasattr(response.usage, "completion_tokens"):
+                        self.cumulative_output_tokens += response.usage.completion_tokens
             except Exception as e:
                 return f"Error calling model: {e}"

{patchpal-0.3.1 → patchpal-0.4.1}/patchpal/cli.py RENAMED Viewed

@@ -248,7 +248,9 @@ Supported models: Any LiteLLM-supported model
         print(f"\033[1;36m🔧 Using custom system prompt: {custom_prompt_path}\033[0m")
     print("\nType 'exit' to quit.")
-    print("Use '/status' to check context window usage, '/compact' to manually compact.")
+    print(
+        "Use '/status' to check context window usage, '/compact' to manually compact, '/clear' to start fresh."
+    )
     print("Use 'list skills' to see available skills or /skillname to invoke skills.")
     print("Press Ctrl-C during agent execution to interrupt the agent.\n")
@@ -360,6 +362,70 @@ Supported models: Any LiteLLM-supported model
                         "\n  Auto-compaction: \033[33mDisabled\033[0m (set PATCHPAL_DISABLE_AUTOCOMPACT=false to enable)"
                     )
+                # Show cumulative token usage
+                print("\n\033[1;36mSession Statistics\033[0m")
+                print(f"  LLM calls: {agent.total_llm_calls}")
+                # Check if usage info is available (if we have LLM calls but no token counts)
+                has_usage_info = (
+                    agent.cumulative_input_tokens > 0 or agent.cumulative_output_tokens > 0
+                )
+                if agent.total_llm_calls > 0 and not has_usage_info:
+                    print(
+                        "  \033[2mToken usage unavailable (model doesn't report usage info)\033[0m"
+                    )
+                else:
+                    print(f"  Cumulative input tokens: {agent.cumulative_input_tokens:,}")
+                    print(f"  Cumulative output tokens: {agent.cumulative_output_tokens:,}")
+                    total_tokens = agent.cumulative_input_tokens + agent.cumulative_output_tokens
+                    print(f"  Total tokens: {total_tokens:,}")
+                print("=" * 70 + "\n")
+                continue
+            # Handle /clear command - clear conversation history
+            if user_input.lower() in ["clear", "/clear"]:
+                print("\n" + "=" * 70)
+                print("\033[1;36mClear Context\033[0m")
+                print("=" * 70)
+                if not agent.messages:
+                    print("\033[1;33m  Context is already empty.\033[0m")
+                    print("=" * 70 + "\n")
+                    continue
+                # Show current status
+                stats = agent.context_manager.get_usage_stats(agent.messages)
+                print(
+                    f"  Current: {len(agent.messages)} messages, {stats['total_tokens']:,} tokens"
+                )
+                # Confirm before clearing
+                try:
+                    confirm = pt_prompt(
+                        FormattedText(
+                            [
+                                ("ansiyellow", "  Clear all context and start fresh? (y/n): "),
+                                ("", ""),
+                            ]
+                        )
+                    ).strip()
+                    if confirm.lower() not in ["y", "yes"]:
+                        print("  Cancelled.")
+                        print("=" * 70 + "\n")
+                        continue
+                except KeyboardInterrupt:
+                    print("\n  Cancelled.")
+                    print("=" * 70 + "\n")
+                    continue
+                # Clear conversation history
+                agent.messages = []
+                agent._last_compaction_message_count = 0
+                print("\n\033[1;32m✓ Context cleared successfully!\033[0m")
+                print("  Starting fresh with empty conversation history.")
+                print("  All previous context has been removed - ready for a new task.")
                 print("=" * 70 + "\n")
                 continue

{patchpal-0.3.1 → patchpal-0.4.1}/patchpal/tools.py RENAMED Viewed

@@ -100,6 +100,10 @@ WEB_USER_AGENT = f"PatchPal/{__version__} (AI Code Assistant)"
 # Shell command configuration
 SHELL_TIMEOUT = int(os.getenv("PATCHPAL_SHELL_TIMEOUT", 30))  # 30 seconds default
+# Output filtering configuration - reduce token usage from verbose commands
+ENABLE_OUTPUT_FILTERING = os.getenv("PATCHPAL_FILTER_OUTPUTS", "true").lower() == "true"
+MAX_OUTPUT_LINES = int(os.getenv("PATCHPAL_MAX_OUTPUT_LINES", 500))  # Max lines of output
 # Global flag for requiring permission on ALL operations (including reads)
 # Set via CLI flag --require-permission-for-all
 _REQUIRE_PERMISSION_FOR_ALL = False
@@ -195,10 +199,194 @@ class OperationLimiter:
         audit_logger.info(f"Operation {self.operations}/{self.max_operations}: {operation}")
     def reset(self):
-        """Reset operation counter."""
+        """Reset the operation counter (used in tests)."""
         self.operations = 0
+class OutputFilter:
+    """Filter verbose command outputs to reduce token usage.
+    This class implements Claude Code's strategy of filtering verbose outputs
+    to show only relevant information (e.g., test failures, error messages).
+    Can save 75% or more on output tokens for verbose commands.
+    """
+    @staticmethod
+    def should_filter(cmd: str) -> bool:
+        """Check if a command should have its output filtered.
+        Args:
+            cmd: The shell command
+        Returns:
+            True if filtering should be applied
+        """
+        if not ENABLE_OUTPUT_FILTERING:
+            return False
+        # Test runners - show only failures
+        test_patterns = [
+            "pytest",
+            "npm test",
+            "npm run test",
+            "yarn test",
+            "go test",
+            "cargo test",
+            "mvn test",
+            "gradle test",
+            "ruby -I test",
+            "rspec",
+        ]
+        # Version control - limit log output
+        vcs_patterns = [
+            "git log",
+            "git reflog",
+        ]
+        # Package managers - show only important info
+        pkg_patterns = [
+            "npm install",
+            "pip install",
+            "cargo build",
+            "go build",
+        ]
+        all_patterns = test_patterns + vcs_patterns + pkg_patterns
+        return any(pattern in cmd for pattern in all_patterns)
+    @staticmethod
+    def filter_output(cmd: str, output: str) -> str:
+        """Filter command output to reduce token usage.
+        Args:
+            cmd: The shell command
+            output: The raw command output
+        Returns:
+            Filtered output with only relevant information
+        """
+        if not output or not ENABLE_OUTPUT_FILTERING:
+            return output
+        lines = output.split("\n")
+        original_lines = len(lines)
+        # Test output - show only failures and summary
+        if any(
+            pattern in cmd
+            for pattern in ["pytest", "npm test", "yarn test", "go test", "cargo test", "rspec"]
+        ):
+            filtered_lines = []
+            in_failure = False
+            failure_context = []
+            for line in lines:
+                # Capture failure indicators
+                if any(
+                    keyword in line.upper()
+                    for keyword in ["FAIL", "ERROR", "FAILED", "✗", "✖", "FAILURE"]
+                ):
+                    in_failure = True
+                    failure_context = [line]
+                elif in_failure:
+                    # Capture context after failure (up to 10 lines or until next test/blank line)
+                    failure_context.append(line)
+                    # End failure context on: blank line, next test case, or 10 lines
+                    if (
+                        not line.strip()
+                        or "::" in line
+                        or line.startswith("=")
+                        or len(failure_context) >= 10
+                    ):
+                        filtered_lines.extend(failure_context)
+                        in_failure = False
+                        failure_context = []
+                # Always capture summary lines
+                elif any(
+                    keyword in line.lower()
+                    for keyword in ["passed", "failed", "error", "summary", "total"]
+                ):
+                    filtered_lines.append(line)
+            # Add remaining failure context
+            if failure_context:
+                filtered_lines.extend(failure_context)
+            # If we filtered significantly, add header
+            if filtered_lines and len(filtered_lines) < original_lines * 0.5:
+                header = f"[Filtered test output - showing failures only ({len(filtered_lines)}/{original_lines} lines)]"
+                return header + "\n" + "\n".join(filtered_lines)
+            else:
+                # Not much to filter, return original but truncated if too long
+                return OutputFilter._truncate_output(output, lines, original_lines)
+        # Git log - limit to reasonable number of commits
+        elif "git log" in cmd or "git reflog" in cmd:
+            # Take first 50 lines (typically ~5-10 commits with details)
+            if len(lines) > 50:
+                truncated = "\n".join(lines[:50])
+                footer = f"\n[Output truncated: showing first 50/{original_lines} lines. Use --max-count to limit commits]"
+                return truncated + footer
+            return output
+        # Build/install output - show only errors and final status
+        elif any(
+            pattern in cmd for pattern in ["npm install", "pip install", "cargo build", "go build"]
+        ):
+            filtered_lines = []
+            for line in lines:
+                # Keep error/warning lines
+                if any(
+                    keyword in line.upper()
+                    for keyword in ["ERROR", "WARN", "FAIL", "SUCCESSFULLY", "COMPLETE"]
+                ):
+                    filtered_lines.append(line)
+                # Keep final summary lines
+                elif any(
+                    keyword in line.lower()
+                    for keyword in ["installed", "built", "compiled", "finished"]
+                ):
+                    filtered_lines.append(line)
+            if filtered_lines and len(filtered_lines) < original_lines * 0.3:
+                header = f"[Filtered build output - showing errors and summary only ({len(filtered_lines)}/{original_lines} lines)]"
+                return header + "\n" + "\n".join(filtered_lines)
+            else:
+                return OutputFilter._truncate_output(output, lines, original_lines)
+        # Default: truncate if too long
+        return OutputFilter._truncate_output(output, lines, original_lines)
+    @staticmethod
+    def _truncate_output(output: str, lines: list, original_lines: int) -> str:
+        """Truncate output if it exceeds maximum lines.
+        Args:
+            output: Original output string
+            lines: Split lines
+            original_lines: Count of original lines
+        Returns:
+            Truncated output if necessary
+        """
+        if original_lines > MAX_OUTPUT_LINES:
+            # Show first and last portions
+            keep_start = MAX_OUTPUT_LINES // 2
+            keep_end = MAX_OUTPUT_LINES // 2
+            truncated_lines = (
+                lines[:keep_start]
+                + ["", f"... [truncated {original_lines - MAX_OUTPUT_LINES} lines] ...", ""]
+                + lines[-keep_end:]
+            )
+            return "\n".join(truncated_lines)
+        return output
 # Global operation limiter
 _operation_limiter = OperationLimiter()
@@ -1738,26 +1926,7 @@ def edit_file(path: str, old_string: str, new_string: str) -> str:
             f"💡 Tip: Use read_lines() to see the exact context, or use apply_patch() for multiple changes."
         )
-    # Check permission before proceeding
-    permission_manager = _get_permission_manager()
-    # Format colored diff for permission prompt (use the matched string for accurate diff)
-    diff_display = _format_colored_diff(matched_string, new_string, file_path=path)
-    # Add warning if writing outside repository
-    outside_repo_warning = ""
-    if not _is_inside_repo(p):
-        outside_repo_warning = "\n   ⚠️  WARNING: Writing file outside repository\n"
-    description = f"   ● Update({path}){outside_repo_warning}\n{diff_display}"
-    if not permission_manager.request_permission("edit_file", description, pattern=path):
-        return "Operation cancelled by user."
-    # Backup if enabled
-    backup_path = _backup_file(p)
-    # Perform replacement using the matched string
+    # Perform indentation adjustment and trailing newline preservation BEFORE showing diff
     # Important: Adjust indentation and preserve trailing newlines to maintain file structure
     adjusted_new_string = new_string
@@ -1803,6 +1972,25 @@ def edit_file(path: str, old_string: str, new_string: str) -> str:
         trailing_newlines = len(matched_string) - len(matched_string.rstrip("\n"))
         adjusted_new_string = adjusted_new_string + ("\n" * trailing_newlines)
+    # Check permission before proceeding (use adjusted_new_string for accurate diff display)
+    permission_manager = _get_permission_manager()
+    # Format colored diff for permission prompt (use adjusted_new_string so user sees what will actually be written)
+    diff_display = _format_colored_diff(matched_string, adjusted_new_string, file_path=path)
+    # Add warning if writing outside repository
+    outside_repo_warning = ""
+    if not _is_inside_repo(p):
+        outside_repo_warning = "\n   ⚠️  WARNING: Writing file outside repository\n"
+    description = f"   ● Update({path}){outside_repo_warning}\n{diff_display}"
+    if not permission_manager.request_permission("edit_file", description, pattern=path):
+        return "Operation cancelled by user."
+    # Backup if enabled
+    backup_path = _backup_file(p)
     new_content = content.replace(matched_string, adjusted_new_string)
     # Write the new content
@@ -2359,4 +2547,19 @@ def run_shell(cmd: str) -> str:
     stdout = result.stdout.decode("utf-8", errors="replace") if result.stdout else ""
     stderr = result.stderr.decode("utf-8", errors="replace") if result.stderr else ""
-    return stdout + stderr
+    output = stdout + stderr
+    # Apply output filtering to reduce token usage
+    if OutputFilter.should_filter(cmd):
+        filtered_output = OutputFilter.filter_output(cmd, output)
+        # Log if we filtered significantly
+        original_lines = len(output.split("\n"))
+        filtered_lines = len(filtered_output.split("\n"))
+        if filtered_lines < original_lines * 0.5:
+            audit_logger.info(
+                f"SHELL_FILTER: Reduced output from {original_lines} to {filtered_lines} lines "
+                f"(~{int((1 - filtered_lines / original_lines) * 100)}% reduction)"
+            )
+        return filtered_output
+    return output

{patchpal-0.3.1 → patchpal-0.4.1/patchpal.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: patchpal
-Version: 0.3.1
+Version: 0.4.1
 Summary: A lean Claude Code clone in pure  Python
 Author: PatchPal Contributors
 License-Expression: Apache-2.0
@@ -905,6 +905,11 @@ You: /status
 # - Token usage breakdown
 # - Visual progress bar
 # - Auto-compaction status
+# - Session statistics:
+#   - Total LLM calls made
+#   - Cumulative input tokens (all requests combined)
+#   - Cumulative output tokens (all responses combined)
+#   - Total tokens (helps estimate API costs)
 # Manually trigger compaction
 You: /compact
@@ -916,6 +921,23 @@ You: /compact
 # Note: Requires at least 5 messages; most effective when context >50% full
 ```
+**Understanding Session Statistics:**
+The `/status` command shows cumulative token usage:
+- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
+  - Each LLM call resends the entire conversation history
+  - **Note on Anthropic models**: PatchPal uses prompt caching
+    - System prompt and last 2 messages are cached
+    - Cached tokens cost much less than regular input tokens
+    - The displayed token counts show raw totals, not cache-adjusted costs
+- **Cumulative output tokens**: Total tokens generated by the LLM
+  - Usually much smaller than input (just the generated responses)
+  - Typically costs more per token than input
+**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
 **Configuration:**
 See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
 - Context is automatically managed at 75% capacity through pruning and compaction.
 - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
 - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
+**Reducing API Costs via Token Optimization**
+When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
+**1. Use Pruning to Manage Long Sessions**
+- **Automatic pruning** removes old tool outputs while preserving conversation context
+- Configure pruning thresholds to be more aggressive:
+  ```bash
+  export PATCHPAL_PRUNE_PROTECT=20000    # Reduce from 40k to 20k tokens
+  export PATCHPAL_PRUNE_MINIMUM=10000    # Reduce minimum saved from 20k to 10k
+  ```
+- Pruning happens transparently before compaction and is much faster (no LLM call needed)
+**2. Monitor Session Token Usage**
+- Use `/status` to see cumulative token usage in real-time
+- **Session Statistics** section shows:
+  - Total LLM calls made
+  - Cumulative input tokens (raw totals, before caching discounts)
+  - Cumulative output tokens
+  - Total tokens for the session
+- Check periodically during long sessions to monitor usage
+- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
+- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
+**3. Manual Compaction for Cost Control**
+- Use `/status` regularly to monitor context window usage
+- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
+- Manual compaction gives you control over when the summarization LLM call happens
+**4. Adjust Auto-Compaction Threshold**
+- Lower threshold = more frequent compaction = smaller context = lower per-request costs
+- Higher threshold = fewer compaction calls = larger context = higher per-request costs
+  ```bash
+  # More aggressive compaction (compact at 60% instead of 75%)
+  export PATCHPAL_COMPACT_THRESHOLD=0.60
+  ```
+- Find the sweet spot for your workload (balance between compaction frequency and context size)
+**5. Use Local Models for Zero API Costs**
+- **Best option:** Run vLLM locally to eliminate API costs entirely
+  ```bash
+  export HOSTED_VLLM_API_BASE=http://localhost:8000
+  export HOSTED_VLLM_API_KEY=token-abc123
+  patchpal --model hosted_vllm/openai/gpt-oss-20b
+  ```
+- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
+- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
+**6. Start Fresh When Appropriate**
+- Use `/clear` command to reset conversation history without restarting PatchPal
+- Exit and restart PatchPal between unrelated tasks to clear context completely
+- Each fresh start begins with minimal tokens (just the system prompt)
+- Better than carrying large conversation history across different tasks
+**7. Use Smaller Models for Simple Tasks**
+- Use less expensive models for routine tasks:
+  ```bash
+  patchpal --model anthropic/claude-3-7-sonnet-latest  # Cheaper than claude-sonnet-4-5
+  patchpal --model openai/gpt-4o-mini                  # Cheaper than gpt-4o
+  ```
+- Reserve premium models for complex reasoning tasks
+**Cost Monitoring Tips:**
+- Check `/status` before large operations to see current token usage
+- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
+- Most cloud providers offer usage dashboards showing cache hits and actual charges
+- Set up billing alerts with your provider to avoid surprises
+- Consider local models (vLLM recommended) for high-volume usage or zero API costs

{patchpal-0.3.1 → patchpal-0.4.1}/tests/test_agent.py RENAMED Viewed

@@ -441,13 +441,13 @@ def test_prompt_caching_application_anthropic():
     # Test with direct Anthropic API
     cached_messages = _apply_prompt_caching(messages.copy(), "anthropic/claude-sonnet-4-5")
-    # System message should have cacheControl
-    assert "cacheControl" in cached_messages[0]
-    assert cached_messages[0]["cacheControl"] == {"type": "ephemeral"}
+    # System message should have cache_control
+    assert "cache_control" in cached_messages[0]
+    assert cached_messages[0]["cache_control"] == {"type": "ephemeral"}
-    # Last 2 messages should have cacheControl
-    assert "cacheControl" in cached_messages[-1]  # Last user message
-    assert "cacheControl" in cached_messages[-2]  # Last assistant message
+    # Last 2 messages should have cache_control
+    assert "cache_control" in cached_messages[-1]  # Last user message
+    assert "cache_control" in cached_messages[-2]  # Last assistant message
 def test_prompt_caching_application_bedrock():
@@ -488,7 +488,7 @@ def test_prompt_caching_no_modification_for_unsupported():
     cached_messages = _apply_prompt_caching(messages.copy(), "openai/gpt-4o")
     # Messages should be unchanged
-    assert "cacheControl" not in cached_messages[0]
+    assert "cache_control" not in cached_messages[0]
     assert "cachePoint" not in cached_messages[0]
     assert cached_messages == messages