patchpal 0.3.2__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {patchpal-0.3.2/patchpal.egg-info → patchpal-0.4.0}/PKG-INFO +92 -1
  2. {patchpal-0.3.2 → patchpal-0.4.0}/README.md +91 -0
  3. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/__init__.py +1 -1
  4. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/agent.py +26 -4
  5. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/cli.py +18 -0
  6. {patchpal-0.3.2 → patchpal-0.4.0/patchpal.egg-info}/PKG-INFO +92 -1
  7. {patchpal-0.3.2 → patchpal-0.4.0}/LICENSE +0 -0
  8. {patchpal-0.3.2 → patchpal-0.4.0}/MANIFEST.in +0 -0
  9. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/context.py +0 -0
  10. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/permissions.py +0 -0
  11. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/skills.py +0 -0
  12. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/system_prompt.md +0 -0
  13. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/tools.py +0 -0
  14. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/SOURCES.txt +0 -0
  15. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/dependency_links.txt +0 -0
  16. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/entry_points.txt +0 -0
  17. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/requires.txt +0 -0
  18. {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/top_level.txt +0 -0
  19. {patchpal-0.3.2 → patchpal-0.4.0}/pyproject.toml +0 -0
  20. {patchpal-0.3.2 → patchpal-0.4.0}/setup.cfg +0 -0
  21. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_agent.py +0 -0
  22. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_cli.py +0 -0
  23. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_context.py +0 -0
  24. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_guardrails.py +0 -0
  25. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_operational_safety.py +0 -0
  26. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_skills.py +0 -0
  27. {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_tools.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: patchpal
3
- Version: 0.3.2
3
+ Version: 0.4.0
4
4
  Summary: A lean Claude Code clone in pure Python
5
5
  Author: PatchPal Contributors
6
6
  License-Expression: Apache-2.0
@@ -905,6 +905,11 @@ You: /status
905
905
  # - Token usage breakdown
906
906
  # - Visual progress bar
907
907
  # - Auto-compaction status
908
+ # - Session statistics:
909
+ # - Total LLM calls made
910
+ # - Cumulative input tokens (all requests combined)
911
+ # - Cumulative output tokens (all responses combined)
912
+ # - Total tokens (helps estimate API costs)
908
913
 
909
914
  # Manually trigger compaction
910
915
  You: /compact
@@ -916,6 +921,23 @@ You: /compact
916
921
  # Note: Requires at least 5 messages; most effective when context >50% full
917
922
  ```
918
923
 
924
+ **Understanding Session Statistics:**
925
+
926
+ The `/status` command shows cumulative token usage:
927
+
928
+ - **Cumulative input tokens**: Total tokens sent to the LLM across all calls
929
+ - Each LLM call resends the entire conversation history
930
+ - **Note on Anthropic models**: PatchPal uses prompt caching
931
+ - System prompt and last 2 messages are cached
932
+ - Cached tokens cost much less than regular input tokens
933
+ - The displayed token counts show raw totals, not cache-adjusted costs
934
+
935
+ - **Cumulative output tokens**: Total tokens generated by the LLM
936
+ - Usually much smaller than input (just the generated responses)
937
+ - Typically costs more per token than input
938
+
939
+ **Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
940
+
919
941
  **Configuration:**
920
942
 
921
943
  See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
1004
1026
  - Context is automatically managed at 75% capacity through pruning and compaction.
1005
1027
  - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
1006
1028
  - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
1029
+
1030
+ **Reducing API Costs via Token Optimization**
1031
+
1032
+ When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
1033
+
1034
+ **1. Use Pruning to Manage Long Sessions**
1035
+ - **Automatic pruning** removes old tool outputs while preserving conversation context
1036
+ - Configure pruning thresholds to be more aggressive:
1037
+ ```bash
1038
+ export PATCHPAL_PRUNE_PROTECT=20000 # Reduce from 40k to 20k tokens
1039
+ export PATCHPAL_PRUNE_MINIMUM=10000 # Reduce minimum saved from 20k to 10k
1040
+ ```
1041
+ - Pruning happens transparently before compaction and is much faster (no LLM call needed)
1042
+
1043
+ **2. Monitor Session Token Usage**
1044
+ - Use `/status` to see cumulative token usage in real-time
1045
+ - **Session Statistics** section shows:
1046
+ - Total LLM calls made
1047
+ - Cumulative input tokens (raw totals, before caching discounts)
1048
+ - Cumulative output tokens
1049
+ - Total tokens for the session
1050
+ - Check periodically during long sessions to monitor usage
1051
+ - **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
1052
+ - For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
1053
+
1054
+ **3. Manual Compaction for Cost Control**
1055
+ - Use `/status` regularly to monitor context window usage
1056
+ - Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
1057
+ - Manual compaction gives you control over when the summarization LLM call happens
1058
+
1059
+ **4. Adjust Auto-Compaction Threshold**
1060
+ - Lower threshold = more frequent compaction = smaller context = lower per-request costs
1061
+ - Higher threshold = fewer compaction calls = larger context = higher per-request costs
1062
+ ```bash
1063
+ # More aggressive compaction (compact at 60% instead of 75%)
1064
+ export PATCHPAL_COMPACT_THRESHOLD=0.60
1065
+ ```
1066
+ - Find the sweet spot for your workload (balance between compaction frequency and context size)
1067
+
1068
+ **5. Use Local Models for Zero API Costs**
1069
+ - **Best option:** Run vLLM locally to eliminate API costs entirely
1070
+ ```bash
1071
+ export HOSTED_VLLM_API_BASE=http://localhost:8000
1072
+ export HOSTED_VLLM_API_KEY=token-abc123
1073
+ patchpal --model hosted_vllm/openai/gpt-oss-20b
1074
+ ```
1075
+ - **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
1076
+ - See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
1077
+
1078
+ **6. Start Fresh When Appropriate**
1079
+ - Use `/clear` command to reset conversation history without restarting PatchPal
1080
+ - Exit and restart PatchPal between unrelated tasks to clear context completely
1081
+ - Each fresh start begins with minimal tokens (just the system prompt)
1082
+ - Better than carrying large conversation history across different tasks
1083
+
1084
+ **7. Use Smaller Models for Simple Tasks**
1085
+ - Use less expensive models for routine tasks:
1086
+ ```bash
1087
+ patchpal --model anthropic/claude-3-7-sonnet-latest # Cheaper than claude-sonnet-4-5
1088
+ patchpal --model openai/gpt-4o-mini # Cheaper than gpt-4o
1089
+ ```
1090
+ - Reserve premium models for complex reasoning tasks
1091
+
1092
+ **Cost Monitoring Tips:**
1093
+ - Check `/status` before large operations to see current token usage
1094
+ - **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
1095
+ - Most cloud providers offer usage dashboards showing cache hits and actual charges
1096
+ - Set up billing alerts with your provider to avoid surprises
1097
+ - Consider local models (vLLM recommended) for high-volume usage or zero API costs
@@ -868,6 +868,11 @@ You: /status
868
868
  # - Token usage breakdown
869
869
  # - Visual progress bar
870
870
  # - Auto-compaction status
871
+ # - Session statistics:
872
+ # - Total LLM calls made
873
+ # - Cumulative input tokens (all requests combined)
874
+ # - Cumulative output tokens (all responses combined)
875
+ # - Total tokens (helps estimate API costs)
871
876
 
872
877
  # Manually trigger compaction
873
878
  You: /compact
@@ -879,6 +884,23 @@ You: /compact
879
884
  # Note: Requires at least 5 messages; most effective when context >50% full
880
885
  ```
881
886
 
887
+ **Understanding Session Statistics:**
888
+
889
+ The `/status` command shows cumulative token usage:
890
+
891
+ - **Cumulative input tokens**: Total tokens sent to the LLM across all calls
892
+ - Each LLM call resends the entire conversation history
893
+ - **Note on Anthropic models**: PatchPal uses prompt caching
894
+ - System prompt and last 2 messages are cached
895
+ - Cached tokens cost much less than regular input tokens
896
+ - The displayed token counts show raw totals, not cache-adjusted costs
897
+
898
+ - **Cumulative output tokens**: Total tokens generated by the LLM
899
+ - Usually much smaller than input (just the generated responses)
900
+ - Typically costs more per token than input
901
+
902
+ **Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
903
+
882
904
  **Configuration:**
883
905
 
884
906
  See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -967,3 +989,72 @@ The system ensures you can work for extended periods without hitting context lim
967
989
  - Context is automatically managed at 75% capacity through pruning and compaction.
968
990
  - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
969
991
  - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
992
+
993
+ **Reducing API Costs via Token Optimization**
994
+
995
+ When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
996
+
997
+ **1. Use Pruning to Manage Long Sessions**
998
+ - **Automatic pruning** removes old tool outputs while preserving conversation context
999
+ - Configure pruning thresholds to be more aggressive:
1000
+ ```bash
1001
+ export PATCHPAL_PRUNE_PROTECT=20000 # Reduce from 40k to 20k tokens
1002
+ export PATCHPAL_PRUNE_MINIMUM=10000 # Reduce minimum saved from 20k to 10k
1003
+ ```
1004
+ - Pruning happens transparently before compaction and is much faster (no LLM call needed)
1005
+
1006
+ **2. Monitor Session Token Usage**
1007
+ - Use `/status` to see cumulative token usage in real-time
1008
+ - **Session Statistics** section shows:
1009
+ - Total LLM calls made
1010
+ - Cumulative input tokens (raw totals, before caching discounts)
1011
+ - Cumulative output tokens
1012
+ - Total tokens for the session
1013
+ - Check periodically during long sessions to monitor usage
1014
+ - **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
1015
+ - For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
1016
+
1017
+ **3. Manual Compaction for Cost Control**
1018
+ - Use `/status` regularly to monitor context window usage
1019
+ - Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
1020
+ - Manual compaction gives you control over when the summarization LLM call happens
1021
+
1022
+ **4. Adjust Auto-Compaction Threshold**
1023
+ - Lower threshold = more frequent compaction = smaller context = lower per-request costs
1024
+ - Higher threshold = fewer compaction calls = larger context = higher per-request costs
1025
+ ```bash
1026
+ # More aggressive compaction (compact at 60% instead of 75%)
1027
+ export PATCHPAL_COMPACT_THRESHOLD=0.60
1028
+ ```
1029
+ - Find the sweet spot for your workload (balance between compaction frequency and context size)
1030
+
1031
+ **5. Use Local Models for Zero API Costs**
1032
+ - **Best option:** Run vLLM locally to eliminate API costs entirely
1033
+ ```bash
1034
+ export HOSTED_VLLM_API_BASE=http://localhost:8000
1035
+ export HOSTED_VLLM_API_KEY=token-abc123
1036
+ patchpal --model hosted_vllm/openai/gpt-oss-20b
1037
+ ```
1038
+ - **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
1039
+ - See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
1040
+
1041
+ **6. Start Fresh When Appropriate**
1042
+ - Use `/clear` command to reset conversation history without restarting PatchPal
1043
+ - Exit and restart PatchPal between unrelated tasks to clear context completely
1044
+ - Each fresh start begins with minimal tokens (just the system prompt)
1045
+ - Better than carrying large conversation history across different tasks
1046
+
1047
+ **7. Use Smaller Models for Simple Tasks**
1048
+ - Use less expensive models for routine tasks:
1049
+ ```bash
1050
+ patchpal --model anthropic/claude-3-7-sonnet-latest # Cheaper than claude-sonnet-4-5
1051
+ patchpal --model openai/gpt-4o-mini # Cheaper than gpt-4o
1052
+ ```
1053
+ - Reserve premium models for complex reasoning tasks
1054
+
1055
+ **Cost Monitoring Tips:**
1056
+ - Check `/status` before large operations to see current token usage
1057
+ - **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
1058
+ - Most cloud providers offer usage dashboards showing cache hits and actual charges
1059
+ - Set up billing alerts with your provider to avoid surprises
1060
+ - Consider local models (vLLM recommended) for high-volume usage or zero API costs
@@ -1,6 +1,6 @@
1
1
  """PatchPal - An open-source Claude Code clone implemented purely in Python."""
2
2
 
3
- __version__ = "0.3.2"
3
+ __version__ = "0.4.0"
4
4
 
5
5
  from patchpal.agent import create_agent
6
6
  from patchpal.tools import (
@@ -725,9 +725,7 @@ def _apply_prompt_caching(messages: List[Dict[str, Any]], model_id: str) -> List
725
725
 
726
726
  Caches:
727
727
  - System messages (first 1-2 messages with role="system")
728
- - Last 2 conversation messages (recent context)
729
-
730
- This provides 90% cost reduction on cached content after the first request.
728
+ - Last 2 non-system messages (recent context, any role except system)
731
729
 
732
730
  Args:
733
731
  messages: List of message dictionaries
@@ -818,6 +816,11 @@ class PatchPalAgent:
818
816
  # Track last compaction to prevent compaction loops
819
817
  self._last_compaction_message_count = 0
820
818
 
819
+ # Track cumulative token usage across all LLM calls
820
+ self.total_llm_calls = 0
821
+ self.cumulative_input_tokens = 0
822
+ self.cumulative_output_tokens = 0
823
+
821
824
  # LiteLLM settings for models that need parameter dropping
822
825
  self.litellm_kwargs = {}
823
826
  if self.model_id.startswith("bedrock/"):
@@ -896,12 +899,22 @@ class PatchPalAgent:
896
899
  messages = [{"role": "system", "content": SYSTEM_PROMPT}] + msgs
897
900
  # Apply prompt caching for supported models
898
901
  messages = _apply_prompt_caching(messages, self.model_id)
899
- return litellm.completion(
902
+ response = litellm.completion(
900
903
  model=self.model_id,
901
904
  messages=messages,
902
905
  **self.litellm_kwargs,
903
906
  )
904
907
 
908
+ # Track token usage from compaction call
909
+ self.total_llm_calls += 1
910
+ if hasattr(response, "usage") and response.usage:
911
+ if hasattr(response.usage, "prompt_tokens"):
912
+ self.cumulative_input_tokens += response.usage.prompt_tokens
913
+ if hasattr(response.usage, "completion_tokens"):
914
+ self.cumulative_output_tokens += response.usage.completion_tokens
915
+
916
+ return response
917
+
905
918
  summary_msg, summary_text = self.context_manager.create_compaction(
906
919
  self.messages,
907
920
  compaction_completion,
@@ -995,6 +1008,15 @@ class PatchPalAgent:
995
1008
  tool_choice="auto",
996
1009
  **self.litellm_kwargs,
997
1010
  )
1011
+
1012
+ # Track token usage from this LLM call
1013
+ self.total_llm_calls += 1
1014
+ if hasattr(response, "usage") and response.usage:
1015
+ if hasattr(response.usage, "prompt_tokens"):
1016
+ self.cumulative_input_tokens += response.usage.prompt_tokens
1017
+ if hasattr(response.usage, "completion_tokens"):
1018
+ self.cumulative_output_tokens += response.usage.completion_tokens
1019
+
998
1020
  except Exception as e:
999
1021
  return f"Error calling model: {e}"
1000
1022
 
@@ -362,6 +362,24 @@ Supported models: Any LiteLLM-supported model
362
362
  "\n Auto-compaction: \033[33mDisabled\033[0m (set PATCHPAL_DISABLE_AUTOCOMPACT=false to enable)"
363
363
  )
364
364
 
365
+ # Show cumulative token usage
366
+ print("\n\033[1;36mSession Statistics\033[0m")
367
+ print(f" LLM calls: {agent.total_llm_calls}")
368
+
369
+ # Check if usage info is available (if we have LLM calls but no token counts)
370
+ has_usage_info = (
371
+ agent.cumulative_input_tokens > 0 or agent.cumulative_output_tokens > 0
372
+ )
373
+ if agent.total_llm_calls > 0 and not has_usage_info:
374
+ print(
375
+ " \033[2mToken usage unavailable (model doesn't report usage info)\033[0m"
376
+ )
377
+ else:
378
+ print(f" Cumulative input tokens: {agent.cumulative_input_tokens:,}")
379
+ print(f" Cumulative output tokens: {agent.cumulative_output_tokens:,}")
380
+ total_tokens = agent.cumulative_input_tokens + agent.cumulative_output_tokens
381
+ print(f" Total tokens: {total_tokens:,}")
382
+
365
383
  print("=" * 70 + "\n")
366
384
  continue
367
385
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: patchpal
3
- Version: 0.3.2
3
+ Version: 0.4.0
4
4
  Summary: A lean Claude Code clone in pure Python
5
5
  Author: PatchPal Contributors
6
6
  License-Expression: Apache-2.0
@@ -905,6 +905,11 @@ You: /status
905
905
  # - Token usage breakdown
906
906
  # - Visual progress bar
907
907
  # - Auto-compaction status
908
+ # - Session statistics:
909
+ # - Total LLM calls made
910
+ # - Cumulative input tokens (all requests combined)
911
+ # - Cumulative output tokens (all responses combined)
912
+ # - Total tokens (helps estimate API costs)
908
913
 
909
914
  # Manually trigger compaction
910
915
  You: /compact
@@ -916,6 +921,23 @@ You: /compact
916
921
  # Note: Requires at least 5 messages; most effective when context >50% full
917
922
  ```
918
923
 
924
+ **Understanding Session Statistics:**
925
+
926
+ The `/status` command shows cumulative token usage:
927
+
928
+ - **Cumulative input tokens**: Total tokens sent to the LLM across all calls
929
+ - Each LLM call resends the entire conversation history
930
+ - **Note on Anthropic models**: PatchPal uses prompt caching
931
+ - System prompt and last 2 messages are cached
932
+ - Cached tokens cost much less than regular input tokens
933
+ - The displayed token counts show raw totals, not cache-adjusted costs
934
+
935
+ - **Cumulative output tokens**: Total tokens generated by the LLM
936
+ - Usually much smaller than input (just the generated responses)
937
+ - Typically costs more per token than input
938
+
939
+ **Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
940
+
919
941
  **Configuration:**
920
942
 
921
943
  See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
1004
1026
  - Context is automatically managed at 75% capacity through pruning and compaction.
1005
1027
  - **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
1006
1028
  - See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
1029
+
1030
+ **Reducing API Costs via Token Optimization**
1031
+
1032
+ When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
1033
+
1034
+ **1. Use Pruning to Manage Long Sessions**
1035
+ - **Automatic pruning** removes old tool outputs while preserving conversation context
1036
+ - Configure pruning thresholds to be more aggressive:
1037
+ ```bash
1038
+ export PATCHPAL_PRUNE_PROTECT=20000 # Reduce from 40k to 20k tokens
1039
+ export PATCHPAL_PRUNE_MINIMUM=10000 # Reduce minimum saved from 20k to 10k
1040
+ ```
1041
+ - Pruning happens transparently before compaction and is much faster (no LLM call needed)
1042
+
1043
+ **2. Monitor Session Token Usage**
1044
+ - Use `/status` to see cumulative token usage in real-time
1045
+ - **Session Statistics** section shows:
1046
+ - Total LLM calls made
1047
+ - Cumulative input tokens (raw totals, before caching discounts)
1048
+ - Cumulative output tokens
1049
+ - Total tokens for the session
1050
+ - Check periodically during long sessions to monitor usage
1051
+ - **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
1052
+ - For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
1053
+
1054
+ **3. Manual Compaction for Cost Control**
1055
+ - Use `/status` regularly to monitor context window usage
1056
+ - Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
1057
+ - Manual compaction gives you control over when the summarization LLM call happens
1058
+
1059
+ **4. Adjust Auto-Compaction Threshold**
1060
+ - Lower threshold = more frequent compaction = smaller context = lower per-request costs
1061
+ - Higher threshold = fewer compaction calls = larger context = higher per-request costs
1062
+ ```bash
1063
+ # More aggressive compaction (compact at 60% instead of 75%)
1064
+ export PATCHPAL_COMPACT_THRESHOLD=0.60
1065
+ ```
1066
+ - Find the sweet spot for your workload (balance between compaction frequency and context size)
1067
+
1068
+ **5. Use Local Models for Zero API Costs**
1069
+ - **Best option:** Run vLLM locally to eliminate API costs entirely
1070
+ ```bash
1071
+ export HOSTED_VLLM_API_BASE=http://localhost:8000
1072
+ export HOSTED_VLLM_API_KEY=token-abc123
1073
+ patchpal --model hosted_vllm/openai/gpt-oss-20b
1074
+ ```
1075
+ - **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
1076
+ - See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
1077
+
1078
+ **6. Start Fresh When Appropriate**
1079
+ - Use `/clear` command to reset conversation history without restarting PatchPal
1080
+ - Exit and restart PatchPal between unrelated tasks to clear context completely
1081
+ - Each fresh start begins with minimal tokens (just the system prompt)
1082
+ - Better than carrying large conversation history across different tasks
1083
+
1084
+ **7. Use Smaller Models for Simple Tasks**
1085
+ - Use less expensive models for routine tasks:
1086
+ ```bash
1087
+ patchpal --model anthropic/claude-3-7-sonnet-latest # Cheaper than claude-sonnet-4-5
1088
+ patchpal --model openai/gpt-4o-mini # Cheaper than gpt-4o
1089
+ ```
1090
+ - Reserve premium models for complex reasoning tasks
1091
+
1092
+ **Cost Monitoring Tips:**
1093
+ - Check `/status` before large operations to see current token usage
1094
+ - **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
1095
+ - Most cloud providers offer usage dashboards showing cache hits and actual charges
1096
+ - Set up billing alerts with your provider to avoid surprises
1097
+ - Consider local models (vLLM recommended) for high-volume usage or zero API costs
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes