patchpal 0.3.2__tar.gz → 0.4.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {patchpal-0.3.2/patchpal.egg-info → patchpal-0.4.0}/PKG-INFO +92 -1
- {patchpal-0.3.2 → patchpal-0.4.0}/README.md +91 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/__init__.py +1 -1
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/agent.py +26 -4
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/cli.py +18 -0
- {patchpal-0.3.2 → patchpal-0.4.0/patchpal.egg-info}/PKG-INFO +92 -1
- {patchpal-0.3.2 → patchpal-0.4.0}/LICENSE +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/MANIFEST.in +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/context.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/permissions.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/skills.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/system_prompt.md +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal/tools.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/SOURCES.txt +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/dependency_links.txt +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/entry_points.txt +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/requires.txt +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/patchpal.egg-info/top_level.txt +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/pyproject.toml +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/setup.cfg +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_agent.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_cli.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_context.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_guardrails.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_operational_safety.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_skills.py +0 -0
- {patchpal-0.3.2 → patchpal-0.4.0}/tests/test_tools.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: patchpal
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Summary: A lean Claude Code clone in pure Python
|
|
5
5
|
Author: PatchPal Contributors
|
|
6
6
|
License-Expression: Apache-2.0
|
|
@@ -905,6 +905,11 @@ You: /status
|
|
|
905
905
|
# - Token usage breakdown
|
|
906
906
|
# - Visual progress bar
|
|
907
907
|
# - Auto-compaction status
|
|
908
|
+
# - Session statistics:
|
|
909
|
+
# - Total LLM calls made
|
|
910
|
+
# - Cumulative input tokens (all requests combined)
|
|
911
|
+
# - Cumulative output tokens (all responses combined)
|
|
912
|
+
# - Total tokens (helps estimate API costs)
|
|
908
913
|
|
|
909
914
|
# Manually trigger compaction
|
|
910
915
|
You: /compact
|
|
@@ -916,6 +921,23 @@ You: /compact
|
|
|
916
921
|
# Note: Requires at least 5 messages; most effective when context >50% full
|
|
917
922
|
```
|
|
918
923
|
|
|
924
|
+
**Understanding Session Statistics:**
|
|
925
|
+
|
|
926
|
+
The `/status` command shows cumulative token usage:
|
|
927
|
+
|
|
928
|
+
- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
|
|
929
|
+
- Each LLM call resends the entire conversation history
|
|
930
|
+
- **Note on Anthropic models**: PatchPal uses prompt caching
|
|
931
|
+
- System prompt and last 2 messages are cached
|
|
932
|
+
- Cached tokens cost much less than regular input tokens
|
|
933
|
+
- The displayed token counts show raw totals, not cache-adjusted costs
|
|
934
|
+
|
|
935
|
+
- **Cumulative output tokens**: Total tokens generated by the LLM
|
|
936
|
+
- Usually much smaller than input (just the generated responses)
|
|
937
|
+
- Typically costs more per token than input
|
|
938
|
+
|
|
939
|
+
**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
|
|
940
|
+
|
|
919
941
|
**Configuration:**
|
|
920
942
|
|
|
921
943
|
See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
|
|
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
|
|
|
1004
1026
|
- Context is automatically managed at 75% capacity through pruning and compaction.
|
|
1005
1027
|
- **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
|
|
1006
1028
|
- See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
|
|
1029
|
+
|
|
1030
|
+
**Reducing API Costs via Token Optimization**
|
|
1031
|
+
|
|
1032
|
+
When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
|
|
1033
|
+
|
|
1034
|
+
**1. Use Pruning to Manage Long Sessions**
|
|
1035
|
+
- **Automatic pruning** removes old tool outputs while preserving conversation context
|
|
1036
|
+
- Configure pruning thresholds to be more aggressive:
|
|
1037
|
+
```bash
|
|
1038
|
+
export PATCHPAL_PRUNE_PROTECT=20000 # Reduce from 40k to 20k tokens
|
|
1039
|
+
export PATCHPAL_PRUNE_MINIMUM=10000 # Reduce minimum saved from 20k to 10k
|
|
1040
|
+
```
|
|
1041
|
+
- Pruning happens transparently before compaction and is much faster (no LLM call needed)
|
|
1042
|
+
|
|
1043
|
+
**2. Monitor Session Token Usage**
|
|
1044
|
+
- Use `/status` to see cumulative token usage in real-time
|
|
1045
|
+
- **Session Statistics** section shows:
|
|
1046
|
+
- Total LLM calls made
|
|
1047
|
+
- Cumulative input tokens (raw totals, before caching discounts)
|
|
1048
|
+
- Cumulative output tokens
|
|
1049
|
+
- Total tokens for the session
|
|
1050
|
+
- Check periodically during long sessions to monitor usage
|
|
1051
|
+
- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
|
|
1052
|
+
- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
|
|
1053
|
+
|
|
1054
|
+
**3. Manual Compaction for Cost Control**
|
|
1055
|
+
- Use `/status` regularly to monitor context window usage
|
|
1056
|
+
- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
|
|
1057
|
+
- Manual compaction gives you control over when the summarization LLM call happens
|
|
1058
|
+
|
|
1059
|
+
**4. Adjust Auto-Compaction Threshold**
|
|
1060
|
+
- Lower threshold = more frequent compaction = smaller context = lower per-request costs
|
|
1061
|
+
- Higher threshold = fewer compaction calls = larger context = higher per-request costs
|
|
1062
|
+
```bash
|
|
1063
|
+
# More aggressive compaction (compact at 60% instead of 75%)
|
|
1064
|
+
export PATCHPAL_COMPACT_THRESHOLD=0.60
|
|
1065
|
+
```
|
|
1066
|
+
- Find the sweet spot for your workload (balance between compaction frequency and context size)
|
|
1067
|
+
|
|
1068
|
+
**5. Use Local Models for Zero API Costs**
|
|
1069
|
+
- **Best option:** Run vLLM locally to eliminate API costs entirely
|
|
1070
|
+
```bash
|
|
1071
|
+
export HOSTED_VLLM_API_BASE=http://localhost:8000
|
|
1072
|
+
export HOSTED_VLLM_API_KEY=token-abc123
|
|
1073
|
+
patchpal --model hosted_vllm/openai/gpt-oss-20b
|
|
1074
|
+
```
|
|
1075
|
+
- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
|
|
1076
|
+
- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
|
|
1077
|
+
|
|
1078
|
+
**6. Start Fresh When Appropriate**
|
|
1079
|
+
- Use `/clear` command to reset conversation history without restarting PatchPal
|
|
1080
|
+
- Exit and restart PatchPal between unrelated tasks to clear context completely
|
|
1081
|
+
- Each fresh start begins with minimal tokens (just the system prompt)
|
|
1082
|
+
- Better than carrying large conversation history across different tasks
|
|
1083
|
+
|
|
1084
|
+
**7. Use Smaller Models for Simple Tasks**
|
|
1085
|
+
- Use less expensive models for routine tasks:
|
|
1086
|
+
```bash
|
|
1087
|
+
patchpal --model anthropic/claude-3-7-sonnet-latest # Cheaper than claude-sonnet-4-5
|
|
1088
|
+
patchpal --model openai/gpt-4o-mini # Cheaper than gpt-4o
|
|
1089
|
+
```
|
|
1090
|
+
- Reserve premium models for complex reasoning tasks
|
|
1091
|
+
|
|
1092
|
+
**Cost Monitoring Tips:**
|
|
1093
|
+
- Check `/status` before large operations to see current token usage
|
|
1094
|
+
- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
|
|
1095
|
+
- Most cloud providers offer usage dashboards showing cache hits and actual charges
|
|
1096
|
+
- Set up billing alerts with your provider to avoid surprises
|
|
1097
|
+
- Consider local models (vLLM recommended) for high-volume usage or zero API costs
|
|
@@ -868,6 +868,11 @@ You: /status
|
|
|
868
868
|
# - Token usage breakdown
|
|
869
869
|
# - Visual progress bar
|
|
870
870
|
# - Auto-compaction status
|
|
871
|
+
# - Session statistics:
|
|
872
|
+
# - Total LLM calls made
|
|
873
|
+
# - Cumulative input tokens (all requests combined)
|
|
874
|
+
# - Cumulative output tokens (all responses combined)
|
|
875
|
+
# - Total tokens (helps estimate API costs)
|
|
871
876
|
|
|
872
877
|
# Manually trigger compaction
|
|
873
878
|
You: /compact
|
|
@@ -879,6 +884,23 @@ You: /compact
|
|
|
879
884
|
# Note: Requires at least 5 messages; most effective when context >50% full
|
|
880
885
|
```
|
|
881
886
|
|
|
887
|
+
**Understanding Session Statistics:**
|
|
888
|
+
|
|
889
|
+
The `/status` command shows cumulative token usage:
|
|
890
|
+
|
|
891
|
+
- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
|
|
892
|
+
- Each LLM call resends the entire conversation history
|
|
893
|
+
- **Note on Anthropic models**: PatchPal uses prompt caching
|
|
894
|
+
- System prompt and last 2 messages are cached
|
|
895
|
+
- Cached tokens cost much less than regular input tokens
|
|
896
|
+
- The displayed token counts show raw totals, not cache-adjusted costs
|
|
897
|
+
|
|
898
|
+
- **Cumulative output tokens**: Total tokens generated by the LLM
|
|
899
|
+
- Usually much smaller than input (just the generated responses)
|
|
900
|
+
- Typically costs more per token than input
|
|
901
|
+
|
|
902
|
+
**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
|
|
903
|
+
|
|
882
904
|
**Configuration:**
|
|
883
905
|
|
|
884
906
|
See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
|
|
@@ -967,3 +989,72 @@ The system ensures you can work for extended periods without hitting context lim
|
|
|
967
989
|
- Context is automatically managed at 75% capacity through pruning and compaction.
|
|
968
990
|
- **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
|
|
969
991
|
- See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
|
|
992
|
+
|
|
993
|
+
**Reducing API Costs via Token Optimization**
|
|
994
|
+
|
|
995
|
+
When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
|
|
996
|
+
|
|
997
|
+
**1. Use Pruning to Manage Long Sessions**
|
|
998
|
+
- **Automatic pruning** removes old tool outputs while preserving conversation context
|
|
999
|
+
- Configure pruning thresholds to be more aggressive:
|
|
1000
|
+
```bash
|
|
1001
|
+
export PATCHPAL_PRUNE_PROTECT=20000 # Reduce from 40k to 20k tokens
|
|
1002
|
+
export PATCHPAL_PRUNE_MINIMUM=10000 # Reduce minimum saved from 20k to 10k
|
|
1003
|
+
```
|
|
1004
|
+
- Pruning happens transparently before compaction and is much faster (no LLM call needed)
|
|
1005
|
+
|
|
1006
|
+
**2. Monitor Session Token Usage**
|
|
1007
|
+
- Use `/status` to see cumulative token usage in real-time
|
|
1008
|
+
- **Session Statistics** section shows:
|
|
1009
|
+
- Total LLM calls made
|
|
1010
|
+
- Cumulative input tokens (raw totals, before caching discounts)
|
|
1011
|
+
- Cumulative output tokens
|
|
1012
|
+
- Total tokens for the session
|
|
1013
|
+
- Check periodically during long sessions to monitor usage
|
|
1014
|
+
- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
|
|
1015
|
+
- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
|
|
1016
|
+
|
|
1017
|
+
**3. Manual Compaction for Cost Control**
|
|
1018
|
+
- Use `/status` regularly to monitor context window usage
|
|
1019
|
+
- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
|
|
1020
|
+
- Manual compaction gives you control over when the summarization LLM call happens
|
|
1021
|
+
|
|
1022
|
+
**4. Adjust Auto-Compaction Threshold**
|
|
1023
|
+
- Lower threshold = more frequent compaction = smaller context = lower per-request costs
|
|
1024
|
+
- Higher threshold = fewer compaction calls = larger context = higher per-request costs
|
|
1025
|
+
```bash
|
|
1026
|
+
# More aggressive compaction (compact at 60% instead of 75%)
|
|
1027
|
+
export PATCHPAL_COMPACT_THRESHOLD=0.60
|
|
1028
|
+
```
|
|
1029
|
+
- Find the sweet spot for your workload (balance between compaction frequency and context size)
|
|
1030
|
+
|
|
1031
|
+
**5. Use Local Models for Zero API Costs**
|
|
1032
|
+
- **Best option:** Run vLLM locally to eliminate API costs entirely
|
|
1033
|
+
```bash
|
|
1034
|
+
export HOSTED_VLLM_API_BASE=http://localhost:8000
|
|
1035
|
+
export HOSTED_VLLM_API_KEY=token-abc123
|
|
1036
|
+
patchpal --model hosted_vllm/openai/gpt-oss-20b
|
|
1037
|
+
```
|
|
1038
|
+
- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
|
|
1039
|
+
- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
|
|
1040
|
+
|
|
1041
|
+
**6. Start Fresh When Appropriate**
|
|
1042
|
+
- Use `/clear` command to reset conversation history without restarting PatchPal
|
|
1043
|
+
- Exit and restart PatchPal between unrelated tasks to clear context completely
|
|
1044
|
+
- Each fresh start begins with minimal tokens (just the system prompt)
|
|
1045
|
+
- Better than carrying large conversation history across different tasks
|
|
1046
|
+
|
|
1047
|
+
**7. Use Smaller Models for Simple Tasks**
|
|
1048
|
+
- Use less expensive models for routine tasks:
|
|
1049
|
+
```bash
|
|
1050
|
+
patchpal --model anthropic/claude-3-7-sonnet-latest # Cheaper than claude-sonnet-4-5
|
|
1051
|
+
patchpal --model openai/gpt-4o-mini # Cheaper than gpt-4o
|
|
1052
|
+
```
|
|
1053
|
+
- Reserve premium models for complex reasoning tasks
|
|
1054
|
+
|
|
1055
|
+
**Cost Monitoring Tips:**
|
|
1056
|
+
- Check `/status` before large operations to see current token usage
|
|
1057
|
+
- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
|
|
1058
|
+
- Most cloud providers offer usage dashboards showing cache hits and actual charges
|
|
1059
|
+
- Set up billing alerts with your provider to avoid surprises
|
|
1060
|
+
- Consider local models (vLLM recommended) for high-volume usage or zero API costs
|
|
@@ -725,9 +725,7 @@ def _apply_prompt_caching(messages: List[Dict[str, Any]], model_id: str) -> List
|
|
|
725
725
|
|
|
726
726
|
Caches:
|
|
727
727
|
- System messages (first 1-2 messages with role="system")
|
|
728
|
-
- Last 2
|
|
729
|
-
|
|
730
|
-
This provides 90% cost reduction on cached content after the first request.
|
|
728
|
+
- Last 2 non-system messages (recent context, any role except system)
|
|
731
729
|
|
|
732
730
|
Args:
|
|
733
731
|
messages: List of message dictionaries
|
|
@@ -818,6 +816,11 @@ class PatchPalAgent:
|
|
|
818
816
|
# Track last compaction to prevent compaction loops
|
|
819
817
|
self._last_compaction_message_count = 0
|
|
820
818
|
|
|
819
|
+
# Track cumulative token usage across all LLM calls
|
|
820
|
+
self.total_llm_calls = 0
|
|
821
|
+
self.cumulative_input_tokens = 0
|
|
822
|
+
self.cumulative_output_tokens = 0
|
|
823
|
+
|
|
821
824
|
# LiteLLM settings for models that need parameter dropping
|
|
822
825
|
self.litellm_kwargs = {}
|
|
823
826
|
if self.model_id.startswith("bedrock/"):
|
|
@@ -896,12 +899,22 @@ class PatchPalAgent:
|
|
|
896
899
|
messages = [{"role": "system", "content": SYSTEM_PROMPT}] + msgs
|
|
897
900
|
# Apply prompt caching for supported models
|
|
898
901
|
messages = _apply_prompt_caching(messages, self.model_id)
|
|
899
|
-
|
|
902
|
+
response = litellm.completion(
|
|
900
903
|
model=self.model_id,
|
|
901
904
|
messages=messages,
|
|
902
905
|
**self.litellm_kwargs,
|
|
903
906
|
)
|
|
904
907
|
|
|
908
|
+
# Track token usage from compaction call
|
|
909
|
+
self.total_llm_calls += 1
|
|
910
|
+
if hasattr(response, "usage") and response.usage:
|
|
911
|
+
if hasattr(response.usage, "prompt_tokens"):
|
|
912
|
+
self.cumulative_input_tokens += response.usage.prompt_tokens
|
|
913
|
+
if hasattr(response.usage, "completion_tokens"):
|
|
914
|
+
self.cumulative_output_tokens += response.usage.completion_tokens
|
|
915
|
+
|
|
916
|
+
return response
|
|
917
|
+
|
|
905
918
|
summary_msg, summary_text = self.context_manager.create_compaction(
|
|
906
919
|
self.messages,
|
|
907
920
|
compaction_completion,
|
|
@@ -995,6 +1008,15 @@ class PatchPalAgent:
|
|
|
995
1008
|
tool_choice="auto",
|
|
996
1009
|
**self.litellm_kwargs,
|
|
997
1010
|
)
|
|
1011
|
+
|
|
1012
|
+
# Track token usage from this LLM call
|
|
1013
|
+
self.total_llm_calls += 1
|
|
1014
|
+
if hasattr(response, "usage") and response.usage:
|
|
1015
|
+
if hasattr(response.usage, "prompt_tokens"):
|
|
1016
|
+
self.cumulative_input_tokens += response.usage.prompt_tokens
|
|
1017
|
+
if hasattr(response.usage, "completion_tokens"):
|
|
1018
|
+
self.cumulative_output_tokens += response.usage.completion_tokens
|
|
1019
|
+
|
|
998
1020
|
except Exception as e:
|
|
999
1021
|
return f"Error calling model: {e}"
|
|
1000
1022
|
|
|
@@ -362,6 +362,24 @@ Supported models: Any LiteLLM-supported model
|
|
|
362
362
|
"\n Auto-compaction: \033[33mDisabled\033[0m (set PATCHPAL_DISABLE_AUTOCOMPACT=false to enable)"
|
|
363
363
|
)
|
|
364
364
|
|
|
365
|
+
# Show cumulative token usage
|
|
366
|
+
print("\n\033[1;36mSession Statistics\033[0m")
|
|
367
|
+
print(f" LLM calls: {agent.total_llm_calls}")
|
|
368
|
+
|
|
369
|
+
# Check if usage info is available (if we have LLM calls but no token counts)
|
|
370
|
+
has_usage_info = (
|
|
371
|
+
agent.cumulative_input_tokens > 0 or agent.cumulative_output_tokens > 0
|
|
372
|
+
)
|
|
373
|
+
if agent.total_llm_calls > 0 and not has_usage_info:
|
|
374
|
+
print(
|
|
375
|
+
" \033[2mToken usage unavailable (model doesn't report usage info)\033[0m"
|
|
376
|
+
)
|
|
377
|
+
else:
|
|
378
|
+
print(f" Cumulative input tokens: {agent.cumulative_input_tokens:,}")
|
|
379
|
+
print(f" Cumulative output tokens: {agent.cumulative_output_tokens:,}")
|
|
380
|
+
total_tokens = agent.cumulative_input_tokens + agent.cumulative_output_tokens
|
|
381
|
+
print(f" Total tokens: {total_tokens:,}")
|
|
382
|
+
|
|
365
383
|
print("=" * 70 + "\n")
|
|
366
384
|
continue
|
|
367
385
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: patchpal
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Summary: A lean Claude Code clone in pure Python
|
|
5
5
|
Author: PatchPal Contributors
|
|
6
6
|
License-Expression: Apache-2.0
|
|
@@ -905,6 +905,11 @@ You: /status
|
|
|
905
905
|
# - Token usage breakdown
|
|
906
906
|
# - Visual progress bar
|
|
907
907
|
# - Auto-compaction status
|
|
908
|
+
# - Session statistics:
|
|
909
|
+
# - Total LLM calls made
|
|
910
|
+
# - Cumulative input tokens (all requests combined)
|
|
911
|
+
# - Cumulative output tokens (all responses combined)
|
|
912
|
+
# - Total tokens (helps estimate API costs)
|
|
908
913
|
|
|
909
914
|
# Manually trigger compaction
|
|
910
915
|
You: /compact
|
|
@@ -916,6 +921,23 @@ You: /compact
|
|
|
916
921
|
# Note: Requires at least 5 messages; most effective when context >50% full
|
|
917
922
|
```
|
|
918
923
|
|
|
924
|
+
**Understanding Session Statistics:**
|
|
925
|
+
|
|
926
|
+
The `/status` command shows cumulative token usage:
|
|
927
|
+
|
|
928
|
+
- **Cumulative input tokens**: Total tokens sent to the LLM across all calls
|
|
929
|
+
- Each LLM call resends the entire conversation history
|
|
930
|
+
- **Note on Anthropic models**: PatchPal uses prompt caching
|
|
931
|
+
- System prompt and last 2 messages are cached
|
|
932
|
+
- Cached tokens cost much less than regular input tokens
|
|
933
|
+
- The displayed token counts show raw totals, not cache-adjusted costs
|
|
934
|
+
|
|
935
|
+
- **Cumulative output tokens**: Total tokens generated by the LLM
|
|
936
|
+
- Usually much smaller than input (just the generated responses)
|
|
937
|
+
- Typically costs more per token than input
|
|
938
|
+
|
|
939
|
+
**Important**: The token counts shown are raw totals and don't reflect prompt caching discounts. For accurate cost information, check your provider's usage dashboard which shows cache hits and actual billing.
|
|
940
|
+
|
|
919
941
|
**Configuration:**
|
|
920
942
|
|
|
921
943
|
See the [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) section for context management settings including:
|
|
@@ -1004,3 +1026,72 @@ The system ensures you can work for extended periods without hitting context lim
|
|
|
1004
1026
|
- Context is automatically managed at 75% capacity through pruning and compaction.
|
|
1005
1027
|
- **Note:** Token estimation may be slightly inaccurate compared to the model's actual counting. If you see this error despite auto-compaction being enabled, the 75% threshold may need to be lowered further for your workload. You can adjust it with `export PATCHPAL_COMPACT_THRESHOLD=0.70` (or lower).
|
|
1006
1028
|
- See [Configuration](https://github.com/amaiya/patchpal?tab=readme-ov-file#configuration) for context management settings.
|
|
1029
|
+
|
|
1030
|
+
**Reducing API Costs via Token Optimization**
|
|
1031
|
+
|
|
1032
|
+
When using cloud LLM providers (Anthropic, OpenAI, etc.), token usage directly impacts costs. PatchPal includes several features to help minimize token consumption:
|
|
1033
|
+
|
|
1034
|
+
**1. Use Pruning to Manage Long Sessions**
|
|
1035
|
+
- **Automatic pruning** removes old tool outputs while preserving conversation context
|
|
1036
|
+
- Configure pruning thresholds to be more aggressive:
|
|
1037
|
+
```bash
|
|
1038
|
+
export PATCHPAL_PRUNE_PROTECT=20000 # Reduce from 40k to 20k tokens
|
|
1039
|
+
export PATCHPAL_PRUNE_MINIMUM=10000 # Reduce minimum saved from 20k to 10k
|
|
1040
|
+
```
|
|
1041
|
+
- Pruning happens transparently before compaction and is much faster (no LLM call needed)
|
|
1042
|
+
|
|
1043
|
+
**2. Monitor Session Token Usage**
|
|
1044
|
+
- Use `/status` to see cumulative token usage in real-time
|
|
1045
|
+
- **Session Statistics** section shows:
|
|
1046
|
+
- Total LLM calls made
|
|
1047
|
+
- Cumulative input tokens (raw totals, before caching discounts)
|
|
1048
|
+
- Cumulative output tokens
|
|
1049
|
+
- Total tokens for the session
|
|
1050
|
+
- Check periodically during long sessions to monitor usage
|
|
1051
|
+
- **Important**: Token counts don't reflect prompt caching discounts (Anthropic models)
|
|
1052
|
+
- For actual costs, check your provider's usage dashboard which shows cache-adjusted billing
|
|
1053
|
+
|
|
1054
|
+
**3. Manual Compaction for Cost Control**
|
|
1055
|
+
- Use `/status` regularly to monitor context window usage
|
|
1056
|
+
- Run `/compact` proactively when context grows large (before hitting auto-compact threshold)
|
|
1057
|
+
- Manual compaction gives you control over when the summarization LLM call happens
|
|
1058
|
+
|
|
1059
|
+
**4. Adjust Auto-Compaction Threshold**
|
|
1060
|
+
- Lower threshold = more frequent compaction = smaller context = lower per-request costs
|
|
1061
|
+
- Higher threshold = fewer compaction calls = larger context = higher per-request costs
|
|
1062
|
+
```bash
|
|
1063
|
+
# More aggressive compaction (compact at 60% instead of 75%)
|
|
1064
|
+
export PATCHPAL_COMPACT_THRESHOLD=0.60
|
|
1065
|
+
```
|
|
1066
|
+
- Find the sweet spot for your workload (balance between compaction frequency and context size)
|
|
1067
|
+
|
|
1068
|
+
**5. Use Local Models for Zero API Costs**
|
|
1069
|
+
- **Best option:** Run vLLM locally to eliminate API costs entirely
|
|
1070
|
+
```bash
|
|
1071
|
+
export HOSTED_VLLM_API_BASE=http://localhost:8000
|
|
1072
|
+
export HOSTED_VLLM_API_KEY=token-abc123
|
|
1073
|
+
patchpal --model hosted_vllm/openai/gpt-oss-20b
|
|
1074
|
+
```
|
|
1075
|
+
- **Alternative:** Use Ollama (requires `OLLAMA_CONTEXT_LENGTH=32768`)
|
|
1076
|
+
- See [Using Local Models](https://github.com/amaiya/patchpal?tab=readme-ov-file#using-local-models-vllm--ollama) for setup
|
|
1077
|
+
|
|
1078
|
+
**6. Start Fresh When Appropriate**
|
|
1079
|
+
- Use `/clear` command to reset conversation history without restarting PatchPal
|
|
1080
|
+
- Exit and restart PatchPal between unrelated tasks to clear context completely
|
|
1081
|
+
- Each fresh start begins with minimal tokens (just the system prompt)
|
|
1082
|
+
- Better than carrying large conversation history across different tasks
|
|
1083
|
+
|
|
1084
|
+
**7. Use Smaller Models for Simple Tasks**
|
|
1085
|
+
- Use less expensive models for routine tasks:
|
|
1086
|
+
```bash
|
|
1087
|
+
patchpal --model anthropic/claude-3-7-sonnet-latest # Cheaper than claude-sonnet-4-5
|
|
1088
|
+
patchpal --model openai/gpt-4o-mini # Cheaper than gpt-4o
|
|
1089
|
+
```
|
|
1090
|
+
- Reserve premium models for complex reasoning tasks
|
|
1091
|
+
|
|
1092
|
+
**Cost Monitoring Tips:**
|
|
1093
|
+
- Check `/status` before large operations to see current token usage
|
|
1094
|
+
- **Anthropic models**: Prompt caching reduces costs (system prompt + last 2 messages cached)
|
|
1095
|
+
- Most cloud providers offer usage dashboards showing cache hits and actual charges
|
|
1096
|
+
- Set up billing alerts with your provider to avoid surprises
|
|
1097
|
+
- Consider local models (vLLM recommended) for high-volume usage or zero API costs
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|