squeezr-ai 1.14.6 → 1.14.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +118 -572
  2. package/bin/squeezr.js +17 -8
  3. package/package.json +1 -1
package/README.md CHANGED
@@ -1,642 +1,188 @@
1
1
  # Squeezr
2
2
 
3
- [![npm version](https://badge.fury.io/js/squeezr-ai.svg)](https://www.npmjs.com/package/squeezr-ai)
4
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
- [![Node.js 18+](https://img.shields.io/badge/node-18%2B-brightgreen)](https://nodejs.org)
6
- [![Tests](https://img.shields.io/badge/tests-190%20passing-brightgreen)](https://github.com/sergioramosv/Squeezr)
3
+ **Token compression proxy for AI coding CLIs.** Sits between your CLI and the API, compresses context on the fly, saves thousands of tokens per session.
7
4
 
8
- **Squeezr is a local proxy that sits between your AI coding CLI and its API. It automatically compresses your context window on every request — saving thousands of tokens per session with zero changes to your workflow.**
5
+ [![npm](https://img.shields.io/npm/v/squeezr-ai)](https://www.npmjs.com/package/squeezr-ai) [![license](https://img.shields.io/npm/l/squeezr-ai)](LICENSE) [![tests](https://img.shields.io/badge/tests-190%20passing-brightgreen)]()
9
6
 
10
- Works with Claude Code, Codex, Aider, OpenCode, Gemini CLI, and any Ollama-powered local LLM.
7
+ ## Supported CLIs
11
8
 
12
- ---
9
+ | CLI | Protocol | Proxy method |
10
+ |-----|----------|-------------|
11
+ | Claude Code | HTTP to Anthropic API | `ANTHROPIC_BASE_URL=http://localhost:8080` |
12
+ | Aider | HTTP to Anthropic/OpenAI API | `ANTHROPIC_BASE_URL` / `openai_base_url` |
13
+ | OpenCode | HTTP to Anthropic/OpenAI API | `ANTHROPIC_BASE_URL` / `openai_base_url` |
14
+ | Gemini CLI | HTTP to Gemini API | `GEMINI_API_BASE_URL=http://localhost:8080` |
15
+ | Ollama | HTTP (local) | Transparent via dummy API key detection |
16
+ | **Codex** | **WebSocket to chatgpt.com** | **TLS-terminating MITM proxy on :8081** |
13
17
 
14
- ## The problem
18
+ ## Quick start
15
19
 
16
- Every time you send a message in an AI coding CLI, the entire conversation history is re-sent to the API. That includes every file you read, every `git diff`, every test output, every bash command — even from 30 messages ago when it's no longer relevant. The system prompt alone can weigh 13KB and gets sent on every single request.
20
+ ```bash
21
+ npm install -g squeezr-ai
22
+ squeezr setup # configures env vars, auto-start, and CA trust
23
+ squeezr start
24
+ ```
17
25
 
18
- The result: context fills up fast, costs spike, and sessions hit the limit sooner than they should.
26
+ `squeezr setup` handles everything automatically:
27
+ - Sets `ANTHROPIC_BASE_URL`, `GEMINI_API_BASE_URL`, `HTTPS_PROXY`, `NODE_EXTRA_CA_CERTS`, `NO_PROXY`
28
+ - Registers auto-start (launchd on macOS, systemd on Linux, Task Scheduler/NSSM on Windows)
29
+ - **Windows:** imports the MITM CA into the Windows Certificate Store (user-level, no admin required) so Rust-based CLIs like Codex trust the proxy's TLS certificates
30
+ - **macOS/Linux:** generates a CA bundle at `~/.squeezr/mitm-ca/bundle.crt` for `SSL_CERT_FILE`
19
31
 
20
- ---
32
+ ## How it works
21
33
 
22
- ## How Squeezr fixes it
34
+ Every request from your AI CLI passes through Squeezr on `localhost:8080`. The proxy applies three compression layers before forwarding to the upstream API:
23
35
 
24
- Squeezr intercepts every API request before it reaches the provider and runs multiple compression layers:
36
+ ### Layer 1: System prompt compression
25
37
 
26
- ```
27
- Your CLI (Claude Code / Codex / Aider / Gemini CLI / Ollama)
28
- |
29
- v
30
- localhost:8080 (Squeezr proxy)
31
- |
32
- |-- [1] System prompt compression
33
- | Compressed once on first request, cached forever.
34
- | ~13KB Claude Code system prompt → ~600 tokens. Never resent in full again.
35
- |
36
- |-- [2] Deterministic preprocessing — noise removal
37
- | Runs on every tool result before anything else:
38
- | strip ANSI codes, strip progress bars, strip timestamps,
39
- | deduplicate repeated stack traces, deduplicate repeated lines,
40
- | minify inline JSON, collapse whitespace.
41
- |
42
- |-- [3] Deterministic preprocessing — tool-specific patterns (~30 patterns)
43
- | Applied automatically to every matching output:
44
- | git: diff (1-line context + Changed: fn summary on large diffs)
45
- | log (capped, adaptive), status, branch (capped at 20)
46
- | cargo: test (failures only), build/check/clippy (errors only)
47
- | JS/TS: vitest/jest (failures + summary only)
48
- | playwright (✘ blocks only)
49
- | tsc (errors grouped by file)
50
- | eslint/biome (grouped, no rule URLs)
51
- | prettier --check (only files needing format)
52
- | pnpm/npm install (summary line only)
53
- | pnpm/npm list (direct deps only)
54
- | pnpm/npm outdated (capped at 30)
55
- | next build (route table + errors)
56
- | npx noise stripped
57
- | Python: pytest (FAILED lines + tracebacks only)
58
- | Go: go test (--- FAIL blocks only)
59
- | Terraform: resource change summary + Plan line
60
- | Docker: ps (compact), images (no dangling), logs (last 50 lines)
61
- | kubectl: get (compact alignment)
62
- | Prisma: strip ASCII box-drawing art
63
- | gh CLI: pr view, pr checks, run list, issue list (all capped)
64
- | Network: curl (strip verbose headers), wget (strip progress)
65
- | Exclusive patterns:
66
- | Read tool → lockfiles replaced with summary count
67
- | large code files (.ts/.js/.py/.go/.rs > 500 lines)
68
- | → imports + top-level signatures only, bodies omitted
69
- | files > 200 lines → head + tail with omission note
70
- | Grep tool → matches grouped by file, capped per file and total
71
- | Glob tool → > 30 files collapsed to directory summary
72
- | Any output → auto-extracts error lines when > 50% of content is noise
73
- | Stack traces → repeated crash frames collapsed across log output
74
- |
75
- |-- [4] Cross-turn Read deduplication
76
- | When the model reads the same file multiple times in a session,
77
- | earlier occurrences are replaced with a reference token.
78
- | Most recent copy always kept at full fidelity.
79
- |
80
- |-- [5] Adaptive AI compression
81
- | Old bash output, file reads, grep results compressed by a cheap model.
82
- | Threshold adjusts automatically based on context pressure:
83
- | < 50% full → compress blocks > 1,500 chars
84
- | 50-75% full → compress blocks > 800 chars
85
- | 75-90% full → compress blocks > 400 chars
86
- | > 90% full → compress everything > 150 chars
87
- | At > 90% pressure, deterministic patterns also tighten:
88
- | git diff → 0 context lines per hunk (vs 1)
89
- | git log → cap 10 commits (vs 30)
90
- | grep → 4 matches/file (vs 8)
91
- |
92
- |-- [6] Session cache + KV cache warming
93
- | Session cache: blocks identical to a previous request skip the pipeline.
94
- | KV warming: unchanged blocks keep deterministic IDs so Anthropic's
95
- | prefix cache stays warm — 90% discount on already-seen tokens.
96
- |
97
- |-- [7] expand() — lossless retrieval
98
- | Every compressed block is stored by ID. If the model needs the full
99
- | original, it calls squeezr_expand(id). Squeezr intercepts the tool call,
100
- | injects the original, and makes a continuation request — transparently.
101
- |
102
- v
103
- Your provider's API (Anthropic / OpenAI / Google / Ollama)
104
- ```
38
+ The system prompt (~13KB for Claude Code) is compressed once using an AI model and cached. Subsequent requests reuse the cached version. Saves ~3,000 tokens per request.
105
39
 
106
- **MCP tool results are compressed automatically.** Any tool result that passes through the proxy — including results from MCP servers (Linear, GitHub, Slack, planning tools, custom MCPs) — goes through the same compression pipeline. No configuration needed; Squeezr treats MCP tool results identically to built-in tools. In practice MCP responses are often large JSON payloads that compress 70-94%.
40
+ ### Layer 2: Deterministic preprocessing
107
41
 
108
- Recent content is always preserved untouched — by default the last 3 tool results are never compressed. Your CLI always has full context for what it's currently working on.
42
+ Zero-latency, rule-based transformations applied to every tool result:
109
43
 
110
- ### Does compression make the AI "dumber"?
44
+ - **Noise removal:** ANSI escape codes, progress bars, timestamps, spinner output
45
+ - **Deduplication:** repeated stack frames, duplicate lines, redundant git hunks
46
+ - **Minification:** JSON whitespace, collapsed blank lines
111
47
 
112
- No it's the opposite. Without Squeezr, long sessions hit the context window limit and the CLI **silently drops old messages entirely**. You lose them with no way to get them back.
48
+ ### Layer 3: Tool-specific patterns (~30 rules)
113
49
 
114
- With Squeezr, old messages are **summarized, not deleted**. A 3,000-token git diff from message #15 becomes a ~150-token summary like:
50
+ Each tool result is matched against specialized compression rules:
115
51
 
116
- ```
117
- [squeezr:a3f2c1] git diff: modified src/auth.ts — validateToken:
118
- added expiry logging + refreshToken call. 3 files changed.
119
- ```
52
+ | Category | Tools | What it does |
53
+ |----------|-------|-------------|
54
+ | Git | diff, log, status, branch | 1-line diff context, capped log, compact status |
55
+ | JS/TS | vitest, jest, playwright, tsc, eslint, biome, prettier | Failures/errors only, grouped by file |
56
+ | Package managers | pnpm, npm | Install summary, list capped at 30, outdated only |
57
+ | Build | next build, cargo build | Errors only |
58
+ | Test | cargo test, pytest, go test | FAIL blocks + tracebacks only |
59
+ | Infra | terraform, docker, kubectl | Resource changes, compact tables, last 50 log lines |
60
+ | Other | prisma, gh CLI, curl/wget | Strip ASCII art, cap output, remove verbose headers |
120
61
 
121
- The AI knows *what* you did, not every exact line. And if it needs the full original, it calls `squeezr_expand(a3f2c1)` and gets it back — losslessly.
122
-
123
- | Scenario | Message #15 at turn #100 |
124
- |---|---|
125
- | **No compression** | Probably dropped by the CLI (doesn't fit) |
126
- | **With Squeezr** | Summarized but present, expandable on demand |
127
-
128
- The trade-off: less detail, but more memory. Without Squeezr the AI forgets entire messages. With Squeezr it has a one-line note about every decision made — and can retrieve the full context when needed.
129
-
130
- ---
131
-
132
- ## Deterministic compression engine
133
-
134
- Before any AI model is involved, Squeezr runs a full deterministic compression pipeline on every tool result. This is a zero-cost, zero-latency layer that handles the most common developer outputs with specialized parsers:
135
-
136
- | Tool output | What Squeezr does | Typical savings |
137
- |---|---|---|
138
- | **git status** | Parses staged/modified/untracked, drops noise lines | 70-85% |
139
- | **git diff** | Extracts changed function names, strips context lines (adaptive), summarizes large diffs | 65-92% |
140
- | **git log** | Compacts to `hash msg (author, date)`, caps entries by pressure | 70-90% |
141
- | **cargo test/build/clippy** | Extracts only failures, errors, warnings | 80-95% |
142
- | **vitest/jest/playwright** | Extracts failed tests and assertion errors | 80-95% |
143
- | **tsc** | Groups errors by file, keeps only error lines | 75-90% |
144
- | **eslint/biome** | Compacts to file + rule + message | 70-85% |
145
- | **prettier** | Keeps only files-changed summary | 80-90% |
146
- | **next build** | Extracts errors and route summary | 75-85% |
147
- | **pytest** | Extracts FAILED lines and short summaries | 80-95% |
148
- | **npm/pnpm install** | Strips progress bars, keeps final summary | 85-90% |
149
- | **npm outdated** | Compact table format | 60-75% |
150
- | **docker ps/images/logs** | Compact output, strips timestamps | 70-80% |
151
- | **kubectl get/describe/logs** | Strips timestamps, compacts tables | 70-80% |
152
- | **gh pr/run/issue** | Strips decorations, keeps data | 65-75% |
153
- | **curl/wget** | Strips progress, keeps response body/headers | 60-80% |
154
- | **terraform plan/apply** | Extracts changes summary | 70-85% |
155
- | **prisma** | Compacts migration and schema output | 65-80% |
156
- | **Grep results** | Groups by file, caps matches per file (adaptive) | 60-80% |
157
- | **Read (large files)** | >500 lines: imports + signatures only. >200 lines: head + tail | 70-95% |
158
- | **Glob** | Compacts file listings into directory summaries | 50-70% |
159
-
160
- Additionally, on **all** outputs regardless of tool:
161
- - ANSI escape codes stripped
162
- - Progress bars and spinners removed
163
- - Repeated lines collapsed (`"... repeated N more times"`)
164
- - Duplicate stack traces deduplicated (Node.js and Python)
165
- - Inline JSON minified (objects >200 chars)
166
- - Timestamps stripped (ISO 8601, bracketed, bare time formats)
167
- - Excessive whitespace collapsed
168
-
169
- This engine runs in pure Node.js — microseconds per result, no API calls, no cost. It handles the bulk of the compression work. The AI layer (Haiku/GPT-4o-mini) only kicks in afterward on older messages where further summarization is needed.
170
-
171
- ---
172
-
173
- ## Supported CLIs and providers
174
-
175
- Squeezr auto-detects which provider each request targets from the auth headers. No configuration needed beyond pointing your CLI at the proxy.
176
-
177
- | CLI | Set this env var | Compresses with | Extra keys needed |
178
- |---|---|---|---|
179
- | **Claude Code** | `ANTHROPIC_BASE_URL=http://localhost:8080` | Claude Haiku | None |
180
- | **Codex CLI** | `squeezr setup` (see below) | gpt-5.4-mini (via your Codex sub) | None |
181
- | **Aider** (OpenAI backend) | `openai_base_url=http://localhost:8080` | GPT-4o-mini | None |
182
- | **Aider** (Anthropic backend) | `ANTHROPIC_BASE_URL=http://localhost:8080` | Claude Haiku | None |
183
- | **OpenCode** | `openai_base_url=http://localhost:8080` | GPT-4o-mini | None |
184
- | **Gemini CLI** | `GEMINI_API_BASE_URL=http://localhost:8080` | Gemini Flash 8B | None |
185
- | **Ollama** (any CLI) | `openai_base_url=http://localhost:8080` | Local model (configurable) | None |
186
-
187
- Squeezr extracts the API key from the request itself and reuses it for compression. Zero extra setup.
188
-
189
- ---
62
+ ### Exclusive patterns
190
63
 
191
- ## Quick start
64
+ Applied to specific content types regardless of tool:
192
65
 
193
- ```bash
194
- npm install -g squeezr-ai
195
- squeezr start
196
- ```
197
-
198
- Then point your CLI at the proxy:
199
-
200
- ```bash
201
- # Claude Code
202
- export ANTHROPIC_BASE_URL=http://localhost:8080 # macOS / Linux
203
- $env:ANTHROPIC_BASE_URL="http://localhost:8080" # Windows PowerShell
66
+ - **Lockfiles** (package-lock.json, Cargo.lock, etc.) → dependency count summary
67
+ - **Large code files** (>500 lines) → imports + function/class signatures only
68
+ - **Long output** (>200 lines) → head + tail + omission note
69
+ - **Grep results** → grouped by file, matches capped
70
+ - **Glob results** (>30 files) → directory tree summary
71
+ - **Noisy output** (>50% non-essential) auto-extract errors/warnings
204
72
 
205
- # Codex (uses MITM proxy — see "Codex deep compression" below)
206
- export HTTPS_PROXY=http://localhost:8081
207
- export SSL_CERT_FILE=~/.squeezr/mitm-ca/bundle.crt
73
+ ### Adaptive pressure
208
74
 
209
- # Aider / OpenCode
210
- export openai_base_url=http://localhost:8080
75
+ Compression aggressiveness scales with context window usage:
211
76
 
212
- # Gemini CLI
213
- export GEMINI_API_BASE_URL=http://localhost:8080
77
+ | Context usage | Threshold | Behavior |
78
+ |--------------|-----------|----------|
79
+ | < 50% | 1,500 chars | Light — only compress large results |
80
+ | 50–75% | 800 chars | Normal — standard compression |
81
+ | 75–90% | 400 chars | Aggressive — compress most results |
82
+ | > 90% | 150 chars | Critical — compress everything, 0 git diff context |
214
83
 
215
- # Ollama
216
- export openai_base_url=http://localhost:8080
217
- ```
84
+ ### Session optimizations
218
85
 
219
- Or use the shell installer to set up the env var permanently and register Squeezr as a login service:
220
-
221
- ```bash
222
- # macOS / Linux
223
- bash install.sh
86
+ - **Session cache:** After ~50 tool results, older results are batch-summarized into a single compact block
87
+ - **KV cache warming:** Deterministic MD5-based IDs keep compressed content prefix-stable across requests
88
+ - **Cross-turn dedup:** If the same file is read multiple times, earlier reads are replaced with reference pointers
89
+ - **Expand on demand:** Compressed blocks include a `squeezr_expand(id)` callback to retrieve full content
224
90
 
225
- # Windows (PowerShell, run as admin for Task Scheduler)
226
- .\install.ps1
227
- ```
91
+ ## Codex support (MITM proxy)
228
92
 
229
- ---
93
+ Codex uses WebSocket over TLS to `chatgpt.com` with OAuth authentication — it cannot be proxied via `OPENAI_BASE_URL`. Squeezr runs a TLS-terminating MITM proxy on port 8081 that intercepts and compresses WebSocket frames. See [CODEX.md](CODEX.md) for the full technical breakdown.
230
94
 
231
95
  ## Configuration
232
96
 
233
- ### Global config `squeezr.toml`
234
-
235
- Located in the Squeezr install directory. Environment variables override any TOML value.
97
+ ### Global config: `squeezr.toml` (next to the binary)
236
98
 
237
99
  ```toml
238
100
  [proxy]
239
101
  port = 8080
240
102
 
241
103
  [compression]
242
- threshold = 800 # min chars to compress a tool result
243
- keep_recent = 3 # recent tool results to leave untouched
244
- disabled = false
245
- compress_system_prompt = true # compress the CLI's system prompt (cached)
246
- compress_conversation = false # also compress old user/assistant messages (aggressive)
247
-
248
- # Explicit control over which tools are compressed:
249
- # skip_tools = ["Read"] # never compress these tools
250
- # only_tools = ["Bash"] # only compress these tools (overrides skip_tools)
104
+ threshold = 800 # min chars to trigger compression
105
+ keep_recent = 3 # last N results left uncompressed
106
+ compress_system_prompt = true
107
+ compress_conversation = false # aggressive: compress assistant messages too
108
+ # skip_tools = ["Read"] # never compress these tools
109
+ # only_tools = ["Bash"] # only compress these tools
251
110
 
252
111
  [cache]
253
112
  enabled = true
254
- max_entries = 1000 # LRU cap for cached compressions
113
+ max_entries = 1000
255
114
 
256
115
  [adaptive]
257
116
  enabled = true
258
- low_threshold = 1500 # used when context < 50% full
259
- mid_threshold = 800 # 50-75%
260
- high_threshold = 400 # 75-90%
261
- critical_threshold = 150 # > 90% — compress everything
117
+ low_threshold = 1500
118
+ mid_threshold = 800
119
+ high_threshold = 400
120
+ critical_threshold = 150
262
121
 
263
122
  [local]
264
123
  enabled = true
265
- upstream_url = "http://localhost:11434" # your Ollama URL
266
- # Model used to compress tool results — must be pulled in Ollama.
267
- # Good options:
268
- # qwen2.5-coder:1.5b (best for code, ~1GB RAM) ← default
269
- # qwen2.5:1.5b (good general, ~1GB RAM)
270
- # llama3.2:1b (good English, ~800MB RAM)
271
- # qwen2.5:3b (better quality, ~2GB RAM)
124
+ upstream_url = "http://localhost:11434" # Ollama
272
125
  compression_model = "qwen2.5-coder:1.5b"
273
- dummy_keys = ["ollama", "lm-studio", "sk-no-key-required", "local", "none", ""]
274
126
  ```
275
127
 
276
- ### Per-project config `.squeezr.toml`
128
+ ### Project config: `.squeezr.toml` (in project root)
277
129
 
278
- Drop a `.squeezr.toml` in any project root. It deep-merges over the global config, so you only need to specify what differs:
130
+ Project-level config is deep-merged over global config. Useful for per-repo tuning.
279
131
 
280
- ```toml
281
- # .squeezr.toml — project-level overrides
282
- [compression]
283
- threshold = 400
284
- skip_tools = ["Read"] # don't compress file reads in this project
285
- ```
286
-
287
- Squeezr logs `[squeezr] Using project config: /path/to/.squeezr.toml` when a local config is detected.
288
-
289
- ### Environment variable reference
132
+ ### Environment variables
290
133
 
291
134
  | Variable | Default | Description |
292
- |---|---|---|
293
- | `SQUEEZR_PORT` | `8080` | Local port |
294
- | `SQUEEZR_THRESHOLD` | `800` | Base compression threshold (chars) |
295
- | `SQUEEZR_KEEP_RECENT` | `3` | Recent tool results to skip |
296
- | `SQUEEZR_DISABLED` | — | Set to `1` to disable (passthrough only) |
297
- | `SQUEEZR_DRY_RUN` | — | Set to `1` to preview savings without compressing |
298
- | `SQUEEZR_LOCAL_UPSTREAM` | `http://localhost:11434` | Ollama URL |
299
- | `SQUEEZR_LOCAL_MODEL` | `qwen2.5-coder:1.5b` | Ollama compression model |
135
+ |----------|---------|-------------|
136
+ | `SQUEEZR_PORT` | `8080` | Proxy port (MITM port = this + 1) |
137
+ | `SQUEEZR_THRESHOLD` | `800` | Min chars to compress |
138
+ | `SQUEEZR_KEEP_RECENT` | `3` | Recent results to skip |
139
+ | `SQUEEZR_DISABLED` | `false` | Disable all compression |
140
+ | `SQUEEZR_DRY_RUN` | `false` | Log savings without compressing |
141
+ | `SQUEEZR_LOCAL_UPSTREAM` | `http://localhost:11434` | Ollama/LM Studio URL |
142
+ | `SQUEEZR_LOCAL_MODEL` | `qwen2.5-coder:1.5b` | Local model for compression |
300
143
 
301
- ---
144
+ ### Per-command skip
302
145
 
303
- ## Explicit control skip and only
146
+ Add `# squeezr:skip` anywhere in a Bash command to bypass compression for that result.
304
147
 
305
- You can control exactly which tool results Squeezr compresses, both globally and per-command.
148
+ ## Compression backends
306
149
 
307
- ### Config-level (global or per-project)
150
+ Squeezr uses cheap/free models for AI compression (the deterministic layer is pure regex, no API calls):
308
151
 
309
- ```toml
310
- [compression]
311
- # Never compress Read or Grep results:
312
- skip_tools = ["Read", "Grep"]
152
+ | Backend | Model | Used for | Cost |
153
+ |---------|-------|----------|------|
154
+ | Anthropic | Haiku | System prompt, session cache | ~$0.0001/call |
155
+ | OpenAI | GPT-4o-mini | Fallback compression | ~$0.0001/call |
156
+ | Gemini | Flash-8B | Fallback compression | Free |
157
+ | Local | qwen2.5-coder:1.5b | Compression when using Ollama | Free |
158
+ | ChatGPT (WS) | GPT-5.4-mini | Codex frame compression | $0 (same subscription) |
313
159
 
314
- # Only compress Bash results — ignore everything else:
315
- only_tools = ["Bash"] # overrides skip_tools when set
316
- ```
160
+ ### Typical savings
317
161
 
318
- ### Inline per-command `# squeezr:skip`
162
+ - **Per tool result:** 70–95% reduction depending on tool
163
+ - **Per session (2 hours):** ~200K tokens → ~80K tokens (60% savings)
164
+ - **System prompt:** ~13KB → ~600 tokens (cached)
319
165
 
320
- Add `# squeezr:skip` anywhere in a Bash command to prevent that specific result from being compressed, regardless of config:
166
+ ## CLI commands
321
167
 
322
168
  ```bash
323
- # This result will never be compressed, even if it's 10,000 chars:
324
- git diff HEAD~3 # squeezr:skip
325
-
326
- # Normal commands are compressed as usual:
327
- cargo test
328
- ```
329
-
330
- ---
331
-
332
- ## Dry-run mode
333
-
334
- Preview what Squeezr would compress without modifying any requests:
335
-
336
- ```bash
337
- SQUEEZR_DRY_RUN=1 squeezr start
338
- ```
339
-
340
- Console output shows exactly what would be compressed:
341
-
169
+ squeezr setup # configure env vars, auto-start, CA trust
170
+ squeezr start # start the proxy (foreground)
171
+ squeezr stop # stop the proxy
172
+ squeezr status # check if proxy is running
173
+ squeezr logs # show last 50 log lines
174
+ squeezr config # print current config
175
+ squeezr gain # estimate token savings for a directory
176
+ squeezr discover # detect which AI CLIs are installed
177
+ squeezr version # print version
342
178
  ```
343
- [squeezr dry-run] Would compress 4 block(s) | potential -12,430 chars | pressure=67% threshold=800
344
- [squeezr dry-run/ollama] Would compress 2 block(s) | potential -5,210 chars | model=qwen2.5-coder:1.5b
345
- ```
346
-
347
- ---
348
-
349
- ## Ollama — local compression
350
-
351
- Pull the compression model once, then Squeezr handles the rest:
352
-
353
- ```bash
354
- ollama pull qwen2.5-coder:1.5b # or any model you prefer
355
- ```
356
-
357
- Any CLI that sends requests with a dummy auth key (`ollama`, `lm-studio`, empty string, etc.) is automatically detected as local and routed to your Ollama instance.
358
-
359
- To use a different model:
360
-
361
- ```toml
362
- [local]
363
- compression_model = "llama3.2:1b"
364
- ```
365
-
366
- ---
367
-
368
- ## Live stats
369
-
370
- Each compressed request logs to console:
371
-
372
- ```
373
- [squeezr] 2 block(s) compressed | -4,821 chars (~1,377 tokens) (87% saved)
374
- [squeezr] Context pressure: 68% → threshold=800 chars
375
- [squeezr/haiku] System prompt compressed: -71% (13,204 → 3,849 chars) [cached]
376
- [squeezr/ollama] 1 block(s) compressed | -3,102 chars (~886 tokens) (79% saved)
377
- [squeezr] Session cache: 3 block(s) reused (KV cache preserved)
378
- [squeezr] Cross-turn dedup: 2 Read result(s) collapsed
379
- ```
380
-
381
- ### `squeezr gain` — full stats dashboard
382
-
383
- ```bash
384
- squeezr gain
385
- ```
386
-
387
- ```
388
- ┌─────────────────────────────────────────┐
389
- │ Squeezr — Token Savings │
390
- ├─────────────────────────────────────────┤
391
- │ Requests 38 │
392
- │ Saved chars 142,830 │
393
- │ Saved tokens 40,808 │
394
- │ Savings 73.4% │
395
- ├─────────────────────────────────────────┤
396
- │ By Tool │
397
- │ Bash (41x): -81% │
398
- │ Read (28x): -74% │
399
- │ Grep (14x): -69% │
400
- └─────────────────────────────────────────┘
401
- ```
402
-
403
- Stats persist to `~/.squeezr/stats.json` across restarts.
404
-
405
- ```bash
406
- squeezr gain --reset # clear all saved stats
407
- ```
408
-
409
- Full JSON at: `http://localhost:8080/squeezr/stats`
410
-
411
- ### `squeezr discover` — pattern coverage report
412
-
413
- After a session, run:
414
-
415
- ```bash
416
- squeezr discover
417
- ```
418
-
419
- Shows which deterministic patterns fired, how many outputs hit the AI fallback, and the Read/Grep/Glob breakdown. Useful for spotting coverage gaps or misconfigured skip lists.
420
-
421
- ---
422
-
423
- ## Codex deep compression
424
-
425
- Codex CLI talks to `chatgpt.com` over WebSocket, not the standard OpenAI API. This means a regular HTTP proxy can't inspect or modify the traffic. Squeezr solves this with a TLS-terminating MITM proxy on port 8081.
426
-
427
- ### How it works
428
-
429
- 1. `squeezr setup` generates a local CA and configures `HTTPS_PROXY` + `SSL_CERT_FILE` in your shell
430
- 2. When Codex connects to `chatgpt.com`, Squeezr intercepts the TLS tunnel and generates a per-host certificate signed by the local CA
431
- 3. Squeezr strips `permessage-deflate` from the WebSocket handshake so frames arrive as plain JSON
432
- 4. On every client-to-server WebSocket frame, Squeezr looks for `function_call_output` messages (tool results) exceeding the compression threshold
433
- 5. For each large tool result, Squeezr opens a **separate** WebSocket to `chatgpt.com/backend-api/codex/responses` using the same OAuth token, and asks `gpt-5.4-mini` to summarize it
434
- 6. The compressed output replaces the original in the frame before forwarding to the server
435
-
436
- ### Setup
437
-
438
- ```bash
439
- squeezr setup # auto-configures everything (HTTPS_PROXY, SSL_CERT_FILE, CA)
440
- ```
441
-
442
- Or manually:
443
-
444
- ```bash
445
- export HTTPS_PROXY=http://localhost:8081
446
- export SSL_CERT_FILE=~/.squeezr/mitm-ca/bundle.crt
447
- ```
448
-
449
- ### What it costs
450
-
451
- Nothing extra. The compression calls use `gpt-5.4-mini` through the same ChatGPT WebSocket endpoint that your Codex subscription already covers. No API key required.
452
-
453
- ### Results
454
-
455
- In testing, Codex tool results (file reads, command output) are compressed by **80-90%** per turn. A typical file read of 5,000 chars compresses to ~700 chars, saving thousands of tokens across a session.
456
-
457
- For a detailed technical explanation, see [CODEX.md](CODEX.md).
458
-
459
- ---
460
-
461
- ## How session-level optimisations work
462
-
463
- ### Session cache + differential compression
464
-
465
- Every request re-sends the full conversation history. Without deduplication, a 50-tool-result session would run 50 Haiku calls on request #51 — even though 49 of them haven't changed.
466
-
467
- Squeezr tracks a hash of each compressed block in memory for the session lifetime. Blocks identical to the previous request skip the entire pipeline (preprocessing + AI call).
468
-
469
- ```
470
- Without session cache: request 51 → up to 50 Haiku calls
471
- With session cache: request 51 → 1 Haiku call (only the new block)
472
- ```
473
-
474
- In a 100-request session with 40 tool results: ~4,000 Haiku calls → ~200.
475
-
476
- ### KV cache warming
477
-
478
- Claude charges 90% less for tokens already in its prefix cache. The cache only activates when the message prefix is byte-for-byte identical between requests. Standard compression breaks this — each call might produce different bytes, invalidating the cache.
479
-
480
- Squeezr fixes this by assigning compressed blocks a deterministic MD5-based ID. Identical content always produces the same `[squeezr:id -ratio%]` string. Unchanged blocks produce identical bytes across requests, keeping the prefix stable.
481
-
482
- ```
483
- Without KV warming: request N+1 → new compressed bytes → cache miss on all subsequent tokens
484
- With KV warming: request N+1 → same IDs for unchanged blocks → cache hit on entire history
485
- → pay 10% of normal price for everything already seen
486
- ```
487
-
488
- These two optimisations compound: session cache reduces Haiku calls, KV warming reduces charges on the main model.
489
-
490
- ### Cross-turn Read deduplication
491
-
492
- When the model reads the same file multiple times (common in long refactoring sessions), every earlier occurrence is replaced with a reference token:
493
-
494
- ```
495
- [same file content as a later read — squeezr_expand(id) to retrieve]
496
- ```
497
-
498
- The most recent copy is always kept at full fidelity. The model can call `squeezr_expand(id)` to retrieve any earlier version on demand.
499
-
500
- ### Adaptive pressure
501
-
502
- As context fills up, Squeezr gets more aggressive — both in what it compresses and how aggressively the deterministic patterns behave:
503
-
504
- | Context used | Threshold | git diff context | git log cap | grep cap/file |
505
- |---|---|---|---|---|
506
- | < 50% | 1,500 chars | 1 line | 30 commits | 8 matches |
507
- | 50-75% | 800 chars | 1 line | 20 commits | 6 matches |
508
- | 75-90% | 400 chars | 1 line | 20 commits | 6 matches |
509
- | > 90% | 150 chars | **0 lines** | **10 commits** | **4 matches** |
510
-
511
- ---
512
-
513
- ## The economics
514
-
515
- Compression is done by the cheapest model in each ecosystem:
516
-
517
- | Provider | Compression model | Cost vs main model |
518
- |---|---|---|
519
- | Anthropic | Claude Haiku | ~25x cheaper than Sonnet |
520
- | OpenAI | GPT-4o-mini | ~15x cheaper than GPT-4o |
521
- | Google | Gemini Flash 8B | ~10x cheaper than Gemini Pro |
522
- | Ollama | Your configured local model | Free |
523
-
524
- **Example:** Haiku compresses a 3,000-token tool result to 150 tokens. Cost: ~$0.0001. Saving on every subsequent Sonnet request: ~$0.009. Net savings per compression: ~98%.
525
-
526
- Typical 2-hour session (50+ tool calls): ~200K tokens without compression → ~80K with Squeezr (-60%). The session cache and KV warming compound this further in long sessions.
527
-
528
- ---
529
-
530
- ## Does it add latency?
531
-
532
- Barely — and in long sessions it makes things faster, not slower.
533
-
534
- **What Squeezr adds:**
535
- - Deterministic patterns (git, cargo, vitest, etc.) run in pure Node.js — microseconds, unnoticeable
536
- - AI compression (Haiku/GPT-4o-mini) adds ~200-400ms **but only once per block**, then cached forever. Every subsequent request that includes that block pays zero
537
-
538
- **Why it feels faster overall:**
539
-
540
- The time Squeezr takes to compress a block is parallel to the time you spend reading the previous response and typing the next message. By the time you send your next message, compression is already done.
541
-
542
- More importantly: sending 60-80% fewer tokens means Claude processes a smaller context and **responds faster** — especially noticeable from turn 10 onward when history accumulates.
543
-
544
- | | Without Squeezr | With Squeezr |
545
- |---|---|---|
546
- | Turn 1-3 | Fast | +200ms first compression (then cached) |
547
- | Turn 10+ | Getting slower | Stays fast — history is compressed |
548
- | Turn 30+ | Noticeably slow | Faster than turn 1 without Squeezr |
549
-
550
- ---
551
-
552
- ## Why not just use /compact?
553
-
554
- `/compact` is a nuclear option: it replaces your entire context with a single lossy summary. You lose granularity and can't go back. Squeezr is surgical — it compresses old, irrelevant content while keeping recent work at full fidelity, with lossless retrieval via `squeezr_expand` for anything that needs to be recovered.
555
-
556
- ---
557
-
558
- ## Auto-start
559
-
560
- The installer configures Squeezr to start automatically on login:
561
-
562
- | OS | Method | Fallback |
563
- |---|---|---|
564
- | macOS | launchd (`~/Library/LaunchAgents/com.squeezr.plist`) | Shell auto-heal |
565
- | Linux | systemd user service (`~/.config/systemd/user/squeezr.service`) | Shell auto-heal |
566
- | Windows | Task Scheduler (runs at login, restarts on failure) | — |
567
- | Windows (robust) | **NSSM Windows Service** (auto-restart on crash) | — |
568
- | **WSL2** | systemd → Task Scheduler (cascade) | Shell auto-heal |
569
-
570
- ### Windows: NSSM (recommended over Task Scheduler)
571
-
572
- The built-in Task Scheduler setup requires admin on every reinstall and does **not** restart Squeezr if it crashes mid-session (e.g. due to `ECONNRESET`). For a more robust setup, use [NSSM](https://nssm.cc) to run Squeezr as a proper Windows service:
573
-
574
- ```powershell
575
- # Install NSSM
576
- winget install nssm
577
-
578
- # Create the service (run as Administrator, adjust paths if needed)
579
- $node = (where.exe node | Select-Object -First 1)
580
- $script = "$(npm root -g)\squeezr-ai\bin\squeezr.js"
581
- nssm install SqueezrProxy $node $script
582
- nssm set SqueezrProxy AppExit Default Restart
583
- nssm set SqueezrProxy AppRestartDelay 3000
584
- nssm start SqueezrProxy
585
- ```
586
-
587
- NSSM gives you: auto-start on boot, automatic restart on crash, stdout/stderr logs, and control via `services.msc`.
588
-
589
- See [NSSM_WINDOWS_SERVICE.md](./NSSM_WINDOWS_SERVICE.md) for the full guide including log setup, troubleshooting, and uninstall steps.
590
-
591
- ### WSL2 support
592
-
593
- `squeezr setup` detects WSL2 automatically and configures both sides:
594
-
595
- - **WSL shell**: env vars + auto-heal guard in `.bashrc` / `.zshrc`
596
- - **Windows**: env vars via `setx` (persistent in registry)
597
- - **Auto-start**: tries systemd first (WSL2 with `systemd=true` in `/etc/wsl.conf`), falls back to Windows Task Scheduler via `powershell.exe`
598
-
599
- ### Auto-heal
600
-
601
- On every platform, `squeezr setup` adds a lightweight guard to your shell profile. Each time you open a terminal, it checks if the proxy is alive (`curl localhost:8080/squeezr/health`). If not, it starts it in the background — silently, in ~100ms. This means:
602
-
603
- - If the service manager fails, the proxy still starts on your next terminal
604
- - If the proxy crashes mid-session, the next terminal restores it
605
- - Zero manual intervention after `squeezr setup`, ever
606
-
607
- ---
608
179
 
609
180
  ## Requirements
610
181
 
611
182
  - Node.js 18+
612
- - Your AI CLI already set up and working — nothing else needed
613
-
614
- Squeezr works with **any auth method** your CLI uses:
615
-
616
- | Auth type | Example | Works? |
617
- |---|---|---|
618
- | API key | `ANTHROPIC_API_KEY=sk-ant-...` | ✅ Full pipeline |
619
- | OAuth / subscription | Claude Code via claude.ai plan | ✅ Full pipeline — OAuth token reused for Haiku |
620
- | Local / no key | Ollama, LM Studio | ✅ Full pipeline — local model for compression |
621
-
622
- No extra credentials needed. Squeezr extracts and reuses whatever auth is already in your requests.
623
-
624
- ---
625
-
626
- ## Endpoints
627
-
628
- | Endpoint | Description |
629
- |---|---|
630
- | `POST /v1/messages` | Anthropic — Claude Code |
631
- | `POST /v1/chat/completions` | OpenAI / Ollama — Codex, Aider, OpenCode, local CLIs |
632
- | `POST /v1beta/models/{model}:generateContent` | Google — Gemini CLI |
633
- | `GET /squeezr/stats` | JSON session stats + cache hit rate + pattern coverage |
634
- | `GET /squeezr/health` | Health check + version |
635
- | `GET /squeezr/expand/:id` | Retrieve original content for a compressed block |
636
- | `* /{path}` | All other endpoints forwarded unmodified to detected upstream |
637
-
638
- ---
183
+ - For Codex MITM: `HTTPS_PROXY=http://localhost:8081` (set automatically by `squeezr setup`)
184
+ - For local compression: [Ollama](https://ollama.ai) with `qwen2.5-coder:1.5b`
639
185
 
640
- ## Changelog
186
+ ## License
641
187
 
642
- See [CHANGELOG.md](CHANGELOG.md).
188
+ MIT
package/bin/squeezr.js CHANGED
@@ -195,11 +195,14 @@ function setupWindows() {
195
195
  const caPath = path.join(os.homedir(), '.squeezr', 'mitm-ca', 'ca.crt')
196
196
  const vars = {
197
197
  ANTHROPIC_BASE_URL: `http://localhost:${port}`,
198
- openai_base_url: `http://localhost:${port}`,
198
+ // openai_base_url NOT set — Codex uses WebSocket and must go via HTTPS_PROXY/MITM,
199
+ // not through the HTTP proxy. Setting it breaks Codex's ws:// connections.
199
200
  GEMINI_API_BASE_URL: `http://localhost:${port}`,
200
201
  HTTPS_PROXY: `http://localhost:${mitmPort}`,
201
202
  NODE_EXTRA_CA_CERTS: caPath,
202
- // Bypass MITM for OpenAI auth and non-Codex endpoints only chatgpt.com needs interception
203
+ // SSL_CERT_FILE not set Rust/native apps use Windows Certificate Store instead
204
+ // (CA is imported in step 4 below via certutil)
205
+ // Bypass MITM for auth and API domains — only chatgpt.com needs interception
203
206
  NO_PROXY: 'auth.openai.com,login.openai.com,api.openai.com,api.anthropic.com,generativelanguage.googleapis.com',
204
207
  }
205
208
  for (const [key, value] of Object.entries(vars)) {
@@ -284,9 +287,9 @@ function setupWindows() {
284
287
  console.log(` [ok] Squeezr started in background (pid ${child.pid})`)
285
288
  console.log(` [ok] Logs → ${logFile}`)
286
289
 
287
- // 4. Trust MITM CA in Windows Certificate Store (for native apps like browsers)
288
- // Node.js apps (Codex) use NODE_EXTRA_CA_CERTS set above instead.
289
- // The CA is generated on first proxy start — wait briefly for it to appear
290
+ // 4. Trust MITM CA in Windows Certificate Store (for Rust apps like Codex)
291
+ // Node.js apps use NODE_EXTRA_CA_CERTS; Rust/native apps need the cert store.
292
+ // The CA is generated on first proxy start — wait briefly for it to appear.
290
293
  const waitForCa = (retries = 10, interval = 500) => new Promise(resolve => {
291
294
  const check = (n) => {
292
295
  if (fs.existsSync(caPath)) return resolve(true)
@@ -302,12 +305,18 @@ function setupWindows() {
302
305
  printDone()
303
306
  return
304
307
  }
308
+ // Try machine store (admin) first, fall back to user store (no admin)
305
309
  try {
306
310
  execSync(`certutil -addstore -f Root "${caPath}"`, { stdio: 'pipe' })
307
- console.log(` [ok] MITM CA trusted in Windows Certificate Store (Codex TLS interception ready)`)
311
+ console.log(` [ok] MITM CA trusted in Windows Certificate Store (machine-level)`)
308
312
  } catch {
309
- console.log(` [warn] Could not trust MITM CA — run as Administrator or trust manually:`)
310
- console.log(` certutil -addstore -f Root "${caPath}"`)
313
+ try {
314
+ execSync(`certutil -addstore -user Root "${caPath}"`, { stdio: 'pipe' })
315
+ console.log(` [ok] MITM CA trusted in Windows Certificate Store (user-level)`)
316
+ } catch {
317
+ console.log(` [warn] Could not trust MITM CA — trust manually:`)
318
+ console.log(` certutil -addstore -user Root "${caPath}"`)
319
+ }
311
320
  }
312
321
  printDone()
313
322
  })
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "squeezr-ai",
3
- "version": "1.14.6",
3
+ "version": "1.14.7",
4
4
  "description": "AI proxy that compresses Claude Code, Codex, Aider, Gemini CLI and Ollama context windows to save thousands of tokens per session",
5
5
  "keywords": [
6
6
  "claude",