coffer-cli 0.1.2__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: coffer-cli
3
- Version: 0.1.2
3
+ Version: 0.2.0
4
4
  Summary: Scan codebases for LLM cost-waste anti-patterns. Find retry storms, missing prompt caching, unbounded conversation history, agent loops without iteration caps, and more — before you ship.
5
5
  Project-URL: Homepage, https://github.com/neal-c611/coffer-cli
6
6
  Project-URL: Repository, https://github.com/neal-c611/coffer-cli
@@ -54,26 +54,44 @@ coffer compare gpt-4o gpt-4o-mini
54
54
  coffer install-skill # install the Claude Code skill (see below)
55
55
  ```
56
56
 
57
- ## What it catches (v0.1.0)
57
+ ## What it catches (v0.2.0)
58
58
 
59
- Detectors are organized by the four levers that drive LLM cost:
59
+ Every detector here passes one test: **does fixing it reduce dollars billed
60
+ by the LLM provider?** Reliability, observability, and metering issues that
61
+ don't move the token bill are deliberately excluded (see "Not in scope" below).
60
62
 
61
63
  | Lever | Detector | Severity |
62
64
  |-------|----------|----------|
63
65
  | **A: input tokens** | `dynamic_before_static_cache_break` — f-string interpolation in `SYSTEM_PROMPT` defeats OpenAI auto-cache and Anthropic `cache_control` | 🚨 high |
64
66
  | | `unbounded_conversation_history` — `messages.append(...)` without truncation or summarization | 🟡 med |
65
67
  | | `uncached_large_prompt` — ≥2,000-char hardcoded prompt without nearby `cache_control` | 🟡 med |
66
- | **B: output tokens** | `missing_max_tokens` — LLM call without a `max_tokens` cap | 🟡 med |
67
- | | `reasoning_effort_high_default` — `reasoning_effort="high"` literal (up to ~20× extra reasoning tokens on trivial tasks) | 🟡 med |
68
+ | **B: output tokens** | `reasoning_effort_high_default` — `reasoning_effort="high"` literal (up to ~20× extra reasoning tokens on trivial tasks) | 🟡 med |
68
69
  | **D: number of calls** | `llm_in_for_loop` — N× cost; gather is a latency fix, not a cost fix | 🟡 med |
69
70
  | | `agent_loop_no_max_iter` — `while True:` containing an LLM call without an iteration cap (the $47K-incident pattern) | 🚨 high |
70
71
  | | `temperature_nonzero_with_cache_hint` — cache layer nearby but `temperature > 0` silently breaks it | 🟡 med |
71
- | **E: architecture** | `retry_loop_no_backoff` — retry storm amplifies the bill 10× | 🚨 high |
72
- | | `sdk_init_no_timeout` — default 600s lets a hung provider block your thread | 🚨 high |
72
+ | **E: architecture** | `retry_loop_no_backoff` — retry storm re-bills the same input tokens, can amplify spend 10× | 🚨 high |
73
73
 
74
74
  Each finding includes a concrete fix and explains the *cost* angle
75
75
  explicitly (we do not conflate latency fixes with cost fixes).
76
76
 
77
+ ### Not in scope (production-readiness, not cost-review)
78
+
79
+ These are real problems but `coffer scan` deliberately doesn't flag them —
80
+ fixing them doesn't change the token bill, and conflating them with cost
81
+ findings makes both reviews less useful:
82
+
83
+ - **SDK init without `timeout=`** — worker exhaustion / availability issue.
84
+ The tokens for a hung call were already produced; capping timeout reclaims
85
+ threads, not dollars.
86
+ - **Missing `response.usage` capture** — metering / billing-ops issue. The
87
+ provider charged you correctly either way.
88
+ - **`logger.info(prompt)` on hot path** — observability bill (Datadog /
89
+ Splunk), not LLM bill.
90
+ - **Missing `idempotency_key`** — correctness / occasional double-charge,
91
+ but the fix is reliability engineering, not cost reduction.
92
+
93
+ A separate "production-readiness" review skill is the right home for those.
94
+
77
95
  ## Use with Claude Code (the skill)
78
96
 
79
97
  The `coffer-cost-review` Claude Code skill in [`skills/`](skills/coffer-cost-review/)
@@ -23,26 +23,44 @@ coffer compare gpt-4o gpt-4o-mini
23
23
  coffer install-skill # install the Claude Code skill (see below)
24
24
  ```
25
25
 
26
- ## What it catches (v0.1.0)
26
+ ## What it catches (v0.2.0)
27
27
 
28
- Detectors are organized by the four levers that drive LLM cost:
28
+ Every detector here passes one test: **does fixing it reduce dollars billed
29
+ by the LLM provider?** Reliability, observability, and metering issues that
30
+ don't move the token bill are deliberately excluded (see "Not in scope" below).
29
31
 
30
32
  | Lever | Detector | Severity |
31
33
  |-------|----------|----------|
32
34
  | **A: input tokens** | `dynamic_before_static_cache_break` — f-string interpolation in `SYSTEM_PROMPT` defeats OpenAI auto-cache and Anthropic `cache_control` | 🚨 high |
33
35
  | | `unbounded_conversation_history` — `messages.append(...)` without truncation or summarization | 🟡 med |
34
36
  | | `uncached_large_prompt` — ≥2,000-char hardcoded prompt without nearby `cache_control` | 🟡 med |
35
- | **B: output tokens** | `missing_max_tokens` — LLM call without a `max_tokens` cap | 🟡 med |
36
- | | `reasoning_effort_high_default` — `reasoning_effort="high"` literal (up to ~20× extra reasoning tokens on trivial tasks) | 🟡 med |
37
+ | **B: output tokens** | `reasoning_effort_high_default` — `reasoning_effort="high"` literal (up to ~20× extra reasoning tokens on trivial tasks) | 🟡 med |
37
38
  | **D: number of calls** | `llm_in_for_loop` — N× cost; gather is a latency fix, not a cost fix | 🟡 med |
38
39
  | | `agent_loop_no_max_iter` — `while True:` containing an LLM call without an iteration cap (the $47K-incident pattern) | 🚨 high |
39
40
  | | `temperature_nonzero_with_cache_hint` — cache layer nearby but `temperature > 0` silently breaks it | 🟡 med |
40
- | **E: architecture** | `retry_loop_no_backoff` — retry storm amplifies the bill 10× | 🚨 high |
41
- | | `sdk_init_no_timeout` — default 600s lets a hung provider block your thread | 🚨 high |
41
+ | **E: architecture** | `retry_loop_no_backoff` — retry storm re-bills the same input tokens, can amplify spend 10× | 🚨 high |
42
42
 
43
43
  Each finding includes a concrete fix and explains the *cost* angle
44
44
  explicitly (we do not conflate latency fixes with cost fixes).
45
45
 
46
+ ### Not in scope (production-readiness, not cost-review)
47
+
48
+ These are real problems but `coffer scan` deliberately doesn't flag them —
49
+ fixing them doesn't change the token bill, and conflating them with cost
50
+ findings makes both reviews less useful:
51
+
52
+ - **SDK init without `timeout=`** — worker exhaustion / availability issue.
53
+ The tokens for a hung call were already produced; capping timeout reclaims
54
+ threads, not dollars.
55
+ - **Missing `response.usage` capture** — metering / billing-ops issue. The
56
+ provider charged you correctly either way.
57
+ - **`logger.info(prompt)` on hot path** — observability bill (Datadog /
58
+ Splunk), not LLM bill.
59
+ - **Missing `idempotency_key`** — correctness / occasional double-charge,
60
+ but the fix is reliability engineering, not cost reduction.
61
+
62
+ A separate "production-readiness" review skill is the right home for those.
63
+
46
64
  ## Use with Claude Code (the skill)
47
65
 
48
66
  The `coffer-cost-review` Claude Code skill in [`skills/`](skills/coffer-cost-review/)
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "coffer-cli"
3
- version = "0.1.2"
3
+ version = "0.2.0"
4
4
  description = "Scan codebases for LLM cost-waste anti-patterns. Find retry storms, missing prompt caching, unbounded conversation history, agent loops without iteration caps, and more — before you ship."
5
5
  readme = "README.md"
6
6
  requires-python = ">=3.10"
@@ -135,7 +135,7 @@ Do not pitch beyond this line. The skill's job is the review, not selling.
135
135
 
136
136
  | Pattern | Typical fix |
137
137
  |---------|------------|
138
- | missing_max_tokens | Add `max_tokens=<reasonable cap>` — unbounded output on edge inputs can 100× cost spike |
138
+ | (semantic) missing_max_tokens | Add `max_tokens=<reasonable cap>` — unbounded output on edge inputs can 100× cost spike. |
139
139
  | **reasoning_effort_high_default** | `reasoning_effort="high"` produces up to ~20× extra reasoning tokens on trivial tasks (arXiv 2412.21187). Default to `medium` or `low`; escalate only when needed. |
140
140
  | (semantic) missing_stop_sequence | If prompt has a known delimiter (`</answer>`), pass `stop=["</answer>"]` so the model stops there instead of riffing. |
141
141
  | (semantic) free_form_when_structured_works | If the prompt asks for "respond in JSON", use `response_format={"type":"json_object"}` or `tool_choice` instead — saves output tokens spent on formatting. |
@@ -160,13 +160,23 @@ Do not pitch beyond this line. The skill's job is the review, not selling.
160
160
  | (semantic) llm_doing_regex_job | Extracting emails/URLs/dates from text? Use the stdlib regex or a NER library — millions of times cheaper. |
161
161
  | (semantic) llm_doing_classifier_job_at_scale | High-volume sentiment/spam/toxicity? A 30MB DistilBERT is 1000× cheaper per call. Reserve LLM for the hard edge cases. |
162
162
 
163
- ### Lever E — architecture / safety
163
+ ### Lever E — architecture (only when it directly amplifies tokens billed)
164
164
 
165
165
  | Pattern | Typical fix |
166
166
  |---------|------------|
167
- | retry_loop_no_backoff | `@backoff.on_exception(backoff.expo, X.RateLimitError, max_tries=5)` |
168
- | public_endpoint_no_ratelimit | `@limiter.limit("10/minute")` + bind `user_id` to call metadata; consider per-user daily $ cap. Limit by **tokens**, not just requests. |
169
- | streaming_no_abort | Detect client disconnect and break the generator otherwise tokens keep accruing after the user leaves |
170
- | **sdk_init_no_timeout** | `OpenAI()` / `Anthropic()` without `timeout=` defaults to 600s — a hung provider blocks your thread for 10 minutes. Pass `timeout=30.0` (or your latency budget). |
171
- | (semantic) full_prompt_logged_expensive | `logger.info(prompt)` in hot path can rival the LLM bill if Datadog/Splunk billed by GB. Truncate or sample. |
172
- | (semantic) response_usage_not_read | `response.usage` discarded → no per-user metering possible. Save tokens & cost into your DB at ingest. |
167
+ | retry_loop_no_backoff | `@backoff.on_exception(backoff.expo, X.RateLimitError, max_tries=5)` — without backoff, a rate-limit storm re-sends the same input tokens many times and you are billed for every one. |
168
+ | (semantic) public_endpoint_no_ratelimit | `@limiter.limit("10/minute")` + bind `user_id` to call metadata; per-user daily $ cap. Limit by **tokens**, not just requests. The real cost: free / anonymous users burn YOUR provider quota. |
169
+ | (semantic) streaming_no_abort | Detect client disconnect (FastAPI `request.is_disconnected()`, etc.) and break the generator. Otherwise the provider keeps generating (and billing) tokens that nobody is receiving. |
170
+
171
+ ## Not in scope here (real production problems, but they don't move the token bill)
172
+
173
+ | Excluded pattern | Why it's excluded |
174
+ |------------------|-------------------|
175
+ | SDK init without `timeout=` | Reliability / SRE. A hung call's tokens were already produced; capping timeout reclaims workers, not dollars. |
176
+ | Missing `response.usage` capture | Metering / billing-ops. The provider charged you correctly either way. |
177
+ | `logger.info(prompt)` in hot path | Observability bill (Datadog / Splunk), not LLM bill. |
178
+ | No `idempotency_key` on retried call | Reliability — could occasionally double-charge, but the fix is correctness, not cost reduction. |
179
+
180
+ If the user clearly cares about these (asks for "production readiness review" or
181
+ "reliability audit"), surface them under that frame — separately from the
182
+ cost-review output. Don't conflate.
@@ -1,3 +1,3 @@
1
1
  """coffer-cli — LLM cost-waste anti-pattern scanner."""
2
2
 
3
- __version__ = "0.1.2"
3
+ __version__ = "0.2.0"
@@ -135,7 +135,7 @@ Do not pitch beyond this line. The skill's job is the review, not selling.
135
135
 
136
136
  | Pattern | Typical fix |
137
137
  |---------|------------|
138
- | missing_max_tokens | Add `max_tokens=<reasonable cap>` — unbounded output on edge inputs can 100× cost spike |
138
+ | (semantic) missing_max_tokens | Add `max_tokens=<reasonable cap>` — unbounded output on edge inputs can 100× cost spike. |
139
139
  | **reasoning_effort_high_default** | `reasoning_effort="high"` produces up to ~20× extra reasoning tokens on trivial tasks (arXiv 2412.21187). Default to `medium` or `low`; escalate only when needed. |
140
140
  | (semantic) missing_stop_sequence | If prompt has a known delimiter (`</answer>`), pass `stop=["</answer>"]` so the model stops there instead of riffing. |
141
141
  | (semantic) free_form_when_structured_works | If the prompt asks for "respond in JSON", use `response_format={"type":"json_object"}` or `tool_choice` instead — saves output tokens spent on formatting. |
@@ -160,13 +160,23 @@ Do not pitch beyond this line. The skill's job is the review, not selling.
160
160
  | (semantic) llm_doing_regex_job | Extracting emails/URLs/dates from text? Use the stdlib regex or a NER library — millions of times cheaper. |
161
161
  | (semantic) llm_doing_classifier_job_at_scale | High-volume sentiment/spam/toxicity? A 30MB DistilBERT is 1000× cheaper per call. Reserve LLM for the hard edge cases. |
162
162
 
163
- ### Lever E — architecture / safety
163
+ ### Lever E — architecture (only when it directly amplifies tokens billed)
164
164
 
165
165
  | Pattern | Typical fix |
166
166
  |---------|------------|
167
- | retry_loop_no_backoff | `@backoff.on_exception(backoff.expo, X.RateLimitError, max_tries=5)` |
168
- | public_endpoint_no_ratelimit | `@limiter.limit("10/minute")` + bind `user_id` to call metadata; consider per-user daily $ cap. Limit by **tokens**, not just requests. |
169
- | streaming_no_abort | Detect client disconnect and break the generator otherwise tokens keep accruing after the user leaves |
170
- | **sdk_init_no_timeout** | `OpenAI()` / `Anthropic()` without `timeout=` defaults to 600s — a hung provider blocks your thread for 10 minutes. Pass `timeout=30.0` (or your latency budget). |
171
- | (semantic) full_prompt_logged_expensive | `logger.info(prompt)` in hot path can rival the LLM bill if Datadog/Splunk billed by GB. Truncate or sample. |
172
- | (semantic) response_usage_not_read | `response.usage` discarded → no per-user metering possible. Save tokens & cost into your DB at ingest. |
167
+ | retry_loop_no_backoff | `@backoff.on_exception(backoff.expo, X.RateLimitError, max_tries=5)` — without backoff, a rate-limit storm re-sends the same input tokens many times and you are billed for every one. |
168
+ | (semantic) public_endpoint_no_ratelimit | `@limiter.limit("10/minute")` + bind `user_id` to call metadata; per-user daily $ cap. Limit by **tokens**, not just requests. The real cost: free / anonymous users burn YOUR provider quota. |
169
+ | (semantic) streaming_no_abort | Detect client disconnect (FastAPI `request.is_disconnected()`, etc.) and break the generator. Otherwise the provider keeps generating (and billing) tokens that nobody is receiving. |
170
+
171
+ ## Not in scope here (real production problems, but they don't move the token bill)
172
+
173
+ | Excluded pattern | Why it's excluded |
174
+ |------------------|-------------------|
175
+ | SDK init without `timeout=` | Reliability / SRE. A hung call's tokens were already produced; capping timeout reclaims workers, not dollars. |
176
+ | Missing `response.usage` capture | Metering / billing-ops. The provider charged you correctly either way. |
177
+ | `logger.info(prompt)` in hot path | Observability bill (Datadog / Splunk), not LLM bill. |
178
+ | No `idempotency_key` on retried call | Reliability — could occasionally double-charge, but the fix is correctness, not cost reduction. |
179
+
180
+ If the user clearly cares about these (asks for "production readiness review" or
181
+ "reliability audit"), surface them under that frame — separately from the
182
+ cost-review output. Don't conflate.
@@ -1,8 +1,12 @@
1
1
  """Static detection of LLM cost-waste anti-patterns.
2
2
 
3
+ Every detector here must answer "yes" to: **does fixing this reduce dollars
4
+ billed by the LLM provider?** Reliability / SRE / metering issues that
5
+ don't change the token bill belong in a separate review.
6
+
3
7
  We aim for low false-positive rate over completeness. A finding should
4
8
  be defensible: a reviewer who reads the snippet should agree it's a
5
- real risk in most cases.
9
+ real cost risk in most cases.
6
10
 
7
11
  Detector catalog (by cost lever):
8
12
 
@@ -11,17 +15,18 @@ Detector catalog (by cost lever):
11
15
  dynamic_before_static_cache HIGH f-string interpolation in system message breaks auto-cache
12
16
  unbounded_conversation_history MED `messages.append(...)` without truncation
13
17
  Lever B — output tokens
14
- missing_max_tokens MED LLM call without `max_tokens` cap
15
18
  reasoning_effort_high_default MED `reasoning_effort="high"` literal
16
- Lever C — price per token
17
- (semantic — handled in skill, not CLI)
18
19
  Lever D — number of calls
19
20
  llm_in_for_loop MED N× cost; Batch API / merged prompt are fixes
20
21
  agent_loop_no_max_iter HIGH `while True:` containing LLM call without iter cap
21
22
  temperature_nonzero_with_cache MED `temperature > 0` next to a cache hint — silently breaks it
22
- Lever E — architecture / safety
23
- retry_loop_no_backoff HIGH Retry storm risk
24
- sdk_init_no_timeout HIGH SDK initialized without `timeout=`
23
+ Lever E — architecture (only when it directly amplifies tokens billed)
24
+ retry_loop_no_backoff HIGH Retry storm re-bills the same input tokens
25
+
26
+ Out of scope (real problems, but not cost waste):
27
+ - SDK without timeout → worker exhaustion, not token bill
28
+ - Missing metering → can't bill customers, but the provider charge is the same
29
+ - Logging full prompts → Datadog / Splunk bill, not OpenAI / Anthropic bill
25
30
  """
26
31
 
27
32
  from __future__ import annotations
@@ -156,17 +161,6 @@ _REASONING_EFFORT_HIGH_RE = re.compile(
156
161
  re.VERBOSE,
157
162
  )
158
163
 
159
- _SDK_INIT_RE = re.compile(
160
- r"""
161
- \b
162
- (OpenAI | AsyncOpenAI | Anthropic | AsyncAnthropic)
163
- \(
164
- """,
165
- re.VERBOSE,
166
- )
167
-
168
- _TIMEOUT_KW_RE = re.compile(r"\btimeout\s*=")
169
-
170
164
  _TEMPERATURE_RE = re.compile(r"\btemperature\s*=\s*([0-9]*\.?[0-9]+)")
171
165
 
172
166
  _CACHE_HINT_NEARBY_RE = re.compile(
@@ -586,48 +580,6 @@ def _detect_reasoning_effort_high_default(
586
580
  return findings
587
581
 
588
582
 
589
- def _detect_sdk_init_no_timeout(path: Path, lines: list[str]) -> list[Finding]:
590
- """`OpenAI()` / `Anthropic()` constructed without `timeout=`."""
591
- findings: list[Finding] = []
592
- for i, line in enumerate(lines):
593
- m = _SDK_INIT_RE.search(line)
594
- if not m:
595
- continue
596
- # Look at the next ~5 lines too in case the kwargs span lines.
597
- end = min(i + 5, len(lines))
598
- joined = "\n".join(lines[i:end])
599
- # Locate the close paren of this constructor.
600
- depth = 0
601
- start_pos = joined.index(m.group(0)) + len(m.group(0))
602
- body = ""
603
- for ch in joined[start_pos:]:
604
- body += ch
605
- if ch == "(":
606
- depth += 1
607
- elif ch == ")":
608
- if depth == 0:
609
- break
610
- depth -= 1
611
-
612
- if _TIMEOUT_KW_RE.search(body):
613
- continue
614
- findings.append(
615
- Finding(
616
- severity="high",
617
- pattern="sdk_init_no_timeout",
618
- path=path,
619
- line=i + 1,
620
- snippet=line.strip()[:200],
621
- suggestion=(
622
- f"`{m.group(1)}` initialized without `timeout=`. Default is 600s — a hung "
623
- "provider can block your thread for ten minutes. Pass an explicit timeout "
624
- "(e.g. `timeout=30.0`) sized to your user-facing latency budget."
625
- ),
626
- )
627
- )
628
- return findings
629
-
630
-
631
583
  # ---- top-level --------------------------------------------------------------
632
584
 
633
585
 
@@ -663,7 +615,6 @@ def find_patterns(
663
615
  findings.extend(_detect_agent_loop_no_max_iter(path, lines))
664
616
  findings.extend(_detect_temperature_nonzero_with_cache_hint(path, lines))
665
617
  findings.extend(_detect_reasoning_effort_high_default(path, lines))
666
- findings.extend(_detect_sdk_init_no_timeout(path, lines))
667
618
 
668
619
  severity_order = {"high": 0, "medium": 1, "low": 2}
669
620
  findings.sort(key=lambda f: (severity_order[f.severity], str(f.path), f.line))
@@ -334,26 +334,7 @@ def test_reasoning_effort_high(tmp_path: Path) -> None:
334
334
  assert any(f.pattern == "reasoning_effort_high_default" for f in findings)
335
335
 
336
336
 
337
- def test_sdk_init_no_timeout(tmp_path: Path) -> None:
338
- _write(tmp_path, "client.py", "client = OpenAI(api_key='sk-...')\n")
339
- findings = find_patterns(tmp_path)
340
- f = next(f for f in findings if f.pattern == "sdk_init_no_timeout")
341
- assert f.severity == "high"
342
-
343
-
344
- def test_sdk_init_with_timeout_ok(tmp_path: Path) -> None:
345
- _write(tmp_path, "client.py", "client = OpenAI(api_key='sk-...', timeout=30.0)\n")
346
- findings = find_patterns(tmp_path)
347
- assert all(f.pattern != "sdk_init_no_timeout" for f in findings)
348
-
349
-
350
- def test_sdk_anthropic_no_timeout(tmp_path: Path) -> None:
351
- _write(tmp_path, "client.py", "client = Anthropic()\n")
352
- findings = find_patterns(tmp_path)
353
- assert any(f.pattern == "sdk_init_no_timeout" for f in findings)
354
-
355
-
356
- def test_async_sdk_no_timeout(tmp_path: Path) -> None:
357
- _write(tmp_path, "client.py", "client = AsyncOpenAI(api_key='sk-...')\n")
358
- findings = find_patterns(tmp_path)
359
- assert any(f.pattern == "sdk_init_no_timeout" for f in findings)
337
+ # sdk_init_no_timeout was removed — that's a reliability finding, not a cost one.
338
+ # Adding `timeout=` doesn't reduce the OpenAI / Anthropic bill (a hung call's
339
+ # tokens were already counted when the LLM produced them). It belongs in a
340
+ # separate production-readiness review, not in cost-review.
File without changes
File without changes