cc-context-stats 1.7.0 → 1.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +24 -0
- package/docs/MODEL_INTELLIGENCE.md +396 -0
- package/docs/context-stats.md +1 -1
- package/package.json +1 -1
- package/pyproject.toml +1 -1
- package/scripts/statusline-full.sh +171 -37
- package/scripts/statusline.js +128 -18
- package/scripts/statusline.py +108 -24
- package/src/claude_statusline/__init__.py +1 -1
- package/src/claude_statusline/cli/context_stats.py +33 -3
- package/src/claude_statusline/cli/statusline.py +27 -12
- package/src/claude_statusline/core/config.py +17 -0
- package/src/claude_statusline/graphs/intelligence.py +162 -0
- package/src/claude_statusline/graphs/renderer.py +38 -3
- package/tests/bash/test_statusline_full.bats +5 -5
- package/tests/fixtures/mi_test_vectors.json +140 -0
- package/tests/node/intelligence.test.js +98 -0
- package/tests/node/statusline.test.js +4 -4
- package/tests/python/test_intelligence.py +314 -0
- package/tests/python/test_layout.py +4 -4
- package/tests/python/test_statusline.py +4 -4
package/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [1.8.0] - 2026-03-15
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- **Model Intelligence (MI) score** — Heuristic quality score estimating answer quality based on context utilization, cache efficiency, and output productivity. Inspired by the Michelangelo paper (arXiv:2409.12640). Displayed as `MI:X.XX` in the statusline with green/yellow/red color coding
|
|
15
|
+
- **MI score in all implementations** — MI computation available across Python package, standalone Python, Node.js, and Bash (via `awk`) statusline scripts with full cross-implementation parity
|
|
16
|
+
- **MI timeseries graph** — `context-stats --type mi` renders MI score trajectory over time as an ASCII graph with decimal Y-axis labels
|
|
17
|
+
- **MI in session summary** — `context-stats` summary now shows MI score with sub-component breakdown (CPS, ES, PS) and interpretation text
|
|
18
|
+
- **Shared test vectors** — `tests/fixtures/mi_test_vectors.json` with 6 vectors ensuring Python and Node.js produce identical MI scores within ±0.01 tolerance
|
|
19
|
+
- **`label_fn` parameter for `render_timeseries()`** — Optional custom Y-axis label formatter, used by MI graph to display decimals instead of token counts
|
|
20
|
+
- **Bash feature parity** — `statusline-full.sh` now supports custom color overrides, state file rotation, MI score display, and all config keys (`show_mi`, `mi_curve_beta`, `reduced_motion`, `show_io_tokens`)
|
|
21
|
+
- **Config: `show_mi`** — Toggle MI score display (default: `true`)
|
|
22
|
+
- **Config: `mi_curve_beta`** — Adjust MI degradation curve shape (default: `1.5`)
|
|
23
|
+
|
|
24
|
+
### Changed
|
|
25
|
+
|
|
26
|
+
- **Compact context display** — Removed "free" word from context info (`872,748 (87.3%)` instead of `872,748 free (87.3%)`) across all implementations
|
|
27
|
+
- **Decoupled state reads from `show_delta`** — State file is now read when either `show_delta` or `show_mi` is enabled, allowing MI to work independently of delta display
|
|
28
|
+
- **Node.js terminal width default** — Changed from `80` to `200` when no TTY is detected (matching Python behavior), preventing `fitToWidth` from dropping statusline parts in Claude Code's subprocess
|
|
29
|
+
|
|
30
|
+
### Fixed
|
|
31
|
+
|
|
32
|
+
- **Node.js terminal width** — Fixed `getTerminalWidth()` defaulting to 80 in Claude Code's subprocess, which caused MI, delta, AC, and session parts to be silently dropped
|
|
33
|
+
|
|
10
34
|
## [1.7.0] - 2026-03-14
|
|
11
35
|
|
|
12
36
|
### Added
|
|
@@ -0,0 +1,396 @@
|
|
|
1
|
+
# Model Intelligence (MI) Metric — Implementation Plan
|
|
2
|
+
|
|
3
|
+
> Inspired by the **Michelangelo** paper: *Long Context Evaluations Beyond Haystacks via Latent Structure Queries* (arXiv:2409.12640, Google DeepMind, Sep 2024)
|
|
4
|
+
|
|
5
|
+
## Context
|
|
6
|
+
|
|
7
|
+
Users of cc-context-stats can see how much context remains, but have no indicator of **how well the model is likely performing** at the current context fill level. The Michelangelo paper demonstrates that LLM answer quality degrades predictably as context fills — with an initial sharp super-linear drop followed by linear/flat degradation. Different capabilities (reasoning, retrieval, self-awareness) degrade at different rates.
|
|
8
|
+
|
|
9
|
+
This plan introduces a **Model Intelligence (MI)** score: a [0, 1] heuristic that estimates answer quality using only data already available in the CSV state entries. It complements the existing Smart/Dumb/Wrap Up Zone indicators with a continuous, multi-dimensional score.
|
|
10
|
+
|
|
11
|
+
## Key Insights from the Paper
|
|
12
|
+
|
|
13
|
+
1. **Performance degrades with context length** — All models show significant falloff, often starting before 32K tokens. There is an initial sharp super-linear drop, followed by either flattening or continued linear degradation.
|
|
14
|
+
|
|
15
|
+
2. **Three orthogonal evaluation dimensions** (each measures different aspects):
|
|
16
|
+
- **MRCR**: Understanding ordering, distinguishing similar content, reproducing context (scored via string similarity [0,1])
|
|
17
|
+
- **Latent List**: Tracking a latent data structure through operations (scored via exact match + normalized error [0,1])
|
|
18
|
+
- **IDK**: Knowing what the model doesn't know (scored via accuracy [0,1])
|
|
19
|
+
|
|
20
|
+
3. **Higher complexity = steeper degradation** — As task complexity increases, performance falls off faster with context length.
|
|
21
|
+
|
|
22
|
+
4. **Cross-over behavior** — Models better at short context can become worse at long context.
|
|
23
|
+
|
|
24
|
+
5. **Perplexity ≠ reasoning quality** — Low perplexity does not predict good reasoning at long contexts.
|
|
25
|
+
|
|
26
|
+
## Formula Design
|
|
27
|
+
|
|
28
|
+
### Guard Clause
|
|
29
|
+
|
|
30
|
+
If `context_window_size == 0` (old 2-field CSV entries, malformed data), return `MI = 1.0` with all sub-scores at defaults (CPS=1.0, ES=1.0, PS=0.5). This avoids division by zero and treats unknown-context entries optimistically.
|
|
31
|
+
|
|
32
|
+
### MI = 0.60 × CPS + 0.25 × ES + 0.15 × PS
|
|
33
|
+
|
|
34
|
+
Three sub-scores, each inspired by a Michelangelo evaluation dimension. Weights are **hardcoded constants** (not configurable) to minimize cross-implementation sync burden.
|
|
35
|
+
|
|
36
|
+
### 1. Context Pressure Score (CPS) — weight: 0.60
|
|
37
|
+
|
|
38
|
+
Maps to the paper's primary finding: performance degrades with context utilization.
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
u = current_used_tokens / context_window_size (utilization ratio, 0 to 1+)
|
|
42
|
+
CPS = max(0, 1 - u^β)
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Default `β = 1.5`. Configurable via `mi_curve_beta`. This creates the super-linear initial drop observed in the paper:
|
|
46
|
+
|
|
47
|
+
| Utilization (u) | CPS (β=1.5) | CPS (β=1.0, linear) | CPS (β=2.0, quadratic) |
|
|
48
|
+
|---|---|---|---|
|
|
49
|
+
| 0.0 | 1.00 | 1.00 | 1.00 |
|
|
50
|
+
| 0.2 | 0.91 | 0.80 | 0.96 |
|
|
51
|
+
| 0.4 | 0.75 | 0.60 | 0.84 |
|
|
52
|
+
| 0.6 | 0.54 | 0.40 | 0.64 |
|
|
53
|
+
| 0.8 | 0.28 | 0.20 | 0.36 |
|
|
54
|
+
| 1.0 | 0.00 | 0.00 | 0.00 |
|
|
55
|
+
|
|
56
|
+
**Rationale:** β=1.5 reproduces the paper's observation that performance is good early, degrades significantly past ~50%, and becomes severely impaired above ~80%. Configurable via `mi_curve_beta`.
|
|
57
|
+
|
|
58
|
+
### 2. Efficiency Score (ES) — weight: 0.25
|
|
59
|
+
|
|
60
|
+
Proxies context utilization quality (analogous to MRCR — is the model effectively re-using prior context?).
|
|
61
|
+
|
|
62
|
+
```
|
|
63
|
+
total_context = entry.current_used_tokens # = current_input + cache_creation + cache_read
|
|
64
|
+
|
|
65
|
+
if total_context == 0:
|
|
66
|
+
ES = 1.0 # No context yet
|
|
67
|
+
else:
|
|
68
|
+
cache_hit_ratio = cache_read / total_context
|
|
69
|
+
ES = 0.3 + 0.7 × cache_hit_ratio # [0.3, 1.0]
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
**Note:** `total_context` is the same as `StateEntry.current_used_tokens` (state.py:132). Reuse the existing property in the package; compute inline in standalone scripts.
|
|
73
|
+
|
|
74
|
+
- Floor of 0.3 prevents penalizing early-session entries (no cache available yet)
|
|
75
|
+
- Full cache-read → ES=1.0
|
|
76
|
+
|
|
77
|
+
**Rationale:** High cache-read ratio indicates the model is successfully re-using previously cached context rather than re-processing, suggesting better context utilization.
|
|
78
|
+
|
|
79
|
+
### 3. Productivity Score (PS) — weight: 0.15
|
|
80
|
+
|
|
81
|
+
Proxies output quality (analogous to Latent List/IDK — is the model producing meaningful, actionable output?).
|
|
82
|
+
|
|
83
|
+
**Deltas are computed as consecutive entry differences** (not cumulative totals):
|
|
84
|
+
```
|
|
85
|
+
delta_lines_added = current.lines_added - previous.lines_added
|
|
86
|
+
delta_lines_removed = current.lines_removed - previous.lines_removed
|
|
87
|
+
delta_output_tokens = current.total_output_tokens - previous.total_output_tokens
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
if no previous entry OR delta_output_tokens <= 0:
|
|
92
|
+
PS = 0.5 # Neutral
|
|
93
|
+
else:
|
|
94
|
+
delta_lines = delta_lines_added + delta_lines_removed
|
|
95
|
+
ratio = delta_lines / delta_output_tokens
|
|
96
|
+
normalized = min(1.0, ratio / target) # target: 0.2 (hardcoded)
|
|
97
|
+
PS = 0.2 + 0.8 × normalized # [0.2, 1.0]
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
- Target: 0.2 lines/token (1 line per 5 output tokens) = perfect score, **hardcoded** (not configurable)
|
|
101
|
+
- Floor of 0.2 for explanation-heavy sessions (still valid work)
|
|
102
|
+
- Lowest weight (0.15) because it's the noisiest proxy
|
|
103
|
+
|
|
104
|
+
**Rationale:** When the model produces concrete code changes relative to token expenditure, it is likely giving focused, relevant answers. When it produces many tokens with no code changes, it may be struggling.
|
|
105
|
+
|
|
106
|
+
### Color Thresholds (hardcoded)
|
|
107
|
+
|
|
108
|
+
| MI Range | Color | Label | Interpretation |
|
|
109
|
+
|---|---|---|---|
|
|
110
|
+
| > 0.65 | Green | High Intelligence | Model is operating well |
|
|
111
|
+
| 0.35–0.65 | Yellow | Degraded | Context pressure affecting quality |
|
|
112
|
+
| < 0.35 | Red | Critical | Severely degraded, consider new session |
|
|
113
|
+
|
|
114
|
+
### Example Calculations
|
|
115
|
+
|
|
116
|
+
**Fresh session** (u=0.1, 60% cache, 150 lines/1000 tokens):
|
|
117
|
+
- CPS = 1 - 0.1^1.5 = 0.968
|
|
118
|
+
- ES = 0.3 + 0.7 × 0.6 = 0.72
|
|
119
|
+
- PS = 0.2 + 0.8 × min(1, 0.15/0.2) = 0.80
|
|
120
|
+
- **MI = 0.60×0.968 + 0.25×0.72 + 0.15×0.80 = 0.581 + 0.180 + 0.120 = 0.88**
|
|
121
|
+
|
|
122
|
+
**Mid-session** (u=0.5, 40% cache, 100 lines/1000 tokens):
|
|
123
|
+
- CPS = 1 - 0.5^1.5 = 0.646
|
|
124
|
+
- ES = 0.3 + 0.7 × 0.4 = 0.58
|
|
125
|
+
- PS = 0.2 + 0.8 × min(1, 0.1/0.2) = 0.60
|
|
126
|
+
- **MI = 0.60×0.646 + 0.25×0.58 + 0.15×0.60 = 0.388 + 0.145 + 0.090 = 0.62**
|
|
127
|
+
|
|
128
|
+
**Late session** (u=0.85, 20% cache, 50 lines/1000 tokens):
|
|
129
|
+
- CPS = 1 - 0.85^1.5 = 0.217
|
|
130
|
+
- ES = 0.3 + 0.7 × 0.2 = 0.44
|
|
131
|
+
- PS = 0.2 + 0.8 × min(1, 0.05/0.2) = 0.40
|
|
132
|
+
- **MI = 0.60×0.217 + 0.25×0.44 + 0.15×0.40 = 0.130 + 0.110 + 0.060 = 0.30**
|
|
133
|
+
|
|
134
|
+
## Implementation Steps
|
|
135
|
+
|
|
136
|
+
### Step 1: Create `src/claude_statusline/graphs/intelligence.py`
|
|
137
|
+
|
|
138
|
+
Core MI computation module:
|
|
139
|
+
- `IntelligenceConfig` dataclass — beta only (weights, thresholds, productivity_target are hardcoded constants)
|
|
140
|
+
- `IntelligenceScore` dataclass — cps, es, ps, mi, utilization (all floats)
|
|
141
|
+
- `calculate_context_pressure(utilization, beta) → float`
|
|
142
|
+
- `calculate_efficiency(entry: StateEntry) → float`
|
|
143
|
+
- `calculate_productivity(current: StateEntry, previous: StateEntry | None) → float` — uses **consecutive entry diffs** for delta_lines and delta_output_tokens
|
|
144
|
+
- `calculate_intelligence(current: StateEntry, previous: StateEntry | None, context_window_size: int, beta?) → IntelligenceScore`
|
|
145
|
+
- **Guard clause:** if `context_window_size == 0`, return `IntelligenceScore(cps=1.0, es=1.0, ps=0.5, mi=1.0, utilization=0.0)`
|
|
146
|
+
- `get_mi_color(mi) → str` — returns "green"/"yellow"/"red" using hardcoded thresholds (0.65/0.35)
|
|
147
|
+
- `format_mi_score(mi) → str` — returns "0.82"
|
|
148
|
+
|
|
149
|
+
Constants (hardcoded, not configurable):
|
|
150
|
+
```python
|
|
151
|
+
MI_WEIGHT_CPS = 0.60
|
|
152
|
+
MI_WEIGHT_ES = 0.25
|
|
153
|
+
MI_WEIGHT_PS = 0.15
|
|
154
|
+
MI_GREEN_THRESHOLD = 0.65
|
|
155
|
+
MI_YELLOW_THRESHOLD = 0.35
|
|
156
|
+
MI_PRODUCTIVITY_TARGET = 0.2
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Step 2: Create `tests/python/test_intelligence.py`
|
|
160
|
+
|
|
161
|
+
Unit tests for all functions:
|
|
162
|
+
- **CPS**: empty/full/half context, custom beta, clamping at 0
|
|
163
|
+
- **CPS guard**: `context_window_size == 0` returns MI=1.0 with defaults
|
|
164
|
+
- **ES**: no tokens, all cache read, no cache, mixed cache
|
|
165
|
+
- **PS**: no previous entry, no output, high/zero/moderate productivity, capping
|
|
166
|
+
- **PS deltas**: verify consecutive entry diff computation (not cumulative)
|
|
167
|
+
- **Composite**: optimal/worst conditions, weight validation, bounds check
|
|
168
|
+
- **Color**: green/yellow/red thresholds, boundary values
|
|
169
|
+
- **Format**: two decimals, zero, one, rounding
|
|
170
|
+
- **Statusline integration**: `show_mi=true` + `show_delta=false` produces MI output without delta display
|
|
171
|
+
|
|
172
|
+
### Step 2b: Create `tests/fixtures/mi_test_vectors.json`
|
|
173
|
+
|
|
174
|
+
Shared test vectors for cross-implementation parity:
|
|
175
|
+
```json
|
|
176
|
+
[
|
|
177
|
+
{
|
|
178
|
+
"description": "Fresh session",
|
|
179
|
+
"input": { "current_used": 20000, "context_window": 200000, "cache_read": 12000, "current_input": 5000, "cache_creation": 3000, "prev_lines_added": 0, "prev_lines_removed": 0, "cur_lines_added": 150, "cur_lines_removed": 10, "prev_output": 0, "cur_output": 1000, "beta": 1.5 },
|
|
180
|
+
"expected": { "cps": 0.968, "es": 0.72, "ps": 0.84, "mi": 0.887 }
|
|
181
|
+
}
|
|
182
|
+
]
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
5-6 vectors covering: fresh session, mid-session, late session, no previous entry, context_window=0, no cache. Both Python and Node.js test suites read this file and assert results within ±0.01 tolerance.
|
|
186
|
+
|
|
187
|
+
### Step 3: Modify `src/claude_statusline/core/config.py`
|
|
188
|
+
|
|
189
|
+
Add to `Config` dataclass (only 2 new fields):
|
|
190
|
+
- `show_mi: bool = True`
|
|
191
|
+
- `mi_curve_beta: float = 1.5`
|
|
192
|
+
|
|
193
|
+
All other MI parameters (weights, thresholds, productivity_target) are **hardcoded constants** in `intelligence.py`, not configurable. This minimizes cross-implementation sync burden (2 config branches vs 8).
|
|
194
|
+
|
|
195
|
+
Add parsing in `_read_config()`:
|
|
196
|
+
- `show_mi`: boolean, same pattern as existing keys (`value_lower != "false"`)
|
|
197
|
+
- `mi_curve_beta`: float, inline `try/except` — `try: self.mi_curve_beta = float(raw_value) except ValueError: pass`
|
|
198
|
+
|
|
199
|
+
Add MI section to `_create_default()` config template:
|
|
200
|
+
```ini
|
|
201
|
+
# Model Intelligence (MI) score display
|
|
202
|
+
show_mi=true
|
|
203
|
+
|
|
204
|
+
# MI degradation curve shape (higher = steeper initial drop)
|
|
205
|
+
# mi_curve_beta=1.5
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
Update `to_dict()` with both new fields.
|
|
209
|
+
|
|
210
|
+
### Step 4: Modify `src/claude_statusline/cli/statusline.py`
|
|
211
|
+
|
|
212
|
+
**Key change:** Decouple state file reads from `show_delta`. Currently, the previous entry is only read inside `if config.show_delta:`. With MI, the previous entry must be read whenever `show_mi` OR `show_delta` is enabled.
|
|
213
|
+
|
|
214
|
+
Restructure the state file logic:
|
|
215
|
+
```python
|
|
216
|
+
# Read previous entry if needed for delta OR MI
|
|
217
|
+
if config.show_delta or config.show_mi:
|
|
218
|
+
state_file = StateFile(session_id)
|
|
219
|
+
prev_entry = state_file.read_last_entry()
|
|
220
|
+
# ... build current entry, append if changed ...
|
|
221
|
+
|
|
222
|
+
if config.show_delta:
|
|
223
|
+
# ... existing delta_info logic ...
|
|
224
|
+
|
|
225
|
+
if config.show_mi:
|
|
226
|
+
from claude_statusline.graphs.intelligence import calculate_intelligence, get_mi_color, format_mi_score
|
|
227
|
+
mi_score = calculate_intelligence(entry, prev_entry, total_size, config.mi_curve_beta)
|
|
228
|
+
mi_color_name = get_mi_color(mi_score.mi)
|
|
229
|
+
mi_color = getattr(colors, mi_color_name)
|
|
230
|
+
mi_info = f" {mi_color}MI:{format_mi_score(mi_score.mi)}{colors.reset}"
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
Add `mi_info` to the output parts list (between `delta_info` and `ac_info`). Color-code using `get_mi_color()`.
|
|
234
|
+
|
|
235
|
+
### Step 5: Modify `src/claude_statusline/graphs/renderer.py`
|
|
236
|
+
|
|
237
|
+
**Two changes:**
|
|
238
|
+
|
|
239
|
+
**5a.** Add optional `label_fn: Callable[[int], str] | None = None` parameter to `render_timeseries()`. When provided, use it instead of `format_tokens()` for Y-axis labels. This allows the MI graph to display `0.62` instead of `620` when data is scaled ×1000. Default behavior (existing graphs) is unchanged.
|
|
240
|
+
|
|
241
|
+
**5b.** In `render_summary()` (~line 318, after zone indicator), add MI score display:
|
|
242
|
+
```
|
|
243
|
+
Model Intelligence: 0.62 (Context pressure is degrading answer quality)
|
|
244
|
+
CPS: 0.54 ES: 0.72 PS: 0.60
|
|
245
|
+
```
|
|
246
|
+
Accepts an optional `IntelligenceScore` parameter.
|
|
247
|
+
|
|
248
|
+
### Step 6: Modify `src/claude_statusline/cli/context_stats.py`
|
|
249
|
+
|
|
250
|
+
- Add `"mi"` to `--type` choices
|
|
251
|
+
- In `render_once()`: compute MI scores for each entry pair and render as timeseries graph (scale ×1000 for integer renderer)
|
|
252
|
+
- Use `label_fn=lambda v: f"{v/1000:.2f}"` when calling `render_timeseries()` for the MI graph
|
|
253
|
+
- Include MI in `"all"` graph type
|
|
254
|
+
- Pass MI score to `render_summary()`
|
|
255
|
+
|
|
256
|
+
### Step 7: Modify `scripts/statusline.py` (standalone)
|
|
257
|
+
|
|
258
|
+
Add a single self-contained `compute_mi()` function (not 4 separate functions) that takes raw values and returns `(mi, cps, es, ps)`. This reduces the sync surface from 4 functions to 1.
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
def compute_mi(used_tokens, context_window_size, cache_read, total_context,
|
|
262
|
+
delta_lines, delta_output, beta=1.5):
|
|
263
|
+
"""Compute Model Intelligence score. Returns (mi, cps, es, ps)."""
|
|
264
|
+
...
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
Add `show_mi` and `mi_curve_beta` config parsing (2 keys, matching the package). Add `MI:X.XX` to output. Decouple state file read from `show_delta` (same restructuring as Step 4).
|
|
268
|
+
|
|
269
|
+
### Step 8: Modify `scripts/statusline.js` (standalone Node.js)
|
|
270
|
+
|
|
271
|
+
Port MI formula as a single `computeMI()` function (same signature as Python). Add `show_mi` and `mi_curve_beta` config parsing. Add `MI:X.XX` to output. Decouple state file read from `showDelta`.
|
|
272
|
+
|
|
273
|
+
```javascript
|
|
274
|
+
function computeMI(usedTokens, contextWindowSize, cacheRead, totalContext,
|
|
275
|
+
deltaLines, deltaOutput, beta = 1.5) {
|
|
276
|
+
// Returns { mi, cps, es, ps }
|
|
277
|
+
}
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
### Step 9: Add Node.js tests and shared test vectors
|
|
281
|
+
|
|
282
|
+
- Create `tests/fixtures/mi_test_vectors.json` with 5-6 test vectors
|
|
283
|
+
- Port core MI formula tests to `tests/node/intelligence.test.js`, reading from shared vectors
|
|
284
|
+
- Update `tests/python/test_intelligence.py` to also read from shared vectors
|
|
285
|
+
- Ensures cross-implementation parity within ±0.01 tolerance
|
|
286
|
+
|
|
287
|
+
## Critical Files
|
|
288
|
+
|
|
289
|
+
| File | Action | Description |
|
|
290
|
+
|---|---|---|
|
|
291
|
+
| `src/claude_statusline/graphs/intelligence.py` | **Create** | Core MI computation module (hardcoded constants + configurable beta) |
|
|
292
|
+
| `tests/python/test_intelligence.py` | **Create** | Unit tests for MI module (incl. guard clause, integration tests) |
|
|
293
|
+
| `tests/fixtures/mi_test_vectors.json` | **Create** | Shared test vectors for cross-implementation parity |
|
|
294
|
+
| `src/claude_statusline/core/config.py` | **Modify** | Add `show_mi` (bool) and `mi_curve_beta` (float) only |
|
|
295
|
+
| `src/claude_statusline/cli/statusline.py` | **Modify** | Add MI score; decouple state read from `show_delta` |
|
|
296
|
+
| `src/claude_statusline/cli/context_stats.py` | **Modify** | Add `--type mi` graph option |
|
|
297
|
+
| `src/claude_statusline/graphs/renderer.py` | **Modify** | Add `label_fn` param to `render_timeseries()`; add MI to summary |
|
|
298
|
+
| `scripts/statusline.py` | **Modify** | Single `compute_mi()` function; decouple state read |
|
|
299
|
+
| `scripts/statusline.js` | **Modify** | Single `computeMI()` function; decouple state read |
|
|
300
|
+
| `tests/node/intelligence.test.js` | **Create** | Node.js MI tests using shared vectors |
|
|
301
|
+
|
|
302
|
+
## Existing Utilities to Reuse
|
|
303
|
+
|
|
304
|
+
- `StateEntry.current_used_tokens` property (`core/state.py:132`) — already computes `current_input_tokens + cache_creation + cache_read` (used for both CPS utilization and ES total_context)
|
|
305
|
+
- `ColorManager` (`core/colors.py`) — for MI color coding
|
|
306
|
+
- `fit_to_width()` (`formatters/layout.py`) — for statusline width management
|
|
307
|
+
- `Config.load()` pattern (`core/config.py`) — extended with 2 new fields
|
|
308
|
+
- `GraphRenderer.render_timeseries()` (`graphs/renderer.py`) — for MI graph (extended with `label_fn` parameter)
|
|
309
|
+
|
|
310
|
+
**Not used**: `format_tokens()` — MI uses its own `f"{mi:.2f}"` format. `calculate_deltas()` — MI computes its own consecutive entry diffs internally.
|
|
311
|
+
|
|
312
|
+
## Configuration Options
|
|
313
|
+
|
|
314
|
+
Only 2 config keys are exposed (weights, thresholds, and productivity target are hardcoded constants to minimize cross-implementation sync burden):
|
|
315
|
+
|
|
316
|
+
```ini
|
|
317
|
+
# Model Intelligence (MI) score display
|
|
318
|
+
# Shows a heuristic quality score based on context utilization
|
|
319
|
+
show_mi=true
|
|
320
|
+
|
|
321
|
+
# MI degradation curve shape (higher = steeper initial drop)
|
|
322
|
+
# Based on Michelangelo paper's observed performance degradation
|
|
323
|
+
# mi_curve_beta=1.5
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
Hardcoded constants (in `intelligence.py`, `compute_mi()`, and `computeMI()`):
|
|
327
|
+
- Weights: CPS=0.60, ES=0.25, PS=0.15
|
|
328
|
+
- Thresholds: green > 0.65, yellow > 0.35, red below
|
|
329
|
+
- Productivity target: 0.2 lines/token
|
|
330
|
+
|
|
331
|
+
## Display Integration
|
|
332
|
+
|
|
333
|
+
### Statusline output
|
|
334
|
+
|
|
335
|
+
```text
|
|
336
|
+
[Opus] myproject | main [3] | 75k free (37.5%) [+2,500] MI:0.62 [AC:45k] abc123
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
### Context Stats CLI summary
|
|
340
|
+
|
|
341
|
+
```text
|
|
342
|
+
Session Summary
|
|
343
|
+
────────────────────────────────────────────────
|
|
344
|
+
Context Remaining: 75,000/200,000 (37%)
|
|
345
|
+
>>> DUMB ZONE <<< (You are in the dumb zone - Dex Horthy says so)
|
|
346
|
+
Model Intelligence: 0.62 (Context pressure is degrading answer quality)
|
|
347
|
+
CPS: 0.54 ES: 0.72 PS: 0.60
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
### Context Stats MI graph (`--type mi`)
|
|
351
|
+
|
|
352
|
+
ASCII timeseries graph of MI score over time, showing the degradation trajectory.
|
|
353
|
+
|
|
354
|
+
## Verification
|
|
355
|
+
|
|
356
|
+
1. **Unit tests**: `pytest tests/python/test_intelligence.py -v`
|
|
357
|
+
2. **All Python tests**: `source venv/bin/activate && pytest tests/python/ -v`
|
|
358
|
+
3. **Node.js tests**: `npm test` (includes `intelligence.test.js` with shared vectors)
|
|
359
|
+
4. **Cross-implementation parity**: Both Python and Node.js test suites read `tests/fixtures/mi_test_vectors.json` and assert results within ±0.01
|
|
360
|
+
5. **Manual statusline test**: Pipe JSON to statusline and verify `MI:X.XX` appears
|
|
361
|
+
6. **Manual statusline test (decoupled)**: Set `show_mi=true` + `show_delta=false` and verify MI appears without delta
|
|
362
|
+
7. **Manual context-stats test**: `context-stats --type mi --no-watch` — verify MI graph renders with decimal Y-axis labels
|
|
363
|
+
|
|
364
|
+
## Known Limitations
|
|
365
|
+
|
|
366
|
+
1. **Productivity Score is noisy for non-coding sessions** — Research/planning sessions have low PS even with high-quality answers. Mitigation: PS has lowest weight (0.15) and floor of 0.2.
|
|
367
|
+
|
|
368
|
+
2. **Cache hit ratio reflects API behavior, not reasoning quality** — Cache management is infrastructure, not model intelligence. Mitigation: ES weight is moderate (0.25) and presented as a heuristic proxy.
|
|
369
|
+
|
|
370
|
+
3. **Degradation curve is not calibrated to specific models** — The paper found different curves per model family. Mitigation: `mi_curve_beta` is configurable.
|
|
371
|
+
|
|
372
|
+
4. **Integer graph renderer** — MI scores (floats [0,1]) are scaled to [0, 1000] for the integer-based renderer, with Y-axis labels formatted as decimals via `label_fn`.
|
|
373
|
+
|
|
374
|
+
5. **MI adds file I/O when show_delta=false** — When `show_mi=true` and `show_delta=false`, the statusline reads the previous entry for PS calculation. Users who need minimal I/O can set `show_mi=false`.
|
|
375
|
+
|
|
376
|
+
## Review Decisions Log
|
|
377
|
+
|
|
378
|
+
Decisions made during engineering review (2026-03-14):
|
|
379
|
+
|
|
380
|
+
| # | Decision | Resolution |
|
|
381
|
+
|---|---|---|
|
|
382
|
+
| 1 | PS delta definition | Consecutive entry diffs (not cumulative totals) |
|
|
383
|
+
| 2 | CPS division by zero | Guard clause: return MI=1.0 when context_window_size=0 |
|
|
384
|
+
| 3 | MI vs show_delta coupling | Decoupled — MI reads prev entry independently |
|
|
385
|
+
| 4 | Config surface area | Only `show_mi` + `mi_curve_beta` (hardcode the rest) |
|
|
386
|
+
| 5 | MI formula DRY | Single `compute_mi()` / `computeMI()` in standalone scripts |
|
|
387
|
+
| 6 | Module location | Keep in `graphs/` next to `statistics.py` |
|
|
388
|
+
| 7 | Float config parsing | Inline try/except for `mi_curve_beta` |
|
|
389
|
+
| 8 | MI graph Y-axis | Add `label_fn` parameter to `render_timeseries()` |
|
|
390
|
+
| 9 | Cross-impl parity | Shared `tests/fixtures/mi_test_vectors.json` |
|
|
391
|
+
|
|
392
|
+
## TODOs (deferred)
|
|
393
|
+
|
|
394
|
+
- **Shared test vectors**: Create `tests/fixtures/mi_test_vectors.json` with 5-6 vectors for cross-implementation parity (agreed during review, to be built during implementation)
|
|
395
|
+
- **MI trend indicators**: Show `MI:0.62↓` or `MI:0.82↑` in statusline based on comparison with previous MI score (deferred — adds cross-impl sync burden)
|
|
396
|
+
- **Per-model beta calibration**: Map known model IDs to empirically tuned beta values (deferred — requires empirical data we don't have yet)
|
package/docs/context-stats.md
CHANGED
|
@@ -79,7 +79,7 @@ Session Summary
|
|
|
79
79
|
Output Tokens: 43,429
|
|
80
80
|
Session Duration: 2h 29m
|
|
81
81
|
|
|
82
|
-
Powered by cc-context-stats v1.
|
|
82
|
+
Powered by cc-context-stats v1.8.0 - https://github.com/luongnv89/cc-context-stats
|
|
83
83
|
```
|
|
84
84
|
|
|
85
85
|
## Features
|
package/package.json
CHANGED
package/pyproject.toml
CHANGED
|
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "cc-context-stats"
|
|
7
|
-
version = "1.
|
|
7
|
+
version = "1.8.0"
|
|
8
8
|
description = "Monitor your Claude Code session context in real-time - track token usage and never run out of context"
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
license = { text = "MIT" }
|