@bilalimamoglu/sift 0.3.1 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,436 +1,245 @@
1
1
  # sift
2
2
 
3
- <img src="assets/brand/sift-logo-badge-monochrome.svg" alt="sift logo" width="88" />
3
+ [![npm version](https://img.shields.io/npm/v/@bilalimamoglu/sift)](https://www.npmjs.com/package/@bilalimamoglu/sift)
4
+ [![license](https://img.shields.io/github/license/bilalimamoglu/sift)](LICENSE)
5
+ [![CI](https://img.shields.io/github/actions/workflow/status/bilalimamoglu/sift/ci.yml?branch=main&label=CI)](https://github.com/bilalimamoglu/sift/actions/workflows/ci.yml)
4
6
 
5
- `sift` turns a long terminal wall of text into a short answer you can act on.
7
+ <img src="assets/brand/sift-logo-minimal-teal-default.svg" alt="sift logo" width="140" />
6
8
 
7
- Think of it like this:
8
- - `standard` = map
9
- - `focused` or `rerun --remaining` = zoom
10
- - raw traceback = last resort
11
-
12
- It is a good fit when a human, agent, or CI job needs the answer faster than it needs the whole log.
13
-
14
- Common uses:
15
- - test failures
16
- - typecheck failures
17
- - lint failures
18
- - build logs
19
- - `git diff`
20
- - `npm audit`
21
- - `terraform plan`
9
+ Your AI agent should not be reading 13,000 lines of test output.
22
10
 
23
- Do not use it when:
24
- - the exact raw log is the main thing you need
25
- - the command is interactive or TUI-based
26
- - shell behavior depends on exact raw command output
27
-
28
- ## Install
29
-
30
- Requires Node.js 20 or later.
11
+ **Before:** 128 failures, 198K tokens, 16 tool calls, agent reconstructs the failure shape from scratch.
12
+ **After:** 6 lines, 129 tokens, 4 tool calls, agent acts on a grouped diagnosis immediately.
31
13
 
32
14
  ```bash
33
- npm install -g @bilalimamoglu/sift
34
- ```
35
-
36
- ## One-time setup
37
-
38
- The easiest setup path is:
39
-
40
- ```bash
41
- sift config setup
15
+ sift exec --preset test-status -- pytest -q
42
16
  ```
43
17
 
44
- That writes a machine-wide config to:
45
-
46
18
  ```text
47
- ~/.config/sift/config.yaml
19
+ - Tests did not pass.
20
+ - 3 tests failed. 125 errors occurred.
21
+ - Shared blocker: 125 errors share the same root cause - a missing test environment variable.
22
+ Anchor: tests/conftest.py
23
+ Fix: Set the required env var before rerunning DB-isolated tests.
24
+ - Contract drift: 3 snapshot tests are out of sync with the current API or model state.
25
+ Anchor: tests/contracts/test_feature_manifest_freeze.py
26
+ Fix: Regenerate the snapshots if the changes are intentional.
27
+ - Decision: stop and act.
48
28
  ```
49
29
 
50
- After that, any terminal on the machine can use `sift`. A repo-local config can still override it later.
30
+ If 125 tests fail for one reason, the agent should pay for that reason once.
51
31
 
52
- To switch between saved native providers without editing YAML:
32
+ ## Who is this for
53
33
 
54
- ```bash
55
- sift config use openai
56
- sift config use openrouter
57
- ```
34
+ Developers using coding agents — Claude Code, Codex, Cursor, Windsurf, Copilot, or any LLM-driven workflow that runs shell commands and reads the output.
35
+
36
+ `sift` sits between the command and the agent. It captures noisy output, groups repeated failures into root-cause buckets, and returns a short diagnosis with an anchor, a likely fix, and a decision signal. The agent gets a map instead of a wall of text.
58
37
 
59
- If you prefer manual setup, this is the smallest useful OpenAI setup:
38
+ ## Install
60
39
 
61
40
  ```bash
62
- export SIFT_PROVIDER=openai
63
- export SIFT_BASE_URL=https://api.openai.com/v1
64
- export SIFT_MODEL=gpt-5-nano
65
- export OPENAI_API_KEY=your_openai_api_key
41
+ npm install -g @bilalimamoglu/sift
66
42
  ```
67
43
 
68
- If you prefer manual setup, this is the smallest useful OpenRouter setup:
44
+ Requires Node.js 20+.
69
45
 
70
- ```bash
71
- export SIFT_PROVIDER=openrouter
72
- export OPENROUTER_API_KEY=your_openrouter_api_key
73
- ```
46
+ ## Quick start
74
47
 
75
- Then check it:
48
+ Guided setup writes a machine-wide config and verifies the provider:
76
49
 
77
50
  ```bash
51
+ sift config setup
78
52
  sift doctor
79
53
  ```
80
54
 
81
- ## Start here
82
-
83
- The default path is simple:
84
- 1. run the noisy command through `sift`
85
- 2. read the short `standard` answer first
86
- 3. only zoom in if `standard` clearly tells you more detail is still worth it
55
+ Config lives at `~/.config/sift/config.yaml`. A repo-local `sift.config.yaml` can override it later.
87
56
 
88
- Examples:
57
+ Then run noisy commands through `sift`:
89
58
 
90
59
  ```bash
60
+ sift exec --preset test-status -- <test command>
91
61
  sift exec "what changed?" -- git diff
92
- sift exec --preset test-status -- pytest -q
93
- sift rerun
94
- sift rerun --remaining --detail focused
95
- sift rerun --remaining --detail verbose --show-raw
96
- sift watch "what changed between cycles?" < watcher-output.txt
97
- sift exec --watch "what changed between cycles?" -- node watcher.js
98
- sift exec --preset typecheck-summary -- npm run typecheck
99
- sift exec --preset lint-failures -- eslint .
100
62
  sift exec --preset audit-critical -- npm audit
101
63
  sift exec --preset infra-risk -- terraform plan
102
- sift agent install codex --dry-run
103
64
  ```
104
65
 
105
- ## Simple workflow
66
+ Useful flags:
67
+ - `--dry-run` to preview the reduced input and prompt without calling a provider
68
+ - `--show-raw` to print captured raw output to `stderr`
69
+ - `--fail-on` to let reduced results fail CI for commands such as `npm audit` or `terraform plan`
106
70
 
107
- For most repos, this is the whole story:
71
+ If you prefer environment variables instead of setup:
108
72
 
109
73
  ```bash
110
- sift exec --preset test-status -- <test command>
111
- sift rerun
112
- sift rerun --remaining --detail focused
113
- ```
114
-
115
- Mental model:
116
- - `sift escalate` = same cached output, deeper render
117
- - `sift rerun` = rerun the cached full command at `standard` and prepend what resolved, remained, or changed
118
- - `sift rerun --remaining` = rerun only the remaining failing pytest node IDs for a zoomed-in view
119
- - `sift watch` / `sift exec --watch` = treat redraw-style output as cycles and summarize what changed
120
- - `Decision: stop and act` = trust the current diagnosis and go read or fix code
121
- - `Decision: zoom` = one deeper sift pass is justified before raw
122
- - `Decision: raw only if exact traceback is required` = raw is last resort, not the next default step
123
-
124
- If your project uses `pytest`, `vitest`, `jest`, `bun test`, or another test runner instead of `npm test`, use the same preset with that command.
125
-
126
- What `sift` does in `exec` mode:
127
- 1. runs the child command
128
- 2. captures `stdout` and `stderr`
129
- 3. keeps the useful signal
130
- 4. returns a short answer or JSON
131
- 5. preserves the child command exit code
132
-
133
- Useful debug flags:
134
- - `--dry-run`: show the reduced input and prompt without calling the provider
135
- - `--show-raw`: print the captured raw input to `stderr`
136
-
137
- ## When tests fail
74
+ # OpenAI
75
+ export SIFT_PROVIDER=openai
76
+ export SIFT_BASE_URL=https://api.openai.com/v1
77
+ export SIFT_MODEL=gpt-5-nano
78
+ export OPENAI_API_KEY=your_openai_api_key
138
79
 
139
- Start with the map:
80
+ # OpenRouter
81
+ export SIFT_PROVIDER=openrouter
82
+ export OPENROUTER_API_KEY=your_openrouter_api_key
140
83
 
141
- ```bash
142
- sift exec --preset test-status -- <test command>
84
+ # Any OpenAI-compatible endpoint
85
+ export SIFT_PROVIDER=openai-compatible
86
+ export SIFT_BASE_URL=https://your-endpoint/v1
87
+ export SIFT_PROVIDER_API_KEY=your_api_key
143
88
  ```
144
89
 
145
- If `standard` already names the main failure buckets, counts, and hints, stop there and read code.
90
+ ## How it works
146
91
 
147
- If `standard` still includes an unknown bucket or ends with `Decision: zoom`, do one deeper sift pass before you fall back to raw traceback.
92
+ `sift` follows a cheapest-first pipeline:
148
93
 
149
- Then use this order:
150
- 1. `sift exec --preset test-status -- <test command>`
151
- 2. `sift rerun`
152
- 3. `sift rerun --remaining --detail focused`
153
- 4. `sift rerun --remaining --detail verbose`
154
- 5. `sift rerun --remaining --detail verbose --show-raw`
155
- 6. raw pytest only if exact traceback lines are still needed
94
+ 1. Capture command output.
95
+ 2. Sanitize sensitive-looking material.
96
+ 3. Apply local heuristics for known failure shapes.
97
+ 4. Escalate to a cheaper provider only if needed.
98
+ 5. Return a short diagnosis to the main agent.
156
99
 
157
- The normal stop budget is `standard` first, then at most one zoom step before raw.
100
+ The core abstraction is a **bucket** — one distinct root cause, no matter how many tests it affects. Instead of making an agent reason over 125 repeated tracebacks, `sift` compresses them into one actionable bucket with a label, an affected count, an anchor, and a likely fix.
158
101
 
159
- If you want the older explicit compare shape, `sift exec --preset test-status --diff -- <test command>` still works. `sift rerun` is the shorter normal path for the same idea.
102
+ It also returns a decision signal:
103
+ - `stop and act` when the diagnosis is already actionable
104
+ - `zoom` when one deeper pass is justified
105
+ - raw logs only as a last resort
160
106
 
161
- ## Diagnose JSON
107
+ The deepest local coverage today is test debugging, especially `pytest`, with growing support for `vitest` and `jest`.
162
108
 
163
- Most of the time, you do not need JSON. Start with text first.
109
+ ## Built-in presets
164
110
 
165
- If `standard` already shows bucket-level root cause, `Anchor`, and `Fix`, do not re-verify the same bucket with raw pytest. At most do one targeted source read before you edit.
111
+ Every preset runs local heuristics first. When the heuristic confidently handles the output, the provider is never called zero tokens, zero latency, fully deterministic.
166
112
 
167
- If diagnose output still contains an unknown bucket or `Decision: zoom`, take one sift zoom step before raw traceback.
113
+ | Preset | Heuristic | What it does |
114
+ |--------|-----------|-------------|
115
+ | `test-status` | Deep | Bucket/anchor/decision system for pytest, vitest, jest. 30+ failure patterns, confidence-gated stop/zoom decisions. |
116
+ | `typecheck-summary` | Deterministic | Parses `tsc` output (standard and pretty formats), groups by error code, returns max 5 bullets. |
117
+ | `lint-failures` | Deterministic | Parses ESLint stylish output, groups by rule, distinguishes errors from warnings, detects fixable hints. |
118
+ | `audit-critical` | Deterministic | Extracts high/critical vulnerabilities from `npm audit` or similar. |
119
+ | `infra-risk` | Deterministic | Detects destructive signals in `terraform plan` output. Returns pass/fail verdict. |
120
+ | `build-failure` | Deterministic-first | Extracts the first concrete build error for recognized webpack, esbuild/Vite, Cargo, Go, GCC/Clang, and `tsc --build` output; falls back to the provider for unsupported formats. |
121
+ | `diff-summary` | Provider | Summarizes changes and risks in diff output. |
122
+ | `log-errors` | Provider | Extracts top error signals from log output. |
168
123
 
169
- Use diagnose JSON only when automation or machine branching really needs it:
124
+ Presets marked **Deterministic** bypass the provider entirely for recognized output formats. Presets marked **Deterministic-first** try a local heuristic first and fall back to the provider only when the captured output is unsupported or ambiguous. Presets marked **Provider** always call the LLM but benefit from input sanitization and truncation.
170
125
 
171
126
  ```bash
172
- sift exec --preset test-status --goal diagnose --format json -- pytest -q
173
- sift rerun --goal diagnose --format json
174
- sift watch --preset test-status --goal diagnose --format json < pytest-watch.txt
127
+ sift exec --preset typecheck-summary -- npx tsc --noEmit
128
+ sift exec --preset lint-failures -- npx eslint src/
129
+ sift exec --preset build-failure -- npm run build
130
+ sift exec --preset audit-critical -- npm audit
131
+ sift exec --preset infra-risk -- terraform plan
175
132
  ```
176
133
 
177
- Default diagnose JSON is summary-first:
178
- - `remaining_summary` and `resolved_summary` keep the answer small
179
- - `read_targets` points to the first file or line worth reading
180
- - `read_targets.context_hint` can tell an agent to read only a small line window first
181
- - if `context_hint` only includes `search_hint`, search for that string before reading the whole file
182
- - `remaining_subset_available` tells you whether `sift rerun --remaining` can zoom safely
134
+ On an interactive terminal, `sift` also shows a small stderr footer so humans can see whether the provider was skipped:
183
135
 
184
- If an agent truly needs every raw failing test ID, opt in:
185
-
186
- ```bash
187
- sift exec --preset test-status --goal diagnose --format json --include-test-ids -- pytest -q
136
+ ```text
137
+ [sift: heuristic • LLM skipped • summary 47ms]
138
+ [sift: provider • LLM used • 380 tokens • summary 1.2s]
188
139
  ```
189
140
 
190
- `--goal diagnose --format json` is currently supported only for `test-status`, `rerun`, and `test-status` watch flows.
191
-
192
- ## Watch mode
193
-
194
- Use watch mode when command output redraws or repeats and you care about cycle-to-cycle change summaries more than the raw stream:
141
+ Suppress the footer with `--quiet`:
195
142
 
196
143
  ```bash
197
- sift watch "what changed between cycles?" < watcher-output.txt
198
- sift exec --watch "what changed between cycles?" -- node watcher.js
199
- sift exec --watch --preset test-status -- pytest -f
144
+ sift exec --preset typecheck-summary --quiet -- npx tsc --noEmit
200
145
  ```
201
146
 
202
- `sift watch` keeps the current summary and change summary together:
203
- - cycle 1 = current state
204
- - later cycles = what changed, what resolved, what stayed, and the next best action
205
- - for `test-status`, resolved tests drop out and remaining failures stay in focus
206
-
207
- If the stream clearly looks like a redraw/watch session, `sift` can auto-switch to watch handling and prints a short stderr note when it does.
147
+ ## Test debugging workflow
208
148
 
209
- ## `test-status` detail modes
149
+ This is where `sift` is strongest today.
210
150
 
211
- If you are running `npm test` and want `sift` to check the result, use `--preset test-status`.
212
-
213
- `test-status` becomes test-aware because you chose the preset. It does **not** infer “this is a test command” from `pytest`, `vitest`, `npm test`, or any other runner name.
214
-
215
- Available detail levels:
216
-
217
- - `standard`
218
- - short default summary
219
- - no file list
220
- - `focused`
221
- - groups failures by error type
222
- - shows a few representative failing tests or modules
223
- - `verbose`
224
- - flat list of visible failing tests or modules and their normalized reason
225
- - useful when Codex needs to know exactly what to fix first
226
-
227
- Examples:
228
-
229
- ```bash
230
- sift exec --preset test-status -- npm test
231
- sift rerun
232
- sift rerun --remaining --detail focused
233
- sift rerun --remaining --detail verbose
234
- sift rerun --remaining --detail verbose --show-raw
235
- ```
151
+ Think of it like this:
152
+ - `standard` = map
153
+ - `focused` = zoom
154
+ - raw traceback = last resort
236
155
 
237
- If you use a different runner, swap in your command:
156
+ Typical loop:
238
157
 
239
158
  ```bash
240
- sift exec --preset test-status -- pytest
159
+ sift exec --preset test-status -- <test command>
241
160
  sift rerun
242
161
  sift rerun --remaining --detail focused
243
- sift rerun --remaining --detail verbose --show-raw
244
162
  ```
245
163
 
246
- `sift rerun --remaining` currently supports only cached argv-mode `pytest ...` or `python -m pytest ...` runs. If the cached command is not subset-capable, run a narrowed pytest command manually with `sift exec --preset test-status -- <narrowed pytest command>`.
247
-
248
- Typical shapes:
164
+ If `standard` already gives you the root cause, anchor, and fix, stop there and act.
249
165
 
250
- `standard`
251
- ```text
252
- - Tests did not complete.
253
- - 114 errors occurred during collection.
254
- - Import/dependency blocker: repeated collection failures are caused by missing dependencies.
255
- - Anchor: path/to/failing_test.py
256
- - Fix: Install the missing dependencies and rerun the affected tests.
257
- - Decision: stop and act. Do not escalate unless you need exact traceback lines.
258
- - Next: Fix bucket 1 first, then rerun the full suite at standard.
259
- - Stop signal: diagnosis complete; raw not needed.
260
- ```
261
-
262
- `standard` can also separate more than one failure family in a single pass:
263
- ```text
264
- - Tests did not pass.
265
- - 3 tests failed. 124 errors occurred.
266
- - Shared blocker: DB-isolated tests are missing a required test env var.
267
- - Anchor: search <TEST_ENV_VAR> in path/to/test_setup.py
268
- - Fix: Set the required test env var and rerun the suite.
269
- - Contract drift: snapshot expectations are out of sync with the current API or model state.
270
- - Anchor: search <route-or-entity> in path/to/freeze_test.py
271
- - Fix: Review the drift and regenerate the snapshots if the change is intentional.
272
- - Decision: stop and act. Do not escalate unless you need exact traceback lines.
273
- - Next: Fix bucket 1 first, then rerun the full suite at standard. Secondary buckets are already visible behind it.
274
- - Stop signal: diagnosis complete; raw not needed.
275
- ```
276
-
277
- `focused`
278
- ```text
279
- - Tests did not complete.
280
- - 114 errors occurred during collection.
281
- - Import/dependency blocker: missing dependencies are blocking collection.
282
- - Missing modules include <module-a>, <module-b>.
283
- - path/to/test_a.py -> missing module: <module-a>
284
- - path/to/test_b.py -> missing module: <module-b>
285
- - Hint: Install the missing dependencies and rerun the affected tests.
286
- - Next: Fix bucket 1 first, then rerun the full suite at standard.
287
- - Stop signal: diagnosis complete; raw not needed.
288
- ```
289
-
290
- `verbose`
291
- ```text
292
- - Tests did not complete.
293
- - 114 errors occurred during collection.
294
- - Import/dependency blocker: missing dependencies are blocking collection.
295
- - path/to/test_a.py -> missing module: <module-a>
296
- - path/to/test_b.py -> missing module: <module-b>
297
- - path/to/test_c.py -> missing module: <module-c>
298
- - Hint: Install the missing dependencies and rerun the affected tests.
299
- - Next: Fix bucket 1 first, then rerun the full suite at standard.
300
- - Stop signal: diagnosis complete; raw not needed.
301
- ```
302
-
303
- Recommended debugging order for tests:
304
- 1. Use `standard` for the full suite first.
305
- 2. Treat `standard` as the map. If it already shows bucket-level root cause, `Anchor`, and `Fix`, trust it and report or act from there directly.
306
- 3. Use `sift escalate` only when you want a deeper render of the same cached output without rerunning the command.
307
- 4. After fixing something, run `sift rerun` to refresh the full-suite truth at `standard`.
308
- 5. Only then use `sift rerun --remaining --detail focused` as the zoom lens after the full-suite truth is refreshed.
309
- 6. Then use `sift rerun --remaining --detail verbose`.
310
- 7. Then use `sift rerun --remaining --detail verbose --show-raw`.
311
- 8. Fall back to the raw pytest command only if you still need exact traceback lines for the remaining failing subset.
312
-
313
- ## Built-in presets
314
-
315
- - `test-status`: summarize test runs
316
- - `typecheck-summary`: group blocking type errors by root cause
317
- - `lint-failures`: group repeated lint violations and highlight the files or rules that matter
318
- - `audit-critical`: extract only high and critical vulnerabilities
319
- - `infra-risk`: return a safety verdict for infra changes
320
- - `diff-summary`: summarize code changes and risks
321
- - `build-failure`: explain the most likely build failure
322
- - `log-errors`: extract the most relevant error signals
323
-
324
- List or inspect them:
325
-
326
- ```bash
327
- sift presets list
328
- sift presets show test-status
329
- ```
166
+ `sift rerun --remaining` currently supports only cached `pytest` or `python -m pytest` runs. For other runners, rerun a narrowed command manually with `sift exec --preset test-status -- <narrowed command>`.
330
167
 
331
168
  ## Agent setup
332
169
 
333
- If you want Codex or Claude Code to use `sift` by default, let `sift` install a managed instruction block for you.
334
-
335
- Repo scope is the default because it is safer:
170
+ `sift` can install a managed instruction block so coding agents use it by default for long command output:
336
171
 
337
172
  ```bash
338
- sift agent show codex
339
- sift agent show codex --raw
340
- sift agent install codex --dry-run
341
- sift agent install codex --dry-run --raw
342
- sift agent install codex
343
173
  sift agent install claude
174
+ sift agent install codex
344
175
  ```
345
176
 
346
- You can also install machine-wide instructions explicitly:
347
-
348
- ```bash
349
- sift agent install codex --scope global
350
- sift agent install claude --scope global
351
- ```
352
-
353
- Useful commands:
177
+ This writes a tuned set of rules into your agent's config (CLAUDE.md, AGENTS.md, etc.) so the agent routes noisy commands through `sift` automatically — no manual prompting needed.
354
178
 
355
179
  ```bash
356
180
  sift agent status
357
- sift agent remove codex
181
+ sift agent show claude
358
182
  sift agent remove claude
359
183
  ```
360
184
 
361
- `sift agent show ...` is a preview. It also tells you whether the managed block is already installed in the current scope.
362
-
363
- What the installer does:
364
- - writes to `AGENTS.md` or `CLAUDE.md` by default in the current repo
365
- - uses marked managed blocks instead of rewriting the whole file
366
- - preserves your surrounding notes and instructions
367
- - can use global files when you explicitly choose `--scope global`
368
- - keeps previews short by default
369
- - shows the exact managed block or final dry-run content only with `--raw`
185
+ ## Where `sift` helps most
370
186
 
371
- What the managed block tells the agent:
372
- - start with `sift` for long non-interactive command output so the agent spends less context-window and token budget on raw logs
373
- - for tests, begin with the normal `test-status` summary
374
- - if `standard` already identifies the main buckets, stop there instead of escalating automatically
375
- - use `sift escalate` only for the same cached output when more detail is needed without rerunning the command
376
- - after a fix, refresh the truth with `sift rerun`
377
- - only then zoom into the remaining failing pytest subset with `sift rerun --remaining --detail focused`, then `verbose`, then `--show-raw`
378
- - fall back to the raw test command only when exact traceback lines are still needed
187
+ `sift` is strongest when output is:
188
+ - long
189
+ - repetitive
190
+ - triage-heavy
191
+ - shaped by a small number of root causes
379
192
 
380
- ## CI-friendly usage
193
+ Good fits:
194
+ - large `pytest`, `vitest`, or `jest` runs (deterministic heuristics)
195
+ - `tsc` type errors and `eslint` lint failures (deterministic heuristics)
196
+ - build failures from webpack, esbuild, cargo, go, gcc
197
+ - `npm audit` and `terraform plan` (deterministic heuristics)
198
+ - repeated CI blockers
199
+ - noisy diffs and log streams
381
200
 
382
- Some commands succeed technically but should still block CI. `--fail-on` handles that for the built-in semantic presets that have stable machine-readable output:
201
+ ## Where it helps less
383
202
 
384
- ```bash
385
- sift exec --preset audit-critical --fail-on -- npm audit
386
- sift exec --preset infra-risk --fail-on -- terraform plan
387
- ```
203
+ `sift` adds less value when:
204
+ - the output is already short and obvious
205
+ - the command is interactive or TUI-based
206
+ - the exact raw log matters
207
+ - the output does not expose enough evidence for reliable grouping
388
208
 
389
- Supported presets for `--fail-on`:
390
- - `audit-critical`
391
- - `infra-risk`
209
+ When it cannot be confident, it tells you to zoom or read raw instead of pretending certainty.
392
210
 
393
- ## Config
211
+ ## Benchmark
394
212
 
395
- Useful commands:
213
+ On a real 640-test Python backend (125 repeated setup errors, 3 contract failures, 510 passing tests):
396
214
 
397
- ```bash
398
- sift config setup
399
- sift config use openrouter
400
- sift config init
401
- sift config show
402
- sift config validate
403
- sift doctor
404
- ```
215
+ | Metric | Raw agent | sift-first | Reduction |
216
+ |--------|-----------|------------|-----------|
217
+ | Tokens | 305K | 600 | 99.8% |
218
+ | Tool calls | 16 | 7 | 56% |
219
+ | Diagnosis | Same | Same | — |
405
220
 
406
- `sift config show` masks secrets by default. Use `--show-secrets` only when you explicitly need raw values.
221
+ The headline numbers (62% token reduction, 71% fewer tool calls, 65% faster) come from the end-to-end wall-clock comparison. The table above shows the token-level reduction on the largest real fixture.
407
222
 
408
- Config precedence:
409
- 1. CLI flags
410
- 2. environment variables
411
- 3. repo-local `sift.config.yaml` or `sift.config.yml`
412
- 4. machine-wide `~/.config/sift/config.yaml` or `~/.config/sift/config.yml`
413
- 5. built-in defaults
223
+ Methodology and caveats live in [BENCHMARK_NOTES.md](BENCHMARK_NOTES.md).
414
224
 
415
- ## Maintainer benchmark
225
+ ## Configuration
416
226
 
417
- To compare raw pytest output against the `test-status` reduction ladder on fixed fixtures, run:
227
+ Inspect and validate config with:
418
228
 
419
229
  ```bash
420
- npm run bench:test-status-ab
421
- npm run bench:test-status-live
230
+ sift config show
231
+ sift config show --show-secrets
232
+ sift config validate
422
233
  ```
423
234
 
424
- This uses the real `o200k_base` tokenizer and reports both:
425
- - command-output budget as the primary benchmark
426
- - deterministic recipe-budget comparisons as supporting evidence only
427
- - live-session scorecards for captured mixed full-suite agent transcripts
235
+ To switch between saved providers without editing files:
428
236
 
429
- The benchmark is meant to show context-window and command-output reduction first. In normal debugging flows, `test-status` should usually stop at `standard`; `focused` and `verbose` are escalation tools, and raw pytest is the last resort when exact traceback evidence is still needed.
430
-
431
- If you pass `--config <path>`, that path is strict. Missing explicit config paths are errors.
237
+ ```bash
238
+ sift config use openai
239
+ sift config use openrouter
240
+ ```
432
241
 
433
- Minimal config example:
242
+ Minimal YAML config:
434
243
 
435
244
  ```yaml
436
245
  provider:
@@ -449,63 +258,10 @@ runtime:
449
258
  rawFallback: true
450
259
  ```
451
260
 
452
- ## OpenAI vs OpenRouter vs OpenAI-compatible
453
-
454
- Use `provider: openai` for `api.openai.com`.
455
-
456
- Use `provider: openrouter` for the native OpenRouter path. It defaults to:
457
- - `baseUrl: https://openrouter.ai/api/v1`
458
- - `model: openrouter/free`
459
-
460
- Use `provider: openai-compatible` for third-party compatible gateways or self-hosted endpoints.
461
-
462
- For OpenAI:
463
- ```bash
464
- export OPENAI_API_KEY=your_openai_api_key
465
- ```
466
-
467
- For OpenRouter:
468
- ```bash
469
- export OPENROUTER_API_KEY=your_openrouter_api_key
470
- ```
471
-
472
- For third-party compatible endpoints, use either the endpoint-native env var or:
473
-
474
- ```bash
475
- export SIFT_PROVIDER_API_KEY=your_provider_api_key
476
- ```
477
-
478
- Known compatible env fallbacks include:
479
- - `OPENROUTER_API_KEY`
480
- - `TOGETHER_API_KEY`
481
- - `GROQ_API_KEY`
482
-
483
- ## Safety and limits
484
-
485
- - redaction is optional and regex-based
486
- - retriable provider failures such as `429`, timeouts, and `5xx` are retried once
487
- - `sift exec` detects simple prompt-like output such as `[y/N]` or `password:` and skips reduction
488
- - pipe mode does not preserve upstream shell pipeline failures; use `set -o pipefail` if you need that behavior
489
-
490
- ## Releasing
491
-
492
- This repo uses a manual GitHub Actions release workflow with npm trusted publishing.
493
-
494
- Release flow:
495
- 1. bump `package.json`
496
- 2. merge to `main`
497
- 3. run the `release` workflow manually
498
-
499
- The workflow runs typecheck, tests, coverage, build, packaging smoke checks, npm publish, tag creation, and GitHub Release creation.
500
-
501
- ## Brand assets
502
-
503
- Curated public logo assets live in `assets/brand/`.
261
+ ## Docs
504
262
 
505
- Included SVG sets:
506
- - badge/app: teal, black, monochrome
507
- - icon-only: teal, black, monochrome
508
- - 24px icon: teal, black, monochrome
263
+ - CLI reference: [docs/cli-reference.md](docs/cli-reference.md)
264
+ - Benchmark methodology: [BENCHMARK_NOTES.md](BENCHMARK_NOTES.md)
509
265
 
510
266
  ## License
511
267