cto-ai-cli 7.1.0 → 8.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,17 +2,20 @@
2
2
 
3
3
  [![npm](https://img.shields.io/npm/v/cto-ai-cli.svg)](https://www.npmjs.com/package/cto-ai-cli)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
5
- [![Tests](https://img.shields.io/badge/tests-606%20passing-brightgreen)](.)
5
+ [![Tests](https://img.shields.io/badge/tests-1133%20passing-brightgreen)](.)
6
6
 
7
- **Pick the right files for any AI task. Secrets auto-redacted. Learns from your feedback.**
7
+ **The most complete AI context selection engine in open source.** Picks the right code *chunks* (not just files), auto-redacts secrets, learns from feedback. 18 signals. Zero AI dependencies.
8
8
 
9
9
  ```bash
10
- cto --context "fix the auth middleware" --stdout | pbcopy # → clipboard
11
- cto --context "fix auth" --prompt "Refactor to use JWT" # → AI prompt
12
- cto --accept # → learns
10
+ cto --context "fix the seller info cache invalidation on KVS delete" --stdout | pbcopy
13
11
  ```
14
12
 
15
- 76KB package · 606 tests · Zero AI dependencies.
13
+ ```
14
+ → 166 relevant chunks from 59 files (26K tokens, 0 secrets)
15
+ → Full chain: DeleteEndpoint → Router → UseCase → CacheService → KvsRepository
16
+ ```
17
+
18
+ 202KB package · 1,133 tests · 96 source modules · Zero AI dependencies.
16
19
 
17
20
  ---
18
21
 
@@ -35,18 +38,31 @@ This runs a self-contained presentation that shows: project analysis, semantic m
35
38
 
36
39
  ## Benchmark Results
37
40
 
38
- Tested against 8 curated tasks with ground truth (known correct files):
41
+ **Eval Harness v8.0** 20-file Java enterprise project, 4 tasks with expert-labeled ground truth:
39
42
 
40
- | Strategy | Precision | Must-have Recall | F1 |
43
+ | Metric | Result |
44
+ |---|---|
45
+ | **Must-have recall** | **100%** (every critical file found) |
46
+ | **Precision** | **38–44%** |
47
+ | **F1** | **55%** |
48
+ | **Noise rate** | **11.3%** |
49
+
50
+ **Real production repos** (Mercado Libre Java monoliths):
51
+
52
+ | Repo | Files | Without CTO | With CTO v8.0 |
41
53
  |---|---|---|---|
42
- | **CTO** | 33.6% | **100.0%** | **48.7%** |
54
+ | fury_supply-seller-info | 219 | 212 files (97%) | **166 chunks from 59 files** |
55
+ | sell-sizechart-middleend | 1,719 | 230 files | **72 chunks from 37 files** |
56
+ | charts-backend | 1,261 | 685 files (54%) | **142 chunks from 16 files** |
57
+
58
+ **Internal benchmark** (8 tasks, own codebase):
59
+
60
+ | Strategy | Precision | Recall | F1 |
61
+ |---|---|---|---|
62
+ | **CTO + Reranker** | **96.9%** | 100% | 98.4% |
43
63
  | TF-IDF only | 54.6% | 87.5% | 62.0% |
44
- | Risk-only | 20.8% | 18.8% | 15.0% |
45
- | Alphabetical | 8.3% | 31.3% | 12.9% |
46
64
  | Random | 7.7% | 6.3% | 2.8% |
47
65
 
48
- **CTO never misses a must-have file** (100% recall). 3.8× better F1 than alphabetical. 17× better than random.
49
-
50
66
  ## ROI
51
67
 
52
68
  On a typical 130-file TypeScript project:
@@ -60,23 +76,34 @@ On a typical 130-file TypeScript project:
60
76
 
61
77
  Plus: fewer hallucinations (right context), zero secret leaks, and the learner gets smarter with every `--accept` / `--reject`.
62
78
 
63
- ## How it Works
79
+ ## How it Works (v8.0 Pipeline)
64
80
 
65
81
  ```
66
- Task description ──→ TF-IDF/BM25 ──→ Semantic scores ──┐
67
-
68
- Project files ──→ Dependency graph ──→ Risk scores ──────┤──→ Composite ──→ Greedy ──→ Selection
69
- │ ranking alloc
70
- Feedback history ──→ Bayesian learner ──→ Boosts ────────┘
82
+ Task Query Intent Parser structured action/entities/layers
83
+
84
+
85
+ BM25 (weighted) ──────┐
86
+ TF-IDF Embedding ─────┤──→ RRF Fusion ─→ 8-signal Boosting ─→ Reranker
87
+ Multi-hop (auto) ─────┘ │
88
+
89
+ Selection ─→ Chunk Extraction ─→ Output
90
+ (methods, not files)
71
91
  ```
72
92
 
73
- 1. **Dependency graph** — parses imports, builds adjacency list, identifies hubs
74
- 2. **Risk scoring** — complexity × centrality × recency (continuous, log-scaled)
75
- 3. **TF-IDF/BM25 semantic matching** task description scored against file contents + path boosting
76
- 4. **Composite ranking** — `finalScore = semantic × 0.55 + risk × 0.25 + learner × 0.2`
77
- 5. **Noise filtering** files with zero semantic relevance are excluded (benchmark-driven optimization)
78
- 6. **Greedy allocation** fills token budget top-down, cascading prune levels (full signatures skeleton)
79
- 7. **Bayesian learning** exponential decay, Wilson score confidence, per-task-type patterns
93
+ **10-step pipeline:**
94
+
95
+ | # | Step | What it does |
96
+ |---|---|---|
97
+ | 0 | **Query Intent** | Parses "fix cache invalidation on delete" `action:fix`, `entities:[cache,kvs]`, `layers:[cache]` |
98
+ | 1 | **BM25 + Embedding** | Lexical matching + TF-IDF cosine vectors, merged via Reciprocal Rank Fusion |
99
+ | 2 | **Multi-hop** | Complex queries auto-detected iterative BM25 expansion via deps + call graph (2 hops) |
100
+ | 3 | **Path IDF Boost** | Query terms in file paths get boosted |
101
+ | 4 | **Layer Boost** | Architectural layer matching (controller, service, repository) |
102
+ | 5 | **Import Boost** | Dependencies of top-ranked files get pulled in |
103
+ | 6 | **Call Graph Boost** | Cross-file method calls traced (Java/TS/Python/Go) |
104
+ | 7 | **Git Co-Change** | Files frequently modified together (Jaccard similarity from commits) |
105
+ | 8 | **Reranker** | 5-signal quality gate: term coverage, specificity, bigram proximity, deps, path |
106
+ | 9 | **Chunk Extraction** | Extracts relevant functions/methods — not whole files. 10x token efficiency |
80
107
 
81
108
  **No AI is used for selection.** Same input → same output. Deterministic.
82
109
 
@@ -225,62 +252,103 @@ const selection = await selectContext({
225
252
  });
226
253
  ```
227
254
 
228
- ## v7.0 Enterprise Features
255
+ ## v8.0 What's New
229
256
 
230
- ### Precision Reranker (96.9% precision, was 33.6%)
257
+ ### Chunk-Level Retrieval (the big one)
231
258
 
232
- Multi-signal reranker between BM25 retrieval and greedy allocation:
233
- - **Term coverage**: fraction of unique query terms matched per file
234
- - **Term specificity**: IDF-weighted — rare terms matter more
235
- - **Bigram proximity**: query terms appearing close together in the file
236
- - **Dependency signal**: files in the dependency cone of top matches
237
- - **Quality gate**: adaptive cutoff stops filling budget with noise
259
+ Instead of including entire files, CTO now extracts **only the relevant functions and methods**. A 2000-line file with 1 relevant method → 50 lines included, not 2000.
238
260
 
239
- ### Persistent Index Cache
261
+ ```
262
+ ### src/main/java/com/example/cache/CacheService.java
263
+ ```java
264
+ // L15-22: method invalidate
265
+ public void invalidate(String id) {
266
+ redis.delete("cache:seller:" + id);
267
+ }
268
+
269
+ // ... lines 23-45 omitted ...
270
+
271
+ // L46-52: method retrieve
272
+ public SellerDTO retrieve(String id) {
273
+ return redis.opsForValue().get("cache:seller:" + id);
274
+ }
275
+ ```
240
276
 
241
- TF-IDF index persisted to `.cto/index-cache.json` with per-file mtime tracking. Subsequent queries only re-tokenize changed files. 50K-file repos go from 5s → <100ms on warm cache.
277
+ Supports Java, TypeScript, Python, Go.
242
278
 
243
- ### Multi-Language Dependency Graphs
279
+ ### Query Intent Parsing
244
280
 
245
- Regex-based import parsing for **Python**, **Go**, **Java**, and **Rust** alongside ts-morph for TS/JS. Enables hub detection, risk scoring, and dependency expansion for polyglot codebases.
281
+ Before searching, CTO parses your task into structured intent:
246
282
 
247
- ```bash
248
- # Works on Python, Go, Java, Rust projects — not just TypeScript
249
- cto --context "fix auth handler" /path/to/go-project
250
283
  ```
284
+ "fix the seller cache invalidation on KVS delete"
285
+ → action: fix
286
+ → entities: [seller, kvs] (3× weight)
287
+ → operations: [invalidate, delete] (2× weight)
288
+ → layers: [cache]
289
+ ```
290
+
291
+ Entities get 3× BM25 weight, operations get 2×. Much better precision on enterprise queries.
251
292
 
252
- ### Team Authentication & SSO
293
+ ### Embedding Search + RRF Fusion
253
294
 
254
- Per-team API keys, JWT validation (HS256/RS256), rate limiting, model allowlists. Teams stored in `.cto/gateway/teams.json`.
295
+ TF-IDF cosine embedding vectors complement BM25 lexical matching. Merged via Reciprocal Rank Fusion (60/40 BM25/embedding). Catches semantic similarity that BM25 misses.
255
296
 
256
- ### Metrics Export
297
+ ### Cross-File Call Graph
257
298
 
258
- Prometheus exposition format at `/__cto/metrics`, Datadog JSON, and StatsD UDP. Counters, histograms, gauges for requests, tokens, cost, latency, secrets.
299
+ Traces method calls across files: `cacheService.invalidate()` in UseCase finds `CacheService.java`. Regex-based, works for Java/TS/Python/Go.
259
300
 
260
- ### Per-Team Policy Engine
301
+ ### Git Co-Change Signal
261
302
 
262
- Routing rules per team: model overrides by task type, cost caps per request, context budget limits, block rules. Preset policies: `createCostConscious()`, `createSecurityFirst()`.
303
+ Files frequently modified together in git history get boosted. Jaccard similarity from commit co-occurrence.
263
304
 
264
- ### Closed-Loop A/B Testing
305
+ ### Multi-Hop Reasoning
265
306
 
266
- Real experimentation on context strategies with two-proportion z-test for statistical significance. Deterministic assignment (SHA-256 hashing), auto-conclusion when p < 0.05.
307
+ Complex enterprise queries auto-detected. Iterative BM25: top matches expand via deps + call graph → re-query. Traces full execution chains (4/4 hops).
267
308
 
268
- ### LSP Bridge (IDE Plugin)
309
+ ### Evaluation Harness
269
310
 
270
- JSON-RPC 2.0 server over stdin/stdout for any IDE: VS Code, JetBrains, Neovim, Emacs. Custom methods: `cto/selectContext`, `cto/score`, `cto/audit`, `cto/experiments`.
311
+ Ground truth benchmark with must-have/relevant/noise labels. 100% must-have recall on 4-task Java enterprise benchmark.
312
+
313
+ ## Enterprise Features
314
+
315
+ - **AI Gateway** — transparent HTTP proxy with context injection, secret redaction, cost tracking
316
+ - **Team Auth** — per-team API keys, JWT (HS256/RS256), rate limiting, OIDC discovery
317
+ - **Policy Engine** — model overrides by task type, cost caps, block rules
318
+ - **Metrics** — Prometheus, Datadog JSON, StatsD UDP
319
+ - **A/B Testing** — context strategy experiments with z-test significance
320
+ - **LSP Bridge** — JSON-RPC 2.0 for VS Code, JetBrains, Neovim
321
+ - **Persistent Index Cache** — 50K-file repos: 5s → <100ms on warm cache
322
+
323
+ ## Competitor Comparison
324
+
325
+ | Feature | CTO v8 | Cursor | Sourcegraph Cody |
326
+ |---|---|---|---|
327
+ | BM25 retrieval | ✅ | ✅ | ✅ |
328
+ | Embedding search | ✅ TF-IDF cosine+RRF | ✅ | ✅ |
329
+ | Chunk-level retrieval | ✅ 4 langs | ✅ | ✅ |
330
+ | Multi-signal RRF fusion | ✅ 8-signal | ❌ | ❌ |
331
+ | Cross-file call graph | ✅ | ❌ | ❌ |
332
+ | Git co-change signal | ✅ | ❌ | ❌ |
333
+ | Multi-hop reasoning | ✅ | ❌ | ❌ |
334
+ | Query intent parsing | ✅ | ❌ | ❌ |
335
+ | Feedback learning | ✅ | ❌ | ❌ |
336
+ | Secret redaction | ✅ | ❌ | ❌ |
337
+ | **Total signals** | **18** | **~3** | **~5** |
271
338
 
272
339
  ## Honest Limitations
273
340
 
274
- - **TypeScript/JavaScript gets AST analysis.** Python/Go/Java/Rust get regex-based import parsing (good for graphs, not AST-accurate).
275
- - **BM25 + reranker, not embeddings.** 96.9% precision on our benchmark. No neural model needed.
276
- - **Learning needs ~5 feedback cycles** to start influencing selection. First runs are pure graph + risk + semantic.
277
- - **Benchmarked against naive baselines** (alphabetical, random, risk-only, TF-IDF-only). Not compared against Cursor/Copilot internal context engines.
341
+ - **TypeScript/JavaScript gets AST analysis.** Python/Go/Java/Rust get regex-based parsing (good for graphs + chunking, not AST-precise).
342
+ - **Embeddings are TF-IDF cosine, not neural.** ONNX infrastructure ready neural model would add ~5-10% recall.
343
+ - **Learning needs ~5 feedback cycles** to start influencing selection. First runs are pure pipeline.
344
+ - **Chunk extraction is regex-based** works for standard methods/functions, may miss DSLs or deeply nested code.
345
+ - **Benchmarked against naive baselines.** Not compared against Cursor/Copilot internal context engines.
278
346
 
279
347
  ## Contributing
280
348
 
281
349
  ```bash
282
350
  git clone https://github.com/cto-ai/cto-ai-cli.git && cd cto-ai-cli
283
- npm install && npm run build && npm test # 776 tests
351
+ npm install && npm run build && npm test # 1,133 tests
284
352
  ```
285
353
 
286
354
  ## License