llm-kb 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/PHASE3_SPEC.md ADDED
@@ -0,0 +1,245 @@
1
+ # llm-kb — Phase 3: Auth Fix + Eval Loop + LLM Config
2
+
3
+ > **Priority 1:** Auth fix — users are bouncing because Pi isn't configured
4
+ > **Priority 2:** Eval loop — the differentiator nobody else has
5
+ > **Priority 3:** LLM config — let users pick models
6
+ > **Blog:** Part 4 of the series (eval loop)
7
+
8
+ ---
9
+
10
+ ## 1. Auth Fix (URGENT)
11
+
12
+ Users run `npx llm-kb run` and hit a wall because Pi SDK isn't installed or configured. 117 people saved the LinkedIn post — they're coming back soon.
13
+
14
+ ### The Flow
15
+
16
+ ```
17
+ User runs `npx llm-kb run ./docs`
18
+
19
+ ├─ Pi SDK auth exists (~/.pi/agent/auth.json)?
20
+ │ → Use it. Done.
21
+
22
+ ├─ ANTHROPIC_API_KEY env var set?
23
+ │ → Configure Pi SDK programmatically. Done.
24
+
25
+ └─ Neither?
26
+ → Show clear error:
27
+
28
+ No LLM authentication found.
29
+
30
+ Option 1: Install Pi SDK (recommended)
31
+ npm install -g @mariozechner/pi-coding-agent
32
+ pi
33
+
34
+ Option 2: Set your Anthropic API key
35
+ export ANTHROPIC_API_KEY=sk-ant-...
36
+ ```
37
+
38
+ ### Implementation
39
+
40
+ Check auth before creating any session. Add to `cli.ts` or a new `auth.ts`:
41
+
42
+ ```typescript
43
+ import { existsSync } from "node:fs";
44
+ import { join } from "node:path";
45
+ import { homedir } from "node:os";
46
+
47
+ export function checkAuth(): { ok: boolean; method: string } {
48
+ // Check Pi SDK auth
49
+ const piAuthPath = join(homedir(), ".pi", "agent", "auth.json");
50
+ if (existsSync(piAuthPath)) {
51
+ return { ok: true, method: "pi-sdk" };
52
+ }
53
+
54
+ // Check ANTHROPIC_API_KEY
55
+ if (process.env.ANTHROPIC_API_KEY) {
56
+ return { ok: true, method: "api-key" };
57
+ }
58
+
59
+ return { ok: false, method: "none" };
60
+ }
61
+ ```
62
+
63
+ If method is `"api-key"`, configure Pi SDK's settings programmatically so `createAgentSession` works with the env var.
64
+
65
+ ### Definition of Done
66
+ - [ ] `ANTHROPIC_API_KEY=sk-... npx llm-kb run ./docs` works without Pi installed
67
+ - [ ] Pi SDK auth works as before (no regression)
68
+ - [ ] Clear error message when neither is available
69
+ - [ ] README updated with both auth options
70
+
71
+ ---
72
+
73
+ ## 2. LLM Configuration
74
+
75
+ ### Config File
76
+
77
+ Auto-generated on first run at `.llm-kb/config.json`:
78
+
79
+ ```json
80
+ {
81
+ "indexModel": "claude-haiku-3-5",
82
+ "queryModel": "claude-sonnet-4-20250514",
83
+ "provider": "anthropic"
84
+ }
85
+ ```
86
+
87
+ ### Env Var Overrides
88
+
89
+ ```bash
90
+ LLM_KB_INDEX_MODEL=claude-haiku-3-5 llm-kb run ./docs
91
+ LLM_KB_QUERY_MODEL=claude-sonnet-4-20250514 llm-kb query "question"
92
+ ```
93
+
94
+ ### Priority
95
+
96
+ ```
97
+ 1. Env var (LLM_KB_INDEX_MODEL, LLM_KB_QUERY_MODEL)
98
+ 2. Config file (.llm-kb/config.json)
99
+ 3. Defaults (Haiku for indexing, Sonnet for query)
100
+ ```
101
+
102
+ ### Why This Matters
103
+
104
+ - Haiku for indexing is 10x cheaper than Sonnet — users shouldn't pay Sonnet prices for one-line summaries
105
+ - Some users want GPT or local models — provider config enables that later
106
+ - Config file is portable — `.llm-kb/` travels with the documents
107
+
108
+ ### Definition of Done
109
+ - [ ] `config.json` auto-generated on first run
110
+ - [ ] Index uses cheap model (Haiku), query uses strong model (Sonnet) by default
111
+ - [ ] Env vars override config file
112
+ - [ ] `llm-kb status` shows current model config
113
+
114
+ ---
115
+
116
+ ## 3. Eval Loop
117
+
118
+ ### What Gets Traced
119
+
120
+ Every query logs a JSON file to `.llm-kb/traces/`:
121
+
122
+ ```json
123
+ {
124
+ "id": "2026-04-06T14-30-00-query",
125
+ "timestamp": "2026-04-06T14:30:00Z",
126
+ "question": "what are the reserve requirements?",
127
+ "mode": "query",
128
+ "filesRead": ["index.md", "reserve-policy.md", "q3-results.md"],
129
+ "filesAvailable": ["reserve-policy.md", "q3-results.md", "board-deck.md", "pipeline.md"],
130
+ "filesSkipped": ["board-deck.md", "pipeline.md"],
131
+ "answer": "Reserve requirements are defined in two documents...",
132
+ "citations": [
133
+ { "file": "reserve-policy.md", "location": "p.3", "claim": "Minimum reserve ratio of 12%" },
134
+ { "file": "q3-results.md", "location": "p.8", "claim": "Current reserve ratio is 14.2%" }
135
+ ],
136
+ "durationMs": 4200
137
+ }
138
+ ```
139
+
140
+ ### How to Capture Traces
141
+
142
+ Wrap the session to intercept tool calls:
143
+
144
+ ```typescript
145
+ // Track which files the agent reads
146
+ const filesRead: string[] = [];
147
+
148
+ session.subscribe((event) => {
149
+ if (event.type === "tool_use") {
150
+ // Check if it's a read tool call on a source file
151
+ const path = extractPathFromToolCall(event);
152
+ if (path && !filesRead.includes(path)) {
153
+ filesRead.push(path);
154
+ }
155
+ }
156
+ });
157
+ ```
158
+
159
+ After session completes, write the trace JSON.
160
+
161
+ ### The Eval Command
162
+
163
+ ```bash
164
+ llm-kb eval --folder ./research
165
+ llm-kb eval --folder ./research --last 20 # only check last 20 queries
166
+ ```
167
+
168
+ The eval agent is a Pi SDK session (read-only) that:
169
+
170
+ 1. Reads trace files from `.llm-kb/traces/`
171
+ 2. For each trace, checks:
172
+ - **Citation validity** — does the cited file contain the claimed text?
173
+ - **Missing sources** — were any skipped files actually relevant?
174
+ - **Answer consistency** — does the answer contradict the cited sources?
175
+ 3. Writes report to `.llm-kb/wiki/outputs/eval-report.md`
176
+ 4. Watcher detects the report, re-indexes
177
+
178
+ ### The Eval AGENTS.md
179
+
180
+ ```markdown
181
+ # llm-kb Knowledge Base — Eval Mode
182
+
183
+ ## Your job
184
+ Read query traces from .llm-kb/traces/ and check answer quality.
185
+
186
+ ## For each trace, check:
187
+ 1. Citation validity — read the cited source file. Does it actually
188
+ contain the claimed text at the claimed location?
189
+ 2. Missing sources — read the index summary for each skipped file.
190
+ Given the question, should any skipped file have been read?
191
+ 3. Consistency — does the answer contradict anything in the
192
+ cited sources?
193
+
194
+ ## Output
195
+ Write .llm-kb/wiki/outputs/eval-report.md with:
196
+ - Summary: X traces checked, Y issues found
197
+ - Per-trace findings (only flag issues, skip clean traces)
198
+ - Recommendations (e.g., "update summary for file X")
199
+ ```
200
+
201
+ ### Status Command
202
+
203
+ ```bash
204
+ llm-kb status --folder ./research
205
+ ```
206
+
207
+ ```
208
+ Knowledge Base: ./research/.llm-kb/
209
+ Sources: 12 files (8 PDF, 2 XLSX, 1 DOCX, 1 TXT)
210
+ Index: 12 entries, last updated 2 min ago
211
+ Outputs: 3 saved research answers
212
+ Traces: 47 queries logged
213
+ Model: claude-sonnet-4 (query), claude-haiku-3-5 (index)
214
+ Auth: Pi SDK
215
+ ```
216
+
217
+ ---
218
+
219
+ ## Build Order (Slices)
220
+
221
+ | Slice | What | Urgency |
222
+ |---|---|---|
223
+ | 1 | Auth check + ANTHROPIC_API_KEY fallback | 🔴 NOW — users bouncing |
224
+ | 2 | Config file (model selection) | 🟡 This week |
225
+ | 3 | Trace logging (JSON per query) | 🟡 This week |
226
+ | 4 | `status` command | 🟢 Nice to have |
227
+ | 5 | `eval` command + eval session | 🟡 This week |
228
+ | 6 | Blog Part 4 (eval loop) | After code works |
229
+
230
+ ---
231
+
232
+ ## Definition of Done (Full Phase 3)
233
+
234
+ - [ ] `ANTHROPIC_API_KEY` works without Pi SDK installed
235
+ - [ ] Clear error when no auth found
236
+ - [ ] Config file with model selection (index vs query model)
237
+ - [ ] Every query logs a trace to `.llm-kb/traces/`
238
+ - [ ] `llm-kb eval` checks citations and writes report
239
+ - [ ] `llm-kb status` shows KB stats + config
240
+ - [ ] README updated with auth options + eval command
241
+ - [ ] Blog Part 4 written with real eval output
242
+
243
+ ---
244
+
245
+ *Phase 3 spec written April 5, 2026. DeltaXY.*
package/PHASE4_SPEC.md ADDED
@@ -0,0 +1,358 @@
1
+ # llm-kb — Phase 4: Farzapedia Pattern + Eval Loop
2
+
3
+ > **The data flywheel is already spinning (v0.3.0):**
4
+ > Query → Answer → Wiki updated → Next query answered from wiki → Faster, cheaper, compounding.
5
+ >
6
+ > Phase 4 makes the flywheel bigger: proactive compilation + eval-driven refinement.
7
+
8
+ ---
9
+
10
+ ## The Flywheel (what we already have)
11
+
12
+ ```
13
+ ┌─────────────┐
14
+ │ User asks │
15
+ │ a question │
16
+ └──────┬───────┘
17
+
18
+
19
+ ┌───────────────────────┐
20
+ │ Agent answers from │
21
+ │ wiki (fast) or │
22
+ │ source files (slow) │
23
+ └───────────┬───────────┘
24
+
25
+
26
+ ┌───────────────────────┐
27
+ │ wiki.md updated │◄─── Haiku merges new knowledge
28
+ │ (topic-organized) │ into existing wiki
29
+ └───────────┬───────────┘
30
+
31
+
32
+ ┌───────────────────────┐
33
+ │ Next similar query │
34
+ │ answered from wiki │──── 0 file reads, 2s instead of 25s
35
+ └───────────────────────┘
36
+ ```
37
+
38
+ **Proven in production:**
39
+ - First query about BNS 2023: 33s, 4 files read
40
+ - Same question again: 2s, 0 files read, answered from wiki
41
+ - Follow-up "tell me about mob lynching clause": instant, from wiki context
42
+
43
+ ---
44
+
45
+ ## What Phase 4 adds
46
+
47
+ ### The problem with reactive-only wiki
48
+
49
+ The wiki only knows what users have asked. If nobody asks about electronic evidence,
50
+ that knowledge never makes it into the wiki. The first person to ask pays the full cost.
51
+
52
+ ### The Farzapedia insight
53
+
54
+ > Compile the wiki **proactively** from all sources — BEFORE anyone asks.
55
+ > Then every query is fast from day one.
56
+
57
+ ```
58
+ Current (reactive only):
59
+ Sources exist → User asks → Agent reads sources → Answers → Wiki updated
60
+ Problem: first query for every topic is slow
61
+
62
+ With compile (proactive + reactive):
63
+ Sources exist → Compile articles → User asks → Instant answer from articles
64
+ Plus: eval finds gaps → Articles refined → Even better answers
65
+ ```
66
+
67
+ ---
68
+
69
+ ## Slices
70
+
71
+ ### Slice 1: Article compiler (part of `run`)
72
+
73
+ **What:** After index is built, compile concept articles from all sources.
74
+ Not a separate command — just step 3.5 in the `run` flow:
75
+
76
+ ```
77
+ llm-kb run ./docs
78
+ 1. Scan files
79
+ 2. Parse PDFs
80
+ 3. Build index ← Haiku summarises sources
81
+ 4. Compile articles ← Sonnet synthesises concepts (NEW)
82
+ 5. Start watchers
83
+ 6. Start chat ← Agent reads articles, not source files
84
+ ```
85
+
86
+ **Skip logic (same as index):**
87
+ - If `articles/` exists AND `articles/index.md` is newer than all source files → skip
88
+ - If any source is newer OR articles/ missing → compile
89
+ - First run always compiles. Subsequent runs are instant.
90
+
91
+ **Input:**
92
+ ```
93
+ .llm-kb/wiki/sources/
94
+ indian-penal-code-new.md (60 pages)
95
+ annotated-comparison-bns-ipc.md (21 pages)
96
+ evidence-act-new.md (40 pages)
97
+ ...
98
+ ```
99
+
100
+ **Output:**
101
+ ```
102
+ .llm-kb/wiki/articles/
103
+ index.md ← concept catalog with one-line descriptions
104
+ bns-2023-overview.md ← what it is, structure, key changes
105
+ murder-and-homicide.md ← Clauses 99-106, old vs new
106
+ mob-lynching.md ← Clause 101(2), new provision
107
+ electronic-evidence.md ← Section 65B / BSB comparison
108
+ organised-crime.md ← Clauses 109-110, new
109
+ sedition-removal.md ← 124A removed, what replaces it
110
+ offences-against-women.md ← Chapter V, new protections
111
+ ...
112
+ ```
113
+
114
+ **Each article contains:**
115
+ ```markdown
116
+ # Mob Lynching — BNS 2023, Clause 101(2)
117
+
118
+ ## Overview
119
+ First-ever explicit criminalisation of mob lynching in Indian law...
120
+
121
+ ## The Provision
122
+ When a group of 5+ persons acting in concert commits murder
123
+ on discriminatory grounds (race, caste, community, sex, etc.)...
124
+
125
+ ## Punishment
126
+ - Death, OR life imprisonment, OR minimum 7 years + fine
127
+ - All members equally liable
128
+
129
+ ## Comparison with IPC
130
+ IPC had no equivalent. Mob killings prosecuted under general S.302...
131
+
132
+ ## Related Articles
133
+ - [[murder-and-homicide]] — general murder provisions
134
+ - [[bns-2023-overview]] — the full new code
135
+ - [[offences-against-women]] — other enhanced protections
136
+
137
+ *Sources: indian penal code - new.md (p.137), Annotated comparison (p.15)*
138
+ ```
139
+
140
+ **How it works:**
141
+ 1. Agent reads index.md to understand all sources
142
+ 2. Agent reads each source (or first ~2000 chars for large files)
143
+ 3. Agent identifies 10-30 key concepts across all sources
144
+ 4. Agent writes one article per concept with cross-references
145
+ 5. Agent writes articles/index.md catalog
146
+
147
+ **Implementation:**
148
+ - New function: `compileArticles(folder, sourcesDir, authStorage, modelId)`
149
+ - Called from cli.ts `run` command, after buildIndex, before chat
150
+ - Uses createAgentSession with read + write tools
151
+ - AGENTS.md instructs the agent on article format, backlinks, source citations
152
+ - Model: Sonnet (needs strong reasoning to synthesise across sources)
153
+
154
+ **Definition of done:**
155
+ - [ ] `run` compiles articles after index (with skip logic)
156
+ - [ ] articles/index.md is a concept catalog with one-line descriptions
157
+ - [ ] Each article has: overview, key details, source citations, related links
158
+ - [ ] Articles are cross-referenced with [[article-name]] backlinks
159
+
160
+ ---
161
+
162
+ ### Slice 2: Query uses articles
163
+
164
+ **What:** When articles/ exists, the agent reads articles/index.md instead of source-index.
165
+ It drills into specific articles rather than raw source files.
166
+
167
+ **The navigation flow:**
168
+ ```
169
+ Agent reads articles/index.md (concept catalog)
170
+ → Finds "mob-lynching.md" is relevant
171
+ → Reads articles/mob-lynching.md (small, focused, pre-synthesised)
172
+ → Answers instantly with cross-references
173
+ → NO raw source files read
174
+ ```
175
+
176
+ **Implementation:**
177
+ - Update buildQueryAgents() in query.ts
178
+ - If articles/index.md exists: inject it into AGENTS.md, tell agent to use articles
179
+ - Fallback: if no articles, use current source-index + wiki.md behaviour
180
+
181
+ **Definition of done:**
182
+ - [ ] Agent reads articles/index.md when available
183
+ - [ ] Agent navigates to specific articles, not source files
184
+ - [ ] Falls back to source-index when articles/ doesn't exist
185
+
186
+ ---
187
+
188
+ ### Slice 3: Incremental article updates
189
+
190
+ **What:** When a new file is dropped in, don't recompile everything.
191
+ Update only the 2-3 articles affected by the new content.
192
+
193
+ **Farza's quote:**
194
+ > "The most magical thing now is as I add new things, the system updates
195
+ > 2-3 different articles where it feels the context belongs, or just
196
+ > creates a new article. Like a super genius librarian."
197
+
198
+ **Flow:**
199
+ ```
200
+ User drops "new-amendment-2024.pdf" into the folder
201
+ → Watcher: parse PDF → sources/new-amendment-2024.md
202
+ → Watcher: re-index (haiku)
203
+ → Watcher: read new source + articles/index.md
204
+ → Agent: "This affects mob-lynching.md and bns-2023-overview.md"
205
+ → Agent: updates those 2 articles + creates new-amendments-2024.md
206
+ → Agent: updates articles/index.md catalog
207
+ ```
208
+
209
+ **Implementation:**
210
+ - Update watcher.ts: after re-index, trigger incremental article update
211
+ - Agent reads: new source file + articles/index.md
212
+ - Agent decides: which articles to update, whether to create new ones
213
+ - Uses Sonnet (needs reasoning about where new content fits)
214
+
215
+ **Definition of done:**
216
+ - [ ] New file → parse → re-index → update relevant articles
217
+ - [ ] Agent updates 2-3 existing articles where content fits
218
+ - [ ] Agent creates new article if topic is genuinely new
219
+ - [ ] articles/index.md updated with any new entries
220
+
221
+ ---
222
+
223
+ ### Slice 4: Eval — session analysis + article refinement
224
+
225
+ **What:** Analyze session files to find quality issues and wiki gaps.
226
+ Then fix the articles automatically.
227
+
228
+ **Input:** `.llm-kb/sessions/*.jsonl` (raw conversation data)
229
+
230
+ **What eval checks:**
231
+
232
+ ```
233
+ CORRECTNESS
234
+ - Citation validity: does the source text support the claim?
235
+ - Consistency: does the answer contradict the sources?
236
+
237
+ PERFORMANCE
238
+ - Query time breakdown: wiki hit vs file reads
239
+ - Most-read source files (candidates for better articles)
240
+ - Wasted reads: files read but not cited
241
+
242
+ WIKI GAPS
243
+ - Questions that needed source files but should be in articles
244
+ - Articles that are incomplete (queries needed to read past them)
245
+ - Missing articles (topics asked about with no article)
246
+
247
+ INDEX ISSUES
248
+ - Wrong file reads: agent read irrelevant files (bad index summary)
249
+ - Redundant reads: same file read multiple times
250
+ ```
251
+
252
+ **Output:** eval-report.md + automatic article patches
253
+
254
+ ```markdown
255
+ # Eval Report — 2026-04-06
256
+
257
+ ## Summary
258
+ 15 sessions · 3 issues · 4 wiki gaps · estimated 120s saveable
259
+
260
+ ## 🔴 Correctness Issues
261
+ 1. Article "sedition-removal.md" says "retained" — source says "removed"
262
+ → AUTO-FIX: patched article
263
+
264
+ ## 🟡 Wiki Gaps (auto-filled)
265
+ 1. "Electronic evidence certification" — asked 4x, no article
266
+ → CREATED: articles/electronic-evidence-certification.md
267
+ 2. "CrPC comparison" — asked 3x, article was incomplete
268
+ → UPDATED: articles/crpc-comparison.md with missing sections
269
+
270
+ ## 🟢 Performance Insights
271
+ - Wiki hit rate: 53% → 78% after gap fixes (estimated)
272
+ - Most-read source: indian-penal-code-new.md (12 reads)
273
+ → Already well-covered by articles (reads are for exact quotes)
274
+ - Wasted reads: 8 across 15 sessions (32% waste rate)
275
+ ```
276
+
277
+ **Implementation:**
278
+ - New command: `llm-kb eval`
279
+ - Reads session JSONL files (full conversation data)
280
+ - Code: extracts metrics (timing, file reads, citations)
281
+ - LLM judge (Haiku): checks citation validity, identifies gaps
282
+ - LLM writer (Haiku): patches articles with fixes
283
+ - Writes eval-report.md
284
+
285
+ **Definition of done:**
286
+ - [ ] `llm-kb eval` reads sessions and writes eval-report.md
287
+ - [ ] Flags: citation issues, consistency problems
288
+ - [ ] Identifies: wiki gaps, performance bottlenecks
289
+ - [ ] Auto-creates/patches articles for wiki gaps
290
+ - [ ] Reports estimated time savings
291
+
292
+ ---
293
+
294
+ ### Slice 5: The complete flywheel
295
+
296
+ With all slices done, the full flywheel:
297
+
298
+ ```
299
+ ┌──────────────┐
300
+ │ COMPILE │ Proactive: articles from all sources
301
+ │ (once/incr) │
302
+ └──────┬────────┘
303
+
304
+
305
+ ┌───────────────────────┐
306
+ │ ARTICLES │ Concept-organized, cross-referenced
307
+ │ articles/index.md │ Agent navigates concepts, not files
308
+ │ articles/*.md │
309
+ └───────────┬───────────┘
310
+
311
+
312
+ ┌───────────────────────┐
313
+ │ QUERY │ User asks question
314
+ │ → reads article │ Agent reads 1 small article, not 5 large sources
315
+ │ → instant answer │ Sessions logged
316
+ └───────────┬───────────┘
317
+
318
+
319
+ ┌───────────────────────┐
320
+ │ EVAL │ Analyzes sessions
321
+ │ → finds gaps │ Creates missing articles
322
+ │ → fixes errors │ Patches wrong articles
323
+ │ → measures speed │ Reports optimization opportunities
324
+ └───────────┬───────────┘
325
+
326
+
327
+ ┌───────────────────────┐
328
+ │ NEW FILE DROPPED │ Watcher detects new source
329
+ │ → incremental update │ Updates 2-3 relevant articles
330
+ │ → index updated │ New knowledge integrated
331
+ └───────────┬───────────┘
332
+
333
+ └──────────── back to QUERY (faster every cycle)
334
+ ```
335
+
336
+ **The compounding effect:**
337
+ - Day 1: compile articles from 9 PDFs → 15 articles
338
+ - Day 2: 10 queries → eval finds 3 gaps → 3 articles added/fixed
339
+ - Day 3: new PDF dropped → 2 articles updated
340
+ - Day 4: 20 queries → 90% answered from articles (2s avg vs 25s)
341
+ - Day 5: eval shows 95% wiki hit rate, 0 citation errors
342
+
343
+ ---
344
+
345
+ ## Build Order
346
+
347
+ | Slice | What | Effort | Priority |
348
+ |---|---|---|---|
349
+ | 1 | Article compiler (in `run`) | 2-3 hrs | 🔴 Do first |
350
+ | 2 | Query reads articles | 30 min | 🔴 Immediate follow-up |
351
+ | 3 | Incremental article updates (watcher) | 1-2 hrs | 🟡 This week |
352
+ | 4 | `llm-kb eval` (session analysis + auto-fix) | 2-3 hrs | 🔴 The big one |
353
+ | 5 | Full flywheel verification | Testing | 🟢 After all slices |
354
+
355
+ ---
356
+
357
+ *Phase 4 spec written April 6, 2026. DeltaXY.*
358
+ *Inspired by Farzapedia (@FarzaTV) — Karpathy called it the best implementation of the LLM wiki pattern.*