llm-kb 0.4.0 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (58) hide show
  1. package/README.md +183 -42
  2. package/bin/anthropic-5TIU2EED.js +5515 -0
  3. package/bin/azure-openai-responses-ZVUVMK3G.js +190 -0
  4. package/bin/chunk-2WV6TQRI.js +4792 -0
  5. package/bin/chunk-3YMNGUZZ.js +262 -0
  6. package/bin/chunk-5PYKQQLA.js +14295 -0
  7. package/bin/chunk-65KFH7OI.js +31 -0
  8. package/bin/chunk-DHOXVEIR.js +7261 -0
  9. package/bin/chunk-EAQYK3U2.js +41 -0
  10. package/bin/chunk-IFS3OKBN.js +428 -0
  11. package/bin/chunk-LDHOKBJA.js +86 -0
  12. package/bin/chunk-SLYBG6ZQ.js +32681 -0
  13. package/bin/chunk-UEODFF7H.js +17 -0
  14. package/bin/chunk-XCXTZJGO.js +174 -0
  15. package/bin/chunk-XFV534WU.js +7056 -0
  16. package/bin/cli.js +30 -4
  17. package/bin/dist-3YH7P2QF.js +1244 -0
  18. package/bin/google-JFC43EFJ.js +371 -0
  19. package/bin/google-gemini-cli-K4XNMYDI.js +712 -0
  20. package/bin/google-vertex-Y42F254G.js +414 -0
  21. package/bin/indexer-KSYRIVVN.js +10 -0
  22. package/bin/mistral-ZU2JS5XZ.js +38406 -0
  23. package/bin/multipart-parser-CO464TZY.js +371 -0
  24. package/bin/openai-codex-responses-NW2LELBH.js +712 -0
  25. package/bin/openai-completions-TW3VKTHO.js +662 -0
  26. package/bin/openai-responses-VGL522MK.js +198 -0
  27. package/bin/src-Y22OHE3S.js +1408 -0
  28. package/package.json +6 -1
  29. package/PHASE2_SPEC.md +0 -274
  30. package/PHASE3_SPEC.md +0 -245
  31. package/PHASE4_SPEC.md +0 -358
  32. package/SPEC.md +0 -275
  33. package/plan.md +0 -300
  34. package/src/auth.ts +0 -55
  35. package/src/cli.ts +0 -257
  36. package/src/config.ts +0 -61
  37. package/src/eval.ts +0 -548
  38. package/src/indexer.ts +0 -152
  39. package/src/md-stream.ts +0 -133
  40. package/src/pdf.ts +0 -119
  41. package/src/query.ts +0 -408
  42. package/src/resolve-kb.ts +0 -19
  43. package/src/scan.ts +0 -59
  44. package/src/session-store.ts +0 -22
  45. package/src/session-watcher.ts +0 -89
  46. package/src/trace-builder.ts +0 -168
  47. package/src/tui-display.ts +0 -281
  48. package/src/utils.ts +0 -17
  49. package/src/watcher.ts +0 -87
  50. package/src/wiki-updater.ts +0 -136
  51. package/test/auth.test.ts +0 -65
  52. package/test/config.test.ts +0 -96
  53. package/test/md-stream.test.ts +0 -98
  54. package/test/resolve-kb.test.ts +0 -33
  55. package/test/scan.test.ts +0 -65
  56. package/test/trace-builder.test.ts +0 -215
  57. package/tsconfig.json +0 -14
  58. package/vitest.config.ts +0 -8
package/PHASE4_SPEC.md DELETED
@@ -1,358 +0,0 @@
1
- # llm-kb — Phase 4: Farzapedia Pattern + Eval Loop
2
-
3
- > **The data flywheel is already spinning (v0.3.0):**
4
- > Query → Answer → Wiki updated → Next query answered from wiki → Faster, cheaper, compounding.
5
- >
6
- > Phase 4 makes the flywheel bigger: proactive compilation + eval-driven refinement.
7
-
8
- ---
9
-
10
- ## The Flywheel (what we already have)
11
-
12
- ```
13
- ┌─────────────┐
14
- │ User asks │
15
- │ a question │
16
- └──────┬───────┘
17
-
18
-
19
- ┌───────────────────────┐
20
- │ Agent answers from │
21
- │ wiki (fast) or │
22
- │ source files (slow) │
23
- └───────────┬───────────┘
24
-
25
-
26
- ┌───────────────────────┐
27
- │ wiki.md updated │◄─── Haiku merges new knowledge
28
- │ (topic-organized) │ into existing wiki
29
- └───────────┬───────────┘
30
-
31
-
32
- ┌───────────────────────┐
33
- │ Next similar query │
34
- │ answered from wiki │──── 0 file reads, 2s instead of 25s
35
- └───────────────────────┘
36
- ```
37
-
38
- **Proven in production:**
39
- - First query about BNS 2023: 33s, 4 files read
40
- - Same question again: 2s, 0 files read, answered from wiki
41
- - Follow-up "tell me about mob lynching clause": instant, from wiki context
42
-
43
- ---
44
-
45
- ## What Phase 4 adds
46
-
47
- ### The problem with reactive-only wiki
48
-
49
- The wiki only knows what users have asked. If nobody asks about electronic evidence,
50
- that knowledge never makes it into the wiki. The first person to ask pays the full cost.
51
-
52
- ### The Farzapedia insight
53
-
54
- > Compile the wiki **proactively** from all sources — BEFORE anyone asks.
55
- > Then every query is fast from day one.
56
-
57
- ```
58
- Current (reactive only):
59
- Sources exist → User asks → Agent reads sources → Answers → Wiki updated
60
- Problem: first query for every topic is slow
61
-
62
- With compile (proactive + reactive):
63
- Sources exist → Compile articles → User asks → Instant answer from articles
64
- Plus: eval finds gaps → Articles refined → Even better answers
65
- ```
66
-
67
- ---
68
-
69
- ## Slices
70
-
71
- ### Slice 1: Article compiler (part of `run`)
72
-
73
- **What:** After index is built, compile concept articles from all sources.
74
- Not a separate command — just step 3.5 in the `run` flow:
75
-
76
- ```
77
- llm-kb run ./docs
78
- 1. Scan files
79
- 2. Parse PDFs
80
- 3. Build index ← Haiku summarises sources
81
- 4. Compile articles ← Sonnet synthesises concepts (NEW)
82
- 5. Start watchers
83
- 6. Start chat ← Agent reads articles, not source files
84
- ```
85
-
86
- **Skip logic (same as index):**
87
- - If `articles/` exists AND `articles/index.md` is newer than all source files → skip
88
- - If any source is newer OR articles/ missing → compile
89
- - First run always compiles. Subsequent runs are instant.
90
-
91
- **Input:**
92
- ```
93
- .llm-kb/wiki/sources/
94
- indian-penal-code-new.md (60 pages)
95
- annotated-comparison-bns-ipc.md (21 pages)
96
- evidence-act-new.md (40 pages)
97
- ...
98
- ```
99
-
100
- **Output:**
101
- ```
102
- .llm-kb/wiki/articles/
103
- index.md ← concept catalog with one-line descriptions
104
- bns-2023-overview.md ← what it is, structure, key changes
105
- murder-and-homicide.md ← Clauses 99-106, old vs new
106
- mob-lynching.md ← Clause 101(2), new provision
107
- electronic-evidence.md ← Section 65B / BSB comparison
108
- organised-crime.md ← Clauses 109-110, new
109
- sedition-removal.md ← 124A removed, what replaces it
110
- offences-against-women.md ← Chapter V, new protections
111
- ...
112
- ```
113
-
114
- **Each article contains:**
115
- ```markdown
116
- # Mob Lynching — BNS 2023, Clause 101(2)
117
-
118
- ## Overview
119
- First-ever explicit criminalisation of mob lynching in Indian law...
120
-
121
- ## The Provision
122
- When a group of 5+ persons acting in concert commits murder
123
- on discriminatory grounds (race, caste, community, sex, etc.)...
124
-
125
- ## Punishment
126
- - Death, OR life imprisonment, OR minimum 7 years + fine
127
- - All members equally liable
128
-
129
- ## Comparison with IPC
130
- IPC had no equivalent. Mob killings prosecuted under general S.302...
131
-
132
- ## Related Articles
133
- - [[murder-and-homicide]] — general murder provisions
134
- - [[bns-2023-overview]] — the full new code
135
- - [[offences-against-women]] — other enhanced protections
136
-
137
- *Sources: indian penal code - new.md (p.137), Annotated comparison (p.15)*
138
- ```
139
-
140
- **How it works:**
141
- 1. Agent reads index.md to understand all sources
142
- 2. Agent reads each source (or first ~2000 chars for large files)
143
- 3. Agent identifies 10-30 key concepts across all sources
144
- 4. Agent writes one article per concept with cross-references
145
- 5. Agent writes articles/index.md catalog
146
-
147
- **Implementation:**
148
- - New function: `compileArticles(folder, sourcesDir, authStorage, modelId)`
149
- - Called from cli.ts `run` command, after buildIndex, before chat
150
- - Uses createAgentSession with read + write tools
151
- - AGENTS.md instructs the agent on article format, backlinks, source citations
152
- - Model: Sonnet (needs strong reasoning to synthesise across sources)
153
-
154
- **Definition of done:**
155
- - [ ] `run` compiles articles after index (with skip logic)
156
- - [ ] articles/index.md is a concept catalog with one-line descriptions
157
- - [ ] Each article has: overview, key details, source citations, related links
158
- - [ ] Articles are cross-referenced with [[article-name]] backlinks
159
-
160
- ---
161
-
162
- ### Slice 2: Query uses articles
163
-
164
- **What:** When articles/ exists, the agent reads articles/index.md instead of source-index.
165
- It drills into specific articles rather than raw source files.
166
-
167
- **The navigation flow:**
168
- ```
169
- Agent reads articles/index.md (concept catalog)
170
- → Finds "mob-lynching.md" is relevant
171
- → Reads articles/mob-lynching.md (small, focused, pre-synthesised)
172
- → Answers instantly with cross-references
173
- → NO raw source files read
174
- ```
175
-
176
- **Implementation:**
177
- - Update buildQueryAgents() in query.ts
178
- - If articles/index.md exists: inject it into AGENTS.md, tell agent to use articles
179
- - Fallback: if no articles, use current source-index + wiki.md behaviour
180
-
181
- **Definition of done:**
182
- - [ ] Agent reads articles/index.md when available
183
- - [ ] Agent navigates to specific articles, not source files
184
- - [ ] Falls back to source-index when articles/ doesn't exist
185
-
186
- ---
187
-
188
- ### Slice 3: Incremental article updates
189
-
190
- **What:** When a new file is dropped in, don't recompile everything.
191
- Update only the 2-3 articles affected by the new content.
192
-
193
- **Farza's quote:**
194
- > "The most magical thing now is as I add new things, the system updates
195
- > 2-3 different articles where it feels the context belongs, or just
196
- > creates a new article. Like a super genius librarian."
197
-
198
- **Flow:**
199
- ```
200
- User drops "new-amendment-2024.pdf" into the folder
201
- → Watcher: parse PDF → sources/new-amendment-2024.md
202
- → Watcher: re-index (haiku)
203
- → Watcher: read new source + articles/index.md
204
- → Agent: "This affects mob-lynching.md and bns-2023-overview.md"
205
- → Agent: updates those 2 articles + creates new-amendments-2024.md
206
- → Agent: updates articles/index.md catalog
207
- ```
208
-
209
- **Implementation:**
210
- - Update watcher.ts: after re-index, trigger incremental article update
211
- - Agent reads: new source file + articles/index.md
212
- - Agent decides: which articles to update, whether to create new ones
213
- - Uses Sonnet (needs reasoning about where new content fits)
214
-
215
- **Definition of done:**
216
- - [ ] New file → parse → re-index → update relevant articles
217
- - [ ] Agent updates 2-3 existing articles where content fits
218
- - [ ] Agent creates new article if topic is genuinely new
219
- - [ ] articles/index.md updated with any new entries
220
-
221
- ---
222
-
223
- ### Slice 4: Eval — session analysis + article refinement
224
-
225
- **What:** Analyze session files to find quality issues and wiki gaps.
226
- Then fix the articles automatically.
227
-
228
- **Input:** `.llm-kb/sessions/*.jsonl` (raw conversation data)
229
-
230
- **What eval checks:**
231
-
232
- ```
233
- CORRECTNESS
234
- - Citation validity: does the source text support the claim?
235
- - Consistency: does the answer contradict the sources?
236
-
237
- PERFORMANCE
238
- - Query time breakdown: wiki hit vs file reads
239
- - Most-read source files (candidates for better articles)
240
- - Wasted reads: files read but not cited
241
-
242
- WIKI GAPS
243
- - Questions that needed source files but should be in articles
244
- - Articles that are incomplete (queries needed to read past them)
245
- - Missing articles (topics asked about with no article)
246
-
247
- INDEX ISSUES
248
- - Wrong file reads: agent read irrelevant files (bad index summary)
249
- - Redundant reads: same file read multiple times
250
- ```
251
-
252
- **Output:** eval-report.md + automatic article patches
253
-
254
- ```markdown
255
- # Eval Report — 2026-04-06
256
-
257
- ## Summary
258
- 15 sessions · 3 issues · 4 wiki gaps · estimated 120s saveable
259
-
260
- ## 🔴 Correctness Issues
261
- 1. Article "sedition-removal.md" says "retained" — source says "removed"
262
- → AUTO-FIX: patched article
263
-
264
- ## 🟡 Wiki Gaps (auto-filled)
265
- 1. "Electronic evidence certification" — asked 4x, no article
266
- → CREATED: articles/electronic-evidence-certification.md
267
- 2. "CrPC comparison" — asked 3x, article was incomplete
268
- → UPDATED: articles/crpc-comparison.md with missing sections
269
-
270
- ## 🟢 Performance Insights
271
- - Wiki hit rate: 53% → 78% after gap fixes (estimated)
272
- - Most-read source: indian-penal-code-new.md (12 reads)
273
- → Already well-covered by articles (reads are for exact quotes)
274
- - Wasted reads: 8 across 15 sessions (32% waste rate)
275
- ```
276
-
277
- **Implementation:**
278
- - New command: `llm-kb eval`
279
- - Reads session JSONL files (full conversation data)
280
- - Code: extracts metrics (timing, file reads, citations)
281
- - LLM judge (Haiku): checks citation validity, identifies gaps
282
- - LLM writer (Haiku): patches articles with fixes
283
- - Writes eval-report.md
284
-
285
- **Definition of done:**
286
- - [ ] `llm-kb eval` reads sessions and writes eval-report.md
287
- - [ ] Flags: citation issues, consistency problems
288
- - [ ] Identifies: wiki gaps, performance bottlenecks
289
- - [ ] Auto-creates/patches articles for wiki gaps
290
- - [ ] Reports estimated time savings
291
-
292
- ---
293
-
294
- ### Slice 5: The complete flywheel
295
-
296
- With all slices done, the full flywheel:
297
-
298
- ```
299
- ┌──────────────┐
300
- │ COMPILE │ Proactive: articles from all sources
301
- │ (once/incr) │
302
- └──────┬────────┘
303
-
304
-
305
- ┌───────────────────────┐
306
- │ ARTICLES │ Concept-organized, cross-referenced
307
- │ articles/index.md │ Agent navigates concepts, not files
308
- │ articles/*.md │
309
- └───────────┬───────────┘
310
-
311
-
312
- ┌───────────────────────┐
313
- │ QUERY │ User asks question
314
- │ → reads article │ Agent reads 1 small article, not 5 large sources
315
- │ → instant answer │ Sessions logged
316
- └───────────┬───────────┘
317
-
318
-
319
- ┌───────────────────────┐
320
- │ EVAL │ Analyzes sessions
321
- │ → finds gaps │ Creates missing articles
322
- │ → fixes errors │ Patches wrong articles
323
- │ → measures speed │ Reports optimization opportunities
324
- └───────────┬───────────┘
325
-
326
-
327
- ┌───────────────────────┐
328
- │ NEW FILE DROPPED │ Watcher detects new source
329
- │ → incremental update │ Updates 2-3 relevant articles
330
- │ → index updated │ New knowledge integrated
331
- └───────────┬───────────┘
332
-
333
- └──────────── back to QUERY (faster every cycle)
334
- ```
335
-
336
- **The compounding effect:**
337
- - Day 1: compile articles from 9 PDFs → 15 articles
338
- - Day 2: 10 queries → eval finds 3 gaps → 3 articles added/fixed
339
- - Day 3: new PDF dropped → 2 articles updated
340
- - Day 4: 20 queries → 90% answered from articles (2s avg vs 25s)
341
- - Day 5: eval shows 95% wiki hit rate, 0 citation errors
342
-
343
- ---
344
-
345
- ## Build Order
346
-
347
- | Slice | What | Effort | Priority |
348
- |---|---|---|---|
349
- | 1 | Article compiler (in `run`) | 2-3 hrs | 🔴 Do first |
350
- | 2 | Query reads articles | 30 min | 🔴 Immediate follow-up |
351
- | 3 | Incremental article updates (watcher) | 1-2 hrs | 🟡 This week |
352
- | 4 | `llm-kb eval` (session analysis + auto-fix) | 2-3 hrs | 🔴 The big one |
353
- | 5 | Full flywheel verification | Testing | 🟢 After all slices |
354
-
355
- ---
356
-
357
- *Phase 4 spec written April 6, 2026. DeltaXY.*
358
- *Inspired by Farzapedia (@FarzaTV) — Karpathy called it the best implementation of the LLM wiki pattern.*
package/SPEC.md DELETED
@@ -1,275 +0,0 @@
1
- # llm-kb — Product Spec
2
-
3
- > **One-liner:** Drop files into a folder. Get a knowledge base you can query.
4
- > **npm:** `npx llm-kb run ./my-documents`
5
- > **Status:** Phase 1 complete. Ingest pipeline + CLI.
6
-
7
- ---
8
-
9
- ## Who Is This For
10
-
11
- A developer or technical researcher who has 20-200 documents (PDFs, spreadsheets, slide decks, notes) scattered across folders. They want to ask questions across all of them without building a RAG pipeline or setting up a vector database.
12
-
13
- **They will try this if:** it works in under 2 minutes with one command.
14
- **They will keep using it if:** the answers are good and the wiki compounds over time.
15
- **They will abandon it if:** setup is painful, it eats tokens without useful results, or it feels like a demo.
16
-
17
- ---
18
-
19
- ## What Success Looks Like
20
-
21
- ```bash
22
- npx llm-kb run ./research
23
- ```
24
-
25
- Terminal output:
26
- ```
27
- llm-kb v0.0.1
28
-
29
- Scanning ./research...
30
- Found 12 files (12 PDF)
31
- 12 parsed
32
-
33
- Building index...
34
- Index built: .llm-kb/wiki/index.md
35
-
36
- Output: ./research/.llm-kb/wiki/sources
37
-
38
- Watching for new files... (Ctrl+C to stop)
39
- ```
40
-
41
- Drop more files in while it's running. They get ingested automatically.
42
-
43
- **That's the whole first-run experience.** No config file. No API key prompt (uses Pi SDK auth). No Docker. Just point at a folder.
44
-
45
- ---
46
-
47
- ## Commands
48
-
49
- ### `llm-kb run <folder>` (Phase 1 ✅)
50
-
51
- The main command. Does everything:
52
-
53
- 1. Scans the folder for PDF files
54
- 2. Parses each PDF to markdown + JSON bounding boxes (via LiteParse)
55
- 3. Skips already-parsed files (mtime check — re-runs are instant)
56
- 4. Builds `index.md` from all parsed sources (via Pi SDK agent)
57
- 5. Starts a file watcher on the folder (new PDFs get auto-ingested + re-indexed)
58
-
59
- **Data layout it creates inside the folder:**
60
-
61
- ```
62
- ./my-documents/
63
- ├── (your original files — untouched)
64
- └── .llm-kb/
65
- └── wiki/
66
- ├── index.md
67
- └── sources/
68
- ├── report.md ← spatial text layout
69
- ├── report.json ← per-word bounding boxes
70
- └── ...
71
- ```
72
-
73
- **Key decision:** `.llm-kb/` lives inside the user's folder, not in a global location. The knowledge base is co-located with the documents. Delete the folder, delete the KB. Copy the folder to another machine, the KB comes with it.
74
-
75
- ### `llm-kb query <question>` (Phase 2)
76
-
77
- Query from the terminal without starting the web UI:
78
-
79
- ```bash
80
- llm-kb query "what are the reserve requirements?" --folder ./research
81
- llm-kb query "compare Q3 vs Q4 guidance" --folder ./research --save
82
- ```
83
-
84
- ### `llm-kb status` (Phase 2)
85
-
86
- Show what's in the knowledge base.
87
-
88
- ### `llm-kb eval` (Phase 4)
89
-
90
- Run the eval loop manually.
91
-
92
- ---
93
-
94
- ## File Type Strategy
95
-
96
- **Key architectural decision: PDF is the only file type parsed at ingest time.** All other file types are handled dynamically by the Pi SDK agent at query time using pre-bundled libraries.
97
-
98
- ### Why?
99
-
100
- - PDFs are binary, slow to parse, and need specialized libraries — worth pre-processing
101
- - Everything else (Excel, Word, PPT, CSV) — the Pi SDK agent can write a quick script to read them on demand
102
- - This eliminates 6 parser adapters, a router, and an adapter interface
103
- - The agent is smarter than a static adapter — it can decide what's relevant
104
-
105
- ### PDF Parsing (Ingest Time)
106
-
107
- | Extension | Parser | Output | Bounding Boxes |
108
- |---|---|---|---|
109
- | `.pdf` | @llamaindex/liteparse | `.md` + `.json` | ✅ Yes |
110
-
111
- ### Other File Types (Query Time — Agent Handles Dynamically)
112
-
113
- These libraries are pre-bundled in llm-kb and available to the agent via `NODE_PATH`:
114
-
115
- | Library | File Types |
116
- |---|---|
117
- | exceljs | `.xlsx`, `.xls` |
118
- | mammoth | `.docx` |
119
- | officeparser | `.pptx` |
120
-
121
- The agent's `AGENTS.md` context (injected via Pi SDK `agentsFilesOverride`) tells it which libraries are available and how to use them.
122
-
123
- ---
124
-
125
- ## OCR Strategy
126
-
127
- Page-level routing — only scanned pages get OCR, native text pages are free and instant.
128
-
129
- ```
130
- PDF Page → LiteParse classifies → native text? → keep local (free)
131
- → scanned? → route to OCR
132
- ```
133
-
134
- **OCR is off by default** (most PDFs have native text, avoids noisy Tesseract warnings).
135
-
136
- **Enable via env vars:**
137
- - `OCR_ENABLED=true` → local Tesseract.js (built into LiteParse)
138
- - `OCR_SERVER_URL=http://...` → remote Azure Document Intelligence bridge (faster, better quality)
139
-
140
- The OCR server is a separate project. llm-kb just calls it if the env var is set.
141
-
142
- ---
143
-
144
- ## Auth & Model
145
-
146
- **No API key handling in llm-kb.** Uses Pi SDK's `createAgentSession()` with defaults:
147
- - Auth from `~/.pi/agent/auth.json` (existing Pi installation)
148
- - Model from Pi's settings (whatever the user has configured)
149
- - No config file, no env var for model selection
150
-
151
- ```typescript
152
- const { session } = await createAgentSession({
153
- cwd: folder,
154
- resourceLoader: loader, // injects AGENTS.md
155
- tools: [readTool, bashTool, writeTool],
156
- sessionManager: SessionManager.inMemory(),
157
- });
158
- ```
159
-
160
- ---
161
-
162
- ## Tech Stack (Phase 1 — What's Built)
163
-
164
- ```
165
- TypeScript (strict)
166
- ├── CLI: Commander
167
- ├── Build: tsup (single bin/cli.js)
168
- ├── Dev: Bun
169
- ├── PDF parsing: @llamaindex/liteparse (local, bounding boxes)
170
- ├── OCR: Tesseract.js (via LiteParse) or remote OCR server
171
- ├── File watching: chokidar (debounced)
172
- ├── Indexing: @mariozechner/pi-coding-agent (createAgentSession)
173
- ├── Pre-bundled for agent: exceljs, mammoth, officeparser
174
- └── No database. No vector store. Files only.
175
- ```
176
-
177
- ---
178
-
179
- ## Project Structure (Actual)
180
-
181
- ```
182
- llm-kb/
183
- ├── bin/
184
- │ └── cli.js ← Built by tsup (single file)
185
- ├── src/
186
- │ ├── cli.ts ← Commander entry point
187
- │ ├── scan.ts ← Recursive folder scan + extension filter
188
- │ ├── pdf.ts ← LiteParse → .md + .json
189
- │ ├── indexer.ts ← Pi SDK agent → writes index.md
190
- │ └── watcher.ts ← chokidar file watcher (debounced)
191
- ├── package.json
192
- ├── tsconfig.json
193
- ├── plan.md ← Emergent build plan
194
- ├── README.md
195
- └── SPEC.md ← This file
196
- ```
197
-
198
- ---
199
-
200
- ## Constraints
201
-
202
- 1. **Zero config for first run.** `npx llm-kb run ./folder` must work with Pi SDK auth. No config file needed. No init step.
203
-
204
- 2. **No global state.** Everything lives in `.llm-kb/` inside the user's folder. Two different folders = two independent knowledge bases.
205
-
206
- 3. **Original files are never modified.** Reads from the folder, writes only to `.llm-kb/`.
207
-
208
- 4. **Graceful on bad files.** Corrupted PDF? Log a warning, skip it, continue. Show clean summary: `9 parsed, 1 failed`.
209
-
210
- 5. **Token-conscious.** Pi SDK uses whatever model the user has configured. Indexing reads first ~500 chars of each file.
211
-
212
- 6. **Offline-capable parsing.** PDF parsing runs locally via LiteParse. OCR is the only optional cloud dependency.
213
-
214
- 7. **Works on Windows, Mac, Linux.** Tested on Windows. All Node.js, no shell scripts.
215
-
216
- 8. **Skip up-to-date files.** Re-runs are instant — mtime check skips already-parsed PDFs.
217
-
218
- ---
219
-
220
- ## What We're NOT Building (Yet)
221
-
222
- - **Multi-user auth.** Personal/team tool. No login.
223
- - **Cloud hosting.** Runs locally. Docker later.
224
- - **Real-time collaboration.** One user at a time.
225
- - **Vector search.** If the wiki outgrows context windows, we add it. Not before.
226
- - **Custom embeddings.** No ML pipeline. The LLM reads markdown.
227
- - **Config file.** Nothing reads it yet. Add when Phase 2/3 needs it (model selection, port, etc).
228
- - **Static file adapters.** No Excel/Word/PPT adapters. Pi SDK agent handles them dynamically.
229
-
230
- ---
231
-
232
- ## Pre-Mortem: How This Fails
233
-
234
- | Failure | Why It Happened | Prevention |
235
- |---|---|---|
236
- | "Nobody tried it" | `npx` didn't work. Pi SDK not installed. | Clear prerequisites in README. |
237
- | "Tried it, too slow" | Indexing 20 PDFs took 5 minutes. | ✅ Progress bar. Skip up-to-date. Parse once. |
238
- | "Answers were bad" | Index summaries garbage → wrong files selected. | Test with real corpora. Eval loop (Phase 4). |
239
- | "Too expensive" | LLM burned tokens on indexing. | Agent reads first ~500 chars per file, not full content. |
240
- | "Broke on my files" | Encrypted PDF. 500MB file. | ✅ Graceful skip. Clean error messages. |
241
- | "Felt like a toy" | CLI only, no UI, no saved state. | Web UI in Phase 3. |
242
-
243
- ---
244
-
245
- ## Build Order (Maps to Blog Series)
246
-
247
- | Phase | What | Status |
248
- |---|---|---|
249
- | **1** | CLI + PDF parsing + indexer + watcher | ✅ Done |
250
- | **2** | Query + Research sessions + terminal query command | Next |
251
- | **3** | Web UI (chat, upload, sources, activity) | Planned |
252
- | **4** | Eval (trace logger, eval session, report) | Planned |
253
- | **5** | Docker + deploy | Planned |
254
- | **6** | Citations (bounding boxes → highlight in PDF) | Planned |
255
-
256
- ---
257
-
258
- ## Phase 1 — Definition of Done
259
-
260
- - [x] `llm-kb run ./folder` scans and parses PDFs
261
- - [x] Inline progress shows parsing status
262
- - [x] `.llm-kb/wiki/sources/` contains `.md` + `.json` per PDF
263
- - [x] `.llm-kb/wiki/index.md` generated with summary table
264
- - [x] File watcher auto-ingests new PDFs dropped into the folder
265
- - [x] Corrupt files skipped with warning, don't crash
266
- - [x] Re-runs skip up-to-date files (instant)
267
- - [x] OCR support via env var (local Tesseract or remote server)
268
- - [x] Auth via Pi SDK (no separate API key config)
269
- - [x] Works on Windows (tested), Mac/Linux (Node.js, should work)
270
- - [x] README has quickstart
271
- - [ ] Blog Part 2 written with real terminal output screenshots
272
-
273
- ---
274
-
275
- *Spec written April 4, 2026. Updated after Phase 1 build. DeltaXY.*