llm-kb 0.0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/PHASE2_SPEC.md ADDED
@@ -0,0 +1,274 @@
1
+ # llm-kb — Phase 2: Query Engine
2
+
3
+ > **Goal:** `llm-kb query "question" --folder ./research` works from the terminal.
4
+ > **Depends on:** Phase 1 (ingest pipeline — complete)
5
+ > **Blog:** Part 3 of the series
6
+
7
+ ---
8
+
9
+ ## What Success Looks Like
10
+
11
+ ```bash
12
+ llm-kb query "what are the reserve requirements?" --folder ./research
13
+ ```
14
+
15
+ ```
16
+ Reading index... 12 sources
17
+ Selected: reserve-policy.md, q3-results.md, board-deck.md
18
+ Reading 3 files...
19
+
20
+ Reserve requirements are defined in two documents:
21
+
22
+ 1. **Reserve Policy** (reserve-policy.md, p.3): Minimum reserve
23
+ ratio of 12% of total assets, reviewed quarterly.
24
+
25
+ 2. **Q3 Results** (q3-results.md, p.8): Current reserve ratio
26
+ is 14.2%, above the 12% minimum. Management notes this
27
+ provides a 2.2% buffer against regulatory changes.
28
+
29
+ Sources: reserve-policy.md (p.3), q3-results.md (p.8)
30
+ ```
31
+
32
+ That's the shape: file selection visible, citations inline, synthesis across sources.
33
+
34
+ ---
35
+
36
+ ## Two Modes
37
+
38
+ ### Query (read-only)
39
+
40
+ ```bash
41
+ llm-kb query "what changed in Q4 guidance?" --folder ./research
42
+ ```
43
+
44
+ The agent reads `index.md`, picks files, reads them, answers. **Cannot modify anything.** Tools: `createReadTool` only.
45
+
46
+ ### Research (read + write)
47
+
48
+ ```bash
49
+ llm-kb query "compare pipeline coverage to revenue target" --folder ./research --save
50
+ ```
51
+
52
+ Same as query, but the answer is also saved to `.llm-kb/wiki/outputs/`. The watcher detects the new file and re-indexes. Next query can reference the analysis.
53
+
54
+ Tools: `createReadTool` + `createWriteTool` + `createBashTool`.
55
+
56
+ The `--save` flag switches from query mode to research mode.
57
+
58
+ ---
59
+
60
+ ## Architecture
61
+
62
+ Same pattern as the indexer — a Pi SDK session with different tools:
63
+
64
+ ```typescript
65
+ export async function query(
66
+ folder: string,
67
+ question: string,
68
+ options: { save?: boolean }
69
+ ) {
70
+ const sourcesDir = join(folder, ".llm-kb", "wiki", "sources");
71
+ const outputsDir = join(folder, ".llm-kb", "wiki", "outputs");
72
+
73
+ // Build AGENTS.md for query context
74
+ const agentsContent = buildQueryAgents(sourcesDir, options.save);
75
+
76
+ const loader = new DefaultResourceLoader({
77
+ cwd: folder,
78
+ agentsFilesOverride: (current) => ({
79
+ agentsFiles: [
80
+ ...current.agentsFiles,
81
+ { path: ".llm-kb/AGENTS.md", content: agentsContent },
82
+ ],
83
+ }),
84
+ });
85
+ await loader.reload();
86
+
87
+ const tools = [createReadTool(folder)];
88
+ if (options.save) {
89
+ tools.push(createWriteTool(folder), createBashTool(folder));
90
+ }
91
+
92
+ const { session } = await createAgentSession({
93
+ cwd: folder,
94
+ resourceLoader: loader,
95
+ tools,
96
+ sessionManager: SessionManager.inMemory(),
97
+ settingsManager: SettingsManager.inMemory({
98
+ compaction: { enabled: false },
99
+ }),
100
+ });
101
+
102
+ // Stream output to terminal
103
+ session.subscribe((event) => {
104
+ if (
105
+ event.type === "message_update" &&
106
+ event.assistantMessageEvent.type === "text_delta"
107
+ ) {
108
+ process.stdout.write(event.assistantMessageEvent.delta);
109
+ }
110
+ });
111
+
112
+ await session.prompt(question);
113
+ session.dispose();
114
+ }
115
+ ```
116
+
117
+ ### The Query AGENTS.md
118
+
119
+ The injected `AGENTS.md` for query mode tells the agent:
120
+
121
+ ```markdown
122
+ # llm-kb Knowledge Base — Query Mode
123
+
124
+ ## How to answer questions
125
+
126
+ 1. FIRST read .llm-kb/wiki/index.md to see all available sources
127
+ 2. Based on the question, select the most relevant source files
128
+ 3. Read those files in full (not just the first 500 chars)
129
+ 4. Answer with inline citations: (filename, page/section)
130
+ 5. If the answer requires cross-referencing, read additional files
131
+ 6. Prefer primary sources over previous analyses in outputs/
132
+
133
+ ## Available sources
134
+ (dynamically generated list of .md files in sources/)
135
+
136
+ ## Available libraries for non-PDF files
137
+ - exceljs — for .xlsx/.xls
138
+ - mammoth — for .docx
139
+ - officeparser — for .pptx
140
+ Write a quick Node.js script via bash to read these when needed.
141
+
142
+ ## Rules
143
+ - Always cite sources with filename and page number
144
+ - If you can't find the answer, say so — don't hallucinate
145
+ - Read the FULL file, not just the beginning
146
+ ```
147
+
148
+ For research mode, add:
149
+
150
+ ```markdown
151
+ ## Research Mode
152
+ You can save your analysis to .llm-kb/wiki/outputs/.
153
+ Use a descriptive filename (e.g., coverage-analysis.md).
154
+ The file watcher will detect it and update the index.
155
+ ```
156
+
157
+ ---
158
+
159
+ ## CLI Integration
160
+
161
+ Add `query` command to Commander:
162
+
163
+ ```typescript
164
+ program
165
+ .command("query")
166
+ .description("Ask a question across your knowledge base")
167
+ .argument("<question>", "Your question")
168
+ .option("--folder <path>", "Path to document folder", ".")
169
+ .option("--save", "Save the answer to wiki/outputs/ (research mode)")
170
+ .action(async (question, options) => {
171
+ const folder = resolve(options.folder);
172
+
173
+ // Check if .llm-kb exists
174
+ if (!existsSync(join(folder, ".llm-kb"))) {
175
+ console.error(chalk.red("No knowledge base found. Run 'llm-kb run' first."));
176
+ process.exit(1);
177
+ }
178
+
179
+ await query(folder, question, { save: options.save });
180
+ });
181
+ ```
182
+
183
+ ---
184
+
185
+ ## Trace Logging (Prep for Eval — Phase 4)
186
+
187
+ Every query gets logged to `.llm-kb/traces/`:
188
+
189
+ ```json
190
+ {
191
+ "timestamp": "2026-04-05T14:30:00Z",
192
+ "question": "what are the reserve requirements?",
193
+ "mode": "query",
194
+ "filesRead": ["index.md", "reserve-policy.md", "q3-results.md"],
195
+ "filesAvailable": ["reserve-policy.md", "q3-results.md", "board-deck.md", "pipeline.md"],
196
+ "answer": "Reserve requirements are defined in two documents...",
197
+ "citations": [
198
+ { "file": "reserve-policy.md", "location": "p.3", "claim": "Minimum reserve ratio of 12%" },
199
+ { "file": "q3-results.md", "location": "p.8", "claim": "Current reserve ratio is 14.2%" }
200
+ ],
201
+ "tokensUsed": 3800,
202
+ "durationMs": 4200,
203
+ "model": "claude-sonnet-4"
204
+ }
205
+ ```
206
+
207
+ Implementation: wrap the session to intercept tool calls and capture which files were read. Save trace JSON after session completes.
208
+
209
+ The eval agent (Phase 4) reads these traces to check citations against sources.
210
+
211
+ ---
212
+
213
+ ## Streaming Output
214
+
215
+ Terminal query should stream — the user sees the answer appear word by word, not wait for the full response. The `session.subscribe()` handler writes deltas to stdout.
216
+
217
+ For the `run` command (when we add query to the web UI in Phase 3), streaming goes through the Vercel AI SDK protocol.
218
+
219
+ ---
220
+
221
+ ## Constraints
222
+
223
+ 1. **Query must work without the web server running.** `llm-kb query` is standalone — it reads `.llm-kb/` directly. No dependency on `llm-kb run`.
224
+
225
+ 2. **Read-only by default.** Query mode cannot modify files. Only `--save` enables write.
226
+
227
+ 3. **Index must exist.** If `.llm-kb/wiki/index.md` doesn't exist, error out: "No knowledge base found. Run 'llm-kb run' first."
228
+
229
+ 4. **Graceful on empty results.** If the agent can't find relevant files, it should say "I couldn't find sources relevant to this question" — not hallucinate.
230
+
231
+ 5. **Token-conscious.** The agent reads index.md (~200 tokens for 50 sources) first, then only the files it selects (3-7 typically). Don't read all sources.
232
+
233
+ ---
234
+
235
+ ## Build Order (Slices)
236
+
237
+ | Slice | What | Demoable? |
238
+ |---|---|---|
239
+ | 1 | `query` command + read-only session + streaming | ✅ Ask questions, get answers |
240
+ | 2 | `--save` flag + research mode + write to outputs/ | ✅ Answers compound in wiki |
241
+ | 3 | Trace logging (JSON per query) | Prep for eval |
242
+ | 4 | `status` command (show KB stats) | ✅ Nice-to-have |
243
+
244
+ ---
245
+
246
+ ## Definition of Done
247
+
248
+ - [ ] `llm-kb query "question" --folder ./research` returns a cited answer
249
+ - [ ] Answer streams to terminal (word by word, not all at once)
250
+ - [ ] Agent reads index.md first, then selects and reads relevant source files
251
+ - [ ] `--save` flag saves the answer to `.llm-kb/wiki/outputs/`
252
+ - [ ] Saved answers get detected by watcher and re-indexed
253
+ - [ ] Query traces logged to `.llm-kb/traces/` as JSON
254
+ - [ ] Error if no `.llm-kb/` exists ("run 'llm-kb run' first")
255
+ - [ ] Non-PDF files (Excel, Word) readable by agent via bundled libraries
256
+ - [ ] Blog Part 3 written with real terminal output
257
+
258
+ ---
259
+
260
+ ## What This Enables
261
+
262
+ With query working, the demo becomes:
263
+
264
+ ```bash
265
+ npx llm-kb run ./my-documents # ingest
266
+ llm-kb query "what changed?" # ask
267
+ llm-kb query "compare X vs Y" --save # research (compounds)
268
+ ```
269
+
270
+ Three commands. Ingest → Query → Research. That's a product, not a script.
271
+
272
+ ---
273
+
274
+ *Phase 2 spec written April 4, 2026. DeltaXY.*
package/README.md CHANGED
@@ -1,21 +1,133 @@
1
1
  # llm-kb
2
2
 
3
- LLM-powered knowledge base. Drop documents, build a wiki, ask questions.
3
+ Drop files into a folder. Get a knowledge base you can query.
4
4
 
5
5
  Inspired by [Karpathy's LLM Knowledge Bases](https://x.com/karpathy/status/2039805659525644595).
6
6
 
7
+ ## Quick Start
8
+
7
9
  ```bash
8
- npx llm-kb run ./my-documents
10
+ npm install -g llm-kb
11
+ llm-kb run ./my-documents
9
12
  ```
10
13
 
14
+ That's it. Your PDFs get parsed to markdown, an index is built, and a file watcher keeps it up to date.
15
+
16
+ ### Prerequisites
17
+
18
+ - **Node.js 18+**
19
+ - **Pi SDK** installed and authenticated (`npm install -g @mariozechner/pi-coding-agent` + run `pi` once to set up auth)
20
+
21
+ Pi handles the LLM auth — no separate API key configuration needed.
22
+
11
23
  ## What It Does
12
24
 
13
- - **Ingest** — drop PDFs, Excel, Word, PowerPoint, images, or text into a folder
14
- - **Parse** — automatically converts to markdown (LiteParse for PDFs, ExcelJS for spreadsheets, Mammoth for Word)
15
- - **Index** — LLM reads all sources, maintains an index with summaries and topics
16
- - **Query** — ask questions, get answers with citations
17
- - **Research** — answers saved back to the wiki, compounding knowledge
18
- - **Eval** — checks answers against sources, reports failures
25
+ ### Ingest
26
+
27
+ ```bash
28
+ llm-kb run ./my-documents
29
+ ```
30
+
31
+ ```
32
+ llm-kb v0.2.0
33
+
34
+ Scanning ./my-documents...
35
+ Found 9 files (9 PDF)
36
+ 9 parsed
37
+
38
+ Building index...
39
+ Index built: .llm-kb/wiki/index.md
40
+
41
+ Output: ./my-documents/.llm-kb/wiki/sources
42
+
43
+ Watching for new files... (Ctrl+C to stop)
44
+ ```
45
+
46
+ 1. **Scans** the folder for PDFs
47
+ 2. **Parses** each PDF to markdown + bounding boxes (using [LiteParse](https://github.com/run-llama/liteparse))
48
+ 3. **Builds an index** — Pi SDK agent reads all sources and writes `index.md` with summaries
49
+ 4. **Watches** — drop a new PDF in while it's running, it gets parsed and indexed automatically
50
+
51
+ ### Query
52
+
53
+ ```bash
54
+ # From inside the documents folder (auto-detects .llm-kb/)
55
+ llm-kb query "what are the key findings?"
56
+
57
+ # From anywhere, with explicit folder
58
+ llm-kb query "compare Q3 vs Q4" --folder ./my-documents
59
+
60
+ # Research mode — saves the answer to wiki/outputs/ and re-indexes
61
+ llm-kb query "summarize all revenue data" --save
62
+ ```
63
+
64
+ The agent reads `index.md`, selects relevant files, and streams a cited answer to the terminal.
65
+
66
+ **Query mode** — read-only. The agent can only read your files.
67
+ **Research mode** (`--save`) — read + write + bash. The agent saves answers to `outputs/`, re-indexes, and can write scripts to read Excel/Word files. Answers compound over time.
68
+
69
+ ### What It Creates
70
+
71
+ ```
72
+ ./my-documents/
73
+ ├── (your files — untouched)
74
+ └── .llm-kb/
75
+ └── wiki/
76
+ ├── index.md ← summary of all sources
77
+ └── sources/
78
+ ├── report.md ← parsed text (spatial layout)
79
+ ├── report.json ← bounding boxes (for citations)
80
+ └── ...
81
+ ```
82
+
83
+ Your original files are never modified. Delete `.llm-kb/` to start fresh.
84
+
85
+ ## OCR for Scanned PDFs
86
+
87
+ Most PDFs have native text — they just work. For scanned PDFs:
88
+
89
+ **Local (default when enabled):**
90
+ ```bash
91
+ OCR_ENABLED=true llm-kb run ./my-documents
92
+ ```
93
+ Uses Tesseract.js (built-in, slower but works everywhere).
94
+
95
+ **Remote OCR server (faster, better quality):**
96
+ ```bash
97
+ OCR_SERVER_URL="http://localhost:8080/ocr?key=YOUR_KEY" llm-kb run ./my-documents
98
+ ```
99
+ Routes scanned pages to an Azure Document Intelligence bridge. Native-text pages still processed locally (free).
100
+
101
+ ## Non-PDF Files
102
+
103
+ PDFs are parsed at ingest time. Other file types (Excel, Word, PowerPoint, CSV, images) are handled dynamically by the Pi SDK agent at query time — it writes quick scripts using pre-bundled libraries:
104
+
105
+ | Library | File Types |
106
+ |---|---|
107
+ | exceljs | `.xlsx`, `.xls` |
108
+ | mammoth | `.docx` |
109
+ | officeparser | `.pptx` |
110
+
111
+ No separate install needed — all bundled with llm-kb.
112
+
113
+ ## How It Works
114
+
115
+ - **PDF parsing** — `@llamaindex/liteparse` extracts text with spatial layout + per-word bounding boxes. Runs locally, no cloud calls.
116
+ - **Indexing** — Pi SDK `createAgentSession` reads each source and generates a summary table in `index.md`.
117
+ - **File watching** — `chokidar` watches the folder. New/changed PDFs trigger re-parse + re-index (debounced for batch drops).
118
+ - **Auth** — uses Pi SDK's auth storage (`~/.pi/agent/auth.json`). No API keys in your project.
119
+
120
+ ## Development
121
+
122
+ ```bash
123
+ git clone https://github.com/satish860/llm-kb
124
+ cd llm-kb
125
+ bun install
126
+ bun run build
127
+ npm link
128
+
129
+ llm-kb run ./test-folder
130
+ ```
19
131
 
20
132
  ## Tutorial
21
133
 
package/SPEC.md ADDED
@@ -0,0 +1,275 @@
1
+ # llm-kb — Product Spec
2
+
3
+ > **One-liner:** Drop files into a folder. Get a knowledge base you can query.
4
+ > **npm:** `npx llm-kb run ./my-documents`
5
+ > **Status:** Phase 1 complete. Ingest pipeline + CLI.
6
+
7
+ ---
8
+
9
+ ## Who Is This For
10
+
11
+ A developer or technical researcher who has 20-200 documents (PDFs, spreadsheets, slide decks, notes) scattered across folders. They want to ask questions across all of them without building a RAG pipeline or setting up a vector database.
12
+
13
+ **They will try this if:** it works in under 2 minutes with one command.
14
+ **They will keep using it if:** the answers are good and the wiki compounds over time.
15
+ **They will abandon it if:** setup is painful, it eats tokens without useful results, or it feels like a demo.
16
+
17
+ ---
18
+
19
+ ## What Success Looks Like
20
+
21
+ ```bash
22
+ npx llm-kb run ./research
23
+ ```
24
+
25
+ Terminal output:
26
+ ```
27
+ llm-kb v0.0.1
28
+
29
+ Scanning ./research...
30
+ Found 12 files (12 PDF)
31
+ 12 parsed
32
+
33
+ Building index...
34
+ Index built: .llm-kb/wiki/index.md
35
+
36
+ Output: ./research/.llm-kb/wiki/sources
37
+
38
+ Watching for new files... (Ctrl+C to stop)
39
+ ```
40
+
41
+ Drop more files in while it's running. They get ingested automatically.
42
+
43
+ **That's the whole first-run experience.** No config file. No API key prompt (uses Pi SDK auth). No Docker. Just point at a folder.
44
+
45
+ ---
46
+
47
+ ## Commands
48
+
49
+ ### `llm-kb run <folder>` (Phase 1 ✅)
50
+
51
+ The main command. Does everything:
52
+
53
+ 1. Scans the folder for PDF files
54
+ 2. Parses each PDF to markdown + JSON bounding boxes (via LiteParse)
55
+ 3. Skips already-parsed files (mtime check — re-runs are instant)
56
+ 4. Builds `index.md` from all parsed sources (via Pi SDK agent)
57
+ 5. Starts a file watcher on the folder (new PDFs get auto-ingested + re-indexed)
58
+
59
+ **Data layout it creates inside the folder:**
60
+
61
+ ```
62
+ ./my-documents/
63
+ ├── (your original files — untouched)
64
+ └── .llm-kb/
65
+ └── wiki/
66
+ ├── index.md
67
+ └── sources/
68
+ ├── report.md ← spatial text layout
69
+ ├── report.json ← per-word bounding boxes
70
+ └── ...
71
+ ```
72
+
73
+ **Key decision:** `.llm-kb/` lives inside the user's folder, not in a global location. The knowledge base is co-located with the documents. Delete the folder, delete the KB. Copy the folder to another machine, the KB comes with it.
74
+
75
+ ### `llm-kb query <question>` (Phase 2)
76
+
77
+ Query from the terminal without starting the web UI:
78
+
79
+ ```bash
80
+ llm-kb query "what are the reserve requirements?" --folder ./research
81
+ llm-kb query "compare Q3 vs Q4 guidance" --folder ./research --save
82
+ ```
83
+
84
+ ### `llm-kb status` (Phase 2)
85
+
86
+ Show what's in the knowledge base.
87
+
88
+ ### `llm-kb eval` (Phase 4)
89
+
90
+ Run the eval loop manually.
91
+
92
+ ---
93
+
94
+ ## File Type Strategy
95
+
96
+ **Key architectural decision: PDF is the only file type parsed at ingest time.** All other file types are handled dynamically by the Pi SDK agent at query time using pre-bundled libraries.
97
+
98
+ ### Why?
99
+
100
+ - PDFs are binary, slow to parse, and need specialized libraries — worth pre-processing
101
+ - Everything else (Excel, Word, PPT, CSV) — the Pi SDK agent can write a quick script to read them on demand
102
+ - This eliminates 6 parser adapters, a router, and an adapter interface
103
+ - The agent is smarter than a static adapter — it can decide what's relevant
104
+
105
+ ### PDF Parsing (Ingest Time)
106
+
107
+ | Extension | Parser | Output | Bounding Boxes |
108
+ |---|---|---|---|
109
+ | `.pdf` | @llamaindex/liteparse | `.md` + `.json` | ✅ Yes |
110
+
111
+ ### Other File Types (Query Time — Agent Handles Dynamically)
112
+
113
+ These libraries are pre-bundled in llm-kb and available to the agent via `NODE_PATH`:
114
+
115
+ | Library | File Types |
116
+ |---|---|
117
+ | exceljs | `.xlsx`, `.xls` |
118
+ | mammoth | `.docx` |
119
+ | officeparser | `.pptx` |
120
+
121
+ The agent's `AGENTS.md` context (injected via Pi SDK `agentsFilesOverride`) tells it which libraries are available and how to use them.
122
+
123
+ ---
124
+
125
+ ## OCR Strategy
126
+
127
+ Page-level routing — only scanned pages get OCR, native text pages are free and instant.
128
+
129
+ ```
130
+ PDF Page → LiteParse classifies → native text? → keep local (free)
131
+ → scanned? → route to OCR
132
+ ```
133
+
134
+ **OCR is off by default** (most PDFs have native text, avoids noisy Tesseract warnings).
135
+
136
+ **Enable via env vars:**
137
+ - `OCR_ENABLED=true` → local Tesseract.js (built into LiteParse)
138
+ - `OCR_SERVER_URL=http://...` → remote Azure Document Intelligence bridge (faster, better quality)
139
+
140
+ The OCR server is a separate project. llm-kb just calls it if the env var is set.
141
+
142
+ ---
143
+
144
+ ## Auth & Model
145
+
146
+ **No API key handling in llm-kb.** Uses Pi SDK's `createAgentSession()` with defaults:
147
+ - Auth from `~/.pi/agent/auth.json` (existing Pi installation)
148
+ - Model from Pi's settings (whatever the user has configured)
149
+ - No config file, no env var for model selection
150
+
151
+ ```typescript
152
+ const { session } = await createAgentSession({
153
+ cwd: folder,
154
+ resourceLoader: loader, // injects AGENTS.md
155
+ tools: [readTool, bashTool, writeTool],
156
+ sessionManager: SessionManager.inMemory(),
157
+ });
158
+ ```
159
+
160
+ ---
161
+
162
+ ## Tech Stack (Phase 1 — What's Built)
163
+
164
+ ```
165
+ TypeScript (strict)
166
+ ├── CLI: Commander
167
+ ├── Build: tsup (single bin/cli.js)
168
+ ├── Dev: Bun
169
+ ├── PDF parsing: @llamaindex/liteparse (local, bounding boxes)
170
+ ├── OCR: Tesseract.js (via LiteParse) or remote OCR server
171
+ ├── File watching: chokidar (debounced)
172
+ ├── Indexing: @mariozechner/pi-coding-agent (createAgentSession)
173
+ ├── Pre-bundled for agent: exceljs, mammoth, officeparser
174
+ └── No database. No vector store. Files only.
175
+ ```
176
+
177
+ ---
178
+
179
+ ## Project Structure (Actual)
180
+
181
+ ```
182
+ llm-kb/
183
+ ├── bin/
184
+ │ └── cli.js ← Built by tsup (single file)
185
+ ├── src/
186
+ │ ├── cli.ts ← Commander entry point
187
+ │ ├── scan.ts ← Recursive folder scan + extension filter
188
+ │ ├── pdf.ts ← LiteParse → .md + .json
189
+ │ ├── indexer.ts ← Pi SDK agent → writes index.md
190
+ │ └── watcher.ts ← chokidar file watcher (debounced)
191
+ ├── package.json
192
+ ├── tsconfig.json
193
+ ├── plan.md ← Emergent build plan
194
+ ├── README.md
195
+ └── SPEC.md ← This file
196
+ ```
197
+
198
+ ---
199
+
200
+ ## Constraints
201
+
202
+ 1. **Zero config for first run.** `npx llm-kb run ./folder` must work with Pi SDK auth. No config file needed. No init step.
203
+
204
+ 2. **No global state.** Everything lives in `.llm-kb/` inside the user's folder. Two different folders = two independent knowledge bases.
205
+
206
+ 3. **Original files are never modified.** Reads from the folder, writes only to `.llm-kb/`.
207
+
208
+ 4. **Graceful on bad files.** Corrupted PDF? Log a warning, skip it, continue. Show clean summary: `9 parsed, 1 failed`.
209
+
210
+ 5. **Token-conscious.** Pi SDK uses whatever model the user has configured. Indexing reads first ~500 chars of each file.
211
+
212
+ 6. **Offline-capable parsing.** PDF parsing runs locally via LiteParse. OCR is the only optional cloud dependency.
213
+
214
+ 7. **Works on Windows, Mac, Linux.** Tested on Windows. All Node.js, no shell scripts.
215
+
216
+ 8. **Skip up-to-date files.** Re-runs are instant — mtime check skips already-parsed PDFs.
217
+
218
+ ---
219
+
220
+ ## What We're NOT Building (Yet)
221
+
222
+ - **Multi-user auth.** Personal/team tool. No login.
223
+ - **Cloud hosting.** Runs locally. Docker later.
224
+ - **Real-time collaboration.** One user at a time.
225
+ - **Vector search.** If the wiki outgrows context windows, we add it. Not before.
226
+ - **Custom embeddings.** No ML pipeline. The LLM reads markdown.
227
+ - **Config file.** Nothing reads it yet. Add when Phase 2/3 needs it (model selection, port, etc).
228
+ - **Static file adapters.** No Excel/Word/PPT adapters. Pi SDK agent handles them dynamically.
229
+
230
+ ---
231
+
232
+ ## Pre-Mortem: How This Fails
233
+
234
+ | Failure | Why It Happened | Prevention |
235
+ |---|---|---|
236
+ | "Nobody tried it" | `npx` didn't work. Pi SDK not installed. | Clear prerequisites in README. |
237
+ | "Tried it, too slow" | Indexing 20 PDFs took 5 minutes. | ✅ Progress bar. Skip up-to-date. Parse once. |
238
+ | "Answers were bad" | Index summaries garbage → wrong files selected. | Test with real corpora. Eval loop (Phase 4). |
239
+ | "Too expensive" | LLM burned tokens on indexing. | Agent reads first ~500 chars per file, not full content. |
240
+ | "Broke on my files" | Encrypted PDF. 500MB file. | ✅ Graceful skip. Clean error messages. |
241
+ | "Felt like a toy" | CLI only, no UI, no saved state. | Web UI in Phase 3. |
242
+
243
+ ---
244
+
245
+ ## Build Order (Maps to Blog Series)
246
+
247
+ | Phase | What | Status |
248
+ |---|---|---|
249
+ | **1** | CLI + PDF parsing + indexer + watcher | ✅ Done |
250
+ | **2** | Query + Research sessions + terminal query command | Next |
251
+ | **3** | Web UI (chat, upload, sources, activity) | Planned |
252
+ | **4** | Eval (trace logger, eval session, report) | Planned |
253
+ | **5** | Docker + deploy | Planned |
254
+ | **6** | Citations (bounding boxes → highlight in PDF) | Planned |
255
+
256
+ ---
257
+
258
+ ## Phase 1 — Definition of Done
259
+
260
+ - [x] `llm-kb run ./folder` scans and parses PDFs
261
+ - [x] Inline progress shows parsing status
262
+ - [x] `.llm-kb/wiki/sources/` contains `.md` + `.json` per PDF
263
+ - [x] `.llm-kb/wiki/index.md` generated with summary table
264
+ - [x] File watcher auto-ingests new PDFs dropped into the folder
265
+ - [x] Corrupt files skipped with warning, don't crash
266
+ - [x] Re-runs skip up-to-date files (instant)
267
+ - [x] OCR support via env var (local Tesseract or remote server)
268
+ - [x] Auth via Pi SDK (no separate API key config)
269
+ - [x] Works on Windows (tested), Mac/Linux (Node.js, should work)
270
+ - [x] README has quickstart
271
+ - [ ] Blog Part 2 written with real terminal output screenshots
272
+
273
+ ---
274
+
275
+ *Spec written April 4, 2026. Updated after Phase 1 build. DeltaXY.*