npm - llm-kb - Versions diffs - 0.2.0 → 0.4.0 - Mend

llm-kb 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/PHASE3_SPEC.md +245 -0
package/PHASE4_SPEC.md +358 -0
package/README.md +182 -61
package/bin/cli.js +5473 -165
package/package.json +11 -6
package/plan.md +248 -3
package/src/auth.ts +55 -0
package/src/cli.ts +163 -38
package/src/config.ts +61 -0
package/src/eval.ts +548 -0
package/src/indexer.ts +36 -32
package/src/md-stream.ts +133 -0
package/src/query.ts +408 -132
package/src/session-store.ts +22 -0
package/src/session-watcher.ts +89 -0
package/src/trace-builder.ts +168 -0
package/src/tui-display.ts +281 -0
package/src/utils.ts +17 -0
package/src/watcher.ts +5 -2
package/src/wiki-updater.ts +136 -0
package/test/auth.test.ts +65 -0
package/test/config.test.ts +96 -0
package/test/md-stream.test.ts +98 -0
package/test/resolve-kb.test.ts +33 -0
package/test/scan.test.ts +65 -0
package/test/trace-builder.test.ts +215 -0
package/vitest.config.ts +8 -0
package/bin/chunk-MYQ36JJB.js +0 -118
package/bin/indexer-LSYSZXZX.js +0 -6

package/PHASE3_SPEC.md ADDED Viewed

@@ -0,0 +1,245 @@
+# llm-kb — Phase 3: Auth Fix + Eval Loop + LLM Config
+> **Priority 1:** Auth fix — users are bouncing because Pi isn't configured
+> **Priority 2:** Eval loop — the differentiator nobody else has
+> **Priority 3:** LLM config — let users pick models
+> **Blog:** Part 4 of the series (eval loop)
+---
+## 1. Auth Fix (URGENT)
+Users run `npx llm-kb run` and hit a wall because Pi SDK isn't installed or configured. 117 people saved the LinkedIn post — they're coming back soon.
+### The Flow
+```
+User runs `npx llm-kb run ./docs`
+  │
+  ├─ Pi SDK auth exists (~/.pi/agent/auth.json)?
+  │    → Use it. Done.
+  │
+  ├─ ANTHROPIC_API_KEY env var set?
+  │    → Configure Pi SDK programmatically. Done.
+  │
+  └─ Neither?
+       → Show clear error:
+       No LLM authentication found.
+       Option 1: Install Pi SDK (recommended)
+         npm install -g @mariozechner/pi-coding-agent
+         pi
+       Option 2: Set your Anthropic API key
+         export ANTHROPIC_API_KEY=sk-ant-...
+```
+### Implementation
+Check auth before creating any session. Add to `cli.ts` or a new `auth.ts`:
+```typescript
+import { existsSync } from "node:fs";
+import { join } from "node:path";
+import { homedir } from "node:os";
+export function checkAuth(): { ok: boolean; method: string } {
+  // Check Pi SDK auth
+  const piAuthPath = join(homedir(), ".pi", "agent", "auth.json");
+  if (existsSync(piAuthPath)) {
+    return { ok: true, method: "pi-sdk" };
+  }
+  // Check ANTHROPIC_API_KEY
+  if (process.env.ANTHROPIC_API_KEY) {
+    return { ok: true, method: "api-key" };
+  }
+  return { ok: false, method: "none" };
+}
+```
+If method is `"api-key"`, configure Pi SDK's settings programmatically so `createAgentSession` works with the env var.
+### Definition of Done
+- [ ] `ANTHROPIC_API_KEY=sk-... npx llm-kb run ./docs` works without Pi installed
+- [ ] Pi SDK auth works as before (no regression)
+- [ ] Clear error message when neither is available
+- [ ] README updated with both auth options
+---
+## 2. LLM Configuration
+### Config File
+Auto-generated on first run at `.llm-kb/config.json`:
+```json
+{
+  "indexModel": "claude-haiku-3-5",
+  "queryModel": "claude-sonnet-4-20250514",
+  "provider": "anthropic"
+}
+```
+### Env Var Overrides
+```bash
+LLM_KB_INDEX_MODEL=claude-haiku-3-5 llm-kb run ./docs
+LLM_KB_QUERY_MODEL=claude-sonnet-4-20250514 llm-kb query "question"
+```
+### Priority
+```
+1. Env var (LLM_KB_INDEX_MODEL, LLM_KB_QUERY_MODEL)
+2. Config file (.llm-kb/config.json)
+3. Defaults (Haiku for indexing, Sonnet for query)
+```
+### Why This Matters
+- Haiku for indexing is 10x cheaper than Sonnet — users shouldn't pay Sonnet prices for one-line summaries
+- Some users want GPT or local models — provider config enables that later
+- Config file is portable — `.llm-kb/` travels with the documents
+### Definition of Done
+- [ ] `config.json` auto-generated on first run
+- [ ] Index uses cheap model (Haiku), query uses strong model (Sonnet) by default
+- [ ] Env vars override config file
+- [ ] `llm-kb status` shows current model config
+---
+## 3. Eval Loop
+### What Gets Traced
+Every query logs a JSON file to `.llm-kb/traces/`:
+```json
+{
+  "id": "2026-04-06T14-30-00-query",
+  "timestamp": "2026-04-06T14:30:00Z",
+  "question": "what are the reserve requirements?",
+  "mode": "query",
+  "filesRead": ["index.md", "reserve-policy.md", "q3-results.md"],
+  "filesAvailable": ["reserve-policy.md", "q3-results.md", "board-deck.md", "pipeline.md"],
+  "filesSkipped": ["board-deck.md", "pipeline.md"],
+  "answer": "Reserve requirements are defined in two documents...",
+  "citations": [
+    { "file": "reserve-policy.md", "location": "p.3", "claim": "Minimum reserve ratio of 12%" },
+    { "file": "q3-results.md", "location": "p.8", "claim": "Current reserve ratio is 14.2%" }
+  ],
+  "durationMs": 4200
+}
+```
+### How to Capture Traces
+Wrap the session to intercept tool calls:
+```typescript
+// Track which files the agent reads
+const filesRead: string[] = [];
+session.subscribe((event) => {
+  if (event.type === "tool_use") {
+    // Check if it's a read tool call on a source file
+    const path = extractPathFromToolCall(event);
+    if (path && !filesRead.includes(path)) {
+      filesRead.push(path);
+    }
+  }
+});
+```
+After session completes, write the trace JSON.
+### The Eval Command
+```bash
+llm-kb eval --folder ./research
+llm-kb eval --folder ./research --last 20  # only check last 20 queries
+```
+The eval agent is a Pi SDK session (read-only) that:
+1. Reads trace files from `.llm-kb/traces/`
+2. For each trace, checks:
+   - **Citation validity** — does the cited file contain the claimed text?
+   - **Missing sources** — were any skipped files actually relevant?
+   - **Answer consistency** — does the answer contradict the cited sources?
+3. Writes report to `.llm-kb/wiki/outputs/eval-report.md`
+4. Watcher detects the report, re-indexes
+### The Eval AGENTS.md
+```markdown
+# llm-kb Knowledge Base — Eval Mode
+## Your job
+Read query traces from .llm-kb/traces/ and check answer quality.
+## For each trace, check:
+1. Citation validity — read the cited source file. Does it actually
+   contain the claimed text at the claimed location?
+2. Missing sources — read the index summary for each skipped file.
+   Given the question, should any skipped file have been read?
+3. Consistency — does the answer contradict anything in the
+   cited sources?
+## Output
+Write .llm-kb/wiki/outputs/eval-report.md with:
+- Summary: X traces checked, Y issues found
+- Per-trace findings (only flag issues, skip clean traces)
+- Recommendations (e.g., "update summary for file X")
+```
+### Status Command
+```bash
+llm-kb status --folder ./research
+```
+```
+Knowledge Base: ./research/.llm-kb/
+  Sources: 12 files (8 PDF, 2 XLSX, 1 DOCX, 1 TXT)
+  Index: 12 entries, last updated 2 min ago
+  Outputs: 3 saved research answers
+  Traces: 47 queries logged
+  Model: claude-sonnet-4 (query), claude-haiku-3-5 (index)
+  Auth: Pi SDK
+```
+---
+## Build Order (Slices)
+| Slice | What | Urgency |
+|---|---|---|
+| 1 | Auth check + ANTHROPIC_API_KEY fallback | 🔴 NOW — users bouncing |
+| 2 | Config file (model selection) | 🟡 This week |
+| 3 | Trace logging (JSON per query) | 🟡 This week |
+| 4 | `status` command | 🟢 Nice to have |
+| 5 | `eval` command + eval session | 🟡 This week |
+| 6 | Blog Part 4 (eval loop) | After code works |
+---
+## Definition of Done (Full Phase 3)
+- [ ] `ANTHROPIC_API_KEY` works without Pi SDK installed
+- [ ] Clear error when no auth found
+- [ ] Config file with model selection (index vs query model)
+- [ ] Every query logs a trace to `.llm-kb/traces/`
+- [ ] `llm-kb eval` checks citations and writes report
+- [ ] `llm-kb status` shows KB stats + config
+- [ ] README updated with auth options + eval command
+- [ ] Blog Part 4 written with real eval output
+---
+*Phase 3 spec written April 5, 2026. DeltaXY.*

package/PHASE4_SPEC.md ADDED Viewed

@@ -0,0 +1,358 @@
+# llm-kb — Phase 4: Farzapedia Pattern + Eval Loop
+> **The data flywheel is already spinning (v0.3.0):**
+> Query → Answer → Wiki updated → Next query answered from wiki → Faster, cheaper, compounding.
+>
+> Phase 4 makes the flywheel bigger: proactive compilation + eval-driven refinement.
+---
+## The Flywheel (what we already have)
+```
+         ┌─────────────┐
+         │  User asks   │
+         │  a question  │
+         └──────┬───────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  Agent answers from   │
+    │  wiki (fast) or       │
+    │  source files (slow)  │
+    └───────────┬───────────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  wiki.md updated      │◄─── Haiku merges new knowledge
+    │  (topic-organized)    │     into existing wiki
+    └───────────┬───────────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  Next similar query   │
+    │  answered from wiki   │──── 0 file reads, 2s instead of 25s
+    └───────────────────────┘
+```
+**Proven in production:**
+- First query about BNS 2023: 33s, 4 files read
+- Same question again: 2s, 0 files read, answered from wiki
+- Follow-up "tell me about mob lynching clause": instant, from wiki context
+---
+## What Phase 4 adds
+### The problem with reactive-only wiki
+The wiki only knows what users have asked. If nobody asks about electronic evidence,
+that knowledge never makes it into the wiki. The first person to ask pays the full cost.
+### The Farzapedia insight
+> Compile the wiki **proactively** from all sources — BEFORE anyone asks.
+> Then every query is fast from day one.
+```
+Current (reactive only):
+  Sources exist → User asks → Agent reads sources → Answers → Wiki updated
+  Problem: first query for every topic is slow
+With compile (proactive + reactive):
+  Sources exist → Compile articles → User asks → Instant answer from articles
+  Plus: eval finds gaps → Articles refined → Even better answers
+```
+---
+## Slices
+### Slice 1: Article compiler (part of `run`)
+**What:** After index is built, compile concept articles from all sources.
+Not a separate command — just step 3.5 in the `run` flow:
+```
+llm-kb run ./docs
+  1. Scan files
+  2. Parse PDFs
+  3. Build index          ← Haiku summarises sources
+  4. Compile articles      ← Sonnet synthesises concepts (NEW)
+  5. Start watchers
+  6. Start chat            ← Agent reads articles, not source files
+```
+**Skip logic (same as index):**
+- If `articles/` exists AND `articles/index.md` is newer than all source files → skip
+- If any source is newer OR articles/ missing → compile
+- First run always compiles. Subsequent runs are instant.
+**Input:**
+```
+.llm-kb/wiki/sources/
+  indian-penal-code-new.md (60 pages)
+  annotated-comparison-bns-ipc.md (21 pages)
+  evidence-act-new.md (40 pages)
+  ...
+```
+**Output:**
+```
+.llm-kb/wiki/articles/
+  index.md                          ← concept catalog with one-line descriptions
+  bns-2023-overview.md              ← what it is, structure, key changes
+  murder-and-homicide.md            ← Clauses 99-106, old vs new
+  mob-lynching.md                   ← Clause 101(2), new provision
+  electronic-evidence.md            ← Section 65B / BSB comparison
+  organised-crime.md                ← Clauses 109-110, new
+  sedition-removal.md               ← 124A removed, what replaces it
+  offences-against-women.md         ← Chapter V, new protections
+  ...
+```
+**Each article contains:**
+```markdown
+# Mob Lynching — BNS 2023, Clause 101(2)
+## Overview
+First-ever explicit criminalisation of mob lynching in Indian law...
+## The Provision
+When a group of 5+ persons acting in concert commits murder
+on discriminatory grounds (race, caste, community, sex, etc.)...
+## Punishment
+- Death, OR life imprisonment, OR minimum 7 years + fine
+- All members equally liable
+## Comparison with IPC
+IPC had no equivalent. Mob killings prosecuted under general S.302...
+## Related Articles
+- [[murder-and-homicide]] — general murder provisions
+- [[bns-2023-overview]] — the full new code
+- [[offences-against-women]] — other enhanced protections
+*Sources: indian penal code - new.md (p.137), Annotated comparison (p.15)*
+```
+**How it works:**
+1. Agent reads index.md to understand all sources
+2. Agent reads each source (or first ~2000 chars for large files)
+3. Agent identifies 10-30 key concepts across all sources
+4. Agent writes one article per concept with cross-references
+5. Agent writes articles/index.md catalog
+**Implementation:**
+- New function: `compileArticles(folder, sourcesDir, authStorage, modelId)`
+- Called from cli.ts `run` command, after buildIndex, before chat
+- Uses createAgentSession with read + write tools
+- AGENTS.md instructs the agent on article format, backlinks, source citations
+- Model: Sonnet (needs strong reasoning to synthesise across sources)
+**Definition of done:**
+- [ ] `run` compiles articles after index (with skip logic)
+- [ ] articles/index.md is a concept catalog with one-line descriptions
+- [ ] Each article has: overview, key details, source citations, related links
+- [ ] Articles are cross-referenced with [[article-name]] backlinks
+---
+### Slice 2: Query uses articles
+**What:** When articles/ exists, the agent reads articles/index.md instead of source-index.
+It drills into specific articles rather than raw source files.
+**The navigation flow:**
+```
+Agent reads articles/index.md (concept catalog)
+  → Finds "mob-lynching.md" is relevant
+  → Reads articles/mob-lynching.md (small, focused, pre-synthesised)
+  → Answers instantly with cross-references
+  → NO raw source files read
+```
+**Implementation:**
+- Update buildQueryAgents() in query.ts
+- If articles/index.md exists: inject it into AGENTS.md, tell agent to use articles
+- Fallback: if no articles, use current source-index + wiki.md behaviour
+**Definition of done:**
+- [ ] Agent reads articles/index.md when available
+- [ ] Agent navigates to specific articles, not source files
+- [ ] Falls back to source-index when articles/ doesn't exist
+---
+### Slice 3: Incremental article updates
+**What:** When a new file is dropped in, don't recompile everything.
+Update only the 2-3 articles affected by the new content.
+**Farza's quote:**
+> "The most magical thing now is as I add new things, the system updates
+> 2-3 different articles where it feels the context belongs, or just
+> creates a new article. Like a super genius librarian."
+**Flow:**
+```
+User drops "new-amendment-2024.pdf" into the folder
+  → Watcher: parse PDF → sources/new-amendment-2024.md
+  → Watcher: re-index (haiku)
+  → Watcher: read new source + articles/index.md
+  → Agent: "This affects mob-lynching.md and bns-2023-overview.md"
+  → Agent: updates those 2 articles + creates new-amendments-2024.md
+  → Agent: updates articles/index.md catalog
+```
+**Implementation:**
+- Update watcher.ts: after re-index, trigger incremental article update
+- Agent reads: new source file + articles/index.md
+- Agent decides: which articles to update, whether to create new ones
+- Uses Sonnet (needs reasoning about where new content fits)
+**Definition of done:**
+- [ ] New file → parse → re-index → update relevant articles
+- [ ] Agent updates 2-3 existing articles where content fits
+- [ ] Agent creates new article if topic is genuinely new
+- [ ] articles/index.md updated with any new entries
+---
+### Slice 4: Eval — session analysis + article refinement
+**What:** Analyze session files to find quality issues and wiki gaps.
+Then fix the articles automatically.
+**Input:** `.llm-kb/sessions/*.jsonl` (raw conversation data)
+**What eval checks:**
+```
+CORRECTNESS
+  - Citation validity: does the source text support the claim?
+  - Consistency: does the answer contradict the sources?
+PERFORMANCE
+  - Query time breakdown: wiki hit vs file reads
+  - Most-read source files (candidates for better articles)
+  - Wasted reads: files read but not cited
+WIKI GAPS
+  - Questions that needed source files but should be in articles
+  - Articles that are incomplete (queries needed to read past them)
+  - Missing articles (topics asked about with no article)
+INDEX ISSUES
+  - Wrong file reads: agent read irrelevant files (bad index summary)
+  - Redundant reads: same file read multiple times
+```
+**Output:** eval-report.md + automatic article patches
+```markdown
+# Eval Report — 2026-04-06
+## Summary
+15 sessions · 3 issues · 4 wiki gaps · estimated 120s saveable
+## 🔴 Correctness Issues
+1. Article "sedition-removal.md" says "retained" — source says "removed"
+   → AUTO-FIX: patched article
+## 🟡 Wiki Gaps (auto-filled)
+1. "Electronic evidence certification" — asked 4x, no article
+   → CREATED: articles/electronic-evidence-certification.md
+2. "CrPC comparison" — asked 3x, article was incomplete
+   → UPDATED: articles/crpc-comparison.md with missing sections
+## 🟢 Performance Insights
+- Wiki hit rate: 53% → 78% after gap fixes (estimated)
+- Most-read source: indian-penal-code-new.md (12 reads)
+  → Already well-covered by articles (reads are for exact quotes)
+- Wasted reads: 8 across 15 sessions (32% waste rate)
+```
+**Implementation:**
+- New command: `llm-kb eval`
+- Reads session JSONL files (full conversation data)
+- Code: extracts metrics (timing, file reads, citations)
+- LLM judge (Haiku): checks citation validity, identifies gaps
+- LLM writer (Haiku): patches articles with fixes
+- Writes eval-report.md
+**Definition of done:**
+- [ ] `llm-kb eval` reads sessions and writes eval-report.md
+- [ ] Flags: citation issues, consistency problems
+- [ ] Identifies: wiki gaps, performance bottlenecks
+- [ ] Auto-creates/patches articles for wiki gaps
+- [ ] Reports estimated time savings
+---
+### Slice 5: The complete flywheel
+With all slices done, the full flywheel:
+```
+         ┌──────────────┐
+         │  COMPILE      │ Proactive: articles from all sources
+         │  (once/incr)  │
+         └──────┬────────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  ARTICLES             │ Concept-organized, cross-referenced
+    │  articles/index.md    │ Agent navigates concepts, not files
+    │  articles/*.md        │
+    └───────────┬───────────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  QUERY                │ User asks question
+    │  → reads article      │ Agent reads 1 small article, not 5 large sources
+    │  → instant answer     │ Sessions logged
+    └───────────┬───────────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  EVAL                 │ Analyzes sessions
+    │  → finds gaps         │ Creates missing articles
+    │  → fixes errors       │ Patches wrong articles
+    │  → measures speed     │ Reports optimization opportunities
+    └───────────┬───────────┘
+                │
+                ▼
+    ┌───────────────────────┐
+    │  NEW FILE DROPPED     │ Watcher detects new source
+    │  → incremental update │ Updates 2-3 relevant articles
+    │  → index updated      │ New knowledge integrated
+    └───────────┬───────────┘
+                │
+                └──────────── back to QUERY (faster every cycle)
+```
+**The compounding effect:**
+- Day 1: compile articles from 9 PDFs → 15 articles
+- Day 2: 10 queries → eval finds 3 gaps → 3 articles added/fixed
+- Day 3: new PDF dropped → 2 articles updated
+- Day 4: 20 queries → 90% answered from articles (2s avg vs 25s)
+- Day 5: eval shows 95% wiki hit rate, 0 citation errors
+---
+## Build Order
+| Slice | What | Effort | Priority |
+|---|---|---|---|
+| 1 | Article compiler (in `run`) | 2-3 hrs | 🔴 Do first |
+| 2 | Query reads articles | 30 min | 🔴 Immediate follow-up |
+| 3 | Incremental article updates (watcher) | 1-2 hrs | 🟡 This week |
+| 4 | `llm-kb eval` (session analysis + auto-fix) | 2-3 hrs | 🔴 The big one |
+| 5 | Full flywheel verification | Testing | 🟢 After all slices |
+---
+*Phase 4 spec written April 6, 2026. DeltaXY.*
+*Inspired by Farzapedia (@FarzaTV) — Karpathy called it the best implementation of the LLM wiki pattern.*