npm - pdf-brain - Versions diffs - 1.1.0 → 1.1.3 - Mend

pdf-brain 1.1.0 → 1.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/README.md +177 -26
package/package.json +29 -6
package/scripts/bulk-ingest.ts +22 -0
package/scripts/migration/backfill-document-concepts.ts +169 -0
package/scripts/migration/backfill-summaries.ts +231 -0
package/src/cli.test.ts +46 -0
package/src/cli.ts +10 -2
package/src/services/AutoTagger.ts +29 -1
package/src/services/ClusterConceptMapper.test.ts +57 -0
package/src/services/ClusterConceptMapper.ts +120 -0
package/src/services/ClusterSummarizer.test.ts +198 -0
package/src/services/ClusterSummarizer.ts +196 -0
package/src/services/Clustering.test.ts +487 -0
package/src/services/Clustering.ts +754 -0
package/src/services/Database.ts +18 -0
package/src/services/LibSQLDatabase.test.ts +559 -1
package/src/services/LibSQLDatabase.ts +208 -5
package/src/types.ts +4 -0

package/README.md CHANGED Viewed

@@ -1,25 +1,28 @@
 # pdf-brain
-Local PDF & Markdown knowledge base with semantic search and AI-powered enrichment.
+Local **PDF & Markdown** knowledge base with semantic search and AI-powered enrichment.
+> **Works with PDFs AND Markdown files** - Index your research papers, books, notes, docs, and any `.md` files in one unified, searchable knowledge base.
 ```
 ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
-│   PDF/MD    │────▶│   Ollama    │────▶│   Ollama    │────▶│   libSQL    │
+│  PDF / MD   │────▶│   Ollama    │────▶│   Ollama    │────▶│   libSQL    │
 │  (extract)  │     │    (LLM)    │     │ (embeddings)│     │  (vectors)  │
 └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │                   │
    pdf-parse         llama3.2:3b        mxbai-embed          HNSW index
-   markdown          enrichment          1024 dims           cosine sim
+   + markdown        enrichment          1024 dims           cosine sim
 ```
 ## Features
+- **PDF + Markdown** - Index `.pdf` and `.md` files with the same workflow
 - **Local-first** - Everything runs on your machine, no API costs
 - **AI enrichment** - LLM extracts titles, summaries, tags, and concepts
 - **SKOS taxonomy** - Organize documents with hierarchical concepts
 - **Vector search** - Semantic search via Ollama embeddings
 - **Hybrid search** - Combine vector similarity with full-text search
-- **Markdown support** - Index `.md` files alongside PDFs
+- **MCP server** - Use with Claude, Cursor, and other AI assistants
 ## Quick Start
@@ -105,16 +108,18 @@ pdf-brain init
 # Add a PDF
 pdf-brain add /path/to/document.pdf
-# Add from URL
+# Add a Markdown file
+pdf-brain add /path/to/notes.md
+# Add from URL (PDF or MD)
 pdf-brain add https://example.com/paper.pdf
+pdf-brain add https://raw.githubusercontent.com/user/repo/main/README.md
 # Add with manual tags
 pdf-brain add document.pdf --tags "ai,agents,research"
 # Add with AI enrichment (extracts title, summary, concepts)
 pdf-brain add document.pdf --enrich
-# Add Markdown file
 pdf-brain add notes.md --enrich
 ```
@@ -185,11 +190,16 @@ pdf-brain taxonomy seed --file data/taxonomy.json
 ### Bulk Ingest
+Recursively ingest directories containing PDFs and/or Markdown files:
 ```bash
 # Ingest a directory with full LLM enrichment
 pdf-brain ingest ~/Documents/papers --enrich
-# Ingest multiple directories
+# Ingest your Obsidian vault or notes folder
+pdf-brain ingest ~/Documents/obsidian --enrich
+# Ingest multiple directories (PDFs, Markdown, mixed)
 pdf-brain ingest ~/papers ~/books ~/notes --enrich
 # With manual tags
@@ -205,6 +215,11 @@ pdf-brain ingest ~/papers --enrich --sample 10
 pdf-brain ingest ~/papers --enrich --no-tui
 ```
+**Supported formats:**
+- `.pdf` - Research papers, books, documents
+- `.md` - Notes, documentation, Obsidian vaults, READMEs
 ## Enrichment
 When you add documents with `--enrich`, the LLM extracts:
@@ -222,16 +237,36 @@ When you add documents with `--enrich`, the LLM extracts:
 ### LLM Providers
-Enrichment tries local Ollama first, falls back to Anthropic if configured:
+Enrichment supports multiple providers via the config system:
 ```bash
-# Uses Ollama by default (llama3.2:3b)
-pdf-brain add paper.pdf --enrich
+# Check current config
+pdf-brain config show
+# Use local Ollama (default)
+pdf-brain config set enrichment.provider ollama
+pdf-brain config set enrichment.model llama3.2:3b
+# Use AI Gateway (Anthropic, OpenAI, etc.)
+pdf-brain config set enrichment.provider gateway
+pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
+export AI_GATEWAY_API_KEY=your-key
-# Set Anthropic API key for fallback
-export ANTHROPIC_API_KEY=sk-ant-...
+# Provider priority: config > CLI flag > auto-detect
+pdf-brain add paper.pdf --enrich              # uses config
+pdf-brain add paper.pdf --enrich --provider ollama  # override
 ```
+### Enrichment Fallback
+If LLM enrichment fails (API error, rate limit, malformed response), pdf-brain automatically falls back to heuristic-based enrichment:
+- **Title**: Cleaned from filename
+- **Tags**: Extracted from path, filename, and content keywords
+- **Category**: Inferred from directory structure
+The actual error is logged so you can debug provider issues.
 ## Taxonomy
 The taxonomy is a hierarchical concept system for organizing documents. It ships with a starter taxonomy covering:
@@ -293,12 +328,78 @@ pdf-brain taxonomy seed --file my-taxonomy.json
 ## Configuration
-| Variable            | Default                    | Description              |
-| ------------------- | -------------------------- | ------------------------ |
-| `PDF_LIBRARY_PATH`  | `~/Documents/.pdf-library` | Library storage location |
-| `OLLAMA_HOST`       | `http://localhost:11434`   | Ollama API endpoint      |
-| `OLLAMA_MODEL`      | `mxbai-embed-large`        | Embedding model          |
-| `ANTHROPIC_API_KEY` | -                          | Fallback for enrichment  |
+### Config File
+pdf-brain stores configuration in `$PDF_LIBRARY_PATH/config.json`:
+```bash
+# Show all config
+pdf-brain config show
+# Get a specific value
+pdf-brain config get enrichment.provider
+# Set a value
+pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
+```
+### Config Options
+```json
+{
+  "ollama": {
+    "host": "http://localhost:11434"
+  },
+  "embedding": {
+    "provider": "ollama",
+    "model": "mxbai-embed-large"
+  },
+  "enrichment": {
+    "provider": "gateway",
+    "model": "anthropic/claude-haiku-4-5"
+  },
+  "judge": {
+    "provider": "gateway",
+    "model": "anthropic/claude-haiku-4-5"
+  }
+}
+```
+| Setting               | Default                  | Description                          |
+| --------------------- | ------------------------ | ------------------------------------ |
+| `ollama.host`         | `http://localhost:11434` | Ollama API endpoint                  |
+| `embedding.provider`  | `ollama`                 | Embedding provider (ollama only)     |
+| `embedding.model`     | `mxbai-embed-large`      | Embedding model (1024 dims)          |
+| `enrichment.provider` | `ollama`                 | LLM provider: `ollama` or `gateway`  |
+| `enrichment.model`    | `llama3.2:3b`            | Model for document enrichment        |
+| `judge.provider`      | `ollama`                 | Provider for concept deduplication   |
+| `judge.model`         | `llama3.2:3b`            | Model for judging duplicate concepts |
+### Environment Variables
+| Variable             | Default                    | Description              |
+| -------------------- | -------------------------- | ------------------------ |
+| `PDF_LIBRARY_PATH`   | `~/Documents/.pdf-library` | Library storage location |
+| `OLLAMA_HOST`        | `http://localhost:11434`   | Ollama API endpoint      |
+| `AI_GATEWAY_API_KEY` | -                          | API key for AI Gateway   |
+### AI Gateway
+For cloud LLM providers (Anthropic, OpenAI, etc.), use the AI Gateway:
+```bash
+# Set your API key
+export AI_GATEWAY_API_KEY=your-key
+# Configure to use gateway
+pdf-brain config set enrichment.provider gateway
+pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
+# Other supported models:
+# - anthropic/claude-sonnet-4-20250514
+# - openai/gpt-4o-mini
+# - openai/gpt-4o
+```
 ## Storage
@@ -310,6 +411,25 @@ pdf-brain taxonomy seed --file my-taxonomy.json
 └── downloads/          # PDFs downloaded from URLs
 ```
+### Database Size
+The database can get **large** due to vector index overhead. For ~500k chunks:
+| Component    | Size   | Notes                             |
+| ------------ | ------ | --------------------------------- |
+| Text content | ~180MB | Actual chunk text                 |
+| Embeddings   | ~1.9GB | 500k × 1024 dims × 4 bytes        |
+| Vector index | ~48GB  | HNSW neighbor graphs (~100KB/row) |
+| FTS index    | ~200MB | Full-text search                  |
+The `*_idx_shadow` tables store HNSW neighbor graphs for approximate nearest neighbor search. Each row averages ~100KB.
+**libSQL quirk**: `SELECT COUNT(*) FROM embeddings` returns 0. Always count a specific column:
+```sql
+SELECT COUNT(chunk_id) FROM embeddings  -- correct
+```
 ## How It Works
 1. **Extract** - PDF text via `pdf-parse`, Markdown parsed directly
@@ -334,13 +454,44 @@ pdf-brain ships as an MCP server for AI coding assistants:
 }
 ```
-Tools exposed:
-- `pdf-brain_add` - Add document to library
-- `pdf-brain_search` - Semantic search
-- `pdf-brain_list` - List documents
-- `pdf-brain_read` - Get document details
-- `pdf-brain_stats` - Library statistics
+### Document Tools
+| Tool                  | Description                                   |
+| --------------------- | --------------------------------------------- |
+| `pdf-brain_add`       | Add PDF/Markdown to library (supports URLs)   |
+| `pdf-brain_batch_add` | Bulk ingest from directory                    |
+| `pdf-brain_search`    | Unified semantic search (docs + concepts)     |
+| `pdf-brain_list`      | List documents, optionally filter by tag      |
+| `pdf-brain_read`      | Get document details and metadata             |
+| `pdf-brain_remove`    | Remove document from library                  |
+| `pdf-brain_tag`       | Set tags on a document                        |
+| `pdf-brain_stats`     | Library statistics (docs, chunks, embeddings) |
+### Taxonomy Tools
+| Tool                        | Description                              |
+| --------------------------- | ---------------------------------------- |
+| `pdf-brain_taxonomy_list`   | List all concepts (optional tree format) |
+| `pdf-brain_taxonomy_tree`   | Visual concept tree with box-drawing     |
+| `pdf-brain_taxonomy_add`    | Add new concept to taxonomy              |
+| `pdf-brain_taxonomy_assign` | Assign concept to document               |
+| `pdf-brain_taxonomy_search` | Search concepts by label                 |
+| `pdf-brain_taxonomy_seed`   | Load taxonomy from JSON file             |
+### Config Tools
+| Tool                    | Description               |
+| ----------------------- | ------------------------- |
+| `pdf-brain_config_show` | Display all config        |
+| `pdf-brain_config_get`  | Get specific config value |
+| `pdf-brain_config_set`  | Set config value          |
+### Utility Tools
+| Tool               | Description                   |
+| ------------------ | ----------------------------- |
+| `pdf-brain_check`  | Check if Ollama is ready      |
+| `pdf-brain_repair` | Fix database integrity issues |
 ## Troubleshooting

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "pdf-brain",
-  "version": "1.1.0",
-  "description": "Local PDF knowledge base with vector search",
+  "version": "1.1.3",
+  "description": "Local PDF & Markdown knowledge base with semantic search, AI enrichment, and SKOS taxonomy",
   "type": "module",
   "main": "src/index.ts",
   "bin": {
@@ -15,7 +15,8 @@
     "setup": "./scripts/setup.sh",
     "changeset": "changeset",
     "version": "changeset version",
-    "release": "bun test && bun run typecheck && changeset publish"
+    "release": "bun test && bun run typecheck && changeset publish",
+    "prepare": "husky"
   },
   "dependencies": {
     "@ai-sdk/anthropic": "^2.0.56",
@@ -23,6 +24,7 @@
     "@effect/platform": "^0.72.0",
     "@effect/platform-bun": "^0.52.0",
     "@effect/schema": "^0.75.0",
+    "@electric-sql/pglite": "^0.3.0",
     "@libsql/client": "^0.15.15",
     "ai": "^5.0.115",
     "dotenv": "^17.2.3",
@@ -40,10 +42,14 @@
   },
   "devDependencies": {
     "@changesets/cli": "^2.29.8",
+    "@secretlint/secretlint-rule-preset-recommend": "^11.2.5",
     "@types/bun": "latest",
     "@types/mdast": "^4.0.4",
     "@types/react": "^19.2.7",
+    "husky": "^9.1.7",
+    "lint-staged": "^16.2.7",
     "react-devtools-core": "^7.0.1",
+    "secretlint": "^11.2.5",
     "typescript": "^5.7.2"
   },
   "peerDependencies": {
@@ -58,12 +64,29 @@
   ],
   "keywords": [
     "pdf",
+    "markdown",
+    "md",
+    "knowledge-base",
     "vector-search",
+    "semantic-search",
     "embeddings",
+    "rag",
+    "ollama",
+    "local-first",
+    "ai",
+    "llm",
+    "taxonomy",
+    "skos",
     "libsql",
-    "knowledge-base",
-    "effect-ts"
+    "mcp",
+    "effect-ts",
+    "document-management",
+    "personal-knowledge-base",
+    "pkm"
   ],
   "author": "Joel Hooks",
-  "license": "MIT"
+  "license": "MIT",
+  "lint-staged": {
+    "*": "secretlint"
+  }
 }

package/scripts/bulk-ingest.ts CHANGED Viewed

@@ -283,6 +283,28 @@ ${
           })
         );
+        if (enrichResult._tag === "Left") {
+          // Check for fatal errors that should stop the entire batch
+          const errMsg = String(enrichResult.left.message || enrichResult.left);
+          const isFatalError =
+            errMsg.includes("Insufficient funds") ||
+            errMsg.includes("rate limit") ||
+            errMsg.includes("Rate limit") ||
+            errMsg.includes("quota exceeded") ||
+            errMsg.includes("API key");
+          if (isFatalError) {
+            console.error(`\n\n🛑 FATAL ERROR: ${errMsg}`);
+            console.error(
+              "Stopping batch import to avoid wasting resources.\n"
+            );
+            process.exit(1);
+          }
+          // Log the actual error so we can debug
+          console.log(`   ⚠️  Enrichment failed: ${errMsg}`);
+        }
         if (enrichResult._tag === "Right") {
           const r = enrichResult.right;
           title = r.title;

package/scripts/migration/backfill-document-concepts.ts ADDED Viewed

@@ -0,0 +1,169 @@
+#!/usr/bin/env bun
+/**
+ * Backfill document_concepts from existing tags
+ *
+ * Tags are stored as the leaf part of concept IDs (e.g., "instructional-design")
+ * Concepts are stored with category prefix (e.g., "education/instructional-design")
+ *
+ * This script:
+ * 1. Builds a mapping from normalized tag -> concept ID
+ * 2. For each document, matches its tags to concepts
+ * 3. Inserts into document_concepts join table
+ *
+ * Usage:
+ *   bun run scripts/migration/backfill-document-concepts.ts
+ */
+import { createClient } from "@libsql/client";
+import { join } from "path";
+const dbPath = `file:${join(
+  process.env.HOME!,
+  "Documents/.pdf-library/library.db"
+)}`;
+interface Concept {
+  id: string;
+  pref_label: string;
+  alt_labels: string;
+}
+interface Document {
+  id: string;
+  title: string;
+  tags: string;
+}
+function normalizeTag(s: string): string {
+  return s
+    .toLowerCase()
+    .replace(/[^a-z0-9]+/g, "-")
+    .replace(/^-|-$/g, "");
+}
+async function main() {
+  console.log("=== Backfill document_concepts ===\n");
+  console.log(`Database: ${dbPath}\n`);
+  const client = createClient({ url: dbPath });
+  // Step 1: Build tag -> concept mapping
+  console.log("Building tag -> concept mapping...");
+  const conceptsResult = await client.execute(
+    "SELECT id, pref_label, alt_labels FROM concepts"
+  );
+  const tagToConcept = new Map<string, string>();
+  let conceptCount = 0;
+  for (const row of conceptsResult.rows) {
+    const concept = row as unknown as Concept;
+    conceptCount++;
+    // Extract leaf from concept ID (e.g., "education/instructional-design" -> "instructional-design")
+    const leaf = concept.id.includes("/")
+      ? concept.id.split("/").pop()!
+      : concept.id;
+    // Map normalized leaf to concept ID
+    tagToConcept.set(normalizeTag(leaf), concept.id);
+    // Also map normalized pref_label
+    tagToConcept.set(normalizeTag(concept.pref_label), concept.id);
+    // Map alt_labels
+    const altLabels: string[] = JSON.parse(concept.alt_labels || "[]");
+    for (const alt of altLabels) {
+      tagToConcept.set(normalizeTag(alt), concept.id);
+    }
+  }
+  console.log(`  ${conceptCount} concepts loaded`);
+  console.log(`  ${tagToConcept.size} tag mappings created\n`);
+  // Step 2: Get all documents with tags
+  console.log("Processing documents...");
+  const docsResult = await client.execute(
+    "SELECT id, title, tags FROM documents"
+  );
+  let docsProcessed = 0;
+  let linksCreated = 0;
+  let docsWithConcepts = 0;
+  for (const row of docsResult.rows) {
+    const doc = row as unknown as Document;
+    const tags: string[] = JSON.parse(doc.tags || "[]");
+    if (tags.length === 0) continue;
+    const matchedConcepts = new Set<string>();
+    for (const tag of tags) {
+      const normalizedTag = normalizeTag(tag);
+      const conceptId = tagToConcept.get(normalizedTag);
+      if (conceptId) {
+        matchedConcepts.add(conceptId);
+      }
+    }
+    if (matchedConcepts.size > 0) {
+      docsWithConcepts++;
+      // Insert into document_concepts
+      for (const conceptId of matchedConcepts) {
+        try {
+          await client.execute({
+            sql: `INSERT INTO document_concepts (doc_id, concept_id, confidence, source)
+                  VALUES (?, ?, ?, ?)
+                  ON CONFLICT (doc_id, concept_id) DO NOTHING`,
+            args: [doc.id, conceptId, 0.8, "backfill"],
+          });
+          linksCreated++;
+        } catch (e) {
+          // Ignore constraint violations (concept doesn't exist)
+        }
+      }
+    }
+    docsProcessed++;
+    if (docsProcessed % 100 === 0) {
+      console.log(`  Processed ${docsProcessed} documents...`);
+    }
+  }
+  // Step 3: Summary
+  const finalCount = await client.execute(
+    "SELECT COUNT(doc_id) as count FROM document_concepts"
+  );
+  const totalLinks = Number((finalCount.rows[0] as any).count || 0);
+  console.log("\n=== Complete ===");
+  console.log(`Documents processed: ${docsProcessed}`);
+  console.log(`Documents with concepts: ${docsWithConcepts}`);
+  console.log(`Links created: ${linksCreated}`);
+  console.log(`Total document_concepts: ${totalLinks}`);
+  // Show sample
+  console.log("\nSample links:");
+  const sample = await client.execute(`
+    SELECT d.title, c.pref_label
+    FROM document_concepts dc
+    JOIN documents d ON d.id = dc.doc_id
+    JOIN concepts c ON c.id = dc.concept_id
+    LIMIT 10
+  `);
+  for (const row of sample.rows) {
+    console.log(`  "${row.title}" -> ${row.pref_label}`);
+  }
+  client.close();
+}
+main().catch((e) => {
+  console.error("Backfill failed:", e);
+  process.exit(1);
+});