npm - cozo-memory - Versions diffs - 1.0.4 → 1.0.5 - Mend

cozo-memory 1.0.4 → 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md +128 -3
package/dist/download-model.js +3 -1
package/dist/embedding-service.js +81 -12
package/dist/index.js +48 -8
package/dist/memory-service.js +88 -5
package/dist/test-pdf-ingest.js +2 -0
package/dist/test-qwen3-bilingual.js +2 -0
package/package.json +3 -1

package/README.md CHANGED Viewed

@@ -28,6 +28,8 @@
 - [Development](#development)
 - [User Preference Profiling](#user-preference-profiling-mem0-style)
 - [Troubleshooting](#troubleshooting)
+- [Roadmap](#roadmap)
+- [Contributing](#contributing)
 - [License](#license)
 ## Quick Start
@@ -224,7 +226,11 @@ graph LR
 ### Prerequisites
 - Node.js 20+ (recommended)
-- CozoDB native dependency is installed via `cozo-node`.
+- **RAM: 1.7 GB minimum** (for default bge-m3 model)
+  - Model download: ~600 MB
+  - Runtime memory: ~1.1 GB
+  - For lower-spec machines, see [Embedding Model Options](#embedding-model-options) below
+- CozoDB native dependency is installed via `cozo-node`
 ### Via npm (Easiest)
@@ -257,6 +263,62 @@ Notes:
 - On first start, `@xenova/transformers` downloads the embedding model (may take time).
 - Embeddings are processed on the CPU.
+### Embedding Model Options
+CozoDB Memory supports multiple embedding models via the `EMBEDDING_MODEL` environment variable:
+| Model | Size | RAM | Dimensions | Best For |
+|-------|------|-----|------------|----------|
+| `Xenova/bge-m3` (default) | ~600 MB | ~1.7 GB | 1024 | High accuracy, production use |
+| `Xenova/all-MiniLM-L6-v2` | ~80 MB | ~400 MB | 384 | Low-spec machines, development |
+| `Xenova/bge-small-en-v1.5` | ~130 MB | ~600 MB | 384 | Balanced performance |
+**Configuration Options:**
+**Option 1: Using `.env` file (Easiest for beginners)**
+```bash
+# Copy the example file
+cp .env.example .env
+# Edit .env and set your preferred model
+EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2
+```
+**Option 2: MCP Server Config (For Claude Desktop / Kiro)**
+```json
+{
+  "mcpServers": {
+    "cozo-memory": {
+      "command": "npx",
+      "args": ["cozo-memory"],
+      "env": {
+        "EMBEDDING_MODEL": "Xenova/all-MiniLM-L6-v2"
+      }
+    }
+  }
+}
+```
+**Option 3: Command Line**
+```bash
+# Use lightweight model for development
+EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 npm run start
+```
+**Download Model First (Recommended):**
+```bash
+# Set model in .env or via command line, then:
+EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 npm run download-model
+```
+}
+```
+**Note:** Changing models requires re-embedding existing data. The model is downloaded once on first use.
 ## Start / Integration
 ### MCP Server (stdio)
@@ -335,6 +397,14 @@ DB_ENGINE=rocksdb npm run dev
 | **RocksDB** | Prepared & Tested | For high-performance or very large datasets. |
 | **MDBX** | Not supported | Requires manual build of `cozo-node` from source. |
+### Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DB_ENGINE` | `sqlite` | Database backend: `sqlite` or `rocksdb` |
+| `EMBEDDING_MODEL` | `Xenova/bge-m3` | Embedding model (see [Embedding Model Options](#embedding-model-options)) |
+| `PORT` | `3001` | HTTP API bridge port (if using `npm run bridge`) |
 ---
 ## Data Model
@@ -367,7 +437,10 @@ Actions:
 - `create_relation`: `{ from_id, to_id, relation_type, strength?, metadata? }`
 - `run_transaction`: `{ operations: Array<{ action, params }> }` **(New v1.2)**: Executes multiple operations atomically.
 - `add_inference_rule`: `{ name, datalog }`
-- `ingest_file`: `{ format, content, entity_id?, entity_name?, entity_type?, chunking?, metadata?, observation_metadata?, deduplicate?, max_observations? }`
+- `ingest_file`: `{ format, file_path?, content?, entity_id?, entity_name?, entity_type?, chunking?, metadata?, observation_metadata?, deduplicate?, max_observations? }`
+  - `format` options: `"markdown"`, `"json"`, `"pdf"` **(New v1.9)**
+  - `file_path`: Optional path to file on disk (alternative to `content` parameter)
+  - `content`: File content as string (required if `file_path` not provided)
   - `chunking` options: `"none"`, `"paragraphs"` (future: `"semantic"`)
 Important Details:
@@ -419,7 +492,7 @@ Example (Transitive Manager ⇒ Upper Manager):
 }
 ```
-Bulk Ingestion (Markdown/JSON):
+Bulk Ingestion (Markdown/JSON/PDF):
 ```json
 {
@@ -433,6 +506,19 @@ Bulk Ingestion (Markdown/JSON):
 }
 ```
+PDF Ingestion via File Path:
+```json
+{
+  "action": "ingest_file",
+  "entity_name": "Research Paper",
+  "format": "pdf",
+  "file_path": "/path/to/document.pdf",
+  "chunking": "paragraphs",
+  "deduplicate": true
+}
+```
 ### query_memory (Read)
 Actions:
@@ -784,6 +870,45 @@ npx ts-node test-user-pref.ts
 - Use `health` action to check cache hit rates
 - Consider RocksDB backend for datasets > 100k entities
+## Roadmap
+CozoDB Memory is actively developed. Here's what's planned:
+### Near-Term (v1.x)
+- **GPU Acceleration** - CUDA support for embedding generation (10-50x faster)
+- **Streaming Ingestion** - Real-time data ingestion from logs, APIs, webhooks
+- **Advanced Chunking** - Semantic chunking for `ingest_file` (paragraph-aware splitting)
+- **Query Optimization** - Automatic query plan optimization for complex graph traversals
+- **Additional Export Formats** - Notion, Roam Research, Logseq compatibility
+### Mid-Term (v2.x)
+- **Multi-Modal Embeddings** - Image and audio embedding support via CLIP/Whisper
+- **Distributed Mode** - Multi-node deployment with CozoDB clustering
+- **Real-Time Sync** - WebSocket-based live updates for collaborative use cases
+- **Advanced Inference** - Causal reasoning, temporal pattern detection
+- **Web UI** - Optional web interface for memory exploration and visualization
+### Long-Term
+- **Federated Learning** - Privacy-preserving model updates across instances
+- **Custom Embedding Models** - Fine-tune embeddings on domain-specific data
+- **Plugin System** - Extensible architecture for custom tools and integrations
+### Community Requests
+Have a feature idea? Open an issue with the `enhancement` label or check [Low-Hanging-Fruit.md](Low-Hanging-Fruit.md) for quick wins you can contribute.
+## Contributing
+Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
+- Setting up the development environment
+- Coding standards and best practices
+- Testing and documentation requirements
+- Pull request process
 ## License
 This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.

package/dist/download-model.js CHANGED Viewed

@@ -33,12 +33,14 @@ var __importStar = (this && this.__importStar) || (function () {
     };
 })();
 Object.defineProperty(exports, "__esModule", { value: true });
+require("dotenv/config"); // Load .env file first
 const transformers_1 = require("@xenova/transformers");
 const path = __importStar(require("path"));
 // Configure cache path
 const CACHE_DIR = path.resolve('./.cache');
 transformers_1.env.cacheDir = CACHE_DIR;
-const MODEL_ID = "Xenova/bge-m3";
+// Read model from environment variable or use default
+const MODEL_ID = process.env.EMBEDDING_MODEL || "Xenova/bge-m3";
 async function downloadModel() {
     console.log(`Downloading FP32 model for ${MODEL_ID}...`);
     // quantized: false forces FP32 model download

package/dist/embedding-service.js CHANGED Viewed

@@ -34,6 +34,7 @@ var __importStar = (this && this.__importStar) || (function () {
 })();
 Object.defineProperty(exports, "__esModule", { value: true });
 exports.EmbeddingService = void 0;
+require("dotenv/config"); // Load .env file first
 const transformers_1 = require("@xenova/transformers");
 const ort = require('onnxruntime-node');
 const path = __importStar(require("path"));
@@ -91,11 +92,27 @@ class EmbeddingService {
     cache;
     session = null;
     tokenizer = null;
-    modelId = "Xenova/bge-m3";
-    dimensions = 1024;
+    modelId;
+    dimensions;
     queue = Promise.resolve();
     constructor() {
         this.cache = new LRUCache(1000, 3600000); // 1000 entries, 1h TTL
+        // Support multiple embedding models via environment variable
+        this.modelId = process.env.EMBEDDING_MODEL || "Xenova/bge-m3";
+        // Set dimensions based on model
+        const dimensionMap = {
+            "Xenova/bge-m3": 1024,
+            "Xenova/all-MiniLM-L6-v2": 384,
+            "Xenova/bge-small-en-v1.5": 384,
+            "Xenova/nomic-embed-text-v1": 768,
+            "onnx-community/Qwen3-Embedding-0.6B-ONNX": 1024,
+        };
+        this.dimensions = dimensionMap[this.modelId] || 1024;
+        console.log(`[EmbeddingService] Using model: ${this.modelId} (${this.dimensions} dimensions)`);
+    }
+    // Public getter for dimensions
+    getDimensions() {
+        return this.dimensions;
     }
     // Serializes embedding execution to avoid event loop blocking
     async runSerialized(task) {
@@ -109,21 +126,38 @@ class EmbeddingService {
         if (this.session && this.tokenizer)
             return;
         try {
-            // 1. Load Tokenizer
+            // 1. Check if model needs to be downloaded
+            // Extract namespace and model name from modelId (e.g., "Xenova/bge-m3" or "onnx-community/Qwen3-Embedding-0.6B-ONNX")
+            const parts = this.modelId.split('/');
+            const namespace = parts[0];
+            const modelName = parts[1];
+            // Try both possible cache locations
+            let baseDir = path.join(transformers_1.env.cacheDir, namespace, modelName, 'onnx');
+            let fp32Path = path.join(baseDir, 'model.onnx');
+            let quantizedPath = path.join(baseDir, 'model_quantized.onnx');
+            // If ONNX model files don't exist, download them
+            if (!fs.existsSync(fp32Path) && !fs.existsSync(quantizedPath)) {
+                console.log(`[EmbeddingService] Model not found, downloading ${this.modelId}...`);
+                console.log(`[EmbeddingService] This may take a few minutes on first run.`);
+                // Import AutoModel dynamically to trigger download
+                const { AutoModel } = await import("@xenova/transformers");
+                await AutoModel.from_pretrained(this.modelId, { quantized: false });
+                console.log(`[EmbeddingService] Model download completed.`);
+            }
+            // 2. Load Tokenizer
             if (!this.tokenizer) {
                 this.tokenizer = await transformers_1.AutoTokenizer.from_pretrained(this.modelId);
             }
-            // 2. Determine model path
-            const baseDir = path.join(transformers_1.env.cacheDir, 'Xenova', 'bge-m3', 'onnx');
+            // 3. Determine model path
             // Priority: FP32 (model.onnx) > Quantized (model_quantized.onnx)
-            let modelPath = path.join(baseDir, 'model.onnx');
+            let modelPath = fp32Path;
             if (!fs.existsSync(modelPath)) {
-                modelPath = path.join(baseDir, 'model_quantized.onnx');
+                modelPath = quantizedPath;
             }
             if (!fs.existsSync(modelPath)) {
-                throw new Error(`Model file not found at: ${modelPath}`);
+                throw new Error(`Model file not found at: ${modelPath}. Download may have failed.`);
             }
-            // 3. Create Session
+            // 4. Create Session
             if (!this.session) {
                 const options = {
                     executionProviders: ['cpu'], // Use CPU backend to avoid native conflicts
@@ -139,7 +173,15 @@ class EmbeddingService {
     }
     async embed(text) {
         return this.runSerialized(async () => {
-            const textStr = String(text || "");
+            let textStr = String(text || "");
+            // For Qwen3-Embedding models, add instruction prefix for better results
+            // (only for queries, not for documents being indexed)
+            if (this.modelId.includes('Qwen3-Embedding')) {
+                // Add instruction prefix if not already present
+                if (!textStr.startsWith('Instruct:')) {
+                    textStr = `Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ${textStr}`;
+                }
+            }
             // 1. Cache lookup
             const cached = this.cache.get(textStr);
             if (cached) {
@@ -171,14 +213,22 @@ class EmbeddingService {
                 const results = await this.session.run(feeds);
                 // 5. Pooling & Normalization
                 // Output name usually 'last_hidden_state' or 'logits'
-                // For BGE-M3, the first output is usually the hidden states [batch, seq_len, hidden_size]
                 const outputName = this.session.outputNames[0];
                 const outputTensor = results[outputName];
                 // Ensure we have data
                 if (!outputTensor || !attentionMaskData) {
                     throw new Error("No output data or attention mask available");
                 }
-                const embedding = this.meanPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
+                // Choose pooling strategy based on model
+                let embedding;
+                if (this.modelId.includes('Qwen3-Embedding')) {
+                    // Qwen3-Embedding uses last token pooling
+                    embedding = this.lastTokenPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
+                }
+                else {
+                    // BGE and other models use mean pooling
+                    embedding = this.meanPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
+                }
                 // Normalize
                 const normalized = this.normalize(embedding);
                 this.cache.set(textStr, normalized);
@@ -200,6 +250,25 @@ class EmbeddingService {
         }
         return results;
     }
+    lastTokenPooling(data, attentionMask, dims) {
+        // dims: [batch_size, seq_len, hidden_size]
+        // Extract the last valid token's hidden state
+        const [batchSize, seqLen, hiddenSize] = dims;
+        // Find last valid token position
+        let lastValidIdx = seqLen - 1;
+        for (let i = seqLen - 1; i >= 0; i--) {
+            if (attentionMask[i] === 1n) {
+                lastValidIdx = i;
+                break;
+            }
+        }
+        // Extract embedding at last valid position
+        const embedding = new Float32Array(hiddenSize);
+        for (let j = 0; j < hiddenSize; j++) {
+            embedding[j] = data[lastValidIdx * hiddenSize + j];
+        }
+        return Array.from(embedding);
+    }
     meanPooling(data, attentionMask, dims) {
         // dims: [batch_size, seq_len, hidden_size]
         // We assume batch_size = 1 for single embedding call

package/dist/index.js CHANGED Viewed

@@ -12,12 +12,12 @@ const cozo_node_1 = require("cozo-node");
 const zod_1 = require("zod");
 const uuid_1 = require("uuid");
 const path_1 = __importDefault(require("path"));
+const fs_1 = __importDefault(require("fs"));
+const pdf_mjs_1 = require("pdfjs-dist/legacy/build/pdf.mjs");
 const hybrid_search_1 = require("./hybrid-search");
 const inference_engine_1 = require("./inference-engine");
 exports.DB_PATH = path_1.default.resolve(__dirname, "..", "memory_db.cozo");
 const DB_ENGINE = process.env.DB_ENGINE || "sqlite"; // "sqlite" or "rocksdb"
-const EMBEDDING_MODEL = "Xenova/bge-m3";
-const EMBEDDING_DIM = 1024;
 exports.USER_ENTITY_ID = "global_user_profile";
 exports.USER_ENTITY_NAME = "The User";
 exports.USER_ENTITY_TYPE = "User";
@@ -484,6 +484,9 @@ class MemoryServer {
     async setupSchema() {
         try {
             console.error("[Schema] Initializing schema...");
+            // Get embedding dimensions from service
+            const EMBEDDING_DIM = this.embeddingService.getDimensions();
+            console.error(`[Schema] Using embedding dimensions: ${EMBEDDING_DIM}`);
             const existingRelations = await this.db.run("::relations");
             const relations = existingRelations.rows.map((r) => r[0]);
             // Entity Table
@@ -1664,7 +1667,39 @@ ids[id] <- $ids
     async ingestFile(args) {
         await this.initPromise;
         try {
-            const content = (args.content ?? "").trim();
+            // Check that either file_path or content is provided
+            if (!args.file_path && !args.content) {
+                return { error: "Either file_path or content must be provided" };
+            }
+            // Read content from file if file_path is provided
+            let content;
+            if (args.file_path) {
+                try {
+                    if (args.format === "pdf") {
+                        // Read PDF file and extract text using pdfjs-dist
+                        const data = new Uint8Array(fs_1.default.readFileSync(args.file_path));
+                        const loadingTask = (0, pdf_mjs_1.getDocument)({ data });
+                        const pdf = await loadingTask.promise;
+                        const numPages = pdf.numPages;
+                        const pageTextPromises = Array.from({ length: numPages }, async (_, i) => {
+                            const page = await pdf.getPage(i + 1);
+                            const textContent = await page.getTextContent();
+                            return textContent.items.map((item) => item.str).join(' ');
+                        });
+                        const pageTexts = await Promise.all(pageTextPromises);
+                        content = pageTexts.join('\n').trim();
+                    }
+                    else {
+                        content = fs_1.default.readFileSync(args.file_path, 'utf-8').trim();
+                    }
+                }
+                catch (error) {
+                    return { error: `Failed to read file: ${error.message}` };
+                }
+            }
+            else {
+                content = (args.content ?? "").trim();
+            }
             if (!content)
                 return { error: "Content must not be empty" };
             let entityId = undefined;
@@ -1701,7 +1736,7 @@ ids[id] <- $ids
             const deduplicate = args.deduplicate ?? true;
             const chunking = args.chunking ?? "none";
             const observations = [];
-            if (args.format === "markdown") {
+            if (args.format === "markdown" || args.format === "pdf") {
                 if (chunking === "paragraphs") {
                     const parts = content
                         .split(/\r?\n\s*\r?\n+/g)
@@ -2230,9 +2265,10 @@ ids[id] <- $ids
                 entity_id: zod_1.z.string().optional().describe("ID of the target entity"),
                 entity_name: zod_1.z.string().optional().describe("Name of the target entity (will be created if not exists)"),
                 entity_type: zod_1.z.string().optional().default("Document").describe("Type of the target entity (only when creating)"),
-                format: zod_1.z.enum(["markdown", "json"]).describe("Input format"),
+                format: zod_1.z.enum(["markdown", "json", "pdf"]).describe("Input format"),
                 chunking: zod_1.z.enum(["none", "paragraphs"]).optional().default("none").describe("Chunking for Markdown"),
-                content: zod_1.z.string().describe("File content (or LLM summary)"),
+                file_path: zod_1.z.string().optional().describe("Path to file on disk (alternative to content parameter)"),
+                content: zod_1.z.string().optional().describe("File content (or LLM summary) - required if file_path not provided"),
                 metadata: MetadataSchema.optional().describe("Metadata for entity creation"),
                 observation_metadata: MetadataSchema.optional().describe("Metadata applied to all observations"),
                 deduplicate: zod_1.z.boolean().optional().default(true).describe("Skip exact duplicates"),
@@ -2240,6 +2276,9 @@ ids[id] <- $ids
             }).refine((v) => Boolean(v.entity_id) || Boolean(v.entity_name), {
                 message: "entity_id or entity_name is required for ingest_file",
                 path: ["entity_id"],
+            }).refine((v) => Boolean(v.file_path) || Boolean(v.content), {
+                message: "file_path or content is required for ingest_file",
+                path: ["file_path"],
             }),
         ]);
         const MutateMemoryParameters = zod_1.z.object({
@@ -2254,9 +2293,10 @@ ids[id] <- $ids
             entity_type: zod_1.z.string().optional().describe("Only when entity_name is used and entity is created new"),
             text: zod_1.z.string().optional().describe("For add_observation (required)"),
             datalog: zod_1.z.string().optional().describe("For add_inference_rule (required)"),
-            format: zod_1.z.enum(["markdown", "json"]).optional().describe("For ingest_file (required)"),
+            format: zod_1.z.enum(["markdown", "json", "pdf"]).optional().describe("For ingest_file (required)"),
             chunking: zod_1.z.enum(["none", "paragraphs"]).optional().describe("Optional for ingest_file (for markdown)"),
-            content: zod_1.z.string().optional().describe("For ingest_file (required)"),
+            file_path: zod_1.z.string().optional().describe("For ingest_file - path to file on disk (alternative to content)"),
+            content: zod_1.z.string().optional().describe("For ingest_file - file content (required if file_path not provided)"),
             observation_metadata: MetadataSchema.optional().describe("Optional for ingest_file"),
             deduplicate: zod_1.z.boolean().optional().describe("Optional for ingest_file and add_observation"),
             max_observations: zod_1.z.number().optional().describe("Optional for ingest_file"),

package/dist/memory-service.js CHANGED Viewed

@@ -1,7 +1,42 @@
 "use strict";
+var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) {
+    if (k2 === undefined) k2 = k;
+    var desc = Object.getOwnPropertyDescriptor(m, k);
+    if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) {
+      desc = { enumerable: true, get: function() { return m[k]; } };
+    }
+    Object.defineProperty(o, k2, desc);
+}) : (function(o, m, k, k2) {
+    if (k2 === undefined) k2 = k;
+    o[k2] = m[k];
+}));
+var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) {
+    Object.defineProperty(o, "default", { enumerable: true, value: v });
+}) : function(o, v) {
+    o["default"] = v;
+});
+var __importStar = (this && this.__importStar) || (function () {
+    var ownKeys = function(o) {
+        ownKeys = Object.getOwnPropertyNames || function (o) {
+            var ar = [];
+            for (var k in o) if (Object.prototype.hasOwnProperty.call(o, k)) ar[ar.length] = k;
+            return ar;
+        };
+        return ownKeys(o);
+    };
+    return function (mod) {
+        if (mod && mod.__esModule) return mod;
+        var result = {};
+        if (mod != null) for (var k = ownKeys(mod), i = 0; i < k.length; i++) if (k[i] !== "default") __createBinding(result, mod, k[i]);
+        __setModuleDefault(result, mod);
+        return result;
+    };
+})();
 Object.defineProperty(exports, "__esModule", { value: true });
 exports.MemoryService = void 0;
 const uuid_1 = require("uuid");
+const pdf_mjs_1 = require("pdfjs-dist/legacy/build/pdf.mjs");
+const fs = __importStar(require("fs"));
 class MemoryService {
     db;
     embeddings;
@@ -169,7 +204,7 @@ class MemoryService {
         console.error('[MemoryService] Snapshot created:', snapshotId, stats);
         return snapshotId;
     }
-    async ingestFile(content, format, entityName, entityType = 'Document', chunking = 'paragraphs') {
+    async ingestFile(content, format, entityName, entityType = 'Document', chunking = 'paragraphs', filePath) {
         const searchResults = await this.search(entityName, 1);
         let entity;
         if (searchResults.length > 0 && searchResults[0].entity.name.toLowerCase() === entityName.toLowerCase()) {
@@ -179,16 +214,64 @@ class MemoryService {
             entity = await this.createEntity(entityName, entityType, { format: format });
         }
         let chunks = [];
-        if (format === 'markdown' && chunking === 'paragraphs') {
-            chunks = content.split(/\n\s*\n/).filter((c) => c.trim().length > 0);
+        if (format === 'pdf') {
+            try {
+                let data;
+                // If filePath is provided, read from file
+                if (filePath) {
+                    data = new Uint8Array(fs.readFileSync(filePath));
+                }
+                else {
+                    // Otherwise, assume content is base64
+                    const buffer = Buffer.from(content, 'base64');
+                    data = new Uint8Array(buffer);
+                }
+                const loadingTask = (0, pdf_mjs_1.getDocument)({ data });
+                const pdf = await loadingTask.promise;
+                const numPages = pdf.numPages;
+                const pageTextPromises = Array.from({ length: numPages }, async (_, i) => {
+                    const page = await pdf.getPage(i + 1);
+                    const textContent = await page.getTextContent();
+                    return textContent.items.map((item) => item.str).join(' ');
+                });
+                const pageTexts = await Promise.all(pageTextPromises);
+                const text = pageTexts.join('\n');
+                if (chunking === 'paragraphs') {
+                    chunks = text.split(/\n\s*\n/).filter((c) => c.trim().length > 0);
+                }
+                else {
+                    chunks = [text];
+                }
+            }
+            catch (e) {
+                console.error('[MemoryService] PDF parsing error:', e);
+                throw new Error(`Failed to parse PDF: ${e instanceof Error ? e.message : String(e)}`);
+            }
+        }
+        else if (format === 'markdown') {
+            // For markdown, also support file path
+            let textContent = content;
+            if (filePath) {
+                textContent = fs.readFileSync(filePath, 'utf-8');
+            }
+            if (chunking === 'paragraphs') {
+                chunks = textContent.split(/\n\s*\n/).filter((c) => c.trim().length > 0);
+            }
+            else {
+                chunks = [textContent];
+            }
         }
         else if (format === 'json') {
+            let textContent = content;
+            if (filePath) {
+                textContent = fs.readFileSync(filePath, 'utf-8');
+            }
             try {
-                const data = JSON.parse(content);
+                const data = JSON.parse(textContent);
                 chunks = [JSON.stringify(data, null, 2)];
             }
             catch (e) {
-                chunks = [content];
+                chunks = [textContent];
             }
         }
         else {

package/dist/test-pdf-ingest.js ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ "use strict";
2	+ Object.defineProperty(exports, "__esModule", { value: true });

package/dist/test-qwen3-bilingual.js ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ "use strict";
2	+ Object.defineProperty(exports, "__esModule", { value: true });

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "cozo-memory",
-  "version": "1.0.4",
+  "version": "1.0.5",
   "mcpName": "io.github.tobs-code/cozo-memory",
   "description": "Local-first persistent memory system for AI agents with hybrid search, graph reasoning, and MCP integration",
   "main": "dist/index.js",
@@ -55,10 +55,12 @@
     "archiver": "^7.0.1",
     "cors": "^2.8.6",
     "cozo-node": "^0.7.6",
+    "dotenv": "^17.3.1",
     "express": "^5.2.1",
     "fastmcp": "^3.31.0",
     "ollama": "^0.6.3",
     "onnxruntime-node": "^1.24.2",
+    "pdfjs-dist": "^5.4.624",
     "uuid": "^13.0.0",
     "zod": "^4.3.6"
   },