cozo-memory 1.0.4 → 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -28,6 +28,8 @@
28
28
  - [Development](#development)
29
29
  - [User Preference Profiling](#user-preference-profiling-mem0-style)
30
30
  - [Troubleshooting](#troubleshooting)
31
+ - [Roadmap](#roadmap)
32
+ - [Contributing](#contributing)
31
33
  - [License](#license)
32
34
 
33
35
  ## Quick Start
@@ -224,7 +226,11 @@ graph LR
224
226
 
225
227
  ### Prerequisites
226
228
  - Node.js 20+ (recommended)
227
- - CozoDB native dependency is installed via `cozo-node`.
229
+ - **RAM: 1.7 GB minimum** (for default bge-m3 model)
230
+ - Model download: ~600 MB
231
+ - Runtime memory: ~1.1 GB
232
+ - For lower-spec machines, see [Embedding Model Options](#embedding-model-options) below
233
+ - CozoDB native dependency is installed via `cozo-node`
228
234
 
229
235
  ### Via npm (Easiest)
230
236
 
@@ -257,6 +263,62 @@ Notes:
257
263
  - On first start, `@xenova/transformers` downloads the embedding model (may take time).
258
264
  - Embeddings are processed on the CPU.
259
265
 
266
+ ### Embedding Model Options
267
+
268
+ CozoDB Memory supports multiple embedding models via the `EMBEDDING_MODEL` environment variable:
269
+
270
+ | Model | Size | RAM | Dimensions | Best For |
271
+ |-------|------|-----|------------|----------|
272
+ | `Xenova/bge-m3` (default) | ~600 MB | ~1.7 GB | 1024 | High accuracy, production use |
273
+ | `Xenova/all-MiniLM-L6-v2` | ~80 MB | ~400 MB | 384 | Low-spec machines, development |
274
+ | `Xenova/bge-small-en-v1.5` | ~130 MB | ~600 MB | 384 | Balanced performance |
275
+
276
+ **Configuration Options:**
277
+
278
+ **Option 1: Using `.env` file (Easiest for beginners)**
279
+
280
+ ```bash
281
+ # Copy the example file
282
+ cp .env.example .env
283
+
284
+ # Edit .env and set your preferred model
285
+ EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2
286
+ ```
287
+
288
+ **Option 2: MCP Server Config (For Claude Desktop / Kiro)**
289
+
290
+ ```json
291
+ {
292
+ "mcpServers": {
293
+ "cozo-memory": {
294
+ "command": "npx",
295
+ "args": ["cozo-memory"],
296
+ "env": {
297
+ "EMBEDDING_MODEL": "Xenova/all-MiniLM-L6-v2"
298
+ }
299
+ }
300
+ }
301
+ }
302
+ ```
303
+
304
+ **Option 3: Command Line**
305
+
306
+ ```bash
307
+ # Use lightweight model for development
308
+ EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 npm run start
309
+ ```
310
+
311
+ **Download Model First (Recommended):**
312
+
313
+ ```bash
314
+ # Set model in .env or via command line, then:
315
+ EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 npm run download-model
316
+ ```
317
+ }
318
+ ```
319
+
320
+ **Note:** Changing models requires re-embedding existing data. The model is downloaded once on first use.
321
+
260
322
  ## Start / Integration
261
323
 
262
324
  ### MCP Server (stdio)
@@ -335,6 +397,14 @@ DB_ENGINE=rocksdb npm run dev
335
397
  | **RocksDB** | Prepared & Tested | For high-performance or very large datasets. |
336
398
  | **MDBX** | Not supported | Requires manual build of `cozo-node` from source. |
337
399
 
400
+ ### Environment Variables
401
+
402
+ | Variable | Default | Description |
403
+ |----------|---------|-------------|
404
+ | `DB_ENGINE` | `sqlite` | Database backend: `sqlite` or `rocksdb` |
405
+ | `EMBEDDING_MODEL` | `Xenova/bge-m3` | Embedding model (see [Embedding Model Options](#embedding-model-options)) |
406
+ | `PORT` | `3001` | HTTP API bridge port (if using `npm run bridge`) |
407
+
338
408
  ---
339
409
 
340
410
  ## Data Model
@@ -367,7 +437,10 @@ Actions:
367
437
  - `create_relation`: `{ from_id, to_id, relation_type, strength?, metadata? }`
368
438
  - `run_transaction`: `{ operations: Array<{ action, params }> }` **(New v1.2)**: Executes multiple operations atomically.
369
439
  - `add_inference_rule`: `{ name, datalog }`
370
- - `ingest_file`: `{ format, content, entity_id?, entity_name?, entity_type?, chunking?, metadata?, observation_metadata?, deduplicate?, max_observations? }`
440
+ - `ingest_file`: `{ format, file_path?, content?, entity_id?, entity_name?, entity_type?, chunking?, metadata?, observation_metadata?, deduplicate?, max_observations? }`
441
+ - `format` options: `"markdown"`, `"json"`, `"pdf"` **(New v1.9)**
442
+ - `file_path`: Optional path to file on disk (alternative to `content` parameter)
443
+ - `content`: File content as string (required if `file_path` not provided)
371
444
  - `chunking` options: `"none"`, `"paragraphs"` (future: `"semantic"`)
372
445
 
373
446
  Important Details:
@@ -419,7 +492,7 @@ Example (Transitive Manager ⇒ Upper Manager):
419
492
  }
420
493
  ```
421
494
 
422
- Bulk Ingestion (Markdown/JSON):
495
+ Bulk Ingestion (Markdown/JSON/PDF):
423
496
 
424
497
  ```json
425
498
  {
@@ -433,6 +506,19 @@ Bulk Ingestion (Markdown/JSON):
433
506
  }
434
507
  ```
435
508
 
509
+ PDF Ingestion via File Path:
510
+
511
+ ```json
512
+ {
513
+ "action": "ingest_file",
514
+ "entity_name": "Research Paper",
515
+ "format": "pdf",
516
+ "file_path": "/path/to/document.pdf",
517
+ "chunking": "paragraphs",
518
+ "deduplicate": true
519
+ }
520
+ ```
521
+
436
522
  ### query_memory (Read)
437
523
 
438
524
  Actions:
@@ -784,6 +870,45 @@ npx ts-node test-user-pref.ts
784
870
  - Use `health` action to check cache hit rates
785
871
  - Consider RocksDB backend for datasets > 100k entities
786
872
 
873
+ ## Roadmap
874
+
875
+ CozoDB Memory is actively developed. Here's what's planned:
876
+
877
+ ### Near-Term (v1.x)
878
+
879
+ - **GPU Acceleration** - CUDA support for embedding generation (10-50x faster)
880
+ - **Streaming Ingestion** - Real-time data ingestion from logs, APIs, webhooks
881
+ - **Advanced Chunking** - Semantic chunking for `ingest_file` (paragraph-aware splitting)
882
+ - **Query Optimization** - Automatic query plan optimization for complex graph traversals
883
+ - **Additional Export Formats** - Notion, Roam Research, Logseq compatibility
884
+
885
+ ### Mid-Term (v2.x)
886
+
887
+ - **Multi-Modal Embeddings** - Image and audio embedding support via CLIP/Whisper
888
+ - **Distributed Mode** - Multi-node deployment with CozoDB clustering
889
+ - **Real-Time Sync** - WebSocket-based live updates for collaborative use cases
890
+ - **Advanced Inference** - Causal reasoning, temporal pattern detection
891
+ - **Web UI** - Optional web interface for memory exploration and visualization
892
+
893
+ ### Long-Term
894
+
895
+ - **Federated Learning** - Privacy-preserving model updates across instances
896
+ - **Custom Embedding Models** - Fine-tune embeddings on domain-specific data
897
+ - **Plugin System** - Extensible architecture for custom tools and integrations
898
+
899
+ ### Community Requests
900
+
901
+ Have a feature idea? Open an issue with the `enhancement` label or check [Low-Hanging-Fruit.md](Low-Hanging-Fruit.md) for quick wins you can contribute.
902
+
903
+ ## Contributing
904
+
905
+ Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
906
+
907
+ - Setting up the development environment
908
+ - Coding standards and best practices
909
+ - Testing and documentation requirements
910
+ - Pull request process
911
+
787
912
  ## License
788
913
 
789
914
  This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.
@@ -33,12 +33,14 @@ var __importStar = (this && this.__importStar) || (function () {
33
33
  };
34
34
  })();
35
35
  Object.defineProperty(exports, "__esModule", { value: true });
36
+ require("dotenv/config"); // Load .env file first
36
37
  const transformers_1 = require("@xenova/transformers");
37
38
  const path = __importStar(require("path"));
38
39
  // Configure cache path
39
40
  const CACHE_DIR = path.resolve('./.cache');
40
41
  transformers_1.env.cacheDir = CACHE_DIR;
41
- const MODEL_ID = "Xenova/bge-m3";
42
+ // Read model from environment variable or use default
43
+ const MODEL_ID = process.env.EMBEDDING_MODEL || "Xenova/bge-m3";
42
44
  async function downloadModel() {
43
45
  console.log(`Downloading FP32 model for ${MODEL_ID}...`);
44
46
  // quantized: false forces FP32 model download
@@ -34,6 +34,7 @@ var __importStar = (this && this.__importStar) || (function () {
34
34
  })();
35
35
  Object.defineProperty(exports, "__esModule", { value: true });
36
36
  exports.EmbeddingService = void 0;
37
+ require("dotenv/config"); // Load .env file first
37
38
  const transformers_1 = require("@xenova/transformers");
38
39
  const ort = require('onnxruntime-node');
39
40
  const path = __importStar(require("path"));
@@ -91,11 +92,27 @@ class EmbeddingService {
91
92
  cache;
92
93
  session = null;
93
94
  tokenizer = null;
94
- modelId = "Xenova/bge-m3";
95
- dimensions = 1024;
95
+ modelId;
96
+ dimensions;
96
97
  queue = Promise.resolve();
97
98
  constructor() {
98
99
  this.cache = new LRUCache(1000, 3600000); // 1000 entries, 1h TTL
100
+ // Support multiple embedding models via environment variable
101
+ this.modelId = process.env.EMBEDDING_MODEL || "Xenova/bge-m3";
102
+ // Set dimensions based on model
103
+ const dimensionMap = {
104
+ "Xenova/bge-m3": 1024,
105
+ "Xenova/all-MiniLM-L6-v2": 384,
106
+ "Xenova/bge-small-en-v1.5": 384,
107
+ "Xenova/nomic-embed-text-v1": 768,
108
+ "onnx-community/Qwen3-Embedding-0.6B-ONNX": 1024,
109
+ };
110
+ this.dimensions = dimensionMap[this.modelId] || 1024;
111
+ console.log(`[EmbeddingService] Using model: ${this.modelId} (${this.dimensions} dimensions)`);
112
+ }
113
+ // Public getter for dimensions
114
+ getDimensions() {
115
+ return this.dimensions;
99
116
  }
100
117
  // Serializes embedding execution to avoid event loop blocking
101
118
  async runSerialized(task) {
@@ -109,21 +126,38 @@ class EmbeddingService {
109
126
  if (this.session && this.tokenizer)
110
127
  return;
111
128
  try {
112
- // 1. Load Tokenizer
129
+ // 1. Check if model needs to be downloaded
130
+ // Extract namespace and model name from modelId (e.g., "Xenova/bge-m3" or "onnx-community/Qwen3-Embedding-0.6B-ONNX")
131
+ const parts = this.modelId.split('/');
132
+ const namespace = parts[0];
133
+ const modelName = parts[1];
134
+ // Try both possible cache locations
135
+ let baseDir = path.join(transformers_1.env.cacheDir, namespace, modelName, 'onnx');
136
+ let fp32Path = path.join(baseDir, 'model.onnx');
137
+ let quantizedPath = path.join(baseDir, 'model_quantized.onnx');
138
+ // If ONNX model files don't exist, download them
139
+ if (!fs.existsSync(fp32Path) && !fs.existsSync(quantizedPath)) {
140
+ console.log(`[EmbeddingService] Model not found, downloading ${this.modelId}...`);
141
+ console.log(`[EmbeddingService] This may take a few minutes on first run.`);
142
+ // Import AutoModel dynamically to trigger download
143
+ const { AutoModel } = await import("@xenova/transformers");
144
+ await AutoModel.from_pretrained(this.modelId, { quantized: false });
145
+ console.log(`[EmbeddingService] Model download completed.`);
146
+ }
147
+ // 2. Load Tokenizer
113
148
  if (!this.tokenizer) {
114
149
  this.tokenizer = await transformers_1.AutoTokenizer.from_pretrained(this.modelId);
115
150
  }
116
- // 2. Determine model path
117
- const baseDir = path.join(transformers_1.env.cacheDir, 'Xenova', 'bge-m3', 'onnx');
151
+ // 3. Determine model path
118
152
  // Priority: FP32 (model.onnx) > Quantized (model_quantized.onnx)
119
- let modelPath = path.join(baseDir, 'model.onnx');
153
+ let modelPath = fp32Path;
120
154
  if (!fs.existsSync(modelPath)) {
121
- modelPath = path.join(baseDir, 'model_quantized.onnx');
155
+ modelPath = quantizedPath;
122
156
  }
123
157
  if (!fs.existsSync(modelPath)) {
124
- throw new Error(`Model file not found at: ${modelPath}`);
158
+ throw new Error(`Model file not found at: ${modelPath}. Download may have failed.`);
125
159
  }
126
- // 3. Create Session
160
+ // 4. Create Session
127
161
  if (!this.session) {
128
162
  const options = {
129
163
  executionProviders: ['cpu'], // Use CPU backend to avoid native conflicts
@@ -139,7 +173,15 @@ class EmbeddingService {
139
173
  }
140
174
  async embed(text) {
141
175
  return this.runSerialized(async () => {
142
- const textStr = String(text || "");
176
+ let textStr = String(text || "");
177
+ // For Qwen3-Embedding models, add instruction prefix for better results
178
+ // (only for queries, not for documents being indexed)
179
+ if (this.modelId.includes('Qwen3-Embedding')) {
180
+ // Add instruction prefix if not already present
181
+ if (!textStr.startsWith('Instruct:')) {
182
+ textStr = `Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ${textStr}`;
183
+ }
184
+ }
143
185
  // 1. Cache lookup
144
186
  const cached = this.cache.get(textStr);
145
187
  if (cached) {
@@ -171,14 +213,22 @@ class EmbeddingService {
171
213
  const results = await this.session.run(feeds);
172
214
  // 5. Pooling & Normalization
173
215
  // Output name usually 'last_hidden_state' or 'logits'
174
- // For BGE-M3, the first output is usually the hidden states [batch, seq_len, hidden_size]
175
216
  const outputName = this.session.outputNames[0];
176
217
  const outputTensor = results[outputName];
177
218
  // Ensure we have data
178
219
  if (!outputTensor || !attentionMaskData) {
179
220
  throw new Error("No output data or attention mask available");
180
221
  }
181
- const embedding = this.meanPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
222
+ // Choose pooling strategy based on model
223
+ let embedding;
224
+ if (this.modelId.includes('Qwen3-Embedding')) {
225
+ // Qwen3-Embedding uses last token pooling
226
+ embedding = this.lastTokenPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
227
+ }
228
+ else {
229
+ // BGE and other models use mean pooling
230
+ embedding = this.meanPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
231
+ }
182
232
  // Normalize
183
233
  const normalized = this.normalize(embedding);
184
234
  this.cache.set(textStr, normalized);
@@ -200,6 +250,25 @@ class EmbeddingService {
200
250
  }
201
251
  return results;
202
252
  }
253
+ lastTokenPooling(data, attentionMask, dims) {
254
+ // dims: [batch_size, seq_len, hidden_size]
255
+ // Extract the last valid token's hidden state
256
+ const [batchSize, seqLen, hiddenSize] = dims;
257
+ // Find last valid token position
258
+ let lastValidIdx = seqLen - 1;
259
+ for (let i = seqLen - 1; i >= 0; i--) {
260
+ if (attentionMask[i] === 1n) {
261
+ lastValidIdx = i;
262
+ break;
263
+ }
264
+ }
265
+ // Extract embedding at last valid position
266
+ const embedding = new Float32Array(hiddenSize);
267
+ for (let j = 0; j < hiddenSize; j++) {
268
+ embedding[j] = data[lastValidIdx * hiddenSize + j];
269
+ }
270
+ return Array.from(embedding);
271
+ }
203
272
  meanPooling(data, attentionMask, dims) {
204
273
  // dims: [batch_size, seq_len, hidden_size]
205
274
  // We assume batch_size = 1 for single embedding call
package/dist/index.js CHANGED
@@ -12,12 +12,12 @@ const cozo_node_1 = require("cozo-node");
12
12
  const zod_1 = require("zod");
13
13
  const uuid_1 = require("uuid");
14
14
  const path_1 = __importDefault(require("path"));
15
+ const fs_1 = __importDefault(require("fs"));
16
+ const pdf_mjs_1 = require("pdfjs-dist/legacy/build/pdf.mjs");
15
17
  const hybrid_search_1 = require("./hybrid-search");
16
18
  const inference_engine_1 = require("./inference-engine");
17
19
  exports.DB_PATH = path_1.default.resolve(__dirname, "..", "memory_db.cozo");
18
20
  const DB_ENGINE = process.env.DB_ENGINE || "sqlite"; // "sqlite" or "rocksdb"
19
- const EMBEDDING_MODEL = "Xenova/bge-m3";
20
- const EMBEDDING_DIM = 1024;
21
21
  exports.USER_ENTITY_ID = "global_user_profile";
22
22
  exports.USER_ENTITY_NAME = "The User";
23
23
  exports.USER_ENTITY_TYPE = "User";
@@ -484,6 +484,9 @@ class MemoryServer {
484
484
  async setupSchema() {
485
485
  try {
486
486
  console.error("[Schema] Initializing schema...");
487
+ // Get embedding dimensions from service
488
+ const EMBEDDING_DIM = this.embeddingService.getDimensions();
489
+ console.error(`[Schema] Using embedding dimensions: ${EMBEDDING_DIM}`);
487
490
  const existingRelations = await this.db.run("::relations");
488
491
  const relations = existingRelations.rows.map((r) => r[0]);
489
492
  // Entity Table
@@ -1664,7 +1667,39 @@ ids[id] <- $ids
1664
1667
  async ingestFile(args) {
1665
1668
  await this.initPromise;
1666
1669
  try {
1667
- const content = (args.content ?? "").trim();
1670
+ // Check that either file_path or content is provided
1671
+ if (!args.file_path && !args.content) {
1672
+ return { error: "Either file_path or content must be provided" };
1673
+ }
1674
+ // Read content from file if file_path is provided
1675
+ let content;
1676
+ if (args.file_path) {
1677
+ try {
1678
+ if (args.format === "pdf") {
1679
+ // Read PDF file and extract text using pdfjs-dist
1680
+ const data = new Uint8Array(fs_1.default.readFileSync(args.file_path));
1681
+ const loadingTask = (0, pdf_mjs_1.getDocument)({ data });
1682
+ const pdf = await loadingTask.promise;
1683
+ const numPages = pdf.numPages;
1684
+ const pageTextPromises = Array.from({ length: numPages }, async (_, i) => {
1685
+ const page = await pdf.getPage(i + 1);
1686
+ const textContent = await page.getTextContent();
1687
+ return textContent.items.map((item) => item.str).join(' ');
1688
+ });
1689
+ const pageTexts = await Promise.all(pageTextPromises);
1690
+ content = pageTexts.join('\n').trim();
1691
+ }
1692
+ else {
1693
+ content = fs_1.default.readFileSync(args.file_path, 'utf-8').trim();
1694
+ }
1695
+ }
1696
+ catch (error) {
1697
+ return { error: `Failed to read file: ${error.message}` };
1698
+ }
1699
+ }
1700
+ else {
1701
+ content = (args.content ?? "").trim();
1702
+ }
1668
1703
  if (!content)
1669
1704
  return { error: "Content must not be empty" };
1670
1705
  let entityId = undefined;
@@ -1701,7 +1736,7 @@ ids[id] <- $ids
1701
1736
  const deduplicate = args.deduplicate ?? true;
1702
1737
  const chunking = args.chunking ?? "none";
1703
1738
  const observations = [];
1704
- if (args.format === "markdown") {
1739
+ if (args.format === "markdown" || args.format === "pdf") {
1705
1740
  if (chunking === "paragraphs") {
1706
1741
  const parts = content
1707
1742
  .split(/\r?\n\s*\r?\n+/g)
@@ -2230,9 +2265,10 @@ ids[id] <- $ids
2230
2265
  entity_id: zod_1.z.string().optional().describe("ID of the target entity"),
2231
2266
  entity_name: zod_1.z.string().optional().describe("Name of the target entity (will be created if not exists)"),
2232
2267
  entity_type: zod_1.z.string().optional().default("Document").describe("Type of the target entity (only when creating)"),
2233
- format: zod_1.z.enum(["markdown", "json"]).describe("Input format"),
2268
+ format: zod_1.z.enum(["markdown", "json", "pdf"]).describe("Input format"),
2234
2269
  chunking: zod_1.z.enum(["none", "paragraphs"]).optional().default("none").describe("Chunking for Markdown"),
2235
- content: zod_1.z.string().describe("File content (or LLM summary)"),
2270
+ file_path: zod_1.z.string().optional().describe("Path to file on disk (alternative to content parameter)"),
2271
+ content: zod_1.z.string().optional().describe("File content (or LLM summary) - required if file_path not provided"),
2236
2272
  metadata: MetadataSchema.optional().describe("Metadata for entity creation"),
2237
2273
  observation_metadata: MetadataSchema.optional().describe("Metadata applied to all observations"),
2238
2274
  deduplicate: zod_1.z.boolean().optional().default(true).describe("Skip exact duplicates"),
@@ -2240,6 +2276,9 @@ ids[id] <- $ids
2240
2276
  }).refine((v) => Boolean(v.entity_id) || Boolean(v.entity_name), {
2241
2277
  message: "entity_id or entity_name is required for ingest_file",
2242
2278
  path: ["entity_id"],
2279
+ }).refine((v) => Boolean(v.file_path) || Boolean(v.content), {
2280
+ message: "file_path or content is required for ingest_file",
2281
+ path: ["file_path"],
2243
2282
  }),
2244
2283
  ]);
2245
2284
  const MutateMemoryParameters = zod_1.z.object({
@@ -2254,9 +2293,10 @@ ids[id] <- $ids
2254
2293
  entity_type: zod_1.z.string().optional().describe("Only when entity_name is used and entity is created new"),
2255
2294
  text: zod_1.z.string().optional().describe("For add_observation (required)"),
2256
2295
  datalog: zod_1.z.string().optional().describe("For add_inference_rule (required)"),
2257
- format: zod_1.z.enum(["markdown", "json"]).optional().describe("For ingest_file (required)"),
2296
+ format: zod_1.z.enum(["markdown", "json", "pdf"]).optional().describe("For ingest_file (required)"),
2258
2297
  chunking: zod_1.z.enum(["none", "paragraphs"]).optional().describe("Optional for ingest_file (for markdown)"),
2259
- content: zod_1.z.string().optional().describe("For ingest_file (required)"),
2298
+ file_path: zod_1.z.string().optional().describe("For ingest_file - path to file on disk (alternative to content)"),
2299
+ content: zod_1.z.string().optional().describe("For ingest_file - file content (required if file_path not provided)"),
2260
2300
  observation_metadata: MetadataSchema.optional().describe("Optional for ingest_file"),
2261
2301
  deduplicate: zod_1.z.boolean().optional().describe("Optional for ingest_file and add_observation"),
2262
2302
  max_observations: zod_1.z.number().optional().describe("Optional for ingest_file"),
@@ -1,7 +1,42 @@
1
1
  "use strict";
2
+ var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) {
3
+ if (k2 === undefined) k2 = k;
4
+ var desc = Object.getOwnPropertyDescriptor(m, k);
5
+ if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) {
6
+ desc = { enumerable: true, get: function() { return m[k]; } };
7
+ }
8
+ Object.defineProperty(o, k2, desc);
9
+ }) : (function(o, m, k, k2) {
10
+ if (k2 === undefined) k2 = k;
11
+ o[k2] = m[k];
12
+ }));
13
+ var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) {
14
+ Object.defineProperty(o, "default", { enumerable: true, value: v });
15
+ }) : function(o, v) {
16
+ o["default"] = v;
17
+ });
18
+ var __importStar = (this && this.__importStar) || (function () {
19
+ var ownKeys = function(o) {
20
+ ownKeys = Object.getOwnPropertyNames || function (o) {
21
+ var ar = [];
22
+ for (var k in o) if (Object.prototype.hasOwnProperty.call(o, k)) ar[ar.length] = k;
23
+ return ar;
24
+ };
25
+ return ownKeys(o);
26
+ };
27
+ return function (mod) {
28
+ if (mod && mod.__esModule) return mod;
29
+ var result = {};
30
+ if (mod != null) for (var k = ownKeys(mod), i = 0; i < k.length; i++) if (k[i] !== "default") __createBinding(result, mod, k[i]);
31
+ __setModuleDefault(result, mod);
32
+ return result;
33
+ };
34
+ })();
2
35
  Object.defineProperty(exports, "__esModule", { value: true });
3
36
  exports.MemoryService = void 0;
4
37
  const uuid_1 = require("uuid");
38
+ const pdf_mjs_1 = require("pdfjs-dist/legacy/build/pdf.mjs");
39
+ const fs = __importStar(require("fs"));
5
40
  class MemoryService {
6
41
  db;
7
42
  embeddings;
@@ -169,7 +204,7 @@ class MemoryService {
169
204
  console.error('[MemoryService] Snapshot created:', snapshotId, stats);
170
205
  return snapshotId;
171
206
  }
172
- async ingestFile(content, format, entityName, entityType = 'Document', chunking = 'paragraphs') {
207
+ async ingestFile(content, format, entityName, entityType = 'Document', chunking = 'paragraphs', filePath) {
173
208
  const searchResults = await this.search(entityName, 1);
174
209
  let entity;
175
210
  if (searchResults.length > 0 && searchResults[0].entity.name.toLowerCase() === entityName.toLowerCase()) {
@@ -179,16 +214,64 @@ class MemoryService {
179
214
  entity = await this.createEntity(entityName, entityType, { format: format });
180
215
  }
181
216
  let chunks = [];
182
- if (format === 'markdown' && chunking === 'paragraphs') {
183
- chunks = content.split(/\n\s*\n/).filter((c) => c.trim().length > 0);
217
+ if (format === 'pdf') {
218
+ try {
219
+ let data;
220
+ // If filePath is provided, read from file
221
+ if (filePath) {
222
+ data = new Uint8Array(fs.readFileSync(filePath));
223
+ }
224
+ else {
225
+ // Otherwise, assume content is base64
226
+ const buffer = Buffer.from(content, 'base64');
227
+ data = new Uint8Array(buffer);
228
+ }
229
+ const loadingTask = (0, pdf_mjs_1.getDocument)({ data });
230
+ const pdf = await loadingTask.promise;
231
+ const numPages = pdf.numPages;
232
+ const pageTextPromises = Array.from({ length: numPages }, async (_, i) => {
233
+ const page = await pdf.getPage(i + 1);
234
+ const textContent = await page.getTextContent();
235
+ return textContent.items.map((item) => item.str).join(' ');
236
+ });
237
+ const pageTexts = await Promise.all(pageTextPromises);
238
+ const text = pageTexts.join('\n');
239
+ if (chunking === 'paragraphs') {
240
+ chunks = text.split(/\n\s*\n/).filter((c) => c.trim().length > 0);
241
+ }
242
+ else {
243
+ chunks = [text];
244
+ }
245
+ }
246
+ catch (e) {
247
+ console.error('[MemoryService] PDF parsing error:', e);
248
+ throw new Error(`Failed to parse PDF: ${e instanceof Error ? e.message : String(e)}`);
249
+ }
250
+ }
251
+ else if (format === 'markdown') {
252
+ // For markdown, also support file path
253
+ let textContent = content;
254
+ if (filePath) {
255
+ textContent = fs.readFileSync(filePath, 'utf-8');
256
+ }
257
+ if (chunking === 'paragraphs') {
258
+ chunks = textContent.split(/\n\s*\n/).filter((c) => c.trim().length > 0);
259
+ }
260
+ else {
261
+ chunks = [textContent];
262
+ }
184
263
  }
185
264
  else if (format === 'json') {
265
+ let textContent = content;
266
+ if (filePath) {
267
+ textContent = fs.readFileSync(filePath, 'utf-8');
268
+ }
186
269
  try {
187
- const data = JSON.parse(content);
270
+ const data = JSON.parse(textContent);
188
271
  chunks = [JSON.stringify(data, null, 2)];
189
272
  }
190
273
  catch (e) {
191
- chunks = [content];
274
+ chunks = [textContent];
192
275
  }
193
276
  }
194
277
  else {
@@ -0,0 +1,2 @@
1
+ "use strict";
2
+ Object.defineProperty(exports, "__esModule", { value: true });
@@ -0,0 +1,2 @@
1
+ "use strict";
2
+ Object.defineProperty(exports, "__esModule", { value: true });
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "cozo-memory",
3
- "version": "1.0.4",
3
+ "version": "1.0.5",
4
4
  "mcpName": "io.github.tobs-code/cozo-memory",
5
5
  "description": "Local-first persistent memory system for AI agents with hybrid search, graph reasoning, and MCP integration",
6
6
  "main": "dist/index.js",
@@ -55,10 +55,12 @@
55
55
  "archiver": "^7.0.1",
56
56
  "cors": "^2.8.6",
57
57
  "cozo-node": "^0.7.6",
58
+ "dotenv": "^17.3.1",
58
59
  "express": "^5.2.1",
59
60
  "fastmcp": "^3.31.0",
60
61
  "ollama": "^0.6.3",
61
62
  "onnxruntime-node": "^1.24.2",
63
+ "pdfjs-dist": "^5.4.624",
62
64
  "uuid": "^13.0.0",
63
65
  "zod": "^4.3.6"
64
66
  },