npm - cozo-memory - Versions diffs - 1.0.3 → 1.0.5 - Mend

cozo-memory 1.0.3 → 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/README.md +374 -40
package/dist/download-model.js +3 -1
package/dist/embedding-service.js +81 -12
package/dist/export-import-service.js +472 -0
package/dist/index.js +290 -15
package/dist/inference-engine.js +9 -2
package/dist/memory-service.js +88 -5
package/dist/test-bugfixes.js +374 -0
package/dist/test-delete-comprehensive.js +174 -0
package/dist/test-export-import.js +152 -0
package/dist/test-fixes-simple.js +50 -0
package/dist/test-pdf-ingest.js +2 -0
package/dist/test-qwen3-bilingual.js +2 -0
package/package.json +5 -1
package/dist/verify_transaction_tool.js +0 -46

package/README.md CHANGED Viewed

@@ -1,8 +1,36 @@
 # CozoDB Memory MCP Server
-Persistent, local-first memory for AI agents. No cloud, no Docker, no external services – just CozoDB embedded in Node.js.
-A local, single-user memory system based on CozoDB with MCP (Model Context Protocol) integration. Focus: robust storage, fast hybrid search (Vector/Graph/Keyword), time-travel queries, and maintainable consolidation.
+[![npm](https://img.shields.io/npm/v/cozo-memory)](https://www.npmjs.com/package/cozo-memory)
+[![Node](https://img.shields.io/node/v/cozo-memory)](https://nodejs.org)
+[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
+**Local-first memory for Claude & AI agents with hybrid search, Graph-RAG, and time-travel – all in a single binary, no cloud, no Docker.**
+## Table of Contents
+- [Quick Start](#quick-start)
+- [Key Features](#key-features)
+- [Positioning & Comparison](#positioning--comparison)
+- [Performance & Benchmarks](#performance--benchmarks)
+- [Architecture](#architecture)
+- [Installation](#installation)
+- [Start / Integration](#start--integration)
+- [Configuration & Backends](#configuration--backends)
+- [Data Model](#data-model)
+- [MCP Tools](#mcp-tools)
+  - [mutate_memory (Write)](#mutate_memory-write)
+  - [query_memory (Read)](#query_memory-read)
+  - [analyze_graph (Analysis)](#analyze_graph-analysis)
+  - [manage_system (Maintenance)](#manage_system-maintenance)
+- [Production Monitoring](#production-monitoring)
+- [Technical Highlights](#technical-highlights)
+- [Optional: HTTP API Bridge](#optional-http-api-bridge)
+- [Development](#development)
+- [User Preference Profiling](#user-preference-profiling-mem0-style)
+- [Troubleshooting](#troubleshooting)
+- [Roadmap](#roadmap)
+- [Contributing](#contributing)
+- [License](#license)
 ## Quick Start
@@ -27,13 +55,35 @@ npm run start
 Now you can add the server to your MCP client (e.g. Claude Desktop).
-## Overview
+## Key Features
+🔍 **Hybrid Search (since v0.7)** - Combines semantic search (HNSW), full-text search (FTS), and graph signals via Reciprocal Rank Fusion (RRF)
+🕸️ **Graph-RAG & Graph-Walking (since v1.7)** - Advanced retrieval combining vector seeds with recursive graph traversals using optimized Datalog algorithms
+🎯 **Multi-Vector Support (since v1.7)** - Dual embeddings per entity: content-embedding for context, name-embedding for identification
+⚡ **Semantic Caching (since v0.8.5)** - Two-level cache (L1 memory + L2 persistent) with semantic query matching
+⏱️ **Time-Travel Queries** - Version all changes via CozoDB Validity; query any point in history
+🔗 **Atomic Transactions (since v1.2)** - Multi-statement transactions ensuring data consistency
-This repository contains:
-- An MCP server (stdio) for Claude/other MCP clients.
-- An optional HTTP API bridge server for UI/tools.
+📊 **Graph Algorithms (since v1.3/v1.6)** - PageRank, Betweenness Centrality, HITS, Community Detection, Shortest Path
-Key Features:
+🧹 **Janitor Service** - LLM-backed automatic cleanup with hierarchical summarization
+👤 **User Preference Profiling** - Persistent user preferences with automatic 50% search boost
+🔍 **Near-Duplicate Detection** - Automatic LSH-based deduplication to avoid redundancy
+🧠 **Inference Engine** - Implicit knowledge discovery with multiple strategies
+🏠 **100% Local** - Embeddings via ONNX/Transformers; no external services required
+📦 **Export/Import (since v1.8)** - Export to JSON, Markdown, or Obsidian-ready ZIP; import from Mem0, MemGPT, Markdown, or native format
+### Detailed Features
 - **Hybrid Search (v0.7 Optimized)**: Combination of semantic search (HNSW), **Full-Text Search (FTS)**, and graph signals, merged via Reciprocal Rank Fusion (RRF).
 - **Full-Text Search (FTS)**: Native CozoDB v0.7 FTS indices with stemming, stopword filtering, and robust query sanitizing (cleaning of `+ - * / \ ( ) ? .`) for maximum stability.
 - **Near-Duplicate Detection (LSH)**: Automatically detects very similar observations via MinHash-LSH (CozoDB v0.7) to avoid redundancy.
@@ -119,39 +169,68 @@ This tool (`src/benchmark.ts`) performs the following tests:
 3.  **Search Performance**: Latency measurement for Hybrid Search vs. Raw Vector Search.
 4.  **RRF Overhead**: Determination of additional computation time for fusion logic.
-## Architecture (High Level)
-```
-┌───────────────────────────┐
-│         MCP Client         │
-└──────────────┬────────────┘
-               │ stdio
-┌──────────────▼────────────┐
-│        MCP Server          │
-│  FastMCP + Zod Schemas     │
-└──────────────┬────────────┘
-               │
-┌──────────────▼────────────┐
-│  Memory Services           │
-│  - Embeddings (ONNX)       │
-│  - Hybrid Search (RRF)     │
-│  - Semantic LRU Cache      │
-│  - Inference Engine        │
-└──────────────┬────────────┘
-               │
-┌──────────────▼────────────┐
-│       CozoDB (SQLite)      │
-│  - Relations + Validity    │
-│  - HNSW Indices            │
-│  - Datalog/Graph Algorithms│
-└───────────────────────────┘
+## Architecture
+```mermaid
+graph TB
+    Client[MCP Client<br/>Claude Desktop, etc.]
+    Server[MCP Server<br/>FastMCP + Zod Schemas]
+    Services[Memory Services]
+    Embeddings[Embeddings<br/>ONNX Runtime]
+    Search[Hybrid Search<br/>RRF Fusion]
+    Cache[Semantic Cache<br/>L1 + L2]
+    Inference[Inference Engine<br/>Multi-Strategy]
+    DB[(CozoDB SQLite<br/>Relations + Validity<br/>HNSW Indices<br/>Datalog/Graph)]
+    Client -->|stdio| Server
+    Server --> Services
+    Services --> Embeddings
+    Services --> Search
+    Services --> Cache
+    Services --> Inference
+    Services --> DB
+    style Client fill:#e1f5ff
+    style Server fill:#fff4e1
+    style Services fill:#f0e1ff
+    style DB fill:#e1ffe1
+```
+### Graph-Walking Visualization
+```mermaid
+graph LR
+    Start([Query: What is Alice working on?])
+    V1[Vector Search<br/>Find: Alice]
+    E1[Alice<br/>Person]
+    E2[Project X<br/>Project]
+    E3[Feature Flags<br/>Technology]
+    E4[Bob<br/>Person]
+    Start --> V1
+    V1 -.semantic similarity.-> E1
+    E1 -->|works_on| E2
+    E2 -->|uses_tech| E3
+    E1 -->|colleague_of| E4
+    E4 -.semantic: also relevant.-> E2
+    style Start fill:#e1f5ff
+    style V1 fill:#fff4e1
+    style E1 fill:#ffe1e1
+    style E2 fill:#e1ffe1
+    style E3 fill:#f0e1ff
+    style E4 fill:#ffe1e1
 ```
 ## Installation
 ### Prerequisites
 - Node.js 20+ (recommended)
-- CozoDB native dependency is installed via `cozo-node`.
+- **RAM: 1.7 GB minimum** (for default bge-m3 model)
+  - Model download: ~600 MB
+  - Runtime memory: ~1.1 GB
+  - For lower-spec machines, see [Embedding Model Options](#embedding-model-options) below
+- CozoDB native dependency is installed via `cozo-node`
 ### Via npm (Easiest)
@@ -184,6 +263,62 @@ Notes:
 - On first start, `@xenova/transformers` downloads the embedding model (may take time).
 - Embeddings are processed on the CPU.
+### Embedding Model Options
+CozoDB Memory supports multiple embedding models via the `EMBEDDING_MODEL` environment variable:
+| Model | Size | RAM | Dimensions | Best For |
+|-------|------|-----|------------|----------|
+| `Xenova/bge-m3` (default) | ~600 MB | ~1.7 GB | 1024 | High accuracy, production use |
+| `Xenova/all-MiniLM-L6-v2` | ~80 MB | ~400 MB | 384 | Low-spec machines, development |
+| `Xenova/bge-small-en-v1.5` | ~130 MB | ~600 MB | 384 | Balanced performance |
+**Configuration Options:**
+**Option 1: Using `.env` file (Easiest for beginners)**
+```bash
+# Copy the example file
+cp .env.example .env
+# Edit .env and set your preferred model
+EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2
+```
+**Option 2: MCP Server Config (For Claude Desktop / Kiro)**
+```json
+{
+  "mcpServers": {
+    "cozo-memory": {
+      "command": "npx",
+      "args": ["cozo-memory"],
+      "env": {
+        "EMBEDDING_MODEL": "Xenova/all-MiniLM-L6-v2"
+      }
+    }
+  }
+}
+```
+**Option 3: Command Line**
+```bash
+# Use lightweight model for development
+EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 npm run start
+```
+**Download Model First (Recommended):**
+```bash
+# Set model in .env or via command line, then:
+EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 npm run download-model
+```
+}
+```
+**Note:** Changing models requires re-embedding existing data. The model is downloaded once on first use.
 ## Start / Integration
 ### MCP Server (stdio)
@@ -262,6 +397,14 @@ DB_ENGINE=rocksdb npm run dev
 | **RocksDB** | Prepared & Tested | For high-performance or very large datasets. |
 | **MDBX** | Not supported | Requires manual build of `cozo-node` from source. |
+### Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DB_ENGINE` | `sqlite` | Database backend: `sqlite` or `rocksdb` |
+| `EMBEDDING_MODEL` | `Xenova/bge-m3` | Embedding model (see [Embedding Model Options](#embedding-model-options)) |
+| `PORT` | `3001` | HTTP API bridge port (if using `npm run bridge`) |
 ---
 ## Data Model
@@ -277,6 +420,13 @@ CozoDB Relations (simplified) – all write operations create new `Validity` ent
 The interface is reduced to **4 consolidated tools**. The concrete operation is always chosen via `action`.
+| Tool | Purpose | Key Actions |
+|------|---------|-------------|
+| `mutate_memory` | Write operations | create_entity, update_entity, delete_entity, add_observation, create_relation, run_transaction, add_inference_rule, ingest_file |
+| `query_memory` | Read operations | search, advancedSearch, context, entity_details, history, graph_rag, graph_walking |
+| `analyze_graph` | Graph analysis | explore, communities, pagerank, betweenness, hits, shortest_path, bridge_discovery, semantic_walk, infer_relations |
+| `manage_system` | Maintenance | health, metrics, export_memory, import_memory, snapshot_create, snapshot_list, snapshot_diff, cleanup, reflect, clear_memory |
 ### mutate_memory (Write)
 Actions:
@@ -287,7 +437,11 @@ Actions:
 - `create_relation`: `{ from_id, to_id, relation_type, strength?, metadata? }`
 - `run_transaction`: `{ operations: Array<{ action, params }> }` **(New v1.2)**: Executes multiple operations atomically.
 - `add_inference_rule`: `{ name, datalog }`
-- `ingest_file`: `{ format, content, entity_id?, entity_name?, entity_type?, chunking?, metadata?, observation_metadata?, deduplicate?, max_observations? }`
+- `ingest_file`: `{ format, file_path?, content?, entity_id?, entity_name?, entity_type?, chunking?, metadata?, observation_metadata?, deduplicate?, max_observations? }`
+  - `format` options: `"markdown"`, `"json"`, `"pdf"` **(New v1.9)**
+  - `file_path`: Optional path to file on disk (alternative to `content` parameter)
+  - `content`: File content as string (required if `file_path` not provided)
+  - `chunking` options: `"none"`, `"paragraphs"` (future: `"semantic"`)
 Important Details:
 - `run_transaction` supports `create_entity`, `add_observation`, and `create_relation`. Parameters are automatically suffixed to avoid collisions.
@@ -338,7 +492,7 @@ Example (Transitive Manager ⇒ Upper Manager):
 }
 ```
-Bulk Ingestion (Markdown/JSON):
+Bulk Ingestion (Markdown/JSON/PDF):
 ```json
 {
@@ -352,6 +506,19 @@ Bulk Ingestion (Markdown/JSON):
 }
 ```
+PDF Ingestion via File Path:
+```json
+{
+  "action": "ingest_file",
+  "entity_name": "Research Paper",
+  "format": "pdf",
+  "file_path": "/path/to/document.pdf",
+  "chunking": "paragraphs",
+  "deduplicate": true
+}
+```
 ### query_memory (Read)
 Actions:
@@ -447,7 +614,10 @@ Examples:
 ### manage_system (Maintenance)
 Actions:
-- `health`: `{}` returns DB counts + embedding cache stats.
+- `health`: `{}` returns DB counts + embedding cache stats + performance metrics.
+- `metrics`: `{}` returns detailed operation counts, error statistics, and performance data.
+- `export_memory`: `{ format, includeMetadata?, includeRelationships?, includeObservations?, entityTypes?, since? }` exports memory to various formats.
+- `import_memory`: `{ data, sourceFormat, mergeStrategy?, defaultEntityType? }` imports memory from external sources.
 - `snapshot_create`: `{ metadata? }`
 - `snapshot_list`: `{}`
 - `snapshot_diff`: `{ snapshot_id_a, snapshot_id_b }`
@@ -460,6 +630,22 @@ Janitor Cleanup Details:
 - With `confirm: true`, the Janitor becomes active:
   - **Hierarchical Summarization**: Detects isolated or old observations, has them summarized by a local LLM (Ollama), and creates a new `ExecutiveSummary` node. Old fragments are deleted to reduce noise while preserving knowledge.
+**Before Janitor:**
+```
+Entity: Project X
+├─ Observation 1: "Started in Q1" (90 days old, isolated)
+├─ Observation 2: "Uses React" (85 days old, isolated)
+├─ Observation 3: "Team of 5" (80 days old, isolated)
+└─ Observation 4: "Deployed to staging" (75 days old, isolated)
+```
+**After Janitor:**
+```
+Entity: Project X
+└─ ExecutiveSummary: "Project X is a React-based application started in Q1
+   with a team of 5 developers, currently deployed to staging environment."
+```
 Reflection Service Details:
 - `reflect` analyzes observations of an entity (or top 5 active entities) to find contradictions, patterns, or temporal developments.
 - Results are persisted as new observations with metadata field `{ "kind": "reflection" }` and are retrievable via `context`.
@@ -467,6 +653,97 @@ Reflection Service Details:
 Defaults: `older_than_days=30`, `max_observations=20`, `min_entity_degree=2`, `model="demyagent-4b-i1:Q6_K"`.
+Export/Import Details:
+- `export_memory` supports three formats:
+  - **JSON** (`format: "json"`): Native Cozo format, fully re-importable with all metadata and timestamps.
+  - **Markdown** (`format: "markdown"`): Human-readable document with entities, observations, and relationships.
+  - **Obsidian** (`format: "obsidian"`): ZIP archive with Wiki-Links `[[Entity]]`, YAML frontmatter, ready for Obsidian vault.
+- `import_memory` supports four source formats:
+  - **Cozo** (`sourceFormat: "cozo"`): Import from native JSON export.
+  - **Mem0** (`sourceFormat: "mem0"`): Import from Mem0 format (user_id becomes entity).
+  - **MemGPT** (`sourceFormat: "memgpt"`): Import from MemGPT archival/recall memory.
+  - **Markdown** (`sourceFormat: "markdown"`): Parse markdown sections as entities with observations.
+- Merge strategies: `skip` (default, skip duplicates), `overwrite` (replace existing), `merge` (combine metadata).
+- Optional filters: `entityTypes` (array), `since` (Unix timestamp in ms), `includeMetadata`, `includeRelationships`, `includeObservations`.
+Example Export:
+```json
+{
+  "action": "export_memory",
+  "format": "obsidian",
+  "includeMetadata": true,
+  "entityTypes": ["Person", "Project"]
+}
+```
+Example Import:
+```json
+{
+  "action": "import_memory",
+  "sourceFormat": "mem0",
+  "data": "{\"user_id\": \"alice\", \"memories\": [...]}",
+  "mergeStrategy": "skip"
+}
+```
+Production Monitoring Details:
+- `health` provides comprehensive system status including entity/observation/relationship counts, embedding cache statistics, and performance metrics (last operation time, average operation time, total operations).
+- `metrics` returns detailed operational metrics:
+  - **Operation Counts**: Tracks create_entity, update_entity, delete_entity, add_observation, create_relation, search, and graph_operations.
+  - **Error Statistics**: Total errors and breakdown by operation type.
+  - **Performance Metrics**: Last operation duration, average operation duration, and total operations executed.
+- Delete operations now include detailed logging with verification steps and return statistics about deleted data (observations, outgoing/incoming relations).
+## Production Monitoring
+The system includes comprehensive monitoring capabilities for production deployments:
+### Metrics Tracking
+All operations are automatically tracked with detailed metrics:
+- Operation counts by type (create, update, delete, search, etc.)
+- Error tracking with breakdown by operation
+- Performance metrics (latency, throughput)
+### Health Endpoint
+The `health` action provides real-time system status:
+```json
+{ "action": "health" }
+```
+Returns:
+- Database counts (entities, observations, relationships)
+- Embedding cache statistics (hit rate, size)
+- Performance metrics (last operation time, average time, total operations)
+### Metrics Endpoint
+The `metrics` action provides detailed operational metrics:
+```json
+{ "action": "metrics" }
+```
+Returns:
+- **operations**: Count of each operation type
+- **errors**: Total errors and breakdown by operation
+- **performance**: Last operation duration, average duration, total operations
+### Enhanced Delete Operations
+Delete operations include comprehensive logging and verification:
+- Detailed step-by-step logging with `[Delete]` prefix
+- Counts related data before deletion
+- Verification after deletion
+- Returns statistics: `{ deleted: { observations: N, outgoing_relations: N, incoming_relations: N } }`
+Example:
+```json
+{ "action": "delete_entity", "entity_id": "ENTITY_ID" }
+```
+Returns deletion statistics showing exactly what was removed.
 ## Technical Highlights
 ### Local ONNX Embeddings (Transformers)
@@ -572,8 +849,65 @@ npx ts-node test-user-pref.ts
 ## Troubleshooting
-- Embedding model download may take a long time on first start (Transformers loads artifacts).
-- If `cleanup` is used, an Ollama service must be reachable locally and the desired model must be present.
+### Common Issues
+**First Start Takes Long**
+- The embedding model download takes 30-90 seconds on first start (Transformers loads ~500MB of artifacts)
+- This is normal and only happens once
+- Subsequent starts are fast (< 2 seconds)
+**Cleanup/Reflect Requires Ollama**
+- If using `cleanup` or `reflect` actions, an Ollama service must be running locally
+- Install Ollama from https://ollama.ai
+- Pull the desired model: `ollama pull demyagent-4b-i1:Q6_K` (or your preferred model)
+**Windows-Specific**
+- Embeddings are processed on CPU for maximum compatibility
+- RocksDB backend requires Visual C++ Redistributable if using that option
+**Performance Issues**
+- First query after restart is slower (cold cache)
+- Use `health` action to check cache hit rates
+- Consider RocksDB backend for datasets > 100k entities
+## Roadmap
+CozoDB Memory is actively developed. Here's what's planned:
+### Near-Term (v1.x)
+- **GPU Acceleration** - CUDA support for embedding generation (10-50x faster)
+- **Streaming Ingestion** - Real-time data ingestion from logs, APIs, webhooks
+- **Advanced Chunking** - Semantic chunking for `ingest_file` (paragraph-aware splitting)
+- **Query Optimization** - Automatic query plan optimization for complex graph traversals
+- **Additional Export Formats** - Notion, Roam Research, Logseq compatibility
+### Mid-Term (v2.x)
+- **Multi-Modal Embeddings** - Image and audio embedding support via CLIP/Whisper
+- **Distributed Mode** - Multi-node deployment with CozoDB clustering
+- **Real-Time Sync** - WebSocket-based live updates for collaborative use cases
+- **Advanced Inference** - Causal reasoning, temporal pattern detection
+- **Web UI** - Optional web interface for memory exploration and visualization
+### Long-Term
+- **Federated Learning** - Privacy-preserving model updates across instances
+- **Custom Embedding Models** - Fine-tune embeddings on domain-specific data
+- **Plugin System** - Extensible architecture for custom tools and integrations
+### Community Requests
+Have a feature idea? Open an issue with the `enhancement` label or check [Low-Hanging-Fruit.md](Low-Hanging-Fruit.md) for quick wins you can contribute.
+## Contributing
+Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
+- Setting up the development environment
+- Coding standards and best practices
+- Testing and documentation requirements
+- Pull request process
 ## License

package/dist/download-model.js CHANGED Viewed

@@ -33,12 +33,14 @@ var __importStar = (this && this.__importStar) || (function () {
     };
 })();
 Object.defineProperty(exports, "__esModule", { value: true });
+require("dotenv/config"); // Load .env file first
 const transformers_1 = require("@xenova/transformers");
 const path = __importStar(require("path"));
 // Configure cache path
 const CACHE_DIR = path.resolve('./.cache');
 transformers_1.env.cacheDir = CACHE_DIR;
-const MODEL_ID = "Xenova/bge-m3";
+// Read model from environment variable or use default
+const MODEL_ID = process.env.EMBEDDING_MODEL || "Xenova/bge-m3";
 async function downloadModel() {
     console.log(`Downloading FP32 model for ${MODEL_ID}...`);
     // quantized: false forces FP32 model download

package/dist/embedding-service.js CHANGED Viewed

@@ -34,6 +34,7 @@ var __importStar = (this && this.__importStar) || (function () {
 })();
 Object.defineProperty(exports, "__esModule", { value: true });
 exports.EmbeddingService = void 0;
+require("dotenv/config"); // Load .env file first
 const transformers_1 = require("@xenova/transformers");
 const ort = require('onnxruntime-node');
 const path = __importStar(require("path"));
@@ -91,11 +92,27 @@ class EmbeddingService {
     cache;
     session = null;
     tokenizer = null;
-    modelId = "Xenova/bge-m3";
-    dimensions = 1024;
+    modelId;
+    dimensions;
     queue = Promise.resolve();
     constructor() {
         this.cache = new LRUCache(1000, 3600000); // 1000 entries, 1h TTL
+        // Support multiple embedding models via environment variable
+        this.modelId = process.env.EMBEDDING_MODEL || "Xenova/bge-m3";
+        // Set dimensions based on model
+        const dimensionMap = {
+            "Xenova/bge-m3": 1024,
+            "Xenova/all-MiniLM-L6-v2": 384,
+            "Xenova/bge-small-en-v1.5": 384,
+            "Xenova/nomic-embed-text-v1": 768,
+            "onnx-community/Qwen3-Embedding-0.6B-ONNX": 1024,
+        };
+        this.dimensions = dimensionMap[this.modelId] || 1024;
+        console.log(`[EmbeddingService] Using model: ${this.modelId} (${this.dimensions} dimensions)`);
+    }
+    // Public getter for dimensions
+    getDimensions() {
+        return this.dimensions;
     }
     // Serializes embedding execution to avoid event loop blocking
     async runSerialized(task) {
@@ -109,21 +126,38 @@ class EmbeddingService {
         if (this.session && this.tokenizer)
             return;
         try {
-            // 1. Load Tokenizer
+            // 1. Check if model needs to be downloaded
+            // Extract namespace and model name from modelId (e.g., "Xenova/bge-m3" or "onnx-community/Qwen3-Embedding-0.6B-ONNX")
+            const parts = this.modelId.split('/');
+            const namespace = parts[0];
+            const modelName = parts[1];
+            // Try both possible cache locations
+            let baseDir = path.join(transformers_1.env.cacheDir, namespace, modelName, 'onnx');
+            let fp32Path = path.join(baseDir, 'model.onnx');
+            let quantizedPath = path.join(baseDir, 'model_quantized.onnx');
+            // If ONNX model files don't exist, download them
+            if (!fs.existsSync(fp32Path) && !fs.existsSync(quantizedPath)) {
+                console.log(`[EmbeddingService] Model not found, downloading ${this.modelId}...`);
+                console.log(`[EmbeddingService] This may take a few minutes on first run.`);
+                // Import AutoModel dynamically to trigger download
+                const { AutoModel } = await import("@xenova/transformers");
+                await AutoModel.from_pretrained(this.modelId, { quantized: false });
+                console.log(`[EmbeddingService] Model download completed.`);
+            }
+            // 2. Load Tokenizer
             if (!this.tokenizer) {
                 this.tokenizer = await transformers_1.AutoTokenizer.from_pretrained(this.modelId);
             }
-            // 2. Determine model path
-            const baseDir = path.join(transformers_1.env.cacheDir, 'Xenova', 'bge-m3', 'onnx');
+            // 3. Determine model path
             // Priority: FP32 (model.onnx) > Quantized (model_quantized.onnx)
-            let modelPath = path.join(baseDir, 'model.onnx');
+            let modelPath = fp32Path;
             if (!fs.existsSync(modelPath)) {
-                modelPath = path.join(baseDir, 'model_quantized.onnx');
+                modelPath = quantizedPath;
             }
             if (!fs.existsSync(modelPath)) {
-                throw new Error(`Model file not found at: ${modelPath}`);
+                throw new Error(`Model file not found at: ${modelPath}. Download may have failed.`);
             }
-            // 3. Create Session
+            // 4. Create Session
             if (!this.session) {
                 const options = {
                     executionProviders: ['cpu'], // Use CPU backend to avoid native conflicts
@@ -139,7 +173,15 @@ class EmbeddingService {
     }
     async embed(text) {
         return this.runSerialized(async () => {
-            const textStr = String(text || "");
+            let textStr = String(text || "");
+            // For Qwen3-Embedding models, add instruction prefix for better results
+            // (only for queries, not for documents being indexed)
+            if (this.modelId.includes('Qwen3-Embedding')) {
+                // Add instruction prefix if not already present
+                if (!textStr.startsWith('Instruct:')) {
+                    textStr = `Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ${textStr}`;
+                }
+            }
             // 1. Cache lookup
             const cached = this.cache.get(textStr);
             if (cached) {
@@ -171,14 +213,22 @@ class EmbeddingService {
                 const results = await this.session.run(feeds);
                 // 5. Pooling & Normalization
                 // Output name usually 'last_hidden_state' or 'logits'
-                // For BGE-M3, the first output is usually the hidden states [batch, seq_len, hidden_size]
                 const outputName = this.session.outputNames[0];
                 const outputTensor = results[outputName];
                 // Ensure we have data
                 if (!outputTensor || !attentionMaskData) {
                     throw new Error("No output data or attention mask available");
                 }
-                const embedding = this.meanPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
+                // Choose pooling strategy based on model
+                let embedding;
+                if (this.modelId.includes('Qwen3-Embedding')) {
+                    // Qwen3-Embedding uses last token pooling
+                    embedding = this.lastTokenPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
+                }
+                else {
+                    // BGE and other models use mean pooling
+                    embedding = this.meanPooling(outputTensor.data, attentionMaskData, outputTensor.dims);
+                }
                 // Normalize
                 const normalized = this.normalize(embedding);
                 this.cache.set(textStr, normalized);
@@ -200,6 +250,25 @@ class EmbeddingService {
         }
         return results;
     }
+    lastTokenPooling(data, attentionMask, dims) {
+        // dims: [batch_size, seq_len, hidden_size]
+        // Extract the last valid token's hidden state
+        const [batchSize, seqLen, hiddenSize] = dims;
+        // Find last valid token position
+        let lastValidIdx = seqLen - 1;
+        for (let i = seqLen - 1; i >= 0; i--) {
+            if (attentionMask[i] === 1n) {
+                lastValidIdx = i;
+                break;
+            }
+        }
+        // Extract embedding at last valid position
+        const embedding = new Float32Array(hiddenSize);
+        for (let j = 0; j < hiddenSize; j++) {
+            embedding[j] = data[lastValidIdx * hiddenSize + j];
+        }
+        return Array.from(embedding);
+    }
     meanPooling(data, attentionMask, dims) {
         // dims: [batch_size, seq_len, hidden_size]
         // We assume batch_size = 1 for single embedding call