@lojban/semantic-search-mcp 1.0.12 → 1.0.14
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +10 -7
- package/package.json +2 -1
- package/src/download-sampu-vlaste.ts +68 -0
- package/src/embeddings.ts +2 -2
- package/src/index.ts +85 -16
- package/src/path-util.ts +6 -0
- package/src/scanner.ts +84 -11
- package/src/storage.ts +61 -0
package/README.md
CHANGED
|
@@ -12,7 +12,7 @@ Use it in **Cursor**, **Claude Code**, or any IDE that supports MCP to search th
|
|
|
12
12
|
|
|
13
13
|
## How it works
|
|
14
14
|
|
|
15
|
-
- **Indexing**: On startup,
|
|
15
|
+
- **Indexing**: On startup, the server indexes content in the background. If **`SEMANTIC_SEARCH_INDEX_DIRS`** is set (comma-separated paths), it scans those directories. If it is *not* set, the server downloads the [lojban/sampu_vlaste](https://github.com/lojban/sampu_vlaste) repository from GitHub and indexes that instead. In both cases, the server looks for `.txt`, `.md`, `.tsv`, `.csv` files. Each non-empty line gets a vector embedding (via [Hugging Face Transformers.js](https://huggingface.co/docs/transformers.js), model `Xenova/all-MiniLM-L6-v2`) and is stored in a local SQLite database with [@dao-xyz/sqlite3-vec](https://www.npmjs.com/package/@dao-xyz/sqlite3-vec) (SQLite + sqlite-vec for Node and browser). Indexing runs asynchronously so the server stays responsive and uses bounded memory.
|
|
16
16
|
- **Search**: You send a natural-language query; the server embeds it and returns the closest lines by cosine similarity.
|
|
17
17
|
- **Storage**: Index is stored in your project's `.semantic-search/data/` (or set `SEMANTIC_SEARCH_DATA_DIR`). No cloud, no API keys.
|
|
18
18
|
|
|
@@ -44,13 +44,13 @@ The package is published as [**@lojban/semantic-search-mcp**](https://www.npmjs.
|
|
|
44
44
|
}
|
|
45
45
|
```
|
|
46
46
|
|
|
47
|
-
No `cwd` needed: the server stores its index in your **project directory** (`.semantic-search/data/`), so open your project in Cursor and the index is per-workspace. To use a fixed data directory instead, add `"env": { "SEMANTIC_SEARCH_DATA_DIR": "/path/to/data" }`. To have the server index directories on startup, set `"env": { "SEMANTIC_SEARCH_INDEX_DIRS": "./dictionary,./glossary" }` (comma-separated paths).
|
|
47
|
+
No `cwd` needed: the server stores its index in your **project directory** (`.semantic-search/data/`), so open your project in Cursor and the index is per-workspace. To use a fixed data directory instead, add `"env": { "SEMANTIC_SEARCH_DATA_DIR": "/path/to/data" }`. To have the server index specific directories on startup, set `"env": { "SEMANTIC_SEARCH_INDEX_DIRS": "./dictionary,./glossary" }` (comma-separated paths). If you omit `SEMANTIC_SEARCH_INDEX_DIRS`, the server will download and index the [lojban/sampu_vlaste](https://github.com/lojban/sampu_vlaste) repo automatically.
|
|
48
48
|
|
|
49
|
-
2. **Restart Cursor** (or reload the window).
|
|
49
|
+
2. **Restart Cursor** (or reload the window). Indexing starts automatically in the background: from your configured `SEMANTIC_SEARCH_INDEX_DIRS`, or from the downloaded sampu_vlaste repo if that env is not set.
|
|
50
50
|
|
|
51
51
|
3. In chat or Composer, ask the AI to use the tools:
|
|
52
|
-
- **Search**: "
|
|
53
|
-
- **Stats**: "
|
|
52
|
+
- **Search**: "Use semantic-search tool: find combinations of words that can express the concept of …", "Use semantic-search tool: search the index for …" or "Use semantic-search tool: Find entries similar to …"
|
|
53
|
+
- **Stats**: "use semantic-search mcp. run get_index_stats" — stats include progress and start time (locale-formatted) when indexing is in progress.
|
|
54
54
|
|
|
55
55
|
The AI will call `search` and `get_index_stats` for you.
|
|
56
56
|
|
|
@@ -58,7 +58,7 @@ The AI will call `search` and `get_index_stats` for you.
|
|
|
58
58
|
|
|
59
59
|
Any environment that supports MCP over stdio can use this server. Run:
|
|
60
60
|
|
|
61
|
-
- **One-liner**: `npx -y @lojban/semantic-search-mcp` — dependencies are installed on first run; index is stored in the current working directory's `.semantic-search/data/`. Set env `SEMANTIC_SEARCH_INDEX_DIRS` (comma-separated paths) to index those directories on startup
|
|
61
|
+
- **One-liner**: `npx -y @lojban/semantic-search-mcp` — dependencies are installed on first run; index is stored in the current working directory's `.semantic-search/data/`. Set env `SEMANTIC_SEARCH_INDEX_DIRS` (comma-separated paths) to index those directories on startup; if unset, the server downloads and indexes [lojban/sampu_vlaste](https://github.com/lojban/sampu_vlaste) from GitHub. Tools: `search`, `get_index_stats`.
|
|
62
62
|
|
|
63
63
|
**From source**: Clone the repo, run `npm install` once, then use `"command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp"` or `"command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"]` (no `cwd` needed with the latter). See [MCP_SETUP.md](MCP_SETUP.md) for details.
|
|
64
64
|
|
|
@@ -71,7 +71,10 @@ Any environment that supports MCP over stdio can use this server. Run:
|
|
|
71
71
|
|
|
72
72
|
### Indexing on startup
|
|
73
73
|
|
|
74
|
-
Set the environment variable **`SEMANTIC_SEARCH_INDEX_DIRS`** to a comma-separated list of directories to index. When the MCP server starts, it begins indexing those directories in the background (async).
|
|
74
|
+
- **With your own dirs**: Set the environment variable **`SEMANTIC_SEARCH_INDEX_DIRS`** to a comma-separated list of directories to index. When the MCP server starts, it begins indexing those directories in the background (async).
|
|
75
|
+
- **Default (no env set)**: If **`SEMANTIC_SEARCH_INDEX_DIRS`** is not set, the server downloads the [lojban/sampu_vlaste](https://github.com/lojban/sampu_vlaste) repository from GitHub (as a zip), extracts it under `.semantic-search/sampu_vlaste/`, and indexes that. The download is cached; subsequent starts reuse the cached copy.
|
|
76
|
+
|
|
77
|
+
The index is cleared and rebuilt each time the server starts. Use absolute paths or paths relative to the server's working directory when setting `SEMANTIC_SEARCH_INDEX_DIRS`. The server reads and indexes all supported `.txt`, `.md`, `.tsv`, `.csv` files under each directory recursively. Indexing uses bounded memory and yields to the event loop so the OS stays responsive.
|
|
75
78
|
|
|
76
79
|
## Example: Lojban dictionary gaps
|
|
77
80
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@lojban/semantic-search-mcp",
|
|
3
|
-
"version": "1.0.
|
|
3
|
+
"version": "1.0.14",
|
|
4
4
|
"description": "Local-first MCP server for semantic search using transformers.js and SQLite",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"scripts": {
|
|
@@ -10,6 +10,7 @@
|
|
|
10
10
|
"@dao-xyz/sqlite3-vec": "^0.0.19",
|
|
11
11
|
"@huggingface/transformers": "^3.0.0",
|
|
12
12
|
"@modelcontextprotocol/sdk": "^1.0.0",
|
|
13
|
+
"extract-zip": "^2.0.1",
|
|
13
14
|
"glob": "^10.3.0",
|
|
14
15
|
"tsx": "^4.0.0"
|
|
15
16
|
},
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
import { mkdirSync, readdirSync, unlinkSync } from 'fs';
|
|
2
|
+
import path from 'path';
|
|
3
|
+
import { createWriteStream } from 'fs';
|
|
4
|
+
import { pipeline } from 'stream/promises';
|
|
5
|
+
import { Readable } from 'stream';
|
|
6
|
+
|
|
7
|
+
const REPO_OWNER = 'lojban';
|
|
8
|
+
const REPO_NAME = 'sampu_vlaste';
|
|
9
|
+
const BRANCHES_TO_TRY = ['main', 'master'];
|
|
10
|
+
|
|
11
|
+
/**
|
|
12
|
+
* Download the lojban/sampu_vlaste repo from GitHub as a zip and extract it.
|
|
13
|
+
* If the repo is already present under cacheDir, returns the existing path.
|
|
14
|
+
* @param cacheDir - Directory to store the downloaded repo (e.g. .semantic-search/sampu_vlaste)
|
|
15
|
+
* @returns Path to the extracted repo root (e.g. cacheDir/sampu_vlaste-main)
|
|
16
|
+
*/
|
|
17
|
+
export async function getSampuVlasteDir(cacheDir: string): Promise<string> {
|
|
18
|
+
mkdirSync(cacheDir, { recursive: true });
|
|
19
|
+
|
|
20
|
+
// Check for existing extraction (any branch)
|
|
21
|
+
const entries = readdirSync(cacheDir, { withFileTypes: true });
|
|
22
|
+
const existingDir = entries.find(
|
|
23
|
+
(e) => e.isDirectory() && e.name.startsWith(`${REPO_NAME}-`)
|
|
24
|
+
);
|
|
25
|
+
if (existingDir) {
|
|
26
|
+
const existingPath = path.join(cacheDir, existingDir.name);
|
|
27
|
+
return existingPath;
|
|
28
|
+
}
|
|
29
|
+
|
|
30
|
+
let lastError: Error | null = null;
|
|
31
|
+
for (const branch of BRANCHES_TO_TRY) {
|
|
32
|
+
const url = `https://github.com/${REPO_OWNER}/${REPO_NAME}/archive/refs/heads/${branch}.zip`;
|
|
33
|
+
try {
|
|
34
|
+
const response = await fetch(url, { redirect: 'follow' });
|
|
35
|
+
if (!response.ok) {
|
|
36
|
+
lastError = new Error(`HTTP ${response.status}: ${url}`);
|
|
37
|
+
continue;
|
|
38
|
+
}
|
|
39
|
+
const body = response.body;
|
|
40
|
+
if (!body) {
|
|
41
|
+
lastError = new Error('Empty response body');
|
|
42
|
+
continue;
|
|
43
|
+
}
|
|
44
|
+
const zipPath = path.join(cacheDir, `repo-${branch}.zip`);
|
|
45
|
+
const nodeStream = Readable.fromWeb(body as import('stream/web').ReadableStream);
|
|
46
|
+
const out = createWriteStream(zipPath);
|
|
47
|
+
await pipeline(nodeStream, out);
|
|
48
|
+
|
|
49
|
+
const { default: extract } = await import('extract-zip');
|
|
50
|
+
await extract(zipPath, { dir: cacheDir });
|
|
51
|
+
|
|
52
|
+
unlinkSync(zipPath);
|
|
53
|
+
|
|
54
|
+
const afterEntries = readdirSync(cacheDir, { withFileTypes: true });
|
|
55
|
+
const extracted = afterEntries.find(
|
|
56
|
+
(e) => e.isDirectory() && e.name.startsWith(`${REPO_NAME}-`)
|
|
57
|
+
);
|
|
58
|
+
if (extracted) {
|
|
59
|
+
return path.join(cacheDir, extracted.name);
|
|
60
|
+
}
|
|
61
|
+
lastError = new Error('Extracted archive had no expected top-level directory');
|
|
62
|
+
} catch (err) {
|
|
63
|
+
lastError = err instanceof Error ? err : new Error(String(err));
|
|
64
|
+
}
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
throw lastError ?? new Error('Failed to download sampu_vlaste');
|
|
68
|
+
}
|
package/src/embeddings.ts
CHANGED
|
@@ -36,8 +36,8 @@ export async function getBatchEmbeddings(texts: string[]): Promise<Float32Array[
|
|
|
36
36
|
const ext = await getExtractor();
|
|
37
37
|
const results: Float32Array[] = [];
|
|
38
38
|
|
|
39
|
-
// Process in batches for memory; each batch is one model forward pass
|
|
40
|
-
const batchSize =
|
|
39
|
+
// Process in batches for memory; each batch is one model forward pass (smaller = lower peak RAM)
|
|
40
|
+
const batchSize = 32;
|
|
41
41
|
for (let i = 0; i < texts.length; i += batchSize) {
|
|
42
42
|
const batch = texts.slice(i, i + batchSize);
|
|
43
43
|
const output = await ext(batch, { pooling: 'mean', normalize: true });
|
package/src/index.ts
CHANGED
|
@@ -8,7 +8,9 @@ import {
|
|
|
8
8
|
import path from 'path';
|
|
9
9
|
import { getEmbedding, getBatchEmbeddings } from './embeddings.js';
|
|
10
10
|
import { createVectorStorage, type SearchResult } from './storage.js';
|
|
11
|
-
import {
|
|
11
|
+
import { normalizePath } from './path-util.js';
|
|
12
|
+
import { listFilesInDirectories, readFileWithMetadata } from './scanner.js';
|
|
13
|
+
import { getSampuVlasteDir } from './download-sampu-vlaste.js';
|
|
12
14
|
|
|
13
15
|
// Data dir: use env, or project cwd so each workspace has its own index when run via npx from Cursor
|
|
14
16
|
const dataDir =
|
|
@@ -28,6 +30,17 @@ const indexingState = {
|
|
|
28
30
|
// Batch size kept small to avoid high RAM usage during indexing
|
|
29
31
|
const INDEX_BATCH_SIZE = 256;
|
|
30
32
|
|
|
33
|
+
/** True if normalized path is under one of the normalized directory paths */
|
|
34
|
+
function isPathUnderAnyDir(filePath: string, dirs: string[]): boolean {
|
|
35
|
+
const normalized = normalizePath(filePath);
|
|
36
|
+
for (const dir of dirs) {
|
|
37
|
+
const d = normalizePath(dir);
|
|
38
|
+
const prefix = d.endsWith(path.sep) ? d : d + path.sep;
|
|
39
|
+
if (normalized === d || normalized.startsWith(prefix)) return true;
|
|
40
|
+
}
|
|
41
|
+
return false;
|
|
42
|
+
}
|
|
43
|
+
|
|
31
44
|
async function runBackgroundIndexing(
|
|
32
45
|
storage: Awaited<ReturnType<typeof createVectorStorage>>,
|
|
33
46
|
directories: string[]
|
|
@@ -37,12 +50,21 @@ async function runBackgroundIndexing(
|
|
|
37
50
|
indexingState.linesIndexed = 0;
|
|
38
51
|
indexingState.filesIndexed = 0;
|
|
39
52
|
indexingState.error = null;
|
|
40
|
-
|
|
53
|
+
|
|
54
|
+
const normalizedDirs = directories.map((d) => normalizePath(d));
|
|
41
55
|
|
|
42
56
|
try {
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
57
|
+
const [fileMetadata, indexedPaths] = await Promise.all([
|
|
58
|
+
storage.getFileMetadata(),
|
|
59
|
+
storage.getIndexedFilePaths(),
|
|
60
|
+
]);
|
|
61
|
+
|
|
62
|
+
// Remove from DB any file whose directory is no longer in SEMANTIC_SEARCH_INDEX_DIRS
|
|
63
|
+
for (const filePath of indexedPaths) {
|
|
64
|
+
if (!isPathUnderAnyDir(filePath, normalizedDirs)) {
|
|
65
|
+
await storage.removeFile(filePath);
|
|
66
|
+
}
|
|
67
|
+
}
|
|
46
68
|
|
|
47
69
|
const processBatch = async (
|
|
48
70
|
batch: Array<{ filePath: string; lineNumber: number; content: string }>
|
|
@@ -57,28 +79,64 @@ async function runBackgroundIndexing(
|
|
|
57
79
|
embedding: embeddings[idx],
|
|
58
80
|
}));
|
|
59
81
|
await storage.upsertLinesBatch(batchData);
|
|
60
|
-
for (const l of batch) seenFiles.add(l.filePath);
|
|
61
82
|
indexingState.linesIndexed += batch.length;
|
|
62
|
-
|
|
83
|
+
const stats = await storage.getStats();
|
|
84
|
+
indexingState.filesIndexed = stats.totalFiles;
|
|
63
85
|
};
|
|
64
86
|
|
|
65
87
|
const yieldToEventLoop = (): Promise<void> =>
|
|
66
88
|
new Promise((resolve) => setImmediate(resolve));
|
|
67
89
|
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
90
|
+
const currentFilesOnDisk = new Set<string>();
|
|
91
|
+
const toIndex: Array<{ filePath: string; mtimeMs: number; contentHash: string; lines: Array<{ filePath: string; lineNumber: number; content: string }> }> = [];
|
|
92
|
+
|
|
93
|
+
for await (const { filePath, mtimeMs } of listFilesInDirectories(directories)) {
|
|
94
|
+
currentFilesOnDisk.add(filePath);
|
|
95
|
+
const meta = fileMetadata.get(filePath);
|
|
96
|
+
const content = readFileWithMetadata(filePath);
|
|
97
|
+
if (!content) continue;
|
|
98
|
+
if (meta && meta.mtimeMs === content.mtimeMs && meta.contentHash === content.contentHash) {
|
|
99
|
+
continue; // unchanged, skip
|
|
100
|
+
}
|
|
101
|
+
toIndex.push({
|
|
102
|
+
filePath: content.filePath,
|
|
103
|
+
mtimeMs: content.mtimeMs,
|
|
104
|
+
contentHash: content.contentHash,
|
|
105
|
+
lines: content.lines,
|
|
106
|
+
});
|
|
107
|
+
}
|
|
108
|
+
|
|
109
|
+
// Remove from DB any file that no longer exists on disk (under current dirs)
|
|
110
|
+
for (const filePath of indexedPaths) {
|
|
111
|
+
if (isPathUnderAnyDir(filePath, normalizedDirs) && !currentFilesOnDisk.has(normalizePath(filePath))) {
|
|
112
|
+
await storage.removeFile(filePath);
|
|
113
|
+
}
|
|
114
|
+
}
|
|
115
|
+
|
|
116
|
+
let currentBatch: Array<{ filePath: string; lineNumber: number; content: string }> = [];
|
|
117
|
+
let processingPromise: Promise<void> | null = null;
|
|
118
|
+
|
|
119
|
+
for (const entry of toIndex) {
|
|
120
|
+
await storage.removeFile(entry.filePath);
|
|
121
|
+
for (const line of entry.lines) {
|
|
122
|
+
currentBatch.push(line);
|
|
123
|
+
if (currentBatch.length >= INDEX_BATCH_SIZE) {
|
|
124
|
+
if (processingPromise) await processingPromise;
|
|
125
|
+
const batchToProcess = currentBatch;
|
|
126
|
+
currentBatch = [];
|
|
127
|
+
processingPromise = processBatch(batchToProcess);
|
|
128
|
+
await yieldToEventLoop();
|
|
129
|
+
}
|
|
76
130
|
}
|
|
77
131
|
}
|
|
78
132
|
|
|
79
133
|
if (processingPromise) await processingPromise;
|
|
80
134
|
if (currentBatch.length > 0) await processBatch(currentBatch);
|
|
81
135
|
|
|
136
|
+
for (const entry of toIndex) {
|
|
137
|
+
await storage.setFileMetadata(entry.filePath, entry.mtimeMs, entry.contentHash);
|
|
138
|
+
}
|
|
139
|
+
|
|
82
140
|
const stats = await storage.getStats();
|
|
83
141
|
indexingState.linesIndexed = stats.totalLines;
|
|
84
142
|
indexingState.filesIndexed = stats.totalFiles;
|
|
@@ -228,7 +286,18 @@ async function main() {
|
|
|
228
286
|
console.error('Semantic Search MCP Server running on stdio');
|
|
229
287
|
|
|
230
288
|
const envDirs = process.env.SEMANTIC_SEARCH_INDEX_DIRS;
|
|
231
|
-
|
|
289
|
+
let directories = envDirs ? envDirs.split(',').map((d) => d.trim()).filter(Boolean) : [];
|
|
290
|
+
if (directories.length === 0) {
|
|
291
|
+
try {
|
|
292
|
+
const sampuVlasteCacheDir = path.join(dataDir, '..', 'sampu_vlaste');
|
|
293
|
+
console.error('SEMANTIC_SEARCH_INDEX_DIRS not set; downloading lojban/sampu_vlaste from GitHub...');
|
|
294
|
+
const sampuDir = await getSampuVlasteDir(sampuVlasteCacheDir);
|
|
295
|
+
directories = [sampuDir];
|
|
296
|
+
console.error(`Using downloaded repo at ${sampuDir}`);
|
|
297
|
+
} catch (err) {
|
|
298
|
+
console.error('Failed to download sampu_vlaste:', err);
|
|
299
|
+
}
|
|
300
|
+
}
|
|
232
301
|
if (directories.length > 0) {
|
|
233
302
|
console.error(`Starting background indexing for ${directories.length} directories...`);
|
|
234
303
|
runBackgroundIndexing(storage, directories).catch((err) => {
|
package/src/path-util.ts
ADDED
package/src/scanner.ts
CHANGED
|
@@ -1,7 +1,9 @@
|
|
|
1
|
-
import { createReadStream, statSync } from 'fs';
|
|
2
|
-
import {
|
|
1
|
+
import { createReadStream, readFileSync, statSync } from 'fs';
|
|
2
|
+
import { globIterate } from 'glob';
|
|
3
3
|
import path from 'path';
|
|
4
4
|
import readline from 'readline';
|
|
5
|
+
import { createHash } from 'crypto';
|
|
6
|
+
import { normalizePath } from './path-util.js';
|
|
5
7
|
|
|
6
8
|
export interface FileLine {
|
|
7
9
|
filePath: string;
|
|
@@ -18,6 +20,9 @@ const MIN_LINE_LENGTH = 5;
|
|
|
18
20
|
// Maximum file size to process (skip very large files)
|
|
19
21
|
const MAX_FILE_SIZE = 10 * 1024 * 1024; // 10MB
|
|
20
22
|
|
|
23
|
+
// Cap line length to avoid unbounded readline buffer (e.g. file with no newlines)
|
|
24
|
+
const MAX_LINE_LENGTH = 256 * 1024; // 256KB
|
|
25
|
+
|
|
21
26
|
/**
|
|
22
27
|
* Check if a file is a text file we should index
|
|
23
28
|
*/
|
|
@@ -27,17 +32,17 @@ function isTextFile(filePath: string): boolean {
|
|
|
27
32
|
}
|
|
28
33
|
|
|
29
34
|
/**
|
|
30
|
-
* Scan a directory for text files and yield lines
|
|
35
|
+
* Scan a directory for text files and yield lines.
|
|
36
|
+
* Uses globIterate so file paths are streamed one-by-one (no full list in RAM).
|
|
37
|
+
* Readline has maxLineLength to avoid huge single-line buffers.
|
|
31
38
|
*/
|
|
32
39
|
export async function* scanDirectory(dirPath: string): AsyncGenerator<FileLine> {
|
|
33
|
-
// Find all files in directory recursively
|
|
34
40
|
const pattern = path.join(dirPath, '**/*');
|
|
35
|
-
|
|
36
|
-
const files = await glob(pattern, { nodir: true, absolute: true });
|
|
37
|
-
|
|
38
|
-
for (const filePath of files) {
|
|
41
|
+
for await (const filePath of globIterate(pattern, { nodir: true, absolute: true }) as AsyncIterable<string>) {
|
|
39
42
|
if (!isTextFile(filePath)) continue;
|
|
40
43
|
|
|
44
|
+
let fileStream: ReturnType<typeof createReadStream> | null = null;
|
|
45
|
+
let rl: readline.Interface | null = null;
|
|
41
46
|
try {
|
|
42
47
|
const stats = statSync(filePath);
|
|
43
48
|
if (stats.size > MAX_FILE_SIZE) {
|
|
@@ -45,11 +50,12 @@ export async function* scanDirectory(dirPath: string): AsyncGenerator<FileLine>
|
|
|
45
50
|
continue;
|
|
46
51
|
}
|
|
47
52
|
|
|
48
|
-
|
|
49
|
-
|
|
53
|
+
fileStream = createReadStream(filePath);
|
|
54
|
+
rl = readline.createInterface({
|
|
50
55
|
input: fileStream,
|
|
51
56
|
crlfDelay: Infinity,
|
|
52
|
-
|
|
57
|
+
maxLineLength: MAX_LINE_LENGTH,
|
|
58
|
+
} as readline.ReadLineOptions);
|
|
53
59
|
|
|
54
60
|
let lineNumber = 0;
|
|
55
61
|
for await (const line of rl) {
|
|
@@ -65,6 +71,9 @@ export async function* scanDirectory(dirPath: string): AsyncGenerator<FileLine>
|
|
|
65
71
|
}
|
|
66
72
|
} catch (err) {
|
|
67
73
|
console.error(`Error reading file ${filePath}:`, err);
|
|
74
|
+
} finally {
|
|
75
|
+
rl?.close();
|
|
76
|
+
fileStream?.destroy();
|
|
68
77
|
}
|
|
69
78
|
}
|
|
70
79
|
}
|
|
@@ -77,3 +86,67 @@ export async function* scanDirectories(dirPaths: string[]): AsyncGenerator<FileL
|
|
|
77
86
|
yield* scanDirectory(dirPath);
|
|
78
87
|
}
|
|
79
88
|
}
|
|
89
|
+
|
|
90
|
+
export interface FileWithMtime {
|
|
91
|
+
filePath: string;
|
|
92
|
+
mtimeMs: number;
|
|
93
|
+
}
|
|
94
|
+
|
|
95
|
+
/**
|
|
96
|
+
* List all text files in the given directories with their mtime.
|
|
97
|
+
* Paths are normalized for cross-platform consistency.
|
|
98
|
+
*/
|
|
99
|
+
export async function* listFilesInDirectories(dirPaths: string[]): AsyncGenerator<FileWithMtime> {
|
|
100
|
+
for (const dirPath of dirPaths) {
|
|
101
|
+
const pattern = path.join(dirPath, '**/*');
|
|
102
|
+
for await (const filePath of globIterate(pattern, { nodir: true, absolute: true }) as AsyncIterable<string>) {
|
|
103
|
+
if (!isTextFile(filePath)) continue;
|
|
104
|
+
try {
|
|
105
|
+
const stats = statSync(filePath);
|
|
106
|
+
if (stats.size > MAX_FILE_SIZE) {
|
|
107
|
+
console.error(`Skipping large file: ${filePath}`);
|
|
108
|
+
continue;
|
|
109
|
+
}
|
|
110
|
+
yield { filePath: normalizePath(filePath), mtimeMs: stats.mtimeMs };
|
|
111
|
+
} catch (err) {
|
|
112
|
+
console.error(`Error stating file ${filePath}:`, err);
|
|
113
|
+
}
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
}
|
|
117
|
+
|
|
118
|
+
export interface FileContentWithMetadata {
|
|
119
|
+
filePath: string;
|
|
120
|
+
mtimeMs: number;
|
|
121
|
+
contentHash: string;
|
|
122
|
+
lines: FileLine[];
|
|
123
|
+
}
|
|
124
|
+
|
|
125
|
+
/**
|
|
126
|
+
* Read a file once: compute content hash (sha256 of raw content) and extract indexable lines.
|
|
127
|
+
* Uses mtime from stat. Cross-platform (mtimeMs is available on Linux, Windows, Mac).
|
|
128
|
+
*/
|
|
129
|
+
export function readFileWithMetadata(filePath: string): FileContentWithMetadata | null {
|
|
130
|
+
try {
|
|
131
|
+
const stats = statSync(filePath);
|
|
132
|
+
if (stats.size > MAX_FILE_SIZE) return null;
|
|
133
|
+
const raw = readFileSync(filePath);
|
|
134
|
+
const contentHash = createHash('sha256').update(raw).digest('hex');
|
|
135
|
+
const mtimeMs = stats.mtimeMs;
|
|
136
|
+
const text = raw.toString('utf8');
|
|
137
|
+
const lines: FileLine[] = [];
|
|
138
|
+
const normalizedPath = normalizePath(filePath);
|
|
139
|
+
let lineNumber = 0;
|
|
140
|
+
for (const rawLine of text.split(/\r?\n/)) {
|
|
141
|
+
lineNumber++;
|
|
142
|
+
const trimmed = rawLine.trim();
|
|
143
|
+
if (trimmed.length >= MIN_LINE_LENGTH) {
|
|
144
|
+
lines.push({ filePath: normalizedPath, lineNumber, content: trimmed });
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
return { filePath: normalizedPath, mtimeMs, contentHash, lines };
|
|
148
|
+
} catch (err) {
|
|
149
|
+
console.error(`Error reading file ${filePath}:`, err);
|
|
150
|
+
return null;
|
|
151
|
+
}
|
|
152
|
+
}
|
package/src/storage.ts
CHANGED
|
@@ -2,9 +2,15 @@ import pkg from '@dao-xyz/sqlite3-vec';
|
|
|
2
2
|
const { createDatabase } = pkg;
|
|
3
3
|
import path from 'path';
|
|
4
4
|
import { mkdirSync } from 'fs';
|
|
5
|
+
import { normalizePath } from './path-util.js';
|
|
5
6
|
|
|
6
7
|
const EMBEDDING_DIM = 384; // all-MiniLM-L6-v2 produces 384-dim vectors
|
|
7
8
|
|
|
9
|
+
export interface FileMetadata {
|
|
10
|
+
mtimeMs: number;
|
|
11
|
+
contentHash: string;
|
|
12
|
+
}
|
|
13
|
+
|
|
8
14
|
export interface LineRecord {
|
|
9
15
|
id: number;
|
|
10
16
|
file_path: string;
|
|
@@ -31,6 +37,10 @@ export class VectorStorage {
|
|
|
31
37
|
}
|
|
32
38
|
|
|
33
39
|
private init(): void {
|
|
40
|
+
// Limit SQLite page cache to avoid unbounded RAM (negative = kibibytes)
|
|
41
|
+
try {
|
|
42
|
+
this.db.exec('PRAGMA cache_size = -65536'); // 64MB max
|
|
43
|
+
} catch {}
|
|
34
44
|
this.db.exec(`
|
|
35
45
|
CREATE TABLE IF NOT EXISTS lines (
|
|
36
46
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
@@ -47,6 +57,13 @@ export class VectorStorage {
|
|
|
47
57
|
embedding FLOAT[${EMBEDDING_DIM}]
|
|
48
58
|
);
|
|
49
59
|
`);
|
|
60
|
+
this.db.exec(`
|
|
61
|
+
CREATE TABLE IF NOT EXISTS file_metadata (
|
|
62
|
+
file_path TEXT PRIMARY KEY,
|
|
63
|
+
mtime_ms INTEGER NOT NULL,
|
|
64
|
+
content_hash TEXT NOT NULL
|
|
65
|
+
);
|
|
66
|
+
`);
|
|
50
67
|
}
|
|
51
68
|
|
|
52
69
|
/**
|
|
@@ -143,11 +160,55 @@ export class VectorStorage {
|
|
|
143
160
|
async removeFile(filePath: string): Promise<void> {
|
|
144
161
|
(await this.db.prepare('DELETE FROM vec_lines WHERE line_id IN (SELECT id FROM lines WHERE file_path = ?)')).run([filePath]);
|
|
145
162
|
(await this.db.prepare('DELETE FROM lines WHERE file_path = ?')).run([filePath]);
|
|
163
|
+
(await this.db.prepare('DELETE FROM file_metadata WHERE file_path = ?')).run([filePath]);
|
|
164
|
+
}
|
|
165
|
+
|
|
166
|
+
/** Remove all indexed files under a directory (path prefix match, normalized). */
|
|
167
|
+
async removeFilesUnderDirectory(dirPath: string): Promise<void> {
|
|
168
|
+
const normalizedDir = normalizePath(dirPath);
|
|
169
|
+
const prefix = normalizedDir.endsWith(path.sep) ? normalizedDir : normalizedDir + path.sep;
|
|
170
|
+
const rows = (await this.db.prepare('SELECT DISTINCT file_path FROM lines')).all() as Array<{ file_path: string }>;
|
|
171
|
+
for (const row of rows) {
|
|
172
|
+
const p = normalizePath(row.file_path);
|
|
173
|
+
if (p === normalizedDir || p.startsWith(prefix)) {
|
|
174
|
+
await this.removeFile(row.file_path);
|
|
175
|
+
}
|
|
176
|
+
}
|
|
177
|
+
}
|
|
178
|
+
|
|
179
|
+
/** Get metadata for all indexed files (for incremental sync). */
|
|
180
|
+
async getFileMetadata(): Promise<Map<string, FileMetadata>> {
|
|
181
|
+
const rows = (await this.db.prepare('SELECT file_path, mtime_ms, content_hash FROM file_metadata')).all() as Array<{
|
|
182
|
+
file_path: string;
|
|
183
|
+
mtime_ms: number;
|
|
184
|
+
content_hash: string;
|
|
185
|
+
}>;
|
|
186
|
+
const map = new Map<string, FileMetadata>();
|
|
187
|
+
for (const row of rows) {
|
|
188
|
+
map.set(row.file_path, { mtimeMs: row.mtime_ms, contentHash: row.content_hash });
|
|
189
|
+
}
|
|
190
|
+
return map;
|
|
191
|
+
}
|
|
192
|
+
|
|
193
|
+
/** Set metadata after indexing a file. */
|
|
194
|
+
async setFileMetadata(filePath: string, mtimeMs: number, contentHash: string): Promise<void> {
|
|
195
|
+
const stmt = await this.db.prepare(
|
|
196
|
+
`INSERT INTO file_metadata (file_path, mtime_ms, content_hash) VALUES (?, ?, ?)
|
|
197
|
+
ON CONFLICT(file_path) DO UPDATE SET mtime_ms = excluded.mtime_ms, content_hash = excluded.content_hash`
|
|
198
|
+
);
|
|
199
|
+
stmt.run([filePath, mtimeMs, contentHash]);
|
|
200
|
+
}
|
|
201
|
+
|
|
202
|
+
/** Get all file paths currently in the index (for detecting deleted files). */
|
|
203
|
+
async getIndexedFilePaths(): Promise<string[]> {
|
|
204
|
+
const rows = (await this.db.prepare('SELECT DISTINCT file_path FROM lines')).all() as Array<{ file_path: string }>;
|
|
205
|
+
return rows.map((r) => r.file_path);
|
|
146
206
|
}
|
|
147
207
|
|
|
148
208
|
clear(): void {
|
|
149
209
|
this.db.exec('DELETE FROM vec_lines');
|
|
150
210
|
this.db.exec('DELETE FROM lines');
|
|
211
|
+
this.db.exec('DELETE FROM file_metadata');
|
|
151
212
|
}
|
|
152
213
|
|
|
153
214
|
close(): void {
|