ownsearch 0.1.3 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +213 -47
- package/dist/{chunk-LGXCBOO4.js → chunk-ZQAY3FE3.js} +169 -11
- package/dist/cli.js +280 -30
- package/dist/mcp/server.js +3 -1
- package/package.json +6 -3
- package/skills/ownsearch-rag-search/SKILL.md +139 -0
- package/skills/ownsearch-rag-search/agents/openai.yaml +4 -0
package/README.md
CHANGED
|
@@ -1,36 +1,114 @@
|
|
|
1
1
|
# ownsearch
|
|
2
2
|
|
|
3
|
-
**ownsearch** is a local
|
|
3
|
+
**ownsearch** is a local retrieval layer for agents.
|
|
4
4
|
|
|
5
|
-
It indexes approved folders into a local Qdrant vector store, exposes retrieval through an MCP server
|
|
5
|
+
It indexes approved folders into a local Qdrant vector store, embeds content with Gemini, and exposes grounded retrieval through an MCP server so agents can search private documents without shipping those documents to a hosted RAG backend.
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
This package is designed for **text-first, local, agentic RAG**:
|
|
8
|
+
|
|
9
|
+
- local folders instead of SaaS document ingestion
|
|
10
|
+
- MCP-native access for agents
|
|
11
|
+
- grounded chunk retrieval instead of opaque long-context guessing
|
|
12
|
+
- predictable local storage with Docker-backed Qdrant
|
|
13
|
+
|
|
14
|
+
## Why it exists
|
|
15
|
+
|
|
16
|
+
Most agents waste time and tokens when they do one of two things:
|
|
17
|
+
|
|
18
|
+
- search too broadly with weak semantic queries
|
|
19
|
+
- skip retrieval and guess from partial context
|
|
20
|
+
|
|
21
|
+
`ownsearch` is meant to reduce both failure modes by:
|
|
22
|
+
|
|
23
|
+
- indexing local knowledge once
|
|
24
|
+
- making retrieval cheap and reusable
|
|
25
|
+
- giving agents a structured way to fetch only the chunks they need
|
|
26
|
+
- improving answer quality with reranking, deduplication, and grounded chunk access
|
|
27
|
+
|
|
28
|
+
## Core use cases
|
|
29
|
+
|
|
30
|
+
`ownsearch` is a good fit when an agent needs to work over:
|
|
31
|
+
|
|
32
|
+
- product documentation
|
|
33
|
+
- technical design docs
|
|
34
|
+
- code-adjacent text files
|
|
35
|
+
- contracts and policy documents
|
|
36
|
+
- research notes
|
|
37
|
+
- knowledge bases stored in folders
|
|
38
|
+
- PDF, DOCX, RTF, markdown, and plain-text heavy repositories
|
|
39
|
+
|
|
40
|
+
Typical agent workflows:
|
|
41
|
+
|
|
42
|
+
- answer questions over local docs
|
|
43
|
+
- locate the exact source file or section for a fact
|
|
44
|
+
- summarize a set of related files
|
|
45
|
+
- compare policy, spec, or contract language across documents
|
|
46
|
+
- support coding agents with repo-local documentation search
|
|
47
|
+
- reduce token cost by retrieving only relevant chunks instead of loading entire files
|
|
8
48
|
|
|
9
49
|
## What it does
|
|
10
50
|
|
|
11
|
-
-
|
|
12
|
-
-
|
|
13
|
-
-
|
|
14
|
-
-
|
|
15
|
-
-
|
|
51
|
+
- indexes local folders into a persistent vector store
|
|
52
|
+
- chunks and embeds supported files with Gemini
|
|
53
|
+
- supports incremental reindexing for changed and deleted files
|
|
54
|
+
- exposes search and context retrieval through MCP
|
|
55
|
+
- reranks and deduplicates result sets before returning them
|
|
56
|
+
- lets agents retrieve ranked hits, exact chunks, or bundled grounded context
|
|
16
57
|
|
|
17
|
-
##
|
|
58
|
+
## Current power
|
|
18
59
|
|
|
19
|
-
|
|
20
|
-
- extracted text from PDFs
|
|
21
|
-
- Gemini `gemini-embedding-001`
|
|
22
|
-
- Docker-backed Qdrant
|
|
23
|
-
- stdio MCP server for local agent attachment
|
|
60
|
+
What is already strong in the current package:
|
|
24
61
|
|
|
25
|
-
|
|
62
|
+
- local-first setup with Docker-backed Qdrant
|
|
63
|
+
- deterministic readiness checks through `ownsearch doctor`
|
|
64
|
+
- multi-platform MCP config generation
|
|
65
|
+
- bundled retrieval skill for better query planning
|
|
66
|
+
- support for common text document formats
|
|
67
|
+
- large plain text and code files are no longer blocked by the extracted-document size cap
|
|
68
|
+
- repeatable smoke validation for mixed text corpora
|
|
69
|
+
|
|
70
|
+
## V1 supported document types
|
|
71
|
+
|
|
72
|
+
The current package is intended for text-first corpora, including:
|
|
73
|
+
|
|
74
|
+
- plain text and code files
|
|
75
|
+
- markdown and MDX
|
|
76
|
+
- JSON, YAML, TOML, CSV, XML, HTML
|
|
77
|
+
- PDF via text extraction
|
|
78
|
+
- DOCX via text extraction
|
|
79
|
+
- RTF via text extraction
|
|
80
|
+
|
|
81
|
+
## Deployment readiness
|
|
26
82
|
|
|
27
|
-
|
|
83
|
+
This package is ready to deploy for **text-first local document folders** when:
|
|
84
|
+
|
|
85
|
+
- Node.js `20+` is available
|
|
86
|
+
- Docker is available and Qdrant can run locally
|
|
87
|
+
- `GEMINI_API_KEY` is configured
|
|
88
|
+
- the document corpus is primarily text-based
|
|
89
|
+
|
|
90
|
+
Installation:
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
npm install -g ownsearch
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Deployment checklist:
|
|
28
97
|
|
|
29
98
|
```bash
|
|
30
99
|
npm install -g ownsearch
|
|
100
|
+
ownsearch setup
|
|
101
|
+
ownsearch doctor
|
|
102
|
+
ownsearch index C:\path\to\folder --name my-folder
|
|
103
|
+
ownsearch serve-mcp
|
|
31
104
|
```
|
|
32
105
|
|
|
33
|
-
|
|
106
|
+
If `ownsearch doctor` returns:
|
|
107
|
+
|
|
108
|
+
- `verdict.status: "ready"` then the package is operational
|
|
109
|
+
- `verdict.status: "action_required"` then follow the listed `nextSteps`
|
|
110
|
+
|
|
111
|
+
## Quickstart
|
|
34
112
|
|
|
35
113
|
```bash
|
|
36
114
|
ownsearch setup
|
|
@@ -42,41 +120,89 @@ ownsearch search-context "what is this repo about?" --limit 8 --max-chars 12000
|
|
|
42
120
|
ownsearch serve-mcp
|
|
43
121
|
```
|
|
44
122
|
|
|
45
|
-
On first run, `ownsearch setup` can
|
|
123
|
+
On first run, `ownsearch setup` can:
|
|
124
|
+
|
|
125
|
+
- prompt for `GEMINI_API_KEY`
|
|
126
|
+
- link users to Google AI Studio
|
|
127
|
+
- save the key to `~/.ownsearch/.env`
|
|
128
|
+
- print exact next commands for CLI and MCP usage
|
|
129
|
+
- optionally print an MCP config snippet for a selected agent
|
|
130
|
+
|
|
131
|
+
## Real-world fit
|
|
132
|
+
|
|
133
|
+
`ownsearch` is a strong fit for:
|
|
46
134
|
|
|
47
|
-
|
|
135
|
+
- engineering teams with private docs that should stay local
|
|
136
|
+
- coding agents that need repo-adjacent design docs and runbooks
|
|
137
|
+
- consultants or operators working across contract, policy, or knowledge folders
|
|
138
|
+
- researchers who want grounded retrieval over local notes and exported reports
|
|
139
|
+
- teams trying to reduce agent token burn by retrieving small grounded context bundles instead of pasting entire files
|
|
140
|
+
|
|
141
|
+
It is less suitable when:
|
|
142
|
+
|
|
143
|
+
- the corpus is mostly scanned documents that need OCR
|
|
144
|
+
- the workflow depends on spreadsheets, slides, or legacy Office formats
|
|
145
|
+
- the main requirement is hosted multi-user search rather than local agent retrieval
|
|
146
|
+
|
|
147
|
+
## Agent integration
|
|
148
|
+
|
|
149
|
+
To print MCP config snippets:
|
|
48
150
|
|
|
49
151
|
```bash
|
|
50
152
|
ownsearch print-agent-config codex
|
|
51
|
-
ownsearch print-agent-config claude-desktop
|
|
52
153
|
ownsearch print-agent-config cursor
|
|
154
|
+
ownsearch print-agent-config vscode
|
|
155
|
+
ownsearch print-agent-config github-copilot
|
|
156
|
+
ownsearch print-agent-config copilot-cli
|
|
157
|
+
ownsearch print-agent-config windsurf
|
|
158
|
+
ownsearch print-agent-config continue
|
|
159
|
+
ownsearch print-agent-config claude-desktop
|
|
53
160
|
```
|
|
54
161
|
|
|
55
|
-
|
|
162
|
+
Supported config targets currently include:
|
|
163
|
+
|
|
164
|
+
- `codex`
|
|
165
|
+
- `cursor`
|
|
166
|
+
- `vscode`
|
|
167
|
+
- `github-copilot`
|
|
168
|
+
- `copilot-cli`
|
|
169
|
+
- `windsurf`
|
|
170
|
+
- `continue`
|
|
171
|
+
- `claude-desktop`
|
|
172
|
+
|
|
173
|
+
Notes:
|
|
56
174
|
|
|
57
|
-
|
|
175
|
+
- `claude-desktop` currently returns guidance rather than a raw JSON snippet because current Claude Desktop docs prefer desktop extensions (`.mcpb`) over manual JSON server configs
|
|
176
|
+
- all other supported targets return concrete MCP config payloads
|
|
177
|
+
|
|
178
|
+
## Bundled skill
|
|
179
|
+
|
|
180
|
+
The package ships with a bundled retrieval skill:
|
|
58
181
|
|
|
59
182
|
```bash
|
|
60
|
-
|
|
61
|
-
npm run build
|
|
62
|
-
node dist/cli.js setup
|
|
63
|
-
node dist/cli.js index ./docs --name docs
|
|
64
|
-
node dist/cli.js search "what is this repo about?" --limit 5
|
|
65
|
-
node dist/cli.js serve-mcp
|
|
183
|
+
ownsearch print-skill ownsearch-rag-search
|
|
66
184
|
```
|
|
67
185
|
|
|
186
|
+
The skill is intended to help an agent:
|
|
187
|
+
|
|
188
|
+
- rewrite weak user requests into stronger retrieval queries
|
|
189
|
+
- decide when to use `search_context` vs `search` vs `get_chunks`
|
|
190
|
+
- recover from poor first-pass retrieval
|
|
191
|
+
- avoid duplicate-heavy answer synthesis
|
|
192
|
+
- stay grounded when retrieval is probabilistic
|
|
193
|
+
|
|
68
194
|
## CLI commands
|
|
69
195
|
|
|
70
196
|
- `ownsearch setup`
|
|
71
|
-
Starts or reconnects to the local Qdrant Docker container, creates local config,
|
|
197
|
+
Starts or reconnects to the local Qdrant Docker container, creates local config, persists `GEMINI_API_KEY`, and prints next-step commands.
|
|
72
198
|
- `ownsearch doctor`
|
|
73
|
-
Checks config, Gemini key presence, Qdrant connectivity, and
|
|
199
|
+
Checks config, Gemini key presence, Qdrant connectivity, collection settings, and emits a deterministic readiness verdict.
|
|
74
200
|
- `ownsearch index <folder> --name <name>`
|
|
75
201
|
Indexes a folder incrementally into the local vector collection.
|
|
76
202
|
- `ownsearch list-roots`
|
|
77
203
|
Lists approved indexed roots.
|
|
78
204
|
- `ownsearch search "<query>"`
|
|
79
|
-
Returns
|
|
205
|
+
Returns reranked search hits from the vector store.
|
|
80
206
|
- `ownsearch search-context "<query>"`
|
|
81
207
|
Returns a compact grounded context bundle for agents.
|
|
82
208
|
- `ownsearch delete-root <rootId>`
|
|
@@ -86,7 +212,9 @@ node dist/cli.js serve-mcp
|
|
|
86
212
|
- `ownsearch serve-mcp`
|
|
87
213
|
Starts the stdio MCP server.
|
|
88
214
|
- `ownsearch print-agent-config <agent>`
|
|
89
|
-
Prints
|
|
215
|
+
Prints MCP config snippets or platform guidance.
|
|
216
|
+
- `ownsearch print-skill [skill]`
|
|
217
|
+
Prints a bundled OwnSearch skill.
|
|
90
218
|
|
|
91
219
|
## MCP tools
|
|
92
220
|
|
|
@@ -100,29 +228,59 @@ The MCP server currently exposes:
|
|
|
100
228
|
- `delete_root`
|
|
101
229
|
- `store_status`
|
|
102
230
|
|
|
103
|
-
Recommended
|
|
231
|
+
Recommended retrieval flow:
|
|
104
232
|
|
|
105
|
-
1.
|
|
106
|
-
2.
|
|
107
|
-
3. Use `get_chunks`
|
|
233
|
+
1. Use `search_context` for fast grounded retrieval.
|
|
234
|
+
2. Use `search` when ranking and source inspection matter.
|
|
235
|
+
3. Use `get_chunks` when exact wording or detailed comparison matters.
|
|
108
236
|
|
|
109
|
-
##
|
|
237
|
+
## Validation
|
|
110
238
|
|
|
111
|
-
|
|
112
|
-
- Shared CLI and MCP secrets can be stored in `~/.ownsearch/.env`
|
|
113
|
-
- Qdrant runs locally in Docker as `ownsearch-qdrant`
|
|
114
|
-
- `GEMINI_API_KEY` may come from the shell environment, the current working directory `.env`, or `~/.ownsearch/.env`
|
|
115
|
-
- Node.js `20+` is required
|
|
239
|
+
The package includes a repeatable smoke suite:
|
|
116
240
|
|
|
117
|
-
|
|
241
|
+
```bash
|
|
242
|
+
npm run smoke:text-docs
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
That smoke run currently validates:
|
|
246
|
+
|
|
247
|
+
- `.txt` retrieval
|
|
248
|
+
- `.rtf` retrieval
|
|
249
|
+
- `.docx` retrieval
|
|
250
|
+
- `.pdf` retrieval
|
|
251
|
+
- large plain text file bypass of the extracted-document byte cap
|
|
252
|
+
|
|
253
|
+
## Limitations
|
|
254
|
+
|
|
255
|
+
This package is deploy-ready for text-first corpora, but it is not universal document intelligence.
|
|
256
|
+
|
|
257
|
+
Current hard limitations:
|
|
258
|
+
|
|
259
|
+
- no OCR for image-only PDFs
|
|
260
|
+
- no `.doc` support
|
|
261
|
+
- no spreadsheet or presentation extraction such as `.xlsx` or `.pptx`
|
|
262
|
+
- no multimodal embeddings yet
|
|
263
|
+
- reranking is heuristic and local, not yet model-based
|
|
264
|
+
- very large corpora can still become expensive because embedding cost scales with chunk count
|
|
265
|
+
|
|
266
|
+
Operational limitations:
|
|
118
267
|
|
|
119
|
-
|
|
268
|
+
- retrieval quality still depends on query quality
|
|
269
|
+
- extracted document quality depends on source document quality
|
|
270
|
+
- duplicate-heavy corpora are improved by current reranking, but not fully solved for all edge cases
|
|
271
|
+
- scanned or low-quality PDFs may require OCR before indexing
|
|
120
272
|
|
|
273
|
+
## Future scope
|
|
274
|
+
|
|
275
|
+
Planned next-stage improvements:
|
|
276
|
+
|
|
277
|
+
- pluggable learned rerankers
|
|
278
|
+
- stronger deduplication across overlapping corpora
|
|
121
279
|
- richer document extraction
|
|
122
|
-
-
|
|
123
|
-
- watch mode for automatic reindexing
|
|
280
|
+
- watch mode for automatic local reindexing
|
|
124
281
|
- HTTP MCP transport
|
|
125
|
-
-
|
|
282
|
+
- optional hosted deployment mode
|
|
283
|
+
- multimodal indexing and retrieval for:
|
|
126
284
|
- images
|
|
127
285
|
- audio
|
|
128
286
|
- video
|
|
@@ -130,6 +288,14 @@ Planned after the text-first v1:
|
|
|
130
288
|
|
|
131
289
|
The multimodal phase will require careful collection migration because Gemini text and multimodal embedding spaces are not interchangeable across model families.
|
|
132
290
|
|
|
291
|
+
## Notes
|
|
292
|
+
|
|
293
|
+
- config is stored in `~/.ownsearch/config.json`
|
|
294
|
+
- shared CLI and MCP secrets can be stored in `~/.ownsearch/.env`
|
|
295
|
+
- Qdrant runs locally in Docker as `ownsearch-qdrant`
|
|
296
|
+
- `GEMINI_API_KEY` may come from the shell environment, the current working directory `.env`, or `~/.ownsearch/.env`
|
|
297
|
+
- `maxFileBytes` primarily applies to extracted document formats such as PDF, DOCX, and RTF, not to large plain text and code files
|
|
298
|
+
|
|
133
299
|
## License
|
|
134
300
|
|
|
135
301
|
MIT
|
|
@@ -3,6 +3,18 @@ function buildContextBundle(query, hits, maxChars = 12e3) {
|
|
|
3
3
|
const results = [];
|
|
4
4
|
let totalChars = 0;
|
|
5
5
|
for (const hit of hits) {
|
|
6
|
+
const last = results.at(-1);
|
|
7
|
+
if (last && last.rootId === hit.rootId && last.relativePath === hit.relativePath && hit.chunkIndex === last.chunkIndex + 1) {
|
|
8
|
+
const mergedContent = `${last.content}
|
|
9
|
+
${hit.content}`.trim();
|
|
10
|
+
const mergedDelta = mergedContent.length - last.content.length;
|
|
11
|
+
if (totalChars + mergedDelta <= maxChars) {
|
|
12
|
+
last.content = mergedContent;
|
|
13
|
+
last.chunkIndex = hit.chunkIndex;
|
|
14
|
+
totalChars += mergedDelta;
|
|
15
|
+
continue;
|
|
16
|
+
}
|
|
17
|
+
}
|
|
6
18
|
if (results.length > 0 && totalChars + hit.content.length > maxChars) {
|
|
7
19
|
break;
|
|
8
20
|
}
|
|
@@ -45,26 +57,32 @@ var DEFAULT_CHUNK_OVERLAP = 200;
|
|
|
45
57
|
var DEFAULT_MAX_FILE_BYTES = 50 * 1024 * 1024;
|
|
46
58
|
var SUPPORTED_TEXT_EXTENSIONS = /* @__PURE__ */ new Set([
|
|
47
59
|
".c",
|
|
60
|
+
".conf",
|
|
48
61
|
".cpp",
|
|
49
62
|
".cs",
|
|
50
63
|
".css",
|
|
51
64
|
".csv",
|
|
65
|
+
".docx",
|
|
52
66
|
".env",
|
|
53
67
|
".go",
|
|
54
68
|
".h",
|
|
55
69
|
".hpp",
|
|
56
70
|
".html",
|
|
71
|
+
".ini",
|
|
57
72
|
".java",
|
|
58
73
|
".js",
|
|
59
74
|
".json",
|
|
60
75
|
".jsx",
|
|
76
|
+
".log",
|
|
61
77
|
".md",
|
|
62
78
|
".mdx",
|
|
63
79
|
".mjs",
|
|
64
80
|
".pdf",
|
|
65
81
|
".ps1",
|
|
82
|
+
".properties",
|
|
66
83
|
".py",
|
|
67
84
|
".rb",
|
|
85
|
+
".rtf",
|
|
68
86
|
".rs",
|
|
69
87
|
".sh",
|
|
70
88
|
".sql",
|
|
@@ -76,6 +94,11 @@ var SUPPORTED_TEXT_EXTENSIONS = /* @__PURE__ */ new Set([
|
|
|
76
94
|
".yaml",
|
|
77
95
|
".yml"
|
|
78
96
|
]);
|
|
97
|
+
var EXTRACTED_DOCUMENT_EXTENSIONS = /* @__PURE__ */ new Set([
|
|
98
|
+
".pdf",
|
|
99
|
+
".docx",
|
|
100
|
+
".rtf"
|
|
101
|
+
]);
|
|
79
102
|
var IGNORED_DIRECTORIES = /* @__PURE__ */ new Set([
|
|
80
103
|
".git",
|
|
81
104
|
".hg",
|
|
@@ -146,11 +169,14 @@ function getConfigPath() {
|
|
|
146
169
|
function getEnvPath() {
|
|
147
170
|
return path2.join(getConfigDir(), ".env");
|
|
148
171
|
}
|
|
172
|
+
function getCwdEnvPath() {
|
|
173
|
+
return path2.resolve(process.cwd(), ".env");
|
|
174
|
+
}
|
|
149
175
|
async function ensureConfigDir() {
|
|
150
176
|
await fs.mkdir(getConfigDir(), { recursive: true });
|
|
151
177
|
}
|
|
152
178
|
function loadOwnSearchEnv() {
|
|
153
|
-
for (const envPath of [
|
|
179
|
+
for (const envPath of [getCwdEnvPath(), getEnvPath()]) {
|
|
154
180
|
if (!fsSync.existsSync(envPath)) {
|
|
155
181
|
continue;
|
|
156
182
|
}
|
|
@@ -162,6 +188,12 @@ function loadOwnSearchEnv() {
|
|
|
162
188
|
}
|
|
163
189
|
}
|
|
164
190
|
}
|
|
191
|
+
function readEnvFile(envPath) {
|
|
192
|
+
if (!fsSync.existsSync(envPath)) {
|
|
193
|
+
return {};
|
|
194
|
+
}
|
|
195
|
+
return dotenv.parse(fsSync.readFileSync(envPath, "utf8"));
|
|
196
|
+
}
|
|
165
197
|
async function loadConfig() {
|
|
166
198
|
await ensureConfigDir();
|
|
167
199
|
const configPath = getConfigPath();
|
|
@@ -293,6 +325,112 @@ async function embedQuery(query) {
|
|
|
293
325
|
|
|
294
326
|
// src/qdrant.ts
|
|
295
327
|
import { QdrantClient } from "@qdrant/js-client-rest";
|
|
328
|
+
|
|
329
|
+
// src/rerank.ts
|
|
330
|
+
function normalize(input) {
|
|
331
|
+
return input.toLowerCase().replace(/[^a-z0-9\s]/g, " ").replace(/\s+/g, " ").trim();
|
|
332
|
+
}
|
|
333
|
+
function tokenize(input) {
|
|
334
|
+
return normalize(input).split(" ").filter((token) => token.length > 1);
|
|
335
|
+
}
|
|
336
|
+
function unique(items) {
|
|
337
|
+
return Array.from(new Set(items));
|
|
338
|
+
}
|
|
339
|
+
function lexicalOverlap(queryTokens, haystack) {
|
|
340
|
+
if (queryTokens.length === 0) {
|
|
341
|
+
return 0;
|
|
342
|
+
}
|
|
343
|
+
const haystackTokens = new Set(tokenize(haystack));
|
|
344
|
+
let matches = 0;
|
|
345
|
+
for (const token of queryTokens) {
|
|
346
|
+
if (haystackTokens.has(token)) {
|
|
347
|
+
matches += 1;
|
|
348
|
+
}
|
|
349
|
+
}
|
|
350
|
+
return matches / queryTokens.length;
|
|
351
|
+
}
|
|
352
|
+
function nearDuplicate(a, b) {
|
|
353
|
+
const aTokens = unique(tokenize(a.content)).slice(0, 48);
|
|
354
|
+
const bTokens = unique(tokenize(b.content)).slice(0, 48);
|
|
355
|
+
if (aTokens.length === 0 || bTokens.length === 0) {
|
|
356
|
+
return false;
|
|
357
|
+
}
|
|
358
|
+
const bSet = new Set(bTokens);
|
|
359
|
+
let intersection = 0;
|
|
360
|
+
for (const token of aTokens) {
|
|
361
|
+
if (bSet.has(token)) {
|
|
362
|
+
intersection += 1;
|
|
363
|
+
}
|
|
364
|
+
}
|
|
365
|
+
const union = (/* @__PURE__ */ new Set([...aTokens, ...bTokens])).size;
|
|
366
|
+
return union > 0 && intersection / union >= 0.8;
|
|
367
|
+
}
|
|
368
|
+
function contentSignature(content) {
|
|
369
|
+
return tokenize(content).slice(0, 24).join(" ");
|
|
370
|
+
}
|
|
371
|
+
function rerankAndDeduplicate(query, hits, limit) {
|
|
372
|
+
const normalizedQuery = normalize(query);
|
|
373
|
+
const queryTokens = unique(tokenize(query));
|
|
374
|
+
const ranked = hits.map((hit) => {
|
|
375
|
+
const overlap = lexicalOverlap(queryTokens, hit.content);
|
|
376
|
+
const pathOverlap = lexicalOverlap(queryTokens, `${hit.relativePath} ${hit.rootName}`);
|
|
377
|
+
const exactPhrase = normalizedQuery.length > 0 && normalize(hit.content).includes(normalizedQuery) ? 0.2 : 0;
|
|
378
|
+
const score = hit.score + overlap * 0.22 + pathOverlap * 0.08 + exactPhrase;
|
|
379
|
+
return { ...hit, rerankScore: score };
|
|
380
|
+
}).sort((left, right) => right.rerankScore - left.rerankScore);
|
|
381
|
+
const selected = [];
|
|
382
|
+
const signatureSet = /* @__PURE__ */ new Set();
|
|
383
|
+
const perFileCounts = /* @__PURE__ */ new Map();
|
|
384
|
+
const preferredPerFileLimit = 2;
|
|
385
|
+
function canTake(hit, enforcePerFileLimit) {
|
|
386
|
+
const signature = contentSignature(hit.content);
|
|
387
|
+
if (signature && signatureSet.has(signature)) {
|
|
388
|
+
return false;
|
|
389
|
+
}
|
|
390
|
+
if (selected.some((existing) => nearDuplicate(existing, hit))) {
|
|
391
|
+
return false;
|
|
392
|
+
}
|
|
393
|
+
if (enforcePerFileLimit) {
|
|
394
|
+
const current = perFileCounts.get(hit.relativePath) ?? 0;
|
|
395
|
+
if (current >= preferredPerFileLimit) {
|
|
396
|
+
return false;
|
|
397
|
+
}
|
|
398
|
+
}
|
|
399
|
+
return true;
|
|
400
|
+
}
|
|
401
|
+
function add(hit) {
|
|
402
|
+
selected.push(hit);
|
|
403
|
+
const signature = contentSignature(hit.content);
|
|
404
|
+
if (signature) {
|
|
405
|
+
signatureSet.add(signature);
|
|
406
|
+
}
|
|
407
|
+
perFileCounts.set(hit.relativePath, (perFileCounts.get(hit.relativePath) ?? 0) + 1);
|
|
408
|
+
}
|
|
409
|
+
for (const hit of ranked) {
|
|
410
|
+
if (selected.length >= limit) {
|
|
411
|
+
break;
|
|
412
|
+
}
|
|
413
|
+
if (canTake(hit, true)) {
|
|
414
|
+
add(hit);
|
|
415
|
+
}
|
|
416
|
+
}
|
|
417
|
+
if (selected.length < limit) {
|
|
418
|
+
for (const hit of ranked) {
|
|
419
|
+
if (selected.length >= limit) {
|
|
420
|
+
break;
|
|
421
|
+
}
|
|
422
|
+
if (selected.some((existing) => existing.id === hit.id)) {
|
|
423
|
+
continue;
|
|
424
|
+
}
|
|
425
|
+
if (canTake(hit, false)) {
|
|
426
|
+
add(hit);
|
|
427
|
+
}
|
|
428
|
+
}
|
|
429
|
+
}
|
|
430
|
+
return selected.map(({ rerankScore: _rerankScore, ...hit }) => hit);
|
|
431
|
+
}
|
|
432
|
+
|
|
433
|
+
// src/qdrant.ts
|
|
296
434
|
var OwnSearchStore = class {
|
|
297
435
|
constructor(client2, collectionName, vectorSize) {
|
|
298
436
|
this.client = client2;
|
|
@@ -450,7 +588,7 @@ var OwnSearchStore = class {
|
|
|
450
588
|
}
|
|
451
589
|
const results = await this.client.search(this.collectionName, {
|
|
452
590
|
vector,
|
|
453
|
-
limit: filters.pathSubstring ?
|
|
591
|
+
limit: Math.max(filters.pathSubstring ? limit * 8 : limit * 6, 24),
|
|
454
592
|
with_payload: true,
|
|
455
593
|
filter: must.length ? { must } : void 0
|
|
456
594
|
});
|
|
@@ -464,11 +602,8 @@ var OwnSearchStore = class {
|
|
|
464
602
|
chunkIndex: Number(result.payload?.chunk_index ?? 0),
|
|
465
603
|
content: String(result.payload?.content ?? "")
|
|
466
604
|
}));
|
|
467
|
-
|
|
468
|
-
|
|
469
|
-
}
|
|
470
|
-
const needle = filters.pathSubstring.toLowerCase();
|
|
471
|
-
return hits.filter((hit) => hit.relativePath.toLowerCase().includes(needle)).slice(0, limit);
|
|
605
|
+
const filtered = !filters.pathSubstring ? hits : hits.filter((hit) => hit.relativePath.toLowerCase().includes(filters.pathSubstring.toLowerCase()));
|
|
606
|
+
return rerankAndDeduplicate(filters.queryText ?? "", filtered, limit);
|
|
472
607
|
}
|
|
473
608
|
async getChunks(ids) {
|
|
474
609
|
if (ids.length === 0) {
|
|
@@ -523,9 +658,17 @@ function chunkText(content, chunkSize, chunkOverlap) {
|
|
|
523
658
|
while (start < normalized.length) {
|
|
524
659
|
let end = Math.min(start + chunkSize, normalized.length);
|
|
525
660
|
if (end < normalized.length) {
|
|
526
|
-
const
|
|
527
|
-
|
|
528
|
-
|
|
661
|
+
const minimumBoundary = start + Math.floor(chunkSize * 0.5);
|
|
662
|
+
const newlineBoundary = normalized.lastIndexOf("\n", end);
|
|
663
|
+
const whitespaceBoundary = normalized.lastIndexOf(" ", end);
|
|
664
|
+
const punctuationBoundary = Math.max(
|
|
665
|
+
normalized.lastIndexOf(". ", end),
|
|
666
|
+
normalized.lastIndexOf("? ", end),
|
|
667
|
+
normalized.lastIndexOf("! ", end)
|
|
668
|
+
);
|
|
669
|
+
const boundary = Math.max(newlineBoundary, whitespaceBoundary, punctuationBoundary);
|
|
670
|
+
if (boundary > minimumBoundary) {
|
|
671
|
+
end = boundary;
|
|
529
672
|
}
|
|
530
673
|
}
|
|
531
674
|
const chunk = normalized.slice(start, end).trim();
|
|
@@ -543,10 +686,14 @@ function chunkText(content, chunkSize, chunkOverlap) {
|
|
|
543
686
|
// src/files.ts
|
|
544
687
|
import fs2 from "fs/promises";
|
|
545
688
|
import path3 from "path";
|
|
689
|
+
import mammoth from "mammoth";
|
|
546
690
|
import { PDFParse } from "pdf-parse";
|
|
547
691
|
function sanitizeExtractedText(input) {
|
|
548
692
|
return input.replace(/\u0000/g, "").replace(/[\u0001-\u0008\u000B\u000C\u000E-\u001F\u007F]/g, " ").replace(/\r\n/g, "\n");
|
|
549
693
|
}
|
|
694
|
+
function extractRtfText(input) {
|
|
695
|
+
return input.replace(/\\par[d]?/g, "\n").replace(/\\tab/g, " ").replace(/\\'[0-9a-fA-F]{2}/g, " ").replace(/\\[a-zA-Z]+-?\d* ?/g, "").replace(/[{}]/g, " ");
|
|
696
|
+
}
|
|
550
697
|
async function collectTextFiles(rootPath, maxFileBytes) {
|
|
551
698
|
const files = [];
|
|
552
699
|
const absoluteRoot = path3.resolve(rootPath);
|
|
@@ -566,6 +713,11 @@ async function collectTextFiles(rootPath, maxFileBytes) {
|
|
|
566
713
|
await parser.destroy();
|
|
567
714
|
}
|
|
568
715
|
}
|
|
716
|
+
async function parseDocx(filePath) {
|
|
717
|
+
const buffer = await fs2.readFile(filePath);
|
|
718
|
+
const result = await mammoth.extractRawText({ buffer });
|
|
719
|
+
return result.value ?? "";
|
|
720
|
+
}
|
|
569
721
|
async function walk(currentPath) {
|
|
570
722
|
const entries = await fs2.readdir(currentPath, { withFileTypes: true });
|
|
571
723
|
for (const entry of entries) {
|
|
@@ -588,7 +740,7 @@ async function collectTextFiles(rootPath, maxFileBytes) {
|
|
|
588
740
|
continue;
|
|
589
741
|
}
|
|
590
742
|
const stats = await fs2.stat(nextPath);
|
|
591
|
-
if (stats.size > maxFileBytes) {
|
|
743
|
+
if (EXTRACTED_DOCUMENT_EXTENSIONS.has(extension) && stats.size > maxFileBytes) {
|
|
592
744
|
debugLog("skip-size", nextPath, stats.size);
|
|
593
745
|
continue;
|
|
594
746
|
}
|
|
@@ -596,6 +748,10 @@ async function collectTextFiles(rootPath, maxFileBytes) {
|
|
|
596
748
|
try {
|
|
597
749
|
if (extension === ".pdf") {
|
|
598
750
|
content = await parsePdf(nextPath);
|
|
751
|
+
} else if (extension === ".docx") {
|
|
752
|
+
content = await parseDocx(nextPath);
|
|
753
|
+
} else if (extension === ".rtf") {
|
|
754
|
+
content = extractRtfText(await fs2.readFile(nextPath, "utf8"));
|
|
599
755
|
} else {
|
|
600
756
|
content = await fs2.readFile(nextPath, "utf8");
|
|
601
757
|
}
|
|
@@ -739,7 +895,9 @@ export {
|
|
|
739
895
|
buildContextBundle,
|
|
740
896
|
getConfigPath,
|
|
741
897
|
getEnvPath,
|
|
898
|
+
getCwdEnvPath,
|
|
742
899
|
loadOwnSearchEnv,
|
|
900
|
+
readEnvFile,
|
|
743
901
|
loadConfig,
|
|
744
902
|
saveGeminiApiKey,
|
|
745
903
|
deleteRootDefinition,
|
package/dist/cli.js
CHANGED
|
@@ -7,15 +7,18 @@ import {
|
|
|
7
7
|
embedQuery,
|
|
8
8
|
findRoot,
|
|
9
9
|
getConfigPath,
|
|
10
|
+
getCwdEnvPath,
|
|
10
11
|
getEnvPath,
|
|
11
12
|
indexPath,
|
|
12
13
|
listRoots,
|
|
13
14
|
loadConfig,
|
|
14
15
|
loadOwnSearchEnv,
|
|
16
|
+
readEnvFile,
|
|
15
17
|
saveGeminiApiKey
|
|
16
|
-
} from "./chunk-
|
|
18
|
+
} from "./chunk-ZQAY3FE3.js";
|
|
17
19
|
|
|
18
20
|
// src/cli.ts
|
|
21
|
+
import fs from "fs/promises";
|
|
19
22
|
import path from "path";
|
|
20
23
|
import { spawn } from "child_process";
|
|
21
24
|
import readline from "readline/promises";
|
|
@@ -67,47 +70,291 @@ async function ensureQdrantDocker() {
|
|
|
67
70
|
loadOwnSearchEnv();
|
|
68
71
|
var program = new Command();
|
|
69
72
|
var PACKAGE_NAME = "ownsearch";
|
|
73
|
+
var GEMINI_API_KEY_URL = "https://aistudio.google.com/apikey";
|
|
74
|
+
var BUNDLED_SKILL_NAME = "ownsearch-rag-search";
|
|
75
|
+
var SUPPORTED_AGENTS = [
|
|
76
|
+
"codex",
|
|
77
|
+
"claude-desktop",
|
|
78
|
+
"continue",
|
|
79
|
+
"copilot-cli",
|
|
80
|
+
"cursor",
|
|
81
|
+
"github-copilot",
|
|
82
|
+
"vscode",
|
|
83
|
+
"windsurf"
|
|
84
|
+
];
|
|
70
85
|
function requireGeminiKey() {
|
|
71
86
|
if (!process.env.GEMINI_API_KEY) {
|
|
72
87
|
throw new OwnSearchError("Set GEMINI_API_KEY before running OwnSearch.");
|
|
73
88
|
}
|
|
74
89
|
}
|
|
90
|
+
function buildAgentConfig(agent) {
|
|
91
|
+
const stdioConfig = {
|
|
92
|
+
command: "npx",
|
|
93
|
+
args: ["-y", PACKAGE_NAME, "serve-mcp"],
|
|
94
|
+
env: {
|
|
95
|
+
GEMINI_API_KEY: "${GEMINI_API_KEY}"
|
|
96
|
+
}
|
|
97
|
+
};
|
|
98
|
+
switch (agent) {
|
|
99
|
+
case "codex":
|
|
100
|
+
return {
|
|
101
|
+
platform: "codex",
|
|
102
|
+
configScope: "Add this server entry to your Codex MCP configuration.",
|
|
103
|
+
config: { ownsearch: stdioConfig }
|
|
104
|
+
};
|
|
105
|
+
case "claude-desktop":
|
|
106
|
+
return {
|
|
107
|
+
platform: "claude-desktop",
|
|
108
|
+
installMethod: "Desktop Extension (.mcpb)",
|
|
109
|
+
note: "Current Claude Desktop documentation recommends local MCP installation through Desktop Extensions instead of manual JSON config files.",
|
|
110
|
+
nextStep: "OwnSearch does not yet ship an .mcpb bundle. Use Cursor, VS Code, Windsurf, Continue, or GitHub Copilot with the snippets below for now."
|
|
111
|
+
};
|
|
112
|
+
case "continue":
|
|
113
|
+
return {
|
|
114
|
+
platform: "continue",
|
|
115
|
+
configPath: ".continue/mcpServers/ownsearch.json",
|
|
116
|
+
note: "Continue can ingest JSON MCP configs directly.",
|
|
117
|
+
config: { ownsearch: stdioConfig }
|
|
118
|
+
};
|
|
119
|
+
case "copilot-cli":
|
|
120
|
+
return {
|
|
121
|
+
platform: "copilot-cli",
|
|
122
|
+
configPath: "~/.copilot/mcp-config.json",
|
|
123
|
+
config: {
|
|
124
|
+
mcpServers: {
|
|
125
|
+
ownsearch: {
|
|
126
|
+
type: "local",
|
|
127
|
+
command: stdioConfig.command,
|
|
128
|
+
args: stdioConfig.args,
|
|
129
|
+
env: stdioConfig.env,
|
|
130
|
+
tools: ["*"]
|
|
131
|
+
}
|
|
132
|
+
}
|
|
133
|
+
}
|
|
134
|
+
};
|
|
135
|
+
case "cursor":
|
|
136
|
+
return {
|
|
137
|
+
platform: "cursor",
|
|
138
|
+
configPath: "~/.cursor/mcp.json or .cursor/mcp.json",
|
|
139
|
+
config: { ownsearch: stdioConfig }
|
|
140
|
+
};
|
|
141
|
+
case "github-copilot":
|
|
142
|
+
case "vscode":
|
|
143
|
+
return {
|
|
144
|
+
platform: agent,
|
|
145
|
+
configPath: ".vscode/mcp.json or VS Code user profile mcp.json",
|
|
146
|
+
config: {
|
|
147
|
+
servers: {
|
|
148
|
+
ownsearch: stdioConfig
|
|
149
|
+
}
|
|
150
|
+
}
|
|
151
|
+
};
|
|
152
|
+
case "windsurf":
|
|
153
|
+
return {
|
|
154
|
+
platform: "windsurf",
|
|
155
|
+
configPath: "~/.codeium/mcp_config.json",
|
|
156
|
+
config: {
|
|
157
|
+
mcpServers: {
|
|
158
|
+
ownsearch: stdioConfig
|
|
159
|
+
}
|
|
160
|
+
}
|
|
161
|
+
};
|
|
162
|
+
default:
|
|
163
|
+
throw new OwnSearchError(`Unsupported agent: ${agent}`);
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
async function readBundledSkill(skillName) {
|
|
167
|
+
const currentFilePath = fileURLToPath(import.meta.url);
|
|
168
|
+
const packageRoot = path.resolve(path.dirname(currentFilePath), "..");
|
|
169
|
+
const skillPath = path.join(packageRoot, "skills", skillName, "SKILL.md");
|
|
170
|
+
return fs.readFile(skillPath, "utf8");
|
|
171
|
+
}
|
|
172
|
+
function getDoctorVerdict(input) {
|
|
173
|
+
const nextSteps = [];
|
|
174
|
+
if (!input.geminiApiKeyPresent) {
|
|
175
|
+
nextSteps.push("Run `ownsearch setup` and save a Gemini API key.");
|
|
176
|
+
}
|
|
177
|
+
if (!input.qdrantReachable) {
|
|
178
|
+
nextSteps.push("Run `ownsearch setup` to start or reconnect to the local Qdrant container.");
|
|
179
|
+
}
|
|
180
|
+
if (input.geminiApiKeyPresent && input.qdrantReachable && input.rootCount === 0) {
|
|
181
|
+
nextSteps.push("Run `ownsearch index C:\\path\\to\\folder --name my-folder` to add your first indexed root.");
|
|
182
|
+
}
|
|
183
|
+
if (nextSteps.length === 0) {
|
|
184
|
+
nextSteps.push("Run `ownsearch index C:\\path\\to\\folder --name my-folder` to add more content, or `ownsearch serve-mcp` to connect an agent.");
|
|
185
|
+
return {
|
|
186
|
+
status: "ready",
|
|
187
|
+
summary: input.rootCount > 0 ? "OwnSearch is ready for indexing, search, and MCP agent use." : "OwnSearch is ready. Qdrant and Gemini are configured.",
|
|
188
|
+
nextSteps
|
|
189
|
+
};
|
|
190
|
+
}
|
|
191
|
+
return {
|
|
192
|
+
status: "action_required",
|
|
193
|
+
summary: "OwnSearch is not fully ready yet.",
|
|
194
|
+
nextSteps
|
|
195
|
+
};
|
|
196
|
+
}
|
|
75
197
|
async function promptForGeminiKey() {
|
|
76
|
-
if (
|
|
77
|
-
return
|
|
198
|
+
if (!process.stdin.isTTY || !process.stdout.isTTY) {
|
|
199
|
+
return false;
|
|
78
200
|
}
|
|
79
201
|
const rl = readline.createInterface({
|
|
80
202
|
input: process.stdin,
|
|
81
203
|
output: process.stdout
|
|
82
204
|
});
|
|
83
205
|
try {
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
206
|
+
console.log(`Generate a Gemini API key here: ${GEMINI_API_KEY_URL}`);
|
|
207
|
+
console.log(`OwnSearch will save it to ${getEnvPath()}`);
|
|
208
|
+
for (; ; ) {
|
|
209
|
+
const apiKey = (await rl.question("Paste GEMINI_API_KEY and press Enter (Ctrl+C to cancel): ")).trim();
|
|
210
|
+
if (!apiKey) {
|
|
211
|
+
console.log("GEMINI_API_KEY is required for indexing and search.");
|
|
212
|
+
continue;
|
|
213
|
+
}
|
|
214
|
+
await saveGeminiApiKey(apiKey);
|
|
215
|
+
process.env.GEMINI_API_KEY = apiKey;
|
|
216
|
+
return true;
|
|
89
217
|
}
|
|
90
|
-
await saveGeminiApiKey(apiKey);
|
|
91
|
-
process.env.GEMINI_API_KEY = apiKey;
|
|
92
|
-
return true;
|
|
93
218
|
} finally {
|
|
94
219
|
rl.close();
|
|
95
220
|
}
|
|
96
221
|
}
|
|
222
|
+
function getGeminiApiKeySource() {
|
|
223
|
+
if (readEnvFile(getEnvPath()).GEMINI_API_KEY) {
|
|
224
|
+
return "ownsearch-env";
|
|
225
|
+
}
|
|
226
|
+
if (readEnvFile(getCwdEnvPath()).GEMINI_API_KEY) {
|
|
227
|
+
return "cwd-env";
|
|
228
|
+
}
|
|
229
|
+
if (process.env.GEMINI_API_KEY) {
|
|
230
|
+
return "process-env";
|
|
231
|
+
}
|
|
232
|
+
return "missing";
|
|
233
|
+
}
|
|
234
|
+
async function ensureManagedGeminiKey() {
|
|
235
|
+
const source = getGeminiApiKeySource();
|
|
236
|
+
if (source === "ownsearch-env") {
|
|
237
|
+
return { present: true, source, savedToManagedEnv: false };
|
|
238
|
+
}
|
|
239
|
+
if (process.env.GEMINI_API_KEY) {
|
|
240
|
+
await saveGeminiApiKey(process.env.GEMINI_API_KEY);
|
|
241
|
+
return { present: true, source, savedToManagedEnv: true };
|
|
242
|
+
}
|
|
243
|
+
const prompted = await promptForGeminiKey();
|
|
244
|
+
return {
|
|
245
|
+
present: prompted,
|
|
246
|
+
source: prompted ? "prompt" : "missing",
|
|
247
|
+
savedToManagedEnv: prompted
|
|
248
|
+
};
|
|
249
|
+
}
|
|
250
|
+
function printSetupNextSteps() {
|
|
251
|
+
console.log("");
|
|
252
|
+
console.log("Next commands:");
|
|
253
|
+
console.log(" CLI indexing:");
|
|
254
|
+
console.log(" ownsearch index C:\\path\\to\\folder --name my-folder");
|
|
255
|
+
console.log(" CLI search:");
|
|
256
|
+
console.log(' ownsearch search "your question here" --limit 5');
|
|
257
|
+
console.log(" CLI grounded context:");
|
|
258
|
+
console.log(' ownsearch search-context "your question here" --limit 8 --max-chars 12000');
|
|
259
|
+
console.log(" MCP server for agents:");
|
|
260
|
+
console.log(" ownsearch serve-mcp");
|
|
261
|
+
console.log(" Agent config snippets:");
|
|
262
|
+
console.log(" ownsearch print-agent-config codex");
|
|
263
|
+
console.log(" ownsearch print-agent-config claude-desktop");
|
|
264
|
+
console.log(" ownsearch print-agent-config cursor");
|
|
265
|
+
console.log(" ownsearch print-agent-config vscode");
|
|
266
|
+
console.log(" ownsearch print-agent-config github-copilot");
|
|
267
|
+
console.log(" ownsearch print-agent-config copilot-cli");
|
|
268
|
+
console.log(" ownsearch print-agent-config windsurf");
|
|
269
|
+
console.log(" ownsearch print-agent-config continue");
|
|
270
|
+
console.log(" Bundled retrieval skill:");
|
|
271
|
+
console.log(` ownsearch print-skill ${BUNDLED_SKILL_NAME}`);
|
|
272
|
+
}
|
|
273
|
+
async function promptForAgentChoice() {
|
|
274
|
+
if (!process.stdin.isTTY || !process.stdout.isTTY) {
|
|
275
|
+
return void 0;
|
|
276
|
+
}
|
|
277
|
+
const rl = readline.createInterface({
|
|
278
|
+
input: process.stdin,
|
|
279
|
+
output: process.stdout
|
|
280
|
+
});
|
|
281
|
+
try {
|
|
282
|
+
console.log("");
|
|
283
|
+
console.log("Connect to an agent now?");
|
|
284
|
+
console.log(" 1. codex");
|
|
285
|
+
console.log(" 2. claude-desktop");
|
|
286
|
+
console.log(" 3. cursor");
|
|
287
|
+
console.log(" 4. vscode");
|
|
288
|
+
console.log(" 5. windsurf");
|
|
289
|
+
console.log(" 6. copilot-cli");
|
|
290
|
+
console.log(" 7. continue");
|
|
291
|
+
console.log(" 8. skip");
|
|
292
|
+
for (; ; ) {
|
|
293
|
+
const answer = (await rl.question("Select 1-8: ")).trim().toLowerCase();
|
|
294
|
+
switch (answer) {
|
|
295
|
+
case "1":
|
|
296
|
+
case "codex":
|
|
297
|
+
return "codex";
|
|
298
|
+
case "2":
|
|
299
|
+
case "claude-desktop":
|
|
300
|
+
case "claude":
|
|
301
|
+
return "claude-desktop";
|
|
302
|
+
case "3":
|
|
303
|
+
case "cursor":
|
|
304
|
+
return "cursor";
|
|
305
|
+
case "4":
|
|
306
|
+
case "vscode":
|
|
307
|
+
case "github-copilot":
|
|
308
|
+
return "vscode";
|
|
309
|
+
case "5":
|
|
310
|
+
case "windsurf":
|
|
311
|
+
return "windsurf";
|
|
312
|
+
case "6":
|
|
313
|
+
case "copilot-cli":
|
|
314
|
+
case "copilot":
|
|
315
|
+
return "copilot-cli";
|
|
316
|
+
case "7":
|
|
317
|
+
case "continue":
|
|
318
|
+
return "continue";
|
|
319
|
+
case "8":
|
|
320
|
+
case "skip":
|
|
321
|
+
case "":
|
|
322
|
+
return void 0;
|
|
323
|
+
default:
|
|
324
|
+
console.log("Enter 1, 2, 3, 4, 5, 6, 7, or 8.");
|
|
325
|
+
}
|
|
326
|
+
}
|
|
327
|
+
} finally {
|
|
328
|
+
rl.close();
|
|
329
|
+
}
|
|
330
|
+
}
|
|
331
|
+
function printAgentConfigSnippet(agent) {
|
|
332
|
+
console.log("");
|
|
333
|
+
console.log(`MCP config for ${agent}:`);
|
|
334
|
+
console.log(JSON.stringify(buildAgentConfig(agent), null, 2));
|
|
335
|
+
}
|
|
97
336
|
program.name("ownsearch").description("Gemini-powered local search MCP server backed by Qdrant.").version("0.1.0");
|
|
98
337
|
program.command("setup").description("Create config and start a local Qdrant Docker container.").action(async () => {
|
|
99
338
|
const config = await loadConfig();
|
|
100
339
|
const result = await ensureQdrantDocker();
|
|
101
|
-
const
|
|
340
|
+
const gemini = await ensureManagedGeminiKey();
|
|
102
341
|
console.log(JSON.stringify({
|
|
103
342
|
configPath: getConfigPath(),
|
|
104
343
|
envPath: getEnvPath(),
|
|
105
344
|
qdrantUrl: config.qdrantUrl,
|
|
106
345
|
qdrantStarted: result.started,
|
|
107
|
-
geminiApiKeyPresent
|
|
346
|
+
geminiApiKeyPresent: gemini.present,
|
|
347
|
+
geminiApiKeySource: gemini.source,
|
|
348
|
+
geminiApiKeySavedToManagedEnv: gemini.savedToManagedEnv
|
|
108
349
|
}, null, 2));
|
|
109
|
-
if (!
|
|
350
|
+
if (!gemini.present) {
|
|
110
351
|
console.log(`GEMINI_API_KEY is not set. Re-run setup or add it to ${getEnvPath()} before indexing or search.`);
|
|
352
|
+
return;
|
|
353
|
+
}
|
|
354
|
+
printSetupNextSteps();
|
|
355
|
+
const agent = await promptForAgentChoice();
|
|
356
|
+
if (agent) {
|
|
357
|
+
printAgentConfigSnippet(agent);
|
|
111
358
|
}
|
|
112
359
|
});
|
|
113
360
|
program.command("index").argument("<folder>", "Folder path to index").option("-n, --name <name>", "Display name for the indexed root").option("--max-file-bytes <n>", "Override the file size limit for this run", (value) => Number(value)).description("Index a local folder into Qdrant using Gemini embeddings.").action(async (folder, options) => {
|
|
@@ -126,6 +373,7 @@ program.command("search").argument("<query>", "Natural language query").option("
|
|
|
126
373
|
const hits = await store.search(
|
|
127
374
|
vector,
|
|
128
375
|
{
|
|
376
|
+
queryText: query,
|
|
129
377
|
rootIds: options.rootId,
|
|
130
378
|
pathSubstring: options.path
|
|
131
379
|
},
|
|
@@ -142,6 +390,7 @@ program.command("search-context").argument("<query>", "Natural language query").
|
|
|
142
390
|
const hits = await store.search(
|
|
143
391
|
vector,
|
|
144
392
|
{
|
|
393
|
+
queryText: query,
|
|
145
394
|
rootIds: options.rootId,
|
|
146
395
|
pathSubstring: options.path
|
|
147
396
|
},
|
|
@@ -178,10 +427,17 @@ program.command("doctor").description("Check local prerequisites and package con
|
|
|
178
427
|
} catch (error) {
|
|
179
428
|
qdrantReachable = false;
|
|
180
429
|
}
|
|
430
|
+
const verdict = getDoctorVerdict({
|
|
431
|
+
geminiApiKeyPresent: Boolean(process.env.GEMINI_API_KEY),
|
|
432
|
+
qdrantReachable,
|
|
433
|
+
rootCount: roots.length
|
|
434
|
+
});
|
|
181
435
|
console.log(JSON.stringify({
|
|
436
|
+
verdict,
|
|
182
437
|
configPath: getConfigPath(),
|
|
183
438
|
envPath: getEnvPath(),
|
|
184
439
|
geminiApiKeyPresent: Boolean(process.env.GEMINI_API_KEY),
|
|
440
|
+
geminiApiKeySource: getGeminiApiKeySource(),
|
|
185
441
|
qdrantUrl: config.qdrantUrl,
|
|
186
442
|
qdrantReachable,
|
|
187
443
|
collection: config.qdrantCollection,
|
|
@@ -189,6 +445,7 @@ program.command("doctor").description("Check local prerequisites and package con
|
|
|
189
445
|
vectorSize: config.vectorSize,
|
|
190
446
|
chunkSize: config.chunkSize,
|
|
191
447
|
chunkOverlap: config.chunkOverlap,
|
|
448
|
+
maxExtractedDocumentBytes: config.maxFileBytes,
|
|
192
449
|
maxFileBytes: config.maxFileBytes,
|
|
193
450
|
rootCount: roots.length
|
|
194
451
|
}, null, 2));
|
|
@@ -204,23 +461,16 @@ program.command("serve-mcp").description("Start the stdio MCP server.").action(a
|
|
|
204
461
|
process.exitCode = code ?? 0;
|
|
205
462
|
});
|
|
206
463
|
});
|
|
207
|
-
program.command("print-agent-config").argument("<agent>", "
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
env: {
|
|
212
|
-
GEMINI_API_KEY: "${GEMINI_API_KEY}"
|
|
213
|
-
}
|
|
214
|
-
};
|
|
215
|
-
switch (agent) {
|
|
216
|
-
case "codex":
|
|
217
|
-
case "claude-desktop":
|
|
218
|
-
case "cursor":
|
|
219
|
-
console.log(JSON.stringify({ ownsearch: config }, null, 2));
|
|
220
|
-
return;
|
|
221
|
-
default:
|
|
222
|
-
throw new OwnSearchError(`Unsupported agent: ${agent}`);
|
|
464
|
+
program.command("print-agent-config").argument("<agent>", SUPPORTED_AGENTS.join(" | ")).description("Print an MCP config snippet for a supported agent.").action(async (agent) => {
|
|
465
|
+
if (SUPPORTED_AGENTS.includes(agent)) {
|
|
466
|
+
console.log(JSON.stringify(buildAgentConfig(agent), null, 2));
|
|
467
|
+
return;
|
|
223
468
|
}
|
|
469
|
+
throw new OwnSearchError(`Unsupported agent: ${agent}`);
|
|
470
|
+
});
|
|
471
|
+
program.command("print-skill").argument("[skill]", `Bundled skill name (default ${BUNDLED_SKILL_NAME})`).description("Print a bundled OwnSearch skill that helps agents query retrieval tools more effectively.").action(async (skill) => {
|
|
472
|
+
const skillName = skill?.trim() || BUNDLED_SKILL_NAME;
|
|
473
|
+
console.log(await readBundledSkill(skillName));
|
|
224
474
|
});
|
|
225
475
|
program.parseAsync(process.argv).catch((error) => {
|
|
226
476
|
const message = error instanceof Error ? error.message : String(error);
|
package/dist/mcp/server.js
CHANGED
|
@@ -9,7 +9,7 @@ import {
|
|
|
9
9
|
indexPath,
|
|
10
10
|
loadConfig,
|
|
11
11
|
loadOwnSearchEnv
|
|
12
|
-
} from "../chunk-
|
|
12
|
+
} from "../chunk-ZQAY3FE3.js";
|
|
13
13
|
|
|
14
14
|
// src/mcp/server.ts
|
|
15
15
|
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
|
|
@@ -165,6 +165,7 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
|
|
|
165
165
|
const hits = await store.search(
|
|
166
166
|
vector,
|
|
167
167
|
{
|
|
168
|
+
queryText: args.query,
|
|
168
169
|
rootIds: args.rootIds,
|
|
169
170
|
pathSubstring: args.pathSubstring
|
|
170
171
|
},
|
|
@@ -185,6 +186,7 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
|
|
|
185
186
|
const hits = await store.search(
|
|
186
187
|
vector,
|
|
187
188
|
{
|
|
189
|
+
queryText: args.query,
|
|
188
190
|
rootIds: args.rootIds,
|
|
189
191
|
pathSubstring: args.pathSubstring
|
|
190
192
|
},
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "ownsearch",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.4",
|
|
4
4
|
"description": "Text-first local document search MCP server backed by Gemini embeddings and Qdrant.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -8,13 +8,15 @@
|
|
|
8
8
|
},
|
|
9
9
|
"files": [
|
|
10
10
|
"dist",
|
|
11
|
-
"README.md"
|
|
11
|
+
"README.md",
|
|
12
|
+
"skills"
|
|
12
13
|
],
|
|
13
14
|
"scripts": {
|
|
14
15
|
"build": "tsup src/cli.ts src/mcp/server.ts --format esm --dts --clean --external pdf-parse",
|
|
15
16
|
"dev": "tsx src/cli.ts",
|
|
16
17
|
"prepare": "npm run build",
|
|
17
18
|
"prepublishOnly": "npm run typecheck && npm run build",
|
|
19
|
+
"smoke:text-docs": "tsx scripts/smoke-text-docs.mts",
|
|
18
20
|
"serve-mcp": "tsx src/mcp/server.ts",
|
|
19
21
|
"typecheck": "tsc --noEmit"
|
|
20
22
|
},
|
|
@@ -44,17 +46,18 @@
|
|
|
44
46
|
},
|
|
45
47
|
"dependencies": {
|
|
46
48
|
"@google/genai": "^1.46.0",
|
|
47
|
-
"@grumppie/ownsearch": "^0.1.3",
|
|
48
49
|
"@modelcontextprotocol/sdk": "^1.27.1",
|
|
49
50
|
"@qdrant/js-client-rest": "^1.17.0",
|
|
50
51
|
"commander": "^14.0.1",
|
|
51
52
|
"dotenv": "^17.3.1",
|
|
53
|
+
"mammoth": "^1.12.0",
|
|
52
54
|
"pdf-parse": "^2.4.5",
|
|
53
55
|
"zod": "^3.25.76"
|
|
54
56
|
},
|
|
55
57
|
"devDependencies": {
|
|
56
58
|
"@types/node": "^24.6.0",
|
|
57
59
|
"@types/pdf-parse": "^1.1.5",
|
|
60
|
+
"docx": "^9.6.1",
|
|
58
61
|
"tsup": "^8.5.0",
|
|
59
62
|
"tsx": "^4.20.6",
|
|
60
63
|
"typescript": "^5.9.3"
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ownsearch-rag-search
|
|
3
|
+
description: Improve retrieval quality when an agent uses OwnSearch MCP tools to search local documents. Use for semantic search, grounded answering, query rewriting, multi-query retrieval, exact chunk fetches, duplicate-heavy result sets, or whenever a user request must be translated into stronger OwnSearch search_context/search/get_chunks calls.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# OwnSearch RAG Search
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
Use this skill to bridge the gap between what a user asks and what OwnSearch should retrieve. Treat retrieval as probabilistic: rewrite weak queries, run multiple targeted searches when needed, prefer grounded context over guesswork, and fetch exact chunks before making precise claims.
|
|
11
|
+
|
|
12
|
+
## Retrieval Workflow
|
|
13
|
+
|
|
14
|
+
1. Classify the user request.
|
|
15
|
+
2. Generate one to four retrieval queries.
|
|
16
|
+
3. Start with `search_context` for the strongest query.
|
|
17
|
+
4. Expand to additional searches only if evidence is weak, duplicate-heavy, or incomplete.
|
|
18
|
+
5. Use `get_chunks` after `search` when the answer needs exact wording, detailed comparison, or citation-grade grounding.
|
|
19
|
+
6. Answer only from retrieved evidence. Say when the retrieved context is insufficient.
|
|
20
|
+
|
|
21
|
+
## Query Planning
|
|
22
|
+
|
|
23
|
+
Generate retrieval queries with these patterns:
|
|
24
|
+
|
|
25
|
+
- Literal query: preserve the exact noun phrase, error string, rule name, or title the user used.
|
|
26
|
+
- Canonical query: replace vague wording with domain terms likely to appear in documents.
|
|
27
|
+
- Paraphrase query: restate the intent in simpler or more explicit language.
|
|
28
|
+
- Source-biased query: add likely file names, section names, or path hints when the user names a source.
|
|
29
|
+
|
|
30
|
+
Good examples:
|
|
31
|
+
|
|
32
|
+
- User ask: "How do concentration checks work?"
|
|
33
|
+
Queries:
|
|
34
|
+
- `concentration checks`
|
|
35
|
+
- `maintain concentration after taking damage`
|
|
36
|
+
- `constitution saving throw concentration spell`
|
|
37
|
+
|
|
38
|
+
- User ask: "Where does the repo explain local MCP setup?"
|
|
39
|
+
Queries:
|
|
40
|
+
- `local MCP setup`
|
|
41
|
+
- `Model Context Protocol setup`
|
|
42
|
+
- `serve-mcp agent config`
|
|
43
|
+
|
|
44
|
+
- User ask: "What did the contract say about payment timing?"
|
|
45
|
+
Queries:
|
|
46
|
+
- `payment timing`
|
|
47
|
+
- `payment due within`
|
|
48
|
+
- `invoice due date net terms`
|
|
49
|
+
|
|
50
|
+
## Tool Use Rules
|
|
51
|
+
|
|
52
|
+
Use `search_context` when:
|
|
53
|
+
|
|
54
|
+
- the user wants an answer, summary, explanation, or quick grounding
|
|
55
|
+
- the answer can be supported by a few chunks
|
|
56
|
+
- low latency matters more than exhaustive recall
|
|
57
|
+
|
|
58
|
+
Use `search` when:
|
|
59
|
+
|
|
60
|
+
- you want to inspect ranking and source distribution
|
|
61
|
+
- you need to compare multiple candidates
|
|
62
|
+
- you suspect duplicates or poor recall
|
|
63
|
+
|
|
64
|
+
Use `get_chunks` when:
|
|
65
|
+
|
|
66
|
+
- exact wording matters
|
|
67
|
+
- the answer depends on adjacent details
|
|
68
|
+
- you need to quote or carefully verify a claim
|
|
69
|
+
- you need to compare similar hits before answering
|
|
70
|
+
|
|
71
|
+
## Duplicate Handling
|
|
72
|
+
|
|
73
|
+
Assume top results can still contain semantic duplicates.
|
|
74
|
+
|
|
75
|
+
When results are duplicate-heavy:
|
|
76
|
+
|
|
77
|
+
- keep only the strongest chunk per repeated claim unless neighboring chunks add new facts
|
|
78
|
+
- prefer source diversity when multiple files say the same thing
|
|
79
|
+
- if one document clearly appears authoritative, prefer that source but mention corroboration when useful
|
|
80
|
+
- if the top results are all from one file and the answer still seems incomplete, issue a second query with a different phrasing
|
|
81
|
+
|
|
82
|
+
## Failure Recovery
|
|
83
|
+
|
|
84
|
+
If the first search is weak:
|
|
85
|
+
|
|
86
|
+
- shorten the query
|
|
87
|
+
- remove conversational filler
|
|
88
|
+
- swap vague words for canonical terms
|
|
89
|
+
- split compound questions into separate searches
|
|
90
|
+
- add likely section names or file hints
|
|
91
|
+
- search once for the concept and once for the expected answer shape
|
|
92
|
+
|
|
93
|
+
Examples:
|
|
94
|
+
|
|
95
|
+
- "Can you tell me what they said about when we can terminate this thing?"
|
|
96
|
+
Retry with:
|
|
97
|
+
- `termination`
|
|
98
|
+
- `termination notice`
|
|
99
|
+
- `right to terminate`
|
|
100
|
+
- `termination for cause`
|
|
101
|
+
|
|
102
|
+
- "Why is my build exploding around env handling?"
|
|
103
|
+
Retry with:
|
|
104
|
+
- `environment variables`
|
|
105
|
+
- `dotenv`
|
|
106
|
+
- `GEMINI_API_KEY`
|
|
107
|
+
- `setup envPath`
|
|
108
|
+
|
|
109
|
+
## Answering Rules
|
|
110
|
+
|
|
111
|
+
- Do not invent facts that were not retrieved.
|
|
112
|
+
- Prefer citing file paths or chunk provenance when the client supports it.
|
|
113
|
+
- If retrieval is partial, say which part is grounded and which part is uncertain.
|
|
114
|
+
- If evidence conflicts, surface the conflict instead of averaging it away.
|
|
115
|
+
- If nothing relevant is retrieved after a few query variants, say so explicitly.
|
|
116
|
+
|
|
117
|
+
## Minimal Playbook
|
|
118
|
+
|
|
119
|
+
For a normal grounded answer:
|
|
120
|
+
|
|
121
|
+
1. Derive two or three strong retrieval queries.
|
|
122
|
+
2. Call `search_context` with the best query.
|
|
123
|
+
3. If results look sufficient, answer from them.
|
|
124
|
+
4. If results look weak or ambiguous, call `search` with another variant.
|
|
125
|
+
5. Fetch exact chunks for the best IDs before making precise claims.
|
|
126
|
+
|
|
127
|
+
For a locate-the-source task:
|
|
128
|
+
|
|
129
|
+
1. Use `search` first.
|
|
130
|
+
2. Inspect which files dominate.
|
|
131
|
+
3. Use `get_chunks` on top hits.
|
|
132
|
+
4. Return the most relevant files and sections, not just a prose answer.
|
|
133
|
+
|
|
134
|
+
For a compare-or-summarize task:
|
|
135
|
+
|
|
136
|
+
1. Run one query per subtopic.
|
|
137
|
+
2. Collect grounded chunks from each.
|
|
138
|
+
3. Merge only non-duplicate evidence.
|
|
139
|
+
4. Summarize with explicit source-backed differences.
|