ownsearch 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +260 -45
- package/dist/{chunk-NLETDGQ5.js → chunk-ZQAY3FE3.js} +194 -10
- package/dist/cli.js +303 -22
- package/dist/mcp/server.js +6 -3
- package/package.json +6 -2
- package/skills/ownsearch-rag-search/SKILL.md +139 -0
- package/skills/ownsearch-rag-search/agents/openai.yaml +4 -0
package/README.md
CHANGED
|
@@ -1,36 +1,116 @@
|
|
|
1
1
|
# ownsearch
|
|
2
2
|
|
|
3
|
-
**ownsearch** is a local
|
|
3
|
+
**ownsearch** is a local retrieval layer for agents.
|
|
4
4
|
|
|
5
|
-
It indexes approved folders into a local Qdrant store, exposes retrieval through an MCP server
|
|
5
|
+
It indexes approved folders into a local Qdrant vector store, embeds content with Gemini, and exposes grounded retrieval through an MCP server so agents can search private documents without shipping those documents to a hosted RAG backend.
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
This package is designed for **text-first, local, agentic RAG**:
|
|
8
|
+
|
|
9
|
+
- local folders instead of SaaS document ingestion
|
|
10
|
+
- MCP-native access for agents
|
|
11
|
+
- grounded chunk retrieval instead of opaque long-context guessing
|
|
12
|
+
- predictable local storage with Docker-backed Qdrant
|
|
13
|
+
|
|
14
|
+
## Why it exists
|
|
15
|
+
|
|
16
|
+
Most agents waste time and tokens when they do one of two things:
|
|
17
|
+
|
|
18
|
+
- search too broadly with weak semantic queries
|
|
19
|
+
- skip retrieval and guess from partial context
|
|
20
|
+
|
|
21
|
+
`ownsearch` is meant to reduce both failure modes by:
|
|
22
|
+
|
|
23
|
+
- indexing local knowledge once
|
|
24
|
+
- making retrieval cheap and reusable
|
|
25
|
+
- giving agents a structured way to fetch only the chunks they need
|
|
26
|
+
- improving answer quality with reranking, deduplication, and grounded chunk access
|
|
27
|
+
|
|
28
|
+
## Core use cases
|
|
29
|
+
|
|
30
|
+
`ownsearch` is a good fit when an agent needs to work over:
|
|
31
|
+
|
|
32
|
+
- product documentation
|
|
33
|
+
- technical design docs
|
|
34
|
+
- code-adjacent text files
|
|
35
|
+
- contracts and policy documents
|
|
36
|
+
- research notes
|
|
37
|
+
- knowledge bases stored in folders
|
|
38
|
+
- PDF, DOCX, RTF, markdown, and plain-text heavy repositories
|
|
39
|
+
|
|
40
|
+
Typical agent workflows:
|
|
41
|
+
|
|
42
|
+
- answer questions over local docs
|
|
43
|
+
- locate the exact source file or section for a fact
|
|
44
|
+
- summarize a set of related files
|
|
45
|
+
- compare policy, spec, or contract language across documents
|
|
46
|
+
- support coding agents with repo-local documentation search
|
|
47
|
+
- reduce token cost by retrieving only relevant chunks instead of loading entire files
|
|
8
48
|
|
|
9
49
|
## What it does
|
|
10
50
|
|
|
11
|
-
-
|
|
12
|
-
-
|
|
13
|
-
-
|
|
14
|
-
-
|
|
15
|
-
-
|
|
51
|
+
- indexes local folders into a persistent vector store
|
|
52
|
+
- chunks and embeds supported files with Gemini
|
|
53
|
+
- supports incremental reindexing for changed and deleted files
|
|
54
|
+
- exposes search and context retrieval through MCP
|
|
55
|
+
- reranks and deduplicates result sets before returning them
|
|
56
|
+
- lets agents retrieve ranked hits, exact chunks, or bundled grounded context
|
|
16
57
|
|
|
17
|
-
##
|
|
58
|
+
## Current power
|
|
18
59
|
|
|
19
|
-
|
|
20
|
-
- extracted text from PDFs
|
|
21
|
-
- Gemini `gemini-embedding-001`
|
|
22
|
-
- Docker-backed Qdrant
|
|
23
|
-
- stdio MCP server for local agent attachment
|
|
60
|
+
What is already strong in the current package:
|
|
24
61
|
|
|
25
|
-
|
|
62
|
+
- local-first setup with Docker-backed Qdrant
|
|
63
|
+
- deterministic readiness checks through `ownsearch doctor`
|
|
64
|
+
- multi-platform MCP config generation
|
|
65
|
+
- bundled retrieval skill for better query planning
|
|
66
|
+
- support for common text document formats
|
|
67
|
+
- large plain text and code files are no longer blocked by the extracted-document size cap
|
|
68
|
+
- repeatable smoke validation for mixed text corpora
|
|
69
|
+
|
|
70
|
+
## V1 supported document types
|
|
71
|
+
|
|
72
|
+
The current package is intended for text-first corpora, including:
|
|
73
|
+
|
|
74
|
+
- plain text and code files
|
|
75
|
+
- markdown and MDX
|
|
76
|
+
- JSON, YAML, TOML, CSV, XML, HTML
|
|
77
|
+
- PDF via text extraction
|
|
78
|
+
- DOCX via text extraction
|
|
79
|
+
- RTF via text extraction
|
|
80
|
+
|
|
81
|
+
## Deployment readiness
|
|
82
|
+
|
|
83
|
+
This package is ready to deploy for **text-first local document folders** when:
|
|
26
84
|
|
|
27
|
-
|
|
85
|
+
- Node.js `20+` is available
|
|
86
|
+
- Docker is available and Qdrant can run locally
|
|
87
|
+
- `GEMINI_API_KEY` is configured
|
|
88
|
+
- the document corpus is primarily text-based
|
|
89
|
+
|
|
90
|
+
Installation:
|
|
28
91
|
|
|
29
92
|
```bash
|
|
30
93
|
npm install -g ownsearch
|
|
94
|
+
```
|
|
31
95
|
|
|
32
|
-
|
|
96
|
+
Deployment checklist:
|
|
33
97
|
|
|
98
|
+
```bash
|
|
99
|
+
npm install -g ownsearch
|
|
100
|
+
ownsearch setup
|
|
101
|
+
ownsearch doctor
|
|
102
|
+
ownsearch index C:\path\to\folder --name my-folder
|
|
103
|
+
ownsearch serve-mcp
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
If `ownsearch doctor` returns:
|
|
107
|
+
|
|
108
|
+
- `verdict.status: "ready"` then the package is operational
|
|
109
|
+
- `verdict.status: "action_required"` then follow the listed `nextSteps`
|
|
110
|
+
|
|
111
|
+
## Quickstart
|
|
112
|
+
|
|
113
|
+
```bash
|
|
34
114
|
ownsearch setup
|
|
35
115
|
ownsearch doctor
|
|
36
116
|
ownsearch index ./docs --name docs
|
|
@@ -38,48 +118,183 @@ ownsearch list-roots
|
|
|
38
118
|
ownsearch search "what is this repo about?" --limit 5
|
|
39
119
|
ownsearch search-context "what is this repo about?" --limit 8 --max-chars 12000
|
|
40
120
|
ownsearch serve-mcp
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
On first run, `ownsearch setup` can:
|
|
124
|
+
|
|
125
|
+
- prompt for `GEMINI_API_KEY`
|
|
126
|
+
- link users to Google AI Studio
|
|
127
|
+
- save the key to `~/.ownsearch/.env`
|
|
128
|
+
- print exact next commands for CLI and MCP usage
|
|
129
|
+
- optionally print an MCP config snippet for a selected agent
|
|
130
|
+
|
|
131
|
+
## Real-world fit
|
|
132
|
+
|
|
133
|
+
`ownsearch` is a strong fit for:
|
|
134
|
+
|
|
135
|
+
- engineering teams with private docs that should stay local
|
|
136
|
+
- coding agents that need repo-adjacent design docs and runbooks
|
|
137
|
+
- consultants or operators working across contract, policy, or knowledge folders
|
|
138
|
+
- researchers who want grounded retrieval over local notes and exported reports
|
|
139
|
+
- teams trying to reduce agent token burn by retrieving small grounded context bundles instead of pasting entire files
|
|
41
140
|
|
|
42
|
-
|
|
141
|
+
It is less suitable when:
|
|
43
142
|
|
|
143
|
+
- the corpus is mostly scanned documents that need OCR
|
|
144
|
+
- the workflow depends on spreadsheets, slides, or legacy Office formats
|
|
145
|
+
- the main requirement is hosted multi-user search rather than local agent retrieval
|
|
146
|
+
|
|
147
|
+
## Agent integration
|
|
148
|
+
|
|
149
|
+
To print MCP config snippets:
|
|
150
|
+
|
|
151
|
+
```bash
|
|
44
152
|
ownsearch print-agent-config codex
|
|
45
|
-
ownsearch print-agent-config claude-desktop
|
|
46
153
|
ownsearch print-agent-config cursor
|
|
47
|
-
|
|
154
|
+
ownsearch print-agent-config vscode
|
|
155
|
+
ownsearch print-agent-config github-copilot
|
|
156
|
+
ownsearch print-agent-config copilot-cli
|
|
157
|
+
ownsearch print-agent-config windsurf
|
|
158
|
+
ownsearch print-agent-config continue
|
|
159
|
+
ownsearch print-agent-config claude-desktop
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Supported config targets currently include:
|
|
163
|
+
|
|
164
|
+
- `codex`
|
|
165
|
+
- `cursor`
|
|
166
|
+
- `vscode`
|
|
167
|
+
- `github-copilot`
|
|
168
|
+
- `copilot-cli`
|
|
169
|
+
- `windsurf`
|
|
170
|
+
- `continue`
|
|
171
|
+
- `claude-desktop`
|
|
172
|
+
|
|
173
|
+
Notes:
|
|
48
174
|
|
|
49
|
-
|
|
175
|
+
- `claude-desktop` currently returns guidance rather than a raw JSON snippet because current Claude Desktop docs prefer desktop extensions (`.mcpb`) over manual JSON server configs
|
|
176
|
+
- all other supported targets return concrete MCP config payloads
|
|
50
177
|
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
178
|
+
## Bundled skill
|
|
179
|
+
|
|
180
|
+
The package ships with a bundled retrieval skill:
|
|
181
|
+
|
|
182
|
+
```bash
|
|
183
|
+
ownsearch print-skill ownsearch-rag-search
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
The skill is intended to help an agent:
|
|
187
|
+
|
|
188
|
+
- rewrite weak user requests into stronger retrieval queries
|
|
189
|
+
- decide when to use `search_context` vs `search` vs `get_chunks`
|
|
190
|
+
- recover from poor first-pass retrieval
|
|
191
|
+
- avoid duplicate-heavy answer synthesis
|
|
192
|
+
- stay grounded when retrieval is probabilistic
|
|
193
|
+
|
|
194
|
+
## CLI commands
|
|
195
|
+
|
|
196
|
+
- `ownsearch setup`
|
|
197
|
+
Starts or reconnects to the local Qdrant Docker container, creates local config, persists `GEMINI_API_KEY`, and prints next-step commands.
|
|
198
|
+
- `ownsearch doctor`
|
|
199
|
+
Checks config, Gemini key presence, Qdrant connectivity, collection settings, and emits a deterministic readiness verdict.
|
|
200
|
+
- `ownsearch index <folder> --name <name>`
|
|
201
|
+
Indexes a folder incrementally into the local vector collection.
|
|
202
|
+
- `ownsearch list-roots`
|
|
203
|
+
Lists approved indexed roots.
|
|
204
|
+
- `ownsearch search "<query>"`
|
|
205
|
+
Returns reranked search hits from the vector store.
|
|
206
|
+
- `ownsearch search-context "<query>"`
|
|
207
|
+
Returns a compact grounded context bundle for agents.
|
|
208
|
+
- `ownsearch delete-root <rootId>`
|
|
209
|
+
Removes a root from config and deletes its vectors from Qdrant.
|
|
210
|
+
- `ownsearch store-status`
|
|
211
|
+
Shows collection status and vector configuration.
|
|
212
|
+
- `ownsearch serve-mcp`
|
|
213
|
+
Starts the stdio MCP server.
|
|
214
|
+
- `ownsearch print-agent-config <agent>`
|
|
215
|
+
Prints MCP config snippets or platform guidance.
|
|
216
|
+
- `ownsearch print-skill [skill]`
|
|
217
|
+
Prints a bundled OwnSearch skill.
|
|
57
218
|
|
|
58
219
|
## MCP tools
|
|
59
220
|
|
|
60
|
-
|
|
61
|
-
* `search`
|
|
62
|
-
* `search_context`
|
|
63
|
-
* `get_chunks`
|
|
64
|
-
* `list_roots`
|
|
65
|
-
* `delete_root`
|
|
66
|
-
* `store_status`
|
|
221
|
+
The MCP server currently exposes:
|
|
67
222
|
|
|
68
|
-
|
|
223
|
+
- `index_path`
|
|
224
|
+
- `search`
|
|
225
|
+
- `search_context`
|
|
226
|
+
- `get_chunks`
|
|
227
|
+
- `list_roots`
|
|
228
|
+
- `delete_root`
|
|
229
|
+
- `store_status`
|
|
69
230
|
|
|
70
|
-
|
|
71
|
-
* Qdrant runs locally in Docker as `ownsearch-qdrant`
|
|
72
|
-
* `GEMINI_API_KEY` must be available in the environment or `.env`
|
|
231
|
+
Recommended retrieval flow:
|
|
73
232
|
|
|
74
|
-
|
|
233
|
+
1. Use `search_context` for fast grounded retrieval.
|
|
234
|
+
2. Use `search` when ranking and source inspection matter.
|
|
235
|
+
3. Use `get_chunks` when exact wording or detailed comparison matters.
|
|
75
236
|
|
|
76
|
-
|
|
237
|
+
## Validation
|
|
238
|
+
|
|
239
|
+
The package includes a repeatable smoke suite:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
npm run smoke:text-docs
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
That smoke run currently validates:
|
|
246
|
+
|
|
247
|
+
- `.txt` retrieval
|
|
248
|
+
- `.rtf` retrieval
|
|
249
|
+
- `.docx` retrieval
|
|
250
|
+
- `.pdf` retrieval
|
|
251
|
+
- large plain text file bypass of the extracted-document byte cap
|
|
252
|
+
|
|
253
|
+
## Limitations
|
|
254
|
+
|
|
255
|
+
This package is deploy-ready for text-first corpora, but it is not universal document intelligence.
|
|
256
|
+
|
|
257
|
+
Current hard limitations:
|
|
258
|
+
|
|
259
|
+
- no OCR for image-only PDFs
|
|
260
|
+
- no `.doc` support
|
|
261
|
+
- no spreadsheet or presentation extraction such as `.xlsx` or `.pptx`
|
|
262
|
+
- no multimodal embeddings yet
|
|
263
|
+
- reranking is heuristic and local, not yet model-based
|
|
264
|
+
- very large corpora can still become expensive because embedding cost scales with chunk count
|
|
265
|
+
|
|
266
|
+
Operational limitations:
|
|
267
|
+
|
|
268
|
+
- retrieval quality still depends on query quality
|
|
269
|
+
- extracted document quality depends on source document quality
|
|
270
|
+
- duplicate-heavy corpora are improved by current reranking, but not fully solved for all edge cases
|
|
271
|
+
- scanned or low-quality PDFs may require OCR before indexing
|
|
272
|
+
|
|
273
|
+
## Future scope
|
|
274
|
+
|
|
275
|
+
Planned next-stage improvements:
|
|
276
|
+
|
|
277
|
+
- pluggable learned rerankers
|
|
278
|
+
- stronger deduplication across overlapping corpora
|
|
279
|
+
- richer document extraction
|
|
280
|
+
- watch mode for automatic local reindexing
|
|
281
|
+
- HTTP MCP transport
|
|
282
|
+
- optional hosted deployment mode
|
|
283
|
+
- multimodal indexing and retrieval for:
|
|
284
|
+
- images
|
|
285
|
+
- audio
|
|
286
|
+
- video
|
|
287
|
+
- richer document formats
|
|
288
|
+
|
|
289
|
+
The multimodal phase will require careful collection migration because Gemini text and multimodal embedding spaces are not interchangeable across model families.
|
|
290
|
+
|
|
291
|
+
## Notes
|
|
77
292
|
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
293
|
+
- config is stored in `~/.ownsearch/config.json`
|
|
294
|
+
- shared CLI and MCP secrets can be stored in `~/.ownsearch/.env`
|
|
295
|
+
- Qdrant runs locally in Docker as `ownsearch-qdrant`
|
|
296
|
+
- `GEMINI_API_KEY` may come from the shell environment, the current working directory `.env`, or `~/.ownsearch/.env`
|
|
297
|
+
- `maxFileBytes` primarily applies to extracted document formats such as PDF, DOCX, and RTF, not to large plain text and code files
|
|
83
298
|
|
|
84
299
|
## License
|
|
85
300
|
|
|
@@ -3,6 +3,18 @@ function buildContextBundle(query, hits, maxChars = 12e3) {
|
|
|
3
3
|
const results = [];
|
|
4
4
|
let totalChars = 0;
|
|
5
5
|
for (const hit of hits) {
|
|
6
|
+
const last = results.at(-1);
|
|
7
|
+
if (last && last.rootId === hit.rootId && last.relativePath === hit.relativePath && hit.chunkIndex === last.chunkIndex + 1) {
|
|
8
|
+
const mergedContent = `${last.content}
|
|
9
|
+
${hit.content}`.trim();
|
|
10
|
+
const mergedDelta = mergedContent.length - last.content.length;
|
|
11
|
+
if (totalChars + mergedDelta <= maxChars) {
|
|
12
|
+
last.content = mergedContent;
|
|
13
|
+
last.chunkIndex = hit.chunkIndex;
|
|
14
|
+
totalChars += mergedDelta;
|
|
15
|
+
continue;
|
|
16
|
+
}
|
|
17
|
+
}
|
|
6
18
|
if (results.length > 0 && totalChars + hit.content.length > maxChars) {
|
|
7
19
|
break;
|
|
8
20
|
}
|
|
@@ -25,9 +37,11 @@ function buildContextBundle(query, hits, maxChars = 12e3) {
|
|
|
25
37
|
}
|
|
26
38
|
|
|
27
39
|
// src/config.ts
|
|
40
|
+
import fsSync from "fs";
|
|
28
41
|
import fs from "fs/promises";
|
|
29
42
|
import os from "os";
|
|
30
43
|
import path2 from "path";
|
|
44
|
+
import dotenv from "dotenv";
|
|
31
45
|
|
|
32
46
|
// src/constants.ts
|
|
33
47
|
var CONFIG_DIR_NAME = ".ownsearch";
|
|
@@ -43,26 +57,32 @@ var DEFAULT_CHUNK_OVERLAP = 200;
|
|
|
43
57
|
var DEFAULT_MAX_FILE_BYTES = 50 * 1024 * 1024;
|
|
44
58
|
var SUPPORTED_TEXT_EXTENSIONS = /* @__PURE__ */ new Set([
|
|
45
59
|
".c",
|
|
60
|
+
".conf",
|
|
46
61
|
".cpp",
|
|
47
62
|
".cs",
|
|
48
63
|
".css",
|
|
49
64
|
".csv",
|
|
65
|
+
".docx",
|
|
50
66
|
".env",
|
|
51
67
|
".go",
|
|
52
68
|
".h",
|
|
53
69
|
".hpp",
|
|
54
70
|
".html",
|
|
71
|
+
".ini",
|
|
55
72
|
".java",
|
|
56
73
|
".js",
|
|
57
74
|
".json",
|
|
58
75
|
".jsx",
|
|
76
|
+
".log",
|
|
59
77
|
".md",
|
|
60
78
|
".mdx",
|
|
61
79
|
".mjs",
|
|
62
80
|
".pdf",
|
|
63
81
|
".ps1",
|
|
82
|
+
".properties",
|
|
64
83
|
".py",
|
|
65
84
|
".rb",
|
|
85
|
+
".rtf",
|
|
66
86
|
".rs",
|
|
67
87
|
".sh",
|
|
68
88
|
".sql",
|
|
@@ -74,6 +94,11 @@ var SUPPORTED_TEXT_EXTENSIONS = /* @__PURE__ */ new Set([
|
|
|
74
94
|
".yaml",
|
|
75
95
|
".yml"
|
|
76
96
|
]);
|
|
97
|
+
var EXTRACTED_DOCUMENT_EXTENSIONS = /* @__PURE__ */ new Set([
|
|
98
|
+
".pdf",
|
|
99
|
+
".docx",
|
|
100
|
+
".rtf"
|
|
101
|
+
]);
|
|
77
102
|
var IGNORED_DIRECTORIES = /* @__PURE__ */ new Set([
|
|
78
103
|
".git",
|
|
79
104
|
".hg",
|
|
@@ -141,9 +166,34 @@ function getConfigDir() {
|
|
|
141
166
|
function getConfigPath() {
|
|
142
167
|
return path2.join(getConfigDir(), CONFIG_FILE_NAME);
|
|
143
168
|
}
|
|
169
|
+
function getEnvPath() {
|
|
170
|
+
return path2.join(getConfigDir(), ".env");
|
|
171
|
+
}
|
|
172
|
+
function getCwdEnvPath() {
|
|
173
|
+
return path2.resolve(process.cwd(), ".env");
|
|
174
|
+
}
|
|
144
175
|
async function ensureConfigDir() {
|
|
145
176
|
await fs.mkdir(getConfigDir(), { recursive: true });
|
|
146
177
|
}
|
|
178
|
+
function loadOwnSearchEnv() {
|
|
179
|
+
for (const envPath of [getCwdEnvPath(), getEnvPath()]) {
|
|
180
|
+
if (!fsSync.existsSync(envPath)) {
|
|
181
|
+
continue;
|
|
182
|
+
}
|
|
183
|
+
const parsed = dotenv.parse(fsSync.readFileSync(envPath, "utf8"));
|
|
184
|
+
for (const [key, value] of Object.entries(parsed)) {
|
|
185
|
+
if (process.env[key] === void 0) {
|
|
186
|
+
process.env[key] = value;
|
|
187
|
+
}
|
|
188
|
+
}
|
|
189
|
+
}
|
|
190
|
+
}
|
|
191
|
+
function readEnvFile(envPath) {
|
|
192
|
+
if (!fsSync.existsSync(envPath)) {
|
|
193
|
+
return {};
|
|
194
|
+
}
|
|
195
|
+
return dotenv.parse(fsSync.readFileSync(envPath, "utf8"));
|
|
196
|
+
}
|
|
147
197
|
async function loadConfig() {
|
|
148
198
|
await ensureConfigDir();
|
|
149
199
|
const configPath = getConfigPath();
|
|
@@ -171,6 +221,11 @@ async function saveConfig(config) {
|
|
|
171
221
|
await fs.writeFile(getConfigPath(), `${JSON.stringify(config, null, 2)}
|
|
172
222
|
`, "utf8");
|
|
173
223
|
}
|
|
224
|
+
async function saveGeminiApiKey(apiKey) {
|
|
225
|
+
await ensureConfigDir();
|
|
226
|
+
await fs.writeFile(getEnvPath(), `GEMINI_API_KEY=${apiKey.trim()}
|
|
227
|
+
`, "utf8");
|
|
228
|
+
}
|
|
174
229
|
function createRootDefinition(rootPath, name) {
|
|
175
230
|
const now = (/* @__PURE__ */ new Date()).toISOString();
|
|
176
231
|
const rootName = name?.trim() || path2.basename(rootPath);
|
|
@@ -270,6 +325,112 @@ async function embedQuery(query) {
|
|
|
270
325
|
|
|
271
326
|
// src/qdrant.ts
|
|
272
327
|
import { QdrantClient } from "@qdrant/js-client-rest";
|
|
328
|
+
|
|
329
|
+
// src/rerank.ts
|
|
330
|
+
function normalize(input) {
|
|
331
|
+
return input.toLowerCase().replace(/[^a-z0-9\s]/g, " ").replace(/\s+/g, " ").trim();
|
|
332
|
+
}
|
|
333
|
+
function tokenize(input) {
|
|
334
|
+
return normalize(input).split(" ").filter((token) => token.length > 1);
|
|
335
|
+
}
|
|
336
|
+
function unique(items) {
|
|
337
|
+
return Array.from(new Set(items));
|
|
338
|
+
}
|
|
339
|
+
function lexicalOverlap(queryTokens, haystack) {
|
|
340
|
+
if (queryTokens.length === 0) {
|
|
341
|
+
return 0;
|
|
342
|
+
}
|
|
343
|
+
const haystackTokens = new Set(tokenize(haystack));
|
|
344
|
+
let matches = 0;
|
|
345
|
+
for (const token of queryTokens) {
|
|
346
|
+
if (haystackTokens.has(token)) {
|
|
347
|
+
matches += 1;
|
|
348
|
+
}
|
|
349
|
+
}
|
|
350
|
+
return matches / queryTokens.length;
|
|
351
|
+
}
|
|
352
|
+
function nearDuplicate(a, b) {
|
|
353
|
+
const aTokens = unique(tokenize(a.content)).slice(0, 48);
|
|
354
|
+
const bTokens = unique(tokenize(b.content)).slice(0, 48);
|
|
355
|
+
if (aTokens.length === 0 || bTokens.length === 0) {
|
|
356
|
+
return false;
|
|
357
|
+
}
|
|
358
|
+
const bSet = new Set(bTokens);
|
|
359
|
+
let intersection = 0;
|
|
360
|
+
for (const token of aTokens) {
|
|
361
|
+
if (bSet.has(token)) {
|
|
362
|
+
intersection += 1;
|
|
363
|
+
}
|
|
364
|
+
}
|
|
365
|
+
const union = (/* @__PURE__ */ new Set([...aTokens, ...bTokens])).size;
|
|
366
|
+
return union > 0 && intersection / union >= 0.8;
|
|
367
|
+
}
|
|
368
|
+
function contentSignature(content) {
|
|
369
|
+
return tokenize(content).slice(0, 24).join(" ");
|
|
370
|
+
}
|
|
371
|
+
function rerankAndDeduplicate(query, hits, limit) {
|
|
372
|
+
const normalizedQuery = normalize(query);
|
|
373
|
+
const queryTokens = unique(tokenize(query));
|
|
374
|
+
const ranked = hits.map((hit) => {
|
|
375
|
+
const overlap = lexicalOverlap(queryTokens, hit.content);
|
|
376
|
+
const pathOverlap = lexicalOverlap(queryTokens, `${hit.relativePath} ${hit.rootName}`);
|
|
377
|
+
const exactPhrase = normalizedQuery.length > 0 && normalize(hit.content).includes(normalizedQuery) ? 0.2 : 0;
|
|
378
|
+
const score = hit.score + overlap * 0.22 + pathOverlap * 0.08 + exactPhrase;
|
|
379
|
+
return { ...hit, rerankScore: score };
|
|
380
|
+
}).sort((left, right) => right.rerankScore - left.rerankScore);
|
|
381
|
+
const selected = [];
|
|
382
|
+
const signatureSet = /* @__PURE__ */ new Set();
|
|
383
|
+
const perFileCounts = /* @__PURE__ */ new Map();
|
|
384
|
+
const preferredPerFileLimit = 2;
|
|
385
|
+
function canTake(hit, enforcePerFileLimit) {
|
|
386
|
+
const signature = contentSignature(hit.content);
|
|
387
|
+
if (signature && signatureSet.has(signature)) {
|
|
388
|
+
return false;
|
|
389
|
+
}
|
|
390
|
+
if (selected.some((existing) => nearDuplicate(existing, hit))) {
|
|
391
|
+
return false;
|
|
392
|
+
}
|
|
393
|
+
if (enforcePerFileLimit) {
|
|
394
|
+
const current = perFileCounts.get(hit.relativePath) ?? 0;
|
|
395
|
+
if (current >= preferredPerFileLimit) {
|
|
396
|
+
return false;
|
|
397
|
+
}
|
|
398
|
+
}
|
|
399
|
+
return true;
|
|
400
|
+
}
|
|
401
|
+
function add(hit) {
|
|
402
|
+
selected.push(hit);
|
|
403
|
+
const signature = contentSignature(hit.content);
|
|
404
|
+
if (signature) {
|
|
405
|
+
signatureSet.add(signature);
|
|
406
|
+
}
|
|
407
|
+
perFileCounts.set(hit.relativePath, (perFileCounts.get(hit.relativePath) ?? 0) + 1);
|
|
408
|
+
}
|
|
409
|
+
for (const hit of ranked) {
|
|
410
|
+
if (selected.length >= limit) {
|
|
411
|
+
break;
|
|
412
|
+
}
|
|
413
|
+
if (canTake(hit, true)) {
|
|
414
|
+
add(hit);
|
|
415
|
+
}
|
|
416
|
+
}
|
|
417
|
+
if (selected.length < limit) {
|
|
418
|
+
for (const hit of ranked) {
|
|
419
|
+
if (selected.length >= limit) {
|
|
420
|
+
break;
|
|
421
|
+
}
|
|
422
|
+
if (selected.some((existing) => existing.id === hit.id)) {
|
|
423
|
+
continue;
|
|
424
|
+
}
|
|
425
|
+
if (canTake(hit, false)) {
|
|
426
|
+
add(hit);
|
|
427
|
+
}
|
|
428
|
+
}
|
|
429
|
+
}
|
|
430
|
+
return selected.map(({ rerankScore: _rerankScore, ...hit }) => hit);
|
|
431
|
+
}
|
|
432
|
+
|
|
433
|
+
// src/qdrant.ts
|
|
273
434
|
var OwnSearchStore = class {
|
|
274
435
|
constructor(client2, collectionName, vectorSize) {
|
|
275
436
|
this.client = client2;
|
|
@@ -427,7 +588,7 @@ var OwnSearchStore = class {
|
|
|
427
588
|
}
|
|
428
589
|
const results = await this.client.search(this.collectionName, {
|
|
429
590
|
vector,
|
|
430
|
-
limit: filters.pathSubstring ?
|
|
591
|
+
limit: Math.max(filters.pathSubstring ? limit * 8 : limit * 6, 24),
|
|
431
592
|
with_payload: true,
|
|
432
593
|
filter: must.length ? { must } : void 0
|
|
433
594
|
});
|
|
@@ -441,11 +602,8 @@ var OwnSearchStore = class {
|
|
|
441
602
|
chunkIndex: Number(result.payload?.chunk_index ?? 0),
|
|
442
603
|
content: String(result.payload?.content ?? "")
|
|
443
604
|
}));
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
}
|
|
447
|
-
const needle = filters.pathSubstring.toLowerCase();
|
|
448
|
-
return hits.filter((hit) => hit.relativePath.toLowerCase().includes(needle)).slice(0, limit);
|
|
605
|
+
const filtered = !filters.pathSubstring ? hits : hits.filter((hit) => hit.relativePath.toLowerCase().includes(filters.pathSubstring.toLowerCase()));
|
|
606
|
+
return rerankAndDeduplicate(filters.queryText ?? "", filtered, limit);
|
|
449
607
|
}
|
|
450
608
|
async getChunks(ids) {
|
|
451
609
|
if (ids.length === 0) {
|
|
@@ -500,9 +658,17 @@ function chunkText(content, chunkSize, chunkOverlap) {
|
|
|
500
658
|
while (start < normalized.length) {
|
|
501
659
|
let end = Math.min(start + chunkSize, normalized.length);
|
|
502
660
|
if (end < normalized.length) {
|
|
503
|
-
const
|
|
504
|
-
|
|
505
|
-
|
|
661
|
+
const minimumBoundary = start + Math.floor(chunkSize * 0.5);
|
|
662
|
+
const newlineBoundary = normalized.lastIndexOf("\n", end);
|
|
663
|
+
const whitespaceBoundary = normalized.lastIndexOf(" ", end);
|
|
664
|
+
const punctuationBoundary = Math.max(
|
|
665
|
+
normalized.lastIndexOf(". ", end),
|
|
666
|
+
normalized.lastIndexOf("? ", end),
|
|
667
|
+
normalized.lastIndexOf("! ", end)
|
|
668
|
+
);
|
|
669
|
+
const boundary = Math.max(newlineBoundary, whitespaceBoundary, punctuationBoundary);
|
|
670
|
+
if (boundary > minimumBoundary) {
|
|
671
|
+
end = boundary;
|
|
506
672
|
}
|
|
507
673
|
}
|
|
508
674
|
const chunk = normalized.slice(start, end).trim();
|
|
@@ -520,10 +686,14 @@ function chunkText(content, chunkSize, chunkOverlap) {
|
|
|
520
686
|
// src/files.ts
|
|
521
687
|
import fs2 from "fs/promises";
|
|
522
688
|
import path3 from "path";
|
|
689
|
+
import mammoth from "mammoth";
|
|
523
690
|
import { PDFParse } from "pdf-parse";
|
|
524
691
|
function sanitizeExtractedText(input) {
|
|
525
692
|
return input.replace(/\u0000/g, "").replace(/[\u0001-\u0008\u000B\u000C\u000E-\u001F\u007F]/g, " ").replace(/\r\n/g, "\n");
|
|
526
693
|
}
|
|
694
|
+
function extractRtfText(input) {
|
|
695
|
+
return input.replace(/\\par[d]?/g, "\n").replace(/\\tab/g, " ").replace(/\\'[0-9a-fA-F]{2}/g, " ").replace(/\\[a-zA-Z]+-?\d* ?/g, "").replace(/[{}]/g, " ");
|
|
696
|
+
}
|
|
527
697
|
async function collectTextFiles(rootPath, maxFileBytes) {
|
|
528
698
|
const files = [];
|
|
529
699
|
const absoluteRoot = path3.resolve(rootPath);
|
|
@@ -543,6 +713,11 @@ async function collectTextFiles(rootPath, maxFileBytes) {
|
|
|
543
713
|
await parser.destroy();
|
|
544
714
|
}
|
|
545
715
|
}
|
|
716
|
+
async function parseDocx(filePath) {
|
|
717
|
+
const buffer = await fs2.readFile(filePath);
|
|
718
|
+
const result = await mammoth.extractRawText({ buffer });
|
|
719
|
+
return result.value ?? "";
|
|
720
|
+
}
|
|
546
721
|
async function walk(currentPath) {
|
|
547
722
|
const entries = await fs2.readdir(currentPath, { withFileTypes: true });
|
|
548
723
|
for (const entry of entries) {
|
|
@@ -565,7 +740,7 @@ async function collectTextFiles(rootPath, maxFileBytes) {
|
|
|
565
740
|
continue;
|
|
566
741
|
}
|
|
567
742
|
const stats = await fs2.stat(nextPath);
|
|
568
|
-
if (stats.size > maxFileBytes) {
|
|
743
|
+
if (EXTRACTED_DOCUMENT_EXTENSIONS.has(extension) && stats.size > maxFileBytes) {
|
|
569
744
|
debugLog("skip-size", nextPath, stats.size);
|
|
570
745
|
continue;
|
|
571
746
|
}
|
|
@@ -573,6 +748,10 @@ async function collectTextFiles(rootPath, maxFileBytes) {
|
|
|
573
748
|
try {
|
|
574
749
|
if (extension === ".pdf") {
|
|
575
750
|
content = await parsePdf(nextPath);
|
|
751
|
+
} else if (extension === ".docx") {
|
|
752
|
+
content = await parseDocx(nextPath);
|
|
753
|
+
} else if (extension === ".rtf") {
|
|
754
|
+
content = extractRtfText(await fs2.readFile(nextPath, "utf8"));
|
|
576
755
|
} else {
|
|
577
756
|
content = await fs2.readFile(nextPath, "utf8");
|
|
578
757
|
}
|
|
@@ -715,7 +894,12 @@ async function indexPath(rootPath, options = {}) {
|
|
|
715
894
|
export {
|
|
716
895
|
buildContextBundle,
|
|
717
896
|
getConfigPath,
|
|
897
|
+
getEnvPath,
|
|
898
|
+
getCwdEnvPath,
|
|
899
|
+
loadOwnSearchEnv,
|
|
900
|
+
readEnvFile,
|
|
718
901
|
loadConfig,
|
|
902
|
+
saveGeminiApiKey,
|
|
719
903
|
deleteRootDefinition,
|
|
720
904
|
findRoot,
|
|
721
905
|
listRoots,
|
package/dist/cli.js
CHANGED
|
@@ -7,15 +7,21 @@ import {
|
|
|
7
7
|
embedQuery,
|
|
8
8
|
findRoot,
|
|
9
9
|
getConfigPath,
|
|
10
|
+
getCwdEnvPath,
|
|
11
|
+
getEnvPath,
|
|
10
12
|
indexPath,
|
|
11
13
|
listRoots,
|
|
12
|
-
loadConfig
|
|
13
|
-
|
|
14
|
+
loadConfig,
|
|
15
|
+
loadOwnSearchEnv,
|
|
16
|
+
readEnvFile,
|
|
17
|
+
saveGeminiApiKey
|
|
18
|
+
} from "./chunk-ZQAY3FE3.js";
|
|
14
19
|
|
|
15
20
|
// src/cli.ts
|
|
16
|
-
import "
|
|
21
|
+
import fs from "fs/promises";
|
|
17
22
|
import path from "path";
|
|
18
23
|
import { spawn } from "child_process";
|
|
24
|
+
import readline from "readline/promises";
|
|
19
25
|
import { fileURLToPath } from "url";
|
|
20
26
|
import { Command } from "commander";
|
|
21
27
|
|
|
@@ -61,23 +67,294 @@ async function ensureQdrantDocker() {
|
|
|
61
67
|
}
|
|
62
68
|
|
|
63
69
|
// src/cli.ts
|
|
70
|
+
loadOwnSearchEnv();
|
|
64
71
|
var program = new Command();
|
|
72
|
+
var PACKAGE_NAME = "ownsearch";
|
|
73
|
+
var GEMINI_API_KEY_URL = "https://aistudio.google.com/apikey";
|
|
74
|
+
var BUNDLED_SKILL_NAME = "ownsearch-rag-search";
|
|
75
|
+
var SUPPORTED_AGENTS = [
|
|
76
|
+
"codex",
|
|
77
|
+
"claude-desktop",
|
|
78
|
+
"continue",
|
|
79
|
+
"copilot-cli",
|
|
80
|
+
"cursor",
|
|
81
|
+
"github-copilot",
|
|
82
|
+
"vscode",
|
|
83
|
+
"windsurf"
|
|
84
|
+
];
|
|
65
85
|
function requireGeminiKey() {
|
|
66
86
|
if (!process.env.GEMINI_API_KEY) {
|
|
67
87
|
throw new OwnSearchError("Set GEMINI_API_KEY before running OwnSearch.");
|
|
68
88
|
}
|
|
69
89
|
}
|
|
90
|
+
function buildAgentConfig(agent) {
|
|
91
|
+
const stdioConfig = {
|
|
92
|
+
command: "npx",
|
|
93
|
+
args: ["-y", PACKAGE_NAME, "serve-mcp"],
|
|
94
|
+
env: {
|
|
95
|
+
GEMINI_API_KEY: "${GEMINI_API_KEY}"
|
|
96
|
+
}
|
|
97
|
+
};
|
|
98
|
+
switch (agent) {
|
|
99
|
+
case "codex":
|
|
100
|
+
return {
|
|
101
|
+
platform: "codex",
|
|
102
|
+
configScope: "Add this server entry to your Codex MCP configuration.",
|
|
103
|
+
config: { ownsearch: stdioConfig }
|
|
104
|
+
};
|
|
105
|
+
case "claude-desktop":
|
|
106
|
+
return {
|
|
107
|
+
platform: "claude-desktop",
|
|
108
|
+
installMethod: "Desktop Extension (.mcpb)",
|
|
109
|
+
note: "Current Claude Desktop documentation recommends local MCP installation through Desktop Extensions instead of manual JSON config files.",
|
|
110
|
+
nextStep: "OwnSearch does not yet ship an .mcpb bundle. Use Cursor, VS Code, Windsurf, Continue, or GitHub Copilot with the snippets below for now."
|
|
111
|
+
};
|
|
112
|
+
case "continue":
|
|
113
|
+
return {
|
|
114
|
+
platform: "continue",
|
|
115
|
+
configPath: ".continue/mcpServers/ownsearch.json",
|
|
116
|
+
note: "Continue can ingest JSON MCP configs directly.",
|
|
117
|
+
config: { ownsearch: stdioConfig }
|
|
118
|
+
};
|
|
119
|
+
case "copilot-cli":
|
|
120
|
+
return {
|
|
121
|
+
platform: "copilot-cli",
|
|
122
|
+
configPath: "~/.copilot/mcp-config.json",
|
|
123
|
+
config: {
|
|
124
|
+
mcpServers: {
|
|
125
|
+
ownsearch: {
|
|
126
|
+
type: "local",
|
|
127
|
+
command: stdioConfig.command,
|
|
128
|
+
args: stdioConfig.args,
|
|
129
|
+
env: stdioConfig.env,
|
|
130
|
+
tools: ["*"]
|
|
131
|
+
}
|
|
132
|
+
}
|
|
133
|
+
}
|
|
134
|
+
};
|
|
135
|
+
case "cursor":
|
|
136
|
+
return {
|
|
137
|
+
platform: "cursor",
|
|
138
|
+
configPath: "~/.cursor/mcp.json or .cursor/mcp.json",
|
|
139
|
+
config: { ownsearch: stdioConfig }
|
|
140
|
+
};
|
|
141
|
+
case "github-copilot":
|
|
142
|
+
case "vscode":
|
|
143
|
+
return {
|
|
144
|
+
platform: agent,
|
|
145
|
+
configPath: ".vscode/mcp.json or VS Code user profile mcp.json",
|
|
146
|
+
config: {
|
|
147
|
+
servers: {
|
|
148
|
+
ownsearch: stdioConfig
|
|
149
|
+
}
|
|
150
|
+
}
|
|
151
|
+
};
|
|
152
|
+
case "windsurf":
|
|
153
|
+
return {
|
|
154
|
+
platform: "windsurf",
|
|
155
|
+
configPath: "~/.codeium/mcp_config.json",
|
|
156
|
+
config: {
|
|
157
|
+
mcpServers: {
|
|
158
|
+
ownsearch: stdioConfig
|
|
159
|
+
}
|
|
160
|
+
}
|
|
161
|
+
};
|
|
162
|
+
default:
|
|
163
|
+
throw new OwnSearchError(`Unsupported agent: ${agent}`);
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
async function readBundledSkill(skillName) {
|
|
167
|
+
const currentFilePath = fileURLToPath(import.meta.url);
|
|
168
|
+
const packageRoot = path.resolve(path.dirname(currentFilePath), "..");
|
|
169
|
+
const skillPath = path.join(packageRoot, "skills", skillName, "SKILL.md");
|
|
170
|
+
return fs.readFile(skillPath, "utf8");
|
|
171
|
+
}
|
|
172
|
+
function getDoctorVerdict(input) {
|
|
173
|
+
const nextSteps = [];
|
|
174
|
+
if (!input.geminiApiKeyPresent) {
|
|
175
|
+
nextSteps.push("Run `ownsearch setup` and save a Gemini API key.");
|
|
176
|
+
}
|
|
177
|
+
if (!input.qdrantReachable) {
|
|
178
|
+
nextSteps.push("Run `ownsearch setup` to start or reconnect to the local Qdrant container.");
|
|
179
|
+
}
|
|
180
|
+
if (input.geminiApiKeyPresent && input.qdrantReachable && input.rootCount === 0) {
|
|
181
|
+
nextSteps.push("Run `ownsearch index C:\\path\\to\\folder --name my-folder` to add your first indexed root.");
|
|
182
|
+
}
|
|
183
|
+
if (nextSteps.length === 0) {
|
|
184
|
+
nextSteps.push("Run `ownsearch index C:\\path\\to\\folder --name my-folder` to add more content, or `ownsearch serve-mcp` to connect an agent.");
|
|
185
|
+
return {
|
|
186
|
+
status: "ready",
|
|
187
|
+
summary: input.rootCount > 0 ? "OwnSearch is ready for indexing, search, and MCP agent use." : "OwnSearch is ready. Qdrant and Gemini are configured.",
|
|
188
|
+
nextSteps
|
|
189
|
+
};
|
|
190
|
+
}
|
|
191
|
+
return {
|
|
192
|
+
status: "action_required",
|
|
193
|
+
summary: "OwnSearch is not fully ready yet.",
|
|
194
|
+
nextSteps
|
|
195
|
+
};
|
|
196
|
+
}
|
|
197
|
+
async function promptForGeminiKey() {
|
|
198
|
+
if (!process.stdin.isTTY || !process.stdout.isTTY) {
|
|
199
|
+
return false;
|
|
200
|
+
}
|
|
201
|
+
const rl = readline.createInterface({
|
|
202
|
+
input: process.stdin,
|
|
203
|
+
output: process.stdout
|
|
204
|
+
});
|
|
205
|
+
try {
|
|
206
|
+
console.log(`Generate a Gemini API key here: ${GEMINI_API_KEY_URL}`);
|
|
207
|
+
console.log(`OwnSearch will save it to ${getEnvPath()}`);
|
|
208
|
+
for (; ; ) {
|
|
209
|
+
const apiKey = (await rl.question("Paste GEMINI_API_KEY and press Enter (Ctrl+C to cancel): ")).trim();
|
|
210
|
+
if (!apiKey) {
|
|
211
|
+
console.log("GEMINI_API_KEY is required for indexing and search.");
|
|
212
|
+
continue;
|
|
213
|
+
}
|
|
214
|
+
await saveGeminiApiKey(apiKey);
|
|
215
|
+
process.env.GEMINI_API_KEY = apiKey;
|
|
216
|
+
return true;
|
|
217
|
+
}
|
|
218
|
+
} finally {
|
|
219
|
+
rl.close();
|
|
220
|
+
}
|
|
221
|
+
}
|
|
222
|
+
function getGeminiApiKeySource() {
|
|
223
|
+
if (readEnvFile(getEnvPath()).GEMINI_API_KEY) {
|
|
224
|
+
return "ownsearch-env";
|
|
225
|
+
}
|
|
226
|
+
if (readEnvFile(getCwdEnvPath()).GEMINI_API_KEY) {
|
|
227
|
+
return "cwd-env";
|
|
228
|
+
}
|
|
229
|
+
if (process.env.GEMINI_API_KEY) {
|
|
230
|
+
return "process-env";
|
|
231
|
+
}
|
|
232
|
+
return "missing";
|
|
233
|
+
}
|
|
234
|
+
async function ensureManagedGeminiKey() {
|
|
235
|
+
const source = getGeminiApiKeySource();
|
|
236
|
+
if (source === "ownsearch-env") {
|
|
237
|
+
return { present: true, source, savedToManagedEnv: false };
|
|
238
|
+
}
|
|
239
|
+
if (process.env.GEMINI_API_KEY) {
|
|
240
|
+
await saveGeminiApiKey(process.env.GEMINI_API_KEY);
|
|
241
|
+
return { present: true, source, savedToManagedEnv: true };
|
|
242
|
+
}
|
|
243
|
+
const prompted = await promptForGeminiKey();
|
|
244
|
+
return {
|
|
245
|
+
present: prompted,
|
|
246
|
+
source: prompted ? "prompt" : "missing",
|
|
247
|
+
savedToManagedEnv: prompted
|
|
248
|
+
};
|
|
249
|
+
}
|
|
250
|
+
function printSetupNextSteps() {
|
|
251
|
+
console.log("");
|
|
252
|
+
console.log("Next commands:");
|
|
253
|
+
console.log(" CLI indexing:");
|
|
254
|
+
console.log(" ownsearch index C:\\path\\to\\folder --name my-folder");
|
|
255
|
+
console.log(" CLI search:");
|
|
256
|
+
console.log(' ownsearch search "your question here" --limit 5');
|
|
257
|
+
console.log(" CLI grounded context:");
|
|
258
|
+
console.log(' ownsearch search-context "your question here" --limit 8 --max-chars 12000');
|
|
259
|
+
console.log(" MCP server for agents:");
|
|
260
|
+
console.log(" ownsearch serve-mcp");
|
|
261
|
+
console.log(" Agent config snippets:");
|
|
262
|
+
console.log(" ownsearch print-agent-config codex");
|
|
263
|
+
console.log(" ownsearch print-agent-config claude-desktop");
|
|
264
|
+
console.log(" ownsearch print-agent-config cursor");
|
|
265
|
+
console.log(" ownsearch print-agent-config vscode");
|
|
266
|
+
console.log(" ownsearch print-agent-config github-copilot");
|
|
267
|
+
console.log(" ownsearch print-agent-config copilot-cli");
|
|
268
|
+
console.log(" ownsearch print-agent-config windsurf");
|
|
269
|
+
console.log(" ownsearch print-agent-config continue");
|
|
270
|
+
console.log(" Bundled retrieval skill:");
|
|
271
|
+
console.log(` ownsearch print-skill ${BUNDLED_SKILL_NAME}`);
|
|
272
|
+
}
|
|
273
|
+
async function promptForAgentChoice() {
|
|
274
|
+
if (!process.stdin.isTTY || !process.stdout.isTTY) {
|
|
275
|
+
return void 0;
|
|
276
|
+
}
|
|
277
|
+
const rl = readline.createInterface({
|
|
278
|
+
input: process.stdin,
|
|
279
|
+
output: process.stdout
|
|
280
|
+
});
|
|
281
|
+
try {
|
|
282
|
+
console.log("");
|
|
283
|
+
console.log("Connect to an agent now?");
|
|
284
|
+
console.log(" 1. codex");
|
|
285
|
+
console.log(" 2. claude-desktop");
|
|
286
|
+
console.log(" 3. cursor");
|
|
287
|
+
console.log(" 4. vscode");
|
|
288
|
+
console.log(" 5. windsurf");
|
|
289
|
+
console.log(" 6. copilot-cli");
|
|
290
|
+
console.log(" 7. continue");
|
|
291
|
+
console.log(" 8. skip");
|
|
292
|
+
for (; ; ) {
|
|
293
|
+
const answer = (await rl.question("Select 1-8: ")).trim().toLowerCase();
|
|
294
|
+
switch (answer) {
|
|
295
|
+
case "1":
|
|
296
|
+
case "codex":
|
|
297
|
+
return "codex";
|
|
298
|
+
case "2":
|
|
299
|
+
case "claude-desktop":
|
|
300
|
+
case "claude":
|
|
301
|
+
return "claude-desktop";
|
|
302
|
+
case "3":
|
|
303
|
+
case "cursor":
|
|
304
|
+
return "cursor";
|
|
305
|
+
case "4":
|
|
306
|
+
case "vscode":
|
|
307
|
+
case "github-copilot":
|
|
308
|
+
return "vscode";
|
|
309
|
+
case "5":
|
|
310
|
+
case "windsurf":
|
|
311
|
+
return "windsurf";
|
|
312
|
+
case "6":
|
|
313
|
+
case "copilot-cli":
|
|
314
|
+
case "copilot":
|
|
315
|
+
return "copilot-cli";
|
|
316
|
+
case "7":
|
|
317
|
+
case "continue":
|
|
318
|
+
return "continue";
|
|
319
|
+
case "8":
|
|
320
|
+
case "skip":
|
|
321
|
+
case "":
|
|
322
|
+
return void 0;
|
|
323
|
+
default:
|
|
324
|
+
console.log("Enter 1, 2, 3, 4, 5, 6, 7, or 8.");
|
|
325
|
+
}
|
|
326
|
+
}
|
|
327
|
+
} finally {
|
|
328
|
+
rl.close();
|
|
329
|
+
}
|
|
330
|
+
}
|
|
331
|
+
function printAgentConfigSnippet(agent) {
|
|
332
|
+
console.log("");
|
|
333
|
+
console.log(`MCP config for ${agent}:`);
|
|
334
|
+
console.log(JSON.stringify(buildAgentConfig(agent), null, 2));
|
|
335
|
+
}
|
|
70
336
|
program.name("ownsearch").description("Gemini-powered local search MCP server backed by Qdrant.").version("0.1.0");
|
|
71
337
|
program.command("setup").description("Create config and start a local Qdrant Docker container.").action(async () => {
|
|
72
338
|
const config = await loadConfig();
|
|
73
339
|
const result = await ensureQdrantDocker();
|
|
340
|
+
const gemini = await ensureManagedGeminiKey();
|
|
74
341
|
console.log(JSON.stringify({
|
|
75
342
|
configPath: getConfigPath(),
|
|
343
|
+
envPath: getEnvPath(),
|
|
76
344
|
qdrantUrl: config.qdrantUrl,
|
|
77
|
-
qdrantStarted: result.started
|
|
345
|
+
qdrantStarted: result.started,
|
|
346
|
+
geminiApiKeyPresent: gemini.present,
|
|
347
|
+
geminiApiKeySource: gemini.source,
|
|
348
|
+
geminiApiKeySavedToManagedEnv: gemini.savedToManagedEnv
|
|
78
349
|
}, null, 2));
|
|
79
|
-
if (!
|
|
80
|
-
console.log(
|
|
350
|
+
if (!gemini.present) {
|
|
351
|
+
console.log(`GEMINI_API_KEY is not set. Re-run setup or add it to ${getEnvPath()} before indexing or search.`);
|
|
352
|
+
return;
|
|
353
|
+
}
|
|
354
|
+
printSetupNextSteps();
|
|
355
|
+
const agent = await promptForAgentChoice();
|
|
356
|
+
if (agent) {
|
|
357
|
+
printAgentConfigSnippet(agent);
|
|
81
358
|
}
|
|
82
359
|
});
|
|
83
360
|
program.command("index").argument("<folder>", "Folder path to index").option("-n, --name <name>", "Display name for the indexed root").option("--max-file-bytes <n>", "Override the file size limit for this run", (value) => Number(value)).description("Index a local folder into Qdrant using Gemini embeddings.").action(async (folder, options) => {
|
|
@@ -96,6 +373,7 @@ program.command("search").argument("<query>", "Natural language query").option("
|
|
|
96
373
|
const hits = await store.search(
|
|
97
374
|
vector,
|
|
98
375
|
{
|
|
376
|
+
queryText: query,
|
|
99
377
|
rootIds: options.rootId,
|
|
100
378
|
pathSubstring: options.path
|
|
101
379
|
},
|
|
@@ -112,6 +390,7 @@ program.command("search-context").argument("<query>", "Natural language query").
|
|
|
112
390
|
const hits = await store.search(
|
|
113
391
|
vector,
|
|
114
392
|
{
|
|
393
|
+
queryText: query,
|
|
115
394
|
rootIds: options.rootId,
|
|
116
395
|
pathSubstring: options.path
|
|
117
396
|
},
|
|
@@ -148,9 +427,17 @@ program.command("doctor").description("Check local prerequisites and package con
|
|
|
148
427
|
} catch (error) {
|
|
149
428
|
qdrantReachable = false;
|
|
150
429
|
}
|
|
430
|
+
const verdict = getDoctorVerdict({
|
|
431
|
+
geminiApiKeyPresent: Boolean(process.env.GEMINI_API_KEY),
|
|
432
|
+
qdrantReachable,
|
|
433
|
+
rootCount: roots.length
|
|
434
|
+
});
|
|
151
435
|
console.log(JSON.stringify({
|
|
436
|
+
verdict,
|
|
152
437
|
configPath: getConfigPath(),
|
|
438
|
+
envPath: getEnvPath(),
|
|
153
439
|
geminiApiKeyPresent: Boolean(process.env.GEMINI_API_KEY),
|
|
440
|
+
geminiApiKeySource: getGeminiApiKeySource(),
|
|
154
441
|
qdrantUrl: config.qdrantUrl,
|
|
155
442
|
qdrantReachable,
|
|
156
443
|
collection: config.qdrantCollection,
|
|
@@ -158,6 +445,7 @@ program.command("doctor").description("Check local prerequisites and package con
|
|
|
158
445
|
vectorSize: config.vectorSize,
|
|
159
446
|
chunkSize: config.chunkSize,
|
|
160
447
|
chunkOverlap: config.chunkOverlap,
|
|
448
|
+
maxExtractedDocumentBytes: config.maxFileBytes,
|
|
161
449
|
maxFileBytes: config.maxFileBytes,
|
|
162
450
|
rootCount: roots.length
|
|
163
451
|
}, null, 2));
|
|
@@ -173,23 +461,16 @@ program.command("serve-mcp").description("Start the stdio MCP server.").action(a
|
|
|
173
461
|
process.exitCode = code ?? 0;
|
|
174
462
|
});
|
|
175
463
|
});
|
|
176
|
-
program.command("print-agent-config").argument("<agent>", "
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
env: {
|
|
181
|
-
GEMINI_API_KEY: "${GEMINI_API_KEY}"
|
|
182
|
-
}
|
|
183
|
-
};
|
|
184
|
-
switch (agent) {
|
|
185
|
-
case "codex":
|
|
186
|
-
case "claude-desktop":
|
|
187
|
-
case "cursor":
|
|
188
|
-
console.log(JSON.stringify({ ownsearch: config }, null, 2));
|
|
189
|
-
return;
|
|
190
|
-
default:
|
|
191
|
-
throw new OwnSearchError(`Unsupported agent: ${agent}`);
|
|
464
|
+
program.command("print-agent-config").argument("<agent>", SUPPORTED_AGENTS.join(" | ")).description("Print an MCP config snippet for a supported agent.").action(async (agent) => {
|
|
465
|
+
if (SUPPORTED_AGENTS.includes(agent)) {
|
|
466
|
+
console.log(JSON.stringify(buildAgentConfig(agent), null, 2));
|
|
467
|
+
return;
|
|
192
468
|
}
|
|
469
|
+
throw new OwnSearchError(`Unsupported agent: ${agent}`);
|
|
470
|
+
});
|
|
471
|
+
program.command("print-skill").argument("[skill]", `Bundled skill name (default ${BUNDLED_SKILL_NAME})`).description("Print a bundled OwnSearch skill that helps agents query retrieval tools more effectively.").action(async (skill) => {
|
|
472
|
+
const skillName = skill?.trim() || BUNDLED_SKILL_NAME;
|
|
473
|
+
console.log(await readBundledSkill(skillName));
|
|
193
474
|
});
|
|
194
475
|
program.parseAsync(process.argv).catch((error) => {
|
|
195
476
|
const message = error instanceof Error ? error.message : String(error);
|
package/dist/mcp/server.js
CHANGED
|
@@ -7,14 +7,15 @@ import {
|
|
|
7
7
|
embedQuery,
|
|
8
8
|
findRoot,
|
|
9
9
|
indexPath,
|
|
10
|
-
loadConfig
|
|
11
|
-
|
|
10
|
+
loadConfig,
|
|
11
|
+
loadOwnSearchEnv
|
|
12
|
+
} from "../chunk-ZQAY3FE3.js";
|
|
12
13
|
|
|
13
14
|
// src/mcp/server.ts
|
|
14
|
-
import "dotenv/config";
|
|
15
15
|
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
|
|
16
16
|
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
|
|
17
17
|
import { CallToolRequestSchema, ListToolsRequestSchema } from "@modelcontextprotocol/sdk/types.js";
|
|
18
|
+
loadOwnSearchEnv();
|
|
18
19
|
function asText(result) {
|
|
19
20
|
return {
|
|
20
21
|
content: [
|
|
@@ -164,6 +165,7 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
|
|
|
164
165
|
const hits = await store.search(
|
|
165
166
|
vector,
|
|
166
167
|
{
|
|
168
|
+
queryText: args.query,
|
|
167
169
|
rootIds: args.rootIds,
|
|
168
170
|
pathSubstring: args.pathSubstring
|
|
169
171
|
},
|
|
@@ -184,6 +186,7 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
|
|
|
184
186
|
const hits = await store.search(
|
|
185
187
|
vector,
|
|
186
188
|
{
|
|
189
|
+
queryText: args.query,
|
|
187
190
|
rootIds: args.rootIds,
|
|
188
191
|
pathSubstring: args.pathSubstring
|
|
189
192
|
},
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "ownsearch",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.4",
|
|
4
4
|
"description": "Text-first local document search MCP server backed by Gemini embeddings and Qdrant.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -8,13 +8,15 @@
|
|
|
8
8
|
},
|
|
9
9
|
"files": [
|
|
10
10
|
"dist",
|
|
11
|
-
"README.md"
|
|
11
|
+
"README.md",
|
|
12
|
+
"skills"
|
|
12
13
|
],
|
|
13
14
|
"scripts": {
|
|
14
15
|
"build": "tsup src/cli.ts src/mcp/server.ts --format esm --dts --clean --external pdf-parse",
|
|
15
16
|
"dev": "tsx src/cli.ts",
|
|
16
17
|
"prepare": "npm run build",
|
|
17
18
|
"prepublishOnly": "npm run typecheck && npm run build",
|
|
19
|
+
"smoke:text-docs": "tsx scripts/smoke-text-docs.mts",
|
|
18
20
|
"serve-mcp": "tsx src/mcp/server.ts",
|
|
19
21
|
"typecheck": "tsc --noEmit"
|
|
20
22
|
},
|
|
@@ -48,12 +50,14 @@
|
|
|
48
50
|
"@qdrant/js-client-rest": "^1.17.0",
|
|
49
51
|
"commander": "^14.0.1",
|
|
50
52
|
"dotenv": "^17.3.1",
|
|
53
|
+
"mammoth": "^1.12.0",
|
|
51
54
|
"pdf-parse": "^2.4.5",
|
|
52
55
|
"zod": "^3.25.76"
|
|
53
56
|
},
|
|
54
57
|
"devDependencies": {
|
|
55
58
|
"@types/node": "^24.6.0",
|
|
56
59
|
"@types/pdf-parse": "^1.1.5",
|
|
60
|
+
"docx": "^9.6.1",
|
|
57
61
|
"tsup": "^8.5.0",
|
|
58
62
|
"tsx": "^4.20.6",
|
|
59
63
|
"typescript": "^5.9.3"
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ownsearch-rag-search
|
|
3
|
+
description: Improve retrieval quality when an agent uses OwnSearch MCP tools to search local documents. Use for semantic search, grounded answering, query rewriting, multi-query retrieval, exact chunk fetches, duplicate-heavy result sets, or whenever a user request must be translated into stronger OwnSearch search_context/search/get_chunks calls.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# OwnSearch RAG Search
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
Use this skill to bridge the gap between what a user asks and what OwnSearch should retrieve. Treat retrieval as probabilistic: rewrite weak queries, run multiple targeted searches when needed, prefer grounded context over guesswork, and fetch exact chunks before making precise claims.
|
|
11
|
+
|
|
12
|
+
## Retrieval Workflow
|
|
13
|
+
|
|
14
|
+
1. Classify the user request.
|
|
15
|
+
2. Generate one to four retrieval queries.
|
|
16
|
+
3. Start with `search_context` for the strongest query.
|
|
17
|
+
4. Expand to additional searches only if evidence is weak, duplicate-heavy, or incomplete.
|
|
18
|
+
5. Use `get_chunks` after `search` when the answer needs exact wording, detailed comparison, or citation-grade grounding.
|
|
19
|
+
6. Answer only from retrieved evidence. Say when the retrieved context is insufficient.
|
|
20
|
+
|
|
21
|
+
## Query Planning
|
|
22
|
+
|
|
23
|
+
Generate retrieval queries with these patterns:
|
|
24
|
+
|
|
25
|
+
- Literal query: preserve the exact noun phrase, error string, rule name, or title the user used.
|
|
26
|
+
- Canonical query: replace vague wording with domain terms likely to appear in documents.
|
|
27
|
+
- Paraphrase query: restate the intent in simpler or more explicit language.
|
|
28
|
+
- Source-biased query: add likely file names, section names, or path hints when the user names a source.
|
|
29
|
+
|
|
30
|
+
Good examples:
|
|
31
|
+
|
|
32
|
+
- User ask: "How do concentration checks work?"
|
|
33
|
+
Queries:
|
|
34
|
+
- `concentration checks`
|
|
35
|
+
- `maintain concentration after taking damage`
|
|
36
|
+
- `constitution saving throw concentration spell`
|
|
37
|
+
|
|
38
|
+
- User ask: "Where does the repo explain local MCP setup?"
|
|
39
|
+
Queries:
|
|
40
|
+
- `local MCP setup`
|
|
41
|
+
- `Model Context Protocol setup`
|
|
42
|
+
- `serve-mcp agent config`
|
|
43
|
+
|
|
44
|
+
- User ask: "What did the contract say about payment timing?"
|
|
45
|
+
Queries:
|
|
46
|
+
- `payment timing`
|
|
47
|
+
- `payment due within`
|
|
48
|
+
- `invoice due date net terms`
|
|
49
|
+
|
|
50
|
+
## Tool Use Rules
|
|
51
|
+
|
|
52
|
+
Use `search_context` when:
|
|
53
|
+
|
|
54
|
+
- the user wants an answer, summary, explanation, or quick grounding
|
|
55
|
+
- the answer can be supported by a few chunks
|
|
56
|
+
- low latency matters more than exhaustive recall
|
|
57
|
+
|
|
58
|
+
Use `search` when:
|
|
59
|
+
|
|
60
|
+
- you want to inspect ranking and source distribution
|
|
61
|
+
- you need to compare multiple candidates
|
|
62
|
+
- you suspect duplicates or poor recall
|
|
63
|
+
|
|
64
|
+
Use `get_chunks` when:
|
|
65
|
+
|
|
66
|
+
- exact wording matters
|
|
67
|
+
- the answer depends on adjacent details
|
|
68
|
+
- you need to quote or carefully verify a claim
|
|
69
|
+
- you need to compare similar hits before answering
|
|
70
|
+
|
|
71
|
+
## Duplicate Handling
|
|
72
|
+
|
|
73
|
+
Assume top results can still contain semantic duplicates.
|
|
74
|
+
|
|
75
|
+
When results are duplicate-heavy:
|
|
76
|
+
|
|
77
|
+
- keep only the strongest chunk per repeated claim unless neighboring chunks add new facts
|
|
78
|
+
- prefer source diversity when multiple files say the same thing
|
|
79
|
+
- if one document clearly appears authoritative, prefer that source but mention corroboration when useful
|
|
80
|
+
- if the top results are all from one file and the answer still seems incomplete, issue a second query with a different phrasing
|
|
81
|
+
|
|
82
|
+
## Failure Recovery
|
|
83
|
+
|
|
84
|
+
If the first search is weak:
|
|
85
|
+
|
|
86
|
+
- shorten the query
|
|
87
|
+
- remove conversational filler
|
|
88
|
+
- swap vague words for canonical terms
|
|
89
|
+
- split compound questions into separate searches
|
|
90
|
+
- add likely section names or file hints
|
|
91
|
+
- search once for the concept and once for the expected answer shape
|
|
92
|
+
|
|
93
|
+
Examples:
|
|
94
|
+
|
|
95
|
+
- "Can you tell me what they said about when we can terminate this thing?"
|
|
96
|
+
Retry with:
|
|
97
|
+
- `termination`
|
|
98
|
+
- `termination notice`
|
|
99
|
+
- `right to terminate`
|
|
100
|
+
- `termination for cause`
|
|
101
|
+
|
|
102
|
+
- "Why is my build exploding around env handling?"
|
|
103
|
+
Retry with:
|
|
104
|
+
- `environment variables`
|
|
105
|
+
- `dotenv`
|
|
106
|
+
- `GEMINI_API_KEY`
|
|
107
|
+
- `setup envPath`
|
|
108
|
+
|
|
109
|
+
## Answering Rules
|
|
110
|
+
|
|
111
|
+
- Do not invent facts that were not retrieved.
|
|
112
|
+
- Prefer citing file paths or chunk provenance when the client supports it.
|
|
113
|
+
- If retrieval is partial, say which part is grounded and which part is uncertain.
|
|
114
|
+
- If evidence conflicts, surface the conflict instead of averaging it away.
|
|
115
|
+
- If nothing relevant is retrieved after a few query variants, say so explicitly.
|
|
116
|
+
|
|
117
|
+
## Minimal Playbook
|
|
118
|
+
|
|
119
|
+
For a normal grounded answer:
|
|
120
|
+
|
|
121
|
+
1. Derive two or three strong retrieval queries.
|
|
122
|
+
2. Call `search_context` with the best query.
|
|
123
|
+
3. If results look sufficient, answer from them.
|
|
124
|
+
4. If results look weak or ambiguous, call `search` with another variant.
|
|
125
|
+
5. Fetch exact chunks for the best IDs before making precise claims.
|
|
126
|
+
|
|
127
|
+
For a locate-the-source task:
|
|
128
|
+
|
|
129
|
+
1. Use `search` first.
|
|
130
|
+
2. Inspect which files dominate.
|
|
131
|
+
3. Use `get_chunks` on top hits.
|
|
132
|
+
4. Return the most relevant files and sections, not just a prose answer.
|
|
133
|
+
|
|
134
|
+
For a compare-or-summarize task:
|
|
135
|
+
|
|
136
|
+
1. Run one query per subtopic.
|
|
137
|
+
2. Collect grounded chunks from each.
|
|
138
|
+
3. Merge only non-duplicate evidence.
|
|
139
|
+
4. Summarize with explicit source-backed differences.
|