searchsocket 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +848 -0
- package/dist/cli.js +3860 -0
- package/dist/cli.js.map +1 -0
- package/dist/client.cjs +36 -0
- package/dist/client.cjs.map +1 -0
- package/dist/client.d.cts +11 -0
- package/dist/client.d.ts +11 -0
- package/dist/client.js +34 -0
- package/dist/client.js.map +1 -0
- package/dist/index.cjs +20767 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +119 -0
- package/dist/index.d.ts +119 -0
- package/dist/index.js +20742 -0
- package/dist/index.js.map +1 -0
- package/dist/sveltekit.cjs +20578 -0
- package/dist/sveltekit.cjs.map +1 -0
- package/dist/sveltekit.d.cts +37 -0
- package/dist/sveltekit.d.ts +37 -0
- package/dist/sveltekit.js +20563 -0
- package/dist/sveltekit.js.map +1 -0
- package/dist/types-D1K46vwd.d.cts +403 -0
- package/dist/types-D1K46vwd.d.ts +403 -0
- package/package.json +86 -0
package/README.md
ADDED
|
@@ -0,0 +1,848 @@
|
|
|
1
|
+
# SearchSocket
|
|
2
|
+
|
|
3
|
+
Semantic site search and MCP retrieval for SvelteKit content projects.
|
|
4
|
+
|
|
5
|
+
**Requirements**: Node.js >= 20
|
|
6
|
+
|
|
7
|
+
## Features
|
|
8
|
+
|
|
9
|
+
- **Embeddings**: OpenAI `text-embedding-3-small` (configurable)
|
|
10
|
+
- **Vector Backend**: Turso/libSQL with vector search (local file DB for development, remote for production)
|
|
11
|
+
- **Rerank**: Optional Jina reranker for improved relevance
|
|
12
|
+
- **Page Aggregation**: Group results by page with score-weighted chunk decay
|
|
13
|
+
- **Meta Extraction**: Automatically extracts `<meta name="description">` and `<meta name="keywords">` for improved relevance
|
|
14
|
+
- **SvelteKit Integrations**:
|
|
15
|
+
- `searchsocketHandle()` for `POST /api/search` endpoint
|
|
16
|
+
- `searchsocketVitePlugin()` for build-triggered indexing
|
|
17
|
+
- **Client Library**: `createSearchClient()` for browser-side search
|
|
18
|
+
- **MCP Server**: Model Context Protocol tools for search and page retrieval
|
|
19
|
+
- **Git-Tracked Markdown Mirror**: Commit-safe deterministic markdown outputs
|
|
20
|
+
|
|
21
|
+
## Install
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
# pnpm
|
|
25
|
+
pnpm add -D searchsocket
|
|
26
|
+
|
|
27
|
+
# npm
|
|
28
|
+
npm install -D searchsocket
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
SearchSocket is typically a dev dependency for CLI indexing. If you use `searchsocketHandle()` at runtime (e.g., in a Node server adapter), add it as a regular dependency instead.
|
|
32
|
+
|
|
33
|
+
## Quickstart
|
|
34
|
+
|
|
35
|
+
### 1. Initialize
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
pnpm searchsocket init
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
This creates:
|
|
42
|
+
- `searchsocket.config.ts` — minimal config file
|
|
43
|
+
- `.searchsocket/` — state directory (added to `.gitignore`)
|
|
44
|
+
|
|
45
|
+
### 2. Configure
|
|
46
|
+
|
|
47
|
+
Minimal config (`searchsocket.config.ts`):
|
|
48
|
+
|
|
49
|
+
```ts
|
|
50
|
+
export default {
|
|
51
|
+
embeddings: { apiKeyEnv: "OPENAI_API_KEY" }
|
|
52
|
+
};
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
**That's it!** Turso defaults work out of the box:
|
|
56
|
+
- **Development**: Uses local file DB at `.searchsocket/vectors.db`
|
|
57
|
+
- **Production**: Set `TURSO_DATABASE_URL` and `TURSO_AUTH_TOKEN` to use remote Turso
|
|
58
|
+
|
|
59
|
+
### 3. Add SvelteKit API Hook
|
|
60
|
+
|
|
61
|
+
Create or update `src/hooks.server.ts`:
|
|
62
|
+
|
|
63
|
+
```ts
|
|
64
|
+
import { searchsocketHandle } from "searchsocket/sveltekit";
|
|
65
|
+
|
|
66
|
+
export const handle = searchsocketHandle();
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
This exposes `POST /api/search` with automatic scope resolution.
|
|
70
|
+
|
|
71
|
+
### 4. Set Environment Variables
|
|
72
|
+
|
|
73
|
+
The CLI automatically loads `.env` from the working directory on startup, so your existing `.env` file works out of the box — no wrapper scripts or shell exports needed.
|
|
74
|
+
|
|
75
|
+
Development (`.env`):
|
|
76
|
+
```bash
|
|
77
|
+
OPENAI_API_KEY=sk-...
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Production (add these for remote Turso):
|
|
81
|
+
```bash
|
|
82
|
+
OPENAI_API_KEY=sk-...
|
|
83
|
+
TURSO_DATABASE_URL=libsql://your-db.turso.io
|
|
84
|
+
TURSO_AUTH_TOKEN=eyJ...
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### 5. Index Your Content
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
pnpm searchsocket index --changed-only
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
SearchSocket auto-detects the source mode based on your config:
|
|
94
|
+
- **`static-output`** (default): Reads prerendered HTML from `build/`
|
|
95
|
+
- **`build`**: Discovers routes from SvelteKit build manifest and renders via preview server
|
|
96
|
+
- **`crawl`**: Fetches pages from a running HTTP server
|
|
97
|
+
- **`content-files`**: Reads markdown/svelte source files directly
|
|
98
|
+
|
|
99
|
+
The indexing pipeline:
|
|
100
|
+
- Extracts content from `<main>` (configurable), including `<meta>` description and keywords
|
|
101
|
+
- Chunks text with semantic heading boundaries
|
|
102
|
+
- Prepends page title to each chunk for embedding context
|
|
103
|
+
- Generates a synthetic summary chunk per page for identity matching
|
|
104
|
+
- Generates embeddings via OpenAI
|
|
105
|
+
- Stores vectors in Turso/libSQL with cosine similarity index
|
|
106
|
+
|
|
107
|
+
### 6. Query
|
|
108
|
+
|
|
109
|
+
**Via API:**
|
|
110
|
+
```bash
|
|
111
|
+
curl -X POST http://localhost:5173/api/search \
|
|
112
|
+
-H "content-type: application/json" \
|
|
113
|
+
-d '{"q":"getting started","topK":5,"groupBy":"page"}'
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
**Via client library:**
|
|
117
|
+
```ts
|
|
118
|
+
import { createSearchClient } from "searchsocket/client";
|
|
119
|
+
|
|
120
|
+
const client = createSearchClient(); // defaults to /api/search
|
|
121
|
+
const response = await client.search({
|
|
122
|
+
q: "getting started",
|
|
123
|
+
topK: 5,
|
|
124
|
+
groupBy: "page",
|
|
125
|
+
pathPrefix: "/docs"
|
|
126
|
+
});
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Via CLI:**
|
|
130
|
+
```bash
|
|
131
|
+
pnpm searchsocket search --q "getting started" --top-k 5 --path-prefix /docs
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Response** (with `groupBy: "page"`, the default):
|
|
135
|
+
```json
|
|
136
|
+
{
|
|
137
|
+
"q": "getting started",
|
|
138
|
+
"scope": "main",
|
|
139
|
+
"results": [
|
|
140
|
+
{
|
|
141
|
+
"url": "/docs/intro",
|
|
142
|
+
"title": "Getting Started",
|
|
143
|
+
"sectionTitle": "Installation",
|
|
144
|
+
"snippet": "Install SearchSocket with pnpm add searchsocket...",
|
|
145
|
+
"score": 0.89,
|
|
146
|
+
"routeFile": "src/routes/docs/intro/+page.svelte",
|
|
147
|
+
"chunks": [
|
|
148
|
+
{
|
|
149
|
+
"sectionTitle": "Installation",
|
|
150
|
+
"snippet": "Install SearchSocket with pnpm add searchsocket...",
|
|
151
|
+
"headingPath": ["Getting Started", "Installation"],
|
|
152
|
+
"score": 0.89
|
|
153
|
+
},
|
|
154
|
+
{
|
|
155
|
+
"sectionTitle": "Configuration",
|
|
156
|
+
"snippet": "Create searchsocket.config.ts with your API key...",
|
|
157
|
+
"headingPath": ["Getting Started", "Configuration"],
|
|
158
|
+
"score": 0.74
|
|
159
|
+
}
|
|
160
|
+
]
|
|
161
|
+
}
|
|
162
|
+
],
|
|
163
|
+
"meta": {
|
|
164
|
+
"timingsMs": { "embed": 120, "vector": 15, "rerank": 0, "total": 135 },
|
|
165
|
+
"usedRerank": false,
|
|
166
|
+
"modelId": "text-embedding-3-small"
|
|
167
|
+
}
|
|
168
|
+
}
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
The `chunks` array appears when a page has multiple matching chunks above the `minChunkScoreRatio` threshold. Use `groupBy: "chunk"` for flat per-chunk results without page aggregation.
|
|
172
|
+
|
|
173
|
+
## Source Modes
|
|
174
|
+
|
|
175
|
+
SearchSocket supports four source modes for loading pages to index.
|
|
176
|
+
|
|
177
|
+
### `static-output` (default)
|
|
178
|
+
|
|
179
|
+
Reads prerendered HTML files from SvelteKit's build output directory.
|
|
180
|
+
|
|
181
|
+
```ts
|
|
182
|
+
export default {
|
|
183
|
+
source: {
|
|
184
|
+
mode: "static-output",
|
|
185
|
+
staticOutputDir: "build"
|
|
186
|
+
}
|
|
187
|
+
};
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
Best for: Sites with fully prerendered pages. Run `vite build` first, then index.
|
|
191
|
+
|
|
192
|
+
### `build`
|
|
193
|
+
|
|
194
|
+
Discovers routes automatically from SvelteKit's build manifest and renders them via an ephemeral `vite preview` server. No manual route configuration needed.
|
|
195
|
+
|
|
196
|
+
```ts
|
|
197
|
+
export default {
|
|
198
|
+
source: {
|
|
199
|
+
build: {
|
|
200
|
+
outputDir: ".svelte-kit/output", // default
|
|
201
|
+
previewTimeout: 30000, // ms to wait for server (default)
|
|
202
|
+
exclude: ["/api/*", "/admin/*"], // glob patterns to skip
|
|
203
|
+
paramValues: { // values for dynamic routes
|
|
204
|
+
"/blog/[slug]": ["hello-world", "getting-started"],
|
|
205
|
+
"/docs/[category]/[page]": ["guides/quickstart", "api/search"]
|
|
206
|
+
}
|
|
207
|
+
}
|
|
208
|
+
}
|
|
209
|
+
};
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Best for: CI/CD pipelines. Enables `vite build && searchsocket index` with zero route configuration.
|
|
213
|
+
|
|
214
|
+
**How it works**:
|
|
215
|
+
1. Parses `.svelte-kit/output/server/manifest-full.js` to discover all page routes
|
|
216
|
+
2. Expands dynamic routes using `paramValues` (skips dynamic routes without values)
|
|
217
|
+
3. Starts an ephemeral `vite preview` server on a random port
|
|
218
|
+
4. Fetches all routes concurrently for SSR-rendered HTML
|
|
219
|
+
5. Provides exact route-to-file mapping (no heuristic matching needed)
|
|
220
|
+
6. Shuts down the preview server
|
|
221
|
+
|
|
222
|
+
**Dynamic routes**: Each key in `paramValues` maps to a route ID (e.g., `/blog/[slug]`) or its URL equivalent. Each value in the array replaces all `[param]` segments in the URL. Routes with layout groups like `/(app)/blog/[slug]` also match the URL key `/blog/[slug]`.
|
|
223
|
+
|
|
224
|
+
### `crawl`
|
|
225
|
+
|
|
226
|
+
Fetches pages from a running HTTP server.
|
|
227
|
+
|
|
228
|
+
```ts
|
|
229
|
+
export default {
|
|
230
|
+
source: {
|
|
231
|
+
crawl: {
|
|
232
|
+
baseUrl: "http://localhost:4173",
|
|
233
|
+
routes: ["/", "/docs", "/blog"], // explicit routes
|
|
234
|
+
sitemapUrl: "https://example.com/sitemap.xml" // or discover via sitemap
|
|
235
|
+
}
|
|
236
|
+
}
|
|
237
|
+
};
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
If `routes` is omitted and no `sitemapUrl` is set, defaults to crawling `["/"]` only.
|
|
241
|
+
|
|
242
|
+
### `content-files`
|
|
243
|
+
|
|
244
|
+
Reads markdown and svelte source files directly, without building or serving.
|
|
245
|
+
|
|
246
|
+
```ts
|
|
247
|
+
export default {
|
|
248
|
+
source: {
|
|
249
|
+
contentFiles: {
|
|
250
|
+
globs: ["src/routes/**/*.md", "content/**/*.md"],
|
|
251
|
+
baseDir: "."
|
|
252
|
+
}
|
|
253
|
+
}
|
|
254
|
+
};
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
## Client Library
|
|
258
|
+
|
|
259
|
+
SearchSocket exports a lightweight client for browser-side search:
|
|
260
|
+
|
|
261
|
+
```ts
|
|
262
|
+
import { createSearchClient } from "searchsocket/client";
|
|
263
|
+
|
|
264
|
+
const client = createSearchClient({
|
|
265
|
+
endpoint: "/api/search", // default
|
|
266
|
+
fetchImpl: fetch // default; override for SSR or testing
|
|
267
|
+
});
|
|
268
|
+
|
|
269
|
+
const response = await client.search({
|
|
270
|
+
q: "deployment guide",
|
|
271
|
+
topK: 8,
|
|
272
|
+
groupBy: "page",
|
|
273
|
+
pathPrefix: "/docs",
|
|
274
|
+
tags: ["guide"],
|
|
275
|
+
rerank: true
|
|
276
|
+
});
|
|
277
|
+
|
|
278
|
+
for (const result of response.results) {
|
|
279
|
+
console.log(result.url, result.title, result.score);
|
|
280
|
+
if (result.chunks) {
|
|
281
|
+
for (const chunk of result.chunks) {
|
|
282
|
+
console.log(" ", chunk.sectionTitle, chunk.score);
|
|
283
|
+
}
|
|
284
|
+
}
|
|
285
|
+
}
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
## Vector Backend: Turso/libSQL
|
|
289
|
+
|
|
290
|
+
SearchSocket uses **Turso** (libSQL) as its single vector backend, providing a unified experience across development and production.
|
|
291
|
+
|
|
292
|
+
### Local Development
|
|
293
|
+
|
|
294
|
+
By default, SearchSocket uses a **local file database**:
|
|
295
|
+
- Path: `.searchsocket/vectors.db` (configurable)
|
|
296
|
+
- No account or API keys needed
|
|
297
|
+
- Full vector search with `libsql_vector_idx` and `vector_top_k`
|
|
298
|
+
- Perfect for local development and CI testing
|
|
299
|
+
|
|
300
|
+
### Production (Remote Turso)
|
|
301
|
+
|
|
302
|
+
For production, switch to **Turso's hosted service**:
|
|
303
|
+
|
|
304
|
+
1. **Sign up for Turso** (free tier available):
|
|
305
|
+
```bash
|
|
306
|
+
# Install Turso CLI
|
|
307
|
+
brew install tursodatabase/tap/turso
|
|
308
|
+
|
|
309
|
+
# Sign up
|
|
310
|
+
turso auth signup
|
|
311
|
+
|
|
312
|
+
# Create a database
|
|
313
|
+
turso db create searchsocket-prod
|
|
314
|
+
|
|
315
|
+
# Get credentials
|
|
316
|
+
turso db show searchsocket-prod --url
|
|
317
|
+
turso db tokens create searchsocket-prod
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
2. **Set environment variables**:
|
|
321
|
+
```bash
|
|
322
|
+
TURSO_DATABASE_URL=libsql://searchsocket-prod-xxx.turso.io
|
|
323
|
+
TURSO_AUTH_TOKEN=eyJhbGc...
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
3. **Index normally** — SearchSocket auto-detects the remote URL and uses it.
|
|
327
|
+
|
|
328
|
+
### Why Turso?
|
|
329
|
+
|
|
330
|
+
- **Single backend** — no more choosing between Pinecone, Milvus, or local JSON
|
|
331
|
+
- **Local-first development** — zero external dependencies for local dev
|
|
332
|
+
- **Production-ready** — same codebase scales to remote hosted DB
|
|
333
|
+
- **Cost-effective** — Turso free tier includes 9GB storage, 500M row reads/month
|
|
334
|
+
- **Vector search native** — `F32_BLOB` vectors, cosine similarity index, `vector_top_k` ANN queries
|
|
335
|
+
|
|
336
|
+
## Embeddings: OpenAI
|
|
337
|
+
|
|
338
|
+
SearchSocket uses **OpenAI's embedding models** to convert text into semantic vectors.
|
|
339
|
+
|
|
340
|
+
### Default Model
|
|
341
|
+
|
|
342
|
+
- **Model**: `text-embedding-3-small`
|
|
343
|
+
- **Dimensions**: 1536
|
|
344
|
+
- **Cost**: ~$0.00002 per 1K tokens (~4K chars)
|
|
345
|
+
|
|
346
|
+
### How It Works
|
|
347
|
+
|
|
348
|
+
1. **Chunking**: Text is split into semantic chunks (default 2200 chars, 200 overlap)
|
|
349
|
+
2. **Title Prepend**: Page title is prepended to each chunk for better context (`chunking.prependTitle`, default: true)
|
|
350
|
+
3. **Summary Chunk**: A synthetic identity chunk is generated per page with title, URL, and first paragraph (`chunking.pageSummaryChunk`, default: true)
|
|
351
|
+
4. **Embedding**: Each chunk is sent to OpenAI's embedding API
|
|
352
|
+
5. **Batching**: Requests batched (64 texts per request) for efficiency
|
|
353
|
+
6. **Storage**: Vectors stored in Turso with metadata (URL, title, tags, depth, etc.)
|
|
354
|
+
|
|
355
|
+
### Cost Estimation
|
|
356
|
+
|
|
357
|
+
Use `--dry-run` to preview costs:
|
|
358
|
+
```bash
|
|
359
|
+
pnpm searchsocket index --dry-run
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
Output:
|
|
363
|
+
```
|
|
364
|
+
pages processed: 42
|
|
365
|
+
chunks total: 156
|
|
366
|
+
chunks changed: 156
|
|
367
|
+
embeddings created: 156
|
|
368
|
+
estimated tokens: 32,400
|
|
369
|
+
estimated cost (USD): $0.000648
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
### Custom Model
|
|
373
|
+
|
|
374
|
+
Override in config:
|
|
375
|
+
```ts
|
|
376
|
+
export default {
|
|
377
|
+
embeddings: {
|
|
378
|
+
provider: "openai",
|
|
379
|
+
model: "text-embedding-3-large", // 3072 dims, higher quality
|
|
380
|
+
apiKeyEnv: "OPENAI_API_KEY",
|
|
381
|
+
pricePer1kTokens: 0.00013
|
|
382
|
+
}
|
|
383
|
+
};
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
**Note**: Changing the model after indexing requires re-indexing with `--force`.
|
|
387
|
+
|
|
388
|
+
## Search & Ranking
|
|
389
|
+
|
|
390
|
+
### Page Aggregation
|
|
391
|
+
|
|
392
|
+
By default (`groupBy: "page"`), SearchSocket groups chunk results by page URL and computes a page-level score:
|
|
393
|
+
|
|
394
|
+
1. The top chunk score becomes the base page score
|
|
395
|
+
2. Additional matching chunks contribute a decaying bonus: `chunk_score * decay^i`
|
|
396
|
+
3. Optional per-URL page weights are applied multiplicatively
|
|
397
|
+
|
|
398
|
+
Configure aggregation behavior:
|
|
399
|
+
|
|
400
|
+
```ts
|
|
401
|
+
export default {
|
|
402
|
+
ranking: {
|
|
403
|
+
aggregationCap: 5, // max chunks contributing to page score (default: 5)
|
|
404
|
+
aggregationDecay: 0.5, // decay factor for additional chunks (default: 0.5)
|
|
405
|
+
minChunkScoreRatio: 0.5, // threshold for sub-chunks in results (default: 0.5)
|
|
406
|
+
pageWeights: { // per-URL score multipliers
|
|
407
|
+
"/": 1.1,
|
|
408
|
+
"/docs": 1.15,
|
|
409
|
+
"/download": 1.2
|
|
410
|
+
},
|
|
411
|
+
weights: {
|
|
412
|
+
aggregation: 0.1, // weight of aggregation bonus (default: 0.1)
|
|
413
|
+
incomingLinks: 0.05, // incoming link boost weight (default: 0.05)
|
|
414
|
+
depth: 0.03, // URL depth boost weight (default: 0.03)
|
|
415
|
+
rerank: 1.0 // reranker score weight (default: 1.0)
|
|
416
|
+
}
|
|
417
|
+
}
|
|
418
|
+
};
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
`pageWeights` supports exact URL matches and prefix matching. A weight of `1.15` on `"/docs"` boosts all pages under `/docs/` by 15%. Use gentle values (1.05-1.2x) since they compound with aggregation.
|
|
422
|
+
|
|
423
|
+
### Chunk Mode
|
|
424
|
+
|
|
425
|
+
Use `groupBy: "chunk"` for flat per-chunk results without page aggregation:
|
|
426
|
+
|
|
427
|
+
```bash
|
|
428
|
+
curl -X POST http://localhost:5173/api/search \
|
|
429
|
+
-H "content-type: application/json" \
|
|
430
|
+
-d '{"q":"vector search","topK":10,"groupBy":"chunk"}'
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
## Build-Triggered Indexing
|
|
434
|
+
|
|
435
|
+
Automatically index after each SvelteKit build.
|
|
436
|
+
|
|
437
|
+
**`vite.config.ts` or `svelte.config.js`:**
|
|
438
|
+
```ts
|
|
439
|
+
import { searchsocketVitePlugin } from "searchsocket/sveltekit";
|
|
440
|
+
|
|
441
|
+
export default {
|
|
442
|
+
plugins: [
|
|
443
|
+
svelteKitPlugin(),
|
|
444
|
+
searchsocketVitePlugin({
|
|
445
|
+
enabled: true, // or check process.env.SEARCHSOCKET_AUTO_INDEX
|
|
446
|
+
changedOnly: true, // incremental indexing (faster)
|
|
447
|
+
verbose: false
|
|
448
|
+
})
|
|
449
|
+
]
|
|
450
|
+
};
|
|
451
|
+
```
|
|
452
|
+
|
|
453
|
+
**Environment control:**
|
|
454
|
+
```bash
|
|
455
|
+
# Enable via env var
|
|
456
|
+
SEARCHSOCKET_AUTO_INDEX=1 pnpm build
|
|
457
|
+
|
|
458
|
+
# Disable via env var
|
|
459
|
+
SEARCHSOCKET_DISABLE_AUTO_INDEX=1 pnpm build
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
## Git-Tracked Markdown Mirror
|
|
463
|
+
|
|
464
|
+
Indexing writes a **deterministic markdown mirror**:
|
|
465
|
+
|
|
466
|
+
```
|
|
467
|
+
.searchsocket/pages/<scope>/<path>.md
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
Example:
|
|
471
|
+
```
|
|
472
|
+
.searchsocket/pages/main/docs/intro.md
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
Each file contains:
|
|
476
|
+
- Frontmatter: URL, title, scope, route file, metadata
|
|
477
|
+
- Markdown: Extracted content
|
|
478
|
+
|
|
479
|
+
**Why commit it?**
|
|
480
|
+
- Content workflows (edit markdown, regenerate embeddings)
|
|
481
|
+
- Version control for indexed content
|
|
482
|
+
- Debugging (see exactly what was indexed)
|
|
483
|
+
- Offline search (grep the mirror)
|
|
484
|
+
|
|
485
|
+
Add to `.gitignore` if you don't need it:
|
|
486
|
+
```
|
|
487
|
+
.searchsocket/pages/
|
|
488
|
+
```
|
|
489
|
+
|
|
490
|
+
## Commands
|
|
491
|
+
|
|
492
|
+
### `searchsocket init`
|
|
493
|
+
|
|
494
|
+
Initialize config and state directory.
|
|
495
|
+
|
|
496
|
+
```bash
|
|
497
|
+
pnpm searchsocket init
|
|
498
|
+
```
|
|
499
|
+
|
|
500
|
+
### `searchsocket index`
|
|
501
|
+
|
|
502
|
+
Index content into vectors.
|
|
503
|
+
|
|
504
|
+
```bash
|
|
505
|
+
# Incremental (only changed chunks)
|
|
506
|
+
pnpm searchsocket index --changed-only
|
|
507
|
+
|
|
508
|
+
# Full re-index
|
|
509
|
+
pnpm searchsocket index --force
|
|
510
|
+
|
|
511
|
+
# Preview cost without indexing
|
|
512
|
+
pnpm searchsocket index --dry-run
|
|
513
|
+
|
|
514
|
+
# Override source mode
|
|
515
|
+
pnpm searchsocket index --source build
|
|
516
|
+
|
|
517
|
+
# Limit for testing
|
|
518
|
+
pnpm searchsocket index --max-pages 10 --max-chunks 50
|
|
519
|
+
|
|
520
|
+
# Override scope
|
|
521
|
+
pnpm searchsocket index --scope staging
|
|
522
|
+
|
|
523
|
+
# Verbose output
|
|
524
|
+
pnpm searchsocket index --verbose
|
|
525
|
+
```
|
|
526
|
+
|
|
527
|
+
### `searchsocket status`
|
|
528
|
+
|
|
529
|
+
Show indexing status, scope, and vector health.
|
|
530
|
+
|
|
531
|
+
```bash
|
|
532
|
+
pnpm searchsocket status
|
|
533
|
+
|
|
534
|
+
# Output:
|
|
535
|
+
# project: my-site
|
|
536
|
+
# resolved scope: main
|
|
537
|
+
# embedding model: text-embedding-3-small
|
|
538
|
+
# vector backend: turso/libsql (local (.searchsocket/vectors.db))
|
|
539
|
+
# vector health: ok
|
|
540
|
+
# last indexed (main): 2025-02-23T10:30:00Z
|
|
541
|
+
# tracked chunks: 156
|
|
542
|
+
# last estimated tokens: 32,400
|
|
543
|
+
# last estimated cost: $0.000648
|
|
544
|
+
```
|
|
545
|
+
|
|
546
|
+
### `searchsocket dev`
|
|
547
|
+
|
|
548
|
+
Watch for file changes and auto-reindex.
|
|
549
|
+
|
|
550
|
+
```bash
|
|
551
|
+
pnpm searchsocket dev
|
|
552
|
+
|
|
553
|
+
# With MCP server
|
|
554
|
+
pnpm searchsocket dev --mcp --mcp-port 3338
|
|
555
|
+
```
|
|
556
|
+
|
|
557
|
+
Watches:
|
|
558
|
+
- `src/routes/**` (route files)
|
|
559
|
+
- `build/` (if static-output mode)
|
|
560
|
+
- Build output dir (if build mode)
|
|
561
|
+
- Content files (if content-files mode)
|
|
562
|
+
- `searchsocket.config.ts` (if crawl or build mode)
|
|
563
|
+
|
|
564
|
+
### `searchsocket clean`
|
|
565
|
+
|
|
566
|
+
Delete local state and optionally remote vectors.
|
|
567
|
+
|
|
568
|
+
```bash
|
|
569
|
+
# Local state only
|
|
570
|
+
pnpm searchsocket clean
|
|
571
|
+
|
|
572
|
+
# Local + remote vectors
|
|
573
|
+
pnpm searchsocket clean --remote --scope staging
|
|
574
|
+
```
|
|
575
|
+
|
|
576
|
+
### `searchsocket prune`
|
|
577
|
+
|
|
578
|
+
Delete stale scopes (e.g., deleted git branches).
|
|
579
|
+
|
|
580
|
+
```bash
|
|
581
|
+
# Dry run (shows what would be deleted)
|
|
582
|
+
pnpm searchsocket prune --older-than 30d
|
|
583
|
+
|
|
584
|
+
# Apply deletions
|
|
585
|
+
pnpm searchsocket prune --older-than 30d --apply
|
|
586
|
+
|
|
587
|
+
# Use custom scope list
|
|
588
|
+
pnpm searchsocket prune --scopes-file active-branches.txt --apply
|
|
589
|
+
```
|
|
590
|
+
|
|
591
|
+
### `searchsocket doctor`
|
|
592
|
+
|
|
593
|
+
Validate config, env vars, and connectivity.
|
|
594
|
+
|
|
595
|
+
```bash
|
|
596
|
+
pnpm searchsocket doctor
|
|
597
|
+
|
|
598
|
+
# Output:
|
|
599
|
+
# PASS config parse
|
|
600
|
+
# PASS env OPENAI_API_KEY
|
|
601
|
+
# PASS turso/libsql (local file: .searchsocket/vectors.db)
|
|
602
|
+
# PASS source: build manifest
|
|
603
|
+
# PASS source: vite binary
|
|
604
|
+
# PASS embedding provider connectivity
|
|
605
|
+
# PASS vector backend connectivity
|
|
606
|
+
# PASS vector backend write permission
|
|
607
|
+
# PASS state directory writable
|
|
608
|
+
```
|
|
609
|
+
|
|
610
|
+
### `searchsocket mcp`
|
|
611
|
+
|
|
612
|
+
Run MCP server for Claude Desktop / other MCP clients.
|
|
613
|
+
|
|
614
|
+
```bash
|
|
615
|
+
# stdio transport (default)
|
|
616
|
+
pnpm searchsocket mcp
|
|
617
|
+
|
|
618
|
+
# HTTP transport
|
|
619
|
+
pnpm searchsocket mcp --transport http --port 3338
|
|
620
|
+
```
|
|
621
|
+
|
|
622
|
+
### `searchsocket search`
|
|
623
|
+
|
|
624
|
+
CLI search for testing.
|
|
625
|
+
|
|
626
|
+
```bash
|
|
627
|
+
pnpm searchsocket search --q "turso vector search" --top-k 5 --rerank
|
|
628
|
+
```
|
|
629
|
+
|
|
630
|
+
## MCP (Model Context Protocol)
|
|
631
|
+
|
|
632
|
+
SearchSocket provides an **MCP server** for integration with Claude Code, Claude Desktop, and other MCP-compatible AI tools. This gives AI assistants direct access to your indexed site content for semantic search and page retrieval.
|
|
633
|
+
|
|
634
|
+
### Tools
|
|
635
|
+
|
|
636
|
+
**`search(query, opts?)`**
|
|
637
|
+
- Semantic search across indexed content
|
|
638
|
+
- Returns ranked results with URL, title, snippet, score, and routeFile
|
|
639
|
+
- Options: `scope`, `topK` (1-100), `pathPrefix`, `tags`, `groupBy` (`"page"` | `"chunk"`)
|
|
640
|
+
|
|
641
|
+
**`get_page(pathOrUrl, opts?)`**
|
|
642
|
+
- Retrieve full indexed page content as markdown with frontmatter
|
|
643
|
+
- Options: `scope`
|
|
644
|
+
|
|
645
|
+
### Setup (Claude Code)
|
|
646
|
+
|
|
647
|
+
Add a `.mcp.json` file to your project root (safe to commit — no secrets needed since the CLI auto-loads `.env`):
|
|
648
|
+
|
|
649
|
+
```json
|
|
650
|
+
{
|
|
651
|
+
"mcpServers": {
|
|
652
|
+
"searchsocket": {
|
|
653
|
+
"type": "stdio",
|
|
654
|
+
"command": "npx",
|
|
655
|
+
"args": ["searchsocket", "mcp"],
|
|
656
|
+
"env": {}
|
|
657
|
+
}
|
|
658
|
+
}
|
|
659
|
+
}
|
|
660
|
+
```
|
|
661
|
+
|
|
662
|
+
Restart Claude Code. The `search` and `get_page` tools will be available automatically. Verify with:
|
|
663
|
+
|
|
664
|
+
```bash
|
|
665
|
+
claude mcp list
|
|
666
|
+
```
|
|
667
|
+
|
|
668
|
+
### Setup (Claude Desktop)
|
|
669
|
+
|
|
670
|
+
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
|
|
671
|
+
|
|
672
|
+
```json
|
|
673
|
+
{
|
|
674
|
+
"mcpServers": {
|
|
675
|
+
"searchsocket": {
|
|
676
|
+
"command": "npx",
|
|
677
|
+
"args": ["searchsocket", "mcp"],
|
|
678
|
+
"cwd": "/path/to/your/project"
|
|
679
|
+
}
|
|
680
|
+
}
|
|
681
|
+
}
|
|
682
|
+
```
|
|
683
|
+
|
|
684
|
+
Restart Claude Desktop. The tools appear in the MCP menu.
|
|
685
|
+
|
|
686
|
+
### HTTP Transport
|
|
687
|
+
|
|
688
|
+
For non-stdio clients, run the MCP server over HTTP:
|
|
689
|
+
|
|
690
|
+
```bash
|
|
691
|
+
npx searchsocket mcp --transport http --port 3338
|
|
692
|
+
```
|
|
693
|
+
|
|
694
|
+
This starts a stateless server at `http://127.0.0.1:3338/mcp`. Each POST request creates a fresh server instance with no session persistence.
|
|
695
|
+
|
|
696
|
+
## Environment Variables
|
|
697
|
+
|
|
698
|
+
The CLI automatically loads `.env` from the working directory on startup. Existing `process.env` values take precedence over `.env` file values. This only applies to CLI commands (`searchsocket index`, `searchsocket mcp`, etc.) — library imports like `searchsocketHandle()` rely on your framework's own `.env` handling (Vite/SvelteKit).
|
|
699
|
+
|
|
700
|
+
### Required
|
|
701
|
+
|
|
702
|
+
**OpenAI:**
|
|
703
|
+
- `OPENAI_API_KEY` — OpenAI API key for embeddings
|
|
704
|
+
|
|
705
|
+
### Optional (Turso)
|
|
706
|
+
|
|
707
|
+
**Remote Turso (production):**
|
|
708
|
+
- `TURSO_DATABASE_URL` — Turso database URL (e.g., `libsql://my-db.turso.io`)
|
|
709
|
+
- `TURSO_AUTH_TOKEN` — Turso auth token
|
|
710
|
+
|
|
711
|
+
If not set, uses local file DB at `.searchsocket/vectors.db`.
|
|
712
|
+
|
|
713
|
+
### Optional (Rerank)
|
|
714
|
+
|
|
715
|
+
**Jina:**
|
|
716
|
+
- `JINA_API_KEY` — Jina reranker API key (if using `rerank: { provider: "jina" }`)
|
|
717
|
+
|
|
718
|
+
### Optional (Scope/Build)
|
|
719
|
+
|
|
720
|
+
- `SEARCHSOCKET_SCOPE` — Override scope (when `scope.mode: "env"`)
|
|
721
|
+
- `SEARCHSOCKET_AUTO_INDEX` — Enable build-triggered indexing
|
|
722
|
+
- `SEARCHSOCKET_DISABLE_AUTO_INDEX` — Disable build-triggered indexing
|
|
723
|
+
|
|
724
|
+
## Configuration
|
|
725
|
+
|
|
726
|
+
### Full Example
|
|
727
|
+
|
|
728
|
+
```ts
|
|
729
|
+
export default {
|
|
730
|
+
project: {
|
|
731
|
+
id: "my-site",
|
|
732
|
+
baseUrl: "https://example.com"
|
|
733
|
+
},
|
|
734
|
+
|
|
735
|
+
scope: {
|
|
736
|
+
mode: "git", // "fixed" | "git" | "env"
|
|
737
|
+
fixed: "main",
|
|
738
|
+
sanitize: true
|
|
739
|
+
},
|
|
740
|
+
|
|
741
|
+
source: {
|
|
742
|
+
mode: "build", // "static-output" | "crawl" | "content-files" | "build"
|
|
743
|
+
staticOutputDir: "build",
|
|
744
|
+
strictRouteMapping: false,
|
|
745
|
+
|
|
746
|
+
// Build mode (recommended for CI/CD)
|
|
747
|
+
build: {
|
|
748
|
+
outputDir: ".svelte-kit/output",
|
|
749
|
+
previewTimeout: 30000,
|
|
750
|
+
exclude: ["/api/*"],
|
|
751
|
+
paramValues: {
|
|
752
|
+
"/blog/[slug]": ["hello-world", "getting-started"]
|
|
753
|
+
}
|
|
754
|
+
},
|
|
755
|
+
|
|
756
|
+
// Crawl mode (alternative)
|
|
757
|
+
crawl: {
|
|
758
|
+
baseUrl: "http://localhost:4173",
|
|
759
|
+
routes: ["/", "/docs", "/blog"],
|
|
760
|
+
sitemapUrl: "https://example.com/sitemap.xml"
|
|
761
|
+
},
|
|
762
|
+
|
|
763
|
+
// Content files mode (alternative)
|
|
764
|
+
contentFiles: {
|
|
765
|
+
globs: ["src/routes/**/*.md"],
|
|
766
|
+
baseDir: "."
|
|
767
|
+
}
|
|
768
|
+
},
|
|
769
|
+
|
|
770
|
+
extract: {
|
|
771
|
+
mainSelector: "main",
|
|
772
|
+
dropTags: ["header", "nav", "footer", "aside"],
|
|
773
|
+
dropSelectors: [".sidebar", ".toc"],
|
|
774
|
+
ignoreAttr: "data-search-ignore",
|
|
775
|
+
noindexAttr: "data-search-noindex",
|
|
776
|
+
respectRobotsNoindex: true
|
|
777
|
+
},
|
|
778
|
+
|
|
779
|
+
chunking: {
|
|
780
|
+
maxChars: 2200,
|
|
781
|
+
overlapChars: 200,
|
|
782
|
+
minChars: 250,
|
|
783
|
+
headingPathDepth: 3,
|
|
784
|
+
dontSplitInside: ["code", "table", "blockquote"],
|
|
785
|
+
prependTitle: true, // prepend page title to chunk text before embedding
|
|
786
|
+
pageSummaryChunk: true // generate synthetic identity chunk per page
|
|
787
|
+
},
|
|
788
|
+
|
|
789
|
+
embeddings: {
|
|
790
|
+
provider: "openai",
|
|
791
|
+
model: "text-embedding-3-small",
|
|
792
|
+
apiKeyEnv: "OPENAI_API_KEY",
|
|
793
|
+
batchSize: 64,
|
|
794
|
+
concurrency: 4
|
|
795
|
+
},
|
|
796
|
+
|
|
797
|
+
vector: {
|
|
798
|
+
dimension: 1536, // optional, inferred from first embedding
|
|
799
|
+
turso: {
|
|
800
|
+
urlEnv: "TURSO_DATABASE_URL",
|
|
801
|
+
authTokenEnv: "TURSO_AUTH_TOKEN",
|
|
802
|
+
localPath: ".searchsocket/vectors.db"
|
|
803
|
+
}
|
|
804
|
+
},
|
|
805
|
+
|
|
806
|
+
rerank: {
|
|
807
|
+
provider: "jina", // "none" | "jina"
|
|
808
|
+
topN: 20,
|
|
809
|
+
jina: {
|
|
810
|
+
apiKeyEnv: "JINA_API_KEY",
|
|
811
|
+
model: "jina-reranker-v2-base-multilingual"
|
|
812
|
+
}
|
|
813
|
+
},
|
|
814
|
+
|
|
815
|
+
ranking: {
|
|
816
|
+
enableIncomingLinkBoost: true,
|
|
817
|
+
enableDepthBoost: true,
|
|
818
|
+
pageWeights: {
|
|
819
|
+
"/": 1.1,
|
|
820
|
+
"/docs": 1.15
|
|
821
|
+
},
|
|
822
|
+
aggregationCap: 5,
|
|
823
|
+
aggregationDecay: 0.5,
|
|
824
|
+
minChunkScoreRatio: 0.5,
|
|
825
|
+
weights: {
|
|
826
|
+
incomingLinks: 0.05,
|
|
827
|
+
depth: 0.03,
|
|
828
|
+
rerank: 1.0,
|
|
829
|
+
aggregation: 0.1
|
|
830
|
+
}
|
|
831
|
+
},
|
|
832
|
+
|
|
833
|
+
api: {
|
|
834
|
+
path: "/api/search",
|
|
835
|
+
cors: {
|
|
836
|
+
allowOrigins: ["https://example.com"]
|
|
837
|
+
},
|
|
838
|
+
rateLimit: {
|
|
839
|
+
windowMs: 60_000,
|
|
840
|
+
max: 60
|
|
841
|
+
}
|
|
842
|
+
}
|
|
843
|
+
};
|
|
844
|
+
```
|
|
845
|
+
|
|
846
|
+
## License
|
|
847
|
+
|
|
848
|
+
MIT
|