searchsocket 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,848 @@
1
+ # SearchSocket
2
+
3
+ Semantic site search and MCP retrieval for SvelteKit content projects.
4
+
5
+ **Requirements**: Node.js >= 20
6
+
7
+ ## Features
8
+
9
+ - **Embeddings**: OpenAI `text-embedding-3-small` (configurable)
10
+ - **Vector Backend**: Turso/libSQL with vector search (local file DB for development, remote for production)
11
+ - **Rerank**: Optional Jina reranker for improved relevance
12
+ - **Page Aggregation**: Group results by page with score-weighted chunk decay
13
+ - **Meta Extraction**: Automatically extracts `<meta name="description">` and `<meta name="keywords">` for improved relevance
14
+ - **SvelteKit Integrations**:
15
+ - `searchsocketHandle()` for `POST /api/search` endpoint
16
+ - `searchsocketVitePlugin()` for build-triggered indexing
17
+ - **Client Library**: `createSearchClient()` for browser-side search
18
+ - **MCP Server**: Model Context Protocol tools for search and page retrieval
19
+ - **Git-Tracked Markdown Mirror**: Commit-safe deterministic markdown outputs
20
+
21
+ ## Install
22
+
23
+ ```bash
24
+ # pnpm
25
+ pnpm add -D searchsocket
26
+
27
+ # npm
28
+ npm install -D searchsocket
29
+ ```
30
+
31
+ SearchSocket is typically a dev dependency for CLI indexing. If you use `searchsocketHandle()` at runtime (e.g., in a Node server adapter), add it as a regular dependency instead.
32
+
33
+ ## Quickstart
34
+
35
+ ### 1. Initialize
36
+
37
+ ```bash
38
+ pnpm searchsocket init
39
+ ```
40
+
41
+ This creates:
42
+ - `searchsocket.config.ts` — minimal config file
43
+ - `.searchsocket/` — state directory (added to `.gitignore`)
44
+
45
+ ### 2. Configure
46
+
47
+ Minimal config (`searchsocket.config.ts`):
48
+
49
+ ```ts
50
+ export default {
51
+ embeddings: { apiKeyEnv: "OPENAI_API_KEY" }
52
+ };
53
+ ```
54
+
55
+ **That's it!** Turso defaults work out of the box:
56
+ - **Development**: Uses local file DB at `.searchsocket/vectors.db`
57
+ - **Production**: Set `TURSO_DATABASE_URL` and `TURSO_AUTH_TOKEN` to use remote Turso
58
+
59
+ ### 3. Add SvelteKit API Hook
60
+
61
+ Create or update `src/hooks.server.ts`:
62
+
63
+ ```ts
64
+ import { searchsocketHandle } from "searchsocket/sveltekit";
65
+
66
+ export const handle = searchsocketHandle();
67
+ ```
68
+
69
+ This exposes `POST /api/search` with automatic scope resolution.
70
+
71
+ ### 4. Set Environment Variables
72
+
73
+ The CLI automatically loads `.env` from the working directory on startup, so your existing `.env` file works out of the box — no wrapper scripts or shell exports needed.
74
+
75
+ Development (`.env`):
76
+ ```bash
77
+ OPENAI_API_KEY=sk-...
78
+ ```
79
+
80
+ Production (add these for remote Turso):
81
+ ```bash
82
+ OPENAI_API_KEY=sk-...
83
+ TURSO_DATABASE_URL=libsql://your-db.turso.io
84
+ TURSO_AUTH_TOKEN=eyJ...
85
+ ```
86
+
87
+ ### 5. Index Your Content
88
+
89
+ ```bash
90
+ pnpm searchsocket index --changed-only
91
+ ```
92
+
93
+ SearchSocket auto-detects the source mode based on your config:
94
+ - **`static-output`** (default): Reads prerendered HTML from `build/`
95
+ - **`build`**: Discovers routes from SvelteKit build manifest and renders via preview server
96
+ - **`crawl`**: Fetches pages from a running HTTP server
97
+ - **`content-files`**: Reads markdown/svelte source files directly
98
+
99
+ The indexing pipeline:
100
+ - Extracts content from `<main>` (configurable), including `<meta>` description and keywords
101
+ - Chunks text with semantic heading boundaries
102
+ - Prepends page title to each chunk for embedding context
103
+ - Generates a synthetic summary chunk per page for identity matching
104
+ - Generates embeddings via OpenAI
105
+ - Stores vectors in Turso/libSQL with cosine similarity index
106
+
107
+ ### 6. Query
108
+
109
+ **Via API:**
110
+ ```bash
111
+ curl -X POST http://localhost:5173/api/search \
112
+ -H "content-type: application/json" \
113
+ -d '{"q":"getting started","topK":5,"groupBy":"page"}'
114
+ ```
115
+
116
+ **Via client library:**
117
+ ```ts
118
+ import { createSearchClient } from "searchsocket/client";
119
+
120
+ const client = createSearchClient(); // defaults to /api/search
121
+ const response = await client.search({
122
+ q: "getting started",
123
+ topK: 5,
124
+ groupBy: "page",
125
+ pathPrefix: "/docs"
126
+ });
127
+ ```
128
+
129
+ **Via CLI:**
130
+ ```bash
131
+ pnpm searchsocket search --q "getting started" --top-k 5 --path-prefix /docs
132
+ ```
133
+
134
+ **Response** (with `groupBy: "page"`, the default):
135
+ ```json
136
+ {
137
+ "q": "getting started",
138
+ "scope": "main",
139
+ "results": [
140
+ {
141
+ "url": "/docs/intro",
142
+ "title": "Getting Started",
143
+ "sectionTitle": "Installation",
144
+ "snippet": "Install SearchSocket with pnpm add searchsocket...",
145
+ "score": 0.89,
146
+ "routeFile": "src/routes/docs/intro/+page.svelte",
147
+ "chunks": [
148
+ {
149
+ "sectionTitle": "Installation",
150
+ "snippet": "Install SearchSocket with pnpm add searchsocket...",
151
+ "headingPath": ["Getting Started", "Installation"],
152
+ "score": 0.89
153
+ },
154
+ {
155
+ "sectionTitle": "Configuration",
156
+ "snippet": "Create searchsocket.config.ts with your API key...",
157
+ "headingPath": ["Getting Started", "Configuration"],
158
+ "score": 0.74
159
+ }
160
+ ]
161
+ }
162
+ ],
163
+ "meta": {
164
+ "timingsMs": { "embed": 120, "vector": 15, "rerank": 0, "total": 135 },
165
+ "usedRerank": false,
166
+ "modelId": "text-embedding-3-small"
167
+ }
168
+ }
169
+ ```
170
+
171
+ The `chunks` array appears when a page has multiple matching chunks above the `minChunkScoreRatio` threshold. Use `groupBy: "chunk"` for flat per-chunk results without page aggregation.
172
+
173
+ ## Source Modes
174
+
175
+ SearchSocket supports four source modes for loading pages to index.
176
+
177
+ ### `static-output` (default)
178
+
179
+ Reads prerendered HTML files from SvelteKit's build output directory.
180
+
181
+ ```ts
182
+ export default {
183
+ source: {
184
+ mode: "static-output",
185
+ staticOutputDir: "build"
186
+ }
187
+ };
188
+ ```
189
+
190
+ Best for: Sites with fully prerendered pages. Run `vite build` first, then index.
191
+
192
+ ### `build`
193
+
194
+ Discovers routes automatically from SvelteKit's build manifest and renders them via an ephemeral `vite preview` server. No manual route configuration needed.
195
+
196
+ ```ts
197
+ export default {
198
+ source: {
199
+ build: {
200
+ outputDir: ".svelte-kit/output", // default
201
+ previewTimeout: 30000, // ms to wait for server (default)
202
+ exclude: ["/api/*", "/admin/*"], // glob patterns to skip
203
+ paramValues: { // values for dynamic routes
204
+ "/blog/[slug]": ["hello-world", "getting-started"],
205
+ "/docs/[category]/[page]": ["guides/quickstart", "api/search"]
206
+ }
207
+ }
208
+ }
209
+ };
210
+ ```
211
+
212
+ Best for: CI/CD pipelines. Enables `vite build && searchsocket index` with zero route configuration.
213
+
214
+ **How it works**:
215
+ 1. Parses `.svelte-kit/output/server/manifest-full.js` to discover all page routes
216
+ 2. Expands dynamic routes using `paramValues` (skips dynamic routes without values)
217
+ 3. Starts an ephemeral `vite preview` server on a random port
218
+ 4. Fetches all routes concurrently for SSR-rendered HTML
219
+ 5. Provides exact route-to-file mapping (no heuristic matching needed)
220
+ 6. Shuts down the preview server
221
+
222
+ **Dynamic routes**: Each key in `paramValues` maps to a route ID (e.g., `/blog/[slug]`) or its URL equivalent. Each value in the array replaces all `[param]` segments in the URL. Routes with layout groups like `/(app)/blog/[slug]` also match the URL key `/blog/[slug]`.
223
+
224
+ ### `crawl`
225
+
226
+ Fetches pages from a running HTTP server.
227
+
228
+ ```ts
229
+ export default {
230
+ source: {
231
+ crawl: {
232
+ baseUrl: "http://localhost:4173",
233
+ routes: ["/", "/docs", "/blog"], // explicit routes
234
+ sitemapUrl: "https://example.com/sitemap.xml" // or discover via sitemap
235
+ }
236
+ }
237
+ };
238
+ ```
239
+
240
+ If `routes` is omitted and no `sitemapUrl` is set, defaults to crawling `["/"]` only.
241
+
242
+ ### `content-files`
243
+
244
+ Reads markdown and svelte source files directly, without building or serving.
245
+
246
+ ```ts
247
+ export default {
248
+ source: {
249
+ contentFiles: {
250
+ globs: ["src/routes/**/*.md", "content/**/*.md"],
251
+ baseDir: "."
252
+ }
253
+ }
254
+ };
255
+ ```
256
+
257
+ ## Client Library
258
+
259
+ SearchSocket exports a lightweight client for browser-side search:
260
+
261
+ ```ts
262
+ import { createSearchClient } from "searchsocket/client";
263
+
264
+ const client = createSearchClient({
265
+ endpoint: "/api/search", // default
266
+ fetchImpl: fetch // default; override for SSR or testing
267
+ });
268
+
269
+ const response = await client.search({
270
+ q: "deployment guide",
271
+ topK: 8,
272
+ groupBy: "page",
273
+ pathPrefix: "/docs",
274
+ tags: ["guide"],
275
+ rerank: true
276
+ });
277
+
278
+ for (const result of response.results) {
279
+ console.log(result.url, result.title, result.score);
280
+ if (result.chunks) {
281
+ for (const chunk of result.chunks) {
282
+ console.log(" ", chunk.sectionTitle, chunk.score);
283
+ }
284
+ }
285
+ }
286
+ ```
287
+
288
+ ## Vector Backend: Turso/libSQL
289
+
290
+ SearchSocket uses **Turso** (libSQL) as its single vector backend, providing a unified experience across development and production.
291
+
292
+ ### Local Development
293
+
294
+ By default, SearchSocket uses a **local file database**:
295
+ - Path: `.searchsocket/vectors.db` (configurable)
296
+ - No account or API keys needed
297
+ - Full vector search with `libsql_vector_idx` and `vector_top_k`
298
+ - Perfect for local development and CI testing
299
+
300
+ ### Production (Remote Turso)
301
+
302
+ For production, switch to **Turso's hosted service**:
303
+
304
+ 1. **Sign up for Turso** (free tier available):
305
+ ```bash
306
+ # Install Turso CLI
307
+ brew install tursodatabase/tap/turso
308
+
309
+ # Sign up
310
+ turso auth signup
311
+
312
+ # Create a database
313
+ turso db create searchsocket-prod
314
+
315
+ # Get credentials
316
+ turso db show searchsocket-prod --url
317
+ turso db tokens create searchsocket-prod
318
+ ```
319
+
320
+ 2. **Set environment variables**:
321
+ ```bash
322
+ TURSO_DATABASE_URL=libsql://searchsocket-prod-xxx.turso.io
323
+ TURSO_AUTH_TOKEN=eyJhbGc...
324
+ ```
325
+
326
+ 3. **Index normally** — SearchSocket auto-detects the remote URL and uses it.
327
+
328
+ ### Why Turso?
329
+
330
+ - **Single backend** — no more choosing between Pinecone, Milvus, or local JSON
331
+ - **Local-first development** — zero external dependencies for local dev
332
+ - **Production-ready** — same codebase scales to remote hosted DB
333
+ - **Cost-effective** — Turso free tier includes 9GB storage, 500M row reads/month
334
+ - **Vector search native** — `F32_BLOB` vectors, cosine similarity index, `vector_top_k` ANN queries
335
+
336
+ ## Embeddings: OpenAI
337
+
338
+ SearchSocket uses **OpenAI's embedding models** to convert text into semantic vectors.
339
+
340
+ ### Default Model
341
+
342
+ - **Model**: `text-embedding-3-small`
343
+ - **Dimensions**: 1536
344
+ - **Cost**: ~$0.00002 per 1K tokens (~4K chars)
345
+
346
+ ### How It Works
347
+
348
+ 1. **Chunking**: Text is split into semantic chunks (default 2200 chars, 200 overlap)
349
+ 2. **Title Prepend**: Page title is prepended to each chunk for better context (`chunking.prependTitle`, default: true)
350
+ 3. **Summary Chunk**: A synthetic identity chunk is generated per page with title, URL, and first paragraph (`chunking.pageSummaryChunk`, default: true)
351
+ 4. **Embedding**: Each chunk is sent to OpenAI's embedding API
352
+ 5. **Batching**: Requests batched (64 texts per request) for efficiency
353
+ 6. **Storage**: Vectors stored in Turso with metadata (URL, title, tags, depth, etc.)
354
+
355
+ ### Cost Estimation
356
+
357
+ Use `--dry-run` to preview costs:
358
+ ```bash
359
+ pnpm searchsocket index --dry-run
360
+ ```
361
+
362
+ Output:
363
+ ```
364
+ pages processed: 42
365
+ chunks total: 156
366
+ chunks changed: 156
367
+ embeddings created: 156
368
+ estimated tokens: 32,400
369
+ estimated cost (USD): $0.000648
370
+ ```
371
+
372
+ ### Custom Model
373
+
374
+ Override in config:
375
+ ```ts
376
+ export default {
377
+ embeddings: {
378
+ provider: "openai",
379
+ model: "text-embedding-3-large", // 3072 dims, higher quality
380
+ apiKeyEnv: "OPENAI_API_KEY",
381
+ pricePer1kTokens: 0.00013
382
+ }
383
+ };
384
+ ```
385
+
386
+ **Note**: Changing the model after indexing requires re-indexing with `--force`.
387
+
388
+ ## Search & Ranking
389
+
390
+ ### Page Aggregation
391
+
392
+ By default (`groupBy: "page"`), SearchSocket groups chunk results by page URL and computes a page-level score:
393
+
394
+ 1. The top chunk score becomes the base page score
395
+ 2. Additional matching chunks contribute a decaying bonus: `chunk_score * decay^i`
396
+ 3. Optional per-URL page weights are applied multiplicatively
397
+
398
+ Configure aggregation behavior:
399
+
400
+ ```ts
401
+ export default {
402
+ ranking: {
403
+ aggregationCap: 5, // max chunks contributing to page score (default: 5)
404
+ aggregationDecay: 0.5, // decay factor for additional chunks (default: 0.5)
405
+ minChunkScoreRatio: 0.5, // threshold for sub-chunks in results (default: 0.5)
406
+ pageWeights: { // per-URL score multipliers
407
+ "/": 1.1,
408
+ "/docs": 1.15,
409
+ "/download": 1.2
410
+ },
411
+ weights: {
412
+ aggregation: 0.1, // weight of aggregation bonus (default: 0.1)
413
+ incomingLinks: 0.05, // incoming link boost weight (default: 0.05)
414
+ depth: 0.03, // URL depth boost weight (default: 0.03)
415
+ rerank: 1.0 // reranker score weight (default: 1.0)
416
+ }
417
+ }
418
+ };
419
+ ```
420
+
421
+ `pageWeights` supports exact URL matches and prefix matching. A weight of `1.15` on `"/docs"` boosts all pages under `/docs/` by 15%. Use gentle values (1.05-1.2x) since they compound with aggregation.
422
+
423
+ ### Chunk Mode
424
+
425
+ Use `groupBy: "chunk"` for flat per-chunk results without page aggregation:
426
+
427
+ ```bash
428
+ curl -X POST http://localhost:5173/api/search \
429
+ -H "content-type: application/json" \
430
+ -d '{"q":"vector search","topK":10,"groupBy":"chunk"}'
431
+ ```
432
+
433
+ ## Build-Triggered Indexing
434
+
435
+ Automatically index after each SvelteKit build.
436
+
437
+ **`vite.config.ts` or `svelte.config.js`:**
438
+ ```ts
439
+ import { searchsocketVitePlugin } from "searchsocket/sveltekit";
440
+
441
+ export default {
442
+ plugins: [
443
+ svelteKitPlugin(),
444
+ searchsocketVitePlugin({
445
+ enabled: true, // or check process.env.SEARCHSOCKET_AUTO_INDEX
446
+ changedOnly: true, // incremental indexing (faster)
447
+ verbose: false
448
+ })
449
+ ]
450
+ };
451
+ ```
452
+
453
+ **Environment control:**
454
+ ```bash
455
+ # Enable via env var
456
+ SEARCHSOCKET_AUTO_INDEX=1 pnpm build
457
+
458
+ # Disable via env var
459
+ SEARCHSOCKET_DISABLE_AUTO_INDEX=1 pnpm build
460
+ ```
461
+
462
+ ## Git-Tracked Markdown Mirror
463
+
464
+ Indexing writes a **deterministic markdown mirror**:
465
+
466
+ ```
467
+ .searchsocket/pages/<scope>/<path>.md
468
+ ```
469
+
470
+ Example:
471
+ ```
472
+ .searchsocket/pages/main/docs/intro.md
473
+ ```
474
+
475
+ Each file contains:
476
+ - Frontmatter: URL, title, scope, route file, metadata
477
+ - Markdown: Extracted content
478
+
479
+ **Why commit it?**
480
+ - Content workflows (edit markdown, regenerate embeddings)
481
+ - Version control for indexed content
482
+ - Debugging (see exactly what was indexed)
483
+ - Offline search (grep the mirror)
484
+
485
+ Add to `.gitignore` if you don't need it:
486
+ ```
487
+ .searchsocket/pages/
488
+ ```
489
+
490
+ ## Commands
491
+
492
+ ### `searchsocket init`
493
+
494
+ Initialize config and state directory.
495
+
496
+ ```bash
497
+ pnpm searchsocket init
498
+ ```
499
+
500
+ ### `searchsocket index`
501
+
502
+ Index content into vectors.
503
+
504
+ ```bash
505
+ # Incremental (only changed chunks)
506
+ pnpm searchsocket index --changed-only
507
+
508
+ # Full re-index
509
+ pnpm searchsocket index --force
510
+
511
+ # Preview cost without indexing
512
+ pnpm searchsocket index --dry-run
513
+
514
+ # Override source mode
515
+ pnpm searchsocket index --source build
516
+
517
+ # Limit for testing
518
+ pnpm searchsocket index --max-pages 10 --max-chunks 50
519
+
520
+ # Override scope
521
+ pnpm searchsocket index --scope staging
522
+
523
+ # Verbose output
524
+ pnpm searchsocket index --verbose
525
+ ```
526
+
527
+ ### `searchsocket status`
528
+
529
+ Show indexing status, scope, and vector health.
530
+
531
+ ```bash
532
+ pnpm searchsocket status
533
+
534
+ # Output:
535
+ # project: my-site
536
+ # resolved scope: main
537
+ # embedding model: text-embedding-3-small
538
+ # vector backend: turso/libsql (local (.searchsocket/vectors.db))
539
+ # vector health: ok
540
+ # last indexed (main): 2025-02-23T10:30:00Z
541
+ # tracked chunks: 156
542
+ # last estimated tokens: 32,400
543
+ # last estimated cost: $0.000648
544
+ ```
545
+
546
+ ### `searchsocket dev`
547
+
548
+ Watch for file changes and auto-reindex.
549
+
550
+ ```bash
551
+ pnpm searchsocket dev
552
+
553
+ # With MCP server
554
+ pnpm searchsocket dev --mcp --mcp-port 3338
555
+ ```
556
+
557
+ Watches:
558
+ - `src/routes/**` (route files)
559
+ - `build/` (if static-output mode)
560
+ - Build output dir (if build mode)
561
+ - Content files (if content-files mode)
562
+ - `searchsocket.config.ts` (if crawl or build mode)
563
+
564
+ ### `searchsocket clean`
565
+
566
+ Delete local state and optionally remote vectors.
567
+
568
+ ```bash
569
+ # Local state only
570
+ pnpm searchsocket clean
571
+
572
+ # Local + remote vectors
573
+ pnpm searchsocket clean --remote --scope staging
574
+ ```
575
+
576
+ ### `searchsocket prune`
577
+
578
+ Delete stale scopes (e.g., deleted git branches).
579
+
580
+ ```bash
581
+ # Dry run (shows what would be deleted)
582
+ pnpm searchsocket prune --older-than 30d
583
+
584
+ # Apply deletions
585
+ pnpm searchsocket prune --older-than 30d --apply
586
+
587
+ # Use custom scope list
588
+ pnpm searchsocket prune --scopes-file active-branches.txt --apply
589
+ ```
590
+
591
+ ### `searchsocket doctor`
592
+
593
+ Validate config, env vars, and connectivity.
594
+
595
+ ```bash
596
+ pnpm searchsocket doctor
597
+
598
+ # Output:
599
+ # PASS config parse
600
+ # PASS env OPENAI_API_KEY
601
+ # PASS turso/libsql (local file: .searchsocket/vectors.db)
602
+ # PASS source: build manifest
603
+ # PASS source: vite binary
604
+ # PASS embedding provider connectivity
605
+ # PASS vector backend connectivity
606
+ # PASS vector backend write permission
607
+ # PASS state directory writable
608
+ ```
609
+
610
+ ### `searchsocket mcp`
611
+
612
+ Run MCP server for Claude Desktop / other MCP clients.
613
+
614
+ ```bash
615
+ # stdio transport (default)
616
+ pnpm searchsocket mcp
617
+
618
+ # HTTP transport
619
+ pnpm searchsocket mcp --transport http --port 3338
620
+ ```
621
+
622
+ ### `searchsocket search`
623
+
624
+ CLI search for testing.
625
+
626
+ ```bash
627
+ pnpm searchsocket search --q "turso vector search" --top-k 5 --rerank
628
+ ```
629
+
630
+ ## MCP (Model Context Protocol)
631
+
632
+ SearchSocket provides an **MCP server** for integration with Claude Code, Claude Desktop, and other MCP-compatible AI tools. This gives AI assistants direct access to your indexed site content for semantic search and page retrieval.
633
+
634
+ ### Tools
635
+
636
+ **`search(query, opts?)`**
637
+ - Semantic search across indexed content
638
+ - Returns ranked results with URL, title, snippet, score, and routeFile
639
+ - Options: `scope`, `topK` (1-100), `pathPrefix`, `tags`, `groupBy` (`"page"` | `"chunk"`)
640
+
641
+ **`get_page(pathOrUrl, opts?)`**
642
+ - Retrieve full indexed page content as markdown with frontmatter
643
+ - Options: `scope`
644
+
645
+ ### Setup (Claude Code)
646
+
647
+ Add a `.mcp.json` file to your project root (safe to commit — no secrets needed since the CLI auto-loads `.env`):
648
+
649
+ ```json
650
+ {
651
+ "mcpServers": {
652
+ "searchsocket": {
653
+ "type": "stdio",
654
+ "command": "npx",
655
+ "args": ["searchsocket", "mcp"],
656
+ "env": {}
657
+ }
658
+ }
659
+ }
660
+ ```
661
+
662
+ Restart Claude Code. The `search` and `get_page` tools will be available automatically. Verify with:
663
+
664
+ ```bash
665
+ claude mcp list
666
+ ```
667
+
668
+ ### Setup (Claude Desktop)
669
+
670
+ Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
671
+
672
+ ```json
673
+ {
674
+ "mcpServers": {
675
+ "searchsocket": {
676
+ "command": "npx",
677
+ "args": ["searchsocket", "mcp"],
678
+ "cwd": "/path/to/your/project"
679
+ }
680
+ }
681
+ }
682
+ ```
683
+
684
+ Restart Claude Desktop. The tools appear in the MCP menu.
685
+
686
+ ### HTTP Transport
687
+
688
+ For non-stdio clients, run the MCP server over HTTP:
689
+
690
+ ```bash
691
+ npx searchsocket mcp --transport http --port 3338
692
+ ```
693
+
694
+ This starts a stateless server at `http://127.0.0.1:3338/mcp`. Each POST request creates a fresh server instance with no session persistence.
695
+
696
+ ## Environment Variables
697
+
698
+ The CLI automatically loads `.env` from the working directory on startup. Existing `process.env` values take precedence over `.env` file values. This only applies to CLI commands (`searchsocket index`, `searchsocket mcp`, etc.) — library imports like `searchsocketHandle()` rely on your framework's own `.env` handling (Vite/SvelteKit).
699
+
700
+ ### Required
701
+
702
+ **OpenAI:**
703
+ - `OPENAI_API_KEY` — OpenAI API key for embeddings
704
+
705
+ ### Optional (Turso)
706
+
707
+ **Remote Turso (production):**
708
+ - `TURSO_DATABASE_URL` — Turso database URL (e.g., `libsql://my-db.turso.io`)
709
+ - `TURSO_AUTH_TOKEN` — Turso auth token
710
+
711
+ If not set, uses local file DB at `.searchsocket/vectors.db`.
712
+
713
+ ### Optional (Rerank)
714
+
715
+ **Jina:**
716
+ - `JINA_API_KEY` — Jina reranker API key (if using `rerank: { provider: "jina" }`)
717
+
718
+ ### Optional (Scope/Build)
719
+
720
+ - `SEARCHSOCKET_SCOPE` — Override scope (when `scope.mode: "env"`)
721
+ - `SEARCHSOCKET_AUTO_INDEX` — Enable build-triggered indexing
722
+ - `SEARCHSOCKET_DISABLE_AUTO_INDEX` — Disable build-triggered indexing
723
+
724
+ ## Configuration
725
+
726
+ ### Full Example
727
+
728
+ ```ts
729
+ export default {
730
+ project: {
731
+ id: "my-site",
732
+ baseUrl: "https://example.com"
733
+ },
734
+
735
+ scope: {
736
+ mode: "git", // "fixed" | "git" | "env"
737
+ fixed: "main",
738
+ sanitize: true
739
+ },
740
+
741
+ source: {
742
+ mode: "build", // "static-output" | "crawl" | "content-files" | "build"
743
+ staticOutputDir: "build",
744
+ strictRouteMapping: false,
745
+
746
+ // Build mode (recommended for CI/CD)
747
+ build: {
748
+ outputDir: ".svelte-kit/output",
749
+ previewTimeout: 30000,
750
+ exclude: ["/api/*"],
751
+ paramValues: {
752
+ "/blog/[slug]": ["hello-world", "getting-started"]
753
+ }
754
+ },
755
+
756
+ // Crawl mode (alternative)
757
+ crawl: {
758
+ baseUrl: "http://localhost:4173",
759
+ routes: ["/", "/docs", "/blog"],
760
+ sitemapUrl: "https://example.com/sitemap.xml"
761
+ },
762
+
763
+ // Content files mode (alternative)
764
+ contentFiles: {
765
+ globs: ["src/routes/**/*.md"],
766
+ baseDir: "."
767
+ }
768
+ },
769
+
770
+ extract: {
771
+ mainSelector: "main",
772
+ dropTags: ["header", "nav", "footer", "aside"],
773
+ dropSelectors: [".sidebar", ".toc"],
774
+ ignoreAttr: "data-search-ignore",
775
+ noindexAttr: "data-search-noindex",
776
+ respectRobotsNoindex: true
777
+ },
778
+
779
+ chunking: {
780
+ maxChars: 2200,
781
+ overlapChars: 200,
782
+ minChars: 250,
783
+ headingPathDepth: 3,
784
+ dontSplitInside: ["code", "table", "blockquote"],
785
+ prependTitle: true, // prepend page title to chunk text before embedding
786
+ pageSummaryChunk: true // generate synthetic identity chunk per page
787
+ },
788
+
789
+ embeddings: {
790
+ provider: "openai",
791
+ model: "text-embedding-3-small",
792
+ apiKeyEnv: "OPENAI_API_KEY",
793
+ batchSize: 64,
794
+ concurrency: 4
795
+ },
796
+
797
+ vector: {
798
+ dimension: 1536, // optional, inferred from first embedding
799
+ turso: {
800
+ urlEnv: "TURSO_DATABASE_URL",
801
+ authTokenEnv: "TURSO_AUTH_TOKEN",
802
+ localPath: ".searchsocket/vectors.db"
803
+ }
804
+ },
805
+
806
+ rerank: {
807
+ provider: "jina", // "none" | "jina"
808
+ topN: 20,
809
+ jina: {
810
+ apiKeyEnv: "JINA_API_KEY",
811
+ model: "jina-reranker-v2-base-multilingual"
812
+ }
813
+ },
814
+
815
+ ranking: {
816
+ enableIncomingLinkBoost: true,
817
+ enableDepthBoost: true,
818
+ pageWeights: {
819
+ "/": 1.1,
820
+ "/docs": 1.15
821
+ },
822
+ aggregationCap: 5,
823
+ aggregationDecay: 0.5,
824
+ minChunkScoreRatio: 0.5,
825
+ weights: {
826
+ incomingLinks: 0.05,
827
+ depth: 0.03,
828
+ rerank: 1.0,
829
+ aggregation: 0.1
830
+ }
831
+ },
832
+
833
+ api: {
834
+ path: "/api/search",
835
+ cors: {
836
+ allowOrigins: ["https://example.com"]
837
+ },
838
+ rateLimit: {
839
+ windowMs: 60_000,
840
+ max: 60
841
+ }
842
+ }
843
+ };
844
+ ```
845
+
846
+ ## License
847
+
848
+ MIT