peeky-search 1.0.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,256 @@
1
+ # peeky-search
2
+
3
+ **Extract the exact answers from the web - before your LLM even runs.**
4
+
5
+ Stop dumping entire webpages into your context window. peeky-search uses IR techniques (BM25 + structural heuristics) to extract only the relevant passages, reducing token usage by ~97%.
6
+
7
+ ```bash
8
+ npx peeky-search --search --query "zod transform vs refine" --max 3
9
+ ```
10
+
11
+ ```
12
+ # Search Results for: "zod transform vs refine"
13
+
14
+ Found 3 of 3 pages with relevant content.
15
+
16
+ ## Zod: .transform() vs .refine() - Stack Overflow
17
+ Source: https://stackoverflow.com/questions/73715295
18
+
19
+ Use `.refine()` when you want to add custom validation logic that returns
20
+ true/false. Use `.transform()` when you want to modify the parsed value
21
+ before it's returned.
22
+
23
+ ### Key difference
24
+ `.refine()` validates and returns the same type. `.transform()` can change
25
+ the output type entirely:
26
+
27
+ const stringToNumber = z.string().transform(val => parseInt(val));
28
+ // Input: string, Output: number
29
+ ```
30
+
31
+ ## Who this is for
32
+
33
+ - **Agent and RAG builders** who need verifiable excerpts, not raw HTML
34
+ - **MCP users** who want passage-level evidence from web sources
35
+ - **Developers** who need to see the actual text, not a summary of it
36
+
37
+ ## Evidence vs Summaries
38
+
39
+ Built-in web search tools give you **summaries** - the AI reads pages and tells you what it found.
40
+
41
+ peeky-search gives you **evidence** - the actual excerpts from sources, scored by relevance, so you can see exactly what the documentation or Stack Overflow answer says.
42
+
43
+ | What you get | peeky-search | WebSearch |
44
+ |--------------|-----------|-----------|
45
+ | Output | Extracted passages with sources | AI-generated summary |
46
+ | Verifiable | Yes, you see the actual text | No, you trust the synthesis |
47
+ | Edge cases | Surfaces gotchas buried in discussions | May summarize them away |
48
+ | Speed | ~3-4 seconds | ~20-25 seconds |
49
+
50
+ ### What peeky surfaces that summaries miss
51
+
52
+ We tested across 7 query types. Here's what peeky's passage extraction found that a summary alone wouldn't:
53
+
54
+ | Query | What peeky extracted |
55
+ |-------|---------------------|
56
+ | Zod .transform() | The `as const satisfies` pattern, `readonly` array gotchas |
57
+ | OpenAI o3 benchmarks | ARC Prize cost analysis: $27/task vs $5 to pay a human |
58
+ | Next.js hydration error | Material UI gotcha: `Typography` defaults to `<p>` |
59
+ | Python requests retry | Tenacity decorator patterns for error-specific strategies |
60
+ | CVE-2024-3094 xz backdoor | Links to Filippo Valsorda's analysis and xzbot reproduction repo |
61
+ | Bun vs better-sqlite3 | GitHub discussion where maintainer debunks the benchmark methodology |
62
+
63
+ These are the details that save you hours - but they get lost in summaries.
64
+
65
+ ### Example: Niche Benchmark Query
66
+
67
+ For `Bun SQLite vs better-sqlite3 performance`:
68
+
69
+ **WebSearch** returned Bun's official claims (3-6x faster) and some skepticism.
70
+
71
+ **peeky-search** found the actual GitHub discussion where a better-sqlite3 maintainer breaks down why the benchmark is misleading - showing that for real SQLite-heavy queries, better-sqlite3 can actually be faster. This is the kind of "ground truth" that saves you from making decisions based on marketing benchmarks.
72
+
73
+ ## Installation
74
+
75
+ Requires [Docker](https://docker.com) to run the SearXNG search backend.
76
+
77
+ ```bash
78
+ npx peeky-search setup
79
+ ```
80
+
81
+ This will:
82
+ 1. Check prerequisites (Docker installed and running)
83
+ 2. Start a local SearXNG instance in Docker
84
+ 3. Output the MCP config to add to your client
85
+
86
+ Then add the config to your MCP client and restart it.
87
+
88
+ ### Privacy
89
+
90
+ peeky-search runs entirely locally:
91
+ - **SearXNG** runs in Docker on your machine
92
+ - **Searches don't hit Anthropic, OpenAI, or any third party**
93
+ - No query logging, no telemetry, no data collection
94
+
95
+ Built-in web search tools route queries through the AI provider. You have no visibility into what happens to those queries.
96
+
97
+ ### Commands
98
+
99
+ ```bash
100
+ npx peeky-search setup # One-time setup
101
+ npx peeky-search setup --port 9999 # Use custom port
102
+ npx peeky-search status # Check if SearXNG is running
103
+ npx peeky-search start # Start SearXNG
104
+ npx peeky-search stop # Stop SearXNG
105
+ npx peeky-search uninstall # Remove everything
106
+ ```
107
+
108
+ ## Usage
109
+
110
+ ### MCP Tool
111
+
112
+ Once configured, your MCP client will have access to `peeky_web_search`:
113
+
114
+ **Input:**
115
+ ```json
116
+ {
117
+ "query": "react useEffect cleanup function",
118
+ "maxResults": 5,
119
+ "diagnostics": false
120
+ }
121
+ ```
122
+
123
+ | Parameter | Type | Description |
124
+ |-----------|------|-------------|
125
+ | `query` | string | Search query. Supports `site:`, `"quotes"`, `-exclude` |
126
+ | `maxResults` | number | Pages to fetch (default 5, max 10) |
127
+ | `diagnostics` | boolean | Include filtering details (default false) |
128
+
129
+ **Output:**
130
+ ```json
131
+ {
132
+ "content": [
133
+ {
134
+ "type": "text",
135
+ "text": "# Search Results for: \"react useEffect cleanup\"\n\nFound 4 of 5 pages...\n\n## Understanding useEffect Cleanup\nSource: https://react.dev/learn/...\n\nThe cleanup function runs before the component unmounts and before every re-render with changed dependencies..."
136
+ }
137
+ ]
138
+ }
139
+ ```
140
+
141
+ ### CLI
142
+
143
+ **Search mode** (uses SearXNG):
144
+ ```bash
145
+ npx peeky-search --search --query "prisma vs drizzle orm" --max 5
146
+ ```
147
+
148
+ **URL mode** (extract from a specific page):
149
+ ```bash
150
+ npx peeky-search --url "https://docs.example.com/auth" --query "JWT refresh tokens"
151
+ ```
152
+
153
+ **File mode** (extract from local HTML):
154
+ ```bash
155
+ npx peeky-search --query "authentication" --file docs.html --debug
156
+ ```
157
+
158
+ ## How It Works
159
+
160
+ ```
161
+ HTML → Strip boilerplate → Extract blocks → Segment sentences
162
+
163
+ BM25 + Heuristics
164
+
165
+ Rank → Select anchors
166
+
167
+ Expand context → Deduplicate
168
+
169
+ Assemble excerpts (budget)
170
+ ```
171
+
172
+ 1. **Preprocess**: Strip scripts, styles, nav, ads, and boilerplate
173
+ 2. **Segment**: Extract blocks (headings, paragraphs, lists, code) into sentences
174
+ 3. **Score**: BM25 for term relevance + 8 structural heuristics
175
+ 4. **Select**: Pick top sentences with position/content diversity
176
+ 5. **Expand**: Build context around anchors, respecting section boundaries
177
+ 6. **Assemble**: Fit excerpts within character budget
178
+
179
+ ## Performance
180
+
181
+ ### Token reduction
182
+
183
+ | Metric | Traditional | peeky-search |
184
+ |--------|-------------|-----------|
185
+ | Content per page | 30-80KB | 1-4KB |
186
+ | Tokens per page | ~15,000-40,000 | ~500-2,000 |
187
+ | 5-page search | ~200KB, ~50k tokens | ~6KB, ~1,500 tokens |
188
+
189
+ That's roughly **97% fewer tokens** for multi-page searches.
190
+
191
+ ### Speed
192
+
193
+ - **Extraction**: ~20-50ms per page (pure computation)
194
+ - **Search**: ~3-4s total for 5 pages (network-bound)
195
+ - **No LLM calls**: Extraction happens before the model sees anything
196
+
197
+ ## Scoring System
198
+
199
+ **BM25** (weight: 0.6): Classic term frequency-inverse document frequency.
200
+
201
+ **Heuristics** (weight: 0.4):
202
+
203
+ | Metric | Weight | What it measures |
204
+ |--------|--------|------------------|
205
+ | headingPath | 0.17 | Query terms in section headings |
206
+ | coverage | 0.16 | IDF-weighted term coverage |
207
+ | proximity | 0.14 | How close query terms appear |
208
+ | headingProximity | 0.11 | Distance to matching heading |
209
+ | structure | 0.11 | Block type (headings, code valued higher) |
210
+ | density | 0.09 | Query term concentration |
211
+ | outlier | 0.09 | Anomaly detection for high-value sentences |
212
+ | metaSection | 0.08 | Penalizes intro/conclusion/meta content |
213
+ | position | 0.05 | Early content bonus |
214
+
215
+ ### Extraction Modes
216
+
217
+ - **strict**: For single-page extraction. Requires strong multi-term matches.
218
+ - **search**: For multi-page search. Looser thresholds, accepts partial matches.
219
+
220
+ ## Configuration
221
+
222
+ ### Pipeline Defaults
223
+
224
+ ```typescript
225
+ {
226
+ bm25Weight: 0.6,
227
+ heuristicWeight: 0.4,
228
+ maxAnchors: 5,
229
+ minScore: 0.25,
230
+ diversityThreshold: 0.4,
231
+ contextBefore: 5,
232
+ contextAfter: 8,
233
+ maxExcerpts: 3,
234
+ charBudget: 6000
235
+ }
236
+ ```
237
+
238
+ ### MCP Defaults
239
+
240
+ ```typescript
241
+ {
242
+ searxngUrl: "http://localhost:8888",
243
+ maxResults: 5,
244
+ timeout: 5000,
245
+ perPageCharBudget: 3000,
246
+ totalCharBudget: 12000
247
+ }
248
+ ```
249
+
250
+ ## Disclaimer
251
+
252
+ This tool fetches and extracts content from publicly accessible web pages. Users are responsible for ensuring their use complies with applicable laws and the terms of service of any websites accessed. The authors are not liable for misuse.
253
+
254
+ ## License
255
+
256
+ MIT
@@ -0,0 +1,227 @@
1
+ // src/setup/docker.ts
2
+ import { exec, spawn } from "child_process";
3
+ import { promisify } from "util";
4
+ import * as fs from "fs";
5
+ import * as path from "path";
6
+ import * as os from "os";
7
+
8
+ // src/setup/templates.ts
9
+ import { randomBytes } from "crypto";
10
+ function generateSecretKey() {
11
+ return randomBytes(32).toString("hex");
12
+ }
13
+ function getDockerComposeTemplate(port) {
14
+ return `services:
15
+ searxng:
16
+ image: searxng/searxng:latest
17
+ container_name: peeky-searxng
18
+ restart: unless-stopped
19
+ ports:
20
+ - "${port}:8080"
21
+ volumes:
22
+ - ./settings.yml:/etc/searxng/settings.yml:ro
23
+ `;
24
+ }
25
+ function getSettingsTemplate(secretKey) {
26
+ return `use_default_settings: true
27
+
28
+ server:
29
+ secret_key: "${secretKey}"
30
+ limiter: false # no rate limiting for local use
31
+
32
+ search:
33
+ safe_search: 0 # no filtering (technical content)
34
+ default_lang: "en" # English results
35
+ ban_time_on_fail: 5 # initial ban time (seconds)
36
+ max_ban_time_on_fail: 60 # max ban time (reduced from default 120)
37
+ suspended_times:
38
+ SearxEngineTooManyRequests: 300 # 5 min (reduced from 1 hour)
39
+ SearxEngineAccessDenied: 3600 # 1 hour (reduced from 24 hours)
40
+ SearxEngineCaptcha: 3600 # 1 hour (reduced from 24 hours)
41
+ formats:
42
+ - html
43
+ - json
44
+
45
+ outgoing:
46
+ request_timeout: 4.0 # slightly longer timeout for reliability
47
+ enable_http2: true # faster connections
48
+ retries: 1 # retry failed requests once
49
+
50
+ # Engines: Brave + Google as primary, DuckDuckGo as fallback
51
+ # Each has retry settings for resilience
52
+ engines:
53
+ - name: brave
54
+ disabled: false
55
+ timeout: 5.0
56
+ retries: 1
57
+ retry_on_http_error: [429, 503] # retry on rate limit and unavailable
58
+ - name: google
59
+ disabled: false
60
+ timeout: 4.0
61
+ retries: 1
62
+ retry_on_http_error: [429, 503]
63
+ - name: duckduckgo
64
+ disabled: false # enabled as fallback
65
+ timeout: 4.0
66
+ retries: 1
67
+ - name: startpage
68
+ disabled: true
69
+ - name: bing
70
+ disabled: true
71
+ - name: qwant
72
+ disabled: true
73
+ - name: mojeek
74
+ disabled: true
75
+ - name: yahoo
76
+ disabled: true
77
+ `;
78
+ }
79
+ function getConfigTemplate(port) {
80
+ return JSON.stringify(
81
+ {
82
+ port,
83
+ installedAt: (/* @__PURE__ */ new Date()).toISOString()
84
+ },
85
+ null,
86
+ 2
87
+ );
88
+ }
89
+ function getMcpConfigJson(port) {
90
+ return JSON.stringify(
91
+ {
92
+ mcpServers: {
93
+ "peeky-search": {
94
+ command: "npx",
95
+ args: ["-y", "peeky-search", "mcp"],
96
+ env: {
97
+ SEARXNG_URL: `http://localhost:${port}`
98
+ }
99
+ }
100
+ }
101
+ },
102
+ null,
103
+ 2
104
+ );
105
+ }
106
+
107
+ // src/setup/docker.ts
108
+ var execAsync = promisify(exec);
109
+ var CONFIG_DIR = path.join(os.homedir(), ".peeky-search");
110
+ function getConfigDir() {
111
+ return CONFIG_DIR;
112
+ }
113
+ function readConfig() {
114
+ const configPath = path.join(CONFIG_DIR, "config.json");
115
+ if (!fs.existsSync(configPath)) {
116
+ return null;
117
+ }
118
+ try {
119
+ const content = fs.readFileSync(configPath, "utf8");
120
+ return JSON.parse(content);
121
+ } catch {
122
+ return null;
123
+ }
124
+ }
125
+ function createConfigFiles(port) {
126
+ if (!fs.existsSync(CONFIG_DIR)) {
127
+ fs.mkdirSync(CONFIG_DIR, { recursive: true });
128
+ }
129
+ const secretKey = generateSecretKey();
130
+ const dockerComposePath = path.join(CONFIG_DIR, "docker-compose.yml");
131
+ const settingsPath = path.join(CONFIG_DIR, "settings.yml");
132
+ const configPath = path.join(CONFIG_DIR, "config.json");
133
+ fs.writeFileSync(dockerComposePath, getDockerComposeTemplate(port));
134
+ fs.writeFileSync(settingsPath, getSettingsTemplate(secretKey));
135
+ fs.writeFileSync(configPath, getConfigTemplate(port));
136
+ }
137
+ async function startContainer() {
138
+ const config = readConfig();
139
+ if (!config) {
140
+ return {
141
+ success: false,
142
+ message: "Config not found. Run 'peeky-search setup' first."
143
+ };
144
+ }
145
+ try {
146
+ await execAsync("docker compose up -d", { cwd: CONFIG_DIR });
147
+ return { success: true, message: "Container started" };
148
+ } catch (err) {
149
+ const message = err instanceof Error ? err.message : String(err);
150
+ return { success: false, message: `Failed to start container: ${message}` };
151
+ }
152
+ }
153
+ async function stopContainer() {
154
+ if (!fs.existsSync(CONFIG_DIR)) {
155
+ return { success: true, message: "No config directory found" };
156
+ }
157
+ try {
158
+ await execAsync("docker compose down", { cwd: CONFIG_DIR });
159
+ return { success: true, message: "Container stopped" };
160
+ } catch (err) {
161
+ const message = err instanceof Error ? err.message : String(err);
162
+ return { success: false, message: `Failed to stop container: ${message}` };
163
+ }
164
+ }
165
+ async function getStatus() {
166
+ const config = readConfig();
167
+ const port = config?.port ?? null;
168
+ let containerRunning = false;
169
+ try {
170
+ const { stdout } = await execAsync("docker ps --format '{{.Names}}'");
171
+ containerRunning = stdout.includes("peeky-searxng");
172
+ } catch {
173
+ }
174
+ let searxngResponding = false;
175
+ if (port !== null) {
176
+ try {
177
+ const response = await fetch(`http://localhost:${port}/`, {
178
+ signal: AbortSignal.timeout(5e3)
179
+ });
180
+ searxngResponding = response.ok;
181
+ } catch {
182
+ }
183
+ }
184
+ return { containerRunning, searxngResponding, port };
185
+ }
186
+ async function waitForReady(port, timeoutMs = 3e4) {
187
+ const startTime = Date.now();
188
+ const pollInterval = 1e3;
189
+ while (Date.now() - startTime < timeoutMs) {
190
+ try {
191
+ const response = await fetch(`http://localhost:${port}/search?q=test&format=json`, {
192
+ signal: AbortSignal.timeout(2e3)
193
+ });
194
+ if (response.ok) {
195
+ return true;
196
+ }
197
+ } catch {
198
+ }
199
+ await new Promise((resolve) => setTimeout(resolve, pollInterval));
200
+ }
201
+ return false;
202
+ }
203
+ function streamLogs() {
204
+ if (!fs.existsSync(CONFIG_DIR)) {
205
+ console.error("Config directory not found");
206
+ return;
207
+ }
208
+ const child = spawn("docker", ["compose", "logs", "-f"], {
209
+ cwd: CONFIG_DIR,
210
+ stdio: "inherit"
211
+ });
212
+ child.on("error", (err) => {
213
+ console.error(`Failed to stream logs: ${err.message}`);
214
+ });
215
+ }
216
+
217
+ export {
218
+ getMcpConfigJson,
219
+ getConfigDir,
220
+ readConfig,
221
+ createConfigFiles,
222
+ startContainer,
223
+ stopContainer,
224
+ getStatus,
225
+ waitForReady,
226
+ streamLogs
227
+ };