npm - peeky-search - Versions diffs - 1.0.10 - Mend

peeky-search 1.0.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,256 @@
+# peeky-search
+**Extract the exact answers from the web - before your LLM even runs.**
+Stop dumping entire webpages into your context window. peeky-search uses IR techniques (BM25 + structural heuristics) to extract only the relevant passages, reducing token usage by ~97%.
+```bash
+npx peeky-search --search --query "zod transform vs refine" --max 3
+```
+```
+# Search Results for: "zod transform vs refine"
+Found 3 of 3 pages with relevant content.
+## Zod: .transform() vs .refine() - Stack Overflow
+Source: https://stackoverflow.com/questions/73715295
+Use `.refine()` when you want to add custom validation logic that returns
+true/false. Use `.transform()` when you want to modify the parsed value
+before it's returned.
+### Key difference
+`.refine()` validates and returns the same type. `.transform()` can change
+the output type entirely:
+const stringToNumber = z.string().transform(val => parseInt(val));
+// Input: string, Output: number
+```
+## Who this is for
+- **Agent and RAG builders** who need verifiable excerpts, not raw HTML
+- **MCP users** who want passage-level evidence from web sources
+- **Developers** who need to see the actual text, not a summary of it
+## Evidence vs Summaries
+Built-in web search tools give you **summaries** - the AI reads pages and tells you what it found.
+peeky-search gives you **evidence** - the actual excerpts from sources, scored by relevance, so you can see exactly what the documentation or Stack Overflow answer says.
+| What you get | peeky-search | WebSearch |
+|--------------|-----------|-----------|
+| Output | Extracted passages with sources | AI-generated summary |
+| Verifiable | Yes, you see the actual text | No, you trust the synthesis |
+| Edge cases | Surfaces gotchas buried in discussions | May summarize them away |
+| Speed | ~3-4 seconds | ~20-25 seconds |
+### What peeky surfaces that summaries miss
+We tested across 7 query types. Here's what peeky's passage extraction found that a summary alone wouldn't:
+| Query | What peeky extracted |
+|-------|---------------------|
+| Zod .transform() | The `as const satisfies` pattern, `readonly` array gotchas |
+| OpenAI o3 benchmarks | ARC Prize cost analysis: $27/task vs $5 to pay a human |
+| Next.js hydration error | Material UI gotcha: `Typography` defaults to `<p>` |
+| Python requests retry | Tenacity decorator patterns for error-specific strategies |
+| CVE-2024-3094 xz backdoor | Links to Filippo Valsorda's analysis and xzbot reproduction repo |
+| Bun vs better-sqlite3 | GitHub discussion where maintainer debunks the benchmark methodology |
+These are the details that save you hours - but they get lost in summaries.
+### Example: Niche Benchmark Query
+For `Bun SQLite vs better-sqlite3 performance`:
+**WebSearch** returned Bun's official claims (3-6x faster) and some skepticism.
+**peeky-search** found the actual GitHub discussion where a better-sqlite3 maintainer breaks down why the benchmark is misleading - showing that for real SQLite-heavy queries, better-sqlite3 can actually be faster. This is the kind of "ground truth" that saves you from making decisions based on marketing benchmarks.
+## Installation
+Requires [Docker](https://docker.com) to run the SearXNG search backend.
+```bash
+npx peeky-search setup
+```
+This will:
+1. Check prerequisites (Docker installed and running)
+2. Start a local SearXNG instance in Docker
+3. Output the MCP config to add to your client
+Then add the config to your MCP client and restart it.
+### Privacy
+peeky-search runs entirely locally:
+- **SearXNG** runs in Docker on your machine
+- **Searches don't hit Anthropic, OpenAI, or any third party**
+- No query logging, no telemetry, no data collection
+Built-in web search tools route queries through the AI provider. You have no visibility into what happens to those queries.
+### Commands
+```bash
+npx peeky-search setup              # One-time setup
+npx peeky-search setup --port 9999  # Use custom port
+npx peeky-search status             # Check if SearXNG is running
+npx peeky-search start              # Start SearXNG
+npx peeky-search stop               # Stop SearXNG
+npx peeky-search uninstall          # Remove everything
+```
+## Usage
+### MCP Tool
+Once configured, your MCP client will have access to `peeky_web_search`:
+**Input:**
+```json
+{
+  "query": "react useEffect cleanup function",
+  "maxResults": 5,
+  "diagnostics": false
+}
+```
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `query` | string | Search query. Supports `site:`, `"quotes"`, `-exclude` |
+| `maxResults` | number | Pages to fetch (default 5, max 10) |
+| `diagnostics` | boolean | Include filtering details (default false) |
+**Output:**
+```json
+{
+  "content": [
+    {
+      "type": "text",
+      "text": "# Search Results for: \"react useEffect cleanup\"\n\nFound 4 of 5 pages...\n\n## Understanding useEffect Cleanup\nSource: https://react.dev/learn/...\n\nThe cleanup function runs before the component unmounts and before every re-render with changed dependencies..."
+    }
+  ]
+}
+```
+### CLI
+**Search mode** (uses SearXNG):
+```bash
+npx peeky-search --search --query "prisma vs drizzle orm" --max 5
+```
+**URL mode** (extract from a specific page):
+```bash
+npx peeky-search --url "https://docs.example.com/auth" --query "JWT refresh tokens"
+```
+**File mode** (extract from local HTML):
+```bash
+npx peeky-search --query "authentication" --file docs.html --debug
+```
+## How It Works
+```
+HTML → Strip boilerplate → Extract blocks → Segment sentences
+                                                    ↓
+                                              BM25 + Heuristics
+                                                    ↓
+                                            Rank → Select anchors
+                                                    ↓
+                                      Expand context → Deduplicate
+                                                    ↓
+                                          Assemble excerpts (budget)
+```
+1. **Preprocess**: Strip scripts, styles, nav, ads, and boilerplate
+2. **Segment**: Extract blocks (headings, paragraphs, lists, code) into sentences
+3. **Score**: BM25 for term relevance + 8 structural heuristics
+4. **Select**: Pick top sentences with position/content diversity
+5. **Expand**: Build context around anchors, respecting section boundaries
+6. **Assemble**: Fit excerpts within character budget
+## Performance
+### Token reduction
+| Metric | Traditional | peeky-search |
+|--------|-------------|-----------|
+| Content per page | 30-80KB | 1-4KB |
+| Tokens per page | ~15,000-40,000 | ~500-2,000 |
+| 5-page search | ~200KB, ~50k tokens | ~6KB, ~1,500 tokens |
+That's roughly **97% fewer tokens** for multi-page searches.
+### Speed
+- **Extraction**: ~20-50ms per page (pure computation)
+- **Search**: ~3-4s total for 5 pages (network-bound)
+- **No LLM calls**: Extraction happens before the model sees anything
+## Scoring System
+**BM25** (weight: 0.6): Classic term frequency-inverse document frequency.
+**Heuristics** (weight: 0.4):
+| Metric | Weight | What it measures |
+|--------|--------|------------------|
+| headingPath | 0.17 | Query terms in section headings |
+| coverage | 0.16 | IDF-weighted term coverage |
+| proximity | 0.14 | How close query terms appear |
+| headingProximity | 0.11 | Distance to matching heading |
+| structure | 0.11 | Block type (headings, code valued higher) |
+| density | 0.09 | Query term concentration |
+| outlier | 0.09 | Anomaly detection for high-value sentences |
+| metaSection | 0.08 | Penalizes intro/conclusion/meta content |
+| position | 0.05 | Early content bonus |
+### Extraction Modes
+- **strict**: For single-page extraction. Requires strong multi-term matches.
+- **search**: For multi-page search. Looser thresholds, accepts partial matches.
+## Configuration
+### Pipeline Defaults
+```typescript
+{
+  bm25Weight: 0.6,
+  heuristicWeight: 0.4,
+  maxAnchors: 5,
+  minScore: 0.25,
+  diversityThreshold: 0.4,
+  contextBefore: 5,
+  contextAfter: 8,
+  maxExcerpts: 3,
+  charBudget: 6000
+}
+```
+### MCP Defaults
+```typescript
+{
+  searxngUrl: "http://localhost:8888",
+  maxResults: 5,
+  timeout: 5000,
+  perPageCharBudget: 3000,
+  totalCharBudget: 12000
+}
+```
+## Disclaimer
+This tool fetches and extracts content from publicly accessible web pages. Users are responsible for ensuring their use complies with applicable laws and the terms of service of any websites accessed. The authors are not liable for misuse.
+## License
+MIT

package/dist/chunk-F3PNR32Z.js ADDED Viewed

@@ -0,0 +1,227 @@
+// src/setup/docker.ts
+import { exec, spawn } from "child_process";
+import { promisify } from "util";
+import * as fs from "fs";
+import * as path from "path";
+import * as os from "os";
+// src/setup/templates.ts
+import { randomBytes } from "crypto";
+function generateSecretKey() {
+  return randomBytes(32).toString("hex");
+}
+function getDockerComposeTemplate(port) {
+  return `services:
+  searxng:
+    image: searxng/searxng:latest
+    container_name: peeky-searxng
+    restart: unless-stopped
+    ports:
+      - "${port}:8080"
+    volumes:
+      - ./settings.yml:/etc/searxng/settings.yml:ro
+`;
+}
+function getSettingsTemplate(secretKey) {
+  return `use_default_settings: true
+server:
+  secret_key: "${secretKey}"
+  limiter: false        # no rate limiting for local use
+search:
+  safe_search: 0        # no filtering (technical content)
+  default_lang: "en"    # English results
+  ban_time_on_fail: 5   # initial ban time (seconds)
+  max_ban_time_on_fail: 60  # max ban time (reduced from default 120)
+  suspended_times:
+    SearxEngineTooManyRequests: 300   # 5 min (reduced from 1 hour)
+    SearxEngineAccessDenied: 3600     # 1 hour (reduced from 24 hours)
+    SearxEngineCaptcha: 3600          # 1 hour (reduced from 24 hours)
+  formats:
+    - html
+    - json
+outgoing:
+  request_timeout: 4.0  # slightly longer timeout for reliability
+  enable_http2: true    # faster connections
+  retries: 1            # retry failed requests once
+# Engines: Brave + Google as primary, DuckDuckGo as fallback
+# Each has retry settings for resilience
+engines:
+  - name: brave
+    disabled: false
+    timeout: 5.0
+    retries: 1
+    retry_on_http_error: [429, 503]  # retry on rate limit and unavailable
+  - name: google
+    disabled: false
+    timeout: 4.0
+    retries: 1
+    retry_on_http_error: [429, 503]
+  - name: duckduckgo
+    disabled: false     # enabled as fallback
+    timeout: 4.0
+    retries: 1
+  - name: startpage
+    disabled: true
+  - name: bing
+    disabled: true
+  - name: qwant
+    disabled: true
+  - name: mojeek
+    disabled: true
+  - name: yahoo
+    disabled: true
+`;
+}
+function getConfigTemplate(port) {
+  return JSON.stringify(
+    {
+      port,
+      installedAt: (/* @__PURE__ */ new Date()).toISOString()
+    },
+    null,
+    2
+  );
+}
+function getMcpConfigJson(port) {
+  return JSON.stringify(
+    {
+      mcpServers: {
+        "peeky-search": {
+          command: "npx",
+          args: ["-y", "peeky-search", "mcp"],
+          env: {
+            SEARXNG_URL: `http://localhost:${port}`
+          }
+        }
+      }
+    },
+    null,
+    2
+  );
+}
+// src/setup/docker.ts
+var execAsync = promisify(exec);
+var CONFIG_DIR = path.join(os.homedir(), ".peeky-search");
+function getConfigDir() {
+  return CONFIG_DIR;
+}
+function readConfig() {
+  const configPath = path.join(CONFIG_DIR, "config.json");
+  if (!fs.existsSync(configPath)) {
+    return null;
+  }
+  try {
+    const content = fs.readFileSync(configPath, "utf8");
+    return JSON.parse(content);
+  } catch {
+    return null;
+  }
+}
+function createConfigFiles(port) {
+  if (!fs.existsSync(CONFIG_DIR)) {
+    fs.mkdirSync(CONFIG_DIR, { recursive: true });
+  }
+  const secretKey = generateSecretKey();
+  const dockerComposePath = path.join(CONFIG_DIR, "docker-compose.yml");
+  const settingsPath = path.join(CONFIG_DIR, "settings.yml");
+  const configPath = path.join(CONFIG_DIR, "config.json");
+  fs.writeFileSync(dockerComposePath, getDockerComposeTemplate(port));
+  fs.writeFileSync(settingsPath, getSettingsTemplate(secretKey));
+  fs.writeFileSync(configPath, getConfigTemplate(port));
+}
+async function startContainer() {
+  const config = readConfig();
+  if (!config) {
+    return {
+      success: false,
+      message: "Config not found. Run 'peeky-search setup' first."
+    };
+  }
+  try {
+    await execAsync("docker compose up -d", { cwd: CONFIG_DIR });
+    return { success: true, message: "Container started" };
+  } catch (err) {
+    const message = err instanceof Error ? err.message : String(err);
+    return { success: false, message: `Failed to start container: ${message}` };
+  }
+}
+async function stopContainer() {
+  if (!fs.existsSync(CONFIG_DIR)) {
+    return { success: true, message: "No config directory found" };
+  }
+  try {
+    await execAsync("docker compose down", { cwd: CONFIG_DIR });
+    return { success: true, message: "Container stopped" };
+  } catch (err) {
+    const message = err instanceof Error ? err.message : String(err);
+    return { success: false, message: `Failed to stop container: ${message}` };
+  }
+}
+async function getStatus() {
+  const config = readConfig();
+  const port = config?.port ?? null;
+  let containerRunning = false;
+  try {
+    const { stdout } = await execAsync("docker ps --format '{{.Names}}'");
+    containerRunning = stdout.includes("peeky-searxng");
+  } catch {
+  }
+  let searxngResponding = false;
+  if (port !== null) {
+    try {
+      const response = await fetch(`http://localhost:${port}/`, {
+        signal: AbortSignal.timeout(5e3)
+      });
+      searxngResponding = response.ok;
+    } catch {
+    }
+  }
+  return { containerRunning, searxngResponding, port };
+}
+async function waitForReady(port, timeoutMs = 3e4) {
+  const startTime = Date.now();
+  const pollInterval = 1e3;
+  while (Date.now() - startTime < timeoutMs) {
+    try {
+      const response = await fetch(`http://localhost:${port}/search?q=test&format=json`, {
+        signal: AbortSignal.timeout(2e3)
+      });
+      if (response.ok) {
+        return true;
+      }
+    } catch {
+    }
+    await new Promise((resolve) => setTimeout(resolve, pollInterval));
+  }
+  return false;
+}
+function streamLogs() {
+  if (!fs.existsSync(CONFIG_DIR)) {
+    console.error("Config directory not found");
+    return;
+  }
+  const child = spawn("docker", ["compose", "logs", "-f"], {
+    cwd: CONFIG_DIR,
+    stdio: "inherit"
+  });
+  child.on("error", (err) => {
+    console.error(`Failed to stream logs: ${err.message}`);
+  });
+}
+export {
+  getMcpConfigJson,
+  getConfigDir,
+  readConfig,
+  createConfigFiles,
+  startContainer,
+  stopContainer,
+  getStatus,
+  waitForReady,
+  streamLogs
+};