npm - @staticn0va/wigolo - Versions diffs - 0.5.1 → 0.6.1 - Mend

@staticn0va/wigolo 0.5.1 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

package/README.md +46 -34
package/SKILL.md +253 -89
package/dist/cli/doctor.d.ts.map +1 -1
package/dist/cli/doctor.js +17 -15
package/dist/cli/doctor.js.map +1 -1
package/dist/cli/warmup.d.ts.map +1 -1
package/dist/cli/warmup.js +93 -13
package/dist/cli/warmup.js.map +1 -1
package/dist/embedding/subprocess.d.ts.map +1 -1
package/dist/embedding/subprocess.js +2 -1
package/dist/embedding/subprocess.js.map +1 -1
package/dist/extraction/trafilatura.d.ts.map +1 -1
package/dist/extraction/trafilatura.js +3 -2
package/dist/extraction/trafilatura.js.map +1 -1
package/dist/fetch/router.d.ts +1 -0
package/dist/fetch/router.d.ts.map +1 -1
package/dist/fetch/router.js.map +1 -1
package/dist/instructions.d.ts +3 -3
package/dist/instructions.d.ts.map +1 -1
package/dist/instructions.js +19 -4
package/dist/instructions.js.map +1 -1
package/dist/python-env.d.ts +9 -0
package/dist/python-env.d.ts.map +1 -0
package/dist/python-env.js +18 -0
package/dist/python-env.js.map +1 -0
package/dist/search/find-similar.d.ts.map +1 -1
package/dist/search/find-similar.js +136 -29
package/dist/search/find-similar.js.map +1 -1
package/dist/search/flashrank.d.ts.map +1 -1
package/dist/search/flashrank.js +2 -1
package/dist/search/flashrank.js.map +1 -1
package/dist/server/backend-status.d.ts +2 -0
package/dist/server/backend-status.d.ts.map +1 -1
package/dist/server/backend-status.js +13 -0
package/dist/server/backend-status.js.map +1 -1
package/dist/server.d.ts.map +1 -1
package/dist/server.js +51 -3
package/dist/server.js.map +1 -1
package/dist/tools/fetch.d.ts.map +1 -1
package/dist/tools/fetch.js +6 -4
package/dist/tools/fetch.js.map +1 -1
package/dist/tools/search.d.ts +2 -2
package/dist/tools/search.d.ts.map +1 -1
package/dist/tools/search.js +55 -8
package/dist/tools/search.js.map +1 -1
package/dist/types.d.ts +8 -0
package/dist/types.d.ts.map +1 -1
package/package.json +2 -1

package/README.md CHANGED Viewed

@@ -10,11 +10,12 @@ Search, fetch, crawl, cache, and extract — zero API keys, zero cloud, zero cos
 [![Node.js](https://img.shields.io/badge/node-%3E%3D20-brightgreen)](https://nodejs.org)
 [![TypeScript](https://img.shields.io/badge/TypeScript-5.x-blue)](https://www.typescriptlang.org/)
-[Quick Start](#quick-start) · [Features](#features) · [Why wigolo?](#why-wigolo) · [Roadmap](#roadmap)
+[Quick Start](#quick-start) · [Features](#features) · [Why wigolo?](#why-wigolo)
 </div>
 ```
+$ npx @staticn0va/wigolo warmup --all
 $ claude mcp add wigolo -- npx @staticn0va/wigolo
 Added MCP server wigolo
@@ -27,6 +28,30 @@ wigolo gives AI coding agents (Claude Code, Cursor, Gemini CLI, Codex, Windsurf)
 ## Quick Start
+### 1. Warm up (required)
+Install Playwright, bootstrap SearXNG, install Python extras (FlashRank, Trafilatura, sentence-transformers), then verify the setup end-to-end:
+```bash
+npx @staticn0va/wigolo warmup --all
+```
+`--all` runs verification automatically: it starts SearXNG, runs a test search, checks every Python package, then shuts SearXNG down. You see proof everything works before connecting an agent. Re-run any time with `warmup --verify`.
+Flag menu:
+```bash
+npx @staticn0va/wigolo warmup                # Playwright + SearXNG only
+npx @staticn0va/wigolo warmup --all          # + reranker + trafilatura + embeddings + lightpanda + verify
+npx @staticn0va/wigolo warmup --reranker     # Install FlashRank (ML reranking)
+npx @staticn0va/wigolo warmup --trafilatura  # Install Trafilatura (content extraction)
+npx @staticn0va/wigolo warmup --embeddings   # Install sentence-transformers
+npx @staticn0va/wigolo warmup --verify       # Start SearXNG, test search, test Python packages
+npx @staticn0va/wigolo warmup --force        # Wipe SearXNG state/install/locks and re-bootstrap
+```
+### 2. Connect your agent
 **Claude Code:**
 ```bash
 claude mcp add wigolo -- npx @staticn0va/wigolo
@@ -44,12 +69,7 @@ claude mcp add wigolo -- npx @staticn0va/wigolo
 }
 ```
-**Optional warmup (improves quality on first use):**
-```bash
-npx @staticn0va/wigolo warmup          # Downloads Playwright + SearXNG
-npx @staticn0va/wigolo warmup --all  # + ML reranking + Trafilatura extraction
-npx @staticn0va/wigolo warmup --force     # Wipe SearXNG state/install/locks and re-bootstrap
-```
+> Skipping warmup still works — wigolo will bootstrap in the background on first tool call — but early searches will be lower quality until the install finishes. Running `warmup --all` up front is strongly recommended.
 ## Diagnostics
@@ -268,31 +288,6 @@ SearXNG bootstrap failures are self-healing: wigolo retries after 30 seconds, 1
 4. Readability.js — battle-tested Mozilla algorithm
 5. Raw Turndown — last resort HTML-to-markdown
-## Roadmap
-### v2.1 — Next
-- [x] Daemon mode — persistent HTTP server, zero startup latency
-- [ ] Browser interaction — click, type, scroll before extraction
-- [ ] Content change detection — diff monitoring for cached pages
-- [ ] CDP session discovery — attach to running Chrome for seamless auth
-- [ ] Plugin system — community extractors and search engines
-### v2.2
-- [ ] Multi-browser pool — Chromium + Firefox for fingerprint diversity
-- [ ] Interactive REPL (`wigolo shell`)
-- [x] Agent skill distribution — MCP registry listings, `SKILL.md`
-### v3 — The Knowledge Engine
-- [ ] Answer synthesis — search + LLM = direct answers with citations (bring your own key)
-- [ ] Semantic search — local vector embeddings over cached content (`findSimilar`)
-- [ ] Agent endpoint — describe what you need, no URLs required
-- [ ] Streaming answers — real-time generation as results come in
-- [ ] Knowledge graph — entity and relationship extraction from crawled content
-- [ ] Auto re-crawl scheduler — keep documentation fresh automatically
-- [ ] Lightpanda browser — optional ultra-lightweight headless browser (11x less RAM than Chrome)
-- [ ] Cloud sync — share cache across machines via rclone (S3, Drive, Dropbox)
-- [ ] Team knowledge base — shared indexed content across team members
 ## Discovery
 wigolo is listed on MCP server registries for agent discovery:
@@ -309,8 +304,19 @@ See `SKILL.md` for the full tool schema in agent-discovery format.
 ## Troubleshooting
+Start with `npx @staticn0va/wigolo doctor` — it reports the state of every component and is the fastest way to find the cause.
+**First search is slow or returns odd results**
+SearXNG is still bootstrapping in the background. Either wait a minute, or (recommended) run `npx @staticn0va/wigolo warmup --all` before connecting your agent.
+**FlashRank / Trafilatura / sentence-transformers "not installed"**
+These are optional Python extras. Install them with `npx @staticn0va/wigolo warmup --all` (or per-package: `--reranker`, `--trafilatura`, `--embeddings`). wigolo uses a private venv under `~/.wigolo/searxng/venv` so your system Python stays untouched.
 **SearXNG won't start**
-Make sure `python3` is on your PATH and version 3.8+. Check with `python3 --version`. Alternatively, set `SEARXNG_MODE=docker` if Docker is available.
+Make sure `python3` is on your PATH and version 3.8+. Check with `python3 --version`. If bootstrap got interrupted, `npx @staticn0va/wigolo warmup --force` wipes the state and reinstalls. Alternatively, set `SEARXNG_MODE=docker` if Docker is available.
+**Doctor reports SearXNG "not running"**
+That's expected when you haven't made a search yet — the process starts on-demand when the MCP server needs it. Doctor only marks it degraded if the install is broken.
 **Playwright browser not found**
 Run `npx @staticn0va/wigolo warmup` to download Chromium. This is done automatically on first use but can fail behind corporate proxies.
@@ -321,6 +327,12 @@ If SearXNG and all fallback engines fail, check your network connection. Behind
 **Permission errors on `~/.wigolo/`**
 wigolo stores its cache and SearXNG installation in `~/.wigolo/`. Ensure your user has write access. Override with `WIGOLO_DATA_DIR=/your/path`.
+**Start fresh**
+```bash
+rm -rf ~/.wigolo
+npx @staticn0va/wigolo warmup --all
+```
 ## Contributing
 PRs welcome. Open an issue first to discuss what you'd like to change.
@@ -353,4 +365,4 @@ Requires the `NPM_TOKEN` repository secret (npm automation token with publish sc
 ## License
-[BSL 1.1](LICENSE) — free for individuals, small teams (under $1M revenue), education, and open source. Converts to MIT on 2029-04-12.
+[BSL 1.1](LICENSE) — free for individuals, small teams (under $1M revenue), education, and open source. Converts to AGPL-3.0 on 2029-04-12.

package/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: wigolo
-description: Local-first web search MCP server for AI coding agents. Search, fetch, crawl, cache, extract, find similar pages, deep research, and autonomous agent mode with zero API keys.
+description: Local-first web access MCP server for AI coding agents. Eight tools for search, fetch, crawl, cache, extract, find similar, research, and agent-driven data gathering. No API keys. Results cached in local SQLite.
 author: KnockOutEZ
 license: BUSL-1.1
 repository: https://github.com/KnockOutEZ/wigolo
@@ -9,29 +9,29 @@ install: npx @staticn0va/wigolo
 runtime: node
 min_runtime_version: "20"
 tools:
-  - name: search
-    description: Search the web and return results with optional full content extraction. Supports domain filtering, date ranges, categories, and ML reranking.
   - name: fetch
-    description: Fetch a web page and return its content as clean markdown. Supports JavaScript rendering, authenticated browsing, section extraction, and caching.
+    description: Fetch one URL, return clean markdown. Auto-routes between HTTP and Playwright. Supports sections, auth, screenshots, browser actions.
+  - name: search
+    description: Search the web, return extracted markdown per result. Single query or array of query variants. Domain, category, date filters. Optional synthesized answer via MCP sampling.
   - name: crawl
-    description: Crawl a website starting from a seed URL. Supports BFS, DFS, sitemap, and map (URL-only) strategies with depth/page limits and URL filtering.
+    description: Crawl a site from a seed URL. BFS, DFS, sitemap, or map (URL-only) strategies with regex include/exclude filters.
   - name: cache
-    description: Query the local knowledge base of previously fetched content. Full-text search over cached pages by query, URL pattern, or date. Cache stats and clearing.
+    description: FTS5 search over previously fetched content. URL glob, date filters, stats, clear, and change detection via re-fetch.
   - name: extract
-    description: Extract structured data from a web page. CSS selector extraction, HTML table parsing, metadata extraction (title, author, JSON-LD), and JSON Schema heuristic matching.
+    description: Structured extraction from URL or raw HTML. Modes: selector (CSS), tables, metadata (meta + JSON-LD), schema (heuristic field matching).
   - name: find_similar
-    description: Find pages semantically similar to a given URL or concept. Uses cached embeddings and web search to discover related content.
+    description: Find pages similar to a URL or concept. Hybrid cache (FTS5 + embeddings) + optional web supplement.
   - name: research
-    description: Deep multi-step research on a question. Decomposes into sub-queries, searches in parallel, fetches sources, and synthesizes a report with citations.
+    description: Multi-step research pipeline. Question decomposition, parallel sub-search, source synthesis with citations. Quick, standard, or comprehensive depth.
   - name: agent
-    description: Autonomous data gathering agent. Plans search queries from a prompt, fetches pages within budget, optionally extracts structured data via JSON Schema, and synthesizes results.
+    description: Natural-language data gathering. Plans searches/URLs, fetches in parallel within page and time budgets, optionally applies a JSON Schema to each page.
 ---
 # wigolo
-Local-first web search MCP server for AI coding agents.
+Local-first web search MCP server for AI coding agents. Ships eight tools over stdio. All network results land in a local SQLite cache.
-## Installation
+## Quick Setup
 **Claude Code:**
 ```bash
@@ -50,147 +50,311 @@ claude mcp add wigolo -- npx @staticn0va/wigolo
 }
 ```
-**Optional warmup (improves search quality):**
+**Warmup (recommended, one-time):**
 ```bash
-npx @staticn0va/wigolo warmup
+npx @staticn0va/wigolo warmup          # installs Playwright Chromium + bootstraps SearXNG
+npx @staticn0va/wigolo warmup --all    # also installs Firefox, WebKit, reranker, embeddings, trafilatura
+npx @staticn0va/wigolo warmup --force  # wipe SearXNG state and rebuild
 ```
+Warmup flags: `--force`, `--all`, `--trafilatura`, `--reranker`, `--firefox`, `--webkit`, `--embeddings`, `--lightpanda`.
 ## Tools
-### search
-Search the web and get full markdown content in one call.
+### fetch
+Fetch a single URL and return clean markdown. Use when you already have a specific URL.
+Parameters:
+- `url` (string, required)
+- `render_js`: `"auto"` (default) | `"always"` | `"never"`
+- `use_auth`: boolean (default `false`) — reuses the user's browser session
+- `max_chars`: number
+- `section`: string — return only the content under a heading
+- `section_index`: number (default `0`) — which heading match when multiple hit
+- `screenshot`: boolean (default `false`)
+- `headers`: object
+- `force_refresh`: boolean — bypass cache
+- `actions`: array of `{type, selector, text, ms, timeout, direction, amount}` — `click`, `type`, `wait`, `wait_for`, `scroll`, `screenshot`. Forces Playwright when present.
+Example:
 ```json
-{ "query": "React Server Components best practices", "max_results": 5, "include_domains": ["react.dev"] }
+{ "url": "https://react.dev/reference/react/useState", "section": "Parameters" }
 ```
-### fetch
-Fetch any URL and get clean markdown.
+Tip: `section` is much cheaper than reading the full page. Repeat fetches of the same URL are free from cache unless `force_refresh: true`.
+### search
+Search the web and return extracted markdown per result. Use when you don't have a URL yet.
+Parameters:
+- `query` (string OR `string[]`, required) — array runs variants in parallel and dedupes
+- `max_results`: number (default `5`, cap `20`)
+- `include_content`: boolean (default `true`)
+- `content_max_chars`: number (default `30000`)
+- `max_total_chars`: number (default `50000`)
+- `time_range`: `"day"` | `"week"` | `"month"` | `"year"`
+- `include_domains` / `exclude_domains`: `string[]`
+- `from_date` / `to_date`: ISO `YYYY-MM-DD`
+- `category`: `"general"` | `"news"` | `"code"` | `"docs"` | `"papers"` | `"images"`
+- `language`: string
+- `search_engines`: `string[]` — override engine selection
+- `format`: `"full"` (default) | `"context"` (token-budgeted string) | `"answer"` (synthesized via MCP sampling) | `"stream_answer"` (answer + phase progress notifications)
+- `force_refresh`: boolean
+Example:
 ```json
-{ "url": "https://docs.react.dev/reference/react/useState", "section": "Parameters" }
+{ "query": ["react server components patterns", "RSC data fetching", "react server components streaming"], "category": "docs", "include_domains": ["react.dev"], "max_results": 5 }
 ```
+Tip: keyword queries beat natural-language questions. A 3–5 item `query` array usually finds more unique sources than one longer query.
 ### crawl
-Crawl a site from a seed URL.
+Crawl a site starting from a seed URL.
+Parameters:
+- `url` (string, required)
+- `strategy`: `"bfs"` (default) | `"dfs"` | `"sitemap"` | `"map"` (URL-only discovery, no content)
+- `max_depth`: number (default `2`)
+- `max_pages`: number (default `20`)
+- `include_patterns` / `exclude_patterns`: regex `string[]`
+- `use_auth`: boolean (default `false`)
+- `extract_links`: boolean (default `false`) — returns inter-page link graph
+- `max_total_chars`: number (default `100000`)
+Example:
 ```json
-{ "url": "https://docs.example.com", "strategy": "sitemap", "max_pages": 50 }
+{ "url": "https://docs.python.org/3/library/", "strategy": "sitemap", "max_pages": 30, "include_patterns": ["^https://docs\\.python\\.org/3/library/asyncio"] }
 ```
+Tip: `strategy: "sitemap"` is faster and more complete than BFS on doc sites. `strategy: "map"` returns URLs only — cheap way to scope before targeted fetches.
 ### cache
-Query previously fetched content without hitting the network.
+Search previously fetched content without hitting the network.
+Parameters:
+- `query`: FTS5 syntax — supports `AND`, `OR`, `NOT`, `"exact phrase"`
+- `url_pattern`: glob (e.g. `"*react.dev*"`)
+- `since`: ISO date
+- `stats`: boolean — returns total URLs, size, date range
+- `clear`: boolean — deletes matching entries (requires one of `query`, `url_pattern`, `since`)
+- `check_changes`: boolean — re-fetches matching URLs, reports changed/unchanged with diff summaries
+Example:
 ```json
-{ "query": "React hooks", "url_pattern": "*react.dev*" }
+{ "query": "useState OR useReducer", "url_pattern": "*react.dev*" }
 ```
+Tip: cache hits are instant and cross-session. Run this before `search` or `fetch` when you suspect the content is already on disk.
 ### extract
-Structured data extraction from any URL or HTML.
+Structured extraction from URL or raw HTML.
+Parameters:
+- `url` OR `html` (one required; `url` wins if both provided)
+- `mode`: `"metadata"` (default) | `"selector"` | `"tables"` | `"schema"`
+- `css_selector`: string — required for `mode: "selector"`
+- `multiple`: boolean (default `false`) — return all matches, selector mode only
+- `schema`: JSON Schema object with `properties` — required for `mode: "schema"`
+Example:
 ```json
-{ "url": "https://example.com/product", "mode": "schema", "schema": { "type": "object", "properties": { "price": { "type": "string" }, "name": { "type": "string" } } } }
+{ "url": "https://example.com/product", "mode": "schema", "schema": { "type": "object", "properties": { "price": { "type": "string" }, "name": { "type": "string" }, "sku": { "type": "string" } } } }
 ```
+Tip: `mode: "schema"` does heuristic matching over CSS classes, ARIA labels, microdata, and JSON-LD — no LLM call required.
 ### find_similar
-Find pages related to a URL or concept.
+Find pages related to a URL or a free-text concept.
+Parameters:
+- `url` OR `concept` (one required)
+- `max_results`: number (default `10`, cap `50`)
+- `include_domains` / `exclude_domains`: `string[]`
+- `include_cache`: boolean (default `true`)
+- `include_web`: boolean (default `true`)
+Example:
 ```json
-{ "url": "https://react.dev/reference/react/useState", "max_results": 5 }
+{ "url": "https://react.dev/reference/react/useState", "max_results": 8, "include_domains": ["react.dev", "developer.mozilla.org"] }
 ```
+Tip: uses hybrid 3-way search — FTS5 over titles, FTS5 over body, plus embeddings when available. Cache path is near-instant; web supplement runs only if cache yields too few results.
 ### research
-Deep multi-step research that plans queries, fetches, and synthesizes.
-```json
-{ "question": "How do modern bundlers handle tree-shaking of ESM vs CJS", "depth": "standard", "max_sources": 10 }
-```
-### agent
-Autonomous data gathering from a natural-language prompt.
+Multi-step research pipeline with decomposition, parallel search, and cited synthesis.
+Parameters:
+- `question` (string, required)
+- `depth`: `"quick"` (~15s, 2 sub-queries, 5–8 sources) | `"standard"` (~40s, default) | `"comprehensive"` (~80s, 7 sub-queries, 20–25 sources)
+- `max_sources`: number (cap `50`) — overrides depth default
+- `include_domains` / `exclude_domains`: `string[]`
+- `schema`: JSON Schema — if present, report is structured to fill these fields
+- `stream`: boolean — emit progress notifications per phase
+Example:
 ```json
-{ "prompt": "Compare authentication strategies of Supabase, Firebase, and Clerk", "max_pages": 15, "max_time_ms": 90000 }
+{ "question": "How do modern JS bundlers tree-shake ESM vs CJS?", "depth": "standard", "include_domains": ["webpack.js.org", "rollupjs.org", "esbuild.github.io", "vitejs.dev"] }
 ```
-## Workflow Patterns
+Tip: `research` checks cache internally — no need to pre-probe. Requires MCP sampling-capable client for synthesis; without sampling, returns raw sources in context format.
-Use the right tool for the right situation.
+### agent
-**When you know the URL** -- use `fetch`. One URL, clean markdown. Add `section` to read only the heading you need.
+Natural-language data gathering. Plans queries and URLs from a prompt, runs them in parallel within budget, optionally applies a schema.
-**When you need to find information** -- use `search`. Formulate a keyword query (not a natural language question). Scope with `include_domains` and `category` when you know where the answer lives.
+Parameters:
+- `prompt` (string, required)
+- `urls`: `string[]` — seed URLs to include
+- `schema`: JSON Schema — extract structured fields per page and merge
+- `max_pages`: number (default `10`, cap `100`)
+- `max_time_ms`: number (default `60000`, cap `600000`)
+- `stream`: boolean
-**When you need multiple pages from one site** -- use `crawl`. For documentation sites, use `strategy: "sitemap"`. When you just want to discover what pages exist, use `strategy: "map"` (URL list only) then follow up with targeted `fetch` calls.
+Example:
+```json
+{ "prompt": "Compare pricing tiers for Supabase, Firebase, and Clerk", "schema": { "type": "object", "properties": { "provider": { "type": "string" }, "free_tier": { "type": "string" }, "paid_start": { "type": "string" } } }, "max_pages": 12 }
+```
-**When you need structured data** -- use `extract` with `mode: "tables"` or `mode: "schema"`. Do not use `fetch` when you need prices, specs, or table rows.
+Tip: output includes a `steps` array showing every action (plan, search, fetch, extract, synthesize) with timings. Use this to debug why an agent run produced a weak result.
-**When you already have content and want related pages** -- use `find_similar`. It searches the local cache by semantic similarity. No network calls needed.
+## Workflow Patterns
-**When you need a thorough answer on a complex topic** -- use `research`. It plans multiple search queries, fetches sources, and produces a cited synthesis. Prefer this over running 5+ manual search/fetch cycles.
+Quick routing:
+- Use when `search` — you need information but don't have a URL.
+- Use when `fetch` — you already have the URL.
+- Use when `crawl` — you need multiple pages from one site.
+- Use when `cache` — you want to check whether something is already on disk.
+- Use when `extract` — you need specific fields, tables, or metadata, not the whole page.
+- Use when `find_similar` — you have a good page/concept and want related content.
+- Use when `research` — a question needs decomposition and multi-source synthesis.
+- Use when `agent` — a natural-language task needs multi-step data gathering.
+**Cache-first lookup.** Before any `fetch` or `search`, probe the cache.
+```json
+cache({ "query": "oauth2 pkce", "url_pattern": "*auth0.com*" })
+// empty? fall through to search
+search({ "query": "oauth2 pkce flow", "include_domains": ["auth0.com"] })
+```
-**When the task requires multi-step data gathering** -- use `agent`. It breaks prompts into search queries and URL fetches, respects page and time budgets, and can extract structured data via JSON Schema.
+**Fresh content (news, dashboards, changelogs).** Bypass cache explicitly.
+```json
+search({ "query": "node.js 22 release notes", "force_refresh": true, "time_range": "week" })
+fetch({ "url": "https://nodejs.org/en/blog", "force_refresh": true })
+```
-**Before any network call** -- check `cache` first. Pages from prior sessions are still there. A cache hit is instant and free.
+**Scoped documentation research.** Crawl the relevant slice, then query cache.
+```json
+crawl({ "url": "https://docs.astro.build", "strategy": "sitemap", "max_pages": 40 })
+cache({ "query": "server islands hydration", "url_pattern": "*docs.astro.build*" })
+```
-## Parameter Optimization
+**Broad exploration.** Pass a query array; dedup is automatic.
+```json
+search({ "query": ["rust async runtimes comparison", "tokio vs async-std vs smol", "rust executor benchmarks"], "max_results": 8 })
+```
-### search
-- `max_results: 3` for focused lookups, `5` for exploration (default), `10+` for broad research
-- `include_domains` narrows to trusted sources -- always use when you know the domain
-- `category: "code"` for programming, `"docs"` for library docs, `"news"` for recent events
-- `from_date` / `to_date` for time-sensitive queries
-- `format: "context"` returns a single token-budgeted string for LLM injection
+**More like this.** Start with a known-good URL, widen via `find_similar`.
+```json
+find_similar({ "url": "https://react.dev/reference/react/useMemo", "max_results": 6, "include_domains": ["react.dev"] })
+```
-### fetch
-- `section: "heading text"` extracts only content under that heading -- much cheaper than the full page
-- `render_js: "never"` is fastest for static sites; `"always"` for SPAs
-- `use_auth: true` to access pages behind login using the user's browser session
+**Complex synthesis.** One `research` call replaces 5+ manual search/fetch cycles.
+```json
+research({ "question": "Tradeoffs of vector DBs for RAG at 100M+ embeddings", "depth": "comprehensive" })
+```
-### crawl
-- `strategy: "sitemap"` is 5-10x faster than BFS for doc sites
-- `strategy: "map"` returns URLs only -- use to scope a site before targeted fetches
-- `include_patterns` / `exclude_patterns` accept regex to stay in one section
+**Structured data from multiple sources.** Use `agent` with a schema.
+```json
+agent({ "prompt": "Find latency and pricing for top 5 edge compute providers", "schema": { "type": "object", "properties": { "provider": {"type":"string"}, "cold_start_ms": {"type":"string"}, "price_per_million": {"type":"string"} } } })
+```
-### research
-- `depth: "quick"` (~15s, 2 sub-queries), `"standard"` (~40s, default), `"comprehensive"` (~80s, 7 sub-queries)
-- `max_sources` overrides the default source count for the chosen depth
+**Table extraction.** Skip markdown entirely.
+```json
+extract({ "url": "https://en.wikipedia.org/wiki/List_of_programming_languages", "mode": "tables" })
+```
-### agent
-- `max_pages` caps total page fetches (default 10, max 100)
-- `max_time_ms` caps total execution time (default 60000)
-- `schema` enables structured extraction from each page -- results are merged across sources
+## Parameter Cheat Sheet
+| Situation | Tool + parameters |
+|---|---|
+| Focused lookup, known site | `search` + `max_results: 3` + `include_domains` |
+| Broad topic survey | `search` + `query: [...3-5 variants]` + `max_results: 8` |
+| Fresh content required | any tool + `force_refresh: true` |
+| Doc site indexing | `crawl` + `strategy: "sitemap"` |
+| Site URL inventory only | `crawl` + `strategy: "map"` |
+| Single heading from long page | `fetch` + `section: "..."` |
+| Behind login | `fetch` / `crawl` + `use_auth: true` |
+| Direct answer (sampling client) | `search` + `format: "answer"` |
+| LLM-ready context blob | `search` + `format: "context"` |
+| Complex question, multi-source | `research` + `depth: "standard"` |
+| Structured multi-page extraction | `agent` + `schema` |
+| One-page structured data | `extract` + `mode: "schema"` or `"tables"` |
+| Change tracking | `cache` + `check_changes: true` |
 ## Anti-Patterns
-These waste tokens, time, and rate limits. Avoid them.
+**Do not skip the cache.** Running `search` or `fetch` without probing `cache` wastes time on content already on disk. `research` and `agent` check cache internally; manual `search`/`fetch` do not.
-**Do not retry the same query.** If `search` returns no results, reformulate with different keywords. Repeating an identical query returns the same empty results.
+**Do not send natural-language questions to `search`.** Use keywords. `"how do I debounce in React hooks"` loses to `"react useDebounce hook custom"`.
-**Do not skip the cache.** Every `fetch`, `search`, and `crawl` result is cached locally. Before any network call, run `cache` with the URL pattern or query text. Cached results return instantly.
+**Do not retry an identical failing query.** Reformulate keywords, swap `category`, or add `include_domains`. Same query → same empty result.
-**Do not send natural language questions as search queries.** Search engines work best with keywords. Instead of `"What is the best way to handle authentication in Next.js?"`, use `"Next.js authentication best practices 2025"`.
+**Do not use `agent` or `research` for one-URL lookups.** Use `fetch`. `agent` is for multi-source gathering; `research` is for decomposable questions.
-**Do not use `agent` for simple lookups.** One fact from one URL = `fetch`. Quick search result = `search`. Reserve `agent` for tasks requiring multiple search/fetch cycles.
+**Do not crawl `max_pages: 100` without filters.** Always add `include_patterns` to stay in-scope. Unfiltered crawls fetch nav, footer, and sitemap garbage.
-**Do not use `research` when you already know the URLs.** If you have URLs to read, use `fetch` or `crawl`. `research` is for when you need the tool to discover sources autonomously.
+**Do not fetch whole pages when you need one section.** `fetch` + `section` reads under one heading only.
-**Do not fetch entire pages when you need one section.** Use `fetch` with `section` to extract just the relevant part.
+**Do not set `force_refresh: true` by default.** It defeats the cache. Use it for news, status, changelogs — content that actually churns.
-**Do not crawl with high max_pages without filtering.** A `max_pages: 100` crawl without `include_patterns` fetches navigation pages, footers, and irrelevant content.
+**Do not pass a JSON Schema to `extract` without `properties`.** The handler rejects schemas that lack a `properties` key.
-**Do not ignore `format: "context"` for search.** When injecting search results into a prompt, use `format: "context"` instead of manually concatenating results.
+## CLI Commands
+```bash
+wigolo                  # default: start MCP server on stdio
+wigolo mcp              # explicit: start MCP server
+wigolo warmup [flags]   # install Playwright, bootstrap SearXNG, optional extras
+wigolo serve            # start HTTP daemon on WIGOLO_DAEMON_PORT (default 3333)
+wigolo health           # health probe, exits 0 if ok
+wigolo doctor           # environment diagnostics (Python, Docker, Playwright, SearXNG)
+wigolo auth discover    # list CDP sessions (needs WIGOLO_CDP_URL)
+wigolo auth status      # show configured auth paths
+wigolo plugin add <git-url>    # clone plugin into ~/.wigolo/plugins/
+wigolo plugin list             # list installed plugins
+wigolo plugin remove <name>    # remove a plugin
+wigolo shell [--json]   # interactive REPL against subsystems
+```
-## Key Features
+## Configuration
-- Zero API keys required
-- Zero cloud dependency -- runs entirely local
-- Authenticated browsing (Chrome profiles, session state)
-- Localhost access (develop against local servers)
-- SQLite FTS5 cache with full-text search
-- ML reranking (optional, via FlashRank)
-- Extraction ensemble: site-specific, Defuddle, Trafilatura, Readability, Turndown
+Top environment variables. All optional — defaults are safe.
-## Requirements
+| Variable | Default | Purpose |
+|---|---|---|
+| `WIGOLO_DATA_DIR` | `~/.wigolo` | Cache DB, SearXNG state, plugins, embeddings |
+| `SEARXNG_URL` | unset | Point at an existing SearXNG (skips native bootstrap) |
+| `SEARXNG_MODE` | `native` | `native` runs local Python SearXNG; `docker` runs container |
+| `WIGOLO_CHROME_PROFILE_PATH` | unset | Chrome profile for `use_auth: true` |
+| `WIGOLO_CDP_URL` | unset | Chrome DevTools endpoint (e.g. `http://localhost:9222`) |
+| `MAX_BROWSERS` | `3` | Playwright pool size |
+| `WIGOLO_BROWSER_TYPES` | `chromium` | Comma list: `chromium,firefox,webkit` |
+| `WIGOLO_RERANKER` | `none` | `flashrank` for ML reranking |
+| `WIGOLO_EMBEDDING_MODEL` | `BAAI/bge-small-en-v1.5` | Used by `find_similar` |
+| `CACHE_TTL_CONTENT` | `604800` (7d) | Seconds before cached pages expire |
+| `LOG_LEVEL` | `info` | `debug` \| `info` \| `warn` \| `error` |
-- Node.js 20+
-- Python 3.8+ (recommended, for embedded SearXNG search)
-- Docker (optional, alternative to Python for SearXNG)
+Full list: see `src/config.ts`.
 ## Links
 - Repository: https://github.com/KnockOutEZ/wigolo
 - npm: https://www.npmjs.com/package/@staticn0va/wigolo
-- License: BSL 1.1 (converts to MIT on 2029-04-12)
+- License: BUSL-1.1 (converts to open source on 2029-04-12)

package/dist/cli/doctor.d.ts.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"doctor.d.ts","sourceRoot":"","sources":["../../src/cli/doctor.ts"],"names":[],"mappings":"~~AAwDA~~;;;;;GAKG;AACH,wBAAsB,SAAS,CAAC,OAAO,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,~~CAgFhE~~"}
1	+ {"version":3,"file":"doctor.d.ts","sourceRoot":"","sources":["../../src/cli/doctor.ts"],"names":[],"mappings":"AA4DA;;;;;GAKG;AACH,wBAAsB,SAAS,CAAC,OAAO,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CA+EhE"}