@staticn0va/wigolo 0.5.1 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. package/README.md +46 -34
  2. package/SKILL.md +253 -89
  3. package/dist/cli/doctor.d.ts.map +1 -1
  4. package/dist/cli/doctor.js +17 -15
  5. package/dist/cli/doctor.js.map +1 -1
  6. package/dist/cli/warmup.d.ts.map +1 -1
  7. package/dist/cli/warmup.js +93 -13
  8. package/dist/cli/warmup.js.map +1 -1
  9. package/dist/embedding/subprocess.d.ts.map +1 -1
  10. package/dist/embedding/subprocess.js +2 -1
  11. package/dist/embedding/subprocess.js.map +1 -1
  12. package/dist/extraction/trafilatura.d.ts.map +1 -1
  13. package/dist/extraction/trafilatura.js +3 -2
  14. package/dist/extraction/trafilatura.js.map +1 -1
  15. package/dist/fetch/router.d.ts +1 -0
  16. package/dist/fetch/router.d.ts.map +1 -1
  17. package/dist/fetch/router.js.map +1 -1
  18. package/dist/instructions.d.ts +3 -3
  19. package/dist/instructions.d.ts.map +1 -1
  20. package/dist/instructions.js +19 -4
  21. package/dist/instructions.js.map +1 -1
  22. package/dist/python-env.d.ts +9 -0
  23. package/dist/python-env.d.ts.map +1 -0
  24. package/dist/python-env.js +18 -0
  25. package/dist/python-env.js.map +1 -0
  26. package/dist/search/find-similar.d.ts.map +1 -1
  27. package/dist/search/find-similar.js +136 -29
  28. package/dist/search/find-similar.js.map +1 -1
  29. package/dist/search/flashrank.d.ts.map +1 -1
  30. package/dist/search/flashrank.js +2 -1
  31. package/dist/search/flashrank.js.map +1 -1
  32. package/dist/server/backend-status.d.ts +2 -0
  33. package/dist/server/backend-status.d.ts.map +1 -1
  34. package/dist/server/backend-status.js +13 -0
  35. package/dist/server/backend-status.js.map +1 -1
  36. package/dist/server.d.ts.map +1 -1
  37. package/dist/server.js +51 -3
  38. package/dist/server.js.map +1 -1
  39. package/dist/tools/fetch.d.ts.map +1 -1
  40. package/dist/tools/fetch.js +6 -4
  41. package/dist/tools/fetch.js.map +1 -1
  42. package/dist/tools/search.d.ts +2 -2
  43. package/dist/tools/search.d.ts.map +1 -1
  44. package/dist/tools/search.js +55 -8
  45. package/dist/tools/search.js.map +1 -1
  46. package/dist/types.d.ts +8 -0
  47. package/dist/types.d.ts.map +1 -1
  48. package/package.json +2 -1
package/README.md CHANGED
@@ -10,11 +10,12 @@ Search, fetch, crawl, cache, and extract — zero API keys, zero cloud, zero cos
10
10
  [![Node.js](https://img.shields.io/badge/node-%3E%3D20-brightgreen)](https://nodejs.org)
11
11
  [![TypeScript](https://img.shields.io/badge/TypeScript-5.x-blue)](https://www.typescriptlang.org/)
12
12
 
13
- [Quick Start](#quick-start) · [Features](#features) · [Why wigolo?](#why-wigolo) · [Roadmap](#roadmap)
13
+ [Quick Start](#quick-start) · [Features](#features) · [Why wigolo?](#why-wigolo)
14
14
 
15
15
  </div>
16
16
 
17
17
  ```
18
+ $ npx @staticn0va/wigolo warmup --all
18
19
  $ claude mcp add wigolo -- npx @staticn0va/wigolo
19
20
  Added MCP server wigolo
20
21
 
@@ -27,6 +28,30 @@ wigolo gives AI coding agents (Claude Code, Cursor, Gemini CLI, Codex, Windsurf)
27
28
 
28
29
  ## Quick Start
29
30
 
31
+ ### 1. Warm up (required)
32
+
33
+ Install Playwright, bootstrap SearXNG, install Python extras (FlashRank, Trafilatura, sentence-transformers), then verify the setup end-to-end:
34
+
35
+ ```bash
36
+ npx @staticn0va/wigolo warmup --all
37
+ ```
38
+
39
+ `--all` runs verification automatically: it starts SearXNG, runs a test search, checks every Python package, then shuts SearXNG down. You see proof everything works before connecting an agent. Re-run any time with `warmup --verify`.
40
+
41
+ Flag menu:
42
+
43
+ ```bash
44
+ npx @staticn0va/wigolo warmup # Playwright + SearXNG only
45
+ npx @staticn0va/wigolo warmup --all # + reranker + trafilatura + embeddings + lightpanda + verify
46
+ npx @staticn0va/wigolo warmup --reranker # Install FlashRank (ML reranking)
47
+ npx @staticn0va/wigolo warmup --trafilatura # Install Trafilatura (content extraction)
48
+ npx @staticn0va/wigolo warmup --embeddings # Install sentence-transformers
49
+ npx @staticn0va/wigolo warmup --verify # Start SearXNG, test search, test Python packages
50
+ npx @staticn0va/wigolo warmup --force # Wipe SearXNG state/install/locks and re-bootstrap
51
+ ```
52
+
53
+ ### 2. Connect your agent
54
+
30
55
  **Claude Code:**
31
56
  ```bash
32
57
  claude mcp add wigolo -- npx @staticn0va/wigolo
@@ -44,12 +69,7 @@ claude mcp add wigolo -- npx @staticn0va/wigolo
44
69
  }
45
70
  ```
46
71
 
47
- **Optional warmup (improves quality on first use):**
48
- ```bash
49
- npx @staticn0va/wigolo warmup # Downloads Playwright + SearXNG
50
- npx @staticn0va/wigolo warmup --all # + ML reranking + Trafilatura extraction
51
- npx @staticn0va/wigolo warmup --force # Wipe SearXNG state/install/locks and re-bootstrap
52
- ```
72
+ > Skipping warmup still works — wigolo will bootstrap in the background on first tool call — but early searches will be lower quality until the install finishes. Running `warmup --all` up front is strongly recommended.
53
73
 
54
74
  ## Diagnostics
55
75
 
@@ -268,31 +288,6 @@ SearXNG bootstrap failures are self-healing: wigolo retries after 30 seconds, 1
268
288
  4. Readability.js — battle-tested Mozilla algorithm
269
289
  5. Raw Turndown — last resort HTML-to-markdown
270
290
 
271
- ## Roadmap
272
-
273
- ### v2.1 — Next
274
- - [x] Daemon mode — persistent HTTP server, zero startup latency
275
- - [ ] Browser interaction — click, type, scroll before extraction
276
- - [ ] Content change detection — diff monitoring for cached pages
277
- - [ ] CDP session discovery — attach to running Chrome for seamless auth
278
- - [ ] Plugin system — community extractors and search engines
279
-
280
- ### v2.2
281
- - [ ] Multi-browser pool — Chromium + Firefox for fingerprint diversity
282
- - [ ] Interactive REPL (`wigolo shell`)
283
- - [x] Agent skill distribution — MCP registry listings, `SKILL.md`
284
-
285
- ### v3 — The Knowledge Engine
286
- - [ ] Answer synthesis — search + LLM = direct answers with citations (bring your own key)
287
- - [ ] Semantic search — local vector embeddings over cached content (`findSimilar`)
288
- - [ ] Agent endpoint — describe what you need, no URLs required
289
- - [ ] Streaming answers — real-time generation as results come in
290
- - [ ] Knowledge graph — entity and relationship extraction from crawled content
291
- - [ ] Auto re-crawl scheduler — keep documentation fresh automatically
292
- - [ ] Lightpanda browser — optional ultra-lightweight headless browser (11x less RAM than Chrome)
293
- - [ ] Cloud sync — share cache across machines via rclone (S3, Drive, Dropbox)
294
- - [ ] Team knowledge base — shared indexed content across team members
295
-
296
291
  ## Discovery
297
292
 
298
293
  wigolo is listed on MCP server registries for agent discovery:
@@ -309,8 +304,19 @@ See `SKILL.md` for the full tool schema in agent-discovery format.
309
304
 
310
305
  ## Troubleshooting
311
306
 
307
+ Start with `npx @staticn0va/wigolo doctor` — it reports the state of every component and is the fastest way to find the cause.
308
+
309
+ **First search is slow or returns odd results**
310
+ SearXNG is still bootstrapping in the background. Either wait a minute, or (recommended) run `npx @staticn0va/wigolo warmup --all` before connecting your agent.
311
+
312
+ **FlashRank / Trafilatura / sentence-transformers "not installed"**
313
+ These are optional Python extras. Install them with `npx @staticn0va/wigolo warmup --all` (or per-package: `--reranker`, `--trafilatura`, `--embeddings`). wigolo uses a private venv under `~/.wigolo/searxng/venv` so your system Python stays untouched.
314
+
312
315
  **SearXNG won't start**
313
- Make sure `python3` is on your PATH and version 3.8+. Check with `python3 --version`. Alternatively, set `SEARXNG_MODE=docker` if Docker is available.
316
+ Make sure `python3` is on your PATH and version 3.8+. Check with `python3 --version`. If bootstrap got interrupted, `npx @staticn0va/wigolo warmup --force` wipes the state and reinstalls. Alternatively, set `SEARXNG_MODE=docker` if Docker is available.
317
+
318
+ **Doctor reports SearXNG "not running"**
319
+ That's expected when you haven't made a search yet — the process starts on-demand when the MCP server needs it. Doctor only marks it degraded if the install is broken.
314
320
 
315
321
  **Playwright browser not found**
316
322
  Run `npx @staticn0va/wigolo warmup` to download Chromium. This is done automatically on first use but can fail behind corporate proxies.
@@ -321,6 +327,12 @@ If SearXNG and all fallback engines fail, check your network connection. Behind
321
327
  **Permission errors on `~/.wigolo/`**
322
328
  wigolo stores its cache and SearXNG installation in `~/.wigolo/`. Ensure your user has write access. Override with `WIGOLO_DATA_DIR=/your/path`.
323
329
 
330
+ **Start fresh**
331
+ ```bash
332
+ rm -rf ~/.wigolo
333
+ npx @staticn0va/wigolo warmup --all
334
+ ```
335
+
324
336
  ## Contributing
325
337
 
326
338
  PRs welcome. Open an issue first to discuss what you'd like to change.
@@ -353,4 +365,4 @@ Requires the `NPM_TOKEN` repository secret (npm automation token with publish sc
353
365
 
354
366
  ## License
355
367
 
356
- [BSL 1.1](LICENSE) — free for individuals, small teams (under $1M revenue), education, and open source. Converts to MIT on 2029-04-12.
368
+ [BSL 1.1](LICENSE) — free for individuals, small teams (under $1M revenue), education, and open source. Converts to AGPL-3.0 on 2029-04-12.
package/SKILL.md CHANGED
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: wigolo
3
- description: Local-first web search MCP server for AI coding agents. Search, fetch, crawl, cache, extract, find similar pages, deep research, and autonomous agent mode with zero API keys.
3
+ description: Local-first web access MCP server for AI coding agents. Eight tools for search, fetch, crawl, cache, extract, find similar, research, and agent-driven data gathering. No API keys. Results cached in local SQLite.
4
4
  author: KnockOutEZ
5
5
  license: BUSL-1.1
6
6
  repository: https://github.com/KnockOutEZ/wigolo
@@ -9,29 +9,29 @@ install: npx @staticn0va/wigolo
9
9
  runtime: node
10
10
  min_runtime_version: "20"
11
11
  tools:
12
- - name: search
13
- description: Search the web and return results with optional full content extraction. Supports domain filtering, date ranges, categories, and ML reranking.
14
12
  - name: fetch
15
- description: Fetch a web page and return its content as clean markdown. Supports JavaScript rendering, authenticated browsing, section extraction, and caching.
13
+ description: Fetch one URL, return clean markdown. Auto-routes between HTTP and Playwright. Supports sections, auth, screenshots, browser actions.
14
+ - name: search
15
+ description: Search the web, return extracted markdown per result. Single query or array of query variants. Domain, category, date filters. Optional synthesized answer via MCP sampling.
16
16
  - name: crawl
17
- description: Crawl a website starting from a seed URL. Supports BFS, DFS, sitemap, and map (URL-only) strategies with depth/page limits and URL filtering.
17
+ description: Crawl a site from a seed URL. BFS, DFS, sitemap, or map (URL-only) strategies with regex include/exclude filters.
18
18
  - name: cache
19
- description: Query the local knowledge base of previously fetched content. Full-text search over cached pages by query, URL pattern, or date. Cache stats and clearing.
19
+ description: FTS5 search over previously fetched content. URL glob, date filters, stats, clear, and change detection via re-fetch.
20
20
  - name: extract
21
- description: Extract structured data from a web page. CSS selector extraction, HTML table parsing, metadata extraction (title, author, JSON-LD), and JSON Schema heuristic matching.
21
+ description: Structured extraction from URL or raw HTML. Modes: selector (CSS), tables, metadata (meta + JSON-LD), schema (heuristic field matching).
22
22
  - name: find_similar
23
- description: Find pages semantically similar to a given URL or concept. Uses cached embeddings and web search to discover related content.
23
+ description: Find pages similar to a URL or concept. Hybrid cache (FTS5 + embeddings) + optional web supplement.
24
24
  - name: research
25
- description: Deep multi-step research on a question. Decomposes into sub-queries, searches in parallel, fetches sources, and synthesizes a report with citations.
25
+ description: Multi-step research pipeline. Question decomposition, parallel sub-search, source synthesis with citations. Quick, standard, or comprehensive depth.
26
26
  - name: agent
27
- description: Autonomous data gathering agent. Plans search queries from a prompt, fetches pages within budget, optionally extracts structured data via JSON Schema, and synthesizes results.
27
+ description: Natural-language data gathering. Plans searches/URLs, fetches in parallel within page and time budgets, optionally applies a JSON Schema to each page.
28
28
  ---
29
29
 
30
30
  # wigolo
31
31
 
32
- Local-first web search MCP server for AI coding agents.
32
+ Local-first web search MCP server for AI coding agents. Ships eight tools over stdio. All network results land in a local SQLite cache.
33
33
 
34
- ## Installation
34
+ ## Quick Setup
35
35
 
36
36
  **Claude Code:**
37
37
  ```bash
@@ -50,147 +50,311 @@ claude mcp add wigolo -- npx @staticn0va/wigolo
50
50
  }
51
51
  ```
52
52
 
53
- **Optional warmup (improves search quality):**
53
+ **Warmup (recommended, one-time):**
54
54
  ```bash
55
- npx @staticn0va/wigolo warmup
55
+ npx @staticn0va/wigolo warmup # installs Playwright Chromium + bootstraps SearXNG
56
+ npx @staticn0va/wigolo warmup --all # also installs Firefox, WebKit, reranker, embeddings, trafilatura
57
+ npx @staticn0va/wigolo warmup --force # wipe SearXNG state and rebuild
56
58
  ```
57
59
 
60
+ Warmup flags: `--force`, `--all`, `--trafilatura`, `--reranker`, `--firefox`, `--webkit`, `--embeddings`, `--lightpanda`.
61
+
58
62
  ## Tools
59
63
 
60
- ### search
61
- Search the web and get full markdown content in one call.
64
+ ### fetch
65
+
66
+ Fetch a single URL and return clean markdown. Use when you already have a specific URL.
67
+
68
+ Parameters:
69
+ - `url` (string, required)
70
+ - `render_js`: `"auto"` (default) | `"always"` | `"never"`
71
+ - `use_auth`: boolean (default `false`) — reuses the user's browser session
72
+ - `max_chars`: number
73
+ - `section`: string — return only the content under a heading
74
+ - `section_index`: number (default `0`) — which heading match when multiple hit
75
+ - `screenshot`: boolean (default `false`)
76
+ - `headers`: object
77
+ - `force_refresh`: boolean — bypass cache
78
+ - `actions`: array of `{type, selector, text, ms, timeout, direction, amount}` — `click`, `type`, `wait`, `wait_for`, `scroll`, `screenshot`. Forces Playwright when present.
79
+
80
+ Example:
62
81
  ```json
63
- { "query": "React Server Components best practices", "max_results": 5, "include_domains": ["react.dev"] }
82
+ { "url": "https://react.dev/reference/react/useState", "section": "Parameters" }
64
83
  ```
65
84
 
66
- ### fetch
67
- Fetch any URL and get clean markdown.
85
+ Tip: `section` is much cheaper than reading the full page. Repeat fetches of the same URL are free from cache unless `force_refresh: true`.
86
+
87
+ ### search
88
+
89
+ Search the web and return extracted markdown per result. Use when you don't have a URL yet.
90
+
91
+ Parameters:
92
+ - `query` (string OR `string[]`, required) — array runs variants in parallel and dedupes
93
+ - `max_results`: number (default `5`, cap `20`)
94
+ - `include_content`: boolean (default `true`)
95
+ - `content_max_chars`: number (default `30000`)
96
+ - `max_total_chars`: number (default `50000`)
97
+ - `time_range`: `"day"` | `"week"` | `"month"` | `"year"`
98
+ - `include_domains` / `exclude_domains`: `string[]`
99
+ - `from_date` / `to_date`: ISO `YYYY-MM-DD`
100
+ - `category`: `"general"` | `"news"` | `"code"` | `"docs"` | `"papers"` | `"images"`
101
+ - `language`: string
102
+ - `search_engines`: `string[]` — override engine selection
103
+ - `format`: `"full"` (default) | `"context"` (token-budgeted string) | `"answer"` (synthesized via MCP sampling) | `"stream_answer"` (answer + phase progress notifications)
104
+ - `force_refresh`: boolean
105
+
106
+ Example:
68
107
  ```json
69
- { "url": "https://docs.react.dev/reference/react/useState", "section": "Parameters" }
108
+ { "query": ["react server components patterns", "RSC data fetching", "react server components streaming"], "category": "docs", "include_domains": ["react.dev"], "max_results": 5 }
70
109
  ```
71
110
 
111
+ Tip: keyword queries beat natural-language questions. A 3–5 item `query` array usually finds more unique sources than one longer query.
112
+
72
113
  ### crawl
73
- Crawl a site from a seed URL.
114
+
115
+ Crawl a site starting from a seed URL.
116
+
117
+ Parameters:
118
+ - `url` (string, required)
119
+ - `strategy`: `"bfs"` (default) | `"dfs"` | `"sitemap"` | `"map"` (URL-only discovery, no content)
120
+ - `max_depth`: number (default `2`)
121
+ - `max_pages`: number (default `20`)
122
+ - `include_patterns` / `exclude_patterns`: regex `string[]`
123
+ - `use_auth`: boolean (default `false`)
124
+ - `extract_links`: boolean (default `false`) — returns inter-page link graph
125
+ - `max_total_chars`: number (default `100000`)
126
+
127
+ Example:
74
128
  ```json
75
- { "url": "https://docs.example.com", "strategy": "sitemap", "max_pages": 50 }
129
+ { "url": "https://docs.python.org/3/library/", "strategy": "sitemap", "max_pages": 30, "include_patterns": ["^https://docs\\.python\\.org/3/library/asyncio"] }
76
130
  ```
77
131
 
132
+ Tip: `strategy: "sitemap"` is faster and more complete than BFS on doc sites. `strategy: "map"` returns URLs only — cheap way to scope before targeted fetches.
133
+
78
134
  ### cache
79
- Query previously fetched content without hitting the network.
135
+
136
+ Search previously fetched content without hitting the network.
137
+
138
+ Parameters:
139
+ - `query`: FTS5 syntax — supports `AND`, `OR`, `NOT`, `"exact phrase"`
140
+ - `url_pattern`: glob (e.g. `"*react.dev*"`)
141
+ - `since`: ISO date
142
+ - `stats`: boolean — returns total URLs, size, date range
143
+ - `clear`: boolean — deletes matching entries (requires one of `query`, `url_pattern`, `since`)
144
+ - `check_changes`: boolean — re-fetches matching URLs, reports changed/unchanged with diff summaries
145
+
146
+ Example:
80
147
  ```json
81
- { "query": "React hooks", "url_pattern": "*react.dev*" }
148
+ { "query": "useState OR useReducer", "url_pattern": "*react.dev*" }
82
149
  ```
83
150
 
151
+ Tip: cache hits are instant and cross-session. Run this before `search` or `fetch` when you suspect the content is already on disk.
152
+
84
153
  ### extract
85
- Structured data extraction from any URL or HTML.
154
+
155
+ Structured extraction from URL or raw HTML.
156
+
157
+ Parameters:
158
+ - `url` OR `html` (one required; `url` wins if both provided)
159
+ - `mode`: `"metadata"` (default) | `"selector"` | `"tables"` | `"schema"`
160
+ - `css_selector`: string — required for `mode: "selector"`
161
+ - `multiple`: boolean (default `false`) — return all matches, selector mode only
162
+ - `schema`: JSON Schema object with `properties` — required for `mode: "schema"`
163
+
164
+ Example:
86
165
  ```json
87
- { "url": "https://example.com/product", "mode": "schema", "schema": { "type": "object", "properties": { "price": { "type": "string" }, "name": { "type": "string" } } } }
166
+ { "url": "https://example.com/product", "mode": "schema", "schema": { "type": "object", "properties": { "price": { "type": "string" }, "name": { "type": "string" }, "sku": { "type": "string" } } } }
88
167
  ```
89
168
 
169
+ Tip: `mode: "schema"` does heuristic matching over CSS classes, ARIA labels, microdata, and JSON-LD — no LLM call required.
170
+
90
171
  ### find_similar
91
- Find pages related to a URL or concept.
172
+
173
+ Find pages related to a URL or a free-text concept.
174
+
175
+ Parameters:
176
+ - `url` OR `concept` (one required)
177
+ - `max_results`: number (default `10`, cap `50`)
178
+ - `include_domains` / `exclude_domains`: `string[]`
179
+ - `include_cache`: boolean (default `true`)
180
+ - `include_web`: boolean (default `true`)
181
+
182
+ Example:
92
183
  ```json
93
- { "url": "https://react.dev/reference/react/useState", "max_results": 5 }
184
+ { "url": "https://react.dev/reference/react/useState", "max_results": 8, "include_domains": ["react.dev", "developer.mozilla.org"] }
94
185
  ```
95
186
 
187
+ Tip: uses hybrid 3-way search — FTS5 over titles, FTS5 over body, plus embeddings when available. Cache path is near-instant; web supplement runs only if cache yields too few results.
188
+
96
189
  ### research
97
- Deep multi-step research that plans queries, fetches, and synthesizes.
98
- ```json
99
- { "question": "How do modern bundlers handle tree-shaking of ESM vs CJS", "depth": "standard", "max_sources": 10 }
100
- ```
101
190
 
102
- ### agent
103
- Autonomous data gathering from a natural-language prompt.
191
+ Multi-step research pipeline with decomposition, parallel search, and cited synthesis.
192
+
193
+ Parameters:
194
+ - `question` (string, required)
195
+ - `depth`: `"quick"` (~15s, 2 sub-queries, 5–8 sources) | `"standard"` (~40s, default) | `"comprehensive"` (~80s, 7 sub-queries, 20–25 sources)
196
+ - `max_sources`: number (cap `50`) — overrides depth default
197
+ - `include_domains` / `exclude_domains`: `string[]`
198
+ - `schema`: JSON Schema — if present, report is structured to fill these fields
199
+ - `stream`: boolean — emit progress notifications per phase
200
+
201
+ Example:
104
202
  ```json
105
- { "prompt": "Compare authentication strategies of Supabase, Firebase, and Clerk", "max_pages": 15, "max_time_ms": 90000 }
203
+ { "question": "How do modern JS bundlers tree-shake ESM vs CJS?", "depth": "standard", "include_domains": ["webpack.js.org", "rollupjs.org", "esbuild.github.io", "vitejs.dev"] }
106
204
  ```
107
205
 
108
- ## Workflow Patterns
206
+ Tip: `research` checks cache internally — no need to pre-probe. Requires MCP sampling-capable client for synthesis; without sampling, returns raw sources in context format.
109
207
 
110
- Use the right tool for the right situation.
208
+ ### agent
111
209
 
112
- **When you know the URL** -- use `fetch`. One URL, clean markdown. Add `section` to read only the heading you need.
210
+ Natural-language data gathering. Plans queries and URLs from a prompt, runs them in parallel within budget, optionally applies a schema.
113
211
 
114
- **When you need to find information** -- use `search`. Formulate a keyword query (not a natural language question). Scope with `include_domains` and `category` when you know where the answer lives.
212
+ Parameters:
213
+ - `prompt` (string, required)
214
+ - `urls`: `string[]` — seed URLs to include
215
+ - `schema`: JSON Schema — extract structured fields per page and merge
216
+ - `max_pages`: number (default `10`, cap `100`)
217
+ - `max_time_ms`: number (default `60000`, cap `600000`)
218
+ - `stream`: boolean
115
219
 
116
- **When you need multiple pages from one site** -- use `crawl`. For documentation sites, use `strategy: "sitemap"`. When you just want to discover what pages exist, use `strategy: "map"` (URL list only) then follow up with targeted `fetch` calls.
220
+ Example:
221
+ ```json
222
+ { "prompt": "Compare pricing tiers for Supabase, Firebase, and Clerk", "schema": { "type": "object", "properties": { "provider": { "type": "string" }, "free_tier": { "type": "string" }, "paid_start": { "type": "string" } } }, "max_pages": 12 }
223
+ ```
117
224
 
118
- **When you need structured data** -- use `extract` with `mode: "tables"` or `mode: "schema"`. Do not use `fetch` when you need prices, specs, or table rows.
225
+ Tip: output includes a `steps` array showing every action (plan, search, fetch, extract, synthesize) with timings. Use this to debug why an agent run produced a weak result.
119
226
 
120
- **When you already have content and want related pages** -- use `find_similar`. It searches the local cache by semantic similarity. No network calls needed.
227
+ ## Workflow Patterns
121
228
 
122
- **When you need a thorough answer on a complex topic** -- use `research`. It plans multiple search queries, fetches sources, and produces a cited synthesis. Prefer this over running 5+ manual search/fetch cycles.
229
+ Quick routing:
230
+ - Use when `search` — you need information but don't have a URL.
231
+ - Use when `fetch` — you already have the URL.
232
+ - Use when `crawl` — you need multiple pages from one site.
233
+ - Use when `cache` — you want to check whether something is already on disk.
234
+ - Use when `extract` — you need specific fields, tables, or metadata, not the whole page.
235
+ - Use when `find_similar` — you have a good page/concept and want related content.
236
+ - Use when `research` — a question needs decomposition and multi-source synthesis.
237
+ - Use when `agent` — a natural-language task needs multi-step data gathering.
238
+
239
+ **Cache-first lookup.** Before any `fetch` or `search`, probe the cache.
240
+ ```json
241
+ cache({ "query": "oauth2 pkce", "url_pattern": "*auth0.com*" })
242
+ // empty? fall through to search
243
+ search({ "query": "oauth2 pkce flow", "include_domains": ["auth0.com"] })
244
+ ```
123
245
 
124
- **When the task requires multi-step data gathering** -- use `agent`. It breaks prompts into search queries and URL fetches, respects page and time budgets, and can extract structured data via JSON Schema.
246
+ **Fresh content (news, dashboards, changelogs).** Bypass cache explicitly.
247
+ ```json
248
+ search({ "query": "node.js 22 release notes", "force_refresh": true, "time_range": "week" })
249
+ fetch({ "url": "https://nodejs.org/en/blog", "force_refresh": true })
250
+ ```
125
251
 
126
- **Before any network call** -- check `cache` first. Pages from prior sessions are still there. A cache hit is instant and free.
252
+ **Scoped documentation research.** Crawl the relevant slice, then query cache.
253
+ ```json
254
+ crawl({ "url": "https://docs.astro.build", "strategy": "sitemap", "max_pages": 40 })
255
+ cache({ "query": "server islands hydration", "url_pattern": "*docs.astro.build*" })
256
+ ```
127
257
 
128
- ## Parameter Optimization
258
+ **Broad exploration.** Pass a query array; dedup is automatic.
259
+ ```json
260
+ search({ "query": ["rust async runtimes comparison", "tokio vs async-std vs smol", "rust executor benchmarks"], "max_results": 8 })
261
+ ```
129
262
 
130
- ### search
131
- - `max_results: 3` for focused lookups, `5` for exploration (default), `10+` for broad research
132
- - `include_domains` narrows to trusted sources -- always use when you know the domain
133
- - `category: "code"` for programming, `"docs"` for library docs, `"news"` for recent events
134
- - `from_date` / `to_date` for time-sensitive queries
135
- - `format: "context"` returns a single token-budgeted string for LLM injection
263
+ **More like this.** Start with a known-good URL, widen via `find_similar`.
264
+ ```json
265
+ find_similar({ "url": "https://react.dev/reference/react/useMemo", "max_results": 6, "include_domains": ["react.dev"] })
266
+ ```
136
267
 
137
- ### fetch
138
- - `section: "heading text"` extracts only content under that heading -- much cheaper than the full page
139
- - `render_js: "never"` is fastest for static sites; `"always"` for SPAs
140
- - `use_auth: true` to access pages behind login using the user's browser session
268
+ **Complex synthesis.** One `research` call replaces 5+ manual search/fetch cycles.
269
+ ```json
270
+ research({ "question": "Tradeoffs of vector DBs for RAG at 100M+ embeddings", "depth": "comprehensive" })
271
+ ```
141
272
 
142
- ### crawl
143
- - `strategy: "sitemap"` is 5-10x faster than BFS for doc sites
144
- - `strategy: "map"` returns URLs only -- use to scope a site before targeted fetches
145
- - `include_patterns` / `exclude_patterns` accept regex to stay in one section
273
+ **Structured data from multiple sources.** Use `agent` with a schema.
274
+ ```json
275
+ agent({ "prompt": "Find latency and pricing for top 5 edge compute providers", "schema": { "type": "object", "properties": { "provider": {"type":"string"}, "cold_start_ms": {"type":"string"}, "price_per_million": {"type":"string"} } } })
276
+ ```
146
277
 
147
- ### research
148
- - `depth: "quick"` (~15s, 2 sub-queries), `"standard"` (~40s, default), `"comprehensive"` (~80s, 7 sub-queries)
149
- - `max_sources` overrides the default source count for the chosen depth
278
+ **Table extraction.** Skip markdown entirely.
279
+ ```json
280
+ extract({ "url": "https://en.wikipedia.org/wiki/List_of_programming_languages", "mode": "tables" })
281
+ ```
150
282
 
151
- ### agent
152
- - `max_pages` caps total page fetches (default 10, max 100)
153
- - `max_time_ms` caps total execution time (default 60000)
154
- - `schema` enables structured extraction from each page -- results are merged across sources
283
+ ## Parameter Cheat Sheet
284
+
285
+ | Situation | Tool + parameters |
286
+ |---|---|
287
+ | Focused lookup, known site | `search` + `max_results: 3` + `include_domains` |
288
+ | Broad topic survey | `search` + `query: [...3-5 variants]` + `max_results: 8` |
289
+ | Fresh content required | any tool + `force_refresh: true` |
290
+ | Doc site indexing | `crawl` + `strategy: "sitemap"` |
291
+ | Site URL inventory only | `crawl` + `strategy: "map"` |
292
+ | Single heading from long page | `fetch` + `section: "..."` |
293
+ | Behind login | `fetch` / `crawl` + `use_auth: true` |
294
+ | Direct answer (sampling client) | `search` + `format: "answer"` |
295
+ | LLM-ready context blob | `search` + `format: "context"` |
296
+ | Complex question, multi-source | `research` + `depth: "standard"` |
297
+ | Structured multi-page extraction | `agent` + `schema` |
298
+ | One-page structured data | `extract` + `mode: "schema"` or `"tables"` |
299
+ | Change tracking | `cache` + `check_changes: true` |
155
300
 
156
301
  ## Anti-Patterns
157
302
 
158
- These waste tokens, time, and rate limits. Avoid them.
303
+ **Do not skip the cache.** Running `search` or `fetch` without probing `cache` wastes time on content already on disk. `research` and `agent` check cache internally; manual `search`/`fetch` do not.
159
304
 
160
- **Do not retry the same query.** If `search` returns no results, reformulate with different keywords. Repeating an identical query returns the same empty results.
305
+ **Do not send natural-language questions to `search`.** Use keywords. `"how do I debounce in React hooks"` loses to `"react useDebounce hook custom"`.
161
306
 
162
- **Do not skip the cache.** Every `fetch`, `search`, and `crawl` result is cached locally. Before any network call, run `cache` with the URL pattern or query text. Cached results return instantly.
307
+ **Do not retry an identical failing query.** Reformulate keywords, swap `category`, or add `include_domains`. Same query same empty result.
163
308
 
164
- **Do not send natural language questions as search queries.** Search engines work best with keywords. Instead of `"What is the best way to handle authentication in Next.js?"`, use `"Next.js authentication best practices 2025"`.
309
+ **Do not use `agent` or `research` for one-URL lookups.** Use `fetch`. `agent` is for multi-source gathering; `research` is for decomposable questions.
165
310
 
166
- **Do not use `agent` for simple lookups.** One fact from one URL = `fetch`. Quick search result = `search`. Reserve `agent` for tasks requiring multiple search/fetch cycles.
311
+ **Do not crawl `max_pages: 100` without filters.** Always add `include_patterns` to stay in-scope. Unfiltered crawls fetch nav, footer, and sitemap garbage.
167
312
 
168
- **Do not use `research` when you already know the URLs.** If you have URLs to read, use `fetch` or `crawl`. `research` is for when you need the tool to discover sources autonomously.
313
+ **Do not fetch whole pages when you need one section.** `fetch` + `section` reads under one heading only.
169
314
 
170
- **Do not fetch entire pages when you need one section.** Use `fetch` with `section` to extract just the relevant part.
315
+ **Do not set `force_refresh: true` by default.** It defeats the cache. Use it for news, status, changelogs content that actually churns.
171
316
 
172
- **Do not crawl with high max_pages without filtering.** A `max_pages: 100` crawl without `include_patterns` fetches navigation pages, footers, and irrelevant content.
317
+ **Do not pass a JSON Schema to `extract` without `properties`.** The handler rejects schemas that lack a `properties` key.
173
318
 
174
- **Do not ignore `format: "context"` for search.** When injecting search results into a prompt, use `format: "context"` instead of manually concatenating results.
319
+ ## CLI Commands
320
+
321
+ ```bash
322
+ wigolo # default: start MCP server on stdio
323
+ wigolo mcp # explicit: start MCP server
324
+ wigolo warmup [flags] # install Playwright, bootstrap SearXNG, optional extras
325
+ wigolo serve # start HTTP daemon on WIGOLO_DAEMON_PORT (default 3333)
326
+ wigolo health # health probe, exits 0 if ok
327
+ wigolo doctor # environment diagnostics (Python, Docker, Playwright, SearXNG)
328
+ wigolo auth discover # list CDP sessions (needs WIGOLO_CDP_URL)
329
+ wigolo auth status # show configured auth paths
330
+ wigolo plugin add <git-url> # clone plugin into ~/.wigolo/plugins/
331
+ wigolo plugin list # list installed plugins
332
+ wigolo plugin remove <name> # remove a plugin
333
+ wigolo shell [--json] # interactive REPL against subsystems
334
+ ```
175
335
 
176
- ## Key Features
336
+ ## Configuration
177
337
 
178
- - Zero API keys required
179
- - Zero cloud dependency -- runs entirely local
180
- - Authenticated browsing (Chrome profiles, session state)
181
- - Localhost access (develop against local servers)
182
- - SQLite FTS5 cache with full-text search
183
- - ML reranking (optional, via FlashRank)
184
- - Extraction ensemble: site-specific, Defuddle, Trafilatura, Readability, Turndown
338
+ Top environment variables. All optional — defaults are safe.
185
339
 
186
- ## Requirements
340
+ | Variable | Default | Purpose |
341
+ |---|---|---|
342
+ | `WIGOLO_DATA_DIR` | `~/.wigolo` | Cache DB, SearXNG state, plugins, embeddings |
343
+ | `SEARXNG_URL` | unset | Point at an existing SearXNG (skips native bootstrap) |
344
+ | `SEARXNG_MODE` | `native` | `native` runs local Python SearXNG; `docker` runs container |
345
+ | `WIGOLO_CHROME_PROFILE_PATH` | unset | Chrome profile for `use_auth: true` |
346
+ | `WIGOLO_CDP_URL` | unset | Chrome DevTools endpoint (e.g. `http://localhost:9222`) |
347
+ | `MAX_BROWSERS` | `3` | Playwright pool size |
348
+ | `WIGOLO_BROWSER_TYPES` | `chromium` | Comma list: `chromium,firefox,webkit` |
349
+ | `WIGOLO_RERANKER` | `none` | `flashrank` for ML reranking |
350
+ | `WIGOLO_EMBEDDING_MODEL` | `BAAI/bge-small-en-v1.5` | Used by `find_similar` |
351
+ | `CACHE_TTL_CONTENT` | `604800` (7d) | Seconds before cached pages expire |
352
+ | `LOG_LEVEL` | `info` | `debug` \| `info` \| `warn` \| `error` |
187
353
 
188
- - Node.js 20+
189
- - Python 3.8+ (recommended, for embedded SearXNG search)
190
- - Docker (optional, alternative to Python for SearXNG)
354
+ Full list: see `src/config.ts`.
191
355
 
192
356
  ## Links
193
357
 
194
358
  - Repository: https://github.com/KnockOutEZ/wigolo
195
359
  - npm: https://www.npmjs.com/package/@staticn0va/wigolo
196
- - License: BSL 1.1 (converts to MIT on 2029-04-12)
360
+ - License: BUSL-1.1 (converts to open source on 2029-04-12)
@@ -1 +1 @@
1
- {"version":3,"file":"doctor.d.ts","sourceRoot":"","sources":["../../src/cli/doctor.ts"],"names":[],"mappings":"AAwDA;;;;;GAKG;AACH,wBAAsB,SAAS,CAAC,OAAO,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CAgFhE"}
1
+ {"version":3,"file":"doctor.d.ts","sourceRoot":"","sources":["../../src/cli/doctor.ts"],"names":[],"mappings":"AA4DA;;;;;GAKG;AACH,wBAAsB,SAAS,CAAC,OAAO,EAAE,MAAM,GAAG,OAAO,CAAC,MAAM,CAAC,CA+EhE"}