pi-web-toolkit 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,145 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.3.0] - 2026-06-23
11
+
12
+ ### Added
13
+
14
+ - **Firecrawl Keyless fallback** — `web_search`, `web_fetch`, and `web_browse` now automatically retry through [Firecrawl Keyless](https://www.firecrawl.dev/blog/firecrawl-keyless-launch) (1,000 free credits/month, **no API key, no signup**) when their local backend errors out, or when `web_search` returns zero results. The fallback is keyless-only, never the primary path, and degrades gracefully to the original local-tool error if the `firecrawl-cli` is absent, the IP is flagged, the quota is exhausted, or the fallback is disabled.
15
+ - Three explicit escape-hatch tools for capabilities the local backends lack: `firecrawl_search` (sources, `github`/`research`/`pdf` categories, domain filters), `firecrawl_scrape` (anti-bot bypass, JS rendering, PDF parsing), and `firecrawl_interact` (natural-language page interaction).
16
+ - `extensions/utils/firecrawl.ts` — a deep Firecrawl CLI wrapper (scrape/search/interact argument builders, output parsers, graceful-skip failure classifier, keyless-eligibility check, and fallback-decision predicates).
17
+ - Optional external CLI dependency: `npm install -g firecrawl-cli`.
18
+ - Environment toggle `PI_WEB_FIRECRAWL_FALLBACK` (default on) to disable all Firecrawl usage.
19
+ - `test/firecrawl/test.ts` — pure-function regression tests for the firecrawl wrapper boundary (wired into `npm test` as `test:firecrawl`).
20
+ - ADR 0001 and `CONTEXT.md` glossary entries (`Firecrawl keyless`, `cloud fallback`, `free credits`, `graceful skip`) documenting the local-first → optional keyless cloud fallback architectural decision.
21
+
22
+ ### Changed
23
+
24
+ - **Default network/privacy behavior.** When a local web tool fails, it now makes a cloud request to Firecrawl (sending the URL/query and page content) before giving up. The fallback is **keyless-only** — it never reads, stores, or sends an API key, and spawns the CLI under an isolated temporary `HOME` with the key env stripped. To enforce a strict local-only / no-cloud-egress policy, set `PI_WEB_FIRECRAWL_FALLBACK=0`.
25
+ - `web_search` falls back on a SearXNG error **or** zero results; `web_fetch` falls back on a scrapling failure (incl. its HTTP-GET fallback); `web_browse` falls back only on runtime failures (missing/broken `agent-browser`), never on caller validation errors. `web_batch_fetch` has no fallback (Firecrawl batch scrape is not keyless).
26
+ - Firecrawl results report `creditsUsed` where the source provides it (search, interact); scrape responses do not surface it.
27
+ - README tagline and hero now describe the toolkit as local-first with an optional keyless cloud fallback; features table, install prompt, configuration, project structure, tool reference, and usage guide updated accordingly.
28
+ - `cli-runner` gained an optional `env` passthrough so the firecrawl CLI can be spawned keyless-only.
29
+
30
+ ## [0.2.2] - 2026-06-11
31
+
32
+ ### Added
33
+
34
+ - Self-contained README prompt that Pi users can copy to install and verify the package, SearXNG, Scrapling, and agent-browser.
35
+ - `CHANGELOG.md` to the files included in the published npm package.
36
+
37
+ ### Changed
38
+
39
+ - Corrected README, tool reference, usage guide, agent guidance, project context, and test documentation to match current repository behavior.
40
+ - Clarified when to use each tool, Scrapling fallback behavior, external dependency requirements, and SearXNG JSON API setup.
41
+ - Test scripts now use the locally installed `tsx` development dependency.
42
+ - User-visible tool descriptions now distinguish pages that need interaction from those that do not.
43
+
44
+ ### Fixed
45
+
46
+ - Corrected historical changelog dates and inaccurate claims about robots.txt enforcement, tool limits, runtime configuration, and test coverage.
47
+ - Corrected GitHub issue-reading commands in agent guidance.
48
+
49
+ ## [0.2.1] - 2026-06-10
50
+
51
+ ### Added
52
+
53
+ - README tools preview grid with screenshots for `web_search`, `web_fetch`, `web_batch_fetch`, and `web_browse`.
54
+ - Agent-browser parser regression tests covering array, wrapped, single-item, and invalid JSON output shapes.
55
+
56
+ ### Fixed
57
+
58
+ - `web_browse` now accepts multiple agent-browser batch JSON output shapes instead of assuming a top-level array.
59
+
60
+ ### Changed
61
+
62
+ - `npm test` now also runs the agent-browser parser regression suite.
63
+
64
+ ## [0.2.0] - 2026-06-09
65
+
66
+ ### Added
67
+
68
+ - `extensions/utils/cli-runner.ts` — centralized CLI process spawning with timeout and AbortSignal support.
69
+ - `extensions/utils/content-preview.ts` — intelligent content extraction from scraped pages.
70
+ - `extensions/utils/output-sink.ts` — truncation and temp-file fallback, replacing `truncateHead` + manual `writeFile`/`mkdtemp` in every tool.
71
+ - `extensions/utils/render-helpers.ts` — URL abbreviations, text normalization, and error formatting for TUI.
72
+ - `extensions/utils/tool-factory.ts` — common tool registration patterns.
73
+ - `CLAUDE.md` — symlink to `AGENTS.md` for IDE/agent integration.
74
+ - `CONTEXT.md` — project domain summary for pi runtime context.
75
+ - `test/` directory — automated test suite under `test/content-preview/` with fixtures, baselines, snapshots, and summary report.
76
+
77
+ ### Changed
78
+
79
+ - All 4 tools (`web_search`, `web_fetch`, `web_browse`, `web_batch_fetch`) refactored to use new shared utils, eliminating ~200 lines of duplicated truncate/output logic per tool.
80
+ - `scrapling.ts` and `agent-browser.ts` now use `cli-runner`, eliminating duplicate `spawn` logic.
81
+ - `web_search` — `language` default changed from `"auto"` to `""` (omits param when unset to use SearXNG default).
82
+ - `web_search` — `promptGuidelines` now recommends `web_batch_fetch` for parallel reading of 2–5 results.
83
+ - `web_batch_fetch` — added live progress tracking with per-URL status (fetching / done / error).
84
+ - `web_browse` — added step formatting and tracking (`formatBrowseStep` + `steps` in details).
85
+ - Unified TUI redesign across all 4 tools:
86
+ - Consistent `isError` rendering with `✗` status, error text, and context details.
87
+ - Enhanced `isPartial` rendering with domain/URL context and live progress indicators.
88
+ - `fullOutputPath` rendered in accent color.
89
+ - `renderCall` tags: `[stealthy]`, `[selector=...]`, `[headed]`, `concurrency`.
90
+ - `web_fetch` — content preview (500-char extract) shown in collapsed and expanded views.
91
+ - `web_browse` — expanded view shows complete step list + preview.
92
+ - `web_batch_fetch` — collapsed shows top 3 successes with previews; expanded shows full success list + failure list.
93
+
94
+ ### Meta
95
+
96
+ - Stop tracking `package-lock.json` (library project; reproducible by downstream consumers).
97
+ - Add `typecheck`, `test`, and `test:approve` scripts to `package.json`.
98
+
99
+ ## [0.1.2] - 2026-06-08
100
+
101
+ ### Added
102
+
103
+ - `utils/agent-browser.ts` — extracted from `web_browse.ts` to encapsulate all agent-browser CLI interaction (command building, process spawning, JSON parsing, session cleanup).
104
+ - `tsconfig.json` — TypeScript project configuration for CI type-checking.
105
+ - GitHub Actions CI workflow (`ci.yml`) — runs `tsc --noEmit` on every push and PR.
106
+
107
+ ### Changed
108
+
109
+ - `web_search` — `SEARXNG_URL` is now read at execute time instead of module load time, so in-process environment changes take effect without reloading the extension.
110
+ - `utils/scrapling.ts` — introduced `runScraplingWithFallback()` with configurable `noGetFallback` option, eliminating duplicate fallback logic in `web_fetch` and `web_batch_fetch`.
111
+ - `web_browse.ts` — reduced from ~400 lines to ~194 lines by moving CLI logic to `utils/agent-browser.ts`.
112
+ - README — added `## Configuration` section, `## Contributing` section, CI badge, and updated project structure with design principles.
113
+
114
+ ### Fixed
115
+
116
+ - Preserved GET fallback for `web_batch_fetch` when `stealthy: true` fails, maintaining backward compatibility with the previous batch implementation.
117
+
118
+ ## [0.1.1] - 2026-06-04
119
+
120
+ ### Added
121
+
122
+ - `web_batch_fetch` — parallel multi-page fetching via scrapling.
123
+ - Built-in output truncation with temp-file fallback for all tools.
124
+ - TUI renderers for tool calls and results.
125
+
126
+ ### Changed
127
+
128
+ - Unified extension entry point at `extensions/index.ts`.
129
+
130
+ ## [0.1.0] - 2026-06-03
131
+
132
+ ### Added
133
+
134
+ - `web_search` — SearXNG web search.
135
+ - `web_fetch` — single-page extraction via scrapling.
136
+ - `web_browse` — interactive browser automation via agent-browser.
137
+ - LLM-optimized `promptGuidelines` and `promptSnippet` for every tool.
138
+
139
+ [Unreleased]: https://github.com/Wade11s/pi-web-toolkit/compare/v0.2.2...HEAD
140
+ [0.2.2]: https://github.com/Wade11s/pi-web-toolkit/compare/v0.2.1...v0.2.2
141
+ [0.2.1]: https://github.com/Wade11s/pi-web-toolkit/compare/v0.2.0...v0.2.1
142
+ [0.2.0]: https://github.com/Wade11s/pi-web-toolkit/compare/v0.1.2...v0.2.0
143
+ [0.1.2]: https://github.com/Wade11s/pi-web-toolkit/compare/v0.1.1...v0.1.2
144
+ [0.1.1]: https://github.com/Wade11s/pi-web-toolkit/compare/v0.1.0...v0.1.1
145
+ [0.1.0]: https://github.com/Wade11s/pi-web-toolkit/releases/tag/v0.1.0
package/README.md CHANGED
@@ -1,22 +1,28 @@
1
1
  # pi-web-toolkit
2
2
 
3
3
  [![npm version](https://badge.fury.io/js/pi-web-toolkit.svg)](https://www.npmjs.com/package/pi-web-toolkit)
4
+ [![Pi package](https://img.shields.io/badge/Pi-package-111111.svg)](https://pi.dev/packages/pi-web-toolkit)
4
5
  [![CI](https://github.com/Wade11s/pi-web-toolkit/actions/workflows/ci.yml/badge.svg)](https://github.com/Wade11s/pi-web-toolkit/actions)
5
6
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
6
7
  ![Node.js](https://img.shields.io/badge/node-%3E%3D22-339933)
7
8
 
8
- **100% open-source. Zero API keys. Zero fees.**
9
+ **Local-first & 100% open-source. No required API keys or paid services.**
9
10
 
10
- Web research toolkit for [pi](https://pi.dev) agents. Search via SearXNG, fetch static pages with scrapling, browse interactively via agent-browser, and batch-read sources in parallel. All self-hosted, all local, all free with built-in truncation safety and LLM-optimized prompt guidelines.
11
+ Web research toolkit for [pi](https://pi.dev) agents. Search via SearXNG, fetch pages with scrapling, browse interactively via agent-browser, and batch-read sources in parallel. All primary backends run locally or are self-hosted, with an **optional Firecrawl Keyless cloud fallback** (no API key, no signup) so the local tools keep working when a backend is missing or fails. Built-in truncation safety and LLM-optimized prompt guidelines throughout.
11
12
 
12
13
  ## Features
13
14
 
14
15
  | Tool | Backend | Purpose | Current Limit |
15
16
  |------|---------|---------|---------------|
16
- | **`web_search`** | [SearXNG](https://github.com/searxng/searxng) | Search the web with scored, ranked results from multiple engines — always the first step in web research | 20 results (max 60, auto-pages up to 3 pages) |
17
- | **`web_fetch`** | [scrapling](https://github.com/D4Vinci/Scrapling) | Fetch a single static page as clean markdown | — |
18
- | **`web_batch_fetch`** | [scrapling](https://github.com/D4Vinci/Scrapling) | Fetch 2–15 pages in parallel for research synthesis | 3 concurrent (max 5) |
17
+ | **`web_search`** | [SearXNG](https://github.com/searxng/searxng) | Discover scored, ranked results from multiple engines | 20 results (max 60, auto-pages up to 3 pages) |
18
+ | **`web_fetch`** | [scrapling](https://github.com/D4Vinci/Scrapling) | Fetch a single page as clean markdown | — |
19
+ | **`web_batch_fetch`** | [scrapling](https://github.com/D4Vinci/Scrapling) | Fetch 1–15 pages in parallel for research synthesis (2–5 recommended) | 3 concurrent (max 5) |
19
20
  | **`web_browse`** | [agent-browser](https://github.com/vercel-labs/agent-browser) | Interact with a page (click, scroll, fill) then extract content | 25 actions |
21
+ | **`firecrawl_search`** | [firecrawl-cli](https://github.com/firecrawl/cli) (keyless) | Cloud search with sources/categories/domain filters | — |
22
+ | **`firecrawl_scrape`** | [firecrawl-cli](https://github.com/firecrawl/cli) (keyless) | Cloud single-page fetch (anti-bot / JS / PDF) | — |
23
+ | **`firecrawl_interact`** | [firecrawl-cli](https://github.com/firecrawl/cli) (keyless) | Cloud natural-language page interaction | — |
24
+
25
+ > **Firecrawl fallback.** `web_search`, `web_fetch`, and `web_browse` automatically retry through Firecrawl Keyless (1,000 free credits/month, no API key) when their local backend errors out or search returns nothing. The three `firecrawl_*` tools are explicit escape hatches. Disable it with `PI_WEB_FIRECRAWL_FALLBACK=0`. Install the optional CLI: `npm install -g firecrawl-cli`.
20
26
 
21
27
  ## Tools Preview
22
28
 
@@ -40,14 +46,101 @@ A quick look at how pi renders toolkit calls while an agent searches, fetches, b
40
46
  </tr>
41
47
  </table>
42
48
 
49
+ ## Install with Pi Agent
50
+
51
+ Copy and send the prompt below to Pi. It will install this package and its external dependencies for you.
52
+
53
+ ```text
54
+ Install pi-web-toolkit and its external dependencies. Complete and verify every
55
+ step yourself; do not rely on web browsing or external documentation. Inspect
56
+ the machine first and reuse working installations. Ask before using sudo,
57
+ changing shell profiles, overwriting configuration, or modifying existing
58
+ services or containers.
59
+
60
+ 1. Ensure Node.js 22+, npm, Docker, OpenSSL, curl, uv, and Pi are installed, and
61
+ that Docker is running. Install only missing or incompatible prerequisites.
62
+ 2. Configure SearXNG:
63
+ - Test SEARXNG_URL when set, then http://localhost:8080.
64
+ - Verify /search?q=test&format=json returns JSON with a results array.
65
+ - If neither endpoint works, first ensure no existing container or config
66
+ would be overwritten, then create a local-only instance by running:
67
+
68
+ mkdir -p "$HOME/.config/searxng"
69
+ cat > "$HOME/.config/searxng/settings.yml" <<'YAML'
70
+ use_default_settings: true
71
+
72
+ search:
73
+ formats:
74
+ - html
75
+ - json
76
+ YAML
77
+
78
+ docker run -d \
79
+ --name searxng \
80
+ --restart unless-stopped \
81
+ -p 127.0.0.1:8080:8080 \
82
+ -e FORCE_OWNERSHIP=false \
83
+ -e SEARXNG_SECRET="$(openssl rand -hex 32)" \
84
+ -v "$HOME/.config/searxng/settings.yml:/etc/searxng/settings.yml:ro" \
85
+ docker.io/searxng/searxng:latest
86
+
87
+ - Verify the selected endpoint by running:
88
+
89
+ SEARXNG_ENDPOINT="${SEARXNG_URL:-http://localhost:8080}"
90
+ curl -fsS --get "${SEARXNG_ENDPOINT%/}/search" \
91
+ --data-urlencode "q=test" \
92
+ --data "format=json" |
93
+ grep -q '"results"' && echo "SearXNG JSON API ready"
94
+
95
+ - Pi uses http://localhost:8080 by default. Set SEARXNG_URL before starting
96
+ Pi only when using another endpoint.
97
+ 3. Install and verify Scrapling:
98
+ uv tool install "scrapling[all]"
99
+ scrapling install
100
+ scrapling --help
101
+ 4. Install and verify agent-browser:
102
+ npm install -g agent-browser
103
+ agent-browser install
104
+ agent-browser doctor
105
+ On Linux, use agent-browser install --with-deps if required.
106
+ 5. Optionally install firecrawl-cli for the keyless cloud fallback (no API key
107
+ needed; the fallback degrades gracefully if it is absent):
108
+ npm install -g firecrawl-cli
109
+ 6. After all dependencies pass verification, install the package:
110
+ pi install npm:pi-web-toolkit
111
+
112
+ Report what was installed or reused, all verification results, the SearXNG
113
+ endpoint Pi will use, and whether Pi must be restarted. Do not report success
114
+ until every check passes.
115
+ ```
116
+
43
117
  ## Quick Start
44
118
 
45
119
  ### 1. Install external dependencies
46
120
 
121
+ The commands below assume a POSIX shell with Docker, OpenSSL, curl, uv, and Node.js 22+ with npm.
122
+
47
123
  ```bash
48
- # SearXNG (for search)
49
- docker run -d --name searxng -p 8080:8080 -v searxng:/etc/searxng searxng/searxng
50
- export SEARXNG_URL="http://localhost:8080"
124
+ # SearXNG (for search; local-only instance with the required JSON API)
125
+ mkdir -p "$HOME/.config/searxng"
126
+ cat > "$HOME/.config/searxng/settings.yml" <<'YAML'
127
+ use_default_settings: true
128
+
129
+ search:
130
+ formats:
131
+ - html
132
+ - json
133
+ YAML
134
+
135
+ docker run -d \
136
+ --name searxng \
137
+ --restart unless-stopped \
138
+ -p 127.0.0.1:8080:8080 \
139
+ -e FORCE_OWNERSHIP=false \
140
+ -e SEARXNG_SECRET="$(openssl rand -hex 32)" \
141
+ -v "$HOME/.config/searxng/settings.yml:/etc/searxng/settings.yml:ro" \
142
+ docker.io/searxng/searxng:latest
143
+ export SEARXNG_URL="http://127.0.0.1:8080"
51
144
 
52
145
  # scrapling (for fetch & batch fetch)
53
146
  uv tool install "scrapling[all]"
@@ -55,12 +148,19 @@ scrapling install
55
148
 
56
149
  # agent-browser (for browse)
57
150
  npm i -g agent-browser && agent-browser install
151
+ # On Linux hosts missing browser system libraries: agent-browser install --with-deps
152
+
153
+ # firecrawl-cli (OPTIONAL — enables the keyless cloud fallback; no API key needed)
154
+ npm i -g firecrawl-cli
58
155
  ```
59
156
 
60
157
  **Verify dependencies:**
61
158
  ```bash
62
159
  # SearXNG
63
- curl -s "$SEARXNG_URL" | head
160
+ curl -fsS --get "$SEARXNG_URL/search" \
161
+ --data-urlencode "q=searxng" \
162
+ --data "format=json" |
163
+ grep -q '"results"' && echo "SearXNG JSON API ready"
64
164
 
65
165
  # scrapling
66
166
  scrapling --help
@@ -81,44 +181,69 @@ pi install git:github.com/Wade11s/pi-web-toolkit
81
181
 
82
182
  ## Configuration
83
183
 
84
- All tools are configured via **environment variables** at runtime no rebuild or restart required.
184
+ `web_search` reads its SearXNG endpoint from an environment variable. Set it before starting pi; no build step is required.
85
185
 
86
186
  | Variable | Default | Used By | Description |
87
187
  |----------|---------|---------|-------------|
88
188
  | `SEARXNG_URL` | `http://localhost:8080` | `web_search` | Your SearXNG instance endpoint |
189
+ | `PI_WEB_FIRECRAWL_FALLBACK` | `1` (on) | all tools | Set to `0`/`false`/`no`/`off` to disable the optional Firecrawl keyless cloud fallback for a strict local-only policy. |
89
190
 
90
191
  Set before starting pi:
91
192
 
92
193
  ```bash
93
194
  export SEARXNG_URL="https://searxng.example.com"
195
+ # Optional: disable the Firecrawl cloud fallback entirely
196
+ export PI_WEB_FIRECRAWL_FALLBACK=0
197
+ ```
198
+
199
+ ### Optional: Firecrawl keyless fallback
200
+
201
+ When a local backend (`web_search`/`web_fetch`/`web_browse`) fails or returns nothing, the tools automatically retry through [Firecrawl Keyless](https://www.firecrawl.dev/blog/firecrawl-keyless-launch) — 1,000 free credits/month, **no API key, no signup**. The `firecrawl_*` tools are explicit escape hatches for capabilities the local backends lack (search categories, cloud rendering, natural-language interaction).
202
+
203
+ Install the optional CLI (the fallback degrades gracefully if it is absent):
204
+
205
+ ```bash
206
+ npm install -g firecrawl-cli
94
207
  ```
95
208
 
209
+ The fallback is **keyless-only**: it never reads or stores an API key, and spawns the CLI under an isolated temporary `HOME` with the key env stripped. **Privacy:** when the fallback runs, the URL and page content are sent to Firecrawl's cloud.
210
+
96
211
  ## Project Structure
97
212
 
98
213
  ```
99
214
  pi-web-toolkit/
100
215
  ├── extensions/
101
- │ ├── index.ts # Unified entry point — registers all 4 tools
216
+ │ ├── index.ts # Unified entry point — registers all 7 tools (4 local + 3 Firecrawl keyless)
102
217
  │ ├── utils/
103
- │ │ ├── cli-runner.ts # Unified CLI process spawning with timeout/AbortSignal
218
+ │ │ ├── cli-runner.ts # Unified CLI process spawning with timeout/AbortSignal/env
104
219
  │ │ ├── content-preview.ts # Intelligent content extraction from scraped pages
105
220
  │ │ ├── output-sink.ts # Truncation + temp-file fallback
106
221
  │ │ ├── render-helpers.ts # URL abbreviations, text normalization, error formatting for TUI
107
222
  │ │ ├── scrapling.ts # Reusable scrapling CLI wrapper (shared by fetch + batch)
108
223
  │ │ ├── tool-factory.ts # Common tool registration patterns
109
- │ │ └── agent-browser.ts # agent-browser CLI wrapper (shared by web_browse)
110
- ├── web_search.ts # SearXNG search tool
111
- │ ├── web_fetch.ts # Single-page scrapling fetcher
224
+ │ │ ├── agent-browser.ts # agent-browser CLI wrapper (shared by web_browse)
225
+ │ └── firecrawl.ts # Firecrawl keyless CLI wrapper + fallback decisions (shared by firecrawl_* tools + fallbacks)
226
+ │ ├── web_search.ts # SearXNG search tool (+ Firecrawl fallback)
227
+ │ ├── web_fetch.ts # Single-page scrapling fetcher (+ Firecrawl fallback)
112
228
  │ ├── web_batch_fetch.ts # Parallel scrapling fetcher
113
- └── web_browse.ts # Interactive browser automation (agent-browser)
229
+ ├── web_browse.ts # Interactive browser automation (agent-browser + Firecrawl fallback)
230
+ │ ├── firecrawl_search.ts # Firecrawl keyless search (escape hatch)
231
+ │ ├── firecrawl_scrape.ts # Firecrawl keyless single-page fetch (escape hatch)
232
+ │ └── firecrawl_interact.ts # Firecrawl keyless natural-language interaction (escape hatch)
114
233
  ├── test/
115
- └── content-preview/ # Automated test suite with fixtures & snapshots
234
+ ├── agent-browser/ # agent-browser output parser regression tests
235
+ │ ├── content-preview/ # Content preview fixtures, baselines & snapshots
236
+ │ └── README.md # Test suite structure and conventions
116
237
  ├── docs/
117
238
  │ ├── tools.md # Full parameter specs
118
- └── guide.md # Decision tree & tool comparison
239
+ ├── guide.md # Decision tree & tool comparison
240
+ │ └── agents/ # Issue tracker, triage and domain guidance
241
+ ├── AGENTS.md
242
+ ├── CONTEXT.md
119
243
  ├── CHANGELOG.md
120
244
  ├── package.json
121
245
  ├── README.md
246
+ ├── tsconfig.json
122
247
  └── LICENSE
123
248
  ```
124
249
 
@@ -0,0 +1,5 @@
1
+ # Firecrawl Keyless as an optional cloud fallback
2
+
3
+ pi-web-toolkit was local-first and self-hosted by design: SearXNG, scrapling, and agent-browser all run on the user's machine, and the README guaranteed "100% open-source. No required API keys or paid services." We decided to add **Firecrawl Keyless** as a strictly optional, fallback-only cloud layer: when a local backend errors out (or `web_search` returns nothing), the same tool transparently retries through the official `firecrawl-cli` in keyless mode, and three explicit `firecrawl_*` tools are exposed for capabilities the local backends lack.
4
+
5
+ This is hard to reverse once users and the agent come to rely on the fallback, surprising to a reader who assumes a local-only toolkit, and the result of a real trade-off (zero-config reliability vs. cloud egress, a privacy surface, and a third-party dependency). The fallback defaults **on**, is **keyless-only** (no API key, no signup, no stored credentials — the CLI is spawned under an isolated temp `HOME` with the key env stripped), and is **opt-out-able** via `PI_WEB_FIRECRAWL_FALLBACK=0`. We drive `firecrawl-cli` (an official Firecrawl client) rather than hand-rolling REST because Firecrawl only grants the free keyless tier to official clients, and we restrict it to the keyless endpoints (`/search`, `/scrape`, `/interact`); API-key mode, self-hosted URLs, OAuth, and non-keyless endpoints (`/map`, `/crawl`, `/batch/scrape`, etc.) are deliberately out of scope. The decision and the graceful-skip behavior (never leave the user worse off than the local tool already did) are encoded in the Firecrawl CLI wrapper module.
@@ -5,7 +5,7 @@ Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all op
5
5
  ## Conventions
6
6
 
7
7
  - **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
8
- - **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
8
+ - **Read an issue**: `gh issue view <number> --json number,title,body,labels,comments --jq '{number, title, body, labels: [.labels[].name], comments: [.comments[].body]}'`.
9
9
  - **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
10
10
  - **Comment on an issue**: `gh issue comment <number> --body "..."`
11
11
  - **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
@@ -19,4 +19,4 @@ Create a GitHub issue.
19
19
 
20
20
  ## When a skill says "fetch the relevant ticket"
21
21
 
22
- Run `gh issue view <number> --comments`.
22
+ Run `gh issue view <number> --json number,title,body,labels,comments --jq '{number, title, body, labels: [.labels[].name], comments: [.comments[].body]}'`.
package/docs/guide.md CHANGED
@@ -8,19 +8,19 @@ User asks about something external / current
8
8
  ├─→ web_search("...")
9
9
  │ │
10
10
  │ ├─→ 1 relevant result?
11
- │ │ └─→ web_fetch(url) ← static page
11
+ │ │ └─→ web_fetch(url) ← no interaction needed
12
12
  │ │ OR
13
13
  │ │ └─→ web_browse(url, actions) ← needs interaction
14
14
  │ │
15
15
  │ └─→ 2–5 relevant results?
16
- │ ├─→ All static pages?
16
+ │ ├─→ All need no interaction?
17
17
  │ │ └─→ web_batch_fetch(urls[]) ← parallel fetch
18
18
  │ └─→ Some need interaction?
19
- │ └─→ web_fetch (static ones)
19
+ │ └─→ web_fetch (no-interaction ones)
20
20
  │ web_browse (interactive ones) ← sequential
21
21
 
22
22
  └─→ User provides a URL directly
23
- ├─→ Static / loads on first request?
23
+ ├─→ No interaction needed / loads on first request?
24
24
  │ └─→ web_fetch(url)
25
25
  └─→ Needs clicking / scrolling / waiting?
26
26
  └─→ web_browse(url, actions)
@@ -32,10 +32,35 @@ User asks about something external / current
32
32
 
33
33
  | | `web_fetch` | `web_browse` | `web_batch_fetch` |
34
34
  |--|-------------|--------------|-------------------|
35
- | **Pages** | 1 | 1 | 2–15 |
36
- | **Browser** | Yes (scrapling) | Yes (agent-browser) | Yes (scrapling) |
35
+ | **Pages** | 1 | 1 | 1–15 (2–5 recommended) |
36
+ | **Browser** | Yes (Scrapling) | Yes (agent-browser) | Yes (Scrapling) |
37
37
  | **Interaction** | ❌ No | ✅ Click, fill, scroll, wait | ❌ No |
38
38
  | **Selector** | ✅ Per-URL | ✅ Final state | ✅ Applied to all |
39
- | **Stealthy** | ✅ Yes | ❌ No (planned) | ✅ Yes |
39
+ | **Stealthy** | ✅ Yes | ❌ No | ✅ Yes |
40
40
  | **Speed** | Fast | Slower (browser ops) | Medium (parallel) |
41
41
  | **Best for** | Articles, docs, blogs | SPAs, forms, pagination | Research synthesis |
42
+
43
+ `web_fetch` falls back to HTTP GET after a normal browser fetch fails, but not in stealthy mode. `web_batch_fetch` falls back to GET after failed browser fetches in all modes.
44
+
45
+ ---
46
+
47
+ ## Firecrawl Keyless fallback
48
+
49
+ When a local backend cannot do the job, the tools automatically retry through **Firecrawl Keyless** (1,000 free credits/month, no API key, no signup) before giving up. It is **fallback-only** — never the primary path — and is **opt-out-able** with `PI_WEB_FIRECRAWL_FALLBACK=0`. Requires the optional `firecrawl-cli` (`npm install -g firecrawl-cli`); if it is absent the tools simply surface the original local error.
50
+
51
+ | Tool | Falls back to Firecrawl when… |
52
+ |------|-------------------------------|
53
+ | `web_search` | SearXNG errors out **or** returns zero results |
54
+ | `web_fetch` | scrapling (incl. its HTTP-GET fallback) fails — anti-bot, heavy JS, PDFs |
55
+ | `web_browse` | agent-browser is missing or its batch fails (not on caller validation errors) |
56
+ | `web_batch_fetch` | (no fallback — Firecrawl batch scrape is not keyless) |
57
+
58
+ The three `firecrawl_*` tools are the explicit escape hatches for capabilities the local backends lack (`github`/`research`/`pdf` search categories, cloud rendering, natural-language interaction).
59
+
60
+ **Graceful skip.** If the fallback itself cannot help — the CLI is missing, the IP is flagged as suspicious, the keyless quota is exhausted, or the fallback is disabled — the tool falls through to the original local-tool error so the user is never left worse off.
61
+
62
+ **Credit budgeting.** Search ≈ 2 credits / 10 results, scrape ≈ 1 credit / page, interact ≈ 2 credits/min (code-only) or ≈ 7 credits/min (AI prompt). Results report `creditsUsed` where the source provides it. The fallback stays conservative (small limits) against the 1,000 credits/month allowance.
63
+
64
+ **Privacy.** Firecrawl is a cloud service: when the fallback runs, the URL/query and page content leave the machine. Set `PI_WEB_FIRECRAWL_FALLBACK=0` to enforce a strict local-only, no-cloud-egress policy. The fallback is **keyless-only** — it never reads, stores, or sends an API key, and spawns the CLI under an isolated temporary `HOME`.
65
+
66
+ ---
package/docs/tools.md CHANGED
@@ -7,22 +7,24 @@ Search the web via SearXNG. Returns ranked results with title, URL, and snippet.
7
7
  ```typescript
8
8
  {
9
9
  query: string, // Search query
10
- language?: string, // Language code (en, de, fr...). Default: "auto"
10
+ language?: string, // Language code (en, en-US, de...). Omit to use the SearXNG default.
11
11
  results?: number, // Max results (1–60). Default: 20. Automatically pages through SearXNG (up to 3 pages) if needed.
12
12
  }
13
13
  ```
14
14
 
15
- **When to use:** The user asks about current events, facts, or anything requiring up-to-date information. This is always the **first step** of web research.
15
+ **When to use:** The user asks about current events, facts, or anything requiring up-to-date information and has not already provided the source URLs.
16
16
 
17
- **Empty results behavior:** When no results are found, `web_search` returns a list of **suggestions** alternative queries that SearXNG believes may yield better results. The agent can use these suggestions to automatically refine and retry the search.
17
+ **Empty results behavior:** When no results are found, `web_search` includes any query **suggestions** provided by SearXNG. The agent can use them to refine and retry the search.
18
18
 
19
19
  **Pagination:** `web_search` automatically fetches up to 3 pages from SearXNG and deduplicates by URL. You do not need to call it multiple times for deeper results.
20
20
 
21
+ **Full output:** For non-empty searches, the formatted result output is always written to a temporary file. Returned text is also truncated to pi's default line/byte limits when necessary.
22
+
21
23
  ---
22
24
 
23
25
  ## `web_fetch`
24
26
 
25
- Fetch a single page and convert it to clean markdown. Uses scrapling's browser automation for JS-heavy sites.
27
+ Fetch a single page and convert it to clean markdown. Uses Scrapling's browser-based `fetch` command first, then falls back to an HTTP GET when allowed. Stealthy mode uses `stealthy-fetch` and does not fall back to GET.
26
28
 
27
29
  ```typescript
28
30
  {
@@ -39,9 +41,9 @@ Fetch a single page and convert it to clean markdown. Uses scrapling's browser a
39
41
 
40
42
  **Example flow:**
41
43
  ```
42
- User: "What's the latest Rust release?"
43
- → web_search("latest Rust programming language release")
44
- → web_fetch("https://blog.rust-lang.org/2026/06/02/maintainers-fund/")
44
+ User: "How do I install Rust?"
45
+ → web_search("official Rust installation guide")
46
+ → web_fetch("https://www.rust-lang.org/tools/install")
45
47
  → Agent answers with full context
46
48
  ```
47
49
 
@@ -51,7 +53,7 @@ User: "What's the latest Rust release?"
51
53
 
52
54
  Open a real browser, perform a chain of actions (click, fill, scroll, wait), then extract content.
53
55
 
54
- Uses the [agent-browser](https://github.com/vercel-labs/agent-browser) CLI for native browser automation via Chrome CDP.
56
+ Uses the [agent-browser](https://github.com/vercel-labs/agent-browser) CLI with batched JSON commands.
55
57
 
56
58
  ```typescript
57
59
  {
@@ -65,12 +67,15 @@ Uses the [agent-browser](https://github.com/vercel-labs/agent-browser) CLI for n
65
67
  | { type: "wait_selector", selector: string, state?: "attached" | "visible" | "hidden" }
66
68
  | { type: "scroll", direction: "down" | "up" | "bottom" | "top", amount?: number }
67
69
  >,
70
+ // Maximum: 25 actions
68
71
  selector?: string, // Extract content from final page state
69
72
  headless?: boolean, // Default: true
70
73
  timeout?: number, // Overall browser batch timeout (ms). Default: 30000
71
74
  }
72
75
  ```
73
76
 
77
+ When `selector` is omitted, the tool returns agent-browser's interactive accessibility snapshot rather than full page text.
78
+
74
79
  **When to use:**
75
80
  - The page requires **clicking** before showing target content (e.g. "Load more", pagination, tab switching)
76
81
  - The page requires **filling a form** (e.g. search box, login)
@@ -123,7 +128,7 @@ Fetch multiple pages in parallel and return aggregated content.
123
128
 
124
129
  ```typescript
125
130
  {
126
- urls: string[], // 1–10 URLs
131
+ urls: string[], // 1–15 URLs; 2–5 recommended
127
132
  selector?: string, // CSS selector applied to ALL pages
128
133
  stealthy?: boolean, // Default: false
129
134
  max_concurrency?: number // Parallel fetches (1–5). Default: 3
@@ -136,7 +141,9 @@ Fetch multiple pages in parallel and return aggregated content.
136
141
  - Comparing implementations across different docs/pages
137
142
  - Research synthesis requiring multiple sources
138
143
 
139
- **NOT for:** Single pages (use `web_fetch` simpler and supports per-URL stealthy mode).
144
+ **NOT recommended for:** Single pages. The schema accepts one URL, but `web_fetch` is simpler and provides single-page behavior.
145
+
146
+ Each page starts with the selected Scrapling browser fetcher. Failed attempts fall back to HTTP GET, including when batch stealthy mode is enabled.
140
147
 
141
148
  **Example flow:**
142
149
  ```
@@ -153,3 +160,65 @@ User: "Compare Python asyncio, Trio, and curio"
153
160
  })
154
161
  → Agent synthesizes comparison from all 3 sources
155
162
  ```
163
+
164
+ ---
165
+
166
+ ## Firecrawl keyless tools (optional cloud escape hatches)
167
+
168
+ These three tools talk to [Firecrawl](https://www.firecrawl.dev) in **keyless** mode: 1,000 free credits/month, **no API key and no signup**. They require the optional `firecrawl-cli` (`npm install -g firecrawl-cli`). **Privacy:** the URL/query/page content is sent to Firecrawl's cloud.
169
+
170
+ They double as the implementation of the automatic fallback: `web_search`/`web_fetch`/`web_browse` retry through Firecrawl keyless when their local backend fails (or search returns nothing). Disable all Firecrawl usage with `PI_WEB_FIRECRAWL_FALLBACK=0`.
171
+
172
+ ### `firecrawl_search`
173
+
174
+ Cloud web search via Firecrawl keyless, with capabilities the local SearXNG tool lacks.
175
+
176
+ ```typescript
177
+ {
178
+ query: string,
179
+ limit?: number, // 1–100. Default 10
180
+ sources?: Array<"web"|"images"|"news">,
181
+ categories?: Array<"github"|"research"|"pdf">,
182
+ country?: string, // ISO code, e.g. "US", "DE"
183
+ tbs?: string, // qdr:h|d|w|m|y
184
+ location?: string,
185
+ includeDomains?: string[], // hostnames; folded into the query as site: operators
186
+ excludeDomains?: string[],
187
+ }
188
+ ```
189
+
190
+ **When to use:** `web_search` failed or returned nothing; or you need `github`/`research`/`pdf` categories, images/news sources, or domain scoping that SearXNG does not provide.
191
+
192
+ ### `firecrawl_scrape`
193
+
194
+ Cloud single-page fetch via Firecrawl keyless (anti-bot bypass, JS rendering, PDF parsing).
195
+
196
+ ```typescript
197
+ {
198
+ url: string,
199
+ waitFor?: number, // ms to wait for JS rendering
200
+ includeTags?: string[], // Firecrawl tag filter (not a CSS selector)
201
+ excludeTags?: string[],
202
+ onlyMainContent?: boolean, // Default: true
203
+ }
204
+ ```
205
+
206
+ **When to use:** `web_fetch` failed on an anti-bot-protected, JavaScript-heavy, or PDF page.
207
+
208
+ ### `firecrawl_interact`
209
+
210
+ Open a URL in a live Firecrawl browser session and drive it with a natural-language prompt (or code).
211
+
212
+ ```typescript
213
+ {
214
+ url: string,
215
+ prompt?: string, // natural-language task (required unless code is set)
216
+ code?: string, // code to run in the browser sandbox
217
+ language?: "node"|"python"|"bash",
218
+ timeout?: number, // seconds (1–300)
219
+ }
220
+ ```
221
+
222
+ **When to use:** `web_browse` cannot run (agent-browser missing / OS deps missing), or you want natural-language page interaction without hand-written CSS selectors. Write each prompt as a single, focused task.
223
+
224
+ ---