pi-web-toolkit 0.2.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +20 -0
- package/README.md +39 -9
- package/docs/adr/0001-firecrawl-keyless-cloud-fallback.md +5 -0
- package/docs/guide.md +23 -0
- package/docs/tools.md +62 -0
- package/extensions/firecrawl_interact.ts +147 -0
- package/extensions/firecrawl_scrape.ts +154 -0
- package/extensions/firecrawl_search.ts +165 -0
- package/extensions/index.ts +6 -0
- package/extensions/utils/cli-runner.ts +3 -0
- package/extensions/utils/firecrawl.ts +484 -0
- package/extensions/web_browse.ts +57 -0
- package/extensions/web_fetch.ts +29 -7
- package/extensions/web_search.ts +85 -35
- package/package.json +5 -4
package/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.3.0] - 2026-06-23
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- **Firecrawl Keyless fallback** — `web_search`, `web_fetch`, and `web_browse` now automatically retry through [Firecrawl Keyless](https://www.firecrawl.dev/blog/firecrawl-keyless-launch) (1,000 free credits/month, **no API key, no signup**) when their local backend errors out, or when `web_search` returns zero results. The fallback is keyless-only, never the primary path, and degrades gracefully to the original local-tool error if the `firecrawl-cli` is absent, the IP is flagged, the quota is exhausted, or the fallback is disabled.
|
|
15
|
+
- Three explicit escape-hatch tools for capabilities the local backends lack: `firecrawl_search` (sources, `github`/`research`/`pdf` categories, domain filters), `firecrawl_scrape` (anti-bot bypass, JS rendering, PDF parsing), and `firecrawl_interact` (natural-language page interaction).
|
|
16
|
+
- `extensions/utils/firecrawl.ts` — a deep Firecrawl CLI wrapper (scrape/search/interact argument builders, output parsers, graceful-skip failure classifier, keyless-eligibility check, and fallback-decision predicates).
|
|
17
|
+
- Optional external CLI dependency: `npm install -g firecrawl-cli`.
|
|
18
|
+
- Environment toggle `PI_WEB_FIRECRAWL_FALLBACK` (default on) to disable all Firecrawl usage.
|
|
19
|
+
- `test/firecrawl/test.ts` — pure-function regression tests for the firecrawl wrapper boundary (wired into `npm test` as `test:firecrawl`).
|
|
20
|
+
- ADR 0001 and `CONTEXT.md` glossary entries (`Firecrawl keyless`, `cloud fallback`, `free credits`, `graceful skip`) documenting the local-first → optional keyless cloud fallback architectural decision.
|
|
21
|
+
|
|
22
|
+
### Changed
|
|
23
|
+
|
|
24
|
+
- **Default network/privacy behavior.** When a local web tool fails, it now makes a cloud request to Firecrawl (sending the URL/query and page content) before giving up. The fallback is **keyless-only** — it never reads, stores, or sends an API key, and spawns the CLI under an isolated temporary `HOME` with the key env stripped. To enforce a strict local-only / no-cloud-egress policy, set `PI_WEB_FIRECRAWL_FALLBACK=0`.
|
|
25
|
+
- `web_search` falls back on a SearXNG error **or** zero results; `web_fetch` falls back on a scrapling failure (incl. its HTTP-GET fallback); `web_browse` falls back only on runtime failures (missing/broken `agent-browser`), never on caller validation errors. `web_batch_fetch` has no fallback (Firecrawl batch scrape is not keyless).
|
|
26
|
+
- Firecrawl results report `creditsUsed` where the source provides it (search, interact); scrape responses do not surface it.
|
|
27
|
+
- README tagline and hero now describe the toolkit as local-first with an optional keyless cloud fallback; features table, install prompt, configuration, project structure, tool reference, and usage guide updated accordingly.
|
|
28
|
+
- `cli-runner` gained an optional `env` passthrough so the firecrawl CLI can be spawned keyless-only.
|
|
29
|
+
|
|
10
30
|
## [0.2.2] - 2026-06-11
|
|
11
31
|
|
|
12
32
|
### Added
|
package/README.md
CHANGED
|
@@ -6,9 +6,9 @@
|
|
|
6
6
|
[](LICENSE)
|
|
7
7
|

|
|
8
8
|
|
|
9
|
-
**100% open-source. No required API keys or paid services.**
|
|
9
|
+
**Local-first & 100% open-source. No required API keys or paid services.**
|
|
10
10
|
|
|
11
|
-
Web research toolkit for [pi](https://pi.dev) agents. Search via SearXNG, fetch pages with scrapling, browse interactively via agent-browser, and batch-read sources in parallel. All backends run locally or are self-hosted, with
|
|
11
|
+
Web research toolkit for [pi](https://pi.dev) agents. Search via SearXNG, fetch pages with scrapling, browse interactively via agent-browser, and batch-read sources in parallel. All primary backends run locally or are self-hosted, with an **optional Firecrawl Keyless cloud fallback** (no API key, no signup) so the local tools keep working when a backend is missing or fails. Built-in truncation safety and LLM-optimized prompt guidelines throughout.
|
|
12
12
|
|
|
13
13
|
## Features
|
|
14
14
|
|
|
@@ -18,6 +18,11 @@ Web research toolkit for [pi](https://pi.dev) agents. Search via SearXNG, fetch
|
|
|
18
18
|
| **`web_fetch`** | [scrapling](https://github.com/D4Vinci/Scrapling) | Fetch a single page as clean markdown | — |
|
|
19
19
|
| **`web_batch_fetch`** | [scrapling](https://github.com/D4Vinci/Scrapling) | Fetch 1–15 pages in parallel for research synthesis (2–5 recommended) | 3 concurrent (max 5) |
|
|
20
20
|
| **`web_browse`** | [agent-browser](https://github.com/vercel-labs/agent-browser) | Interact with a page (click, scroll, fill) then extract content | 25 actions |
|
|
21
|
+
| **`firecrawl_search`** | [firecrawl-cli](https://github.com/firecrawl/cli) (keyless) | Cloud search with sources/categories/domain filters | — |
|
|
22
|
+
| **`firecrawl_scrape`** | [firecrawl-cli](https://github.com/firecrawl/cli) (keyless) | Cloud single-page fetch (anti-bot / JS / PDF) | — |
|
|
23
|
+
| **`firecrawl_interact`** | [firecrawl-cli](https://github.com/firecrawl/cli) (keyless) | Cloud natural-language page interaction | — |
|
|
24
|
+
|
|
25
|
+
> **Firecrawl fallback.** `web_search`, `web_fetch`, and `web_browse` automatically retry through Firecrawl Keyless (1,000 free credits/month, no API key) when their local backend errors out or search returns nothing. The three `firecrawl_*` tools are explicit escape hatches. Disable it with `PI_WEB_FIRECRAWL_FALLBACK=0`. Install the optional CLI: `npm install -g firecrawl-cli`.
|
|
21
26
|
|
|
22
27
|
## Tools Preview
|
|
23
28
|
|
|
@@ -98,7 +103,10 @@ curl -fsS --get "${SEARXNG_ENDPOINT%/}/search" \
|
|
|
98
103
|
agent-browser install
|
|
99
104
|
agent-browser doctor
|
|
100
105
|
On Linux, use agent-browser install --with-deps if required.
|
|
101
|
-
5.
|
|
106
|
+
5. Optionally install firecrawl-cli for the keyless cloud fallback (no API key
|
|
107
|
+
needed; the fallback degrades gracefully if it is absent):
|
|
108
|
+
npm install -g firecrawl-cli
|
|
109
|
+
6. After all dependencies pass verification, install the package:
|
|
102
110
|
pi install npm:pi-web-toolkit
|
|
103
111
|
|
|
104
112
|
Report what was installed or reused, all verification results, the SearXNG
|
|
@@ -141,6 +149,9 @@ scrapling install
|
|
|
141
149
|
# agent-browser (for browse)
|
|
142
150
|
npm i -g agent-browser && agent-browser install
|
|
143
151
|
# On Linux hosts missing browser system libraries: agent-browser install --with-deps
|
|
152
|
+
|
|
153
|
+
# firecrawl-cli (OPTIONAL — enables the keyless cloud fallback; no API key needed)
|
|
154
|
+
npm i -g firecrawl-cli
|
|
144
155
|
```
|
|
145
156
|
|
|
146
157
|
**Verify dependencies:**
|
|
@@ -175,31 +186,50 @@ pi install git:github.com/Wade11s/pi-web-toolkit
|
|
|
175
186
|
| Variable | Default | Used By | Description |
|
|
176
187
|
|----------|---------|---------|-------------|
|
|
177
188
|
| `SEARXNG_URL` | `http://localhost:8080` | `web_search` | Your SearXNG instance endpoint |
|
|
189
|
+
| `PI_WEB_FIRECRAWL_FALLBACK` | `1` (on) | all tools | Set to `0`/`false`/`no`/`off` to disable the optional Firecrawl keyless cloud fallback for a strict local-only policy. |
|
|
178
190
|
|
|
179
191
|
Set before starting pi:
|
|
180
192
|
|
|
181
193
|
```bash
|
|
182
194
|
export SEARXNG_URL="https://searxng.example.com"
|
|
195
|
+
# Optional: disable the Firecrawl cloud fallback entirely
|
|
196
|
+
export PI_WEB_FIRECRAWL_FALLBACK=0
|
|
183
197
|
```
|
|
184
198
|
|
|
199
|
+
### Optional: Firecrawl keyless fallback
|
|
200
|
+
|
|
201
|
+
When a local backend (`web_search`/`web_fetch`/`web_browse`) fails or returns nothing, the tools automatically retry through [Firecrawl Keyless](https://www.firecrawl.dev/blog/firecrawl-keyless-launch) — 1,000 free credits/month, **no API key, no signup**. The `firecrawl_*` tools are explicit escape hatches for capabilities the local backends lack (search categories, cloud rendering, natural-language interaction).
|
|
202
|
+
|
|
203
|
+
Install the optional CLI (the fallback degrades gracefully if it is absent):
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
npm install -g firecrawl-cli
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
The fallback is **keyless-only**: it never reads or stores an API key, and spawns the CLI under an isolated temporary `HOME` with the key env stripped. **Privacy:** when the fallback runs, the URL and page content are sent to Firecrawl's cloud.
|
|
210
|
+
|
|
185
211
|
## Project Structure
|
|
186
212
|
|
|
187
213
|
```
|
|
188
214
|
pi-web-toolkit/
|
|
189
215
|
├── extensions/
|
|
190
|
-
│ ├── index.ts # Unified entry point — registers all 4
|
|
216
|
+
│ ├── index.ts # Unified entry point — registers all 7 tools (4 local + 3 Firecrawl keyless)
|
|
191
217
|
│ ├── utils/
|
|
192
|
-
│ │ ├── cli-runner.ts # Unified CLI process spawning with timeout/AbortSignal
|
|
218
|
+
│ │ ├── cli-runner.ts # Unified CLI process spawning with timeout/AbortSignal/env
|
|
193
219
|
│ │ ├── content-preview.ts # Intelligent content extraction from scraped pages
|
|
194
220
|
│ │ ├── output-sink.ts # Truncation + temp-file fallback
|
|
195
221
|
│ │ ├── render-helpers.ts # URL abbreviations, text normalization, error formatting for TUI
|
|
196
222
|
│ │ ├── scrapling.ts # Reusable scrapling CLI wrapper (shared by fetch + batch)
|
|
197
223
|
│ │ ├── tool-factory.ts # Common tool registration patterns
|
|
198
|
-
│ │
|
|
199
|
-
│
|
|
200
|
-
│ ├──
|
|
224
|
+
│ │ ├── agent-browser.ts # agent-browser CLI wrapper (shared by web_browse)
|
|
225
|
+
│ │ └── firecrawl.ts # Firecrawl keyless CLI wrapper + fallback decisions (shared by firecrawl_* tools + fallbacks)
|
|
226
|
+
│ ├── web_search.ts # SearXNG search tool (+ Firecrawl fallback)
|
|
227
|
+
│ ├── web_fetch.ts # Single-page scrapling fetcher (+ Firecrawl fallback)
|
|
201
228
|
│ ├── web_batch_fetch.ts # Parallel scrapling fetcher
|
|
202
|
-
│
|
|
229
|
+
│ ├── web_browse.ts # Interactive browser automation (agent-browser + Firecrawl fallback)
|
|
230
|
+
│ ├── firecrawl_search.ts # Firecrawl keyless search (escape hatch)
|
|
231
|
+
│ ├── firecrawl_scrape.ts # Firecrawl keyless single-page fetch (escape hatch)
|
|
232
|
+
│ └── firecrawl_interact.ts # Firecrawl keyless natural-language interaction (escape hatch)
|
|
203
233
|
├── test/
|
|
204
234
|
│ ├── agent-browser/ # agent-browser output parser regression tests
|
|
205
235
|
│ ├── content-preview/ # Content preview fixtures, baselines & snapshots
|
|
@@ -0,0 +1,5 @@
|
|
|
1
|
+
# Firecrawl Keyless as an optional cloud fallback
|
|
2
|
+
|
|
3
|
+
pi-web-toolkit was local-first and self-hosted by design: SearXNG, scrapling, and agent-browser all run on the user's machine, and the README guaranteed "100% open-source. No required API keys or paid services." We decided to add **Firecrawl Keyless** as a strictly optional, fallback-only cloud layer: when a local backend errors out (or `web_search` returns nothing), the same tool transparently retries through the official `firecrawl-cli` in keyless mode, and three explicit `firecrawl_*` tools are exposed for capabilities the local backends lack.
|
|
4
|
+
|
|
5
|
+
This is hard to reverse once users and the agent come to rely on the fallback, surprising to a reader who assumes a local-only toolkit, and the result of a real trade-off (zero-config reliability vs. cloud egress, a privacy surface, and a third-party dependency). The fallback defaults **on**, is **keyless-only** (no API key, no signup, no stored credentials — the CLI is spawned under an isolated temp `HOME` with the key env stripped), and is **opt-out-able** via `PI_WEB_FIRECRAWL_FALLBACK=0`. We drive `firecrawl-cli` (an official Firecrawl client) rather than hand-rolling REST because Firecrawl only grants the free keyless tier to official clients, and we restrict it to the keyless endpoints (`/search`, `/scrape`, `/interact`); API-key mode, self-hosted URLs, OAuth, and non-keyless endpoints (`/map`, `/crawl`, `/batch/scrape`, etc.) are deliberately out of scope. The decision and the graceful-skip behavior (never leave the user worse off than the local tool already did) are encoded in the Firecrawl CLI wrapper module.
|
package/docs/guide.md
CHANGED
|
@@ -41,3 +41,26 @@ User asks about something external / current
|
|
|
41
41
|
| **Best for** | Articles, docs, blogs | SPAs, forms, pagination | Research synthesis |
|
|
42
42
|
|
|
43
43
|
`web_fetch` falls back to HTTP GET after a normal browser fetch fails, but not in stealthy mode. `web_batch_fetch` falls back to GET after failed browser fetches in all modes.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Firecrawl Keyless fallback
|
|
48
|
+
|
|
49
|
+
When a local backend cannot do the job, the tools automatically retry through **Firecrawl Keyless** (1,000 free credits/month, no API key, no signup) before giving up. It is **fallback-only** — never the primary path — and is **opt-out-able** with `PI_WEB_FIRECRAWL_FALLBACK=0`. Requires the optional `firecrawl-cli` (`npm install -g firecrawl-cli`); if it is absent the tools simply surface the original local error.
|
|
50
|
+
|
|
51
|
+
| Tool | Falls back to Firecrawl when… |
|
|
52
|
+
|------|-------------------------------|
|
|
53
|
+
| `web_search` | SearXNG errors out **or** returns zero results |
|
|
54
|
+
| `web_fetch` | scrapling (incl. its HTTP-GET fallback) fails — anti-bot, heavy JS, PDFs |
|
|
55
|
+
| `web_browse` | agent-browser is missing or its batch fails (not on caller validation errors) |
|
|
56
|
+
| `web_batch_fetch` | (no fallback — Firecrawl batch scrape is not keyless) |
|
|
57
|
+
|
|
58
|
+
The three `firecrawl_*` tools are the explicit escape hatches for capabilities the local backends lack (`github`/`research`/`pdf` search categories, cloud rendering, natural-language interaction).
|
|
59
|
+
|
|
60
|
+
**Graceful skip.** If the fallback itself cannot help — the CLI is missing, the IP is flagged as suspicious, the keyless quota is exhausted, or the fallback is disabled — the tool falls through to the original local-tool error so the user is never left worse off.
|
|
61
|
+
|
|
62
|
+
**Credit budgeting.** Search ≈ 2 credits / 10 results, scrape ≈ 1 credit / page, interact ≈ 2 credits/min (code-only) or ≈ 7 credits/min (AI prompt). Results report `creditsUsed` where the source provides it. The fallback stays conservative (small limits) against the 1,000 credits/month allowance.
|
|
63
|
+
|
|
64
|
+
**Privacy.** Firecrawl is a cloud service: when the fallback runs, the URL/query and page content leave the machine. Set `PI_WEB_FIRECRAWL_FALLBACK=0` to enforce a strict local-only, no-cloud-egress policy. The fallback is **keyless-only** — it never reads, stores, or sends an API key, and spawns the CLI under an isolated temporary `HOME`.
|
|
65
|
+
|
|
66
|
+
---
|
package/docs/tools.md
CHANGED
|
@@ -160,3 +160,65 @@ User: "Compare Python asyncio, Trio, and curio"
|
|
|
160
160
|
})
|
|
161
161
|
→ Agent synthesizes comparison from all 3 sources
|
|
162
162
|
```
|
|
163
|
+
|
|
164
|
+
---
|
|
165
|
+
|
|
166
|
+
## Firecrawl keyless tools (optional cloud escape hatches)
|
|
167
|
+
|
|
168
|
+
These three tools talk to [Firecrawl](https://www.firecrawl.dev) in **keyless** mode: 1,000 free credits/month, **no API key and no signup**. They require the optional `firecrawl-cli` (`npm install -g firecrawl-cli`). **Privacy:** the URL/query/page content is sent to Firecrawl's cloud.
|
|
169
|
+
|
|
170
|
+
They double as the implementation of the automatic fallback: `web_search`/`web_fetch`/`web_browse` retry through Firecrawl keyless when their local backend fails (or search returns nothing). Disable all Firecrawl usage with `PI_WEB_FIRECRAWL_FALLBACK=0`.
|
|
171
|
+
|
|
172
|
+
### `firecrawl_search`
|
|
173
|
+
|
|
174
|
+
Cloud web search via Firecrawl keyless, with capabilities the local SearXNG tool lacks.
|
|
175
|
+
|
|
176
|
+
```typescript
|
|
177
|
+
{
|
|
178
|
+
query: string,
|
|
179
|
+
limit?: number, // 1–100. Default 10
|
|
180
|
+
sources?: Array<"web"|"images"|"news">,
|
|
181
|
+
categories?: Array<"github"|"research"|"pdf">,
|
|
182
|
+
country?: string, // ISO code, e.g. "US", "DE"
|
|
183
|
+
tbs?: string, // qdr:h|d|w|m|y
|
|
184
|
+
location?: string,
|
|
185
|
+
includeDomains?: string[], // hostnames; folded into the query as site: operators
|
|
186
|
+
excludeDomains?: string[],
|
|
187
|
+
}
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
**When to use:** `web_search` failed or returned nothing; or you need `github`/`research`/`pdf` categories, images/news sources, or domain scoping that SearXNG does not provide.
|
|
191
|
+
|
|
192
|
+
### `firecrawl_scrape`
|
|
193
|
+
|
|
194
|
+
Cloud single-page fetch via Firecrawl keyless (anti-bot bypass, JS rendering, PDF parsing).
|
|
195
|
+
|
|
196
|
+
```typescript
|
|
197
|
+
{
|
|
198
|
+
url: string,
|
|
199
|
+
waitFor?: number, // ms to wait for JS rendering
|
|
200
|
+
includeTags?: string[], // Firecrawl tag filter (not a CSS selector)
|
|
201
|
+
excludeTags?: string[],
|
|
202
|
+
onlyMainContent?: boolean, // Default: true
|
|
203
|
+
}
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
**When to use:** `web_fetch` failed on an anti-bot-protected, JavaScript-heavy, or PDF page.
|
|
207
|
+
|
|
208
|
+
### `firecrawl_interact`
|
|
209
|
+
|
|
210
|
+
Open a URL in a live Firecrawl browser session and drive it with a natural-language prompt (or code).
|
|
211
|
+
|
|
212
|
+
```typescript
|
|
213
|
+
{
|
|
214
|
+
url: string,
|
|
215
|
+
prompt?: string, // natural-language task (required unless code is set)
|
|
216
|
+
code?: string, // code to run in the browser sandbox
|
|
217
|
+
language?: "node"|"python"|"bash",
|
|
218
|
+
timeout?: number, // seconds (1–300)
|
|
219
|
+
}
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
**When to use:** `web_browse` cannot run (agent-browser missing / OS deps missing), or you want natural-language page interaction without hand-written CSS selectors. Write each prompt as a single, focused task.
|
|
223
|
+
|
|
224
|
+
---
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Firecrawl Interact Extension — natural-language browser interaction (keyless)
|
|
3
|
+
*
|
|
4
|
+
* Provides a `firecrawl_interact` tool that scrapes a URL to start a live
|
|
5
|
+
* Firecrawl browser session, then drives the page with a natural-language
|
|
6
|
+
* prompt (or code) and returns the result. It is an escape hatch for
|
|
7
|
+
* interactive pages the local agent-browser tool cannot run (missing CLI,
|
|
8
|
+
* missing OS browser deps), and underpins the automatic `web_browse` fallback.
|
|
9
|
+
*
|
|
10
|
+
* Requires: `npm install -g firecrawl-cli` (optional; degrades gracefully).
|
|
11
|
+
* Privacy: the URL, page content, and prompt are sent to Firecrawl's cloud.
|
|
12
|
+
*/
|
|
13
|
+
|
|
14
|
+
import {
|
|
15
|
+
defineTool,
|
|
16
|
+
type ExtensionAPI,
|
|
17
|
+
formatSize,
|
|
18
|
+
DEFAULT_MAX_BYTES,
|
|
19
|
+
DEFAULT_MAX_LINES,
|
|
20
|
+
} from "@earendil-works/pi-coding-agent";
|
|
21
|
+
import { Text } from "@earendil-works/pi-tui";
|
|
22
|
+
import { Type, type Static } from "typebox";
|
|
23
|
+
import { StringEnum } from "@earendil-works/pi-ai";
|
|
24
|
+
import { interactKeyless } from "./utils/firecrawl";
|
|
25
|
+
import { writeWithFallback } from "./utils/output-sink";
|
|
26
|
+
import { abbreviateUrl, getErrorText, normalizeWhitespace } from "./utils/render-helpers";
|
|
27
|
+
|
|
28
|
+
export const FirecrawlInteractParamsSchema = Type.Object({
|
|
29
|
+
url: Type.String({ description: "Full URL to open and interact with" }),
|
|
30
|
+
prompt: Type.Optional(Type.String({ description: "Natural-language task for the AI agent (e.g. 'Click the pricing tab and return the price')" })),
|
|
31
|
+
code: Type.Optional(Type.String({ description: "Code to execute in the browser sandbox instead of a prompt" })),
|
|
32
|
+
language: Type.Optional(StringEnum(["node", "python", "bash"] as const)),
|
|
33
|
+
timeout: Type.Optional(Type.Integer({ description: "Timeout in seconds (1-300). Default: 30", minimum: 1, maximum: 300 })),
|
|
34
|
+
});
|
|
35
|
+
|
|
36
|
+
export type FirecrawlInteractInput = Static<typeof FirecrawlInteractParamsSchema>;
|
|
37
|
+
|
|
38
|
+
const firecrawlInteractTool = defineTool({
|
|
39
|
+
name: "firecrawl_interact",
|
|
40
|
+
label: "Firecrawl Interact",
|
|
41
|
+
description: [
|
|
42
|
+
"Open a URL in a live Firecrawl browser session and drive it with a natural-language",
|
|
43
|
+
"prompt (or code), returning the result. Keyless — no API key, no signup.",
|
|
44
|
+
"Use firecrawl_interact when the local web_browse cannot run, or when you want",
|
|
45
|
+
"natural-language page interaction without CSS selectors.",
|
|
46
|
+
"Privacy: the URL, page content, and prompt are sent to Firecrawl's cloud.",
|
|
47
|
+
`Output is truncated to ${DEFAULT_MAX_LINES} lines or ${formatSize(DEFAULT_MAX_BYTES)}; if truncated, full output is saved to a temp file.`,
|
|
48
|
+
].join(" "),
|
|
49
|
+
promptSnippet: "Drive a page via Firecrawl keyless (natural-language interaction)",
|
|
50
|
+
promptGuidelines: [
|
|
51
|
+
"Prefer web_browse first; reach for firecrawl_interact when web_browse can't run or you want NL interaction.",
|
|
52
|
+
"Write each prompt as a single, focused task; the session can be reused across calls.",
|
|
53
|
+
"Always pass the full URL including https://.",
|
|
54
|
+
],
|
|
55
|
+
parameters: FirecrawlInteractParamsSchema,
|
|
56
|
+
|
|
57
|
+
async execute(_toolCallId, params, signal) {
|
|
58
|
+
if (!params.prompt && !params.code) {
|
|
59
|
+
throw new Error("firecrawl_interact requires either a prompt or code.");
|
|
60
|
+
}
|
|
61
|
+
const out = await interactKeyless(
|
|
62
|
+
params.url,
|
|
63
|
+
{ prompt: params.prompt, code: params.code, language: params.language, timeout: params.timeout },
|
|
64
|
+
signal,
|
|
65
|
+
);
|
|
66
|
+
|
|
67
|
+
if (!out.ok) {
|
|
68
|
+
const reason = out.failure?.reason ?? "unknown error";
|
|
69
|
+
throw new Error(`Firecrawl interact failed (${out.failure?.kind}): ${reason}`);
|
|
70
|
+
}
|
|
71
|
+
|
|
72
|
+
const rawText = `Interacted: ${params.url}\n(via Firecrawl keyless${out.creditsUsed !== undefined ? `, ${out.creditsUsed} credits` : ""})\n${out.liveViewUrl ? `Live view: ${out.liveViewUrl}\n` : ""}\n---\n\n${out.output || "(no output)"}`;
|
|
73
|
+
const sink = await writeWithFallback(rawText, { tmpPrefix: "pi-firecrawl-interact-" });
|
|
74
|
+
const preview = (out.output || "").replace(/\s+/g, " ").trim().slice(0, 500);
|
|
75
|
+
|
|
76
|
+
return {
|
|
77
|
+
content: [{ type: "text", text: sink.text }],
|
|
78
|
+
details: {
|
|
79
|
+
url: params.url,
|
|
80
|
+
output: out.output,
|
|
81
|
+
preview,
|
|
82
|
+
fullOutputPath: sink.fullOutputPath,
|
|
83
|
+
liveViewUrl: out.liveViewUrl,
|
|
84
|
+
creditsUsed: out.creditsUsed,
|
|
85
|
+
viaFirecrawl: true,
|
|
86
|
+
},
|
|
87
|
+
};
|
|
88
|
+
},
|
|
89
|
+
|
|
90
|
+
renderCall(args, theme) {
|
|
91
|
+
let text = theme.fg("toolTitle", theme.bold("firecrawl_interact "));
|
|
92
|
+
text += theme.fg("muted", args.url);
|
|
93
|
+
if (args.prompt) text += theme.fg("dim", ` — ${args.prompt.slice(0, 60)}`);
|
|
94
|
+
return new Text(text, 0, 0);
|
|
95
|
+
},
|
|
96
|
+
|
|
97
|
+
renderResult(result, { expanded, isPartial }, theme, context) {
|
|
98
|
+
const isError = context?.isError ?? false;
|
|
99
|
+
|
|
100
|
+
if (isPartial) {
|
|
101
|
+
return new Text(theme.fg("warning", "Interacting via Firecrawl..."), 0, 0);
|
|
102
|
+
}
|
|
103
|
+
|
|
104
|
+
const details = result.details as {
|
|
105
|
+
url?: string;
|
|
106
|
+
output?: string;
|
|
107
|
+
preview?: string;
|
|
108
|
+
fullOutputPath?: string;
|
|
109
|
+
liveViewUrl?: string;
|
|
110
|
+
creditsUsed?: number;
|
|
111
|
+
} | undefined;
|
|
112
|
+
|
|
113
|
+
if (isError) {
|
|
114
|
+
const errText = getErrorText(result);
|
|
115
|
+
let text = theme.fg("error", "✗ Firecrawl interact failed");
|
|
116
|
+
if (details?.url) text += ` ${theme.fg("dim", abbreviateUrl(details.url))}`;
|
|
117
|
+
text += `\n\n ${theme.fg("toolOutput", errText)}`;
|
|
118
|
+
return new Text(text, 0, 0);
|
|
119
|
+
}
|
|
120
|
+
|
|
121
|
+
let text = theme.fg("success", "✓ Interacted");
|
|
122
|
+
text += theme.fg("accent", " [Firecrawl keyless]");
|
|
123
|
+
if (details?.url) text += ` ${theme.fg("dim", abbreviateUrl(details.url))}`;
|
|
124
|
+
if (details?.creditsUsed !== undefined) text += theme.fg("muted", ` ${details.creditsUsed} credits`);
|
|
125
|
+
|
|
126
|
+
if (!expanded && details?.preview) {
|
|
127
|
+
const snippet = normalizeWhitespace(details.preview);
|
|
128
|
+
const short = snippet.length > 160 ? snippet.slice(0, 160).replace(/\s+\S*$/, "") + "..." : snippet;
|
|
129
|
+
text += `\n\n ${theme.fg("muted", short)}`;
|
|
130
|
+
}
|
|
131
|
+
|
|
132
|
+
if (expanded) {
|
|
133
|
+
if (details?.output) {
|
|
134
|
+
text += `\n\n ${theme.fg("muted", normalizeWhitespace(details.output))}`;
|
|
135
|
+
}
|
|
136
|
+
if (details?.fullOutputPath) {
|
|
137
|
+
text += `\n\n${theme.fg("accent", `Full output: ${details.fullOutputPath}`)}`;
|
|
138
|
+
}
|
|
139
|
+
}
|
|
140
|
+
|
|
141
|
+
return new Text(text, 0, 0);
|
|
142
|
+
},
|
|
143
|
+
});
|
|
144
|
+
|
|
145
|
+
export default function (pi: ExtensionAPI) {
|
|
146
|
+
pi.registerTool(firecrawlInteractTool);
|
|
147
|
+
}
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Firecrawl Scrape Extension — single-page fetch via firecrawl-cli (keyless)
|
|
3
|
+
*
|
|
4
|
+
* Provides a `firecrawl_scrape` tool that fetches a single URL as clean
|
|
5
|
+
* markdown through the official Firecrawl CLI in keyless mode (no API key,
|
|
6
|
+
* no signup). It is an explicit escape hatch for hard targets the local
|
|
7
|
+
* scrapling fetcher cannot handle (anti-bot, heavy JS, PDFs), and also
|
|
8
|
+
* underpins the automatic `web_fetch` fallback.
|
|
9
|
+
*
|
|
10
|
+
* Requires: `npm install -g firecrawl-cli` (optional; the tool degrades
|
|
11
|
+
* gracefully and reports when the CLI is unavailable).
|
|
12
|
+
*
|
|
13
|
+
* Privacy: the URL and page content are sent to Firecrawl's cloud.
|
|
14
|
+
*/
|
|
15
|
+
|
|
16
|
+
import {
|
|
17
|
+
defineTool,
|
|
18
|
+
type ExtensionAPI,
|
|
19
|
+
formatSize,
|
|
20
|
+
DEFAULT_MAX_BYTES,
|
|
21
|
+
DEFAULT_MAX_LINES,
|
|
22
|
+
} from "@earendil-works/pi-coding-agent";
|
|
23
|
+
import { Text } from "@earendil-works/pi-tui";
|
|
24
|
+
import { Type, type Static } from "typebox";
|
|
25
|
+
import { scrapeKeyless } from "./utils/firecrawl";
|
|
26
|
+
import { extractPreview } from "./utils/content-preview";
|
|
27
|
+
import { writeWithFallback } from "./utils/output-sink";
|
|
28
|
+
import { abbreviateUrl, getErrorText, normalizeWhitespace, formatExtraction } from "./utils/render-helpers";
|
|
29
|
+
|
|
30
|
+
export const FirecrawlScrapeParamsSchema = Type.Object({
|
|
31
|
+
url: Type.String({ description: "Full URL to fetch (e.g. https://example.com/article)" }),
|
|
32
|
+
waitFor: Type.Optional(Type.Integer({ description: "Wait (ms) before scraping for JS-rendered content", minimum: 0 })),
|
|
33
|
+
includeTags: Type.Optional(Type.Array(Type.String(), { description: "HTML tags to include (Firecrawl tag filter, not a CSS selector)" })),
|
|
34
|
+
excludeTags: Type.Optional(Type.Array(Type.String(), { description: "HTML tags to exclude" })),
|
|
35
|
+
onlyMainContent: Type.Optional(Type.Boolean({ description: "Extract only main content (drop nav/footer). Default: true", default: true })),
|
|
36
|
+
});
|
|
37
|
+
|
|
38
|
+
export type FirecrawlScrapeInput = Static<typeof FirecrawlScrapeParamsSchema>;
|
|
39
|
+
|
|
40
|
+
const firecrawlScrapeTool = defineTool({
|
|
41
|
+
name: "firecrawl_scrape",
|
|
42
|
+
label: "Firecrawl Scrape",
|
|
43
|
+
description: [
|
|
44
|
+
"Fetch a single page as clean markdown via Firecrawl (keyless — no API key, no signup).",
|
|
45
|
+
"Use firecrawl_scrape when the local web_fetch fails on a hard target (anti-bot,",
|
|
46
|
+
"JavaScript-heavy pages, PDFs) or when you need Firecrawl's cloud rendering directly.",
|
|
47
|
+
"Privacy: the URL and page content are sent to Firecrawl's cloud.",
|
|
48
|
+
`Output is truncated to ${DEFAULT_MAX_LINES} lines or ${formatSize(DEFAULT_MAX_BYTES)}; if truncated, full output is saved to a temp file.`,
|
|
49
|
+
].join(" "),
|
|
50
|
+
promptSnippet: "Fetch a single page via Firecrawl keyless (anti-bot / JS / PDF fallback)",
|
|
51
|
+
promptGuidelines: [
|
|
52
|
+
"Prefer web_fetch first; reach for firecrawl_scrape when web_fetch fails or you need cloud rendering.",
|
|
53
|
+
"firecrawl_scrape handles anti-bot protection, JS-heavy SPAs, and PDFs that scrapling may miss.",
|
|
54
|
+
"Always pass the full URL including https://.",
|
|
55
|
+
],
|
|
56
|
+
parameters: FirecrawlScrapeParamsSchema,
|
|
57
|
+
|
|
58
|
+
async execute(_toolCallId, params, signal) {
|
|
59
|
+
const out = await scrapeKeyless(params.url, {
|
|
60
|
+
waitFor: params.waitFor,
|
|
61
|
+
includeTags: params.includeTags,
|
|
62
|
+
excludeTags: params.excludeTags,
|
|
63
|
+
onlyMainContent: params.onlyMainContent,
|
|
64
|
+
}, signal);
|
|
65
|
+
|
|
66
|
+
if (!out.ok) {
|
|
67
|
+
const reason = out.failure?.reason ?? "unknown error";
|
|
68
|
+
throw new Error(`Firecrawl scrape failed (${out.failure?.kind}): ${reason}`);
|
|
69
|
+
}
|
|
70
|
+
|
|
71
|
+
const preview = extractPreview(out.content, 500);
|
|
72
|
+
const rawText = `Fetched: ${params.url}\n(via Firecrawl keyless${out.creditsUsed !== undefined ? `, ${out.creditsUsed} credits` : ""})\nSize: ${out.bytes} bytes\n\n---\n\n${out.content}`;
|
|
73
|
+
const sink = await writeWithFallback(rawText, { tmpPrefix: "pi-firecrawl-scrape-full-" });
|
|
74
|
+
|
|
75
|
+
return {
|
|
76
|
+
content: [{ type: "text", text: sink.text }],
|
|
77
|
+
details: {
|
|
78
|
+
url: params.url,
|
|
79
|
+
bytes: out.bytes,
|
|
80
|
+
fullOutputPath: sink.fullOutputPath,
|
|
81
|
+
preview,
|
|
82
|
+
title: out.title,
|
|
83
|
+
creditsUsed: out.creditsUsed,
|
|
84
|
+
viaFirecrawl: true,
|
|
85
|
+
},
|
|
86
|
+
};
|
|
87
|
+
},
|
|
88
|
+
|
|
89
|
+
renderCall(args, theme) {
|
|
90
|
+
let text = theme.fg("toolTitle", theme.bold("firecrawl_scrape "));
|
|
91
|
+
text += theme.fg("muted", args.url);
|
|
92
|
+
if (args.waitFor) {
|
|
93
|
+
text += theme.fg("dim", ` [wait=${args.waitFor}]`);
|
|
94
|
+
}
|
|
95
|
+
return new Text(text, 0, 0);
|
|
96
|
+
},
|
|
97
|
+
|
|
98
|
+
renderResult(result, { expanded, isPartial }, theme, context) {
|
|
99
|
+
const isError = context?.isError ?? false;
|
|
100
|
+
|
|
101
|
+
if (isPartial) {
|
|
102
|
+
return new Text(theme.fg("warning", "Scraping via Firecrawl..."), 0, 0);
|
|
103
|
+
}
|
|
104
|
+
|
|
105
|
+
const details = result.details as {
|
|
106
|
+
url?: string;
|
|
107
|
+
bytes?: number;
|
|
108
|
+
fullOutputPath?: string;
|
|
109
|
+
preview?: string;
|
|
110
|
+
title?: string;
|
|
111
|
+
creditsUsed?: number;
|
|
112
|
+
} | undefined;
|
|
113
|
+
|
|
114
|
+
if (isError) {
|
|
115
|
+
const errText = getErrorText(result);
|
|
116
|
+
let text = theme.fg("error", "✗ Firecrawl scrape failed");
|
|
117
|
+
if (details?.url) text += ` ${theme.fg("dim", abbreviateUrl(details.url))}`;
|
|
118
|
+
text += `\n\n ${theme.fg("toolOutput", errText)}`;
|
|
119
|
+
return new Text(text, 0, 0);
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
let text = theme.fg("success", "✓ Fetched");
|
|
123
|
+
text += theme.fg("accent", " [Firecrawl keyless]");
|
|
124
|
+
if (details?.title) {
|
|
125
|
+
text += ` ${theme.fg("toolTitle", details.title)}`;
|
|
126
|
+
} else if (details?.url) {
|
|
127
|
+
text += ` ${theme.fg("dim", abbreviateUrl(details.url))}`;
|
|
128
|
+
}
|
|
129
|
+
if (details?.bytes && details?.preview) {
|
|
130
|
+
text += ` ${theme.fg("muted", formatExtraction(details.bytes, details.preview.length))}`;
|
|
131
|
+
}
|
|
132
|
+
|
|
133
|
+
if (!expanded && details?.preview) {
|
|
134
|
+
const snippet = normalizeWhitespace(details.preview);
|
|
135
|
+
const short = snippet.length > 160 ? snippet.slice(0, 160).replace(/\s+\S*$/, "") + "..." : snippet;
|
|
136
|
+
text += `\n\n ${theme.fg("muted", short)}`;
|
|
137
|
+
}
|
|
138
|
+
|
|
139
|
+
if (expanded) {
|
|
140
|
+
if (details?.preview) {
|
|
141
|
+
text += `\n\n ${theme.fg("muted", normalizeWhitespace(details.preview))}`;
|
|
142
|
+
}
|
|
143
|
+
if (details?.fullOutputPath) {
|
|
144
|
+
text += `\n\n${theme.fg("accent", `Full output: ${details.fullOutputPath}`)}`;
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
|
|
148
|
+
return new Text(text, 0, 0);
|
|
149
|
+
},
|
|
150
|
+
});
|
|
151
|
+
|
|
152
|
+
export default function (pi: ExtensionAPI) {
|
|
153
|
+
pi.registerTool(firecrawlScrapeTool);
|
|
154
|
+
}
|