pi-smart-fetch 0.2.22 โ†’ 0.2.24

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +60 -197
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,30 +1,33 @@
1
1
  # pi-smart-fetch
2
2
 
3
- `pi-smart-fetch` adds adaptive, agent-friendly web fetching tools to pi.dev.
3
+ `pi-smart-fetch` adds smarter web fetching tools to pi.dev.
4
4
 
5
- ![pi Smart Fetch](demo.gif)
6
-
7
- Registers 2 tools:
8
- - `web_fetch`
9
- - `batch_web_fetch`
5
+ ![pi Smart Fetch](https://raw.githubusercontent.com/Thinkscape/agent-smart-fetch/main/packages/pi-smart-fetch/demo.gif)
10
6
 
11
7
  ## Features
12
8
 
13
- Compared with naive Node.js `fetch()`, this package gives you:
14
- - **browser-like transport fingerprints** via Thinkscape's maintained `@thinkscape/wreq-js` fork, which helps on sites that inspect TLS and HTTP client behavior
15
- - **clean readable extraction** via `Defuddle`, so agents get article content instead of raw noisy HTML
16
- - **better success on bot-defended pages** where plain server-side requests are blocked, challenged, or degraded
17
- - **useful metadata** like title, author, published date, site, and language when available
18
- - **multiple output formats**: `markdown`, `html`, `text`, or `json`
19
- - **single and batch tools**: `web_fetch` for one URL, `batch_web_fetch` for many
20
- - **pi-specific behavior** including full metadata for agents, a compact history preview for users, and defaults from pi settings
21
- - **bounded batch fan-out** with a configurable default concurrency of `8`
22
- - **a richer pi TUI for batch mode** with per-item rows, truncated URLs, statuses, small progress bars, and timer-driven spinner animation
23
- - **attachment and binary download support** when a response is an attachment or non-text payload
24
- - **temp-file output** with sanitized filenames plus returned file metadata (`URL`, `File size`, `Mime type`, `File path`)
25
- - **publish-ready packaging/test workflow** across the monorepo for safer releases
26
- - **lower overhead than browser automation** when you do not need JS execution, login, scrolling, or clicks
27
- - **clear limits**: it does not execute JavaScript or solve interactive anti-bot flows
9
+ - ๐Ÿ” **Browser-like TLS/SSL + HTTP fingerprints** โ€” better success on bot-defended pages
10
+ - ๐Ÿงน **Defuddle extraction** โ€” clean readable content instead of noisy HTML
11
+ - ๐Ÿง  **Useful metadata** โ€” title, author, site, language, published date when available
12
+ - ๐Ÿ“ฆ **Downloads + large file support** โ€” stream attachments and binaries to temp files
13
+ - โšก **Batch fetch** โ€” fetch many URLs with bounded concurrency
14
+ - ๐Ÿ“ **Multiple output formats** โ€” `markdown`, `html`, `text`, `json`
15
+
16
+ ## Site optimisations
17
+
18
+ This package works on general web pages, but some site types benefit especially from Defuddle's extractors and cleanup:
19
+
20
+ - YouTube pages and transcripts
21
+ - Reddit posts and comment threads
22
+ - X / Twitter posts
23
+ - GitHub pages, issues, PRs, and discussions
24
+ - Hacker News threads
25
+ - Substack posts
26
+ - Pages with code blocks, footnotes, math, and callouts
27
+
28
+ Notes:
29
+ - Defuddle is the cleanup layer: it strips common page chrome like nav, sidebars, related links, share widgets, and footers
30
+ - It does **not** execute JavaScript or solve interactive anti-bot/login flows
28
31
 
29
32
  ## Install
30
33
 
@@ -41,163 +44,38 @@ gh repo clone Thinkscape/agent-smart-fetch
41
44
  pi install agent-smart-fetch/packages/pi-smart-fetch
42
45
  ```
43
46
 
44
- ## Use cases
47
+ ## Pi tools
45
48
 
46
- Use `web_fetch` when you want to:
47
- - fetch one article, doc page, or blog post with a browser-like network fingerprint
48
- - analyze readable content instead of raw HTML
49
- - reduce agent token waste on noisy page chrome
50
- - get author/title/published metadata when available
51
- - work around pages that reject ordinary server-side fetches
52
-
53
- Use `batch_web_fetch` when you want to:
54
- - fetch multiple URLs in one tool call
55
- - preserve a clear mapping between each input URL and its result
56
- - let pi show per-item progress while the batch runs
57
- - collect a mix of successes and failures without losing per-item errors
49
+ Registers:
50
+ - `web_fetch`
51
+ - `batch_web_fetch`
58
52
 
59
- ## Tool synopsis
53
+ Synopsis:
60
54
 
61
55
  ```text
62
- web_fetch(url, browser?, os?, headers?, maxChars?, format?, removeImages?, includeReplies?, proxy?, verbose?)
56
+ web_fetch(url, browser?, os?, headers?, maxChars?, timeoutMs?, format?, removeImages?, includeReplies?, proxy?, verbose?)
63
57
  batch_web_fetch(requests, verbose?)
64
58
  ```
65
59
 
66
- For `batch_web_fetch`, `requests` is an array of objects, and **each item accepts the same parameters as `web_fetch` except `verbose`**.
67
-
68
- ## Output behavior
69
-
70
- ### `web_fetch`
71
-
72
- Agent-facing tool output always includes the full non-empty metadata header plus the extracted document body.
73
-
74
- In the pi TUI backlog/history preview, user-facing metadata is intentionally brief and only shows:
75
- - Title
76
- - Published
77
-
78
- The duplicated `URL:` line is hidden from the preview because the tool call line already shows the URL.
79
-
80
- The optional `verbose` flag is retained for compatibility, but pi now always returns the full metadata header to the agent.
81
-
82
- ### `batch_web_fetch`
83
-
84
- Batch output:
85
- - starts with a batch summary (`Requests`, `Succeeded`, `Failed`, `Concurrency`)
86
- - keeps results in input order
87
- - labels each item with its ordinal and URL
88
- - includes full content for successful items
89
- - includes a bot-friendly `Error:` line for failed items
90
-
91
- In the pi TUI, batch mode also streams per-item progress rows showing:
92
- - a small spinner/check/error glyph
93
- - a truncated URL
94
- - a one-word status (`queued`, `fetching`, `extracting`, `done`, `error`)
95
- - a small progress bar
96
-
97
- ## Example tool outputs
98
-
99
- ### Agent-facing `web_fetch` output
100
-
101
- ```text
102
- > URL: https://example.com/blog/some-article
103
- > Title: Some Article
104
- > Author: Jane Doe
105
- > Published: 2026-03-12
106
- > Site: Example Blog
107
- > Language: en
108
- > Words: 1284
109
- > Browser: chrome_145/windows
110
-
111
- # Some Article
112
-
113
- This is the cleaned readable content extracted from the page.
114
- It includes the body plus the full metadata header available to the agent.
115
- ```
116
-
117
- ### pi history/backlog preview for `web_fetch`
118
-
119
- ```text
120
- web_fetch https://example.com/blog/some-article
121
- Title: Some Article
122
- Published: 2026-03-12
123
-
124
- # Some Article
125
-
126
- This is the cleaned readable content extracted from the page.
127
- ... (more lines, Ctrl+O to expand)
128
- ```
129
-
130
- ### Attachment/binary `web_fetch` output
131
-
132
- ```text
133
- > URL: https://example.com/download/report
134
- > File size: 999999
135
- > Mime type: application/pdf
136
- > File path: /absolute/path/to/temp/report.pdf
137
- ```
138
-
139
- ### `batch_web_fetch` output
140
-
141
- ```text
142
- > Requests: 2
143
- > Succeeded: 1
144
- > Failed: 1
145
- > Concurrency: 8
146
-
147
- ## [1/2] https://example.com/blog/some-article
148
- > URL: https://example.com/blog/some-article
149
- > Title: Some Article
150
- > Author: Jane Doe
151
- > Published: 2026-03-12
152
-
153
- # Some Article
154
-
155
- This is the cleaned readable content extracted from the page.
156
-
157
- ## [2/2] https://blocked.example/post
158
- > URL: https://blocked.example/post
159
- > Status: error
160
- > Error: HTTP 403 Forbidden for https://blocked.example/post
161
- ```
162
-
163
- ### Error output
164
-
165
- ```text
166
- Error: Invalid URL: not-a-url
167
- ```
168
-
169
- ## Parameters
170
-
171
- ### `web_fetch`
172
-
173
- | Parameter | Type | Default | Description |
174
- |-------------------|-------------------------------|-----------------|------------------------------------------------------------------------------|
175
- | `url` | string | required | URL to fetch |
176
- | `browser` | string | `chrome_145` | Browser profile used for transport fingerprinting |
177
- | `os` | string | `windows` | OS profile: `windows`, `macos`, `linux`, `android`, `ios` |
178
- | `headers` | object | auto | Extra request headers |
179
- | `maxChars` | number | `50000` | Maximum returned characters. Can be overridden by pi settings |
180
- | `format` | `markdown` \| `html` \| `text` \| `json` | `markdown` | Output format |
181
- | `removeImages` | boolean | `false` | Strip image references from output |
182
- | `includeReplies` | boolean \| `extractors` | `extractors` | Include replies/comments |
183
- | `proxy` | string | none | Proxy URL |
184
- | `verbose` | boolean | `false` | Compatibility flag. pi currently returns the full metadata header to the agent regardless; user history preview stays compact |
60
+ For `batch_web_fetch`, each item in `requests` accepts the same parameters as `web_fetch` except `verbose`.
185
61
 
186
- ### `batch_web_fetch`
62
+ ## Output formats
187
63
 
188
- | Parameter | Type | Default | Description |
189
- |-------------|---------------------|-----------|-------------|
190
- | `requests` | array of objects | required | Array of fetch requests. Each item accepts the same parameters as `web_fetch` except `verbose` |
191
- | `verbose` | boolean | `false` | Compatibility flag. pi currently returns the full metadata header for successful results regardless |
64
+ | Format | What you get |
65
+ |---|---|
66
+ | `markdown` | Best default for readable page content |
67
+ | `html` | Cleaned HTML output |
68
+ | `text` | Plain text with markdown stripped |
69
+ | `json` | Structured JSON for metadata-heavy workflows |
192
70
 
193
- ## pi settings
71
+ ## Global defaults
194
72
 
195
- Optional custom settings in `~/.pi/agent/settings.json` or `.pi/settings.json`:
73
+ Optional settings in `~/.pi/agent/settings.json` or `.pi/settings.json`:
196
74
 
197
75
  ```json
198
76
  {
199
77
  "smartFetchVerboseByDefault": false,
200
- "smartFetchDefaultMaxChars": 12000,
78
+ "smartFetchDefaultMaxChars": 50000,
201
79
  "smartFetchDefaultTimeoutMs": 15000,
202
80
  "smartFetchDefaultBrowser": "chrome_145",
203
81
  "smartFetchDefaultOs": "windows",
@@ -208,37 +86,22 @@ Optional custom settings in `~/.pi/agent/settings.json` or `.pi/settings.json`:
208
86
  }
209
87
  ```
210
88
 
211
- Behavior:
212
- - `smartFetchVerboseByDefault` sets the stored default for the compatibility `verbose` flag
213
- - `smartFetchDefaultMaxChars` sets the runtime default for `maxChars`
214
- - `smartFetchDefaultTimeoutMs` sets the runtime request timeout
215
- - `smartFetchDefaultBrowser` sets the default browser fingerprint profile
216
- - `smartFetchDefaultOs` sets the default OS fingerprint profile
217
- - `smartFetchDefaultRemoveImages` sets the default for image stripping
218
- - `smartFetchDefaultIncludeReplies` sets the default replies/comments behavior
219
- - `smartFetchDefaultBatchConcurrency` sets the default bounded concurrency for `batch_web_fetch`
220
- - `smartFetchTempDir` sets the base temp directory used for attachments and binary downloads
221
- - project `.pi/settings.json` overrides global `~/.pi/agent/settings.json`
222
-
223
- Legacy aliases still supported:
224
- - `webFetchVerboseByDefault`
225
- - `webFetchDefaultMaxChars`
226
- - `webFetchDefaultBatchConcurrency`
227
- - `webFetchTempDir`
228
-
229
- ## When not to use it
230
-
231
- Do not use these tools when:
232
- - the site requires JS rendering
233
- - you need login/session flows
234
- - you need to click, scroll, or submit forms
235
- - you need a fully interactive browser session
236
-
237
- In those cases, switch to browser automation.
238
-
239
- ## Recent feature additions reflected here
240
-
241
- Recent `feat:` work added:
242
- - publish-ready TS tooling, tests, and packaging checks
243
- - timer-driven spinner animation for batch progress in the pi TUI
244
- - attachment and binary streaming into temp files with sanitized output paths
89
+ | Setting | Default | Description |
90
+ |---|---:|---|
91
+ | `smartFetchVerboseByDefault` | `false` | Stored default for the compatibility `verbose` flag |
92
+ | `smartFetchDefaultMaxChars` | `50000` | Default `maxChars` limit |
93
+ | `smartFetchDefaultTimeoutMs` | `15000` | Default request timeout in milliseconds |
94
+ | `smartFetchDefaultBrowser` | `chrome_145` | Default browser fingerprint profile |
95
+ | `smartFetchDefaultOs` | `windows` | Default OS fingerprint profile |
96
+ | `smartFetchDefaultRemoveImages` | `false` | Strip image references by default |
97
+ | `smartFetchDefaultIncludeReplies` | `extractors` | Include replies/comments only when site extractors support them |
98
+ | `smartFetchDefaultBatchConcurrency` | `8` | Default bounded concurrency for `batch_web_fetch` |
99
+ | `smartFetchTempDir` | OS temp dir | Base directory for attachment and binary downloads |
100
+
101
+ Notes:
102
+ - Project `.pi/settings.json` overrides global `~/.pi/agent/settings.json`
103
+ - Legacy `webFetch*` aliases are still supported
104
+
105
+ ## Dev and publishing note
106
+
107
+ This repo uses Bun for local development, tests, and workspace scripts. Package publishing still goes through `npm publish` in CI so npm Trusted Publishing can be used.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-smart-fetch",
3
- "version": "0.2.22",
3
+ "version": "0.2.24",
4
4
  "type": "module",
5
5
  "description": "pi.dev smart fetch extension with browser-grade TLS fingerprinting and Defuddle extraction.",
6
6
  "license": "MIT",