npm - pi-smart-fetch - Versions diffs - 0.2.22 → 0.2.24 - Mend

pi-smart-fetch 0.2.22 → 0.2.24

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +60 -197
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -1,30 +1,33 @@
 # pi-smart-fetch
-`pi-smart-fetch` adds adaptive, agent-friendly web fetching tools to pi.dev.
+`pi-smart-fetch` adds smarter web fetching tools to pi.dev.
-![pi Smart Fetch](demo.gif)
-Registers 2 tools:
-- `web_fetch`
-- `batch_web_fetch`
+![pi Smart Fetch](https://raw.githubusercontent.com/Thinkscape/agent-smart-fetch/main/packages/pi-smart-fetch/demo.gif)
 ## Features
-Compared with naive Node.js `fetch()`, this package gives you:
-- **browser-like transport fingerprints** via Thinkscape's maintained `@thinkscape/wreq-js` fork, which helps on sites that inspect TLS and HTTP client behavior
-- **clean readable extraction** via `Defuddle`, so agents get article content instead of raw noisy HTML
-- **better success on bot-defended pages** where plain server-side requests are blocked, challenged, or degraded
-- **useful metadata** like title, author, published date, site, and language when available
-- **multiple output formats**: `markdown`, `html`, `text`, or `json`
-- **single and batch tools**: `web_fetch` for one URL, `batch_web_fetch` for many
-- **pi-specific behavior** including full metadata for agents, a compact history preview for users, and defaults from pi settings
-- **bounded batch fan-out** with a configurable default concurrency of `8`
-- **a richer pi TUI for batch mode** with per-item rows, truncated URLs, statuses, small progress bars, and timer-driven spinner animation
-- **attachment and binary download support** when a response is an attachment or non-text payload
-- **temp-file output** with sanitized filenames plus returned file metadata (`URL`, `File size`, `Mime type`, `File path`)
-- **publish-ready packaging/test workflow** across the monorepo for safer releases
-- **lower overhead than browser automation** when you do not need JS execution, login, scrolling, or clicks
-- **clear limits**: it does not execute JavaScript or solve interactive anti-bot flows
+- 🔐 **Browser-like TLS/SSL + HTTP fingerprints** — better success on bot-defended pages
+- 🧹 **Defuddle extraction** — clean readable content instead of noisy HTML
+- 🧠 **Useful metadata** — title, author, site, language, published date when available
+- 📦 **Downloads + large file support** — stream attachments and binaries to temp files
+- ⚡ **Batch fetch** — fetch many URLs with bounded concurrency
+- 📝 **Multiple output formats** — `markdown`, `html`, `text`, `json`
+## Site optimisations
+This package works on general web pages, but some site types benefit especially from Defuddle's extractors and cleanup:
+- YouTube pages and transcripts
+- Reddit posts and comment threads
+- X / Twitter posts
+- GitHub pages, issues, PRs, and discussions
+- Hacker News threads
+- Substack posts
+- Pages with code blocks, footnotes, math, and callouts
+Notes:
+- Defuddle is the cleanup layer: it strips common page chrome like nav, sidebars, related links, share widgets, and footers
+- It does **not** execute JavaScript or solve interactive anti-bot/login flows
 ## Install
@@ -41,163 +44,38 @@ gh repo clone Thinkscape/agent-smart-fetch
 pi install agent-smart-fetch/packages/pi-smart-fetch
 ```
-## Use cases
+## Pi tools
-Use `web_fetch` when you want to:
-- fetch one article, doc page, or blog post with a browser-like network fingerprint
-- analyze readable content instead of raw HTML
-- reduce agent token waste on noisy page chrome
-- get author/title/published metadata when available
-- work around pages that reject ordinary server-side fetches
-Use `batch_web_fetch` when you want to:
-- fetch multiple URLs in one tool call
-- preserve a clear mapping between each input URL and its result
-- let pi show per-item progress while the batch runs
-- collect a mix of successes and failures without losing per-item errors
+Registers:
+- `web_fetch`
+- `batch_web_fetch`
-## Tool synopsis
+Synopsis:
 ```text
-web_fetch(url, browser?, os?, headers?, maxChars?, format?, removeImages?, includeReplies?, proxy?, verbose?)
+web_fetch(url, browser?, os?, headers?, maxChars?, timeoutMs?, format?, removeImages?, includeReplies?, proxy?, verbose?)
 batch_web_fetch(requests, verbose?)
 ```
-For `batch_web_fetch`, `requests` is an array of objects, and **each item accepts the same parameters as `web_fetch` except `verbose`**.
-## Output behavior
-### `web_fetch`
-Agent-facing tool output always includes the full non-empty metadata header plus the extracted document body.
-In the pi TUI backlog/history preview, user-facing metadata is intentionally brief and only shows:
-- Title
-- Published
-The duplicated `URL:` line is hidden from the preview because the tool call line already shows the URL.
-The optional `verbose` flag is retained for compatibility, but pi now always returns the full metadata header to the agent.
-### `batch_web_fetch`
-Batch output:
-- starts with a batch summary (`Requests`, `Succeeded`, `Failed`, `Concurrency`)
-- keeps results in input order
-- labels each item with its ordinal and URL
-- includes full content for successful items
-- includes a bot-friendly `Error:` line for failed items
-In the pi TUI, batch mode also streams per-item progress rows showing:
-- a small spinner/check/error glyph
-- a truncated URL
-- a one-word status (`queued`, `fetching`, `extracting`, `done`, `error`)
-- a small progress bar
-## Example tool outputs
-### Agent-facing `web_fetch` output
-```text
-> URL: https://example.com/blog/some-article
-> Title: Some Article
-> Author: Jane Doe
-> Published: 2026-03-12
-> Site: Example Blog
-> Language: en
-> Words: 1284
-> Browser: chrome_145/windows
-# Some Article
-This is the cleaned readable content extracted from the page.
-It includes the body plus the full metadata header available to the agent.
-```
-### pi history/backlog preview for `web_fetch`
-```text
-web_fetch https://example.com/blog/some-article
-Title: Some Article
-Published: 2026-03-12
-# Some Article
-This is the cleaned readable content extracted from the page.
-... (more lines, Ctrl+O to expand)
-```
-### Attachment/binary `web_fetch` output
-```text
-> URL: https://example.com/download/report
-> File size: 999999
-> Mime type: application/pdf
-> File path: /absolute/path/to/temp/report.pdf
-```
-### `batch_web_fetch` output
-```text
-> Requests: 2
-> Succeeded: 1
-> Failed: 1
-> Concurrency: 8
-## [1/2] https://example.com/blog/some-article
-> URL: https://example.com/blog/some-article
-> Title: Some Article
-> Author: Jane Doe
-> Published: 2026-03-12
-# Some Article
-This is the cleaned readable content extracted from the page.
-## [2/2] https://blocked.example/post
-> URL: https://blocked.example/post
-> Status: error
-> Error: HTTP 403 Forbidden for https://blocked.example/post
-```
-### Error output
-```text
-Error: Invalid URL: not-a-url
-```
-## Parameters
-### `web_fetch`
-| Parameter         | Type                          | Default         | Description                                                                  |
-|-------------------|-------------------------------|-----------------|------------------------------------------------------------------------------|
-| `url`             | string                        | required        | URL to fetch                                                                 |
-| `browser`         | string                        | `chrome_145`    | Browser profile used for transport fingerprinting                            |
-| `os`              | string                        | `windows`       | OS profile: `windows`, `macos`, `linux`, `android`, `ios`                   |
-| `headers`         | object                        | auto            | Extra request headers                                                        |
-| `maxChars`        | number                        | `50000`         | Maximum returned characters. Can be overridden by pi settings                |
-| `format`          | `markdown` \| `html` \| `text` \| `json` | `markdown`      | Output format                                                                |
-| `removeImages`    | boolean                       | `false`         | Strip image references from output                                           |
-| `includeReplies`  | boolean \| `extractors`       | `extractors`    | Include replies/comments                                                     |
-| `proxy`           | string                        | none            | Proxy URL                                                                    |
-| `verbose`         | boolean                       | `false`         | Compatibility flag. pi currently returns the full metadata header to the agent regardless; user history preview stays compact |
+For `batch_web_fetch`, each item in `requests` accepts the same parameters as `web_fetch` except `verbose`.
-### `batch_web_fetch`
+## Output formats
-| Parameter   | Type                | Default   | Description |
-|-------------|---------------------|-----------|-------------|
-| `requests`  | array of objects    | required  | Array of fetch requests. Each item accepts the same parameters as `web_fetch` except `verbose` |
-| `verbose`   | boolean             | `false`   | Compatibility flag. pi currently returns the full metadata header for successful results regardless |
+| Format | What you get |
+|---|---|
+| `markdown` | Best default for readable page content |
+| `html` | Cleaned HTML output |
+| `text` | Plain text with markdown stripped |
+| `json` | Structured JSON for metadata-heavy workflows |
-## pi settings
+## Global defaults
-Optional custom settings in `~/.pi/agent/settings.json` or `.pi/settings.json`:
+Optional settings in `~/.pi/agent/settings.json` or `.pi/settings.json`:
 ```json
 {
   "smartFetchVerboseByDefault": false,
-  "smartFetchDefaultMaxChars": 12000,
+  "smartFetchDefaultMaxChars": 50000,
   "smartFetchDefaultTimeoutMs": 15000,
   "smartFetchDefaultBrowser": "chrome_145",
   "smartFetchDefaultOs": "windows",
@@ -208,37 +86,22 @@ Optional custom settings in `~/.pi/agent/settings.json` or `.pi/settings.json`:
 }
 ```
-Behavior:
-- `smartFetchVerboseByDefault` sets the stored default for the compatibility `verbose` flag
-- `smartFetchDefaultMaxChars` sets the runtime default for `maxChars`
-- `smartFetchDefaultTimeoutMs` sets the runtime request timeout
-- `smartFetchDefaultBrowser` sets the default browser fingerprint profile
-- `smartFetchDefaultOs` sets the default OS fingerprint profile
-- `smartFetchDefaultRemoveImages` sets the default for image stripping
-- `smartFetchDefaultIncludeReplies` sets the default replies/comments behavior
-- `smartFetchDefaultBatchConcurrency` sets the default bounded concurrency for `batch_web_fetch`
-- `smartFetchTempDir` sets the base temp directory used for attachments and binary downloads
-- project `.pi/settings.json` overrides global `~/.pi/agent/settings.json`
-Legacy aliases still supported:
-- `webFetchVerboseByDefault`
-- `webFetchDefaultMaxChars`
-- `webFetchDefaultBatchConcurrency`
-- `webFetchTempDir`
-## When not to use it
-Do not use these tools when:
-- the site requires JS rendering
-- you need login/session flows
-- you need to click, scroll, or submit forms
-- you need a fully interactive browser session
-In those cases, switch to browser automation.
-## Recent feature additions reflected here
-Recent `feat:` work added:
-- publish-ready TS tooling, tests, and packaging checks
-- timer-driven spinner animation for batch progress in the pi TUI
-- attachment and binary streaming into temp files with sanitized output paths
+| Setting | Default | Description |
+|---|---:|---|
+| `smartFetchVerboseByDefault` | `false` | Stored default for the compatibility `verbose` flag |
+| `smartFetchDefaultMaxChars` | `50000` | Default `maxChars` limit |
+| `smartFetchDefaultTimeoutMs` | `15000` | Default request timeout in milliseconds |
+| `smartFetchDefaultBrowser` | `chrome_145` | Default browser fingerprint profile |
+| `smartFetchDefaultOs` | `windows` | Default OS fingerprint profile |
+| `smartFetchDefaultRemoveImages` | `false` | Strip image references by default |
+| `smartFetchDefaultIncludeReplies` | `extractors` | Include replies/comments only when site extractors support them |
+| `smartFetchDefaultBatchConcurrency` | `8` | Default bounded concurrency for `batch_web_fetch` |
+| `smartFetchTempDir` | OS temp dir | Base directory for attachment and binary downloads |
+Notes:
+- Project `.pi/settings.json` overrides global `~/.pi/agent/settings.json`
+- Legacy `webFetch*` aliases are still supported
+## Dev and publishing note
+This repo uses Bun for local development, tests, and workspace scripts. Package publishing still goes through `npm publish` in CI so npm Trusted Publishing can be used.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "pi-smart-fetch",
-  "version": "0.2.22",
+  "version": "0.2.24",
   "type": "module",
   "description": "pi.dev smart fetch extension with browser-grade TLS fingerprinting and Defuddle extraction.",
   "license": "MIT",