openclaw-smart-fetch 0.2.15 โ†’ 0.2.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,52 +1,31 @@
1
1
  # openclaw-smart-fetch
2
2
 
3
- `openclaw-smart-fetch` adds smarter fetching tools for OpenClaw.
3
+ `openclaw-smart-fetch` adds smarter web fetching tools to OpenClaw.
4
4
 
5
- It registers:
6
- - `smart_fetch`
7
- - `batch_smart_fetch`
8
-
9
- It combines:
10
- - `@thinkscape/wreq-js` for browser-like transport fingerprints
11
- - `Defuddle` for readable content extraction
12
-
13
- ## Why use this instead of OpenClaw's built-in `web_fetch`
14
-
15
- Use this package when the built-in `web_fetch` is not enough.
5
+ ## Features
16
6
 
17
- Typical advantages:
18
- - **better resistance to bot detection** on sites that inspect TLS/HTTP client fingerprints
19
- - **more browser-like transport behavior** instead of a generic server-side HTTP client
20
- - **cleaner extracted content** instead of raw or noisy page output
21
- - **better article/document readability** for downstream agent analysis
22
- - **useful metadata** like title, author, published date, site, and language when available
23
- - **batch fan-out support** when you want to fetch multiple URLs in one tool call
24
- - **attachment and binary download support** when a server returns `Content-Disposition: attachment` or a non-text content type
25
- - **temp-file output** with sanitized filenames and file metadata instead of trying to render binary bytes as page text
7
+ - ๐Ÿ” **Browser-like TLS/SSL + HTTP fingerprints** โ€” better success on bot-defended pages
8
+ - ๐Ÿงน **Defuddle extraction** โ€” clean readable content instead of noisy HTML
9
+ - ๐Ÿง  **Useful metadata** โ€” title, author, site, language, published date when available
10
+ - ๐Ÿ“ฆ **Downloads + large file support** โ€” stream attachments and binaries to temp files
11
+ - โšก **Batch fetch** โ€” fetch many URLs with bounded concurrency
12
+ - ๐Ÿ“ **Multiple output formats** โ€” `markdown`, `html`, `text`, `json`
26
13
 
27
- A good rule of thumb:
28
- - use built-in `web_fetch` for simple pages
29
- - use `smart_fetch` when pages are blocked, noisy, or extraction quality matters
30
- - use `batch_smart_fetch` when you need the same smarter fetch behavior over many URLs at once
14
+ ## Supported sites
31
15
 
32
- ## Bot-detection focus
16
+ - **Anything fetchable over server-side HTTP** โ€” this package is not limited to a fixed allowlist
17
+ - Docs, blog posts, articles, and knowledge-base pages
18
+ - Reddit posts and comment threads
19
+ - X / Twitter posts
20
+ - YouTube pages and transcripts
21
+ - GitHub pages, issues, PRs, and discussions
22
+ - Hacker News threads
23
+ - Substack posts
24
+ - Pages with code blocks, footnotes, math, and callouts
33
25
 
34
- These tools are aimed at sites that detect bots through:
35
- - TLS/client fingerprinting
36
- - transport/header inconsistencies
37
- - non-browser HTTP behavior
38
-
39
- They do **not** execute JavaScript or solve interactive anti-bot flows.
40
-
41
- If a page requires JS execution, login, scrolling, or clicking, use browser automation instead.
42
-
43
- ## What tools it exposes
44
-
45
- This package registers:
46
- - `smart_fetch`
47
- - `batch_smart_fetch`
48
-
49
- OpenClaw keeps separate tool names because overriding/hoisting built-in `web_fetch` is not the desired path here.
26
+ Notes:
27
+ - Defuddle is the cleanup layer: it strips common page chrome like nav, sidebars, related links, share widgets, and footers
28
+ - It does **not** execute JavaScript or solve interactive anti-bot/login flows
50
29
 
51
30
  ## Install
52
31
 
@@ -62,139 +41,58 @@ From a local checkout:
62
41
  openclaw plugins install -l /absolute/path/to/agent-smart-fetch/packages/openclaw-smart-fetch
63
42
  ```
64
43
 
65
- ## Use cases
66
-
67
- Use `smart_fetch` when you want to:
68
- - fetch pages that reject naive HTTP clients
69
- - extract the readable body from articles, docs, and blog posts
70
- - reduce noise before passing content to an agent
71
- - preserve page metadata for summarization or research
72
- - use browser-like fetching without paying the cost of full browser automation
44
+ ## OpenClaw tools
73
45
 
74
- Use `batch_smart_fetch` when you want to:
75
- - fetch multiple URLs in one tool call
76
- - preserve a clear mapping between each input URL and its result
77
- - keep full content for successes while retaining per-item error strings for failures
78
- - run bounded-concurrency fetches instead of firing everything at once
46
+ Registers:
47
+ - `smart_fetch`
48
+ - `batch_smart_fetch`
79
49
 
80
- ## Tool synopsis
50
+ Synopsis:
81
51
 
82
52
  ```text
83
- smart_fetch(url, browser?, os?, headers?, maxChars?, format?, removeImages?, includeReplies?, proxy?)
53
+ smart_fetch(url, browser?, os?, headers?, maxChars?, timeoutMs?, format?, removeImages?, includeReplies?, proxy?)
84
54
  batch_smart_fetch(requests)
85
55
  ```
86
56
 
87
- For `batch_smart_fetch`, `requests` is an array of objects, and **each item accepts the same parameters as `smart_fetch`**.
88
-
89
- ## Example output
90
-
91
- ### `smart_fetch`
57
+ For `batch_smart_fetch`, each item in `requests` accepts the same parameters as `smart_fetch`.
92
58
 
93
- ```text
94
- > URL: https://example.com/blog/some-article
95
- > Title: Some Article
96
- > Author: Jane Doe
97
- > Published: 2026-03-12
98
- > Site: Example Blog
99
- > Language: en
100
- > Words: 1284
101
- > Browser: chrome_145/windows
102
-
103
- # Some Article
104
-
105
- This is the cleaned readable content extracted from the page.
106
- ```
59
+ ## Output formats
107
60
 
108
- ### `smart_fetch` attachment/binary output
61
+ | Format | What you get |
62
+ |---|---|
63
+ | `markdown` | Best default for readable page content |
64
+ | `html` | Cleaned HTML output |
65
+ | `text` | Plain text with markdown stripped |
66
+ | `json` | Structured JSON for metadata-heavy workflows |
109
67
 
110
- ```text
111
- > URL: https://example.com/download/report
112
- > File size: 999999
113
- > Mime type: application/pdf
114
- > File path: /absolute/path/to/temp/report.pdf
115
- ```
68
+ ## Plugin defaults
116
69
 
117
- ### `batch_smart_fetch`
70
+ See `openclaw.plugin.json` for the schema. The effective defaults are:
118
71
 
119
- ```text
120
- > Requests: 2
121
- > Succeeded: 1
122
- > Failed: 1
123
- > Concurrency: 8
124
-
125
- ## [1/2] https://example.com/blog/some-article
126
- > URL: https://example.com/blog/some-article
127
- > Title: Some Article
128
- > Author: Jane Doe
129
- > Published: 2026-03-12
130
- > Site: Example Blog
131
- > Language: en
132
- > Words: 1284
133
- > Browser: chrome_145/windows
134
-
135
- # Some Article
136
-
137
- This is the cleaned readable content extracted from the page.
138
-
139
- ## [2/2] https://blocked.example/post
140
- > URL: https://blocked.example/post
141
- > Status: error
142
- > Error: HTTP 403 Forbidden for https://blocked.example/post
72
+ ```json
73
+ {
74
+ "maxChars": 50000,
75
+ "timeoutMs": 15000,
76
+ "browser": "chrome_145",
77
+ "os": "windows",
78
+ "removeImages": false,
79
+ "includeReplies": "extractors",
80
+ "batchConcurrency": 8,
81
+ "tempDir": "/tmp/openclaw-smart-fetch"
82
+ }
143
83
  ```
144
84
 
145
- ## Parameters
146
-
147
- ### `smart_fetch`
148
-
149
- | Parameter | Type | Default | Description |
150
- |-------------------|-------------------------------|-----------------|-----------------------------------------------------------|
151
- | `url` | string | required | URL to fetch |
152
- | `browser` | string | `chrome_145` | Browser profile used for transport fingerprinting |
153
- | `os` | string | `windows` | OS profile: `windows`, `macos`, `linux`, `android`, `ios` |
154
- | `headers` | object | auto | Extra request headers |
155
- | `maxChars` | number | `50000` | Maximum returned characters |
156
- | `format` | `markdown` \| `html` \| `text` \| `json` | `markdown` | Output format |
157
- | `removeImages` | boolean | `false` | Strip image references from output |
158
- | `includeReplies` | boolean \| `extractors` | `extractors` | Include replies/comments |
159
- | `proxy` | string | none | Proxy URL |
160
-
161
- ### `batch_smart_fetch`
162
-
163
- | Parameter | Type | Default | Description |
164
- |-------------|------------------|-----------|-------------|
165
- | `requests` | array of objects | required | Array of fetch requests. Each item accepts the same parameters as `smart_fetch` |
166
-
167
- ## OpenClaw config
168
-
169
- See `openclaw.plugin.json` for plugin config defaults and schema.
170
-
171
- Configurable defaults include:
172
- - `maxChars`
173
- - `timeoutMs`
174
- - `browser`
175
- - `os`
176
- - `removeImages`
177
- - `includeReplies`
178
- - `batchConcurrency`
179
- - `tempDir`
180
-
181
- `batchConcurrency` defaults to `8` and controls how many `batch_smart_fetch` requests run concurrently.
182
-
183
- `tempDir` lets the OpenClaw consumer choose where attachment/binary downloads are written before the tool returns their absolute file paths.
184
-
185
- ## When not to use it
186
-
187
- Do not use these tools when:
188
- - the page requires JS rendering
189
- - you need login/session flows
190
- - you need clicks, scrolling, or form submission
191
- - a full browser session is required
192
-
193
- In those cases, use browser automation instead.
85
+ | Setting | Default | Description |
86
+ |---|---:|---|
87
+ | `maxChars` | `50000` | Default maximum returned characters |
88
+ | `timeoutMs` | `15000` | Default request timeout in milliseconds |
89
+ | `browser` | `chrome_145` | Default browser fingerprint profile |
90
+ | `os` | `windows` | Default OS fingerprint profile |
91
+ | `removeImages` | `false` | Strip image references by default |
92
+ | `includeReplies` | `extractors` | Include replies/comments only when site extractors support them |
93
+ | `batchConcurrency` | `8` | Default bounded concurrency for `batch_smart_fetch` |
94
+ | `tempDir` | OS temp dir | Directory for attachment and binary downloads |
194
95
 
195
- ## Recent feature additions reflected here
96
+ ## Dev and publishing note
196
97
 
197
- Recent `feat:` work added:
198
- - publish-ready TS/test/build packaging workflow across the monorepo
199
- - richer animated batch progress behavior in pi-facing consumers
200
- - attachment and binary download streaming with sanitized temp-file output
98
+ This repo uses Bun for local development, tests, and workspace scripts. Package publishing still goes through `npm publish` in CI so npm Trusted Publishing can be used.
package/dist/index.js CHANGED
@@ -42,9 +42,9 @@ var __toESM = (mod, isNodeMode, target) => (target = mod != null ? __create(__ge
42
42
  mod
43
43
  ));
44
44
 
45
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_basePropertyOf.js
45
+ // ../../node_modules/lodash/_basePropertyOf.js
46
46
  var require_basePropertyOf = __commonJS({
47
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_basePropertyOf.js"(exports$1, module) {
47
+ "../../node_modules/lodash/_basePropertyOf.js"(exports$1, module) {
48
48
  function basePropertyOf(object) {
49
49
  return function(key) {
50
50
  return object == null ? void 0 : object[key];
@@ -54,9 +54,9 @@ var require_basePropertyOf = __commonJS({
54
54
  }
55
55
  });
56
56
 
57
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_deburrLetter.js
57
+ // ../../node_modules/lodash/_deburrLetter.js
58
58
  var require_deburrLetter = __commonJS({
59
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_deburrLetter.js"(exports$1, module) {
59
+ "../../node_modules/lodash/_deburrLetter.js"(exports$1, module) {
60
60
  var basePropertyOf = require_basePropertyOf();
61
61
  var deburredLetters = {
62
62
  // Latin-1 Supplement block.
@@ -257,17 +257,17 @@ var require_deburrLetter = __commonJS({
257
257
  }
258
258
  });
259
259
 
260
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_freeGlobal.js
260
+ // ../../node_modules/lodash/_freeGlobal.js
261
261
  var require_freeGlobal = __commonJS({
262
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_freeGlobal.js"(exports$1, module) {
262
+ "../../node_modules/lodash/_freeGlobal.js"(exports$1, module) {
263
263
  var freeGlobal = typeof global == "object" && global && global.Object === Object && global;
264
264
  module.exports = freeGlobal;
265
265
  }
266
266
  });
267
267
 
268
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_root.js
268
+ // ../../node_modules/lodash/_root.js
269
269
  var require_root = __commonJS({
270
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_root.js"(exports$1, module) {
270
+ "../../node_modules/lodash/_root.js"(exports$1, module) {
271
271
  var freeGlobal = require_freeGlobal();
272
272
  var freeSelf = typeof self == "object" && self && self.Object === Object && self;
273
273
  var root = freeGlobal || freeSelf || Function("return this")();
@@ -275,18 +275,18 @@ var require_root = __commonJS({
275
275
  }
276
276
  });
277
277
 
278
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_Symbol.js
278
+ // ../../node_modules/lodash/_Symbol.js
279
279
  var require_Symbol = __commonJS({
280
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_Symbol.js"(exports$1, module) {
280
+ "../../node_modules/lodash/_Symbol.js"(exports$1, module) {
281
281
  var root = require_root();
282
282
  var Symbol2 = root.Symbol;
283
283
  module.exports = Symbol2;
284
284
  }
285
285
  });
286
286
 
287
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_arrayMap.js
287
+ // ../../node_modules/lodash/_arrayMap.js
288
288
  var require_arrayMap = __commonJS({
289
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_arrayMap.js"(exports$1, module) {
289
+ "../../node_modules/lodash/_arrayMap.js"(exports$1, module) {
290
290
  function arrayMap(array, iteratee) {
291
291
  var index = -1, length = array == null ? 0 : array.length, result = Array(length);
292
292
  while (++index < length) {
@@ -298,17 +298,17 @@ var require_arrayMap = __commonJS({
298
298
  }
299
299
  });
300
300
 
301
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/isArray.js
301
+ // ../../node_modules/lodash/isArray.js
302
302
  var require_isArray = __commonJS({
303
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/isArray.js"(exports$1, module) {
303
+ "../../node_modules/lodash/isArray.js"(exports$1, module) {
304
304
  var isArray = Array.isArray;
305
305
  module.exports = isArray;
306
306
  }
307
307
  });
308
308
 
309
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_getRawTag.js
309
+ // ../../node_modules/lodash/_getRawTag.js
310
310
  var require_getRawTag = __commonJS({
311
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_getRawTag.js"(exports$1, module) {
311
+ "../../node_modules/lodash/_getRawTag.js"(exports$1, module) {
312
312
  var Symbol2 = require_Symbol();
313
313
  var objectProto = Object.prototype;
314
314
  var hasOwnProperty = objectProto.hasOwnProperty;
@@ -335,9 +335,9 @@ var require_getRawTag = __commonJS({
335
335
  }
336
336
  });
337
337
 
338
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_objectToString.js
338
+ // ../../node_modules/lodash/_objectToString.js
339
339
  var require_objectToString = __commonJS({
340
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_objectToString.js"(exports$1, module) {
340
+ "../../node_modules/lodash/_objectToString.js"(exports$1, module) {
341
341
  var objectProto = Object.prototype;
342
342
  var nativeObjectToString = objectProto.toString;
343
343
  function objectToString(value) {
@@ -347,9 +347,9 @@ var require_objectToString = __commonJS({
347
347
  }
348
348
  });
349
349
 
350
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_baseGetTag.js
350
+ // ../../node_modules/lodash/_baseGetTag.js
351
351
  var require_baseGetTag = __commonJS({
352
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_baseGetTag.js"(exports$1, module) {
352
+ "../../node_modules/lodash/_baseGetTag.js"(exports$1, module) {
353
353
  var Symbol2 = require_Symbol();
354
354
  var getRawTag = require_getRawTag();
355
355
  var objectToString = require_objectToString();
@@ -366,9 +366,9 @@ var require_baseGetTag = __commonJS({
366
366
  }
367
367
  });
368
368
 
369
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/isObjectLike.js
369
+ // ../../node_modules/lodash/isObjectLike.js
370
370
  var require_isObjectLike = __commonJS({
371
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/isObjectLike.js"(exports$1, module) {
371
+ "../../node_modules/lodash/isObjectLike.js"(exports$1, module) {
372
372
  function isObjectLike(value) {
373
373
  return value != null && typeof value == "object";
374
374
  }
@@ -376,9 +376,9 @@ var require_isObjectLike = __commonJS({
376
376
  }
377
377
  });
378
378
 
379
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/isSymbol.js
379
+ // ../../node_modules/lodash/isSymbol.js
380
380
  var require_isSymbol = __commonJS({
381
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/isSymbol.js"(exports$1, module) {
381
+ "../../node_modules/lodash/isSymbol.js"(exports$1, module) {
382
382
  var baseGetTag = require_baseGetTag();
383
383
  var isObjectLike = require_isObjectLike();
384
384
  var symbolTag = "[object Symbol]";
@@ -389,9 +389,9 @@ var require_isSymbol = __commonJS({
389
389
  }
390
390
  });
391
391
 
392
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_baseToString.js
392
+ // ../../node_modules/lodash/_baseToString.js
393
393
  var require_baseToString = __commonJS({
394
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/_baseToString.js"(exports$1, module) {
394
+ "../../node_modules/lodash/_baseToString.js"(exports$1, module) {
395
395
  var Symbol2 = require_Symbol();
396
396
  var arrayMap = require_arrayMap();
397
397
  var isArray = require_isArray();
@@ -415,9 +415,9 @@ var require_baseToString = __commonJS({
415
415
  }
416
416
  });
417
417
 
418
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/toString.js
418
+ // ../../node_modules/lodash/toString.js
419
419
  var require_toString = __commonJS({
420
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/toString.js"(exports$1, module) {
420
+ "../../node_modules/lodash/toString.js"(exports$1, module) {
421
421
  var baseToString = require_baseToString();
422
422
  function toString(value) {
423
423
  return value == null ? "" : baseToString(value);
@@ -426,9 +426,9 @@ var require_toString = __commonJS({
426
426
  }
427
427
  });
428
428
 
429
- // ../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/deburr.js
429
+ // ../../node_modules/lodash/deburr.js
430
430
  var require_deburr = __commonJS({
431
- "../../node_modules/.bun/lodash@4.18.1/node_modules/lodash/deburr.js"(exports$1, module) {
431
+ "../../node_modules/lodash/deburr.js"(exports$1, module) {
432
432
  var deburrLetter = require_deburrLetter();
433
433
  var toString = require_toString();
434
434
  var reLatin = /[\xc0-\xd6\xd8-\xf6\xf8-\xff\u0100-\u017f]/g;
@@ -446,9 +446,9 @@ var require_deburr = __commonJS({
446
446
  }
447
447
  });
448
448
 
449
- // ../../node_modules/.bun/mime-db@1.52.0/node_modules/mime-db/db.json
449
+ // ../../node_modules/mime-db/db.json
450
450
  var require_db = __commonJS({
451
- "../../node_modules/.bun/mime-db@1.52.0/node_modules/mime-db/db.json"(exports$1, module) {
451
+ "../../node_modules/mime-db/db.json"(exports$1, module) {
452
452
  module.exports = {
453
453
  "application/1d-interleaved-parityfec": {
454
454
  source: "iana"
@@ -8971,16 +8971,16 @@ var require_db = __commonJS({
8971
8971
  }
8972
8972
  });
8973
8973
 
8974
- // ../../node_modules/.bun/mime-db@1.52.0/node_modules/mime-db/index.js
8974
+ // ../../node_modules/mime-db/index.js
8975
8975
  var require_mime_db = __commonJS({
8976
- "../../node_modules/.bun/mime-db@1.52.0/node_modules/mime-db/index.js"(exports$1, module) {
8976
+ "../../node_modules/mime-db/index.js"(exports$1, module) {
8977
8977
  module.exports = require_db();
8978
8978
  }
8979
8979
  });
8980
8980
 
8981
- // ../../node_modules/.bun/mime-types@2.1.35/node_modules/mime-types/index.js
8981
+ // ../../node_modules/mime-types/index.js
8982
8982
  var require_mime_types = __commonJS({
8983
- "../../node_modules/.bun/mime-types@2.1.35/node_modules/mime-types/index.js"(exports$1) {
8983
+ "../../node_modules/mime-types/index.js"(exports$1) {
8984
8984
  var db = require_mime_db();
8985
8985
  var extname = __require("path").extname;
8986
8986
  var EXTRACT_TYPE_REGEXP = /^\s*([^;\s]*)(?:;|\s|$)/;