openclaw-smart-fetch 0.2.30 → 0.2.32

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,10 +1,16 @@
1
1
  {
2
2
  "id": "smart-fetch",
3
3
  "name": "Smart Fetch",
4
- "version": "0.2.30",
4
+ "version": "0.2.32",
5
5
  "description": "Clean web content extraction with browser-grade TLS fingerprinting. Uses wreq-js (Rust native bindings) for anti-bot bypass and Defuddle for superior content extraction.",
6
+ "skills": ["./skills"],
7
+ "contracts": {
8
+ "webFetchProviders": ["smart-fetch"],
9
+ "tools": ["smart_fetch", "batch_smart_fetch"]
10
+ },
6
11
  "configSchema": {
7
12
  "type": "object",
13
+ "additionalProperties": false,
8
14
  "properties": {
9
15
  "maxChars": {
10
16
  "type": "number",
@@ -46,7 +52,6 @@
46
52
  "type": "string",
47
53
  "description": "Absolute temp directory used for attachment and binary downloads. If omitted, the plugin uses an OS temp subdirectory."
48
54
  }
49
- },
50
- "additionalProperties": false
55
+ }
51
56
  }
52
57
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "openclaw-smart-fetch",
3
- "version": "0.2.30",
3
+ "version": "0.2.32",
4
4
  "type": "module",
5
5
  "description": "OpenClaw web fetch plugin with desktop-browser TLS impersonation and defuddle extraction.",
6
6
  "license": "MIT",
@@ -33,6 +33,7 @@
33
33
  },
34
34
  "files": [
35
35
  "dist/",
36
+ "skills/",
36
37
  "openclaw.plugin.json",
37
38
  "README.md",
38
39
  "LICENSE"
@@ -0,0 +1,99 @@
1
+ ---
2
+ name: smart_fetch
3
+ description: "Fetch web pages with browser-grade TLS fingerprinting and Defuddle extraction. Fetch X/Twitter posts, Reddit threads, YouTube, GitHub, news articles, documentation, and any site where web_fetch gets blocked or returns noisy output."
4
+ metadata:
5
+ {
6
+ "openclaw":
7
+ {
8
+ "emoji": "🌐",
9
+ "requires":
10
+ { "config": ["plugins.entries.smart-fetch.enabled"] },
11
+ },
12
+ }
13
+ ---
14
+
15
+ # Smart Fetch Tools
16
+
17
+ ## When to use which tool
18
+
19
+ | Need | Tool | When |
20
+ |----------------------------|---------------------|---------------------------------------------------------|
21
+ | Fetch a single URL | `smart_fetch` | Articles, posts, docs, any page — try this first |
22
+ | Fetch multiple URLs | `batch_smart_fetch` | Multiple URLs in one call, bounded concurrency |
23
+ | JS-heavy interactive sites | `browser` | SPAs that need JavaScript to render content |
24
+
25
+ ## Sites and pages smart_fetch handles well
26
+
27
+ smart_fetch uses **Defuddle** for content extraction and **wreq-js** for
28
+ browser-grade TLS fingerprinting. This combination works especially well on:
29
+
30
+ | Site / page type | What it extracts |
31
+ |----------------------------|---------------------------------------------------------------------|
32
+ | **X / Twitter posts** | Tweet text via oEmbed; detects deleted/protected tweets |
33
+ | **Reddit posts & threads** | Post content + comment threads (use `includeReplies`) |
34
+ | **YouTube** | Page metadata, transcript extraction |
35
+ | **GitHub** | READMEs, issues, PRs, discussions — strips chrome, keeps code |
36
+ | **Hacker News** | Story content + comment threads |
37
+ | **Substack / Medium** | Full article text, author, publish date |
38
+ | **Stack Overflow** | Question + answers with code blocks |
39
+ | **Wikipedia** | Article body with infobox cleanup |
40
+ | **Documentation sites** | Code blocks, callouts, footnotes, math (MathML/KaTeX) |
41
+ | **Blog posts & news** | Schema.org metadata, clean main-content extraction |
42
+ | **General web pages** | Any HTML page — strips nav, sidebars, footers, ads |
43
+
44
+ Limitations — escalate to the **browser** tool for:
45
+
46
+ - **JS-heavy SPAs** — content that only appears after JavaScript execution
47
+ - **Login-protected pages** — no session/cookie management
48
+ - **Interactive flows** — anything needing clicks, form fills, or scrolling
49
+
50
+ ## smart_fetch
51
+
52
+ | Parameter | Type | Description |
53
+ |--------------------|---------------------------------|-------------------------------------------------------------------------------|
54
+ | `url` | `string` (required) | HTTP or HTTPS URL to fetch |
55
+ | `browser` | `string` | TLS profile: `chrome_145`, `firefox_147`, `safari_26`, `edge_145` |
56
+ | `os` | `string` | OS profile: `windows`, `macos`, `linux`, `android`, `ios` |
57
+ | `headers` | `Record<string,string>` | Custom HTTP headers |
58
+ | `maxChars` | `number` | Max characters to return (default: 50000) |
59
+ | `timeoutMs` | `number` | Request timeout in ms (default: 15000) |
60
+ | `format` | `string` | Output: `markdown` (default), `html`, `text`, `json` |
61
+ | `removeImages` | `boolean` | Strip image references (default: false) |
62
+ | `includeReplies` | `boolean` or `"extractors"` | Include comments/replies (default: `"extractors"`) |
63
+ | `proxy` | `string` | HTTP or SOCKS5 proxy URL |
64
+
65
+ ### Why smart_fetch over web_fetch
66
+
67
+ - **TLS fingerprinting** — impersonates real browsers at the TLS/HTTP2 level (JA3/JA4).
68
+ Sites that return 403 or empty pages to plain HTTP clients often serve full
69
+ content to smart_fetch.
70
+ - **Better extraction** — Defuddle removes more noise (nav, sidebars, ads,
71
+ footers, social widgets) and keeps more signal (code blocks, footnotes,
72
+ math, callouts, schema.org metadata).
73
+ - **Richer metadata** — returns author, publish date, site name, language,
74
+ word count.
75
+ - **No API key required** — works out of the box.
76
+
77
+ ## batch_smart_fetch
78
+
79
+ | Parameter | Type | Description |
80
+ |-------------|--------------------|----------------------------------------------------|
81
+ | `requests` | array (required) | Each item accepts the same params as `smart_fetch` |
82
+
83
+ - Default concurrency: **8** parallel requests (configurable via plugin config).
84
+ - Results are **ordered** matching the input array — labelled `[N/total]`.
85
+ - Individual failures don't fail the batch — each item has its own status.
86
+
87
+ ## Workflow escalation
88
+
89
+ 1. **`smart_fetch`** — first choice for any URL.
90
+ 2. **`batch_smart_fetch`** — when you need multiple URLs at once.
91
+ 3. **`web_fetch`** — if smart_fetch is unavailable.
92
+ 4. **Browser tool** — JS-heavy or login-protected pages only.
93
+
94
+ ## Automatic web_fetch fallback
95
+
96
+ When the `smart-fetch` plugin is enabled, it registers as a **web fetch
97
+ provider**. The built-in `web_fetch` tool will automatically use smart_fetch's
98
+ TLS-fingerprinted pipeline when its own Readability extraction fails — no
99
+ configuration needed.