mdrip 0.1.3 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +118 -129
  2. package/dist/index.js +1 -1
  3. package/package.json +2 -1
package/README.md CHANGED
@@ -1,216 +1,205 @@
1
1
  # mdrip
2
2
 
3
- Fetch markdown snapshots of web pages using Cloudflare's Markdown for Agents feature, so coding agents can consume clean structured content instead of HTML.
3
+ Fetch clean markdown snapshots of any web page optimized for AI agents, RAG pipelines, and context-aware workflows.
4
4
 
5
- ## AI Skills
5
+ Reduces token overhead by ~90% compared to raw HTML while preserving the content structure LLMs need.
6
+
7
+ ## Why
6
8
 
7
- This repo also includes an AI-consumable skills catalog in `skills/`, following the [agentskills](https://agentskills.io) format.
9
+ AI agents and LLMs work better with markdown than HTML. Feeding raw HTML into a context window wastes tokens on tags, scripts, styles, and boilerplate. mdrip solves this by fetching any URL and returning clean, structured markdown.
8
10
 
9
- - Skill index: `skills/README.md`
10
- - mdrip skill: `skills/mdrip/SKILL.md`
11
+ - **~90% fewer tokens** than raw HTML
12
+ - **Automatic HTML-to-markdown fallback** when native markdown isn't available
13
+ - **Works everywhere** — CLI, Node.js, Cloudflare Workers, or via remote MCP
14
+ - **Token-aware** — reports estimated token counts so you can manage context budgets
11
15
 
12
- ### Install skills from this repo
16
+ Sites that support [Cloudflare's Markdown for Agents](https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/) return markdown natively at the edge. For all other sites, mdrip's built-in converter handles headings, links, lists, code blocks, tables, blockquotes, and more.
13
17
 
14
- If you use a Skills-compatible agent setup, you can add these skills directly:
18
+ ## Installation
15
19
 
16
20
  ```bash
17
- # install skills from this repo
18
- npx skills add charl-kruger/mdrip
21
+ npm install -g mdrip
19
22
  ```
20
23
 
21
- ## Why
24
+ Or use directly with `npx`:
25
+
26
+ ```bash
27
+ npx mdrip <url>
28
+ ```
22
29
 
23
- For agent workflows, markdown is often better than HTML:
24
- - cleaner structure
25
- - lower token overhead
26
- - easier chunking and context management
30
+ ## CLI Usage
27
31
 
28
- `mdrip` requests pages with `Accept: text/markdown`, stores the markdown locally, and tracks fetched pages in an index.
32
+ ### Fetch pages
29
33
 
30
- If a site does not return `text/markdown`, `mdrip` can automatically fall back to converting `text/html` into markdown.
31
- The fallback uses an in-project converter optimized for common documentation/blog content (headings, links, lists, code blocks, tables, blockquotes).
34
+ ```bash
35
+ # Fetch one page
36
+ mdrip https://example.com/docs/getting-started
32
37
 
33
- ## Why Cloudflare Markdown for Agents matters
38
+ # Fetch multiple pages
39
+ mdrip https://example.com/docs https://example.com/api
34
40
 
35
- Cloudflare's blog and docs describe Markdown for Agents as content negotiation at the edge:
36
- - clients request `Accept: text/markdown`
37
- - Cloudflare converts HTML to markdown in real time (for enabled zones)
38
- - response includes `x-markdown-tokens` for token-size awareness
41
+ # Custom timeout (ms)
42
+ mdrip https://example.com --timeout 45000
39
43
 
40
- For AI workflows this is high-value:
41
- - better structure for LLM parsing than raw HTML
42
- - less token waste in context windows
43
- - predictable markdown snapshots you can store and reuse in your repo
44
+ # Strict mode only accept native markdown, no HTML fallback
45
+ mdrip https://example.com --no-html-fallback
44
46
 
45
- References:
46
- - [Cloudflare blog: Markdown for Agents](https://blog.cloudflare.com/markdown-for-agents/)
47
- - [Cloudflare docs: Markdown for Agents](https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/)
47
+ # Raw mode — print markdown to stdout, no file writes
48
+ mdrip https://example.com --raw
49
+ ```
48
50
 
49
- ## Installation
51
+ ### List fetched pages
50
52
 
51
53
  ```bash
52
- npm install -g mdrip
54
+ mdrip list
55
+ mdrip list --json
53
56
  ```
54
57
 
55
- Or use with `npx`:
58
+ ### Remove pages
56
59
 
57
60
  ```bash
58
- npx mdrip <url>
61
+ mdrip remove https://example.com/docs/getting-started
59
62
  ```
60
63
 
61
- For programmatic usage in Node.js or Workers:
64
+ ### Clean snapshots
62
65
 
63
66
  ```bash
64
- npm install mdrip
65
- ```
67
+ # Remove all
68
+ mdrip clean
66
69
 
67
- ## Programmatic API
70
+ # Remove only one domain
71
+ mdrip clean --domain example.com
72
+ ```
68
73
 
69
- ### Node.js (fetch and store)
74
+ ### Raw mode for agent runtimes
70
75
 
71
- ```ts
72
- import { fetchToStore, listStoredPages } from "mdrip/node";
76
+ `--raw` prints markdown to stdout and skips all file writes and prompts. Useful for piping content directly into agent loops.
73
77
 
74
- const result = await fetchToStore("https://developers.cloudflare.com/", {
75
- cwd: process.cwd(),
76
- });
78
+ ```bash
79
+ mdrip https://example.com --raw | your-agent-cli
80
+ ```
77
81
 
78
- if (!result.success) {
79
- throw new Error(result.error || "Failed to fetch page");
80
- }
82
+ ## Programmatic API
81
83
 
82
- const pages = await listStoredPages(process.cwd());
83
- console.log(pages.map((p) => p.path));
84
+ ```bash
85
+ npm install mdrip
84
86
  ```
85
87
 
86
- ### Cloudflare Workers / Agent runtimes (raw in-memory markdown)
88
+ ### Workers / Edge / In-memory
87
89
 
88
90
  ```ts
89
91
  import { fetchMarkdown } from "mdrip";
90
92
 
91
- const page = await fetchMarkdown(
92
- "https://blog.cloudflare.com/markdown-for-agents/",
93
- );
93
+ const page = await fetchMarkdown("https://example.com/docs");
94
94
 
95
- console.log(page.markdownTokens);
96
- console.log(page.markdown);
95
+ console.log(page.markdown); // clean markdown content
96
+ console.log(page.markdownTokens); // estimated token count
97
+ console.log(page.source); // "cloudflare-markdown" or "html-fallback"
97
98
  ```
98
99
 
99
- Available programmatic methods:
100
- - `mdrip` (Workers-safe): `fetchMarkdown(url, options)`, `fetchRawMarkdown(url, options)`
101
- - `mdrip/node` (filesystem features): `fetchToStore(url, options)`, `fetchManyToStore(urls, options)`, `listStoredPages(cwd?)`
100
+ ### Node.js (fetch and store to disk)
102
101
 
103
- ## Usage
104
-
105
- ### Fetch pages
106
-
107
- ```bash
108
- # Fetch one page
109
- mdrip https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/
110
-
111
- # Fetch multiple pages
112
- mdrip https://blog.cloudflare.com/markdown-for-agents/ https://developers.cloudflare.com/
102
+ ```ts
103
+ import { fetchToStore, listStoredPages } from "mdrip/node";
113
104
 
114
- # Optional timeout override (ms)
115
- mdrip https://example.com --timeout 45000
105
+ const result = await fetchToStore("https://example.com/docs", {
106
+ cwd: process.cwd(),
107
+ });
116
108
 
117
- # Disable HTML fallback (strict Cloudflare markdown only)
118
- mdrip https://example.com --no-html-fallback
109
+ if (result.success) {
110
+ console.log(`Saved to ${result.path}`);
111
+ }
119
112
 
120
- # Print raw page markdown to stdout (no files/settings changes, no prompts)
121
- mdrip https://blog.cloudflare.com/markdown-for-agents/ --raw
113
+ const pages = await listStoredPages(process.cwd());
122
114
  ```
123
115
 
124
- ### Raw mode for agents (OpenClaw, etc.)
116
+ ### Available exports
125
117
 
126
- `--raw` is designed for agent runtimes that only need in-memory content.
127
- It prints markdown to stdout and skips settings prompts and all file writes.
118
+ | Import | Environment | Functions |
119
+ |--------|-------------|-----------|
120
+ | `mdrip` | Workers, edge, browser | `fetchMarkdown()`, `fetchRawMarkdown()` |
121
+ | `mdrip/node` | Node.js | `fetchToStore()`, `fetchManyToStore()`, `listStoredPages()` |
128
122
 
129
- This is useful for flows with OpenClaw and similar AI tools where you want to pipe page content directly into your agent loop.
123
+ ## Remote MCP Server
130
124
 
131
- ```bash
132
- # stream markdown directly to another process
133
- mdrip https://blog.cloudflare.com/markdown-for-agents/ --raw
134
- ```
125
+ mdrip is available as a remote MCP server at **`mdrip.createmcp.dev`** — no install required. Any MCP-compatible client can connect and use the `fetch_markdown` and `batch_fetch_markdown` tools.
135
126
 
136
- ### List fetched pages
127
+ ### Claude Desktop
137
128
 
138
- ```bash
139
- mdrip list
140
- mdrip list --json
129
+ Add to `claude_desktop_config.json`:
130
+
131
+ ```json
132
+ {
133
+ "mcpServers": {
134
+ "mdrip": {
135
+ "command": "npx",
136
+ "args": ["mcp-remote", "https://mdrip.createmcp.dev/mcp"]
137
+ }
138
+ }
139
+ }
141
140
  ```
142
141
 
143
- ### Remove pages
142
+ ### Claude Code
144
143
 
145
144
  ```bash
146
- mdrip remove https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/
145
+ claude mcp add mdrip-remote --transport sse https://mdrip.createmcp.dev/sse
147
146
  ```
148
147
 
149
- ### Clean snapshots
148
+ ### Cloudflare AI Playground
150
149
 
151
- ```bash
152
- # Remove all
153
- mdrip clean
154
-
155
- # Remove only one domain
156
- mdrip clean --domain developers.cloudflare.com
157
- ```
150
+ Enter `mdrip.createmcp.dev/sse` at [playground.ai.cloudflare.com](https://playground.ai.cloudflare.com/).
158
151
 
159
152
  ## File modifications
160
153
 
161
154
  On first run, mdrip can optionally update:
162
- - `.gitignore` (adds `mdrip/`)
163
- - `tsconfig.json` (excludes `mdrip`)
164
- - `AGENTS.md` (adds a section pointing agents to snapshots)
155
+ - `.gitignore` adds `mdrip/`
156
+ - `tsconfig.json` excludes `mdrip/`
157
+ - `AGENTS.md` adds a section pointing agents to your snapshots
165
158
 
166
- Choice is stored in `mdrip/settings.json`.
159
+ Choice is stored in `mdrip/settings.json`. Use `--modify` or `--modify=false` to skip the prompt.
167
160
 
168
- Use flags to skip prompt:
161
+ `--raw` mode bypasses this entirely.
169
162
 
170
- ```bash
171
- # allow updates
172
- mdrip https://example.com --modify
163
+ ## Output structure
173
164
 
174
- # deny updates
175
- mdrip https://example.com --modify=false
176
165
  ```
177
-
178
- `--raw` mode bypasses this entire flow and never writes settings or snapshots.
179
-
180
- ## Output
181
-
182
- ```text
183
166
  mdrip/
184
167
  ├── settings.json
185
168
  ├── sources.json
186
169
  └── pages/
187
- └── developers.cloudflare.com/
188
- └── fundamentals/
189
- └── reference/
190
- └── markdown-for-agents/
191
- └── index.md
170
+ └── example.com/
171
+ └── docs/
172
+ └── getting-started/
173
+ └── index.md
192
174
  ```
193
175
 
194
- ## Requirements and notes
176
+ ## Benchmark
195
177
 
196
- - Node.js 18+
197
- - The target site must return markdown for `Accept: text/markdown` (Cloudflare Markdown for Agents enabled).
198
- - If a page does not return `text/markdown`, mdrip can convert `text/html` into markdown fallback unless `--no-html-fallback` is used.
178
+ Measured across popular pages (values vary as pages change):
199
179
 
200
- ## Publishing to npm
180
+ | Page | Mode | Chars saved | Tokens saved |
181
+ |------|------|------------:|-------------:|
182
+ | blog.cloudflare.com/markdown-for-agents | cloudflare-markdown | 94.9% | 94.9% |
183
+ | developers.cloudflare.com/.../markdown-for-agents | cloudflare-markdown | 95.7% | 95.7% |
184
+ | en.wikipedia.org/wiki/Markdown | html-fallback | 72.7% | 72.7% |
185
+ | github.com/cloudflare/skills | html-fallback | 96.3% | 96.3% |
186
+ | **Average** | | **89.9%** | **89.9%** |
201
187
 
202
188
  ```bash
203
- # optional package check
204
- pnpm publish:dry-run
189
+ pnpm build && pnpm benchmark
190
+ ```
191
+
192
+ ## AI Skills
205
193
 
206
- # publish to npm
207
- pnpm publish:npm
194
+ This repo includes an AI-consumable skills catalog in `skills/`, following the [agentskills](https://agentskills.io) format.
195
+
196
+ ```bash
197
+ npx skills add charl-kruger/mdrip
208
198
  ```
209
199
 
210
- `prepublishOnly` runs automatically before publish and executes:
211
- - `pnpm type-check`
212
- - `pnpm test`
213
- - `pnpm build`
200
+ ## Requirements
201
+
202
+ - Node.js 18+
214
203
 
215
204
  ## Author
216
205
 
package/dist/index.js CHANGED
@@ -8,7 +8,7 @@ const program = new Command();
8
8
  program
9
9
  .name("mdrip")
10
10
  .description("Fetch markdown snapshots for URLs using Cloudflare Markdown for Agents")
11
- .version("0.1.3")
11
+ .version("0.1.4")
12
12
  .option("--cwd <path>", "working directory (default: current directory)");
13
13
  program
14
14
  .argument("[urls...]", "URLs to fetch as markdown")
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mdrip",
3
- "version": "0.1.3",
3
+ "version": "0.1.5",
4
4
  "description": "Fetch markdown snapshots of web pages using Cloudflare Markdown for Agents",
5
5
  "type": "module",
6
6
  "main": "./dist/web.js",
@@ -38,6 +38,7 @@
38
38
  "build": "tsc",
39
39
  "dev": "tsc --watch",
40
40
  "start": "node dist/index.js",
41
+ "benchmark": "node scripts/benchmark.mjs",
41
42
  "test": "vitest run",
42
43
  "test:watch": "vitest",
43
44
  "test:coverage": "vitest run --coverage",