md-fetch 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +212 -0
- package/LICENSE +21 -0
- package/README.md +449 -0
- package/README.zh-CN.md +449 -0
- package/dist/cli.d.ts +27 -0
- package/dist/cli.d.ts.map +1 -0
- package/dist/cli.js +158 -0
- package/dist/cli.js.map +1 -0
- package/dist/constants.d.ts +9 -0
- package/dist/constants.d.ts.map +1 -0
- package/dist/constants.js +15 -0
- package/dist/constants.js.map +1 -0
- package/dist/core/browser.d.ts +23 -0
- package/dist/core/browser.d.ts.map +1 -0
- package/dist/core/browser.js +125 -0
- package/dist/core/browser.js.map +1 -0
- package/dist/core/converter.d.ts +18 -0
- package/dist/core/converter.d.ts.map +1 -0
- package/dist/core/converter.js +74 -0
- package/dist/core/converter.js.map +1 -0
- package/dist/core/extractor.d.ts +28 -0
- package/dist/core/extractor.d.ts.map +1 -0
- package/dist/core/extractor.js +151 -0
- package/dist/core/extractor.js.map +1 -0
- package/dist/core/fetcher.d.ts +24 -0
- package/dist/core/fetcher.d.ts.map +1 -0
- package/dist/core/fetcher.js +111 -0
- package/dist/core/fetcher.js.map +1 -0
- package/dist/core/processor.d.ts +22 -0
- package/dist/core/processor.d.ts.map +1 -0
- package/dist/core/processor.js +104 -0
- package/dist/core/processor.js.map +1 -0
- package/dist/core/screenshotter.d.ts +31 -0
- package/dist/core/screenshotter.d.ts.map +1 -0
- package/dist/core/screenshotter.js +222 -0
- package/dist/core/screenshotter.js.map +1 -0
- package/dist/index.d.ts +3 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +14 -0
- package/dist/index.js.map +1 -0
- package/dist/screen-cli.d.ts +26 -0
- package/dist/screen-cli.d.ts.map +1 -0
- package/dist/screen-cli.js +196 -0
- package/dist/screen-cli.js.map +1 -0
- package/dist/screen.d.ts +3 -0
- package/dist/screen.d.ts.map +1 -0
- package/dist/screen.js +14 -0
- package/dist/screen.js.map +1 -0
- package/dist/types/index.d.ts +151 -0
- package/dist/types/index.d.ts.map +1 -0
- package/dist/types/index.js +42 -0
- package/dist/types/index.js.map +1 -0
- package/dist/utils/filename-sanitizer.d.ts +38 -0
- package/dist/utils/filename-sanitizer.d.ts.map +1 -0
- package/dist/utils/filename-sanitizer.js +79 -0
- package/dist/utils/filename-sanitizer.js.map +1 -0
- package/dist/utils/frontmatter.d.ts +6 -0
- package/dist/utils/frontmatter.d.ts.map +1 -0
- package/dist/utils/frontmatter.js +65 -0
- package/dist/utils/frontmatter.js.map +1 -0
- package/package.json +56 -0
- package/skills/md-fetch/SKILL.md +133 -0
- package/skills/md-fetch/references/cli-reference.md +257 -0
- package/src/cli.ts +169 -0
- package/src/constants.ts +17 -0
- package/src/core/browser.ts +161 -0
- package/src/core/converter.ts +82 -0
- package/src/core/extractor.ts +172 -0
- package/src/core/fetcher.ts +143 -0
- package/src/core/processor.ts +124 -0
- package/src/core/screenshotter.ts +289 -0
- package/src/index.ts +15 -0
- package/src/screen-cli.ts +216 -0
- package/src/screen.ts +15 -0
- package/src/types/index.ts +227 -0
- package/src/utils/filename-sanitizer.ts +88 -0
- package/src/utils/frontmatter.ts +81 -0
- package/tsconfig.json +20 -0
package/AGENTS.md
ADDED
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
# md-fetch - Web Content Processing CLI Tools
|
|
2
|
+
|
|
3
|
+
## 项目概述
|
|
4
|
+
一套基于 Node.js 的 CLI 工具集,包含:
|
|
5
|
+
1. **md-fetch** - 将 URL 内容转换为干净的 Markdown 格式
|
|
6
|
+
2. **md-fetch-screen** - 对网页进行高质量截图
|
|
7
|
+
|
|
8
|
+
## 技术栈
|
|
9
|
+
- **语言**: TypeScript (ES 模块)
|
|
10
|
+
- **运行时**: Node.js ≥18
|
|
11
|
+
- **包管理器**: pnpm
|
|
12
|
+
- **核心依赖**:
|
|
13
|
+
- `commander`: CLI 参数解析
|
|
14
|
+
- `@mozilla/readability`: 内容提取
|
|
15
|
+
- `turndown`: HTML 转 Markdown
|
|
16
|
+
- `puppeteer-core`: 无头浏览器(不自带 Chrome)
|
|
17
|
+
- `jsdom`: DOM 解析
|
|
18
|
+
- `undici`: 代理支持
|
|
19
|
+
|
|
20
|
+
## 架构设计
|
|
21
|
+
|
|
22
|
+
### md-fetch 核心处理流程
|
|
23
|
+
```
|
|
24
|
+
URL 输入 → Processor
|
|
25
|
+
↓
|
|
26
|
+
fetch/browser (获取 HTML)
|
|
27
|
+
↓
|
|
28
|
+
Extractor (提取内容 + 元数据)
|
|
29
|
+
↓
|
|
30
|
+
Converter (转为 Markdown)
|
|
31
|
+
↓
|
|
32
|
+
Generate Frontmatter (生成 YAML)
|
|
33
|
+
↓
|
|
34
|
+
输出 (stdout/file)
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### md-fetch-screen 截图流程
|
|
38
|
+
```
|
|
39
|
+
URL 输入 → Screenshotter
|
|
40
|
+
↓
|
|
41
|
+
Browser.launch (启动浏览器)
|
|
42
|
+
↓
|
|
43
|
+
Page.setViewport (设置视口 + 像素比例)
|
|
44
|
+
↓
|
|
45
|
+
Page.goto (导航到 URL)
|
|
46
|
+
↓
|
|
47
|
+
隐藏元素 (可选)
|
|
48
|
+
↓
|
|
49
|
+
延迟等待 (可选)
|
|
50
|
+
↓
|
|
51
|
+
Screenshot (全页/视口/元素)
|
|
52
|
+
↓
|
|
53
|
+
保存文件 (自动命名)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### 关键模块
|
|
57
|
+
- **fetcher** (`src/core/fetcher.ts`): 使用原生 fetch 执行 HTTP 请求
|
|
58
|
+
- **browser** (`src/core/browser.ts`): Puppeteer 集成,用于 SPA 页面
|
|
59
|
+
- **extractor** (`src/core/extractor.ts`): 使用 readability 提取主要内容和元数据
|
|
60
|
+
- **converter** (`src/core/converter.ts`): 使用 turndown 将 HTML 转为 Markdown
|
|
61
|
+
- **processor** (`src/core/processor.ts`): 协调 Markdown 转换流程
|
|
62
|
+
- **screenshotter** (`src/core/screenshotter.ts`): 截图核心类,管理浏览器和截图逻辑
|
|
63
|
+
- **filename-sanitizer** (`src/utils/filename-sanitizer.ts`): URL 安全化和时间戳生成
|
|
64
|
+
- **cli** (`src/cli.ts`): md-fetch CLI 接口和参数解析
|
|
65
|
+
- **screen-cli** (`src/screen-cli.ts`): md-fetch-screen CLI 接口和参数解析
|
|
66
|
+
|
|
67
|
+
## 开发命令
|
|
68
|
+
```bash
|
|
69
|
+
# 安装依赖
|
|
70
|
+
pnpm install
|
|
71
|
+
|
|
72
|
+
# 开发模式运行
|
|
73
|
+
pnpm dev <url>
|
|
74
|
+
|
|
75
|
+
# 构建
|
|
76
|
+
pnpm build
|
|
77
|
+
|
|
78
|
+
# 测试
|
|
79
|
+
pnpm test
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## 项目结构
|
|
83
|
+
```
|
|
84
|
+
md-fetch/
|
|
85
|
+
├── src/
|
|
86
|
+
│ ├── index.ts # md-fetch CLI 入口点
|
|
87
|
+
│ ├── cli.ts # md-fetch CLI 参数解析
|
|
88
|
+
│ ├── screen.ts # md-fetch-screen CLI 入口点
|
|
89
|
+
│ ├── screen-cli.ts # md-fetch-screen CLI 参数解析
|
|
90
|
+
│ ├── constants.ts # 常量定义
|
|
91
|
+
│ ├── core/
|
|
92
|
+
│ │ ├── fetcher.ts # HTTP fetch 逻辑
|
|
93
|
+
│ │ ├── browser.ts # Puppeteer 浏览器管理
|
|
94
|
+
│ │ ├── extractor.ts # 内容提取
|
|
95
|
+
│ │ ├── converter.ts # HTML 转 Markdown
|
|
96
|
+
│ │ ├── processor.ts # Markdown 主处理编排器
|
|
97
|
+
│ │ └── screenshotter.ts # 截图核心类
|
|
98
|
+
│ ├── utils/
|
|
99
|
+
│ │ ├── frontmatter.ts # YAML frontmatter 生成
|
|
100
|
+
│ │ └── filename-sanitizer.ts # 文件名安全化
|
|
101
|
+
│ └── types/
|
|
102
|
+
│ └── index.ts # TypeScript 类型定义
|
|
103
|
+
├── dist/ # 构建输出
|
|
104
|
+
│ ├── index.js # md-fetch 可执行文件
|
|
105
|
+
│ └── screen.js # md-fetch-screen 可执行文件
|
|
106
|
+
└── package.json # 包配置(定义两个可执行命令)
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## 设计决策
|
|
110
|
+
1. **puppeteer-core**: 不捆绑浏览器,减小包体积,用户需自行安装 Chrome
|
|
111
|
+
2. **ES 模块**: 使用现代 Node.js 模块系统
|
|
112
|
+
3. **TypeScript 严格模式**: 确保类型安全
|
|
113
|
+
4. **最小化依赖**: 只保留核心功能必需的依赖
|
|
114
|
+
|
|
115
|
+
## CLI 用法示例
|
|
116
|
+
|
|
117
|
+
### md-fetch (Markdown 转换)
|
|
118
|
+
```bash
|
|
119
|
+
# 基本使用
|
|
120
|
+
md-fetch https://example.com
|
|
121
|
+
|
|
122
|
+
# 保存到文件
|
|
123
|
+
md-fetch https://example.com -o article.md
|
|
124
|
+
|
|
125
|
+
# 浏览器模式(用于 SPA)
|
|
126
|
+
md-fetch https://react-app.com -b
|
|
127
|
+
|
|
128
|
+
# 自定义选择器
|
|
129
|
+
md-fetch https://example.com -s "article.main-content"
|
|
130
|
+
|
|
131
|
+
# 多个 URL
|
|
132
|
+
md-fetch url1.com url2.com url3.com
|
|
133
|
+
|
|
134
|
+
# 自定义 headers
|
|
135
|
+
md-fetch https://example.com -H "Authorization: Bearer token"
|
|
136
|
+
|
|
137
|
+
# 详细日志
|
|
138
|
+
md-fetch https://example.com --verbose
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### md-fetch-screen (网页截图)
|
|
142
|
+
```bash
|
|
143
|
+
# 基本截图(全页,标准分辨率)
|
|
144
|
+
md-fetch-screen https://example.com
|
|
145
|
+
|
|
146
|
+
# 视口截图,自定义尺寸
|
|
147
|
+
md-fetch-screen https://example.com --viewport -W 1440 -H 900
|
|
148
|
+
|
|
149
|
+
# 高清截图(2倍像素比例)
|
|
150
|
+
md-fetch-screen https://example.com --scale 2
|
|
151
|
+
|
|
152
|
+
# 截取特定元素
|
|
153
|
+
md-fetch-screen https://example.com --selector "#main-content"
|
|
154
|
+
|
|
155
|
+
# 隐藏广告和弹窗
|
|
156
|
+
md-fetch-screen https://example.com --hide ".ad,.popup"
|
|
157
|
+
|
|
158
|
+
# JPEG 格式,指定输出目录
|
|
159
|
+
md-fetch-screen https://example.com --format jpeg --quality 85 --output ./screenshots
|
|
160
|
+
|
|
161
|
+
# 等待页面加载后延迟截图
|
|
162
|
+
md-fetch-screen https://example.com --wait-until networkidle0 --delay 2000
|
|
163
|
+
|
|
164
|
+
# 批量截图
|
|
165
|
+
md-fetch-screen https://site1.com https://site2.com https://site3.com
|
|
166
|
+
|
|
167
|
+
# 详细日志
|
|
168
|
+
md-fetch-screen https://example.com --verbose
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
## 错误处理
|
|
172
|
+
- **网络错误**: 自动重试 3 次,带指数退避
|
|
173
|
+
- **浏览器错误**: 提供清晰的 Chrome 安装提示
|
|
174
|
+
- **提取错误**: 如果 readability 失败,回退到原始 HTML
|
|
175
|
+
- **批量处理**: 单个失败不影响其他 URL,最后汇总报告
|
|
176
|
+
|
|
177
|
+
## 当前状态
|
|
178
|
+
|
|
179
|
+
### md-fetch 已实现功能
|
|
180
|
+
- ✅ HTTP fetch 获取网页内容
|
|
181
|
+
- ✅ Readability 内容提取
|
|
182
|
+
- ✅ HTML 转 Markdown
|
|
183
|
+
- ✅ YAML frontmatter 自动生成
|
|
184
|
+
- ✅ 输出到 stdout 或文件
|
|
185
|
+
- ✅ 自定义选择器提取
|
|
186
|
+
- ✅ 禁用 readability 选项
|
|
187
|
+
- ✅ 自定义 HTTP headers
|
|
188
|
+
- ✅ 超时配置
|
|
189
|
+
- ✅ 详细日志模式
|
|
190
|
+
- ✅ 多个 URL 处理
|
|
191
|
+
- ✅ 自动重试(3次,指数退避)
|
|
192
|
+
- ✅ 浏览器模式(Puppeteer 无头浏览器)
|
|
193
|
+
- ✅ 代理支持(环境变量 HTTP_PROXY/HTTPS_PROXY/NO_PROXY)
|
|
194
|
+
|
|
195
|
+
### md-fetch-screen 已实现功能
|
|
196
|
+
- ✅ 全页截图模式
|
|
197
|
+
- ✅ 视口截图模式
|
|
198
|
+
- ✅ 自定义视口尺寸(宽度/高度)
|
|
199
|
+
- ✅ 设备像素比例(scale)支持高清截图
|
|
200
|
+
- ✅ 多种图片格式(PNG/JPEG/WebP)
|
|
201
|
+
- ✅ 质量控制(JPEG/WebP)
|
|
202
|
+
- ✅ 截取特定元素(CSS 选择器)
|
|
203
|
+
- ✅ 隐藏元素功能
|
|
204
|
+
- ✅ 截图前延迟
|
|
205
|
+
- ✅ 自动文件命名(URL + 时间戳)
|
|
206
|
+
- ✅ 批量截图
|
|
207
|
+
- ✅ 代理支持
|
|
208
|
+
- ✅ 详细日志模式
|
|
209
|
+
- ✅ 参数验证和错误处理
|
|
210
|
+
|
|
211
|
+
### 待实现功能
|
|
212
|
+
- ⏳ 从文件批量读取 URL
|
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,449 @@
|
|
|
1
|
+
# md-fetch
|
|
2
|
+
|
|
3
|
+
[中文文档](./README.zh-CN.md)
|
|
4
|
+
|
|
5
|
+
A suite of CLI tools for web content processing:
|
|
6
|
+
- **md-fetch** - Convert web pages to clean Markdown format
|
|
7
|
+
- **md-fetch-screen** - Take high-quality screenshots of web pages
|
|
8
|
+
|
|
9
|
+
## Authors
|
|
10
|
+
|
|
11
|
+
Built by **Claude Code** & **Claude Sonnet**
|
|
12
|
+
|
|
13
|
+
## Table of Contents
|
|
14
|
+
|
|
15
|
+
- [md-fetch - Markdown Converter](#md-fetch---markdown-converter)
|
|
16
|
+
- [Features](#features)
|
|
17
|
+
- [Installation](#installation)
|
|
18
|
+
- [Usage](#usage)
|
|
19
|
+
- [CLI Options](#cli-options)
|
|
20
|
+
- [md-fetch-screen - Screenshot Tool](#md-fetch-screen---screenshot-tool)
|
|
21
|
+
- [Features](#screenshot-features)
|
|
22
|
+
- [Usage](#screenshot-usage)
|
|
23
|
+
- [CLI Options](#screenshot-cli-options)
|
|
24
|
+
- [Tech Stack](#tech-stack)
|
|
25
|
+
- [Development](#development)
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
# md-fetch - Markdown Converter
|
|
30
|
+
|
|
31
|
+
## Features
|
|
32
|
+
|
|
33
|
+
- 🚀 Fetch web content using native fetch API
|
|
34
|
+
- 🌐 Headless browser mode (Puppeteer) for SPA pages
|
|
35
|
+
- 📄 Extract main content using Mozilla Readability
|
|
36
|
+
- ✨ Convert HTML to Markdown using Turndown
|
|
37
|
+
- 📋 **Auto-generate YAML frontmatter** (includes title, URL, author, publish date, and more metadata)
|
|
38
|
+
- 🎯 Custom CSS selector support for content extraction
|
|
39
|
+
- 🔒 Proxy support (HTTP_PROXY/HTTPS_PROXY environment variables)
|
|
40
|
+
- ⚙️ Configurable timeout, headers, and other options
|
|
41
|
+
- 🔄 Auto-retry (3 times with exponential backoff)
|
|
42
|
+
- 📦 Minimal dependencies
|
|
43
|
+
|
|
44
|
+
## Installation
|
|
45
|
+
|
|
46
|
+
### Development Setup
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# Clone the repository (if you haven't already)
|
|
50
|
+
git clone <repo-url>
|
|
51
|
+
cd md-fetch
|
|
52
|
+
|
|
53
|
+
# Install dependencies
|
|
54
|
+
pnpm install
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Global Installation
|
|
58
|
+
|
|
59
|
+
**Using pnpm:**
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
# 1. Build the project
|
|
63
|
+
pnpm build
|
|
64
|
+
|
|
65
|
+
# 2. Setup pnpm (first time only)
|
|
66
|
+
pnpm setup
|
|
67
|
+
|
|
68
|
+
# 3. Link globally (recommended for development)
|
|
69
|
+
pnpm link --global
|
|
70
|
+
|
|
71
|
+
# 4. Now you can use md-fetch anywhere
|
|
72
|
+
md-fetch https://example.com
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
**Using npm:**
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
# 1. Build the project
|
|
79
|
+
pnpm build
|
|
80
|
+
|
|
81
|
+
# 2. Link globally
|
|
82
|
+
npm link
|
|
83
|
+
|
|
84
|
+
# 3. Now you can use md-fetch anywhere
|
|
85
|
+
md-fetch https://example.com
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Rebuild After Code Changes
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
# 1. Rebuild
|
|
92
|
+
pnpm build
|
|
93
|
+
|
|
94
|
+
# 2. No need to re-link, changes take effect automatically
|
|
95
|
+
md-fetch https://example.com
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
### Uninstall
|
|
99
|
+
|
|
100
|
+
**Using pnpm:**
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
# Unlink globally
|
|
104
|
+
pnpm unlink --global
|
|
105
|
+
|
|
106
|
+
# Optional: Clean up unused packages in pnpm global store
|
|
107
|
+
pnpm store prune
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**Using npm:**
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
# Unlink globally
|
|
114
|
+
npm unlink -g md-fetch
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
**Remove project:**
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
# Simply delete the project directory
|
|
121
|
+
cd ..
|
|
122
|
+
rm -rf md-fetch # Or use rmdir /s md-fetch on Windows
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
## Usage
|
|
126
|
+
|
|
127
|
+
### Development Mode
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
# Basic usage - output to stdout
|
|
131
|
+
pnpm dev -- https://example.com
|
|
132
|
+
|
|
133
|
+
# Save to file
|
|
134
|
+
pnpm dev -- https://example.com -o output.md
|
|
135
|
+
|
|
136
|
+
# Browser mode (for SPA pages)
|
|
137
|
+
pnpm dev -- -b https://react-app.example.com
|
|
138
|
+
|
|
139
|
+
# Disable readability, keep full HTML content
|
|
140
|
+
pnpm dev -- https://example.com -R
|
|
141
|
+
# Or use the full option name
|
|
142
|
+
pnpm dev -- https://example.com --no-readability
|
|
143
|
+
|
|
144
|
+
# Use custom CSS selector
|
|
145
|
+
pnpm dev -- https://example.com -s "article.main-content"
|
|
146
|
+
|
|
147
|
+
# Process multiple URLs
|
|
148
|
+
pnpm dev -- https://example.com https://httpbin.org/html
|
|
149
|
+
|
|
150
|
+
# Custom HTTP headers
|
|
151
|
+
pnpm dev -- https://example.com -H "Authorization: Bearer token"
|
|
152
|
+
|
|
153
|
+
# Use proxy
|
|
154
|
+
pnpm dev -- https://example.com --proxy http://proxy.example.com:8080
|
|
155
|
+
|
|
156
|
+
# Verbose logging
|
|
157
|
+
pnpm dev -- https://example.com --verbose
|
|
158
|
+
|
|
159
|
+
# View all options
|
|
160
|
+
pnpm dev -- --help
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Production Usage (After Global Installation)
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
# Basic usage
|
|
167
|
+
md-fetch https://example.com
|
|
168
|
+
|
|
169
|
+
# Save to file
|
|
170
|
+
md-fetch https://example.com -o article.md
|
|
171
|
+
|
|
172
|
+
# Browser mode
|
|
173
|
+
md-fetch -b https://react-app.example.com
|
|
174
|
+
|
|
175
|
+
# Use proxy (from environment variable)
|
|
176
|
+
export HTTPS_PROXY=http://proxy.example.com:8080
|
|
177
|
+
md-fetch https://example.com
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
## Output Example
|
|
181
|
+
|
|
182
|
+
md-fetch automatically adds YAML frontmatter at the beginning of the Markdown file with page metadata:
|
|
183
|
+
|
|
184
|
+
```markdown
|
|
185
|
+
---
|
|
186
|
+
title: "Example Domain"
|
|
187
|
+
url: https://example.com
|
|
188
|
+
description: "Example Domain description"
|
|
189
|
+
author: "John Doe"
|
|
190
|
+
siteName: "Example"
|
|
191
|
+
publishedTime: 2024-01-01T00:00:00Z
|
|
192
|
+
modifiedTime: 2024-01-15T10:30:00Z
|
|
193
|
+
keywords:
|
|
194
|
+
- example
|
|
195
|
+
- demo
|
|
196
|
+
- test
|
|
197
|
+
image: https://example.com/og-image.jpg
|
|
198
|
+
lang: en
|
|
199
|
+
---
|
|
200
|
+
|
|
201
|
+
# Example Domain
|
|
202
|
+
|
|
203
|
+
This domain is for use in illustrative examples...
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### Frontmatter Fields
|
|
207
|
+
|
|
208
|
+
- `title` - Page title (extracted from Readability, Open Graph, Twitter Cards, or `<title>` tag)
|
|
209
|
+
- `url` - Original URL
|
|
210
|
+
- `description` - Page description or excerpt
|
|
211
|
+
- `author` - Author information
|
|
212
|
+
- `siteName` - Site name
|
|
213
|
+
- `publishedTime` - Published date (ISO 8601 format)
|
|
214
|
+
- `modifiedTime` - Last modified date (ISO 8601 format)
|
|
215
|
+
- `keywords` - Keywords array
|
|
216
|
+
- `image` - Main page image (Open Graph or Twitter Cards)
|
|
217
|
+
- `lang` - Page language code
|
|
218
|
+
|
|
219
|
+
## CLI Options
|
|
220
|
+
|
|
221
|
+
```
|
|
222
|
+
Usage: md-fetch <urls...> [options]
|
|
223
|
+
|
|
224
|
+
Arguments:
|
|
225
|
+
urls URLs to convert to Markdown
|
|
226
|
+
|
|
227
|
+
Options:
|
|
228
|
+
-V, --version output the version number
|
|
229
|
+
-o, --output <file> Output to file instead of stdout
|
|
230
|
+
-b, --browser Use headless browser mode (for SPA pages)
|
|
231
|
+
--browser-path <path> Custom Chrome/Chromium executable path
|
|
232
|
+
-R, --no-readability Disable readability, keep full HTML content
|
|
233
|
+
-s, --selector <selector> Custom CSS selector to extract content
|
|
234
|
+
-H, --header <header> Custom HTTP header (can be repeated)
|
|
235
|
+
--proxy <url> Proxy server URL (also reads HTTP_PROXY/HTTPS_PROXY env vars)
|
|
236
|
+
-t, --timeout <ms> Request timeout in milliseconds (default: 30000)
|
|
237
|
+
--user-agent <string> Custom user agent (default: "md-fetch/1.0.0")
|
|
238
|
+
--wait-until <event> Browser wait condition (load|domcontentloaded|networkidle0|networkidle2)
|
|
239
|
+
--verbose Enable verbose logging
|
|
240
|
+
-h, --help display help for command
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
## Tech Stack
|
|
244
|
+
|
|
245
|
+
- **TypeScript** - Type safety
|
|
246
|
+
- **Node.js ≥18** - Native fetch API
|
|
247
|
+
- **ES Modules** - Modern JavaScript
|
|
248
|
+
- **Commander** - CLI argument parsing
|
|
249
|
+
- **Mozilla Readability** - Smart content extraction
|
|
250
|
+
- **Turndown** - HTML to Markdown conversion
|
|
251
|
+
- **JSDOM** - DOM parsing
|
|
252
|
+
- **Puppeteer-core** - Headless browser support
|
|
253
|
+
- **Undici** - Proxy support
|
|
254
|
+
|
|
255
|
+
## Development
|
|
256
|
+
|
|
257
|
+
```bash
|
|
258
|
+
# Install dependencies
|
|
259
|
+
pnpm install
|
|
260
|
+
|
|
261
|
+
# Development mode
|
|
262
|
+
pnpm dev -- <url>
|
|
263
|
+
|
|
264
|
+
# Build
|
|
265
|
+
pnpm build
|
|
266
|
+
|
|
267
|
+
# Run tests
|
|
268
|
+
pnpm test
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
## How It Works
|
|
272
|
+
|
|
273
|
+
1. **Fetch** - Fetch HTML content using native fetch or Puppeteer headless browser
|
|
274
|
+
2. **Extract** - Extract main content using Readability or custom selector, also extract page metadata
|
|
275
|
+
3. **Convert** - Convert to Markdown using Turndown
|
|
276
|
+
4. **Generate Frontmatter** - Generate YAML frontmatter from extracted metadata
|
|
277
|
+
5. **Output** - Output frontmatter and Markdown content to stdout or save to file
|
|
278
|
+
|
|
279
|
+
## Proxy Support
|
|
280
|
+
|
|
281
|
+
md-fetch automatically reads proxy configuration from environment variables:
|
|
282
|
+
|
|
283
|
+
```bash
|
|
284
|
+
# Set proxy
|
|
285
|
+
export HTTP_PROXY=http://proxy.example.com:8080
|
|
286
|
+
export HTTPS_PROXY=http://proxy.example.com:8080
|
|
287
|
+
|
|
288
|
+
# Exclude certain domains
|
|
289
|
+
export NO_PROXY=localhost,127.0.0.1,.example.com
|
|
290
|
+
|
|
291
|
+
# Or via command line argument
|
|
292
|
+
md-fetch https://example.com --proxy http://proxy.example.com:8080
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
---
|
|
296
|
+
|
|
297
|
+
# md-fetch-screen - Screenshot Tool
|
|
298
|
+
|
|
299
|
+
## Screenshot Features
|
|
300
|
+
|
|
301
|
+
- 📸 Take high-quality screenshots of web pages
|
|
302
|
+
- 🖥️ Full-page or viewport-only screenshot modes
|
|
303
|
+
- 📐 Customizable viewport size (width/height)
|
|
304
|
+
- ✨ Device scale factor support for high-DPI screenshots (Retina displays)
|
|
305
|
+
- 🎨 Multiple image formats (PNG, JPEG, WebP)
|
|
306
|
+
- 🎯 Screenshot specific elements using CSS selectors
|
|
307
|
+
- 🙈 Hide unwanted elements (ads, popups, etc.)
|
|
308
|
+
- ⏱️ Configurable delay before screenshot
|
|
309
|
+
- 🔒 Proxy support
|
|
310
|
+
- 🌐 Headless browser mode using Puppeteer
|
|
311
|
+
- 📁 Automatic filename generation from URL and timestamp
|
|
312
|
+
- 🔄 Batch screenshot multiple URLs
|
|
313
|
+
|
|
314
|
+
## Screenshot Usage
|
|
315
|
+
|
|
316
|
+
### Basic Usage
|
|
317
|
+
|
|
318
|
+
```bash
|
|
319
|
+
# Basic screenshot (full page, standard resolution)
|
|
320
|
+
md-fetch-screen https://example.com
|
|
321
|
+
|
|
322
|
+
# Viewport-only screenshot with custom size
|
|
323
|
+
md-fetch-screen https://example.com --viewport -W 1440 -H 900
|
|
324
|
+
|
|
325
|
+
# High-DPI screenshot (2x scale for Retina displays)
|
|
326
|
+
md-fetch-screen https://example.com --scale 2
|
|
327
|
+
|
|
328
|
+
# Screenshot with verbose logging
|
|
329
|
+
md-fetch-screen https://example.com --verbose
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
### Advanced Usage
|
|
333
|
+
|
|
334
|
+
```bash
|
|
335
|
+
# Screenshot specific element
|
|
336
|
+
md-fetch-screen https://example.com --selector "#main-content"
|
|
337
|
+
|
|
338
|
+
# Hide ads and popups
|
|
339
|
+
md-fetch-screen https://example.com --hide ".ad,.popup,.cookie-banner"
|
|
340
|
+
|
|
341
|
+
# JPEG format with custom quality
|
|
342
|
+
md-fetch-screen https://example.com --format jpeg --quality 85
|
|
343
|
+
|
|
344
|
+
# Save to specific directory
|
|
345
|
+
md-fetch-screen https://example.com --output ./screenshots
|
|
346
|
+
|
|
347
|
+
# Wait for page to load, then delay 2 seconds
|
|
348
|
+
md-fetch-screen https://example.com --wait-until networkidle0 --delay 2000
|
|
349
|
+
|
|
350
|
+
# Batch screenshot multiple URLs
|
|
351
|
+
md-fetch-screen https://site1.com https://site2.com https://site3.com
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
### Understanding Width, Height, and Scale
|
|
355
|
+
|
|
356
|
+
**Full-Page Mode (default):**
|
|
357
|
+
- Width/Height control the browser viewport size
|
|
358
|
+
- The screenshot captures the entire page content
|
|
359
|
+
- Final image dimensions depend on actual page height
|
|
360
|
+
|
|
361
|
+
```bash
|
|
362
|
+
# Full page with 1920px viewport width
|
|
363
|
+
md-fetch-screen https://example.com -W 1920 -H 1080
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
**Viewport Mode:**
|
|
367
|
+
- Width/Height directly control the screenshot size
|
|
368
|
+
- Only captures what's visible in the viewport
|
|
369
|
+
|
|
370
|
+
```bash
|
|
371
|
+
# Exactly 1440x900 screenshot
|
|
372
|
+
md-fetch-screen https://example.com --viewport -W 1440 -H 900
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
**Scale Factor (Device Pixel Ratio):**
|
|
376
|
+
- `--scale 1` (default): Standard resolution
|
|
377
|
+
- Viewport 1920x1080 → Image 1920x1080 pixels
|
|
378
|
+
- `--scale 2`: High-DPI (Retina)
|
|
379
|
+
- Viewport 1920x1080 → Image 3840x2160 pixels
|
|
380
|
+
- `--scale 3`: Ultra high-DPI
|
|
381
|
+
- Viewport 1920x1080 → Image 5760x3240 pixels
|
|
382
|
+
|
|
383
|
+
```bash
|
|
384
|
+
# High-quality Retina screenshot
|
|
385
|
+
md-fetch-screen https://example.com --scale 2
|
|
386
|
+
|
|
387
|
+
# Viewport mode with 2x scale = 2880x1800 final image
|
|
388
|
+
md-fetch-screen https://example.com --viewport -W 1440 -H 900 --scale 2
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
## Screenshot CLI Options
|
|
392
|
+
|
|
393
|
+
```
|
|
394
|
+
Usage: md-fetch-screen [options] <urls...>
|
|
395
|
+
|
|
396
|
+
Arguments:
|
|
397
|
+
urls URLs to screenshot
|
|
398
|
+
|
|
399
|
+
Options:
|
|
400
|
+
-V, --version output the version number
|
|
401
|
+
|
|
402
|
+
Viewport & Size:
|
|
403
|
+
-f, --full-page Full page screenshot (default)
|
|
404
|
+
--viewport Viewport-only screenshot
|
|
405
|
+
-W, --width <pixels> Viewport width in pixels (default: 1920)
|
|
406
|
+
-H, --height <pixels> Viewport height in pixels (default: 1080)
|
|
407
|
+
--scale <number> Device scale factor for high-DPI (1/2/3, default: 1)
|
|
408
|
+
|
|
409
|
+
Output:
|
|
410
|
+
--output <dir> Output directory (default: ".")
|
|
411
|
+
--format <type> Image format: png|jpeg|webp (default: "png")
|
|
412
|
+
--quality <number> JPEG/WebP quality 0-100 (default: 90)
|
|
413
|
+
|
|
414
|
+
Browser:
|
|
415
|
+
--browser-path <path> Custom Chrome/Chromium executable path
|
|
416
|
+
--wait-until <event> Wait condition: load|domcontentloaded|networkidle0|networkidle2
|
|
417
|
+
--timeout <ms> Timeout in milliseconds (default: 30000)
|
|
418
|
+
--user-agent <string> Custom user agent
|
|
419
|
+
--proxy <url> Proxy server URL
|
|
420
|
+
|
|
421
|
+
Content:
|
|
422
|
+
--delay <ms> Delay before screenshot in ms (default: 0)
|
|
423
|
+
--selector <css> CSS selector to screenshot specific element
|
|
424
|
+
--hide <selectors> CSS selectors to hide (comma-separated)
|
|
425
|
+
|
|
426
|
+
Other:
|
|
427
|
+
--verbose Enable verbose logging
|
|
428
|
+
-h, --help display help for command
|
|
429
|
+
```
|
|
430
|
+
|
|
431
|
+
### Filename Format
|
|
432
|
+
|
|
433
|
+
Screenshots are automatically named using the following format:
|
|
434
|
+
```
|
|
435
|
+
<domain_path_50chars>_<timestamp>.png
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
Examples:
|
|
439
|
+
- `example.com_20251229153045.png`
|
|
440
|
+
- `github.com_user_repo_issues_123_20251229153045.png`
|
|
441
|
+
|
|
442
|
+
The filename includes:
|
|
443
|
+
- Domain and path (up to 50 characters, sanitized for filesystem safety)
|
|
444
|
+
- Timestamp in format: `YYYYMMDDHHmmss`
|
|
445
|
+
- File extension based on format
|
|
446
|
+
|
|
447
|
+
## License
|
|
448
|
+
|
|
449
|
+
MIT
|