@ariesfish/feedloom 0.1.4 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.en.md +151 -0
- package/README.md +81 -225
- package/dist/cli.js +26 -5
- package/dist/site-rules/x.toml +15 -0
- package/package.json +1 -1
- package/skills/feedloom/references/site-rules.md +1 -1
package/README.en.md
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
<img src="assets/logo.png" alt="Feedloom logo" width="160">
|
|
3
|
+
<h1>Feedloom</h1>
|
|
4
|
+
<p><strong>Archive long-form web content as clean Markdown with local assets.</strong></p>
|
|
5
|
+
<p>
|
|
6
|
+
<a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm version" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
|
|
7
|
+
<img alt="Node 24 or newer" src="https://img.shields.io/badge/node-24%2B-339933">
|
|
8
|
+
<img alt="MIT license" src="https://img.shields.io/badge/license-MIT-blue">
|
|
9
|
+
</p>
|
|
10
|
+
</div>
|
|
11
|
+
|
|
12
|
+
Feedloom is a CLI for saving long-form web content as clean Markdown. It accepts article URLs, URL list files, and RSS/Atom feeds, extracts readable content, downloads page images, and writes portable Markdown notes with YAML frontmatter.
|
|
13
|
+
|
|
14
|
+
## Features
|
|
15
|
+
|
|
16
|
+
- Save articles as Markdown with local image assets.
|
|
17
|
+
- Read URLs directly, from text/Markdown files, or from RSS/Atom feeds.
|
|
18
|
+
- Deduplicate URL lists and mark completed Markdown checklist items.
|
|
19
|
+
- Use static, browser-rendered, or stealth fetching when pages need JavaScript rendering.
|
|
20
|
+
- Apply built-in site rules for common sites such as WeChat and Zhihu.
|
|
21
|
+
- Optionally use local Chrome login state for pages that require your own authenticated browser session.
|
|
22
|
+
|
|
23
|
+
## Requirements
|
|
24
|
+
|
|
25
|
+
- Node.js >= 24
|
|
26
|
+
- npm
|
|
27
|
+
- Patchright Chromium for browser-based fetching. `doctor` can install it automatically.
|
|
28
|
+
|
|
29
|
+
## Install or run
|
|
30
|
+
|
|
31
|
+
Run directly with `npx`:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
npx -y @ariesfish/feedloom "https://example.com/article"
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Or install globally:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
npm install -g @ariesfish/feedloom
|
|
41
|
+
feedloom "https://example.com/article"
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Check and repair the browser runtime:
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
npx -y @ariesfish/feedloom doctor
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
If the Patchright Chromium executable is missing, `doctor` runs `npx patchright install chromium` automatically.
|
|
51
|
+
|
|
52
|
+
## Quick start
|
|
53
|
+
|
|
54
|
+
Archive one article to `clippings/`:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
npx -y @ariesfish/feedloom "https://example.com/article"
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Write output somewhere else:
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
npx -y @ariesfish/feedloom --output-dir ./outputs "https://example.com/article"
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
Archive a URL list:
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
npx -y @ariesfish/feedloom urls.md --limit 10
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Archive an RSS/Atom feed:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
npx -y @ariesfish/feedloom "https://example.com/feed.xml" --source-kind rss-feed --since 2026-01-01
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Use browser rendering for JavaScript-heavy pages:
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode browser --wait-ms 4000 --scroll-to-bottom
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
Use stealth mode only when normal static/browser fetching is insufficient:
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode stealth --solve-cloudflare
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## Output
|
|
91
|
+
|
|
92
|
+
Generated notes are written to `clippings/` by default:
|
|
93
|
+
|
|
94
|
+
```markdown
|
|
95
|
+
---
|
|
96
|
+
source: "https://example.com/article"
|
|
97
|
+
author: "Author Name"
|
|
98
|
+
created: "2026-04-29"
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
# Article Title
|
|
102
|
+
|
|
103
|
+
Article content...
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
Images are downloaded into an `assets/` subdirectory under the output directory and rewritten as local Markdown references.
|
|
107
|
+
|
|
108
|
+
## Fetch modes
|
|
109
|
+
|
|
110
|
+
| Mode | Use when |
|
|
111
|
+
| --- | --- |
|
|
112
|
+
| `auto` | Default. Try static fetch first, then browser/stealth fallback when content is insufficient. |
|
|
113
|
+
| `static` | The page is server-rendered and does not require JavaScript. |
|
|
114
|
+
| `browser` | The page needs JavaScript rendering, waiting, clicking, or scrolling. |
|
|
115
|
+
| `stealth` | Browser mode fails because the site has stronger bot detection. |
|
|
116
|
+
|
|
117
|
+
## Agent Skill
|
|
118
|
+
|
|
119
|
+
Feedloom ships an Agent Skill in `skills/feedloom`, so agents that support the `skills` CLI can install the clipping workflow directly:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
npx skills add @ariesfish/feedloom --skill feedloom
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
For a global install across supported agents:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
npx skills add @ariesfish/feedloom --skill feedloom --global
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
After installing the skill, ask your agent to save article URLs, URL lists, or RSS feeds as Markdown. The skill runs the CLI through `npx -y @ariesfish/feedloom` by default.
|
|
132
|
+
|
|
133
|
+
## Site rules
|
|
134
|
+
|
|
135
|
+
Feedloom ships built-in TOML site rules for common dynamic or structured sites. You can also keep private rules outside the package and pass them at runtime:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir ./site-rules
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
## Acknowledgements
|
|
142
|
+
|
|
143
|
+
Feedloom is inspired by:
|
|
144
|
+
|
|
145
|
+
- [Defuddle](https://github.com/kepano/defuddle), for readable content extraction ideas.
|
|
146
|
+
- [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright), for browser automation and realistic page access.
|
|
147
|
+
- [Scrapling](https://github.com/D4Vinci/Scrapling), for resilient scraping fallback ideas.
|
|
148
|
+
|
|
149
|
+
## License
|
|
150
|
+
|
|
151
|
+
MIT License
|
package/README.md
CHANGED
|
@@ -1,128 +1,117 @@
|
|
|
1
|
-
# Feedloom
|
|
2
|
-
|
|
3
1
|
<div align="center">
|
|
4
2
|
<img src="assets/logo.png" alt="Feedloom logo" width="160">
|
|
5
|
-
<
|
|
3
|
+
<h1>Feedloom</h1>
|
|
4
|
+
<p><strong>快速剪藏优质内容</strong></p>
|
|
5
|
+
<p><strong>支持公众号、小红书、知乎、X、YouTube 等各种网站</strong></p>
|
|
6
6
|
<p>
|
|
7
|
-
<a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
|
|
8
|
-
<img alt="Node
|
|
9
|
-
<img alt="
|
|
7
|
+
<a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm version" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
|
|
8
|
+
<img alt="Node 24 or newer" src="https://img.shields.io/badge/node-24%2B-339933">
|
|
9
|
+
<img alt="MIT license" src="https://img.shields.io/badge/license-MIT-blue">
|
|
10
10
|
</p>
|
|
11
|
+
<p><a href="README.en.md">English</a></p>
|
|
11
12
|
</div>
|
|
12
13
|
|
|
13
|
-
Feedloom
|
|
14
|
+
Feedloom 是一个 Agent 原生的网页剪藏工具。给它一篇文章、一组链接或一个 RSS 订阅,它会为你提取正文、清理页面噪音、下载图片,并生成适合放进个人知识库、Obsidian、离线阅读目录的完整 Markdown 文档。
|
|
15
|
+
|
|
16
|
+
它适合这些场景:
|
|
17
|
+
|
|
18
|
+
- 看到一篇值得收藏的文章,不想只保留一个以后可能失效的链接。
|
|
19
|
+
- 把博客、公众号、知乎、小红书、X、YouTube 等各种网页内容收藏到自己的知识库。
|
|
20
|
+
- 支持批量剪藏,免去一篇篇复制粘贴。
|
|
21
|
+
- 保存文章时同时保留本地图片,方便离线阅读和迁移。
|
|
14
22
|
|
|
15
|
-
##
|
|
23
|
+
## 主要能力
|
|
16
24
|
|
|
17
|
-
-
|
|
18
|
-
-
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
-
|
|
22
|
-
-
|
|
23
|
-
- Support static fetch, browser-rendered fetch, and stealth fetch modes.
|
|
24
|
-
- Optionally use a local Chrome profile for pages that require login state.
|
|
25
|
-
- Automatically mark Markdown checklist items as done after successful processing.
|
|
25
|
+
- 把文章保存为带 YAML frontmatter 的 Markdown。
|
|
26
|
+
- 自动下载页面图片,并改写为本地 Markdown 图片引用。
|
|
27
|
+
- 支持直接输入 URL、读取批量链接列表和 RSS 订阅。
|
|
28
|
+
- 支持静态抓取、浏览器渲染抓取、stealth 模式,适应需要 JavaScript 渲染的页面。
|
|
29
|
+
- 内置常见站点规则,例如微信公众号、知乎、小红书、X、YouTube 等。
|
|
30
|
+
- 可选使用本机登录状态,处理需要登录或反爬较强的页面。
|
|
26
31
|
|
|
27
|
-
##
|
|
32
|
+
## 安装要求
|
|
28
33
|
|
|
29
34
|
- Node.js >= 24
|
|
30
35
|
- npm
|
|
31
|
-
-
|
|
36
|
+
- 使用浏览器抓取时需要 Patchright Chromium;`doctor` 命令可以自动检查并安装。
|
|
32
37
|
|
|
33
|
-
##
|
|
38
|
+
## 直接运行
|
|
34
39
|
|
|
35
|
-
|
|
40
|
+
无需安装,直接用 `npx`:
|
|
36
41
|
|
|
37
42
|
```bash
|
|
38
|
-
|
|
39
|
-
cd feedloom
|
|
43
|
+
npx -y @ariesfish/feedloom "https://example.com/article"
|
|
40
44
|
```
|
|
41
45
|
|
|
42
|
-
|
|
46
|
+
也可以全局安装:
|
|
43
47
|
|
|
44
48
|
```bash
|
|
45
|
-
npm install
|
|
49
|
+
npm install -g @ariesfish/feedloom
|
|
50
|
+
feedloom "https://example.com/article"
|
|
46
51
|
```
|
|
47
52
|
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
If you plan to use `browser`, `stealth`, or the browser fallback in `auto` mode, install the Patchright Chromium runtime:
|
|
53
|
+
检查并修复浏览器运行环境:
|
|
51
54
|
|
|
52
55
|
```bash
|
|
53
|
-
npx
|
|
56
|
+
npx -y @ariesfish/feedloom doctor
|
|
54
57
|
```
|
|
55
58
|
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
```bash
|
|
59
|
-
npm run dev -- doctor
|
|
60
|
-
```
|
|
59
|
+
如果缺少 Patchright Chromium,`doctor` 会自动执行 `npx patchright install chromium`。
|
|
61
60
|
|
|
62
|
-
|
|
61
|
+
## 快速开始
|
|
63
62
|
|
|
64
|
-
|
|
63
|
+
保存单篇文章:
|
|
65
64
|
|
|
66
65
|
```bash
|
|
67
|
-
|
|
66
|
+
npx -y @ariesfish/feedloom "https://example.com/article"
|
|
68
67
|
```
|
|
69
68
|
|
|
70
|
-
|
|
69
|
+
指定输出目录:
|
|
71
70
|
|
|
72
71
|
```bash
|
|
73
|
-
|
|
72
|
+
npx -y @ariesfish/feedloom --output-dir ./outputs "https://example.com/article"
|
|
74
73
|
```
|
|
75
74
|
|
|
76
|
-
|
|
75
|
+
批量保存 URL 列表:
|
|
77
76
|
|
|
78
77
|
```bash
|
|
79
|
-
|
|
78
|
+
npx -y @ariesfish/feedloom urls.md --limit 10
|
|
80
79
|
```
|
|
81
80
|
|
|
82
|
-
|
|
81
|
+
`urls.md` 可以是普通链接列表,也可以是 Markdown checklist:
|
|
83
82
|
|
|
84
|
-
```
|
|
85
|
-
|
|
86
|
-
|
|
83
|
+
```markdown
|
|
84
|
+
- [ ] https://example.com/a
|
|
85
|
+
- [ ] https://example.com/b
|
|
87
86
|
```
|
|
88
87
|
|
|
89
|
-
|
|
88
|
+
成功处理后,对应项会被标记为完成:
|
|
90
89
|
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
```bash
|
|
94
|
-
npx skills add @ariesfish/feedloom --skill feedloom
|
|
90
|
+
```markdown
|
|
91
|
+
- [x] https://example.com/a
|
|
95
92
|
```
|
|
96
93
|
|
|
97
|
-
|
|
94
|
+
保存 RSS 订阅中的文章:
|
|
98
95
|
|
|
99
96
|
```bash
|
|
100
|
-
npx
|
|
97
|
+
npx -y @ariesfish/feedloom "https://example.com/feed.xml" --source-kind rss-feed --since 2026-01-01
|
|
101
98
|
```
|
|
102
99
|
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
## Quick Start
|
|
106
|
-
|
|
107
|
-
Archive a single article to the default `clippings/` directory:
|
|
100
|
+
处理需要 JavaScript 渲染的页面:
|
|
108
101
|
|
|
109
102
|
```bash
|
|
110
|
-
|
|
103
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode browser --wait-ms 4000 --scroll-to-bottom
|
|
111
104
|
```
|
|
112
105
|
|
|
113
|
-
|
|
106
|
+
普通模式失败时,再尝试 `stealth` 模式:
|
|
114
107
|
|
|
115
108
|
```bash
|
|
116
|
-
|
|
109
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode stealth --solve-cloudflare
|
|
117
110
|
```
|
|
118
111
|
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
```bash
|
|
122
|
-
node dist/cli.js --output-dir ./outputs "https://example.com/article"
|
|
123
|
-
```
|
|
112
|
+
## 输出长什么样
|
|
124
113
|
|
|
125
|
-
|
|
114
|
+
Feedloom 默认写入 `clippings/`。生成的 Markdown 大致如下:
|
|
126
115
|
|
|
127
116
|
```markdown
|
|
128
117
|
---
|
|
@@ -136,188 +125,55 @@ created: "2026-04-29"
|
|
|
136
125
|
Article content...
|
|
137
126
|
```
|
|
138
127
|
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
## Input Methods
|
|
142
|
-
|
|
143
|
-
### Pass multiple URLs directly
|
|
144
|
-
|
|
145
|
-
```bash
|
|
146
|
-
npm run dev -- \
|
|
147
|
-
"https://example.com/a" \
|
|
148
|
-
"https://example.com/b"
|
|
149
|
-
```
|
|
150
|
-
|
|
151
|
-
### Read URLs from a file
|
|
152
|
-
|
|
153
|
-
`urls.md` can be a plain URL list or a Markdown checklist:
|
|
154
|
-
|
|
155
|
-
```markdown
|
|
156
|
-
- [ ] https://example.com/a
|
|
157
|
-
- [ ] https://example.com/b
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
Run:
|
|
161
|
-
|
|
162
|
-
```bash
|
|
163
|
-
npm run dev -- --output-dir ./outputs urls.md
|
|
164
|
-
```
|
|
165
|
-
|
|
166
|
-
After a URL is processed successfully, the matching checklist item is updated automatically:
|
|
167
|
-
|
|
168
|
-
```markdown
|
|
169
|
-
- [x] https://example.com/a
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
### Process RSS/Atom feeds
|
|
173
|
-
|
|
174
|
-
By default, `--source-kind auto` tries to detect whether the input is a normal HTML page or a feed. You can also specify the source kind explicitly:
|
|
128
|
+
## 抓取模式怎么选
|
|
175
129
|
|
|
176
|
-
|
|
177
|
-
npm run dev -- --source-kind rss-feed --since 2026-01-01 "https://example.com/feed.xml"
|
|
178
|
-
```
|
|
179
|
-
|
|
180
|
-
Useful slicing options:
|
|
181
|
-
|
|
182
|
-
```bash
|
|
183
|
-
npm run dev -- --start 1 --end 10 "https://example.com/feed.xml"
|
|
184
|
-
npm run dev -- --limit 5 "https://example.com/feed.xml"
|
|
185
|
-
```
|
|
186
|
-
|
|
187
|
-
## Fetch Modes
|
|
188
|
-
|
|
189
|
-
Use `--fetch-mode` to control how pages are fetched:
|
|
190
|
-
|
|
191
|
-
| Mode | Description |
|
|
130
|
+
| 模式 | 适合情况 |
|
|
192
131
|
| --- | --- |
|
|
193
|
-
| `auto` |
|
|
194
|
-
| `static` |
|
|
195
|
-
| `browser` |
|
|
196
|
-
| `stealth` |
|
|
132
|
+
| `auto` | 默认模式。先尝试静态抓取,内容不足时再回退到浏览器/stealth。 |
|
|
133
|
+
| `static` | 页面本身已服务端渲染,不需要 JavaScript。速度最快。 |
|
|
134
|
+
| `browser` | 页面需要 JavaScript 渲染、等待元素、点击按钮或滚动加载。 |
|
|
135
|
+
| `stealth` | 普通浏览器模式仍失败,站点有更强的反爬检测。 |
|
|
197
136
|
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
```bash
|
|
201
|
-
npm run dev -- --fetch-mode browser "https://example.com/article"
|
|
202
|
-
npm run dev -- --fetch-mode stealth --solve-cloudflare "https://example.com/article"
|
|
203
|
-
```
|
|
137
|
+
建议先用默认 `auto`。只有结果不完整时,再显式选择 `browser` 或 `stealth`。
|
|
204
138
|
|
|
205
|
-
##
|
|
139
|
+
## 自定义规则
|
|
206
140
|
|
|
207
|
-
|
|
141
|
+
Feedloom 内置 TOML 站点规则,用于处理常见的动态页面或结构化站点。你也可以把自己的私有规则放在包外,并在运行时指定:
|
|
208
142
|
|
|
209
143
|
```bash
|
|
210
|
-
|
|
144
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir ./site-rules
|
|
211
145
|
```
|
|
212
146
|
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
```bash
|
|
216
|
-
npm run dev -- --fetch-mode browser --wait-selector "article" "https://example.com/article"
|
|
217
|
-
```
|
|
218
|
-
|
|
219
|
-
Click popups or expand buttons before extraction:
|
|
220
|
-
|
|
221
|
-
```bash
|
|
222
|
-
npm run dev -- --fetch-mode browser --click-selector "button.accept" --click-selector ".expand" "https://example.com/article"
|
|
223
|
-
```
|
|
224
|
-
|
|
225
|
-
Scroll to the bottom before extraction:
|
|
226
|
-
|
|
227
|
-
```bash
|
|
228
|
-
npm run dev -- --fetch-mode browser --scroll-to-bottom "https://example.com/article"
|
|
229
|
-
```
|
|
147
|
+
私有规则适合为自己的常用网站做精准适配。
|
|
230
148
|
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
```bash
|
|
234
|
-
npm run dev -- --fetch-mode stealth --proxy "http://127.0.0.1:7890" "https://example.com/article"
|
|
235
|
-
```
|
|
236
|
-
|
|
237
|
-
Run with a visible browser window for debugging:
|
|
238
|
-
|
|
239
|
-
```bash
|
|
240
|
-
npm run dev -- --fetch-mode browser --headful "https://example.com/article"
|
|
241
|
-
```
|
|
242
|
-
|
|
243
|
-
## Use Local Chrome Login State
|
|
244
|
-
|
|
245
|
-
For pages that require an authenticated browser session, you can try using your local Chrome profile:
|
|
246
|
-
|
|
247
|
-
```bash
|
|
248
|
-
npm run dev -- \
|
|
249
|
-
--prefer-browser-state \
|
|
250
|
-
--chrome-user-data-dir "{CHROME_INSTALL_PATH}" \
|
|
251
|
-
--chrome-profile "Default" \
|
|
252
|
-
--fetch-mode browser \
|
|
253
|
-
"https://example.com/member-only-article"
|
|
254
|
-
```
|
|
255
|
-
|
|
256
|
-
Only use this on your own device and accounts. Always respect the target site's terms of service and copyright rules.
|
|
257
|
-
|
|
258
|
-
## Common CLI Options
|
|
259
|
-
|
|
260
|
-
```text
|
|
261
|
-
--output-dir <dir> Markdown output directory. Default: clippings
|
|
262
|
-
--source-kind <kind> auto, html-page, or rss-feed. Default: auto
|
|
263
|
-
--since <date> Keep only feed entries on or after YYYY-MM-DD
|
|
264
|
-
--limit <n> Process only the first N deduplicated URLs
|
|
265
|
-
--start <n> Start from the Nth deduplicated URL, 1-based
|
|
266
|
-
--end <n> End at the Nth deduplicated URL, 1-based; 0 means no upper bound
|
|
267
|
-
--fetch-mode <mode> auto, static, browser, or stealth. Default: auto
|
|
268
|
-
--wait-ms <ms> Extra browser wait after load. Default: 2500
|
|
269
|
-
--wait-selector <selector> Wait for a CSS selector
|
|
270
|
-
--click-selector <selector...> Click one or more selectors after page load
|
|
271
|
-
--scroll-to-bottom Scroll to the bottom before extraction
|
|
272
|
-
--headful Run with a visible browser window
|
|
273
|
-
--proxy <server> Proxy server for browser/stealth fetch
|
|
274
|
-
--solve-cloudflare In stealth mode, try to handle Cloudflare challenges
|
|
275
|
-
--disable-resources In stealth mode, block images/media/fonts/stylesheets for speed
|
|
276
|
-
--prefer-browser-state Try local Chrome user state first
|
|
277
|
-
--chrome-user-data-dir <path> Chrome User Data directory
|
|
278
|
-
--chrome-profile <name> Chrome profile name. Default: Default
|
|
279
|
-
--site-rules-dir <dir> Optional directory of private TOML site rules
|
|
280
|
-
```
|
|
281
|
-
|
|
282
|
-
Run environment checks:
|
|
283
|
-
|
|
284
|
-
```bash
|
|
285
|
-
npm run dev -- doctor
|
|
286
|
-
```
|
|
149
|
+
## Agent Skill
|
|
287
150
|
|
|
288
|
-
|
|
151
|
+
Feedloom 随包提供 `skills/feedloom`,支持 `skills` CLI 的 Agent 可以直接安装这个网页归档能力:
|
|
289
152
|
|
|
290
153
|
```bash
|
|
291
|
-
|
|
154
|
+
npx skills add @ariesfish/feedloom --skill feedloom
|
|
292
155
|
```
|
|
293
156
|
|
|
294
|
-
|
|
157
|
+
全局安装到支持的 Agent:
|
|
295
158
|
|
|
296
159
|
```bash
|
|
297
|
-
|
|
298
|
-
npm run build
|
|
299
|
-
npm run typecheck
|
|
300
|
-
npm test
|
|
160
|
+
npx skills add @ariesfish/feedloom --skill feedloom --global
|
|
301
161
|
```
|
|
302
162
|
|
|
303
|
-
##
|
|
304
|
-
|
|
305
|
-
- Respect robots.txt, website terms of service, copyright, and rate limits.
|
|
306
|
-
- For dynamic pages, try `--fetch-mode browser` first.
|
|
307
|
-
- For static blogs and news sites, `--fetch-mode static` is usually faster.
|
|
308
|
-
- Feedloom ships bundled TOML site rules for common dynamic/structured sites such as WeChat official account articles and Zhihu. Site rules can define extraction, cleanup, and fetch preferences. For example, the bundled Zhihu rule uses browser fetch with copied Chrome state when `--chrome-user-data-dir`/`--chrome-profile` are configured.
|
|
309
|
-
- If article extraction is poor for a specific site, keep private TOML site rules outside the package and pass them with `--site-rules-dir <dir>`. Private rules are loaded after bundled rules.
|
|
310
|
-
- For large batches, test with `--limit` before running the full job.
|
|
163
|
+
## 使用建议
|
|
311
164
|
|
|
312
|
-
|
|
165
|
+
- 大批量归档前,先用 `--limit` 跑几篇确认效果。
|
|
166
|
+
- 静态博客和新闻站通常用默认模式即可;动态站点再尝试 `--fetch-mode browser`。
|
|
167
|
+
- 不要把 Feedloom 当成高并发爬虫。它更适合个人剪藏使用。
|
|
168
|
+
- 遵守 robots.txt、网站服务条款、版权规则和访问频率限制。
|
|
313
169
|
|
|
314
|
-
|
|
170
|
+
## 致谢
|
|
315
171
|
|
|
316
|
-
|
|
317
|
-
- [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright), for inspiring robust browser automation and realistic page access.
|
|
318
|
-
- [Scrapling](https://github.com/D4Vinci/Scrapling), for ideas around real browser contexts, anti-detection strategies, and resilient scraping fallbacks.
|
|
172
|
+
Feedloom 受到这些优秀项目启发:
|
|
319
173
|
|
|
320
|
-
|
|
174
|
+
- [Defuddle](https://github.com/kepano/defuddle):可读正文抽取思路。
|
|
175
|
+
- [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright):浏览器自动化和更真实的页面访问能力。
|
|
176
|
+
- [Scrapling](https://github.com/D4Vinci/Scrapling):更稳健的抓取 fallback 思路。
|
|
321
177
|
|
|
322
178
|
## License
|
|
323
179
|
|
package/dist/cli.js
CHANGED
|
@@ -1024,15 +1024,24 @@ async function localizeImages(html, options) {
|
|
|
1024
1024
|
}
|
|
1025
1025
|
let rel = seen.get(absolute);
|
|
1026
1026
|
if (!rel) {
|
|
1027
|
-
|
|
1027
|
+
let response;
|
|
1028
|
+
try {
|
|
1029
|
+
response = await fetchImage(absolute);
|
|
1030
|
+
} catch {
|
|
1031
|
+
continue;
|
|
1032
|
+
}
|
|
1028
1033
|
if (!response.ok) continue;
|
|
1029
1034
|
const contentType = response.headers.get("content-type");
|
|
1030
1035
|
if (contentType && !contentType.toLowerCase().startsWith("image/")) continue;
|
|
1031
1036
|
const ext = extensionFrom(contentType, absolute);
|
|
1032
1037
|
const filename = `image-${String(index).padStart(3, "0")}${ext}`;
|
|
1033
1038
|
index += 1;
|
|
1034
|
-
|
|
1035
|
-
|
|
1039
|
+
try {
|
|
1040
|
+
await mkdir(assetDir, { recursive: true });
|
|
1041
|
+
await writeFile2(join3(assetDir, filename), new Uint8Array(await response.arrayBuffer()));
|
|
1042
|
+
} catch {
|
|
1043
|
+
continue;
|
|
1044
|
+
}
|
|
1036
1045
|
rel = `assets/${encodeURIComponent(options.noteSlug)}/${filename}`;
|
|
1037
1046
|
seen.set(absolute, rel);
|
|
1038
1047
|
}
|
|
@@ -1228,6 +1237,15 @@ var DEFAULT_FEEDLOOM_PROFILE = {
|
|
|
1228
1237
|
}
|
|
1229
1238
|
};
|
|
1230
1239
|
var DefuddleClass = DefuddleModule.default ?? DefuddleModule.Defuddle;
|
|
1240
|
+
function installDefuddleDomGlobals(window) {
|
|
1241
|
+
const target = globalThis;
|
|
1242
|
+
const source = window;
|
|
1243
|
+
for (const key of ["Node", "Element", "HTMLElement", "Document", "DocumentFragment", "Text", "Comment", "HTMLAnchorElement"]) {
|
|
1244
|
+
if (!target[key] && source[key]) {
|
|
1245
|
+
target[key] = source[key];
|
|
1246
|
+
}
|
|
1247
|
+
}
|
|
1248
|
+
}
|
|
1231
1249
|
function firstMetaContent(document, names) {
|
|
1232
1250
|
for (const name of names) {
|
|
1233
1251
|
const escaped = name.replace(/"/g, '\\"');
|
|
@@ -1336,7 +1354,9 @@ var HtmlCleaner = class {
|
|
|
1336
1354
|
const preferredContentSelector = this.options.contentSelector ?? firstContentSelector(activeProfiles);
|
|
1337
1355
|
const removals = [];
|
|
1338
1356
|
const html = /<html[\s>]/i.test(rawHtml) ? rawHtml : `<!doctype html><html><body>${rawHtml}</body></html>`;
|
|
1339
|
-
const
|
|
1357
|
+
const window = parseHTML2(html);
|
|
1358
|
+
installDefuddleDomGlobals(window);
|
|
1359
|
+
const { document } = window;
|
|
1340
1360
|
const contentSelector = preferredContentSelector && document.querySelector(preferredContentSelector) ? preferredContentSelector : void 0;
|
|
1341
1361
|
const doc = document;
|
|
1342
1362
|
if (this.options.baseUrl) {
|
|
@@ -2049,11 +2069,12 @@ async function processItem(item, options) {
|
|
|
2049
2069
|
}
|
|
2050
2070
|
const title = cleaned.metadata.title || item.sourceTitle || titleFromUrl(item.url);
|
|
2051
2071
|
await cleanupExistingNote(options.outputDir, item.url);
|
|
2072
|
+
const imageFetch = options.fetchImage ?? (activeProfiles.some((profile) => profile.fetch?.useProxyEnv) ? proxyAwareFetch : void 0);
|
|
2052
2073
|
const contentHtml = options.localizeAssets === false ? cleaned.content : await localizeImages(cleaned.content, {
|
|
2053
2074
|
outputDir: options.outputDir,
|
|
2054
2075
|
noteSlug: sanitizeFilename(title),
|
|
2055
2076
|
baseUrl: item.url,
|
|
2056
|
-
fetchImage:
|
|
2077
|
+
fetchImage: imageFetch
|
|
2057
2078
|
});
|
|
2058
2079
|
const markdown = demoteTopLevelHeadings(stripLeadingDateLine(stripDuplicateLeadingHeading(htmlToMarkdown(contentHtml), title)));
|
|
2059
2080
|
const outputPath = await writeMarkdownNote(options.outputDir, {
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
[match]
|
|
2
|
+
host_suffixes = ["x.com", "twitter.com"]
|
|
3
|
+
url_regexes = ["https?://(www\\.)?(x|twitter)\\.com/[^/]+/status/"]
|
|
4
|
+
|
|
5
|
+
[fetch]
|
|
6
|
+
mode = "browser"
|
|
7
|
+
scroll_to_bottom = true
|
|
8
|
+
wait_ms = 8000
|
|
9
|
+
use_proxy_env = true
|
|
10
|
+
|
|
11
|
+
[extract]
|
|
12
|
+
require_text = true
|
|
13
|
+
|
|
14
|
+
[metadata]
|
|
15
|
+
strip_title_regexes = ["\\s*/\\s*X\\s*$", "\\s*on X:\\s*.*$"]
|
package/package.json
CHANGED
|
@@ -66,7 +66,7 @@ Supported sections:
|
|
|
66
66
|
|
|
67
67
|
Use `[fetch]` only when a site consistently needs browser rendering, local Chrome state, scrolling, waiting, clicking, or proxy-aware requests.
|
|
68
68
|
|
|
69
|
-
`use_proxy_env = true` tells Feedloom to use `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, and `NO_PROXY` for static fetches and Defuddle async extractor fetches. Use this for YouTube transcript capture and similar extractor-backed pages that need the user's proxy settings.
|
|
69
|
+
`use_proxy_env = true` tells Feedloom to use `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, and `NO_PROXY` for static fetches and Defuddle async extractor fetches. Use this for YouTube transcript capture, X/Twitter posts, and similar extractor-backed pages that need the user's proxy settings.
|
|
70
70
|
|
|
71
71
|
`prefer_browser_state = true` only tells Feedloom to use copied Chrome state for matching URLs. It does not store the local Chrome path. The command still needs Chrome state parameters when login state is required:
|
|
72
72
|
|