@ariesfish/feedloom 0.1.4 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.en.md ADDED
@@ -0,0 +1,151 @@
1
+ <div align="center">
2
+ <img src="assets/logo.png" alt="Feedloom logo" width="160">
3
+ <h1>Feedloom</h1>
4
+ <p><strong>Archive long-form web content as clean Markdown with local assets.</strong></p>
5
+ <p>
6
+ <a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm version" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
7
+ <img alt="Node 24 or newer" src="https://img.shields.io/badge/node-24%2B-339933">
8
+ <img alt="MIT license" src="https://img.shields.io/badge/license-MIT-blue">
9
+ </p>
10
+ </div>
11
+
12
+ Feedloom is a CLI for saving long-form web content as clean Markdown. It accepts article URLs, URL list files, and RSS/Atom feeds, extracts readable content, downloads page images, and writes portable Markdown notes with YAML frontmatter.
13
+
14
+ ## Features
15
+
16
+ - Save articles as Markdown with local image assets.
17
+ - Read URLs directly, from text/Markdown files, or from RSS/Atom feeds.
18
+ - Deduplicate URL lists and mark completed Markdown checklist items.
19
+ - Use static, browser-rendered, or stealth fetching when pages need JavaScript rendering.
20
+ - Apply built-in site rules for common sites such as WeChat and Zhihu.
21
+ - Optionally use local Chrome login state for pages that require your own authenticated browser session.
22
+
23
+ ## Requirements
24
+
25
+ - Node.js >= 24
26
+ - npm
27
+ - Patchright Chromium for browser-based fetching. `doctor` can install it automatically.
28
+
29
+ ## Install or run
30
+
31
+ Run directly with `npx`:
32
+
33
+ ```bash
34
+ npx -y @ariesfish/feedloom "https://example.com/article"
35
+ ```
36
+
37
+ Or install globally:
38
+
39
+ ```bash
40
+ npm install -g @ariesfish/feedloom
41
+ feedloom "https://example.com/article"
42
+ ```
43
+
44
+ Check and repair the browser runtime:
45
+
46
+ ```bash
47
+ npx -y @ariesfish/feedloom doctor
48
+ ```
49
+
50
+ If the Patchright Chromium executable is missing, `doctor` runs `npx patchright install chromium` automatically.
51
+
52
+ ## Quick start
53
+
54
+ Archive one article to `clippings/`:
55
+
56
+ ```bash
57
+ npx -y @ariesfish/feedloom "https://example.com/article"
58
+ ```
59
+
60
+ Write output somewhere else:
61
+
62
+ ```bash
63
+ npx -y @ariesfish/feedloom --output-dir ./outputs "https://example.com/article"
64
+ ```
65
+
66
+ Archive a URL list:
67
+
68
+ ```bash
69
+ npx -y @ariesfish/feedloom urls.md --limit 10
70
+ ```
71
+
72
+ Archive an RSS/Atom feed:
73
+
74
+ ```bash
75
+ npx -y @ariesfish/feedloom "https://example.com/feed.xml" --source-kind rss-feed --since 2026-01-01
76
+ ```
77
+
78
+ Use browser rendering for JavaScript-heavy pages:
79
+
80
+ ```bash
81
+ npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode browser --wait-ms 4000 --scroll-to-bottom
82
+ ```
83
+
84
+ Use stealth mode only when normal static/browser fetching is insufficient:
85
+
86
+ ```bash
87
+ npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode stealth --solve-cloudflare
88
+ ```
89
+
90
+ ## Output
91
+
92
+ Generated notes are written to `clippings/` by default:
93
+
94
+ ```markdown
95
+ ---
96
+ source: "https://example.com/article"
97
+ author: "Author Name"
98
+ created: "2026-04-29"
99
+ ---
100
+
101
+ # Article Title
102
+
103
+ Article content...
104
+ ```
105
+
106
+ Images are downloaded into an `assets/` subdirectory under the output directory and rewritten as local Markdown references.
107
+
108
+ ## Fetch modes
109
+
110
+ | Mode | Use when |
111
+ | --- | --- |
112
+ | `auto` | Default. Try static fetch first, then browser/stealth fallback when content is insufficient. |
113
+ | `static` | The page is server-rendered and does not require JavaScript. |
114
+ | `browser` | The page needs JavaScript rendering, waiting, clicking, or scrolling. |
115
+ | `stealth` | Browser mode fails because the site has stronger bot detection. |
116
+
117
+ ## Agent Skill
118
+
119
+ Feedloom ships an Agent Skill in `skills/feedloom`, so agents that support the `skills` CLI can install the clipping workflow directly:
120
+
121
+ ```bash
122
+ npx skills add @ariesfish/feedloom --skill feedloom
123
+ ```
124
+
125
+ For a global install across supported agents:
126
+
127
+ ```bash
128
+ npx skills add @ariesfish/feedloom --skill feedloom --global
129
+ ```
130
+
131
+ After installing the skill, ask your agent to save article URLs, URL lists, or RSS feeds as Markdown. The skill runs the CLI through `npx -y @ariesfish/feedloom` by default.
132
+
133
+ ## Site rules
134
+
135
+ Feedloom ships built-in TOML site rules for common dynamic or structured sites. You can also keep private rules outside the package and pass them at runtime:
136
+
137
+ ```bash
138
+ npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir ./site-rules
139
+ ```
140
+
141
+ ## Acknowledgements
142
+
143
+ Feedloom is inspired by:
144
+
145
+ - [Defuddle](https://github.com/kepano/defuddle), for readable content extraction ideas.
146
+ - [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright), for browser automation and realistic page access.
147
+ - [Scrapling](https://github.com/D4Vinci/Scrapling), for resilient scraping fallback ideas.
148
+
149
+ ## License
150
+
151
+ MIT License
package/README.md CHANGED
@@ -1,128 +1,117 @@
1
- # Feedloom
2
-
3
1
  <div align="center">
4
2
  <img src="assets/logo.png" alt="Feedloom logo" width="160">
5
- <p><strong>Archive long-form web content as clean Markdown with local assets.</strong></p>
3
+ <h1>Feedloom</h1>
4
+ <p><strong>快速剪藏优质内容</strong></p>
5
+ <p><strong>支持公众号、小红书、知乎、X、YouTube 等各种网站</strong></p>
6
6
  <p>
7
- <a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
8
- <img alt="Node.js >= 24" src="https://img.shields.io/badge/node-%3E%3D24-339933">
9
- <img alt="License MIT" src="https://img.shields.io/badge/license-MIT-blue">
7
+ <a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm version" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
8
+ <img alt="Node 24 or newer" src="https://img.shields.io/badge/node-24%2B-339933">
9
+ <img alt="MIT license" src="https://img.shields.io/badge/license-MIT-blue">
10
10
  </p>
11
+ <p><a href="README.en.md">English</a></p>
11
12
  </div>
12
13
 
13
- Feedloom is a command-line tool for archiving long-form web content. It takes article URLs, URL list files, or RSS/Atom feeds, extracts readable article content, converts it to Markdown with YAML frontmatter, and saves page images as local assets. It is designed for personal knowledge bases, notebook vaults, and offline reading archives.
14
+ Feedloom 是一个 Agent 原生的网页剪藏工具。给它一篇文章、一组链接或一个 RSS 订阅,它会为你提取正文、清理页面噪音、下载图片,并生成适合放进个人知识库、Obsidian、离线阅读目录的完整 Markdown 文档。
15
+
16
+ 它适合这些场景:
17
+
18
+ - 看到一篇值得收藏的文章,不想只保留一个以后可能失效的链接。
19
+ - 把博客、公众号、知乎、小红书、X、YouTube 等各种网页内容收藏到自己的知识库。
20
+ - 支持批量剪藏,免去一篇篇复制粘贴。
21
+ - 保存文章时同时保留本地图片,方便离线阅读和迁移。
14
22
 
15
- ## Features
23
+ ## 主要能力
16
24
 
17
- - Accept one or more URLs directly from the command line.
18
- - Extract URLs from text or Markdown files, with automatic deduplication.
19
- - Expand RSS/Atom feeds and optionally filter entries by date.
20
- - Clean article HTML and convert it to Markdown.
21
- - Download and localize article images.
22
- - Generate Markdown notes with `source`, `author`, and `created` frontmatter.
23
- - Support static fetch, browser-rendered fetch, and stealth fetch modes.
24
- - Optionally use a local Chrome profile for pages that require login state.
25
- - Automatically mark Markdown checklist items as done after successful processing.
25
+ - 把文章保存为带 YAML frontmatter Markdown。
26
+ - 自动下载页面图片,并改写为本地 Markdown 图片引用。
27
+ - 支持直接输入 URL、读取批量链接列表和 RSS 订阅。
28
+ - 支持静态抓取、浏览器渲染抓取、stealth 模式,适应需要 JavaScript 渲染的页面。
29
+ - 内置常见站点规则,例如微信公众号、知乎、小红书、X、YouTube 等。
30
+ - 可选使用本机登录状态,处理需要登录或反爬较强的页面。
26
31
 
27
- ## Requirements
32
+ ## 安装要求
28
33
 
29
34
  - Node.js >= 24
30
35
  - npm
31
- - macOS, Linux, or Windows should work; browser-based fetching depends on Patchright/Chromium.
36
+ - 使用浏览器抓取时需要 Patchright Chromium;`doctor` 命令可以自动检查并安装。
32
37
 
33
- ## Installation
38
+ ## 直接运行
34
39
 
35
- ### 1. Clone the repository
40
+ 无需安装,直接用 `npx`:
36
41
 
37
42
  ```bash
38
- git clone <this-repository-url>
39
- cd feedloom
43
+ npx -y @ariesfish/feedloom "https://example.com/article"
40
44
  ```
41
45
 
42
- ### 2. Install dependencies
46
+ 也可以全局安装:
43
47
 
44
48
  ```bash
45
- npm install
49
+ npm install -g @ariesfish/feedloom
50
+ feedloom "https://example.com/article"
46
51
  ```
47
52
 
48
- ### 3. Install the browser runtime
49
-
50
- If you plan to use `browser`, `stealth`, or the browser fallback in `auto` mode, install the Patchright Chromium runtime:
53
+ 检查并修复浏览器运行环境:
51
54
 
52
55
  ```bash
53
- npx patchright install chromium
56
+ npx -y @ariesfish/feedloom doctor
54
57
  ```
55
58
 
56
- You can verify or repair the runtime later with:
57
-
58
- ```bash
59
- npm run dev -- doctor
60
- ```
59
+ 如果缺少 Patchright Chromium,`doctor` 会自动执行 `npx patchright install chromium`。
61
60
 
62
- If the Patchright Chromium executable is missing, `doctor` runs `npx patchright install chromium` automatically.
61
+ ## 快速开始
63
62
 
64
- ### 4. Build the CLI
63
+ 保存单篇文章:
65
64
 
66
65
  ```bash
67
- npm run build
66
+ npx -y @ariesfish/feedloom "https://example.com/article"
68
67
  ```
69
68
 
70
- After building, run:
69
+ 指定输出目录:
71
70
 
72
71
  ```bash
73
- node dist/cli.js --help
72
+ npx -y @ariesfish/feedloom --output-dir ./outputs "https://example.com/article"
74
73
  ```
75
74
 
76
- During development, you can run the TypeScript source directly:
75
+ 批量保存 URL 列表:
77
76
 
78
77
  ```bash
79
- npm run dev -- --help
78
+ npx -y @ariesfish/feedloom urls.md --limit 10
80
79
  ```
81
80
 
82
- To make the CLI available globally on your machine:
81
+ `urls.md` 可以是普通链接列表,也可以是 Markdown checklist:
83
82
 
84
- ```bash
85
- npm link
86
- feedloom --help
83
+ ```markdown
84
+ - [ ] https://example.com/a
85
+ - [ ] https://example.com/b
87
86
  ```
88
87
 
89
- ## Agent Skill
88
+ 成功处理后,对应项会被标记为完成:
90
89
 
91
- Feedloom ships an Agent Skill in `skills/feedloom`, so agents that support the `skills` CLI can install the clipping workflow directly from the package or repository:
92
-
93
- ```bash
94
- npx skills add @ariesfish/feedloom --skill feedloom
90
+ ```markdown
91
+ - [x] https://example.com/a
95
92
  ```
96
93
 
97
- For a global install across supported agents:
94
+ 保存 RSS 订阅中的文章:
98
95
 
99
96
  ```bash
100
- npx skills add @ariesfish/feedloom --skill feedloom --global
97
+ npx -y @ariesfish/feedloom "https://example.com/feed.xml" --source-kind rss-feed --since 2026-01-01
101
98
  ```
102
99
 
103
- After installing the skill, ask your agent to save article URLs, URL lists, or RSS feeds as Markdown. The skill runs the CLI through `npx -y @ariesfish/feedloom` by default.
104
-
105
- ## Quick Start
106
-
107
- Archive a single article to the default `clippings/` directory:
100
+ 处理需要 JavaScript 渲染的页面:
108
101
 
109
102
  ```bash
110
- npm run dev -- "https://example.com/article"
103
+ npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode browser --wait-ms 4000 --scroll-to-bottom
111
104
  ```
112
105
 
113
- Write output to a custom directory:
106
+ 普通模式失败时,再尝试 `stealth` 模式:
114
107
 
115
108
  ```bash
116
- npm run dev -- --output-dir ./outputs "https://example.com/article"
109
+ npx -y @ariesfish/feedloom "https://example.com/article" --fetch-mode stealth --solve-cloudflare
117
110
  ```
118
111
 
119
- Use the built CLI:
120
-
121
- ```bash
122
- node dist/cli.js --output-dir ./outputs "https://example.com/article"
123
- ```
112
+ ## 输出长什么样
124
113
 
125
- The generated Markdown will look roughly like this:
114
+ Feedloom 默认写入 `clippings/`。生成的 Markdown 大致如下:
126
115
 
127
116
  ```markdown
128
117
  ---
@@ -136,188 +125,55 @@ created: "2026-04-29"
136
125
  Article content...
137
126
  ```
138
127
 
139
- Images are downloaded into an `assets/` subdirectory under the output directory and rewritten as local Markdown references.
140
-
141
- ## Input Methods
142
-
143
- ### Pass multiple URLs directly
144
-
145
- ```bash
146
- npm run dev -- \
147
- "https://example.com/a" \
148
- "https://example.com/b"
149
- ```
150
-
151
- ### Read URLs from a file
152
-
153
- `urls.md` can be a plain URL list or a Markdown checklist:
154
-
155
- ```markdown
156
- - [ ] https://example.com/a
157
- - [ ] https://example.com/b
158
- ```
159
-
160
- Run:
161
-
162
- ```bash
163
- npm run dev -- --output-dir ./outputs urls.md
164
- ```
165
-
166
- After a URL is processed successfully, the matching checklist item is updated automatically:
167
-
168
- ```markdown
169
- - [x] https://example.com/a
170
- ```
171
-
172
- ### Process RSS/Atom feeds
173
-
174
- By default, `--source-kind auto` tries to detect whether the input is a normal HTML page or a feed. You can also specify the source kind explicitly:
128
+ ## 抓取模式怎么选
175
129
 
176
- ```bash
177
- npm run dev -- --source-kind rss-feed --since 2026-01-01 "https://example.com/feed.xml"
178
- ```
179
-
180
- Useful slicing options:
181
-
182
- ```bash
183
- npm run dev -- --start 1 --end 10 "https://example.com/feed.xml"
184
- npm run dev -- --limit 5 "https://example.com/feed.xml"
185
- ```
186
-
187
- ## Fetch Modes
188
-
189
- Use `--fetch-mode` to control how pages are fetched:
190
-
191
- | Mode | Description |
130
+ | 模式 | 适合情况 |
192
131
  | --- | --- |
193
- | `auto` | Default. Try static fetch first, then fall back to browser/stealth when content is insufficient. |
194
- | `static` | Use plain HTTP fetching only. Fastest option for static pages. |
195
- | `browser` | Render the page in a browser. Useful for JavaScript-heavy sites. |
196
- | `stealth` | Use a more realistic browser context. Useful for sites with stronger bot detection. |
132
+ | `auto` | 默认模式。先尝试静态抓取,内容不足时再回退到浏览器/stealth |
133
+ | `static` | 页面本身已服务端渲染,不需要 JavaScript。速度最快。 |
134
+ | `browser` | 页面需要 JavaScript 渲染、等待元素、点击按钮或滚动加载。 |
135
+ | `stealth` | 普通浏览器模式仍失败,站点有更强的反爬检测。 |
197
136
 
198
- Examples:
199
-
200
- ```bash
201
- npm run dev -- --fetch-mode browser "https://example.com/article"
202
- npm run dev -- --fetch-mode stealth --solve-cloudflare "https://example.com/article"
203
- ```
137
+ 建议先用默认 `auto`。只有结果不完整时,再显式选择 `browser` 或 `stealth`。
204
138
 
205
- ## Browser Options
139
+ ## 自定义规则
206
140
 
207
- Wait longer after page load:
141
+ Feedloom 内置 TOML 站点规则,用于处理常见的动态页面或结构化站点。你也可以把自己的私有规则放在包外,并在运行时指定:
208
142
 
209
143
  ```bash
210
- npm run dev -- --fetch-mode browser --wait-ms 5000 "https://example.com/article"
144
+ npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir ./site-rules
211
145
  ```
212
146
 
213
- Wait for a selector before extracting content:
214
-
215
- ```bash
216
- npm run dev -- --fetch-mode browser --wait-selector "article" "https://example.com/article"
217
- ```
218
-
219
- Click popups or expand buttons before extraction:
220
-
221
- ```bash
222
- npm run dev -- --fetch-mode browser --click-selector "button.accept" --click-selector ".expand" "https://example.com/article"
223
- ```
224
-
225
- Scroll to the bottom before extraction:
226
-
227
- ```bash
228
- npm run dev -- --fetch-mode browser --scroll-to-bottom "https://example.com/article"
229
- ```
147
+ 私有规则适合为自己的常用网站做精准适配。
230
148
 
231
- Use a proxy:
232
-
233
- ```bash
234
- npm run dev -- --fetch-mode stealth --proxy "http://127.0.0.1:7890" "https://example.com/article"
235
- ```
236
-
237
- Run with a visible browser window for debugging:
238
-
239
- ```bash
240
- npm run dev -- --fetch-mode browser --headful "https://example.com/article"
241
- ```
242
-
243
- ## Use Local Chrome Login State
244
-
245
- For pages that require an authenticated browser session, you can try using your local Chrome profile:
246
-
247
- ```bash
248
- npm run dev -- \
249
- --prefer-browser-state \
250
- --chrome-user-data-dir "{CHROME_INSTALL_PATH}" \
251
- --chrome-profile "Default" \
252
- --fetch-mode browser \
253
- "https://example.com/member-only-article"
254
- ```
255
-
256
- Only use this on your own device and accounts. Always respect the target site's terms of service and copyright rules.
257
-
258
- ## Common CLI Options
259
-
260
- ```text
261
- --output-dir <dir> Markdown output directory. Default: clippings
262
- --source-kind <kind> auto, html-page, or rss-feed. Default: auto
263
- --since <date> Keep only feed entries on or after YYYY-MM-DD
264
- --limit <n> Process only the first N deduplicated URLs
265
- --start <n> Start from the Nth deduplicated URL, 1-based
266
- --end <n> End at the Nth deduplicated URL, 1-based; 0 means no upper bound
267
- --fetch-mode <mode> auto, static, browser, or stealth. Default: auto
268
- --wait-ms <ms> Extra browser wait after load. Default: 2500
269
- --wait-selector <selector> Wait for a CSS selector
270
- --click-selector <selector...> Click one or more selectors after page load
271
- --scroll-to-bottom Scroll to the bottom before extraction
272
- --headful Run with a visible browser window
273
- --proxy <server> Proxy server for browser/stealth fetch
274
- --solve-cloudflare In stealth mode, try to handle Cloudflare challenges
275
- --disable-resources In stealth mode, block images/media/fonts/stylesheets for speed
276
- --prefer-browser-state Try local Chrome user state first
277
- --chrome-user-data-dir <path> Chrome User Data directory
278
- --chrome-profile <name> Chrome profile name. Default: Default
279
- --site-rules-dir <dir> Optional directory of private TOML site rules
280
- ```
281
-
282
- Run environment checks:
283
-
284
- ```bash
285
- npm run dev -- doctor
286
- ```
149
+ ## Agent Skill
287
150
 
288
- For the full option list, run:
151
+ Feedloom 随包提供 `skills/feedloom`,支持 `skills` CLI 的 Agent 可以直接安装这个网页归档能力:
289
152
 
290
153
  ```bash
291
- npm run dev -- --help
154
+ npx skills add @ariesfish/feedloom --skill feedloom
292
155
  ```
293
156
 
294
- ## Development
157
+ 全局安装到支持的 Agent:
295
158
 
296
159
  ```bash
297
- npm install
298
- npm run build
299
- npm run typecheck
300
- npm test
160
+ npx skills add @ariesfish/feedloom --skill feedloom --global
301
161
  ```
302
162
 
303
- ## Tips and Notes
304
-
305
- - Respect robots.txt, website terms of service, copyright, and rate limits.
306
- - For dynamic pages, try `--fetch-mode browser` first.
307
- - For static blogs and news sites, `--fetch-mode static` is usually faster.
308
- - Feedloom ships bundled TOML site rules for common dynamic/structured sites such as WeChat official account articles and Zhihu. Site rules can define extraction, cleanup, and fetch preferences. For example, the bundled Zhihu rule uses browser fetch with copied Chrome state when `--chrome-user-data-dir`/`--chrome-profile` are configured.
309
- - If article extraction is poor for a specific site, keep private TOML site rules outside the package and pass them with `--site-rules-dir <dir>`. Private rules are loaded after bundled rules.
310
- - For large batches, test with `--limit` before running the full job.
163
+ ## 使用建议
311
164
 
312
- ## Acknowledgements
165
+ - 大批量归档前,先用 `--limit` 跑几篇确认效果。
166
+ - 静态博客和新闻站通常用默认模式即可;动态站点再尝试 `--fetch-mode browser`。
167
+ - 不要把 Feedloom 当成高并发爬虫。它更适合个人剪藏使用。
168
+ - 遵守 robots.txt、网站服务条款、版权规则和访问频率限制。
313
169
 
314
- Feedloom is inspired by several excellent open-source projects. Special thanks to:
170
+ ## 致谢
315
171
 
316
- - [Defuddle](https://github.com/kepano/defuddle), for high-quality readable content extraction ideas.
317
- - [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright), for inspiring robust browser automation and realistic page access.
318
- - [Scrapling](https://github.com/D4Vinci/Scrapling), for ideas around real browser contexts, anti-detection strategies, and resilient scraping fallbacks.
172
+ Feedloom 受到这些优秀项目启发:
319
173
 
320
- Thanks also to Linkedom, Turndown, Commander, Vitest, and the wider TypeScript ecosystem for the reliable building blocks.
174
+ - [Defuddle](https://github.com/kepano/defuddle):可读正文抽取思路。
175
+ - [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright):浏览器自动化和更真实的页面访问能力。
176
+ - [Scrapling](https://github.com/D4Vinci/Scrapling):更稳健的抓取 fallback 思路。
321
177
 
322
178
  ## License
323
179
 
package/dist/cli.js CHANGED
@@ -1024,15 +1024,24 @@ async function localizeImages(html, options) {
1024
1024
  }
1025
1025
  let rel = seen.get(absolute);
1026
1026
  if (!rel) {
1027
- const response = await fetchImage(absolute);
1027
+ let response;
1028
+ try {
1029
+ response = await fetchImage(absolute);
1030
+ } catch {
1031
+ continue;
1032
+ }
1028
1033
  if (!response.ok) continue;
1029
1034
  const contentType = response.headers.get("content-type");
1030
1035
  if (contentType && !contentType.toLowerCase().startsWith("image/")) continue;
1031
1036
  const ext = extensionFrom(contentType, absolute);
1032
1037
  const filename = `image-${String(index).padStart(3, "0")}${ext}`;
1033
1038
  index += 1;
1034
- await mkdir(assetDir, { recursive: true });
1035
- await writeFile2(join3(assetDir, filename), new Uint8Array(await response.arrayBuffer()));
1039
+ try {
1040
+ await mkdir(assetDir, { recursive: true });
1041
+ await writeFile2(join3(assetDir, filename), new Uint8Array(await response.arrayBuffer()));
1042
+ } catch {
1043
+ continue;
1044
+ }
1036
1045
  rel = `assets/${encodeURIComponent(options.noteSlug)}/${filename}`;
1037
1046
  seen.set(absolute, rel);
1038
1047
  }
@@ -1228,6 +1237,15 @@ var DEFAULT_FEEDLOOM_PROFILE = {
1228
1237
  }
1229
1238
  };
1230
1239
  var DefuddleClass = DefuddleModule.default ?? DefuddleModule.Defuddle;
1240
+ function installDefuddleDomGlobals(window) {
1241
+ const target = globalThis;
1242
+ const source = window;
1243
+ for (const key of ["Node", "Element", "HTMLElement", "Document", "DocumentFragment", "Text", "Comment", "HTMLAnchorElement"]) {
1244
+ if (!target[key] && source[key]) {
1245
+ target[key] = source[key];
1246
+ }
1247
+ }
1248
+ }
1231
1249
  function firstMetaContent(document, names) {
1232
1250
  for (const name of names) {
1233
1251
  const escaped = name.replace(/"/g, '\\"');
@@ -1336,7 +1354,9 @@ var HtmlCleaner = class {
1336
1354
  const preferredContentSelector = this.options.contentSelector ?? firstContentSelector(activeProfiles);
1337
1355
  const removals = [];
1338
1356
  const html = /<html[\s>]/i.test(rawHtml) ? rawHtml : `<!doctype html><html><body>${rawHtml}</body></html>`;
1339
- const { document } = parseHTML2(html);
1357
+ const window = parseHTML2(html);
1358
+ installDefuddleDomGlobals(window);
1359
+ const { document } = window;
1340
1360
  const contentSelector = preferredContentSelector && document.querySelector(preferredContentSelector) ? preferredContentSelector : void 0;
1341
1361
  const doc = document;
1342
1362
  if (this.options.baseUrl) {
@@ -2049,11 +2069,12 @@ async function processItem(item, options) {
2049
2069
  }
2050
2070
  const title = cleaned.metadata.title || item.sourceTitle || titleFromUrl(item.url);
2051
2071
  await cleanupExistingNote(options.outputDir, item.url);
2072
+ const imageFetch = options.fetchImage ?? (activeProfiles.some((profile) => profile.fetch?.useProxyEnv) ? proxyAwareFetch : void 0);
2052
2073
  const contentHtml = options.localizeAssets === false ? cleaned.content : await localizeImages(cleaned.content, {
2053
2074
  outputDir: options.outputDir,
2054
2075
  noteSlug: sanitizeFilename(title),
2055
2076
  baseUrl: item.url,
2056
- fetchImage: options.fetchImage
2077
+ fetchImage: imageFetch
2057
2078
  });
2058
2079
  const markdown = demoteTopLevelHeadings(stripLeadingDateLine(stripDuplicateLeadingHeading(htmlToMarkdown(contentHtml), title)));
2059
2080
  const outputPath = await writeMarkdownNote(options.outputDir, {
@@ -0,0 +1,15 @@
1
+ [match]
2
+ host_suffixes = ["x.com", "twitter.com"]
3
+ url_regexes = ["https?://(www\\.)?(x|twitter)\\.com/[^/]+/status/"]
4
+
5
+ [fetch]
6
+ mode = "browser"
7
+ scroll_to_bottom = true
8
+ wait_ms = 8000
9
+ use_proxy_env = true
10
+
11
+ [extract]
12
+ require_text = true
13
+
14
+ [metadata]
15
+ strip_title_regexes = ["\\s*/\\s*X\\s*$", "\\s*on X:\\s*.*$"]
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ariesfish/feedloom",
3
- "version": "0.1.4",
3
+ "version": "0.2.1",
4
4
  "type": "module",
5
5
  "author": "ariesfish",
6
6
  "license": "MIT",
@@ -66,7 +66,7 @@ Supported sections:
66
66
 
67
67
  Use `[fetch]` only when a site consistently needs browser rendering, local Chrome state, scrolling, waiting, clicking, or proxy-aware requests.
68
68
 
69
- `use_proxy_env = true` tells Feedloom to use `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, and `NO_PROXY` for static fetches and Defuddle async extractor fetches. Use this for YouTube transcript capture and similar extractor-backed pages that need the user's proxy settings.
69
+ `use_proxy_env = true` tells Feedloom to use `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, and `NO_PROXY` for static fetches and Defuddle async extractor fetches. Use this for YouTube transcript capture, X/Twitter posts, and similar extractor-backed pages that need the user's proxy settings.
70
70
 
71
71
  `prefer_browser_state = true` only tells Feedloom to use copied Chrome state for matching URLs. It does not store the local Chrome path. The command still needs Chrome state parameters when login state is required:
72
72