@ariesfish/feedloom 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +26 -1
- package/assets/logo.png +0 -0
- package/dist/cli.js +533 -61
- package/dist/site-rules/wechat.toml +5 -0
- package/dist/site-rules/xiaohongshu.toml +55 -0
- package/dist/site-rules/youtube.toml +10 -0
- package/dist/site-rules/zhihu.toml +22 -0
- package/package.json +3 -2
- package/skills/feedloom/SKILL.md +13 -3
- package/skills/feedloom/references/site-rules.md +104 -0
package/README.md
CHANGED
|
@@ -1,5 +1,15 @@
|
|
|
1
1
|
# Feedloom
|
|
2
2
|
|
|
3
|
+
<div align="center">
|
|
4
|
+
<img src="assets/logo.png" alt="Feedloom logo" width="160">
|
|
5
|
+
<p><strong>Archive long-form web content as clean Markdown with local assets.</strong></p>
|
|
6
|
+
<p>
|
|
7
|
+
<a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
|
|
8
|
+
<img alt="Node.js >= 24" src="https://img.shields.io/badge/node-%3E%3D24-339933">
|
|
9
|
+
<img alt="License MIT" src="https://img.shields.io/badge/license-MIT-blue">
|
|
10
|
+
</p>
|
|
11
|
+
</div>
|
|
12
|
+
|
|
3
13
|
Feedloom is a command-line tool for archiving long-form web content. It takes article URLs, URL list files, or RSS/Atom feeds, extracts readable article content, converts it to Markdown with YAML frontmatter, and saves page images as local assets. It is designed for personal knowledge bases, notebook vaults, and offline reading archives.
|
|
4
14
|
|
|
5
15
|
## Features
|
|
@@ -43,6 +53,14 @@ If you plan to use `browser`, `stealth`, or the browser fallback in `auto` mode,
|
|
|
43
53
|
npx patchright install chromium
|
|
44
54
|
```
|
|
45
55
|
|
|
56
|
+
You can verify or repair the runtime later with:
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
npm run dev -- doctor
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
If the Patchright Chromium executable is missing, `doctor` runs `npx patchright install chromium` automatically.
|
|
63
|
+
|
|
46
64
|
### 4. Build the CLI
|
|
47
65
|
|
|
48
66
|
```bash
|
|
@@ -261,6 +279,12 @@ Only use this on your own device and accounts. Always respect the target site's
|
|
|
261
279
|
--site-rules-dir <dir> Optional directory of private TOML site rules
|
|
262
280
|
```
|
|
263
281
|
|
|
282
|
+
Run environment checks:
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
npm run dev -- doctor
|
|
286
|
+
```
|
|
287
|
+
|
|
264
288
|
For the full option list, run:
|
|
265
289
|
|
|
266
290
|
```bash
|
|
@@ -281,7 +305,8 @@ npm test
|
|
|
281
305
|
- Respect robots.txt, website terms of service, copyright, and rate limits.
|
|
282
306
|
- For dynamic pages, try `--fetch-mode browser` first.
|
|
283
307
|
- For static blogs and news sites, `--fetch-mode static` is usually faster.
|
|
284
|
-
-
|
|
308
|
+
- Feedloom ships bundled TOML site rules for common dynamic/structured sites such as WeChat official account articles and Zhihu. Site rules can define extraction, cleanup, and fetch preferences. For example, the bundled Zhihu rule uses browser fetch with copied Chrome state when `--chrome-user-data-dir`/`--chrome-profile` are configured.
|
|
309
|
+
- If article extraction is poor for a specific site, keep private TOML site rules outside the package and pass them with `--site-rules-dir <dir>`. Private rules are loaded after bundled rules.
|
|
285
310
|
- For large batches, test with `--limit` before running the full job.
|
|
286
311
|
|
|
287
312
|
## Acknowledgements
|
package/assets/logo.png
ADDED
|
Binary file
|
package/dist/cli.js
CHANGED
|
@@ -2,7 +2,9 @@
|
|
|
2
2
|
|
|
3
3
|
// src/cli.ts
|
|
4
4
|
import { readdir as readdir2 } from "fs/promises";
|
|
5
|
-
import {
|
|
5
|
+
import { createRequire } from "module";
|
|
6
|
+
import { dirname, join as join7, resolve as resolve2 } from "path";
|
|
7
|
+
import { fileURLToPath } from "url";
|
|
6
8
|
import { Command } from "commander";
|
|
7
9
|
|
|
8
10
|
// src/cleaning/profiles.ts
|
|
@@ -39,10 +41,29 @@ function profileFromTomlRule(name, rule) {
|
|
|
39
41
|
metadata: {
|
|
40
42
|
fixedAuthor: rule.metadata?.fixed_author,
|
|
41
43
|
titleSuffixPatterns: rule.metadata?.strip_title_regexes,
|
|
44
|
+
authorSuffixPatterns: rule.metadata?.strip_author_regexes,
|
|
42
45
|
authorSelectors: rule.metadata?.author_selectors,
|
|
43
46
|
authorMetaNames: rule.metadata?.author_meta_names,
|
|
44
47
|
authorMetaItemprops: rule.metadata?.author_meta_itemprops,
|
|
45
48
|
authorMetaProperties: rule.metadata?.author_meta_properties
|
|
49
|
+
},
|
|
50
|
+
fetch: {
|
|
51
|
+
mode: rule.fetch?.mode,
|
|
52
|
+
preferBrowserState: rule.fetch?.prefer_browser_state,
|
|
53
|
+
waitMs: rule.fetch?.wait_ms,
|
|
54
|
+
networkIdle: rule.fetch?.network_idle,
|
|
55
|
+
waitSelector: rule.fetch?.wait_selector,
|
|
56
|
+
waitSelectorState: rule.fetch?.wait_selector_state,
|
|
57
|
+
clickSelectors: rule.fetch?.click_selectors,
|
|
58
|
+
scrollToBottom: rule.fetch?.scroll_to_bottom,
|
|
59
|
+
useProxyEnv: rule.fetch?.use_proxy_env
|
|
60
|
+
},
|
|
61
|
+
media: {
|
|
62
|
+
includeMetaImages: rule.media?.include_meta_images,
|
|
63
|
+
imageMetaProperties: rule.media?.image_meta_properties
|
|
64
|
+
},
|
|
65
|
+
extraction: {
|
|
66
|
+
requireText: rule.extract?.require_text
|
|
46
67
|
}
|
|
47
68
|
};
|
|
48
69
|
}
|
|
@@ -92,11 +113,172 @@ function firstContentSelector(profiles) {
|
|
|
92
113
|
return void 0;
|
|
93
114
|
}
|
|
94
115
|
|
|
116
|
+
// src/doctor.ts
|
|
117
|
+
import { spawn } from "child_process";
|
|
118
|
+
import { access } from "fs/promises";
|
|
119
|
+
import { chromium } from "patchright";
|
|
120
|
+
var INSTALL_CHROMIUM_COMMAND = "npx patchright install chromium";
|
|
121
|
+
var INSTALL_CHROMIUM_HINT = `Run: ${INSTALL_CHROMIUM_COMMAND}`;
|
|
122
|
+
function errorMessage(error) {
|
|
123
|
+
return error instanceof Error ? error.message : String(error);
|
|
124
|
+
}
|
|
125
|
+
function chromiumExecutablePath() {
|
|
126
|
+
const browserType = chromium;
|
|
127
|
+
return browserType.executablePath();
|
|
128
|
+
}
|
|
129
|
+
function appendCheck(checks, check) {
|
|
130
|
+
checks.push(check);
|
|
131
|
+
}
|
|
132
|
+
async function installPatchrightChromium(events = {}) {
|
|
133
|
+
return new Promise((resolve3, reject) => {
|
|
134
|
+
const child = spawn("npx", ["patchright", "install", "chromium"], {
|
|
135
|
+
stdio: ["ignore", "pipe", "pipe"]
|
|
136
|
+
});
|
|
137
|
+
child.stdout.setEncoding("utf8");
|
|
138
|
+
child.stderr.setEncoding("utf8");
|
|
139
|
+
child.stdout.on("data", (chunk) => events.onStdout?.(chunk));
|
|
140
|
+
child.stderr.on("data", (chunk) => events.onStderr?.(chunk));
|
|
141
|
+
child.on("error", reject);
|
|
142
|
+
child.on("close", (code, signal) => resolve3({ code, signal }));
|
|
143
|
+
});
|
|
144
|
+
}
|
|
145
|
+
async function executableExists(path) {
|
|
146
|
+
try {
|
|
147
|
+
await access(path);
|
|
148
|
+
return { ok: true };
|
|
149
|
+
} catch (error) {
|
|
150
|
+
return { ok: false, error };
|
|
151
|
+
}
|
|
152
|
+
}
|
|
153
|
+
async function ensureChromiumInstalled(checks, executablePath, options) {
|
|
154
|
+
const firstCheck = await executableExists(executablePath);
|
|
155
|
+
if (firstCheck.ok) {
|
|
156
|
+
appendCheck(checks, {
|
|
157
|
+
name: "Patchright Chromium installation",
|
|
158
|
+
ok: true,
|
|
159
|
+
message: "Chromium executable exists.",
|
|
160
|
+
detail: executablePath
|
|
161
|
+
});
|
|
162
|
+
return true;
|
|
163
|
+
}
|
|
164
|
+
appendCheck(checks, {
|
|
165
|
+
name: "Patchright Chromium installation",
|
|
166
|
+
ok: false,
|
|
167
|
+
message: "Chromium executable was not found on disk. Installing Patchright Chromium...",
|
|
168
|
+
detail: `${executablePath}
|
|
169
|
+
${errorMessage(firstCheck.error)}`
|
|
170
|
+
});
|
|
171
|
+
let output = "";
|
|
172
|
+
const appendOutput = (chunk) => {
|
|
173
|
+
output += chunk;
|
|
174
|
+
options.stderr.write(chunk);
|
|
175
|
+
};
|
|
176
|
+
try {
|
|
177
|
+
const result = await options.installChromium({ onStdout: appendOutput, onStderr: appendOutput });
|
|
178
|
+
if (result.code !== 0) {
|
|
179
|
+
appendCheck(checks, {
|
|
180
|
+
name: "Patchright Chromium auto-install",
|
|
181
|
+
ok: false,
|
|
182
|
+
message: `Installation command failed with exit code ${result.code ?? "null"}${result.signal ? ` and signal ${result.signal}` : ""}.`,
|
|
183
|
+
detail: output.trim() || void 0,
|
|
184
|
+
hint: INSTALL_CHROMIUM_HINT
|
|
185
|
+
});
|
|
186
|
+
return false;
|
|
187
|
+
}
|
|
188
|
+
} catch (error) {
|
|
189
|
+
appendCheck(checks, {
|
|
190
|
+
name: "Patchright Chromium auto-install",
|
|
191
|
+
ok: false,
|
|
192
|
+
message: "Installation command failed to start or crashed.",
|
|
193
|
+
detail: errorMessage(error),
|
|
194
|
+
hint: INSTALL_CHROMIUM_HINT
|
|
195
|
+
});
|
|
196
|
+
return false;
|
|
197
|
+
}
|
|
198
|
+
const secondCheck = await executableExists(executablePath);
|
|
199
|
+
appendCheck(checks, secondCheck.ok ? {
|
|
200
|
+
name: "Patchright Chromium auto-install",
|
|
201
|
+
ok: true,
|
|
202
|
+
message: "Chromium installed successfully.",
|
|
203
|
+
detail: executablePath
|
|
204
|
+
} : {
|
|
205
|
+
name: "Patchright Chromium auto-install",
|
|
206
|
+
ok: false,
|
|
207
|
+
message: "Installation finished, but Chromium executable is still missing.",
|
|
208
|
+
detail: `${executablePath}
|
|
209
|
+
${errorMessage(secondCheck.error)}`,
|
|
210
|
+
hint: INSTALL_CHROMIUM_HINT
|
|
211
|
+
});
|
|
212
|
+
return secondCheck.ok;
|
|
213
|
+
}
|
|
214
|
+
async function runDoctor(options = {}) {
|
|
215
|
+
const checks = [];
|
|
216
|
+
const resolvedOptions = {
|
|
217
|
+
installChromium: options.installChromium ?? installPatchrightChromium,
|
|
218
|
+
stderr: options.stderr ?? process.stderr
|
|
219
|
+
};
|
|
220
|
+
let executablePath = "";
|
|
221
|
+
try {
|
|
222
|
+
executablePath = chromiumExecutablePath();
|
|
223
|
+
checks.push({
|
|
224
|
+
name: "Patchright Chromium executable path",
|
|
225
|
+
ok: true,
|
|
226
|
+
message: executablePath
|
|
227
|
+
});
|
|
228
|
+
} catch (error) {
|
|
229
|
+
checks.push({
|
|
230
|
+
name: "Patchright Chromium executable path",
|
|
231
|
+
ok: false,
|
|
232
|
+
message: "Patchright does not report a Chromium executable for this platform.",
|
|
233
|
+
detail: errorMessage(error),
|
|
234
|
+
hint: INSTALL_CHROMIUM_HINT
|
|
235
|
+
});
|
|
236
|
+
}
|
|
237
|
+
const installed = executablePath ? await ensureChromiumInstalled(checks, executablePath, resolvedOptions) : false;
|
|
238
|
+
if (installed) {
|
|
239
|
+
try {
|
|
240
|
+
const browser = await chromium.launch({ headless: true });
|
|
241
|
+
await browser.close();
|
|
242
|
+
checks.push({
|
|
243
|
+
name: "Patchright Chromium launch",
|
|
244
|
+
ok: true,
|
|
245
|
+
message: "Chromium launched successfully in headless mode."
|
|
246
|
+
});
|
|
247
|
+
} catch (error) {
|
|
248
|
+
checks.push({
|
|
249
|
+
name: "Patchright Chromium launch",
|
|
250
|
+
ok: false,
|
|
251
|
+
message: "Chromium executable exists but failed to launch.",
|
|
252
|
+
detail: errorMessage(error),
|
|
253
|
+
hint: INSTALL_CHROMIUM_HINT
|
|
254
|
+
});
|
|
255
|
+
}
|
|
256
|
+
}
|
|
257
|
+
return {
|
|
258
|
+
ok: checks.at(-1)?.ok === true,
|
|
259
|
+
checks
|
|
260
|
+
};
|
|
261
|
+
}
|
|
262
|
+
function formatDoctorResult(result) {
|
|
263
|
+
const lines = ["Feedloom doctor"];
|
|
264
|
+
for (const check of result.checks) {
|
|
265
|
+
lines.push(`${check.ok ? "\u2713" : "\u2717"} ${check.name}: ${check.message}`);
|
|
266
|
+
if (check.detail) {
|
|
267
|
+
lines.push(...check.detail.split("\n").map((line) => ` ${line}`));
|
|
268
|
+
}
|
|
269
|
+
if (check.hint) {
|
|
270
|
+
lines.push(` ${check.hint}`);
|
|
271
|
+
}
|
|
272
|
+
}
|
|
273
|
+
lines.push(result.ok ? "OK" : "FAILED");
|
|
274
|
+
return lines.join("\n");
|
|
275
|
+
}
|
|
276
|
+
|
|
95
277
|
// src/fetch/browser.ts
|
|
96
278
|
import { mkdtemp, rm } from "fs/promises";
|
|
97
279
|
import { tmpdir } from "os";
|
|
98
280
|
import { join } from "path";
|
|
99
|
-
import { chromium } from "patchright";
|
|
281
|
+
import { chromium as chromium2 } from "patchright";
|
|
100
282
|
var SCRAPLING_DEFAULT_ARGS = [
|
|
101
283
|
"--no-pings",
|
|
102
284
|
"--no-first-run",
|
|
@@ -122,13 +304,13 @@ async function runPageActions(page, options) {
|
|
|
122
304
|
await page.locator(selector).first().click({ timeout: 5e3 }).catch(() => void 0);
|
|
123
305
|
}
|
|
124
306
|
if (options.scrollToBottom) {
|
|
125
|
-
await page.evaluate(async () => {
|
|
126
|
-
const delay = (ms) => new Promise((
|
|
307
|
+
await page.evaluate(`(async () => {
|
|
308
|
+
const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
|
|
127
309
|
for (let i = 0; i < 8; i += 1) {
|
|
128
310
|
window.scrollTo(0, document.body.scrollHeight);
|
|
129
311
|
await delay(250);
|
|
130
312
|
}
|
|
131
|
-
});
|
|
313
|
+
})()`);
|
|
132
314
|
}
|
|
133
315
|
if (options.waitSelector) {
|
|
134
316
|
await page.locator(options.waitSelector).first().waitFor({
|
|
@@ -145,7 +327,7 @@ async function launchBrowserContext(options) {
|
|
|
145
327
|
if (options.dnsOverHttps) {
|
|
146
328
|
extraArgs.push("--dns-over-https-templates=https://cloudflare-dns.com/dns-query");
|
|
147
329
|
}
|
|
148
|
-
const context = await
|
|
330
|
+
const context = await chromium2.launchPersistentContext(userDataDir, {
|
|
149
331
|
channel: options.channel,
|
|
150
332
|
headless: options.headless ?? true,
|
|
151
333
|
args: extraArgs,
|
|
@@ -222,7 +404,7 @@ async function fetchBrowserHtml(url, options = {}) {
|
|
|
222
404
|
import { mkdtemp as mkdtemp2, rm as rm2 } from "fs/promises";
|
|
223
405
|
import { tmpdir as tmpdir2 } from "os";
|
|
224
406
|
import { join as join2 } from "path";
|
|
225
|
-
import { chromium as
|
|
407
|
+
import { chromium as chromium3 } from "patchright";
|
|
226
408
|
var DEFAULT_ARGS = [
|
|
227
409
|
"--no-pings",
|
|
228
410
|
"--no-first-run",
|
|
@@ -365,7 +547,7 @@ async function solveCloudflare(page) {
|
|
|
365
547
|
async function launchStealthContext(options) {
|
|
366
548
|
const userDataDir = options.userDataDir ?? await mkdtemp2(join2(tmpdir2(), "feedloom-stealth-"));
|
|
367
549
|
const ownsUserDataDir = options.userDataDir === void 0;
|
|
368
|
-
const context = await
|
|
550
|
+
const context = await chromium3.launchPersistentContext(userDataDir, {
|
|
369
551
|
channel: "chromium",
|
|
370
552
|
headless: options.headless ?? true,
|
|
371
553
|
args: stealthArgs(options),
|
|
@@ -409,13 +591,13 @@ async function fetchWithStealthContext(context, url, options) {
|
|
|
409
591
|
await page.locator(selector).first().click({ timeout: 5e3 }).catch(() => void 0);
|
|
410
592
|
}
|
|
411
593
|
if (options.scrollToBottom) {
|
|
412
|
-
await page.evaluate(async () => {
|
|
413
|
-
const delay = (ms) => new Promise((
|
|
594
|
+
await page.evaluate(`(async () => {
|
|
595
|
+
const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
|
|
414
596
|
for (let i = 0; i < 8; i += 1) {
|
|
415
597
|
window.scrollTo(0, document.body.scrollHeight);
|
|
416
598
|
await delay(250);
|
|
417
599
|
}
|
|
418
|
-
});
|
|
600
|
+
})()`);
|
|
419
601
|
}
|
|
420
602
|
if (options.waitSelector) {
|
|
421
603
|
await page.locator(options.waitSelector).first().waitFor({ state: options.waitSelectorState ?? "attached", timeout: timeoutMs }).catch(() => void 0);
|
|
@@ -824,8 +1006,8 @@ function imageSource(img) {
|
|
|
824
1006
|
return first || null;
|
|
825
1007
|
}
|
|
826
1008
|
async function localizeImages(html, options) {
|
|
827
|
-
const { document
|
|
828
|
-
const images = Array.from(
|
|
1009
|
+
const { document } = parseHTML(`<!doctype html><html><body>${html}</body></html>`);
|
|
1010
|
+
const images = Array.from(document.querySelectorAll("img"));
|
|
829
1011
|
if (images.length === 0) return html;
|
|
830
1012
|
const fetchImage = options.fetchImage ?? fetch;
|
|
831
1013
|
const seen = /* @__PURE__ */ new Map();
|
|
@@ -864,7 +1046,7 @@ async function localizeImages(html, options) {
|
|
|
864
1046
|
img.removeAttribute("data-original");
|
|
865
1047
|
img.removeAttribute("data-src");
|
|
866
1048
|
}
|
|
867
|
-
return
|
|
1049
|
+
return document.body.innerHTML;
|
|
868
1050
|
}
|
|
869
1051
|
|
|
870
1052
|
// src/cleaning/clean-html.ts
|
|
@@ -1004,6 +1186,18 @@ function cleanupTitle(metadata, profiles) {
|
|
|
1004
1186
|
}
|
|
1005
1187
|
metadata.title = title;
|
|
1006
1188
|
}
|
|
1189
|
+
function cleanupAuthor(metadata, profiles) {
|
|
1190
|
+
if (!metadata.author) {
|
|
1191
|
+
return;
|
|
1192
|
+
}
|
|
1193
|
+
let author = metadata.author;
|
|
1194
|
+
for (const profile of profiles) {
|
|
1195
|
+
for (const pattern of profile.metadata?.authorSuffixPatterns ?? []) {
|
|
1196
|
+
author = author.replace(new RegExp(pattern, "i"), "").trim();
|
|
1197
|
+
}
|
|
1198
|
+
}
|
|
1199
|
+
metadata.author = author;
|
|
1200
|
+
}
|
|
1007
1201
|
function applySiteProfiles(root, profiles, removals) {
|
|
1008
1202
|
removeByExactSelectors(root, profiles, removals);
|
|
1009
1203
|
removeByPartialAttributePatterns(root, profiles, removals);
|
|
@@ -1012,6 +1206,7 @@ function applySiteProfiles(root, profiles, removals) {
|
|
|
1012
1206
|
function applyMetadataProfiles(metadata, profiles) {
|
|
1013
1207
|
applyFixedAuthor(metadata, profiles);
|
|
1014
1208
|
cleanupTitle(metadata, profiles);
|
|
1209
|
+
cleanupAuthor(metadata, profiles);
|
|
1015
1210
|
}
|
|
1016
1211
|
|
|
1017
1212
|
// src/cleaning/clean-html.ts
|
|
@@ -1033,17 +1228,17 @@ var DEFAULT_FEEDLOOM_PROFILE = {
|
|
|
1033
1228
|
}
|
|
1034
1229
|
};
|
|
1035
1230
|
var DefuddleClass = DefuddleModule.default ?? DefuddleModule.Defuddle;
|
|
1036
|
-
function firstMetaContent(
|
|
1231
|
+
function firstMetaContent(document, names) {
|
|
1037
1232
|
for (const name of names) {
|
|
1038
1233
|
const escaped = name.replace(/"/g, '\\"');
|
|
1039
|
-
const element =
|
|
1234
|
+
const element = document.querySelector(`meta[property="${escaped}"], meta[name="${escaped}"], meta[itemprop="${escaped}"]`);
|
|
1040
1235
|
const content = element?.getAttribute("content")?.trim();
|
|
1041
1236
|
if (content) return content;
|
|
1042
1237
|
}
|
|
1043
1238
|
return void 0;
|
|
1044
1239
|
}
|
|
1045
|
-
function jsonLdValue(
|
|
1046
|
-
for (const script of Array.from(
|
|
1240
|
+
function jsonLdValue(document, keys) {
|
|
1241
|
+
for (const script of Array.from(document.querySelectorAll('script[type="application/ld+json"]'))) {
|
|
1047
1242
|
const text = script.textContent?.trim();
|
|
1048
1243
|
if (!text) continue;
|
|
1049
1244
|
try {
|
|
@@ -1064,12 +1259,12 @@ function jsonLdValue(document2, keys) {
|
|
|
1064
1259
|
}
|
|
1065
1260
|
return void 0;
|
|
1066
1261
|
}
|
|
1067
|
-
function profileAuthorFromDocument(
|
|
1262
|
+
function profileAuthorFromDocument(document, profiles) {
|
|
1068
1263
|
for (const profile of profiles) {
|
|
1069
1264
|
const metadata = profile.metadata;
|
|
1070
1265
|
if (!metadata) continue;
|
|
1071
1266
|
for (const selector of metadata.authorSelectors ?? []) {
|
|
1072
|
-
const author =
|
|
1267
|
+
const author = document.querySelector(selector)?.textContent?.replace(/\s+/g, " ").trim();
|
|
1073
1268
|
if (author) return author;
|
|
1074
1269
|
}
|
|
1075
1270
|
const metaNames = [
|
|
@@ -1079,33 +1274,54 @@ function profileAuthorFromDocument(document2, profiles) {
|
|
|
1079
1274
|
];
|
|
1080
1275
|
for (const entry of metaNames) {
|
|
1081
1276
|
const escaped = entry.value.replace(/"/g, '\\"');
|
|
1082
|
-
const author =
|
|
1277
|
+
const author = document.querySelector(`meta[${entry.attr}="${escaped}"]`)?.getAttribute("content")?.trim();
|
|
1083
1278
|
if (author) return author;
|
|
1084
1279
|
}
|
|
1085
1280
|
}
|
|
1086
1281
|
return void 0;
|
|
1087
1282
|
}
|
|
1088
|
-
function toMetadata(result,
|
|
1283
|
+
function toMetadata(result, document, profiles) {
|
|
1089
1284
|
return {
|
|
1090
|
-
title: result.title || firstMetaContent(
|
|
1091
|
-
description: result.description || firstMetaContent(
|
|
1285
|
+
title: result.title || firstMetaContent(document, ["og:title", "twitter:title"]) || document.querySelector("title")?.textContent?.trim() || void 0,
|
|
1286
|
+
description: result.description || firstMetaContent(document, ["description", "og:description", "twitter:description"]),
|
|
1092
1287
|
domain: result.domain || void 0,
|
|
1093
1288
|
favicon: result.favicon || void 0,
|
|
1094
|
-
image: result.image || firstMetaContent(
|
|
1095
|
-
language: result.language ||
|
|
1096
|
-
published: result.published || firstMetaContent(
|
|
1097
|
-
author: result.author || profileAuthorFromDocument(
|
|
1098
|
-
site: result.site || firstMetaContent(
|
|
1289
|
+
image: result.image || firstMetaContent(document, ["og:image", "twitter:image"]),
|
|
1290
|
+
language: result.language || document.documentElement.getAttribute("lang") || void 0,
|
|
1291
|
+
published: result.published || firstMetaContent(document, ["article:published_time", "date", "datePublished", "pubdate", "publishdate"]) || jsonLdValue(document, ["datePublished", "dateCreated"]),
|
|
1292
|
+
author: result.author || profileAuthorFromDocument(document, profiles) || firstMetaContent(document, ["author", "article:author", "twitter:creator"]) || jsonLdValue(document, ["author", "creator"]),
|
|
1293
|
+
site: result.site || firstMetaContent(document, ["og:site_name", "application-name"]),
|
|
1099
1294
|
schemaOrgData: result.schemaOrgData,
|
|
1100
1295
|
wordCount: result.wordCount,
|
|
1101
1296
|
parseTime: result.parseTime
|
|
1102
1297
|
};
|
|
1103
1298
|
}
|
|
1104
|
-
function
|
|
1105
|
-
const
|
|
1106
|
-
|
|
1299
|
+
function appendMetaImages(document, root, profiles) {
|
|
1300
|
+
const properties = profiles.flatMap((profile) => profile.media?.includeMetaImages ? profile.media.imageMetaProperties ?? ["og:image"] : []);
|
|
1301
|
+
if (properties.length === 0) {
|
|
1302
|
+
return;
|
|
1303
|
+
}
|
|
1304
|
+
const seen = new Set(Array.from(root.querySelectorAll("img")).map((img) => img.getAttribute("src") ?? ""));
|
|
1305
|
+
for (const property of properties) {
|
|
1306
|
+
const escaped = property.replace(/"/g, '\\"');
|
|
1307
|
+
for (const meta of Array.from(document.querySelectorAll(`meta[property="${escaped}"], meta[name="${escaped}"], meta[itemprop="${escaped}"]`))) {
|
|
1308
|
+
const src = meta.getAttribute("content")?.trim();
|
|
1309
|
+
if (!src || seen.has(src)) continue;
|
|
1310
|
+
const img = document.createElement("img");
|
|
1311
|
+
img.setAttribute("src", src);
|
|
1312
|
+
img.setAttribute("alt", "");
|
|
1313
|
+
root.appendChild(document.createElement("p"));
|
|
1314
|
+
root.lastElementChild?.appendChild(img);
|
|
1315
|
+
seen.add(src);
|
|
1316
|
+
}
|
|
1317
|
+
}
|
|
1318
|
+
}
|
|
1319
|
+
function serializeProfiledContent(document, content, profiles, removals) {
|
|
1320
|
+
const { document: contentDocument } = parseHTML2(`<!doctype html><html><body><main data-feedloom-profile-root="true">${content}</main></body></html>`);
|
|
1321
|
+
const root = contentDocument.querySelector('[data-feedloom-profile-root="true"]') ?? contentDocument.body;
|
|
1322
|
+
appendMetaImages(document, root, profiles);
|
|
1107
1323
|
applySiteProfiles(root, profiles, removals);
|
|
1108
|
-
const serialized = root.innerHTML || root.outerHTML ||
|
|
1324
|
+
const serialized = root.innerHTML || root.outerHTML || contentDocument.body.innerHTML;
|
|
1109
1325
|
return serialized.trim() ? `${serialized.trim()}
|
|
1110
1326
|
` : "";
|
|
1111
1327
|
}
|
|
@@ -1120,9 +1336,9 @@ var HtmlCleaner = class {
|
|
|
1120
1336
|
const preferredContentSelector = this.options.contentSelector ?? firstContentSelector(activeProfiles);
|
|
1121
1337
|
const removals = [];
|
|
1122
1338
|
const html = /<html[\s>]/i.test(rawHtml) ? rawHtml : `<!doctype html><html><body>${rawHtml}</body></html>`;
|
|
1123
|
-
const { document
|
|
1124
|
-
const contentSelector = preferredContentSelector &&
|
|
1125
|
-
const doc =
|
|
1339
|
+
const { document } = parseHTML2(html);
|
|
1340
|
+
const contentSelector = preferredContentSelector && document.querySelector(preferredContentSelector) ? preferredContentSelector : void 0;
|
|
1341
|
+
const doc = document;
|
|
1126
1342
|
if (this.options.baseUrl) {
|
|
1127
1343
|
doc.URL = this.options.baseUrl;
|
|
1128
1344
|
}
|
|
@@ -1139,12 +1355,14 @@ var HtmlCleaner = class {
|
|
|
1139
1355
|
removeExactSelectors: this.options.removeExactSelectors,
|
|
1140
1356
|
removePartialSelectors: this.options.removePartialSelectors,
|
|
1141
1357
|
removeContentPatterns: this.options.removeContentPatterns,
|
|
1142
|
-
standardize: this.options.standardize
|
|
1358
|
+
standardize: this.options.standardize,
|
|
1359
|
+
fetch: this.options.defuddleFetch,
|
|
1360
|
+
language: this.options.language
|
|
1143
1361
|
});
|
|
1144
1362
|
const result = parser2.parseAsync ? await parser2.parseAsync() : parser2.parse();
|
|
1145
|
-
const metadata = toMetadata(result,
|
|
1363
|
+
const metadata = toMetadata(result, document, activeProfiles);
|
|
1146
1364
|
applyMetadataProfiles(metadata, activeProfiles);
|
|
1147
|
-
const content = serializeProfiledContent(result.content, postProfiles, removals);
|
|
1365
|
+
const content = serializeProfiledContent(document, result.content, postProfiles, removals);
|
|
1148
1366
|
return {
|
|
1149
1367
|
content,
|
|
1150
1368
|
contentMarkdown: result.contentMarkdown,
|
|
@@ -1161,6 +1379,212 @@ async function cleanHtml(rawHtml, options = {}) {
|
|
|
1161
1379
|
return new HtmlCleaner(options).parse(rawHtml);
|
|
1162
1380
|
}
|
|
1163
1381
|
|
|
1382
|
+
// src/fetch/proxy-fetch.ts
|
|
1383
|
+
import { request as httpRequest } from "http";
|
|
1384
|
+
import { connect as tlsConnect } from "tls";
|
|
1385
|
+
var REDIRECT_STATUSES = /* @__PURE__ */ new Set([301, 302, 303, 307, 308]);
|
|
1386
|
+
var DEFAULT_REDIRECT_LIMIT = 10;
|
|
1387
|
+
function envProxyForUrl(targetUrl) {
|
|
1388
|
+
const raw = targetUrl.protocol === "https:" ? process.env.HTTPS_PROXY || process.env.https_proxy || process.env.ALL_PROXY || process.env.all_proxy : process.env.HTTP_PROXY || process.env.http_proxy || process.env.ALL_PROXY || process.env.all_proxy;
|
|
1389
|
+
if (!raw || noProxyMatches(targetUrl.hostname)) {
|
|
1390
|
+
return null;
|
|
1391
|
+
}
|
|
1392
|
+
try {
|
|
1393
|
+
return new URL(raw);
|
|
1394
|
+
} catch {
|
|
1395
|
+
return null;
|
|
1396
|
+
}
|
|
1397
|
+
}
|
|
1398
|
+
function noProxyMatches(hostname) {
|
|
1399
|
+
const raw = process.env.NO_PROXY ?? process.env.no_proxy ?? "";
|
|
1400
|
+
if (!raw) return false;
|
|
1401
|
+
const host = hostname.toLowerCase();
|
|
1402
|
+
return raw.split(",").map((entry) => entry.trim().toLowerCase()).some((entry) => {
|
|
1403
|
+
if (!entry) return false;
|
|
1404
|
+
if (entry === "*") return true;
|
|
1405
|
+
if (entry.startsWith(".")) return host === entry.slice(1) || host.endsWith(entry);
|
|
1406
|
+
return host === entry || host.endsWith(`.${entry}`);
|
|
1407
|
+
});
|
|
1408
|
+
}
|
|
1409
|
+
function headersToRecord(headers) {
|
|
1410
|
+
const record = {};
|
|
1411
|
+
if (!headers) return record;
|
|
1412
|
+
new Headers(headers).forEach((value, key) => {
|
|
1413
|
+
record[key] = value;
|
|
1414
|
+
});
|
|
1415
|
+
return record;
|
|
1416
|
+
}
|
|
1417
|
+
function responseHeaders(headers) {
|
|
1418
|
+
const result = new Headers();
|
|
1419
|
+
for (const [key, value] of Object.entries(headers)) {
|
|
1420
|
+
if (Array.isArray(value)) {
|
|
1421
|
+
for (const item of value) result.append(key, item);
|
|
1422
|
+
} else if (value !== void 0) {
|
|
1423
|
+
result.set(key, String(value));
|
|
1424
|
+
}
|
|
1425
|
+
}
|
|
1426
|
+
return result;
|
|
1427
|
+
}
|
|
1428
|
+
async function bodyToBuffer(body) {
|
|
1429
|
+
if (body === void 0 || body === null) return void 0;
|
|
1430
|
+
if (typeof ReadableStream !== "undefined" && body instanceof ReadableStream) {
|
|
1431
|
+
throw new Error("proxy-aware fetch does not support streaming request bodies");
|
|
1432
|
+
}
|
|
1433
|
+
if (typeof body === "string") return Buffer.from(body);
|
|
1434
|
+
if (body instanceof URLSearchParams) return Buffer.from(body.toString());
|
|
1435
|
+
if (body instanceof ArrayBuffer) return Buffer.from(body);
|
|
1436
|
+
if (ArrayBuffer.isView(body)) return Buffer.from(body.buffer, body.byteOffset, body.byteLength);
|
|
1437
|
+
if (typeof Blob !== "undefined" && body instanceof Blob) return Buffer.from(await body.arrayBuffer());
|
|
1438
|
+
throw new Error("proxy-aware fetch only supports buffered request bodies");
|
|
1439
|
+
}
|
|
1440
|
+
function collectResponse(res, done) {
|
|
1441
|
+
const chunks = [];
|
|
1442
|
+
res.on("data", (chunk) => {
|
|
1443
|
+
chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
|
|
1444
|
+
});
|
|
1445
|
+
res.on("end", () => {
|
|
1446
|
+
done(null, {
|
|
1447
|
+
status: res.statusCode ?? 0,
|
|
1448
|
+
statusText: res.statusMessage ?? "",
|
|
1449
|
+
headers: Object.fromEntries(responseHeaders(res.headers)),
|
|
1450
|
+
body: Buffer.concat(chunks)
|
|
1451
|
+
});
|
|
1452
|
+
});
|
|
1453
|
+
res.on("error", (error) => done(error));
|
|
1454
|
+
}
|
|
1455
|
+
function proxyAuthorization(proxy) {
|
|
1456
|
+
if (!proxy.username) return void 0;
|
|
1457
|
+
return `Basic ${Buffer.from(`${decodeURIComponent(proxy.username)}:${decodeURIComponent(proxy.password)}`).toString("base64")}`;
|
|
1458
|
+
}
|
|
1459
|
+
function requestViaHttpProxy(targetUrl, proxy, method, headers, body, signal) {
|
|
1460
|
+
if (proxy.protocol !== "http:") {
|
|
1461
|
+
throw new Error(`Unsupported proxy protocol: ${proxy.protocol}`);
|
|
1462
|
+
}
|
|
1463
|
+
return new Promise((resolve3, reject) => {
|
|
1464
|
+
let settled = false;
|
|
1465
|
+
let active = null;
|
|
1466
|
+
const done = (error, response) => {
|
|
1467
|
+
if (settled) return;
|
|
1468
|
+
settled = true;
|
|
1469
|
+
signal?.removeEventListener("abort", abort);
|
|
1470
|
+
if (error) reject(error);
|
|
1471
|
+
else if (response) resolve3(response);
|
|
1472
|
+
else reject(new Error("Proxy request ended without a response"));
|
|
1473
|
+
};
|
|
1474
|
+
const abort = () => active?.destroy(new Error("The operation was aborted"));
|
|
1475
|
+
if (signal?.aborted) {
|
|
1476
|
+
done(new Error("The operation was aborted"));
|
|
1477
|
+
return;
|
|
1478
|
+
}
|
|
1479
|
+
signal?.addEventListener("abort", abort, { once: true });
|
|
1480
|
+
const targetPort = targetUrl.port ? Number(targetUrl.port) : targetUrl.protocol === "https:" ? 443 : 80;
|
|
1481
|
+
const proxyPort = proxy.port ? Number(proxy.port) : 8080;
|
|
1482
|
+
const auth = proxyAuthorization(proxy);
|
|
1483
|
+
const requestHeaders2 = {
|
|
1484
|
+
...headers,
|
|
1485
|
+
host: targetUrl.host
|
|
1486
|
+
};
|
|
1487
|
+
if (body && !Object.keys(requestHeaders2).some((key) => key.toLowerCase() === "content-length")) {
|
|
1488
|
+
requestHeaders2["content-length"] = String(body.byteLength);
|
|
1489
|
+
}
|
|
1490
|
+
if (targetUrl.protocol === "https:") {
|
|
1491
|
+
const connectHeaders = { host: `${targetUrl.hostname}:${targetPort}` };
|
|
1492
|
+
if (auth) connectHeaders["proxy-authorization"] = auth;
|
|
1493
|
+
const connectReq = httpRequest({
|
|
1494
|
+
host: proxy.hostname,
|
|
1495
|
+
port: proxyPort,
|
|
1496
|
+
method: "CONNECT",
|
|
1497
|
+
path: `${targetUrl.hostname}:${targetPort}`,
|
|
1498
|
+
headers: connectHeaders
|
|
1499
|
+
});
|
|
1500
|
+
active = connectReq;
|
|
1501
|
+
connectReq.on("connect", (connectRes, socket) => {
|
|
1502
|
+
if (connectRes.statusCode !== 200) {
|
|
1503
|
+
socket.destroy();
|
|
1504
|
+
done(new Error(`Proxy CONNECT failed: ${connectRes.statusCode ?? 0}`));
|
|
1505
|
+
return;
|
|
1506
|
+
}
|
|
1507
|
+
const tlsSocket = tlsConnect({ socket, host: targetUrl.hostname, servername: targetUrl.hostname });
|
|
1508
|
+
active = tlsSocket;
|
|
1509
|
+
tlsSocket.on("error", (error) => done(error));
|
|
1510
|
+
tlsSocket.on("secureConnect", () => {
|
|
1511
|
+
const req2 = httpRequest({
|
|
1512
|
+
method,
|
|
1513
|
+
path: `${targetUrl.pathname}${targetUrl.search}`,
|
|
1514
|
+
headers: requestHeaders2,
|
|
1515
|
+
createConnection: () => tlsSocket
|
|
1516
|
+
}, (res) => collectResponse(res, done));
|
|
1517
|
+
active = req2;
|
|
1518
|
+
req2.on("error", (error) => done(error));
|
|
1519
|
+
if (body) req2.write(body);
|
|
1520
|
+
req2.end();
|
|
1521
|
+
});
|
|
1522
|
+
});
|
|
1523
|
+
connectReq.on("error", (error) => done(error));
|
|
1524
|
+
connectReq.end();
|
|
1525
|
+
return;
|
|
1526
|
+
}
|
|
1527
|
+
if (auth) requestHeaders2["proxy-authorization"] = auth;
|
|
1528
|
+
const req = httpRequest({
|
|
1529
|
+
host: proxy.hostname,
|
|
1530
|
+
port: proxyPort,
|
|
1531
|
+
method,
|
|
1532
|
+
path: targetUrl.href,
|
|
1533
|
+
headers: requestHeaders2
|
|
1534
|
+
}, (res) => collectResponse(res, done));
|
|
1535
|
+
active = req;
|
|
1536
|
+
req.on("error", (error) => done(error));
|
|
1537
|
+
if (body) req.write(body);
|
|
1538
|
+
req.end();
|
|
1539
|
+
});
|
|
1540
|
+
}
|
|
1541
|
+
function requestUrl(input) {
|
|
1542
|
+
if (input instanceof URL) return input;
|
|
1543
|
+
if (typeof input === "string") return new URL(input);
|
|
1544
|
+
return new URL(input.url);
|
|
1545
|
+
}
|
|
1546
|
+
function requestMethod(input, init) {
|
|
1547
|
+
if (init.method) return init.method.toUpperCase();
|
|
1548
|
+
if (input instanceof Request) return input.method.toUpperCase();
|
|
1549
|
+
return "GET";
|
|
1550
|
+
}
|
|
1551
|
+
function requestHeaders(input, init) {
|
|
1552
|
+
return {
|
|
1553
|
+
...input instanceof Request ? headersToRecord(input.headers) : {},
|
|
1554
|
+
...headersToRecord(init.headers)
|
|
1555
|
+
};
|
|
1556
|
+
}
|
|
1557
|
+
async function proxyAwareFetchInternal(input, init, redirectsLeft) {
|
|
1558
|
+
const url = requestUrl(input);
|
|
1559
|
+
const proxy = envProxyForUrl(url);
|
|
1560
|
+
if (!proxy) {
|
|
1561
|
+
return fetch(input, init);
|
|
1562
|
+
}
|
|
1563
|
+
const method = requestMethod(input, init);
|
|
1564
|
+
const headers = requestHeaders(input, init);
|
|
1565
|
+
const body = await bodyToBuffer(init.body ?? (input instanceof Request ? input.body : void 0));
|
|
1566
|
+
const proxied = await requestViaHttpProxy(url, proxy, method, headers, body, init.signal ?? void 0);
|
|
1567
|
+
const location = proxied.headers.location;
|
|
1568
|
+
const redirectMode = init.redirect ?? "follow";
|
|
1569
|
+
if (redirectMode !== "manual" && REDIRECT_STATUSES.has(proxied.status) && location && redirectsLeft > 0) {
|
|
1570
|
+
const nextUrl = new URL(location, url);
|
|
1571
|
+
const nextInit = { ...init };
|
|
1572
|
+
if (proxied.status === 303) {
|
|
1573
|
+
nextInit.method = "GET";
|
|
1574
|
+
nextInit.body = void 0;
|
|
1575
|
+
}
|
|
1576
|
+
return proxyAwareFetchInternal(nextUrl, nextInit, redirectsLeft - 1);
|
|
1577
|
+
}
|
|
1578
|
+
return new Response(new Uint8Array(proxied.body), {
|
|
1579
|
+
status: proxied.status,
|
|
1580
|
+
statusText: proxied.statusText,
|
|
1581
|
+
headers: proxied.headers
|
|
1582
|
+
});
|
|
1583
|
+
}
|
|
1584
|
+
async function proxyAwareFetch(input, init = {}) {
|
|
1585
|
+
return proxyAwareFetchInternal(input, init, DEFAULT_REDIRECT_LIMIT);
|
|
1586
|
+
}
|
|
1587
|
+
|
|
1164
1588
|
// src/fetch/strategy.ts
|
|
1165
1589
|
import { writeFile as writeFile3 } from "fs/promises";
|
|
1166
1590
|
|
|
@@ -1175,8 +1599,8 @@ function extractPreloadedMarkdownUrl(html, baseUrl) {
|
|
|
1175
1599
|
}
|
|
1176
1600
|
return new URL(rawUrl, baseUrl).toString();
|
|
1177
1601
|
}
|
|
1178
|
-
function removeNoise(
|
|
1179
|
-
|
|
1602
|
+
function removeNoise(document) {
|
|
1603
|
+
document.querySelectorAll("script, style, noscript, svg, iframe").forEach((element) => element.remove());
|
|
1180
1604
|
}
|
|
1181
1605
|
function normalizedTextLength(element) {
|
|
1182
1606
|
return (element?.textContent ?? "").replace(/\s+/g, " ").trim().length;
|
|
@@ -1185,12 +1609,12 @@ function htmlHasMeaningfulContent(url, html) {
|
|
|
1185
1609
|
if (extractPreloadedMarkdownUrl(html, url) !== null) {
|
|
1186
1610
|
return true;
|
|
1187
1611
|
}
|
|
1188
|
-
const { document
|
|
1189
|
-
removeNoise(
|
|
1612
|
+
const { document } = parseHTML3(html);
|
|
1613
|
+
removeNoise(document);
|
|
1190
1614
|
const selectors = ["#js_content", "article", "main", "section", "div", "body"];
|
|
1191
1615
|
let bestLength = 0;
|
|
1192
1616
|
for (const selector of selectors) {
|
|
1193
|
-
|
|
1617
|
+
document.querySelectorAll(selector).forEach((element) => {
|
|
1194
1618
|
bestLength = Math.max(bestLength, normalizedTextLength(element));
|
|
1195
1619
|
});
|
|
1196
1620
|
if (bestLength >= 600 && selector !== "div") {
|
|
@@ -1285,11 +1709,11 @@ async function fetchBrowserHtmlWithBrowserState(url, config) {
|
|
|
1285
1709
|
}
|
|
1286
1710
|
|
|
1287
1711
|
// src/fetch/static.ts
|
|
1288
|
-
async function fetchStaticHtml(url, timeoutMs = 6e4) {
|
|
1712
|
+
async function fetchStaticHtml(url, timeoutMs = 6e4, fetchImpl = fetch) {
|
|
1289
1713
|
const controller = new AbortController();
|
|
1290
1714
|
const timeout = setTimeout(() => controller.abort(), timeoutMs);
|
|
1291
1715
|
try {
|
|
1292
|
-
const response = await
|
|
1716
|
+
const response = await fetchImpl(url, {
|
|
1293
1717
|
redirect: "follow",
|
|
1294
1718
|
signal: controller.signal,
|
|
1295
1719
|
headers: {
|
|
@@ -1318,7 +1742,7 @@ async function writeOutputIfRequested(outputPath, html) {
|
|
|
1318
1742
|
}
|
|
1319
1743
|
async function fetchHtmlResult(url, options = {}) {
|
|
1320
1744
|
const isMeaningful = options.isMeaningful ?? htmlHasMeaningfulContent;
|
|
1321
|
-
const staticFetch = options.staticFetch ?? (async (targetUrl) => (await fetchStaticHtml(targetUrl)).html);
|
|
1745
|
+
const staticFetch = options.staticFetch ?? (async (targetUrl) => (await fetchStaticHtml(targetUrl, void 0, options.useProxyEnv ? proxyAwareFetch : void 0)).html);
|
|
1322
1746
|
const browserFetch = options.browserFetch ?? ((targetUrl) => fetchBrowserHtml(targetUrl, {
|
|
1323
1747
|
waitMs: options.waitMs,
|
|
1324
1748
|
networkIdle: options.networkIdle,
|
|
@@ -1486,9 +1910,9 @@ ${code}
|
|
|
1486
1910
|
\`\`\``).replace(/\[\s*\]\((?:#|javascript:void\(0\)|javascript:;)\)/gi, "").replace(/(^|[^\\])\$(?=\d)/g, "$1\\$").replace(/\n\s*\n\s*([-*+]\s)/g, "\n$1").replace(/[ \t]+\n/g, "\n").replace(/\n{3,}/g, "\n\n").trim();
|
|
1487
1911
|
}
|
|
1488
1912
|
function htmlFragmentText(fragment) {
|
|
1489
|
-
const { document
|
|
1490
|
-
|
|
1491
|
-
return
|
|
1913
|
+
const { document } = parseHTML4(`<!doctype html><html><body>${fragment}</body></html>`);
|
|
1914
|
+
document.querySelectorAll("br").forEach((br) => br.replaceWith(document.createTextNode("\n")));
|
|
1915
|
+
return document.body.textContent ?? "";
|
|
1492
1916
|
}
|
|
1493
1917
|
function fencedCodeHtml(text) {
|
|
1494
1918
|
const escaped = text.replace(/&/g, "&").replace(/</g, "<").replace(/>/g, ">");
|
|
@@ -1584,10 +2008,45 @@ function resolveCreatedValue(item, published) {
|
|
|
1584
2008
|
if (item.publishedAt) return createdFromItemDate(item.publishedAt);
|
|
1585
2009
|
return (/* @__PURE__ */ new Date()).toISOString().replace(/\.\d{3}Z$/, "Z");
|
|
1586
2010
|
}
|
|
2011
|
+
function mergeProfileFetchOptions(options, profiles) {
|
|
2012
|
+
const merged = { ...options };
|
|
2013
|
+
for (const profile of profiles) {
|
|
2014
|
+
if (profile.fetch?.mode) merged.fetchMode = profile.fetch.mode;
|
|
2015
|
+
if (profile.fetch?.waitMs !== void 0) merged.waitMs = profile.fetch.waitMs;
|
|
2016
|
+
if (profile.fetch?.networkIdle !== void 0) merged.networkIdle = profile.fetch.networkIdle;
|
|
2017
|
+
if (profile.fetch?.waitSelector) merged.waitSelector = profile.fetch.waitSelector;
|
|
2018
|
+
if (profile.fetch?.waitSelectorState) merged.waitSelectorState = profile.fetch.waitSelectorState;
|
|
2019
|
+
if (profile.fetch?.clickSelectors) merged.clickSelectors = profile.fetch.clickSelectors;
|
|
2020
|
+
if (profile.fetch?.scrollToBottom !== void 0) merged.scrollToBottom = profile.fetch.scrollToBottom;
|
|
2021
|
+
if (profile.fetch?.useProxyEnv !== void 0) merged.useProxyEnv = profile.fetch.useProxyEnv;
|
|
2022
|
+
if (profile.fetch?.preferBrowserState && options.browserStateDefaults) {
|
|
2023
|
+
merged.browserState = {
|
|
2024
|
+
...options.browserStateDefaults,
|
|
2025
|
+
waitMs: merged.waitMs,
|
|
2026
|
+
networkIdle: merged.networkIdle,
|
|
2027
|
+
proxy: merged.proxy,
|
|
2028
|
+
dnsOverHttps: merged.dnsOverHttps,
|
|
2029
|
+
waitSelector: merged.waitSelector,
|
|
2030
|
+
waitSelectorState: merged.waitSelectorState,
|
|
2031
|
+
clickSelectors: merged.clickSelectors,
|
|
2032
|
+
scrollToBottom: merged.scrollToBottom,
|
|
2033
|
+
headless: merged.headless,
|
|
2034
|
+
realChromeDefaults: merged.realChromeDefaults
|
|
2035
|
+
};
|
|
2036
|
+
}
|
|
2037
|
+
}
|
|
2038
|
+
return merged;
|
|
2039
|
+
}
|
|
1587
2040
|
async function processItem(item, options) {
|
|
1588
|
-
const
|
|
2041
|
+
const urlProfiles = selectActiveProfiles(options.profiles, item.url, "");
|
|
2042
|
+
const fetchOptions = mergeProfileFetchOptions(options, urlProfiles);
|
|
2043
|
+
const html = await fetchHtml(item.url, fetchOptions);
|
|
1589
2044
|
const activeProfiles = selectActiveProfiles(options.profiles, item.url, html);
|
|
1590
|
-
const
|
|
2045
|
+
const defuddleFetch = activeProfiles.some((profile) => profile.fetch?.useProxyEnv) ? proxyAwareFetch : void 0;
|
|
2046
|
+
const cleaned = await cleanHtml(html, { baseUrl: item.url, profiles: options.profiles, activeProfiles, defuddleFetch });
|
|
2047
|
+
if (activeProfiles.some((profile) => profile.extraction?.requireText) && !cleaned.content.replace(/<[^>]*>/g, "").trim()) {
|
|
2048
|
+
throw new Error("matched site rule requires extracted text, but no text content was extracted");
|
|
2049
|
+
}
|
|
1591
2050
|
const title = cleaned.metadata.title || item.sourceTitle || titleFromUrl(item.url);
|
|
1592
2051
|
await cleanupExistingNote(options.outputDir, item.url);
|
|
1593
2052
|
const contentHtml = options.localizeAssets === false ? cleaned.content : await localizeImages(cleaned.content, {
|
|
@@ -1656,10 +2115,15 @@ var ProgressTracker = class {
|
|
|
1656
2115
|
};
|
|
1657
2116
|
|
|
1658
2117
|
// src/cli.ts
|
|
2118
|
+
var require2 = createRequire(import.meta.url);
|
|
2119
|
+
var packageJson = require2("../package.json");
|
|
1659
2120
|
var program = new Command();
|
|
1660
2121
|
async function siteRulePathsFromDir(dir) {
|
|
1661
2122
|
const names = await readdir2(dir);
|
|
1662
|
-
return names.filter((name) => name.endsWith(".toml")).map((name) => join7(dir, name));
|
|
2123
|
+
return names.filter((name) => name.endsWith(".toml")).sort().map((name) => join7(dir, name));
|
|
2124
|
+
}
|
|
2125
|
+
function builtinSiteRulesDir() {
|
|
2126
|
+
return join7(dirname(fileURLToPath(import.meta.url)), "site-rules");
|
|
1663
2127
|
}
|
|
1664
2128
|
function positiveIntOption(value, fallback) {
|
|
1665
2129
|
const parsed = Number(value ?? fallback);
|
|
@@ -1668,7 +2132,13 @@ function positiveIntOption(value, fallback) {
|
|
|
1668
2132
|
}
|
|
1669
2133
|
return parsed;
|
|
1670
2134
|
}
|
|
1671
|
-
program.name("feedloom").description("Archive long-form web content as clean Markdown with local assets").version(
|
|
2135
|
+
program.name("feedloom").description("Archive long-form web content as clean Markdown with local assets").version(packageJson.version ?? "0.0.0");
|
|
2136
|
+
program.command("doctor").description("Check Feedloom runtime dependencies").action(async () => {
|
|
2137
|
+
const result = await runDoctor();
|
|
2138
|
+
console.error(formatDoctorResult(result));
|
|
2139
|
+
process.exitCode = result.ok ? 0 : 1;
|
|
2140
|
+
});
|
|
2141
|
+
program.option("--output-dir <dir>", "Output directory for markdown notes", "clippings").option("--source-kind <kind>", "auto, html-page, or rss-feed", "auto").option("--since <date>", "Only keep feed entries on or after YYYY-MM-DD", "").option("--limit <n>", "Process only first N deduplicated URLs", "0").option("--start <n>", "Start from 1-based index after deduplication", "1").option("--end <n>", "End at 1-based index after deduplication", "0").option("--prefer-browser-state", "Try copied local Chrome profile before regular browser fallback", false).option("--chrome-user-data-dir <path>", "Chrome user data directory used with --prefer-browser-state", "").option("--chrome-profile <name>", "Chrome profile directory name", "Default").option("--fetch-mode <mode>", "auto, static, browser, or stealth", "auto").option("--no-network-idle", "Do not wait for browser networkidle before reading HTML").option("--wait-ms <ms>", "Extra browser wait after load", "2500").option("--solve-cloudflare", "In stealth mode, attempt Cloudflare Turnstile/interstitial challenge handling", false).option("--disable-resources", "In stealth mode, block images/media/fonts/stylesheets for speed", false).option("--proxy <server>", "Proxy server for browser/stealth fetch, e.g. http://127.0.0.1:8080", "").option("--dns-over-https", "Use Chromium Cloudflare DNS-over-HTTPS flag for browser/stealth fetch", false).option("--wait-selector <selector>", "Wait for a CSS selector after page load", "").option("--wait-selector-state <state>", "attached, detached, visible, or hidden", "attached").option("--click-selector <selector...>", "Click one or more selectors after page load", []).option("--scroll-to-bottom", "Scroll to the bottom before reading HTML", false).option("--headful", "Run browser/browser-state fetches with a visible Chrome window", false).option("--site-rules-dir <dir>", "Optional directory of private TOML site extraction/cleaning rules", "").option("--no-real-chrome-defaults", "Disable Scrapling-inspired real Chrome context defaults").option("--no-reuse-browser", "Disable batch browser/stealth context reuse").argument("[inputs...]", "URLs or files containing URLs").action(async (inputs, options) => {
|
|
1672
2142
|
if (inputs.length === 0) {
|
|
1673
2143
|
program.help({ error: true });
|
|
1674
2144
|
}
|
|
@@ -1696,7 +2166,9 @@ program.name("feedloom").description("Archive long-form web content as clean Mar
|
|
|
1696
2166
|
positiveIntOption(options.limit, 0)
|
|
1697
2167
|
);
|
|
1698
2168
|
const siteRulesDir = String(options.siteRulesDir || "");
|
|
1699
|
-
const
|
|
2169
|
+
const builtinRulePaths = await siteRulePathsFromDir(builtinSiteRulesDir());
|
|
2170
|
+
const customRulePaths = siteRulesDir ? await siteRulePathsFromDir(resolve2(siteRulesDir)) : [];
|
|
2171
|
+
const profiles = await loadSiteProfiles([...builtinRulePaths, ...customRulePaths]);
|
|
1700
2172
|
const outputDir = String(options.outputDir ?? "clippings");
|
|
1701
2173
|
let failures = 0;
|
|
1702
2174
|
const tracker = new ProgressTracker(selected, outputDir);
|
|
@@ -1715,6 +2187,10 @@ program.name("feedloom").description("Archive long-form web content as clean Mar
|
|
|
1715
2187
|
headless: !Boolean(options.headful),
|
|
1716
2188
|
realChromeDefaults: options.realChromeDefaults !== false
|
|
1717
2189
|
};
|
|
2190
|
+
const browserStateDefaults = {
|
|
2191
|
+
userDataDir: String(options.chromeUserDataDir || ""),
|
|
2192
|
+
profile: String(options.chromeProfile || "Default")
|
|
2193
|
+
};
|
|
1718
2194
|
const sessions = options.reuseBrowser === false ? null : new BatchFetchSessions({
|
|
1719
2195
|
browser: browserOptions,
|
|
1720
2196
|
stealth: {
|
|
@@ -1727,15 +2203,11 @@ program.name("feedloom").description("Archive long-form web content as clean Mar
|
|
|
1727
2203
|
for (const item of selected) {
|
|
1728
2204
|
tracker.start(item.url);
|
|
1729
2205
|
try {
|
|
1730
|
-
const browserState = options.preferBrowserState ? {
|
|
1731
|
-
userDataDir: String(options.chromeUserDataDir || ""),
|
|
1732
|
-
profile: String(options.chromeProfile || "Default"),
|
|
1733
|
-
...browserOptions
|
|
1734
|
-
} : null;
|
|
1735
2206
|
const result = await processItem(item, {
|
|
1736
2207
|
outputDir,
|
|
1737
2208
|
profiles,
|
|
1738
|
-
browserState,
|
|
2209
|
+
browserState: options.preferBrowserState ? { ...browserStateDefaults, ...browserOptions } : null,
|
|
2210
|
+
browserStateDefaults,
|
|
1739
2211
|
fetchMode,
|
|
1740
2212
|
...browserOptions,
|
|
1741
2213
|
solveCloudflare: Boolean(options.solveCloudflare),
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
[match]
|
|
2
|
+
host_suffixes = ["xiaohongshu.com"]
|
|
3
|
+
|
|
4
|
+
[fetch]
|
|
5
|
+
mode = "auto"
|
|
6
|
+
scroll_to_bottom = true
|
|
7
|
+
wait_ms = 5000
|
|
8
|
+
|
|
9
|
+
[extract]
|
|
10
|
+
selectors = [".note-content", "#noteContainer", ".note-container", ".note-detail"]
|
|
11
|
+
|
|
12
|
+
[metadata]
|
|
13
|
+
strip_title_regexes = ["\\s*-\\s*小红书\\s*$"]
|
|
14
|
+
strip_author_regexes = ["关注$"]
|
|
15
|
+
author_selectors = [".author-container .username", ".author-wrapper .name", ".user-name"]
|
|
16
|
+
author_meta_names = ["author"]
|
|
17
|
+
author_meta_properties = ["article:author"]
|
|
18
|
+
|
|
19
|
+
[media]
|
|
20
|
+
include_meta_images = true
|
|
21
|
+
image_meta_properties = ["og:image"]
|
|
22
|
+
|
|
23
|
+
[clean.remove]
|
|
24
|
+
selectors = [
|
|
25
|
+
".side-bar",
|
|
26
|
+
".left-container",
|
|
27
|
+
".comments-el",
|
|
28
|
+
".interactions",
|
|
29
|
+
".engage-bar",
|
|
30
|
+
".note-detail-dropdown",
|
|
31
|
+
".close-circle",
|
|
32
|
+
".close-box",
|
|
33
|
+
".login-container",
|
|
34
|
+
".bottom-container .notedetail-menu",
|
|
35
|
+
".author",
|
|
36
|
+
".author-container",
|
|
37
|
+
".media-container",
|
|
38
|
+
".fraction",
|
|
39
|
+
".arrow-controller",
|
|
40
|
+
".pagination-media-container"
|
|
41
|
+
]
|
|
42
|
+
text_contains = [
|
|
43
|
+
"创作中心",
|
|
44
|
+
"业务合作",
|
|
45
|
+
"沪ICP备",
|
|
46
|
+
"营业执照",
|
|
47
|
+
"违法不良信息举报电话",
|
|
48
|
+
"行吟信息科技",
|
|
49
|
+
"小红书网页版",
|
|
50
|
+
"登录后推荐更懂你的笔记"
|
|
51
|
+
]
|
|
52
|
+
exact_text = ["关注", "加载中", "更多", "发现", "直播", "发布", "通知"]
|
|
53
|
+
|
|
54
|
+
[clean.truncate]
|
|
55
|
+
after_regexes = ["^共 \\d+ 条评论$", "^相关推荐$", "^登录后推荐更懂你的笔记$"]
|
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
[match]
|
|
2
|
+
host_suffixes = ["zhihu.com"]
|
|
3
|
+
|
|
4
|
+
[extract]
|
|
5
|
+
selectors = ["[class*=\"Post-RichTextContainer\"]", "[class*=\"RichText ztext\"]", "[class*=\"RichContent-inner\"]"]
|
|
6
|
+
|
|
7
|
+
[metadata]
|
|
8
|
+
strip_title_regexes = ["\\s*-\\s*知乎\\s*$"]
|
|
9
|
+
|
|
10
|
+
[fetch]
|
|
11
|
+
mode = "browser"
|
|
12
|
+
prefer_browser_state = true
|
|
13
|
+
scroll_to_bottom = true
|
|
14
|
+
wait_ms = 8000
|
|
15
|
+
|
|
16
|
+
[clean.remove]
|
|
17
|
+
class_contains = ["RichText-LinkCardContainer"]
|
|
18
|
+
text_regexes = ["^目录收起$", "^目录收起.*References$", "^.+\\d+ 赞同 · \\d+ 评论 文章$", "^.+\\d+ 赞同 · \\d+ 评论 文章$", "^\\d+ 赞同 · \\d+ 评论 文章$"]
|
|
19
|
+
exact_text = ["目录", "收起"]
|
|
20
|
+
|
|
21
|
+
[clean.truncate]
|
|
22
|
+
after_regexes = ["^发布于 ", "^赞同 ", "^\\d+ 条评论$", "^分享$", "^申请转载$", ".*的广告$"]
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ariesfish/feedloom",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.4",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"author": "ariesfish",
|
|
6
6
|
"license": "MIT",
|
|
@@ -17,6 +17,7 @@
|
|
|
17
17
|
"feedloom": "dist/cli.js"
|
|
18
18
|
},
|
|
19
19
|
"files": [
|
|
20
|
+
"assets",
|
|
20
21
|
"dist",
|
|
21
22
|
"skills",
|
|
22
23
|
"README.md",
|
|
@@ -24,7 +25,7 @@
|
|
|
24
25
|
],
|
|
25
26
|
"scripts": {
|
|
26
27
|
"dev": "tsx src/cli.ts",
|
|
27
|
-
"build": "tsup src/cli.ts --format esm --dts --clean",
|
|
28
|
+
"build": "tsup src/cli.ts --format esm --dts --clean && rm -rf dist/site-rules && cp -R src/site-rules dist/site-rules",
|
|
28
29
|
"typecheck": "tsc --noEmit",
|
|
29
30
|
"test": "vitest run",
|
|
30
31
|
"prepublishOnly": "npm run typecheck && npm test && npm run build"
|
package/skills/feedloom/SKILL.md
CHANGED
|
@@ -22,6 +22,12 @@ npx -y @ariesfish/feedloom <inputs...> [options]
|
|
|
22
22
|
|
|
23
23
|
## Common usage
|
|
24
24
|
|
|
25
|
+
Before clipping with browser-based fetch modes, run `doctor` once to verify and repair the Patchright Chromium runtime. If Chromium is missing, `doctor` automatically runs `npx patchright install chromium`.
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
npx -y @ariesfish/feedloom doctor
|
|
29
|
+
```
|
|
30
|
+
|
|
25
31
|
Before running Feedloom, check whether this skill directory has a `site-rules/` directory. If it exists, always pass it with `--site-rules-dir $HOME/.agents/skills/feedloom/site-rules`; do not omit available site rules.
|
|
26
32
|
|
|
27
33
|
```bash
|
|
@@ -61,11 +67,13 @@ Use the least expensive mode that works:
|
|
|
61
67
|
- `--site-rules-dir <dir>`: load optional private TOML extraction/cleaning rules from a local directory, for example `$HOME/.agents/skills/feedloom/site-rules/` reference folder.
|
|
62
68
|
- `--solve-cloudflare`, `--proxy <server>`, `--dns-over-https`: use only when stealth fetching needs them.
|
|
63
69
|
|
|
64
|
-
Run `npx -y @ariesfish/feedloom --help` for the complete option list. Do not invent unsupported options.
|
|
70
|
+
Run `npx -y @ariesfish/feedloom doctor` when browser, stealth, or auto fallback fails because Chromium is missing or cannot launch. Run `npx -y @ariesfish/feedloom --help` for the complete option list. Do not invent unsupported options.
|
|
71
|
+
|
|
72
|
+
## Site rules
|
|
65
73
|
|
|
66
|
-
|
|
74
|
+
Feedloom ships built-in TOML site rules in the package for common sites such as WeChat and Zhihu. These are loaded automatically; do not pass a special option for built-in rules.
|
|
67
75
|
|
|
68
|
-
|
|
76
|
+
Private skill rules are also supported and are mandatory to use when present next to this skill. Always check for `$HOME/.agents/skills/feedloom/site-rules/` before clipping. If that directory exists, pass it explicitly on every Feedloom command using the `$HOME`-prefixed path:
|
|
69
77
|
|
|
70
78
|
```bash
|
|
71
79
|
npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir $HOME/.agents/skills/feedloom/site-rules
|
|
@@ -73,6 +81,8 @@ npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir $HOME/
|
|
|
73
81
|
|
|
74
82
|
Treat rule files in `$HOME/.agents/skills/feedloom/site-rules/` as local reference material and use them whenever available; never skip an existing site-rules directory unless the user explicitly asks not to use it.
|
|
75
83
|
|
|
84
|
+
For adding or editing private rules, read `references/site-rules.md`. It contains the TOML schema, examples, `[fetch]` behavior, and validation workflow.
|
|
85
|
+
|
|
76
86
|
## Output
|
|
77
87
|
|
|
78
88
|
- Markdown files are written to `clippings/` by default, or to `--output-dir`.
|
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
# Feedloom site rules
|
|
2
|
+
|
|
3
|
+
Use TOML site rules when Feedloom needs a narrow site-specific selector, cleanup overlay, metadata normalization, or conservative fetch preference. Do not write ad-hoc scrapers.
|
|
4
|
+
|
|
5
|
+
## Locations
|
|
6
|
+
|
|
7
|
+
Private skill rules live in:
|
|
8
|
+
|
|
9
|
+
```text
|
|
10
|
+
$HOME/.agents/skills/feedloom/site-rules/<site>.toml
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
When the private rules directory exists, pass it on every command:
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir $HOME/.agents/skills/feedloom/site-rules
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Add a private rule
|
|
20
|
+
|
|
21
|
+
Create or edit one TOML file per site:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
mkdir -p $HOME/.agents/skills/feedloom/site-rules
|
|
25
|
+
$EDITOR $HOME/.agents/skills/feedloom/site-rules/example.toml
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Minimal rule:
|
|
29
|
+
|
|
30
|
+
```toml
|
|
31
|
+
[match]
|
|
32
|
+
host_suffixes = ["example.com"]
|
|
33
|
+
|
|
34
|
+
[extract]
|
|
35
|
+
selectors = ["article", "main"]
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Rule with fetch preferences:
|
|
39
|
+
|
|
40
|
+
```toml
|
|
41
|
+
[match]
|
|
42
|
+
host_suffixes = ["zhihu.com"]
|
|
43
|
+
|
|
44
|
+
[fetch]
|
|
45
|
+
mode = "browser"
|
|
46
|
+
prefer_browser_state = true
|
|
47
|
+
scroll_to_bottom = true
|
|
48
|
+
wait_ms = 8000
|
|
49
|
+
|
|
50
|
+
[extract]
|
|
51
|
+
selectors = ["[class*=\"Post-RichTextContainer\"]", "[class*=\"RichText ztext\"]"]
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## Schema
|
|
55
|
+
|
|
56
|
+
Supported sections:
|
|
57
|
+
|
|
58
|
+
- `[match]`: `host_suffixes`, `host_regexes`, `url_regexes`, `html_markers`.
|
|
59
|
+
- `[fetch]`: `mode`, `prefer_browser_state`, `wait_ms`, `network_idle`, `wait_selector`, `wait_selector_state`, `click_selectors`, `scroll_to_bottom`, `use_proxy_env`.
|
|
60
|
+
- `[extract]`: `selectors`, `require_text`.
|
|
61
|
+
- `[metadata]`: `fixed_author`, `strip_title_regexes`, `strip_author_regexes`, `author_selectors`, `author_meta_names`, `author_meta_itemprops`, `author_meta_properties`.
|
|
62
|
+
- `[clean.remove]`: `selectors`, `class_contains`, `id_contains`, `attr_contains`, `text_contains`, `text_regexes`, `exact_text`.
|
|
63
|
+
- `[clean.truncate]`: `after_contains`, `after_regexes`.
|
|
64
|
+
|
|
65
|
+
## Fetch rules
|
|
66
|
+
|
|
67
|
+
Use `[fetch]` only when a site consistently needs browser rendering, local Chrome state, scrolling, waiting, clicking, or proxy-aware requests.
|
|
68
|
+
|
|
69
|
+
`use_proxy_env = true` tells Feedloom to use `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, and `NO_PROXY` for static fetches and Defuddle async extractor fetches. Use this for YouTube transcript capture and similar extractor-backed pages that need the user's proxy settings.
|
|
70
|
+
|
|
71
|
+
`prefer_browser_state = true` only tells Feedloom to use copied Chrome state for matching URLs. It does not store the local Chrome path. The command still needs Chrome state parameters when login state is required:
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
npx -y @ariesfish/feedloom \
|
|
75
|
+
--chrome-user-data-dir "$HOME/Library/Application Support/Google/Chrome" \
|
|
76
|
+
--chrome-profile Default \
|
|
77
|
+
--site-rules-dir $HOME/.agents/skills/feedloom/site-rules \
|
|
78
|
+
"https://zhuanlan.zhihu.com/p/..."
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## Rules for writing rules
|
|
82
|
+
|
|
83
|
+
- Prefer narrow domain-specific selectors over broad selectors.
|
|
84
|
+
- Prefer content containers over page shells. Avoid `body` unless the HTML is already minimal.
|
|
85
|
+
- Use `require_text = true` when a matched extractor-backed page should fail instead of writing an empty note.
|
|
86
|
+
- Use cleanup only for repeated, stable noise inside otherwise correct content.
|
|
87
|
+
- Use truncation only for stable tail markers where everything after the marker is non-article content.
|
|
88
|
+
- Do not add aggressive crawling, high concurrency, repeated challenge solving, or broad stealth defaults.
|
|
89
|
+
- Keep private rules outside project repos unless the user is working on Feedloom itself.
|
|
90
|
+
|
|
91
|
+
## Validation
|
|
92
|
+
|
|
93
|
+
After adding or editing a private rule, test one known URL and inspect the Markdown:
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
outdir=$(mktemp -d /tmp/feedloom-rule-test-XXXXXX)
|
|
97
|
+
npx -y @ariesfish/feedloom \
|
|
98
|
+
--output-dir "$outdir" \
|
|
99
|
+
--site-rules-dir $HOME/.agents/skills/feedloom/site-rules \
|
|
100
|
+
"https://example.com/article"
|
|
101
|
+
find "$outdir" -maxdepth 2 -type f | sort
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
For sites that require Chrome state, add the Chrome state options shown above.
|