@ariesfish/feedloom 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,5 +1,15 @@
1
1
  # Feedloom
2
2
 
3
+ <div align="center">
4
+ <img src="assets/logo.png" alt="Feedloom logo" width="160">
5
+ <p><strong>Archive long-form web content as clean Markdown with local assets.</strong></p>
6
+ <p>
7
+ <a href="https://www.npmjs.com/package/@ariesfish/feedloom"><img alt="npm" src="https://img.shields.io/npm/v/@ariesfish/feedloom"></a>
8
+ <img alt="Node.js >= 24" src="https://img.shields.io/badge/node-%3E%3D24-339933">
9
+ <img alt="License MIT" src="https://img.shields.io/badge/license-MIT-blue">
10
+ </p>
11
+ </div>
12
+
3
13
  Feedloom is a command-line tool for archiving long-form web content. It takes article URLs, URL list files, or RSS/Atom feeds, extracts readable article content, converts it to Markdown with YAML frontmatter, and saves page images as local assets. It is designed for personal knowledge bases, notebook vaults, and offline reading archives.
4
14
 
5
15
  ## Features
@@ -43,6 +53,14 @@ If you plan to use `browser`, `stealth`, or the browser fallback in `auto` mode,
43
53
  npx patchright install chromium
44
54
  ```
45
55
 
56
+ You can verify or repair the runtime later with:
57
+
58
+ ```bash
59
+ npm run dev -- doctor
60
+ ```
61
+
62
+ If the Patchright Chromium executable is missing, `doctor` runs `npx patchright install chromium` automatically.
63
+
46
64
  ### 4. Build the CLI
47
65
 
48
66
  ```bash
@@ -261,6 +279,12 @@ Only use this on your own device and accounts. Always respect the target site's
261
279
  --site-rules-dir <dir> Optional directory of private TOML site rules
262
280
  ```
263
281
 
282
+ Run environment checks:
283
+
284
+ ```bash
285
+ npm run dev -- doctor
286
+ ```
287
+
264
288
  For the full option list, run:
265
289
 
266
290
  ```bash
@@ -281,7 +305,8 @@ npm test
281
305
  - Respect robots.txt, website terms of service, copyright, and rate limits.
282
306
  - For dynamic pages, try `--fetch-mode browser` first.
283
307
  - For static blogs and news sites, `--fetch-mode static` is usually faster.
284
- - If article extraction is poor for a specific site, keep private TOML site rules outside the package and pass them with `--site-rules-dir <dir>`.
308
+ - Feedloom ships bundled TOML site rules for common dynamic/structured sites such as WeChat official account articles and Zhihu. Site rules can define extraction, cleanup, and fetch preferences. For example, the bundled Zhihu rule uses browser fetch with copied Chrome state when `--chrome-user-data-dir`/`--chrome-profile` are configured.
309
+ - If article extraction is poor for a specific site, keep private TOML site rules outside the package and pass them with `--site-rules-dir <dir>`. Private rules are loaded after bundled rules.
285
310
  - For large batches, test with `--limit` before running the full job.
286
311
 
287
312
  ## Acknowledgements
Binary file
package/dist/cli.js CHANGED
@@ -2,7 +2,9 @@
2
2
 
3
3
  // src/cli.ts
4
4
  import { readdir as readdir2 } from "fs/promises";
5
- import { join as join7, resolve as resolve2 } from "path";
5
+ import { createRequire } from "module";
6
+ import { dirname, join as join7, resolve as resolve2 } from "path";
7
+ import { fileURLToPath } from "url";
6
8
  import { Command } from "commander";
7
9
 
8
10
  // src/cleaning/profiles.ts
@@ -39,10 +41,29 @@ function profileFromTomlRule(name, rule) {
39
41
  metadata: {
40
42
  fixedAuthor: rule.metadata?.fixed_author,
41
43
  titleSuffixPatterns: rule.metadata?.strip_title_regexes,
44
+ authorSuffixPatterns: rule.metadata?.strip_author_regexes,
42
45
  authorSelectors: rule.metadata?.author_selectors,
43
46
  authorMetaNames: rule.metadata?.author_meta_names,
44
47
  authorMetaItemprops: rule.metadata?.author_meta_itemprops,
45
48
  authorMetaProperties: rule.metadata?.author_meta_properties
49
+ },
50
+ fetch: {
51
+ mode: rule.fetch?.mode,
52
+ preferBrowserState: rule.fetch?.prefer_browser_state,
53
+ waitMs: rule.fetch?.wait_ms,
54
+ networkIdle: rule.fetch?.network_idle,
55
+ waitSelector: rule.fetch?.wait_selector,
56
+ waitSelectorState: rule.fetch?.wait_selector_state,
57
+ clickSelectors: rule.fetch?.click_selectors,
58
+ scrollToBottom: rule.fetch?.scroll_to_bottom,
59
+ useProxyEnv: rule.fetch?.use_proxy_env
60
+ },
61
+ media: {
62
+ includeMetaImages: rule.media?.include_meta_images,
63
+ imageMetaProperties: rule.media?.image_meta_properties
64
+ },
65
+ extraction: {
66
+ requireText: rule.extract?.require_text
46
67
  }
47
68
  };
48
69
  }
@@ -92,11 +113,172 @@ function firstContentSelector(profiles) {
92
113
  return void 0;
93
114
  }
94
115
 
116
+ // src/doctor.ts
117
+ import { spawn } from "child_process";
118
+ import { access } from "fs/promises";
119
+ import { chromium } from "patchright";
120
+ var INSTALL_CHROMIUM_COMMAND = "npx patchright install chromium";
121
+ var INSTALL_CHROMIUM_HINT = `Run: ${INSTALL_CHROMIUM_COMMAND}`;
122
+ function errorMessage(error) {
123
+ return error instanceof Error ? error.message : String(error);
124
+ }
125
+ function chromiumExecutablePath() {
126
+ const browserType = chromium;
127
+ return browserType.executablePath();
128
+ }
129
+ function appendCheck(checks, check) {
130
+ checks.push(check);
131
+ }
132
+ async function installPatchrightChromium(events = {}) {
133
+ return new Promise((resolve3, reject) => {
134
+ const child = spawn("npx", ["patchright", "install", "chromium"], {
135
+ stdio: ["ignore", "pipe", "pipe"]
136
+ });
137
+ child.stdout.setEncoding("utf8");
138
+ child.stderr.setEncoding("utf8");
139
+ child.stdout.on("data", (chunk) => events.onStdout?.(chunk));
140
+ child.stderr.on("data", (chunk) => events.onStderr?.(chunk));
141
+ child.on("error", reject);
142
+ child.on("close", (code, signal) => resolve3({ code, signal }));
143
+ });
144
+ }
145
+ async function executableExists(path) {
146
+ try {
147
+ await access(path);
148
+ return { ok: true };
149
+ } catch (error) {
150
+ return { ok: false, error };
151
+ }
152
+ }
153
+ async function ensureChromiumInstalled(checks, executablePath, options) {
154
+ const firstCheck = await executableExists(executablePath);
155
+ if (firstCheck.ok) {
156
+ appendCheck(checks, {
157
+ name: "Patchright Chromium installation",
158
+ ok: true,
159
+ message: "Chromium executable exists.",
160
+ detail: executablePath
161
+ });
162
+ return true;
163
+ }
164
+ appendCheck(checks, {
165
+ name: "Patchright Chromium installation",
166
+ ok: false,
167
+ message: "Chromium executable was not found on disk. Installing Patchright Chromium...",
168
+ detail: `${executablePath}
169
+ ${errorMessage(firstCheck.error)}`
170
+ });
171
+ let output = "";
172
+ const appendOutput = (chunk) => {
173
+ output += chunk;
174
+ options.stderr.write(chunk);
175
+ };
176
+ try {
177
+ const result = await options.installChromium({ onStdout: appendOutput, onStderr: appendOutput });
178
+ if (result.code !== 0) {
179
+ appendCheck(checks, {
180
+ name: "Patchright Chromium auto-install",
181
+ ok: false,
182
+ message: `Installation command failed with exit code ${result.code ?? "null"}${result.signal ? ` and signal ${result.signal}` : ""}.`,
183
+ detail: output.trim() || void 0,
184
+ hint: INSTALL_CHROMIUM_HINT
185
+ });
186
+ return false;
187
+ }
188
+ } catch (error) {
189
+ appendCheck(checks, {
190
+ name: "Patchright Chromium auto-install",
191
+ ok: false,
192
+ message: "Installation command failed to start or crashed.",
193
+ detail: errorMessage(error),
194
+ hint: INSTALL_CHROMIUM_HINT
195
+ });
196
+ return false;
197
+ }
198
+ const secondCheck = await executableExists(executablePath);
199
+ appendCheck(checks, secondCheck.ok ? {
200
+ name: "Patchright Chromium auto-install",
201
+ ok: true,
202
+ message: "Chromium installed successfully.",
203
+ detail: executablePath
204
+ } : {
205
+ name: "Patchright Chromium auto-install",
206
+ ok: false,
207
+ message: "Installation finished, but Chromium executable is still missing.",
208
+ detail: `${executablePath}
209
+ ${errorMessage(secondCheck.error)}`,
210
+ hint: INSTALL_CHROMIUM_HINT
211
+ });
212
+ return secondCheck.ok;
213
+ }
214
+ async function runDoctor(options = {}) {
215
+ const checks = [];
216
+ const resolvedOptions = {
217
+ installChromium: options.installChromium ?? installPatchrightChromium,
218
+ stderr: options.stderr ?? process.stderr
219
+ };
220
+ let executablePath = "";
221
+ try {
222
+ executablePath = chromiumExecutablePath();
223
+ checks.push({
224
+ name: "Patchright Chromium executable path",
225
+ ok: true,
226
+ message: executablePath
227
+ });
228
+ } catch (error) {
229
+ checks.push({
230
+ name: "Patchright Chromium executable path",
231
+ ok: false,
232
+ message: "Patchright does not report a Chromium executable for this platform.",
233
+ detail: errorMessage(error),
234
+ hint: INSTALL_CHROMIUM_HINT
235
+ });
236
+ }
237
+ const installed = executablePath ? await ensureChromiumInstalled(checks, executablePath, resolvedOptions) : false;
238
+ if (installed) {
239
+ try {
240
+ const browser = await chromium.launch({ headless: true });
241
+ await browser.close();
242
+ checks.push({
243
+ name: "Patchright Chromium launch",
244
+ ok: true,
245
+ message: "Chromium launched successfully in headless mode."
246
+ });
247
+ } catch (error) {
248
+ checks.push({
249
+ name: "Patchright Chromium launch",
250
+ ok: false,
251
+ message: "Chromium executable exists but failed to launch.",
252
+ detail: errorMessage(error),
253
+ hint: INSTALL_CHROMIUM_HINT
254
+ });
255
+ }
256
+ }
257
+ return {
258
+ ok: checks.at(-1)?.ok === true,
259
+ checks
260
+ };
261
+ }
262
+ function formatDoctorResult(result) {
263
+ const lines = ["Feedloom doctor"];
264
+ for (const check of result.checks) {
265
+ lines.push(`${check.ok ? "\u2713" : "\u2717"} ${check.name}: ${check.message}`);
266
+ if (check.detail) {
267
+ lines.push(...check.detail.split("\n").map((line) => ` ${line}`));
268
+ }
269
+ if (check.hint) {
270
+ lines.push(` ${check.hint}`);
271
+ }
272
+ }
273
+ lines.push(result.ok ? "OK" : "FAILED");
274
+ return lines.join("\n");
275
+ }
276
+
95
277
  // src/fetch/browser.ts
96
278
  import { mkdtemp, rm } from "fs/promises";
97
279
  import { tmpdir } from "os";
98
280
  import { join } from "path";
99
- import { chromium } from "patchright";
281
+ import { chromium as chromium2 } from "patchright";
100
282
  var SCRAPLING_DEFAULT_ARGS = [
101
283
  "--no-pings",
102
284
  "--no-first-run",
@@ -122,13 +304,13 @@ async function runPageActions(page, options) {
122
304
  await page.locator(selector).first().click({ timeout: 5e3 }).catch(() => void 0);
123
305
  }
124
306
  if (options.scrollToBottom) {
125
- await page.evaluate(async () => {
126
- const delay = (ms) => new Promise((resolve3) => setTimeout(resolve3, ms));
307
+ await page.evaluate(`(async () => {
308
+ const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
127
309
  for (let i = 0; i < 8; i += 1) {
128
310
  window.scrollTo(0, document.body.scrollHeight);
129
311
  await delay(250);
130
312
  }
131
- });
313
+ })()`);
132
314
  }
133
315
  if (options.waitSelector) {
134
316
  await page.locator(options.waitSelector).first().waitFor({
@@ -145,7 +327,7 @@ async function launchBrowserContext(options) {
145
327
  if (options.dnsOverHttps) {
146
328
  extraArgs.push("--dns-over-https-templates=https://cloudflare-dns.com/dns-query");
147
329
  }
148
- const context = await chromium.launchPersistentContext(userDataDir, {
330
+ const context = await chromium2.launchPersistentContext(userDataDir, {
149
331
  channel: options.channel,
150
332
  headless: options.headless ?? true,
151
333
  args: extraArgs,
@@ -222,7 +404,7 @@ async function fetchBrowserHtml(url, options = {}) {
222
404
  import { mkdtemp as mkdtemp2, rm as rm2 } from "fs/promises";
223
405
  import { tmpdir as tmpdir2 } from "os";
224
406
  import { join as join2 } from "path";
225
- import { chromium as chromium2 } from "patchright";
407
+ import { chromium as chromium3 } from "patchright";
226
408
  var DEFAULT_ARGS = [
227
409
  "--no-pings",
228
410
  "--no-first-run",
@@ -365,7 +547,7 @@ async function solveCloudflare(page) {
365
547
  async function launchStealthContext(options) {
366
548
  const userDataDir = options.userDataDir ?? await mkdtemp2(join2(tmpdir2(), "feedloom-stealth-"));
367
549
  const ownsUserDataDir = options.userDataDir === void 0;
368
- const context = await chromium2.launchPersistentContext(userDataDir, {
550
+ const context = await chromium3.launchPersistentContext(userDataDir, {
369
551
  channel: "chromium",
370
552
  headless: options.headless ?? true,
371
553
  args: stealthArgs(options),
@@ -409,13 +591,13 @@ async function fetchWithStealthContext(context, url, options) {
409
591
  await page.locator(selector).first().click({ timeout: 5e3 }).catch(() => void 0);
410
592
  }
411
593
  if (options.scrollToBottom) {
412
- await page.evaluate(async () => {
413
- const delay = (ms) => new Promise((resolve3) => setTimeout(resolve3, ms));
594
+ await page.evaluate(`(async () => {
595
+ const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
414
596
  for (let i = 0; i < 8; i += 1) {
415
597
  window.scrollTo(0, document.body.scrollHeight);
416
598
  await delay(250);
417
599
  }
418
- });
600
+ })()`);
419
601
  }
420
602
  if (options.waitSelector) {
421
603
  await page.locator(options.waitSelector).first().waitFor({ state: options.waitSelectorState ?? "attached", timeout: timeoutMs }).catch(() => void 0);
@@ -824,8 +1006,8 @@ function imageSource(img) {
824
1006
  return first || null;
825
1007
  }
826
1008
  async function localizeImages(html, options) {
827
- const { document: document2 } = parseHTML(`<!doctype html><html><body>${html}</body></html>`);
828
- const images = Array.from(document2.querySelectorAll("img"));
1009
+ const { document } = parseHTML(`<!doctype html><html><body>${html}</body></html>`);
1010
+ const images = Array.from(document.querySelectorAll("img"));
829
1011
  if (images.length === 0) return html;
830
1012
  const fetchImage = options.fetchImage ?? fetch;
831
1013
  const seen = /* @__PURE__ */ new Map();
@@ -864,7 +1046,7 @@ async function localizeImages(html, options) {
864
1046
  img.removeAttribute("data-original");
865
1047
  img.removeAttribute("data-src");
866
1048
  }
867
- return document2.body.innerHTML;
1049
+ return document.body.innerHTML;
868
1050
  }
869
1051
 
870
1052
  // src/cleaning/clean-html.ts
@@ -1004,6 +1186,18 @@ function cleanupTitle(metadata, profiles) {
1004
1186
  }
1005
1187
  metadata.title = title;
1006
1188
  }
1189
+ function cleanupAuthor(metadata, profiles) {
1190
+ if (!metadata.author) {
1191
+ return;
1192
+ }
1193
+ let author = metadata.author;
1194
+ for (const profile of profiles) {
1195
+ for (const pattern of profile.metadata?.authorSuffixPatterns ?? []) {
1196
+ author = author.replace(new RegExp(pattern, "i"), "").trim();
1197
+ }
1198
+ }
1199
+ metadata.author = author;
1200
+ }
1007
1201
  function applySiteProfiles(root, profiles, removals) {
1008
1202
  removeByExactSelectors(root, profiles, removals);
1009
1203
  removeByPartialAttributePatterns(root, profiles, removals);
@@ -1012,6 +1206,7 @@ function applySiteProfiles(root, profiles, removals) {
1012
1206
  function applyMetadataProfiles(metadata, profiles) {
1013
1207
  applyFixedAuthor(metadata, profiles);
1014
1208
  cleanupTitle(metadata, profiles);
1209
+ cleanupAuthor(metadata, profiles);
1015
1210
  }
1016
1211
 
1017
1212
  // src/cleaning/clean-html.ts
@@ -1033,17 +1228,17 @@ var DEFAULT_FEEDLOOM_PROFILE = {
1033
1228
  }
1034
1229
  };
1035
1230
  var DefuddleClass = DefuddleModule.default ?? DefuddleModule.Defuddle;
1036
- function firstMetaContent(document2, names) {
1231
+ function firstMetaContent(document, names) {
1037
1232
  for (const name of names) {
1038
1233
  const escaped = name.replace(/"/g, '\\"');
1039
- const element = document2.querySelector(`meta[property="${escaped}"], meta[name="${escaped}"], meta[itemprop="${escaped}"]`);
1234
+ const element = document.querySelector(`meta[property="${escaped}"], meta[name="${escaped}"], meta[itemprop="${escaped}"]`);
1040
1235
  const content = element?.getAttribute("content")?.trim();
1041
1236
  if (content) return content;
1042
1237
  }
1043
1238
  return void 0;
1044
1239
  }
1045
- function jsonLdValue(document2, keys) {
1046
- for (const script of Array.from(document2.querySelectorAll('script[type="application/ld+json"]'))) {
1240
+ function jsonLdValue(document, keys) {
1241
+ for (const script of Array.from(document.querySelectorAll('script[type="application/ld+json"]'))) {
1047
1242
  const text = script.textContent?.trim();
1048
1243
  if (!text) continue;
1049
1244
  try {
@@ -1064,12 +1259,12 @@ function jsonLdValue(document2, keys) {
1064
1259
  }
1065
1260
  return void 0;
1066
1261
  }
1067
- function profileAuthorFromDocument(document2, profiles) {
1262
+ function profileAuthorFromDocument(document, profiles) {
1068
1263
  for (const profile of profiles) {
1069
1264
  const metadata = profile.metadata;
1070
1265
  if (!metadata) continue;
1071
1266
  for (const selector of metadata.authorSelectors ?? []) {
1072
- const author = document2.querySelector(selector)?.textContent?.replace(/\s+/g, " ").trim();
1267
+ const author = document.querySelector(selector)?.textContent?.replace(/\s+/g, " ").trim();
1073
1268
  if (author) return author;
1074
1269
  }
1075
1270
  const metaNames = [
@@ -1079,33 +1274,54 @@ function profileAuthorFromDocument(document2, profiles) {
1079
1274
  ];
1080
1275
  for (const entry of metaNames) {
1081
1276
  const escaped = entry.value.replace(/"/g, '\\"');
1082
- const author = document2.querySelector(`meta[${entry.attr}="${escaped}"]`)?.getAttribute("content")?.trim();
1277
+ const author = document.querySelector(`meta[${entry.attr}="${escaped}"]`)?.getAttribute("content")?.trim();
1083
1278
  if (author) return author;
1084
1279
  }
1085
1280
  }
1086
1281
  return void 0;
1087
1282
  }
1088
- function toMetadata(result, document2, profiles) {
1283
+ function toMetadata(result, document, profiles) {
1089
1284
  return {
1090
- title: result.title || firstMetaContent(document2, ["og:title", "twitter:title"]) || document2.querySelector("title")?.textContent?.trim() || void 0,
1091
- description: result.description || firstMetaContent(document2, ["description", "og:description", "twitter:description"]),
1285
+ title: result.title || firstMetaContent(document, ["og:title", "twitter:title"]) || document.querySelector("title")?.textContent?.trim() || void 0,
1286
+ description: result.description || firstMetaContent(document, ["description", "og:description", "twitter:description"]),
1092
1287
  domain: result.domain || void 0,
1093
1288
  favicon: result.favicon || void 0,
1094
- image: result.image || firstMetaContent(document2, ["og:image", "twitter:image"]),
1095
- language: result.language || document2.documentElement.getAttribute("lang") || void 0,
1096
- published: result.published || firstMetaContent(document2, ["article:published_time", "date", "datePublished", "pubdate", "publishdate"]) || jsonLdValue(document2, ["datePublished", "dateCreated"]),
1097
- author: result.author || profileAuthorFromDocument(document2, profiles) || firstMetaContent(document2, ["author", "article:author", "twitter:creator"]) || jsonLdValue(document2, ["author", "creator"]),
1098
- site: result.site || firstMetaContent(document2, ["og:site_name", "application-name"]),
1289
+ image: result.image || firstMetaContent(document, ["og:image", "twitter:image"]),
1290
+ language: result.language || document.documentElement.getAttribute("lang") || void 0,
1291
+ published: result.published || firstMetaContent(document, ["article:published_time", "date", "datePublished", "pubdate", "publishdate"]) || jsonLdValue(document, ["datePublished", "dateCreated"]),
1292
+ author: result.author || profileAuthorFromDocument(document, profiles) || firstMetaContent(document, ["author", "article:author", "twitter:creator"]) || jsonLdValue(document, ["author", "creator"]),
1293
+ site: result.site || firstMetaContent(document, ["og:site_name", "application-name"]),
1099
1294
  schemaOrgData: result.schemaOrgData,
1100
1295
  wordCount: result.wordCount,
1101
1296
  parseTime: result.parseTime
1102
1297
  };
1103
1298
  }
1104
- function serializeProfiledContent(content, profiles, removals) {
1105
- const { document: document2 } = parseHTML2(`<!doctype html><html><body><main data-feedloom-profile-root="true">${content}</main></body></html>`);
1106
- const root = document2.querySelector('[data-feedloom-profile-root="true"]') ?? document2.body;
1299
+ function appendMetaImages(document, root, profiles) {
1300
+ const properties = profiles.flatMap((profile) => profile.media?.includeMetaImages ? profile.media.imageMetaProperties ?? ["og:image"] : []);
1301
+ if (properties.length === 0) {
1302
+ return;
1303
+ }
1304
+ const seen = new Set(Array.from(root.querySelectorAll("img")).map((img) => img.getAttribute("src") ?? ""));
1305
+ for (const property of properties) {
1306
+ const escaped = property.replace(/"/g, '\\"');
1307
+ for (const meta of Array.from(document.querySelectorAll(`meta[property="${escaped}"], meta[name="${escaped}"], meta[itemprop="${escaped}"]`))) {
1308
+ const src = meta.getAttribute("content")?.trim();
1309
+ if (!src || seen.has(src)) continue;
1310
+ const img = document.createElement("img");
1311
+ img.setAttribute("src", src);
1312
+ img.setAttribute("alt", "");
1313
+ root.appendChild(document.createElement("p"));
1314
+ root.lastElementChild?.appendChild(img);
1315
+ seen.add(src);
1316
+ }
1317
+ }
1318
+ }
1319
+ function serializeProfiledContent(document, content, profiles, removals) {
1320
+ const { document: contentDocument } = parseHTML2(`<!doctype html><html><body><main data-feedloom-profile-root="true">${content}</main></body></html>`);
1321
+ const root = contentDocument.querySelector('[data-feedloom-profile-root="true"]') ?? contentDocument.body;
1322
+ appendMetaImages(document, root, profiles);
1107
1323
  applySiteProfiles(root, profiles, removals);
1108
- const serialized = root.innerHTML || root.outerHTML || document2.body.innerHTML;
1324
+ const serialized = root.innerHTML || root.outerHTML || contentDocument.body.innerHTML;
1109
1325
  return serialized.trim() ? `${serialized.trim()}
1110
1326
  ` : "";
1111
1327
  }
@@ -1120,9 +1336,9 @@ var HtmlCleaner = class {
1120
1336
  const preferredContentSelector = this.options.contentSelector ?? firstContentSelector(activeProfiles);
1121
1337
  const removals = [];
1122
1338
  const html = /<html[\s>]/i.test(rawHtml) ? rawHtml : `<!doctype html><html><body>${rawHtml}</body></html>`;
1123
- const { document: document2 } = parseHTML2(html);
1124
- const contentSelector = preferredContentSelector && document2.querySelector(preferredContentSelector) ? preferredContentSelector : void 0;
1125
- const doc = document2;
1339
+ const { document } = parseHTML2(html);
1340
+ const contentSelector = preferredContentSelector && document.querySelector(preferredContentSelector) ? preferredContentSelector : void 0;
1341
+ const doc = document;
1126
1342
  if (this.options.baseUrl) {
1127
1343
  doc.URL = this.options.baseUrl;
1128
1344
  }
@@ -1139,12 +1355,14 @@ var HtmlCleaner = class {
1139
1355
  removeExactSelectors: this.options.removeExactSelectors,
1140
1356
  removePartialSelectors: this.options.removePartialSelectors,
1141
1357
  removeContentPatterns: this.options.removeContentPatterns,
1142
- standardize: this.options.standardize
1358
+ standardize: this.options.standardize,
1359
+ fetch: this.options.defuddleFetch,
1360
+ language: this.options.language
1143
1361
  });
1144
1362
  const result = parser2.parseAsync ? await parser2.parseAsync() : parser2.parse();
1145
- const metadata = toMetadata(result, document2, activeProfiles);
1363
+ const metadata = toMetadata(result, document, activeProfiles);
1146
1364
  applyMetadataProfiles(metadata, activeProfiles);
1147
- const content = serializeProfiledContent(result.content, postProfiles, removals);
1365
+ const content = serializeProfiledContent(document, result.content, postProfiles, removals);
1148
1366
  return {
1149
1367
  content,
1150
1368
  contentMarkdown: result.contentMarkdown,
@@ -1161,6 +1379,212 @@ async function cleanHtml(rawHtml, options = {}) {
1161
1379
  return new HtmlCleaner(options).parse(rawHtml);
1162
1380
  }
1163
1381
 
1382
+ // src/fetch/proxy-fetch.ts
1383
+ import { request as httpRequest } from "http";
1384
+ import { connect as tlsConnect } from "tls";
1385
+ var REDIRECT_STATUSES = /* @__PURE__ */ new Set([301, 302, 303, 307, 308]);
1386
+ var DEFAULT_REDIRECT_LIMIT = 10;
1387
+ function envProxyForUrl(targetUrl) {
1388
+ const raw = targetUrl.protocol === "https:" ? process.env.HTTPS_PROXY || process.env.https_proxy || process.env.ALL_PROXY || process.env.all_proxy : process.env.HTTP_PROXY || process.env.http_proxy || process.env.ALL_PROXY || process.env.all_proxy;
1389
+ if (!raw || noProxyMatches(targetUrl.hostname)) {
1390
+ return null;
1391
+ }
1392
+ try {
1393
+ return new URL(raw);
1394
+ } catch {
1395
+ return null;
1396
+ }
1397
+ }
1398
+ function noProxyMatches(hostname) {
1399
+ const raw = process.env.NO_PROXY ?? process.env.no_proxy ?? "";
1400
+ if (!raw) return false;
1401
+ const host = hostname.toLowerCase();
1402
+ return raw.split(",").map((entry) => entry.trim().toLowerCase()).some((entry) => {
1403
+ if (!entry) return false;
1404
+ if (entry === "*") return true;
1405
+ if (entry.startsWith(".")) return host === entry.slice(1) || host.endsWith(entry);
1406
+ return host === entry || host.endsWith(`.${entry}`);
1407
+ });
1408
+ }
1409
+ function headersToRecord(headers) {
1410
+ const record = {};
1411
+ if (!headers) return record;
1412
+ new Headers(headers).forEach((value, key) => {
1413
+ record[key] = value;
1414
+ });
1415
+ return record;
1416
+ }
1417
+ function responseHeaders(headers) {
1418
+ const result = new Headers();
1419
+ for (const [key, value] of Object.entries(headers)) {
1420
+ if (Array.isArray(value)) {
1421
+ for (const item of value) result.append(key, item);
1422
+ } else if (value !== void 0) {
1423
+ result.set(key, String(value));
1424
+ }
1425
+ }
1426
+ return result;
1427
+ }
1428
+ async function bodyToBuffer(body) {
1429
+ if (body === void 0 || body === null) return void 0;
1430
+ if (typeof ReadableStream !== "undefined" && body instanceof ReadableStream) {
1431
+ throw new Error("proxy-aware fetch does not support streaming request bodies");
1432
+ }
1433
+ if (typeof body === "string") return Buffer.from(body);
1434
+ if (body instanceof URLSearchParams) return Buffer.from(body.toString());
1435
+ if (body instanceof ArrayBuffer) return Buffer.from(body);
1436
+ if (ArrayBuffer.isView(body)) return Buffer.from(body.buffer, body.byteOffset, body.byteLength);
1437
+ if (typeof Blob !== "undefined" && body instanceof Blob) return Buffer.from(await body.arrayBuffer());
1438
+ throw new Error("proxy-aware fetch only supports buffered request bodies");
1439
+ }
1440
+ function collectResponse(res, done) {
1441
+ const chunks = [];
1442
+ res.on("data", (chunk) => {
1443
+ chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
1444
+ });
1445
+ res.on("end", () => {
1446
+ done(null, {
1447
+ status: res.statusCode ?? 0,
1448
+ statusText: res.statusMessage ?? "",
1449
+ headers: Object.fromEntries(responseHeaders(res.headers)),
1450
+ body: Buffer.concat(chunks)
1451
+ });
1452
+ });
1453
+ res.on("error", (error) => done(error));
1454
+ }
1455
+ function proxyAuthorization(proxy) {
1456
+ if (!proxy.username) return void 0;
1457
+ return `Basic ${Buffer.from(`${decodeURIComponent(proxy.username)}:${decodeURIComponent(proxy.password)}`).toString("base64")}`;
1458
+ }
1459
+ function requestViaHttpProxy(targetUrl, proxy, method, headers, body, signal) {
1460
+ if (proxy.protocol !== "http:") {
1461
+ throw new Error(`Unsupported proxy protocol: ${proxy.protocol}`);
1462
+ }
1463
+ return new Promise((resolve3, reject) => {
1464
+ let settled = false;
1465
+ let active = null;
1466
+ const done = (error, response) => {
1467
+ if (settled) return;
1468
+ settled = true;
1469
+ signal?.removeEventListener("abort", abort);
1470
+ if (error) reject(error);
1471
+ else if (response) resolve3(response);
1472
+ else reject(new Error("Proxy request ended without a response"));
1473
+ };
1474
+ const abort = () => active?.destroy(new Error("The operation was aborted"));
1475
+ if (signal?.aborted) {
1476
+ done(new Error("The operation was aborted"));
1477
+ return;
1478
+ }
1479
+ signal?.addEventListener("abort", abort, { once: true });
1480
+ const targetPort = targetUrl.port ? Number(targetUrl.port) : targetUrl.protocol === "https:" ? 443 : 80;
1481
+ const proxyPort = proxy.port ? Number(proxy.port) : 8080;
1482
+ const auth = proxyAuthorization(proxy);
1483
+ const requestHeaders2 = {
1484
+ ...headers,
1485
+ host: targetUrl.host
1486
+ };
1487
+ if (body && !Object.keys(requestHeaders2).some((key) => key.toLowerCase() === "content-length")) {
1488
+ requestHeaders2["content-length"] = String(body.byteLength);
1489
+ }
1490
+ if (targetUrl.protocol === "https:") {
1491
+ const connectHeaders = { host: `${targetUrl.hostname}:${targetPort}` };
1492
+ if (auth) connectHeaders["proxy-authorization"] = auth;
1493
+ const connectReq = httpRequest({
1494
+ host: proxy.hostname,
1495
+ port: proxyPort,
1496
+ method: "CONNECT",
1497
+ path: `${targetUrl.hostname}:${targetPort}`,
1498
+ headers: connectHeaders
1499
+ });
1500
+ active = connectReq;
1501
+ connectReq.on("connect", (connectRes, socket) => {
1502
+ if (connectRes.statusCode !== 200) {
1503
+ socket.destroy();
1504
+ done(new Error(`Proxy CONNECT failed: ${connectRes.statusCode ?? 0}`));
1505
+ return;
1506
+ }
1507
+ const tlsSocket = tlsConnect({ socket, host: targetUrl.hostname, servername: targetUrl.hostname });
1508
+ active = tlsSocket;
1509
+ tlsSocket.on("error", (error) => done(error));
1510
+ tlsSocket.on("secureConnect", () => {
1511
+ const req2 = httpRequest({
1512
+ method,
1513
+ path: `${targetUrl.pathname}${targetUrl.search}`,
1514
+ headers: requestHeaders2,
1515
+ createConnection: () => tlsSocket
1516
+ }, (res) => collectResponse(res, done));
1517
+ active = req2;
1518
+ req2.on("error", (error) => done(error));
1519
+ if (body) req2.write(body);
1520
+ req2.end();
1521
+ });
1522
+ });
1523
+ connectReq.on("error", (error) => done(error));
1524
+ connectReq.end();
1525
+ return;
1526
+ }
1527
+ if (auth) requestHeaders2["proxy-authorization"] = auth;
1528
+ const req = httpRequest({
1529
+ host: proxy.hostname,
1530
+ port: proxyPort,
1531
+ method,
1532
+ path: targetUrl.href,
1533
+ headers: requestHeaders2
1534
+ }, (res) => collectResponse(res, done));
1535
+ active = req;
1536
+ req.on("error", (error) => done(error));
1537
+ if (body) req.write(body);
1538
+ req.end();
1539
+ });
1540
+ }
1541
+ function requestUrl(input) {
1542
+ if (input instanceof URL) return input;
1543
+ if (typeof input === "string") return new URL(input);
1544
+ return new URL(input.url);
1545
+ }
1546
+ function requestMethod(input, init) {
1547
+ if (init.method) return init.method.toUpperCase();
1548
+ if (input instanceof Request) return input.method.toUpperCase();
1549
+ return "GET";
1550
+ }
1551
+ function requestHeaders(input, init) {
1552
+ return {
1553
+ ...input instanceof Request ? headersToRecord(input.headers) : {},
1554
+ ...headersToRecord(init.headers)
1555
+ };
1556
+ }
1557
+ async function proxyAwareFetchInternal(input, init, redirectsLeft) {
1558
+ const url = requestUrl(input);
1559
+ const proxy = envProxyForUrl(url);
1560
+ if (!proxy) {
1561
+ return fetch(input, init);
1562
+ }
1563
+ const method = requestMethod(input, init);
1564
+ const headers = requestHeaders(input, init);
1565
+ const body = await bodyToBuffer(init.body ?? (input instanceof Request ? input.body : void 0));
1566
+ const proxied = await requestViaHttpProxy(url, proxy, method, headers, body, init.signal ?? void 0);
1567
+ const location = proxied.headers.location;
1568
+ const redirectMode = init.redirect ?? "follow";
1569
+ if (redirectMode !== "manual" && REDIRECT_STATUSES.has(proxied.status) && location && redirectsLeft > 0) {
1570
+ const nextUrl = new URL(location, url);
1571
+ const nextInit = { ...init };
1572
+ if (proxied.status === 303) {
1573
+ nextInit.method = "GET";
1574
+ nextInit.body = void 0;
1575
+ }
1576
+ return proxyAwareFetchInternal(nextUrl, nextInit, redirectsLeft - 1);
1577
+ }
1578
+ return new Response(new Uint8Array(proxied.body), {
1579
+ status: proxied.status,
1580
+ statusText: proxied.statusText,
1581
+ headers: proxied.headers
1582
+ });
1583
+ }
1584
+ async function proxyAwareFetch(input, init = {}) {
1585
+ return proxyAwareFetchInternal(input, init, DEFAULT_REDIRECT_LIMIT);
1586
+ }
1587
+
1164
1588
  // src/fetch/strategy.ts
1165
1589
  import { writeFile as writeFile3 } from "fs/promises";
1166
1590
 
@@ -1175,8 +1599,8 @@ function extractPreloadedMarkdownUrl(html, baseUrl) {
1175
1599
  }
1176
1600
  return new URL(rawUrl, baseUrl).toString();
1177
1601
  }
1178
- function removeNoise(document2) {
1179
- document2.querySelectorAll("script, style, noscript, svg, iframe").forEach((element) => element.remove());
1602
+ function removeNoise(document) {
1603
+ document.querySelectorAll("script, style, noscript, svg, iframe").forEach((element) => element.remove());
1180
1604
  }
1181
1605
  function normalizedTextLength(element) {
1182
1606
  return (element?.textContent ?? "").replace(/\s+/g, " ").trim().length;
@@ -1185,12 +1609,12 @@ function htmlHasMeaningfulContent(url, html) {
1185
1609
  if (extractPreloadedMarkdownUrl(html, url) !== null) {
1186
1610
  return true;
1187
1611
  }
1188
- const { document: document2 } = parseHTML3(html);
1189
- removeNoise(document2);
1612
+ const { document } = parseHTML3(html);
1613
+ removeNoise(document);
1190
1614
  const selectors = ["#js_content", "article", "main", "section", "div", "body"];
1191
1615
  let bestLength = 0;
1192
1616
  for (const selector of selectors) {
1193
- document2.querySelectorAll(selector).forEach((element) => {
1617
+ document.querySelectorAll(selector).forEach((element) => {
1194
1618
  bestLength = Math.max(bestLength, normalizedTextLength(element));
1195
1619
  });
1196
1620
  if (bestLength >= 600 && selector !== "div") {
@@ -1285,11 +1709,11 @@ async function fetchBrowserHtmlWithBrowserState(url, config) {
1285
1709
  }
1286
1710
 
1287
1711
  // src/fetch/static.ts
1288
- async function fetchStaticHtml(url, timeoutMs = 6e4) {
1712
+ async function fetchStaticHtml(url, timeoutMs = 6e4, fetchImpl = fetch) {
1289
1713
  const controller = new AbortController();
1290
1714
  const timeout = setTimeout(() => controller.abort(), timeoutMs);
1291
1715
  try {
1292
- const response = await fetch(url, {
1716
+ const response = await fetchImpl(url, {
1293
1717
  redirect: "follow",
1294
1718
  signal: controller.signal,
1295
1719
  headers: {
@@ -1318,7 +1742,7 @@ async function writeOutputIfRequested(outputPath, html) {
1318
1742
  }
1319
1743
  async function fetchHtmlResult(url, options = {}) {
1320
1744
  const isMeaningful = options.isMeaningful ?? htmlHasMeaningfulContent;
1321
- const staticFetch = options.staticFetch ?? (async (targetUrl) => (await fetchStaticHtml(targetUrl)).html);
1745
+ const staticFetch = options.staticFetch ?? (async (targetUrl) => (await fetchStaticHtml(targetUrl, void 0, options.useProxyEnv ? proxyAwareFetch : void 0)).html);
1322
1746
  const browserFetch = options.browserFetch ?? ((targetUrl) => fetchBrowserHtml(targetUrl, {
1323
1747
  waitMs: options.waitMs,
1324
1748
  networkIdle: options.networkIdle,
@@ -1486,9 +1910,9 @@ ${code}
1486
1910
  \`\`\``).replace(/\[\s*\]\((?:#|javascript:void\(0\)|javascript:;)\)/gi, "").replace(/(^|[^\\])\$(?=\d)/g, "$1\\$").replace(/\n\s*\n\s*([-*+]\s)/g, "\n$1").replace(/[ \t]+\n/g, "\n").replace(/\n{3,}/g, "\n\n").trim();
1487
1911
  }
1488
1912
  function htmlFragmentText(fragment) {
1489
- const { document: document2 } = parseHTML4(`<!doctype html><html><body>${fragment}</body></html>`);
1490
- document2.querySelectorAll("br").forEach((br) => br.replaceWith(document2.createTextNode("\n")));
1491
- return document2.body.textContent ?? "";
1913
+ const { document } = parseHTML4(`<!doctype html><html><body>${fragment}</body></html>`);
1914
+ document.querySelectorAll("br").forEach((br) => br.replaceWith(document.createTextNode("\n")));
1915
+ return document.body.textContent ?? "";
1492
1916
  }
1493
1917
  function fencedCodeHtml(text) {
1494
1918
  const escaped = text.replace(/&/g, "&amp;").replace(/</g, "&lt;").replace(/>/g, "&gt;");
@@ -1584,10 +2008,45 @@ function resolveCreatedValue(item, published) {
1584
2008
  if (item.publishedAt) return createdFromItemDate(item.publishedAt);
1585
2009
  return (/* @__PURE__ */ new Date()).toISOString().replace(/\.\d{3}Z$/, "Z");
1586
2010
  }
2011
+ function mergeProfileFetchOptions(options, profiles) {
2012
+ const merged = { ...options };
2013
+ for (const profile of profiles) {
2014
+ if (profile.fetch?.mode) merged.fetchMode = profile.fetch.mode;
2015
+ if (profile.fetch?.waitMs !== void 0) merged.waitMs = profile.fetch.waitMs;
2016
+ if (profile.fetch?.networkIdle !== void 0) merged.networkIdle = profile.fetch.networkIdle;
2017
+ if (profile.fetch?.waitSelector) merged.waitSelector = profile.fetch.waitSelector;
2018
+ if (profile.fetch?.waitSelectorState) merged.waitSelectorState = profile.fetch.waitSelectorState;
2019
+ if (profile.fetch?.clickSelectors) merged.clickSelectors = profile.fetch.clickSelectors;
2020
+ if (profile.fetch?.scrollToBottom !== void 0) merged.scrollToBottom = profile.fetch.scrollToBottom;
2021
+ if (profile.fetch?.useProxyEnv !== void 0) merged.useProxyEnv = profile.fetch.useProxyEnv;
2022
+ if (profile.fetch?.preferBrowserState && options.browserStateDefaults) {
2023
+ merged.browserState = {
2024
+ ...options.browserStateDefaults,
2025
+ waitMs: merged.waitMs,
2026
+ networkIdle: merged.networkIdle,
2027
+ proxy: merged.proxy,
2028
+ dnsOverHttps: merged.dnsOverHttps,
2029
+ waitSelector: merged.waitSelector,
2030
+ waitSelectorState: merged.waitSelectorState,
2031
+ clickSelectors: merged.clickSelectors,
2032
+ scrollToBottom: merged.scrollToBottom,
2033
+ headless: merged.headless,
2034
+ realChromeDefaults: merged.realChromeDefaults
2035
+ };
2036
+ }
2037
+ }
2038
+ return merged;
2039
+ }
1587
2040
  async function processItem(item, options) {
1588
- const html = await fetchHtml(item.url, options);
2041
+ const urlProfiles = selectActiveProfiles(options.profiles, item.url, "");
2042
+ const fetchOptions = mergeProfileFetchOptions(options, urlProfiles);
2043
+ const html = await fetchHtml(item.url, fetchOptions);
1589
2044
  const activeProfiles = selectActiveProfiles(options.profiles, item.url, html);
1590
- const cleaned = await cleanHtml(html, { baseUrl: item.url, profiles: options.profiles, activeProfiles });
2045
+ const defuddleFetch = activeProfiles.some((profile) => profile.fetch?.useProxyEnv) ? proxyAwareFetch : void 0;
2046
+ const cleaned = await cleanHtml(html, { baseUrl: item.url, profiles: options.profiles, activeProfiles, defuddleFetch });
2047
+ if (activeProfiles.some((profile) => profile.extraction?.requireText) && !cleaned.content.replace(/<[^>]*>/g, "").trim()) {
2048
+ throw new Error("matched site rule requires extracted text, but no text content was extracted");
2049
+ }
1591
2050
  const title = cleaned.metadata.title || item.sourceTitle || titleFromUrl(item.url);
1592
2051
  await cleanupExistingNote(options.outputDir, item.url);
1593
2052
  const contentHtml = options.localizeAssets === false ? cleaned.content : await localizeImages(cleaned.content, {
@@ -1656,10 +2115,15 @@ var ProgressTracker = class {
1656
2115
  };
1657
2116
 
1658
2117
  // src/cli.ts
2118
+ var require2 = createRequire(import.meta.url);
2119
+ var packageJson = require2("../package.json");
1659
2120
  var program = new Command();
1660
2121
  async function siteRulePathsFromDir(dir) {
1661
2122
  const names = await readdir2(dir);
1662
- return names.filter((name) => name.endsWith(".toml")).map((name) => join7(dir, name));
2123
+ return names.filter((name) => name.endsWith(".toml")).sort().map((name) => join7(dir, name));
2124
+ }
2125
+ function builtinSiteRulesDir() {
2126
+ return join7(dirname(fileURLToPath(import.meta.url)), "site-rules");
1663
2127
  }
1664
2128
  function positiveIntOption(value, fallback) {
1665
2129
  const parsed = Number(value ?? fallback);
@@ -1668,7 +2132,13 @@ function positiveIntOption(value, fallback) {
1668
2132
  }
1669
2133
  return parsed;
1670
2134
  }
1671
- program.name("feedloom").description("Archive long-form web content as clean Markdown with local assets").version("0.1.0").option("--output-dir <dir>", "Output directory for markdown notes", "clippings").option("--source-kind <kind>", "auto, html-page, or rss-feed", "auto").option("--since <date>", "Only keep feed entries on or after YYYY-MM-DD", "").option("--limit <n>", "Process only first N deduplicated URLs", "0").option("--start <n>", "Start from 1-based index after deduplication", "1").option("--end <n>", "End at 1-based index after deduplication", "0").option("--prefer-browser-state", "Try copied local Chrome profile before regular browser fallback", false).option("--chrome-user-data-dir <path>", "Chrome user data directory used with --prefer-browser-state", "").option("--chrome-profile <name>", "Chrome profile directory name", "Default").option("--fetch-mode <mode>", "auto, static, browser, or stealth", "auto").option("--no-network-idle", "Do not wait for browser networkidle before reading HTML").option("--wait-ms <ms>", "Extra browser wait after load", "2500").option("--solve-cloudflare", "In stealth mode, attempt Cloudflare Turnstile/interstitial challenge handling", false).option("--disable-resources", "In stealth mode, block images/media/fonts/stylesheets for speed", false).option("--proxy <server>", "Proxy server for browser/stealth fetch, e.g. http://127.0.0.1:8080", "").option("--dns-over-https", "Use Chromium Cloudflare DNS-over-HTTPS flag for browser/stealth fetch", false).option("--wait-selector <selector>", "Wait for a CSS selector after page load", "").option("--wait-selector-state <state>", "attached, detached, visible, or hidden", "attached").option("--click-selector <selector...>", "Click one or more selectors after page load", []).option("--scroll-to-bottom", "Scroll to the bottom before reading HTML", false).option("--headful", "Run browser/browser-state fetches with a visible Chrome window", false).option("--site-rules-dir <dir>", "Optional directory of private TOML site extraction/cleaning rules", "").option("--no-real-chrome-defaults", "Disable Scrapling-inspired real Chrome context defaults").option("--no-reuse-browser", "Disable batch browser/stealth context reuse").argument("[inputs...]", "URLs or files containing URLs").action(async (inputs, options) => {
2135
+ program.name("feedloom").description("Archive long-form web content as clean Markdown with local assets").version(packageJson.version ?? "0.0.0");
2136
+ program.command("doctor").description("Check Feedloom runtime dependencies").action(async () => {
2137
+ const result = await runDoctor();
2138
+ console.error(formatDoctorResult(result));
2139
+ process.exitCode = result.ok ? 0 : 1;
2140
+ });
2141
+ program.option("--output-dir <dir>", "Output directory for markdown notes", "clippings").option("--source-kind <kind>", "auto, html-page, or rss-feed", "auto").option("--since <date>", "Only keep feed entries on or after YYYY-MM-DD", "").option("--limit <n>", "Process only first N deduplicated URLs", "0").option("--start <n>", "Start from 1-based index after deduplication", "1").option("--end <n>", "End at 1-based index after deduplication", "0").option("--prefer-browser-state", "Try copied local Chrome profile before regular browser fallback", false).option("--chrome-user-data-dir <path>", "Chrome user data directory used with --prefer-browser-state", "").option("--chrome-profile <name>", "Chrome profile directory name", "Default").option("--fetch-mode <mode>", "auto, static, browser, or stealth", "auto").option("--no-network-idle", "Do not wait for browser networkidle before reading HTML").option("--wait-ms <ms>", "Extra browser wait after load", "2500").option("--solve-cloudflare", "In stealth mode, attempt Cloudflare Turnstile/interstitial challenge handling", false).option("--disable-resources", "In stealth mode, block images/media/fonts/stylesheets for speed", false).option("--proxy <server>", "Proxy server for browser/stealth fetch, e.g. http://127.0.0.1:8080", "").option("--dns-over-https", "Use Chromium Cloudflare DNS-over-HTTPS flag for browser/stealth fetch", false).option("--wait-selector <selector>", "Wait for a CSS selector after page load", "").option("--wait-selector-state <state>", "attached, detached, visible, or hidden", "attached").option("--click-selector <selector...>", "Click one or more selectors after page load", []).option("--scroll-to-bottom", "Scroll to the bottom before reading HTML", false).option("--headful", "Run browser/browser-state fetches with a visible Chrome window", false).option("--site-rules-dir <dir>", "Optional directory of private TOML site extraction/cleaning rules", "").option("--no-real-chrome-defaults", "Disable Scrapling-inspired real Chrome context defaults").option("--no-reuse-browser", "Disable batch browser/stealth context reuse").argument("[inputs...]", "URLs or files containing URLs").action(async (inputs, options) => {
1672
2142
  if (inputs.length === 0) {
1673
2143
  program.help({ error: true });
1674
2144
  }
@@ -1696,7 +2166,9 @@ program.name("feedloom").description("Archive long-form web content as clean Mar
1696
2166
  positiveIntOption(options.limit, 0)
1697
2167
  );
1698
2168
  const siteRulesDir = String(options.siteRulesDir || "");
1699
- const profiles = siteRulesDir ? await loadSiteProfiles(await siteRulePathsFromDir(resolve2(siteRulesDir))) : [];
2169
+ const builtinRulePaths = await siteRulePathsFromDir(builtinSiteRulesDir());
2170
+ const customRulePaths = siteRulesDir ? await siteRulePathsFromDir(resolve2(siteRulesDir)) : [];
2171
+ const profiles = await loadSiteProfiles([...builtinRulePaths, ...customRulePaths]);
1700
2172
  const outputDir = String(options.outputDir ?? "clippings");
1701
2173
  let failures = 0;
1702
2174
  const tracker = new ProgressTracker(selected, outputDir);
@@ -1715,6 +2187,10 @@ program.name("feedloom").description("Archive long-form web content as clean Mar
1715
2187
  headless: !Boolean(options.headful),
1716
2188
  realChromeDefaults: options.realChromeDefaults !== false
1717
2189
  };
2190
+ const browserStateDefaults = {
2191
+ userDataDir: String(options.chromeUserDataDir || ""),
2192
+ profile: String(options.chromeProfile || "Default")
2193
+ };
1718
2194
  const sessions = options.reuseBrowser === false ? null : new BatchFetchSessions({
1719
2195
  browser: browserOptions,
1720
2196
  stealth: {
@@ -1727,15 +2203,11 @@ program.name("feedloom").description("Archive long-form web content as clean Mar
1727
2203
  for (const item of selected) {
1728
2204
  tracker.start(item.url);
1729
2205
  try {
1730
- const browserState = options.preferBrowserState ? {
1731
- userDataDir: String(options.chromeUserDataDir || ""),
1732
- profile: String(options.chromeProfile || "Default"),
1733
- ...browserOptions
1734
- } : null;
1735
2206
  const result = await processItem(item, {
1736
2207
  outputDir,
1737
2208
  profiles,
1738
- browserState,
2209
+ browserState: options.preferBrowserState ? { ...browserStateDefaults, ...browserOptions } : null,
2210
+ browserStateDefaults,
1739
2211
  fetchMode,
1740
2212
  ...browserOptions,
1741
2213
  solveCloudflare: Boolean(options.solveCloudflare),
@@ -0,0 +1,5 @@
1
+ [match]
2
+ host_suffixes = ["mp.weixin.qq.com"]
3
+
4
+ [extract]
5
+ selectors = ["#js_content"]
@@ -0,0 +1,55 @@
1
+ [match]
2
+ host_suffixes = ["xiaohongshu.com"]
3
+
4
+ [fetch]
5
+ mode = "auto"
6
+ scroll_to_bottom = true
7
+ wait_ms = 5000
8
+
9
+ [extract]
10
+ selectors = [".note-content", "#noteContainer", ".note-container", ".note-detail"]
11
+
12
+ [metadata]
13
+ strip_title_regexes = ["\\s*-\\s*小红书\\s*$"]
14
+ strip_author_regexes = ["关注$"]
15
+ author_selectors = [".author-container .username", ".author-wrapper .name", ".user-name"]
16
+ author_meta_names = ["author"]
17
+ author_meta_properties = ["article:author"]
18
+
19
+ [media]
20
+ include_meta_images = true
21
+ image_meta_properties = ["og:image"]
22
+
23
+ [clean.remove]
24
+ selectors = [
25
+ ".side-bar",
26
+ ".left-container",
27
+ ".comments-el",
28
+ ".interactions",
29
+ ".engage-bar",
30
+ ".note-detail-dropdown",
31
+ ".close-circle",
32
+ ".close-box",
33
+ ".login-container",
34
+ ".bottom-container .notedetail-menu",
35
+ ".author",
36
+ ".author-container",
37
+ ".media-container",
38
+ ".fraction",
39
+ ".arrow-controller",
40
+ ".pagination-media-container"
41
+ ]
42
+ text_contains = [
43
+ "创作中心",
44
+ "业务合作",
45
+ "沪ICP备",
46
+ "营业执照",
47
+ "违法不良信息举报电话",
48
+ "行吟信息科技",
49
+ "小红书网页版",
50
+ "登录后推荐更懂你的笔记"
51
+ ]
52
+ exact_text = ["关注", "加载中", "更多", "发现", "直播", "发布", "通知"]
53
+
54
+ [clean.truncate]
55
+ after_regexes = ["^共 \\d+ 条评论$", "^相关推荐$", "^登录后推荐更懂你的笔记$"]
@@ -0,0 +1,10 @@
1
+ [match]
2
+ host_suffixes = ["youtube.com", "youtu.be"]
3
+ url_regexes = ["https?://(www\\.)?youtube\\.com/watch\\?", "https?://youtu\\.be/"]
4
+
5
+ [fetch]
6
+ mode = "auto"
7
+ use_proxy_env = true
8
+
9
+ [extract]
10
+ require_text = true
@@ -0,0 +1,22 @@
1
+ [match]
2
+ host_suffixes = ["zhihu.com"]
3
+
4
+ [extract]
5
+ selectors = ["[class*=\"Post-RichTextContainer\"]", "[class*=\"RichText ztext\"]", "[class*=\"RichContent-inner\"]"]
6
+
7
+ [metadata]
8
+ strip_title_regexes = ["\\s*-\\s*知乎\\s*$"]
9
+
10
+ [fetch]
11
+ mode = "browser"
12
+ prefer_browser_state = true
13
+ scroll_to_bottom = true
14
+ wait_ms = 8000
15
+
16
+ [clean.remove]
17
+ class_contains = ["RichText-LinkCardContainer"]
18
+ text_regexes = ["^目录收起$", "^目录收起.*References$", "^.+\\d+ 赞同 · \\d+ 评论 文章$", "^.+\\d+ 赞同 · \\d+ 评论 文章$", "^\\d+ 赞同 · \\d+ 评论 文章$"]
19
+ exact_text = ["目录", "收起"]
20
+
21
+ [clean.truncate]
22
+ after_regexes = ["^发布于 ", "^赞同 ", "^\\d+ 条评论$", "^分享$", "^申请转载$", ".*的广告$"]
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ariesfish/feedloom",
3
- "version": "0.1.2",
3
+ "version": "0.1.4",
4
4
  "type": "module",
5
5
  "author": "ariesfish",
6
6
  "license": "MIT",
@@ -17,6 +17,7 @@
17
17
  "feedloom": "dist/cli.js"
18
18
  },
19
19
  "files": [
20
+ "assets",
20
21
  "dist",
21
22
  "skills",
22
23
  "README.md",
@@ -24,7 +25,7 @@
24
25
  ],
25
26
  "scripts": {
26
27
  "dev": "tsx src/cli.ts",
27
- "build": "tsup src/cli.ts --format esm --dts --clean",
28
+ "build": "tsup src/cli.ts --format esm --dts --clean && rm -rf dist/site-rules && cp -R src/site-rules dist/site-rules",
28
29
  "typecheck": "tsc --noEmit",
29
30
  "test": "vitest run",
30
31
  "prepublishOnly": "npm run typecheck && npm test && npm run build"
@@ -22,6 +22,12 @@ npx -y @ariesfish/feedloom <inputs...> [options]
22
22
 
23
23
  ## Common usage
24
24
 
25
+ Before clipping with browser-based fetch modes, run `doctor` once to verify and repair the Patchright Chromium runtime. If Chromium is missing, `doctor` automatically runs `npx patchright install chromium`.
26
+
27
+ ```bash
28
+ npx -y @ariesfish/feedloom doctor
29
+ ```
30
+
25
31
  Before running Feedloom, check whether this skill directory has a `site-rules/` directory. If it exists, always pass it with `--site-rules-dir $HOME/.agents/skills/feedloom/site-rules`; do not omit available site rules.
26
32
 
27
33
  ```bash
@@ -61,11 +67,13 @@ Use the least expensive mode that works:
61
67
  - `--site-rules-dir <dir>`: load optional private TOML extraction/cleaning rules from a local directory, for example `$HOME/.agents/skills/feedloom/site-rules/` reference folder.
62
68
  - `--solve-cloudflare`, `--proxy <server>`, `--dns-over-https`: use only when stealth fetching needs them.
63
69
 
64
- Run `npx -y @ariesfish/feedloom --help` for the complete option list. Do not invent unsupported options.
70
+ Run `npx -y @ariesfish/feedloom doctor` when browser, stealth, or auto fallback fails because Chromium is missing or cannot launch. Run `npx -y @ariesfish/feedloom --help` for the complete option list. Do not invent unsupported options.
71
+
72
+ ## Site rules
65
73
 
66
- ## Private site rules
74
+ Feedloom ships built-in TOML site rules in the package for common sites such as WeChat and Zhihu. These are loaded automatically; do not pass a special option for built-in rules.
67
75
 
68
- Site-specific TOML rules are optional in the package, but mandatory to use when present next to this skill. Always check for `$HOME/.agents/skills/feedloom/site-rules/` before clipping. If that directory exists, pass it explicitly on every Feedloom command using the `$HOME`-prefixed path:
76
+ Private skill rules are also supported and are mandatory to use when present next to this skill. Always check for `$HOME/.agents/skills/feedloom/site-rules/` before clipping. If that directory exists, pass it explicitly on every Feedloom command using the `$HOME`-prefixed path:
69
77
 
70
78
  ```bash
71
79
  npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir $HOME/.agents/skills/feedloom/site-rules
@@ -73,6 +81,8 @@ npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir $HOME/
73
81
 
74
82
  Treat rule files in `$HOME/.agents/skills/feedloom/site-rules/` as local reference material and use them whenever available; never skip an existing site-rules directory unless the user explicitly asks not to use it.
75
83
 
84
+ For adding or editing private rules, read `references/site-rules.md`. It contains the TOML schema, examples, `[fetch]` behavior, and validation workflow.
85
+
76
86
  ## Output
77
87
 
78
88
  - Markdown files are written to `clippings/` by default, or to `--output-dir`.
@@ -0,0 +1,104 @@
1
+ # Feedloom site rules
2
+
3
+ Use TOML site rules when Feedloom needs a narrow site-specific selector, cleanup overlay, metadata normalization, or conservative fetch preference. Do not write ad-hoc scrapers.
4
+
5
+ ## Locations
6
+
7
+ Private skill rules live in:
8
+
9
+ ```text
10
+ $HOME/.agents/skills/feedloom/site-rules/<site>.toml
11
+ ```
12
+
13
+ When the private rules directory exists, pass it on every command:
14
+
15
+ ```bash
16
+ npx -y @ariesfish/feedloom "https://example.com/article" --site-rules-dir $HOME/.agents/skills/feedloom/site-rules
17
+ ```
18
+
19
+ ## Add a private rule
20
+
21
+ Create or edit one TOML file per site:
22
+
23
+ ```bash
24
+ mkdir -p $HOME/.agents/skills/feedloom/site-rules
25
+ $EDITOR $HOME/.agents/skills/feedloom/site-rules/example.toml
26
+ ```
27
+
28
+ Minimal rule:
29
+
30
+ ```toml
31
+ [match]
32
+ host_suffixes = ["example.com"]
33
+
34
+ [extract]
35
+ selectors = ["article", "main"]
36
+ ```
37
+
38
+ Rule with fetch preferences:
39
+
40
+ ```toml
41
+ [match]
42
+ host_suffixes = ["zhihu.com"]
43
+
44
+ [fetch]
45
+ mode = "browser"
46
+ prefer_browser_state = true
47
+ scroll_to_bottom = true
48
+ wait_ms = 8000
49
+
50
+ [extract]
51
+ selectors = ["[class*=\"Post-RichTextContainer\"]", "[class*=\"RichText ztext\"]"]
52
+ ```
53
+
54
+ ## Schema
55
+
56
+ Supported sections:
57
+
58
+ - `[match]`: `host_suffixes`, `host_regexes`, `url_regexes`, `html_markers`.
59
+ - `[fetch]`: `mode`, `prefer_browser_state`, `wait_ms`, `network_idle`, `wait_selector`, `wait_selector_state`, `click_selectors`, `scroll_to_bottom`, `use_proxy_env`.
60
+ - `[extract]`: `selectors`, `require_text`.
61
+ - `[metadata]`: `fixed_author`, `strip_title_regexes`, `strip_author_regexes`, `author_selectors`, `author_meta_names`, `author_meta_itemprops`, `author_meta_properties`.
62
+ - `[clean.remove]`: `selectors`, `class_contains`, `id_contains`, `attr_contains`, `text_contains`, `text_regexes`, `exact_text`.
63
+ - `[clean.truncate]`: `after_contains`, `after_regexes`.
64
+
65
+ ## Fetch rules
66
+
67
+ Use `[fetch]` only when a site consistently needs browser rendering, local Chrome state, scrolling, waiting, clicking, or proxy-aware requests.
68
+
69
+ `use_proxy_env = true` tells Feedloom to use `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, and `NO_PROXY` for static fetches and Defuddle async extractor fetches. Use this for YouTube transcript capture and similar extractor-backed pages that need the user's proxy settings.
70
+
71
+ `prefer_browser_state = true` only tells Feedloom to use copied Chrome state for matching URLs. It does not store the local Chrome path. The command still needs Chrome state parameters when login state is required:
72
+
73
+ ```bash
74
+ npx -y @ariesfish/feedloom \
75
+ --chrome-user-data-dir "$HOME/Library/Application Support/Google/Chrome" \
76
+ --chrome-profile Default \
77
+ --site-rules-dir $HOME/.agents/skills/feedloom/site-rules \
78
+ "https://zhuanlan.zhihu.com/p/..."
79
+ ```
80
+
81
+ ## Rules for writing rules
82
+
83
+ - Prefer narrow domain-specific selectors over broad selectors.
84
+ - Prefer content containers over page shells. Avoid `body` unless the HTML is already minimal.
85
+ - Use `require_text = true` when a matched extractor-backed page should fail instead of writing an empty note.
86
+ - Use cleanup only for repeated, stable noise inside otherwise correct content.
87
+ - Use truncation only for stable tail markers where everything after the marker is non-article content.
88
+ - Do not add aggressive crawling, high concurrency, repeated challenge solving, or broad stealth defaults.
89
+ - Keep private rules outside project repos unless the user is working on Feedloom itself.
90
+
91
+ ## Validation
92
+
93
+ After adding or editing a private rule, test one known URL and inspect the Markdown:
94
+
95
+ ```bash
96
+ outdir=$(mktemp -d /tmp/feedloom-rule-test-XXXXXX)
97
+ npx -y @ariesfish/feedloom \
98
+ --output-dir "$outdir" \
99
+ --site-rules-dir $HOME/.agents/skills/feedloom/site-rules \
100
+ "https://example.com/article"
101
+ find "$outdir" -maxdepth 2 -type f | sort
102
+ ```
103
+
104
+ For sites that require Chrome state, add the Chrome state options shown above.