@ariesfish/feedloom 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +282 -0
- package/dist/cli.d.ts +1 -0
- package/dist/cli.js +1745 -0
- package/package.json +52 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Quinn
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,282 @@
|
|
|
1
|
+
# Feedloom
|
|
2
|
+
|
|
3
|
+
Feedloom is a command-line tool for archiving long-form web content. It takes article URLs, URL list files, or RSS/Atom feeds, extracts readable article content, converts it to Markdown with YAML frontmatter, and saves page images as local assets. It is designed for personal knowledge bases, notebook vaults, and offline reading archives.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
- Accept one or more URLs directly from the command line.
|
|
8
|
+
- Extract URLs from text or Markdown files, with automatic deduplication.
|
|
9
|
+
- Expand RSS/Atom feeds and optionally filter entries by date.
|
|
10
|
+
- Clean article HTML and convert it to Markdown.
|
|
11
|
+
- Download and localize article images.
|
|
12
|
+
- Generate Markdown notes with `source`, `author`, and `created` frontmatter.
|
|
13
|
+
- Support static fetch, browser-rendered fetch, and stealth fetch modes.
|
|
14
|
+
- Optionally use a local Chrome profile for pages that require login state.
|
|
15
|
+
- Automatically mark Markdown checklist items as done after successful processing.
|
|
16
|
+
|
|
17
|
+
## Requirements
|
|
18
|
+
|
|
19
|
+
- Node.js >= 24
|
|
20
|
+
- npm
|
|
21
|
+
- macOS, Linux, or Windows should work; browser-based fetching depends on Patchright/Chromium.
|
|
22
|
+
|
|
23
|
+
## Installation
|
|
24
|
+
|
|
25
|
+
### 1. Clone the repository
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
git clone <this-repository-url>
|
|
29
|
+
cd feedloom
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### 2. Install dependencies
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
npm install
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### 3. Install the browser runtime
|
|
39
|
+
|
|
40
|
+
If you plan to use `browser`, `stealth`, or the browser fallback in `auto` mode, install the Patchright Chromium runtime:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
npx patchright install chromium
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### 4. Build the CLI
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
npm run build
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
After building, run:
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
node dist/cli.js --help
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
During development, you can run the TypeScript source directly:
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
npm run dev -- --help
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
To make the CLI available globally on your machine:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
npm link
|
|
68
|
+
feedloom --help
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## Quick Start
|
|
72
|
+
|
|
73
|
+
Archive a single article to the default `clippings/` directory:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
npm run dev -- "https://example.com/article"
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Write output to a custom directory:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
npm run dev -- --output-dir ./outputs "https://example.com/article"
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Use the built CLI:
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
node dist/cli.js --output-dir ./outputs "https://example.com/article"
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
The generated Markdown will look roughly like this:
|
|
92
|
+
|
|
93
|
+
```markdown
|
|
94
|
+
---
|
|
95
|
+
source: "https://example.com/article"
|
|
96
|
+
author: "Author Name"
|
|
97
|
+
created: "2026-04-29"
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
# Article Title
|
|
101
|
+
|
|
102
|
+
Article content...
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
Images are downloaded into an `assets/` subdirectory under the output directory and rewritten as local Markdown references.
|
|
106
|
+
|
|
107
|
+
## Input Methods
|
|
108
|
+
|
|
109
|
+
### Pass multiple URLs directly
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
npm run dev -- \
|
|
113
|
+
"https://example.com/a" \
|
|
114
|
+
"https://example.com/b"
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Read URLs from a file
|
|
118
|
+
|
|
119
|
+
`urls.md` can be a plain URL list or a Markdown checklist:
|
|
120
|
+
|
|
121
|
+
```markdown
|
|
122
|
+
- [ ] https://example.com/a
|
|
123
|
+
- [ ] https://example.com/b
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Run:
|
|
127
|
+
|
|
128
|
+
```bash
|
|
129
|
+
npm run dev -- --output-dir ./outputs urls.md
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
After a URL is processed successfully, the matching checklist item is updated automatically:
|
|
133
|
+
|
|
134
|
+
```markdown
|
|
135
|
+
- [x] https://example.com/a
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Process RSS/Atom feeds
|
|
139
|
+
|
|
140
|
+
By default, `--source-kind auto` tries to detect whether the input is a normal HTML page or a feed. You can also specify the source kind explicitly:
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
npm run dev -- --source-kind rss-feed --since 2026-01-01 "https://example.com/feed.xml"
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
Useful slicing options:
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
npm run dev -- --start 1 --end 10 "https://example.com/feed.xml"
|
|
150
|
+
npm run dev -- --limit 5 "https://example.com/feed.xml"
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
## Fetch Modes
|
|
154
|
+
|
|
155
|
+
Use `--fetch-mode` to control how pages are fetched:
|
|
156
|
+
|
|
157
|
+
| Mode | Description |
|
|
158
|
+
| --- | --- |
|
|
159
|
+
| `auto` | Default. Try static fetch first, then fall back to browser/stealth when content is insufficient. |
|
|
160
|
+
| `static` | Use plain HTTP fetching only. Fastest option for static pages. |
|
|
161
|
+
| `browser` | Render the page in a browser. Useful for JavaScript-heavy sites. |
|
|
162
|
+
| `stealth` | Use a more realistic browser context. Useful for sites with stronger bot detection. |
|
|
163
|
+
|
|
164
|
+
Examples:
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
npm run dev -- --fetch-mode browser "https://example.com/article"
|
|
168
|
+
npm run dev -- --fetch-mode stealth --solve-cloudflare "https://example.com/article"
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
## Browser Options
|
|
172
|
+
|
|
173
|
+
Wait longer after page load:
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
npm run dev -- --fetch-mode browser --wait-ms 5000 "https://example.com/article"
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
Wait for a selector before extracting content:
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
npm run dev -- --fetch-mode browser --wait-selector "article" "https://example.com/article"
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
Click popups or expand buttons before extraction:
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
npm run dev -- --fetch-mode browser --click-selector "button.accept" --click-selector ".expand" "https://example.com/article"
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
Scroll to the bottom before extraction:
|
|
192
|
+
|
|
193
|
+
```bash
|
|
194
|
+
npm run dev -- --fetch-mode browser --scroll-to-bottom "https://example.com/article"
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
Use a proxy:
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
npm run dev -- --fetch-mode stealth --proxy "http://127.0.0.1:7890" "https://example.com/article"
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
Run with a visible browser window for debugging:
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
npm run dev -- --fetch-mode browser --headful "https://example.com/article"
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
## Use Local Chrome Login State
|
|
210
|
+
|
|
211
|
+
For pages that require an authenticated browser session, you can try using your local Chrome profile:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
npm run dev -- \
|
|
215
|
+
--prefer-browser-state \
|
|
216
|
+
--chrome-user-data-dir "{CHROME_INSTALL_PATH}" \
|
|
217
|
+
--chrome-profile "Default" \
|
|
218
|
+
--fetch-mode browser \
|
|
219
|
+
"https://example.com/member-only-article"
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
Only use this on your own device and accounts. Always respect the target site's terms of service and copyright rules.
|
|
223
|
+
|
|
224
|
+
## Common CLI Options
|
|
225
|
+
|
|
226
|
+
```text
|
|
227
|
+
--output-dir <dir> Markdown output directory. Default: clippings
|
|
228
|
+
--source-kind <kind> auto, html-page, or rss-feed. Default: auto
|
|
229
|
+
--since <date> Keep only feed entries on or after YYYY-MM-DD
|
|
230
|
+
--limit <n> Process only the first N deduplicated URLs
|
|
231
|
+
--start <n> Start from the Nth deduplicated URL, 1-based
|
|
232
|
+
--end <n> End at the Nth deduplicated URL, 1-based; 0 means no upper bound
|
|
233
|
+
--fetch-mode <mode> auto, static, browser, or stealth. Default: auto
|
|
234
|
+
--wait-ms <ms> Extra browser wait after load. Default: 2500
|
|
235
|
+
--wait-selector <selector> Wait for a CSS selector
|
|
236
|
+
--click-selector <selector...> Click one or more selectors after page load
|
|
237
|
+
--scroll-to-bottom Scroll to the bottom before extraction
|
|
238
|
+
--headful Run with a visible browser window
|
|
239
|
+
--proxy <server> Proxy server for browser/stealth fetch
|
|
240
|
+
--solve-cloudflare In stealth mode, try to handle Cloudflare challenges
|
|
241
|
+
--disable-resources In stealth mode, block images/media/fonts/stylesheets for speed
|
|
242
|
+
--prefer-browser-state Try local Chrome user state first
|
|
243
|
+
--chrome-user-data-dir <path> Chrome User Data directory
|
|
244
|
+
--chrome-profile <name> Chrome profile name. Default: Default
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
For the full option list, run:
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
npm run dev -- --help
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
## Development
|
|
254
|
+
|
|
255
|
+
```bash
|
|
256
|
+
npm install
|
|
257
|
+
npm run build
|
|
258
|
+
npm run typecheck
|
|
259
|
+
npm test
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Tips and Notes
|
|
263
|
+
|
|
264
|
+
- Respect robots.txt, website terms of service, copyright, and rate limits.
|
|
265
|
+
- For dynamic pages, try `--fetch-mode browser` first.
|
|
266
|
+
- For static blogs and news sites, `--fetch-mode static` is usually faster.
|
|
267
|
+
- If article extraction is poor for a specific site, add or adjust a site rule in `src/site-rules/`.
|
|
268
|
+
- For large batches, test with `--limit` before running the full job.
|
|
269
|
+
|
|
270
|
+
## Acknowledgements
|
|
271
|
+
|
|
272
|
+
Feedloom is inspired by several excellent open-source projects. Special thanks to:
|
|
273
|
+
|
|
274
|
+
- [Defuddle](https://github.com/kepano/defuddle), for high-quality readable content extraction ideas.
|
|
275
|
+
- [Patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright), for inspiring robust browser automation and realistic page access.
|
|
276
|
+
- [Scrapling](https://github.com/D4Vinci/Scrapling), for ideas around real browser contexts, anti-detection strategies, and resilient scraping fallbacks.
|
|
277
|
+
|
|
278
|
+
Thanks also to Linkedom, Turndown, Commander, Vitest, and the wider TypeScript ecosystem for the reliable building blocks.
|
|
279
|
+
|
|
280
|
+
## License
|
|
281
|
+
|
|
282
|
+
MIT License
|
package/dist/cli.d.ts
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
#!/usr/bin/env node
|