vault-fetch 0.1.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.ja.md +154 -0
- package/README.md +83 -57
- package/dist/cli.js +175 -56
- package/dist/cli.js.map +1 -1
- package/dist/pdf-converter-U3SFA2HY.js +42 -0
- package/dist/pdf-converter-U3SFA2HY.js.map +1 -0
- package/package.json +2 -1
package/README.ja.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# vault-fetch
|
|
2
|
+
|
|
3
|
+
Obsidian Clipper では取得できない、JavaScript レンダリングや認証が必要な Web ページおよび PDF ファイルを Playwright で取得し、Markdown に変換して Obsidian Vault に保存する CLI ツール。
|
|
4
|
+
|
|
5
|
+
## 特徴
|
|
6
|
+
|
|
7
|
+
- Playwright (Chromium) による JS レンダリング後のページ取得
|
|
8
|
+
- **PDF から Markdown への変換**(Content-Type で自動判定)
|
|
9
|
+
- Readability.js による記事本文の抽出(広告・ナビゲーション除去)、`--raw` モードでフルページ変換も可能
|
|
10
|
+
- リソースブロッキング(画像・フォント・メディア)による高速フェッチ
|
|
11
|
+
- Chrome User-Agent 偽装によるボット対策回避
|
|
12
|
+
- Obsidian Clipper 互換のフロントマター(title, source, author, published, created, description, tags)
|
|
13
|
+
- セッション管理(`storageState`)によるログイン済みページの取得
|
|
14
|
+
- 設定の 3 層解決(CLI オプション > 環境変数 > 設定ファイル)
|
|
15
|
+
|
|
16
|
+
## インストール
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
# グローバルインストール
|
|
20
|
+
npm install -g vault-fetch
|
|
21
|
+
|
|
22
|
+
# Playwright のブラウザも必要
|
|
23
|
+
npx playwright install chromium
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## 使い方
|
|
27
|
+
|
|
28
|
+
`npx` でインストールなしでも実行できます:
|
|
29
|
+
|
|
30
|
+
```bash
|
|
31
|
+
npx vault-fetch fetch https://example.com/article --dry-run --dest /tmp
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
### ページの取得・保存
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
# Obsidian Vault に保存
|
|
38
|
+
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings
|
|
39
|
+
|
|
40
|
+
# 標準出力に出力(保存しない)
|
|
41
|
+
vault-fetch fetch https://example.com/article --dry-run --dest /tmp
|
|
42
|
+
|
|
43
|
+
# headed モードで実行(デバッグ用)
|
|
44
|
+
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --headed
|
|
45
|
+
|
|
46
|
+
# 特定の CSS セレクタのみ抽出(Readability をスキップ)
|
|
47
|
+
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --selector "article"
|
|
48
|
+
|
|
49
|
+
# タグを追加
|
|
50
|
+
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --tag tech --tag ai
|
|
51
|
+
|
|
52
|
+
# 非記事ページをフルページ変換(Readability をスキップ)
|
|
53
|
+
vault-fetch fetch https://example.com/table-page --dest ~/Documents/Obsidian/Clippings --raw
|
|
54
|
+
|
|
55
|
+
# 画像を含めてフェッチ(デフォルトではブロック)
|
|
56
|
+
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --no-block-images
|
|
57
|
+
|
|
58
|
+
# PDF を取得して Markdown に変換(自動判定)
|
|
59
|
+
vault-fetch fetch https://example.com/report.pdf --dest ~/Documents/Obsidian/Clippings
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### PDF 対応
|
|
63
|
+
|
|
64
|
+
サーバーが `Content-Type: application/pdf` を返す場合、vault-fetch は自動的に PDF をダウンロードし、[pdf2md](https://github.com/opendocsg/pdf2md) で Markdown に変換します。追加のフラグは不要です。
|
|
65
|
+
|
|
66
|
+
- タイトルは変換された Markdown の最初の `#` 見出しから抽出(なければ URL のファイル名を使用)
|
|
67
|
+
- `--selector` および `--raw` オプションは PDF URL と併用不可
|
|
68
|
+
- セッション機能により認証付き PDF のダウンロードにも対応
|
|
69
|
+
|
|
70
|
+
### ログイン(セッション保存)
|
|
71
|
+
|
|
72
|
+
認証が必要なサイトの場合、事前にログインしてセッションを保存できます。
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
vault-fetch login https://note.com
|
|
76
|
+
# → ブラウザが開く → 手動でログイン → ターミナルで Enter を押す
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
以降の `fetch` でそのドメインのセッションが自動的に使用されます。
|
|
80
|
+
|
|
81
|
+
### fetch オプション
|
|
82
|
+
|
|
83
|
+
| オプション | 説明 |
|
|
84
|
+
|---|---|
|
|
85
|
+
| `--dest <path>` | 保存先ディレクトリ(必須) |
|
|
86
|
+
| `--headed` | ブラウザを表示して実行 |
|
|
87
|
+
| `--selector <css>` | CSS セレクタで要素を抽出 |
|
|
88
|
+
| `--timeout <sec>` | タイムアウト秒数(デフォルト: 30) |
|
|
89
|
+
| `--tag <name>` | タグ追加(複数指定可) |
|
|
90
|
+
| `--wait-until <event>` | 待機条件: `load` / `domcontentloaded` / `networkidle`(デフォルト: `networkidle`) |
|
|
91
|
+
| `--skip-session` | 保存済みセッションを使わない |
|
|
92
|
+
| `--dry-run` | 保存せず標準出力に出力 |
|
|
93
|
+
| `--raw` | Readability をスキップし、フルページ HTML を直接変換 |
|
|
94
|
+
| `--no-block-images` | 画像リクエストのブロックを無効化 |
|
|
95
|
+
| `--no-block-fonts` | フォントリクエストのブロックを無効化 |
|
|
96
|
+
| `--no-block-media` | メディアリクエストのブロックを無効化 |
|
|
97
|
+
|
|
98
|
+
## 設定
|
|
99
|
+
|
|
100
|
+
### 設定ファイル
|
|
101
|
+
|
|
102
|
+
`~/.config/vault-fetch/config.yaml`:
|
|
103
|
+
|
|
104
|
+
```yaml
|
|
105
|
+
# Obsidian Vault の保存先
|
|
106
|
+
dest: ~/Documents/Obsidian/Clippings
|
|
107
|
+
|
|
108
|
+
# デフォルトタグ
|
|
109
|
+
tags:
|
|
110
|
+
- clippings
|
|
111
|
+
|
|
112
|
+
timeout: 30
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
### 環境変数
|
|
116
|
+
|
|
117
|
+
| 変数 | 説明 |
|
|
118
|
+
|---|---|
|
|
119
|
+
| `VAULT_FETCH_DEST` | 保存先ディレクトリ |
|
|
120
|
+
| `VAULT_FETCH_TIMEOUT` | タイムアウト秒数 |
|
|
121
|
+
|
|
122
|
+
### 優先順位
|
|
123
|
+
|
|
124
|
+
CLI オプション > 環境変数 > 設定ファイル > デフォルト値
|
|
125
|
+
|
|
126
|
+
## 出力例
|
|
127
|
+
|
|
128
|
+
```yaml
|
|
129
|
+
---
|
|
130
|
+
title: ADHDの自分が毎日クッソ集中できるようになった習慣
|
|
131
|
+
source: https://note.com/simplearchitect/n/n8389e1b4fbde
|
|
132
|
+
author:
|
|
133
|
+
- "[[牛尾 剛]]"
|
|
134
|
+
published: 2025-06-14
|
|
135
|
+
created: 2025-07-03
|
|
136
|
+
description: 自分はADHDですので、もちろん集中力は暗黒です...
|
|
137
|
+
tags:
|
|
138
|
+
- clippings
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
記事の本文が Markdown で続きます...
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
## 開発
|
|
145
|
+
|
|
146
|
+
```bash
|
|
147
|
+
npm run build # tsup でビルド
|
|
148
|
+
npm test # vitest でテスト実行
|
|
149
|
+
npm run typecheck # 型チェック
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
## ライセンス
|
|
153
|
+
|
|
154
|
+
MIT
|
package/README.md
CHANGED
|
@@ -1,128 +1,154 @@
|
|
|
1
1
|
# vault-fetch
|
|
2
2
|
|
|
3
|
-
Obsidian Clipper
|
|
3
|
+
A CLI tool that uses Playwright to fetch web pages and PDF files — pages that Obsidian Clipper cannot handle — converts them to Markdown, and saves them to your Obsidian Vault.
|
|
4
4
|
|
|
5
|
-
##
|
|
5
|
+
## Features
|
|
6
6
|
|
|
7
|
-
- Playwright (Chromium)
|
|
8
|
-
-
|
|
9
|
-
-
|
|
10
|
-
-
|
|
11
|
-
-
|
|
7
|
+
- Page fetching with JS rendering via Playwright (Chromium)
|
|
8
|
+
- **PDF to Markdown conversion** (auto-detected via Content-Type)
|
|
9
|
+
- Article content extraction using Readability.js (removes ads and navigation), with `--raw` mode for full-page conversion
|
|
10
|
+
- Resource blocking (images, fonts, media) for faster fetching
|
|
11
|
+
- Chrome User-Agent spoofing to bypass bot detection
|
|
12
|
+
- Obsidian Clipper-compatible frontmatter (title, source, author, published, created, description, tags)
|
|
13
|
+
- Session management (`storageState`) for fetching authenticated pages
|
|
14
|
+
- 3-layer configuration resolution (CLI options > environment variables > config file)
|
|
12
15
|
|
|
13
|
-
##
|
|
16
|
+
## Installation
|
|
14
17
|
|
|
15
18
|
```bash
|
|
16
|
-
|
|
19
|
+
# Global install
|
|
20
|
+
npm install -g vault-fetch
|
|
21
|
+
|
|
22
|
+
# Playwright browser is also required
|
|
17
23
|
npx playwright install chromium
|
|
18
24
|
```
|
|
19
25
|
|
|
20
|
-
|
|
26
|
+
## Usage
|
|
27
|
+
|
|
28
|
+
You can run it without installation using `npx`:
|
|
21
29
|
|
|
22
30
|
```bash
|
|
23
|
-
|
|
24
|
-
npx playwright install chromium
|
|
31
|
+
npx vault-fetch fetch https://example.com/article --dry-run --dest /tmp
|
|
25
32
|
```
|
|
26
33
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
### ページの取得・保存
|
|
34
|
+
### Fetching and Saving Pages
|
|
30
35
|
|
|
31
36
|
```bash
|
|
32
|
-
# Obsidian Vault
|
|
37
|
+
# Save to Obsidian Vault
|
|
33
38
|
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings
|
|
34
39
|
|
|
35
|
-
#
|
|
40
|
+
# Output to stdout (without saving)
|
|
36
41
|
vault-fetch fetch https://example.com/article --dry-run --dest /tmp
|
|
37
42
|
|
|
38
|
-
# headed
|
|
43
|
+
# Run in headed mode (for debugging)
|
|
39
44
|
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --headed
|
|
40
45
|
|
|
41
|
-
#
|
|
46
|
+
# Extract only a specific CSS selector (skips Readability)
|
|
42
47
|
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --selector "article"
|
|
43
48
|
|
|
44
|
-
#
|
|
49
|
+
# Add tags
|
|
45
50
|
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --tag tech --tag ai
|
|
51
|
+
|
|
52
|
+
# Full-page conversion for non-article pages (skips Readability)
|
|
53
|
+
vault-fetch fetch https://example.com/table-page --dest ~/Documents/Obsidian/Clippings --raw
|
|
54
|
+
|
|
55
|
+
# Fetch with images (blocked by default)
|
|
56
|
+
vault-fetch fetch https://example.com/article --dest ~/Documents/Obsidian/Clippings --no-block-images
|
|
57
|
+
|
|
58
|
+
# Fetch a PDF and convert to Markdown (auto-detected)
|
|
59
|
+
vault-fetch fetch https://example.com/report.pdf --dest ~/Documents/Obsidian/Clippings
|
|
46
60
|
```
|
|
47
61
|
|
|
48
|
-
###
|
|
62
|
+
### PDF Support
|
|
63
|
+
|
|
64
|
+
When the server returns `Content-Type: application/pdf`, vault-fetch automatically downloads the PDF and converts it to Markdown using [pdf2md](https://github.com/opendocsg/pdf2md). No additional flags are needed.
|
|
65
|
+
|
|
66
|
+
- Title is extracted from the first `#` heading in the converted Markdown, or from the URL filename
|
|
67
|
+
- `--selector` and `--raw` options cannot be used with PDF URLs
|
|
68
|
+
- Session support works with authenticated PDF downloads
|
|
49
69
|
|
|
50
|
-
|
|
70
|
+
### Login (Session Storage)
|
|
71
|
+
|
|
72
|
+
For sites that require authentication, you can log in and save the session beforehand.
|
|
51
73
|
|
|
52
74
|
```bash
|
|
53
75
|
vault-fetch login https://note.com
|
|
54
|
-
# →
|
|
76
|
+
# → Browser opens → Log in manually → Press Enter in terminal
|
|
55
77
|
```
|
|
56
78
|
|
|
57
|
-
|
|
79
|
+
Subsequent `fetch` commands will automatically use the saved session for that domain.
|
|
58
80
|
|
|
59
|
-
###
|
|
81
|
+
### Fetch Options
|
|
60
82
|
|
|
61
|
-
|
|
|
83
|
+
| Option | Description |
|
|
62
84
|
|---|---|
|
|
63
|
-
| `--dest <path>` |
|
|
64
|
-
| `--headed` |
|
|
65
|
-
| `--selector <css>` | CSS
|
|
66
|
-
| `--timeout <sec>` |
|
|
67
|
-
| `--tag <name>` |
|
|
68
|
-
| `--wait-until <event>` |
|
|
69
|
-
| `--skip-session` |
|
|
70
|
-
| `--dry-run` |
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
85
|
+
| `--dest <path>` | Destination directory (required) |
|
|
86
|
+
| `--headed` | Run with browser visible |
|
|
87
|
+
| `--selector <css>` | Extract elements by CSS selector |
|
|
88
|
+
| `--timeout <sec>` | Timeout in seconds (default: 30) |
|
|
89
|
+
| `--tag <name>` | Add tags (can be specified multiple times) |
|
|
90
|
+
| `--wait-until <event>` | Wait condition: `load` / `domcontentloaded` / `networkidle` (default: `networkidle`) |
|
|
91
|
+
| `--skip-session` | Do not use saved sessions |
|
|
92
|
+
| `--dry-run` | Output to stdout without saving |
|
|
93
|
+
| `--raw` | Skip Readability and convert full-page HTML directly |
|
|
94
|
+
| `--no-block-images` | Disable image request blocking |
|
|
95
|
+
| `--no-block-fonts` | Disable font request blocking |
|
|
96
|
+
| `--no-block-media` | Disable media request blocking |
|
|
97
|
+
|
|
98
|
+
## Configuration
|
|
99
|
+
|
|
100
|
+
### Config File
|
|
75
101
|
|
|
76
102
|
`~/.config/vault-fetch/config.yaml`:
|
|
77
103
|
|
|
78
104
|
```yaml
|
|
79
|
-
# Obsidian Vault
|
|
105
|
+
# Obsidian Vault destination
|
|
80
106
|
dest: ~/Documents/Obsidian/Clippings
|
|
81
107
|
|
|
82
|
-
#
|
|
108
|
+
# Default tags
|
|
83
109
|
tags:
|
|
84
110
|
- clippings
|
|
85
111
|
|
|
86
112
|
timeout: 30
|
|
87
113
|
```
|
|
88
114
|
|
|
89
|
-
###
|
|
115
|
+
### Environment Variables
|
|
90
116
|
|
|
91
|
-
|
|
|
117
|
+
| Variable | Description |
|
|
92
118
|
|---|---|
|
|
93
|
-
| `VAULT_FETCH_DEST` |
|
|
94
|
-
| `VAULT_FETCH_TIMEOUT` |
|
|
119
|
+
| `VAULT_FETCH_DEST` | Destination directory |
|
|
120
|
+
| `VAULT_FETCH_TIMEOUT` | Timeout in seconds |
|
|
95
121
|
|
|
96
|
-
###
|
|
122
|
+
### Priority
|
|
97
123
|
|
|
98
|
-
CLI
|
|
124
|
+
CLI options > Environment variables > Config file > Default values
|
|
99
125
|
|
|
100
|
-
##
|
|
126
|
+
## Output Example
|
|
101
127
|
|
|
102
128
|
```yaml
|
|
103
129
|
---
|
|
104
|
-
title:
|
|
105
|
-
source: https://
|
|
130
|
+
title: "Thinking, Fast and Slow: Lessons for Software Engineers"
|
|
131
|
+
source: https://medium.com/@example/thinking-fast-and-slow-lessons-for-engineers-abc123
|
|
106
132
|
author:
|
|
107
|
-
- "[[
|
|
133
|
+
- "[[Jane Smith]]"
|
|
108
134
|
published: 2025-06-14
|
|
109
135
|
created: 2025-07-03
|
|
110
|
-
description:
|
|
136
|
+
description: How cognitive biases from Kahneman's research apply to everyday engineering decisions...
|
|
111
137
|
tags:
|
|
112
138
|
- clippings
|
|
113
139
|
---
|
|
114
140
|
|
|
115
|
-
|
|
141
|
+
The article body continues in Markdown...
|
|
116
142
|
```
|
|
117
143
|
|
|
118
|
-
##
|
|
144
|
+
## Development
|
|
119
145
|
|
|
120
146
|
```bash
|
|
121
|
-
npm run build # tsup
|
|
122
|
-
npm test # vitest
|
|
123
|
-
npm run typecheck #
|
|
147
|
+
npm run build # Build with tsup
|
|
148
|
+
npm test # Run tests with vitest
|
|
149
|
+
npm run typecheck # Type checking
|
|
124
150
|
```
|
|
125
151
|
|
|
126
|
-
##
|
|
152
|
+
## License
|
|
127
153
|
|
|
128
154
|
MIT
|
package/dist/cli.js
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
// src/cli.ts
|
|
4
4
|
import { Command } from "commander";
|
|
5
|
+
import { once } from "events";
|
|
5
6
|
import { existsSync as existsSync2 } from "fs";
|
|
6
7
|
import { homedir as homedir3 } from "os";
|
|
7
8
|
import { join as join3 } from "path";
|
|
@@ -20,13 +21,40 @@ function expandTilde(filePath) {
|
|
|
20
21
|
}
|
|
21
22
|
return filePath;
|
|
22
23
|
}
|
|
24
|
+
var VALID_WAIT_UNTIL = ["load", "domcontentloaded", "networkidle"];
|
|
25
|
+
function validateWaitUntil(value) {
|
|
26
|
+
if (!VALID_WAIT_UNTIL.includes(value)) {
|
|
27
|
+
throw new Error(
|
|
28
|
+
`Invalid waitUntil value: "${value}". Must be one of: ${VALID_WAIT_UNTIL.join(", ")}`
|
|
29
|
+
);
|
|
30
|
+
}
|
|
31
|
+
return value;
|
|
32
|
+
}
|
|
23
33
|
function loadConfigFile(configPath) {
|
|
24
34
|
const content = readFileSync(configPath, "utf-8");
|
|
25
35
|
const parsed = yaml.load(content);
|
|
26
36
|
if (parsed === null || typeof parsed !== "object") {
|
|
27
37
|
throw new Error(`Invalid config file: ${configPath}`);
|
|
28
38
|
}
|
|
29
|
-
|
|
39
|
+
const config = parsed;
|
|
40
|
+
if (config.timeout !== void 0 && typeof config.timeout !== "number") {
|
|
41
|
+
throw new Error(`Invalid timeout in config file: expected number, got ${typeof config.timeout}`);
|
|
42
|
+
}
|
|
43
|
+
if (config.dest !== void 0 && typeof config.dest !== "string") {
|
|
44
|
+
throw new Error(`Invalid dest in config file: expected string, got ${typeof config.dest}`);
|
|
45
|
+
}
|
|
46
|
+
if (config.waitUntil !== void 0) {
|
|
47
|
+
if (typeof config.waitUntil !== "string") {
|
|
48
|
+
throw new Error(`Invalid waitUntil in config file: expected string, got ${typeof config.waitUntil}`);
|
|
49
|
+
}
|
|
50
|
+
validateWaitUntil(config.waitUntil);
|
|
51
|
+
}
|
|
52
|
+
if (config.tags !== void 0) {
|
|
53
|
+
if (!Array.isArray(config.tags) || !config.tags.every((t) => typeof t === "string")) {
|
|
54
|
+
throw new Error("Invalid tags in config file: expected array of strings");
|
|
55
|
+
}
|
|
56
|
+
}
|
|
57
|
+
return config;
|
|
30
58
|
}
|
|
31
59
|
function resolveConfig(cliOptions, configPath) {
|
|
32
60
|
let fileConfig = {};
|
|
@@ -53,7 +81,8 @@ function resolveConfig(cliOptions, configPath) {
|
|
|
53
81
|
} else {
|
|
54
82
|
timeout = fileConfig.timeout ?? DEFAULT_TIMEOUT;
|
|
55
83
|
}
|
|
56
|
-
const
|
|
84
|
+
const rawWaitUntil = cliOptions.waitUntil ?? fileConfig.waitUntil ?? DEFAULT_WAIT_UNTIL;
|
|
85
|
+
const waitUntil = validateWaitUntil(rawWaitUntil);
|
|
57
86
|
const allTags = [
|
|
58
87
|
...fileConfig.tags ?? [],
|
|
59
88
|
...cliOptions.tags ?? [],
|
|
@@ -68,7 +97,11 @@ function resolveConfig(cliOptions, configPath) {
|
|
|
68
97
|
headed: cliOptions.headed ?? false,
|
|
69
98
|
selector: cliOptions.selector ?? null,
|
|
70
99
|
noSession: cliOptions.noSession ?? false,
|
|
71
|
-
dryRun: cliOptions.dryRun ?? false
|
|
100
|
+
dryRun: cliOptions.dryRun ?? false,
|
|
101
|
+
blockImages: cliOptions.blockImages ?? true,
|
|
102
|
+
blockFonts: cliOptions.blockFonts ?? true,
|
|
103
|
+
blockMedia: cliOptions.blockMedia ?? true,
|
|
104
|
+
raw: cliOptions.raw ?? false
|
|
72
105
|
};
|
|
73
106
|
}
|
|
74
107
|
|
|
@@ -103,23 +136,79 @@ function ensureSessionDir(sessionsDir) {
|
|
|
103
136
|
}
|
|
104
137
|
|
|
105
138
|
// src/fetcher.ts
|
|
139
|
+
var CHROME_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36";
|
|
140
|
+
function isPdfContentType(contentType) {
|
|
141
|
+
return contentType.toLowerCase().includes("application/pdf");
|
|
142
|
+
}
|
|
143
|
+
function buildBlockedResourceTypes(options) {
|
|
144
|
+
const blocked = /* @__PURE__ */ new Set();
|
|
145
|
+
if (options.blockImages) blocked.add("image");
|
|
146
|
+
if (options.blockFonts) blocked.add("font");
|
|
147
|
+
if (options.blockMedia) blocked.add("media");
|
|
148
|
+
return blocked;
|
|
149
|
+
}
|
|
150
|
+
var PDF_MAGIC_BYTES = "%PDF";
|
|
151
|
+
function validatePdfBuffer(pdfBuffer, sourceUrl) {
|
|
152
|
+
if (pdfBuffer.length === 0) {
|
|
153
|
+
throw new Error(`Empty PDF response received from ${sourceUrl}`);
|
|
154
|
+
}
|
|
155
|
+
const header = pdfBuffer.subarray(0, PDF_MAGIC_BYTES.length).toString("ascii");
|
|
156
|
+
if (!header.startsWith(PDF_MAGIC_BYTES)) {
|
|
157
|
+
throw new Error(
|
|
158
|
+
`Response Content-Type is application/pdf but body is not valid PDF data from ${sourceUrl}`
|
|
159
|
+
);
|
|
160
|
+
}
|
|
161
|
+
}
|
|
162
|
+
async function downloadPdf(context, url, timeoutMs) {
|
|
163
|
+
const apiResponse = await context.request.get(url, { timeout: timeoutMs });
|
|
164
|
+
const status = apiResponse.status();
|
|
165
|
+
if (status >= 400) {
|
|
166
|
+
throw new Error(`HTTP ${status} received when downloading PDF from ${url}`);
|
|
167
|
+
}
|
|
168
|
+
const pdfBuffer = Buffer.from(await apiResponse.body());
|
|
169
|
+
const finalUrl = apiResponse.url();
|
|
170
|
+
validatePdfBuffer(pdfBuffer, finalUrl);
|
|
171
|
+
return { pdfBuffer, finalUrl };
|
|
172
|
+
}
|
|
106
173
|
async function fetchPage(url, config, sessionsDir) {
|
|
107
174
|
const browser = await chromium.launch({
|
|
108
175
|
headless: !config.headed
|
|
109
176
|
});
|
|
110
177
|
try {
|
|
111
|
-
const contextOptions = {
|
|
178
|
+
const contextOptions = {
|
|
179
|
+
userAgent: CHROME_USER_AGENT
|
|
180
|
+
};
|
|
112
181
|
if (!config.noSession && sessionExists(url, sessionsDir)) {
|
|
113
182
|
const sessionPath = getSessionPath(url, sessionsDir);
|
|
114
183
|
contextOptions.storageState = sessionPath;
|
|
115
184
|
}
|
|
116
185
|
const context = await browser.newContext(contextOptions);
|
|
117
186
|
const page = await context.newPage();
|
|
187
|
+
const blockedTypes = buildBlockedResourceTypes(config);
|
|
188
|
+
if (blockedTypes.size > 0) {
|
|
189
|
+
await page.route("**/*", async (route) => {
|
|
190
|
+
if (blockedTypes.has(route.request().resourceType())) {
|
|
191
|
+
await route.abort();
|
|
192
|
+
} else {
|
|
193
|
+
await route.continue();
|
|
194
|
+
}
|
|
195
|
+
});
|
|
196
|
+
}
|
|
118
197
|
const timeoutMs = config.timeout * 1e3;
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
198
|
+
let response;
|
|
199
|
+
try {
|
|
200
|
+
response = await page.goto(url, {
|
|
201
|
+
waitUntil: config.waitUntil,
|
|
202
|
+
timeout: timeoutMs
|
|
203
|
+
});
|
|
204
|
+
} catch (error) {
|
|
205
|
+
if (error instanceof Error && error.message.includes("Download is starting")) {
|
|
206
|
+
const result = await downloadPdf(context, url, timeoutMs);
|
|
207
|
+
await context.close();
|
|
208
|
+
return { kind: "pdf", pdfBuffer: result.pdfBuffer, url, finalUrl: result.finalUrl };
|
|
209
|
+
}
|
|
210
|
+
throw error;
|
|
211
|
+
}
|
|
123
212
|
if (!response) {
|
|
124
213
|
throw new Error(`No response received from ${url}`);
|
|
125
214
|
}
|
|
@@ -128,6 +217,19 @@ async function fetchPage(url, config, sessionsDir) {
|
|
|
128
217
|
throw new Error(`HTTP ${status} received from ${response.url()}`);
|
|
129
218
|
}
|
|
130
219
|
const finalUrl = response.url();
|
|
220
|
+
const contentType = response.headers()["content-type"] ?? "";
|
|
221
|
+
if (isPdfContentType(contentType)) {
|
|
222
|
+
const body = await response.body();
|
|
223
|
+
try {
|
|
224
|
+
validatePdfBuffer(body, finalUrl);
|
|
225
|
+
await context.close();
|
|
226
|
+
return { kind: "pdf", pdfBuffer: body, url, finalUrl };
|
|
227
|
+
} catch {
|
|
228
|
+
const result = await downloadPdf(context, finalUrl, timeoutMs);
|
|
229
|
+
await context.close();
|
|
230
|
+
return { kind: "pdf", pdfBuffer: result.pdfBuffer, url, finalUrl: result.finalUrl };
|
|
231
|
+
}
|
|
232
|
+
}
|
|
131
233
|
const fullHtml = await page.content();
|
|
132
234
|
let html;
|
|
133
235
|
if (config.selector) {
|
|
@@ -140,7 +242,7 @@ async function fetchPage(url, config, sessionsDir) {
|
|
|
140
242
|
html = fullHtml;
|
|
141
243
|
}
|
|
142
244
|
await context.close();
|
|
143
|
-
return { html, fullHtml, url, finalUrl };
|
|
245
|
+
return { kind: "html", html, fullHtml, url, finalUrl };
|
|
144
246
|
} finally {
|
|
145
247
|
await browser.close();
|
|
146
248
|
}
|
|
@@ -187,54 +289,48 @@ function extractAuthors(doc, readabilityByline) {
|
|
|
187
289
|
}
|
|
188
290
|
return [];
|
|
189
291
|
}
|
|
292
|
+
function buildMetadata(doc, article, finalUrl) {
|
|
293
|
+
const title = article?.title ?? doc.title;
|
|
294
|
+
const authors = extractAuthors(doc, article?.byline ?? null);
|
|
295
|
+
const published = extractPublishedDate(doc);
|
|
296
|
+
const description = getMetaContent(doc, 'meta[property="og:description"]') ?? getMetaContent(doc, 'meta[name="description"]') ?? (article?.excerpt ?? null);
|
|
297
|
+
const today = (/* @__PURE__ */ new Date()).toISOString().split("T")[0];
|
|
298
|
+
return {
|
|
299
|
+
title,
|
|
300
|
+
source: finalUrl,
|
|
301
|
+
author: authors,
|
|
302
|
+
published,
|
|
303
|
+
created: today,
|
|
304
|
+
description
|
|
305
|
+
};
|
|
306
|
+
}
|
|
307
|
+
function parseWithReadability(html, url) {
|
|
308
|
+
const dom = new JSDOM(html, { url });
|
|
309
|
+
const reader = new Readability(dom.window.document);
|
|
310
|
+
return reader.parse();
|
|
311
|
+
}
|
|
190
312
|
function extract(html, finalUrl) {
|
|
191
313
|
const metaDom = new JSDOM(html, { url: finalUrl });
|
|
192
314
|
const doc = metaDom.window.document;
|
|
193
|
-
const
|
|
194
|
-
const reader = new Readability(readabilityDom.window.document);
|
|
195
|
-
const article = reader.parse();
|
|
315
|
+
const article = parseWithReadability(html, finalUrl);
|
|
196
316
|
if (!article) {
|
|
197
|
-
throw new Error(
|
|
317
|
+
throw new Error(
|
|
318
|
+
"Readability failed to extract content from the page. Try --raw to convert the full page, or --selector <css> to target specific content."
|
|
319
|
+
);
|
|
198
320
|
}
|
|
199
321
|
if (!article.content) {
|
|
200
322
|
throw new Error("Readability returned empty content for the page");
|
|
201
323
|
}
|
|
202
|
-
const title = article.title ?? doc.title;
|
|
203
|
-
const authors = extractAuthors(doc, article.byline ?? null);
|
|
204
|
-
const published = extractPublishedDate(doc);
|
|
205
|
-
const description = getMetaContent(doc, 'meta[property="og:description"]') ?? getMetaContent(doc, 'meta[name="description"]') ?? (article.excerpt ?? null);
|
|
206
|
-
const today = (/* @__PURE__ */ new Date()).toISOString().split("T")[0];
|
|
207
324
|
return {
|
|
208
|
-
metadata:
|
|
209
|
-
title,
|
|
210
|
-
source: finalUrl,
|
|
211
|
-
author: authors,
|
|
212
|
-
published,
|
|
213
|
-
created: today,
|
|
214
|
-
description
|
|
215
|
-
},
|
|
325
|
+
metadata: buildMetadata(doc, article, finalUrl),
|
|
216
326
|
content: article.content
|
|
217
327
|
};
|
|
218
328
|
}
|
|
219
329
|
function extractMetadata(html, finalUrl) {
|
|
220
330
|
const metaDom = new JSDOM(html, { url: finalUrl });
|
|
221
331
|
const doc = metaDom.window.document;
|
|
222
|
-
const
|
|
223
|
-
|
|
224
|
-
const article = reader.parse();
|
|
225
|
-
const title = article?.title ?? doc.title;
|
|
226
|
-
const authors = extractAuthors(doc, article?.byline ?? null);
|
|
227
|
-
const published = extractPublishedDate(doc);
|
|
228
|
-
const description = getMetaContent(doc, 'meta[property="og:description"]') ?? getMetaContent(doc, 'meta[name="description"]') ?? (article?.excerpt ?? null);
|
|
229
|
-
const today = (/* @__PURE__ */ new Date()).toISOString().split("T")[0];
|
|
230
|
-
return {
|
|
231
|
-
title,
|
|
232
|
-
source: finalUrl,
|
|
233
|
-
author: authors,
|
|
234
|
-
published,
|
|
235
|
-
created: today,
|
|
236
|
-
description
|
|
237
|
-
};
|
|
332
|
+
const article = parseWithReadability(html, finalUrl);
|
|
333
|
+
return buildMetadata(doc, article, finalUrl);
|
|
238
334
|
}
|
|
239
335
|
|
|
240
336
|
// src/converter.ts
|
|
@@ -253,9 +349,10 @@ import { writeFileSync } from "fs";
|
|
|
253
349
|
import { join as join2 } from "path";
|
|
254
350
|
import yaml2 from "js-yaml";
|
|
255
351
|
var UNSAFE_CHARS = /[/\\:*?"<>|]/g;
|
|
352
|
+
var CONTROL_CHARS = /[\x00-\x1f\x7f]/g;
|
|
256
353
|
var MAX_FILENAME_LENGTH = 200;
|
|
257
354
|
function sanitizeFilename(title) {
|
|
258
|
-
const sanitized = title.replace(UNSAFE_CHARS, "").replace(/\s+/g, " ").trim();
|
|
355
|
+
const sanitized = title.replace(CONTROL_CHARS, "").replace(UNSAFE_CHARS, "").replace(/\s+/g, " ").trim();
|
|
259
356
|
const base = sanitized.slice(0, MAX_FILENAME_LENGTH) || "Untitled";
|
|
260
357
|
return `${base}.md`;
|
|
261
358
|
}
|
|
@@ -301,14 +398,14 @@ var CONFIG_PATH = join3(homedir3(), ".config", "vault-fetch", "config.yaml");
|
|
|
301
398
|
var program = new Command();
|
|
302
399
|
program.name("vault-fetch").description(
|
|
303
400
|
"Fetch JS-rendered web pages and save as Markdown to Obsidian Vault"
|
|
304
|
-
).version("0.
|
|
401
|
+
).version("0.3.0");
|
|
305
402
|
program.command("fetch").description("Fetch a page and save as Markdown").argument("<url>", "URL to fetch").option("--dest <path>", "Destination directory").option("--headed", "Run browser in headed mode").option("--selector <css>", "CSS selector to extract").option("--timeout <seconds>", "Timeout in seconds", parseInt).option("--tag <name>", "Add tag (repeatable)", (val, acc) => {
|
|
306
403
|
acc.push(val);
|
|
307
404
|
return acc;
|
|
308
405
|
}, []).option(
|
|
309
406
|
"--wait-until <event>",
|
|
310
407
|
"Wait condition: load, domcontentloaded, networkidle"
|
|
311
|
-
).option("--skip-session", "Do not use saved session").option("--dry-run", "Output to stdout instead of saving").action(async (url, options) => {
|
|
408
|
+
).option("--skip-session", "Do not use saved session").option("--dry-run", "Output to stdout instead of saving").option("--no-block-images", "Do not block image requests").option("--no-block-fonts", "Do not block font requests").option("--no-block-media", "Do not block media requests").option("--raw", "Convert full page HTML without Readability extraction").action(async (url, options) => {
|
|
312
409
|
try {
|
|
313
410
|
const configPath = existsSync2(CONFIG_PATH) ? CONFIG_PATH : void 0;
|
|
314
411
|
const config = resolveConfig(
|
|
@@ -320,26 +417,49 @@ program.command("fetch").description("Fetch a page and save as Markdown").argume
|
|
|
320
417
|
headed: options.headed,
|
|
321
418
|
selector: options.selector,
|
|
322
419
|
noSession: options.skipSession,
|
|
323
|
-
dryRun: options.dryRun
|
|
420
|
+
dryRun: options.dryRun,
|
|
421
|
+
blockImages: options.blockImages,
|
|
422
|
+
blockFonts: options.blockFonts,
|
|
423
|
+
blockMedia: options.blockMedia,
|
|
424
|
+
raw: options.raw
|
|
324
425
|
},
|
|
325
426
|
configPath
|
|
326
427
|
);
|
|
428
|
+
if (config.raw && config.selector) {
|
|
429
|
+
throw new Error("--raw and --selector cannot be used together.");
|
|
430
|
+
}
|
|
327
431
|
if (!config.dryRun && !existsSync2(config.dest)) {
|
|
328
432
|
throw new Error(`Destination directory does not exist: ${config.dest}`);
|
|
329
433
|
}
|
|
330
434
|
const sessionsDir = getSessionDir();
|
|
331
435
|
const fetchResult = await fetchPage(url, config, sessionsDir);
|
|
332
|
-
let
|
|
436
|
+
let markdown;
|
|
333
437
|
let metadata;
|
|
334
|
-
if (
|
|
335
|
-
|
|
438
|
+
if (fetchResult.kind === "pdf") {
|
|
439
|
+
if (config.selector) {
|
|
440
|
+
throw new Error("--selector cannot be used with PDF URLs.");
|
|
441
|
+
}
|
|
442
|
+
if (config.raw) {
|
|
443
|
+
throw new Error("--raw cannot be used with PDF URLs.");
|
|
444
|
+
}
|
|
445
|
+
const { convertPdfToMarkdown } = await import("./pdf-converter-U3SFA2HY.js");
|
|
446
|
+
const pdfResult = await convertPdfToMarkdown(
|
|
447
|
+
fetchResult.pdfBuffer,
|
|
448
|
+
fetchResult.finalUrl
|
|
449
|
+
);
|
|
450
|
+
markdown = pdfResult.markdown;
|
|
451
|
+
metadata = pdfResult.metadata;
|
|
452
|
+
} else if (config.selector) {
|
|
336
453
|
metadata = extractMetadata(fetchResult.fullHtml, fetchResult.finalUrl);
|
|
454
|
+
markdown = convertToMarkdown(fetchResult.html);
|
|
455
|
+
} else if (config.raw) {
|
|
456
|
+
metadata = extractMetadata(fetchResult.fullHtml, fetchResult.finalUrl);
|
|
457
|
+
markdown = convertToMarkdown(fetchResult.fullHtml);
|
|
337
458
|
} else {
|
|
338
459
|
const result = extract(fetchResult.html, fetchResult.finalUrl);
|
|
339
460
|
metadata = result.metadata;
|
|
340
|
-
|
|
461
|
+
markdown = convertToMarkdown(result.content);
|
|
341
462
|
}
|
|
342
|
-
const markdown = convertToMarkdown(contentHtml);
|
|
343
463
|
if (config.dryRun) {
|
|
344
464
|
const frontmatter = buildFrontmatter(metadata, config.tags);
|
|
345
465
|
process.stdout.write(`${frontmatter}
|
|
@@ -372,11 +492,10 @@ program.command("login").description("Login to a site and save session").argumen
|
|
|
372
492
|
const page = await context.newPage();
|
|
373
493
|
await page.goto(url, { waitUntil: "networkidle", timeout: timeoutSec * 1e3 });
|
|
374
494
|
console.error("Browser opened. Log in manually, then press Enter here to save session.");
|
|
375
|
-
|
|
376
|
-
|
|
377
|
-
|
|
378
|
-
|
|
379
|
-
});
|
|
495
|
+
process.stdin.resume();
|
|
496
|
+
await once(process.stdin, "data");
|
|
497
|
+
process.stdin.pause();
|
|
498
|
+
process.stdin.unref();
|
|
380
499
|
const sessionPath = getSessionPath(url, sessionsDir);
|
|
381
500
|
await context.storageState({ path: sessionPath });
|
|
382
501
|
console.error(`Session saved: ${sessionPath}`);
|
package/dist/cli.js.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"sources":["../src/cli.ts","../src/config.ts","../src/fetcher.ts","../src/session.ts","../src/extractor.ts","../src/converter.ts","../src/writer.ts"],"sourcesContent":["import { Command } from \"commander\";\nimport { existsSync } from \"node:fs\";\nimport { homedir } from \"node:os\";\nimport { join } from \"node:path\";\nimport { resolveConfig } from \"./config.js\";\nimport { fetchPage } from \"./fetcher.js\";\nimport { extract, extractMetadata } from \"./extractor.js\";\nimport { convertToMarkdown } from \"./converter.js\";\nimport { writeMarkdownFile, buildFrontmatter } from \"./writer.js\";\nimport {\n getSessionDir,\n getSessionPath,\n ensureSessionDir,\n} from \"./session.js\";\nimport type { WaitUntilOption } from \"./types.js\";\n\nconst CONFIG_PATH = join(homedir(), \".config\", \"vault-fetch\", \"config.yaml\");\n\nconst program = new Command();\n\nprogram\n .name(\"vault-fetch\")\n .description(\n \"Fetch JS-rendered web pages and save as Markdown to Obsidian Vault\",\n )\n .version(\"0.1.0\");\n\nprogram\n .command(\"fetch\")\n .description(\"Fetch a page and save as Markdown\")\n .argument(\"<url>\", \"URL to fetch\")\n .option(\"--dest <path>\", \"Destination directory\")\n .option(\"--headed\", \"Run browser in headed mode\")\n .option(\"--selector <css>\", \"CSS selector to extract\")\n .option(\"--timeout <seconds>\", \"Timeout in seconds\", parseInt)\n .option(\"--tag <name>\", \"Add tag (repeatable)\", (val: string, acc: string[]) => {\n acc.push(val);\n return acc;\n }, [] as string[])\n .option(\n \"--wait-until <event>\",\n \"Wait condition: load, domcontentloaded, networkidle\",\n )\n .option(\"--skip-session\", \"Do not use saved session\")\n .option(\"--dry-run\", \"Output to stdout instead of saving\")\n .action(async (url: string, options: Record<string, unknown>) => {\n try {\n const configPath = existsSync(CONFIG_PATH) ? CONFIG_PATH : undefined;\n const config = resolveConfig(\n {\n dest: options.dest as string | undefined,\n tags: options.tag as string[] | undefined,\n timeout: options.timeout as number | undefined,\n waitUntil: options.waitUntil as WaitUntilOption | undefined,\n headed: options.headed as boolean | undefined,\n selector: options.selector as string | undefined,\n noSession: options.skipSession as boolean | undefined,\n dryRun: options.dryRun as boolean | undefined,\n },\n configPath,\n );\n\n // Validate dest directory exists\n if (!config.dryRun && !existsSync(config.dest)) {\n throw new Error(`Destination directory does not exist: ${config.dest}`);\n }\n\n const sessionsDir = getSessionDir();\n const fetchResult = await fetchPage(url, config, sessionsDir);\n\n let contentHtml: string;\n let metadata;\n\n if (config.selector) {\n // --selector mode: skip Readability, extract metadata from full page\n contentHtml = fetchResult.html;\n metadata = extractMetadata(fetchResult.fullHtml, fetchResult.finalUrl);\n } else {\n const result = extract(fetchResult.html, fetchResult.finalUrl);\n metadata = result.metadata;\n contentHtml = result.content;\n }\n\n const markdown = convertToMarkdown(contentHtml);\n\n if (config.dryRun) {\n const frontmatter = buildFrontmatter(metadata, config.tags);\n process.stdout.write(`${frontmatter}\\n\\n${markdown}\\n`);\n } else {\n const filePath = writeMarkdownFile(\n config.dest,\n metadata,\n markdown,\n config.tags,\n );\n console.error(`Saved: ${filePath}`);\n }\n } catch (error) {\n const message = error instanceof Error ? error.message : String(error);\n console.error(`Error: ${message}`);\n process.exit(1);\n }\n });\n\nprogram\n .command(\"login\")\n .description(\"Login to a site and save session\")\n .argument(\"<url>\", \"URL to login\")\n .option(\"--timeout <seconds>\", \"Login timeout in seconds\", parseInt)\n .action(async (url: string, options: Record<string, unknown>) => {\n const { chromium } = await import(\"playwright\");\n const sessionsDir = getSessionDir();\n ensureSessionDir(sessionsDir);\n\n const timeoutSec = (options.timeout as number | undefined) ?? 300;\n const browser = await chromium.launch({ headless: false });\n\n try {\n const context = await browser.newContext();\n const page = await context.newPage();\n\n await page.goto(url, { waitUntil: \"networkidle\", timeout: timeoutSec * 1000 });\n\n console.error(\"Browser opened. Log in manually, then press Enter here to save session.\");\n\n await new Promise<void>((resolve) => {\n process.stdin.once(\"data\", () => {\n resolve();\n });\n });\n\n const sessionPath = getSessionPath(url, sessionsDir);\n await context.storageState({ path: sessionPath });\n console.error(`Session saved: ${sessionPath}`);\n } catch (error) {\n const message = error instanceof Error ? error.message : String(error);\n console.error(`Error: ${message}`);\n process.exit(1);\n } finally {\n await browser.close();\n }\n });\n\nprogram.parse();\n","import { readFileSync } from \"node:fs\";\nimport { homedir } from \"node:os\";\nimport { resolve } from \"node:path\";\nimport yaml from \"js-yaml\";\nimport type { ResolvedConfig, WaitUntilOption } from \"./types.js\";\n\nconst DEFAULT_TIMEOUT = 30;\nconst DEFAULT_WAIT_UNTIL: WaitUntilOption = \"networkidle\";\nconst REQUIRED_TAG = \"clippings\";\n\ninterface FileConfig {\n dest?: string;\n tags?: string[];\n timeout?: number;\n waitUntil?: WaitUntilOption;\n}\n\ninterface CliOptions {\n dest?: string;\n tags?: string[];\n timeout?: number;\n waitUntil?: WaitUntilOption;\n headed?: boolean;\n selector?: string;\n noSession?: boolean;\n dryRun?: boolean;\n}\n\nfunction expandTilde(filePath: string): string {\n if (filePath.startsWith(\"~/\")) {\n return resolve(homedir(), filePath.slice(2));\n }\n return filePath;\n}\n\nfunction loadConfigFile(configPath: string): FileConfig {\n const content = readFileSync(configPath, \"utf-8\");\n const parsed = yaml.load(content);\n if (parsed === null || typeof parsed !== \"object\") {\n throw new Error(`Invalid config file: ${configPath}`);\n }\n return parsed as FileConfig;\n}\n\nexport function resolveConfig(\n cliOptions: CliOptions,\n configPath: string | undefined,\n): ResolvedConfig {\n // Layer 1: Config file\n let fileConfig: FileConfig = {};\n if (configPath) {\n fileConfig = loadConfigFile(configPath);\n }\n\n // Layer 2: Environment variables\n const envDest = process.env.VAULT_FETCH_DEST;\n const envTimeout = process.env.VAULT_FETCH_TIMEOUT;\n\n // Resolve each field: CLI > env > file > default\n const dest = cliOptions.dest ?? envDest ?? fileConfig.dest;\n if (dest === undefined) {\n throw new Error(\n \"dest is required. Set via --dest, VAULT_FETCH_DEST, or config file.\",\n );\n }\n\n let timeout: number;\n if (cliOptions.timeout !== undefined) {\n timeout = cliOptions.timeout;\n } else if (envTimeout !== undefined) {\n const parsed = Number(envTimeout);\n if (Number.isNaN(parsed)) {\n throw new Error(`Invalid VAULT_FETCH_TIMEOUT value: ${envTimeout}`);\n }\n timeout = parsed;\n } else {\n timeout = fileConfig.timeout ?? DEFAULT_TIMEOUT;\n }\n\n const waitUntil =\n cliOptions.waitUntil ?? fileConfig.waitUntil ?? DEFAULT_WAIT_UNTIL;\n\n // Merge tags: file tags + CLI tags + always clippings\n const allTags = [\n ...(fileConfig.tags ?? []),\n ...(cliOptions.tags ?? []),\n REQUIRED_TAG,\n ];\n const tags = [...new Set(allTags)];\n\n return {\n dest: expandTilde(dest),\n tags,\n timeout,\n waitUntil,\n headed: cliOptions.headed ?? false,\n selector: cliOptions.selector ?? null,\n noSession: cliOptions.noSession ?? false,\n dryRun: cliOptions.dryRun ?? false,\n };\n}\n","import { chromium, type BrowserContext } from \"playwright\";\nimport type { FetchResult, ResolvedConfig } from \"./types.js\";\nimport { getSessionPath, sessionExists } from \"./session.js\";\n\nexport async function fetchPage(\n url: string,\n config: ResolvedConfig,\n sessionsDir: string,\n): Promise<FetchResult> {\n const browser = await chromium.launch({\n headless: !config.headed,\n });\n\n try {\n const contextOptions: Parameters<typeof browser.newContext>[0] = {};\n\n // Load session if available and not disabled\n if (!config.noSession && sessionExists(url, sessionsDir)) {\n const sessionPath = getSessionPath(url, sessionsDir);\n contextOptions.storageState = sessionPath;\n }\n\n const context: BrowserContext = await browser.newContext(contextOptions);\n const page = await context.newPage();\n\n const timeoutMs = config.timeout * 1000;\n const response = await page.goto(url, {\n waitUntil: config.waitUntil,\n timeout: timeoutMs,\n });\n\n if (!response) {\n throw new Error(`No response received from ${url}`);\n }\n\n const status = response.status();\n if (status >= 400) {\n throw new Error(`HTTP ${status} received from ${response.url()}`);\n }\n\n const finalUrl = response.url();\n const fullHtml = await page.content();\n let html: string;\n\n if (config.selector) {\n const element = await page.$(config.selector);\n if (!element) {\n throw new Error(`Selector not found: ${config.selector}`);\n }\n html = await element.innerHTML();\n } else {\n html = fullHtml;\n }\n\n await context.close();\n\n return { html, fullHtml, url, finalUrl };\n } finally {\n await browser.close();\n }\n}\n","import { existsSync, mkdirSync } from \"node:fs\";\nimport { join } from \"node:path\";\nimport { homedir } from \"node:os\";\n\nconst CONFIG_DIR = join(homedir(), \".config\", \"vault-fetch\");\nconst SESSIONS_DIR = join(CONFIG_DIR, \"sessions\");\n\nexport function getSessionDir(): string {\n return SESSIONS_DIR;\n}\n\nfunction extractDomain(url: string): string {\n const parsed = new URL(url);\n return parsed.hostname ?? \"\";\n}\n\nexport function getSessionPath(url: string, sessionsDir: string): string {\n const domain = extractDomain(url);\n return join(sessionsDir, `${domain}.json`);\n}\n\nexport function sessionExists(url: string, sessionsDir: string): boolean {\n const sessionPath = getSessionPath(url, sessionsDir);\n return existsSync(sessionPath);\n}\n\nexport function ensureSessionDir(sessionsDir: string): void {\n if (!existsSync(sessionsDir)) {\n mkdirSync(sessionsDir, { recursive: true });\n }\n}\n","import { Readability } from \"@mozilla/readability\";\nimport { JSDOM } from \"jsdom\";\nimport type { Metadata } from \"./types.js\";\n\nfunction getMetaContent(doc: Document, selector: string): string | null {\n const el = doc.querySelector(selector);\n return el?.getAttribute(\"content\") ?? null;\n}\n\nfunction formatAuthor(raw: string): string {\n return `[[${raw.trim()}]]`;\n}\n\nfunction extractPublishedDate(doc: Document): string | null {\n const published =\n getMetaContent(doc, 'meta[property=\"article:published_time\"]') ??\n getMetaContent(doc, 'meta[name=\"datePublished\"]');\n\n if (!published) {\n const jsonLd = doc.querySelector('script[type=\"application/ld+json\"]');\n if (jsonLd?.textContent) {\n try {\n const data = JSON.parse(jsonLd.textContent) as Record<string, unknown>;\n if (typeof data.datePublished === \"string\") {\n return data.datePublished.split(\"T\")[0];\n }\n } catch {\n // JSON-LD parse failed\n }\n }\n return null;\n }\n\n return published.split(\"T\")[0];\n}\n\nfunction extractAuthors(\n doc: Document,\n readabilityByline: string | null,\n): string[] {\n const articleAuthors = doc.querySelectorAll('meta[property=\"article:author\"]');\n if (articleAuthors.length > 0) {\n return Array.from(articleAuthors)\n .map((el) => el.getAttribute(\"content\"))\n .filter((v): v is string => v !== null)\n .map(formatAuthor);\n }\n\n const ogAuthor = getMetaContent(doc, 'meta[property=\"og:author\"]');\n if (ogAuthor) {\n return [formatAuthor(ogAuthor)];\n }\n\n if (readabilityByline) {\n return [formatAuthor(readabilityByline)];\n }\n\n return [];\n}\n\nexport interface ExtractResult {\n metadata: Metadata;\n content: string;\n}\n\nexport function extract(html: string, finalUrl: string): ExtractResult {\n // One JSDOM for metadata DOM queries\n const metaDom = new JSDOM(html, { url: finalUrl });\n const doc = metaDom.window.document;\n\n // One JSDOM for Readability (which mutates the DOM)\n const readabilityDom = new JSDOM(html, { url: finalUrl });\n const reader = new Readability(readabilityDom.window.document);\n const article = reader.parse();\n\n if (!article) {\n throw new Error(\"Readability failed to extract content from the page\");\n }\n\n if (!article.content) {\n throw new Error(\"Readability returned empty content for the page\");\n }\n\n const title = article.title ?? doc.title;\n const authors = extractAuthors(doc, article.byline ?? null);\n const published = extractPublishedDate(doc);\n\n const description =\n getMetaContent(doc, 'meta[property=\"og:description\"]') ??\n getMetaContent(doc, 'meta[name=\"description\"]') ??\n (article.excerpt ?? null);\n\n const today = new Date().toISOString().split(\"T\")[0];\n\n return {\n metadata: {\n title,\n source: finalUrl,\n author: authors,\n published,\n created: today,\n description,\n },\n content: article.content,\n };\n}\n\nexport function extractMetadata(html: string, finalUrl: string): Metadata {\n const metaDom = new JSDOM(html, { url: finalUrl });\n const doc = metaDom.window.document;\n\n const readabilityDom = new JSDOM(html, { url: finalUrl });\n const reader = new Readability(readabilityDom.window.document);\n const article = reader.parse();\n\n const title = article?.title ?? doc.title;\n const authors = extractAuthors(doc, article?.byline ?? null);\n const published = extractPublishedDate(doc);\n\n const description =\n getMetaContent(doc, 'meta[property=\"og:description\"]') ??\n getMetaContent(doc, 'meta[name=\"description\"]') ??\n (article?.excerpt ?? null);\n\n const today = new Date().toISOString().split(\"T\")[0];\n\n return {\n title,\n source: finalUrl,\n author: authors,\n published,\n created: today,\n description,\n };\n}\n","import TurndownService from \"turndown\";\n\nexport function convertToMarkdown(html: string): string {\n const turndown = new TurndownService({\n headingStyle: \"atx\",\n codeBlockStyle: \"fenced\",\n bulletListMarker: \"-\",\n });\n\n return turndown.turndown(html);\n}\n","import { writeFileSync } from \"node:fs\";\nimport { join } from \"node:path\";\nimport yaml from \"js-yaml\";\nimport type { Metadata } from \"./types.js\";\n\nconst UNSAFE_CHARS = /[/\\\\:*?\"<>|]/g;\nconst MAX_FILENAME_LENGTH = 200;\n\nexport function sanitizeFilename(title: string): string {\n const sanitized = title\n .replace(UNSAFE_CHARS, \"\")\n .replace(/\\s+/g, \" \")\n .trim();\n const base = sanitized.slice(0, MAX_FILENAME_LENGTH) || \"Untitled\";\n return `${base}.md`;\n}\n\nexport function buildFrontmatter(metadata: Metadata, tags: string[]): string {\n const data: Record<string, unknown> = {\n title: metadata.title,\n source: metadata.source,\n };\n\n if (metadata.author.length > 0) {\n data.author = metadata.author;\n }\n\n if (metadata.published) {\n data.published = metadata.published;\n }\n\n data.created = metadata.created;\n\n if (metadata.description) {\n data.description = metadata.description;\n }\n\n data.tags = tags;\n\n const yamlStr = yaml.dump(data, {\n quotingType: '\"',\n forceQuotes: false,\n lineWidth: -1,\n sortKeys: false,\n });\n\n return `---\\n${yamlStr}---`;\n}\n\nexport function writeMarkdownFile(\n dest: string,\n metadata: Metadata,\n markdownContent: string,\n tags: string[],\n): string {\n const filename = sanitizeFilename(metadata.title);\n const filePath = join(dest, filename);\n const frontmatter = buildFrontmatter(metadata, tags);\n const fullContent = `${frontmatter}\\n\\n${markdownContent}\\n`;\n\n writeFileSync(filePath, fullContent, \"utf-8\");\n\n return filePath;\n}\n"],"mappings":";;;AAAA,SAAS,eAAe;AACxB,SAAS,cAAAA,mBAAkB;AAC3B,SAAS,WAAAC,gBAAe;AACxB,SAAS,QAAAC,aAAY;;;ACHrB,SAAS,oBAAoB;AAC7B,SAAS,eAAe;AACxB,SAAS,eAAe;AACxB,OAAO,UAAU;AAGjB,IAAM,kBAAkB;AACxB,IAAM,qBAAsC;AAC5C,IAAM,eAAe;AAoBrB,SAAS,YAAY,UAA0B;AAC7C,MAAI,SAAS,WAAW,IAAI,GAAG;AAC7B,WAAO,QAAQ,QAAQ,GAAG,SAAS,MAAM,CAAC,CAAC;AAAA,EAC7C;AACA,SAAO;AACT;AAEA,SAAS,eAAe,YAAgC;AACtD,QAAM,UAAU,aAAa,YAAY,OAAO;AAChD,QAAM,SAAS,KAAK,KAAK,OAAO;AAChC,MAAI,WAAW,QAAQ,OAAO,WAAW,UAAU;AACjD,UAAM,IAAI,MAAM,wBAAwB,UAAU,EAAE;AAAA,EACtD;AACA,SAAO;AACT;AAEO,SAAS,cACd,YACA,YACgB;AAEhB,MAAI,aAAyB,CAAC;AAC9B,MAAI,YAAY;AACd,iBAAa,eAAe,UAAU;AAAA,EACxC;AAGA,QAAM,UAAU,QAAQ,IAAI;AAC5B,QAAM,aAAa,QAAQ,IAAI;AAG/B,QAAM,OAAO,WAAW,QAAQ,WAAW,WAAW;AACtD,MAAI,SAAS,QAAW;AACtB,UAAM,IAAI;AAAA,MACR;AAAA,IACF;AAAA,EACF;AAEA,MAAI;AACJ,MAAI,WAAW,YAAY,QAAW;AACpC,cAAU,WAAW;AAAA,EACvB,WAAW,eAAe,QAAW;AACnC,UAAM,SAAS,OAAO,UAAU;AAChC,QAAI,OAAO,MAAM,MAAM,GAAG;AACxB,YAAM,IAAI,MAAM,sCAAsC,UAAU,EAAE;AAAA,IACpE;AACA,cAAU;AAAA,EACZ,OAAO;AACL,cAAU,WAAW,WAAW;AAAA,EAClC;AAEA,QAAM,YACJ,WAAW,aAAa,WAAW,aAAa;AAGlD,QAAM,UAAU;AAAA,IACd,GAAI,WAAW,QAAQ,CAAC;AAAA,IACxB,GAAI,WAAW,QAAQ,CAAC;AAAA,IACxB;AAAA,EACF;AACA,QAAM,OAAO,CAAC,GAAG,IAAI,IAAI,OAAO,CAAC;AAEjC,SAAO;AAAA,IACL,MAAM,YAAY,IAAI;AAAA,IACtB;AAAA,IACA;AAAA,IACA;AAAA,IACA,QAAQ,WAAW,UAAU;AAAA,IAC7B,UAAU,WAAW,YAAY;AAAA,IACjC,WAAW,WAAW,aAAa;AAAA,IACnC,QAAQ,WAAW,UAAU;AAAA,EAC/B;AACF;;;ACpGA,SAAS,gBAAqC;;;ACA9C,SAAS,YAAY,iBAAiB;AACtC,SAAS,YAAY;AACrB,SAAS,WAAAC,gBAAe;AAExB,IAAM,aAAa,KAAKA,SAAQ,GAAG,WAAW,aAAa;AAC3D,IAAM,eAAe,KAAK,YAAY,UAAU;AAEzC,SAAS,gBAAwB;AACtC,SAAO;AACT;AAEA,SAAS,cAAc,KAAqB;AAC1C,QAAM,SAAS,IAAI,IAAI,GAAG;AAC1B,SAAO,OAAO,YAAY;AAC5B;AAEO,SAAS,eAAe,KAAa,aAA6B;AACvE,QAAM,SAAS,cAAc,GAAG;AAChC,SAAO,KAAK,aAAa,GAAG,MAAM,OAAO;AAC3C;AAEO,SAAS,cAAc,KAAa,aAA8B;AACvE,QAAM,cAAc,eAAe,KAAK,WAAW;AACnD,SAAO,WAAW,WAAW;AAC/B;AAEO,SAAS,iBAAiB,aAA2B;AAC1D,MAAI,CAAC,WAAW,WAAW,GAAG;AAC5B,cAAU,aAAa,EAAE,WAAW,KAAK,CAAC;AAAA,EAC5C;AACF;;;AD1BA,eAAsB,UACpB,KACA,QACA,aACsB;AACtB,QAAM,UAAU,MAAM,SAAS,OAAO;AAAA,IACpC,UAAU,CAAC,OAAO;AAAA,EACpB,CAAC;AAED,MAAI;AACF,UAAM,iBAA2D,CAAC;AAGlE,QAAI,CAAC,OAAO,aAAa,cAAc,KAAK,WAAW,GAAG;AACxD,YAAM,cAAc,eAAe,KAAK,WAAW;AACnD,qBAAe,eAAe;AAAA,IAChC;AAEA,UAAM,UAA0B,MAAM,QAAQ,WAAW,cAAc;AACvE,UAAM,OAAO,MAAM,QAAQ,QAAQ;AAEnC,UAAM,YAAY,OAAO,UAAU;AACnC,UAAM,WAAW,MAAM,KAAK,KAAK,KAAK;AAAA,MACpC,WAAW,OAAO;AAAA,MAClB,SAAS;AAAA,IACX,CAAC;AAED,QAAI,CAAC,UAAU;AACb,YAAM,IAAI,MAAM,6BAA6B,GAAG,EAAE;AAAA,IACpD;AAEA,UAAM,SAAS,SAAS,OAAO;AAC/B,QAAI,UAAU,KAAK;AACjB,YAAM,IAAI,MAAM,QAAQ,MAAM,kBAAkB,SAAS,IAAI,CAAC,EAAE;AAAA,IAClE;AAEA,UAAM,WAAW,SAAS,IAAI;AAC9B,UAAM,WAAW,MAAM,KAAK,QAAQ;AACpC,QAAI;AAEJ,QAAI,OAAO,UAAU;AACnB,YAAM,UAAU,MAAM,KAAK,EAAE,OAAO,QAAQ;AAC5C,UAAI,CAAC,SAAS;AACZ,cAAM,IAAI,MAAM,uBAAuB,OAAO,QAAQ,EAAE;AAAA,MAC1D;AACA,aAAO,MAAM,QAAQ,UAAU;AAAA,IACjC,OAAO;AACL,aAAO;AAAA,IACT;AAEA,UAAM,QAAQ,MAAM;AAEpB,WAAO,EAAE,MAAM,UAAU,KAAK,SAAS;AAAA,EACzC,UAAE;AACA,UAAM,QAAQ,MAAM;AAAA,EACtB;AACF;;;AE5DA,SAAS,mBAAmB;AAC5B,SAAS,aAAa;AAGtB,SAAS,eAAe,KAAe,UAAiC;AACtE,QAAM,KAAK,IAAI,cAAc,QAAQ;AACrC,SAAO,IAAI,aAAa,SAAS,KAAK;AACxC;AAEA,SAAS,aAAa,KAAqB;AACzC,SAAO,KAAK,IAAI,KAAK,CAAC;AACxB;AAEA,SAAS,qBAAqB,KAA8B;AAC1D,QAAM,YACJ,eAAe,KAAK,yCAAyC,KAC7D,eAAe,KAAK,4BAA4B;AAElD,MAAI,CAAC,WAAW;AACd,UAAM,SAAS,IAAI,cAAc,oCAAoC;AACrE,QAAI,QAAQ,aAAa;AACvB,UAAI;AACF,cAAM,OAAO,KAAK,MAAM,OAAO,WAAW;AAC1C,YAAI,OAAO,KAAK,kBAAkB,UAAU;AAC1C,iBAAO,KAAK,cAAc,MAAM,GAAG,EAAE,CAAC;AAAA,QACxC;AAAA,MACF,QAAQ;AAAA,MAER;AAAA,IACF;AACA,WAAO;AAAA,EACT;AAEA,SAAO,UAAU,MAAM,GAAG,EAAE,CAAC;AAC/B;AAEA,SAAS,eACP,KACA,mBACU;AACV,QAAM,iBAAiB,IAAI,iBAAiB,iCAAiC;AAC7E,MAAI,eAAe,SAAS,GAAG;AAC7B,WAAO,MAAM,KAAK,cAAc,EAC7B,IAAI,CAAC,OAAO,GAAG,aAAa,SAAS,CAAC,EACtC,OAAO,CAAC,MAAmB,MAAM,IAAI,EACrC,IAAI,YAAY;AAAA,EACrB;AAEA,QAAM,WAAW,eAAe,KAAK,4BAA4B;AACjE,MAAI,UAAU;AACZ,WAAO,CAAC,aAAa,QAAQ,CAAC;AAAA,EAChC;AAEA,MAAI,mBAAmB;AACrB,WAAO,CAAC,aAAa,iBAAiB,CAAC;AAAA,EACzC;AAEA,SAAO,CAAC;AACV;AAOO,SAAS,QAAQ,MAAc,UAAiC;AAErE,QAAM,UAAU,IAAI,MAAM,MAAM,EAAE,KAAK,SAAS,CAAC;AACjD,QAAM,MAAM,QAAQ,OAAO;AAG3B,QAAM,iBAAiB,IAAI,MAAM,MAAM,EAAE,KAAK,SAAS,CAAC;AACxD,QAAM,SAAS,IAAI,YAAY,eAAe,OAAO,QAAQ;AAC7D,QAAM,UAAU,OAAO,MAAM;AAE7B,MAAI,CAAC,SAAS;AACZ,UAAM,IAAI,MAAM,qDAAqD;AAAA,EACvE;AAEA,MAAI,CAAC,QAAQ,SAAS;AACpB,UAAM,IAAI,MAAM,iDAAiD;AAAA,EACnE;AAEA,QAAM,QAAQ,QAAQ,SAAS,IAAI;AACnC,QAAM,UAAU,eAAe,KAAK,QAAQ,UAAU,IAAI;AAC1D,QAAM,YAAY,qBAAqB,GAAG;AAE1C,QAAM,cACJ,eAAe,KAAK,iCAAiC,KACrD,eAAe,KAAK,0BAA0B,MAC7C,QAAQ,WAAW;AAEtB,QAAM,SAAQ,oBAAI,KAAK,GAAE,YAAY,EAAE,MAAM,GAAG,EAAE,CAAC;AAEnD,SAAO;AAAA,IACL,UAAU;AAAA,MACR;AAAA,MACA,QAAQ;AAAA,MACR,QAAQ;AAAA,MACR;AAAA,MACA,SAAS;AAAA,MACT;AAAA,IACF;AAAA,IACA,SAAS,QAAQ;AAAA,EACnB;AACF;AAEO,SAAS,gBAAgB,MAAc,UAA4B;AACxE,QAAM,UAAU,IAAI,MAAM,MAAM,EAAE,KAAK,SAAS,CAAC;AACjD,QAAM,MAAM,QAAQ,OAAO;AAE3B,QAAM,iBAAiB,IAAI,MAAM,MAAM,EAAE,KAAK,SAAS,CAAC;AACxD,QAAM,SAAS,IAAI,YAAY,eAAe,OAAO,QAAQ;AAC7D,QAAM,UAAU,OAAO,MAAM;AAE7B,QAAM,QAAQ,SAAS,SAAS,IAAI;AACpC,QAAM,UAAU,eAAe,KAAK,SAAS,UAAU,IAAI;AAC3D,QAAM,YAAY,qBAAqB,GAAG;AAE1C,QAAM,cACJ,eAAe,KAAK,iCAAiC,KACrD,eAAe,KAAK,0BAA0B,MAC7C,SAAS,WAAW;AAEvB,QAAM,SAAQ,oBAAI,KAAK,GAAE,YAAY,EAAE,MAAM,GAAG,EAAE,CAAC;AAEnD,SAAO;AAAA,IACL;AAAA,IACA,QAAQ;AAAA,IACR,QAAQ;AAAA,IACR;AAAA,IACA,SAAS;AAAA,IACT;AAAA,EACF;AACF;;;ACtIA,OAAO,qBAAqB;AAErB,SAAS,kBAAkB,MAAsB;AACtD,QAAM,WAAW,IAAI,gBAAgB;AAAA,IACnC,cAAc;AAAA,IACd,gBAAgB;AAAA,IAChB,kBAAkB;AAAA,EACpB,CAAC;AAED,SAAO,SAAS,SAAS,IAAI;AAC/B;;;ACVA,SAAS,qBAAqB;AAC9B,SAAS,QAAAC,aAAY;AACrB,OAAOC,WAAU;AAGjB,IAAM,eAAe;AACrB,IAAM,sBAAsB;AAErB,SAAS,iBAAiB,OAAuB;AACtD,QAAM,YAAY,MACf,QAAQ,cAAc,EAAE,EACxB,QAAQ,QAAQ,GAAG,EACnB,KAAK;AACR,QAAM,OAAO,UAAU,MAAM,GAAG,mBAAmB,KAAK;AACxD,SAAO,GAAG,IAAI;AAChB;AAEO,SAAS,iBAAiB,UAAoB,MAAwB;AAC3E,QAAM,OAAgC;AAAA,IACpC,OAAO,SAAS;AAAA,IAChB,QAAQ,SAAS;AAAA,EACnB;AAEA,MAAI,SAAS,OAAO,SAAS,GAAG;AAC9B,SAAK,SAAS,SAAS;AAAA,EACzB;AAEA,MAAI,SAAS,WAAW;AACtB,SAAK,YAAY,SAAS;AAAA,EAC5B;AAEA,OAAK,UAAU,SAAS;AAExB,MAAI,SAAS,aAAa;AACxB,SAAK,cAAc,SAAS;AAAA,EAC9B;AAEA,OAAK,OAAO;AAEZ,QAAM,UAAUA,MAAK,KAAK,MAAM;AAAA,IAC9B,aAAa;AAAA,IACb,aAAa;AAAA,IACb,WAAW;AAAA,IACX,UAAU;AAAA,EACZ,CAAC;AAED,SAAO;AAAA,EAAQ,OAAO;AACxB;AAEO,SAAS,kBACd,MACA,UACA,iBACA,MACQ;AACR,QAAM,WAAW,iBAAiB,SAAS,KAAK;AAChD,QAAM,WAAWD,MAAK,MAAM,QAAQ;AACpC,QAAM,cAAc,iBAAiB,UAAU,IAAI;AACnD,QAAM,cAAc,GAAG,WAAW;AAAA;AAAA,EAAO,eAAe;AAAA;AAExD,gBAAc,UAAU,aAAa,OAAO;AAE5C,SAAO;AACT;;;AN/CA,IAAM,cAAcE,MAAKC,SAAQ,GAAG,WAAW,eAAe,aAAa;AAE3E,IAAM,UAAU,IAAI,QAAQ;AAE5B,QACG,KAAK,aAAa,EAClB;AAAA,EACC;AACF,EACC,QAAQ,OAAO;AAElB,QACG,QAAQ,OAAO,EACf,YAAY,mCAAmC,EAC/C,SAAS,SAAS,cAAc,EAChC,OAAO,iBAAiB,uBAAuB,EAC/C,OAAO,YAAY,4BAA4B,EAC/C,OAAO,oBAAoB,yBAAyB,EACpD,OAAO,uBAAuB,sBAAsB,QAAQ,EAC5D,OAAO,gBAAgB,wBAAwB,CAAC,KAAa,QAAkB;AAC9E,MAAI,KAAK,GAAG;AACZ,SAAO;AACT,GAAG,CAAC,CAAa,EAChB;AAAA,EACC;AAAA,EACA;AACF,EACC,OAAO,kBAAkB,0BAA0B,EACnD,OAAO,aAAa,oCAAoC,EACxD,OAAO,OAAO,KAAa,YAAqC;AAC/D,MAAI;AACF,UAAM,aAAaC,YAAW,WAAW,IAAI,cAAc;AAC3D,UAAM,SAAS;AAAA,MACb;AAAA,QACE,MAAM,QAAQ;AAAA,QACd,MAAM,QAAQ;AAAA,QACd,SAAS,QAAQ;AAAA,QACjB,WAAW,QAAQ;AAAA,QACnB,QAAQ,QAAQ;AAAA,QAChB,UAAU,QAAQ;AAAA,QAClB,WAAW,QAAQ;AAAA,QACnB,QAAQ,QAAQ;AAAA,MAClB;AAAA,MACA;AAAA,IACF;AAGA,QAAI,CAAC,OAAO,UAAU,CAACA,YAAW,OAAO,IAAI,GAAG;AAC9C,YAAM,IAAI,MAAM,yCAAyC,OAAO,IAAI,EAAE;AAAA,IACxE;AAEA,UAAM,cAAc,cAAc;AAClC,UAAM,cAAc,MAAM,UAAU,KAAK,QAAQ,WAAW;AAE5D,QAAI;AACJ,QAAI;AAEJ,QAAI,OAAO,UAAU;AAEnB,oBAAc,YAAY;AAC1B,iBAAW,gBAAgB,YAAY,UAAU,YAAY,QAAQ;AAAA,IACvE,OAAO;AACL,YAAM,SAAS,QAAQ,YAAY,MAAM,YAAY,QAAQ;AAC7D,iBAAW,OAAO;AAClB,oBAAc,OAAO;AAAA,IACvB;AAEA,UAAM,WAAW,kBAAkB,WAAW;AAE9C,QAAI,OAAO,QAAQ;AACjB,YAAM,cAAc,iBAAiB,UAAU,OAAO,IAAI;AAC1D,cAAQ,OAAO,MAAM,GAAG,WAAW;AAAA;AAAA,EAAO,QAAQ;AAAA,CAAI;AAAA,IACxD,OAAO;AACL,YAAM,WAAW;AAAA,QACf,OAAO;AAAA,QACP;AAAA,QACA;AAAA,QACA,OAAO;AAAA,MACT;AACA,cAAQ,MAAM,UAAU,QAAQ,EAAE;AAAA,IACpC;AAAA,EACF,SAAS,OAAO;AACd,UAAM,UAAU,iBAAiB,QAAQ,MAAM,UAAU,OAAO,KAAK;AACrE,YAAQ,MAAM,UAAU,OAAO,EAAE;AACjC,YAAQ,KAAK,CAAC;AAAA,EAChB;AACF,CAAC;AAEH,QACG,QAAQ,OAAO,EACf,YAAY,kCAAkC,EAC9C,SAAS,SAAS,cAAc,EAChC,OAAO,uBAAuB,4BAA4B,QAAQ,EAClE,OAAO,OAAO,KAAa,YAAqC;AAC/D,QAAM,EAAE,UAAAC,UAAS,IAAI,MAAM,OAAO,YAAY;AAC9C,QAAM,cAAc,cAAc;AAClC,mBAAiB,WAAW;AAE5B,QAAM,aAAc,QAAQ,WAAkC;AAC9D,QAAM,UAAU,MAAMA,UAAS,OAAO,EAAE,UAAU,MAAM,CAAC;AAEzD,MAAI;AACF,UAAM,UAAU,MAAM,QAAQ,WAAW;AACzC,UAAM,OAAO,MAAM,QAAQ,QAAQ;AAEnC,UAAM,KAAK,KAAK,KAAK,EAAE,WAAW,eAAe,SAAS,aAAa,IAAK,CAAC;AAE7E,YAAQ,MAAM,yEAAyE;AAEvF,UAAM,IAAI,QAAc,CAACC,aAAY;AACnC,cAAQ,MAAM,KAAK,QAAQ,MAAM;AAC/B,QAAAA,SAAQ;AAAA,MACV,CAAC;AAAA,IACH,CAAC;AAED,UAAM,cAAc,eAAe,KAAK,WAAW;AACnD,UAAM,QAAQ,aAAa,EAAE,MAAM,YAAY,CAAC;AAChD,YAAQ,MAAM,kBAAkB,WAAW,EAAE;AAAA,EAC/C,SAAS,OAAO;AACd,UAAM,UAAU,iBAAiB,QAAQ,MAAM,UAAU,OAAO,KAAK;AACrE,YAAQ,MAAM,UAAU,OAAO,EAAE;AACjC,YAAQ,KAAK,CAAC;AAAA,EAChB,UAAE;AACA,UAAM,QAAQ,MAAM;AAAA,EACtB;AACF,CAAC;AAEH,QAAQ,MAAM;","names":["existsSync","homedir","join","homedir","join","yaml","join","homedir","existsSync","chromium","resolve"]}
|
|
1
|
+
{"version":3,"sources":["../src/cli.ts","../src/config.ts","../src/fetcher.ts","../src/session.ts","../src/extractor.ts","../src/converter.ts","../src/writer.ts"],"sourcesContent":["import { Command } from \"commander\";\nimport { once } from \"node:events\";\nimport { existsSync } from \"node:fs\";\nimport { homedir } from \"node:os\";\nimport { join } from \"node:path\";\nimport { resolveConfig } from \"./config.js\";\nimport { fetchPage } from \"./fetcher.js\";\nimport { extract, extractMetadata } from \"./extractor.js\";\nimport { convertToMarkdown } from \"./converter.js\";\nimport { writeMarkdownFile, buildFrontmatter } from \"./writer.js\";\nimport {\n getSessionDir,\n getSessionPath,\n ensureSessionDir,\n} from \"./session.js\";\nimport type { Metadata, WaitUntilOption } from \"./types.js\";\n\nconst CONFIG_PATH = join(homedir(), \".config\", \"vault-fetch\", \"config.yaml\");\n\nconst program = new Command();\n\nprogram\n .name(\"vault-fetch\")\n .description(\n \"Fetch JS-rendered web pages and save as Markdown to Obsidian Vault\",\n )\n .version(\"0.3.0\");\n\nprogram\n .command(\"fetch\")\n .description(\"Fetch a page and save as Markdown\")\n .argument(\"<url>\", \"URL to fetch\")\n .option(\"--dest <path>\", \"Destination directory\")\n .option(\"--headed\", \"Run browser in headed mode\")\n .option(\"--selector <css>\", \"CSS selector to extract\")\n .option(\"--timeout <seconds>\", \"Timeout in seconds\", parseInt)\n .option(\"--tag <name>\", \"Add tag (repeatable)\", (val: string, acc: string[]) => {\n acc.push(val);\n return acc;\n }, [] as string[])\n .option(\n \"--wait-until <event>\",\n \"Wait condition: load, domcontentloaded, networkidle\",\n )\n .option(\"--skip-session\", \"Do not use saved session\")\n .option(\"--dry-run\", \"Output to stdout instead of saving\")\n .option(\"--no-block-images\", \"Do not block image requests\")\n .option(\"--no-block-fonts\", \"Do not block font requests\")\n .option(\"--no-block-media\", \"Do not block media requests\")\n .option(\"--raw\", \"Convert full page HTML without Readability extraction\")\n .action(async (url: string, options: Record<string, unknown>) => {\n try {\n const configPath = existsSync(CONFIG_PATH) ? CONFIG_PATH : undefined;\n const config = resolveConfig(\n {\n dest: options.dest as string | undefined,\n tags: options.tag as string[] | undefined,\n timeout: options.timeout as number | undefined,\n waitUntil: options.waitUntil as WaitUntilOption | undefined,\n headed: options.headed as boolean | undefined,\n selector: options.selector as string | undefined,\n noSession: options.skipSession as boolean | undefined,\n dryRun: options.dryRun as boolean | undefined,\n blockImages: options.blockImages as boolean | undefined,\n blockFonts: options.blockFonts as boolean | undefined,\n blockMedia: options.blockMedia as boolean | undefined,\n raw: options.raw as boolean | undefined,\n },\n configPath,\n );\n\n if (config.raw && config.selector) {\n throw new Error(\"--raw and --selector cannot be used together.\");\n }\n\n // Validate dest directory exists\n if (!config.dryRun && !existsSync(config.dest)) {\n throw new Error(`Destination directory does not exist: ${config.dest}`);\n }\n\n const sessionsDir = getSessionDir();\n const fetchResult = await fetchPage(url, config, sessionsDir);\n\n let markdown: string;\n let metadata: Metadata;\n\n if (fetchResult.kind === \"pdf\") {\n if (config.selector) {\n throw new Error(\"--selector cannot be used with PDF URLs.\");\n }\n if (config.raw) {\n throw new Error(\"--raw cannot be used with PDF URLs.\");\n }\n const { convertPdfToMarkdown } = await import(\"./pdf-converter.js\");\n const pdfResult = await convertPdfToMarkdown(\n fetchResult.pdfBuffer,\n fetchResult.finalUrl,\n );\n markdown = pdfResult.markdown;\n metadata = pdfResult.metadata;\n } else if (config.selector) {\n // --selector mode: skip Readability, extract metadata from full page\n metadata = extractMetadata(fetchResult.fullHtml, fetchResult.finalUrl);\n markdown = convertToMarkdown(fetchResult.html);\n } else if (config.raw) {\n // --raw mode: skip Readability, convert full page HTML directly\n metadata = extractMetadata(fetchResult.fullHtml, fetchResult.finalUrl);\n markdown = convertToMarkdown(fetchResult.fullHtml);\n } else {\n const result = extract(fetchResult.html, fetchResult.finalUrl);\n metadata = result.metadata;\n markdown = convertToMarkdown(result.content);\n }\n\n if (config.dryRun) {\n const frontmatter = buildFrontmatter(metadata, config.tags);\n process.stdout.write(`${frontmatter}\\n\\n${markdown}\\n`);\n } else {\n const filePath = writeMarkdownFile(\n config.dest,\n metadata,\n markdown,\n config.tags,\n );\n console.error(`Saved: ${filePath}`);\n }\n } catch (error) {\n const message = error instanceof Error ? error.message : String(error);\n console.error(`Error: ${message}`);\n process.exit(1);\n }\n });\n\nprogram\n .command(\"login\")\n .description(\"Login to a site and save session\")\n .argument(\"<url>\", \"URL to login\")\n .option(\"--timeout <seconds>\", \"Login timeout in seconds\", parseInt)\n .action(async (url: string, options: Record<string, unknown>) => {\n const { chromium } = await import(\"playwright\");\n const sessionsDir = getSessionDir();\n ensureSessionDir(sessionsDir);\n\n const timeoutSec = (options.timeout as number | undefined) ?? 300;\n const browser = await chromium.launch({ headless: false });\n\n try {\n const context = await browser.newContext();\n const page = await context.newPage();\n\n await page.goto(url, { waitUntil: \"networkidle\", timeout: timeoutSec * 1000 });\n\n console.error(\"Browser opened. Log in manually, then press Enter here to save session.\");\n\n process.stdin.resume();\n await once(process.stdin, \"data\");\n process.stdin.pause();\n process.stdin.unref();\n\n const sessionPath = getSessionPath(url, sessionsDir);\n await context.storageState({ path: sessionPath });\n console.error(`Session saved: ${sessionPath}`);\n } catch (error) {\n const message = error instanceof Error ? error.message : String(error);\n console.error(`Error: ${message}`);\n process.exit(1);\n } finally {\n await browser.close();\n }\n });\n\nprogram.parse();\n","import { readFileSync } from \"node:fs\";\nimport { homedir } from \"node:os\";\nimport { resolve } from \"node:path\";\nimport yaml from \"js-yaml\";\nimport type { ResolvedConfig, WaitUntilOption } from \"./types.js\";\n\nconst DEFAULT_TIMEOUT = 30;\nconst DEFAULT_WAIT_UNTIL: WaitUntilOption = \"networkidle\";\nconst REQUIRED_TAG = \"clippings\";\n\ninterface FileConfig {\n dest?: string;\n tags?: string[];\n timeout?: number;\n waitUntil?: WaitUntilOption;\n}\n\ninterface CliOptions {\n dest?: string;\n tags?: string[];\n timeout?: number;\n waitUntil?: WaitUntilOption;\n headed?: boolean;\n selector?: string;\n noSession?: boolean;\n dryRun?: boolean;\n blockImages?: boolean;\n blockFonts?: boolean;\n blockMedia?: boolean;\n raw?: boolean;\n}\n\nfunction expandTilde(filePath: string): string {\n if (filePath.startsWith(\"~/\")) {\n return resolve(homedir(), filePath.slice(2));\n }\n return filePath;\n}\n\nconst VALID_WAIT_UNTIL: readonly string[] = [\"load\", \"domcontentloaded\", \"networkidle\"];\n\nfunction validateWaitUntil(value: string): WaitUntilOption {\n if (!VALID_WAIT_UNTIL.includes(value)) {\n throw new Error(\n `Invalid waitUntil value: \"${value}\". Must be one of: ${VALID_WAIT_UNTIL.join(\", \")}`,\n );\n }\n return value as WaitUntilOption;\n}\n\nfunction loadConfigFile(configPath: string): FileConfig {\n const content = readFileSync(configPath, \"utf-8\");\n const parsed = yaml.load(content);\n if (parsed === null || typeof parsed !== \"object\") {\n throw new Error(`Invalid config file: ${configPath}`);\n }\n const config = parsed as Record<string, unknown>;\n\n if (config.timeout !== undefined && typeof config.timeout !== \"number\") {\n throw new Error(`Invalid timeout in config file: expected number, got ${typeof config.timeout}`);\n }\n if (config.dest !== undefined && typeof config.dest !== \"string\") {\n throw new Error(`Invalid dest in config file: expected string, got ${typeof config.dest}`);\n }\n if (config.waitUntil !== undefined) {\n if (typeof config.waitUntil !== \"string\") {\n throw new Error(`Invalid waitUntil in config file: expected string, got ${typeof config.waitUntil}`);\n }\n validateWaitUntil(config.waitUntil);\n }\n if (config.tags !== undefined) {\n if (!Array.isArray(config.tags) || !config.tags.every((t: unknown) => typeof t === \"string\")) {\n throw new Error(\"Invalid tags in config file: expected array of strings\");\n }\n }\n\n return config as FileConfig;\n}\n\nexport function resolveConfig(\n cliOptions: CliOptions,\n configPath: string | undefined,\n): ResolvedConfig {\n // Layer 1: Config file\n let fileConfig: FileConfig = {};\n if (configPath) {\n fileConfig = loadConfigFile(configPath);\n }\n\n // Layer 2: Environment variables\n const envDest = process.env.VAULT_FETCH_DEST;\n const envTimeout = process.env.VAULT_FETCH_TIMEOUT;\n\n // Resolve each field: CLI > env > file > default\n const dest = cliOptions.dest ?? envDest ?? fileConfig.dest;\n if (dest === undefined) {\n throw new Error(\n \"dest is required. Set via --dest, VAULT_FETCH_DEST, or config file.\",\n );\n }\n\n let timeout: number;\n if (cliOptions.timeout !== undefined) {\n timeout = cliOptions.timeout;\n } else if (envTimeout !== undefined) {\n const parsed = Number(envTimeout);\n if (Number.isNaN(parsed)) {\n throw new Error(`Invalid VAULT_FETCH_TIMEOUT value: ${envTimeout}`);\n }\n timeout = parsed;\n } else {\n timeout = fileConfig.timeout ?? DEFAULT_TIMEOUT;\n }\n\n const rawWaitUntil = cliOptions.waitUntil ?? fileConfig.waitUntil ?? DEFAULT_WAIT_UNTIL;\n const waitUntil = validateWaitUntil(rawWaitUntil);\n\n // Merge tags: file tags + CLI tags + always clippings\n const allTags = [\n ...(fileConfig.tags ?? []),\n ...(cliOptions.tags ?? []),\n REQUIRED_TAG,\n ];\n const tags = [...new Set(allTags)];\n\n return {\n dest: expandTilde(dest),\n tags,\n timeout,\n waitUntil,\n headed: cliOptions.headed ?? false,\n selector: cliOptions.selector ?? null,\n noSession: cliOptions.noSession ?? false,\n dryRun: cliOptions.dryRun ?? false,\n blockImages: cliOptions.blockImages ?? true,\n blockFonts: cliOptions.blockFonts ?? true,\n blockMedia: cliOptions.blockMedia ?? true,\n raw: cliOptions.raw ?? false,\n };\n}\n","import { chromium, type BrowserContext } from \"playwright\";\nimport type { FetchResult, ResolvedConfig } from \"./types.js\";\nimport { getSessionPath, sessionExists } from \"./session.js\";\n\nexport const CHROME_USER_AGENT =\n \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) \" +\n \"AppleWebKit/537.36 (KHTML, like Gecko) \" +\n \"Chrome/134.0.0.0 Safari/537.36\";\n\ninterface BlockingOptions {\n blockImages: boolean;\n blockFonts: boolean;\n blockMedia: boolean;\n}\n\nexport function isPdfContentType(contentType: string): boolean {\n return contentType.toLowerCase().includes(\"application/pdf\");\n}\n\nexport function buildBlockedResourceTypes(options: BlockingOptions): Set<string> {\n const blocked = new Set<string>();\n if (options.blockImages) blocked.add(\"image\");\n if (options.blockFonts) blocked.add(\"font\");\n if (options.blockMedia) blocked.add(\"media\");\n return blocked;\n}\n\nconst PDF_MAGIC_BYTES = \"%PDF\";\n\nexport function validatePdfBuffer(pdfBuffer: Buffer, sourceUrl: string): void {\n if (pdfBuffer.length === 0) {\n throw new Error(`Empty PDF response received from ${sourceUrl}`);\n }\n const header = pdfBuffer.subarray(0, PDF_MAGIC_BYTES.length).toString(\"ascii\");\n if (!header.startsWith(PDF_MAGIC_BYTES)) {\n throw new Error(\n `Response Content-Type is application/pdf but body is not valid PDF data from ${sourceUrl}`,\n );\n }\n}\n\nasync function downloadPdf(\n context: BrowserContext,\n url: string,\n timeoutMs: number,\n): Promise<{ pdfBuffer: Buffer; finalUrl: string }> {\n const apiResponse = await context.request.get(url, { timeout: timeoutMs });\n const status = apiResponse.status();\n if (status >= 400) {\n throw new Error(`HTTP ${status} received when downloading PDF from ${url}`);\n }\n const pdfBuffer = Buffer.from(await apiResponse.body());\n const finalUrl = apiResponse.url();\n validatePdfBuffer(pdfBuffer, finalUrl);\n return { pdfBuffer, finalUrl };\n}\n\nexport async function fetchPage(\n url: string,\n config: ResolvedConfig,\n sessionsDir: string,\n): Promise<FetchResult> {\n const browser = await chromium.launch({\n headless: !config.headed,\n });\n\n try {\n const contextOptions: Parameters<typeof browser.newContext>[0] = {\n userAgent: CHROME_USER_AGENT,\n };\n\n // Load session if available and not disabled\n if (!config.noSession && sessionExists(url, sessionsDir)) {\n const sessionPath = getSessionPath(url, sessionsDir);\n contextOptions.storageState = sessionPath;\n }\n\n const context: BrowserContext = await browser.newContext(contextOptions);\n const page = await context.newPage();\n\n // Block specified resource types for faster loading\n const blockedTypes = buildBlockedResourceTypes(config);\n if (blockedTypes.size > 0) {\n await page.route(\"**/*\", async (route) => {\n if (blockedTypes.has(route.request().resourceType())) {\n await route.abort();\n } else {\n await route.continue();\n }\n });\n }\n\n const timeoutMs = config.timeout * 1000;\n\n // page.goto throws \"Download is starting\" when the server returns\n // Content-Disposition: attachment (common for PDF downloads).\n // Catch this and download the PDF via the context's HTTP client.\n let response;\n try {\n response = await page.goto(url, {\n waitUntil: config.waitUntil,\n timeout: timeoutMs,\n });\n } catch (error) {\n if (\n error instanceof Error &&\n error.message.includes(\"Download is starting\")\n ) {\n const result = await downloadPdf(context, url, timeoutMs);\n await context.close();\n return { kind: \"pdf\", pdfBuffer: result.pdfBuffer, url, finalUrl: result.finalUrl };\n }\n throw error;\n }\n\n if (!response) {\n throw new Error(`No response received from ${url}`);\n }\n\n const status = response.status();\n if (status >= 400) {\n throw new Error(`HTTP ${status} received from ${response.url()}`);\n }\n\n const finalUrl = response.url();\n const contentType = response.headers()[\"content-type\"] ?? \"\";\n\n // Inline PDF (Content-Disposition: inline or absent).\n // Try response.body() first; fall back to context.request if the\n // browser returned its PDF viewer HTML instead of the actual bytes.\n if (isPdfContentType(contentType)) {\n const body = await response.body();\n try {\n validatePdfBuffer(body, finalUrl);\n await context.close();\n return { kind: \"pdf\", pdfBuffer: body, url, finalUrl };\n } catch {\n // response.body() returned PDF viewer HTML; re-download via API\n const result = await downloadPdf(context, finalUrl, timeoutMs);\n await context.close();\n return { kind: \"pdf\", pdfBuffer: result.pdfBuffer, url, finalUrl: result.finalUrl };\n }\n }\n\n const fullHtml = await page.content();\n let html: string;\n\n if (config.selector) {\n const element = await page.$(config.selector);\n if (!element) {\n throw new Error(`Selector not found: ${config.selector}`);\n }\n html = await element.innerHTML();\n } else {\n html = fullHtml;\n }\n\n await context.close();\n\n return { kind: \"html\", html, fullHtml, url, finalUrl };\n } finally {\n await browser.close();\n }\n}\n","import { existsSync, mkdirSync } from \"node:fs\";\nimport { join } from \"node:path\";\nimport { homedir } from \"node:os\";\n\nconst CONFIG_DIR = join(homedir(), \".config\", \"vault-fetch\");\nconst SESSIONS_DIR = join(CONFIG_DIR, \"sessions\");\n\nexport function getSessionDir(): string {\n return SESSIONS_DIR;\n}\n\nfunction extractDomain(url: string): string {\n const parsed = new URL(url);\n return parsed.hostname ?? \"\";\n}\n\nexport function getSessionPath(url: string, sessionsDir: string): string {\n const domain = extractDomain(url);\n return join(sessionsDir, `${domain}.json`);\n}\n\nexport function sessionExists(url: string, sessionsDir: string): boolean {\n const sessionPath = getSessionPath(url, sessionsDir);\n return existsSync(sessionPath);\n}\n\nexport function ensureSessionDir(sessionsDir: string): void {\n if (!existsSync(sessionsDir)) {\n mkdirSync(sessionsDir, { recursive: true });\n }\n}\n","import { Readability } from \"@mozilla/readability\";\nimport { JSDOM } from \"jsdom\";\nimport type { Metadata } from \"./types.js\";\n\nfunction getMetaContent(doc: Document, selector: string): string | null {\n const el = doc.querySelector(selector);\n return el?.getAttribute(\"content\") ?? null;\n}\n\nfunction formatAuthor(raw: string): string {\n return `[[${raw.trim()}]]`;\n}\n\nfunction extractPublishedDate(doc: Document): string | null {\n const published =\n getMetaContent(doc, 'meta[property=\"article:published_time\"]') ??\n getMetaContent(doc, 'meta[name=\"datePublished\"]');\n\n if (!published) {\n const jsonLd = doc.querySelector('script[type=\"application/ld+json\"]');\n if (jsonLd?.textContent) {\n try {\n const data = JSON.parse(jsonLd.textContent) as Record<string, unknown>;\n if (typeof data.datePublished === \"string\") {\n return data.datePublished.split(\"T\")[0];\n }\n } catch {\n // JSON-LD parse failed\n }\n }\n return null;\n }\n\n return published.split(\"T\")[0];\n}\n\nfunction extractAuthors(\n doc: Document,\n readabilityByline: string | null,\n): string[] {\n const articleAuthors = doc.querySelectorAll('meta[property=\"article:author\"]');\n if (articleAuthors.length > 0) {\n return Array.from(articleAuthors)\n .map((el) => el.getAttribute(\"content\"))\n .filter((v): v is string => v !== null)\n .map(formatAuthor);\n }\n\n const ogAuthor = getMetaContent(doc, 'meta[property=\"og:author\"]');\n if (ogAuthor) {\n return [formatAuthor(ogAuthor)];\n }\n\n if (readabilityByline) {\n return [formatAuthor(readabilityByline)];\n }\n\n return [];\n}\n\nexport interface ExtractResult {\n metadata: Metadata;\n content: string;\n}\n\ninterface ReadabilityArticle {\n title: string;\n byline: string | null;\n excerpt: string;\n content: string;\n}\n\nfunction buildMetadata(\n doc: Document,\n article: ReadabilityArticle | null,\n finalUrl: string,\n): Metadata {\n const title = article?.title ?? doc.title;\n const authors = extractAuthors(doc, article?.byline ?? null);\n const published = extractPublishedDate(doc);\n\n const description =\n getMetaContent(doc, 'meta[property=\"og:description\"]') ??\n getMetaContent(doc, 'meta[name=\"description\"]') ??\n (article?.excerpt ?? null);\n\n const today = new Date().toISOString().split(\"T\")[0];\n\n return {\n title,\n source: finalUrl,\n author: authors,\n published,\n created: today,\n description,\n };\n}\n\nfunction parseWithReadability(html: string, url: string): ReadabilityArticle | null {\n const dom = new JSDOM(html, { url });\n const reader = new Readability(dom.window.document);\n return reader.parse() as ReadabilityArticle | null;\n}\n\nexport function extract(html: string, finalUrl: string): ExtractResult {\n const metaDom = new JSDOM(html, { url: finalUrl });\n const doc = metaDom.window.document;\n\n const article = parseWithReadability(html, finalUrl);\n\n if (!article) {\n throw new Error(\n \"Readability failed to extract content from the page. \" +\n \"Try --raw to convert the full page, or --selector <css> to target specific content.\",\n );\n }\n\n if (!article.content) {\n throw new Error(\"Readability returned empty content for the page\");\n }\n\n return {\n metadata: buildMetadata(doc, article, finalUrl),\n content: article.content,\n };\n}\n\nexport function extractMetadata(html: string, finalUrl: string): Metadata {\n const metaDom = new JSDOM(html, { url: finalUrl });\n const doc = metaDom.window.document;\n\n const article = parseWithReadability(html, finalUrl);\n\n return buildMetadata(doc, article, finalUrl);\n}\n","import TurndownService from \"turndown\";\n\nexport function convertToMarkdown(html: string): string {\n const turndown = new TurndownService({\n headingStyle: \"atx\",\n codeBlockStyle: \"fenced\",\n bulletListMarker: \"-\",\n });\n\n return turndown.turndown(html);\n}\n","import { writeFileSync } from \"node:fs\";\nimport { join } from \"node:path\";\nimport yaml from \"js-yaml\";\nimport type { Metadata } from \"./types.js\";\n\nconst UNSAFE_CHARS = /[/\\\\:*?\"<>|]/g;\nconst CONTROL_CHARS = /[\\x00-\\x1f\\x7f]/g;\nconst MAX_FILENAME_LENGTH = 200;\n\nexport function sanitizeFilename(title: string): string {\n const sanitized = title\n .replace(CONTROL_CHARS, \"\")\n .replace(UNSAFE_CHARS, \"\")\n .replace(/\\s+/g, \" \")\n .trim();\n const base = sanitized.slice(0, MAX_FILENAME_LENGTH) || \"Untitled\";\n return `${base}.md`;\n}\n\nexport function buildFrontmatter(metadata: Metadata, tags: string[]): string {\n const data: Record<string, unknown> = {\n title: metadata.title,\n source: metadata.source,\n };\n\n if (metadata.author.length > 0) {\n data.author = metadata.author;\n }\n\n if (metadata.published) {\n data.published = metadata.published;\n }\n\n data.created = metadata.created;\n\n if (metadata.description) {\n data.description = metadata.description;\n }\n\n data.tags = tags;\n\n const yamlStr = yaml.dump(data, {\n quotingType: '\"',\n forceQuotes: false,\n lineWidth: -1,\n sortKeys: false,\n });\n\n return `---\\n${yamlStr}---`;\n}\n\nexport function writeMarkdownFile(\n dest: string,\n metadata: Metadata,\n markdownContent: string,\n tags: string[],\n): string {\n const filename = sanitizeFilename(metadata.title);\n const filePath = join(dest, filename);\n const frontmatter = buildFrontmatter(metadata, tags);\n const fullContent = `${frontmatter}\\n\\n${markdownContent}\\n`;\n\n writeFileSync(filePath, fullContent, \"utf-8\");\n\n return filePath;\n}\n"],"mappings":";;;AAAA,SAAS,eAAe;AACxB,SAAS,YAAY;AACrB,SAAS,cAAAA,mBAAkB;AAC3B,SAAS,WAAAC,gBAAe;AACxB,SAAS,QAAAC,aAAY;;;ACJrB,SAAS,oBAAoB;AAC7B,SAAS,eAAe;AACxB,SAAS,eAAe;AACxB,OAAO,UAAU;AAGjB,IAAM,kBAAkB;AACxB,IAAM,qBAAsC;AAC5C,IAAM,eAAe;AAwBrB,SAAS,YAAY,UAA0B;AAC7C,MAAI,SAAS,WAAW,IAAI,GAAG;AAC7B,WAAO,QAAQ,QAAQ,GAAG,SAAS,MAAM,CAAC,CAAC;AAAA,EAC7C;AACA,SAAO;AACT;AAEA,IAAM,mBAAsC,CAAC,QAAQ,oBAAoB,aAAa;AAEtF,SAAS,kBAAkB,OAAgC;AACzD,MAAI,CAAC,iBAAiB,SAAS,KAAK,GAAG;AACrC,UAAM,IAAI;AAAA,MACR,6BAA6B,KAAK,sBAAsB,iBAAiB,KAAK,IAAI,CAAC;AAAA,IACrF;AAAA,EACF;AACA,SAAO;AACT;AAEA,SAAS,eAAe,YAAgC;AACtD,QAAM,UAAU,aAAa,YAAY,OAAO;AAChD,QAAM,SAAS,KAAK,KAAK,OAAO;AAChC,MAAI,WAAW,QAAQ,OAAO,WAAW,UAAU;AACjD,UAAM,IAAI,MAAM,wBAAwB,UAAU,EAAE;AAAA,EACtD;AACA,QAAM,SAAS;AAEf,MAAI,OAAO,YAAY,UAAa,OAAO,OAAO,YAAY,UAAU;AACtE,UAAM,IAAI,MAAM,wDAAwD,OAAO,OAAO,OAAO,EAAE;AAAA,EACjG;AACA,MAAI,OAAO,SAAS,UAAa,OAAO,OAAO,SAAS,UAAU;AAChE,UAAM,IAAI,MAAM,qDAAqD,OAAO,OAAO,IAAI,EAAE;AAAA,EAC3F;AACA,MAAI,OAAO,cAAc,QAAW;AAClC,QAAI,OAAO,OAAO,cAAc,UAAU;AACxC,YAAM,IAAI,MAAM,0DAA0D,OAAO,OAAO,SAAS,EAAE;AAAA,IACrG;AACA,sBAAkB,OAAO,SAAS;AAAA,EACpC;AACA,MAAI,OAAO,SAAS,QAAW;AAC7B,QAAI,CAAC,MAAM,QAAQ,OAAO,IAAI,KAAK,CAAC,OAAO,KAAK,MAAM,CAAC,MAAe,OAAO,MAAM,QAAQ,GAAG;AAC5F,YAAM,IAAI,MAAM,wDAAwD;AAAA,IAC1E;AAAA,EACF;AAEA,SAAO;AACT;AAEO,SAAS,cACd,YACA,YACgB;AAEhB,MAAI,aAAyB,CAAC;AAC9B,MAAI,YAAY;AACd,iBAAa,eAAe,UAAU;AAAA,EACxC;AAGA,QAAM,UAAU,QAAQ,IAAI;AAC5B,QAAM,aAAa,QAAQ,IAAI;AAG/B,QAAM,OAAO,WAAW,QAAQ,WAAW,WAAW;AACtD,MAAI,SAAS,QAAW;AACtB,UAAM,IAAI;AAAA,MACR;AAAA,IACF;AAAA,EACF;AAEA,MAAI;AACJ,MAAI,WAAW,YAAY,QAAW;AACpC,cAAU,WAAW;AAAA,EACvB,WAAW,eAAe,QAAW;AACnC,UAAM,SAAS,OAAO,UAAU;AAChC,QAAI,OAAO,MAAM,MAAM,GAAG;AACxB,YAAM,IAAI,MAAM,sCAAsC,UAAU,EAAE;AAAA,IACpE;AACA,cAAU;AAAA,EACZ,OAAO;AACL,cAAU,WAAW,WAAW;AAAA,EAClC;AAEA,QAAM,eAAe,WAAW,aAAa,WAAW,aAAa;AACrE,QAAM,YAAY,kBAAkB,YAAY;AAGhD,QAAM,UAAU;AAAA,IACd,GAAI,WAAW,QAAQ,CAAC;AAAA,IACxB,GAAI,WAAW,QAAQ,CAAC;AAAA,IACxB;AAAA,EACF;AACA,QAAM,OAAO,CAAC,GAAG,IAAI,IAAI,OAAO,CAAC;AAEjC,SAAO;AAAA,IACL,MAAM,YAAY,IAAI;AAAA,IACtB;AAAA,IACA;AAAA,IACA;AAAA,IACA,QAAQ,WAAW,UAAU;AAAA,IAC7B,UAAU,WAAW,YAAY;AAAA,IACjC,WAAW,WAAW,aAAa;AAAA,IACnC,QAAQ,WAAW,UAAU;AAAA,IAC7B,aAAa,WAAW,eAAe;AAAA,IACvC,YAAY,WAAW,cAAc;AAAA,IACrC,YAAY,WAAW,cAAc;AAAA,IACrC,KAAK,WAAW,OAAO;AAAA,EACzB;AACF;;;AC3IA,SAAS,gBAAqC;;;ACA9C,SAAS,YAAY,iBAAiB;AACtC,SAAS,YAAY;AACrB,SAAS,WAAAC,gBAAe;AAExB,IAAM,aAAa,KAAKA,SAAQ,GAAG,WAAW,aAAa;AAC3D,IAAM,eAAe,KAAK,YAAY,UAAU;AAEzC,SAAS,gBAAwB;AACtC,SAAO;AACT;AAEA,SAAS,cAAc,KAAqB;AAC1C,QAAM,SAAS,IAAI,IAAI,GAAG;AAC1B,SAAO,OAAO,YAAY;AAC5B;AAEO,SAAS,eAAe,KAAa,aAA6B;AACvE,QAAM,SAAS,cAAc,GAAG;AAChC,SAAO,KAAK,aAAa,GAAG,MAAM,OAAO;AAC3C;AAEO,SAAS,cAAc,KAAa,aAA8B;AACvE,QAAM,cAAc,eAAe,KAAK,WAAW;AACnD,SAAO,WAAW,WAAW;AAC/B;AAEO,SAAS,iBAAiB,aAA2B;AAC1D,MAAI,CAAC,WAAW,WAAW,GAAG;AAC5B,cAAU,aAAa,EAAE,WAAW,KAAK,CAAC;AAAA,EAC5C;AACF;;;AD1BO,IAAM,oBACX;AAUK,SAAS,iBAAiB,aAA8B;AAC7D,SAAO,YAAY,YAAY,EAAE,SAAS,iBAAiB;AAC7D;AAEO,SAAS,0BAA0B,SAAuC;AAC/E,QAAM,UAAU,oBAAI,IAAY;AAChC,MAAI,QAAQ,YAAa,SAAQ,IAAI,OAAO;AAC5C,MAAI,QAAQ,WAAY,SAAQ,IAAI,MAAM;AAC1C,MAAI,QAAQ,WAAY,SAAQ,IAAI,OAAO;AAC3C,SAAO;AACT;AAEA,IAAM,kBAAkB;AAEjB,SAAS,kBAAkB,WAAmB,WAAyB;AAC5E,MAAI,UAAU,WAAW,GAAG;AAC1B,UAAM,IAAI,MAAM,oCAAoC,SAAS,EAAE;AAAA,EACjE;AACA,QAAM,SAAS,UAAU,SAAS,GAAG,gBAAgB,MAAM,EAAE,SAAS,OAAO;AAC7E,MAAI,CAAC,OAAO,WAAW,eAAe,GAAG;AACvC,UAAM,IAAI;AAAA,MACR,gFAAgF,SAAS;AAAA,IAC3F;AAAA,EACF;AACF;AAEA,eAAe,YACb,SACA,KACA,WACkD;AAClD,QAAM,cAAc,MAAM,QAAQ,QAAQ,IAAI,KAAK,EAAE,SAAS,UAAU,CAAC;AACzE,QAAM,SAAS,YAAY,OAAO;AAClC,MAAI,UAAU,KAAK;AACjB,UAAM,IAAI,MAAM,QAAQ,MAAM,uCAAuC,GAAG,EAAE;AAAA,EAC5E;AACA,QAAM,YAAY,OAAO,KAAK,MAAM,YAAY,KAAK,CAAC;AACtD,QAAM,WAAW,YAAY,IAAI;AACjC,oBAAkB,WAAW,QAAQ;AACrC,SAAO,EAAE,WAAW,SAAS;AAC/B;AAEA,eAAsB,UACpB,KACA,QACA,aACsB;AACtB,QAAM,UAAU,MAAM,SAAS,OAAO;AAAA,IACpC,UAAU,CAAC,OAAO;AAAA,EACpB,CAAC;AAED,MAAI;AACF,UAAM,iBAA2D;AAAA,MAC/D,WAAW;AAAA,IACb;AAGA,QAAI,CAAC,OAAO,aAAa,cAAc,KAAK,WAAW,GAAG;AACxD,YAAM,cAAc,eAAe,KAAK,WAAW;AACnD,qBAAe,eAAe;AAAA,IAChC;AAEA,UAAM,UAA0B,MAAM,QAAQ,WAAW,cAAc;AACvE,UAAM,OAAO,MAAM,QAAQ,QAAQ;AAGnC,UAAM,eAAe,0BAA0B,MAAM;AACrD,QAAI,aAAa,OAAO,GAAG;AACzB,YAAM,KAAK,MAAM,QAAQ,OAAO,UAAU;AACxC,YAAI,aAAa,IAAI,MAAM,QAAQ,EAAE,aAAa,CAAC,GAAG;AACpD,gBAAM,MAAM,MAAM;AAAA,QACpB,OAAO;AACL,gBAAM,MAAM,SAAS;AAAA,QACvB;AAAA,MACF,CAAC;AAAA,IACH;AAEA,UAAM,YAAY,OAAO,UAAU;AAKnC,QAAI;AACJ,QAAI;AACF,iBAAW,MAAM,KAAK,KAAK,KAAK;AAAA,QAC9B,WAAW,OAAO;AAAA,QAClB,SAAS;AAAA,MACX,CAAC;AAAA,IACH,SAAS,OAAO;AACd,UACE,iBAAiB,SACjB,MAAM,QAAQ,SAAS,sBAAsB,GAC7C;AACA,cAAM,SAAS,MAAM,YAAY,SAAS,KAAK,SAAS;AACxD,cAAM,QAAQ,MAAM;AACpB,eAAO,EAAE,MAAM,OAAO,WAAW,OAAO,WAAW,KAAK,UAAU,OAAO,SAAS;AAAA,MACpF;AACA,YAAM;AAAA,IACR;AAEA,QAAI,CAAC,UAAU;AACb,YAAM,IAAI,MAAM,6BAA6B,GAAG,EAAE;AAAA,IACpD;AAEA,UAAM,SAAS,SAAS,OAAO;AAC/B,QAAI,UAAU,KAAK;AACjB,YAAM,IAAI,MAAM,QAAQ,MAAM,kBAAkB,SAAS,IAAI,CAAC,EAAE;AAAA,IAClE;AAEA,UAAM,WAAW,SAAS,IAAI;AAC9B,UAAM,cAAc,SAAS,QAAQ,EAAE,cAAc,KAAK;AAK1D,QAAI,iBAAiB,WAAW,GAAG;AACjC,YAAM,OAAO,MAAM,SAAS,KAAK;AACjC,UAAI;AACF,0BAAkB,MAAM,QAAQ;AAChC,cAAM,QAAQ,MAAM;AACpB,eAAO,EAAE,MAAM,OAAO,WAAW,MAAM,KAAK,SAAS;AAAA,MACvD,QAAQ;AAEN,cAAM,SAAS,MAAM,YAAY,SAAS,UAAU,SAAS;AAC7D,cAAM,QAAQ,MAAM;AACpB,eAAO,EAAE,MAAM,OAAO,WAAW,OAAO,WAAW,KAAK,UAAU,OAAO,SAAS;AAAA,MACpF;AAAA,IACF;AAEA,UAAM,WAAW,MAAM,KAAK,QAAQ;AACpC,QAAI;AAEJ,QAAI,OAAO,UAAU;AACnB,YAAM,UAAU,MAAM,KAAK,EAAE,OAAO,QAAQ;AAC5C,UAAI,CAAC,SAAS;AACZ,cAAM,IAAI,MAAM,uBAAuB,OAAO,QAAQ,EAAE;AAAA,MAC1D;AACA,aAAO,MAAM,QAAQ,UAAU;AAAA,IACjC,OAAO;AACL,aAAO;AAAA,IACT;AAEA,UAAM,QAAQ,MAAM;AAEpB,WAAO,EAAE,MAAM,QAAQ,MAAM,UAAU,KAAK,SAAS;AAAA,EACvD,UAAE;AACA,UAAM,QAAQ,MAAM;AAAA,EACtB;AACF;;;AEnKA,SAAS,mBAAmB;AAC5B,SAAS,aAAa;AAGtB,SAAS,eAAe,KAAe,UAAiC;AACtE,QAAM,KAAK,IAAI,cAAc,QAAQ;AACrC,SAAO,IAAI,aAAa,SAAS,KAAK;AACxC;AAEA,SAAS,aAAa,KAAqB;AACzC,SAAO,KAAK,IAAI,KAAK,CAAC;AACxB;AAEA,SAAS,qBAAqB,KAA8B;AAC1D,QAAM,YACJ,eAAe,KAAK,yCAAyC,KAC7D,eAAe,KAAK,4BAA4B;AAElD,MAAI,CAAC,WAAW;AACd,UAAM,SAAS,IAAI,cAAc,oCAAoC;AACrE,QAAI,QAAQ,aAAa;AACvB,UAAI;AACF,cAAM,OAAO,KAAK,MAAM,OAAO,WAAW;AAC1C,YAAI,OAAO,KAAK,kBAAkB,UAAU;AAC1C,iBAAO,KAAK,cAAc,MAAM,GAAG,EAAE,CAAC;AAAA,QACxC;AAAA,MACF,QAAQ;AAAA,MAER;AAAA,IACF;AACA,WAAO;AAAA,EACT;AAEA,SAAO,UAAU,MAAM,GAAG,EAAE,CAAC;AAC/B;AAEA,SAAS,eACP,KACA,mBACU;AACV,QAAM,iBAAiB,IAAI,iBAAiB,iCAAiC;AAC7E,MAAI,eAAe,SAAS,GAAG;AAC7B,WAAO,MAAM,KAAK,cAAc,EAC7B,IAAI,CAAC,OAAO,GAAG,aAAa,SAAS,CAAC,EACtC,OAAO,CAAC,MAAmB,MAAM,IAAI,EACrC,IAAI,YAAY;AAAA,EACrB;AAEA,QAAM,WAAW,eAAe,KAAK,4BAA4B;AACjE,MAAI,UAAU;AACZ,WAAO,CAAC,aAAa,QAAQ,CAAC;AAAA,EAChC;AAEA,MAAI,mBAAmB;AACrB,WAAO,CAAC,aAAa,iBAAiB,CAAC;AAAA,EACzC;AAEA,SAAO,CAAC;AACV;AAcA,SAAS,cACP,KACA,SACA,UACU;AACV,QAAM,QAAQ,SAAS,SAAS,IAAI;AACpC,QAAM,UAAU,eAAe,KAAK,SAAS,UAAU,IAAI;AAC3D,QAAM,YAAY,qBAAqB,GAAG;AAE1C,QAAM,cACJ,eAAe,KAAK,iCAAiC,KACrD,eAAe,KAAK,0BAA0B,MAC7C,SAAS,WAAW;AAEvB,QAAM,SAAQ,oBAAI,KAAK,GAAE,YAAY,EAAE,MAAM,GAAG,EAAE,CAAC;AAEnD,SAAO;AAAA,IACL;AAAA,IACA,QAAQ;AAAA,IACR,QAAQ;AAAA,IACR;AAAA,IACA,SAAS;AAAA,IACT;AAAA,EACF;AACF;AAEA,SAAS,qBAAqB,MAAc,KAAwC;AAClF,QAAM,MAAM,IAAI,MAAM,MAAM,EAAE,IAAI,CAAC;AACnC,QAAM,SAAS,IAAI,YAAY,IAAI,OAAO,QAAQ;AAClD,SAAO,OAAO,MAAM;AACtB;AAEO,SAAS,QAAQ,MAAc,UAAiC;AACrE,QAAM,UAAU,IAAI,MAAM,MAAM,EAAE,KAAK,SAAS,CAAC;AACjD,QAAM,MAAM,QAAQ,OAAO;AAE3B,QAAM,UAAU,qBAAqB,MAAM,QAAQ;AAEnD,MAAI,CAAC,SAAS;AACZ,UAAM,IAAI;AAAA,MACR;AAAA,IAEF;AAAA,EACF;AAEA,MAAI,CAAC,QAAQ,SAAS;AACpB,UAAM,IAAI,MAAM,iDAAiD;AAAA,EACnE;AAEA,SAAO;AAAA,IACL,UAAU,cAAc,KAAK,SAAS,QAAQ;AAAA,IAC9C,SAAS,QAAQ;AAAA,EACnB;AACF;AAEO,SAAS,gBAAgB,MAAc,UAA4B;AACxE,QAAM,UAAU,IAAI,MAAM,MAAM,EAAE,KAAK,SAAS,CAAC;AACjD,QAAM,MAAM,QAAQ,OAAO;AAE3B,QAAM,UAAU,qBAAqB,MAAM,QAAQ;AAEnD,SAAO,cAAc,KAAK,SAAS,QAAQ;AAC7C;;;ACtIA,OAAO,qBAAqB;AAErB,SAAS,kBAAkB,MAAsB;AACtD,QAAM,WAAW,IAAI,gBAAgB;AAAA,IACnC,cAAc;AAAA,IACd,gBAAgB;AAAA,IAChB,kBAAkB;AAAA,EACpB,CAAC;AAED,SAAO,SAAS,SAAS,IAAI;AAC/B;;;ACVA,SAAS,qBAAqB;AAC9B,SAAS,QAAAC,aAAY;AACrB,OAAOC,WAAU;AAGjB,IAAM,eAAe;AACrB,IAAM,gBAAgB;AACtB,IAAM,sBAAsB;AAErB,SAAS,iBAAiB,OAAuB;AACtD,QAAM,YAAY,MACf,QAAQ,eAAe,EAAE,EACzB,QAAQ,cAAc,EAAE,EACxB,QAAQ,QAAQ,GAAG,EACnB,KAAK;AACR,QAAM,OAAO,UAAU,MAAM,GAAG,mBAAmB,KAAK;AACxD,SAAO,GAAG,IAAI;AAChB;AAEO,SAAS,iBAAiB,UAAoB,MAAwB;AAC3E,QAAM,OAAgC;AAAA,IACpC,OAAO,SAAS;AAAA,IAChB,QAAQ,SAAS;AAAA,EACnB;AAEA,MAAI,SAAS,OAAO,SAAS,GAAG;AAC9B,SAAK,SAAS,SAAS;AAAA,EACzB;AAEA,MAAI,SAAS,WAAW;AACtB,SAAK,YAAY,SAAS;AAAA,EAC5B;AAEA,OAAK,UAAU,SAAS;AAExB,MAAI,SAAS,aAAa;AACxB,SAAK,cAAc,SAAS;AAAA,EAC9B;AAEA,OAAK,OAAO;AAEZ,QAAM,UAAUA,MAAK,KAAK,MAAM;AAAA,IAC9B,aAAa;AAAA,IACb,aAAa;AAAA,IACb,WAAW;AAAA,IACX,UAAU;AAAA,EACZ,CAAC;AAED,SAAO;AAAA,EAAQ,OAAO;AACxB;AAEO,SAAS,kBACd,MACA,UACA,iBACA,MACQ;AACR,QAAM,WAAW,iBAAiB,SAAS,KAAK;AAChD,QAAM,WAAWD,MAAK,MAAM,QAAQ;AACpC,QAAM,cAAc,iBAAiB,UAAU,IAAI;AACnD,QAAM,cAAc,GAAG,WAAW;AAAA;AAAA,EAAO,eAAe;AAAA;AAExD,gBAAc,UAAU,aAAa,OAAO;AAE5C,SAAO;AACT;;;ANhDA,IAAM,cAAcE,MAAKC,SAAQ,GAAG,WAAW,eAAe,aAAa;AAE3E,IAAM,UAAU,IAAI,QAAQ;AAE5B,QACG,KAAK,aAAa,EAClB;AAAA,EACC;AACF,EACC,QAAQ,OAAO;AAElB,QACG,QAAQ,OAAO,EACf,YAAY,mCAAmC,EAC/C,SAAS,SAAS,cAAc,EAChC,OAAO,iBAAiB,uBAAuB,EAC/C,OAAO,YAAY,4BAA4B,EAC/C,OAAO,oBAAoB,yBAAyB,EACpD,OAAO,uBAAuB,sBAAsB,QAAQ,EAC5D,OAAO,gBAAgB,wBAAwB,CAAC,KAAa,QAAkB;AAC9E,MAAI,KAAK,GAAG;AACZ,SAAO;AACT,GAAG,CAAC,CAAa,EAChB;AAAA,EACC;AAAA,EACA;AACF,EACC,OAAO,kBAAkB,0BAA0B,EACnD,OAAO,aAAa,oCAAoC,EACxD,OAAO,qBAAqB,6BAA6B,EACzD,OAAO,oBAAoB,4BAA4B,EACvD,OAAO,oBAAoB,6BAA6B,EACxD,OAAO,SAAS,uDAAuD,EACvE,OAAO,OAAO,KAAa,YAAqC;AAC/D,MAAI;AACF,UAAM,aAAaC,YAAW,WAAW,IAAI,cAAc;AAC3D,UAAM,SAAS;AAAA,MACb;AAAA,QACE,MAAM,QAAQ;AAAA,QACd,MAAM,QAAQ;AAAA,QACd,SAAS,QAAQ;AAAA,QACjB,WAAW,QAAQ;AAAA,QACnB,QAAQ,QAAQ;AAAA,QAChB,UAAU,QAAQ;AAAA,QAClB,WAAW,QAAQ;AAAA,QACnB,QAAQ,QAAQ;AAAA,QAChB,aAAa,QAAQ;AAAA,QACrB,YAAY,QAAQ;AAAA,QACpB,YAAY,QAAQ;AAAA,QACpB,KAAK,QAAQ;AAAA,MACf;AAAA,MACA;AAAA,IACF;AAEA,QAAI,OAAO,OAAO,OAAO,UAAU;AACjC,YAAM,IAAI,MAAM,+CAA+C;AAAA,IACjE;AAGA,QAAI,CAAC,OAAO,UAAU,CAACA,YAAW,OAAO,IAAI,GAAG;AAC9C,YAAM,IAAI,MAAM,yCAAyC,OAAO,IAAI,EAAE;AAAA,IACxE;AAEA,UAAM,cAAc,cAAc;AAClC,UAAM,cAAc,MAAM,UAAU,KAAK,QAAQ,WAAW;AAE5D,QAAI;AACJ,QAAI;AAEJ,QAAI,YAAY,SAAS,OAAO;AAC9B,UAAI,OAAO,UAAU;AACnB,cAAM,IAAI,MAAM,0CAA0C;AAAA,MAC5D;AACA,UAAI,OAAO,KAAK;AACd,cAAM,IAAI,MAAM,qCAAqC;AAAA,MACvD;AACA,YAAM,EAAE,qBAAqB,IAAI,MAAM,OAAO,6BAAoB;AAClE,YAAM,YAAY,MAAM;AAAA,QACtB,YAAY;AAAA,QACZ,YAAY;AAAA,MACd;AACA,iBAAW,UAAU;AACrB,iBAAW,UAAU;AAAA,IACvB,WAAW,OAAO,UAAU;AAE1B,iBAAW,gBAAgB,YAAY,UAAU,YAAY,QAAQ;AACrE,iBAAW,kBAAkB,YAAY,IAAI;AAAA,IAC/C,WAAW,OAAO,KAAK;AAErB,iBAAW,gBAAgB,YAAY,UAAU,YAAY,QAAQ;AACrE,iBAAW,kBAAkB,YAAY,QAAQ;AAAA,IACnD,OAAO;AACL,YAAM,SAAS,QAAQ,YAAY,MAAM,YAAY,QAAQ;AAC7D,iBAAW,OAAO;AAClB,iBAAW,kBAAkB,OAAO,OAAO;AAAA,IAC7C;AAEA,QAAI,OAAO,QAAQ;AACjB,YAAM,cAAc,iBAAiB,UAAU,OAAO,IAAI;AAC1D,cAAQ,OAAO,MAAM,GAAG,WAAW;AAAA;AAAA,EAAO,QAAQ;AAAA,CAAI;AAAA,IACxD,OAAO;AACL,YAAM,WAAW;AAAA,QACf,OAAO;AAAA,QACP;AAAA,QACA;AAAA,QACA,OAAO;AAAA,MACT;AACA,cAAQ,MAAM,UAAU,QAAQ,EAAE;AAAA,IACpC;AAAA,EACF,SAAS,OAAO;AACd,UAAM,UAAU,iBAAiB,QAAQ,MAAM,UAAU,OAAO,KAAK;AACrE,YAAQ,MAAM,UAAU,OAAO,EAAE;AACjC,YAAQ,KAAK,CAAC;AAAA,EAChB;AACF,CAAC;AAEH,QACG,QAAQ,OAAO,EACf,YAAY,kCAAkC,EAC9C,SAAS,SAAS,cAAc,EAChC,OAAO,uBAAuB,4BAA4B,QAAQ,EAClE,OAAO,OAAO,KAAa,YAAqC;AAC/D,QAAM,EAAE,UAAAC,UAAS,IAAI,MAAM,OAAO,YAAY;AAC9C,QAAM,cAAc,cAAc;AAClC,mBAAiB,WAAW;AAE5B,QAAM,aAAc,QAAQ,WAAkC;AAC9D,QAAM,UAAU,MAAMA,UAAS,OAAO,EAAE,UAAU,MAAM,CAAC;AAEzD,MAAI;AACF,UAAM,UAAU,MAAM,QAAQ,WAAW;AACzC,UAAM,OAAO,MAAM,QAAQ,QAAQ;AAEnC,UAAM,KAAK,KAAK,KAAK,EAAE,WAAW,eAAe,SAAS,aAAa,IAAK,CAAC;AAE7E,YAAQ,MAAM,yEAAyE;AAEvF,YAAQ,MAAM,OAAO;AACrB,UAAM,KAAK,QAAQ,OAAO,MAAM;AAChC,YAAQ,MAAM,MAAM;AACpB,YAAQ,MAAM,MAAM;AAEpB,UAAM,cAAc,eAAe,KAAK,WAAW;AACnD,UAAM,QAAQ,aAAa,EAAE,MAAM,YAAY,CAAC;AAChD,YAAQ,MAAM,kBAAkB,WAAW,EAAE;AAAA,EAC/C,SAAS,OAAO;AACd,UAAM,UAAU,iBAAiB,QAAQ,MAAM,UAAU,OAAO,KAAK;AACrE,YAAQ,MAAM,UAAU,OAAO,EAAE;AACjC,YAAQ,KAAK,CAAC;AAAA,EAChB,UAAE;AACA,UAAM,QAAQ,MAAM;AAAA,EACtB;AACF,CAAC;AAEH,QAAQ,MAAM;","names":["existsSync","homedir","join","homedir","join","yaml","join","homedir","existsSync","chromium"]}
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
|
|
3
|
+
// src/pdf-converter.ts
|
|
4
|
+
import pdf2md from "@opendocsg/pdf2md";
|
|
5
|
+
async function convertPdfToMarkdown(pdfBuffer, sourceUrl) {
|
|
6
|
+
const markdown = await pdf2md(pdfBuffer);
|
|
7
|
+
if (!markdown.trim()) {
|
|
8
|
+
throw new Error("pdf2md returned empty content from the PDF");
|
|
9
|
+
}
|
|
10
|
+
const title = extractTitleFromMarkdown(markdown) ?? extractTitleFromUrl(sourceUrl);
|
|
11
|
+
const today = (/* @__PURE__ */ new Date()).toISOString().split("T")[0];
|
|
12
|
+
return {
|
|
13
|
+
markdown,
|
|
14
|
+
metadata: {
|
|
15
|
+
title,
|
|
16
|
+
source: sourceUrl,
|
|
17
|
+
author: [],
|
|
18
|
+
published: null,
|
|
19
|
+
created: today,
|
|
20
|
+
description: null
|
|
21
|
+
}
|
|
22
|
+
};
|
|
23
|
+
}
|
|
24
|
+
function extractTitleFromMarkdown(markdown) {
|
|
25
|
+
const match = markdown.match(/^#\s+(.+?)(?:\s+#+)?$/m);
|
|
26
|
+
if (!match) return null;
|
|
27
|
+
const trimmed = match[1].trim();
|
|
28
|
+
return trimmed || null;
|
|
29
|
+
}
|
|
30
|
+
function extractTitleFromUrl(url) {
|
|
31
|
+
const pathname = new URL(url).pathname;
|
|
32
|
+
const lastSegment = pathname.split("/").filter(Boolean).pop();
|
|
33
|
+
if (!lastSegment) {
|
|
34
|
+
throw new Error(`Cannot extract filename from URL: ${url}`);
|
|
35
|
+
}
|
|
36
|
+
return decodeURIComponent(lastSegment.replace(/\.pdf$/i, ""));
|
|
37
|
+
}
|
|
38
|
+
export {
|
|
39
|
+
convertPdfToMarkdown,
|
|
40
|
+
extractTitleFromUrl
|
|
41
|
+
};
|
|
42
|
+
//# sourceMappingURL=pdf-converter-U3SFA2HY.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"sources":["../src/pdf-converter.ts"],"sourcesContent":["import pdf2md from \"@opendocsg/pdf2md\";\nimport type { Metadata } from \"./types.js\";\n\nexport interface PdfConvertResult {\n markdown: string;\n metadata: Metadata;\n}\n\nexport async function convertPdfToMarkdown(\n pdfBuffer: Buffer,\n sourceUrl: string,\n): Promise<PdfConvertResult> {\n const markdown = await pdf2md(pdfBuffer);\n if (!markdown.trim()) {\n throw new Error(\"pdf2md returned empty content from the PDF\");\n }\n const title = extractTitleFromMarkdown(markdown) ?? extractTitleFromUrl(sourceUrl);\n const today = new Date().toISOString().split(\"T\")[0];\n\n return {\n markdown,\n metadata: {\n title,\n source: sourceUrl,\n author: [],\n published: null,\n created: today,\n description: null,\n },\n };\n}\n\nfunction extractTitleFromMarkdown(markdown: string): string | null {\n const match = markdown.match(/^#\\s+(.+?)(?:\\s+#+)?$/m);\n if (!match) return null;\n const trimmed = match[1].trim();\n return trimmed || null;\n}\n\nexport function extractTitleFromUrl(url: string): string {\n const pathname = new URL(url).pathname;\n const lastSegment = pathname.split(\"/\").filter(Boolean).pop();\n if (!lastSegment) {\n throw new Error(`Cannot extract filename from URL: ${url}`);\n }\n return decodeURIComponent(lastSegment.replace(/\\.pdf$/i, \"\"));\n}\n"],"mappings":";;;AAAA,OAAO,YAAY;AAQnB,eAAsB,qBACpB,WACA,WAC2B;AAC3B,QAAM,WAAW,MAAM,OAAO,SAAS;AACvC,MAAI,CAAC,SAAS,KAAK,GAAG;AACpB,UAAM,IAAI,MAAM,4CAA4C;AAAA,EAC9D;AACA,QAAM,QAAQ,yBAAyB,QAAQ,KAAK,oBAAoB,SAAS;AACjF,QAAM,SAAQ,oBAAI,KAAK,GAAE,YAAY,EAAE,MAAM,GAAG,EAAE,CAAC;AAEnD,SAAO;AAAA,IACL;AAAA,IACA,UAAU;AAAA,MACR;AAAA,MACA,QAAQ;AAAA,MACR,QAAQ,CAAC;AAAA,MACT,WAAW;AAAA,MACX,SAAS;AAAA,MACT,aAAa;AAAA,IACf;AAAA,EACF;AACF;AAEA,SAAS,yBAAyB,UAAiC;AACjE,QAAM,QAAQ,SAAS,MAAM,wBAAwB;AACrD,MAAI,CAAC,MAAO,QAAO;AACnB,QAAM,UAAU,MAAM,CAAC,EAAE,KAAK;AAC9B,SAAO,WAAW;AACpB;AAEO,SAAS,oBAAoB,KAAqB;AACvD,QAAM,WAAW,IAAI,IAAI,GAAG,EAAE;AAC9B,QAAM,cAAc,SAAS,MAAM,GAAG,EAAE,OAAO,OAAO,EAAE,IAAI;AAC5D,MAAI,CAAC,aAAa;AAChB,UAAM,IAAI,MAAM,qCAAqC,GAAG,EAAE;AAAA,EAC5D;AACA,SAAO,mBAAmB,YAAY,QAAQ,WAAW,EAAE,CAAC;AAC9D;","names":[]}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "vault-fetch",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.3.0",
|
|
4
4
|
"description": "Fetch JS-rendered web pages with Playwright and save as Markdown to Obsidian Vault",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"license": "MIT",
|
|
@@ -35,6 +35,7 @@
|
|
|
35
35
|
},
|
|
36
36
|
"dependencies": {
|
|
37
37
|
"@mozilla/readability": "^0.6.0",
|
|
38
|
+
"@opendocsg/pdf2md": "^0.2.5",
|
|
38
39
|
"commander": "^14.0.3",
|
|
39
40
|
"js-yaml": "^4.1.1",
|
|
40
41
|
"jsdom": "^29.0.1",
|