web-to-markdown-crawler 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +111 -0
  3. package/dist/index.js +68698 -0
  4. package/package.json +35 -0
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 leochilds
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,111 @@
1
+ # web-to-markdown-crawler
2
+
3
+ A CLI tool that crawls a website and converts every page to Markdown, mirroring the site's URL structure as a local directory tree. Internal links are rewritten to relative `.md` paths so the output works as a self-contained document collection.
4
+
5
+ ## Features
6
+
7
+ - Mirrors URL structure on disk (`/docs/intro` → `docs/intro.md`)
8
+ - Rewrites internal links to relative `.md` paths
9
+ - Extracts `<main>` / `<article>` / `[role="main"]` content before converting
10
+ - Prepends YAML frontmatter (`url`, `crawledAt`) to every file
11
+ - Handles redirects — the final URL is used as the canonical path
12
+ - Query-string URLs are disambiguated (`/search?q=foo` → `search-q-foo.md`)
13
+ - Produces a `nodemap.json` graph of every discovered URL and its status
14
+ - Graceful error handling — one bad page never aborts the crawl
15
+
16
+ ## Requirements
17
+
18
+ - [Bun](https://bun.sh) 1.x
19
+
20
+ ## Installation
21
+
22
+ ```bash
23
+ git clone https://github.com/leochilds/web-to-markdown-crawler.git
24
+ cd web-to-markdown-crawler
25
+ bun install
26
+ ```
27
+
28
+ ## Usage
29
+
30
+ ```
31
+ crawl <url> [options]
32
+ ```
33
+
34
+ ### Options
35
+
36
+ | Flag | Default | Description |
37
+ |---|---|---|
38
+ | `-o, --output <dir>` | `./output` | Output directory |
39
+ | `-c, --concurrency <n>` | `5` | Parallel fetch limit |
40
+ | `--max-depth <n>` | unlimited | Stop following links beyond this depth (0 = start page only) |
41
+ | `--max-pages <n>` | unlimited | Stop after writing this many pages |
42
+ | `--delay <ms>` | none | Delay between requests (polite crawling) |
43
+
44
+ ### Examples
45
+
46
+ ```bash
47
+ # Crawl a docs site into ./output
48
+ bun run dev https://docs.example.com
49
+
50
+ # Limit depth and add a polite delay
51
+ bun run dev https://docs.example.com --max-depth 3 --delay 500
52
+
53
+ # Custom output directory with a concurrency limit
54
+ bun run dev https://docs.example.com -o ./docs-mirror -c 3 --max-pages 100
55
+ ```
56
+
57
+ ## Output
58
+
59
+ ```
60
+ output/
61
+ index.md ← https://example.com/
62
+ about.md ← https://example.com/about
63
+ docs/
64
+ index.md ← https://example.com/docs/
65
+ intro.md ← https://example.com/docs/intro
66
+ nodemap.json ← full link graph with per-URL status
67
+ ```
68
+
69
+ Each `.md` file begins with YAML frontmatter:
70
+
71
+ ```yaml
72
+ ---
73
+ url: https://example.com/docs/intro
74
+ crawledAt: 2026-04-05T09:00:00.000Z
75
+ ---
76
+ ```
77
+
78
+ `nodemap.json` records every URL the crawler encountered (including skipped external links and errors):
79
+
80
+ ```json
81
+ {
82
+ "startUrl": "https://example.com/",
83
+ "crawledAt": "2026-04-05T09:00:00.000Z",
84
+ "totalPages": 42,
85
+ "nodes": {
86
+ "https://example.com/": { "status": "success", "outputPath": "output/index.md", "outLinks": [...] },
87
+ "https://external.com/": { "status": "skipped", "outLinks": [] }
88
+ }
89
+ }
90
+ ```
91
+
92
+ ## Development
93
+
94
+ ```bash
95
+ bun run dev <url> # run from source
96
+ bun run typecheck # TypeScript type check
97
+ bun run test # run the test suite (77 tests)
98
+ bun run build # compile to dist/
99
+ ```
100
+
101
+ ## Built with
102
+
103
+ - [got](https://github.com/sindresorhus/got) — HTTP requests with retries and redirect handling
104
+ - [cheerio](https://github.com/cheeriojs/cheerio) — HTML parsing and link extraction
105
+ - [turndown](https://github.com/mixmark-io/turndown) — HTML → Markdown conversion
106
+ - [graphjs](https://github.com/tantalor/graphjs) — directed graph for the link nodemap
107
+ - [p-limit](https://github.com/sindresorhus/p-limit) — concurrency control
108
+
109
+ ---
110
+
111
+ *Built with [Claude Code](https://claude.com/claude-code)*