sitesavvy 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,53 @@
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # Distribution / packaging
7
+ build/
8
+ dist/
9
+ *.egg-info/
10
+ *.egg
11
+ .eggs/
12
+ wheels/
13
+ *.whl
14
+
15
+ # PyInstaller
16
+ *.spec.bak
17
+ build_exe/
18
+ dist_exe/
19
+
20
+ # Virtual environments
21
+ .venv/
22
+ venv/
23
+ env/
24
+ ENV/
25
+
26
+ # Test & coverage artifacts
27
+ .pytest_cache/
28
+ .mypy_cache/
29
+ .ruff_cache/
30
+ .coverage
31
+ .coverage.*
32
+ htmlcov/
33
+ coverage.xml
34
+ test-results/
35
+
36
+ # Mkdocs
37
+ site/
38
+
39
+ # IDE / OS
40
+ .idea/
41
+ .vscode/
42
+ .DS_Store
43
+ Thumbs.db
44
+
45
+ # Crawl outputs (samples)
46
+ out/
47
+ output/
48
+ *.zip
49
+ !tests/**/*.zip
50
+
51
+ # Manifests from real crawls
52
+ manifest.json
53
+ failed.log
@@ -0,0 +1,111 @@
1
+ # Changelog
2
+
3
+ All notable changes to **SiteSavvy** are documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+ Dates are recorded in ISO 8601 (`YYYY-MM-DD`) form.
8
+
9
+ ## [Unreleased]
10
+
11
+ ### Added
12
+
13
+ - _Nothing yet._
14
+
15
+ ### Changed
16
+
17
+ - _Nothing yet._
18
+
19
+ ### Deprecated
20
+
21
+ - _Nothing yet._
22
+
23
+ ### Removed
24
+
25
+ - _Nothing yet._
26
+
27
+ ### Fixed
28
+
29
+ - _Nothing yet._
30
+
31
+ ### Security
32
+
33
+ - _Nothing yet._
34
+
35
+ ## [0.1.0] - 2025-01-15
36
+
37
+ ### Added
38
+
39
+ - **Two crawl modes** driven by `--mode`:
40
+ - `full` — recursively downloads every reachable resource (HTML, CSS, JS,
41
+ images, PDFs, fonts, …) while preserving the original site hierarchy.
42
+ - `text` — extracts the readable text of each HTML page (scripts,
43
+ navigation, ads and boilerplate are stripped) for offline reading.
44
+ - **Six output formats**, repeatable via `--format`:
45
+ `html`, `md`, `txt`, `pdf`, `epub`, `zip`.
46
+ - **Robots.txt compliance by default** with an explicit `--force` override
47
+ for crawls where you have permission to proceed.
48
+ - **Resume / incremental crawling** backed by a JSON manifest that records
49
+ every fetched URL, its local path, and `ETag` / `Last-Modified` headers.
50
+ `--resume` skips already-completed URLs; `--incremental` re-issues
51
+ conditional GETs and skips `304 Not Modified` responses.
52
+ - **Concurrency control** with a global `asyncio.Semaphore` and per-host
53
+ `asyncio.Lock`s that enforce a configurable `--delay` between requests to
54
+ the same host.
55
+ - **Automatic throttling** on `HTTP 429` and `5xx` responses
56
+ (`--rate-limit auto`, the default) with exponential back-off plus jitter
57
+ and a configurable retry budget.
58
+ - **Dry-run mode** (`--dry-run`) that BFS-enumerates the URLs that *would*
59
+ be fetched without writing any files — useful for sizing a crawl.
60
+ - **Optional Playwright headless rendering** (`--headless`) for
61
+ JavaScript-heavy pages, with transparent fall-back to `aiohttp` when the
62
+ browser binary is unavailable.
63
+ - **Fine-grained `--download-types`** filtering by coarse content category:
64
+ `html,css,js,img,pdf,other` (repeatable or comma-separated).
65
+ - **External-link gating** — by default SiteSavvy stays on the start host;
66
+ pass `--external` to follow cross-domain links (they are nested under an
67
+ `_external/<host>/` prefix to keep the archive tidy).
68
+ - **Rich command-line interface** built on Typer + Rich, with the
69
+ `crawl`, `legal` and `info` subcommands, coloured progress tables and
70
+ a `--verbose` / `-v` debug-logging switch.
71
+ - **Cross-platform CI matrix** covering Ubuntu, macOS and Windows
72
+ (Python 3.10 / 3.11 / 3.12 / 3.13).
73
+ - **Comprehensive test suite**: 111 tests at **92 %** line + branch
74
+ coverage, including integration tests against an in-process
75
+ `pytest-httpserver` miniature site.
76
+ - **MIT licence** — see [`LICENSE`](LICENSE).
77
+ - **mkdocs documentation site** (Material theme) published alongside the
78
+ source — see [`docs/`](docs/) and [`mkdocs.yml`](mkdocs.yml).
79
+
80
+ ### Changed
81
+
82
+ - Initial public release.
83
+
84
+ ### Deprecated
85
+
86
+ - _Nothing._
87
+
88
+ ### Removed
89
+
90
+ - _Nothing._
91
+
92
+ ### Fixed
93
+
94
+ - _Nothing._
95
+
96
+ ### Security
97
+
98
+ - **Default-secure behaviour**: `robots.txt` is respected by default
99
+ (`--respect-robots`), external links are not followed by default, and
100
+ the start URL is pre-flight checked against `robots.txt` before any
101
+ request is issued. A non-overridable `PermissionError` is raised when
102
+ the start URL is disallowed unless the user explicitly passes `--force`.
103
+ - The `legal` subcommand prints a full ethical / legal disclaimer
104
+ reminding users of their responsibility to respect copyright, terms of
105
+ service and applicable scraping laws (e.g. EU Database Directive, US
106
+ CFAA).
107
+
108
+ ## Links
109
+
110
+ - [Unreleased]: https://github.com/your-org/sitesavvy/compare/v0.1.0...HEAD
111
+ - [0.1.0]: https://github.com/your-org/sitesavvy/releases/tag/v0.1.0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 SiteSavvy Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,283 @@
1
+ Metadata-Version: 2.4
2
+ Name: sitesavvy
3
+ Version: 0.1.0
4
+ Summary: Capture the web, your way. A modern, async, cross-platform web scraper.
5
+ Project-URL: Homepage, https://github.com/your-org/sitesavvy
6
+ Project-URL: Documentation, https://your-org.github.io/sitesavvy/
7
+ Project-URL: Repository, https://github.com/your-org/sitesavvy
8
+ Project-URL: Issues, https://github.com/your-org/sitesavvy/issues
9
+ Project-URL: Changelog, https://github.com/your-org/sitesavvy/blob/main/CHANGELOG.md
10
+ Author: SiteSavvy Contributors
11
+ License-Expression: MIT
12
+ License-File: LICENSE
13
+ Keywords: async,crawler,epub,markdown,offline-reader,pdf,web-scraper
14
+ Classifier: Development Status :: 4 - Beta
15
+ Classifier: Environment :: Console
16
+ Classifier: Intended Audience :: Developers
17
+ Classifier: Intended Audience :: End Users/Desktop
18
+ Classifier: License :: OSI Approved :: MIT License
19
+ Classifier: Operating System :: OS Independent
20
+ Classifier: Programming Language :: Python :: 3
21
+ Classifier: Programming Language :: Python :: 3.10
22
+ Classifier: Programming Language :: Python :: 3.11
23
+ Classifier: Programming Language :: Python :: 3.12
24
+ Classifier: Programming Language :: Python :: 3.13
25
+ Classifier: Topic :: Internet :: WWW/HTTP
26
+ Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
27
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
28
+ Classifier: Topic :: Utilities
29
+ Classifier: Typing :: Typed
30
+ Requires-Python: >=3.10
31
+ Requires-Dist: aiohttp>=3.9
32
+ Requires-Dist: beautifulsoup4>=4.12
33
+ Requires-Dist: ebooklib>=0.18
34
+ Requires-Dist: html2text>=2024.2.26
35
+ Requires-Dist: lxml>=5.0
36
+ Requires-Dist: markdownify>=0.13
37
+ Requires-Dist: playwright>=1.40
38
+ Requires-Dist: pyyaml>=6.0
39
+ Requires-Dist: rich>=13.7
40
+ Requires-Dist: tomlkit>=0.12
41
+ Requires-Dist: typer>=0.12
42
+ Requires-Dist: weasyprint>=60.0
43
+ Provides-Extra: dev
44
+ Requires-Dist: mypy>=1.10; extra == 'dev'
45
+ Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
46
+ Requires-Dist: pytest-cov>=5.0; extra == 'dev'
47
+ Requires-Dist: pytest-httpserver>=1.0; extra == 'dev'
48
+ Requires-Dist: pytest>=8.0; extra == 'dev'
49
+ Requires-Dist: ruff>=0.6; extra == 'dev'
50
+ Provides-Extra: docs
51
+ Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
52
+ Requires-Dist: mkdocs>=1.6; extra == 'docs'
53
+ Provides-Extra: reppy
54
+ Requires-Dist: reppy>=2.0; extra == 'reppy'
55
+ Description-Content-Type: text/markdown
56
+
57
+ # SiteSavvy
58
+
59
+ > **Capture the web, your way.**
60
+
61
+ A modern, async, cross-platform web scraper that mirrors entire sites or
62
+ extracts their readable text — and exports the result as **HTML**, **Markdown**,
63
+ **plain text**, **PDF**, **EPUB** or a single **ZIP** archive.
64
+
65
+ Built with [`aiohttp`](https://docs.aiohttp.org/), `BeautifulSoup` + `lxml`,
66
+ `Typer` + `Rich`, with optional Playwright headless rendering for
67
+ JavaScript-heavy pages.
68
+
69
+ ---
70
+
71
+ ## Features
72
+
73
+ - **Two crawl modes**
74
+ - `full` — recursively download every reachable resource (HTML, CSS, JS,
75
+ images, PDFs, fonts, …) preserving the original directory hierarchy.
76
+ - `text` — extract the readable text from each HTML page (strips scripts,
77
+ navigation, ads) and store it in your chosen format.
78
+ - **Six output formats** (repeatable `--format`): `html`, `md`, `txt`, `pdf`,
79
+ `epub`, `zip`.
80
+ - **Polite by default**: respects `robots.txt`, enforces a per-host delay, and
81
+ auto-throttles on `429` / `5xx` responses.
82
+ - **Resume & incremental**: a JSON manifest records every fetched URL, its
83
+ local path and `ETag` / `Last-Modified`; `--resume` skips completed work and
84
+ `--incremental` re-downloads only what changed.
85
+ - **Concurrency control** with a global semaphore and per-host locks.
86
+ - **Dry-run** mode that lists the URLs that *would* be fetched.
87
+ - **Headless rendering** via Playwright (falls back to `aiohttp` automatically).
88
+ - **Fine-grained `--download-types`** filtering: `html,css,js,img,pdf,other`.
89
+ - **External-link gating** — stays on the start host unless you pass
90
+ `--external`.
91
+ - **Rich CLI** with progress tables and coloured output.
92
+ - **Cross-platform** — runs on Linux, macOS and Windows; ships a CI matrix for
93
+ all three.
94
+
95
+ ---
96
+
97
+ ## Installation
98
+
99
+ ### From PyPI (once published)
100
+
101
+ ```bash
102
+ pip install sitesavvy
103
+ ```
104
+
105
+ ### From source (development)
106
+
107
+ ```bash
108
+ git clone https://github.com/your-org/sitesavvy.git
109
+ cd sitesavvy
110
+ python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
111
+ pip install -e ".[dev]"
112
+ playwright install chromium # optional, only for --headless
113
+ ```
114
+
115
+ A plain `pip install -r requirements.txt` is also supported if you prefer to
116
+ skip the PEP 517 build.
117
+
118
+ ---
119
+
120
+ ## Quick start
121
+
122
+ ### Full-site mirror → ZIP
123
+
124
+ ```bash
125
+ sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
126
+ ```
127
+
128
+ ### Text-only crawl → Markdown + EPUB
129
+
130
+ ```bash
131
+ sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
132
+ ```
133
+
134
+ ### Dry-run (list URLs only)
135
+
136
+ ```bash
137
+ sitesavvy crawl https://example.com --dry-run --depth 1
138
+ ```
139
+
140
+ ### Resume an interrupted crawl
141
+
142
+ ```bash
143
+ sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out
144
+ ```
145
+
146
+ ### Only re-download changed resources
147
+
148
+ ```bash
149
+ sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out
150
+ ```
151
+
152
+ ### Render JavaScript pages
153
+
154
+ ```bash
155
+ sitesavvy crawl https://spa.example.com --headless --format html
156
+ ```
157
+
158
+ ---
159
+
160
+ ## Command reference
161
+
162
+ | Flag | Default | Description |
163
+ | --- | --- | --- |
164
+ | `url` *(positional)* | — | Starting URL. |
165
+ | `--depth INT` | `0` | Max link depth (`0` = unlimited). |
166
+ | `--mode {full,text}` | `full` | Full-site download or text-only extraction. |
167
+ | `--format …` | `html` | Output format, repeatable: `html md txt pdf epub zip`. |
168
+ | `--out-dir PATH` | CWD | Destination folder. |
169
+ | `--concurrency N` | `4` | Simultaneous HTTP requests. |
170
+ | `--user-agent STR` | browser-like | Custom `User-Agent` header. |
171
+ | `--respect-robots` / `--no-respect-robots` | on | Obey `robots.txt`. |
172
+ | `--delay SECS` | `0.5` | Polite delay between same-host requests. |
173
+ | `--resume` | off | Skip URLs already completed in the manifest. |
174
+ | `--manifest FILE` | `<out-dir>/manifest.json` | Manifest path. |
175
+ | `--dry-run` | off | List URLs that would be fetched. |
176
+ | `--headless` | off | Render JS pages with Playwright. |
177
+ | `--rate-limit {auto,fixed}` | `auto` | Back off on 429/5xx, or use fixed delay. |
178
+ | `--download-types …` | all | Comma-separated: `html,css,js,img,pdf,other`. |
179
+ | `--incremental` | off | Re-download only changed resources (conditional GET). |
180
+ | `--external` | off | Follow cross-domain links. |
181
+ | `--force` | off | Proceed even if `robots.txt` disallows the start URL. |
182
+ | `--timeout SECS` | `30` | Per-request timeout. |
183
+ | `--verbose` / `-v` | off | Enable debug logging. |
184
+
185
+ Auxiliary commands:
186
+
187
+ ```bash
188
+ sitesavvy legal # print the legal / ethical disclaimer
189
+ sitesavvy info # show which optional backends are installed
190
+ sitesavvy --version
191
+ ```
192
+
193
+ ---
194
+
195
+ ## Export-format matrix
196
+
197
+ | Format | Mode `full` | Mode `text` | Backend |
198
+ | --- | --- | --- | --- |
199
+ | `html` | original bytes, hierarchy preserved | — | built-in |
200
+ | `md` | — | `markdownify` (ATX headings, links absolute) | `markdownify` |
201
+ | `txt` | — | `html2text` (no hard wrap) | `html2text` |
202
+ | `pdf` | — | WeasyPrint | `weasyprint` |
203
+ | `epub` | — | `ebooklib`, one chapter per page | `ebooklib` |
204
+ | `zip` | archive of the whole crawl | archive of the whole crawl | `zipfile` |
205
+
206
+ Sample Markdown output:
207
+
208
+ ```markdown
209
+ # Page Title
210
+
211
+ ## A heading
212
+
213
+ Some paragraph text with a [link](https://example.com/page).
214
+ ```
215
+
216
+ ---
217
+
218
+ ## Architecture
219
+
220
+ ```
221
+ sitesavvy/
222
+ ├── __init__.py # package metadata
223
+ ├── __main__.py # python -m sitesavvy
224
+ ├── __about__.py # version
225
+ ├── config.py # CrawlConfig + enums
226
+ ├── models.py # CrawlItem, FetchResult, ManifestEntry
227
+ ├── url_utils.py # normalisation, link extraction, path mapping
228
+ ├── robots.py # async robots.txt (reppy or stdlib fallback)
229
+ ├── conversions.py # HTML → MD/TXT/PDF/EPUB + ZIP
230
+ ├── manifest.py # resume / incremental state
231
+ ├── headless.py # Playwright fetcher
232
+ ├── crawler.py # the Crawler engine
233
+ ├── legal.py # disclaimer text
234
+ ├── cli.py # Typer + Rich CLI
235
+ └── main.py # console-script entry point
236
+ ```
237
+
238
+ Networking layer: `aiohttp` (primary) with an optional Playwright headless
239
+ browser for JS-rendered pages. HTML parsing uses `beautifulsoup4` + `lxml`.
240
+ `robots.txt` is parsed with `reppy` when available, otherwise with the stdlib
241
+ `urllib.robotparser`.
242
+
243
+ ---
244
+
245
+ ## Troubleshooting
246
+
247
+ - **`HTTP 429 Too Many Requests`** — lower `--concurrency`, raise `--delay`,
248
+ and keep `--rate-limit auto` (default) so SiteSavvy backs off automatically.
249
+ - **Large sites** — set `--depth` to bound the crawl, run with `--dry-run`
250
+ first to estimate scope, and use `--resume` so an interruption doesn't waste
251
+ work.
252
+ - **PDF export fails** — WeasyPrint needs Pango/Cairo system libraries. On
253
+ Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. On macOS:
254
+ `brew install pango`. The other formats keep working even if PDF is missing.
255
+ - **Headless mode crashes** — run `playwright install chromium` once after
256
+ installing the package. Without it, SiteSavvy transparently falls back to
257
+ `aiohttp`.
258
+ - **`robots.txt disallows …`** — by default SiteSavvy honours `robots.txt`.
259
+ Add `--force` only if you have permission and accept responsibility.
260
+
261
+ ---
262
+
263
+ ## Legal & ethics
264
+
265
+ SiteSavvy is provided for **personal, non-commercial use only**. Respect the
266
+ copyright, terms of service, and `robots.txt` of every site you crawl. The
267
+ authors assume no liability for misuse. Run `sitesavvy legal` to read the full
268
+ disclaimer. Licensed under the [MIT License](LICENSE).
269
+
270
+ ---
271
+
272
+ ## Contributing
273
+
274
+ Pull requests are welcome! Please run the full check suite before submitting:
275
+
276
+ ```bash
277
+ ruff check .
278
+ mypy sitesavvy
279
+ pytest --cov=sitesavvy --cov-report=term-missing
280
+ ```
281
+
282
+ Coverage must stay at or above **90 %**. See the [Developer Guide](docs/developer.md)
283
+ for the project layout, release process and binary-building instructions.
@@ -0,0 +1,227 @@
1
+ # SiteSavvy
2
+
3
+ > **Capture the web, your way.**
4
+
5
+ A modern, async, cross-platform web scraper that mirrors entire sites or
6
+ extracts their readable text — and exports the result as **HTML**, **Markdown**,
7
+ **plain text**, **PDF**, **EPUB** or a single **ZIP** archive.
8
+
9
+ Built with [`aiohttp`](https://docs.aiohttp.org/), `BeautifulSoup` + `lxml`,
10
+ `Typer` + `Rich`, with optional Playwright headless rendering for
11
+ JavaScript-heavy pages.
12
+
13
+ ---
14
+
15
+ ## Features
16
+
17
+ - **Two crawl modes**
18
+ - `full` — recursively download every reachable resource (HTML, CSS, JS,
19
+ images, PDFs, fonts, …) preserving the original directory hierarchy.
20
+ - `text` — extract the readable text from each HTML page (strips scripts,
21
+ navigation, ads) and store it in your chosen format.
22
+ - **Six output formats** (repeatable `--format`): `html`, `md`, `txt`, `pdf`,
23
+ `epub`, `zip`.
24
+ - **Polite by default**: respects `robots.txt`, enforces a per-host delay, and
25
+ auto-throttles on `429` / `5xx` responses.
26
+ - **Resume & incremental**: a JSON manifest records every fetched URL, its
27
+ local path and `ETag` / `Last-Modified`; `--resume` skips completed work and
28
+ `--incremental` re-downloads only what changed.
29
+ - **Concurrency control** with a global semaphore and per-host locks.
30
+ - **Dry-run** mode that lists the URLs that *would* be fetched.
31
+ - **Headless rendering** via Playwright (falls back to `aiohttp` automatically).
32
+ - **Fine-grained `--download-types`** filtering: `html,css,js,img,pdf,other`.
33
+ - **External-link gating** — stays on the start host unless you pass
34
+ `--external`.
35
+ - **Rich CLI** with progress tables and coloured output.
36
+ - **Cross-platform** — runs on Linux, macOS and Windows; ships a CI matrix for
37
+ all three.
38
+
39
+ ---
40
+
41
+ ## Installation
42
+
43
+ ### From PyPI (once published)
44
+
45
+ ```bash
46
+ pip install sitesavvy
47
+ ```
48
+
49
+ ### From source (development)
50
+
51
+ ```bash
52
+ git clone https://github.com/your-org/sitesavvy.git
53
+ cd sitesavvy
54
+ python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
55
+ pip install -e ".[dev]"
56
+ playwright install chromium # optional, only for --headless
57
+ ```
58
+
59
+ A plain `pip install -r requirements.txt` is also supported if you prefer to
60
+ skip the PEP 517 build.
61
+
62
+ ---
63
+
64
+ ## Quick start
65
+
66
+ ### Full-site mirror → ZIP
67
+
68
+ ```bash
69
+ sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
70
+ ```
71
+
72
+ ### Text-only crawl → Markdown + EPUB
73
+
74
+ ```bash
75
+ sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
76
+ ```
77
+
78
+ ### Dry-run (list URLs only)
79
+
80
+ ```bash
81
+ sitesavvy crawl https://example.com --dry-run --depth 1
82
+ ```
83
+
84
+ ### Resume an interrupted crawl
85
+
86
+ ```bash
87
+ sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out
88
+ ```
89
+
90
+ ### Only re-download changed resources
91
+
92
+ ```bash
93
+ sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out
94
+ ```
95
+
96
+ ### Render JavaScript pages
97
+
98
+ ```bash
99
+ sitesavvy crawl https://spa.example.com --headless --format html
100
+ ```
101
+
102
+ ---
103
+
104
+ ## Command reference
105
+
106
+ | Flag | Default | Description |
107
+ | --- | --- | --- |
108
+ | `url` *(positional)* | — | Starting URL. |
109
+ | `--depth INT` | `0` | Max link depth (`0` = unlimited). |
110
+ | `--mode {full,text}` | `full` | Full-site download or text-only extraction. |
111
+ | `--format …` | `html` | Output format, repeatable: `html md txt pdf epub zip`. |
112
+ | `--out-dir PATH` | CWD | Destination folder. |
113
+ | `--concurrency N` | `4` | Simultaneous HTTP requests. |
114
+ | `--user-agent STR` | browser-like | Custom `User-Agent` header. |
115
+ | `--respect-robots` / `--no-respect-robots` | on | Obey `robots.txt`. |
116
+ | `--delay SECS` | `0.5` | Polite delay between same-host requests. |
117
+ | `--resume` | off | Skip URLs already completed in the manifest. |
118
+ | `--manifest FILE` | `<out-dir>/manifest.json` | Manifest path. |
119
+ | `--dry-run` | off | List URLs that would be fetched. |
120
+ | `--headless` | off | Render JS pages with Playwright. |
121
+ | `--rate-limit {auto,fixed}` | `auto` | Back off on 429/5xx, or use fixed delay. |
122
+ | `--download-types …` | all | Comma-separated: `html,css,js,img,pdf,other`. |
123
+ | `--incremental` | off | Re-download only changed resources (conditional GET). |
124
+ | `--external` | off | Follow cross-domain links. |
125
+ | `--force` | off | Proceed even if `robots.txt` disallows the start URL. |
126
+ | `--timeout SECS` | `30` | Per-request timeout. |
127
+ | `--verbose` / `-v` | off | Enable debug logging. |
128
+
129
+ Auxiliary commands:
130
+
131
+ ```bash
132
+ sitesavvy legal # print the legal / ethical disclaimer
133
+ sitesavvy info # show which optional backends are installed
134
+ sitesavvy --version
135
+ ```
136
+
137
+ ---
138
+
139
+ ## Export-format matrix
140
+
141
+ | Format | Mode `full` | Mode `text` | Backend |
142
+ | --- | --- | --- | --- |
143
+ | `html` | original bytes, hierarchy preserved | — | built-in |
144
+ | `md` | — | `markdownify` (ATX headings, links absolute) | `markdownify` |
145
+ | `txt` | — | `html2text` (no hard wrap) | `html2text` |
146
+ | `pdf` | — | WeasyPrint | `weasyprint` |
147
+ | `epub` | — | `ebooklib`, one chapter per page | `ebooklib` |
148
+ | `zip` | archive of the whole crawl | archive of the whole crawl | `zipfile` |
149
+
150
+ Sample Markdown output:
151
+
152
+ ```markdown
153
+ # Page Title
154
+
155
+ ## A heading
156
+
157
+ Some paragraph text with a [link](https://example.com/page).
158
+ ```
159
+
160
+ ---
161
+
162
+ ## Architecture
163
+
164
+ ```
165
+ sitesavvy/
166
+ ├── __init__.py # package metadata
167
+ ├── __main__.py # python -m sitesavvy
168
+ ├── __about__.py # version
169
+ ├── config.py # CrawlConfig + enums
170
+ ├── models.py # CrawlItem, FetchResult, ManifestEntry
171
+ ├── url_utils.py # normalisation, link extraction, path mapping
172
+ ├── robots.py # async robots.txt (reppy or stdlib fallback)
173
+ ├── conversions.py # HTML → MD/TXT/PDF/EPUB + ZIP
174
+ ├── manifest.py # resume / incremental state
175
+ ├── headless.py # Playwright fetcher
176
+ ├── crawler.py # the Crawler engine
177
+ ├── legal.py # disclaimer text
178
+ ├── cli.py # Typer + Rich CLI
179
+ └── main.py # console-script entry point
180
+ ```
181
+
182
+ Networking layer: `aiohttp` (primary) with an optional Playwright headless
183
+ browser for JS-rendered pages. HTML parsing uses `beautifulsoup4` + `lxml`.
184
+ `robots.txt` is parsed with `reppy` when available, otherwise with the stdlib
185
+ `urllib.robotparser`.
186
+
187
+ ---
188
+
189
+ ## Troubleshooting
190
+
191
+ - **`HTTP 429 Too Many Requests`** — lower `--concurrency`, raise `--delay`,
192
+ and keep `--rate-limit auto` (default) so SiteSavvy backs off automatically.
193
+ - **Large sites** — set `--depth` to bound the crawl, run with `--dry-run`
194
+ first to estimate scope, and use `--resume` so an interruption doesn't waste
195
+ work.
196
+ - **PDF export fails** — WeasyPrint needs Pango/Cairo system libraries. On
197
+ Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. On macOS:
198
+ `brew install pango`. The other formats keep working even if PDF is missing.
199
+ - **Headless mode crashes** — run `playwright install chromium` once after
200
+ installing the package. Without it, SiteSavvy transparently falls back to
201
+ `aiohttp`.
202
+ - **`robots.txt disallows …`** — by default SiteSavvy honours `robots.txt`.
203
+ Add `--force` only if you have permission and accept responsibility.
204
+
205
+ ---
206
+
207
+ ## Legal & ethics
208
+
209
+ SiteSavvy is provided for **personal, non-commercial use only**. Respect the
210
+ copyright, terms of service, and `robots.txt` of every site you crawl. The
211
+ authors assume no liability for misuse. Run `sitesavvy legal` to read the full
212
+ disclaimer. Licensed under the [MIT License](LICENSE).
213
+
214
+ ---
215
+
216
+ ## Contributing
217
+
218
+ Pull requests are welcome! Please run the full check suite before submitting:
219
+
220
+ ```bash
221
+ ruff check .
222
+ mypy sitesavvy
223
+ pytest --cov=sitesavvy --cov-report=term-missing
224
+ ```
225
+
226
+ Coverage must stay at or above **90 %**. See the [Developer Guide](docs/developer.md)
227
+ for the project layout, release process and binary-building instructions.