sitesavvy 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sitesavvy-0.1.0/.gitignore +53 -0
- sitesavvy-0.1.0/CHANGELOG.md +111 -0
- sitesavvy-0.1.0/LICENSE +21 -0
- sitesavvy-0.1.0/PKG-INFO +283 -0
- sitesavvy-0.1.0/README.md +227 -0
- sitesavvy-0.1.0/pyproject.toml +160 -0
- sitesavvy-0.1.0/sitesavvy/__about__.py +5 -0
- sitesavvy-0.1.0/sitesavvy/__init__.py +13 -0
- sitesavvy-0.1.0/sitesavvy/__main__.py +8 -0
- sitesavvy-0.1.0/sitesavvy/cli.py +311 -0
- sitesavvy-0.1.0/sitesavvy/config.py +135 -0
- sitesavvy-0.1.0/sitesavvy/conversions.py +301 -0
- sitesavvy-0.1.0/sitesavvy/crawler.py +576 -0
- sitesavvy-0.1.0/sitesavvy/headless.py +83 -0
- sitesavvy-0.1.0/sitesavvy/legal.py +30 -0
- sitesavvy-0.1.0/sitesavvy/main.py +19 -0
- sitesavvy-0.1.0/sitesavvy/manifest.py +104 -0
- sitesavvy-0.1.0/sitesavvy/models.py +109 -0
- sitesavvy-0.1.0/sitesavvy/robots.py +144 -0
- sitesavvy-0.1.0/sitesavvy/url_utils.py +200 -0
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Byte-compiled / optimized / DLL files
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*$py.class
|
|
5
|
+
|
|
6
|
+
# Distribution / packaging
|
|
7
|
+
build/
|
|
8
|
+
dist/
|
|
9
|
+
*.egg-info/
|
|
10
|
+
*.egg
|
|
11
|
+
.eggs/
|
|
12
|
+
wheels/
|
|
13
|
+
*.whl
|
|
14
|
+
|
|
15
|
+
# PyInstaller
|
|
16
|
+
*.spec.bak
|
|
17
|
+
build_exe/
|
|
18
|
+
dist_exe/
|
|
19
|
+
|
|
20
|
+
# Virtual environments
|
|
21
|
+
.venv/
|
|
22
|
+
venv/
|
|
23
|
+
env/
|
|
24
|
+
ENV/
|
|
25
|
+
|
|
26
|
+
# Test & coverage artifacts
|
|
27
|
+
.pytest_cache/
|
|
28
|
+
.mypy_cache/
|
|
29
|
+
.ruff_cache/
|
|
30
|
+
.coverage
|
|
31
|
+
.coverage.*
|
|
32
|
+
htmlcov/
|
|
33
|
+
coverage.xml
|
|
34
|
+
test-results/
|
|
35
|
+
|
|
36
|
+
# Mkdocs
|
|
37
|
+
site/
|
|
38
|
+
|
|
39
|
+
# IDE / OS
|
|
40
|
+
.idea/
|
|
41
|
+
.vscode/
|
|
42
|
+
.DS_Store
|
|
43
|
+
Thumbs.db
|
|
44
|
+
|
|
45
|
+
# Crawl outputs (samples)
|
|
46
|
+
out/
|
|
47
|
+
output/
|
|
48
|
+
*.zip
|
|
49
|
+
!tests/**/*.zip
|
|
50
|
+
|
|
51
|
+
# Manifests from real crawls
|
|
52
|
+
manifest.json
|
|
53
|
+
failed.log
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to **SiteSavvy** are documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
Dates are recorded in ISO 8601 (`YYYY-MM-DD`) form.
|
|
8
|
+
|
|
9
|
+
## [Unreleased]
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
|
|
13
|
+
- _Nothing yet._
|
|
14
|
+
|
|
15
|
+
### Changed
|
|
16
|
+
|
|
17
|
+
- _Nothing yet._
|
|
18
|
+
|
|
19
|
+
### Deprecated
|
|
20
|
+
|
|
21
|
+
- _Nothing yet._
|
|
22
|
+
|
|
23
|
+
### Removed
|
|
24
|
+
|
|
25
|
+
- _Nothing yet._
|
|
26
|
+
|
|
27
|
+
### Fixed
|
|
28
|
+
|
|
29
|
+
- _Nothing yet._
|
|
30
|
+
|
|
31
|
+
### Security
|
|
32
|
+
|
|
33
|
+
- _Nothing yet._
|
|
34
|
+
|
|
35
|
+
## [0.1.0] - 2025-01-15
|
|
36
|
+
|
|
37
|
+
### Added
|
|
38
|
+
|
|
39
|
+
- **Two crawl modes** driven by `--mode`:
|
|
40
|
+
- `full` — recursively downloads every reachable resource (HTML, CSS, JS,
|
|
41
|
+
images, PDFs, fonts, …) while preserving the original site hierarchy.
|
|
42
|
+
- `text` — extracts the readable text of each HTML page (scripts,
|
|
43
|
+
navigation, ads and boilerplate are stripped) for offline reading.
|
|
44
|
+
- **Six output formats**, repeatable via `--format`:
|
|
45
|
+
`html`, `md`, `txt`, `pdf`, `epub`, `zip`.
|
|
46
|
+
- **Robots.txt compliance by default** with an explicit `--force` override
|
|
47
|
+
for crawls where you have permission to proceed.
|
|
48
|
+
- **Resume / incremental crawling** backed by a JSON manifest that records
|
|
49
|
+
every fetched URL, its local path, and `ETag` / `Last-Modified` headers.
|
|
50
|
+
`--resume` skips already-completed URLs; `--incremental` re-issues
|
|
51
|
+
conditional GETs and skips `304 Not Modified` responses.
|
|
52
|
+
- **Concurrency control** with a global `asyncio.Semaphore` and per-host
|
|
53
|
+
`asyncio.Lock`s that enforce a configurable `--delay` between requests to
|
|
54
|
+
the same host.
|
|
55
|
+
- **Automatic throttling** on `HTTP 429` and `5xx` responses
|
|
56
|
+
(`--rate-limit auto`, the default) with exponential back-off plus jitter
|
|
57
|
+
and a configurable retry budget.
|
|
58
|
+
- **Dry-run mode** (`--dry-run`) that BFS-enumerates the URLs that *would*
|
|
59
|
+
be fetched without writing any files — useful for sizing a crawl.
|
|
60
|
+
- **Optional Playwright headless rendering** (`--headless`) for
|
|
61
|
+
JavaScript-heavy pages, with transparent fall-back to `aiohttp` when the
|
|
62
|
+
browser binary is unavailable.
|
|
63
|
+
- **Fine-grained `--download-types`** filtering by coarse content category:
|
|
64
|
+
`html,css,js,img,pdf,other` (repeatable or comma-separated).
|
|
65
|
+
- **External-link gating** — by default SiteSavvy stays on the start host;
|
|
66
|
+
pass `--external` to follow cross-domain links (they are nested under an
|
|
67
|
+
`_external/<host>/` prefix to keep the archive tidy).
|
|
68
|
+
- **Rich command-line interface** built on Typer + Rich, with the
|
|
69
|
+
`crawl`, `legal` and `info` subcommands, coloured progress tables and
|
|
70
|
+
a `--verbose` / `-v` debug-logging switch.
|
|
71
|
+
- **Cross-platform CI matrix** covering Ubuntu, macOS and Windows
|
|
72
|
+
(Python 3.10 / 3.11 / 3.12 / 3.13).
|
|
73
|
+
- **Comprehensive test suite**: 111 tests at **92 %** line + branch
|
|
74
|
+
coverage, including integration tests against an in-process
|
|
75
|
+
`pytest-httpserver` miniature site.
|
|
76
|
+
- **MIT licence** — see [`LICENSE`](LICENSE).
|
|
77
|
+
- **mkdocs documentation site** (Material theme) published alongside the
|
|
78
|
+
source — see [`docs/`](docs/) and [`mkdocs.yml`](mkdocs.yml).
|
|
79
|
+
|
|
80
|
+
### Changed
|
|
81
|
+
|
|
82
|
+
- Initial public release.
|
|
83
|
+
|
|
84
|
+
### Deprecated
|
|
85
|
+
|
|
86
|
+
- _Nothing._
|
|
87
|
+
|
|
88
|
+
### Removed
|
|
89
|
+
|
|
90
|
+
- _Nothing._
|
|
91
|
+
|
|
92
|
+
### Fixed
|
|
93
|
+
|
|
94
|
+
- _Nothing._
|
|
95
|
+
|
|
96
|
+
### Security
|
|
97
|
+
|
|
98
|
+
- **Default-secure behaviour**: `robots.txt` is respected by default
|
|
99
|
+
(`--respect-robots`), external links are not followed by default, and
|
|
100
|
+
the start URL is pre-flight checked against `robots.txt` before any
|
|
101
|
+
request is issued. A non-overridable `PermissionError` is raised when
|
|
102
|
+
the start URL is disallowed unless the user explicitly passes `--force`.
|
|
103
|
+
- The `legal` subcommand prints a full ethical / legal disclaimer
|
|
104
|
+
reminding users of their responsibility to respect copyright, terms of
|
|
105
|
+
service and applicable scraping laws (e.g. EU Database Directive, US
|
|
106
|
+
CFAA).
|
|
107
|
+
|
|
108
|
+
## Links
|
|
109
|
+
|
|
110
|
+
- [Unreleased]: https://github.com/your-org/sitesavvy/compare/v0.1.0...HEAD
|
|
111
|
+
- [0.1.0]: https://github.com/your-org/sitesavvy/releases/tag/v0.1.0
|
sitesavvy-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 SiteSavvy Contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
sitesavvy-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,283 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sitesavvy
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Capture the web, your way. A modern, async, cross-platform web scraper.
|
|
5
|
+
Project-URL: Homepage, https://github.com/your-org/sitesavvy
|
|
6
|
+
Project-URL: Documentation, https://your-org.github.io/sitesavvy/
|
|
7
|
+
Project-URL: Repository, https://github.com/your-org/sitesavvy
|
|
8
|
+
Project-URL: Issues, https://github.com/your-org/sitesavvy/issues
|
|
9
|
+
Project-URL: Changelog, https://github.com/your-org/sitesavvy/blob/main/CHANGELOG.md
|
|
10
|
+
Author: SiteSavvy Contributors
|
|
11
|
+
License-Expression: MIT
|
|
12
|
+
License-File: LICENSE
|
|
13
|
+
Keywords: async,crawler,epub,markdown,offline-reader,pdf,web-scraper
|
|
14
|
+
Classifier: Development Status :: 4 - Beta
|
|
15
|
+
Classifier: Environment :: Console
|
|
16
|
+
Classifier: Intended Audience :: Developers
|
|
17
|
+
Classifier: Intended Audience :: End Users/Desktop
|
|
18
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
19
|
+
Classifier: Operating System :: OS Independent
|
|
20
|
+
Classifier: Programming Language :: Python :: 3
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
23
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
24
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
25
|
+
Classifier: Topic :: Internet :: WWW/HTTP
|
|
26
|
+
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
|
|
27
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
28
|
+
Classifier: Topic :: Utilities
|
|
29
|
+
Classifier: Typing :: Typed
|
|
30
|
+
Requires-Python: >=3.10
|
|
31
|
+
Requires-Dist: aiohttp>=3.9
|
|
32
|
+
Requires-Dist: beautifulsoup4>=4.12
|
|
33
|
+
Requires-Dist: ebooklib>=0.18
|
|
34
|
+
Requires-Dist: html2text>=2024.2.26
|
|
35
|
+
Requires-Dist: lxml>=5.0
|
|
36
|
+
Requires-Dist: markdownify>=0.13
|
|
37
|
+
Requires-Dist: playwright>=1.40
|
|
38
|
+
Requires-Dist: pyyaml>=6.0
|
|
39
|
+
Requires-Dist: rich>=13.7
|
|
40
|
+
Requires-Dist: tomlkit>=0.12
|
|
41
|
+
Requires-Dist: typer>=0.12
|
|
42
|
+
Requires-Dist: weasyprint>=60.0
|
|
43
|
+
Provides-Extra: dev
|
|
44
|
+
Requires-Dist: mypy>=1.10; extra == 'dev'
|
|
45
|
+
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
|
|
46
|
+
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
|
|
47
|
+
Requires-Dist: pytest-httpserver>=1.0; extra == 'dev'
|
|
48
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
49
|
+
Requires-Dist: ruff>=0.6; extra == 'dev'
|
|
50
|
+
Provides-Extra: docs
|
|
51
|
+
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
|
|
52
|
+
Requires-Dist: mkdocs>=1.6; extra == 'docs'
|
|
53
|
+
Provides-Extra: reppy
|
|
54
|
+
Requires-Dist: reppy>=2.0; extra == 'reppy'
|
|
55
|
+
Description-Content-Type: text/markdown
|
|
56
|
+
|
|
57
|
+
# SiteSavvy
|
|
58
|
+
|
|
59
|
+
> **Capture the web, your way.**
|
|
60
|
+
|
|
61
|
+
A modern, async, cross-platform web scraper that mirrors entire sites or
|
|
62
|
+
extracts their readable text — and exports the result as **HTML**, **Markdown**,
|
|
63
|
+
**plain text**, **PDF**, **EPUB** or a single **ZIP** archive.
|
|
64
|
+
|
|
65
|
+
Built with [`aiohttp`](https://docs.aiohttp.org/), `BeautifulSoup` + `lxml`,
|
|
66
|
+
`Typer` + `Rich`, with optional Playwright headless rendering for
|
|
67
|
+
JavaScript-heavy pages.
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Features
|
|
72
|
+
|
|
73
|
+
- **Two crawl modes**
|
|
74
|
+
- `full` — recursively download every reachable resource (HTML, CSS, JS,
|
|
75
|
+
images, PDFs, fonts, …) preserving the original directory hierarchy.
|
|
76
|
+
- `text` — extract the readable text from each HTML page (strips scripts,
|
|
77
|
+
navigation, ads) and store it in your chosen format.
|
|
78
|
+
- **Six output formats** (repeatable `--format`): `html`, `md`, `txt`, `pdf`,
|
|
79
|
+
`epub`, `zip`.
|
|
80
|
+
- **Polite by default**: respects `robots.txt`, enforces a per-host delay, and
|
|
81
|
+
auto-throttles on `429` / `5xx` responses.
|
|
82
|
+
- **Resume & incremental**: a JSON manifest records every fetched URL, its
|
|
83
|
+
local path and `ETag` / `Last-Modified`; `--resume` skips completed work and
|
|
84
|
+
`--incremental` re-downloads only what changed.
|
|
85
|
+
- **Concurrency control** with a global semaphore and per-host locks.
|
|
86
|
+
- **Dry-run** mode that lists the URLs that *would* be fetched.
|
|
87
|
+
- **Headless rendering** via Playwright (falls back to `aiohttp` automatically).
|
|
88
|
+
- **Fine-grained `--download-types`** filtering: `html,css,js,img,pdf,other`.
|
|
89
|
+
- **External-link gating** — stays on the start host unless you pass
|
|
90
|
+
`--external`.
|
|
91
|
+
- **Rich CLI** with progress tables and coloured output.
|
|
92
|
+
- **Cross-platform** — runs on Linux, macOS and Windows; ships a CI matrix for
|
|
93
|
+
all three.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Installation
|
|
98
|
+
|
|
99
|
+
### From PyPI (once published)
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
pip install sitesavvy
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### From source (development)
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
git clone https://github.com/your-org/sitesavvy.git
|
|
109
|
+
cd sitesavvy
|
|
110
|
+
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
|
|
111
|
+
pip install -e ".[dev]"
|
|
112
|
+
playwright install chromium # optional, only for --headless
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
A plain `pip install -r requirements.txt` is also supported if you prefer to
|
|
116
|
+
skip the PEP 517 build.
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## Quick start
|
|
121
|
+
|
|
122
|
+
### Full-site mirror → ZIP
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### Text-only crawl → Markdown + EPUB
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### Dry-run (list URLs only)
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
sitesavvy crawl https://example.com --dry-run --depth 1
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Resume an interrupted crawl
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### Only re-download changed resources
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Render JavaScript pages
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
sitesavvy crawl https://spa.example.com --headless --format html
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
## Command reference
|
|
161
|
+
|
|
162
|
+
| Flag | Default | Description |
|
|
163
|
+
| --- | --- | --- |
|
|
164
|
+
| `url` *(positional)* | — | Starting URL. |
|
|
165
|
+
| `--depth INT` | `0` | Max link depth (`0` = unlimited). |
|
|
166
|
+
| `--mode {full,text}` | `full` | Full-site download or text-only extraction. |
|
|
167
|
+
| `--format …` | `html` | Output format, repeatable: `html md txt pdf epub zip`. |
|
|
168
|
+
| `--out-dir PATH` | CWD | Destination folder. |
|
|
169
|
+
| `--concurrency N` | `4` | Simultaneous HTTP requests. |
|
|
170
|
+
| `--user-agent STR` | browser-like | Custom `User-Agent` header. |
|
|
171
|
+
| `--respect-robots` / `--no-respect-robots` | on | Obey `robots.txt`. |
|
|
172
|
+
| `--delay SECS` | `0.5` | Polite delay between same-host requests. |
|
|
173
|
+
| `--resume` | off | Skip URLs already completed in the manifest. |
|
|
174
|
+
| `--manifest FILE` | `<out-dir>/manifest.json` | Manifest path. |
|
|
175
|
+
| `--dry-run` | off | List URLs that would be fetched. |
|
|
176
|
+
| `--headless` | off | Render JS pages with Playwright. |
|
|
177
|
+
| `--rate-limit {auto,fixed}` | `auto` | Back off on 429/5xx, or use fixed delay. |
|
|
178
|
+
| `--download-types …` | all | Comma-separated: `html,css,js,img,pdf,other`. |
|
|
179
|
+
| `--incremental` | off | Re-download only changed resources (conditional GET). |
|
|
180
|
+
| `--external` | off | Follow cross-domain links. |
|
|
181
|
+
| `--force` | off | Proceed even if `robots.txt` disallows the start URL. |
|
|
182
|
+
| `--timeout SECS` | `30` | Per-request timeout. |
|
|
183
|
+
| `--verbose` / `-v` | off | Enable debug logging. |
|
|
184
|
+
|
|
185
|
+
Auxiliary commands:
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
sitesavvy legal # print the legal / ethical disclaimer
|
|
189
|
+
sitesavvy info # show which optional backends are installed
|
|
190
|
+
sitesavvy --version
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## Export-format matrix
|
|
196
|
+
|
|
197
|
+
| Format | Mode `full` | Mode `text` | Backend |
|
|
198
|
+
| --- | --- | --- | --- |
|
|
199
|
+
| `html` | original bytes, hierarchy preserved | — | built-in |
|
|
200
|
+
| `md` | — | `markdownify` (ATX headings, links absolute) | `markdownify` |
|
|
201
|
+
| `txt` | — | `html2text` (no hard wrap) | `html2text` |
|
|
202
|
+
| `pdf` | — | WeasyPrint | `weasyprint` |
|
|
203
|
+
| `epub` | — | `ebooklib`, one chapter per page | `ebooklib` |
|
|
204
|
+
| `zip` | archive of the whole crawl | archive of the whole crawl | `zipfile` |
|
|
205
|
+
|
|
206
|
+
Sample Markdown output:
|
|
207
|
+
|
|
208
|
+
```markdown
|
|
209
|
+
# Page Title
|
|
210
|
+
|
|
211
|
+
## A heading
|
|
212
|
+
|
|
213
|
+
Some paragraph text with a [link](https://example.com/page).
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## Architecture
|
|
219
|
+
|
|
220
|
+
```
|
|
221
|
+
sitesavvy/
|
|
222
|
+
├── __init__.py # package metadata
|
|
223
|
+
├── __main__.py # python -m sitesavvy
|
|
224
|
+
├── __about__.py # version
|
|
225
|
+
├── config.py # CrawlConfig + enums
|
|
226
|
+
├── models.py # CrawlItem, FetchResult, ManifestEntry
|
|
227
|
+
├── url_utils.py # normalisation, link extraction, path mapping
|
|
228
|
+
├── robots.py # async robots.txt (reppy or stdlib fallback)
|
|
229
|
+
├── conversions.py # HTML → MD/TXT/PDF/EPUB + ZIP
|
|
230
|
+
├── manifest.py # resume / incremental state
|
|
231
|
+
├── headless.py # Playwright fetcher
|
|
232
|
+
├── crawler.py # the Crawler engine
|
|
233
|
+
├── legal.py # disclaimer text
|
|
234
|
+
├── cli.py # Typer + Rich CLI
|
|
235
|
+
└── main.py # console-script entry point
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
Networking layer: `aiohttp` (primary) with an optional Playwright headless
|
|
239
|
+
browser for JS-rendered pages. HTML parsing uses `beautifulsoup4` + `lxml`.
|
|
240
|
+
`robots.txt` is parsed with `reppy` when available, otherwise with the stdlib
|
|
241
|
+
`urllib.robotparser`.
|
|
242
|
+
|
|
243
|
+
---
|
|
244
|
+
|
|
245
|
+
## Troubleshooting
|
|
246
|
+
|
|
247
|
+
- **`HTTP 429 Too Many Requests`** — lower `--concurrency`, raise `--delay`,
|
|
248
|
+
and keep `--rate-limit auto` (default) so SiteSavvy backs off automatically.
|
|
249
|
+
- **Large sites** — set `--depth` to bound the crawl, run with `--dry-run`
|
|
250
|
+
first to estimate scope, and use `--resume` so an interruption doesn't waste
|
|
251
|
+
work.
|
|
252
|
+
- **PDF export fails** — WeasyPrint needs Pango/Cairo system libraries. On
|
|
253
|
+
Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. On macOS:
|
|
254
|
+
`brew install pango`. The other formats keep working even if PDF is missing.
|
|
255
|
+
- **Headless mode crashes** — run `playwright install chromium` once after
|
|
256
|
+
installing the package. Without it, SiteSavvy transparently falls back to
|
|
257
|
+
`aiohttp`.
|
|
258
|
+
- **`robots.txt disallows …`** — by default SiteSavvy honours `robots.txt`.
|
|
259
|
+
Add `--force` only if you have permission and accept responsibility.
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
## Legal & ethics
|
|
264
|
+
|
|
265
|
+
SiteSavvy is provided for **personal, non-commercial use only**. Respect the
|
|
266
|
+
copyright, terms of service, and `robots.txt` of every site you crawl. The
|
|
267
|
+
authors assume no liability for misuse. Run `sitesavvy legal` to read the full
|
|
268
|
+
disclaimer. Licensed under the [MIT License](LICENSE).
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
## Contributing
|
|
273
|
+
|
|
274
|
+
Pull requests are welcome! Please run the full check suite before submitting:
|
|
275
|
+
|
|
276
|
+
```bash
|
|
277
|
+
ruff check .
|
|
278
|
+
mypy sitesavvy
|
|
279
|
+
pytest --cov=sitesavvy --cov-report=term-missing
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
Coverage must stay at or above **90 %**. See the [Developer Guide](docs/developer.md)
|
|
283
|
+
for the project layout, release process and binary-building instructions.
|
|
@@ -0,0 +1,227 @@
|
|
|
1
|
+
# SiteSavvy
|
|
2
|
+
|
|
3
|
+
> **Capture the web, your way.**
|
|
4
|
+
|
|
5
|
+
A modern, async, cross-platform web scraper that mirrors entire sites or
|
|
6
|
+
extracts their readable text — and exports the result as **HTML**, **Markdown**,
|
|
7
|
+
**plain text**, **PDF**, **EPUB** or a single **ZIP** archive.
|
|
8
|
+
|
|
9
|
+
Built with [`aiohttp`](https://docs.aiohttp.org/), `BeautifulSoup` + `lxml`,
|
|
10
|
+
`Typer` + `Rich`, with optional Playwright headless rendering for
|
|
11
|
+
JavaScript-heavy pages.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Features
|
|
16
|
+
|
|
17
|
+
- **Two crawl modes**
|
|
18
|
+
- `full` — recursively download every reachable resource (HTML, CSS, JS,
|
|
19
|
+
images, PDFs, fonts, …) preserving the original directory hierarchy.
|
|
20
|
+
- `text` — extract the readable text from each HTML page (strips scripts,
|
|
21
|
+
navigation, ads) and store it in your chosen format.
|
|
22
|
+
- **Six output formats** (repeatable `--format`): `html`, `md`, `txt`, `pdf`,
|
|
23
|
+
`epub`, `zip`.
|
|
24
|
+
- **Polite by default**: respects `robots.txt`, enforces a per-host delay, and
|
|
25
|
+
auto-throttles on `429` / `5xx` responses.
|
|
26
|
+
- **Resume & incremental**: a JSON manifest records every fetched URL, its
|
|
27
|
+
local path and `ETag` / `Last-Modified`; `--resume` skips completed work and
|
|
28
|
+
`--incremental` re-downloads only what changed.
|
|
29
|
+
- **Concurrency control** with a global semaphore and per-host locks.
|
|
30
|
+
- **Dry-run** mode that lists the URLs that *would* be fetched.
|
|
31
|
+
- **Headless rendering** via Playwright (falls back to `aiohttp` automatically).
|
|
32
|
+
- **Fine-grained `--download-types`** filtering: `html,css,js,img,pdf,other`.
|
|
33
|
+
- **External-link gating** — stays on the start host unless you pass
|
|
34
|
+
`--external`.
|
|
35
|
+
- **Rich CLI** with progress tables and coloured output.
|
|
36
|
+
- **Cross-platform** — runs on Linux, macOS and Windows; ships a CI matrix for
|
|
37
|
+
all three.
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## Installation
|
|
42
|
+
|
|
43
|
+
### From PyPI (once published)
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
pip install sitesavvy
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### From source (development)
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
git clone https://github.com/your-org/sitesavvy.git
|
|
53
|
+
cd sitesavvy
|
|
54
|
+
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
|
|
55
|
+
pip install -e ".[dev]"
|
|
56
|
+
playwright install chromium # optional, only for --headless
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
A plain `pip install -r requirements.txt` is also supported if you prefer to
|
|
60
|
+
skip the PEP 517 build.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## Quick start
|
|
65
|
+
|
|
66
|
+
### Full-site mirror → ZIP
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Text-only crawl → Markdown + EPUB
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### Dry-run (list URLs only)
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
sitesavvy crawl https://example.com --dry-run --depth 1
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Resume an interrupted crawl
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### Only re-download changed resources
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
### Render JavaScript pages
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
sitesavvy crawl https://spa.example.com --headless --format html
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Command reference
|
|
105
|
+
|
|
106
|
+
| Flag | Default | Description |
|
|
107
|
+
| --- | --- | --- |
|
|
108
|
+
| `url` *(positional)* | — | Starting URL. |
|
|
109
|
+
| `--depth INT` | `0` | Max link depth (`0` = unlimited). |
|
|
110
|
+
| `--mode {full,text}` | `full` | Full-site download or text-only extraction. |
|
|
111
|
+
| `--format …` | `html` | Output format, repeatable: `html md txt pdf epub zip`. |
|
|
112
|
+
| `--out-dir PATH` | CWD | Destination folder. |
|
|
113
|
+
| `--concurrency N` | `4` | Simultaneous HTTP requests. |
|
|
114
|
+
| `--user-agent STR` | browser-like | Custom `User-Agent` header. |
|
|
115
|
+
| `--respect-robots` / `--no-respect-robots` | on | Obey `robots.txt`. |
|
|
116
|
+
| `--delay SECS` | `0.5` | Polite delay between same-host requests. |
|
|
117
|
+
| `--resume` | off | Skip URLs already completed in the manifest. |
|
|
118
|
+
| `--manifest FILE` | `<out-dir>/manifest.json` | Manifest path. |
|
|
119
|
+
| `--dry-run` | off | List URLs that would be fetched. |
|
|
120
|
+
| `--headless` | off | Render JS pages with Playwright. |
|
|
121
|
+
| `--rate-limit {auto,fixed}` | `auto` | Back off on 429/5xx, or use fixed delay. |
|
|
122
|
+
| `--download-types …` | all | Comma-separated: `html,css,js,img,pdf,other`. |
|
|
123
|
+
| `--incremental` | off | Re-download only changed resources (conditional GET). |
|
|
124
|
+
| `--external` | off | Follow cross-domain links. |
|
|
125
|
+
| `--force` | off | Proceed even if `robots.txt` disallows the start URL. |
|
|
126
|
+
| `--timeout SECS` | `30` | Per-request timeout. |
|
|
127
|
+
| `--verbose` / `-v` | off | Enable debug logging. |
|
|
128
|
+
|
|
129
|
+
Auxiliary commands:
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
sitesavvy legal # print the legal / ethical disclaimer
|
|
133
|
+
sitesavvy info # show which optional backends are installed
|
|
134
|
+
sitesavvy --version
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## Export-format matrix
|
|
140
|
+
|
|
141
|
+
| Format | Mode `full` | Mode `text` | Backend |
|
|
142
|
+
| --- | --- | --- | --- |
|
|
143
|
+
| `html` | original bytes, hierarchy preserved | — | built-in |
|
|
144
|
+
| `md` | — | `markdownify` (ATX headings, links absolute) | `markdownify` |
|
|
145
|
+
| `txt` | — | `html2text` (no hard wrap) | `html2text` |
|
|
146
|
+
| `pdf` | — | WeasyPrint | `weasyprint` |
|
|
147
|
+
| `epub` | — | `ebooklib`, one chapter per page | `ebooklib` |
|
|
148
|
+
| `zip` | archive of the whole crawl | archive of the whole crawl | `zipfile` |
|
|
149
|
+
|
|
150
|
+
Sample Markdown output:
|
|
151
|
+
|
|
152
|
+
```markdown
|
|
153
|
+
# Page Title
|
|
154
|
+
|
|
155
|
+
## A heading
|
|
156
|
+
|
|
157
|
+
Some paragraph text with a [link](https://example.com/page).
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## Architecture
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
sitesavvy/
|
|
166
|
+
├── __init__.py # package metadata
|
|
167
|
+
├── __main__.py # python -m sitesavvy
|
|
168
|
+
├── __about__.py # version
|
|
169
|
+
├── config.py # CrawlConfig + enums
|
|
170
|
+
├── models.py # CrawlItem, FetchResult, ManifestEntry
|
|
171
|
+
├── url_utils.py # normalisation, link extraction, path mapping
|
|
172
|
+
├── robots.py # async robots.txt (reppy or stdlib fallback)
|
|
173
|
+
├── conversions.py # HTML → MD/TXT/PDF/EPUB + ZIP
|
|
174
|
+
├── manifest.py # resume / incremental state
|
|
175
|
+
├── headless.py # Playwright fetcher
|
|
176
|
+
├── crawler.py # the Crawler engine
|
|
177
|
+
├── legal.py # disclaimer text
|
|
178
|
+
├── cli.py # Typer + Rich CLI
|
|
179
|
+
└── main.py # console-script entry point
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
Networking layer: `aiohttp` (primary) with an optional Playwright headless
|
|
183
|
+
browser for JS-rendered pages. HTML parsing uses `beautifulsoup4` + `lxml`.
|
|
184
|
+
`robots.txt` is parsed with `reppy` when available, otherwise with the stdlib
|
|
185
|
+
`urllib.robotparser`.
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
## Troubleshooting
|
|
190
|
+
|
|
191
|
+
- **`HTTP 429 Too Many Requests`** — lower `--concurrency`, raise `--delay`,
|
|
192
|
+
and keep `--rate-limit auto` (default) so SiteSavvy backs off automatically.
|
|
193
|
+
- **Large sites** — set `--depth` to bound the crawl, run with `--dry-run`
|
|
194
|
+
first to estimate scope, and use `--resume` so an interruption doesn't waste
|
|
195
|
+
work.
|
|
196
|
+
- **PDF export fails** — WeasyPrint needs Pango/Cairo system libraries. On
|
|
197
|
+
Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. On macOS:
|
|
198
|
+
`brew install pango`. The other formats keep working even if PDF is missing.
|
|
199
|
+
- **Headless mode crashes** — run `playwright install chromium` once after
|
|
200
|
+
installing the package. Without it, SiteSavvy transparently falls back to
|
|
201
|
+
`aiohttp`.
|
|
202
|
+
- **`robots.txt disallows …`** — by default SiteSavvy honours `robots.txt`.
|
|
203
|
+
Add `--force` only if you have permission and accept responsibility.
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
## Legal & ethics
|
|
208
|
+
|
|
209
|
+
SiteSavvy is provided for **personal, non-commercial use only**. Respect the
|
|
210
|
+
copyright, terms of service, and `robots.txt` of every site you crawl. The
|
|
211
|
+
authors assume no liability for misuse. Run `sitesavvy legal` to read the full
|
|
212
|
+
disclaimer. Licensed under the [MIT License](LICENSE).
|
|
213
|
+
|
|
214
|
+
---
|
|
215
|
+
|
|
216
|
+
## Contributing
|
|
217
|
+
|
|
218
|
+
Pull requests are welcome! Please run the full check suite before submitting:
|
|
219
|
+
|
|
220
|
+
```bash
|
|
221
|
+
ruff check .
|
|
222
|
+
mypy sitesavvy
|
|
223
|
+
pytest --cov=sitesavvy --cov-report=term-missing
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Coverage must stay at or above **90 %**. See the [Developer Guide](docs/developer.md)
|
|
227
|
+
for the project layout, release process and binary-building instructions.
|