web2md 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- web2md-0.1.0/CHANGELOG.md +12 -0
- web2md-0.1.0/LICENSE +21 -0
- web2md-0.1.0/MANIFEST.in +5 -0
- web2md-0.1.0/PKG-INFO +360 -0
- web2md-0.1.0/README.md +315 -0
- web2md-0.1.0/setup.cfg +4 -0
- web2md-0.1.0/setup.py +85 -0
- web2md-0.1.0/web2md/__init__.py +10 -0
- web2md-0.1.0/web2md/__pycache__/__init__.cpython-313.pyc +0 -0
- web2md-0.1.0/web2md/__pycache__/cli.cpython-313.pyc +0 -0
- web2md-0.1.0/web2md/__pycache__/version.cpython-313.pyc +0 -0
- web2md-0.1.0/web2md/cli.py +612 -0
- web2md-0.1.0/web2md/version.py +33 -0
- web2md-0.1.0/web2md.egg-info/PKG-INFO +360 -0
- web2md-0.1.0/web2md.egg-info/SOURCES.txt +18 -0
- web2md-0.1.0/web2md.egg-info/dependency_links.txt +1 -0
- web2md-0.1.0/web2md.egg-info/entry_points.txt +2 -0
- web2md-0.1.0/web2md.egg-info/not-zip-safe +1 -0
- web2md-0.1.0/web2md.egg-info/requires.txt +5 -0
- web2md-0.1.0/web2md.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
## ๐ Changelog
|
|
2
|
+
|
|
3
|
+
### v0.1.0 (2026-01-28)
|
|
4
|
+
- โจ Initial release
|
|
5
|
+
- ๐ Dynamic site crawling with Playwright
|
|
6
|
+
- ๐ Recursive subpage crawling with depth/count controls
|
|
7
|
+
- ๐ผ๏ธ Optional image and video downloads
|
|
8
|
+
- ๐ฏ Smart base URL resolution using `document.baseURI`
|
|
9
|
+
- ๐งน Clean Markdown output with content extraction
|
|
10
|
+
- ๐ SSL certificate error handling
|
|
11
|
+
- ๐ Local link conversion for offline browsing
|
|
12
|
+
|
web2md-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) [2026] [Liming Xie]
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
web2md-0.1.0/MANIFEST.in
ADDED
web2md-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,360 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: web2md
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A CLI tool to crawl dynamic/static websites and convert content to clean Markdown
|
|
5
|
+
Home-page: https://github.com/floatinghotpot/web2md
|
|
6
|
+
Author: Liming Xie
|
|
7
|
+
Author-email: liming.xie@gmail.com
|
|
8
|
+
License: MIT
|
|
9
|
+
Keywords: crawler,markdown,web2md,scraper,dynamic website,html2md
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: Intended Audience :: End Users/Desktop
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Topic :: Internet :: WWW/HTTP
|
|
22
|
+
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
|
|
23
|
+
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
24
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
25
|
+
Requires-Python: >=3.8
|
|
26
|
+
Description-Content-Type: text/markdown
|
|
27
|
+
License-File: LICENSE
|
|
28
|
+
Requires-Dist: playwright>=1.40.0
|
|
29
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
30
|
+
Requires-Dist: markdownify>=0.11.6
|
|
31
|
+
Requires-Dist: lxml>=4.9.0
|
|
32
|
+
Requires-Dist: requests>=2.31.0
|
|
33
|
+
Dynamic: author
|
|
34
|
+
Dynamic: author-email
|
|
35
|
+
Dynamic: classifier
|
|
36
|
+
Dynamic: description
|
|
37
|
+
Dynamic: description-content-type
|
|
38
|
+
Dynamic: home-page
|
|
39
|
+
Dynamic: keywords
|
|
40
|
+
Dynamic: license
|
|
41
|
+
Dynamic: license-file
|
|
42
|
+
Dynamic: requires-dist
|
|
43
|
+
Dynamic: requires-python
|
|
44
|
+
Dynamic: summary
|
|
45
|
+
|
|
46
|
+
# web2md
|
|
47
|
+
|
|
48
|
+
[](https://opensource.org/licenses/MIT)
|
|
49
|
+
[](https://www.python.org/downloads/)
|
|
50
|
+
|
|
51
|
+
A powerful, intelligent CLI tool to crawl **dynamic and static websites** with full JavaScript rendering support and convert them to clean, well-formatted Markdown files. Perfect for archiving documentation, creating offline knowledge bases, and preserving web content.
|
|
52
|
+
|
|
53
|
+
## โจ Key Features
|
|
54
|
+
|
|
55
|
+
- ๐ **Dynamic Site Support**: Full JavaScript rendering via Playwright (Vue/React/Angular/Next.js)
|
|
56
|
+
- ๐ฏ **Smart Content Extraction**: Automatically identifies and extracts core content, removing navigation, ads, and sidebars
|
|
57
|
+
- ๐ **Recursive Crawling**: Intelligently crawls subpages with configurable depth and count limits
|
|
58
|
+
- ๏ฟฝ๏ธ **Media Downloads**: Optional image and video downloading with lazy-loading support
|
|
59
|
+
- ๐ **Base URL Intelligence**: Uses browser's `document.baseURI` for accurate relative path resolution
|
|
60
|
+
- ๐ **Local Link Conversion**: Automatically converts HTML links to local Markdown relative paths
|
|
61
|
+
- ๐งน **Clean Output**: Preserves tables, code blocks, images, links, and heading hierarchies
|
|
62
|
+
- ๐ **SSL Flexibility**: Handles sites with certificate issues gracefully
|
|
63
|
+
- ๐ **Cross-Platform**: Works on Windows, macOS, and Linux (Python 3.8+)
|
|
64
|
+
- ๐ **Universal Compatibility**: Generated Markdown works with Typora, Obsidian, VS Code, and more
|
|
65
|
+
|
|
66
|
+
## ๐ฆ Installation
|
|
67
|
+
|
|
68
|
+
### Option 1: Install from PyPI (Recommended)
|
|
69
|
+
```bash
|
|
70
|
+
pip3 install web2md
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Option 2: Install from Source (For Development)
|
|
74
|
+
```bash
|
|
75
|
+
git clone https://github.com/floatinghotpot/web2md.git
|
|
76
|
+
cd web2md
|
|
77
|
+
python3 -m pip install -e .
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### Required: Install Playwright Browser
|
|
81
|
+
```bash
|
|
82
|
+
# Install Chromium driver (required for JavaScript rendering)
|
|
83
|
+
python3 -m playwright install chromium
|
|
84
|
+
|
|
85
|
+
# Linux only: Install system dependencies
|
|
86
|
+
python3 -m playwright install-deps chromium
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## ๐ Quick Start
|
|
90
|
+
|
|
91
|
+
### Basic Usage
|
|
92
|
+
```bash
|
|
93
|
+
# Crawl a single page (auto-generated save directory)
|
|
94
|
+
web2md https://docs.python.org/3/tutorial/
|
|
95
|
+
|
|
96
|
+
# Specify custom save directory
|
|
97
|
+
web2md https://docs.python.org/3/tutorial/ ./python-docs
|
|
98
|
+
|
|
99
|
+
# Crawl with images
|
|
100
|
+
web2md https://example.com/docs --picture
|
|
101
|
+
|
|
102
|
+
# Limit crawl depth and count
|
|
103
|
+
web2md https://example.com/docs --depth 2 --count 10
|
|
104
|
+
|
|
105
|
+
# Crawl with images and videos
|
|
106
|
+
web2md https://example.com/docs --picture --video --depth 3
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Show Help
|
|
110
|
+
```bash
|
|
111
|
+
web2md -h
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## ๐ Usage
|
|
115
|
+
|
|
116
|
+
### Command Syntax
|
|
117
|
+
```
|
|
118
|
+
web2md [URL] [SAVE_DIR] [OPTIONS]
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Arguments
|
|
122
|
+
|
|
123
|
+
| Argument | Required | Description |
|
|
124
|
+
|----------|----------|-------------|
|
|
125
|
+
| `web_url` | โ
Yes | Target webpage URL (must start with http/https) |
|
|
126
|
+
| `save_folder` | โ No | Local save directory (auto-generated from URL if omitted) |
|
|
127
|
+
|
|
128
|
+
### Options
|
|
129
|
+
|
|
130
|
+
| Option | Default | Description |
|
|
131
|
+
|--------|---------|-------------|
|
|
132
|
+
| `--depth N` | `5` | Maximum relative crawl depth from base URL |
|
|
133
|
+
| `--count N` | `999` | Maximum number of pages to crawl (0 = unlimited) |
|
|
134
|
+
| `--picture` | `False` | Download and save images to local `images/` directory |
|
|
135
|
+
| `--video` | `False` | Download and save videos to local `videos/` directory |
|
|
136
|
+
| `-h, --help` | - | Show help message and exit |
|
|
137
|
+
|
|
138
|
+
### Examples
|
|
139
|
+
|
|
140
|
+
#### 1. Unlimited Crawl with Depth Limit
|
|
141
|
+
```bash
|
|
142
|
+
web2md https://company.com/docs/home company-docs --depth 2
|
|
143
|
+
```
|
|
144
|
+
- Crawls all pages within 2 levels of `/docs/`
|
|
145
|
+
- Saves to `./company-docs/`
|
|
146
|
+
|
|
147
|
+
#### 2. Limited Page Count
|
|
148
|
+
```bash
|
|
149
|
+
web2md https://company.com/docs/home company-docs --depth 2 --count 5
|
|
150
|
+
```
|
|
151
|
+
- Stops after crawling 5 pages
|
|
152
|
+
- Useful for testing or sampling large sites
|
|
153
|
+
|
|
154
|
+
#### 3. Crawl with Images
|
|
155
|
+
```bash
|
|
156
|
+
web2md https://company.com/docs/home --picture --count 3
|
|
157
|
+
```
|
|
158
|
+
- Downloads images to `images/` subdirectory
|
|
159
|
+
- Converts image URLs to local relative paths in Markdown
|
|
160
|
+
|
|
161
|
+
#### 4. Auto-Generated Save Directory
|
|
162
|
+
```bash
|
|
163
|
+
web2md https://company.com/docs/home --depth 1 --count 10
|
|
164
|
+
```
|
|
165
|
+
- Auto-creates directory: `company_com_docs/`
|
|
166
|
+
|
|
167
|
+
## ๐ฏ How It Works
|
|
168
|
+
|
|
169
|
+
### 1. Base URL Calculation
|
|
170
|
+
The tool automatically determines a **base URL** from your target URL:
|
|
171
|
+
- Target: `https://company.com/docs/home` โ Base: `https://company.com/docs/`
|
|
172
|
+
- All crawling is scoped to pages under this base URL
|
|
173
|
+
|
|
174
|
+
### 2. Intelligent Path Resolution
|
|
175
|
+
Uses the browser's `document.baseURI` to correctly resolve relative URLs:
|
|
176
|
+
- Handles `<base>` tags in HTML
|
|
177
|
+
- Respects redirects and trailing slashes
|
|
178
|
+
- Resolves lazy-loaded images with `data-src`, `srcset`, etc.
|
|
179
|
+
|
|
180
|
+
### 3. Smart Content Extraction
|
|
181
|
+
Automatically identifies core content using priority selectors:
|
|
182
|
+
1. `<main>` tag
|
|
183
|
+
2. `.article-content` or `.article_content`
|
|
184
|
+
3. `#main-content`
|
|
185
|
+
4. `.content`
|
|
186
|
+
5. `<article>` tag
|
|
187
|
+
6. Fallback to `<body>` (with cleanup)
|
|
188
|
+
|
|
189
|
+
### 4. Media Handling
|
|
190
|
+
When `--picture` or `--video` is enabled:
|
|
191
|
+
- Downloads media files to `images/` or `videos/` subdirectories
|
|
192
|
+
- Generates unique filenames with MD5 hash to prevent duplicates
|
|
193
|
+
- Converts URLs to local relative paths in Markdown
|
|
194
|
+
- Supports lazy-loading attributes: `data-src`, `data-original`, `srcset`
|
|
195
|
+
|
|
196
|
+
### 5. Filename Generation
|
|
197
|
+
MD filenames are generated from URLs:
|
|
198
|
+
- Remove base URL prefix
|
|
199
|
+
- Replace `/` with `_`
|
|
200
|
+
- Filter illegal characters
|
|
201
|
+
- Example: `https://company.com/docs/api/auth` โ `api_auth.md`
|
|
202
|
+
|
|
203
|
+
## โ๏ธ Configuration
|
|
204
|
+
|
|
205
|
+
### Built-in Settings (in `web2md/cli.py`)
|
|
206
|
+
|
|
207
|
+
#### Playwright Configuration
|
|
208
|
+
```python
|
|
209
|
+
PLAYWRIGHT_CONFIG = {
|
|
210
|
+
"headless": False, # Set to True for background crawling
|
|
211
|
+
"timeout": 60000, # Page load timeout (ms)
|
|
212
|
+
"wait_for_load": "networkidle", # Wait strategy
|
|
213
|
+
"sleep_after_load": 2, # Additional wait time (seconds)
|
|
214
|
+
"user_agent": "Mozilla/5.0..." # Custom user agent
|
|
215
|
+
}
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
#### Media Configuration
|
|
219
|
+
```python
|
|
220
|
+
MEDIA_CONFIG = {
|
|
221
|
+
"timeout": 30000, # Media download timeout (ms)
|
|
222
|
+
"image_dir": "images", # Image save subdirectory
|
|
223
|
+
"video_dir": "videos", # Video save subdirectory
|
|
224
|
+
"allowed_img_ext": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp"],
|
|
225
|
+
"allowed_vid_ext": [".mp4", ".avi", ".mov", ".webm", ".flv", ".mkv"]
|
|
226
|
+
}
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
#### Content Filtering
|
|
230
|
+
```python
|
|
231
|
+
REMOVE_TAGS = ["nav", "header", "footer", "aside", "script", "style", "iframe", "sidebar"]
|
|
232
|
+
|
|
233
|
+
CORE_CONTENT_SELECTORS = [
|
|
234
|
+
("main", {}),
|
|
235
|
+
("div", {"class_": "article-content"}),
|
|
236
|
+
("article", {})
|
|
237
|
+
]
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
#### Crawl Defaults
|
|
241
|
+
```python
|
|
242
|
+
DEFAULT_CRAWL_CONFIG = {
|
|
243
|
+
"max_depth": 5, # Default max depth
|
|
244
|
+
"max_count": 999, # Default max pages
|
|
245
|
+
"allowed_schemes": ["http", "https"],
|
|
246
|
+
"exclude_patterns": [r"\.pdf$", r"\.zip$", r"\.exe$"]
|
|
247
|
+
}
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
## ๐ง Advanced Usage
|
|
251
|
+
|
|
252
|
+
### Debug Mode (Show Browser)
|
|
253
|
+
Edit `web2md/cli.py` and set:
|
|
254
|
+
```python
|
|
255
|
+
PLAYWRIGHT_CONFIG = {
|
|
256
|
+
"headless": False, # Shows browser window
|
|
257
|
+
...
|
|
258
|
+
}
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### Custom Content Selectors
|
|
262
|
+
Add site-specific selectors to `CORE_CONTENT_SELECTORS`:
|
|
263
|
+
```python
|
|
264
|
+
CORE_CONTENT_SELECTORS = [
|
|
265
|
+
("main", {}),
|
|
266
|
+
("div", {"class_": "documentation-content"}), # Custom selector
|
|
267
|
+
("article", {})
|
|
268
|
+
]
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
### Anti-Bot Detection
|
|
272
|
+
Install and use `playwright-stealth`:
|
|
273
|
+
```bash
|
|
274
|
+
pip3 install playwright-stealth
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
Add to `get_dynamic_html()` in `web2md/cli.py`:
|
|
278
|
+
```python
|
|
279
|
+
from playwright_stealth import stealth_sync
|
|
280
|
+
|
|
281
|
+
page = context.new_page()
|
|
282
|
+
stealth_sync(page) # Add this line
|
|
283
|
+
page.goto(url, ...)
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
### Authentication
|
|
287
|
+
Add login logic in `get_dynamic_html()` before `page.goto()`:
|
|
288
|
+
```python
|
|
289
|
+
page.goto("https://example.com/login")
|
|
290
|
+
page.fill("#username", "your-username")
|
|
291
|
+
page.fill("#password", "your-password")
|
|
292
|
+
page.click("#login-button")
|
|
293
|
+
time.sleep(2)
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
## ๐ Troubleshooting
|
|
297
|
+
|
|
298
|
+
### SSL Certificate Errors
|
|
299
|
+
The tool automatically disables SSL verification for downloads. If you encounter issues, check your network/firewall settings.
|
|
300
|
+
|
|
301
|
+
### Timeout Errors
|
|
302
|
+
Increase timeout in `PLAYWRIGHT_CONFIG`:
|
|
303
|
+
```python
|
|
304
|
+
"timeout": 120000, # 2 minutes
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
### Missing Content
|
|
308
|
+
1. Check if content is in `<main>` or common content tags
|
|
309
|
+
2. Add custom selectors to `CORE_CONTENT_SELECTORS`
|
|
310
|
+
3. Run with `headless: False` to debug visually
|
|
311
|
+
|
|
312
|
+
### Image Download Failures
|
|
313
|
+
- Verify image URLs are accessible
|
|
314
|
+
- Check if images require authentication
|
|
315
|
+
- Some CDNs may block automated downloads
|
|
316
|
+
|
|
317
|
+
## ๐ Dependencies
|
|
318
|
+
|
|
319
|
+
Automatically installed via `pip`:
|
|
320
|
+
- **playwright** - Browser automation and JS rendering
|
|
321
|
+
- **beautifulsoup4** - HTML parsing and manipulation
|
|
322
|
+
- **lxml** - Fast XML/HTML parser
|
|
323
|
+
- **markdownify** - HTML to Markdown conversion
|
|
324
|
+
- **urllib3** - HTTP client utilities
|
|
325
|
+
|
|
326
|
+
## ๐ค Contributing
|
|
327
|
+
|
|
328
|
+
Contributions are welcome! Please follow these steps:
|
|
329
|
+
|
|
330
|
+
1. Fork the repository
|
|
331
|
+
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
|
|
332
|
+
3. Make your changes
|
|
333
|
+
4. Run tests (if available)
|
|
334
|
+
5. Commit your changes (`git commit -m 'Add amazing feature'`)
|
|
335
|
+
6. Push to the branch (`git push origin feature/amazing-feature`)
|
|
336
|
+
7. Open a Pull Request
|
|
337
|
+
|
|
338
|
+
### Development Setup
|
|
339
|
+
```bash
|
|
340
|
+
git clone https://github.com/floatinghotpot/web2md.git
|
|
341
|
+
cd web2md
|
|
342
|
+
python3 -m pip install -e .
|
|
343
|
+
python3 -m playwright install chromium
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
## ๐ License
|
|
347
|
+
|
|
348
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
349
|
+
|
|
350
|
+
## ๐ Acknowledgments
|
|
351
|
+
|
|
352
|
+
- [Playwright](https://playwright.dev/) for powerful browser automation
|
|
353
|
+
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
|
|
354
|
+
- [markdownify](https://github.com/matthewwithanm/markdownify) for clean Markdown conversion
|
|
355
|
+
|
|
356
|
+
---
|
|
357
|
+
|
|
358
|
+
**Made with โค๏ธ for developers, researchers, and documentation enthusiasts.**
|
|
359
|
+
|
|
360
|
+
If you find this tool useful, please consider giving it a โญ on GitHub!
|