web2md 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,12 @@
1
+ ## ๐Ÿ“Š Changelog
2
+
3
+ ### v0.1.0 (2026-01-28)
4
+ - โœจ Initial release
5
+ - ๐Ÿš€ Dynamic site crawling with Playwright
6
+ - ๐Ÿ”— Recursive subpage crawling with depth/count controls
7
+ - ๐Ÿ–ผ๏ธ Optional image and video downloads
8
+ - ๐ŸŽฏ Smart base URL resolution using `document.baseURI`
9
+ - ๐Ÿงน Clean Markdown output with content extraction
10
+ - ๐Ÿ”’ SSL certificate error handling
11
+ - ๐Ÿ“ Local link conversion for offline browsing
12
+
web2md-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) [2026] [Liming Xie]
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,5 @@
1
+ # MANIFEST.in
2
+ include README.md
3
+ include CHANGELOG.md
4
+ include LICENSE
5
+ recursive-include web2md *
web2md-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,360 @@
1
+ Metadata-Version: 2.4
2
+ Name: web2md
3
+ Version: 0.1.0
4
+ Summary: A CLI tool to crawl dynamic/static websites and convert content to clean Markdown
5
+ Home-page: https://github.com/floatinghotpot/web2md
6
+ Author: Liming Xie
7
+ Author-email: liming.xie@gmail.com
8
+ License: MIT
9
+ Keywords: crawler,markdown,web2md,scraper,dynamic website,html2md
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Intended Audience :: End Users/Desktop
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Internet :: WWW/HTTP
22
+ Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
23
+ Classifier: Topic :: Text Processing :: Markup :: HTML
24
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
25
+ Requires-Python: >=3.8
26
+ Description-Content-Type: text/markdown
27
+ License-File: LICENSE
28
+ Requires-Dist: playwright>=1.40.0
29
+ Requires-Dist: beautifulsoup4>=4.12.0
30
+ Requires-Dist: markdownify>=0.11.6
31
+ Requires-Dist: lxml>=4.9.0
32
+ Requires-Dist: requests>=2.31.0
33
+ Dynamic: author
34
+ Dynamic: author-email
35
+ Dynamic: classifier
36
+ Dynamic: description
37
+ Dynamic: description-content-type
38
+ Dynamic: home-page
39
+ Dynamic: keywords
40
+ Dynamic: license
41
+ Dynamic: license-file
42
+ Dynamic: requires-dist
43
+ Dynamic: requires-python
44
+ Dynamic: summary
45
+
46
+ # web2md
47
+
48
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
49
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
50
+
51
+ A powerful, intelligent CLI tool to crawl **dynamic and static websites** with full JavaScript rendering support and convert them to clean, well-formatted Markdown files. Perfect for archiving documentation, creating offline knowledge bases, and preserving web content.
52
+
53
+ ## โœจ Key Features
54
+
55
+ - ๐Ÿš€ **Dynamic Site Support**: Full JavaScript rendering via Playwright (Vue/React/Angular/Next.js)
56
+ - ๐ŸŽฏ **Smart Content Extraction**: Automatically identifies and extracts core content, removing navigation, ads, and sidebars
57
+ - ๐Ÿ”— **Recursive Crawling**: Intelligently crawls subpages with configurable depth and count limits
58
+ - ๏ฟฝ๏ธ **Media Downloads**: Optional image and video downloading with lazy-loading support
59
+ - ๐Ÿ“ **Base URL Intelligence**: Uses browser's `document.baseURI` for accurate relative path resolution
60
+ - ๐Ÿ”„ **Local Link Conversion**: Automatically converts HTML links to local Markdown relative paths
61
+ - ๐Ÿงน **Clean Output**: Preserves tables, code blocks, images, links, and heading hierarchies
62
+ - ๐Ÿ”’ **SSL Flexibility**: Handles sites with certificate issues gracefully
63
+ - ๐ŸŒ **Cross-Platform**: Works on Windows, macOS, and Linux (Python 3.8+)
64
+ - ๐Ÿ“‹ **Universal Compatibility**: Generated Markdown works with Typora, Obsidian, VS Code, and more
65
+
66
+ ## ๐Ÿ“ฆ Installation
67
+
68
+ ### Option 1: Install from PyPI (Recommended)
69
+ ```bash
70
+ pip3 install web2md
71
+ ```
72
+
73
+ ### Option 2: Install from Source (For Development)
74
+ ```bash
75
+ git clone https://github.com/floatinghotpot/web2md.git
76
+ cd web2md
77
+ python3 -m pip install -e .
78
+ ```
79
+
80
+ ### Required: Install Playwright Browser
81
+ ```bash
82
+ # Install Chromium driver (required for JavaScript rendering)
83
+ python3 -m playwright install chromium
84
+
85
+ # Linux only: Install system dependencies
86
+ python3 -m playwright install-deps chromium
87
+ ```
88
+
89
+ ## ๐Ÿš€ Quick Start
90
+
91
+ ### Basic Usage
92
+ ```bash
93
+ # Crawl a single page (auto-generated save directory)
94
+ web2md https://docs.python.org/3/tutorial/
95
+
96
+ # Specify custom save directory
97
+ web2md https://docs.python.org/3/tutorial/ ./python-docs
98
+
99
+ # Crawl with images
100
+ web2md https://example.com/docs --picture
101
+
102
+ # Limit crawl depth and count
103
+ web2md https://example.com/docs --depth 2 --count 10
104
+
105
+ # Crawl with images and videos
106
+ web2md https://example.com/docs --picture --video --depth 3
107
+ ```
108
+
109
+ ### Show Help
110
+ ```bash
111
+ web2md -h
112
+ ```
113
+
114
+ ## ๐Ÿ“– Usage
115
+
116
+ ### Command Syntax
117
+ ```
118
+ web2md [URL] [SAVE_DIR] [OPTIONS]
119
+ ```
120
+
121
+ ### Arguments
122
+
123
+ | Argument | Required | Description |
124
+ |----------|----------|-------------|
125
+ | `web_url` | โœ… Yes | Target webpage URL (must start with http/https) |
126
+ | `save_folder` | โŒ No | Local save directory (auto-generated from URL if omitted) |
127
+
128
+ ### Options
129
+
130
+ | Option | Default | Description |
131
+ |--------|---------|-------------|
132
+ | `--depth N` | `5` | Maximum relative crawl depth from base URL |
133
+ | `--count N` | `999` | Maximum number of pages to crawl (0 = unlimited) |
134
+ | `--picture` | `False` | Download and save images to local `images/` directory |
135
+ | `--video` | `False` | Download and save videos to local `videos/` directory |
136
+ | `-h, --help` | - | Show help message and exit |
137
+
138
+ ### Examples
139
+
140
+ #### 1. Unlimited Crawl with Depth Limit
141
+ ```bash
142
+ web2md https://company.com/docs/home company-docs --depth 2
143
+ ```
144
+ - Crawls all pages within 2 levels of `/docs/`
145
+ - Saves to `./company-docs/`
146
+
147
+ #### 2. Limited Page Count
148
+ ```bash
149
+ web2md https://company.com/docs/home company-docs --depth 2 --count 5
150
+ ```
151
+ - Stops after crawling 5 pages
152
+ - Useful for testing or sampling large sites
153
+
154
+ #### 3. Crawl with Images
155
+ ```bash
156
+ web2md https://company.com/docs/home --picture --count 3
157
+ ```
158
+ - Downloads images to `images/` subdirectory
159
+ - Converts image URLs to local relative paths in Markdown
160
+
161
+ #### 4. Auto-Generated Save Directory
162
+ ```bash
163
+ web2md https://company.com/docs/home --depth 1 --count 10
164
+ ```
165
+ - Auto-creates directory: `company_com_docs/`
166
+
167
+ ## ๐ŸŽฏ How It Works
168
+
169
+ ### 1. Base URL Calculation
170
+ The tool automatically determines a **base URL** from your target URL:
171
+ - Target: `https://company.com/docs/home` โ†’ Base: `https://company.com/docs/`
172
+ - All crawling is scoped to pages under this base URL
173
+
174
+ ### 2. Intelligent Path Resolution
175
+ Uses the browser's `document.baseURI` to correctly resolve relative URLs:
176
+ - Handles `<base>` tags in HTML
177
+ - Respects redirects and trailing slashes
178
+ - Resolves lazy-loaded images with `data-src`, `srcset`, etc.
179
+
180
+ ### 3. Smart Content Extraction
181
+ Automatically identifies core content using priority selectors:
182
+ 1. `<main>` tag
183
+ 2. `.article-content` or `.article_content`
184
+ 3. `#main-content`
185
+ 4. `.content`
186
+ 5. `<article>` tag
187
+ 6. Fallback to `<body>` (with cleanup)
188
+
189
+ ### 4. Media Handling
190
+ When `--picture` or `--video` is enabled:
191
+ - Downloads media files to `images/` or `videos/` subdirectories
192
+ - Generates unique filenames with MD5 hash to prevent duplicates
193
+ - Converts URLs to local relative paths in Markdown
194
+ - Supports lazy-loading attributes: `data-src`, `data-original`, `srcset`
195
+
196
+ ### 5. Filename Generation
197
+ MD filenames are generated from URLs:
198
+ - Remove base URL prefix
199
+ - Replace `/` with `_`
200
+ - Filter illegal characters
201
+ - Example: `https://company.com/docs/api/auth` โ†’ `api_auth.md`
202
+
203
+ ## โš™๏ธ Configuration
204
+
205
+ ### Built-in Settings (in `web2md/cli.py`)
206
+
207
+ #### Playwright Configuration
208
+ ```python
209
+ PLAYWRIGHT_CONFIG = {
210
+ "headless": False, # Set to True for background crawling
211
+ "timeout": 60000, # Page load timeout (ms)
212
+ "wait_for_load": "networkidle", # Wait strategy
213
+ "sleep_after_load": 2, # Additional wait time (seconds)
214
+ "user_agent": "Mozilla/5.0..." # Custom user agent
215
+ }
216
+ ```
217
+
218
+ #### Media Configuration
219
+ ```python
220
+ MEDIA_CONFIG = {
221
+ "timeout": 30000, # Media download timeout (ms)
222
+ "image_dir": "images", # Image save subdirectory
223
+ "video_dir": "videos", # Video save subdirectory
224
+ "allowed_img_ext": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp"],
225
+ "allowed_vid_ext": [".mp4", ".avi", ".mov", ".webm", ".flv", ".mkv"]
226
+ }
227
+ ```
228
+
229
+ #### Content Filtering
230
+ ```python
231
+ REMOVE_TAGS = ["nav", "header", "footer", "aside", "script", "style", "iframe", "sidebar"]
232
+
233
+ CORE_CONTENT_SELECTORS = [
234
+ ("main", {}),
235
+ ("div", {"class_": "article-content"}),
236
+ ("article", {})
237
+ ]
238
+ ```
239
+
240
+ #### Crawl Defaults
241
+ ```python
242
+ DEFAULT_CRAWL_CONFIG = {
243
+ "max_depth": 5, # Default max depth
244
+ "max_count": 999, # Default max pages
245
+ "allowed_schemes": ["http", "https"],
246
+ "exclude_patterns": [r"\.pdf$", r"\.zip$", r"\.exe$"]
247
+ }
248
+ ```
249
+
250
+ ## ๐Ÿ”ง Advanced Usage
251
+
252
+ ### Debug Mode (Show Browser)
253
+ Edit `web2md/cli.py` and set:
254
+ ```python
255
+ PLAYWRIGHT_CONFIG = {
256
+ "headless": False, # Shows browser window
257
+ ...
258
+ }
259
+ ```
260
+
261
+ ### Custom Content Selectors
262
+ Add site-specific selectors to `CORE_CONTENT_SELECTORS`:
263
+ ```python
264
+ CORE_CONTENT_SELECTORS = [
265
+ ("main", {}),
266
+ ("div", {"class_": "documentation-content"}), # Custom selector
267
+ ("article", {})
268
+ ]
269
+ ```
270
+
271
+ ### Anti-Bot Detection
272
+ Install and use `playwright-stealth`:
273
+ ```bash
274
+ pip3 install playwright-stealth
275
+ ```
276
+
277
+ Add to `get_dynamic_html()` in `web2md/cli.py`:
278
+ ```python
279
+ from playwright_stealth import stealth_sync
280
+
281
+ page = context.new_page()
282
+ stealth_sync(page) # Add this line
283
+ page.goto(url, ...)
284
+ ```
285
+
286
+ ### Authentication
287
+ Add login logic in `get_dynamic_html()` before `page.goto()`:
288
+ ```python
289
+ page.goto("https://example.com/login")
290
+ page.fill("#username", "your-username")
291
+ page.fill("#password", "your-password")
292
+ page.click("#login-button")
293
+ time.sleep(2)
294
+ ```
295
+
296
+ ## ๐Ÿ› Troubleshooting
297
+
298
+ ### SSL Certificate Errors
299
+ The tool automatically disables SSL verification for downloads. If you encounter issues, check your network/firewall settings.
300
+
301
+ ### Timeout Errors
302
+ Increase timeout in `PLAYWRIGHT_CONFIG`:
303
+ ```python
304
+ "timeout": 120000, # 2 minutes
305
+ ```
306
+
307
+ ### Missing Content
308
+ 1. Check if content is in `<main>` or common content tags
309
+ 2. Add custom selectors to `CORE_CONTENT_SELECTORS`
310
+ 3. Run with `headless: False` to debug visually
311
+
312
+ ### Image Download Failures
313
+ - Verify image URLs are accessible
314
+ - Check if images require authentication
315
+ - Some CDNs may block automated downloads
316
+
317
+ ## ๐Ÿ“‹ Dependencies
318
+
319
+ Automatically installed via `pip`:
320
+ - **playwright** - Browser automation and JS rendering
321
+ - **beautifulsoup4** - HTML parsing and manipulation
322
+ - **lxml** - Fast XML/HTML parser
323
+ - **markdownify** - HTML to Markdown conversion
324
+ - **urllib3** - HTTP client utilities
325
+
326
+ ## ๐Ÿค Contributing
327
+
328
+ Contributions are welcome! Please follow these steps:
329
+
330
+ 1. Fork the repository
331
+ 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
332
+ 3. Make your changes
333
+ 4. Run tests (if available)
334
+ 5. Commit your changes (`git commit -m 'Add amazing feature'`)
335
+ 6. Push to the branch (`git push origin feature/amazing-feature`)
336
+ 7. Open a Pull Request
337
+
338
+ ### Development Setup
339
+ ```bash
340
+ git clone https://github.com/floatinghotpot/web2md.git
341
+ cd web2md
342
+ python3 -m pip install -e .
343
+ python3 -m playwright install chromium
344
+ ```
345
+
346
+ ## ๐Ÿ“ License
347
+
348
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
349
+
350
+ ## ๐Ÿ™ Acknowledgments
351
+
352
+ - [Playwright](https://playwright.dev/) for powerful browser automation
353
+ - [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
354
+ - [markdownify](https://github.com/matthewwithanm/markdownify) for clean Markdown conversion
355
+
356
+ ---
357
+
358
+ **Made with โค๏ธ for developers, researchers, and documentation enthusiasts.**
359
+
360
+ If you find this tool useful, please consider giving it a โญ on GitHub!