crawl4md 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- crawl4md-0.1.2/.gitignore +57 -0
- crawl4md-0.1.2/AGENTS.md +171 -0
- crawl4md-0.1.2/CHANGELOG.md +43 -0
- crawl4md-0.1.2/LICENSE.md +21 -0
- crawl4md-0.1.2/PKG-INFO +336 -0
- crawl4md-0.1.2/README.md +320 -0
- crawl4md-0.1.2/VERSION +1 -0
- crawl4md-0.1.2/bin/check-version +19 -0
- crawl4md-0.1.2/crawl.yml.example +49 -0
- crawl4md-0.1.2/docs/.gitkeep +0 -0
- crawl4md-0.1.2/pyproject.toml +36 -0
- crawl4md-0.1.2/src/crawl4md/__init__.py +11 -0
- crawl4md-0.1.2/src/crawl4md/check.py +20 -0
- crawl4md-0.1.2/src/crawl4md/cli.py +93 -0
- crawl4md-0.1.2/src/crawl4md/config.py +54 -0
- crawl4md-0.1.2/src/crawl4md/convert/__init__.py +1 -0
- crawl4md-0.1.2/src/crawl4md/convert/markdown.py +63 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/__init__.py +1 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/helpers/__init__.py +1 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/helpers/title_html_parser.py +40 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/markdown.py +62 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/__init__.py +1 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/base/__init__.py +0 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/base/rule_base.py +83 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/ensure_h1.py +45 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/normalize_whitespace.py +140 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_html_comments.py +28 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_jump_to_content.py +68 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_reference_sections.py +47 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_wiki_loves_earth_banner.py +49 -0
- crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_wikipedia_subtitle.py +40 -0
- crawl4md-0.1.2/src/crawl4md/fetch/__init__.py +1 -0
- crawl4md-0.1.2/src/crawl4md/fetch/html.py +57 -0
- crawl4md-0.1.2/src/crawl4md/fetch/markdown.py +59 -0
- crawl4md-0.1.2/src/crawl4md/fetch/normalize/__init__.py +0 -0
- crawl4md-0.1.2/src/crawl4md/fetch/normalize/base/__init__.py +2 -0
- crawl4md-0.1.2/src/crawl4md/fetch/normalize/base/normalizer_base.py +16 -0
- crawl4md-0.1.2/src/crawl4md/fetch/normalize/mediawiki_entity.py +31 -0
- crawl4md-0.1.2/src/crawl4md/fetch/normalize/mediawiki_hidden_span.py +31 -0
- crawl4md-0.1.2/src/crawl4md/fetch/normalize/url.py +42 -0
- crawl4md-0.1.2/src/crawl4md/paths.py +24 -0
- crawl4md-0.1.2/src/crawl4md/sitemap.py +34 -0
- crawl4md-0.1.2/src/crawl4md/writer.py +17 -0
- crawl4md-0.1.2/tests/test_preprocessing.py +305 -0
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
# -------------------------
|
|
2
|
+
# Python
|
|
3
|
+
# -------------------------
|
|
4
|
+
__pycache__/
|
|
5
|
+
*.py[cod]
|
|
6
|
+
*.pyo
|
|
7
|
+
*.pyd
|
|
8
|
+
*.so
|
|
9
|
+
|
|
10
|
+
# Virtual environments
|
|
11
|
+
.venv/
|
|
12
|
+
venv/
|
|
13
|
+
env/
|
|
14
|
+
|
|
15
|
+
# -------------------------
|
|
16
|
+
# Build / Packaging
|
|
17
|
+
# -------------------------
|
|
18
|
+
build/
|
|
19
|
+
dist/
|
|
20
|
+
*.egg-info/
|
|
21
|
+
.eggs/
|
|
22
|
+
|
|
23
|
+
# -------------------------
|
|
24
|
+
# uv
|
|
25
|
+
# -------------------------
|
|
26
|
+
.uv/
|
|
27
|
+
uv.lock
|
|
28
|
+
|
|
29
|
+
# -------------------------
|
|
30
|
+
# IDE / Editor
|
|
31
|
+
# -------------------------
|
|
32
|
+
.vscode/
|
|
33
|
+
.idea/
|
|
34
|
+
*.swp
|
|
35
|
+
*.swo
|
|
36
|
+
|
|
37
|
+
# -------------------------
|
|
38
|
+
# OS
|
|
39
|
+
# -------------------------
|
|
40
|
+
.DS_Store
|
|
41
|
+
Thumbs.db
|
|
42
|
+
|
|
43
|
+
# -------------------------
|
|
44
|
+
# Project specific
|
|
45
|
+
# -------------------------
|
|
46
|
+
|
|
47
|
+
# User-specific config (must NOT be committed)
|
|
48
|
+
crawl.yml
|
|
49
|
+
|
|
50
|
+
# Generated Markdown output
|
|
51
|
+
docs/*
|
|
52
|
+
!docs/.gitkeep
|
|
53
|
+
|
|
54
|
+
# Optional: logs or temp files (future-proof)
|
|
55
|
+
*.log
|
|
56
|
+
tmp/
|
|
57
|
+
|
crawl4md-0.1.2/AGENTS.md
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
# AGENTS.md
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
|
|
5
|
+
crawl4md is a minimal CLI tool to crawl web pages or sitemaps and convert them into clean, deterministic Markdown files.
|
|
6
|
+
|
|
7
|
+
This document defines how contributors and automated agents (LLMs, scripts, CI tools) should interact with and extend the project.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Core Principles
|
|
12
|
+
|
|
13
|
+
- Keep it simple (no overengineering)
|
|
14
|
+
- Deterministic output (same input → same output)
|
|
15
|
+
- No hidden behavior or side effects
|
|
16
|
+
- Clear separation of concerns:
|
|
17
|
+
- config
|
|
18
|
+
- crawling
|
|
19
|
+
- preprocessing
|
|
20
|
+
- writing
|
|
21
|
+
- CLI-first design
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Project Structure
|
|
26
|
+
|
|
27
|
+
src/crawl4md/
|
|
28
|
+
- cli.py → entrypoint (orchestration only)
|
|
29
|
+
- config.py → config models (Pydantic)
|
|
30
|
+
- sitemap.py → sitemap parsing
|
|
31
|
+
- crawler.py → crawl4ai integration
|
|
32
|
+
- paths.py → URL → file path mapping
|
|
33
|
+
- writer.py → file output
|
|
34
|
+
- (future) preprocessing.py → markdown cleanup
|
|
35
|
+
|
|
36
|
+
docs/
|
|
37
|
+
- output directory (generated, not versioned except .gitkeep)
|
|
38
|
+
|
|
39
|
+
crawl.yml.example
|
|
40
|
+
- example configuration (must stay in sync with config models)
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## Responsibilities
|
|
45
|
+
|
|
46
|
+
### cli.py
|
|
47
|
+
- Orchestrates flow
|
|
48
|
+
- No business logic
|
|
49
|
+
- Reads config, loops URLs, prints output
|
|
50
|
+
|
|
51
|
+
### crawler.py
|
|
52
|
+
- Only responsible for fetching + converting to markdown
|
|
53
|
+
- Must not handle filesystem or preprocessing
|
|
54
|
+
|
|
55
|
+
### preprocessing (future)
|
|
56
|
+
- Pure functions: markdown in → markdown out
|
|
57
|
+
- No IO, no side effects
|
|
58
|
+
|
|
59
|
+
### writer.py
|
|
60
|
+
- Only writes files
|
|
61
|
+
- Must not modify content
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Configuration Rules
|
|
66
|
+
|
|
67
|
+
- All behavior must be configurable via crawl.yml
|
|
68
|
+
- crawl.yml is user-specific → never commit
|
|
69
|
+
- crawl.yml.example is canonical → always update when config changes
|
|
70
|
+
|
|
71
|
+
### Config Sections
|
|
72
|
+
|
|
73
|
+
- type: pages | sitemap
|
|
74
|
+
- sources: list[str]
|
|
75
|
+
|
|
76
|
+
- crawl:
|
|
77
|
+
parse_type: markdown | markdown-fit
|
|
78
|
+
|
|
79
|
+
- preprocessing:
|
|
80
|
+
markdown:
|
|
81
|
+
enabled: bool
|
|
82
|
+
remove_reference_sections: bool
|
|
83
|
+
remove_html_comments: bool
|
|
84
|
+
normalize_whitespace: bool
|
|
85
|
+
reference_headings: list[str]
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## Coding Guidelines
|
|
90
|
+
|
|
91
|
+
- Python >= 3.11
|
|
92
|
+
- Use type hints everywhere
|
|
93
|
+
- Prefer explicit over implicit
|
|
94
|
+
- No global state (except logging config)
|
|
95
|
+
- Keep functions small and focused
|
|
96
|
+
- Avoid unnecessary dependencies
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Logging & Output
|
|
101
|
+
|
|
102
|
+
- CLI output must stay minimal and readable
|
|
103
|
+
- Avoid noisy logs
|
|
104
|
+
- External library logs (e.g. crawl4ai) should be suppressed or reduced
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Error Handling
|
|
109
|
+
|
|
110
|
+
- Errors per URL must not stop the entire run
|
|
111
|
+
- Always continue with next URL
|
|
112
|
+
- Provide clear summary:
|
|
113
|
+
- success count
|
|
114
|
+
- failure count
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## File Output Rules
|
|
119
|
+
|
|
120
|
+
- Output path must be deterministic:
|
|
121
|
+
docs/<project>/<url-path>.md
|
|
122
|
+
|
|
123
|
+
- URL path rules:
|
|
124
|
+
- "/" → index.md
|
|
125
|
+
- strip leading slash
|
|
126
|
+
- ignore query params
|
|
127
|
+
|
|
128
|
+
- Always create parent directories
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
## Extending the Project
|
|
133
|
+
|
|
134
|
+
When adding features:
|
|
135
|
+
|
|
136
|
+
1. First update config models
|
|
137
|
+
2. Then update crawl.yml.example
|
|
138
|
+
3. Keep backward compatibility (defaults!)
|
|
139
|
+
4. Add logic in the correct layer (do not mix concerns)
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
143
|
+
## Anti-Patterns (Avoid)
|
|
144
|
+
|
|
145
|
+
- Business logic inside cli.py
|
|
146
|
+
- Hidden transformations
|
|
147
|
+
- Implicit defaults not visible in config
|
|
148
|
+
- Mixing crawling and preprocessing
|
|
149
|
+
- Writing files outside writer.py
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Future Extensions (Planned)
|
|
154
|
+
|
|
155
|
+
- Markdown preprocessing pipeline
|
|
156
|
+
- Frontmatter support
|
|
157
|
+
- Parallel crawling
|
|
158
|
+
- Retry & rate limiting
|
|
159
|
+
- Chunking for RAG
|
|
160
|
+
- Database export
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## Summary
|
|
165
|
+
|
|
166
|
+
crawl4md is intentionally simple.
|
|
167
|
+
|
|
168
|
+
Every addition must preserve:
|
|
169
|
+
- clarity
|
|
170
|
+
- determinism
|
|
171
|
+
- separation of concerns
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
## [0.1.2] - 2026-05-02
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
* Update README for PyPI package usage and clarify batch crawler setup
|
|
10
|
+
|
|
11
|
+
## [0.1.1] - 2026-05-02
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
|
|
15
|
+
- Add uv check command for tests and Ruff linting
|
|
16
|
+
- Export public Python API and expand README with usage and crawl4ai context
|
|
17
|
+
|
|
18
|
+
### Refactored
|
|
19
|
+
|
|
20
|
+
- Split `fetch_markdown` into fetch and convert layers
|
|
21
|
+
- Move markdown preprocessing from CLI into convert pipeline
|
|
22
|
+
- Refactor markdown fetch/convert into classes and add sync APIs
|
|
23
|
+
|
|
24
|
+
### Removed
|
|
25
|
+
|
|
26
|
+
- Crawl4AI Logging
|
|
27
|
+
|
|
28
|
+
## [0.1.0] - 2026-05-02
|
|
29
|
+
|
|
30
|
+
### Added
|
|
31
|
+
|
|
32
|
+
- Initial release
|
|
33
|
+
- CLI for crawling single pages and sitemaps
|
|
34
|
+
- YAML-based project configuration
|
|
35
|
+
- Deterministic Markdown file output
|
|
36
|
+
- Support for multiple Markdown extraction modes
|
|
37
|
+
- Configurable Markdown preprocessing pipeline
|
|
38
|
+
- Automatic cleanup of common wiki and web artifacts
|
|
39
|
+
- Automatic removal of reference and appendix sections
|
|
40
|
+
- Whitespace and document structure normalization
|
|
41
|
+
- Automatic insertion of missing top-level headings
|
|
42
|
+
- Clear separation of crawling, preprocessing, and file writing
|
|
43
|
+
- Basic test coverage for core Markdown processing
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Björn Hempel <bjoern@hempel.li>
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
crawl4md-0.1.2/PKG-INFO
ADDED
|
@@ -0,0 +1,336 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: crawl4md
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Convert web pages and HTML to clean Markdown
|
|
5
|
+
Author: Björn Hempel
|
|
6
|
+
License: MIT
|
|
7
|
+
License-File: LICENSE.md
|
|
8
|
+
Requires-Python: >=3.11
|
|
9
|
+
Requires-Dist: crawl4ai
|
|
10
|
+
Requires-Dist: lxml
|
|
11
|
+
Requires-Dist: pydantic
|
|
12
|
+
Requires-Dist: pyyaml
|
|
13
|
+
Requires-Dist: requests
|
|
14
|
+
Requires-Dist: typer
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
|
|
17
|
+
# crawl4md
|
|
18
|
+
|
|
19
|
+
crawl4md is a minimal, clean CLI tool that crawls web pages or sitemaps and converts them into structured Markdown files.
|
|
20
|
+
|
|
21
|
+
The project is intentionally designed to stay simple, deterministic, and easy to extend — without unnecessary complexity or hidden behavior.
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Philosophy
|
|
26
|
+
|
|
27
|
+
- **Minimal**: only what is needed, nothing more
|
|
28
|
+
- **Deterministic**: same input → same output
|
|
29
|
+
- **Transparent**: no magic, clear processing steps
|
|
30
|
+
- **Composable**: ideal as a building block for pipelines (e.g. RAG)
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Features
|
|
35
|
+
|
|
36
|
+
- Crawl from:
|
|
37
|
+
- `sitemap.xml`
|
|
38
|
+
- explicit page lists
|
|
39
|
+
- Clean Markdown output (via `crawl4ai`, markdown-fit mode)
|
|
40
|
+
- Deterministic file structure based on URL paths
|
|
41
|
+
- YAML-based project configuration
|
|
42
|
+
- CLI-first workflow (uv-compatible)
|
|
43
|
+
- Clear, readable progress output
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Installation
|
|
48
|
+
|
|
49
|
+
There are two ways to use `crawl4md`.
|
|
50
|
+
|
|
51
|
+
### Use the Batch Crawler
|
|
52
|
+
|
|
53
|
+
If you want to use the project directly for batch crawling via `crawl.yml`, clone the repository:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
git clone git@github.com:ixnode/crawl4md.git && cd crawl4md
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Then continue with the configuration section below.
|
|
60
|
+
|
|
61
|
+
### Use the Python Package
|
|
62
|
+
|
|
63
|
+
If you want to build your own tooling on top of `crawl4md`, install it as a package:
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
pip install crawl4md
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
Or with `uv`:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
uv add crawl4md
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
For local development inside the repository:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
uv sync
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## Configuration
|
|
84
|
+
|
|
85
|
+
The CLI reads a `crawl.yml` file from the current working directory.
|
|
86
|
+
|
|
87
|
+
Create it from the example:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
cp crawl.yml.example crawl.yml
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Minimal example:
|
|
94
|
+
|
|
95
|
+
```yaml
|
|
96
|
+
projects:
|
|
97
|
+
planes:
|
|
98
|
+
type: pages
|
|
99
|
+
crawl:
|
|
100
|
+
parse_type: markdown-fit
|
|
101
|
+
sources:
|
|
102
|
+
- https://de.wikipedia.org/wiki/Boeing_707
|
|
103
|
+
- https://de.wikipedia.org/wiki/Boeing_717
|
|
104
|
+
preprocessing:
|
|
105
|
+
markdown:
|
|
106
|
+
enabled: true
|
|
107
|
+
remove_html_comments: true
|
|
108
|
+
normalize_whitespace: true
|
|
109
|
+
|
|
110
|
+
pydantic:
|
|
111
|
+
type: sitemap
|
|
112
|
+
crawl:
|
|
113
|
+
parse_type: markdown-fit
|
|
114
|
+
sources:
|
|
115
|
+
- https://pydantic.dev/sitemap.xml
|
|
116
|
+
preprocessing:
|
|
117
|
+
markdown:
|
|
118
|
+
enabled: false
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Available project settings:
|
|
122
|
+
|
|
123
|
+
- `type`: `pages` or `sitemap`
|
|
124
|
+
- `sources`: list of page URLs or sitemap URLs
|
|
125
|
+
- `crawl.parse_type`: `markdown` or `markdown-fit`
|
|
126
|
+
- `preprocessing.markdown.enabled`: enables Markdown cleanup
|
|
127
|
+
- `preprocessing.markdown.*`: optional cleanup rules such as `ensure_h1`, `remove_html_comments`, `remove_reference_sections`, and `normalize_whitespace`
|
|
128
|
+
|
|
129
|
+
For the full configuration, see [`crawl.yml.example`](crawl.yml.example).
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Usage
|
|
134
|
+
|
|
135
|
+
After cloning the repository and creating `crawl.yml`, use:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
crawl planes
|
|
139
|
+
crawl pydantic
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Or with `uv` inside the project:
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
uv run crawl planes
|
|
146
|
+
uv run crawl pydantic
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## Python API
|
|
152
|
+
|
|
153
|
+
`crawl4md` can also be used as a Python package.
|
|
154
|
+
|
|
155
|
+
The public classes are:
|
|
156
|
+
|
|
157
|
+
- `MarkdownFetcher`
|
|
158
|
+
- `MarkdownConverter`
|
|
159
|
+
- `ParseType`
|
|
160
|
+
- `MarkdownPreprocessingConfig`
|
|
161
|
+
|
|
162
|
+
### Configure Parse Type
|
|
163
|
+
|
|
164
|
+
Use `ParseType` to control how Markdown is generated:
|
|
165
|
+
|
|
166
|
+
- `"markdown"`: raw markdown output
|
|
167
|
+
- `"markdown-fit"`: cleaned and reduced markdown output via `crawl4ai`
|
|
168
|
+
|
|
169
|
+
### Configure Preprocessing
|
|
170
|
+
|
|
171
|
+
Use `MarkdownPreprocessingConfig` to enable optional cleanup steps.
|
|
172
|
+
|
|
173
|
+
Simple example:
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
from crawl4md import MarkdownPreprocessingConfig
|
|
177
|
+
|
|
178
|
+
config = MarkdownPreprocessingConfig(
|
|
179
|
+
enabled=True,
|
|
180
|
+
remove_html_comments=True,
|
|
181
|
+
normalize_whitespace=True,
|
|
182
|
+
)
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Fetch Markdown From a URL
|
|
186
|
+
|
|
187
|
+
Use `MarkdownFetcher` if you want to fetch a page and directly receive Markdown.
|
|
188
|
+
|
|
189
|
+
```python
|
|
190
|
+
from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig
|
|
191
|
+
|
|
192
|
+
config = MarkdownPreprocessingConfig(enabled=True)
|
|
193
|
+
fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")
|
|
194
|
+
|
|
195
|
+
markdown = fetcher.fetch_sync("https://example.com")
|
|
196
|
+
print(markdown)
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
Async version:
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
import asyncio
|
|
203
|
+
|
|
204
|
+
from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig
|
|
205
|
+
|
|
206
|
+
config = MarkdownPreprocessingConfig(enabled=True)
|
|
207
|
+
fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")
|
|
208
|
+
|
|
209
|
+
markdown = asyncio.run(fetcher.fetch("https://example.com"))
|
|
210
|
+
print(markdown)
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### Convert HTML to Markdown
|
|
214
|
+
|
|
215
|
+
Use `MarkdownConverter` if you already have HTML and only want the conversion step.
|
|
216
|
+
|
|
217
|
+
```python
|
|
218
|
+
from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig
|
|
219
|
+
|
|
220
|
+
html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
|
|
221
|
+
|
|
222
|
+
config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
|
|
223
|
+
converter = MarkdownConverter(config=config, parse_type="markdown")
|
|
224
|
+
|
|
225
|
+
markdown = converter.convert_sync(html=html, url="https://example.com")
|
|
226
|
+
print(markdown)
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
Async version:
|
|
230
|
+
|
|
231
|
+
```python
|
|
232
|
+
import asyncio
|
|
233
|
+
|
|
234
|
+
from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig
|
|
235
|
+
|
|
236
|
+
html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
|
|
237
|
+
|
|
238
|
+
config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
|
|
239
|
+
converter = MarkdownConverter(config=config, parse_type="markdown")
|
|
240
|
+
|
|
241
|
+
markdown = asyncio.run(
|
|
242
|
+
converter.convert(html=html, url="https://example.com")
|
|
243
|
+
)
|
|
244
|
+
print(markdown)
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
---
|
|
248
|
+
|
|
249
|
+
## Output Structure
|
|
250
|
+
|
|
251
|
+
Markdown files are stored deterministically based on the URL path:
|
|
252
|
+
|
|
253
|
+
```bash
|
|
254
|
+
docs/<project>/<url-path>.md
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
Example:
|
|
258
|
+
|
|
259
|
+
```bash
|
|
260
|
+
docs/planes/wiki/Boeing_707.md
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
Rules:
|
|
264
|
+
|
|
265
|
+
* Domain is ignored
|
|
266
|
+
* URL path is preserved
|
|
267
|
+
* `/` → `index.md`
|
|
268
|
+
* Query parameters are ignored
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
## Example Output
|
|
273
|
+
|
|
274
|
+
```bash
|
|
275
|
+
1/2 Crawl https://de.wikipedia.org/wiki/Boeing_707
|
|
276
|
+
- Fetching ... done
|
|
277
|
+
- Processing ... done
|
|
278
|
+
- Writing docs/planes/wiki/Boeing_707.md ... done
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
---
|
|
282
|
+
|
|
283
|
+
## Use Cases
|
|
284
|
+
* RAG data ingestion
|
|
285
|
+
* Website snapshotting
|
|
286
|
+
* Knowledge base generation
|
|
287
|
+
* Offline documentation
|
|
288
|
+
|
|
289
|
+
---
|
|
290
|
+
|
|
291
|
+
## Project Structure
|
|
292
|
+
|
|
293
|
+
```bash
|
|
294
|
+
src/crawl4md/
|
|
295
|
+
├─ cli.py
|
|
296
|
+
├─ config.py
|
|
297
|
+
├─ sitemap.py
|
|
298
|
+
├─ crawler.py
|
|
299
|
+
├─ paths.py
|
|
300
|
+
└─ writer.py
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Notes
|
|
306
|
+
|
|
307
|
+
* No recursive crawling (by design)
|
|
308
|
+
* No hidden caching or transformations
|
|
309
|
+
* Focus on clean Markdown output only
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
## License
|
|
314
|
+
|
|
315
|
+
This project is licensed under the MIT License. See the [LICENSE.md](LICENSE.md) file for details.
|
|
316
|
+
|
|
317
|
+
### Authors
|
|
318
|
+
|
|
319
|
+
* Björn Hempel <bjoern@hempel.li> - _Initial work_ - [https://github.com/bjoern-hempel](https://github.com/bjoern-hempel)
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
## Built on top of crawl4ai
|
|
324
|
+
|
|
325
|
+
This project builds on the excellent [`crawl4ai`](https://github.com/unclecode/crawl4ai) library and extends it with a simpler batch-oriented workflow for repeatable Markdown exports.
|
|
326
|
+
|
|
327
|
+
Why use `crawl4md` as a complement to `crawl4ai`:
|
|
328
|
+
|
|
329
|
+
- project-based batch crawling via `crawl.yml`
|
|
330
|
+
- support for both page lists and sitemap-driven crawls
|
|
331
|
+
- deterministic output paths for generated Markdown files
|
|
332
|
+
- optional Markdown cleanup rules for better downstream text quality
|
|
333
|
+
- a small CLI and Python API focused on URL or HTML to Markdown workflows
|
|
334
|
+
- clearer separation between fetching, conversion, preprocessing, and writing
|
|
335
|
+
|
|
336
|
+
In short: `crawl4ai` provides the powerful crawling and Markdown generation foundation, while `crawl4md` adds a lightweight structure around it for batch jobs, cleaner output, and easier integration into documentation or RAG pipelines.
|