crawl4md 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. crawl4md-0.1.2/.gitignore +57 -0
  2. crawl4md-0.1.2/AGENTS.md +171 -0
  3. crawl4md-0.1.2/CHANGELOG.md +43 -0
  4. crawl4md-0.1.2/LICENSE.md +21 -0
  5. crawl4md-0.1.2/PKG-INFO +336 -0
  6. crawl4md-0.1.2/README.md +320 -0
  7. crawl4md-0.1.2/VERSION +1 -0
  8. crawl4md-0.1.2/bin/check-version +19 -0
  9. crawl4md-0.1.2/crawl.yml.example +49 -0
  10. crawl4md-0.1.2/docs/.gitkeep +0 -0
  11. crawl4md-0.1.2/pyproject.toml +36 -0
  12. crawl4md-0.1.2/src/crawl4md/__init__.py +11 -0
  13. crawl4md-0.1.2/src/crawl4md/check.py +20 -0
  14. crawl4md-0.1.2/src/crawl4md/cli.py +93 -0
  15. crawl4md-0.1.2/src/crawl4md/config.py +54 -0
  16. crawl4md-0.1.2/src/crawl4md/convert/__init__.py +1 -0
  17. crawl4md-0.1.2/src/crawl4md/convert/markdown.py +63 -0
  18. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/__init__.py +1 -0
  19. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/helpers/__init__.py +1 -0
  20. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/helpers/title_html_parser.py +40 -0
  21. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/markdown.py +62 -0
  22. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/__init__.py +1 -0
  23. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/base/__init__.py +0 -0
  24. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/base/rule_base.py +83 -0
  25. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/ensure_h1.py +45 -0
  26. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/normalize_whitespace.py +140 -0
  27. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_html_comments.py +28 -0
  28. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_jump_to_content.py +68 -0
  29. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_reference_sections.py +47 -0
  30. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_wiki_loves_earth_banner.py +49 -0
  31. crawl4md-0.1.2/src/crawl4md/convert/preprocessing/rules/remove_wikipedia_subtitle.py +40 -0
  32. crawl4md-0.1.2/src/crawl4md/fetch/__init__.py +1 -0
  33. crawl4md-0.1.2/src/crawl4md/fetch/html.py +57 -0
  34. crawl4md-0.1.2/src/crawl4md/fetch/markdown.py +59 -0
  35. crawl4md-0.1.2/src/crawl4md/fetch/normalize/__init__.py +0 -0
  36. crawl4md-0.1.2/src/crawl4md/fetch/normalize/base/__init__.py +2 -0
  37. crawl4md-0.1.2/src/crawl4md/fetch/normalize/base/normalizer_base.py +16 -0
  38. crawl4md-0.1.2/src/crawl4md/fetch/normalize/mediawiki_entity.py +31 -0
  39. crawl4md-0.1.2/src/crawl4md/fetch/normalize/mediawiki_hidden_span.py +31 -0
  40. crawl4md-0.1.2/src/crawl4md/fetch/normalize/url.py +42 -0
  41. crawl4md-0.1.2/src/crawl4md/paths.py +24 -0
  42. crawl4md-0.1.2/src/crawl4md/sitemap.py +34 -0
  43. crawl4md-0.1.2/src/crawl4md/writer.py +17 -0
  44. crawl4md-0.1.2/tests/test_preprocessing.py +305 -0
@@ -0,0 +1,57 @@
1
+ # -------------------------
2
+ # Python
3
+ # -------------------------
4
+ __pycache__/
5
+ *.py[cod]
6
+ *.pyo
7
+ *.pyd
8
+ *.so
9
+
10
+ # Virtual environments
11
+ .venv/
12
+ venv/
13
+ env/
14
+
15
+ # -------------------------
16
+ # Build / Packaging
17
+ # -------------------------
18
+ build/
19
+ dist/
20
+ *.egg-info/
21
+ .eggs/
22
+
23
+ # -------------------------
24
+ # uv
25
+ # -------------------------
26
+ .uv/
27
+ uv.lock
28
+
29
+ # -------------------------
30
+ # IDE / Editor
31
+ # -------------------------
32
+ .vscode/
33
+ .idea/
34
+ *.swp
35
+ *.swo
36
+
37
+ # -------------------------
38
+ # OS
39
+ # -------------------------
40
+ .DS_Store
41
+ Thumbs.db
42
+
43
+ # -------------------------
44
+ # Project specific
45
+ # -------------------------
46
+
47
+ # User-specific config (must NOT be committed)
48
+ crawl.yml
49
+
50
+ # Generated Markdown output
51
+ docs/*
52
+ !docs/.gitkeep
53
+
54
+ # Optional: logs or temp files (future-proof)
55
+ *.log
56
+ tmp/
57
+
@@ -0,0 +1,171 @@
1
+ # AGENTS.md
2
+
3
+ ## Purpose
4
+
5
+ crawl4md is a minimal CLI tool to crawl web pages or sitemaps and convert them into clean, deterministic Markdown files.
6
+
7
+ This document defines how contributors and automated agents (LLMs, scripts, CI tools) should interact with and extend the project.
8
+
9
+ ---
10
+
11
+ ## Core Principles
12
+
13
+ - Keep it simple (no overengineering)
14
+ - Deterministic output (same input → same output)
15
+ - No hidden behavior or side effects
16
+ - Clear separation of concerns:
17
+ - config
18
+ - crawling
19
+ - preprocessing
20
+ - writing
21
+ - CLI-first design
22
+
23
+ ---
24
+
25
+ ## Project Structure
26
+
27
+ src/crawl4md/
28
+ - cli.py → entrypoint (orchestration only)
29
+ - config.py → config models (Pydantic)
30
+ - sitemap.py → sitemap parsing
31
+ - crawler.py → crawl4ai integration
32
+ - paths.py → URL → file path mapping
33
+ - writer.py → file output
34
+ - (future) preprocessing.py → markdown cleanup
35
+
36
+ docs/
37
+ - output directory (generated, not versioned except .gitkeep)
38
+
39
+ crawl.yml.example
40
+ - example configuration (must stay in sync with config models)
41
+
42
+ ---
43
+
44
+ ## Responsibilities
45
+
46
+ ### cli.py
47
+ - Orchestrates flow
48
+ - No business logic
49
+ - Reads config, loops URLs, prints output
50
+
51
+ ### crawler.py
52
+ - Only responsible for fetching + converting to markdown
53
+ - Must not handle filesystem or preprocessing
54
+
55
+ ### preprocessing (future)
56
+ - Pure functions: markdown in → markdown out
57
+ - No IO, no side effects
58
+
59
+ ### writer.py
60
+ - Only writes files
61
+ - Must not modify content
62
+
63
+ ---
64
+
65
+ ## Configuration Rules
66
+
67
+ - All behavior must be configurable via crawl.yml
68
+ - crawl.yml is user-specific → never commit
69
+ - crawl.yml.example is canonical → always update when config changes
70
+
71
+ ### Config Sections
72
+
73
+ - type: pages | sitemap
74
+ - sources: list[str]
75
+
76
+ - crawl:
77
+ parse_type: markdown | markdown-fit
78
+
79
+ - preprocessing:
80
+ markdown:
81
+ enabled: bool
82
+ remove_reference_sections: bool
83
+ remove_html_comments: bool
84
+ normalize_whitespace: bool
85
+ reference_headings: list[str]
86
+
87
+ ---
88
+
89
+ ## Coding Guidelines
90
+
91
+ - Python >= 3.11
92
+ - Use type hints everywhere
93
+ - Prefer explicit over implicit
94
+ - No global state (except logging config)
95
+ - Keep functions small and focused
96
+ - Avoid unnecessary dependencies
97
+
98
+ ---
99
+
100
+ ## Logging & Output
101
+
102
+ - CLI output must stay minimal and readable
103
+ - Avoid noisy logs
104
+ - External library logs (e.g. crawl4ai) should be suppressed or reduced
105
+
106
+ ---
107
+
108
+ ## Error Handling
109
+
110
+ - Errors per URL must not stop the entire run
111
+ - Always continue with next URL
112
+ - Provide clear summary:
113
+ - success count
114
+ - failure count
115
+
116
+ ---
117
+
118
+ ## File Output Rules
119
+
120
+ - Output path must be deterministic:
121
+ docs/<project>/<url-path>.md
122
+
123
+ - URL path rules:
124
+ - "/" → index.md
125
+ - strip leading slash
126
+ - ignore query params
127
+
128
+ - Always create parent directories
129
+
130
+ ---
131
+
132
+ ## Extending the Project
133
+
134
+ When adding features:
135
+
136
+ 1. First update config models
137
+ 2. Then update crawl.yml.example
138
+ 3. Keep backward compatibility (defaults!)
139
+ 4. Add logic in the correct layer (do not mix concerns)
140
+
141
+ ---
142
+
143
+ ## Anti-Patterns (Avoid)
144
+
145
+ - Business logic inside cli.py
146
+ - Hidden transformations
147
+ - Implicit defaults not visible in config
148
+ - Mixing crawling and preprocessing
149
+ - Writing files outside writer.py
150
+
151
+ ---
152
+
153
+ ## Future Extensions (Planned)
154
+
155
+ - Markdown preprocessing pipeline
156
+ - Frontmatter support
157
+ - Parallel crawling
158
+ - Retry & rate limiting
159
+ - Chunking for RAG
160
+ - Database export
161
+
162
+ ---
163
+
164
+ ## Summary
165
+
166
+ crawl4md is intentionally simple.
167
+
168
+ Every addition must preserve:
169
+ - clarity
170
+ - determinism
171
+ - separation of concerns
@@ -0,0 +1,43 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ ## [0.1.2] - 2026-05-02
6
+
7
+ ### Added
8
+
9
+ * Update README for PyPI package usage and clarify batch crawler setup
10
+
11
+ ## [0.1.1] - 2026-05-02
12
+
13
+ ### Added
14
+
15
+ - Add uv check command for tests and Ruff linting
16
+ - Export public Python API and expand README with usage and crawl4ai context
17
+
18
+ ### Refactored
19
+
20
+ - Split `fetch_markdown` into fetch and convert layers
21
+ - Move markdown preprocessing from CLI into convert pipeline
22
+ - Refactor markdown fetch/convert into classes and add sync APIs
23
+
24
+ ### Removed
25
+
26
+ - Crawl4AI Logging
27
+
28
+ ## [0.1.0] - 2026-05-02
29
+
30
+ ### Added
31
+
32
+ - Initial release
33
+ - CLI for crawling single pages and sitemaps
34
+ - YAML-based project configuration
35
+ - Deterministic Markdown file output
36
+ - Support for multiple Markdown extraction modes
37
+ - Configurable Markdown preprocessing pipeline
38
+ - Automatic cleanup of common wiki and web artifacts
39
+ - Automatic removal of reference and appendix sections
40
+ - Whitespace and document structure normalization
41
+ - Automatic insertion of missing top-level headings
42
+ - Clear separation of crawling, preprocessing, and file writing
43
+ - Basic test coverage for core Markdown processing
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Björn Hempel <bjoern@hempel.li>
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,336 @@
1
+ Metadata-Version: 2.4
2
+ Name: crawl4md
3
+ Version: 0.1.2
4
+ Summary: Convert web pages and HTML to clean Markdown
5
+ Author: Björn Hempel
6
+ License: MIT
7
+ License-File: LICENSE.md
8
+ Requires-Python: >=3.11
9
+ Requires-Dist: crawl4ai
10
+ Requires-Dist: lxml
11
+ Requires-Dist: pydantic
12
+ Requires-Dist: pyyaml
13
+ Requires-Dist: requests
14
+ Requires-Dist: typer
15
+ Description-Content-Type: text/markdown
16
+
17
+ # crawl4md
18
+
19
+ crawl4md is a minimal, clean CLI tool that crawls web pages or sitemaps and converts them into structured Markdown files.
20
+
21
+ The project is intentionally designed to stay simple, deterministic, and easy to extend — without unnecessary complexity or hidden behavior.
22
+
23
+ ---
24
+
25
+ ## Philosophy
26
+
27
+ - **Minimal**: only what is needed, nothing more
28
+ - **Deterministic**: same input → same output
29
+ - **Transparent**: no magic, clear processing steps
30
+ - **Composable**: ideal as a building block for pipelines (e.g. RAG)
31
+
32
+ ---
33
+
34
+ ## Features
35
+
36
+ - Crawl from:
37
+ - `sitemap.xml`
38
+ - explicit page lists
39
+ - Clean Markdown output (via `crawl4ai`, markdown-fit mode)
40
+ - Deterministic file structure based on URL paths
41
+ - YAML-based project configuration
42
+ - CLI-first workflow (uv-compatible)
43
+ - Clear, readable progress output
44
+
45
+ ---
46
+
47
+ ## Installation
48
+
49
+ There are two ways to use `crawl4md`.
50
+
51
+ ### Use the Batch Crawler
52
+
53
+ If you want to use the project directly for batch crawling via `crawl.yml`, clone the repository:
54
+
55
+ ```bash
56
+ git clone git@github.com:ixnode/crawl4md.git && cd crawl4md
57
+ ```
58
+
59
+ Then continue with the configuration section below.
60
+
61
+ ### Use the Python Package
62
+
63
+ If you want to build your own tooling on top of `crawl4md`, install it as a package:
64
+
65
+ ```bash
66
+ pip install crawl4md
67
+ ```
68
+
69
+ Or with `uv`:
70
+
71
+ ```bash
72
+ uv add crawl4md
73
+ ```
74
+
75
+ For local development inside the repository:
76
+
77
+ ```bash
78
+ uv sync
79
+ ```
80
+
81
+ ---
82
+
83
+ ## Configuration
84
+
85
+ The CLI reads a `crawl.yml` file from the current working directory.
86
+
87
+ Create it from the example:
88
+
89
+ ```bash
90
+ cp crawl.yml.example crawl.yml
91
+ ```
92
+
93
+ Minimal example:
94
+
95
+ ```yaml
96
+ projects:
97
+ planes:
98
+ type: pages
99
+ crawl:
100
+ parse_type: markdown-fit
101
+ sources:
102
+ - https://de.wikipedia.org/wiki/Boeing_707
103
+ - https://de.wikipedia.org/wiki/Boeing_717
104
+ preprocessing:
105
+ markdown:
106
+ enabled: true
107
+ remove_html_comments: true
108
+ normalize_whitespace: true
109
+
110
+ pydantic:
111
+ type: sitemap
112
+ crawl:
113
+ parse_type: markdown-fit
114
+ sources:
115
+ - https://pydantic.dev/sitemap.xml
116
+ preprocessing:
117
+ markdown:
118
+ enabled: false
119
+ ```
120
+
121
+ Available project settings:
122
+
123
+ - `type`: `pages` or `sitemap`
124
+ - `sources`: list of page URLs or sitemap URLs
125
+ - `crawl.parse_type`: `markdown` or `markdown-fit`
126
+ - `preprocessing.markdown.enabled`: enables Markdown cleanup
127
+ - `preprocessing.markdown.*`: optional cleanup rules such as `ensure_h1`, `remove_html_comments`, `remove_reference_sections`, and `normalize_whitespace`
128
+
129
+ For the full configuration, see [`crawl.yml.example`](crawl.yml.example).
130
+
131
+ ---
132
+
133
+ ## Usage
134
+
135
+ After cloning the repository and creating `crawl.yml`, use:
136
+
137
+ ```bash
138
+ crawl planes
139
+ crawl pydantic
140
+ ```
141
+
142
+ Or with `uv` inside the project:
143
+
144
+ ```bash
145
+ uv run crawl planes
146
+ uv run crawl pydantic
147
+ ```
148
+
149
+ ---
150
+
151
+ ## Python API
152
+
153
+ `crawl4md` can also be used as a Python package.
154
+
155
+ The public classes are:
156
+
157
+ - `MarkdownFetcher`
158
+ - `MarkdownConverter`
159
+ - `ParseType`
160
+ - `MarkdownPreprocessingConfig`
161
+
162
+ ### Configure Parse Type
163
+
164
+ Use `ParseType` to control how Markdown is generated:
165
+
166
+ - `"markdown"`: raw markdown output
167
+ - `"markdown-fit"`: cleaned and reduced markdown output via `crawl4ai`
168
+
169
+ ### Configure Preprocessing
170
+
171
+ Use `MarkdownPreprocessingConfig` to enable optional cleanup steps.
172
+
173
+ Simple example:
174
+
175
+ ```python
176
+ from crawl4md import MarkdownPreprocessingConfig
177
+
178
+ config = MarkdownPreprocessingConfig(
179
+ enabled=True,
180
+ remove_html_comments=True,
181
+ normalize_whitespace=True,
182
+ )
183
+ ```
184
+
185
+ ### Fetch Markdown From a URL
186
+
187
+ Use `MarkdownFetcher` if you want to fetch a page and directly receive Markdown.
188
+
189
+ ```python
190
+ from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig
191
+
192
+ config = MarkdownPreprocessingConfig(enabled=True)
193
+ fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")
194
+
195
+ markdown = fetcher.fetch_sync("https://example.com")
196
+ print(markdown)
197
+ ```
198
+
199
+ Async version:
200
+
201
+ ```python
202
+ import asyncio
203
+
204
+ from crawl4md import MarkdownFetcher, MarkdownPreprocessingConfig
205
+
206
+ config = MarkdownPreprocessingConfig(enabled=True)
207
+ fetcher = MarkdownFetcher(config=config, parse_type="markdown-fit")
208
+
209
+ markdown = asyncio.run(fetcher.fetch("https://example.com"))
210
+ print(markdown)
211
+ ```
212
+
213
+ ### Convert HTML to Markdown
214
+
215
+ Use `MarkdownConverter` if you already have HTML and only want the conversion step.
216
+
217
+ ```python
218
+ from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig
219
+
220
+ html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
221
+
222
+ config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
223
+ converter = MarkdownConverter(config=config, parse_type="markdown")
224
+
225
+ markdown = converter.convert_sync(html=html, url="https://example.com")
226
+ print(markdown)
227
+ ```
228
+
229
+ Async version:
230
+
231
+ ```python
232
+ import asyncio
233
+
234
+ from crawl4md import MarkdownConverter, MarkdownPreprocessingConfig
235
+
236
+ html = "<html><body><h1>Hello</h1><p>World</p></body></html>"
237
+
238
+ config = MarkdownPreprocessingConfig(enabled=True, ensure_h1=True)
239
+ converter = MarkdownConverter(config=config, parse_type="markdown")
240
+
241
+ markdown = asyncio.run(
242
+ converter.convert(html=html, url="https://example.com")
243
+ )
244
+ print(markdown)
245
+ ```
246
+
247
+ ---
248
+
249
+ ## Output Structure
250
+
251
+ Markdown files are stored deterministically based on the URL path:
252
+
253
+ ```bash
254
+ docs/<project>/<url-path>.md
255
+ ```
256
+
257
+ Example:
258
+
259
+ ```bash
260
+ docs/planes/wiki/Boeing_707.md
261
+ ```
262
+
263
+ Rules:
264
+
265
+ * Domain is ignored
266
+ * URL path is preserved
267
+ * `/` → `index.md`
268
+ * Query parameters are ignored
269
+
270
+ ---
271
+
272
+ ## Example Output
273
+
274
+ ```bash
275
+ 1/2 Crawl https://de.wikipedia.org/wiki/Boeing_707
276
+ - Fetching ... done
277
+ - Processing ... done
278
+ - Writing docs/planes/wiki/Boeing_707.md ... done
279
+ ```
280
+
281
+ ---
282
+
283
+ ## Use Cases
284
+ * RAG data ingestion
285
+ * Website snapshotting
286
+ * Knowledge base generation
287
+ * Offline documentation
288
+
289
+ ---
290
+
291
+ ## Project Structure
292
+
293
+ ```bash
294
+ src/crawl4md/
295
+ ├─ cli.py
296
+ ├─ config.py
297
+ ├─ sitemap.py
298
+ ├─ crawler.py
299
+ ├─ paths.py
300
+ └─ writer.py
301
+ ```
302
+
303
+ ---
304
+
305
+ ## Notes
306
+
307
+ * No recursive crawling (by design)
308
+ * No hidden caching or transformations
309
+ * Focus on clean Markdown output only
310
+
311
+ ---
312
+
313
+ ## License
314
+
315
+ This project is licensed under the MIT License. See the [LICENSE.md](LICENSE.md) file for details.
316
+
317
+ ### Authors
318
+
319
+ * Björn Hempel <bjoern@hempel.li> - _Initial work_ - [https://github.com/bjoern-hempel](https://github.com/bjoern-hempel)
320
+
321
+ ---
322
+
323
+ ## Built on top of crawl4ai
324
+
325
+ This project builds on the excellent [`crawl4ai`](https://github.com/unclecode/crawl4ai) library and extends it with a simpler batch-oriented workflow for repeatable Markdown exports.
326
+
327
+ Why use `crawl4md` as a complement to `crawl4ai`:
328
+
329
+ - project-based batch crawling via `crawl.yml`
330
+ - support for both page lists and sitemap-driven crawls
331
+ - deterministic output paths for generated Markdown files
332
+ - optional Markdown cleanup rules for better downstream text quality
333
+ - a small CLI and Python API focused on URL or HTML to Markdown workflows
334
+ - clearer separation between fetching, conversion, preprocessing, and writing
335
+
336
+ In short: `crawl4ai` provides the powerful crawling and Markdown generation foundation, while `crawl4md` adds a lightweight structure around it for batch jobs, cleaner output, and easier integration into documentation or RAG pipelines.