docpull 1.1.0__tar.gz → 1.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- docpull-1.2.0/PKG-INFO +394 -0
- docpull-1.2.0/README.md +326 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/__init__.py +1 -1
- docpull-1.2.0/docpull/archive.py +186 -0
- docpull-1.2.0/docpull/cache.py +224 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/cli.py +345 -12
- {docpull-1.1.0 → docpull-1.2.0}/docpull/config.py +113 -1
- docpull-1.2.0/docpull/formatters/__init__.py +44 -0
- docpull-1.2.0/docpull/formatters/base.py +73 -0
- docpull-1.2.0/docpull/formatters/json.py +98 -0
- docpull-1.2.0/docpull/formatters/markdown.py +47 -0
- docpull-1.2.0/docpull/formatters/sqlite.py +264 -0
- docpull-1.2.0/docpull/formatters/toon.py +88 -0
- docpull-1.2.0/docpull/hooks.py +217 -0
- docpull-1.2.0/docpull/indexer.py +387 -0
- docpull-1.2.0/docpull/metadata.py +184 -0
- docpull-1.2.0/docpull/naming.py +259 -0
- docpull-1.2.0/docpull/orchestrator.py +253 -0
- docpull-1.2.0/docpull/processors/__init__.py +17 -0
- docpull-1.2.0/docpull/processors/base.py +150 -0
- docpull-1.2.0/docpull/processors/content_filter.py +249 -0
- docpull-1.2.0/docpull/processors/deduplicator.py +213 -0
- docpull-1.2.0/docpull/processors/language_filter.py +170 -0
- docpull-1.2.0/docpull/processors/size_limiter.py +214 -0
- docpull-1.2.0/docpull/sources_config.py +260 -0
- docpull-1.2.0/docpull/vcs.py +224 -0
- docpull-1.2.0/docpull.egg-info/PKG-INFO +394 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull.egg-info/SOURCES.txt +26 -1
- {docpull-1.1.0 → docpull-1.2.0}/docpull.egg-info/requires.txt +4 -4
- {docpull-1.1.0 → docpull-1.2.0}/pyproject.toml +12 -3
- docpull-1.2.0/tests/test_formatters.py +276 -0
- docpull-1.2.0/tests/test_orchestrator.py +330 -0
- docpull-1.2.0/tests/test_processors.py +424 -0
- docpull-1.2.0/tests/test_sources_config.py +348 -0
- docpull-1.1.0/PKG-INFO +0 -221
- docpull-1.1.0/README.md +0 -154
- docpull-1.1.0/docpull.egg-info/PKG-INFO +0 -221
- {docpull-1.1.0 → docpull-1.2.0}/LICENSE +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/__main__.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/doctor.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/__init__.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/async_fetcher.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/base.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/bun.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/d3.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/generic.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/generic_async.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/nextjs.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/parallel_base.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/plaid.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/react.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/stripe.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/tailwind.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/fetchers/turborepo.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/__init__.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/base.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/bun.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/d3.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/nextjs.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/plaid.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/react.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/stripe.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/tailwind.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/profiles/turborepo.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/py.typed +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/utils/__init__.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/utils/file_utils.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull/utils/logging_config.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull.egg-info/dependency_links.txt +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull.egg-info/entry_points.txt +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/docpull.egg-info/top_level.txt +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/setup.cfg +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/tests/test_async_fetcher.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/tests/test_config.py +0 -0
- {docpull-1.1.0 → docpull-1.2.0}/tests/test_fetchers.py +0 -0
docpull-1.2.0/PKG-INFO
ADDED
|
@@ -0,0 +1,394 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: docpull
|
|
3
|
+
Version: 1.2.0
|
|
4
|
+
Summary: Pull documentation from the web and convert to clean markdown
|
|
5
|
+
Author-email: Zachary Roth <support@raintree.technology>
|
|
6
|
+
Maintainer-email: Raintree Technology <support@raintree.technology>
|
|
7
|
+
License-Expression: MIT
|
|
8
|
+
Project-URL: Homepage, https://github.com/raintree-technology/docpull
|
|
9
|
+
Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
|
|
10
|
+
Project-URL: Repository, https://github.com/raintree-technology/docpull
|
|
11
|
+
Project-URL: Source Code, https://github.com/raintree-technology/docpull
|
|
12
|
+
Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
|
|
13
|
+
Project-URL: Changelog, https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md
|
|
14
|
+
Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
|
|
15
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
16
|
+
Classifier: Intended Audience :: Developers
|
|
17
|
+
Classifier: Intended Audience :: Information Technology
|
|
18
|
+
Classifier: Intended Audience :: Science/Research
|
|
19
|
+
Classifier: Intended Audience :: Education
|
|
20
|
+
Classifier: Environment :: Console
|
|
21
|
+
Classifier: Topic :: Documentation
|
|
22
|
+
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
|
|
23
|
+
Classifier: Topic :: Software Development :: Documentation
|
|
24
|
+
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
25
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
26
|
+
Classifier: Topic :: Utilities
|
|
27
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
28
|
+
Classifier: Natural Language :: English
|
|
29
|
+
Classifier: Operating System :: OS Independent
|
|
30
|
+
Classifier: Programming Language :: Python :: 3
|
|
31
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
32
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
33
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
34
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
35
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
36
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
37
|
+
Classifier: Typing :: Typed
|
|
38
|
+
Requires-Python: >=3.9
|
|
39
|
+
Description-Content-Type: text/markdown
|
|
40
|
+
License-File: LICENSE
|
|
41
|
+
Requires-Dist: requests>=2.31.0
|
|
42
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
43
|
+
Requires-Dist: html2text>=2020.1.16
|
|
44
|
+
Requires-Dist: defusedxml>=0.7.1
|
|
45
|
+
Requires-Dist: aiohttp>=3.9.0
|
|
46
|
+
Requires-Dist: rich>=13.0.0
|
|
47
|
+
Requires-Dist: pyyaml>=6.0
|
|
48
|
+
Requires-Dist: gitpython>=3.1.40
|
|
49
|
+
Provides-Extra: js
|
|
50
|
+
Requires-Dist: playwright>=1.40.0; extra == "js"
|
|
51
|
+
Provides-Extra: all
|
|
52
|
+
Requires-Dist: playwright>=1.40.0; extra == "all"
|
|
53
|
+
Provides-Extra: dev
|
|
54
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
55
|
+
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
|
|
56
|
+
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
|
|
57
|
+
Requires-Dist: black>=23.0.0; extra == "dev"
|
|
58
|
+
Requires-Dist: mypy>=1.0.0; extra == "dev"
|
|
59
|
+
Requires-Dist: ruff>=0.1.0; extra == "dev"
|
|
60
|
+
Requires-Dist: bandit>=1.7.0; extra == "dev"
|
|
61
|
+
Requires-Dist: pip-audit>=2.0.0; extra == "dev"
|
|
62
|
+
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
|
|
63
|
+
Requires-Dist: types-requests>=2.31.0; extra == "dev"
|
|
64
|
+
Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
|
|
65
|
+
Requires-Dist: types-defusedxml>=0.7.0; extra == "dev"
|
|
66
|
+
Requires-Dist: types-pyyaml>=6.0.0; extra == "dev"
|
|
67
|
+
Dynamic: license-file
|
|
68
|
+
|
|
69
|
+
# docpull
|
|
70
|
+
|
|
71
|
+
**Pull documentation from any website and converts it into clean, AI-ready Markdown.**
|
|
72
|
+
Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.
|
|
73
|
+
|
|
74
|
+
**NEW in v1.2.0**: 15 major features including language filtering, deduplication, auto-indexing, multi-source configuration, and more. Real-world testing shows **58% size reduction** with automatic optimization.
|
|
75
|
+
|
|
76
|
+
[](https://www.python.org/downloads/)
|
|
77
|
+
[](https://badge.fury.io/py/docpull)
|
|
78
|
+
[](https://github.com/raintree-technology/docpull/blob/main/LICENSE)
|
|
79
|
+
[](https://github.com/psf/black)
|
|
80
|
+
[](http://mypy-lang.org/)
|
|
81
|
+
[](https://github.com/PyCQA/bandit)
|
|
82
|
+
|
|
83
|
+
## Why docpull?
|
|
84
|
+
|
|
85
|
+
Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.
|
|
86
|
+
|
|
87
|
+
## Key Features
|
|
88
|
+
|
|
89
|
+
### Core Features (v1.0+)
|
|
90
|
+
- Works on any documentation site
|
|
91
|
+
- Smart extraction of main content
|
|
92
|
+
- Async + parallel fetching (up to 10× faster)
|
|
93
|
+
- Optional JavaScript rendering via Playwright
|
|
94
|
+
- Sitemap + link crawling
|
|
95
|
+
- Rate limiting, timeouts, content-type checks
|
|
96
|
+
- Saves docs in structured Markdown with YAML metadata
|
|
97
|
+
- Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)
|
|
98
|
+
|
|
99
|
+
### NEW in v1.2.0: Advanced Optimization
|
|
100
|
+
- **Language Filtering**: Auto-detect and filter by language (skip 352+ translation files)
|
|
101
|
+
- **Deduplication**: Remove duplicates with SHA-256 hashing (save 10+ MB on duplicate content)
|
|
102
|
+
- **Auto-Index Generation**: Create navigable INDEX.md with tree/TOC/categories/stats
|
|
103
|
+
- **Size Limits**: Control file and total download size (skip/truncate oversized files)
|
|
104
|
+
- **Multi-Source Configuration**: Configure multiple docs in one YAML file
|
|
105
|
+
- **Selective Crawling**: Include/exclude URL patterns for targeted fetching
|
|
106
|
+
- **Content Filtering**: Remove verbose sections (Examples, Changelog, etc.)
|
|
107
|
+
- **Format Conversion**: Output to Markdown, TOON (compact), JSON, or SQLite
|
|
108
|
+
- **Smart Naming**: 4 naming strategies (full, short, flat, hierarchical)
|
|
109
|
+
- **Metadata Extraction**: Extract titles, URLs, stats to metadata.json
|
|
110
|
+
- **Update Detection**: Only download changed files (checksums, ETags)
|
|
111
|
+
- **Incremental Mode**: Resume interrupted downloads with checkpointing
|
|
112
|
+
- **Hooks & Plugins**: Python plugin system for custom processing
|
|
113
|
+
- **Git Integration**: Auto-commit changes with customizable messages
|
|
114
|
+
- **Archive Mode**: Create tar.gz/zip archives for distribution
|
|
115
|
+
|
|
116
|
+
**Real-world impact**: Testing with 1,914 files (31 MB) → **13 MB (58% reduction)** with all optimizations enabled.
|
|
117
|
+
|
|
118
|
+
## Quick Start
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
pip install docpull
|
|
122
|
+
docpull --doctor # verify installation
|
|
123
|
+
|
|
124
|
+
# Basic usage
|
|
125
|
+
docpull https://aptos.dev
|
|
126
|
+
docpull stripe # use a built-in profile
|
|
127
|
+
|
|
128
|
+
# NEW: Simple optimization (v1.2.0)
|
|
129
|
+
docpull https://code.claude.com/docs --language en --create-index
|
|
130
|
+
|
|
131
|
+
# NEW: Advanced optimization (v1.2.0)
|
|
132
|
+
docpull https://aptos.dev \
|
|
133
|
+
--deduplicate \
|
|
134
|
+
--keep-variant mainnet \
|
|
135
|
+
--max-file-size 200kb \
|
|
136
|
+
--create-index
|
|
137
|
+
|
|
138
|
+
# NEW: Multi-source configuration (v1.2.0)
|
|
139
|
+
docpull --sources-file examples/multi-source-optimized.yaml
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### JavaScript-heavy sites
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
pip install docpull[js]
|
|
146
|
+
python -m playwright install chromium
|
|
147
|
+
docpull https://site.com --js
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Python API
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
from docpull import GenericAsyncFetcher
|
|
154
|
+
|
|
155
|
+
fetcher = GenericAsyncFetcher(
|
|
156
|
+
url_or_profile="https://aptos.dev",
|
|
157
|
+
output_dir="./docs",
|
|
158
|
+
max_pages=100,
|
|
159
|
+
max_concurrent=20,
|
|
160
|
+
)
|
|
161
|
+
fetcher.fetch()
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
## Common Options
|
|
165
|
+
|
|
166
|
+
### Core Options
|
|
167
|
+
- `--doctor` – verify installation and dependencies
|
|
168
|
+
- `--max-pages N` – limit crawl size
|
|
169
|
+
- `--max-depth N` – restrict link depth
|
|
170
|
+
- `--max-concurrent N` – control parallel fetches
|
|
171
|
+
- `--js` – enable Playwright rendering
|
|
172
|
+
- `--output-dir DIR` – output directory
|
|
173
|
+
- `--rate-limit X` – seconds between requests
|
|
174
|
+
- `--no-skip-existing` – re-download existing files
|
|
175
|
+
- `--dry-run` – test without downloading
|
|
176
|
+
|
|
177
|
+
### NEW in v1.2.0: Optimization Options
|
|
178
|
+
- `--language LANG` – filter by language (e.g., `en`)
|
|
179
|
+
- `--exclude-languages LANG [LANG ...]` – exclude languages
|
|
180
|
+
- `--deduplicate` – remove duplicate files
|
|
181
|
+
- `--keep-variant PATTERN` – keep files matching pattern when deduplicating
|
|
182
|
+
- `--max-file-size SIZE` – max file size (e.g., `200kb`, `1mb`)
|
|
183
|
+
- `--max-total-size SIZE` – max total download size
|
|
184
|
+
- `--include-paths PATTERN [PATTERN ...]` – only crawl matching URLs
|
|
185
|
+
- `--exclude-paths PATTERN [PATTERN ...]` – skip matching URLs
|
|
186
|
+
- `--exclude-sections NAME [NAME ...]` – remove sections by header name
|
|
187
|
+
- `--format {markdown,toon,json,sqlite}` – output format
|
|
188
|
+
- `--naming-strategy {full,short,flat,hierarchical}` – file naming strategy
|
|
189
|
+
- `--create-index` – generate INDEX.md with navigation
|
|
190
|
+
- `--extract-metadata` – extract metadata to metadata.json
|
|
191
|
+
- `--update-only-changed` – only download changed files
|
|
192
|
+
- `--incremental` – enable incremental mode with resume
|
|
193
|
+
- `--git-commit` – auto-commit changes
|
|
194
|
+
- `--git-message MSG` – commit message template
|
|
195
|
+
- `--archive` – create compressed archive
|
|
196
|
+
- `--archive-format {tar.gz,tar.bz2,tar.xz,zip}` – archive format
|
|
197
|
+
- `--sources-file PATH` – multi-source configuration file
|
|
198
|
+
|
|
199
|
+
See `docpull --help` for complete list of options.
|
|
200
|
+
|
|
201
|
+
## Performance
|
|
202
|
+
|
|
203
|
+
Async fetching drastically reduces runtime:
|
|
204
|
+
|
|
205
|
+
| Pages | Sync | Async | Speedup |
|
|
206
|
+
|-------|------|-------|---------|
|
|
207
|
+
| 50 | ~50s | ~6s | 8× faster |
|
|
208
|
+
|
|
209
|
+
Higher concurrency yields even better results.
|
|
210
|
+
|
|
211
|
+
## Output Format
|
|
212
|
+
|
|
213
|
+
Each downloaded page becomes a Markdown file:
|
|
214
|
+
|
|
215
|
+
```markdown
|
|
216
|
+
---
|
|
217
|
+
url: https://stripe.com/docs/payments
|
|
218
|
+
fetched: 2025-11-13
|
|
219
|
+
---
|
|
220
|
+
# Payment Intents
|
|
221
|
+
...
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
Directory layout mirrors the target site's structure.
|
|
225
|
+
|
|
226
|
+
## Configuration File
|
|
227
|
+
|
|
228
|
+
### Simple Configuration (v1.0+)
|
|
229
|
+
|
|
230
|
+
```yaml
|
|
231
|
+
output_dir: ./docs
|
|
232
|
+
rate_limit: 0.5
|
|
233
|
+
sources:
|
|
234
|
+
- stripe
|
|
235
|
+
- nextjs
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
Run with:
|
|
239
|
+
```bash
|
|
240
|
+
docpull --config config.yaml
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### NEW: Multi-Source Configuration (v1.2.0)
|
|
244
|
+
|
|
245
|
+
```yaml
|
|
246
|
+
sources:
|
|
247
|
+
anthropic:
|
|
248
|
+
url: https://docs.anthropic.com
|
|
249
|
+
language: en
|
|
250
|
+
max_file_size: 200kb
|
|
251
|
+
create_index: true
|
|
252
|
+
|
|
253
|
+
claude-code:
|
|
254
|
+
url: https://code.claude.com/docs
|
|
255
|
+
language: en # Skips 352 translation files!
|
|
256
|
+
create_index: true
|
|
257
|
+
|
|
258
|
+
aptos:
|
|
259
|
+
url: https://aptos.dev
|
|
260
|
+
deduplicate: true
|
|
261
|
+
keep_variant: mainnet # Skips 304 duplicates!
|
|
262
|
+
max_file_size: 200kb
|
|
263
|
+
include_paths:
|
|
264
|
+
- "build/guides/*"
|
|
265
|
+
|
|
266
|
+
output_dir: ./docs
|
|
267
|
+
rate_limit: 0.5
|
|
268
|
+
git_commit: true
|
|
269
|
+
git_message: "Update docs - {date}"
|
|
270
|
+
extract_metadata: true
|
|
271
|
+
archive: true
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
Run with:
|
|
275
|
+
```bash
|
|
276
|
+
docpull --sources-file config.yaml
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
See `examples/` directory for more configuration examples.
|
|
280
|
+
|
|
281
|
+
## Custom Profiles
|
|
282
|
+
|
|
283
|
+
Easily define profiles for frequently scraped sites.
|
|
284
|
+
|
|
285
|
+
```python
|
|
286
|
+
from docpull.profiles.base import SiteProfile
|
|
287
|
+
|
|
288
|
+
MY_PROFILE = SiteProfile(
|
|
289
|
+
name="mysite",
|
|
290
|
+
domains={"docs.mysite.com"},
|
|
291
|
+
include_patterns=["/docs/", "/api/"],
|
|
292
|
+
)
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
## Security
|
|
296
|
+
|
|
297
|
+
- HTTPS-only
|
|
298
|
+
- Blocks private network IPs
|
|
299
|
+
- 50MB page size limit
|
|
300
|
+
- Timeout controls
|
|
301
|
+
- Validates content-type
|
|
302
|
+
- Playwright sandboxing
|
|
303
|
+
|
|
304
|
+
## Troubleshooting
|
|
305
|
+
|
|
306
|
+
- **Installation issues**: Run `docpull --doctor` to diagnose problems
|
|
307
|
+
- **Missing dependencies**: See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common fixes
|
|
308
|
+
- **Site requires JS**: install Playwright + `--js`
|
|
309
|
+
- **Slow or rate limited**: lower concurrency or raise `--rate-limit`
|
|
310
|
+
- **Large sites**: set `--max-pages`
|
|
311
|
+
|
|
312
|
+
For detailed troubleshooting, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
|
|
313
|
+
|
|
314
|
+
## v1.2.0 Feature Examples
|
|
315
|
+
|
|
316
|
+
### Language Filtering
|
|
317
|
+
Automatically detect and filter documentation by language:
|
|
318
|
+
```bash
|
|
319
|
+
# English only (auto-detects /en/, _en_, docs_en_, etc.)
|
|
320
|
+
docpull https://code.claude.com/docs --language en --create-index
|
|
321
|
+
```
|
|
322
|
+
**Impact**: Claude Code docs downloaded in 9 languages = 352 unnecessary files for English-only users.
|
|
323
|
+
|
|
324
|
+
### Deduplication
|
|
325
|
+
Remove duplicate files based on content hash:
|
|
326
|
+
```bash
|
|
327
|
+
# Keep mainnet version, skip testnet/devnet duplicates
|
|
328
|
+
docpull https://aptos.dev --deduplicate --keep-variant mainnet --create-index
|
|
329
|
+
```
|
|
330
|
+
**Impact**: Aptos Move reference docs across 3 environments = 304 duplicate files (~10 MB saved).
|
|
331
|
+
|
|
332
|
+
### Format Conversion
|
|
333
|
+
Convert to different formats for various use cases:
|
|
334
|
+
```bash
|
|
335
|
+
# TOON format (40-60% size reduction, optimized for LLMs)
|
|
336
|
+
docpull https://docs.anthropic.com --format toon --language en
|
|
337
|
+
|
|
338
|
+
# SQLite database with full-text search
|
|
339
|
+
docpull https://docs.anthropic.com --format sqlite --language en
|
|
340
|
+
|
|
341
|
+
# Structured JSON
|
|
342
|
+
docpull https://docs.anthropic.com --format json --language en
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
### Incremental Updates
|
|
346
|
+
Only download changed files:
|
|
347
|
+
```bash
|
|
348
|
+
docpull https://docs.anthropic.com \
|
|
349
|
+
--incremental \
|
|
350
|
+
--update-only-changed \
|
|
351
|
+
--git-commit \
|
|
352
|
+
--git-message "Update docs - {date}"
|
|
353
|
+
```
|
|
354
|
+
**Use case**: Regular documentation updates with minimal bandwidth usage.
|
|
355
|
+
|
|
356
|
+
### Complete Optimization Pipeline
|
|
357
|
+
Combine all optimizations:
|
|
358
|
+
```bash
|
|
359
|
+
docpull --sources-file examples/multi-source-optimized.yaml
|
|
360
|
+
```
|
|
361
|
+
See `examples/` directory for comprehensive configuration examples.
|
|
362
|
+
|
|
363
|
+
**Real-world results**: Testing with 4 documentation sources (Anthropic, Claude Code, Aptos, Shelby):
|
|
364
|
+
- **Before**: 1,914 files, 31 MB, no navigation
|
|
365
|
+
- **After**: 1,250 files, 13 MB (58% reduction), full indexes generated
|
|
366
|
+
- **One command** instead of 4+ separate commands with manual optimization
|
|
367
|
+
|
|
368
|
+
## What's New in v1.2.0
|
|
369
|
+
|
|
370
|
+
This release adds 15 major features across 4 phases. See [CHANGELOG.md](CHANGELOG.md) for complete release notes.
|
|
371
|
+
|
|
372
|
+
**Highlights**:
|
|
373
|
+
- Multi-source YAML configuration
|
|
374
|
+
- Language filtering with auto-detection
|
|
375
|
+
- SHA-256 based deduplication
|
|
376
|
+
- Auto-index generation (tree, TOC, categories, stats)
|
|
377
|
+
- 4 output formats (Markdown, TOON, JSON, SQLite)
|
|
378
|
+
- Incremental mode with resume capability
|
|
379
|
+
- Git integration and archive creation
|
|
380
|
+
- Python plugin/hook system
|
|
381
|
+
|
|
382
|
+
**Backward Compatible**: All v1.0+ workflows continue to work unchanged.
|
|
383
|
+
|
|
384
|
+
## Links
|
|
385
|
+
|
|
386
|
+
- [PyPI](https://pypi.org/project/docpull/)
|
|
387
|
+
- [GitHub](https://github.com/raintree-technology/docpull)
|
|
388
|
+
- [Issues](https://github.com/raintree-technology/docpull/issues)
|
|
389
|
+
- [Changelog](https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md)
|
|
390
|
+
- [Examples](https://github.com/raintree-technology/docpull/tree/main/examples)
|
|
391
|
+
|
|
392
|
+
## License
|
|
393
|
+
|
|
394
|
+
MIT License - see [LICENSE](LICENSE) file for details
|