doculift-cli 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,240 @@
1
+ Metadata-Version: 2.4
2
+ Name: doculift-cli
3
+ Version: 0.1.0
4
+ Summary: A powerful CLI & web scraper that lifts documentation for Large Language Models.
5
+ Author: M.J. Shetty
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/mjshetty/doculift
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Topic :: Utilities
13
+ Requires-Python: >=3.10
14
+ Description-Content-Type: text/markdown
15
+ Requires-Dist: flask>=3.0.0
16
+ Requires-Dist: requests
17
+ Requires-Dist: beautifulsoup4
18
+ Requires-Dist: playwright
19
+ Requires-Dist: click
20
+ Requires-Dist: rich
21
+ Provides-Extra: dev
22
+ Requires-Dist: black; extra == "dev"
23
+ Requires-Dist: flake8; extra == "dev"
24
+ Requires-Dist: bandit; extra == "dev"
25
+ Requires-Dist: build; extra == "dev"
26
+ Requires-Dist: twine; extra == "dev"
27
+
28
+ # DocuLift
29
+
30
+ **DocuLift** is a web scraping tool that lifts documentation websites into clean, aggregated files optimized for feeding into Large Language Models like Google NotebookLM, Claude, or ChatGPT.
31
+
32
+ It handles dynamic Single Page Applications (SPAs), respects site structure, and produces output in two modes: full content extraction or URL-only extraction.
33
+
34
+ ---
35
+
36
+ ## Features
37
+
38
+ - **Two Extract Modes** — choose between extracting full page content or just collecting URLs (see [When to Use Each Mode](#when-to-use-each-mode))
39
+ - **Dynamic Content Scraping** — uses Playwright (headless Chromium) to render JavaScript-heavy sites (React, Vue, etc.) before extraction
40
+ - **Smart Scoping**:
41
+ - **Section Only** — stays within the folder boundary of the starting URL (e.g. starting at `.../docs/agents/overview` scrapes everything under `.../docs/agents/`)
42
+ - **Entire Domain** — crawls all pages under the target domain
43
+ - **Intelligent Aggregation** — combines multiple pages into single files, auto-splits at ~500KB (NotebookLM's per-file limit), generates meaningful filenames
44
+ - **Multi-URL Support** — submit multiple starting URLs in one job; each is crawled independently and produces its own output file(s)
45
+ - **Per-URL stats** — on completion, the UI shows how many pages or URLs were collected per starting URL
46
+ - **Clean Extraction** — removes navigation, footers, sidebars, ads, and scripts; focuses on main content
47
+
48
+ ---
49
+
50
+ ## When to Use Each Mode
51
+
52
+ ### Extract Content
53
+ Crawls each page and converts its content to Markdown (or text/CSV). Use this when you want to feed documentation directly into an LLM as context.
54
+
55
+ - **Best for**: NotebookLM, Claude Projects, ChatGPT — any tool that accepts uploaded documents
56
+ - **Output**: One or more `.md` files per starting URL, split at ~500KB
57
+ - **Typical workflow**: Extract content → upload files to NotebookLM → ask questions
58
+
59
+ ### Extract URLs Only
60
+ Crawls the site and collects every discovered URL within scope, writing them to a plain `.txt` file — one URL per line, no other content.
61
+
62
+ **Use this when NotebookLM's URL limit is the bottleneck.**
63
+
64
+ NotebookLM supports adding web URLs as sources, but has a cap on how many you can add per notebook. When a documentation section has hundreds of pages, you'll hit that limit quickly. The recommended two-step workflow is:
65
+
66
+ 1. **Run "Extract URLs Only"** on the target documentation to get a full list of all pages within scope
67
+ 2. **Review and trim** the URL list down to the most relevant pages
68
+ 3. **Add the trimmed URLs directly to NotebookLM** as web sources — NotebookLM fetches and indexes them itself, giving you live, citable sources rather than static file uploads
69
+
70
+ This approach gives you fine-grained control over exactly which pages NotebookLM indexes, without wasting your URL quota on irrelevant pages.
71
+
72
+ ---
73
+
74
+ ## Tech Stack
75
+
76
+ | Layer | Technology |
77
+ |---|---|
78
+ | Backend | Python 3.10+, Flask |
79
+ | Scraping | Playwright (headless Chromium) |
80
+ | Parsing | BeautifulSoup4 |
81
+ | Frontend | HTML5, CSS (Glassmorphism), Vanilla JS |
82
+ | CI/CD | GitHub Actions, Black, Flake8, Bandit |
83
+
84
+ ---
85
+
86
+ ## Continuous Integration (CI/CD)
87
+
88
+ DocuLift includes a pre-configured GitHub Actions pipeline (`.github/workflows/ci.yml`) that automatically runs on every push and pull request to the `main` or `master` branches.
89
+
90
+ The pipeline executes the following checks to ensure code quality and security:
91
+
92
+ 1. **Code Formatting (Black)**
93
+ - Automatically checks that all Python files adhere to standard `black` formatting rules.
94
+ 2. **Linting (Flake8)**
95
+ - Scans for syntax errors, undefined names, and unused imports.
96
+ - Enforces a maximum line length and complexity thresholds.
97
+ 3. **Security Scanning (Bandit)**
98
+ - Analyzes Python code for common security vulnerabilities.
99
+ - Ensures safe configurations (e.g., verifying `debug=False` for Flask in production environments).
100
+
101
+ *Note: The pipeline strictly fails if any high-severity security issues are found, preventing insecure code from being merged.*
102
+
103
+ ---
104
+
105
+ ## Installation
106
+
107
+ ### Prerequisites
108
+ - Python 3.10 or higher
109
+ - `pip`
110
+
111
+ ### Steps
112
+
113
+ 1. **Clone the repository**
114
+ ```bash
115
+ git clone <repository-url>
116
+ cd doculift
117
+ ```
118
+
119
+ 2. **Create a virtual environment**
120
+ ```bash
121
+ python3 -m venv venv
122
+ source venv/bin/activate # Windows: venv\Scripts\activate
123
+ ```
124
+
125
+ 3. **Install dependencies**
126
+ ```bash
127
+ pip install -r requirements.txt
128
+ ```
129
+
130
+ 4. **Install Chromium**
131
+ ```bash
132
+ playwright install chromium
133
+ ```
134
+
135
+ 5. **Start the app**
136
+ ```bash
137
+ python3 app.py
138
+ ```
139
+ Open `http://127.0.0.1:5001` in your browser.
140
+
141
+ ---
142
+
143
+ ## Usage
144
+
145
+ DocuLift is a hybrid tool. You can run it via a beautiful Web interface, or directly from your terminal.
146
+
147
+ ### 1. Web User Interface
148
+
149
+ Start the local server:
150
+ ```bash
151
+ doculift ui
152
+ # or
153
+ doculift ui --port 5001
154
+ ```
155
+ Then open `http://127.0.0.1:5001` in your browser.
156
+
157
+ 1. **Enter target URLs** — one per line (e.g. `https://docs.docker.com/reference/`)
158
+ 2. **Choose Extract Mode** — *Extract Content* or *Extract URLs Only*
159
+ 3. **Choose Scoping Strategy** — *Section Only* (recommended) or *Entire Domain*
160
+ 4. **Choose Output Format** — Markdown, Plain Text, or CSV (applies to content mode)
161
+ 5. **Set Max Pages per URL** — default 500; each starting URL is crawled independently up to this limit
162
+ 6. **Click "Siphon Content"** and watch the progress bar
163
+ 7. On completion, per-URL stats are shown and files are available for download
164
+
165
+ ### 2. Command Line Interface (CLI)
166
+
167
+ Run extraction directly from your terminal with a beautiful progress bar. Files will be saved into the `./outputs` folder automatically.
168
+
169
+ ```bash
170
+ # See all available commands and options
171
+ doculift --help
172
+
173
+ # See options specific to the scrape command
174
+ doculift scrape --help
175
+
176
+ # Example: Extract full markdown content from a documentation section
177
+ doculift scrape https://docs.docker.com/reference/
178
+
179
+ # Example: Extract only URLs, capped at 1000 pages, from multiple sources
180
+ doculift scrape https://paketo.io/docs/ https://docs.docker.com/ --mode urls --max-pages 1000
181
+ ```
182
+
183
+ ---
184
+
185
+ ## How It Works
186
+
187
+ ```
188
+ User submits URLs + config
189
+
190
+ Background thread spawned (one per job)
191
+
192
+ For each starting URL:
193
+ ├── Determine scope (section boundary or full domain)
194
+ ├── BFS crawl with Playwright (handles JS rendering)
195
+ ├── [Content mode] Clean HTML → Markdown, buffer → split files at 500KB
196
+ └── [URL mode] Collect discovered links → single .txt file
197
+
198
+ Per-URL stats displayed, files available for download
199
+ ```
200
+
201
+ **Key crawl behaviours:**
202
+ - Each starting URL gets an independent BFS with its own visited set — URLs are not cross-contaminated between starting points
203
+ - `max_pages` applies per starting URL, not globally
204
+ - Pages already scraped by an earlier starting URL in the same job are skipped to avoid duplication
205
+ - Fragment URLs (`#anchor`) are normalised and deduplicated
206
+
207
+ ---
208
+
209
+ ## API
210
+
211
+ Trigger jobs programmatically:
212
+
213
+ ```bash
214
+ curl -X POST http://127.0.0.1:5001/scrape \
215
+ -H "Content-Type: application/json" \
216
+ -d '{
217
+ "urls": ["https://docs.docker.com/reference/", "https://paketo.io/docs/"],
218
+ "format": "md",
219
+ "max_pages": 200,
220
+ "scope_type": "section",
221
+ "extract_mode": "content"
222
+ }'
223
+ ```
224
+
225
+ Response:
226
+ ```json
227
+ { "job_id": "abc123" }
228
+ ```
229
+
230
+ Poll for status:
231
+ ```bash
232
+ curl http://127.0.0.1:5001/status/abc123
233
+ ```
234
+
235
+ Response fields: `status`, `progress`, `is_finished`, `files`, `per_url_stats`, `urls_extracted`.
236
+
237
+ Download a file:
238
+ ```
239
+ GET /download/<job_id>/<filename>
240
+ ```
@@ -0,0 +1,14 @@
1
+ doculift/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
+ doculift/__main__.py,sha256=8hDtWlaFZK24KhfNq_ZKgtXqYHsDQDetukOCMlsbW0Q,59
3
+ doculift/app.py,sha256=BTxwSjCR3s9BFz0iq5S4J20q80k48Q86ZXFRv5b5Y4g,2794
4
+ doculift/cli.py,sha256=A2nTbypf-cNmDeIWp1eqVLP4fwwpMGg071-uw6Q4_EQ,4861
5
+ doculift/scraper.py,sha256=9gL2VaOwVG1BlrAN3WuYHhQIK9ZSudcqZbhTX5RIhGc,16806
6
+ doculift/static/doculift_logo.png,sha256=MJHy9Xodg91R3JpRcZd-fnB57iHWdklpaMk7lXUj3MA,369863
7
+ doculift/static/css/style.css,sha256=_Gfzzw51CNqYNFPq1hP787C2BjAUHYge9qco1d0gHnA,10910
8
+ doculift/static/js/main.js,sha256=7dZWJxH1lCgqRkabR9eypX-FXKIvohhkuMEp6wi4uOw,3445
9
+ doculift/templates/index.html,sha256=JncjxLmpimlwGmdkGHI4yZ3qBbsbigp01ZlzDNiujGQ,3864
10
+ doculift_cli-0.1.0.dist-info/METADATA,sha256=WmBHHLAmWQA3B8ednB2D_lmeyKVsGyUldDETK7ogwxw,8399
11
+ doculift_cli-0.1.0.dist-info/WHEEL,sha256=YCfwYGOYMi5Jhw2fU4yNgwErybb2IX5PEwBKV4ZbdBo,91
12
+ doculift_cli-0.1.0.dist-info/entry_points.txt,sha256=nY4u9yybrlQ9jIGwL7tMemienETueN7P13KF59Tqk3w,46
13
+ doculift_cli-0.1.0.dist-info/top_level.txt,sha256=hxBMetN0Aq6vjzDjtdfoix7uK4DgkG56P92oh5mFNPY,9
14
+ doculift_cli-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ doculift = doculift.cli:cli
@@ -0,0 +1 @@
1
+ doculift