docslight 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,277 @@
1
+ Metadata-Version: 2.4
2
+ Name: docslight
3
+ Version: 0.1.0
4
+ Summary: Lightweight ComPDF document parsing and extraction SDK
5
+ Author-email: ComPDF AI <support@compdf.com>
6
+ License-Expression: MIT
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: requests>=2.31.0
11
+ Requires-Dist: pydantic>=2.5.0
12
+ Requires-Dist: typing-extensions>=4.8.0; python_version < "3.11"
13
+ Requires-Dist: tomli>=2.0.1; python_version < "3.11"
14
+ Requires-Dist: flask>=3.1.3
15
+ Requires-Dist: werkzeug>=3.1.8
16
+ Provides-Extra: local
17
+ Requires-Dist: Pillow>=10.0.0; extra == "local"
18
+ Requires-Dist: PyMuPDF>=1.23.0; extra == "local"
19
+ Requires-Dist: numpy>=1.24.0; extra == "local"
20
+ Requires-Dist: paddlepaddle>=3.3.0; extra == "local"
21
+ Requires-Dist: paddleocr[doc-parser]>=3.3.0; extra == "local"
22
+ Requires-Dist: python-docx>=1.1.0; extra == "local"
23
+ Requires-Dist: python-pptx>=0.6.23; extra == "local"
24
+ Requires-Dist: openpyxl>=3.1.0; extra == "local"
25
+ Provides-Extra: local-llm
26
+ Requires-Dist: openai>=1.0.0; extra == "local-llm"
27
+ Provides-Extra: web
28
+ Requires-Dist: Flask>=3.0.0; extra == "web"
29
+ Requires-Dist: Werkzeug>=3.0.0; extra == "web"
30
+ Provides-Extra: web-test
31
+ Requires-Dist: playwright>=1.40.0; extra == "web-test"
32
+ Provides-Extra: dev
33
+ Requires-Dist: pytest>=7.4.0; extra == "dev"
34
+ Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
35
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
36
+ Requires-Dist: mypy>=1.7.0; extra == "dev"
37
+ Requires-Dist: build>=1.0.0; extra == "dev"
38
+ Requires-Dist: twine>=4.0.0; extra == "dev"
39
+ Requires-Dist: Pillow>=10.0.0; extra == "dev"
40
+ Requires-Dist: PyMuPDF>=1.23.0; extra == "dev"
41
+ Requires-Dist: numpy>=1.24.0; extra == "dev"
42
+ Requires-Dist: python-docx>=1.1.0; extra == "dev"
43
+ Requires-Dist: python-pptx>=0.6.23; extra == "dev"
44
+ Requires-Dist: openpyxl>=3.1.0; extra == "dev"
45
+ Requires-Dist: Flask>=3.0.0; extra == "dev"
46
+ Requires-Dist: Werkzeug>=3.0.0; extra == "dev"
47
+ Dynamic: license-file
48
+
49
+ <p align="center">
50
+ <h1 align="center">DocSlight</h1>
51
+ <p align="center">Lightweight Python SDK & CLI for document parsing and structured extraction</p>
52
+ <p align="center">
53
+ <a href="https://pypi.org/project/docslight/"><img src="https://img.shields.io/pypi/v/docslight" alt="PyPI"></a>
54
+ <a href="https://pypi.org/project/docslight/"><img src="https://img.shields.io/pypi/pyversions/docslight" alt="Python versions"></a>
55
+ <a href="LICENSE"><img src="https://img.shields.io/github/license/kdanmobile/docslight" alt="License"></a>
56
+ </p>
57
+ </p>
58
+
59
+ ## What is DocSlight?
60
+
61
+ A lightweight Python library that turns PDFs, images, and Office documents into clean Markdown or structured JSON — with one line of code. Works with ComPDF Cloud (recommended) or fully offline with local parsers.
62
+
63
+ ```python
64
+ from docslight import DocSlight
65
+
66
+ client = DocSlight(api_key="your-api-key")
67
+ result = client.parse("invoice.pdf")
68
+ print(result.to_markdown())
69
+ ```
70
+
71
+ ## Quick Start
72
+
73
+ ```bash
74
+ pip install docslight
75
+ ```
76
+
77
+ Parse any document:
78
+
79
+ ```bash
80
+ docslight parse invoice.pdf --output invoice.md
81
+ ```
82
+
83
+ Extract specific fields:
84
+
85
+ ```bash
86
+ docslight extract invoice.pdf --fields invoice_number,total_amount
87
+ ```
88
+
89
+ Launch the local Web UI workbench:
90
+
91
+ ```bash
92
+ pip install "docslight[web]"
93
+ docslight web
94
+ # Open http://127.0.0.1:8000
95
+
96
+ # Or run the same Web UI directly as a module
97
+ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
98
+ ```
99
+
100
+ ## Features
101
+
102
+ - **Dual mode** — ComPDF Cloud for production-grade results, or local CPU parsing for offline evaluation
103
+ - **Parse → Markdown** — Convert PDF, DOCX, PPTX, XLSX, and images (PNG, JPG, TIFF, BMP, WebP) to clean Markdown
104
+ - **Extract → JSON** — Pull structured data by field list, JSON Schema, or structured template (key-value + table extraction)
105
+ - **CLI first** — Full-featured command-line interface, script-friendly
106
+ - **Web UI** — Local Flask workbench with drag-and-drop, live preview with bbox highlights, and a Fields Builder UI
107
+ - **Batch processing** — `parse_batch()` / `extract_batch()` for multiple files
108
+ - **Local LLM extraction** — Ollama or any OpenAI-compatible provider for offline extraction
109
+ - **Document types** — Classify and route documents by type for cloud extraction
110
+ - **Error-safe** — Typed result objects, structured error hierarchy, no credential leaks
111
+
112
+ ## Install
113
+
114
+ | Scenario | Command |
115
+ |----------|---------|
116
+ | Core SDK & CLI | `pip install docslight` |
117
+ | + Local parsing (OCR, Office) | `pip install "docslight[local]"` |
118
+ | + Local LLM extraction | `pip install "docslight[local,local-llm]"` |
119
+ | + Web UI workbench | `pip install "docslight[web]"` |
120
+
121
+ > Local CPU parsing is experimental. Validate accuracy and latency on your own documents before production use.
122
+
123
+ ## SDK Usage
124
+
125
+ ### Cloud — Parse
126
+
127
+ ```python
128
+ from docslight import DocSlight
129
+
130
+ client = DocSlight(mode="cloud", api_key="your-api-key")
131
+
132
+ result = client.parse("invoice.pdf")
133
+ print(result.to_markdown()) # Clean markdown
134
+ print(result.to_json()) # Full result with pages + metadata
135
+ ```
136
+
137
+ ### Cloud — Extract
138
+
139
+ ```python
140
+ result = client.extract(
141
+ "invoice.pdf",
142
+ fields=["invoice_number", "invoice_date", "total_amount"],
143
+ )
144
+ print(result.to_json())
145
+ ```
146
+
147
+ With a JSON Schema:
148
+
149
+ ```python
150
+ schema = {
151
+ "type": "object",
152
+ "properties": {
153
+ "invoice_number": {"type": "string"},
154
+ "total_amount": {"type": "number"},
155
+ },
156
+ "required": ["invoice_number"],
157
+ }
158
+ result = client.extract("invoice.pdf", schema=schema)
159
+ ```
160
+
161
+ With document type classification:
162
+
163
+ ```python
164
+ result = client.extract(
165
+ "invoice.pdf",
166
+ fields=["invoice_number"],
167
+ document_types=["invoice"],
168
+ )
169
+ ```
170
+
171
+ ### Local — Parse (Offline)
172
+
173
+ ```python
174
+ client = DocSlight(mode="local")
175
+ result = client.parse("invoice.pdf")
176
+ print(result.to_markdown())
177
+ ```
178
+
179
+ ### Local — Extract with Ollama
180
+
181
+ ```python
182
+ client = DocSlight(
183
+ mode="local",
184
+ local_llm={"provider": "ollama", "model": "llama3.1"},
185
+ )
186
+ result = client.extract(
187
+ "invoice.pdf",
188
+ fields=["invoice_number", "invoice_date"],
189
+ )
190
+ ```
191
+
192
+ ### Local — Extract with OpenAI-Compatible API
193
+
194
+ ```python
195
+ client = DocSlight(
196
+ mode="local",
197
+ local_llm={
198
+ "provider": "openai-compatible",
199
+ "base_url": "https://your-endpoint/v1",
200
+ "model": "your-model",
201
+ "api_key": "your-api-key",
202
+ "extra_body": {"enable_thinking": False}, # e.g., DashScope qwen3
203
+ },
204
+ )
205
+ ```
206
+
207
+ ### Batch Processing
208
+
209
+ ```python
210
+ results = client.parse_batch(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
211
+ for r in results:
212
+ print(r.to_markdown()[:200])
213
+ ```
214
+
215
+ ## CLI Usage
216
+
217
+ ```bash
218
+ # Parse
219
+ docslight parse invoice.pdf --mode cloud -o invoice.md
220
+ docslight parse invoice.pdf --mode local -o invoice.md
221
+
222
+ # Extract
223
+ docslight extract invoice.pdf --mode cloud --fields invoice_number,total_amount
224
+ docslight extract invoice.pdf --mode local --fields invoice_number --local-llm-provider ollama --local-llm-model llama3.1
225
+
226
+ # Extract with schema
227
+ docslight extract invoice.pdf --schema schema.json
228
+
229
+ # Web UI
230
+ docslight web --host 127.0.0.1 --port 8000
231
+ ```
232
+
233
+ ## Web UI Workbench
234
+
235
+ DocSlight Workbench is a local Flask app for visual document processing.
236
+
237
+ ```bash
238
+ pip install "docslight[web]"
239
+ docslight web
240
+ python -m docslight.web_app
241
+ ```
242
+
243
+ - **Parse & Extract tabs** — Switch between parsing and extraction workflows
244
+ - **Drag-and-drop upload** — PDF, images, DOCX, PPTX, XLSX
245
+ - **Live preview** — PDF page rendering with bbox highlight overlays
246
+ - **Fields Builder** — Structured UI for building key-value and table extraction templates
247
+ - **Download results** — One-click download of Markdown or JSON output
248
+
249
+ ## Environment Variables
250
+
251
+ | Variable | Description |
252
+ |----------|-------------|
253
+ | `DOCSLIGHT_API_KEY` | API key for cloud mode |
254
+ | `DOCSLIGHT_MODE` | Processing mode: `cloud` or `local` (default: `cloud`) |
255
+
256
+ ## Supported Inputs
257
+
258
+ | Mode | Formats |
259
+ |------|---------|
260
+ | Cloud | PDF, images (PNG/JPG/TIFF/BMP/WebP), DOCX, PPTX, XLSX, and more via ComPDF Cloud API |
261
+ | Local | PDF, images (PNG/JPG/TIFF/BMP/WebP), DOCX, PPTX, XLSX |
262
+
263
+ > Legacy Office formats (`.doc`, `.ppt`, `.xls`) must be converted to DOCX/PPTX/XLSX for local processing.
264
+
265
+ ## Development
266
+
267
+ ```bash
268
+ pip install -e ".[dev]"
269
+ ruff check .
270
+ mypy docslight
271
+ pytest
272
+ python -m build
273
+ ```
274
+
275
+ ## License
276
+
277
+ MIT License. See `LICENSE`.
@@ -0,0 +1,39 @@
1
+ docslight/__init__.py,sha256=OCbA3TldIR0aIKDzpGi96iRAuexkQrjnPMIvS8DTZiI,957
2
+ docslight/cli.py,sha256=ozpMtFpmkS6CSbiTdrkOJlj7TVuc4Tj34tvtE80mSWE,7459
3
+ docslight/client.py,sha256=Yfa4MqAe8wTJjJw6LI00KL10UdnB6Igblm1eLtl45UM,3534
4
+ docslight/config.py,sha256=BAuRkKuVZVnki-LvBYIeKAU30FrOoqns3bPq0vpVmnQ,3807
5
+ docslight/exceptions.py,sha256=sFk-EP_Oe-0AR1b1YlZLekt7miTIcRO8y29XywxuMQg,1703
6
+ docslight/preview.py,sha256=lqpPbcLoxKJx1M7FeO86Agw0GBUCBPa3p9iWdq6Nk9g,1788
7
+ docslight/result.py,sha256=C4Hry53cVAKXIij0Yv-GX6wQUfEPnfGxmsnHb5qd2s8,2987
8
+ docslight/standard_json.py,sha256=xiKOAxtbDybiFg5LLOKLrFrkCH9lLtfEO8CsJV4fh6c,12528
9
+ docslight/web_app.py,sha256=OkDPKZuHNe52A3wyk_NWjPjL2FPxqaC9DfLgZ4wp7Xg,13518
10
+ docslight/cloud/__init__.py,sha256=Hk2-B1809THMXRTm9QUj8Q2y2J40cTIeYGx0yKlXz7s,123
11
+ docslight/cloud/client.py,sha256=9xy-9XUpUGHZmFHKrcZ4rP3AC_PfdXtfQzqfRCpE24k,22517
12
+ docslight/local/__init__.py,sha256=cOlFSTlE_4NJDvxFydjukFTb5dh03WtZdmgEj-I0l4w,809
13
+ docslight/local/layout_blocks.py,sha256=TPBS4WqjWGe625Mxq4SwVDakOfuJLQsXwit9yB4T5oc,2975
14
+ docslight/local/llm_extractor.py,sha256=yGxqWGtc-fo1xXMoSAhc7oWKWJoc3vFf1YP0Yk8MzLw,8703
15
+ docslight/local/loaders.py,sha256=hK2apRUeG6U5FNIezwRRDOpPXNeWEbfQt2FG24plP9k,3275
16
+ docslight/local/markdown.py,sha256=nH4vN1qk7y7RUxUk2qD1PRMMZRaJhm0y13bN-Q6iLi8,626
17
+ docslight/local/office_loader.py,sha256=JmrX3ne1eT0GGLDBRp6MdYAf4scL4ybUODFcSAYZ4V4,5382
18
+ docslight/local/paddle_parser.py,sha256=-707LgB9ih6Uieu7D6Cr0tXkNG1f5PMNNqS3YSPbYYo,6454
19
+ docslight/local/pipeline.py,sha256=actIKjfaJ3AykymrmzgBkFYIuqJTnOHGzK3bbZPrqvM,8668
20
+ docslight/providers/__init__.py,sha256=WavbBqBDcGs65B0tZqTG7u1sEkoWbqmoolPREJlIytI,254
21
+ docslight/providers/ollama.py,sha256=7zEeuGhsQrU4jo7O0b4JSR3pdQHZU21GJIGwXaqe628,843
22
+ docslight/providers/openai_compatible.py,sha256=y9DR_yL2rLBqv1cManzahYNb74SaskMXRhGHHl2iQM8,2275
23
+ docslight/schemas/__init__.py,sha256=-HJLylY01HoX6HV4wP2llZ_moXUtjW92_CQwuxLB6Q0,191
24
+ docslight/schemas/fields.py,sha256=vhC3LpbRt0CpGz49YzT6NnV0sAQ_Zx9KYCq4zUmWNQY,7094
25
+ docslight/static/styles.css,sha256=woDz4dUYQLl4sgfGjVwxvfLUpaskv3Kla9bBnGkzuWg,16287
26
+ docslight/static/app/common.js,sha256=ooBbE7NQOzgzLA4Ef4KsKH9XB9nwiNbLPm1Y1UtVG2o,23697
27
+ docslight/static/app/docslight-extract.json,sha256=J8Dv2XIkeh_283pklNVP8izc5gN9xKB7-z5PUR7kaiY,6282
28
+ docslight/static/app/extract.js,sha256=dh5_Pi7Ll1t1rPo9vgyPa_AVl4b9ttJOFzFE_m0ysls,14495
29
+ docslight/static/app/i18n.js,sha256=cFv7oZlTpxjs57ad_JZC61xqakhzrJbKfAL-W7wENkU,17775
30
+ docslight/static/app/parse.js,sha256=qFk3iCouHWR51afjJS94cE45yMt1U_XPVBInfYFO1eg,5441
31
+ docslight/templates/base.html,sha256=sXpWTxP_uvp1aCmyt52RIetPHNuaxYy1TnVN_ONSB_c,1694
32
+ docslight/templates/extract.html,sha256=LMl_xuxnTXbqSO7RMJswEVDe1zLBXR1AxyD0PxqoLOk,6277
33
+ docslight/templates/parse.html,sha256=Sm5b2GsovJUPq2eWTiesj04phXKFVrkJVKHtFT7NQ5Q,4130
34
+ docslight-0.1.0.dist-info/licenses/LICENSE,sha256=C5PGcCZEjsiOF7rAof4gU_VahlwqfLIUeL7eHnDZT34,1087
35
+ docslight-0.1.0.dist-info/METADATA,sha256=ZTCWvDRt-AcDWbGqhuviFTZWpT3nHfP8rWvM2wDRRVY,8449
36
+ docslight-0.1.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
37
+ docslight-0.1.0.dist-info/entry_points.txt,sha256=ekw6ynMNLGFXp0K-r1XUDLcS6BN2nzoAVzCUDEHFu8c,49
38
+ docslight-0.1.0.dist-info/top_level.txt,sha256=_u4jxf9wMEplc-nj0eJOYLTH6x3sFogHbbBcTn45SFA,10
39
+ docslight-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ docslight = docslight.cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 ComPDF AI
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ docslight