docslight 0.1.1__tar.gz → 0.1.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {docslight-0.1.1 → docslight-0.1.3}/PKG-INFO +19 -18
- {docslight-0.1.1 → docslight-0.1.3}/README.md +46 -45
- {docslight-0.1.1 → docslight-0.1.3}/docslight/__init__.py +1 -1
- {docslight-0.1.1 → docslight-0.1.3}/docslight/cli.py +63 -33
- {docslight-0.1.1 → docslight-0.1.3}/docslight/cloud/client.py +6 -6
- {docslight-0.1.1 → docslight-0.1.3}/docslight/preview.py +1 -1
- {docslight-0.1.1 → docslight-0.1.3}/docslight/schemas/fields.py +2 -2
- {docslight-0.1.1 → docslight-0.1.3}/docslight/web_app.py +32 -31
- {docslight-0.1.1 → docslight-0.1.3}/docslight.egg-info/PKG-INFO +19 -18
- {docslight-0.1.1 → docslight-0.1.3}/docslight.egg-info/SOURCES.txt +12 -9
- {docslight-0.1.1 → docslight-0.1.3}/pyproject.toml +4 -7
- docslight-0.1.3/tests/test_cli.py +450 -0
- docslight-0.1.3/tests/test_cli_entrypoint.py +15 -0
- docslight-0.1.3/tests/test_client.py +255 -0
- docslight-0.1.3/tests/test_cloud_client.py +771 -0
- docslight-0.1.3/tests/test_config_result.py +231 -0
- docslight-0.1.3/tests/test_examples.py +20 -0
- docslight-0.1.3/tests/test_local_llm.py +401 -0
- docslight-0.1.3/tests/test_local_loader_parser.py +300 -0
- docslight-0.1.3/tests/test_local_office_loader.py +108 -0
- docslight-0.1.3/tests/test_local_pipeline.py +825 -0
- docslight-0.1.3/tests/test_schema_helpers.py +117 -0
- docslight-0.1.3/tests/test_web_app.py +442 -0
- docslight-0.1.1/docslight/static/app/common.js +0 -668
- docslight-0.1.1/docslight/static/app/docslight-extract.json +0 -307
- docslight-0.1.1/docslight/static/app/extract.js +0 -394
- docslight-0.1.1/docslight/static/app/i18n.js +0 -405
- docslight-0.1.1/docslight/static/app/parse.js +0 -161
- docslight-0.1.1/docslight/static/styles.css +0 -878
- docslight-0.1.1/docslight/templates/base.html +0 -36
- docslight-0.1.1/docslight/templates/extract.html +0 -123
- docslight-0.1.1/docslight/templates/parse.html +0 -81
- {docslight-0.1.1 → docslight-0.1.3}/LICENSE +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/client.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/cloud/__init__.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/config.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/exceptions.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/__init__.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/layout_blocks.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/llm_extractor.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/loaders.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/markdown.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/office_loader.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/paddle_parser.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/local/pipeline.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/providers/__init__.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/providers/ollama.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/providers/openai_compatible.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/result.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/schemas/__init__.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight/standard_json.py +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight.egg-info/dependency_links.txt +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight.egg-info/entry_points.txt +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight.egg-info/requires.txt +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/docslight.egg-info/top_level.txt +0 -0
- {docslight-0.1.1 → docslight-0.1.3}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: docslight
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.3
|
|
4
4
|
Summary: Lightweight ComPDF document parsing and extraction SDK
|
|
5
5
|
Author-email: ComPDF AI <support@compdf.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -78,6 +78,7 @@ Parse any document:
|
|
|
78
78
|
|
|
79
79
|
```bash
|
|
80
80
|
docslight parse invoice.pdf --output invoice.md
|
|
81
|
+
docslight parse invoice.pdf --format zip --output invoice.zip
|
|
81
82
|
```
|
|
82
83
|
|
|
83
84
|
Extract specific fields:
|
|
@@ -86,14 +87,13 @@ Extract specific fields:
|
|
|
86
87
|
docslight extract invoice.pdf --fields invoice_number,total_amount
|
|
87
88
|
```
|
|
88
89
|
|
|
89
|
-
Launch the local
|
|
90
|
+
Launch the local API server:
|
|
90
91
|
|
|
91
92
|
```bash
|
|
92
|
-
pip install "docslight[web]"
|
|
93
93
|
docslight web
|
|
94
|
-
#
|
|
94
|
+
# Health: http://127.0.0.1:8000/api/health
|
|
95
95
|
|
|
96
|
-
# Or run the same
|
|
96
|
+
# Or run the same API server directly as a module
|
|
97
97
|
python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
98
98
|
```
|
|
99
99
|
|
|
@@ -103,7 +103,7 @@ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
|
103
103
|
- **Parse → Markdown** — Convert PDF, DOCX, PPTX, XLSX, and images (PNG, JPG, TIFF, BMP, WebP) to clean Markdown
|
|
104
104
|
- **Extract → JSON** — Pull structured data by field list, JSON Schema, or structured template (key-value + table extraction)
|
|
105
105
|
- **CLI first** — Full-featured command-line interface, script-friendly
|
|
106
|
-
- **
|
|
106
|
+
- **API server** — Local Flask backend exposing parse, extract, preview, health, and system-info endpoints
|
|
107
107
|
- **Batch processing** — `parse_batch()` / `extract_batch()` for multiple files
|
|
108
108
|
- **Local LLM extraction** — Ollama or any OpenAI-compatible provider for offline extraction
|
|
109
109
|
- **Document types** — Classify and route documents by type for cloud extraction
|
|
@@ -116,7 +116,7 @@ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
|
116
116
|
| Core SDK & CLI | `pip install docslight` |
|
|
117
117
|
| + Local parsing (OCR, Office) | `pip install "docslight[local]"` |
|
|
118
118
|
| + Local LLM extraction | `pip install "docslight[local,local-llm]"` |
|
|
119
|
-
| +
|
|
119
|
+
| + API server | `pip install "docslight[web]"` |
|
|
120
120
|
|
|
121
121
|
> Local CPU parsing is experimental. Validate accuracy and latency on your own documents before production use.
|
|
122
122
|
|
|
@@ -216,35 +216,36 @@ for r in results:
|
|
|
216
216
|
|
|
217
217
|
```bash
|
|
218
218
|
# Parse
|
|
219
|
-
docslight parse invoice.pdf --mode cloud -o invoice.
|
|
220
|
-
docslight parse invoice.pdf --mode
|
|
219
|
+
docslight parse invoice.pdf --mode cloud -o invoice.zip
|
|
220
|
+
docslight parse invoice.pdf --mode cloud --format zip -o invoice.zip
|
|
221
|
+
docslight parse invoice.pdf --mode local -o invoice.zip
|
|
221
222
|
|
|
222
223
|
# Extract
|
|
223
224
|
docslight extract invoice.pdf --mode cloud --fields invoice_number,total_amount
|
|
224
225
|
docslight extract invoice.pdf --mode local --fields invoice_number --local-llm-provider ollama --local-llm-model llama3.1
|
|
226
|
+
docslight extract "D:\pdf\invoice\1.pdf" --mode local --fields invoice_number --local-llm-provider ollama
|
|
225
227
|
|
|
226
228
|
# Extract with schema
|
|
227
229
|
docslight extract invoice.pdf --schema schema.json
|
|
228
230
|
|
|
229
|
-
#
|
|
231
|
+
# API server
|
|
230
232
|
docslight web --host 127.0.0.1 --port 8000
|
|
231
233
|
```
|
|
232
234
|
|
|
233
|
-
##
|
|
235
|
+
## API Server
|
|
234
236
|
|
|
235
|
-
DocSlight
|
|
237
|
+
DocSlight includes a local Flask API server for document processing. Frontend assets are not bundled in this package.
|
|
236
238
|
|
|
237
239
|
```bash
|
|
238
|
-
pip install "docslight[web]"
|
|
239
240
|
docslight web
|
|
240
241
|
python -m docslight.web_app
|
|
241
242
|
```
|
|
242
243
|
|
|
243
|
-
-
|
|
244
|
-
-
|
|
245
|
-
-
|
|
246
|
-
-
|
|
247
|
-
-
|
|
244
|
+
- `GET /api/health`
|
|
245
|
+
- `GET /api/system-info`
|
|
246
|
+
- `POST /api/parse`
|
|
247
|
+
- `POST /api/extract`
|
|
248
|
+
- `POST /api/preview`
|
|
248
249
|
|
|
249
250
|
## Environment Variables
|
|
250
251
|
|
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
<p align="center">
|
|
2
|
-
<h1 align="center">DocSlight</h1>
|
|
2
|
+
<h1 align="center">DocSlight</h1>
|
|
3
3
|
<p align="center">Lightweight Python SDK & CLI for document parsing and structured extraction</p>
|
|
4
4
|
<p align="center">
|
|
5
|
-
<a href="https://pypi.org/project/docslight/"><img src="https://img.shields.io/pypi/v/docslight" alt="PyPI"></a>
|
|
6
|
-
<a href="https://pypi.org/project/docslight/"><img src="https://img.shields.io/pypi/pyversions/docslight" alt="Python versions"></a>
|
|
7
|
-
<a href="LICENSE"><img src="https://img.shields.io/github/license/kdanmobile/docslight" alt="License"></a>
|
|
5
|
+
<a href="https://pypi.org/project/docslight/"><img src="https://img.shields.io/pypi/v/docslight" alt="PyPI"></a>
|
|
6
|
+
<a href="https://pypi.org/project/docslight/"><img src="https://img.shields.io/pypi/pyversions/docslight" alt="Python versions"></a>
|
|
7
|
+
<a href="LICENSE"><img src="https://img.shields.io/github/license/kdanmobile/docslight" alt="License"></a>
|
|
8
8
|
</p>
|
|
9
9
|
</p>
|
|
10
10
|
|
|
11
|
-
## What is DocSlight?
|
|
11
|
+
## What is DocSlight?
|
|
12
12
|
|
|
13
13
|
A lightweight Python library that turns PDFs, images, and Office documents into clean Markdown or structured JSON — with one line of code. Works with ComPDF Cloud (recommended) or fully offline with local parsers.
|
|
14
14
|
|
|
@@ -23,13 +23,14 @@ print(result.to_markdown())
|
|
|
23
23
|
## Quick Start
|
|
24
24
|
|
|
25
25
|
```bash
|
|
26
|
-
pip install docslight
|
|
26
|
+
pip install docslight
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
Parse any document:
|
|
30
30
|
|
|
31
31
|
```bash
|
|
32
32
|
docslight parse invoice.pdf --output invoice.md
|
|
33
|
+
docslight parse invoice.pdf --format zip --output invoice.zip
|
|
33
34
|
```
|
|
34
35
|
|
|
35
36
|
Extract specific fields:
|
|
@@ -38,16 +39,15 @@ Extract specific fields:
|
|
|
38
39
|
docslight extract invoice.pdf --fields invoice_number,total_amount
|
|
39
40
|
```
|
|
40
41
|
|
|
41
|
-
Launch the local
|
|
42
|
-
|
|
43
|
-
```bash
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
```
|
|
42
|
+
Launch the local API server:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
docslight web
|
|
46
|
+
# Health: http://127.0.0.1:8000/api/health
|
|
47
|
+
|
|
48
|
+
# Or run the same API server directly as a module
|
|
49
|
+
python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
50
|
+
```
|
|
51
51
|
|
|
52
52
|
## Features
|
|
53
53
|
|
|
@@ -55,7 +55,7 @@ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
|
55
55
|
- **Parse → Markdown** — Convert PDF, DOCX, PPTX, XLSX, and images (PNG, JPG, TIFF, BMP, WebP) to clean Markdown
|
|
56
56
|
- **Extract → JSON** — Pull structured data by field list, JSON Schema, or structured template (key-value + table extraction)
|
|
57
57
|
- **CLI first** — Full-featured command-line interface, script-friendly
|
|
58
|
-
- **
|
|
58
|
+
- **API server** — Local Flask backend exposing parse, extract, preview, health, and system-info endpoints
|
|
59
59
|
- **Batch processing** — `parse_batch()` / `extract_batch()` for multiple files
|
|
60
60
|
- **Local LLM extraction** — Ollama or any OpenAI-compatible provider for offline extraction
|
|
61
61
|
- **Document types** — Classify and route documents by type for cloud extraction
|
|
@@ -65,10 +65,10 @@ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
|
65
65
|
|
|
66
66
|
| Scenario | Command |
|
|
67
67
|
|----------|---------|
|
|
68
|
-
| Core SDK & CLI | `pip install docslight` |
|
|
69
|
-
| + Local parsing (OCR, Office) | `pip install "docslight[local]"` |
|
|
70
|
-
| + Local LLM extraction | `pip install "docslight[local,local-llm]"` |
|
|
71
|
-
| +
|
|
68
|
+
| Core SDK & CLI | `pip install docslight` |
|
|
69
|
+
| + Local parsing (OCR, Office) | `pip install "docslight[local]"` |
|
|
70
|
+
| + Local LLM extraction | `pip install "docslight[local,local-llm]"` |
|
|
71
|
+
| + API server | `pip install "docslight[web]"` |
|
|
72
72
|
|
|
73
73
|
> Local CPU parsing is experimental. Validate accuracy and latency on your own documents before production use.
|
|
74
74
|
|
|
@@ -159,44 +159,45 @@ client = DocSlight(
|
|
|
159
159
|
### Batch Processing
|
|
160
160
|
|
|
161
161
|
```python
|
|
162
|
-
results = client.parse_batch(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
|
|
163
|
-
for r in results:
|
|
164
|
-
print(r.to_markdown()[:200])
|
|
162
|
+
results = client.parse_batch(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
|
|
163
|
+
for r in results:
|
|
164
|
+
print(r.to_markdown()[:200])
|
|
165
165
|
```
|
|
166
166
|
|
|
167
167
|
## CLI Usage
|
|
168
168
|
|
|
169
169
|
```bash
|
|
170
170
|
# Parse
|
|
171
|
-
docslight parse invoice.pdf --mode cloud -o invoice.
|
|
172
|
-
docslight parse invoice.pdf --mode
|
|
171
|
+
docslight parse invoice.pdf --mode cloud -o invoice.zip
|
|
172
|
+
docslight parse invoice.pdf --mode cloud --format zip -o invoice.zip
|
|
173
|
+
docslight parse invoice.pdf --mode local -o invoice.zip
|
|
173
174
|
|
|
174
175
|
# Extract
|
|
175
176
|
docslight extract invoice.pdf --mode cloud --fields invoice_number,total_amount
|
|
176
177
|
docslight extract invoice.pdf --mode local --fields invoice_number --local-llm-provider ollama --local-llm-model llama3.1
|
|
178
|
+
docslight extract "D:\pdf\invoice\1.pdf" --mode local --fields invoice_number --local-llm-provider ollama
|
|
177
179
|
|
|
178
180
|
# Extract with schema
|
|
179
181
|
docslight extract invoice.pdf --schema schema.json
|
|
180
182
|
|
|
181
|
-
#
|
|
182
|
-
docslight web --host 127.0.0.1 --port 8000
|
|
183
|
-
```
|
|
184
|
-
|
|
185
|
-
##
|
|
186
|
-
|
|
187
|
-
DocSlight
|
|
188
|
-
|
|
189
|
-
```bash
|
|
190
|
-
|
|
191
|
-
docslight
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
-
|
|
196
|
-
-
|
|
197
|
-
-
|
|
198
|
-
-
|
|
199
|
-
- **Download results** — One-click download of Markdown or JSON output
|
|
183
|
+
# API server
|
|
184
|
+
docslight web --host 127.0.0.1 --port 8000
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
## API Server
|
|
188
|
+
|
|
189
|
+
DocSlight includes a local Flask API server for document processing. Frontend assets are not bundled in this package.
|
|
190
|
+
|
|
191
|
+
```bash
|
|
192
|
+
docslight web
|
|
193
|
+
python -m docslight.web_app
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
- `GET /api/health`
|
|
197
|
+
- `GET /api/system-info`
|
|
198
|
+
- `POST /api/parse`
|
|
199
|
+
- `POST /api/extract`
|
|
200
|
+
- `POST /api/preview`
|
|
200
201
|
|
|
201
202
|
## Environment Variables
|
|
202
203
|
|
|
@@ -81,20 +81,34 @@ def _client_from_args(args: argparse.Namespace) -> DocSlight:
|
|
|
81
81
|
)
|
|
82
82
|
|
|
83
83
|
|
|
84
|
-
def _write_output(content: str, output_path: str | None) -> None:
|
|
85
|
-
if output_path is None:
|
|
86
|
-
sys.stdout.write(content)
|
|
87
|
-
if not content.endswith("\n"):
|
|
88
|
-
sys.stdout.write("\n")
|
|
89
|
-
return
|
|
90
|
-
Path(output_path)
|
|
84
|
+
def _write_output(content: str, output_path: str | None) -> None:
|
|
85
|
+
if output_path is None:
|
|
86
|
+
sys.stdout.write(content)
|
|
87
|
+
if not content.endswith("\n"):
|
|
88
|
+
sys.stdout.write("\n")
|
|
89
|
+
return
|
|
90
|
+
path = Path(output_path)
|
|
91
|
+
path.write_text(content, encoding="utf-8")
|
|
92
|
+
sys.stderr.write(f"Wrote {path.resolve()}\n")
|
|
93
|
+
|
|
94
|
+
|
|
95
|
+
def _write_binary_output(content: bytes, output_path: str | None) -> None:
|
|
96
|
+
if output_path is None:
|
|
97
|
+
output_buffer = getattr(sys.stdout, "buffer", None)
|
|
98
|
+
if output_buffer is None:
|
|
99
|
+
raise CLIUsageError("binary output requires --output")
|
|
100
|
+
output_buffer.write(content)
|
|
101
|
+
return
|
|
102
|
+
path = Path(output_path)
|
|
103
|
+
path.write_bytes(content)
|
|
104
|
+
sys.stderr.write(f"Wrote {path.resolve()}\n")
|
|
91
105
|
|
|
92
106
|
|
|
93
107
|
def _to_pretty_json(data: Any) -> str:
|
|
94
108
|
return json.dumps(data, ensure_ascii=False, indent=2)
|
|
95
109
|
|
|
96
110
|
|
|
97
|
-
def run_web_app(host: str, port: int, debug: bool) -> None:
|
|
111
|
+
def run_web_app(host: str, port: int, debug: bool) -> None:
|
|
98
112
|
"""Run the optional Flask web application."""
|
|
99
113
|
if importlib.util.find_spec("docslight.web_app") is None:
|
|
100
114
|
raise CLIUsageError(WEB_EXTRA_ERROR)
|
|
@@ -105,8 +119,8 @@ def run_web_app(host: str, port: int, debug: bool) -> None:
|
|
|
105
119
|
if exc.name in {"flask", "werkzeug"}:
|
|
106
120
|
raise CLIUsageError(WEB_EXTRA_ERROR) from exc
|
|
107
121
|
raise
|
|
108
|
-
_run_web_app = web_app.run_web_app
|
|
109
|
-
_run_web_app(host, port, debug)
|
|
122
|
+
_run_web_app = web_app.run_web_app
|
|
123
|
+
_run_web_app(host, port, debug)
|
|
110
124
|
|
|
111
125
|
|
|
112
126
|
def _print_cli_error(error: Exception) -> int:
|
|
@@ -125,11 +139,11 @@ def build_parser() -> argparse.ArgumentParser:
|
|
|
125
139
|
parse_parser = subparsers.add_parser("parse", help="Parse a document")
|
|
126
140
|
parse_parser.add_argument("input")
|
|
127
141
|
parse_parser.add_argument("--output", "-o")
|
|
128
|
-
parse_parser.add_argument(
|
|
129
|
-
"--format",
|
|
130
|
-
choices=("markdown", "json", "standard-json"),
|
|
131
|
-
default=
|
|
132
|
-
)
|
|
142
|
+
parse_parser.add_argument(
|
|
143
|
+
"--format",
|
|
144
|
+
choices=("markdown", "json", "standard-json", "zip"),
|
|
145
|
+
default=None,
|
|
146
|
+
)
|
|
133
147
|
_add_common_options(parse_parser)
|
|
134
148
|
parse_parser.set_defaults(func=_run_parse)
|
|
135
149
|
|
|
@@ -151,25 +165,41 @@ def build_parser() -> argparse.ArgumentParser:
|
|
|
151
165
|
extract_parser.set_defaults(func=_run_extract)
|
|
152
166
|
|
|
153
167
|
web_parser = subparsers.add_parser("web", help="Run the web application")
|
|
154
|
-
web_parser.add_argument("--host", default="127.0.0.1")
|
|
155
|
-
web_parser.add_argument("--port", type=int, default=8000)
|
|
156
|
-
web_parser.add_argument("--debug", action="store_true")
|
|
157
|
-
web_parser.set_defaults(func=_run_web)
|
|
168
|
+
web_parser.add_argument("--host", default="127.0.0.1")
|
|
169
|
+
web_parser.add_argument("--port", type=int, default=8000)
|
|
170
|
+
web_parser.add_argument("--debug", action="store_true")
|
|
171
|
+
web_parser.set_defaults(func=_run_web)
|
|
158
172
|
|
|
159
173
|
return parser
|
|
160
174
|
|
|
161
175
|
|
|
162
|
-
def _run_parse(args: argparse.Namespace) -> int:
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
176
|
+
def _run_parse(args: argparse.Namespace) -> int:
|
|
177
|
+
parse_format = _resolve_parse_format(args.format, args.output)
|
|
178
|
+
parse_output = "json" if parse_format == "standard-json" else "markdown"
|
|
179
|
+
result = _client_from_args(args).parse(args.input, output=parse_output)
|
|
180
|
+
if parse_format == "zip":
|
|
181
|
+
raw_archive = getattr(result, "raw_archive", None)
|
|
182
|
+
if not isinstance(raw_archive, bytes):
|
|
183
|
+
raise CLIUsageError("parse result did not include a ZIP archive")
|
|
184
|
+
_write_binary_output(raw_archive, args.output)
|
|
185
|
+
elif parse_format == "json":
|
|
186
|
+
content = _to_pretty_json(result.to_json())
|
|
187
|
+
_write_output(content, args.output)
|
|
188
|
+
elif parse_format == "standard-json":
|
|
189
|
+
content = _to_pretty_json(result.to_standard_json())
|
|
190
|
+
_write_output(content, args.output)
|
|
191
|
+
else:
|
|
192
|
+
content = result.to_markdown()
|
|
193
|
+
_write_output(content, args.output)
|
|
194
|
+
return 0
|
|
195
|
+
|
|
196
|
+
|
|
197
|
+
def _resolve_parse_format(parse_format: str | None, output_path: str | None) -> str:
|
|
198
|
+
if parse_format is not None:
|
|
199
|
+
return parse_format
|
|
200
|
+
if output_path is not None and Path(output_path).suffix.lower() == ".zip":
|
|
201
|
+
return "zip"
|
|
202
|
+
return "markdown"
|
|
173
203
|
|
|
174
204
|
|
|
175
205
|
def _run_convert_parse_json(args: argparse.Namespace) -> int:
|
|
@@ -197,9 +227,9 @@ def _run_extract(args: argparse.Namespace) -> int:
|
|
|
197
227
|
return 0
|
|
198
228
|
|
|
199
229
|
|
|
200
|
-
def _run_web(args: argparse.Namespace) -> int:
|
|
201
|
-
run_web_app(args.host, args.port, args.debug)
|
|
202
|
-
return 0
|
|
230
|
+
def _run_web(args: argparse.Namespace) -> int:
|
|
231
|
+
run_web_app(args.host, args.port, args.debug)
|
|
232
|
+
return 0
|
|
203
233
|
|
|
204
234
|
|
|
205
235
|
def main(argv: Sequence[str] | None = None) -> int:
|
|
@@ -171,11 +171,11 @@ class CloudClient:
|
|
|
171
171
|
return _read_downloaded_result_payload(content)
|
|
172
172
|
return self._response_json(response), None
|
|
173
173
|
|
|
174
|
-
def _prepare_options(self, operation: str, options: dict[str, Any]) -> dict[str, Any]:
|
|
175
|
-
if operation != "extract"
|
|
176
|
-
return options
|
|
177
|
-
|
|
178
|
-
prepared = dict(options)
|
|
174
|
+
def _prepare_options(self, operation: str, options: dict[str, Any]) -> dict[str, Any]:
|
|
175
|
+
if operation != "extract":
|
|
176
|
+
return options
|
|
177
|
+
|
|
178
|
+
prepared = dict(options)
|
|
179
179
|
if "extract_fields" in prepared:
|
|
180
180
|
return prepared
|
|
181
181
|
|
|
@@ -219,7 +219,7 @@ class CloudClient:
|
|
|
219
219
|
return compacted
|
|
220
220
|
|
|
221
221
|
def _headers(self) -> dict[str, str]:
|
|
222
|
-
headers = {"User-Agent": "docslight/0.1.
|
|
222
|
+
headers = {"User-Agent": "docslight/0.1.2"}
|
|
223
223
|
if self.api_key:
|
|
224
224
|
headers["Authorization"] = f"Bearer {self.api_key}"
|
|
225
225
|
headers["x-api-key"] = self.api_key
|
|
@@ -11,8 +11,8 @@ NormalizedFields = list[str] | StructuredFields | None
|
|
|
11
11
|
ExtractSchema = dict[str, Any]
|
|
12
12
|
|
|
13
13
|
|
|
14
|
-
def normalize_fields(fields: list[str] | str | StructuredFields | None) -> NormalizedFields:
|
|
15
|
-
"""Normalize extraction fields from SDK, CLI, or
|
|
14
|
+
def normalize_fields(fields: list[str] | str | StructuredFields | None) -> NormalizedFields:
|
|
15
|
+
"""Normalize extraction fields from SDK, CLI, or API inputs."""
|
|
16
16
|
if fields is None:
|
|
17
17
|
return None
|
|
18
18
|
if isinstance(fields, str):
|
|
@@ -2,19 +2,19 @@
|
|
|
2
2
|
|
|
3
3
|
from __future__ import annotations
|
|
4
4
|
|
|
5
|
-
import argparse
|
|
6
|
-
import base64
|
|
7
|
-
import json
|
|
8
|
-
import logging
|
|
9
|
-
import sys
|
|
10
|
-
import tempfile
|
|
5
|
+
import argparse
|
|
6
|
+
import base64
|
|
7
|
+
import json
|
|
8
|
+
import logging
|
|
9
|
+
import sys
|
|
10
|
+
import tempfile
|
|
11
11
|
from collections.abc import Callable
|
|
12
12
|
from io import BytesIO
|
|
13
13
|
from json import JSONDecodeError
|
|
14
14
|
from pathlib import Path
|
|
15
15
|
from typing import Any, cast
|
|
16
16
|
|
|
17
|
-
from flask import Flask, Response, jsonify,
|
|
17
|
+
from flask import Flask, Response, jsonify, request, send_file
|
|
18
18
|
from werkzeug.datastructures import FileStorage
|
|
19
19
|
from werkzeug.utils import secure_filename
|
|
20
20
|
|
|
@@ -58,21 +58,18 @@ OFFICE_PREVIEW_UNSUPPORTED_MESSAGE = (
|
|
|
58
58
|
LOG_FORMAT = "%(levelname)s:%(name)s:%(message)s"
|
|
59
59
|
|
|
60
60
|
|
|
61
|
-
def create_app(docslight_factory: Callable[..., Any] = DocSlight) -> Flask:
|
|
62
|
-
"""Create the local DocSlight Flask application."""
|
|
63
|
-
app = Flask(__name__)
|
|
64
|
-
|
|
65
|
-
@app.get("/")
|
|
66
|
-
def index() -> Any:
|
|
67
|
-
return
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
@app.get("/extract")
|
|
74
|
-
def extract_page() -> str:
|
|
75
|
-
return render_template("extract.html", active_page="extract")
|
|
61
|
+
def create_app(docslight_factory: Callable[..., Any] = DocSlight) -> Flask:
|
|
62
|
+
"""Create the local DocSlight Flask application."""
|
|
63
|
+
app = Flask(__name__)
|
|
64
|
+
|
|
65
|
+
@app.get("/")
|
|
66
|
+
def index() -> Any:
|
|
67
|
+
return jsonify(
|
|
68
|
+
{
|
|
69
|
+
"status": "healthy",
|
|
70
|
+
"service": "docslight-web",
|
|
71
|
+
}
|
|
72
|
+
)
|
|
76
73
|
|
|
77
74
|
@app.get("/api/health")
|
|
78
75
|
def health() -> Any:
|
|
@@ -160,7 +157,11 @@ def create_app(docslight_factory: Callable[..., Any] = DocSlight) -> Flask:
|
|
|
160
157
|
return app
|
|
161
158
|
|
|
162
159
|
|
|
163
|
-
def run_web_app(
|
|
160
|
+
def run_web_app(
|
|
161
|
+
host: str = "127.0.0.1",
|
|
162
|
+
port: int = 8000,
|
|
163
|
+
debug: bool = False,
|
|
164
|
+
) -> None:
|
|
164
165
|
"""Run the local DocSlight web application."""
|
|
165
166
|
_configure_web_logging(debug)
|
|
166
167
|
create_app().run(host=host, port=port, debug=debug)
|
|
@@ -183,17 +184,17 @@ def build_parser() -> argparse.ArgumentParser:
|
|
|
183
184
|
prog="python -m docslight.web_app",
|
|
184
185
|
description="Run the DocSlight web application.",
|
|
185
186
|
)
|
|
186
|
-
parser.add_argument("--host", default="127.0.0.1")
|
|
187
|
-
parser.add_argument("--port", type=int, default=8000)
|
|
188
|
-
parser.add_argument("--debug", action="store_true")
|
|
189
|
-
return parser
|
|
187
|
+
parser.add_argument("--host", default="127.0.0.1")
|
|
188
|
+
parser.add_argument("--port", type=int, default=8000)
|
|
189
|
+
parser.add_argument("--debug", action="store_true")
|
|
190
|
+
return parser
|
|
190
191
|
|
|
191
192
|
|
|
192
193
|
def main(argv: list[str] | None = None) -> int:
|
|
193
|
-
"""Run the standalone DocSlight web application entrypoint."""
|
|
194
|
-
args = build_parser().parse_args(argv)
|
|
195
|
-
run_web_app(args.host, args.port, args.debug)
|
|
196
|
-
return 0
|
|
194
|
+
"""Run the standalone DocSlight web application entrypoint."""
|
|
195
|
+
args = build_parser().parse_args(argv)
|
|
196
|
+
run_web_app(args.host, args.port, args.debug)
|
|
197
|
+
return 0
|
|
197
198
|
|
|
198
199
|
|
|
199
200
|
def local_llm_from_form(form: Any) -> dict[str, str] | None:
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: docslight
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.3
|
|
4
4
|
Summary: Lightweight ComPDF document parsing and extraction SDK
|
|
5
5
|
Author-email: ComPDF AI <support@compdf.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -78,6 +78,7 @@ Parse any document:
|
|
|
78
78
|
|
|
79
79
|
```bash
|
|
80
80
|
docslight parse invoice.pdf --output invoice.md
|
|
81
|
+
docslight parse invoice.pdf --format zip --output invoice.zip
|
|
81
82
|
```
|
|
82
83
|
|
|
83
84
|
Extract specific fields:
|
|
@@ -86,14 +87,13 @@ Extract specific fields:
|
|
|
86
87
|
docslight extract invoice.pdf --fields invoice_number,total_amount
|
|
87
88
|
```
|
|
88
89
|
|
|
89
|
-
Launch the local
|
|
90
|
+
Launch the local API server:
|
|
90
91
|
|
|
91
92
|
```bash
|
|
92
|
-
pip install "docslight[web]"
|
|
93
93
|
docslight web
|
|
94
|
-
#
|
|
94
|
+
# Health: http://127.0.0.1:8000/api/health
|
|
95
95
|
|
|
96
|
-
# Or run the same
|
|
96
|
+
# Or run the same API server directly as a module
|
|
97
97
|
python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
98
98
|
```
|
|
99
99
|
|
|
@@ -103,7 +103,7 @@ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
|
103
103
|
- **Parse → Markdown** — Convert PDF, DOCX, PPTX, XLSX, and images (PNG, JPG, TIFF, BMP, WebP) to clean Markdown
|
|
104
104
|
- **Extract → JSON** — Pull structured data by field list, JSON Schema, or structured template (key-value + table extraction)
|
|
105
105
|
- **CLI first** — Full-featured command-line interface, script-friendly
|
|
106
|
-
- **
|
|
106
|
+
- **API server** — Local Flask backend exposing parse, extract, preview, health, and system-info endpoints
|
|
107
107
|
- **Batch processing** — `parse_batch()` / `extract_batch()` for multiple files
|
|
108
108
|
- **Local LLM extraction** — Ollama or any OpenAI-compatible provider for offline extraction
|
|
109
109
|
- **Document types** — Classify and route documents by type for cloud extraction
|
|
@@ -116,7 +116,7 @@ python -m docslight.web_app --host 0.0.0.0 --port 8000 --debug
|
|
|
116
116
|
| Core SDK & CLI | `pip install docslight` |
|
|
117
117
|
| + Local parsing (OCR, Office) | `pip install "docslight[local]"` |
|
|
118
118
|
| + Local LLM extraction | `pip install "docslight[local,local-llm]"` |
|
|
119
|
-
| +
|
|
119
|
+
| + API server | `pip install "docslight[web]"` |
|
|
120
120
|
|
|
121
121
|
> Local CPU parsing is experimental. Validate accuracy and latency on your own documents before production use.
|
|
122
122
|
|
|
@@ -216,35 +216,36 @@ for r in results:
|
|
|
216
216
|
|
|
217
217
|
```bash
|
|
218
218
|
# Parse
|
|
219
|
-
docslight parse invoice.pdf --mode cloud -o invoice.
|
|
220
|
-
docslight parse invoice.pdf --mode
|
|
219
|
+
docslight parse invoice.pdf --mode cloud -o invoice.zip
|
|
220
|
+
docslight parse invoice.pdf --mode cloud --format zip -o invoice.zip
|
|
221
|
+
docslight parse invoice.pdf --mode local -o invoice.zip
|
|
221
222
|
|
|
222
223
|
# Extract
|
|
223
224
|
docslight extract invoice.pdf --mode cloud --fields invoice_number,total_amount
|
|
224
225
|
docslight extract invoice.pdf --mode local --fields invoice_number --local-llm-provider ollama --local-llm-model llama3.1
|
|
226
|
+
docslight extract "D:\pdf\invoice\1.pdf" --mode local --fields invoice_number --local-llm-provider ollama
|
|
225
227
|
|
|
226
228
|
# Extract with schema
|
|
227
229
|
docslight extract invoice.pdf --schema schema.json
|
|
228
230
|
|
|
229
|
-
#
|
|
231
|
+
# API server
|
|
230
232
|
docslight web --host 127.0.0.1 --port 8000
|
|
231
233
|
```
|
|
232
234
|
|
|
233
|
-
##
|
|
235
|
+
## API Server
|
|
234
236
|
|
|
235
|
-
DocSlight
|
|
237
|
+
DocSlight includes a local Flask API server for document processing. Frontend assets are not bundled in this package.
|
|
236
238
|
|
|
237
239
|
```bash
|
|
238
|
-
pip install "docslight[web]"
|
|
239
240
|
docslight web
|
|
240
241
|
python -m docslight.web_app
|
|
241
242
|
```
|
|
242
243
|
|
|
243
|
-
-
|
|
244
|
-
-
|
|
245
|
-
-
|
|
246
|
-
-
|
|
247
|
-
-
|
|
244
|
+
- `GET /api/health`
|
|
245
|
+
- `GET /api/system-info`
|
|
246
|
+
- `POST /api/parse`
|
|
247
|
+
- `POST /api/extract`
|
|
248
|
+
- `POST /api/preview`
|
|
248
249
|
|
|
249
250
|
## Environment Variables
|
|
250
251
|
|