preocr 1.2.2__tar.gz → 1.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {preocr-1.2.2 → preocr-1.3.1}/PKG-INFO +139 -60
- {preocr-1.2.2 → preocr-1.3.1}/README.md +127 -53
- {preocr-1.2.2 → preocr-1.3.1}/preocr/__init__.py +2 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/constants.py +23 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/core/decision.py +101 -5
- {preocr-1.2.2 → preocr-1.3.1}/preocr/core/detector.py +70 -5
- {preocr-1.2.2 → preocr-1.3.1}/preocr/core/signals.py +3 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/schemas.py +4 -3
- preocr-1.3.1/preocr/planner/__init__.py +15 -0
- preocr-1.3.1/preocr/planner/_extract.py +79 -0
- preocr-1.3.1/preocr/planner/config.py +99 -0
- preocr-1.3.1/preocr/planner/decision.py +126 -0
- preocr-1.3.1/preocr/planner/intent.py +131 -0
- preocr-1.3.1/preocr/planner/models.py +101 -0
- preocr-1.3.1/preocr/planner/planner.py +231 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/version.py +1 -1
- {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/PKG-INFO +139 -60
- {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/SOURCES.txt +9 -0
- {preocr-1.2.2 → preocr-1.3.1}/pyproject.toml +28 -18
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_batch.py +18 -19
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_config_thresholds.py +20 -20
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_decision.py +30 -9
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_layout_aware_needs_ocr.py +26 -25
- preocr-1.3.1/tests/test_planner.py +102 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_signals.py +26 -1
- preocr-1.3.1/tests/test_skip_opencv_heuristics.py +77 -0
- {preocr-1.2.2 → preocr-1.3.1}/LICENSE +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/__init__.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/layout_analyzer.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/opencv_layout.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/page_detection.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/core/__init__.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/core/extractor.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/exceptions.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/__init__.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/base.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/formatters.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/office_extractor.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/pdf_extractor.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/text_extractor.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/__init__.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/image_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/office_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/pdf_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/text_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/py.typed +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/reason_codes.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/__init__.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/batch.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/cache.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/filetype.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/logger.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/dependency_links.txt +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/requires.txt +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/top_level.txt +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/setup.cfg +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_detector.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_filetype.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_hybrid_pipeline.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_image_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_integration.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_layout_analyzer.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_office_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_opencv_layout.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_page_detection.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_pdf_probe.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_reason_codes.py +0 -0
- {preocr-1.2.2 → preocr-1.3.1}/tests/test_text_probe.py +0 -0
|
@@ -1,24 +1,29 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: preocr
|
|
3
|
-
Version: 1.
|
|
4
|
-
Summary: A fast,
|
|
5
|
-
Author:
|
|
6
|
-
License
|
|
3
|
+
Version: 1.3.1
|
|
4
|
+
Summary: A fast, layout-aware OCR decision engine for document processing pipelines. Detects whether files truly require OCR before expensive processing, reducing unnecessary OCR calls while preserving extraction reliability.
|
|
5
|
+
Author: Yuvaraj Kannan
|
|
6
|
+
License: Apache-2.0
|
|
7
7
|
Project-URL: Homepage, https://github.com/yuvaraj3855/preocr
|
|
8
8
|
Project-URL: Documentation, https://github.com/yuvaraj3855/preocr#readme
|
|
9
9
|
Project-URL: Repository, https://github.com/yuvaraj3855/preocr
|
|
10
10
|
Project-URL: Issues, https://github.com/yuvaraj3855/preocr/issues
|
|
11
|
-
Keywords: ocr,
|
|
12
|
-
Classifier: Development Status ::
|
|
11
|
+
Keywords: ocr,ocr-decision,pre-ocr,document-processing,pdf-analysis,layout-analysis,ocr-optimization,ocr-routing,pdf-processing,computer-vision,document-ai
|
|
12
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
13
13
|
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Information Technology
|
|
14
15
|
Classifier: Programming Language :: Python :: 3
|
|
15
16
|
Classifier: Programming Language :: Python :: 3.9
|
|
16
17
|
Classifier: Programming Language :: Python :: 3.10
|
|
17
18
|
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
18
21
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
19
22
|
Classifier: Topic :: Text Processing
|
|
20
|
-
Classifier: Topic ::
|
|
23
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
24
|
Classifier: Topic :: Scientific/Engineering :: Image Recognition
|
|
25
|
+
Classifier: Topic :: Multimedia :: Graphics :: Capture
|
|
26
|
+
Classifier: Topic :: Utilities
|
|
22
27
|
Classifier: Operating System :: OS Independent
|
|
23
28
|
Requires-Python: >=3.9
|
|
24
29
|
Description-Content-Type: text/markdown
|
|
@@ -45,11 +50,11 @@ Provides-Extra: batch
|
|
|
45
50
|
Requires-Dist: tqdm>=4.65.0; extra == "batch"
|
|
46
51
|
Dynamic: license-file
|
|
47
52
|
|
|
48
|
-
# PreOCR
|
|
53
|
+
# PreOCR – Python OCR Detection Library | Skip OCR for Digital PDFs
|
|
49
54
|
|
|
50
55
|
<div align="center">
|
|
51
56
|
|
|
52
|
-
**
|
|
57
|
+
**Open-source Python library for OCR detection and document extraction. Detect if PDFs need OCR before expensive processing—save 50–70% on OCR costs.**
|
|
53
58
|
|
|
54
59
|
[](https://www.python.org/downloads/)
|
|
55
60
|
[](LICENSE)
|
|
@@ -57,32 +62,53 @@ Dynamic: license-file
|
|
|
57
62
|
[](https://pepy.tech/project/preocr)
|
|
58
63
|
[](https://github.com/psf/black)
|
|
59
64
|
|
|
60
|
-
*
|
|
65
|
+
*2–10× faster than alternatives • 100% accuracy on benchmark • CPU-only, no GPU required*
|
|
61
66
|
|
|
62
|
-
**🌐
|
|
67
|
+
**🌐 [preocr.io](https://preocr.io)** • [Installation](#-installation) • [Quick Start](#-quick-start) • [API Reference](#-api-reference) • [Examples](#-usage-examples) • [Performance](#-performance)
|
|
63
68
|
|
|
64
69
|
</div>
|
|
65
70
|
|
|
66
71
|
---
|
|
67
72
|
|
|
68
|
-
|
|
73
|
+
### ⚡ TL;DR
|
|
69
74
|
|
|
70
|
-
|
|
75
|
+
| Metric | Result |
|
|
76
|
+
|--------|--------|
|
|
77
|
+
| **Accuracy** | 100% (TP=1, FP=0, TN=9, FN=0) |
|
|
78
|
+
| **Latency** | ~2.7s mean, ~1.9s median (≤1MB PDFs) |
|
|
79
|
+
| **Office docs** | ~7ms |
|
|
80
|
+
| **Focus** | Zero false positives. Zero missed scans. |
|
|
71
81
|
|
|
72
|
-
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## What is PreOCR? Python OCR Detection & Document Processing
|
|
85
|
+
|
|
86
|
+
**PreOCR** is an open-source **Python OCR detection library** that determines whether documents need OCR before you run expensive processing. It analyzes **PDFs**, **Office documents** (DOCX, PPTX, XLSX), **images**, and text files to detect if they're already machine-readable—helping you **skip OCR** for 50–70% of documents and cut costs.
|
|
87
|
+
|
|
88
|
+
Use PreOCR to filter documents before Tesseract, AWS Textract, Google Vision, Azure Document Intelligence, or MinerU. Works offline, CPU-only, with 100% accuracy on validation benchmarks.
|
|
89
|
+
|
|
90
|
+
**🌐 [preocr.io](https://preocr.io)**
|
|
73
91
|
|
|
74
92
|
### Key Benefits
|
|
75
93
|
|
|
76
|
-
- ⚡ **Fast**: CPU-only
|
|
77
|
-
- 🎯 **Accurate**: 92
|
|
78
|
-
- 💰 **Cost-Effective**: Skip OCR for 50
|
|
79
|
-
- 📊 **Structured Extraction**:
|
|
94
|
+
- ⚡ **Fast**: CPU-only, typically < 1 second per file—no GPU needed
|
|
95
|
+
- 🎯 **Accurate**: 92–95% accuracy (100% on validation benchmark)
|
|
96
|
+
- 💰 **Cost-Effective**: Skip OCR for 50–70% of documents
|
|
97
|
+
- 📊 **Structured Extraction**: Tables, forms, images, semantic data—Pydantic models, JSON, or Markdown
|
|
80
98
|
- 🔒 **Type-Safe**: Full Pydantic models with IDE autocomplete
|
|
81
|
-
- 🚀 **Production-Ready**:
|
|
99
|
+
- 🚀 **Offline & Production-Ready**: No API keys; battle-tested error handling
|
|
100
|
+
|
|
101
|
+
### Use Cases: When to Use PreOCR
|
|
102
|
+
|
|
103
|
+
- **Document pipelines**: Filter PDFs before OCR (Tesseract, AWS Textract, Google Vision)
|
|
104
|
+
- **RAG / LLM ingestion**: Decide which documents need OCR vs. native text extraction
|
|
105
|
+
- **Batch processing**: Process thousands of PDFs with page-level OCR decisions
|
|
106
|
+
- **Cost optimization**: Reduce cloud OCR API costs by skipping digital documents
|
|
107
|
+
- **Medical / legal**: Intent-aware planner for prescriptions, discharge summaries, lab reports
|
|
82
108
|
|
|
83
109
|
---
|
|
84
110
|
|
|
85
|
-
##
|
|
111
|
+
## Quick Comparison: PreOCR vs. Alternatives
|
|
86
112
|
|
|
87
113
|
| Feature | PreOCR 🏆 | Unstructured.io | Docugami |
|
|
88
114
|
|---------|-----------|-----------------|----------|
|
|
@@ -157,6 +183,17 @@ results.print_summary()
|
|
|
157
183
|
- **Page-Level Granularity**: Analyze PDFs page-by-page for precise detection
|
|
158
184
|
- **Confidence Scores**: Per-decision confidence with reason codes
|
|
159
185
|
- **Hybrid Pipeline**: Fast heuristics + OpenCV refinement for edge cases
|
|
186
|
+
- **OpenCV Skip Heuristics**: Skips OpenCV for clearly digital documents (file size, page count, text coverage) to improve performance
|
|
187
|
+
- **Digital/Table Bias**: Reduces false positives on high-text PDFs (product manuals, marketing docs) via configurable rules
|
|
188
|
+
|
|
189
|
+
### Intent-Aware OCR Planner (`plan_ocr_for_document`)
|
|
190
|
+
|
|
191
|
+
- **Medical Domain**: Terminal overrides for prescriptions, diagnosis, discharge summaries, lab reports
|
|
192
|
+
- **Weighted Scoring**: Configurable threshold with safety/balanced/cost modes
|
|
193
|
+
- **Explainability**: Per-page score breakdown (intent, image_dominance, text_weakness)
|
|
194
|
+
- **Evaluation**: Threshold sweep and confusion matrix for calibration
|
|
195
|
+
|
|
196
|
+
See [docs/OCR_DECISION_MODEL.md](docs/OCR_DECISION_MODEL.md) for the full specification.
|
|
160
197
|
|
|
161
198
|
### Document Extraction (`extract_native_data`)
|
|
162
199
|
|
|
@@ -222,6 +259,18 @@ print(f"Confidence: {result['confidence']:.2f}")
|
|
|
222
259
|
print(f"Reason: {result['reason']}")
|
|
223
260
|
```
|
|
224
261
|
|
|
262
|
+
#### Intent-Aware Planner (Medical/Domain-Specific)
|
|
263
|
+
|
|
264
|
+
```python
|
|
265
|
+
from preocr import plan_ocr_for_document
|
|
266
|
+
|
|
267
|
+
result = plan_ocr_for_document("hospital_discharge.pdf")
|
|
268
|
+
print(f"Needs OCR (any page): {result['needs_ocr_any']}")
|
|
269
|
+
for page in result["pages"]:
|
|
270
|
+
print(f" Page {page['page_number']}: needs_ocr={page['needs_ocr']} "
|
|
271
|
+
f"type={page['decision_type']} score={page['debug']['score']:.2f}")
|
|
272
|
+
```
|
|
273
|
+
|
|
225
274
|
#### Layout-Aware Detection
|
|
226
275
|
|
|
227
276
|
```python
|
|
@@ -352,8 +401,6 @@ PreOCR supports **20+ file formats** for OCR detection and extraction:
|
|
|
352
401
|
| **Text** | ✅ Yes | ✅ Yes | TXT, CSV, HTML |
|
|
353
402
|
| **Structured** | ✅ Yes | ✅ Yes | JSON, XML |
|
|
354
403
|
|
|
355
|
-
See [Supported Formats](SUPPORTED_FORMATS.md) for complete list.
|
|
356
|
-
|
|
357
404
|
---
|
|
358
405
|
|
|
359
406
|
## ⚙️ Configuration
|
|
@@ -377,6 +424,10 @@ result = needs_ocr("document.pdf", config=config)
|
|
|
377
424
|
- `min_text_length`: Minimum text length (default: 50)
|
|
378
425
|
- `min_office_text_length`: Minimum office text length (default: 100)
|
|
379
426
|
- `layout_refinement_threshold`: OpenCV trigger threshold (default: 0.9)
|
|
427
|
+
- `skip_opencv_if_file_size_mb`: Skip OpenCV when file size ≥ N MB (default: None)
|
|
428
|
+
- `skip_opencv_if_page_count`: Skip OpenCV when page count ≥ N (default: None)
|
|
429
|
+
- `digital_bias_text_coverage_min`: Force no-OCR when text_coverage ≥ this and image_coverage is low (default: 65)
|
|
430
|
+
- `table_bias_text_density_min`: For mixed layout, treat as digital when text_density ≥ this (default: 1.5)
|
|
380
431
|
|
|
381
432
|
---
|
|
382
433
|
|
|
@@ -415,15 +466,34 @@ if result["reason_code"] == "PDF_MIXED":
|
|
|
415
466
|
|----------|------|----------|
|
|
416
467
|
| Fast Path (Heuristics) | < 150ms | ~99% |
|
|
417
468
|
| OpenCV Refinement | 150-300ms | 92-96% |
|
|
418
|
-
| **
|
|
469
|
+
| **Typical (single file)** | **< 1 second** | **94-97%** |
|
|
470
|
+
|
|
471
|
+
*Typical: most PDFs finish in under 1 second. Heuristics-only files: 120–180ms avg. Large or mixed documents may take 1–3s with OpenCV.*
|
|
472
|
+
|
|
473
|
+
### Benchmark Results (≤1MB Dataset)
|
|
474
|
+
|
|
475
|
+
<p align="center">
|
|
476
|
+
<img src="docs/benchmarks/avg-time-by-type.png" alt="Average processing time by file type" width="500">
|
|
477
|
+
<br><em>Average Processing Time by File Type</em>
|
|
478
|
+
</p>
|
|
479
|
+
|
|
480
|
+
<p align="center">
|
|
481
|
+
<img src="docs/benchmarks/latency-summary.png" alt="Latency summary for PDFs" width="500">
|
|
482
|
+
<br><em>Latency Summary (Mean, Median, P95)</em>
|
|
483
|
+
</p>
|
|
419
484
|
|
|
420
485
|
### Accuracy Metrics
|
|
421
486
|
|
|
422
|
-
- **Overall Accuracy**: 92-95% (100% on
|
|
487
|
+
- **Overall Accuracy**: 92-95% (100% on validation benchmark)
|
|
423
488
|
- **Precision**: 100% (all flagged files actually need OCR)
|
|
424
489
|
- **Recall**: 100% (all OCR-needed files detected)
|
|
425
490
|
- **F1-Score**: 100%
|
|
426
491
|
|
|
492
|
+
<p align="center">
|
|
493
|
+
<img src="docs/benchmarks/confusion-matrix.png" alt="Confusion matrix - 100% accuracy" width="500">
|
|
494
|
+
<br><em>Confusion Matrix (TP:1, FP:0, TN:9, FN:0)</em>
|
|
495
|
+
</p>
|
|
496
|
+
|
|
427
497
|
### Performance Factors
|
|
428
498
|
|
|
429
499
|
- **File size**: Larger files take longer
|
|
@@ -528,12 +598,14 @@ Batch processor for multiple files with parallel processing.
|
|
|
528
598
|
### When to Choose PreOCR
|
|
529
599
|
|
|
530
600
|
✅ **Choose PreOCR when:**
|
|
531
|
-
- You
|
|
532
|
-
- You
|
|
533
|
-
- You
|
|
534
|
-
- You
|
|
535
|
-
|
|
536
|
-
|
|
601
|
+
- You're building **document ingestion pipelines** or **RAG/LLM systems**—decide which files need OCR vs. native extraction
|
|
602
|
+
- You need **speed** (< 1 second per file) and **cost optimization** (skip OCR for 50–70% of documents)
|
|
603
|
+
- You want **page-level granularity** (which pages need OCR in mixed PDFs)
|
|
604
|
+
- You prefer **type safety** (Pydantic models) and **edge deployment** (CPU-only, no GPU)
|
|
605
|
+
|
|
606
|
+
### Switched from Unstructured.io or another library?
|
|
607
|
+
|
|
608
|
+
PreOCR focuses on **OCR routing**—it doesn't perform extraction by default. Use it as a pre-filter: call `needs_ocr()` first, then route to your OCR engine or to `extract_native_data()` for digital documents. The API is simple: `needs_ocr(path)`, `extract_native_data(path)`, `BatchProcessor`.
|
|
537
609
|
|
|
538
610
|
---
|
|
539
611
|
|
|
@@ -560,22 +632,25 @@ Batch processor for multiple files with parallel processing.
|
|
|
560
632
|
|
|
561
633
|
---
|
|
562
634
|
|
|
563
|
-
##
|
|
635
|
+
## Frequently Asked Questions (FAQ)
|
|
564
636
|
|
|
565
|
-
**
|
|
566
|
-
|
|
637
|
+
**Does PreOCR perform OCR?**
|
|
638
|
+
No. PreOCR is an **OCR detection** library—it analyzes files to determine if OCR is needed. It does not run OCR itself. Use it to decide whether to call Tesseract, Textract, or another OCR engine.
|
|
567
639
|
|
|
568
|
-
**
|
|
569
|
-
|
|
640
|
+
**How accurate is PreOCR for PDF OCR detection?**
|
|
641
|
+
PreOCR achieves 92–95% accuracy with the hybrid pipeline. Validation on benchmark datasets reached 100% accuracy (10/10 PDFs correct).
|
|
570
642
|
|
|
571
|
-
**
|
|
572
|
-
|
|
643
|
+
**Can I use PreOCR with AWS Textract, Google Vision, or Azure Document Intelligence?**
|
|
644
|
+
Yes. PreOCR is ideal for filtering documents before sending them to cloud OCR APIs. Skip OCR for digital PDFs to reduce API costs.
|
|
573
645
|
|
|
574
|
-
**
|
|
575
|
-
|
|
646
|
+
**Does PreOCR work offline?**
|
|
647
|
+
Yes. PreOCR is CPU-only and runs fully offline—no API keys or internet required.
|
|
576
648
|
|
|
577
|
-
**
|
|
578
|
-
|
|
649
|
+
**How do I customize OCR detection thresholds?**
|
|
650
|
+
Use the `Config` class or pass threshold parameters to `BatchProcessor`. See [Configuration](#-configuration).
|
|
651
|
+
|
|
652
|
+
**Is there an HTTP/REST API?**
|
|
653
|
+
PreOCR is a Python library. For HTTP APIs, wrap it in FastAPI or Flask—see [preocr.io](https://preocr.io) for hosted options.
|
|
579
654
|
|
|
580
655
|
---
|
|
581
656
|
|
|
@@ -592,6 +667,10 @@ pip install -e ".[dev]"
|
|
|
592
667
|
# Run tests
|
|
593
668
|
pytest
|
|
594
669
|
|
|
670
|
+
# Run benchmarks (add PDFs to datasets/ for testing)
|
|
671
|
+
python scripts/benchmark_accuracy.py datasets -g scripts/ground_truth_data_source_formats.json --layout-aware --page-level
|
|
672
|
+
python scripts/benchmark_planner.py datasets
|
|
673
|
+
|
|
595
674
|
# Run linting
|
|
596
675
|
ruff check preocr/
|
|
597
676
|
black --check preocr/
|
|
@@ -605,20 +684,20 @@ See [CHANGELOG.md](docs/CHANGELOG.md) for complete version history.
|
|
|
605
684
|
|
|
606
685
|
### Recent Updates
|
|
607
686
|
|
|
608
|
-
**
|
|
609
|
-
- ✅ **
|
|
610
|
-
- ✅ **
|
|
611
|
-
- ✅ **
|
|
612
|
-
- ✅ **
|
|
613
|
-
- ✅ **
|
|
614
|
-
|
|
615
|
-
|
|
687
|
+
**v2.0.0** - Accuracy & Performance (Latest)
|
|
688
|
+
- ✅ **100% Accuracy**: Fixed false positives on digital PDFs; benchmark validation at 100%
|
|
689
|
+
- ✅ **OpenCV Skip Heuristics**: Skip OpenCV for clearly digital documents (configurable by file size, page count)
|
|
690
|
+
- ✅ **Digital/Table Bias Rules**: New config options to reduce false positives on product manuals, marketing PDFs
|
|
691
|
+
- ✅ **Unified Datasets**: Consolidated `benchmarkdata` and `data-source-formats` into `datasets/` directory
|
|
692
|
+
- ✅ **Page Count in Signals**: PDF analysis includes page count for smarter heuristics
|
|
693
|
+
|
|
694
|
+
**v1.1.0** - Invoice Intelligence & Advanced Extraction
|
|
695
|
+
- ✅ Semantic deduplication, invoice intelligence, text merging
|
|
696
|
+
- ✅ Table stitching, finance validation, reversed text detection
|
|
616
697
|
|
|
617
698
|
**v1.0.0** - Structured Data Extraction
|
|
618
|
-
- ✅ Comprehensive extraction
|
|
619
|
-
- ✅ Element classification
|
|
620
|
-
- ✅ Table, form, and image extraction
|
|
621
|
-
- ✅ Multiple output formats (Pydantic, JSON, Markdown)
|
|
699
|
+
- ✅ Comprehensive extraction for PDFs, Office docs, text files
|
|
700
|
+
- ✅ Element classification, table/form/image extraction
|
|
622
701
|
|
|
623
702
|
---
|
|
624
703
|
|
|
@@ -634,19 +713,19 @@ Apache License 2.0 - see [LICENSE](LICENSE) for details.
|
|
|
634
713
|
|
|
635
714
|
---
|
|
636
715
|
|
|
637
|
-
##
|
|
716
|
+
## Links & Resources
|
|
638
717
|
|
|
639
|
-
-
|
|
640
|
-
- **
|
|
641
|
-
- **
|
|
642
|
-
- **
|
|
718
|
+
- **Website**: [preocr.io](https://preocr.io) – Python OCR detection and document processing
|
|
719
|
+
- **PyPI**: [pypi.org/project/preocr](https://pypi.org/project/preocr) – Install with `pip install preocr`
|
|
720
|
+
- **GitHub**: [github.com/yuvaraj3855/preocr](https://github.com/yuvaraj3855/preocr) – Source code and issues
|
|
721
|
+
- **Documentation**: [CHANGELOG](docs/CHANGELOG.md) • [OCR Decision Model](docs/OCR_DECISION_MODEL.md) • [Contributing](docs/CONTRIBUTING.md)
|
|
643
722
|
|
|
644
723
|
---
|
|
645
724
|
|
|
646
725
|
<div align="center">
|
|
647
726
|
|
|
648
|
-
**
|
|
727
|
+
**PreOCR – Python OCR detection library. Skip OCR for digital PDFs. Save time and money.**
|
|
649
728
|
|
|
650
|
-
[
|
|
729
|
+
[Website](https://preocr.io) · [GitHub](https://github.com/yuvaraj3855/preocr) · [PyPI](https://pypi.org/project/preocr) · [Report Issue](https://github.com/yuvaraj3855/preocr/issues)
|
|
651
730
|
|
|
652
731
|
</div>
|
|
@@ -1,8 +1,8 @@
|
|
|
1
|
-
# PreOCR
|
|
1
|
+
# PreOCR – Python OCR Detection Library | Skip OCR for Digital PDFs
|
|
2
2
|
|
|
3
3
|
<div align="center">
|
|
4
4
|
|
|
5
|
-
**
|
|
5
|
+
**Open-source Python library for OCR detection and document extraction. Detect if PDFs need OCR before expensive processing—save 50–70% on OCR costs.**
|
|
6
6
|
|
|
7
7
|
[](https://www.python.org/downloads/)
|
|
8
8
|
[](LICENSE)
|
|
@@ -10,32 +10,53 @@
|
|
|
10
10
|
[](https://pepy.tech/project/preocr)
|
|
11
11
|
[](https://github.com/psf/black)
|
|
12
12
|
|
|
13
|
-
*
|
|
13
|
+
*2–10× faster than alternatives • 100% accuracy on benchmark • CPU-only, no GPU required*
|
|
14
14
|
|
|
15
|
-
**🌐
|
|
15
|
+
**🌐 [preocr.io](https://preocr.io)** • [Installation](#-installation) • [Quick Start](#-quick-start) • [API Reference](#-api-reference) • [Examples](#-usage-examples) • [Performance](#-performance)
|
|
16
16
|
|
|
17
17
|
</div>
|
|
18
18
|
|
|
19
19
|
---
|
|
20
20
|
|
|
21
|
-
|
|
21
|
+
### ⚡ TL;DR
|
|
22
22
|
|
|
23
|
-
|
|
23
|
+
| Metric | Result |
|
|
24
|
+
|--------|--------|
|
|
25
|
+
| **Accuracy** | 100% (TP=1, FP=0, TN=9, FN=0) |
|
|
26
|
+
| **Latency** | ~2.7s mean, ~1.9s median (≤1MB PDFs) |
|
|
27
|
+
| **Office docs** | ~7ms |
|
|
28
|
+
| **Focus** | Zero false positives. Zero missed scans. |
|
|
24
29
|
|
|
25
|
-
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## What is PreOCR? Python OCR Detection & Document Processing
|
|
33
|
+
|
|
34
|
+
**PreOCR** is an open-source **Python OCR detection library** that determines whether documents need OCR before you run expensive processing. It analyzes **PDFs**, **Office documents** (DOCX, PPTX, XLSX), **images**, and text files to detect if they're already machine-readable—helping you **skip OCR** for 50–70% of documents and cut costs.
|
|
35
|
+
|
|
36
|
+
Use PreOCR to filter documents before Tesseract, AWS Textract, Google Vision, Azure Document Intelligence, or MinerU. Works offline, CPU-only, with 100% accuracy on validation benchmarks.
|
|
37
|
+
|
|
38
|
+
**🌐 [preocr.io](https://preocr.io)**
|
|
26
39
|
|
|
27
40
|
### Key Benefits
|
|
28
41
|
|
|
29
|
-
- ⚡ **Fast**: CPU-only
|
|
30
|
-
- 🎯 **Accurate**: 92
|
|
31
|
-
- 💰 **Cost-Effective**: Skip OCR for 50
|
|
32
|
-
- 📊 **Structured Extraction**:
|
|
42
|
+
- ⚡ **Fast**: CPU-only, typically < 1 second per file—no GPU needed
|
|
43
|
+
- 🎯 **Accurate**: 92–95% accuracy (100% on validation benchmark)
|
|
44
|
+
- 💰 **Cost-Effective**: Skip OCR for 50–70% of documents
|
|
45
|
+
- 📊 **Structured Extraction**: Tables, forms, images, semantic data—Pydantic models, JSON, or Markdown
|
|
33
46
|
- 🔒 **Type-Safe**: Full Pydantic models with IDE autocomplete
|
|
34
|
-
- 🚀 **Production-Ready**:
|
|
47
|
+
- 🚀 **Offline & Production-Ready**: No API keys; battle-tested error handling
|
|
48
|
+
|
|
49
|
+
### Use Cases: When to Use PreOCR
|
|
50
|
+
|
|
51
|
+
- **Document pipelines**: Filter PDFs before OCR (Tesseract, AWS Textract, Google Vision)
|
|
52
|
+
- **RAG / LLM ingestion**: Decide which documents need OCR vs. native text extraction
|
|
53
|
+
- **Batch processing**: Process thousands of PDFs with page-level OCR decisions
|
|
54
|
+
- **Cost optimization**: Reduce cloud OCR API costs by skipping digital documents
|
|
55
|
+
- **Medical / legal**: Intent-aware planner for prescriptions, discharge summaries, lab reports
|
|
35
56
|
|
|
36
57
|
---
|
|
37
58
|
|
|
38
|
-
##
|
|
59
|
+
## Quick Comparison: PreOCR vs. Alternatives
|
|
39
60
|
|
|
40
61
|
| Feature | PreOCR 🏆 | Unstructured.io | Docugami |
|
|
41
62
|
|---------|-----------|-----------------|----------|
|
|
@@ -110,6 +131,17 @@ results.print_summary()
|
|
|
110
131
|
- **Page-Level Granularity**: Analyze PDFs page-by-page for precise detection
|
|
111
132
|
- **Confidence Scores**: Per-decision confidence with reason codes
|
|
112
133
|
- **Hybrid Pipeline**: Fast heuristics + OpenCV refinement for edge cases
|
|
134
|
+
- **OpenCV Skip Heuristics**: Skips OpenCV for clearly digital documents (file size, page count, text coverage) to improve performance
|
|
135
|
+
- **Digital/Table Bias**: Reduces false positives on high-text PDFs (product manuals, marketing docs) via configurable rules
|
|
136
|
+
|
|
137
|
+
### Intent-Aware OCR Planner (`plan_ocr_for_document`)
|
|
138
|
+
|
|
139
|
+
- **Medical Domain**: Terminal overrides for prescriptions, diagnosis, discharge summaries, lab reports
|
|
140
|
+
- **Weighted Scoring**: Configurable threshold with safety/balanced/cost modes
|
|
141
|
+
- **Explainability**: Per-page score breakdown (intent, image_dominance, text_weakness)
|
|
142
|
+
- **Evaluation**: Threshold sweep and confusion matrix for calibration
|
|
143
|
+
|
|
144
|
+
See [docs/OCR_DECISION_MODEL.md](docs/OCR_DECISION_MODEL.md) for the full specification.
|
|
113
145
|
|
|
114
146
|
### Document Extraction (`extract_native_data`)
|
|
115
147
|
|
|
@@ -175,6 +207,18 @@ print(f"Confidence: {result['confidence']:.2f}")
|
|
|
175
207
|
print(f"Reason: {result['reason']}")
|
|
176
208
|
```
|
|
177
209
|
|
|
210
|
+
#### Intent-Aware Planner (Medical/Domain-Specific)
|
|
211
|
+
|
|
212
|
+
```python
|
|
213
|
+
from preocr import plan_ocr_for_document
|
|
214
|
+
|
|
215
|
+
result = plan_ocr_for_document("hospital_discharge.pdf")
|
|
216
|
+
print(f"Needs OCR (any page): {result['needs_ocr_any']}")
|
|
217
|
+
for page in result["pages"]:
|
|
218
|
+
print(f" Page {page['page_number']}: needs_ocr={page['needs_ocr']} "
|
|
219
|
+
f"type={page['decision_type']} score={page['debug']['score']:.2f}")
|
|
220
|
+
```
|
|
221
|
+
|
|
178
222
|
#### Layout-Aware Detection
|
|
179
223
|
|
|
180
224
|
```python
|
|
@@ -305,8 +349,6 @@ PreOCR supports **20+ file formats** for OCR detection and extraction:
|
|
|
305
349
|
| **Text** | ✅ Yes | ✅ Yes | TXT, CSV, HTML |
|
|
306
350
|
| **Structured** | ✅ Yes | ✅ Yes | JSON, XML |
|
|
307
351
|
|
|
308
|
-
See [Supported Formats](SUPPORTED_FORMATS.md) for complete list.
|
|
309
|
-
|
|
310
352
|
---
|
|
311
353
|
|
|
312
354
|
## ⚙️ Configuration
|
|
@@ -330,6 +372,10 @@ result = needs_ocr("document.pdf", config=config)
|
|
|
330
372
|
- `min_text_length`: Minimum text length (default: 50)
|
|
331
373
|
- `min_office_text_length`: Minimum office text length (default: 100)
|
|
332
374
|
- `layout_refinement_threshold`: OpenCV trigger threshold (default: 0.9)
|
|
375
|
+
- `skip_opencv_if_file_size_mb`: Skip OpenCV when file size ≥ N MB (default: None)
|
|
376
|
+
- `skip_opencv_if_page_count`: Skip OpenCV when page count ≥ N (default: None)
|
|
377
|
+
- `digital_bias_text_coverage_min`: Force no-OCR when text_coverage ≥ this and image_coverage is low (default: 65)
|
|
378
|
+
- `table_bias_text_density_min`: For mixed layout, treat as digital when text_density ≥ this (default: 1.5)
|
|
333
379
|
|
|
334
380
|
---
|
|
335
381
|
|
|
@@ -368,15 +414,34 @@ if result["reason_code"] == "PDF_MIXED":
|
|
|
368
414
|
|----------|------|----------|
|
|
369
415
|
| Fast Path (Heuristics) | < 150ms | ~99% |
|
|
370
416
|
| OpenCV Refinement | 150-300ms | 92-96% |
|
|
371
|
-
| **
|
|
417
|
+
| **Typical (single file)** | **< 1 second** | **94-97%** |
|
|
418
|
+
|
|
419
|
+
*Typical: most PDFs finish in under 1 second. Heuristics-only files: 120–180ms avg. Large or mixed documents may take 1–3s with OpenCV.*
|
|
420
|
+
|
|
421
|
+
### Benchmark Results (≤1MB Dataset)
|
|
422
|
+
|
|
423
|
+
<p align="center">
|
|
424
|
+
<img src="docs/benchmarks/avg-time-by-type.png" alt="Average processing time by file type" width="500">
|
|
425
|
+
<br><em>Average Processing Time by File Type</em>
|
|
426
|
+
</p>
|
|
427
|
+
|
|
428
|
+
<p align="center">
|
|
429
|
+
<img src="docs/benchmarks/latency-summary.png" alt="Latency summary for PDFs" width="500">
|
|
430
|
+
<br><em>Latency Summary (Mean, Median, P95)</em>
|
|
431
|
+
</p>
|
|
372
432
|
|
|
373
433
|
### Accuracy Metrics
|
|
374
434
|
|
|
375
|
-
- **Overall Accuracy**: 92-95% (100% on
|
|
435
|
+
- **Overall Accuracy**: 92-95% (100% on validation benchmark)
|
|
376
436
|
- **Precision**: 100% (all flagged files actually need OCR)
|
|
377
437
|
- **Recall**: 100% (all OCR-needed files detected)
|
|
378
438
|
- **F1-Score**: 100%
|
|
379
439
|
|
|
440
|
+
<p align="center">
|
|
441
|
+
<img src="docs/benchmarks/confusion-matrix.png" alt="Confusion matrix - 100% accuracy" width="500">
|
|
442
|
+
<br><em>Confusion Matrix (TP:1, FP:0, TN:9, FN:0)</em>
|
|
443
|
+
</p>
|
|
444
|
+
|
|
380
445
|
### Performance Factors
|
|
381
446
|
|
|
382
447
|
- **File size**: Larger files take longer
|
|
@@ -481,12 +546,14 @@ Batch processor for multiple files with parallel processing.
|
|
|
481
546
|
### When to Choose PreOCR
|
|
482
547
|
|
|
483
548
|
✅ **Choose PreOCR when:**
|
|
484
|
-
- You
|
|
485
|
-
- You
|
|
486
|
-
- You
|
|
487
|
-
- You
|
|
488
|
-
|
|
489
|
-
|
|
549
|
+
- You're building **document ingestion pipelines** or **RAG/LLM systems**—decide which files need OCR vs. native extraction
|
|
550
|
+
- You need **speed** (< 1 second per file) and **cost optimization** (skip OCR for 50–70% of documents)
|
|
551
|
+
- You want **page-level granularity** (which pages need OCR in mixed PDFs)
|
|
552
|
+
- You prefer **type safety** (Pydantic models) and **edge deployment** (CPU-only, no GPU)
|
|
553
|
+
|
|
554
|
+
### Switched from Unstructured.io or another library?
|
|
555
|
+
|
|
556
|
+
PreOCR focuses on **OCR routing**—it doesn't perform extraction by default. Use it as a pre-filter: call `needs_ocr()` first, then route to your OCR engine or to `extract_native_data()` for digital documents. The API is simple: `needs_ocr(path)`, `extract_native_data(path)`, `BatchProcessor`.
|
|
490
557
|
|
|
491
558
|
---
|
|
492
559
|
|
|
@@ -513,22 +580,25 @@ Batch processor for multiple files with parallel processing.
|
|
|
513
580
|
|
|
514
581
|
---
|
|
515
582
|
|
|
516
|
-
##
|
|
583
|
+
## Frequently Asked Questions (FAQ)
|
|
517
584
|
|
|
518
|
-
**
|
|
519
|
-
|
|
585
|
+
**Does PreOCR perform OCR?**
|
|
586
|
+
No. PreOCR is an **OCR detection** library—it analyzes files to determine if OCR is needed. It does not run OCR itself. Use it to decide whether to call Tesseract, Textract, or another OCR engine.
|
|
520
587
|
|
|
521
|
-
**
|
|
522
|
-
|
|
588
|
+
**How accurate is PreOCR for PDF OCR detection?**
|
|
589
|
+
PreOCR achieves 92–95% accuracy with the hybrid pipeline. Validation on benchmark datasets reached 100% accuracy (10/10 PDFs correct).
|
|
523
590
|
|
|
524
|
-
**
|
|
525
|
-
|
|
591
|
+
**Can I use PreOCR with AWS Textract, Google Vision, or Azure Document Intelligence?**
|
|
592
|
+
Yes. PreOCR is ideal for filtering documents before sending them to cloud OCR APIs. Skip OCR for digital PDFs to reduce API costs.
|
|
526
593
|
|
|
527
|
-
**
|
|
528
|
-
|
|
594
|
+
**Does PreOCR work offline?**
|
|
595
|
+
Yes. PreOCR is CPU-only and runs fully offline—no API keys or internet required.
|
|
529
596
|
|
|
530
|
-
**
|
|
531
|
-
|
|
597
|
+
**How do I customize OCR detection thresholds?**
|
|
598
|
+
Use the `Config` class or pass threshold parameters to `BatchProcessor`. See [Configuration](#-configuration).
|
|
599
|
+
|
|
600
|
+
**Is there an HTTP/REST API?**
|
|
601
|
+
PreOCR is a Python library. For HTTP APIs, wrap it in FastAPI or Flask—see [preocr.io](https://preocr.io) for hosted options.
|
|
532
602
|
|
|
533
603
|
---
|
|
534
604
|
|
|
@@ -545,6 +615,10 @@ pip install -e ".[dev]"
|
|
|
545
615
|
# Run tests
|
|
546
616
|
pytest
|
|
547
617
|
|
|
618
|
+
# Run benchmarks (add PDFs to datasets/ for testing)
|
|
619
|
+
python scripts/benchmark_accuracy.py datasets -g scripts/ground_truth_data_source_formats.json --layout-aware --page-level
|
|
620
|
+
python scripts/benchmark_planner.py datasets
|
|
621
|
+
|
|
548
622
|
# Run linting
|
|
549
623
|
ruff check preocr/
|
|
550
624
|
black --check preocr/
|
|
@@ -558,20 +632,20 @@ See [CHANGELOG.md](docs/CHANGELOG.md) for complete version history.
|
|
|
558
632
|
|
|
559
633
|
### Recent Updates
|
|
560
634
|
|
|
561
|
-
**
|
|
562
|
-
- ✅ **
|
|
563
|
-
- ✅ **
|
|
564
|
-
- ✅ **
|
|
565
|
-
- ✅ **
|
|
566
|
-
- ✅ **
|
|
567
|
-
|
|
568
|
-
|
|
635
|
+
**v2.0.0** - Accuracy & Performance (Latest)
|
|
636
|
+
- ✅ **100% Accuracy**: Fixed false positives on digital PDFs; benchmark validation at 100%
|
|
637
|
+
- ✅ **OpenCV Skip Heuristics**: Skip OpenCV for clearly digital documents (configurable by file size, page count)
|
|
638
|
+
- ✅ **Digital/Table Bias Rules**: New config options to reduce false positives on product manuals, marketing PDFs
|
|
639
|
+
- ✅ **Unified Datasets**: Consolidated `benchmarkdata` and `data-source-formats` into `datasets/` directory
|
|
640
|
+
- ✅ **Page Count in Signals**: PDF analysis includes page count for smarter heuristics
|
|
641
|
+
|
|
642
|
+
**v1.1.0** - Invoice Intelligence & Advanced Extraction
|
|
643
|
+
- ✅ Semantic deduplication, invoice intelligence, text merging
|
|
644
|
+
- ✅ Table stitching, finance validation, reversed text detection
|
|
569
645
|
|
|
570
646
|
**v1.0.0** - Structured Data Extraction
|
|
571
|
-
- ✅ Comprehensive extraction
|
|
572
|
-
- ✅ Element classification
|
|
573
|
-
- ✅ Table, form, and image extraction
|
|
574
|
-
- ✅ Multiple output formats (Pydantic, JSON, Markdown)
|
|
647
|
+
- ✅ Comprehensive extraction for PDFs, Office docs, text files
|
|
648
|
+
- ✅ Element classification, table/form/image extraction
|
|
575
649
|
|
|
576
650
|
---
|
|
577
651
|
|
|
@@ -587,19 +661,19 @@ Apache License 2.0 - see [LICENSE](LICENSE) for details.
|
|
|
587
661
|
|
|
588
662
|
---
|
|
589
663
|
|
|
590
|
-
##
|
|
664
|
+
## Links & Resources
|
|
591
665
|
|
|
592
|
-
-
|
|
593
|
-
- **
|
|
594
|
-
- **
|
|
595
|
-
- **
|
|
666
|
+
- **Website**: [preocr.io](https://preocr.io) – Python OCR detection and document processing
|
|
667
|
+
- **PyPI**: [pypi.org/project/preocr](https://pypi.org/project/preocr) – Install with `pip install preocr`
|
|
668
|
+
- **GitHub**: [github.com/yuvaraj3855/preocr](https://github.com/yuvaraj3855/preocr) – Source code and issues
|
|
669
|
+
- **Documentation**: [CHANGELOG](docs/CHANGELOG.md) • [OCR Decision Model](docs/OCR_DECISION_MODEL.md) • [Contributing](docs/CONTRIBUTING.md)
|
|
596
670
|
|
|
597
671
|
---
|
|
598
672
|
|
|
599
673
|
<div align="center">
|
|
600
674
|
|
|
601
|
-
**
|
|
675
|
+
**PreOCR – Python OCR detection library. Skip OCR for digital PDFs. Save time and money.**
|
|
602
676
|
|
|
603
|
-
[
|
|
677
|
+
[Website](https://preocr.io) · [GitHub](https://github.com/yuvaraj3855/preocr) · [PyPI](https://pypi.org/project/preocr) · [Report Issue](https://github.com/yuvaraj3855/preocr/issues)
|
|
604
678
|
|
|
605
679
|
</div>
|