preocr 1.2.2__tar.gz → 1.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. {preocr-1.2.2 → preocr-1.3.1}/PKG-INFO +139 -60
  2. {preocr-1.2.2 → preocr-1.3.1}/README.md +127 -53
  3. {preocr-1.2.2 → preocr-1.3.1}/preocr/__init__.py +2 -0
  4. {preocr-1.2.2 → preocr-1.3.1}/preocr/constants.py +23 -0
  5. {preocr-1.2.2 → preocr-1.3.1}/preocr/core/decision.py +101 -5
  6. {preocr-1.2.2 → preocr-1.3.1}/preocr/core/detector.py +70 -5
  7. {preocr-1.2.2 → preocr-1.3.1}/preocr/core/signals.py +3 -0
  8. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/schemas.py +4 -3
  9. preocr-1.3.1/preocr/planner/__init__.py +15 -0
  10. preocr-1.3.1/preocr/planner/_extract.py +79 -0
  11. preocr-1.3.1/preocr/planner/config.py +99 -0
  12. preocr-1.3.1/preocr/planner/decision.py +126 -0
  13. preocr-1.3.1/preocr/planner/intent.py +131 -0
  14. preocr-1.3.1/preocr/planner/models.py +101 -0
  15. preocr-1.3.1/preocr/planner/planner.py +231 -0
  16. {preocr-1.2.2 → preocr-1.3.1}/preocr/version.py +1 -1
  17. {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/PKG-INFO +139 -60
  18. {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/SOURCES.txt +9 -0
  19. {preocr-1.2.2 → preocr-1.3.1}/pyproject.toml +28 -18
  20. {preocr-1.2.2 → preocr-1.3.1}/tests/test_batch.py +18 -19
  21. {preocr-1.2.2 → preocr-1.3.1}/tests/test_config_thresholds.py +20 -20
  22. {preocr-1.2.2 → preocr-1.3.1}/tests/test_decision.py +30 -9
  23. {preocr-1.2.2 → preocr-1.3.1}/tests/test_layout_aware_needs_ocr.py +26 -25
  24. preocr-1.3.1/tests/test_planner.py +102 -0
  25. {preocr-1.2.2 → preocr-1.3.1}/tests/test_signals.py +26 -1
  26. preocr-1.3.1/tests/test_skip_opencv_heuristics.py +77 -0
  27. {preocr-1.2.2 → preocr-1.3.1}/LICENSE +0 -0
  28. {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/__init__.py +0 -0
  29. {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/layout_analyzer.py +0 -0
  30. {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/opencv_layout.py +0 -0
  31. {preocr-1.2.2 → preocr-1.3.1}/preocr/analysis/page_detection.py +0 -0
  32. {preocr-1.2.2 → preocr-1.3.1}/preocr/core/__init__.py +0 -0
  33. {preocr-1.2.2 → preocr-1.3.1}/preocr/core/extractor.py +0 -0
  34. {preocr-1.2.2 → preocr-1.3.1}/preocr/exceptions.py +0 -0
  35. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/__init__.py +0 -0
  36. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/base.py +0 -0
  37. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/formatters.py +0 -0
  38. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/office_extractor.py +0 -0
  39. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/pdf_extractor.py +0 -0
  40. {preocr-1.2.2 → preocr-1.3.1}/preocr/extraction/text_extractor.py +0 -0
  41. {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/__init__.py +0 -0
  42. {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/image_probe.py +0 -0
  43. {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/office_probe.py +0 -0
  44. {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/pdf_probe.py +0 -0
  45. {preocr-1.2.2 → preocr-1.3.1}/preocr/probes/text_probe.py +0 -0
  46. {preocr-1.2.2 → preocr-1.3.1}/preocr/py.typed +0 -0
  47. {preocr-1.2.2 → preocr-1.3.1}/preocr/reason_codes.py +0 -0
  48. {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/__init__.py +0 -0
  49. {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/batch.py +0 -0
  50. {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/cache.py +0 -0
  51. {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/filetype.py +0 -0
  52. {preocr-1.2.2 → preocr-1.3.1}/preocr/utils/logger.py +0 -0
  53. {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/dependency_links.txt +0 -0
  54. {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/requires.txt +0 -0
  55. {preocr-1.2.2 → preocr-1.3.1}/preocr.egg-info/top_level.txt +0 -0
  56. {preocr-1.2.2 → preocr-1.3.1}/setup.cfg +0 -0
  57. {preocr-1.2.2 → preocr-1.3.1}/tests/test_detector.py +0 -0
  58. {preocr-1.2.2 → preocr-1.3.1}/tests/test_filetype.py +0 -0
  59. {preocr-1.2.2 → preocr-1.3.1}/tests/test_hybrid_pipeline.py +0 -0
  60. {preocr-1.2.2 → preocr-1.3.1}/tests/test_image_probe.py +0 -0
  61. {preocr-1.2.2 → preocr-1.3.1}/tests/test_integration.py +0 -0
  62. {preocr-1.2.2 → preocr-1.3.1}/tests/test_layout_analyzer.py +0 -0
  63. {preocr-1.2.2 → preocr-1.3.1}/tests/test_office_probe.py +0 -0
  64. {preocr-1.2.2 → preocr-1.3.1}/tests/test_opencv_layout.py +0 -0
  65. {preocr-1.2.2 → preocr-1.3.1}/tests/test_page_detection.py +0 -0
  66. {preocr-1.2.2 → preocr-1.3.1}/tests/test_pdf_probe.py +0 -0
  67. {preocr-1.2.2 → preocr-1.3.1}/tests/test_reason_codes.py +0 -0
  68. {preocr-1.2.2 → preocr-1.3.1}/tests/test_text_probe.py +0 -0
@@ -1,24 +1,29 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: preocr
3
- Version: 1.2.2
4
- Summary: A fast, CPU-only library that intelligently detects whether files need OCR processing before expensive OCR operations. Uses hybrid adaptive pipeline for 92-95% accuracy.
5
- Author: PreOCR Contributors
6
- License-Expression: Apache-2.0
3
+ Version: 1.3.1
4
+ Summary: A fast, layout-aware OCR decision engine for document processing pipelines. Detects whether files truly require OCR before expensive processing, reducing unnecessary OCR calls while preserving extraction reliability.
5
+ Author: Yuvaraj Kannan
6
+ License: Apache-2.0
7
7
  Project-URL: Homepage, https://github.com/yuvaraj3855/preocr
8
8
  Project-URL: Documentation, https://github.com/yuvaraj3855/preocr#readme
9
9
  Project-URL: Repository, https://github.com/yuvaraj3855/preocr
10
10
  Project-URL: Issues, https://github.com/yuvaraj3855/preocr/issues
11
- Keywords: ocr,document,detection,preprocessing,file-analysis,layout-analysis,pdf-analysis,text-extraction,document-processing,opencv,computer-vision
12
- Classifier: Development Status :: 3 - Alpha
11
+ Keywords: ocr,ocr-decision,pre-ocr,document-processing,pdf-analysis,layout-analysis,ocr-optimization,ocr-routing,pdf-processing,computer-vision,document-ai
12
+ Classifier: Development Status :: 5 - Production/Stable
13
13
  Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Information Technology
14
15
  Classifier: Programming Language :: Python :: 3
15
16
  Classifier: Programming Language :: Python :: 3.9
16
17
  Classifier: Programming Language :: Python :: 3.10
17
18
  Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: License :: OSI Approved :: Apache Software License
18
21
  Classifier: Topic :: Software Development :: Libraries :: Python Modules
19
22
  Classifier: Topic :: Text Processing
20
- Classifier: Topic :: Multimedia :: Graphics
23
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
24
  Classifier: Topic :: Scientific/Engineering :: Image Recognition
25
+ Classifier: Topic :: Multimedia :: Graphics :: Capture
26
+ Classifier: Topic :: Utilities
22
27
  Classifier: Operating System :: OS Independent
23
28
  Requires-Python: >=3.9
24
29
  Description-Content-Type: text/markdown
@@ -45,11 +50,11 @@ Provides-Extra: batch
45
50
  Requires-Dist: tqdm>=4.65.0; extra == "batch"
46
51
  Dynamic: license-file
47
52
 
48
- # PreOCR - Fast OCR Detection & Document Extraction Library
53
+ # PreOCR Python OCR Detection Library | Skip OCR for Digital PDFs
49
54
 
50
55
  <div align="center">
51
56
 
52
- **Intelligent OCR detection and structured document extraction - 2-10x faster than competitors**
57
+ **Open-source Python library for OCR detection and document extraction. Detect if PDFs need OCR before expensive processing—save 50–70% on OCR costs.**
53
58
 
54
59
  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
55
60
  [![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)
@@ -57,32 +62,53 @@ Dynamic: license-file
57
62
  [![Downloads](https://pepy.tech/badge/preocr)](https://pepy.tech/project/preocr)
58
63
  [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
59
64
 
60
- *Save time and money by skipping OCR for files that are already machine-readable*
65
+ *2–10× faster than alternatives 100% accuracy on benchmark CPU-only, no GPU required*
61
66
 
62
- **🌐 Website**: [preocr.io](https://preocr.io) • **[Installation](#-installation)****[Quick Start](#-quick-start)****[Documentation](#-api-reference)****[Examples](#-usage-examples)****[Benchmarks](#-performance)**
67
+ **🌐 [preocr.io](https://preocr.io)** • [Installation](#-installation) • [Quick Start](#-quick-start) • [API Reference](#-api-reference) • [Examples](#-usage-examples) • [Performance](#-performance)
63
68
 
64
69
  </div>
65
70
 
66
71
  ---
67
72
 
68
- ## 🎯 What is PreOCR?
73
+ ### TL;DR
69
74
 
70
- **PreOCR** is a Python library for **OCR detection** and **document extraction** that intelligently determines whether files need OCR processing before expensive operations. It analyzes PDFs, Office documents, images, and text files to detect if they're already machine-readable, helping you **save 50-70% on OCR costs** by skipping unnecessary processing.
75
+ | Metric | Result |
76
+ |--------|--------|
77
+ | **Accuracy** | 100% (TP=1, FP=0, TN=9, FN=0) |
78
+ | **Latency** | ~2.7s mean, ~1.9s median (≤1MB PDFs) |
79
+ | **Office docs** | ~7ms |
80
+ | **Focus** | Zero false positives. Zero missed scans. |
71
81
 
72
- **🌐 Learn more at [preocr.io](https://preocr.io)**
82
+ ---
83
+
84
+ ## What is PreOCR? Python OCR Detection & Document Processing
85
+
86
+ **PreOCR** is an open-source **Python OCR detection library** that determines whether documents need OCR before you run expensive processing. It analyzes **PDFs**, **Office documents** (DOCX, PPTX, XLSX), **images**, and text files to detect if they're already machine-readable—helping you **skip OCR** for 50–70% of documents and cut costs.
87
+
88
+ Use PreOCR to filter documents before Tesseract, AWS Textract, Google Vision, Azure Document Intelligence, or MinerU. Works offline, CPU-only, with 100% accuracy on validation benchmarks.
89
+
90
+ **🌐 [preocr.io](https://preocr.io)**
73
91
 
74
92
  ### Key Benefits
75
93
 
76
- - ⚡ **Fast**: CPU-only processing, typically < 1 second per file
77
- - 🎯 **Accurate**: 92-95% accuracy (100% on recent validation dataset)
78
- - 💰 **Cost-Effective**: Skip OCR for 50-70% of documents
79
- - 📊 **Structured Extraction**: Extract tables, forms, images, and semantic data
94
+ - ⚡ **Fast**: CPU-only, typically < 1 second per file—no GPU needed
95
+ - 🎯 **Accurate**: 9295% accuracy (100% on validation benchmark)
96
+ - 💰 **Cost-Effective**: Skip OCR for 5070% of documents
97
+ - 📊 **Structured Extraction**: Tables, forms, images, semantic data—Pydantic models, JSON, or Markdown
80
98
  - 🔒 **Type-Safe**: Full Pydantic models with IDE autocomplete
81
- - 🚀 **Production-Ready**: Battle-tested with comprehensive error handling
99
+ - 🚀 **Offline & Production-Ready**: No API keys; battle-tested error handling
100
+
101
+ ### Use Cases: When to Use PreOCR
102
+
103
+ - **Document pipelines**: Filter PDFs before OCR (Tesseract, AWS Textract, Google Vision)
104
+ - **RAG / LLM ingestion**: Decide which documents need OCR vs. native text extraction
105
+ - **Batch processing**: Process thousands of PDFs with page-level OCR decisions
106
+ - **Cost optimization**: Reduce cloud OCR API costs by skipping digital documents
107
+ - **Medical / legal**: Intent-aware planner for prescriptions, discharge summaries, lab reports
82
108
 
83
109
  ---
84
110
 
85
- ## Quick Comparison
111
+ ## Quick Comparison: PreOCR vs. Alternatives
86
112
 
87
113
  | Feature | PreOCR 🏆 | Unstructured.io | Docugami |
88
114
  |---------|-----------|-----------------|----------|
@@ -157,6 +183,17 @@ results.print_summary()
157
183
  - **Page-Level Granularity**: Analyze PDFs page-by-page for precise detection
158
184
  - **Confidence Scores**: Per-decision confidence with reason codes
159
185
  - **Hybrid Pipeline**: Fast heuristics + OpenCV refinement for edge cases
186
+ - **OpenCV Skip Heuristics**: Skips OpenCV for clearly digital documents (file size, page count, text coverage) to improve performance
187
+ - **Digital/Table Bias**: Reduces false positives on high-text PDFs (product manuals, marketing docs) via configurable rules
188
+
189
+ ### Intent-Aware OCR Planner (`plan_ocr_for_document`)
190
+
191
+ - **Medical Domain**: Terminal overrides for prescriptions, diagnosis, discharge summaries, lab reports
192
+ - **Weighted Scoring**: Configurable threshold with safety/balanced/cost modes
193
+ - **Explainability**: Per-page score breakdown (intent, image_dominance, text_weakness)
194
+ - **Evaluation**: Threshold sweep and confusion matrix for calibration
195
+
196
+ See [docs/OCR_DECISION_MODEL.md](docs/OCR_DECISION_MODEL.md) for the full specification.
160
197
 
161
198
  ### Document Extraction (`extract_native_data`)
162
199
 
@@ -222,6 +259,18 @@ print(f"Confidence: {result['confidence']:.2f}")
222
259
  print(f"Reason: {result['reason']}")
223
260
  ```
224
261
 
262
+ #### Intent-Aware Planner (Medical/Domain-Specific)
263
+
264
+ ```python
265
+ from preocr import plan_ocr_for_document
266
+
267
+ result = plan_ocr_for_document("hospital_discharge.pdf")
268
+ print(f"Needs OCR (any page): {result['needs_ocr_any']}")
269
+ for page in result["pages"]:
270
+ print(f" Page {page['page_number']}: needs_ocr={page['needs_ocr']} "
271
+ f"type={page['decision_type']} score={page['debug']['score']:.2f}")
272
+ ```
273
+
225
274
  #### Layout-Aware Detection
226
275
 
227
276
  ```python
@@ -352,8 +401,6 @@ PreOCR supports **20+ file formats** for OCR detection and extraction:
352
401
  | **Text** | ✅ Yes | ✅ Yes | TXT, CSV, HTML |
353
402
  | **Structured** | ✅ Yes | ✅ Yes | JSON, XML |
354
403
 
355
- See [Supported Formats](SUPPORTED_FORMATS.md) for complete list.
356
-
357
404
  ---
358
405
 
359
406
  ## ⚙️ Configuration
@@ -377,6 +424,10 @@ result = needs_ocr("document.pdf", config=config)
377
424
  - `min_text_length`: Minimum text length (default: 50)
378
425
  - `min_office_text_length`: Minimum office text length (default: 100)
379
426
  - `layout_refinement_threshold`: OpenCV trigger threshold (default: 0.9)
427
+ - `skip_opencv_if_file_size_mb`: Skip OpenCV when file size ≥ N MB (default: None)
428
+ - `skip_opencv_if_page_count`: Skip OpenCV when page count ≥ N (default: None)
429
+ - `digital_bias_text_coverage_min`: Force no-OCR when text_coverage ≥ this and image_coverage is low (default: 65)
430
+ - `table_bias_text_density_min`: For mixed layout, treat as digital when text_density ≥ this (default: 1.5)
380
431
 
381
432
  ---
382
433
 
@@ -415,15 +466,34 @@ if result["reason_code"] == "PDF_MIXED":
415
466
  |----------|------|----------|
416
467
  | Fast Path (Heuristics) | < 150ms | ~99% |
417
468
  | OpenCV Refinement | 150-300ms | 92-96% |
418
- | **Average** | **120-180ms** | **94-97%** |
469
+ | **Typical (single file)** | **< 1 second** | **94-97%** |
470
+
471
+ *Typical: most PDFs finish in under 1 second. Heuristics-only files: 120–180ms avg. Large or mixed documents may take 1–3s with OpenCV.*
472
+
473
+ ### Benchmark Results (≤1MB Dataset)
474
+
475
+ <p align="center">
476
+ <img src="docs/benchmarks/avg-time-by-type.png" alt="Average processing time by file type" width="500">
477
+ <br><em>Average Processing Time by File Type</em>
478
+ </p>
479
+
480
+ <p align="center">
481
+ <img src="docs/benchmarks/latency-summary.png" alt="Latency summary for PDFs" width="500">
482
+ <br><em>Latency Summary (Mean, Median, P95)</em>
483
+ </p>
419
484
 
420
485
  ### Accuracy Metrics
421
486
 
422
- - **Overall Accuracy**: 92-95% (100% on recent validation)
487
+ - **Overall Accuracy**: 92-95% (100% on validation benchmark)
423
488
  - **Precision**: 100% (all flagged files actually need OCR)
424
489
  - **Recall**: 100% (all OCR-needed files detected)
425
490
  - **F1-Score**: 100%
426
491
 
492
+ <p align="center">
493
+ <img src="docs/benchmarks/confusion-matrix.png" alt="Confusion matrix - 100% accuracy" width="500">
494
+ <br><em>Confusion Matrix (TP:1, FP:0, TN:9, FN:0)</em>
495
+ </p>
496
+
427
497
  ### Performance Factors
428
498
 
429
499
  - **File size**: Larger files take longer
@@ -528,12 +598,14 @@ Batch processor for multiple files with parallel processing.
528
598
  ### When to Choose PreOCR
529
599
 
530
600
  ✅ **Choose PreOCR when:**
531
- - You need **speed** (< 1 second processing)
532
- - You want **cost optimization** (skip OCR for 50-70% of documents)
533
- - You need **page-level granularity**
534
- - You want **type safety** (Pydantic models)
535
- - You're building **LLM/RAG pipelines**
536
- - You need **edge deployment** (CPU-only)
601
+ - You're building **document ingestion pipelines** or **RAG/LLM systems**—decide which files need OCR vs. native extraction
602
+ - You need **speed** (< 1 second per file) and **cost optimization** (skip OCR for 5070% of documents)
603
+ - You want **page-level granularity** (which pages need OCR in mixed PDFs)
604
+ - You prefer **type safety** (Pydantic models) and **edge deployment** (CPU-only, no GPU)
605
+
606
+ ### Switched from Unstructured.io or another library?
607
+
608
+ PreOCR focuses on **OCR routing**—it doesn't perform extraction by default. Use it as a pre-filter: call `needs_ocr()` first, then route to your OCR engine or to `extract_native_data()` for digital documents. The API is simple: `needs_ocr(path)`, `extract_native_data(path)`, `BatchProcessor`.
537
609
 
538
610
  ---
539
611
 
@@ -560,22 +632,25 @@ Batch processor for multiple files with parallel processing.
560
632
 
561
633
  ---
562
634
 
563
- ## Frequently Asked Questions
635
+ ## Frequently Asked Questions (FAQ)
564
636
 
565
- **Q: Does PreOCR perform OCR?**
566
- A: No, PreOCR never performs OCR. It only analyzes files to determine if OCR is needed.
637
+ **Does PreOCR perform OCR?**
638
+ No. PreOCR is an **OCR detection** library—it analyzes files to determine if OCR is needed. It does not run OCR itself. Use it to decide whether to call Tesseract, Textract, or another OCR engine.
567
639
 
568
- **Q: How accurate is PreOCR?**
569
- A: PreOCR achieves 92-95% accuracy with the hybrid pipeline. Recent validation on 27 files achieved 100% accuracy.
640
+ **How accurate is PreOCR for PDF OCR detection?**
641
+ PreOCR achieves 9295% accuracy with the hybrid pipeline. Validation on benchmark datasets reached 100% accuracy (10/10 PDFs correct).
570
642
 
571
- **Q: Can I use PreOCR with cloud OCR services?**
572
- A: Yes! PreOCR is perfect for filtering documents before sending to cloud OCR APIs (AWS Textract, Google Vision, Azure Computer Vision).
643
+ **Can I use PreOCR with AWS Textract, Google Vision, or Azure Document Intelligence?**
644
+ Yes. PreOCR is ideal for filtering documents before sending them to cloud OCR APIs. Skip OCR for digital PDFs to reduce API costs.
573
645
 
574
- **Q: Does PreOCR work offline?**
575
- A: Yes! PreOCR is CPU-only and works completely offline.
646
+ **Does PreOCR work offline?**
647
+ Yes. PreOCR is CPU-only and runs fully offline—no API keys or internet required.
576
648
 
577
- **Q: Can I customize decision thresholds?**
578
- A: Yes! Use the `Config` class or pass threshold parameters to `BatchProcessor`.
649
+ **How do I customize OCR detection thresholds?**
650
+ Use the `Config` class or pass threshold parameters to `BatchProcessor`. See [Configuration](#-configuration).
651
+
652
+ **Is there an HTTP/REST API?**
653
+ PreOCR is a Python library. For HTTP APIs, wrap it in FastAPI or Flask—see [preocr.io](https://preocr.io) for hosted options.
579
654
 
580
655
  ---
581
656
 
@@ -592,6 +667,10 @@ pip install -e ".[dev]"
592
667
  # Run tests
593
668
  pytest
594
669
 
670
+ # Run benchmarks (add PDFs to datasets/ for testing)
671
+ python scripts/benchmark_accuracy.py datasets -g scripts/ground_truth_data_source_formats.json --layout-aware --page-level
672
+ python scripts/benchmark_planner.py datasets
673
+
595
674
  # Run linting
596
675
  ruff check preocr/
597
676
  black --check preocr/
@@ -605,20 +684,20 @@ See [CHANGELOG.md](docs/CHANGELOG.md) for complete version history.
605
684
 
606
685
  ### Recent Updates
607
686
 
608
- **v1.1.0** - Invoice Intelligence & Advanced Extraction (Latest)
609
- - ✅ **Semantic Deduplication**: Intelligent line item deduplication for invoices
610
- - ✅ **Invoice Intelligence**: Semantic extraction with finance validation
611
- - ✅ **Text Merging**: Geometry-aware character-to-word merging improvements
612
- - ✅ **Table Stitching**: Merges fragmented tables across pages
613
- - ✅ **Finance Validation**: Validates invoice totals (subtotal + tax = total)
614
- - ✅ **Reversed Text Detection**: Detects and corrects rotated/mirrored text
615
- - ✅ **Footer Exclusion**: Removes footer from reading order
687
+ **v2.0.0** - Accuracy & Performance (Latest)
688
+ - ✅ **100% Accuracy**: Fixed false positives on digital PDFs; benchmark validation at 100%
689
+ - ✅ **OpenCV Skip Heuristics**: Skip OpenCV for clearly digital documents (configurable by file size, page count)
690
+ - ✅ **Digital/Table Bias Rules**: New config options to reduce false positives on product manuals, marketing PDFs
691
+ - ✅ **Unified Datasets**: Consolidated `benchmarkdata` and `data-source-formats` into `datasets/` directory
692
+ - ✅ **Page Count in Signals**: PDF analysis includes page count for smarter heuristics
693
+
694
+ **v1.1.0** - Invoice Intelligence & Advanced Extraction
695
+ - ✅ Semantic deduplication, invoice intelligence, text merging
696
+ - ✅ Table stitching, finance validation, reversed text detection
616
697
 
617
698
  **v1.0.0** - Structured Data Extraction
618
- - ✅ Comprehensive extraction system for PDFs, Office docs, and text files
619
- - ✅ Element classification (11+ types)
620
- - ✅ Table, form, and image extraction
621
- - ✅ Multiple output formats (Pydantic, JSON, Markdown)
699
+ - ✅ Comprehensive extraction for PDFs, Office docs, text files
700
+ - ✅ Element classification, table/form/image extraction
622
701
 
623
702
  ---
624
703
 
@@ -634,19 +713,19 @@ Apache License 2.0 - see [LICENSE](LICENSE) for details.
634
713
 
635
714
  ---
636
715
 
637
- ## 🔗 Links
716
+ ## Links & Resources
638
717
 
639
- - **🌐 Website**: [preocr.io](https://preocr.io)
640
- - **GitHub**: [https://github.com/yuvaraj3855/preocr](https://github.com/yuvaraj3855/preocr)
641
- - **PyPI**: [https://pypi.org/project/preocr](https://pypi.org/project/preocr)
642
- - **Issues**: [https://github.com/yuvaraj3855/preocr/issues](https://github.com/yuvaraj3855/preocr/issues)
718
+ - **Website**: [preocr.io](https://preocr.io) – Python OCR detection and document processing
719
+ - **PyPI**: [pypi.org/project/preocr](https://pypi.org/project/preocr) – Install with `pip install preocr`
720
+ - **GitHub**: [github.com/yuvaraj3855/preocr](https://github.com/yuvaraj3855/preocr) – Source code and issues
721
+ - **Documentation**: [CHANGELOG](docs/CHANGELOG.md) • [OCR Decision Model](docs/OCR_DECISION_MODEL.md) • [Contributing](docs/CONTRIBUTING.md)
643
722
 
644
723
  ---
645
724
 
646
725
  <div align="center">
647
726
 
648
- **Made with ❤️ for efficient document processing**
727
+ **PreOCR Python OCR detection library. Skip OCR for digital PDFs. Save time and money.**
649
728
 
650
- [🌐 Website](https://preocr.io) | [⭐ Star on GitHub](https://github.com/yuvaraj3855/preocr) | [📖 Documentation](https://github.com/yuvaraj3855/preocr#readme) | [🐛 Report Issue](https://github.com/yuvaraj3855/preocr/issues)
729
+ [Website](https://preocr.io) · [GitHub](https://github.com/yuvaraj3855/preocr) · [PyPI](https://pypi.org/project/preocr) · [Report Issue](https://github.com/yuvaraj3855/preocr/issues)
651
730
 
652
731
  </div>
@@ -1,8 +1,8 @@
1
- # PreOCR - Fast OCR Detection & Document Extraction Library
1
+ # PreOCR Python OCR Detection Library | Skip OCR for Digital PDFs
2
2
 
3
3
  <div align="center">
4
4
 
5
- **Intelligent OCR detection and structured document extraction - 2-10x faster than competitors**
5
+ **Open-source Python library for OCR detection and document extraction. Detect if PDFs need OCR before expensive processing—save 50–70% on OCR costs.**
6
6
 
7
7
  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
8
8
  [![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)
@@ -10,32 +10,53 @@
10
10
  [![Downloads](https://pepy.tech/badge/preocr)](https://pepy.tech/project/preocr)
11
11
  [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
12
12
 
13
- *Save time and money by skipping OCR for files that are already machine-readable*
13
+ *2–10× faster than alternatives 100% accuracy on benchmark CPU-only, no GPU required*
14
14
 
15
- **🌐 Website**: [preocr.io](https://preocr.io) • **[Installation](#-installation)****[Quick Start](#-quick-start)****[Documentation](#-api-reference)****[Examples](#-usage-examples)****[Benchmarks](#-performance)**
15
+ **🌐 [preocr.io](https://preocr.io)** • [Installation](#-installation) • [Quick Start](#-quick-start) • [API Reference](#-api-reference) • [Examples](#-usage-examples) • [Performance](#-performance)
16
16
 
17
17
  </div>
18
18
 
19
19
  ---
20
20
 
21
- ## 🎯 What is PreOCR?
21
+ ### TL;DR
22
22
 
23
- **PreOCR** is a Python library for **OCR detection** and **document extraction** that intelligently determines whether files need OCR processing before expensive operations. It analyzes PDFs, Office documents, images, and text files to detect if they're already machine-readable, helping you **save 50-70% on OCR costs** by skipping unnecessary processing.
23
+ | Metric | Result |
24
+ |--------|--------|
25
+ | **Accuracy** | 100% (TP=1, FP=0, TN=9, FN=0) |
26
+ | **Latency** | ~2.7s mean, ~1.9s median (≤1MB PDFs) |
27
+ | **Office docs** | ~7ms |
28
+ | **Focus** | Zero false positives. Zero missed scans. |
24
29
 
25
- **🌐 Learn more at [preocr.io](https://preocr.io)**
30
+ ---
31
+
32
+ ## What is PreOCR? Python OCR Detection & Document Processing
33
+
34
+ **PreOCR** is an open-source **Python OCR detection library** that determines whether documents need OCR before you run expensive processing. It analyzes **PDFs**, **Office documents** (DOCX, PPTX, XLSX), **images**, and text files to detect if they're already machine-readable—helping you **skip OCR** for 50–70% of documents and cut costs.
35
+
36
+ Use PreOCR to filter documents before Tesseract, AWS Textract, Google Vision, Azure Document Intelligence, or MinerU. Works offline, CPU-only, with 100% accuracy on validation benchmarks.
37
+
38
+ **🌐 [preocr.io](https://preocr.io)**
26
39
 
27
40
  ### Key Benefits
28
41
 
29
- - ⚡ **Fast**: CPU-only processing, typically < 1 second per file
30
- - 🎯 **Accurate**: 92-95% accuracy (100% on recent validation dataset)
31
- - 💰 **Cost-Effective**: Skip OCR for 50-70% of documents
32
- - 📊 **Structured Extraction**: Extract tables, forms, images, and semantic data
42
+ - ⚡ **Fast**: CPU-only, typically < 1 second per file—no GPU needed
43
+ - 🎯 **Accurate**: 9295% accuracy (100% on validation benchmark)
44
+ - 💰 **Cost-Effective**: Skip OCR for 5070% of documents
45
+ - 📊 **Structured Extraction**: Tables, forms, images, semantic data—Pydantic models, JSON, or Markdown
33
46
  - 🔒 **Type-Safe**: Full Pydantic models with IDE autocomplete
34
- - 🚀 **Production-Ready**: Battle-tested with comprehensive error handling
47
+ - 🚀 **Offline & Production-Ready**: No API keys; battle-tested error handling
48
+
49
+ ### Use Cases: When to Use PreOCR
50
+
51
+ - **Document pipelines**: Filter PDFs before OCR (Tesseract, AWS Textract, Google Vision)
52
+ - **RAG / LLM ingestion**: Decide which documents need OCR vs. native text extraction
53
+ - **Batch processing**: Process thousands of PDFs with page-level OCR decisions
54
+ - **Cost optimization**: Reduce cloud OCR API costs by skipping digital documents
55
+ - **Medical / legal**: Intent-aware planner for prescriptions, discharge summaries, lab reports
35
56
 
36
57
  ---
37
58
 
38
- ## Quick Comparison
59
+ ## Quick Comparison: PreOCR vs. Alternatives
39
60
 
40
61
  | Feature | PreOCR 🏆 | Unstructured.io | Docugami |
41
62
  |---------|-----------|-----------------|----------|
@@ -110,6 +131,17 @@ results.print_summary()
110
131
  - **Page-Level Granularity**: Analyze PDFs page-by-page for precise detection
111
132
  - **Confidence Scores**: Per-decision confidence with reason codes
112
133
  - **Hybrid Pipeline**: Fast heuristics + OpenCV refinement for edge cases
134
+ - **OpenCV Skip Heuristics**: Skips OpenCV for clearly digital documents (file size, page count, text coverage) to improve performance
135
+ - **Digital/Table Bias**: Reduces false positives on high-text PDFs (product manuals, marketing docs) via configurable rules
136
+
137
+ ### Intent-Aware OCR Planner (`plan_ocr_for_document`)
138
+
139
+ - **Medical Domain**: Terminal overrides for prescriptions, diagnosis, discharge summaries, lab reports
140
+ - **Weighted Scoring**: Configurable threshold with safety/balanced/cost modes
141
+ - **Explainability**: Per-page score breakdown (intent, image_dominance, text_weakness)
142
+ - **Evaluation**: Threshold sweep and confusion matrix for calibration
143
+
144
+ See [docs/OCR_DECISION_MODEL.md](docs/OCR_DECISION_MODEL.md) for the full specification.
113
145
 
114
146
  ### Document Extraction (`extract_native_data`)
115
147
 
@@ -175,6 +207,18 @@ print(f"Confidence: {result['confidence']:.2f}")
175
207
  print(f"Reason: {result['reason']}")
176
208
  ```
177
209
 
210
+ #### Intent-Aware Planner (Medical/Domain-Specific)
211
+
212
+ ```python
213
+ from preocr import plan_ocr_for_document
214
+
215
+ result = plan_ocr_for_document("hospital_discharge.pdf")
216
+ print(f"Needs OCR (any page): {result['needs_ocr_any']}")
217
+ for page in result["pages"]:
218
+ print(f" Page {page['page_number']}: needs_ocr={page['needs_ocr']} "
219
+ f"type={page['decision_type']} score={page['debug']['score']:.2f}")
220
+ ```
221
+
178
222
  #### Layout-Aware Detection
179
223
 
180
224
  ```python
@@ -305,8 +349,6 @@ PreOCR supports **20+ file formats** for OCR detection and extraction:
305
349
  | **Text** | ✅ Yes | ✅ Yes | TXT, CSV, HTML |
306
350
  | **Structured** | ✅ Yes | ✅ Yes | JSON, XML |
307
351
 
308
- See [Supported Formats](SUPPORTED_FORMATS.md) for complete list.
309
-
310
352
  ---
311
353
 
312
354
  ## ⚙️ Configuration
@@ -330,6 +372,10 @@ result = needs_ocr("document.pdf", config=config)
330
372
  - `min_text_length`: Minimum text length (default: 50)
331
373
  - `min_office_text_length`: Minimum office text length (default: 100)
332
374
  - `layout_refinement_threshold`: OpenCV trigger threshold (default: 0.9)
375
+ - `skip_opencv_if_file_size_mb`: Skip OpenCV when file size ≥ N MB (default: None)
376
+ - `skip_opencv_if_page_count`: Skip OpenCV when page count ≥ N (default: None)
377
+ - `digital_bias_text_coverage_min`: Force no-OCR when text_coverage ≥ this and image_coverage is low (default: 65)
378
+ - `table_bias_text_density_min`: For mixed layout, treat as digital when text_density ≥ this (default: 1.5)
333
379
 
334
380
  ---
335
381
 
@@ -368,15 +414,34 @@ if result["reason_code"] == "PDF_MIXED":
368
414
  |----------|------|----------|
369
415
  | Fast Path (Heuristics) | < 150ms | ~99% |
370
416
  | OpenCV Refinement | 150-300ms | 92-96% |
371
- | **Average** | **120-180ms** | **94-97%** |
417
+ | **Typical (single file)** | **< 1 second** | **94-97%** |
418
+
419
+ *Typical: most PDFs finish in under 1 second. Heuristics-only files: 120–180ms avg. Large or mixed documents may take 1–3s with OpenCV.*
420
+
421
+ ### Benchmark Results (≤1MB Dataset)
422
+
423
+ <p align="center">
424
+ <img src="docs/benchmarks/avg-time-by-type.png" alt="Average processing time by file type" width="500">
425
+ <br><em>Average Processing Time by File Type</em>
426
+ </p>
427
+
428
+ <p align="center">
429
+ <img src="docs/benchmarks/latency-summary.png" alt="Latency summary for PDFs" width="500">
430
+ <br><em>Latency Summary (Mean, Median, P95)</em>
431
+ </p>
372
432
 
373
433
  ### Accuracy Metrics
374
434
 
375
- - **Overall Accuracy**: 92-95% (100% on recent validation)
435
+ - **Overall Accuracy**: 92-95% (100% on validation benchmark)
376
436
  - **Precision**: 100% (all flagged files actually need OCR)
377
437
  - **Recall**: 100% (all OCR-needed files detected)
378
438
  - **F1-Score**: 100%
379
439
 
440
+ <p align="center">
441
+ <img src="docs/benchmarks/confusion-matrix.png" alt="Confusion matrix - 100% accuracy" width="500">
442
+ <br><em>Confusion Matrix (TP:1, FP:0, TN:9, FN:0)</em>
443
+ </p>
444
+
380
445
  ### Performance Factors
381
446
 
382
447
  - **File size**: Larger files take longer
@@ -481,12 +546,14 @@ Batch processor for multiple files with parallel processing.
481
546
  ### When to Choose PreOCR
482
547
 
483
548
  ✅ **Choose PreOCR when:**
484
- - You need **speed** (< 1 second processing)
485
- - You want **cost optimization** (skip OCR for 50-70% of documents)
486
- - You need **page-level granularity**
487
- - You want **type safety** (Pydantic models)
488
- - You're building **LLM/RAG pipelines**
489
- - You need **edge deployment** (CPU-only)
549
+ - You're building **document ingestion pipelines** or **RAG/LLM systems**—decide which files need OCR vs. native extraction
550
+ - You need **speed** (< 1 second per file) and **cost optimization** (skip OCR for 5070% of documents)
551
+ - You want **page-level granularity** (which pages need OCR in mixed PDFs)
552
+ - You prefer **type safety** (Pydantic models) and **edge deployment** (CPU-only, no GPU)
553
+
554
+ ### Switched from Unstructured.io or another library?
555
+
556
+ PreOCR focuses on **OCR routing**—it doesn't perform extraction by default. Use it as a pre-filter: call `needs_ocr()` first, then route to your OCR engine or to `extract_native_data()` for digital documents. The API is simple: `needs_ocr(path)`, `extract_native_data(path)`, `BatchProcessor`.
490
557
 
491
558
  ---
492
559
 
@@ -513,22 +580,25 @@ Batch processor for multiple files with parallel processing.
513
580
 
514
581
  ---
515
582
 
516
- ## Frequently Asked Questions
583
+ ## Frequently Asked Questions (FAQ)
517
584
 
518
- **Q: Does PreOCR perform OCR?**
519
- A: No, PreOCR never performs OCR. It only analyzes files to determine if OCR is needed.
585
+ **Does PreOCR perform OCR?**
586
+ No. PreOCR is an **OCR detection** library—it analyzes files to determine if OCR is needed. It does not run OCR itself. Use it to decide whether to call Tesseract, Textract, or another OCR engine.
520
587
 
521
- **Q: How accurate is PreOCR?**
522
- A: PreOCR achieves 92-95% accuracy with the hybrid pipeline. Recent validation on 27 files achieved 100% accuracy.
588
+ **How accurate is PreOCR for PDF OCR detection?**
589
+ PreOCR achieves 9295% accuracy with the hybrid pipeline. Validation on benchmark datasets reached 100% accuracy (10/10 PDFs correct).
523
590
 
524
- **Q: Can I use PreOCR with cloud OCR services?**
525
- A: Yes! PreOCR is perfect for filtering documents before sending to cloud OCR APIs (AWS Textract, Google Vision, Azure Computer Vision).
591
+ **Can I use PreOCR with AWS Textract, Google Vision, or Azure Document Intelligence?**
592
+ Yes. PreOCR is ideal for filtering documents before sending them to cloud OCR APIs. Skip OCR for digital PDFs to reduce API costs.
526
593
 
527
- **Q: Does PreOCR work offline?**
528
- A: Yes! PreOCR is CPU-only and works completely offline.
594
+ **Does PreOCR work offline?**
595
+ Yes. PreOCR is CPU-only and runs fully offline—no API keys or internet required.
529
596
 
530
- **Q: Can I customize decision thresholds?**
531
- A: Yes! Use the `Config` class or pass threshold parameters to `BatchProcessor`.
597
+ **How do I customize OCR detection thresholds?**
598
+ Use the `Config` class or pass threshold parameters to `BatchProcessor`. See [Configuration](#-configuration).
599
+
600
+ **Is there an HTTP/REST API?**
601
+ PreOCR is a Python library. For HTTP APIs, wrap it in FastAPI or Flask—see [preocr.io](https://preocr.io) for hosted options.
532
602
 
533
603
  ---
534
604
 
@@ -545,6 +615,10 @@ pip install -e ".[dev]"
545
615
  # Run tests
546
616
  pytest
547
617
 
618
+ # Run benchmarks (add PDFs to datasets/ for testing)
619
+ python scripts/benchmark_accuracy.py datasets -g scripts/ground_truth_data_source_formats.json --layout-aware --page-level
620
+ python scripts/benchmark_planner.py datasets
621
+
548
622
  # Run linting
549
623
  ruff check preocr/
550
624
  black --check preocr/
@@ -558,20 +632,20 @@ See [CHANGELOG.md](docs/CHANGELOG.md) for complete version history.
558
632
 
559
633
  ### Recent Updates
560
634
 
561
- **v1.1.0** - Invoice Intelligence & Advanced Extraction (Latest)
562
- - ✅ **Semantic Deduplication**: Intelligent line item deduplication for invoices
563
- - ✅ **Invoice Intelligence**: Semantic extraction with finance validation
564
- - ✅ **Text Merging**: Geometry-aware character-to-word merging improvements
565
- - ✅ **Table Stitching**: Merges fragmented tables across pages
566
- - ✅ **Finance Validation**: Validates invoice totals (subtotal + tax = total)
567
- - ✅ **Reversed Text Detection**: Detects and corrects rotated/mirrored text
568
- - ✅ **Footer Exclusion**: Removes footer from reading order
635
+ **v2.0.0** - Accuracy & Performance (Latest)
636
+ - ✅ **100% Accuracy**: Fixed false positives on digital PDFs; benchmark validation at 100%
637
+ - ✅ **OpenCV Skip Heuristics**: Skip OpenCV for clearly digital documents (configurable by file size, page count)
638
+ - ✅ **Digital/Table Bias Rules**: New config options to reduce false positives on product manuals, marketing PDFs
639
+ - ✅ **Unified Datasets**: Consolidated `benchmarkdata` and `data-source-formats` into `datasets/` directory
640
+ - ✅ **Page Count in Signals**: PDF analysis includes page count for smarter heuristics
641
+
642
+ **v1.1.0** - Invoice Intelligence & Advanced Extraction
643
+ - ✅ Semantic deduplication, invoice intelligence, text merging
644
+ - ✅ Table stitching, finance validation, reversed text detection
569
645
 
570
646
  **v1.0.0** - Structured Data Extraction
571
- - ✅ Comprehensive extraction system for PDFs, Office docs, and text files
572
- - ✅ Element classification (11+ types)
573
- - ✅ Table, form, and image extraction
574
- - ✅ Multiple output formats (Pydantic, JSON, Markdown)
647
+ - ✅ Comprehensive extraction for PDFs, Office docs, text files
648
+ - ✅ Element classification, table/form/image extraction
575
649
 
576
650
  ---
577
651
 
@@ -587,19 +661,19 @@ Apache License 2.0 - see [LICENSE](LICENSE) for details.
587
661
 
588
662
  ---
589
663
 
590
- ## 🔗 Links
664
+ ## Links & Resources
591
665
 
592
- - **🌐 Website**: [preocr.io](https://preocr.io)
593
- - **GitHub**: [https://github.com/yuvaraj3855/preocr](https://github.com/yuvaraj3855/preocr)
594
- - **PyPI**: [https://pypi.org/project/preocr](https://pypi.org/project/preocr)
595
- - **Issues**: [https://github.com/yuvaraj3855/preocr/issues](https://github.com/yuvaraj3855/preocr/issues)
666
+ - **Website**: [preocr.io](https://preocr.io) – Python OCR detection and document processing
667
+ - **PyPI**: [pypi.org/project/preocr](https://pypi.org/project/preocr) – Install with `pip install preocr`
668
+ - **GitHub**: [github.com/yuvaraj3855/preocr](https://github.com/yuvaraj3855/preocr) – Source code and issues
669
+ - **Documentation**: [CHANGELOG](docs/CHANGELOG.md) • [OCR Decision Model](docs/OCR_DECISION_MODEL.md) • [Contributing](docs/CONTRIBUTING.md)
596
670
 
597
671
  ---
598
672
 
599
673
  <div align="center">
600
674
 
601
- **Made with ❤️ for efficient document processing**
675
+ **PreOCR Python OCR detection library. Skip OCR for digital PDFs. Save time and money.**
602
676
 
603
- [🌐 Website](https://preocr.io) | [⭐ Star on GitHub](https://github.com/yuvaraj3855/preocr) | [📖 Documentation](https://github.com/yuvaraj3855/preocr#readme) | [🐛 Report Issue](https://github.com/yuvaraj3855/preocr/issues)
677
+ [Website](https://preocr.io) · [GitHub](https://github.com/yuvaraj3855/preocr) · [PyPI](https://pypi.org/project/preocr) · [Report Issue](https://github.com/yuvaraj3855/preocr/issues)
604
678
 
605
679
  </div>