natural-pdf 0.1.8__py3-none-any.whl → 0.1.9__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- natural_pdf/__init__.py +1 -0
- natural_pdf/analyzers/layout/base.py +1 -5
- natural_pdf/analyzers/layout/gemini.py +61 -51
- natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
- natural_pdf/analyzers/layout/layout_manager.py +26 -84
- natural_pdf/analyzers/layout/layout_options.py +7 -0
- natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
- natural_pdf/analyzers/layout/surya.py +46 -123
- natural_pdf/analyzers/layout/tatr.py +51 -4
- natural_pdf/analyzers/text_structure.py +3 -5
- natural_pdf/analyzers/utils.py +3 -3
- natural_pdf/classification/manager.py +230 -151
- natural_pdf/classification/mixin.py +49 -35
- natural_pdf/classification/results.py +64 -46
- natural_pdf/collections/mixins.py +68 -20
- natural_pdf/collections/pdf_collection.py +177 -64
- natural_pdf/core/element_manager.py +30 -14
- natural_pdf/core/highlighting_service.py +13 -22
- natural_pdf/core/page.py +423 -101
- natural_pdf/core/pdf.py +633 -190
- natural_pdf/elements/base.py +134 -40
- natural_pdf/elements/collections.py +503 -131
- natural_pdf/elements/region.py +659 -90
- natural_pdf/elements/text.py +1 -1
- natural_pdf/export/mixin.py +137 -0
- natural_pdf/exporters/base.py +3 -3
- natural_pdf/exporters/paddleocr.py +4 -3
- natural_pdf/extraction/manager.py +50 -49
- natural_pdf/extraction/mixin.py +90 -57
- natural_pdf/extraction/result.py +9 -23
- natural_pdf/ocr/__init__.py +5 -5
- natural_pdf/ocr/engine_doctr.py +346 -0
- natural_pdf/ocr/ocr_factory.py +24 -4
- natural_pdf/ocr/ocr_manager.py +61 -25
- natural_pdf/ocr/ocr_options.py +70 -10
- natural_pdf/ocr/utils.py +6 -4
- natural_pdf/search/__init__.py +20 -34
- natural_pdf/search/haystack_search_service.py +309 -265
- natural_pdf/search/haystack_utils.py +99 -75
- natural_pdf/search/search_service_protocol.py +11 -12
- natural_pdf/selectors/parser.py +219 -143
- natural_pdf/utils/debug.py +3 -3
- natural_pdf/utils/identifiers.py +1 -1
- natural_pdf/utils/locks.py +1 -1
- natural_pdf/utils/packaging.py +8 -6
- natural_pdf/utils/text_extraction.py +24 -16
- natural_pdf/utils/tqdm_utils.py +18 -10
- natural_pdf/utils/visualization.py +18 -0
- natural_pdf/widgets/viewer.py +4 -25
- {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +12 -3
- natural_pdf-0.1.9.dist-info/RECORD +80 -0
- {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
- {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
- docs/api/index.md +0 -386
- docs/assets/favicon.png +0 -3
- docs/assets/favicon.svg +0 -3
- docs/assets/javascripts/custom.js +0 -17
- docs/assets/logo.svg +0 -3
- docs/assets/sample-screen.png +0 -0
- docs/assets/social-preview.png +0 -17
- docs/assets/social-preview.svg +0 -17
- docs/assets/stylesheets/custom.css +0 -65
- docs/categorizing-documents/index.md +0 -168
- docs/data-extraction/index.md +0 -87
- docs/document-qa/index.ipynb +0 -435
- docs/document-qa/index.md +0 -79
- docs/element-selection/index.ipynb +0 -969
- docs/element-selection/index.md +0 -249
- docs/finetuning/index.md +0 -176
- docs/index.md +0 -189
- docs/installation/index.md +0 -69
- docs/interactive-widget/index.ipynb +0 -962
- docs/interactive-widget/index.md +0 -12
- docs/layout-analysis/index.ipynb +0 -818
- docs/layout-analysis/index.md +0 -185
- docs/ocr/index.md +0 -256
- docs/pdf-navigation/index.ipynb +0 -314
- docs/pdf-navigation/index.md +0 -97
- docs/regions/index.ipynb +0 -816
- docs/regions/index.md +0 -294
- docs/tables/index.ipynb +0 -658
- docs/tables/index.md +0 -144
- docs/text-analysis/index.ipynb +0 -370
- docs/text-analysis/index.md +0 -105
- docs/text-extraction/index.ipynb +0 -1478
- docs/text-extraction/index.md +0 -292
- docs/tutorials/01-loading-and-extraction.ipynb +0 -1873
- docs/tutorials/01-loading-and-extraction.md +0 -95
- docs/tutorials/02-finding-elements.ipynb +0 -417
- docs/tutorials/02-finding-elements.md +0 -149
- docs/tutorials/03-extracting-blocks.ipynb +0 -152
- docs/tutorials/03-extracting-blocks.md +0 -48
- docs/tutorials/04-table-extraction.ipynb +0 -119
- docs/tutorials/04-table-extraction.md +0 -50
- docs/tutorials/05-excluding-content.ipynb +0 -275
- docs/tutorials/05-excluding-content.md +0 -109
- docs/tutorials/06-document-qa.ipynb +0 -337
- docs/tutorials/06-document-qa.md +0 -91
- docs/tutorials/07-layout-analysis.ipynb +0 -293
- docs/tutorials/07-layout-analysis.md +0 -66
- docs/tutorials/07-working-with-regions.ipynb +0 -414
- docs/tutorials/07-working-with-regions.md +0 -151
- docs/tutorials/08-spatial-navigation.ipynb +0 -513
- docs/tutorials/08-spatial-navigation.md +0 -190
- docs/tutorials/09-section-extraction.ipynb +0 -2439
- docs/tutorials/09-section-extraction.md +0 -256
- docs/tutorials/10-form-field-extraction.ipynb +0 -517
- docs/tutorials/10-form-field-extraction.md +0 -201
- docs/tutorials/11-enhanced-table-processing.ipynb +0 -59
- docs/tutorials/11-enhanced-table-processing.md +0 -9
- docs/tutorials/12-ocr-integration.ipynb +0 -3712
- docs/tutorials/12-ocr-integration.md +0 -137
- docs/tutorials/13-semantic-search.ipynb +0 -1718
- docs/tutorials/13-semantic-search.md +0 -77
- docs/visual-debugging/index.ipynb +0 -2970
- docs/visual-debugging/index.md +0 -157
- docs/visual-debugging/region.png +0 -0
- natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -420
- natural_pdf/templates/spa/css/style.css +0 -334
- natural_pdf/templates/spa/index.html +0 -31
- natural_pdf/templates/spa/js/app.js +0 -472
- natural_pdf/templates/spa/words.txt +0 -235976
- natural_pdf/widgets/frontend/viewer.js +0 -88
- natural_pdf-0.1.8.dist-info/RECORD +0 -156
- notebooks/Examples.ipynb +0 -1293
- pdfs/.gitkeep +0 -0
- pdfs/01-practice.pdf +0 -543
- pdfs/0500000US42001.pdf +0 -0
- pdfs/0500000US42007.pdf +0 -0
- pdfs/2014 Statistics.pdf +0 -0
- pdfs/2019 Statistics.pdf +0 -0
- pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
- pdfs/needs-ocr.pdf +0 -0
- {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
docs/layout-analysis/index.md
DELETED
@@ -1,185 +0,0 @@
|
|
1
|
-
# Document Layout Analysis
|
2
|
-
|
3
|
-
Natural PDF can automatically detect the structure of a document (titles, paragraphs, tables, figures) using layout analysis models. This guide shows how to use this feature.
|
4
|
-
|
5
|
-
## Setup
|
6
|
-
|
7
|
-
We'll use a sample PDF that includes various layout elements.
|
8
|
-
|
9
|
-
```python
|
10
|
-
from natural_pdf import PDF
|
11
|
-
|
12
|
-
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
|
13
|
-
page = pdf.pages[0]
|
14
|
-
|
15
|
-
page.to_image(width=700)
|
16
|
-
```
|
17
|
-
|
18
|
-
## Running Basic Layout Analysis
|
19
|
-
|
20
|
-
Use the `analyze_layout()` method. By default, it uses the YOLO model.
|
21
|
-
|
22
|
-
```python
|
23
|
-
# Analyze the layout using the default engine (YOLO)
|
24
|
-
# This adds 'region' elements to the page
|
25
|
-
page.analyze_layout()
|
26
|
-
```
|
27
|
-
|
28
|
-
```python
|
29
|
-
# Find all detected regions
|
30
|
-
regions = page.find_all('region')
|
31
|
-
len(regions) # Show how many regions were detected
|
32
|
-
```
|
33
|
-
|
34
|
-
```python
|
35
|
-
first_region = regions[0]
|
36
|
-
f"First region: type='{first_region.type}', confidence={first_region.confidence:.2f}"
|
37
|
-
```
|
38
|
-
|
39
|
-
## Visualizing Detected Layout
|
40
|
-
|
41
|
-
Use `highlight()` or `show()` on the detected regions.
|
42
|
-
|
43
|
-
```python
|
44
|
-
# Highlight all detected regions, colored by type
|
45
|
-
regions.highlight(group_by='type')
|
46
|
-
page.to_image(width=700)
|
47
|
-
```
|
48
|
-
|
49
|
-
## Finding Specific Region Types
|
50
|
-
|
51
|
-
Use attribute selectors to find regions of a specific type.
|
52
|
-
|
53
|
-
```python
|
54
|
-
# Find all detected titles
|
55
|
-
titles = page.find_all('region[type=title]')
|
56
|
-
titles
|
57
|
-
```
|
58
|
-
|
59
|
-
```python
|
60
|
-
titles.show()
|
61
|
-
```
|
62
|
-
|
63
|
-
```python
|
64
|
-
page.find_all('region[type=table]').show()
|
65
|
-
```
|
66
|
-
|
67
|
-
## Working with Layout Regions
|
68
|
-
|
69
|
-
Detected regions are like any other `Region` object. You can extract text, find elements within them, etc.
|
70
|
-
|
71
|
-
```python
|
72
|
-
page.find('region[type=table]').extract_text(layout=True)
|
73
|
-
```
|
74
|
-
|
75
|
-
## Using Different Layout Models
|
76
|
-
|
77
|
-
Natural PDF supports multiple engines (`yolo`, `paddle`, `tatr`). Specify the engine when calling `analyze_layout`.
|
78
|
-
|
79
|
-
*Note: Using different engines requires installing the corresponding extras (e.g., `natural-pdf[layout_paddle]`).* `yolo` is the default.
|
80
|
-
|
81
|
-
```python
|
82
|
-
page.clear_detected_layout_regions()
|
83
|
-
page.clear_highlights()
|
84
|
-
|
85
|
-
page.analyze_layout(engine="paddle")
|
86
|
-
page.find_all('region[model=paddle]').highlight(group_by='region_type')
|
87
|
-
page.to_image(width=700)
|
88
|
-
```
|
89
|
-
|
90
|
-
```python
|
91
|
-
# Analyze using Table Transformer (TATR) - specialized for tables
|
92
|
-
page.clear_detected_layout_regions()
|
93
|
-
page.clear_highlights()
|
94
|
-
|
95
|
-
page.analyze_layout(engine="tatr")
|
96
|
-
page.find_all('region[model=tatr]').highlight(group_by='region_type')
|
97
|
-
page.to_image(width=700)
|
98
|
-
```
|
99
|
-
|
100
|
-
```python
|
101
|
-
# Analyze using Table Transformer (TATR) - specialized for tables
|
102
|
-
page.clear_detected_layout_regions()
|
103
|
-
page.clear_highlights()
|
104
|
-
|
105
|
-
page.analyze_layout(engine="docling")
|
106
|
-
page.find_all('region[model=docling]').highlight(group_by='region_type')
|
107
|
-
page.to_image(width=700)
|
108
|
-
```
|
109
|
-
|
110
|
-
```python
|
111
|
-
# Analyze using Table Transformer (TATR) - specialized for tables
|
112
|
-
page.clear_detected_layout_regions()
|
113
|
-
page.clear_highlights()
|
114
|
-
|
115
|
-
page.analyze_layout(engine="surya")
|
116
|
-
page.find_all('region[model=surya]').highlight(group_by='region_type')
|
117
|
-
page.to_image(width=700)
|
118
|
-
```
|
119
|
-
|
120
|
-
*Note: Calling `analyze_layout` multiple times (even with the same engine) can add duplicate regions. You might want to use `page.clear_detected_layout_regions()` first, or filter by model using `region[model=yolo]`.*
|
121
|
-
|
122
|
-
## Controlling Confidence Threshold
|
123
|
-
|
124
|
-
Filter detections by their confidence score.
|
125
|
-
|
126
|
-
```python
|
127
|
-
# Re-run YOLO analysis (clearing previous results might be good practice)
|
128
|
-
page.clear_detected_layout_regions()
|
129
|
-
page.analyze_layout(engine="yolo")
|
130
|
-
|
131
|
-
# Find only high-confidence regions (e.g., >= 0.8)
|
132
|
-
high_conf_regions = page.find_all('region[confidence>=0.8]')
|
133
|
-
len(high_conf_regions)
|
134
|
-
```
|
135
|
-
|
136
|
-
## Table Structure with TATR
|
137
|
-
|
138
|
-
The TATR engine provides detailed table structure elements (`table`, `table-row`, `table-column`, `table-column-header`). This is very useful for precise table extraction.
|
139
|
-
|
140
|
-
```python
|
141
|
-
# Ensure TATR analysis has been run
|
142
|
-
page.clear_detected_layout_regions()
|
143
|
-
page.clear_highlights()
|
144
|
-
|
145
|
-
page.analyze_layout(engine="tatr")
|
146
|
-
page.find_all('region[model=tatr]').highlight(group_by='region_type')
|
147
|
-
page.to_image(width=700)
|
148
|
-
```
|
149
|
-
|
150
|
-
```python
|
151
|
-
# Find different structural elements from TATR
|
152
|
-
tables = page.find_all('region[type=table][model=tatr]')
|
153
|
-
rows = page.find_all('region[type=table-row][model=tatr]')
|
154
|
-
cols = page.find_all('region[type=table-column][model=tatr]')
|
155
|
-
hdrs = page.find_all('region[type=table-column-header][model=tatr]')
|
156
|
-
|
157
|
-
f"Found: {len(tables)} tables, {len(rows)} rows, {len(cols)} columns, {len(hdrs)} headers (from TATR)"
|
158
|
-
```
|
159
|
-
|
160
|
-
### Enhanced Table Extraction with TATR
|
161
|
-
|
162
|
-
When a `region[type=table]` comes from the TATR model, `extract_table()` can use the underlying row/column structure for more robust extraction.
|
163
|
-
|
164
|
-
```python
|
165
|
-
# Find the TATR table region again
|
166
|
-
tatr_table = page.find('region[type=table][model=tatr]')
|
167
|
-
|
168
|
-
# This extraction uses the detected rows/columns
|
169
|
-
tatr_table.extract_table()
|
170
|
-
```
|
171
|
-
|
172
|
-
if you'd like the normal approach instead of the "intelligent" one, you can ask for pdfplumber.
|
173
|
-
|
174
|
-
```python
|
175
|
-
# This extraction uses the detected rows/columns
|
176
|
-
tatr_table.extract_table(method='pdfplumber')
|
177
|
-
```
|
178
|
-
|
179
|
-
## Next Steps
|
180
|
-
|
181
|
-
Layout analysis provides regions that you can use for:
|
182
|
-
|
183
|
-
- [Table Extraction](../tables/index.ipynb): Especially powerful with TATR regions.
|
184
|
-
- [Text Extraction](../text-extraction/index.ipynb): Extract text only from specific region types (e.g., paragraphs).
|
185
|
-
- [Document QA](../document-qa/index.ipynb): Focus question answering on specific detected regions.
|
docs/ocr/index.md
DELETED
@@ -1,256 +0,0 @@
|
|
1
|
-
# OCR Integration
|
2
|
-
|
3
|
-
Natural PDF includes OCR (Optical Character Recognition) to extract text from scanned documents or images embedded in PDFs.
|
4
|
-
|
5
|
-
## OCR Engine Comparison
|
6
|
-
|
7
|
-
Natural PDF supports multiple OCR engines:
|
8
|
-
|
9
|
-
| Feature | EasyOCR | PaddleOCR | Surya OCR | Gemini (Layout + potential OCR) |
|
10
|
-
|----------------------|------------------------------------|------------------------------------------|---------------------------------------|--------------------------------------|
|
11
|
-
| **Installation** | `natural-pdf[easyocr]` | `natural-pdf[paddle]` | `natural-pdf[surya]` | `natural-pdf[gemini]` |
|
12
|
-
| **Primary Strength** | Good general performance, simpler | Excellent Asian language, speed | High accuracy, multilingual lines | Advanced layout analysis (via API) |
|
13
|
-
| **Speed** | Moderate | Fast | Moderate (GPU recommended) | API Latency |
|
14
|
-
| **Memory Usage** | Higher | Efficient | Higher (GPU recommended) | N/A (API) |
|
15
|
-
| **Paragraph Detect** | Yes (via option) | No | No (focuses on lines) | Yes (Layout model) |
|
16
|
-
| **Handwritten** | Better support | Limited | Limited | Potentially (API model dependent) |
|
17
|
-
| **Small Text** | Moderate | Good | Good | Potentially (API model dependent) |
|
18
|
-
| **When to Use** | General documents, handwritten text| Asian languages, speed-critical tasks | Highest accuracy needed, line-level | Complex layouts, API integration |
|
19
|
-
|
20
|
-
## Basic OCR Usage
|
21
|
-
|
22
|
-
Apply OCR directly to a page or region:
|
23
|
-
|
24
|
-
```python
|
25
|
-
from natural_pdf import PDF
|
26
|
-
|
27
|
-
# Assume 'page' is a Page object from a PDF
|
28
|
-
page = pdf.pages[0]
|
29
|
-
|
30
|
-
# Apply OCR using the default engine (or specify one)
|
31
|
-
ocr_elements = page.apply_ocr(languages=['en'])
|
32
|
-
|
33
|
-
# Extract text (will use the results from apply_ocr if run previously)
|
34
|
-
text = page.extract_text()
|
35
|
-
print(text)
|
36
|
-
```
|
37
|
-
|
38
|
-
## Configuring OCR
|
39
|
-
|
40
|
-
Specify the engine and basic options directly:
|
41
|
-
|
42
|
-
## OCR Configuration
|
43
|
-
|
44
|
-
```python
|
45
|
-
# Use PaddleOCR for Chinese and English
|
46
|
-
ocr_elements = page.apply_ocr(engine='paddle', languages=['zh-cn', 'en'])
|
47
|
-
|
48
|
-
# Use EasyOCR with a lower confidence threshold
|
49
|
-
ocr_elements = page.apply_ocr(engine='easyocr', languages=['en'], min_confidence=0.3)
|
50
|
-
```
|
51
|
-
|
52
|
-
For advanced, engine-specific settings, use the Options classes:
|
53
|
-
|
54
|
-
```python
|
55
|
-
from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
|
56
|
-
from natural_pdf.analyzers.layout import GeminiOptions # Note: Gemini is primarily layout
|
57
|
-
|
58
|
-
# --- Configure PaddleOCR ---
|
59
|
-
paddle_opts = PaddleOCROptions(
|
60
|
-
languages=['en', 'zh-cn'],
|
61
|
-
use_gpu=True, # Explicitly enable GPU if available
|
62
|
-
use_angle_cls=False, # Disable text direction classification (if text is upright)
|
63
|
-
det_db_thresh=0.25, # Lower detection threshold (more boxes, potentially noisy)
|
64
|
-
rec_batch_num=16 # Increase recognition batch size for potential speedup on GPU
|
65
|
-
# rec_char_dict_path='/path/to/custom_dict.txt' # Optional: Path to a custom character dictionary
|
66
|
-
# See PaddleOCROptions documentation or source code for all parameters
|
67
|
-
)
|
68
|
-
ocr_elements = page.apply_ocr(engine='paddle', options=paddle_opts)
|
69
|
-
|
70
|
-
# --- Configure EasyOCR ---
|
71
|
-
easy_opts = EasyOCROptions(
|
72
|
-
languages=['en', 'fr'],
|
73
|
-
gpu=True, # Explicitly enable GPU if available
|
74
|
-
paragraph=True, # Group results into paragraphs (if structure is clear)
|
75
|
-
detail=1, # Ensure bounding boxes are returned (required)
|
76
|
-
text_threshold=0.6, # Confidence threshold for text detection (adjust based on tuning table)
|
77
|
-
link_threshold=0.4, # Standard EasyOCR param, uncomment if confirmed in wrapper
|
78
|
-
low_text=0.4, # Standard EasyOCR param, uncomment if confirmed in wrapper
|
79
|
-
batch_size=8 # Processing batch size (adjust based on memory)
|
80
|
-
# See EasyOCROptions documentation or source code for all parameters
|
81
|
-
)
|
82
|
-
ocr_elements = page.apply_ocr(engine='easyocr', options=easy_opts)
|
83
|
-
|
84
|
-
# --- Configure Surya OCR ---
|
85
|
-
# Surya focuses on line detection and recognition
|
86
|
-
surya_opts = SuryaOCROptions(
|
87
|
-
languages=['en', 'de'], # Specify languages for recognition
|
88
|
-
# device='cuda', # Use GPU ('cuda') or CPU ('cpu') <-- Set via env var TORCH_DEVICE
|
89
|
-
min_confidence=0.4 # Example: Adjust minimum confidence for results
|
90
|
-
# Core Surya options like device, batch size, and thresholds are typically
|
91
|
-
# set via environment variables (see note below).
|
92
|
-
)
|
93
|
-
ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
|
94
|
-
|
95
|
-
# --- Configure Gemini (as layout analyzer, can be used with OCR) ---
|
96
|
-
# Gemini requires API key (GOOGLE_API_KEY environment variable)
|
97
|
-
# Note: Gemini is used via apply_layout, but its options can influence OCR if used together
|
98
|
-
gemini_opts = GeminiOptions(
|
99
|
-
prompt="Extract text content and identify document elements.",
|
100
|
-
# model_name="gemini-1.5-flash-latest" # Specify a model if needed
|
101
|
-
# See GeminiOptions documentation for more parameters
|
102
|
-
)
|
103
|
-
# Typically used like this (layout first, then potentially OCR on regions)
|
104
|
-
layout_elements = page.apply_layout(engine='gemini', options=gemini_opts)
|
105
|
-
# If Gemini also performed OCR or you want to OCR layout regions:
|
106
|
-
# ocr_elements = some_region.apply_ocr(...)
|
107
|
-
|
108
|
-
# It can sometimes be used directly if the model supports it, but less common:
|
109
|
-
# try:
|
110
|
-
# ocr_elements = page.apply_ocr(engine='gemini', options=gemini_opts)
|
111
|
-
# except Exception as e:
|
112
|
-
# print(f"Gemini might not be configured for direct OCR via apply_ocr: {e}")
|
113
|
-
```
|
114
|
-
|
115
|
-
## Applying OCR Directly
|
116
|
-
|
117
|
-
The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
|
118
|
-
|
119
|
-
```python
|
120
|
-
# Apply OCR to a page and get the OCR elements
|
121
|
-
ocr_elements = page.apply_ocr(engine='easyocr')
|
122
|
-
print(f"Found {len(ocr_elements)} text elements via OCR")
|
123
|
-
|
124
|
-
# Apply OCR to a specific region
|
125
|
-
title = page.find('text:contains("Title")')
|
126
|
-
content_region = title.below(height=300)
|
127
|
-
region_ocr_elements = content_region.apply_ocr(engine='paddle', languages=['en'])
|
128
|
-
|
129
|
-
# Note: Re-applying OCR to the same page or region will remove any
|
130
|
-
# previously generated OCR elements for that area before adding the new ones.
|
131
|
-
```
|
132
|
-
|
133
|
-
## OCR Engines
|
134
|
-
|
135
|
-
Choose the engine best suited for your document and language requirements using the `engine` parameter in `apply_ocr`.
|
136
|
-
|
137
|
-
## Finding and Working with OCR Text
|
138
|
-
|
139
|
-
After applying OCR, work with the text just like regular text:
|
140
|
-
|
141
|
-
```python
|
142
|
-
# Find all OCR text elements
|
143
|
-
ocr_text = page.find_all('text[source=ocr]')
|
144
|
-
|
145
|
-
# Find high-confidence OCR text
|
146
|
-
high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
|
147
|
-
|
148
|
-
# Extract text only from OCR elements
|
149
|
-
ocr_text_content = page.find_all('text[source=ocr]').extract_text()
|
150
|
-
|
151
|
-
# Filter OCR text by content
|
152
|
-
names = page.find_all('text[source=ocr]:contains("Smith")', case=False)
|
153
|
-
```
|
154
|
-
|
155
|
-
## Visualizing OCR Results
|
156
|
-
|
157
|
-
See OCR results to help debug issues:
|
158
|
-
|
159
|
-
```python
|
160
|
-
# Apply OCR
|
161
|
-
ocr_elements = page.apply_ocr()
|
162
|
-
|
163
|
-
# Highlight all OCR elements
|
164
|
-
for element in ocr_elements:
|
165
|
-
# Color based on confidence
|
166
|
-
if element.confidence >= 0.8:
|
167
|
-
color = "green" # High confidence
|
168
|
-
elif element.confidence >= 0.5:
|
169
|
-
color = "yellow" # Medium confidence
|
170
|
-
else:
|
171
|
-
color = "red" # Low confidence
|
172
|
-
|
173
|
-
element.highlight(color=color, label=f"OCR ({element.confidence:.2f})")
|
174
|
-
|
175
|
-
# Get the visualization as an image
|
176
|
-
image = page.to_image(labels=True)
|
177
|
-
# Just return the image in a Jupyter cell
|
178
|
-
image
|
179
|
-
|
180
|
-
# Highlight only high-confidence elements
|
181
|
-
high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
|
182
|
-
high_conf.highlight(color="green", label="High Confidence OCR")
|
183
|
-
```
|
184
|
-
|
185
|
-
## Detect + LLM OCR
|
186
|
-
|
187
|
-
Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
|
188
|
-
|
189
|
-
```python
|
190
|
-
from natural_pdf import PDF
|
191
|
-
from natural_pdf.ocr.utils import direct_ocr_llm
|
192
|
-
import openai
|
193
|
-
|
194
|
-
pdf = PDF("needs-ocr.pdf")
|
195
|
-
page = pdf.pages[0]
|
196
|
-
|
197
|
-
# Detect
|
198
|
-
page.apply_ocr('paddle', resolution=120, detect_only=True)
|
199
|
-
|
200
|
-
# Build the framework
|
201
|
-
client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
|
202
|
-
prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
|
203
|
-
punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
|
204
|
-
The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
|
205
|
-
|
206
|
-
# This returns the cleaned-up text
|
207
|
-
def correct(region):
|
208
|
-
return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
|
209
|
-
|
210
|
-
# Run 'correct' on each text element
|
211
|
-
page.correct_ocr(correct)
|
212
|
-
|
213
|
-
# You're done!
|
214
|
-
```
|
215
|
-
|
216
|
-
## Interactive OCR Correction / Debugging
|
217
|
-
|
218
|
-
Natural PDF includes a utility to package a PDF and its detected elements, along with an interactive web application (SPA) for reviewing and correcting OCR results.
|
219
|
-
|
220
|
-
1. **Package the data:**
|
221
|
-
Use the `create_correction_task_package` function to create a zip file containing the necessary data for the SPA.
|
222
|
-
|
223
|
-
```python
|
224
|
-
from natural_pdf.utils.packaging import create_correction_task_package
|
225
|
-
|
226
|
-
# Assuming 'pdf' is your loaded PDF object after running apply_ocr or apply_layout
|
227
|
-
create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
|
228
|
-
```
|
229
|
-
|
230
|
-
2. **Run the SPA:**
|
231
|
-
The correction SPA is bundled with the library. You need to run a simple web server from the directory containing the SPA's files. The location of these files might depend on your installation, but you can typically find them within the installed `natural_pdf` package directory under `templates/spa`.
|
232
|
-
|
233
|
-
*Example using Python's built-in server (run from your terminal):*
|
234
|
-
|
235
|
-
```bash
|
236
|
-
# Find the path to the installed natural_pdf package
|
237
|
-
# (This command might vary depending on your environment)
|
238
|
-
NATURAL_PDF_PATH=$(python -c "import site; print(site.getsitepackages()[0])")/natural_pdf
|
239
|
-
|
240
|
-
# Navigate to the SPA directory
|
241
|
-
cd $NATURAL_PDF_PATH/templates/spa
|
242
|
-
|
243
|
-
# Start the web server (e.g., on port 8000)
|
244
|
-
python -m http.server 8000
|
245
|
-
```
|
246
|
-
|
247
|
-
3. **Use the SPA:**
|
248
|
-
Open your web browser to `http://localhost:8000`. The SPA should load, allowing you to drag and drop the `correction_package.zip` file you created into the application to view and edit the OCR results.
|
249
|
-
|
250
|
-
## Next Steps
|
251
|
-
|
252
|
-
With OCR capabilities, you can explore:
|
253
|
-
|
254
|
-
- [Layout Analysis](../layout-analysis/index.ipynb) for automatically detecting document structure
|
255
|
-
- [Document QA](../document-qa/index.ipynb) for asking questions about your documents
|
256
|
-
- [Visual Debugging](../visual-debugging/index.ipynb) for visualizing OCR results
|