natural-pdf 0.1.8__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +1 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +230 -151
  13. natural_pdf/classification/mixin.py +49 -35
  14. natural_pdf/classification/results.py +64 -46
  15. natural_pdf/collections/mixins.py +68 -20
  16. natural_pdf/collections/pdf_collection.py +177 -64
  17. natural_pdf/core/element_manager.py +30 -14
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +423 -101
  20. natural_pdf/core/pdf.py +633 -190
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +503 -131
  23. natural_pdf/elements/region.py +659 -90
  24. natural_pdf/elements/text.py +1 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +4 -3
  28. natural_pdf/extraction/manager.py +50 -49
  29. natural_pdf/extraction/mixin.py +90 -57
  30. natural_pdf/extraction/result.py +9 -23
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/ocr_factory.py +24 -4
  34. natural_pdf/ocr/ocr_manager.py +61 -25
  35. natural_pdf/ocr/ocr_options.py +70 -10
  36. natural_pdf/ocr/utils.py +6 -4
  37. natural_pdf/search/__init__.py +20 -34
  38. natural_pdf/search/haystack_search_service.py +309 -265
  39. natural_pdf/search/haystack_utils.py +99 -75
  40. natural_pdf/search/search_service_protocol.py +11 -12
  41. natural_pdf/selectors/parser.py +219 -143
  42. natural_pdf/utils/debug.py +3 -3
  43. natural_pdf/utils/identifiers.py +1 -1
  44. natural_pdf/utils/locks.py +1 -1
  45. natural_pdf/utils/packaging.py +8 -6
  46. natural_pdf/utils/text_extraction.py +24 -16
  47. natural_pdf/utils/tqdm_utils.py +18 -10
  48. natural_pdf/utils/visualization.py +18 -0
  49. natural_pdf/widgets/viewer.py +4 -25
  50. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +12 -3
  51. natural_pdf-0.1.9.dist-info/RECORD +80 -0
  52. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
  53. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
  54. docs/api/index.md +0 -386
  55. docs/assets/favicon.png +0 -3
  56. docs/assets/favicon.svg +0 -3
  57. docs/assets/javascripts/custom.js +0 -17
  58. docs/assets/logo.svg +0 -3
  59. docs/assets/sample-screen.png +0 -0
  60. docs/assets/social-preview.png +0 -17
  61. docs/assets/social-preview.svg +0 -17
  62. docs/assets/stylesheets/custom.css +0 -65
  63. docs/categorizing-documents/index.md +0 -168
  64. docs/data-extraction/index.md +0 -87
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -969
  68. docs/element-selection/index.md +0 -249
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -189
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -256
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -1873
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -417
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -152
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -119
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -275
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -337
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -293
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -414
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -513
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2439
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -517
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -59
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -3712
  112. docs/tutorials/12-ocr-integration.md +0 -137
  113. docs/tutorials/13-semantic-search.ipynb +0 -1718
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -420
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.8.dist-info/RECORD +0 -156
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
@@ -1,185 +0,0 @@
1
- # Document Layout Analysis
2
-
3
- Natural PDF can automatically detect the structure of a document (titles, paragraphs, tables, figures) using layout analysis models. This guide shows how to use this feature.
4
-
5
- ## Setup
6
-
7
- We'll use a sample PDF that includes various layout elements.
8
-
9
- ```python
10
- from natural_pdf import PDF
11
-
12
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
13
- page = pdf.pages[0]
14
-
15
- page.to_image(width=700)
16
- ```
17
-
18
- ## Running Basic Layout Analysis
19
-
20
- Use the `analyze_layout()` method. By default, it uses the YOLO model.
21
-
22
- ```python
23
- # Analyze the layout using the default engine (YOLO)
24
- # This adds 'region' elements to the page
25
- page.analyze_layout()
26
- ```
27
-
28
- ```python
29
- # Find all detected regions
30
- regions = page.find_all('region')
31
- len(regions) # Show how many regions were detected
32
- ```
33
-
34
- ```python
35
- first_region = regions[0]
36
- f"First region: type='{first_region.type}', confidence={first_region.confidence:.2f}"
37
- ```
38
-
39
- ## Visualizing Detected Layout
40
-
41
- Use `highlight()` or `show()` on the detected regions.
42
-
43
- ```python
44
- # Highlight all detected regions, colored by type
45
- regions.highlight(group_by='type')
46
- page.to_image(width=700)
47
- ```
48
-
49
- ## Finding Specific Region Types
50
-
51
- Use attribute selectors to find regions of a specific type.
52
-
53
- ```python
54
- # Find all detected titles
55
- titles = page.find_all('region[type=title]')
56
- titles
57
- ```
58
-
59
- ```python
60
- titles.show()
61
- ```
62
-
63
- ```python
64
- page.find_all('region[type=table]').show()
65
- ```
66
-
67
- ## Working with Layout Regions
68
-
69
- Detected regions are like any other `Region` object. You can extract text, find elements within them, etc.
70
-
71
- ```python
72
- page.find('region[type=table]').extract_text(layout=True)
73
- ```
74
-
75
- ## Using Different Layout Models
76
-
77
- Natural PDF supports multiple engines (`yolo`, `paddle`, `tatr`). Specify the engine when calling `analyze_layout`.
78
-
79
- *Note: Using different engines requires installing the corresponding extras (e.g., `natural-pdf[layout_paddle]`).* `yolo` is the default.
80
-
81
- ```python
82
- page.clear_detected_layout_regions()
83
- page.clear_highlights()
84
-
85
- page.analyze_layout(engine="paddle")
86
- page.find_all('region[model=paddle]').highlight(group_by='region_type')
87
- page.to_image(width=700)
88
- ```
89
-
90
- ```python
91
- # Analyze using Table Transformer (TATR) - specialized for tables
92
- page.clear_detected_layout_regions()
93
- page.clear_highlights()
94
-
95
- page.analyze_layout(engine="tatr")
96
- page.find_all('region[model=tatr]').highlight(group_by='region_type')
97
- page.to_image(width=700)
98
- ```
99
-
100
- ```python
101
- # Analyze using Table Transformer (TATR) - specialized for tables
102
- page.clear_detected_layout_regions()
103
- page.clear_highlights()
104
-
105
- page.analyze_layout(engine="docling")
106
- page.find_all('region[model=docling]').highlight(group_by='region_type')
107
- page.to_image(width=700)
108
- ```
109
-
110
- ```python
111
- # Analyze using Table Transformer (TATR) - specialized for tables
112
- page.clear_detected_layout_regions()
113
- page.clear_highlights()
114
-
115
- page.analyze_layout(engine="surya")
116
- page.find_all('region[model=surya]').highlight(group_by='region_type')
117
- page.to_image(width=700)
118
- ```
119
-
120
- *Note: Calling `analyze_layout` multiple times (even with the same engine) can add duplicate regions. You might want to use `page.clear_detected_layout_regions()` first, or filter by model using `region[model=yolo]`.*
121
-
122
- ## Controlling Confidence Threshold
123
-
124
- Filter detections by their confidence score.
125
-
126
- ```python
127
- # Re-run YOLO analysis (clearing previous results might be good practice)
128
- page.clear_detected_layout_regions()
129
- page.analyze_layout(engine="yolo")
130
-
131
- # Find only high-confidence regions (e.g., >= 0.8)
132
- high_conf_regions = page.find_all('region[confidence>=0.8]')
133
- len(high_conf_regions)
134
- ```
135
-
136
- ## Table Structure with TATR
137
-
138
- The TATR engine provides detailed table structure elements (`table`, `table-row`, `table-column`, `table-column-header`). This is very useful for precise table extraction.
139
-
140
- ```python
141
- # Ensure TATR analysis has been run
142
- page.clear_detected_layout_regions()
143
- page.clear_highlights()
144
-
145
- page.analyze_layout(engine="tatr")
146
- page.find_all('region[model=tatr]').highlight(group_by='region_type')
147
- page.to_image(width=700)
148
- ```
149
-
150
- ```python
151
- # Find different structural elements from TATR
152
- tables = page.find_all('region[type=table][model=tatr]')
153
- rows = page.find_all('region[type=table-row][model=tatr]')
154
- cols = page.find_all('region[type=table-column][model=tatr]')
155
- hdrs = page.find_all('region[type=table-column-header][model=tatr]')
156
-
157
- f"Found: {len(tables)} tables, {len(rows)} rows, {len(cols)} columns, {len(hdrs)} headers (from TATR)"
158
- ```
159
-
160
- ### Enhanced Table Extraction with TATR
161
-
162
- When a `region[type=table]` comes from the TATR model, `extract_table()` can use the underlying row/column structure for more robust extraction.
163
-
164
- ```python
165
- # Find the TATR table region again
166
- tatr_table = page.find('region[type=table][model=tatr]')
167
-
168
- # This extraction uses the detected rows/columns
169
- tatr_table.extract_table()
170
- ```
171
-
172
- if you'd like the normal approach instead of the "intelligent" one, you can ask for pdfplumber.
173
-
174
- ```python
175
- # This extraction uses the detected rows/columns
176
- tatr_table.extract_table(method='pdfplumber')
177
- ```
178
-
179
- ## Next Steps
180
-
181
- Layout analysis provides regions that you can use for:
182
-
183
- - [Table Extraction](../tables/index.ipynb): Especially powerful with TATR regions.
184
- - [Text Extraction](../text-extraction/index.ipynb): Extract text only from specific region types (e.g., paragraphs).
185
- - [Document QA](../document-qa/index.ipynb): Focus question answering on specific detected regions.
docs/ocr/index.md DELETED
@@ -1,256 +0,0 @@
1
- # OCR Integration
2
-
3
- Natural PDF includes OCR (Optical Character Recognition) to extract text from scanned documents or images embedded in PDFs.
4
-
5
- ## OCR Engine Comparison
6
-
7
- Natural PDF supports multiple OCR engines:
8
-
9
- | Feature | EasyOCR | PaddleOCR | Surya OCR | Gemini (Layout + potential OCR) |
10
- |----------------------|------------------------------------|------------------------------------------|---------------------------------------|--------------------------------------|
11
- | **Installation** | `natural-pdf[easyocr]` | `natural-pdf[paddle]` | `natural-pdf[surya]` | `natural-pdf[gemini]` |
12
- | **Primary Strength** | Good general performance, simpler | Excellent Asian language, speed | High accuracy, multilingual lines | Advanced layout analysis (via API) |
13
- | **Speed** | Moderate | Fast | Moderate (GPU recommended) | API Latency |
14
- | **Memory Usage** | Higher | Efficient | Higher (GPU recommended) | N/A (API) |
15
- | **Paragraph Detect** | Yes (via option) | No | No (focuses on lines) | Yes (Layout model) |
16
- | **Handwritten** | Better support | Limited | Limited | Potentially (API model dependent) |
17
- | **Small Text** | Moderate | Good | Good | Potentially (API model dependent) |
18
- | **When to Use** | General documents, handwritten text| Asian languages, speed-critical tasks | Highest accuracy needed, line-level | Complex layouts, API integration |
19
-
20
- ## Basic OCR Usage
21
-
22
- Apply OCR directly to a page or region:
23
-
24
- ```python
25
- from natural_pdf import PDF
26
-
27
- # Assume 'page' is a Page object from a PDF
28
- page = pdf.pages[0]
29
-
30
- # Apply OCR using the default engine (or specify one)
31
- ocr_elements = page.apply_ocr(languages=['en'])
32
-
33
- # Extract text (will use the results from apply_ocr if run previously)
34
- text = page.extract_text()
35
- print(text)
36
- ```
37
-
38
- ## Configuring OCR
39
-
40
- Specify the engine and basic options directly:
41
-
42
- ## OCR Configuration
43
-
44
- ```python
45
- # Use PaddleOCR for Chinese and English
46
- ocr_elements = page.apply_ocr(engine='paddle', languages=['zh-cn', 'en'])
47
-
48
- # Use EasyOCR with a lower confidence threshold
49
- ocr_elements = page.apply_ocr(engine='easyocr', languages=['en'], min_confidence=0.3)
50
- ```
51
-
52
- For advanced, engine-specific settings, use the Options classes:
53
-
54
- ```python
55
- from natural_pdf.ocr import PaddleOCROptions, EasyOCROptions, SuryaOCROptions
56
- from natural_pdf.analyzers.layout import GeminiOptions # Note: Gemini is primarily layout
57
-
58
- # --- Configure PaddleOCR ---
59
- paddle_opts = PaddleOCROptions(
60
- languages=['en', 'zh-cn'],
61
- use_gpu=True, # Explicitly enable GPU if available
62
- use_angle_cls=False, # Disable text direction classification (if text is upright)
63
- det_db_thresh=0.25, # Lower detection threshold (more boxes, potentially noisy)
64
- rec_batch_num=16 # Increase recognition batch size for potential speedup on GPU
65
- # rec_char_dict_path='/path/to/custom_dict.txt' # Optional: Path to a custom character dictionary
66
- # See PaddleOCROptions documentation or source code for all parameters
67
- )
68
- ocr_elements = page.apply_ocr(engine='paddle', options=paddle_opts)
69
-
70
- # --- Configure EasyOCR ---
71
- easy_opts = EasyOCROptions(
72
- languages=['en', 'fr'],
73
- gpu=True, # Explicitly enable GPU if available
74
- paragraph=True, # Group results into paragraphs (if structure is clear)
75
- detail=1, # Ensure bounding boxes are returned (required)
76
- text_threshold=0.6, # Confidence threshold for text detection (adjust based on tuning table)
77
- link_threshold=0.4, # Standard EasyOCR param, uncomment if confirmed in wrapper
78
- low_text=0.4, # Standard EasyOCR param, uncomment if confirmed in wrapper
79
- batch_size=8 # Processing batch size (adjust based on memory)
80
- # See EasyOCROptions documentation or source code for all parameters
81
- )
82
- ocr_elements = page.apply_ocr(engine='easyocr', options=easy_opts)
83
-
84
- # --- Configure Surya OCR ---
85
- # Surya focuses on line detection and recognition
86
- surya_opts = SuryaOCROptions(
87
- languages=['en', 'de'], # Specify languages for recognition
88
- # device='cuda', # Use GPU ('cuda') or CPU ('cpu') <-- Set via env var TORCH_DEVICE
89
- min_confidence=0.4 # Example: Adjust minimum confidence for results
90
- # Core Surya options like device, batch size, and thresholds are typically
91
- # set via environment variables (see note below).
92
- )
93
- ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
94
-
95
- # --- Configure Gemini (as layout analyzer, can be used with OCR) ---
96
- # Gemini requires API key (GOOGLE_API_KEY environment variable)
97
- # Note: Gemini is used via apply_layout, but its options can influence OCR if used together
98
- gemini_opts = GeminiOptions(
99
- prompt="Extract text content and identify document elements.",
100
- # model_name="gemini-1.5-flash-latest" # Specify a model if needed
101
- # See GeminiOptions documentation for more parameters
102
- )
103
- # Typically used like this (layout first, then potentially OCR on regions)
104
- layout_elements = page.apply_layout(engine='gemini', options=gemini_opts)
105
- # If Gemini also performed OCR or you want to OCR layout regions:
106
- # ocr_elements = some_region.apply_ocr(...)
107
-
108
- # It can sometimes be used directly if the model supports it, but less common:
109
- # try:
110
- # ocr_elements = page.apply_ocr(engine='gemini', options=gemini_opts)
111
- # except Exception as e:
112
- # print(f"Gemini might not be configured for direct OCR via apply_ocr: {e}")
113
- ```
114
-
115
- ## Applying OCR Directly
116
-
117
- The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
118
-
119
- ```python
120
- # Apply OCR to a page and get the OCR elements
121
- ocr_elements = page.apply_ocr(engine='easyocr')
122
- print(f"Found {len(ocr_elements)} text elements via OCR")
123
-
124
- # Apply OCR to a specific region
125
- title = page.find('text:contains("Title")')
126
- content_region = title.below(height=300)
127
- region_ocr_elements = content_region.apply_ocr(engine='paddle', languages=['en'])
128
-
129
- # Note: Re-applying OCR to the same page or region will remove any
130
- # previously generated OCR elements for that area before adding the new ones.
131
- ```
132
-
133
- ## OCR Engines
134
-
135
- Choose the engine best suited for your document and language requirements using the `engine` parameter in `apply_ocr`.
136
-
137
- ## Finding and Working with OCR Text
138
-
139
- After applying OCR, work with the text just like regular text:
140
-
141
- ```python
142
- # Find all OCR text elements
143
- ocr_text = page.find_all('text[source=ocr]')
144
-
145
- # Find high-confidence OCR text
146
- high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
147
-
148
- # Extract text only from OCR elements
149
- ocr_text_content = page.find_all('text[source=ocr]').extract_text()
150
-
151
- # Filter OCR text by content
152
- names = page.find_all('text[source=ocr]:contains("Smith")', case=False)
153
- ```
154
-
155
- ## Visualizing OCR Results
156
-
157
- See OCR results to help debug issues:
158
-
159
- ```python
160
- # Apply OCR
161
- ocr_elements = page.apply_ocr()
162
-
163
- # Highlight all OCR elements
164
- for element in ocr_elements:
165
- # Color based on confidence
166
- if element.confidence >= 0.8:
167
- color = "green" # High confidence
168
- elif element.confidence >= 0.5:
169
- color = "yellow" # Medium confidence
170
- else:
171
- color = "red" # Low confidence
172
-
173
- element.highlight(color=color, label=f"OCR ({element.confidence:.2f})")
174
-
175
- # Get the visualization as an image
176
- image = page.to_image(labels=True)
177
- # Just return the image in a Jupyter cell
178
- image
179
-
180
- # Highlight only high-confidence elements
181
- high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
182
- high_conf.highlight(color="green", label="High Confidence OCR")
183
- ```
184
-
185
- ## Detect + LLM OCR
186
-
187
- Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
188
-
189
- ```python
190
- from natural_pdf import PDF
191
- from natural_pdf.ocr.utils import direct_ocr_llm
192
- import openai
193
-
194
- pdf = PDF("needs-ocr.pdf")
195
- page = pdf.pages[0]
196
-
197
- # Detect
198
- page.apply_ocr('paddle', resolution=120, detect_only=True)
199
-
200
- # Build the framework
201
- client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
202
- prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
203
- punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
204
- The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
205
-
206
- # This returns the cleaned-up text
207
- def correct(region):
208
- return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
209
-
210
- # Run 'correct' on each text element
211
- page.correct_ocr(correct)
212
-
213
- # You're done!
214
- ```
215
-
216
- ## Interactive OCR Correction / Debugging
217
-
218
- Natural PDF includes a utility to package a PDF and its detected elements, along with an interactive web application (SPA) for reviewing and correcting OCR results.
219
-
220
- 1. **Package the data:**
221
- Use the `create_correction_task_package` function to create a zip file containing the necessary data for the SPA.
222
-
223
- ```python
224
- from natural_pdf.utils.packaging import create_correction_task_package
225
-
226
- # Assuming 'pdf' is your loaded PDF object after running apply_ocr or apply_layout
227
- create_correction_task_package(pdf, "correction_package.zip", overwrite=True)
228
- ```
229
-
230
- 2. **Run the SPA:**
231
- The correction SPA is bundled with the library. You need to run a simple web server from the directory containing the SPA's files. The location of these files might depend on your installation, but you can typically find them within the installed `natural_pdf` package directory under `templates/spa`.
232
-
233
- *Example using Python's built-in server (run from your terminal):*
234
-
235
- ```bash
236
- # Find the path to the installed natural_pdf package
237
- # (This command might vary depending on your environment)
238
- NATURAL_PDF_PATH=$(python -c "import site; print(site.getsitepackages()[0])")/natural_pdf
239
-
240
- # Navigate to the SPA directory
241
- cd $NATURAL_PDF_PATH/templates/spa
242
-
243
- # Start the web server (e.g., on port 8000)
244
- python -m http.server 8000
245
- ```
246
-
247
- 3. **Use the SPA:**
248
- Open your web browser to `http://localhost:8000`. The SPA should load, allowing you to drag and drop the `correction_package.zip` file you created into the application to view and edit the OCR results.
249
-
250
- ## Next Steps
251
-
252
- With OCR capabilities, you can explore:
253
-
254
- - [Layout Analysis](../layout-analysis/index.ipynb) for automatically detecting document structure
255
- - [Document QA](../document-qa/index.ipynb) for asking questions about your documents
256
- - [Visual Debugging](../visual-debugging/index.ipynb) for visualizing OCR results