natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +3 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +422 -0
  13. natural_pdf/classification/mixin.py +163 -0
  14. natural_pdf/classification/results.py +80 -0
  15. natural_pdf/collections/mixins.py +111 -0
  16. natural_pdf/collections/pdf_collection.py +434 -15
  17. natural_pdf/core/element_manager.py +83 -0
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +578 -93
  20. natural_pdf/core/pdf.py +912 -460
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +712 -109
  23. natural_pdf/elements/region.py +722 -69
  24. natural_pdf/elements/text.py +4 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +5 -4
  28. natural_pdf/extraction/manager.py +135 -0
  29. natural_pdf/extraction/mixin.py +279 -0
  30. natural_pdf/extraction/result.py +23 -0
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/engine_easyocr.py +6 -3
  34. natural_pdf/ocr/ocr_factory.py +24 -4
  35. natural_pdf/ocr/ocr_manager.py +122 -26
  36. natural_pdf/ocr/ocr_options.py +94 -11
  37. natural_pdf/ocr/utils.py +19 -6
  38. natural_pdf/qa/document_qa.py +0 -4
  39. natural_pdf/search/__init__.py +20 -34
  40. natural_pdf/search/haystack_search_service.py +309 -265
  41. natural_pdf/search/haystack_utils.py +99 -75
  42. natural_pdf/search/search_service_protocol.py +11 -12
  43. natural_pdf/selectors/parser.py +431 -230
  44. natural_pdf/utils/debug.py +3 -3
  45. natural_pdf/utils/identifiers.py +1 -1
  46. natural_pdf/utils/locks.py +8 -0
  47. natural_pdf/utils/packaging.py +8 -6
  48. natural_pdf/utils/text_extraction.py +60 -1
  49. natural_pdf/utils/tqdm_utils.py +51 -0
  50. natural_pdf/utils/visualization.py +18 -0
  51. natural_pdf/widgets/viewer.py +4 -25
  52. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
  53. natural_pdf-0.1.9.dist-info/RECORD +80 -0
  54. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
  55. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
  56. docs/api/index.md +0 -386
  57. docs/assets/favicon.png +0 -3
  58. docs/assets/favicon.svg +0 -3
  59. docs/assets/javascripts/custom.js +0 -17
  60. docs/assets/logo.svg +0 -3
  61. docs/assets/sample-screen.png +0 -0
  62. docs/assets/social-preview.png +0 -17
  63. docs/assets/social-preview.svg +0 -17
  64. docs/assets/stylesheets/custom.css +0 -65
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -915
  68. docs/element-selection/index.md +0 -229
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -170
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -209
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -194
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -340
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -147
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -114
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -270
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -332
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -288
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -413
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -508
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2434
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -512
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -604
  112. docs/tutorials/12-ocr-integration.md +0 -175
  113. docs/tutorials/13-semantic-search.ipynb +0 -1328
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.7.dist-info/RECORD +0 -145
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
@@ -1,175 +0,0 @@
1
- # OCR Integration for Scanned Documents
2
-
3
- Optical Character Recognition (OCR) allows you to extract text from scanned documents where the text isn't embedded in the PDF. This tutorial demonstrates how to work with scanned documents.
4
-
5
- ```python
6
- #%pip install "natural-pdf[all]"
7
- ```
8
-
9
- ```python
10
- from natural_pdf import PDF
11
-
12
- # Load a PDF
13
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
14
- page = pdf.pages[0]
15
-
16
- # Try extracting text without OCR
17
- text_without_ocr = page.extract_text()
18
- f"Without OCR: {len(text_without_ocr)} characters extracted"
19
- ```
20
-
21
- ## Finding Text Elements with OCR
22
-
23
- ```python
24
- # Convert text-as-image to text elements
25
- page.apply_ocr()
26
-
27
- # Select all text pieces on the page
28
- text_elements = page.find_all('text')
29
- f"Found {len(text_elements)} text elements"
30
-
31
- # Visualize the elements
32
- text_elements.highlight()
33
- ```
34
-
35
- ## OCR Configuration Options
36
-
37
- ```python
38
- # Set OCR configuration for better results
39
- page.ocr_config = {
40
- 'language': 'eng', # English
41
- 'dpi': 300, # Higher resolution
42
- }
43
-
44
- # Extract text with the improved configuration
45
- improved_text = page.extract_text()
46
-
47
- # Preview the text
48
- improved_text[:200] + "..." if len(improved_text) > 200 else improved_text
49
- ```
50
-
51
- ## Working with Multi-language Documents
52
-
53
- ```python
54
- # Configure for multiple languages
55
- page.ocr_config = {
56
- 'language': 'eng+fra+deu', # English, French, German
57
- 'dpi': 300
58
- }
59
-
60
- # Extract text with multi-language support
61
- multilang_text = page.extract_text()
62
- multilang_text[:200]
63
- ```
64
-
65
- ## Extracting Tables from Scanned Documents
66
-
67
- ```python
68
- # Enable OCR and analyze the document layout
69
- page.use_ocr = True
70
- page.analyze_layout()
71
-
72
- # Find table regions
73
- table_regions = page.find_all('region[type=table]')
74
-
75
- # Visualize any detected tables
76
- table_regions.highlight()
77
-
78
- # Extract the first table if found
79
- if table_regions:
80
- table_data = table_regions[0].extract_table()
81
- table_data
82
- else:
83
- "No tables found in the document"
84
- ```
85
-
86
- ## Finding Form Fields in Scanned Documents
87
-
88
- ```python
89
- # Look for potential form labels (containing a colon)
90
- labels = page.find_all('text:contains(":")')
91
-
92
- # Visualize the labels
93
- labels.highlight()
94
-
95
- # Extract form data by looking to the right of each label
96
- form_data = {}
97
- for label in labels:
98
- # Clean the label text
99
- field_name = label.text.strip().rstrip(':')
100
-
101
- # Find the value to the right
102
- value_element = label.right(width=200)
103
- value = value_element.extract_text().strip()
104
-
105
- # Add to our dictionary
106
- form_data[field_name] = value
107
-
108
- # Display the extracted data
109
- form_data
110
- ```
111
-
112
- ## Combining OCR with Layout Analysis
113
-
114
- ```python
115
- # Apply OCR and analyze layout
116
- page.use_ocr = True
117
- page.analyze_layout()
118
-
119
- # Find document structure elements
120
- headings = page.find_all('region[type=heading]')
121
- paragraphs = page.find_all('region[type=paragraph]')
122
-
123
- # Visualize the structure
124
- headings.highlight(color="red", label="Headings")
125
- paragraphs.highlight(color="blue", label="Paragraphs")
126
-
127
- # Create a simple document outline
128
- document_outline = []
129
- for heading in headings:
130
- heading_text = heading.extract_text()
131
- document_outline.append(heading_text)
132
-
133
- document_outline
134
- ```
135
-
136
- ## Working with Multiple Pages
137
-
138
- ```python
139
- # Process all pages in the document
140
- all_text = []
141
-
142
- for i, page in enumerate(pdf.pages):
143
- # Enable OCR for each page
144
- page.use_ocr = True
145
-
146
- # Extract text
147
- page_text = page.extract_text()
148
-
149
- # Add to our collection with page number
150
- all_text.append(f"Page {i+1}: {page_text[:100]}...")
151
-
152
- # Show the first few pages
153
- all_text
154
- ```
155
-
156
- ## Saving PDFs with Searchable Text
157
-
158
- After applying OCR to a PDF, you can save a new version of the PDF where the recognized text is embedded as an invisible layer. This makes the text searchable and copyable in standard PDF viewers.
159
-
160
- Use the `save_searchable()` method on the `PDF` object:
161
-
162
- ```python
163
- from natural_pdf import PDF
164
-
165
- input_pdf_path = "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf"
166
-
167
- pdf = PDF(input_pdf_path)
168
- pdf.apply_ocr()
169
-
170
- pdf.save_searchable("needs-ocr-searchable.pdf")
171
- ```
172
-
173
- This creates `needs-ocr-searchable.pdf`, which looks identical to the original but now has a text layer corresponding to the OCR results. You can adjust the rendering resolution used during saving with the `dpi` parameter (default is 300).
174
-
175
- OCR integration enables you to work with scanned documents, historical archives, and image-based PDFs that don't have embedded text. By combining OCR with natural-pdf's layout analysis capabilities, you can turn any document into structured, searchable data.