natural-pdf 0.1.8__py3-none-any.whl → 0.1.10__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +1 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +241 -158
  13. natural_pdf/classification/mixin.py +52 -38
  14. natural_pdf/classification/results.py +71 -45
  15. natural_pdf/collections/mixins.py +85 -20
  16. natural_pdf/collections/pdf_collection.py +245 -100
  17. natural_pdf/core/element_manager.py +30 -14
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +423 -101
  20. natural_pdf/core/pdf.py +694 -195
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +610 -134
  23. natural_pdf/elements/region.py +659 -90
  24. natural_pdf/elements/text.py +1 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +4 -3
  28. natural_pdf/extraction/manager.py +50 -49
  29. natural_pdf/extraction/mixin.py +90 -57
  30. natural_pdf/extraction/result.py +9 -23
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/ocr_factory.py +24 -4
  34. natural_pdf/ocr/ocr_manager.py +61 -25
  35. natural_pdf/ocr/ocr_options.py +70 -10
  36. natural_pdf/ocr/utils.py +6 -4
  37. natural_pdf/search/__init__.py +20 -34
  38. natural_pdf/search/haystack_search_service.py +309 -265
  39. natural_pdf/search/haystack_utils.py +99 -75
  40. natural_pdf/search/search_service_protocol.py +11 -12
  41. natural_pdf/selectors/parser.py +219 -143
  42. natural_pdf/utils/debug.py +3 -3
  43. natural_pdf/utils/identifiers.py +1 -1
  44. natural_pdf/utils/locks.py +1 -1
  45. natural_pdf/utils/packaging.py +8 -6
  46. natural_pdf/utils/text_extraction.py +24 -16
  47. natural_pdf/utils/tqdm_utils.py +18 -10
  48. natural_pdf/utils/visualization.py +18 -0
  49. natural_pdf/widgets/viewer.py +4 -25
  50. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/METADATA +12 -3
  51. natural_pdf-0.1.10.dist-info/RECORD +80 -0
  52. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/WHEEL +1 -1
  53. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/top_level.txt +0 -2
  54. docs/api/index.md +0 -386
  55. docs/assets/favicon.png +0 -3
  56. docs/assets/favicon.svg +0 -3
  57. docs/assets/javascripts/custom.js +0 -17
  58. docs/assets/logo.svg +0 -3
  59. docs/assets/sample-screen.png +0 -0
  60. docs/assets/social-preview.png +0 -17
  61. docs/assets/social-preview.svg +0 -17
  62. docs/assets/stylesheets/custom.css +0 -65
  63. docs/categorizing-documents/index.md +0 -168
  64. docs/data-extraction/index.md +0 -87
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -969
  68. docs/element-selection/index.md +0 -249
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -189
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -256
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -1873
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -417
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -152
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -119
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -275
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -337
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -293
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -414
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -513
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2439
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -517
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -59
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -3712
  112. docs/tutorials/12-ocr-integration.md +0 -137
  113. docs/tutorials/13-semantic-search.ipynb +0 -1718
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -420
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.8.dist-info/RECORD +0 -156
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.10.dist-info}/licenses/LICENSE +0 -0
@@ -1,292 +0,0 @@
1
- # Text Extraction Guide
2
-
3
- This guide demonstrates various ways to extract text from PDFs using Natural PDF, from simple page dumps to targeted extraction based on elements, regions, and styles.
4
-
5
- ## Setup
6
-
7
- First, let's import necessary libraries and load a sample PDF. We'll use `example.pdf` from the tutorials' `pdfs` directory. *Adjust the path if your setup differs.*
8
-
9
- ```python
10
- from natural_pdf import PDF
11
-
12
- # Load the PDF
13
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
14
-
15
- # Select the first page for initial examples
16
- page = pdf.pages[0]
17
-
18
- # Display the first page
19
- page.show(width=700)
20
- ```
21
-
22
- ## Basic Text Extraction
23
-
24
- Get all text from a page or the entire document.
25
-
26
- ```python
27
- # Extract all text from the first page
28
- # Displaying first 500 characters
29
- print(page.extract_text()[:500])
30
- ```
31
-
32
- You can also preserve layout with `layout=True`.
33
-
34
- ```python
35
- # Extract text from the entire document (may take time)
36
- # Uncomment to run:
37
- print(page.extract_text(layout=True)[:2000])
38
- ```
39
-
40
- ## Extracting Text from Specific Elements
41
-
42
- Use selectors with `find()` or `find_all()` to target specific elements. *Selectors like `:contains("Summary")` are examples; adapt them to your PDF.*
43
-
44
- ```python
45
- # Find a single element, e.g., a title containing "Summary"
46
- # Adjust selector as needed
47
- date_element = page.find('text:contains("Site")')
48
- date_element # Display the found element object
49
- ```
50
-
51
- ```python
52
- date_element.show()
53
- ```
54
-
55
- ```python
56
- date_element.text
57
- ```
58
-
59
- ```python
60
- # Find multiple elements, e.g., bold headings (size >= 8)
61
- heading_elements = page.find_all('text[size>=8]:bold')
62
- heading_elements
63
- ```
64
-
65
- ```python
66
- page.find_all('text[size>=8]:bold').show()
67
- ```
68
-
69
- ```python
70
- # Pull out all of their text (why? I don't know!)
71
- print(heading_elements.extract_text())
72
- ```
73
-
74
- ## Advanced text searches
75
-
76
- ```python
77
- # Exact phrase (case-sensitive)
78
- page.find('text:contains("Hazardous Materials")').text
79
- ```
80
-
81
- ```python
82
- # Exact phrase (case-sensitive)
83
- page.find('text:contains("HAZARDOUS MATERIALS")', case=False).text
84
- ```
85
-
86
- ```python
87
- # Regular expression (e.g., "YYYY Report")
88
- regex = "\d+, \d{4}"
89
- page.find(f'text:contains("{regex}")', regex=True)
90
- ```
91
-
92
- ```python
93
- # Regular expression (e.g., "YYYY Report")
94
- page.find_all('text[fontname="Helvetica"][size=10]')
95
- ```
96
-
97
- # Regions
98
-
99
- ```python
100
- # Region below an element (e.g., below "Introduction")
101
- # Adjust selector as needed
102
- page.find('text:contains("Summary")').below(include_element=True).show()
103
- ```
104
-
105
- ```python
106
- (
107
- page
108
- .find('text:contains("Summary")')
109
- .below(include_element=True)
110
- .extract_text()
111
- [:500]
112
- )
113
- ```
114
-
115
- ```python
116
- (
117
- page
118
- .find('text:contains("Summary")')
119
- .below(include_element=True, until='line:horizontal')
120
- .show()
121
- )
122
- ```
123
-
124
- ```python
125
- # Manually defined region via coordinates (x0, top, x1, bottom)
126
- manual_region = page.create_region(30, 60, 600, 300)
127
- manual_region.show()
128
- ```
129
-
130
- ```python
131
- # Extract text from the manual region
132
- manual_region.extract_text()[:500]
133
- ```
134
-
135
- ## Filtering Out Headers and Footers
136
-
137
- Use Exclusion Zones to remove unwanted content before extraction. *Adjust selectors for typical header/footer content.*
138
-
139
- ```python
140
- header_content = page.find('rect')
141
- footer_content = page.find_all('line')[-1].below()
142
-
143
- header_content.highlight()
144
- footer_content.highlight()
145
- page.to_image()
146
- ```
147
-
148
- ```python
149
- page.extract_text()[:500]
150
- ```
151
-
152
- ```python
153
- page.add_exclusion(header_content)
154
- page.add_exclusion(footer_content)
155
- ```
156
-
157
- ```python
158
- page.extract_text()[:500]
159
- ```
160
-
161
- ```python
162
- full_text_no_exclusions = page.extract_text(use_exclusions=False)
163
- clean_text = page.extract_text()
164
- f"Original length: {len(full_text_no_exclusions)}, Excluded length: {len(clean_text)}"
165
- ```
166
-
167
- ```python
168
- page.clear_exclusions()
169
- ```
170
-
171
- *Exclusions can also be defined globally at the PDF level using `pdf.add_exclusion()` with a function.*
172
-
173
- ## Controlling Whitespace
174
-
175
- Manage how spaces and blank lines are handled during extraction using `layout`.
176
-
177
- ```python
178
- print(page.extract_text())
179
- ```
180
-
181
- ```python
182
- print(page.extract_text(use_exclusions=False, layout=True))
183
- ```
184
-
185
- ### Font Information Access
186
-
187
- Inspect font details of text elements.
188
-
189
- ```python
190
- # Find the first text element on the page
191
- first_text = page.find_all('text')[1]
192
- first_text # Display basic info
193
- ```
194
-
195
- ```python
196
- # Highlight the first text element
197
- first_text.show()
198
- ```
199
-
200
- ```python
201
- # Get detailed font properties dictionary
202
- first_text.font_info()
203
- ```
204
-
205
- ```python
206
- # Check specific style properties directly
207
- f"Is Bold: {first_text.bold}, Is Italic: {first_text.italic}, Font: {first_text.fontname}, Size: {first_text.size}"
208
- ```
209
-
210
- ```python
211
- # Find elements by font attributes (adjust selectors)
212
- # Example: Find Arial fonts
213
- arial_text = page.find_all('text[fontname*=Helvetica]')
214
- arial_text # Display list of found elements
215
- ```
216
-
217
- ```python
218
- # Example: Find large text (e.g., size >= 16)
219
- large_text = page.find_all('text[size>=12]')
220
- large_text
221
- ```
222
-
223
- ```python
224
- # Example: Find large text (e.g., size >= 16)
225
- bold_text = page.find_all('text:bold')
226
- bold_text
227
- ```
228
-
229
- ## Working with Font Styles
230
-
231
- Analyze and group text elements by their computed font *style*, which combines attributes like font name, size, boldness, etc., into logical groups.
232
-
233
- ```python
234
- # Analyze styles on the page
235
- # This returns a dictionary mapping style names to ElementList objects
236
- page.analyze_text_styles()
237
- page.text_style_labels
238
- ```
239
-
240
- ```python
241
- page.find_all('text').highlight(group_by='style_label').to_image()
242
- ```
243
-
244
- ```python
245
- page.find_all('text[style_label="8.0pt Helvetica"]')
246
- ```
247
-
248
- ```python
249
- page.find_all('text[fontname="Helvetica"][size=8]')
250
- ```
251
-
252
- *Font variants (e.g., `AAAAAB+FontName`) are also accessible via the `font-variant` attribute selector: `page.find_all('text[font-variant="AAAAAB"]')`.*
253
-
254
- ## Reading Order
255
-
256
- Text extraction respects a pathetic attempt at natural reading order (top-to-bottom, left-to-right by default). `page.find_all('text')` returns elements already sorted this way.
257
-
258
- ```python
259
- # Get first 5 text elements in reading order
260
- elements_in_order = page.find_all('text')
261
- elements_in_order[:5]
262
- ```
263
-
264
- ```python
265
- # Text extracted via page.extract_text() respects this order automatically
266
- # (Result already shown in Basic Text Extraction section)
267
- page.extract_text()[:100]
268
- ```
269
-
270
- ## Element Navigation
271
-
272
- Move between elements sequentially based on reading order using `.next()` and `.previous()`.
273
-
274
- ```python
275
- page.clear_highlights()
276
-
277
- start = page.find('text:contains("Date")')
278
- start.highlight(label='Date label')
279
- start.next().highlight(label='Maybe the date', color='green')
280
- start.next('text:contains("\d")', regex=True).highlight(label='Probably the date')
281
-
282
- page.to_image()
283
- ```
284
-
285
- ## Next Steps
286
-
287
- Now that you know how to extract text, you might want to explore:
288
-
289
- - [Working with regions](../regions/index.ipynb) for more precise extraction
290
- - [OCR capabilities](../ocr/index.md) for scanned documents
291
- - [Document layout analysis](../layout-analysis/index.ipynb) for automatic structure detection
292
- - [Document QA](../document-qa/index.ipynb) for asking questions directly to your documents