natural-pdf 0.1.8__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +1 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +230 -151
  13. natural_pdf/classification/mixin.py +49 -35
  14. natural_pdf/classification/results.py +64 -46
  15. natural_pdf/collections/mixins.py +68 -20
  16. natural_pdf/collections/pdf_collection.py +177 -64
  17. natural_pdf/core/element_manager.py +30 -14
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +423 -101
  20. natural_pdf/core/pdf.py +633 -190
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +503 -131
  23. natural_pdf/elements/region.py +659 -90
  24. natural_pdf/elements/text.py +1 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +4 -3
  28. natural_pdf/extraction/manager.py +50 -49
  29. natural_pdf/extraction/mixin.py +90 -57
  30. natural_pdf/extraction/result.py +9 -23
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/ocr_factory.py +24 -4
  34. natural_pdf/ocr/ocr_manager.py +61 -25
  35. natural_pdf/ocr/ocr_options.py +70 -10
  36. natural_pdf/ocr/utils.py +6 -4
  37. natural_pdf/search/__init__.py +20 -34
  38. natural_pdf/search/haystack_search_service.py +309 -265
  39. natural_pdf/search/haystack_utils.py +99 -75
  40. natural_pdf/search/search_service_protocol.py +11 -12
  41. natural_pdf/selectors/parser.py +219 -143
  42. natural_pdf/utils/debug.py +3 -3
  43. natural_pdf/utils/identifiers.py +1 -1
  44. natural_pdf/utils/locks.py +1 -1
  45. natural_pdf/utils/packaging.py +8 -6
  46. natural_pdf/utils/text_extraction.py +24 -16
  47. natural_pdf/utils/tqdm_utils.py +18 -10
  48. natural_pdf/utils/visualization.py +18 -0
  49. natural_pdf/widgets/viewer.py +4 -25
  50. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +12 -3
  51. natural_pdf-0.1.9.dist-info/RECORD +80 -0
  52. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
  53. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
  54. docs/api/index.md +0 -386
  55. docs/assets/favicon.png +0 -3
  56. docs/assets/favicon.svg +0 -3
  57. docs/assets/javascripts/custom.js +0 -17
  58. docs/assets/logo.svg +0 -3
  59. docs/assets/sample-screen.png +0 -0
  60. docs/assets/social-preview.png +0 -17
  61. docs/assets/social-preview.svg +0 -17
  62. docs/assets/stylesheets/custom.css +0 -65
  63. docs/categorizing-documents/index.md +0 -168
  64. docs/data-extraction/index.md +0 -87
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -969
  68. docs/element-selection/index.md +0 -249
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -189
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -256
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -1873
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -417
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -152
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -119
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -275
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -337
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -293
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -414
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -513
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2439
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -517
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -59
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -3712
  112. docs/tutorials/12-ocr-integration.md +0 -137
  113. docs/tutorials/13-semantic-search.ipynb +0 -1718
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -420
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.8.dist-info/RECORD +0 -156
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.8.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
@@ -1,314 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "bba1860e",
6
- "metadata": {},
7
- "source": [
8
- "# PDF Navigation\n",
9
- "\n",
10
- "This guide covers the basics of working with PDFs in Natural PDF - opening documents, accessing pages, and navigating through content.\n",
11
- "\n",
12
- "## Opening a PDF\n",
13
- "\n",
14
- "The main entry point to Natural PDF is the `PDF` class:"
15
- ]
16
- },
17
- {
18
- "cell_type": "code",
19
- "execution_count": 1,
20
- "id": "56d12ab5",
21
- "metadata": {
22
- "execution": {
23
- "iopub.execute_input": "2025-04-03T14:50:38.434157Z",
24
- "iopub.status.busy": "2025-04-03T14:50:38.433170Z",
25
- "iopub.status.idle": "2025-04-03T14:50:49.768101Z",
26
- "shell.execute_reply": "2025-04-03T14:50:49.767384Z"
27
- }
28
- },
29
- "outputs": [],
30
- "source": [
31
- "from natural_pdf import PDF\n",
32
- "\n",
33
- "# Open a PDF file\n",
34
- "pdf = PDF(\"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42001.pdf\")"
35
- ]
36
- },
37
- {
38
- "cell_type": "markdown",
39
- "id": "c425482a",
40
- "metadata": {},
41
- "source": [
42
- "## Accessing Pages\n",
43
- "\n",
44
- "Once you have a PDF object, you can access its pages:"
45
- ]
46
- },
47
- {
48
- "cell_type": "code",
49
- "execution_count": 2,
50
- "id": "a3405aa9",
51
- "metadata": {
52
- "execution": {
53
- "iopub.execute_input": "2025-04-03T14:50:49.770604Z",
54
- "iopub.status.busy": "2025-04-03T14:50:49.770419Z",
55
- "iopub.status.idle": "2025-04-03T14:50:50.700808Z",
56
- "shell.execute_reply": "2025-04-03T14:50:50.699634Z"
57
- }
58
- },
59
- "outputs": [
60
- {
61
- "name": "stdout",
62
- "output_type": "stream",
63
- "text": [
64
- "This PDF has 153 pages\n",
65
- "Page 1 has 985 characters\n",
66
- "Page 2 has 778 characters\n",
67
- "Page 3 has 522 characters\n",
68
- "Page 4 has 984 characters\n",
69
- "Page 5 has 778 characters\n",
70
- "Page 6 has 523 characters\n"
71
- ]
72
- },
73
- {
74
- "name": "stdout",
75
- "output_type": "stream",
76
- "text": [
77
- "Page 7 has 982 characters\n",
78
- "Page 8 has 772 characters\n",
79
- "Page 9 has 522 characters\n",
80
- "Page 10 has 1008 characters\n"
81
- ]
82
- },
83
- {
84
- "name": "stdout",
85
- "output_type": "stream",
86
- "text": [
87
- "Page 11 has 796 characters\n",
88
- "Page 12 has 532 characters\n",
89
- "Page 13 has 986 characters\n",
90
- "Page 14 has 780 characters\n",
91
- "Page 15 has 523 characters\n",
92
- "Page 16 has 990 characters\n",
93
- "Page 17 has 782 characters\n"
94
- ]
95
- },
96
- {
97
- "name": "stdout",
98
- "output_type": "stream",
99
- "text": [
100
- "Page 18 has 520 characters\n",
101
- "Page 19 has 1006 characters\n",
102
- "Page 20 has 795 characters\n"
103
- ]
104
- }
105
- ],
106
- "source": [
107
- "# Get the total number of pages\n",
108
- "num_pages = len(pdf)\n",
109
- "print(f\"This PDF has {num_pages} pages\")\n",
110
- "\n",
111
- "# Get a specific page (0-indexed)\n",
112
- "first_page = pdf.pages[0]\n",
113
- "last_page = pdf.pages[-1]\n",
114
- "\n",
115
- "# Iterate through the first 20 pages\n",
116
- "for page in pdf.pages[:20]:\n",
117
- " print(f\"Page {page.number} has {len(page.extract_text())} characters\")"
118
- ]
119
- },
120
- {
121
- "cell_type": "markdown",
122
- "id": "2eca7327",
123
- "metadata": {},
124
- "source": [
125
- "## Page Properties\n",
126
- "\n",
127
- "Each `Page` object has useful properties:"
128
- ]
129
- },
130
- {
131
- "cell_type": "code",
132
- "execution_count": 3,
133
- "id": "348f28d7",
134
- "metadata": {
135
- "execution": {
136
- "iopub.execute_input": "2025-04-03T14:50:50.713325Z",
137
- "iopub.status.busy": "2025-04-03T14:50:50.711638Z",
138
- "iopub.status.idle": "2025-04-03T14:50:50.738737Z",
139
- "shell.execute_reply": "2025-04-03T14:50:50.726839Z"
140
- }
141
- },
142
- "outputs": [
143
- {
144
- "name": "stdout",
145
- "output_type": "stream",
146
- "text": [
147
- "612 792\n",
148
- "20\n",
149
- "19\n"
150
- ]
151
- }
152
- ],
153
- "source": [
154
- "# Page dimensions in points (1/72 inch)\n",
155
- "print(page.width, page.height)\n",
156
- "\n",
157
- "# Page number (1-indexed as shown in PDF viewers)\n",
158
- "print(page.number)\n",
159
- "\n",
160
- "# Page index (0-indexed position in the PDF)\n",
161
- "print(page.index)"
162
- ]
163
- },
164
- {
165
- "cell_type": "markdown",
166
- "id": "c7cf1839",
167
- "metadata": {},
168
- "source": [
169
- "## Working Across Pages\n",
170
- "\n",
171
- "Natural PDF makes it easy to work with content across multiple pages:"
172
- ]
173
- },
174
- {
175
- "cell_type": "code",
176
- "execution_count": 4,
177
- "id": "71a8f1ec",
178
- "metadata": {
179
- "execution": {
180
- "iopub.execute_input": "2025-04-03T14:50:50.765495Z",
181
- "iopub.status.busy": "2025-04-03T14:50:50.764444Z",
182
- "iopub.status.idle": "2025-04-03T14:50:57.735494Z",
183
- "shell.execute_reply": "2025-04-03T14:50:57.726489Z"
184
- }
185
- },
186
- "outputs": [
187
- {
188
- "data": {
189
- "text/plain": [
190
- "<natural_pdf.core.pdf.PDF at 0x1045224d0>"
191
- ]
192
- },
193
- "execution_count": 4,
194
- "metadata": {},
195
- "output_type": "execute_result"
196
- }
197
- ],
198
- "source": [
199
- "# Extract text from all pages\n",
200
- "all_text = pdf.extract_text()\n",
201
- "\n",
202
- "# Find elements across all pages\n",
203
- "all_headings = pdf.find_all('text[size>=14]:bold')\n",
204
- "\n",
205
- "# Add exclusion zones to all pages (like headers/footers)\n",
206
- "pdf.add_exclusion(\n",
207
- " lambda page: page.find('text:contains(\"CONFIDENTIAL\")').above() if page.find('text:contains(\"CONFIDENTIAL\")') else None,\n",
208
- " label=\"header\"\n",
209
- ")"
210
- ]
211
- },
212
- {
213
- "cell_type": "markdown",
214
- "id": "e18051a4",
215
- "metadata": {},
216
- "source": [
217
- "## The Page Collection\n",
218
- "\n",
219
- "The `pdf.pages` object is a `PageCollection` that allows batch operations on pages:"
220
- ]
221
- },
222
- {
223
- "cell_type": "code",
224
- "execution_count": 5,
225
- "id": "e5f1c662",
226
- "metadata": {
227
- "execution": {
228
- "iopub.execute_input": "2025-04-03T14:50:57.752240Z",
229
- "iopub.status.busy": "2025-04-03T14:50:57.751868Z",
230
- "iopub.status.idle": "2025-04-03T14:50:57.770738Z",
231
- "shell.execute_reply": "2025-04-03T14:50:57.759415Z"
232
- }
233
- },
234
- "outputs": [],
235
- "source": [
236
- "# Extract text from specific pages\n",
237
- "text = pdf.pages[2:5].extract_text()\n",
238
- "\n",
239
- "# Find elements across specific pages\n",
240
- "elements = pdf.pages[2:5].find_all('text:contains(\"Annual Report\")')"
241
- ]
242
- },
243
- {
244
- "cell_type": "markdown",
245
- "id": "9713e392",
246
- "metadata": {},
247
- "source": [
248
- "## Document Sections Across Pages\n",
249
- "\n",
250
- "You can extract sections that span across multiple pages:"
251
- ]
252
- },
253
- {
254
- "cell_type": "code",
255
- "execution_count": 6,
256
- "id": "d5b89a2b",
257
- "metadata": {
258
- "execution": {
259
- "iopub.execute_input": "2025-04-03T14:50:57.782621Z",
260
- "iopub.status.busy": "2025-04-03T14:50:57.781776Z",
261
- "iopub.status.idle": "2025-04-03T14:50:57.811508Z",
262
- "shell.execute_reply": "2025-04-03T14:50:57.805310Z"
263
- }
264
- },
265
- "outputs": [],
266
- "source": [
267
- "# Get sections with headings as section starts\n",
268
- "sections = pdf.pages.get_sections(\n",
269
- " start_elements='text[size>=14]:bold',\n",
270
- " new_section_on_page_break=False\n",
271
- ")"
272
- ]
273
- },
274
- {
275
- "cell_type": "markdown",
276
- "id": "f51594ce",
277
- "metadata": {},
278
- "source": [
279
- "## Next Steps\n",
280
- "\n",
281
- "Now that you know how to navigate PDFs, you can:\n",
282
- "\n",
283
- "- [Find elements using selectors](../element-selection/index.ipynb)\n",
284
- "- [Extract text from your documents](../text-extraction/index.ipynb)\n",
285
- "- [Work with specific regions](../regions/index.ipynb)"
286
- ]
287
- }
288
- ],
289
- "metadata": {
290
- "jupytext": {
291
- "cell_metadata_filter": "-all",
292
- "main_language": "python",
293
- "notebook_metadata_filter": "-all",
294
- "text_representation": {
295
- "extension": ".md",
296
- "format_name": "markdown"
297
- }
298
- },
299
- "language_info": {
300
- "codemirror_mode": {
301
- "name": "ipython",
302
- "version": 3
303
- },
304
- "file_extension": ".py",
305
- "mimetype": "text/x-python",
306
- "name": "python",
307
- "nbconvert_exporter": "python",
308
- "pygments_lexer": "ipython3",
309
- "version": "3.10.13"
310
- }
311
- },
312
- "nbformat": 4,
313
- "nbformat_minor": 5
314
- }
@@ -1,97 +0,0 @@
1
- # PDF Navigation
2
-
3
- This guide covers the basics of working with PDFs in Natural PDF - opening documents, accessing pages, and navigating through content.
4
-
5
- ## Opening a PDF
6
-
7
- The main entry point to Natural PDF is the `PDF` class:
8
-
9
- ```python
10
- from natural_pdf import PDF
11
-
12
- # Open a PDF file
13
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42001.pdf")
14
- ```
15
-
16
- ## Accessing Pages
17
-
18
- Once you have a PDF object, you can access its pages:
19
-
20
- ```python
21
- # Get the total number of pages
22
- num_pages = len(pdf)
23
- print(f"This PDF has {num_pages} pages")
24
-
25
- # Get a specific page (0-indexed)
26
- first_page = pdf.pages[0]
27
- last_page = pdf.pages[-1]
28
-
29
- # Iterate through the first 20 pages
30
- for page in pdf.pages[:20]:
31
- print(f"Page {page.number} has {len(page.extract_text())} characters")
32
- ```
33
-
34
- ## Page Properties
35
-
36
- Each `Page` object has useful properties:
37
-
38
- ```python
39
- # Page dimensions in points (1/72 inch)
40
- print(page.width, page.height)
41
-
42
- # Page number (1-indexed as shown in PDF viewers)
43
- print(page.number)
44
-
45
- # Page index (0-indexed position in the PDF)
46
- print(page.index)
47
- ```
48
-
49
- ## Working Across Pages
50
-
51
- Natural PDF makes it easy to work with content across multiple pages:
52
-
53
- ```python
54
- # Extract text from all pages
55
- all_text = pdf.extract_text()
56
-
57
- # Find elements across all pages
58
- all_headings = pdf.find_all('text[size>=14]:bold')
59
-
60
- # Add exclusion zones to all pages (like headers/footers)
61
- pdf.add_exclusion(
62
- lambda page: page.find('text:contains("CONFIDENTIAL")').above() if page.find('text:contains("CONFIDENTIAL")') else None,
63
- label="header"
64
- )
65
- ```
66
-
67
- ## The Page Collection
68
-
69
- The `pdf.pages` object is a `PageCollection` that allows batch operations on pages:
70
-
71
- ```python
72
- # Extract text from specific pages
73
- text = pdf.pages[2:5].extract_text()
74
-
75
- # Find elements across specific pages
76
- elements = pdf.pages[2:5].find_all('text:contains("Annual Report")')
77
- ```
78
-
79
- ## Document Sections Across Pages
80
-
81
- You can extract sections that span across multiple pages:
82
-
83
- ```python
84
- # Get sections with headings as section starts
85
- sections = pdf.pages.get_sections(
86
- start_elements='text[size>=14]:bold',
87
- new_section_on_page_break=False
88
- )
89
- ```
90
-
91
- ## Next Steps
92
-
93
- Now that you know how to navigate PDFs, you can:
94
-
95
- - [Find elements using selectors](../element-selection/index.ipynb)
96
- - [Extract text from your documents](../text-extraction/index.ipynb)
97
- - [Work with specific regions](../regions/index.ipynb)