natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +3 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +422 -0
  13. natural_pdf/classification/mixin.py +163 -0
  14. natural_pdf/classification/results.py +80 -0
  15. natural_pdf/collections/mixins.py +111 -0
  16. natural_pdf/collections/pdf_collection.py +434 -15
  17. natural_pdf/core/element_manager.py +83 -0
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +578 -93
  20. natural_pdf/core/pdf.py +912 -460
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +712 -109
  23. natural_pdf/elements/region.py +722 -69
  24. natural_pdf/elements/text.py +4 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +5 -4
  28. natural_pdf/extraction/manager.py +135 -0
  29. natural_pdf/extraction/mixin.py +279 -0
  30. natural_pdf/extraction/result.py +23 -0
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/engine_easyocr.py +6 -3
  34. natural_pdf/ocr/ocr_factory.py +24 -4
  35. natural_pdf/ocr/ocr_manager.py +122 -26
  36. natural_pdf/ocr/ocr_options.py +94 -11
  37. natural_pdf/ocr/utils.py +19 -6
  38. natural_pdf/qa/document_qa.py +0 -4
  39. natural_pdf/search/__init__.py +20 -34
  40. natural_pdf/search/haystack_search_service.py +309 -265
  41. natural_pdf/search/haystack_utils.py +99 -75
  42. natural_pdf/search/search_service_protocol.py +11 -12
  43. natural_pdf/selectors/parser.py +431 -230
  44. natural_pdf/utils/debug.py +3 -3
  45. natural_pdf/utils/identifiers.py +1 -1
  46. natural_pdf/utils/locks.py +8 -0
  47. natural_pdf/utils/packaging.py +8 -6
  48. natural_pdf/utils/text_extraction.py +60 -1
  49. natural_pdf/utils/tqdm_utils.py +51 -0
  50. natural_pdf/utils/visualization.py +18 -0
  51. natural_pdf/widgets/viewer.py +4 -25
  52. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
  53. natural_pdf-0.1.9.dist-info/RECORD +80 -0
  54. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
  55. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
  56. docs/api/index.md +0 -386
  57. docs/assets/favicon.png +0 -3
  58. docs/assets/favicon.svg +0 -3
  59. docs/assets/javascripts/custom.js +0 -17
  60. docs/assets/logo.svg +0 -3
  61. docs/assets/sample-screen.png +0 -0
  62. docs/assets/social-preview.png +0 -17
  63. docs/assets/social-preview.svg +0 -17
  64. docs/assets/stylesheets/custom.css +0 -65
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -915
  68. docs/element-selection/index.md +0 -229
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -170
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -209
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -194
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -340
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -147
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -114
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -270
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -332
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -288
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -413
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -508
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2434
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -512
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -604
  112. docs/tutorials/12-ocr-integration.md +0 -175
  113. docs/tutorials/13-semantic-search.ipynb +0 -1328
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.7.dist-info/RECORD +0 -145
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
@@ -1,48 +0,0 @@
1
- # Extracting Text Blocks
2
-
3
- Often, you need a specific section, like a paragraph between two headings. You can find a starting element and select everything below it until an ending element.
4
-
5
- Let's extract the "Summary" section from `01-practice.pdf`. It starts after "Summary:" and ends before the thick horizontal line.
6
-
7
- ```python
8
- #%pip install "natural-pdf[all]"
9
- ```
10
-
11
-
12
- ```python
13
- from natural_pdf import PDF
14
-
15
- # Load the PDF and get the page
16
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
17
- page = pdf.pages[0]
18
-
19
- # Find the starting element ("Summary:")
20
- start_marker = page.find('text:contains("Summary:")')
21
-
22
- # Select elements below the start_marker, stopping *before*
23
- # the thick horizontal line (a line with height > 1).
24
- summary_elements = start_marker.below(
25
- include_element=True, # Include the "Summary:" text itself
26
- until="line[height > 1]"
27
- )
28
-
29
- # Visualize the elements found in this block
30
- summary_elements.highlight(color="lightgreen", label="Summary Block")
31
-
32
- # Extract and display the text from the collection of summary elements
33
- summary_elements.extract_text()
34
-
35
- ```
36
-
37
- ```python
38
- # Display the page image to see the visualization
39
- page.to_image()
40
- ```
41
-
42
- This selects the elements using `.below(until=...)` and extracts their text. The second code block displays the page image with the visualized section.
43
-
44
- <div class="admonition note">
45
- <p class="admonition-title">Selector Specificity</p>
46
-
47
- We used `line[height > 1]` to find the thick horizontal line. You might need to adjust selectors based on the specific PDF structure. Inspecting element properties can help you find reliable start and end markers.
48
- </div>
@@ -1,114 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "24111eee",
6
- "metadata": {},
7
- "source": [
8
- "# Basic Table Extraction\n",
9
- "\n",
10
- "PDFs often contain tables, and `natural-pdf` provides methods to extract their data, building on `pdfplumber`'s capabilities.\n",
11
- "\n",
12
- "Let's extract the \"Violations\" table from our practice PDF."
13
- ]
14
- },
15
- {
16
- "cell_type": "code",
17
- "execution_count": 1,
18
- "id": "75f17900",
19
- "metadata": {
20
- "execution": {
21
- "iopub.execute_input": "2025-04-21T21:23:59.967091Z",
22
- "iopub.status.busy": "2025-04-21T21:23:59.966933Z",
23
- "iopub.status.idle": "2025-04-21T21:23:59.971753Z",
24
- "shell.execute_reply": "2025-04-21T21:23:59.970980Z"
25
- },
26
- "lines_to_next_cell": 2
27
- },
28
- "outputs": [],
29
- "source": [
30
- "#%pip install \"natural-pdf[all]\""
31
- ]
32
- },
33
- {
34
- "cell_type": "code",
35
- "execution_count": 2,
36
- "id": "f1b71280",
37
- "metadata": {
38
- "execution": {
39
- "iopub.execute_input": "2025-04-21T21:23:59.974183Z",
40
- "iopub.status.busy": "2025-04-21T21:23:59.973996Z",
41
- "iopub.status.idle": "2025-04-21T21:24:06.847197Z",
42
- "shell.execute_reply": "2025-04-21T21:24:06.846712Z"
43
- }
44
- },
45
- "outputs": [],
46
- "source": [
47
- "from natural_pdf import PDF\n",
48
- "\n",
49
- "# Load a PDF\n",
50
- "pdf = PDF(\"https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf\")\n",
51
- "page = pdf.pages[0]\n",
52
- "\n",
53
- "# Use extract_tables() to find all tables on the page.\n",
54
- "# It returns a list of tables, where each table is a list of lists.\n",
55
- "tables_data = page.extract_tables()\n",
56
- "\n",
57
- "# Display the first table found\n",
58
- "tables_data[0] if tables_data else \"No tables found\"\n",
59
- "\n",
60
- "# You can also visualize the general area of the first table \n",
61
- "# by finding elements in that region\n",
62
- "if tables_data:\n",
63
- " # Find a header element in the table\n",
64
- " statute_header = page.find('text:contains(\"Statute\")')\n",
65
- " if statute_header:\n",
66
- " # Show the area\n",
67
- " statute_header.below(height=100).highlight(color=\"green\", label=\"Table Area\")\n",
68
- " page.to_image()"
69
- ]
70
- },
71
- {
72
- "cell_type": "markdown",
73
- "id": "5c80e397",
74
- "metadata": {},
75
- "source": [
76
- "This code uses `page.extract_tables()` which attempts to automatically detect tables based on visual cues like lines and whitespace. The result is a list of lists, representing the rows and cells of the table.\n",
77
- "\n",
78
- "<div class=\"admonition note\">\n",
79
- "<p class=\"admonition-title\">Table Settings and Limitations</p>\n",
80
- "\n",
81
- " The default `extract_tables()` works well for simple, clearly defined tables. However, it might struggle with:\n",
82
- " * Tables without clear borders or lines.\n",
83
- " * Complex merged cells.\n",
84
- " * Tables spanning multiple pages.\n",
85
- "\n",
86
- " `pdfplumber` (and thus `natural-pdf`) allows passing `table_settings` dictionaries to `extract_tables()` for more control over the detection strategy (e.g., `\"vertical_strategy\": \"text\"`, `\"horizontal_strategy\": \"text\"`).\n",
87
- "\n",
88
- " For even more robust table detection, especially for tables without explicit lines, using Layout Analysis (like `page.analyze_layout(engine='tatr')`) first, finding the table `region`, and then calling `region.extract_table()` can yield better results. We'll explore layout analysis in a later tutorial.\n",
89
- "</div> "
90
- ]
91
- }
92
- ],
93
- "metadata": {
94
- "jupytext": {
95
- "cell_metadata_filter": "-all",
96
- "main_language": "python",
97
- "notebook_metadata_filter": "-all"
98
- },
99
- "language_info": {
100
- "codemirror_mode": {
101
- "name": "ipython",
102
- "version": 3
103
- },
104
- "file_extension": ".py",
105
- "mimetype": "text/x-python",
106
- "name": "python",
107
- "nbconvert_exporter": "python",
108
- "pygments_lexer": "ipython3",
109
- "version": "3.10.13"
110
- }
111
- },
112
- "nbformat": 4,
113
- "nbformat_minor": 5
114
- }
@@ -1,50 +0,0 @@
1
- # Basic Table Extraction
2
-
3
- PDFs often contain tables, and `natural-pdf` provides methods to extract their data, building on `pdfplumber`'s capabilities.
4
-
5
- Let's extract the "Violations" table from our practice PDF.
6
-
7
- ```python
8
- #%pip install "natural-pdf[all]"
9
- ```
10
-
11
-
12
- ```python
13
- from natural_pdf import PDF
14
-
15
- # Load a PDF
16
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
17
- page = pdf.pages[0]
18
-
19
- # Use extract_tables() to find all tables on the page.
20
- # It returns a list of tables, where each table is a list of lists.
21
- tables_data = page.extract_tables()
22
-
23
- # Display the first table found
24
- tables_data[0] if tables_data else "No tables found"
25
-
26
- # You can also visualize the general area of the first table
27
- # by finding elements in that region
28
- if tables_data:
29
- # Find a header element in the table
30
- statute_header = page.find('text:contains("Statute")')
31
- if statute_header:
32
- # Show the area
33
- statute_header.below(height=100).highlight(color="green", label="Table Area")
34
- page.to_image()
35
- ```
36
-
37
- This code uses `page.extract_tables()` which attempts to automatically detect tables based on visual cues like lines and whitespace. The result is a list of lists, representing the rows and cells of the table.
38
-
39
- <div class="admonition note">
40
- <p class="admonition-title">Table Settings and Limitations</p>
41
-
42
- The default `extract_tables()` works well for simple, clearly defined tables. However, it might struggle with:
43
- * Tables without clear borders or lines.
44
- * Complex merged cells.
45
- * Tables spanning multiple pages.
46
-
47
- `pdfplumber` (and thus `natural-pdf`) allows passing `table_settings` dictionaries to `extract_tables()` for more control over the detection strategy (e.g., `"vertical_strategy": "text"`, `"horizontal_strategy": "text"`).
48
-
49
- For even more robust table detection, especially for tables without explicit lines, using Layout Analysis (like `page.analyze_layout(engine='tatr')`) first, finding the table `region`, and then calling `region.extract_table()` can yield better results. We'll explore layout analysis in a later tutorial.
50
- </div>