natural-pdf 0.1.7__py3-none-any.whl → 0.1.9__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (134) hide show
  1. natural_pdf/__init__.py +3 -0
  2. natural_pdf/analyzers/layout/base.py +1 -5
  3. natural_pdf/analyzers/layout/gemini.py +61 -51
  4. natural_pdf/analyzers/layout/layout_analyzer.py +40 -11
  5. natural_pdf/analyzers/layout/layout_manager.py +26 -84
  6. natural_pdf/analyzers/layout/layout_options.py +7 -0
  7. natural_pdf/analyzers/layout/pdfplumber_table_finder.py +142 -0
  8. natural_pdf/analyzers/layout/surya.py +46 -123
  9. natural_pdf/analyzers/layout/tatr.py +51 -4
  10. natural_pdf/analyzers/text_structure.py +3 -5
  11. natural_pdf/analyzers/utils.py +3 -3
  12. natural_pdf/classification/manager.py +422 -0
  13. natural_pdf/classification/mixin.py +163 -0
  14. natural_pdf/classification/results.py +80 -0
  15. natural_pdf/collections/mixins.py +111 -0
  16. natural_pdf/collections/pdf_collection.py +434 -15
  17. natural_pdf/core/element_manager.py +83 -0
  18. natural_pdf/core/highlighting_service.py +13 -22
  19. natural_pdf/core/page.py +578 -93
  20. natural_pdf/core/pdf.py +912 -460
  21. natural_pdf/elements/base.py +134 -40
  22. natural_pdf/elements/collections.py +712 -109
  23. natural_pdf/elements/region.py +722 -69
  24. natural_pdf/elements/text.py +4 -1
  25. natural_pdf/export/mixin.py +137 -0
  26. natural_pdf/exporters/base.py +3 -3
  27. natural_pdf/exporters/paddleocr.py +5 -4
  28. natural_pdf/extraction/manager.py +135 -0
  29. natural_pdf/extraction/mixin.py +279 -0
  30. natural_pdf/extraction/result.py +23 -0
  31. natural_pdf/ocr/__init__.py +5 -5
  32. natural_pdf/ocr/engine_doctr.py +346 -0
  33. natural_pdf/ocr/engine_easyocr.py +6 -3
  34. natural_pdf/ocr/ocr_factory.py +24 -4
  35. natural_pdf/ocr/ocr_manager.py +122 -26
  36. natural_pdf/ocr/ocr_options.py +94 -11
  37. natural_pdf/ocr/utils.py +19 -6
  38. natural_pdf/qa/document_qa.py +0 -4
  39. natural_pdf/search/__init__.py +20 -34
  40. natural_pdf/search/haystack_search_service.py +309 -265
  41. natural_pdf/search/haystack_utils.py +99 -75
  42. natural_pdf/search/search_service_protocol.py +11 -12
  43. natural_pdf/selectors/parser.py +431 -230
  44. natural_pdf/utils/debug.py +3 -3
  45. natural_pdf/utils/identifiers.py +1 -1
  46. natural_pdf/utils/locks.py +8 -0
  47. natural_pdf/utils/packaging.py +8 -6
  48. natural_pdf/utils/text_extraction.py +60 -1
  49. natural_pdf/utils/tqdm_utils.py +51 -0
  50. natural_pdf/utils/visualization.py +18 -0
  51. natural_pdf/widgets/viewer.py +4 -25
  52. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/METADATA +17 -3
  53. natural_pdf-0.1.9.dist-info/RECORD +80 -0
  54. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/WHEEL +1 -1
  55. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/top_level.txt +0 -2
  56. docs/api/index.md +0 -386
  57. docs/assets/favicon.png +0 -3
  58. docs/assets/favicon.svg +0 -3
  59. docs/assets/javascripts/custom.js +0 -17
  60. docs/assets/logo.svg +0 -3
  61. docs/assets/sample-screen.png +0 -0
  62. docs/assets/social-preview.png +0 -17
  63. docs/assets/social-preview.svg +0 -17
  64. docs/assets/stylesheets/custom.css +0 -65
  65. docs/document-qa/index.ipynb +0 -435
  66. docs/document-qa/index.md +0 -79
  67. docs/element-selection/index.ipynb +0 -915
  68. docs/element-selection/index.md +0 -229
  69. docs/finetuning/index.md +0 -176
  70. docs/index.md +0 -170
  71. docs/installation/index.md +0 -69
  72. docs/interactive-widget/index.ipynb +0 -962
  73. docs/interactive-widget/index.md +0 -12
  74. docs/layout-analysis/index.ipynb +0 -818
  75. docs/layout-analysis/index.md +0 -185
  76. docs/ocr/index.md +0 -209
  77. docs/pdf-navigation/index.ipynb +0 -314
  78. docs/pdf-navigation/index.md +0 -97
  79. docs/regions/index.ipynb +0 -816
  80. docs/regions/index.md +0 -294
  81. docs/tables/index.ipynb +0 -658
  82. docs/tables/index.md +0 -144
  83. docs/text-analysis/index.ipynb +0 -370
  84. docs/text-analysis/index.md +0 -105
  85. docs/text-extraction/index.ipynb +0 -1478
  86. docs/text-extraction/index.md +0 -292
  87. docs/tutorials/01-loading-and-extraction.ipynb +0 -194
  88. docs/tutorials/01-loading-and-extraction.md +0 -95
  89. docs/tutorials/02-finding-elements.ipynb +0 -340
  90. docs/tutorials/02-finding-elements.md +0 -149
  91. docs/tutorials/03-extracting-blocks.ipynb +0 -147
  92. docs/tutorials/03-extracting-blocks.md +0 -48
  93. docs/tutorials/04-table-extraction.ipynb +0 -114
  94. docs/tutorials/04-table-extraction.md +0 -50
  95. docs/tutorials/05-excluding-content.ipynb +0 -270
  96. docs/tutorials/05-excluding-content.md +0 -109
  97. docs/tutorials/06-document-qa.ipynb +0 -332
  98. docs/tutorials/06-document-qa.md +0 -91
  99. docs/tutorials/07-layout-analysis.ipynb +0 -288
  100. docs/tutorials/07-layout-analysis.md +0 -66
  101. docs/tutorials/07-working-with-regions.ipynb +0 -413
  102. docs/tutorials/07-working-with-regions.md +0 -151
  103. docs/tutorials/08-spatial-navigation.ipynb +0 -508
  104. docs/tutorials/08-spatial-navigation.md +0 -190
  105. docs/tutorials/09-section-extraction.ipynb +0 -2434
  106. docs/tutorials/09-section-extraction.md +0 -256
  107. docs/tutorials/10-form-field-extraction.ipynb +0 -512
  108. docs/tutorials/10-form-field-extraction.md +0 -201
  109. docs/tutorials/11-enhanced-table-processing.ipynb +0 -54
  110. docs/tutorials/11-enhanced-table-processing.md +0 -9
  111. docs/tutorials/12-ocr-integration.ipynb +0 -604
  112. docs/tutorials/12-ocr-integration.md +0 -175
  113. docs/tutorials/13-semantic-search.ipynb +0 -1328
  114. docs/tutorials/13-semantic-search.md +0 -77
  115. docs/visual-debugging/index.ipynb +0 -2970
  116. docs/visual-debugging/index.md +0 -157
  117. docs/visual-debugging/region.png +0 -0
  118. natural_pdf/templates/finetune/fine_tune_paddleocr.md +0 -415
  119. natural_pdf/templates/spa/css/style.css +0 -334
  120. natural_pdf/templates/spa/index.html +0 -31
  121. natural_pdf/templates/spa/js/app.js +0 -472
  122. natural_pdf/templates/spa/words.txt +0 -235976
  123. natural_pdf/widgets/frontend/viewer.js +0 -88
  124. natural_pdf-0.1.7.dist-info/RECORD +0 -145
  125. notebooks/Examples.ipynb +0 -1293
  126. pdfs/.gitkeep +0 -0
  127. pdfs/01-practice.pdf +0 -543
  128. pdfs/0500000US42001.pdf +0 -0
  129. pdfs/0500000US42007.pdf +0 -0
  130. pdfs/2014 Statistics.pdf +0 -0
  131. pdfs/2019 Statistics.pdf +0 -0
  132. pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  133. pdfs/needs-ocr.pdf +0 -0
  134. {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.9.dist-info}/licenses/LICENSE +0 -0
docs/tables/index.md DELETED
@@ -1,144 +0,0 @@
1
- # Table Extraction
2
-
3
- Extracting tables from PDFs can range from straightforward to complex. Natural PDF provides several tools and methods to handle different scenarios, leveraging both rule-based (`pdfplumber`) and model-based (`TATR`) approaches.
4
-
5
- ## Setup
6
-
7
- Let's load a PDF containing tables.
8
-
9
- ```python
10
- from natural_pdf import PDF
11
-
12
- # Load the PDF
13
- pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
14
-
15
- # Select the first page
16
- page = pdf.pages[0]
17
-
18
- # Display the page
19
- page.show()
20
- ```
21
-
22
- ## Basic Table Extraction (No Detection)
23
-
24
- If you know a table exists, you can try `extract_table()` directly on the page or a region. This uses `pdfplumber` behind the scenes.
25
-
26
- ```python
27
- # Extract the first table found on the page using pdfplumber
28
- # This works best for simple tables with clear lines
29
- table_data = page.extract_table() # Returns a list of lists
30
- table_data
31
- ```
32
-
33
- *This might fail or give poor results if there are multiple tables or the table structure is complex.*
34
-
35
- ## Layout Analysis for Table Detection
36
-
37
- A more robust approach can be to first *detect* the table boundaries using layout analysis.
38
-
39
- ### Using YOLO (Default)
40
-
41
- The default YOLO model finds the overall bounding box of tables.
42
-
43
- ```python
44
- # Detect layout elements using YOLO (default)
45
- page.analyze_layout(engine='yolo')
46
-
47
- # Find regions detected as tables
48
- table_regions_yolo = page.find_all('region[type=table][model=yolo]')
49
- table_regions_yolo.show()
50
- ```
51
-
52
- ```python
53
- table_regions_yolo[0].extract_table()
54
- ```
55
-
56
- ### Using TATR (Table Transformer)
57
-
58
- The TATR model provides detailed table structure (rows, columns, headers).
59
-
60
- ```python
61
- page.clear_detected_layout_regions() # Clear previous YOLO regions for clarity
62
- page.analyze_layout(engine='tatr')
63
- ```
64
-
65
- ```python
66
- # Find the main table region(s) detected by TATR
67
- tatr_table = page.find('region[type=table][model=tatr]')
68
- tatr_table.show()
69
- ```
70
-
71
- ```python
72
- # Find rows, columns, headers detected by TATR
73
- rows = page.find_all('region[type=table-row][model=tatr]')
74
- cols = page.find_all('region[type=table-column][model=tatr]')
75
- hdrs = page.find_all('region[type=table-column-header][model=tatr]')
76
- f"TATR found: {len(rows)} rows, {len(cols)} columns, {len(hdrs)} headers"
77
- ```
78
-
79
- ## Controlling Extraction Method (`plumber` vs `tatr`)
80
-
81
- When you call `extract_table()` on a region:
82
- - If the region was detected by **YOLO** (or not detected at all), it uses the `plumber` method.
83
- - If the region was detected by **TATR**, it defaults to the `tatr` method, which uses the detected row/column structure.
84
-
85
- You can override this using the `method` argument.
86
-
87
- ```python
88
- tatr_table = page.find('region[type=table][model=tatr]')
89
- tatr_table.extract_table(method='tatr')
90
- ```
91
-
92
- ```python
93
- # Force using pdfplumber even on a TATR-detected region
94
- # (Might be useful for comparison or if TATR structure is flawed)
95
- tatr_table = page.find('region[type=table][model=tatr]')
96
- tatr_table.extract_table(method='pdfplumber')
97
- ```
98
-
99
- ### When to Use Which Method?
100
-
101
- - **`pdfplumber`**: Good for simple tables with clear grid lines. Faster.
102
- - **`tatr`**: Better for tables without clear lines, complex cell merging, or irregular layouts. Leverages the model's understanding of rows and columns.
103
-
104
- ## Customizing `pdfplumber` Settings
105
-
106
- If using the `pdfplumber` method (explicitly or implicitly), you can pass `pdfplumber` settings via `table_settings`.
107
-
108
- ```python
109
- # Example: Use text alignment for vertical lines, explicit lines for horizontal
110
- # See pdfplumber documentation for all settings
111
- table_settings = {
112
- "vertical_strategy": "text",
113
- "horizontal_strategy": "lines",
114
- "intersection_x_tolerance": 5, # Increase tolerance for intersections
115
- }
116
-
117
- results = page.extract_table(
118
- table_settings=table_settings
119
- )
120
- ```
121
-
122
- ## Saving Extracted Tables
123
-
124
- You can easily save the extracted data (list of lists) to common formats.
125
-
126
- ```python
127
- import pandas as pd
128
-
129
- pd.DataFrame(page.extract_table())
130
- ```
131
-
132
- ## Working Directly with TATR Cells
133
-
134
- The TATR engine implicitly creates cell regions at the intersection of detected rows and columns. You can access these for fine-grained control.
135
-
136
- ```python
137
- # This doesn't work! I forget why, I should troubleshoot later.
138
- # tatr_table.cells
139
- ```
140
-
141
- ## Next Steps
142
-
143
- - [Layout Analysis](../layout-analysis/index.ipynb): Understand how table detection fits into overall document structure analysis.
144
- - [Working with Regions](../regions/index.ipynb): Manually define table areas if detection fails.