natural-pdf 0.1.4__tar.gz → 0.1.6__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (288) hide show
  1. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/.gitignore +5 -0
  2. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/CLAUDE.md +1 -1
  3. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/PKG-INFO +53 -17
  4. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/README.md +5 -1
  5. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/ocr/index.md +34 -47
  6. natural_pdf-0.1.6/docs/tutorials/01-loading-and-extraction.ipynb +1710 -0
  7. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/02-finding-elements.ipynb +42 -42
  8. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/03-extracting-blocks.ipynb +18 -18
  9. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/04-table-extraction.ipynb +12 -12
  10. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/05-excluding-content.ipynb +32 -32
  11. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/06-document-qa.ipynb +44 -44
  12. natural_pdf-0.1.6/docs/tutorials/07-layout-analysis.ipynb +288 -0
  13. natural_pdf-0.1.6/docs/tutorials/07-working-with-regions.ipynb +413 -0
  14. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/07-working-with-regions.md +2 -2
  15. natural_pdf-0.1.6/docs/tutorials/08-spatial-navigation.ipynb +508 -0
  16. natural_pdf-0.1.6/docs/tutorials/09-section-extraction.ipynb +2434 -0
  17. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/10-form-field-extraction.ipynb +91 -63
  18. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
  19. natural_pdf-0.1.6/docs/tutorials/12-ocr-integration.ipynb +604 -0
  20. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/12-ocr-integration.md +0 -13
  21. natural_pdf-0.1.6/docs/tutorials/13-semantic-search.ipynb +1328 -0
  22. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/execute_notebooks.py +120 -68
  23. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/__init__.py +50 -33
  24. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/__init__.py +2 -1
  25. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/base.py +32 -24
  26. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/docling.py +131 -72
  27. natural_pdf-0.1.6/natural_pdf/analyzers/layout/gemini.py +264 -0
  28. natural_pdf-0.1.6/natural_pdf/analyzers/layout/layout_analyzer.py +298 -0
  29. natural_pdf-0.1.6/natural_pdf/analyzers/layout/layout_manager.py +270 -0
  30. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/layout_options.py +43 -17
  31. natural_pdf-0.1.6/natural_pdf/analyzers/layout/paddle.py +297 -0
  32. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/surya.py +164 -92
  33. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/tatr.py +149 -84
  34. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/yolo.py +89 -45
  35. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/text_options.py +22 -15
  36. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/text_structure.py +131 -85
  37. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/utils.py +30 -23
  38. natural_pdf-0.1.6/natural_pdf/collections/pdf_collection.py +308 -0
  39. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/__init__.py +1 -1
  40. natural_pdf-0.1.6/natural_pdf/core/element_manager.py +539 -0
  41. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/highlighting_service.py +268 -196
  42. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/page.py +1044 -521
  43. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/pdf.py +516 -313
  44. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/__init__.py +1 -1
  45. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/base.py +307 -225
  46. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/collections.py +805 -543
  47. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/line.py +39 -36
  48. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/rect.py +32 -30
  49. natural_pdf-0.1.6/natural_pdf/elements/region.py +1730 -0
  50. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/text.py +127 -99
  51. natural_pdf-0.1.6/natural_pdf/exporters/searchable_pdf.py +411 -0
  52. natural_pdf-0.1.6/natural_pdf/ocr/__init__.py +78 -0
  53. natural_pdf-0.1.6/natural_pdf/ocr/engine.py +208 -0
  54. natural_pdf-0.1.6/natural_pdf/ocr/engine_easyocr.py +175 -0
  55. natural_pdf-0.1.6/natural_pdf/ocr/engine_paddle.py +147 -0
  56. natural_pdf-0.1.6/natural_pdf/ocr/engine_surya.py +108 -0
  57. natural_pdf-0.1.6/natural_pdf/ocr/ocr_factory.py +114 -0
  58. natural_pdf-0.1.6/natural_pdf/ocr/ocr_manager.py +189 -0
  59. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/ocr/ocr_options.py +16 -20
  60. natural_pdf-0.1.6/natural_pdf/ocr/utils.py +98 -0
  61. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/qa/__init__.py +1 -1
  62. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/qa/document_qa.py +119 -111
  63. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/__init__.py +37 -31
  64. natural_pdf-0.1.6/natural_pdf/search/haystack_search_service.py +643 -0
  65. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/haystack_utils.py +186 -122
  66. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/search_options.py +25 -14
  67. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/search_service_protocol.py +12 -6
  68. natural_pdf-0.1.6/natural_pdf/search/searchable_mixin.py +549 -0
  69. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/selectors/__init__.py +2 -1
  70. natural_pdf-0.1.6/natural_pdf/selectors/parser.py +411 -0
  71. natural_pdf-0.1.6/natural_pdf/templates/__init__.py +1 -0
  72. natural_pdf-0.1.6/natural_pdf/templates/spa/css/style.css +334 -0
  73. natural_pdf-0.1.6/natural_pdf/templates/spa/index.html +31 -0
  74. natural_pdf-0.1.6/natural_pdf/templates/spa/js/app.js +472 -0
  75. natural_pdf-0.1.6/natural_pdf/templates/spa/words.txt +235976 -0
  76. natural_pdf-0.1.6/natural_pdf/utils/debug.py +32 -0
  77. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/highlighting.py +8 -2
  78. natural_pdf-0.1.6/natural_pdf/utils/identifiers.py +29 -0
  79. natural_pdf-0.1.6/natural_pdf/utils/packaging.py +418 -0
  80. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/reading_order.py +65 -63
  81. natural_pdf-0.1.6/natural_pdf/utils/text_extraction.py +195 -0
  82. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/visualization.py +70 -61
  83. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/widgets/__init__.py +2 -3
  84. natural_pdf-0.1.6/natural_pdf/widgets/viewer.py +796 -0
  85. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf.egg-info/PKG-INFO +53 -17
  86. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf.egg-info/SOURCES.txt +15 -103
  87. natural_pdf-0.1.6/natural_pdf.egg-info/requires.txt +83 -0
  88. natural_pdf-0.1.6/natural_pdf.egg-info/top_level.txt +8 -0
  89. natural_pdf-0.1.6/noxfile.py +78 -0
  90. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pyproject.toml +79 -24
  91. natural_pdf-0.1.6/tests/test_loading.py +50 -0
  92. natural_pdf-0.1.6/tests/test_optional_deps.py +298 -0
  93. natural_pdf-0.1.4/docs/tutorials/01-loading-and-extraction.ipynb +0 -1700
  94. natural_pdf-0.1.4/docs/tutorials/07-layout-analysis.ipynb +0 -260
  95. natural_pdf-0.1.4/docs/tutorials/07-working-with-regions.ipynb +0 -409
  96. natural_pdf-0.1.4/docs/tutorials/08-spatial-navigation.ipynb +0 -508
  97. natural_pdf-0.1.4/docs/tutorials/09-section-extraction.ipynb +0 -2428
  98. natural_pdf-0.1.4/docs/tutorials/12-ocr-integration.ipynb +0 -601
  99. natural_pdf-0.1.4/docs/tutorials/13-semantic-search.ipynb +0 -1904
  100. natural_pdf-0.1.4/natural_pdf/analyzers/layout/layout_analyzer.py +0 -255
  101. natural_pdf-0.1.4/natural_pdf/analyzers/layout/layout_manager.py +0 -203
  102. natural_pdf-0.1.4/natural_pdf/analyzers/layout/paddle.py +0 -240
  103. natural_pdf-0.1.4/natural_pdf/collections/pdf_collection.py +0 -259
  104. natural_pdf-0.1.4/natural_pdf/core/element_manager.py +0 -457
  105. natural_pdf-0.1.4/natural_pdf/elements/region.py +0 -1720
  106. natural_pdf-0.1.4/natural_pdf/exporters/__init__.py +0 -1
  107. natural_pdf-0.1.4/natural_pdf/exporters/searchable_pdf.py +0 -252
  108. natural_pdf-0.1.4/natural_pdf/ocr/__init__.py +0 -56
  109. natural_pdf-0.1.4/natural_pdf/ocr/engine.py +0 -104
  110. natural_pdf-0.1.4/natural_pdf/ocr/engine_easyocr.py +0 -179
  111. natural_pdf-0.1.4/natural_pdf/ocr/engine_paddle.py +0 -204
  112. natural_pdf-0.1.4/natural_pdf/ocr/engine_surya.py +0 -171
  113. natural_pdf-0.1.4/natural_pdf/ocr/ocr_manager.py +0 -191
  114. natural_pdf-0.1.4/natural_pdf/search/haystack_search_service.py +0 -520
  115. natural_pdf-0.1.4/natural_pdf/search/searchable_mixin.py +0 -464
  116. natural_pdf-0.1.4/natural_pdf/selectors/parser.py +0 -568
  117. natural_pdf-0.1.4/natural_pdf/templates/__init__.py +0 -1
  118. natural_pdf-0.1.4/natural_pdf/templates/ocr_debug.html +0 -517
  119. natural_pdf-0.1.4/natural_pdf/widgets/viewer.py +0 -765
  120. natural_pdf-0.1.4/natural_pdf.egg-info/requires.txt +0 -45
  121. natural_pdf-0.1.4/natural_pdf.egg-info/top_level.txt +0 -1
  122. natural_pdf-0.1.4/output/all_detected_regions.png +0 -0
  123. natural_pdf-0.1.4/output/all_elements.png +0 -0
  124. natural_pdf-0.1.4/output/basic_highlighting.png +0 -0
  125. natural_pdf-0.1.4/output/chainable_layout.png +0 -0
  126. natural_pdf-0.1.4/output/chained_analysis.png +0 -0
  127. natural_pdf-0.1.4/output/color_names.png +0 -0
  128. natural_pdf-0.1.4/output/color_names_with_boxes.png +0 -0
  129. natural_pdf-0.1.4/output/conf_display_highlight_all.png +0 -0
  130. natural_pdf-0.1.4/output/conf_display_highlight_layout.png +0 -0
  131. natural_pdf-0.1.4/output/conf_display_layout_only.png +0 -0
  132. natural_pdf-0.1.4/output/confidence_color_coded.png +0 -0
  133. natural_pdf-0.1.4/output/debug_page_image.png +0 -0
  134. natural_pdf-0.1.4/output/detected_table.png +0 -0
  135. natural_pdf-0.1.4/output/dimension_analysis.txt +0 -48
  136. natural_pdf-0.1.4/output/direct_ocr_debug.png +0 -0
  137. natural_pdf-0.1.4/output/easyocr_debug_input.png +0 -0
  138. natural_pdf-0.1.4/output/easyocr_results.png +0 -0
  139. natural_pdf-0.1.4/output/easyocr_test_input.png +0 -0
  140. natural_pdf-0.1.4/output/exclusion_optimization_regions.png +0 -0
  141. natural_pdf-0.1.4/output/explicit_confidence_display.png +0 -0
  142. natural_pdf-0.1.4/output/footer_overlap_test.png +0 -0
  143. natural_pdf-0.1.4/output/highlight_all.png +0 -0
  144. natural_pdf-0.1.4/output/highlight_all_styles.png +0 -0
  145. natural_pdf-0.1.4/output/highlight_all_with_all_layouts.png +0 -0
  146. natural_pdf-0.1.4/output/highlight_all_with_attrs.png +0 -0
  147. natural_pdf-0.1.4/output/highlight_all_with_yolo.png +0 -0
  148. natural_pdf-0.1.4/output/highlight_by_confidence.png +0 -0
  149. natural_pdf-0.1.4/output/highlight_color_test_1.png +0 -0
  150. natural_pdf-0.1.4/output/highlight_color_test_2.png +0 -0
  151. natural_pdf-0.1.4/output/highlight_color_test_3.png +0 -0
  152. natural_pdf-0.1.4/output/highlight_color_test_4.png +0 -0
  153. natural_pdf-0.1.4/output/highlight_layout_method.png +0 -0
  154. natural_pdf-0.1.4/output/highlight_multiple.png +0 -0
  155. natural_pdf-0.1.4/output/highlight_no_attrs.png +0 -0
  156. natural_pdf-0.1.4/output/highlight_region.png +0 -0
  157. natural_pdf-0.1.4/output/highlight_single.png +0 -0
  158. natural_pdf-0.1.4/output/highlight_specific_types.png +0 -0
  159. natural_pdf-0.1.4/output/highlight_specific_types_with_boxes.png +0 -0
  160. natural_pdf-0.1.4/output/highlight_specific_types_with_tables.png +0 -0
  161. natural_pdf-0.1.4/output/highlight_test.png +0 -0
  162. natural_pdf-0.1.4/output/highlight_test_colors.png +0 -0
  163. natural_pdf-0.1.4/output/highlight_test_individual.png +0 -0
  164. natural_pdf-0.1.4/output/highlight_test_individual_annotated.png +0 -0
  165. natural_pdf-0.1.4/output/highlight_test_individual_with_structure.png +0 -0
  166. natural_pdf-0.1.4/output/highlight_test_individual_with_structure_yolo.png +0 -0
  167. natural_pdf-0.1.4/output/highlight_test_individual_with_tables.png +0 -0
  168. natural_pdf-0.1.4/output/highlight_with_attrs.png +0 -0
  169. natural_pdf-0.1.4/output/layout_conf_default.png +0 -0
  170. natural_pdf-0.1.4/output/layout_detection.png +0 -0
  171. natural_pdf-0.1.4/output/layout_fix_test.png +0 -0
  172. natural_pdf-0.1.4/output/layout_fix_test2.png +0 -0
  173. natural_pdf-0.1.4/output/layout_fix_test3.png +0 -0
  174. natural_pdf-0.1.4/output/layout_fix_test4.png +0 -0
  175. natural_pdf-0.1.4/output/model_comparison.png +0 -0
  176. natural_pdf-0.1.4/output/multiple_attributes_display.png +0 -0
  177. natural_pdf-0.1.4/output/ocr_confidence_visualization.png +0 -0
  178. natural_pdf-0.1.4/output/ocr_debug.png +0 -0
  179. natural_pdf-0.1.4/output/ocr_debug_page.html +0 -517
  180. natural_pdf-0.1.4/output/ocr_highlight_all_test.png +0 -0
  181. natural_pdf-0.1.4/output/ocr_highlight_test.png +0 -0
  182. natural_pdf-0.1.4/output/ocr_highlighted.png +0 -0
  183. natural_pdf-0.1.4/output/ocr_simplified.png +0 -0
  184. natural_pdf-0.1.4/output/ocr_threshold_comparison.png +0 -0
  185. natural_pdf-0.1.4/output/ocr_visualization_clean.png +0 -0
  186. natural_pdf-0.1.4/output/ocr_visualization_highlights.png +0 -0
  187. natural_pdf-0.1.4/output/ocr_visualization_text.png +0 -0
  188. natural_pdf-0.1.4/output/paddle_layout_detection.png +0 -0
  189. natural_pdf-0.1.4/output/paddle_layout_polygons.png +0 -0
  190. natural_pdf-0.1.4/output/paddle_layout_sources.png +0 -0
  191. natural_pdf-0.1.4/output/paddle_layout_with_text.png +0 -0
  192. natural_pdf-0.1.4/output/paddle_layout_without_text.png +0 -0
  193. natural_pdf-0.1.4/output/paddleocr_highlights.png +0 -0
  194. natural_pdf-0.1.4/output/paddleocr_results.png +0 -0
  195. natural_pdf-0.1.4/output/paddleocr_test_input.png +0 -0
  196. natural_pdf-0.1.4/output/page_1_for_ocr.png +0 -0
  197. natural_pdf-0.1.4/output/page_4_for_ocr.png +0 -0
  198. natural_pdf-0.1.4/output/region_exclusion_test.png +0 -0
  199. natural_pdf-0.1.4/output/region_management_test.png +0 -0
  200. natural_pdf-0.1.4/output/region_ocr_cropped.png +0 -0
  201. natural_pdf-0.1.4/output/region_ocr_debug.png +0 -0
  202. natural_pdf-0.1.4/output/region_ocr_full_page.png +0 -0
  203. natural_pdf-0.1.4/output/region_ocr_highlighted.png +0 -0
  204. natural_pdf-0.1.4/output/spatial_navigation.png +0 -0
  205. natural_pdf-0.1.4/output/standard_highlight_all.png +0 -0
  206. natural_pdf-0.1.4/output/table_no_ocr.csv +0 -54
  207. natural_pdf-0.1.4/output/table_structure.png +0 -0
  208. natural_pdf-0.1.4/output/table_structure_detail.png +0 -0
  209. natural_pdf-0.1.4/output/table_with_ocr.csv +0 -54
  210. natural_pdf-0.1.4/output/tatr_cells_test.png +0 -0
  211. natural_pdf-0.1.4/output/tatr_ocr_table_test.png +0 -0
  212. natural_pdf-0.1.4/output/tatr_regions.png +0 -0
  213. natural_pdf-0.1.4/output/tatr_regions.txt +0 -16
  214. natural_pdf-0.1.4/output/text_styles.png +0 -0
  215. natural_pdf-0.1.4/output/titles_only.png +0 -0
  216. natural_pdf-0.1.4/output/width_1200px.png +0 -0
  217. natural_pdf-0.1.4/output/width_800px.png +0 -0
  218. natural_pdf-0.1.4/output/width_default.png +0 -0
  219. natural_pdf-0.1.4/output/width_with_scale.png +0 -0
  220. natural_pdf-0.1.4/output/yolo_regions.png +0 -0
  221. natural_pdf-0.1.4/output/yolo_regions.txt +0 -9
  222. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/.github/workflows/docs.yml +0 -0
  223. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/LICENSE +0 -0
  224. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/MANIFEST.in +0 -0
  225. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/check_run_md.sh +0 -0
  226. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/api/index.md +0 -0
  227. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/favicon.png +0 -0
  228. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/favicon.svg +0 -0
  229. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/javascripts/custom.js +0 -0
  230. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/logo.svg +0 -0
  231. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/sample-screen.png +0 -0
  232. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/social-preview.png +0 -0
  233. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/social-preview.svg +0 -0
  234. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/stylesheets/custom.css +0 -0
  235. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/document-qa/index.ipynb +0 -0
  236. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/document-qa/index.md +0 -0
  237. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/element-selection/index.ipynb +0 -0
  238. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/element-selection/index.md +0 -0
  239. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/index.md +0 -0
  240. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/installation/index.md +0 -0
  241. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/interactive-widget/index.ipynb +0 -0
  242. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/interactive-widget/index.md +0 -0
  243. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/layout-analysis/index.ipynb +0 -0
  244. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/layout-analysis/index.md +0 -0
  245. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/pdf-navigation/index.ipynb +0 -0
  246. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/pdf-navigation/index.md +0 -0
  247. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/regions/index.ipynb +0 -0
  248. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/regions/index.md +0 -0
  249. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tables/index.ipynb +0 -0
  250. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tables/index.md +0 -0
  251. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-analysis/index.ipynb +0 -0
  252. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-analysis/index.md +0 -0
  253. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-extraction/index.ipynb +0 -0
  254. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-extraction/index.md +0 -0
  255. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/01-loading-and-extraction.md +0 -0
  256. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/02-finding-elements.md +0 -0
  257. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/03-extracting-blocks.md +0 -0
  258. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/04-table-extraction.md +0 -0
  259. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/05-excluding-content.md +0 -0
  260. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/06-document-qa.md +0 -0
  261. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/07-layout-analysis.md +0 -0
  262. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/08-spatial-navigation.md +0 -0
  263. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/09-section-extraction.md +0 -0
  264. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/10-form-field-extraction.md +0 -0
  265. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/11-enhanced-table-processing.md +0 -0
  266. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/13-semantic-search.md +0 -0
  267. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/visual-debugging/index.ipynb +0 -0
  268. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/visual-debugging/index.md +0 -0
  269. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/visual-debugging/region.png +0 -0
  270. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/mkdocs.yml +0 -0
  271. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/__init__.py +0 -0
  272. /natural_pdf-0.1.4/output/layout_conf_high.png → /natural_pdf-0.1.6/natural_pdf/exporters/__init__.py +0 -0
  273. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/__init__.py +0 -0
  274. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/widgets/frontend/viewer.js +0 -0
  275. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf.egg-info/dependency_links.txt +0 -0
  276. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/notebooks/Examples.ipynb +0 -0
  277. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/.gitkeep +0 -0
  278. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/01-practice.pdf +0 -0
  279. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/0500000US42001.pdf +0 -0
  280. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/0500000US42007.pdf +0 -0
  281. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/2014 Statistics.pdf +0 -0
  282. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/2019 Statistics.pdf +0 -0
  283. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  284. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/needs-ocr.pdf +0 -0
  285. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/publish.sh +0 -0
  286. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/run_all_tutorials.sh +0 -0
  287. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/sample-screen.png +0 -0
  288. {natural_pdf-0.1.4 → natural_pdf-0.1.6}/setup.cfg +0 -0
@@ -1,4 +1,6 @@
1
1
  .notebook_cache.json
2
+ .venv
3
+ output
2
4
  Untitled.ipynb
3
5
  conversation.md
4
6
  docs/tutorials/pdfs
@@ -10,6 +12,9 @@ results
10
12
  docs/tutorials/needs-ocr-searchable.pdf
11
13
  sample.py
12
14
  sample2.py
15
+ requirements.lock
16
+ pdfs/hidden
17
+ *.hocr
13
18
 
14
19
  # Created by https://www.toptal.com/developers/gitignore/api/python,macos,visualstudiocode,jupyternotebooks
15
20
  # Edit at https://www.toptal.com/developers/gitignore?templates=python,macos,visualstudiocode,jupyternotebooks
@@ -213,7 +213,7 @@ region = page.create_region(50, 50, page.width - 50, page.height - 50)
213
213
  sections = region.get_sections(start_elements='text:bold')
214
214
 
215
215
  # Expand the region around a section
216
- expanded_section = sections[0].expand(left=20, right=20, top_expand=10, bottom_expand=30)
216
+ expanded_section = sections[0].expand(left=20, right=20, top=10, bottom=30)
217
217
 
218
218
  # Use percentage-based expansion
219
219
  expanded_section = sections[0].expand(width_factor=1.5, height_factor=1.2) # 50% wider, 20% taller
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: natural-pdf
3
- Version: 0.1.4
3
+ Version: 0.1.6
4
4
  Summary: A more intuitive interface for working with PDFs
5
5
  Author-email: Jonathan Soma <jonathan.soma@gmail.com>
6
6
  License-Expression: MIT
@@ -16,38 +16,70 @@ Requires-Dist: Pillow
16
16
  Requires-Dist: colour
17
17
  Requires-Dist: numpy
18
18
  Requires-Dist: urllib3
19
- Requires-Dist: torch
20
- Requires-Dist: torchvision
21
- Requires-Dist: transformers
22
- Requires-Dist: huggingface_hub
23
- Requires-Dist: ocrmypdf
24
- Requires-Dist: pikepdf
19
+ Requires-Dist: tqdm
25
20
  Provides-Extra: interactive
26
21
  Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "interactive"
27
22
  Provides-Extra: haystack
28
23
  Requires-Dist: haystack-ai; extra == "haystack"
29
24
  Requires-Dist: chroma-haystack; extra == "haystack"
30
25
  Requires-Dist: sentence-transformers; extra == "haystack"
26
+ Requires-Dist: protobuf<4; extra == "haystack"
27
+ Requires-Dist: natural-pdf[core-ml]; extra == "haystack"
31
28
  Provides-Extra: easyocr
32
29
  Requires-Dist: easyocr; extra == "easyocr"
30
+ Requires-Dist: natural-pdf[core-ml]; extra == "easyocr"
33
31
  Provides-Extra: paddle
34
32
  Requires-Dist: paddlepaddle; extra == "paddle"
35
33
  Requires-Dist: paddleocr; extra == "paddle"
36
34
  Provides-Extra: layout-yolo
37
35
  Requires-Dist: doclayout_yolo; extra == "layout-yolo"
36
+ Requires-Dist: natural-pdf[core-ml]; extra == "layout-yolo"
38
37
  Provides-Extra: surya
39
38
  Requires-Dist: surya-ocr; extra == "surya"
39
+ Requires-Dist: natural-pdf[core-ml]; extra == "surya"
40
40
  Provides-Extra: qa
41
+ Requires-Dist: natural-pdf[core-ml]; extra == "qa"
42
+ Provides-Extra: docling
43
+ Requires-Dist: docling; extra == "docling"
44
+ Requires-Dist: natural-pdf[core-ml]; extra == "docling"
45
+ Provides-Extra: llm
46
+ Requires-Dist: openai>=1.0; extra == "llm"
47
+ Requires-Dist: pydantic; extra == "llm"
48
+ Provides-Extra: test
49
+ Requires-Dist: pytest; extra == "test"
50
+ Provides-Extra: dev
51
+ Requires-Dist: black; extra == "dev"
52
+ Requires-Dist: isort; extra == "dev"
53
+ Requires-Dist: mypy; extra == "dev"
54
+ Requires-Dist: pytest; extra == "dev"
55
+ Requires-Dist: nox; extra == "dev"
56
+ Requires-Dist: nox-uv; extra == "dev"
57
+ Requires-Dist: build; extra == "dev"
58
+ Requires-Dist: uv; extra == "dev"
59
+ Requires-Dist: pipdeptree; extra == "dev"
60
+ Requires-Dist: nbformat; extra == "dev"
61
+ Requires-Dist: jupytext; extra == "dev"
62
+ Requires-Dist: nbclient; extra == "dev"
41
63
  Provides-Extra: all
42
- Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "all"
43
- Requires-Dist: easyocr; extra == "all"
44
- Requires-Dist: paddlepaddle; extra == "all"
45
- Requires-Dist: paddleocr; extra == "all"
46
- Requires-Dist: doclayout_yolo; extra == "all"
47
- Requires-Dist: surya-ocr; extra == "all"
48
- Requires-Dist: haystack-ai; extra == "all"
49
- Requires-Dist: chroma-haystack; extra == "all"
50
- Requires-Dist: sentence-transformers; extra == "all"
64
+ Requires-Dist: natural-pdf[interactive]; extra == "all"
65
+ Requires-Dist: natural-pdf[haystack]; extra == "all"
66
+ Requires-Dist: natural-pdf[easyocr]; extra == "all"
67
+ Requires-Dist: natural-pdf[paddle]; extra == "all"
68
+ Requires-Dist: natural-pdf[layout_yolo]; extra == "all"
69
+ Requires-Dist: natural-pdf[surya]; extra == "all"
70
+ Requires-Dist: natural-pdf[qa]; extra == "all"
71
+ Requires-Dist: natural-pdf[ocr-export]; extra == "all"
72
+ Requires-Dist: natural-pdf[docling]; extra == "all"
73
+ Requires-Dist: natural-pdf[llm]; extra == "all"
74
+ Requires-Dist: natural-pdf[test]; extra == "all"
75
+ Provides-Extra: core-ml
76
+ Requires-Dist: torch; extra == "core-ml"
77
+ Requires-Dist: torchvision; extra == "core-ml"
78
+ Requires-Dist: transformers; extra == "core-ml"
79
+ Requires-Dist: huggingface_hub; extra == "core-ml"
80
+ Provides-Extra: ocr-export
81
+ Requires-Dist: ocrmypdf; extra == "ocr-export"
82
+ Requires-Dist: pikepdf; extra == "ocr-export"
51
83
  Dynamic: license-file
52
84
 
53
85
  # Natural PDF
@@ -75,6 +107,10 @@ pip install natural-pdf[easyocr]
75
107
  pip install natural-pdf[surya]
76
108
  pip install natural-pdf[paddle]
77
109
 
110
+ # Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
111
+ pip install natural-pdf[llm]
112
+ # (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
113
+
78
114
  # Example: Install with interactive viewer support
79
115
  pip install natural-pdf[interactive]
80
116
 
@@ -127,7 +163,7 @@ Natural PDF offers a range of features for working with PDFs:
127
163
  * **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
128
164
  * **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
129
165
  * **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
130
- * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using AI models.
166
+ * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
131
167
  * **Document QA:** Ask natural language questions about your document's content.
132
168
  * **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
133
169
  * **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
@@ -23,6 +23,10 @@ pip install natural-pdf[easyocr]
23
23
  pip install natural-pdf[surya]
24
24
  pip install natural-pdf[paddle]
25
25
 
26
+ # Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
27
+ pip install natural-pdf[llm]
28
+ # (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
29
+
26
30
  # Example: Install with interactive viewer support
27
31
  pip install natural-pdf[interactive]
28
32
 
@@ -75,7 +79,7 @@ Natural PDF offers a range of features for working with PDFs:
75
79
  * **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
76
80
  * **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
77
81
  * **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
78
- * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using AI models.
82
+ * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
79
83
  * **Document QA:** Ask natural language questions about your document's content.
80
84
  * **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
81
85
  * **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
92
92
  ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
93
93
  ```
94
94
 
95
- ## Multiple Languages
96
-
97
- OCR supports multiple languages:
98
-
99
- ```python
100
- # Recognize English and Spanish text
101
- pdf = PDF('multilingual.pdf', ocr={
102
- 'enabled': True,
103
- 'languages': ['en', 'es']
104
- })
105
-
106
- # Multiple languages with PaddleOCR
107
- pdf = PDF('multilingual_document.pdf',
108
- ocr_engine='paddleocr',
109
- ocr={
110
- 'enabled': True,
111
- 'languages': ['zh', 'ja', 'ko', 'en'] # Chinese, Japanese, Korean, English
112
- })
113
- ```
114
-
115
95
  ## Applying OCR Directly
116
96
 
117
97
  The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
179
159
  high_conf.highlight(color="green", label="High Confidence OCR")
180
160
  ```
181
161
 
182
- ## OCR Debugging
162
+ ## Detect + LLM OCR
163
+
164
+ Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
165
+
166
+ ```python
167
+ from natural_pdf import PDF
168
+ from natural_pdf.ocr.utils import direct_ocr_llm
169
+ import openai
170
+
171
+ pdf = PDF("needs-ocr.pdf")
172
+ page = pdf.pages[0]
173
+
174
+ # Detect
175
+ page.apply_ocr('paddle', resolution=120, detect_only=True)
176
+
177
+ # Build the framework
178
+ client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
179
+ prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
180
+ punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
181
+ The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
182
+
183
+ # This returns the cleaned-up text
184
+ def correct(region):
185
+ return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
186
+
187
+ # Run 'correct' on each text element
188
+ page.correct_ocr(correct)
189
+
190
+ # You're done!
191
+ ```
183
192
 
184
- For troubleshooting OCR problems:
193
+ ## Debugging OCR
185
194
 
186
195
  ```python
187
- # Create an interactive HTML debug report
188
- pdf.debug_ocr("ocr_debug.html")
196
+ from natural_pdf.utils.packaging import create_correction_task_package
189
197
 
190
- # Specify which pages to include
191
- pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
198
+ create_correction_task_package(pdf, "original.zip", overwrite=True)
192
199
  ```
193
200
 
194
- The debug report shows:
195
- - The original image
196
- - Text found with confidence scores
197
- - Boxes around each detected word
198
- - Options to sort and filter results
199
-
200
- ## OCR Parameter Tuning
201
-
202
- ### Parameter Recommendation Table
203
-
204
- | Issue | Engine | Parameter | Recommended Value | Effect |
205
- |-------|--------|-----------|-------------------|--------|
206
- | Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
207
- | Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
208
- | Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
209
- | Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
210
- | Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
211
- | Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
212
- | Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
213
- | Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
214
- | Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
201
+ This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
215
202
 
216
203
  ## Next Steps
217
204