natural-pdf 0.1.5__tar.gz → 0.1.6__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (170) hide show
  1. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/.gitignore +1 -0
  2. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/CLAUDE.md +1 -1
  3. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/PKG-INFO +41 -19
  4. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/README.md +5 -1
  5. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/ocr/index.md +34 -47
  6. natural_pdf-0.1.6/docs/tutorials/01-loading-and-extraction.ipynb +1710 -0
  7. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/02-finding-elements.ipynb +42 -42
  8. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/03-extracting-blocks.ipynb +17 -17
  9. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/04-table-extraction.ipynb +12 -12
  10. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/05-excluding-content.ipynb +30 -30
  11. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/06-document-qa.ipynb +28 -28
  12. natural_pdf-0.1.6/docs/tutorials/07-layout-analysis.ipynb +288 -0
  13. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/07-working-with-regions.ipynb +55 -51
  14. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/07-working-with-regions.md +2 -2
  15. natural_pdf-0.1.6/docs/tutorials/08-spatial-navigation.ipynb +508 -0
  16. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/09-section-extraction.ipynb +113 -113
  17. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/10-form-field-extraction.ipynb +78 -50
  18. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
  19. natural_pdf-0.1.6/docs/tutorials/12-ocr-integration.ipynb +604 -0
  20. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/12-ocr-integration.md +0 -13
  21. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/13-semantic-search.ipynb +313 -873
  22. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/execute_notebooks.py +3 -2
  23. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/__init__.py +21 -23
  24. natural_pdf-0.1.6/natural_pdf/analyzers/layout/gemini.py +264 -0
  25. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/layout_manager.py +28 -1
  26. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/layout_options.py +11 -0
  27. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/yolo.py +6 -2
  28. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/collections/pdf_collection.py +21 -0
  29. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/core/element_manager.py +16 -13
  30. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/core/page.py +165 -36
  31. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/core/pdf.py +146 -41
  32. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/base.py +11 -17
  33. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/collections.py +100 -38
  34. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/region.py +77 -38
  35. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/text.py +5 -0
  36. natural_pdf-0.1.6/natural_pdf/ocr/__init__.py +78 -0
  37. natural_pdf-0.1.6/natural_pdf/ocr/engine.py +208 -0
  38. natural_pdf-0.1.6/natural_pdf/ocr/engine_easyocr.py +175 -0
  39. natural_pdf-0.1.6/natural_pdf/ocr/engine_paddle.py +147 -0
  40. natural_pdf-0.1.6/natural_pdf/ocr/engine_surya.py +108 -0
  41. natural_pdf-0.1.6/natural_pdf/ocr/ocr_factory.py +114 -0
  42. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/ocr/ocr_manager.py +65 -93
  43. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/ocr/ocr_options.py +7 -17
  44. natural_pdf-0.1.6/natural_pdf/ocr/utils.py +98 -0
  45. natural_pdf-0.1.6/natural_pdf/templates/spa/css/style.css +334 -0
  46. natural_pdf-0.1.6/natural_pdf/templates/spa/index.html +31 -0
  47. natural_pdf-0.1.6/natural_pdf/templates/spa/js/app.js +472 -0
  48. natural_pdf-0.1.6/natural_pdf/templates/spa/words.txt +235976 -0
  49. natural_pdf-0.1.6/natural_pdf/utils/debug.py +32 -0
  50. natural_pdf-0.1.6/natural_pdf/utils/identifiers.py +29 -0
  51. natural_pdf-0.1.6/natural_pdf/utils/packaging.py +418 -0
  52. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf.egg-info/PKG-INFO +41 -19
  53. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf.egg-info/SOURCES.txt +10 -1
  54. natural_pdf-0.1.6/natural_pdf.egg-info/requires.txt +83 -0
  55. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf.egg-info/top_level.txt +0 -1
  56. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pyproject.toml +68 -37
  57. natural_pdf-0.1.5/docs/tutorials/01-loading-and-extraction.ipynb +0 -1696
  58. natural_pdf-0.1.5/docs/tutorials/07-layout-analysis.ipynb +0 -260
  59. natural_pdf-0.1.5/docs/tutorials/08-spatial-navigation.ipynb +0 -508
  60. natural_pdf-0.1.5/docs/tutorials/12-ocr-integration.ipynb +0 -586
  61. natural_pdf-0.1.5/natural_pdf/ocr/__init__.py +0 -65
  62. natural_pdf-0.1.5/natural_pdf/ocr/engine.py +0 -113
  63. natural_pdf-0.1.5/natural_pdf/ocr/engine_easyocr.py +0 -195
  64. natural_pdf-0.1.5/natural_pdf/ocr/engine_paddle.py +0 -233
  65. natural_pdf-0.1.5/natural_pdf/ocr/engine_surya.py +0 -181
  66. natural_pdf-0.1.5/natural_pdf/templates/ocr_debug.html +0 -517
  67. natural_pdf-0.1.5/natural_pdf.egg-info/requires.txt +0 -61
  68. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/.github/workflows/docs.yml +0 -0
  69. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/LICENSE +0 -0
  70. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/MANIFEST.in +0 -0
  71. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/check_run_md.sh +0 -0
  72. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/api/index.md +0 -0
  73. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/favicon.png +0 -0
  74. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/favicon.svg +0 -0
  75. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/javascripts/custom.js +0 -0
  76. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/logo.svg +0 -0
  77. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/sample-screen.png +0 -0
  78. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/social-preview.png +0 -0
  79. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/social-preview.svg +0 -0
  80. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/assets/stylesheets/custom.css +0 -0
  81. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/document-qa/index.ipynb +0 -0
  82. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/document-qa/index.md +0 -0
  83. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/element-selection/index.ipynb +0 -0
  84. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/element-selection/index.md +0 -0
  85. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/index.md +0 -0
  86. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/installation/index.md +0 -0
  87. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/interactive-widget/index.ipynb +0 -0
  88. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/interactive-widget/index.md +0 -0
  89. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/layout-analysis/index.ipynb +0 -0
  90. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/layout-analysis/index.md +0 -0
  91. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/pdf-navigation/index.ipynb +0 -0
  92. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/pdf-navigation/index.md +0 -0
  93. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/regions/index.ipynb +0 -0
  94. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/regions/index.md +0 -0
  95. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tables/index.ipynb +0 -0
  96. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tables/index.md +0 -0
  97. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/text-analysis/index.ipynb +0 -0
  98. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/text-analysis/index.md +0 -0
  99. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/text-extraction/index.ipynb +0 -0
  100. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/text-extraction/index.md +0 -0
  101. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/01-loading-and-extraction.md +0 -0
  102. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/02-finding-elements.md +0 -0
  103. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/03-extracting-blocks.md +0 -0
  104. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/04-table-extraction.md +0 -0
  105. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/05-excluding-content.md +0 -0
  106. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/06-document-qa.md +0 -0
  107. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/07-layout-analysis.md +0 -0
  108. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/08-spatial-navigation.md +0 -0
  109. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/09-section-extraction.md +0 -0
  110. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/10-form-field-extraction.md +0 -0
  111. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/11-enhanced-table-processing.md +0 -0
  112. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/tutorials/13-semantic-search.md +0 -0
  113. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/visual-debugging/index.ipynb +0 -0
  114. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/visual-debugging/index.md +0 -0
  115. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/docs/visual-debugging/region.png +0 -0
  116. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/mkdocs.yml +0 -0
  117. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/__init__.py +0 -0
  118. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/__init__.py +0 -0
  119. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/base.py +0 -0
  120. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/docling.py +0 -0
  121. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/layout_analyzer.py +0 -0
  122. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/paddle.py +0 -0
  123. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/surya.py +0 -0
  124. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/tatr.py +0 -0
  125. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/text_options.py +0 -0
  126. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/text_structure.py +0 -0
  127. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/analyzers/utils.py +0 -0
  128. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/core/__init__.py +0 -0
  129. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/core/highlighting_service.py +0 -0
  130. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/__init__.py +0 -0
  131. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/line.py +0 -0
  132. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/elements/rect.py +0 -0
  133. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/exporters/__init__.py +0 -0
  134. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/exporters/searchable_pdf.py +0 -0
  135. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/qa/__init__.py +0 -0
  136. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/qa/document_qa.py +0 -0
  137. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/search/__init__.py +0 -0
  138. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/search/haystack_search_service.py +0 -0
  139. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/search/haystack_utils.py +0 -0
  140. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/search/search_options.py +0 -0
  141. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/search/search_service_protocol.py +0 -0
  142. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/search/searchable_mixin.py +0 -0
  143. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/selectors/__init__.py +0 -0
  144. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/selectors/parser.py +0 -0
  145. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/templates/__init__.py +0 -0
  146. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/utils/__init__.py +0 -0
  147. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/utils/highlighting.py +0 -0
  148. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/utils/reading_order.py +0 -0
  149. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/utils/text_extraction.py +0 -0
  150. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/utils/visualization.py +0 -0
  151. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/widgets/__init__.py +0 -0
  152. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/widgets/frontend/viewer.js +0 -0
  153. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf/widgets/viewer.py +0 -0
  154. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/natural_pdf.egg-info/dependency_links.txt +0 -0
  155. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/notebooks/Examples.ipynb +0 -0
  156. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/noxfile.py +0 -0
  157. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/.gitkeep +0 -0
  158. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/01-practice.pdf +0 -0
  159. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/0500000US42001.pdf +0 -0
  160. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/0500000US42007.pdf +0 -0
  161. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/2014 Statistics.pdf +0 -0
  162. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/2019 Statistics.pdf +0 -0
  163. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
  164. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/pdfs/needs-ocr.pdf +0 -0
  165. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/publish.sh +0 -0
  166. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/run_all_tutorials.sh +0 -0
  167. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/sample-screen.png +0 -0
  168. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/setup.cfg +0 -0
  169. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/tests/test_loading.py +0 -0
  170. {natural_pdf-0.1.5 → natural_pdf-0.1.6}/tests/test_optional_deps.py +0 -0
@@ -12,6 +12,7 @@ results
12
12
  docs/tutorials/needs-ocr-searchable.pdf
13
13
  sample.py
14
14
  sample2.py
15
+ requirements.lock
15
16
  pdfs/hidden
16
17
  *.hocr
17
18
 
@@ -213,7 +213,7 @@ region = page.create_region(50, 50, page.width - 50, page.height - 50)
213
213
  sections = region.get_sections(start_elements='text:bold')
214
214
 
215
215
  # Expand the region around a section
216
- expanded_section = sections[0].expand(left=20, right=20, top_expand=10, bottom_expand=30)
216
+ expanded_section = sections[0].expand(left=20, right=20, top=10, bottom=30)
217
217
 
218
218
  # Use percentage-based expansion
219
219
  expanded_section = sections[0].expand(width_factor=1.5, height_factor=1.2) # 50% wider, 20% taller
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: natural-pdf
3
- Version: 0.1.5
3
+ Version: 0.1.6
4
4
  Summary: A more intuitive interface for working with PDFs
5
5
  Author-email: Jonathan Soma <jonathan.soma@gmail.com>
6
6
  License-Expression: MIT
@@ -16,12 +16,7 @@ Requires-Dist: Pillow
16
16
  Requires-Dist: colour
17
17
  Requires-Dist: numpy
18
18
  Requires-Dist: urllib3
19
- Requires-Dist: torch
20
- Requires-Dist: torchvision
21
- Requires-Dist: transformers
22
- Requires-Dist: huggingface_hub
23
- Requires-Dist: ocrmypdf
24
- Requires-Dist: pikepdf
19
+ Requires-Dist: tqdm
25
20
  Provides-Extra: interactive
26
21
  Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "interactive"
27
22
  Provides-Extra: haystack
@@ -29,16 +24,27 @@ Requires-Dist: haystack-ai; extra == "haystack"
29
24
  Requires-Dist: chroma-haystack; extra == "haystack"
30
25
  Requires-Dist: sentence-transformers; extra == "haystack"
31
26
  Requires-Dist: protobuf<4; extra == "haystack"
27
+ Requires-Dist: natural-pdf[core-ml]; extra == "haystack"
32
28
  Provides-Extra: easyocr
33
29
  Requires-Dist: easyocr; extra == "easyocr"
30
+ Requires-Dist: natural-pdf[core-ml]; extra == "easyocr"
34
31
  Provides-Extra: paddle
35
32
  Requires-Dist: paddlepaddle; extra == "paddle"
36
33
  Requires-Dist: paddleocr; extra == "paddle"
37
34
  Provides-Extra: layout-yolo
38
35
  Requires-Dist: doclayout_yolo; extra == "layout-yolo"
36
+ Requires-Dist: natural-pdf[core-ml]; extra == "layout-yolo"
39
37
  Provides-Extra: surya
40
38
  Requires-Dist: surya-ocr; extra == "surya"
39
+ Requires-Dist: natural-pdf[core-ml]; extra == "surya"
41
40
  Provides-Extra: qa
41
+ Requires-Dist: natural-pdf[core-ml]; extra == "qa"
42
+ Provides-Extra: docling
43
+ Requires-Dist: docling; extra == "docling"
44
+ Requires-Dist: natural-pdf[core-ml]; extra == "docling"
45
+ Provides-Extra: llm
46
+ Requires-Dist: openai>=1.0; extra == "llm"
47
+ Requires-Dist: pydantic; extra == "llm"
42
48
  Provides-Extra: test
43
49
  Requires-Dist: pytest; extra == "test"
44
50
  Provides-Extra: dev
@@ -50,18 +56,30 @@ Requires-Dist: nox; extra == "dev"
50
56
  Requires-Dist: nox-uv; extra == "dev"
51
57
  Requires-Dist: build; extra == "dev"
52
58
  Requires-Dist: uv; extra == "dev"
59
+ Requires-Dist: pipdeptree; extra == "dev"
60
+ Requires-Dist: nbformat; extra == "dev"
61
+ Requires-Dist: jupytext; extra == "dev"
62
+ Requires-Dist: nbclient; extra == "dev"
53
63
  Provides-Extra: all
54
- Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "all"
55
- Requires-Dist: easyocr; extra == "all"
56
- Requires-Dist: paddlepaddle; extra == "all"
57
- Requires-Dist: paddleocr; extra == "all"
58
- Requires-Dist: doclayout_yolo; extra == "all"
59
- Requires-Dist: surya-ocr; extra == "all"
60
- Requires-Dist: haystack-ai; extra == "all"
61
- Requires-Dist: chroma-haystack; extra == "all"
62
- Requires-Dist: sentence-transformers; extra == "all"
63
- Requires-Dist: protobuf<4; extra == "all"
64
- Requires-Dist: pytest; extra == "all"
64
+ Requires-Dist: natural-pdf[interactive]; extra == "all"
65
+ Requires-Dist: natural-pdf[haystack]; extra == "all"
66
+ Requires-Dist: natural-pdf[easyocr]; extra == "all"
67
+ Requires-Dist: natural-pdf[paddle]; extra == "all"
68
+ Requires-Dist: natural-pdf[layout_yolo]; extra == "all"
69
+ Requires-Dist: natural-pdf[surya]; extra == "all"
70
+ Requires-Dist: natural-pdf[qa]; extra == "all"
71
+ Requires-Dist: natural-pdf[ocr-export]; extra == "all"
72
+ Requires-Dist: natural-pdf[docling]; extra == "all"
73
+ Requires-Dist: natural-pdf[llm]; extra == "all"
74
+ Requires-Dist: natural-pdf[test]; extra == "all"
75
+ Provides-Extra: core-ml
76
+ Requires-Dist: torch; extra == "core-ml"
77
+ Requires-Dist: torchvision; extra == "core-ml"
78
+ Requires-Dist: transformers; extra == "core-ml"
79
+ Requires-Dist: huggingface_hub; extra == "core-ml"
80
+ Provides-Extra: ocr-export
81
+ Requires-Dist: ocrmypdf; extra == "ocr-export"
82
+ Requires-Dist: pikepdf; extra == "ocr-export"
65
83
  Dynamic: license-file
66
84
 
67
85
  # Natural PDF
@@ -89,6 +107,10 @@ pip install natural-pdf[easyocr]
89
107
  pip install natural-pdf[surya]
90
108
  pip install natural-pdf[paddle]
91
109
 
110
+ # Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
111
+ pip install natural-pdf[llm]
112
+ # (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
113
+
92
114
  # Example: Install with interactive viewer support
93
115
  pip install natural-pdf[interactive]
94
116
 
@@ -141,7 +163,7 @@ Natural PDF offers a range of features for working with PDFs:
141
163
  * **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
142
164
  * **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
143
165
  * **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
144
- * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using AI models.
166
+ * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
145
167
  * **Document QA:** Ask natural language questions about your document's content.
146
168
  * **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
147
169
  * **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
@@ -23,6 +23,10 @@ pip install natural-pdf[easyocr]
23
23
  pip install natural-pdf[surya]
24
24
  pip install natural-pdf[paddle]
25
25
 
26
+ # Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
27
+ pip install natural-pdf[llm]
28
+ # (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
29
+
26
30
  # Example: Install with interactive viewer support
27
31
  pip install natural-pdf[interactive]
28
32
 
@@ -75,7 +79,7 @@ Natural PDF offers a range of features for working with PDFs:
75
79
  * **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
76
80
  * **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
77
81
  * **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
78
- * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using AI models.
82
+ * **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
79
83
  * **Document QA:** Ask natural language questions about your document's content.
80
84
  * **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
81
85
  * **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
92
92
  ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
93
93
  ```
94
94
 
95
- ## Multiple Languages
96
-
97
- OCR supports multiple languages:
98
-
99
- ```python
100
- # Recognize English and Spanish text
101
- pdf = PDF('multilingual.pdf', ocr={
102
- 'enabled': True,
103
- 'languages': ['en', 'es']
104
- })
105
-
106
- # Multiple languages with PaddleOCR
107
- pdf = PDF('multilingual_document.pdf',
108
- ocr_engine='paddleocr',
109
- ocr={
110
- 'enabled': True,
111
- 'languages': ['zh', 'ja', 'ko', 'en'] # Chinese, Japanese, Korean, English
112
- })
113
- ```
114
-
115
95
  ## Applying OCR Directly
116
96
 
117
97
  The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
179
159
  high_conf.highlight(color="green", label="High Confidence OCR")
180
160
  ```
181
161
 
182
- ## OCR Debugging
162
+ ## Detect + LLM OCR
163
+
164
+ Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
165
+
166
+ ```python
167
+ from natural_pdf import PDF
168
+ from natural_pdf.ocr.utils import direct_ocr_llm
169
+ import openai
170
+
171
+ pdf = PDF("needs-ocr.pdf")
172
+ page = pdf.pages[0]
173
+
174
+ # Detect
175
+ page.apply_ocr('paddle', resolution=120, detect_only=True)
176
+
177
+ # Build the framework
178
+ client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
179
+ prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
180
+ punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
181
+ The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
182
+
183
+ # This returns the cleaned-up text
184
+ def correct(region):
185
+ return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
186
+
187
+ # Run 'correct' on each text element
188
+ page.correct_ocr(correct)
189
+
190
+ # You're done!
191
+ ```
183
192
 
184
- For troubleshooting OCR problems:
193
+ ## Debugging OCR
185
194
 
186
195
  ```python
187
- # Create an interactive HTML debug report
188
- pdf.debug_ocr("ocr_debug.html")
196
+ from natural_pdf.utils.packaging import create_correction_task_package
189
197
 
190
- # Specify which pages to include
191
- pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
198
+ create_correction_task_package(pdf, "original.zip", overwrite=True)
192
199
  ```
193
200
 
194
- The debug report shows:
195
- - The original image
196
- - Text found with confidence scores
197
- - Boxes around each detected word
198
- - Options to sort and filter results
199
-
200
- ## OCR Parameter Tuning
201
-
202
- ### Parameter Recommendation Table
203
-
204
- | Issue | Engine | Parameter | Recommended Value | Effect |
205
- |-------|--------|-----------|-------------------|--------|
206
- | Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
207
- | Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
208
- | Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
209
- | Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
210
- | Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
211
- | Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
212
- | Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
213
- | Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
214
- | Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
201
+ This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
215
202
 
216
203
  ## Next Steps
217
204