natural-pdf 0.1.4__tar.gz → 0.1.6__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/.gitignore +5 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/CLAUDE.md +1 -1
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/PKG-INFO +53 -17
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/README.md +5 -1
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/ocr/index.md +34 -47
- natural_pdf-0.1.6/docs/tutorials/01-loading-and-extraction.ipynb +1710 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/02-finding-elements.ipynb +42 -42
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/03-extracting-blocks.ipynb +18 -18
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/04-table-extraction.ipynb +12 -12
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/05-excluding-content.ipynb +32 -32
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/06-document-qa.ipynb +44 -44
- natural_pdf-0.1.6/docs/tutorials/07-layout-analysis.ipynb +288 -0
- natural_pdf-0.1.6/docs/tutorials/07-working-with-regions.ipynb +413 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/07-working-with-regions.md +2 -2
- natural_pdf-0.1.6/docs/tutorials/08-spatial-navigation.ipynb +508 -0
- natural_pdf-0.1.6/docs/tutorials/09-section-extraction.ipynb +2434 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/10-form-field-extraction.ipynb +91 -63
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/11-enhanced-table-processing.ipynb +6 -6
- natural_pdf-0.1.6/docs/tutorials/12-ocr-integration.ipynb +604 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/12-ocr-integration.md +0 -13
- natural_pdf-0.1.6/docs/tutorials/13-semantic-search.ipynb +1328 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/execute_notebooks.py +120 -68
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/__init__.py +50 -33
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/__init__.py +2 -1
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/base.py +32 -24
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/docling.py +131 -72
- natural_pdf-0.1.6/natural_pdf/analyzers/layout/gemini.py +264 -0
- natural_pdf-0.1.6/natural_pdf/analyzers/layout/layout_analyzer.py +298 -0
- natural_pdf-0.1.6/natural_pdf/analyzers/layout/layout_manager.py +270 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/layout_options.py +43 -17
- natural_pdf-0.1.6/natural_pdf/analyzers/layout/paddle.py +297 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/surya.py +164 -92
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/tatr.py +149 -84
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/yolo.py +89 -45
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/text_options.py +22 -15
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/text_structure.py +131 -85
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/utils.py +30 -23
- natural_pdf-0.1.6/natural_pdf/collections/pdf_collection.py +308 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/__init__.py +1 -1
- natural_pdf-0.1.6/natural_pdf/core/element_manager.py +539 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/highlighting_service.py +268 -196
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/page.py +1044 -521
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/core/pdf.py +516 -313
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/__init__.py +1 -1
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/base.py +307 -225
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/collections.py +805 -543
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/line.py +39 -36
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/rect.py +32 -30
- natural_pdf-0.1.6/natural_pdf/elements/region.py +1730 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/elements/text.py +127 -99
- natural_pdf-0.1.6/natural_pdf/exporters/searchable_pdf.py +411 -0
- natural_pdf-0.1.6/natural_pdf/ocr/__init__.py +78 -0
- natural_pdf-0.1.6/natural_pdf/ocr/engine.py +208 -0
- natural_pdf-0.1.6/natural_pdf/ocr/engine_easyocr.py +175 -0
- natural_pdf-0.1.6/natural_pdf/ocr/engine_paddle.py +147 -0
- natural_pdf-0.1.6/natural_pdf/ocr/engine_surya.py +108 -0
- natural_pdf-0.1.6/natural_pdf/ocr/ocr_factory.py +114 -0
- natural_pdf-0.1.6/natural_pdf/ocr/ocr_manager.py +189 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/ocr/ocr_options.py +16 -20
- natural_pdf-0.1.6/natural_pdf/ocr/utils.py +98 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/qa/__init__.py +1 -1
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/qa/document_qa.py +119 -111
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/__init__.py +37 -31
- natural_pdf-0.1.6/natural_pdf/search/haystack_search_service.py +643 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/haystack_utils.py +186 -122
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/search_options.py +25 -14
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/search/search_service_protocol.py +12 -6
- natural_pdf-0.1.6/natural_pdf/search/searchable_mixin.py +549 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/selectors/__init__.py +2 -1
- natural_pdf-0.1.6/natural_pdf/selectors/parser.py +411 -0
- natural_pdf-0.1.6/natural_pdf/templates/__init__.py +1 -0
- natural_pdf-0.1.6/natural_pdf/templates/spa/css/style.css +334 -0
- natural_pdf-0.1.6/natural_pdf/templates/spa/index.html +31 -0
- natural_pdf-0.1.6/natural_pdf/templates/spa/js/app.js +472 -0
- natural_pdf-0.1.6/natural_pdf/templates/spa/words.txt +235976 -0
- natural_pdf-0.1.6/natural_pdf/utils/debug.py +32 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/highlighting.py +8 -2
- natural_pdf-0.1.6/natural_pdf/utils/identifiers.py +29 -0
- natural_pdf-0.1.6/natural_pdf/utils/packaging.py +418 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/reading_order.py +65 -63
- natural_pdf-0.1.6/natural_pdf/utils/text_extraction.py +195 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/visualization.py +70 -61
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/widgets/__init__.py +2 -3
- natural_pdf-0.1.6/natural_pdf/widgets/viewer.py +796 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf.egg-info/PKG-INFO +53 -17
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf.egg-info/SOURCES.txt +15 -103
- natural_pdf-0.1.6/natural_pdf.egg-info/requires.txt +83 -0
- natural_pdf-0.1.6/natural_pdf.egg-info/top_level.txt +8 -0
- natural_pdf-0.1.6/noxfile.py +78 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pyproject.toml +79 -24
- natural_pdf-0.1.6/tests/test_loading.py +50 -0
- natural_pdf-0.1.6/tests/test_optional_deps.py +298 -0
- natural_pdf-0.1.4/docs/tutorials/01-loading-and-extraction.ipynb +0 -1700
- natural_pdf-0.1.4/docs/tutorials/07-layout-analysis.ipynb +0 -260
- natural_pdf-0.1.4/docs/tutorials/07-working-with-regions.ipynb +0 -409
- natural_pdf-0.1.4/docs/tutorials/08-spatial-navigation.ipynb +0 -508
- natural_pdf-0.1.4/docs/tutorials/09-section-extraction.ipynb +0 -2428
- natural_pdf-0.1.4/docs/tutorials/12-ocr-integration.ipynb +0 -601
- natural_pdf-0.1.4/docs/tutorials/13-semantic-search.ipynb +0 -1904
- natural_pdf-0.1.4/natural_pdf/analyzers/layout/layout_analyzer.py +0 -255
- natural_pdf-0.1.4/natural_pdf/analyzers/layout/layout_manager.py +0 -203
- natural_pdf-0.1.4/natural_pdf/analyzers/layout/paddle.py +0 -240
- natural_pdf-0.1.4/natural_pdf/collections/pdf_collection.py +0 -259
- natural_pdf-0.1.4/natural_pdf/core/element_manager.py +0 -457
- natural_pdf-0.1.4/natural_pdf/elements/region.py +0 -1720
- natural_pdf-0.1.4/natural_pdf/exporters/__init__.py +0 -1
- natural_pdf-0.1.4/natural_pdf/exporters/searchable_pdf.py +0 -252
- natural_pdf-0.1.4/natural_pdf/ocr/__init__.py +0 -56
- natural_pdf-0.1.4/natural_pdf/ocr/engine.py +0 -104
- natural_pdf-0.1.4/natural_pdf/ocr/engine_easyocr.py +0 -179
- natural_pdf-0.1.4/natural_pdf/ocr/engine_paddle.py +0 -204
- natural_pdf-0.1.4/natural_pdf/ocr/engine_surya.py +0 -171
- natural_pdf-0.1.4/natural_pdf/ocr/ocr_manager.py +0 -191
- natural_pdf-0.1.4/natural_pdf/search/haystack_search_service.py +0 -520
- natural_pdf-0.1.4/natural_pdf/search/searchable_mixin.py +0 -464
- natural_pdf-0.1.4/natural_pdf/selectors/parser.py +0 -568
- natural_pdf-0.1.4/natural_pdf/templates/__init__.py +0 -1
- natural_pdf-0.1.4/natural_pdf/templates/ocr_debug.html +0 -517
- natural_pdf-0.1.4/natural_pdf/widgets/viewer.py +0 -765
- natural_pdf-0.1.4/natural_pdf.egg-info/requires.txt +0 -45
- natural_pdf-0.1.4/natural_pdf.egg-info/top_level.txt +0 -1
- natural_pdf-0.1.4/output/all_detected_regions.png +0 -0
- natural_pdf-0.1.4/output/all_elements.png +0 -0
- natural_pdf-0.1.4/output/basic_highlighting.png +0 -0
- natural_pdf-0.1.4/output/chainable_layout.png +0 -0
- natural_pdf-0.1.4/output/chained_analysis.png +0 -0
- natural_pdf-0.1.4/output/color_names.png +0 -0
- natural_pdf-0.1.4/output/color_names_with_boxes.png +0 -0
- natural_pdf-0.1.4/output/conf_display_highlight_all.png +0 -0
- natural_pdf-0.1.4/output/conf_display_highlight_layout.png +0 -0
- natural_pdf-0.1.4/output/conf_display_layout_only.png +0 -0
- natural_pdf-0.1.4/output/confidence_color_coded.png +0 -0
- natural_pdf-0.1.4/output/debug_page_image.png +0 -0
- natural_pdf-0.1.4/output/detected_table.png +0 -0
- natural_pdf-0.1.4/output/dimension_analysis.txt +0 -48
- natural_pdf-0.1.4/output/direct_ocr_debug.png +0 -0
- natural_pdf-0.1.4/output/easyocr_debug_input.png +0 -0
- natural_pdf-0.1.4/output/easyocr_results.png +0 -0
- natural_pdf-0.1.4/output/easyocr_test_input.png +0 -0
- natural_pdf-0.1.4/output/exclusion_optimization_regions.png +0 -0
- natural_pdf-0.1.4/output/explicit_confidence_display.png +0 -0
- natural_pdf-0.1.4/output/footer_overlap_test.png +0 -0
- natural_pdf-0.1.4/output/highlight_all.png +0 -0
- natural_pdf-0.1.4/output/highlight_all_styles.png +0 -0
- natural_pdf-0.1.4/output/highlight_all_with_all_layouts.png +0 -0
- natural_pdf-0.1.4/output/highlight_all_with_attrs.png +0 -0
- natural_pdf-0.1.4/output/highlight_all_with_yolo.png +0 -0
- natural_pdf-0.1.4/output/highlight_by_confidence.png +0 -0
- natural_pdf-0.1.4/output/highlight_color_test_1.png +0 -0
- natural_pdf-0.1.4/output/highlight_color_test_2.png +0 -0
- natural_pdf-0.1.4/output/highlight_color_test_3.png +0 -0
- natural_pdf-0.1.4/output/highlight_color_test_4.png +0 -0
- natural_pdf-0.1.4/output/highlight_layout_method.png +0 -0
- natural_pdf-0.1.4/output/highlight_multiple.png +0 -0
- natural_pdf-0.1.4/output/highlight_no_attrs.png +0 -0
- natural_pdf-0.1.4/output/highlight_region.png +0 -0
- natural_pdf-0.1.4/output/highlight_single.png +0 -0
- natural_pdf-0.1.4/output/highlight_specific_types.png +0 -0
- natural_pdf-0.1.4/output/highlight_specific_types_with_boxes.png +0 -0
- natural_pdf-0.1.4/output/highlight_specific_types_with_tables.png +0 -0
- natural_pdf-0.1.4/output/highlight_test.png +0 -0
- natural_pdf-0.1.4/output/highlight_test_colors.png +0 -0
- natural_pdf-0.1.4/output/highlight_test_individual.png +0 -0
- natural_pdf-0.1.4/output/highlight_test_individual_annotated.png +0 -0
- natural_pdf-0.1.4/output/highlight_test_individual_with_structure.png +0 -0
- natural_pdf-0.1.4/output/highlight_test_individual_with_structure_yolo.png +0 -0
- natural_pdf-0.1.4/output/highlight_test_individual_with_tables.png +0 -0
- natural_pdf-0.1.4/output/highlight_with_attrs.png +0 -0
- natural_pdf-0.1.4/output/layout_conf_default.png +0 -0
- natural_pdf-0.1.4/output/layout_detection.png +0 -0
- natural_pdf-0.1.4/output/layout_fix_test.png +0 -0
- natural_pdf-0.1.4/output/layout_fix_test2.png +0 -0
- natural_pdf-0.1.4/output/layout_fix_test3.png +0 -0
- natural_pdf-0.1.4/output/layout_fix_test4.png +0 -0
- natural_pdf-0.1.4/output/model_comparison.png +0 -0
- natural_pdf-0.1.4/output/multiple_attributes_display.png +0 -0
- natural_pdf-0.1.4/output/ocr_confidence_visualization.png +0 -0
- natural_pdf-0.1.4/output/ocr_debug.png +0 -0
- natural_pdf-0.1.4/output/ocr_debug_page.html +0 -517
- natural_pdf-0.1.4/output/ocr_highlight_all_test.png +0 -0
- natural_pdf-0.1.4/output/ocr_highlight_test.png +0 -0
- natural_pdf-0.1.4/output/ocr_highlighted.png +0 -0
- natural_pdf-0.1.4/output/ocr_simplified.png +0 -0
- natural_pdf-0.1.4/output/ocr_threshold_comparison.png +0 -0
- natural_pdf-0.1.4/output/ocr_visualization_clean.png +0 -0
- natural_pdf-0.1.4/output/ocr_visualization_highlights.png +0 -0
- natural_pdf-0.1.4/output/ocr_visualization_text.png +0 -0
- natural_pdf-0.1.4/output/paddle_layout_detection.png +0 -0
- natural_pdf-0.1.4/output/paddle_layout_polygons.png +0 -0
- natural_pdf-0.1.4/output/paddle_layout_sources.png +0 -0
- natural_pdf-0.1.4/output/paddle_layout_with_text.png +0 -0
- natural_pdf-0.1.4/output/paddle_layout_without_text.png +0 -0
- natural_pdf-0.1.4/output/paddleocr_highlights.png +0 -0
- natural_pdf-0.1.4/output/paddleocr_results.png +0 -0
- natural_pdf-0.1.4/output/paddleocr_test_input.png +0 -0
- natural_pdf-0.1.4/output/page_1_for_ocr.png +0 -0
- natural_pdf-0.1.4/output/page_4_for_ocr.png +0 -0
- natural_pdf-0.1.4/output/region_exclusion_test.png +0 -0
- natural_pdf-0.1.4/output/region_management_test.png +0 -0
- natural_pdf-0.1.4/output/region_ocr_cropped.png +0 -0
- natural_pdf-0.1.4/output/region_ocr_debug.png +0 -0
- natural_pdf-0.1.4/output/region_ocr_full_page.png +0 -0
- natural_pdf-0.1.4/output/region_ocr_highlighted.png +0 -0
- natural_pdf-0.1.4/output/spatial_navigation.png +0 -0
- natural_pdf-0.1.4/output/standard_highlight_all.png +0 -0
- natural_pdf-0.1.4/output/table_no_ocr.csv +0 -54
- natural_pdf-0.1.4/output/table_structure.png +0 -0
- natural_pdf-0.1.4/output/table_structure_detail.png +0 -0
- natural_pdf-0.1.4/output/table_with_ocr.csv +0 -54
- natural_pdf-0.1.4/output/tatr_cells_test.png +0 -0
- natural_pdf-0.1.4/output/tatr_ocr_table_test.png +0 -0
- natural_pdf-0.1.4/output/tatr_regions.png +0 -0
- natural_pdf-0.1.4/output/tatr_regions.txt +0 -16
- natural_pdf-0.1.4/output/text_styles.png +0 -0
- natural_pdf-0.1.4/output/titles_only.png +0 -0
- natural_pdf-0.1.4/output/width_1200px.png +0 -0
- natural_pdf-0.1.4/output/width_800px.png +0 -0
- natural_pdf-0.1.4/output/width_default.png +0 -0
- natural_pdf-0.1.4/output/width_with_scale.png +0 -0
- natural_pdf-0.1.4/output/yolo_regions.png +0 -0
- natural_pdf-0.1.4/output/yolo_regions.txt +0 -9
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/.github/workflows/docs.yml +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/LICENSE +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/MANIFEST.in +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/check_run_md.sh +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/api/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/favicon.png +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/favicon.svg +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/javascripts/custom.js +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/logo.svg +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/sample-screen.png +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/social-preview.png +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/social-preview.svg +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/assets/stylesheets/custom.css +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/document-qa/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/document-qa/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/element-selection/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/element-selection/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/installation/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/interactive-widget/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/interactive-widget/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/layout-analysis/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/layout-analysis/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/pdf-navigation/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/pdf-navigation/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/regions/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/regions/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tables/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tables/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-analysis/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-analysis/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-extraction/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/text-extraction/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/01-loading-and-extraction.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/02-finding-elements.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/03-extracting-blocks.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/04-table-extraction.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/05-excluding-content.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/06-document-qa.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/07-layout-analysis.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/08-spatial-navigation.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/09-section-extraction.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/10-form-field-extraction.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/11-enhanced-table-processing.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/tutorials/13-semantic-search.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/visual-debugging/index.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/visual-debugging/index.md +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/docs/visual-debugging/region.png +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/mkdocs.yml +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/analyzers/layout/__init__.py +0 -0
- /natural_pdf-0.1.4/output/layout_conf_high.png → /natural_pdf-0.1.6/natural_pdf/exporters/__init__.py +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/utils/__init__.py +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf/widgets/frontend/viewer.js +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/natural_pdf.egg-info/dependency_links.txt +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/notebooks/Examples.ipynb +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/.gitkeep +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/01-practice.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/0500000US42001.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/0500000US42007.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/2014 Statistics.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/2019 Statistics.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/pdfs/needs-ocr.pdf +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/publish.sh +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/run_all_tutorials.sh +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/sample-screen.png +0 -0
- {natural_pdf-0.1.4 → natural_pdf-0.1.6}/setup.cfg +0 -0
@@ -1,4 +1,6 @@
|
|
1
1
|
.notebook_cache.json
|
2
|
+
.venv
|
3
|
+
output
|
2
4
|
Untitled.ipynb
|
3
5
|
conversation.md
|
4
6
|
docs/tutorials/pdfs
|
@@ -10,6 +12,9 @@ results
|
|
10
12
|
docs/tutorials/needs-ocr-searchable.pdf
|
11
13
|
sample.py
|
12
14
|
sample2.py
|
15
|
+
requirements.lock
|
16
|
+
pdfs/hidden
|
17
|
+
*.hocr
|
13
18
|
|
14
19
|
# Created by https://www.toptal.com/developers/gitignore/api/python,macos,visualstudiocode,jupyternotebooks
|
15
20
|
# Edit at https://www.toptal.com/developers/gitignore?templates=python,macos,visualstudiocode,jupyternotebooks
|
@@ -213,7 +213,7 @@ region = page.create_region(50, 50, page.width - 50, page.height - 50)
|
|
213
213
|
sections = region.get_sections(start_elements='text:bold')
|
214
214
|
|
215
215
|
# Expand the region around a section
|
216
|
-
expanded_section = sections[0].expand(left=20, right=20,
|
216
|
+
expanded_section = sections[0].expand(left=20, right=20, top=10, bottom=30)
|
217
217
|
|
218
218
|
# Use percentage-based expansion
|
219
219
|
expanded_section = sections[0].expand(width_factor=1.5, height_factor=1.2) # 50% wider, 20% taller
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: natural-pdf
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.6
|
4
4
|
Summary: A more intuitive interface for working with PDFs
|
5
5
|
Author-email: Jonathan Soma <jonathan.soma@gmail.com>
|
6
6
|
License-Expression: MIT
|
@@ -16,38 +16,70 @@ Requires-Dist: Pillow
|
|
16
16
|
Requires-Dist: colour
|
17
17
|
Requires-Dist: numpy
|
18
18
|
Requires-Dist: urllib3
|
19
|
-
Requires-Dist:
|
20
|
-
Requires-Dist: torchvision
|
21
|
-
Requires-Dist: transformers
|
22
|
-
Requires-Dist: huggingface_hub
|
23
|
-
Requires-Dist: ocrmypdf
|
24
|
-
Requires-Dist: pikepdf
|
19
|
+
Requires-Dist: tqdm
|
25
20
|
Provides-Extra: interactive
|
26
21
|
Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "interactive"
|
27
22
|
Provides-Extra: haystack
|
28
23
|
Requires-Dist: haystack-ai; extra == "haystack"
|
29
24
|
Requires-Dist: chroma-haystack; extra == "haystack"
|
30
25
|
Requires-Dist: sentence-transformers; extra == "haystack"
|
26
|
+
Requires-Dist: protobuf<4; extra == "haystack"
|
27
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "haystack"
|
31
28
|
Provides-Extra: easyocr
|
32
29
|
Requires-Dist: easyocr; extra == "easyocr"
|
30
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "easyocr"
|
33
31
|
Provides-Extra: paddle
|
34
32
|
Requires-Dist: paddlepaddle; extra == "paddle"
|
35
33
|
Requires-Dist: paddleocr; extra == "paddle"
|
36
34
|
Provides-Extra: layout-yolo
|
37
35
|
Requires-Dist: doclayout_yolo; extra == "layout-yolo"
|
36
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "layout-yolo"
|
38
37
|
Provides-Extra: surya
|
39
38
|
Requires-Dist: surya-ocr; extra == "surya"
|
39
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "surya"
|
40
40
|
Provides-Extra: qa
|
41
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "qa"
|
42
|
+
Provides-Extra: docling
|
43
|
+
Requires-Dist: docling; extra == "docling"
|
44
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "docling"
|
45
|
+
Provides-Extra: llm
|
46
|
+
Requires-Dist: openai>=1.0; extra == "llm"
|
47
|
+
Requires-Dist: pydantic; extra == "llm"
|
48
|
+
Provides-Extra: test
|
49
|
+
Requires-Dist: pytest; extra == "test"
|
50
|
+
Provides-Extra: dev
|
51
|
+
Requires-Dist: black; extra == "dev"
|
52
|
+
Requires-Dist: isort; extra == "dev"
|
53
|
+
Requires-Dist: mypy; extra == "dev"
|
54
|
+
Requires-Dist: pytest; extra == "dev"
|
55
|
+
Requires-Dist: nox; extra == "dev"
|
56
|
+
Requires-Dist: nox-uv; extra == "dev"
|
57
|
+
Requires-Dist: build; extra == "dev"
|
58
|
+
Requires-Dist: uv; extra == "dev"
|
59
|
+
Requires-Dist: pipdeptree; extra == "dev"
|
60
|
+
Requires-Dist: nbformat; extra == "dev"
|
61
|
+
Requires-Dist: jupytext; extra == "dev"
|
62
|
+
Requires-Dist: nbclient; extra == "dev"
|
41
63
|
Provides-Extra: all
|
42
|
-
Requires-Dist:
|
43
|
-
Requires-Dist:
|
44
|
-
Requires-Dist:
|
45
|
-
Requires-Dist:
|
46
|
-
Requires-Dist:
|
47
|
-
Requires-Dist: surya
|
48
|
-
Requires-Dist:
|
49
|
-
Requires-Dist:
|
50
|
-
Requires-Dist:
|
64
|
+
Requires-Dist: natural-pdf[interactive]; extra == "all"
|
65
|
+
Requires-Dist: natural-pdf[haystack]; extra == "all"
|
66
|
+
Requires-Dist: natural-pdf[easyocr]; extra == "all"
|
67
|
+
Requires-Dist: natural-pdf[paddle]; extra == "all"
|
68
|
+
Requires-Dist: natural-pdf[layout_yolo]; extra == "all"
|
69
|
+
Requires-Dist: natural-pdf[surya]; extra == "all"
|
70
|
+
Requires-Dist: natural-pdf[qa]; extra == "all"
|
71
|
+
Requires-Dist: natural-pdf[ocr-export]; extra == "all"
|
72
|
+
Requires-Dist: natural-pdf[docling]; extra == "all"
|
73
|
+
Requires-Dist: natural-pdf[llm]; extra == "all"
|
74
|
+
Requires-Dist: natural-pdf[test]; extra == "all"
|
75
|
+
Provides-Extra: core-ml
|
76
|
+
Requires-Dist: torch; extra == "core-ml"
|
77
|
+
Requires-Dist: torchvision; extra == "core-ml"
|
78
|
+
Requires-Dist: transformers; extra == "core-ml"
|
79
|
+
Requires-Dist: huggingface_hub; extra == "core-ml"
|
80
|
+
Provides-Extra: ocr-export
|
81
|
+
Requires-Dist: ocrmypdf; extra == "ocr-export"
|
82
|
+
Requires-Dist: pikepdf; extra == "ocr-export"
|
51
83
|
Dynamic: license-file
|
52
84
|
|
53
85
|
# Natural PDF
|
@@ -75,6 +107,10 @@ pip install natural-pdf[easyocr]
|
|
75
107
|
pip install natural-pdf[surya]
|
76
108
|
pip install natural-pdf[paddle]
|
77
109
|
|
110
|
+
# Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
|
111
|
+
pip install natural-pdf[llm]
|
112
|
+
# (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
|
113
|
+
|
78
114
|
# Example: Install with interactive viewer support
|
79
115
|
pip install natural-pdf[interactive]
|
80
116
|
|
@@ -127,7 +163,7 @@ Natural PDF offers a range of features for working with PDFs:
|
|
127
163
|
* **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
|
128
164
|
* **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
|
129
165
|
* **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
|
130
|
-
* **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using
|
166
|
+
* **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
|
131
167
|
* **Document QA:** Ask natural language questions about your document's content.
|
132
168
|
* **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
|
133
169
|
* **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
|
@@ -23,6 +23,10 @@ pip install natural-pdf[easyocr]
|
|
23
23
|
pip install natural-pdf[surya]
|
24
24
|
pip install natural-pdf[paddle]
|
25
25
|
|
26
|
+
# Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
|
27
|
+
pip install natural-pdf[llm]
|
28
|
+
# (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
|
29
|
+
|
26
30
|
# Example: Install with interactive viewer support
|
27
31
|
pip install natural-pdf[interactive]
|
28
32
|
|
@@ -75,7 +79,7 @@ Natural PDF offers a range of features for working with PDFs:
|
|
75
79
|
* **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
|
76
80
|
* **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
|
77
81
|
* **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
|
78
|
-
* **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using
|
82
|
+
* **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
|
79
83
|
* **Document QA:** Ask natural language questions about your document's content.
|
80
84
|
* **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
|
81
85
|
* **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.
|
@@ -92,26 +92,6 @@ surya_opts = SuryaOCROptions(
|
|
92
92
|
ocr_elements = page.apply_ocr(engine='surya', options=surya_opts)
|
93
93
|
```
|
94
94
|
|
95
|
-
## Multiple Languages
|
96
|
-
|
97
|
-
OCR supports multiple languages:
|
98
|
-
|
99
|
-
```python
|
100
|
-
# Recognize English and Spanish text
|
101
|
-
pdf = PDF('multilingual.pdf', ocr={
|
102
|
-
'enabled': True,
|
103
|
-
'languages': ['en', 'es']
|
104
|
-
})
|
105
|
-
|
106
|
-
# Multiple languages with PaddleOCR
|
107
|
-
pdf = PDF('multilingual_document.pdf',
|
108
|
-
ocr_engine='paddleocr',
|
109
|
-
ocr={
|
110
|
-
'enabled': True,
|
111
|
-
'languages': ['zh', 'ja', 'ko', 'en'] # Chinese, Japanese, Korean, English
|
112
|
-
})
|
113
|
-
```
|
114
|
-
|
115
95
|
## Applying OCR Directly
|
116
96
|
|
117
97
|
The `page.apply_ocr(...)` and `region.apply_ocr(...)` methods are the primary way to run OCR:
|
@@ -179,39 +159,46 @@ high_conf = page.find_all('text[source=ocr][confidence>=0.8]')
|
|
179
159
|
high_conf.highlight(color="green", label="High Confidence OCR")
|
180
160
|
```
|
181
161
|
|
182
|
-
## OCR
|
162
|
+
## Detect + LLM OCR
|
163
|
+
|
164
|
+
Sometimes you have a difficult piece of content where you need to use a local model to identify the content, then send it off in pieces to be identified by the LLM. You can do this with Natural PDF!
|
165
|
+
|
166
|
+
```python
|
167
|
+
from natural_pdf import PDF
|
168
|
+
from natural_pdf.ocr.utils import direct_ocr_llm
|
169
|
+
import openai
|
170
|
+
|
171
|
+
pdf = PDF("needs-ocr.pdf")
|
172
|
+
page = pdf.pages[0]
|
173
|
+
|
174
|
+
# Detect
|
175
|
+
page.apply_ocr('paddle', resolution=120, detect_only=True)
|
176
|
+
|
177
|
+
# Build the framework
|
178
|
+
client = openai.OpenAI(base_url="https://api.anthropic.com/v1/", api_key='sk-XXXXX')
|
179
|
+
prompt = """OCR this image. Return only the exact text from the image. Include misspellings,
|
180
|
+
punctuation, etc. Do not surround it with quotation marks. Do not include translations or comments.
|
181
|
+
The text is from a Greek spreadsheet, so most likely content is Modern Greek or numeric."""
|
182
|
+
|
183
|
+
# This returns the cleaned-up text
|
184
|
+
def correct(region):
|
185
|
+
return direct_ocr_llm(region, client, prompt=prompt, resolution=300, model="claude-3-5-haiku-20241022")
|
186
|
+
|
187
|
+
# Run 'correct' on each text element
|
188
|
+
page.correct_ocr(correct)
|
189
|
+
|
190
|
+
# You're done!
|
191
|
+
```
|
183
192
|
|
184
|
-
|
193
|
+
## Debugging OCR
|
185
194
|
|
186
195
|
```python
|
187
|
-
|
188
|
-
pdf.debug_ocr("ocr_debug.html")
|
196
|
+
from natural_pdf.utils.packaging import create_correction_task_package
|
189
197
|
|
190
|
-
|
191
|
-
pdf.debug_ocr("ocr_debug.html", pages=[0, 1, 2])
|
198
|
+
create_correction_task_package(pdf, "original.zip", overwrite=True)
|
192
199
|
```
|
193
200
|
|
194
|
-
|
195
|
-
- The original image
|
196
|
-
- Text found with confidence scores
|
197
|
-
- Boxes around each detected word
|
198
|
-
- Options to sort and filter results
|
199
|
-
|
200
|
-
## OCR Parameter Tuning
|
201
|
-
|
202
|
-
### Parameter Recommendation Table
|
203
|
-
|
204
|
-
| Issue | Engine | Parameter | Recommended Value | Effect |
|
205
|
-
|-------|--------|-----------|-------------------|--------|
|
206
|
-
| Missing text | EasyOCR | `text_threshold` | 0.1 - 0.3 (default: 0.7) | Lower values detect more text but may increase false positives |
|
207
|
-
| Missing text | PaddleOCR | `det_db_thresh` | 0.1 - 0.3 (default: 0.3) | Lower values detect more text areas |
|
208
|
-
| Low quality scan | EasyOCR | `contrast_ths` | 0.05 - 0.1 (default: 0.1) | Lower values help with low contrast documents |
|
209
|
-
| Low quality scan | PaddleOCR | `det_limit_side_len` | 1280 - 2560 (default: 960) | Higher values improve detail detection |
|
210
|
-
| Accuracy vs. speed | EasyOCR | `decoder` | "wordbeamsearch" (accuracy)<br>"greedy" (speed) | Word beam search is more accurate but slower |
|
211
|
-
| Accuracy vs. speed | PaddleOCR | `rec_batch_num` | 1 (accuracy)<br>8+ (speed) | Larger batches process faster but use more memory |
|
212
|
-
| Small text | Both | `min_confidence` | 0.3 - 0.4 (default: 0.5) | Lower confidence threshold to capture small/blurry text |
|
213
|
-
| Text orientation | PaddleOCR | `use_angle_cls` | `True` | Enable angle classification for rotated text |
|
214
|
-
| Asian languages | PaddleOCR | `lang` | "ch", "japan", "korea" | Use PaddleOCR for Asian languages |
|
201
|
+
This will at *some point* be official-ized, but for now you can look at `templates/spa` and see the correction package.
|
215
202
|
|
216
203
|
## Next Steps
|
217
204
|
|