natural-pdf 0.1.11__tar.gz → 0.1.13__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/01-execute_notebooks.py +2 -2
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/MANIFEST.in +3 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/PKG-INFO +55 -49
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/README.md +29 -29
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/check_run_md.sh +2 -2
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/categorizing-documents/index.md +1 -1
- natural_pdf-0.1.13/docs/element-selection/index.ipynb +1112 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/element-selection/index.md +31 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/index.md +3 -3
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/installation/index.md +32 -14
- natural_pdf-0.1.13/docs/loops-and-groups/index.ipynb +476 -0
- natural_pdf-0.1.13/docs/loops-and-groups/index.md +84 -0
- natural_pdf-0.1.13/docs/reflowing-pages/index.ipynb +360 -0
- natural_pdf-0.1.13/docs/reflowing-pages/index.md +80 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/text-extraction/index.ipynb +234 -220
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/text-extraction/index.md +2 -2
- natural_pdf-0.1.13/docs/tutorials/01-loading-and-extraction.ipynb +3082 -0
- natural_pdf-0.1.13/docs/tutorials/02-finding-elements.ipynb +352 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/02-finding-elements.md +1 -1
- natural_pdf-0.1.13/docs/tutorials/03-extracting-blocks.ipynb +159 -0
- natural_pdf-0.1.13/docs/tutorials/04-table-extraction.ipynb +209 -0
- natural_pdf-0.1.13/docs/tutorials/05-excluding-content.ipynb +8402 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/06-document-qa.ipynb +45 -31
- natural_pdf-0.1.13/docs/tutorials/07-layout-analysis.ipynb +262 -0
- natural_pdf-0.1.13/docs/tutorials/07-working-with-regions.ipynb +477 -0
- natural_pdf-0.1.13/docs/tutorials/08-spatial-navigation.ipynb +520 -0
- natural_pdf-0.1.13/docs/tutorials/09-section-extraction.ipynb +2474 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/09-section-extraction.md +1 -1
- natural_pdf-0.1.13/docs/tutorials/10-form-field-extraction.ipynb +496 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/11-enhanced-table-processing.ipynb +9 -9
- natural_pdf-0.1.13/docs/tutorials/12-ocr-integration.ipynb +3448 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/12-ocr-integration.md +1 -1
- natural_pdf-0.1.13/docs/tutorials/13-semantic-search.ipynb +706 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/13-semantic-search.md +4 -3
- natural_pdf-0.1.13/docs/tutorials/14-categorizing-documents.ipynb +2142 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/14-categorizing-documents.md +21 -29
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/mkdocs.yml +4 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/__init__.py +7 -2
- natural_pdf-0.1.13/natural_pdf/analyzers/shape_detection_mixin.py +1092 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/text_options.py +9 -1
- natural_pdf-0.1.13/natural_pdf/analyzers/text_structure.py +627 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/classification/manager.py +3 -4
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/collections/pdf_collection.py +19 -39
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/core/element_manager.py +11 -1
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/core/highlighting_service.py +146 -75
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/core/page.py +287 -188
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/core/pdf.py +57 -42
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/base.py +51 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/collections.py +362 -67
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/line.py +5 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/region.py +396 -23
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/hocr.py +40 -61
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/hocr_font.py +7 -13
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/original_pdf.py +10 -13
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/paddleocr.py +51 -11
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/searchable_pdf.py +0 -10
- natural_pdf-0.1.13/natural_pdf/flows/__init__.py +12 -0
- natural_pdf-0.1.13/natural_pdf/flows/collections.py +533 -0
- natural_pdf-0.1.13/natural_pdf/flows/element.py +382 -0
- natural_pdf-0.1.13/natural_pdf/flows/flow.py +216 -0
- natural_pdf-0.1.13/natural_pdf/flows/region.py +458 -0
- natural_pdf-0.1.13/natural_pdf/search/__init__.py +99 -0
- natural_pdf-0.1.13/natural_pdf/search/lancedb_search_service.py +325 -0
- natural_pdf-0.1.13/natural_pdf/search/numpy_search_service.py +255 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/search/searchable_mixin.py +25 -71
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/selectors/parser.py +163 -8
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/templates/finetune/fine_tune_paddleocr.md +84 -5
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/widgets/viewer.py +22 -31
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf.egg-info/PKG-INFO +55 -49
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf.egg-info/SOURCES.txt +25 -5
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf.egg-info/requires.txt +26 -21
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf.egg-info/top_level.txt +0 -1
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/noxfile.py +39 -8
- natural_pdf-0.1.13/pdfs/.gitkeep +0 -0
- natural_pdf-0.1.13/pdfs/anexo_edital_6604_1743480-table.pdf +0 -0
- natural_pdf-0.1.13/pdfs/geometry.pdf +0 -0
- natural_pdf-0.1.13/pdfs/multicolumn.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pyproject.toml +37 -27
- natural_pdf-0.1.13/tests/conftest.py +140 -0
- natural_pdf-0.1.13/tests/test_core/test_containment_geometry.py +26 -0
- natural_pdf-0.1.13/tests/test_core/test_elements.py +169 -0
- natural_pdf-0.1.13/tests/test_core/test_loading.py +86 -0
- natural_pdf-0.1.13/tests/test_core/test_spatial.py +201 -0
- natural_pdf-0.1.13/tests/test_core/test_text_extraction.py +118 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/tests/test_optional_deps.py +1 -66
- natural_pdf-0.1.13/uv.lock +56 -0
- natural_pdf-0.1.11/docs/element-selection/index.ipynb +0 -957
- natural_pdf-0.1.11/docs/tutorials/01-loading-and-extraction.ipynb +0 -1628
- natural_pdf-0.1.11/docs/tutorials/02-finding-elements.ipynb +0 -374
- natural_pdf-0.1.11/docs/tutorials/03-extracting-blocks.ipynb +0 -152
- natural_pdf-0.1.11/docs/tutorials/04-table-extraction.ipynb +0 -195
- natural_pdf-0.1.11/docs/tutorials/05-excluding-content.ipynb +0 -275
- natural_pdf-0.1.11/docs/tutorials/07-layout-analysis.ipynb +0 -269
- natural_pdf-0.1.11/docs/tutorials/07-working-with-regions.ipynb +0 -470
- natural_pdf-0.1.11/docs/tutorials/08-spatial-navigation.ipynb +0 -513
- natural_pdf-0.1.11/docs/tutorials/09-section-extraction.ipynb +0 -2439
- natural_pdf-0.1.11/docs/tutorials/10-form-field-extraction.ipynb +0 -503
- natural_pdf-0.1.11/docs/tutorials/12-ocr-integration.ipynb +0 -3556
- natural_pdf-0.1.11/docs/tutorials/13-semantic-search.ipynb +0 -1411
- natural_pdf-0.1.11/docs/tutorials/14-categorizing-documents.ipynb +0 -2399
- natural_pdf-0.1.11/natural_pdf/analyzers/text_structure.py +0 -314
- natural_pdf-0.1.11/natural_pdf/search/__init__.py +0 -86
- natural_pdf-0.1.11/natural_pdf/search/haystack_search_service.py +0 -687
- natural_pdf-0.1.11/natural_pdf/search/haystack_utils.py +0 -474
- natural_pdf-0.1.11/natural_pdf/utils/tqdm_utils.py +0 -51
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.cursor/rules/analysis_framework.mdc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.cursor/rules/coding-style.mdc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.cursor/rules/edit-md-instead-of-ipynb.mdc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.cursor/rules/minimal-comments.mdc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.cursor/rules/natural-pdf-overview.mdc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.cursor/rules/user-friendly-library-code.mdc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.github/workflows/docs.yml +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/.gitignore +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/02-run_all_tutorials.sh +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/CLAUDE.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/LICENSE +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/audit_packaging.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/api/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/favicon.png +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/favicon.svg +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/javascripts/custom.js +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/logo.svg +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/sample-screen.png +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/social-preview.png +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/social-preview.svg +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/assets/stylesheets/custom.css +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/data-extraction/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/document-qa/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/document-qa/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/finetuning/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/interactive-widget/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/interactive-widget/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/layout-analysis/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/layout-analysis/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/ocr/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/pdf-navigation/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/pdf-navigation/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/regions/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/regions/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tables/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tables/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/text-analysis/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/text-analysis/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/01-loading-and-extraction.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/03-extracting-blocks.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/04-table-extraction.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/05-excluding-content.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/06-document-qa.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/07-layout-analysis.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/07-working-with-regions.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/08-spatial-navigation.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/10-form-field-extraction.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/tutorials/11-enhanced-table-processing.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/visual-debugging/index.ipynb +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/visual-debugging/index.md +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/docs/visual-debugging/region.png +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/base.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/docling.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/gemini.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/layout_analyzer.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/layout_manager.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/layout_options.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/paddle.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/pdfplumber_table_finder.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/surya.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/tatr.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/layout/yolo.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/analyzers/utils.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/classification/mixin.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/classification/results.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/collections/mixins.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/core/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/rect.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/elements/text.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/export/mixin.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/base.py +0 -0
- /natural_pdf-0.1.11/pdfs/.gitkeep → /natural_pdf-0.1.13/natural_pdf/exporters/data/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/data/pdf.ttf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/exporters/data/sRGB.icc +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/extraction/manager.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/extraction/mixin.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/extraction/result.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/engine.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/engine_doctr.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/engine_easyocr.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/engine_paddle.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/engine_surya.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/ocr_factory.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/ocr_manager.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/ocr_options.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/ocr/utils.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/qa/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/qa/document_qa.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/search/search_options.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/search/search_service_protocol.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/selectors/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/templates/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/templates/spa/css/style.css +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/templates/spa/index.html +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/templates/spa/js/app.js +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/templates/spa/words.txt +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/debug.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/highlighting.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/identifiers.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/locks.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/packaging.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/reading_order.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/text_extraction.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/utils/visualization.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/widgets/__init__.py +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf/widgets/frontend/viewer.js +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/natural_pdf.egg-info/dependency_links.txt +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/01-practice.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/0500000US42001.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/0500000US42007.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/2014 Statistics.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/2019 Statistics.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/Atlanta_Public_Schools_GA_sample.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/cia-doc.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/pdfs/needs-ocr.pdf +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/publish.sh +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/sample-screen.png +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/setup.cfg +0 -0
- {natural_pdf-0.1.11 → natural_pdf-0.1.13}/tests/exporters/test_paddleocr_exporter.py +0 -0
- /natural_pdf-0.1.11/tests/test_loading.py → /natural_pdf-0.1.13/tests/test_loading_original.py +0 -0
@@ -30,7 +30,7 @@ EXCLUDE_PATTERNS = [
|
|
30
30
|
"finetuning/index.md",
|
31
31
|
"categorizing-documents/index.md",
|
32
32
|
"data-extraction/index.md",
|
33
|
-
"*.ipynb_checkpoints*"
|
33
|
+
"*.ipynb_checkpoints*",
|
34
34
|
]
|
35
35
|
MAX_WORKERS = os.cpu_count()
|
36
36
|
|
@@ -178,7 +178,7 @@ def process_notebook(md_file_path_str: str, log_level: int) -> Dict[str, Any]:
|
|
178
178
|
client = NotebookClient(
|
179
179
|
notebook,
|
180
180
|
timeout=600,
|
181
|
-
kernel_name="natural-pdf",
|
181
|
+
kernel_name="natural-pdf-project-venv",
|
182
182
|
resources={"metadata": {"path": str(cwd)}},
|
183
183
|
)
|
184
184
|
client.execute() # Modifies 'notebook' object
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: natural-pdf
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.13
|
4
4
|
Summary: A more intuitive interface for working with PDFs
|
5
5
|
Author-email: Jonathan Soma <jonathan.soma@gmail.com>
|
6
6
|
License-Expression: MIT
|
@@ -12,20 +12,17 @@ Requires-Python: >=3.9
|
|
12
12
|
Description-Content-Type: text/markdown
|
13
13
|
License-File: LICENSE
|
14
14
|
Requires-Dist: pdfplumber
|
15
|
-
Requires-Dist:
|
15
|
+
Requires-Dist: pillow
|
16
16
|
Requires-Dist: colour
|
17
17
|
Requires-Dist: numpy
|
18
18
|
Requires-Dist: urllib3
|
19
19
|
Requires-Dist: tqdm
|
20
20
|
Requires-Dist: pydantic
|
21
|
-
|
22
|
-
Requires-Dist:
|
23
|
-
|
24
|
-
|
25
|
-
Requires-Dist:
|
26
|
-
Requires-Dist: lancedb; extra == "haystack"
|
27
|
-
Requires-Dist: sentence-transformers; extra == "haystack"
|
28
|
-
Requires-Dist: natural-pdf[core-ml]; extra == "haystack"
|
21
|
+
Requires-Dist: jenkspy
|
22
|
+
Requires-Dist: pikepdf>=9.7.0
|
23
|
+
Requires-Dist: scipy
|
24
|
+
Provides-Extra: viewer
|
25
|
+
Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "viewer"
|
29
26
|
Provides-Extra: easyocr
|
30
27
|
Requires-Dist: easyocr; extra == "easyocr"
|
31
28
|
Requires-Dist: natural-pdf[core-ml]; extra == "easyocr"
|
@@ -41,19 +38,25 @@ Requires-Dist: natural-pdf[core-ml]; extra == "surya"
|
|
41
38
|
Provides-Extra: doctr
|
42
39
|
Requires-Dist: python-doctr[torch]; extra == "doctr"
|
43
40
|
Requires-Dist: natural-pdf[core-ml]; extra == "doctr"
|
44
|
-
Provides-Extra: qa
|
45
|
-
Requires-Dist: natural-pdf[core-ml]; extra == "qa"
|
46
41
|
Provides-Extra: docling
|
47
42
|
Requires-Dist: docling; extra == "docling"
|
48
43
|
Requires-Dist: natural-pdf[core-ml]; extra == "docling"
|
49
44
|
Provides-Extra: llm
|
50
45
|
Requires-Dist: openai>=1.0; extra == "llm"
|
51
|
-
Provides-Extra: classification
|
52
|
-
Requires-Dist: sentence-transformers; extra == "classification"
|
53
|
-
Requires-Dist: timm; extra == "classification"
|
54
|
-
Requires-Dist: natural-pdf[core-ml]; extra == "classification"
|
55
46
|
Provides-Extra: test
|
56
47
|
Requires-Dist: pytest; extra == "test"
|
48
|
+
Provides-Extra: search
|
49
|
+
Requires-Dist: lancedb; extra == "search"
|
50
|
+
Requires-Dist: pyarrow; extra == "search"
|
51
|
+
Provides-Extra: favorites
|
52
|
+
Requires-Dist: natural-pdf[deskew]; extra == "favorites"
|
53
|
+
Requires-Dist: natural-pdf[llm]; extra == "favorites"
|
54
|
+
Requires-Dist: natural-pdf[surya]; extra == "favorites"
|
55
|
+
Requires-Dist: natural-pdf[easyocr]; extra == "favorites"
|
56
|
+
Requires-Dist: natural-pdf[layout_yolo]; extra == "favorites"
|
57
|
+
Requires-Dist: natural-pdf[ocr-export]; extra == "favorites"
|
58
|
+
Requires-Dist: natural-pdf[viewer]; extra == "favorites"
|
59
|
+
Requires-Dist: natural-pdf[search]; extra == "favorites"
|
57
60
|
Provides-Extra: dev
|
58
61
|
Requires-Dist: black; extra == "dev"
|
59
62
|
Requires-Dist: isort; extra == "dev"
|
@@ -67,29 +70,32 @@ Requires-Dist: pipdeptree; extra == "dev"
|
|
67
70
|
Requires-Dist: nbformat; extra == "dev"
|
68
71
|
Requires-Dist: jupytext; extra == "dev"
|
69
72
|
Requires-Dist: nbclient; extra == "dev"
|
73
|
+
Requires-Dist: ipykernel; extra == "dev"
|
70
74
|
Provides-Extra: deskew
|
71
75
|
Requires-Dist: deskew>=1.5; extra == "deskew"
|
72
76
|
Requires-Dist: img2pdf; extra == "deskew"
|
73
77
|
Provides-Extra: all
|
74
|
-
Requires-Dist: natural-pdf[
|
75
|
-
Requires-Dist: natural-pdf[haystack]; extra == "all"
|
78
|
+
Requires-Dist: natural-pdf[viewer]; extra == "all"
|
76
79
|
Requires-Dist: natural-pdf[easyocr]; extra == "all"
|
77
80
|
Requires-Dist: natural-pdf[paddle]; extra == "all"
|
78
81
|
Requires-Dist: natural-pdf[layout_yolo]; extra == "all"
|
79
82
|
Requires-Dist: natural-pdf[surya]; extra == "all"
|
80
83
|
Requires-Dist: natural-pdf[doctr]; extra == "all"
|
81
|
-
Requires-Dist: natural-pdf[qa]; extra == "all"
|
82
84
|
Requires-Dist: natural-pdf[ocr-export]; extra == "all"
|
83
85
|
Requires-Dist: natural-pdf[docling]; extra == "all"
|
84
86
|
Requires-Dist: natural-pdf[llm]; extra == "all"
|
85
|
-
Requires-Dist: natural-pdf[
|
87
|
+
Requires-Dist: natural-pdf[core-ml]; extra == "all"
|
86
88
|
Requires-Dist: natural-pdf[deskew]; extra == "all"
|
87
89
|
Requires-Dist: natural-pdf[test]; extra == "all"
|
90
|
+
Requires-Dist: natural-pdf[search]; extra == "all"
|
88
91
|
Provides-Extra: core-ml
|
89
92
|
Requires-Dist: torch; extra == "core-ml"
|
90
93
|
Requires-Dist: torchvision; extra == "core-ml"
|
91
94
|
Requires-Dist: transformers[sentencepiece]; extra == "core-ml"
|
92
95
|
Requires-Dist: huggingface_hub; extra == "core-ml"
|
96
|
+
Requires-Dist: sentence-transformers; extra == "core-ml"
|
97
|
+
Requires-Dist: numpy; extra == "core-ml"
|
98
|
+
Requires-Dist: timm; extra == "core-ml"
|
93
99
|
Provides-Extra: ocr-export
|
94
100
|
Requires-Dist: pikepdf; extra == "ocr-export"
|
95
101
|
Provides-Extra: export-extras
|
@@ -114,26 +120,11 @@ Natural PDF lets you find and extract content from PDFs using simple code that m
|
|
114
120
|
pip install natural-pdf
|
115
121
|
```
|
116
122
|
|
117
|
-
For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install extras:
|
123
|
+
For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install one to two million different extras. If you just want the greatest hits:
|
118
124
|
|
119
125
|
```bash
|
120
|
-
#
|
121
|
-
pip install natural-pdf[
|
122
|
-
pip install natural-pdf[surya]
|
123
|
-
pip install natural-pdf[paddle]
|
124
|
-
|
125
|
-
# Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
|
126
|
-
pip install natural-pdf[llm]
|
127
|
-
# (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
|
128
|
-
|
129
|
-
# Example: Install with interactive viewer support
|
130
|
-
pip install natural-pdf[interactive]
|
131
|
-
|
132
|
-
# Example: Install with semantic search support (Haystack)
|
133
|
-
pip install natural-pdf[haystack]
|
134
|
-
|
135
|
-
# Install everything
|
136
|
-
pip install natural-pdf[all]
|
126
|
+
# deskewing, OCR (surya) + layout analysis (yolo), interactive browsing
|
127
|
+
pip install natural-pdf[favorites]
|
137
128
|
```
|
138
129
|
|
139
130
|
See the [installation guide](https://jsoma.github.io/natural-pdf/installation/) for more details on extras.
|
@@ -147,25 +138,26 @@ from natural_pdf import PDF
|
|
147
138
|
pdf = PDF('document.pdf')
|
148
139
|
page = pdf.pages[0]
|
149
140
|
|
141
|
+
# Extract all of the text on the page
|
142
|
+
page.extract_text()
|
143
|
+
|
150
144
|
# Find elements using CSS-like selectors
|
151
145
|
heading = page.find('text:contains("Summary"):bold')
|
152
146
|
|
153
147
|
# Extract content below the heading
|
154
148
|
content = heading.below().extract_text()
|
155
|
-
print("Content below Summary:", content[:100] + "...")
|
156
149
|
|
157
|
-
#
|
158
|
-
|
159
|
-
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
|
160
|
-
page.add_exclusion(page.find_all('line')[-1].below())
|
150
|
+
# Examine all the bold text on the page
|
151
|
+
page.find_all('text:bold').show()
|
161
152
|
|
162
|
-
#
|
163
|
-
|
164
|
-
|
153
|
+
# Exclude parts of the page from selectors/extractors
|
154
|
+
header = page.find('text:contains("CONFIDENTIAL")').above()
|
155
|
+
footer = page.find_all('line')[-1].below()
|
156
|
+
page.add_exclusion(header)
|
157
|
+
page.add_exclusion(footer)
|
165
158
|
|
166
|
-
#
|
167
|
-
|
168
|
-
page.to_image()
|
159
|
+
# Extract clean text from the page ignoring exclusions
|
160
|
+
clean_text = page.extract_text()
|
169
161
|
```
|
170
162
|
|
171
163
|
And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.
|
@@ -186,3 +178,17 @@ Natural PDF offers a range of features for working with PDFs:
|
|
186
178
|
## Learn More
|
187
179
|
|
188
180
|
Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).
|
181
|
+
|
182
|
+
## Best friends
|
183
|
+
|
184
|
+
Natural PDF sits on top of a *lot* of fantastic tools and mdoels, some of which are:
|
185
|
+
|
186
|
+
- [pdfplumber](https://github.com/jsvine/pdfplumber)
|
187
|
+
- [EasyOCR](https://www.jaided.ai/easyocr/)
|
188
|
+
- [PaddleOCR](https://paddlepaddle.github.io/PaddleOCR/latest/en/index.html)
|
189
|
+
- [Surya](https://github.com/VikParuchuri/surya)
|
190
|
+
- A specific [YOLO](https://github.com/opendatalab/DocLayout-YOLO)
|
191
|
+
- [deskew](https://github.com/sbrunner/deskew)
|
192
|
+
- [doctr](https://github.com/mindee/doctr)
|
193
|
+
- [docling](https://github.com/docling-project/docling)
|
194
|
+
- [Hugging Face](https://huggingface.co/models)
|
@@ -15,26 +15,11 @@ Natural PDF lets you find and extract content from PDFs using simple code that m
|
|
15
15
|
pip install natural-pdf
|
16
16
|
```
|
17
17
|
|
18
|
-
For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install extras:
|
18
|
+
For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install one to two million different extras. If you just want the greatest hits:
|
19
19
|
|
20
20
|
```bash
|
21
|
-
#
|
22
|
-
pip install natural-pdf[
|
23
|
-
pip install natural-pdf[surya]
|
24
|
-
pip install natural-pdf[paddle]
|
25
|
-
|
26
|
-
# Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
|
27
|
-
pip install natural-pdf[llm]
|
28
|
-
# (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)
|
29
|
-
|
30
|
-
# Example: Install with interactive viewer support
|
31
|
-
pip install natural-pdf[interactive]
|
32
|
-
|
33
|
-
# Example: Install with semantic search support (Haystack)
|
34
|
-
pip install natural-pdf[haystack]
|
35
|
-
|
36
|
-
# Install everything
|
37
|
-
pip install natural-pdf[all]
|
21
|
+
# deskewing, OCR (surya) + layout analysis (yolo), interactive browsing
|
22
|
+
pip install natural-pdf[favorites]
|
38
23
|
```
|
39
24
|
|
40
25
|
See the [installation guide](https://jsoma.github.io/natural-pdf/installation/) for more details on extras.
|
@@ -48,25 +33,26 @@ from natural_pdf import PDF
|
|
48
33
|
pdf = PDF('document.pdf')
|
49
34
|
page = pdf.pages[0]
|
50
35
|
|
36
|
+
# Extract all of the text on the page
|
37
|
+
page.extract_text()
|
38
|
+
|
51
39
|
# Find elements using CSS-like selectors
|
52
40
|
heading = page.find('text:contains("Summary"):bold')
|
53
41
|
|
54
42
|
# Extract content below the heading
|
55
43
|
content = heading.below().extract_text()
|
56
|
-
print("Content below Summary:", content[:100] + "...")
|
57
44
|
|
58
|
-
#
|
59
|
-
|
60
|
-
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
|
61
|
-
page.add_exclusion(page.find_all('line')[-1].below())
|
45
|
+
# Examine all the bold text on the page
|
46
|
+
page.find_all('text:bold').show()
|
62
47
|
|
63
|
-
#
|
64
|
-
|
65
|
-
|
48
|
+
# Exclude parts of the page from selectors/extractors
|
49
|
+
header = page.find('text:contains("CONFIDENTIAL")').above()
|
50
|
+
footer = page.find_all('line')[-1].below()
|
51
|
+
page.add_exclusion(header)
|
52
|
+
page.add_exclusion(footer)
|
66
53
|
|
67
|
-
#
|
68
|
-
|
69
|
-
page.to_image()
|
54
|
+
# Extract clean text from the page ignoring exclusions
|
55
|
+
clean_text = page.extract_text()
|
70
56
|
```
|
71
57
|
|
72
58
|
And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.
|
@@ -87,3 +73,17 @@ Natural PDF offers a range of features for working with PDFs:
|
|
87
73
|
## Learn More
|
88
74
|
|
89
75
|
Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).
|
76
|
+
|
77
|
+
## Best friends
|
78
|
+
|
79
|
+
Natural PDF sits on top of a *lot* of fantastic tools and mdoels, some of which are:
|
80
|
+
|
81
|
+
- [pdfplumber](https://github.com/jsvine/pdfplumber)
|
82
|
+
- [EasyOCR](https://www.jaided.ai/easyocr/)
|
83
|
+
- [PaddleOCR](https://paddlepaddle.github.io/PaddleOCR/latest/en/index.html)
|
84
|
+
- [Surya](https://github.com/VikParuchuri/surya)
|
85
|
+
- A specific [YOLO](https://github.com/opendatalab/DocLayout-YOLO)
|
86
|
+
- [deskew](https://github.com/sbrunner/deskew)
|
87
|
+
- [doctr](https://github.com/mindee/doctr)
|
88
|
+
- [docling](https://github.com/docling-project/docling)
|
89
|
+
- [Hugging Face](https://huggingface.co/models)
|
@@ -9,7 +9,7 @@ fi
|
|
9
9
|
|
10
10
|
MARKDOWN_FILE=$1
|
11
11
|
NOTEBOOK_FILE="${MARKDOWN_FILE%.md}.ipynb"
|
12
|
-
KERNEL_NAME="natural-pdf"
|
12
|
+
KERNEL_NAME="natural-pdf-project-venv"
|
13
13
|
|
14
14
|
echo "Converting $MARKDOWN_FILE to notebook..."
|
15
15
|
# Jupytext will now automatically add tags based on markdown metadata
|
@@ -29,6 +29,6 @@ EOF
|
|
29
29
|
|
30
30
|
|
31
31
|
echo "Executing notebook $NOTEBOOK_FILE..."
|
32
|
-
jupyter execute "$NOTEBOOK_FILE" --inplace --ExecutePreprocessor.kernel_name=natural-pdf || { echo "Execution failed"; exit 1; }
|
32
|
+
jupyter execute "$NOTEBOOK_FILE" --inplace --ExecutePreprocessor.kernel_name=natural-pdf-project-venv || { echo "Execution failed"; exit 1; }
|
33
33
|
|
34
34
|
echo "Success! Notebook executed and results saved to $NOTEBOOK_FILE"
|
@@ -7,7 +7,7 @@ Natural PDF allows you to automatically categorize pages or specific regions wit
|
|
7
7
|
To use the classification features, you need to install the optional dependencies:
|
8
8
|
|
9
9
|
```bash
|
10
|
-
pip install "natural-pdf[
|
10
|
+
pip install "natural-pdf[core-ml]"
|
11
11
|
```
|
12
12
|
|
13
13
|
This installs necessary libraries like `torch`, `transformers`, and others.
|