natural-pdf 0.1.6__py3-none-any.whl → 0.1.8__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. docs/categorizing-documents/index.md +168 -0
  2. docs/data-extraction/index.md +87 -0
  3. docs/element-selection/index.ipynb +218 -164
  4. docs/element-selection/index.md +20 -0
  5. docs/finetuning/index.md +176 -0
  6. docs/index.md +19 -0
  7. docs/ocr/index.md +63 -16
  8. docs/tutorials/01-loading-and-extraction.ipynb +411 -248
  9. docs/tutorials/02-finding-elements.ipynb +123 -46
  10. docs/tutorials/03-extracting-blocks.ipynb +24 -19
  11. docs/tutorials/04-table-extraction.ipynb +17 -12
  12. docs/tutorials/05-excluding-content.ipynb +37 -32
  13. docs/tutorials/06-document-qa.ipynb +36 -31
  14. docs/tutorials/07-layout-analysis.ipynb +45 -40
  15. docs/tutorials/07-working-with-regions.ipynb +61 -60
  16. docs/tutorials/08-spatial-navigation.ipynb +76 -71
  17. docs/tutorials/09-section-extraction.ipynb +160 -155
  18. docs/tutorials/10-form-field-extraction.ipynb +71 -66
  19. docs/tutorials/11-enhanced-table-processing.ipynb +11 -6
  20. docs/tutorials/12-ocr-integration.ipynb +3420 -312
  21. docs/tutorials/12-ocr-integration.md +68 -106
  22. docs/tutorials/13-semantic-search.ipynb +641 -251
  23. natural_pdf/__init__.py +3 -0
  24. natural_pdf/analyzers/layout/gemini.py +63 -47
  25. natural_pdf/classification/manager.py +343 -0
  26. natural_pdf/classification/mixin.py +149 -0
  27. natural_pdf/classification/results.py +62 -0
  28. natural_pdf/collections/mixins.py +63 -0
  29. natural_pdf/collections/pdf_collection.py +326 -17
  30. natural_pdf/core/element_manager.py +73 -4
  31. natural_pdf/core/page.py +255 -83
  32. natural_pdf/core/pdf.py +385 -367
  33. natural_pdf/elements/base.py +1 -3
  34. natural_pdf/elements/collections.py +279 -49
  35. natural_pdf/elements/region.py +106 -21
  36. natural_pdf/elements/text.py +5 -2
  37. natural_pdf/exporters/__init__.py +4 -0
  38. natural_pdf/exporters/base.py +61 -0
  39. natural_pdf/exporters/paddleocr.py +345 -0
  40. natural_pdf/extraction/manager.py +134 -0
  41. natural_pdf/extraction/mixin.py +246 -0
  42. natural_pdf/extraction/result.py +37 -0
  43. natural_pdf/ocr/__init__.py +16 -8
  44. natural_pdf/ocr/engine.py +46 -30
  45. natural_pdf/ocr/engine_easyocr.py +86 -42
  46. natural_pdf/ocr/engine_paddle.py +39 -28
  47. natural_pdf/ocr/engine_surya.py +32 -16
  48. natural_pdf/ocr/ocr_factory.py +34 -23
  49. natural_pdf/ocr/ocr_manager.py +98 -34
  50. natural_pdf/ocr/ocr_options.py +38 -10
  51. natural_pdf/ocr/utils.py +59 -33
  52. natural_pdf/qa/document_qa.py +0 -4
  53. natural_pdf/selectors/parser.py +363 -238
  54. natural_pdf/templates/finetune/fine_tune_paddleocr.md +420 -0
  55. natural_pdf/utils/debug.py +4 -2
  56. natural_pdf/utils/identifiers.py +9 -5
  57. natural_pdf/utils/locks.py +8 -0
  58. natural_pdf/utils/packaging.py +172 -105
  59. natural_pdf/utils/text_extraction.py +96 -65
  60. natural_pdf/utils/tqdm_utils.py +43 -0
  61. natural_pdf/utils/visualization.py +1 -1
  62. {natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/METADATA +10 -3
  63. {natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/RECORD +66 -51
  64. {natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/WHEEL +1 -1
  65. {natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/licenses/LICENSE +0 -0
  66. {natural_pdf-0.1.6.dist-info → natural_pdf-0.1.8.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,168 @@
1
+ # Categorizing Pages and Regions
2
+
3
+ Natural PDF allows you to automatically categorize pages or specific regions within a page using machine learning models. This is incredibly useful for filtering large collections of documents or understanding the structure and content of individual PDFs.
4
+
5
+ ## Installation
6
+
7
+ To use the classification features, you need to install the optional dependencies:
8
+
9
+ ```bash
10
+ pip install "natural-pdf[classification]"
11
+ ```
12
+
13
+ This installs necessary libraries like `torch`, `transformers`, and others.
14
+
15
+ ## Core Concept: The `.classify()` Method
16
+
17
+ The primary way to perform categorization is using the `.classify()` method available on `Page` and `Region` objects.
18
+
19
+ ```python
20
+ from natural_pdf import PDF
21
+
22
+ # Example: Classify a Page
23
+ pdf = PDF("pdfs/01-practice.pdf")
24
+ page = pdf.pages[0]
25
+ categories = ["invoice", "letter", "report cover", "data table"]
26
+ results = page.classify(categories=categories, model="text")
27
+
28
+ # Access the top result
29
+ print(f"Top Category: {page.category}")
30
+ print(f"Confidence: {page.category_confidence:.3f}")
31
+
32
+ # Access all results
33
+ # print(page.classification_results)
34
+ ```
35
+
36
+ **Key Arguments:**
37
+
38
+ * `categories` (required): A list of strings representing the potential categories you want to classify the item into.
39
+ * `model` (optional): Specifies which classification model or strategy to use. Defaults to `"text"`.
40
+ * `"text"`: Uses a text-based model (default: `facebook/bart-large-mnli`) suitable for classifying based on language content.
41
+ * `"vision"`: Uses a vision-based model (default: `openai/clip-vit-base-patch32`) suitable for classifying based on visual layout and appearance.
42
+ * Specific Model ID: You can provide a Hugging Face model ID (e.g., `"google/siglip-base-patch16-224"`, `"MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"`) compatible with zero-shot text or image classification. The library attempts to infer whether it's text or vision, but you might need `using`.
43
+ * `using` (optional): Explicitly set to `"text"` or `"vision"` if the automatic inference based on the `model` ID fails or is ambiguous.
44
+ * `min_confidence` (optional): A float between 0.0 and 1.0. Only categories with a confidence score greater than or equal to this threshold will be included in the results (default: 0.0).
45
+
46
+ ## Text vs. Vision Classification
47
+
48
+ Choosing the right model type depends on your goal:
49
+
50
+ ### Text Classification (`model="text"`)
51
+
52
+ * **How it works:** Extracts the text from the page or region and analyzes the language content.
53
+ * **Best for:**
54
+ * **Topic Identification:** Determining what a page or section is *about* (e.g., "budget discussion," "environmental impact," "legal terms").
55
+ * **Content-Driven Document Types:** Identifying document types primarily defined by their text (e.g., emails, meeting minutes, news articles, reports).
56
+ * **Data Journalism Example:** You have thousands of pages of government reports. You can use text classification to find all pages discussing "public health funding" or classify paragraphs within environmental impact statements to find mentions of specific endangered species.
57
+
58
+ ```python
59
+ # Find pages related to finance
60
+ financial_categories = ["budget", "revenue", "expenditure", "forecast"]
61
+ pdf.classify_pages(categories=financial_categories, model="text")
62
+ budget_pages = [p for p in pdf.pages if p.category == "budget"]
63
+ ```
64
+
65
+ ### Vision Classification (`model="vision"`)
66
+
67
+ * **How it works:** Renders the page or region as an image and analyzes its visual layout, structure, and appearance.
68
+ * **Best for:**
69
+ * **Layout-Driven Document Types:** Identifying documents recognizable by their structure (e.g., invoices, receipts, forms, presentation slides, title pages).
70
+ * **Identifying Visual Elements:** Distinguishing between pages dominated by text, tables, charts, or images.
71
+ * **Data Journalism Example:** You have a scanned archive of campaign finance filings containing various document types. You can use vision classification to quickly isolate all the pages that look like donation receipts or expenditure forms, even if the OCR quality is poor.
72
+
73
+ ```python
74
+ # Find pages that look like invoices or receipts
75
+ visual_categories = ["invoice", "receipt", "letter", "form"]
76
+ page.classify(categories=visual_categories, model="vision")
77
+ if page.category in ["invoice", "receipt"]:
78
+ print(f"Page {page.number} looks like an invoice or receipt.")
79
+ ```
80
+
81
+ ## Classifying Specific Objects
82
+
83
+ ### Pages (`page.classify(...)`)
84
+
85
+ Classifying a whole page is useful for sorting documents or identifying the overall purpose of a page within a larger document.
86
+
87
+ ```python
88
+ # Classify the first page
89
+ page = pdf.pages[0]
90
+ page_types = ["cover page", "table of contents", "chapter start", "appendix"]
91
+ page.classify(categories=page_types, model="vision") # Vision often good for page structure
92
+ print(f"Page 1 Type: {page.category}")
93
+ ```
94
+
95
+ ### Regions (`region.classify(...)`)
96
+
97
+ Classifying a specific region allows for more granular analysis within a page. You might first detect regions using Layout Analysis and then classify those regions.
98
+
99
+ ```python
100
+ # Assume layout analysis has run, find paragraphs
101
+ paragraphs = page.find_all("region[type=paragraph]")
102
+ if paragraphs:
103
+ # Classify the topic of the first paragraph
104
+ topic_categories = ["introduction", "methodology", "results", "conclusion"]
105
+ # Use text model for topic
106
+ paragraphs[0].classify(categories=topic_categories, model="text")
107
+ print(f"First paragraph category: {paragraphs[0].category}")
108
+ ```
109
+
110
+ ## Accessing Classification Results
111
+
112
+ After running `.classify()`, you can access the results:
113
+
114
+ * `page.category` or `region.category`: Returns the string label of the category with the highest confidence score from the *last* classification run. Returns `None` if no classification has been run or no category met the threshold.
115
+ * `page.category_confidence` or `region.category_confidence`: Returns the float confidence score (0.0-1.0) for the top category. Returns `None` otherwise.
116
+ * `page.classification_results` or `region.classification_results`: Returns the full result dictionary stored in the object's `.metadata['classification']`, containing the model used, engine type, categories provided, timestamp, and a list of all scores above the threshold sorted by confidence. Returns `None` if no classification has been run.
117
+
118
+ ```python
119
+ results = page.classify(categories=["invoice", "letter"], model="text", min_confidence=0.5)
120
+
121
+ if page.category == "invoice":
122
+ print(f"Found an invoice with confidence {page.category_confidence:.2f}")
123
+
124
+ # See all results above the threshold
125
+ # print(page.classification_results['scores'])
126
+ ```
127
+
128
+ ## Classifying Collections
129
+
130
+ For batch processing, use the `.classify_all()` method on `PDFCollection` or `ElementCollection` objects. This displays a progress bar tracking individual items (pages or elements).
131
+
132
+ ### PDFCollection (`collection.classify_all(...)`)
133
+
134
+ Classifies pages across all PDFs in the collection. Use `max_workers` for parallel processing across different PDF files.
135
+
136
+ ```python
137
+ collection = natural_pdf.PDFCollection.from_directory("./documents/")
138
+ categories = ["form", "datasheet", "image", "text document"]
139
+
140
+ # Classify all pages using vision model, processing 4 PDFs concurrently
141
+ collection.classify_all(categories=categories, model="vision", max_workers=4)
142
+
143
+ # Filter PDFs containing forms
144
+ form_pdfs = []
145
+ for pdf in collection:
146
+ if any(p.category == "form" for p in pdf.pages if p.category):
147
+ form_pdfs.append(pdf.path)
148
+ pdf.close() # Remember to close PDFs
149
+
150
+ print(f"Found forms in: {form_pdfs}")
151
+ ```
152
+
153
+ ### ElementCollection (`element_collection.classify_all(...)`)
154
+
155
+ Classifies all classifiable elements (currently `Page` and `Region`) within the collection.
156
+
157
+ ```python
158
+ # Assume 'pdf' is loaded and 'layout_regions' is an ElementCollection of Regions
159
+ layout_regions = pdf.find_all("region")
160
+ region_types = ["paragraph", "list", "table", "figure", "caption"]
161
+
162
+ # Classify all detected regions based on vision
163
+ layout_regions.classify_all(categories=region_types, model="vision")
164
+
165
+ # Count table regions
166
+ table_count = sum(1 for r in layout_regions if r.category == "table")
167
+ print(f"Found {table_count} regions classified as tables.")
168
+ ```
@@ -0,0 +1,87 @@
1
+ # Structured Data Extraction
2
+
3
+ Extracting specific, structured information (like invoice numbers, dates, or addresses) from documents often requires more than simple text extraction. Natural PDF integrates with Large Language Models (LLMs) via Pydantic schemas to achieve this.
4
+
5
+ ## Introduction
6
+
7
+ This feature allows you to define the exact data structure you want using a Pydantic model and then instruct an LLM to populate that structure based on the content of a PDF element (like a `Page` or `Region`).
8
+
9
+ ## Basic Extraction
10
+
11
+ 1. **Define a Schema:** Create a Pydantic model for your desired data.
12
+ 2. **Extract:** Use the `.extract()` method on a `PDF`, `Page`, or `Region` object.
13
+ 3. **Access:** Use the `.extracted()` method to retrieve the results.
14
+
15
+ ```python
16
+ from natural_pdf import PDF
17
+ from pydantic import BaseModel, Field
18
+ from openai import OpenAI # Example client
19
+
20
+ # Example: Initialize your LLM client
21
+ client = OpenAI()
22
+
23
+ # Load the PDF
24
+ pdf = PDF("path/to/your/document.pdf")
25
+ page = pdf.pages[0]
26
+
27
+ # 1. Define your schema
28
+ class InvoiceInfo(BaseModel):
29
+ invoice_number: str = Field(description="The main invoice identifier")
30
+ total_amount: float = Field(description="The final amount due")
31
+ company_name: Optional[str] = Field(None, description="The name of the issuing company")
32
+
33
+ # 2. Extract data (using default analysis_key="default-structured")
34
+ page.extract(schema=InvoiceInfo, client=client)
35
+
36
+ # 3. Access the results
37
+ # Access the full result object
38
+ full_data = page.extracted()
39
+ print(full_data)
40
+
41
+ # Access a single field
42
+ inv_num = page.extracted('invoice_number')
43
+ print(f"Invoice Number: {inv_num}")
44
+ ```
45
+
46
+ ## Keys and Overwriting
47
+
48
+ - By default, results are stored under the key `"default-structured"` in the element's `.analyses` dictionary.
49
+ - Use the `analysis_key` parameter in `.extract()` to store results under a different name (e.g., `analysis_key="customer_details"`).
50
+ - Attempting to extract using an existing `analysis_key` will raise an error unless `overwrite=True` is specified.
51
+
52
+ ```python
53
+ # Extract using a specific key
54
+ page.extract(InvoiceInfo, client, analysis_key="invoice_header")
55
+
56
+ # Access using the specific key
57
+ header_data = page.extracted(analysis_key="invoice_header")
58
+ company = page.extracted('company_name', analysis_key="invoice_header")
59
+ ```
60
+
61
+ ## Applying to Regions and Collections
62
+
63
+ The `.extract()` and `.extracted()` methods work identically on `Region` objects, allowing you to target specific areas of a page for structured data extraction.
64
+
65
+ ```python
66
+ # Assuming 'header_region' is a Region object you defined
67
+ header_region.extract(InvoiceInfo, client)
68
+ company = header_region.extracted('company_name')
69
+ ```
70
+
71
+ Furthermore, you can apply extraction to collections of elements (like `pdf.pages`, or the result of `pdf.find_all(...)`) using the `.apply()` method. This iterates through the collection and calls `.extract()` on each item.
72
+
73
+ ```python
74
+ # Example: Extract InvoiceInfo from the first 5 pages
75
+ results = pdf.pages[:5].apply(
76
+ 'extract',
77
+ schema=InvoiceInfo,
78
+ client=client,
79
+ analysis_key="page_invoice_info", # Use a specific key for batch results
80
+ overwrite=True # Allow overwriting if run multiple times
81
+ )
82
+
83
+ # Access results for the first page in the collection
84
+ first_page_company = results[0].extracted('company_name', analysis_key="page_invoice_info")
85
+ ```
86
+
87
+ This provides a powerful way to turn unstructured PDF content into structured, usable data.