natural-pdf 0.1.7__py3-none-any.whl → 0.1.8__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- docs/categorizing-documents/index.md +168 -0
- docs/data-extraction/index.md +87 -0
- docs/element-selection/index.ipynb +218 -164
- docs/element-selection/index.md +20 -0
- docs/index.md +19 -0
- docs/ocr/index.md +63 -16
- docs/tutorials/01-loading-and-extraction.ipynb +1713 -34
- docs/tutorials/02-finding-elements.ipynb +123 -46
- docs/tutorials/03-extracting-blocks.ipynb +24 -19
- docs/tutorials/04-table-extraction.ipynb +17 -12
- docs/tutorials/05-excluding-content.ipynb +37 -32
- docs/tutorials/06-document-qa.ipynb +36 -31
- docs/tutorials/07-layout-analysis.ipynb +45 -40
- docs/tutorials/07-working-with-regions.ipynb +61 -60
- docs/tutorials/08-spatial-navigation.ipynb +76 -71
- docs/tutorials/09-section-extraction.ipynb +160 -155
- docs/tutorials/10-form-field-extraction.ipynb +71 -66
- docs/tutorials/11-enhanced-table-processing.ipynb +11 -6
- docs/tutorials/12-ocr-integration.ipynb +3420 -312
- docs/tutorials/12-ocr-integration.md +68 -106
- docs/tutorials/13-semantic-search.ipynb +641 -251
- natural_pdf/__init__.py +2 -0
- natural_pdf/classification/manager.py +343 -0
- natural_pdf/classification/mixin.py +149 -0
- natural_pdf/classification/results.py +62 -0
- natural_pdf/collections/mixins.py +63 -0
- natural_pdf/collections/pdf_collection.py +321 -15
- natural_pdf/core/element_manager.py +67 -0
- natural_pdf/core/page.py +227 -64
- natural_pdf/core/pdf.py +387 -378
- natural_pdf/elements/collections.py +272 -41
- natural_pdf/elements/region.py +99 -15
- natural_pdf/elements/text.py +5 -2
- natural_pdf/exporters/paddleocr.py +1 -1
- natural_pdf/extraction/manager.py +134 -0
- natural_pdf/extraction/mixin.py +246 -0
- natural_pdf/extraction/result.py +37 -0
- natural_pdf/ocr/engine_easyocr.py +6 -3
- natural_pdf/ocr/ocr_manager.py +85 -25
- natural_pdf/ocr/ocr_options.py +33 -10
- natural_pdf/ocr/utils.py +14 -3
- natural_pdf/qa/document_qa.py +0 -4
- natural_pdf/selectors/parser.py +363 -238
- natural_pdf/templates/finetune/fine_tune_paddleocr.md +10 -5
- natural_pdf/utils/locks.py +8 -0
- natural_pdf/utils/text_extraction.py +52 -1
- natural_pdf/utils/tqdm_utils.py +43 -0
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/METADATA +6 -1
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/RECORD +52 -41
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/WHEEL +1 -1
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/licenses/LICENSE +0 -0
- {natural_pdf-0.1.7.dist-info → natural_pdf-0.1.8.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,168 @@
|
|
1
|
+
# Categorizing Pages and Regions
|
2
|
+
|
3
|
+
Natural PDF allows you to automatically categorize pages or specific regions within a page using machine learning models. This is incredibly useful for filtering large collections of documents or understanding the structure and content of individual PDFs.
|
4
|
+
|
5
|
+
## Installation
|
6
|
+
|
7
|
+
To use the classification features, you need to install the optional dependencies:
|
8
|
+
|
9
|
+
```bash
|
10
|
+
pip install "natural-pdf[classification]"
|
11
|
+
```
|
12
|
+
|
13
|
+
This installs necessary libraries like `torch`, `transformers`, and others.
|
14
|
+
|
15
|
+
## Core Concept: The `.classify()` Method
|
16
|
+
|
17
|
+
The primary way to perform categorization is using the `.classify()` method available on `Page` and `Region` objects.
|
18
|
+
|
19
|
+
```python
|
20
|
+
from natural_pdf import PDF
|
21
|
+
|
22
|
+
# Example: Classify a Page
|
23
|
+
pdf = PDF("pdfs/01-practice.pdf")
|
24
|
+
page = pdf.pages[0]
|
25
|
+
categories = ["invoice", "letter", "report cover", "data table"]
|
26
|
+
results = page.classify(categories=categories, model="text")
|
27
|
+
|
28
|
+
# Access the top result
|
29
|
+
print(f"Top Category: {page.category}")
|
30
|
+
print(f"Confidence: {page.category_confidence:.3f}")
|
31
|
+
|
32
|
+
# Access all results
|
33
|
+
# print(page.classification_results)
|
34
|
+
```
|
35
|
+
|
36
|
+
**Key Arguments:**
|
37
|
+
|
38
|
+
* `categories` (required): A list of strings representing the potential categories you want to classify the item into.
|
39
|
+
* `model` (optional): Specifies which classification model or strategy to use. Defaults to `"text"`.
|
40
|
+
* `"text"`: Uses a text-based model (default: `facebook/bart-large-mnli`) suitable for classifying based on language content.
|
41
|
+
* `"vision"`: Uses a vision-based model (default: `openai/clip-vit-base-patch32`) suitable for classifying based on visual layout and appearance.
|
42
|
+
* Specific Model ID: You can provide a Hugging Face model ID (e.g., `"google/siglip-base-patch16-224"`, `"MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"`) compatible with zero-shot text or image classification. The library attempts to infer whether it's text or vision, but you might need `using`.
|
43
|
+
* `using` (optional): Explicitly set to `"text"` or `"vision"` if the automatic inference based on the `model` ID fails or is ambiguous.
|
44
|
+
* `min_confidence` (optional): A float between 0.0 and 1.0. Only categories with a confidence score greater than or equal to this threshold will be included in the results (default: 0.0).
|
45
|
+
|
46
|
+
## Text vs. Vision Classification
|
47
|
+
|
48
|
+
Choosing the right model type depends on your goal:
|
49
|
+
|
50
|
+
### Text Classification (`model="text"`)
|
51
|
+
|
52
|
+
* **How it works:** Extracts the text from the page or region and analyzes the language content.
|
53
|
+
* **Best for:**
|
54
|
+
* **Topic Identification:** Determining what a page or section is *about* (e.g., "budget discussion," "environmental impact," "legal terms").
|
55
|
+
* **Content-Driven Document Types:** Identifying document types primarily defined by their text (e.g., emails, meeting minutes, news articles, reports).
|
56
|
+
* **Data Journalism Example:** You have thousands of pages of government reports. You can use text classification to find all pages discussing "public health funding" or classify paragraphs within environmental impact statements to find mentions of specific endangered species.
|
57
|
+
|
58
|
+
```python
|
59
|
+
# Find pages related to finance
|
60
|
+
financial_categories = ["budget", "revenue", "expenditure", "forecast"]
|
61
|
+
pdf.classify_pages(categories=financial_categories, model="text")
|
62
|
+
budget_pages = [p for p in pdf.pages if p.category == "budget"]
|
63
|
+
```
|
64
|
+
|
65
|
+
### Vision Classification (`model="vision"`)
|
66
|
+
|
67
|
+
* **How it works:** Renders the page or region as an image and analyzes its visual layout, structure, and appearance.
|
68
|
+
* **Best for:**
|
69
|
+
* **Layout-Driven Document Types:** Identifying documents recognizable by their structure (e.g., invoices, receipts, forms, presentation slides, title pages).
|
70
|
+
* **Identifying Visual Elements:** Distinguishing between pages dominated by text, tables, charts, or images.
|
71
|
+
* **Data Journalism Example:** You have a scanned archive of campaign finance filings containing various document types. You can use vision classification to quickly isolate all the pages that look like donation receipts or expenditure forms, even if the OCR quality is poor.
|
72
|
+
|
73
|
+
```python
|
74
|
+
# Find pages that look like invoices or receipts
|
75
|
+
visual_categories = ["invoice", "receipt", "letter", "form"]
|
76
|
+
page.classify(categories=visual_categories, model="vision")
|
77
|
+
if page.category in ["invoice", "receipt"]:
|
78
|
+
print(f"Page {page.number} looks like an invoice or receipt.")
|
79
|
+
```
|
80
|
+
|
81
|
+
## Classifying Specific Objects
|
82
|
+
|
83
|
+
### Pages (`page.classify(...)`)
|
84
|
+
|
85
|
+
Classifying a whole page is useful for sorting documents or identifying the overall purpose of a page within a larger document.
|
86
|
+
|
87
|
+
```python
|
88
|
+
# Classify the first page
|
89
|
+
page = pdf.pages[0]
|
90
|
+
page_types = ["cover page", "table of contents", "chapter start", "appendix"]
|
91
|
+
page.classify(categories=page_types, model="vision") # Vision often good for page structure
|
92
|
+
print(f"Page 1 Type: {page.category}")
|
93
|
+
```
|
94
|
+
|
95
|
+
### Regions (`region.classify(...)`)
|
96
|
+
|
97
|
+
Classifying a specific region allows for more granular analysis within a page. You might first detect regions using Layout Analysis and then classify those regions.
|
98
|
+
|
99
|
+
```python
|
100
|
+
# Assume layout analysis has run, find paragraphs
|
101
|
+
paragraphs = page.find_all("region[type=paragraph]")
|
102
|
+
if paragraphs:
|
103
|
+
# Classify the topic of the first paragraph
|
104
|
+
topic_categories = ["introduction", "methodology", "results", "conclusion"]
|
105
|
+
# Use text model for topic
|
106
|
+
paragraphs[0].classify(categories=topic_categories, model="text")
|
107
|
+
print(f"First paragraph category: {paragraphs[0].category}")
|
108
|
+
```
|
109
|
+
|
110
|
+
## Accessing Classification Results
|
111
|
+
|
112
|
+
After running `.classify()`, you can access the results:
|
113
|
+
|
114
|
+
* `page.category` or `region.category`: Returns the string label of the category with the highest confidence score from the *last* classification run. Returns `None` if no classification has been run or no category met the threshold.
|
115
|
+
* `page.category_confidence` or `region.category_confidence`: Returns the float confidence score (0.0-1.0) for the top category. Returns `None` otherwise.
|
116
|
+
* `page.classification_results` or `region.classification_results`: Returns the full result dictionary stored in the object's `.metadata['classification']`, containing the model used, engine type, categories provided, timestamp, and a list of all scores above the threshold sorted by confidence. Returns `None` if no classification has been run.
|
117
|
+
|
118
|
+
```python
|
119
|
+
results = page.classify(categories=["invoice", "letter"], model="text", min_confidence=0.5)
|
120
|
+
|
121
|
+
if page.category == "invoice":
|
122
|
+
print(f"Found an invoice with confidence {page.category_confidence:.2f}")
|
123
|
+
|
124
|
+
# See all results above the threshold
|
125
|
+
# print(page.classification_results['scores'])
|
126
|
+
```
|
127
|
+
|
128
|
+
## Classifying Collections
|
129
|
+
|
130
|
+
For batch processing, use the `.classify_all()` method on `PDFCollection` or `ElementCollection` objects. This displays a progress bar tracking individual items (pages or elements).
|
131
|
+
|
132
|
+
### PDFCollection (`collection.classify_all(...)`)
|
133
|
+
|
134
|
+
Classifies pages across all PDFs in the collection. Use `max_workers` for parallel processing across different PDF files.
|
135
|
+
|
136
|
+
```python
|
137
|
+
collection = natural_pdf.PDFCollection.from_directory("./documents/")
|
138
|
+
categories = ["form", "datasheet", "image", "text document"]
|
139
|
+
|
140
|
+
# Classify all pages using vision model, processing 4 PDFs concurrently
|
141
|
+
collection.classify_all(categories=categories, model="vision", max_workers=4)
|
142
|
+
|
143
|
+
# Filter PDFs containing forms
|
144
|
+
form_pdfs = []
|
145
|
+
for pdf in collection:
|
146
|
+
if any(p.category == "form" for p in pdf.pages if p.category):
|
147
|
+
form_pdfs.append(pdf.path)
|
148
|
+
pdf.close() # Remember to close PDFs
|
149
|
+
|
150
|
+
print(f"Found forms in: {form_pdfs}")
|
151
|
+
```
|
152
|
+
|
153
|
+
### ElementCollection (`element_collection.classify_all(...)`)
|
154
|
+
|
155
|
+
Classifies all classifiable elements (currently `Page` and `Region`) within the collection.
|
156
|
+
|
157
|
+
```python
|
158
|
+
# Assume 'pdf' is loaded and 'layout_regions' is an ElementCollection of Regions
|
159
|
+
layout_regions = pdf.find_all("region")
|
160
|
+
region_types = ["paragraph", "list", "table", "figure", "caption"]
|
161
|
+
|
162
|
+
# Classify all detected regions based on vision
|
163
|
+
layout_regions.classify_all(categories=region_types, model="vision")
|
164
|
+
|
165
|
+
# Count table regions
|
166
|
+
table_count = sum(1 for r in layout_regions if r.category == "table")
|
167
|
+
print(f"Found {table_count} regions classified as tables.")
|
168
|
+
```
|
@@ -0,0 +1,87 @@
|
|
1
|
+
# Structured Data Extraction
|
2
|
+
|
3
|
+
Extracting specific, structured information (like invoice numbers, dates, or addresses) from documents often requires more than simple text extraction. Natural PDF integrates with Large Language Models (LLMs) via Pydantic schemas to achieve this.
|
4
|
+
|
5
|
+
## Introduction
|
6
|
+
|
7
|
+
This feature allows you to define the exact data structure you want using a Pydantic model and then instruct an LLM to populate that structure based on the content of a PDF element (like a `Page` or `Region`).
|
8
|
+
|
9
|
+
## Basic Extraction
|
10
|
+
|
11
|
+
1. **Define a Schema:** Create a Pydantic model for your desired data.
|
12
|
+
2. **Extract:** Use the `.extract()` method on a `PDF`, `Page`, or `Region` object.
|
13
|
+
3. **Access:** Use the `.extracted()` method to retrieve the results.
|
14
|
+
|
15
|
+
```python
|
16
|
+
from natural_pdf import PDF
|
17
|
+
from pydantic import BaseModel, Field
|
18
|
+
from openai import OpenAI # Example client
|
19
|
+
|
20
|
+
# Example: Initialize your LLM client
|
21
|
+
client = OpenAI()
|
22
|
+
|
23
|
+
# Load the PDF
|
24
|
+
pdf = PDF("path/to/your/document.pdf")
|
25
|
+
page = pdf.pages[0]
|
26
|
+
|
27
|
+
# 1. Define your schema
|
28
|
+
class InvoiceInfo(BaseModel):
|
29
|
+
invoice_number: str = Field(description="The main invoice identifier")
|
30
|
+
total_amount: float = Field(description="The final amount due")
|
31
|
+
company_name: Optional[str] = Field(None, description="The name of the issuing company")
|
32
|
+
|
33
|
+
# 2. Extract data (using default analysis_key="default-structured")
|
34
|
+
page.extract(schema=InvoiceInfo, client=client)
|
35
|
+
|
36
|
+
# 3. Access the results
|
37
|
+
# Access the full result object
|
38
|
+
full_data = page.extracted()
|
39
|
+
print(full_data)
|
40
|
+
|
41
|
+
# Access a single field
|
42
|
+
inv_num = page.extracted('invoice_number')
|
43
|
+
print(f"Invoice Number: {inv_num}")
|
44
|
+
```
|
45
|
+
|
46
|
+
## Keys and Overwriting
|
47
|
+
|
48
|
+
- By default, results are stored under the key `"default-structured"` in the element's `.analyses` dictionary.
|
49
|
+
- Use the `analysis_key` parameter in `.extract()` to store results under a different name (e.g., `analysis_key="customer_details"`).
|
50
|
+
- Attempting to extract using an existing `analysis_key` will raise an error unless `overwrite=True` is specified.
|
51
|
+
|
52
|
+
```python
|
53
|
+
# Extract using a specific key
|
54
|
+
page.extract(InvoiceInfo, client, analysis_key="invoice_header")
|
55
|
+
|
56
|
+
# Access using the specific key
|
57
|
+
header_data = page.extracted(analysis_key="invoice_header")
|
58
|
+
company = page.extracted('company_name', analysis_key="invoice_header")
|
59
|
+
```
|
60
|
+
|
61
|
+
## Applying to Regions and Collections
|
62
|
+
|
63
|
+
The `.extract()` and `.extracted()` methods work identically on `Region` objects, allowing you to target specific areas of a page for structured data extraction.
|
64
|
+
|
65
|
+
```python
|
66
|
+
# Assuming 'header_region' is a Region object you defined
|
67
|
+
header_region.extract(InvoiceInfo, client)
|
68
|
+
company = header_region.extracted('company_name')
|
69
|
+
```
|
70
|
+
|
71
|
+
Furthermore, you can apply extraction to collections of elements (like `pdf.pages`, or the result of `pdf.find_all(...)`) using the `.apply()` method. This iterates through the collection and calls `.extract()` on each item.
|
72
|
+
|
73
|
+
```python
|
74
|
+
# Example: Extract InvoiceInfo from the first 5 pages
|
75
|
+
results = pdf.pages[:5].apply(
|
76
|
+
'extract',
|
77
|
+
schema=InvoiceInfo,
|
78
|
+
client=client,
|
79
|
+
analysis_key="page_invoice_info", # Use a specific key for batch results
|
80
|
+
overwrite=True # Allow overwriting if run multiple times
|
81
|
+
)
|
82
|
+
|
83
|
+
# Access results for the first page in the collection
|
84
|
+
first_page_company = results[0].extracted('company_name', analysis_key="page_invoice_info")
|
85
|
+
```
|
86
|
+
|
87
|
+
This provides a powerful way to turn unstructured PDF content into structured, usable data.
|