mistocr 0.1.5__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
mistocr-0.2.0/PKG-INFO ADDED
@@ -0,0 +1,253 @@
1
+ Metadata-Version: 2.4
2
+ Name: mistocr
3
+ Version: 0.2.0
4
+ Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
5
+ Home-page: https://github.com/franckalbinet/mistocr
6
+ Author: Solveit
7
+ Author-email: nobody@fast.ai
8
+ License: Apache Software License 2.0
9
+ Keywords: nbdev jupyter notebook python
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Natural Language :: English
13
+ Classifier: Programming Language :: Python :: 3.9
14
+ Classifier: Programming Language :: Python :: 3.10
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: License :: OSI Approved :: Apache Software License
18
+ Requires-Python: >=3.9
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Requires-Dist: fastcore
22
+ Requires-Dist: mistralai
23
+ Requires-Dist: pillow
24
+ Requires-Dist: dotenv
25
+ Requires-Dist: lisette
26
+ Provides-Extra: dev
27
+ Dynamic: author
28
+ Dynamic: author-email
29
+ Dynamic: classifier
30
+ Dynamic: description
31
+ Dynamic: description-content-type
32
+ Dynamic: home-page
33
+ Dynamic: keywords
34
+ Dynamic: license
35
+ Dynamic: license-file
36
+ Dynamic: provides-extra
37
+ Dynamic: requires-dist
38
+ Dynamic: requires-python
39
+ Dynamic: summary
40
+
41
+ # mistocr
42
+
43
+
44
+ <!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
45
+
46
+ **PDF OCR is a critical bottleneck in AI pipelines.** It’s often
47
+ mentioned in passing, as if it’s a trivial step. Practice shows it’s far
48
+ from it. Poorly converted PDFs mean garbage-in-garbage-out for
49
+ downstream AI-system (RAG, …).
50
+
51
+ When [Mistral AI](https://mistral.ai) released their [state-of-the-art
52
+ OCR model](https://mistral.ai/fr/news/mistral-ocr) in March 2025, it
53
+ opened new possibilities for large-scale document processing. While
54
+ alternatives like [datalab.to](https://www.datalab.to) and
55
+ [docling.ai](https://www.docling.ai) offer viable solutions, Mistral OCR
56
+ delivers exceptional accuracy at a compelling price point.
57
+
58
+ **mistocr** emerged from months of real-world usage across projects
59
+ requiring large-scale processing of niche-domain PDFs. It addresses two
60
+ fundamental challenges that raw OCR output leaves unsolved:
61
+
62
+ - **Heading hierarchy restoration**: Even state-of-the-art OCR sometimes
63
+ produces inconsistent heading levels in large documents—a complex task
64
+ to get right. mistocr uses LLM-based analysis to restore proper
65
+ document structure, essential for downstream AI tasks.
66
+
67
+ - **Visual content integration**: Charts, figures and diagrams are
68
+ automatically classified and described, then integrated into the
69
+ markdown. This makes visual information searchable and accessible for
70
+ downstream applications.
71
+
72
+ - **Cost-efficient batch processing**: By exclusively using Mistral’s
73
+ batch API, mistocr cuts costs by 50% (\$0.50 vs \$1.00 per 1000 pages)
74
+ while eliminating the boilerplate code typically required.
75
+
76
+ **In short**: Production-ready batch OCR with intelligent postprocessing
77
+ that ensures your documents are actually usable for AI systems.
78
+
79
+ ## Get Started
80
+
81
+ Install latest from [pypi](https://pypi.org/project/mistocr), then:
82
+
83
+ ``` sh
84
+ $ pip install mistocr
85
+ ```
86
+
87
+ Set your API keys:
88
+
89
+ ``` python
90
+ import os
91
+ os.environ['MISTRAL_API_KEY'] = 'your-key-here'
92
+ os.environ['ANTHROPIC_API_KEY'] = 'your-key-here' # for refine features (see Advanced Usage for other LLMs)
93
+ ```
94
+
95
+ ### Complete Pipeline
96
+
97
+ Full pipeline with all features:
98
+
99
+ ``` python
100
+ from mistocr.pipeline import pdf_to_md
101
+ await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')
102
+ ```
103
+
104
+ Step 1/3: Running OCR on files/test/resnet.pdf...
105
+ Mistral batch job status: QUEUED
106
+ Mistral batch job status: RUNNING
107
+ Mistral batch job status: RUNNING
108
+ Step 2/3: Fixing heading hierarchy...
109
+ Step 3/3: Adding image descriptions...
110
+ Describing 7 images...
111
+ Saved descriptions to ocr_temp/resnet/img_descriptions.json
112
+ Adding descriptions to 12 pages...
113
+ Done! Enriched pages saved to files/test/md_test
114
+ Done!
115
+
116
+ This will (as indicated by the output):
117
+
118
+ 1. OCR the PDF using Mistral’s batch API
119
+ 2. Fix heading hierarchy inconsistencies
120
+ 3. Describe images (charts, diagrams) and add those descriptions into
121
+ the markdown Save everything to `files/test/md_test`
122
+
123
+ The output structure will be:
124
+
125
+ files/test/md_test/
126
+ ├── img/
127
+ │ ├── img-0.jpeg
128
+ │ ├── img-1.jpeg
129
+ │ └── ...
130
+ ├── page_1.md
131
+ ├── page_2.md
132
+ └── ...
133
+
134
+ Each page’s markdown will include inline image descriptions:
135
+
136
+ ```` markdown
137
+ ```markdown
138
+ ![Figure 1](img/img-0.jpeg)
139
+ AI-generated image description:
140
+ ___
141
+ A residual learning block...
142
+ ___
143
+ ```
144
+ ````
145
+
146
+ To print the the processed markdown, you can use the
147
+ [`read_pgs`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
148
+ function. Here’s how:
149
+
150
+ Then to read the fully processed document:
151
+
152
+ ``` python
153
+ from mistocr.pipeline import read_pgs
154
+ md = read_pgs('files/test/md_test')
155
+ print(md[:500])
156
+ ```
157
+
158
+ # Deep Residual Learning for Image Recognition ... page 1
159
+
160
+ Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
161
+
162
+
163
+ ## Abstract ... page 1
164
+
165
+ Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
166
+
167
+ By default,
168
+ [`read_pgs()`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
169
+ joins all pages. Pass `join=False` to get a list of individual pages
170
+ instead.
171
+
172
+ ### Advanced Usage
173
+
174
+ **Batch process entire folders:**
175
+
176
+ ``` python
177
+ from mistocr.core import ocr_pdf
178
+
179
+ # Process all PDFs in a folder
180
+ output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')
181
+ ```
182
+
183
+ **Custom models and prompts for heading fixes:**
184
+
185
+ ``` python
186
+ from mistocr.refine import fix_hdgs
187
+
188
+ # Use a different model or custom prompt
189
+ fix_hdgs('ocr_output/doc1',
190
+ model='gpt-4o',
191
+ prompt=your_custom_prompt)
192
+ ```
193
+
194
+ **Custom image description with rate limiting:**
195
+
196
+ ``` python
197
+ from mistocr.refine import add_img_descs
198
+
199
+ # Control API usage and customize descriptions
200
+ await add_img_descs('ocr_output/doc1',
201
+ model='claude-opus-4',
202
+ semaphore=5, # More concurrent requests
203
+ delay=0.5) # Shorter delay between calls
204
+ ```
205
+
206
+ For complete control over each pipeline step, see the
207
+ [core](https://fr.anckalbi.net/mistocr/core.html),
208
+ [refine](https://fr.anckalbi.net/mistocr/refine.html), and
209
+ [pipeline](https://fr.anckalbi.net/mistocr/pipeline.html) module
210
+ documentation.
211
+
212
+ ## Known Limitations & Future Work
213
+
214
+ `mistocr` is under active development. Current limitations include:
215
+
216
+ - **No timeout on batch jobs**: Jobs poll indefinitely until completion.
217
+ If a job stalls, manual intervention is required.
218
+ - **Limited error handling**: When batch jobs fail, error reporting and
219
+ recovery options are minimal.
220
+ - **Progress monitoring**: Currently limited to periodic status prints.
221
+ Future versions will support callbacks or streaming updates for better
222
+ real-time monitoring.
223
+
224
+ Contributions are welcome! If you encounter issues or have ideas for
225
+ improvements, please open an issue or discussion on
226
+ [GitHub](https://github.com/franckalbinet/mistocr).
227
+
228
+ ## Developer Guide
229
+
230
+ If you are new to using `nbdev` here are some useful pointers to get you
231
+ started.
232
+
233
+ ### Install mistocr in Development mode
234
+
235
+ ``` sh
236
+ # make sure mistocr package is installed in development mode
237
+ $ pip install -e .
238
+
239
+ # make changes under nbs/ directory
240
+ # ...
241
+
242
+ # compile to have changes apply to mistocr
243
+ $ nbdev_prepare
244
+ ```
245
+
246
+ ### Documentation
247
+
248
+ Documentation can be found hosted on this GitHub
249
+ [repository](https://github.com/franckalbinet/mistocr)’s
250
+ [pages](https://franckalbinet.github.io/mistocr/). Additionally you can
251
+ find package manager specific guidelines on
252
+ [conda](https://anaconda.org/franckalbinet/mistocr) and
253
+ [pypi](https://pypi.org/project/mistocr/) respectively.
@@ -0,0 +1,213 @@
1
+ # mistocr
2
+
3
+
4
+ <!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
5
+
6
+ **PDF OCR is a critical bottleneck in AI pipelines.** It’s often
7
+ mentioned in passing, as if it’s a trivial step. Practice shows it’s far
8
+ from it. Poorly converted PDFs mean garbage-in-garbage-out for
9
+ downstream AI-system (RAG, …).
10
+
11
+ When [Mistral AI](https://mistral.ai) released their [state-of-the-art
12
+ OCR model](https://mistral.ai/fr/news/mistral-ocr) in March 2025, it
13
+ opened new possibilities for large-scale document processing. While
14
+ alternatives like [datalab.to](https://www.datalab.to) and
15
+ [docling.ai](https://www.docling.ai) offer viable solutions, Mistral OCR
16
+ delivers exceptional accuracy at a compelling price point.
17
+
18
+ **mistocr** emerged from months of real-world usage across projects
19
+ requiring large-scale processing of niche-domain PDFs. It addresses two
20
+ fundamental challenges that raw OCR output leaves unsolved:
21
+
22
+ - **Heading hierarchy restoration**: Even state-of-the-art OCR sometimes
23
+ produces inconsistent heading levels in large documents—a complex task
24
+ to get right. mistocr uses LLM-based analysis to restore proper
25
+ document structure, essential for downstream AI tasks.
26
+
27
+ - **Visual content integration**: Charts, figures and diagrams are
28
+ automatically classified and described, then integrated into the
29
+ markdown. This makes visual information searchable and accessible for
30
+ downstream applications.
31
+
32
+ - **Cost-efficient batch processing**: By exclusively using Mistral’s
33
+ batch API, mistocr cuts costs by 50% (\$0.50 vs \$1.00 per 1000 pages)
34
+ while eliminating the boilerplate code typically required.
35
+
36
+ **In short**: Production-ready batch OCR with intelligent postprocessing
37
+ that ensures your documents are actually usable for AI systems.
38
+
39
+ ## Get Started
40
+
41
+ Install latest from [pypi](https://pypi.org/project/mistocr), then:
42
+
43
+ ``` sh
44
+ $ pip install mistocr
45
+ ```
46
+
47
+ Set your API keys:
48
+
49
+ ``` python
50
+ import os
51
+ os.environ['MISTRAL_API_KEY'] = 'your-key-here'
52
+ os.environ['ANTHROPIC_API_KEY'] = 'your-key-here' # for refine features (see Advanced Usage for other LLMs)
53
+ ```
54
+
55
+ ### Complete Pipeline
56
+
57
+ Full pipeline with all features:
58
+
59
+ ``` python
60
+ from mistocr.pipeline import pdf_to_md
61
+ await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')
62
+ ```
63
+
64
+ Step 1/3: Running OCR on files/test/resnet.pdf...
65
+ Mistral batch job status: QUEUED
66
+ Mistral batch job status: RUNNING
67
+ Mistral batch job status: RUNNING
68
+ Step 2/3: Fixing heading hierarchy...
69
+ Step 3/3: Adding image descriptions...
70
+ Describing 7 images...
71
+ Saved descriptions to ocr_temp/resnet/img_descriptions.json
72
+ Adding descriptions to 12 pages...
73
+ Done! Enriched pages saved to files/test/md_test
74
+ Done!
75
+
76
+ This will (as indicated by the output):
77
+
78
+ 1. OCR the PDF using Mistral’s batch API
79
+ 2. Fix heading hierarchy inconsistencies
80
+ 3. Describe images (charts, diagrams) and add those descriptions into
81
+ the markdown Save everything to `files/test/md_test`
82
+
83
+ The output structure will be:
84
+
85
+ files/test/md_test/
86
+ ├── img/
87
+ │ ├── img-0.jpeg
88
+ │ ├── img-1.jpeg
89
+ │ └── ...
90
+ ├── page_1.md
91
+ ├── page_2.md
92
+ └── ...
93
+
94
+ Each page’s markdown will include inline image descriptions:
95
+
96
+ ```` markdown
97
+ ```markdown
98
+ ![Figure 1](img/img-0.jpeg)
99
+ AI-generated image description:
100
+ ___
101
+ A residual learning block...
102
+ ___
103
+ ```
104
+ ````
105
+
106
+ To print the the processed markdown, you can use the
107
+ [`read_pgs`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
108
+ function. Here’s how:
109
+
110
+ Then to read the fully processed document:
111
+
112
+ ``` python
113
+ from mistocr.pipeline import read_pgs
114
+ md = read_pgs('files/test/md_test')
115
+ print(md[:500])
116
+ ```
117
+
118
+ # Deep Residual Learning for Image Recognition ... page 1
119
+
120
+ Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com
121
+
122
+
123
+ ## Abstract ... page 1
124
+
125
+ Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins
126
+
127
+ By default,
128
+ [`read_pgs()`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
129
+ joins all pages. Pass `join=False` to get a list of individual pages
130
+ instead.
131
+
132
+ ### Advanced Usage
133
+
134
+ **Batch process entire folders:**
135
+
136
+ ``` python
137
+ from mistocr.core import ocr_pdf
138
+
139
+ # Process all PDFs in a folder
140
+ output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')
141
+ ```
142
+
143
+ **Custom models and prompts for heading fixes:**
144
+
145
+ ``` python
146
+ from mistocr.refine import fix_hdgs
147
+
148
+ # Use a different model or custom prompt
149
+ fix_hdgs('ocr_output/doc1',
150
+ model='gpt-4o',
151
+ prompt=your_custom_prompt)
152
+ ```
153
+
154
+ **Custom image description with rate limiting:**
155
+
156
+ ``` python
157
+ from mistocr.refine import add_img_descs
158
+
159
+ # Control API usage and customize descriptions
160
+ await add_img_descs('ocr_output/doc1',
161
+ model='claude-opus-4',
162
+ semaphore=5, # More concurrent requests
163
+ delay=0.5) # Shorter delay between calls
164
+ ```
165
+
166
+ For complete control over each pipeline step, see the
167
+ [core](https://fr.anckalbi.net/mistocr/core.html),
168
+ [refine](https://fr.anckalbi.net/mistocr/refine.html), and
169
+ [pipeline](https://fr.anckalbi.net/mistocr/pipeline.html) module
170
+ documentation.
171
+
172
+ ## Known Limitations & Future Work
173
+
174
+ `mistocr` is under active development. Current limitations include:
175
+
176
+ - **No timeout on batch jobs**: Jobs poll indefinitely until completion.
177
+ If a job stalls, manual intervention is required.
178
+ - **Limited error handling**: When batch jobs fail, error reporting and
179
+ recovery options are minimal.
180
+ - **Progress monitoring**: Currently limited to periodic status prints.
181
+ Future versions will support callbacks or streaming updates for better
182
+ real-time monitoring.
183
+
184
+ Contributions are welcome! If you encounter issues or have ideas for
185
+ improvements, please open an issue or discussion on
186
+ [GitHub](https://github.com/franckalbinet/mistocr).
187
+
188
+ ## Developer Guide
189
+
190
+ If you are new to using `nbdev` here are some useful pointers to get you
191
+ started.
192
+
193
+ ### Install mistocr in Development mode
194
+
195
+ ``` sh
196
+ # make sure mistocr package is installed in development mode
197
+ $ pip install -e .
198
+
199
+ # make changes under nbs/ directory
200
+ # ...
201
+
202
+ # compile to have changes apply to mistocr
203
+ $ nbdev_prepare
204
+ ```
205
+
206
+ ### Documentation
207
+
208
+ Documentation can be found hosted on this GitHub
209
+ [repository](https://github.com/franckalbinet/mistocr)’s
210
+ [pages](https://franckalbinet.github.io/mistocr/). Additionally you can
211
+ find package manager specific guidelines on
212
+ [conda](https://anaconda.org/franckalbinet/mistocr) and
213
+ [pypi](https://pypi.org/project/mistocr/) respectively.
@@ -0,0 +1 @@
1
+ __version__ = "0.2.0"
@@ -11,7 +11,7 @@ d = { 'settings': { 'branch': 'main',
11
11
  'mistocr.core.create_batch_entry': ('core.html#create_batch_entry', 'mistocr/core.py'),
12
12
  'mistocr.core.download_results': ('core.html#download_results', 'mistocr/core.py'),
13
13
  'mistocr.core.get_api_key': ('core.html#get_api_key', 'mistocr/core.py'),
14
- 'mistocr.core.ocr': ('core.html#ocr', 'mistocr/core.py'),
14
+ 'mistocr.core.ocr_pdf': ('core.html#ocr_pdf', 'mistocr/core.py'),
15
15
  'mistocr.core.prep_pdf_batch': ('core.html#prep_pdf_batch', 'mistocr/core.py'),
16
16
  'mistocr.core.read_pgs': ('core.html#read_pgs', 'mistocr/core.py'),
17
17
  'mistocr.core.save_images': ('core.html#save_images', 'mistocr/core.py'),
@@ -20,12 +20,22 @@ d = { 'settings': { 'branch': 'main',
20
20
  'mistocr.core.submit_batch': ('core.html#submit_batch', 'mistocr/core.py'),
21
21
  'mistocr.core.upload_pdf': ('core.html#upload_pdf', 'mistocr/core.py'),
22
22
  'mistocr.core.wait_for_job': ('core.html#wait_for_job', 'mistocr/core.py')},
23
+ 'mistocr.pipeline': {'mistocr.pipeline.pdf_to_md': ('pipeline.html#pdf_to_md', 'mistocr/pipeline.py')},
23
24
  'mistocr.refine': { 'mistocr.refine.HeadingCorrections': ('refine.html#headingcorrections', 'mistocr/refine.py'),
25
+ 'mistocr.refine.ImgDescription': ('refine.html#imgdescription', 'mistocr/refine.py'),
26
+ 'mistocr.refine.add_descs_to_pg': ('refine.html#add_descs_to_pg', 'mistocr/refine.py'),
27
+ 'mistocr.refine.add_descs_to_pgs': ('refine.html#add_descs_to_pgs', 'mistocr/refine.py'),
28
+ 'mistocr.refine.add_img_descs': ('refine.html#add_img_descs', 'mistocr/refine.py'),
24
29
  'mistocr.refine.add_pg_hdgs': ('refine.html#add_pg_hdgs', 'mistocr/refine.py'),
25
30
  'mistocr.refine.apply_hdg_fixes': ('refine.html#apply_hdg_fixes', 'mistocr/refine.py'),
31
+ 'mistocr.refine.describe_img': ('refine.html#describe_img', 'mistocr/refine.py'),
32
+ 'mistocr.refine.describe_imgs': ('refine.html#describe_imgs', 'mistocr/refine.py'),
26
33
  'mistocr.refine.fix_hdg_hierarchy': ('refine.html#fix_hdg_hierarchy', 'mistocr/refine.py'),
27
- 'mistocr.refine.fix_md_hdgs': ('refine.html#fix_md_hdgs', 'mistocr/refine.py'),
34
+ 'mistocr.refine.fix_hdgs': ('refine.html#fix_hdgs', 'mistocr/refine.py'),
28
35
  'mistocr.refine.fmt_hdgs_idx': ('refine.html#fmt_hdgs_idx', 'mistocr/refine.py'),
29
36
  'mistocr.refine.get_hdgs': ('refine.html#get_hdgs', 'mistocr/refine.py'),
37
+ 'mistocr.refine.limit': ('refine.html#limit', 'mistocr/refine.py'),
30
38
  'mistocr.refine.mk_fixes_lut': ('refine.html#mk_fixes_lut', 'mistocr/refine.py'),
31
- 'mistocr.refine.read_pgs_pg': ('refine.html#read_pgs_pg', 'mistocr/refine.py')}}}
39
+ 'mistocr.refine.parse_r': ('refine.html#parse_r', 'mistocr/refine.py'),
40
+ 'mistocr.refine.read_pgs_pg': ('refine.html#read_pgs_pg', 'mistocr/refine.py'),
41
+ 'mistocr.refine.save_img_descs': ('refine.html#save_img_descs', 'mistocr/refine.py')}}}
@@ -4,7 +4,7 @@
4
4
 
5
5
  # %% auto 0
6
6
  __all__ = ['ocr_model', 'ocr_endpoint', 'get_api_key', 'upload_pdf', 'create_batch_entry', 'prep_pdf_batch', 'submit_batch',
7
- 'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr', 'read_pgs']
7
+ 'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr_pdf', 'read_pgs']
8
8
 
9
9
  # %% ../nbs/00_core.ipynb 3
10
10
  from fastcore.all import *
@@ -79,10 +79,11 @@ def submit_batch(
79
79
  def wait_for_job(
80
80
  job:dict, # Job dict,
81
81
  c:Mistral=None, # Mistral client,
82
- poll_interval:int=10 # Poll interval in seconds
82
+ poll_interval:int=1 # Poll interval in seconds
83
83
  ) -> dict: # Job dict (with status)
84
84
  "Poll job until completion and return final job status"
85
85
  while job.status in ["QUEUED", "RUNNING"]:
86
+ print(f'Mistral batch job status: {job.status}')
86
87
  time.sleep(poll_interval)
87
88
  job = c.batch.jobs.get(job_id=job.id)
88
89
  return job
@@ -161,7 +162,7 @@ def _run_batch(entries:list[dict], c:Mistral, poll_interval:int=2) -> list[dict]
161
162
  return download_results(job, c)
162
163
 
163
164
  # %% ../nbs/00_core.ipynb 43
164
- def ocr(
165
+ def ocr_pdf(
165
166
  path:str, # Path to PDF file or folder,
166
167
  dst:str='md', # Directory to save markdown pages,
167
168
  inc_img:bool=True, # Include image in response,
@@ -174,7 +175,7 @@ def ocr(
174
175
  results = _run_batch(entries, c, poll_interval)
175
176
  return L([save_pages(r['response']['body'], dst, r['custom_id']) for r in results])
176
177
 
177
- # %% ../nbs/00_core.ipynb 48
178
+ # %% ../nbs/00_core.ipynb 47
178
179
  def read_pgs(
179
180
  path:str, # OCR output directory,
180
181
  join:bool=True # Join pages into single string
@@ -0,0 +1,37 @@
1
+ """End-to-End Pipeline: PDF OCR, Markdown Heading Correction, and AI Image Descriptions"""
2
+
3
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/02_pipeline.ipynb.
4
+
5
+ # %% auto 0
6
+ __all__ = ['pdf_to_md']
7
+
8
+ # %% ../nbs/02_pipeline.ipynb 3
9
+ from fastcore.all import *
10
+ from .core import read_pgs, ocr_pdf
11
+ from .refine import add_img_descs, fix_hdgs
12
+ from pathlib import Path
13
+ from asyncio import Semaphore, gather, sleep
14
+ import os, json, shutil
15
+
16
+ # %% ../nbs/02_pipeline.ipynb 4
17
+ @delegates(add_img_descs)
18
+ async def pdf_to_md(
19
+ pdf_path:str, # Path to input PDF file
20
+ dst:str, # Destination directory for output markdown
21
+ ocr_output:str=None, # Optional OCR output directory (defaults to pdf_path stem)
22
+ model:str='claude-sonnet-4-5', # Model to use for heading fixes and image descriptions
23
+ add_img_desc:bool=True, # Whether to add image descriptions
24
+ progress:bool=True, # Whether to show progress messages
25
+ **kwargs):
26
+ "Convert PDF to markdown with OCR, fixed heading hierarchy, and optional image descriptions"
27
+ n_steps = 3 if add_img_desc else 2
28
+ if progress: print(f"Step 1/{n_steps}: Running OCR on {pdf_path}...")
29
+ ocr_dirs = ocr_pdf(pdf_path, ocr_output or 'ocr_temp')
30
+ ocr_dir = ocr_dirs[0]
31
+ if progress: print(f"Step 2/{n_steps}: Fixing heading hierarchy...")
32
+ fix_hdgs(ocr_dir, model=model)
33
+ if add_img_desc:
34
+ if progress: print(f"Step 3/{n_steps}: Adding image descriptions...")
35
+ await add_img_descs(ocr_dir, dst=dst, model=model, progress=progress, **kwargs)
36
+ elif dst and Path(dst) != ocr_dir: shutil.copytree(ocr_dir, dst, dirs_exist_ok=True)
37
+ if progress: print("Done!")