mistocr 0.0.3__tar.gz → 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: mistocr
3
- Version: 0.0.3
3
+ Version: 0.1.2
4
4
  Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
5
5
  Home-page: https://github.com/franckalbinet/mistocr
6
6
  Author: Solveit
@@ -22,6 +22,7 @@ Requires-Dist: fastcore
22
22
  Requires-Dist: mistralai
23
23
  Requires-Dist: pillow
24
24
  Requires-Dist: dotenv
25
+ Requires-Dist: lisette
25
26
  Provides-Extra: dev
26
27
  Dynamic: author
27
28
  Dynamic: author-email
@@ -54,10 +55,11 @@ for large document sets.
54
55
  **Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
55
56
  \$0.50/1000 pages - a 50% reduction compared to synchronous processing.
56
57
 
57
- **Simplicity**: A single `ocr()` function handles everything -
58
- uploading, batch submission, polling for completion, and saving results
59
- as markdown with extracted images. Process one PDF or an entire folder
60
- with the same simple interface.
58
+ **Simplicity**: A single
59
+ [`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
60
+ function handles everything - uploading, batch submission, polling for
61
+ completion, and saving results as markdown with extracted images.
62
+ Process one PDF or an entire folder with the same simple interface.
61
63
 
62
64
  **Organized output**: Each PDF is automatically saved to its own folder
63
65
  with pages as separate markdown files and images in an `img` subfolder,
@@ -80,57 +82,60 @@ $ pip install mistocr
80
82
 
81
83
  ## How to use
82
84
 
85
+ ### Basic usage
86
+
87
+ Process a single PDF:
88
+
83
89
  ``` python
84
90
  from mistocr.core import ocr
85
- ```
86
-
87
- - **Process a single PDF:**
88
91
 
89
- <!-- -->
92
+ fname = 'files/test/attention-is-all-you-need.pdf'
93
+ result = ocr(fname)
94
+ ```
90
95
 
91
- fname = 'files/test/attention-is-all-you-need.pdf'
92
- result = ocr(fname)
96
+ Or process an entire folder:
93
97
 
94
98
  ``` python
99
+ results = ocr('files/test')
95
100
  ```
96
101
 
97
- files/test/md/attention-is-all-you-need:
98
- img/ page_11.md page_14.md page_3.md page_6.md page_9.md
99
- page_1.md page_12.md page_15.md page_4.md page_7.md
100
- page_10.md page_13.md page_2.md page_5.md page_8.md
102
+ ### Output structure
101
103
 
102
- files/test/md/attention-is-all-you-need/img:
103
- img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
104
+ Each PDF is saved to its own folder with pages as separate markdown
105
+ files and images in an `img` subfolder:
104
106
 
105
- - **Or process an entire folder:**
107
+ files/test/md/
108
+ ├── attention-is-all-you-need/
109
+ │ ├── img/
110
+ │ │ ├── img-0.jpeg
111
+ │ │ ├── img-1.jpeg
112
+ │ │ └── ...
113
+ │ ├── page_1.md
114
+ │ ├── page_2.md
115
+ │ └── ...
116
+ └── resnet/
117
+ ├── img/
118
+ └── ...
106
119
 
107
- ``` python
108
- results = ocr('files/test')
109
- ```
120
+ ### Reading results
110
121
 
111
- ``` python
112
- ```
122
+ Read all pages from a processed PDF:
113
123
 
114
- files/test/md:
115
- attention-is-all-you-need/ resnet/
124
+ ``` python
125
+ from mistocr.core import read_pgs
116
126
 
117
- files/test/md/attention-is-all-you-need:
118
- img/ page_11.md page_14.md page_3.md page_6.md page_9.md
119
- page_1.md page_12.md page_15.md page_4.md page_7.md
120
- page_10.md page_13.md page_2.md page_5.md page_8.md
127
+ text = read_pgs('files/test/md/attention-is-all-you-need')
128
+ ```
121
129
 
122
- files/test/md/attention-is-all-you-need/img:
123
- img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
130
+ Or read a specific page:
124
131
 
125
- files/test/md/resnet:
126
- img/ page_10.md page_12.md page_3.md page_5.md page_7.md page_9.md
127
- page_1.md page_11.md page_2.md page_4.md page_6.md page_8.md
132
+ ``` python
133
+ text = read_pgs('files/test/md/attention-is-all-you-need', 10)
134
+ ```
128
135
 
129
- files/test/md/resnet/img:
130
- img-0.jpeg img-2.jpeg img-4.jpeg img-6.jpeg
131
- img-1.jpeg img-3.jpeg img-5.jpeg
136
+ ### Customization
132
137
 
133
- - **Customize the output:**
138
+ Customize output directory, image inclusion, and polling interval:
134
139
 
135
140
  ``` python
136
141
  results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)
@@ -15,10 +15,11 @@ for large document sets.
15
15
  **Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
16
16
  \$0.50/1000 pages - a 50% reduction compared to synchronous processing.
17
17
 
18
- **Simplicity**: A single `ocr()` function handles everything -
19
- uploading, batch submission, polling for completion, and saving results
20
- as markdown with extracted images. Process one PDF or an entire folder
21
- with the same simple interface.
18
+ **Simplicity**: A single
19
+ [`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
20
+ function handles everything - uploading, batch submission, polling for
21
+ completion, and saving results as markdown with extracted images.
22
+ Process one PDF or an entire folder with the same simple interface.
22
23
 
23
24
  **Organized output**: Each PDF is automatically saved to its own folder
24
25
  with pages as separate markdown files and images in an `img` subfolder,
@@ -41,57 +42,60 @@ $ pip install mistocr
41
42
 
42
43
  ## How to use
43
44
 
45
+ ### Basic usage
46
+
47
+ Process a single PDF:
48
+
44
49
  ``` python
45
50
  from mistocr.core import ocr
46
- ```
47
-
48
- - **Process a single PDF:**
49
51
 
50
- <!-- -->
52
+ fname = 'files/test/attention-is-all-you-need.pdf'
53
+ result = ocr(fname)
54
+ ```
51
55
 
52
- fname = 'files/test/attention-is-all-you-need.pdf'
53
- result = ocr(fname)
56
+ Or process an entire folder:
54
57
 
55
58
  ``` python
59
+ results = ocr('files/test')
56
60
  ```
57
61
 
58
- files/test/md/attention-is-all-you-need:
59
- img/ page_11.md page_14.md page_3.md page_6.md page_9.md
60
- page_1.md page_12.md page_15.md page_4.md page_7.md
61
- page_10.md page_13.md page_2.md page_5.md page_8.md
62
+ ### Output structure
62
63
 
63
- files/test/md/attention-is-all-you-need/img:
64
- img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
64
+ Each PDF is saved to its own folder with pages as separate markdown
65
+ files and images in an `img` subfolder:
65
66
 
66
- - **Or process an entire folder:**
67
+ files/test/md/
68
+ ├── attention-is-all-you-need/
69
+ │ ├── img/
70
+ │ │ ├── img-0.jpeg
71
+ │ │ ├── img-1.jpeg
72
+ │ │ └── ...
73
+ │ ├── page_1.md
74
+ │ ├── page_2.md
75
+ │ └── ...
76
+ └── resnet/
77
+ ├── img/
78
+ └── ...
67
79
 
68
- ``` python
69
- results = ocr('files/test')
70
- ```
80
+ ### Reading results
71
81
 
72
- ``` python
73
- ```
82
+ Read all pages from a processed PDF:
74
83
 
75
- files/test/md:
76
- attention-is-all-you-need/ resnet/
84
+ ``` python
85
+ from mistocr.core import read_pgs
77
86
 
78
- files/test/md/attention-is-all-you-need:
79
- img/ page_11.md page_14.md page_3.md page_6.md page_9.md
80
- page_1.md page_12.md page_15.md page_4.md page_7.md
81
- page_10.md page_13.md page_2.md page_5.md page_8.md
87
+ text = read_pgs('files/test/md/attention-is-all-you-need')
88
+ ```
82
89
 
83
- files/test/md/attention-is-all-you-need/img:
84
- img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
90
+ Or read a specific page:
85
91
 
86
- files/test/md/resnet:
87
- img/ page_10.md page_12.md page_3.md page_5.md page_7.md page_9.md
88
- page_1.md page_11.md page_2.md page_4.md page_6.md page_8.md
92
+ ``` python
93
+ text = read_pgs('files/test/md/attention-is-all-you-need', 10)
94
+ ```
89
95
 
90
- files/test/md/resnet/img:
91
- img-0.jpeg img-2.jpeg img-4.jpeg img-6.jpeg
92
- img-1.jpeg img-3.jpeg img-5.jpeg
96
+ ### Customization
93
97
 
94
- - **Customize the output:**
98
+ Customize output directory, image inclusion, and polling interval:
95
99
 
96
100
  ``` python
97
101
  results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)
@@ -0,0 +1 @@
1
+ __version__ = "0.1.2"
@@ -13,9 +13,17 @@ d = { 'settings': { 'branch': 'main',
13
13
  'mistocr.core.get_api_key': ('core.html#get_api_key', 'mistocr/core.py'),
14
14
  'mistocr.core.ocr': ('core.html#ocr', 'mistocr/core.py'),
15
15
  'mistocr.core.prep_pdf_batch': ('core.html#prep_pdf_batch', 'mistocr/core.py'),
16
+ 'mistocr.core.read_pgs': ('core.html#read_pgs', 'mistocr/core.py'),
16
17
  'mistocr.core.save_images': ('core.html#save_images', 'mistocr/core.py'),
17
18
  'mistocr.core.save_page': ('core.html#save_page', 'mistocr/core.py'),
18
19
  'mistocr.core.save_pages': ('core.html#save_pages', 'mistocr/core.py'),
19
20
  'mistocr.core.submit_batch': ('core.html#submit_batch', 'mistocr/core.py'),
20
21
  'mistocr.core.upload_pdf': ('core.html#upload_pdf', 'mistocr/core.py'),
21
- 'mistocr.core.wait_for_job': ('core.html#wait_for_job', 'mistocr/core.py')}}}
22
+ 'mistocr.core.wait_for_job': ('core.html#wait_for_job', 'mistocr/core.py')},
23
+ 'mistocr.refine': { 'mistocr.refine.HeadingCorrections': ('refine.html#headingcorrections', 'mistocr/refine.py'),
24
+ 'mistocr.refine.apply_hdg_fixes': ('refine.html#apply_hdg_fixes', 'mistocr/refine.py'),
25
+ 'mistocr.refine.fix_hdg_hierarchy': ('refine.html#fix_hdg_hierarchy', 'mistocr/refine.py'),
26
+ 'mistocr.refine.fix_md_hdgs': ('refine.html#fix_md_hdgs', 'mistocr/refine.py'),
27
+ 'mistocr.refine.fmt_hdgs_idx': ('refine.html#fmt_hdgs_idx', 'mistocr/refine.py'),
28
+ 'mistocr.refine.get_hdgs': ('refine.html#get_hdgs', 'mistocr/refine.py'),
29
+ 'mistocr.refine.mk_fixes_lut': ('refine.html#mk_fixes_lut', 'mistocr/refine.py')}}}
@@ -4,21 +4,17 @@
4
4
 
5
5
  # %% auto 0
6
6
  __all__ = ['ocr_model', 'ocr_endpoint', 'get_api_key', 'upload_pdf', 'create_batch_entry', 'prep_pdf_batch', 'submit_batch',
7
- 'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr']
7
+ 'wait_for_job', 'download_results', 'save_images', 'save_page', 'save_pages', 'ocr', 'read_pgs']
8
8
 
9
9
  # %% ../nbs/00_core.ipynb 3
10
10
  from fastcore.all import *
11
- from dotenv import load_dotenv
12
- import os, json, time, base64, tempfile
11
+ import os, re, json, time, base64, tempfile, logging
13
12
  from io import BytesIO
14
13
  from pathlib import Path
15
14
  from PIL import Image
16
15
  from mistralai import Mistral
17
16
 
18
17
  # %% ../nbs/00_core.ipynb 6
19
- load_dotenv()
20
-
21
- # %% ../nbs/00_core.ipynb 7
22
18
  def get_api_key(
23
19
  key:str=None # Mistral API key
24
20
  ):
@@ -27,11 +23,11 @@ def get_api_key(
27
23
  if not key: raise ValueError("MISTRAL_API_KEY not found")
28
24
  return key
29
25
 
30
- # %% ../nbs/00_core.ipynb 8
26
+ # %% ../nbs/00_core.ipynb 7
31
27
  ocr_model = "mistral-ocr-latest"
32
28
  ocr_endpoint = "/v1/ocr"
33
29
 
34
- # %% ../nbs/00_core.ipynb 11
30
+ # %% ../nbs/00_core.ipynb 10
35
31
  def upload_pdf(
36
32
  path:str, # Path to PDF file
37
33
  key:str=None # Mistral API key
@@ -42,11 +38,11 @@ def upload_pdf(
42
38
  uploaded = c.files.upload(file=dict(file_name=path.stem, content=path.read_bytes()), purpose="ocr")
43
39
  return c.files.get_signed_url(file_id=uploaded.id).url, c
44
40
 
45
- # %% ../nbs/00_core.ipynb 16
41
+ # %% ../nbs/00_core.ipynb 15
46
42
  def create_batch_entry(
47
43
  path:str, # Path to PDF file,
48
44
  url:str, # Mistral signed URL
49
- cid:str=None, # Custom ID (by default using the file name without extention)
45
+ cid:str=None, # Custom ID (by default using the file name without extension)
50
46
  inc_img:bool=True # Include image in response
51
47
  ) -> dict[str, str | dict[str, str | bool]]: # Batch entry dict
52
48
  "Create a batch entry dict for OCR"
@@ -54,7 +50,7 @@ def create_batch_entry(
54
50
  if not cid: cid = path.stem
55
51
  return dict(custom_id=cid, body=dict(document=dict(type="document_url", document_url=url), include_image_base64=inc_img))
56
52
 
57
- # %% ../nbs/00_core.ipynb 18
53
+ # %% ../nbs/00_core.ipynb 17
58
54
  def prep_pdf_batch(
59
55
  path:str, # Path to PDF file,
60
56
  cid:str=None, # Custom ID (by default using the file name without extention)
@@ -65,7 +61,7 @@ def prep_pdf_batch(
65
61
  url, c = upload_pdf(path, key)
66
62
  return create_batch_entry(path, url, cid, inc_img), c
67
63
 
68
- # %% ../nbs/00_core.ipynb 22
64
+ # %% ../nbs/00_core.ipynb 21
69
65
  def submit_batch(
70
66
  entries:list[dict], # List of batch entries,
71
67
  c:Mistral=None, # Mistral client,
@@ -79,7 +75,7 @@ def submit_batch(
79
75
  batch_data = c.files.upload(file=dict(file_name="batch.jsonl", content=open(f.name, "rb")), purpose="batch")
80
76
  return c.batch.jobs.create(input_files=[batch_data.id], model=model, endpoint=endpoint)
81
77
 
82
- # %% ../nbs/00_core.ipynb 25
78
+ # %% ../nbs/00_core.ipynb 24
83
79
  def wait_for_job(
84
80
  job:dict, # Job dict,
85
81
  c:Mistral=None, # Mistral client,
@@ -91,7 +87,7 @@ def wait_for_job(
91
87
  job = c.batch.jobs.get(job_id=job.id)
92
88
  return job
93
89
 
94
- # %% ../nbs/00_core.ipynb 27
90
+ # %% ../nbs/00_core.ipynb 26
95
91
  def download_results(
96
92
  job:dict, # Job dict,
97
93
  c:Mistral=None # Mistral client
@@ -100,7 +96,7 @@ def download_results(
100
96
  content = c.files.download(file_id=job.output_file).read().decode('utf-8')
101
97
  return [json.loads(line) for line in content.strip().split('\n') if line]
102
98
 
103
- # %% ../nbs/00_core.ipynb 32
99
+ # %% ../nbs/00_core.ipynb 31
104
100
  def save_images(
105
101
  page:dict, # Page dict,
106
102
  img_dir:str='img' # Directory to save images
@@ -111,32 +107,32 @@ def save_images(
111
107
  img_bytes = base64.b64decode(img['image_base64'].split(',')[1])
112
108
  Image.open(BytesIO(img_bytes)).save(img_dir / img['id'])
113
109
 
114
- # %% ../nbs/00_core.ipynb 33
110
+ # %% ../nbs/00_core.ipynb 32
115
111
  def save_page(
116
112
  page:dict, # Page dict,
117
- out_dir:str, # Directory to save page
113
+ dst:str, # Directory to save page
118
114
  img_dir:str='img' # Directory to save images
119
115
  ) -> None:
120
116
  "Save single page markdown and images"
121
- (out_dir / f"page_{page['index']+1}.md").write_text(page['markdown'])
117
+ (dst / f"page_{page['index']+1}.md").write_text(page['markdown'])
122
118
  if page.get('images'):
123
119
  img_dir.mkdir(exist_ok=True)
124
120
  save_images(page, img_dir)
125
121
 
126
- # %% ../nbs/00_core.ipynb 35
122
+ # %% ../nbs/00_core.ipynb 34
127
123
  def save_pages(
128
124
  ocr_resp:dict, # OCR response,
129
- out_dir:str, # Directory to save pages,
125
+ dst:str, # Directory to save pages,
130
126
  cid:str # Custom ID
131
127
  ) -> Path: # Output directory
132
128
  "Save markdown pages and images from OCR response to output directory"
133
- out_dir = Path(out_dir) / cid
134
- out_dir.mkdir(parents=True, exist_ok=True)
135
- img_dir = out_dir / 'img'
136
- for page in ocr_resp['pages']: save_page(page, out_dir, img_dir)
137
- return out_dir
129
+ dst = Path(dst) / cid
130
+ dst.mkdir(parents=True, exist_ok=True)
131
+ img_dir = dst / 'img'
132
+ for page in ocr_resp['pages']: save_page(page, dst, img_dir)
133
+ return dst
138
134
 
139
- # %% ../nbs/00_core.ipynb 41
135
+ # %% ../nbs/00_core.ipynb 40
140
136
  def _get_paths(path:str) -> list[Path]:
141
137
  "Get list of PDFs from file or folder"
142
138
  path = Path(path)
@@ -147,7 +143,7 @@ def _get_paths(path:str) -> list[Path]:
147
143
  return pdfs
148
144
  raise ValueError(f"Path not found: {path}")
149
145
 
150
- # %% ../nbs/00_core.ipynb 42
146
+ # %% ../nbs/00_core.ipynb 41
151
147
  def _prep_batch(pdfs:list[Path], inc_img:bool=True, key:str=None) -> tuple[list[dict], Mistral]:
152
148
  "Prepare batch entries for list of PDFs"
153
149
  entries, c = [], None
@@ -156,7 +152,7 @@ def _prep_batch(pdfs:list[Path], inc_img:bool=True, key:str=None) -> tuple[list[
156
152
  entries.append(entry)
157
153
  return entries, c
158
154
 
159
- # %% ../nbs/00_core.ipynb 43
155
+ # %% ../nbs/00_core.ipynb 42
160
156
  def _run_batch(entries:list[dict], c:Mistral, poll_interval:int=2) -> list[dict]:
161
157
  "Submit batch, wait for completion, and download results"
162
158
  job = submit_batch(entries, c)
@@ -164,10 +160,10 @@ def _run_batch(entries:list[dict], c:Mistral, poll_interval:int=2) -> list[dict]
164
160
  if job.status != 'SUCCESS': raise Exception(f"Job failed with status: {job.status}")
165
161
  return download_results(job, c)
166
162
 
167
- # %% ../nbs/00_core.ipynb 44
163
+ # %% ../nbs/00_core.ipynb 43
168
164
  def ocr(
169
165
  path:str, # Path to PDF file or folder,
170
- out_dir:str='md', # Directory to save markdown pages,
166
+ dst:str='md', # Directory to save markdown pages,
171
167
  inc_img:bool=True, # Include image in response,
172
168
  key:str=None, # API key,
173
169
  poll_interval:int=2 # Poll interval in seconds
@@ -176,4 +172,15 @@ def ocr(
176
172
  pdfs = _get_paths(path)
177
173
  entries, c = _prep_batch(pdfs, inc_img, key)
178
174
  results = _run_batch(entries, c, poll_interval)
179
- return L([save_pages(r['response']['body'], out_dir, r['custom_id']) for r in results])
175
+ return L([save_pages(r['response']['body'], dst, r['custom_id']) for r in results])
176
+
177
+ # %% ../nbs/00_core.ipynb 48
178
+ def read_pgs(
179
+ path:str, # OCR output directory,
180
+ join:bool=True # Join pages into single string
181
+ ) -> str|list[str]: # Joined string or list of page contents
182
+ "Read specific page or all pages from OCR output directory"
183
+ path = Path(path)
184
+ pgs = sorted(path.glob('page_*.md'), key=lambda p: int(p.stem.split('_')[1]))
185
+ contents = L([p.read_text() for p in pgs])
186
+ return '\n\n'.join(contents) if join else contents
@@ -0,0 +1,117 @@
1
+ """Postprocess markdown files by fixing heading hierarchy and describint images"""
2
+
3
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/01_refine.ipynb.
4
+
5
+ # %% auto 0
6
+ __all__ = ['prompt_fix_hdgs', 'get_hdgs', 'fmt_hdgs_idx', 'HeadingCorrections', 'fix_hdg_hierarchy', 'mk_fixes_lut',
7
+ 'apply_hdg_fixes', 'fix_md_hdgs']
8
+
9
+ # %% ../nbs/01_refine.ipynb 3
10
+ from fastcore.all import *
11
+ from .core import read_pgs
12
+ from re import sub, findall, MULTILINE
13
+ from pydantic import BaseModel
14
+ from lisette.core import completion
15
+ import os
16
+ import json
17
+
18
+ # %% ../nbs/01_refine.ipynb 7
19
+ def get_hdgs(
20
+ md:str # Markdown file string
21
+ ):
22
+ "Return the markdown headings"
23
+ # Sanitize removing '#' in python snippet if any
24
+ md = sub(r'```[\s\S]*?```', '', md)
25
+ return L(findall(r'^#{1,6} .+$', md, MULTILINE))
26
+
27
+
28
+
29
+ # %% ../nbs/01_refine.ipynb 10
30
+ def fmt_hdgs_idx(
31
+ hdgs: list[str] # List of markdown headings
32
+ ) -> str: # Formatted string with index
33
+ "Format the headings with index"
34
+ return '\n'.join(f"{i}. {h}" for i, h in enumerate(hdgs))
35
+
36
+
37
+ # %% ../nbs/01_refine.ipynb 13
38
+ class HeadingCorrections(BaseModel):
39
+ corrections: dict[int, str] # index → corrected heading
40
+
41
+ # %% ../nbs/01_refine.ipynb 15
42
+ prompt_fix_hdgs = """Fix markdown heading hierarchy errors while preserving the document's intended structure.
43
+
44
+ INPUT FORMAT: Each heading is prefixed with its index number (e.g., "0. # Title")
45
+
46
+ RULES - Apply these fixes in order:
47
+
48
+ 1. **Single H1 rule**: Documents must have exactly ONE # heading (the title/main heading)
49
+ - All other headings should be ## or deeper
50
+
51
+ 2. **Infer depth from numbering patterns**: If headings contain section numbers, deeper nesting means deeper heading level
52
+ - Parent section (e.g., "1", "2", "A") should be shallower than child (e.g., "1.1", "2.a", "A.1")
53
+ - Child section should be one # deeper than parent
54
+ - Works with any numbering: "1/1.1/1.1.1", "A/A.1/A.1.a", "I/I.A/I.A.1", etc.
55
+
56
+ 3. **Level jumps**: Headings can only increase by one # at a time when moving deeper
57
+ - Wrong: ## Section → ##### Subsection
58
+ - Fixed: ## Section → ### Subsection
59
+
60
+ 4. **Decreasing levels is OK**: Moving back up the hierarchy (### to ##) is valid for new sections
61
+
62
+ OUTPUT: Return a Python dictionary mapping index to corrected heading (without the index prefix).
63
+ Only include entries that need changes.
64
+
65
+ Headings to analyze:
66
+ {headings_list}
67
+ """
68
+
69
+ # %% ../nbs/01_refine.ipynb 16
70
+ def fix_hdg_hierarchy(
71
+ hdgs: list[str], # List of markdown headings
72
+ prompt: str=prompt_fix_hdgs, # Prompt to use
73
+ model: str='claude-sonnet-4-5', # Model to use
74
+ api_key: str=os.getenv('ANTHROPIC_API_KEY') # API key
75
+ ) -> dict[int, str]: # Dictionary of index → corrected heading
76
+ "Fix the heading hierarchy"
77
+ r = completion(
78
+ model=model,
79
+ messages=[{"role": "user", "content": prompt_fix_hdgs.format(headings_list=fmt_hdgs_idx(hdgs))}],
80
+ response_format=HeadingCorrections,
81
+ api_key=api_key
82
+ )
83
+ return json.loads(r.choices[0].message.content)['corrections']
84
+
85
+ # %% ../nbs/01_refine.ipynb 19
86
+ def mk_fixes_lut(
87
+ hdgs: list[str], # List of markdown headings
88
+ model: str='claude-sonnet-4-5', # Model to use
89
+ api_key: str=os.getenv('ANTHROPIC_API_KEY') # API key
90
+ ) -> dict[str, str]: # Dictionary of old → new heading
91
+ "Make a lookup table of fixes"
92
+ fixes = fix_hdg_hierarchy(hdgs, model, api_key)
93
+ return {hdgs[int(k)]:v for k,v in fixes.items()}
94
+
95
+ # %% ../nbs/01_refine.ipynb 22
96
+ def apply_hdg_fixes(
97
+ p:str, # Page to fix
98
+ lut_fixes: dict[str, str], # Lookup table of fixes
99
+ pg: int=None, # Optionnaly specify the page number to append to original heading
100
+ ) -> str: # Page with fixes applied
101
+ "Apply the fixes to the page"
102
+ for old in get_hdgs(p): p = p.replace(old, lut_fixes.get(old, old) + (f' .... page {pg}' if pg else ''))
103
+ return p
104
+
105
+ # %% ../nbs/01_refine.ipynb 25
106
+ def fix_md_hdgs(
107
+ src:str, # Source directory with markdown pages
108
+ model:str='claude-sonnet-4-5', # Model
109
+ dst:str=None, # Destination directory (None=overwrite)
110
+ pg_nums:bool=True # Add page numbers
111
+ ):
112
+ "Fix heading hierarchy in markdown document"
113
+ src_path,dst_path = Path(src),Path(dst) if dst else Path(src)
114
+ if dst_path != src_path: dst_path.mkdir(parents=True, exist_ok=True)
115
+ lut = mk_fixes_lut(get_hdgs(read_pgs(src_path)), model)
116
+ for i,p in enumerate(read_pgs(src_path, join=False), 1):
117
+ (dst_path/f'page_{i}.md').write_text(apply_hdg_fixes(p, lut, pg=i if pg_nums else None))
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: mistocr
3
- Version: 0.0.3
3
+ Version: 0.1.2
4
4
  Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
5
5
  Home-page: https://github.com/franckalbinet/mistocr
6
6
  Author: Solveit
@@ -22,6 +22,7 @@ Requires-Dist: fastcore
22
22
  Requires-Dist: mistralai
23
23
  Requires-Dist: pillow
24
24
  Requires-Dist: dotenv
25
+ Requires-Dist: lisette
25
26
  Provides-Extra: dev
26
27
  Dynamic: author
27
28
  Dynamic: author-email
@@ -54,10 +55,11 @@ for large document sets.
54
55
  **Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
55
56
  \$0.50/1000 pages - a 50% reduction compared to synchronous processing.
56
57
 
57
- **Simplicity**: A single `ocr()` function handles everything -
58
- uploading, batch submission, polling for completion, and saving results
59
- as markdown with extracted images. Process one PDF or an entire folder
60
- with the same simple interface.
58
+ **Simplicity**: A single
59
+ [`ocr()`](https://franckalbinet.github.io/mistocr/core.html#ocr)
60
+ function handles everything - uploading, batch submission, polling for
61
+ completion, and saving results as markdown with extracted images.
62
+ Process one PDF or an entire folder with the same simple interface.
61
63
 
62
64
  **Organized output**: Each PDF is automatically saved to its own folder
63
65
  with pages as separate markdown files and images in an `img` subfolder,
@@ -80,57 +82,60 @@ $ pip install mistocr
80
82
 
81
83
  ## How to use
82
84
 
85
+ ### Basic usage
86
+
87
+ Process a single PDF:
88
+
83
89
  ``` python
84
90
  from mistocr.core import ocr
85
- ```
86
-
87
- - **Process a single PDF:**
88
91
 
89
- <!-- -->
92
+ fname = 'files/test/attention-is-all-you-need.pdf'
93
+ result = ocr(fname)
94
+ ```
90
95
 
91
- fname = 'files/test/attention-is-all-you-need.pdf'
92
- result = ocr(fname)
96
+ Or process an entire folder:
93
97
 
94
98
  ``` python
99
+ results = ocr('files/test')
95
100
  ```
96
101
 
97
- files/test/md/attention-is-all-you-need:
98
- img/ page_11.md page_14.md page_3.md page_6.md page_9.md
99
- page_1.md page_12.md page_15.md page_4.md page_7.md
100
- page_10.md page_13.md page_2.md page_5.md page_8.md
102
+ ### Output structure
101
103
 
102
- files/test/md/attention-is-all-you-need/img:
103
- img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
104
+ Each PDF is saved to its own folder with pages as separate markdown
105
+ files and images in an `img` subfolder:
104
106
 
105
- - **Or process an entire folder:**
107
+ files/test/md/
108
+ ├── attention-is-all-you-need/
109
+ │ ├── img/
110
+ │ │ ├── img-0.jpeg
111
+ │ │ ├── img-1.jpeg
112
+ │ │ └── ...
113
+ │ ├── page_1.md
114
+ │ ├── page_2.md
115
+ │ └── ...
116
+ └── resnet/
117
+ ├── img/
118
+ └── ...
106
119
 
107
- ``` python
108
- results = ocr('files/test')
109
- ```
120
+ ### Reading results
110
121
 
111
- ``` python
112
- ```
122
+ Read all pages from a processed PDF:
113
123
 
114
- files/test/md:
115
- attention-is-all-you-need/ resnet/
124
+ ``` python
125
+ from mistocr.core import read_pgs
116
126
 
117
- files/test/md/attention-is-all-you-need:
118
- img/ page_11.md page_14.md page_3.md page_6.md page_9.md
119
- page_1.md page_12.md page_15.md page_4.md page_7.md
120
- page_10.md page_13.md page_2.md page_5.md page_8.md
127
+ text = read_pgs('files/test/md/attention-is-all-you-need')
128
+ ```
121
129
 
122
- files/test/md/attention-is-all-you-need/img:
123
- img-0.jpeg img-1.jpeg img-2.jpeg img-3.jpeg img-4.jpeg
130
+ Or read a specific page:
124
131
 
125
- files/test/md/resnet:
126
- img/ page_10.md page_12.md page_3.md page_5.md page_7.md page_9.md
127
- page_1.md page_11.md page_2.md page_4.md page_6.md page_8.md
132
+ ``` python
133
+ text = read_pgs('files/test/md/attention-is-all-you-need', 10)
134
+ ```
128
135
 
129
- files/test/md/resnet/img:
130
- img-0.jpeg img-2.jpeg img-4.jpeg img-6.jpeg
131
- img-1.jpeg img-3.jpeg img-5.jpeg
136
+ ### Customization
132
137
 
133
- - **Customize the output:**
138
+ Customize output directory, image inclusion, and polling interval:
134
139
 
135
140
  ``` python
136
141
  results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)
@@ -7,6 +7,7 @@ setup.py
7
7
  mistocr/__init__.py
8
8
  mistocr/_modidx.py
9
9
  mistocr/core.py
10
+ mistocr/refine.py
10
11
  mistocr.egg-info/PKG-INFO
11
12
  mistocr.egg-info/SOURCES.txt
12
13
  mistocr.egg-info/dependency_links.txt
@@ -2,5 +2,6 @@ fastcore
2
2
  mistralai
3
3
  pillow
4
4
  dotenv
5
+ lisette
5
6
 
6
7
  [dev]
@@ -1,7 +1,7 @@
1
1
  [DEFAULT]
2
2
  repo = mistocr
3
3
  lib_name = mistocr
4
- version = 0.0.3
4
+ version = 0.1.2
5
5
  min_python = 3.9
6
6
  license = apache2
7
7
  black_formatting = False
@@ -27,7 +27,7 @@ keywords = nbdev jupyter notebook python
27
27
  language = English
28
28
  status = 3
29
29
  user = franckalbinet
30
- requirements = fastcore mistralai pillow dotenv
30
+ requirements = fastcore mistralai pillow dotenv lisette
31
31
  readme_nb = index.ipynb
32
32
  allowed_metadata_keys =
33
33
  allowed_cell_metadata_keys =
@@ -1 +0,0 @@
1
- __version__ = "0.0.3"
File without changes
File without changes
File without changes
File without changes
File without changes