mistral-ai-ocr 1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,172 @@
1
+ Metadata-Version: 2.1
2
+ Name: mistral-ai-ocr
3
+ Version: 1.0
4
+ Description-Content-Type: text/markdown
5
+
6
+ # Mistral AI OCR
7
+ This is a simple script that uses the Mistral AI OCR API to extract text from a PDF or image file
8
+
9
+ ## Modes
10
+
11
+ | Value | Name |
12
+ |-|-|
13
+ | 0 | FULL |
14
+ | 1 | FULL_ALT |
15
+ | 2 | FULL_NO_DIR |
16
+ | 3 | FULL_NO_PAGES |
17
+ | 4 | TEXT |
18
+ | 5 | TEXT_NO_PAGES |
19
+
20
+ Given the input file `paper.pdf`, the directory structure for each mode is shown below:
21
+
22
+ ### 0 - `FULL`
23
+
24
+ Structure
25
+ ```
26
+ paper
27
+ ├── full
28
+ │ ├── image1.png
29
+ │ ├── image2.png
30
+ │ ├── image3.png
31
+ │ └── paper.md
32
+ ├── page_0
33
+ │ ├── image1.png
34
+ │ └── paper.md
35
+ ├── page_1
36
+ │ ├── image2.png
37
+ │ └── paper.md
38
+ └── page_2
39
+ ├── image3.png
40
+ └── paper.md
41
+ ```
42
+
43
+ ### 1 - `FULL_ALT`
44
+
45
+ Structure
46
+ ```
47
+ paper
48
+ ├── image1.png
49
+ ├── image2.png
50
+ ├── image3.png
51
+ ├── paper.md
52
+ ├── page_0
53
+ │ ├── image1.png
54
+ │ └── paper.md
55
+ ├── page_1
56
+ │ ├── image2.png
57
+ │ └── paper.md
58
+ └── page_2
59
+ ├── image3.png
60
+ └── paper.md
61
+ ```
62
+
63
+ ### 2 - `FULL_NO_DIR`
64
+
65
+ Structure
66
+ ```
67
+ paper
68
+ ├── image1.png
69
+ ├── image2.png
70
+ ├── image3.png
71
+ ├── paper.md
72
+ ├── paper0.md
73
+ ├── paper1.md
74
+ └── paper2.md
75
+ ```
76
+
77
+ ### 3 - `FULL_NO_PAGES` *default*
78
+
79
+ Structure
80
+ ```
81
+ paper
82
+ ├── image1.png
83
+ ├── image2.png
84
+ ├── image3.png
85
+ └── paper.md
86
+ ```
87
+
88
+ ### 4 - `TEXT`
89
+
90
+ Structure
91
+ ```
92
+ paper
93
+ ├── paper.md
94
+ ├── paper0.md
95
+ ├── paper1.md
96
+ └── paper2.md
97
+ ```
98
+
99
+ ### 5 - `TEXT_NO_PAGES`
100
+
101
+ Structure
102
+ ```
103
+ paper
104
+ └── paper.md
105
+ ```
106
+
107
+ By default, the JSON response from the Mistral AI OCR API is saved in the output directory. To disable JSON output, use the `-n` or `--no-json` argument. To experiment with a different **mode** without using additional API calls, reuse an existing JSON response instead of the original input file
108
+
109
+ # Usage
110
+
111
+ ## Install the Requirements
112
+
113
+ To install the necessary requirements, run the following command:
114
+
115
+ ```sh
116
+ pip install mistral-ai-ocr
117
+ ```
118
+
119
+ ## Typical Usage
120
+
121
+ ```sh
122
+ mistral-ai-ocr paper.pdf
123
+ mistral-ai-ocr paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
124
+ mistral-ai-ocr paper.pdf -o revision
125
+ mistral-ai-ocr paper.pdf -e
126
+ mistral-ai-ocr paper.pdf -m FULL
127
+ mistral-ai-ocr page74.jpg -e
128
+ mistral-ai-ocr -j paper.json
129
+ mistral-ai-ocr -j paper.json -m TEXT_NO_PAGES -n
130
+ ```
131
+
132
+ ## Arguments
133
+
134
+ | Argument || Description |
135
+ |-|-|-|
136
+ | | | input PDF or image file |
137
+ | -k API_KEY | --api-key API_KEY | Mistral API key, can be set via the **MISTRAL_API_KEY** environment variable |
138
+ | -o OUTPUT | --output OUTPUT | output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file |
139
+ | -j JSON_OCR_RESPONSE | --json-ocr-response JSON_OCR_RESPONSE | path from which to load a pre-existing JSON OCR response (any input file will be ignored) |
140
+ | -m MODE | --mode MODE | mode of operation: either the name or numerical value of the mode. _Defaults to FULL_NO_PAGES_ |
141
+ | -s PAGE_SEPARATOR | --page-separator PAGE_SEPARATOR | page separator to use when writing the Markdown file. _Defaults to `\n`_ |
142
+ | -n | --no-json | do not write the JSON OCR response to a file. By default, the response is written |
143
+ | -e | --load-dot-env | load the .env file from the current directory using [`python-dotenv`](https://pypi.org/project/python-dotenv/), to retrieve the Mistral API key |
144
+
145
+ ### Mistral AI API Key
146
+
147
+ To obtain an API key, you need a [Mistral AI](https://auth.mistral.ai/ui/registration) account. Then visit [https://admin.mistral.ai/organization/api-keys](https://admin.mistral.ai/organization/api-keys) and click the **Create new key** button
148
+
149
+ To avoid using `-e` to load the `.env` file, you can create one at `$HOME/.mistral_ai_ocr.env` (where `$HOME` is your home directory). It will then be automatically loaded at the start of the script
150
+
151
+ For example, for an user called `vavilov`, the path would look like this:
152
+
153
+ * **Linux**
154
+ ```
155
+ /home/vavilov/.mistral_ai_ocr.env
156
+ ```
157
+
158
+ * **macOS**
159
+ ```
160
+ /Users/vavilov/.mistral_ai_ocr.env
161
+ ```
162
+
163
+ * **Windows**
164
+ ```
165
+ C:\Users\vavilov\.mistral_ai_ocr.env
166
+ ```
167
+
168
+ and the content will be something like this:
169
+
170
+ ```
171
+ MISTRAL_API_KEY=jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
172
+ ```
@@ -0,0 +1,167 @@
1
+ # Mistral AI OCR
2
+ This is a simple script that uses the Mistral AI OCR API to extract text from a PDF or image file
3
+
4
+ ## Modes
5
+
6
+ | Value | Name |
7
+ |-|-|
8
+ | 0 | FULL |
9
+ | 1 | FULL_ALT |
10
+ | 2 | FULL_NO_DIR |
11
+ | 3 | FULL_NO_PAGES |
12
+ | 4 | TEXT |
13
+ | 5 | TEXT_NO_PAGES |
14
+
15
+ Given the input file `paper.pdf`, the directory structure for each mode is shown below:
16
+
17
+ ### 0 - `FULL`
18
+
19
+ Structure
20
+ ```
21
+ paper
22
+ ├── full
23
+ │ ├── image1.png
24
+ │ ├── image2.png
25
+ │ ├── image3.png
26
+ │ └── paper.md
27
+ ├── page_0
28
+ │ ├── image1.png
29
+ │ └── paper.md
30
+ ├── page_1
31
+ │ ├── image2.png
32
+ │ └── paper.md
33
+ └── page_2
34
+ ├── image3.png
35
+ └── paper.md
36
+ ```
37
+
38
+ ### 1 - `FULL_ALT`
39
+
40
+ Structure
41
+ ```
42
+ paper
43
+ ├── image1.png
44
+ ├── image2.png
45
+ ├── image3.png
46
+ ├── paper.md
47
+ ├── page_0
48
+ │ ├── image1.png
49
+ │ └── paper.md
50
+ ├── page_1
51
+ │ ├── image2.png
52
+ │ └── paper.md
53
+ └── page_2
54
+ ├── image3.png
55
+ └── paper.md
56
+ ```
57
+
58
+ ### 2 - `FULL_NO_DIR`
59
+
60
+ Structure
61
+ ```
62
+ paper
63
+ ├── image1.png
64
+ ├── image2.png
65
+ ├── image3.png
66
+ ├── paper.md
67
+ ├── paper0.md
68
+ ├── paper1.md
69
+ └── paper2.md
70
+ ```
71
+
72
+ ### 3 - `FULL_NO_PAGES` *default*
73
+
74
+ Structure
75
+ ```
76
+ paper
77
+ ├── image1.png
78
+ ├── image2.png
79
+ ├── image3.png
80
+ └── paper.md
81
+ ```
82
+
83
+ ### 4 - `TEXT`
84
+
85
+ Structure
86
+ ```
87
+ paper
88
+ ├── paper.md
89
+ ├── paper0.md
90
+ ├── paper1.md
91
+ └── paper2.md
92
+ ```
93
+
94
+ ### 5 - `TEXT_NO_PAGES`
95
+
96
+ Structure
97
+ ```
98
+ paper
99
+ └── paper.md
100
+ ```
101
+
102
+ By default, the JSON response from the Mistral AI OCR API is saved in the output directory. To disable JSON output, use the `-n` or `--no-json` argument. To experiment with a different **mode** without using additional API calls, reuse an existing JSON response instead of the original input file
103
+
104
+ # Usage
105
+
106
+ ## Install the Requirements
107
+
108
+ To install the necessary requirements, run the following command:
109
+
110
+ ```sh
111
+ pip install mistral-ai-ocr
112
+ ```
113
+
114
+ ## Typical Usage
115
+
116
+ ```sh
117
+ mistral-ai-ocr paper.pdf
118
+ mistral-ai-ocr paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
119
+ mistral-ai-ocr paper.pdf -o revision
120
+ mistral-ai-ocr paper.pdf -e
121
+ mistral-ai-ocr paper.pdf -m FULL
122
+ mistral-ai-ocr page74.jpg -e
123
+ mistral-ai-ocr -j paper.json
124
+ mistral-ai-ocr -j paper.json -m TEXT_NO_PAGES -n
125
+ ```
126
+
127
+ ## Arguments
128
+
129
+ | Argument || Description |
130
+ |-|-|-|
131
+ | | | input PDF or image file |
132
+ | -k API_KEY | --api-key API_KEY | Mistral API key, can be set via the **MISTRAL_API_KEY** environment variable |
133
+ | -o OUTPUT | --output OUTPUT | output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file |
134
+ | -j JSON_OCR_RESPONSE | --json-ocr-response JSON_OCR_RESPONSE | path from which to load a pre-existing JSON OCR response (any input file will be ignored) |
135
+ | -m MODE | --mode MODE | mode of operation: either the name or numerical value of the mode. _Defaults to FULL_NO_PAGES_ |
136
+ | -s PAGE_SEPARATOR | --page-separator PAGE_SEPARATOR | page separator to use when writing the Markdown file. _Defaults to `\n`_ |
137
+ | -n | --no-json | do not write the JSON OCR response to a file. By default, the response is written |
138
+ | -e | --load-dot-env | load the .env file from the current directory using [`python-dotenv`](https://pypi.org/project/python-dotenv/), to retrieve the Mistral API key |
139
+
140
+ ### Mistral AI API Key
141
+
142
+ To obtain an API key, you need a [Mistral AI](https://auth.mistral.ai/ui/registration) account. Then visit [https://admin.mistral.ai/organization/api-keys](https://admin.mistral.ai/organization/api-keys) and click the **Create new key** button
143
+
144
+ To avoid using `-e` to load the `.env` file, you can create one at `$HOME/.mistral_ai_ocr.env` (where `$HOME` is your home directory). It will then be automatically loaded at the start of the script
145
+
146
+ For example, for an user called `vavilov`, the path would look like this:
147
+
148
+ * **Linux**
149
+ ```
150
+ /home/vavilov/.mistral_ai_ocr.env
151
+ ```
152
+
153
+ * **macOS**
154
+ ```
155
+ /Users/vavilov/.mistral_ai_ocr.env
156
+ ```
157
+
158
+ * **Windows**
159
+ ```
160
+ C:\Users\vavilov\.mistral_ai_ocr.env
161
+ ```
162
+
163
+ and the content will be something like this:
164
+
165
+ ```
166
+ MISTRAL_API_KEY=jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
167
+ ```
@@ -0,0 +1,255 @@
1
+ #!/usr/bin/env python
2
+ import base64
3
+ from typing import List
4
+ from mistralai import Mistral, OCRImageObject
5
+ from mistralai.models import OCRResponse
6
+ from pathlib import Path
7
+ import mimetypes
8
+ from enum import Enum
9
+ import sys
10
+
11
+ class Modes(Enum):
12
+ FULL = 0
13
+ FULL_ALT = 1
14
+ FULL_NO_DIR = 2
15
+ FULL_NO_PAGES = 3
16
+ TEXT = 4
17
+ TEXT_NO_PAGES = 5
18
+
19
+ def get_mode_from_string(mode_str: str):
20
+ for mode in Modes:
21
+ if mode.name == mode_str.upper() or mode.value == mode_str:
22
+ return mode
23
+ raise ValueError(f"Unknown mode: {mode_str}")
24
+
25
+ def b64encode_document(document_path: Path):
26
+ try:
27
+ with open(document_path, "rb") as doc_file:
28
+ return base64.b64encode(doc_file.read()).decode('utf-8')
29
+ except FileNotFoundError:
30
+ return None
31
+ except Exception as e:
32
+ return None
33
+
34
+ def b64decode_document(base64_data: str, output_path: Path):
35
+ if ',' in base64_data:
36
+ _, base64_str = base64_data.split(',', 1)
37
+ else:
38
+ base64_str = base64_data
39
+ try:
40
+ image_data = base64.b64decode(base64_str)
41
+ except (base64.binascii.Error, ValueError) as e:
42
+ print(f"Error decoding base64 data: {e}", file=sys.stderr)
43
+ return
44
+
45
+ output_path.parent.mkdir(parents=True, exist_ok=True)
46
+
47
+ with open(output_path, 'wb') as f:
48
+ f.write(image_data)
49
+
50
+ class Page:
51
+ def __init__(self, index, markdown=None, images:List[OCRImageObject]=None):
52
+ self.index = index
53
+ self.markdown = markdown
54
+ self.images = images if images is not None else []
55
+
56
+ def write_markdown(self, output_path: Path, append: bool = False, insert = None):
57
+ if self.markdown:
58
+ output_path.parent.mkdir(parents=True, exist_ok=True)
59
+ mode = 'a' if append else 'w'
60
+ with open(output_path, mode) as md_file:
61
+ if insert:
62
+ md_file.write(insert)
63
+ md_file.write(self.markdown)
64
+
65
+ def write_images(self, output_directory: Path):
66
+ if not self.images:
67
+ return
68
+
69
+ for image in self.images:
70
+ if image and image.image_base64:
71
+ image_name = image.id
72
+ image_path = output_directory / image_name
73
+ b64decode_document(image.image_base64, image_path)
74
+
75
+ class MistralOCRDocument:
76
+ def __init__(self,
77
+ document_path: Path,
78
+ api_key: str,
79
+ include_images=True,
80
+ output_directory: Path = None,
81
+ generate_pages=True,
82
+ full_directory_name="full",
83
+ page_separator="\n",
84
+ page_directory_name="page_<index>",
85
+ page_text_name="<stem>.md",
86
+ json_ocr_response_path=None,
87
+ save_json=True,
88
+ ):
89
+ self.document_path = document_path
90
+ self.api_key = api_key
91
+ self.include_images = include_images
92
+ self.generate_pages = generate_pages
93
+ self.save_json = save_json
94
+ self.full_directory_name = full_directory_name
95
+ self.page_separator = page_separator
96
+ self.page_directory_name = page_directory_name
97
+ self.page_text_name = page_text_name
98
+ self.json_ocr_response_path = json_ocr_response_path
99
+ if output_directory is None:
100
+ self.output_directory = self.get_input_path().parent / self.get_input_path().stem
101
+ else:
102
+ self.output_directory = output_directory
103
+
104
+ def get_ocr_response(self, mimetype, base64_document):
105
+ client = Mistral(api_key=self.api_key)
106
+ if mimetype.startswith("image/"):
107
+ document_type = "image_url"
108
+ elif mimetype.startswith("application/pdf"):
109
+ document_type = "document_url"
110
+ else:
111
+ raise ValueError(f"Unsupported MIME type: {mimetype}. Only image and PDF files are supported.")
112
+ self.ocr_response = client.ocr.process(
113
+ model="mistral-ocr-latest",
114
+ document={
115
+ "type": document_type,
116
+ document_type: f"data:{mimetype};base64,{base64_document}"
117
+ },
118
+ include_image_base64=self.include_images
119
+ )
120
+
121
+ def process_document(self):
122
+ if not self.document_path.exists():
123
+ raise FileNotFoundError(f"The document {self.document_path} does not exist.")
124
+ if not self.document_path.is_file():
125
+ raise ValueError(f"The path {self.document_path} is not a valid file.")
126
+
127
+ mimetype, _ = mimetypes.guess_type(self.document_path)
128
+ if mimetype is None:
129
+ raise ValueError(f"Could not determine the MIME type for {self.document_path}.")
130
+
131
+ self.get_ocr_response(mimetype, b64encode_document(self.document_path))
132
+ self.write_json()
133
+ self.process_ocr_response()
134
+
135
+ def process_json_response(self):
136
+ if self.json_ocr_response_path is None or not self.json_ocr_response_path.exists():
137
+ raise FileNotFoundError(f"The JSON OCR response {self.json_ocr_response_path} does not exist.")
138
+
139
+ with open(self.json_ocr_response_path, "r") as json_file:
140
+ self.ocr_response = OCRResponse.model_validate_json(json_file.read())
141
+ self.write_json()
142
+ self.process_ocr_response()
143
+
144
+ def process(self):
145
+ if self.json_ocr_response_path is not None:
146
+ self.process_json_response()
147
+ else:
148
+ self.process_document()
149
+
150
+ def get_input_path(self):
151
+ if self.json_ocr_response_path is not None:
152
+ return self.json_ocr_response_path
153
+ return self.document_path
154
+
155
+ def write_json(self):
156
+ if self.save_json:
157
+ output_path = (self.output_directory / self.get_input_path().stem).with_suffix(".json")
158
+ self.output_directory.mkdir(parents=True, exist_ok=True)
159
+ with open(output_path, "w") as text_file:
160
+ text_file.write(self.ocr_response.model_dump_json(indent=2))
161
+
162
+ def process_ocr_response(self):
163
+ response_pages = self.ocr_response.pages
164
+ if not response_pages:
165
+ print("No pages found in the OCR response.")
166
+ return
167
+
168
+ pages = []
169
+
170
+ full_dir = self.output_directory / self.full_directory_name
171
+
172
+ for r_page in response_pages:
173
+ page = Page(
174
+ index=r_page.index,
175
+ markdown=r_page.markdown,
176
+ images=r_page.images
177
+ )
178
+ if self.generate_pages:
179
+ page_dir = self.output_directory / self.page_directory_name.replace("<index>", str(page.index))
180
+ page.write_markdown((
181
+ page_dir / self.page_text_name.
182
+ replace("<stem>", self.get_input_path().stem).
183
+ replace("<index>", str(page.index))
184
+ ).with_suffix(".md"))
185
+ if self.include_images:
186
+ page.write_images(page_dir)
187
+ if self.include_images:
188
+ page.write_images(full_dir)
189
+ pages.append(page)
190
+ for i, page in enumerate(sorted(pages, key=lambda p: p.index)):
191
+ first = i == 0
192
+ md_file = (full_dir / self.get_input_path().stem).with_suffix(".md")
193
+ insert = self.page_separator if not first else None
194
+ page.write_markdown(md_file, append=not first, insert=insert)
195
+
196
+ def construct_from_mode(
197
+ document_path: Path,
198
+ api_key: str,
199
+ output_directory: Path = None,
200
+ json_ocr_response_path: Path = None,
201
+ page_separator: str = "\n",
202
+ write_json: bool = True,
203
+ mode: Modes = Modes.FULL
204
+ ):
205
+ kwargs = dict(
206
+ document_path=document_path,
207
+ api_key=api_key,
208
+ output_directory=output_directory,
209
+ json_ocr_response_path=json_ocr_response_path,
210
+ page_separator=page_separator,
211
+ save_json=write_json
212
+ )
213
+ match mode:
214
+ case Modes.FULL:
215
+ kwargs.update(
216
+ include_images=True,
217
+ generate_pages=True
218
+ )
219
+ case Modes.FULL_ALT:
220
+ kwargs.update(
221
+ include_images=True,
222
+ generate_pages=True,
223
+ full_directory_name="."
224
+ )
225
+ case Modes.FULL_NO_DIR:
226
+ kwargs.update(
227
+ include_images=True,
228
+ generate_pages=True,
229
+ full_directory_name=".",
230
+ page_directory_name=".",
231
+ page_text_name="<stem><index>.md"
232
+ )
233
+ case Modes.FULL_NO_PAGES:
234
+ kwargs.update(
235
+ include_images=True,
236
+ generate_pages=False,
237
+ full_directory_name="."
238
+ )
239
+ case Modes.TEXT:
240
+ kwargs.update(
241
+ include_images=False,
242
+ generate_pages=True,
243
+ full_directory_name=".",
244
+ page_directory_name=".",
245
+ page_text_name="<stem><index>.md"
246
+ )
247
+ case Modes.TEXT_NO_PAGES:
248
+ kwargs.update(
249
+ include_images=False,
250
+ generate_pages=False,
251
+ full_directory_name="."
252
+ )
253
+ case _:
254
+ raise ValueError(f"Unknown mode: {mode}")
255
+ return MistralOCRDocument(**kwargs)
@@ -0,0 +1,74 @@
1
+ #!/usr/bin/env python
2
+ from pathlib import Path
3
+ from . import Modes, construct_from_mode, get_mode_from_string
4
+ import argparse
5
+ import codecs
6
+ from os import getenv
7
+ from dotenv import load_dotenv
8
+
9
+ try:
10
+ load_dotenv(dotenv_path=Path.home() / ".mistral_ai_ocr.env")
11
+ except:
12
+ pass
13
+
14
+ _mode_choices = [mode.name for mode in Modes] + [str(mode.value) for mode in Modes]
15
+
16
+ def _unescape(s: str) -> str:
17
+ return codecs.decode(s, 'unicode_escape')
18
+
19
+ def main():
20
+ example_text = (
21
+ 'examples:\n\n'
22
+ '%(prog)s paper.pdf\n'
23
+ '%(prog)s paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39\n'
24
+ '%(prog)s paper.pdf -o revision\n'
25
+ '%(prog)s paper.pdf -e\n'
26
+ '%(prog)s paper.pdf -m FULL\n'
27
+ '%(prog)s -j paper.json\n'
28
+ '%(prog)s -j paper.json -m TEXT_NO_PAGES -n\n'
29
+ )
30
+ parser = argparse.ArgumentParser(
31
+ description="A simple script that uses the Mistral AI OCR API to get the Markdown text from a PDF or image file.",
32
+ epilog=example_text,
33
+ formatter_class=argparse.RawDescriptionHelpFormatter
34
+ )
35
+
36
+ parser.add_argument("input", type=Path, nargs="?", help="input PDF or image file", default=None)
37
+ parser.add_argument("-k", "--api-key", help="Mistral API key, can be set via the MISTRAL_API_KEY environment variable", default=None)
38
+ parser.add_argument("-o", "--output", type=Path, help="output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file", default=None)
39
+ parser.add_argument("-j", "--json-ocr-response", type=Path, help="path from which to load a pre-existing JSON OCR response (any input file will be ignored)", default=None)
40
+ parser.add_argument("-m", "--mode", type=str, choices=_mode_choices, default="FULL_NO_PAGES",
41
+ help="mode of operation: either the name or numerical value of the mode. Defaults to FULL_NO_PAGES")
42
+ parser.add_argument("-s", "--page-separator", type=str, default="\n",
43
+ help="page separator to use when writing the Markdown file. Defaults to '\\n'")
44
+ parser.add_argument("-n", "--no-json", action="store_false", dest="write_json",
45
+ help="do not write the JSON OCR response to a file. By default, the response is written")
46
+ parser.add_argument("-e", "--load-dot-env", action="store_true",
47
+ help="load the .env file from the current directory using python-dotenv, to retrieve the Mistral API key")
48
+ args = parser.parse_args()
49
+
50
+ if args.load_dot_env:
51
+ load_dotenv()
52
+
53
+ if args.api_key is None:
54
+ args.api_key = getenv("MISTRAL_API_KEY")
55
+ if args.api_key is None:
56
+ parser.error("API key is required. Set it with --api-key, via the MISTRAL_API_KEY environment variable, or load it from a .env file with -e/--load-dot-env")
57
+
58
+ try:
59
+ construct_from_mode(
60
+ document_path=args.input,
61
+ api_key=args.api_key,
62
+ output_directory=args.output,
63
+ json_ocr_response_path=args.json_ocr_response,
64
+ page_separator=_unescape(args.page_separator),
65
+ write_json=args.write_json,
66
+ mode=get_mode_from_string(args.mode)
67
+ ).process()
68
+ except FileNotFoundError as e:
69
+ parser.error(e)
70
+ except ValueError as e:
71
+ parser.error(e)
72
+
73
+ if __name__ == "__main__":
74
+ main()
@@ -0,0 +1,172 @@
1
+ Metadata-Version: 2.1
2
+ Name: mistral-ai-ocr
3
+ Version: 1.0
4
+ Description-Content-Type: text/markdown
5
+
6
+ # Mistral AI OCR
7
+ This is a simple script that uses the Mistral AI OCR API to extract text from a PDF or image file
8
+
9
+ ## Modes
10
+
11
+ | Value | Name |
12
+ |-|-|
13
+ | 0 | FULL |
14
+ | 1 | FULL_ALT |
15
+ | 2 | FULL_NO_DIR |
16
+ | 3 | FULL_NO_PAGES |
17
+ | 4 | TEXT |
18
+ | 5 | TEXT_NO_PAGES |
19
+
20
+ Given the input file `paper.pdf`, the directory structure for each mode is shown below:
21
+
22
+ ### 0 - `FULL`
23
+
24
+ Structure
25
+ ```
26
+ paper
27
+ ├── full
28
+ │ ├── image1.png
29
+ │ ├── image2.png
30
+ │ ├── image3.png
31
+ │ └── paper.md
32
+ ├── page_0
33
+ │ ├── image1.png
34
+ │ └── paper.md
35
+ ├── page_1
36
+ │ ├── image2.png
37
+ │ └── paper.md
38
+ └── page_2
39
+ ├── image3.png
40
+ └── paper.md
41
+ ```
42
+
43
+ ### 1 - `FULL_ALT`
44
+
45
+ Structure
46
+ ```
47
+ paper
48
+ ├── image1.png
49
+ ├── image2.png
50
+ ├── image3.png
51
+ ├── paper.md
52
+ ├── page_0
53
+ │ ├── image1.png
54
+ │ └── paper.md
55
+ ├── page_1
56
+ │ ├── image2.png
57
+ │ └── paper.md
58
+ └── page_2
59
+ ├── image3.png
60
+ └── paper.md
61
+ ```
62
+
63
+ ### 2 - `FULL_NO_DIR`
64
+
65
+ Structure
66
+ ```
67
+ paper
68
+ ├── image1.png
69
+ ├── image2.png
70
+ ├── image3.png
71
+ ├── paper.md
72
+ ├── paper0.md
73
+ ├── paper1.md
74
+ └── paper2.md
75
+ ```
76
+
77
+ ### 3 - `FULL_NO_PAGES` *default*
78
+
79
+ Structure
80
+ ```
81
+ paper
82
+ ├── image1.png
83
+ ├── image2.png
84
+ ├── image3.png
85
+ └── paper.md
86
+ ```
87
+
88
+ ### 4 - `TEXT`
89
+
90
+ Structure
91
+ ```
92
+ paper
93
+ ├── paper.md
94
+ ├── paper0.md
95
+ ├── paper1.md
96
+ └── paper2.md
97
+ ```
98
+
99
+ ### 5 - `TEXT_NO_PAGES`
100
+
101
+ Structure
102
+ ```
103
+ paper
104
+ └── paper.md
105
+ ```
106
+
107
+ By default, the JSON response from the Mistral AI OCR API is saved in the output directory. To disable JSON output, use the `-n` or `--no-json` argument. To experiment with a different **mode** without using additional API calls, reuse an existing JSON response instead of the original input file
108
+
109
+ # Usage
110
+
111
+ ## Install the Requirements
112
+
113
+ To install the necessary requirements, run the following command:
114
+
115
+ ```sh
116
+ pip install mistral-ai-ocr
117
+ ```
118
+
119
+ ## Typical Usage
120
+
121
+ ```sh
122
+ mistral-ai-ocr paper.pdf
123
+ mistral-ai-ocr paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
124
+ mistral-ai-ocr paper.pdf -o revision
125
+ mistral-ai-ocr paper.pdf -e
126
+ mistral-ai-ocr paper.pdf -m FULL
127
+ mistral-ai-ocr page74.jpg -e
128
+ mistral-ai-ocr -j paper.json
129
+ mistral-ai-ocr -j paper.json -m TEXT_NO_PAGES -n
130
+ ```
131
+
132
+ ## Arguments
133
+
134
+ | Argument || Description |
135
+ |-|-|-|
136
+ | | | input PDF or image file |
137
+ | -k API_KEY | --api-key API_KEY | Mistral API key, can be set via the **MISTRAL_API_KEY** environment variable |
138
+ | -o OUTPUT | --output OUTPUT | output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file |
139
+ | -j JSON_OCR_RESPONSE | --json-ocr-response JSON_OCR_RESPONSE | path from which to load a pre-existing JSON OCR response (any input file will be ignored) |
140
+ | -m MODE | --mode MODE | mode of operation: either the name or numerical value of the mode. _Defaults to FULL_NO_PAGES_ |
141
+ | -s PAGE_SEPARATOR | --page-separator PAGE_SEPARATOR | page separator to use when writing the Markdown file. _Defaults to `\n`_ |
142
+ | -n | --no-json | do not write the JSON OCR response to a file. By default, the response is written |
143
+ | -e | --load-dot-env | load the .env file from the current directory using [`python-dotenv`](https://pypi.org/project/python-dotenv/), to retrieve the Mistral API key |
144
+
145
+ ### Mistral AI API Key
146
+
147
+ To obtain an API key, you need a [Mistral AI](https://auth.mistral.ai/ui/registration) account. Then visit [https://admin.mistral.ai/organization/api-keys](https://admin.mistral.ai/organization/api-keys) and click the **Create new key** button
148
+
149
+ To avoid using `-e` to load the `.env` file, you can create one at `$HOME/.mistral_ai_ocr.env` (where `$HOME` is your home directory). It will then be automatically loaded at the start of the script
150
+
151
+ For example, for an user called `vavilov`, the path would look like this:
152
+
153
+ * **Linux**
154
+ ```
155
+ /home/vavilov/.mistral_ai_ocr.env
156
+ ```
157
+
158
+ * **macOS**
159
+ ```
160
+ /Users/vavilov/.mistral_ai_ocr.env
161
+ ```
162
+
163
+ * **Windows**
164
+ ```
165
+ C:\Users\vavilov\.mistral_ai_ocr.env
166
+ ```
167
+
168
+ and the content will be something like this:
169
+
170
+ ```
171
+ MISTRAL_API_KEY=jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
172
+ ```
@@ -0,0 +1,10 @@
1
+ README.md
2
+ setup.py
3
+ mistral_ai_ocr/__init__.py
4
+ mistral_ai_ocr/__main__.py
5
+ mistral_ai_ocr.egg-info/PKG-INFO
6
+ mistral_ai_ocr.egg-info/SOURCES.txt
7
+ mistral_ai_ocr.egg-info/dependency_links.txt
8
+ mistral_ai_ocr.egg-info/entry_points.txt
9
+ mistral_ai_ocr.egg-info/requires.txt
10
+ mistral_ai_ocr.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ mistral-ai-ocr = mistral_ai_ocr.__main__:main
@@ -0,0 +1,2 @@
1
+ mistralai
2
+ python-dotenv
@@ -0,0 +1 @@
1
+ mistral_ai_ocr
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,22 @@
1
+ #!/usr/bin/env python
2
+ from setuptools import setup, find_packages
3
+
4
+ with open("README.md", "r", encoding="utf-8") as fh:
5
+ long_description = fh.read()
6
+
7
+ setup(
8
+ name="mistral-ai-ocr",
9
+ version="1.0",
10
+ packages=find_packages(),
11
+ entry_points={
12
+ 'console_scripts': [
13
+ 'mistral-ai-ocr = mistral_ai_ocr.__main__:main',
14
+ ],
15
+ },
16
+ install_requires=[
17
+ 'mistralai',
18
+ 'python-dotenv'
19
+ ],
20
+ long_description=long_description,
21
+ long_description_content_type="text/markdown"
22
+ )