mistral-ai-ocr 1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- mistral-ai-ocr-1.0/PKG-INFO +172 -0
- mistral-ai-ocr-1.0/README.md +167 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr/__init__.py +255 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr/__main__.py +74 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr.egg-info/PKG-INFO +172 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr.egg-info/SOURCES.txt +10 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr.egg-info/dependency_links.txt +1 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr.egg-info/entry_points.txt +2 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr.egg-info/requires.txt +2 -0
- mistral-ai-ocr-1.0/mistral_ai_ocr.egg-info/top_level.txt +1 -0
- mistral-ai-ocr-1.0/setup.cfg +4 -0
- mistral-ai-ocr-1.0/setup.py +22 -0
@@ -0,0 +1,172 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: mistral-ai-ocr
|
3
|
+
Version: 1.0
|
4
|
+
Description-Content-Type: text/markdown
|
5
|
+
|
6
|
+
# Mistral AI OCR
|
7
|
+
This is a simple script that uses the Mistral AI OCR API to extract text from a PDF or image file
|
8
|
+
|
9
|
+
## Modes
|
10
|
+
|
11
|
+
| Value | Name |
|
12
|
+
|-|-|
|
13
|
+
| 0 | FULL |
|
14
|
+
| 1 | FULL_ALT |
|
15
|
+
| 2 | FULL_NO_DIR |
|
16
|
+
| 3 | FULL_NO_PAGES |
|
17
|
+
| 4 | TEXT |
|
18
|
+
| 5 | TEXT_NO_PAGES |
|
19
|
+
|
20
|
+
Given the input file `paper.pdf`, the directory structure for each mode is shown below:
|
21
|
+
|
22
|
+
### 0 - `FULL`
|
23
|
+
|
24
|
+
Structure
|
25
|
+
```
|
26
|
+
paper
|
27
|
+
├── full
|
28
|
+
│ ├── image1.png
|
29
|
+
│ ├── image2.png
|
30
|
+
│ ├── image3.png
|
31
|
+
│ └── paper.md
|
32
|
+
├── page_0
|
33
|
+
│ ├── image1.png
|
34
|
+
│ └── paper.md
|
35
|
+
├── page_1
|
36
|
+
│ ├── image2.png
|
37
|
+
│ └── paper.md
|
38
|
+
└── page_2
|
39
|
+
├── image3.png
|
40
|
+
└── paper.md
|
41
|
+
```
|
42
|
+
|
43
|
+
### 1 - `FULL_ALT`
|
44
|
+
|
45
|
+
Structure
|
46
|
+
```
|
47
|
+
paper
|
48
|
+
├── image1.png
|
49
|
+
├── image2.png
|
50
|
+
├── image3.png
|
51
|
+
├── paper.md
|
52
|
+
├── page_0
|
53
|
+
│ ├── image1.png
|
54
|
+
│ └── paper.md
|
55
|
+
├── page_1
|
56
|
+
│ ├── image2.png
|
57
|
+
│ └── paper.md
|
58
|
+
└── page_2
|
59
|
+
├── image3.png
|
60
|
+
└── paper.md
|
61
|
+
```
|
62
|
+
|
63
|
+
### 2 - `FULL_NO_DIR`
|
64
|
+
|
65
|
+
Structure
|
66
|
+
```
|
67
|
+
paper
|
68
|
+
├── image1.png
|
69
|
+
├── image2.png
|
70
|
+
├── image3.png
|
71
|
+
├── paper.md
|
72
|
+
├── paper0.md
|
73
|
+
├── paper1.md
|
74
|
+
└── paper2.md
|
75
|
+
```
|
76
|
+
|
77
|
+
### 3 - `FULL_NO_PAGES` *default*
|
78
|
+
|
79
|
+
Structure
|
80
|
+
```
|
81
|
+
paper
|
82
|
+
├── image1.png
|
83
|
+
├── image2.png
|
84
|
+
├── image3.png
|
85
|
+
└── paper.md
|
86
|
+
```
|
87
|
+
|
88
|
+
### 4 - `TEXT`
|
89
|
+
|
90
|
+
Structure
|
91
|
+
```
|
92
|
+
paper
|
93
|
+
├── paper.md
|
94
|
+
├── paper0.md
|
95
|
+
├── paper1.md
|
96
|
+
└── paper2.md
|
97
|
+
```
|
98
|
+
|
99
|
+
### 5 - `TEXT_NO_PAGES`
|
100
|
+
|
101
|
+
Structure
|
102
|
+
```
|
103
|
+
paper
|
104
|
+
└── paper.md
|
105
|
+
```
|
106
|
+
|
107
|
+
By default, the JSON response from the Mistral AI OCR API is saved in the output directory. To disable JSON output, use the `-n` or `--no-json` argument. To experiment with a different **mode** without using additional API calls, reuse an existing JSON response instead of the original input file
|
108
|
+
|
109
|
+
# Usage
|
110
|
+
|
111
|
+
## Install the Requirements
|
112
|
+
|
113
|
+
To install the necessary requirements, run the following command:
|
114
|
+
|
115
|
+
```sh
|
116
|
+
pip install mistral-ai-ocr
|
117
|
+
```
|
118
|
+
|
119
|
+
## Typical Usage
|
120
|
+
|
121
|
+
```sh
|
122
|
+
mistral-ai-ocr paper.pdf
|
123
|
+
mistral-ai-ocr paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
|
124
|
+
mistral-ai-ocr paper.pdf -o revision
|
125
|
+
mistral-ai-ocr paper.pdf -e
|
126
|
+
mistral-ai-ocr paper.pdf -m FULL
|
127
|
+
mistral-ai-ocr page74.jpg -e
|
128
|
+
mistral-ai-ocr -j paper.json
|
129
|
+
mistral-ai-ocr -j paper.json -m TEXT_NO_PAGES -n
|
130
|
+
```
|
131
|
+
|
132
|
+
## Arguments
|
133
|
+
|
134
|
+
| Argument || Description |
|
135
|
+
|-|-|-|
|
136
|
+
| | | input PDF or image file |
|
137
|
+
| -k API_KEY | --api-key API_KEY | Mistral API key, can be set via the **MISTRAL_API_KEY** environment variable |
|
138
|
+
| -o OUTPUT | --output OUTPUT | output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file |
|
139
|
+
| -j JSON_OCR_RESPONSE | --json-ocr-response JSON_OCR_RESPONSE | path from which to load a pre-existing JSON OCR response (any input file will be ignored) |
|
140
|
+
| -m MODE | --mode MODE | mode of operation: either the name or numerical value of the mode. _Defaults to FULL_NO_PAGES_ |
|
141
|
+
| -s PAGE_SEPARATOR | --page-separator PAGE_SEPARATOR | page separator to use when writing the Markdown file. _Defaults to `\n`_ |
|
142
|
+
| -n | --no-json | do not write the JSON OCR response to a file. By default, the response is written |
|
143
|
+
| -e | --load-dot-env | load the .env file from the current directory using [`python-dotenv`](https://pypi.org/project/python-dotenv/), to retrieve the Mistral API key |
|
144
|
+
|
145
|
+
### Mistral AI API Key
|
146
|
+
|
147
|
+
To obtain an API key, you need a [Mistral AI](https://auth.mistral.ai/ui/registration) account. Then visit [https://admin.mistral.ai/organization/api-keys](https://admin.mistral.ai/organization/api-keys) and click the **Create new key** button
|
148
|
+
|
149
|
+
To avoid using `-e` to load the `.env` file, you can create one at `$HOME/.mistral_ai_ocr.env` (where `$HOME` is your home directory). It will then be automatically loaded at the start of the script
|
150
|
+
|
151
|
+
For example, for an user called `vavilov`, the path would look like this:
|
152
|
+
|
153
|
+
* **Linux**
|
154
|
+
```
|
155
|
+
/home/vavilov/.mistral_ai_ocr.env
|
156
|
+
```
|
157
|
+
|
158
|
+
* **macOS**
|
159
|
+
```
|
160
|
+
/Users/vavilov/.mistral_ai_ocr.env
|
161
|
+
```
|
162
|
+
|
163
|
+
* **Windows**
|
164
|
+
```
|
165
|
+
C:\Users\vavilov\.mistral_ai_ocr.env
|
166
|
+
```
|
167
|
+
|
168
|
+
and the content will be something like this:
|
169
|
+
|
170
|
+
```
|
171
|
+
MISTRAL_API_KEY=jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
|
172
|
+
```
|
@@ -0,0 +1,167 @@
|
|
1
|
+
# Mistral AI OCR
|
2
|
+
This is a simple script that uses the Mistral AI OCR API to extract text from a PDF or image file
|
3
|
+
|
4
|
+
## Modes
|
5
|
+
|
6
|
+
| Value | Name |
|
7
|
+
|-|-|
|
8
|
+
| 0 | FULL |
|
9
|
+
| 1 | FULL_ALT |
|
10
|
+
| 2 | FULL_NO_DIR |
|
11
|
+
| 3 | FULL_NO_PAGES |
|
12
|
+
| 4 | TEXT |
|
13
|
+
| 5 | TEXT_NO_PAGES |
|
14
|
+
|
15
|
+
Given the input file `paper.pdf`, the directory structure for each mode is shown below:
|
16
|
+
|
17
|
+
### 0 - `FULL`
|
18
|
+
|
19
|
+
Structure
|
20
|
+
```
|
21
|
+
paper
|
22
|
+
├── full
|
23
|
+
│ ├── image1.png
|
24
|
+
│ ├── image2.png
|
25
|
+
│ ├── image3.png
|
26
|
+
│ └── paper.md
|
27
|
+
├── page_0
|
28
|
+
│ ├── image1.png
|
29
|
+
│ └── paper.md
|
30
|
+
├── page_1
|
31
|
+
│ ├── image2.png
|
32
|
+
│ └── paper.md
|
33
|
+
└── page_2
|
34
|
+
├── image3.png
|
35
|
+
└── paper.md
|
36
|
+
```
|
37
|
+
|
38
|
+
### 1 - `FULL_ALT`
|
39
|
+
|
40
|
+
Structure
|
41
|
+
```
|
42
|
+
paper
|
43
|
+
├── image1.png
|
44
|
+
├── image2.png
|
45
|
+
├── image3.png
|
46
|
+
├── paper.md
|
47
|
+
├── page_0
|
48
|
+
│ ├── image1.png
|
49
|
+
│ └── paper.md
|
50
|
+
├── page_1
|
51
|
+
│ ├── image2.png
|
52
|
+
│ └── paper.md
|
53
|
+
└── page_2
|
54
|
+
├── image3.png
|
55
|
+
└── paper.md
|
56
|
+
```
|
57
|
+
|
58
|
+
### 2 - `FULL_NO_DIR`
|
59
|
+
|
60
|
+
Structure
|
61
|
+
```
|
62
|
+
paper
|
63
|
+
├── image1.png
|
64
|
+
├── image2.png
|
65
|
+
├── image3.png
|
66
|
+
├── paper.md
|
67
|
+
├── paper0.md
|
68
|
+
├── paper1.md
|
69
|
+
└── paper2.md
|
70
|
+
```
|
71
|
+
|
72
|
+
### 3 - `FULL_NO_PAGES` *default*
|
73
|
+
|
74
|
+
Structure
|
75
|
+
```
|
76
|
+
paper
|
77
|
+
├── image1.png
|
78
|
+
├── image2.png
|
79
|
+
├── image3.png
|
80
|
+
└── paper.md
|
81
|
+
```
|
82
|
+
|
83
|
+
### 4 - `TEXT`
|
84
|
+
|
85
|
+
Structure
|
86
|
+
```
|
87
|
+
paper
|
88
|
+
├── paper.md
|
89
|
+
├── paper0.md
|
90
|
+
├── paper1.md
|
91
|
+
└── paper2.md
|
92
|
+
```
|
93
|
+
|
94
|
+
### 5 - `TEXT_NO_PAGES`
|
95
|
+
|
96
|
+
Structure
|
97
|
+
```
|
98
|
+
paper
|
99
|
+
└── paper.md
|
100
|
+
```
|
101
|
+
|
102
|
+
By default, the JSON response from the Mistral AI OCR API is saved in the output directory. To disable JSON output, use the `-n` or `--no-json` argument. To experiment with a different **mode** without using additional API calls, reuse an existing JSON response instead of the original input file
|
103
|
+
|
104
|
+
# Usage
|
105
|
+
|
106
|
+
## Install the Requirements
|
107
|
+
|
108
|
+
To install the necessary requirements, run the following command:
|
109
|
+
|
110
|
+
```sh
|
111
|
+
pip install mistral-ai-ocr
|
112
|
+
```
|
113
|
+
|
114
|
+
## Typical Usage
|
115
|
+
|
116
|
+
```sh
|
117
|
+
mistral-ai-ocr paper.pdf
|
118
|
+
mistral-ai-ocr paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
|
119
|
+
mistral-ai-ocr paper.pdf -o revision
|
120
|
+
mistral-ai-ocr paper.pdf -e
|
121
|
+
mistral-ai-ocr paper.pdf -m FULL
|
122
|
+
mistral-ai-ocr page74.jpg -e
|
123
|
+
mistral-ai-ocr -j paper.json
|
124
|
+
mistral-ai-ocr -j paper.json -m TEXT_NO_PAGES -n
|
125
|
+
```
|
126
|
+
|
127
|
+
## Arguments
|
128
|
+
|
129
|
+
| Argument || Description |
|
130
|
+
|-|-|-|
|
131
|
+
| | | input PDF or image file |
|
132
|
+
| -k API_KEY | --api-key API_KEY | Mistral API key, can be set via the **MISTRAL_API_KEY** environment variable |
|
133
|
+
| -o OUTPUT | --output OUTPUT | output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file |
|
134
|
+
| -j JSON_OCR_RESPONSE | --json-ocr-response JSON_OCR_RESPONSE | path from which to load a pre-existing JSON OCR response (any input file will be ignored) |
|
135
|
+
| -m MODE | --mode MODE | mode of operation: either the name or numerical value of the mode. _Defaults to FULL_NO_PAGES_ |
|
136
|
+
| -s PAGE_SEPARATOR | --page-separator PAGE_SEPARATOR | page separator to use when writing the Markdown file. _Defaults to `\n`_ |
|
137
|
+
| -n | --no-json | do not write the JSON OCR response to a file. By default, the response is written |
|
138
|
+
| -e | --load-dot-env | load the .env file from the current directory using [`python-dotenv`](https://pypi.org/project/python-dotenv/), to retrieve the Mistral API key |
|
139
|
+
|
140
|
+
### Mistral AI API Key
|
141
|
+
|
142
|
+
To obtain an API key, you need a [Mistral AI](https://auth.mistral.ai/ui/registration) account. Then visit [https://admin.mistral.ai/organization/api-keys](https://admin.mistral.ai/organization/api-keys) and click the **Create new key** button
|
143
|
+
|
144
|
+
To avoid using `-e` to load the `.env` file, you can create one at `$HOME/.mistral_ai_ocr.env` (where `$HOME` is your home directory). It will then be automatically loaded at the start of the script
|
145
|
+
|
146
|
+
For example, for an user called `vavilov`, the path would look like this:
|
147
|
+
|
148
|
+
* **Linux**
|
149
|
+
```
|
150
|
+
/home/vavilov/.mistral_ai_ocr.env
|
151
|
+
```
|
152
|
+
|
153
|
+
* **macOS**
|
154
|
+
```
|
155
|
+
/Users/vavilov/.mistral_ai_ocr.env
|
156
|
+
```
|
157
|
+
|
158
|
+
* **Windows**
|
159
|
+
```
|
160
|
+
C:\Users\vavilov\.mistral_ai_ocr.env
|
161
|
+
```
|
162
|
+
|
163
|
+
and the content will be something like this:
|
164
|
+
|
165
|
+
```
|
166
|
+
MISTRAL_API_KEY=jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
|
167
|
+
```
|
@@ -0,0 +1,255 @@
|
|
1
|
+
#!/usr/bin/env python
|
2
|
+
import base64
|
3
|
+
from typing import List
|
4
|
+
from mistralai import Mistral, OCRImageObject
|
5
|
+
from mistralai.models import OCRResponse
|
6
|
+
from pathlib import Path
|
7
|
+
import mimetypes
|
8
|
+
from enum import Enum
|
9
|
+
import sys
|
10
|
+
|
11
|
+
class Modes(Enum):
|
12
|
+
FULL = 0
|
13
|
+
FULL_ALT = 1
|
14
|
+
FULL_NO_DIR = 2
|
15
|
+
FULL_NO_PAGES = 3
|
16
|
+
TEXT = 4
|
17
|
+
TEXT_NO_PAGES = 5
|
18
|
+
|
19
|
+
def get_mode_from_string(mode_str: str):
|
20
|
+
for mode in Modes:
|
21
|
+
if mode.name == mode_str.upper() or mode.value == mode_str:
|
22
|
+
return mode
|
23
|
+
raise ValueError(f"Unknown mode: {mode_str}")
|
24
|
+
|
25
|
+
def b64encode_document(document_path: Path):
|
26
|
+
try:
|
27
|
+
with open(document_path, "rb") as doc_file:
|
28
|
+
return base64.b64encode(doc_file.read()).decode('utf-8')
|
29
|
+
except FileNotFoundError:
|
30
|
+
return None
|
31
|
+
except Exception as e:
|
32
|
+
return None
|
33
|
+
|
34
|
+
def b64decode_document(base64_data: str, output_path: Path):
|
35
|
+
if ',' in base64_data:
|
36
|
+
_, base64_str = base64_data.split(',', 1)
|
37
|
+
else:
|
38
|
+
base64_str = base64_data
|
39
|
+
try:
|
40
|
+
image_data = base64.b64decode(base64_str)
|
41
|
+
except (base64.binascii.Error, ValueError) as e:
|
42
|
+
print(f"Error decoding base64 data: {e}", file=sys.stderr)
|
43
|
+
return
|
44
|
+
|
45
|
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
46
|
+
|
47
|
+
with open(output_path, 'wb') as f:
|
48
|
+
f.write(image_data)
|
49
|
+
|
50
|
+
class Page:
|
51
|
+
def __init__(self, index, markdown=None, images:List[OCRImageObject]=None):
|
52
|
+
self.index = index
|
53
|
+
self.markdown = markdown
|
54
|
+
self.images = images if images is not None else []
|
55
|
+
|
56
|
+
def write_markdown(self, output_path: Path, append: bool = False, insert = None):
|
57
|
+
if self.markdown:
|
58
|
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
59
|
+
mode = 'a' if append else 'w'
|
60
|
+
with open(output_path, mode) as md_file:
|
61
|
+
if insert:
|
62
|
+
md_file.write(insert)
|
63
|
+
md_file.write(self.markdown)
|
64
|
+
|
65
|
+
def write_images(self, output_directory: Path):
|
66
|
+
if not self.images:
|
67
|
+
return
|
68
|
+
|
69
|
+
for image in self.images:
|
70
|
+
if image and image.image_base64:
|
71
|
+
image_name = image.id
|
72
|
+
image_path = output_directory / image_name
|
73
|
+
b64decode_document(image.image_base64, image_path)
|
74
|
+
|
75
|
+
class MistralOCRDocument:
|
76
|
+
def __init__(self,
|
77
|
+
document_path: Path,
|
78
|
+
api_key: str,
|
79
|
+
include_images=True,
|
80
|
+
output_directory: Path = None,
|
81
|
+
generate_pages=True,
|
82
|
+
full_directory_name="full",
|
83
|
+
page_separator="\n",
|
84
|
+
page_directory_name="page_<index>",
|
85
|
+
page_text_name="<stem>.md",
|
86
|
+
json_ocr_response_path=None,
|
87
|
+
save_json=True,
|
88
|
+
):
|
89
|
+
self.document_path = document_path
|
90
|
+
self.api_key = api_key
|
91
|
+
self.include_images = include_images
|
92
|
+
self.generate_pages = generate_pages
|
93
|
+
self.save_json = save_json
|
94
|
+
self.full_directory_name = full_directory_name
|
95
|
+
self.page_separator = page_separator
|
96
|
+
self.page_directory_name = page_directory_name
|
97
|
+
self.page_text_name = page_text_name
|
98
|
+
self.json_ocr_response_path = json_ocr_response_path
|
99
|
+
if output_directory is None:
|
100
|
+
self.output_directory = self.get_input_path().parent / self.get_input_path().stem
|
101
|
+
else:
|
102
|
+
self.output_directory = output_directory
|
103
|
+
|
104
|
+
def get_ocr_response(self, mimetype, base64_document):
|
105
|
+
client = Mistral(api_key=self.api_key)
|
106
|
+
if mimetype.startswith("image/"):
|
107
|
+
document_type = "image_url"
|
108
|
+
elif mimetype.startswith("application/pdf"):
|
109
|
+
document_type = "document_url"
|
110
|
+
else:
|
111
|
+
raise ValueError(f"Unsupported MIME type: {mimetype}. Only image and PDF files are supported.")
|
112
|
+
self.ocr_response = client.ocr.process(
|
113
|
+
model="mistral-ocr-latest",
|
114
|
+
document={
|
115
|
+
"type": document_type,
|
116
|
+
document_type: f"data:{mimetype};base64,{base64_document}"
|
117
|
+
},
|
118
|
+
include_image_base64=self.include_images
|
119
|
+
)
|
120
|
+
|
121
|
+
def process_document(self):
|
122
|
+
if not self.document_path.exists():
|
123
|
+
raise FileNotFoundError(f"The document {self.document_path} does not exist.")
|
124
|
+
if not self.document_path.is_file():
|
125
|
+
raise ValueError(f"The path {self.document_path} is not a valid file.")
|
126
|
+
|
127
|
+
mimetype, _ = mimetypes.guess_type(self.document_path)
|
128
|
+
if mimetype is None:
|
129
|
+
raise ValueError(f"Could not determine the MIME type for {self.document_path}.")
|
130
|
+
|
131
|
+
self.get_ocr_response(mimetype, b64encode_document(self.document_path))
|
132
|
+
self.write_json()
|
133
|
+
self.process_ocr_response()
|
134
|
+
|
135
|
+
def process_json_response(self):
|
136
|
+
if self.json_ocr_response_path is None or not self.json_ocr_response_path.exists():
|
137
|
+
raise FileNotFoundError(f"The JSON OCR response {self.json_ocr_response_path} does not exist.")
|
138
|
+
|
139
|
+
with open(self.json_ocr_response_path, "r") as json_file:
|
140
|
+
self.ocr_response = OCRResponse.model_validate_json(json_file.read())
|
141
|
+
self.write_json()
|
142
|
+
self.process_ocr_response()
|
143
|
+
|
144
|
+
def process(self):
|
145
|
+
if self.json_ocr_response_path is not None:
|
146
|
+
self.process_json_response()
|
147
|
+
else:
|
148
|
+
self.process_document()
|
149
|
+
|
150
|
+
def get_input_path(self):
|
151
|
+
if self.json_ocr_response_path is not None:
|
152
|
+
return self.json_ocr_response_path
|
153
|
+
return self.document_path
|
154
|
+
|
155
|
+
def write_json(self):
|
156
|
+
if self.save_json:
|
157
|
+
output_path = (self.output_directory / self.get_input_path().stem).with_suffix(".json")
|
158
|
+
self.output_directory.mkdir(parents=True, exist_ok=True)
|
159
|
+
with open(output_path, "w") as text_file:
|
160
|
+
text_file.write(self.ocr_response.model_dump_json(indent=2))
|
161
|
+
|
162
|
+
def process_ocr_response(self):
|
163
|
+
response_pages = self.ocr_response.pages
|
164
|
+
if not response_pages:
|
165
|
+
print("No pages found in the OCR response.")
|
166
|
+
return
|
167
|
+
|
168
|
+
pages = []
|
169
|
+
|
170
|
+
full_dir = self.output_directory / self.full_directory_name
|
171
|
+
|
172
|
+
for r_page in response_pages:
|
173
|
+
page = Page(
|
174
|
+
index=r_page.index,
|
175
|
+
markdown=r_page.markdown,
|
176
|
+
images=r_page.images
|
177
|
+
)
|
178
|
+
if self.generate_pages:
|
179
|
+
page_dir = self.output_directory / self.page_directory_name.replace("<index>", str(page.index))
|
180
|
+
page.write_markdown((
|
181
|
+
page_dir / self.page_text_name.
|
182
|
+
replace("<stem>", self.get_input_path().stem).
|
183
|
+
replace("<index>", str(page.index))
|
184
|
+
).with_suffix(".md"))
|
185
|
+
if self.include_images:
|
186
|
+
page.write_images(page_dir)
|
187
|
+
if self.include_images:
|
188
|
+
page.write_images(full_dir)
|
189
|
+
pages.append(page)
|
190
|
+
for i, page in enumerate(sorted(pages, key=lambda p: p.index)):
|
191
|
+
first = i == 0
|
192
|
+
md_file = (full_dir / self.get_input_path().stem).with_suffix(".md")
|
193
|
+
insert = self.page_separator if not first else None
|
194
|
+
page.write_markdown(md_file, append=not first, insert=insert)
|
195
|
+
|
196
|
+
def construct_from_mode(
|
197
|
+
document_path: Path,
|
198
|
+
api_key: str,
|
199
|
+
output_directory: Path = None,
|
200
|
+
json_ocr_response_path: Path = None,
|
201
|
+
page_separator: str = "\n",
|
202
|
+
write_json: bool = True,
|
203
|
+
mode: Modes = Modes.FULL
|
204
|
+
):
|
205
|
+
kwargs = dict(
|
206
|
+
document_path=document_path,
|
207
|
+
api_key=api_key,
|
208
|
+
output_directory=output_directory,
|
209
|
+
json_ocr_response_path=json_ocr_response_path,
|
210
|
+
page_separator=page_separator,
|
211
|
+
save_json=write_json
|
212
|
+
)
|
213
|
+
match mode:
|
214
|
+
case Modes.FULL:
|
215
|
+
kwargs.update(
|
216
|
+
include_images=True,
|
217
|
+
generate_pages=True
|
218
|
+
)
|
219
|
+
case Modes.FULL_ALT:
|
220
|
+
kwargs.update(
|
221
|
+
include_images=True,
|
222
|
+
generate_pages=True,
|
223
|
+
full_directory_name="."
|
224
|
+
)
|
225
|
+
case Modes.FULL_NO_DIR:
|
226
|
+
kwargs.update(
|
227
|
+
include_images=True,
|
228
|
+
generate_pages=True,
|
229
|
+
full_directory_name=".",
|
230
|
+
page_directory_name=".",
|
231
|
+
page_text_name="<stem><index>.md"
|
232
|
+
)
|
233
|
+
case Modes.FULL_NO_PAGES:
|
234
|
+
kwargs.update(
|
235
|
+
include_images=True,
|
236
|
+
generate_pages=False,
|
237
|
+
full_directory_name="."
|
238
|
+
)
|
239
|
+
case Modes.TEXT:
|
240
|
+
kwargs.update(
|
241
|
+
include_images=False,
|
242
|
+
generate_pages=True,
|
243
|
+
full_directory_name=".",
|
244
|
+
page_directory_name=".",
|
245
|
+
page_text_name="<stem><index>.md"
|
246
|
+
)
|
247
|
+
case Modes.TEXT_NO_PAGES:
|
248
|
+
kwargs.update(
|
249
|
+
include_images=False,
|
250
|
+
generate_pages=False,
|
251
|
+
full_directory_name="."
|
252
|
+
)
|
253
|
+
case _:
|
254
|
+
raise ValueError(f"Unknown mode: {mode}")
|
255
|
+
return MistralOCRDocument(**kwargs)
|
@@ -0,0 +1,74 @@
|
|
1
|
+
#!/usr/bin/env python
|
2
|
+
from pathlib import Path
|
3
|
+
from . import Modes, construct_from_mode, get_mode_from_string
|
4
|
+
import argparse
|
5
|
+
import codecs
|
6
|
+
from os import getenv
|
7
|
+
from dotenv import load_dotenv
|
8
|
+
|
9
|
+
try:
|
10
|
+
load_dotenv(dotenv_path=Path.home() / ".mistral_ai_ocr.env")
|
11
|
+
except:
|
12
|
+
pass
|
13
|
+
|
14
|
+
_mode_choices = [mode.name for mode in Modes] + [str(mode.value) for mode in Modes]
|
15
|
+
|
16
|
+
def _unescape(s: str) -> str:
|
17
|
+
return codecs.decode(s, 'unicode_escape')
|
18
|
+
|
19
|
+
def main():
|
20
|
+
example_text = (
|
21
|
+
'examples:\n\n'
|
22
|
+
'%(prog)s paper.pdf\n'
|
23
|
+
'%(prog)s paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39\n'
|
24
|
+
'%(prog)s paper.pdf -o revision\n'
|
25
|
+
'%(prog)s paper.pdf -e\n'
|
26
|
+
'%(prog)s paper.pdf -m FULL\n'
|
27
|
+
'%(prog)s -j paper.json\n'
|
28
|
+
'%(prog)s -j paper.json -m TEXT_NO_PAGES -n\n'
|
29
|
+
)
|
30
|
+
parser = argparse.ArgumentParser(
|
31
|
+
description="A simple script that uses the Mistral AI OCR API to get the Markdown text from a PDF or image file.",
|
32
|
+
epilog=example_text,
|
33
|
+
formatter_class=argparse.RawDescriptionHelpFormatter
|
34
|
+
)
|
35
|
+
|
36
|
+
parser.add_argument("input", type=Path, nargs="?", help="input PDF or image file", default=None)
|
37
|
+
parser.add_argument("-k", "--api-key", help="Mistral API key, can be set via the MISTRAL_API_KEY environment variable", default=None)
|
38
|
+
parser.add_argument("-o", "--output", type=Path, help="output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file", default=None)
|
39
|
+
parser.add_argument("-j", "--json-ocr-response", type=Path, help="path from which to load a pre-existing JSON OCR response (any input file will be ignored)", default=None)
|
40
|
+
parser.add_argument("-m", "--mode", type=str, choices=_mode_choices, default="FULL_NO_PAGES",
|
41
|
+
help="mode of operation: either the name or numerical value of the mode. Defaults to FULL_NO_PAGES")
|
42
|
+
parser.add_argument("-s", "--page-separator", type=str, default="\n",
|
43
|
+
help="page separator to use when writing the Markdown file. Defaults to '\\n'")
|
44
|
+
parser.add_argument("-n", "--no-json", action="store_false", dest="write_json",
|
45
|
+
help="do not write the JSON OCR response to a file. By default, the response is written")
|
46
|
+
parser.add_argument("-e", "--load-dot-env", action="store_true",
|
47
|
+
help="load the .env file from the current directory using python-dotenv, to retrieve the Mistral API key")
|
48
|
+
args = parser.parse_args()
|
49
|
+
|
50
|
+
if args.load_dot_env:
|
51
|
+
load_dotenv()
|
52
|
+
|
53
|
+
if args.api_key is None:
|
54
|
+
args.api_key = getenv("MISTRAL_API_KEY")
|
55
|
+
if args.api_key is None:
|
56
|
+
parser.error("API key is required. Set it with --api-key, via the MISTRAL_API_KEY environment variable, or load it from a .env file with -e/--load-dot-env")
|
57
|
+
|
58
|
+
try:
|
59
|
+
construct_from_mode(
|
60
|
+
document_path=args.input,
|
61
|
+
api_key=args.api_key,
|
62
|
+
output_directory=args.output,
|
63
|
+
json_ocr_response_path=args.json_ocr_response,
|
64
|
+
page_separator=_unescape(args.page_separator),
|
65
|
+
write_json=args.write_json,
|
66
|
+
mode=get_mode_from_string(args.mode)
|
67
|
+
).process()
|
68
|
+
except FileNotFoundError as e:
|
69
|
+
parser.error(e)
|
70
|
+
except ValueError as e:
|
71
|
+
parser.error(e)
|
72
|
+
|
73
|
+
if __name__ == "__main__":
|
74
|
+
main()
|
@@ -0,0 +1,172 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: mistral-ai-ocr
|
3
|
+
Version: 1.0
|
4
|
+
Description-Content-Type: text/markdown
|
5
|
+
|
6
|
+
# Mistral AI OCR
|
7
|
+
This is a simple script that uses the Mistral AI OCR API to extract text from a PDF or image file
|
8
|
+
|
9
|
+
## Modes
|
10
|
+
|
11
|
+
| Value | Name |
|
12
|
+
|-|-|
|
13
|
+
| 0 | FULL |
|
14
|
+
| 1 | FULL_ALT |
|
15
|
+
| 2 | FULL_NO_DIR |
|
16
|
+
| 3 | FULL_NO_PAGES |
|
17
|
+
| 4 | TEXT |
|
18
|
+
| 5 | TEXT_NO_PAGES |
|
19
|
+
|
20
|
+
Given the input file `paper.pdf`, the directory structure for each mode is shown below:
|
21
|
+
|
22
|
+
### 0 - `FULL`
|
23
|
+
|
24
|
+
Structure
|
25
|
+
```
|
26
|
+
paper
|
27
|
+
├── full
|
28
|
+
│ ├── image1.png
|
29
|
+
│ ├── image2.png
|
30
|
+
│ ├── image3.png
|
31
|
+
│ └── paper.md
|
32
|
+
├── page_0
|
33
|
+
│ ├── image1.png
|
34
|
+
│ └── paper.md
|
35
|
+
├── page_1
|
36
|
+
│ ├── image2.png
|
37
|
+
│ └── paper.md
|
38
|
+
└── page_2
|
39
|
+
├── image3.png
|
40
|
+
└── paper.md
|
41
|
+
```
|
42
|
+
|
43
|
+
### 1 - `FULL_ALT`
|
44
|
+
|
45
|
+
Structure
|
46
|
+
```
|
47
|
+
paper
|
48
|
+
├── image1.png
|
49
|
+
├── image2.png
|
50
|
+
├── image3.png
|
51
|
+
├── paper.md
|
52
|
+
├── page_0
|
53
|
+
│ ├── image1.png
|
54
|
+
│ └── paper.md
|
55
|
+
├── page_1
|
56
|
+
│ ├── image2.png
|
57
|
+
│ └── paper.md
|
58
|
+
└── page_2
|
59
|
+
├── image3.png
|
60
|
+
└── paper.md
|
61
|
+
```
|
62
|
+
|
63
|
+
### 2 - `FULL_NO_DIR`
|
64
|
+
|
65
|
+
Structure
|
66
|
+
```
|
67
|
+
paper
|
68
|
+
├── image1.png
|
69
|
+
├── image2.png
|
70
|
+
├── image3.png
|
71
|
+
├── paper.md
|
72
|
+
├── paper0.md
|
73
|
+
├── paper1.md
|
74
|
+
└── paper2.md
|
75
|
+
```
|
76
|
+
|
77
|
+
### 3 - `FULL_NO_PAGES` *default*
|
78
|
+
|
79
|
+
Structure
|
80
|
+
```
|
81
|
+
paper
|
82
|
+
├── image1.png
|
83
|
+
├── image2.png
|
84
|
+
├── image3.png
|
85
|
+
└── paper.md
|
86
|
+
```
|
87
|
+
|
88
|
+
### 4 - `TEXT`
|
89
|
+
|
90
|
+
Structure
|
91
|
+
```
|
92
|
+
paper
|
93
|
+
├── paper.md
|
94
|
+
├── paper0.md
|
95
|
+
├── paper1.md
|
96
|
+
└── paper2.md
|
97
|
+
```
|
98
|
+
|
99
|
+
### 5 - `TEXT_NO_PAGES`
|
100
|
+
|
101
|
+
Structure
|
102
|
+
```
|
103
|
+
paper
|
104
|
+
└── paper.md
|
105
|
+
```
|
106
|
+
|
107
|
+
By default, the JSON response from the Mistral AI OCR API is saved in the output directory. To disable JSON output, use the `-n` or `--no-json` argument. To experiment with a different **mode** without using additional API calls, reuse an existing JSON response instead of the original input file
|
108
|
+
|
109
|
+
# Usage
|
110
|
+
|
111
|
+
## Install the Requirements
|
112
|
+
|
113
|
+
To install the necessary requirements, run the following command:
|
114
|
+
|
115
|
+
```sh
|
116
|
+
pip install mistral-ai-ocr
|
117
|
+
```
|
118
|
+
|
119
|
+
## Typical Usage
|
120
|
+
|
121
|
+
```sh
|
122
|
+
mistral-ai-ocr paper.pdf
|
123
|
+
mistral-ai-ocr paper.pdf --api-key jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
|
124
|
+
mistral-ai-ocr paper.pdf -o revision
|
125
|
+
mistral-ai-ocr paper.pdf -e
|
126
|
+
mistral-ai-ocr paper.pdf -m FULL
|
127
|
+
mistral-ai-ocr page74.jpg -e
|
128
|
+
mistral-ai-ocr -j paper.json
|
129
|
+
mistral-ai-ocr -j paper.json -m TEXT_NO_PAGES -n
|
130
|
+
```
|
131
|
+
|
132
|
+
## Arguments
|
133
|
+
|
134
|
+
| Argument || Description |
|
135
|
+
|-|-|-|
|
136
|
+
| | | input PDF or image file |
|
137
|
+
| -k API_KEY | --api-key API_KEY | Mistral API key, can be set via the **MISTRAL_API_KEY** environment variable |
|
138
|
+
| -o OUTPUT | --output OUTPUT | output directory path. If not set, a directory will be created in the current working directory using the same stem (filename without extension) as the input file |
|
139
|
+
| -j JSON_OCR_RESPONSE | --json-ocr-response JSON_OCR_RESPONSE | path from which to load a pre-existing JSON OCR response (any input file will be ignored) |
|
140
|
+
| -m MODE | --mode MODE | mode of operation: either the name or numerical value of the mode. _Defaults to FULL_NO_PAGES_ |
|
141
|
+
| -s PAGE_SEPARATOR | --page-separator PAGE_SEPARATOR | page separator to use when writing the Markdown file. _Defaults to `\n`_ |
|
142
|
+
| -n | --no-json | do not write the JSON OCR response to a file. By default, the response is written |
|
143
|
+
| -e | --load-dot-env | load the .env file from the current directory using [`python-dotenv`](https://pypi.org/project/python-dotenv/), to retrieve the Mistral API key |
|
144
|
+
|
145
|
+
### Mistral AI API Key
|
146
|
+
|
147
|
+
To obtain an API key, you need a [Mistral AI](https://auth.mistral.ai/ui/registration) account. Then visit [https://admin.mistral.ai/organization/api-keys](https://admin.mistral.ai/organization/api-keys) and click the **Create new key** button
|
148
|
+
|
149
|
+
To avoid using `-e` to load the `.env` file, you can create one at `$HOME/.mistral_ai_ocr.env` (where `$HOME` is your home directory). It will then be automatically loaded at the start of the script
|
150
|
+
|
151
|
+
For example, for an user called `vavilov`, the path would look like this:
|
152
|
+
|
153
|
+
* **Linux**
|
154
|
+
```
|
155
|
+
/home/vavilov/.mistral_ai_ocr.env
|
156
|
+
```
|
157
|
+
|
158
|
+
* **macOS**
|
159
|
+
```
|
160
|
+
/Users/vavilov/.mistral_ai_ocr.env
|
161
|
+
```
|
162
|
+
|
163
|
+
* **Windows**
|
164
|
+
```
|
165
|
+
C:\Users\vavilov\.mistral_ai_ocr.env
|
166
|
+
```
|
167
|
+
|
168
|
+
and the content will be something like this:
|
169
|
+
|
170
|
+
```
|
171
|
+
MISTRAL_API_KEY=jrWjJE5lFketfB2sA6vvhQK2SoHQ6R39
|
172
|
+
```
|
@@ -0,0 +1,10 @@
|
|
1
|
+
README.md
|
2
|
+
setup.py
|
3
|
+
mistral_ai_ocr/__init__.py
|
4
|
+
mistral_ai_ocr/__main__.py
|
5
|
+
mistral_ai_ocr.egg-info/PKG-INFO
|
6
|
+
mistral_ai_ocr.egg-info/SOURCES.txt
|
7
|
+
mistral_ai_ocr.egg-info/dependency_links.txt
|
8
|
+
mistral_ai_ocr.egg-info/entry_points.txt
|
9
|
+
mistral_ai_ocr.egg-info/requires.txt
|
10
|
+
mistral_ai_ocr.egg-info/top_level.txt
|
@@ -0,0 +1 @@
|
|
1
|
+
|
@@ -0,0 +1 @@
|
|
1
|
+
mistral_ai_ocr
|
@@ -0,0 +1,22 @@
|
|
1
|
+
#!/usr/bin/env python
|
2
|
+
from setuptools import setup, find_packages
|
3
|
+
|
4
|
+
with open("README.md", "r", encoding="utf-8") as fh:
|
5
|
+
long_description = fh.read()
|
6
|
+
|
7
|
+
setup(
|
8
|
+
name="mistral-ai-ocr",
|
9
|
+
version="1.0",
|
10
|
+
packages=find_packages(),
|
11
|
+
entry_points={
|
12
|
+
'console_scripts': [
|
13
|
+
'mistral-ai-ocr = mistral_ai_ocr.__main__:main',
|
14
|
+
],
|
15
|
+
},
|
16
|
+
install_requires=[
|
17
|
+
'mistralai',
|
18
|
+
'python-dotenv'
|
19
|
+
],
|
20
|
+
long_description=long_description,
|
21
|
+
long_description_content_type="text/markdown"
|
22
|
+
)
|