docreader-ocr 0.2.3__tar.gz → 0.2.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- docreader_ocr-0.2.5/PKG-INFO +69 -0
- docreader_ocr-0.2.5/README.md +292 -0
- docreader_ocr-0.2.5/README_pypi.md +36 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/pyproject.toml +3 -2
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/config.py +19 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/factory.py +39 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/hub.py +9 -1
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/pipeline.py +92 -30
- docreader_ocr-0.2.5/src/docreader/resolver/__init__.py +4 -0
- docreader_ocr-0.2.5/src/docreader/resolver/base.py +40 -0
- docreader_ocr-0.2.5/src/docreader/resolver/lvl_resolver.py +195 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/schemas.py +5 -1
- docreader_ocr-0.2.3/PKG-INFO +0 -33
- docreader_ocr-0.2.3/README.md +0 -1
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/.github/workflows/publish.yaml +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/.gitignore +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/LICENSE +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/__init__.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/classifier/__init__.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/classifier/base.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/classifier/yolo_classifier.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/detector/__init__.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/detector/base.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/detector/yolo_obb.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/ocr/__init__.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/ocr/base.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/ocr/easyocr_engine.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/preprocessing/__init__.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/preprocessing/geometry.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/src/docreader/utils.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/tests/test_hub.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/tests/test_pipeline.py +0 -0
- {docreader_ocr-0.2.3 → docreader_ocr-0.2.5}/tests/test_run.py +0 -0
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: docreader-ocr
|
|
3
|
+
Version: 0.2.5
|
|
4
|
+
Summary: Document OCR pipeline: classify → detect fields → recognize text
|
|
5
|
+
Project-URL: Homepage, https://github.com/mishanyacorleone/docreader
|
|
6
|
+
Project-URL: Repository, https://github.com/mishanyacorleone/docreader
|
|
7
|
+
Project-URL: Issues, https://github.com/mishanyacorleone/docreader/issues
|
|
8
|
+
Author-email: Mikhail Kardash <mishutqac@mail.ru>, Ruslan Abzelilov <ruslanr26@mail.ru>, Ekaterina Karmanova <monitor81@mail.ru>
|
|
9
|
+
License: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: document,ocr,recognition,yolo
|
|
12
|
+
Classifier: Development Status :: 3 - Alpha
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Topic :: Scientific/Engineering :: Image Recognition
|
|
17
|
+
Requires-Python: >=3.9
|
|
18
|
+
Requires-Dist: easyocr>=1.7
|
|
19
|
+
Requires-Dist: numpy>=1.24
|
|
20
|
+
Requires-Dist: opencv-python>=4.8
|
|
21
|
+
Requires-Dist: rapidfuzz>=3.14.0
|
|
22
|
+
Requires-Dist: requests>=2.28
|
|
23
|
+
Requires-Dist: torch>=2.0
|
|
24
|
+
Requires-Dist: torchvision>=0.15
|
|
25
|
+
Requires-Dist: tqdm>=4.65
|
|
26
|
+
Requires-Dist: ultralytics>=8.0
|
|
27
|
+
Provides-Extra: dev
|
|
28
|
+
Requires-Dist: mypy; extra == 'dev'
|
|
29
|
+
Requires-Dist: pytest-cov; extra == 'dev'
|
|
30
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
31
|
+
Requires-Dist: ruff; extra == 'dev'
|
|
32
|
+
Description-Content-Type: text/markdown
|
|
33
|
+
|
|
34
|
+
# docreader-ocr
|
|
35
|
+
|
|
36
|
+
Python-библиотека для автоматического распознавания российских документов.
|
|
37
|
+
|
|
38
|
+
```python
|
|
39
|
+
from docreader import DocReader
|
|
40
|
+
|
|
41
|
+
result = DocReader().process("passport.jpg")
|
|
42
|
+
print(result.documents[0].fields)
|
|
43
|
+
# {"surname": "Иванов", "firstname": "Иван", "passport_num": "1234 567890", ...}
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Установка
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
pip install docreader-ocr
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Модели скачиваются автоматически при первом запуске.
|
|
53
|
+
|
|
54
|
+
## Поддерживаемые документы
|
|
55
|
+
|
|
56
|
+
- Паспорт РФ
|
|
57
|
+
- СНИЛС
|
|
58
|
+
- Аттестат об образовании
|
|
59
|
+
- Диплом о высшем образовании
|
|
60
|
+
|
|
61
|
+
## Как работает
|
|
62
|
+
|
|
63
|
+
Трёхэтапный конвейер: **классификатор** (YOLO OBB, accuracy 97.5%) определяет тип документа → **детектор зон** (YOLO OBB, mAP@50 = 90%) находит поля → **OCR** (EasyOCR, word accuracy 87.3%) распознаёт текст.
|
|
64
|
+
|
|
65
|
+
Данные обрабатываются локально — никаких внешних серверов, полное соответствие 152-ФЗ.
|
|
66
|
+
|
|
67
|
+
## Документация
|
|
68
|
+
|
|
69
|
+
Полный README, примеры и API — на [GitHub](https://github.com/mishanyacorleone/docreader).
|
|
@@ -0,0 +1,292 @@
|
|
|
1
|
+
# docreader-ocr
|
|
2
|
+
|
|
3
|
+
[](https://pypi.org/project/docreader-ocr/)
|
|
4
|
+
[](https://www.python.org/downloads/)
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+
[](https://github.com/mishanyacorleone/docreader/stargazers)
|
|
7
|
+
|
|
8
|
+
Python-библиотека для автоматического распознавания российских документов. Сфотографировал — получил структурированные данные.
|
|
9
|
+
|
|
10
|
+
```python
|
|
11
|
+
from docreader import DocReader
|
|
12
|
+
|
|
13
|
+
result = DocReader().process("passport.jpg")
|
|
14
|
+
print(result.documents[0].fields)
|
|
15
|
+
# {"surname": "Иванов", "firstname": "Иван", "passport_num": "1234 567890", ...}
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Поддерживаемые документы
|
|
21
|
+
|
|
22
|
+
| Документ | Распознаваемые поля |
|
|
23
|
+
|----------|---------------------|
|
|
24
|
+
| Паспорт РФ | surname, firstname, middlename, dateOfBirth, birthCity, sex, passport_num, issued_by, issued_date, issued_code |
|
|
25
|
+
| СНИЛС | fio, snils, date_of_birth, sex, location, reg_date |
|
|
26
|
+
| Аттестат | fio, lvl, number, issue_date, grad_year, school_name, gerb |
|
|
27
|
+
| Диплом | fio, lvl, series_numbers, reg_num, issue_date, spec_name, university_name, gerb, stamp |
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## Установка
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
pip install docreader-ocr
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Модели скачиваются автоматически при первом запуске и кэшируются в `~/.cache/docreader/models/`.
|
|
38
|
+
|
|
39
|
+
### Требования
|
|
40
|
+
|
|
41
|
+
- Python 3.12+
|
|
42
|
+
- PyTorch (CPU или CUDA)
|
|
43
|
+
- ~200 МБ дискового пространства для моделей
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Быстрый старт
|
|
48
|
+
|
|
49
|
+
### Базовое использование
|
|
50
|
+
|
|
51
|
+
```python
|
|
52
|
+
from docreader import DocReader
|
|
53
|
+
|
|
54
|
+
reader = DocReader()
|
|
55
|
+
result = reader.process("photo.jpg")
|
|
56
|
+
|
|
57
|
+
for doc in result.documents:
|
|
58
|
+
print(f"Тип: {doc.doc_type}")
|
|
59
|
+
print(f"Поля: {doc.fields}")
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### Пакетная обработка
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
results = reader.process_batch(["doc1.jpg", "doc2.jpg", "doc3.jpg"])
|
|
66
|
+
|
|
67
|
+
for page in results:
|
|
68
|
+
for doc in page.documents:
|
|
69
|
+
print(doc.fields)
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Использование numpy array
|
|
73
|
+
|
|
74
|
+
```python
|
|
75
|
+
import cv2
|
|
76
|
+
from docreader import DocReader
|
|
77
|
+
|
|
78
|
+
image = cv2.imread("passport.jpg")
|
|
79
|
+
result = DocReader().process(image)
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Получение кропов зон
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
result = DocReader().process("passport.jpg", return_crops=True)
|
|
86
|
+
|
|
87
|
+
for doc in result.documents:
|
|
88
|
+
for zone in doc.zones:
|
|
89
|
+
print(f"{zone.name}: {zone.text}")
|
|
90
|
+
# zone.crop_image — numpy array с вырезанной зоной
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Использование GPU
|
|
94
|
+
|
|
95
|
+
```python
|
|
96
|
+
from docreader import DocReader
|
|
97
|
+
from docreader.config import PipelineConfig
|
|
98
|
+
|
|
99
|
+
config = PipelineConfig(device="cuda")
|
|
100
|
+
reader = DocReader(config=config)
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## Архитектура
|
|
106
|
+
|
|
107
|
+
Библиотека реализует трёхэтапный конвейер:
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
Фото → [Классификатор] → [Детектор зон] → [OCR] → Словарь полей
|
|
111
|
+
YOLO OBB YOLO OBB EasyOCR
|
|
112
|
+
97.5% acc mAP@50=90% CER=0.15
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
**Классификатор** — определяет тип документа и вырезает его из произвольной фотографии. Работает при любом освещении и ракурсе.
|
|
116
|
+
|
|
117
|
+
**Детектор зон** — специализированная YOLO OBB модель для каждого типа документа. Находит поля с точностью mAP@50 = 90%.
|
|
118
|
+
|
|
119
|
+
**OCR-движок** — EasyOCR с кастомным дообучением под русскоязычные документы. Структурированные поля (числа, даты, коды) — Exact Match 85–92%.
|
|
120
|
+
|
|
121
|
+
### Resolver для аттестата/диплома
|
|
122
|
+
|
|
123
|
+
Аттестат и диплом визуально идентичны, поэтому их различение вынесено в отдельный компонент — `LvlSubtypeResolver`. Он детектирует поле `lvl`, читает его через OCR и с помощью fuzzy matching определяет подтип документа.
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Метрики качества
|
|
128
|
+
|
|
129
|
+
### Классификатор
|
|
130
|
+
|
|
131
|
+
| Класс | Precision | Recall | F1 |
|
|
132
|
+
|-------|-----------|--------|----|
|
|
133
|
+
| passport | 0.968 | 1.000 | 0.984 |
|
|
134
|
+
| snils | 1.000 | 0.903 | 0.949 |
|
|
135
|
+
| attestat | 1.000 | 1.000 | 1.000 |
|
|
136
|
+
| diplom | 1.000 | 1.000 | 1.000 |
|
|
137
|
+
| **Общая точность** | | | **97.5%** |
|
|
138
|
+
|
|
139
|
+
### Детектор зон
|
|
140
|
+
|
|
141
|
+
| Метрика | Значение |
|
|
142
|
+
|---------|----------|
|
|
143
|
+
| mAP@50 | 90.0% |
|
|
144
|
+
| Лучшая зона (gerb) | F1 = 99.1% |
|
|
145
|
+
| Слабейшая зона (location) | F1 = 82.5% |
|
|
146
|
+
|
|
147
|
+
### OCR
|
|
148
|
+
|
|
149
|
+
| Метрика | Значение |
|
|
150
|
+
|---------|----------|
|
|
151
|
+
| CER средний | 0.146 |
|
|
152
|
+
| WER средний | 0.276 |
|
|
153
|
+
| Exact Match средний | 58.8% |
|
|
154
|
+
| Exact Match (series_numbers) | 92.3% |
|
|
155
|
+
| Exact Match (fio) | 88.4% |
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## Кастомизация
|
|
160
|
+
|
|
161
|
+
### Своя конфигурация
|
|
162
|
+
|
|
163
|
+
```python
|
|
164
|
+
from docreader import DocReader
|
|
165
|
+
from docreader.config import PipelineConfig
|
|
166
|
+
|
|
167
|
+
config = PipelineConfig(
|
|
168
|
+
device="cuda",
|
|
169
|
+
classifier_confidence=0.5,
|
|
170
|
+
detector_confidence=0.3,
|
|
171
|
+
enable_deskew=True,
|
|
172
|
+
return_crops=False,
|
|
173
|
+
skip_ocr_zones={"stamp", "gerb"},
|
|
174
|
+
)
|
|
175
|
+
reader = DocReader(config=config)
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### Использование отдельных компонентов
|
|
179
|
+
|
|
180
|
+
```python
|
|
181
|
+
from docreader.factory import create_classifier, create_detector, create_ocr
|
|
182
|
+
|
|
183
|
+
# Только классификатор
|
|
184
|
+
clf = create_classifier()
|
|
185
|
+
docs = clf.classify("photo.jpg")
|
|
186
|
+
|
|
187
|
+
# Только детектор
|
|
188
|
+
det = create_detector()
|
|
189
|
+
zones = det.detect(image, doc_type="passport")
|
|
190
|
+
|
|
191
|
+
# Только OCR
|
|
192
|
+
ocr = create_ocr()
|
|
193
|
+
result = ocr.recognize(crop_image)
|
|
194
|
+
print(result.text, result.confidence)
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
### Подключение своего OCR-движка
|
|
198
|
+
|
|
199
|
+
```python
|
|
200
|
+
from docreader.ocr.base import BaseOcrEngine, OcrResult
|
|
201
|
+
import numpy as np
|
|
202
|
+
|
|
203
|
+
class MyOcrEngine(BaseOcrEngine):
|
|
204
|
+
def recognize(self, image: np.ndarray) -> OcrResult:
|
|
205
|
+
# ваша реализация
|
|
206
|
+
return OcrResult(text="...", confidence=0.95)
|
|
207
|
+
|
|
208
|
+
reader = DocReader(ocr_engine=MyOcrEngine())
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### Добавление нового типа документа
|
|
212
|
+
|
|
213
|
+
```python
|
|
214
|
+
from docreader.config import PipelineConfig
|
|
215
|
+
|
|
216
|
+
config = PipelineConfig(
|
|
217
|
+
detector_weights={
|
|
218
|
+
"passport": "passport.pt",
|
|
219
|
+
"snils": "snils.pt",
|
|
220
|
+
"attestat": "attestat.pt",
|
|
221
|
+
"diplom": "diplom.pt",
|
|
222
|
+
"inn": "/path/to/your/inn.pt", # ваша модель
|
|
223
|
+
}
|
|
224
|
+
)
|
|
225
|
+
reader = DocReader(config=config)
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
---
|
|
229
|
+
|
|
230
|
+
## Структура результата
|
|
231
|
+
|
|
232
|
+
```python
|
|
233
|
+
PageResult
|
|
234
|
+
└── documents: list[DocumentResult]
|
|
235
|
+
├── doc_type: str # "passport", "snils", "attestat", "diplom"
|
|
236
|
+
├── doc_confidence: float # уверенность классификатора
|
|
237
|
+
├── doc_bbox: list[float] # координаты документа в исходном изображении
|
|
238
|
+
├── doc_crop: np.ndarray # вырезанный документ (если return_crops=True)
|
|
239
|
+
├── fields: dict # {zone_name: text} — удобный доступ к полям
|
|
240
|
+
├── resolve_meta: dict # диагностика resolver'а (для attestat/diplom)
|
|
241
|
+
└── zones: list[ZoneResult]
|
|
242
|
+
├── name: str
|
|
243
|
+
├── text: str
|
|
244
|
+
├── confidence: float
|
|
245
|
+
├── bbox: list[float]
|
|
246
|
+
└── crop_image: np.ndarray # если return_crops=True
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## Управление моделями
|
|
252
|
+
|
|
253
|
+
```python
|
|
254
|
+
from docreader.hub import get_model_status, ensure_all_models
|
|
255
|
+
|
|
256
|
+
# Статус всех моделей
|
|
257
|
+
status = get_model_status()
|
|
258
|
+
for name, info in status.items():
|
|
259
|
+
print(f"{name}: {'✓' if info['downloaded'] else '✗'} ({info['size_mb']} MB)")
|
|
260
|
+
|
|
261
|
+
# Скачать все модели заранее
|
|
262
|
+
ensure_all_models()
|
|
263
|
+
|
|
264
|
+
# Кастомная директория кэша
|
|
265
|
+
import os
|
|
266
|
+
os.environ["DOCREADER_CACHE"] = "/path/to/custom/cache"
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## Почему библиотека, а не сервис
|
|
272
|
+
|
|
273
|
+
**Данные остаются внутри.** Персональные данные не покидают инфраструктуру организации. Полное соответствие 152-ФЗ. Никаких внешних серверов.
|
|
274
|
+
|
|
275
|
+
**Интеграция без переписывания.** Встраивается в любую существующую систему — 1С, CRM, ERP, мобильное приложение — двумя строками кода.
|
|
276
|
+
|
|
277
|
+
**Полный контроль.** Новые типы документов подключаются через дообучение без участия вендора. IT-отдел контролирует всё: модели, данные, обновления.
|
|
278
|
+
|
|
279
|
+
**Нет операционных затрат.** В отличие от облачных API — никакой абонентской платы и лимитов на количество запросов.
|
|
280
|
+
|
|
281
|
+
---
|
|
282
|
+
|
|
283
|
+
## Лицензия
|
|
284
|
+
|
|
285
|
+
MIT License — см. [LICENSE](LICENSE).
|
|
286
|
+
|
|
287
|
+
---
|
|
288
|
+
|
|
289
|
+
## Ссылки
|
|
290
|
+
|
|
291
|
+
- [PyPI](https://pypi.org/project/docreader-ocr/)
|
|
292
|
+
- [GitHub](https://github.com/mishanyacorleone/docreader)
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
# docreader-ocr
|
|
2
|
+
|
|
3
|
+
Python-библиотека для автоматического распознавания российских документов.
|
|
4
|
+
|
|
5
|
+
```python
|
|
6
|
+
from docreader import DocReader
|
|
7
|
+
|
|
8
|
+
result = DocReader().process("passport.jpg")
|
|
9
|
+
print(result.documents[0].fields)
|
|
10
|
+
# {"surname": "Иванов", "firstname": "Иван", "passport_num": "1234 567890", ...}
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
## Установка
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
pip install docreader-ocr
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
Модели скачиваются автоматически при первом запуске.
|
|
20
|
+
|
|
21
|
+
## Поддерживаемые документы
|
|
22
|
+
|
|
23
|
+
- Паспорт РФ
|
|
24
|
+
- СНИЛС
|
|
25
|
+
- Аттестат об образовании
|
|
26
|
+
- Диплом о высшем образовании
|
|
27
|
+
|
|
28
|
+
## Как работает
|
|
29
|
+
|
|
30
|
+
Трёхэтапный конвейер: **классификатор** (YOLO OBB, accuracy 97.5%) определяет тип документа → **детектор зон** (YOLO OBB, mAP@50 = 90%) находит поля → **OCR** (EasyOCR, word accuracy 87.3%) распознаёт текст.
|
|
31
|
+
|
|
32
|
+
Данные обрабатываются локально — никаких внешних серверов, полное соответствие 152-ФЗ.
|
|
33
|
+
|
|
34
|
+
## Документация
|
|
35
|
+
|
|
36
|
+
Полный README, примеры и API — на [GitHub](https://github.com/mishanyacorleone/docreader).
|
|
@@ -4,9 +4,9 @@ build-backend = "hatchling.build"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "docreader-ocr"
|
|
7
|
-
version = "0.2.
|
|
7
|
+
version = "0.2.5"
|
|
8
8
|
description = "Document OCR pipeline: classify → detect fields → recognize text"
|
|
9
|
-
readme = "
|
|
9
|
+
readme = "README_pypi.md"
|
|
10
10
|
license = {text = "MIT"}
|
|
11
11
|
requires-python = ">=3.9"
|
|
12
12
|
authors = [
|
|
@@ -32,6 +32,7 @@ dependencies = [
|
|
|
32
32
|
"numpy>=1.24",
|
|
33
33
|
"requests>=2.28",
|
|
34
34
|
"tqdm>=4.65",
|
|
35
|
+
"RapidFuzz>=3.14.0"
|
|
35
36
|
]
|
|
36
37
|
|
|
37
38
|
[project.optional-dependencies]
|
|
@@ -6,6 +6,13 @@ from dataclasses import dataclass, field
|
|
|
6
6
|
|
|
7
7
|
DEFAULT_SKIP_OCR_ZONES = frozenset({"stamp", "gerb"})
|
|
8
8
|
|
|
9
|
+
DEFAULT_AMBIGUOUS_CLASSES = frozenset({"attestat/diplom"})
|
|
10
|
+
|
|
11
|
+
DEFAULT_SUBTYPE_KEYWORDS: dict[str, list[str]] = {
|
|
12
|
+
"attestat": ["аттестат"],
|
|
13
|
+
"diplom": ["диплом"]
|
|
14
|
+
}
|
|
15
|
+
|
|
9
16
|
|
|
10
17
|
@dataclass
|
|
11
18
|
class PipelineConfig:
|
|
@@ -35,6 +42,18 @@ class PipelineConfig:
|
|
|
35
42
|
ocr_download_enabled: bool = False
|
|
36
43
|
skip_ocr_zones: frozenset[str] = DEFAULT_SKIP_OCR_ZONES
|
|
37
44
|
|
|
45
|
+
# Resolver
|
|
46
|
+
ambiguous_classes: frozenset[str] = field(
|
|
47
|
+
default_factory=lambda: DEFAULT_AMBIGUOUS_CLASSES
|
|
48
|
+
)
|
|
49
|
+
resolver_weights: str = "lvl_detector.pt"
|
|
50
|
+
resolver_confidence: float = 0.25
|
|
51
|
+
resolver_subtype_keywords: dict[str, list[str]] = field(
|
|
52
|
+
default_factory=lambda: dict(DEFAULT_SUBTYPE_KEYWORDS)
|
|
53
|
+
)
|
|
54
|
+
resolver_fuzzy_threshold: float = 60.0
|
|
55
|
+
resolver_fallback: str | None = None
|
|
56
|
+
|
|
38
57
|
enable_deskew: bool = True # Выравнивание по линиям Хафа
|
|
39
58
|
return_crops: bool = True # Сохранять кропы зон в результат
|
|
40
59
|
|
|
@@ -19,6 +19,7 @@ from docreader.hub import ensure_model
|
|
|
19
19
|
from docreader.classifier.yolo_classifier import DocClassifier
|
|
20
20
|
from docreader.detector.yolo_obb import ZoneDetector
|
|
21
21
|
from docreader.ocr.easyocr_engine import TextRecognizer
|
|
22
|
+
from docreader.resolver.lvl_resolver import LvlSubtypeResolver
|
|
22
23
|
|
|
23
24
|
|
|
24
25
|
def create_classifier(
|
|
@@ -127,3 +128,41 @@ def create_ocr(
|
|
|
127
128
|
}
|
|
128
129
|
defaults.update(kwargs)
|
|
129
130
|
return TextRecognizer(**defaults)
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
def create_resolver(
|
|
134
|
+
config: PipelineConfig | None = None,
|
|
135
|
+
ocr_engine: TextRecognizer | None = None,
|
|
136
|
+
**kwargs
|
|
137
|
+
) -> LvlSubtypeResolver:
|
|
138
|
+
"""
|
|
139
|
+
Создаёт resolver подтипа документа (attestat/diplom)
|
|
140
|
+
|
|
141
|
+
Args:
|
|
142
|
+
config: конфигурация (если None — используется дефолтная).
|
|
143
|
+
ocr_engine: готовый OCR-движок (если None - создаётся новый).
|
|
144
|
+
**kwargs: переопределение параметров TextRecognizer
|
|
145
|
+
(weights_path, match_threshold, detector_confidence, device).
|
|
146
|
+
|
|
147
|
+
Returns:
|
|
148
|
+
Готовый к работе LvlSubtypeResolver.
|
|
149
|
+
|
|
150
|
+
Примеры:
|
|
151
|
+
resolver = create_resolver()
|
|
152
|
+
resolver = create_resolver(match_threshold=70.0)
|
|
153
|
+
resolver = create_resolver(ocr_engine=my_ocr)
|
|
154
|
+
"""
|
|
155
|
+
cfg = config or PipelineConfig()
|
|
156
|
+
ocr = ocr_engine or create_ocr(cfg)
|
|
157
|
+
|
|
158
|
+
defaults = {
|
|
159
|
+
"weights_path": cfg.resolver_weights,
|
|
160
|
+
"ocr_engine": ocr_engine,
|
|
161
|
+
"subtype_keywords": cfg.resolver_subtype_keywords,
|
|
162
|
+
"fuzzy_threshold": cfg.resolver_fuzzy_threshold,
|
|
163
|
+
"confidence_threshold": cfg.resolver_confidence,
|
|
164
|
+
"fallback": cfg.resolver_fallback,
|
|
165
|
+
"device": cfg.resolve_device(),
|
|
166
|
+
}
|
|
167
|
+
defaults.update(kwargs)
|
|
168
|
+
return LvlSubtypeResolver(**defaults)
|
|
@@ -17,6 +17,7 @@ from tqdm import tqdm
|
|
|
17
17
|
|
|
18
18
|
logger = logging.getLogger(__name__)
|
|
19
19
|
|
|
20
|
+
_BASE_LVL_DETECTOR = "https://github.com/mishanyacorleone/docreader/releases/download/v0.2.2"
|
|
20
21
|
_BASE_URL_CLASSIFIER = "https://github.com/mishanyacorleone/docreader/releases/download/v0.2.1"
|
|
21
22
|
_BASE_URL = "https://github.com/mishanyacorleone/docreader/releases/download/v0.1.0"
|
|
22
23
|
|
|
@@ -56,7 +57,14 @@ MODEL_REGISTRY: dict[str, dict] = {
|
|
|
56
57
|
"sha256": "832ce5a7f3a1086d81beb1c991347e3f545a425646bc87f3f576ae06fecd2420",
|
|
57
58
|
"size_mb": 87.1,
|
|
58
59
|
"extract_to": "easyocr"
|
|
59
|
-
}
|
|
60
|
+
},
|
|
61
|
+
|
|
62
|
+
# === Resolver ===
|
|
63
|
+
"lvl_detector.pt": {
|
|
64
|
+
"url": f"{_BASE_LVL_DETECTOR}/lvl_detector.pt",
|
|
65
|
+
"sha256": "10bc71dbf8de891bc591154c3c369d8db2daa329249ef4e5b4b15508e8441ba4",
|
|
66
|
+
"size_mb": 5.63,
|
|
67
|
+
},
|
|
60
68
|
}
|
|
61
69
|
|
|
62
70
|
def get_cache_dir() -> Path:
|
|
@@ -14,10 +14,9 @@ from docreader.hub import ensure_model
|
|
|
14
14
|
from docreader.preprocessing import deskew_image, crop_obb_region
|
|
15
15
|
|
|
16
16
|
from docreader.classifier.base import BaseClassifier
|
|
17
|
-
|
|
18
17
|
from docreader.detector.base import BaseDetector
|
|
19
|
-
|
|
20
18
|
from docreader.ocr.base import BaseOcrEngine
|
|
19
|
+
from docreader.resolver.base import BaseSubtypeResolver
|
|
21
20
|
|
|
22
21
|
logger = logging.getLogger(__name__)
|
|
23
22
|
|
|
@@ -47,6 +46,7 @@ class DocReader:
|
|
|
47
46
|
classifier: Optional[BaseClassifier] = None,
|
|
48
47
|
detector: Optional[BaseDetector] = None,
|
|
49
48
|
ocr_engine: Optional[BaseOcrEngine] = None,
|
|
49
|
+
subtype_resolver: Optional[BaseSubtypeResolver] = None
|
|
50
50
|
):
|
|
51
51
|
self._config = config or PipelineConfig()
|
|
52
52
|
self._device = self._config.resolve_device()
|
|
@@ -56,6 +56,7 @@ class DocReader:
|
|
|
56
56
|
self._classifier = classifier or self._build_classifier()
|
|
57
57
|
self._detector = detector or self._build_detector()
|
|
58
58
|
self._ocr = ocr_engine or self._build_ocr()
|
|
59
|
+
self._resolver = subtype_resolver or self._build_resolver()
|
|
59
60
|
|
|
60
61
|
def _build_classifier(self) -> BaseClassifier:
|
|
61
62
|
"""Создаёт классификатор из конфига."""
|
|
@@ -101,6 +102,28 @@ class DocReader:
|
|
|
101
102
|
recog_network=self._config.ocr_recog_network,
|
|
102
103
|
download_enabled=self._config.ocr_download_enabled,
|
|
103
104
|
)
|
|
105
|
+
|
|
106
|
+
def _build_resolver(self) -> Optional[BaseSubtypeResolver]:
|
|
107
|
+
"""
|
|
108
|
+
Создаёт resolver только если есть неоднозначные классы в конфиге.
|
|
109
|
+
Возвращает None, если resolver не нужен
|
|
110
|
+
"""
|
|
111
|
+
if not self._config.ambiguous_classes:
|
|
112
|
+
return None
|
|
113
|
+
|
|
114
|
+
from docreader.resolver.lvl_resolver import LvlSubtypeResolver
|
|
115
|
+
|
|
116
|
+
weights_path = ensure_model(self._config.resolver_weights)
|
|
117
|
+
|
|
118
|
+
return LvlSubtypeResolver(
|
|
119
|
+
weights_path=weights_path,
|
|
120
|
+
ocr_engine=self._ocr,
|
|
121
|
+
subtype_keywords=self._config.resolver_subtype_keywords,
|
|
122
|
+
fuzzy_threshold=self._config.resolver_fuzzy_threshold,
|
|
123
|
+
confidence_threshold=self._config.resolver_confidence,
|
|
124
|
+
fallback=self._config.resolver_fallback,
|
|
125
|
+
device=self._device,
|
|
126
|
+
)
|
|
104
127
|
|
|
105
128
|
# === Публичный API ===
|
|
106
129
|
|
|
@@ -111,11 +134,11 @@ class DocReader:
|
|
|
111
134
|
) -> PageResult:
|
|
112
135
|
"""
|
|
113
136
|
Полный пайплайн: находит все документы и распознаёт.
|
|
114
|
-
|
|
137
|
+
|
|
115
138
|
Args:
|
|
116
139
|
source: путь к файлу или numpy array (BGR).
|
|
117
140
|
return_crops: сохранять ли кропы.
|
|
118
|
-
|
|
141
|
+
|
|
119
142
|
Returns:
|
|
120
143
|
PageResult со списком найденных документов.
|
|
121
144
|
"""
|
|
@@ -124,17 +147,14 @@ class DocReader:
|
|
|
124
147
|
if return_crops is not None
|
|
125
148
|
else self._config.return_crops
|
|
126
149
|
)
|
|
127
|
-
|
|
150
|
+
|
|
128
151
|
image = load_image(source)
|
|
129
|
-
|
|
130
|
-
# 1. Классификация
|
|
131
152
|
classified_docs = self._classifier.classify(image)
|
|
132
|
-
|
|
153
|
+
|
|
133
154
|
if not classified_docs:
|
|
134
155
|
logger.info("No documents found")
|
|
135
156
|
return PageResult(documents=[])
|
|
136
|
-
|
|
137
|
-
# 2. Обработка каждого документа
|
|
157
|
+
|
|
138
158
|
documents: list[DocumentResult] = []
|
|
139
159
|
for doc in classified_docs:
|
|
140
160
|
result = self._process_single_document(
|
|
@@ -145,11 +165,11 @@ class DocReader:
|
|
|
145
165
|
save_crops=save_crops,
|
|
146
166
|
)
|
|
147
167
|
documents.append(result)
|
|
148
|
-
|
|
168
|
+
|
|
149
169
|
page_result = PageResult(documents=documents)
|
|
150
170
|
logger.info(f"Complete: {page_result}")
|
|
151
171
|
return page_result
|
|
152
|
-
|
|
172
|
+
|
|
153
173
|
def process_batch(
|
|
154
174
|
self,
|
|
155
175
|
sources: list[ImageSource],
|
|
@@ -159,6 +179,45 @@ class DocReader:
|
|
|
159
179
|
return [self.process(src, return_crops) for src in sources]
|
|
160
180
|
|
|
161
181
|
# === Внутренняя логика ===
|
|
182
|
+
|
|
183
|
+
def _resolve_doc_type(
|
|
184
|
+
self,
|
|
185
|
+
doc_type: str,
|
|
186
|
+
doc_image: np.ndarray
|
|
187
|
+
) -> tuple[str, dict]:
|
|
188
|
+
"""
|
|
189
|
+
Уточняет тип документа через resolver, если класс неоднозначен.
|
|
190
|
+
|
|
191
|
+
Returns:
|
|
192
|
+
Кортеж (уточнённый doc_type, метаданные resolve).
|
|
193
|
+
"""
|
|
194
|
+
if (
|
|
195
|
+
doc_type not in self._config.ambiguous_classes
|
|
196
|
+
or self._resolver is None
|
|
197
|
+
):
|
|
198
|
+
return doc_type, {}
|
|
199
|
+
|
|
200
|
+
resolve_result = self._resolver.resolve(doc_image)
|
|
201
|
+
meta = {
|
|
202
|
+
"resolver_ocr_text": resolve_result.ocr_text,
|
|
203
|
+
"resolver_ocr_confidence": resolve_result.confidence,
|
|
204
|
+
"resolver_fuzzy_score": resolve_result.fuzzy_score
|
|
205
|
+
}
|
|
206
|
+
|
|
207
|
+
if resolve_result.resolve:
|
|
208
|
+
logger.info(
|
|
209
|
+
f"Resolved '{doc_type}' -> '{resolve_result.subtype}' "
|
|
210
|
+
f"(text='{resolve_result.ocr_text}', "
|
|
211
|
+
f"fuzzy={resolve_result.fuzzy_score:.1f})"
|
|
212
|
+
)
|
|
213
|
+
return resolve_result.subtype, meta
|
|
214
|
+
|
|
215
|
+
logger.warning(
|
|
216
|
+
f"Could not resolve subtype for '{doc_type}': "
|
|
217
|
+
f"text='{resolve_result.ocr_text}', "
|
|
218
|
+
f"score={resolve_result.fuzzy_score:.1f}"
|
|
219
|
+
)
|
|
220
|
+
return doc_type, meta
|
|
162
221
|
|
|
163
222
|
def _process_single_document(
|
|
164
223
|
self,
|
|
@@ -169,42 +228,44 @@ class DocReader:
|
|
|
169
228
|
save_crops: bool,
|
|
170
229
|
) -> DocumentResult:
|
|
171
230
|
"""Обрабатывает один документ."""
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
231
|
+
if self._config.enable_deskew:
|
|
232
|
+
doc_image = deskew_image(doc_image)
|
|
233
|
+
|
|
234
|
+
resolved_type, resolve_meta = self._resolve_doc_type(doc_type, doc_image)
|
|
235
|
+
|
|
236
|
+
if resolved_type not in self._detector.supported_doc_types:
|
|
237
|
+
logger.warning(f"No detector for '{resolved_type}'")
|
|
175
238
|
return DocumentResult(
|
|
176
|
-
doc_type=
|
|
239
|
+
doc_type=resolved_type,
|
|
177
240
|
doc_confidence=doc_confidence,
|
|
178
241
|
zones=[],
|
|
179
242
|
doc_bbox=doc_bbox.tolist(),
|
|
180
243
|
doc_crop=doc_image if save_crops else None,
|
|
244
|
+
resolve_meta=resolve_meta,
|
|
181
245
|
)
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
detections = self._detector.detect(doc_image, doc_type)
|
|
187
|
-
logger.info(f"'{doc_type}': {len(detections)} zones")
|
|
188
|
-
|
|
246
|
+
|
|
247
|
+
detections = self._detector.detect(doc_image, resolved_type)
|
|
248
|
+
logger.info(f"'{resolved_type}': {len(detections)} zones")
|
|
249
|
+
|
|
189
250
|
zones: list[ZoneResult] = []
|
|
190
251
|
for det in detections:
|
|
191
252
|
zone = self._process_zone(doc_image, det, save_crops)
|
|
192
253
|
if zone is not None:
|
|
193
254
|
zones.append(zone)
|
|
194
|
-
|
|
255
|
+
|
|
195
256
|
return DocumentResult(
|
|
196
|
-
doc_type=
|
|
257
|
+
doc_type=resolved_type,
|
|
197
258
|
doc_confidence=doc_confidence,
|
|
198
259
|
zones=zones,
|
|
199
260
|
doc_bbox=doc_bbox.tolist(),
|
|
200
261
|
doc_crop=doc_image if save_crops else None,
|
|
262
|
+
resolve_meta=resolve_meta,
|
|
201
263
|
)
|
|
202
264
|
|
|
203
265
|
def _process_zone(self, image, detection, save_crops):
|
|
204
266
|
"""Обрабатывает одну зону."""
|
|
205
|
-
|
|
206
267
|
zone_name = detection.zone_name
|
|
207
|
-
|
|
268
|
+
|
|
208
269
|
if zone_name in self._config.skip_ocr_zones:
|
|
209
270
|
return ZoneResult(
|
|
210
271
|
name=zone_name,
|
|
@@ -212,14 +273,14 @@ class DocReader:
|
|
|
212
273
|
confidence=detection.confidence,
|
|
213
274
|
bbox=detection.obb_points.tolist(),
|
|
214
275
|
)
|
|
215
|
-
|
|
276
|
+
|
|
216
277
|
crop = crop_obb_region(image, detection.obb_points)
|
|
217
278
|
if crop is None or crop.size == 0:
|
|
218
279
|
logger.warning(f"Empty crop for '{zone_name}'")
|
|
219
280
|
return None
|
|
220
|
-
|
|
281
|
+
|
|
221
282
|
ocr_result = self._ocr.recognize(crop)
|
|
222
|
-
|
|
283
|
+
|
|
223
284
|
return ZoneResult(
|
|
224
285
|
name=zone_name,
|
|
225
286
|
text=ocr_result.text,
|
|
@@ -239,6 +300,7 @@ class DocReader:
|
|
|
239
300
|
self._classifier = None
|
|
240
301
|
self._detector = None
|
|
241
302
|
self._ocr = None
|
|
303
|
+
self._resolver = None
|
|
242
304
|
try:
|
|
243
305
|
import gc
|
|
244
306
|
gc.collect()
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Абстрактный интерфейс resolver'a подтипа документа.
|
|
3
|
+
"""
|
|
4
|
+
|
|
5
|
+
from abc import ABC, abstractmethod
|
|
6
|
+
from dataclasses import dataclass
|
|
7
|
+
from typing import Optional
|
|
8
|
+
|
|
9
|
+
import numpy as np
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
@dataclass
|
|
13
|
+
class ResolveResult:
|
|
14
|
+
"""
|
|
15
|
+
Результат определения подтипа документа
|
|
16
|
+
"""
|
|
17
|
+
subtype: Optional[str]
|
|
18
|
+
ocr_text: str
|
|
19
|
+
confidence: float
|
|
20
|
+
fuzzy_score: float
|
|
21
|
+
|
|
22
|
+
@property
|
|
23
|
+
def resolve(self) -> None:
|
|
24
|
+
return self.subtype is not None
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
class BaseSubtypeResolver(ABC):
|
|
28
|
+
"""
|
|
29
|
+
Интерфейс для определения подтипа документа.
|
|
30
|
+
|
|
31
|
+
Используется когда классификатор не может различать 2 похожих
|
|
32
|
+
класса (attestat/diplom) и требуется дополнительный шаг
|
|
33
|
+
"""
|
|
34
|
+
|
|
35
|
+
@abstractmethod
|
|
36
|
+
def resolve(self, image: np.ndarray) -> ResolveResult:
|
|
37
|
+
"""
|
|
38
|
+
Определяет подтип документа по его crop'y
|
|
39
|
+
"""
|
|
40
|
+
...
|
|
@@ -0,0 +1,195 @@
|
|
|
1
|
+
"""
|
|
2
|
+
Resolver подтипа документа через детекцию поля lvl + OCR + fuzzy matching.
|
|
3
|
+
"""
|
|
4
|
+
|
|
5
|
+
import logging
|
|
6
|
+
from typing import Optional
|
|
7
|
+
|
|
8
|
+
import numpy as np
|
|
9
|
+
from rapidfuzz import process, fuzz
|
|
10
|
+
from ultralytics import YOLO
|
|
11
|
+
|
|
12
|
+
from docreader.ocr.base import BaseOcrEngine
|
|
13
|
+
from docreader.preprocessing.geometry import crop_obb_region
|
|
14
|
+
from docreader.resolver.base import BaseSubtypeResolver, ResolveResult
|
|
15
|
+
|
|
16
|
+
logger = logging.getLogger(__name__)
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
class LvlSubtypeResolver(BaseSubtypeResolver):
|
|
20
|
+
"""
|
|
21
|
+
Определяет подтип документа (attestat/diplom) через:
|
|
22
|
+
1. YOLO OBB — детектирует поле lvl на crop'е документа
|
|
23
|
+
2. OCR — распознаёт текст поля
|
|
24
|
+
3. Fuzzy matching — сопоставляет текст с ключевыми словами подтипов
|
|
25
|
+
|
|
26
|
+
Примеры:
|
|
27
|
+
resolver = LvlSubtypeResolver(
|
|
28
|
+
weights_path="/path/to/lvl_detector.pt",
|
|
29
|
+
ocr_engine=ocr,
|
|
30
|
+
subtype_keywords={
|
|
31
|
+
"attestat": ["аттестат", "attestat"],
|
|
32
|
+
"diplom": ["диплом", "diplom"],
|
|
33
|
+
},
|
|
34
|
+
)
|
|
35
|
+
result = resolver.resolve(doc_crop)
|
|
36
|
+
if result.resolved:
|
|
37
|
+
print(result.subtype) # "attestat" или "diplom"
|
|
38
|
+
|
|
39
|
+
Args:
|
|
40
|
+
weights_path: путь к YOLO-модели для детекции поля lvl.
|
|
41
|
+
ocr_engine: движок OCR (BaseOcrEngine).
|
|
42
|
+
subtype_keywords: словарь {подтип: [ключевые слова]}.
|
|
43
|
+
fuzzy_threshold: минимальный score для признания совпадения (0–100).
|
|
44
|
+
confidence_threshold: минимальная уверенность детектора.
|
|
45
|
+
fallback: подтип по умолчанию если resolve не удался (None = unresolved).
|
|
46
|
+
device: устройство ("cpu", "cuda").
|
|
47
|
+
"""
|
|
48
|
+
|
|
49
|
+
def __init__(
|
|
50
|
+
self,
|
|
51
|
+
weights_path: str,
|
|
52
|
+
ocr_engine: BaseOcrEngine,
|
|
53
|
+
subtype_keywords: dict[str, list[str]],
|
|
54
|
+
fuzzy_threshold: float = 60.0,
|
|
55
|
+
confidence_threshold: float = 0.25,
|
|
56
|
+
fallback: Optional[str] = None,
|
|
57
|
+
device: str = "cpu",
|
|
58
|
+
):
|
|
59
|
+
self._model = YOLO(weights_path)
|
|
60
|
+
self._ocr = ocr_engine
|
|
61
|
+
self._fuzzy_threshold = fuzzy_threshold
|
|
62
|
+
self._confidence_threshold = confidence_threshold
|
|
63
|
+
self._fallback = fallback
|
|
64
|
+
self._device = device
|
|
65
|
+
|
|
66
|
+
# Плоский список ключевых слов и маппинг слово → подтип
|
|
67
|
+
self._keywords: list[str] = []
|
|
68
|
+
self._keyword_to_subtype: dict[str, str] = {}
|
|
69
|
+
for subtype, words in subtype_keywords.items():
|
|
70
|
+
for word in words:
|
|
71
|
+
normalized = word.lower()
|
|
72
|
+
self._keywords.append(normalized)
|
|
73
|
+
self._keyword_to_subtype[normalized] = subtype
|
|
74
|
+
|
|
75
|
+
logger.info(
|
|
76
|
+
f"LvlSubtypeResolver initialized: "
|
|
77
|
+
f"subtypes={list(subtype_keywords.keys())}, "
|
|
78
|
+
f"threshold={fuzzy_threshold}, fallback={fallback}"
|
|
79
|
+
)
|
|
80
|
+
|
|
81
|
+
def resolve(self, image: np.ndarray) -> ResolveResult:
|
|
82
|
+
"""
|
|
83
|
+
Определяет подтип документа по crop'у.
|
|
84
|
+
|
|
85
|
+
Args:
|
|
86
|
+
image: BGR изображение документа.
|
|
87
|
+
|
|
88
|
+
Returns:
|
|
89
|
+
ResolveResult с подтипом и диагностической информацией.
|
|
90
|
+
"""
|
|
91
|
+
lvl_crop = self._detect_lvl_field(image)
|
|
92
|
+
|
|
93
|
+
if lvl_crop is None:
|
|
94
|
+
logger.warning("lvl field not detected, using fallback")
|
|
95
|
+
return ResolveResult(
|
|
96
|
+
subtype=self._fallback,
|
|
97
|
+
ocr_text="",
|
|
98
|
+
confidence=0.0,
|
|
99
|
+
fuzzy_score=0.0,
|
|
100
|
+
)
|
|
101
|
+
|
|
102
|
+
ocr_result = self._ocr.recognize(lvl_crop)
|
|
103
|
+
logger.debug(
|
|
104
|
+
f"lvl OCR: text='{ocr_result.text}', conf={ocr_result.confidence:.3f}"
|
|
105
|
+
)
|
|
106
|
+
|
|
107
|
+
if not ocr_result.text.strip():
|
|
108
|
+
logger.warning("lvl OCR returned empty text, using fallback")
|
|
109
|
+
return ResolveResult(
|
|
110
|
+
subtype=self._fallback,
|
|
111
|
+
ocr_text=ocr_result.text,
|
|
112
|
+
confidence=ocr_result.confidence,
|
|
113
|
+
fuzzy_score=0.0,
|
|
114
|
+
)
|
|
115
|
+
|
|
116
|
+
subtype, fuzzy_score = self._match_subtype(ocr_result.text)
|
|
117
|
+
|
|
118
|
+
if subtype is None:
|
|
119
|
+
logger.warning(
|
|
120
|
+
f"Fuzzy match below threshold: "
|
|
121
|
+
f"text='{ocr_result.text}', score={fuzzy_score:.1f}, "
|
|
122
|
+
f"threshold={self._fuzzy_threshold}, fallback={self._fallback}"
|
|
123
|
+
)
|
|
124
|
+
|
|
125
|
+
return ResolveResult(
|
|
126
|
+
subtype=subtype,
|
|
127
|
+
ocr_text=ocr_result.text,
|
|
128
|
+
confidence=ocr_result.confidence,
|
|
129
|
+
fuzzy_score=fuzzy_score,
|
|
130
|
+
)
|
|
131
|
+
|
|
132
|
+
def _detect_lvl_field(self, image: np.ndarray) -> Optional[np.ndarray]:
|
|
133
|
+
"""
|
|
134
|
+
Детектирует поле lvl и возвращает его crop.
|
|
135
|
+
"""
|
|
136
|
+
|
|
137
|
+
results = self._model(image, device=self._device, verbose=False)
|
|
138
|
+
|
|
139
|
+
if results[0].obb is None:
|
|
140
|
+
return None
|
|
141
|
+
|
|
142
|
+
best_conf = -1.0
|
|
143
|
+
best_crop = None
|
|
144
|
+
|
|
145
|
+
for det in results[0].obb:
|
|
146
|
+
confidence = float(det.conf.cpu())
|
|
147
|
+
if confidence < self._confidence_threshold:
|
|
148
|
+
continue
|
|
149
|
+
|
|
150
|
+
zone_name = self._model.names[int(det.cls.cpu())]
|
|
151
|
+
if zone_name != "lvl":
|
|
152
|
+
continue
|
|
153
|
+
|
|
154
|
+
if confidence <= best_conf:
|
|
155
|
+
continue
|
|
156
|
+
|
|
157
|
+
obb_points = det.xyxyxyxy.cpu().numpy().flatten()
|
|
158
|
+
crop = crop_obb_region(image, obb_points)
|
|
159
|
+
|
|
160
|
+
if crop is not None and crop.size > 0:
|
|
161
|
+
best_conf = confidence
|
|
162
|
+
best_crop = crop
|
|
163
|
+
|
|
164
|
+
return best_crop
|
|
165
|
+
|
|
166
|
+
def _match_subtype(self, text: str) -> tuple[Optional[str], float]:
|
|
167
|
+
"""
|
|
168
|
+
Сопоставляет OCR-текст с ключевыми словами через fuzzy matching.
|
|
169
|
+
|
|
170
|
+
Returns:
|
|
171
|
+
Кортеж (подтип или None, fuzzy score).
|
|
172
|
+
"""
|
|
173
|
+
normalized = text.lower().strip()
|
|
174
|
+
|
|
175
|
+
match = process.extractOne(
|
|
176
|
+
normalized,
|
|
177
|
+
self._keywords,
|
|
178
|
+
scorer=fuzz.WRatio,
|
|
179
|
+
)
|
|
180
|
+
|
|
181
|
+
if match is None:
|
|
182
|
+
return None, 0.0
|
|
183
|
+
|
|
184
|
+
best_keyword, score, _ = match
|
|
185
|
+
|
|
186
|
+
if score < self._fuzzy_threshold:
|
|
187
|
+
return None, float(score)
|
|
188
|
+
|
|
189
|
+
subtype = self._keyword_to_subtype[best_keyword]
|
|
190
|
+
logger.debug(
|
|
191
|
+
f"Fuzzy matched: '{normalized}' -> '{best_keyword}' "
|
|
192
|
+
f"(subtype={subtype}, score={score:.1f})"
|
|
193
|
+
)
|
|
194
|
+
return subtype, float(score)
|
|
195
|
+
|
|
@@ -33,6 +33,7 @@ class DocumentResult:
|
|
|
33
33
|
zones: list[ZoneResult] = field(default_factory=list)
|
|
34
34
|
doc_bbox: Optional[list[float]] = None # координаты документа в исходном изображении
|
|
35
35
|
doc_crop: Optional[np.ndarray] = None # кроп документа
|
|
36
|
+
resolve_meta: dict = field(default_factory=dict) # диагностика resolver'a
|
|
36
37
|
|
|
37
38
|
@property
|
|
38
39
|
def fields(self) -> dict[str, str]:
|
|
@@ -42,7 +43,7 @@ class DocumentResult:
|
|
|
42
43
|
return {zone.name: zone.text for zone in self.zones}
|
|
43
44
|
|
|
44
45
|
def to_dict(self) -> dict:
|
|
45
|
-
|
|
46
|
+
result = {
|
|
46
47
|
"document": {
|
|
47
48
|
"doc_type": self.doc_type,
|
|
48
49
|
"doc_confidence": round(self.doc_confidence, 4),
|
|
@@ -50,6 +51,9 @@ class DocumentResult:
|
|
|
50
51
|
"fields": self.fields
|
|
51
52
|
}
|
|
52
53
|
}
|
|
54
|
+
if self.resolve_meta:
|
|
55
|
+
result["document"]["resolve_meta"] = self.resolve_meta
|
|
56
|
+
return result
|
|
53
57
|
|
|
54
58
|
def __repr__(self) -> str:
|
|
55
59
|
return (
|
docreader_ocr-0.2.3/PKG-INFO
DELETED
|
@@ -1,33 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: docreader-ocr
|
|
3
|
-
Version: 0.2.3
|
|
4
|
-
Summary: Document OCR pipeline: classify → detect fields → recognize text
|
|
5
|
-
Project-URL: Homepage, https://github.com/mishanyacorleone/docreader
|
|
6
|
-
Project-URL: Repository, https://github.com/mishanyacorleone/docreader
|
|
7
|
-
Project-URL: Issues, https://github.com/mishanyacorleone/docreader/issues
|
|
8
|
-
Author-email: Mikhail Kardash <mishutqac@mail.ru>, Ruslan Abzelilov <ruslanr26@mail.ru>, Ekaterina Karmanova <monitor81@mail.ru>
|
|
9
|
-
License: MIT
|
|
10
|
-
License-File: LICENSE
|
|
11
|
-
Keywords: document,ocr,recognition,yolo
|
|
12
|
-
Classifier: Development Status :: 3 - Alpha
|
|
13
|
-
Classifier: Intended Audience :: Developers
|
|
14
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
-
Classifier: Programming Language :: Python :: 3
|
|
16
|
-
Classifier: Topic :: Scientific/Engineering :: Image Recognition
|
|
17
|
-
Requires-Python: >=3.9
|
|
18
|
-
Requires-Dist: easyocr>=1.7
|
|
19
|
-
Requires-Dist: numpy>=1.24
|
|
20
|
-
Requires-Dist: opencv-python>=4.8
|
|
21
|
-
Requires-Dist: requests>=2.28
|
|
22
|
-
Requires-Dist: torch>=2.0
|
|
23
|
-
Requires-Dist: torchvision>=0.15
|
|
24
|
-
Requires-Dist: tqdm>=4.65
|
|
25
|
-
Requires-Dist: ultralytics>=8.0
|
|
26
|
-
Provides-Extra: dev
|
|
27
|
-
Requires-Dist: mypy; extra == 'dev'
|
|
28
|
-
Requires-Dist: pytest-cov; extra == 'dev'
|
|
29
|
-
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
30
|
-
Requires-Dist: ruff; extra == 'dev'
|
|
31
|
-
Description-Content-Type: text/markdown
|
|
32
|
-
|
|
33
|
-
Заглушка
|
docreader_ocr-0.2.3/README.md
DELETED
|
@@ -1 +0,0 @@
|
|
|
1
|
-
Заглушка
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|