ocr-stringdist 0.2.0__pp310-pypy310_pp73-musllinux_1_1_i686.whl → 0.2.1__pp310-pypy310_pp73-musllinux_1_1_i686.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,81 @@
1
+ Metadata-Version: 2.4
2
+ Name: ocr_stringdist
3
+ Version: 0.2.1
4
+ Classifier: Programming Language :: Rust
5
+ Classifier: Programming Language :: Python
6
+ Classifier: Operating System :: OS Independent
7
+ License-File: LICENSE
8
+ Requires-Python: >=3.9
9
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
10
+ Project-URL: repository, https://github.com/NiklasvonM/ocr-stringdist
11
+
12
+ # OCR-StringDist
13
+
14
+ A Python library for fast string distance calculations that account for common OCR (optical character recognition) errors.
15
+
16
+ Documentation: https://niklasvonm.github.io/ocr-stringdist/
17
+
18
+ [![PyPI](https://img.shields.io/badge/PyPI-Package-blue)](https://pypi.org/project/ocr-stringdist/)
19
+ [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
20
+
21
+ ## Overview
22
+
23
+ Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like `O` vs `0` are far more common than, say, `O` vs `X`.
24
+
25
+ OCR-StringDist uses a **weighted Levenshtein distance**, assigning lower costs to common OCR errors.
26
+
27
+ **Example:** Matching against the correct word `CODE`:
28
+
29
+ * **Standard Levenshtein:**
30
+ * $d(\text{"CODE"}, \text{"C0DE"}) = 1$ (O → 0)
31
+ * $d(\text{"CODE"}, \text{"CXDE"}) = 1$ (O → X)
32
+ * Result: Both appear equally likely/distant.
33
+
34
+ * **OCR-StringDist (Weighted):**
35
+ * $d(\text{"CODE"}, \text{"C0DE"}) \approx 0.1$ (common error, low cost)
36
+ * $d(\text{"CODE"}, \text{"CXDE"}) = 1.0$ (unlikely error, high cost)
37
+ * Result: Correctly identifies `C0DE` as a much closer match.
38
+
39
+ This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes, database entries).
40
+
41
+ > **Note:** This project is in early development. APIs may change in future releases.
42
+
43
+ ## Installation
44
+
45
+ ```bash
46
+ pip install ocr-stringdist
47
+ ```
48
+
49
+ ## Features
50
+
51
+ - **Weighted Levenshtein Distance**: Calculates Levenshtein distance with customizable costs for substitutions, insertions, and deletions. Includes an efficient batch version (`batch_weighted_levenshtein_distance`) for comparing one string against many candidates.
52
+ - **Substitution of Multiple Characters**: Not just character pairs, but string pairs may be substituted, for example the Korean syllable "이" for the two letters "OI".
53
+ - **Pre-defined OCR Distance Map**: A built-in distance map for common OCR confusions (e.g., "0" vs "O", "1" vs "l", "5" vs "S").
54
+ - **Unicode Support**: Works with arbitrary Unicode strings.
55
+ - **Best Match Finder**: Includes a utility function `find_best_candidate` to efficiently find the best match from a list based on _any_ distance function.
56
+
57
+ ## Usage
58
+
59
+ ### Weighted Levenshtein Distance
60
+
61
+ ```python
62
+ import ocr_stringdist as osd
63
+
64
+ # Using default OCR distance map
65
+ distance = osd.weighted_levenshtein_distance("OCR5", "OCRS")
66
+ print(f"Distance between 'OCR5' and 'OCRS': {distance}") # Will be less than 1.0
67
+
68
+ # Custom cost map
69
+ substitution_costs = {("In", "h"): 0.5}
70
+ distance = osd.weighted_levenshtein_distance(
71
+ "hi", "Ini",
72
+ substitution_costs=substitution_costs,
73
+ symmetric_substitution=True,
74
+ )
75
+ print(f"Distance with custom map: {distance}")
76
+ ```
77
+
78
+ ## Acknowledgements
79
+
80
+ This project is inspired by [jellyfish](https://github.com/jamesturk/jellyfish), providing the base implementations of the algorithms used here.
81
+
@@ -1,11 +1,11 @@
1
- ocr_stringdist-0.2.0.dist-info/METADATA,sha256=OVF3jUKVM038ogWfwZIHmpu3eUXdeuS1Cy-t96G8Tgo,304
2
- ocr_stringdist-0.2.0.dist-info/WHEEL,sha256=xaqx4XMkAU9rqVgVtKWB2ZWHvVuVjQz1tlBEy4WblNs,112
3
- ocr_stringdist-0.2.0.dist-info/licenses/LICENSE,sha256=5BPRcjlnbl2t4TidSgpfGrtC_birSf8JlZfA-qmVoQE,1072
1
+ ocr_stringdist-0.2.1.dist-info/METADATA,sha256=dIjhLqdKIzSgqyX45jQ6mZTzZjm3UOcghfe2zYoKeS0,3320
2
+ ocr_stringdist-0.2.1.dist-info/WHEEL,sha256=xaqx4XMkAU9rqVgVtKWB2ZWHvVuVjQz1tlBEy4WblNs,112
3
+ ocr_stringdist-0.2.1.dist-info/licenses/LICENSE,sha256=5BPRcjlnbl2t4TidSgpfGrtC_birSf8JlZfA-qmVoQE,1072
4
4
  ocr_stringdist.libs/libgcc_s-b5472b99.so.1,sha256=wh8CpjXz9IccAyeERcB7YDEx7NH2jF-PykwOyYNeRRI,453841
5
5
  ocr_stringdist/__init__.py,sha256=ApxqraLRcWAkzXhGJXSf3EqGEVFbxghrYrfJ9dmQjQU,467
6
- ocr_stringdist/_rust_stringdist.pypy310-pp73-x86-linux-gnu.so,sha256=x5q1wNRNE0-ZytF1oLHjl7dKAsIDWwTHCs5qEmnaWuw,780805
6
+ ocr_stringdist/_rust_stringdist.pypy310-pp73-x86-linux-gnu.so,sha256=f3o1rGAEDokVtJyQUQ2cl-cubfmDg2BDZAyfKErIizI,780805
7
7
  ocr_stringdist/default_ocr_distances.py,sha256=oSu-TpHjPA4jxKpLAfmap8z0ZsC99jsOjnRVHW7Hj_Y,1033
8
8
  ocr_stringdist/levenshtein.py,sha256=Jypg31BQyULipJ_Yh3dcBQDKNnbvEIlmf28tDr_gySw,11243
9
9
  ocr_stringdist/matching.py,sha256=rr8R63Ttu2hTf5Mni7_P8aGBbjWs6t2QPV3wxKXspAs,3293
10
10
  ocr_stringdist/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
11
- ocr_stringdist-0.2.0.dist-info/RECORD,,
11
+ ocr_stringdist-0.2.1.dist-info/RECORD,,
@@ -1,9 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: ocr_stringdist
3
- Version: 0.2.0
4
- Classifier: Programming Language :: Rust
5
- Classifier: Programming Language :: Python
6
- Classifier: Operating System :: OS Independent
7
- License-File: LICENSE
8
- Requires-Python: >=3.9
9
- Project-URL: repository, https://github.com/NiklasvonM/ocr-stringdist