ocr-stringdist 0.2.1__cp313-cp313-macosx_10_12_x86_64.whl → 0.2.2__cp313-cp313-macosx_10_12_x86_64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
- Name: ocr_stringdist
3
- Version: 0.2.1
2
+ Name: ocr-stringdist
3
+ Version: 0.2.2
4
4
  Classifier: Programming Language :: Rust
5
5
  Classifier: Programming Language :: Python
6
6
  Classifier: Operating System :: OS Independent
@@ -8,6 +8,7 @@ License-File: LICENSE
8
8
  Requires-Python: >=3.9
9
9
  Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
10
10
  Project-URL: repository, https://github.com/NiklasvonM/ocr-stringdist
11
+ Project-URL: documentation, https://niklasvonm.github.io/ocr-stringdist/
11
12
 
12
13
  # OCR-StringDist
13
14
 
@@ -38,8 +39,6 @@ OCR-StringDist uses a **weighted Levenshtein distance**, assigning lower costs t
38
39
 
39
40
  This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes, database entries).
40
41
 
41
- > **Note:** This project is in early development. APIs may change in future releases.
42
-
43
42
  ## Installation
44
43
 
45
44
  ```bash
@@ -48,7 +47,9 @@ pip install ocr-stringdist
48
47
 
49
48
  ## Features
50
49
 
50
+ - **High Performance**: The core logic is implemented in Rust with speed in mind.
51
51
  - **Weighted Levenshtein Distance**: Calculates Levenshtein distance with customizable costs for substitutions, insertions, and deletions. Includes an efficient batch version (`batch_weighted_levenshtein_distance`) for comparing one string against many candidates.
52
+ - **Explainable Edit Path**: Returns the optimal sequence of edit operations (substitutions, insertions, and deletions) used to transform one string into another.
52
53
  - **Substitution of Multiple Characters**: Not just character pairs, but string pairs may be substituted, for example the Korean syllable "이" for the two letters "OI".
53
54
  - **Pre-defined OCR Distance Map**: A built-in distance map for common OCR confusions (e.g., "0" vs "O", "1" vs "l", "5" vs "S").
54
55
  - **Unicode Support**: Works with arbitrary Unicode strings.
@@ -56,25 +57,45 @@ pip install ocr-stringdist
56
57
 
57
58
  ## Usage
58
59
 
59
- ### Weighted Levenshtein Distance
60
+ ### Basic usage
61
+
62
+ ```python
63
+ from ocr_stringdist import WeightedLevenshtein
64
+
65
+ # Default substitution costs are ocr_stringdist.ocr_distance_map.
66
+ wl = WeightedLevenshtein()
67
+
68
+ print(wl.distance("CXDE", "CODE")) # == 1
69
+ print(wl.distance("C0DE", "CODE")) # < 1
70
+ ```
71
+
72
+ ### Explain the Edit Path
73
+
74
+ ```python
75
+ edit_path = wl.explain("C0DE", "CODE")
76
+ print(edit_path)
77
+ # EditOperation(op_type='substitute', source_token='0', target_token='O', cost=0.1)]
78
+ ```
79
+
80
+ ### Fast Batch Calculations
81
+
82
+ Quickly compare a string to a list of candidates.
60
83
 
61
84
  ```python
62
- import ocr_stringdist as osd
63
-
64
- # Using default OCR distance map
65
- distance = osd.weighted_levenshtein_distance("OCR5", "OCRS")
66
- print(f"Distance between 'OCR5' and 'OCRS': {distance}") # Will be less than 1.0
67
-
68
- # Custom cost map
69
- substitution_costs = {("In", "h"): 0.5}
70
- distance = osd.weighted_levenshtein_distance(
71
- "hi", "Ini",
72
- substitution_costs=substitution_costs,
73
- symmetric_substitution=True,
74
- )
75
- print(f"Distance with custom map: {distance}")
85
+ distances: list[float] = wl.batch_distance("CODE", ["CXDE", "C0DE"])
86
+ # [1.0, 0.1]
76
87
  ```
77
88
 
89
+ ### Multi-character Substitutions
90
+
91
+ ```python
92
+ # Custom costs with multi-character substitution
93
+ wl = WeightedLevenshtein(substitution_costs={("In", "h"): 0.5})
94
+
95
+ print(wl.distance("hi", "Ini")) # 0.5
96
+ ```
97
+
98
+
78
99
  ## Acknowledgements
79
100
 
80
101
  This project is inspired by [jellyfish](https://github.com/jamesturk/jellyfish), providing the base implementations of the algorithms used here.
@@ -1,10 +1,10 @@
1
- ocr_stringdist-0.2.1.dist-info/METADATA,sha256=dIjhLqdKIzSgqyX45jQ6mZTzZjm3UOcghfe2zYoKeS0,3320
2
- ocr_stringdist-0.2.1.dist-info/WHEEL,sha256=iXfRWk7-127zCPB-_BNFDQE-qLd9Rsj-fJMRKNRg-kg,106
3
- ocr_stringdist-0.2.1.dist-info/licenses/LICENSE,sha256=5BPRcjlnbl2t4TidSgpfGrtC_birSf8JlZfA-qmVoQE,1072
1
+ ocr_stringdist-0.2.2.dist-info/METADATA,sha256=2KjG6DHqpsannN0lPK4EwkYBbY3adZrl1oTCq-elnL8,3868
2
+ ocr_stringdist-0.2.2.dist-info/WHEEL,sha256=iXfRWk7-127zCPB-_BNFDQE-qLd9Rsj-fJMRKNRg-kg,106
3
+ ocr_stringdist-0.2.2.dist-info/licenses/LICENSE,sha256=5BPRcjlnbl2t4TidSgpfGrtC_birSf8JlZfA-qmVoQE,1072
4
4
  ocr_stringdist/__init__.py,sha256=ApxqraLRcWAkzXhGJXSf3EqGEVFbxghrYrfJ9dmQjQU,467
5
- ocr_stringdist/_rust_stringdist.cpython-313-darwin.so,sha256=nzhPB5QAg1-oCplNhL5Mw9rGLIrseTpHZJP4ZsUkpcQ,708016
5
+ ocr_stringdist/_rust_stringdist.cpython-313-darwin.so,sha256=hrK-1mgOjJZNE8JEOs1ynxKGghzMoBlYGTeRYkv1Bcs,708016
6
6
  ocr_stringdist/default_ocr_distances.py,sha256=oSu-TpHjPA4jxKpLAfmap8z0ZsC99jsOjnRVHW7Hj_Y,1033
7
7
  ocr_stringdist/levenshtein.py,sha256=Jypg31BQyULipJ_Yh3dcBQDKNnbvEIlmf28tDr_gySw,11243
8
8
  ocr_stringdist/matching.py,sha256=rr8R63Ttu2hTf5Mni7_P8aGBbjWs6t2QPV3wxKXspAs,3293
9
9
  ocr_stringdist/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
10
- ocr_stringdist-0.2.1.dist-info/RECORD,,
10
+ ocr_stringdist-0.2.2.dist-info/RECORD,,