PyPI - ocr-stringdist - Versions diffs - 0.2.2__tar.gz → 1.0.0__tar.gz - Mend

ocr-stringdist 0.2.2tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (56) hide show

{ocr_stringdist-0.2.2 → ocr_stringdist-1.0.0}/.github/workflows/CI.yml RENAMED Viewed

@@ -13,10 +13,12 @@ permissions:
 jobs:
   lint_and_test:
+    name: Lint & Test
     runs-on: ubuntu-latest
     strategy:
+      fail-fast: false
       matrix:
-        python-version: ["3.9", "3.13", "pypy3.11"]
+        python-version: ["3.9", "3.13"]
     steps:
       - uses: actions/checkout@v4
         with:
@@ -24,17 +26,8 @@ jobs:
       - uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
-      - name: Build wheels
-        uses: PyO3/maturin-action@v1
-        with:
-          target: ${{ matrix.target }}
-          args: --release --out dist -i ${{ matrix.python-version }}
-          sccache: "true"
-      - name: Install Just
-        uses: extractions/setup-just@v2
       - name: Run Cargo Tests
-        run: |
-          cargo test
+        run: cargo test --features python -- --nocapture --test-threads=1
       - name: Run pytest
         run: |
           # just venv pytest
@@ -43,20 +36,21 @@ jobs:
           . .venv/bin/activate
           .venv/bin/pip install wheel pytest maturin
           maturin develop
-          .venv/bin/pytest
+          .venv/bin/pytest python/tests
   linux:
+    name: Build Wheels (Linux)
     runs-on: ubuntu-latest
     needs: lint_and_test
     strategy:
       matrix:
         platform:
           - target: x64
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
           - target: aarch64
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
           - target: armv7
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
     steps:
       - uses: actions/checkout@v4
         with:
@@ -71,9 +65,11 @@ jobs:
       - name: Upload wheels
         uses: actions/upload-artifact@v4
         with:
-          name: wheels-linux-${{ strategy.job-index }}
+          name: wheels-linux-${{ matrix.platform.target }}
           path: dist
   musllinux:
+    name: Build Wheels (musllinux)
     runs-on: ubuntu-latest
     needs: lint_and_test
     strategy:
@@ -81,13 +77,13 @@ jobs:
         platform:
           - target: x86_64-unknown-linux-musl
             arch: x86_64
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
           - target: i686-unknown-linux-musl
             arch: x86
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
           - target: aarch64-unknown-linux-musl
             arch: aarch64
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
         # all values: [x86_64, x86, aarch64, armhf, armv7, ppc64le, riscv64, s390x]
         # { target: "armv7-unknown-linux-musleabihf", image_tag: "armv7" },
         # { target: "powerpc64le-unknown-linux-musl", image_tag: "ppc64le" },
@@ -107,10 +103,11 @@ jobs:
       - name: Upload wheels
         uses: actions/upload-artifact@v4
         with:
-          name: wheels-musl-${{ strategy.job-index }}
+          name: wheels-musl-${{ matrix.platform.arch }}
           path: dist
   windows:
+    name: Build Wheels (Windows)
     runs-on: windows-latest
     needs: lint_and_test
     strategy:
@@ -124,6 +121,16 @@ jobs:
       - uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.interpreter }}
+          architecture: ${{ matrix.target }}
+      - name: Ensure pythonXY.lib exists (for PyO3 on Windows)
+        shell: pwsh
+        run: |
+          $py = "${{ matrix.interpreter }}"
+          $libPath = "${{ env.pythonLocation }}\\libs\\python$($py.Replace('.', '')).lib"
+          if (!(Test-Path $libPath)) {
+            Write-Host "pythonXY.lib missing, generating..."
+            & "${{ env.pythonLocation }}\\python.exe" -c "import distutils.sysconfig, shutil, sys; libdir = distutils.sysconfig.get_config_var('LIBDIR'); libname = distutils.sysconfig.get_config_var('LDLIBRARY'); src = libdir + '\\\\' + libname; dst = sys.prefix + '\\\\libs\\\\' + libname; shutil.copyfile(src, dst)"
+          }
       - name: Build wheels
         uses: PyO3/maturin-action@v1
         with:
@@ -133,19 +140,20 @@ jobs:
       - name: Upload wheels
         uses: actions/upload-artifact@v4
         with:
+          name: wheels-win-${{ matrix.target }}-${{ matrix.interpreter }}
           path: dist
-          name: wheels-win-${{ strategy.job-index }}
   macos:
+    name: Build Wheels (macOS)
     runs-on: macos-latest
     needs: lint_and_test
     strategy:
       matrix:
         platform:
           - target: x64
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
           - target: aarch64
-            interpreter: 3.9 3.10 3.11 3.12 3.13 pypy3.9 pypy3.10 pypy3.11
+            interpreter: 3.9 3.10 3.11 3.12 3.13
     steps:
       - uses: actions/checkout@v4
         with:
@@ -159,10 +167,11 @@ jobs:
       - name: Upload wheels
         uses: actions/upload-artifact@v4
         with:
-          name: wheels-mac-${{ strategy.job-index }}
+          name: wheels-mac-${{ matrix.platform.target }}
           path: dist
   sdist:
+    name: Build Source Distribution
     runs-on: ubuntu-latest
     needs: lint_and_test
     steps:
@@ -177,23 +186,30 @@ jobs:
       - name: Upload sdist
         uses: actions/upload-artifact@v4
         with:
-          name: wheels-sdist-${{ strategy.job-index }}
+          name: wheels-sdist
           path: dist
   release:
-    name: Release
+    name: Release to PyPI
     runs-on: ubuntu-latest
     if: "startsWith(github.ref, 'refs/tags/')"
     needs: [linux, windows, macos, sdist, musllinux]
     steps:
-      - uses: actions/download-artifact@v4
+      - name: Download all wheels and sdist
+        uses: actions/download-artifact@v4
         with:
           pattern: wheels-*
           merge-multiple: true
+      - name: Move packages to dist directory
+        run: |
+          mkdir -p dist
+          mv *.whl *.tar.gz dist/
+          echo "Moved packages to dist/:"
+          ls -l dist
       - name: Publish to PyPI
         uses: PyO3/maturin-action@v1
         env:
           MATURIN_PYPI_TOKEN: ${{ secrets.PYPI_API_TOKEN }}
         with:
           command: upload
-          args: --skip-existing *
+          args: --skip-existing dist/*

ocr_stringdist-1.0.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,61 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.0.0] - 2025-09-20
+### Changed
+- Rename "Learner" to "CostLearner".
+- Rework and fix the cost learning algorithm.
+- Remove `with_cost_function` from `CostLearner`.
+- Remove the functional interface in favour of `WeightedLevenshtein` class.
+### Added
+- Add `calculate_for_unseen` parameter to `CostLearner.fit()`.
+- Add input validation in `WeightedLevenshtein.__init__`.
+- Add `to_dict` and `from_dict` methods to `WeightedLevenshtein`.
+## [0.3.0] - 2025-09-14
+### Added
+- Add the option to include the matched characters in the `explain` method via the `filter_matches` parameter.
+- Add the option to learn the costs from a dataset of pairs (OCR result, ground truth) via the `WeightedLevenshtein.learn_from` method and the `Learner` class.
+### Changed
+- Drop support for PyPy due to issues with PyO3.
+## [0.2.2] - 2025-09-01
+### Changed
+- Improve documentation.
+## [0.2.1] - 2025-08-31
+### Fixed
+- Documentation for PyPI
+## [0.2.0] - 2025-08-31
+### Added
+- `WeightedLevenshtein` class for reusable configuration.
+- Explanation of edit operations via `WeightedLevenshtein.explain` and `explain_weighted_levenshtein`.
+## [0.1.0] - 2025-04-26
+### Added
+- Custom insertion and deletion costs for weighted Levenshtein distance.
+### Changed
+- Breaking changes to Levenshtein distance functions signatures.

{ocr_stringdist-0.2.2 → ocr_stringdist-1.0.0}/Cargo.lock RENAMED Viewed

@@ -74,7 +74,7 @@ dependencies = [
 [[package]]
 name = "ocr_stringdist"
-version = "0.2.2"
+version = "1.0.0"
 dependencies = [
  "pyo3",
  "rayon",

{ocr_stringdist-0.2.2 → ocr_stringdist-1.0.0}/Cargo.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [package]
 name = "ocr_stringdist"
-version = "0.2.2"
+version = "1.0.0"
 edition = "2021"
 description = "String distances considering OCR errors."
 authors = ["Niklas von Moers <niklasvmoers@protonmail.com>"]
@@ -14,7 +14,7 @@ name = "ocr_stringdist"
 crate-type = ["cdylib"]
 [dependencies]
-pyo3 = { version = "0.24.0", features = [] }
+pyo3 = { version = "0.24.0", features = ["auto-initialize"] }
 rayon = "1.10.0"
 [features]

ocr_stringdist-1.0.0/Justfile ADDED Viewed

@@ -0,0 +1,41 @@
+venv:
+    rm -rf .venv
+    uv venv
+    uv sync --all-groups
+pytest:
+    uv run maturin develop
+    uv run pytest --cov=python/ocr_stringdist python/tests
+test:
+    cargo llvm-cov --features python
+    #cargo test --features python
+mypy:
+    uv run mypy .
+lint:
+    uv run ruff check . --fix
+doc:
+    uv run make -C docs html
+# Usage: just release v1.0.0
+# Make sure to update the version in Cargo.toml first.
+release version:
+    # Fail if the current branch is not 'main'
+    @if [ "$(git symbolic-ref --short HEAD)" != "main" ]; then \
+        echo "Error: Must be on 'main' branch to release."; \
+        exit 1; \
+    fi
+    # Fail if the working directory is not clean
+    @if ! git diff --quiet --exit-code; then \
+        echo "Error: Working directory is not clean. Commit or stash changes before releasing."; \
+        exit 1; \
+    fi
+    git tag -a {{version}} -m "Release version {{version}}"
+    git push origin {{version}}
+    @echo "Successfully tagged and pushed version {{version}}"

ocr_stringdist-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,94 @@
+Metadata-Version: 2.4
+Name: ocr-stringdist
+Version: 1.0.0
+Classifier: Programming Language :: Rust
+Classifier: Programming Language :: Python :: Implementation :: CPython
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+License-File: LICENSE
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: repository, https://github.com/NiklasvonM/ocr-stringdist
+Project-URL: documentation, https://niklasvonm.github.io/ocr-stringdist/
+# OCR-StringDist
+A Python library to learn, model, explain and correct OCR errors using a fast string distance engine.
+Documentation: https://niklasvonm.github.io/ocr-stringdist/
+[![PyPI badge](https://badge.fury.io/py/ocr-stringdist.svg)](https://badge.fury.io/py/ocr-stringdist)
+[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
+## Overview
+Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like `O` vs `0` are far more common than, say, `O` vs `X`.
+OCR-StringDist provides a learnable **weighted Levenshtein distance**, implementing part of the **Noisy Channel model**.
+**Example:** Matching against the correct word `CODE`:
+* **Standard Levenshtein:**
+    * $d(\text{"CODE"}, \text{"C0DE"}) = 1$ (O → 0)
+    * $d(\text{"CODE"}, \text{"CXDE"}) = 1$ (O → X)
+    * Result: Both appear equally likely/distant.
+* **OCR-StringDist (Channel Model):**
+    * $d(\text{"CODE"}, \text{"C0DE"}) \approx 0.1$ (common error, low cost)
+    * $d(\text{"CODE"}, \text{"CXDE"}) = 1.0$ (unlikely error, high cost)
+    * Result: Correctly identifies `C0DE` as a much closer match.
+This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes). By combining this *channel model* with a *source model* (e.g., product code frequencies), you can build a complete and robust OCR correction system.
+## Installation
+```bash
+pip install ocr-stringdist
+```
+## Features
+- **Learnable Costs**: Automatically learn substitution, insertion, and deletion costs from a dataset of (OCR string, ground truth string) pairs.
+- **Weighted Levenshtein Distance**: Models OCR error patterns by assigning custom costs to specific edit operations.
+- **High Performance**: Core logic in Rust and a batch_distance function for efficiently comparing one string against thousands of candidates.
+- **Substitution of Multiple Characters**: Not just character pairs, but string pairs may be substituted, for example the Korean syllable "이" for the two letters "OI".
+- **Explainable Edit Path**: Returns the optimal sequence of edit operations (substitutions, insertions, and deletions) used to transform one string into another.
+- **Pre-defined OCR Distance Map**: A built-in distance map for common OCR confusions (e.g., "0" vs "O", "1" vs "l", "5" vs "S").
+- **Full Unicode Support**: Works with arbitrary Unicode strings.
+## Core Workflow
+The typical workflow involves
+- learning costs from your data and then
+- using the resulting model to find the best match from a list of candidates.
+```python
+from ocr_stringdist import WeightedLevenshtein
+# 1. LEARN costs from your own data
+training_data = [
+    ("128", "123"),
+    ("567", "567"),
+]
+wl = WeightedLevenshtein.learn_from(training_data)
+# The engine has now learned that '8' -> '3' is a low-cost substitution
+print(f"Learned cost for ('8', '3'): {wl.substitution_costs[('8', '3')]:.2f}")
+# 2. MATCH new OCR output against a list of candidates
+ocr_output = "Product Code 128"
+candidates = [
+    "Product Code 123",
+    "Product Code 523",  # '5' -> '1' is an unlikely error
+]
+distances = wl.batch_distance(ocr_output, candidates)
+# Find the best match
+min_distance = min(distances)
+best_match = candidates[distances.index(min_distance)]
+print(f"Best match for '{ocr_output}': '{best_match}' (Cost: {min_distance:.2f})")
+```

ocr_stringdist-1.0.0/README.md ADDED Viewed

@@ -0,0 +1,80 @@
+# OCR-StringDist
+A Python library to learn, model, explain and correct OCR errors using a fast string distance engine.
+Documentation: https://niklasvonm.github.io/ocr-stringdist/
+[![PyPI badge](https://badge.fury.io/py/ocr-stringdist.svg)](https://badge.fury.io/py/ocr-stringdist)
+[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
+## Overview
+Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like `O` vs `0` are far more common than, say, `O` vs `X`.
+OCR-StringDist provides a learnable **weighted Levenshtein distance**, implementing part of the **Noisy Channel model**.
+**Example:** Matching against the correct word `CODE`:
+* **Standard Levenshtein:**
+    * $d(\text{"CODE"}, \text{"C0DE"}) = 1$ (O → 0)
+    * $d(\text{"CODE"}, \text{"CXDE"}) = 1$ (O → X)
+    * Result: Both appear equally likely/distant.
+* **OCR-StringDist (Channel Model):**
+    * $d(\text{"CODE"}, \text{"C0DE"}) \approx 0.1$ (common error, low cost)
+    * $d(\text{"CODE"}, \text{"CXDE"}) = 1.0$ (unlikely error, high cost)
+    * Result: Correctly identifies `C0DE` as a much closer match.
+This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes). By combining this *channel model* with a *source model* (e.g., product code frequencies), you can build a complete and robust OCR correction system.
+## Installation
+```bash
+pip install ocr-stringdist
+```
+## Features
+- **Learnable Costs**: Automatically learn substitution, insertion, and deletion costs from a dataset of (OCR string, ground truth string) pairs.
+- **Weighted Levenshtein Distance**: Models OCR error patterns by assigning custom costs to specific edit operations.
+- **High Performance**: Core logic in Rust and a batch_distance function for efficiently comparing one string against thousands of candidates.
+- **Substitution of Multiple Characters**: Not just character pairs, but string pairs may be substituted, for example the Korean syllable "이" for the two letters "OI".
+- **Explainable Edit Path**: Returns the optimal sequence of edit operations (substitutions, insertions, and deletions) used to transform one string into another.
+- **Pre-defined OCR Distance Map**: A built-in distance map for common OCR confusions (e.g., "0" vs "O", "1" vs "l", "5" vs "S").
+- **Full Unicode Support**: Works with arbitrary Unicode strings.
+## Core Workflow
+The typical workflow involves
+- learning costs from your data and then
+- using the resulting model to find the best match from a list of candidates.
+```python
+from ocr_stringdist import WeightedLevenshtein
+# 1. LEARN costs from your own data
+training_data = [
+    ("128", "123"),
+    ("567", "567"),
+]
+wl = WeightedLevenshtein.learn_from(training_data)
+# The engine has now learned that '8' -> '3' is a low-cost substitution
+print(f"Learned cost for ('8', '3'): {wl.substitution_costs[('8', '3')]:.2f}")
+# 2. MATCH new OCR output against a list of candidates
+ocr_output = "Product Code 128"
+candidates = [
+    "Product Code 123",
+    "Product Code 523",  # '5' -> '1' is an unlikely error
+]
+distances = wl.batch_distance(ocr_output, candidates)
+# Find the best match
+min_distance = min(distances)
+best_match = candidates[distances.index(min_distance)]
+print(f"Best match for '{ocr_output}': '{best_match}' (Cost: {min_distance:.2f})")
+```

{ocr_stringdist-0.2.2 → ocr_stringdist-1.0.0}/docs/source/api/index.rst RENAMED Viewed

@@ -1,14 +1,17 @@
 .. _api_reference:
-API Reference
-=============
+===============
+ API Reference
+===============
 .. autoclass:: ocr_stringdist.WeightedLevenshtein
    :members:
-.. autofunction:: ocr_stringdist.weighted_levenshtein_distance
-.. autofunction:: ocr_stringdist.batch_weighted_levenshtein_distance
-.. autofunction:: ocr_stringdist.explain_weighted_levenshtein
+.. autoclass:: ocr_stringdist.learner.CostLearner
+   :members:
+.. autoclass:: ocr_stringdist.edit_operation.EditOperation
+   :members:
 .. automodule:: ocr_stringdist.matching
    :members:

ocr_stringdist-1.0.0/docs/source/cost_learning_model.rst ADDED Viewed

@@ -0,0 +1,62 @@
+=====================
+ Cost Learning Model
+=====================
+The ``CostLearner`` class calculates edit costs using a probabilistic model. The cost of an edit operation is defined by its **surprisal**: a measure of how unlikely that event is based on the training data. This value, derived from the negative log-likelihood :math:`-\log(P(e))`, quantifies the information contained in observing an event :math:`e`.
+A common, high-probability error will have low surprisal and thus a low cost. A rare, low-probability error will have high surprisal and a high cost.
+-------------------
+Probabilistic Model
+-------------------
+The model estimates the probability of edit operations and transforms them into normalized, comparable costs. The smoothing parameter :math:`k` (set via ``with_smoothing()``) allows for a continuous transition between a Maximum Likelihood Estimation and a smoothed Bayesian model.
+General Notation
+~~~~~~~~~~~~~~~~
+- :math:`c(e)`: The observed count of a specific event :math:`e`. For example, :math:`c(s \to t)` is the count of source character :math:`s` being substituted by target character :math:`t`.
+- :math:`C(x)`: The total count of a specific context character :math:`x`. For example, :math:`C(s)` is the total number of times the source character :math:`s` appeared in the OCR outputs.
+- :math:`V`: The total number of unique characters in the vocabulary.
+Probability of an Edit Operation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The model treats all edit operations within the same probabilistic framework. An insertion is modeled as a substitution from a ground-truth character to an "empty" character, and a deletion is a substitution from an OCR character to an empty character.
+This means that for any given character (either from the source or the target), there are :math:`V+1` possible outcomes: a transformation into any of the :math:`V` vocabulary characters or a transformation into an empty character.
+The smoothed conditional probability for any edit event :math:`e` given a context character :math:`x` (where :math:`x` is a source character for substitutions/deletions or a target character for insertions) is:
+.. math:: P(e|x) = \frac{c(e) + k}{C(x) + k \cdot (V+1)}
+Bayesian Interpretation
+~~~~~~~~~~~~~~~~~~~~~~~
+When :math:`k > 0`, the parameter acts as the concentration parameter of a **symmetric Dirichlet prior distribution**. This represents a prior belief that every possible error is equally likely and has a "pseudo-count" of `k`.
+Normalization
+~~~~~~~~~~~~~
+The costs are normalized by a ceiling :math:`Z` that depends on the size of the unified outcome space. It is the a priori surprisal of any single event, assuming a uniform probability distribution over all :math:`V+1` possible outcomes.
+.. math:: Z = -\log(\frac{1}{V+1}) = \log(V+1)
+This normalization contextualizes the cost relative to the complexity of the character set.
+Final Cost
+~~~~~~~~~~
+The final cost :math:`w(e)` is the base surprisal scaled by the normalization ceiling:
+.. math:: w(e) = \frac{-\log(P(e|x))}{Z}
+This cost is a relative measure. Costs can be greater than 1.0, which indicates the observed event was less probable than the uniform a priori assumption.
+Asymptotic Properties
+~~~~~~~~~~~~~~~~~~~~~
+As the amount of training data grows, the learned cost for an operation with a stable frequency ("share") converges to a fixed value - independent of :math:`k`:
+.. math:: w(e) \approx \frac{-\log(\text{share})}{\log(V+1)}

ocr-stringdist 0.2.2__tar.gz → 1.0.0__tar.gz

ocr-stringdist 0.2.2tar.gz → 1.0.0tar.gz