Fuzzylookup 0.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- fuzzylookup-0.0.1/Fuzzylookup.egg-info/PKG-INFO +80 -0
- fuzzylookup-0.0.1/Fuzzylookup.egg-info/SOURCES.txt +10 -0
- fuzzylookup-0.0.1/Fuzzylookup.egg-info/dependency_links.txt +1 -0
- fuzzylookup-0.0.1/Fuzzylookup.egg-info/top_level.txt +1 -0
- fuzzylookup-0.0.1/LICENSE +21 -0
- fuzzylookup-0.0.1/PKG-INFO +80 -0
- fuzzylookup-0.0.1/README.md +70 -0
- fuzzylookup-0.0.1/fuzzylookup/__init__.py +4 -0
- fuzzylookup-0.0.1/fuzzylookup/core.py +306 -0
- fuzzylookup-0.0.1/fuzzylookup/setup.py +15 -0
- fuzzylookup-0.0.1/pyproject.toml +15 -0
- fuzzylookup-0.0.1/setup.cfg +4 -0
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: Fuzzylookup
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: A package for Fuzzy Lookup
|
|
5
|
+
Project-URL: Homepage, https://github.com/Moda141/Fuzzylookup
|
|
6
|
+
Requires-Python: >=3.7
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Dynamic: license-file
|
|
10
|
+
|
|
11
|
+
# FuzzyLookup Documentation
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
`fuzzylookup` (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages `pandas` for data handling and `rapidfuzz` for high-performance string matching [cite: 2, 3].
|
|
15
|
+
|
|
16
|
+
## Installation
|
|
17
|
+
The package requires Python 3.8 or higher [cite: 3].
|
|
18
|
+
Dependencies include:
|
|
19
|
+
* `pandas>=1.3` [cite: 3]
|
|
20
|
+
* `openpyxl>=3.0` (for Excel support) [cite: 3]
|
|
21
|
+
* `rapidfuzz>=3.0` [cite: 3]
|
|
22
|
+
|
|
23
|
+
## Core Features
|
|
24
|
+
|
|
25
|
+
### 1. Arabic Text Normalization
|
|
26
|
+
By default, `fuzzylookup` applies normalization to Arabic text (`normalize_arabic=True`) [cite: 2]. This process improves match quality by:
|
|
27
|
+
* Removing *tashkeel* (diacritics) [cite: 2].
|
|
28
|
+
* Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
|
|
29
|
+
* Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
|
|
30
|
+
* Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
|
|
31
|
+
|
|
32
|
+
### 2. Positional Name-Aware Scoring
|
|
33
|
+
A standout feature is the `name_aware` scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but `fuzzylookup` can punish wrong token orders when `name_aware=True` is enabled [cite: 2].
|
|
34
|
+
* It compares names token-by-token in order [cite: 2].
|
|
35
|
+
* The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
|
|
36
|
+
* It blends this positional score with `WRatio` to handle both typos and correct word order gracefully [cite: 2].
|
|
37
|
+
* If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
|
|
38
|
+
|
|
39
|
+
## API Reference
|
|
40
|
+
|
|
41
|
+
### `FuzzyLookup` Class
|
|
42
|
+
The primary entry point is the `FuzzyLookup` class [cite: 1, 2].
|
|
43
|
+
|
|
44
|
+
**Parameters:**
|
|
45
|
+
* `source` (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
|
|
46
|
+
* `column` (str): The column name to match against [cite: 2].
|
|
47
|
+
* `scorer` (str): Matching algorithm to use (`ratio`, `partial`, `token_sort`, `token_set`, `wratio`). Defaults to `wratio` [cite: 2].
|
|
48
|
+
* `normalize_arabic` (bool): Whether to strip diacritics and normalize characters. Defaults to `True` [cite: 2].
|
|
49
|
+
* `name_aware` (bool): Enables positional name scoring. Recommended for full person names. Defaults to `False` [cite: 2].
|
|
50
|
+
* `encoding` (str): File encoding for CSVs. Defaults to `utf-8` [cite: 2].
|
|
51
|
+
|
|
52
|
+
### Methods
|
|
53
|
+
|
|
54
|
+
* `lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None)`: Returns the top `N` best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated `score`, and `_index` [cite: 2].
|
|
55
|
+
* `lookup_best(query: str, min_score: float = 0.0, columns: list = None)`: Returns only the single best match, or `None` if no match meets the minimum score [cite: 2].
|
|
56
|
+
* `lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None)`: Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
|
|
57
|
+
|
|
58
|
+
### Properties
|
|
59
|
+
* `columns`: Returns a list of the columns available in the loaded dataframe [cite: 2].
|
|
60
|
+
* `shape`: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
|
|
61
|
+
|
|
62
|
+
## Workflow Example
|
|
63
|
+
|
|
64
|
+
Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from fuzzylookup import FuzzyLookup
|
|
68
|
+
|
|
69
|
+
# 1. Initialize the lookup instance with a dataset
|
|
70
|
+
fl = FuzzyLookup("names.csv", column="name", name_aware=True)
|
|
71
|
+
|
|
72
|
+
# 2. Perform a standard lookup for the top 3 matches
|
|
73
|
+
results = fl.lookup("محمد كمال", top_n=3)
|
|
74
|
+
|
|
75
|
+
# 3. Perform a strict lookup requiring a high match score
|
|
76
|
+
best_match = fl.lookup_best("كمال محمد", min_score=85.0)
|
|
77
|
+
|
|
78
|
+
# 4. Batch processing multiple names
|
|
79
|
+
batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
|
|
80
|
+
```
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
fuzzylookup
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Mohammed kamal
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: Fuzzylookup
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: A package for Fuzzy Lookup
|
|
5
|
+
Project-URL: Homepage, https://github.com/Moda141/Fuzzylookup
|
|
6
|
+
Requires-Python: >=3.7
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Dynamic: license-file
|
|
10
|
+
|
|
11
|
+
# FuzzyLookup Documentation
|
|
12
|
+
|
|
13
|
+
## Overview
|
|
14
|
+
`fuzzylookup` (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages `pandas` for data handling and `rapidfuzz` for high-performance string matching [cite: 2, 3].
|
|
15
|
+
|
|
16
|
+
## Installation
|
|
17
|
+
The package requires Python 3.8 or higher [cite: 3].
|
|
18
|
+
Dependencies include:
|
|
19
|
+
* `pandas>=1.3` [cite: 3]
|
|
20
|
+
* `openpyxl>=3.0` (for Excel support) [cite: 3]
|
|
21
|
+
* `rapidfuzz>=3.0` [cite: 3]
|
|
22
|
+
|
|
23
|
+
## Core Features
|
|
24
|
+
|
|
25
|
+
### 1. Arabic Text Normalization
|
|
26
|
+
By default, `fuzzylookup` applies normalization to Arabic text (`normalize_arabic=True`) [cite: 2]. This process improves match quality by:
|
|
27
|
+
* Removing *tashkeel* (diacritics) [cite: 2].
|
|
28
|
+
* Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
|
|
29
|
+
* Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
|
|
30
|
+
* Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
|
|
31
|
+
|
|
32
|
+
### 2. Positional Name-Aware Scoring
|
|
33
|
+
A standout feature is the `name_aware` scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but `fuzzylookup` can punish wrong token orders when `name_aware=True` is enabled [cite: 2].
|
|
34
|
+
* It compares names token-by-token in order [cite: 2].
|
|
35
|
+
* The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
|
|
36
|
+
* It blends this positional score with `WRatio` to handle both typos and correct word order gracefully [cite: 2].
|
|
37
|
+
* If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
|
|
38
|
+
|
|
39
|
+
## API Reference
|
|
40
|
+
|
|
41
|
+
### `FuzzyLookup` Class
|
|
42
|
+
The primary entry point is the `FuzzyLookup` class [cite: 1, 2].
|
|
43
|
+
|
|
44
|
+
**Parameters:**
|
|
45
|
+
* `source` (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
|
|
46
|
+
* `column` (str): The column name to match against [cite: 2].
|
|
47
|
+
* `scorer` (str): Matching algorithm to use (`ratio`, `partial`, `token_sort`, `token_set`, `wratio`). Defaults to `wratio` [cite: 2].
|
|
48
|
+
* `normalize_arabic` (bool): Whether to strip diacritics and normalize characters. Defaults to `True` [cite: 2].
|
|
49
|
+
* `name_aware` (bool): Enables positional name scoring. Recommended for full person names. Defaults to `False` [cite: 2].
|
|
50
|
+
* `encoding` (str): File encoding for CSVs. Defaults to `utf-8` [cite: 2].
|
|
51
|
+
|
|
52
|
+
### Methods
|
|
53
|
+
|
|
54
|
+
* `lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None)`: Returns the top `N` best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated `score`, and `_index` [cite: 2].
|
|
55
|
+
* `lookup_best(query: str, min_score: float = 0.0, columns: list = None)`: Returns only the single best match, or `None` if no match meets the minimum score [cite: 2].
|
|
56
|
+
* `lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None)`: Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
|
|
57
|
+
|
|
58
|
+
### Properties
|
|
59
|
+
* `columns`: Returns a list of the columns available in the loaded dataframe [cite: 2].
|
|
60
|
+
* `shape`: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
|
|
61
|
+
|
|
62
|
+
## Workflow Example
|
|
63
|
+
|
|
64
|
+
Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from fuzzylookup import FuzzyLookup
|
|
68
|
+
|
|
69
|
+
# 1. Initialize the lookup instance with a dataset
|
|
70
|
+
fl = FuzzyLookup("names.csv", column="name", name_aware=True)
|
|
71
|
+
|
|
72
|
+
# 2. Perform a standard lookup for the top 3 matches
|
|
73
|
+
results = fl.lookup("محمد كمال", top_n=3)
|
|
74
|
+
|
|
75
|
+
# 3. Perform a strict lookup requiring a high match score
|
|
76
|
+
best_match = fl.lookup_best("كمال محمد", min_score=85.0)
|
|
77
|
+
|
|
78
|
+
# 4. Batch processing multiple names
|
|
79
|
+
batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
|
|
80
|
+
```
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
# FuzzyLookup Documentation
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
`fuzzylookup` (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages `pandas` for data handling and `rapidfuzz` for high-performance string matching [cite: 2, 3].
|
|
5
|
+
|
|
6
|
+
## Installation
|
|
7
|
+
The package requires Python 3.8 or higher [cite: 3].
|
|
8
|
+
Dependencies include:
|
|
9
|
+
* `pandas>=1.3` [cite: 3]
|
|
10
|
+
* `openpyxl>=3.0` (for Excel support) [cite: 3]
|
|
11
|
+
* `rapidfuzz>=3.0` [cite: 3]
|
|
12
|
+
|
|
13
|
+
## Core Features
|
|
14
|
+
|
|
15
|
+
### 1. Arabic Text Normalization
|
|
16
|
+
By default, `fuzzylookup` applies normalization to Arabic text (`normalize_arabic=True`) [cite: 2]. This process improves match quality by:
|
|
17
|
+
* Removing *tashkeel* (diacritics) [cite: 2].
|
|
18
|
+
* Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
|
|
19
|
+
* Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
|
|
20
|
+
* Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
|
|
21
|
+
|
|
22
|
+
### 2. Positional Name-Aware Scoring
|
|
23
|
+
A standout feature is the `name_aware` scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but `fuzzylookup` can punish wrong token orders when `name_aware=True` is enabled [cite: 2].
|
|
24
|
+
* It compares names token-by-token in order [cite: 2].
|
|
25
|
+
* The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
|
|
26
|
+
* It blends this positional score with `WRatio` to handle both typos and correct word order gracefully [cite: 2].
|
|
27
|
+
* If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
|
|
28
|
+
|
|
29
|
+
## API Reference
|
|
30
|
+
|
|
31
|
+
### `FuzzyLookup` Class
|
|
32
|
+
The primary entry point is the `FuzzyLookup` class [cite: 1, 2].
|
|
33
|
+
|
|
34
|
+
**Parameters:**
|
|
35
|
+
* `source` (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
|
|
36
|
+
* `column` (str): The column name to match against [cite: 2].
|
|
37
|
+
* `scorer` (str): Matching algorithm to use (`ratio`, `partial`, `token_sort`, `token_set`, `wratio`). Defaults to `wratio` [cite: 2].
|
|
38
|
+
* `normalize_arabic` (bool): Whether to strip diacritics and normalize characters. Defaults to `True` [cite: 2].
|
|
39
|
+
* `name_aware` (bool): Enables positional name scoring. Recommended for full person names. Defaults to `False` [cite: 2].
|
|
40
|
+
* `encoding` (str): File encoding for CSVs. Defaults to `utf-8` [cite: 2].
|
|
41
|
+
|
|
42
|
+
### Methods
|
|
43
|
+
|
|
44
|
+
* `lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None)`: Returns the top `N` best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated `score`, and `_index` [cite: 2].
|
|
45
|
+
* `lookup_best(query: str, min_score: float = 0.0, columns: list = None)`: Returns only the single best match, or `None` if no match meets the minimum score [cite: 2].
|
|
46
|
+
* `lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None)`: Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
|
|
47
|
+
|
|
48
|
+
### Properties
|
|
49
|
+
* `columns`: Returns a list of the columns available in the loaded dataframe [cite: 2].
|
|
50
|
+
* `shape`: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
|
|
51
|
+
|
|
52
|
+
## Workflow Example
|
|
53
|
+
|
|
54
|
+
Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from fuzzylookup import FuzzyLookup
|
|
58
|
+
|
|
59
|
+
# 1. Initialize the lookup instance with a dataset
|
|
60
|
+
fl = FuzzyLookup("names.csv", column="name", name_aware=True)
|
|
61
|
+
|
|
62
|
+
# 2. Perform a standard lookup for the top 3 matches
|
|
63
|
+
results = fl.lookup("محمد كمال", top_n=3)
|
|
64
|
+
|
|
65
|
+
# 3. Perform a strict lookup requiring a high match score
|
|
66
|
+
best_match = fl.lookup_best("كمال محمد", min_score=85.0)
|
|
67
|
+
|
|
68
|
+
# 4. Batch processing multiple names
|
|
69
|
+
batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
|
|
70
|
+
```
|
|
@@ -0,0 +1,306 @@
|
|
|
1
|
+
"""
|
|
2
|
+
fuzzylookup - Fuzzy matching lookup for CSV/Excel datasets
|
|
3
|
+
Supports Arabic and English text, with positional name-aware scoring.
|
|
4
|
+
"""
|
|
5
|
+
|
|
6
|
+
from __future__ import annotations
|
|
7
|
+
|
|
8
|
+
import re
|
|
9
|
+
import unicodedata
|
|
10
|
+
from pathlib import Path
|
|
11
|
+
from typing import Any, Optional, Union
|
|
12
|
+
|
|
13
|
+
try:
|
|
14
|
+
import pandas as pd
|
|
15
|
+
except ImportError:
|
|
16
|
+
raise ImportError("pandas is required: pip install pandas openpyxl")
|
|
17
|
+
|
|
18
|
+
try:
|
|
19
|
+
from rapidfuzz import fuzz, process
|
|
20
|
+
except ImportError:
|
|
21
|
+
raise ImportError("rapidfuzz is required: pip install rapidfuzz")
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
# ---------------------------------------------------------------------------
|
|
25
|
+
# Text normalization
|
|
26
|
+
# ---------------------------------------------------------------------------
|
|
27
|
+
|
|
28
|
+
def _normalize_arabic(text: str) -> str:
|
|
29
|
+
"""Normalize Arabic text for better matching."""
|
|
30
|
+
if not isinstance(text, str):
|
|
31
|
+
return str(text)
|
|
32
|
+
text = unicodedata.normalize("NFC", text)
|
|
33
|
+
# Remove tashkeel (diacritics)
|
|
34
|
+
text = re.sub(r"[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED]", "", text)
|
|
35
|
+
text = re.sub(r"[أإآٱ]", "ا", text) # Alef variants
|
|
36
|
+
text = re.sub(r"ة", "ه", text) # Teh marbuta
|
|
37
|
+
text = re.sub(r"ى", "ي", text) # Alef maqsura
|
|
38
|
+
text = re.sub(r"\s+", " ", text).strip()
|
|
39
|
+
return text
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
def _normalize(text: str, arabic: bool = True) -> str:
|
|
43
|
+
if not isinstance(text, str):
|
|
44
|
+
text = str(text)
|
|
45
|
+
text = text.strip().lower()
|
|
46
|
+
if arabic:
|
|
47
|
+
text = _normalize_arabic(text)
|
|
48
|
+
return text
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
# ---------------------------------------------------------------------------
|
|
52
|
+
# Scorer aliases
|
|
53
|
+
# ---------------------------------------------------------------------------
|
|
54
|
+
|
|
55
|
+
SCORERS = {
|
|
56
|
+
"ratio": fuzz.ratio,
|
|
57
|
+
"partial": fuzz.partial_ratio,
|
|
58
|
+
"token_sort": fuzz.token_sort_ratio,
|
|
59
|
+
"token_set": fuzz.token_set_ratio,
|
|
60
|
+
"wratio": fuzz.WRatio,
|
|
61
|
+
}
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
# ---------------------------------------------------------------------------
|
|
65
|
+
# Positional Name Scoring
|
|
66
|
+
# ---------------------------------------------------------------------------
|
|
67
|
+
|
|
68
|
+
def _tokenize(name: str) -> list[str]:
|
|
69
|
+
tokens = name.strip().split()
|
|
70
|
+
return tokens if tokens else [""]
|
|
71
|
+
|
|
72
|
+
|
|
73
|
+
def _positional_name_score(
|
|
74
|
+
query: str,
|
|
75
|
+
candidate: str,
|
|
76
|
+
first_weight: float = 0.6,
|
|
77
|
+
rest_weight: float = 0.4,
|
|
78
|
+
) -> float:
|
|
79
|
+
"""
|
|
80
|
+
Compare names token-by-token in order.
|
|
81
|
+
|
|
82
|
+
- First token vs first token → 60% of score
|
|
83
|
+
- Remaining tokens joined → 40% of score
|
|
84
|
+
|
|
85
|
+
Example:
|
|
86
|
+
"محمد كمال" vs "محمد كمال" → ~100
|
|
87
|
+
"محمد كمال" vs "كمال محمد" → ~50 (first tokens مختلفين)
|
|
88
|
+
"محمد كمال" vs "محمد علي" → ~65 (first token متطابق، الباقي مختلف)
|
|
89
|
+
"""
|
|
90
|
+
q_tokens = _tokenize(query)
|
|
91
|
+
c_tokens = _tokenize(candidate)
|
|
92
|
+
|
|
93
|
+
# Single token on either side → plain ratio
|
|
94
|
+
if len(q_tokens) == 1 or len(c_tokens) == 1:
|
|
95
|
+
return fuzz.ratio(query, candidate)
|
|
96
|
+
|
|
97
|
+
# First token score
|
|
98
|
+
first_score = fuzz.ratio(q_tokens[0], c_tokens[0])
|
|
99
|
+
|
|
100
|
+
# Rest tokens (join and compare)
|
|
101
|
+
q_rest = " ".join(q_tokens[1:])
|
|
102
|
+
c_rest = " ".join(c_tokens[1:])
|
|
103
|
+
rest_score = fuzz.token_sort_ratio(q_rest, c_rest)
|
|
104
|
+
|
|
105
|
+
return (first_score * first_weight) + (rest_score * rest_weight)
|
|
106
|
+
|
|
107
|
+
|
|
108
|
+
def _smart_name_score(query: str, candidate: str) -> float:
|
|
109
|
+
"""
|
|
110
|
+
Blend positional + WRatio for best of both worlds:
|
|
111
|
+
- Positional punishes wrong token order (محمد كمال ≠ كمال محمد)
|
|
112
|
+
- WRatio handles typos and partial names
|
|
113
|
+
|
|
114
|
+
If WRatio >> positional by >15 pts → token reordering is inflating WRatio
|
|
115
|
+
→ apply penalty so swapped names don't score the same as correct order.
|
|
116
|
+
"""
|
|
117
|
+
positional = _positional_name_score(query, candidate)
|
|
118
|
+
wratio = fuzz.WRatio(query, candidate)
|
|
119
|
+
|
|
120
|
+
diff = wratio - positional
|
|
121
|
+
if diff > 15:
|
|
122
|
+
# WRatio is inflating because of token reordering — penalize
|
|
123
|
+
return (positional * 0.7) + (wratio * 0.3)
|
|
124
|
+
else:
|
|
125
|
+
return (positional * 0.5) + (wratio * 0.5)
|
|
126
|
+
|
|
127
|
+
|
|
128
|
+
# ---------------------------------------------------------------------------
|
|
129
|
+
# FuzzyLookup
|
|
130
|
+
# ---------------------------------------------------------------------------
|
|
131
|
+
|
|
132
|
+
class FuzzyLookup:
|
|
133
|
+
"""
|
|
134
|
+
Fuzzy lookup over a CSV or Excel dataset.
|
|
135
|
+
|
|
136
|
+
Parameters
|
|
137
|
+
----------
|
|
138
|
+
source : str | Path | pd.DataFrame
|
|
139
|
+
Path to CSV/Excel file, or an already-loaded DataFrame.
|
|
140
|
+
column : str
|
|
141
|
+
The column to match against.
|
|
142
|
+
scorer : str
|
|
143
|
+
Matching algorithm: ratio, partial, token_sort, token_set, wratio (default).
|
|
144
|
+
normalize_arabic : bool
|
|
145
|
+
Strip diacritics & normalize Arabic characters (default True).
|
|
146
|
+
name_aware : bool
|
|
147
|
+
Enable positional name scoring so that "محمد كمال" and "كمال محمد"
|
|
148
|
+
score differently. First token is weighted higher than the rest.
|
|
149
|
+
Recommended when the column contains full person names (default False).
|
|
150
|
+
encoding : str
|
|
151
|
+
File encoding for CSV (default 'utf-8').
|
|
152
|
+
|
|
153
|
+
Examples
|
|
154
|
+
--------
|
|
155
|
+
>>> fl = FuzzyLookup("names.csv", column="name", name_aware=True)
|
|
156
|
+
>>> fl.lookup("محمد كمال", top_n=3)
|
|
157
|
+
"""
|
|
158
|
+
|
|
159
|
+
def __init__(
|
|
160
|
+
self,
|
|
161
|
+
source: Union[str, Path, "pd.DataFrame"],
|
|
162
|
+
column: str,
|
|
163
|
+
scorer: str = "wratio",
|
|
164
|
+
normalize_arabic: bool = True,
|
|
165
|
+
name_aware: bool = False,
|
|
166
|
+
encoding: str = "utf-8",
|
|
167
|
+
):
|
|
168
|
+
self.column = column
|
|
169
|
+
self.scorer = SCORERS.get(scorer, fuzz.WRatio)
|
|
170
|
+
self.normalize_arabic = normalize_arabic
|
|
171
|
+
self.name_aware = name_aware
|
|
172
|
+
|
|
173
|
+
# Load data
|
|
174
|
+
if isinstance(source, (str, Path)):
|
|
175
|
+
path = Path(source)
|
|
176
|
+
if path.suffix.lower() in {".xlsx", ".xls"}:
|
|
177
|
+
self._df = pd.read_excel(path)
|
|
178
|
+
else:
|
|
179
|
+
self._df = pd.read_csv(path, encoding=encoding)
|
|
180
|
+
elif isinstance(source, pd.DataFrame):
|
|
181
|
+
self._df = source.copy()
|
|
182
|
+
else:
|
|
183
|
+
raise TypeError("source must be a file path or a pandas DataFrame")
|
|
184
|
+
|
|
185
|
+
if column not in self._df.columns:
|
|
186
|
+
raise ValueError(
|
|
187
|
+
f"Column '{column}' not found. Available: {list(self._df.columns)}"
|
|
188
|
+
)
|
|
189
|
+
|
|
190
|
+
self._choices: list[str] = (
|
|
191
|
+
self._df[column].fillna("").astype(str).tolist()
|
|
192
|
+
)
|
|
193
|
+
self._normalized_choices: list[str] = [
|
|
194
|
+
_normalize(c, arabic=self.normalize_arabic) for c in self._choices
|
|
195
|
+
]
|
|
196
|
+
|
|
197
|
+
# ------------------------------------------------------------------
|
|
198
|
+
# Internal scoring
|
|
199
|
+
# ------------------------------------------------------------------
|
|
200
|
+
|
|
201
|
+
def _score(self, query: str, candidate: str) -> float:
|
|
202
|
+
"""Score a query against a candidate string."""
|
|
203
|
+
if self.name_aware:
|
|
204
|
+
return _smart_name_score(query, candidate)
|
|
205
|
+
return self.scorer(query, candidate)
|
|
206
|
+
|
|
207
|
+
# ------------------------------------------------------------------
|
|
208
|
+
# Public API
|
|
209
|
+
# ------------------------------------------------------------------
|
|
210
|
+
|
|
211
|
+
def lookup(
|
|
212
|
+
self,
|
|
213
|
+
query: str,
|
|
214
|
+
top_n: int = 5,
|
|
215
|
+
min_score: float = 0.0,
|
|
216
|
+
columns: Optional[list[str]] = None,
|
|
217
|
+
) -> list[dict[str, Any]]:
|
|
218
|
+
"""
|
|
219
|
+
Return the top-N best matches for *query*.
|
|
220
|
+
|
|
221
|
+
Parameters
|
|
222
|
+
----------
|
|
223
|
+
query : str — search string
|
|
224
|
+
top_n : int — max results (default 5)
|
|
225
|
+
min_score : float — minimum score 0–100 (default 0)
|
|
226
|
+
columns : list — which columns to return (default: all)
|
|
227
|
+
"""
|
|
228
|
+
norm_query = _normalize(query, arabic=self.normalize_arabic)
|
|
229
|
+
cols = columns or list(self._df.columns)
|
|
230
|
+
|
|
231
|
+
if self.name_aware:
|
|
232
|
+
# Manual scoring loop (rapidfuzz process doesn't support custom scorers easily)
|
|
233
|
+
scored = [
|
|
234
|
+
(i, self._score(norm_query, cand))
|
|
235
|
+
for i, cand in enumerate(self._normalized_choices)
|
|
236
|
+
]
|
|
237
|
+
scored = [(i, s) for i, s in scored if s >= min_score]
|
|
238
|
+
scored.sort(key=lambda x: x[1], reverse=True)
|
|
239
|
+
scored = scored[:top_n]
|
|
240
|
+
|
|
241
|
+
results = []
|
|
242
|
+
for idx, score in scored:
|
|
243
|
+
row = self._df.iloc[idx][cols].to_dict()
|
|
244
|
+
row["score"] = round(score, 2)
|
|
245
|
+
row["_index"] = int(idx)
|
|
246
|
+
results.append(row)
|
|
247
|
+
else:
|
|
248
|
+
matches = process.extract(
|
|
249
|
+
norm_query,
|
|
250
|
+
self._normalized_choices,
|
|
251
|
+
scorer=self.scorer,
|
|
252
|
+
limit=top_n,
|
|
253
|
+
score_cutoff=min_score,
|
|
254
|
+
)
|
|
255
|
+
results = []
|
|
256
|
+
for _matched_str, score, idx in matches:
|
|
257
|
+
row = self._df.iloc[idx][cols].to_dict()
|
|
258
|
+
row["score"] = round(score, 2)
|
|
259
|
+
row["_index"] = int(idx)
|
|
260
|
+
results.append(row)
|
|
261
|
+
results.sort(key=lambda r: r["score"], reverse=True)
|
|
262
|
+
|
|
263
|
+
return results
|
|
264
|
+
|
|
265
|
+
def lookup_best(
|
|
266
|
+
self,
|
|
267
|
+
query: str,
|
|
268
|
+
min_score: float = 0.0,
|
|
269
|
+
columns: Optional[list[str]] = None,
|
|
270
|
+
) -> Optional[dict[str, Any]]:
|
|
271
|
+
"""Return only the single best match, or None if below min_score."""
|
|
272
|
+
results = self.lookup(query, top_n=1, min_score=min_score, columns=columns)
|
|
273
|
+
return results[0] if results else None
|
|
274
|
+
|
|
275
|
+
def lookup_many(
|
|
276
|
+
self,
|
|
277
|
+
queries: list[str],
|
|
278
|
+
top_n: int = 1,
|
|
279
|
+
min_score: float = 0.0,
|
|
280
|
+
columns: Optional[list[str]] = None,
|
|
281
|
+
) -> dict[str, list[dict[str, Any]]]:
|
|
282
|
+
"""Batch lookup for multiple queries."""
|
|
283
|
+
return {
|
|
284
|
+
q: self.lookup(q, top_n=top_n, min_score=min_score, columns=columns)
|
|
285
|
+
for q in queries
|
|
286
|
+
}
|
|
287
|
+
|
|
288
|
+
# ------------------------------------------------------------------
|
|
289
|
+
# Convenience
|
|
290
|
+
# ------------------------------------------------------------------
|
|
291
|
+
|
|
292
|
+
@property
|
|
293
|
+
def columns(self) -> list[str]:
|
|
294
|
+
return list(self._df.columns)
|
|
295
|
+
|
|
296
|
+
@property
|
|
297
|
+
def shape(self) -> tuple[int, int]:
|
|
298
|
+
return self._df.shape
|
|
299
|
+
|
|
300
|
+
def __repr__(self) -> str:
|
|
301
|
+
mode = "name_aware" if self.name_aware else self.scorer.__name__
|
|
302
|
+
return (
|
|
303
|
+
f"FuzzyLookup(column='{self.column}', "
|
|
304
|
+
f"rows={self._df.shape[0]}, "
|
|
305
|
+
f"mode='{mode}')"
|
|
306
|
+
)
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
|
|
3
|
+
setup(
|
|
4
|
+
name="fuzzylookup",
|
|
5
|
+
version="0.1.0",
|
|
6
|
+
description="Fuzzy matching lookup for CSV/Excel datasets (Arabic + English)",
|
|
7
|
+
license="MIT",
|
|
8
|
+
packages=find_packages(),
|
|
9
|
+
python_requires=">=3.8",
|
|
10
|
+
install_requires=[
|
|
11
|
+
"pandas>=1.3",
|
|
12
|
+
"openpyxl>=3.0",
|
|
13
|
+
"rapidfuzz>=3.0",
|
|
14
|
+
],
|
|
15
|
+
)
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "Fuzzylookup"
|
|
7
|
+
version = "0.0.1"
|
|
8
|
+
description = "A package for Fuzzy Lookup"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.7"
|
|
11
|
+
# إذا كان لديك مكتبات خارجية في ملف requirements.txt
|
|
12
|
+
# ضعها هنا مثال: dependencies = ["pandas", "numpy"]
|
|
13
|
+
|
|
14
|
+
[project.urls]
|
|
15
|
+
Homepage = "https://github.com/Moda141/Fuzzylookup"
|