Fuzzylookup 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,80 @@
1
+ Metadata-Version: 2.4
2
+ Name: Fuzzylookup
3
+ Version: 0.0.1
4
+ Summary: A package for Fuzzy Lookup
5
+ Project-URL: Homepage, https://github.com/Moda141/Fuzzylookup
6
+ Requires-Python: >=3.7
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Dynamic: license-file
10
+
11
+ # FuzzyLookup Documentation
12
+
13
+ ## Overview
14
+ `fuzzylookup` (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages `pandas` for data handling and `rapidfuzz` for high-performance string matching [cite: 2, 3].
15
+
16
+ ## Installation
17
+ The package requires Python 3.8 or higher [cite: 3].
18
+ Dependencies include:
19
+ * `pandas>=1.3` [cite: 3]
20
+ * `openpyxl>=3.0` (for Excel support) [cite: 3]
21
+ * `rapidfuzz>=3.0` [cite: 3]
22
+
23
+ ## Core Features
24
+
25
+ ### 1. Arabic Text Normalization
26
+ By default, `fuzzylookup` applies normalization to Arabic text (`normalize_arabic=True`) [cite: 2]. This process improves match quality by:
27
+ * Removing *tashkeel* (diacritics) [cite: 2].
28
+ * Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
29
+ * Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
30
+ * Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
31
+
32
+ ### 2. Positional Name-Aware Scoring
33
+ A standout feature is the `name_aware` scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but `fuzzylookup` can punish wrong token orders when `name_aware=True` is enabled [cite: 2].
34
+ * It compares names token-by-token in order [cite: 2].
35
+ * The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
36
+ * It blends this positional score with `WRatio` to handle both typos and correct word order gracefully [cite: 2].
37
+ * If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
38
+
39
+ ## API Reference
40
+
41
+ ### `FuzzyLookup` Class
42
+ The primary entry point is the `FuzzyLookup` class [cite: 1, 2].
43
+
44
+ **Parameters:**
45
+ * `source` (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
46
+ * `column` (str): The column name to match against [cite: 2].
47
+ * `scorer` (str): Matching algorithm to use (`ratio`, `partial`, `token_sort`, `token_set`, `wratio`). Defaults to `wratio` [cite: 2].
48
+ * `normalize_arabic` (bool): Whether to strip diacritics and normalize characters. Defaults to `True` [cite: 2].
49
+ * `name_aware` (bool): Enables positional name scoring. Recommended for full person names. Defaults to `False` [cite: 2].
50
+ * `encoding` (str): File encoding for CSVs. Defaults to `utf-8` [cite: 2].
51
+
52
+ ### Methods
53
+
54
+ * `lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None)`: Returns the top `N` best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated `score`, and `_index` [cite: 2].
55
+ * `lookup_best(query: str, min_score: float = 0.0, columns: list = None)`: Returns only the single best match, or `None` if no match meets the minimum score [cite: 2].
56
+ * `lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None)`: Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
57
+
58
+ ### Properties
59
+ * `columns`: Returns a list of the columns available in the loaded dataframe [cite: 2].
60
+ * `shape`: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
61
+
62
+ ## Workflow Example
63
+
64
+ Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
65
+
66
+ ```python
67
+ from fuzzylookup import FuzzyLookup
68
+
69
+ # 1. Initialize the lookup instance with a dataset
70
+ fl = FuzzyLookup("names.csv", column="name", name_aware=True)
71
+
72
+ # 2. Perform a standard lookup for the top 3 matches
73
+ results = fl.lookup("محمد كمال", top_n=3)
74
+
75
+ # 3. Perform a strict lookup requiring a high match score
76
+ best_match = fl.lookup_best("كمال محمد", min_score=85.0)
77
+
78
+ # 4. Batch processing multiple names
79
+ batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
80
+ ```
@@ -0,0 +1,10 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ Fuzzylookup.egg-info/PKG-INFO
5
+ Fuzzylookup.egg-info/SOURCES.txt
6
+ Fuzzylookup.egg-info/dependency_links.txt
7
+ Fuzzylookup.egg-info/top_level.txt
8
+ fuzzylookup/__init__.py
9
+ fuzzylookup/core.py
10
+ fuzzylookup/setup.py
@@ -0,0 +1 @@
1
+ fuzzylookup
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Mohammed kamal
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,80 @@
1
+ Metadata-Version: 2.4
2
+ Name: Fuzzylookup
3
+ Version: 0.0.1
4
+ Summary: A package for Fuzzy Lookup
5
+ Project-URL: Homepage, https://github.com/Moda141/Fuzzylookup
6
+ Requires-Python: >=3.7
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Dynamic: license-file
10
+
11
+ # FuzzyLookup Documentation
12
+
13
+ ## Overview
14
+ `fuzzylookup` (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages `pandas` for data handling and `rapidfuzz` for high-performance string matching [cite: 2, 3].
15
+
16
+ ## Installation
17
+ The package requires Python 3.8 or higher [cite: 3].
18
+ Dependencies include:
19
+ * `pandas>=1.3` [cite: 3]
20
+ * `openpyxl>=3.0` (for Excel support) [cite: 3]
21
+ * `rapidfuzz>=3.0` [cite: 3]
22
+
23
+ ## Core Features
24
+
25
+ ### 1. Arabic Text Normalization
26
+ By default, `fuzzylookup` applies normalization to Arabic text (`normalize_arabic=True`) [cite: 2]. This process improves match quality by:
27
+ * Removing *tashkeel* (diacritics) [cite: 2].
28
+ * Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
29
+ * Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
30
+ * Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
31
+
32
+ ### 2. Positional Name-Aware Scoring
33
+ A standout feature is the `name_aware` scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but `fuzzylookup` can punish wrong token orders when `name_aware=True` is enabled [cite: 2].
34
+ * It compares names token-by-token in order [cite: 2].
35
+ * The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
36
+ * It blends this positional score with `WRatio` to handle both typos and correct word order gracefully [cite: 2].
37
+ * If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
38
+
39
+ ## API Reference
40
+
41
+ ### `FuzzyLookup` Class
42
+ The primary entry point is the `FuzzyLookup` class [cite: 1, 2].
43
+
44
+ **Parameters:**
45
+ * `source` (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
46
+ * `column` (str): The column name to match against [cite: 2].
47
+ * `scorer` (str): Matching algorithm to use (`ratio`, `partial`, `token_sort`, `token_set`, `wratio`). Defaults to `wratio` [cite: 2].
48
+ * `normalize_arabic` (bool): Whether to strip diacritics and normalize characters. Defaults to `True` [cite: 2].
49
+ * `name_aware` (bool): Enables positional name scoring. Recommended for full person names. Defaults to `False` [cite: 2].
50
+ * `encoding` (str): File encoding for CSVs. Defaults to `utf-8` [cite: 2].
51
+
52
+ ### Methods
53
+
54
+ * `lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None)`: Returns the top `N` best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated `score`, and `_index` [cite: 2].
55
+ * `lookup_best(query: str, min_score: float = 0.0, columns: list = None)`: Returns only the single best match, or `None` if no match meets the minimum score [cite: 2].
56
+ * `lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None)`: Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
57
+
58
+ ### Properties
59
+ * `columns`: Returns a list of the columns available in the loaded dataframe [cite: 2].
60
+ * `shape`: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
61
+
62
+ ## Workflow Example
63
+
64
+ Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
65
+
66
+ ```python
67
+ from fuzzylookup import FuzzyLookup
68
+
69
+ # 1. Initialize the lookup instance with a dataset
70
+ fl = FuzzyLookup("names.csv", column="name", name_aware=True)
71
+
72
+ # 2. Perform a standard lookup for the top 3 matches
73
+ results = fl.lookup("محمد كمال", top_n=3)
74
+
75
+ # 3. Perform a strict lookup requiring a high match score
76
+ best_match = fl.lookup_best("كمال محمد", min_score=85.0)
77
+
78
+ # 4. Batch processing multiple names
79
+ batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
80
+ ```
@@ -0,0 +1,70 @@
1
+ # FuzzyLookup Documentation
2
+
3
+ ## Overview
4
+ `fuzzylookup` (version 0.1.0) is a Python package designed for fuzzy matching and lookups against CSV or Excel datasets [cite: 1, 2, 3]. It explicitly supports both Arabic and English text [cite: 2, 3]. Under the hood, it leverages `pandas` for data handling and `rapidfuzz` for high-performance string matching [cite: 2, 3].
5
+
6
+ ## Installation
7
+ The package requires Python 3.8 or higher [cite: 3].
8
+ Dependencies include:
9
+ * `pandas>=1.3` [cite: 3]
10
+ * `openpyxl>=3.0` (for Excel support) [cite: 3]
11
+ * `rapidfuzz>=3.0` [cite: 3]
12
+
13
+ ## Core Features
14
+
15
+ ### 1. Arabic Text Normalization
16
+ By default, `fuzzylookup` applies normalization to Arabic text (`normalize_arabic=True`) [cite: 2]. This process improves match quality by:
17
+ * Removing *tashkeel* (diacritics) [cite: 2].
18
+ * Normalizing Alef variants (أ, إ, آ, ٱ) to a standard Alef (ا) [cite: 2].
19
+ * Normalizing Teh marbuta (ة) to Heh (ه) [cite: 2].
20
+ * Normalizing Alef maqsura (ى) to Yeh (ي) [cite: 2].
21
+
22
+ ### 2. Positional Name-Aware Scoring
23
+ A standout feature is the `name_aware` scoring algorithm [cite: 2]. Standard fuzzy matching algorithms might score "محمد كمال" and "كمال محمد" similarly, but `fuzzylookup` can punish wrong token orders when `name_aware=True` is enabled [cite: 2].
24
+ * It compares names token-by-token in order [cite: 2].
25
+ * The first token accounts for 60% of the score, and the remaining tokens account for 40% [cite: 2].
26
+ * It blends this positional score with `WRatio` to handle both typos and correct word order gracefully [cite: 2].
27
+ * If WRatio exceeds the positional score by more than 15 points, it indicates token reordering is inflating the score, triggering a penalty [cite: 2].
28
+
29
+ ## API Reference
30
+
31
+ ### `FuzzyLookup` Class
32
+ The primary entry point is the `FuzzyLookup` class [cite: 1, 2].
33
+
34
+ **Parameters:**
35
+ * `source` (str, Path, or pd.DataFrame): Path to the CSV/Excel file or an existing DataFrame [cite: 2].
36
+ * `column` (str): The column name to match against [cite: 2].
37
+ * `scorer` (str): Matching algorithm to use (`ratio`, `partial`, `token_sort`, `token_set`, `wratio`). Defaults to `wratio` [cite: 2].
38
+ * `normalize_arabic` (bool): Whether to strip diacritics and normalize characters. Defaults to `True` [cite: 2].
39
+ * `name_aware` (bool): Enables positional name scoring. Recommended for full person names. Defaults to `False` [cite: 2].
40
+ * `encoding` (str): File encoding for CSVs. Defaults to `utf-8` [cite: 2].
41
+
42
+ ### Methods
43
+
44
+ * `lookup(query: str, top_n: int = 5, min_score: float = 0.0, columns: list = None)`: Returns the top `N` best matches for the search string [cite: 2]. Returns a list of dictionaries with row data, the calculated `score`, and `_index` [cite: 2].
45
+ * `lookup_best(query: str, min_score: float = 0.0, columns: list = None)`: Returns only the single best match, or `None` if no match meets the minimum score [cite: 2].
46
+ * `lookup_many(queries: list, top_n: int = 1, min_score: float = 0.0, columns: list = None)`: Batch lookup for multiple queries. Returns a dictionary mapping each query to its list of results [cite: 2].
47
+
48
+ ### Properties
49
+ * `columns`: Returns a list of the columns available in the loaded dataframe [cite: 2].
50
+ * `shape`: Returns a tuple representing the shape of the underlying dataframe [cite: 2].
51
+
52
+ ## Workflow Example
53
+
54
+ Below is a typical workflow demonstrating how to instantiate the lookup object and perform queries [cite: 2]:
55
+
56
+ ```python
57
+ from fuzzylookup import FuzzyLookup
58
+
59
+ # 1. Initialize the lookup instance with a dataset
60
+ fl = FuzzyLookup("names.csv", column="name", name_aware=True)
61
+
62
+ # 2. Perform a standard lookup for the top 3 matches
63
+ results = fl.lookup("محمد كمال", top_n=3)
64
+
65
+ # 3. Perform a strict lookup requiring a high match score
66
+ best_match = fl.lookup_best("كمال محمد", min_score=85.0)
67
+
68
+ # 4. Batch processing multiple names
69
+ batch_results = fl.lookup_many(["أحمد علي", "محمود حسن"], top_n=1)
70
+ ```
@@ -0,0 +1,4 @@
1
+ from .core import FuzzyLookup
2
+
3
+ __all__ = ["FuzzyLookup"]
4
+ __version__ = "0.1.0"
@@ -0,0 +1,306 @@
1
+ """
2
+ fuzzylookup - Fuzzy matching lookup for CSV/Excel datasets
3
+ Supports Arabic and English text, with positional name-aware scoring.
4
+ """
5
+
6
+ from __future__ import annotations
7
+
8
+ import re
9
+ import unicodedata
10
+ from pathlib import Path
11
+ from typing import Any, Optional, Union
12
+
13
+ try:
14
+ import pandas as pd
15
+ except ImportError:
16
+ raise ImportError("pandas is required: pip install pandas openpyxl")
17
+
18
+ try:
19
+ from rapidfuzz import fuzz, process
20
+ except ImportError:
21
+ raise ImportError("rapidfuzz is required: pip install rapidfuzz")
22
+
23
+
24
+ # ---------------------------------------------------------------------------
25
+ # Text normalization
26
+ # ---------------------------------------------------------------------------
27
+
28
+ def _normalize_arabic(text: str) -> str:
29
+ """Normalize Arabic text for better matching."""
30
+ if not isinstance(text, str):
31
+ return str(text)
32
+ text = unicodedata.normalize("NFC", text)
33
+ # Remove tashkeel (diacritics)
34
+ text = re.sub(r"[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED]", "", text)
35
+ text = re.sub(r"[أإآٱ]", "ا", text) # Alef variants
36
+ text = re.sub(r"ة", "ه", text) # Teh marbuta
37
+ text = re.sub(r"ى", "ي", text) # Alef maqsura
38
+ text = re.sub(r"\s+", " ", text).strip()
39
+ return text
40
+
41
+
42
+ def _normalize(text: str, arabic: bool = True) -> str:
43
+ if not isinstance(text, str):
44
+ text = str(text)
45
+ text = text.strip().lower()
46
+ if arabic:
47
+ text = _normalize_arabic(text)
48
+ return text
49
+
50
+
51
+ # ---------------------------------------------------------------------------
52
+ # Scorer aliases
53
+ # ---------------------------------------------------------------------------
54
+
55
+ SCORERS = {
56
+ "ratio": fuzz.ratio,
57
+ "partial": fuzz.partial_ratio,
58
+ "token_sort": fuzz.token_sort_ratio,
59
+ "token_set": fuzz.token_set_ratio,
60
+ "wratio": fuzz.WRatio,
61
+ }
62
+
63
+
64
+ # ---------------------------------------------------------------------------
65
+ # Positional Name Scoring
66
+ # ---------------------------------------------------------------------------
67
+
68
+ def _tokenize(name: str) -> list[str]:
69
+ tokens = name.strip().split()
70
+ return tokens if tokens else [""]
71
+
72
+
73
+ def _positional_name_score(
74
+ query: str,
75
+ candidate: str,
76
+ first_weight: float = 0.6,
77
+ rest_weight: float = 0.4,
78
+ ) -> float:
79
+ """
80
+ Compare names token-by-token in order.
81
+
82
+ - First token vs first token → 60% of score
83
+ - Remaining tokens joined → 40% of score
84
+
85
+ Example:
86
+ "محمد كمال" vs "محمد كمال" → ~100
87
+ "محمد كمال" vs "كمال محمد" → ~50 (first tokens مختلفين)
88
+ "محمد كمال" vs "محمد علي" → ~65 (first token متطابق، الباقي مختلف)
89
+ """
90
+ q_tokens = _tokenize(query)
91
+ c_tokens = _tokenize(candidate)
92
+
93
+ # Single token on either side → plain ratio
94
+ if len(q_tokens) == 1 or len(c_tokens) == 1:
95
+ return fuzz.ratio(query, candidate)
96
+
97
+ # First token score
98
+ first_score = fuzz.ratio(q_tokens[0], c_tokens[0])
99
+
100
+ # Rest tokens (join and compare)
101
+ q_rest = " ".join(q_tokens[1:])
102
+ c_rest = " ".join(c_tokens[1:])
103
+ rest_score = fuzz.token_sort_ratio(q_rest, c_rest)
104
+
105
+ return (first_score * first_weight) + (rest_score * rest_weight)
106
+
107
+
108
+ def _smart_name_score(query: str, candidate: str) -> float:
109
+ """
110
+ Blend positional + WRatio for best of both worlds:
111
+ - Positional punishes wrong token order (محمد كمال ≠ كمال محمد)
112
+ - WRatio handles typos and partial names
113
+
114
+ If WRatio >> positional by >15 pts → token reordering is inflating WRatio
115
+ → apply penalty so swapped names don't score the same as correct order.
116
+ """
117
+ positional = _positional_name_score(query, candidate)
118
+ wratio = fuzz.WRatio(query, candidate)
119
+
120
+ diff = wratio - positional
121
+ if diff > 15:
122
+ # WRatio is inflating because of token reordering — penalize
123
+ return (positional * 0.7) + (wratio * 0.3)
124
+ else:
125
+ return (positional * 0.5) + (wratio * 0.5)
126
+
127
+
128
+ # ---------------------------------------------------------------------------
129
+ # FuzzyLookup
130
+ # ---------------------------------------------------------------------------
131
+
132
+ class FuzzyLookup:
133
+ """
134
+ Fuzzy lookup over a CSV or Excel dataset.
135
+
136
+ Parameters
137
+ ----------
138
+ source : str | Path | pd.DataFrame
139
+ Path to CSV/Excel file, or an already-loaded DataFrame.
140
+ column : str
141
+ The column to match against.
142
+ scorer : str
143
+ Matching algorithm: ratio, partial, token_sort, token_set, wratio (default).
144
+ normalize_arabic : bool
145
+ Strip diacritics & normalize Arabic characters (default True).
146
+ name_aware : bool
147
+ Enable positional name scoring so that "محمد كمال" and "كمال محمد"
148
+ score differently. First token is weighted higher than the rest.
149
+ Recommended when the column contains full person names (default False).
150
+ encoding : str
151
+ File encoding for CSV (default 'utf-8').
152
+
153
+ Examples
154
+ --------
155
+ >>> fl = FuzzyLookup("names.csv", column="name", name_aware=True)
156
+ >>> fl.lookup("محمد كمال", top_n=3)
157
+ """
158
+
159
+ def __init__(
160
+ self,
161
+ source: Union[str, Path, "pd.DataFrame"],
162
+ column: str,
163
+ scorer: str = "wratio",
164
+ normalize_arabic: bool = True,
165
+ name_aware: bool = False,
166
+ encoding: str = "utf-8",
167
+ ):
168
+ self.column = column
169
+ self.scorer = SCORERS.get(scorer, fuzz.WRatio)
170
+ self.normalize_arabic = normalize_arabic
171
+ self.name_aware = name_aware
172
+
173
+ # Load data
174
+ if isinstance(source, (str, Path)):
175
+ path = Path(source)
176
+ if path.suffix.lower() in {".xlsx", ".xls"}:
177
+ self._df = pd.read_excel(path)
178
+ else:
179
+ self._df = pd.read_csv(path, encoding=encoding)
180
+ elif isinstance(source, pd.DataFrame):
181
+ self._df = source.copy()
182
+ else:
183
+ raise TypeError("source must be a file path or a pandas DataFrame")
184
+
185
+ if column not in self._df.columns:
186
+ raise ValueError(
187
+ f"Column '{column}' not found. Available: {list(self._df.columns)}"
188
+ )
189
+
190
+ self._choices: list[str] = (
191
+ self._df[column].fillna("").astype(str).tolist()
192
+ )
193
+ self._normalized_choices: list[str] = [
194
+ _normalize(c, arabic=self.normalize_arabic) for c in self._choices
195
+ ]
196
+
197
+ # ------------------------------------------------------------------
198
+ # Internal scoring
199
+ # ------------------------------------------------------------------
200
+
201
+ def _score(self, query: str, candidate: str) -> float:
202
+ """Score a query against a candidate string."""
203
+ if self.name_aware:
204
+ return _smart_name_score(query, candidate)
205
+ return self.scorer(query, candidate)
206
+
207
+ # ------------------------------------------------------------------
208
+ # Public API
209
+ # ------------------------------------------------------------------
210
+
211
+ def lookup(
212
+ self,
213
+ query: str,
214
+ top_n: int = 5,
215
+ min_score: float = 0.0,
216
+ columns: Optional[list[str]] = None,
217
+ ) -> list[dict[str, Any]]:
218
+ """
219
+ Return the top-N best matches for *query*.
220
+
221
+ Parameters
222
+ ----------
223
+ query : str — search string
224
+ top_n : int — max results (default 5)
225
+ min_score : float — minimum score 0–100 (default 0)
226
+ columns : list — which columns to return (default: all)
227
+ """
228
+ norm_query = _normalize(query, arabic=self.normalize_arabic)
229
+ cols = columns or list(self._df.columns)
230
+
231
+ if self.name_aware:
232
+ # Manual scoring loop (rapidfuzz process doesn't support custom scorers easily)
233
+ scored = [
234
+ (i, self._score(norm_query, cand))
235
+ for i, cand in enumerate(self._normalized_choices)
236
+ ]
237
+ scored = [(i, s) for i, s in scored if s >= min_score]
238
+ scored.sort(key=lambda x: x[1], reverse=True)
239
+ scored = scored[:top_n]
240
+
241
+ results = []
242
+ for idx, score in scored:
243
+ row = self._df.iloc[idx][cols].to_dict()
244
+ row["score"] = round(score, 2)
245
+ row["_index"] = int(idx)
246
+ results.append(row)
247
+ else:
248
+ matches = process.extract(
249
+ norm_query,
250
+ self._normalized_choices,
251
+ scorer=self.scorer,
252
+ limit=top_n,
253
+ score_cutoff=min_score,
254
+ )
255
+ results = []
256
+ for _matched_str, score, idx in matches:
257
+ row = self._df.iloc[idx][cols].to_dict()
258
+ row["score"] = round(score, 2)
259
+ row["_index"] = int(idx)
260
+ results.append(row)
261
+ results.sort(key=lambda r: r["score"], reverse=True)
262
+
263
+ return results
264
+
265
+ def lookup_best(
266
+ self,
267
+ query: str,
268
+ min_score: float = 0.0,
269
+ columns: Optional[list[str]] = None,
270
+ ) -> Optional[dict[str, Any]]:
271
+ """Return only the single best match, or None if below min_score."""
272
+ results = self.lookup(query, top_n=1, min_score=min_score, columns=columns)
273
+ return results[0] if results else None
274
+
275
+ def lookup_many(
276
+ self,
277
+ queries: list[str],
278
+ top_n: int = 1,
279
+ min_score: float = 0.0,
280
+ columns: Optional[list[str]] = None,
281
+ ) -> dict[str, list[dict[str, Any]]]:
282
+ """Batch lookup for multiple queries."""
283
+ return {
284
+ q: self.lookup(q, top_n=top_n, min_score=min_score, columns=columns)
285
+ for q in queries
286
+ }
287
+
288
+ # ------------------------------------------------------------------
289
+ # Convenience
290
+ # ------------------------------------------------------------------
291
+
292
+ @property
293
+ def columns(self) -> list[str]:
294
+ return list(self._df.columns)
295
+
296
+ @property
297
+ def shape(self) -> tuple[int, int]:
298
+ return self._df.shape
299
+
300
+ def __repr__(self) -> str:
301
+ mode = "name_aware" if self.name_aware else self.scorer.__name__
302
+ return (
303
+ f"FuzzyLookup(column='{self.column}', "
304
+ f"rows={self._df.shape[0]}, "
305
+ f"mode='{mode}')"
306
+ )
@@ -0,0 +1,15 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="fuzzylookup",
5
+ version="0.1.0",
6
+ description="Fuzzy matching lookup for CSV/Excel datasets (Arabic + English)",
7
+ license="MIT",
8
+ packages=find_packages(),
9
+ python_requires=">=3.8",
10
+ install_requires=[
11
+ "pandas>=1.3",
12
+ "openpyxl>=3.0",
13
+ "rapidfuzz>=3.0",
14
+ ],
15
+ )
@@ -0,0 +1,15 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "Fuzzylookup"
7
+ version = "0.0.1"
8
+ description = "A package for Fuzzy Lookup"
9
+ readme = "README.md"
10
+ requires-python = ">=3.7"
11
+ # إذا كان لديك مكتبات خارجية في ملف requirements.txt
12
+ # ضعها هنا مثال: dependencies = ["pandas", "numpy"]
13
+
14
+ [project.urls]
15
+ Homepage = "https://github.com/Moda141/Fuzzylookup"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+