gngram-lookup 0.2.1__tar.gz → 0.2.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/PKG-INFO +12 -10
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/README.md +8 -8
- gngram_lookup-0.2.3/gngram_counter/lookup.py +246 -0
- gngram_lookup-0.2.3/gngram_counter/normalize.py +50 -0
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/pyproject.toml +2 -2
- gngram_lookup-0.2.3/setup.py +35 -0
- gngram_lookup-0.2.1/gngram_counter/lookup.py +0 -138
- gngram_lookup-0.2.1/setup.py +0 -35
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/LICENSE +0 -0
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/gngram_counter/__init__.py +0 -0
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/gngram_counter/cli.py +0 -0
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/gngram_counter/data.py +0 -0
- {gngram_lookup-0.2.1 → gngram_lookup-0.2.3}/gngram_counter/download_data.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: gngram-lookup
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.3
|
|
4
4
|
Summary: Static Hash-Based Lookup for Google Ngram Frequencies
|
|
5
5
|
Home-page: https://github.com/craigtrim/gngram-lookup
|
|
6
6
|
License: Proprietary
|
|
@@ -9,7 +9,7 @@ Author: Craig Trim
|
|
|
9
9
|
Author-email: craigtrim@gmail.com
|
|
10
10
|
Maintainer: Craig Trim
|
|
11
11
|
Maintainer-email: craigtrim@gmail.com
|
|
12
|
-
Requires-Python: >=3.
|
|
12
|
+
Requires-Python: >=3.9,<4.0
|
|
13
13
|
Classifier: Development Status :: 4 - Beta
|
|
14
14
|
Classifier: Intended Audience :: Developers
|
|
15
15
|
Classifier: Intended Audience :: Science/Research
|
|
@@ -17,6 +17,8 @@ Classifier: License :: Other/Proprietary License
|
|
|
17
17
|
Classifier: Natural Language :: English
|
|
18
18
|
Classifier: Operating System :: OS Independent
|
|
19
19
|
Classifier: Programming Language :: Python :: 3
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
20
22
|
Classifier: Programming Language :: Python :: 3.11
|
|
21
23
|
Classifier: Programming Language :: Python :: 3.12
|
|
22
24
|
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
@@ -32,8 +34,8 @@ Description-Content-Type: text/markdown
|
|
|
32
34
|
[](https://badge.fury.io/py/gngram-lookup)
|
|
33
35
|
[](https://pepy.tech/project/gngram-lookup)
|
|
34
36
|
[](https://pepy.tech/project/gngram-lookup)
|
|
35
|
-
[](https://github.com/craigtrim/gngram-lookup/tree/main/tests)
|
|
38
|
+
[](https://www.python.org/downloads/)
|
|
37
39
|
|
|
38
40
|
Word frequency from 500 years of books. O(1) lookup. 5 million words.
|
|
39
41
|
|
|
@@ -74,11 +76,11 @@ gngram-freq computer
|
|
|
74
76
|
|
|
75
77
|
## Docs
|
|
76
78
|
|
|
77
|
-
- [API Reference](docs/api.md)
|
|
78
|
-
- [CLI Reference](docs/cli.md)
|
|
79
|
-
- [Data Format](docs/data-format.md)
|
|
80
|
-
- [Use Cases](docs/use-cases.md)
|
|
81
|
-
- [Development](docs/development.md)
|
|
79
|
+
- [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)
|
|
80
|
+
- [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)
|
|
81
|
+
- [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)
|
|
82
|
+
- [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)
|
|
83
|
+
- [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)
|
|
82
84
|
|
|
83
85
|
## See Also
|
|
84
86
|
|
|
@@ -91,5 +93,5 @@ Data derived from the [Google Books Ngram](https://books.google.com/ngrams) data
|
|
|
91
93
|
|
|
92
94
|
## License
|
|
93
95
|
|
|
94
|
-
Proprietary. See [LICENSE](LICENSE).
|
|
96
|
+
Proprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).
|
|
95
97
|
|
|
@@ -3,8 +3,8 @@
|
|
|
3
3
|
[](https://badge.fury.io/py/gngram-lookup)
|
|
4
4
|
[](https://pepy.tech/project/gngram-lookup)
|
|
5
5
|
[](https://pepy.tech/project/gngram-lookup)
|
|
6
|
-
[](https://github.com/craigtrim/gngram-lookup/tree/main/tests)
|
|
7
|
+
[](https://www.python.org/downloads/)
|
|
8
8
|
|
|
9
9
|
Word frequency from 500 years of books. O(1) lookup. 5 million words.
|
|
10
10
|
|
|
@@ -45,11 +45,11 @@ gngram-freq computer
|
|
|
45
45
|
|
|
46
46
|
## Docs
|
|
47
47
|
|
|
48
|
-
- [API Reference](docs/api.md)
|
|
49
|
-
- [CLI Reference](docs/cli.md)
|
|
50
|
-
- [Data Format](docs/data-format.md)
|
|
51
|
-
- [Use Cases](docs/use-cases.md)
|
|
52
|
-
- [Development](docs/development.md)
|
|
48
|
+
- [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)
|
|
49
|
+
- [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)
|
|
50
|
+
- [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)
|
|
51
|
+
- [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)
|
|
52
|
+
- [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)
|
|
53
53
|
|
|
54
54
|
## See Also
|
|
55
55
|
|
|
@@ -62,4 +62,4 @@ Data derived from the [Google Books Ngram](https://books.google.com/ngrams) data
|
|
|
62
62
|
|
|
63
63
|
## License
|
|
64
64
|
|
|
65
|
-
Proprietary. See [LICENSE](LICENSE).
|
|
65
|
+
Proprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).
|
|
@@ -0,0 +1,246 @@
|
|
|
1
|
+
"""
|
|
2
|
+
High-level lookup API for gngram-counter.
|
|
3
|
+
|
|
4
|
+
Provides simple functions for word frequency lookups similar to bnc-lookup.
|
|
5
|
+
|
|
6
|
+
Includes contraction fallback: if a contraction like "don't" is not found
|
|
7
|
+
directly, the stem ("do") is looked up instead. The ngram corpus only
|
|
8
|
+
contains pure alphabetic words, so contractions and their suffix parts
|
|
9
|
+
(n't, 'll, etc.) are absent — but the stems are present.
|
|
10
|
+
"""
|
|
11
|
+
|
|
12
|
+
from __future__ import annotations
|
|
13
|
+
|
|
14
|
+
import hashlib
|
|
15
|
+
from functools import lru_cache
|
|
16
|
+
from typing import TypedDict
|
|
17
|
+
|
|
18
|
+
import polars as pl
|
|
19
|
+
|
|
20
|
+
from gngram_counter.data import get_hash_file, is_data_installed
|
|
21
|
+
from gngram_counter.normalize import normalize
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
class FrequencyData(TypedDict):
|
|
25
|
+
"""Frequency data for a word."""
|
|
26
|
+
|
|
27
|
+
peak_tf: int # Decade with highest term frequency
|
|
28
|
+
peak_df: int # Decade with highest document frequency
|
|
29
|
+
sum_tf: int # Total term frequency across all decades
|
|
30
|
+
sum_df: int # Total document frequency across all decades
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
# Contraction suffixes stored as separate tokens in the ngram corpus
|
|
34
|
+
# Order matters: longer suffixes must be checked before shorter ones
|
|
35
|
+
CONTRACTION_SUFFIXES = ("n't", "'ll", "'re", "'ve", "'m", "'d")
|
|
36
|
+
|
|
37
|
+
# Specific stems that form 's contractions (where 's = "is" or "has").
|
|
38
|
+
# NOT generalized — 's is ambiguous with possessive, so only known
|
|
39
|
+
# contraction stems are listed here. Ported from bnc-lookup.
|
|
40
|
+
S_CONTRACTION_STEMS = frozenset({
|
|
41
|
+
# Pronouns (unambiguously 's = "is" or "has", never possessive)
|
|
42
|
+
'it', 'he', 'she', 'that', 'what', 'who',
|
|
43
|
+
# Adverbs / demonstratives
|
|
44
|
+
'where', 'how', 'here', 'there',
|
|
45
|
+
# "let's" = "let us"
|
|
46
|
+
'let',
|
|
47
|
+
# Indefinite pronouns
|
|
48
|
+
'somebody', 'everybody', 'everyone', 'nobody',
|
|
49
|
+
'anywhere', 'nowhere',
|
|
50
|
+
})
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
@lru_cache(maxsize=256)
|
|
54
|
+
def _load_bucket(prefix: str) -> pl.DataFrame:
|
|
55
|
+
"""Load and cache a parquet bucket file."""
|
|
56
|
+
return pl.read_parquet(get_hash_file(prefix))
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def _hash_word(word: str) -> tuple[str, str]:
|
|
60
|
+
"""Hash a word and return (prefix, suffix)."""
|
|
61
|
+
h = hashlib.md5(normalize(word).encode("utf-8")).hexdigest()
|
|
62
|
+
return h[:2], h[2:]
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
def _lookup_frequency(word: str) -> FrequencyData | None:
|
|
66
|
+
"""Look up frequency data for a single word form (no fallbacks)."""
|
|
67
|
+
if not word:
|
|
68
|
+
return None
|
|
69
|
+
prefix, suffix = _hash_word(word)
|
|
70
|
+
try:
|
|
71
|
+
df = _load_bucket(prefix)
|
|
72
|
+
except FileNotFoundError:
|
|
73
|
+
return None
|
|
74
|
+
row = df.filter(pl.col("hash") == suffix)
|
|
75
|
+
if len(row) == 0:
|
|
76
|
+
return None
|
|
77
|
+
return FrequencyData(
|
|
78
|
+
peak_tf=row["peak_tf"][0],
|
|
79
|
+
peak_df=row["peak_df"][0],
|
|
80
|
+
sum_tf=row["sum_tf"][0],
|
|
81
|
+
sum_df=row["sum_df"][0],
|
|
82
|
+
)
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
def _split_contraction(word: str) -> tuple[str, str] | None:
|
|
86
|
+
"""Split a contraction into its component parts if possible.
|
|
87
|
+
|
|
88
|
+
The ngram corpus tokenizes contractions separately (e.g., "we'll" -> "we" + "'ll").
|
|
89
|
+
This function reverses that split for fallback lookup.
|
|
90
|
+
|
|
91
|
+
Returns:
|
|
92
|
+
Tuple of (stem, suffix) if the word matches a contraction pattern,
|
|
93
|
+
or None if no contraction pattern matches.
|
|
94
|
+
"""
|
|
95
|
+
for suffix in CONTRACTION_SUFFIXES:
|
|
96
|
+
if word.endswith(suffix):
|
|
97
|
+
stem = word[:-len(suffix)]
|
|
98
|
+
if stem:
|
|
99
|
+
return (stem, suffix)
|
|
100
|
+
|
|
101
|
+
# Specific 's contractions from curated allowlist (not possessives)
|
|
102
|
+
if word.endswith("'s"):
|
|
103
|
+
stem = word[:-2]
|
|
104
|
+
if stem in S_CONTRACTION_STEMS:
|
|
105
|
+
return (stem, "'s")
|
|
106
|
+
|
|
107
|
+
return None
|
|
108
|
+
|
|
109
|
+
|
|
110
|
+
def exists(word: str) -> bool:
|
|
111
|
+
"""Check if a word exists in the ngram data.
|
|
112
|
+
|
|
113
|
+
Performs case-insensitive lookup with automatic fallbacks:
|
|
114
|
+
1. Direct lookup of the normalized word
|
|
115
|
+
2. Contraction fallback: if word is a contraction, check if both
|
|
116
|
+
components exist (e.g., "don't" -> "do" + "n't")
|
|
117
|
+
|
|
118
|
+
Args:
|
|
119
|
+
word: The word to check (case-insensitive)
|
|
120
|
+
|
|
121
|
+
Returns:
|
|
122
|
+
True if the word exists, False otherwise
|
|
123
|
+
|
|
124
|
+
Raises:
|
|
125
|
+
FileNotFoundError: If data files are not installed
|
|
126
|
+
"""
|
|
127
|
+
if not is_data_installed():
|
|
128
|
+
raise FileNotFoundError(
|
|
129
|
+
"Data files not installed. Run: python -m gngram_counter.download_data"
|
|
130
|
+
)
|
|
131
|
+
|
|
132
|
+
word = normalize(word)
|
|
133
|
+
|
|
134
|
+
if _lookup_frequency(word) is not None:
|
|
135
|
+
return True
|
|
136
|
+
|
|
137
|
+
# Contraction fallback: check if the stem exists
|
|
138
|
+
parts = _split_contraction(word)
|
|
139
|
+
if parts:
|
|
140
|
+
stem, _ = parts
|
|
141
|
+
if _lookup_frequency(stem) is not None:
|
|
142
|
+
return True
|
|
143
|
+
|
|
144
|
+
return False
|
|
145
|
+
|
|
146
|
+
|
|
147
|
+
def frequency(word: str) -> FrequencyData | None:
|
|
148
|
+
"""Get frequency data for a word.
|
|
149
|
+
|
|
150
|
+
Performs case-insensitive lookup with contraction fallback.
|
|
151
|
+
For contractions, returns the stem's frequency data.
|
|
152
|
+
|
|
153
|
+
Args:
|
|
154
|
+
word: The word to look up (case-insensitive)
|
|
155
|
+
|
|
156
|
+
Returns:
|
|
157
|
+
FrequencyData dict with peak_tf, peak_df, sum_tf, sum_df, or None if not found
|
|
158
|
+
|
|
159
|
+
Raises:
|
|
160
|
+
FileNotFoundError: If data files are not installed
|
|
161
|
+
"""
|
|
162
|
+
if not is_data_installed():
|
|
163
|
+
raise FileNotFoundError(
|
|
164
|
+
"Data files not installed. Run: python -m gngram_counter.download_data"
|
|
165
|
+
)
|
|
166
|
+
|
|
167
|
+
word = normalize(word)
|
|
168
|
+
|
|
169
|
+
result = _lookup_frequency(word)
|
|
170
|
+
if result is not None:
|
|
171
|
+
return result
|
|
172
|
+
|
|
173
|
+
# Contraction fallback: return the stem's frequency
|
|
174
|
+
parts = _split_contraction(word)
|
|
175
|
+
if parts:
|
|
176
|
+
stem, _ = parts
|
|
177
|
+
stem_freq = _lookup_frequency(stem)
|
|
178
|
+
if stem_freq is not None:
|
|
179
|
+
return stem_freq
|
|
180
|
+
|
|
181
|
+
return None
|
|
182
|
+
|
|
183
|
+
|
|
184
|
+
def batch_frequency(words: list[str]) -> dict[str, FrequencyData | None]:
|
|
185
|
+
"""Get frequency data for multiple words.
|
|
186
|
+
|
|
187
|
+
Args:
|
|
188
|
+
words: List of words to look up (case-insensitive)
|
|
189
|
+
|
|
190
|
+
Returns:
|
|
191
|
+
Dict mapping each word to its FrequencyData or None if not found
|
|
192
|
+
|
|
193
|
+
Raises:
|
|
194
|
+
FileNotFoundError: If data files are not installed
|
|
195
|
+
"""
|
|
196
|
+
if not is_data_installed():
|
|
197
|
+
raise FileNotFoundError(
|
|
198
|
+
"Data files not installed. Run: python -m gngram_counter.download_data"
|
|
199
|
+
)
|
|
200
|
+
|
|
201
|
+
# Group words by bucket prefix for efficient batch lookups
|
|
202
|
+
by_prefix: dict[str, list[tuple[str, str, str]]] = {}
|
|
203
|
+
contraction_words: list[str] = []
|
|
204
|
+
|
|
205
|
+
for word in words:
|
|
206
|
+
normalized = normalize(word)
|
|
207
|
+
prefix, suffix = _hash_word(normalized)
|
|
208
|
+
if prefix not in by_prefix:
|
|
209
|
+
by_prefix[prefix] = []
|
|
210
|
+
by_prefix[prefix].append((word, normalized, suffix))
|
|
211
|
+
|
|
212
|
+
results: dict[str, FrequencyData | None] = {}
|
|
213
|
+
|
|
214
|
+
for prefix, entries in by_prefix.items():
|
|
215
|
+
df = _load_bucket(prefix)
|
|
216
|
+
suffixes = [s for _, _, s in entries]
|
|
217
|
+
|
|
218
|
+
# Filter to all matching suffixes at once
|
|
219
|
+
matches = df.filter(pl.col("hash").is_in(suffixes))
|
|
220
|
+
match_dict = {row["hash"]: row for row in matches.iter_rows(named=True)}
|
|
221
|
+
|
|
222
|
+
for word, normalized, suffix in entries:
|
|
223
|
+
if suffix in match_dict:
|
|
224
|
+
row = match_dict[suffix]
|
|
225
|
+
results[word] = FrequencyData(
|
|
226
|
+
peak_tf=row["peak_tf"],
|
|
227
|
+
peak_df=row["peak_df"],
|
|
228
|
+
sum_tf=row["sum_tf"],
|
|
229
|
+
sum_df=row["sum_df"],
|
|
230
|
+
)
|
|
231
|
+
else:
|
|
232
|
+
# Mark for contraction fallback
|
|
233
|
+
results[word] = None
|
|
234
|
+
contraction_words.append(word)
|
|
235
|
+
|
|
236
|
+
# Contraction fallback for words not found directly
|
|
237
|
+
for word in contraction_words:
|
|
238
|
+
normalized = normalize(word)
|
|
239
|
+
parts = _split_contraction(normalized)
|
|
240
|
+
if parts:
|
|
241
|
+
stem, _ = parts
|
|
242
|
+
stem_freq = _lookup_frequency(stem)
|
|
243
|
+
if stem_freq is not None:
|
|
244
|
+
results[word] = stem_freq
|
|
245
|
+
|
|
246
|
+
return results
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
"""Text normalization utilities for gngram-counter.
|
|
2
|
+
|
|
3
|
+
Handles normalization of Unicode apostrophe variants and other text
|
|
4
|
+
transformations to ensure consistent matching against the ngram corpus.
|
|
5
|
+
|
|
6
|
+
Ported from bnc-lookup normalize.py.
|
|
7
|
+
"""
|
|
8
|
+
|
|
9
|
+
from __future__ import annotations
|
|
10
|
+
|
|
11
|
+
# Unicode characters that should normalize to ASCII apostrophe (U+0027)
|
|
12
|
+
# Ordered by likelihood of occurrence in English text
|
|
13
|
+
APOSTROPHE_VARIANTS = (
|
|
14
|
+
'\u2019' # RIGHT SINGLE QUOTATION MARK (most common smart quote)
|
|
15
|
+
'\u2018' # LEFT SINGLE QUOTATION MARK
|
|
16
|
+
'\u0060' # GRAVE ACCENT
|
|
17
|
+
'\u00B4' # ACUTE ACCENT
|
|
18
|
+
'\u201B' # SINGLE HIGH-REVERSED-9 QUOTATION MARK
|
|
19
|
+
'\u2032' # PRIME
|
|
20
|
+
'\u2035' # REVERSED PRIME
|
|
21
|
+
'\u02B9' # MODIFIER LETTER PRIME
|
|
22
|
+
'\u02BC' # MODIFIER LETTER APOSTROPHE
|
|
23
|
+
'\u02C8' # MODIFIER LETTER VERTICAL LINE
|
|
24
|
+
'\u0313' # COMBINING COMMA ABOVE
|
|
25
|
+
'\u0315' # COMBINING COMMA ABOVE RIGHT
|
|
26
|
+
'\u055A' # ARMENIAN APOSTROPHE
|
|
27
|
+
'\u05F3' # HEBREW PUNCTUATION GERESH
|
|
28
|
+
'\u07F4' # NKO HIGH TONE APOSTROPHE
|
|
29
|
+
'\u07F5' # NKO LOW TONE APOSTROPHE
|
|
30
|
+
'\uFF07' # FULLWIDTH APOSTROPHE
|
|
31
|
+
'\u1FBF' # GREEK PSILI
|
|
32
|
+
'\u1FBD' # GREEK KORONIS
|
|
33
|
+
'\uA78C' # LATIN SMALL LETTER SALTILLO
|
|
34
|
+
)
|
|
35
|
+
|
|
36
|
+
# Pre-compiled translation table for fast apostrophe normalization
|
|
37
|
+
_APOSTROPHE_TABLE = str.maketrans({char: "'" for char in APOSTROPHE_VARIANTS})
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
def normalize_apostrophes(text: str) -> str:
|
|
41
|
+
"""Normalize Unicode apostrophe variants to ASCII apostrophe."""
|
|
42
|
+
return text.translate(_APOSTROPHE_TABLE)
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
def normalize(text: str) -> str:
|
|
46
|
+
"""Normalize text for ngram lookup.
|
|
47
|
+
|
|
48
|
+
Applies: apostrophe variant conversion, lowercase, strip whitespace.
|
|
49
|
+
"""
|
|
50
|
+
return normalize_apostrophes(text).lower().strip()
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
[tool.poetry]
|
|
2
2
|
name = "gngram-lookup"
|
|
3
3
|
packages = [{include = "gngram_counter"}]
|
|
4
|
-
version = "0.2.
|
|
4
|
+
version = "0.2.3"
|
|
5
5
|
description = "Static Hash-Based Lookup for Google Ngram Frequencies"
|
|
6
6
|
authors = ["Craig Trim <craigtrim@gmail.com>"]
|
|
7
7
|
maintainers = ["Craig Trim <craigtrim@gmail.com>"]
|
|
@@ -39,7 +39,7 @@ gngram-freq = "gngram_counter.cli:gngram_freq"
|
|
|
39
39
|
generate-setup-file = true
|
|
40
40
|
|
|
41
41
|
[tool.poetry.dependencies]
|
|
42
|
-
python = "^3.
|
|
42
|
+
python = "^3.9"
|
|
43
43
|
polars = "^1.0"
|
|
44
44
|
pyarrow = "^18.0"
|
|
45
45
|
|
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# -*- coding: utf-8 -*-
|
|
2
|
+
from setuptools import setup
|
|
3
|
+
|
|
4
|
+
packages = \
|
|
5
|
+
['gngram_counter']
|
|
6
|
+
|
|
7
|
+
package_data = \
|
|
8
|
+
{'': ['*']}
|
|
9
|
+
|
|
10
|
+
install_requires = \
|
|
11
|
+
['polars>=1.0,<2.0', 'pyarrow>=18.0,<19.0']
|
|
12
|
+
|
|
13
|
+
entry_points = \
|
|
14
|
+
{'console_scripts': ['gngram-exists = gngram_counter.cli:gngram_exists',
|
|
15
|
+
'gngram-freq = gngram_counter.cli:gngram_freq']}
|
|
16
|
+
|
|
17
|
+
setup_kwargs = {
|
|
18
|
+
'name': 'gngram-lookup',
|
|
19
|
+
'version': '0.2.3',
|
|
20
|
+
'description': 'Static Hash-Based Lookup for Google Ngram Frequencies',
|
|
21
|
+
'long_description': "# gngram-lookup\n\n[](https://badge.fury.io/py/gngram-lookup)\n[](https://pepy.tech/project/gngram-lookup)\n[](https://pepy.tech/project/gngram-lookup)\n[](https://github.com/craigtrim/gngram-lookup/tree/main/tests)\n[](https://www.python.org/downloads/)\n\nWord frequency from 500 years of books. O(1) lookup. 5 million words.\n\n## Install\n\n```bash\npip install gngram-lookup\npython -m gngram_lookup.download_data\n```\n\n## Python\n\n```python\nimport gngram_lookup as ng\n\nng.exists('computer') # True\nng.exists('xyznotaword') # False\n\nng.frequency('computer')\n# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}\n\nng.batch_frequency(['the', 'algorithm', 'xyznotaword'])\n# {'the': {...}, 'algorithm': {...}, 'xyznotaword': None}\n```\n\n## CLI\n\n```bash\ngngram-exists computer # True, exit 0\ngngram-exists xyznotaword # False, exit 1\n\ngngram-freq computer\n# peak_tf_decade: 2000\n# peak_df_decade: 2000\n# sum_tf: 892451\n# sum_df: 312876\n```\n\n## Docs\n\n- [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)\n- [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)\n- [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)\n- [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)\n- [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)\n\n## See Also\n\n- [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus\n- [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet\n\n## Attribution\n\nData derived from the [Google Books Ngram](https://books.google.com/ngrams) dataset.\n\n## License\n\nProprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).\n",
|
|
22
|
+
'author': 'Craig Trim',
|
|
23
|
+
'author_email': 'craigtrim@gmail.com',
|
|
24
|
+
'maintainer': 'Craig Trim',
|
|
25
|
+
'maintainer_email': 'craigtrim@gmail.com',
|
|
26
|
+
'url': 'https://github.com/craigtrim/gngram-lookup',
|
|
27
|
+
'packages': packages,
|
|
28
|
+
'package_data': package_data,
|
|
29
|
+
'install_requires': install_requires,
|
|
30
|
+
'entry_points': entry_points,
|
|
31
|
+
'python_requires': '>=3.9,<4.0',
|
|
32
|
+
}
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
setup(**setup_kwargs)
|
|
@@ -1,138 +0,0 @@
|
|
|
1
|
-
"""
|
|
2
|
-
High-level lookup API for gngram-counter.
|
|
3
|
-
|
|
4
|
-
Provides simple functions for word frequency lookups similar to bnc-lookup.
|
|
5
|
-
"""
|
|
6
|
-
|
|
7
|
-
import hashlib
|
|
8
|
-
from functools import lru_cache
|
|
9
|
-
from typing import TypedDict
|
|
10
|
-
|
|
11
|
-
import polars as pl
|
|
12
|
-
|
|
13
|
-
from gngram_counter.data import get_hash_file, is_data_installed
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
class FrequencyData(TypedDict):
|
|
17
|
-
"""Frequency data for a word."""
|
|
18
|
-
|
|
19
|
-
peak_tf: int # Decade with highest term frequency
|
|
20
|
-
peak_df: int # Decade with highest document frequency
|
|
21
|
-
sum_tf: int # Total term frequency across all decades
|
|
22
|
-
sum_df: int # Total document frequency across all decades
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
@lru_cache(maxsize=256)
|
|
26
|
-
def _load_bucket(prefix: str) -> pl.DataFrame:
|
|
27
|
-
"""Load and cache a parquet bucket file."""
|
|
28
|
-
return pl.read_parquet(get_hash_file(prefix))
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
def _hash_word(word: str) -> tuple[str, str]:
|
|
32
|
-
"""Hash a word and return (prefix, suffix)."""
|
|
33
|
-
h = hashlib.md5(word.lower().encode("utf-8")).hexdigest()
|
|
34
|
-
return h[:2], h[2:]
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
def exists(word: str) -> bool:
|
|
38
|
-
"""Check if a word exists in the ngram data.
|
|
39
|
-
|
|
40
|
-
Args:
|
|
41
|
-
word: The word to check (case-insensitive)
|
|
42
|
-
|
|
43
|
-
Returns:
|
|
44
|
-
True if the word exists, False otherwise
|
|
45
|
-
|
|
46
|
-
Raises:
|
|
47
|
-
FileNotFoundError: If data files are not installed
|
|
48
|
-
"""
|
|
49
|
-
if not is_data_installed():
|
|
50
|
-
raise FileNotFoundError(
|
|
51
|
-
"Data files not installed. Run: python -m gngram_counter.download_data"
|
|
52
|
-
)
|
|
53
|
-
|
|
54
|
-
prefix, suffix = _hash_word(word)
|
|
55
|
-
df = _load_bucket(prefix)
|
|
56
|
-
return len(df.filter(pl.col("hash") == suffix)) > 0
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
def frequency(word: str) -> FrequencyData | None:
|
|
60
|
-
"""Get frequency data for a word.
|
|
61
|
-
|
|
62
|
-
Args:
|
|
63
|
-
word: The word to look up (case-insensitive)
|
|
64
|
-
|
|
65
|
-
Returns:
|
|
66
|
-
FrequencyData dict with peak_tf, peak_df, sum_tf, sum_df, or None if not found
|
|
67
|
-
|
|
68
|
-
Raises:
|
|
69
|
-
FileNotFoundError: If data files are not installed
|
|
70
|
-
"""
|
|
71
|
-
if not is_data_installed():
|
|
72
|
-
raise FileNotFoundError(
|
|
73
|
-
"Data files not installed. Run: python -m gngram_counter.download_data"
|
|
74
|
-
)
|
|
75
|
-
|
|
76
|
-
prefix, suffix = _hash_word(word)
|
|
77
|
-
df = _load_bucket(prefix)
|
|
78
|
-
row = df.filter(pl.col("hash") == suffix)
|
|
79
|
-
|
|
80
|
-
if len(row) == 0:
|
|
81
|
-
return None
|
|
82
|
-
|
|
83
|
-
return FrequencyData(
|
|
84
|
-
peak_tf=row["peak_tf"][0],
|
|
85
|
-
peak_df=row["peak_df"][0],
|
|
86
|
-
sum_tf=row["sum_tf"][0],
|
|
87
|
-
sum_df=row["sum_df"][0],
|
|
88
|
-
)
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
def batch_frequency(words: list[str]) -> dict[str, FrequencyData | None]:
|
|
92
|
-
"""Get frequency data for multiple words.
|
|
93
|
-
|
|
94
|
-
Args:
|
|
95
|
-
words: List of words to look up (case-insensitive)
|
|
96
|
-
|
|
97
|
-
Returns:
|
|
98
|
-
Dict mapping each word to its FrequencyData or None if not found
|
|
99
|
-
|
|
100
|
-
Raises:
|
|
101
|
-
FileNotFoundError: If data files are not installed
|
|
102
|
-
"""
|
|
103
|
-
if not is_data_installed():
|
|
104
|
-
raise FileNotFoundError(
|
|
105
|
-
"Data files not installed. Run: python -m gngram_counter.download_data"
|
|
106
|
-
)
|
|
107
|
-
|
|
108
|
-
# Group words by bucket prefix for efficient batch lookups
|
|
109
|
-
by_prefix: dict[str, list[tuple[str, str]]] = {}
|
|
110
|
-
for word in words:
|
|
111
|
-
prefix, suffix = _hash_word(word)
|
|
112
|
-
if prefix not in by_prefix:
|
|
113
|
-
by_prefix[prefix] = []
|
|
114
|
-
by_prefix[prefix].append((word, suffix))
|
|
115
|
-
|
|
116
|
-
results: dict[str, FrequencyData | None] = {}
|
|
117
|
-
|
|
118
|
-
for prefix, word_suffix_pairs in by_prefix.items():
|
|
119
|
-
df = _load_bucket(prefix)
|
|
120
|
-
suffixes = [s for _, s in word_suffix_pairs]
|
|
121
|
-
|
|
122
|
-
# Filter to all matching suffixes at once
|
|
123
|
-
matches = df.filter(pl.col("hash").is_in(suffixes))
|
|
124
|
-
match_dict = {row["hash"]: row for row in matches.iter_rows(named=True)}
|
|
125
|
-
|
|
126
|
-
for word, suffix in word_suffix_pairs:
|
|
127
|
-
if suffix in match_dict:
|
|
128
|
-
row = match_dict[suffix]
|
|
129
|
-
results[word] = FrequencyData(
|
|
130
|
-
peak_tf=row["peak_tf"],
|
|
131
|
-
peak_df=row["peak_df"],
|
|
132
|
-
sum_tf=row["sum_tf"],
|
|
133
|
-
sum_df=row["sum_df"],
|
|
134
|
-
)
|
|
135
|
-
else:
|
|
136
|
-
results[word] = None
|
|
137
|
-
|
|
138
|
-
return results
|
gngram_lookup-0.2.1/setup.py
DELETED
|
@@ -1,35 +0,0 @@
|
|
|
1
|
-
# -*- coding: utf-8 -*-
|
|
2
|
-
from setuptools import setup
|
|
3
|
-
|
|
4
|
-
packages = \
|
|
5
|
-
['gngram_counter']
|
|
6
|
-
|
|
7
|
-
package_data = \
|
|
8
|
-
{'': ['*']}
|
|
9
|
-
|
|
10
|
-
install_requires = \
|
|
11
|
-
['polars>=1.0,<2.0', 'pyarrow>=18.0,<19.0']
|
|
12
|
-
|
|
13
|
-
entry_points = \
|
|
14
|
-
{'console_scripts': ['gngram-exists = gngram_counter.cli:gngram_exists',
|
|
15
|
-
'gngram-freq = gngram_counter.cli:gngram_freq']}
|
|
16
|
-
|
|
17
|
-
setup_kwargs = {
|
|
18
|
-
'name': 'gngram-lookup',
|
|
19
|
-
'version': '0.2.1',
|
|
20
|
-
'description': 'Static Hash-Based Lookup for Google Ngram Frequencies',
|
|
21
|
-
'long_description': "# gngram-lookup\n\n[](https://badge.fury.io/py/gngram-lookup)\n[](https://pepy.tech/project/gngram-lookup)\n[](https://pepy.tech/project/gngram-lookup)\n[](tests/)\n[](https://www.python.org/downloads/)\n\nWord frequency from 500 years of books. O(1) lookup. 5 million words.\n\n## Install\n\n```bash\npip install gngram-lookup\npython -m gngram_lookup.download_data\n```\n\n## Python\n\n```python\nimport gngram_lookup as ng\n\nng.exists('computer') # True\nng.exists('xyznotaword') # False\n\nng.frequency('computer')\n# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}\n\nng.batch_frequency(['the', 'algorithm', 'xyznotaword'])\n# {'the': {...}, 'algorithm': {...}, 'xyznotaword': None}\n```\n\n## CLI\n\n```bash\ngngram-exists computer # True, exit 0\ngngram-exists xyznotaword # False, exit 1\n\ngngram-freq computer\n# peak_tf_decade: 2000\n# peak_df_decade: 2000\n# sum_tf: 892451\n# sum_df: 312876\n```\n\n## Docs\n\n- [API Reference](docs/api.md)\n- [CLI Reference](docs/cli.md)\n- [Data Format](docs/data-format.md)\n- [Use Cases](docs/use-cases.md)\n- [Development](docs/development.md)\n\n## See Also\n\n- [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus\n- [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet\n\n## Attribution\n\nData derived from the [Google Books Ngram](https://books.google.com/ngrams) dataset.\n\n## License\n\nProprietary. See [LICENSE](LICENSE).\n",
|
|
22
|
-
'author': 'Craig Trim',
|
|
23
|
-
'author_email': 'craigtrim@gmail.com',
|
|
24
|
-
'maintainer': 'Craig Trim',
|
|
25
|
-
'maintainer_email': 'craigtrim@gmail.com',
|
|
26
|
-
'url': 'https://github.com/craigtrim/gngram-lookup',
|
|
27
|
-
'packages': packages,
|
|
28
|
-
'package_data': package_data,
|
|
29
|
-
'install_requires': install_requires,
|
|
30
|
-
'entry_points': entry_points,
|
|
31
|
-
'python_requires': '>=3.11,<4.0',
|
|
32
|
-
}
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
setup(**setup_kwargs)
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|