gngram-lookup 0.2.1__tar.gz → 0.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: gngram-lookup
3
- Version: 0.2.1
3
+ Version: 0.2.3
4
4
  Summary: Static Hash-Based Lookup for Google Ngram Frequencies
5
5
  Home-page: https://github.com/craigtrim/gngram-lookup
6
6
  License: Proprietary
@@ -9,7 +9,7 @@ Author: Craig Trim
9
9
  Author-email: craigtrim@gmail.com
10
10
  Maintainer: Craig Trim
11
11
  Maintainer-email: craigtrim@gmail.com
12
- Requires-Python: >=3.11,<4.0
12
+ Requires-Python: >=3.9,<4.0
13
13
  Classifier: Development Status :: 4 - Beta
14
14
  Classifier: Intended Audience :: Developers
15
15
  Classifier: Intended Audience :: Science/Research
@@ -17,6 +17,8 @@ Classifier: License :: Other/Proprietary License
17
17
  Classifier: Natural Language :: English
18
18
  Classifier: Operating System :: OS Independent
19
19
  Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.9
21
+ Classifier: Programming Language :: Python :: 3.10
20
22
  Classifier: Programming Language :: Python :: 3.11
21
23
  Classifier: Programming Language :: Python :: 3.12
22
24
  Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
@@ -32,8 +34,8 @@ Description-Content-Type: text/markdown
32
34
  [![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)
33
35
  [![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)
34
36
  [![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)
35
- [![Tests](https://img.shields.io/badge/tests-58-brightgreen)](tests/)
36
- [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
37
+ [![Tests](https://img.shields.io/badge/tests-131-brightgreen)](https://github.com/craigtrim/gngram-lookup/tree/main/tests)
38
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
37
39
 
38
40
  Word frequency from 500 years of books. O(1) lookup. 5 million words.
39
41
 
@@ -74,11 +76,11 @@ gngram-freq computer
74
76
 
75
77
  ## Docs
76
78
 
77
- - [API Reference](docs/api.md)
78
- - [CLI Reference](docs/cli.md)
79
- - [Data Format](docs/data-format.md)
80
- - [Use Cases](docs/use-cases.md)
81
- - [Development](docs/development.md)
79
+ - [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)
80
+ - [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)
81
+ - [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)
82
+ - [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)
83
+ - [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)
82
84
 
83
85
  ## See Also
84
86
 
@@ -91,5 +93,5 @@ Data derived from the [Google Books Ngram](https://books.google.com/ngrams) data
91
93
 
92
94
  ## License
93
95
 
94
- Proprietary. See [LICENSE](LICENSE).
96
+ Proprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).
95
97
 
@@ -3,8 +3,8 @@
3
3
  [![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)
4
4
  [![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)
5
5
  [![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)
6
- [![Tests](https://img.shields.io/badge/tests-58-brightgreen)](tests/)
7
- [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
6
+ [![Tests](https://img.shields.io/badge/tests-131-brightgreen)](https://github.com/craigtrim/gngram-lookup/tree/main/tests)
7
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
8
8
 
9
9
  Word frequency from 500 years of books. O(1) lookup. 5 million words.
10
10
 
@@ -45,11 +45,11 @@ gngram-freq computer
45
45
 
46
46
  ## Docs
47
47
 
48
- - [API Reference](docs/api.md)
49
- - [CLI Reference](docs/cli.md)
50
- - [Data Format](docs/data-format.md)
51
- - [Use Cases](docs/use-cases.md)
52
- - [Development](docs/development.md)
48
+ - [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)
49
+ - [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)
50
+ - [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)
51
+ - [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)
52
+ - [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)
53
53
 
54
54
  ## See Also
55
55
 
@@ -62,4 +62,4 @@ Data derived from the [Google Books Ngram](https://books.google.com/ngrams) data
62
62
 
63
63
  ## License
64
64
 
65
- Proprietary. See [LICENSE](LICENSE).
65
+ Proprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).
@@ -0,0 +1,246 @@
1
+ """
2
+ High-level lookup API for gngram-counter.
3
+
4
+ Provides simple functions for word frequency lookups similar to bnc-lookup.
5
+
6
+ Includes contraction fallback: if a contraction like "don't" is not found
7
+ directly, the stem ("do") is looked up instead. The ngram corpus only
8
+ contains pure alphabetic words, so contractions and their suffix parts
9
+ (n't, 'll, etc.) are absent — but the stems are present.
10
+ """
11
+
12
+ from __future__ import annotations
13
+
14
+ import hashlib
15
+ from functools import lru_cache
16
+ from typing import TypedDict
17
+
18
+ import polars as pl
19
+
20
+ from gngram_counter.data import get_hash_file, is_data_installed
21
+ from gngram_counter.normalize import normalize
22
+
23
+
24
+ class FrequencyData(TypedDict):
25
+ """Frequency data for a word."""
26
+
27
+ peak_tf: int # Decade with highest term frequency
28
+ peak_df: int # Decade with highest document frequency
29
+ sum_tf: int # Total term frequency across all decades
30
+ sum_df: int # Total document frequency across all decades
31
+
32
+
33
+ # Contraction suffixes stored as separate tokens in the ngram corpus
34
+ # Order matters: longer suffixes must be checked before shorter ones
35
+ CONTRACTION_SUFFIXES = ("n't", "'ll", "'re", "'ve", "'m", "'d")
36
+
37
+ # Specific stems that form 's contractions (where 's = "is" or "has").
38
+ # NOT generalized — 's is ambiguous with possessive, so only known
39
+ # contraction stems are listed here. Ported from bnc-lookup.
40
+ S_CONTRACTION_STEMS = frozenset({
41
+ # Pronouns (unambiguously 's = "is" or "has", never possessive)
42
+ 'it', 'he', 'she', 'that', 'what', 'who',
43
+ # Adverbs / demonstratives
44
+ 'where', 'how', 'here', 'there',
45
+ # "let's" = "let us"
46
+ 'let',
47
+ # Indefinite pronouns
48
+ 'somebody', 'everybody', 'everyone', 'nobody',
49
+ 'anywhere', 'nowhere',
50
+ })
51
+
52
+
53
+ @lru_cache(maxsize=256)
54
+ def _load_bucket(prefix: str) -> pl.DataFrame:
55
+ """Load and cache a parquet bucket file."""
56
+ return pl.read_parquet(get_hash_file(prefix))
57
+
58
+
59
+ def _hash_word(word: str) -> tuple[str, str]:
60
+ """Hash a word and return (prefix, suffix)."""
61
+ h = hashlib.md5(normalize(word).encode("utf-8")).hexdigest()
62
+ return h[:2], h[2:]
63
+
64
+
65
+ def _lookup_frequency(word: str) -> FrequencyData | None:
66
+ """Look up frequency data for a single word form (no fallbacks)."""
67
+ if not word:
68
+ return None
69
+ prefix, suffix = _hash_word(word)
70
+ try:
71
+ df = _load_bucket(prefix)
72
+ except FileNotFoundError:
73
+ return None
74
+ row = df.filter(pl.col("hash") == suffix)
75
+ if len(row) == 0:
76
+ return None
77
+ return FrequencyData(
78
+ peak_tf=row["peak_tf"][0],
79
+ peak_df=row["peak_df"][0],
80
+ sum_tf=row["sum_tf"][0],
81
+ sum_df=row["sum_df"][0],
82
+ )
83
+
84
+
85
+ def _split_contraction(word: str) -> tuple[str, str] | None:
86
+ """Split a contraction into its component parts if possible.
87
+
88
+ The ngram corpus tokenizes contractions separately (e.g., "we'll" -> "we" + "'ll").
89
+ This function reverses that split for fallback lookup.
90
+
91
+ Returns:
92
+ Tuple of (stem, suffix) if the word matches a contraction pattern,
93
+ or None if no contraction pattern matches.
94
+ """
95
+ for suffix in CONTRACTION_SUFFIXES:
96
+ if word.endswith(suffix):
97
+ stem = word[:-len(suffix)]
98
+ if stem:
99
+ return (stem, suffix)
100
+
101
+ # Specific 's contractions from curated allowlist (not possessives)
102
+ if word.endswith("'s"):
103
+ stem = word[:-2]
104
+ if stem in S_CONTRACTION_STEMS:
105
+ return (stem, "'s")
106
+
107
+ return None
108
+
109
+
110
+ def exists(word: str) -> bool:
111
+ """Check if a word exists in the ngram data.
112
+
113
+ Performs case-insensitive lookup with automatic fallbacks:
114
+ 1. Direct lookup of the normalized word
115
+ 2. Contraction fallback: if word is a contraction, check if both
116
+ components exist (e.g., "don't" -> "do" + "n't")
117
+
118
+ Args:
119
+ word: The word to check (case-insensitive)
120
+
121
+ Returns:
122
+ True if the word exists, False otherwise
123
+
124
+ Raises:
125
+ FileNotFoundError: If data files are not installed
126
+ """
127
+ if not is_data_installed():
128
+ raise FileNotFoundError(
129
+ "Data files not installed. Run: python -m gngram_counter.download_data"
130
+ )
131
+
132
+ word = normalize(word)
133
+
134
+ if _lookup_frequency(word) is not None:
135
+ return True
136
+
137
+ # Contraction fallback: check if the stem exists
138
+ parts = _split_contraction(word)
139
+ if parts:
140
+ stem, _ = parts
141
+ if _lookup_frequency(stem) is not None:
142
+ return True
143
+
144
+ return False
145
+
146
+
147
+ def frequency(word: str) -> FrequencyData | None:
148
+ """Get frequency data for a word.
149
+
150
+ Performs case-insensitive lookup with contraction fallback.
151
+ For contractions, returns the stem's frequency data.
152
+
153
+ Args:
154
+ word: The word to look up (case-insensitive)
155
+
156
+ Returns:
157
+ FrequencyData dict with peak_tf, peak_df, sum_tf, sum_df, or None if not found
158
+
159
+ Raises:
160
+ FileNotFoundError: If data files are not installed
161
+ """
162
+ if not is_data_installed():
163
+ raise FileNotFoundError(
164
+ "Data files not installed. Run: python -m gngram_counter.download_data"
165
+ )
166
+
167
+ word = normalize(word)
168
+
169
+ result = _lookup_frequency(word)
170
+ if result is not None:
171
+ return result
172
+
173
+ # Contraction fallback: return the stem's frequency
174
+ parts = _split_contraction(word)
175
+ if parts:
176
+ stem, _ = parts
177
+ stem_freq = _lookup_frequency(stem)
178
+ if stem_freq is not None:
179
+ return stem_freq
180
+
181
+ return None
182
+
183
+
184
+ def batch_frequency(words: list[str]) -> dict[str, FrequencyData | None]:
185
+ """Get frequency data for multiple words.
186
+
187
+ Args:
188
+ words: List of words to look up (case-insensitive)
189
+
190
+ Returns:
191
+ Dict mapping each word to its FrequencyData or None if not found
192
+
193
+ Raises:
194
+ FileNotFoundError: If data files are not installed
195
+ """
196
+ if not is_data_installed():
197
+ raise FileNotFoundError(
198
+ "Data files not installed. Run: python -m gngram_counter.download_data"
199
+ )
200
+
201
+ # Group words by bucket prefix for efficient batch lookups
202
+ by_prefix: dict[str, list[tuple[str, str, str]]] = {}
203
+ contraction_words: list[str] = []
204
+
205
+ for word in words:
206
+ normalized = normalize(word)
207
+ prefix, suffix = _hash_word(normalized)
208
+ if prefix not in by_prefix:
209
+ by_prefix[prefix] = []
210
+ by_prefix[prefix].append((word, normalized, suffix))
211
+
212
+ results: dict[str, FrequencyData | None] = {}
213
+
214
+ for prefix, entries in by_prefix.items():
215
+ df = _load_bucket(prefix)
216
+ suffixes = [s for _, _, s in entries]
217
+
218
+ # Filter to all matching suffixes at once
219
+ matches = df.filter(pl.col("hash").is_in(suffixes))
220
+ match_dict = {row["hash"]: row for row in matches.iter_rows(named=True)}
221
+
222
+ for word, normalized, suffix in entries:
223
+ if suffix in match_dict:
224
+ row = match_dict[suffix]
225
+ results[word] = FrequencyData(
226
+ peak_tf=row["peak_tf"],
227
+ peak_df=row["peak_df"],
228
+ sum_tf=row["sum_tf"],
229
+ sum_df=row["sum_df"],
230
+ )
231
+ else:
232
+ # Mark for contraction fallback
233
+ results[word] = None
234
+ contraction_words.append(word)
235
+
236
+ # Contraction fallback for words not found directly
237
+ for word in contraction_words:
238
+ normalized = normalize(word)
239
+ parts = _split_contraction(normalized)
240
+ if parts:
241
+ stem, _ = parts
242
+ stem_freq = _lookup_frequency(stem)
243
+ if stem_freq is not None:
244
+ results[word] = stem_freq
245
+
246
+ return results
@@ -0,0 +1,50 @@
1
+ """Text normalization utilities for gngram-counter.
2
+
3
+ Handles normalization of Unicode apostrophe variants and other text
4
+ transformations to ensure consistent matching against the ngram corpus.
5
+
6
+ Ported from bnc-lookup normalize.py.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ # Unicode characters that should normalize to ASCII apostrophe (U+0027)
12
+ # Ordered by likelihood of occurrence in English text
13
+ APOSTROPHE_VARIANTS = (
14
+ '\u2019' # RIGHT SINGLE QUOTATION MARK (most common smart quote)
15
+ '\u2018' # LEFT SINGLE QUOTATION MARK
16
+ '\u0060' # GRAVE ACCENT
17
+ '\u00B4' # ACUTE ACCENT
18
+ '\u201B' # SINGLE HIGH-REVERSED-9 QUOTATION MARK
19
+ '\u2032' # PRIME
20
+ '\u2035' # REVERSED PRIME
21
+ '\u02B9' # MODIFIER LETTER PRIME
22
+ '\u02BC' # MODIFIER LETTER APOSTROPHE
23
+ '\u02C8' # MODIFIER LETTER VERTICAL LINE
24
+ '\u0313' # COMBINING COMMA ABOVE
25
+ '\u0315' # COMBINING COMMA ABOVE RIGHT
26
+ '\u055A' # ARMENIAN APOSTROPHE
27
+ '\u05F3' # HEBREW PUNCTUATION GERESH
28
+ '\u07F4' # NKO HIGH TONE APOSTROPHE
29
+ '\u07F5' # NKO LOW TONE APOSTROPHE
30
+ '\uFF07' # FULLWIDTH APOSTROPHE
31
+ '\u1FBF' # GREEK PSILI
32
+ '\u1FBD' # GREEK KORONIS
33
+ '\uA78C' # LATIN SMALL LETTER SALTILLO
34
+ )
35
+
36
+ # Pre-compiled translation table for fast apostrophe normalization
37
+ _APOSTROPHE_TABLE = str.maketrans({char: "'" for char in APOSTROPHE_VARIANTS})
38
+
39
+
40
+ def normalize_apostrophes(text: str) -> str:
41
+ """Normalize Unicode apostrophe variants to ASCII apostrophe."""
42
+ return text.translate(_APOSTROPHE_TABLE)
43
+
44
+
45
+ def normalize(text: str) -> str:
46
+ """Normalize text for ngram lookup.
47
+
48
+ Applies: apostrophe variant conversion, lowercase, strip whitespace.
49
+ """
50
+ return normalize_apostrophes(text).lower().strip()
@@ -1,7 +1,7 @@
1
1
  [tool.poetry]
2
2
  name = "gngram-lookup"
3
3
  packages = [{include = "gngram_counter"}]
4
- version = "0.2.1"
4
+ version = "0.2.3"
5
5
  description = "Static Hash-Based Lookup for Google Ngram Frequencies"
6
6
  authors = ["Craig Trim <craigtrim@gmail.com>"]
7
7
  maintainers = ["Craig Trim <craigtrim@gmail.com>"]
@@ -39,7 +39,7 @@ gngram-freq = "gngram_counter.cli:gngram_freq"
39
39
  generate-setup-file = true
40
40
 
41
41
  [tool.poetry.dependencies]
42
- python = "^3.11"
42
+ python = "^3.9"
43
43
  polars = "^1.0"
44
44
  pyarrow = "^18.0"
45
45
 
@@ -0,0 +1,35 @@
1
+ # -*- coding: utf-8 -*-
2
+ from setuptools import setup
3
+
4
+ packages = \
5
+ ['gngram_counter']
6
+
7
+ package_data = \
8
+ {'': ['*']}
9
+
10
+ install_requires = \
11
+ ['polars>=1.0,<2.0', 'pyarrow>=18.0,<19.0']
12
+
13
+ entry_points = \
14
+ {'console_scripts': ['gngram-exists = gngram_counter.cli:gngram_exists',
15
+ 'gngram-freq = gngram_counter.cli:gngram_freq']}
16
+
17
+ setup_kwargs = {
18
+ 'name': 'gngram-lookup',
19
+ 'version': '0.2.3',
20
+ 'description': 'Static Hash-Based Lookup for Google Ngram Frequencies',
21
+ 'long_description': "# gngram-lookup\n\n[![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)\n[![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)\n[![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)\n[![Tests](https://img.shields.io/badge/tests-131-brightgreen)](https://github.com/craigtrim/gngram-lookup/tree/main/tests)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)\n\nWord frequency from 500 years of books. O(1) lookup. 5 million words.\n\n## Install\n\n```bash\npip install gngram-lookup\npython -m gngram_lookup.download_data\n```\n\n## Python\n\n```python\nimport gngram_lookup as ng\n\nng.exists('computer') # True\nng.exists('xyznotaword') # False\n\nng.frequency('computer')\n# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}\n\nng.batch_frequency(['the', 'algorithm', 'xyznotaword'])\n# {'the': {...}, 'algorithm': {...}, 'xyznotaword': None}\n```\n\n## CLI\n\n```bash\ngngram-exists computer # True, exit 0\ngngram-exists xyznotaword # False, exit 1\n\ngngram-freq computer\n# peak_tf_decade: 2000\n# peak_df_decade: 2000\n# sum_tf: 892451\n# sum_df: 312876\n```\n\n## Docs\n\n- [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)\n- [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)\n- [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)\n- [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)\n- [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)\n\n## See Also\n\n- [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus\n- [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet\n\n## Attribution\n\nData derived from the [Google Books Ngram](https://books.google.com/ngrams) dataset.\n\n## License\n\nProprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).\n",
22
+ 'author': 'Craig Trim',
23
+ 'author_email': 'craigtrim@gmail.com',
24
+ 'maintainer': 'Craig Trim',
25
+ 'maintainer_email': 'craigtrim@gmail.com',
26
+ 'url': 'https://github.com/craigtrim/gngram-lookup',
27
+ 'packages': packages,
28
+ 'package_data': package_data,
29
+ 'install_requires': install_requires,
30
+ 'entry_points': entry_points,
31
+ 'python_requires': '>=3.9,<4.0',
32
+ }
33
+
34
+
35
+ setup(**setup_kwargs)
@@ -1,138 +0,0 @@
1
- """
2
- High-level lookup API for gngram-counter.
3
-
4
- Provides simple functions for word frequency lookups similar to bnc-lookup.
5
- """
6
-
7
- import hashlib
8
- from functools import lru_cache
9
- from typing import TypedDict
10
-
11
- import polars as pl
12
-
13
- from gngram_counter.data import get_hash_file, is_data_installed
14
-
15
-
16
- class FrequencyData(TypedDict):
17
- """Frequency data for a word."""
18
-
19
- peak_tf: int # Decade with highest term frequency
20
- peak_df: int # Decade with highest document frequency
21
- sum_tf: int # Total term frequency across all decades
22
- sum_df: int # Total document frequency across all decades
23
-
24
-
25
- @lru_cache(maxsize=256)
26
- def _load_bucket(prefix: str) -> pl.DataFrame:
27
- """Load and cache a parquet bucket file."""
28
- return pl.read_parquet(get_hash_file(prefix))
29
-
30
-
31
- def _hash_word(word: str) -> tuple[str, str]:
32
- """Hash a word and return (prefix, suffix)."""
33
- h = hashlib.md5(word.lower().encode("utf-8")).hexdigest()
34
- return h[:2], h[2:]
35
-
36
-
37
- def exists(word: str) -> bool:
38
- """Check if a word exists in the ngram data.
39
-
40
- Args:
41
- word: The word to check (case-insensitive)
42
-
43
- Returns:
44
- True if the word exists, False otherwise
45
-
46
- Raises:
47
- FileNotFoundError: If data files are not installed
48
- """
49
- if not is_data_installed():
50
- raise FileNotFoundError(
51
- "Data files not installed. Run: python -m gngram_counter.download_data"
52
- )
53
-
54
- prefix, suffix = _hash_word(word)
55
- df = _load_bucket(prefix)
56
- return len(df.filter(pl.col("hash") == suffix)) > 0
57
-
58
-
59
- def frequency(word: str) -> FrequencyData | None:
60
- """Get frequency data for a word.
61
-
62
- Args:
63
- word: The word to look up (case-insensitive)
64
-
65
- Returns:
66
- FrequencyData dict with peak_tf, peak_df, sum_tf, sum_df, or None if not found
67
-
68
- Raises:
69
- FileNotFoundError: If data files are not installed
70
- """
71
- if not is_data_installed():
72
- raise FileNotFoundError(
73
- "Data files not installed. Run: python -m gngram_counter.download_data"
74
- )
75
-
76
- prefix, suffix = _hash_word(word)
77
- df = _load_bucket(prefix)
78
- row = df.filter(pl.col("hash") == suffix)
79
-
80
- if len(row) == 0:
81
- return None
82
-
83
- return FrequencyData(
84
- peak_tf=row["peak_tf"][0],
85
- peak_df=row["peak_df"][0],
86
- sum_tf=row["sum_tf"][0],
87
- sum_df=row["sum_df"][0],
88
- )
89
-
90
-
91
- def batch_frequency(words: list[str]) -> dict[str, FrequencyData | None]:
92
- """Get frequency data for multiple words.
93
-
94
- Args:
95
- words: List of words to look up (case-insensitive)
96
-
97
- Returns:
98
- Dict mapping each word to its FrequencyData or None if not found
99
-
100
- Raises:
101
- FileNotFoundError: If data files are not installed
102
- """
103
- if not is_data_installed():
104
- raise FileNotFoundError(
105
- "Data files not installed. Run: python -m gngram_counter.download_data"
106
- )
107
-
108
- # Group words by bucket prefix for efficient batch lookups
109
- by_prefix: dict[str, list[tuple[str, str]]] = {}
110
- for word in words:
111
- prefix, suffix = _hash_word(word)
112
- if prefix not in by_prefix:
113
- by_prefix[prefix] = []
114
- by_prefix[prefix].append((word, suffix))
115
-
116
- results: dict[str, FrequencyData | None] = {}
117
-
118
- for prefix, word_suffix_pairs in by_prefix.items():
119
- df = _load_bucket(prefix)
120
- suffixes = [s for _, s in word_suffix_pairs]
121
-
122
- # Filter to all matching suffixes at once
123
- matches = df.filter(pl.col("hash").is_in(suffixes))
124
- match_dict = {row["hash"]: row for row in matches.iter_rows(named=True)}
125
-
126
- for word, suffix in word_suffix_pairs:
127
- if suffix in match_dict:
128
- row = match_dict[suffix]
129
- results[word] = FrequencyData(
130
- peak_tf=row["peak_tf"],
131
- peak_df=row["peak_df"],
132
- sum_tf=row["sum_tf"],
133
- sum_df=row["sum_df"],
134
- )
135
- else:
136
- results[word] = None
137
-
138
- return results
@@ -1,35 +0,0 @@
1
- # -*- coding: utf-8 -*-
2
- from setuptools import setup
3
-
4
- packages = \
5
- ['gngram_counter']
6
-
7
- package_data = \
8
- {'': ['*']}
9
-
10
- install_requires = \
11
- ['polars>=1.0,<2.0', 'pyarrow>=18.0,<19.0']
12
-
13
- entry_points = \
14
- {'console_scripts': ['gngram-exists = gngram_counter.cli:gngram_exists',
15
- 'gngram-freq = gngram_counter.cli:gngram_freq']}
16
-
17
- setup_kwargs = {
18
- 'name': 'gngram-lookup',
19
- 'version': '0.2.1',
20
- 'description': 'Static Hash-Based Lookup for Google Ngram Frequencies',
21
- 'long_description': "# gngram-lookup\n\n[![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)\n[![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)\n[![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)\n[![Tests](https://img.shields.io/badge/tests-58-brightgreen)](tests/)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)\n\nWord frequency from 500 years of books. O(1) lookup. 5 million words.\n\n## Install\n\n```bash\npip install gngram-lookup\npython -m gngram_lookup.download_data\n```\n\n## Python\n\n```python\nimport gngram_lookup as ng\n\nng.exists('computer') # True\nng.exists('xyznotaword') # False\n\nng.frequency('computer')\n# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}\n\nng.batch_frequency(['the', 'algorithm', 'xyznotaword'])\n# {'the': {...}, 'algorithm': {...}, 'xyznotaword': None}\n```\n\n## CLI\n\n```bash\ngngram-exists computer # True, exit 0\ngngram-exists xyznotaword # False, exit 1\n\ngngram-freq computer\n# peak_tf_decade: 2000\n# peak_df_decade: 2000\n# sum_tf: 892451\n# sum_df: 312876\n```\n\n## Docs\n\n- [API Reference](docs/api.md)\n- [CLI Reference](docs/cli.md)\n- [Data Format](docs/data-format.md)\n- [Use Cases](docs/use-cases.md)\n- [Development](docs/development.md)\n\n## See Also\n\n- [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus\n- [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet\n\n## Attribution\n\nData derived from the [Google Books Ngram](https://books.google.com/ngrams) dataset.\n\n## License\n\nProprietary. See [LICENSE](LICENSE).\n",
22
- 'author': 'Craig Trim',
23
- 'author_email': 'craigtrim@gmail.com',
24
- 'maintainer': 'Craig Trim',
25
- 'maintainer_email': 'craigtrim@gmail.com',
26
- 'url': 'https://github.com/craigtrim/gngram-lookup',
27
- 'packages': packages,
28
- 'package_data': package_data,
29
- 'install_requires': install_requires,
30
- 'entry_points': entry_points,
31
- 'python_requires': '>=3.11,<4.0',
32
- }
33
-
34
-
35
- setup(**setup_kwargs)
File without changes