gngram-lookup 0.2.0__tar.gz → 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: gngram-lookup
3
- Version: 0.2.0
3
+ Version: 0.2.2
4
4
  Summary: Static Hash-Based Lookup for Google Ngram Frequencies
5
5
  Home-page: https://github.com/craigtrim/gngram-lookup
6
6
  License: Proprietary
@@ -30,6 +30,9 @@ Description-Content-Type: text/markdown
30
30
  # gngram-lookup
31
31
 
32
32
  [![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)
33
+ [![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)
34
+ [![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)
35
+ [![Tests](https://img.shields.io/badge/tests-131-brightgreen)](https://github.com/craigtrim/gngram-lookup/tree/main/tests)
33
36
  [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
34
37
 
35
38
  Word frequency from 500 years of books. O(1) lookup. 5 million words.
@@ -71,11 +74,16 @@ gngram-freq computer
71
74
 
72
75
  ## Docs
73
76
 
74
- - [API Reference](docs/api.md)
75
- - [CLI Reference](docs/cli.md)
76
- - [Data Format](docs/data-format.md)
77
- - [Use Cases](docs/use-cases.md)
78
- - [Development](docs/development.md)
77
+ - [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)
78
+ - [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)
79
+ - [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)
80
+ - [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)
81
+ - [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)
82
+
83
+ ## See Also
84
+
85
+ - [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus
86
+ - [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet
79
87
 
80
88
  ## Attribution
81
89
 
@@ -83,5 +91,5 @@ Data derived from the [Google Books Ngram](https://books.google.com/ngrams) data
83
91
 
84
92
  ## License
85
93
 
86
- Proprietary. See [LICENSE](LICENSE).
94
+ Proprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).
87
95
 
@@ -1,6 +1,9 @@
1
1
  # gngram-lookup
2
2
 
3
3
  [![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)
4
+ [![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)
5
+ [![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)
6
+ [![Tests](https://img.shields.io/badge/tests-131-brightgreen)](https://github.com/craigtrim/gngram-lookup/tree/main/tests)
4
7
  [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
5
8
 
6
9
  Word frequency from 500 years of books. O(1) lookup. 5 million words.
@@ -42,11 +45,16 @@ gngram-freq computer
42
45
 
43
46
  ## Docs
44
47
 
45
- - [API Reference](docs/api.md)
46
- - [CLI Reference](docs/cli.md)
47
- - [Data Format](docs/data-format.md)
48
- - [Use Cases](docs/use-cases.md)
49
- - [Development](docs/development.md)
48
+ - [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)
49
+ - [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)
50
+ - [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)
51
+ - [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)
52
+ - [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)
53
+
54
+ ## See Also
55
+
56
+ - [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus
57
+ - [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet
50
58
 
51
59
  ## Attribution
52
60
 
@@ -54,4 +62,4 @@ Data derived from the [Google Books Ngram](https://books.google.com/ngrams) data
54
62
 
55
63
  ## License
56
64
 
57
- Proprietary. See [LICENSE](LICENSE).
65
+ Proprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).
@@ -0,0 +1,244 @@
1
+ """
2
+ High-level lookup API for gngram-counter.
3
+
4
+ Provides simple functions for word frequency lookups similar to bnc-lookup.
5
+
6
+ Includes contraction fallback: if a contraction like "don't" is not found
7
+ directly, the stem ("do") is looked up instead. The ngram corpus only
8
+ contains pure alphabetic words, so contractions and their suffix parts
9
+ (n't, 'll, etc.) are absent — but the stems are present.
10
+ """
11
+
12
+ import hashlib
13
+ from functools import lru_cache
14
+ from typing import TypedDict
15
+
16
+ import polars as pl
17
+
18
+ from gngram_counter.data import get_hash_file, is_data_installed
19
+ from gngram_counter.normalize import normalize
20
+
21
+
22
+ class FrequencyData(TypedDict):
23
+ """Frequency data for a word."""
24
+
25
+ peak_tf: int # Decade with highest term frequency
26
+ peak_df: int # Decade with highest document frequency
27
+ sum_tf: int # Total term frequency across all decades
28
+ sum_df: int # Total document frequency across all decades
29
+
30
+
31
+ # Contraction suffixes stored as separate tokens in the ngram corpus
32
+ # Order matters: longer suffixes must be checked before shorter ones
33
+ CONTRACTION_SUFFIXES = ("n't", "'ll", "'re", "'ve", "'m", "'d")
34
+
35
+ # Specific stems that form 's contractions (where 's = "is" or "has").
36
+ # NOT generalized — 's is ambiguous with possessive, so only known
37
+ # contraction stems are listed here. Ported from bnc-lookup.
38
+ S_CONTRACTION_STEMS = frozenset({
39
+ # Pronouns (unambiguously 's = "is" or "has", never possessive)
40
+ 'it', 'he', 'she', 'that', 'what', 'who',
41
+ # Adverbs / demonstratives
42
+ 'where', 'how', 'here', 'there',
43
+ # "let's" = "let us"
44
+ 'let',
45
+ # Indefinite pronouns
46
+ 'somebody', 'everybody', 'everyone', 'nobody',
47
+ 'anywhere', 'nowhere',
48
+ })
49
+
50
+
51
+ @lru_cache(maxsize=256)
52
+ def _load_bucket(prefix: str) -> pl.DataFrame:
53
+ """Load and cache a parquet bucket file."""
54
+ return pl.read_parquet(get_hash_file(prefix))
55
+
56
+
57
+ def _hash_word(word: str) -> tuple[str, str]:
58
+ """Hash a word and return (prefix, suffix)."""
59
+ h = hashlib.md5(normalize(word).encode("utf-8")).hexdigest()
60
+ return h[:2], h[2:]
61
+
62
+
63
+ def _lookup_frequency(word: str) -> FrequencyData | None:
64
+ """Look up frequency data for a single word form (no fallbacks)."""
65
+ if not word:
66
+ return None
67
+ prefix, suffix = _hash_word(word)
68
+ try:
69
+ df = _load_bucket(prefix)
70
+ except FileNotFoundError:
71
+ return None
72
+ row = df.filter(pl.col("hash") == suffix)
73
+ if len(row) == 0:
74
+ return None
75
+ return FrequencyData(
76
+ peak_tf=row["peak_tf"][0],
77
+ peak_df=row["peak_df"][0],
78
+ sum_tf=row["sum_tf"][0],
79
+ sum_df=row["sum_df"][0],
80
+ )
81
+
82
+
83
+ def _split_contraction(word: str) -> tuple[str, str] | None:
84
+ """Split a contraction into its component parts if possible.
85
+
86
+ The ngram corpus tokenizes contractions separately (e.g., "we'll" -> "we" + "'ll").
87
+ This function reverses that split for fallback lookup.
88
+
89
+ Returns:
90
+ Tuple of (stem, suffix) if the word matches a contraction pattern,
91
+ or None if no contraction pattern matches.
92
+ """
93
+ for suffix in CONTRACTION_SUFFIXES:
94
+ if word.endswith(suffix):
95
+ stem = word[:-len(suffix)]
96
+ if stem:
97
+ return (stem, suffix)
98
+
99
+ # Specific 's contractions from curated allowlist (not possessives)
100
+ if word.endswith("'s"):
101
+ stem = word[:-2]
102
+ if stem in S_CONTRACTION_STEMS:
103
+ return (stem, "'s")
104
+
105
+ return None
106
+
107
+
108
+ def exists(word: str) -> bool:
109
+ """Check if a word exists in the ngram data.
110
+
111
+ Performs case-insensitive lookup with automatic fallbacks:
112
+ 1. Direct lookup of the normalized word
113
+ 2. Contraction fallback: if word is a contraction, check if both
114
+ components exist (e.g., "don't" -> "do" + "n't")
115
+
116
+ Args:
117
+ word: The word to check (case-insensitive)
118
+
119
+ Returns:
120
+ True if the word exists, False otherwise
121
+
122
+ Raises:
123
+ FileNotFoundError: If data files are not installed
124
+ """
125
+ if not is_data_installed():
126
+ raise FileNotFoundError(
127
+ "Data files not installed. Run: python -m gngram_counter.download_data"
128
+ )
129
+
130
+ word = normalize(word)
131
+
132
+ if _lookup_frequency(word) is not None:
133
+ return True
134
+
135
+ # Contraction fallback: check if the stem exists
136
+ parts = _split_contraction(word)
137
+ if parts:
138
+ stem, _ = parts
139
+ if _lookup_frequency(stem) is not None:
140
+ return True
141
+
142
+ return False
143
+
144
+
145
+ def frequency(word: str) -> FrequencyData | None:
146
+ """Get frequency data for a word.
147
+
148
+ Performs case-insensitive lookup with contraction fallback.
149
+ For contractions, returns the stem's frequency data.
150
+
151
+ Args:
152
+ word: The word to look up (case-insensitive)
153
+
154
+ Returns:
155
+ FrequencyData dict with peak_tf, peak_df, sum_tf, sum_df, or None if not found
156
+
157
+ Raises:
158
+ FileNotFoundError: If data files are not installed
159
+ """
160
+ if not is_data_installed():
161
+ raise FileNotFoundError(
162
+ "Data files not installed. Run: python -m gngram_counter.download_data"
163
+ )
164
+
165
+ word = normalize(word)
166
+
167
+ result = _lookup_frequency(word)
168
+ if result is not None:
169
+ return result
170
+
171
+ # Contraction fallback: return the stem's frequency
172
+ parts = _split_contraction(word)
173
+ if parts:
174
+ stem, _ = parts
175
+ stem_freq = _lookup_frequency(stem)
176
+ if stem_freq is not None:
177
+ return stem_freq
178
+
179
+ return None
180
+
181
+
182
+ def batch_frequency(words: list[str]) -> dict[str, FrequencyData | None]:
183
+ """Get frequency data for multiple words.
184
+
185
+ Args:
186
+ words: List of words to look up (case-insensitive)
187
+
188
+ Returns:
189
+ Dict mapping each word to its FrequencyData or None if not found
190
+
191
+ Raises:
192
+ FileNotFoundError: If data files are not installed
193
+ """
194
+ if not is_data_installed():
195
+ raise FileNotFoundError(
196
+ "Data files not installed. Run: python -m gngram_counter.download_data"
197
+ )
198
+
199
+ # Group words by bucket prefix for efficient batch lookups
200
+ by_prefix: dict[str, list[tuple[str, str, str]]] = {}
201
+ contraction_words: list[str] = []
202
+
203
+ for word in words:
204
+ normalized = normalize(word)
205
+ prefix, suffix = _hash_word(normalized)
206
+ if prefix not in by_prefix:
207
+ by_prefix[prefix] = []
208
+ by_prefix[prefix].append((word, normalized, suffix))
209
+
210
+ results: dict[str, FrequencyData | None] = {}
211
+
212
+ for prefix, entries in by_prefix.items():
213
+ df = _load_bucket(prefix)
214
+ suffixes = [s for _, _, s in entries]
215
+
216
+ # Filter to all matching suffixes at once
217
+ matches = df.filter(pl.col("hash").is_in(suffixes))
218
+ match_dict = {row["hash"]: row for row in matches.iter_rows(named=True)}
219
+
220
+ for word, normalized, suffix in entries:
221
+ if suffix in match_dict:
222
+ row = match_dict[suffix]
223
+ results[word] = FrequencyData(
224
+ peak_tf=row["peak_tf"],
225
+ peak_df=row["peak_df"],
226
+ sum_tf=row["sum_tf"],
227
+ sum_df=row["sum_df"],
228
+ )
229
+ else:
230
+ # Mark for contraction fallback
231
+ results[word] = None
232
+ contraction_words.append(word)
233
+
234
+ # Contraction fallback for words not found directly
235
+ for word in contraction_words:
236
+ normalized = normalize(word)
237
+ parts = _split_contraction(normalized)
238
+ if parts:
239
+ stem, _ = parts
240
+ stem_freq = _lookup_frequency(stem)
241
+ if stem_freq is not None:
242
+ results[word] = stem_freq
243
+
244
+ return results
@@ -0,0 +1,48 @@
1
+ """Text normalization utilities for gngram-counter.
2
+
3
+ Handles normalization of Unicode apostrophe variants and other text
4
+ transformations to ensure consistent matching against the ngram corpus.
5
+
6
+ Ported from bnc-lookup normalize.py.
7
+ """
8
+
9
+ # Unicode characters that should normalize to ASCII apostrophe (U+0027)
10
+ # Ordered by likelihood of occurrence in English text
11
+ APOSTROPHE_VARIANTS = (
12
+ '\u2019' # RIGHT SINGLE QUOTATION MARK (most common smart quote)
13
+ '\u2018' # LEFT SINGLE QUOTATION MARK
14
+ '\u0060' # GRAVE ACCENT
15
+ '\u00B4' # ACUTE ACCENT
16
+ '\u201B' # SINGLE HIGH-REVERSED-9 QUOTATION MARK
17
+ '\u2032' # PRIME
18
+ '\u2035' # REVERSED PRIME
19
+ '\u02B9' # MODIFIER LETTER PRIME
20
+ '\u02BC' # MODIFIER LETTER APOSTROPHE
21
+ '\u02C8' # MODIFIER LETTER VERTICAL LINE
22
+ '\u0313' # COMBINING COMMA ABOVE
23
+ '\u0315' # COMBINING COMMA ABOVE RIGHT
24
+ '\u055A' # ARMENIAN APOSTROPHE
25
+ '\u05F3' # HEBREW PUNCTUATION GERESH
26
+ '\u07F4' # NKO HIGH TONE APOSTROPHE
27
+ '\u07F5' # NKO LOW TONE APOSTROPHE
28
+ '\uFF07' # FULLWIDTH APOSTROPHE
29
+ '\u1FBF' # GREEK PSILI
30
+ '\u1FBD' # GREEK KORONIS
31
+ '\uA78C' # LATIN SMALL LETTER SALTILLO
32
+ )
33
+
34
+ # Pre-compiled translation table for fast apostrophe normalization
35
+ _APOSTROPHE_TABLE = str.maketrans({char: "'" for char in APOSTROPHE_VARIANTS})
36
+
37
+
38
+ def normalize_apostrophes(text: str) -> str:
39
+ """Normalize Unicode apostrophe variants to ASCII apostrophe."""
40
+ return text.translate(_APOSTROPHE_TABLE)
41
+
42
+
43
+ def normalize(text: str) -> str:
44
+ """Normalize text for ngram lookup.
45
+
46
+ Applies: apostrophe variant conversion, lowercase, strip whitespace.
47
+ """
48
+ return normalize_apostrophes(text).lower().strip()
@@ -1,7 +1,7 @@
1
1
  [tool.poetry]
2
2
  name = "gngram-lookup"
3
3
  packages = [{include = "gngram_counter"}]
4
- version = "0.2.0"
4
+ version = "0.2.2"
5
5
  description = "Static Hash-Based Lookup for Google Ngram Frequencies"
6
6
  authors = ["Craig Trim <craigtrim@gmail.com>"]
7
7
  maintainers = ["Craig Trim <craigtrim@gmail.com>"]
@@ -0,0 +1,35 @@
1
+ # -*- coding: utf-8 -*-
2
+ from setuptools import setup
3
+
4
+ packages = \
5
+ ['gngram_counter']
6
+
7
+ package_data = \
8
+ {'': ['*']}
9
+
10
+ install_requires = \
11
+ ['polars>=1.0,<2.0', 'pyarrow>=18.0,<19.0']
12
+
13
+ entry_points = \
14
+ {'console_scripts': ['gngram-exists = gngram_counter.cli:gngram_exists',
15
+ 'gngram-freq = gngram_counter.cli:gngram_freq']}
16
+
17
+ setup_kwargs = {
18
+ 'name': 'gngram-lookup',
19
+ 'version': '0.2.2',
20
+ 'description': 'Static Hash-Based Lookup for Google Ngram Frequencies',
21
+ 'long_description': "# gngram-lookup\n\n[![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)\n[![Downloads](https://pepy.tech/badge/gngram-lookup)](https://pepy.tech/project/gngram-lookup)\n[![Downloads/Month](https://pepy.tech/badge/gngram-lookup/month)](https://pepy.tech/project/gngram-lookup)\n[![Tests](https://img.shields.io/badge/tests-131-brightgreen)](https://github.com/craigtrim/gngram-lookup/tree/main/tests)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)\n\nWord frequency from 500 years of books. O(1) lookup. 5 million words.\n\n## Install\n\n```bash\npip install gngram-lookup\npython -m gngram_lookup.download_data\n```\n\n## Python\n\n```python\nimport gngram_lookup as ng\n\nng.exists('computer') # True\nng.exists('xyznotaword') # False\n\nng.frequency('computer')\n# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}\n\nng.batch_frequency(['the', 'algorithm', 'xyznotaword'])\n# {'the': {...}, 'algorithm': {...}, 'xyznotaword': None}\n```\n\n## CLI\n\n```bash\ngngram-exists computer # True, exit 0\ngngram-exists xyznotaword # False, exit 1\n\ngngram-freq computer\n# peak_tf_decade: 2000\n# peak_df_decade: 2000\n# sum_tf: 892451\n# sum_df: 312876\n```\n\n## Docs\n\n- [API Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/api.md)\n- [CLI Reference](https://github.com/craigtrim/gngram-lookup/blob/main/docs/cli.md)\n- [Data Format](https://github.com/craigtrim/gngram-lookup/blob/main/docs/data-format.md)\n- [Use Cases](https://github.com/craigtrim/gngram-lookup/blob/main/docs/use-cases.md)\n- [Development](https://github.com/craigtrim/gngram-lookup/blob/main/docs/development.md)\n\n## See Also\n\n- [bnc-lookup](https://pypi.org/project/bnc-lookup/) - O(1) lookup for British National Corpus\n- [wordnet-lookup](https://pypi.org/project/wordnet-lookup/) - O(1) lookup for WordNet\n\n## Attribution\n\nData derived from the [Google Books Ngram](https://books.google.com/ngrams) dataset.\n\n## License\n\nProprietary. See [LICENSE](https://github.com/craigtrim/gngram-lookup/blob/main/LICENSE).\n",
22
+ 'author': 'Craig Trim',
23
+ 'author_email': 'craigtrim@gmail.com',
24
+ 'maintainer': 'Craig Trim',
25
+ 'maintainer_email': 'craigtrim@gmail.com',
26
+ 'url': 'https://github.com/craigtrim/gngram-lookup',
27
+ 'packages': packages,
28
+ 'package_data': package_data,
29
+ 'install_requires': install_requires,
30
+ 'entry_points': entry_points,
31
+ 'python_requires': '>=3.11,<4.0',
32
+ }
33
+
34
+
35
+ setup(**setup_kwargs)
@@ -1,138 +0,0 @@
1
- """
2
- High-level lookup API for gngram-counter.
3
-
4
- Provides simple functions for word frequency lookups similar to bnc-lookup.
5
- """
6
-
7
- import hashlib
8
- from functools import lru_cache
9
- from typing import TypedDict
10
-
11
- import polars as pl
12
-
13
- from gngram_counter.data import get_hash_file, is_data_installed
14
-
15
-
16
- class FrequencyData(TypedDict):
17
- """Frequency data for a word."""
18
-
19
- peak_tf: int # Decade with highest term frequency
20
- peak_df: int # Decade with highest document frequency
21
- sum_tf: int # Total term frequency across all decades
22
- sum_df: int # Total document frequency across all decades
23
-
24
-
25
- @lru_cache(maxsize=256)
26
- def _load_bucket(prefix: str) -> pl.DataFrame:
27
- """Load and cache a parquet bucket file."""
28
- return pl.read_parquet(get_hash_file(prefix))
29
-
30
-
31
- def _hash_word(word: str) -> tuple[str, str]:
32
- """Hash a word and return (prefix, suffix)."""
33
- h = hashlib.md5(word.lower().encode("utf-8")).hexdigest()
34
- return h[:2], h[2:]
35
-
36
-
37
- def exists(word: str) -> bool:
38
- """Check if a word exists in the ngram data.
39
-
40
- Args:
41
- word: The word to check (case-insensitive)
42
-
43
- Returns:
44
- True if the word exists, False otherwise
45
-
46
- Raises:
47
- FileNotFoundError: If data files are not installed
48
- """
49
- if not is_data_installed():
50
- raise FileNotFoundError(
51
- "Data files not installed. Run: python -m gngram_counter.download_data"
52
- )
53
-
54
- prefix, suffix = _hash_word(word)
55
- df = _load_bucket(prefix)
56
- return len(df.filter(pl.col("hash") == suffix)) > 0
57
-
58
-
59
- def frequency(word: str) -> FrequencyData | None:
60
- """Get frequency data for a word.
61
-
62
- Args:
63
- word: The word to look up (case-insensitive)
64
-
65
- Returns:
66
- FrequencyData dict with peak_tf, peak_df, sum_tf, sum_df, or None if not found
67
-
68
- Raises:
69
- FileNotFoundError: If data files are not installed
70
- """
71
- if not is_data_installed():
72
- raise FileNotFoundError(
73
- "Data files not installed. Run: python -m gngram_counter.download_data"
74
- )
75
-
76
- prefix, suffix = _hash_word(word)
77
- df = _load_bucket(prefix)
78
- row = df.filter(pl.col("hash") == suffix)
79
-
80
- if len(row) == 0:
81
- return None
82
-
83
- return FrequencyData(
84
- peak_tf=row["peak_tf"][0],
85
- peak_df=row["peak_df"][0],
86
- sum_tf=row["sum_tf"][0],
87
- sum_df=row["sum_df"][0],
88
- )
89
-
90
-
91
- def batch_frequency(words: list[str]) -> dict[str, FrequencyData | None]:
92
- """Get frequency data for multiple words.
93
-
94
- Args:
95
- words: List of words to look up (case-insensitive)
96
-
97
- Returns:
98
- Dict mapping each word to its FrequencyData or None if not found
99
-
100
- Raises:
101
- FileNotFoundError: If data files are not installed
102
- """
103
- if not is_data_installed():
104
- raise FileNotFoundError(
105
- "Data files not installed. Run: python -m gngram_counter.download_data"
106
- )
107
-
108
- # Group words by bucket prefix for efficient batch lookups
109
- by_prefix: dict[str, list[tuple[str, str]]] = {}
110
- for word in words:
111
- prefix, suffix = _hash_word(word)
112
- if prefix not in by_prefix:
113
- by_prefix[prefix] = []
114
- by_prefix[prefix].append((word, suffix))
115
-
116
- results: dict[str, FrequencyData | None] = {}
117
-
118
- for prefix, word_suffix_pairs in by_prefix.items():
119
- df = _load_bucket(prefix)
120
- suffixes = [s for _, s in word_suffix_pairs]
121
-
122
- # Filter to all matching suffixes at once
123
- matches = df.filter(pl.col("hash").is_in(suffixes))
124
- match_dict = {row["hash"]: row for row in matches.iter_rows(named=True)}
125
-
126
- for word, suffix in word_suffix_pairs:
127
- if suffix in match_dict:
128
- row = match_dict[suffix]
129
- results[word] = FrequencyData(
130
- peak_tf=row["peak_tf"],
131
- peak_df=row["peak_df"],
132
- sum_tf=row["sum_tf"],
133
- sum_df=row["sum_df"],
134
- )
135
- else:
136
- results[word] = None
137
-
138
- return results
@@ -1,35 +0,0 @@
1
- # -*- coding: utf-8 -*-
2
- from setuptools import setup
3
-
4
- packages = \
5
- ['gngram_counter']
6
-
7
- package_data = \
8
- {'': ['*']}
9
-
10
- install_requires = \
11
- ['polars>=1.0,<2.0', 'pyarrow>=18.0,<19.0']
12
-
13
- entry_points = \
14
- {'console_scripts': ['gngram-exists = gngram_counter.cli:gngram_exists',
15
- 'gngram-freq = gngram_counter.cli:gngram_freq']}
16
-
17
- setup_kwargs = {
18
- 'name': 'gngram-lookup',
19
- 'version': '0.2.0',
20
- 'description': 'Static Hash-Based Lookup for Google Ngram Frequencies',
21
- 'long_description': "# gngram-lookup\n\n[![PyPI version](https://badge.fury.io/py/gngram-lookup.svg)](https://badge.fury.io/py/gngram-lookup)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)\n\nWord frequency from 500 years of books. O(1) lookup. 5 million words.\n\n## Install\n\n```bash\npip install gngram-lookup\npython -m gngram_lookup.download_data\n```\n\n## Python\n\n```python\nimport gngram_lookup as ng\n\nng.exists('computer') # True\nng.exists('xyznotaword') # False\n\nng.frequency('computer')\n# {'peak_tf': 2000, 'peak_df': 2000, 'sum_tf': 892451, 'sum_df': 312876}\n\nng.batch_frequency(['the', 'algorithm', 'xyznotaword'])\n# {'the': {...}, 'algorithm': {...}, 'xyznotaword': None}\n```\n\n## CLI\n\n```bash\ngngram-exists computer # True, exit 0\ngngram-exists xyznotaword # False, exit 1\n\ngngram-freq computer\n# peak_tf_decade: 2000\n# peak_df_decade: 2000\n# sum_tf: 892451\n# sum_df: 312876\n```\n\n## Docs\n\n- [API Reference](docs/api.md)\n- [CLI Reference](docs/cli.md)\n- [Data Format](docs/data-format.md)\n- [Use Cases](docs/use-cases.md)\n- [Development](docs/development.md)\n\n## Attribution\n\nData derived from the [Google Books Ngram](https://books.google.com/ngrams) dataset.\n\n## License\n\nProprietary. See [LICENSE](LICENSE).\n",
22
- 'author': 'Craig Trim',
23
- 'author_email': 'craigtrim@gmail.com',
24
- 'maintainer': 'Craig Trim',
25
- 'maintainer_email': 'craigtrim@gmail.com',
26
- 'url': 'https://github.com/craigtrim/gngram-lookup',
27
- 'packages': packages,
28
- 'package_data': package_data,
29
- 'install_requires': install_requires,
30
- 'entry_points': entry_points,
31
- 'python_requires': '>=3.11,<4.0',
32
- }
33
-
34
-
35
- setup(**setup_kwargs)
File without changes