trigram-llm 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- trigram_llm-0.1.0/LICENSE +21 -0
- trigram_llm-0.1.0/MANIFEST.in +6 -0
- trigram_llm-0.1.0/PKG-INFO +306 -0
- trigram_llm-0.1.0/README.md +273 -0
- trigram_llm-0.1.0/README_PYTHON.md +273 -0
- trigram_llm-0.1.0/pyproject.toml +46 -0
- trigram_llm-0.1.0/setup.cfg +4 -0
- trigram_llm-0.1.0/setup.py +149 -0
- trigram_llm-0.1.0/tests/test_advanced.py +203 -0
- trigram_llm-0.1.0/tests/test_basic.py +235 -0
- trigram_llm-0.1.0/tests/test_edge_cases.py +228 -0
- trigram_llm-0.1.0/tests/test_persistence.py +160 -0
- trigram_llm-0.1.0/trigram/__init__.py +25 -0
- trigram_llm-0.1.0/trigram/_lib.py +205 -0
- trigram_llm-0.1.0/trigram/_trigram_c.dylib +0 -0
- trigram_llm-0.1.0/trigram/model.py +756 -0
- trigram_llm-0.1.0/trigram/utils.py +80 -0
- trigram_llm-0.1.0/trigram_frontend_api/generate_graphs.py +238 -0
- trigram_llm-0.1.0/trigram_llm/include/hashmap.h +31 -0
- trigram_llm-0.1.0/trigram_llm/include/queue.h +23 -0
- trigram_llm-0.1.0/trigram_llm/include/reader.h +9 -0
- trigram_llm-0.1.0/trigram_llm/include/sll.h +23 -0
- trigram_llm-0.1.0/trigram_llm/include/tree.h +52 -0
- trigram_llm-0.1.0/trigram_llm/include/trigram.h +17 -0
- trigram_llm-0.1.0/trigram_llm/include/trigram_py.h +99 -0
- trigram_llm-0.1.0/trigram_llm/src/hashmap.c +166 -0
- trigram_llm-0.1.0/trigram_llm/src/main.c +219 -0
- trigram_llm-0.1.0/trigram_llm/src/queue.c +114 -0
- trigram_llm-0.1.0/trigram_llm/src/reader.c +63 -0
- trigram_llm-0.1.0/trigram_llm/src/sll.c +79 -0
- trigram_llm-0.1.0/trigram_llm/src/tree.c +780 -0
- trigram_llm-0.1.0/trigram_llm/src/trigram.c +177 -0
- trigram_llm-0.1.0/trigram_llm/src/trigram_py.c +209 -0
- trigram_llm-0.1.0/trigram_llm.egg-info/PKG-INFO +306 -0
- trigram_llm-0.1.0/trigram_llm.egg-info/SOURCES.txt +37 -0
- trigram_llm-0.1.0/trigram_llm.egg-info/dependency_links.txt +1 -0
- trigram_llm-0.1.0/trigram_llm.egg-info/not-zip-safe +1 -0
- trigram_llm-0.1.0/trigram_llm.egg-info/requires.txt +4 -0
- trigram_llm-0.1.0/trigram_llm.egg-info/top_level.txt +5 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Raghottam Girish Nadgoudar
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,306 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: trigram-llm
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Fast C-backed trigram language model for word prediction and sentence completion
|
|
5
|
+
Author: Raghottam Girish Nadgoudar
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/ROHITH-KUMAR-L/Trigrams
|
|
8
|
+
Project-URL: Repository, https://github.com/ROHITH-KUMAR-L/Trigrams
|
|
9
|
+
Project-URL: Bug Tracker, https://github.com/ROHITH-KUMAR-L/Trigrams/issues
|
|
10
|
+
Keywords: nlp,language-model,trigram,autocomplete,prediction,nlp,text
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Education
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
22
|
+
Classifier: Programming Language :: C
|
|
23
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
24
|
+
Classifier: Topic :: Text Processing :: Linguistic
|
|
25
|
+
Requires-Python: >=3.8
|
|
26
|
+
Description-Content-Type: text/markdown
|
|
27
|
+
License-File: LICENSE
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
30
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
31
|
+
Dynamic: license-file
|
|
32
|
+
Dynamic: requires-python
|
|
33
|
+
|
|
34
|
+
# trigram-llm ๐ง
|
|
35
|
+
|
|
36
|
+
A **fast, production-ready Python library** for next-word prediction and sentence completion, powered by a hand-written C engine using a Prefix Trie, DJB2 HashMap, and Stupid Backoff smoothing.
|
|
37
|
+
|
|
38
|
+
> Sub-millisecond predictions ยท Zero dependencies ยท Pure ctypes ยท Thread-safe
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Features
|
|
43
|
+
|
|
44
|
+
| Feature | Description |
|
|
45
|
+
|---|---|
|
|
46
|
+
| `train_from_text(text)` | Train from any Python string |
|
|
47
|
+
| `train_from_file(path)` | Train from a text file (incremental) |
|
|
48
|
+
| `train_from_list(words)` | Train from a pre-tokenised word list |
|
|
49
|
+
| `predict_next(w1, w2)` | Greedy single-word prediction (< 1ms) |
|
|
50
|
+
| `predict_top_n(w1, w2, n, temperature)` | Top-N predictions with probabilities |
|
|
51
|
+
| `complete_sentence(prompt, num_words, beam_width)` | Beam search sentence generation |
|
|
52
|
+
| `greedy_generate(prompt, num_words)` | Fastest sentence completion |
|
|
53
|
+
| `perplexity(text)` | Evaluate model quality on held-out text |
|
|
54
|
+
| `vocabulary()` | Returns all known words as a Python `set` |
|
|
55
|
+
| `get_stats()` | Dict with trigram count, vocab size, etc. |
|
|
56
|
+
| `save(path)` / `TrigramModel.load(path)` | Binary model persistence |
|
|
57
|
+
| `reset()` | Clear model and retrain from scratch |
|
|
58
|
+
| `"the quick" in model` | Check if a bigram context was seen |
|
|
59
|
+
| `len(model)` | Total number of stored trigrams |
|
|
60
|
+
| Thread-safe | All predictions guarded by a `threading.Lock` |
|
|
61
|
+
| Context manager | `with TrigramModel.load(path) as m:` |
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Installation
|
|
66
|
+
|
|
67
|
+
### Prerequisites
|
|
68
|
+
- Python 3.8+
|
|
69
|
+
- GCC (macOS: `xcode-select --install`, Ubuntu: `sudo apt install gcc`)
|
|
70
|
+
|
|
71
|
+
### Install (one command)
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
cd /path/to/Trigrams
|
|
75
|
+
pip install -e .
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
This compiles the C engine into `trigram/_trigram_c.dylib` (or `.so` on Linux) and installs the package in editable mode.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Quickstart
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
from trigram import TrigramModel
|
|
86
|
+
|
|
87
|
+
# 1. Create and train
|
|
88
|
+
model = TrigramModel()
|
|
89
|
+
model.train_from_text("""
|
|
90
|
+
The quick brown fox jumps over the lazy dog.
|
|
91
|
+
The quick brown fox was nimble and swift.
|
|
92
|
+
The lazy dog slept peacefully under the old oak tree.
|
|
93
|
+
""")
|
|
94
|
+
|
|
95
|
+
# 2. Predict next word (greedy)
|
|
96
|
+
word = model.predict_next("the", "quick")
|
|
97
|
+
print(word) # โ "brown"
|
|
98
|
+
|
|
99
|
+
# 3. Top-N predictions with probabilities
|
|
100
|
+
preds = model.predict_top_n("the", "quick", n=3, temperature=1.0)
|
|
101
|
+
# [{"word": "brown", "probability": 0.75, "count": 2},
|
|
102
|
+
# {"word": "red", "probability": 0.25, "count": 1}]
|
|
103
|
+
|
|
104
|
+
# 4. Sentence completion (beam search)
|
|
105
|
+
completions = model.complete_sentence("the quick", num_words=4, beam_width=3)
|
|
106
|
+
# [{"sentence": "the quick brown fox jumps", "probability": 0.012}, ...]
|
|
107
|
+
|
|
108
|
+
# 5. Greedy generation (fastest)
|
|
109
|
+
sentence = model.greedy_generate("the quick", num_words=3)
|
|
110
|
+
# "the quick brown fox"
|
|
111
|
+
|
|
112
|
+
# 6. Evaluate quality
|
|
113
|
+
ppl = model.perplexity("the quick brown fox")
|
|
114
|
+
print(f"Perplexity: {ppl:.2f}")
|
|
115
|
+
|
|
116
|
+
# 7. Inspect model
|
|
117
|
+
print(len(model)) # โ total trigrams
|
|
118
|
+
print("the quick" in model) # โ True
|
|
119
|
+
print(model.vocabulary()) # โ {"the", "quick", "brown", ...}
|
|
120
|
+
print(model.get_stats()) # โ {"total_trigrams": 7, "unique_first_words": 3, ...}
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## Training from a File
|
|
126
|
+
|
|
127
|
+
```python
|
|
128
|
+
model = TrigramModel()
|
|
129
|
+
model.train_from_file("path/to/my_corpus.txt")
|
|
130
|
+
|
|
131
|
+
# Incremental training โ add more data later
|
|
132
|
+
model.train_from_file("path/to/more_data.txt")
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## Saving and Loading Models
|
|
138
|
+
|
|
139
|
+
```python
|
|
140
|
+
# Save
|
|
141
|
+
model.save("my_model.bin")
|
|
142
|
+
|
|
143
|
+
# Load (class method)
|
|
144
|
+
model2 = TrigramModel.load("my_model.bin")
|
|
145
|
+
|
|
146
|
+
# Context manager (auto-frees on exit)
|
|
147
|
+
with TrigramModel.load("my_model.bin") as m:
|
|
148
|
+
print(m.predict_next("the", "quick"))
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Temperature Sampling
|
|
154
|
+
|
|
155
|
+
The `temperature` parameter controls how creative predictions are:
|
|
156
|
+
|
|
157
|
+
```python
|
|
158
|
+
# Deterministic โ always picks the most common word
|
|
159
|
+
model.predict_top_n("the", "quick", temperature=0.1)
|
|
160
|
+
|
|
161
|
+
# Standard probability distribution
|
|
162
|
+
model.predict_top_n("the", "quick", temperature=1.0)
|
|
163
|
+
|
|
164
|
+
# More diverse / creative
|
|
165
|
+
model.predict_top_n("the", "quick", temperature=2.0)
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## Advanced Usage
|
|
171
|
+
|
|
172
|
+
### Train from a word list (custom tokenisation)
|
|
173
|
+
|
|
174
|
+
```python
|
|
175
|
+
import nltk
|
|
176
|
+
tokens = nltk.word_tokenize("The quick brown fox")
|
|
177
|
+
tokens = [t.lower() for t in tokens if t.isalpha()]
|
|
178
|
+
|
|
179
|
+
model = TrigramModel()
|
|
180
|
+
model.train_from_list(tokens)
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Thread-safe batch prediction
|
|
184
|
+
|
|
185
|
+
```python
|
|
186
|
+
import threading
|
|
187
|
+
|
|
188
|
+
def worker(model, results, idx):
|
|
189
|
+
results[idx] = model.predict_top_n("the", "quick", n=5)
|
|
190
|
+
|
|
191
|
+
model = TrigramModel.load("model.bin")
|
|
192
|
+
results = [None] * 10
|
|
193
|
+
threads = [threading.Thread(target=worker, args=(model, results, i)) for i in range(10)]
|
|
194
|
+
for t in threads: t.start()
|
|
195
|
+
for t in threads: t.join()
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
### Check if a context exists before predicting
|
|
199
|
+
|
|
200
|
+
```python
|
|
201
|
+
if "the quick" in model:
|
|
202
|
+
result = model.predict_next("the", "quick")
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
## API Reference
|
|
208
|
+
|
|
209
|
+
### `TrigramModel()`
|
|
210
|
+
Creates a new empty model.
|
|
211
|
+
|
|
212
|
+
### `train_from_text(text: str) โ int`
|
|
213
|
+
Train on a raw text string. Returns trigrams inserted.
|
|
214
|
+
|
|
215
|
+
### `train_from_file(path) โ int`
|
|
216
|
+
Train from a text file. Returns trigrams inserted.
|
|
217
|
+
|
|
218
|
+
### `train_from_list(words: list) โ int`
|
|
219
|
+
Train from a pre-tokenised word list. Returns trigrams inserted.
|
|
220
|
+
|
|
221
|
+
### `predict_next(w1, w2) โ str | None`
|
|
222
|
+
Return the single most-likely next word or `None`.
|
|
223
|
+
|
|
224
|
+
### `predict_top_n(w1, w2, n=5, temperature=1.0) โ list[dict]`
|
|
225
|
+
Return up to N predictions sorted by probability descending.
|
|
226
|
+
Each dict: `{"word": str, "probability": float, "count": int}`.
|
|
227
|
+
|
|
228
|
+
### `complete_sentence(prompt, num_words=5, beam_width=3) โ list[dict]`
|
|
229
|
+
Generate sentence completions via beam search.
|
|
230
|
+
Each dict: `{"sentence": str, "probability": float}`.
|
|
231
|
+
|
|
232
|
+
### `greedy_generate(prompt, num_words=5) โ str`
|
|
233
|
+
Fastest sentence completion using greedy decoding.
|
|
234
|
+
|
|
235
|
+
### `perplexity(text) โ float`
|
|
236
|
+
Compute per-token perplexity on held-out text. Lower = better.
|
|
237
|
+
|
|
238
|
+
### `vocabulary() โ set[str]`
|
|
239
|
+
All words seen in the first-word position of training trigrams.
|
|
240
|
+
|
|
241
|
+
### `get_stats() โ dict`
|
|
242
|
+
`{"total_trigrams": int, "unique_first_words": int, "vocabulary_size": int}`.
|
|
243
|
+
|
|
244
|
+
### `save(path) โ None`
|
|
245
|
+
Save model to binary file. Compatible with the C CLI tool.
|
|
246
|
+
|
|
247
|
+
### `TrigramModel.load(path) โ TrigramModel` (classmethod)
|
|
248
|
+
Load a pre-trained binary model. Supports context manager protocol.
|
|
249
|
+
|
|
250
|
+
### `reset() โ None`
|
|
251
|
+
Clear all training data.
|
|
252
|
+
|
|
253
|
+
### `len(model)` โ int
|
|
254
|
+
Total stored trigrams.
|
|
255
|
+
|
|
256
|
+
### `"w1 w2" in model` / `("w1", "w2") in model` โ bool
|
|
257
|
+
Check if a bigram context exists.
|
|
258
|
+
|
|
259
|
+
### `repr(model)`
|
|
260
|
+
`TrigramModel(trigrams=11,062,203, vocab=97,277)`
|
|
261
|
+
|
|
262
|
+
---
|
|
263
|
+
|
|
264
|
+
## Performance
|
|
265
|
+
|
|
266
|
+
| Operation | Latency |
|
|
267
|
+
|---|---|
|
|
268
|
+
| Single word prediction | < 1ms |
|
|
269
|
+
| Top-5 predictions | 1โ2ms |
|
|
270
|
+
| Beam search (5 words, width 3) | 5โ10ms |
|
|
271
|
+
| Training (1M words) | ~30s |
|
|
272
|
+
|
|
273
|
+
---
|
|
274
|
+
|
|
275
|
+
## Running Tests
|
|
276
|
+
|
|
277
|
+
```bash
|
|
278
|
+
pip install pytest
|
|
279
|
+
pytest tests/ -v
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
---
|
|
283
|
+
|
|
284
|
+
## Project Structure
|
|
285
|
+
|
|
286
|
+
```
|
|
287
|
+
Trigrams/
|
|
288
|
+
โโโ trigram/ # Python library
|
|
289
|
+
โ โโโ __init__.py
|
|
290
|
+
โ โโโ _lib.py # ctypes bindings
|
|
291
|
+
โ โโโ model.py # TrigramModel class
|
|
292
|
+
โ โโโ utils.py # Text preprocessing
|
|
293
|
+
โ โโโ _trigram_c.dylib # Compiled C engine (auto-generated)
|
|
294
|
+
โโโ trigram_llm/
|
|
295
|
+
โ โโโ src/ # C source files
|
|
296
|
+
โ โโโ include/ # C headers
|
|
297
|
+
โโโ tests/ # pytest test suite
|
|
298
|
+
โโโ setup.py # Build script
|
|
299
|
+
โโโ pyproject.toml
|
|
300
|
+
```
|
|
301
|
+
|
|
302
|
+
---
|
|
303
|
+
|
|
304
|
+
## License
|
|
305
|
+
|
|
306
|
+
MIT License โ feel free to use, modify, and distribute.
|
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
# trigram-llm ๐ง
|
|
2
|
+
|
|
3
|
+
A **fast, production-ready Python library** for next-word prediction and sentence completion, powered by a hand-written C engine using a Prefix Trie, DJB2 HashMap, and Stupid Backoff smoothing.
|
|
4
|
+
|
|
5
|
+
> Sub-millisecond predictions ยท Zero dependencies ยท Pure ctypes ยท Thread-safe
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Features
|
|
10
|
+
|
|
11
|
+
| Feature | Description |
|
|
12
|
+
|---|---|
|
|
13
|
+
| `train_from_text(text)` | Train from any Python string |
|
|
14
|
+
| `train_from_file(path)` | Train from a text file (incremental) |
|
|
15
|
+
| `train_from_list(words)` | Train from a pre-tokenised word list |
|
|
16
|
+
| `predict_next(w1, w2)` | Greedy single-word prediction (< 1ms) |
|
|
17
|
+
| `predict_top_n(w1, w2, n, temperature)` | Top-N predictions with probabilities |
|
|
18
|
+
| `complete_sentence(prompt, num_words, beam_width)` | Beam search sentence generation |
|
|
19
|
+
| `greedy_generate(prompt, num_words)` | Fastest sentence completion |
|
|
20
|
+
| `perplexity(text)` | Evaluate model quality on held-out text |
|
|
21
|
+
| `vocabulary()` | Returns all known words as a Python `set` |
|
|
22
|
+
| `get_stats()` | Dict with trigram count, vocab size, etc. |
|
|
23
|
+
| `save(path)` / `TrigramModel.load(path)` | Binary model persistence |
|
|
24
|
+
| `reset()` | Clear model and retrain from scratch |
|
|
25
|
+
| `"the quick" in model` | Check if a bigram context was seen |
|
|
26
|
+
| `len(model)` | Total number of stored trigrams |
|
|
27
|
+
| Thread-safe | All predictions guarded by a `threading.Lock` |
|
|
28
|
+
| Context manager | `with TrigramModel.load(path) as m:` |
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Installation
|
|
33
|
+
|
|
34
|
+
### Prerequisites
|
|
35
|
+
- Python 3.8+
|
|
36
|
+
- GCC (macOS: `xcode-select --install`, Ubuntu: `sudo apt install gcc`)
|
|
37
|
+
|
|
38
|
+
### Install (one command)
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
cd /path/to/Trigrams
|
|
42
|
+
pip install -e .
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
This compiles the C engine into `trigram/_trigram_c.dylib` (or `.so` on Linux) and installs the package in editable mode.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## Quickstart
|
|
50
|
+
|
|
51
|
+
```python
|
|
52
|
+
from trigram import TrigramModel
|
|
53
|
+
|
|
54
|
+
# 1. Create and train
|
|
55
|
+
model = TrigramModel()
|
|
56
|
+
model.train_from_text("""
|
|
57
|
+
The quick brown fox jumps over the lazy dog.
|
|
58
|
+
The quick brown fox was nimble and swift.
|
|
59
|
+
The lazy dog slept peacefully under the old oak tree.
|
|
60
|
+
""")
|
|
61
|
+
|
|
62
|
+
# 2. Predict next word (greedy)
|
|
63
|
+
word = model.predict_next("the", "quick")
|
|
64
|
+
print(word) # โ "brown"
|
|
65
|
+
|
|
66
|
+
# 3. Top-N predictions with probabilities
|
|
67
|
+
preds = model.predict_top_n("the", "quick", n=3, temperature=1.0)
|
|
68
|
+
# [{"word": "brown", "probability": 0.75, "count": 2},
|
|
69
|
+
# {"word": "red", "probability": 0.25, "count": 1}]
|
|
70
|
+
|
|
71
|
+
# 4. Sentence completion (beam search)
|
|
72
|
+
completions = model.complete_sentence("the quick", num_words=4, beam_width=3)
|
|
73
|
+
# [{"sentence": "the quick brown fox jumps", "probability": 0.012}, ...]
|
|
74
|
+
|
|
75
|
+
# 5. Greedy generation (fastest)
|
|
76
|
+
sentence = model.greedy_generate("the quick", num_words=3)
|
|
77
|
+
# "the quick brown fox"
|
|
78
|
+
|
|
79
|
+
# 6. Evaluate quality
|
|
80
|
+
ppl = model.perplexity("the quick brown fox")
|
|
81
|
+
print(f"Perplexity: {ppl:.2f}")
|
|
82
|
+
|
|
83
|
+
# 7. Inspect model
|
|
84
|
+
print(len(model)) # โ total trigrams
|
|
85
|
+
print("the quick" in model) # โ True
|
|
86
|
+
print(model.vocabulary()) # โ {"the", "quick", "brown", ...}
|
|
87
|
+
print(model.get_stats()) # โ {"total_trigrams": 7, "unique_first_words": 3, ...}
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Training from a File
|
|
93
|
+
|
|
94
|
+
```python
|
|
95
|
+
model = TrigramModel()
|
|
96
|
+
model.train_from_file("path/to/my_corpus.txt")
|
|
97
|
+
|
|
98
|
+
# Incremental training โ add more data later
|
|
99
|
+
model.train_from_file("path/to/more_data.txt")
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Saving and Loading Models
|
|
105
|
+
|
|
106
|
+
```python
|
|
107
|
+
# Save
|
|
108
|
+
model.save("my_model.bin")
|
|
109
|
+
|
|
110
|
+
# Load (class method)
|
|
111
|
+
model2 = TrigramModel.load("my_model.bin")
|
|
112
|
+
|
|
113
|
+
# Context manager (auto-frees on exit)
|
|
114
|
+
with TrigramModel.load("my_model.bin") as m:
|
|
115
|
+
print(m.predict_next("the", "quick"))
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## Temperature Sampling
|
|
121
|
+
|
|
122
|
+
The `temperature` parameter controls how creative predictions are:
|
|
123
|
+
|
|
124
|
+
```python
|
|
125
|
+
# Deterministic โ always picks the most common word
|
|
126
|
+
model.predict_top_n("the", "quick", temperature=0.1)
|
|
127
|
+
|
|
128
|
+
# Standard probability distribution
|
|
129
|
+
model.predict_top_n("the", "quick", temperature=1.0)
|
|
130
|
+
|
|
131
|
+
# More diverse / creative
|
|
132
|
+
model.predict_top_n("the", "quick", temperature=2.0)
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## Advanced Usage
|
|
138
|
+
|
|
139
|
+
### Train from a word list (custom tokenisation)
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
import nltk
|
|
143
|
+
tokens = nltk.word_tokenize("The quick brown fox")
|
|
144
|
+
tokens = [t.lower() for t in tokens if t.isalpha()]
|
|
145
|
+
|
|
146
|
+
model = TrigramModel()
|
|
147
|
+
model.train_from_list(tokens)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### Thread-safe batch prediction
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
import threading
|
|
154
|
+
|
|
155
|
+
def worker(model, results, idx):
|
|
156
|
+
results[idx] = model.predict_top_n("the", "quick", n=5)
|
|
157
|
+
|
|
158
|
+
model = TrigramModel.load("model.bin")
|
|
159
|
+
results = [None] * 10
|
|
160
|
+
threads = [threading.Thread(target=worker, args=(model, results, i)) for i in range(10)]
|
|
161
|
+
for t in threads: t.start()
|
|
162
|
+
for t in threads: t.join()
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Check if a context exists before predicting
|
|
166
|
+
|
|
167
|
+
```python
|
|
168
|
+
if "the quick" in model:
|
|
169
|
+
result = model.predict_next("the", "quick")
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
---
|
|
173
|
+
|
|
174
|
+
## API Reference
|
|
175
|
+
|
|
176
|
+
### `TrigramModel()`
|
|
177
|
+
Creates a new empty model.
|
|
178
|
+
|
|
179
|
+
### `train_from_text(text: str) โ int`
|
|
180
|
+
Train on a raw text string. Returns trigrams inserted.
|
|
181
|
+
|
|
182
|
+
### `train_from_file(path) โ int`
|
|
183
|
+
Train from a text file. Returns trigrams inserted.
|
|
184
|
+
|
|
185
|
+
### `train_from_list(words: list) โ int`
|
|
186
|
+
Train from a pre-tokenised word list. Returns trigrams inserted.
|
|
187
|
+
|
|
188
|
+
### `predict_next(w1, w2) โ str | None`
|
|
189
|
+
Return the single most-likely next word or `None`.
|
|
190
|
+
|
|
191
|
+
### `predict_top_n(w1, w2, n=5, temperature=1.0) โ list[dict]`
|
|
192
|
+
Return up to N predictions sorted by probability descending.
|
|
193
|
+
Each dict: `{"word": str, "probability": float, "count": int}`.
|
|
194
|
+
|
|
195
|
+
### `complete_sentence(prompt, num_words=5, beam_width=3) โ list[dict]`
|
|
196
|
+
Generate sentence completions via beam search.
|
|
197
|
+
Each dict: `{"sentence": str, "probability": float}`.
|
|
198
|
+
|
|
199
|
+
### `greedy_generate(prompt, num_words=5) โ str`
|
|
200
|
+
Fastest sentence completion using greedy decoding.
|
|
201
|
+
|
|
202
|
+
### `perplexity(text) โ float`
|
|
203
|
+
Compute per-token perplexity on held-out text. Lower = better.
|
|
204
|
+
|
|
205
|
+
### `vocabulary() โ set[str]`
|
|
206
|
+
All words seen in the first-word position of training trigrams.
|
|
207
|
+
|
|
208
|
+
### `get_stats() โ dict`
|
|
209
|
+
`{"total_trigrams": int, "unique_first_words": int, "vocabulary_size": int}`.
|
|
210
|
+
|
|
211
|
+
### `save(path) โ None`
|
|
212
|
+
Save model to binary file. Compatible with the C CLI tool.
|
|
213
|
+
|
|
214
|
+
### `TrigramModel.load(path) โ TrigramModel` (classmethod)
|
|
215
|
+
Load a pre-trained binary model. Supports context manager protocol.
|
|
216
|
+
|
|
217
|
+
### `reset() โ None`
|
|
218
|
+
Clear all training data.
|
|
219
|
+
|
|
220
|
+
### `len(model)` โ int
|
|
221
|
+
Total stored trigrams.
|
|
222
|
+
|
|
223
|
+
### `"w1 w2" in model` / `("w1", "w2") in model` โ bool
|
|
224
|
+
Check if a bigram context exists.
|
|
225
|
+
|
|
226
|
+
### `repr(model)`
|
|
227
|
+
`TrigramModel(trigrams=11,062,203, vocab=97,277)`
|
|
228
|
+
|
|
229
|
+
---
|
|
230
|
+
|
|
231
|
+
## Performance
|
|
232
|
+
|
|
233
|
+
| Operation | Latency |
|
|
234
|
+
|---|---|
|
|
235
|
+
| Single word prediction | < 1ms |
|
|
236
|
+
| Top-5 predictions | 1โ2ms |
|
|
237
|
+
| Beam search (5 words, width 3) | 5โ10ms |
|
|
238
|
+
| Training (1M words) | ~30s |
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
|
|
242
|
+
## Running Tests
|
|
243
|
+
|
|
244
|
+
```bash
|
|
245
|
+
pip install pytest
|
|
246
|
+
pytest tests/ -v
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## Project Structure
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
Trigrams/
|
|
255
|
+
โโโ trigram/ # Python library
|
|
256
|
+
โ โโโ __init__.py
|
|
257
|
+
โ โโโ _lib.py # ctypes bindings
|
|
258
|
+
โ โโโ model.py # TrigramModel class
|
|
259
|
+
โ โโโ utils.py # Text preprocessing
|
|
260
|
+
โ โโโ _trigram_c.dylib # Compiled C engine (auto-generated)
|
|
261
|
+
โโโ trigram_llm/
|
|
262
|
+
โ โโโ src/ # C source files
|
|
263
|
+
โ โโโ include/ # C headers
|
|
264
|
+
โโโ tests/ # pytest test suite
|
|
265
|
+
โโโ setup.py # Build script
|
|
266
|
+
โโโ pyproject.toml
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## License
|
|
272
|
+
|
|
273
|
+
MIT License โ feel free to use, modify, and distribute.
|