topolm 0.0.11__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
topolm-0.0.11/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 JadeyGraham96
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
topolm-0.0.11/PKG-INFO ADDED
@@ -0,0 +1,475 @@
1
+ Metadata-Version: 2.4
2
+ Name: topolm
3
+ Version: 0.0.11
4
+ Summary: Topology-native explainable language model prototype powered by Topologist
5
+ Author: Robert McMenemy
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/Arkay92/TopoLM
8
+ Project-URL: Repository, https://github.com/Arkay92/TopoLM.git
9
+ Project-URL: Bug Tracker, https://github.com/Arkay92/TopoLM/issues
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Requires-Python: >=3.10
20
+ Description-Content-Type: text/markdown
21
+ License-File: LICENSE
22
+ Requires-Dist: numpy>=1.23
23
+ Requires-Dist: networkx>=3.0
24
+ Requires-Dist: topologist>=0.4.0
25
+ Provides-Extra: ml
26
+ Requires-Dist: scikit-learn>=1.3; extra == "ml"
27
+ Requires-Dist: torch>=2.0; extra == "ml"
28
+ Provides-Extra: hf
29
+ Requires-Dist: datasets>=2.18; extra == "hf"
30
+ Provides-Extra: dev
31
+ Requires-Dist: pytest>=7.0; extra == "dev"
32
+ Requires-Dist: ruff>=0.5; extra == "dev"
33
+ Requires-Dist: build>=0.10; extra == "dev"
34
+ Requires-Dist: twine>=4.0; extra == "dev"
35
+ Dynamic: license-file
36
+
37
+ # TopoLM
38
+
39
+ <p align="center">
40
+ A topology-native, explainable language model prototype powered by <a href="https://github.com/Arkay92/Topologist">topologist</a>.
41
+ </p>
42
+
43
+ <p align="center">
44
+ <img width="256" height="256" alt="ChatGPT Image Jun 7, 2026, 11_38_36 AM" src="https://github.com/user-attachments/assets/969082e0-bb1c-4cda-9551-9cefdd23a06b" />
45
+ </p>
46
+
47
+ <p align="center">
48
+ <a href="https://github.com/Arkay92/TopoLM/actions/workflows/publish.yml"><img alt="Publish" src="https://github.com/Arkay92/TopoLM/actions/workflows/publish.yml/badge.svg" /></a>
49
+ <a href="https://pypi.org/project/topolm/"><img alt="PyPI" src="https://img.shields.io/pypi/v/topolm.svg" /></a>
50
+ <img alt="Python" src="https://img.shields.io/pypi/pyversions/topolm.svg" />
51
+ <img alt="Downloads" src="https://img.shields.io/pypi/dm/topolm.svg" />
52
+ <img alt="License" src="https://img.shields.io/pypi/l/topolm.svg" />
53
+ </p>
54
+
55
+ **TopoLM** combines:
56
+ - **Topology-native graph memory** using `topologist` and NetworkX.
57
+ - **Hyperdimensional encoding** for unit, domain, and sentence representations.
58
+ - **Evidence-based candidate retrieval** from phrase continuations, direct edges, and retrieved contexts.
59
+ - **Explainable scoring** with breakdowns of evidence, domain match, POS grammar, and repetition penalties.
60
+ - **Generation with multiple decoding strategies** (nucleus, beam, greedy) and phrase-tail detection.
61
+ - **Hugging Face dataset support** for training on large text corpora.
62
+ - **Persistence** with full state save/load, graph serialization, and memory reconstruction.
63
+
64
+ ---
65
+
66
+ ## Why Topology for Language Models?
67
+
68
+ Most neural LMs are **opaque black boxes**. Most symbolic systems are **brittle and limited**.
69
+
70
+ TopoLM sits between:
71
+
72
+ ```
73
+ Input text
74
+ -> Tokenize & domain detect
75
+ -> Build symbolic graph (units, phrases, domains, POS)
76
+ -> HDC encoding for each node
77
+ -> Topological memory state
78
+
79
+ -> Inference (next-token prediction, generation)
80
+ -> Explainable evidence trails
81
+ -> Drift detection & refinement
82
+ ```
83
+
84
+ Each token, phrase, and domain relationship is stored **explicitly** in the graph, **encoded** into a high-dimensional bipolar vector, and **scored** by evidence, topology, and confidence. This gives you a language model that is:
85
+ - **Interpretable**: see exactly why a prediction was made.
86
+ - **Grounded**: graph structure prevents nonsense outputs.
87
+ - **Efficient**: no matrix multiplications; graph queries and HDC similarity.
88
+ - **Debuggable**: modify graph state, track provenance, refine confidence.
89
+
90
+ ---
91
+
92
+ ## Architecture
93
+
94
+ ```
95
+ Text input
96
+ |
97
+ v
98
+ Tokenizer (unit, POS, domain, entity recognition)
99
+ |
100
+ v
101
+ Graph builder
102
+ - Unit nodes (with frequency, domain, POS)
103
+ - Phrase nodes (with multi-gram spans)
104
+ - Domain nodes
105
+ - Relations (next_unit, appears_near, likely_next, domain_related, has_pos)
106
+ |
107
+ v
108
+ HDC Memory (Topologist + fallback NetworkX)
109
+ - Encode units, phrases, domains, positions into {-1,+1}^D vectors
110
+ - Store graph topology
111
+ - Bundled snapshots for drift
112
+ |
113
+ v
114
+ Inference (Predict or Generate)
115
+ - Context Index (HDC similarity retrieval)
116
+ - Candidate retrieval (phrase continuation, direct edges, domain priors, unigrams)
117
+ - Evidence scoring (weighted by source: phrase, direct, RAG, domain, frequency)
118
+ - Grammar validation (POS sequences)
119
+ - Sampling (nucleus, beam, greedy)
120
+ ```
121
+
122
+ ---
123
+
124
+ ## Install
125
+
126
+ ```bash
127
+ pip install topolm
128
+ ```
129
+
130
+ For Hugging Face dataset support:
131
+
132
+ ```bash
133
+ pip install topolm[hf]
134
+ ```
135
+
136
+ For development:
137
+
138
+ ```bash
139
+ pip install -e ".[dev]"
140
+ pytest -q
141
+ python -m build
142
+ twine check dist/*
143
+ ```
144
+
145
+ ---
146
+
147
+ ## Quick Start
148
+
149
+ ### Basic Training and Prediction
150
+
151
+ ```python
152
+ from topolm import TopoLM, Config
153
+
154
+ corpus = """
155
+ The cat sat on the mat.
156
+ The dog sat on the floor.
157
+ CYP3A4 inhibition increases drug exposure.
158
+ Clarithromycin inhibits CYP3A4.
159
+ """
160
+
161
+ model = TopoLM(Config()).fit(corpus)
162
+
163
+ # Get next-token predictions
164
+ preds = model.distribution("clarithromycin inhibits", top_k=5)
165
+ for p in preds:
166
+ print(f" {p.text:20s} prob={p.probability:.3f} score={p.score:.3f}")
167
+
168
+ # Generate fluent text
169
+ generated = model.generate("cyp3a4 inhibition", decoding="beam")
170
+ print(generated)
171
+ ```
172
+
173
+ ### Training from Text List
174
+
175
+ ```python
176
+ texts = [
177
+ "Sentence one.",
178
+ "Another sentence.",
179
+ "Third sentence here.",
180
+ ]
181
+ model = TopoLM(Config()).fit_texts(texts)
182
+ ```
183
+
184
+ ### Training from Hugging Face Dataset
185
+
186
+ ```python
187
+ from topolm import load_hf_dataset
188
+
189
+ texts = load_hf_dataset(
190
+ "wikitext",
191
+ split="train",
192
+ text_field="text",
193
+ sample_size=1000
194
+ )
195
+ model = TopoLM(Config()).fit_texts(texts)
196
+ ```
197
+
198
+ ### Save and Load
199
+
200
+ ```python
201
+ import tempfile
202
+ from pathlib import Path
203
+
204
+ with tempfile.TemporaryDirectory() as tmpdir:
205
+ path = model.save(tmpdir)
206
+ loaded = TopoLM.load(path)
207
+ print(loaded.distribution("clarithromycin inhibits", 3))
208
+ ```
209
+
210
+ ### Model Explanation
211
+
212
+ ```python
213
+ explanation = model.explain("clarithromycin inhibits", "cyp3a4")
214
+ print(f"Score: {explanation['score']:.3f}")
215
+ print(f"Breakdown: {explanation['breakdown']}")
216
+ print(f"Evidence paths: {explanation['paths'][:3]}")
217
+ ```
218
+
219
+ ---
220
+
221
+ ## CLI
222
+
223
+ Train and interact with a demo model:
224
+
225
+ ```bash
226
+ topolm demo
227
+ ```
228
+
229
+ Make predictions:
230
+
231
+ ```bash
232
+ topolm predict "clarithromycin inhibits"
233
+ ```
234
+
235
+ Generate text:
236
+
237
+ ```bash
238
+ topolm generate "cyp3a4 inhibition" --decoding beam
239
+ ```
240
+
241
+ ---
242
+
243
+ ## Main Features
244
+
245
+ ### 1. **Hyperdimensional Unit Memory**
246
+
247
+ Tokens and phrases are encoded into stable bipolar vectors using seeded random generation:
248
+
249
+ ```python
250
+ config = Config(dim=1024, seed=42)
251
+ hdc = HDC(dim=1024, seed=42)
252
+ vector = hdc.get("unit:clarithromycin") # {-1, +1}^1024
253
+ ```
254
+
255
+ ### 2. **Symbolic Graph Topology**
256
+
257
+ Units, phrases, and domains are connected via typed relations:
258
+
259
+ - `next_unit`: direct token transitions
260
+ - `appears_near`: positional co-occurrence
261
+ - `likely_next`: phrase continuation
262
+ - `domain_related`: domain affinity
263
+ - `has_pos`: part-of-speech tagging
264
+
265
+ ```python
266
+ g = model.graph
267
+ edges = list(g.out_edges("unit:clarithromycin", data=True))
268
+ for s, t, d in edges:
269
+ print(f"{s} --{d['relation']}--> {t} (conf={d.get('confidence', 0.0):.2f})")
270
+ ```
271
+
272
+ ### 3. **Evidence-Based Candidate Retrieval**
273
+
274
+ Candidates are scored by multiple overlapping sources:
275
+
276
+ - **Phrase-based**: exact n-gram continuations from the graph
277
+ - **Direct edges**: observed next-token relations
278
+ - **Retrieved context**: HDC similarity to past sentences
279
+ - **Domain priors**: units from matching domain
280
+ - **Entity copy**: repeat entities from input
281
+ - **Frequency**: unigram statistics
282
+
283
+ ```python
284
+ candidates = model.retrieve_candidates(
285
+ units=["clarithromycin", "inhibits"],
286
+ domain="drug_interaction",
287
+ context_text="clarithromycin inhibits"
288
+ )
289
+ ```
290
+
291
+ ### 4. **Explainable Scoring**
292
+
293
+ Each prediction includes a breakdown:
294
+
295
+ ```python
296
+ pred = model.distribution("clarithromycin inhibits", top_k=1)[0]
297
+ print(f"Text: {pred.text}")
298
+ print(f"Score: {pred.score:.3f}")
299
+ print(f"Probability: {pred.probability:.3f}")
300
+ print(f"Breakdown: {pred.breakdown}")
301
+ # {'evidence': 0.5, 'phrase': 0.35, 'direct': 0.0, 'freq': 0.0, 'pos': 0.45, 'domain': 1.0, ...}
302
+ ```
303
+
304
+ ### 5. **Multiple Decoding Strategies**
305
+
306
+ Generate text using nucleus sampling, beam search, or greedy selection:
307
+
308
+ ```python
309
+ # Nucleus sampling (default)
310
+ text = model.generate("prompt", decoding="nucleus", top_p=0.88)
311
+
312
+ # Beam search
313
+ text = model.generate("prompt", decoding="beam", beam_width=4)
314
+
315
+ # Greedy
316
+ text = model.generate("prompt", decoding="greedy")
317
+ ```
318
+
319
+ ### 6. **Domain Detection and Grounding**
320
+
321
+ Automatic domain detection prevents category confusion:
322
+
323
+ ```python
324
+ domains = {
325
+ "domestic": ["cat", "dog", "mat", "floor"],
326
+ "cybersecurity": ["attacker", "exploit", "vulnerability"],
327
+ "drug_interaction": ["cyp3a4", "clarithromycin", "inhibits"],
328
+ "lm_research": ["language", "model", "topological"],
329
+ }
330
+ domain = model.tok.domain(["clarithromycin", "inhibits"]) # "drug_interaction"
331
+ ```
332
+
333
+ ### 7. **Full State Persistence**
334
+
335
+ Save and restore the complete model state, including graph and HDC memory:
336
+
337
+ ```python
338
+ path = model.save("./model_checkpoint")
339
+ restored = TopoLM.load(path)
340
+ # Full parity: same predictions, same graph, same counts
341
+ ```
342
+
343
+ ### 8. **Graph Compaction**
344
+
345
+ Remove low-frequency edges to reduce memory:
346
+
347
+ ```python
348
+ stats = model.mem.compact(min_edge_frequency=2)
349
+ print(f"Removed {stats['removed_edges']} edges")
350
+ ```
351
+
352
+ ---
353
+
354
+ ## Configuration
355
+
356
+ Tune behavior via `Config`:
357
+
358
+ ```python
359
+ from topolm import Config
360
+
361
+ config = Config(
362
+ dim=1024, # HDC vector dimension
363
+ seed=42, # Reproducibility
364
+ window=8, # Co-occurrence window
365
+ phrase_lengths=(2, 3, 4, 5), # Phrase n-gram sizes
366
+ max_candidates=96, # Retrieval pool size
367
+ inference_candidates=48, # Top-k for scoring
368
+ temperature=0.75, # Softmax temperature
369
+ default_top_p=0.88, # Nucleus threshold
370
+ default_beam_width=4, # Beam search width
371
+ fast_dev_mode=True, # Disable slow features
372
+ )
373
+ model = TopoLM(config).fit(text)
374
+ ```
375
+
376
+ ---
377
+
378
+ ## Examples
379
+
380
+ - [basic_demo.py](examples/basic_demo.py): Simple in-memory training and generation.
381
+ - [hf_dataset_demo.py](examples/hf_dataset_demo.py): Load and train on Hugging Face datasets.
382
+
383
+ ---
384
+
385
+ ## Project Structure
386
+
387
+ ```
388
+ topolm/
389
+ __init__.py # Public API
390
+ config.py # Configuration dataclass
391
+ core.py # TopoLM, Memory, Tokenizer, HDC
392
+ cli.py # Command-line interface
393
+ datasets.py # Hugging Face dataset loaders
394
+ examples/
395
+ basic_demo.py # In-memory example
396
+ hf_dataset_demo.py # Hugging Face example
397
+ tests/
398
+ test_smoke.py # Smoke tests
399
+ .github/
400
+ workflows/
401
+ publish.yml # PyPI publishing workflow
402
+ pyproject.toml # Project metadata and dependencies
403
+ ```
404
+
405
+ ---
406
+
407
+ ## Development
408
+
409
+ ```bash
410
+ # Install with dev extras
411
+ pip install -e ".[dev]"
412
+
413
+ # Format and lint
414
+ ruff check .
415
+
416
+ # Run tests
417
+ pytest -q
418
+
419
+ # Build package
420
+ python -m build
421
+
422
+ # Check distributions
423
+ twine check dist/*
424
+ ```
425
+
426
+ ---
427
+
428
+ ## Limitations and Future Work
429
+
430
+ - **No fine-tuning**: TopoLM learns from corpus statistics; no gradient-based learning.
431
+ - **Limited scalability**: Designed for interpretability at the cost of training speed.
432
+ - **Topologist dependency**: Requires `topologist>=0.4.0` for graph reasoning (fallback to NetworkX).
433
+ - **English-focused tokenization**: Custom regex tokenizer; non-English text may need adaptation.
434
+
435
+ Future improvements:
436
+ - Domain-specific confidence tuning.
437
+ - Multi-hop inference over learned relations.
438
+ - Tensor-backed HDC for GPU acceleration.
439
+ - Streaming/online updates.
440
+
441
+ ---
442
+
443
+ ## License
444
+
445
+ MIT
446
+
447
+ ---
448
+
449
+ ## Contributing
450
+
451
+ Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) (if applicable) or open an issue.
452
+
453
+ ---
454
+
455
+ ## Citation
456
+
457
+ If you use TopoLM in research, please cite:
458
+
459
+ ```bibtex
460
+ @software{topolm2024,
461
+ title={TopoLM: A Topology-Native Explainable Language Model},
462
+ author={McMenemy, Robert},
463
+ url={https://github.com/Arkay92/TopoLM},
464
+ year={2024},
465
+ version={0.0.11},
466
+ }
467
+ ```
468
+
469
+ ---
470
+
471
+ ## Acknowledgments
472
+
473
+ - [topologist](https://github.com/Arkay92/Topologist) for the hyperdimensional graph engine.
474
+ - [networkx](https://networkx.org/) for core graph algorithms.
475
+ - [huggingface/datasets](https://huggingface.co/docs/datasets/) for dataset loading.