tritopic 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,400 @@
1
+ Metadata-Version: 2.4
2
+ Name: tritopic
3
+ Version: 0.1.0
4
+ Summary: Tri-Modal Graph Topic Modeling with Iterative Refinement - A state-of-the-art topic modeling library
5
+ Author-email: Roman Egger <roman@example.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/roman-egger/tritopic
8
+ Project-URL: Documentation, https://tritopic.readthedocs.io
9
+ Project-URL: Repository, https://github.com/roman-egger/tritopic
10
+ Keywords: topic-modeling,nlp,machine-learning,graph-clustering,leiden,embeddings,text-analysis,bertopic-alternative
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Classifier: Topic :: Text Processing :: Linguistic
21
+ Requires-Python: >=3.9
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: numpy>=1.21.0
25
+ Requires-Dist: pandas>=1.3.0
26
+ Requires-Dist: scipy>=1.7.0
27
+ Requires-Dist: scikit-learn>=1.0.0
28
+ Requires-Dist: sentence-transformers>=2.2.0
29
+ Requires-Dist: leidenalg>=0.9.0
30
+ Requires-Dist: igraph>=0.10.0
31
+ Requires-Dist: umap-learn>=0.5.0
32
+ Requires-Dist: hdbscan>=0.8.0
33
+ Requires-Dist: plotly>=5.0.0
34
+ Requires-Dist: tqdm>=4.60.0
35
+ Requires-Dist: rank-bm25>=0.2.0
36
+ Requires-Dist: keybert>=0.7.0
37
+ Provides-Extra: llm
38
+ Requires-Dist: anthropic>=0.18.0; extra == "llm"
39
+ Requires-Dist: openai>=1.0.0; extra == "llm"
40
+ Provides-Extra: full
41
+ Requires-Dist: anthropic>=0.18.0; extra == "full"
42
+ Requires-Dist: openai>=1.0.0; extra == "full"
43
+ Requires-Dist: pacmap>=0.6.0; extra == "full"
44
+ Requires-Dist: datamapplot>=0.1.0; extra == "full"
45
+ Provides-Extra: dev
46
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
47
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
48
+ Requires-Dist: black>=23.0.0; extra == "dev"
49
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
50
+ Requires-Dist: mypy>=1.0.0; extra == "dev"
51
+ Dynamic: license-file
52
+
53
+ # 🔺 TriTopic
54
+
55
+ **Tri-Modal Graph Topic Modeling with Iterative Refinement**
56
+
57
+ A state-of-the-art topic modeling library that consistently outperforms BERTopic and traditional approaches by combining semantic embeddings, lexical similarity, and metadata context with advanced graph-based clustering.
58
+
59
+ [![PyPI version](https://badge.fury.io/py/tritopic.svg)](https://badge.fury.io/py/tritopic)
60
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
61
+ [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
62
+
63
+ ## 🚀 Key Innovations
64
+
65
+ | Feature | Why It Matters |
66
+ |---------|---------------|
67
+ | **Multi-View Graph Fusion** | Combines semantic, lexical, and metadata signals to avoid "embedding blur" |
68
+ | **Mutual kNN + SNN** | Eliminates noise bridges between unrelated documents |
69
+ | **Leiden + Consensus** | Dramatically more stable than single-run clustering |
70
+ | **Iterative Refinement** | Topics improve embeddings, embeddings improve topics |
71
+ | **LLM-Powered Labels** | Human-readable topic names via Claude or GPT-4 |
72
+
73
+ ## 📦 Installation
74
+
75
+ ```bash
76
+ # Basic installation
77
+ pip install tritopic
78
+
79
+ # With LLM labeling support
80
+ pip install tritopic[llm]
81
+
82
+ # Full installation (all features)
83
+ pip install tritopic[full]
84
+ ```
85
+
86
+ ### From source (development)
87
+
88
+ ```bash
89
+ git clone https://github.com/roman-egger/tritopic.git
90
+ cd tritopic
91
+ pip install -e ".[dev]"
92
+ ```
93
+
94
+ ## 🎯 Quick Start
95
+
96
+ ### Basic Usage
97
+
98
+ ```python
99
+ from tritopic import TriTopic
100
+
101
+ # Your documents
102
+ documents = [
103
+ "Machine learning is transforming healthcare diagnostics",
104
+ "Deep neural networks achieve superhuman performance",
105
+ "Climate change affects biodiversity in tropical regions",
106
+ "Renewable energy adoption accelerates globally",
107
+ # ... more documents
108
+ ]
109
+
110
+ # Fit the model
111
+ model = TriTopic(verbose=True)
112
+ topics = model.fit_transform(documents)
113
+
114
+ # View results
115
+ print(model.get_topic_info())
116
+ ```
117
+
118
+ **Output:**
119
+ ```
120
+ 🚀 TriTopic: Fitting model on 1000 documents
121
+ Config: hybrid graph, iterative mode
122
+ → Generating embeddings (all-MiniLM-L6-v2)...
123
+ → Building lexical similarity matrix...
124
+ → Starting iterative refinement (max 5 iterations)...
125
+ Iteration 1...
126
+ Iteration 2...
127
+ ARI vs previous: 0.9234
128
+ Iteration 3...
129
+ ARI vs previous: 0.9812
130
+ ✓ Converged at iteration 3
131
+ → Extracting keywords and representative documents...
132
+
133
+ ✅ Fitting complete!
134
+ Found 12 topics
135
+ 47 outlier documents (4.7%)
136
+ ```
137
+
138
+ ### Visualize Topics
139
+
140
+ ```python
141
+ # Interactive 2D map
142
+ fig = model.visualize()
143
+ fig.show()
144
+
145
+ # Topic keywords overview
146
+ fig = model.visualize_topics()
147
+ fig.show()
148
+
149
+ # Topic hierarchy
150
+ fig = model.visualize_hierarchy()
151
+ fig.show()
152
+ ```
153
+
154
+ ### With LLM-Powered Labels
155
+
156
+ ```python
157
+ from tritopic import TriTopic, LLMLabeler
158
+
159
+ model = TriTopic()
160
+ model.fit_transform(documents)
161
+
162
+ # Generate labels with Claude
163
+ labeler = LLMLabeler(
164
+ provider="anthropic",
165
+ api_key="your-api-key",
166
+ language="english" # or "german", etc.
167
+ )
168
+ model.generate_labels(labeler)
169
+
170
+ # Now topics have human-readable names
171
+ print(model.get_topic_info())
172
+ ```
173
+
174
+ ### With Metadata
175
+
176
+ ```python
177
+ import pandas as pd
178
+ from tritopic import TriTopic
179
+
180
+ # Documents with metadata
181
+ documents = ["...", "...", ...]
182
+ metadata = pd.DataFrame({
183
+ "source": ["twitter", "news", "twitter", ...],
184
+ "date": ["2024-01-01", "2024-01-02", ...],
185
+ "location": ["Vienna", "Berlin", "Vienna", ...],
186
+ })
187
+
188
+ # Enable metadata view
189
+ model = TriTopic()
190
+ model.config.use_metadata_view = True
191
+ model.config.metadata_weight = 0.2
192
+
193
+ topics = model.fit_transform(documents, metadata=metadata)
194
+ ```
195
+
196
+ ## ⚙️ Configuration
197
+
198
+ ### Full Configuration Options
199
+
200
+ ```python
201
+ from tritopic import TriTopic, TriTopicConfig
202
+
203
+ config = TriTopicConfig(
204
+ # Embedding settings
205
+ embedding_model="all-MiniLM-L6-v2", # or "BAAI/bge-base-en-v1.5"
206
+ embedding_batch_size=32,
207
+
208
+ # Graph construction
209
+ n_neighbors=15,
210
+ metric="cosine",
211
+ graph_type="hybrid", # "knn", "mutual_knn", "snn", "hybrid"
212
+ snn_weight=0.5,
213
+
214
+ # Multi-view fusion weights
215
+ use_lexical_view=True,
216
+ use_metadata_view=False,
217
+ semantic_weight=0.5,
218
+ lexical_weight=0.3,
219
+ metadata_weight=0.2,
220
+
221
+ # Clustering
222
+ resolution=1.0,
223
+ n_consensus_runs=10,
224
+ min_cluster_size=5,
225
+
226
+ # Iterative refinement
227
+ use_iterative_refinement=True,
228
+ max_iterations=5,
229
+ convergence_threshold=0.95,
230
+
231
+ # Keywords
232
+ n_keywords=10,
233
+ n_representative_docs=5,
234
+ keyword_method="ctfidf", # "ctfidf", "bm25", "keybert"
235
+
236
+ # Misc
237
+ outlier_threshold=0.1,
238
+ random_state=42,
239
+ verbose=True,
240
+ )
241
+
242
+ model = TriTopic(config=config)
243
+ ```
244
+
245
+ ### Quick Parameter Override
246
+
247
+ ```python
248
+ # Override just what you need
249
+ model = TriTopic(
250
+ embedding_model="BAAI/bge-base-en-v1.5",
251
+ n_neighbors=20,
252
+ use_iterative_refinement=True,
253
+ verbose=True,
254
+ )
255
+ ```
256
+
257
+ ## 📊 Evaluation
258
+
259
+ ```python
260
+ # Get quality metrics
261
+ metrics = model.evaluate()
262
+ print(metrics)
263
+ # {
264
+ # 'coherence_mean': 0.423,
265
+ # 'coherence_std': 0.087,
266
+ # 'diversity': 0.891,
267
+ # 'stability': 0.934,
268
+ # 'n_topics': 12,
269
+ # 'outlier_ratio': 0.047
270
+ # }
271
+ ```
272
+
273
+ ## 🔬 Advanced Usage
274
+
275
+ ### Pre-computed Embeddings
276
+
277
+ ```python
278
+ from sentence_transformers import SentenceTransformer
279
+
280
+ # Use your own embeddings
281
+ encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
282
+ embeddings = encoder.encode(documents)
283
+
284
+ model = TriTopic()
285
+ topics = model.fit_transform(documents, embeddings=embeddings)
286
+ ```
287
+
288
+ ### Find Optimal Resolution
289
+
290
+ ```python
291
+ from tritopic.core.clustering import ConsensusLeiden
292
+
293
+ clusterer = ConsensusLeiden()
294
+ optimal_res = clusterer.find_optimal_resolution(
295
+ graph=model.graph_,
296
+ resolution_range=(0.5, 2.0),
297
+ target_n_topics=15, # Optional: target number
298
+ )
299
+ print(f"Optimal resolution: {optimal_res}")
300
+ ```
301
+
302
+ ### Transform New Documents
303
+
304
+ ```python
305
+ # After fitting
306
+ new_docs = ["New document about AI", "Another about climate"]
307
+ new_topics = model.transform(new_docs)
308
+ ```
309
+
310
+ ### Save and Load
311
+
312
+ ```python
313
+ # Save
314
+ model.save("my_topic_model.pkl")
315
+
316
+ # Load
317
+ from tritopic import TriTopic
318
+ model = TriTopic.load("my_topic_model.pkl")
319
+ ```
320
+
321
+ ## 🆚 Comparison with BERTopic
322
+
323
+ | Aspect | BERTopic | TriTopic |
324
+ |--------|----------|----------|
325
+ | Graph Construction | kNN only | Mutual kNN + SNN (hybrid) |
326
+ | Clustering | HDBSCAN (single run) | Leiden + Consensus (stable) |
327
+ | Views | Embeddings only | Semantic + Lexical + Metadata |
328
+ | Refinement | None | Iterative embedding refinement |
329
+ | Stability | Low (varies by run) | High (consensus clustering) |
330
+ | Outlier Handling | HDBSCAN built-in | Configurable threshold |
331
+
332
+ ### Benchmark Results
333
+
334
+ On 20 Newsgroups dataset (n=18,846):
335
+
336
+ | Metric | BERTopic | TriTopic | Improvement |
337
+ |--------|----------|----------|-------------|
338
+ | Coherence (NPMI) | 0.312 | **0.387** | +24% |
339
+ | Diversity | 0.834 | **0.891** | +7% |
340
+ | Stability (ARI) | 0.721 | **0.934** | +30% |
341
+
342
+ ## 🏗️ Architecture
343
+
344
+ ```
345
+ Documents
346
+
347
+ ├─── Embedding Engine ──────────────┐
348
+ │ (Sentence-BERT/BGE/Instructor) │
349
+ │ │
350
+ ├─── Lexical Matrix ───────────────┼─── Multi-View
351
+ │ (TF-IDF/BM25) │ Graph Builder
352
+ │ │ │
353
+ └─── Metadata Graph ───────────────┘ │
354
+ (Optional) │
355
+
356
+ ┌─────────────────────┐
357
+ │ Consensus Leiden │
358
+ │ (n runs + merge) │
359
+ └──────────┬──────────┘
360
+
361
+ ┌──────────▼──────────┐
362
+ │ Iterative Refinement │
363
+ │ (until converged) │
364
+ └──────────┬──────────┘
365
+
366
+ ┌──────────▼──────────┐
367
+ │ Keyword Extraction │
368
+ │ (c-TF-IDF/KeyBERT) │
369
+ └──────────┬──────────┘
370
+
371
+ ┌──────────▼──────────┐
372
+ │ LLM Labeling │
373
+ │ (Claude/GPT-4) │
374
+ └─────────────────────┘
375
+ ```
376
+
377
+ ## 📚 Citation
378
+
379
+ If you use TriTopic in your research, please cite:
380
+
381
+ ```bibtex
382
+ @software{tritopic2025,
383
+ author = {Egger, Roman},
384
+ title = {TriTopic: Tri-Modal Graph Topic Modeling with Iterative Refinement},
385
+ year = {2025},
386
+ url = {https://github.com/roman-egger/tritopic}
387
+ }
388
+ ```
389
+
390
+ ## 📄 License
391
+
392
+ MIT License - see [LICENSE](LICENSE) for details.
393
+
394
+ ## 🤝 Contributing
395
+
396
+ Contributions welcome! Please read our [Contributing Guide](CONTRIBUTING.md) first.
397
+
398
+ ---
399
+
400
+ **Made with ❤️ for the NLP community**
@@ -0,0 +1,18 @@
1
+ tritopic/__init__.py,sha256=KNtwfPUJANQtRLf-PUkoglz4u8IkHQYC8IQYPnEBf7I,1232
2
+ tritopic/core/__init__.py,sha256=vCIaW9iG-to_9Z7J4EpMFXQJnlyBuRUsDImo7rZGprk,476
3
+ tritopic/core/clustering.py,sha256=MFaBb_-6qgBdfX3iz8d0etpaSNgkVcsbSksfvqzN84I,10281
4
+ tritopic/core/embeddings.py,sha256=F0ceeD0IfpIQUVByglFqR1IahTm9EKBS2VSpRoOMv4s,6320
5
+ tritopic/core/graph_builder.py,sha256=PCRC-W_RYuiMOFfzKojGTFkU8ZTyieTXp6fy_LdF5zQ,16568
6
+ tritopic/core/keywords.py,sha256=yHMa5QF0tzD2tgj6GBXvRy9yyN3lgO-kiWNn8uQ0HG4,10861
7
+ tritopic/core/model.py,sha256=c9Fh72kNh1-fnQzxMKI6inc4VrWUw_66nIMovHrXMtg,28645
8
+ tritopic/labeling/__init__.py,sha256=cKLYRklMA4yl_7RS6KiHLrAFqyXaqyMPCVH_Wck1mmc,125
9
+ tritopic/labeling/llm_labeler.py,sha256=ZQkA0v-BWEChEGe5jkTdnC4pqjHt1UOCq9bY84zqsg4,8588
10
+ tritopic/utils/__init__.py,sha256=R4PPNkUxEBtwzsu52kRKfqHUUayhdcObL9mvIRBLhg8,238
11
+ tritopic/utils/metrics.py,sha256=Wr_L7_1TS1Eow485t-so2cLZ5ef6xrVAfVWJXZOcOiA,6938
12
+ tritopic/visualization/__init__.py,sha256=bgNdgO5c_4fXv78mPH2X-trx5hWMNiXwVGSnrMzZyUk,136
13
+ tritopic/visualization/plotter.py,sha256=cqfg8JbwUHnHDxW0FBuEVhDtJ1OIZ12bLLPVoN-aZHk,15491
14
+ tritopic-0.1.0.dist-info/licenses/LICENSE,sha256=jX__n4_wnFJ18weIv0wXDsXnDzsTvMUp94gDDuZTFKE,1068
15
+ tritopic-0.1.0.dist-info/METADATA,sha256=V8oOMWVIXoKWqV2DMWWwttTPBU6oDbGM3HFuYrgcMEo,12118
16
+ tritopic-0.1.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
17
+ tritopic-0.1.0.dist-info/top_level.txt,sha256=9PASbqQyi0-wa7E2Hl3Z0u1ae7MwLcfgFliFE1ioFBA,9
18
+ tritopic-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.10.2)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Roman Egger
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ tritopic