tomoto 0.4.0-aarch64-linux
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/CHANGELOG.md +65 -0
- data/LICENSE.txt +22 -0
- data/README.md +154 -0
- data/ext/tomoto/ct.cpp +58 -0
- data/ext/tomoto/dmr.cpp +69 -0
- data/ext/tomoto/dt.cpp +91 -0
- data/ext/tomoto/extconf.rb +42 -0
- data/ext/tomoto/gdmr.cpp +42 -0
- data/ext/tomoto/hdp.cpp +47 -0
- data/ext/tomoto/hlda.cpp +71 -0
- data/ext/tomoto/hpa.cpp +32 -0
- data/ext/tomoto/lda.cpp +281 -0
- data/ext/tomoto/llda.cpp +46 -0
- data/ext/tomoto/mglda.cpp +81 -0
- data/ext/tomoto/pa.cpp +32 -0
- data/ext/tomoto/plda.cpp +33 -0
- data/ext/tomoto/slda.cpp +48 -0
- data/ext/tomoto/tomoto.cpp +48 -0
- data/ext/tomoto/utils.h +30 -0
- data/lib/tomoto/3.0/tomoto.so +0 -0
- data/lib/tomoto/3.1/tomoto.so +0 -0
- data/lib/tomoto/3.2/tomoto.so +0 -0
- data/lib/tomoto/3.3/tomoto.so +0 -0
- data/lib/tomoto/ct.rb +24 -0
- data/lib/tomoto/dmr.rb +27 -0
- data/lib/tomoto/dt.rb +15 -0
- data/lib/tomoto/gdmr.rb +15 -0
- data/lib/tomoto/hdp.rb +11 -0
- data/lib/tomoto/hlda.rb +56 -0
- data/lib/tomoto/hpa.rb +11 -0
- data/lib/tomoto/lda.rb +186 -0
- data/lib/tomoto/llda.rb +15 -0
- data/lib/tomoto/mglda.rb +15 -0
- data/lib/tomoto/pa.rb +11 -0
- data/lib/tomoto/plda.rb +15 -0
- data/lib/tomoto/slda.rb +37 -0
- data/lib/tomoto/version.rb +3 -0
- data/lib/tomoto.rb +27 -0
- data/vendor/EigenRand/EigenRand/EigenRand +24 -0
- data/vendor/EigenRand/LICENSE +21 -0
- data/vendor/EigenRand/README.md +430 -0
- data/vendor/eigen/COPYING.APACHE +203 -0
- data/vendor/eigen/COPYING.BSD +26 -0
- data/vendor/eigen/COPYING.GPL +674 -0
- data/vendor/eigen/COPYING.LGPL +502 -0
- data/vendor/eigen/COPYING.MINPACK +51 -0
- data/vendor/eigen/COPYING.MPL2 +373 -0
- data/vendor/eigen/COPYING.README +18 -0
- data/vendor/eigen/Eigen/Cholesky +45 -0
- data/vendor/eigen/Eigen/CholmodSupport +48 -0
- data/vendor/eigen/Eigen/Core +384 -0
- data/vendor/eigen/Eigen/Dense +7 -0
- data/vendor/eigen/Eigen/Eigen +2 -0
- data/vendor/eigen/Eigen/Eigenvalues +60 -0
- data/vendor/eigen/Eigen/Geometry +59 -0
- data/vendor/eigen/Eigen/Householder +29 -0
- data/vendor/eigen/Eigen/IterativeLinearSolvers +48 -0
- data/vendor/eigen/Eigen/Jacobi +32 -0
- data/vendor/eigen/Eigen/KLUSupport +41 -0
- data/vendor/eigen/Eigen/LU +47 -0
- data/vendor/eigen/Eigen/MetisSupport +35 -0
- data/vendor/eigen/Eigen/OrderingMethods +70 -0
- data/vendor/eigen/Eigen/PaStiXSupport +49 -0
- data/vendor/eigen/Eigen/PardisoSupport +35 -0
- data/vendor/eigen/Eigen/QR +50 -0
- data/vendor/eigen/Eigen/QtAlignedMalloc +39 -0
- data/vendor/eigen/Eigen/SPQRSupport +34 -0
- data/vendor/eigen/Eigen/SVD +50 -0
- data/vendor/eigen/Eigen/Sparse +34 -0
- data/vendor/eigen/Eigen/SparseCholesky +37 -0
- data/vendor/eigen/Eigen/SparseCore +69 -0
- data/vendor/eigen/Eigen/SparseLU +50 -0
- data/vendor/eigen/Eigen/SparseQR +36 -0
- data/vendor/eigen/Eigen/StdDeque +27 -0
- data/vendor/eigen/Eigen/StdList +26 -0
- data/vendor/eigen/Eigen/StdVector +27 -0
- data/vendor/eigen/Eigen/SuperLUSupport +64 -0
- data/vendor/eigen/Eigen/UmfPackSupport +40 -0
- data/vendor/eigen/README.md +5 -0
- data/vendor/eigen/bench/README.txt +55 -0
- data/vendor/eigen/bench/btl/COPYING +340 -0
- data/vendor/eigen/bench/btl/README +154 -0
- data/vendor/eigen/bench/tensors/README +20 -0
- data/vendor/eigen/blas/README.txt +6 -0
- data/vendor/eigen/ci/README.md +56 -0
- data/vendor/eigen/demos/mandelbrot/README +10 -0
- data/vendor/eigen/demos/mix_eigen_and_c/README +9 -0
- data/vendor/eigen/demos/opengl/README +13 -0
- data/vendor/eigen/unsupported/Eigen/CXX11/src/Tensor/README.md +1815 -0
- data/vendor/eigen/unsupported/README.txt +50 -0
- data/vendor/tomotopy/LICENSE +21 -0
- data/vendor/tomotopy/README.kr.rst +536 -0
- data/vendor/tomotopy/README.rst +555 -0
- data/vendor/variant/LICENSE +25 -0
- data/vendor/variant/LICENSE_1_0.txt +23 -0
- data/vendor/variant/README.md +102 -0
- metadata +141 -0
@@ -0,0 +1,555 @@
|
|
1
|
+
tomotopy
|
2
|
+
========
|
3
|
+
|
4
|
+
.. image:: https://badge.fury.io/py/tomotopy.svg
|
5
|
+
:target: https://pypi.python.org/pypi/tomotopy
|
6
|
+
|
7
|
+
.. image:: https://zenodo.org/badge/186155463.svg
|
8
|
+
:target: https://zenodo.org/badge/latestdoi/186155463
|
9
|
+
|
10
|
+
🎌
|
11
|
+
**English**,
|
12
|
+
`한국어`_.
|
13
|
+
|
14
|
+
.. _한국어: README.kr.rst
|
15
|
+
|
16
|
+
What is tomotopy?
|
17
|
+
------------------
|
18
|
+
|
19
|
+
`tomotopy` is a Python extension of `tomoto` (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++.
|
20
|
+
It utilizes a vectorization of modern CPUs for maximizing speed.
|
21
|
+
The current version of `tomoto` supports several major topic models including
|
22
|
+
|
23
|
+
* Latent Dirichlet Allocation (`tomotopy.LDAModel`)
|
24
|
+
* Labeled LDA (`tomotopy.LLDAModel`)
|
25
|
+
* Partially Labeled LDA (`tomotopy.PLDAModel`)
|
26
|
+
* Supervised LDA (`tomotopy.SLDAModel`)
|
27
|
+
* Dirichlet Multinomial Regression (`tomotopy.DMRModel`)
|
28
|
+
* Generalized Dirichlet Multinomial Regression (`tomotopy.GDMRModel`)
|
29
|
+
* Hierarchical Dirichlet Process (`tomotopy.HDPModel`)
|
30
|
+
* Hierarchical LDA (`tomotopy.HLDAModel`)
|
31
|
+
* Multi Grain LDA (`tomotopy.MGLDAModel`)
|
32
|
+
* Pachinko Allocation (`tomotopy.PAModel`)
|
33
|
+
* Hierarchical PA (`tomotopy.HPAModel`)
|
34
|
+
* Correlated Topic Model (`tomotopy.CTModel`)
|
35
|
+
* Dynamic Topic Model (`tomotopy.DTModel`)
|
36
|
+
* Pseudo-document based Topic Model (`tomotopy.PTModel`).
|
37
|
+
|
38
|
+
Please visit https://bab2min.github.io/tomotopy to see more information.
|
39
|
+
|
40
|
+
Getting Started
|
41
|
+
---------------
|
42
|
+
You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)
|
43
|
+
::
|
44
|
+
|
45
|
+
$ pip install --upgrade pip
|
46
|
+
$ pip install tomotopy
|
47
|
+
|
48
|
+
The supported OS and Python versions are:
|
49
|
+
|
50
|
+
* Linux (x86-64) with Python >= 3.6
|
51
|
+
* macOS >= 10.13 with Python >= 3.6
|
52
|
+
* Windows 7 or later (x86, x86-64) with Python >= 3.6
|
53
|
+
* Other OS with Python >= 3.6: Compilation from source code required (with c++14 compatible compiler)
|
54
|
+
|
55
|
+
After installing, you can start tomotopy by just importing.
|
56
|
+
::
|
57
|
+
|
58
|
+
import tomotopy as tp
|
59
|
+
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'
|
60
|
+
|
61
|
+
Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance.
|
62
|
+
When the package is imported, it will check available instruction sets and select the best option.
|
63
|
+
If `tp.isa` tells `none`, iterations of training may take a long time.
|
64
|
+
But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.
|
65
|
+
|
66
|
+
Here is a sample code for simple LDA training of texts from 'sample.txt' file.
|
67
|
+
::
|
68
|
+
|
69
|
+
import tomotopy as tp
|
70
|
+
mdl = tp.LDAModel(k=20)
|
71
|
+
for line in open('sample.txt'):
|
72
|
+
mdl.add_doc(line.strip().split())
|
73
|
+
|
74
|
+
for i in range(0, 100, 10):
|
75
|
+
mdl.train(10)
|
76
|
+
print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
|
77
|
+
|
78
|
+
for k in range(mdl.k):
|
79
|
+
print('Top 10 words of topic #{}'.format(k))
|
80
|
+
print(mdl.get_topic_words(k, top_n=10))
|
81
|
+
|
82
|
+
mdl.summary()
|
83
|
+
|
84
|
+
Performance of tomotopy
|
85
|
+
-----------------------
|
86
|
+
`tomotopy` uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words.
|
87
|
+
Generally CGS converges more slowly than Variational Bayes(VB) that `gensim's LdaModel`_ uses, but its iteration can be computed much faster.
|
88
|
+
In addition, `tomotopy` can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.
|
89
|
+
|
90
|
+
.. _gensim's LdaModel: https://radimrehurek.com/gensim/models/ldamodel.html
|
91
|
+
|
92
|
+
Following chart shows the comparison of LDA model's running time between `tomotopy` and `gensim`.
|
93
|
+
The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB).
|
94
|
+
`tomotopy` trains 200 iterations and `gensim` trains 10 iterations.
|
95
|
+
|
96
|
+
.. image:: https://bab2min.github.io/tomotopy/images/tmt_i5.png
|
97
|
+
|
98
|
+
Performance in Intel i5-6600, x86-64 (4 cores)
|
99
|
+
|
100
|
+
.. image:: https://bab2min.github.io/tomotopy/images/tmt_xeon.png
|
101
|
+
|
102
|
+
Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)
|
103
|
+
|
104
|
+
Although `tomotopy` iterated 20 times more, the overall running time was 5~10 times faster than `gensim`. And it yields a stable result.
|
105
|
+
|
106
|
+
It is difficult to compare CGS and VB directly because they are totaly different techniques.
|
107
|
+
But from a practical point of view, we can compare the speed and the result between them.
|
108
|
+
The following chart shows the log-likelihood per word of two models' result.
|
109
|
+
|
110
|
+
.. image:: https://bab2min.github.io/tomotopy/images/LLComp.png
|
111
|
+
|
112
|
+
The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.
|
113
|
+
|
114
|
+
.. image:: https://bab2min.github.io/tomotopy/images/SIMDComp.png
|
115
|
+
|
116
|
+
Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.
|
117
|
+
|
118
|
+
Model Save and Load
|
119
|
+
-------------------
|
120
|
+
`tomotopy` provides `save` and `load` method for each topic model class,
|
121
|
+
so you can save the model into the file whenever you want, and re-load it from the file.
|
122
|
+
::
|
123
|
+
|
124
|
+
import tomotopy as tp
|
125
|
+
|
126
|
+
mdl = tp.HDPModel()
|
127
|
+
for line in open('sample.txt'):
|
128
|
+
mdl.add_doc(line.strip().split())
|
129
|
+
|
130
|
+
for i in range(0, 100, 10):
|
131
|
+
mdl.train(10)
|
132
|
+
print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
|
133
|
+
|
134
|
+
# save into file
|
135
|
+
mdl.save('sample_hdp_model.bin')
|
136
|
+
|
137
|
+
# load from file
|
138
|
+
mdl = tp.HDPModel.load('sample_hdp_model.bin')
|
139
|
+
for k in range(mdl.k):
|
140
|
+
if not mdl.is_live_topic(k): continue
|
141
|
+
print('Top 10 words of topic #{}'.format(k))
|
142
|
+
print(mdl.get_topic_words(k, top_n=10))
|
143
|
+
|
144
|
+
# the saved model is HDP model,
|
145
|
+
# so when you load it by LDA model, it will raise an exception
|
146
|
+
mdl = tp.LDAModel.load('sample_hdp_model.bin')
|
147
|
+
|
148
|
+
When you load the model from a file, a model type in the file should match the class of methods.
|
149
|
+
|
150
|
+
See more at `tomotopy.LDAModel.save` and `tomotopy.LDAModel.load` methods.
|
151
|
+
|
152
|
+
Documents in the Model and out of the Model
|
153
|
+
-------------------------------------------
|
154
|
+
We can use Topic Model for two major purposes.
|
155
|
+
The basic one is to discover topics from a set of documents as a result of trained model,
|
156
|
+
and the more advanced one is to infer topic distributions for unseen documents by using trained model.
|
157
|
+
|
158
|
+
We named the document in the former purpose (used for model training) as **document in the model**,
|
159
|
+
and the document in the later purpose (unseen document during training) as **document out of the model**.
|
160
|
+
|
161
|
+
In `tomotopy`, these two different kinds of document are generated differently.
|
162
|
+
A **document in the model** can be created by `tomotopy.LDAModel.add_doc` method.
|
163
|
+
`add_doc` can be called before `tomotopy.LDAModel.train` starts.
|
164
|
+
In other words, after `train` called, `add_doc` cannot add a document into the model because the set of document used for training has become fixed.
|
165
|
+
|
166
|
+
To acquire the instance of the created document, you should use `tomotopy.LDAModel.docs` like:
|
167
|
+
|
168
|
+
::
|
169
|
+
|
170
|
+
mdl = tp.LDAModel(k=20)
|
171
|
+
idx = mdl.add_doc(words)
|
172
|
+
if idx < 0: raise RuntimeError("Failed to add doc")
|
173
|
+
doc_inst = mdl.docs[idx]
|
174
|
+
# doc_inst is an instance of the added document
|
175
|
+
|
176
|
+
A **document out of the model** is generated by `tomotopy.LDAModel.make_doc` method. `make_doc` can be called only after `train` starts.
|
177
|
+
If you use `make_doc` before the set of document used for training has become fixed, you may get wrong results.
|
178
|
+
Since `make_doc` returns the instance directly, you can use its return value for other manipulations.
|
179
|
+
|
180
|
+
::
|
181
|
+
|
182
|
+
mdl = tp.LDAModel(k=20)
|
183
|
+
# add_doc ...
|
184
|
+
mdl.train(100)
|
185
|
+
doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document
|
186
|
+
|
187
|
+
Inference for Unseen Documents
|
188
|
+
------------------------------
|
189
|
+
If a new document is created by `tomotopy.LDAModel.make_doc`, its topic distribution can be inferred by the model.
|
190
|
+
Inference for unseen document should be performed using `tomotopy.LDAModel.infer` method.
|
191
|
+
|
192
|
+
::
|
193
|
+
|
194
|
+
mdl = tp.LDAModel(k=20)
|
195
|
+
# add_doc ...
|
196
|
+
mdl.train(100)
|
197
|
+
doc_inst = mdl.make_doc(unseen_doc)
|
198
|
+
topic_dist, ll = mdl.infer(doc_inst)
|
199
|
+
print("Topic Distribution for Unseen Docs: ", topic_dist)
|
200
|
+
print("Log-likelihood of inference: ", ll)
|
201
|
+
|
202
|
+
The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`.
|
203
|
+
See more at `tomotopy.LDAModel.infer`.
|
204
|
+
|
205
|
+
Corpus and transform
|
206
|
+
--------------------
|
207
|
+
Every topic model in `tomotopy` has its own internal document type.
|
208
|
+
A document can be created and added into suitable for each model through each model's `add_doc` method.
|
209
|
+
However, trying to add the same list of documents to different models becomes quite inconvenient,
|
210
|
+
because `add_doc` should be called for the same list of documents to each different model.
|
211
|
+
Thus, `tomotopy` provides `tomotopy.utils.Corpus` class that holds a list of documents.
|
212
|
+
`tomotopy.utils.Corpus` can be inserted into any model by passing as argument `corpus` to `__init__` or `add_corpus` method of each model.
|
213
|
+
So, inserting `tomotopy.utils.Corpus` just has the same effect to inserting documents the corpus holds.
|
214
|
+
|
215
|
+
Some topic models requires different data for its documents.
|
216
|
+
For example, `tomotopy.DMRModel` requires argument `metadata` in `str` type,
|
217
|
+
but `tomotopy.PLDAModel` requires argument `labels` in `List[str]` type.
|
218
|
+
Since `tomotopy.utils.Corpus` holds an independent set of documents rather than being tied to a specific topic model,
|
219
|
+
data types required by a topic model may be inconsistent when a corpus is added into that topic model.
|
220
|
+
In this case, miscellaneous data can be transformed to be fitted target topic model using argument `transform`.
|
221
|
+
See more details in the following code:
|
222
|
+
|
223
|
+
::
|
224
|
+
|
225
|
+
from tomotopy import DMRModel
|
226
|
+
from tomotopy.utils import Corpus
|
227
|
+
|
228
|
+
corpus = Corpus()
|
229
|
+
corpus.add_doc("a b c d e".split(), a_data=1)
|
230
|
+
corpus.add_doc("e f g h i".split(), a_data=2)
|
231
|
+
corpus.add_doc("i j k l m".split(), a_data=3)
|
232
|
+
|
233
|
+
model = DMRModel(k=10)
|
234
|
+
model.add_corpus(corpus)
|
235
|
+
# You lose `a_data` field in `corpus`,
|
236
|
+
# and `metadata` that `DMRModel` requires is filled with the default value, empty str.
|
237
|
+
|
238
|
+
assert model.docs[0].metadata == ''
|
239
|
+
assert model.docs[1].metadata == ''
|
240
|
+
assert model.docs[2].metadata == ''
|
241
|
+
|
242
|
+
def transform_a_data_to_metadata(misc: dict):
|
243
|
+
return {'metadata': str(misc['a_data'])}
|
244
|
+
# this function transforms `a_data` to `metadata`
|
245
|
+
|
246
|
+
model = DMRModel(k=10)
|
247
|
+
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
|
248
|
+
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.
|
249
|
+
|
250
|
+
assert model.docs[0].metadata == '1'
|
251
|
+
assert model.docs[1].metadata == '2'
|
252
|
+
assert model.docs[2].metadata == '3'
|
253
|
+
|
254
|
+
Parallel Sampling Algorithms
|
255
|
+
----------------------------
|
256
|
+
Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm.
|
257
|
+
The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
|
258
|
+
The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.
|
259
|
+
|
260
|
+
The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.
|
261
|
+
|
262
|
+
.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png
|
263
|
+
|
264
|
+
.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png
|
265
|
+
|
266
|
+
Performance by Version
|
267
|
+
----------------------
|
268
|
+
Performance changes by version are shown in the following graph.
|
269
|
+
The time it takes to run the LDA model train with 1000 iteration was measured.
|
270
|
+
(Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)
|
271
|
+
|
272
|
+
.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t1.png
|
273
|
+
|
274
|
+
.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t4.png
|
275
|
+
|
276
|
+
.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t8.png
|
277
|
+
|
278
|
+
Pining Topics using Word Priors
|
279
|
+
-------------------------------
|
280
|
+
Since version 0.6.0, a new method `tomotopy.LDAModel.set_word_prior` has been added. It allows you to control word prior for each topic.
|
281
|
+
For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes.
|
282
|
+
This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic.
|
283
|
+
Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'.
|
284
|
+
This allows to manipulate some topics to be placed at a specific topic number.
|
285
|
+
|
286
|
+
::
|
287
|
+
|
288
|
+
import tomotopy as tp
|
289
|
+
mdl = tp.LDAModel(k=20)
|
290
|
+
|
291
|
+
# add documents into `mdl`
|
292
|
+
|
293
|
+
# setting word prior
|
294
|
+
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])
|
295
|
+
|
296
|
+
See `word_prior_example` in `example.py` for more details.
|
297
|
+
|
298
|
+
|
299
|
+
Examples
|
300
|
+
--------
|
301
|
+
You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .
|
302
|
+
|
303
|
+
You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .
|
304
|
+
|
305
|
+
License
|
306
|
+
---------
|
307
|
+
`tomotopy` is licensed under the terms of MIT License,
|
308
|
+
meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.
|
309
|
+
|
310
|
+
History
|
311
|
+
-------
|
312
|
+
* 0.12.7 (2023-12-19)
|
313
|
+
* New features
|
314
|
+
* Added Topic Model Viewer `tomotopy.viewer.open_viewer()`
|
315
|
+
* Optimized the performance of `tomotopy.utils.Corpus.process()`
|
316
|
+
* Bug fixes
|
317
|
+
* `Document.span` now returns the ranges in character unit, not in byte unit.
|
318
|
+
|
319
|
+
* 0.12.6 (2023-12-11)
|
320
|
+
* New features
|
321
|
+
* Added some convenience features to `tomotopy.LDAModel.train` and `tomotopy.LDAModel.set_word_prior`.
|
322
|
+
* `LDAModel.train` now has new arguments `callback`, `callback_interval` and `show_progres` to monitor the training progress.
|
323
|
+
* `LDAModel.set_word_prior` now can accept `Dict[int, float]` type as its argument `prior`.
|
324
|
+
|
325
|
+
* 0.12.5 (2023-08-03)
|
326
|
+
* New features
|
327
|
+
* Added support for Linux ARM64 architecture.
|
328
|
+
|
329
|
+
* 0.12.4 (2023-01-22)
|
330
|
+
* New features
|
331
|
+
* Added support for macOS ARM64 architecture.
|
332
|
+
* Bug fixes
|
333
|
+
* Fixed an issue where `tomotopy.Document.get_sub_topic_dist()` raises a bad argument exception.
|
334
|
+
* Fixed an issue where exception raising sometimes causes crashes.
|
335
|
+
|
336
|
+
* 0.12.3 (2022-07-19)
|
337
|
+
* New features
|
338
|
+
* Now, inserting an empty document using `tomotopy.LDAModel.add_doc()` just ignores it instead of raising an exception. If the newly added argument `ignore_empty_words` is set to False, an exception is raised as before.
|
339
|
+
* `tomotopy.HDPModel.purge_dead_topics()` method is added to remove non-live topics from the model.
|
340
|
+
* Bug fixes
|
341
|
+
* Fixed an issue that prevents setting user defined values for nuSq in `tomotopy.SLDAModel` (by @jucendrero).
|
342
|
+
* Fixed an issue where `tomotopy.utils.Coherence` did not work for `tomotopy.DTModel`.
|
343
|
+
* Fixed an issue that often crashed when calling `make_dic()` before calling `train()`.
|
344
|
+
* Resolved the problem that the results of `tomotopy.DMRModel` and `tomotopy.GDMRModel` are different even when the seed is fixed.
|
345
|
+
* The parameter optimization process of `tomotopy.DMRModel` and `tomotopy.GDMRModel` has been improved.
|
346
|
+
* Fixed an issue that sometimes crashed when calling `tomotopy.PTModel.copy()`.
|
347
|
+
|
348
|
+
* 0.12.2 (2021-09-06)
|
349
|
+
* An issue where calling `convert_to_lda` of `tomotopy.HDPModel` with `min_cf > 0`, `min_df > 0` or `rm_top > 0` causes a crash has been fixed.
|
350
|
+
* A new argument `from_pseudo_doc` is added to `tomotopy.Document.get_topics` and `tomotopy.Document.get_topic_dist`.
|
351
|
+
This argument is only valid for documents of `PTModel`, it enables to control a source for computing topic distribution.
|
352
|
+
* A default value for argument `p` of `tomotopy.PTModel` has been changed. The new default value is `k * 10`.
|
353
|
+
* Using documents generated by `make_doc` without calling `infer` doesn't cause a crash anymore, but just print warning messages.
|
354
|
+
* An issue where the internal C++ code isn't compiled at clang c++17 environment has been fixed.
|
355
|
+
|
356
|
+
* 0.12.1 (2021-06-20)
|
357
|
+
* An issue where `tomotopy.LDAModel.set_word_prior()` causes a crash has been fixed.
|
358
|
+
* Now `tomotopy.LDAModel.perplexity` and `tomotopy.LDAModel.ll_per_word` return the accurate value when `TermWeight` is not `ONE`.
|
359
|
+
* `tomotopy.LDAModel.used_vocab_weighted_freq` was added, which returns term-weighted frequencies of words.
|
360
|
+
* Now `tomotopy.LDAModel.summary()` shows not only the entropy of words, but also the entropy of term-weighted words.
|
361
|
+
|
362
|
+
* 0.12.0 (2021-04-26)
|
363
|
+
* Now `tomotopy.DMRModel` and `tomotopy.GDMRModel` support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )
|
364
|
+
* The performance of `tomotopy.GDMRModel` was improved.
|
365
|
+
* A `copy()` method has been added for all topic models to do a deep copy.
|
366
|
+
* An issue was fixed where words that are excluded from training (by `min_cf`, `min_df`) have incorrect topic id. Now all excluded words have `-1` as topic id.
|
367
|
+
* Now all exceptions and warnings that generated by `tomotopy` follow standard Python types.
|
368
|
+
* Compiler requirements have been raised to C++14.
|
369
|
+
|
370
|
+
* 0.11.1 (2021-03-28)
|
371
|
+
* A critical bug of asymmetric alphas was fixed. Due to this bug, version 0.11.0 has been removed from releases.
|
372
|
+
|
373
|
+
* 0.11.0 (2021-03-26) (removed)
|
374
|
+
* A new topic model `tomotopy.PTModel` for short texts was added into the package.
|
375
|
+
* An issue was fixed where `tomotopy.HDPModel.infer` causes a segmentation fault sometimes.
|
376
|
+
* A mismatch of numpy API version was fixed.
|
377
|
+
* Now asymmetric document-topic priors are supported.
|
378
|
+
* Serializing topic models to `bytes` in memory is supported.
|
379
|
+
* An argument `normalize` was added to `get_topic_dist()`, `get_topic_word_dist()` and `get_sub_topic_dist()` for controlling normalization of results.
|
380
|
+
* Now `tomotopy.DMRModel.lambdas` and `tomotopy.DMRModel.alpha` give correct values.
|
381
|
+
* Categorical metadata supports for `tomotopy.GDMRModel` were added (see https://github.com/bab2min/tomotopy/blob/main/examples/gdmr_both_categorical_and_numerical.py ).
|
382
|
+
* Python3.5 support was dropped.
|
383
|
+
|
384
|
+
* 0.10.2 (2021-02-16)
|
385
|
+
* An issue was fixed where `tomotopy.CTModel.train` fails with large K.
|
386
|
+
* An issue was fixed where `tomotopy.utils.Corpus` loses their `uid` values.
|
387
|
+
|
388
|
+
* 0.10.1 (2021-02-14)
|
389
|
+
* An issue was fixed where `tomotopy.utils.Corpus.extract_ngrams` craches with empty input.
|
390
|
+
* An issue was fixed where `tomotopy.LDAModel.infer` raises exception with valid input.
|
391
|
+
* An issue was fixed where `tomotopy.HLDAModel.infer` generates wrong `tomotopy.Document.path`.
|
392
|
+
* Since a new parameter `freeze_topics` for `tomotopy.HLDAModel.train` was added, you can control whether to create a new topic or not when training.
|
393
|
+
|
394
|
+
* 0.10.0 (2020-12-19)
|
395
|
+
* The interface of `tomotopy.utils.Corpus` and of `tomotopy.LDAModel.docs` were unified. Now you can access the document in corpus with the same manner.
|
396
|
+
* `__getitem__` of `tomotopy.utils.Corpus` was improved. Not only indexing by int, but also by Iterable[int], slicing are supported. Also indexing by uid is supported.
|
397
|
+
* New methods `tomotopy.utils.Corpus.extract_ngrams` and `tomotopy.utils.Corpus.concat_ngrams` were added. They extracts n-gram collocations using PMI and concatenates them into a single words.
|
398
|
+
* A new method `tomotopy.LDAModel.add_corpus` was added, and `tomotopy.LDAModel.infer` can receive corpus as input.
|
399
|
+
* A new module `tomotopy.coherence` was added. It provides the way to calculate coherence of the model.
|
400
|
+
* A paramter `window_size` was added to `tomotopy.label.FoRelevance`.
|
401
|
+
* An issue was fixed where NaN often occurs when training `tomotopy.HDPModel`.
|
402
|
+
* Now Python3.9 is supported.
|
403
|
+
* A dependency to py-cpuinfo was removed and the initializing of the module was improved.
|
404
|
+
|
405
|
+
* 0.9.1 (2020-08-08)
|
406
|
+
* Memory leaks of version 0.9.0 was fixed.
|
407
|
+
* `tomotopy.CTModel.summary()` was fixed.
|
408
|
+
|
409
|
+
* 0.9.0 (2020-08-04)
|
410
|
+
* The `tomotopy.LDAModel.summary()` method, which prints human-readable summary of the model, has been added.
|
411
|
+
* The random number generator of package has been replaced with `EigenRand`_. It speeds up the random number generation and solves the result difference between platforms.
|
412
|
+
* Due to above, even if `seed` is the same, the model training result may be different from the version before 0.9.0.
|
413
|
+
* Fixed a training error in `tomotopy.HDPModel`.
|
414
|
+
* `tomotopy.DMRModel.alpha` now shows Dirichlet prior of per-document topic distribution by metadata.
|
415
|
+
* `tomotopy.DTModel.get_count_by_topics()` has been modified to return a 2-dimensional `ndarray`.
|
416
|
+
* `tomotopy.DTModel.alpha` has been modified to return the same value as `tomotopy.DTModel.get_alpha()`.
|
417
|
+
* Fixed an issue where the `metadata` value could not be obtained for the document of `tomotopy.GDMRModel`.
|
418
|
+
* `tomotopy.HLDAModel.alpha` now shows Dirichlet prior of per-document depth distribution.
|
419
|
+
* `tomotopy.LDAModel.global_step` has been added.
|
420
|
+
* `tomotopy.MGLDAModel.get_count_by_topics()` now returns the word count for both global and local topics.
|
421
|
+
* `tomotopy.PAModel.alpha`, `tomotopy.PAModel.subalpha`, and `tomotopy.PAModel.get_count_by_super_topic()` have been added.
|
422
|
+
|
423
|
+
.. _EigenRand: https://github.com/bab2min/EigenRand
|
424
|
+
|
425
|
+
* 0.8.2 (2020-07-14)
|
426
|
+
* New properties `tomotopy.DTModel.num_timepoints` and `tomotopy.DTModel.num_docs_by_timepoint` have been added.
|
427
|
+
* A bug which causes different results with the different platform even if `seeds` were the same was partially fixed.
|
428
|
+
As a result of this fix, now `tomotopy` in 32 bit yields different training results from earlier version.
|
429
|
+
|
430
|
+
* 0.8.1 (2020-06-08)
|
431
|
+
* A bug where `tomotopy.LDAModel.used_vocabs` returned an incorrect value was fixed.
|
432
|
+
* Now `tomotopy.CTModel.prior_cov` returns a covariance matrix with shape `[k, k]`.
|
433
|
+
* Now `tomotopy.CTModel.get_correlations` with empty arguments returns a correlation matrix with shape `[k, k]`.
|
434
|
+
|
435
|
+
* 0.8.0 (2020-06-06)
|
436
|
+
* Since NumPy was introduced in tomotopy, many methods and properties of tomotopy return not just `list`, but `numpy.ndarray` now.
|
437
|
+
* Tomotopy has a new dependency `NumPy >= 1.10.0`.
|
438
|
+
* A wrong estimation of `tomotopy.HDPModel.infer` was fixed.
|
439
|
+
* A new method about converting HDPModel to LDAModel was added.
|
440
|
+
* New properties including `tomotopy.LDAModel.used_vocabs`, `tomotopy.LDAModel.used_vocab_freq` and `tomotopy.LDAModel.used_vocab_df` were added into topic models.
|
441
|
+
* A new g-DMR topic model(`tomotopy.GDMRModel`) was added.
|
442
|
+
* An error at initializing `tomotopy.label.FoRelevance` in macOS was fixed.
|
443
|
+
* An error that occured when using `tomotopy.utils.Corpus` created without `raw` parameters was fixed.
|
444
|
+
|
445
|
+
* 0.7.1 (2020-05-08)
|
446
|
+
* `tomotopy.Document.path` was added for `tomotopy.HLDAModel`.
|
447
|
+
* A memory corruption bug in `tomotopy.label.PMIExtractor` was fixed.
|
448
|
+
* A compile error in gcc 7 was fixed.
|
449
|
+
|
450
|
+
* 0.7.0 (2020-04-18)
|
451
|
+
* `tomotopy.DTModel` was added into the package.
|
452
|
+
* A bug in `tomotopy.utils.Corpus.save` was fixed.
|
453
|
+
* A new method `tomotopy.Document.get_count_vector` was added into Document class.
|
454
|
+
* Now linux distributions use manylinux2010 and an additional optimization is applied.
|
455
|
+
|
456
|
+
* 0.6.2 (2020-03-28)
|
457
|
+
* A critical bug related to `save` and `load` was fixed. Version 0.6.0 and 0.6.1 have been removed from releases.
|
458
|
+
|
459
|
+
* 0.6.1 (2020-03-22) (removed)
|
460
|
+
* A bug related to module loading was fixed.
|
461
|
+
|
462
|
+
* 0.6.0 (2020-03-22) (removed)
|
463
|
+
* `tomotopy.utils.Corpus` class that manages multiple documents easily was added.
|
464
|
+
* `tomotopy.LDAModel.set_word_prior` method that controls word-topic priors of topic models was added.
|
465
|
+
* A new argument `min_df` that filters words based on document frequency was added into every topic model's __init__.
|
466
|
+
* `tomotopy.label`, the submodule about topic labeling was added. Currently, only `tomotopy.label.FoRelevance` is provided.
|
467
|
+
|
468
|
+
* 0.5.2 (2020-03-01)
|
469
|
+
* A segmentation fault problem was fixed in `tomotopy.LLDAModel.add_doc`.
|
470
|
+
* A bug was fixed that `infer` of `tomotopy.HDPModel` sometimes crashes the program.
|
471
|
+
* A crash issue was fixed of `tomotopy.LDAModel.infer` with ps=tomotopy.ParallelScheme.PARTITION, together=True.
|
472
|
+
|
473
|
+
* 0.5.1 (2020-01-11)
|
474
|
+
* A bug was fixed that `tomotopy.SLDAModel.make_doc` doesn't support missing values for `y`.
|
475
|
+
* Now `tomotopy.SLDAModel` fully supports missing values for response variables `y`. Documents with missing values (NaN) are included in modeling topic, but excluded from regression of response variables.
|
476
|
+
|
477
|
+
* 0.5.0 (2019-12-30)
|
478
|
+
* Now `tomotopy.PAModel.infer` returns both topic distribution nd sub-topic distribution.
|
479
|
+
* New methods get_sub_topics and get_sub_topic_dist were added into `tomotopy.Document`. (for PAModel)
|
480
|
+
* New parameter `parallel` was added for `tomotopy.LDAModel.train` and `tomotopy.LDAModel.infer` method. You can select parallelism algorithm by changing this parameter.
|
481
|
+
* `tomotopy.ParallelScheme.PARTITION`, a new algorithm, was added. It works efficiently when the number of workers is large, the number of topics or the size of vocabulary is big.
|
482
|
+
* A bug where `rm_top` didn't work at `min_cf` < 2 was fixed.
|
483
|
+
|
484
|
+
* 0.4.2 (2019-11-30)
|
485
|
+
* Wrong topic assignments of `tomotopy.LLDAModel` and `tomotopy.PLDAModel` were fixed.
|
486
|
+
* Readable __repr__ of `tomotopy.Document` and `tomotopy.Dictionary` was implemented.
|
487
|
+
|
488
|
+
* 0.4.1 (2019-11-27)
|
489
|
+
* A bug at init function of `tomotopy.PLDAModel` was fixed.
|
490
|
+
|
491
|
+
* 0.4.0 (2019-11-18)
|
492
|
+
* New models including `tomotopy.PLDAModel` and `tomotopy.HLDAModel` were added into the package.
|
493
|
+
|
494
|
+
* 0.3.1 (2019-11-05)
|
495
|
+
* An issue where `get_topic_dist()` returns incorrect value when `min_cf` or `rm_top` is set was fixed.
|
496
|
+
* The return value of `get_topic_dist()` of `tomotopy.MGLDAModel` document was fixed to include local topics.
|
497
|
+
* The estimation speed with `tw=ONE` was improved.
|
498
|
+
|
499
|
+
* 0.3.0 (2019-10-06)
|
500
|
+
* A new model, `tomotopy.LLDAModel` was added into the package.
|
501
|
+
* A crashing issue of `HDPModel` was fixed.
|
502
|
+
* Since hyperparameter estimation for `HDPModel` was implemented, the result of `HDPModel` may differ from previous versions.
|
503
|
+
If you want to turn off hyperparameter estimation of HDPModel, set `optim_interval` to zero.
|
504
|
+
|
505
|
+
* 0.2.0 (2019-08-18)
|
506
|
+
* New models including `tomotopy.CTModel` and `tomotopy.SLDAModel` were added into the package.
|
507
|
+
* A new parameter option `rm_top` was added for all topic models.
|
508
|
+
* The problems in `save` and `load` method for `PAModel` and `HPAModel` were fixed.
|
509
|
+
* An occassional crash in loading `HDPModel` was fixed.
|
510
|
+
* The problem that `ll_per_word` was calculated incorrectly when `min_cf` > 0 was fixed.
|
511
|
+
|
512
|
+
* 0.1.6 (2019-08-09)
|
513
|
+
* Compiling errors at clang with macOS environment were fixed.
|
514
|
+
|
515
|
+
* 0.1.4 (2019-08-05)
|
516
|
+
* The issue when `add_doc` receives an empty list as input was fixed.
|
517
|
+
* The issue that `tomotopy.PAModel.get_topic_words` doesn't extract the word distribution of subtopic was fixed.
|
518
|
+
|
519
|
+
* 0.1.3 (2019-05-19)
|
520
|
+
* The parameter `min_cf` and its stopword-removing function were added for all topic models.
|
521
|
+
|
522
|
+
* 0.1.0 (2019-05-12)
|
523
|
+
* First version of **tomotopy**
|
524
|
+
|
525
|
+
Bindings for Other Languages
|
526
|
+
------------------------------
|
527
|
+
* Ruby: https://github.com/ankane/tomoto
|
528
|
+
|
529
|
+
Bundled Libraries and Their License
|
530
|
+
------------------------------------
|
531
|
+
* Eigen:
|
532
|
+
This application uses the MPL2-licensed features of Eigen, a C++ template library for linear algebra.
|
533
|
+
A copy of the MPL2 license is available at https://www.mozilla.org/en-US/MPL/2.0/.
|
534
|
+
The source code of the Eigen library can be obtained at http://eigen.tuxfamily.org/.
|
535
|
+
|
536
|
+
* EigenRand: `MIT License
|
537
|
+
<licenses_bundled/EigenRand>`_
|
538
|
+
|
539
|
+
* Mapbox Variant: `BSD License
|
540
|
+
<licenses_bundled/MapboxVariant>`_
|
541
|
+
|
542
|
+
Citation
|
543
|
+
---------
|
544
|
+
::
|
545
|
+
|
546
|
+
@software{minchul_lee_2022_6868418,
|
547
|
+
author = {Minchul Lee},
|
548
|
+
title = {bab2min/tomotopy: 0.12.3},
|
549
|
+
month = jul,
|
550
|
+
year = 2022,
|
551
|
+
publisher = {Zenodo},
|
552
|
+
version = {v0.12.3},
|
553
|
+
doi = {10.5281/zenodo.6868418},
|
554
|
+
url = {https://doi.org/10.5281/zenodo.6868418}
|
555
|
+
}
|
@@ -0,0 +1,25 @@
|
|
1
|
+
Copyright (c) MapBox
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without modification,
|
5
|
+
are permitted provided that the following conditions are met:
|
6
|
+
|
7
|
+
- Redistributions of source code must retain the above copyright notice, this
|
8
|
+
list of conditions and the following disclaimer.
|
9
|
+
- Redistributions in binary form must reproduce the above copyright notice, this
|
10
|
+
list of conditions and the following disclaimer in the documentation and/or
|
11
|
+
other materials provided with the distribution.
|
12
|
+
- Neither the name "MapBox" nor the names of its contributors may be
|
13
|
+
used to endorse or promote products derived from this software without
|
14
|
+
specific prior written permission.
|
15
|
+
|
16
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
17
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
18
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
19
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
|
20
|
+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
21
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
22
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
23
|
+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
24
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
25
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
@@ -0,0 +1,23 @@
|
|
1
|
+
Boost Software License - Version 1.0 - August 17th, 2003
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person or organization
|
4
|
+
obtaining a copy of the software and accompanying documentation covered by
|
5
|
+
this license (the "Software") to use, reproduce, display, distribute,
|
6
|
+
execute, and transmit the Software, and to prepare derivative works of the
|
7
|
+
Software, and to permit third-parties to whom the Software is furnished to
|
8
|
+
do so, all subject to the following:
|
9
|
+
|
10
|
+
The copyright notices in the Software and this entire statement, including
|
11
|
+
the above license grant, this restriction and the following disclaimer,
|
12
|
+
must be included in all copies of the Software, in whole or in part, and
|
13
|
+
all derivative works of the Software, unless such copies or derivative
|
14
|
+
works are solely in the form of machine-executable object code generated by
|
15
|
+
a source language processor.
|
16
|
+
|
17
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
18
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
19
|
+
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
|
20
|
+
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
|
21
|
+
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
|
22
|
+
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
|
23
|
+
DEALINGS IN THE SOFTWARE.
|