lda-ruby 0.3.8 → 0.3.9
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +15 -0
- data/{CHANGELOG → CHANGELOG.md} +12 -0
- data/{README.markdown → README.md} +2 -2
- data/VERSION.yml +3 -3
- data/lda-ruby.gemspec +0 -0
- data/lib/lda-ruby/corpus/text_corpus.rb +14 -9
- data/lib/lda-ruby/document/document.rb +0 -2
- metadata +38 -61
- data/README +0 -21
checksums.yaml
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
---
|
2
|
+
!binary "U0hBMQ==":
|
3
|
+
metadata.gz: !binary |-
|
4
|
+
M2IzN2IwMmE1YWFhNTJjYjk0ZjMyYzhjMWQyZDdjZjY2YTA0OGNjZA==
|
5
|
+
data.tar.gz: !binary |-
|
6
|
+
OTNjZjE5MGNmOGI2YzY3YzhlNDRiYTBlNDM5NmUwYmY4Mjc2ZmNkNQ==
|
7
|
+
SHA512:
|
8
|
+
metadata.gz: !binary |-
|
9
|
+
MDAxZmQ1N2U5MGQ1MDA2MGE4YTU0N2NkN2Y4N2Q4YzlmNTdmYTkxN2FlYzU0
|
10
|
+
ZmFjZDZiZWIwZmRiMDJlYjNjYTg5YWQ0N2RlOTY1MTg0YzZjZTc0NGQ0YmI1
|
11
|
+
ZDljMjljNTA3ODdlZjBjNjNjMTc0MGRlMjBhODQzZTg3YWM5OWE=
|
12
|
+
data.tar.gz: !binary |-
|
13
|
+
OGFiMzVhMzY4OTFmZTkzMmM3YjQxMzNkM2JlMjVjOTRjM2ZhNTc1YmUxZTRj
|
14
|
+
Y2M3YzZiNzNiYmVkNDI1ZTUxODhkMjhlYTQ4MmRhZjVkZjQxMDhjOTk5Njc4
|
15
|
+
OTZhOTM5ZTQ3OTFhY2U1YjRhODQyYzBkZWNlYzhjNzBjMWFlNGU=
|
data/{CHANGELOG → CHANGELOG.md}
RENAMED
@@ -1,3 +1,15 @@
|
|
1
|
+
version 0.3.9
|
2
|
+
=============
|
3
|
+
|
4
|
+
- merge pull request from @rishabh-tripathi allowing text corpus objects to also be built with an array of strings
|
5
|
+
- couple minor code refinements
|
6
|
+
|
7
|
+
version 0.3.8
|
8
|
+
=============
|
9
|
+
|
10
|
+
- tokenization changes to support German (courtesy of @LeFnord)
|
11
|
+
- user defined stop word list (also via @LeFnord)
|
12
|
+
|
1
13
|
version 0.3.7
|
2
14
|
=============
|
3
15
|
|
@@ -13,7 +13,7 @@ The original C code relied on files for the input and output. We felt it was nec
|
|
13
13
|
lda = Lda::Lda.new(corpus) # create an Lda object for training
|
14
14
|
lda.em("random") # run EM algorithm using random starting points
|
15
15
|
lda.load_vocabulary("data/vocab.txt")
|
16
|
-
lda.print_topics(20) # print
|
16
|
+
lda.print_topics(20) # print all topics with up to 20 words per topic
|
17
17
|
|
18
18
|
If you have general questions about Latent Dirichlet Allocation, I urge you to use the [topic models mailing list][topic-models], since the people who monitor that are very knowledgeable. If you encounter bugs specific to lda-ruby, please post an issue on the Github project.
|
19
19
|
|
@@ -34,4 +34,4 @@ Blei, David M., Ng, Andrew Y., and Jordan, Michael I. 2003. Latent dirichlet all
|
|
34
34
|
[wikipedia]: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
|
35
35
|
[ap-data]: http://www.cs.princeton.edu/~blei/lda-c/ap.tgz
|
36
36
|
[pdf]: http://www.cs.princeton.edu/picasso/mats/BleiNgJordan2003_blei.pdf
|
37
|
-
[topic-models]: https://lists.cs.princeton.edu/mailman/listinfo/topic-models
|
37
|
+
[topic-models]: https://lists.cs.princeton.edu/mailman/listinfo/topic-models
|
data/VERSION.yml
CHANGED
data/lda-ruby.gemspec
CHANGED
Binary file
|
@@ -2,18 +2,23 @@ module Lda
|
|
2
2
|
class TextCorpus < Corpus
|
3
3
|
attr_reader :filename
|
4
4
|
|
5
|
-
#
|
6
|
-
def initialize(
|
5
|
+
# Loads text documents from a YAML file or an array of strings
|
6
|
+
def initialize(input_data)
|
7
7
|
super()
|
8
8
|
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
9
|
+
docs = if input_data.is_a?(String) && File.exists?(input_data)
|
10
|
+
# yaml file containing an array of strings representing each document
|
11
|
+
YAML.load_file(input_data)
|
12
|
+
elsif input_data.is_a?(Array)
|
13
|
+
# an array of strings representing each document
|
14
|
+
input_data.dup
|
15
|
+
elsif input_data.is_a?(String)
|
16
|
+
# a single string representing one document
|
17
|
+
[input_data]
|
18
|
+
else
|
19
|
+
raise "Unknown input type: please pass in a valid filename or an array of strings."
|
20
|
+
end
|
14
21
|
|
15
|
-
def load_from_file
|
16
|
-
docs = YAML.load_file(@filename)
|
17
22
|
docs.each do |doc|
|
18
23
|
add_document(TextDocument.new(self, doc))
|
19
24
|
end
|
@@ -32,9 +32,7 @@ module Lda
|
|
32
32
|
end
|
33
33
|
|
34
34
|
def tokenize(text)
|
35
|
-
# now respects Umlaute
|
36
35
|
clean_text = text.gsub(/[^a-zäöüß'-]+/i, ' ').gsub(/\s+/, ' ').downcase # remove everything but letters and ' and leave only single spaces
|
37
|
-
# clean_text = text.gsub(/[^A-Za-z'\s]+/, ' ').gsub(/\s+/, ' ').downcase # remove everything but letters and ' and leave only single spaces
|
38
36
|
@tokens = handle(clean_text.split(' '))
|
39
37
|
nil
|
40
38
|
end
|
metadata
CHANGED
@@ -1,52 +1,41 @@
|
|
1
|
-
--- !ruby/object:Gem::Specification
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
2
|
name: lda-ruby
|
3
|
-
version: !ruby/object:Gem::Version
|
4
|
-
|
5
|
-
prerelease:
|
6
|
-
segments:
|
7
|
-
- 0
|
8
|
-
- 3
|
9
|
-
- 8
|
10
|
-
version: 0.3.8
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.3.9
|
11
5
|
platform: ruby
|
12
|
-
authors:
|
6
|
+
authors:
|
13
7
|
- David Blei
|
14
8
|
- Jason Adams
|
15
9
|
- Rio Akasaka
|
16
10
|
autorequire:
|
17
11
|
bindir: bin
|
18
12
|
cert_chain: []
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
dependencies:
|
23
|
-
- !ruby/object:Gem::Dependency
|
13
|
+
date: 2015-02-11 00:00:00.000000000 Z
|
14
|
+
dependencies:
|
15
|
+
- !ruby/object:Gem::Dependency
|
24
16
|
name: shoulda
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
- !ruby/object:Gem::Version
|
31
|
-
hash: 3
|
32
|
-
segments:
|
33
|
-
- 0
|
34
|
-
version: "0"
|
17
|
+
requirement: !ruby/object:Gem::Requirement
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
35
22
|
type: :runtime
|
36
|
-
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
requirements:
|
26
|
+
- - ! '>='
|
27
|
+
- !ruby/object:Gem::Version
|
28
|
+
version: '0'
|
37
29
|
description: Ruby port of Latent Dirichlet Allocation by David M. Blei. See http://www.cs.princeton.edu/~blei/lda-c/.
|
38
30
|
email: jasonmadams@gmail.com
|
39
31
|
executables: []
|
40
|
-
|
41
|
-
extensions:
|
32
|
+
extensions:
|
42
33
|
- ext/lda-ruby/extconf.rb
|
43
|
-
extra_rdoc_files:
|
44
|
-
- README
|
45
|
-
|
46
|
-
|
47
|
-
-
|
48
|
-
- README
|
49
|
-
- README.markdown
|
34
|
+
extra_rdoc_files:
|
35
|
+
- README.md
|
36
|
+
files:
|
37
|
+
- CHANGELOG.md
|
38
|
+
- README.md
|
50
39
|
- Rakefile
|
51
40
|
- VERSION.yml
|
52
41
|
- ext/lda-ruby/Makefile
|
@@ -84,40 +73,28 @@ files:
|
|
84
73
|
- test/simple_test.rb
|
85
74
|
- test/simple_yaml.rb
|
86
75
|
- test/test_helper.rb
|
87
|
-
has_rdoc: true
|
88
76
|
homepage: http://github.com/ealdent/lda-ruby
|
89
77
|
licenses: []
|
90
|
-
|
78
|
+
metadata: {}
|
91
79
|
post_install_message:
|
92
80
|
rdoc_options: []
|
93
|
-
|
94
|
-
require_paths:
|
81
|
+
require_paths:
|
95
82
|
- lib
|
96
83
|
- ext
|
97
|
-
required_ruby_version: !ruby/object:Gem::Requirement
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
none: false
|
108
|
-
requirements:
|
109
|
-
- - ">="
|
110
|
-
- !ruby/object:Gem::Version
|
111
|
-
hash: 3
|
112
|
-
segments:
|
113
|
-
- 0
|
114
|
-
version: "0"
|
84
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
85
|
+
requirements:
|
86
|
+
- - ! '>='
|
87
|
+
- !ruby/object:Gem::Version
|
88
|
+
version: '0'
|
89
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
90
|
+
requirements:
|
91
|
+
- - ! '>='
|
92
|
+
- !ruby/object:Gem::Version
|
93
|
+
version: '0'
|
115
94
|
requirements: []
|
116
|
-
|
117
95
|
rubyforge_project:
|
118
|
-
rubygems_version:
|
96
|
+
rubygems_version: 2.4.5
|
119
97
|
signing_key:
|
120
|
-
specification_version:
|
98
|
+
specification_version: 4
|
121
99
|
summary: Ruby port of Latent Dirichlet Allocation by David M. Blei.
|
122
100
|
test_files: []
|
123
|
-
|
data/README
DELETED
@@ -1,21 +0,0 @@
|
|
1
|
-
Latent Dirichlet Allocation – Ruby Wrapper
|
2
|
-
|
3
|
-
This wrapper is based on C-code by David M. Blei. In a nutshell, it can be used to automatically cluster documents into topics. The number of topics are chosen beforehand and the topics found are usually fairly intuitive. Details of the implementation can be found in the paper by Blei, Ng, and Jordan.
|
4
|
-
|
5
|
-
The original C code relied on files for the input and output. We felt it was necessary to depart from that model and use Ruby objects for these steps instead. The only file necessary will be the data file (in a format similar to that used by SVMlight). Optionally you may need a vocabulary file to be able to extract the words belonging to topics.
|
6
|
-
|
7
|
-
Example usage:
|
8
|
-
|
9
|
-
require 'lda'
|
10
|
-
corpus = Lda::DataCorpus.new("data/data_file.dat")
|
11
|
-
lda = Lda::Lda.new(corpus) # create an Lda object for training
|
12
|
-
lda.em("random") # run EM algorithm using random starting points
|
13
|
-
lda.load_vocabulary("data/vocab.txt")
|
14
|
-
lda.print_topics(20) # print the topic 20 words per topic
|
15
|
-
|
16
|
-
You can check out the mailing list for this project if you have any questions or mail lda-ruby@groups.google.com [email link]. If you have general questions about Latent Dirichlet Allocation, I urge you to use the topic models mailing list, since the people who monitor that are very knowledgeable.
|
17
|
-
|
18
|
-
|
19
|
-
References
|
20
|
-
|
21
|
-
Blei, David M., Ng, Andrew Y., and Jordan, Michael I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3 (Mar. 2003), 993-1022.
|