lda-ruby 0.3.8 → 0.3.9

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,15 @@
1
+ ---
2
+ !binary "U0hBMQ==":
3
+ metadata.gz: !binary |-
4
+ M2IzN2IwMmE1YWFhNTJjYjk0ZjMyYzhjMWQyZDdjZjY2YTA0OGNjZA==
5
+ data.tar.gz: !binary |-
6
+ OTNjZjE5MGNmOGI2YzY3YzhlNDRiYTBlNDM5NmUwYmY4Mjc2ZmNkNQ==
7
+ SHA512:
8
+ metadata.gz: !binary |-
9
+ MDAxZmQ1N2U5MGQ1MDA2MGE4YTU0N2NkN2Y4N2Q4YzlmNTdmYTkxN2FlYzU0
10
+ ZmFjZDZiZWIwZmRiMDJlYjNjYTg5YWQ0N2RlOTY1MTg0YzZjZTc0NGQ0YmI1
11
+ ZDljMjljNTA3ODdlZjBjNjNjMTc0MGRlMjBhODQzZTg3YWM5OWE=
12
+ data.tar.gz: !binary |-
13
+ OGFiMzVhMzY4OTFmZTkzMmM3YjQxMzNkM2JlMjVjOTRjM2ZhNTc1YmUxZTRj
14
+ Y2M3YzZiNzNiYmVkNDI1ZTUxODhkMjhlYTQ4MmRhZjVkZjQxMDhjOTk5Njc4
15
+ OTZhOTM5ZTQ3OTFhY2U1YjRhODQyYzBkZWNlYzhjNzBjMWFlNGU=
@@ -1,3 +1,15 @@
1
+ version 0.3.9
2
+ =============
3
+
4
+ - merge pull request from @rishabh-tripathi allowing text corpus objects to also be built with an array of strings
5
+ - couple minor code refinements
6
+
7
+ version 0.3.8
8
+ =============
9
+
10
+ - tokenization changes to support German (courtesy of @LeFnord)
11
+ - user defined stop word list (also via @LeFnord)
12
+
1
13
  version 0.3.7
2
14
  =============
3
15
 
@@ -13,7 +13,7 @@ The original C code relied on files for the input and output. We felt it was nec
13
13
  lda = Lda::Lda.new(corpus) # create an Lda object for training
14
14
  lda.em("random") # run EM algorithm using random starting points
15
15
  lda.load_vocabulary("data/vocab.txt")
16
- lda.print_topics(20) # print the topic 20 words per topic
16
+ lda.print_topics(20) # print all topics with up to 20 words per topic
17
17
 
18
18
  If you have general questions about Latent Dirichlet Allocation, I urge you to use the [topic models mailing list][topic-models], since the people who monitor that are very knowledgeable. If you encounter bugs specific to lda-ruby, please post an issue on the Github project.
19
19
 
@@ -34,4 +34,4 @@ Blei, David M., Ng, Andrew Y., and Jordan, Michael I. 2003. Latent dirichlet all
34
34
  [wikipedia]: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
35
35
  [ap-data]: http://www.cs.princeton.edu/~blei/lda-c/ap.tgz
36
36
  [pdf]: http://www.cs.princeton.edu/picasso/mats/BleiNgJordan2003_blei.pdf
37
- [topic-models]: https://lists.cs.princeton.edu/mailman/listinfo/topic-models
37
+ [topic-models]: https://lists.cs.princeton.edu/mailman/listinfo/topic-models
@@ -1,5 +1,5 @@
1
- ---
2
- :build:
1
+ ---
3
2
  :major: 0
4
3
  :minor: 3
5
- :patch: 8
4
+ :patch: 9
5
+ :build:
Binary file
@@ -2,18 +2,23 @@ module Lda
2
2
  class TextCorpus < Corpus
3
3
  attr_reader :filename
4
4
 
5
- # Load text documents from YAML file if filename is given.
6
- def initialize(filename)
5
+ # Loads text documents from a YAML file or an array of strings
6
+ def initialize(input_data)
7
7
  super()
8
8
 
9
- @filename = filename
10
- load_from_file
11
- end
12
-
13
- protected
9
+ docs = if input_data.is_a?(String) && File.exists?(input_data)
10
+ # yaml file containing an array of strings representing each document
11
+ YAML.load_file(input_data)
12
+ elsif input_data.is_a?(Array)
13
+ # an array of strings representing each document
14
+ input_data.dup
15
+ elsif input_data.is_a?(String)
16
+ # a single string representing one document
17
+ [input_data]
18
+ else
19
+ raise "Unknown input type: please pass in a valid filename or an array of strings."
20
+ end
14
21
 
15
- def load_from_file
16
- docs = YAML.load_file(@filename)
17
22
  docs.each do |doc|
18
23
  add_document(TextDocument.new(self, doc))
19
24
  end
@@ -32,9 +32,7 @@ module Lda
32
32
  end
33
33
 
34
34
  def tokenize(text)
35
- # now respects Umlaute
36
35
  clean_text = text.gsub(/[^a-zäöüß'-]+/i, ' ').gsub(/\s+/, ' ').downcase # remove everything but letters and ' and leave only single spaces
37
- # clean_text = text.gsub(/[^A-Za-z'\s]+/, ' ').gsub(/\s+/, ' ').downcase # remove everything but letters and ' and leave only single spaces
38
36
  @tokens = handle(clean_text.split(' '))
39
37
  nil
40
38
  end
metadata CHANGED
@@ -1,52 +1,41 @@
1
- --- !ruby/object:Gem::Specification
1
+ --- !ruby/object:Gem::Specification
2
2
  name: lda-ruby
3
- version: !ruby/object:Gem::Version
4
- hash: 3
5
- prerelease:
6
- segments:
7
- - 0
8
- - 3
9
- - 8
10
- version: 0.3.8
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.3.9
11
5
  platform: ruby
12
- authors:
6
+ authors:
13
7
  - David Blei
14
8
  - Jason Adams
15
9
  - Rio Akasaka
16
10
  autorequire:
17
11
  bindir: bin
18
12
  cert_chain: []
19
-
20
- date: 2011-10-18 00:00:00 -04:00
21
- default_executable:
22
- dependencies:
23
- - !ruby/object:Gem::Dependency
13
+ date: 2015-02-11 00:00:00.000000000 Z
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
24
16
  name: shoulda
25
- prerelease: false
26
- requirement: &id001 !ruby/object:Gem::Requirement
27
- none: false
28
- requirements:
29
- - - ">="
30
- - !ruby/object:Gem::Version
31
- hash: 3
32
- segments:
33
- - 0
34
- version: "0"
17
+ requirement: !ruby/object:Gem::Requirement
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
35
22
  type: :runtime
36
- version_requirements: *id001
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ requirements:
26
+ - - ! '>='
27
+ - !ruby/object:Gem::Version
28
+ version: '0'
37
29
  description: Ruby port of Latent Dirichlet Allocation by David M. Blei. See http://www.cs.princeton.edu/~blei/lda-c/.
38
30
  email: jasonmadams@gmail.com
39
31
  executables: []
40
-
41
- extensions:
32
+ extensions:
42
33
  - ext/lda-ruby/extconf.rb
43
- extra_rdoc_files:
44
- - README
45
- - README.markdown
46
- files:
47
- - CHANGELOG
48
- - README
49
- - README.markdown
34
+ extra_rdoc_files:
35
+ - README.md
36
+ files:
37
+ - CHANGELOG.md
38
+ - README.md
50
39
  - Rakefile
51
40
  - VERSION.yml
52
41
  - ext/lda-ruby/Makefile
@@ -84,40 +73,28 @@ files:
84
73
  - test/simple_test.rb
85
74
  - test/simple_yaml.rb
86
75
  - test/test_helper.rb
87
- has_rdoc: true
88
76
  homepage: http://github.com/ealdent/lda-ruby
89
77
  licenses: []
90
-
78
+ metadata: {}
91
79
  post_install_message:
92
80
  rdoc_options: []
93
-
94
- require_paths:
81
+ require_paths:
95
82
  - lib
96
83
  - ext
97
- required_ruby_version: !ruby/object:Gem::Requirement
98
- none: false
99
- requirements:
100
- - - ">="
101
- - !ruby/object:Gem::Version
102
- hash: 3
103
- segments:
104
- - 0
105
- version: "0"
106
- required_rubygems_version: !ruby/object:Gem::Requirement
107
- none: false
108
- requirements:
109
- - - ">="
110
- - !ruby/object:Gem::Version
111
- hash: 3
112
- segments:
113
- - 0
114
- version: "0"
84
+ required_ruby_version: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - ! '>='
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ required_rubygems_version: !ruby/object:Gem::Requirement
90
+ requirements:
91
+ - - ! '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
115
94
  requirements: []
116
-
117
95
  rubyforge_project:
118
- rubygems_version: 1.5.2
96
+ rubygems_version: 2.4.5
119
97
  signing_key:
120
- specification_version: 3
98
+ specification_version: 4
121
99
  summary: Ruby port of Latent Dirichlet Allocation by David M. Blei.
122
100
  test_files: []
123
-
data/README DELETED
@@ -1,21 +0,0 @@
1
- Latent Dirichlet Allocation – Ruby Wrapper
2
-
3
- This wrapper is based on C-code by David M. Blei. In a nutshell, it can be used to automatically cluster documents into topics. The number of topics are chosen beforehand and the topics found are usually fairly intuitive. Details of the implementation can be found in the paper by Blei, Ng, and Jordan.
4
-
5
- The original C code relied on files for the input and output. We felt it was necessary to depart from that model and use Ruby objects for these steps instead. The only file necessary will be the data file (in a format similar to that used by SVMlight). Optionally you may need a vocabulary file to be able to extract the words belonging to topics.
6
-
7
- Example usage:
8
-
9
- require 'lda'
10
- corpus = Lda::DataCorpus.new("data/data_file.dat")
11
- lda = Lda::Lda.new(corpus) # create an Lda object for training
12
- lda.em("random") # run EM algorithm using random starting points
13
- lda.load_vocabulary("data/vocab.txt")
14
- lda.print_topics(20) # print the topic 20 words per topic
15
-
16
- You can check out the mailing list for this project if you have any questions or mail lda-ruby@groups.google.com [email link]. If you have general questions about Latent Dirichlet Allocation, I urge you to use the topic models mailing list, since the people who monitor that are very knowledgeable.
17
-
18
-
19
- References
20
-
21
- Blei, David M., Ng, Andrew Y., and Jordan, Michael I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3 (Mar. 2003), 993-1022.