classifier-reborn 2.0.0.rc1 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 85ca221ce7678d94a2d7020e37b260991e64aa84
4
- data.tar.gz: 3260eeef860f793d575da189c3acc87f6ed96089
3
+ metadata.gz: b5bf47d31418ceab033a1cb67ef62dd0958d619f
4
+ data.tar.gz: 9fbfb8cfa723f0c124130aa15ef831e8cb30e317
5
5
  SHA512:
6
- metadata.gz: 88de7ba1a323cabcf871e98f2607bc25263830d4a9808d4c2d8d9d880d36ac85547e25e93a3eb555419c543d99edd455b9c6b95599c707e2e12421d02a33dafc
7
- data.tar.gz: 3c346d1aff42973651869e4738adaf673c2d258b8fdd5cf9f724f0659590ae8173abe4c5ac21a06c098b0085c3c155e1bff0f4cce16e7a1bbf3f50c869720b8c
6
+ metadata.gz: 79d2af7643a0184efc1f4173f4cce6e966d6e02d8ddc60ebfa03c8aed1f34d6c33ec6cb75077e46f466acf91eabe9122af4ff611d5f96f3d7bfba29923de90db
7
+ data.tar.gz: 80bc672d0e87fdb0a5064dcd6d8d80746923f7713f38da1c34decb3e3b4a7e11fe92e16e2c8230a1837905b91b3db3de7368c56f0f89654500f4ea242291775a
data/README.markdown CHANGED
@@ -1,22 +1,35 @@
1
- ## Welcome to Classifier
1
+ ## Welcome to Classifier Reborn
2
2
 
3
3
  Classifier is a general module to allow Bayesian and other types of classifications.
4
4
 
5
+ Classifier Reborn is a fork of cardmagic/classifier under more active development.
6
+
5
7
  ## Download
6
8
 
7
- * https://github.com/cardmagic/classifier
8
- * gem install classifier
9
- * git clone https://github.com/cardmagic/classifier.git
9
+ Add this line to your application's Gemfile:
10
+
11
+ gem 'classifier-reborn'
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install classifier-reborn
10
20
 
11
21
  ## Dependencies
12
22
 
13
- If you install Classifier from source, you'll need to install Roman Shterenzon's fast-stemmer gem with RubyGems as follows:
23
+ The only runtime dependency you'll need to install is Roman Shterenzon's fast-stemmer gem:
14
24
 
15
25
  gem install fast-stemmer
16
26
 
27
+ This should install automatically with RubyGems.
28
+
17
29
  If you would like to speed up LSI classification by at least 10x, please install the following libraries:
18
- GNU GSL:: http://www.gnu.org/software/gsl
19
- rb-gsl:: http://rb-gsl.rubyforge.org
30
+
31
+ * [GNU GSL](http://www.gnu.org/software/gsl)
32
+ * [rb-gsl](http://rb-gsl.rubyforge.org)
20
33
 
21
34
  Notice that LSI will work without these libraries, but as soon as they are installed, Classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
22
35
 
@@ -26,20 +39,22 @@ A Bayesian classifier by Lucas Carlson. Bayesian Classifiers are accurate, fast,
26
39
 
27
40
  ### Usage
28
41
 
29
- require 'classifier'
30
- b = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
31
- b.train_interesting "here are some good words. I hope you love them"
32
- b.train_uninteresting "here are some bad words, I hate you"
33
- b.classify "I hate bad words and you" # returns 'Uninteresting'
34
-
35
- require 'madeleine'
36
- m = SnapshotMadeleine.new("bayes_data") {
37
- ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
38
- }
39
- m.system.train_interesting "here are some good words. I hope you love them"
40
- m.system.train_uninteresting "here are some bad words, I hate you"
41
- m.take_snapshot
42
- m.system.classify "I love you" # returns 'Interesting'
42
+ ```ruby
43
+ require 'classifier'
44
+ b = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
45
+ b.train_interesting "here are some good words. I hope you love them"
46
+ b.train_uninteresting "here are some bad words, I hate you"
47
+ b.classify "I hate bad words and you" # returns 'Uninteresting'
48
+
49
+ require 'madeleine' # use madeline to persist the data
50
+ m = SnapshotMadeleine.new("bayes_data") {
51
+ ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
52
+ }
53
+ m.system.train_interesting "here are some good words. I hope you love them"
54
+ m.system.train_uninteresting "here are some bad words, I hate you"
55
+ m.take_snapshot
56
+ m.system.classify "I love you" # returns 'Interesting'
57
+ ```
43
58
 
44
59
  Using Madeleine, your application can persist the learned data over time.
45
60
 
@@ -52,33 +67,35 @@ Using Madeleine, your application can persist the learned data over time.
52
67
  ## LSI
53
68
 
54
69
  A Latent Semantic Indexer by David Fayram. Latent Semantic Indexing engines
55
- are not as fast or as small as Bayesian classifiers, but are more flexible, providing
56
- fast search and clustering detection as well as semantic analysis of the text that
70
+ are not as fast or as small as Bayesian classifiers, but are more flexible, providing
71
+ fast search and clustering detection as well as semantic analysis of the text that
57
72
  theoretically simulates human learning.
58
73
 
59
74
  ### Usage
60
75
 
61
- require 'classifier'
62
- lsi = ClassifierReborn::LSI.new
63
- strings = [ ["This text deals with dogs. Dogs.", :dog],
64
- ["This text involves dogs too. Dogs! ", :dog],
65
- ["This text revolves around cats. Cats.", :cat],
66
- ["This text also involves cats. Cats!", :cat],
67
- ["This text involves birds. Birds.",:bird ]]
68
- strings.each {|x| lsi.add_item x.first, x.last}
69
-
70
- lsi.search("dog", 3)
71
- # returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ",
72
- # "This text also involves cats. Cats!"]
73
-
74
- lsi.find_related(strings[2], 2)
75
- # returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]
76
-
77
- lsi.classify "This text is also about dogs!"
78
- # returns => :dog
79
-
76
+ ```ruby
77
+ require 'classifier'
78
+ lsi = ClassifierReborn::LSI.new
79
+ strings = [ ["This text deals with dogs. Dogs.", :dog],
80
+ ["This text involves dogs too. Dogs! ", :dog],
81
+ ["This text revolves around cats. Cats.", :cat],
82
+ ["This text also involves cats. Cats!", :cat],
83
+ ["This text involves birds. Birds.",:bird ]]
84
+ strings.each {|x| lsi.add_item x.first, x.last}
85
+
86
+ lsi.search("dog", 3)
87
+ # returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ",
88
+ # "This text also involves cats. Cats!"]
89
+
90
+ lsi.find_related(strings[2], 2)
91
+ # returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]
92
+
93
+ lsi.classify "This text is also about dogs!"
94
+ # returns => :dog
95
+ ```
96
+
80
97
  Please see the ClassifierReborn::LSI documentation for more information. It is possible to index, search and classify
81
- with more than just simple strings.
98
+ with more than just simple strings.
82
99
 
83
100
  ### Latent Semantic Indexing
84
101
 
@@ -86,12 +103,12 @@ with more than just simple strings.
86
103
  * http://www.chadfowler.com/index.cgi/Computing/LatentSemanticIndexing.rdoc
87
104
  * http://en.wikipedia.org/wiki/Latent_semantic_analysis
88
105
 
89
- ## Authors
106
+ ## Authors
90
107
 
91
108
  * Lucas Carlson (lucas@rufy.com)
92
109
  * David Fayram II (dfayram@gmail.com)
93
110
  * Cameron McBride (cameron.mcbride@gmail.com)
94
111
  * Ivan Acosta-Rubio (ivan@softwarecriollo.com)
112
+ * Parker Moore (email@byparker.com)
95
113
 
96
114
  This library is released under the terms of the GNU LGPL. See LICENSE for more details.
97
-
@@ -2,6 +2,8 @@
2
2
  # Copyright:: Copyright (c) 2005 Lucas Carlson
3
3
  # License:: LGPL
4
4
 
5
+ require_relative 'extensions/string'
6
+
5
7
  module ClassifierReborn
6
8
  class Bayes
7
9
  # The class can be created with one or more categories, each of which will be
@@ -23,7 +25,7 @@ module ClassifierReborn
23
25
  def train(category, text)
24
26
  category = category.prepare_category_name
25
27
  @category_counts[category] += 1
26
- text.word_hash.each do |word, count|
28
+ Hasher.word_hash(text).each do |word, count|
27
29
  @categories[category][word] ||= 0
28
30
  @categories[category][word] += count
29
31
  @total_words += count
@@ -39,12 +41,12 @@ module ClassifierReborn
39
41
  # b.untrain :this, "This text"
40
42
  def untrain(category, text)
41
43
  category = category.prepare_category_name
42
- @category_counts[category] -= 1
43
- text.word_hash.each do |word, count|
44
+ @category_counts[category] -= 1
45
+ Hasher.word_hash(text).each do |word, count|
44
46
  if @total_words >= 0
45
- orig = @categories[category][word]
46
- @categories[category][word] ||= 0
47
- @categories[category][word] -= count
47
+ orig = @categories[category][word] || 0
48
+ @categories[category][word] ||= 0
49
+ @categories[category][word] -= count
48
50
  if @categories[category][word] <= 0
49
51
  @categories[category].delete(word)
50
52
  count = orig
@@ -64,7 +66,7 @@ module ClassifierReborn
64
66
  @categories.each do |category, category_words|
65
67
  score[category.to_s] = 0
66
68
  total = category_words.values.inject(0) {|sum, element| sum+element}
67
- text.word_hash.each do |word, count|
69
+ Hasher.word_hash(text).each do |word, count|
68
70
  s = category_words.has_key?(word) ? category_words[word] : 0.1
69
71
  score[category.to_s] += Math.log(s/total.to_f)
70
72
  end
@@ -0,0 +1,134 @@
1
+ # Author:: Lucas Carlson (mailto:lucas@rufy.com)
2
+ # Copyright:: Copyright (c) 2005 Lucas Carlson
3
+ # License:: LGPL
4
+
5
+ require "set"
6
+
7
+ module ClassifierReborn
8
+ module Hasher
9
+ extend self
10
+
11
+ # Removes common punctuation symbols, returning a new string.
12
+ # E.g.,
13
+ # "Hello (greeting's), with {braces} < >...?".without_punctuation
14
+ # => "Hello greetings with braces "
15
+ def without_punctuation(str)
16
+ str .tr( ',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "'\-", "")
17
+ end
18
+
19
+ # Return a Hash of strings => ints. Each word in the string is stemmed,
20
+ # interned, and indexes to its frequency in the document.
21
+ def word_hash(str)
22
+ word_hash = clean_word_hash(str)
23
+ symbol_hash = word_hash_for_symbols(str.gsub(/[\w]/," ").split)
24
+ return clean_word_hash(str).merge(symbol_hash)
25
+ end
26
+
27
+ # Return a word hash without extra punctuation or short symbols, just stemmed words
28
+ def clean_word_hash(str)
29
+ word_hash_for_words str.gsub(/[^\w\s]/,"").split
30
+ end
31
+
32
+ def word_hash_for_words(words)
33
+ d = Hash.new(0)
34
+ words.each do |word|
35
+ word.downcase!
36
+ if ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
37
+ d[word.stem.intern] += 1
38
+ end
39
+ end
40
+ return d
41
+ end
42
+
43
+ def word_hash_for_symbols(words)
44
+ d = Hash.new(0)
45
+ words.each do |word|
46
+ d[word.intern] += 1
47
+ end
48
+ return d
49
+ end
50
+
51
+ CORPUS_SKIP_WORDS = Set.new(%w[
52
+ a
53
+ again
54
+ all
55
+ along
56
+ are
57
+ also
58
+ an
59
+ and
60
+ as
61
+ at
62
+ but
63
+ by
64
+ came
65
+ can
66
+ cant
67
+ couldnt
68
+ did
69
+ didn
70
+ didnt
71
+ do
72
+ doesnt
73
+ dont
74
+ ever
75
+ first
76
+ from
77
+ have
78
+ her
79
+ here
80
+ him
81
+ how
82
+ i
83
+ if
84
+ in
85
+ into
86
+ is
87
+ isnt
88
+ it
89
+ itll
90
+ just
91
+ last
92
+ least
93
+ like
94
+ most
95
+ my
96
+ new
97
+ no
98
+ not
99
+ now
100
+ of
101
+ on
102
+ or
103
+ should
104
+ sinc
105
+ so
106
+ some
107
+ th
108
+ than
109
+ this
110
+ that
111
+ the
112
+ their
113
+ then
114
+ those
115
+ to
116
+ told
117
+ too
118
+ true
119
+ try
120
+ until
121
+ url
122
+ us
123
+ were
124
+ when
125
+ whether
126
+ while
127
+ with
128
+ within
129
+ yes
130
+ you
131
+ youll
132
+ ])
133
+ end
134
+ end
@@ -3,7 +3,7 @@
3
3
  # License:: LGPL
4
4
 
5
5
  require 'fast_stemmer'
6
- require 'classifier-reborn/extensions/word_hash'
6
+ require 'classifier-reborn/extensions/hasher'
7
7
 
8
8
  class Object
9
9
  def prepare_category_name; to_s.gsub("_"," ").capitalize.intern end
@@ -4,7 +4,6 @@
4
4
  # These are extensions to the std-lib 'matrix' to allow an all ruby SVD
5
5
 
6
6
  require 'matrix'
7
- require 'mathn'
8
7
 
9
8
  class Array
10
9
  def sum(identity = 0, &block)
@@ -13,7 +12,7 @@ class Array
13
12
  if block_given?
14
13
  map(&block).sum
15
14
  else
16
- reduce(:+)
15
+ reduce(:+) || 0
17
16
  end
18
17
  end
19
18
  end
@@ -6,16 +6,16 @@ begin
6
6
  raise LoadError if ENV['NATIVE_VECTOR'] == "true" # to test the native vector class, try `rake test NATIVE_VECTOR=true`
7
7
 
8
8
  require 'gsl' # requires http://rb-gsl.rubyforge.org/
9
- require 'classifier-reborn/extensions/vector_serialize'
9
+ require_relative 'extensions/vector_serialize'
10
10
  $GSL = true
11
11
 
12
12
  rescue LoadError
13
- require 'classifier-reborn/extensions/vector'
13
+ require_relative 'extensions/vector'
14
14
  end
15
15
 
16
- require 'classifier-reborn/lsi/word_list'
17
- require 'classifier-reborn/lsi/content_node'
18
- require 'classifier-reborn/lsi/summary'
16
+ require_relative 'lsi/word_list'
17
+ require_relative 'lsi/content_node'
18
+ require_relative 'lsi/summarizer'
19
19
 
20
20
  module ClassifierReborn
21
21
 
@@ -58,7 +58,7 @@ module ClassifierReborn
58
58
  # lsi.add_item ar, *ar.categories { |x| ar.content }
59
59
  #
60
60
  def add_item( item, *categories, &block )
61
- clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
61
+ clean_word_hash = Hasher.clean_word_hash(block ? block.call(item) : item.to_s)
62
62
  @items[item] = ContentNode.new(clean_word_hash, *categories)
63
63
  @version += 1
64
64
  build_index if @auto_rebuild
@@ -82,8 +82,8 @@ module ClassifierReborn
82
82
  # Removes an item from the database, if it is indexed.
83
83
  #
84
84
  def remove_item( item )
85
- if @items.keys.contain? item
86
- @items.remove item
85
+ if @items.key? item
86
+ @items.delete item
87
87
  @version += 1
88
88
  end
89
89
  end
@@ -293,7 +293,7 @@ module ClassifierReborn
293
293
  if @items[item]
294
294
  return @items[item]
295
295
  else
296
- clean_word_hash = block ? block.call(item).clean_word_hash : item.to_s.clean_word_hash
296
+ clean_word_hash = Hasher.clean_word_hash(block ? block.call(item) : item.to_s)
297
297
 
298
298
  cn = ContentNode.new(clean_word_hash, &block) # make the node and extract the data
299
299
 
@@ -43,12 +43,14 @@ module ClassifierReborn
43
43
  vec[word_list[word]] = @word_hash[word] if word_list[word]
44
44
  end
45
45
 
46
- # Perform the scaling transform
47
- total_words = vec.sum
46
+ # Perform the scaling transform and force floating point arithmetic
47
+ total_words = vec.sum.to_f
48
+
49
+ total_unique_words = vec.count{|word| word != 0}
48
50
 
49
51
  # Perform first-order association transform if this vector has more
50
- # than one word in it.
51
- if total_words > 1.0
52
+ # then one word in it.
53
+ if total_words > 1.0 && total_unique_words > 1
52
54
  weighted_total = 0.0
53
55
  vec.each do |term|
54
56
  if ( term > 0 )
@@ -0,0 +1,33 @@
1
+ # Author:: Lucas Carlson (mailto:lucas@rufy.com)
2
+ # Copyright:: Copyright (c) 2005 Lucas Carlson
3
+ # License:: LGPL
4
+
5
+ module ClassifierReborn
6
+ module Summarizer
7
+ extend self
8
+
9
+ def summary( str, count=10, separator=" [...] " )
10
+ perform_lsi split_sentences(str), count, separator
11
+ end
12
+
13
+ def paragraph_summary( str, count=1, separator=" [...] " )
14
+ perform_lsi split_paragraphs(str), count, separator
15
+ end
16
+
17
+ def split_sentences(str)
18
+ str.split /(\.|\!|\?)/ # TODO: make this less primitive
19
+ end
20
+
21
+ def split_paragraphs(str)
22
+ str.split /(\n\n|\r\r|\r\n\r\n)/ # TODO: make this less primitive
23
+ end
24
+
25
+ def perform_lsi(chunks, count, separator)
26
+ lsi = ClassifierReborn::LSI.new :auto_rebuild => false
27
+ chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
28
+ lsi.build_index
29
+ summaries = lsi.highest_relative_content count
30
+ return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator)
31
+ end
32
+ end
33
+ end
@@ -1,3 +1,3 @@
1
1
  module ClassifierReborn
2
- VERSION = '2.0.0.rc1'
2
+ VERSION = '2.0.0'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: classifier-reborn
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.0.rc1
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Lucas Carlson
@@ -9,22 +9,22 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2014-08-10 00:00:00.000000000 Z
12
+ date: 2014-08-13 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: fast-stemmer
16
16
  requirement: !ruby/object:Gem::Requirement
17
17
  requirements:
18
- - - ">="
18
+ - - "~>"
19
19
  - !ruby/object:Gem::Version
20
- version: 1.0.0
20
+ version: '1.0'
21
21
  type: :runtime
22
22
  prerelease: false
23
23
  version_requirements: !ruby/object:Gem::Requirement
24
24
  requirements:
25
- - - ">="
25
+ - - "~>"
26
26
  - !ruby/object:Gem::Version
27
- version: 1.0.0
27
+ version: '1.0'
28
28
  - !ruby/object:Gem::Dependency
29
29
  name: rake
30
30
  requirement: !ruby/object:Gem::Requirement
@@ -71,13 +71,13 @@ files:
71
71
  - bin/summarize.rb
72
72
  - lib/classifier-reborn.rb
73
73
  - lib/classifier-reborn/bayes.rb
74
+ - lib/classifier-reborn/extensions/hasher.rb
74
75
  - lib/classifier-reborn/extensions/string.rb
75
76
  - lib/classifier-reborn/extensions/vector.rb
76
77
  - lib/classifier-reborn/extensions/vector_serialize.rb
77
- - lib/classifier-reborn/extensions/word_hash.rb
78
78
  - lib/classifier-reborn/lsi.rb
79
79
  - lib/classifier-reborn/lsi/content_node.rb
80
- - lib/classifier-reborn/lsi/summary.rb
80
+ - lib/classifier-reborn/lsi/summarizer.rb
81
81
  - lib/classifier-reborn/lsi/word_list.rb
82
82
  - lib/classifier-reborn/version.rb
83
83
  homepage: https://github.com/jekyll/classifier-reborn
@@ -96,9 +96,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
96
96
  version: 1.9.3
97
97
  required_rubygems_version: !ruby/object:Gem::Requirement
98
98
  requirements:
99
- - - ">"
99
+ - - ">="
100
100
  - !ruby/object:Gem::Version
101
- version: 1.3.1
101
+ version: '0'
102
102
  requirements: []
103
103
  rubyforge_project:
104
104
  rubygems_version: 2.2.2
@@ -1,136 +0,0 @@
1
- # Author:: Lucas Carlson (mailto:lucas@rufy.com)
2
- # Copyright:: Copyright (c) 2005 Lucas Carlson
3
- # License:: LGPL
4
-
5
- require "set"
6
-
7
- # These are extensions to the String class to provide convenience
8
- # methods for the Classifier package.
9
- class String
10
-
11
- # Removes common punctuation symbols, returning a new string.
12
- # E.g.,
13
- # "Hello (greeting's), with {braces} < >...?".without_punctuation
14
- # => "Hello greetings with braces "
15
- def without_punctuation
16
- tr( ',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "'\-", "")
17
- end
18
-
19
- # Return a Hash of strings => ints. Each word in the string is stemmed,
20
- # interned, and indexes to its frequency in the document.
21
- def word_hash
22
- word_hash = clean_word_hash()
23
- symbol_hash = word_hash_for_symbols(gsub(/[\w]/," ").split)
24
- return word_hash.merge(symbol_hash)
25
- end
26
-
27
- # Return a word hash without extra punctuation or short symbols, just stemmed words
28
- def clean_word_hash
29
- word_hash_for_words gsub(/[^\w\s]/,"").split
30
- end
31
-
32
- private
33
-
34
- def word_hash_for_words(words)
35
- d = Hash.new(0)
36
- words.each do |word|
37
- word.downcase!
38
- if ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
39
- d[word.stem.intern] += 1
40
- end
41
- end
42
- return d
43
- end
44
-
45
-
46
- def word_hash_for_symbols(words)
47
- d = Hash.new(0)
48
- words.each do |word|
49
- d[word.intern] += 1
50
- end
51
- return d
52
- end
53
-
54
- CORPUS_SKIP_WORDS = Set.new([
55
- "a",
56
- "again",
57
- "all",
58
- "along",
59
- "are",
60
- "also",
61
- "an",
62
- "and",
63
- "as",
64
- "at",
65
- "but",
66
- "by",
67
- "came",
68
- "can",
69
- "cant",
70
- "couldnt",
71
- "did",
72
- "didn",
73
- "didnt",
74
- "do",
75
- "doesnt",
76
- "dont",
77
- "ever",
78
- "first",
79
- "from",
80
- "have",
81
- "her",
82
- "here",
83
- "him",
84
- "how",
85
- "i",
86
- "if",
87
- "in",
88
- "into",
89
- "is",
90
- "isnt",
91
- "it",
92
- "itll",
93
- "just",
94
- "last",
95
- "least",
96
- "like",
97
- "most",
98
- "my",
99
- "new",
100
- "no",
101
- "not",
102
- "now",
103
- "of",
104
- "on",
105
- "or",
106
- "should",
107
- "sinc",
108
- "so",
109
- "some",
110
- "th",
111
- "than",
112
- "this",
113
- "that",
114
- "the",
115
- "their",
116
- "then",
117
- "those",
118
- "to",
119
- "told",
120
- "too",
121
- "true",
122
- "try",
123
- "until",
124
- "url",
125
- "us",
126
- "were",
127
- "when",
128
- "whether",
129
- "while",
130
- "with",
131
- "within",
132
- "yes",
133
- "you",
134
- "youll",
135
- ])
136
- end
@@ -1,31 +0,0 @@
1
- # Author:: Lucas Carlson (mailto:lucas@rufy.com)
2
- # Copyright:: Copyright (c) 2005 Lucas Carlson
3
- # License:: LGPL
4
-
5
- class String
6
- def summary( count=10, separator=" [...] " )
7
- perform_lsi split_sentences, count, separator
8
- end
9
-
10
- def paragraph_summary( count=1, separator=" [...] " )
11
- perform_lsi split_paragraphs, count, separator
12
- end
13
-
14
- def split_sentences
15
- split /(\.|\!|\?)/ # TODO: make this less primitive
16
- end
17
-
18
- def split_paragraphs
19
- split /(\n\n|\r\r|\r\n\r\n)/ # TODO: make this less primitive
20
- end
21
-
22
- private
23
-
24
- def perform_lsi(chunks, count, separator)
25
- lsi = ClassifierReborn::LSI.new :auto_rebuild => false
26
- chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }
27
- lsi.build_index
28
- summaries = lsi.highest_relative_content count
29
- return summaries.reject { |chunk| !summaries.include? chunk }.map { |x| x.strip }.join(separator)
30
- end
31
- end