ankusa 0.0.6 → 0.0.7
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +80 -6
- data/Rakefile +22 -10
- data/docs/classes/Ankusa.html +29 -1
- data/docs/classes/Ankusa/CassandraStorage.html +615 -0
- data/docs/classes/Ankusa/Classifier.html +23 -131
- data/docs/classes/Ankusa/HBaseStorage.html +102 -102
- data/docs/classes/Ankusa/KLDivergenceClassifier.html +194 -0
- data/docs/classes/Ankusa/MemoryStorage.html +84 -84
- data/docs/classes/Ankusa/NaiveBayesClassifier.html +231 -0
- data/docs/classes/Ankusa/TextHash.html +30 -30
- data/docs/created.rid +1 -1
- data/docs/files/README_rdoc.html +132 -11
- data/docs/files/lib/ankusa/cassandra_storage_rb.html +108 -0
- data/docs/files/lib/ankusa/classifier_rb.html +1 -1
- data/docs/files/lib/ankusa/kl_divergence_rb.html +101 -0
- data/docs/files/lib/ankusa/naive_bayes_rb.html +101 -0
- data/docs/files/lib/ankusa_rb.html +3 -3
- data/docs/fr_class_index.html +3 -0
- data/docs/fr_file_index.html +3 -0
- data/docs/fr_method_index.html +59 -42
- data/lib/ankusa.rb +2 -2
- data/lib/ankusa/cassandra_storage.rb +194 -0
- data/lib/ankusa/classifier.rb +1 -39
- data/lib/ankusa/kl_divergence.rb +31 -0
- data/lib/ankusa/naive_bayes.rb +46 -0
- metadata +19 -26
data/README.rdoc
CHANGED
@@ -1,21 +1,28 @@
|
|
1
1
|
= ankusa
|
2
2
|
|
3
|
-
Ankusa is a text classifier in Ruby that
|
3
|
+
Ankusa is a text classifier in Ruby that can use either Hadoop's HBase or Cassandra for storage. Because it uses HBase or Cassandra as a backend, the training corpus can be many terabytes in size.
|
4
4
|
|
5
|
-
Ankusa currently
|
5
|
+
Ankusa currently provides both a Naive Bayes and Kullback-Leibler divergence classifier. It ignores common words (a.k.a, stop words) and stems all others. Additionally, it uses Laplacian smoothing in both classification methods.
|
6
6
|
|
7
7
|
== Installation
|
8
|
-
First, install HBase
|
8
|
+
First, install HBase/Hadoop or Cassandra (>= 0.7.0-rc2). Then, install the appropriate gem:
|
9
|
+
gem install hbaserb
|
10
|
+
# or
|
11
|
+
gem install cassandra
|
9
12
|
|
13
|
+
If you're using HBase, make sure the HBase Thrift interface has been started as well. Then:
|
10
14
|
gem install ankusa
|
11
15
|
|
12
16
|
== Basic Usage
|
17
|
+
Using the naive Bayes classifier:
|
18
|
+
|
13
19
|
require 'rubygems'
|
14
20
|
require 'ankusa'
|
21
|
+
require 'ankusa/hbase_storage'
|
15
22
|
|
16
23
|
# connect to HBase
|
17
24
|
storage = Ankusa::HBaseStorage.new 'localhost'
|
18
|
-
c = Ankusa::
|
25
|
+
c = Ankusa::NaiveBayesClassifier.new storage
|
19
26
|
|
20
27
|
# Each of these calls will return a bag-of-words
|
21
28
|
# has with stemmed words as keys and counts as values
|
@@ -35,7 +42,74 @@ First, install HBase / Hadoop. Make sure the HBase Thrift interface has been st
|
|
35
42
|
puts c.log_likelihoods "This is some spammy text"
|
36
43
|
|
37
44
|
# get a list of all classes
|
38
|
-
puts c.
|
45
|
+
puts c.classnames
|
39
46
|
|
40
47
|
# close connection
|
41
|
-
storage.close
|
48
|
+
storage.close
|
49
|
+
|
50
|
+
|
51
|
+
== KL Diverence Classifier
|
52
|
+
There is a Kullback–Leibler divergence classifier as well. KL divergence is a distance measure (though not a true metric because it does not satisfy the triangle inequality). The KL classifier simply measures the relative entropy between the text you want to classify and each of the classes. The class with the shortest "distance" is the best class. You may find that for a especially large corpus it may be slightly faster to use this classifier (since prior probablities are never calculated, only likelihoods).
|
53
|
+
|
54
|
+
The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
|
55
|
+
|
56
|
+
require 'rubygems'
|
57
|
+
require 'ankusa'
|
58
|
+
require 'ankusa/hbase_storage'
|
59
|
+
|
60
|
+
# connect to HBase
|
61
|
+
storage = Ankusa::HBaseStorage.new 'localhost'
|
62
|
+
c = Ankusa::KLDivergenceClassifier.new storage
|
63
|
+
|
64
|
+
# Each of these calls will return a bag-of-words
|
65
|
+
# has with stemmed words as keys and counts as values
|
66
|
+
c.train :spam, "This is some spammy text"
|
67
|
+
c.train :good, "This is not the bad stuff"
|
68
|
+
|
69
|
+
# This will return the most likely class (as symbol)
|
70
|
+
puts c.classify "This is some spammy text"
|
71
|
+
|
72
|
+
# This will return Hash with classes as keys and
|
73
|
+
# distances >= 0 as values
|
74
|
+
puts c.distances "This is some spammy text"
|
75
|
+
|
76
|
+
# get a list of all classes
|
77
|
+
puts c.classnames
|
78
|
+
|
79
|
+
# close connection
|
80
|
+
storage.close
|
81
|
+
|
82
|
+
== Storage Methods
|
83
|
+
Ankusa has a generalized storage interface that has been implemented for HBase, Cassandra, and in-memory storage.
|
84
|
+
|
85
|
+
Memory storage can be used when you have a very small corpora
|
86
|
+
require 'ankusa/memory_storage'
|
87
|
+
storage = Ankusa::MemoryStorage.new
|
88
|
+
|
89
|
+
HBase storage:
|
90
|
+
require 'ankusa/hbase_storage'
|
91
|
+
# defaults: host='localhost', port=9090, frequency_tablename="ankusa_word_frequencies", summary_tablename="ankusa_summary"
|
92
|
+
storage = Ankusa::HBaseStorage.new host, port, frequency_tablename, summary_tablename
|
93
|
+
|
94
|
+
For Cassandra storage:
|
95
|
+
* You will need Cassandra version 0.7.0-rc2 or greater.
|
96
|
+
* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
|
97
|
+
* Prior to using the Cassandra storage you will need to run the following command from the cassandra-cli: "create keyspace ankusa with replication_factor = 1". This should be fixed with a new release candidate for Cassandra.
|
98
|
+
|
99
|
+
To use the Cassandra storage class:
|
100
|
+
require 'ankusa/cassandra_storage'
|
101
|
+
# defaults: host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100
|
102
|
+
storage = Ankusa::HBaseStorage.new host, port, keyspace, max_classes
|
103
|
+
|
104
|
+
|
105
|
+
== Running Tests
|
106
|
+
You can run the tests for any of the three storage methods. For instance, for memory storage:
|
107
|
+
rake test_memory
|
108
|
+
|
109
|
+
For the other methods you will need to edit the file test/config.yml and set the configuration params. Then:
|
110
|
+
rake test_hbase
|
111
|
+
# or
|
112
|
+
rake test_cassandra
|
113
|
+
|
114
|
+
|
115
|
+
|
data/Rakefile
CHANGED
@@ -12,28 +12,40 @@ Rake::RDocTask.new("doc") { |rdoc|
|
|
12
12
|
rdoc.rdoc_files.include('lib/**/*.rb')
|
13
13
|
}
|
14
14
|
|
15
|
-
|
16
|
-
|
17
|
-
Rake::TestTask.new("test") { |t|
|
15
|
+
desc "Run all unit tests with memory storage"
|
16
|
+
Rake::TestTask.new("test_memory") { |t|
|
18
17
|
t.libs << "lib"
|
19
|
-
t.test_files = FileList['test
|
18
|
+
t.test_files = FileList['test/hasher_test.rb', 'test/memory_classifier_test.rb']
|
19
|
+
t.verbose = true
|
20
|
+
}
|
21
|
+
|
22
|
+
desc "Run all unit tests with HBase storage"
|
23
|
+
Rake::TestTask.new("test_hbase") { |t|
|
24
|
+
t.libs << "lib"
|
25
|
+
t.test_files = FileList['test/hasher_test.rb', 'test/memory_hbase_test.rb']
|
26
|
+
t.verbose = true
|
27
|
+
}
|
28
|
+
|
29
|
+
desc "Run all unit tests with Cassandra storage"
|
30
|
+
Rake::TestTask.new("test_cassandra") { |t|
|
31
|
+
t.libs << "lib"
|
32
|
+
t.test_files = FileList['test/hasher_test.rb', 'test/cassandra_classifier_test.rb']
|
20
33
|
t.verbose = true
|
21
34
|
}
|
22
35
|
|
23
36
|
spec = Gem::Specification.new do |s|
|
24
37
|
s.name = "ankusa"
|
25
|
-
s.version = "0.0.
|
38
|
+
s.version = "0.0.7"
|
26
39
|
s.authors = ["Brian Muller"]
|
27
|
-
s.date = %q{2010-12-
|
28
|
-
s.description = "Text classifier with HBase storage"
|
29
|
-
s.summary = "Text classifier in Ruby that uses Hadoop's HBase for storage"
|
40
|
+
s.date = %q{2010-12-12}
|
41
|
+
s.description = "Text classifier with HBase or Cassandra storage"
|
42
|
+
s.summary = "Text classifier in Ruby that uses Hadoop's HBase or Cassandra for storage"
|
30
43
|
s.email = "brian.muller@livingsocial.com"
|
31
44
|
s.files = FileList["lib/**/*", "[A-Z]*", "Rakefile", "docs/**/*"]
|
32
45
|
s.homepage = "https://github.com/livingsocial/ankusa"
|
33
46
|
s.require_paths = ["lib"]
|
34
|
-
s.rubygems_version = "1.3.5"
|
35
|
-
s.add_dependency('hbaserb', '>= 0.0.3')
|
36
47
|
s.add_dependency('fast-stemmer', '>= 1.0.0')
|
48
|
+
s.requirements << "Either hbaserb >= 0.0.3 or cassandra >= 0.7"
|
37
49
|
end
|
38
50
|
|
39
51
|
Rake::GemPackageTask.new(spec) do |pkg|
|
data/docs/classes/Ankusa.html
CHANGED
@@ -55,6 +55,10 @@
|
|
55
55
|
<tr class="top-aligned-row">
|
56
56
|
<td><strong>In:</strong></td>
|
57
57
|
<td>
|
58
|
+
<a href="../files/lib/ankusa/cassandra_storage_rb.html">
|
59
|
+
lib/ankusa/cassandra_storage.rb
|
60
|
+
</a>
|
61
|
+
<br />
|
58
62
|
<a href="../files/lib/ankusa/classifier_rb.html">
|
59
63
|
lib/ankusa/classifier.rb
|
60
64
|
</a>
|
@@ -66,10 +70,18 @@
|
|
66
70
|
<a href="../files/lib/ankusa/hbase_storage_rb.html">
|
67
71
|
lib/ankusa/hbase_storage.rb
|
68
72
|
</a>
|
73
|
+
<br />
|
74
|
+
<a href="../files/lib/ankusa/kl_divergence_rb.html">
|
75
|
+
lib/ankusa/kl_divergence.rb
|
76
|
+
</a>
|
69
77
|
<br />
|
70
78
|
<a href="../files/lib/ankusa/memory_storage_rb.html">
|
71
79
|
lib/ankusa/memory_storage.rb
|
72
80
|
</a>
|
81
|
+
<br />
|
82
|
+
<a href="../files/lib/ankusa/naive_bayes_rb.html">
|
83
|
+
lib/ankusa/naive_bayes.rb
|
84
|
+
</a>
|
73
85
|
<br />
|
74
86
|
<a href="../files/lib/ankusa/stopwords_rb.html">
|
75
87
|
lib/ankusa/stopwords.rb
|
@@ -88,6 +100,19 @@
|
|
88
100
|
|
89
101
|
<div id="contextContent">
|
90
102
|
|
103
|
+
<div id="description">
|
104
|
+
<p>
|
105
|
+
At the moment you‘ll have to do:
|
106
|
+
</p>
|
107
|
+
<p>
|
108
|
+
create keyspace ankusa with replication_factor = 1
|
109
|
+
</p>
|
110
|
+
<p>
|
111
|
+
from the cassandra-cli. This should be fixed with new release candidate for
|
112
|
+
cassandra
|
113
|
+
</p>
|
114
|
+
|
115
|
+
</div>
|
91
116
|
|
92
117
|
|
93
118
|
</div>
|
@@ -103,9 +128,12 @@
|
|
103
128
|
<div id="class-list">
|
104
129
|
<h3 class="section-bar">Classes and Modules</h3>
|
105
130
|
|
106
|
-
|
131
|
+
Module <a href="Ankusa/Classifier.html" class="link">Ankusa::Classifier</a><br />
|
132
|
+
Class <a href="Ankusa/CassandraStorage.html" class="link">Ankusa::CassandraStorage</a><br />
|
107
133
|
Class <a href="Ankusa/HBaseStorage.html" class="link">Ankusa::HBaseStorage</a><br />
|
134
|
+
Class <a href="Ankusa/KLDivergenceClassifier.html" class="link">Ankusa::KLDivergenceClassifier</a><br />
|
108
135
|
Class <a href="Ankusa/MemoryStorage.html" class="link">Ankusa::MemoryStorage</a><br />
|
136
|
+
Class <a href="Ankusa/NaiveBayesClassifier.html" class="link">Ankusa::NaiveBayesClassifier</a><br />
|
109
137
|
Class <a href="Ankusa/TextHash.html" class="link">Ankusa::TextHash</a><br />
|
110
138
|
|
111
139
|
</div>
|
@@ -0,0 +1,615 @@
|
|
1
|
+
<?xml version="1.0" encoding="iso-8859-1"?>
|
2
|
+
<!DOCTYPE html
|
3
|
+
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
4
|
+
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
5
|
+
|
6
|
+
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
7
|
+
<head>
|
8
|
+
<title>Class: Ankusa::CassandraStorage</title>
|
9
|
+
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
|
10
|
+
<meta http-equiv="Content-Script-Type" content="text/javascript" />
|
11
|
+
<link rel="stylesheet" href="../.././rdoc-style.css" type="text/css" media="screen" />
|
12
|
+
<script type="text/javascript">
|
13
|
+
// <![CDATA[
|
14
|
+
|
15
|
+
function popupCode( url ) {
|
16
|
+
window.open(url, "Code", "resizable=yes,scrollbars=yes,toolbar=no,status=no,height=150,width=400")
|
17
|
+
}
|
18
|
+
|
19
|
+
function toggleCode( id ) {
|
20
|
+
if ( document.getElementById )
|
21
|
+
elem = document.getElementById( id );
|
22
|
+
else if ( document.all )
|
23
|
+
elem = eval( "document.all." + id );
|
24
|
+
else
|
25
|
+
return false;
|
26
|
+
|
27
|
+
elemStyle = elem.style;
|
28
|
+
|
29
|
+
if ( elemStyle.display != "block" ) {
|
30
|
+
elemStyle.display = "block"
|
31
|
+
} else {
|
32
|
+
elemStyle.display = "none"
|
33
|
+
}
|
34
|
+
|
35
|
+
return true;
|
36
|
+
}
|
37
|
+
|
38
|
+
// Make codeblocks hidden by default
|
39
|
+
document.writeln( "<style type=\"text/css\">div.method-source-code { display: none }</style>" )
|
40
|
+
|
41
|
+
// ]]>
|
42
|
+
</script>
|
43
|
+
|
44
|
+
</head>
|
45
|
+
<body>
|
46
|
+
|
47
|
+
|
48
|
+
|
49
|
+
<div id="classHeader">
|
50
|
+
<table class="header-table">
|
51
|
+
<tr class="top-aligned-row">
|
52
|
+
<td><strong>Class</strong></td>
|
53
|
+
<td class="class-name-in-header">Ankusa::CassandraStorage</td>
|
54
|
+
</tr>
|
55
|
+
<tr class="top-aligned-row">
|
56
|
+
<td><strong>In:</strong></td>
|
57
|
+
<td>
|
58
|
+
<a href="../../files/lib/ankusa/cassandra_storage_rb.html">
|
59
|
+
lib/ankusa/cassandra_storage.rb
|
60
|
+
</a>
|
61
|
+
<br />
|
62
|
+
</td>
|
63
|
+
</tr>
|
64
|
+
|
65
|
+
<tr class="top-aligned-row">
|
66
|
+
<td><strong>Parent:</strong></td>
|
67
|
+
<td>
|
68
|
+
Object
|
69
|
+
</td>
|
70
|
+
</tr>
|
71
|
+
</table>
|
72
|
+
</div>
|
73
|
+
<!-- banner header -->
|
74
|
+
|
75
|
+
<div id="bodyContent">
|
76
|
+
|
77
|
+
|
78
|
+
|
79
|
+
<div id="contextContent">
|
80
|
+
|
81
|
+
|
82
|
+
|
83
|
+
</div>
|
84
|
+
|
85
|
+
<div id="method-list">
|
86
|
+
<h3 class="section-bar">Methods</h3>
|
87
|
+
|
88
|
+
<div class="name-list">
|
89
|
+
<a href="#M000010">classnames</a>
|
90
|
+
<a href="#M000022">close</a>
|
91
|
+
<a href="#M000021">doc_count_totals</a>
|
92
|
+
<a href="#M000012">drop_tables</a>
|
93
|
+
<a href="#M000017">get_doc_count</a>
|
94
|
+
<a href="#M000023">get_summary</a>
|
95
|
+
<a href="#M000016">get_total_word_count</a>
|
96
|
+
<a href="#M000015">get_vocabulary_sizes</a>
|
97
|
+
<a href="#M000014">get_word_counts</a>
|
98
|
+
<a href="#M000020">incr_doc_count</a>
|
99
|
+
<a href="#M000019">incr_total_word_count</a>
|
100
|
+
<a href="#M000018">incr_word_count</a>
|
101
|
+
<a href="#M000013">init_tables</a>
|
102
|
+
<a href="#M000009">new</a>
|
103
|
+
<a href="#M000011">reset</a>
|
104
|
+
</div>
|
105
|
+
</div>
|
106
|
+
|
107
|
+
</div>
|
108
|
+
|
109
|
+
|
110
|
+
<!-- if includes -->
|
111
|
+
|
112
|
+
<div id="section">
|
113
|
+
|
114
|
+
|
115
|
+
|
116
|
+
|
117
|
+
|
118
|
+
<div id="attribute-list">
|
119
|
+
<h3 class="section-bar">Attributes</h3>
|
120
|
+
|
121
|
+
<div class="name-list">
|
122
|
+
<table>
|
123
|
+
<tr class="top-aligned-row context-row">
|
124
|
+
<td class="context-item-name">cassandra</td>
|
125
|
+
<td class="context-item-value"> [R] </td>
|
126
|
+
<td class="context-item-desc"></td>
|
127
|
+
</tr>
|
128
|
+
</table>
|
129
|
+
</div>
|
130
|
+
</div>
|
131
|
+
|
132
|
+
|
133
|
+
|
134
|
+
<!-- if method_list -->
|
135
|
+
<div id="methods">
|
136
|
+
<h3 class="section-bar">Public Class methods</h3>
|
137
|
+
|
138
|
+
<div id="method-M000009" class="method-detail">
|
139
|
+
<a name="M000009"></a>
|
140
|
+
|
141
|
+
<div class="method-heading">
|
142
|
+
<a href="#M000009" class="method-signature">
|
143
|
+
<span class="method-name">new</span><span class="method-args">(host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100)</span>
|
144
|
+
</a>
|
145
|
+
</div>
|
146
|
+
|
147
|
+
<div class="method-description">
|
148
|
+
<p>
|
149
|
+
Necessary to set max classes since current implementation of ruby cassandra
|
150
|
+
client doesn‘t support table scans. Using crufty get_range method at
|
151
|
+
the moment.
|
152
|
+
</p>
|
153
|
+
<p><a class="source-toggle" href="#"
|
154
|
+
onclick="toggleCode('M000009-source');return false;">[Source]</a></p>
|
155
|
+
<div class="method-source-code" id="M000009-source">
|
156
|
+
<pre>
|
157
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 21</span>
|
158
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">initialize</span>(<span class="ruby-identifier">host</span>=<span class="ruby-value str">'127.0.0.1'</span>, <span class="ruby-identifier">port</span>=<span class="ruby-value">9160</span>, <span class="ruby-identifier">keyspace</span> = <span class="ruby-value str">'ankusa'</span>, <span class="ruby-identifier">max_classes</span> = <span class="ruby-value">100</span>)
|
159
|
+
<span class="ruby-ivar">@cassandra</span> = <span class="ruby-constant">Cassandra</span>.<span class="ruby-identifier">new</span>(<span class="ruby-value str">'system'</span>, <span class="ruby-node">"#{host}:#{port}"</span>)
|
160
|
+
<span class="ruby-ivar">@klass_word_counts</span> = {}
|
161
|
+
<span class="ruby-ivar">@klass_doc_counts</span> = {}
|
162
|
+
<span class="ruby-ivar">@keyspace</span> = <span class="ruby-identifier">keyspace</span>
|
163
|
+
<span class="ruby-ivar">@max_classes</span> = <span class="ruby-identifier">max_classes</span>
|
164
|
+
<span class="ruby-identifier">init_tables</span>
|
165
|
+
<span class="ruby-keyword kw">end</span>
|
166
|
+
</pre>
|
167
|
+
</div>
|
168
|
+
</div>
|
169
|
+
</div>
|
170
|
+
|
171
|
+
<h3 class="section-bar">Public Instance methods</h3>
|
172
|
+
|
173
|
+
<div id="method-M000010" class="method-detail">
|
174
|
+
<a name="M000010"></a>
|
175
|
+
|
176
|
+
<div class="method-heading">
|
177
|
+
<a href="#M000010" class="method-signature">
|
178
|
+
<span class="method-name">classnames</span><span class="method-args">()</span>
|
179
|
+
</a>
|
180
|
+
</div>
|
181
|
+
|
182
|
+
<div class="method-description">
|
183
|
+
<p>
|
184
|
+
Fetch the names of the distinct classes for classification: eg. :spam,
|
185
|
+
:good, etc
|
186
|
+
</p>
|
187
|
+
<p><a class="source-toggle" href="#"
|
188
|
+
onclick="toggleCode('M000010-source');return false;">[Source]</a></p>
|
189
|
+
<div class="method-source-code" id="M000010-source">
|
190
|
+
<pre>
|
191
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 34</span>
|
192
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">classnames</span>
|
193
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get_range</span>(<span class="ruby-identifier">:totals</span>, {<span class="ruby-identifier">:start</span> =<span class="ruby-operator">></span> <span class="ruby-value str">''</span>, <span class="ruby-identifier">:finish</span> =<span class="ruby-operator">></span> <span class="ruby-value str">''</span>, <span class="ruby-identifier">:count</span> =<span class="ruby-operator">></span> <span class="ruby-ivar">@max_classes</span>}).<span class="ruby-identifier">inject</span>([]) <span class="ruby-keyword kw">do</span> <span class="ruby-operator">|</span><span class="ruby-identifier">cs</span>, <span class="ruby-identifier">key_slice</span><span class="ruby-operator">|</span>
|
194
|
+
<span class="ruby-identifier">cs</span> <span class="ruby-operator"><<</span> <span class="ruby-identifier">key_slice</span>.<span class="ruby-identifier">key</span>.<span class="ruby-identifier">to_sym</span>
|
195
|
+
<span class="ruby-keyword kw">end</span>
|
196
|
+
<span class="ruby-keyword kw">end</span>
|
197
|
+
</pre>
|
198
|
+
</div>
|
199
|
+
</div>
|
200
|
+
</div>
|
201
|
+
|
202
|
+
<div id="method-M000022" class="method-detail">
|
203
|
+
<a name="M000022"></a>
|
204
|
+
|
205
|
+
<div class="method-heading">
|
206
|
+
<a href="#M000022" class="method-signature">
|
207
|
+
<span class="method-name">close</span><span class="method-args">()</span>
|
208
|
+
</a>
|
209
|
+
</div>
|
210
|
+
|
211
|
+
<div class="method-description">
|
212
|
+
<p>
|
213
|
+
Doesn‘t do anything
|
214
|
+
</p>
|
215
|
+
<p><a class="source-toggle" href="#"
|
216
|
+
onclick="toggleCode('M000022-source');return false;">[Source]</a></p>
|
217
|
+
<div class="method-source-code" id="M000022-source">
|
218
|
+
<pre>
|
219
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 174</span>
|
220
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">close</span>
|
221
|
+
<span class="ruby-keyword kw">end</span>
|
222
|
+
</pre>
|
223
|
+
</div>
|
224
|
+
</div>
|
225
|
+
</div>
|
226
|
+
|
227
|
+
<div id="method-M000021" class="method-detail">
|
228
|
+
<a name="M000021"></a>
|
229
|
+
|
230
|
+
<div class="method-heading">
|
231
|
+
<a href="#M000021" class="method-signature">
|
232
|
+
<span class="method-name">doc_count_totals</span><span class="method-args">()</span>
|
233
|
+
</a>
|
234
|
+
</div>
|
235
|
+
|
236
|
+
<div class="method-description">
|
237
|
+
<p><a class="source-toggle" href="#"
|
238
|
+
onclick="toggleCode('M000021-source');return false;">[Source]</a></p>
|
239
|
+
<div class="method-source-code" id="M000021-source">
|
240
|
+
<pre>
|
241
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 167</span>
|
242
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">doc_count_totals</span>
|
243
|
+
<span class="ruby-identifier">get_summary</span> <span class="ruby-value str">"doc_count"</span>
|
244
|
+
<span class="ruby-keyword kw">end</span>
|
245
|
+
</pre>
|
246
|
+
</div>
|
247
|
+
</div>
|
248
|
+
</div>
|
249
|
+
|
250
|
+
<div id="method-M000012" class="method-detail">
|
251
|
+
<a name="M000012"></a>
|
252
|
+
|
253
|
+
<div class="method-heading">
|
254
|
+
<a href="#M000012" class="method-signature">
|
255
|
+
<span class="method-name">drop_tables</span><span class="method-args">()</span>
|
256
|
+
</a>
|
257
|
+
</div>
|
258
|
+
|
259
|
+
<div class="method-description">
|
260
|
+
<p>
|
261
|
+
Drop ankusa keyspace, <a href="CassandraStorage.html#M000011">reset</a>
|
262
|
+
internal caches
|
263
|
+
</p>
|
264
|
+
<p>
|
265
|
+
FIXME: truncate doesn‘t work with cassandra-beta2
|
266
|
+
</p>
|
267
|
+
<p><a class="source-toggle" href="#"
|
268
|
+
onclick="toggleCode('M000012-source');return false;">[Source]</a></p>
|
269
|
+
<div class="method-source-code" id="M000012-source">
|
270
|
+
<pre>
|
271
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 50</span>
|
272
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">drop_tables</span>
|
273
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">truncate!</span>(<span class="ruby-value str">'classes'</span>)
|
274
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">truncate!</span>(<span class="ruby-value str">'totals'</span>)
|
275
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">drop_keyspace</span>(<span class="ruby-ivar">@keyspace</span>)
|
276
|
+
<span class="ruby-ivar">@klass_word_counts</span> = {}
|
277
|
+
<span class="ruby-ivar">@klass_doc_counts</span> = {}
|
278
|
+
<span class="ruby-keyword kw">end</span>
|
279
|
+
</pre>
|
280
|
+
</div>
|
281
|
+
</div>
|
282
|
+
</div>
|
283
|
+
|
284
|
+
<div id="method-M000017" class="method-detail">
|
285
|
+
<a name="M000017"></a>
|
286
|
+
|
287
|
+
<div class="method-heading">
|
288
|
+
<a href="#M000017" class="method-signature">
|
289
|
+
<span class="method-name">get_doc_count</span><span class="method-args">(klass)</span>
|
290
|
+
</a>
|
291
|
+
</div>
|
292
|
+
|
293
|
+
<div class="method-description">
|
294
|
+
<p>
|
295
|
+
Fetch total documents for a given class and cache it
|
296
|
+
</p>
|
297
|
+
<p><a class="source-toggle" href="#"
|
298
|
+
onclick="toggleCode('M000017-source');return false;">[Source]</a></p>
|
299
|
+
<div class="method-source-code" id="M000017-source">
|
300
|
+
<pre>
|
301
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 109</span>
|
302
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">get_doc_count</span>(<span class="ruby-identifier">klass</span>)
|
303
|
+
<span class="ruby-ivar">@klass_doc_counts</span>[<span class="ruby-identifier">klass</span>] = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_s</span>, <span class="ruby-value str">"doc_count"</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_f</span>
|
304
|
+
<span class="ruby-keyword kw">end</span>
|
305
|
+
</pre>
|
306
|
+
</div>
|
307
|
+
</div>
|
308
|
+
</div>
|
309
|
+
|
310
|
+
<div id="method-M000016" class="method-detail">
|
311
|
+
<a name="M000016"></a>
|
312
|
+
|
313
|
+
<div class="method-heading">
|
314
|
+
<a href="#M000016" class="method-signature">
|
315
|
+
<span class="method-name">get_total_word_count</span><span class="method-args">(klass)</span>
|
316
|
+
</a>
|
317
|
+
</div>
|
318
|
+
|
319
|
+
<div class="method-description">
|
320
|
+
<p>
|
321
|
+
Fetch total word count for a given class and cache it
|
322
|
+
</p>
|
323
|
+
<p><a class="source-toggle" href="#"
|
324
|
+
onclick="toggleCode('M000016-source');return false;">[Source]</a></p>
|
325
|
+
<div class="method-source-code" id="M000016-source">
|
326
|
+
<pre>
|
327
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 102</span>
|
328
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">get_total_word_count</span>(<span class="ruby-identifier">klass</span>)
|
329
|
+
<span class="ruby-ivar">@klass_word_counts</span>[<span class="ruby-identifier">klass</span>] = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_s</span>, <span class="ruby-value str">"wordcount"</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_f</span>
|
330
|
+
<span class="ruby-keyword kw">end</span>
|
331
|
+
</pre>
|
332
|
+
</div>
|
333
|
+
</div>
|
334
|
+
</div>
|
335
|
+
|
336
|
+
<div id="method-M000015" class="method-detail">
|
337
|
+
<a name="M000015"></a>
|
338
|
+
|
339
|
+
<div class="method-heading">
|
340
|
+
<a href="#M000015" class="method-signature">
|
341
|
+
<span class="method-name">get_vocabulary_sizes</span><span class="method-args">()</span>
|
342
|
+
</a>
|
343
|
+
</div>
|
344
|
+
|
345
|
+
<div class="method-description">
|
346
|
+
<p>
|
347
|
+
Does a table ‘scan’ of summary table pulling out the
|
348
|
+
‘vocabsize’ column from each row. Generates a hash of (class,
|
349
|
+
vocab_size) key value pairs
|
350
|
+
</p>
|
351
|
+
<p><a class="source-toggle" href="#"
|
352
|
+
onclick="toggleCode('M000015-source');return false;">[Source]</a></p>
|
353
|
+
<div class="method-source-code" id="M000015-source">
|
354
|
+
<pre>
|
355
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 95</span>
|
356
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">get_vocabulary_sizes</span>
|
357
|
+
<span class="ruby-identifier">get_summary</span> <span class="ruby-value str">"vocabsize"</span>
|
358
|
+
<span class="ruby-keyword kw">end</span>
|
359
|
+
</pre>
|
360
|
+
</div>
|
361
|
+
</div>
|
362
|
+
</div>
|
363
|
+
|
364
|
+
<div id="method-M000014" class="method-detail">
|
365
|
+
<a name="M000014"></a>
|
366
|
+
|
367
|
+
<div class="method-heading">
|
368
|
+
<a href="#M000014" class="method-signature">
|
369
|
+
<span class="method-name">get_word_counts</span><span class="method-args">(word)</span>
|
370
|
+
</a>
|
371
|
+
</div>
|
372
|
+
|
373
|
+
<div class="method-description">
|
374
|
+
<p>
|
375
|
+
Fetch hash of word counts as a single row from cassandra. Here column_name
|
376
|
+
is the class and column value is the count
|
377
|
+
</p>
|
378
|
+
<p><a class="source-toggle" href="#"
|
379
|
+
onclick="toggleCode('M000014-source');return false;">[Source]</a></p>
|
380
|
+
<div class="method-source-code" id="M000014-source">
|
381
|
+
<pre>
|
382
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 84</span>
|
383
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">get_word_counts</span>(<span class="ruby-identifier">word</span>)
|
384
|
+
<span class="ruby-comment cmt"># fetch all (class,count) pairs for a given word</span>
|
385
|
+
<span class="ruby-identifier">row</span> = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:classes</span>, <span class="ruby-identifier">word</span>.<span class="ruby-identifier">to_s</span>)
|
386
|
+
<span class="ruby-keyword kw">return</span> <span class="ruby-identifier">row</span>.<span class="ruby-identifier">to_hash</span> <span class="ruby-keyword kw">if</span> <span class="ruby-identifier">row</span>.<span class="ruby-identifier">empty?</span>
|
387
|
+
<span class="ruby-identifier">row</span>.<span class="ruby-identifier">inject</span>({}){<span class="ruby-operator">|</span><span class="ruby-identifier">counts</span>, <span class="ruby-identifier">col</span><span class="ruby-operator">|</span> <span class="ruby-identifier">counts</span>[<span class="ruby-identifier">col</span>.<span class="ruby-identifier">first</span>.<span class="ruby-identifier">to_sym</span>] = [<span class="ruby-identifier">col</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_f</span>,<span class="ruby-value">0</span>].<span class="ruby-identifier">max</span>; <span class="ruby-identifier">counts</span>}
|
388
|
+
<span class="ruby-keyword kw">end</span>
|
389
|
+
</pre>
|
390
|
+
</div>
|
391
|
+
</div>
|
392
|
+
</div>
|
393
|
+
|
394
|
+
<div id="method-M000020" class="method-detail">
|
395
|
+
<a name="M000020"></a>
|
396
|
+
|
397
|
+
<div class="method-heading">
|
398
|
+
<a href="#M000020" class="method-signature">
|
399
|
+
<span class="method-name">incr_doc_count</span><span class="method-args">(klass, count)</span>
|
400
|
+
</a>
|
401
|
+
</div>
|
402
|
+
|
403
|
+
<div class="method-description">
|
404
|
+
<p>
|
405
|
+
Increment total document count for a given class by ‘count‘
|
406
|
+
</p>
|
407
|
+
<p><a class="source-toggle" href="#"
|
408
|
+
onclick="toggleCode('M000020-source');return false;">[Source]</a></p>
|
409
|
+
<div class="method-source-code" id="M000020-source">
|
410
|
+
<pre>
|
411
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 159</span>
|
412
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">incr_doc_count</span>(<span class="ruby-identifier">klass</span>, <span class="ruby-identifier">count</span>)
|
413
|
+
<span class="ruby-identifier">klass</span> = <span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_s</span>
|
414
|
+
<span class="ruby-identifier">doc_count</span> = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, <span class="ruby-value str">"doc_count"</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_i</span>
|
415
|
+
<span class="ruby-identifier">doc_count</span> <span class="ruby-operator">+=</span> <span class="ruby-identifier">count</span>
|
416
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">insert</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, {<span class="ruby-value str">"doc_count"</span> =<span class="ruby-operator">></span> <span class="ruby-identifier">doc_count</span>.<span class="ruby-identifier">to_s</span>})
|
417
|
+
<span class="ruby-ivar">@klass_doc_counts</span>[<span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_sym</span>] = <span class="ruby-identifier">doc_count</span>
|
418
|
+
<span class="ruby-keyword kw">end</span>
|
419
|
+
</pre>
|
420
|
+
</div>
|
421
|
+
</div>
|
422
|
+
</div>
|
423
|
+
|
424
|
+
<div id="method-M000019" class="method-detail">
|
425
|
+
<a name="M000019"></a>
|
426
|
+
|
427
|
+
<div class="method-heading">
|
428
|
+
<a href="#M000019" class="method-signature">
|
429
|
+
<span class="method-name">incr_total_word_count</span><span class="method-args">(klass, count)</span>
|
430
|
+
</a>
|
431
|
+
</div>
|
432
|
+
|
433
|
+
<div class="method-description">
|
434
|
+
<p>
|
435
|
+
Increment total word count for a given class by ‘count‘
|
436
|
+
</p>
|
437
|
+
<p><a class="source-toggle" href="#"
|
438
|
+
onclick="toggleCode('M000019-source');return false;">[Source]</a></p>
|
439
|
+
<div class="method-source-code" id="M000019-source">
|
440
|
+
<pre>
|
441
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 148</span>
|
442
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">incr_total_word_count</span>(<span class="ruby-identifier">klass</span>, <span class="ruby-identifier">count</span>)
|
443
|
+
<span class="ruby-identifier">klass</span> = <span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_s</span>
|
444
|
+
<span class="ruby-identifier">wordcount</span> = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, <span class="ruby-value str">"wordcount"</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_i</span>
|
445
|
+
<span class="ruby-identifier">wordcount</span> <span class="ruby-operator">+=</span> <span class="ruby-identifier">count</span>
|
446
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">insert</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, {<span class="ruby-value str">"wordcount"</span> =<span class="ruby-operator">></span> <span class="ruby-identifier">wordcount</span>.<span class="ruby-identifier">to_s</span>})
|
447
|
+
<span class="ruby-ivar">@klass_word_counts</span>[<span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_sym</span>] = <span class="ruby-identifier">wordcount</span>
|
448
|
+
<span class="ruby-keyword kw">end</span>
|
449
|
+
</pre>
|
450
|
+
</div>
|
451
|
+
</div>
|
452
|
+
</div>
|
453
|
+
|
454
|
+
<div id="method-M000018" class="method-detail">
|
455
|
+
<a name="M000018"></a>
|
456
|
+
|
457
|
+
<div class="method-heading">
|
458
|
+
<a href="#M000018" class="method-signature">
|
459
|
+
<span class="method-name">incr_word_count</span><span class="method-args">(klass, word, count)</span>
|
460
|
+
</a>
|
461
|
+
</div>
|
462
|
+
|
463
|
+
<div class="method-description">
|
464
|
+
<p>
|
465
|
+
Increment the count for a given (word,class) pair. Evidently, cassandra
|
466
|
+
does not support atomic increment/decrement. Psh. HBase uses ZooKeeper to
|
467
|
+
implement atomic operations, ain‘t it special?
|
468
|
+
</p>
|
469
|
+
<p><a class="source-toggle" href="#"
|
470
|
+
onclick="toggleCode('M000018-source');return false;">[Source]</a></p>
|
471
|
+
<div class="method-source-code" id="M000018-source">
|
472
|
+
<pre>
|
473
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 118</span>
|
474
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">incr_word_count</span>(<span class="ruby-identifier">klass</span>, <span class="ruby-identifier">word</span>, <span class="ruby-identifier">count</span>)
|
475
|
+
<span class="ruby-comment cmt"># Only wants strings</span>
|
476
|
+
<span class="ruby-identifier">klass</span> = <span class="ruby-identifier">klass</span>.<span class="ruby-identifier">to_s</span>
|
477
|
+
<span class="ruby-identifier">word</span> = <span class="ruby-identifier">word</span>.<span class="ruby-identifier">to_s</span>
|
478
|
+
|
479
|
+
<span class="ruby-identifier">prior_count</span> = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:classes</span>, <span class="ruby-identifier">word</span>, <span class="ruby-identifier">klass</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_i</span>
|
480
|
+
<span class="ruby-identifier">new_count</span> = <span class="ruby-identifier">prior_count</span> <span class="ruby-operator">+</span> <span class="ruby-identifier">count</span>
|
481
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">insert</span>(<span class="ruby-identifier">:classes</span>, <span class="ruby-identifier">word</span>, {<span class="ruby-identifier">klass</span> =<span class="ruby-operator">></span> <span class="ruby-identifier">new_count</span>.<span class="ruby-identifier">to_s</span>})
|
482
|
+
|
483
|
+
<span class="ruby-keyword kw">if</span> (<span class="ruby-identifier">prior_count</span> <span class="ruby-operator">==</span> <span class="ruby-value">0</span> <span class="ruby-operator">&&</span> <span class="ruby-identifier">count</span> <span class="ruby-operator">></span> <span class="ruby-value">0</span>)
|
484
|
+
<span class="ruby-comment cmt">#</span>
|
485
|
+
<span class="ruby-comment cmt"># we've never seen this word before and we're not trying to unlearn it</span>
|
486
|
+
<span class="ruby-comment cmt">#</span>
|
487
|
+
<span class="ruby-identifier">vocab_size</span> = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, <span class="ruby-value str">"vocabsize"</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_i</span>
|
488
|
+
<span class="ruby-identifier">vocab_size</span> <span class="ruby-operator">+=</span> <span class="ruby-value">1</span>
|
489
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">insert</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, {<span class="ruby-value str">"vocabsize"</span> =<span class="ruby-operator">></span> <span class="ruby-identifier">vocab_size</span>.<span class="ruby-identifier">to_s</span>})
|
490
|
+
<span class="ruby-keyword kw">elsif</span> <span class="ruby-identifier">new_count</span> <span class="ruby-operator">==</span> <span class="ruby-value">0</span>
|
491
|
+
<span class="ruby-comment cmt">#</span>
|
492
|
+
<span class="ruby-comment cmt"># we've seen this word before but we're trying to unlearn it</span>
|
493
|
+
<span class="ruby-comment cmt">#</span>
|
494
|
+
<span class="ruby-identifier">vocab_size</span> = <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, <span class="ruby-value str">"vocabsize"</span>).<span class="ruby-identifier">values</span>.<span class="ruby-identifier">last</span>.<span class="ruby-identifier">to_i</span>
|
495
|
+
<span class="ruby-identifier">vocab_size</span> <span class="ruby-operator">-=</span> <span class="ruby-value">1</span>
|
496
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">insert</span>(<span class="ruby-identifier">:totals</span>, <span class="ruby-identifier">klass</span>, {<span class="ruby-value str">"vocabsize"</span> =<span class="ruby-operator">></span> <span class="ruby-identifier">vocab_size</span>.<span class="ruby-identifier">to_s</span>})
|
497
|
+
<span class="ruby-keyword kw">end</span>
|
498
|
+
<span class="ruby-identifier">new_count</span>
|
499
|
+
<span class="ruby-keyword kw">end</span>
|
500
|
+
</pre>
|
501
|
+
</div>
|
502
|
+
</div>
|
503
|
+
</div>
|
504
|
+
|
505
|
+
<div id="method-M000013" class="method-detail">
|
506
|
+
<a name="M000013"></a>
|
507
|
+
|
508
|
+
<div class="method-heading">
|
509
|
+
<a href="#M000013" class="method-signature">
|
510
|
+
<span class="method-name">init_tables</span><span class="method-args">()</span>
|
511
|
+
</a>
|
512
|
+
</div>
|
513
|
+
|
514
|
+
<div class="method-description">
|
515
|
+
<p>
|
516
|
+
Create required keyspace and column families
|
517
|
+
</p>
|
518
|
+
<p><a class="source-toggle" href="#"
|
519
|
+
onclick="toggleCode('M000013-source');return false;">[Source]</a></p>
|
520
|
+
<div class="method-source-code" id="M000013-source">
|
521
|
+
<pre>
|
522
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 62</span>
|
523
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">init_tables</span>
|
524
|
+
<span class="ruby-comment cmt"># Do nothing if keyspace already exists</span>
|
525
|
+
<span class="ruby-keyword kw">if</span> <span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">keyspaces</span>.<span class="ruby-identifier">include?</span>(<span class="ruby-ivar">@keyspace</span>)
|
526
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">keyspace</span> = <span class="ruby-ivar">@keyspace</span>
|
527
|
+
<span class="ruby-keyword kw">else</span>
|
528
|
+
<span class="ruby-identifier">freq_table</span> = <span class="ruby-constant">Cassandra</span><span class="ruby-operator">::</span><span class="ruby-constant">ColumnFamily</span>.<span class="ruby-identifier">new</span>({<span class="ruby-identifier">:keyspace</span> =<span class="ruby-operator">></span> <span class="ruby-ivar">@keyspace</span>, <span class="ruby-identifier">:name</span> =<span class="ruby-operator">></span> <span class="ruby-value str">"classes"</span>}) <span class="ruby-comment cmt"># word => {classname => count}</span>
|
529
|
+
<span class="ruby-identifier">summary_table</span> = <span class="ruby-constant">Cassandra</span><span class="ruby-operator">::</span><span class="ruby-constant">ColumnFamily</span>.<span class="ruby-identifier">new</span>({<span class="ruby-identifier">:keyspace</span> =<span class="ruby-operator">></span> <span class="ruby-ivar">@keyspace</span>, <span class="ruby-identifier">:name</span> =<span class="ruby-operator">></span> <span class="ruby-value str">"totals"</span>}) <span class="ruby-comment cmt"># class => {wordcount => count}</span>
|
530
|
+
<span class="ruby-identifier">ks_def</span> = <span class="ruby-constant">Cassandra</span><span class="ruby-operator">::</span><span class="ruby-constant">Keyspace</span>.<span class="ruby-identifier">new</span>({
|
531
|
+
<span class="ruby-identifier">:name</span> =<span class="ruby-operator">></span> <span class="ruby-ivar">@keyspace</span>,
|
532
|
+
<span class="ruby-identifier">:strategy_class</span> =<span class="ruby-operator">></span> <span class="ruby-value str">'org.apache.cassandra.locator.SimpleStrategy'</span>,
|
533
|
+
<span class="ruby-identifier">:replication_factor</span> =<span class="ruby-operator">></span> <span class="ruby-value">1</span>,
|
534
|
+
<span class="ruby-identifier">:cf_defs</span> =<span class="ruby-operator">></span> [<span class="ruby-identifier">freq_table</span>, <span class="ruby-identifier">summary_table</span>]
|
535
|
+
})
|
536
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">add_keyspace</span> <span class="ruby-identifier">ks_def</span>
|
537
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">keyspace</span> = <span class="ruby-ivar">@keyspace</span>
|
538
|
+
<span class="ruby-keyword kw">end</span>
|
539
|
+
<span class="ruby-keyword kw">end</span>
|
540
|
+
</pre>
|
541
|
+
</div>
|
542
|
+
</div>
|
543
|
+
</div>
|
544
|
+
|
545
|
+
<div id="method-M000011" class="method-detail">
|
546
|
+
<a name="M000011"></a>
|
547
|
+
|
548
|
+
<div class="method-heading">
|
549
|
+
<a href="#M000011" class="method-signature">
|
550
|
+
<span class="method-name">reset</span><span class="method-args">()</span>
|
551
|
+
</a>
|
552
|
+
</div>
|
553
|
+
|
554
|
+
<div class="method-description">
|
555
|
+
<p><a class="source-toggle" href="#"
|
556
|
+
onclick="toggleCode('M000011-source');return false;">[Source]</a></p>
|
557
|
+
<div class="method-source-code" id="M000011-source">
|
558
|
+
<pre>
|
559
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 40</span>
|
560
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">reset</span>
|
561
|
+
<span class="ruby-identifier">drop_tables</span>
|
562
|
+
<span class="ruby-identifier">init_tables</span>
|
563
|
+
<span class="ruby-keyword kw">end</span>
|
564
|
+
</pre>
|
565
|
+
</div>
|
566
|
+
</div>
|
567
|
+
</div>
|
568
|
+
|
569
|
+
<h3 class="section-bar">Protected Instance methods</h3>
|
570
|
+
|
571
|
+
<div id="method-M000023" class="method-detail">
|
572
|
+
<a name="M000023"></a>
|
573
|
+
|
574
|
+
<div class="method-heading">
|
575
|
+
<a href="#M000023" class="method-signature">
|
576
|
+
<span class="method-name">get_summary</span><span class="method-args">(name)</span>
|
577
|
+
</a>
|
578
|
+
</div>
|
579
|
+
|
580
|
+
<div class="method-description">
|
581
|
+
<p>
|
582
|
+
Fetch 100 rows from summary table, yes, increase if necessary
|
583
|
+
</p>
|
584
|
+
<p><a class="source-toggle" href="#"
|
585
|
+
onclick="toggleCode('M000023-source');return false;">[Source]</a></p>
|
586
|
+
<div class="method-source-code" id="M000023-source">
|
587
|
+
<pre>
|
588
|
+
<span class="ruby-comment cmt"># File lib/ankusa/cassandra_storage.rb, line 182</span>
|
589
|
+
<span class="ruby-keyword kw">def</span> <span class="ruby-identifier">get_summary</span>(<span class="ruby-identifier">name</span>)
|
590
|
+
<span class="ruby-identifier">counts</span> = {}
|
591
|
+
<span class="ruby-ivar">@cassandra</span>.<span class="ruby-identifier">get_range</span>(<span class="ruby-identifier">:totals</span>, {<span class="ruby-identifier">:start</span> =<span class="ruby-operator">></span> <span class="ruby-value str">''</span>, <span class="ruby-identifier">:finish</span> =<span class="ruby-operator">></span> <span class="ruby-value str">''</span>, <span class="ruby-identifier">:count</span> =<span class="ruby-operator">></span> <span class="ruby-ivar">@max_classes</span>}).<span class="ruby-identifier">each</span> <span class="ruby-keyword kw">do</span> <span class="ruby-operator">|</span><span class="ruby-identifier">key_slice</span><span class="ruby-operator">|</span>
|
592
|
+
<span class="ruby-comment cmt"># keyslice is a clunky thrift object, map into a ruby hash</span>
|
593
|
+
<span class="ruby-identifier">row</span> = <span class="ruby-identifier">key_slice</span>.<span class="ruby-identifier">columns</span>.<span class="ruby-identifier">inject</span>({}){<span class="ruby-operator">|</span><span class="ruby-identifier">hsh</span>, <span class="ruby-identifier">c</span><span class="ruby-operator">|</span> <span class="ruby-identifier">hsh</span>[<span class="ruby-identifier">c</span>.<span class="ruby-identifier">column</span>.<span class="ruby-identifier">name</span>] = <span class="ruby-identifier">c</span>.<span class="ruby-identifier">column</span>.<span class="ruby-identifier">value</span>; <span class="ruby-identifier">hsh</span>}
|
594
|
+
<span class="ruby-identifier">counts</span>[<span class="ruby-identifier">key_slice</span>.<span class="ruby-identifier">key</span>.<span class="ruby-identifier">to_sym</span>] = <span class="ruby-identifier">row</span>[<span class="ruby-identifier">name</span>].<span class="ruby-identifier">to_f</span>
|
595
|
+
<span class="ruby-keyword kw">end</span>
|
596
|
+
<span class="ruby-identifier">counts</span>
|
597
|
+
<span class="ruby-keyword kw">end</span>
|
598
|
+
</pre>
|
599
|
+
</div>
|
600
|
+
</div>
|
601
|
+
</div>
|
602
|
+
|
603
|
+
|
604
|
+
</div>
|
605
|
+
|
606
|
+
|
607
|
+
</div>
|
608
|
+
|
609
|
+
|
610
|
+
<div id="validator-badges">
|
611
|
+
<p><small><a href="http://validator.w3.org/check/referer">[Validate]</a></small></p>
|
612
|
+
</div>
|
613
|
+
|
614
|
+
</body>
|
615
|
+
</html>
|