wukong 0.1.1 → 0.1.4

Sign up to get free protection for your applications and to get access to all the features.
data/README.textile CHANGED
@@ -12,6 +12,22 @@ Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig
12
12
 
13
13
  The main documentation -- including tutorials and tips for working with big data -- lives on the "Wukong Pages":http://mrflip.github.com/wukong and there is some supplemental information on the "wukong wiki.":http://wiki.github.com/mrflip/wukong
14
14
 
15
+ h2. Install
16
+
17
+ Wukong is still under active development. The newest version is available at
18
+
19
+ http://github.com/mrflip/wukong
20
+
21
+ A gem is available from "github:":http://gems.github.com
22
+
23
+ gem install mrflip-wukong --source=http://gems.github.com
24
+
25
+ or from "gemcutter":http://gemcutter.org
26
+
27
+ gem install wukong --source=http://gemcutter.org
28
+
29
+ Phil Ripperger has prepared "instructions on getting wukong to work on the Amazon AWS cloud.":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart Thanks Phil!
30
+
15
31
  h2. How to write a Wukong script
16
32
 
17
33
  Here's a script to count words in a text stream:
@@ -94,6 +110,61 @@ You can also use structs to treat your dataset as a stream of objects:
94
110
  Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
95
111
  </code></pre>
96
112
 
113
+ h3. Advanced Patterns
114
+
115
+ Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group.
116
+
117
+ The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
118
+
119
+ Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
120
+
121
+ <pre><code> #
122
+ # Roll up all values for each key into a single line
123
+ #
124
+ class GroupByReducer < Wukong::Streamer::AccumulatingReducer
125
+ attr_accessor :values
126
+
127
+ # Start with an empty list
128
+ def start! *args
129
+ self.values = []
130
+ end
131
+
132
+ # Aggregate each value in turn
133
+ def accumulate key, value
134
+ self.values << value
135
+ end
136
+
137
+ # Emit the key and all values, tab-separated
138
+ def finalize
139
+ yield [key, values].flatten
140
+ end
141
+ end
142
+ </code></pre>
143
+
144
+ So given adjacency pairs for the following directed friend graph:
145
+
146
+ <pre><code>
147
+ @jerry @elaine
148
+ @elaine @jerry
149
+ @jerry @kramer
150
+ @kramer @jerry
151
+ @kramer @bobsacamato
152
+ @kramer @newman
153
+ @jerry @superman
154
+ @newman @kramer
155
+ @newman @elaine
156
+ @newman @jerry
157
+ </code></pre>
158
+
159
+ You'd end up with
160
+
161
+ <pre><code>
162
+ @elaine @jerry
163
+ @jerry @elaine @kramer @superman
164
+ @kramer @bobsacamato @jerry @newman
165
+ @newman @elaine @jerry @kramer
166
+ </code></pre>
167
+
97
168
  h3. More info
98
169
 
99
170
  There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
@@ -110,7 +181,6 @@ h2. Setup
110
181
 
111
182
  2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
112
183
 
113
-
114
184
  h2. How to run a Wukong script
115
185
 
116
186
  To run your script using local files and no connection to a hadoop cluster,
@@ -26,6 +26,14 @@ module Wukong
26
26
  def self.[] *args
27
27
  new *args
28
28
  end
29
+
30
+ # returns the value corresponding to that string representation
31
+ def index *args
32
+ # delegate
33
+ self.class.names.index *args
34
+ end
35
+
36
+ # Representations:
29
37
  def to_i
30
38
  val
31
39
  end
@@ -33,18 +41,23 @@ module Wukong
33
41
  return nil if val.nil?
34
42
  self.class.names[val]
35
43
  end
44
+
36
45
  def inspect
37
46
  "<#{self.class.to_s} #{to_i} (#{to_s})>"
38
47
  end
39
- # returns the value corresponding to that string representation
40
- def index *args
41
- # delegate
42
- self.class.names.index *args
43
- end
48
+
44
49
  def to_flat
45
50
  to_s #to_i
46
51
  end
47
52
 
53
+ def self.to_sql_str
54
+ "ENUM('#{names.join("', '")}')"
55
+ end
56
+
57
+ def self.to_pig
58
+ 'chararray'
59
+ end
60
+
48
61
  #
49
62
  # Use enumerates to set the class' names
50
63
  #
@@ -57,17 +70,8 @@ module Wukong
57
70
  def self.enumerates *names
58
71
  self.names = names.map(&:to_s)
59
72
  end
60
-
61
- def self.to_sql_str
62
- "ENUM('#{names.join("', '")}')"
63
- end
64
-
65
- def self.typify
66
- 'chararray'
67
- end
68
73
  end
69
74
 
70
-
71
75
  #
72
76
  # Note that bin 0 is
73
77
  #
@@ -0,0 +1,17 @@
1
+ module Wukong
2
+ module Datatypes
3
+ class Text < String ; end unless defined?(Text)
4
+ class Blob < String ; end unless defined?(Blob)
5
+ class Boolean < Integer ; end unless defined?(Boolean)
6
+ class BigDecimal < Float ; end unless defined?(BigDecimal)
7
+ class EpochTime < Integer ; end unless defined?(EpochTime)
8
+ class FilePath < String ; end unless defined?(FilePath)
9
+ class Flag < String ; end unless defined?(Flag)
10
+ class IPAddress < String ; end unless defined?(IPAddress)
11
+ class URI < String ; end unless defined?(URI)
12
+ class Csv < String ; end unless defined?(Csv)
13
+ class Yaml < String ; end unless defined?(Yaml)
14
+ class Json < String ; end unless defined?(Json)
15
+ class Regex < Regexp ; end unless defined?(Regex)
16
+ end
17
+ end
data/lib/wukong/schema.rb CHANGED
@@ -1,37 +1,220 @@
1
+ require 'extlib/inflection'
2
+ require 'wukong'
3
+
4
+
5
+ #
6
+ # Basic types: SQL conversion
7
+ #
8
+ class << Integer ; def to_sql() 'INT' end ; end
9
+ class << Bignum ; def to_sql() 'BIGINT' end ; end
10
+ class << String ; def to_sql() 'VARCHAR(255) CHARACTER SET ASCII' end ; end
11
+ class << Symbol ; def to_sql() 'VARCHAR(255) CHARACTER SET ASCII' end ; end
12
+ class << BigDecimal ; def to_pig() 'DECIMAL' end ; end if defined?(BigDecimal)
13
+ class << EpochTime ; def to_pig() 'INT' end ; end if defined?(EpochTime)
14
+ class << FilePath ; def to_pig() 'VARCHAR(255) CHARACTER SET ASCII' end ; end if defined?(FilePath)
15
+ class << Flag ; def to_pig() 'CHAR(1) CHARACTER SET ASCII' end ; end if defined?(Flag)
16
+ class << IPAddress ; def to_pig() 'CHAR(15) CHARACTER SET ASCII' end ; end if defined?(IPAddress)
17
+ class << URI ; def to_pig() 'VARCHAR(255) CHARACTER SET ASCII' end ; end if defined?(URI)
18
+ class << Csv ; def to_pig() 'TEXT' end ; end if defined?(Csv)
19
+ class << Yaml ; def to_pig() 'TEXT' end ; end if defined?(Yaml)
20
+ class << Json ; def to_pig() 'TEXT' end ; end if defined?(Json)
21
+ class << Regex ; def to_pig() 'TEXT' end ; end if defined?(Regex)
22
+ class String ; def to_sql() self ; end ; end
23
+ class Symbol ; def to_sql() self.to_s.upcase ; end ; end
24
+
25
+ #
26
+ # Basic types: Pig conversion
27
+ #
28
+ class << Integer ; def to_pig() 'int' end ; end
29
+ class << Bignum ; def to_pig() 'long' end ; end
30
+ class << Float ; def to_pig() 'float' end ; end
31
+ class << Symbol ; def to_pig() 'chararray' end ; end
32
+ class << Date ; def to_pig() 'long' end ; end
33
+ class << Time ; def to_pig() 'long' end ; end
34
+ class << DateTime ; def to_pig() 'long' end ; end
35
+ class << String ; def to_pig() 'chararray' end ; end
36
+ class << Text ; def to_pig() 'chararray' end ; end if defined?(Text)
37
+ class << Blob ; def to_pig() 'bytearray' end ; end if defined?(Blob)
38
+ class << Boolean ; def to_pig() 'bytearray' end ; end if defined?(Boolean)
39
+ class String ; def to_pig() self.to_s ; end ; end
40
+ class Symbol ; def to_pig() self.to_s ; end ; end
41
+
42
+ class << BigDecimal ; def to_pig() 'long' end ; end if defined?(BigDecimal)
43
+ class << EpochTime ; def to_pig() 'integer' end ; end if defined?(EpochTime)
44
+ class << FilePath ; def to_pig() 'chararray' end ; end if defined?(FilePath)
45
+ class << Flag ; def to_pig() 'chararray' end ; end if defined?(Flag)
46
+ class << IPAddress ; def to_pig() 'chararray' end ; end if defined?(IPAddress)
47
+ class << URI ; def to_pig() 'chararray' end ; end if defined?(URI)
48
+ class << Csv ; def to_pig() 'chararray' end ; end if defined?(Csv)
49
+ class << Yaml ; def to_pig() 'chararray' end ; end if defined?(Yaml)
50
+ class << Json ; def to_pig() 'chararray' end ; end if defined?(Json)
51
+ class << Regex ; def to_pig() 'chararray' end ; end if defined?(Regex)
52
+
1
53
  module Wukong
2
54
  #
3
- # Export model's structure for other data frameworks:
4
- # SQL and Pig
55
+ # Export model's structure for loading and manipulating in other frameworks,
56
+ # such as SQL and Pig
57
+ #
58
+ # Your class should support the #resource_name and #mtypes methods
59
+ # An easy way to do this is by being a TypedStruct.
60
+ #
61
+ # You can use this to do silly stunts like
62
+ #
63
+ # % ruby -rubygems -r'wukong/schema' -e 'require "/path/to/user_model.rb" ; puts User.pig_load ; '
64
+ #
65
+ # If you include the classes from Wukong::Datatypes::MoreTypes, you can draw
66
+ # on a richer set of type definitions
67
+ #
68
+ # require 'wukong/datatypes/more_types'
69
+ # include Wukong::Datatypes::MoreTypes
70
+ # require 'wukong/schema'
71
+ #
72
+ # (if you're using Wukong to bulk-process Datamapper records, these should
73
+ # fall right in line as well -- make sure *not* to include
74
+ # Wukong::Datatypes::MoreTypes, and to require 'dm-more' before 'wukong/schema')
5
75
  #
6
76
  module Schema
7
- def to_sql
8
- end
77
+ module ClassMethods
9
78
 
79
+ #
80
+ # Table name for this class
81
+ #
82
+ def table_name
83
+ resource_name.to_s.pluralize
84
+ end
10
85
 
11
- # Export schema as Pig
12
- def to_pig
13
- members.zip(mtypes).map do |member, type|
14
- member.to_s + ': ' + type.to_pig
15
- end.join(', ')
16
- end
86
+ # ===========================================================================
87
+ #
88
+ # Pig
89
+ #
17
90
 
18
- def pig_klass
19
- self.to_s.gsub(/.*::/, '')
20
- end
91
+ # Export schema as Pig
92
+ #
93
+ # Won't correctly handle complex types (struct having struct as member, eg)
94
+ #
95
+ def to_pig
96
+ members.zip(mtypes).map do |member, type|
97
+ member.to_s + ': ' + type.to_pig
98
+ end.join(', ')
99
+ end
100
+
101
+ #
102
+ # A pig snippet to load a tsv file containing
103
+ # serialized instances of this class.
104
+ #
105
+ # Assumes the first column is the resource name (you can, and probably
106
+ # should, follow with an immediate GENERATE to ditch that field.)
107
+ #
108
+ def pig_load filename=nil
109
+ filename ||= table_name+'.tsv'
110
+ cmd = [
111
+ "%-23s" % resource_name,
112
+ "= LOAD", filename,
113
+ "AS ( rsrc:chararray,", self.to_pig, ')',
114
+ ].join(" ")
115
+ end
116
+
117
+ # ===========================================================================
118
+ #
119
+ # SQL
21
120
 
22
- def pig_load filename=nil
23
- cmd = [
24
- "%-23s" % pig_klass,
25
- "= LOAD", filename || pig_klass.underscore.pluralize,
26
- "AS ( rsrc:chararray,", self.to_pig, ')',
27
- ].join(" ")
121
+ #
122
+ # Schema definition for use in a CREATE TABLE statement
123
+ #
124
+ def to_sql
125
+ sql_str = []
126
+ members.zip(mtypes).each do |attr, type|
127
+ type_str = type.respond_to?(:to_sql) ? type.to_sql : type.to_s.upcase
128
+ sql_str << " %-21s\t%s" %["`#{attr}`", type_str]
129
+ end
130
+ sql_str.join(",\n")
131
+ end
132
+
133
+ #
134
+ # List off member names, to be stuffed into a SELECT or a LOAD DATA
135
+ #
136
+ def sql_members
137
+ members.map{|attr| "`#{attr}`" }.join(", ")
138
+ end
139
+
140
+ #
141
+ # Creates a table for the wukong class.
142
+ #
143
+ # * primary_key gives the name of one column to be set as the primary key
144
+ #
145
+ # * if drop_first is given, a "DROP TABLE IF EXISTS" statement will
146
+ # precede the snippet.
147
+ #
148
+ # * table_options sets the table parameters. Useful table_options for a
149
+ # read-only database in MySQL:
150
+ # ENGINE=MyISAM PACK_KEYS=0
151
+ #
152
+ def sql_create_table primary_key=nil, drop_first=nil, table_options=''
153
+ str = []
154
+ str << %Q{DROP TABLE IF EXISTS `#{self.table_name}`; } if drop_first
155
+ str << %Q{CREATE TABLE `#{self.table_name}` ( }
156
+ str << self.to_sql
157
+ if primary_key then str.last << ',' ; str << %Q{ PRIMARY KEY \t(`#{primary_key}`)} ; end
158
+ str << %Q{ ) #{table_options} ;}
159
+ str.join("\n")
160
+ end
161
+
162
+ #
163
+ # A mysql snippet to bulk load the tab-separated-values file emitted by a
164
+ # Wukong script.
165
+ #
166
+ # Let's say your class is ClickLog; its resource_name is "click_log"
167
+ # and thus its table_name is 'click_logs'. sql_load_mysql will:
168
+ #
169
+ # * disable indexing on the table
170
+ # * import the file, replacing any existing rows. (Replacement is governed
171
+ # by primary key and unique index constraints -- see the mysql docs).
172
+ # * re-enable indexing on that table
173
+ # * show the number of
174
+ #
175
+ # The load portion will
176
+ #
177
+ # * Load into a table named click_logs
178
+ # * from a file named click_logs.tsv
179
+ # * where all rows have the string 'click_logs' in their first column
180
+ # * and all remaining fields in their #members order
181
+ # * assuming strings are wukong_encode'd and so shouldn't be escaped or enclosed.
182
+ #
183
+ # Why the "LINES STARTING BY" part? For map/reduce outputs that have many
184
+ # different objects jumbled together, you can just dump in the whole file,
185
+ # landing each object in its correct table.
186
+ #
187
+ def sql_load_mysql
188
+ str = []
189
+ # disable indexing during bulk load
190
+ str << %Q{ALTER TABLE `#{self.resource_name}` DISABLE KEYS; }
191
+ # Bulk load the tab-separated-values file.
192
+ str << %Q{LOAD DATA LOCAL INFILE '#{self.resource_name}.tsv'}
193
+ str << %Q{ REPLACE INTO TABLE `#{self.resource_name}` }
194
+ str << %Q{ COLUMNS }
195
+ str << %Q{ TERMINATED BY '\\t' }
196
+ str << %Q{ OPTIONALLY ENCLOSED BY '' }
197
+ str << %Q{ ESCAPED BY '' }
198
+ str << %Q{ LINES STARTING BY '#{self.resource_name}' }
199
+ str << %Q{ ( @dummy,\n }
200
+ str << ' '+self.sql_members
201
+ str << %Q{\n ); }
202
+ # Re-enable indexing
203
+ str << %Q{ALTER TABLE `#{self.resource_name}` ENABLE KEYS ; }
204
+ # Show it loaded correctly
205
+ str << %Q{SELECT '#{self.resource_name}', NOW(), COUNT(*) FROM `#{self.resource_name}`; }
206
+ str.join("\n")
207
+ end
208
+
209
+ end
210
+ # standard stanza for making methods appear on the class itself on include
211
+ def self.included base
212
+ base.class_eval{ extend ClassMethods }
28
213
  end
29
214
  end
30
215
  end
31
216
 
32
- class << Integer ; def to_pig() 'int' end ; end
33
- class << Bignum ; def to_pig() 'long' end ; end
34
- class << Float ; def to_pig() 'float' end ; end
35
- class << String ; def to_pig() 'chararray' end ; end
36
- class << Symbol ; def to_pig() self end ; end
37
- class << Date ; def to_pig() 'long' end ; end
217
+ #
218
+ # TypedStructs are class-schematizeable
219
+ #
220
+ Struct.class_eval do include(Wukong::Schema) ; end
@@ -1,20 +1,31 @@
1
1
  module Wukong
2
2
  module Streamer
3
3
  #
4
- # Emit each unique key and the count of its occurrences
4
+ # Roll up all records from a given key into a single list.
5
5
  #
6
6
  class ListReducer < Wukong::Streamer::AccumulatingReducer
7
7
  attr_accessor :values
8
8
 
9
- # reset the counter to zero
9
+ # start with an empty list
10
10
  def start! *args
11
11
  self.values = []
12
12
  end
13
13
 
14
- # record one more for this key
14
+ # aggregate all records.
15
+ # note that this accumulates the full *record* -- key, value, everything.
15
16
  def accumulate *record
16
17
  self.values << record
17
18
  end
19
+
20
+ # emit the key and all records, tab-separated
21
+ #
22
+ # you will almost certainly want to override this method to do something
23
+ # interesting with the values (or override accumulate to gather scalar
24
+ # values)
25
+ #
26
+ def finalize
27
+ yield [key, values.to_flat.join(";")].flatten
28
+ end
18
29
  end
19
30
  end
20
31
  end
@@ -1,13 +1,20 @@
1
1
  module Wukong
2
2
  module Streamer
3
3
  #
4
- # Accumulate acts like an insecure high-school kid, for each key adopting in
5
- # turn the latest value seen. It then emits the last (in sort order) value
6
- # for that key.
4
+ # UniqByLastReducer accepts all records for a given key and emits only the
5
+ # last-seen.
7
6
  #
8
- # For example, to extract the *latest* value for each property, set hadoop
9
- # to use <resource, item_id, timestamp> as sort fields and <resource,
10
- # item_id> as key fields.
7
+ # It acts like an insecure high-school kid: for each record of a given key
8
+ # it discards whatever record it's holding and adopts this new value. When a
9
+ # new key comes on the scene it emits the last record, like an older brother
10
+ # handing off his Depeche Mode collection.
11
+ #
12
+ # For example, to extract the *latest* value for each property, emit your
13
+ # records as
14
+ #
15
+ # [resource_type, key, timestamp, ... fields ...]
16
+ #
17
+ # then set :sort_fields to 3 and :partition_fields to 2.
11
18
  #
12
19
  class UniqByLastReducer < Wukong::Streamer::AccumulatingReducer
13
20
  attr_accessor :final_value
data/wukong.gemspec CHANGED
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{wukong}
8
- s.version = "0.1.1"
8
+ s.version = "0.1.4"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Philip (flip) Kromer"]
12
- s.date = %q{2009-09-28}
12
+ s.date = %q{2009-10-05}
13
13
  s.description = %q{ Treat your dataset like a:
14
14
 
15
15
  * stream of lines when it’s efficient to process by lines
@@ -97,6 +97,7 @@ Gem::Specification.new do |s|
97
97
  "lib/wukong/boot.rb",
98
98
  "lib/wukong/datatypes.rb",
99
99
  "lib/wukong/datatypes/enum.rb",
100
+ "lib/wukong/datatypes/fake_types.rb",
100
101
  "lib/wukong/dfs.rb",
101
102
  "lib/wukong/encoding.rb",
102
103
  "lib/wukong/extensions.rb",
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wukong
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Philip (flip) Kromer
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-09-28 00:00:00 -05:00
12
+ date: 2009-10-05 00:00:00 -05:00
13
13
  default_executable:
14
14
  dependencies: []
15
15
 
@@ -120,6 +120,7 @@ files:
120
120
  - lib/wukong/boot.rb
121
121
  - lib/wukong/datatypes.rb
122
122
  - lib/wukong/datatypes/enum.rb
123
+ - lib/wukong/datatypes/fake_types.rb
123
124
  - lib/wukong/dfs.rb
124
125
  - lib/wukong/encoding.rb
125
126
  - lib/wukong/extensions.rb