RubyGems - wukong - Versions diffs - 0.1.1 → 0.1.4 - Mend

wukong 0.1.1 → 0.1.4

Files changed (8) hide show

data/README.textile +71 -1
data/lib/wukong/datatypes/enum.rb +18 -14
data/lib/wukong/datatypes/fake_types.rb +17 -0
data/lib/wukong/schema.rb +208 -25
data/lib/wukong/streamer/list_reducer.rb +14 -3
data/lib/wukong/streamer/uniq_by_last_reducer.rb +13 -6
data/wukong.gemspec +3 -2
metadata +3 -2

data/README.textile CHANGED Viewed

@@ -12,6 +12,22 @@ Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig
 The main documentation -- including tutorials and tips for working with big data -- lives on the "Wukong Pages":http://mrflip.github.com/wukong and there is some supplemental information on the "wukong wiki.":http://wiki.github.com/mrflip/wukong
+h2. Install
+Wukong is still under active development.  The newest version is available at
+    http://github.com/mrflip/wukong
+A gem is available from "github:":http://gems.github.com
+    gem install mrflip-wukong --source=http://gems.github.com
+or from "gemcutter":http://gemcutter.org
+    gem install wukong --source=http://gemcutter.org
+Phil Ripperger has prepared "instructions on getting wukong to work on the Amazon AWS cloud.":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart Thanks Phil!
 h2. How to write a Wukong script
 Here's a script to count words in a text stream:
@@ -94,6 +110,61 @@ You can also use structs to treat your dataset as a stream of objects:
     Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
 </code></pre>
+h3. Advanced Patterns
+Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group.
+The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
+Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
+<pre><code>    #
+    # Roll up all values for each key into a single line
+    #
+    class GroupByReducer < Wukong::Streamer::AccumulatingReducer
+      attr_accessor :values
+      # Start with an empty list
+      def start! *args
+        self.values = []
+      end
+      # Aggregate each value in turn
+      def accumulate key, value
+        self.values << value
+      end
+      # Emit the key and all values, tab-separated
+      def finalize
+        yield [key, values].flatten
+      end
+    end
+</code></pre>
+So given adjacency pairs for the following directed friend graph:
+<pre><code>
+    @jerry      @elaine
+    @elaine     @jerry
+    @jerry      @kramer
+    @kramer     @jerry
+    @kramer     @bobsacamato
+    @kramer     @newman
+    @jerry      @superman
+    @newman     @kramer
+    @newman     @elaine
+    @newman     @jerry
+</code></pre>
+You'd end up with
+<pre><code>
+    @elaine     @jerry
+    @jerry      @elaine      @kramer     @superman
+    @kramer     @bobsacamato @jerry      @newman
+    @newman     @elaine      @jerry      @kramer
+</code></pre>
 h3. More info
 There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
@@ -110,7 +181,6 @@ h2. Setup
 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
 h2. How to run a Wukong script
 To run your script using local files and no connection to a hadoop cluster,

data/lib/wukong/datatypes/enum.rb CHANGED Viewed

@@ -26,6 +26,14 @@ module Wukong
       def self.[] *args
         new *args
       end
+      # returns the value corresponding to that string representation
+      def index *args
+        # delegate
+        self.class.names.index *args
+      end
+      # Representations:
       def to_i
         val
       end
@@ -33,18 +41,23 @@ module Wukong
         return nil if val.nil?
         self.class.names[val]
       end
       def inspect
         "<#{self.class.to_s} #{to_i} (#{to_s})>"
       end
-      # returns the value corresponding to that string representation
-      def index *args
-        # delegate
-        self.class.names.index *args
-      end
       def to_flat
         to_s #to_i
       end
+      def self.to_sql_str
+        "ENUM('#{names.join("', '")}')"
+      end
+      def self.to_pig
+        'chararray'
+      end
       #
       # Use enumerates to set the class' names
       #
@@ -57,17 +70,8 @@ module Wukong
       def self.enumerates *names
         self.names = names.map(&:to_s)
       end
-      def self.to_sql_str
-        "ENUM('#{names.join("', '")}')"
-      end
-      def self.typify
-        'chararray'
-      end
     end
     #
     # Note that bin 0 is
     #

data/lib/wukong/datatypes/fake_types.rb ADDED Viewed

@@ -0,0 +1,17 @@
+module Wukong
+  module Datatypes
+    class Text       < String  ; end unless defined?(Text)
+    class Blob       < String  ; end unless defined?(Blob)
+    class Boolean    < Integer ; end unless defined?(Boolean)
+    class BigDecimal < Float   ; end unless defined?(BigDecimal)
+    class EpochTime  < Integer ; end unless defined?(EpochTime)
+    class FilePath   < String  ; end unless defined?(FilePath)
+    class Flag       < String  ; end unless defined?(Flag)
+    class IPAddress  < String  ; end unless defined?(IPAddress)
+    class URI        < String  ; end unless defined?(URI)
+    class Csv        < String  ; end unless defined?(Csv)
+    class Yaml       < String  ; end unless defined?(Yaml)
+    class Json       < String  ; end unless defined?(Json)
+    class Regex      < Regexp  ; end unless defined?(Regex)
+  end
+end

data/lib/wukong/schema.rb CHANGED Viewed

@@ -1,37 +1,220 @@
+require 'extlib/inflection'
+require 'wukong'
+#
+# Basic types: SQL conversion
+#
+class << Integer    ; def to_sql() 'INT'                              end ; end
+class << Bignum     ; def to_sql() 'BIGINT'                           end ; end
+class << String     ; def to_sql() 'VARCHAR(255) CHARACTER SET ASCII' end ; end
+class << Symbol     ; def to_sql() 'VARCHAR(255) CHARACTER SET ASCII' end ; end
+class << BigDecimal ; def to_pig() 'DECIMAL'                          end ; end if defined?(BigDecimal)
+class << EpochTime  ; def to_pig() 'INT'                              end ; end if defined?(EpochTime)
+class << FilePath   ; def to_pig() 'VARCHAR(255) CHARACTER SET ASCII' end ; end if defined?(FilePath)
+class << Flag       ; def to_pig() 'CHAR(1)      CHARACTER SET ASCII' end ; end if defined?(Flag)
+class << IPAddress  ; def to_pig() 'CHAR(15)     CHARACTER SET ASCII' end ; end if defined?(IPAddress)
+class << URI        ; def to_pig() 'VARCHAR(255) CHARACTER SET ASCII' end ; end if defined?(URI)
+class << Csv        ; def to_pig() 'TEXT'                             end ; end if defined?(Csv)
+class << Yaml       ; def to_pig() 'TEXT'                             end ; end if defined?(Yaml)
+class << Json       ; def to_pig() 'TEXT'                             end ; end if defined?(Json)
+class << Regex      ; def to_pig() 'TEXT'                             end ; end if defined?(Regex)
+class String        ; def to_sql() self             ; end ; end
+class Symbol        ; def to_sql() self.to_s.upcase ; end ; end
+#
+# Basic types: Pig conversion
+#
+class << Integer    ; def to_pig() 'int'           end ; end
+class << Bignum     ; def to_pig() 'long'          end ; end
+class << Float      ; def to_pig() 'float'         end ; end
+class << Symbol     ; def to_pig() 'chararray'     end ; end
+class << Date       ; def to_pig() 'long'          end ; end
+class << Time       ; def to_pig() 'long'          end ; end
+class << DateTime   ; def to_pig() 'long'          end ; end
+class << String     ; def to_pig() 'chararray'     end ; end
+class << Text       ; def to_pig() 'chararray'     end ; end if defined?(Text)
+class << Blob       ; def to_pig() 'bytearray'     end ; end if defined?(Blob)
+class << Boolean    ; def to_pig() 'bytearray'     end ; end if defined?(Boolean)
+class String        ; def to_pig() self.to_s ; end ; end
+class Symbol        ; def to_pig() self.to_s ; end ; end
+class << BigDecimal ; def to_pig() 'long'          end ; end if defined?(BigDecimal)
+class << EpochTime  ; def to_pig() 'integer'       end ; end if defined?(EpochTime)
+class << FilePath   ; def to_pig() 'chararray'     end ; end if defined?(FilePath)
+class << Flag       ; def to_pig() 'chararray'     end ; end if defined?(Flag)
+class << IPAddress  ; def to_pig() 'chararray'     end ; end if defined?(IPAddress)
+class << URI        ; def to_pig() 'chararray'     end ; end if defined?(URI)
+class << Csv        ; def to_pig() 'chararray'     end ; end if defined?(Csv)
+class << Yaml       ; def to_pig() 'chararray'     end ; end if defined?(Yaml)
+class << Json       ; def to_pig() 'chararray'     end ; end if defined?(Json)
+class << Regex      ; def to_pig() 'chararray'     end ; end if defined?(Regex)
 module Wukong
   #
-  # Export model's structure for other data frameworks:
-  # SQL and Pig
+  # Export model's structure for loading and manipulating in other frameworks,
+  # such as SQL and Pig
+  #
+  # Your class should support the #resource_name and #mtypes methods
+  # An easy way to do this is by being a TypedStruct.
+  #
+  # You can use this to do silly stunts like
+  #
+  #      % ruby -rubygems -r'wukong/schema' -e 'require "/path/to/user_model.rb" ; puts User.pig_load ; '
+  #
+  # If you include the classes from Wukong::Datatypes::MoreTypes, you can draw
+  # on a richer set of type definitions
+  #
+  #     require 'wukong/datatypes/more_types'
+  #     include Wukong::Datatypes::MoreTypes
+  #     require 'wukong/schema'
+  #
+  # (if you're using Wukong to bulk-process Datamapper records, these should
+  # fall right in line as well -- make sure *not* to include
+  # Wukong::Datatypes::MoreTypes, and to require 'dm-more' before 'wukong/schema')
   #
   module Schema
-    def to_sql
-    end
+    module ClassMethods
+      #
+      # Table name for this class
+      #
+      def table_name
+        resource_name.to_s.pluralize
+      end
-    # Export schema as Pig
-    def to_pig
-      members.zip(mtypes).map do |member, type|
-        member.to_s + ': ' + type.to_pig
-      end.join(', ')
-    end
+      # ===========================================================================
+      #
+      # Pig
+      #
-    def pig_klass
-      self.to_s.gsub(/.*::/, '')
-    end
+      # Export schema as Pig
+      #
+      # Won't correctly handle complex types (struct having struct as member, eg)
+      #
+      def to_pig
+        members.zip(mtypes).map do |member, type|
+          member.to_s + ': ' + type.to_pig
+        end.join(', ')
+      end
+      #
+      # A pig snippet to load a tsv file containing
+      # serialized instances of this class.
+      #
+      # Assumes the first column is the resource name (you can, and probably
+      # should, follow with an immediate GENERATE to ditch that field.)
+      #
+      def pig_load filename=nil
+        filename ||= table_name+'.tsv'
+        cmd = [
+          "%-23s" % resource_name,
+          "= LOAD", filename,
+          "AS ( rsrc:chararray,", self.to_pig, ')',
+        ].join(" ")
+      end
+      # ===========================================================================
+      #
+      # SQL
-    def pig_load filename=nil
-      cmd = [
-        "%-23s" % pig_klass,
-        "= LOAD", filename || pig_klass.underscore.pluralize,
-        "AS ( rsrc:chararray,",     self.to_pig, ')',
-      ].join(" ")
+      #
+      # Schema definition for use in a CREATE TABLE statement
+      #
+      def to_sql
+        sql_str = []
+        members.zip(mtypes).each do |attr, type|
+          type_str = type.respond_to?(:to_sql) ? type.to_sql : type.to_s.upcase
+          sql_str << "  %-21s\t%s" %["`#{attr}`", type_str]
+        end
+        sql_str.join(",\n")
+      end
+      #
+      # List off member names, to be stuffed into a SELECT or a LOAD DATA
+      #
+      def sql_members
+        members.map{|attr| "`#{attr}`" }.join(", ")
+      end
+      #
+      # Creates a table for the wukong class.
+      #
+      # * primary_key gives the name of one column to be set as the primary key
+      #
+      # * if drop_first is given, a "DROP TABLE IF EXISTS" statement will
+      #   precede the snippet.
+      #
+      # * table_options sets the table parameters. Useful table_options for a
+      #   read-only database in MySQL:
+      #     ENGINE=MyISAM PACK_KEYS=0
+      #
+      def sql_create_table primary_key=nil, drop_first=nil, table_options=''
+        str = []
+        str << %Q{DROP TABLE IF EXISTS `#{self.table_name}`;  } if drop_first
+        str << %Q{CREATE TABLE         `#{self.table_name}` ( }
+        str << self.to_sql
+        if primary_key then str.last << ',' ; str << %Q{  PRIMARY KEY     \t(`#{primary_key}`)} ; end
+        str << %Q{  ) #{table_options} ;}
+        str.join("\n")
+      end
+      #
+      # A mysql snippet to bulk load the tab-separated-values file emitted by a
+      # Wukong script.
+      #
+      # Let's say your class is ClickLog; its resource_name is "click_log"
+      # and thus its table_name is 'click_logs'. sql_load_mysql will:
+      #
+      # * disable indexing on the table
+      # * import the file, replacing any existing rows. (Replacement is governed
+      #   by primary key and unique index constraints -- see the mysql docs).
+      # * re-enable indexing on that table
+      # * show the number of
+      #
+      # The load portion will
+      #
+      # * Load into a table named click_logs
+      # * from a file named click_logs.tsv
+      # * where all rows have the string 'click_logs' in their first column
+      # * and all remaining fields in their #members order
+      # * assuming strings are wukong_encode'd and so shouldn't be escaped or enclosed.
+      #
+      # Why the "LINES STARTING BY" part? For map/reduce outputs that have many
+      # different objects jumbled together, you can just dump in the whole file,
+      # landing each object in its correct table.
+      #
+      def sql_load_mysql
+        str = []
+        # disable indexing during bulk load
+        str << %Q{ALTER TABLE            `#{self.resource_name}` DISABLE KEYS; }
+        # Bulk load the tab-separated-values file.
+        str << %Q{LOAD DATA LOCAL INFILE '#{self.resource_name}.tsv'}
+        str << %Q{  REPLACE INTO TABLE   `#{self.resource_name}`    }
+        str << %Q{  COLUMNS                                         }
+        str << %Q{    TERMINATED BY           '\\t'                 }
+        str << %Q{    OPTIONALLY ENCLOSED BY  ''                    }
+        str << %Q{    ESCAPED BY              ''                    }
+        str << %Q{  LINES STARTING BY     '#{self.resource_name}'   }
+        str << %Q{  ( @dummy,\n }
+        str << '    '+self.sql_members
+        str << %Q{\n  ); }
+        # Re-enable indexing
+        str << %Q{ALTER TABLE `#{self.resource_name}` ENABLE KEYS ; }
+        # Show it loaded correctly
+        str << %Q{SELECT '#{self.resource_name}', NOW(), COUNT(*) FROM `#{self.resource_name}`; }
+        str.join("\n")
+      end
+    end
+    # standard stanza for making methods appear on the class itself on include
+    def self.included base
+      base.class_eval{ extend ClassMethods }
     end
   end
 end
-class << Integer ; def to_pig() 'int'           end ; end
-class << Bignum  ; def to_pig() 'long'          end ; end
-class << Float   ; def to_pig() 'float'         end ; end
-class << String  ; def to_pig() 'chararray'     end ; end
-class << Symbol  ; def to_pig() self            end ; end
-class << Date    ; def to_pig() 'long'          end ; end
+#
+# TypedStructs are class-schematizeable
+#
+Struct.class_eval do include(Wukong::Schema) ; end

data/lib/wukong/streamer/list_reducer.rb CHANGED Viewed

@@ -1,20 +1,31 @@
 module Wukong
   module Streamer
     #
-    # Emit each unique key and the count of its occurrences
+    # Roll up all records from a given key into a single list.
     #
     class ListReducer < Wukong::Streamer::AccumulatingReducer
       attr_accessor :values
-      # reset the counter to zero
+      # start with an empty list
       def start! *args
         self.values = []
       end
-      # record one more for this key
+      # aggregate all records.
+      # note that this accumulates the full *record* -- key, value, everything.
       def accumulate *record
         self.values << record
       end
+      # emit the key and all records, tab-separated
+      #
+      # you will almost certainly want to override this method to do something
+      # interesting with the values (or override accumulate to gather scalar
+      # values)
+      #
+      def finalize
+        yield [key, values.to_flat.join(";")].flatten
+      end
     end
   end
 end

data/lib/wukong/streamer/uniq_by_last_reducer.rb CHANGED Viewed

@@ -1,13 +1,20 @@
 module Wukong
   module Streamer
     #
-    # Accumulate acts like an insecure high-school kid, for each key adopting in
-    # turn the latest value seen. It then emits the last (in sort order) value
-    # for that key.
+    # UniqByLastReducer accepts all records for a given key and emits only the
+    # last-seen.
     #
-    # For example, to extract the *latest* value for each property, set hadoop
-    # to use <resource, item_id, timestamp> as sort fields and <resource,
-    # item_id> as key fields.
+    # It acts like an insecure high-school kid: for each record of a given key
+    # it discards whatever record it's holding and adopts this new value. When a
+    # new key comes on the scene it emits the last record, like an older brother
+    # handing off his Depeche Mode collection.
+    #
+    # For example, to extract the *latest* value for each property, emit your
+    # records as
+    #
+    #    [resource_type, key, timestamp, ... fields ...]
+    #
+    # then set :sort_fields to 3 and :partition_fields to 2.
     #
     class UniqByLastReducer < Wukong::Streamer::AccumulatingReducer
       attr_accessor :final_value

data/wukong.gemspec CHANGED Viewed

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{wukong}
-  s.version = "0.1.1"
+  s.version = "0.1.4"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Philip (flip) Kromer"]
-  s.date = %q{2009-09-28}
+  s.date = %q{2009-10-05}
   s.description = %q{  Treat your dataset like a:
       * stream of lines when it’s efficient to process by lines
@@ -97,6 +97,7 @@ Gem::Specification.new do |s|
      "lib/wukong/boot.rb",
      "lib/wukong/datatypes.rb",
      "lib/wukong/datatypes/enum.rb",
+     "lib/wukong/datatypes/fake_types.rb",
      "lib/wukong/dfs.rb",
      "lib/wukong/encoding.rb",
      "lib/wukong/extensions.rb",

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wukong
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.4
 platform: ruby
 authors:
 - Philip (flip) Kromer
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-09-28 00:00:00 -05:00
+date: 2009-10-05 00:00:00 -05:00
 default_executable:
 dependencies: []
@@ -120,6 +120,7 @@ files:
 - lib/wukong/boot.rb
 - lib/wukong/datatypes.rb
 - lib/wukong/datatypes/enum.rb
+- lib/wukong/datatypes/fake_types.rb
 - lib/wukong/dfs.rb
 - lib/wukong/encoding.rb
 - lib/wukong/extensions.rb