RubyGems - wukong - Versions diffs - 0.1.4 → 1.4.0 - Mend

wukong 0.1.4 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

data/INSTALL.textile +89 -0
data/README.textile +41 -74
data/docpages/INSTALL.textile +94 -0
data/{doc → docpages}/LICENSE.textile +0 -0
data/{doc → docpages}/README-wulign.textile +6 -0
data/docpages/UsingWukong-part1-get_ready.textile +17 -0
data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
data/docpages/_config.yml +39 -0
data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
data/{doc → docpages}/code/api_response_example.txt +0 -0
data/{doc → docpages}/code/parser_skeleton.rb +0 -0
data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
data/docpages/favicon.ico +0 -0
data/docpages/gem.css +16 -0
data/docpages/hadoop-tips.textile +83 -0
data/docpages/index.textile +90 -0
data/docpages/intro.textile +8 -0
data/docpages/moreinfo.textile +174 -0
data/docpages/news.html +24 -0
data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
data/docpages/tutorial.textile +283 -0
data/docpages/usage.textile +195 -0
data/docpages/wutils.textile +263 -0
data/wukong.gemspec +80 -50
metadata +87 -54
data/doc/INSTALL.textile +0 -41
data/doc/README-tutorial.textile +0 -163
data/doc/README-wutils.textile +0 -128
data/doc/TODO.textile +0 -61
data/doc/UsingWukong-part1-setup.textile +0 -2
data/doc/UsingWukong-part2-scraping.textile +0 -2
data/doc/hadoop-nfs.textile +0 -51
data/doc/hadoop-setup.textile +0 -29
data/doc/index.textile +0 -124
data/doc/links.textile +0 -42
data/doc/usage.textile +0 -102
data/doc/utils.textile +0 -48
data/examples/and_pig/sample_queries.rb +0 -128
data/lib/wukong/and_pig.rb +0 -62
data/lib/wukong/and_pig/README.textile +0 -12
data/lib/wukong/and_pig/as.rb +0 -37
data/lib/wukong/and_pig/data_types.rb +0 -30
data/lib/wukong/and_pig/functions.rb +0 -50
data/lib/wukong/and_pig/generate.rb +0 -85
data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
data/lib/wukong/and_pig/junk.rb +0 -51
data/lib/wukong/and_pig/operators.rb +0 -8
data/lib/wukong/and_pig/operators/compound.rb +0 -29
data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
data/lib/wukong/and_pig/operators/execution.rb +0 -15
data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
data/lib/wukong/and_pig/operators/foreach.rb +0 -98
data/lib/wukong/and_pig/operators/groupies.rb +0 -212
data/lib/wukong/and_pig/operators/load_store.rb +0 -65
data/lib/wukong/and_pig/operators/meta.rb +0 -42
data/lib/wukong/and_pig/operators/relational.rb +0 -129
data/lib/wukong/and_pig/pig_struct.rb +0 -48
data/lib/wukong/and_pig/pig_var.rb +0 -95
data/lib/wukong/and_pig/symbol.rb +0 -29
data/lib/wukong/and_pig/utils.rb +0 -0

data/docpages/news.html ADDED Viewed

@@ -0,0 +1,24 @@
+---
+layout: default
+title:  edamame news
+collapse: true
+---
+<h1 class="gemheader">{% if site.gemname %}{{ site.gemname }}{% else %}mrflip{% endif %}<span class="small">:: news</span></h1>
+<div id="news">
+  {% for t in site.posts %} {% assign has_posts = true %}{% endfor %}{% if has_posts %}
+  {% for post in site.posts %}
+  <div class="toggle" id="news-{{ post.id }}">
+    <h2><a href="{{ post.url }}">{{ post.title }}</a><span class="postdate"> &raquo; {{ post.date | date_to_string }}</span></h2>
+    {{ post.content }}
+  </div>
+  {% endfor %}
+  {% else %}
+  <p class="heavy">
+    <em>(no news. good news?)</em>
+  </p>
+  {% endif %}
+</div>

data/{doc → docpages}/pig/PigLatinExpressionsList.txt RENAMED Viewed

File without changes

data/{doc → docpages}/pig/PigLatinReferenceManual.html RENAMED Viewed

File without changes

data/{doc → docpages}/pig/PigLatinReferenceManual.txt RENAMED Viewed

File without changes

data/docpages/tutorial.textile ADDED Viewed

@@ -0,0 +1,283 @@
+---
+layout: default
+title:  mrflip.github.com/wukong - Tutorial
+collapse: false
+---
+h1(gemheader). Tutorial by Examples
+<notextile><div class="toggle"></notextile>
+h2(#wordcount). Count Words
+Here's a script to count words in a text stream:
+{% highlight ruby %}
+    require 'wukong'
+    module WordCount
+      class Mapper < Wukong::Streamer::LineStreamer
+        # Emit each word in the line.
+        def process line
+          words = line.strip.split(/\W+/).reject(&:blank?)
+          words.each{|word| yield [word, 1] }
+        end
+      end
+      class Reducer < Wukong::Streamer::ListReducer
+        def finalize
+          yield [ key, values.map(&:last).map(&:to_i).sum ]
+        end
+      end
+    end
+    Wukong::Script.new(
+      WordCount::Mapper,
+      WordCount::Reducer
+      ).run # Execute the script
+{% endhighlight %}
+The first class, the Mapper, eats lines and craps @[word, count]@ records. Here
+the /key/ is the word, and the /value/ is its count.
+The second class is an example of an accumulated list reducer. The values for
+each key are stacked up into a list; then the record(s) yielded by @#finalize@
+are emitted.
+Here's another way to write the Reducer: accumulate the count of each line, then
+yield the sum in @#finalize@:
+{% highlight ruby %}
+    class Reducer2 < Wukong::Streamer::AccumulatingReducer
+      attr_accessor :key_count
+      def start! *args
+        self.key_count = 0
+      end
+      def accumulate(word, count)
+        self.key_count += count.to_i
+      end
+      def finalize
+        yield [ key, key_count ]
+      end
+    end
+{% endhighlight %}
+Of course you can be really lazy (i.e. smart) and write your script as
+{% highlight ruby %}
+    class Script < Wukong::Script
+      def reducer_command
+        'uniq -c'
+      end
+    end
+{% endhighlight %}
+h2(#structstream). Structured data
+The previous example dealt with unstructured data.  Wukong also lets you view your data as a stream of structured objects.
+Let's say you have a blog; its records look like
+{% highlight ruby %}
+    Post    = Struct.new( :id, :created_at, :user_id, :title, :body, :link )
+    Comment = Struct.new( :id, :created_at, :post_id, :user_id, :body )
+    User    = Struct.new( :id, :username, :fullname, :homepage, :description )
+    UserLoc = Struct.new( :user_id, :text, :lat, :lng )
+{% endhighlight %}
+You've been using "twitter":http://twitter.com for a long time, and you've written something that from now on will inject all your tweets as Posts, and all replies to them as Comments (by a common 'twitter_bot' account on your blog).What about the past two years' worth of tweets?  Let's assume you're so chatty that a Map/Reduce script is warranted to handle the volume. (Actually, wukong makes a really nice ETL package, so this may be convienient even at small scale).
+Cook up something that scrapes your tweets and all replies to your tweets:
+{% highlight ruby %}
+    Tweet = Struct.new( :id, :created_at, :twitter_user_id,
+      :in_reply_to_user_id, :in_reply_to_status_id, :text )
+    TwitterUser  = Struct.new( :id, :username, :fullname,
+      :homepage, :location, :description )
+{% endhighlight %}
+Now we'll just process all those in a big pile, converting to Posts, Comments and Users as appropriate. Serialize your scrape results so that each Tweet and each TwitterUser is a single lines containing first the class name ('tweet' or 'twitter_user') followed by its constituent fields, in order, separated by tabs.
+The RecordStreamer takes each such line, constructs its corresponding class, and instantiates it with the
+{% highlight ruby %}
+    require 'wukong'
+    require 'my_blog' #defines the blog models
+    module TwitBlog
+      class Mapper < Wukong::Streamer::RecordStreamer
+        # Watch for tweets by me
+        MY_USER_ID = 24601
+        # structs for our input objects
+        Tweet = Struct.new( :id, :created_at, :twitter_user_id,
+          :in_reply_to_user_id, :in_reply_to_status_id, :text )
+        TwitterUser  = Struct.new( :id, :username, :fullname,
+          :homepage, :location, :description )
+        #
+        # If this is a tweet is by me, convert it to a Post.
+        #
+        # If it is a tweet not by me, convert it to a Comment that
+        # will be paired with the correct Post.
+        #
+        # If it is a TwitterUser, convert it to a User record and
+        # a user_location record
+        #
+        def process record
+          case record
+  	when TwitterUser
+  	  user     = MyBlog::User.new.merge(record) # grab the fields in common
+  	  user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
+  	  yield user
+  	  yield user_loc
+  	when Tweet
+  	  if record.twitter_user_id == MY_USER_ID
+  	    post = MyBlog::Post.new.merge record
+  	    post.link = "http://twitter.com/statuses/show/#{record.id}"
+  	    post.body = record.text
+  	    post.title = record.text[0..65] + "..."
+  	    yield post
+  	  else
+     	    comment = MyBlog::Comment.new.merge record
+  	    comment.body    = record.text
+  	    comment.post_id = record.in_reply_to_status_id
+  	    yield comment
+  	  end
+  	end
+        end
+      end
+    end
+    Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
+{% endhighlight %}
+h2(#accumulators). Accumulators
+h3(#uniqifying). A Uniqifying Accumulator
+The script above uses the identity reducer: every record from the mapper is sent
+to the output.  But what if you had grabbed the replying user's record every time you saw a reply?
+You'd like to just pass it through @uniq@. But if something has changed in the interim, or if you record a timestamp for each sample, you won't be able to use the simple @uniq@ command.  You'd like to just get one example for each key!
+Wukong includes just such a reducer, the UniqByLastReducer:
+{% highlight ruby %}
+    #
+    # UniqByLastReducer accepts all records for a given key and emits only the
+    # last-seen.
+    #
+    # It acts like an insecure high-school kid: for each record of a given key
+    # it discards whatever record it's holding and adopts this new value. When a
+    # new key comes on the scene it emits the last record, like an older brother
+    # handing off his Depeche Mode collection.
+    #
+    # For example, to extract the *latest* value for each property, emit your
+    # records as
+    #
+    #    [resource_type, key, timestamp, ... fields ...]
+    #
+    # then set :sort_fields to 3 and :partition_fields to 2.
+    #
+    class UniqByLastReducer < Wukong::Streamer::AccumulatingReducer
+      attr_accessor :final_value
+      #
+      # Use first two fields as keys by default
+      #
+      def get_key *vals
+        vals[0..1]
+      end
+      #
+      # Adopt each value in turn: the last one's the one you want.
+      #
+      def accumulate *vals
+        self.final_value = vals
+      end
+      #
+      # Emit the last-seen value
+      #
+      def finalize
+        yield final_value if final_value
+      end
+      #
+      # Clear state on reset
+      #
+      def start! *args
+        self.final_value = nil
+      end
+    end
+{% endhighlight %}
+h3(#groupby). A GroupBy Accumulator
+Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group. The
+The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
+Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
+{% highlight ruby %}
+    #
+    # Roll up all values for each key into a single line
+    #
+    class GroupByReducer < Wukong::Streamer::AccumulatingReducer
+      attr_accessor :values
+      # Start with an empty list
+      def start! *args
+        self.values = []
+      end
+      # Aggregate each value in turn
+      def accumulate key, value
+        self.values << value
+      end
+      # Emit the key and all values, tab-separated
+      def finalize
+        yield [key, values].flatten
+      end
+    end
+{% endhighlight %}
+So given adjacency pairs for the following directed friend graph:
+<pre>
+    @jerry      @elaine
+    @elaine     @jerry
+    @jerry      @kramer
+    @kramer     @jerry
+    @kramer     @bobsacamato
+    @kramer     @newman
+    @jerry      @superman
+    @newman     @kramer
+    @newman     @elaine
+    @newman     @jerry
+</pre>
+You'd end up with
+<pre>    @elaine     @jerry
+    @jerry      @elaine      @kramer     @superman
+    @kramer     @bobsacamato @jerry      @newman
+    @newman     @elaine      @jerry      @kramer
+</pre>
+h2. A note about keys
+Now we're going to write this using the synthetic keys already extant in the
+twitter records, making the unwarranted assumption that they won't collide with
+the keys in your database.
+Map/Reduce paradigm does badly with synthetic keys. Synthetic keys demand
+locality, and map/reduce's remarkable scaling comes from not assuming
+locality. In general, write your map/reduce scripts to use natural keys (the scre
+h2. More...
+There are many useful examples (including an actually-useful version of this
+WordCount script) in the "examples/ directory.":http://github.com/mrflip/wukong/tree/master/examples

data/docpages/usage.textile ADDED Viewed

@@ -0,0 +1,195 @@
+---
+layout: default
+title:  Usage notes
+---
+h1(gemheader). {{ site.gemname }} %(small):: usage%
+** "How to run a Wukong script":#running
+** "How to test your scripts":#testing
+** "Wukong Plays nicely with others":#playnice
+** "Schema export":#schema_export to Pig or SQL
+** "Wukong's internal workflow":#workflow
+** "Using wukong with internal streaming":#stayinruby
+** "Using wukong to Batch-Process ActiveRecord Objects":#activerecord
+<notextile><div class="toggle"></notextile>
+h2(#running). How to run a Wukong script
+To run your script using local files and no connection to a hadoop cluster,
+pre. your/script.rb --run=local path/to/input_files path/to/output_dir
+To run the command across a Hadoop cluster,
+pre. your/script.rb --run=hadoop path/to/input_files path/to/output_dir
+You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
+If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths.  (your/script path, of course, lives on the local filesystem).
+You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
+pre. ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
+   --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
+Note that all @--options@ must precede (in any order) all non-options.
+<notextile></div><div class="toggle"></notextile>
+h2(#testing). How to test your scripts
+To run mapper on its own:
+pre. cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
+or if your test data lies on the HDFS,
+pre. hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
+Next graduate to running @--run=local@ mode so you can inspect the reducer.
+<notextile></div><div class="toggle"></notextile>
+h2(#playnice). Wukong Plays nicely with others
+Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.  It even has limited support for "martinis":http://datamapper.org (Datamapper) and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord (ActiveRecord).
+* "Export Wukong classes to SQL or Pig":#schema_export -- easily bulk-load and define SQL tables, or kickstart your pig scripts
+* "Batch-Process records from ActiveRecord":#activerecord (the datamapper case is similar)
+* Cascade Mappers and Reducers "purely in ruby":#stayinruby -- reportedly useful in an "ETL":http://en.wikipedia.org/wiki/Extract,_transform,_load context.
+h3(#schema_export). Schema export to Pig or SQL
+There is preliminary support for dumping wukong classes as schemata for other tools. For example, given the following:
+{% highlight ruby %}
+    require "wukong" ;
+    require "wukong/schema"
+    User = TypedStruct.new(
+        [:id,                     Integer],
+        [:scraped_at,             Bignum],
+        [:screen_name,            String],
+        [:followers_count,        Integer],
+        [:created_at,             Bignum]
+        );
+{% endhighlight %}
+You can make a snippet for loading into pig with @puts User.load_pig@:
+<pre>    LOAD users.tsv AS ( rsrc:chararray, id: int, scraped_at: long, screen_name: chararray, followers_count: int, created_at: long )</pre>
+Export to SQL with @puts User.sql_create_table ; puts User.sql_load_mysql@:
+{% highlight sql %}
+    CREATE TABLE         `users` (
+      `id`                  INT,
+      `scraped_at`          BIGINT,
+      `screen_name`         VARCHAR(255) CHARACTER SET ASCII,
+      `followers_count`     INT,
+      `created_at`          BIGINT
+      )  ;
+    ALTER TABLE            `user` DISABLE KEYS;
+    LOAD DATA LOCAL INFILE 'user.tsv'
+      REPLACE INTO TABLE   `user`
+      COLUMNS
+        TERMINATED BY           '\t'
+        OPTIONALLY ENCLOSED BY  ''
+        ESCAPED BY              ''
+      LINES STARTING BY     'user'
+      ( @dummy,
+        `id`, `scraped_at`, `screen_name`, `followers_count`, `created_at`
+      );
+    ALTER TABLE `user` ENABLE KEYS ;
+    SELECT 'user', NOW(), COUNT(*) FROM `user`;
+{% endhighlight %}
+<notextile></div><div class="toggle"></notextile>
+h2(#workflow). Wukong's internal workflow
+Here's a somewhat detailed overview of a wukong script's internal workflow.
+# You call @./myscript.rb --run infile outfile@
+# Execution begins in the run method of the Script class (@wukong/script.rb@). It launches (depending on if you're local or remote) one of
+** @cat infile | ./myscript.rb --map | sort | ./myscript.rb --reduce > outfile@
+** @hadoop [a_crapton_of_streaming_args] -mapper './myscript.rb --map' -reducer './myscript.rb --reduce' @
+# In either case, the effect is to spawn the exact same script you ran at the command line: one or more times with the --map command in place of the --run command, and one or more times with the --reduce command in place of the --run command. %(quiet)(well, unless you specify no reducers or a :map_command or something)%
+# With the @--map@ or @--reduce@ flag given, the Script flag turns over control to the corresponding class: either @mapper_klass.new(self.options).stream@ or @reducer_klass.new(self.options).stream@
+When in @--map@ or @--reduce@ mode (we'll just use @--map@ as an example):
+# The mapper_klass is usually a subclass of @Streamer::Base@, but in actual fact it can be anything that initializes from a hash of options and responds to #stream.
+# The default #stream method
+** calls the before_stream hook
+** reads each line from stdin ; #recordizes it ; passes it (if non-nil) to #process ; and emits each object yielded by #process
+** calls its after_stream hook
+# You typically leave #stream alone and just override #process.
+# The accumulator classes build on these patterns (they're proper subclasses of Streamer::Base), but are used differently. With an accumulator, you should implement some or all of
+** #start! -- called at the start of each accumulation, passing in the first record for that key
+** #accumulate -- called on each record (including that first one)
+** #finalize --  called when the last key of this accumulation is seen.
+** #get_key -- called on each record to recover its key.
+h3(#stayinruby). Using wukong with internal streaming
+If you're using wukong in local mode, you may not want to spawn new processes all over the place.  Or your records may arrive not from the command line but from, say, a database call.
+In that case, just override #stream.  The original:
+{% highlight ruby %}
+      #
+      # Pass each record to +#process+
+      #
+      def stream
+        before_stream
+        $stdin.each do |line|
+          record = recordize(line.chomp)
+          next unless record
+          process(*record) do |output_record|
+            emit output_record
+          end
+        end
+        after_stream
+      end
+{% endhighlight %}
+h3(#activerecord). Using wukong to Batch-Process ActiveRecord Objects
+Here's a stream method, overridden to batch-process ActiveRecord objects (untested sample code):
+{% highlight ruby %}
+    class Mapper < Wukong::Streamer
+      # Set record_klass to the ActiveRecord class you'd like to batch process
+      cattr_accessor :record_klass
+      # Size of each batch to pull from the database
+      cattr_accessor :batch_size
+      #
+      # Grab records from the database in batches,
+      # pass each record to +#process+
+      #
+      # Everything downstream of this is agnostic of the fact that
+      # records are coming from the database and not $stdin
+      #
+      def stream
+        before_stream
+        record_klass.find_in_batches(:batch_size => batch_size ) do |record_batch|
+          record_batch.each do |record|
+            process(record.id, record) do |output_record|
+              emit output_record
+            end
+          end
+        end
+        after_stream
+      end
+      # ....
+    end
+{% endhighlight %}
+<notextile></div></notextile>