RubyGems - wukong - Versions diffs - 1.5.4 → 2.0.0 - Mend

wukong 1.5.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (87) hide show

data/CHANGELOG.textile +32 -0
data/README.textile +58 -12
data/TODO.textile +0 -8
data/bin/hdp-bzip +12 -17
data/bin/hdp-kill-task +1 -1
data/bin/hdp-sort +7 -7
data/bin/hdp-stream +7 -7
data/bin/hdp-stream-flat +2 -3
data/bin/setcat +11 -0
data/bin/uniq-ord +59 -0
data/examples/corpus/bucket_counter.rb +47 -0
data/examples/corpus/dbpedia_abstract_to_sentences.rb +85 -0
data/examples/corpus/sentence_coocurrence.rb +70 -0
data/examples/emr/README.textile +110 -0
data/examples/emr/dot_wukong_dir/emr_bootstrap.sh +1 -0
data/examples/emr/elastic_mapreduce_example.rb +2 -2
data/examples/ignore_me/counting.rb +56 -0
data/examples/ignore_me/grouper.rb +71 -0
data/examples/network_graph/adjacency_list.rb +2 -2
data/examples/network_graph/breadth_first_search.rb +14 -21
data/examples/network_graph/gen_multi_edge.rb +22 -13
data/examples/pagerank/pagerank.rb +1 -1
data/examples/pagerank/pagerank_initialize.rb +6 -10
data/examples/sample_records.rb +6 -16
data/examples/server_logs/apache_log_parser.rb +7 -22
data/examples/server_logs/breadcrumbs.rb +39 -0
data/examples/server_logs/logline.rb +27 -0
data/examples/size.rb +3 -2
data/examples/{binning_percentile_estimator.rb → stats/binning_percentile_estimator.rb} +9 -11
data/examples/{rank_and_bin.rb → stats/rank_and_bin.rb} +2 -2
data/examples/stupidly_simple_filter.rb +11 -14
data/examples/word_count.rb +16 -36
data/lib/wukong/and_pig.rb +2 -15
data/lib/wukong/logger.rb +7 -28
data/lib/wukong/periodic_monitor.rb +24 -9
data/lib/wukong/script/emr_command.rb +1 -0
data/lib/wukong/script/hadoop_command.rb +31 -29
data/lib/wukong/script.rb +19 -14
data/lib/wukong/store/cassandra_model.rb +2 -1
data/lib/wukong/streamer/accumulating_reducer.rb +5 -9
data/lib/wukong/streamer/base.rb +44 -3
data/lib/wukong/streamer/counting_reducer.rb +12 -12
data/lib/wukong/streamer/filter.rb +2 -2
data/lib/wukong/streamer/list_reducer.rb +3 -3
data/lib/wukong/streamer/reducer.rb +11 -0
data/lib/wukong/streamer.rb +7 -3
data/lib/wukong.rb +7 -3
data/{examples → old}/cassandra_streaming/berlitz_for_cassandra.textile +0 -0
data/{examples → old}/cassandra_streaming/client_interface_notes.textile +0 -0
data/{examples → old}/cassandra_streaming/client_schema.textile +0 -0
data/{examples → old}/cassandra_streaming/tuning.textile +0 -0
data/wukong.gemspec +257 -285
metadata +45 -62
data/examples/cassandra_streaming/avromapper.rb +0 -85
data/examples/cassandra_streaming/cassandra.avpr +0 -468
data/examples/cassandra_streaming/cassandra_random_partitioner.rb +0 -62
data/examples/cassandra_streaming/catter.sh +0 -45
data/examples/cassandra_streaming/client_schema.avpr +0 -211
data/examples/cassandra_streaming/foofile.avr +0 -0
data/examples/cassandra_streaming/pymap.sh +0 -1
data/examples/cassandra_streaming/pyreduce.sh +0 -1
data/examples/cassandra_streaming/smutation.avpr +0 -188
data/examples/cassandra_streaming/streamer.sh +0 -51
data/examples/cassandra_streaming/struct_loader.rb +0 -24
data/examples/count_keys.rb +0 -56
data/examples/count_keys_at_mapper.rb +0 -57
data/examples/emr/README-elastic_map_reduce.textile +0 -26
data/examples/keystore/cassandra_batch_test.rb +0 -41
data/examples/keystore/conditional_outputter_example.rb +0 -70
data/examples/store/chunked_store_example.rb +0 -18
data/lib/wukong/dfs.rb +0 -81
data/lib/wukong/keystore/cassandra_conditional_outputter.rb +0 -122
data/lib/wukong/keystore/redis_db.rb +0 -24
data/lib/wukong/keystore/tyrant_db.rb +0 -137
data/lib/wukong/keystore/tyrant_notes.textile +0 -145
data/lib/wukong/models/graph.rb +0 -25
data/lib/wukong/monitor/chunked_store.rb +0 -23
data/lib/wukong/monitor/periodic_logger.rb +0 -34
data/lib/wukong/monitor/periodic_monitor.rb +0 -70
data/lib/wukong/monitor.rb +0 -7
data/lib/wukong/rdf.rb +0 -104
data/lib/wukong/streamer/cassandra_streamer.rb +0 -61
data/lib/wukong/streamer/count_keys.rb +0 -30
data/lib/wukong/streamer/count_lines.rb +0 -26
data/lib/wukong/streamer/em_streamer.rb +0 -7
data/lib/wukong/streamer/preprocess_with_pipe_streamer.rb +0 -22
data/lib/wukong/wukong_class.rb +0 -21

data/CHANGELOG.textile CHANGED Viewed

@@ -1,3 +1,35 @@
+h2. Wukong v2.0.0
+h4. Important changes
+* Passing options to streamers is now deprecated. Use @Settings@ instead.
+* Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds
+* Examples cleaned up, should all run
+h4. Simplified syntax
+* you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
+* Adding an experimental sugar:
+  <pre>
+    #!/usr/bin/env ruby
+    require 'wukong/script'
+    LineStreamer.map do |line|
+      emit line.reverse
+    end.run
+  </pre>
+  Note that you can now tweet a wukong script.
+* It's now recommended that at the top of a wukong script you say
+  <pre>
+    require 'wukong/script'
+  </pre>
+  Among other benefits, this lets you refer to wukong streamers without prefix.
 h2. Wukong v1.5.4
 * EMR support now works very well

data/README.textile CHANGED Viewed

@@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
 * "More info":http://mrflip.github.com/wukong/moreinfo.html
-h2. Imminent Changes
-I'm pushing to release "Wukong 3.0 the actual 1.0 release".
-* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
-* Methods on TypedStruct to
-    * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
-    * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
-    * May make some things that are derived classes into mixin'ed modules
-    * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
 h2. Help!
@@ -193,6 +181,64 @@ You'd end up with
     @newman     @elaine      @jerry      @kramer
 </code></pre>
+h2. Gotchas
+h4. RecordStreamer dies on blank lines with "wrong number of arguments"
+If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
+<pre>
+  class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
+    # this will fail if the line has more or fewer than 3 fields:
+    def process x, y, z
+      p [x, y, z]
+    end
+  end
+</pre>
+The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
+<pre>
+  class MyHappyMapper < Wukong::Streamer::RecordStreamer
+    # extracts three fields always; any missing fields are nil, any extra fields discarded
+    # @example
+    #   recordize("a")            # ["a", nil, nil]
+    #   recordize("a\t\b\tc")     # ["a", "b", "c"]
+    #   recordize("a\t\b\tc\td")  # ["a", "b", "c"]
+    def recordize raw_record
+      x, y, z = super(raw_record)
+      [x, y, z]
+    end
+    # Now all lines produce exactly three args
+    def process x, y, z
+      p [x, y, z]
+    end
+  end
+</pre>
+If you want to preserve any extra fields, use the extra argument to #split():
+<pre>
+  class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
+    # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
+    # @example
+    #   recordize("a")            # ["a", nil, nil]
+    #   recordize("a\t\b\tc")     # ["a", "b", "c"]
+    #   recordize("a\t\b\tc\td")  # ["a", "b", "c\td"]
+    def recordize raw_record
+      x, y, z = split(raw_record, "\t", 3)
+      [x, y, z]
+    end
+    # Now all lines produce exactly three args
+    def process x, y, z
+      p [x, y, z]
+    end
+  end
+</pre>
 h2. Why is it called Wukong?
 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog.  A Monkey King who journeyed to the land of the Elephant seems to fit the bill:

data/TODO.textile CHANGED Viewed

@@ -1,13 +1,5 @@
 * add GEM_PATH to hadoop_recycle_env
-* Hadoop_command function received an array for the input_path parameter
 ** We should be able to specify comma *or* space separated paths; the last
    space-separated path in Settings.rest becomes the output file, the others are
    used as the input_file list.
-* Make configliere Settings and streamer_instance.options() be the same
-  thing. (instead of almost-but-confusingly-not-always the same thing).

data/bin/hdp-bzip CHANGED Viewed

@@ -2,27 +2,22 @@
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
-OUTPUT="$1" ; shift
+input_file=${1} 		; shift
+output_file=${1} 		; shift
-INPUTS=''
-for foo in $@; do
-  INPUTS="$INPUTS -input $foo\
-"
-done
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file" ; exit ; fi
-echo "Removing output directory $OUTPUT"
-hadoop fs -rmr $OUTPUT
+HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 cmd="${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar		   \
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			   \
-    -jobconf     mapred.output.compress=true                                               \
-    -jobconf     mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec  \
-    -jobconf     mapred.reduce.tasks=1                                                     \
-    -mapper  	 \"/bin/cat\"                                                              \
-    -reducer	 \"/bin/cat\"                                                              \
-    $INPUTS
-    -output  	 $OUTPUT                                                                   \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar	\
+    -Dmapred.output.compress=true                                               \
+    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec  \
+    -Dmapred.reduce.tasks=1                                                     \
+    -mapper  	 \"/bin/cat\"                                                   \
+    -reducer	 \"/bin/cat\"                                                   \
+    -input       \"$input_file\"                                                \
+    -output  	 \"$output_file\"                                               \
     "
 echo $cmd
 $cmd

data/bin/hdp-kill-task CHANGED Viewed

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-exec hadoop fs -kill-task "$1"
+exec hadoop job -kill-task "$1"

data/bin/hdp-sort CHANGED Viewed

@@ -1,5 +1,4 @@
 #!/usr/bin/env bash
-# hadoop dfs -rmr out/parsed-followers
 input_file=${1} 		; shift
 output_file=${1} 		; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 cmd="${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+    $@
+    -D   num.key.fields.for.partition=\"$partfields\"
+    -D 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -D   stream.map.output.field.separator=\"'/t'\"
+    -D   mapred.text.key.partitioner.options=\"-k1,$partfields\"
+    -D   mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
     -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-    -jobconf     num.key.fields.for.partition=\"$partfields\"
-    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
-    -jobconf     stream.map.output.field.separator=\"'/t'\"
-    -jobconf     mapred.text.key.partitioner.options=\"-k1,$partfields\"
     -mapper  	 \"$map_script\"
     -reducer	 \"$reduce_script\"
     -input       \"$input_file\"
     -output  	 \"$output_file\"
-    $@
     "
 echo "$cmd"

data/bin/hdp-stream CHANGED Viewed

@@ -1,5 +1,4 @@
 #!/usr/bin/env bash
-# hadoop dfs -rmr out/parsed-followers
 input_file=${1} 		; shift
 output_file=${1} 		; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 cmd="${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+    $@
+    -D   num.key.fields.for.partition=\"$partfields\"
+    -D 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -D   stream.map.output.field.separator=\"'/t'\"
+    -D   mapred.text.key.partitioner.options=\"-k1,$partfields\"
+    -D   mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
     -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-    -jobconf     num.key.fields.for.partition=\"$partfields\"
-    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
-    -jobconf     stream.map.output.field.separator=\"'/t'\"
-    -jobconf     mapred.text.key.partitioner.options=\"-k1,$partfields\"
     -mapper  	 \"$map_script\"
     -reducer	 \"$reduce_script\"
     -input       \"$input_file\"
     -output  	 \"$output_file\"
-    $@
     "
 echo "$cmd"

data/bin/hdp-stream-flat CHANGED Viewed

@@ -10,13 +10,12 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 # Can add fun stuff like
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \
+# -Dmapred.reduce.tasks=0                                                    \
 exec ${HADOOP_HOME}/bin/hadoop \
      jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar		\
     "$@"                                                                                \
-    -jobconf    "mapred.job.name=`basename $0`-$map_script-$input_file-$output_file"    \
+    -Dmapred.job.name=`basename $0`-$map_script-$input_file-$output_file                \
     -mapper  	"$map_script"  								\
     -reducer	"$reduce_script"							\
     -input      "$input_file"								\

data/bin/setcat ADDED Viewed

@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+#
+# This script is useful for debugging. it dumps your environment to STDERR
+# and otherwise runs as `cat`
+#
+set >&2
+cat
+true

data/bin/uniq-ord ADDED Viewed

@@ -0,0 +1,59 @@
+#!/usr/bin/env ruby
+# encoding: ASCII-8BIT
+require 'set'
+unless ARGV.empty?
+  unless ARGV.include?('--help')
+    puts "\n**\nSorry, uniq-ord only works in-line: cat foo.txt bar.tsv | uniq-ord\n**" ; puts
+  end
+  puts <<USAGE
+uniq-ord is ike the uniq command but doesn't depend on prior sorting: it tracks
+each line and only emits the first-seen instance of that line.
+The algorithm is /very/ simplistic: it uses ruby's built-in hash to track lines.
+This can produce false positives, meaning that a line of output might be removed
+even if it hasn't been seen before.  It may also consume an unbounded amount of
+memory (though less than the input text). With a million lines it will consume
+about 70 MB of memory and have more than 1 in a million chance of false
+positive. On a billion lines it will consume many GB and have over 25% odds of
+incorrectly skipping a line.
+However, it's really handy for dealing with in-order lists from the command line.
+USAGE
+  exit(0)
+end
+# # Logging
+#
+# MB = 1024*1024
+# LOG_INTERVAL = 100_000
+# $start = Time.now; $iter = 0; $size = 0
+# def log_line
+#   elapsed = (Time.now - $start).to_f
+#   $stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])
+# end
+LINES = Set.new
+$stdin.each do |line|
+  next if LINES.include?(line.hash)
+  puts line
+  LINES << line.hash
+  # $iter += 1 ; $size += line.length
+  # log_line if ($iter % LOG_INTERVAL == 0)
+end
+# log_line
+#
+# # 2.1 GB data, 1M lines, 2000 avg chars/line
+#
+# # Used:   RSS:     71_988 kB     VSZ:     2_509_152 kB
+# # Stats:   38 s  25_859.1 l/s  1000k<  1000k>  1976 MB         51.1 MB/s       2072 b/l
+# # Time:   real     0m41.4 s      user  0m31.6 s          sys  0m8.3 s     pct    96.48
+#
+# # 4.1 GB data, 5.6M lines, 800 avg chars/line
+#
+# # Used:   RSS:    330_644 kB     VSZ:     2_764_236 kB
+# # Stats:  861     6_538.2 l/s  5632k<  5632k>  4158 MB          4.8 MB/s        774 b/l
+# # Time:   real    14m24.6 s     user  13m8.8 s           sys 0m12. s       pct   92.61
+#

data/examples/corpus/bucket_counter.rb ADDED Viewed

@@ -0,0 +1,47 @@
+class BucketCounter
+  BUCKET_SIZE = 2**24
+  attr_reader :total
+  def initialize
+    @hsh = Hash.new{|h,k| h[k] = 0 }
+    @total = 0
+  end
+  # def [] val
+  #   @hsh[val]
+  # end
+  # def << val
+  #   @hsh[val] += 1; @total += 1 ; self
+  # end
+  def [] val
+    @hsh[val.hash % BUCKET_SIZE]
+  end
+  def << val
+    @hsh[val.hash % BUCKET_SIZE] += 1; @total += 1 ; self
+  end
+  def insert *words
+    words.flatten.each{|word| self << word }
+  end
+  def clear
+    @hsh.clear
+    @total = 0
+  end
+  def stats
+    { :total => total,
+      :size  => size,
+    }
+  end
+  def size() @hsh.size end
+  def full?
+    size.to_f / BUCKET_SIZE > 0.5
+  end
+  def each *args, &block
+    @hsh.each(*args, &block)
+  end
+end

data/examples/corpus/dbpedia_abstract_to_sentences.rb ADDED Viewed

@@ -0,0 +1,85 @@
+#!/usr/bin/env ruby
+require 'wukong/script'
+#
+# Use the stanford NLP parse to split a piece of text into sentences
+#
+# @example
+#   SentenceParser.split("Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!")
+#   # => [["Beware", "the", "Jabberwock", ",", "my", "son", "!"], ["The", "jaws", "that", "bite", ",", "the", "claws", "that", "catch", "!"], ["Beware", "the", "Jubjub", "bird", ",", "and", "shun", "The", "frumious", "Bandersnatch", "!"]]
+#
+class SentenceParser
+  def self.processor
+    return @processor if @processor
+    require 'rubygems'
+    require 'stanfordparser'
+    @processor = StanfordParser::DocumentPreprocessor.new
+  end
+  def self.split line
+    processor.getSentencesFromString(line).map{|s| s.map{|w| w.to_s } }
+  end
+end
+#
+# takes one document per line
+# splits into sentences
+#
+class WordNGrams < Wukong::Streamer::LineStreamer
+  def recordize line
+    line.strip!
+    line.gsub!(%r{^<http://dbpedia.org/resource/([^>]+)> <[^>]+> \"}, '') ; title = $1
+    line.gsub!(%r{\"@en \.},'')
+    [title, SentenceParser.split(line)]
+  end
+  def process title, sentences
+    sentences.each_with_index do |words, idx|
+      yield [title, idx, words].flatten
+    end
+  end
+end
+Wukong.run WordNGrams, nil, :partition_fields => 1, :sort_fields => 2
+# ---------------------------------------------------------------------------
+#
+# Run Time:
+#
+#   Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/short_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/short_abstract_sentences
+#   Status: Succeeded
+#   Started at: Fri Jan 28 03:14:45 UTC 2011
+#   Finished in: 41mins, 50sec
+#   3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
+#
+#                                     Counter      Reduce       Total
+#   SLOTS_MILLIS_MAPS                       0              10 126 566
+#   Launched map tasks                      0                      15
+#   Data-local map tasks                    0                      15
+#   SLOTS_MILLIS_REDUCES                    0                   1 217
+#   HDFS_BYTES_READ             1 327 116 133           1 327 116 133
+#   HDFS_BYTES_WRITTEN          1 229 841 020           1 229 841 020
+#   Map input records               3 261 096               3 261 096
+#   Spilled Records                         0                       0
+#   Map input bytes             1 326 524 800           1 326 524 800
+#   SPLIT_RAW_BYTES                     1 500                   1 500
+#   Map output records              9 026 343               9 026 343
+#
+#   Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/long_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/long_abstract_sentences
+#   Status: Succeeded
+#   Started at: Fri Jan 28 03:23:08 UTC 2011
+#   Finished in: 41mins, 11sec
+#   3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
+#
+#                                     Counter      Reduce       Total
+#   SLOTS_MILLIS_MAPS                       0              19 872 357
+#   Launched map tasks                      0                      29
+#   Data-local map tasks                    0                      29
+#   SLOTS_MILLIS_REDUCES                    0                   5 504
+#   HDFS_BYTES_READ             2 175 900 769           2 175 900 769
+#   HDFS_BYTES_WRITTEN          2 280 332 736           2 280 332 736
+#   Map input records               3 261 096               3 261 096
+#   Spilled Records                         0                       0
+#   Map input bytes             2 174 849 644           2 174 849 644
+#   SPLIT_RAW_BYTES                     2 533                    2533
+#   Map output records             15 425 467              15 425 467

data/examples/corpus/sentence_coocurrence.rb ADDED Viewed

@@ -0,0 +1,70 @@
+#!/usr/bin/env ruby
+$: << File.dirname(__FILE__)
+require 'rubygems'
+require 'wukong/script'
+require 'bucket_counter'
+#
+# Coocurrence counts
+#
+#
+# Input is a list of document-idx-sentences, each field is tab-separated
+#   title   idx   word_a    word_b    word_c ...
+#
+# This emits each co-courring pair exactly once; in the case of a three-word
+# sentence the output would be
+#
+#   word_a  word_b
+#   word_a  word_c
+#   word_b  word_c
+#
+class SentenceCoocurrence < Wukong::Streamer::RecordStreamer
+  def initialize *args
+    super *args
+    @bucket = BucketCounter.new
+  end
+  def process title, idx, *words
+    words.each_with_index do |word_a, idx|
+      words[(idx+1) .. -1].each do |word_b|
+        @bucket << [word_a, word_b]
+      end
+    end
+    dump_bucket if @bucket.full?
+  end
+  def dump_bucket
+    @bucket.each do |pair_key, count|
+      emit [pair_key, count]
+    end
+    $stderr.puts "bucket stats: #{@bucket.stats.inspect}"
+    @bucket.clear
+  end
+  def after_stream
+    dump_bucket
+  end
+end
+#
+# Combine multiple bucket counts into a single on
+#
+class CombineBuckets < Wukong::Streamer::AccumulatingReducer
+  def start! *args
+    @total = 0
+  end
+  def accumulate word, count
+    @total += count.to_i
+  end
+  def finalize
+    yield [@total, key] if @total > 20
+  end
+end
+Wukong.run(
+  SentenceCoocurrence,
+  CombineBuckets,
+  :io_sort_record_percent => 0.3,
+  :io_sort_mb => 300
+  )

data/examples/emr/README.textile ADDED Viewed

@@ -0,0 +1,110 @@
+h1. Using Elastic Map-Reduce in Wukong
+h2. Initial Setup
+# Sign up for elastic map reduce and S3 at Amazon AWS.
+# Download the Amazon elastic-mapreduce runner: either the official version at http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip or the infochimps fork (which has support for Ruby 1.9) at http://github.com/infochimps/elastic-mapreduce .
+# Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose 'emr.yourdomain.com' as the bucket and '/wukong' as a scoping path within that bucket. In that case you will refer to it with a path like s3://emr.yourdomain.com/wukong (see notes below about s3n:// vs. s3:// URLs).
+# Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
+# Edit emr.yaml and credentials.json, adding your keys where appropriate and  following the other instructions. Start with a single-node m1.small cluster as you'll probably have some false starts beforethe flow of logging in, checking the logs, etc becomes clear.
+# You should now be good to launch a program. We'll give it the @--alive@ flag so that the machine sticks around if there were any issues:
+   ./elastic_mapreduce_example.rb --run=emr --alive s3://emr.yourdomain.com/wukong/data/input s3://emr.yourdomain.com/wukong/data/output
+# If you visit the "AWS console":http://bit.ly/awsconsole you should now see a jobflow with two steps. The first sets up debugging for the job; the second is your hadoop task.
+# The "AWS console":http://bit.ly/awsconsole also has the public IP of the master node. You can log in to the machine directly:
+<pre>
+  ssh -i /path/to/your/keypair.pem hadoop@ec2-148-37-14-128.compute-1.amazonaws.com
+</pre>
+h3. Lorkbong
+Lorkbong (named after the staff carried by Sun Wukong) is a very very simple example Heroku app that lets you trigger showing job status or launching a new job, either by visiting a special URL or by triggering a rake task. Get its code from
+  http://github.com/mrflip/lorkbong
+h3. s3n:// vs. s3:// URLs
+Many external tools use a URI convention to address files in S3; they typically use the 's3://' scheme, which makes a lot of sense:
+  s3://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
+Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks -- to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth.  Hadoop unfortunately chose the 's3://' scheme for URIs in this filesystem:
+  s3://s3hdfs.yourcompany.com/path/to/data
+Hadoop is happy to read s3 native files -- 'native' as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There's a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about).  You refer to these files with the 's3n://' scheme ('n' as in 'native'):
+  s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-mapper.rb
+  s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-reducer.rb
+  s3n://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
+Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use 's3n://' for normal s3 files.
+h2. Advanced Tips n' Tricks for common usage
+h3. Direct access to logs using your browser
+Each Hadoop component exposes a web dashboard for you to access. Use the following ports:
+* 9100: Job tracker (master only)
+* 9101: Namenode (master only)
+* 9102: Datanodes
+* 9103: Task trackers
+They will only, however, respond to web requests from within the private cluster
+subnet. You can browse the cluster by creating a persistent tunnel to the hadoop master node, and configuring your
+browser to use it as a proxy.
+h4. Create a tunneling proxy to your cluster
+To create a tunnel from your local machine to the master node, substitute the keypair and the master node's address into this command:
+<pre><code>
+  ssh -i ~/.wukong/keypairs/KEYPAIR.pem -f -N -D 6666 -o StrictHostKeyChecking=no -o "ConnectTimeout=10" -o "ServerAliveInterval=60" -o "ControlPath=none" ubuntu@MASTER_NODE_PUBLIC_IP
+</code></pre>
+The command will silently background itself if it worked.
+h4. Make your browser use the proxy (but only for cluster machines)
+You can access basic information by pointing your browser to "this Proxy
+Auto-Configuration (PAC)
+file.":http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
+You'll have issues if you browse around though, because many of the in-page
+links will refer to addresses that only resolve within the cluster's private
+namespace.
+h4. Setup Foxy Proxy
+To fix this, use "FoxyProxy":https://addons.mozilla.org/en-US/firefox/addon/2464
+It allows you to manage multiple proxy configurations and to use the proxy for
+DNS resolution (curing the private address problem).
+Once you've installed the FoxyProxy extension and restarted Firefox,
+* Set FoxyProxy to 'Use Proxies based on their pre-defined patterns and priorities'
+* Create a new proxy, called 'EC2 Socks Proxy' or something
+* Automatic proxy configuration URL: http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
+* Under 'General', check yes for 'Perform remote DNS lookups on host'
+* Add the following URL patterns as 'whitelist' using 'Wildcards' (not regular expression):
+* <code>*.compute-*.internal*</code>
+* <code>*ec2.internal*</code>
+* <code>*domu*.internal*</code>
+* <code>*ec2*.amazonaws.com*</code>
+* <code>*://10.*</code>
+And this one as blacklist:
+* <code>https://us-*st-1.ec2.amazonaws.com/*</code>
+h3. Pulling to your local machine
+s3cmd sync s3://s3n.infinitemonkeys.info/emr/elastic_mapreduce_example/log/ /tmp/emr_log/

data/examples/emr/dot_wukong_dir/emr_bootstrap.sh CHANGED Viewed

@@ -1,4 +1,5 @@
 #!/usr/bin/env bash
+set -x  # turn on tracing
 # A url directory with the scripts you'd like to stuff into the machine
 REMOTE_FILE_URL_BASE="http://github.com/infochimps/wukong"

data/examples/emr/elastic_mapreduce_example.rb CHANGED Viewed

@@ -1,7 +1,8 @@
 #!/usr/bin/env ruby
 Dir[File.dirname(__FILE__)+'/vendor/**/lib'].each{|dir| $: << dir }
 require 'rubygems'
-require 'wukong'
+require 'wukong/script'
+require 'wukong/script/emr_command'
 #
 # * Copy the emr.yaml from here into ~/.wukong/emr.yaml
@@ -24,5 +25,4 @@ class FooStreamer < Wukong::Streamer::LineStreamer
   end
 end
-Settings.resolve!
 Wukong::Script.new(FooStreamer, FooStreamer).run