RubyGems - wukong - Versions diffs - 1.5.4 → 2.0.0 - Mend

wukong 1.5.4 → 2.0.0

Files changed (87) hide show

data/CHANGELOG.textile +32 -0
data/README.textile +58 -12
data/TODO.textile +0 -8
data/bin/hdp-bzip +12 -17
data/bin/hdp-kill-task +1 -1
data/bin/hdp-sort +7 -7
data/bin/hdp-stream +7 -7
data/bin/hdp-stream-flat +2 -3
data/bin/setcat +11 -0
data/bin/uniq-ord +59 -0
data/examples/corpus/bucket_counter.rb +47 -0
data/examples/corpus/dbpedia_abstract_to_sentences.rb +85 -0
data/examples/corpus/sentence_coocurrence.rb +70 -0
data/examples/emr/README.textile +110 -0
data/examples/emr/dot_wukong_dir/emr_bootstrap.sh +1 -0
data/examples/emr/elastic_mapreduce_example.rb +2 -2
data/examples/ignore_me/counting.rb +56 -0
data/examples/ignore_me/grouper.rb +71 -0
data/examples/network_graph/adjacency_list.rb +2 -2
data/examples/network_graph/breadth_first_search.rb +14 -21
data/examples/network_graph/gen_multi_edge.rb +22 -13
data/examples/pagerank/pagerank.rb +1 -1
data/examples/pagerank/pagerank_initialize.rb +6 -10
data/examples/sample_records.rb +6 -16
data/examples/server_logs/apache_log_parser.rb +7 -22
data/examples/server_logs/breadcrumbs.rb +39 -0
data/examples/server_logs/logline.rb +27 -0
data/examples/size.rb +3 -2
data/examples/{binning_percentile_estimator.rb → stats/binning_percentile_estimator.rb} +9 -11
data/examples/{rank_and_bin.rb → stats/rank_and_bin.rb} +2 -2
data/examples/stupidly_simple_filter.rb +11 -14
data/examples/word_count.rb +16 -36
data/lib/wukong/and_pig.rb +2 -15
data/lib/wukong/logger.rb +7 -28
data/lib/wukong/periodic_monitor.rb +24 -9
data/lib/wukong/script/emr_command.rb +1 -0
data/lib/wukong/script/hadoop_command.rb +31 -29
data/lib/wukong/script.rb +19 -14
data/lib/wukong/store/cassandra_model.rb +2 -1
data/lib/wukong/streamer/accumulating_reducer.rb +5 -9
data/lib/wukong/streamer/base.rb +44 -3
data/lib/wukong/streamer/counting_reducer.rb +12 -12
data/lib/wukong/streamer/filter.rb +2 -2
data/lib/wukong/streamer/list_reducer.rb +3 -3
data/lib/wukong/streamer/reducer.rb +11 -0
data/lib/wukong/streamer.rb +7 -3
data/lib/wukong.rb +7 -3
data/{examples → old}/cassandra_streaming/berlitz_for_cassandra.textile +0 -0
data/{examples → old}/cassandra_streaming/client_interface_notes.textile +0 -0
data/{examples → old}/cassandra_streaming/client_schema.textile +0 -0
data/{examples → old}/cassandra_streaming/tuning.textile +0 -0
data/wukong.gemspec +257 -285
metadata +45 -62
data/examples/cassandra_streaming/avromapper.rb +0 -85
data/examples/cassandra_streaming/cassandra.avpr +0 -468
data/examples/cassandra_streaming/cassandra_random_partitioner.rb +0 -62
data/examples/cassandra_streaming/catter.sh +0 -45
data/examples/cassandra_streaming/client_schema.avpr +0 -211
data/examples/cassandra_streaming/foofile.avr +0 -0
data/examples/cassandra_streaming/pymap.sh +0 -1
data/examples/cassandra_streaming/pyreduce.sh +0 -1
data/examples/cassandra_streaming/smutation.avpr +0 -188
data/examples/cassandra_streaming/streamer.sh +0 -51
data/examples/cassandra_streaming/struct_loader.rb +0 -24
data/examples/count_keys.rb +0 -56
data/examples/count_keys_at_mapper.rb +0 -57
data/examples/emr/README-elastic_map_reduce.textile +0 -26
data/examples/keystore/cassandra_batch_test.rb +0 -41
data/examples/keystore/conditional_outputter_example.rb +0 -70
data/examples/store/chunked_store_example.rb +0 -18
data/lib/wukong/dfs.rb +0 -81
data/lib/wukong/keystore/cassandra_conditional_outputter.rb +0 -122
data/lib/wukong/keystore/redis_db.rb +0 -24
data/lib/wukong/keystore/tyrant_db.rb +0 -137
data/lib/wukong/keystore/tyrant_notes.textile +0 -145
data/lib/wukong/models/graph.rb +0 -25
data/lib/wukong/monitor/chunked_store.rb +0 -23
data/lib/wukong/monitor/periodic_logger.rb +0 -34
data/lib/wukong/monitor/periodic_monitor.rb +0 -70
data/lib/wukong/monitor.rb +0 -7
data/lib/wukong/rdf.rb +0 -104
data/lib/wukong/streamer/cassandra_streamer.rb +0 -61
data/lib/wukong/streamer/count_keys.rb +0 -30
data/lib/wukong/streamer/count_lines.rb +0 -26
data/lib/wukong/streamer/em_streamer.rb +0 -7
data/lib/wukong/streamer/preprocess_with_pipe_streamer.rb +0 -22
data/lib/wukong/wukong_class.rb +0 -21

data/CHANGELOG.textile CHANGED Viewed

@@ -1,3 +1,35 @@
+h2. Wukong v2.0.0
+h4. Important changes
+* Passing options to streamers is now deprecated. Use @Settings@ instead.
+* Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds
+* Examples cleaned up, should all run
+h4. Simplified syntax
+* you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
+* Adding an experimental sugar:
+  <pre>
+    #!/usr/bin/env ruby
+    require 'wukong/script'
+    LineStreamer.map do |line|
+      emit line.reverse
+    end.run
+  </pre>
+  Note that you can now tweet a wukong script.
+* It's now recommended that at the top of a wukong script you say
+  <pre>
+    require 'wukong/script'
+  </pre>
+  Among other benefits, this lets you refer to wukong streamers without prefix.
 h2. Wukong v1.5.4
 * EMR support now works very well

data/README.textile CHANGED Viewed

@@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
 * "More info":http://mrflip.github.com/wukong/moreinfo.html
-h2. Imminent Changes
-I'm pushing to release "Wukong 3.0 the actual 1.0 release".
-* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
-* Methods on TypedStruct to
-    * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
-    * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
-    * May make some things that are derived classes into mixin'ed modules
-    * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
 h2. Help!
@@ -193,6 +181,64 @@ You'd end up with
     @newman     @elaine      @jerry      @kramer
 </code></pre>
+h2. Gotchas
+h4. RecordStreamer dies on blank lines with "wrong number of arguments"
+If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
+<pre>
+  class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
+    # this will fail if the line has more or fewer than 3 fields:
+    def process x, y, z
+      p [x, y, z]
+    end
+  end
+</pre>
+The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
+<pre>
+  class MyHappyMapper < Wukong::Streamer::RecordStreamer
+    # extracts three fields always; any missing fields are nil, any extra fields discarded
+    # @example
+    #   recordize("a")            # ["a", nil, nil]
+    #   recordize("a\t\b\tc")     # ["a", "b", "c"]
+    #   recordize("a\t\b\tc\td")  # ["a", "b", "c"]
+    def recordize raw_record
+      x, y, z = super(raw_record)
+      [x, y, z]
+    end
+    # Now all lines produce exactly three args
+    def process x, y, z
+      p [x, y, z]
+    end
+  end
+</pre>
+If you want to preserve any extra fields, use the extra argument to #split():
+<pre>
+  class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
+    # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
+    # @example
+    #   recordize("a")            # ["a", nil, nil]
+    #   recordize("a\t\b\tc")     # ["a", "b", "c"]
+    #   recordize("a\t\b\tc\td")  # ["a", "b", "c\td"]
+    def recordize raw_record
+      x, y, z = split(raw_record, "\t", 3)
+      [x, y, z]
+    end
+    # Now all lines produce exactly three args
+    def process x, y, z
+      p [x, y, z]
+    end
+  end
+</pre>
 h2. Why is it called Wukong?
 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog.  A Monkey King who journeyed to the land of the Elephant seems to fit the bill:

data/TODO.textile CHANGED Viewed

@@ -1,13 +1,5 @@
 * add GEM_PATH to hadoop_recycle_env
-* Hadoop_command function received an array for the input_path parameter
 ** We should be able to specify comma *or* space separated paths; the last
    space-separated path in Settings.rest becomes the output file, the others are
    used as the input_file list.
-* Make configliere Settings and streamer_instance.options() be the same
-  thing. (instead of almost-but-confusingly-not-always the same thing).

data/bin/hdp-bzip CHANGED Viewed

@@ -2,27 +2,22 @@
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
-OUTPUT="$1" ; shift
+input_file=${1} 		; shift
+output_file=${1} 		; shift
-INPUTS=''
-for foo in $@; do
-  INPUTS="$INPUTS -input $foo\
-"
-done
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file" ; exit ; fi
-echo "Removing output directory $OUTPUT"
-hadoop fs -rmr $OUTPUT
+HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 cmd="${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar		   \
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			   \
-    -jobconf     mapred.output.compress=true                                               \
-    -jobconf     mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec  \
-    -jobconf     mapred.reduce.tasks=1                                                     \
-    -mapper  	 \"/bin/cat\"                                                              \
-    -reducer	 \"/bin/cat\"                                                              \
-    $INPUTS
-    -output  	 $OUTPUT                                                                   \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar	\
+    -Dmapred.output.compress=true                                               \
+    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec  \
+    -Dmapred.reduce.tasks=1                                                     \
+    -mapper  	 \"/bin/cat\"                                                   \
+    -reducer	 \"/bin/cat\"                                                   \
+    -input       \"$input_file\"                                                \
+    -output  	 \"$output_file\"                                               \
     "
 echo $cmd
 $cmd

data/bin/hdp-kill-task CHANGED Viewed

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-exec hadoop fs -kill-task "$1"
+exec hadoop job -kill-task "$1"

data/bin/hdp-sort CHANGED Viewed

@@ -1,5 +1,4 @@
 #!/usr/bin/env bash
-# hadoop dfs -rmr out/parsed-followers
 input_file=${1} 		; shift
 output_file=${1} 		; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 cmd="${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+    $@
+    -D   num.key.fields.for.partition=\"$partfields\"
+    -D 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -D   stream.map.output.field.separator=\"'/t'\"
+    -D   mapred.text.key.partitioner.options=\"-k1,$partfields\"
+    -D   mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
     -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-    -jobconf     num.key.fields.for.partition=\"$partfields\"
-    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
-    -jobconf     stream.map.output.field.separator=\"'/t'\"
-    -jobconf     mapred.text.key.partitioner.options=\"-k1,$partfields\"
     -mapper  	 \"$map_script\"
     -reducer	 \"$reduce_script\"
     -input       \"$input_file\"
     -output  	 \"$output_file\"
-    $@
     "
 echo "$cmd"

data/bin/hdp-stream CHANGED Viewed

@@ -1,5 +1,4 @@
 #!/usr/bin/env bash
-# hadoop dfs -rmr out/parsed-followers
 input_file=${1} 		; shift
 output_file=${1} 		; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 cmd="${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
+    $@
+    -D   num.key.fields.for.partition=\"$partfields\"
+    -D 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -D   stream.map.output.field.separator=\"'/t'\"
+    -D   mapred.text.key.partitioner.options=\"-k1,$partfields\"
+    -D   mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
     -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-    -jobconf     num.key.fields.for.partition=\"$partfields\"
-    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
-    -jobconf     stream.map.output.field.separator=\"'/t'\"
-    -jobconf     mapred.text.key.partitioner.options=\"-k1,$partfields\"
     -mapper  	 \"$map_script\"
     -reducer	 \"$reduce_script\"
     -input       \"$input_file\"
     -output  	 \"$output_file\"
-    $@
     "
 echo "$cmd"

data/bin/hdp-stream-flat CHANGED Viewed

@@ -10,13 +10,12 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 # Can add fun stuff like
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \
+# -Dmapred.reduce.tasks=0                                                    \
 exec ${HADOOP_HOME}/bin/hadoop \
      jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar		\
     "$@"                                                                                \
-    -jobconf    "mapred.job.name=`basename $0`-$map_script-$input_file-$output_file"    \
+    -Dmapred.job.name=`basename $0`-$map_script-$input_file-$output_file                \
     -mapper  	"$map_script"  								\
     -reducer	"$reduce_script"							\
     -input      "$input_file"								\

data/bin/setcat ADDED Viewed

@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+#
+# This script is useful for debugging. it dumps your environment to STDERR
+# and otherwise runs as `cat`
+#
+set >&2
+cat
+true

data/bin/uniq-ord ADDED Viewed

@@ -0,0 +1,59 @@
+#!/usr/bin/env ruby
+# encoding: ASCII-8BIT
+require 'set'
+unless ARGV.empty?
+  unless ARGV.include?('--help')
+    puts "\n**\nSorry, uniq-ord only works in-line: cat foo.txt bar.tsv | uniq-ord\n**" ; puts
+  end
+  puts <<USAGE
+uniq-ord is ike the uniq command but doesn't depend on prior sorting: it tracks
+each line and only emits the first-seen instance of that line.
+The algorithm is /very/ simplistic: it uses ruby's built-in hash to track lines.
+This can produce false positives, meaning that a line of output might be removed
+even if it hasn't been seen before.  It may also consume an unbounded amount of
+memory (though less than the input text). With a million lines it will consume
+about 70 MB of memory and have more than 1 in a million chance of false
+positive. On a billion lines it will consume many GB and have over 25% odds of
+incorrectly skipping a line.
+However, it's really handy for dealing with in-order lists from the command line.
+USAGE
+  exit(0)
+end
+# # Logging
+#
+# MB = 1024*1024
+# LOG_INTERVAL = 100_000
+# $start = Time.now; $iter = 0; $size = 0
+# def log_line
+#   elapsed = (Time.now - $start).to_f
+#   $stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])
+# end
+LINES = Set.new
+$stdin.each do |line|
+  next if LINES.include?(line.hash)
+  puts line
+  LINES << line.hash
+  # $iter += 1 ; $size += line.length
+  # log_line if ($iter % LOG_INTERVAL == 0)
+end
+# log_line
+#
+# # 2.1 GB data, 1M lines, 2000 avg chars/line
+#
+# # Used:   RSS:     71_988 kB     VSZ:     2_509_152 kB
+# # Stats:   38 s  25_859.1 l/s  1000k<  1000k>  1976 MB         51.1 MB/s       2072 b/l
+# # Time:   real     0m41.4 s      user  0m31.6 s          sys  0m8.3 s     pct    96.48
+#
+# # 4.1 GB data, 5.6M lines, 800 avg chars/line
+#
+# # Used:   RSS:    330_644 kB     VSZ:     2_764_236 kB
+# # Stats:  861     6_538.2 l/s  5632k<  5632k>  4158 MB          4.8 MB/s        774 b/l
+# # Time:   real    14m24.6 s     user  13m8.8 s           sys 0m12. s       pct   92.61
+#

data/examples/corpus/bucket_counter.rb ADDED Viewed

@@ -0,0 +1,47 @@
+class BucketCounter
+  BUCKET_SIZE = 2**24
+  attr_reader :total
+  def initialize
+    @hsh = Hash.new{|h,k| h[k] = 0 }
+    @total = 0
+  end
+  # def [] val
+  #   @hsh[val]
+  # end
+  # def << val
+  #   @hsh[val] += 1; @total += 1 ; self
+  # end
+  def [] val
+    @hsh[val.hash % BUCKET_SIZE]
+  end
+  def << val
+    @hsh[val.hash % BUCKET_SIZE] += 1; @total += 1 ; self
+  end
+  def insert *words
+    words.flatten.each{|word| self << word }
+  end
+  def clear
+    @hsh.clear
+    @total = 0
+  end
+  def stats
+    { :total => total,
+      :size  => size,
+    }
+  end
+  def size() @hsh.size end
+  def full?
+    size.to_f / BUCKET_SIZE > 0.5
+  end
+  def each *args, &block
+    @hsh.each(*args, &block)
+  end
+end

data/examples/corpus/dbpedia_abstract_to_sentences.rb ADDED Viewed

@@ -0,0 +1,85 @@
+#!/usr/bin/env ruby
+require 'wukong/script'
+#
+# Use the stanford NLP parse to split a piece of text into sentences
+#
+# @example
+#   SentenceParser.split("Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!")
+#   # => [["Beware", "the", "Jabberwock", ",", "my", "son", "!"], ["The", "jaws", "that", "bite", ",", "the", "claws", "that", "catch", "!"], ["Beware", "the", "Jubjub", "bird", ",", "and", "shun", "The", "frumious", "Bandersnatch", "!"]]
+#
+class SentenceParser
+  def self.processor
+    return @processor if @processor
+    require 'rubygems'
+    require 'stanfordparser'
+    @processor = StanfordParser::DocumentPreprocessor.new
+  end
+  def self.split line
+    processor.getSentencesFromString(line).map{|s| s.map{|w| w.to_s } }
+  end
+end
+#
+# takes one document per line
+# splits into sentences
+#
+class WordNGrams < Wukong::Streamer::LineStreamer
+  def recordize line
+    line.strip!
+    line.gsub!(%r{^<http://dbpedia.org/resource/([^>]+)> <[^>]+> \"}, '') ; title = $1
+    line.gsub!(%r{\"@en \.},'')
+    [title, SentenceParser.split(line)]
+  end
+  def process title, sentences
+    sentences.each_with_index do |words, idx|
+      yield [title, idx, words].flatten
+    end
+  end
+end
+Wukong.run WordNGrams, nil, :partition_fields => 1, :sort_fields => 2
+# ---------------------------------------------------------------------------
+#
+# Run Time:
+#
+#   Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/short_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/short_abstract_sentences
+#   Status: Succeeded
+#   Started at: Fri Jan 28 03:14:45 UTC 2011
+#   Finished in: 41mins, 50sec
+#   3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
+#
+#                                     Counter      Reduce       Total
+#   SLOTS_MILLIS_MAPS                       0              10 126 566
+#   Launched map tasks                      0                      15
+#   Data-local map tasks                    0                      15
+#   SLOTS_MILLIS_REDUCES                    0                   1 217
+#   HDFS_BYTES_READ             1 327 116 133           1 327 116 133
+#   HDFS_BYTES_WRITTEN          1 229 841 020           1 229 841 020
+#   Map input records               3 261 096               3 261 096
+#   Spilled Records                         0                       0
+#   Map input bytes             1 326 524 800           1 326 524 800
+#   SPLIT_RAW_BYTES                     1 500                   1 500
+#   Map output records              9 026 343               9 026 343
+#
+#   Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/long_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/long_abstract_sentences
+#   Status: Succeeded
+#   Started at: Fri Jan 28 03:23:08 UTC 2011
+#   Finished in: 41mins, 11sec
+#   3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
+#
+#                                     Counter      Reduce       Total
+#   SLOTS_MILLIS_MAPS                       0              19 872 357
+#   Launched map tasks                      0                      29
+#   Data-local map tasks                    0                      29
+#   SLOTS_MILLIS_REDUCES                    0                   5 504
+#   HDFS_BYTES_READ             2 175 900 769           2 175 900 769
+#   HDFS_BYTES_WRITTEN          2 280 332 736           2 280 332 736
+#   Map input records               3 261 096               3 261 096
+#   Spilled Records                         0                       0
+#   Map input bytes             2 174 849 644           2 174 849 644
+#   SPLIT_RAW_BYTES                     2 533                    2533
+#   Map output records             15 425 467              15 425 467

data/examples/corpus/sentence_coocurrence.rb ADDED Viewed

@@ -0,0 +1,70 @@
+#!/usr/bin/env ruby
+$: << File.dirname(__FILE__)
+require 'rubygems'
+require 'wukong/script'
+require 'bucket_counter'
+#
+# Coocurrence counts
+#
+#
+# Input is a list of document-idx-sentences, each field is tab-separated
+#   title   idx   word_a    word_b    word_c ...
+#
+# This emits each co-courring pair exactly once; in the case of a three-word
+# sentence the output would be
+#
+#   word_a  word_b
+#   word_a  word_c
+#   word_b  word_c
+#
+class SentenceCoocurrence < Wukong::Streamer::RecordStreamer
+  def initialize *args
+    super *args
+    @bucket = BucketCounter.new
+  end
+  def process title, idx, *words
+    words.each_with_index do |word_a, idx|
+      words[(idx+1) .. -1].each do |word_b|
+        @bucket << [word_a, word_b]
+      end
+    end
+    dump_bucket if @bucket.full?
+  end
+  def dump_bucket
+    @bucket.each do |pair_key, count|
+      emit [pair_key, count]
+    end
+    $stderr.puts "bucket stats: #{@bucket.stats.inspect}"
+    @bucket.clear
+  end
+  def after_stream
+    dump_bucket
+  end
+end
+#
+# Combine multiple bucket counts into a single on
+#
+class CombineBuckets < Wukong::Streamer::AccumulatingReducer
+  def start! *args
+    @total = 0
+  end
+  def accumulate word, count
+    @total += count.to_i
+  end
+  def finalize
+    yield [@total, key] if @total > 20
+  end
+end
+Wukong.run(
+  SentenceCoocurrence,
+  CombineBuckets,
+  :io_sort_record_percent => 0.3,
+  :io_sort_mb => 300
+  )

data/examples/emr/README.textile ADDED Viewed

@@ -0,0 +1,110 @@
+h1. Using Elastic Map-Reduce in Wukong
+h2. Initial Setup
+# Sign up for elastic map reduce and S3 at Amazon AWS.
+# Download the Amazon elastic-mapreduce runner: either the official version at http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip or the infochimps fork (which has support for Ruby 1.9) at http://github.com/infochimps/elastic-mapreduce .
+# Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose 'emr.yourdomain.com' as the bucket and '/wukong' as a scoping path within that bucket. In that case you will refer to it with a path like s3://emr.yourdomain.com/wukong (see notes below about s3n:// vs. s3:// URLs).
+# Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
+# Edit emr.yaml and credentials.json, adding your keys where appropriate and  following the other instructions. Start with a single-node m1.small cluster as you'll probably have some false starts beforethe flow of logging in, checking the logs, etc becomes clear.
+# You should now be good to launch a program. We'll give it the @--alive@ flag so that the machine sticks around if there were any issues:
+   ./elastic_mapreduce_example.rb --run=emr --alive s3://emr.yourdomain.com/wukong/data/input s3://emr.yourdomain.com/wukong/data/output
+# If you visit the "AWS console":http://bit.ly/awsconsole you should now see a jobflow with two steps. The first sets up debugging for the job; the second is your hadoop task.
+# The "AWS console":http://bit.ly/awsconsole also has the public IP of the master node. You can log in to the machine directly:
+<pre>
+  ssh -i /path/to/your/keypair.pem hadoop@ec2-148-37-14-128.compute-1.amazonaws.com
+</pre>
+h3. Lorkbong
+Lorkbong (named after the staff carried by Sun Wukong) is a very very simple example Heroku app that lets you trigger showing job status or launching a new job, either by visiting a special URL or by triggering a rake task. Get its code from
+  http://github.com/mrflip/lorkbong
+h3. s3n:// vs. s3:// URLs
+Many external tools use a URI convention to address files in S3; they typically use the 's3://' scheme, which makes a lot of sense:
+  s3://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
+Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks -- to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth.  Hadoop unfortunately chose the 's3://' scheme for URIs in this filesystem:
+  s3://s3hdfs.yourcompany.com/path/to/data
+Hadoop is happy to read s3 native files -- 'native' as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There's a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about).  You refer to these files with the 's3n://' scheme ('n' as in 'native'):
+  s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-mapper.rb
+  s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-reducer.rb
+  s3n://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
+Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use 's3n://' for normal s3 files.
+h2. Advanced Tips n' Tricks for common usage
+h3. Direct access to logs using your browser
+Each Hadoop component exposes a web dashboard for you to access. Use the following ports:
+* 9100: Job tracker (master only)
+* 9101: Namenode (master only)
+* 9102: Datanodes
+* 9103: Task trackers
+They will only, however, respond to web requests from within the private cluster
+subnet. You can browse the cluster by creating a persistent tunnel to the hadoop master node, and configuring your
+browser to use it as a proxy.
+h4. Create a tunneling proxy to your cluster
+To create a tunnel from your local machine to the master node, substitute the keypair and the master node's address into this command:
+<pre><code>
+  ssh -i ~/.wukong/keypairs/KEYPAIR.pem -f -N -D 6666 -o StrictHostKeyChecking=no -o "ConnectTimeout=10" -o "ServerAliveInterval=60" -o "ControlPath=none" ubuntu@MASTER_NODE_PUBLIC_IP
+</code></pre>
+The command will silently background itself if it worked.
+h4. Make your browser use the proxy (but only for cluster machines)
+You can access basic information by pointing your browser to "this Proxy
+Auto-Configuration (PAC)
+file.":http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
+You'll have issues if you browse around though, because many of the in-page
+links will refer to addresses that only resolve within the cluster's private
+namespace.
+h4. Setup Foxy Proxy
+To fix this, use "FoxyProxy":https://addons.mozilla.org/en-US/firefox/addon/2464
+It allows you to manage multiple proxy configurations and to use the proxy for
+DNS resolution (curing the private address problem).
+Once you've installed the FoxyProxy extension and restarted Firefox,
+* Set FoxyProxy to 'Use Proxies based on their pre-defined patterns and priorities'
+* Create a new proxy, called 'EC2 Socks Proxy' or something
+* Automatic proxy configuration URL: http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
+* Under 'General', check yes for 'Perform remote DNS lookups on host'
+* Add the following URL patterns as 'whitelist' using 'Wildcards' (not regular expression):
+* <code>*.compute-*.internal*</code>
+* <code>*ec2.internal*</code>
+* <code>*domu*.internal*</code>
+* <code>*ec2*.amazonaws.com*</code>
+* <code>*://10.*</code>
+And this one as blacklist:
+* <code>https://us-*st-1.ec2.amazonaws.com/*</code>
+h3. Pulling to your local machine
+s3cmd sync s3://s3n.infinitemonkeys.info/emr/elastic_mapreduce_example/log/ /tmp/emr_log/

data/examples/emr/dot_wukong_dir/emr_bootstrap.sh CHANGED Viewed

@@ -1,4 +1,5 @@
 #!/usr/bin/env bash
+set -x  # turn on tracing
 # A url directory with the scripts you'd like to stuff into the machine
 REMOTE_FILE_URL_BASE="http://github.com/infochimps/wukong"

data/examples/emr/elastic_mapreduce_example.rb CHANGED Viewed

@@ -1,7 +1,8 @@
 #!/usr/bin/env ruby
 Dir[File.dirname(__FILE__)+'/vendor/**/lib'].each{|dir| $: << dir }
 require 'rubygems'
-require 'wukong'
+require 'wukong/script'
+require 'wukong/script/emr_command'
 #
 # * Copy the emr.yaml from here into ~/.wukong/emr.yaml
@@ -24,5 +25,4 @@ class FooStreamer < Wukong::Streamer::LineStreamer
   end
 end
-Settings.resolve!
 Wukong::Script.new(FooStreamer, FooStreamer).run