RubyGems - wukong - Versions diffs - 1.4.0 → 1.4.1 - Mend

wukong 1.4.0 → 1.4.1

Files changed (48) hide show

data/README.textile +34 -7
data/bin/cutc +1 -1
data/bin/cuttab +1 -1
data/bin/greptrue +1 -3
data/bin/hdp-cat +1 -1
data/bin/hdp-catd +1 -1
data/bin/hdp-du +11 -6
data/bin/hdp-get +1 -1
data/bin/hdp-kill +1 -1
data/bin/hdp-ls +1 -1
data/bin/hdp-mkdir +1 -1
data/bin/hdp-mv +1 -1
data/bin/hdp-ps +1 -1
data/bin/hdp-put +1 -1
data/bin/hdp-rm +1 -1
data/bin/hdp-sort +39 -19
data/bin/hdp-stream +39 -19
data/bin/hdp-stream-flat +9 -5
data/bin/hdp-stream2 +39 -0
data/bin/tabchar +1 -1
data/bin/wu-date +13 -0
data/bin/wu-datetime +13 -0
data/bin/wu-plus +9 -0
data/docpages/INSTALL.textile +0 -2
data/docpages/index.textile +4 -2
data/examples/apache_log_parser.rb +26 -14
data/examples/graph/gen_symmetric_links.rb +10 -0
data/examples/sample_records.rb +6 -8
data/lib/wukong/datatypes/enum.rb +2 -2
data/lib/wukong/dfs.rb +10 -9
data/lib/wukong/encoding.rb +22 -4
data/lib/wukong/extensions/emittable.rb +1 -1
data/lib/wukong/extensions/hash_keys.rb +16 -0
data/lib/wukong/extensions/hash_like.rb +17 -0
data/lib/wukong/models/graph.rb +18 -20
data/lib/wukong/schema.rb +13 -11
data/lib/wukong/script.rb +26 -8
data/lib/wukong/script/hadoop_command.rb +108 -2
data/lib/wukong/streamer.rb +2 -0
data/lib/wukong/streamer/base.rb +1 -0
data/lib/wukong/streamer/record_streamer.rb +14 -0
data/lib/wukong/streamer/struct_streamer.rb +2 -2
data/spec/data/a_atsigns_b.tsv +64 -0
data/spec/data/a_follows_b.tsv +53 -0
data/spec/data/tweet.tsv +167 -0
data/spec/data/twitter_user.tsv +55 -0
data/wukong.gemspec +13 -3
metadata +13 -23

data/README.textile CHANGED

@@ -19,6 +19,21 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
 * "More info":http://mrflip.github.com/wukong/moreinfo.html
+h2. Imminent Changes
+I'm pushing to release "Wukong 3.0 the actual 1.0 release".
+* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
+* Methods on TypedStruct to
+    * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
+    * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
+    * May make some things that are derived classes into mixin'ed modules
+    * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
+*
 h2. Help!
 Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
@@ -29,17 +44,17 @@ h2. Install
 h3. Get the code
-We're still actively developing {{ site.gemname }}.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
+We're still actively developing wukong.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
-pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
+pre. $ git clone git://github.com/mrflip/wukong
-A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
+A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
-pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
+pre. $ sudo gem install wukong --source=http://gemcutter.org
 (don't use the gems.github.com version -- it's way out of date.)
-You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
+You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
 h3. Dependencies and setup
@@ -190,9 +205,15 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
-h2. Credits
+<notextile><div class="toggle"></notextile>
-Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
+h2. More info
+There are many useful examples in the examples/ directory.
+h3. Credits
+Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
 Patches submitted by:
 * gemified by Ben Woosley (ben.woosley with the gmails)
@@ -201,3 +222,9 @@ Patches submitted by:
 Thanks to:
 * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
 * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
+h3. Help!
+Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
+<notextile></div></notextile>

data/bin/cutc CHANGED

@@ -27,4 +27,4 @@ shift
 #
 # Do the cuttin'
 #
-cut -c"${cutchars}" "$@"
+exec cut -c"${cutchars}" "$@"

data/bin/cuttab CHANGED

@@ -2,4 +2,4 @@
 fields=${1-"1-"}
 shift
-cut  -d'	' -f"$fields" "$@"
+exec cut  -d'	' -f"$fields" "$@"

data/bin/greptrue CHANGED

@@ -1,8 +1,6 @@
 #!/usr/bin/env bash
 # runs grep but always returns a true exit status. (Otherwise hadoop vomits)
+# You can set a command line var in hadoop instead, but we'll leave this around
 grep "$@"
 true
-# runs grep but always returns a true exit status. (Otherwise hadoop vomits)
-egrep "$@"
-true

data/bin/hdp-cat CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -cat "$@"
+exec hadoop dfs -cat "$@"

data/bin/hdp-catd CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
 args=`echo "$@" | ruby -ne 'a = $_.split(/\s+/); puts a.map{|arg| arg+"/[^_]*" }.join(" ")'`
-hadoop dfs -cat $args
+exec hadoop dfs -cat $args

data/bin/hdp-du CHANGED

@@ -5,7 +5,7 @@ OPTIONS={}
 #
 # grok options
 #
-if ARGV[0] =~ /-[a-z]+/
+if ARGV[0] =~ /\A-[sh]+\z/
   flags = ARGV.shift
   OPTIONS[:summary]  = flags.include?('s')
   OPTIONS[:humanize] = flags.include?('h')
@@ -16,7 +16,7 @@ end
 #
 def prepare_command
   dfs_cmd  = OPTIONS[:summary] ? 'dus' : 'du'
-  dfs_args = "'" + ARGV.join("' '") + "'"
+  dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'"
   %Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} }
 end
@@ -61,21 +61,26 @@ def number_to_human_size(size, precision=1)
   when size < 1.gigabyte; "%.#{precision}f MB"  % (size / 1.0.megabyte)
   when size < 1.terabyte; "%.#{precision}f GB"  % (size / 1.0.gigabyte)
   else                    "%.#{precision}f TB"  % (size / 1.0.terabyte)
-  end.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
+  end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
 rescue
   nil
 end
+OUTPUT_LINE_FMT = "%-71s\t%15d\t%15s"
 def format_output file, size
-  human_size = number_to_human_size(size) || 3
+  human_size = number_to_human_size(size) || ""
   file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone
-  "%-71s\t%15d\t%15s" % [file, size.to_i, human_size]
+  OUTPUT_LINE_FMT % [file, size.to_i, human_size]
 end
+entries_count = 0
+total_size  = 0
 %x{ #{prepare_command} }.split("\n").each do |line|
   if line =~ /^Found \d+ items$/ then puts line ; next end
   info = line.split(/\s+/)
   if OPTIONS[:summary] then file, size = info else size, file = info end
   puts format_output(file, size)
+  total_size  += size.to_i
+  entries_count += 1
 end
+$stderr.puts OUTPUT_LINE_FMT%[" #{"%55d"%entries_count} entries", total_size, number_to_human_size(total_size)]

data/bin/hdp-get CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -copyToLocal "$1" "$2"
+exec hadoop dfs -copyToLocal "$1" "$2"

data/bin/hdp-kill CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop job -kill "$@"
+exec hadoop job -kill "$@"

data/bin/hdp-ls CHANGED

@@ -7,4 +7,4 @@ else
     action=ls
 fi
-hadoop dfs -$action "$@"
+exec hadoop dfs -$action "$@"

data/bin/hdp-mkdir CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -mkdir "$@"
+exec hadoop dfs -mkdir "$@"

data/bin/hdp-mv CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -mv "$@"
+exec hadoop dfs -mv "$@"

data/bin/hdp-ps CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop job -list all
+exec hadoop job -list all

data/bin/hdp-put CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -put "$1" "$2"
+exec hadoop dfs -put "$1" "$2"

data/bin/hdp-rm CHANGED

@@ -8,4 +8,4 @@ else
 fi
 echo hadoop dfs -$action "$@"
 # read -p "Hit ctrl-C to abort or enter to do this...."
-hadoop dfs -$action "$@"
+exec hadoop dfs -$action "$@"

data/bin/hdp-sort CHANGED

@@ -4,26 +4,46 @@
 input_file=${1} 		; shift
 output_file=${1} 		; shift
 map_script=${1-/bin/cat}	; shift
-reduce_script=${1-/usr/bin/uniq}	; shift
-fields=${1-2} 			; shift
+reduce_script=${1-/usr/bin/uniq} ; shift
+partfields=${1-2} 		; shift
+sortfields=${1-2} 		; shift
-if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
-${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			\
-    -jobconf     map.output.key.field.separator='\t'					\
-    -jobconf     num.key.fields.for.partition=1 					\
-    -jobconf 	 stream.map.output.field.separator='\t'					\
-    -jobconf 	 stream.num.map.output.key.fields="$fields"				\
-    -mapper  	 "$map_script"  							\
-    -reducer	 "$reduce_script"							\
-    -input       "$input_file"								\
-    -output  	 "$output_file"								\
-    "$@"
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \
+cmd="${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
+    -jobconf     num.key.fields.for.partition=\"$partfields\"
+    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -mapper  	 \"$map_script\"
+    -reducer	 \"$reduce_script\"
+    -input       \"$input_file\"
+    -output  	 \"$output_file\"
+    $@
+    "
+echo "$cmd"
+$cmd
+# -jobconf      mapred.text.key.partitioner.options="-k1,$partfields"                   \
+# -jobconf      stream.map.output.field.separator='\t'                                  \
+# -jobconf      map.output.key.field.separator='\t'                                     \
+# -jobconf      mapred.map.tasks=3                                                      \
+# -jobconf      mapred.reduce.tasks=3                                                   \
+#
+# TODO:
+#   http://issues.apache.org/jira/browse/MAPREDUCE-594
+#   http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
+#   Instead of /bin/cat, Identity can be (I think)
+#     -inputformat    org.apache.hadoop.mapred.KeyValueTextInputFormat \
+#     -mapper         org.apache.hadoop.mapred.lib.IdentityMapper      \
+#     ...
+#
+# TODO
+#
+#   New-style secondary sort:
+#     http://hadoop.apache.org/common/docs/r0.20.0/streaming.html

data/bin/hdp-stream CHANGED

@@ -4,26 +4,46 @@
 input_file=${1} 		; shift
 output_file=${1} 		; shift
 map_script=${1-/bin/cat}	; shift
-reduce_script=${1-/usr/bin/uniq}	; shift
-fields=${1-2} 			; shift
+reduce_script=${1-/usr/bin/uniq} ; shift
+partfields=${1-2} 		; shift
+sortfields=${1-2} 		; shift
-if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
-${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			\
-    -jobconf     map.output.key.field.separator='\t'					\
-    -jobconf     num.key.fields.for.partition=1 					\
-    -jobconf 	 stream.map.output.field.separator='\t'					\
-    -jobconf 	 stream.num.map.output.key.fields="$fields"				\
-    -mapper  	 "$map_script"  							\
-    -reducer	 "$reduce_script"							\
-    -input       "$input_file"								\
-    -output  	 "$output_file"								\
-    "$@"
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \
+cmd="${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
+    -jobconf     num.key.fields.for.partition=\"$partfields\"
+    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -mapper  	 \"$map_script\"
+    -reducer	 \"$reduce_script\"
+    -input       \"$input_file\"
+    -output  	 \"$output_file\"
+    $@
+    "
+echo "$cmd"
+$cmd
+# -jobconf      mapred.text.key.partitioner.options="-k1,$partfields"                   \
+# -jobconf      stream.map.output.field.separator='\t'                                  \
+# -jobconf      map.output.key.field.separator='\t'                                     \
+# -jobconf      mapred.map.tasks=3                                                      \
+# -jobconf      mapred.reduce.tasks=3                                                   \
+#
+# TODO:
+#   http://issues.apache.org/jira/browse/MAPREDUCE-594
+#   http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
+#   Instead of /bin/cat, Identity can be (I think)
+#     -inputformat    org.apache.hadoop.mapred.KeyValueTextInputFormat \
+#     -mapper         org.apache.hadoop.mapred.lib.IdentityMapper      \
+#     ...
+#
+# TODO
+#
+#   New-style secondary sort:
+#     http://hadoop.apache.org/common/docs/r0.20.0/streaming.html

data/bin/hdp-stream-flat CHANGED

@@ -5,14 +5,18 @@ output_file=${1} 			; shift
 map_script=${1-/bin/cat}		; shift
 reduce_script=${1-/usr/bin/uniq}	; shift
-if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [extra_args]" ; exit ; fi
-hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar			\
+HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
+# Can add fun stuff like
+# -jobconf mapred.map.tasks=3                                                       \
+# -jobconf mapred.reduce.tasks=3                                                    \
+exec ${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
     -mapper  	"$map_script"  								\
     -reducer	"$reduce_script"							\
     -input      "$input_file"								\
     -output  	"$output_file"								\
     "$@"
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \

data/bin/hdp-stream2 ADDED

@@ -0,0 +1,39 @@
+#!/usr/bin/env ruby
+require 'wukong'
+# Example usage:
+#
+#  ~/ics/wukong/bin/hdp-stream2 input_path1,input_path2 output_path  \
+#     "`which cuttab` 2,3,7" "`which uniq` -c" 1 3 -jobconf mapred.reduce.tasks=23
+# options = Wukong::CONFIG[:runner_defaults].dup
+# cmdline_opts = Hash.zip(
+#   [ :input_file, :output_file,
+#     :map_command, :reduce_command,
+#     :partition_fields, :sort_fields],
+#   ARGV.map{|s| s.blank? ? nil : s }
+#   )
+# argvs = ARGV.slice!(0..5) ;
+# ARGV.unshift cmdline_opts[:input_file];
+# ARGV.unshift cmdline_opts[:output_file]
+# p [argvs, ARGV]
+#
+# # cmdline_opts[:map_command]    = `which cat`.chomp if cmdline_opts[:map_command].blank?
+# # cmdline_opts[:reduce_command] = nil               if cmdline_opts[:reduce_command].blank?
+# cmdline_opts[:dry_run] = true
+# cmdline_opts[:run]     = true
+#p cmdline_opts, Wukong::CONFIG[:runner_defaults]
+# Go script go!
+runner = Wukong::Script.new(
+  nil, # use mapper_command
+  nil, # use reducer_command
+  :run => true
+  )
+# runner.options.merge cmdline_opts
+runner.options[:reuse_jvms] = true if runner.options[:reuse_jvms].blank?
+runner.run

data/bin/tabchar CHANGED

@@ -2,4 +2,4 @@
 # insert a tab char from the command line:
 # echo "hi$(tabchar)there"
 # # => "hi	there"
-echo -n -e '\t'
+exec echo -n -e '\t'

data/bin/wu-date ADDED

@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Outputs a compact wukong-style date:
+#
+#
+#	$ date
+#       Sun Nov  8 03:21:37 CST 2009
+#	$ wu-date
+#	20091108
+#
+exec date +"%Y%m%d"

data/bin/wu-datetime ADDED

@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Outputs a compact wukong-style datetime:
+#
+#
+#	$ date
+#       Sun Nov  8 03:21:37 CST 2009
+#	$ wu-datetime
+#	20091108032137
+#
+exec date +"%Y%m%d%H%M%D"

data/bin/wu-plus ADDED

@@ -0,0 +1,9 @@
+#!/usr/bin/env ruby
+sum   = 0.0
+lines = 0
+$stdin.each do |n|
+  sum   += n.to_f
+  lines += 1
+end
+puts "%15d\t%15d"%[sum, lines]