RubyGems - wukong - Versions diffs - 1.4.0 → 1.4.1 - Mend

wukong 1.4.0 → 1.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

data/README.textile +34 -7
data/bin/cutc +1 -1
data/bin/cuttab +1 -1
data/bin/greptrue +1 -3
data/bin/hdp-cat +1 -1
data/bin/hdp-catd +1 -1
data/bin/hdp-du +11 -6
data/bin/hdp-get +1 -1
data/bin/hdp-kill +1 -1
data/bin/hdp-ls +1 -1
data/bin/hdp-mkdir +1 -1
data/bin/hdp-mv +1 -1
data/bin/hdp-ps +1 -1
data/bin/hdp-put +1 -1
data/bin/hdp-rm +1 -1
data/bin/hdp-sort +39 -19
data/bin/hdp-stream +39 -19
data/bin/hdp-stream-flat +9 -5
data/bin/hdp-stream2 +39 -0
data/bin/tabchar +1 -1
data/bin/wu-date +13 -0
data/bin/wu-datetime +13 -0
data/bin/wu-plus +9 -0
data/docpages/INSTALL.textile +0 -2
data/docpages/index.textile +4 -2
data/examples/apache_log_parser.rb +26 -14
data/examples/graph/gen_symmetric_links.rb +10 -0
data/examples/sample_records.rb +6 -8
data/lib/wukong/datatypes/enum.rb +2 -2
data/lib/wukong/dfs.rb +10 -9
data/lib/wukong/encoding.rb +22 -4
data/lib/wukong/extensions/emittable.rb +1 -1
data/lib/wukong/extensions/hash_keys.rb +16 -0
data/lib/wukong/extensions/hash_like.rb +17 -0
data/lib/wukong/models/graph.rb +18 -20
data/lib/wukong/schema.rb +13 -11
data/lib/wukong/script.rb +26 -8
data/lib/wukong/script/hadoop_command.rb +108 -2
data/lib/wukong/streamer.rb +2 -0
data/lib/wukong/streamer/base.rb +1 -0
data/lib/wukong/streamer/record_streamer.rb +14 -0
data/lib/wukong/streamer/struct_streamer.rb +2 -2
data/spec/data/a_atsigns_b.tsv +64 -0
data/spec/data/a_follows_b.tsv +53 -0
data/spec/data/tweet.tsv +167 -0
data/spec/data/twitter_user.tsv +55 -0
data/wukong.gemspec +13 -3
metadata +13 -23

data/README.textile CHANGED

@@ -19,6 +19,21 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
 * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
 * "More info":http://mrflip.github.com/wukong/moreinfo.html
+h2. Imminent Changes
+I'm pushing to release "Wukong 3.0 the actual 1.0 release".
+* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
+* Methods on TypedStruct to
+    * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
+    * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
+    * May make some things that are derived classes into mixin'ed modules
+    * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
+*
 h2. Help!
 Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
@@ -29,17 +44,17 @@ h2. Install
 h3. Get the code
-We're still actively developing {{ site.gemname }}.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
+We're still actively developing wukong.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
-pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
+pre. $ git clone git://github.com/mrflip/wukong
-A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
+A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
-pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
+pre. $ sudo gem install wukong --source=http://gemcutter.org
 (don't use the gems.github.com version -- it's way out of date.)
-You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
+You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
 h3. Dependencies and setup
@@ -190,9 +205,15 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
-h2. Credits
+<notextile><div class="toggle"></notextile>
-Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
+h2. More info
+There are many useful examples in the examples/ directory.
+h3. Credits
+Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
 Patches submitted by:
 * gemified by Ben Woosley (ben.woosley with the gmails)
@@ -201,3 +222,9 @@ Patches submitted by:
 Thanks to:
 * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
 * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
+h3. Help!
+Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
+<notextile></div></notextile>

data/bin/cutc CHANGED

@@ -27,4 +27,4 @@ shift
 #
 # Do the cuttin'
 #
-cut -c"${cutchars}" "$@"
+exec cut -c"${cutchars}" "$@"

data/bin/cuttab CHANGED

@@ -2,4 +2,4 @@
 fields=${1-"1-"}
 shift
-cut  -d'	' -f"$fields" "$@"
+exec cut  -d'	' -f"$fields" "$@"

data/bin/greptrue CHANGED

@@ -1,8 +1,6 @@
 #!/usr/bin/env bash
 # runs grep but always returns a true exit status. (Otherwise hadoop vomits)
+# You can set a command line var in hadoop instead, but we'll leave this around
 grep "$@"
 true
-# runs grep but always returns a true exit status. (Otherwise hadoop vomits)
-egrep "$@"
-true

data/bin/hdp-cat CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -cat "$@"
+exec hadoop dfs -cat "$@"

data/bin/hdp-catd CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
 args=`echo "$@" | ruby -ne 'a = $_.split(/\s+/); puts a.map{|arg| arg+"/[^_]*" }.join(" ")'`
-hadoop dfs -cat $args
+exec hadoop dfs -cat $args

data/bin/hdp-du CHANGED

@@ -5,7 +5,7 @@ OPTIONS={}
 #
 # grok options
 #
-if ARGV[0] =~ /-[a-z]+/
+if ARGV[0] =~ /\A-[sh]+\z/
   flags = ARGV.shift
   OPTIONS[:summary]  = flags.include?('s')
   OPTIONS[:humanize] = flags.include?('h')
@@ -16,7 +16,7 @@ end
 #
 def prepare_command
   dfs_cmd  = OPTIONS[:summary] ? 'dus' : 'du'
-  dfs_args = "'" + ARGV.join("' '") + "'"
+  dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'"
   %Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} }
 end
@@ -61,21 +61,26 @@ def number_to_human_size(size, precision=1)
   when size < 1.gigabyte; "%.#{precision}f MB"  % (size / 1.0.megabyte)
   when size < 1.terabyte; "%.#{precision}f GB"  % (size / 1.0.gigabyte)
   else                    "%.#{precision}f TB"  % (size / 1.0.terabyte)
-  end.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
+  end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
 rescue
   nil
 end
+OUTPUT_LINE_FMT = "%-71s\t%15d\t%15s"
 def format_output file, size
-  human_size = number_to_human_size(size) || 3
+  human_size = number_to_human_size(size) || ""
   file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone
-  "%-71s\t%15d\t%15s" % [file, size.to_i, human_size]
+  OUTPUT_LINE_FMT % [file, size.to_i, human_size]
 end
+entries_count = 0
+total_size  = 0
 %x{ #{prepare_command} }.split("\n").each do |line|
   if line =~ /^Found \d+ items$/ then puts line ; next end
   info = line.split(/\s+/)
   if OPTIONS[:summary] then file, size = info else size, file = info end
   puts format_output(file, size)
+  total_size  += size.to_i
+  entries_count += 1
 end
+$stderr.puts OUTPUT_LINE_FMT%[" #{"%55d"%entries_count} entries", total_size, number_to_human_size(total_size)]

data/bin/hdp-get CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -copyToLocal "$1" "$2"
+exec hadoop dfs -copyToLocal "$1" "$2"

data/bin/hdp-kill CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop job -kill "$@"
+exec hadoop job -kill "$@"

data/bin/hdp-ls CHANGED

@@ -7,4 +7,4 @@ else
     action=ls
 fi
-hadoop dfs -$action "$@"
+exec hadoop dfs -$action "$@"

data/bin/hdp-mkdir CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -mkdir "$@"
+exec hadoop dfs -mkdir "$@"

data/bin/hdp-mv CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -mv "$@"
+exec hadoop dfs -mv "$@"

data/bin/hdp-ps CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop job -list all
+exec hadoop job -list all

data/bin/hdp-put CHANGED

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-hadoop dfs -put "$1" "$2"
+exec hadoop dfs -put "$1" "$2"

data/bin/hdp-rm CHANGED

@@ -8,4 +8,4 @@ else
 fi
 echo hadoop dfs -$action "$@"
 # read -p "Hit ctrl-C to abort or enter to do this...."
-hadoop dfs -$action "$@"
+exec hadoop dfs -$action "$@"

data/bin/hdp-sort CHANGED

@@ -4,26 +4,46 @@
 input_file=${1} 		; shift
 output_file=${1} 		; shift
 map_script=${1-/bin/cat}	; shift
-reduce_script=${1-/usr/bin/uniq}	; shift
-fields=${1-2} 			; shift
+reduce_script=${1-/usr/bin/uniq} ; shift
+partfields=${1-2} 		; shift
+sortfields=${1-2} 		; shift
-if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
-${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			\
-    -jobconf     map.output.key.field.separator='\t'					\
-    -jobconf     num.key.fields.for.partition=1 					\
-    -jobconf 	 stream.map.output.field.separator='\t'					\
-    -jobconf 	 stream.num.map.output.key.fields="$fields"				\
-    -mapper  	 "$map_script"  							\
-    -reducer	 "$reduce_script"							\
-    -input       "$input_file"								\
-    -output  	 "$output_file"								\
-    "$@"
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \
+cmd="${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
+    -jobconf     num.key.fields.for.partition=\"$partfields\"
+    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -mapper  	 \"$map_script\"
+    -reducer	 \"$reduce_script\"
+    -input       \"$input_file\"
+    -output  	 \"$output_file\"
+    $@
+    "
+echo "$cmd"
+$cmd
+# -jobconf      mapred.text.key.partitioner.options="-k1,$partfields"                   \
+# -jobconf      stream.map.output.field.separator='\t'                                  \
+# -jobconf      map.output.key.field.separator='\t'                                     \
+# -jobconf      mapred.map.tasks=3                                                      \
+# -jobconf      mapred.reduce.tasks=3                                                   \
+#
+# TODO:
+#   http://issues.apache.org/jira/browse/MAPREDUCE-594
+#   http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
+#   Instead of /bin/cat, Identity can be (I think)
+#     -inputformat    org.apache.hadoop.mapred.KeyValueTextInputFormat \
+#     -mapper         org.apache.hadoop.mapred.lib.IdentityMapper      \
+#     ...
+#
+# TODO
+#
+#   New-style secondary sort:
+#     http://hadoop.apache.org/common/docs/r0.20.0/streaming.html

data/bin/hdp-stream CHANGED

@@ -4,26 +4,46 @@
 input_file=${1} 		; shift
 output_file=${1} 		; shift
 map_script=${1-/bin/cat}	; shift
-reduce_script=${1-/usr/bin/uniq}	; shift
-fields=${1-2} 			; shift
+reduce_script=${1-/usr/bin/uniq} ; shift
+partfields=${1-2} 		; shift
+sortfields=${1-2} 		; shift
-if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
 HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
-${HADOOP_HOME}/bin/hadoop \
-     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			\
-    -jobconf     map.output.key.field.separator='\t'					\
-    -jobconf     num.key.fields.for.partition=1 					\
-    -jobconf 	 stream.map.output.field.separator='\t'					\
-    -jobconf 	 stream.num.map.output.key.fields="$fields"				\
-    -mapper  	 "$map_script"  							\
-    -reducer	 "$reduce_script"							\
-    -input       "$input_file"								\
-    -output  	 "$output_file"								\
-    "$@"
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \
+cmd="${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
+    -jobconf     num.key.fields.for.partition=\"$partfields\"
+    -jobconf 	 stream.num.map.output.key.fields=\"$sortfields\"
+    -mapper  	 \"$map_script\"
+    -reducer	 \"$reduce_script\"
+    -input       \"$input_file\"
+    -output  	 \"$output_file\"
+    $@
+    "
+echo "$cmd"
+$cmd
+# -jobconf      mapred.text.key.partitioner.options="-k1,$partfields"                   \
+# -jobconf      stream.map.output.field.separator='\t'                                  \
+# -jobconf      map.output.key.field.separator='\t'                                     \
+# -jobconf      mapred.map.tasks=3                                                      \
+# -jobconf      mapred.reduce.tasks=3                                                   \
+#
+# TODO:
+#   http://issues.apache.org/jira/browse/MAPREDUCE-594
+#   http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
+#   Instead of /bin/cat, Identity can be (I think)
+#     -inputformat    org.apache.hadoop.mapred.KeyValueTextInputFormat \
+#     -mapper         org.apache.hadoop.mapred.lib.IdentityMapper      \
+#     ...
+#
+# TODO
+#
+#   New-style secondary sort:
+#     http://hadoop.apache.org/common/docs/r0.20.0/streaming.html

data/bin/hdp-stream-flat CHANGED

@@ -5,14 +5,18 @@ output_file=${1} 			; shift
 map_script=${1-/bin/cat}		; shift
 reduce_script=${1-/usr/bin/uniq}	; shift
-if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
+if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [extra_args]" ; exit ; fi
-hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar			\
+HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
+# Can add fun stuff like
+# -jobconf mapred.map.tasks=3                                                       \
+# -jobconf mapred.reduce.tasks=3                                                    \
+exec ${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
     -mapper  	"$map_script"  								\
     -reducer	"$reduce_script"							\
     -input      "$input_file"								\
     -output  	"$output_file"								\
     "$@"
-# -jobconf mapred.map.tasks=3                                                       \
-# -jobconf mapred.reduce.tasks=3                                                    \

data/bin/hdp-stream2 ADDED

@@ -0,0 +1,39 @@
+#!/usr/bin/env ruby
+require 'wukong'
+# Example usage:
+#
+#  ~/ics/wukong/bin/hdp-stream2 input_path1,input_path2 output_path  \
+#     "`which cuttab` 2,3,7" "`which uniq` -c" 1 3 -jobconf mapred.reduce.tasks=23
+# options = Wukong::CONFIG[:runner_defaults].dup
+# cmdline_opts = Hash.zip(
+#   [ :input_file, :output_file,
+#     :map_command, :reduce_command,
+#     :partition_fields, :sort_fields],
+#   ARGV.map{|s| s.blank? ? nil : s }
+#   )
+# argvs = ARGV.slice!(0..5) ;
+# ARGV.unshift cmdline_opts[:input_file];
+# ARGV.unshift cmdline_opts[:output_file]
+# p [argvs, ARGV]
+#
+# # cmdline_opts[:map_command]    = `which cat`.chomp if cmdline_opts[:map_command].blank?
+# # cmdline_opts[:reduce_command] = nil               if cmdline_opts[:reduce_command].blank?
+# cmdline_opts[:dry_run] = true
+# cmdline_opts[:run]     = true
+#p cmdline_opts, Wukong::CONFIG[:runner_defaults]
+# Go script go!
+runner = Wukong::Script.new(
+  nil, # use mapper_command
+  nil, # use reducer_command
+  :run => true
+  )
+# runner.options.merge cmdline_opts
+runner.options[:reuse_jvms] = true if runner.options[:reuse_jvms].blank?
+runner.run

data/bin/tabchar CHANGED

@@ -2,4 +2,4 @@
 # insert a tab char from the command line:
 # echo "hi$(tabchar)there"
 # # => "hi	there"
-echo -n -e '\t'
+exec echo -n -e '\t'

data/bin/wu-date ADDED

@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Outputs a compact wukong-style date:
+#
+#
+#	$ date
+#       Sun Nov  8 03:21:37 CST 2009
+#	$ wu-date
+#	20091108
+#
+exec date +"%Y%m%d"

data/bin/wu-datetime ADDED

@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Outputs a compact wukong-style datetime:
+#
+#
+#	$ date
+#       Sun Nov  8 03:21:37 CST 2009
+#	$ wu-datetime
+#	20091108032137
+#
+exec date +"%Y%m%d%H%M%D"

data/bin/wu-plus ADDED

@@ -0,0 +1,9 @@
+#!/usr/bin/env ruby
+sum   = 0.0
+lines = 0
+$stdin.each do |n|
+  sum   += n.to_f
+  lines += 1
+end
+puts "%15d\t%15d"%[sum, lines]