wukong 1.4.0 → 1.4.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (48) hide show
  1. data/README.textile +34 -7
  2. data/bin/cutc +1 -1
  3. data/bin/cuttab +1 -1
  4. data/bin/greptrue +1 -3
  5. data/bin/hdp-cat +1 -1
  6. data/bin/hdp-catd +1 -1
  7. data/bin/hdp-du +11 -6
  8. data/bin/hdp-get +1 -1
  9. data/bin/hdp-kill +1 -1
  10. data/bin/hdp-ls +1 -1
  11. data/bin/hdp-mkdir +1 -1
  12. data/bin/hdp-mv +1 -1
  13. data/bin/hdp-ps +1 -1
  14. data/bin/hdp-put +1 -1
  15. data/bin/hdp-rm +1 -1
  16. data/bin/hdp-sort +39 -19
  17. data/bin/hdp-stream +39 -19
  18. data/bin/hdp-stream-flat +9 -5
  19. data/bin/hdp-stream2 +39 -0
  20. data/bin/tabchar +1 -1
  21. data/bin/wu-date +13 -0
  22. data/bin/wu-datetime +13 -0
  23. data/bin/wu-plus +9 -0
  24. data/docpages/INSTALL.textile +0 -2
  25. data/docpages/index.textile +4 -2
  26. data/examples/apache_log_parser.rb +26 -14
  27. data/examples/graph/gen_symmetric_links.rb +10 -0
  28. data/examples/sample_records.rb +6 -8
  29. data/lib/wukong/datatypes/enum.rb +2 -2
  30. data/lib/wukong/dfs.rb +10 -9
  31. data/lib/wukong/encoding.rb +22 -4
  32. data/lib/wukong/extensions/emittable.rb +1 -1
  33. data/lib/wukong/extensions/hash_keys.rb +16 -0
  34. data/lib/wukong/extensions/hash_like.rb +17 -0
  35. data/lib/wukong/models/graph.rb +18 -20
  36. data/lib/wukong/schema.rb +13 -11
  37. data/lib/wukong/script.rb +26 -8
  38. data/lib/wukong/script/hadoop_command.rb +108 -2
  39. data/lib/wukong/streamer.rb +2 -0
  40. data/lib/wukong/streamer/base.rb +1 -0
  41. data/lib/wukong/streamer/record_streamer.rb +14 -0
  42. data/lib/wukong/streamer/struct_streamer.rb +2 -2
  43. data/spec/data/a_atsigns_b.tsv +64 -0
  44. data/spec/data/a_follows_b.tsv +53 -0
  45. data/spec/data/tweet.tsv +167 -0
  46. data/spec/data/twitter_user.tsv +55 -0
  47. data/wukong.gemspec +13 -3
  48. metadata +13 -23
@@ -19,6 +19,21 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
19
19
  * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
20
20
  * "More info":http://mrflip.github.com/wukong/moreinfo.html
21
21
 
22
+ h2. Imminent Changes
23
+
24
+ I'm pushing to release "Wukong 3.0 the actual 1.0 release".
25
+
26
+ * For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
27
+ * Methods on TypedStruct to
28
+
29
+ * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
30
+ * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
31
+ * May make some things that are derived classes into mixin'ed modules
32
+ * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
33
+
34
+
35
+ *
36
+
22
37
  h2. Help!
23
38
 
24
39
  Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
@@ -29,17 +44,17 @@ h2. Install
29
44
 
30
45
  h3. Get the code
31
46
 
32
- We're still actively developing {{ site.gemname }}. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
47
+ We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
33
48
 
34
- pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
49
+ pre. $ git clone git://github.com/mrflip/wukong
35
50
 
36
- A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
51
+ A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
37
52
 
38
- pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
53
+ pre. $ sudo gem install wukong --source=http://gemcutter.org
39
54
 
40
55
  (don't use the gems.github.com version -- it's way out of date.)
41
56
 
42
- You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
57
+ You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
43
58
 
44
59
  h3. Dependencies and setup
45
60
 
@@ -190,9 +205,15 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
190
205
 
191
206
  The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
192
207
 
193
- h2. Credits
208
+ <notextile><div class="toggle"></notextile>
194
209
 
195
- Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
210
+ h2. More info
211
+
212
+ There are many useful examples in the examples/ directory.
213
+
214
+ h3. Credits
215
+
216
+ Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
196
217
 
197
218
  Patches submitted by:
198
219
  * gemified by Ben Woosley (ben.woosley with the gmails)
@@ -201,3 +222,9 @@ Patches submitted by:
201
222
  Thanks to:
202
223
  * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
203
224
  * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
225
+
226
+ h3. Help!
227
+
228
+ Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
229
+
230
+ <notextile></div></notextile>
data/bin/cutc CHANGED
@@ -27,4 +27,4 @@ shift
27
27
  #
28
28
  # Do the cuttin'
29
29
  #
30
- cut -c"${cutchars}" "$@"
30
+ exec cut -c"${cutchars}" "$@"
data/bin/cuttab CHANGED
@@ -2,4 +2,4 @@
2
2
 
3
3
  fields=${1-"1-"}
4
4
  shift
5
- cut -d' ' -f"$fields" "$@"
5
+ exec cut -d' ' -f"$fields" "$@"
@@ -1,8 +1,6 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
3
  # runs grep but always returns a true exit status. (Otherwise hadoop vomits)
4
+ # You can set a command line var in hadoop instead, but we'll leave this around
4
5
  grep "$@"
5
6
  true
6
- # runs grep but always returns a true exit status. (Otherwise hadoop vomits)
7
- egrep "$@"
8
- true
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop dfs -cat "$@"
3
+ exec hadoop dfs -cat "$@"
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
  args=`echo "$@" | ruby -ne 'a = $_.split(/\s+/); puts a.map{|arg| arg+"/[^_]*" }.join(" ")'`
3
- hadoop dfs -cat $args
3
+ exec hadoop dfs -cat $args
data/bin/hdp-du CHANGED
@@ -5,7 +5,7 @@ OPTIONS={}
5
5
  #
6
6
  # grok options
7
7
  #
8
- if ARGV[0] =~ /-[a-z]+/
8
+ if ARGV[0] =~ /\A-[sh]+\z/
9
9
  flags = ARGV.shift
10
10
  OPTIONS[:summary] = flags.include?('s')
11
11
  OPTIONS[:humanize] = flags.include?('h')
@@ -16,7 +16,7 @@ end
16
16
  #
17
17
  def prepare_command
18
18
  dfs_cmd = OPTIONS[:summary] ? 'dus' : 'du'
19
- dfs_args = "'" + ARGV.join("' '") + "'"
19
+ dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'"
20
20
  %Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} }
21
21
  end
22
22
 
@@ -61,21 +61,26 @@ def number_to_human_size(size, precision=1)
61
61
  when size < 1.gigabyte; "%.#{precision}f MB" % (size / 1.0.megabyte)
62
62
  when size < 1.terabyte; "%.#{precision}f GB" % (size / 1.0.gigabyte)
63
63
  else "%.#{precision}f TB" % (size / 1.0.terabyte)
64
- end.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
64
+ end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
65
65
  rescue
66
66
  nil
67
67
  end
68
68
 
69
+ OUTPUT_LINE_FMT = "%-71s\t%15d\t%15s"
69
70
  def format_output file, size
70
- human_size = number_to_human_size(size) || 3
71
+ human_size = number_to_human_size(size) || ""
71
72
  file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone
72
- "%-71s\t%15d\t%15s" % [file, size.to_i, human_size]
73
+ OUTPUT_LINE_FMT % [file, size.to_i, human_size]
73
74
  end
74
75
 
75
-
76
+ entries_count = 0
77
+ total_size = 0
76
78
  %x{ #{prepare_command} }.split("\n").each do |line|
77
79
  if line =~ /^Found \d+ items$/ then puts line ; next end
78
80
  info = line.split(/\s+/)
79
81
  if OPTIONS[:summary] then file, size = info else size, file = info end
80
82
  puts format_output(file, size)
83
+ total_size += size.to_i
84
+ entries_count += 1
81
85
  end
86
+ $stderr.puts OUTPUT_LINE_FMT%[" #{"%55d"%entries_count} entries", total_size, number_to_human_size(total_size)]
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop dfs -copyToLocal "$1" "$2"
3
+ exec hadoop dfs -copyToLocal "$1" "$2"
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop job -kill "$@"
3
+ exec hadoop job -kill "$@"
data/bin/hdp-ls CHANGED
@@ -7,4 +7,4 @@ else
7
7
  action=ls
8
8
  fi
9
9
 
10
- hadoop dfs -$action "$@"
10
+ exec hadoop dfs -$action "$@"
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop dfs -mkdir "$@"
3
+ exec hadoop dfs -mkdir "$@"
data/bin/hdp-mv CHANGED
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop dfs -mv "$@"
3
+ exec hadoop dfs -mv "$@"
data/bin/hdp-ps CHANGED
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop job -list all
3
+ exec hadoop job -list all
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- hadoop dfs -put "$1" "$2"
3
+ exec hadoop dfs -put "$1" "$2"
data/bin/hdp-rm CHANGED
@@ -8,4 +8,4 @@ else
8
8
  fi
9
9
  echo hadoop dfs -$action "$@"
10
10
  # read -p "Hit ctrl-C to abort or enter to do this...."
11
- hadoop dfs -$action "$@"
11
+ exec hadoop dfs -$action "$@"
@@ -4,26 +4,46 @@
4
4
  input_file=${1} ; shift
5
5
  output_file=${1} ; shift
6
6
  map_script=${1-/bin/cat} ; shift
7
- reduce_script=${1-/usr/bin/uniq} ; shift
8
- fields=${1-2} ; shift
7
+ reduce_script=${1-/usr/bin/uniq} ; shift
8
+ partfields=${1-2} ; shift
9
+ sortfields=${1-2} ; shift
9
10
 
10
- if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
11
+ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
11
12
 
12
13
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
13
14
 
14
- ${HADOOP_HOME}/bin/hadoop \
15
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
16
- -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
17
- -jobconf map.output.key.field.separator='\t' \
18
- -jobconf num.key.fields.for.partition=1 \
19
- -jobconf stream.map.output.field.separator='\t' \
20
- -jobconf stream.num.map.output.key.fields="$fields" \
21
- -mapper "$map_script" \
22
- -reducer "$reduce_script" \
23
- -input "$input_file" \
24
- -output "$output_file" \
25
- "$@"
26
-
27
-
28
- # -jobconf mapred.map.tasks=3 \
29
- # -jobconf mapred.reduce.tasks=3 \
15
+ cmd="${HADOOP_HOME}/bin/hadoop \
16
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
17
+ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
18
+ -jobconf num.key.fields.for.partition=\"$partfields\"
19
+ -jobconf stream.num.map.output.key.fields=\"$sortfields\"
20
+ -mapper \"$map_script\"
21
+ -reducer \"$reduce_script\"
22
+ -input \"$input_file\"
23
+ -output \"$output_file\"
24
+ $@
25
+ "
26
+
27
+ echo "$cmd"
28
+
29
+ $cmd
30
+
31
+ # -jobconf mapred.text.key.partitioner.options="-k1,$partfields" \
32
+ # -jobconf stream.map.output.field.separator='\t' \
33
+ # -jobconf map.output.key.field.separator='\t' \
34
+ # -jobconf mapred.map.tasks=3 \
35
+ # -jobconf mapred.reduce.tasks=3 \
36
+
37
+ #
38
+ # TODO:
39
+ # http://issues.apache.org/jira/browse/MAPREDUCE-594
40
+ # http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
41
+ # Instead of /bin/cat, Identity can be (I think)
42
+ # -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
43
+ # -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
44
+ # ...
45
+ #
46
+ # TODO
47
+ #
48
+ # New-style secondary sort:
49
+ # http://hadoop.apache.org/common/docs/r0.20.0/streaming.html
@@ -4,26 +4,46 @@
4
4
  input_file=${1} ; shift
5
5
  output_file=${1} ; shift
6
6
  map_script=${1-/bin/cat} ; shift
7
- reduce_script=${1-/usr/bin/uniq} ; shift
8
- fields=${1-2} ; shift
7
+ reduce_script=${1-/usr/bin/uniq} ; shift
8
+ partfields=${1-2} ; shift
9
+ sortfields=${1-2} ; shift
9
10
 
10
- if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
11
+ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
11
12
 
12
13
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
13
14
 
14
- ${HADOOP_HOME}/bin/hadoop \
15
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
16
- -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
17
- -jobconf map.output.key.field.separator='\t' \
18
- -jobconf num.key.fields.for.partition=1 \
19
- -jobconf stream.map.output.field.separator='\t' \
20
- -jobconf stream.num.map.output.key.fields="$fields" \
21
- -mapper "$map_script" \
22
- -reducer "$reduce_script" \
23
- -input "$input_file" \
24
- -output "$output_file" \
25
- "$@"
26
-
27
-
28
- # -jobconf mapred.map.tasks=3 \
29
- # -jobconf mapred.reduce.tasks=3 \
15
+ cmd="${HADOOP_HOME}/bin/hadoop \
16
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
17
+ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
18
+ -jobconf num.key.fields.for.partition=\"$partfields\"
19
+ -jobconf stream.num.map.output.key.fields=\"$sortfields\"
20
+ -mapper \"$map_script\"
21
+ -reducer \"$reduce_script\"
22
+ -input \"$input_file\"
23
+ -output \"$output_file\"
24
+ $@
25
+ "
26
+
27
+ echo "$cmd"
28
+
29
+ $cmd
30
+
31
+ # -jobconf mapred.text.key.partitioner.options="-k1,$partfields" \
32
+ # -jobconf stream.map.output.field.separator='\t' \
33
+ # -jobconf map.output.key.field.separator='\t' \
34
+ # -jobconf mapred.map.tasks=3 \
35
+ # -jobconf mapred.reduce.tasks=3 \
36
+
37
+ #
38
+ # TODO:
39
+ # http://issues.apache.org/jira/browse/MAPREDUCE-594
40
+ # http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
41
+ # Instead of /bin/cat, Identity can be (I think)
42
+ # -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
43
+ # -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
44
+ # ...
45
+ #
46
+ # TODO
47
+ #
48
+ # New-style secondary sort:
49
+ # http://hadoop.apache.org/common/docs/r0.20.0/streaming.html
@@ -5,14 +5,18 @@ output_file=${1} ; shift
5
5
  map_script=${1-/bin/cat} ; shift
6
6
  reduce_script=${1-/usr/bin/uniq} ; shift
7
7
 
8
- if [ "$reduce_script" == "" ] ; then echo "$0 input_file output_file [sort_fields] [mapper] [reducer] [args]" ; exit ; fi
8
+ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [extra_args]" ; exit ; fi
9
9
 
10
- hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
10
+ HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
11
+
12
+ # Can add fun stuff like
13
+ # -jobconf mapred.map.tasks=3 \
14
+ # -jobconf mapred.reduce.tasks=3 \
15
+
16
+ exec ${HADOOP_HOME}/bin/hadoop \
17
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
11
18
  -mapper "$map_script" \
12
19
  -reducer "$reduce_script" \
13
20
  -input "$input_file" \
14
21
  -output "$output_file" \
15
22
  "$@"
16
-
17
- # -jobconf mapred.map.tasks=3 \
18
- # -jobconf mapred.reduce.tasks=3 \
@@ -0,0 +1,39 @@
1
+ #!/usr/bin/env ruby
2
+ require 'wukong'
3
+
4
+ # Example usage:
5
+ #
6
+ # ~/ics/wukong/bin/hdp-stream2 input_path1,input_path2 output_path \
7
+ # "`which cuttab` 2,3,7" "`which uniq` -c" 1 3 -jobconf mapred.reduce.tasks=23
8
+
9
+
10
+ # options = Wukong::CONFIG[:runner_defaults].dup
11
+
12
+ # cmdline_opts = Hash.zip(
13
+ # [ :input_file, :output_file,
14
+ # :map_command, :reduce_command,
15
+ # :partition_fields, :sort_fields],
16
+ # ARGV.map{|s| s.blank? ? nil : s }
17
+ # )
18
+ # argvs = ARGV.slice!(0..5) ;
19
+ # ARGV.unshift cmdline_opts[:input_file];
20
+ # ARGV.unshift cmdline_opts[:output_file]
21
+ # p [argvs, ARGV]
22
+ #
23
+ # # cmdline_opts[:map_command] = `which cat`.chomp if cmdline_opts[:map_command].blank?
24
+ # # cmdline_opts[:reduce_command] = nil if cmdline_opts[:reduce_command].blank?
25
+ # cmdline_opts[:dry_run] = true
26
+ # cmdline_opts[:run] = true
27
+
28
+ #p cmdline_opts, Wukong::CONFIG[:runner_defaults]
29
+
30
+ # Go script go!
31
+ runner = Wukong::Script.new(
32
+ nil, # use mapper_command
33
+ nil, # use reducer_command
34
+ :run => true
35
+ )
36
+ # runner.options.merge cmdline_opts
37
+ runner.options[:reuse_jvms] = true if runner.options[:reuse_jvms].blank?
38
+
39
+ runner.run
@@ -2,4 +2,4 @@
2
2
  # insert a tab char from the command line:
3
3
  # echo "hi$(tabchar)there"
4
4
  # # => "hi there"
5
- echo -n -e '\t'
5
+ exec echo -n -e '\t'
@@ -0,0 +1,13 @@
1
+ #!/bin/sh
2
+
3
+ #
4
+ # Outputs a compact wukong-style date:
5
+ #
6
+ #
7
+ # $ date
8
+ # Sun Nov 8 03:21:37 CST 2009
9
+ # $ wu-date
10
+ # 20091108
11
+ #
12
+
13
+ exec date +"%Y%m%d"
@@ -0,0 +1,13 @@
1
+ #!/bin/sh
2
+
3
+ #
4
+ # Outputs a compact wukong-style datetime:
5
+ #
6
+ #
7
+ # $ date
8
+ # Sun Nov 8 03:21:37 CST 2009
9
+ # $ wu-datetime
10
+ # 20091108032137
11
+ #
12
+
13
+ exec date +"%Y%m%d%H%M%D"
@@ -0,0 +1,9 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ sum = 0.0
4
+ lines = 0
5
+ $stdin.each do |n|
6
+ sum += n.to_f
7
+ lines += 1
8
+ end
9
+ puts "%15d\t%15d"%[sum, lines]