wukong 1.4.0 → 1.4.1
Sign up to get free protection for your applications and to get access to all the features.
- data/README.textile +34 -7
- data/bin/cutc +1 -1
- data/bin/cuttab +1 -1
- data/bin/greptrue +1 -3
- data/bin/hdp-cat +1 -1
- data/bin/hdp-catd +1 -1
- data/bin/hdp-du +11 -6
- data/bin/hdp-get +1 -1
- data/bin/hdp-kill +1 -1
- data/bin/hdp-ls +1 -1
- data/bin/hdp-mkdir +1 -1
- data/bin/hdp-mv +1 -1
- data/bin/hdp-ps +1 -1
- data/bin/hdp-put +1 -1
- data/bin/hdp-rm +1 -1
- data/bin/hdp-sort +39 -19
- data/bin/hdp-stream +39 -19
- data/bin/hdp-stream-flat +9 -5
- data/bin/hdp-stream2 +39 -0
- data/bin/tabchar +1 -1
- data/bin/wu-date +13 -0
- data/bin/wu-datetime +13 -0
- data/bin/wu-plus +9 -0
- data/docpages/INSTALL.textile +0 -2
- data/docpages/index.textile +4 -2
- data/examples/apache_log_parser.rb +26 -14
- data/examples/graph/gen_symmetric_links.rb +10 -0
- data/examples/sample_records.rb +6 -8
- data/lib/wukong/datatypes/enum.rb +2 -2
- data/lib/wukong/dfs.rb +10 -9
- data/lib/wukong/encoding.rb +22 -4
- data/lib/wukong/extensions/emittable.rb +1 -1
- data/lib/wukong/extensions/hash_keys.rb +16 -0
- data/lib/wukong/extensions/hash_like.rb +17 -0
- data/lib/wukong/models/graph.rb +18 -20
- data/lib/wukong/schema.rb +13 -11
- data/lib/wukong/script.rb +26 -8
- data/lib/wukong/script/hadoop_command.rb +108 -2
- data/lib/wukong/streamer.rb +2 -0
- data/lib/wukong/streamer/base.rb +1 -0
- data/lib/wukong/streamer/record_streamer.rb +14 -0
- data/lib/wukong/streamer/struct_streamer.rb +2 -2
- data/spec/data/a_atsigns_b.tsv +64 -0
- data/spec/data/a_follows_b.tsv +53 -0
- data/spec/data/tweet.tsv +167 -0
- data/spec/data/twitter_user.tsv +55 -0
- data/wukong.gemspec +13 -3
- metadata +13 -23
data/README.textile
CHANGED
@@ -19,6 +19,21 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
|
|
19
19
|
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
|
20
20
|
* "More info":http://mrflip.github.com/wukong/moreinfo.html
|
21
21
|
|
22
|
+
h2. Imminent Changes
|
23
|
+
|
24
|
+
I'm pushing to release "Wukong 3.0 the actual 1.0 release".
|
25
|
+
|
26
|
+
* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
|
27
|
+
* Methods on TypedStruct to
|
28
|
+
|
29
|
+
* Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
|
30
|
+
* Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
|
31
|
+
* May make some things that are derived classes into mixin'ed modules
|
32
|
+
* Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
|
33
|
+
|
34
|
+
|
35
|
+
*
|
36
|
+
|
22
37
|
h2. Help!
|
23
38
|
|
24
39
|
Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
@@ -29,17 +44,17 @@ h2. Install
|
|
29
44
|
|
30
45
|
h3. Get the code
|
31
46
|
|
32
|
-
We're still actively developing
|
47
|
+
We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
|
33
48
|
|
34
|
-
pre. $ git clone git://github.com/mrflip/
|
49
|
+
pre. $ git clone git://github.com/mrflip/wukong
|
35
50
|
|
36
|
-
A gem is available from "gemcutter:":http://gemcutter.org/gems/
|
51
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
|
37
52
|
|
38
|
-
pre. $ sudo gem install
|
53
|
+
pre. $ sudo gem install wukong --source=http://gemcutter.org
|
39
54
|
|
40
55
|
(don't use the gems.github.com version -- it's way out of date.)
|
41
56
|
|
42
|
-
You can instead download this project in either "zip":http://github.com/mrflip/
|
57
|
+
You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
|
43
58
|
|
44
59
|
h3. Dependencies and setup
|
45
60
|
|
@@ -190,9 +205,15 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
|
|
190
205
|
|
191
206
|
The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
|
192
207
|
|
193
|
-
|
208
|
+
<notextile><div class="toggle"></notextile>
|
194
209
|
|
195
|
-
|
210
|
+
h2. More info
|
211
|
+
|
212
|
+
There are many useful examples in the examples/ directory.
|
213
|
+
|
214
|
+
h3. Credits
|
215
|
+
|
216
|
+
Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
|
196
217
|
|
197
218
|
Patches submitted by:
|
198
219
|
* gemified by Ben Woosley (ben.woosley with the gmails)
|
@@ -201,3 +222,9 @@ Patches submitted by:
|
|
201
222
|
Thanks to:
|
202
223
|
* "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
|
203
224
|
* "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
|
225
|
+
|
226
|
+
h3. Help!
|
227
|
+
|
228
|
+
Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
229
|
+
|
230
|
+
<notextile></div></notextile>
|
data/bin/cutc
CHANGED
data/bin/cuttab
CHANGED
data/bin/greptrue
CHANGED
@@ -1,8 +1,6 @@
|
|
1
1
|
#!/usr/bin/env bash
|
2
2
|
|
3
3
|
# runs grep but always returns a true exit status. (Otherwise hadoop vomits)
|
4
|
+
# You can set a command line var in hadoop instead, but we'll leave this around
|
4
5
|
grep "$@"
|
5
6
|
true
|
6
|
-
# runs grep but always returns a true exit status. (Otherwise hadoop vomits)
|
7
|
-
egrep "$@"
|
8
|
-
true
|
data/bin/hdp-cat
CHANGED
data/bin/hdp-catd
CHANGED
data/bin/hdp-du
CHANGED
@@ -5,7 +5,7 @@ OPTIONS={}
|
|
5
5
|
#
|
6
6
|
# grok options
|
7
7
|
#
|
8
|
-
if ARGV[0] =~
|
8
|
+
if ARGV[0] =~ /\A-[sh]+\z/
|
9
9
|
flags = ARGV.shift
|
10
10
|
OPTIONS[:summary] = flags.include?('s')
|
11
11
|
OPTIONS[:humanize] = flags.include?('h')
|
@@ -16,7 +16,7 @@ end
|
|
16
16
|
#
|
17
17
|
def prepare_command
|
18
18
|
dfs_cmd = OPTIONS[:summary] ? 'dus' : 'du'
|
19
|
-
dfs_args =
|
19
|
+
dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'"
|
20
20
|
%Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} }
|
21
21
|
end
|
22
22
|
|
@@ -61,21 +61,26 @@ def number_to_human_size(size, precision=1)
|
|
61
61
|
when size < 1.gigabyte; "%.#{precision}f MB" % (size / 1.0.megabyte)
|
62
62
|
when size < 1.terabyte; "%.#{precision}f GB" % (size / 1.0.gigabyte)
|
63
63
|
else "%.#{precision}f TB" % (size / 1.0.terabyte)
|
64
|
-
end
|
64
|
+
end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
|
65
65
|
rescue
|
66
66
|
nil
|
67
67
|
end
|
68
68
|
|
69
|
+
OUTPUT_LINE_FMT = "%-71s\t%15d\t%15s"
|
69
70
|
def format_output file, size
|
70
|
-
human_size = number_to_human_size(size) ||
|
71
|
+
human_size = number_to_human_size(size) || ""
|
71
72
|
file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone
|
72
|
-
|
73
|
+
OUTPUT_LINE_FMT % [file, size.to_i, human_size]
|
73
74
|
end
|
74
75
|
|
75
|
-
|
76
|
+
entries_count = 0
|
77
|
+
total_size = 0
|
76
78
|
%x{ #{prepare_command} }.split("\n").each do |line|
|
77
79
|
if line =~ /^Found \d+ items$/ then puts line ; next end
|
78
80
|
info = line.split(/\s+/)
|
79
81
|
if OPTIONS[:summary] then file, size = info else size, file = info end
|
80
82
|
puts format_output(file, size)
|
83
|
+
total_size += size.to_i
|
84
|
+
entries_count += 1
|
81
85
|
end
|
86
|
+
$stderr.puts OUTPUT_LINE_FMT%[" #{"%55d"%entries_count} entries", total_size, number_to_human_size(total_size)]
|
data/bin/hdp-get
CHANGED
data/bin/hdp-kill
CHANGED
data/bin/hdp-ls
CHANGED
data/bin/hdp-mkdir
CHANGED
data/bin/hdp-mv
CHANGED
data/bin/hdp-ps
CHANGED
data/bin/hdp-put
CHANGED
data/bin/hdp-rm
CHANGED
data/bin/hdp-sort
CHANGED
@@ -4,26 +4,46 @@
|
|
4
4
|
input_file=${1} ; shift
|
5
5
|
output_file=${1} ; shift
|
6
6
|
map_script=${1-/bin/cat} ; shift
|
7
|
-
reduce_script=${1-/usr/bin/uniq}
|
8
|
-
|
7
|
+
reduce_script=${1-/usr/bin/uniq} ; shift
|
8
|
+
partfields=${1-2} ; shift
|
9
|
+
sortfields=${1-2} ; shift
|
9
10
|
|
10
|
-
if [ "$
|
11
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
|
11
12
|
|
12
13
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
13
14
|
|
14
|
-
${HADOOP_HOME}/bin/hadoop \
|
15
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
16
|
-
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
17
|
-
-jobconf
|
18
|
-
-jobconf
|
19
|
-
-
|
20
|
-
-
|
21
|
-
-
|
22
|
-
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
15
|
+
cmd="${HADOOP_HOME}/bin/hadoop \
|
16
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
17
|
+
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
18
|
+
-jobconf num.key.fields.for.partition=\"$partfields\"
|
19
|
+
-jobconf stream.num.map.output.key.fields=\"$sortfields\"
|
20
|
+
-mapper \"$map_script\"
|
21
|
+
-reducer \"$reduce_script\"
|
22
|
+
-input \"$input_file\"
|
23
|
+
-output \"$output_file\"
|
24
|
+
$@
|
25
|
+
"
|
26
|
+
|
27
|
+
echo "$cmd"
|
28
|
+
|
29
|
+
$cmd
|
30
|
+
|
31
|
+
# -jobconf mapred.text.key.partitioner.options="-k1,$partfields" \
|
32
|
+
# -jobconf stream.map.output.field.separator='\t' \
|
33
|
+
# -jobconf map.output.key.field.separator='\t' \
|
34
|
+
# -jobconf mapred.map.tasks=3 \
|
35
|
+
# -jobconf mapred.reduce.tasks=3 \
|
36
|
+
|
37
|
+
#
|
38
|
+
# TODO:
|
39
|
+
# http://issues.apache.org/jira/browse/MAPREDUCE-594
|
40
|
+
# http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
|
41
|
+
# Instead of /bin/cat, Identity can be (I think)
|
42
|
+
# -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
|
43
|
+
# -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
|
44
|
+
# ...
|
45
|
+
#
|
46
|
+
# TODO
|
47
|
+
#
|
48
|
+
# New-style secondary sort:
|
49
|
+
# http://hadoop.apache.org/common/docs/r0.20.0/streaming.html
|
data/bin/hdp-stream
CHANGED
@@ -4,26 +4,46 @@
|
|
4
4
|
input_file=${1} ; shift
|
5
5
|
output_file=${1} ; shift
|
6
6
|
map_script=${1-/bin/cat} ; shift
|
7
|
-
reduce_script=${1-/usr/bin/uniq}
|
8
|
-
|
7
|
+
reduce_script=${1-/usr/bin/uniq} ; shift
|
8
|
+
partfields=${1-2} ; shift
|
9
|
+
sortfields=${1-2} ; shift
|
9
10
|
|
10
|
-
if [ "$
|
11
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
|
11
12
|
|
12
13
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
13
14
|
|
14
|
-
${HADOOP_HOME}/bin/hadoop \
|
15
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
16
|
-
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
17
|
-
-jobconf
|
18
|
-
-jobconf
|
19
|
-
-
|
20
|
-
-
|
21
|
-
-
|
22
|
-
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
15
|
+
cmd="${HADOOP_HOME}/bin/hadoop \
|
16
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
17
|
+
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
18
|
+
-jobconf num.key.fields.for.partition=\"$partfields\"
|
19
|
+
-jobconf stream.num.map.output.key.fields=\"$sortfields\"
|
20
|
+
-mapper \"$map_script\"
|
21
|
+
-reducer \"$reduce_script\"
|
22
|
+
-input \"$input_file\"
|
23
|
+
-output \"$output_file\"
|
24
|
+
$@
|
25
|
+
"
|
26
|
+
|
27
|
+
echo "$cmd"
|
28
|
+
|
29
|
+
$cmd
|
30
|
+
|
31
|
+
# -jobconf mapred.text.key.partitioner.options="-k1,$partfields" \
|
32
|
+
# -jobconf stream.map.output.field.separator='\t' \
|
33
|
+
# -jobconf map.output.key.field.separator='\t' \
|
34
|
+
# -jobconf mapred.map.tasks=3 \
|
35
|
+
# -jobconf mapred.reduce.tasks=3 \
|
36
|
+
|
37
|
+
#
|
38
|
+
# TODO:
|
39
|
+
# http://issues.apache.org/jira/browse/MAPREDUCE-594
|
40
|
+
# http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
|
41
|
+
# Instead of /bin/cat, Identity can be (I think)
|
42
|
+
# -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
|
43
|
+
# -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
|
44
|
+
# ...
|
45
|
+
#
|
46
|
+
# TODO
|
47
|
+
#
|
48
|
+
# New-style secondary sort:
|
49
|
+
# http://hadoop.apache.org/common/docs/r0.20.0/streaming.html
|
data/bin/hdp-stream-flat
CHANGED
@@ -5,14 +5,18 @@ output_file=${1} ; shift
|
|
5
5
|
map_script=${1-/bin/cat} ; shift
|
6
6
|
reduce_script=${1-/usr/bin/uniq} ; shift
|
7
7
|
|
8
|
-
if [ "$
|
8
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [extra_args]" ; exit ; fi
|
9
9
|
|
10
|
-
|
10
|
+
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
11
|
+
|
12
|
+
# Can add fun stuff like
|
13
|
+
# -jobconf mapred.map.tasks=3 \
|
14
|
+
# -jobconf mapred.reduce.tasks=3 \
|
15
|
+
|
16
|
+
exec ${HADOOP_HOME}/bin/hadoop \
|
17
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
|
11
18
|
-mapper "$map_script" \
|
12
19
|
-reducer "$reduce_script" \
|
13
20
|
-input "$input_file" \
|
14
21
|
-output "$output_file" \
|
15
22
|
"$@"
|
16
|
-
|
17
|
-
# -jobconf mapred.map.tasks=3 \
|
18
|
-
# -jobconf mapred.reduce.tasks=3 \
|
data/bin/hdp-stream2
ADDED
@@ -0,0 +1,39 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'wukong'
|
3
|
+
|
4
|
+
# Example usage:
|
5
|
+
#
|
6
|
+
# ~/ics/wukong/bin/hdp-stream2 input_path1,input_path2 output_path \
|
7
|
+
# "`which cuttab` 2,3,7" "`which uniq` -c" 1 3 -jobconf mapred.reduce.tasks=23
|
8
|
+
|
9
|
+
|
10
|
+
# options = Wukong::CONFIG[:runner_defaults].dup
|
11
|
+
|
12
|
+
# cmdline_opts = Hash.zip(
|
13
|
+
# [ :input_file, :output_file,
|
14
|
+
# :map_command, :reduce_command,
|
15
|
+
# :partition_fields, :sort_fields],
|
16
|
+
# ARGV.map{|s| s.blank? ? nil : s }
|
17
|
+
# )
|
18
|
+
# argvs = ARGV.slice!(0..5) ;
|
19
|
+
# ARGV.unshift cmdline_opts[:input_file];
|
20
|
+
# ARGV.unshift cmdline_opts[:output_file]
|
21
|
+
# p [argvs, ARGV]
|
22
|
+
#
|
23
|
+
# # cmdline_opts[:map_command] = `which cat`.chomp if cmdline_opts[:map_command].blank?
|
24
|
+
# # cmdline_opts[:reduce_command] = nil if cmdline_opts[:reduce_command].blank?
|
25
|
+
# cmdline_opts[:dry_run] = true
|
26
|
+
# cmdline_opts[:run] = true
|
27
|
+
|
28
|
+
#p cmdline_opts, Wukong::CONFIG[:runner_defaults]
|
29
|
+
|
30
|
+
# Go script go!
|
31
|
+
runner = Wukong::Script.new(
|
32
|
+
nil, # use mapper_command
|
33
|
+
nil, # use reducer_command
|
34
|
+
:run => true
|
35
|
+
)
|
36
|
+
# runner.options.merge cmdline_opts
|
37
|
+
runner.options[:reuse_jvms] = true if runner.options[:reuse_jvms].blank?
|
38
|
+
|
39
|
+
runner.run
|
data/bin/tabchar
CHANGED
data/bin/wu-date
ADDED
data/bin/wu-datetime
ADDED