wukong 1.4.0 → 1.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.textile +34 -7
- data/bin/cutc +1 -1
- data/bin/cuttab +1 -1
- data/bin/greptrue +1 -3
- data/bin/hdp-cat +1 -1
- data/bin/hdp-catd +1 -1
- data/bin/hdp-du +11 -6
- data/bin/hdp-get +1 -1
- data/bin/hdp-kill +1 -1
- data/bin/hdp-ls +1 -1
- data/bin/hdp-mkdir +1 -1
- data/bin/hdp-mv +1 -1
- data/bin/hdp-ps +1 -1
- data/bin/hdp-put +1 -1
- data/bin/hdp-rm +1 -1
- data/bin/hdp-sort +39 -19
- data/bin/hdp-stream +39 -19
- data/bin/hdp-stream-flat +9 -5
- data/bin/hdp-stream2 +39 -0
- data/bin/tabchar +1 -1
- data/bin/wu-date +13 -0
- data/bin/wu-datetime +13 -0
- data/bin/wu-plus +9 -0
- data/docpages/INSTALL.textile +0 -2
- data/docpages/index.textile +4 -2
- data/examples/apache_log_parser.rb +26 -14
- data/examples/graph/gen_symmetric_links.rb +10 -0
- data/examples/sample_records.rb +6 -8
- data/lib/wukong/datatypes/enum.rb +2 -2
- data/lib/wukong/dfs.rb +10 -9
- data/lib/wukong/encoding.rb +22 -4
- data/lib/wukong/extensions/emittable.rb +1 -1
- data/lib/wukong/extensions/hash_keys.rb +16 -0
- data/lib/wukong/extensions/hash_like.rb +17 -0
- data/lib/wukong/models/graph.rb +18 -20
- data/lib/wukong/schema.rb +13 -11
- data/lib/wukong/script.rb +26 -8
- data/lib/wukong/script/hadoop_command.rb +108 -2
- data/lib/wukong/streamer.rb +2 -0
- data/lib/wukong/streamer/base.rb +1 -0
- data/lib/wukong/streamer/record_streamer.rb +14 -0
- data/lib/wukong/streamer/struct_streamer.rb +2 -2
- data/spec/data/a_atsigns_b.tsv +64 -0
- data/spec/data/a_follows_b.tsv +53 -0
- data/spec/data/tweet.tsv +167 -0
- data/spec/data/twitter_user.tsv +55 -0
- data/wukong.gemspec +13 -3
- metadata +13 -23
data/README.textile
CHANGED
|
@@ -19,6 +19,21 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
|
|
|
19
19
|
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
|
|
20
20
|
* "More info":http://mrflip.github.com/wukong/moreinfo.html
|
|
21
21
|
|
|
22
|
+
h2. Imminent Changes
|
|
23
|
+
|
|
24
|
+
I'm pushing to release "Wukong 3.0 the actual 1.0 release".
|
|
25
|
+
|
|
26
|
+
* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
|
|
27
|
+
* Methods on TypedStruct to
|
|
28
|
+
|
|
29
|
+
* Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
|
|
30
|
+
* Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
|
|
31
|
+
* May make some things that are derived classes into mixin'ed modules
|
|
32
|
+
* Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
*
|
|
36
|
+
|
|
22
37
|
h2. Help!
|
|
23
38
|
|
|
24
39
|
Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
|
@@ -29,17 +44,17 @@ h2. Install
|
|
|
29
44
|
|
|
30
45
|
h3. Get the code
|
|
31
46
|
|
|
32
|
-
We're still actively developing
|
|
47
|
+
We're still actively developing wukong. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/wukong
|
|
33
48
|
|
|
34
|
-
pre. $ git clone git://github.com/mrflip/
|
|
49
|
+
pre. $ git clone git://github.com/mrflip/wukong
|
|
35
50
|
|
|
36
|
-
A gem is available from "gemcutter:":http://gemcutter.org/gems/
|
|
51
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/wukong
|
|
37
52
|
|
|
38
|
-
pre. $ sudo gem install
|
|
53
|
+
pre. $ sudo gem install wukong --source=http://gemcutter.org
|
|
39
54
|
|
|
40
55
|
(don't use the gems.github.com version -- it's way out of date.)
|
|
41
56
|
|
|
42
|
-
You can instead download this project in either "zip":http://github.com/mrflip/
|
|
57
|
+
You can instead download this project in either "zip":http://github.com/mrflip/wukong/zipball/master or "tar":http://github.com/mrflip/wukong/tarball/master formats.
|
|
43
58
|
|
|
44
59
|
h3. Dependencies and setup
|
|
45
60
|
|
|
@@ -190,9 +205,15 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
|
|
|
190
205
|
|
|
191
206
|
The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
|
|
192
207
|
|
|
193
|
-
|
|
208
|
+
<notextile><div class="toggle"></notextile>
|
|
194
209
|
|
|
195
|
-
|
|
210
|
+
h2. More info
|
|
211
|
+
|
|
212
|
+
There are many useful examples in the examples/ directory.
|
|
213
|
+
|
|
214
|
+
h3. Credits
|
|
215
|
+
|
|
216
|
+
Monkeyshines was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org / "@mrflip":http://twitter.com/mrflip) for the "infochimps project":http://infochimps.org
|
|
196
217
|
|
|
197
218
|
Patches submitted by:
|
|
198
219
|
* gemified by Ben Woosley (ben.woosley with the gmails)
|
|
@@ -201,3 +222,9 @@ Patches submitted by:
|
|
|
201
222
|
Thanks to:
|
|
202
223
|
* "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
|
|
203
224
|
* "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
|
|
225
|
+
|
|
226
|
+
h3. Help!
|
|
227
|
+
|
|
228
|
+
Send monkeyshines questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
|
229
|
+
|
|
230
|
+
<notextile></div></notextile>
|
data/bin/cutc
CHANGED
data/bin/cuttab
CHANGED
data/bin/greptrue
CHANGED
|
@@ -1,8 +1,6 @@
|
|
|
1
1
|
#!/usr/bin/env bash
|
|
2
2
|
|
|
3
3
|
# runs grep but always returns a true exit status. (Otherwise hadoop vomits)
|
|
4
|
+
# You can set a command line var in hadoop instead, but we'll leave this around
|
|
4
5
|
grep "$@"
|
|
5
6
|
true
|
|
6
|
-
# runs grep but always returns a true exit status. (Otherwise hadoop vomits)
|
|
7
|
-
egrep "$@"
|
|
8
|
-
true
|
data/bin/hdp-cat
CHANGED
data/bin/hdp-catd
CHANGED
data/bin/hdp-du
CHANGED
|
@@ -5,7 +5,7 @@ OPTIONS={}
|
|
|
5
5
|
#
|
|
6
6
|
# grok options
|
|
7
7
|
#
|
|
8
|
-
if ARGV[0] =~
|
|
8
|
+
if ARGV[0] =~ /\A-[sh]+\z/
|
|
9
9
|
flags = ARGV.shift
|
|
10
10
|
OPTIONS[:summary] = flags.include?('s')
|
|
11
11
|
OPTIONS[:humanize] = flags.include?('h')
|
|
@@ -16,7 +16,7 @@ end
|
|
|
16
16
|
#
|
|
17
17
|
def prepare_command
|
|
18
18
|
dfs_cmd = OPTIONS[:summary] ? 'dus' : 'du'
|
|
19
|
-
dfs_args =
|
|
19
|
+
dfs_args = ((!ARGV[0]) || ARGV[0]=='') ? '.' : "'#{ARGV.join("' '")}'"
|
|
20
20
|
%Q{ hadoop dfs -#{dfs_cmd} #{dfs_args} }
|
|
21
21
|
end
|
|
22
22
|
|
|
@@ -61,21 +61,26 @@ def number_to_human_size(size, precision=1)
|
|
|
61
61
|
when size < 1.gigabyte; "%.#{precision}f MB" % (size / 1.0.megabyte)
|
|
62
62
|
when size < 1.terabyte; "%.#{precision}f GB" % (size / 1.0.gigabyte)
|
|
63
63
|
else "%.#{precision}f TB" % (size / 1.0.terabyte)
|
|
64
|
-
end
|
|
64
|
+
end #.sub(/([0-9]\.\d*?)0+ /, '\1 ' ).sub(/\. /,' ')
|
|
65
65
|
rescue
|
|
66
66
|
nil
|
|
67
67
|
end
|
|
68
68
|
|
|
69
|
+
OUTPUT_LINE_FMT = "%-71s\t%15d\t%15s"
|
|
69
70
|
def format_output file, size
|
|
70
|
-
human_size = number_to_human_size(size) ||
|
|
71
|
+
human_size = number_to_human_size(size) || ""
|
|
71
72
|
file = file.gsub(%r{hdfs://[^/]+/}, '/') # kill off hdfs paths, otherwise leave it alone
|
|
72
|
-
|
|
73
|
+
OUTPUT_LINE_FMT % [file, size.to_i, human_size]
|
|
73
74
|
end
|
|
74
75
|
|
|
75
|
-
|
|
76
|
+
entries_count = 0
|
|
77
|
+
total_size = 0
|
|
76
78
|
%x{ #{prepare_command} }.split("\n").each do |line|
|
|
77
79
|
if line =~ /^Found \d+ items$/ then puts line ; next end
|
|
78
80
|
info = line.split(/\s+/)
|
|
79
81
|
if OPTIONS[:summary] then file, size = info else size, file = info end
|
|
80
82
|
puts format_output(file, size)
|
|
83
|
+
total_size += size.to_i
|
|
84
|
+
entries_count += 1
|
|
81
85
|
end
|
|
86
|
+
$stderr.puts OUTPUT_LINE_FMT%[" #{"%55d"%entries_count} entries", total_size, number_to_human_size(total_size)]
|
data/bin/hdp-get
CHANGED
data/bin/hdp-kill
CHANGED
data/bin/hdp-ls
CHANGED
data/bin/hdp-mkdir
CHANGED
data/bin/hdp-mv
CHANGED
data/bin/hdp-ps
CHANGED
data/bin/hdp-put
CHANGED
data/bin/hdp-rm
CHANGED
data/bin/hdp-sort
CHANGED
|
@@ -4,26 +4,46 @@
|
|
|
4
4
|
input_file=${1} ; shift
|
|
5
5
|
output_file=${1} ; shift
|
|
6
6
|
map_script=${1-/bin/cat} ; shift
|
|
7
|
-
reduce_script=${1-/usr/bin/uniq}
|
|
8
|
-
|
|
7
|
+
reduce_script=${1-/usr/bin/uniq} ; shift
|
|
8
|
+
partfields=${1-2} ; shift
|
|
9
|
+
sortfields=${1-2} ; shift
|
|
9
10
|
|
|
10
|
-
if [ "$
|
|
11
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
|
|
11
12
|
|
|
12
13
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
|
13
14
|
|
|
14
|
-
${HADOOP_HOME}/bin/hadoop \
|
|
15
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
|
16
|
-
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
|
17
|
-
-jobconf
|
|
18
|
-
-jobconf
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
-
|
|
22
|
-
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
15
|
+
cmd="${HADOOP_HOME}/bin/hadoop \
|
|
16
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
|
17
|
+
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
|
18
|
+
-jobconf num.key.fields.for.partition=\"$partfields\"
|
|
19
|
+
-jobconf stream.num.map.output.key.fields=\"$sortfields\"
|
|
20
|
+
-mapper \"$map_script\"
|
|
21
|
+
-reducer \"$reduce_script\"
|
|
22
|
+
-input \"$input_file\"
|
|
23
|
+
-output \"$output_file\"
|
|
24
|
+
$@
|
|
25
|
+
"
|
|
26
|
+
|
|
27
|
+
echo "$cmd"
|
|
28
|
+
|
|
29
|
+
$cmd
|
|
30
|
+
|
|
31
|
+
# -jobconf mapred.text.key.partitioner.options="-k1,$partfields" \
|
|
32
|
+
# -jobconf stream.map.output.field.separator='\t' \
|
|
33
|
+
# -jobconf map.output.key.field.separator='\t' \
|
|
34
|
+
# -jobconf mapred.map.tasks=3 \
|
|
35
|
+
# -jobconf mapred.reduce.tasks=3 \
|
|
36
|
+
|
|
37
|
+
#
|
|
38
|
+
# TODO:
|
|
39
|
+
# http://issues.apache.org/jira/browse/MAPREDUCE-594
|
|
40
|
+
# http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
|
|
41
|
+
# Instead of /bin/cat, Identity can be (I think)
|
|
42
|
+
# -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
|
|
43
|
+
# -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
|
|
44
|
+
# ...
|
|
45
|
+
#
|
|
46
|
+
# TODO
|
|
47
|
+
#
|
|
48
|
+
# New-style secondary sort:
|
|
49
|
+
# http://hadoop.apache.org/common/docs/r0.20.0/streaming.html
|
data/bin/hdp-stream
CHANGED
|
@@ -4,26 +4,46 @@
|
|
|
4
4
|
input_file=${1} ; shift
|
|
5
5
|
output_file=${1} ; shift
|
|
6
6
|
map_script=${1-/bin/cat} ; shift
|
|
7
|
-
reduce_script=${1-/usr/bin/uniq}
|
|
8
|
-
|
|
7
|
+
reduce_script=${1-/usr/bin/uniq} ; shift
|
|
8
|
+
partfields=${1-2} ; shift
|
|
9
|
+
sortfields=${1-2} ; shift
|
|
9
10
|
|
|
10
|
-
if [ "$
|
|
11
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [sortfields=2] [partfields=1] [extra_args]" ; exit ; fi
|
|
11
12
|
|
|
12
13
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
|
13
14
|
|
|
14
|
-
${HADOOP_HOME}/bin/hadoop \
|
|
15
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
|
16
|
-
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
|
17
|
-
-jobconf
|
|
18
|
-
-jobconf
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
-
|
|
22
|
-
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
15
|
+
cmd="${HADOOP_HOME}/bin/hadoop \
|
|
16
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar
|
|
17
|
+
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
|
18
|
+
-jobconf num.key.fields.for.partition=\"$partfields\"
|
|
19
|
+
-jobconf stream.num.map.output.key.fields=\"$sortfields\"
|
|
20
|
+
-mapper \"$map_script\"
|
|
21
|
+
-reducer \"$reduce_script\"
|
|
22
|
+
-input \"$input_file\"
|
|
23
|
+
-output \"$output_file\"
|
|
24
|
+
$@
|
|
25
|
+
"
|
|
26
|
+
|
|
27
|
+
echo "$cmd"
|
|
28
|
+
|
|
29
|
+
$cmd
|
|
30
|
+
|
|
31
|
+
# -jobconf mapred.text.key.partitioner.options="-k1,$partfields" \
|
|
32
|
+
# -jobconf stream.map.output.field.separator='\t' \
|
|
33
|
+
# -jobconf map.output.key.field.separator='\t' \
|
|
34
|
+
# -jobconf mapred.map.tasks=3 \
|
|
35
|
+
# -jobconf mapred.reduce.tasks=3 \
|
|
36
|
+
|
|
37
|
+
#
|
|
38
|
+
# TODO:
|
|
39
|
+
# http://issues.apache.org/jira/browse/MAPREDUCE-594
|
|
40
|
+
# http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/KeyValueTextInputFormat.html
|
|
41
|
+
# Instead of /bin/cat, Identity can be (I think)
|
|
42
|
+
# -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
|
|
43
|
+
# -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
|
|
44
|
+
# ...
|
|
45
|
+
#
|
|
46
|
+
# TODO
|
|
47
|
+
#
|
|
48
|
+
# New-style secondary sort:
|
|
49
|
+
# http://hadoop.apache.org/common/docs/r0.20.0/streaming.html
|
data/bin/hdp-stream-flat
CHANGED
|
@@ -5,14 +5,18 @@ output_file=${1} ; shift
|
|
|
5
5
|
map_script=${1-/bin/cat} ; shift
|
|
6
6
|
reduce_script=${1-/usr/bin/uniq} ; shift
|
|
7
7
|
|
|
8
|
-
if [ "$
|
|
8
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/cat] [reducer=/usr/bin/uniq] [extra_args]" ; exit ; fi
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
|
11
|
+
|
|
12
|
+
# Can add fun stuff like
|
|
13
|
+
# -jobconf mapred.map.tasks=3 \
|
|
14
|
+
# -jobconf mapred.reduce.tasks=3 \
|
|
15
|
+
|
|
16
|
+
exec ${HADOOP_HOME}/bin/hadoop \
|
|
17
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
|
|
11
18
|
-mapper "$map_script" \
|
|
12
19
|
-reducer "$reduce_script" \
|
|
13
20
|
-input "$input_file" \
|
|
14
21
|
-output "$output_file" \
|
|
15
22
|
"$@"
|
|
16
|
-
|
|
17
|
-
# -jobconf mapred.map.tasks=3 \
|
|
18
|
-
# -jobconf mapred.reduce.tasks=3 \
|
data/bin/hdp-stream2
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
require 'wukong'
|
|
3
|
+
|
|
4
|
+
# Example usage:
|
|
5
|
+
#
|
|
6
|
+
# ~/ics/wukong/bin/hdp-stream2 input_path1,input_path2 output_path \
|
|
7
|
+
# "`which cuttab` 2,3,7" "`which uniq` -c" 1 3 -jobconf mapred.reduce.tasks=23
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
# options = Wukong::CONFIG[:runner_defaults].dup
|
|
11
|
+
|
|
12
|
+
# cmdline_opts = Hash.zip(
|
|
13
|
+
# [ :input_file, :output_file,
|
|
14
|
+
# :map_command, :reduce_command,
|
|
15
|
+
# :partition_fields, :sort_fields],
|
|
16
|
+
# ARGV.map{|s| s.blank? ? nil : s }
|
|
17
|
+
# )
|
|
18
|
+
# argvs = ARGV.slice!(0..5) ;
|
|
19
|
+
# ARGV.unshift cmdline_opts[:input_file];
|
|
20
|
+
# ARGV.unshift cmdline_opts[:output_file]
|
|
21
|
+
# p [argvs, ARGV]
|
|
22
|
+
#
|
|
23
|
+
# # cmdline_opts[:map_command] = `which cat`.chomp if cmdline_opts[:map_command].blank?
|
|
24
|
+
# # cmdline_opts[:reduce_command] = nil if cmdline_opts[:reduce_command].blank?
|
|
25
|
+
# cmdline_opts[:dry_run] = true
|
|
26
|
+
# cmdline_opts[:run] = true
|
|
27
|
+
|
|
28
|
+
#p cmdline_opts, Wukong::CONFIG[:runner_defaults]
|
|
29
|
+
|
|
30
|
+
# Go script go!
|
|
31
|
+
runner = Wukong::Script.new(
|
|
32
|
+
nil, # use mapper_command
|
|
33
|
+
nil, # use reducer_command
|
|
34
|
+
:run => true
|
|
35
|
+
)
|
|
36
|
+
# runner.options.merge cmdline_opts
|
|
37
|
+
runner.options[:reuse_jvms] = true if runner.options[:reuse_jvms].blank?
|
|
38
|
+
|
|
39
|
+
runner.run
|
data/bin/tabchar
CHANGED
data/bin/wu-date
ADDED
data/bin/wu-datetime
ADDED