wukong 1.5.4 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.textile +32 -0
- data/README.textile +58 -12
- data/TODO.textile +0 -8
- data/bin/hdp-bzip +12 -17
- data/bin/hdp-kill-task +1 -1
- data/bin/hdp-sort +7 -7
- data/bin/hdp-stream +7 -7
- data/bin/hdp-stream-flat +2 -3
- data/bin/setcat +11 -0
- data/bin/uniq-ord +59 -0
- data/examples/corpus/bucket_counter.rb +47 -0
- data/examples/corpus/dbpedia_abstract_to_sentences.rb +85 -0
- data/examples/corpus/sentence_coocurrence.rb +70 -0
- data/examples/emr/README.textile +110 -0
- data/examples/emr/dot_wukong_dir/emr_bootstrap.sh +1 -0
- data/examples/emr/elastic_mapreduce_example.rb +2 -2
- data/examples/ignore_me/counting.rb +56 -0
- data/examples/ignore_me/grouper.rb +71 -0
- data/examples/network_graph/adjacency_list.rb +2 -2
- data/examples/network_graph/breadth_first_search.rb +14 -21
- data/examples/network_graph/gen_multi_edge.rb +22 -13
- data/examples/pagerank/pagerank.rb +1 -1
- data/examples/pagerank/pagerank_initialize.rb +6 -10
- data/examples/sample_records.rb +6 -16
- data/examples/server_logs/apache_log_parser.rb +7 -22
- data/examples/server_logs/breadcrumbs.rb +39 -0
- data/examples/server_logs/logline.rb +27 -0
- data/examples/size.rb +3 -2
- data/examples/{binning_percentile_estimator.rb → stats/binning_percentile_estimator.rb} +9 -11
- data/examples/{rank_and_bin.rb → stats/rank_and_bin.rb} +2 -2
- data/examples/stupidly_simple_filter.rb +11 -14
- data/examples/word_count.rb +16 -36
- data/lib/wukong/and_pig.rb +2 -15
- data/lib/wukong/logger.rb +7 -28
- data/lib/wukong/periodic_monitor.rb +24 -9
- data/lib/wukong/script/emr_command.rb +1 -0
- data/lib/wukong/script/hadoop_command.rb +31 -29
- data/lib/wukong/script.rb +19 -14
- data/lib/wukong/store/cassandra_model.rb +2 -1
- data/lib/wukong/streamer/accumulating_reducer.rb +5 -9
- data/lib/wukong/streamer/base.rb +44 -3
- data/lib/wukong/streamer/counting_reducer.rb +12 -12
- data/lib/wukong/streamer/filter.rb +2 -2
- data/lib/wukong/streamer/list_reducer.rb +3 -3
- data/lib/wukong/streamer/reducer.rb +11 -0
- data/lib/wukong/streamer.rb +7 -3
- data/lib/wukong.rb +7 -3
- data/{examples → old}/cassandra_streaming/berlitz_for_cassandra.textile +0 -0
- data/{examples → old}/cassandra_streaming/client_interface_notes.textile +0 -0
- data/{examples → old}/cassandra_streaming/client_schema.textile +0 -0
- data/{examples → old}/cassandra_streaming/tuning.textile +0 -0
- data/wukong.gemspec +257 -285
- metadata +45 -62
- data/examples/cassandra_streaming/avromapper.rb +0 -85
- data/examples/cassandra_streaming/cassandra.avpr +0 -468
- data/examples/cassandra_streaming/cassandra_random_partitioner.rb +0 -62
- data/examples/cassandra_streaming/catter.sh +0 -45
- data/examples/cassandra_streaming/client_schema.avpr +0 -211
- data/examples/cassandra_streaming/foofile.avr +0 -0
- data/examples/cassandra_streaming/pymap.sh +0 -1
- data/examples/cassandra_streaming/pyreduce.sh +0 -1
- data/examples/cassandra_streaming/smutation.avpr +0 -188
- data/examples/cassandra_streaming/streamer.sh +0 -51
- data/examples/cassandra_streaming/struct_loader.rb +0 -24
- data/examples/count_keys.rb +0 -56
- data/examples/count_keys_at_mapper.rb +0 -57
- data/examples/emr/README-elastic_map_reduce.textile +0 -26
- data/examples/keystore/cassandra_batch_test.rb +0 -41
- data/examples/keystore/conditional_outputter_example.rb +0 -70
- data/examples/store/chunked_store_example.rb +0 -18
- data/lib/wukong/dfs.rb +0 -81
- data/lib/wukong/keystore/cassandra_conditional_outputter.rb +0 -122
- data/lib/wukong/keystore/redis_db.rb +0 -24
- data/lib/wukong/keystore/tyrant_db.rb +0 -137
- data/lib/wukong/keystore/tyrant_notes.textile +0 -145
- data/lib/wukong/models/graph.rb +0 -25
- data/lib/wukong/monitor/chunked_store.rb +0 -23
- data/lib/wukong/monitor/periodic_logger.rb +0 -34
- data/lib/wukong/monitor/periodic_monitor.rb +0 -70
- data/lib/wukong/monitor.rb +0 -7
- data/lib/wukong/rdf.rb +0 -104
- data/lib/wukong/streamer/cassandra_streamer.rb +0 -61
- data/lib/wukong/streamer/count_keys.rb +0 -30
- data/lib/wukong/streamer/count_lines.rb +0 -26
- data/lib/wukong/streamer/em_streamer.rb +0 -7
- data/lib/wukong/streamer/preprocess_with_pipe_streamer.rb +0 -22
- data/lib/wukong/wukong_class.rb +0 -21
data/CHANGELOG.textile
CHANGED
@@ -1,3 +1,35 @@
|
|
1
|
+
h2. Wukong v2.0.0
|
2
|
+
|
3
|
+
h4. Important changes
|
4
|
+
|
5
|
+
* Passing options to streamers is now deprecated. Use @Settings@ instead.
|
6
|
+
|
7
|
+
* Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds
|
8
|
+
|
9
|
+
* Examples cleaned up, should all run
|
10
|
+
|
11
|
+
h4. Simplified syntax
|
12
|
+
|
13
|
+
* you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
|
14
|
+
* Adding an experimental sugar:
|
15
|
+
|
16
|
+
<pre>
|
17
|
+
#!/usr/bin/env ruby
|
18
|
+
require 'wukong/script'
|
19
|
+
|
20
|
+
LineStreamer.map do |line|
|
21
|
+
emit line.reverse
|
22
|
+
end.run
|
23
|
+
</pre>
|
24
|
+
|
25
|
+
Note that you can now tweet a wukong script.
|
26
|
+
|
27
|
+
* It's now recommended that at the top of a wukong script you say
|
28
|
+
<pre>
|
29
|
+
require 'wukong/script'
|
30
|
+
</pre>
|
31
|
+
Among other benefits, this lets you refer to wukong streamers without prefix.
|
32
|
+
|
1
33
|
h2. Wukong v1.5.4
|
2
34
|
|
3
35
|
* EMR support now works very well
|
data/README.textile
CHANGED
@@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
|
|
19
19
|
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
|
20
20
|
* "More info":http://mrflip.github.com/wukong/moreinfo.html
|
21
21
|
|
22
|
-
h2. Imminent Changes
|
23
|
-
|
24
|
-
I'm pushing to release "Wukong 3.0 the actual 1.0 release".
|
25
|
-
|
26
|
-
* For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
|
27
|
-
* Methods on TypedStruct to
|
28
|
-
|
29
|
-
* Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
|
30
|
-
* Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
|
31
|
-
* May make some things that are derived classes into mixin'ed modules
|
32
|
-
* Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
|
33
|
-
|
34
22
|
|
35
23
|
h2. Help!
|
36
24
|
|
@@ -193,6 +181,64 @@ You'd end up with
|
|
193
181
|
@newman @elaine @jerry @kramer
|
194
182
|
</code></pre>
|
195
183
|
|
184
|
+
h2. Gotchas
|
185
|
+
|
186
|
+
h4. RecordStreamer dies on blank lines with "wrong number of arguments"
|
187
|
+
|
188
|
+
If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
|
189
|
+
|
190
|
+
<pre>
|
191
|
+
class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
|
192
|
+
# this will fail if the line has more or fewer than 3 fields:
|
193
|
+
def process x, y, z
|
194
|
+
p [x, y, z]
|
195
|
+
end
|
196
|
+
end
|
197
|
+
</pre>
|
198
|
+
|
199
|
+
The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
|
200
|
+
|
201
|
+
<pre>
|
202
|
+
class MyHappyMapper < Wukong::Streamer::RecordStreamer
|
203
|
+
# extracts three fields always; any missing fields are nil, any extra fields discarded
|
204
|
+
# @example
|
205
|
+
# recordize("a") # ["a", nil, nil]
|
206
|
+
# recordize("a\t\b\tc") # ["a", "b", "c"]
|
207
|
+
# recordize("a\t\b\tc\td") # ["a", "b", "c"]
|
208
|
+
def recordize raw_record
|
209
|
+
x, y, z = super(raw_record)
|
210
|
+
[x, y, z]
|
211
|
+
end
|
212
|
+
|
213
|
+
# Now all lines produce exactly three args
|
214
|
+
def process x, y, z
|
215
|
+
p [x, y, z]
|
216
|
+
end
|
217
|
+
end
|
218
|
+
</pre>
|
219
|
+
|
220
|
+
If you want to preserve any extra fields, use the extra argument to #split():
|
221
|
+
|
222
|
+
<pre>
|
223
|
+
class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
|
224
|
+
# extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
|
225
|
+
# @example
|
226
|
+
# recordize("a") # ["a", nil, nil]
|
227
|
+
# recordize("a\t\b\tc") # ["a", "b", "c"]
|
228
|
+
# recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
|
229
|
+
def recordize raw_record
|
230
|
+
x, y, z = split(raw_record, "\t", 3)
|
231
|
+
[x, y, z]
|
232
|
+
end
|
233
|
+
|
234
|
+
# Now all lines produce exactly three args
|
235
|
+
def process x, y, z
|
236
|
+
p [x, y, z]
|
237
|
+
end
|
238
|
+
end
|
239
|
+
</pre>
|
240
|
+
|
241
|
+
|
196
242
|
h2. Why is it called Wukong?
|
197
243
|
|
198
244
|
Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
|
data/TODO.textile
CHANGED
@@ -1,13 +1,5 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
|
4
1
|
* add GEM_PATH to hadoop_recycle_env
|
5
2
|
|
6
|
-
* Hadoop_command function received an array for the input_path parameter
|
7
|
-
|
8
3
|
** We should be able to specify comma *or* space separated paths; the last
|
9
4
|
space-separated path in Settings.rest becomes the output file, the others are
|
10
5
|
used as the input_file list.
|
11
|
-
|
12
|
-
* Make configliere Settings and streamer_instance.options() be the same
|
13
|
-
thing. (instead of almost-but-confusingly-not-always the same thing).
|
data/bin/hdp-bzip
CHANGED
@@ -2,27 +2,22 @@
|
|
2
2
|
|
3
3
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
4
4
|
|
5
|
-
|
5
|
+
input_file=${1} ; shift
|
6
|
+
output_file=${1} ; shift
|
6
7
|
|
7
|
-
|
8
|
-
for foo in $@; do
|
9
|
-
INPUTS="$INPUTS -input $foo\
|
10
|
-
"
|
11
|
-
done
|
8
|
+
if [ "$output_file" == "" ] ; then echo "$0 input_file output_file" ; exit ; fi
|
12
9
|
|
13
|
-
|
14
|
-
hadoop fs -rmr $OUTPUT
|
10
|
+
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
15
11
|
|
16
12
|
cmd="${HADOOP_HOME}/bin/hadoop \
|
17
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
|
18
|
-
-
|
19
|
-
-
|
20
|
-
-
|
21
|
-
-
|
22
|
-
-
|
23
|
-
-
|
24
|
-
$
|
25
|
-
-output $OUTPUT \
|
13
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
|
14
|
+
-Dmapred.output.compress=true \
|
15
|
+
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
|
16
|
+
-Dmapred.reduce.tasks=1 \
|
17
|
+
-mapper \"/bin/cat\" \
|
18
|
+
-reducer \"/bin/cat\" \
|
19
|
+
-input \"$input_file\" \
|
20
|
+
-output \"$output_file\" \
|
26
21
|
"
|
27
22
|
echo $cmd
|
28
23
|
$cmd
|
data/bin/hdp-kill-task
CHANGED
data/bin/hdp-sort
CHANGED
@@ -1,5 +1,4 @@
|
|
1
1
|
#!/usr/bin/env bash
|
2
|
-
# hadoop dfs -rmr out/parsed-followers
|
3
2
|
|
4
3
|
input_file=${1} ; shift
|
5
4
|
output_file=${1} ; shift
|
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
|
|
13
12
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
14
13
|
|
15
14
|
cmd="${HADOOP_HOME}/bin/hadoop \
|
16
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
|
15
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
|
16
|
+
$@
|
17
|
+
-D num.key.fields.for.partition=\"$partfields\"
|
18
|
+
-D stream.num.map.output.key.fields=\"$sortfields\"
|
19
|
+
-D stream.map.output.field.separator=\"'/t'\"
|
20
|
+
-D mapred.text.key.partitioner.options=\"-k1,$partfields\"
|
21
|
+
-D mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
|
17
22
|
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
18
|
-
-jobconf num.key.fields.for.partition=\"$partfields\"
|
19
|
-
-jobconf stream.num.map.output.key.fields=\"$sortfields\"
|
20
|
-
-jobconf stream.map.output.field.separator=\"'/t'\"
|
21
|
-
-jobconf mapred.text.key.partitioner.options=\"-k1,$partfields\"
|
22
23
|
-mapper \"$map_script\"
|
23
24
|
-reducer \"$reduce_script\"
|
24
25
|
-input \"$input_file\"
|
25
26
|
-output \"$output_file\"
|
26
|
-
$@
|
27
27
|
"
|
28
28
|
|
29
29
|
echo "$cmd"
|
data/bin/hdp-stream
CHANGED
@@ -1,5 +1,4 @@
|
|
1
1
|
#!/usr/bin/env bash
|
2
|
-
# hadoop dfs -rmr out/parsed-followers
|
3
2
|
|
4
3
|
input_file=${1} ; shift
|
5
4
|
output_file=${1} ; shift
|
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
|
|
13
12
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
14
13
|
|
15
14
|
cmd="${HADOOP_HOME}/bin/hadoop \
|
16
|
-
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
|
15
|
+
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
|
16
|
+
$@
|
17
|
+
-D num.key.fields.for.partition=\"$partfields\"
|
18
|
+
-D stream.num.map.output.key.fields=\"$sortfields\"
|
19
|
+
-D stream.map.output.field.separator=\"'/t'\"
|
20
|
+
-D mapred.text.key.partitioner.options=\"-k1,$partfields\"
|
21
|
+
-D mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
|
17
22
|
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
|
18
|
-
-jobconf num.key.fields.for.partition=\"$partfields\"
|
19
|
-
-jobconf stream.num.map.output.key.fields=\"$sortfields\"
|
20
|
-
-jobconf stream.map.output.field.separator=\"'/t'\"
|
21
|
-
-jobconf mapred.text.key.partitioner.options=\"-k1,$partfields\"
|
22
23
|
-mapper \"$map_script\"
|
23
24
|
-reducer \"$reduce_script\"
|
24
25
|
-input \"$input_file\"
|
25
26
|
-output \"$output_file\"
|
26
|
-
$@
|
27
27
|
"
|
28
28
|
|
29
29
|
echo "$cmd"
|
data/bin/hdp-stream-flat
CHANGED
@@ -10,13 +10,12 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
|
|
10
10
|
HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
|
11
11
|
|
12
12
|
# Can add fun stuff like
|
13
|
-
# -
|
14
|
-
# -jobconf mapred.reduce.tasks=3 \
|
13
|
+
# -Dmapred.reduce.tasks=0 \
|
15
14
|
|
16
15
|
exec ${HADOOP_HOME}/bin/hadoop \
|
17
16
|
jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
|
18
17
|
"$@" \
|
19
|
-
-
|
18
|
+
-Dmapred.job.name=`basename $0`-$map_script-$input_file-$output_file \
|
20
19
|
-mapper "$map_script" \
|
21
20
|
-reducer "$reduce_script" \
|
22
21
|
-input "$input_file" \
|
data/bin/setcat
ADDED
data/bin/uniq-ord
ADDED
@@ -0,0 +1,59 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: ASCII-8BIT
|
3
|
+
require 'set'
|
4
|
+
|
5
|
+
unless ARGV.empty?
|
6
|
+
unless ARGV.include?('--help')
|
7
|
+
puts "\n**\nSorry, uniq-ord only works in-line: cat foo.txt bar.tsv | uniq-ord\n**" ; puts
|
8
|
+
end
|
9
|
+
puts <<USAGE
|
10
|
+
uniq-ord is ike the uniq command but doesn't depend on prior sorting: it tracks
|
11
|
+
each line and only emits the first-seen instance of that line.
|
12
|
+
|
13
|
+
The algorithm is /very/ simplistic: it uses ruby's built-in hash to track lines.
|
14
|
+
This can produce false positives, meaning that a line of output might be removed
|
15
|
+
even if it hasn't been seen before. It may also consume an unbounded amount of
|
16
|
+
memory (though less than the input text). With a million lines it will consume
|
17
|
+
about 70 MB of memory and have more than 1 in a million chance of false
|
18
|
+
positive. On a billion lines it will consume many GB and have over 25% odds of
|
19
|
+
incorrectly skipping a line.
|
20
|
+
|
21
|
+
However, it's really handy for dealing with in-order lists from the command line.
|
22
|
+
USAGE
|
23
|
+
exit(0)
|
24
|
+
end
|
25
|
+
|
26
|
+
# # Logging
|
27
|
+
#
|
28
|
+
# MB = 1024*1024
|
29
|
+
# LOG_INTERVAL = 100_000
|
30
|
+
# $start = Time.now; $iter = 0; $size = 0
|
31
|
+
# def log_line
|
32
|
+
# elapsed = (Time.now - $start).to_f
|
33
|
+
# $stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])
|
34
|
+
# end
|
35
|
+
|
36
|
+
LINES = Set.new
|
37
|
+
$stdin.each do |line|
|
38
|
+
next if LINES.include?(line.hash)
|
39
|
+
puts line
|
40
|
+
LINES << line.hash
|
41
|
+
# $iter += 1 ; $size += line.length
|
42
|
+
# log_line if ($iter % LOG_INTERVAL == 0)
|
43
|
+
end
|
44
|
+
# log_line
|
45
|
+
|
46
|
+
#
|
47
|
+
# # 2.1 GB data, 1M lines, 2000 avg chars/line
|
48
|
+
#
|
49
|
+
# # Used: RSS: 71_988 kB VSZ: 2_509_152 kB
|
50
|
+
# # Stats: 38 s 25_859.1 l/s 1000k< 1000k> 1976 MB 51.1 MB/s 2072 b/l
|
51
|
+
# # Time: real 0m41.4 s user 0m31.6 s sys 0m8.3 s pct 96.48
|
52
|
+
#
|
53
|
+
# # 4.1 GB data, 5.6M lines, 800 avg chars/line
|
54
|
+
#
|
55
|
+
# # Used: RSS: 330_644 kB VSZ: 2_764_236 kB
|
56
|
+
# # Stats: 861 6_538.2 l/s 5632k< 5632k> 4158 MB 4.8 MB/s 774 b/l
|
57
|
+
# # Time: real 14m24.6 s user 13m8.8 s sys 0m12. s pct 92.61
|
58
|
+
#
|
59
|
+
|
@@ -0,0 +1,47 @@
|
|
1
|
+
|
2
|
+
class BucketCounter
|
3
|
+
BUCKET_SIZE = 2**24
|
4
|
+
attr_reader :total
|
5
|
+
|
6
|
+
def initialize
|
7
|
+
@hsh = Hash.new{|h,k| h[k] = 0 }
|
8
|
+
@total = 0
|
9
|
+
end
|
10
|
+
|
11
|
+
# def [] val
|
12
|
+
# @hsh[val]
|
13
|
+
# end
|
14
|
+
# def << val
|
15
|
+
# @hsh[val] += 1; @total += 1 ; self
|
16
|
+
# end
|
17
|
+
|
18
|
+
def [] val
|
19
|
+
@hsh[val.hash % BUCKET_SIZE]
|
20
|
+
end
|
21
|
+
def << val
|
22
|
+
@hsh[val.hash % BUCKET_SIZE] += 1; @total += 1 ; self
|
23
|
+
end
|
24
|
+
|
25
|
+
def insert *words
|
26
|
+
words.flatten.each{|word| self << word }
|
27
|
+
end
|
28
|
+
def clear
|
29
|
+
@hsh.clear
|
30
|
+
@total = 0
|
31
|
+
end
|
32
|
+
|
33
|
+
def stats
|
34
|
+
{ :total => total,
|
35
|
+
:size => size,
|
36
|
+
}
|
37
|
+
end
|
38
|
+
def size() @hsh.size end
|
39
|
+
|
40
|
+
def full?
|
41
|
+
size.to_f / BUCKET_SIZE > 0.5
|
42
|
+
end
|
43
|
+
|
44
|
+
def each *args, &block
|
45
|
+
@hsh.each(*args, &block)
|
46
|
+
end
|
47
|
+
end
|
@@ -0,0 +1,85 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
require 'wukong/script'
|
3
|
+
|
4
|
+
#
|
5
|
+
# Use the stanford NLP parse to split a piece of text into sentences
|
6
|
+
#
|
7
|
+
# @example
|
8
|
+
# SentenceParser.split("Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!")
|
9
|
+
# # => [["Beware", "the", "Jabberwock", ",", "my", "son", "!"], ["The", "jaws", "that", "bite", ",", "the", "claws", "that", "catch", "!"], ["Beware", "the", "Jubjub", "bird", ",", "and", "shun", "The", "frumious", "Bandersnatch", "!"]]
|
10
|
+
#
|
11
|
+
class SentenceParser
|
12
|
+
def self.processor
|
13
|
+
return @processor if @processor
|
14
|
+
require 'rubygems'
|
15
|
+
require 'stanfordparser'
|
16
|
+
@processor = StanfordParser::DocumentPreprocessor.new
|
17
|
+
end
|
18
|
+
|
19
|
+
def self.split line
|
20
|
+
processor.getSentencesFromString(line).map{|s| s.map{|w| w.to_s } }
|
21
|
+
end
|
22
|
+
end
|
23
|
+
|
24
|
+
#
|
25
|
+
# takes one document per line
|
26
|
+
# splits into sentences
|
27
|
+
#
|
28
|
+
class WordNGrams < Wukong::Streamer::LineStreamer
|
29
|
+
def recordize line
|
30
|
+
line.strip!
|
31
|
+
line.gsub!(%r{^<http://dbpedia.org/resource/([^>]+)> <[^>]+> \"}, '') ; title = $1
|
32
|
+
line.gsub!(%r{\"@en \.},'')
|
33
|
+
[title, SentenceParser.split(line)]
|
34
|
+
end
|
35
|
+
|
36
|
+
def process title, sentences
|
37
|
+
sentences.each_with_index do |words, idx|
|
38
|
+
yield [title, idx, words].flatten
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
42
|
+
|
43
|
+
Wukong.run WordNGrams, nil, :partition_fields => 1, :sort_fields => 2
|
44
|
+
|
45
|
+
# ---------------------------------------------------------------------------
|
46
|
+
#
|
47
|
+
# Run Time:
|
48
|
+
#
|
49
|
+
# Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/short_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/short_abstract_sentences
|
50
|
+
# Status: Succeeded
|
51
|
+
# Started at: Fri Jan 28 03:14:45 UTC 2011
|
52
|
+
# Finished in: 41mins, 50sec
|
53
|
+
# 3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
|
54
|
+
#
|
55
|
+
# Counter Reduce Total
|
56
|
+
# SLOTS_MILLIS_MAPS 0 10 126 566
|
57
|
+
# Launched map tasks 0 15
|
58
|
+
# Data-local map tasks 0 15
|
59
|
+
# SLOTS_MILLIS_REDUCES 0 1 217
|
60
|
+
# HDFS_BYTES_READ 1 327 116 133 1 327 116 133
|
61
|
+
# HDFS_BYTES_WRITTEN 1 229 841 020 1 229 841 020
|
62
|
+
# Map input records 3 261 096 3 261 096
|
63
|
+
# Spilled Records 0 0
|
64
|
+
# Map input bytes 1 326 524 800 1 326 524 800
|
65
|
+
# SPLIT_RAW_BYTES 1 500 1 500
|
66
|
+
# Map output records 9 026 343 9 026 343
|
67
|
+
#
|
68
|
+
# Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/long_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/long_abstract_sentences
|
69
|
+
# Status: Succeeded
|
70
|
+
# Started at: Fri Jan 28 03:23:08 UTC 2011
|
71
|
+
# Finished in: 41mins, 11sec
|
72
|
+
# 3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
|
73
|
+
#
|
74
|
+
# Counter Reduce Total
|
75
|
+
# SLOTS_MILLIS_MAPS 0 19 872 357
|
76
|
+
# Launched map tasks 0 29
|
77
|
+
# Data-local map tasks 0 29
|
78
|
+
# SLOTS_MILLIS_REDUCES 0 5 504
|
79
|
+
# HDFS_BYTES_READ 2 175 900 769 2 175 900 769
|
80
|
+
# HDFS_BYTES_WRITTEN 2 280 332 736 2 280 332 736
|
81
|
+
# Map input records 3 261 096 3 261 096
|
82
|
+
# Spilled Records 0 0
|
83
|
+
# Map input bytes 2 174 849 644 2 174 849 644
|
84
|
+
# SPLIT_RAW_BYTES 2 533 2533
|
85
|
+
# Map output records 15 425 467 15 425 467
|
@@ -0,0 +1,70 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
$: << File.dirname(__FILE__)
|
3
|
+
require 'rubygems'
|
4
|
+
require 'wukong/script'
|
5
|
+
require 'bucket_counter'
|
6
|
+
|
7
|
+
#
|
8
|
+
# Coocurrence counts
|
9
|
+
#
|
10
|
+
|
11
|
+
#
|
12
|
+
# Input is a list of document-idx-sentences, each field is tab-separated
|
13
|
+
# title idx word_a word_b word_c ...
|
14
|
+
#
|
15
|
+
# This emits each co-courring pair exactly once; in the case of a three-word
|
16
|
+
# sentence the output would be
|
17
|
+
#
|
18
|
+
# word_a word_b
|
19
|
+
# word_a word_c
|
20
|
+
# word_b word_c
|
21
|
+
#
|
22
|
+
class SentenceCoocurrence < Wukong::Streamer::RecordStreamer
|
23
|
+
def initialize *args
|
24
|
+
super *args
|
25
|
+
@bucket = BucketCounter.new
|
26
|
+
end
|
27
|
+
|
28
|
+
def process title, idx, *words
|
29
|
+
words.each_with_index do |word_a, idx|
|
30
|
+
words[(idx+1) .. -1].each do |word_b|
|
31
|
+
@bucket << [word_a, word_b]
|
32
|
+
end
|
33
|
+
end
|
34
|
+
dump_bucket if @bucket.full?
|
35
|
+
end
|
36
|
+
|
37
|
+
def dump_bucket
|
38
|
+
@bucket.each do |pair_key, count|
|
39
|
+
emit [pair_key, count]
|
40
|
+
end
|
41
|
+
$stderr.puts "bucket stats: #{@bucket.stats.inspect}"
|
42
|
+
@bucket.clear
|
43
|
+
end
|
44
|
+
|
45
|
+
def after_stream
|
46
|
+
dump_bucket
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
#
|
51
|
+
# Combine multiple bucket counts into a single on
|
52
|
+
#
|
53
|
+
class CombineBuckets < Wukong::Streamer::AccumulatingReducer
|
54
|
+
def start! *args
|
55
|
+
@total = 0
|
56
|
+
end
|
57
|
+
def accumulate word, count
|
58
|
+
@total += count.to_i
|
59
|
+
end
|
60
|
+
def finalize
|
61
|
+
yield [@total, key] if @total > 20
|
62
|
+
end
|
63
|
+
end
|
64
|
+
|
65
|
+
Wukong.run(
|
66
|
+
SentenceCoocurrence,
|
67
|
+
CombineBuckets,
|
68
|
+
:io_sort_record_percent => 0.3,
|
69
|
+
:io_sort_mb => 300
|
70
|
+
)
|
@@ -0,0 +1,110 @@
|
|
1
|
+
h1. Using Elastic Map-Reduce in Wukong
|
2
|
+
|
3
|
+
h2. Initial Setup
|
4
|
+
|
5
|
+
# Sign up for elastic map reduce and S3 at Amazon AWS.
|
6
|
+
|
7
|
+
# Download the Amazon elastic-mapreduce runner: either the official version at http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip or the infochimps fork (which has support for Ruby 1.9) at http://github.com/infochimps/elastic-mapreduce .
|
8
|
+
|
9
|
+
# Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose 'emr.yourdomain.com' as the bucket and '/wukong' as a scoping path within that bucket. In that case you will refer to it with a path like s3://emr.yourdomain.com/wukong (see notes below about s3n:// vs. s3:// URLs).
|
10
|
+
|
11
|
+
# Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
|
12
|
+
|
13
|
+
# Edit emr.yaml and credentials.json, adding your keys where appropriate and following the other instructions. Start with a single-node m1.small cluster as you'll probably have some false starts beforethe flow of logging in, checking the logs, etc becomes clear.
|
14
|
+
|
15
|
+
# You should now be good to launch a program. We'll give it the @--alive@ flag so that the machine sticks around if there were any issues:
|
16
|
+
|
17
|
+
./elastic_mapreduce_example.rb --run=emr --alive s3://emr.yourdomain.com/wukong/data/input s3://emr.yourdomain.com/wukong/data/output
|
18
|
+
|
19
|
+
# If you visit the "AWS console":http://bit.ly/awsconsole you should now see a jobflow with two steps. The first sets up debugging for the job; the second is your hadoop task.
|
20
|
+
|
21
|
+
# The "AWS console":http://bit.ly/awsconsole also has the public IP of the master node. You can log in to the machine directly:
|
22
|
+
|
23
|
+
<pre>
|
24
|
+
ssh -i /path/to/your/keypair.pem hadoop@ec2-148-37-14-128.compute-1.amazonaws.com
|
25
|
+
</pre>
|
26
|
+
|
27
|
+
h3. Lorkbong
|
28
|
+
|
29
|
+
Lorkbong (named after the staff carried by Sun Wukong) is a very very simple example Heroku app that lets you trigger showing job status or launching a new job, either by visiting a special URL or by triggering a rake task. Get its code from
|
30
|
+
|
31
|
+
http://github.com/mrflip/lorkbong
|
32
|
+
|
33
|
+
h3. s3n:// vs. s3:// URLs
|
34
|
+
|
35
|
+
Many external tools use a URI convention to address files in S3; they typically use the 's3://' scheme, which makes a lot of sense:
|
36
|
+
s3://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
|
37
|
+
|
38
|
+
Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks -- to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth. Hadoop unfortunately chose the 's3://' scheme for URIs in this filesystem:
|
39
|
+
s3://s3hdfs.yourcompany.com/path/to/data
|
40
|
+
|
41
|
+
Hadoop is happy to read s3 native files -- 'native' as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There's a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about). You refer to these files with the 's3n://' scheme ('n' as in 'native'):
|
42
|
+
s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-mapper.rb
|
43
|
+
s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-reducer.rb
|
44
|
+
s3n://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
|
45
|
+
|
46
|
+
Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use 's3n://' for normal s3 files.
|
47
|
+
|
48
|
+
h2. Advanced Tips n' Tricks for common usage
|
49
|
+
|
50
|
+
h3. Direct access to logs using your browser
|
51
|
+
|
52
|
+
Each Hadoop component exposes a web dashboard for you to access. Use the following ports:
|
53
|
+
|
54
|
+
* 9100: Job tracker (master only)
|
55
|
+
* 9101: Namenode (master only)
|
56
|
+
* 9102: Datanodes
|
57
|
+
* 9103: Task trackers
|
58
|
+
|
59
|
+
They will only, however, respond to web requests from within the private cluster
|
60
|
+
subnet. You can browse the cluster by creating a persistent tunnel to the hadoop master node, and configuring your
|
61
|
+
browser to use it as a proxy.
|
62
|
+
|
63
|
+
h4. Create a tunneling proxy to your cluster
|
64
|
+
|
65
|
+
To create a tunnel from your local machine to the master node, substitute the keypair and the master node's address into this command:
|
66
|
+
|
67
|
+
<pre><code>
|
68
|
+
ssh -i ~/.wukong/keypairs/KEYPAIR.pem -f -N -D 6666 -o StrictHostKeyChecking=no -o "ConnectTimeout=10" -o "ServerAliveInterval=60" -o "ControlPath=none" ubuntu@MASTER_NODE_PUBLIC_IP
|
69
|
+
</code></pre>
|
70
|
+
|
71
|
+
The command will silently background itself if it worked.
|
72
|
+
|
73
|
+
h4. Make your browser use the proxy (but only for cluster machines)
|
74
|
+
|
75
|
+
You can access basic information by pointing your browser to "this Proxy
|
76
|
+
Auto-Configuration (PAC)
|
77
|
+
file.":http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
|
78
|
+
You'll have issues if you browse around though, because many of the in-page
|
79
|
+
links will refer to addresses that only resolve within the cluster's private
|
80
|
+
namespace.
|
81
|
+
|
82
|
+
h4. Setup Foxy Proxy
|
83
|
+
|
84
|
+
To fix this, use "FoxyProxy":https://addons.mozilla.org/en-US/firefox/addon/2464
|
85
|
+
It allows you to manage multiple proxy configurations and to use the proxy for
|
86
|
+
DNS resolution (curing the private address problem).
|
87
|
+
|
88
|
+
Once you've installed the FoxyProxy extension and restarted Firefox,
|
89
|
+
|
90
|
+
* Set FoxyProxy to 'Use Proxies based on their pre-defined patterns and priorities'
|
91
|
+
* Create a new proxy, called 'EC2 Socks Proxy' or something
|
92
|
+
* Automatic proxy configuration URL: http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
|
93
|
+
* Under 'General', check yes for 'Perform remote DNS lookups on host'
|
94
|
+
* Add the following URL patterns as 'whitelist' using 'Wildcards' (not regular expression):
|
95
|
+
|
96
|
+
* <code>*.compute-*.internal*</code>
|
97
|
+
* <code>*ec2.internal*</code>
|
98
|
+
* <code>*domu*.internal*</code>
|
99
|
+
* <code>*ec2*.amazonaws.com*</code>
|
100
|
+
* <code>*://10.*</code>
|
101
|
+
|
102
|
+
And this one as blacklist:
|
103
|
+
|
104
|
+
* <code>https://us-*st-1.ec2.amazonaws.com/*</code>
|
105
|
+
|
106
|
+
|
107
|
+
h3. Pulling to your local machine
|
108
|
+
|
109
|
+
s3cmd sync s3://s3n.infinitemonkeys.info/emr/elastic_mapreduce_example/log/ /tmp/emr_log/
|
110
|
+
|
@@ -1,7 +1,8 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
2
|
Dir[File.dirname(__FILE__)+'/vendor/**/lib'].each{|dir| $: << dir }
|
3
3
|
require 'rubygems'
|
4
|
-
require 'wukong'
|
4
|
+
require 'wukong/script'
|
5
|
+
require 'wukong/script/emr_command'
|
5
6
|
|
6
7
|
#
|
7
8
|
# * Copy the emr.yaml from here into ~/.wukong/emr.yaml
|
@@ -24,5 +25,4 @@ class FooStreamer < Wukong::Streamer::LineStreamer
|
|
24
25
|
end
|
25
26
|
end
|
26
27
|
|
27
|
-
Settings.resolve!
|
28
28
|
Wukong::Script.new(FooStreamer, FooStreamer).run
|