wukong 1.5.4 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (87) hide show
  1. data/CHANGELOG.textile +32 -0
  2. data/README.textile +58 -12
  3. data/TODO.textile +0 -8
  4. data/bin/hdp-bzip +12 -17
  5. data/bin/hdp-kill-task +1 -1
  6. data/bin/hdp-sort +7 -7
  7. data/bin/hdp-stream +7 -7
  8. data/bin/hdp-stream-flat +2 -3
  9. data/bin/setcat +11 -0
  10. data/bin/uniq-ord +59 -0
  11. data/examples/corpus/bucket_counter.rb +47 -0
  12. data/examples/corpus/dbpedia_abstract_to_sentences.rb +85 -0
  13. data/examples/corpus/sentence_coocurrence.rb +70 -0
  14. data/examples/emr/README.textile +110 -0
  15. data/examples/emr/dot_wukong_dir/emr_bootstrap.sh +1 -0
  16. data/examples/emr/elastic_mapreduce_example.rb +2 -2
  17. data/examples/ignore_me/counting.rb +56 -0
  18. data/examples/ignore_me/grouper.rb +71 -0
  19. data/examples/network_graph/adjacency_list.rb +2 -2
  20. data/examples/network_graph/breadth_first_search.rb +14 -21
  21. data/examples/network_graph/gen_multi_edge.rb +22 -13
  22. data/examples/pagerank/pagerank.rb +1 -1
  23. data/examples/pagerank/pagerank_initialize.rb +6 -10
  24. data/examples/sample_records.rb +6 -16
  25. data/examples/server_logs/apache_log_parser.rb +7 -22
  26. data/examples/server_logs/breadcrumbs.rb +39 -0
  27. data/examples/server_logs/logline.rb +27 -0
  28. data/examples/size.rb +3 -2
  29. data/examples/{binning_percentile_estimator.rb → stats/binning_percentile_estimator.rb} +9 -11
  30. data/examples/{rank_and_bin.rb → stats/rank_and_bin.rb} +2 -2
  31. data/examples/stupidly_simple_filter.rb +11 -14
  32. data/examples/word_count.rb +16 -36
  33. data/lib/wukong/and_pig.rb +2 -15
  34. data/lib/wukong/logger.rb +7 -28
  35. data/lib/wukong/periodic_monitor.rb +24 -9
  36. data/lib/wukong/script/emr_command.rb +1 -0
  37. data/lib/wukong/script/hadoop_command.rb +31 -29
  38. data/lib/wukong/script.rb +19 -14
  39. data/lib/wukong/store/cassandra_model.rb +2 -1
  40. data/lib/wukong/streamer/accumulating_reducer.rb +5 -9
  41. data/lib/wukong/streamer/base.rb +44 -3
  42. data/lib/wukong/streamer/counting_reducer.rb +12 -12
  43. data/lib/wukong/streamer/filter.rb +2 -2
  44. data/lib/wukong/streamer/list_reducer.rb +3 -3
  45. data/lib/wukong/streamer/reducer.rb +11 -0
  46. data/lib/wukong/streamer.rb +7 -3
  47. data/lib/wukong.rb +7 -3
  48. data/{examples → old}/cassandra_streaming/berlitz_for_cassandra.textile +0 -0
  49. data/{examples → old}/cassandra_streaming/client_interface_notes.textile +0 -0
  50. data/{examples → old}/cassandra_streaming/client_schema.textile +0 -0
  51. data/{examples → old}/cassandra_streaming/tuning.textile +0 -0
  52. data/wukong.gemspec +257 -285
  53. metadata +45 -62
  54. data/examples/cassandra_streaming/avromapper.rb +0 -85
  55. data/examples/cassandra_streaming/cassandra.avpr +0 -468
  56. data/examples/cassandra_streaming/cassandra_random_partitioner.rb +0 -62
  57. data/examples/cassandra_streaming/catter.sh +0 -45
  58. data/examples/cassandra_streaming/client_schema.avpr +0 -211
  59. data/examples/cassandra_streaming/foofile.avr +0 -0
  60. data/examples/cassandra_streaming/pymap.sh +0 -1
  61. data/examples/cassandra_streaming/pyreduce.sh +0 -1
  62. data/examples/cassandra_streaming/smutation.avpr +0 -188
  63. data/examples/cassandra_streaming/streamer.sh +0 -51
  64. data/examples/cassandra_streaming/struct_loader.rb +0 -24
  65. data/examples/count_keys.rb +0 -56
  66. data/examples/count_keys_at_mapper.rb +0 -57
  67. data/examples/emr/README-elastic_map_reduce.textile +0 -26
  68. data/examples/keystore/cassandra_batch_test.rb +0 -41
  69. data/examples/keystore/conditional_outputter_example.rb +0 -70
  70. data/examples/store/chunked_store_example.rb +0 -18
  71. data/lib/wukong/dfs.rb +0 -81
  72. data/lib/wukong/keystore/cassandra_conditional_outputter.rb +0 -122
  73. data/lib/wukong/keystore/redis_db.rb +0 -24
  74. data/lib/wukong/keystore/tyrant_db.rb +0 -137
  75. data/lib/wukong/keystore/tyrant_notes.textile +0 -145
  76. data/lib/wukong/models/graph.rb +0 -25
  77. data/lib/wukong/monitor/chunked_store.rb +0 -23
  78. data/lib/wukong/monitor/periodic_logger.rb +0 -34
  79. data/lib/wukong/monitor/periodic_monitor.rb +0 -70
  80. data/lib/wukong/monitor.rb +0 -7
  81. data/lib/wukong/rdf.rb +0 -104
  82. data/lib/wukong/streamer/cassandra_streamer.rb +0 -61
  83. data/lib/wukong/streamer/count_keys.rb +0 -30
  84. data/lib/wukong/streamer/count_lines.rb +0 -26
  85. data/lib/wukong/streamer/em_streamer.rb +0 -7
  86. data/lib/wukong/streamer/preprocess_with_pipe_streamer.rb +0 -22
  87. data/lib/wukong/wukong_class.rb +0 -21
data/CHANGELOG.textile CHANGED
@@ -1,3 +1,35 @@
1
+ h2. Wukong v2.0.0
2
+
3
+ h4. Important changes
4
+
5
+ * Passing options to streamers is now deprecated. Use @Settings@ instead.
6
+
7
+ * Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds
8
+
9
+ * Examples cleaned up, should all run
10
+
11
+ h4. Simplified syntax
12
+
13
+ * you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
14
+ * Adding an experimental sugar:
15
+
16
+ <pre>
17
+ #!/usr/bin/env ruby
18
+ require 'wukong/script'
19
+
20
+ LineStreamer.map do |line|
21
+ emit line.reverse
22
+ end.run
23
+ </pre>
24
+
25
+ Note that you can now tweet a wukong script.
26
+
27
+ * It's now recommended that at the top of a wukong script you say
28
+ <pre>
29
+ require 'wukong/script'
30
+ </pre>
31
+ Among other benefits, this lets you refer to wukong streamers without prefix.
32
+
1
33
  h2. Wukong v1.5.4
2
34
 
3
35
  * EMR support now works very well
data/README.textile CHANGED
@@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
19
19
  * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
20
20
  * "More info":http://mrflip.github.com/wukong/moreinfo.html
21
21
 
22
- h2. Imminent Changes
23
-
24
- I'm pushing to release "Wukong 3.0 the actual 1.0 release".
25
-
26
- * For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
27
- * Methods on TypedStruct to
28
-
29
- * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
30
- * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
31
- * May make some things that are derived classes into mixin'ed modules
32
- * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
33
-
34
22
 
35
23
  h2. Help!
36
24
 
@@ -193,6 +181,64 @@ You'd end up with
193
181
  @newman @elaine @jerry @kramer
194
182
  </code></pre>
195
183
 
184
+ h2. Gotchas
185
+
186
+ h4. RecordStreamer dies on blank lines with "wrong number of arguments"
187
+
188
+ If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
189
+
190
+ <pre>
191
+ class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
192
+ # this will fail if the line has more or fewer than 3 fields:
193
+ def process x, y, z
194
+ p [x, y, z]
195
+ end
196
+ end
197
+ </pre>
198
+
199
+ The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
200
+
201
+ <pre>
202
+ class MyHappyMapper < Wukong::Streamer::RecordStreamer
203
+ # extracts three fields always; any missing fields are nil, any extra fields discarded
204
+ # @example
205
+ # recordize("a") # ["a", nil, nil]
206
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
207
+ # recordize("a\t\b\tc\td") # ["a", "b", "c"]
208
+ def recordize raw_record
209
+ x, y, z = super(raw_record)
210
+ [x, y, z]
211
+ end
212
+
213
+ # Now all lines produce exactly three args
214
+ def process x, y, z
215
+ p [x, y, z]
216
+ end
217
+ end
218
+ </pre>
219
+
220
+ If you want to preserve any extra fields, use the extra argument to #split():
221
+
222
+ <pre>
223
+ class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
224
+ # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
225
+ # @example
226
+ # recordize("a") # ["a", nil, nil]
227
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
228
+ # recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
229
+ def recordize raw_record
230
+ x, y, z = split(raw_record, "\t", 3)
231
+ [x, y, z]
232
+ end
233
+
234
+ # Now all lines produce exactly three args
235
+ def process x, y, z
236
+ p [x, y, z]
237
+ end
238
+ end
239
+ </pre>
240
+
241
+
196
242
  h2. Why is it called Wukong?
197
243
 
198
244
  Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
data/TODO.textile CHANGED
@@ -1,13 +1,5 @@
1
-
2
-
3
-
4
1
  * add GEM_PATH to hadoop_recycle_env
5
2
 
6
- * Hadoop_command function received an array for the input_path parameter
7
-
8
3
  ** We should be able to specify comma *or* space separated paths; the last
9
4
  space-separated path in Settings.rest becomes the output file, the others are
10
5
  used as the input_file list.
11
-
12
- * Make configliere Settings and streamer_instance.options() be the same
13
- thing. (instead of almost-but-confusingly-not-always the same thing).
data/bin/hdp-bzip CHANGED
@@ -2,27 +2,22 @@
2
2
 
3
3
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
4
4
 
5
- OUTPUT="$1" ; shift
5
+ input_file=${1} ; shift
6
+ output_file=${1} ; shift
6
7
 
7
- INPUTS=''
8
- for foo in $@; do
9
- INPUTS="$INPUTS -input $foo\
10
- "
11
- done
8
+ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file" ; exit ; fi
12
9
 
13
- echo "Removing output directory $OUTPUT"
14
- hadoop fs -rmr $OUTPUT
10
+ HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
15
11
 
16
12
  cmd="${HADOOP_HOME}/bin/hadoop \
17
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
18
- -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
19
- -jobconf mapred.output.compress=true \
20
- -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
21
- -jobconf mapred.reduce.tasks=1 \
22
- -mapper \"/bin/cat\" \
23
- -reducer \"/bin/cat\" \
24
- $INPUTS
25
- -output $OUTPUT \
13
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
14
+ -Dmapred.output.compress=true \
15
+ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
16
+ -Dmapred.reduce.tasks=1 \
17
+ -mapper \"/bin/cat\" \
18
+ -reducer \"/bin/cat\" \
19
+ -input \"$input_file\" \
20
+ -output \"$output_file\" \
26
21
  "
27
22
  echo $cmd
28
23
  $cmd
data/bin/hdp-kill-task CHANGED
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- exec hadoop fs -kill-task "$1"
3
+ exec hadoop job -kill-task "$1"
data/bin/hdp-sort CHANGED
@@ -1,5 +1,4 @@
1
1
  #!/usr/bin/env bash
2
- # hadoop dfs -rmr out/parsed-followers
3
2
 
4
3
  input_file=${1} ; shift
5
4
  output_file=${1} ; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
13
12
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
14
13
 
15
14
  cmd="${HADOOP_HOME}/bin/hadoop \
16
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
15
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
16
+ $@
17
+ -D num.key.fields.for.partition=\"$partfields\"
18
+ -D stream.num.map.output.key.fields=\"$sortfields\"
19
+ -D stream.map.output.field.separator=\"'/t'\"
20
+ -D mapred.text.key.partitioner.options=\"-k1,$partfields\"
21
+ -D mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
17
22
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
18
- -jobconf num.key.fields.for.partition=\"$partfields\"
19
- -jobconf stream.num.map.output.key.fields=\"$sortfields\"
20
- -jobconf stream.map.output.field.separator=\"'/t'\"
21
- -jobconf mapred.text.key.partitioner.options=\"-k1,$partfields\"
22
23
  -mapper \"$map_script\"
23
24
  -reducer \"$reduce_script\"
24
25
  -input \"$input_file\"
25
26
  -output \"$output_file\"
26
- $@
27
27
  "
28
28
 
29
29
  echo "$cmd"
data/bin/hdp-stream CHANGED
@@ -1,5 +1,4 @@
1
1
  #!/usr/bin/env bash
2
- # hadoop dfs -rmr out/parsed-followers
3
2
 
4
3
  input_file=${1} ; shift
5
4
  output_file=${1} ; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
13
12
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
14
13
 
15
14
  cmd="${HADOOP_HOME}/bin/hadoop \
16
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
15
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
16
+ $@
17
+ -D num.key.fields.for.partition=\"$partfields\"
18
+ -D stream.num.map.output.key.fields=\"$sortfields\"
19
+ -D stream.map.output.field.separator=\"'/t'\"
20
+ -D mapred.text.key.partitioner.options=\"-k1,$partfields\"
21
+ -D mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
17
22
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
18
- -jobconf num.key.fields.for.partition=\"$partfields\"
19
- -jobconf stream.num.map.output.key.fields=\"$sortfields\"
20
- -jobconf stream.map.output.field.separator=\"'/t'\"
21
- -jobconf mapred.text.key.partitioner.options=\"-k1,$partfields\"
22
23
  -mapper \"$map_script\"
23
24
  -reducer \"$reduce_script\"
24
25
  -input \"$input_file\"
25
26
  -output \"$output_file\"
26
- $@
27
27
  "
28
28
 
29
29
  echo "$cmd"
data/bin/hdp-stream-flat CHANGED
@@ -10,13 +10,12 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
10
10
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
11
11
 
12
12
  # Can add fun stuff like
13
- # -jobconf mapred.map.tasks=3 \
14
- # -jobconf mapred.reduce.tasks=3 \
13
+ # -Dmapred.reduce.tasks=0 \
15
14
 
16
15
  exec ${HADOOP_HOME}/bin/hadoop \
17
16
  jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
18
17
  "$@" \
19
- -jobconf "mapred.job.name=`basename $0`-$map_script-$input_file-$output_file" \
18
+ -Dmapred.job.name=`basename $0`-$map_script-$input_file-$output_file \
20
19
  -mapper "$map_script" \
21
20
  -reducer "$reduce_script" \
22
21
  -input "$input_file" \
data/bin/setcat ADDED
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/env bash
2
+
3
+ #
4
+ # This script is useful for debugging. it dumps your environment to STDERR
5
+ # and otherwise runs as `cat`
6
+ #
7
+
8
+ set >&2
9
+
10
+ cat
11
+ true
data/bin/uniq-ord ADDED
@@ -0,0 +1,59 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: ASCII-8BIT
3
+ require 'set'
4
+
5
+ unless ARGV.empty?
6
+ unless ARGV.include?('--help')
7
+ puts "\n**\nSorry, uniq-ord only works in-line: cat foo.txt bar.tsv | uniq-ord\n**" ; puts
8
+ end
9
+ puts <<USAGE
10
+ uniq-ord is ike the uniq command but doesn't depend on prior sorting: it tracks
11
+ each line and only emits the first-seen instance of that line.
12
+
13
+ The algorithm is /very/ simplistic: it uses ruby's built-in hash to track lines.
14
+ This can produce false positives, meaning that a line of output might be removed
15
+ even if it hasn't been seen before. It may also consume an unbounded amount of
16
+ memory (though less than the input text). With a million lines it will consume
17
+ about 70 MB of memory and have more than 1 in a million chance of false
18
+ positive. On a billion lines it will consume many GB and have over 25% odds of
19
+ incorrectly skipping a line.
20
+
21
+ However, it's really handy for dealing with in-order lists from the command line.
22
+ USAGE
23
+ exit(0)
24
+ end
25
+
26
+ # # Logging
27
+ #
28
+ # MB = 1024*1024
29
+ # LOG_INTERVAL = 100_000
30
+ # $start = Time.now; $iter = 0; $size = 0
31
+ # def log_line
32
+ # elapsed = (Time.now - $start).to_f
33
+ # $stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])
34
+ # end
35
+
36
+ LINES = Set.new
37
+ $stdin.each do |line|
38
+ next if LINES.include?(line.hash)
39
+ puts line
40
+ LINES << line.hash
41
+ # $iter += 1 ; $size += line.length
42
+ # log_line if ($iter % LOG_INTERVAL == 0)
43
+ end
44
+ # log_line
45
+
46
+ #
47
+ # # 2.1 GB data, 1M lines, 2000 avg chars/line
48
+ #
49
+ # # Used: RSS: 71_988 kB VSZ: 2_509_152 kB
50
+ # # Stats: 38 s 25_859.1 l/s 1000k< 1000k> 1976 MB 51.1 MB/s 2072 b/l
51
+ # # Time: real 0m41.4 s user 0m31.6 s sys 0m8.3 s pct 96.48
52
+ #
53
+ # # 4.1 GB data, 5.6M lines, 800 avg chars/line
54
+ #
55
+ # # Used: RSS: 330_644 kB VSZ: 2_764_236 kB
56
+ # # Stats: 861 6_538.2 l/s 5632k< 5632k> 4158 MB 4.8 MB/s 774 b/l
57
+ # # Time: real 14m24.6 s user 13m8.8 s sys 0m12. s pct 92.61
58
+ #
59
+
@@ -0,0 +1,47 @@
1
+
2
+ class BucketCounter
3
+ BUCKET_SIZE = 2**24
4
+ attr_reader :total
5
+
6
+ def initialize
7
+ @hsh = Hash.new{|h,k| h[k] = 0 }
8
+ @total = 0
9
+ end
10
+
11
+ # def [] val
12
+ # @hsh[val]
13
+ # end
14
+ # def << val
15
+ # @hsh[val] += 1; @total += 1 ; self
16
+ # end
17
+
18
+ def [] val
19
+ @hsh[val.hash % BUCKET_SIZE]
20
+ end
21
+ def << val
22
+ @hsh[val.hash % BUCKET_SIZE] += 1; @total += 1 ; self
23
+ end
24
+
25
+ def insert *words
26
+ words.flatten.each{|word| self << word }
27
+ end
28
+ def clear
29
+ @hsh.clear
30
+ @total = 0
31
+ end
32
+
33
+ def stats
34
+ { :total => total,
35
+ :size => size,
36
+ }
37
+ end
38
+ def size() @hsh.size end
39
+
40
+ def full?
41
+ size.to_f / BUCKET_SIZE > 0.5
42
+ end
43
+
44
+ def each *args, &block
45
+ @hsh.each(*args, &block)
46
+ end
47
+ end
@@ -0,0 +1,85 @@
1
+ #!/usr/bin/env ruby
2
+ require 'wukong/script'
3
+
4
+ #
5
+ # Use the stanford NLP parse to split a piece of text into sentences
6
+ #
7
+ # @example
8
+ # SentenceParser.split("Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!")
9
+ # # => [["Beware", "the", "Jabberwock", ",", "my", "son", "!"], ["The", "jaws", "that", "bite", ",", "the", "claws", "that", "catch", "!"], ["Beware", "the", "Jubjub", "bird", ",", "and", "shun", "The", "frumious", "Bandersnatch", "!"]]
10
+ #
11
+ class SentenceParser
12
+ def self.processor
13
+ return @processor if @processor
14
+ require 'rubygems'
15
+ require 'stanfordparser'
16
+ @processor = StanfordParser::DocumentPreprocessor.new
17
+ end
18
+
19
+ def self.split line
20
+ processor.getSentencesFromString(line).map{|s| s.map{|w| w.to_s } }
21
+ end
22
+ end
23
+
24
+ #
25
+ # takes one document per line
26
+ # splits into sentences
27
+ #
28
+ class WordNGrams < Wukong::Streamer::LineStreamer
29
+ def recordize line
30
+ line.strip!
31
+ line.gsub!(%r{^<http://dbpedia.org/resource/([^>]+)> <[^>]+> \"}, '') ; title = $1
32
+ line.gsub!(%r{\"@en \.},'')
33
+ [title, SentenceParser.split(line)]
34
+ end
35
+
36
+ def process title, sentences
37
+ sentences.each_with_index do |words, idx|
38
+ yield [title, idx, words].flatten
39
+ end
40
+ end
41
+ end
42
+
43
+ Wukong.run WordNGrams, nil, :partition_fields => 1, :sort_fields => 2
44
+
45
+ # ---------------------------------------------------------------------------
46
+ #
47
+ # Run Time:
48
+ #
49
+ # Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/short_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/short_abstract_sentences
50
+ # Status: Succeeded
51
+ # Started at: Fri Jan 28 03:14:45 UTC 2011
52
+ # Finished in: 41mins, 50sec
53
+ # 3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
54
+ #
55
+ # Counter Reduce Total
56
+ # SLOTS_MILLIS_MAPS 0 10 126 566
57
+ # Launched map tasks 0 15
58
+ # Data-local map tasks 0 15
59
+ # SLOTS_MILLIS_REDUCES 0 1 217
60
+ # HDFS_BYTES_READ 1 327 116 133 1 327 116 133
61
+ # HDFS_BYTES_WRITTEN 1 229 841 020 1 229 841 020
62
+ # Map input records 3 261 096 3 261 096
63
+ # Spilled Records 0 0
64
+ # Map input bytes 1 326 524 800 1 326 524 800
65
+ # SPLIT_RAW_BYTES 1 500 1 500
66
+ # Map output records 9 026 343 9 026 343
67
+ #
68
+ # Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/long_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/long_abstract_sentences
69
+ # Status: Succeeded
70
+ # Started at: Fri Jan 28 03:23:08 UTC 2011
71
+ # Finished in: 41mins, 11sec
72
+ # 3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
73
+ #
74
+ # Counter Reduce Total
75
+ # SLOTS_MILLIS_MAPS 0 19 872 357
76
+ # Launched map tasks 0 29
77
+ # Data-local map tasks 0 29
78
+ # SLOTS_MILLIS_REDUCES 0 5 504
79
+ # HDFS_BYTES_READ 2 175 900 769 2 175 900 769
80
+ # HDFS_BYTES_WRITTEN 2 280 332 736 2 280 332 736
81
+ # Map input records 3 261 096 3 261 096
82
+ # Spilled Records 0 0
83
+ # Map input bytes 2 174 849 644 2 174 849 644
84
+ # SPLIT_RAW_BYTES 2 533 2533
85
+ # Map output records 15 425 467 15 425 467
@@ -0,0 +1,70 @@
1
+ #!/usr/bin/env ruby
2
+ $: << File.dirname(__FILE__)
3
+ require 'rubygems'
4
+ require 'wukong/script'
5
+ require 'bucket_counter'
6
+
7
+ #
8
+ # Coocurrence counts
9
+ #
10
+
11
+ #
12
+ # Input is a list of document-idx-sentences, each field is tab-separated
13
+ # title idx word_a word_b word_c ...
14
+ #
15
+ # This emits each co-courring pair exactly once; in the case of a three-word
16
+ # sentence the output would be
17
+ #
18
+ # word_a word_b
19
+ # word_a word_c
20
+ # word_b word_c
21
+ #
22
+ class SentenceCoocurrence < Wukong::Streamer::RecordStreamer
23
+ def initialize *args
24
+ super *args
25
+ @bucket = BucketCounter.new
26
+ end
27
+
28
+ def process title, idx, *words
29
+ words.each_with_index do |word_a, idx|
30
+ words[(idx+1) .. -1].each do |word_b|
31
+ @bucket << [word_a, word_b]
32
+ end
33
+ end
34
+ dump_bucket if @bucket.full?
35
+ end
36
+
37
+ def dump_bucket
38
+ @bucket.each do |pair_key, count|
39
+ emit [pair_key, count]
40
+ end
41
+ $stderr.puts "bucket stats: #{@bucket.stats.inspect}"
42
+ @bucket.clear
43
+ end
44
+
45
+ def after_stream
46
+ dump_bucket
47
+ end
48
+ end
49
+
50
+ #
51
+ # Combine multiple bucket counts into a single on
52
+ #
53
+ class CombineBuckets < Wukong::Streamer::AccumulatingReducer
54
+ def start! *args
55
+ @total = 0
56
+ end
57
+ def accumulate word, count
58
+ @total += count.to_i
59
+ end
60
+ def finalize
61
+ yield [@total, key] if @total > 20
62
+ end
63
+ end
64
+
65
+ Wukong.run(
66
+ SentenceCoocurrence,
67
+ CombineBuckets,
68
+ :io_sort_record_percent => 0.3,
69
+ :io_sort_mb => 300
70
+ )
@@ -0,0 +1,110 @@
1
+ h1. Using Elastic Map-Reduce in Wukong
2
+
3
+ h2. Initial Setup
4
+
5
+ # Sign up for elastic map reduce and S3 at Amazon AWS.
6
+
7
+ # Download the Amazon elastic-mapreduce runner: either the official version at http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip or the infochimps fork (which has support for Ruby 1.9) at http://github.com/infochimps/elastic-mapreduce .
8
+
9
+ # Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose 'emr.yourdomain.com' as the bucket and '/wukong' as a scoping path within that bucket. In that case you will refer to it with a path like s3://emr.yourdomain.com/wukong (see notes below about s3n:// vs. s3:// URLs).
10
+
11
+ # Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
12
+
13
+ # Edit emr.yaml and credentials.json, adding your keys where appropriate and following the other instructions. Start with a single-node m1.small cluster as you'll probably have some false starts beforethe flow of logging in, checking the logs, etc becomes clear.
14
+
15
+ # You should now be good to launch a program. We'll give it the @--alive@ flag so that the machine sticks around if there were any issues:
16
+
17
+ ./elastic_mapreduce_example.rb --run=emr --alive s3://emr.yourdomain.com/wukong/data/input s3://emr.yourdomain.com/wukong/data/output
18
+
19
+ # If you visit the "AWS console":http://bit.ly/awsconsole you should now see a jobflow with two steps. The first sets up debugging for the job; the second is your hadoop task.
20
+
21
+ # The "AWS console":http://bit.ly/awsconsole also has the public IP of the master node. You can log in to the machine directly:
22
+
23
+ <pre>
24
+ ssh -i /path/to/your/keypair.pem hadoop@ec2-148-37-14-128.compute-1.amazonaws.com
25
+ </pre>
26
+
27
+ h3. Lorkbong
28
+
29
+ Lorkbong (named after the staff carried by Sun Wukong) is a very very simple example Heroku app that lets you trigger showing job status or launching a new job, either by visiting a special URL or by triggering a rake task. Get its code from
30
+
31
+ http://github.com/mrflip/lorkbong
32
+
33
+ h3. s3n:// vs. s3:// URLs
34
+
35
+ Many external tools use a URI convention to address files in S3; they typically use the 's3://' scheme, which makes a lot of sense:
36
+ s3://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
37
+
38
+ Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks -- to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth. Hadoop unfortunately chose the 's3://' scheme for URIs in this filesystem:
39
+ s3://s3hdfs.yourcompany.com/path/to/data
40
+
41
+ Hadoop is happy to read s3 native files -- 'native' as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There's a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about). You refer to these files with the 's3n://' scheme ('n' as in 'native'):
42
+ s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-mapper.rb
43
+ s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-reducer.rb
44
+ s3n://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
45
+
46
+ Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use 's3n://' for normal s3 files.
47
+
48
+ h2. Advanced Tips n' Tricks for common usage
49
+
50
+ h3. Direct access to logs using your browser
51
+
52
+ Each Hadoop component exposes a web dashboard for you to access. Use the following ports:
53
+
54
+ * 9100: Job tracker (master only)
55
+ * 9101: Namenode (master only)
56
+ * 9102: Datanodes
57
+ * 9103: Task trackers
58
+
59
+ They will only, however, respond to web requests from within the private cluster
60
+ subnet. You can browse the cluster by creating a persistent tunnel to the hadoop master node, and configuring your
61
+ browser to use it as a proxy.
62
+
63
+ h4. Create a tunneling proxy to your cluster
64
+
65
+ To create a tunnel from your local machine to the master node, substitute the keypair and the master node's address into this command:
66
+
67
+ <pre><code>
68
+ ssh -i ~/.wukong/keypairs/KEYPAIR.pem -f -N -D 6666 -o StrictHostKeyChecking=no -o "ConnectTimeout=10" -o "ServerAliveInterval=60" -o "ControlPath=none" ubuntu@MASTER_NODE_PUBLIC_IP
69
+ </code></pre>
70
+
71
+ The command will silently background itself if it worked.
72
+
73
+ h4. Make your browser use the proxy (but only for cluster machines)
74
+
75
+ You can access basic information by pointing your browser to "this Proxy
76
+ Auto-Configuration (PAC)
77
+ file.":http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
78
+ You'll have issues if you browse around though, because many of the in-page
79
+ links will refer to addresses that only resolve within the cluster's private
80
+ namespace.
81
+
82
+ h4. Setup Foxy Proxy
83
+
84
+ To fix this, use "FoxyProxy":https://addons.mozilla.org/en-US/firefox/addon/2464
85
+ It allows you to manage multiple proxy configurations and to use the proxy for
86
+ DNS resolution (curing the private address problem).
87
+
88
+ Once you've installed the FoxyProxy extension and restarted Firefox,
89
+
90
+ * Set FoxyProxy to 'Use Proxies based on their pre-defined patterns and priorities'
91
+ * Create a new proxy, called 'EC2 Socks Proxy' or something
92
+ * Automatic proxy configuration URL: http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
93
+ * Under 'General', check yes for 'Perform remote DNS lookups on host'
94
+ * Add the following URL patterns as 'whitelist' using 'Wildcards' (not regular expression):
95
+
96
+ * <code>*.compute-*.internal*</code>
97
+ * <code>*ec2.internal*</code>
98
+ * <code>*domu*.internal*</code>
99
+ * <code>*ec2*.amazonaws.com*</code>
100
+ * <code>*://10.*</code>
101
+
102
+ And this one as blacklist:
103
+
104
+ * <code>https://us-*st-1.ec2.amazonaws.com/*</code>
105
+
106
+
107
+ h3. Pulling to your local machine
108
+
109
+ s3cmd sync s3://s3n.infinitemonkeys.info/emr/elastic_mapreduce_example/log/ /tmp/emr_log/
110
+
@@ -1,4 +1,5 @@
1
1
  #!/usr/bin/env bash
2
+ set -x # turn on tracing
2
3
 
3
4
  # A url directory with the scripts you'd like to stuff into the machine
4
5
  REMOTE_FILE_URL_BASE="http://github.com/infochimps/wukong"
@@ -1,7 +1,8 @@
1
1
  #!/usr/bin/env ruby
2
2
  Dir[File.dirname(__FILE__)+'/vendor/**/lib'].each{|dir| $: << dir }
3
3
  require 'rubygems'
4
- require 'wukong'
4
+ require 'wukong/script'
5
+ require 'wukong/script/emr_command'
5
6
 
6
7
  #
7
8
  # * Copy the emr.yaml from here into ~/.wukong/emr.yaml
@@ -24,5 +25,4 @@ class FooStreamer < Wukong::Streamer::LineStreamer
24
25
  end
25
26
  end
26
27
 
27
- Settings.resolve!
28
28
  Wukong::Script.new(FooStreamer, FooStreamer).run