wukong 1.5.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (87) hide show
  1. data/CHANGELOG.textile +32 -0
  2. data/README.textile +58 -12
  3. data/TODO.textile +0 -8
  4. data/bin/hdp-bzip +12 -17
  5. data/bin/hdp-kill-task +1 -1
  6. data/bin/hdp-sort +7 -7
  7. data/bin/hdp-stream +7 -7
  8. data/bin/hdp-stream-flat +2 -3
  9. data/bin/setcat +11 -0
  10. data/bin/uniq-ord +59 -0
  11. data/examples/corpus/bucket_counter.rb +47 -0
  12. data/examples/corpus/dbpedia_abstract_to_sentences.rb +85 -0
  13. data/examples/corpus/sentence_coocurrence.rb +70 -0
  14. data/examples/emr/README.textile +110 -0
  15. data/examples/emr/dot_wukong_dir/emr_bootstrap.sh +1 -0
  16. data/examples/emr/elastic_mapreduce_example.rb +2 -2
  17. data/examples/ignore_me/counting.rb +56 -0
  18. data/examples/ignore_me/grouper.rb +71 -0
  19. data/examples/network_graph/adjacency_list.rb +2 -2
  20. data/examples/network_graph/breadth_first_search.rb +14 -21
  21. data/examples/network_graph/gen_multi_edge.rb +22 -13
  22. data/examples/pagerank/pagerank.rb +1 -1
  23. data/examples/pagerank/pagerank_initialize.rb +6 -10
  24. data/examples/sample_records.rb +6 -16
  25. data/examples/server_logs/apache_log_parser.rb +7 -22
  26. data/examples/server_logs/breadcrumbs.rb +39 -0
  27. data/examples/server_logs/logline.rb +27 -0
  28. data/examples/size.rb +3 -2
  29. data/examples/{binning_percentile_estimator.rb → stats/binning_percentile_estimator.rb} +9 -11
  30. data/examples/{rank_and_bin.rb → stats/rank_and_bin.rb} +2 -2
  31. data/examples/stupidly_simple_filter.rb +11 -14
  32. data/examples/word_count.rb +16 -36
  33. data/lib/wukong/and_pig.rb +2 -15
  34. data/lib/wukong/logger.rb +7 -28
  35. data/lib/wukong/periodic_monitor.rb +24 -9
  36. data/lib/wukong/script/emr_command.rb +1 -0
  37. data/lib/wukong/script/hadoop_command.rb +31 -29
  38. data/lib/wukong/script.rb +19 -14
  39. data/lib/wukong/store/cassandra_model.rb +2 -1
  40. data/lib/wukong/streamer/accumulating_reducer.rb +5 -9
  41. data/lib/wukong/streamer/base.rb +44 -3
  42. data/lib/wukong/streamer/counting_reducer.rb +12 -12
  43. data/lib/wukong/streamer/filter.rb +2 -2
  44. data/lib/wukong/streamer/list_reducer.rb +3 -3
  45. data/lib/wukong/streamer/reducer.rb +11 -0
  46. data/lib/wukong/streamer.rb +7 -3
  47. data/lib/wukong.rb +7 -3
  48. data/{examples → old}/cassandra_streaming/berlitz_for_cassandra.textile +0 -0
  49. data/{examples → old}/cassandra_streaming/client_interface_notes.textile +0 -0
  50. data/{examples → old}/cassandra_streaming/client_schema.textile +0 -0
  51. data/{examples → old}/cassandra_streaming/tuning.textile +0 -0
  52. data/wukong.gemspec +257 -285
  53. metadata +45 -62
  54. data/examples/cassandra_streaming/avromapper.rb +0 -85
  55. data/examples/cassandra_streaming/cassandra.avpr +0 -468
  56. data/examples/cassandra_streaming/cassandra_random_partitioner.rb +0 -62
  57. data/examples/cassandra_streaming/catter.sh +0 -45
  58. data/examples/cassandra_streaming/client_schema.avpr +0 -211
  59. data/examples/cassandra_streaming/foofile.avr +0 -0
  60. data/examples/cassandra_streaming/pymap.sh +0 -1
  61. data/examples/cassandra_streaming/pyreduce.sh +0 -1
  62. data/examples/cassandra_streaming/smutation.avpr +0 -188
  63. data/examples/cassandra_streaming/streamer.sh +0 -51
  64. data/examples/cassandra_streaming/struct_loader.rb +0 -24
  65. data/examples/count_keys.rb +0 -56
  66. data/examples/count_keys_at_mapper.rb +0 -57
  67. data/examples/emr/README-elastic_map_reduce.textile +0 -26
  68. data/examples/keystore/cassandra_batch_test.rb +0 -41
  69. data/examples/keystore/conditional_outputter_example.rb +0 -70
  70. data/examples/store/chunked_store_example.rb +0 -18
  71. data/lib/wukong/dfs.rb +0 -81
  72. data/lib/wukong/keystore/cassandra_conditional_outputter.rb +0 -122
  73. data/lib/wukong/keystore/redis_db.rb +0 -24
  74. data/lib/wukong/keystore/tyrant_db.rb +0 -137
  75. data/lib/wukong/keystore/tyrant_notes.textile +0 -145
  76. data/lib/wukong/models/graph.rb +0 -25
  77. data/lib/wukong/monitor/chunked_store.rb +0 -23
  78. data/lib/wukong/monitor/periodic_logger.rb +0 -34
  79. data/lib/wukong/monitor/periodic_monitor.rb +0 -70
  80. data/lib/wukong/monitor.rb +0 -7
  81. data/lib/wukong/rdf.rb +0 -104
  82. data/lib/wukong/streamer/cassandra_streamer.rb +0 -61
  83. data/lib/wukong/streamer/count_keys.rb +0 -30
  84. data/lib/wukong/streamer/count_lines.rb +0 -26
  85. data/lib/wukong/streamer/em_streamer.rb +0 -7
  86. data/lib/wukong/streamer/preprocess_with_pipe_streamer.rb +0 -22
  87. data/lib/wukong/wukong_class.rb +0 -21
data/CHANGELOG.textile CHANGED
@@ -1,3 +1,35 @@
1
+ h2. Wukong v2.0.0
2
+
3
+ h4. Important changes
4
+
5
+ * Passing options to streamers is now deprecated. Use @Settings@ instead.
6
+
7
+ * Streamer by default has a periodic monitor that logs (to STDERR by default) every 10_000 lines or 30 seconds
8
+
9
+ * Examples cleaned up, should all run
10
+
11
+ h4. Simplified syntax
12
+
13
+ * you can now pass Script.new an *instance* of Streamer to use as mapper or reducer
14
+ * Adding an experimental sugar:
15
+
16
+ <pre>
17
+ #!/usr/bin/env ruby
18
+ require 'wukong/script'
19
+
20
+ LineStreamer.map do |line|
21
+ emit line.reverse
22
+ end.run
23
+ </pre>
24
+
25
+ Note that you can now tweet a wukong script.
26
+
27
+ * It's now recommended that at the top of a wukong script you say
28
+ <pre>
29
+ require 'wukong/script'
30
+ </pre>
31
+ Among other benefits, this lets you refer to wukong streamers without prefix.
32
+
1
33
  h2. Wukong v1.5.4
2
34
 
3
35
  * EMR support now works very well
data/README.textile CHANGED
@@ -19,18 +19,6 @@ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com
19
19
  * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
20
20
  * "More info":http://mrflip.github.com/wukong/moreinfo.html
21
21
 
22
- h2. Imminent Changes
23
-
24
- I'm pushing to release "Wukong 3.0 the actual 1.0 release".
25
-
26
- * For reducing/uniqing, a notion of mutable_fields and immutable_fields and extrinsic_fields: two objects compare the same/differently if their mutable fields compare the same/differently
27
- * Methods on TypedStruct to
28
-
29
- * Make to_flat(false) the default, with the sort_fields / partition_fields defaulting to 2 each and very prominently documented
30
- * Standardize the notion that wukong classes have a "key"; by default, it will be to_a.first for Structs/TypedStructs. This shouldn't break anything.
31
- * May make some things that are derived classes into mixin'ed modules
32
- * Will probably change the name of AccumulatingReducer into just Accumulator, and have all Accumulator-derived classes include Accumulator; I'll make sure the old names continue to work though.
33
-
34
22
 
35
23
  h2. Help!
36
24
 
@@ -193,6 +181,64 @@ You'd end up with
193
181
  @newman @elaine @jerry @kramer
194
182
  </code></pre>
195
183
 
184
+ h2. Gotchas
185
+
186
+ h4. RecordStreamer dies on blank lines with "wrong number of arguments"
187
+
188
+ If your lines don't always have a full complement of fields, and you define #process() to take fixed named arguments, then ruby will complain when some of them don't show up:
189
+
190
+ <pre>
191
+ class MyUnhappyMapper < Wukong::Streamer::RecordStreamer
192
+ # this will fail if the line has more or fewer than 3 fields:
193
+ def process x, y, z
194
+ p [x, y, z]
195
+ end
196
+ end
197
+ </pre>
198
+
199
+ The cleanest way I know to fix this is with recordize, which you should recall always returns an array of fields:
200
+
201
+ <pre>
202
+ class MyHappyMapper < Wukong::Streamer::RecordStreamer
203
+ # extracts three fields always; any missing fields are nil, any extra fields discarded
204
+ # @example
205
+ # recordize("a") # ["a", nil, nil]
206
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
207
+ # recordize("a\t\b\tc\td") # ["a", "b", "c"]
208
+ def recordize raw_record
209
+ x, y, z = super(raw_record)
210
+ [x, y, z]
211
+ end
212
+
213
+ # Now all lines produce exactly three args
214
+ def process x, y, z
215
+ p [x, y, z]
216
+ end
217
+ end
218
+ </pre>
219
+
220
+ If you want to preserve any extra fields, use the extra argument to #split():
221
+
222
+ <pre>
223
+ class MyMoreThanHappyMapper < Wukong::Streamer::RecordStreamer
224
+ # extracts three fields always; any missing fields are nil, the final field will contain a tab-separated string of all trailing fields
225
+ # @example
226
+ # recordize("a") # ["a", nil, nil]
227
+ # recordize("a\t\b\tc") # ["a", "b", "c"]
228
+ # recordize("a\t\b\tc\td") # ["a", "b", "c\td"]
229
+ def recordize raw_record
230
+ x, y, z = split(raw_record, "\t", 3)
231
+ [x, y, z]
232
+ end
233
+
234
+ # Now all lines produce exactly three args
235
+ def process x, y, z
236
+ p [x, y, z]
237
+ end
238
+ end
239
+ </pre>
240
+
241
+
196
242
  h2. Why is it called Wukong?
197
243
 
198
244
  Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
data/TODO.textile CHANGED
@@ -1,13 +1,5 @@
1
-
2
-
3
-
4
1
  * add GEM_PATH to hadoop_recycle_env
5
2
 
6
- * Hadoop_command function received an array for the input_path parameter
7
-
8
3
  ** We should be able to specify comma *or* space separated paths; the last
9
4
  space-separated path in Settings.rest becomes the output file, the others are
10
5
  used as the input_file list.
11
-
12
- * Make configliere Settings and streamer_instance.options() be the same
13
- thing. (instead of almost-but-confusingly-not-always the same thing).
data/bin/hdp-bzip CHANGED
@@ -2,27 +2,22 @@
2
2
 
3
3
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
4
4
 
5
- OUTPUT="$1" ; shift
5
+ input_file=${1} ; shift
6
+ output_file=${1} ; shift
6
7
 
7
- INPUTS=''
8
- for foo in $@; do
9
- INPUTS="$INPUTS -input $foo\
10
- "
11
- done
8
+ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file" ; exit ; fi
12
9
 
13
- echo "Removing output directory $OUTPUT"
14
- hadoop fs -rmr $OUTPUT
10
+ HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
15
11
 
16
12
  cmd="${HADOOP_HOME}/bin/hadoop \
17
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
18
- -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
19
- -jobconf mapred.output.compress=true \
20
- -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
21
- -jobconf mapred.reduce.tasks=1 \
22
- -mapper \"/bin/cat\" \
23
- -reducer \"/bin/cat\" \
24
- $INPUTS
25
- -output $OUTPUT \
13
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
14
+ -Dmapred.output.compress=true \
15
+ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
16
+ -Dmapred.reduce.tasks=1 \
17
+ -mapper \"/bin/cat\" \
18
+ -reducer \"/bin/cat\" \
19
+ -input \"$input_file\" \
20
+ -output \"$output_file\" \
26
21
  "
27
22
  echo $cmd
28
23
  $cmd
data/bin/hdp-kill-task CHANGED
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- exec hadoop fs -kill-task "$1"
3
+ exec hadoop job -kill-task "$1"
data/bin/hdp-sort CHANGED
@@ -1,5 +1,4 @@
1
1
  #!/usr/bin/env bash
2
- # hadoop dfs -rmr out/parsed-followers
3
2
 
4
3
  input_file=${1} ; shift
5
4
  output_file=${1} ; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
13
12
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
14
13
 
15
14
  cmd="${HADOOP_HOME}/bin/hadoop \
16
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
15
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
16
+ $@
17
+ -D num.key.fields.for.partition=\"$partfields\"
18
+ -D stream.num.map.output.key.fields=\"$sortfields\"
19
+ -D stream.map.output.field.separator=\"'/t'\"
20
+ -D mapred.text.key.partitioner.options=\"-k1,$partfields\"
21
+ -D mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
17
22
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
18
- -jobconf num.key.fields.for.partition=\"$partfields\"
19
- -jobconf stream.num.map.output.key.fields=\"$sortfields\"
20
- -jobconf stream.map.output.field.separator=\"'/t'\"
21
- -jobconf mapred.text.key.partitioner.options=\"-k1,$partfields\"
22
23
  -mapper \"$map_script\"
23
24
  -reducer \"$reduce_script\"
24
25
  -input \"$input_file\"
25
26
  -output \"$output_file\"
26
- $@
27
27
  "
28
28
 
29
29
  echo "$cmd"
data/bin/hdp-stream CHANGED
@@ -1,5 +1,4 @@
1
1
  #!/usr/bin/env bash
2
- # hadoop dfs -rmr out/parsed-followers
3
2
 
4
3
  input_file=${1} ; shift
5
4
  output_file=${1} ; shift
@@ -13,17 +12,18 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
13
12
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
14
13
 
15
14
  cmd="${HADOOP_HOME}/bin/hadoop \
16
- jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
15
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar
16
+ $@
17
+ -D num.key.fields.for.partition=\"$partfields\"
18
+ -D stream.num.map.output.key.fields=\"$sortfields\"
19
+ -D stream.map.output.field.separator=\"'/t'\"
20
+ -D mapred.text.key.partitioner.options=\"-k1,$partfields\"
21
+ -D mapred.job.name=\"`basename $0`-$map_script-$input_file-$output_file\"
17
22
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
18
- -jobconf num.key.fields.for.partition=\"$partfields\"
19
- -jobconf stream.num.map.output.key.fields=\"$sortfields\"
20
- -jobconf stream.map.output.field.separator=\"'/t'\"
21
- -jobconf mapred.text.key.partitioner.options=\"-k1,$partfields\"
22
23
  -mapper \"$map_script\"
23
24
  -reducer \"$reduce_script\"
24
25
  -input \"$input_file\"
25
26
  -output \"$output_file\"
26
- $@
27
27
  "
28
28
 
29
29
  echo "$cmd"
data/bin/hdp-stream-flat CHANGED
@@ -10,13 +10,12 @@ if [ "$output_file" == "" ] ; then echo "$0 input_file output_file [mapper=/bin/
10
10
  HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
11
11
 
12
12
  # Can add fun stuff like
13
- # -jobconf mapred.map.tasks=3 \
14
- # -jobconf mapred.reduce.tasks=3 \
13
+ # -Dmapred.reduce.tasks=0 \
15
14
 
16
15
  exec ${HADOOP_HOME}/bin/hadoop \
17
16
  jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar \
18
17
  "$@" \
19
- -jobconf "mapred.job.name=`basename $0`-$map_script-$input_file-$output_file" \
18
+ -Dmapred.job.name=`basename $0`-$map_script-$input_file-$output_file \
20
19
  -mapper "$map_script" \
21
20
  -reducer "$reduce_script" \
22
21
  -input "$input_file" \
data/bin/setcat ADDED
@@ -0,0 +1,11 @@
1
+ #!/usr/bin/env bash
2
+
3
+ #
4
+ # This script is useful for debugging. it dumps your environment to STDERR
5
+ # and otherwise runs as `cat`
6
+ #
7
+
8
+ set >&2
9
+
10
+ cat
11
+ true
data/bin/uniq-ord ADDED
@@ -0,0 +1,59 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: ASCII-8BIT
3
+ require 'set'
4
+
5
+ unless ARGV.empty?
6
+ unless ARGV.include?('--help')
7
+ puts "\n**\nSorry, uniq-ord only works in-line: cat foo.txt bar.tsv | uniq-ord\n**" ; puts
8
+ end
9
+ puts <<USAGE
10
+ uniq-ord is ike the uniq command but doesn't depend on prior sorting: it tracks
11
+ each line and only emits the first-seen instance of that line.
12
+
13
+ The algorithm is /very/ simplistic: it uses ruby's built-in hash to track lines.
14
+ This can produce false positives, meaning that a line of output might be removed
15
+ even if it hasn't been seen before. It may also consume an unbounded amount of
16
+ memory (though less than the input text). With a million lines it will consume
17
+ about 70 MB of memory and have more than 1 in a million chance of false
18
+ positive. On a billion lines it will consume many GB and have over 25% odds of
19
+ incorrectly skipping a line.
20
+
21
+ However, it's really handy for dealing with in-order lists from the command line.
22
+ USAGE
23
+ exit(0)
24
+ end
25
+
26
+ # # Logging
27
+ #
28
+ # MB = 1024*1024
29
+ # LOG_INTERVAL = 100_000
30
+ # $start = Time.now; $iter = 0; $size = 0
31
+ # def log_line
32
+ # elapsed = (Time.now - $start).to_f
33
+ # $stderr.puts("%5d s\t%10.1f l/s\t%5dk<\t%5dk>\t%5d MB\t%9.1f MB/s\t%11d b/l"%[ elapsed, $iter/elapsed, $iter/1000, LINES.count/1000, $size/MB, ($size/MB)/elapsed, $size/$iter ])
34
+ # end
35
+
36
+ LINES = Set.new
37
+ $stdin.each do |line|
38
+ next if LINES.include?(line.hash)
39
+ puts line
40
+ LINES << line.hash
41
+ # $iter += 1 ; $size += line.length
42
+ # log_line if ($iter % LOG_INTERVAL == 0)
43
+ end
44
+ # log_line
45
+
46
+ #
47
+ # # 2.1 GB data, 1M lines, 2000 avg chars/line
48
+ #
49
+ # # Used: RSS: 71_988 kB VSZ: 2_509_152 kB
50
+ # # Stats: 38 s 25_859.1 l/s 1000k< 1000k> 1976 MB 51.1 MB/s 2072 b/l
51
+ # # Time: real 0m41.4 s user 0m31.6 s sys 0m8.3 s pct 96.48
52
+ #
53
+ # # 4.1 GB data, 5.6M lines, 800 avg chars/line
54
+ #
55
+ # # Used: RSS: 330_644 kB VSZ: 2_764_236 kB
56
+ # # Stats: 861 6_538.2 l/s 5632k< 5632k> 4158 MB 4.8 MB/s 774 b/l
57
+ # # Time: real 14m24.6 s user 13m8.8 s sys 0m12. s pct 92.61
58
+ #
59
+
@@ -0,0 +1,47 @@
1
+
2
+ class BucketCounter
3
+ BUCKET_SIZE = 2**24
4
+ attr_reader :total
5
+
6
+ def initialize
7
+ @hsh = Hash.new{|h,k| h[k] = 0 }
8
+ @total = 0
9
+ end
10
+
11
+ # def [] val
12
+ # @hsh[val]
13
+ # end
14
+ # def << val
15
+ # @hsh[val] += 1; @total += 1 ; self
16
+ # end
17
+
18
+ def [] val
19
+ @hsh[val.hash % BUCKET_SIZE]
20
+ end
21
+ def << val
22
+ @hsh[val.hash % BUCKET_SIZE] += 1; @total += 1 ; self
23
+ end
24
+
25
+ def insert *words
26
+ words.flatten.each{|word| self << word }
27
+ end
28
+ def clear
29
+ @hsh.clear
30
+ @total = 0
31
+ end
32
+
33
+ def stats
34
+ { :total => total,
35
+ :size => size,
36
+ }
37
+ end
38
+ def size() @hsh.size end
39
+
40
+ def full?
41
+ size.to_f / BUCKET_SIZE > 0.5
42
+ end
43
+
44
+ def each *args, &block
45
+ @hsh.each(*args, &block)
46
+ end
47
+ end
@@ -0,0 +1,85 @@
1
+ #!/usr/bin/env ruby
2
+ require 'wukong/script'
3
+
4
+ #
5
+ # Use the stanford NLP parse to split a piece of text into sentences
6
+ #
7
+ # @example
8
+ # SentenceParser.split("Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!")
9
+ # # => [["Beware", "the", "Jabberwock", ",", "my", "son", "!"], ["The", "jaws", "that", "bite", ",", "the", "claws", "that", "catch", "!"], ["Beware", "the", "Jubjub", "bird", ",", "and", "shun", "The", "frumious", "Bandersnatch", "!"]]
10
+ #
11
+ class SentenceParser
12
+ def self.processor
13
+ return @processor if @processor
14
+ require 'rubygems'
15
+ require 'stanfordparser'
16
+ @processor = StanfordParser::DocumentPreprocessor.new
17
+ end
18
+
19
+ def self.split line
20
+ processor.getSentencesFromString(line).map{|s| s.map{|w| w.to_s } }
21
+ end
22
+ end
23
+
24
+ #
25
+ # takes one document per line
26
+ # splits into sentences
27
+ #
28
+ class WordNGrams < Wukong::Streamer::LineStreamer
29
+ def recordize line
30
+ line.strip!
31
+ line.gsub!(%r{^<http://dbpedia.org/resource/([^>]+)> <[^>]+> \"}, '') ; title = $1
32
+ line.gsub!(%r{\"@en \.},'')
33
+ [title, SentenceParser.split(line)]
34
+ end
35
+
36
+ def process title, sentences
37
+ sentences.each_with_index do |words, idx|
38
+ yield [title, idx, words].flatten
39
+ end
40
+ end
41
+ end
42
+
43
+ Wukong.run WordNGrams, nil, :partition_fields => 1, :sort_fields => 2
44
+
45
+ # ---------------------------------------------------------------------------
46
+ #
47
+ # Run Time:
48
+ #
49
+ # Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/short_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/short_abstract_sentences
50
+ # Status: Succeeded
51
+ # Started at: Fri Jan 28 03:14:45 UTC 2011
52
+ # Finished in: 41mins, 50sec
53
+ # 3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
54
+ #
55
+ # Counter Reduce Total
56
+ # SLOTS_MILLIS_MAPS 0 10 126 566
57
+ # Launched map tasks 0 15
58
+ # Data-local map tasks 0 15
59
+ # SLOTS_MILLIS_REDUCES 0 1 217
60
+ # HDFS_BYTES_READ 1 327 116 133 1 327 116 133
61
+ # HDFS_BYTES_WRITTEN 1 229 841 020 1 229 841 020
62
+ # Map input records 3 261 096 3 261 096
63
+ # Spilled Records 0 0
64
+ # Map input bytes 1 326 524 800 1 326 524 800
65
+ # SPLIT_RAW_BYTES 1 500 1 500
66
+ # Map output records 9 026 343 9 026 343
67
+ #
68
+ # Job Name: dbpedia_abstract_to_sentences.rb---/data/rawd/encyc/dbpedia/dbpedia_dumps/long_abstracts_en.nt---/data/rawd/encyc/dbpedia/dbpedia_parsed/long_abstract_sentences
69
+ # Status: Succeeded
70
+ # Started at: Fri Jan 28 03:23:08 UTC 2011
71
+ # Finished in: 41mins, 11sec
72
+ # 3 machines: master m1.xlarge, 2 c1.xlarge workers; was having some over-memory issues on the c1.xls
73
+ #
74
+ # Counter Reduce Total
75
+ # SLOTS_MILLIS_MAPS 0 19 872 357
76
+ # Launched map tasks 0 29
77
+ # Data-local map tasks 0 29
78
+ # SLOTS_MILLIS_REDUCES 0 5 504
79
+ # HDFS_BYTES_READ 2 175 900 769 2 175 900 769
80
+ # HDFS_BYTES_WRITTEN 2 280 332 736 2 280 332 736
81
+ # Map input records 3 261 096 3 261 096
82
+ # Spilled Records 0 0
83
+ # Map input bytes 2 174 849 644 2 174 849 644
84
+ # SPLIT_RAW_BYTES 2 533 2533
85
+ # Map output records 15 425 467 15 425 467
@@ -0,0 +1,70 @@
1
+ #!/usr/bin/env ruby
2
+ $: << File.dirname(__FILE__)
3
+ require 'rubygems'
4
+ require 'wukong/script'
5
+ require 'bucket_counter'
6
+
7
+ #
8
+ # Coocurrence counts
9
+ #
10
+
11
+ #
12
+ # Input is a list of document-idx-sentences, each field is tab-separated
13
+ # title idx word_a word_b word_c ...
14
+ #
15
+ # This emits each co-courring pair exactly once; in the case of a three-word
16
+ # sentence the output would be
17
+ #
18
+ # word_a word_b
19
+ # word_a word_c
20
+ # word_b word_c
21
+ #
22
+ class SentenceCoocurrence < Wukong::Streamer::RecordStreamer
23
+ def initialize *args
24
+ super *args
25
+ @bucket = BucketCounter.new
26
+ end
27
+
28
+ def process title, idx, *words
29
+ words.each_with_index do |word_a, idx|
30
+ words[(idx+1) .. -1].each do |word_b|
31
+ @bucket << [word_a, word_b]
32
+ end
33
+ end
34
+ dump_bucket if @bucket.full?
35
+ end
36
+
37
+ def dump_bucket
38
+ @bucket.each do |pair_key, count|
39
+ emit [pair_key, count]
40
+ end
41
+ $stderr.puts "bucket stats: #{@bucket.stats.inspect}"
42
+ @bucket.clear
43
+ end
44
+
45
+ def after_stream
46
+ dump_bucket
47
+ end
48
+ end
49
+
50
+ #
51
+ # Combine multiple bucket counts into a single on
52
+ #
53
+ class CombineBuckets < Wukong::Streamer::AccumulatingReducer
54
+ def start! *args
55
+ @total = 0
56
+ end
57
+ def accumulate word, count
58
+ @total += count.to_i
59
+ end
60
+ def finalize
61
+ yield [@total, key] if @total > 20
62
+ end
63
+ end
64
+
65
+ Wukong.run(
66
+ SentenceCoocurrence,
67
+ CombineBuckets,
68
+ :io_sort_record_percent => 0.3,
69
+ :io_sort_mb => 300
70
+ )
@@ -0,0 +1,110 @@
1
+ h1. Using Elastic Map-Reduce in Wukong
2
+
3
+ h2. Initial Setup
4
+
5
+ # Sign up for elastic map reduce and S3 at Amazon AWS.
6
+
7
+ # Download the Amazon elastic-mapreduce runner: either the official version at http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip or the infochimps fork (which has support for Ruby 1.9) at http://github.com/infochimps/elastic-mapreduce .
8
+
9
+ # Create a bucket and path to hold your EMR logs, scripts and other ephemera. For instance you might choose 'emr.yourdomain.com' as the bucket and '/wukong' as a scoping path within that bucket. In that case you will refer to it with a path like s3://emr.yourdomain.com/wukong (see notes below about s3n:// vs. s3:// URLs).
10
+
11
+ # Copy the contents of wukong/examples/emr/dot_wukong_dir to ~/.wukong
12
+
13
+ # Edit emr.yaml and credentials.json, adding your keys where appropriate and following the other instructions. Start with a single-node m1.small cluster as you'll probably have some false starts beforethe flow of logging in, checking the logs, etc becomes clear.
14
+
15
+ # You should now be good to launch a program. We'll give it the @--alive@ flag so that the machine sticks around if there were any issues:
16
+
17
+ ./elastic_mapreduce_example.rb --run=emr --alive s3://emr.yourdomain.com/wukong/data/input s3://emr.yourdomain.com/wukong/data/output
18
+
19
+ # If you visit the "AWS console":http://bit.ly/awsconsole you should now see a jobflow with two steps. The first sets up debugging for the job; the second is your hadoop task.
20
+
21
+ # The "AWS console":http://bit.ly/awsconsole also has the public IP of the master node. You can log in to the machine directly:
22
+
23
+ <pre>
24
+ ssh -i /path/to/your/keypair.pem hadoop@ec2-148-37-14-128.compute-1.amazonaws.com
25
+ </pre>
26
+
27
+ h3. Lorkbong
28
+
29
+ Lorkbong (named after the staff carried by Sun Wukong) is a very very simple example Heroku app that lets you trigger showing job status or launching a new job, either by visiting a special URL or by triggering a rake task. Get its code from
30
+
31
+ http://github.com/mrflip/lorkbong
32
+
33
+ h3. s3n:// vs. s3:// URLs
34
+
35
+ Many external tools use a URI convention to address files in S3; they typically use the 's3://' scheme, which makes a lot of sense:
36
+ s3://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
37
+
38
+ Hadoop can maintain an HDFS on the Amazon S3: it uses a block structure and has optimizations for streaming, no file size limitation, and other goodness. However, only hadoop tools can interpret the contents of those blocks -- to everything else it just looks like a soup of blocks labelled block_-8675309 and so forth. Hadoop unfortunately chose the 's3://' scheme for URIs in this filesystem:
39
+ s3://s3hdfs.yourcompany.com/path/to/data
40
+
41
+ Hadoop is happy to read s3 native files -- 'native' as in, you can look at them with a browser and upload them an download them with any S3 tool out there. There's a 5GB limit on file size, and in some cases a performance hit (but not in our experience enough to worry about). You refer to these files with the 's3n://' scheme ('n' as in 'native'):
42
+ s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-mapper.rb
43
+ s3n://emr.yourcompany.com/wukong/happy_job_1/code/happy_job_1-reducer.rb
44
+ s3n://emr.yourcompany.com/wukong/happy_job_1/logs/whatever-20100808.log
45
+
46
+ Wukong will coerce things to the right scheme when it knows what that scheme should be (eg. code should be s3n://). It will otherwise leave the path alone. Specifically, if you use a URI scheme for input and output paths you must use 's3n://' for normal s3 files.
47
+
48
+ h2. Advanced Tips n' Tricks for common usage
49
+
50
+ h3. Direct access to logs using your browser
51
+
52
+ Each Hadoop component exposes a web dashboard for you to access. Use the following ports:
53
+
54
+ * 9100: Job tracker (master only)
55
+ * 9101: Namenode (master only)
56
+ * 9102: Datanodes
57
+ * 9103: Task trackers
58
+
59
+ They will only, however, respond to web requests from within the private cluster
60
+ subnet. You can browse the cluster by creating a persistent tunnel to the hadoop master node, and configuring your
61
+ browser to use it as a proxy.
62
+
63
+ h4. Create a tunneling proxy to your cluster
64
+
65
+ To create a tunnel from your local machine to the master node, substitute the keypair and the master node's address into this command:
66
+
67
+ <pre><code>
68
+ ssh -i ~/.wukong/keypairs/KEYPAIR.pem -f -N -D 6666 -o StrictHostKeyChecking=no -o "ConnectTimeout=10" -o "ServerAliveInterval=60" -o "ControlPath=none" ubuntu@MASTER_NODE_PUBLIC_IP
69
+ </code></pre>
70
+
71
+ The command will silently background itself if it worked.
72
+
73
+ h4. Make your browser use the proxy (but only for cluster machines)
74
+
75
+ You can access basic information by pointing your browser to "this Proxy
76
+ Auto-Configuration (PAC)
77
+ file.":http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
78
+ You'll have issues if you browse around though, because many of the in-page
79
+ links will refer to addresses that only resolve within the cluster's private
80
+ namespace.
81
+
82
+ h4. Setup Foxy Proxy
83
+
84
+ To fix this, use "FoxyProxy":https://addons.mozilla.org/en-US/firefox/addon/2464
85
+ It allows you to manage multiple proxy configurations and to use the proxy for
86
+ DNS resolution (curing the private address problem).
87
+
88
+ Once you've installed the FoxyProxy extension and restarted Firefox,
89
+
90
+ * Set FoxyProxy to 'Use Proxies based on their pre-defined patterns and priorities'
91
+ * Create a new proxy, called 'EC2 Socks Proxy' or something
92
+ * Automatic proxy configuration URL: http://github.com/infochimps/cluster_chef/raw/master/config/proxy.pac
93
+ * Under 'General', check yes for 'Perform remote DNS lookups on host'
94
+ * Add the following URL patterns as 'whitelist' using 'Wildcards' (not regular expression):
95
+
96
+ * <code>*.compute-*.internal*</code>
97
+ * <code>*ec2.internal*</code>
98
+ * <code>*domu*.internal*</code>
99
+ * <code>*ec2*.amazonaws.com*</code>
100
+ * <code>*://10.*</code>
101
+
102
+ And this one as blacklist:
103
+
104
+ * <code>https://us-*st-1.ec2.amazonaws.com/*</code>
105
+
106
+
107
+ h3. Pulling to your local machine
108
+
109
+ s3cmd sync s3://s3n.infinitemonkeys.info/emr/elastic_mapreduce_example/log/ /tmp/emr_log/
110
+
@@ -1,4 +1,5 @@
1
1
  #!/usr/bin/env bash
2
+ set -x # turn on tracing
2
3
 
3
4
  # A url directory with the scripts you'd like to stuff into the machine
4
5
  REMOTE_FILE_URL_BASE="http://github.com/infochimps/wukong"
@@ -1,7 +1,8 @@
1
1
  #!/usr/bin/env ruby
2
2
  Dir[File.dirname(__FILE__)+'/vendor/**/lib'].each{|dir| $: << dir }
3
3
  require 'rubygems'
4
- require 'wukong'
4
+ require 'wukong/script'
5
+ require 'wukong/script/emr_command'
5
6
 
6
7
  #
7
8
  # * Copy the emr.yaml from here into ~/.wukong/emr.yaml
@@ -24,5 +25,4 @@ class FooStreamer < Wukong::Streamer::LineStreamer
24
25
  end
25
26
  end
26
27
 
27
- Settings.resolve!
28
28
  Wukong::Script.new(FooStreamer, FooStreamer).run