wukong-hadoop 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (59) hide show
  1. data/.gitignore +59 -0
  2. data/.rspec +2 -0
  3. data/Gemfile +3 -0
  4. data/README.md +339 -0
  5. data/Rakefile +13 -0
  6. data/bin/hdp-bin +44 -0
  7. data/bin/hdp-bzip +23 -0
  8. data/bin/hdp-cat +3 -0
  9. data/bin/hdp-catd +3 -0
  10. data/bin/hdp-cp +3 -0
  11. data/bin/hdp-du +86 -0
  12. data/bin/hdp-get +3 -0
  13. data/bin/hdp-kill +3 -0
  14. data/bin/hdp-kill-task +3 -0
  15. data/bin/hdp-ls +11 -0
  16. data/bin/hdp-mkdir +2 -0
  17. data/bin/hdp-mkdirp +12 -0
  18. data/bin/hdp-mv +3 -0
  19. data/bin/hdp-parts_to_keys.rb +77 -0
  20. data/bin/hdp-ps +3 -0
  21. data/bin/hdp-put +3 -0
  22. data/bin/hdp-rm +32 -0
  23. data/bin/hdp-sort +40 -0
  24. data/bin/hdp-stream +40 -0
  25. data/bin/hdp-stream-flat +22 -0
  26. data/bin/hdp-stream2 +39 -0
  27. data/bin/hdp-sync +17 -0
  28. data/bin/hdp-wc +67 -0
  29. data/bin/wu-hadoop +14 -0
  30. data/examples/counter.rb +17 -0
  31. data/examples/map_only.rb +28 -0
  32. data/examples/processors.rb +4 -0
  33. data/examples/sonnet_18.txt +14 -0
  34. data/examples/tokenizer.rb +28 -0
  35. data/examples/word_count.rb +44 -0
  36. data/features/step_definitions/wu_hadoop_steps.rb +4 -0
  37. data/features/support/env.rb +1 -0
  38. data/features/wu_hadoop.feature +113 -0
  39. data/lib/wukong-hadoop.rb +21 -0
  40. data/lib/wukong-hadoop/configuration.rb +133 -0
  41. data/lib/wukong-hadoop/driver.rb +190 -0
  42. data/lib/wukong-hadoop/driver/hadoop_invocation.rb +184 -0
  43. data/lib/wukong-hadoop/driver/inputs_and_outputs.rb +27 -0
  44. data/lib/wukong-hadoop/driver/local_invocation.rb +48 -0
  45. data/lib/wukong-hadoop/driver/map_logic.rb +104 -0
  46. data/lib/wukong-hadoop/driver/reduce_logic.rb +129 -0
  47. data/lib/wukong-hadoop/extensions.rb +2 -0
  48. data/lib/wukong-hadoop/hadoop_env_methods.rb +80 -0
  49. data/lib/wukong-hadoop/version.rb +6 -0
  50. data/spec/spec_helper.rb +21 -0
  51. data/spec/support/driver_helper.rb +15 -0
  52. data/spec/support/integration_helper.rb +39 -0
  53. data/spec/wukong-hadoop/driver_spec.rb +117 -0
  54. data/spec/wukong-hadoop/hadoop_env_methods_spec.rb +14 -0
  55. data/spec/wukong-hadoop/hadoop_mode_spec.rb +78 -0
  56. data/spec/wukong-hadoop/local_mode_spec.rb +22 -0
  57. data/spec/wukong-hadoop/wu_hadoop_spec.rb +34 -0
  58. data/wukong-hadoop.gemspec +33 -0
  59. metadata +168 -0
data/.gitignore ADDED
@@ -0,0 +1,59 @@
1
+ ## OS
2
+ .DS_Store
3
+ Icon
4
+ nohup.out
5
+ .bak
6
+
7
+ *.pem
8
+
9
+ ## EDITORS
10
+ \#*
11
+ .\#*
12
+ \#*\#
13
+ *~
14
+ *.swp
15
+ REVISION
16
+ TAGS*
17
+ tmtags
18
+ *_flymake.*
19
+ *_flymake
20
+ *.tmproj
21
+ .project
22
+ .settings
23
+
24
+ ## COMPILED
25
+ a.out
26
+ *.o
27
+ *.pyc
28
+ *.so
29
+
30
+ ## OTHER SCM
31
+ .bzr
32
+ .hg
33
+ .svn
34
+
35
+ ## PROJECT::GENERAL
36
+
37
+ log/*
38
+ tmp/*
39
+ pkg/*
40
+
41
+ coverage
42
+ rdoc
43
+ doc
44
+ pkg
45
+ .rake_test_cache
46
+ .bundle
47
+ .yardoc
48
+
49
+ .vendor
50
+
51
+ ## PROJECT::SPECIFIC
52
+
53
+ old/*
54
+ docpages
55
+ away
56
+
57
+ .rbx
58
+ Gemfile.lock
59
+ Backup*of*.numbers
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format=progress
2
+ --color
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source :rubygems
2
+
3
+ gemspec
data/README.md ADDED
@@ -0,0 +1,339 @@
1
+ # Wukong-Hadoop
2
+
3
+ The Hadoop plugin for Wukong lets you run <a
4
+ href="http://github.com/infochimps-labs/wukong">Wukong processors</a>
5
+ through <a href="http://hadoop.apache.org/">Hadoop's</a> command-line
6
+ <a
7
+ href="http://hadoop.apache.org/docs/r0.15.2/streaming.html">streaming
8
+ interface</a>.
9
+
10
+ Before you use Wukong-Hadoop to develop, test, and write your Hadoop
11
+ jobs, you might want to read about <a href="http://github.com/infochimps-labs/wukong">Wukong</a>, write some
12
+ <a href="http://github.com/infochimps-labs/wukong#processors">simple processors</a>, and read about the structure of a <a href="http://en.wikipedia.org/wiki/MapReduce">map/reduce job</a>.
13
+
14
+ You might also want to check out some other projects which enrich the
15
+ Wukong and Hadoop experience:
16
+
17
+ * <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
18
+ * <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
19
+
20
+ <a name="installation"></a>
21
+ ## Installation & Setup
22
+
23
+ Wukong-Hadoop can be installed as a RubyGem:
24
+
25
+ ```
26
+ $ sudo gem install wukong-hadoop
27
+ ```
28
+
29
+ If you actually want to run your map/reduce jobs on a Hadoop cluster,
30
+ you'll of course need one handy. <a
31
+ href="http://github.com/infochimps-labs/ironfan">Ironfan</a> is a
32
+ great tool for building and managing Hadoop clusters and other
33
+ distributed infrastructure quickly and easily.
34
+
35
+ To run Hadoop jobs through Wukong-Hadoop, you'll need to move your
36
+ your Wukong code to each member of the Hadoop cluster, install
37
+ Wukong-Hadoop on each, and log in and launch your job fron one of
38
+ them. Ironfan again helps with configuring this.
39
+
40
+ <a name="anatomy"></a>
41
+ ## Anatomy of a map/reduce job
42
+
43
+ A map/reduce job consists of two separate phases, the **map** phase
44
+ and the **reduce** phase which are connected by an intermediary
45
+ **sort** phase.
46
+
47
+ The <tt>wu-hadoop</tt> command-line tool is used to run Wukong
48
+ processors in the shape of a map/reduce job, whether locally or on a
49
+ Hadoop cluster.
50
+
51
+ The examples used in this README are all taken from the
52
+ <tt>/examples</tt> directory within the Wukong-Hadoop source code.
53
+ They implement the usual "word count" example.
54
+
55
+ <a name="local"></a>
56
+ ## Test and Develop Map/Reduce Jobs Locally
57
+
58
+ Hadoop is a powerful tool designed to process huge amounts of data
59
+ very quickly. It's not designed to make developing Hadoop jobs
60
+ iterative and simple. Wukong-Hadoop lets you define a map/reduce job
61
+ and execute it locally, on small amounts of sample data, then launch
62
+ that job into a Hadoop cluster when you're sure it works.
63
+
64
+ <a name="processors_to_mappers_and_reducers"></a>
65
+ ### From Processors to Mappers & Reducers
66
+
67
+ Wukong processors can be used either for the map phase or the reduce
68
+ phase of a map/reduce job. Different processors can be defined in
69
+ different <tt>.rb</tt> files or within the same one.
70
+
71
+ Map-phase processors would filter, transform, or otherwise modify
72
+ input records getting them ready for the reduce. Reduce-phase
73
+ processors typically perform aggregative operations like counting,
74
+ grouping, averaging, &c.
75
+
76
+ Given that you've already created a map/reduce job (just like this
77
+ word count example that comes with Wukong-Hadoop), the first thing to
78
+ try is to run the job locally on sample input data in flat files. The
79
+ <tt>--mode=local</tt> flag tells <tt>wu-hadoop</tt> to run in local
80
+ mode, suitable for development and testing of jobs:
81
+
82
+ ```
83
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt
84
+ a 2
85
+ all 1
86
+ and 2
87
+ And 3
88
+ art 1
89
+ ...
90
+ ```
91
+
92
+ Wukong-Hadoop looks for processors named <tt>:mapper</tt> and a
93
+ <tt>:reducer</tt> in the <tt>word_count.rb</tt> file. To understand
94
+ what's going on under the hood, pass the <tt>--dry_run</tt> option:
95
+
96
+ ```
97
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --dry_run
98
+ I, [2012-11-27T19:24:21.238429 #20104] INFO -- : Dry run:
99
+ cat examples/sonnet_18.txt | wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper | sort | wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer
100
+ ```
101
+
102
+ which shows that <tt>wu-hadoop</tt> is ultimately relying on
103
+ <tt>wu-local</tt> to do the heavy-lifting. You can copy, paste, and
104
+ run this longer command (or a portion of it) when debugging.
105
+
106
+ You can also pass options to your processors:
107
+
108
+ ```
109
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --fold_case --min_length=3
110
+ all 1
111
+ and 5
112
+ art 1
113
+ brag 1
114
+ ...
115
+ ```
116
+
117
+ Sometimes you may want to use a given processor in multiple jobs. You
118
+ can therefore define each processor in separate files if you want. If
119
+ Wukong-Hadoop doesn't find processors named <tt>:mapper</tt> and
120
+ <tt>:reducer</tt> it will try to use processors named after the files
121
+ you pass it:
122
+
123
+ ```
124
+ $ wu-hadoop examples/tokenizer.rb examples/counter.rb --mode=local --input=examples/sonnet_18.txt
125
+ a 2
126
+ all 1
127
+ and 2
128
+ And 3
129
+ art 1
130
+ ...
131
+ ```
132
+
133
+ You can also just specify the processors you want to run using the
134
+ <tt>--mapper</tt> and <tt>--reducer</tt> options:
135
+
136
+ ```
137
+ $ wu-hadoop examples/processors.rb --mode=local --input=examples/sonnet_18.txt --mapper=tokenizer --reducer=counter
138
+ a 2
139
+ all 1
140
+ and 2
141
+ And 3
142
+ art 1
143
+ ...
144
+ ```
145
+
146
+ <a name="map_only"></a>
147
+ ### Map-Only Jobs
148
+
149
+ If Wukong-Hadoop can't find a processor named <tt>:reducer</tt> (and
150
+ you didn't give it two files explicitly) then it will run a map-only
151
+ job:
152
+
153
+ ```
154
+ $ wu-hadoop examples/tokenizer.rb --mode=local --input=examples/sonnet_18.txt
155
+ Shall
156
+ I
157
+ compare
158
+ thee
159
+ ...
160
+ ```
161
+
162
+ You can force this behavior using using the <tt>--reduce_tasks</tt>
163
+ option:
164
+
165
+ ```
166
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --reduce_tasks=0
167
+ Shall
168
+ I
169
+ compare
170
+ thee
171
+ ...
172
+ ```
173
+
174
+ <a name="sort_options"></a>
175
+ ### Sort Options
176
+
177
+ For some kinds of jobs, you may have special requirements about how
178
+ you sort. You can specify an explicit <tt>--sort_command</tt> option:
179
+
180
+ ```
181
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --sort_command='sort -r'
182
+ winds 1
183
+ When 1
184
+ wander'st 1
185
+ untrimm'd 1
186
+ ...
187
+ ```
188
+
189
+ <a name="non_wukong"></a>
190
+ ### Something Other than Wukong/Ruby?
191
+
192
+ Wukong-Hadoop even lets you use mappers and reducers which aren't
193
+ themselves Wukong processors or even Ruby code. The <tt>:counter</tt>
194
+ processor is here replaced by good old <tt>uniq</tt>:
195
+
196
+ ```
197
+ $ wu-hadoop examples/processors.rb --mode=local --input=examples/sonnet_18.txt --mapper=tokenizer --reduce_command='uniq -c'
198
+ 2 a
199
+ 1 all
200
+ 2 and
201
+ 3 And
202
+ 1 art
203
+ ...
204
+ ```
205
+
206
+ This is a good method for getting a little performance bump (if your
207
+ job is CPU-bound) or even lifting other, non-Hadoop or non-Wukong
208
+ aware code into the Hadoop world:
209
+
210
+
211
+ ```
212
+ $ wu-hadoop --mode=local --input=examples/sonnet_18.txt --map_command='python tokenizer.py' --reduce_command='python counter.py'
213
+ a 2
214
+ all 1
215
+ and 2
216
+ And 3
217
+ art 1
218
+ ...
219
+ ```
220
+
221
+ The only requirement on <tt>tokenizer.py</tt> and <tt>counter.py</tt>
222
+ is that they work the same way as their Ruby
223
+ <tt>Wukong::Processor</tt> equivalents: one line at a time from STDIN
224
+ to STDOUT.
225
+
226
+ <a name="hadoop"></a>
227
+ ## Running in Hadoop
228
+
229
+ Once you've got your code working locally, you can easily make it run
230
+ inside of Hadoop by just changing the <tt>--mode</tt> option. You'll
231
+ also need to specify <tt>--input</tt> and <tt>--output</tt> paths that
232
+ Hadoop can access, either on the <a
233
+ href="http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_Distributed_File_System">HDFS</a>
234
+ or on something like Amazon's <a
235
+ href="http://aws.amazon.com/s3/">S3</a> if you're using AWS and have
236
+ properly configured your Hadoop cluster.
237
+
238
+ Here's the very first example from the <a href="#local">Local</a>
239
+ section above, but executed within a Hadoop cluster, reading and writing data from the HDFS.
240
+
241
+ ```
242
+ $ wu-hadoop examples/word_count.rb --mode=hadoop --input=/data/sonnet_18.txt --output=/data/word_count.tsv
243
+ I, [2012-11-27T19:27:18.872645 #20142] INFO -- : Launching Hadoop!
244
+ I, [2012-11-27T19:27:18.873477 #20142] INFO -- : Running
245
+
246
+ /usr/lib/hadoop/bin/hadoop \
247
+ jar /usr/lib/hadoop/contrib/streaming/hadoop-*streaming*.jar \
248
+ -D mapred.job.name='word_count---/data/sonnet_18.txt---/data/word_count.tsv' \
249
+ -mapper 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper' \
250
+ -reducer 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer' \
251
+ -input '/data/sonnet_18.txt' \
252
+ -output '/data/word_count.tsv' \
253
+ 12/11/28 01:32:09 INFO mapred.FileInputFormat: Total input paths to process : 1
254
+ 12/11/28 01:32:10 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local, /mnt2/hadoop/mapred/local]
255
+ 12/11/28 01:32:10 INFO streaming.StreamJob: Running job: job_201210241848_0043
256
+ 12/11/28 01:32:10 INFO streaming.StreamJob: To kill this job, run:
257
+ 12/11/28 01:32:10 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=10.124.54.254:8021 -kill job_201210241848_0043
258
+ 12/11/28 01:32:10 INFO streaming.StreamJob: Tracking URL: http://ip-10-124-54-254.ec2.internal:50030/jobdetails.jsp?jobid=job_201210241848_0043
259
+ 12/11/28 01:32:11 INFO streaming.StreamJob: map 0% reduce 0%
260
+ ...
261
+ ```
262
+
263
+ Hadoop throws an error if your output path already exists. If you're
264
+ running the same job over and over, it can be annoying to constantly
265
+ have to remember to delete the output path from your last run. Use
266
+ the <tt>--rm</tt> option in this case to automatically remove the
267
+ output path before launching a Hadoop job (this only works for Hadoop
268
+ mode).
269
+
270
+ ### Advanced Hadoop Usage
271
+
272
+ For small or lightweight jobs, all you have to do to move from local
273
+ to Hadoop is change the <tt>--mode</tt> flag when executing your jobs
274
+ with <tt>wu-hadoop</tt>.
275
+
276
+ More complicated jobs that require either special code to be available
277
+ (new input/output formats, <tt>CLASSPATH</tt> or <tt>RUBYLIB</tt>
278
+ hacking, &c.) or require tuning at the level of Hadoop to run
279
+ efficiently.
280
+
281
+ #### Other Input/Output Formats
282
+
283
+ Hadoop streaming uses the <a
284
+ href="http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapred/TextInputFormat.html">TextInputFormat</a>
285
+ and <a
286
+ href="http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.html">TextOutputFormat</a>
287
+ by default. These turn all input/output data into newline delimited
288
+ string records which creates a perfect match for the command-line and
289
+ the local mode of Wukong-Hadoop.
290
+
291
+ Other input and output formats can be specified with the
292
+ <tt>--input_format</tt> and <tt>--output_format</tt> options.
293
+
294
+ #### Tuning
295
+
296
+ Hadoop offers many, many options for configuring a particular Hadoop
297
+ job as well as the Hadoop cluster itself. Wukong-Hadoop wraps many of
298
+ these familiar options (<tt>mapred.map.tasks</tt>,
299
+ <tt>mapred.reduce.tasks</tt>, <tt>mapred.task.timeout</tt>, &c.) with
300
+ friendlier names (<tt>map_tasks</tt>, <tt>reduce_tasks</tt>,
301
+ <tt>timeout</tt>, &c.). See a complete list using <tt>wu-hadoop
302
+ --help</tt>.
303
+
304
+ Java options themselves can be set directly using the
305
+ <tt>--java_opts</tt> flag. You can also use the <tt>--dry_run</tt>
306
+ option again to see the constructed Hadoop invocation without running
307
+ it:
308
+
309
+ ```
310
+ $ wu-hadoop examples/word_count.rb --mode=hadoop --input=/data/sonnet_18.txt --output=/data/word_count.tsv --java_opts='-D foo.bar=3 -D something.else=hello' --dry_run
311
+ I, [2012-11-27T19:47:08.872784 #20512] INFO -- : Launching Hadoop!
312
+ I, [2012-11-27T19:47:08.873630 #20512] INFO -- : Dry run:
313
+ /usr/lib/hadoop/bin/hadoop \
314
+ jar /usr/lib/hadoop/contrib/streaming/hadoop-*streaming*.jar \
315
+ -D mapred.job.name='word_count---/data/sonnet_18.txt---/data/word_count.tsv' \
316
+ -D foo.bar=3 \
317
+ -D something.else=hello \
318
+ -mapper 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper' \
319
+ -reducer 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer' \
320
+ -input '/data/sonnet_18.txt' \
321
+ -output '/data/word_count.tsv' \
322
+ ```
323
+
324
+ #### Accessing Hadoop Runtime Data
325
+
326
+ Hadoop streaming exposes several environment variables to scripts it
327
+ executes, including mapper and reducer scripts launched by
328
+ <tt>wu-hadoop</tt>. Instead of manually inspecting the <tt>ENV</tt>
329
+ within your Wukong processors, you can use the following methods
330
+ defined for commonly accessed parameters:
331
+
332
+ * <tt>input_file</tt>: Path of the (data) file currently being processed.
333
+ * <tt>input_dir</tt>: Directory of the (data) file currently being processed.
334
+ * <tt>map_input_start_offset</tt>: Offset of the chunk currently being processed within the current input file.
335
+ * <tt>map_input_length</tt>: Length of the chunk currently being processed within the current input file.
336
+ * <tt>attempt_id</tt>: ID of the current map/reduce attempt.
337
+ * <tt>curr_task_id</tt>: ID of the current map/reduce task.
338
+
339
+ or use the <tt>hadoop_streaming_parameter</tt> method for the others.
data/Rakefile ADDED
@@ -0,0 +1,13 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rspec/core/rake_task'
5
+ RSpec::Core::RakeTask.new(:specs)
6
+
7
+ require 'yard'
8
+ YARD::Rake::YardocTask.new
9
+
10
+ require 'cucumber/rake/task'
11
+ Cucumber::Rake::Task.new(:features)
12
+
13
+ task :default => [:specs]
data/bin/hdp-bin ADDED
@@ -0,0 +1,44 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'wukong'
5
+ require 'wukong/streamer/count_keys'
6
+
7
+ #
8
+ # Run locally for testing:
9
+ #
10
+ # hdp-cat /hdfs/sometable.tsv | head -n100 | ./hdp-bin --column=4 --bin_width=0.1 --map | sort | ./hdp-bin --reduce
11
+ #
12
+ # Run on a giant dataset:
13
+ #
14
+ # hdp-bin --run --column=4 --bin_width=0.1 /hdfs/sometable.tsv /hdfs/sometable_col4_binned
15
+ #
16
+
17
+ Settings.define :column, :default => 1, :type => Integer, :description => "The column to bin"
18
+ Settings.define :bin_width, :default => 0.5, :type => Float, :description => "What should the bin width be?"
19
+
20
+ module HadoopBinning
21
+
22
+ class Mapper < Wukong::Streamer::RecordStreamer
23
+
24
+ def initialize *args
25
+ super(*args)
26
+ @bin_width = options.bin_width
27
+ @column = options.column
28
+ end
29
+
30
+ def process *args
31
+ yield bin_field(args[@column])
32
+ end
33
+
34
+ def bin_field field
35
+ (field.to_f/@bin_width).round*@bin_width
36
+ end
37
+
38
+ end
39
+
40
+ class Reducer < Wukong::Streamer::CountKeys; end
41
+
42
+ end
43
+
44
+ Wukong::Script.new(HadoopBinning::Mapper, HadoopBinning::Reducer).run