wukong-hadoop 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (59) hide show
  1. data/.gitignore +59 -0
  2. data/.rspec +2 -0
  3. data/Gemfile +3 -0
  4. data/README.md +339 -0
  5. data/Rakefile +13 -0
  6. data/bin/hdp-bin +44 -0
  7. data/bin/hdp-bzip +23 -0
  8. data/bin/hdp-cat +3 -0
  9. data/bin/hdp-catd +3 -0
  10. data/bin/hdp-cp +3 -0
  11. data/bin/hdp-du +86 -0
  12. data/bin/hdp-get +3 -0
  13. data/bin/hdp-kill +3 -0
  14. data/bin/hdp-kill-task +3 -0
  15. data/bin/hdp-ls +11 -0
  16. data/bin/hdp-mkdir +2 -0
  17. data/bin/hdp-mkdirp +12 -0
  18. data/bin/hdp-mv +3 -0
  19. data/bin/hdp-parts_to_keys.rb +77 -0
  20. data/bin/hdp-ps +3 -0
  21. data/bin/hdp-put +3 -0
  22. data/bin/hdp-rm +32 -0
  23. data/bin/hdp-sort +40 -0
  24. data/bin/hdp-stream +40 -0
  25. data/bin/hdp-stream-flat +22 -0
  26. data/bin/hdp-stream2 +39 -0
  27. data/bin/hdp-sync +17 -0
  28. data/bin/hdp-wc +67 -0
  29. data/bin/wu-hadoop +14 -0
  30. data/examples/counter.rb +17 -0
  31. data/examples/map_only.rb +28 -0
  32. data/examples/processors.rb +4 -0
  33. data/examples/sonnet_18.txt +14 -0
  34. data/examples/tokenizer.rb +28 -0
  35. data/examples/word_count.rb +44 -0
  36. data/features/step_definitions/wu_hadoop_steps.rb +4 -0
  37. data/features/support/env.rb +1 -0
  38. data/features/wu_hadoop.feature +113 -0
  39. data/lib/wukong-hadoop.rb +21 -0
  40. data/lib/wukong-hadoop/configuration.rb +133 -0
  41. data/lib/wukong-hadoop/driver.rb +190 -0
  42. data/lib/wukong-hadoop/driver/hadoop_invocation.rb +184 -0
  43. data/lib/wukong-hadoop/driver/inputs_and_outputs.rb +27 -0
  44. data/lib/wukong-hadoop/driver/local_invocation.rb +48 -0
  45. data/lib/wukong-hadoop/driver/map_logic.rb +104 -0
  46. data/lib/wukong-hadoop/driver/reduce_logic.rb +129 -0
  47. data/lib/wukong-hadoop/extensions.rb +2 -0
  48. data/lib/wukong-hadoop/hadoop_env_methods.rb +80 -0
  49. data/lib/wukong-hadoop/version.rb +6 -0
  50. data/spec/spec_helper.rb +21 -0
  51. data/spec/support/driver_helper.rb +15 -0
  52. data/spec/support/integration_helper.rb +39 -0
  53. data/spec/wukong-hadoop/driver_spec.rb +117 -0
  54. data/spec/wukong-hadoop/hadoop_env_methods_spec.rb +14 -0
  55. data/spec/wukong-hadoop/hadoop_mode_spec.rb +78 -0
  56. data/spec/wukong-hadoop/local_mode_spec.rb +22 -0
  57. data/spec/wukong-hadoop/wu_hadoop_spec.rb +34 -0
  58. data/wukong-hadoop.gemspec +33 -0
  59. metadata +168 -0
data/.gitignore ADDED
@@ -0,0 +1,59 @@
1
+ ## OS
2
+ .DS_Store
3
+ Icon
4
+ nohup.out
5
+ .bak
6
+
7
+ *.pem
8
+
9
+ ## EDITORS
10
+ \#*
11
+ .\#*
12
+ \#*\#
13
+ *~
14
+ *.swp
15
+ REVISION
16
+ TAGS*
17
+ tmtags
18
+ *_flymake.*
19
+ *_flymake
20
+ *.tmproj
21
+ .project
22
+ .settings
23
+
24
+ ## COMPILED
25
+ a.out
26
+ *.o
27
+ *.pyc
28
+ *.so
29
+
30
+ ## OTHER SCM
31
+ .bzr
32
+ .hg
33
+ .svn
34
+
35
+ ## PROJECT::GENERAL
36
+
37
+ log/*
38
+ tmp/*
39
+ pkg/*
40
+
41
+ coverage
42
+ rdoc
43
+ doc
44
+ pkg
45
+ .rake_test_cache
46
+ .bundle
47
+ .yardoc
48
+
49
+ .vendor
50
+
51
+ ## PROJECT::SPECIFIC
52
+
53
+ old/*
54
+ docpages
55
+ away
56
+
57
+ .rbx
58
+ Gemfile.lock
59
+ Backup*of*.numbers
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format=progress
2
+ --color
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source :rubygems
2
+
3
+ gemspec
data/README.md ADDED
@@ -0,0 +1,339 @@
1
+ # Wukong-Hadoop
2
+
3
+ The Hadoop plugin for Wukong lets you run <a
4
+ href="http://github.com/infochimps-labs/wukong">Wukong processors</a>
5
+ through <a href="http://hadoop.apache.org/">Hadoop's</a> command-line
6
+ <a
7
+ href="http://hadoop.apache.org/docs/r0.15.2/streaming.html">streaming
8
+ interface</a>.
9
+
10
+ Before you use Wukong-Hadoop to develop, test, and write your Hadoop
11
+ jobs, you might want to read about <a href="http://github.com/infochimps-labs/wukong">Wukong</a>, write some
12
+ <a href="http://github.com/infochimps-labs/wukong#processors">simple processors</a>, and read about the structure of a <a href="http://en.wikipedia.org/wiki/MapReduce">map/reduce job</a>.
13
+
14
+ You might also want to check out some other projects which enrich the
15
+ Wukong and Hadoop experience:
16
+
17
+ * <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
18
+ * <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
19
+
20
+ <a name="installation"></a>
21
+ ## Installation & Setup
22
+
23
+ Wukong-Hadoop can be installed as a RubyGem:
24
+
25
+ ```
26
+ $ sudo gem install wukong-hadoop
27
+ ```
28
+
29
+ If you actually want to run your map/reduce jobs on a Hadoop cluster,
30
+ you'll of course need one handy. <a
31
+ href="http://github.com/infochimps-labs/ironfan">Ironfan</a> is a
32
+ great tool for building and managing Hadoop clusters and other
33
+ distributed infrastructure quickly and easily.
34
+
35
+ To run Hadoop jobs through Wukong-Hadoop, you'll need to move your
36
+ your Wukong code to each member of the Hadoop cluster, install
37
+ Wukong-Hadoop on each, and log in and launch your job fron one of
38
+ them. Ironfan again helps with configuring this.
39
+
40
+ <a name="anatomy"></a>
41
+ ## Anatomy of a map/reduce job
42
+
43
+ A map/reduce job consists of two separate phases, the **map** phase
44
+ and the **reduce** phase which are connected by an intermediary
45
+ **sort** phase.
46
+
47
+ The <tt>wu-hadoop</tt> command-line tool is used to run Wukong
48
+ processors in the shape of a map/reduce job, whether locally or on a
49
+ Hadoop cluster.
50
+
51
+ The examples used in this README are all taken from the
52
+ <tt>/examples</tt> directory within the Wukong-Hadoop source code.
53
+ They implement the usual "word count" example.
54
+
55
+ <a name="local"></a>
56
+ ## Test and Develop Map/Reduce Jobs Locally
57
+
58
+ Hadoop is a powerful tool designed to process huge amounts of data
59
+ very quickly. It's not designed to make developing Hadoop jobs
60
+ iterative and simple. Wukong-Hadoop lets you define a map/reduce job
61
+ and execute it locally, on small amounts of sample data, then launch
62
+ that job into a Hadoop cluster when you're sure it works.
63
+
64
+ <a name="processors_to_mappers_and_reducers"></a>
65
+ ### From Processors to Mappers & Reducers
66
+
67
+ Wukong processors can be used either for the map phase or the reduce
68
+ phase of a map/reduce job. Different processors can be defined in
69
+ different <tt>.rb</tt> files or within the same one.
70
+
71
+ Map-phase processors would filter, transform, or otherwise modify
72
+ input records getting them ready for the reduce. Reduce-phase
73
+ processors typically perform aggregative operations like counting,
74
+ grouping, averaging, &c.
75
+
76
+ Given that you've already created a map/reduce job (just like this
77
+ word count example that comes with Wukong-Hadoop), the first thing to
78
+ try is to run the job locally on sample input data in flat files. The
79
+ <tt>--mode=local</tt> flag tells <tt>wu-hadoop</tt> to run in local
80
+ mode, suitable for development and testing of jobs:
81
+
82
+ ```
83
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt
84
+ a 2
85
+ all 1
86
+ and 2
87
+ And 3
88
+ art 1
89
+ ...
90
+ ```
91
+
92
+ Wukong-Hadoop looks for processors named <tt>:mapper</tt> and a
93
+ <tt>:reducer</tt> in the <tt>word_count.rb</tt> file. To understand
94
+ what's going on under the hood, pass the <tt>--dry_run</tt> option:
95
+
96
+ ```
97
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --dry_run
98
+ I, [2012-11-27T19:24:21.238429 #20104] INFO -- : Dry run:
99
+ cat examples/sonnet_18.txt | wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper | sort | wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer
100
+ ```
101
+
102
+ which shows that <tt>wu-hadoop</tt> is ultimately relying on
103
+ <tt>wu-local</tt> to do the heavy-lifting. You can copy, paste, and
104
+ run this longer command (or a portion of it) when debugging.
105
+
106
+ You can also pass options to your processors:
107
+
108
+ ```
109
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --fold_case --min_length=3
110
+ all 1
111
+ and 5
112
+ art 1
113
+ brag 1
114
+ ...
115
+ ```
116
+
117
+ Sometimes you may want to use a given processor in multiple jobs. You
118
+ can therefore define each processor in separate files if you want. If
119
+ Wukong-Hadoop doesn't find processors named <tt>:mapper</tt> and
120
+ <tt>:reducer</tt> it will try to use processors named after the files
121
+ you pass it:
122
+
123
+ ```
124
+ $ wu-hadoop examples/tokenizer.rb examples/counter.rb --mode=local --input=examples/sonnet_18.txt
125
+ a 2
126
+ all 1
127
+ and 2
128
+ And 3
129
+ art 1
130
+ ...
131
+ ```
132
+
133
+ You can also just specify the processors you want to run using the
134
+ <tt>--mapper</tt> and <tt>--reducer</tt> options:
135
+
136
+ ```
137
+ $ wu-hadoop examples/processors.rb --mode=local --input=examples/sonnet_18.txt --mapper=tokenizer --reducer=counter
138
+ a 2
139
+ all 1
140
+ and 2
141
+ And 3
142
+ art 1
143
+ ...
144
+ ```
145
+
146
+ <a name="map_only"></a>
147
+ ### Map-Only Jobs
148
+
149
+ If Wukong-Hadoop can't find a processor named <tt>:reducer</tt> (and
150
+ you didn't give it two files explicitly) then it will run a map-only
151
+ job:
152
+
153
+ ```
154
+ $ wu-hadoop examples/tokenizer.rb --mode=local --input=examples/sonnet_18.txt
155
+ Shall
156
+ I
157
+ compare
158
+ thee
159
+ ...
160
+ ```
161
+
162
+ You can force this behavior using using the <tt>--reduce_tasks</tt>
163
+ option:
164
+
165
+ ```
166
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --reduce_tasks=0
167
+ Shall
168
+ I
169
+ compare
170
+ thee
171
+ ...
172
+ ```
173
+
174
+ <a name="sort_options"></a>
175
+ ### Sort Options
176
+
177
+ For some kinds of jobs, you may have special requirements about how
178
+ you sort. You can specify an explicit <tt>--sort_command</tt> option:
179
+
180
+ ```
181
+ $ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --sort_command='sort -r'
182
+ winds 1
183
+ When 1
184
+ wander'st 1
185
+ untrimm'd 1
186
+ ...
187
+ ```
188
+
189
+ <a name="non_wukong"></a>
190
+ ### Something Other than Wukong/Ruby?
191
+
192
+ Wukong-Hadoop even lets you use mappers and reducers which aren't
193
+ themselves Wukong processors or even Ruby code. The <tt>:counter</tt>
194
+ processor is here replaced by good old <tt>uniq</tt>:
195
+
196
+ ```
197
+ $ wu-hadoop examples/processors.rb --mode=local --input=examples/sonnet_18.txt --mapper=tokenizer --reduce_command='uniq -c'
198
+ 2 a
199
+ 1 all
200
+ 2 and
201
+ 3 And
202
+ 1 art
203
+ ...
204
+ ```
205
+
206
+ This is a good method for getting a little performance bump (if your
207
+ job is CPU-bound) or even lifting other, non-Hadoop or non-Wukong
208
+ aware code into the Hadoop world:
209
+
210
+
211
+ ```
212
+ $ wu-hadoop --mode=local --input=examples/sonnet_18.txt --map_command='python tokenizer.py' --reduce_command='python counter.py'
213
+ a 2
214
+ all 1
215
+ and 2
216
+ And 3
217
+ art 1
218
+ ...
219
+ ```
220
+
221
+ The only requirement on <tt>tokenizer.py</tt> and <tt>counter.py</tt>
222
+ is that they work the same way as their Ruby
223
+ <tt>Wukong::Processor</tt> equivalents: one line at a time from STDIN
224
+ to STDOUT.
225
+
226
+ <a name="hadoop"></a>
227
+ ## Running in Hadoop
228
+
229
+ Once you've got your code working locally, you can easily make it run
230
+ inside of Hadoop by just changing the <tt>--mode</tt> option. You'll
231
+ also need to specify <tt>--input</tt> and <tt>--output</tt> paths that
232
+ Hadoop can access, either on the <a
233
+ href="http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_Distributed_File_System">HDFS</a>
234
+ or on something like Amazon's <a
235
+ href="http://aws.amazon.com/s3/">S3</a> if you're using AWS and have
236
+ properly configured your Hadoop cluster.
237
+
238
+ Here's the very first example from the <a href="#local">Local</a>
239
+ section above, but executed within a Hadoop cluster, reading and writing data from the HDFS.
240
+
241
+ ```
242
+ $ wu-hadoop examples/word_count.rb --mode=hadoop --input=/data/sonnet_18.txt --output=/data/word_count.tsv
243
+ I, [2012-11-27T19:27:18.872645 #20142] INFO -- : Launching Hadoop!
244
+ I, [2012-11-27T19:27:18.873477 #20142] INFO -- : Running
245
+
246
+ /usr/lib/hadoop/bin/hadoop \
247
+ jar /usr/lib/hadoop/contrib/streaming/hadoop-*streaming*.jar \
248
+ -D mapred.job.name='word_count---/data/sonnet_18.txt---/data/word_count.tsv' \
249
+ -mapper 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper' \
250
+ -reducer 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer' \
251
+ -input '/data/sonnet_18.txt' \
252
+ -output '/data/word_count.tsv' \
253
+ 12/11/28 01:32:09 INFO mapred.FileInputFormat: Total input paths to process : 1
254
+ 12/11/28 01:32:10 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local, /mnt2/hadoop/mapred/local]
255
+ 12/11/28 01:32:10 INFO streaming.StreamJob: Running job: job_201210241848_0043
256
+ 12/11/28 01:32:10 INFO streaming.StreamJob: To kill this job, run:
257
+ 12/11/28 01:32:10 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=10.124.54.254:8021 -kill job_201210241848_0043
258
+ 12/11/28 01:32:10 INFO streaming.StreamJob: Tracking URL: http://ip-10-124-54-254.ec2.internal:50030/jobdetails.jsp?jobid=job_201210241848_0043
259
+ 12/11/28 01:32:11 INFO streaming.StreamJob: map 0% reduce 0%
260
+ ...
261
+ ```
262
+
263
+ Hadoop throws an error if your output path already exists. If you're
264
+ running the same job over and over, it can be annoying to constantly
265
+ have to remember to delete the output path from your last run. Use
266
+ the <tt>--rm</tt> option in this case to automatically remove the
267
+ output path before launching a Hadoop job (this only works for Hadoop
268
+ mode).
269
+
270
+ ### Advanced Hadoop Usage
271
+
272
+ For small or lightweight jobs, all you have to do to move from local
273
+ to Hadoop is change the <tt>--mode</tt> flag when executing your jobs
274
+ with <tt>wu-hadoop</tt>.
275
+
276
+ More complicated jobs that require either special code to be available
277
+ (new input/output formats, <tt>CLASSPATH</tt> or <tt>RUBYLIB</tt>
278
+ hacking, &c.) or require tuning at the level of Hadoop to run
279
+ efficiently.
280
+
281
+ #### Other Input/Output Formats
282
+
283
+ Hadoop streaming uses the <a
284
+ href="http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapred/TextInputFormat.html">TextInputFormat</a>
285
+ and <a
286
+ href="http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.html">TextOutputFormat</a>
287
+ by default. These turn all input/output data into newline delimited
288
+ string records which creates a perfect match for the command-line and
289
+ the local mode of Wukong-Hadoop.
290
+
291
+ Other input and output formats can be specified with the
292
+ <tt>--input_format</tt> and <tt>--output_format</tt> options.
293
+
294
+ #### Tuning
295
+
296
+ Hadoop offers many, many options for configuring a particular Hadoop
297
+ job as well as the Hadoop cluster itself. Wukong-Hadoop wraps many of
298
+ these familiar options (<tt>mapred.map.tasks</tt>,
299
+ <tt>mapred.reduce.tasks</tt>, <tt>mapred.task.timeout</tt>, &c.) with
300
+ friendlier names (<tt>map_tasks</tt>, <tt>reduce_tasks</tt>,
301
+ <tt>timeout</tt>, &c.). See a complete list using <tt>wu-hadoop
302
+ --help</tt>.
303
+
304
+ Java options themselves can be set directly using the
305
+ <tt>--java_opts</tt> flag. You can also use the <tt>--dry_run</tt>
306
+ option again to see the constructed Hadoop invocation without running
307
+ it:
308
+
309
+ ```
310
+ $ wu-hadoop examples/word_count.rb --mode=hadoop --input=/data/sonnet_18.txt --output=/data/word_count.tsv --java_opts='-D foo.bar=3 -D something.else=hello' --dry_run
311
+ I, [2012-11-27T19:47:08.872784 #20512] INFO -- : Launching Hadoop!
312
+ I, [2012-11-27T19:47:08.873630 #20512] INFO -- : Dry run:
313
+ /usr/lib/hadoop/bin/hadoop \
314
+ jar /usr/lib/hadoop/contrib/streaming/hadoop-*streaming*.jar \
315
+ -D mapred.job.name='word_count---/data/sonnet_18.txt---/data/word_count.tsv' \
316
+ -D foo.bar=3 \
317
+ -D something.else=hello \
318
+ -mapper 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper' \
319
+ -reducer 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer' \
320
+ -input '/data/sonnet_18.txt' \
321
+ -output '/data/word_count.tsv' \
322
+ ```
323
+
324
+ #### Accessing Hadoop Runtime Data
325
+
326
+ Hadoop streaming exposes several environment variables to scripts it
327
+ executes, including mapper and reducer scripts launched by
328
+ <tt>wu-hadoop</tt>. Instead of manually inspecting the <tt>ENV</tt>
329
+ within your Wukong processors, you can use the following methods
330
+ defined for commonly accessed parameters:
331
+
332
+ * <tt>input_file</tt>: Path of the (data) file currently being processed.
333
+ * <tt>input_dir</tt>: Directory of the (data) file currently being processed.
334
+ * <tt>map_input_start_offset</tt>: Offset of the chunk currently being processed within the current input file.
335
+ * <tt>map_input_length</tt>: Length of the chunk currently being processed within the current input file.
336
+ * <tt>attempt_id</tt>: ID of the current map/reduce attempt.
337
+ * <tt>curr_task_id</tt>: ID of the current map/reduce task.
338
+
339
+ or use the <tt>hadoop_streaming_parameter</tt> method for the others.
data/Rakefile ADDED
@@ -0,0 +1,13 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rspec/core/rake_task'
5
+ RSpec::Core::RakeTask.new(:specs)
6
+
7
+ require 'yard'
8
+ YARD::Rake::YardocTask.new
9
+
10
+ require 'cucumber/rake/task'
11
+ Cucumber::Rake::Task.new(:features)
12
+
13
+ task :default => [:specs]
data/bin/hdp-bin ADDED
@@ -0,0 +1,44 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'wukong'
5
+ require 'wukong/streamer/count_keys'
6
+
7
+ #
8
+ # Run locally for testing:
9
+ #
10
+ # hdp-cat /hdfs/sometable.tsv | head -n100 | ./hdp-bin --column=4 --bin_width=0.1 --map | sort | ./hdp-bin --reduce
11
+ #
12
+ # Run on a giant dataset:
13
+ #
14
+ # hdp-bin --run --column=4 --bin_width=0.1 /hdfs/sometable.tsv /hdfs/sometable_col4_binned
15
+ #
16
+
17
+ Settings.define :column, :default => 1, :type => Integer, :description => "The column to bin"
18
+ Settings.define :bin_width, :default => 0.5, :type => Float, :description => "What should the bin width be?"
19
+
20
+ module HadoopBinning
21
+
22
+ class Mapper < Wukong::Streamer::RecordStreamer
23
+
24
+ def initialize *args
25
+ super(*args)
26
+ @bin_width = options.bin_width
27
+ @column = options.column
28
+ end
29
+
30
+ def process *args
31
+ yield bin_field(args[@column])
32
+ end
33
+
34
+ def bin_field field
35
+ (field.to_f/@bin_width).round*@bin_width
36
+ end
37
+
38
+ end
39
+
40
+ class Reducer < Wukong::Streamer::CountKeys; end
41
+
42
+ end
43
+
44
+ Wukong::Script.new(HadoopBinning::Mapper, HadoopBinning::Reducer).run