wukong-hadoop 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +59 -0
- data/.rspec +2 -0
- data/Gemfile +3 -0
- data/README.md +339 -0
- data/Rakefile +13 -0
- data/bin/hdp-bin +44 -0
- data/bin/hdp-bzip +23 -0
- data/bin/hdp-cat +3 -0
- data/bin/hdp-catd +3 -0
- data/bin/hdp-cp +3 -0
- data/bin/hdp-du +86 -0
- data/bin/hdp-get +3 -0
- data/bin/hdp-kill +3 -0
- data/bin/hdp-kill-task +3 -0
- data/bin/hdp-ls +11 -0
- data/bin/hdp-mkdir +2 -0
- data/bin/hdp-mkdirp +12 -0
- data/bin/hdp-mv +3 -0
- data/bin/hdp-parts_to_keys.rb +77 -0
- data/bin/hdp-ps +3 -0
- data/bin/hdp-put +3 -0
- data/bin/hdp-rm +32 -0
- data/bin/hdp-sort +40 -0
- data/bin/hdp-stream +40 -0
- data/bin/hdp-stream-flat +22 -0
- data/bin/hdp-stream2 +39 -0
- data/bin/hdp-sync +17 -0
- data/bin/hdp-wc +67 -0
- data/bin/wu-hadoop +14 -0
- data/examples/counter.rb +17 -0
- data/examples/map_only.rb +28 -0
- data/examples/processors.rb +4 -0
- data/examples/sonnet_18.txt +14 -0
- data/examples/tokenizer.rb +28 -0
- data/examples/word_count.rb +44 -0
- data/features/step_definitions/wu_hadoop_steps.rb +4 -0
- data/features/support/env.rb +1 -0
- data/features/wu_hadoop.feature +113 -0
- data/lib/wukong-hadoop.rb +21 -0
- data/lib/wukong-hadoop/configuration.rb +133 -0
- data/lib/wukong-hadoop/driver.rb +190 -0
- data/lib/wukong-hadoop/driver/hadoop_invocation.rb +184 -0
- data/lib/wukong-hadoop/driver/inputs_and_outputs.rb +27 -0
- data/lib/wukong-hadoop/driver/local_invocation.rb +48 -0
- data/lib/wukong-hadoop/driver/map_logic.rb +104 -0
- data/lib/wukong-hadoop/driver/reduce_logic.rb +129 -0
- data/lib/wukong-hadoop/extensions.rb +2 -0
- data/lib/wukong-hadoop/hadoop_env_methods.rb +80 -0
- data/lib/wukong-hadoop/version.rb +6 -0
- data/spec/spec_helper.rb +21 -0
- data/spec/support/driver_helper.rb +15 -0
- data/spec/support/integration_helper.rb +39 -0
- data/spec/wukong-hadoop/driver_spec.rb +117 -0
- data/spec/wukong-hadoop/hadoop_env_methods_spec.rb +14 -0
- data/spec/wukong-hadoop/hadoop_mode_spec.rb +78 -0
- data/spec/wukong-hadoop/local_mode_spec.rb +22 -0
- data/spec/wukong-hadoop/wu_hadoop_spec.rb +34 -0
- data/wukong-hadoop.gemspec +33 -0
- metadata +168 -0
data/.gitignore
ADDED
@@ -0,0 +1,59 @@
|
|
1
|
+
## OS
|
2
|
+
.DS_Store
|
3
|
+
Icon
|
4
|
+
nohup.out
|
5
|
+
.bak
|
6
|
+
|
7
|
+
*.pem
|
8
|
+
|
9
|
+
## EDITORS
|
10
|
+
\#*
|
11
|
+
.\#*
|
12
|
+
\#*\#
|
13
|
+
*~
|
14
|
+
*.swp
|
15
|
+
REVISION
|
16
|
+
TAGS*
|
17
|
+
tmtags
|
18
|
+
*_flymake.*
|
19
|
+
*_flymake
|
20
|
+
*.tmproj
|
21
|
+
.project
|
22
|
+
.settings
|
23
|
+
|
24
|
+
## COMPILED
|
25
|
+
a.out
|
26
|
+
*.o
|
27
|
+
*.pyc
|
28
|
+
*.so
|
29
|
+
|
30
|
+
## OTHER SCM
|
31
|
+
.bzr
|
32
|
+
.hg
|
33
|
+
.svn
|
34
|
+
|
35
|
+
## PROJECT::GENERAL
|
36
|
+
|
37
|
+
log/*
|
38
|
+
tmp/*
|
39
|
+
pkg/*
|
40
|
+
|
41
|
+
coverage
|
42
|
+
rdoc
|
43
|
+
doc
|
44
|
+
pkg
|
45
|
+
.rake_test_cache
|
46
|
+
.bundle
|
47
|
+
.yardoc
|
48
|
+
|
49
|
+
.vendor
|
50
|
+
|
51
|
+
## PROJECT::SPECIFIC
|
52
|
+
|
53
|
+
old/*
|
54
|
+
docpages
|
55
|
+
away
|
56
|
+
|
57
|
+
.rbx
|
58
|
+
Gemfile.lock
|
59
|
+
Backup*of*.numbers
|
data/.rspec
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,339 @@
|
|
1
|
+
# Wukong-Hadoop
|
2
|
+
|
3
|
+
The Hadoop plugin for Wukong lets you run <a
|
4
|
+
href="http://github.com/infochimps-labs/wukong">Wukong processors</a>
|
5
|
+
through <a href="http://hadoop.apache.org/">Hadoop's</a> command-line
|
6
|
+
<a
|
7
|
+
href="http://hadoop.apache.org/docs/r0.15.2/streaming.html">streaming
|
8
|
+
interface</a>.
|
9
|
+
|
10
|
+
Before you use Wukong-Hadoop to develop, test, and write your Hadoop
|
11
|
+
jobs, you might want to read about <a href="http://github.com/infochimps-labs/wukong">Wukong</a>, write some
|
12
|
+
<a href="http://github.com/infochimps-labs/wukong#processors">simple processors</a>, and read about the structure of a <a href="http://en.wikipedia.org/wiki/MapReduce">map/reduce job</a>.
|
13
|
+
|
14
|
+
You might also want to check out some other projects which enrich the
|
15
|
+
Wukong and Hadoop experience:
|
16
|
+
|
17
|
+
* <a href="http://github.com/infochimps-labs/wonderdog">wonderdog</a>: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
|
18
|
+
* <a href="http://github.com/infochimps-labs/wukong-deploy">wukong-deploy</a>: Orchestrate Wukong and other wu-tools together to support an application running on the Infochimps Platform.
|
19
|
+
|
20
|
+
<a name="installation"></a>
|
21
|
+
## Installation & Setup
|
22
|
+
|
23
|
+
Wukong-Hadoop can be installed as a RubyGem:
|
24
|
+
|
25
|
+
```
|
26
|
+
$ sudo gem install wukong-hadoop
|
27
|
+
```
|
28
|
+
|
29
|
+
If you actually want to run your map/reduce jobs on a Hadoop cluster,
|
30
|
+
you'll of course need one handy. <a
|
31
|
+
href="http://github.com/infochimps-labs/ironfan">Ironfan</a> is a
|
32
|
+
great tool for building and managing Hadoop clusters and other
|
33
|
+
distributed infrastructure quickly and easily.
|
34
|
+
|
35
|
+
To run Hadoop jobs through Wukong-Hadoop, you'll need to move your
|
36
|
+
your Wukong code to each member of the Hadoop cluster, install
|
37
|
+
Wukong-Hadoop on each, and log in and launch your job fron one of
|
38
|
+
them. Ironfan again helps with configuring this.
|
39
|
+
|
40
|
+
<a name="anatomy"></a>
|
41
|
+
## Anatomy of a map/reduce job
|
42
|
+
|
43
|
+
A map/reduce job consists of two separate phases, the **map** phase
|
44
|
+
and the **reduce** phase which are connected by an intermediary
|
45
|
+
**sort** phase.
|
46
|
+
|
47
|
+
The <tt>wu-hadoop</tt> command-line tool is used to run Wukong
|
48
|
+
processors in the shape of a map/reduce job, whether locally or on a
|
49
|
+
Hadoop cluster.
|
50
|
+
|
51
|
+
The examples used in this README are all taken from the
|
52
|
+
<tt>/examples</tt> directory within the Wukong-Hadoop source code.
|
53
|
+
They implement the usual "word count" example.
|
54
|
+
|
55
|
+
<a name="local"></a>
|
56
|
+
## Test and Develop Map/Reduce Jobs Locally
|
57
|
+
|
58
|
+
Hadoop is a powerful tool designed to process huge amounts of data
|
59
|
+
very quickly. It's not designed to make developing Hadoop jobs
|
60
|
+
iterative and simple. Wukong-Hadoop lets you define a map/reduce job
|
61
|
+
and execute it locally, on small amounts of sample data, then launch
|
62
|
+
that job into a Hadoop cluster when you're sure it works.
|
63
|
+
|
64
|
+
<a name="processors_to_mappers_and_reducers"></a>
|
65
|
+
### From Processors to Mappers & Reducers
|
66
|
+
|
67
|
+
Wukong processors can be used either for the map phase or the reduce
|
68
|
+
phase of a map/reduce job. Different processors can be defined in
|
69
|
+
different <tt>.rb</tt> files or within the same one.
|
70
|
+
|
71
|
+
Map-phase processors would filter, transform, or otherwise modify
|
72
|
+
input records getting them ready for the reduce. Reduce-phase
|
73
|
+
processors typically perform aggregative operations like counting,
|
74
|
+
grouping, averaging, &c.
|
75
|
+
|
76
|
+
Given that you've already created a map/reduce job (just like this
|
77
|
+
word count example that comes with Wukong-Hadoop), the first thing to
|
78
|
+
try is to run the job locally on sample input data in flat files. The
|
79
|
+
<tt>--mode=local</tt> flag tells <tt>wu-hadoop</tt> to run in local
|
80
|
+
mode, suitable for development and testing of jobs:
|
81
|
+
|
82
|
+
```
|
83
|
+
$ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt
|
84
|
+
a 2
|
85
|
+
all 1
|
86
|
+
and 2
|
87
|
+
And 3
|
88
|
+
art 1
|
89
|
+
...
|
90
|
+
```
|
91
|
+
|
92
|
+
Wukong-Hadoop looks for processors named <tt>:mapper</tt> and a
|
93
|
+
<tt>:reducer</tt> in the <tt>word_count.rb</tt> file. To understand
|
94
|
+
what's going on under the hood, pass the <tt>--dry_run</tt> option:
|
95
|
+
|
96
|
+
```
|
97
|
+
$ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --dry_run
|
98
|
+
I, [2012-11-27T19:24:21.238429 #20104] INFO -- : Dry run:
|
99
|
+
cat examples/sonnet_18.txt | wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper | sort | wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer
|
100
|
+
```
|
101
|
+
|
102
|
+
which shows that <tt>wu-hadoop</tt> is ultimately relying on
|
103
|
+
<tt>wu-local</tt> to do the heavy-lifting. You can copy, paste, and
|
104
|
+
run this longer command (or a portion of it) when debugging.
|
105
|
+
|
106
|
+
You can also pass options to your processors:
|
107
|
+
|
108
|
+
```
|
109
|
+
$ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --fold_case --min_length=3
|
110
|
+
all 1
|
111
|
+
and 5
|
112
|
+
art 1
|
113
|
+
brag 1
|
114
|
+
...
|
115
|
+
```
|
116
|
+
|
117
|
+
Sometimes you may want to use a given processor in multiple jobs. You
|
118
|
+
can therefore define each processor in separate files if you want. If
|
119
|
+
Wukong-Hadoop doesn't find processors named <tt>:mapper</tt> and
|
120
|
+
<tt>:reducer</tt> it will try to use processors named after the files
|
121
|
+
you pass it:
|
122
|
+
|
123
|
+
```
|
124
|
+
$ wu-hadoop examples/tokenizer.rb examples/counter.rb --mode=local --input=examples/sonnet_18.txt
|
125
|
+
a 2
|
126
|
+
all 1
|
127
|
+
and 2
|
128
|
+
And 3
|
129
|
+
art 1
|
130
|
+
...
|
131
|
+
```
|
132
|
+
|
133
|
+
You can also just specify the processors you want to run using the
|
134
|
+
<tt>--mapper</tt> and <tt>--reducer</tt> options:
|
135
|
+
|
136
|
+
```
|
137
|
+
$ wu-hadoop examples/processors.rb --mode=local --input=examples/sonnet_18.txt --mapper=tokenizer --reducer=counter
|
138
|
+
a 2
|
139
|
+
all 1
|
140
|
+
and 2
|
141
|
+
And 3
|
142
|
+
art 1
|
143
|
+
...
|
144
|
+
```
|
145
|
+
|
146
|
+
<a name="map_only"></a>
|
147
|
+
### Map-Only Jobs
|
148
|
+
|
149
|
+
If Wukong-Hadoop can't find a processor named <tt>:reducer</tt> (and
|
150
|
+
you didn't give it two files explicitly) then it will run a map-only
|
151
|
+
job:
|
152
|
+
|
153
|
+
```
|
154
|
+
$ wu-hadoop examples/tokenizer.rb --mode=local --input=examples/sonnet_18.txt
|
155
|
+
Shall
|
156
|
+
I
|
157
|
+
compare
|
158
|
+
thee
|
159
|
+
...
|
160
|
+
```
|
161
|
+
|
162
|
+
You can force this behavior using using the <tt>--reduce_tasks</tt>
|
163
|
+
option:
|
164
|
+
|
165
|
+
```
|
166
|
+
$ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --reduce_tasks=0
|
167
|
+
Shall
|
168
|
+
I
|
169
|
+
compare
|
170
|
+
thee
|
171
|
+
...
|
172
|
+
```
|
173
|
+
|
174
|
+
<a name="sort_options"></a>
|
175
|
+
### Sort Options
|
176
|
+
|
177
|
+
For some kinds of jobs, you may have special requirements about how
|
178
|
+
you sort. You can specify an explicit <tt>--sort_command</tt> option:
|
179
|
+
|
180
|
+
```
|
181
|
+
$ wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt --sort_command='sort -r'
|
182
|
+
winds 1
|
183
|
+
When 1
|
184
|
+
wander'st 1
|
185
|
+
untrimm'd 1
|
186
|
+
...
|
187
|
+
```
|
188
|
+
|
189
|
+
<a name="non_wukong"></a>
|
190
|
+
### Something Other than Wukong/Ruby?
|
191
|
+
|
192
|
+
Wukong-Hadoop even lets you use mappers and reducers which aren't
|
193
|
+
themselves Wukong processors or even Ruby code. The <tt>:counter</tt>
|
194
|
+
processor is here replaced by good old <tt>uniq</tt>:
|
195
|
+
|
196
|
+
```
|
197
|
+
$ wu-hadoop examples/processors.rb --mode=local --input=examples/sonnet_18.txt --mapper=tokenizer --reduce_command='uniq -c'
|
198
|
+
2 a
|
199
|
+
1 all
|
200
|
+
2 and
|
201
|
+
3 And
|
202
|
+
1 art
|
203
|
+
...
|
204
|
+
```
|
205
|
+
|
206
|
+
This is a good method for getting a little performance bump (if your
|
207
|
+
job is CPU-bound) or even lifting other, non-Hadoop or non-Wukong
|
208
|
+
aware code into the Hadoop world:
|
209
|
+
|
210
|
+
|
211
|
+
```
|
212
|
+
$ wu-hadoop --mode=local --input=examples/sonnet_18.txt --map_command='python tokenizer.py' --reduce_command='python counter.py'
|
213
|
+
a 2
|
214
|
+
all 1
|
215
|
+
and 2
|
216
|
+
And 3
|
217
|
+
art 1
|
218
|
+
...
|
219
|
+
```
|
220
|
+
|
221
|
+
The only requirement on <tt>tokenizer.py</tt> and <tt>counter.py</tt>
|
222
|
+
is that they work the same way as their Ruby
|
223
|
+
<tt>Wukong::Processor</tt> equivalents: one line at a time from STDIN
|
224
|
+
to STDOUT.
|
225
|
+
|
226
|
+
<a name="hadoop"></a>
|
227
|
+
## Running in Hadoop
|
228
|
+
|
229
|
+
Once you've got your code working locally, you can easily make it run
|
230
|
+
inside of Hadoop by just changing the <tt>--mode</tt> option. You'll
|
231
|
+
also need to specify <tt>--input</tt> and <tt>--output</tt> paths that
|
232
|
+
Hadoop can access, either on the <a
|
233
|
+
href="http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_Distributed_File_System">HDFS</a>
|
234
|
+
or on something like Amazon's <a
|
235
|
+
href="http://aws.amazon.com/s3/">S3</a> if you're using AWS and have
|
236
|
+
properly configured your Hadoop cluster.
|
237
|
+
|
238
|
+
Here's the very first example from the <a href="#local">Local</a>
|
239
|
+
section above, but executed within a Hadoop cluster, reading and writing data from the HDFS.
|
240
|
+
|
241
|
+
```
|
242
|
+
$ wu-hadoop examples/word_count.rb --mode=hadoop --input=/data/sonnet_18.txt --output=/data/word_count.tsv
|
243
|
+
I, [2012-11-27T19:27:18.872645 #20142] INFO -- : Launching Hadoop!
|
244
|
+
I, [2012-11-27T19:27:18.873477 #20142] INFO -- : Running
|
245
|
+
|
246
|
+
/usr/lib/hadoop/bin/hadoop \
|
247
|
+
jar /usr/lib/hadoop/contrib/streaming/hadoop-*streaming*.jar \
|
248
|
+
-D mapred.job.name='word_count---/data/sonnet_18.txt---/data/word_count.tsv' \
|
249
|
+
-mapper 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper' \
|
250
|
+
-reducer 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer' \
|
251
|
+
-input '/data/sonnet_18.txt' \
|
252
|
+
-output '/data/word_count.tsv' \
|
253
|
+
12/11/28 01:32:09 INFO mapred.FileInputFormat: Total input paths to process : 1
|
254
|
+
12/11/28 01:32:10 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local, /mnt2/hadoop/mapred/local]
|
255
|
+
12/11/28 01:32:10 INFO streaming.StreamJob: Running job: job_201210241848_0043
|
256
|
+
12/11/28 01:32:10 INFO streaming.StreamJob: To kill this job, run:
|
257
|
+
12/11/28 01:32:10 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=10.124.54.254:8021 -kill job_201210241848_0043
|
258
|
+
12/11/28 01:32:10 INFO streaming.StreamJob: Tracking URL: http://ip-10-124-54-254.ec2.internal:50030/jobdetails.jsp?jobid=job_201210241848_0043
|
259
|
+
12/11/28 01:32:11 INFO streaming.StreamJob: map 0% reduce 0%
|
260
|
+
...
|
261
|
+
```
|
262
|
+
|
263
|
+
Hadoop throws an error if your output path already exists. If you're
|
264
|
+
running the same job over and over, it can be annoying to constantly
|
265
|
+
have to remember to delete the output path from your last run. Use
|
266
|
+
the <tt>--rm</tt> option in this case to automatically remove the
|
267
|
+
output path before launching a Hadoop job (this only works for Hadoop
|
268
|
+
mode).
|
269
|
+
|
270
|
+
### Advanced Hadoop Usage
|
271
|
+
|
272
|
+
For small or lightweight jobs, all you have to do to move from local
|
273
|
+
to Hadoop is change the <tt>--mode</tt> flag when executing your jobs
|
274
|
+
with <tt>wu-hadoop</tt>.
|
275
|
+
|
276
|
+
More complicated jobs that require either special code to be available
|
277
|
+
(new input/output formats, <tt>CLASSPATH</tt> or <tt>RUBYLIB</tt>
|
278
|
+
hacking, &c.) or require tuning at the level of Hadoop to run
|
279
|
+
efficiently.
|
280
|
+
|
281
|
+
#### Other Input/Output Formats
|
282
|
+
|
283
|
+
Hadoop streaming uses the <a
|
284
|
+
href="http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapred/TextInputFormat.html">TextInputFormat</a>
|
285
|
+
and <a
|
286
|
+
href="http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.html">TextOutputFormat</a>
|
287
|
+
by default. These turn all input/output data into newline delimited
|
288
|
+
string records which creates a perfect match for the command-line and
|
289
|
+
the local mode of Wukong-Hadoop.
|
290
|
+
|
291
|
+
Other input and output formats can be specified with the
|
292
|
+
<tt>--input_format</tt> and <tt>--output_format</tt> options.
|
293
|
+
|
294
|
+
#### Tuning
|
295
|
+
|
296
|
+
Hadoop offers many, many options for configuring a particular Hadoop
|
297
|
+
job as well as the Hadoop cluster itself. Wukong-Hadoop wraps many of
|
298
|
+
these familiar options (<tt>mapred.map.tasks</tt>,
|
299
|
+
<tt>mapred.reduce.tasks</tt>, <tt>mapred.task.timeout</tt>, &c.) with
|
300
|
+
friendlier names (<tt>map_tasks</tt>, <tt>reduce_tasks</tt>,
|
301
|
+
<tt>timeout</tt>, &c.). See a complete list using <tt>wu-hadoop
|
302
|
+
--help</tt>.
|
303
|
+
|
304
|
+
Java options themselves can be set directly using the
|
305
|
+
<tt>--java_opts</tt> flag. You can also use the <tt>--dry_run</tt>
|
306
|
+
option again to see the constructed Hadoop invocation without running
|
307
|
+
it:
|
308
|
+
|
309
|
+
```
|
310
|
+
$ wu-hadoop examples/word_count.rb --mode=hadoop --input=/data/sonnet_18.txt --output=/data/word_count.tsv --java_opts='-D foo.bar=3 -D something.else=hello' --dry_run
|
311
|
+
I, [2012-11-27T19:47:08.872784 #20512] INFO -- : Launching Hadoop!
|
312
|
+
I, [2012-11-27T19:47:08.873630 #20512] INFO -- : Dry run:
|
313
|
+
/usr/lib/hadoop/bin/hadoop \
|
314
|
+
jar /usr/lib/hadoop/contrib/streaming/hadoop-*streaming*.jar \
|
315
|
+
-D mapred.job.name='word_count---/data/sonnet_18.txt---/data/word_count.tsv' \
|
316
|
+
-D foo.bar=3 \
|
317
|
+
-D something.else=hello \
|
318
|
+
-mapper 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=mapper' \
|
319
|
+
-reducer 'wu-local /home/user/wukong-hadoop/examples/word_count.rb --run=reducer' \
|
320
|
+
-input '/data/sonnet_18.txt' \
|
321
|
+
-output '/data/word_count.tsv' \
|
322
|
+
```
|
323
|
+
|
324
|
+
#### Accessing Hadoop Runtime Data
|
325
|
+
|
326
|
+
Hadoop streaming exposes several environment variables to scripts it
|
327
|
+
executes, including mapper and reducer scripts launched by
|
328
|
+
<tt>wu-hadoop</tt>. Instead of manually inspecting the <tt>ENV</tt>
|
329
|
+
within your Wukong processors, you can use the following methods
|
330
|
+
defined for commonly accessed parameters:
|
331
|
+
|
332
|
+
* <tt>input_file</tt>: Path of the (data) file currently being processed.
|
333
|
+
* <tt>input_dir</tt>: Directory of the (data) file currently being processed.
|
334
|
+
* <tt>map_input_start_offset</tt>: Offset of the chunk currently being processed within the current input file.
|
335
|
+
* <tt>map_input_length</tt>: Length of the chunk currently being processed within the current input file.
|
336
|
+
* <tt>attempt_id</tt>: ID of the current map/reduce attempt.
|
337
|
+
* <tt>curr_task_id</tt>: ID of the current map/reduce task.
|
338
|
+
|
339
|
+
or use the <tt>hadoop_streaming_parameter</tt> method for the others.
|
data/Rakefile
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
require 'bundler'
|
2
|
+
Bundler::GemHelper.install_tasks
|
3
|
+
|
4
|
+
require 'rspec/core/rake_task'
|
5
|
+
RSpec::Core::RakeTask.new(:specs)
|
6
|
+
|
7
|
+
require 'yard'
|
8
|
+
YARD::Rake::YardocTask.new
|
9
|
+
|
10
|
+
require 'cucumber/rake/task'
|
11
|
+
Cucumber::Rake::Task.new(:features)
|
12
|
+
|
13
|
+
task :default => [:specs]
|
data/bin/hdp-bin
ADDED
@@ -0,0 +1,44 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'rubygems'
|
4
|
+
require 'wukong'
|
5
|
+
require 'wukong/streamer/count_keys'
|
6
|
+
|
7
|
+
#
|
8
|
+
# Run locally for testing:
|
9
|
+
#
|
10
|
+
# hdp-cat /hdfs/sometable.tsv | head -n100 | ./hdp-bin --column=4 --bin_width=0.1 --map | sort | ./hdp-bin --reduce
|
11
|
+
#
|
12
|
+
# Run on a giant dataset:
|
13
|
+
#
|
14
|
+
# hdp-bin --run --column=4 --bin_width=0.1 /hdfs/sometable.tsv /hdfs/sometable_col4_binned
|
15
|
+
#
|
16
|
+
|
17
|
+
Settings.define :column, :default => 1, :type => Integer, :description => "The column to bin"
|
18
|
+
Settings.define :bin_width, :default => 0.5, :type => Float, :description => "What should the bin width be?"
|
19
|
+
|
20
|
+
module HadoopBinning
|
21
|
+
|
22
|
+
class Mapper < Wukong::Streamer::RecordStreamer
|
23
|
+
|
24
|
+
def initialize *args
|
25
|
+
super(*args)
|
26
|
+
@bin_width = options.bin_width
|
27
|
+
@column = options.column
|
28
|
+
end
|
29
|
+
|
30
|
+
def process *args
|
31
|
+
yield bin_field(args[@column])
|
32
|
+
end
|
33
|
+
|
34
|
+
def bin_field field
|
35
|
+
(field.to_f/@bin_width).round*@bin_width
|
36
|
+
end
|
37
|
+
|
38
|
+
end
|
39
|
+
|
40
|
+
class Reducer < Wukong::Streamer::CountKeys; end
|
41
|
+
|
42
|
+
end
|
43
|
+
|
44
|
+
Wukong::Script.new(HadoopBinning::Mapper, HadoopBinning::Reducer).run
|