wukong 0.1.4 → 1.4.0
Sign up to get free protection for your applications and to get access to all the features.
- data/INSTALL.textile +89 -0
- data/README.textile +41 -74
- data/docpages/INSTALL.textile +94 -0
- data/{doc → docpages}/LICENSE.textile +0 -0
- data/{doc → docpages}/README-wulign.textile +6 -0
- data/docpages/UsingWukong-part1-get_ready.textile +17 -0
- data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
- data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
- data/docpages/_config.yml +39 -0
- data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
- data/{doc → docpages}/code/api_response_example.txt +0 -0
- data/{doc → docpages}/code/parser_skeleton.rb +0 -0
- data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
- data/docpages/favicon.ico +0 -0
- data/docpages/gem.css +16 -0
- data/docpages/hadoop-tips.textile +83 -0
- data/docpages/index.textile +90 -0
- data/docpages/intro.textile +8 -0
- data/docpages/moreinfo.textile +174 -0
- data/docpages/news.html +24 -0
- data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
- data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
- data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
- data/docpages/tutorial.textile +283 -0
- data/docpages/usage.textile +195 -0
- data/docpages/wutils.textile +263 -0
- data/wukong.gemspec +80 -50
- metadata +87 -54
- data/doc/INSTALL.textile +0 -41
- data/doc/README-tutorial.textile +0 -163
- data/doc/README-wutils.textile +0 -128
- data/doc/TODO.textile +0 -61
- data/doc/UsingWukong-part1-setup.textile +0 -2
- data/doc/UsingWukong-part2-scraping.textile +0 -2
- data/doc/hadoop-nfs.textile +0 -51
- data/doc/hadoop-setup.textile +0 -29
- data/doc/index.textile +0 -124
- data/doc/links.textile +0 -42
- data/doc/usage.textile +0 -102
- data/doc/utils.textile +0 -48
- data/examples/and_pig/sample_queries.rb +0 -128
- data/lib/wukong/and_pig.rb +0 -62
- data/lib/wukong/and_pig/README.textile +0 -12
- data/lib/wukong/and_pig/as.rb +0 -37
- data/lib/wukong/and_pig/data_types.rb +0 -30
- data/lib/wukong/and_pig/functions.rb +0 -50
- data/lib/wukong/and_pig/generate.rb +0 -85
- data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
- data/lib/wukong/and_pig/junk.rb +0 -51
- data/lib/wukong/and_pig/operators.rb +0 -8
- data/lib/wukong/and_pig/operators/compound.rb +0 -29
- data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
- data/lib/wukong/and_pig/operators/execution.rb +0 -15
- data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
- data/lib/wukong/and_pig/operators/foreach.rb +0 -98
- data/lib/wukong/and_pig/operators/groupies.rb +0 -212
- data/lib/wukong/and_pig/operators/load_store.rb +0 -65
- data/lib/wukong/and_pig/operators/meta.rb +0 -42
- data/lib/wukong/and_pig/operators/relational.rb +0 -129
- data/lib/wukong/and_pig/pig_struct.rb +0 -48
- data/lib/wukong/and_pig/pig_var.rb +0 -95
- data/lib/wukong/and_pig/symbol.rb +0 -29
- data/lib/wukong/and_pig/utils.rb +0 -0
data/INSTALL.textile
ADDED
@@ -0,0 +1,89 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Install Notes
|
4
|
+
collapse: false
|
5
|
+
---
|
6
|
+
h1(gemheader). {{ site.gemname }} %(small):: install%
|
7
|
+
|
8
|
+
** "Get the code":#getcode
|
9
|
+
** "Setup":#setup
|
10
|
+
** "Installing and Running Wukong with Hadoop":#gethadoop
|
11
|
+
** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":#others
|
12
|
+
|
13
|
+
|
14
|
+
<notextile><div class="toggle"></notextile>
|
15
|
+
|
16
|
+
h2(#getcode). Get the code
|
17
|
+
|
18
|
+
Wukong is still under active development. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
|
19
|
+
|
20
|
+
pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
|
21
|
+
|
22
|
+
A gem is available from "github:":http://gems.github.com
|
23
|
+
|
24
|
+
pre. $ sudo gem install mrflip-{{ site.gemname }} --source=http://gems.github.com
|
25
|
+
|
26
|
+
or from "gemcutter":http://gemcutter.org
|
27
|
+
|
28
|
+
pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
|
29
|
+
|
30
|
+
You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
|
31
|
+
|
32
|
+
h3. Get the Dependencies
|
33
|
+
|
34
|
+
* Hadoop, pig
|
35
|
+
* extlib, YAML, JSON
|
36
|
+
* Optional gems: trollop, addressable/uri, htmlentities
|
37
|
+
|
38
|
+
<notextile></div><div class="toggle"></notextile>
|
39
|
+
|
40
|
+
h2(#setup). Setup
|
41
|
+
|
42
|
+
1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable: @export HADOOP_HOME="/usr/local/share/hadoop"@
|
43
|
+
2. Add wukong's @bin/@ directory to your $PATH if you'd like to use the "wutils":wutils.html
|
44
|
+
|
45
|
+
<i>(see also: "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart)</i>
|
46
|
+
|
47
|
+
<notextile></div><div class="toggle"></notextile>
|
48
|
+
|
49
|
+
h2(#gethadoop). Installing and Running Wukong with Hadoop
|
50
|
+
|
51
|
+
Wukong was primarily developed for Hadoop, and we think it's the best way to use Hadoop (it's certainly the most fun!).
|
52
|
+
|
53
|
+
h3. Run Wukong on the Amazon AWS EC2 Cloud
|
54
|
+
|
55
|
+
h3. Hadoop Infrastructure
|
56
|
+
|
57
|
+
Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. If it's still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20. It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
|
58
|
+
|
59
|
+
To set up hadoop, your best bet are the Cloudera AMIs on Amazon's EC2 compute cloud:
|
60
|
+
|
61
|
+
* http://www.cloudera.com/hadoop-ec2
|
62
|
+
* http://www.cloudera.com/hadoop-ec2-ebs-beta
|
63
|
+
|
64
|
+
EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.
|
65
|
+
|
66
|
+
h3. Run Wukong using Amazon AWS Elastic MapReduce
|
67
|
+
|
68
|
+
AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.
|
69
|
+
|
70
|
+
Phil Ripperger has prepared a "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud -- it's better than anything we could put here. Thanks Phil!
|
71
|
+
|
72
|
+
h3. Set up a Hadoop cluster
|
73
|
+
|
74
|
+
If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.
|
75
|
+
|
76
|
+
h3. More Hadoop Notes
|
77
|
+
|
78
|
+
I've braindumped some random notes on configuring and using hadoop "over here":hadoop-tips.html
|
79
|
+
|
80
|
+
<notextile></div><div class="toggle"></notextile>
|
81
|
+
|
82
|
+
h2(#others). Wukong isn't just Hadoop: Datamapper, ActiveRecord, command-line usage and more
|
83
|
+
|
84
|
+
Wukong is used by many in an non-Hadoop environment -- anywhere you can stream data records, you can unleash its monkey power.
|
85
|
+
|
86
|
+
Please see the "usage notes":usage.html#playnice for more!
|
87
|
+
|
88
|
+
|
89
|
+
<notextile></div></notextile>
|
data/README.textile
CHANGED
@@ -2,34 +2,53 @@ h1. Wukong
|
|
2
2
|
|
3
3
|
Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
|
4
4
|
|
5
|
-
Treat your dataset
|
6
|
-
|
5
|
+
Treat your dataset like a
|
7
6
|
* stream of lines when it's efficient to process by lines
|
8
7
|
* stream of field arrays when it's efficient to deal directly with fields
|
9
8
|
* stream of lightweight objects when it's efficient to deal with objects
|
10
9
|
|
11
|
-
Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
|
10
|
+
Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
|
11
|
+
|
12
|
+
The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
|
13
|
+
|
14
|
+
* "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
|
15
|
+
* "Tutorial":http://mrflip.github.com/wukong/tutorial.html
|
16
|
+
* "Usage notes":http://mrflip.github.com/wukong/usage.html
|
17
|
+
* "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
|
18
|
+
* Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
|
19
|
+
* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
|
20
|
+
* "More info":http://mrflip.github.com/wukong/moreinfo.html
|
12
21
|
|
13
|
-
|
22
|
+
h2. Help!
|
23
|
+
|
24
|
+
Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
|
14
25
|
|
15
26
|
h2. Install
|
16
27
|
|
17
|
-
|
28
|
+
** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
|
29
|
+
|
30
|
+
h3. Get the code
|
18
31
|
|
19
|
-
|
32
|
+
We're still actively developing {{ site.gemname }}. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
|
20
33
|
|
21
|
-
|
34
|
+
pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
|
22
35
|
|
23
|
-
|
36
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
|
24
37
|
|
25
|
-
|
38
|
+
pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
|
26
39
|
|
27
|
-
|
40
|
+
(don't use the gems.github.com version -- it's way out of date.)
|
28
41
|
|
29
|
-
|
42
|
+
You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
|
43
|
+
|
44
|
+
h3. Dependencies and setup
|
45
|
+
|
46
|
+
To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
|
30
47
|
|
31
48
|
h2. How to write a Wukong script
|
32
49
|
|
50
|
+
** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
|
51
|
+
|
33
52
|
Here's a script to count words in a text stream:
|
34
53
|
|
35
54
|
<pre><code> require 'wukong'
|
@@ -112,11 +131,7 @@ You can also use structs to treat your dataset as a stream of objects:
|
|
112
131
|
|
113
132
|
h3. Advanced Patterns
|
114
133
|
|
115
|
-
Wukong has a good collection of map/reduce patterns.
|
116
|
-
|
117
|
-
The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
|
118
|
-
|
119
|
-
Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
|
134
|
+
Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
|
120
135
|
|
121
136
|
<pre><code> #
|
122
137
|
# Roll up all values for each key into a single line
|
@@ -165,62 +180,6 @@ You'd end up with
|
|
165
180
|
@newman @elaine @jerry @kramer
|
166
181
|
</code></pre>
|
167
182
|
|
168
|
-
h3. More info
|
169
|
-
|
170
|
-
There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
|
171
|
-
|
172
|
-
h2. Setup
|
173
|
-
|
174
|
-
1. Allow Wukong to discover where his elephant friend lives: either
|
175
|
-
|
176
|
-
* set a @$HADOOP_HOME@ environment variable,
|
177
|
-
|
178
|
-
* or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install:
|
179
|
-
|
180
|
-
@:hadoop_home: /usr/local/share/hadoop@
|
181
|
-
|
182
|
-
2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
|
183
|
-
|
184
|
-
h2. How to run a Wukong script
|
185
|
-
|
186
|
-
To run your script using local files and no connection to a hadoop cluster,
|
187
|
-
|
188
|
-
@your/script.rb --run=local path/to/input_files path/to/output_dir@
|
189
|
-
|
190
|
-
To run the command across a Hadoop cluster,
|
191
|
-
|
192
|
-
@your/script.rb --run=hadoop path/to/input_files path/to/output_dir@
|
193
|
-
|
194
|
-
You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
|
195
|
-
|
196
|
-
If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths. (your/script path, of course, lives on the local filesystem).
|
197
|
-
|
198
|
-
You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
|
199
|
-
|
200
|
-
./path/to/your/script.rb --any_specific_options --options=can_have_vals \
|
201
|
-
--run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
|
202
|
-
|
203
|
-
Note that all @--options@ must precede (in any order) all non-options.
|
204
|
-
|
205
|
-
h2. How to test your scripts
|
206
|
-
|
207
|
-
To run mapper on its own:
|
208
|
-
|
209
|
-
cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
|
210
|
-
|
211
|
-
or if your test data lies on the HDFS,
|
212
|
-
|
213
|
-
hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
|
214
|
-
|
215
|
-
Next graduate to running @--run=local@ mode so you can inspect the reducer.
|
216
|
-
|
217
|
-
|
218
|
-
h2. What's up with Wukong::AndPig?
|
219
|
-
|
220
|
-
@Wukong::AndPig@ is a small library to more easily generate code for the
|
221
|
-
"Pig":http://hadoop.apache.org/pig data analysis language. See its
|
222
|
-
"README":wukong/and_pig/README.textile for more.
|
223
|
-
|
224
183
|
h2. Why is it called Wukong?
|
225
184
|
|
226
185
|
Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
|
@@ -231,6 +190,14 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
|
|
231
190
|
|
232
191
|
The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
|
233
192
|
|
234
|
-
h2.
|
193
|
+
h2. Credits
|
194
|
+
|
195
|
+
Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
|
196
|
+
|
197
|
+
Patches submitted by:
|
198
|
+
* gemified by Ben Woosley (ben.woosley with the gmails)
|
199
|
+
* ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/
|
235
200
|
|
236
|
-
|
201
|
+
Thanks to:
|
202
|
+
* "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
|
203
|
+
* "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
|
@@ -0,0 +1,94 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Install Notes
|
4
|
+
collapse: false
|
5
|
+
---
|
6
|
+
h1(gemheader). {{ site.gemname }} %(small):: install%
|
7
|
+
|
8
|
+
** "Get the code":#getcode
|
9
|
+
** "Setup":#setup
|
10
|
+
** "Installing and Running Wukong with Hadoop":#gethadoop
|
11
|
+
** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":#others
|
12
|
+
|
13
|
+
|
14
|
+
<notextile><div class="toggle"></notextile>
|
15
|
+
|
16
|
+
h2(#getcode). Get the code
|
17
|
+
|
18
|
+
We're still actively developing {{ site.gemname }}. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
|
19
|
+
|
20
|
+
pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
|
21
|
+
|
22
|
+
A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
|
23
|
+
|
24
|
+
pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
|
25
|
+
|
26
|
+
(don't use the gems.github.com version -- it's way out of date.)
|
27
|
+
|
28
|
+
You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
|
29
|
+
|
30
|
+
<notextile></div><div class="toggle"></notextile>
|
31
|
+
|
32
|
+
h3. Get the Dependencies
|
33
|
+
|
34
|
+
* Hadoop
|
35
|
+
* Pig (optional)
|
36
|
+
* Parts of {{ site.gemname }} require these gems:
|
37
|
+
** addressable/uri
|
38
|
+
** htmlentities
|
39
|
+
** extlib
|
40
|
+
** YAML
|
41
|
+
** JSON
|
42
|
+
|
43
|
+
<notextile></div><div class="toggle"></notextile>
|
44
|
+
|
45
|
+
h2(#setup). Setup
|
46
|
+
|
47
|
+
1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable: @export HADOOP_HOME="/usr/local/share/hadoop"@
|
48
|
+
2. Add wukong's @bin/@ directory to your $PATH if you'd like to use the "wutils":wutils.html
|
49
|
+
|
50
|
+
<i>(see also: "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart)</i>
|
51
|
+
|
52
|
+
<notextile></div><div class="toggle"></notextile>
|
53
|
+
|
54
|
+
h2(#gethadoop). Installing and Running Wukong with Hadoop
|
55
|
+
|
56
|
+
Wukong was primarily developed for Hadoop, and we think it's the best way to use Hadoop (it's certainly the most fun!).
|
57
|
+
|
58
|
+
h3. Run Wukong on the Amazon AWS EC2 Cloud
|
59
|
+
|
60
|
+
h3. Hadoop Infrastructure
|
61
|
+
|
62
|
+
Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. If it's still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20. It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
|
63
|
+
|
64
|
+
To set up hadoop, your best bet are the Cloudera AMIs on Amazon's EC2 compute cloud:
|
65
|
+
|
66
|
+
* http://www.cloudera.com/hadoop-ec2
|
67
|
+
* http://www.cloudera.com/hadoop-ec2-ebs-beta
|
68
|
+
|
69
|
+
EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.
|
70
|
+
|
71
|
+
h3. Run Wukong using Amazon AWS Elastic MapReduce
|
72
|
+
|
73
|
+
AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.
|
74
|
+
|
75
|
+
Phil Ripperger has prepared a "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud -- it's better than anything we could put here. Thanks Phil!
|
76
|
+
|
77
|
+
h3. Set up a Hadoop cluster
|
78
|
+
|
79
|
+
If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.
|
80
|
+
|
81
|
+
h3. More Hadoop Notes
|
82
|
+
|
83
|
+
I've braindumped some random notes on configuring and using hadoop "over here":hadoop-tips.html
|
84
|
+
|
85
|
+
<notextile></div><div class="toggle"></notextile>
|
86
|
+
|
87
|
+
h2(#others). Wukong isn't just Hadoop: Datamapper, ActiveRecord, command-line usage and more
|
88
|
+
|
89
|
+
Wukong is used by many in an non-Hadoop environment -- anywhere you can stream data records, you can unleash its monkey power.
|
90
|
+
|
91
|
+
Please see the "usage notes":usage.html#playnice for more!
|
92
|
+
|
93
|
+
|
94
|
+
<notextile></div></notextile>
|
File without changes
|
@@ -1,3 +1,9 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: mrflip.github.com/wukong - wu-lign utility
|
4
|
+
collapse: false
|
5
|
+
---
|
6
|
+
|
1
7
|
h1. wu-lign -- format a tab-separated file as aligned columns
|
2
8
|
|
3
9
|
wu-lign will intelligently reformat a tab-separated file into a tab-separated, space aligned file that is still suitable for further processing. For example, given the log-file input
|
@@ -0,0 +1,17 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: mrflip.github.com/wukong - Using Wukong and Wuclan, Part 1 - Setup
|
4
|
+
collapse: false
|
5
|
+
---
|
6
|
+
|
7
|
+
h1. Using Wukong and Wuclan, Part 0 - Setup
|
8
|
+
|
9
|
+
Please follow the "installation and setup directions":setup.html for wukong, hadoop and a compute cluster.
|
10
|
+
|
11
|
+
h1. Using Wukong and Wuclan, Part 1 - Scraping
|
12
|
+
|
13
|
+
This part needs writing.
|
14
|
+
|
15
|
+
Later, it will tell you how to get a large corpus of data to use in part 2.
|
16
|
+
|
17
|
+
In the meantime check out http://mrflip.github.com/monkeyshines/ and http://mrflip.github.com/wuclan/ -- in particular the "Twitter Search Scraper":http://github.com/mrflip/wuclan/tree/master/examples/twitter/scrape_twitter_search/ example. We use this in production to gather and analyze tens of gigabytes of twitter conversations.
|
@@ -1,5 +1,12 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: mrflip.github.com/wukong - Overview
|
4
|
+
collapse: false
|
5
|
+
---
|
1
6
|
|
2
|
-
|
7
|
+
h1. Thinking Big Data
|
8
|
+
|
9
|
+
h2. There's lots of data, Wukong and Hadoop can help
|
3
10
|
|
4
11
|
|
5
12
|
There are two disruptive
|
@@ -13,9 +20,6 @@ There are two disruptive
|
|
13
20
|
** Old frontier computing: expensive, N log N, SUUUUUUCKS
|
14
21
|
** It's cheap, it's scaleable and it's fun
|
15
22
|
|
16
|
-
h2. Wukong + Hadoop can help
|
17
|
-
|
18
|
-
|
19
23
|
h2. == Map|Reduce ==
|
20
24
|
|
21
25
|
h3. cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv
|
@@ -69,23 +73,3 @@ h2. == Mechanics, HDFS ==
|
|
69
73
|
|
70
74
|
x M _
|
71
75
|
_ M y
|
72
|
-
|
73
|
-
h2. == More Reading ==
|
74
|
-
|
75
|
-
h3. Hadoop
|
76
|
-
|
77
|
-
* "Hadoop, The Definitive Guide":http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
|
78
|
-
* "":
|
79
|
-
|
80
|
-
* "Cloudera Blog":http://www.cloudera.com/blog/
|
81
|
-
|
82
|
-
h3. Hadoop|Streaming Frameworks
|
83
|
-
|
84
|
-
* infochimps.org's "Wukong":http://github.com/mrflip/wukong -- ruby; object-oriented *and* record-oriented
|
85
|
-
* NYTimes' "MRToolkit":http://code.google.com/p/mrtoolkit/ -- ruby; much more log-oriented
|
86
|
-
* Freebase's "Happy":http://code.google.com/p/happy/ -- python; the most performant, as it can use Jython to make direct API calls.
|
87
|
-
* Last.fm's "Dumbo":http://wiki.github.com/klbostee/dumbo -- python
|
88
|
-
|
89
|
-
h3. Hadoop Infrastructure
|
90
|
-
|
91
|
-
Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. Actually, if it's still June 2009 when you read this, profile your scripts with Wukong on the command line and kill some time before Hadoop 0.20 comes out. It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
|
@@ -1,6 +1,12 @@
|
|
1
|
-
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: mrflip.github.com/wukong - Using Wukong and Wuclan, Part 3 - Parsing
|
4
|
+
collapse: false
|
5
|
+
---
|
2
6
|
|
3
|
-
|
7
|
+
h1. Using Wukong and Wuclan - Parsing
|
8
|
+
|
9
|
+
In part 1 we begain a scraper to trawl our desired part of the social web. Now
|
4
10
|
we're ready to start using Wukong to process the files.
|
5
11
|
|
6
12
|
Files come off the wire as
|
@@ -0,0 +1,39 @@
|
|
1
|
+
---
|
2
|
+
permalink: ":year-:month/:title.html"
|
3
|
+
markdown: rdiscount
|
4
|
+
pygments: true
|
5
|
+
auto: true
|
6
|
+
server: true
|
7
|
+
server_port: 4000
|
8
|
+
maruku:
|
9
|
+
use_tex: false
|
10
|
+
use_divs: false
|
11
|
+
png_dir: images/latex
|
12
|
+
png_url: /images/latex
|
13
|
+
|
14
|
+
header_ref: '.html' # .html for subdirs, / for main.
|
15
|
+
assets_path: '/' # http://github.mrflip.com
|
16
|
+
|
17
|
+
gemuser: mrflip
|
18
|
+
gemname: wukong
|
19
|
+
gemversion: 0.1.1
|
20
|
+
title: mrflip.github.com/wukong
|
21
|
+
|
22
|
+
keywords: [ 'wukong,hadoop,ruby,mrflip,infochimps,map,reduce,streaming,dumbo,happy,mrtoolkit,script,simple' ]
|
23
|
+
description: "Wukong: Hadoop made so easy a Chimpanzee could run it."
|
24
|
+
header_files:
|
25
|
+
- INSTALL
|
26
|
+
- LICENSE
|
27
|
+
- usage
|
28
|
+
- wutils
|
29
|
+
- moreinfo
|
30
|
+
- tutorial
|
31
|
+
|
32
|
+
credits:
|
33
|
+
<p>Wukong image courtesy
|
34
|
+
<a href="http://www.curtbusse.com/okavango/page1/oka1.html">Curt Busse</a> under
|
35
|
+
an <a href="http://www.curtbusse.com/copyright.html">open license</a>.
|
36
|
+
It's a Chacma Baboon from the Okavango site. Make sure to read the
|
37
|
+
<a href="http://www.curtbusse.com/okavango/page1/oka1.html#note3">story at the bottom of that page</a>.
|
38
|
+
</p>
|
39
|
+
|