wukong 0.1.4 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (63) hide show
  1. data/INSTALL.textile +89 -0
  2. data/README.textile +41 -74
  3. data/docpages/INSTALL.textile +94 -0
  4. data/{doc → docpages}/LICENSE.textile +0 -0
  5. data/{doc → docpages}/README-wulign.textile +6 -0
  6. data/docpages/UsingWukong-part1-get_ready.textile +17 -0
  7. data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
  8. data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
  9. data/docpages/_config.yml +39 -0
  10. data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
  11. data/{doc → docpages}/code/api_response_example.txt +0 -0
  12. data/{doc → docpages}/code/parser_skeleton.rb +0 -0
  13. data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
  14. data/docpages/favicon.ico +0 -0
  15. data/docpages/gem.css +16 -0
  16. data/docpages/hadoop-tips.textile +83 -0
  17. data/docpages/index.textile +90 -0
  18. data/docpages/intro.textile +8 -0
  19. data/docpages/moreinfo.textile +174 -0
  20. data/docpages/news.html +24 -0
  21. data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
  22. data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
  23. data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
  24. data/docpages/tutorial.textile +283 -0
  25. data/docpages/usage.textile +195 -0
  26. data/docpages/wutils.textile +263 -0
  27. data/wukong.gemspec +80 -50
  28. metadata +87 -54
  29. data/doc/INSTALL.textile +0 -41
  30. data/doc/README-tutorial.textile +0 -163
  31. data/doc/README-wutils.textile +0 -128
  32. data/doc/TODO.textile +0 -61
  33. data/doc/UsingWukong-part1-setup.textile +0 -2
  34. data/doc/UsingWukong-part2-scraping.textile +0 -2
  35. data/doc/hadoop-nfs.textile +0 -51
  36. data/doc/hadoop-setup.textile +0 -29
  37. data/doc/index.textile +0 -124
  38. data/doc/links.textile +0 -42
  39. data/doc/usage.textile +0 -102
  40. data/doc/utils.textile +0 -48
  41. data/examples/and_pig/sample_queries.rb +0 -128
  42. data/lib/wukong/and_pig.rb +0 -62
  43. data/lib/wukong/and_pig/README.textile +0 -12
  44. data/lib/wukong/and_pig/as.rb +0 -37
  45. data/lib/wukong/and_pig/data_types.rb +0 -30
  46. data/lib/wukong/and_pig/functions.rb +0 -50
  47. data/lib/wukong/and_pig/generate.rb +0 -85
  48. data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
  49. data/lib/wukong/and_pig/junk.rb +0 -51
  50. data/lib/wukong/and_pig/operators.rb +0 -8
  51. data/lib/wukong/and_pig/operators/compound.rb +0 -29
  52. data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
  53. data/lib/wukong/and_pig/operators/execution.rb +0 -15
  54. data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
  55. data/lib/wukong/and_pig/operators/foreach.rb +0 -98
  56. data/lib/wukong/and_pig/operators/groupies.rb +0 -212
  57. data/lib/wukong/and_pig/operators/load_store.rb +0 -65
  58. data/lib/wukong/and_pig/operators/meta.rb +0 -42
  59. data/lib/wukong/and_pig/operators/relational.rb +0 -129
  60. data/lib/wukong/and_pig/pig_struct.rb +0 -48
  61. data/lib/wukong/and_pig/pig_var.rb +0 -95
  62. data/lib/wukong/and_pig/symbol.rb +0 -29
  63. data/lib/wukong/and_pig/utils.rb +0 -0
data/INSTALL.textile ADDED
@@ -0,0 +1,89 @@
1
+ ---
2
+ layout: default
3
+ title: Install Notes
4
+ collapse: false
5
+ ---
6
+ h1(gemheader). {{ site.gemname }} %(small):: install%
7
+
8
+ ** "Get the code":#getcode
9
+ ** "Setup":#setup
10
+ ** "Installing and Running Wukong with Hadoop":#gethadoop
11
+ ** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":#others
12
+
13
+
14
+ <notextile><div class="toggle"></notextile>
15
+
16
+ h2(#getcode). Get the code
17
+
18
+ Wukong is still under active development. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
19
+
20
+ pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
21
+
22
+ A gem is available from "github:":http://gems.github.com
23
+
24
+ pre. $ sudo gem install mrflip-{{ site.gemname }} --source=http://gems.github.com
25
+
26
+ or from "gemcutter":http://gemcutter.org
27
+
28
+ pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
29
+
30
+ You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
31
+
32
+ h3. Get the Dependencies
33
+
34
+ * Hadoop, pig
35
+ * extlib, YAML, JSON
36
+ * Optional gems: trollop, addressable/uri, htmlentities
37
+
38
+ <notextile></div><div class="toggle"></notextile>
39
+
40
+ h2(#setup). Setup
41
+
42
+ 1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable: @export HADOOP_HOME="/usr/local/share/hadoop"@
43
+ 2. Add wukong's @bin/@ directory to your $PATH if you'd like to use the "wutils":wutils.html
44
+
45
+ <i>(see also: "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart)</i>
46
+
47
+ <notextile></div><div class="toggle"></notextile>
48
+
49
+ h2(#gethadoop). Installing and Running Wukong with Hadoop
50
+
51
+ Wukong was primarily developed for Hadoop, and we think it's the best way to use Hadoop (it's certainly the most fun!).
52
+
53
+ h3. Run Wukong on the Amazon AWS EC2 Cloud
54
+
55
+ h3. Hadoop Infrastructure
56
+
57
+ Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. If it's still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20. It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
58
+
59
+ To set up hadoop, your best bet are the Cloudera AMIs on Amazon's EC2 compute cloud:
60
+
61
+ * http://www.cloudera.com/hadoop-ec2
62
+ * http://www.cloudera.com/hadoop-ec2-ebs-beta
63
+
64
+ EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.
65
+
66
+ h3. Run Wukong using Amazon AWS Elastic MapReduce
67
+
68
+ AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.
69
+
70
+ Phil Ripperger has prepared a "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud -- it's better than anything we could put here. Thanks Phil!
71
+
72
+ h3. Set up a Hadoop cluster
73
+
74
+ If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.
75
+
76
+ h3. More Hadoop Notes
77
+
78
+ I've braindumped some random notes on configuring and using hadoop "over here":hadoop-tips.html
79
+
80
+ <notextile></div><div class="toggle"></notextile>
81
+
82
+ h2(#others). Wukong isn't just Hadoop: Datamapper, ActiveRecord, command-line usage and more
83
+
84
+ Wukong is used by many in an non-Hadoop environment -- anywhere you can stream data records, you can unleash its monkey power.
85
+
86
+ Please see the "usage notes":usage.html#playnice for more!
87
+
88
+
89
+ <notextile></div></notextile>
data/README.textile CHANGED
@@ -2,34 +2,53 @@ h1. Wukong
2
2
 
3
3
  Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
4
4
 
5
- Treat your dataset as a
6
-
5
+ Treat your dataset like a
7
6
  * stream of lines when it's efficient to process by lines
8
7
  * stream of field arrays when it's efficient to deal directly with fields
9
8
  * stream of lightweight objects when it's efficient to deal with objects
10
9
 
11
- Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
10
+ Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
11
+
12
+ The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
13
+
14
+ * "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
15
+ * "Tutorial":http://mrflip.github.com/wukong/tutorial.html
16
+ * "Usage notes":http://mrflip.github.com/wukong/usage.html
17
+ * "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
18
+ * Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
19
+ * Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
20
+ * "More info":http://mrflip.github.com/wukong/moreinfo.html
12
21
 
13
- The main documentation -- including tutorials and tips for working with big data -- lives on the "Wukong Pages":http://mrflip.github.com/wukong and there is some supplemental information on the "wukong wiki.":http://wiki.github.com/mrflip/wukong
22
+ h2. Help!
23
+
24
+ Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
14
25
 
15
26
  h2. Install
16
27
 
17
- Wukong is still under active development. The newest version is available at
28
+ ** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
29
+
30
+ h3. Get the code
18
31
 
19
- http://github.com/mrflip/wukong
32
+ We're still actively developing {{ site.gemname }}. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
20
33
 
21
- A gem is available from "github:":http://gems.github.com
34
+ pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
22
35
 
23
- gem install mrflip-wukong --source=http://gems.github.com
36
+ A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
24
37
 
25
- or from "gemcutter":http://gemcutter.org
38
+ pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
26
39
 
27
- gem install wukong --source=http://gemcutter.org
40
+ (don't use the gems.github.com version -- it's way out of date.)
28
41
 
29
- Phil Ripperger has prepared "instructions on getting wukong to work on the Amazon AWS cloud.":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart Thanks Phil!
42
+ You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
43
+
44
+ h3. Dependencies and setup
45
+
46
+ To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
30
47
 
31
48
  h2. How to write a Wukong script
32
49
 
50
+ ** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
51
+
33
52
  Here's a script to count words in a text stream:
34
53
 
35
54
  <pre><code> require 'wukong'
@@ -112,11 +131,7 @@ You can also use structs to treat your dataset as a stream of objects:
112
131
 
113
132
  h3. Advanced Patterns
114
133
 
115
- Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group.
116
-
117
- The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
118
-
119
- Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
134
+ Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
120
135
 
121
136
  <pre><code> #
122
137
  # Roll up all values for each key into a single line
@@ -165,62 +180,6 @@ You'd end up with
165
180
  @newman @elaine @jerry @kramer
166
181
  </code></pre>
167
182
 
168
- h3. More info
169
-
170
- There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
171
-
172
- h2. Setup
173
-
174
- 1. Allow Wukong to discover where his elephant friend lives: either
175
-
176
- * set a @$HADOOP_HOME@ environment variable,
177
-
178
- * or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install:
179
-
180
- @:hadoop_home: /usr/local/share/hadoop@
181
-
182
- 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
183
-
184
- h2. How to run a Wukong script
185
-
186
- To run your script using local files and no connection to a hadoop cluster,
187
-
188
- @your/script.rb --run=local path/to/input_files path/to/output_dir@
189
-
190
- To run the command across a Hadoop cluster,
191
-
192
- @your/script.rb --run=hadoop path/to/input_files path/to/output_dir@
193
-
194
- You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
195
-
196
- If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths. (your/script path, of course, lives on the local filesystem).
197
-
198
- You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
199
-
200
- ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
201
- --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
202
-
203
- Note that all @--options@ must precede (in any order) all non-options.
204
-
205
- h2. How to test your scripts
206
-
207
- To run mapper on its own:
208
-
209
- cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
210
-
211
- or if your test data lies on the HDFS,
212
-
213
- hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
214
-
215
- Next graduate to running @--run=local@ mode so you can inspect the reducer.
216
-
217
-
218
- h2. What's up with Wukong::AndPig?
219
-
220
- @Wukong::AndPig@ is a small library to more easily generate code for the
221
- "Pig":http://hadoop.apache.org/pig data analysis language. See its
222
- "README":wukong/and_pig/README.textile for more.
223
-
224
183
  h2. Why is it called Wukong?
225
184
 
226
185
  Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
@@ -231,6 +190,14 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
231
190
 
232
191
  The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
233
192
 
234
- h2. What tools does Wukong work with?
193
+ h2. Credits
194
+
195
+ Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
196
+
197
+ Patches submitted by:
198
+ * gemified by Ben Woosley (ben.woosley with the gmails)
199
+ * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/
235
200
 
236
- Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line. We're looking forward to being friends with "martinis":http://datamapper.org and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
201
+ Thanks to:
202
+ * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
203
+ * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
@@ -0,0 +1,94 @@
1
+ ---
2
+ layout: default
3
+ title: Install Notes
4
+ collapse: false
5
+ ---
6
+ h1(gemheader). {{ site.gemname }} %(small):: install%
7
+
8
+ ** "Get the code":#getcode
9
+ ** "Setup":#setup
10
+ ** "Installing and Running Wukong with Hadoop":#gethadoop
11
+ ** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":#others
12
+
13
+
14
+ <notextile><div class="toggle"></notextile>
15
+
16
+ h2(#getcode). Get the code
17
+
18
+ We're still actively developing {{ site.gemname }}. The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
19
+
20
+ pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
21
+
22
+ A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
23
+
24
+ pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
25
+
26
+ (don't use the gems.github.com version -- it's way out of date.)
27
+
28
+ You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
29
+
30
+ <notextile></div><div class="toggle"></notextile>
31
+
32
+ h3. Get the Dependencies
33
+
34
+ * Hadoop
35
+ * Pig (optional)
36
+ * Parts of {{ site.gemname }} require these gems:
37
+ ** addressable/uri
38
+ ** htmlentities
39
+ ** extlib
40
+ ** YAML
41
+ ** JSON
42
+
43
+ <notextile></div><div class="toggle"></notextile>
44
+
45
+ h2(#setup). Setup
46
+
47
+ 1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable: @export HADOOP_HOME="/usr/local/share/hadoop"@
48
+ 2. Add wukong's @bin/@ directory to your $PATH if you'd like to use the "wutils":wutils.html
49
+
50
+ <i>(see also: "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart)</i>
51
+
52
+ <notextile></div><div class="toggle"></notextile>
53
+
54
+ h2(#gethadoop). Installing and Running Wukong with Hadoop
55
+
56
+ Wukong was primarily developed for Hadoop, and we think it's the best way to use Hadoop (it's certainly the most fun!).
57
+
58
+ h3. Run Wukong on the Amazon AWS EC2 Cloud
59
+
60
+ h3. Hadoop Infrastructure
61
+
62
+ Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. If it's still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20. It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
63
+
64
+ To set up hadoop, your best bet are the Cloudera AMIs on Amazon's EC2 compute cloud:
65
+
66
+ * http://www.cloudera.com/hadoop-ec2
67
+ * http://www.cloudera.com/hadoop-ec2-ebs-beta
68
+
69
+ EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.
70
+
71
+ h3. Run Wukong using Amazon AWS Elastic MapReduce
72
+
73
+ AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.
74
+
75
+ Phil Ripperger has prepared a "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud -- it's better than anything we could put here. Thanks Phil!
76
+
77
+ h3. Set up a Hadoop cluster
78
+
79
+ If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.
80
+
81
+ h3. More Hadoop Notes
82
+
83
+ I've braindumped some random notes on configuring and using hadoop "over here":hadoop-tips.html
84
+
85
+ <notextile></div><div class="toggle"></notextile>
86
+
87
+ h2(#others). Wukong isn't just Hadoop: Datamapper, ActiveRecord, command-line usage and more
88
+
89
+ Wukong is used by many in an non-Hadoop environment -- anywhere you can stream data records, you can unleash its monkey power.
90
+
91
+ Please see the "usage notes":usage.html#playnice for more!
92
+
93
+
94
+ <notextile></div></notextile>
File without changes
@@ -1,3 +1,9 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - wu-lign utility
4
+ collapse: false
5
+ ---
6
+
1
7
  h1. wu-lign -- format a tab-separated file as aligned columns
2
8
 
3
9
  wu-lign will intelligently reformat a tab-separated file into a tab-separated, space aligned file that is still suitable for further processing. For example, given the log-file input
@@ -0,0 +1,17 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - Using Wukong and Wuclan, Part 1 - Setup
4
+ collapse: false
5
+ ---
6
+
7
+ h1. Using Wukong and Wuclan, Part 0 - Setup
8
+
9
+ Please follow the "installation and setup directions":setup.html for wukong, hadoop and a compute cluster.
10
+
11
+ h1. Using Wukong and Wuclan, Part 1 - Scraping
12
+
13
+ This part needs writing.
14
+
15
+ Later, it will tell you how to get a large corpus of data to use in part 2.
16
+
17
+ In the meantime check out http://mrflip.github.com/monkeyshines/ and http://mrflip.github.com/wuclan/ -- in particular the "Twitter Search Scraper":http://github.com/mrflip/wuclan/tree/master/examples/twitter/scrape_twitter_search/ example. We use this in production to gather and analyze tens of gigabytes of twitter conversations.
@@ -1,5 +1,12 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - Overview
4
+ collapse: false
5
+ ---
1
6
 
2
- h2. There's lots of data
7
+ h1. Thinking Big Data
8
+
9
+ h2. There's lots of data, Wukong and Hadoop can help
3
10
 
4
11
 
5
12
  There are two disruptive
@@ -13,9 +20,6 @@ There are two disruptive
13
20
  ** Old frontier computing: expensive, N log N, SUUUUUUCKS
14
21
  ** It's cheap, it's scaleable and it's fun
15
22
 
16
- h2. Wukong + Hadoop can help
17
-
18
-
19
23
  h2. == Map|Reduce ==
20
24
 
21
25
  h3. cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv
@@ -69,23 +73,3 @@ h2. == Mechanics, HDFS ==
69
73
 
70
74
  x M _
71
75
  _ M y
72
-
73
- h2. == More Reading ==
74
-
75
- h3. Hadoop
76
-
77
- * "Hadoop, The Definitive Guide":http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
78
- * "":
79
-
80
- * "Cloudera Blog":http://www.cloudera.com/blog/
81
-
82
- h3. Hadoop|Streaming Frameworks
83
-
84
- * infochimps.org's "Wukong":http://github.com/mrflip/wukong -- ruby; object-oriented *and* record-oriented
85
- * NYTimes' "MRToolkit":http://code.google.com/p/mrtoolkit/ -- ruby; much more log-oriented
86
- * Freebase's "Happy":http://code.google.com/p/happy/ -- python; the most performant, as it can use Jython to make direct API calls.
87
- * Last.fm's "Dumbo":http://wiki.github.com/klbostee/dumbo -- python
88
-
89
- h3. Hadoop Infrastructure
90
-
91
- Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. Actually, if it's still June 2009 when you read this, profile your scripts with Wukong on the command line and kill some time before Hadoop 0.20 comes out. It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
@@ -1,6 +1,12 @@
1
- h1. Using Wukong and Wuclan, Part 3 - Parsing
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - Using Wukong and Wuclan, Part 3 - Parsing
4
+ collapse: false
5
+ ---
2
6
 
3
- In part 2 we begain a scraper to trawl our desired part of the social web. Now
7
+ h1. Using Wukong and Wuclan - Parsing
8
+
9
+ In part 1 we begain a scraper to trawl our desired part of the social web. Now
4
10
  we're ready to start using Wukong to process the files.
5
11
 
6
12
  Files come off the wire as
@@ -0,0 +1,39 @@
1
+ ---
2
+ permalink: ":year-:month/:title.html"
3
+ markdown: rdiscount
4
+ pygments: true
5
+ auto: true
6
+ server: true
7
+ server_port: 4000
8
+ maruku:
9
+ use_tex: false
10
+ use_divs: false
11
+ png_dir: images/latex
12
+ png_url: /images/latex
13
+
14
+ header_ref: '.html' # .html for subdirs, / for main.
15
+ assets_path: '/' # http://github.mrflip.com
16
+
17
+ gemuser: mrflip
18
+ gemname: wukong
19
+ gemversion: 0.1.1
20
+ title: mrflip.github.com/wukong
21
+
22
+ keywords: [ 'wukong,hadoop,ruby,mrflip,infochimps,map,reduce,streaming,dumbo,happy,mrtoolkit,script,simple' ]
23
+ description: "Wukong: Hadoop made so easy a Chimpanzee could run it."
24
+ header_files:
25
+ - INSTALL
26
+ - LICENSE
27
+ - usage
28
+ - wutils
29
+ - moreinfo
30
+ - tutorial
31
+
32
+ credits:
33
+ <p>Wukong image courtesy
34
+ <a href="http://www.curtbusse.com/okavango/page1/oka1.html">Curt Busse</a> under
35
+ an <a href="http://www.curtbusse.com/copyright.html">open license</a>.
36
+ It's a Chacma Baboon from the Okavango site. Make sure to read the
37
+ <a href="http://www.curtbusse.com/okavango/page1/oka1.html#note3">story at the bottom of that page</a>.
38
+ </p>
39
+