wukong 0.1.4 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (63) hide show
  1. data/INSTALL.textile +89 -0
  2. data/README.textile +41 -74
  3. data/docpages/INSTALL.textile +94 -0
  4. data/{doc → docpages}/LICENSE.textile +0 -0
  5. data/{doc → docpages}/README-wulign.textile +6 -0
  6. data/docpages/UsingWukong-part1-get_ready.textile +17 -0
  7. data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
  8. data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
  9. data/docpages/_config.yml +39 -0
  10. data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
  11. data/{doc → docpages}/code/api_response_example.txt +0 -0
  12. data/{doc → docpages}/code/parser_skeleton.rb +0 -0
  13. data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
  14. data/docpages/favicon.ico +0 -0
  15. data/docpages/gem.css +16 -0
  16. data/docpages/hadoop-tips.textile +83 -0
  17. data/docpages/index.textile +90 -0
  18. data/docpages/intro.textile +8 -0
  19. data/docpages/moreinfo.textile +174 -0
  20. data/docpages/news.html +24 -0
  21. data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
  22. data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
  23. data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
  24. data/docpages/tutorial.textile +283 -0
  25. data/docpages/usage.textile +195 -0
  26. data/docpages/wutils.textile +263 -0
  27. data/wukong.gemspec +80 -50
  28. metadata +87 -54
  29. data/doc/INSTALL.textile +0 -41
  30. data/doc/README-tutorial.textile +0 -163
  31. data/doc/README-wutils.textile +0 -128
  32. data/doc/TODO.textile +0 -61
  33. data/doc/UsingWukong-part1-setup.textile +0 -2
  34. data/doc/UsingWukong-part2-scraping.textile +0 -2
  35. data/doc/hadoop-nfs.textile +0 -51
  36. data/doc/hadoop-setup.textile +0 -29
  37. data/doc/index.textile +0 -124
  38. data/doc/links.textile +0 -42
  39. data/doc/usage.textile +0 -102
  40. data/doc/utils.textile +0 -48
  41. data/examples/and_pig/sample_queries.rb +0 -128
  42. data/lib/wukong/and_pig.rb +0 -62
  43. data/lib/wukong/and_pig/README.textile +0 -12
  44. data/lib/wukong/and_pig/as.rb +0 -37
  45. data/lib/wukong/and_pig/data_types.rb +0 -30
  46. data/lib/wukong/and_pig/functions.rb +0 -50
  47. data/lib/wukong/and_pig/generate.rb +0 -85
  48. data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
  49. data/lib/wukong/and_pig/junk.rb +0 -51
  50. data/lib/wukong/and_pig/operators.rb +0 -8
  51. data/lib/wukong/and_pig/operators/compound.rb +0 -29
  52. data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
  53. data/lib/wukong/and_pig/operators/execution.rb +0 -15
  54. data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
  55. data/lib/wukong/and_pig/operators/foreach.rb +0 -98
  56. data/lib/wukong/and_pig/operators/groupies.rb +0 -212
  57. data/lib/wukong/and_pig/operators/load_store.rb +0 -65
  58. data/lib/wukong/and_pig/operators/meta.rb +0 -42
  59. data/lib/wukong/and_pig/operators/relational.rb +0 -129
  60. data/lib/wukong/and_pig/pig_struct.rb +0 -48
  61. data/lib/wukong/and_pig/pig_var.rb +0 -95
  62. data/lib/wukong/and_pig/symbol.rb +0 -29
  63. data/lib/wukong/and_pig/utils.rb +0 -0
@@ -1,5 +1,21 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - Lessons Learned working with Big Data
4
+ collapse: false
5
+ ---
1
6
 
2
- h3. Don't Drop ACID while exploring Big Data
7
+ h2. Random Thoughts on Big Data
8
+
9
+ Stuff changes when you cross the 100GB barrier. Here are random musings on why it might make sense to
10
+
11
+ * Sort everything
12
+ * Don't do any error handling
13
+ * Catch errors and emit them along with your data
14
+ * Make everything ASCII
15
+ * Abandon integer keys
16
+ * Use bash as your data-analysis IDE.
17
+
18
+ h2(#dropacid). Drop ACID, explore Big Data
3
19
 
4
20
  The traditional "ACID quartet":http://en.wikipedia.org/wiki/ACID for relational databases can be re-interpreted in a Big Data context:
5
21
 
@@ -18,7 +34,7 @@ Finally, where possible leave things in sort order by some appropriate index. Cl
18
34
 
19
35
  Note: for files that will live on the DFS, you should usually *not* do a total sort,
20
36
 
21
- h3. If it's not broken, it's wrong
37
+ h2. If it's not broken, it's wrong
22
38
 
23
39
  Something that goes wrong one in a five million times will crop up hundreds of times in a billion-record collection.
24
40
 
@@ -26,38 +42,6 @@ h3. Error is not normally distributed
26
42
 
27
43
  What's more, errors introduced will not in general be normally distributed and their impact may not decrease with increasing data size.
28
44
 
29
- h3. Encode once, and carefully.
30
-
31
- Encoding violates idempotence. Data brought in from elsewhere *must* be considered unparsable, ill-formatted and rife with illegal characters.
32
-
33
- * Immediately fix a copy of the original data with as minimal encoding as possible.
34
- * Follow this with a separate parse stage to emit perfectly well-formed, tab-separated / newline delimited data
35
- * In this parse stage, encode the data to 7-bits, free of internal tabs, backslashes, carriage return/line feed or control characters. You want your encoding scheme to be
36
- ** perfectly reversible
37
- ** widely implemented
38
- ** easily parseable
39
- ** recognizable: incoming data that is mostly inoffensive (a json record, or each line of a document such as this one) should be minimally altered from its original. This lets you do rough exploration with sort/cut/grep and friends.
40
- ** !! Involve **NO QUOTING**, only escaping. I can write a simple regexp to decode entities such as %10, \n or &#10;. This regexp will behave harmlessly with ill-formed data (eg %%10 or &&; or \ at end of line) and is robust against data being split or interpolated. Schemes such as "quoting: it's bad", %Q{quoting: "just say no"} or <notextile><notextile>tagged markup</notextile></notextile> require a recursive parser. An extra or missing quote mark is almost impossible to backtrack. And av
41
-
42
- In the absence of some lightweight, mostly-transparent, ASCII-compatible *AND* idempotent encoding scheme lurking in a back closet of some algorithms book -- how to handle the initial lousy payload coming off the wire?
43
-
44
- * For data that is *mostly* text in a western language, you'll do well wiht XML encoding (with <notextile>[\n\r\t\\]</notextile> forced to encode as entities)
45
- * URL encoding isn't as recognizable, but is also safe. Use this for things like URIs and filenames, or if you want to be /really/ paranoid about escaping.
46
- * For binary data, Binhex is efficient enough and every toolkit can handle it. There are more data-efficient ascii-compatible encoding schemes but it's not worth the hassle for the 10% or whatever gain in size.
47
- * If your payload itself is XML data, consider using \0 (nul) between records, with a fixed number of tab-separated metadata fields leading the XML data, which can then include tabs, newlines, or whatever the hell it wants. No changes are made to the data apart from a quick gsub to remove any (highly illegal) \0 in the XML data itself. A later parse round will convert it to structured hadoop-able data. Ex:
48
-
49
- {% highlight html %}
50
- feed_request 20090809101112 200 OK <?xml version='1.0' encoding='utf-8' ?>
51
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
52
- <html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
53
- <head>
54
- <title>infochimps.org &mdash; Find Any Dataset in the World</title>
55
- {% endhighlight %}
56
-
57
- p. Many of the command line utilities (@cat@, @grep@, etc.) will accept nul-delimited files.
58
-
59
- You may be tempted to use XML around your XML so you can XML while you XML, but this is ultimately only done right by parsing or scrubbing the inputm and at that point you should just translate directly to a reasonable tab/newline format. (Even if that format is tsv-compatible JSON).
60
-
61
45
  h3. Do your exception handling in-band
62
46
 
63
47
  A large, heavily-used cluster will want to have ganglia or "scribe":http://www.cloudera.com/blog/2008/11/02/configuring-and-using-scribe-for-hadoop-log-collection/ or the like collecting and managing log data. "Splunk":http://www.splunk.com/ is a compelling option I haven't myself used, but it is "broadly endorsed.":http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/
@@ -66,14 +50,7 @@ However, it's worth considering another extremely efficient, simple and powerful
66
50
 
67
51
  Wukong gives you a BadRecord class -- just rescue errors, pass the full or partial contents of the offending input. and emit the BadRecord instance in-band. They'll be serialized out along with the rest, and at your preference can be made to reduce to a single instance. Do analysis on them at your leisure; by default, any StructStreamer will silently discard *inbound* BadRecords -- they won't survive past the current generation.
68
52
 
69
- h3. Don't be afraid to use the command line as an IDE
70
-
71
- %{ highlight sh %}
72
- cat /data/foo.tsv | ruby -ne 'puts $_.chomp.scan(/text="([^"]+)"/).join("\t")'
73
- {% endhighlight %}
74
-
75
-
76
- h3. Keys
53
+ h2(#keys). Keys
77
54
 
78
55
  * Artificial key: assigned externally, key is not a function of the object's intrinsic values. A social security number is an artificial key. Artificial
79
56
 
@@ -91,7 +68,6 @@ h4. Natural keys are right for big data
91
68
 
92
69
  Synthetic keys suck. They demand locality or a central keymaster.
93
70
 
94
-
95
71
  * Use the natural key
96
72
  * Hash the natural key. This has some drawbacks
97
73
 
@@ -112,5 +88,56 @@ How do you get a unique prefix?
112
88
  fact that uniqueness was achieved. Use the birthday party formula to find out
113
89
  how often this will happen. (In practice, almost never.)
114
90
 
91
+ You can consider your fields are one of three types:
92
+
93
+ * Key
94
+ ** natural: a unique username, a URL, the MD5 hash of a URL
95
+ ** synthetic: an integer generated by some central keymaster
96
+ * Mutable:
97
+ ** eg A user’s ‘bio’ section.
98
+ * Immutable:
99
+ ** A user’s created_at date is immutable: it doesn’t help identify the person but it will never change.
100
+
101
+ The meaning of a key depends on its semantics. Is a URL a key?
102
+
103
+ * A location: (compare: "The head of household residing at 742 Evergreen Terr, Springfield USA")
104
+ * An entity handle (URI): (compare: "Homer J Simpson (aka Max Power)")
105
+ * An observation of that entity: Many URLs are handles to a __stream__ -- http://twitter.com/mrflip names the resource "mrflip's twitter stream", but loading that page offers only the last 20 entries in that stream. (compare: "The collection of all words spoken by the residents of 742 Evergreen Terr, Springfield USA")
106
+
107
+ h2(#bashide). The command line is an IDE
108
+
109
+ {% highlight sh %}
110
+ cat /data/foo.tsv | ruby -ne 'puts $_.chomp.scan(/text="([^"]+)"/).join("\t")'
111
+ {% endhighlight %}
112
+
113
+ h2(#encoding). Encode once, and carefully.
114
+
115
+ Encoding violates idempotence. Data brought in from elsewhere *must* be considered unparsable, ill-formatted and rife with illegal characters.
116
+
117
+ * Immediately fix a copy of the original data with as minimal encoding as possible.
118
+ * Follow this with a separate parse stage to emit perfectly well-formed, tab-separated / newline delimited data
119
+ * In this parse stage, encode the data to 7-bits, free of internal tabs, backslashes, carriage return/line feed or control characters. You want your encoding scheme to be
120
+ ** perfectly reversible
121
+ ** widely implemented
122
+ ** easily parseable
123
+ ** recognizable: incoming data that is mostly inoffensive (a json record, or each line of a document such as this one) should be minimally altered from its original. This lets you do rough exploration with sort/cut/grep and friends.
124
+ ** !! Involve **NO QUOTING**, only escaping. I can write a simple regexp to decode entities such as %10, \n or &#10;. This regexp will behave harmlessly with ill-formed data (eg %%10 or &&; or \ at end of line) and is robust against data being split or interpolated. Schemes such as "quoting: it's bad", %Q{quoting: "just say no"} or <notextile><notextile>tagged markup</notextile></notextile> require a recursive parser. An extra or missing quote mark is almost impossible to backtrack. And av
125
+
126
+ In the absence of some lightweight, mostly-transparent, ASCII-compatible *AND* idempotent encoding scheme lurking in a back closet of some algorithms book -- how to handle the initial lousy payload coming off the wire?
127
+
128
+ * For data that is *mostly* text in a western language, you'll do well wiht XML encoding (with <notextile>[\n\r\t\\]</notextile> forced to encode as entities)
129
+ * URL encoding isn't as recognizable, but is also safe. Use this for things like URIs and filenames, or if you want to be /really/ paranoid about escaping.
130
+ * For binary data, Binhex is efficient enough and every toolkit can handle it. There are more data-efficient ascii-compatible encoding schemes but it's not worth the hassle for the 10% or whatever gain in size.
131
+ * If your payload itself is XML data, consider using \0 (nul) between records, with a fixed number of tab-separated metadata fields leading the XML data, which can then include tabs, newlines, or whatever the hell it wants. No changes are made to the data apart from a quick gsub to remove any (highly illegal) \0 in the XML data itself. A later parse round will convert it to structured hadoop-able data. Ex:
132
+
133
+ {% highlight html %}
134
+ feed_request 20090809101112 200 OK <?xml version='1.0' encoding='utf-8' ?>
135
+ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
136
+ <html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
137
+ <head>
138
+ <title>infochimps.org &mdash; Find Any Dataset in the World</title>
139
+ {% endhighlight %}
140
+
141
+ p. Many of the command line utilities (@cat@, @grep@, etc.) will accept nul-delimited files.
115
142
 
116
- Working with records that change over time,
143
+ You may be tempted to use XML around your XML so you can XML while you XML. Ultimately, you'll find this can only be done right by doing a full parse of the input -- and at that point you should just translate directly to a reasonable tab/newline format. (Even if that format is tsv-compatible JSON).
File without changes
File without changes
Binary file
data/docpages/gem.css ADDED
@@ -0,0 +1,16 @@
1
+ #header a { color: #00a; }
2
+ hr { border-color: #66a ; }
3
+ h2 { border-color: #acc ; }
4
+ h1 { border-color: #acc ; }
5
+ .download { border-color: #acc ; }
6
+ #footer { border-color: #a0e0e8 ; }
7
+
8
+ #header a { margin-left:0.125em; margin-right:0.125em; }
9
+ h1.gemheader {
10
+ margin: -30px 0 0.5em -65px ;
11
+ text-indent: 65px ;
12
+ height: 90px ;
13
+ padding: 50px 0 10px 0px;
14
+ background: url('/images/wukong.png') no-repeat 0px 0px ;
15
+ }
16
+ .quiet { font-size: 0.85em ; color: #777 ; font-style: italic }
@@ -0,0 +1,83 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - NFS on Hadoop FTW
4
+ collapse: false
5
+ ---
6
+
7
+ h2. Hadoop Config Tips
8
+
9
+ h3(#hadoopnfs). Setup NFS within the cluster
10
+
11
+ If you're lazy, I recommend setting up NFS -- it makes dispatching simple config and script files much easier. (And if you're not lazy, what the hell are you doing using Wukong?). Be careful though -- used unwisely, a swarm of NFS requests will mount a devastatingly effective denial of service attack on your poor old master node.
12
+
13
+ Installing NFS to share files along the cluster gives the following conveniences:
14
+ * You don't have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
15
+ * The user can now passwordless ssh among the nodes, since there's only one shared home directory and since we included the user's own public key in the authorized_keys2 file. This lets you easily rsync files among the nodes.
16
+
17
+ First, you need to take note of the _internal_ name for your master, perhaps something like @domU-xx-xx-xx-xx-xx-xx.compute-1.internal@.
18
+
19
+ As root, on the master (change @compute-1.internal@ to match your setup):
20
+
21
+ <pre>
22
+ apt-get install nfs-kernel-server
23
+ echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
24
+ /etc/init.d/nfs-kernel-server stop ;
25
+ </pre>
26
+
27
+ (The @*.compute-1.internal@ part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)
28
+
29
+ Next, set up a regular user account on the *master only*. In this case our user will be named 'chimpy':
30
+
31
+ <pre>
32
+ visudo # uncomment the last line, to allow group sudo to sudo
33
+ groupadd admin
34
+ adduser chimpy
35
+ usermod -a -G sudo,admin chimpy
36
+ su chimpy # now you are the new user
37
+ ssh-keygen -t rsa # accept all the defaults
38
+ cat ~/.ssh/id_rsa.pub # can paste this public key into your github, etc
39
+ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
40
+ </pre>
41
+
42
+ Then on each slave (replacing domU-xx-... by the internal name for the master node):
43
+
44
+ <pre>
45
+ apt-get install nfs-common ;
46
+ echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home /mnt/home nfs rw 0 0" >> /etc/fstab
47
+ /etc/init.d/nfs-common restart
48
+ mkdir /mnt/home
49
+ mount /mnt/home
50
+ ln -s /mnt/home/chimpy /home/chimpy
51
+ </pre>
52
+
53
+ You should now be in business.
54
+
55
+ Performance tradeoffs should be small as long as you're just sending code files and gems around. *Don't* write out log entries or data to NFS partitions, or you'll effectively perform a denial-of-service attack on the master node.
56
+
57
+ * http://nfs.sourceforge.net/nfs-howto/ar01s03.html
58
+ * The "Setting up an NFS Server HOWTO":http://nfs.sourceforge.net/nfs-howto/index.html was an immense help, and I recommend reading it carefully.
59
+
60
+ h3(#awstools). Tools for EC2 and S3 Management
61
+
62
+ * http://s3sync.net/wiki
63
+ * http://jets3t.s3.amazonaws.com/applications/applications.html#uploader
64
+ * "ElasticFox"
65
+ * "S3Fox (S3 Organizer)":
66
+ * "FoxyProxy":
67
+
68
+
69
+ h3. Random EC2 notes
70
+
71
+ * "How to Mount EBS volume at launch":http://clouddevelopertips.blogspot.com/2009/08/mount-ebs-volume-created-from-snapshot.html
72
+
73
+ * The Cloudera AMIs and distribution include BZip2 support. This means that if you have input files with a .bz2 extension, they will be naturally un-bzipped and streamed. (Note that there is a non-trivial penalty for doing so: each bzip'ed file must go, in whole, to a single mapper; and the CPU load for un-bzipping is sizeable.)
74
+
75
+ * To _produce_ bzip2 files, specify the @--compress_output=@ flag. If you have the BZip2 patches installed, you can give @--compress_output=bz2@; everyone should be able to use @--compress_output=gz@.
76
+
77
+ * For excellent performance you can patch your install for "Parallel LZO Splitting":http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/
78
+
79
+ * If you're using XFS, consider setting the nobarrier option
80
+ /dev/sdf /mnt/data2 xfs noatime,nodiratime,nobarrier 0 0
81
+
82
+ * The first write to any disk location is about 5x slower than later writes. Explanation, and how to pre-soften a volume, here: http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/index.html?instance-storage.html
83
+
@@ -0,0 +1,90 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong
4
+ collapse: true
5
+ ---
6
+ h1(gemheader). wukong %(small):: hadoop made easy%
7
+
8
+ p(description). {{ site.description }}
9
+
10
+ Treat your dataset like a
11
+ * stream of lines when it's efficient to process by lines
12
+ * stream of field arrays when it's efficient to deal directly with fields
13
+ * stream of lightweight objects when it's efficient to deal with objects
14
+
15
+ Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
16
+
17
+ Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
18
+
19
+ <notextile><div class="toggle"></notextile>
20
+
21
+ h2. Documentation index
22
+
23
+ * "Install and set up wukong":INSTALL.html
24
+ ** "Get the code":INSTALL.html#getcode
25
+ ** "Setup":INSTALL.html#setup
26
+ ** "Installing and Running Wukong with Hadoop":INSTALL.html#gethadoop
27
+ ** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":INSTALL.html#others
28
+
29
+ * "Tutorial":tutorial.html
30
+ ** "Count Words":tutorial.html#wordcount
31
+ ** "Structured data":tutorial.html#structstream
32
+ ** "Accumulators":tutorial.html#accumulators including a UniqByLastReducer and a GroupBy reducer.
33
+
34
+ * "Usage notes":usage.html
35
+ ** "How to run a Wukong script":usage.html#running
36
+ ** "How to test your scripts":usage.html#testing
37
+ ** "Wukong Plays nicely with others":usage.html#playnice
38
+ ** "Schema export":usage.html#schema_export to Pig and SQL
39
+ ** "Using wukong with internal streaming":usage.html#stayinruby
40
+ ** "Using wukong to Batch-Process ActiveRecord Objects":usage.html#activerecord
41
+
42
+ * "Wutils":wutils.html -- command-line utilies for working with data from the command line
43
+ ** "Overview of wutils":wutils.html#wutils -- command listing
44
+ ** "Stupid command-line tricks":wutils.html#cmdlinetricks using the wutils
45
+ ** "wu-lign":wutils.html#wulign -- present a tab-separated file as aligned columns
46
+ ** Dear Lazyweb, please build us a "tab-oriented version of the Textutils library":wutils.html#wutilsinc
47
+
48
+ * Links and tips for "configuring and working with hadoop":hadoop-tips.html
49
+ * Some opinionated "thoughts on working with big data,":bigdata-tips.html on why you should drop acid, treat exceptions as records, and happily embrace variable-length strings as primary keys.
50
+ * Wukong is licensed under the "Apache License":LICENSE.html (same as Hadoop)
51
+
52
+ * "More info":moreinfo.html
53
+ ** "Why is it called Wukong?":moreinfo.html#name
54
+ ** "Don't Use Wukong, use this instead":moreinfo.html#whateverdude
55
+ ** "Further Reading and useful links":moreinfo.html#links
56
+ ** "Note on Patches/Pull Requests":moreinfo.html#patches
57
+ ** "What's up with Wukong::AndPig?":moreinfo.html#andpig
58
+ ** "Map/Reduce Algorithms":moreinfo.html#algorithms
59
+ ** "TODOs":moreinfo.html#TODO
60
+
61
+ * Work in progress: an intro to data processing with wukong:
62
+ ** "Part 1, Get Ready":UsingWukong-part1-getready.html
63
+ ** "Part 2, Thinking Big Data":UsingWukong-part2-ThinkingBigData.html
64
+ ** "Part 3, Parsing":UsingWukong-part3-parsing.html
65
+
66
+ <notextile></div></notextile>
67
+
68
+ {% include intro.textile %}
69
+
70
+ <notextile><div class="toggle"></notextile>
71
+
72
+ h2. Credits
73
+
74
+ Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
75
+
76
+ Patches submitted by:
77
+ * gemified by Ben Woosley (ben.woosley@gmail.com)
78
+ * ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui@masuidrive.jp - http://blog.masuidrive.jp/
79
+
80
+ Thanks to:
81
+ * "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
82
+ * "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.
83
+
84
+ <notextile><div class="toggle"></notextile>
85
+
86
+ h2. Help!
87
+
88
+ Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
89
+
90
+ <notextile></div></notextile>
@@ -0,0 +1,8 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong
4
+ collapse: false
5
+ ---
6
+ h1(gemheader). Intro %(small):: 3 simple examples%
7
+
8
+ {% include intro.textile %}
@@ -0,0 +1,174 @@
1
+ ---
2
+ layout: default
3
+ title: mrflip.github.com/wukong - TODO
4
+ collapse: false
5
+ ---
6
+
7
+
8
+ h1(gemheader). Wukong More Info
9
+
10
+ ** "Why is it called Wukong?":#name
11
+ ** "Don't Use Wukong, use this instead":#whateverdude
12
+ ** "Further Reading and useful links":#links
13
+ ** "Note on Patches/Pull Requests":#patches
14
+ ** "What's up with Wukong::AndPig?":#andpig
15
+ ** "Map/Reduce Algorithms":#algorithms
16
+ ** "TODOs":#TODO
17
+
18
+
19
+ <notextile><div class="toggle"></notextile>
20
+
21
+ h2(#name). Why is it called Wukong?
22
+
23
+ Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
24
+
25
+ bq. Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.
26
+
27
+ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. -- ["Sun Wukong's Wikipedia entry":http://en.wikipedia.org/wiki/Wukong]
28
+
29
+ The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
30
+
31
+ <notextile></div><div class="toggle"></notextile>
32
+
33
+ h2(#algorithms). Map/Reduce Algorithms
34
+
35
+ Example graph scripts:
36
+
37
+ * Multigraph
38
+ * Pagerank (done)
39
+ * Breadth-first search
40
+ * Triangle enumeration
41
+ * Clustering
42
+
43
+ h3. K-Nearest Neighbors
44
+
45
+ More example hadoop algorithms:
46
+ * Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html
47
+ * Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html
48
+ * Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html
49
+ * SIPs, Median, classifiers and more : http://matpalm.com/
50
+ * Brad Heintz's "Distributed Computing with Ruby":http://www.bradheintz.com/no1thing/talks/ demonstrates Travelling Salesman in map/reduce.
51
+
52
+ * "Clustering billions of images with large scale nearest neighbor search":http://scholar.google.com/scholar?cluster=2473742255769621469&hl=en uses three map/reduce passes:
53
+ ** Subsample to build a "spill tree" that roughly localizes each object
54
+ ** Use the spill tree on the full dataset to group each object with its potential neighbors
55
+ ** Calculate the metrics and emit only the k-nearest neighbors
56
+
57
+ Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce):
58
+
59
+ 1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
60
+ 2. Find the pages with over 1 million hits in day for a website given its access logs.
61
+ 3. Find the pages that link to each page in a collection of webpages.
62
+ 4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
63
+ 5. Sort tabular data by a primary and secondary column.
64
+ 6. Find the most popular pages for a website given its access logs.
65
+
66
+ <notextile></div><div class="toggle"></notextile>
67
+
68
+ h2(#whateverdude). Don't Use Wukong, use this instead
69
+
70
+ There are several worthy Hadoop|Streaming Frameworks:
71
+
72
+ * infochimps.org's "Wukong":http://github.com/mrflip/wukong -- ruby; object-oriented *and* record-oriented
73
+ * NYTimes' "MRToolkit":http://code.google.com/p/mrtoolkit/ -- ruby; much more log-oriented
74
+ * Freebase's "Happy":http://code.google.com/p/happy/ -- python; the most performant, as it can use Jython to make direct API calls.
75
+ * Last.fm's "Dumbo":http://wiki.github.com/klbostee/dumbo -- python
76
+
77
+ Most people use Wukong / one of the above (or straight Java Hadoop, poor souls) for heavy lifting, and several of the following hadoop tools for efficiency:
78
+
79
+ * Pig OR
80
+ * Hive -- hive is more SQL-ish, Pig is more elegant (in a brushed-metal kind of way). I greatly prefer Pig, because I hate SQL; you may feel differently.
81
+ * Sqoop
82
+ * Mahout
83
+
84
+ <notextile></div><div class="toggle"></notextile>
85
+
86
+ h2(#links). Further Reading and useful links:
87
+
88
+ * "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart - dive right in with Wukong, Hadoop and the Amazon Elastic MapReduce cloud. Once you get bored with the command line, this is the fastest path to Wukong power.
89
+ * "Distributed Computing with Ruby":http://www.bradheintz.com/no1thing/talks/ has some raw ruby, some Wukong and some JRuby/Hadoop integration -- it demonstrates a Travelling Salesman in map/reduce. Cool!
90
+
91
+ * "Hadoop, The Definitive Guide":http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
92
+
93
+ * "Running Hadoop On Ubuntu Linux (Single-Node Cluster)":http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) and "unning Hadoop On Ubuntu Linux (Multi-Node Cluster).":http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
94
+ * "Running Hadoop MapReduce on Amazon EC2 and S3":http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873
95
+
96
+ * "Hadoop Overview by Doug Cutting":http://video.google.com/videoplay?docid=-4912926263813234341 - the founder of the Hadoop project. (49m video)
97
+
98
+ * "Cluster Computing and Map|Reduce":http://www.youtube.com/results?search_query=cluster+computing+and+mapreduce
99
+ ** "Lecture 1: Overview":http://www.youtube.com/watch?v=yjPBkvYh-ss
100
+ ** "Lecture 2 (technical): Map|Reduce":http://www.youtube.com/watch?v=-vD6PUdf3Js
101
+ ** "Lecture 3 (technical): GFS (Google File System)":http://www.youtube.com/watch?v=5Eib_H_zCEY
102
+ ** "Lecture 4 (theoretical): Canopy Clustering":http://www.youtube.com/watch?v=1ZDybXl212Q
103
+ ** "Lecture 5 (theoretical): Breadth-First Search":http://www.youtube.com/watch?v=BT-piFBP4fE
104
+
105
+ * "Cloudera Hadoop Training:":http://www.cloudera.com/hadoop-training
106
+ ** "Thinking at Scale":http://www.cloudera.com/hadoop-training-thinking-at-scale
107
+ ** "Mapreduce and HDFS":http://www.cloudera.com/hadoop-training-mapreduce-hdfs
108
+ ** "A Tour of the Hadoop Ecosystem":http://www.cloudera.com/hadoop-training-ecosystem-tour
109
+ ** "Programming with Hadoop":http://www.cloudera.com/hadoop-training-programming-with-hadoop
110
+ ** "Hadoop and Hive: introduction":http://www.cloudera.com/hadoop-training-hive-introduction
111
+ ** "Hadoop and Hive: tutorial":http://www.cloudera.com/hadoop-training-hive-tutorial
112
+ ** "Hadoop and Pig: Introduction":http://www.cloudera.com/hadoop-training-pig-introduction
113
+ ** "Hadoop and Pig: Tutorial":http://www.cloudera.com/hadoop-training-pig-tutorial
114
+ ** "Mapreduce Algorithms":http://www.cloudera.com/hadoop-training-mapreduce-algorithms
115
+ ** "Exercise: Getting started with Hadoop":http://www.cloudera.com/hadoop-training-exercise-getting-started-with-hadoop
116
+ ** "Exercise: Writing mapreduce programs":http://www.cloudera.com/hadoop-training-exercise-writing-mapreduce-programs
117
+ ** "Cloudera Blog":http://www.cloudera.com/blog/
118
+
119
+ * "Hadoop Wiki: Hadoop Streaming":http://wiki.apache.org/hadoop/HadoopStreaming
120
+ * "Hadoop Docs: Hadoop Streaming":http://hadoop.apache.org/common/docs/current/streaming.html
121
+
122
+ * A "dimwitted screed on Ruby, Hadoop and Starling":http://www.theregister.co.uk/2008/08/11/hadoop_dziuba/ seemingly written with jockstrap on head.
123
+
124
+ <notextile></div><div class="toggle"></notextile>
125
+
126
+ h2(#patches). Note on Patches/Pull Requests
127
+
128
+ * Fork the project.
129
+ * Make your feature addition or bug fix.
130
+ * Add tests for it. This is important so I don't break it in a future version unintentionally.
131
+ * Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
132
+ * Send me a pull request. Bonus points for topic branches.
133
+
134
+ <notextile></div><div class="toggle"></notextile>
135
+
136
+ h2(#andpig). What's up with Wukong::AndPig?
137
+
138
+ @Wukong::AndPig@ is a small library to more easily generate code for the "Pig":http://hadoop.apache.org/pig data analysis language. See its "README":http://github.com/mrflip/wukong/tree/master/lib/wukong/and_pig/README.textile for more.
139
+
140
+ It's **not really being worked on**, and you should probably **ignore it**.
141
+
142
+ <notextile></div><div class="toggle"></notextile>
143
+
144
+ h2(#todo). TODOs
145
+
146
+ Utility
147
+
148
+ * columnizing / reconstituting
149
+
150
+ * Set up with JRuby
151
+ * Allow for direct HDFS operations
152
+ * Make the dfs commands slightly less stupid
153
+ * add more standard options
154
+ * Allow for combiners
155
+ * JobStarter / JobSteps
156
+ * might as well take dumbo's command line args
157
+
158
+ BUGS:
159
+
160
+ * Can't do multiple input files in local mode
161
+
162
+ Patterns to implement:
163
+
164
+ * Stats reducer
165
+ ** basic sum, avg, max, min, std.dev of a numeric field
166
+ ** the "running standard deviation":http://www.johndcook.com/standard_deviation.html
167
+
168
+ * Efficient median (and other order statistics)
169
+
170
+ * Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer)
171
+
172
+ Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.
173
+
174
+ <notextile></div></notextile>