wukong 0.1.4 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (63) hide show
  1. data/INSTALL.textile +89 -0
  2. data/README.textile +41 -74
  3. data/docpages/INSTALL.textile +94 -0
  4. data/{doc → docpages}/LICENSE.textile +0 -0
  5. data/{doc → docpages}/README-wulign.textile +6 -0
  6. data/docpages/UsingWukong-part1-get_ready.textile +17 -0
  7. data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
  8. data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
  9. data/docpages/_config.yml +39 -0
  10. data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
  11. data/{doc → docpages}/code/api_response_example.txt +0 -0
  12. data/{doc → docpages}/code/parser_skeleton.rb +0 -0
  13. data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
  14. data/docpages/favicon.ico +0 -0
  15. data/docpages/gem.css +16 -0
  16. data/docpages/hadoop-tips.textile +83 -0
  17. data/docpages/index.textile +90 -0
  18. data/docpages/intro.textile +8 -0
  19. data/docpages/moreinfo.textile +174 -0
  20. data/docpages/news.html +24 -0
  21. data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
  22. data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
  23. data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
  24. data/docpages/tutorial.textile +283 -0
  25. data/docpages/usage.textile +195 -0
  26. data/docpages/wutils.textile +263 -0
  27. data/wukong.gemspec +80 -50
  28. metadata +87 -54
  29. data/doc/INSTALL.textile +0 -41
  30. data/doc/README-tutorial.textile +0 -163
  31. data/doc/README-wutils.textile +0 -128
  32. data/doc/TODO.textile +0 -61
  33. data/doc/UsingWukong-part1-setup.textile +0 -2
  34. data/doc/UsingWukong-part2-scraping.textile +0 -2
  35. data/doc/hadoop-nfs.textile +0 -51
  36. data/doc/hadoop-setup.textile +0 -29
  37. data/doc/index.textile +0 -124
  38. data/doc/links.textile +0 -42
  39. data/doc/usage.textile +0 -102
  40. data/doc/utils.textile +0 -48
  41. data/examples/and_pig/sample_queries.rb +0 -128
  42. data/lib/wukong/and_pig.rb +0 -62
  43. data/lib/wukong/and_pig/README.textile +0 -12
  44. data/lib/wukong/and_pig/as.rb +0 -37
  45. data/lib/wukong/and_pig/data_types.rb +0 -30
  46. data/lib/wukong/and_pig/functions.rb +0 -50
  47. data/lib/wukong/and_pig/generate.rb +0 -85
  48. data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
  49. data/lib/wukong/and_pig/junk.rb +0 -51
  50. data/lib/wukong/and_pig/operators.rb +0 -8
  51. data/lib/wukong/and_pig/operators/compound.rb +0 -29
  52. data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
  53. data/lib/wukong/and_pig/operators/execution.rb +0 -15
  54. data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
  55. data/lib/wukong/and_pig/operators/foreach.rb +0 -98
  56. data/lib/wukong/and_pig/operators/groupies.rb +0 -212
  57. data/lib/wukong/and_pig/operators/load_store.rb +0 -65
  58. data/lib/wukong/and_pig/operators/meta.rb +0 -42
  59. data/lib/wukong/and_pig/operators/relational.rb +0 -129
  60. data/lib/wukong/and_pig/pig_struct.rb +0 -48
  61. data/lib/wukong/and_pig/pig_var.rb +0 -95
  62. data/lib/wukong/and_pig/symbol.rb +0 -29
  63. data/lib/wukong/and_pig/utils.rb +0 -0
data/doc/INSTALL.textile DELETED
@@ -1,41 +0,0 @@
1
- ---
2
- layout: default
3
- title: Install Notes
4
- ---
5
-
6
-
7
- h1(gemheader). {{ site.gemname }} %(small):: install%
8
-
9
- <notextile><div class="toggle"></notextile>
10
-
11
- h2. Get the code
12
-
13
- This code is available as a gem:
14
-
15
- pre. $ sudo gem install mrflip-{{ site.gemname }}
16
-
17
- You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
18
-
19
- Better yet, you can also clone the project with "Git":http://git-scm.com by running:
20
-
21
- pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
22
-
23
- <notextile></div><div class="toggle"></notextile>
24
-
25
- h2. Get the Dependencies
26
-
27
- * Hadoop, pig
28
- * extlib, YAML, JSON
29
- * Optional gems: trollop, addressable/uri, htmlentities
30
-
31
-
32
- <notextile></div><div class="toggle"></notextile>
33
-
34
- h2. Setup
35
-
36
- 1. Allow Wukong to discover where his elephant friend lives: either
37
- ** set a $HADOOP_HOME environment variable,
38
- ** or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install: @:hadoop_home: /usr/local/share/hadoop@
39
- 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
40
-
41
- <notextile></div></notextile>
@@ -1,163 +0,0 @@
1
- Here's a script to count words in a text stream:
2
-
3
- require 'wukong'
4
- module WordCount
5
- class Mapper < Wukong::Streamer::LineStreamer
6
- # Emit each word in the line.
7
- def process line
8
- words = line.strip.split(/\W+/).reject(&:blank?)
9
- words.each{|word| yield [word, 1] }
10
- end
11
- end
12
-
13
- class Reducer < Wukong::Streamer::ListReducer
14
- def finalize
15
- yield [ key, values.map(&:last).map(&:to_i).sum ]
16
- end
17
- end
18
- end
19
-
20
- Wukong::Script.new(
21
- WordCount::Mapper,
22
- WordCount::Reducer
23
- ).run # Execute the script
24
-
25
- The first class, the Mapper, eats lines and craps @[word, count]@ records. Here
26
- the /key/ is the word, and the /value/ is its count.
27
-
28
- The second class is an example of an accumulated list reducer. The values for
29
- each key are stacked up into a list; then the record(s) yielded by @#finalize@
30
- are emitted.
31
-
32
- Here's another way to write the Reducer: accumulate the count of each line, then
33
- yield the sum in @#finalize@:
34
-
35
- class Reducer2 < Wukong::Streamer::AccumulatingReducer
36
- attr_accessor :key_count
37
- def start! *args
38
- self.key_count = 0
39
- end
40
- def accumulate(word, count)
41
- self.key_count += count.to_i
42
- end
43
- def finalize
44
- yield [ key, key_count ]
45
- end
46
- end
47
-
48
- Of course you can be really lazy (that is, smart) and write your script instead as
49
-
50
- class Script < Wukong::Script
51
- def reducer_command
52
- 'uniq -c'
53
- end
54
- end
55
-
56
-
57
- h2. Structured data
58
-
59
- All of these deal with unstructured data. Wukong also lets you view your data
60
- as a stream of structured objects.
61
-
62
- Let's say you have a blog; its records look like
63
-
64
- Post = Struct.new( :id, :created_at, :user_id, :title, :body, :link )
65
- Comment = Struct.new( :id, :created_at, :post_id, :user_id, :body )
66
- User = Struct.new( :id, :username, :fullname, :homepage, :description )
67
- UserLoc = Struct.new( :user_id, :text, :lat, :lng )
68
-
69
- You've been using "twitter":http://twitter.com for a long time, and you've
70
- written something that from now on will inject all your tweets as Posts, and all
71
- replies to them as Comments (by a common 'twitter_bot' account on your blog).
72
- What about the past two years' worth of tweets? Let's assume you're so chatty that
73
- a Map/Reduce script is warranted to handle the volume.
74
-
75
- Cook up something that scrapes your tweets and all replies to your tweets:
76
-
77
- Tweet = Struct.new( :id, :created_at, :twitter_user_id,
78
- :in_reply_to_user_id, :in_reply_to_status_id, :text )
79
- TwitterUser = Struct.new( :id, :username, :fullname,
80
- :homepage, :location, :description )
81
-
82
- Now we'll just process all those in a big pile, converting to Posts, Comments
83
- and Users as appropriate. Serialize your scrape results so that each Tweet and
84
- each TwitterUser is a single lines containing first the class name ('tweet' or
85
- 'twitter_user') followed by its constituent fields, in order, separated by tabs.
86
-
87
- The RecordStreamer takes each such line, constructs its corresponding class, and
88
- instantiates it with the
89
-
90
- require 'wukong'
91
- require 'my_blog' #defines the blog models
92
- module TwitBlog
93
- class Mapper < Wukong::Streamer::RecordStreamer
94
- # Watch for tweets by me
95
- MY_USER_ID = 24601
96
- # structs for our input objects
97
- Tweet = Struct.new( :id, :created_at, :twitter_user_id,
98
- :in_reply_to_user_id, :in_reply_to_status_id, :text )
99
- TwitterUser = Struct.new( :id, :username, :fullname,
100
- :homepage, :location, :description )
101
- #
102
- # If this is a tweet is by me, convert it to a Post.
103
- #
104
- # If it is a tweet not by me, convert it to a Comment that
105
- # will be paired with the correct Post.
106
- #
107
- # If it is a TwitterUser, convert it to a User record and
108
- # a user_location record
109
- #
110
- def process record
111
- case record
112
- when TwitterUser
113
- user = MyBlog::User.new.merge(record) # grab the fields in common
114
- user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
115
- yield user
116
- yield user_loc
117
- when Tweet
118
- if record.twitter_user_id == MY_USER_ID
119
- post = MyBlog::Post.new.merge record
120
- post.link = "http://twitter.com/statuses/show/#{record.id}"
121
- post.body = record.text
122
- post.title = record.text[0..65] + "..."
123
- yield post
124
- else
125
- comment = MyBlog::Comment.new.merge record
126
- comment.body = record.text
127
- comment.post_id = record.in_reply_to_status_id
128
- yield comment
129
- end
130
- end
131
- end
132
- end
133
- end
134
- Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
135
-
136
- h2. Uniqifying
137
-
138
- The script above uses the identity reducer: every record from the mapper is sent
139
- to the output. But what if you had grabbed the replying user's record every time
140
- you saw a reply?
141
-
142
- Fine, so pass it through @uniq@. But what if a user updated their location or
143
- description during this time? You'll want to probably use UniqByLastReducer
144
-
145
- Location might want to take the most /frequent/, and might want as well to
146
- geolocate the location text. Use a ListReducer, find the most frequent element,
147
- then finally call the expensive geolocation method.
148
-
149
- h2. A note about keys
150
-
151
- Now we're going to write this using the synthetic keys already extant in the
152
- twitter records, making the unwarranted assumption that they won't collide with
153
- the keys in your database.
154
-
155
- Map/Reduce paradigm does badly with synthetic keys. Synthetic keys demand
156
- locality, and map/reduce's remarkable scaling comes from not assuming
157
- locality. In general, write your map/reduce scripts to use natural keys (the scre
158
-
159
- h1. More info
160
-
161
- There are many useful examples (including an actually-useful version of this
162
- WordCount script) in examples/ directory.
163
-
@@ -1,128 +0,0 @@
1
- h1. Wukong Utility Scripts
2
-
3
- h2. Stupid command-line tricks
4
-
5
- h3. Histogram
6
-
7
- Given data with a date column:
8
-
9
- message 235623 20090423012345 Now is the winter of our discontent Made glorious summer by this son of York
10
- message 235623 20080101230900 These pretzels are making me THIRSTY!
11
- ...
12
-
13
- You can calculate number of messages sent by day with
14
-
15
- cat messages | cuttab 3 | cutc 8 | sort | uniq -c
16
-
17
- (see the wuhist command, below.)
18
-
19
- h3. Simple intersection, union, etc
20
-
21
- For two datasets (batch_1 and batch_2) with unique entries (no repeated lines),
22
-
23
- * Their union is simple:
24
-
25
- cat batch_1 batch_2 | sort -u
26
-
27
-
28
- * Their intersection:
29
-
30
- cat batch_1 batch_2 | sort | uniq -c | egrep -v '^ *1 '
31
-
32
- This concatenates the two sets and filters out everything that only occurred once.
33
-
34
- * For the complement of the intersection, use "... | egrep '^ *1 '"
35
-
36
- * In both cases, if the files are each internally sorted, the commandline sort takes a --merge flag:
37
-
38
- sort --merge -u batch_1 batch_2
39
-
40
- h2. Command Listing
41
-
42
- h3. cutc
43
-
44
- @cutc [colnum]@
45
-
46
- Ex.
47
-
48
- echo -e 'foo\tbar\tbaz' | cutc 6
49
- foo ba
50
-
51
- Cuts from beginning of line to given column (default 200). A tab is one character, so right margin can still be ragged.
52
-
53
- h3. cuttab
54
-
55
- @cuttab [colspec]@
56
-
57
- Cuts given tab-separated columns. You can give a comma separated list of numbers
58
- or ranges 1-4. columns are numbered from 1.
59
-
60
- Ex.
61
-
62
- echo -e 'foo\tbar\tbaz' | cuttab 1,3
63
- foo baz
64
-
65
- h3. hdp-*
66
-
67
- These perform the corresponding commands on the HDFS filesystem. In general,
68
- where they accept command-line flags, they go with the GNU-style ones, not the
69
- hadoop-style: so, @hdp-du -s dir@ or @hdp-rm -r foo/@
70
-
71
- * @hdp-cat@
72
- * @hdp-catd@ -- cats the files that don't start with '_' in a directory. Use this for a pile of @.../part-00000@ files
73
- * @hdp-du@
74
- * @hdp-get@
75
- * @hdp-kill@
76
- * @hdp-ls@
77
- * @hdp-mkdir@
78
- * @hdp-mv@
79
- * @hdp-ps@
80
- * @hdp-put@
81
- * @hdp-rm@
82
- * @hdp-sync@
83
-
84
- h3. hdp-sort, hdp-stream, hdp-stream-flat
85
-
86
- * @hdp-sort@
87
- * @hdp-stream@
88
- * @hdp-stream-flat@
89
-
90
- <code><pre>
91
- hdp-stream input_filespec output_file map_cmd reduce_cmd num_key_fields
92
- </pre></code>
93
-
94
- h3. tabchar
95
-
96
- Outputs a single tab character.
97
-
98
- h3. wuhist
99
-
100
- Occasionally useful to gather a lexical histogram of a single column:
101
-
102
- Ex.
103
-
104
- <code><pre>
105
- $ echo -e 'foo\nbar\nbar\nfoo\nfoo\nfoo\n7' | ./wuhist
106
- 4 foo
107
- 2 bar
108
- 1 7
109
- </pre></code>
110
-
111
- (the output will have a tab between the first and second column, for futher processing.)
112
-
113
- h3. wulign
114
-
115
- Intelligently format a tab-separated file into aligned columns (while remaining tab-separated for further processing). See README-wulign.textile.
116
-
117
- h3. hdp-parts_to_keys.rb
118
-
119
- A *very* clumsy script to rename reduced hadoop output files by their initial key.
120
-
121
- If your output file has an initial key in the first column and you pass it
122
- through hdp-sort, they will be distributed across reducers and thus output
123
- files. (Because of the way hadoop hashes the keys, there's no guarantee that
124
- each file will get a distinct key. You could have 2 keys with a million entries
125
- and they could land sequentially on the same reducer, always fun.)
126
-
127
- If you're willing to roll the dice, this script will rename files according to
128
- the first key in the first line.
data/doc/TODO.textile DELETED
@@ -1,61 +0,0 @@
1
- Utility
2
-
3
- * columnizing / reconstituting
4
-
5
- * Set up with JRuby
6
- * Allow for direct HDFS operations
7
- * Make the dfs commands slightly less stupid
8
- * add more standard options
9
- * Allow for combiners
10
- * JobStarter / JobSteps
11
- * might as well take dumbo's command line args
12
-
13
- BUGS:
14
-
15
- * Can't do multiple input files in local mode
16
-
17
- Patterns to implement:
18
-
19
- * Stats reducer (takes sum, avg, max, min, std.dev of a numeric field)
20
- * Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer)
21
-
22
- Example graph scripts:
23
-
24
- * Multigraph
25
- * Pagerank (done)
26
- * Breadth-first search
27
- * Triangle enumeration
28
- * Clustering
29
-
30
- Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce):
31
-
32
- 1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
33
-
34
- 2. Find the pages with over 1 million hits in day for a website given its access logs.
35
-
36
- 3. Find the pages that link to each page in a collection of webpages.
37
-
38
- 4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
39
-
40
- 5. Sort tabular data by a primary and secondary column.
41
-
42
- 6. Find the most popular pages for a website given its access logs.
43
-
44
- /can use
45
-
46
-
47
- ---------------------------------------------------------------------------
48
-
49
- Add statistics helpers
50
-
51
- * including "running standard deviation":http://www.johndcook.com/standard_deviation.html
52
-
53
-
54
- ---------------------------------------------------------------------------
55
-
56
- Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.
57
-
58
- More example hadoop algorithms:
59
- Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html
60
- * Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html
61
- * Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html
@@ -1,2 +0,0 @@
1
- h1. Using Wukong and Wuclan, Part 1 - Setup
2
-
@@ -1,2 +0,0 @@
1
- h1. Using Wukong and Wuclan, Part 2 - Scraping
2
-
@@ -1,51 +0,0 @@
1
- The "Cloudera Hadoop AMI Instances":http://www.cloudera.com/hadoop-ec2 for Amazon's EC2 compute cloud are the fastest, easiest way to get up and running with hadoop. Unfortunately, doing streaming scripts can be a pain, especially if you're doing iterative development.
2
-
3
- Installing NFS to share files along the cluster gives the following conveniences:
4
-
5
- * You don't have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
6
-
7
- * The user can now passwordless ssh among the nodes, since there's only one shared home directory and since we included the user's own public key in the authorized_keys2 file. This lets you easily rsync files among the nodes.
8
-
9
- First, you need to take note of the _internal_ name for your master, perhaps something like @domU-xx-xx-xx-xx-xx-xx.compute-1.internal@.
10
-
11
- As root, on the master (change @compute-1.internal@ to match your setup):
12
-
13
- <pre>
14
- apt-get install nfs-kernel-server
15
- echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
16
- /etc/init.d/nfs-kernel-server stop ;
17
- </pre>
18
-
19
- (The @*.compute-1.internal@ part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)
20
-
21
- Next, set up a regular user account on the *master only*. In this case our user will be named 'chimpy':
22
-
23
- <pre>
24
- visudo # uncomment the last line, to allow group sudo to sudo
25
- groupadd admin
26
- adduser chimpy
27
- usermod -a -G sudo,admin chimpy
28
- su chimpy # now you are the new user
29
- ssh-keygen -t rsa # accept all the defaults
30
- cat ~/.ssh/id_rsa.pub # can paste this public key into your github, etc
31
- cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
32
- </pre>
33
-
34
- Then on each slave (replacing domU-xx-... by the internal name for the master node):
35
-
36
- <pre>
37
- apt-get install nfs-common ;
38
- echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home /mnt/home nfs rw 0 0" >> /etc/fstab
39
- /etc/init.d/nfs-common restart
40
- mkdir /mnt/home
41
- mount /mnt/home
42
- ln -s /mnt/home/chimpy /home/chimpy
43
- </pre>
44
-
45
- You should now be in business.
46
-
47
- Performance tradeoffs should be small as long as you're just sending code files and gems around. *Don't* write out log entries or data to NFS partitions, or you'll effectively perform a denial-of-service attack on the master node.
48
-
49
- ------------------------------
50
-
51
- The "Setting up an NFS Server HOWTO":http://nfs.sourceforge.net/nfs-howto/index.html was an immense help, and I recommend reading it carefully.