wukong 0.1.4 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. data/INSTALL.textile +89 -0
  2. data/README.textile +41 -74
  3. data/docpages/INSTALL.textile +94 -0
  4. data/{doc → docpages}/LICENSE.textile +0 -0
  5. data/{doc → docpages}/README-wulign.textile +6 -0
  6. data/docpages/UsingWukong-part1-get_ready.textile +17 -0
  7. data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
  8. data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
  9. data/docpages/_config.yml +39 -0
  10. data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
  11. data/{doc → docpages}/code/api_response_example.txt +0 -0
  12. data/{doc → docpages}/code/parser_skeleton.rb +0 -0
  13. data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
  14. data/docpages/favicon.ico +0 -0
  15. data/docpages/gem.css +16 -0
  16. data/docpages/hadoop-tips.textile +83 -0
  17. data/docpages/index.textile +90 -0
  18. data/docpages/intro.textile +8 -0
  19. data/docpages/moreinfo.textile +174 -0
  20. data/docpages/news.html +24 -0
  21. data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
  22. data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
  23. data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
  24. data/docpages/tutorial.textile +283 -0
  25. data/docpages/usage.textile +195 -0
  26. data/docpages/wutils.textile +263 -0
  27. data/wukong.gemspec +80 -50
  28. metadata +87 -54
  29. data/doc/INSTALL.textile +0 -41
  30. data/doc/README-tutorial.textile +0 -163
  31. data/doc/README-wutils.textile +0 -128
  32. data/doc/TODO.textile +0 -61
  33. data/doc/UsingWukong-part1-setup.textile +0 -2
  34. data/doc/UsingWukong-part2-scraping.textile +0 -2
  35. data/doc/hadoop-nfs.textile +0 -51
  36. data/doc/hadoop-setup.textile +0 -29
  37. data/doc/index.textile +0 -124
  38. data/doc/links.textile +0 -42
  39. data/doc/usage.textile +0 -102
  40. data/doc/utils.textile +0 -48
  41. data/examples/and_pig/sample_queries.rb +0 -128
  42. data/lib/wukong/and_pig.rb +0 -62
  43. data/lib/wukong/and_pig/README.textile +0 -12
  44. data/lib/wukong/and_pig/as.rb +0 -37
  45. data/lib/wukong/and_pig/data_types.rb +0 -30
  46. data/lib/wukong/and_pig/functions.rb +0 -50
  47. data/lib/wukong/and_pig/generate.rb +0 -85
  48. data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
  49. data/lib/wukong/and_pig/junk.rb +0 -51
  50. data/lib/wukong/and_pig/operators.rb +0 -8
  51. data/lib/wukong/and_pig/operators/compound.rb +0 -29
  52. data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
  53. data/lib/wukong/and_pig/operators/execution.rb +0 -15
  54. data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
  55. data/lib/wukong/and_pig/operators/foreach.rb +0 -98
  56. data/lib/wukong/and_pig/operators/groupies.rb +0 -212
  57. data/lib/wukong/and_pig/operators/load_store.rb +0 -65
  58. data/lib/wukong/and_pig/operators/meta.rb +0 -42
  59. data/lib/wukong/and_pig/operators/relational.rb +0 -129
  60. data/lib/wukong/and_pig/pig_struct.rb +0 -48
  61. data/lib/wukong/and_pig/pig_var.rb +0 -95
  62. data/lib/wukong/and_pig/symbol.rb +0 -29
  63. data/lib/wukong/and_pig/utils.rb +0 -0
data/doc/INSTALL.textile DELETED
@@ -1,41 +0,0 @@
1
- ---
2
- layout: default
3
- title: Install Notes
4
- ---
5
-
6
-
7
- h1(gemheader). {{ site.gemname }} %(small):: install%
8
-
9
- <notextile><div class="toggle"></notextile>
10
-
11
- h2. Get the code
12
-
13
- This code is available as a gem:
14
-
15
- pre. $ sudo gem install mrflip-{{ site.gemname }}
16
-
17
- You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
18
-
19
- Better yet, you can also clone the project with "Git":http://git-scm.com by running:
20
-
21
- pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
22
-
23
- <notextile></div><div class="toggle"></notextile>
24
-
25
- h2. Get the Dependencies
26
-
27
- * Hadoop, pig
28
- * extlib, YAML, JSON
29
- * Optional gems: trollop, addressable/uri, htmlentities
30
-
31
-
32
- <notextile></div><div class="toggle"></notextile>
33
-
34
- h2. Setup
35
-
36
- 1. Allow Wukong to discover where his elephant friend lives: either
37
- ** set a $HADOOP_HOME environment variable,
38
- ** or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install: @:hadoop_home: /usr/local/share/hadoop@
39
- 2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
40
-
41
- <notextile></div></notextile>
@@ -1,163 +0,0 @@
1
- Here's a script to count words in a text stream:
2
-
3
- require 'wukong'
4
- module WordCount
5
- class Mapper < Wukong::Streamer::LineStreamer
6
- # Emit each word in the line.
7
- def process line
8
- words = line.strip.split(/\W+/).reject(&:blank?)
9
- words.each{|word| yield [word, 1] }
10
- end
11
- end
12
-
13
- class Reducer < Wukong::Streamer::ListReducer
14
- def finalize
15
- yield [ key, values.map(&:last).map(&:to_i).sum ]
16
- end
17
- end
18
- end
19
-
20
- Wukong::Script.new(
21
- WordCount::Mapper,
22
- WordCount::Reducer
23
- ).run # Execute the script
24
-
25
- The first class, the Mapper, eats lines and craps @[word, count]@ records. Here
26
- the /key/ is the word, and the /value/ is its count.
27
-
28
- The second class is an example of an accumulated list reducer. The values for
29
- each key are stacked up into a list; then the record(s) yielded by @#finalize@
30
- are emitted.
31
-
32
- Here's another way to write the Reducer: accumulate the count of each line, then
33
- yield the sum in @#finalize@:
34
-
35
- class Reducer2 < Wukong::Streamer::AccumulatingReducer
36
- attr_accessor :key_count
37
- def start! *args
38
- self.key_count = 0
39
- end
40
- def accumulate(word, count)
41
- self.key_count += count.to_i
42
- end
43
- def finalize
44
- yield [ key, key_count ]
45
- end
46
- end
47
-
48
- Of course you can be really lazy (that is, smart) and write your script instead as
49
-
50
- class Script < Wukong::Script
51
- def reducer_command
52
- 'uniq -c'
53
- end
54
- end
55
-
56
-
57
- h2. Structured data
58
-
59
- All of these deal with unstructured data. Wukong also lets you view your data
60
- as a stream of structured objects.
61
-
62
- Let's say you have a blog; its records look like
63
-
64
- Post = Struct.new( :id, :created_at, :user_id, :title, :body, :link )
65
- Comment = Struct.new( :id, :created_at, :post_id, :user_id, :body )
66
- User = Struct.new( :id, :username, :fullname, :homepage, :description )
67
- UserLoc = Struct.new( :user_id, :text, :lat, :lng )
68
-
69
- You've been using "twitter":http://twitter.com for a long time, and you've
70
- written something that from now on will inject all your tweets as Posts, and all
71
- replies to them as Comments (by a common 'twitter_bot' account on your blog).
72
- What about the past two years' worth of tweets? Let's assume you're so chatty that
73
- a Map/Reduce script is warranted to handle the volume.
74
-
75
- Cook up something that scrapes your tweets and all replies to your tweets:
76
-
77
- Tweet = Struct.new( :id, :created_at, :twitter_user_id,
78
- :in_reply_to_user_id, :in_reply_to_status_id, :text )
79
- TwitterUser = Struct.new( :id, :username, :fullname,
80
- :homepage, :location, :description )
81
-
82
- Now we'll just process all those in a big pile, converting to Posts, Comments
83
- and Users as appropriate. Serialize your scrape results so that each Tweet and
84
- each TwitterUser is a single lines containing first the class name ('tweet' or
85
- 'twitter_user') followed by its constituent fields, in order, separated by tabs.
86
-
87
- The RecordStreamer takes each such line, constructs its corresponding class, and
88
- instantiates it with the
89
-
90
- require 'wukong'
91
- require 'my_blog' #defines the blog models
92
- module TwitBlog
93
- class Mapper < Wukong::Streamer::RecordStreamer
94
- # Watch for tweets by me
95
- MY_USER_ID = 24601
96
- # structs for our input objects
97
- Tweet = Struct.new( :id, :created_at, :twitter_user_id,
98
- :in_reply_to_user_id, :in_reply_to_status_id, :text )
99
- TwitterUser = Struct.new( :id, :username, :fullname,
100
- :homepage, :location, :description )
101
- #
102
- # If this is a tweet is by me, convert it to a Post.
103
- #
104
- # If it is a tweet not by me, convert it to a Comment that
105
- # will be paired with the correct Post.
106
- #
107
- # If it is a TwitterUser, convert it to a User record and
108
- # a user_location record
109
- #
110
- def process record
111
- case record
112
- when TwitterUser
113
- user = MyBlog::User.new.merge(record) # grab the fields in common
114
- user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
115
- yield user
116
- yield user_loc
117
- when Tweet
118
- if record.twitter_user_id == MY_USER_ID
119
- post = MyBlog::Post.new.merge record
120
- post.link = "http://twitter.com/statuses/show/#{record.id}"
121
- post.body = record.text
122
- post.title = record.text[0..65] + "..."
123
- yield post
124
- else
125
- comment = MyBlog::Comment.new.merge record
126
- comment.body = record.text
127
- comment.post_id = record.in_reply_to_status_id
128
- yield comment
129
- end
130
- end
131
- end
132
- end
133
- end
134
- Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
135
-
136
- h2. Uniqifying
137
-
138
- The script above uses the identity reducer: every record from the mapper is sent
139
- to the output. But what if you had grabbed the replying user's record every time
140
- you saw a reply?
141
-
142
- Fine, so pass it through @uniq@. But what if a user updated their location or
143
- description during this time? You'll want to probably use UniqByLastReducer
144
-
145
- Location might want to take the most /frequent/, and might want as well to
146
- geolocate the location text. Use a ListReducer, find the most frequent element,
147
- then finally call the expensive geolocation method.
148
-
149
- h2. A note about keys
150
-
151
- Now we're going to write this using the synthetic keys already extant in the
152
- twitter records, making the unwarranted assumption that they won't collide with
153
- the keys in your database.
154
-
155
- Map/Reduce paradigm does badly with synthetic keys. Synthetic keys demand
156
- locality, and map/reduce's remarkable scaling comes from not assuming
157
- locality. In general, write your map/reduce scripts to use natural keys (the scre
158
-
159
- h1. More info
160
-
161
- There are many useful examples (including an actually-useful version of this
162
- WordCount script) in examples/ directory.
163
-
@@ -1,128 +0,0 @@
1
- h1. Wukong Utility Scripts
2
-
3
- h2. Stupid command-line tricks
4
-
5
- h3. Histogram
6
-
7
- Given data with a date column:
8
-
9
- message 235623 20090423012345 Now is the winter of our discontent Made glorious summer by this son of York
10
- message 235623 20080101230900 These pretzels are making me THIRSTY!
11
- ...
12
-
13
- You can calculate number of messages sent by day with
14
-
15
- cat messages | cuttab 3 | cutc 8 | sort | uniq -c
16
-
17
- (see the wuhist command, below.)
18
-
19
- h3. Simple intersection, union, etc
20
-
21
- For two datasets (batch_1 and batch_2) with unique entries (no repeated lines),
22
-
23
- * Their union is simple:
24
-
25
- cat batch_1 batch_2 | sort -u
26
-
27
-
28
- * Their intersection:
29
-
30
- cat batch_1 batch_2 | sort | uniq -c | egrep -v '^ *1 '
31
-
32
- This concatenates the two sets and filters out everything that only occurred once.
33
-
34
- * For the complement of the intersection, use "... | egrep '^ *1 '"
35
-
36
- * In both cases, if the files are each internally sorted, the commandline sort takes a --merge flag:
37
-
38
- sort --merge -u batch_1 batch_2
39
-
40
- h2. Command Listing
41
-
42
- h3. cutc
43
-
44
- @cutc [colnum]@
45
-
46
- Ex.
47
-
48
- echo -e 'foo\tbar\tbaz' | cutc 6
49
- foo ba
50
-
51
- Cuts from beginning of line to given column (default 200). A tab is one character, so right margin can still be ragged.
52
-
53
- h3. cuttab
54
-
55
- @cuttab [colspec]@
56
-
57
- Cuts given tab-separated columns. You can give a comma separated list of numbers
58
- or ranges 1-4. columns are numbered from 1.
59
-
60
- Ex.
61
-
62
- echo -e 'foo\tbar\tbaz' | cuttab 1,3
63
- foo baz
64
-
65
- h3. hdp-*
66
-
67
- These perform the corresponding commands on the HDFS filesystem. In general,
68
- where they accept command-line flags, they go with the GNU-style ones, not the
69
- hadoop-style: so, @hdp-du -s dir@ or @hdp-rm -r foo/@
70
-
71
- * @hdp-cat@
72
- * @hdp-catd@ -- cats the files that don't start with '_' in a directory. Use this for a pile of @.../part-00000@ files
73
- * @hdp-du@
74
- * @hdp-get@
75
- * @hdp-kill@
76
- * @hdp-ls@
77
- * @hdp-mkdir@
78
- * @hdp-mv@
79
- * @hdp-ps@
80
- * @hdp-put@
81
- * @hdp-rm@
82
- * @hdp-sync@
83
-
84
- h3. hdp-sort, hdp-stream, hdp-stream-flat
85
-
86
- * @hdp-sort@
87
- * @hdp-stream@
88
- * @hdp-stream-flat@
89
-
90
- <code><pre>
91
- hdp-stream input_filespec output_file map_cmd reduce_cmd num_key_fields
92
- </pre></code>
93
-
94
- h3. tabchar
95
-
96
- Outputs a single tab character.
97
-
98
- h3. wuhist
99
-
100
- Occasionally useful to gather a lexical histogram of a single column:
101
-
102
- Ex.
103
-
104
- <code><pre>
105
- $ echo -e 'foo\nbar\nbar\nfoo\nfoo\nfoo\n7' | ./wuhist
106
- 4 foo
107
- 2 bar
108
- 1 7
109
- </pre></code>
110
-
111
- (the output will have a tab between the first and second column, for futher processing.)
112
-
113
- h3. wulign
114
-
115
- Intelligently format a tab-separated file into aligned columns (while remaining tab-separated for further processing). See README-wulign.textile.
116
-
117
- h3. hdp-parts_to_keys.rb
118
-
119
- A *very* clumsy script to rename reduced hadoop output files by their initial key.
120
-
121
- If your output file has an initial key in the first column and you pass it
122
- through hdp-sort, they will be distributed across reducers and thus output
123
- files. (Because of the way hadoop hashes the keys, there's no guarantee that
124
- each file will get a distinct key. You could have 2 keys with a million entries
125
- and they could land sequentially on the same reducer, always fun.)
126
-
127
- If you're willing to roll the dice, this script will rename files according to
128
- the first key in the first line.
data/doc/TODO.textile DELETED
@@ -1,61 +0,0 @@
1
- Utility
2
-
3
- * columnizing / reconstituting
4
-
5
- * Set up with JRuby
6
- * Allow for direct HDFS operations
7
- * Make the dfs commands slightly less stupid
8
- * add more standard options
9
- * Allow for combiners
10
- * JobStarter / JobSteps
11
- * might as well take dumbo's command line args
12
-
13
- BUGS:
14
-
15
- * Can't do multiple input files in local mode
16
-
17
- Patterns to implement:
18
-
19
- * Stats reducer (takes sum, avg, max, min, std.dev of a numeric field)
20
- * Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer)
21
-
22
- Example graph scripts:
23
-
24
- * Multigraph
25
- * Pagerank (done)
26
- * Breadth-first search
27
- * Triangle enumeration
28
- * Clustering
29
-
30
- Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce):
31
-
32
- 1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
33
-
34
- 2. Find the pages with over 1 million hits in day for a website given its access logs.
35
-
36
- 3. Find the pages that link to each page in a collection of webpages.
37
-
38
- 4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
39
-
40
- 5. Sort tabular data by a primary and secondary column.
41
-
42
- 6. Find the most popular pages for a website given its access logs.
43
-
44
- /can use
45
-
46
-
47
- ---------------------------------------------------------------------------
48
-
49
- Add statistics helpers
50
-
51
- * including "running standard deviation":http://www.johndcook.com/standard_deviation.html
52
-
53
-
54
- ---------------------------------------------------------------------------
55
-
56
- Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.
57
-
58
- More example hadoop algorithms:
59
- Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html
60
- * Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html
61
- * Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html
@@ -1,2 +0,0 @@
1
- h1. Using Wukong and Wuclan, Part 1 - Setup
2
-
@@ -1,2 +0,0 @@
1
- h1. Using Wukong and Wuclan, Part 2 - Scraping
2
-
@@ -1,51 +0,0 @@
1
- The "Cloudera Hadoop AMI Instances":http://www.cloudera.com/hadoop-ec2 for Amazon's EC2 compute cloud are the fastest, easiest way to get up and running with hadoop. Unfortunately, doing streaming scripts can be a pain, especially if you're doing iterative development.
2
-
3
- Installing NFS to share files along the cluster gives the following conveniences:
4
-
5
- * You don't have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
6
-
7
- * The user can now passwordless ssh among the nodes, since there's only one shared home directory and since we included the user's own public key in the authorized_keys2 file. This lets you easily rsync files among the nodes.
8
-
9
- First, you need to take note of the _internal_ name for your master, perhaps something like @domU-xx-xx-xx-xx-xx-xx.compute-1.internal@.
10
-
11
- As root, on the master (change @compute-1.internal@ to match your setup):
12
-
13
- <pre>
14
- apt-get install nfs-kernel-server
15
- echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
16
- /etc/init.d/nfs-kernel-server stop ;
17
- </pre>
18
-
19
- (The @*.compute-1.internal@ part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)
20
-
21
- Next, set up a regular user account on the *master only*. In this case our user will be named 'chimpy':
22
-
23
- <pre>
24
- visudo # uncomment the last line, to allow group sudo to sudo
25
- groupadd admin
26
- adduser chimpy
27
- usermod -a -G sudo,admin chimpy
28
- su chimpy # now you are the new user
29
- ssh-keygen -t rsa # accept all the defaults
30
- cat ~/.ssh/id_rsa.pub # can paste this public key into your github, etc
31
- cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
32
- </pre>
33
-
34
- Then on each slave (replacing domU-xx-... by the internal name for the master node):
35
-
36
- <pre>
37
- apt-get install nfs-common ;
38
- echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home /mnt/home nfs rw 0 0" >> /etc/fstab
39
- /etc/init.d/nfs-common restart
40
- mkdir /mnt/home
41
- mount /mnt/home
42
- ln -s /mnt/home/chimpy /home/chimpy
43
- </pre>
44
-
45
- You should now be in business.
46
-
47
- Performance tradeoffs should be small as long as you're just sending code files and gems around. *Don't* write out log entries or data to NFS partitions, or you'll effectively perform a denial-of-service attack on the master node.
48
-
49
- ------------------------------
50
-
51
- The "Setting up an NFS Server HOWTO":http://nfs.sourceforge.net/nfs-howto/index.html was an immense help, and I recommend reading it carefully.