RubyGems - wukong - Versions diffs - 0.1.4 → 1.4.0 - Mend

wukong 0.1.4 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

data/INSTALL.textile +89 -0
data/README.textile +41 -74
data/docpages/INSTALL.textile +94 -0
data/{doc → docpages}/LICENSE.textile +0 -0
data/{doc → docpages}/README-wulign.textile +6 -0
data/docpages/UsingWukong-part1-get_ready.textile +17 -0
data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
data/docpages/_config.yml +39 -0
data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
data/{doc → docpages}/code/api_response_example.txt +0 -0
data/{doc → docpages}/code/parser_skeleton.rb +0 -0
data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
data/docpages/favicon.ico +0 -0
data/docpages/gem.css +16 -0
data/docpages/hadoop-tips.textile +83 -0
data/docpages/index.textile +90 -0
data/docpages/intro.textile +8 -0
data/docpages/moreinfo.textile +174 -0
data/docpages/news.html +24 -0
data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
data/docpages/tutorial.textile +283 -0
data/docpages/usage.textile +195 -0
data/docpages/wutils.textile +263 -0
data/wukong.gemspec +80 -50
metadata +87 -54
data/doc/INSTALL.textile +0 -41
data/doc/README-tutorial.textile +0 -163
data/doc/README-wutils.textile +0 -128
data/doc/TODO.textile +0 -61
data/doc/UsingWukong-part1-setup.textile +0 -2
data/doc/UsingWukong-part2-scraping.textile +0 -2
data/doc/hadoop-nfs.textile +0 -51
data/doc/hadoop-setup.textile +0 -29
data/doc/index.textile +0 -124
data/doc/links.textile +0 -42
data/doc/usage.textile +0 -102
data/doc/utils.textile +0 -48
data/examples/and_pig/sample_queries.rb +0 -128
data/lib/wukong/and_pig.rb +0 -62
data/lib/wukong/and_pig/README.textile +0 -12
data/lib/wukong/and_pig/as.rb +0 -37
data/lib/wukong/and_pig/data_types.rb +0 -30
data/lib/wukong/and_pig/functions.rb +0 -50
data/lib/wukong/and_pig/generate.rb +0 -85
data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
data/lib/wukong/and_pig/junk.rb +0 -51
data/lib/wukong/and_pig/operators.rb +0 -8
data/lib/wukong/and_pig/operators/compound.rb +0 -29
data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
data/lib/wukong/and_pig/operators/execution.rb +0 -15
data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
data/lib/wukong/and_pig/operators/foreach.rb +0 -98
data/lib/wukong/and_pig/operators/groupies.rb +0 -212
data/lib/wukong/and_pig/operators/load_store.rb +0 -65
data/lib/wukong/and_pig/operators/meta.rb +0 -42
data/lib/wukong/and_pig/operators/relational.rb +0 -129
data/lib/wukong/and_pig/pig_struct.rb +0 -48
data/lib/wukong/and_pig/pig_var.rb +0 -95
data/lib/wukong/and_pig/symbol.rb +0 -29
data/lib/wukong/and_pig/utils.rb +0 -0

data/doc/INSTALL.textile DELETED Viewed

@@ -1,41 +0,0 @@
----
-layout: default
-title:  Install Notes
----
-h1(gemheader). {{ site.gemname }} %(small):: install%
-<notextile><div class="toggle"></notextile>
-h2. Get the code
-This code is available as a gem:
-pre. $ sudo gem install mrflip-{{ site.gemname }}
-You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
-Better yet, you can also clone the project with "Git":http://git-scm.com by running:
-pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
-<notextile></div><div class="toggle"></notextile>
-h2. Get the Dependencies
-* Hadoop, pig
-* extlib, YAML, JSON
-* Optional gems: trollop, addressable/uri, htmlentities
-<notextile></div><div class="toggle"></notextile>
-h2. Setup
-1. Allow Wukong to discover where his elephant friend lives: either
-** set a $HADOOP_HOME environment variable,
-** or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install: @:hadoop_home: /usr/local/share/hadoop@
-2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
-<notextile></div></notextile>

data/doc/README-tutorial.textile DELETED Viewed

@@ -1,163 +0,0 @@
-Here's a script to count words in a text stream:
-  require 'wukong'
-  module WordCount
-    class Mapper < Wukong::Streamer::LineStreamer
-      # Emit each word in the line.
-      def process line
-        words = line.strip.split(/\W+/).reject(&:blank?)
-        words.each{|word| yield [word, 1] }
-      end
-    end
-    class Reducer < Wukong::Streamer::ListReducer
-      def finalize
-        yield [ key, values.map(&:last).map(&:to_i).sum ]
-      end
-    end
-  end
-  Wukong::Script.new(
-    WordCount::Mapper,
-    WordCount::Reducer
-    ).run # Execute the script
-The first class, the Mapper, eats lines and craps @[word, count]@ records. Here
-the /key/ is the word, and the /value/ is its count.
-The second class is an example of an accumulated list reducer. The values for
-each key are stacked up into a list; then the record(s) yielded by @#finalize@
-are emitted.
-Here's another way to write the Reducer: accumulate the count of each line, then
-yield the sum in @#finalize@:
-    class Reducer2 < Wukong::Streamer::AccumulatingReducer
-      attr_accessor :key_count
-      def start! *args
-        self.key_count = 0
-      end
-      def accumulate(word, count)
-        self.key_count += count.to_i
-      end
-      def finalize
-        yield [ key, key_count ]
-      end
-    end
-Of course you can be really lazy (that is, smart) and write your script instead as
-  class Script < Wukong::Script
-    def reducer_command
-      'uniq -c'
-    end
-  end
-h2. Structured data
-All of these deal with unstructured data.  Wukong also lets you view your data
-as a stream of structured objects.
-Let's say you have a blog; its records look like
-  Post    = Struct.new( :id, :created_at, :user_id, :title, :body, :link )
-  Comment = Struct.new( :id, :created_at, :post_id, :user_id, :body )
-  User    = Struct.new( :id, :username, :fullname, :homepage, :description )
-  UserLoc = Struct.new( :user_id, :text, :lat, :lng )
-You've been using "twitter":http://twitter.com for a long time, and you've
-written something that from now on will inject all your tweets as Posts, and all
-replies to them as Comments (by a common 'twitter_bot' account on your blog).
-What about the past two years' worth of tweets?  Let's assume you're so chatty that
-a Map/Reduce script is warranted to handle the volume.
-Cook up something that scrapes your tweets and all replies to your tweets:
-  Tweet = Struct.new( :id, :created_at, :twitter_user_id,
-    :in_reply_to_user_id, :in_reply_to_status_id, :text )
-  TwitterUser  = Struct.new( :id, :username, :fullname,
-    :homepage, :location, :description )
-Now we'll just process all those in a big pile, converting to Posts, Comments
-and Users as appropriate. Serialize your scrape results so that each Tweet and
-each TwitterUser is a single lines containing first the class name ('tweet' or
-'twitter_user') followed by its constituent fields, in order, separated by tabs.
-The RecordStreamer takes each such line, constructs its corresponding class, and
-instantiates it with the
-  require 'wukong'
-  require 'my_blog' #defines the blog models
-  module TwitBlog
-    class Mapper < Wukong::Streamer::RecordStreamer
-      # Watch for tweets by me
-      MY_USER_ID = 24601
-      # structs for our input objects
-      Tweet = Struct.new( :id, :created_at, :twitter_user_id,
-        :in_reply_to_user_id, :in_reply_to_status_id, :text )
-      TwitterUser  = Struct.new( :id, :username, :fullname,
-        :homepage, :location, :description )
-      #
-      # If this is a tweet is by me, convert it to a Post.
-      #
-      # If it is a tweet not by me, convert it to a Comment that
-      # will be paired with the correct Post.
-      #
-      # If it is a TwitterUser, convert it to a User record and
-      # a user_location record
-      #
-      def process record
-        case record
-	when TwitterUser
-	  user     = MyBlog::User.new.merge(record) # grab the fields in common
-	  user_loc = MyBlog::UserLoc.new(record.id, record.location, nil, nil)
-	  yield user
-	  yield user_loc
-	when Tweet
-	  if record.twitter_user_id == MY_USER_ID
-	    post = MyBlog::Post.new.merge record
-	    post.link = "http://twitter.com/statuses/show/#{record.id}"
-	    post.body = record.text
-	    post.title = record.text[0..65] + "..."
-	    yield post
-	  else
-   	    comment = MyBlog::Comment.new.merge record
-	    comment.body    = record.text
-	    comment.post_id = record.in_reply_to_status_id
-	    yield comment
-	  end
-	end
-      end
-    end
-  end
-  Wukong::Script.new( TwitBlog::Mapper, nil ).run # identity reducer
-h2. Uniqifying
-The script above uses the identity reducer: every record from the mapper is sent
-to the output.  But what if you had grabbed the replying user's record every time
-you saw a reply?
-Fine, so pass it through @uniq@. But what if a user updated their location or
-description during this time? You'll want to probably use UniqByLastReducer
-Location might want to take the most /frequent/, and might want as well to
-geolocate the location text. Use a ListReducer, find the most frequent element,
-then finally call the expensive geolocation method.
-h2. A note about keys
-Now we're going to write this using the synthetic keys already extant in the
-twitter records, making the unwarranted assumption that they won't collide with
-the keys in your database.
-Map/Reduce paradigm does badly with synthetic keys. Synthetic keys demand
-locality, and map/reduce's remarkable scaling comes from not assuming
-locality. In general, write your map/reduce scripts to use natural keys (the scre
-h1. More info
-There are many useful examples (including an actually-useful version of this
-WordCount script) in examples/ directory.

data/doc/README-wutils.textile DELETED Viewed

@@ -1,128 +0,0 @@
-h1. Wukong Utility Scripts
-h2. Stupid command-line tricks
-h3. Histogram
-Given data with a date column:
-   message	235623	20090423012345	Now is the winter of our discontent Made glorious summer by this son of York
-   message	235623	20080101230900	These pretzels are making me THIRSTY!
-   ...
-You can calculate number of messages sent by day with
-    cat messages | cuttab 3 | cutc 8 | sort | uniq -c
-(see the wuhist command, below.)
-h3. Simple intersection, union, etc
-For two datasets (batch_1 and batch_2) with unique entries (no repeated lines),
-* Their union is simple:
-      cat batch_1 batch_2 | sort -u
-* Their intersection:
-      cat batch_1 batch_2 | sort | uniq -c | egrep -v '^ *1 '
-  This concatenates the two sets and filters out everything that only occurred once.
-* For the complement of the intersection, use "... | egrep '^ *1 '"
-* In both cases, if the files are each internally sorted, the commandline sort takes a --merge flag:
-      sort --merge -u batch_1 batch_2
-h2. Command Listing
-h3. cutc
-@cutc [colnum]@
-Ex.
-  echo -e 'foo\tbar\tbaz' | cutc 6
-  foo	ba
-Cuts from beginning of line to given column (default 200). A tab is one character, so right margin can still be ragged.
-h3. cuttab
-  @cuttab [colspec]@
-Cuts given tab-separated columns. You can give a comma separated list of numbers
-or ranges 1-4. columns are numbered from 1.
-Ex.
-  echo -e 'foo\tbar\tbaz' | cuttab 1,3
-  foo	baz
-h3. hdp-*
-These perform the corresponding commands on the HDFS filesystem.  In general,
-where they accept command-line flags, they go with the GNU-style ones, not the
-hadoop-style: so, @hdp-du -s dir@ or @hdp-rm -r foo/@
-* @hdp-cat@
-* @hdp-catd@ -- cats the files that don't start with '_' in a directory. Use this for a pile of @.../part-00000@ files
-* @hdp-du@
-* @hdp-get@
-* @hdp-kill@
-* @hdp-ls@
-* @hdp-mkdir@
-* @hdp-mv@
-* @hdp-ps@
-* @hdp-put@
-* @hdp-rm@
-* @hdp-sync@
-h3. hdp-sort, hdp-stream, hdp-stream-flat
-* @hdp-sort@
-* @hdp-stream@
-* @hdp-stream-flat@
-    <code><pre>
-    hdp-stream input_filespec output_file map_cmd reduce_cmd num_key_fields
-    </pre></code>
-h3. tabchar
-Outputs a single tab character.
-h3. wuhist
-Occasionally useful to gather a lexical histogram of a single column:
-Ex.
-    <code><pre>
-    $ echo -e 'foo\nbar\nbar\nfoo\nfoo\nfoo\n7' | ./wuhist
-    4       foo
-    2       bar
-    1       7
-    </pre></code>
-(the output will have a tab between the first and second column, for futher processing.)
-h3. wulign
-Intelligently format a tab-separated file into aligned columns (while remaining tab-separated for further processing). See README-wulign.textile.
-h3. hdp-parts_to_keys.rb
-A *very* clumsy script to rename reduced hadoop output files by their initial key.
-If your output file has an initial key in the first column and you pass it
-through hdp-sort, they will be distributed across reducers and thus output
-files. (Because of the way hadoop hashes the keys, there's no guarantee that
-each file will get a distinct key. You could have 2 keys with a million entries
-and they could land sequentially on the same reducer, always fun.)
-If you're willing to roll the dice, this script will rename files according to
-the first key in the first line.

data/doc/TODO.textile DELETED Viewed

@@ -1,61 +0,0 @@
-Utility
-* columnizing / reconstituting
-* Set up with JRuby
-* Allow for direct HDFS operations
-* Make the dfs commands slightly less stupid
-* add more standard options
-* Allow for combiners
-* JobStarter / JobSteps
-* might as well take dumbo's command line args
-BUGS:
-* Can't do multiple input files in local mode
-Patterns to implement:
-* Stats reducer (takes sum, avg, max, min, std.dev of a numeric field)
-* Make StructRecordizer work generically with other reducers (spec. AccumulatingReducer)
-Example graph scripts:
-* Multigraph
-* Pagerank 		(done)
-* Breadth-first search
-* Triangle enumeration
-* Clustering
-Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce):
-1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
-2. Find the pages with over 1 million hits in day for a website given its access logs.
-3. Find the pages that link to each page in a collection of webpages.
-4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
-5. Sort tabular data by a primary and secondary column.
-6. Find the most popular pages for a website given its access logs.
-/can use
----------------------------------------------------------------------------
-Add statistics helpers
-* including "running standard deviation":http://www.johndcook.com/standard_deviation.html
----------------------------------------------------------------------------
-Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.
-More example hadoop algorithms:
-Bigram counts: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/bigrams.html
-* Inverted index construction: http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/indexer.html
-* Pagerank : http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/exercises/pagerank.html

data/doc/UsingWukong-part1-setup.textile DELETED Viewed

	@@ -1,2 +0,0 @@
1	- h1. Using Wukong and Wuclan, Part 1 - Setup
2	-

data/doc/UsingWukong-part2-scraping.textile DELETED Viewed

	@@ -1,2 +0,0 @@
1	- h1. Using Wukong and Wuclan, Part 2 - Scraping
2	-

data/doc/hadoop-nfs.textile DELETED Viewed

@@ -1,51 +0,0 @@
-The "Cloudera Hadoop AMI Instances":http://www.cloudera.com/hadoop-ec2 for Amazon's EC2 compute cloud are the fastest, easiest way to get up and running with hadoop. Unfortunately, doing streaming scripts can be a pain, especially if you're doing iterative development.
-Installing NFS to share files along the cluster gives the following conveniences:
-* You don't have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
-* The user can now passwordless ssh among the nodes, since there's only one shared home directory and since we included the user's own public key in the authorized_keys2 file.  This lets you easily rsync files among the nodes.
-First, you need to take note of the _internal_ name for your master, perhaps something like @domU-xx-xx-xx-xx-xx-xx.compute-1.internal@.
-As root, on the master (change @compute-1.internal@ to match your setup):
-<pre>
-    apt-get install nfs-kernel-server
-    echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
-    /etc/init.d/nfs-kernel-server stop ;
-</pre>
-(The @*.compute-1.internal@ part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)
-Next, set up a regular user account on the *master only*. In this case our user will be named 'chimpy':
-<pre>
-  visudo # uncomment the last line, to allow group sudo to sudo
-  groupadd admin
-  adduser chimpy
-  usermod -a -G sudo,admin chimpy
-  su chimpy                  # now you are the new user
-  ssh-keygen -t rsa          # accept all the defaults
-  cat ~/.ssh/id_rsa.pub      # can paste this public key into your github, etc
-  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
-</pre>
-Then on each slave (replacing domU-xx-... by the internal name for the master node):
-<pre>
-    apt-get install nfs-common ;
-    echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home  /mnt/home  nfs  rw  0 0"  >> /etc/fstab
-    /etc/init.d/nfs-common restart
-    mkdir /mnt/home
-    mount /mnt/home
-   ln -s /mnt/home/chimpy /home/chimpy
-</pre>
-You should now be in business.
-Performance tradeoffs should be small as long as you're just sending code files and gems around.  *Don't* write out log entries or data to NFS partitions, or you'll effectively perform a denial-of-service attack on the master node.
-------------------------------
-The "Setting up an NFS Server HOWTO":http://nfs.sourceforge.net/nfs-howto/index.html was an immense help, and I recommend reading it  carefully.