RubyGems - wukong - Versions diffs - 0.1.4 → 1.4.0 - Mend

wukong 0.1.4 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

data/INSTALL.textile +89 -0
data/README.textile +41 -74
data/docpages/INSTALL.textile +94 -0
data/{doc → docpages}/LICENSE.textile +0 -0
data/{doc → docpages}/README-wulign.textile +6 -0
data/docpages/UsingWukong-part1-get_ready.textile +17 -0
data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} +8 -24
data/{doc → docpages}/UsingWukong-part3-parsing.textile +8 -2
data/docpages/_config.yml +39 -0
data/{doc/tips.textile → docpages/bigdata-tips.textile} +71 -44
data/{doc → docpages}/code/api_response_example.txt +0 -0
data/{doc → docpages}/code/parser_skeleton.rb +0 -0
data/{doc/intro_to_map_reduce → docpages/diagrams}/MapReduceDiagram.graffle +0 -0
data/docpages/favicon.ico +0 -0
data/docpages/gem.css +16 -0
data/docpages/hadoop-tips.textile +83 -0
data/docpages/index.textile +90 -0
data/docpages/intro.textile +8 -0
data/docpages/moreinfo.textile +174 -0
data/docpages/news.html +24 -0
data/{doc → docpages}/pig/PigLatinExpressionsList.txt +0 -0
data/{doc → docpages}/pig/PigLatinReferenceManual.html +0 -0
data/{doc → docpages}/pig/PigLatinReferenceManual.txt +0 -0
data/docpages/tutorial.textile +283 -0
data/docpages/usage.textile +195 -0
data/docpages/wutils.textile +263 -0
data/wukong.gemspec +80 -50
metadata +87 -54
data/doc/INSTALL.textile +0 -41
data/doc/README-tutorial.textile +0 -163
data/doc/README-wutils.textile +0 -128
data/doc/TODO.textile +0 -61
data/doc/UsingWukong-part1-setup.textile +0 -2
data/doc/UsingWukong-part2-scraping.textile +0 -2
data/doc/hadoop-nfs.textile +0 -51
data/doc/hadoop-setup.textile +0 -29
data/doc/index.textile +0 -124
data/doc/links.textile +0 -42
data/doc/usage.textile +0 -102
data/doc/utils.textile +0 -48
data/examples/and_pig/sample_queries.rb +0 -128
data/lib/wukong/and_pig.rb +0 -62
data/lib/wukong/and_pig/README.textile +0 -12
data/lib/wukong/and_pig/as.rb +0 -37
data/lib/wukong/and_pig/data_types.rb +0 -30
data/lib/wukong/and_pig/functions.rb +0 -50
data/lib/wukong/and_pig/generate.rb +0 -85
data/lib/wukong/and_pig/generate/variable_inflections.rb +0 -82
data/lib/wukong/and_pig/junk.rb +0 -51
data/lib/wukong/and_pig/operators.rb +0 -8
data/lib/wukong/and_pig/operators/compound.rb +0 -29
data/lib/wukong/and_pig/operators/evaluators.rb +0 -7
data/lib/wukong/and_pig/operators/execution.rb +0 -15
data/lib/wukong/and_pig/operators/file_methods.rb +0 -29
data/lib/wukong/and_pig/operators/foreach.rb +0 -98
data/lib/wukong/and_pig/operators/groupies.rb +0 -212
data/lib/wukong/and_pig/operators/load_store.rb +0 -65
data/lib/wukong/and_pig/operators/meta.rb +0 -42
data/lib/wukong/and_pig/operators/relational.rb +0 -129
data/lib/wukong/and_pig/pig_struct.rb +0 -48
data/lib/wukong/and_pig/pig_var.rb +0 -95
data/lib/wukong/and_pig/symbol.rb +0 -29
data/lib/wukong/and_pig/utils.rb +0 -0

data/INSTALL.textile ADDED Viewed

@@ -0,0 +1,89 @@
+---
+layout: default
+title:  Install Notes
+collapse: false
+---
+h1(gemheader). {{ site.gemname }} %(small):: install%
+** "Get the code":#getcode
+** "Setup":#setup
+** "Installing and Running Wukong with Hadoop":#gethadoop
+** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":#others
+<notextile><div class="toggle"></notextile>
+h2(#getcode). Get the code
+Wukong is still under active development.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
+pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
+A gem is available from "github:":http://gems.github.com
+pre. $ sudo gem install mrflip-{{ site.gemname }} --source=http://gems.github.com
+or from "gemcutter":http://gemcutter.org
+pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
+You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
+h3. Get the Dependencies
+* Hadoop, pig
+* extlib, YAML, JSON
+* Optional gems: trollop, addressable/uri, htmlentities
+<notextile></div><div class="toggle"></notextile>
+h2(#setup). Setup
+1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable:  @export HADOOP_HOME="/usr/local/share/hadoop"@
+2. Add wukong's @bin/@ directory to your $PATH if you'd like to use the "wutils":wutils.html
+<i>(see also: "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart)</i>
+<notextile></div><div class="toggle"></notextile>
+h2(#gethadoop). Installing and Running Wukong with Hadoop
+Wukong was primarily developed for Hadoop, and we think it's the best way to use Hadoop (it's certainly the most fun!).
+h3. Run Wukong on the Amazon AWS EC2 Cloud
+h3. Hadoop Infrastructure
+Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. If it's still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20.  It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
+To set up hadoop, your best bet are the Cloudera AMIs on Amazon's EC2 compute cloud:
+* http://www.cloudera.com/hadoop-ec2
+* http://www.cloudera.com/hadoop-ec2-ebs-beta
+EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.
+h3. Run Wukong using Amazon AWS Elastic MapReduce
+AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.
+Phil Ripperger has prepared a "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud -- it's better than anything we could put here. Thanks Phil!
+h3. Set up a Hadoop cluster
+If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.
+h3. More Hadoop Notes
+I've braindumped some random notes on configuring and using hadoop "over here":hadoop-tips.html
+<notextile></div><div class="toggle"></notextile>
+h2(#others). Wukong isn't just Hadoop: Datamapper, ActiveRecord, command-line usage and more
+Wukong is used by many in an non-Hadoop environment -- anywhere you can stream data records, you can unleash its monkey power.
+Please see the "usage notes":usage.html#playnice for more!
+<notextile></div></notextile>

data/README.textile CHANGED Viewed

@@ -2,34 +2,53 @@ h1. Wukong
 Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
-Treat your dataset as a
+Treat your dataset like a
 * stream of lines when it's efficient to process by lines
 * stream of field arrays when it's efficient to deal directly with fields
 * stream of lightweight objects when it's efficient to deal with objects
-Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
+Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.
+The **main documentation** lives on the "Wukong Pages.":http://mrflip.github.com/wukong Please feel free to add supplemental information to the "wukong wiki.":http://wiki.github.com/mrflip/wukong
+* "Install and set up wukong":http://mrflip.github.com/wukong/INSTALL.html
+* "Tutorial":http://mrflip.github.com/wukong/tutorial.html
+* "Usage notes":http://mrflip.github.com/wukong/usage.html
+* "Wutils":http://mrflip.github.com/wukong/wutils.html -- command-line utilies for working with data from the command line
+* Links and tips for "configuring and working with hadoop":http://mrflip.github.com/wukong/hadoop-tips.html
+* Wukong is licensed under the "Apache License":http://mrflip.github.com/wukong/LICENSE.html (same as Hadoop)
+* "More info":http://mrflip.github.com/wukong/moreinfo.html
-The main documentation -- including tutorials and tips for working with big data -- lives on the "Wukong Pages":http://mrflip.github.com/wukong and there is some supplemental information on the "wukong wiki.":http://wiki.github.com/mrflip/wukong
+h2. Help!
+Send Wukong questions to the "Infinite Monkeywrench mailing list":http://groups.google.com/group/infochimps-code
 h2. Install
-Wukong is still under active development.  The newest version is available at
+** "Main Install and Setup Documentation":http://mrflip.github.com/wukong/INSTALL.html **
+h3. Get the code
-    http://github.com/mrflip/wukong
+We're still actively developing {{ site.gemname }}.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
-A gem is available from "github:":http://gems.github.com
+pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
-    gem install mrflip-wukong --source=http://gems.github.com
+A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
-or from "gemcutter":http://gemcutter.org
+pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
-    gem install wukong --source=http://gemcutter.org
+(don't use the gems.github.com version -- it's way out of date.)
-Phil Ripperger has prepared "instructions on getting wukong to work on the Amazon AWS cloud.":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart Thanks Phil!
+You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
+h3. Dependencies and setup
+To finish setting up, see the "detailed setup instructions":http://mrflip.github.com/wukong/INSTALL.html and then read the "usage notes":http://mrflip.github.com/wukong/usage.html
 h2. How to write a Wukong script
+** "Tutorial By Example":http://mrflip.github.com/wukong/tutorial.html **
 Here's a script to count words in a text stream:
 <pre><code>    require 'wukong'
@@ -112,11 +131,7 @@ You can also use structs to treat your dataset as a stream of objects:
 h3. Advanced Patterns
-Wukong has a good collection of map/reduce patterns. For example, it's quite common to accumulate all records for a given key and emit some result based on the whole group.
-The AccumulatingReducer calls start! on the first record for each key, calls accumulate() on every example for that key (including the first), and calls finalize() once the last record for that key is seen.
-Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
+Wukong has a good collection of map/reduce patterns. Here's an AccumulatingReducer that takes a long list of key-value pairs and emits, for each key, all its corresponding values in one line.
 <pre><code>    #
     # Roll up all values for each key into a single line
@@ -165,62 +180,6 @@ You'd end up with
     @newman     @elaine      @jerry      @kramer
 </code></pre>
-h3. More info
-There are many useful examples (including an actually-useful version of the WordCount script) in examples/ directory.
-h2. Setup
-1. Allow Wukong to discover where his elephant friend lives: either
-  * set a @$HADOOP_HOME@ environment variable,
-  * or create a file 'config/wukong-site.yaml' with a line that points to the top-level directory of your hadoop install:
-      @:hadoop_home: /usr/local/share/hadoop@
-2. Add wukong's @bin/@ directory to your $PATH, so that you may use its filesystem shortcuts.
-h2. How to run a Wukong script
-To run your script using local files and no connection to a hadoop cluster,
-  @your/script.rb --run=local path/to/input_files path/to/output_dir@
-To run the command across a Hadoop cluster,
-  @your/script.rb --run=hadoop path/to/input_files path/to/output_dir@
-You can set the default in the config/wukong-site.yaml file, and then just use @--run@ instead of @--run=something@ --it will just use the default run mode.
-If you're running @--run=hadoop@, all file paths are HDFS paths. If you're running @--run=local@, all file paths are local paths.  (your/script path, of course, lives on the local filesystem).
-You can supply arbitrary command line arguments (they wind up as key-value pairs in the options path your mapper and reducer receive), and you can use the hadoop syntax to specify more than one input file:
-    ./path/to/your/script.rb --any_specific_options --options=can_have_vals \
-         --run "input_dir/part_*,input_file2.tsv,etc.tsv" path/to/output_dir
-Note that all @--options@ must precede (in any order) all non-options.
-h2. How to test your scripts
-To run mapper on its own:
-  cat ./local/test/input.tsv | ./examples/word_count.rb --map | more
-or if your test data lies on the HDFS,
-  hdp-cat test/input.tsv | ./examples/word_count.rb --map | more
-Next graduate to running @--run=local@ mode so you can inspect the reducer.
-h2. What's up with Wukong::AndPig?
-@Wukong::AndPig@ is a small library to more easily generate code for the
-"Pig":http://hadoop.apache.org/pig data analysis language.  See its
-"README":wukong/and_pig/README.textile for more.
 h2. Why is it called Wukong?
 Hadoop, as you may know, is "named after a stuffed elephant.":http://en.wikipedia.org/wiki/Hadoop Since Wukong was started by the "infochimps":http://infochimps.org team, we needed a simian analog.  A Monkey King who journeyed to the land of the Elephant seems to fit the bill:
@@ -231,6 +190,14 @@ bq. Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn
 The "Jaime Hewlett / Damon Albarn short":http://news.bbc.co.uk/sport1/hi/olympics/monkey that the BBC made for their 2008 Olympics coverage gives the general idea.
-h2. What tools does Wukong work with?
+h2. Credits
+Wukong was written by "Philip (flip) Kromer":http://mrflip.com (flip@infochimps.org) for the "infochimps project":http://infochimps.org
+Patches submitted by:
+* gemified by Ben Woosley (ben.woosley with the gmails)
+* ruby interpreter path fix by "Yuichiro MASUI":http://github.com/masuidrive - masui at masuidrive.jp - http://blog.masuidrive.jp/
-Wukong is friends with "Hadoop":http://hadoop.apache.org/core the elephant, "Pig":http://hadoop.apache.org/pig/ the query language, and the @cat@ on your command line.  We're looking forward to being friends with "martinis":http://datamapper.org and "express trains":http://wiki.rubyonrails.org/rails/pages/ActiveRecord down the road.
+Thanks to:
+* "Brad Heintz":http://www.bradheintz.com/no1thing/talks/ for his early feedback
+* "Phil Ripperger":http://blog.pdatasolutions.com for his "wukong in the Amazon AWS cloud":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart tutorial.

data/docpages/INSTALL.textile ADDED Viewed

@@ -0,0 +1,94 @@
+---
+layout: default
+title:  Install Notes
+collapse: false
+---
+h1(gemheader). {{ site.gemname }} %(small):: install%
+** "Get the code":#getcode
+** "Setup":#setup
+** "Installing and Running Wukong with Hadoop":#gethadoop
+** "Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more":#others
+<notextile><div class="toggle"></notextile>
+h2(#getcode). Get the code
+We're still actively developing {{ site.gemname }}.  The newest version is available via "Git":http://git-scm.com on "github:":http://github.com/mrflip/{{ site.gemname }}
+pre. $ git clone git://github.com/mrflip/{{ site.gemname }}
+A gem is available from "gemcutter:":http://gemcutter.org/gems/{{ site.gemname }}
+pre. $ sudo gem install {{ site.gemname }} --source=http://gemcutter.org
+(don't use the gems.github.com version -- it's way out of date.)
+You can instead download this project in either "zip":http://github.com/mrflip/{{ site.gemname }}/zipball/master or "tar":http://github.com/mrflip/{{ site.gemname }}/tarball/master formats.
+<notextile></div><div class="toggle"></notextile>
+h3. Get the Dependencies
+* Hadoop
+* Pig (optional)
+* Parts of {{ site.gemname }} require these gems:
+** addressable/uri
+** htmlentities
+** extlib
+** YAML
+** JSON
+<notextile></div><div class="toggle"></notextile>
+h2(#setup). Setup
+1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable:  @export HADOOP_HOME="/usr/local/share/hadoop"@
+2. Add wukong's @bin/@ directory to your $PATH if you'd like to use the "wutils":wutils.html
+<i>(see also: "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart)</i>
+<notextile></div><div class="toggle"></notextile>
+h2(#gethadoop). Installing and Running Wukong with Hadoop
+Wukong was primarily developed for Hadoop, and we think it's the best way to use Hadoop (it's certainly the most fun!).
+h3. Run Wukong on the Amazon AWS EC2 Cloud
+h3. Hadoop Infrastructure
+Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. If it's still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20.  It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.
+To set up hadoop, your best bet are the Cloudera AMIs on Amazon's EC2 compute cloud:
+* http://www.cloudera.com/hadoop-ec2
+* http://www.cloudera.com/hadoop-ec2-ebs-beta
+EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.
+h3. Run Wukong using Amazon AWS Elastic MapReduce
+AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.
+Phil Ripperger has prepared a "Ruby Hadoop Quickstart":http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud -- it's better than anything we could put here. Thanks Phil!
+h3. Set up a Hadoop cluster
+If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.
+h3. More Hadoop Notes
+I've braindumped some random notes on configuring and using hadoop "over here":hadoop-tips.html
+<notextile></div><div class="toggle"></notextile>
+h2(#others). Wukong isn't just Hadoop: Datamapper, ActiveRecord, command-line usage and more
+Wukong is used by many in an non-Hadoop environment -- anywhere you can stream data records, you can unleash its monkey power.
+Please see the "usage notes":usage.html#playnice for more!
+<notextile></div></notextile>

data/{doc → docpages}/LICENSE.textile RENAMED Viewed

File without changes

data/{doc → docpages}/README-wulign.textile RENAMED Viewed

@@ -1,3 +1,9 @@
+---
+layout: default
+title:  mrflip.github.com/wukong - wu-lign utility
+collapse: false
+---
 h1. wu-lign -- format a tab-separated file as aligned columns
 wu-lign will intelligently reformat a tab-separated file into a tab-separated, space aligned file that is still suitable for further processing. For example, given the log-file input

data/docpages/UsingWukong-part1-get_ready.textile ADDED Viewed

@@ -0,0 +1,17 @@
+---
+layout: default
+title:  mrflip.github.com/wukong - Using Wukong and Wuclan, Part 1 - Setup
+collapse: false
+---
+h1. Using Wukong and Wuclan, Part 0 - Setup
+Please follow the "installation and setup directions":setup.html for wukong, hadoop and a compute cluster.
+h1. Using Wukong and Wuclan, Part 1 - Scraping
+This part needs writing.
+Later, it will tell you how to get a large corpus of data to use in part 2.
+In the meantime check out http://mrflip.github.com/monkeyshines/ and http://mrflip.github.com/wuclan/ -- in particular the "Twitter Search Scraper":http://github.com/mrflip/wuclan/tree/master/examples/twitter/scrape_twitter_search/ example.  We use this in production to gather and analyze tens of gigabytes of twitter conversations.

data/{doc/overview.textile → docpages/UsingWukong-part2-ThinkingBigData.textile} RENAMED Viewed

@@ -1,5 +1,12 @@
+---
+layout: default
+title:  mrflip.github.com/wukong - Overview
+collapse: false
+---
-h2. There's lots of data
+h1. Thinking Big Data
+h2. There's lots of data, Wukong and Hadoop can help
 There are two disruptive
@@ -13,9 +20,6 @@ There are two disruptive
 ** Old frontier computing: expensive, N log N, SUUUUUUCKS
 ** It's cheap, it's scaleable and it's fun
-h2. Wukong + Hadoop can help
 h2. == Map|Reduce ==
 h3. cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv
@@ -69,23 +73,3 @@ h2. == Mechanics, HDFS ==
 x M _
 _ M y
-h2. == More Reading ==
-h3. Hadoop
-* "Hadoop, The Definitive Guide":http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
-* "":
-* "Cloudera Blog":http://www.cloudera.com/blog/
-h3. Hadoop|Streaming Frameworks
-* infochimps.org's "Wukong":http://github.com/mrflip/wukong -- ruby; object-oriented *and* record-oriented
-* NYTimes' "MRToolkit":http://code.google.com/p/mrtoolkit/ -- ruby; much more log-oriented
-* Freebase's "Happy":http://code.google.com/p/happy/ -- python; the most performant, as it can use Jython to make direct API calls.
-* Last.fm's "Dumbo":http://wiki.github.com/klbostee/dumbo -- python
- h3. Hadoop Infrastructure
-Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem... do yourself a favor and start out using the "Cloudera AMIs on Amazon's EC2 cloud.":http://www.cloudera.com/hadoop-ec2 There are an overwhelming number of fiddly little parameters and you'll be glad for the user experience before you get into server setup. Actually, if it's still June 2009 when you read this, profile your scripts with Wukong on the command line and kill some time before Hadoop 0.20 comes out.  It will be a) more fun, b) much more robust (trust me, at "v0.20" you want to live on the bleeding edge), and c) you won't have to suffer through migrating your HDFS two weeks after setup.

data/{doc → docpages}/UsingWukong-part3-parsing.textile RENAMED Viewed

@@ -1,6 +1,12 @@
-h1. Using Wukong and Wuclan, Part 3 - Parsing
+---
+layout: default
+title:  mrflip.github.com/wukong - Using Wukong and Wuclan, Part 3 - Parsing
+collapse: false
+---
-In part 2 we begain a scraper to trawl our desired part of the social web. Now
+h1. Using Wukong and Wuclan - Parsing
+In part 1 we begain a scraper to trawl our desired part of the social web. Now
 we're ready to start using Wukong to process the files.
 Files come off the wire as

data/docpages/_config.yml ADDED Viewed

@@ -0,0 +1,39 @@
+---
+permalink:      ":year-:month/:title.html"
+markdown:       rdiscount
+pygments:       true
+auto:           true
+server:         true
+server_port:    4000
+maruku:
+  use_tex:      false
+  use_divs:     false
+  png_dir:      images/latex
+  png_url:      /images/latex
+header_ref:     '.html'    # .html for subdirs, / for main.
+assets_path:    '/'        # http://github.mrflip.com
+gemuser:        mrflip
+gemname:        wukong
+gemversion:     0.1.1
+title:          mrflip.github.com/wukong
+keywords:       [ 'wukong,hadoop,ruby,mrflip,infochimps,map,reduce,streaming,dumbo,happy,mrtoolkit,script,simple' ]
+description:    "Wukong: Hadoop made so easy a Chimpanzee could run it."
+header_files:
+  - INSTALL
+  - LICENSE
+  - usage
+  - wutils
+  - moreinfo
+  - tutorial
+credits:
+  <p>Wukong image courtesy
+  <a href="http://www.curtbusse.com/okavango/page1/oka1.html">Curt Busse</a> under
+  an <a href="http://www.curtbusse.com/copyright.html">open license</a>.
+  It's a Chacma Baboon from the Okavango site. Make sure to read the
+  <a href="http://www.curtbusse.com/okavango/page1/oka1.html#note3">story at the bottom of that page</a>.
+  </p>