RubyGems - wukong - Versions diffs - 1.4.7 → 1.4.9 - Mend

wukong 1.4.7 → 1.4.9

Files changed (62) hide show

data/CHANGELOG.textile +9 -0
data/README.textile +1 -1
data/bin/hdp-bzip +28 -0
data/bin/hdp-mkdir +1 -1
data/bin/hdp-stream-flat +3 -2
data/bin/wu-lign +32 -18
data/docpages/pig/cookbook.html +481 -0
data/docpages/pig/images/hadoop-logo.jpg +0 -0
data/docpages/pig/images/instruction_arrow.png +0 -0
data/docpages/pig/images/pig-logo.gif +0 -0
data/docpages/pig/piglatin_ref1.html +1103 -0
data/docpages/pig/piglatin_ref2.html +14340 -0
data/docpages/pig/setup.html +505 -0
data/docpages/pig/skin/basic.css +166 -0
data/docpages/pig/skin/breadcrumbs.js +237 -0
data/docpages/pig/skin/fontsize.js +166 -0
data/docpages/pig/skin/getBlank.js +40 -0
data/docpages/pig/skin/getMenu.js +45 -0
data/docpages/pig/skin/images/chapter.gif +0 -0
data/docpages/pig/skin/images/chapter_open.gif +0 -0
data/docpages/pig/skin/images/current.gif +0 -0
data/docpages/pig/skin/images/external-link.gif +0 -0
data/docpages/pig/skin/images/header_white_line.gif +0 -0
data/docpages/pig/skin/images/page.gif +0 -0
data/docpages/pig/skin/images/pdfdoc.gif +0 -0
data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
data/docpages/pig/skin/print.css +54 -0
data/docpages/pig/skin/profile.css +181 -0
data/docpages/pig/skin/screen.css +587 -0
data/docpages/pig/tutorial.html +1059 -0
data/docpages/pig/udf.html +1509 -0
data/examples/keystore/conditional_outputter_example.rb +70 -0
data/examples/{graph → network_graph}/adjacency_list.rb +0 -0
data/examples/{graph → network_graph}/breadth_first_search.rb +0 -0
data/examples/{graph → network_graph}/gen_2paths.rb +0 -0
data/examples/{graph → network_graph}/gen_multi_edge.rb +0 -0
data/examples/{graph → network_graph}/gen_symmetric_links.rb +0 -0
data/examples/pagerank/run_pagerank.sh +10 -8
data/examples/{apache_log_parser.rb → server_logs/apache_log_parser.rb} +0 -0
data/examples/stupidly_simple_filter.rb +43 -0
data/lib/wukong/extensions/hash.rb +13 -0
data/lib/wukong/extensions/hash_like.rb +7 -0
data/lib/wukong/keystore/cassandra_conditional_outputter.rb +122 -0
data/lib/wukong/script.rb +27 -22
data/lib/wukong/script/hadoop_command.rb +5 -3
data/lib/wukong/streamer/accumulating_reducer.rb +2 -1
data/wukong.gemspec +64 -26
metadata +89 -31
data/docpages/pig/PigLatinReferenceManual.html +0 -19134
data/examples/foo.rb +0 -9
data/examples/package-local.rb +0 -100
data/examples/package.rb +0 -96
data/examples/run_all.sh +0 -47

data/CHANGELOG.textile CHANGED Viewed

@@ -1,3 +1,12 @@
+h2. Wukong v1.4.8 2010-06-05
+* made scripts inject a helpful job name using mapred.job.name
+* Hash.compact_blank! and HashLike.compact_blank! -- eliminate all key-values whoes value is blank?
+h2. Wukong v1.4.8 2010-05-17
+* Bug in passing commandline args down to map and reduce child processes
 h2. Wukong v1.4.7 2010-03-05
 Lots more examples:

data/README.textile CHANGED Viewed

@@ -1,6 +1,6 @@
 h1. Wukong
-Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
+Wukong is Ruby for Hadoop -- it makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
 Treat your dataset like a
 * stream of lines when it's efficient to process by lines

data/bin/hdp-bzip ADDED Viewed

@@ -0,0 +1,28 @@
+#!/bin/bash
+HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
+OUTPUT="$1" ; shift
+INPUTS=''
+for foo in $@; do
+  INPUTS="$INPUTS -input $foo\
+"
+done
+echo "Removing output directory $OUTPUT"
+hadoop fs -rmr $OUTPUT
+cmd="${HADOOP_HOME}/bin/hadoop \
+     jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		   \
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 			   \
+    -jobconf     mapred.output.compress=true                                               \
+    -jobconf     mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec  \
+    -jobconf     mapred.reduce.tasks=1                                                     \
+    -mapper  	 \"/bin/cat\"                                                              \
+    -reducer	 \"/usr/bin/uniq\"                                                         \
+    $INPUTS
+    -output  	 $OUTPUT                                                                   \
+    "
+echo $cmd
+$cmd

data/bin/hdp-mkdir CHANGED Viewed

@@ -1,3 +1,3 @@
 #!/usr/bin/env bash
-exec hadoop dfs -mkdir "$@"
+exec hadoop fs -mkdir "$@"

data/bin/hdp-stream-flat CHANGED Viewed

@@ -15,8 +15,9 @@ HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
 exec ${HADOOP_HOME}/bin/hadoop \
      jar         ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar		\
+    "$@"                                                                                \
+    -jobconf    "mapred.job.name=`basename $0`-$map_script-$input_file-$output_file"    \
     -mapper  	"$map_script"  								\
     -reducer	"$reduce_script"							\
     -input      "$input_file"								\
-    -output  	"$output_file"								\
-    "$@"
+    -output  	"$output_file"

data/bin/wu-lign CHANGED Viewed

@@ -101,22 +101,26 @@ end
 #
 FORMAT_GUESSING_LINES = 500
 # widest column to set
-MAX_MAX_WIDTH = 70
+MAX_MAX_WIDTH = 100
 INT_RE   = /\A\d+\z/
 FLOAT_RE = /\A(\d+)(?:\.(\d+))?(?:e-?\d+)?\z/
-def consensus_type val, alltype
-  return :mixed if alltype == :mixed
+def get_type val
   case
   when val == ''       then type = nil
   when val =~ INT_RE   then type = :int
   when val =~ FLOAT_RE then type = :float
-  else                      type = :str end
-  return if ! type
+  else                      type = :str end
+end
+def consensus_type val, alltype, is_first
+  return :mixed if alltype == :mixed
+  type = get_type(val) or return
   case
-  when alltype.nil?    then type
-  when alltype == type then type
+  when alltype.nil?                  then type
+  when is_first && (alltype == :str) then type
+  when alltype == type               then type
   when ( ((alltype==:float) && (type == :int)) || ((alltype == :int) && (type == :float)) )
     :float
   else :mixed
@@ -134,24 +138,27 @@ col_minmag = []
 col_maxmag = []
 rows       = []
 skip_col   = []
+has_header = false
 ARGV.each_with_index{|v,i| next if (v == '') ; maxw[i] = 0; skip_col[i] = true }
 FORMAT_GUESSING_LINES.times do
   line = $stdin.readline rescue nil
   break unless line
-  cols = line.chomp.split("\t").map{|s| s.strip }
-  col_widths = cols.map{|col| col.length }
+  row = line.chomp.split("\t").map{|s| s.strip }
+  col_widths = row.map{|col| col.length }
   col_widths.each_with_index{|cw,i| maxw[i] = [[cw,maxw[i]].compact.max, MAX_MAX_WIDTH].min }
-  cols.each_with_index{|col,i|
+  row.each_with_index{|col,i|
     next if skip_col[i]
-    col_types[i] = consensus_type(col, col_types[i])
+    # Let the first row be text (headers)
+    col_types[i] = consensus_type(col, col_types[i], rows.length == 1)
     if col_types[i] == :float
       mantissa, radix = f_width(col)
       col_minmag[i] = [radix,    col_minmag[i], 1].compact.max
       col_maxmag[i] = [mantissa, col_maxmag[i], 1].compact.max
     end
   }
-  # p [maxw, col_types, col_minmag, col_maxmag, col_widths, cols]
-  rows << cols
+  # p [rows.length, has_header, maxw, col_types, col_minmag, col_maxmag, col_widths, row]
+  has_header = true if row.all?{|col| get_type(col) == :str } && rows.length == 0
+  rows << row
 end
 format = maxw.zip(col_types, col_minmag, col_maxmag, ARGV).map do |width, type, minmag, maxmag, default|
@@ -160,18 +167,25 @@ format = maxw.zip(col_types, col_minmag, col_maxmag, ARGV).map do |width, type,
   when :mixed, nil then lambda{|s| "%-#{width}s" % s }
   when :str        then lambda{|s| "%-#{width}s" % s }
   when :int        then lambda{|s| "%#{width}d"  % s.to_i }
-  when :float      then lambda{|s| "%#{maxmag+minmag+1}.#{minmag}f" % s.to_f }
+  when :float      then lambda{|s| "%#{maxmag+minmag+2}.#{minmag}f" % s.to_f }
   else raise "oops type #{type}"  end
 end
-# p [maxw, col_types, col_minmag, col_maxmag, format]
+def dump_row row, format
+  puts row.zip(format).map{|c,f| f.call(c) rescue c }.join("\t")
+end
+def dump_header row, maxw
+  puts row.zip(maxw).map{|col, width| "%-#{width}s" % col.to_s }.join("\t")
+end
 pad = [''] * maxw.length
+dump_header(rows.shift, maxw) if has_header
 rows.each do |row|
   # note -- strips trailing columns
-  puts row.zip(format).map{|c,f| f.call(c) }.join("\t")
+  dump_row(row, format)
 end
 $stdin.each do |line|
-  cols = line.chomp.split("\t").map{|s| s.strip }
+  row = line.chomp.split("\t").map{|s| s.strip }
   # note -- strips trailing columns
-  puts cols.zip(format).map{|c,f| f.call(c) rescue c }.join("\t")
+  dump_row(row, format)
 end

data/docpages/pig/cookbook.html ADDED Viewed

@@ -0,0 +1,481 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.8">
+<meta name="Forrest-skin-name" content="pelt">
+<title>Pig Cookbook</title>
+<link type="text/css" href="skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
+<link type="text/css" href="skin/profile.css" rel="stylesheet">
+<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/pig/">Pig</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
+</div>
+<!--+
+    |header
+    +-->
+<div class="header">
+<!--+
+    |start group logo
+    +-->
+<div class="grouplogo">
+<a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
+</div>
+<!--+
+    |end group logo
+    +-->
+<!--+
+    |start Project Logo
+    +-->
+<div class="projectlogo">
+<a href="http://hadoop.apache.org/pig/"><img class="logoImage" alt="Pig" src="images/pig-logo.gif" title="A platform for analyzing large datasets."></a>
+</div>
+<!--+
+    |end Project Logo
+    +-->
+<!--+
+    |start Search
+    +-->
+<div class="searchbox">
+<form action="http://www.google.com/search" method="get" class="roundtopsmall">
+<input value="" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
+                    <input name="Search" value="Search" type="submit">
+</form>
+</div>
+<!--+
+    |end search
+    +-->
+<!--+
+    |start Tabs
+    +-->
+<ul id="tabs">
+<li>
+<a class="unselected" href="http://hadoop.apache.org/pig/">Project</a>
+</li>
+<li>
+<a class="unselected" href="http://wiki.apache.org/pig/">Wiki</a>
+</li>
+<li class="current">
+<a class="selected" href="index.html">Pig 0.7.0 Documentation</a>
+</li>
+</ul>
+<!--+
+    |end Tabs
+    +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+    |start Subtabs
+    +-->
+<div id="level2tabs"></div>
+<!--+
+    |end Endtabs
+    +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+             &nbsp;
+           </div>
+<!--+
+    |start Menu, mainarea
+    +-->
+<!--+
+    |start Menu
+    +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Pig</div>
+<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="setup.html">Setup</a>
+</div>
+<div class="menuitem">
+<a href="tutorial.html">Tutorial</a>
+</div>
+<div class="menuitem">
+<a href="piglatin_ref1.html">Pig Latin 1</a>
+</div>
+<div class="menuitem">
+<a href="piglatin_ref2.html">Pig Latin 2</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">Cookbook</div>
+</div>
+<div class="menuitem">
+<a href="udf.html">UDFs</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Zebra</div>
+<div id="menu_1.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="zebra_overview.html">Zebra Overview </a>
+</div>
+<div class="menuitem">
+<a href="zebra_users.html">Zebra Users </a>
+</div>
+<div class="menuitem">
+<a href="zebra_reference.html">Zebra Reference </a>
+</div>
+<div class="menuitem">
+<a href="zebra_mapreduce.html">Zebra MapReduce </a>
+</div>
+<div class="menuitem">
+<a href="zebra_pig.html">Zebra Pig </a>
+</div>
+<div class="menuitem">
+<a href="zebra_stream.html">Zebra Streaming </a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Miscellaneous</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="api/">API Docs</a>
+</div>
+<div class="menuitem">
+<a href="http://wiki.apache.org/pig/">Wiki</a>
+</div>
+<div class="menuitem">
+<a href="http://wiki.apache.org/pig/FAQ">FAQ</a>
+</div>
+<div class="menuitem">
+<a href="http://hadoop.apache.org/pig/releases.html">Release Notes</a>
+</div>
+</div>
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<!--+
+  |alternative credits
+  +-->
+<div id="credit2"></div>
+</div>
+<!--+
+    |end Menu
+    +-->
+<!--+
+    |start content
+    +-->
+<div id="content">
+<div title="Portable Document Format" class="pdflink">
+<a class="dida" href="cookbook.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
+        PDF</a>
+</div>
+<h1>Pig Cookbook</h1>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#Overview">Overview</a>
+</li>
+<li>
+<a href="#Performance+Enhancers">Performance Enhancers</a>
+<ul class="minitoc">
+<li>
+<a href="#Use+Optimization">Use Optimization</a>
+</li>
+<li>
+<a href="#Use+Types">Use Types</a>
+</li>
+<li>
+<a href="#Project+Early+and+Often">Project Early and Often </a>
+</li>
+<li>
+<a href="#Filter+Early+and+Often">Filter Early and Often</a>
+</li>
+<li>
+<a href="#Reduce+Your+Operator+Pipeline">Reduce Your Operator Pipeline</a>
+</li>
+<li>
+<a href="#Make+Your+UDFs+Algebraic">Make Your UDFs Algebraic</a>
+</li>
+<li>
+<a href="#Implement+the+Aggregator+Interface">Implement the Aggregator Interface</a>
+</li>
+<li>
+<a href="#Drop+Nulls+Before+a+Join">Drop Nulls Before a Join</a>
+</li>
+<li>
+<a href="#Take+Advantage+of+Join+Optimizations">Take Advantage of Join Optimizations</a>
+</li>
+<li>
+<a href="#Use+the+PARALLEL+Clause">Use the PARALLEL Clause</a>
+</li>
+<li>
+<a href="#Use+the+LIMIT+Operator">Use the LIMIT Operator</a>
+</li>
+<li>
+<a href="#Prefer+DISTINCT+over+GROUP+BY+-+GENERATE">Prefer DISTINCT over GROUP BY - GENERATE</a>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+<a name="N1000D"></a><a name="Overview"></a>
+<h2 class="h3">Overview</h2>
+<div class="section">
+<p>This document provides hints and tips for pig users. </p>
+</div>
+<a name="N10017"></a><a name="Performance+Enhancers"></a>
+<h2 class="h3">Performance Enhancers</h2>
+<div class="section">
+<a name="N1001D"></a><a name="Use+Optimization"></a>
+<h3 class="h4">Use Optimization</h3>
+<p>Pig supports various <a href="piglatin_ref1.html#Optimization+Rules">optimization rules</a> which are turned on by default.
+Become familiar with these rules.</p>
+<a name="N1002B"></a><a name="Use+Types"></a>
+<h3 class="h4">Use Types</h3>
+<p>If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations.
+A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with
+speed of arithmetic computation. It has an additional advantage of early error detection. </p>
+<pre class="code">
+--Query 1
+A = load 'myfile' as (t, u, v);
+B = foreach A generate t + u;
+--Query 2
+A = load 'myfile' as (t: int, u: int, v);
+B = foreach A generate t + u;
+</pre>
+<p>The second query will run more efficiently than the first. In some of our queries with see 2x speedup. </p>
+<a name="N1003C"></a><a name="Project+Early+and+Often"></a>
+<h3 class="h4">Project Early and Often </h3>
+<p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = join A by t, B by x;
+D = group C by u;
+E = foreach D generate group, COUNT($1);
+</pre>
+<p>There is no need for v, y, or z to participate in this query.  And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+A1 = foreach A generate t, u;
+B = load 'myotherfile' as (x, y, z);
+B1 = foreach B generate x;
+C = join A1 by t, B1 by x;
+C1 = foreach C generate t, u;
+D = group C1 by u;
+E = foreach D generate group, COUNT($1);
+</pre>
+<p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p>
+<a name="N10054"></a><a name="Filter+Early+and+Often"></a>
+<h3 class="h4">Filter Early and Often</h3>
+<p>As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. </p>
+<pre class="code">
+-- Query 1
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = filter A by t == 1;
+D = join C by t, B by x;
+E = group D by u;
+F = foreach E generate group, COUNT($1);
+-- Query 2
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = join A by t, B by x;
+D = group C by u;
+E = foreach D generate group, COUNT($1);
+F = filter E by C.t == 1;
+</pre>
+<p>The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. </p>
+<p>One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out. </p>
+<a name="N10068"></a><a name="Reduce+Your+Operator+Pipeline"></a>
+<h3 class="h4">Reduce Your Operator Pipeline</h3>
+<p>For clarity of your script, you might choose to split your projects into several steps for instance: </p>
+<pre class="code">
+A = load 'data' as (in: map[]);
+-- get key out of the map
+B = foreach A generate in#k1 as k1, in#k2 as k2;
+-- concatenate the keys
+C = foreach B generate CONCAT(k1, k2);
+.......
+</pre>
+<p>While the example above is easier to read, you might want to consider combining the two foreach statements to improve your query performance: </p>
+<pre class="code">
+A = load 'data' as (in: map[]);
+-- concatenate the keys from the map
+B = foreach A generate CONCAT(in#k1, in#k2);
+....
+</pre>
+<p>The same goes for filters. </p>
+<a name="N10080"></a><a name="Make+Your+UDFs+Algebraic"></a>
+<h3 class="h4">Make Your UDFs Algebraic</h3>
+<p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see the Pig UDF Manual and <a href="udf.html#Aggregate+Functions">Aggregate Functions</a>.</p>
+<pre class="code">
+A = load 'data' as (x, y, z)
+B = group A by x;
+C = foreach B generate group, MyUDF(A);
+....
+</pre>
+<p>If <span class="codefrag">MyUDF</span> is algebraic, the query will use combiner and run much faster. You can run <span class="codefrag">explain</span> command on your query to make sure that combiner is used. </p>
+<a name="N1009B"></a><a name="Implement+the+Aggregator+Interface"></a>
+<h3 class="h4">Implement the Aggregator Interface</h3>
+<p>
+If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Aggregator interface to reduce the amount of memory used by your script. If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see the Pig UDF Manual and <a href="udf.html#Accumulator+Interface">Accumulator Interface</a>.
+</p>
+<a name="N100AC"></a><a name="Drop+Nulls+Before+a+Join"></a>
+<h3 class="h4">Drop Nulls Before a Join</h3>
+<p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls.  The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together.  This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs).  Since flattening an empty bag results in an empty row, in a standard join the rows with a null key will always be dropped.  The join:  </p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = join A by t, B by x;
+</pre>
+<p>is rewritten by pig to </p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C1 = cogroup A by t INNER, B by x INNER;
+C = foreach C1 generate flatten(A), flatten(B);
+</pre>
+<p>Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output.  So the null keys will be dropped.  But they will not be dropped until the last possible moment.  If the query is rewritten to </p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+A1 = filter A by t is not null;
+B1 = filter B by x is not null;
+C = join A1 by t, B1 by x;
+</pre>
+<p>then the nulls will be dropped before the join.  Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant.  In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. </p>
+<a name="N100CB"></a><a name="Take+Advantage+of+Join+Optimizations"></a>
+<h3 class="h4">Take Advantage of Join Optimizations</h3>
+<a name="N100D1"></a><a name="Regular+Join+Optimizations"></a>
+<h4>Regular Join Optimizations</h4>
+<p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p>
+<p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query.
+In some of our tests we saw 10x performance improvement as the result of this optimization.</p>
+<pre class="code">
+small = load 'small_file' as (t, u, v);
+large = load 'large_file' as (x, y, z);
+C = join small by t, large by x;
+</pre>
+<a name="N100E2"></a><a name="Specialized+Join+Optimizations"></a>
+<h4>Specialized Join Optimizations</h4>
+<p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins.
+For more information see <a href="piglatin_ref1.html#Specialized+Joins">Specialized Joins</a>.</p>
+<a name="N100F1"></a><a name="Use+the+PARALLEL+Clause"></a>
+<h3 class="h4">Use the PARALLEL Clause</h3>
+<p>Use the PARALLEL clause to increase the parallelism of a job:</p>
+<ul>
+<li>PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The default value is 1 (one reduce task).</li>
+<li>PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. </li>
+<li>If you don&rsquo;t specify PARALLEL, you still get the same map parallelism but only one reduce task.</li>
+</ul>
+<p></p>
+<p>As noted, the default value for PARALLEL is 1 (one reduce task). However, the number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers  and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 500 MB of data behaves efficiently.</p>
+<p>You can include the PARALLEL clause with any operator that starts a reduce phase (see the example below). This includes
+<a href="piglatin_ref2.html#COGROUP">COGROUP</a>,
+<a href="piglatin_ref2.html#CROSS">CROSS</a>,
+<a href="piglatin_ref2.html#DISTINCT">DISTINCT</a>,
+<a href="piglatin_ref2.html#GROUP">GROUP</a>,
+<a href="piglatin_ref2.html#JOIN+%28inner%29">JOIN (inner)</a>,
+<a href="piglatin_ref2.html#JOIN+%28outer%29">JOIN (outer)</a>, and
+<a href="piglatin_ref2.html#ORDER">ORDER</a>.
+</p>
+<p>You can also set the value of PARALLEL for all Pig scripts using the <a href="piglatin_ref2.html#set">set default parallel</a> command.</p>
+<p>In this example PARALLEL is used with the GROUP operator. </p>
+<pre class="code">
+A = LOAD 'myfile' AS (t, u, v);
+B = GROUP A BY t PARALLEL 18;
+.....
+</pre>
+<p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
+<pre class="code">
+SET DEFAULT_PARALLEL 20;
+A = LOAD &lsquo;myfile.txt&rsquo; USING PigStorage() AS (t, u, v);
+B = GROUP A BY t;
+C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
+D = ORDER C BY mycount;
+STORE D INTO &lsquo;mysortedcount&rsquo; USING PigStorage();
+</pre>
+<a name="N10140"></a><a name="Use+the+LIMIT+Operator"></a>
+<h3 class="h4">Use the LIMIT Operator</h3>
+<p>Often you are not interested in the entire output but rather a sample or top results. In such cases, using LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p>
+<p>Sample:
+</p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = limit A 500;
+</pre>
+<p>Top results: </p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = order A by t;
+C = limit B 500;
+</pre>
+<a name="N10158"></a><a name="Prefer+DISTINCT+over+GROUP+BY+-+GENERATE"></a>
+<h3 class="h4">Prefer DISTINCT over GROUP BY - GENERATE</h3>
+<p>When it comes to extracting the unique values from a column in a relation, one of two approaches can be used: </p>
+<p>Example Using GROUP BY - GENERATE</p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = foreach A generate u;
+C = group B by u;
+D = foreach C generate group as uniquekey;
+dump D;
+</pre>
+<p>Example Using DISTINCT</p>
+<pre class="code">
+A = load 'myfile' as (t, u, v);
+B = foreach A generate u;
+C = distinct B;
+dump C;
+</pre>
+<p>In pig 0.1.x, DISTINCT is just GROUP BY/PROJECT under the hood. In pig 0.2.0 it is not, and it is much faster and more efficient (depending on your key cardinality, up to 20x faster in pig team's tests). Therefore, the use of DISTINCT is recommended over GROUP BY - GENERATE.  </p>
+</div>
+</div>
+<!--+
+    |end content
+    +-->
+<div class="clearboth">&nbsp;</div>
+</div>
+<div id="footer">
+<!--+
+    |start bottomstrip
+    +-->
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>
+</div>
+<div class="copyright">
+        Copyright &copy;
+         2007-2010 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
+</div>
+<!--+
+    |end bottomstrip
+    +-->
+</div>
+</body>
+</html>