wukong 1.4.7 → 1.4.9

Sign up to get free protection for your applications and to get access to all the features.
Files changed (62) hide show
  1. data/CHANGELOG.textile +9 -0
  2. data/README.textile +1 -1
  3. data/bin/hdp-bzip +28 -0
  4. data/bin/hdp-mkdir +1 -1
  5. data/bin/hdp-stream-flat +3 -2
  6. data/bin/wu-lign +32 -18
  7. data/docpages/pig/cookbook.html +481 -0
  8. data/docpages/pig/images/hadoop-logo.jpg +0 -0
  9. data/docpages/pig/images/instruction_arrow.png +0 -0
  10. data/docpages/pig/images/pig-logo.gif +0 -0
  11. data/docpages/pig/piglatin_ref1.html +1103 -0
  12. data/docpages/pig/piglatin_ref2.html +14340 -0
  13. data/docpages/pig/setup.html +505 -0
  14. data/docpages/pig/skin/basic.css +166 -0
  15. data/docpages/pig/skin/breadcrumbs.js +237 -0
  16. data/docpages/pig/skin/fontsize.js +166 -0
  17. data/docpages/pig/skin/getBlank.js +40 -0
  18. data/docpages/pig/skin/getMenu.js +45 -0
  19. data/docpages/pig/skin/images/chapter.gif +0 -0
  20. data/docpages/pig/skin/images/chapter_open.gif +0 -0
  21. data/docpages/pig/skin/images/current.gif +0 -0
  22. data/docpages/pig/skin/images/external-link.gif +0 -0
  23. data/docpages/pig/skin/images/header_white_line.gif +0 -0
  24. data/docpages/pig/skin/images/page.gif +0 -0
  25. data/docpages/pig/skin/images/pdfdoc.gif +0 -0
  26. data/docpages/pig/skin/images/rc-b-l-15-1body-2menu-3menu.png +0 -0
  27. data/docpages/pig/skin/images/rc-b-r-15-1body-2menu-3menu.png +0 -0
  28. data/docpages/pig/skin/images/rc-b-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  29. data/docpages/pig/skin/images/rc-t-l-5-1header-2searchbox-3searchbox.png +0 -0
  30. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-selected-3tab-selected.png +0 -0
  31. data/docpages/pig/skin/images/rc-t-l-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  32. data/docpages/pig/skin/images/rc-t-r-15-1body-2menu-3menu.png +0 -0
  33. data/docpages/pig/skin/images/rc-t-r-5-1header-2searchbox-3searchbox.png +0 -0
  34. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-selected-3tab-selected.png +0 -0
  35. data/docpages/pig/skin/images/rc-t-r-5-1header-2tab-unselected-3tab-unselected.png +0 -0
  36. data/docpages/pig/skin/print.css +54 -0
  37. data/docpages/pig/skin/profile.css +181 -0
  38. data/docpages/pig/skin/screen.css +587 -0
  39. data/docpages/pig/tutorial.html +1059 -0
  40. data/docpages/pig/udf.html +1509 -0
  41. data/examples/keystore/conditional_outputter_example.rb +70 -0
  42. data/examples/{graph → network_graph}/adjacency_list.rb +0 -0
  43. data/examples/{graph → network_graph}/breadth_first_search.rb +0 -0
  44. data/examples/{graph → network_graph}/gen_2paths.rb +0 -0
  45. data/examples/{graph → network_graph}/gen_multi_edge.rb +0 -0
  46. data/examples/{graph → network_graph}/gen_symmetric_links.rb +0 -0
  47. data/examples/pagerank/run_pagerank.sh +10 -8
  48. data/examples/{apache_log_parser.rb → server_logs/apache_log_parser.rb} +0 -0
  49. data/examples/stupidly_simple_filter.rb +43 -0
  50. data/lib/wukong/extensions/hash.rb +13 -0
  51. data/lib/wukong/extensions/hash_like.rb +7 -0
  52. data/lib/wukong/keystore/cassandra_conditional_outputter.rb +122 -0
  53. data/lib/wukong/script.rb +27 -22
  54. data/lib/wukong/script/hadoop_command.rb +5 -3
  55. data/lib/wukong/streamer/accumulating_reducer.rb +2 -1
  56. data/wukong.gemspec +64 -26
  57. metadata +89 -31
  58. data/docpages/pig/PigLatinReferenceManual.html +0 -19134
  59. data/examples/foo.rb +0 -9
  60. data/examples/package-local.rb +0 -100
  61. data/examples/package.rb +0 -96
  62. data/examples/run_all.sh +0 -47
data/CHANGELOG.textile CHANGED
@@ -1,3 +1,12 @@
1
+ h2. Wukong v1.4.8 2010-06-05
2
+
3
+ * made scripts inject a helpful job name using mapred.job.name
4
+ * Hash.compact_blank! and HashLike.compact_blank! -- eliminate all key-values whoes value is blank?
5
+
6
+ h2. Wukong v1.4.8 2010-05-17
7
+
8
+ * Bug in passing commandline args down to map and reduce child processes
9
+
1
10
  h2. Wukong v1.4.7 2010-03-05
2
11
 
3
12
  Lots more examples:
data/README.textile CHANGED
@@ -1,6 +1,6 @@
1
1
  h1. Wukong
2
2
 
3
- Wukong makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
3
+ Wukong is Ruby for Hadoop -- it makes "Hadoop":http://hadoop.apache.org/core so easy a chimpanzee can use it.
4
4
 
5
5
  Treat your dataset like a
6
6
  * stream of lines when it's efficient to process by lines
data/bin/hdp-bzip ADDED
@@ -0,0 +1,28 @@
1
+ #!/bin/bash
2
+
3
+ HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
4
+
5
+ OUTPUT="$1" ; shift
6
+
7
+ INPUTS=''
8
+ for foo in $@; do
9
+ INPUTS="$INPUTS -input $foo\
10
+ "
11
+ done
12
+
13
+ echo "Removing output directory $OUTPUT"
14
+ hadoop fs -rmr $OUTPUT
15
+
16
+ cmd="${HADOOP_HOME}/bin/hadoop \
17
+ jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
18
+ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
19
+ -jobconf mapred.output.compress=true \
20
+ -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
21
+ -jobconf mapred.reduce.tasks=1 \
22
+ -mapper \"/bin/cat\" \
23
+ -reducer \"/usr/bin/uniq\" \
24
+ $INPUTS
25
+ -output $OUTPUT \
26
+ "
27
+ echo $cmd
28
+ $cmd
data/bin/hdp-mkdir CHANGED
@@ -1,3 +1,3 @@
1
1
  #!/usr/bin/env bash
2
2
 
3
- exec hadoop dfs -mkdir "$@"
3
+ exec hadoop fs -mkdir "$@"
data/bin/hdp-stream-flat CHANGED
@@ -15,8 +15,9 @@ HADOOP_HOME=${HADOOP_HOME-/usr/lib/hadoop}
15
15
 
16
16
  exec ${HADOOP_HOME}/bin/hadoop \
17
17
  jar ${HADOOP_HOME}/contrib/streaming/hadoop-*-streaming.jar \
18
+ "$@" \
19
+ -jobconf "mapred.job.name=`basename $0`-$map_script-$input_file-$output_file" \
18
20
  -mapper "$map_script" \
19
21
  -reducer "$reduce_script" \
20
22
  -input "$input_file" \
21
- -output "$output_file" \
22
- "$@"
23
+ -output "$output_file"
data/bin/wu-lign CHANGED
@@ -101,22 +101,26 @@ end
101
101
  #
102
102
  FORMAT_GUESSING_LINES = 500
103
103
  # widest column to set
104
- MAX_MAX_WIDTH = 70
104
+ MAX_MAX_WIDTH = 100
105
105
 
106
106
  INT_RE = /\A\d+\z/
107
107
  FLOAT_RE = /\A(\d+)(?:\.(\d+))?(?:e-?\d+)?\z/
108
108
 
109
- def consensus_type val, alltype
110
- return :mixed if alltype == :mixed
109
+ def get_type val
111
110
  case
112
111
  when val == '' then type = nil
113
112
  when val =~ INT_RE then type = :int
114
113
  when val =~ FLOAT_RE then type = :float
115
- else type = :str end
116
- return if ! type
114
+ else type = :str end
115
+ end
116
+
117
+ def consensus_type val, alltype, is_first
118
+ return :mixed if alltype == :mixed
119
+ type = get_type(val) or return
117
120
  case
118
- when alltype.nil? then type
119
- when alltype == type then type
121
+ when alltype.nil? then type
122
+ when is_first && (alltype == :str) then type
123
+ when alltype == type then type
120
124
  when ( ((alltype==:float) && (type == :int)) || ((alltype == :int) && (type == :float)) )
121
125
  :float
122
126
  else :mixed
@@ -134,24 +138,27 @@ col_minmag = []
134
138
  col_maxmag = []
135
139
  rows = []
136
140
  skip_col = []
141
+ has_header = false
137
142
  ARGV.each_with_index{|v,i| next if (v == '') ; maxw[i] = 0; skip_col[i] = true }
138
143
  FORMAT_GUESSING_LINES.times do
139
144
  line = $stdin.readline rescue nil
140
145
  break unless line
141
- cols = line.chomp.split("\t").map{|s| s.strip }
142
- col_widths = cols.map{|col| col.length }
146
+ row = line.chomp.split("\t").map{|s| s.strip }
147
+ col_widths = row.map{|col| col.length }
143
148
  col_widths.each_with_index{|cw,i| maxw[i] = [[cw,maxw[i]].compact.max, MAX_MAX_WIDTH].min }
144
- cols.each_with_index{|col,i|
149
+ row.each_with_index{|col,i|
145
150
  next if skip_col[i]
146
- col_types[i] = consensus_type(col, col_types[i])
151
+ # Let the first row be text (headers)
152
+ col_types[i] = consensus_type(col, col_types[i], rows.length == 1)
147
153
  if col_types[i] == :float
148
154
  mantissa, radix = f_width(col)
149
155
  col_minmag[i] = [radix, col_minmag[i], 1].compact.max
150
156
  col_maxmag[i] = [mantissa, col_maxmag[i], 1].compact.max
151
157
  end
152
158
  }
153
- # p [maxw, col_types, col_minmag, col_maxmag, col_widths, cols]
154
- rows << cols
159
+ # p [rows.length, has_header, maxw, col_types, col_minmag, col_maxmag, col_widths, row]
160
+ has_header = true if row.all?{|col| get_type(col) == :str } && rows.length == 0
161
+ rows << row
155
162
  end
156
163
 
157
164
  format = maxw.zip(col_types, col_minmag, col_maxmag, ARGV).map do |width, type, minmag, maxmag, default|
@@ -160,18 +167,25 @@ format = maxw.zip(col_types, col_minmag, col_maxmag, ARGV).map do |width, type,
160
167
  when :mixed, nil then lambda{|s| "%-#{width}s" % s }
161
168
  when :str then lambda{|s| "%-#{width}s" % s }
162
169
  when :int then lambda{|s| "%#{width}d" % s.to_i }
163
- when :float then lambda{|s| "%#{maxmag+minmag+1}.#{minmag}f" % s.to_f }
170
+ when :float then lambda{|s| "%#{maxmag+minmag+2}.#{minmag}f" % s.to_f }
164
171
  else raise "oops type #{type}" end
165
172
  end
166
- # p [maxw, col_types, col_minmag, col_maxmag, format]
173
+
174
+ def dump_row row, format
175
+ puts row.zip(format).map{|c,f| f.call(c) rescue c }.join("\t")
176
+ end
177
+ def dump_header row, maxw
178
+ puts row.zip(maxw).map{|col, width| "%-#{width}s" % col.to_s }.join("\t")
179
+ end
167
180
 
168
181
  pad = [''] * maxw.length
182
+ dump_header(rows.shift, maxw) if has_header
169
183
  rows.each do |row|
170
184
  # note -- strips trailing columns
171
- puts row.zip(format).map{|c,f| f.call(c) }.join("\t")
185
+ dump_row(row, format)
172
186
  end
173
187
  $stdin.each do |line|
174
- cols = line.chomp.split("\t").map{|s| s.strip }
188
+ row = line.chomp.split("\t").map{|s| s.strip }
175
189
  # note -- strips trailing columns
176
- puts cols.zip(format).map{|c,f| f.call(c) rescue c }.join("\t")
190
+ dump_row(row, format)
177
191
  end
@@ -0,0 +1,481 @@
1
+ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
2
+ <html>
3
+ <head>
4
+ <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
5
+ <meta content="Apache Forrest" name="Generator">
6
+ <meta name="Forrest-version" content="0.8">
7
+ <meta name="Forrest-skin-name" content="pelt">
8
+ <title>Pig Cookbook</title>
9
+ <link type="text/css" href="skin/basic.css" rel="stylesheet">
10
+ <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
11
+ <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
12
+ <link type="text/css" href="skin/profile.css" rel="stylesheet">
13
+ <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
14
+ <link rel="shortcut icon" href="">
15
+ </head>
16
+ <body onload="init()">
17
+ <script type="text/javascript">ndeSetTextSize();</script>
18
+ <div id="top">
19
+ <!--+
20
+ |breadtrail
21
+ +-->
22
+ <div class="breadtrail">
23
+ <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/pig/">Pig</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
24
+ </div>
25
+ <!--+
26
+ |header
27
+ +-->
28
+ <div class="header">
29
+ <!--+
30
+ |start group logo
31
+ +-->
32
+ <div class="grouplogo">
33
+ <a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
34
+ </div>
35
+ <!--+
36
+ |end group logo
37
+ +-->
38
+ <!--+
39
+ |start Project Logo
40
+ +-->
41
+ <div class="projectlogo">
42
+ <a href="http://hadoop.apache.org/pig/"><img class="logoImage" alt="Pig" src="images/pig-logo.gif" title="A platform for analyzing large datasets."></a>
43
+ </div>
44
+ <!--+
45
+ |end Project Logo
46
+ +-->
47
+ <!--+
48
+ |start Search
49
+ +-->
50
+ <div class="searchbox">
51
+ <form action="http://www.google.com/search" method="get" class="roundtopsmall">
52
+ <input value="" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
53
+ <input name="Search" value="Search" type="submit">
54
+ </form>
55
+ </div>
56
+ <!--+
57
+ |end search
58
+ +-->
59
+ <!--+
60
+ |start Tabs
61
+ +-->
62
+ <ul id="tabs">
63
+ <li>
64
+ <a class="unselected" href="http://hadoop.apache.org/pig/">Project</a>
65
+ </li>
66
+ <li>
67
+ <a class="unselected" href="http://wiki.apache.org/pig/">Wiki</a>
68
+ </li>
69
+ <li class="current">
70
+ <a class="selected" href="index.html">Pig 0.7.0 Documentation</a>
71
+ </li>
72
+ </ul>
73
+ <!--+
74
+ |end Tabs
75
+ +-->
76
+ </div>
77
+ </div>
78
+ <div id="main">
79
+ <div id="publishedStrip">
80
+ <!--+
81
+ |start Subtabs
82
+ +-->
83
+ <div id="level2tabs"></div>
84
+ <!--+
85
+ |end Endtabs
86
+ +-->
87
+ <script type="text/javascript"><!--
88
+ document.write("Last Published: " + document.lastModified);
89
+ // --></script>
90
+ </div>
91
+ <!--+
92
+ |breadtrail
93
+ +-->
94
+ <div class="breadtrail">
95
+
96
+ &nbsp;
97
+ </div>
98
+ <!--+
99
+ |start Menu, mainarea
100
+ +-->
101
+ <!--+
102
+ |start Menu
103
+ +-->
104
+ <div id="menu">
105
+ <div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Pig</div>
106
+ <div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
107
+ <div class="menuitem">
108
+ <a href="index.html">Overview</a>
109
+ </div>
110
+ <div class="menuitem">
111
+ <a href="setup.html">Setup</a>
112
+ </div>
113
+ <div class="menuitem">
114
+ <a href="tutorial.html">Tutorial</a>
115
+ </div>
116
+ <div class="menuitem">
117
+ <a href="piglatin_ref1.html">Pig Latin 1</a>
118
+ </div>
119
+ <div class="menuitem">
120
+ <a href="piglatin_ref2.html">Pig Latin 2</a>
121
+ </div>
122
+ <div class="menupage">
123
+ <div class="menupagetitle">Cookbook</div>
124
+ </div>
125
+ <div class="menuitem">
126
+ <a href="udf.html">UDFs</a>
127
+ </div>
128
+ </div>
129
+ <div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Zebra</div>
130
+ <div id="menu_1.2" class="menuitemgroup">
131
+ <div class="menuitem">
132
+ <a href="zebra_overview.html">Zebra Overview </a>
133
+ </div>
134
+ <div class="menuitem">
135
+ <a href="zebra_users.html">Zebra Users </a>
136
+ </div>
137
+ <div class="menuitem">
138
+ <a href="zebra_reference.html">Zebra Reference </a>
139
+ </div>
140
+ <div class="menuitem">
141
+ <a href="zebra_mapreduce.html">Zebra MapReduce </a>
142
+ </div>
143
+ <div class="menuitem">
144
+ <a href="zebra_pig.html">Zebra Pig </a>
145
+ </div>
146
+ <div class="menuitem">
147
+ <a href="zebra_stream.html">Zebra Streaming </a>
148
+ </div>
149
+ </div>
150
+ <div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Miscellaneous</div>
151
+ <div id="menu_1.3" class="menuitemgroup">
152
+ <div class="menuitem">
153
+ <a href="api/">API Docs</a>
154
+ </div>
155
+ <div class="menuitem">
156
+ <a href="http://wiki.apache.org/pig/">Wiki</a>
157
+ </div>
158
+ <div class="menuitem">
159
+ <a href="http://wiki.apache.org/pig/FAQ">FAQ</a>
160
+ </div>
161
+ <div class="menuitem">
162
+ <a href="http://hadoop.apache.org/pig/releases.html">Release Notes</a>
163
+ </div>
164
+ </div>
165
+ <div id="credit"></div>
166
+ <div id="roundbottom">
167
+ <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
168
+ <!--+
169
+ |alternative credits
170
+ +-->
171
+ <div id="credit2"></div>
172
+ </div>
173
+ <!--+
174
+ |end Menu
175
+ +-->
176
+ <!--+
177
+ |start content
178
+ +-->
179
+ <div id="content">
180
+ <div title="Portable Document Format" class="pdflink">
181
+ <a class="dida" href="cookbook.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
182
+ PDF</a>
183
+ </div>
184
+ <h1>Pig Cookbook</h1>
185
+ <div id="minitoc-area">
186
+ <ul class="minitoc">
187
+ <li>
188
+ <a href="#Overview">Overview</a>
189
+ </li>
190
+ <li>
191
+ <a href="#Performance+Enhancers">Performance Enhancers</a>
192
+ <ul class="minitoc">
193
+ <li>
194
+ <a href="#Use+Optimization">Use Optimization</a>
195
+ </li>
196
+ <li>
197
+ <a href="#Use+Types">Use Types</a>
198
+ </li>
199
+ <li>
200
+ <a href="#Project+Early+and+Often">Project Early and Often </a>
201
+ </li>
202
+ <li>
203
+ <a href="#Filter+Early+and+Often">Filter Early and Often</a>
204
+ </li>
205
+ <li>
206
+ <a href="#Reduce+Your+Operator+Pipeline">Reduce Your Operator Pipeline</a>
207
+ </li>
208
+ <li>
209
+ <a href="#Make+Your+UDFs+Algebraic">Make Your UDFs Algebraic</a>
210
+ </li>
211
+ <li>
212
+ <a href="#Implement+the+Aggregator+Interface">Implement the Aggregator Interface</a>
213
+ </li>
214
+ <li>
215
+ <a href="#Drop+Nulls+Before+a+Join">Drop Nulls Before a Join</a>
216
+ </li>
217
+ <li>
218
+ <a href="#Take+Advantage+of+Join+Optimizations">Take Advantage of Join Optimizations</a>
219
+ </li>
220
+ <li>
221
+ <a href="#Use+the+PARALLEL+Clause">Use the PARALLEL Clause</a>
222
+ </li>
223
+ <li>
224
+ <a href="#Use+the+LIMIT+Operator">Use the LIMIT Operator</a>
225
+ </li>
226
+ <li>
227
+ <a href="#Prefer+DISTINCT+over+GROUP+BY+-+GENERATE">Prefer DISTINCT over GROUP BY - GENERATE</a>
228
+ </li>
229
+ </ul>
230
+ </li>
231
+ </ul>
232
+ </div>
233
+
234
+
235
+ <a name="N1000D"></a><a name="Overview"></a>
236
+ <h2 class="h3">Overview</h2>
237
+ <div class="section">
238
+ <p>This document provides hints and tips for pig users. </p>
239
+ </div>
240
+
241
+
242
+ <a name="N10017"></a><a name="Performance+Enhancers"></a>
243
+ <h2 class="h3">Performance Enhancers</h2>
244
+ <div class="section">
245
+ <a name="N1001D"></a><a name="Use+Optimization"></a>
246
+ <h3 class="h4">Use Optimization</h3>
247
+ <p>Pig supports various <a href="piglatin_ref1.html#Optimization+Rules">optimization rules</a> which are turned on by default.
248
+ Become familiar with these rules.</p>
249
+ <a name="N1002B"></a><a name="Use+Types"></a>
250
+ <h3 class="h4">Use Types</h3>
251
+ <p>If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations.
252
+ A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with
253
+ speed of arithmetic computation. It has an additional advantage of early error detection. </p>
254
+ <pre class="code">
255
+ --Query 1
256
+ A = load 'myfile' as (t, u, v);
257
+ B = foreach A generate t + u;
258
+
259
+ --Query 2
260
+ A = load 'myfile' as (t: int, u: int, v);
261
+ B = foreach A generate t + u;
262
+ </pre>
263
+ <p>The second query will run more efficiently than the first. In some of our queries with see 2x speedup. </p>
264
+ <a name="N1003C"></a><a name="Project+Early+and+Often"></a>
265
+ <h3 class="h4">Project Early and Often </h3>
266
+ <p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p>
267
+ <pre class="code">
268
+ A = load 'myfile' as (t, u, v);
269
+ B = load 'myotherfile' as (x, y, z);
270
+ C = join A by t, B by x;
271
+ D = group C by u;
272
+ E = foreach D generate group, COUNT($1);
273
+ </pre>
274
+ <p>There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p>
275
+ <pre class="code">
276
+ A = load 'myfile' as (t, u, v);
277
+ A1 = foreach A generate t, u;
278
+ B = load 'myotherfile' as (x, y, z);
279
+ B1 = foreach B generate x;
280
+ C = join A1 by t, B1 by x;
281
+ C1 = foreach C generate t, u;
282
+ D = group C1 by u;
283
+ E = foreach D generate group, COUNT($1);
284
+ </pre>
285
+ <p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p>
286
+ <a name="N10054"></a><a name="Filter+Early+and+Often"></a>
287
+ <h3 class="h4">Filter Early and Often</h3>
288
+ <p>As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. </p>
289
+ <pre class="code">
290
+ -- Query 1
291
+ A = load 'myfile' as (t, u, v);
292
+ B = load 'myotherfile' as (x, y, z);
293
+ C = filter A by t == 1;
294
+ D = join C by t, B by x;
295
+ E = group D by u;
296
+ F = foreach E generate group, COUNT($1);
297
+
298
+ -- Query 2
299
+ A = load 'myfile' as (t, u, v);
300
+ B = load 'myotherfile' as (x, y, z);
301
+ C = join A by t, B by x;
302
+ D = group C by u;
303
+ E = foreach D generate group, COUNT($1);
304
+ F = filter E by C.t == 1;
305
+ </pre>
306
+ <p>The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. </p>
307
+ <p>One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out. </p>
308
+ <a name="N10068"></a><a name="Reduce+Your+Operator+Pipeline"></a>
309
+ <h3 class="h4">Reduce Your Operator Pipeline</h3>
310
+ <p>For clarity of your script, you might choose to split your projects into several steps for instance: </p>
311
+ <pre class="code">
312
+ A = load 'data' as (in: map[]);
313
+ -- get key out of the map
314
+ B = foreach A generate in#k1 as k1, in#k2 as k2;
315
+ -- concatenate the keys
316
+ C = foreach B generate CONCAT(k1, k2);
317
+ .......
318
+ </pre>
319
+ <p>While the example above is easier to read, you might want to consider combining the two foreach statements to improve your query performance: </p>
320
+ <pre class="code">
321
+ A = load 'data' as (in: map[]);
322
+ -- concatenate the keys from the map
323
+ B = foreach A generate CONCAT(in#k1, in#k2);
324
+ ....
325
+ </pre>
326
+ <p>The same goes for filters. </p>
327
+ <a name="N10080"></a><a name="Make+Your+UDFs+Algebraic"></a>
328
+ <h3 class="h4">Make Your UDFs Algebraic</h3>
329
+ <p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see the Pig UDF Manual and <a href="udf.html#Aggregate+Functions">Aggregate Functions</a>.</p>
330
+ <pre class="code">
331
+ A = load 'data' as (x, y, z)
332
+ B = group A by x;
333
+ C = foreach B generate group, MyUDF(A);
334
+ ....
335
+ </pre>
336
+ <p>If <span class="codefrag">MyUDF</span> is algebraic, the query will use combiner and run much faster. You can run <span class="codefrag">explain</span> command on your query to make sure that combiner is used. </p>
337
+ <a name="N1009B"></a><a name="Implement+the+Aggregator+Interface"></a>
338
+ <h3 class="h4">Implement the Aggregator Interface</h3>
339
+ <p>
340
+ If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Aggregator interface to reduce the amount of memory used by your script. If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see the Pig UDF Manual and <a href="udf.html#Accumulator+Interface">Accumulator Interface</a>.
341
+ </p>
342
+ <a name="N100AC"></a><a name="Drop+Nulls+Before+a+Join"></a>
343
+ <h3 class="h4">Drop Nulls Before a Join</h3>
344
+ <p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row, in a standard join the rows with a null key will always be dropped. The join: </p>
345
+ <pre class="code">
346
+ A = load 'myfile' as (t, u, v);
347
+ B = load 'myotherfile' as (x, y, z);
348
+ C = join A by t, B by x;
349
+ </pre>
350
+ <p>is rewritten by pig to </p>
351
+ <pre class="code">
352
+ A = load 'myfile' as (t, u, v);
353
+ B = load 'myotherfile' as (x, y, z);
354
+ C1 = cogroup A by t INNER, B by x INNER;
355
+ C = foreach C1 generate flatten(A), flatten(B);
356
+ </pre>
357
+ <p>Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null keys will be dropped. But they will not be dropped until the last possible moment. If the query is rewritten to </p>
358
+ <pre class="code">
359
+ A = load 'myfile' as (t, u, v);
360
+ B = load 'myotherfile' as (x, y, z);
361
+ A1 = filter A by t is not null;
362
+ B1 = filter B by x is not null;
363
+ C = join A1 by t, B1 by x;
364
+ </pre>
365
+ <p>then the nulls will be dropped before the join. Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant. In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. </p>
366
+ <a name="N100CB"></a><a name="Take+Advantage+of+Join+Optimizations"></a>
367
+ <h3 class="h4">Take Advantage of Join Optimizations</h3>
368
+ <a name="N100D1"></a><a name="Regular+Join+Optimizations"></a>
369
+ <h4>Regular Join Optimizations</h4>
370
+ <p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p>
371
+ <p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query.
372
+ In some of our tests we saw 10x performance improvement as the result of this optimization.</p>
373
+ <pre class="code">
374
+ small = load 'small_file' as (t, u, v);
375
+ large = load 'large_file' as (x, y, z);
376
+ C = join small by t, large by x;
377
+ </pre>
378
+ <a name="N100E2"></a><a name="Specialized+Join+Optimizations"></a>
379
+ <h4>Specialized Join Optimizations</h4>
380
+ <p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins.
381
+ For more information see <a href="piglatin_ref1.html#Specialized+Joins">Specialized Joins</a>.</p>
382
+ <a name="N100F1"></a><a name="Use+the+PARALLEL+Clause"></a>
383
+ <h3 class="h4">Use the PARALLEL Clause</h3>
384
+ <p>Use the PARALLEL clause to increase the parallelism of a job:</p>
385
+ <ul>
386
+
387
+ <li>PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The default value is 1 (one reduce task).</li>
388
+
389
+ <li>PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. </li>
390
+
391
+ <li>If you don&rsquo;t specify PARALLEL, you still get the same map parallelism but only one reduce task.</li>
392
+
393
+ </ul>
394
+ <p></p>
395
+ <p>As noted, the default value for PARALLEL is 1 (one reduce task). However, the number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 500 MB of data behaves efficiently.</p>
396
+ <p>You can include the PARALLEL clause with any operator that starts a reduce phase (see the example below). This includes
397
+ <a href="piglatin_ref2.html#COGROUP">COGROUP</a>,
398
+ <a href="piglatin_ref2.html#CROSS">CROSS</a>,
399
+ <a href="piglatin_ref2.html#DISTINCT">DISTINCT</a>,
400
+ <a href="piglatin_ref2.html#GROUP">GROUP</a>,
401
+ <a href="piglatin_ref2.html#JOIN+%28inner%29">JOIN (inner)</a>,
402
+ <a href="piglatin_ref2.html#JOIN+%28outer%29">JOIN (outer)</a>, and
403
+ <a href="piglatin_ref2.html#ORDER">ORDER</a>.
404
+ </p>
405
+ <p>You can also set the value of PARALLEL for all Pig scripts using the <a href="piglatin_ref2.html#set">set default parallel</a> command.</p>
406
+ <p>In this example PARALLEL is used with the GROUP operator. </p>
407
+ <pre class="code">
408
+ A = LOAD 'myfile' AS (t, u, v);
409
+ B = GROUP A BY t PARALLEL 18;
410
+ .....
411
+ </pre>
412
+ <p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
413
+ <pre class="code">
414
+ SET DEFAULT_PARALLEL 20;
415
+ A = LOAD &lsquo;myfile.txt&rsquo; USING PigStorage() AS (t, u, v);
416
+ B = GROUP A BY t;
417
+ C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
418
+ D = ORDER C BY mycount;
419
+ STORE D INTO &lsquo;mysortedcount&rsquo; USING PigStorage();
420
+ </pre>
421
+ <a name="N10140"></a><a name="Use+the+LIMIT+Operator"></a>
422
+ <h3 class="h4">Use the LIMIT Operator</h3>
423
+ <p>Often you are not interested in the entire output but rather a sample or top results. In such cases, using LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p>
424
+ <p>Sample:
425
+ </p>
426
+ <pre class="code">
427
+ A = load 'myfile' as (t, u, v);
428
+ B = limit A 500;
429
+ </pre>
430
+ <p>Top results: </p>
431
+ <pre class="code">
432
+ A = load 'myfile' as (t, u, v);
433
+ B = order A by t;
434
+ C = limit B 500;
435
+ </pre>
436
+ <a name="N10158"></a><a name="Prefer+DISTINCT+over+GROUP+BY+-+GENERATE"></a>
437
+ <h3 class="h4">Prefer DISTINCT over GROUP BY - GENERATE</h3>
438
+ <p>When it comes to extracting the unique values from a column in a relation, one of two approaches can be used: </p>
439
+ <p>Example Using GROUP BY - GENERATE</p>
440
+ <pre class="code">
441
+ A = load 'myfile' as (t, u, v);
442
+ B = foreach A generate u;
443
+ C = group B by u;
444
+ D = foreach C generate group as uniquekey;
445
+ dump D;
446
+ </pre>
447
+ <p>Example Using DISTINCT</p>
448
+ <pre class="code">
449
+ A = load 'myfile' as (t, u, v);
450
+ B = foreach A generate u;
451
+ C = distinct B;
452
+ dump C;
453
+ </pre>
454
+ <p>In pig 0.1.x, DISTINCT is just GROUP BY/PROJECT under the hood. In pig 0.2.0 it is not, and it is much faster and more efficient (depending on your key cardinality, up to 20x faster in pig team's tests). Therefore, the use of DISTINCT is recommended over GROUP BY - GENERATE. </p>
455
+ </div>
456
+
457
+ </div>
458
+ <!--+
459
+ |end content
460
+ +-->
461
+ <div class="clearboth">&nbsp;</div>
462
+ </div>
463
+ <div id="footer">
464
+ <!--+
465
+ |start bottomstrip
466
+ +-->
467
+ <div class="lastmodified">
468
+ <script type="text/javascript"><!--
469
+ document.write("Last Published: " + document.lastModified);
470
+ // --></script>
471
+ </div>
472
+ <div class="copyright">
473
+ Copyright &copy;
474
+ 2007-2010 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
475
+ </div>
476
+ <!--+
477
+ |end bottomstrip
478
+ +-->
479
+ </div>
480
+ </body>
481
+ </html>