piglet 0.2.0 → 0.2.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -45,9 +45,14 @@ will output
45
45
 
46
46
  or
47
47
 
48
- Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin
48
+ puts Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin
49
49
 
50
- will print
50
+ or
51
+
52
+ @interpreter = Piglet::Interpreter.new
53
+ puts @interpreter.to_pig_latin { store(load('input'), 'output') }
54
+
55
+ will print
51
56
 
52
57
  relation_1 = LOAD 'input';
53
58
  STORE relation_1 INTO 'output';
@@ -97,7 +102,7 @@ In fact, a whole script can be written without using variables at all:
97
102
 
98
103
  store(load('input', :schema => [:x, :y, :z]).sample(3).group(:x))
99
104
 
100
- The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation mixin for syntax examples.
105
+ The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation::Relation mixin for syntax examples.
101
106
 
102
107
  === <code>load</code>
103
108
 
@@ -112,6 +117,8 @@ But if you want types, then you need to pass an array of arrays, where the inner
112
117
 
113
118
  This is a bit inconvenient. I would like to use a hash, like this: <code>{:a => :chararray, :b => :long}</code>, but since the order of the keys isn't guaranteed in Ruby 1.8, it's not possible. I'm working on something better.
114
119
 
120
+ If you need to specify tuples or bags in a schema you can use the special syntax <code>[:field_name, :tuple, [[:a, :int], [:b, :float]]]</code>, i.e. the field name, the field type (<code>:tuple</code> or <code>:bag</code>) and the schema of the tuple or bag. See “Types & schemas” below for more info.
121
+
115
122
  You can also specify a load function by passing the <code>:using</code> option:
116
123
 
117
124
  load('input', :using => :pig_storage)
@@ -125,6 +132,118 @@ Piglet knows to translate <code>:pig_storage</code> to <code>PigStorage</code>,
125
132
 
126
133
  +dump+, +describe+, +illustrate+ and +explain+ all take a relation as sole argument. +explain+ can be called without argument (see the Pig Latin manual for what +EXPLAIN+ without argument does).
127
134
 
135
+ == +cross+, +distinct+, +limit+, +sample+, +union+
136
+
137
+ These operators are the most straightforward in Piglet. To do the equivalent of
138
+
139
+ b = DISTINCT a;
140
+
141
+ you write
142
+
143
+ b = a.distinct
144
+
145
+ in Piglet. More examples:
146
+
147
+ a.cross(b) # => CROSS a, b
148
+ a.limit(4) # => LIMIT a 4
149
+ a.sample(0.1) # => SAMPLE a 0.1
150
+ a.union(b, c) # => UNION a, b, c
151
+
152
+ you get the pattern.
153
+
154
+ == +order+
155
+
156
+ +order+ works more or less like the operators above, with some extra features: to specify ascending or descending order you can pass an array with two elements instead of a field name -- the first element is the field name, the second <code>:asc</code> or <code>:desc</code>:
157
+
158
+ a.order(:x, [:y, :desc]) # => ORDER a BY x, y DESC
159
+
160
+ == +group+
161
+
162
+ In light of the above +group+ works exactly as you would expect: <code>a.group(:b)</code> becomes <code>GROUP a BY b</code>. You can specify which fields to group by either by passing them as separate arguments, or by passing an array as the first parameter. These statements are equivalent:
163
+
164
+ a.group(:x, :y)
165
+ a.group([:x, :y])
166
+ a.group(%w(x y))
167
+
168
+ == +filter+
169
+
170
+ +filter+ works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block receives a parameter which is the relation that the operation is performed on -- this may sound odd, but since operations can be chained in Piglet there are situations where you otherwise wouldn't have a reference to the relation, e.g. <code>a.limit(4).filter { |r| … }</code>.
171
+
172
+ The thing that sets +filter+ apart from the operators above is it needs to support field expressions. For example the <code>x == 3</code> in <code>FILTER a BY x == 3</code>. Piglet supports simple field operators like <code>==</code> or <code>%</code> quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example <code>a.filter { |r| r.x == 3 }</code> works fine, but <code>a.filter { |r| r.x != 3 }</code> doesn't (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do <code>r.x.ne(3)</code> or <code>(r.x == 3).not</code>. See “Limitations” below for more info on field expressions.
173
+
174
+ The way field expressions are done in Piglet is that you ask the relation (the object passed to the block) for a field, and then call methods on that object to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference.
175
+
176
+ a.filter { |r| r.x == 3 } # => FILTER a BY x == 3
177
+ a.filter { |r| (r.x > 4).or(r.y < 2) } # => FILTER a BY x > 4 OR r < 2
178
+
179
+ == +foreach+
180
+
181
+ <code>FOREACH … GENERATE</code> is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing -- see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply +foreach+, and just as +filter+ it takes a block, which receives the relation as a parameter.
182
+
183
+ In contrast to +filter+, +foreach+ should return an array of field references and expressions. This array describes the schema of the new relation. The expressions used in +foreach+ are usually not the same as those used in +filter+, although all are of course available in both situations. In +foreach+ common operators to use are the aggregate functions (called “eval functions” in the Pig Latin manual) like +MAX+, +MIN+, +COUNT+, +SUM+, etc. In Piglet these are method calls on field objects. Let's look at an example (I like to use lots of whitespace and newlines for +foreach+ operations, because otherwise it gets very messy):
184
+
185
+ a.foreach do |r|
186
+ [
187
+ r.x.max,
188
+ r.y.min,
189
+ r.z.count,
190
+ r.w + r.q
191
+ ]
192
+ end
193
+
194
+ this would be translated into:
195
+
196
+ FOREACH a GENERATE
197
+ MAX(x),
198
+ MIN(y),
199
+ COUNT(z),
200
+ w + q;
201
+
202
+ pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write <code>MAX(x) AS (x_max)</code>, and in Piglet you can write <code>r.x.max.as(:x_max)</code>. This is such a common thing to do that I'm thinking of adding some kind of feature that automatically adds <code>AS</code> clauses where appropriate, but it's not there yet.
203
+
204
+ +foreach+ is a very complex beast, and this is just an overview, so I'll just give you a few more examples that are not obvious:
205
+
206
+ Literal values can be specified using +literal+:
207
+
208
+ a.foreach { |r| [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello
209
+
210
+ Binary conditionals, a.k.a. the ternary operator are supported through +test+ (unfortunately the Ruby ternary operator can't be overridden):
211
+
212
+ a.foreach { |r| [test(r.x == 3, r.y, r.z)] } # => FOREACH a GENERATE (x == 3 ? y : z)
213
+
214
+ The first argument to +test+ is the test expression, the second is the if-true expression and the third is the if-false expression.
215
+
216
+ == +split+
217
+
218
+ The syntax of +split+ shouldn't be surprising if you've read this far, but there's perhaps some details that aren't obvious. To split a relation into a number of parts you call +split+ on the relation and pass a block in which you specify the expressions describing each shard. Just as with +filter+ and +foreach+ the block receives the relation as an argument. +split+ returns an array containing the relation shards and you can use parallel assignment to make it look really nice:
219
+
220
+ b, c = a.split { |r| [r.x > 2, r.y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3
221
+
222
+ == +cogroup+ & +join+
223
+
224
+ Thes two operators are the different ways to join relations in Pig Latin. They take the relations to join, and the keys to join them. In Piglet you specify the join expression using a hash: the keys are the relations, and the values are the fields on which to join:
225
+
226
+ a.join(b => :y, a => :x) # => JOIN b BY x, a BY y
227
+ a.cogroup(b => :y, a => :x) # => COGROUP b BY x, a BY y
228
+
229
+ Notice that you have to specify the +a+ relation twice: you call the method on it, but you also have to pass it as a key to the join description. I'm working on an alternative syntax.
230
+
231
+ If you're joining on more than one field, simply pass an array of field names:
232
+
233
+ a.join(b => [:y, :z], a => [:x, :w]) # => JOIN b BY (y, z), a BY (x, w)
234
+
235
+ I'm not absolutely sure that it is legal to join or cogroup on more than one field, the Pig Latin manual isn't entirely clear on this, but Piglet supports it for the time being.
236
+
237
+ <code>COGROUP</code> lets you specify <code>INNER</code> and <code>OUTER</code> for join fields, and in Piglet you can do this by passing <code>:inner</code> or <code>:outer</code> as the last element in the array that is the value in the join description:
238
+
239
+ a.cogroup(b => [:y, :inner], a => [:z, :outer]) # => COGROUP b BY y INNER, a BY z OUTER
240
+
241
+ == <code>:parallel</code>
242
+
243
+ For some operators in Pig Latin you can specify the <code>PARALLEL</code> keyword to tell Pig how many reducers
244
+
245
+ For the +cogroup+, +cross+, +distinct+, +group+, +join+ and +order+ you can pass <code>:parallel => <em>n</em></code> as the last parameter to specify the amount of parallelism, e.g. <code>a.group(:x, :y, :z, :parallel => 5)</code>.
246
+
128
247
  === Putting it all together
129
248
 
130
249
  Let's look at a more complex example:
@@ -194,9 +313,9 @@ But in Piglet it's as simple as looping over the names of the dimensions. You co
194
313
  sum_dimension(input, dimension)
195
314
  end
196
315
 
197
- You can even define your own relation operations if you want, just add them to Piglet::Relation:
316
+ You can even define your own relation operations if you want, just add them to Piglet::Relation::Relation:
198
317
 
199
- module Piglet::Relation
318
+ module Piglet::Relation::Relation
200
319
  # Returns a list of sampled relations for each given sample size
201
320
  def samples(*sizes)
202
321
  sizes.map { |s| sample(s) }
data/lib/piglet.rb CHANGED
@@ -1,6 +1,6 @@
1
1
  # :main: README.rdoc
2
2
  module Piglet # :nodoc:
3
- VERSION = '0.2.0'
3
+ VERSION = '0.2.2'
4
4
 
5
5
  class PigletError < StandardError; end
6
6
  class NotSupportedError < PigletError; end
@@ -4,7 +4,7 @@ module Piglet
4
4
  SYMBOLIC_OPERATORS = [:==, :>, :<, :>=, :<=, :%, :+, :-, :*, :/]
5
5
  FUNCTIONS = [:avg, :count, :max, :min, :size, :sum, :tokenize]
6
6
 
7
- attr_reader :name, :type
7
+ attr_reader :name, :type, :operator
8
8
 
9
9
  FUNCTIONS.each do |fun|
10
10
  define_method(fun) do
@@ -18,7 +18,18 @@ module Piglet
18
18
  end
19
19
 
20
20
  def to_s
21
- "#{parenthesise(@left_expression)} #{@operator} #{parenthesise(@right_expression)}"
21
+ left = @left_expression
22
+ right = @right_expression
23
+
24
+ if left.respond_to?(:operator) && left.operator != @operator
25
+ left = parenthesise(left)
26
+ end
27
+
28
+ if right.respond_to?(:operator) && right.operator != @operator
29
+ right = parenthesise(right)
30
+ end
31
+
32
+ "#{left} #{@operator} #{right}"
22
33
  end
23
34
 
24
35
  private
@@ -17,7 +17,9 @@ module Piglet
17
17
  self
18
18
  end
19
19
 
20
- def to_pig_latin
20
+ def to_pig_latin(&block)
21
+ interpret(&block) if block_given?
22
+
21
23
  return '' if @stores.empty?
22
24
 
23
25
  handled_relations = Set.new
@@ -75,8 +75,10 @@ describe Piglet::Field::Reference do
75
75
  @field1.send(op, @field2).to_s.should eql("field1 #{op} field2")
76
76
  end
77
77
 
78
- it "supports #{op} on an expression" do
79
- (@field1 + (@field1.send(op, @field2))).to_s.should eql("field1 + (field1 #{op} field2)")
78
+ if op != :+ # + is already covered in all other iterations, and it parenthesizes differently
79
+ it "supports #{op} on an expression" do
80
+ (@field1 + (@field1.send(op, @field2))).to_s.should eql("field1 + (field1 #{op} field2)")
81
+ end
80
82
  end
81
83
  end
82
84
 
@@ -8,15 +8,19 @@ describe Piglet::Interpreter do
8
8
  end
9
9
 
10
10
  context 'basic usage' do
11
- it 'interprets a block given to #new' do
11
+ it 'interprets a block given to #new so that a subsequent call to #to_pig_latin returns the Pig Latin code' do
12
12
  output = Piglet::Interpreter.new { store(load('some/path'), 'out') }
13
13
  output.to_pig_latin.should_not be_empty
14
14
  end
15
15
 
16
- it 'interprets a block given to #interpret' do
16
+ it 'interprets a block given to #interpret so that a subsequent call to #to_pig_latin returns the Pig Latin code' do
17
17
  output = @interpreter.interpret { store(load('some/path'), 'out') }
18
18
  output.to_pig_latin.should_not be_empty
19
19
  end
20
+
21
+ it 'interprets a block given to #to_pig_latin and returns the Pig Latin code' do
22
+ @interpreter.to_pig_latin { store(load('some/path'), 'out') }.should_not be_empty
23
+ end
20
24
 
21
25
  it 'does nothing with no commands' do
22
26
  @interpreter.interpret.to_pig_latin.should be_empty
data/spec/piglet_spec.rb CHANGED
@@ -393,6 +393,22 @@ describe Piglet do
393
393
  @interpreter.to_pig_latin.should match(/(\w+) = LOAD 'in';\n(\w+) = DISTINCT \1;\nSTORE \2 INTO 'out1';\nSTORE \2 INTO 'out2';/)
394
394
  end
395
395
  end
396
+
397
+ context 'field expressions' do
398
+ it 'doesn\'t parenthesizes expressions with the same operator' do
399
+ output = @interpreter.to_pig_latin do
400
+ store(load('in').filter { |r| r.x.and(r.y.and(r.z)).and(r.w) }, 'out')
401
+ end
402
+ output.should include('x AND y AND z AND w')
403
+ end
404
+
405
+ it 'parenthesizes expressions with different operators' do
406
+ output = @interpreter.to_pig_latin do
407
+ store(load('in').filter { |r| r.x.and(r.y.or(r.z)).and(r.w) }, 'out')
408
+ end
409
+ output.should include('x AND (y OR z) AND w')
410
+ end
411
+ end
396
412
 
397
413
  context 'long and complex scripts' do
398
414
  before do
@@ -579,7 +595,9 @@ describe Piglet do
579
595
  throw :relations, [relation1, relation2, relation3]
580
596
  end
581
597
  end
582
- relation3.schema.field_names.should eql([:group, relation1.alias.to_sym, relation2.alias.to_sym])
598
+ relation3.schema.field_names[0].should eql(:group)
599
+ relation3.schema.field_names.should include(relation1.alias.to_sym)
600
+ relation3.schema.field_names.should include(relation2.alias.to_sym)
583
601
  relation3.schema.field_type(relation1.alias.to_sym).should be_a(Piglet::Schema::Bag)
584
602
  relation3.schema.field_type(relation2.alias.to_sym).should be_a(Piglet::Schema::Bag)
585
603
  relation3.schema.field_type(relation1.alias.to_sym).field_names.should eql([:a, :b])
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: piglet
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.2.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Theo Hultberg
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2010-01-11 00:00:00 +01:00
12
+ date: 2010-01-12 00:00:00 +01:00
13
13
  default_executable: piglet
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency