piglet 0.2.0 → 0.2.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +124 -5
- data/lib/piglet.rb +1 -1
- data/lib/piglet/field/field.rb +1 -1
- data/lib/piglet/field/infix_expression.rb +12 -1
- data/lib/piglet/interpreter.rb +3 -1
- data/spec/piglet/field/reference_spec.rb +4 -2
- data/spec/piglet/interpreter_spec.rb +6 -2
- data/spec/piglet_spec.rb +19 -1
- metadata +2 -2
data/README.rdoc
CHANGED
@@ -45,9 +45,14 @@ will output
|
|
45
45
|
|
46
46
|
or
|
47
47
|
|
48
|
-
Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin
|
48
|
+
puts Piglet::Interpreter.new { store(load('input'), 'output') }.to_pig_latin
|
49
49
|
|
50
|
-
|
50
|
+
or
|
51
|
+
|
52
|
+
@interpreter = Piglet::Interpreter.new
|
53
|
+
puts @interpreter.to_pig_latin { store(load('input'), 'output') }
|
54
|
+
|
55
|
+
will print
|
51
56
|
|
52
57
|
relation_1 = LOAD 'input';
|
53
58
|
STORE relation_1 INTO 'output';
|
@@ -97,7 +102,7 @@ In fact, a whole script can be written without using variables at all:
|
|
97
102
|
|
98
103
|
store(load('input', :schema => [:x, :y, :z]).sample(3).group(:x))
|
99
104
|
|
100
|
-
The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation mixin for syntax examples.
|
105
|
+
The relation operators are meant to be close to the Pig Latin syntax, but there are obvious limitations and tradeoffs, see the documentation for the Piglet::Relation::Relation mixin for syntax examples.
|
101
106
|
|
102
107
|
=== <code>load</code>
|
103
108
|
|
@@ -112,6 +117,8 @@ But if you want types, then you need to pass an array of arrays, where the inner
|
|
112
117
|
|
113
118
|
This is a bit inconvenient. I would like to use a hash, like this: <code>{:a => :chararray, :b => :long}</code>, but since the order of the keys isn't guaranteed in Ruby 1.8, it's not possible. I'm working on something better.
|
114
119
|
|
120
|
+
If you need to specify tuples or bags in a schema you can use the special syntax <code>[:field_name, :tuple, [[:a, :int], [:b, :float]]]</code>, i.e. the field name, the field type (<code>:tuple</code> or <code>:bag</code>) and the schema of the tuple or bag. See “Types & schemas” below for more info.
|
121
|
+
|
115
122
|
You can also specify a load function by passing the <code>:using</code> option:
|
116
123
|
|
117
124
|
load('input', :using => :pig_storage)
|
@@ -125,6 +132,118 @@ Piglet knows to translate <code>:pig_storage</code> to <code>PigStorage</code>,
|
|
125
132
|
|
126
133
|
+dump+, +describe+, +illustrate+ and +explain+ all take a relation as sole argument. +explain+ can be called without argument (see the Pig Latin manual for what +EXPLAIN+ without argument does).
|
127
134
|
|
135
|
+
== +cross+, +distinct+, +limit+, +sample+, +union+
|
136
|
+
|
137
|
+
These operators are the most straightforward in Piglet. To do the equivalent of
|
138
|
+
|
139
|
+
b = DISTINCT a;
|
140
|
+
|
141
|
+
you write
|
142
|
+
|
143
|
+
b = a.distinct
|
144
|
+
|
145
|
+
in Piglet. More examples:
|
146
|
+
|
147
|
+
a.cross(b) # => CROSS a, b
|
148
|
+
a.limit(4) # => LIMIT a 4
|
149
|
+
a.sample(0.1) # => SAMPLE a 0.1
|
150
|
+
a.union(b, c) # => UNION a, b, c
|
151
|
+
|
152
|
+
you get the pattern.
|
153
|
+
|
154
|
+
== +order+
|
155
|
+
|
156
|
+
+order+ works more or less like the operators above, with some extra features: to specify ascending or descending order you can pass an array with two elements instead of a field name -- the first element is the field name, the second <code>:asc</code> or <code>:desc</code>:
|
157
|
+
|
158
|
+
a.order(:x, [:y, :desc]) # => ORDER a BY x, y DESC
|
159
|
+
|
160
|
+
== +group+
|
161
|
+
|
162
|
+
In light of the above +group+ works exactly as you would expect: <code>a.group(:b)</code> becomes <code>GROUP a BY b</code>. You can specify which fields to group by either by passing them as separate arguments, or by passing an array as the first parameter. These statements are equivalent:
|
163
|
+
|
164
|
+
a.group(:x, :y)
|
165
|
+
a.group([:x, :y])
|
166
|
+
a.group(%w(x y))
|
167
|
+
|
168
|
+
== +filter+
|
169
|
+
|
170
|
+
+filter+ works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block receives a parameter which is the relation that the operation is performed on -- this may sound odd, but since operations can be chained in Piglet there are situations where you otherwise wouldn't have a reference to the relation, e.g. <code>a.limit(4).filter { |r| … }</code>.
|
171
|
+
|
172
|
+
The thing that sets +filter+ apart from the operators above is it needs to support field expressions. For example the <code>x == 3</code> in <code>FILTER a BY x == 3</code>. Piglet supports simple field operators like <code>==</code> or <code>%</code> quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example <code>a.filter { |r| r.x == 3 }</code> works fine, but <code>a.filter { |r| r.x != 3 }</code> doesn't (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do <code>r.x.ne(3)</code> or <code>(r.x == 3).not</code>. See “Limitations” below for more info on field expressions.
|
173
|
+
|
174
|
+
The way field expressions are done in Piglet is that you ask the relation (the object passed to the block) for a field, and then call methods on that object to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference.
|
175
|
+
|
176
|
+
a.filter { |r| r.x == 3 } # => FILTER a BY x == 3
|
177
|
+
a.filter { |r| (r.x > 4).or(r.y < 2) } # => FILTER a BY x > 4 OR r < 2
|
178
|
+
|
179
|
+
== +foreach+
|
180
|
+
|
181
|
+
<code>FOREACH … GENERATE</code> is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing -- see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply +foreach+, and just as +filter+ it takes a block, which receives the relation as a parameter.
|
182
|
+
|
183
|
+
In contrast to +filter+, +foreach+ should return an array of field references and expressions. This array describes the schema of the new relation. The expressions used in +foreach+ are usually not the same as those used in +filter+, although all are of course available in both situations. In +foreach+ common operators to use are the aggregate functions (called “eval functions” in the Pig Latin manual) like +MAX+, +MIN+, +COUNT+, +SUM+, etc. In Piglet these are method calls on field objects. Let's look at an example (I like to use lots of whitespace and newlines for +foreach+ operations, because otherwise it gets very messy):
|
184
|
+
|
185
|
+
a.foreach do |r|
|
186
|
+
[
|
187
|
+
r.x.max,
|
188
|
+
r.y.min,
|
189
|
+
r.z.count,
|
190
|
+
r.w + r.q
|
191
|
+
]
|
192
|
+
end
|
193
|
+
|
194
|
+
this would be translated into:
|
195
|
+
|
196
|
+
FOREACH a GENERATE
|
197
|
+
MAX(x),
|
198
|
+
MIN(y),
|
199
|
+
COUNT(z),
|
200
|
+
w + q;
|
201
|
+
|
202
|
+
pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write <code>MAX(x) AS (x_max)</code>, and in Piglet you can write <code>r.x.max.as(:x_max)</code>. This is such a common thing to do that I'm thinking of adding some kind of feature that automatically adds <code>AS</code> clauses where appropriate, but it's not there yet.
|
203
|
+
|
204
|
+
+foreach+ is a very complex beast, and this is just an overview, so I'll just give you a few more examples that are not obvious:
|
205
|
+
|
206
|
+
Literal values can be specified using +literal+:
|
207
|
+
|
208
|
+
a.foreach { |r| [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello
|
209
|
+
|
210
|
+
Binary conditionals, a.k.a. the ternary operator are supported through +test+ (unfortunately the Ruby ternary operator can't be overridden):
|
211
|
+
|
212
|
+
a.foreach { |r| [test(r.x == 3, r.y, r.z)] } # => FOREACH a GENERATE (x == 3 ? y : z)
|
213
|
+
|
214
|
+
The first argument to +test+ is the test expression, the second is the if-true expression and the third is the if-false expression.
|
215
|
+
|
216
|
+
== +split+
|
217
|
+
|
218
|
+
The syntax of +split+ shouldn't be surprising if you've read this far, but there's perhaps some details that aren't obvious. To split a relation into a number of parts you call +split+ on the relation and pass a block in which you specify the expressions describing each shard. Just as with +filter+ and +foreach+ the block receives the relation as an argument. +split+ returns an array containing the relation shards and you can use parallel assignment to make it look really nice:
|
219
|
+
|
220
|
+
b, c = a.split { |r| [r.x > 2, r.y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3
|
221
|
+
|
222
|
+
== +cogroup+ & +join+
|
223
|
+
|
224
|
+
Thes two operators are the different ways to join relations in Pig Latin. They take the relations to join, and the keys to join them. In Piglet you specify the join expression using a hash: the keys are the relations, and the values are the fields on which to join:
|
225
|
+
|
226
|
+
a.join(b => :y, a => :x) # => JOIN b BY x, a BY y
|
227
|
+
a.cogroup(b => :y, a => :x) # => COGROUP b BY x, a BY y
|
228
|
+
|
229
|
+
Notice that you have to specify the +a+ relation twice: you call the method on it, but you also have to pass it as a key to the join description. I'm working on an alternative syntax.
|
230
|
+
|
231
|
+
If you're joining on more than one field, simply pass an array of field names:
|
232
|
+
|
233
|
+
a.join(b => [:y, :z], a => [:x, :w]) # => JOIN b BY (y, z), a BY (x, w)
|
234
|
+
|
235
|
+
I'm not absolutely sure that it is legal to join or cogroup on more than one field, the Pig Latin manual isn't entirely clear on this, but Piglet supports it for the time being.
|
236
|
+
|
237
|
+
<code>COGROUP</code> lets you specify <code>INNER</code> and <code>OUTER</code> for join fields, and in Piglet you can do this by passing <code>:inner</code> or <code>:outer</code> as the last element in the array that is the value in the join description:
|
238
|
+
|
239
|
+
a.cogroup(b => [:y, :inner], a => [:z, :outer]) # => COGROUP b BY y INNER, a BY z OUTER
|
240
|
+
|
241
|
+
== <code>:parallel</code>
|
242
|
+
|
243
|
+
For some operators in Pig Latin you can specify the <code>PARALLEL</code> keyword to tell Pig how many reducers
|
244
|
+
|
245
|
+
For the +cogroup+, +cross+, +distinct+, +group+, +join+ and +order+ you can pass <code>:parallel => <em>n</em></code> as the last parameter to specify the amount of parallelism, e.g. <code>a.group(:x, :y, :z, :parallel => 5)</code>.
|
246
|
+
|
128
247
|
=== Putting it all together
|
129
248
|
|
130
249
|
Let's look at a more complex example:
|
@@ -194,9 +313,9 @@ But in Piglet it's as simple as looping over the names of the dimensions. You co
|
|
194
313
|
sum_dimension(input, dimension)
|
195
314
|
end
|
196
315
|
|
197
|
-
You can even define your own relation operations if you want, just add them to Piglet::Relation:
|
316
|
+
You can even define your own relation operations if you want, just add them to Piglet::Relation::Relation:
|
198
317
|
|
199
|
-
module Piglet::Relation
|
318
|
+
module Piglet::Relation::Relation
|
200
319
|
# Returns a list of sampled relations for each given sample size
|
201
320
|
def samples(*sizes)
|
202
321
|
sizes.map { |s| sample(s) }
|
data/lib/piglet.rb
CHANGED
data/lib/piglet/field/field.rb
CHANGED
@@ -18,7 +18,18 @@ module Piglet
|
|
18
18
|
end
|
19
19
|
|
20
20
|
def to_s
|
21
|
-
|
21
|
+
left = @left_expression
|
22
|
+
right = @right_expression
|
23
|
+
|
24
|
+
if left.respond_to?(:operator) && left.operator != @operator
|
25
|
+
left = parenthesise(left)
|
26
|
+
end
|
27
|
+
|
28
|
+
if right.respond_to?(:operator) && right.operator != @operator
|
29
|
+
right = parenthesise(right)
|
30
|
+
end
|
31
|
+
|
32
|
+
"#{left} #{@operator} #{right}"
|
22
33
|
end
|
23
34
|
|
24
35
|
private
|
data/lib/piglet/interpreter.rb
CHANGED
@@ -75,8 +75,10 @@ describe Piglet::Field::Reference do
|
|
75
75
|
@field1.send(op, @field2).to_s.should eql("field1 #{op} field2")
|
76
76
|
end
|
77
77
|
|
78
|
-
|
79
|
-
|
78
|
+
if op != :+ # + is already covered in all other iterations, and it parenthesizes differently
|
79
|
+
it "supports #{op} on an expression" do
|
80
|
+
(@field1 + (@field1.send(op, @field2))).to_s.should eql("field1 + (field1 #{op} field2)")
|
81
|
+
end
|
80
82
|
end
|
81
83
|
end
|
82
84
|
|
@@ -8,15 +8,19 @@ describe Piglet::Interpreter do
|
|
8
8
|
end
|
9
9
|
|
10
10
|
context 'basic usage' do
|
11
|
-
it 'interprets a block given to #new' do
|
11
|
+
it 'interprets a block given to #new so that a subsequent call to #to_pig_latin returns the Pig Latin code' do
|
12
12
|
output = Piglet::Interpreter.new { store(load('some/path'), 'out') }
|
13
13
|
output.to_pig_latin.should_not be_empty
|
14
14
|
end
|
15
15
|
|
16
|
-
it 'interprets a block given to #interpret' do
|
16
|
+
it 'interprets a block given to #interpret so that a subsequent call to #to_pig_latin returns the Pig Latin code' do
|
17
17
|
output = @interpreter.interpret { store(load('some/path'), 'out') }
|
18
18
|
output.to_pig_latin.should_not be_empty
|
19
19
|
end
|
20
|
+
|
21
|
+
it 'interprets a block given to #to_pig_latin and returns the Pig Latin code' do
|
22
|
+
@interpreter.to_pig_latin { store(load('some/path'), 'out') }.should_not be_empty
|
23
|
+
end
|
20
24
|
|
21
25
|
it 'does nothing with no commands' do
|
22
26
|
@interpreter.interpret.to_pig_latin.should be_empty
|
data/spec/piglet_spec.rb
CHANGED
@@ -393,6 +393,22 @@ describe Piglet do
|
|
393
393
|
@interpreter.to_pig_latin.should match(/(\w+) = LOAD 'in';\n(\w+) = DISTINCT \1;\nSTORE \2 INTO 'out1';\nSTORE \2 INTO 'out2';/)
|
394
394
|
end
|
395
395
|
end
|
396
|
+
|
397
|
+
context 'field expressions' do
|
398
|
+
it 'doesn\'t parenthesizes expressions with the same operator' do
|
399
|
+
output = @interpreter.to_pig_latin do
|
400
|
+
store(load('in').filter { |r| r.x.and(r.y.and(r.z)).and(r.w) }, 'out')
|
401
|
+
end
|
402
|
+
output.should include('x AND y AND z AND w')
|
403
|
+
end
|
404
|
+
|
405
|
+
it 'parenthesizes expressions with different operators' do
|
406
|
+
output = @interpreter.to_pig_latin do
|
407
|
+
store(load('in').filter { |r| r.x.and(r.y.or(r.z)).and(r.w) }, 'out')
|
408
|
+
end
|
409
|
+
output.should include('x AND (y OR z) AND w')
|
410
|
+
end
|
411
|
+
end
|
396
412
|
|
397
413
|
context 'long and complex scripts' do
|
398
414
|
before do
|
@@ -579,7 +595,9 @@ describe Piglet do
|
|
579
595
|
throw :relations, [relation1, relation2, relation3]
|
580
596
|
end
|
581
597
|
end
|
582
|
-
relation3.schema.field_names.should eql(
|
598
|
+
relation3.schema.field_names[0].should eql(:group)
|
599
|
+
relation3.schema.field_names.should include(relation1.alias.to_sym)
|
600
|
+
relation3.schema.field_names.should include(relation2.alias.to_sym)
|
583
601
|
relation3.schema.field_type(relation1.alias.to_sym).should be_a(Piglet::Schema::Bag)
|
584
602
|
relation3.schema.field_type(relation2.alias.to_sym).should be_a(Piglet::Schema::Bag)
|
585
603
|
relation3.schema.field_type(relation1.alias.to_sym).field_names.should eql([:a, :b])
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: piglet
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Theo Hultberg
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2010-01-
|
12
|
+
date: 2010-01-12 00:00:00 +01:00
|
13
13
|
default_executable: piglet
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|