RubyGems - piglet - Versions diffs - 0.2.5 → 0.3.0 - Mend

piglet 0.2.5 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

data/.gitignore +5 -4
data/Gemfile +10 -0
data/Gemfile.lock +53 -0
data/README.rdoc +74 -38
data/Rakefile +10 -1
data/lib/piglet.rb +5 -1
data/lib/piglet/field/call_expression.rb +7 -2
data/lib/piglet/field/direct_expression.rb +28 -0
data/lib/piglet/field/field.rb +73 -3
data/lib/piglet/field/infix_expression.rb +14 -9
data/lib/piglet/field/map_value.rb +17 -0
data/lib/piglet/field/prefix_expression.rb +6 -3
data/lib/piglet/field/reference.rb +5 -7
data/lib/piglet/field/rename.rb +7 -5
data/lib/piglet/field/suffix_expression.rb +4 -2
data/lib/piglet/field/udf_expression.rb +19 -2
data/lib/piglet/inout/load.rb +2 -2
data/lib/piglet/interpreter.rb +8 -18
data/lib/piglet/relation/block_context.rb +41 -0
data/lib/piglet/relation/cogroup.rb +2 -1
data/lib/piglet/relation/cross.rb +2 -2
data/lib/piglet/relation/distinct.rb +2 -2
data/lib/piglet/relation/filter.rb +2 -2
data/lib/piglet/relation/foreach.rb +2 -2
data/lib/piglet/relation/group.rb +2 -2
data/lib/piglet/relation/join.rb +2 -1
data/lib/piglet/relation/limit.rb +2 -2
data/lib/piglet/relation/nested_foreach.rb +60 -0
data/lib/piglet/relation/order.rb +4 -2
data/lib/piglet/relation/relation.rb +43 -32
data/lib/piglet/relation/sample.rb +2 -2
data/lib/piglet/relation/split.rb +5 -5
data/lib/piglet/relation/stream.rb +2 -1
data/lib/piglet/relation/union.rb +2 -2
data/piglet.gemspec +126 -0
data/spec/piglet/field/field_spec.rb +7 -2
data/spec/piglet/interpreter_spec.rb +6 -6
data/spec/piglet/relation/relation_spec.rb +7 -4
data/spec/piglet/relation/split_spec.rb +3 -1
data/spec/piglet/relation/union_spec.rb +5 -7
data/spec/piglet_spec.rb +76 -31
data/spec/spec_helper.rb +9 -0
data/tasks/gem.rake +16 -19
data/tasks/rdoc.rake +1 -3
metadata +34 -11
data/TODO +0 -2

data/.gitignore CHANGED

@@ -14,10 +14,11 @@ tmtags
 *.swp
 ## PROJECT::GENERAL
-coverage
-rdoc
-doc
-pkg
+/coverage
+/rdoc
+/doc
+/pkg
+/.bundle
 ## PROJECT::SPECIFIC
 pig_*.log

data/Gemfile ADDED

@@ -0,0 +1,10 @@
+source :rubygems
+group :development do
+  gem 'rake'
+  gem 'rspec'
+  gem 'jeweler'
+  gem 'rdoc', '>= 2.4.0'
+  gem 'sdoc'
+  gem 'rcov'
+end

data/Gemfile.lock ADDED

@@ -0,0 +1,53 @@
+---
+dependencies:
+  rake:
+    group:
+    - :development
+    version: ">= 0"
+  rspec:
+    group:
+    - :development
+    version: ">= 0"
+  sdoc:
+    group:
+    - :development
+    version: ">= 0"
+  rcov:
+    group:
+    - :development
+    version: ">= 0"
+  jeweler:
+    group:
+    - :development
+    version: ">= 0"
+  rdoc:
+    group:
+    - :development
+    version: ">= 2.4.0"
+specs:
+- rake:
+    version: 0.8.7
+- json_pure:
+    version: 1.4.3
+- gemcutter:
+    version: 0.5.0
+- git:
+    version: 1.2.5
+- rubyforge:
+    version: 2.0.4
+- jeweler:
+    version: 1.4.0
+- json:
+    version: 1.4.3
+- rcov:
+    version: 0.9.8
+- rdoc:
+    version: 2.5.8
+- rspec:
+    version: 1.3.0
+- sdoc:
+    version: 0.2.19
+hash: 4f040144929b22ea17e9c74ab3cc8dd9db5a6fcf
+sources:
+- Rubygems:
+    uri: http://gemcutter.org

data/README.rdoc CHANGED

@@ -71,7 +71,7 @@ to standard out.
   a = load 'input', :schema => [:a, :b, :c]
   b = a.group :c
-  c = b.foreach { |r| [r[0], r[1].a.max, r[1].b.max] }
+  c = b.foreach { [self[0], self[1].a.max, self[1].b.max] }
   store c, 'output'
 will result in the following Pig Latin:
@@ -175,27 +175,29 @@ In light of the above +group+ works exactly as you would expect: <code>a.group(:
 == +filter+
-+filter+ works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block receives a parameter which is the relation that the operation is performed on -- this may sound odd, but since operations can be chained in Piglet there are situations where you otherwise wouldn't have a reference to the relation, e.g. <code>a.limit(4).filter { |r| … }</code>.
++filter+ works a little bit different from the operators discussed above. It takes a block in which you specify the arguments to the operator. The block is interpreted in the context of the relation it's performed on.
-The thing that sets +filter+ apart from the operators above is it needs to support field expressions. For example the <code>x == 3</code> in <code>FILTER a BY x == 3</code>. Piglet supports simple field operators like <code>==</code> or <code>%</code> quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example <code>a.filter { |r| r.x == 3 }</code> works fine, but <code>a.filter { |r| r.x != 3 }</code> doesn't (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do <code>r.x.ne(3)</code> or <code>(r.x == 3).not</code>. See “Limitations” below for more info on field expressions.
+The thing that sets +filter+ apart from the operators above is it needs to support field expressions. For example the <code>x == 3</code> in <code>FILTER a BY x == 3</code>. Piglet supports simple field operators like <code>==</code> or <code>%</code> quite transparently, but more complex expressions can be less elegant, see ”Limitations” below. For example <code>a.filter { x == 3 }</code> works fine, but <code>a.filter { x != 3 }</code> doesn't (it has to do with how Ruby parses expressions, unfortunately). To do not equals you can either do <code>x.ne(3)</code> or <code>(x == 3).not</code>. See “Limitations” below for more info on field expressions.
-The way field expressions are done in Piglet is that you ask the relation (the object passed to the block) for a field, and then call methods on that object to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference.
+The way field expressions are done in Piglet is that you simply use fields as if they were existing local variables, and then call methods on those to build up an expression. Some Ruby operators can be used, but other operations are only available as methods, again, see “Limitations” below for a complete reference.
-  a.filter { |r| r.x == 3 }              # => FILTER a BY x == 3
-  a.filter { |r| (r.x > 4).or(r.y < 2) } # => FILTER a BY x > 4 OR r < 2
+  a.filter { x == 3 }            # => FILTER a BY x == 3
+  a.filter { (x > 4).or(y < 2) } # => FILTER a BY x > 4 OR r < 2
+Be careful about the names of the fields. Ruby's scoping rules apply, which means that if there's already a variable defined outside of the block with the name +x+ Ruby will assume you meant that variable. If you get strange results, try prefixing with +self+, e.g. +self.x+.
 == +foreach+
-<code>FOREACH … GENERATE</code> is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing -- see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply +foreach+, and just as +filter+ it takes a block, which receives the relation as a parameter.
+<code>FOREACH … GENERATE</code> is probably the most complex operator in Pig Latin. Piglet tries its best to support most of it, but there are things that are still missing -- see “Limitations”. Most things should work without problems though. The operator in Piglet is called simply +foreach+, and just as +filter+ it takes a block, which is interpreted in the context of the relation +foreach+ was called on.
 In contrast to +filter+, +foreach+ should return an array of field references and expressions. This array describes the schema of the new relation. The expressions used in +foreach+ are usually not the same as those used in +filter+, although all are of course available in both situations. In +foreach+ common operators to use are the aggregate functions (called “eval functions” in the Pig Latin manual) like +MAX+, +MIN+, +COUNT+, +SUM+, etc. In Piglet these are method calls on field objects. Let's look at an example (I like to use lots of whitespace and newlines for +foreach+ operations, because otherwise it gets very messy):
-  a.foreach do |r|
+  a.foreach do
     [
-      r.x.max,
-      r.y.min,
-      r.z.count,
-      r.w + r.q
+      x.max,
+      y.min,
+      z.count,
+      w + q
     ]
   end
@@ -207,27 +209,45 @@ this would be translated into:
     COUNT(z),
     w + q;
-pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write <code>MAX(x) AS (x_max)</code>, and in Piglet you can write <code>r.x.max.as(:x_max)</code>. This is such a common thing to do that I'm thinking of adding some kind of feature that automatically adds <code>AS</code> clauses where appropriate, but it's not there yet.
+pretty straight forward. What if you want to give the fields of the new relation proper names? In Pig Latin you would write <code>MAX(x) AS (x_max)</code>, and in Piglet you can write <code>x.max.as(:x_max)</code>. This is such a common thing to do that I'm thinking of adding some kind of feature that automatically adds <code>AS</code> clauses where appropriate, but it's not there yet.
+If you want to access fields with $0, $1, etc. you can use +self[0]+, +self[1]+:
+  a.foreach { [self[0].as(:x)] } # => FOREACH a GENERATE $0
 +foreach+ is a very complex beast, and this is just an overview, so I'll just give you a few more examples that are not obvious:
 Literal values can be specified using +literal+:
-  a.foreach { |r| [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello
+  a.foreach { [literal('hello').as(:hello)] } # => FOREACH a GENERATE 'hello' AS hello
 Binary conditionals, a.k.a. the ternary operator are supported through +test+ (unfortunately the Ruby ternary operator can't be overridden):
-  a.foreach { |r| [test(r.x == 3, r.y, r.z)] } # => FOREACH a GENERATE (x == 3 ? y : z)
+  a.foreach { [test(x == 3, y, z)] } # => FOREACH a GENERATE (x == 3 ? y : z)
 The first argument to +test+ is the test expression, the second is the if-true expression and the third is the if-false expression.
-<code>FOREACH { … } GENERATE</code>, a.k.a. nested foreach or foreach-with-inner-bag is currently not supported. Nor is specifying the schema of the resulting relation, or +FLATTEN+ operations.
+== +nested_foreach+
+In Pig Latin you can use a different syntax if you have a relation with an inner bag, e.g:
+  x = FOREACH b {
+    S = FILTER a BY c == 'xyz';
+    GENERATE COUNT(s.z);
+  }
+In Piglet you would write this as
+  x.nested_foreach {
+    s = a.filter { c == 'xyz' }
+    [s.z.count]
+  }
 == +split+
-The syntax of +split+ shouldn't be surprising if you've read this far, but there's perhaps some details that aren't obvious. To split a relation into a number of parts you call +split+ on the relation and pass a block in which you specify the expressions describing each shard. Just as with +filter+ and +foreach+ the block receives the relation as an argument. +split+ returns an array containing the relation shards and you can use parallel assignment to make it look really nice:
+The syntax of +split+ shouldn't be surprising if you've read this far, but there's perhaps some details that aren't obvious. To split a relation into a number of parts you call +split+ on the relation and pass a block in which you specify the expressions describing each shard. Just as with +filter+ and +foreach+ the block operates in the context of the relation +split+ is called on. +split+ returns an array containing the relation shards and you can use parallel assignment to make it look really nice:
-  b, c = a.split { |r| [r.x > 2, r.y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3
+  b, c = a.split { [x > 2, y == 3] } # => SPLIT a INTO b IF x > 2, c IF y == 3
 == +cogroup+ & +join+
@@ -269,7 +289,7 @@ When you define a UDF it becomes available as a method in the interpreter scope.
   define :awesome, :function => 'my.awesome.Function' # => DEFINE awesome my.awesome.Function
   …
-  b = a.foreach { |r| [awesome(r[0]).as(:something_special)] } # => b = FOREACH a GENERATE awesome($0) AS something_special
+  b = a.foreach { [awesome(self[0]).as(:something_special)] } # => b = FOREACH a GENERATE awesome($0) AS something_special
 If you need to register a JAR you can use +register+:
@@ -307,10 +327,10 @@ If you want to quote the value with backticks, pass <code>:backticks => true</co
 Let's look at a more complex example:
   students = load('students.txt', :schema => [%w(student chararray), %w(age int), %w(grade int)])
-  top_acheivers = students.filter { |r| r.grade == 5 }
-  name_and_age = top_acheivers.foreach { |r| [r.student.as(:name), r.age] }
+  top_acheivers = students.filter { grade == 5 }
+  name_and_age = top_acheivers.foreach { [student.as(:name), age] }
   name_by_age = name_and_age.group(:age)
-  count_by_age = name_by_age.foreach { |r| [r[0].as(:age), r[1].name.count.as(:count)]}
+  count_by_age = name_by_age.foreach { [self[0].as(:age), r[1].name.count.as(:count)]}
   store(count_by_age, 'student_counts_by_age.txt', :using => :pig_storage)
 We load the file <code>students.txt</code> as a relation with three fields: <code>student</code>, a string, <code>age</code> an integer and <code>grade</code> another integer. Next we filter out the top acheivers with +filter+. +filter+ takes a block and that block gets a referece to the relation (the one +filter+ was called on), the result of the block will be the filter expression, in this case it's <code>grade == 5</code>.
@@ -338,8 +358,8 @@ My goal with Piglet was to add control of flow and reuse mechanisms to Pig, so I
   input = load('input', :schema => %w(country browser site visit_duration))
   %w(country browser site).each do |dimension|
-    grouped = input.group(dimension).foreach do |r|
-      [r[0], r[1].visit_duration.sum]
+    grouped = input.group(dimension).foreach do
+      [self[0], self[1].visit_duration.sum]
     end
     store(grouped, "output-#{dimension}")
   end
@@ -360,8 +380,8 @@ We load a file that contains an ID field, three dimensions (country, browser and
 But in Piglet it's as simple as looping over the names of the dimensions. You could even define a method that encapsulates the grouping, summing and storing (although in this case it would be a bit overkill):
   def sum_dimension(relation, dimension)
-    grouped = relation.group(dimension).foreach do |r|
-      [r[0], r[1].visit_duration.sum]
+    grouped = relation.group(dimension).foreach do
+      [self[0], self[1].visit_duration.sum]
     end
     store(grouped, "output-#{dimension}")
   end
@@ -384,6 +404,19 @@ and then use them just as any other operator:
   small, medium, large = input.samples(0.01, 0.1, 0.5)
+or what about an operator that returns the top _n_ items by some field:
+	module Piglet::Relation::Relation
+	  # Returns the top _n_ tuples from a relation, ordered by _field_
+	  def top(n, field)
+	    order([field, :desc]).limit(n)
+	  end
+	end
+which can be used as
+  input.top(10, :score)
 nifty, huh?
 === Types & schemas
@@ -393,7 +426,7 @@ Piglet knows the schema of relations, so you can do something else that Pig lack
   relation = load('in', :schema => [:a, :b, :c])
   relation.schema.field_names.each do |field|
     grouped = relation.group(field)
-    counted = grouped.foreach { |r| [r[1].count] }
+    counted = grouped.foreach { [self[1].count] }
     store(counted, "out-#{field}")
   end
@@ -415,7 +448,7 @@ The following Pig operators are supported:
 * +DUMP+
 * +EXPLAIN+
 * +FILTER+
-* <code>FOREACH … GENERATE</code>
+* <code>FOREACH … GENERATE</code> (including <code>FOREACH { … GENERATE }</code>)
 * +GROUP+
 * +ILLUSTRATE+
 * +JOIN+
@@ -429,13 +462,9 @@ The following Pig operators are supported:
 * +STREAM+
 * +UNION+
-The following is currently not supported (but will be soon):
-* <code>FOREACH { … } GENERATE</code>
 The file commands (+cd+, +cat+, etc.) will probably not be supported for the forseeable future.
-All the aggregate functions except two are supported:
+All the aggregate functions except one are supported:
 * +AVG+
 * +CONCAT+
@@ -446,11 +475,9 @@ All the aggregate functions except two are supported:
 * +SIZE+
 * +SUM+
 * +TOKENIZE+
-These are not supported yet:
-* +DIFF+
 * +FLATTEN+
++DIFF+ is not supported yet.
 Piglet only supports most arithmetic and logic operators (see below) on fields -- but check the output and make sure that it's doing what you expect because some it's tricky to see where Piglet hijacks the operators and when it's Ruby that is running the show. I'm doing the best I can, but there are many things that can't be done, at least not in Ruby 1.8.
@@ -491,6 +518,10 @@ In the future I may add a way of manually suggesting relation aliases, so that t
 You may also wonder why the relation aliases aren't in consecutive order. The reason is that they get their names in the order they are evaluated, and the interpreter walks the relation ancestry upwards from a +store+ (and it only evaluates a relation once).
+=== Why the verbosity in the code generated from a nested +FOREACH+?
+I'm working on it.
 === Why aren’t all operations included in the output?
 If you try this Piglet code:
@@ -508,6 +539,11 @@ As a side effect of using +store+ and the other output operators as the trigger
 Please contact me and give me the Piglet code and what you think the output should be. I'll try to either fix your Piglet code, or fix Piglet to do what you expect it to do.
+== Contributors
+* Theo Hultberg
+* Ning Liang
 == Copyright
-© 2009-2010 Theo Hultberg / Iconara. See LICENSE for details.
+© 2009-2010 Theo Hultberg / Iconara and contributors. See LICENSE for details.

data/Rakefile CHANGED

@@ -1,4 +1,13 @@
-require 'lib/piglet'
+$: << File.expand_path('../lib', __FILE__)
+unless defined?(Bundler)
+  require 'rubygems'
+  require 'bundler'
+end
+Bundler.setup
+require 'piglet'
 task :default => :spec

data/lib/piglet.rb CHANGED

@@ -2,7 +2,7 @@
 # :main: README.rdoc
 module Piglet # :nodoc:
-  VERSION = '0.2.5'
+  VERSION = '0.3.0'
   class PigletError < StandardError; end
   class NotSupportedError < PigletError; end
@@ -21,11 +21,13 @@ module Piglet # :nodoc:
   end
   module Relation
+    autoload :BlockContext, 'piglet/relation/block_context'
     autoload :Cogroup, 'piglet/relation/cogroup'
     autoload :Cross, 'piglet/relation/cross'
     autoload :Distinct, 'piglet/relation/distinct'
     autoload :Filter, 'piglet/relation/filter'
     autoload :Foreach, 'piglet/relation/foreach'
+    autoload :NestedForeach, 'piglet/relation/nested_foreach'
     autoload :Group, 'piglet/relation/group'
     autoload :Join, 'piglet/relation/join'
     autoload :Limit, 'piglet/relation/limit'
@@ -41,8 +43,10 @@ module Piglet # :nodoc:
     autoload :BinaryConditional, 'piglet/field/binary_conditional'
     autoload :CallExpression, 'piglet/field/call_expression'
     autoload :InfixExpression, 'piglet/field/infix_expression'
+    autoload :DirectExpression, 'piglet/field/direct_expression'
     autoload :Literal, 'piglet/field/literal'
     autoload :Field, 'piglet/field/field'
+    autoload :MapValue, 'piglet/field/map_value'
     autoload :PrefixExpression, 'piglet/field/prefix_expression'
     autoload :Reference, 'piglet/field/reference'
     autoload :Rename, 'piglet/field/rename'

data/lib/piglet/field/call_expression.rb CHANGED

@@ -9,14 +9,19 @@ module Piglet
         options ||= {}
         @function_name, @inner_expression = function_name, inner_expression
         @type = options[:type] || inner_expression.type
+        @predecessors = [inner_expression]
       end
       def simple?
         false
       end
-      def to_s
-        "#{@function_name}(#{@inner_expression})"
+      def to_s(inner=false)
+        if inner
+          "#{@function_name}(#{@inner_expression.field_alias})"
+        else
+          "#{@function_name}(#{@inner_expression})"
+        end
       end
     end
   end

data/lib/piglet/field/direct_expression.rb ADDED

@@ -0,0 +1,28 @@
+# encoding: utf-8
+module Piglet
+  module Field
+    class DirectExpression
+      include Field
+      attr_reader :string
+      def initialize(string, predecessor)
+        @string = string
+        @predecessors = [predecessor]
+      end
+      def to_s(inner=false)
+        @string
+      end
+      def method_missing(name, *args)
+        if name.to_s =~ /^\w+$/ && args.empty?
+          field(name)
+        else
+          super
+        end
+      end
+    end
+  end
+end

data/lib/piglet/field/field.rb CHANGED

@@ -6,7 +6,7 @@ module Piglet
       SYMBOLIC_OPERATORS = [:==, :>, :<, :>=, :<=, :%, :+, :-, :*, :/]
       FUNCTIONS = [:avg, :count, :max, :min, :size, :sum, :tokenize]
-      attr_reader :name, :type
+      attr_reader :name, :type, :predecessors
       FUNCTIONS.each do |fun|
         define_method(fun) do
@@ -69,9 +69,63 @@ module Piglet
           InfixExpression.new(op.to_s, self, other, :type => symbolic_operator_return_type(op, self, other))
         end
       end
+      def generate_field_alias
+        if @parent.respond_to?(:next_field_alias)
+          @parent.next_field_alias
+        elsif predecessors.first.respond_to?(:generate_field_alias)
+          predecessors.first.generate_field_alias
+        end
+      end
+      def field_alias
+        @field_alias ||= generate_field_alias
+      end
+      def predecessors
+        @predecessors ||= []
+      end
+      def distinct
+        DirectExpression.new("DISTINCT #{field_alias}", self)
+      end
+      def limit(size)
+        DirectExpression.new("LIMIT #{field_alias} #{size}", self)
+      end
+      def sample(rate)
+        DirectExpression.new("SAMPLE #{field_alias} #{rate}", self)
+      end
+      def order(*args)
+        fields, options = split_at_options(args)
+        fields = *fields
+        expression = Relation::Order.new(self, @interpreter, fields, options).to_s
+        DirectExpression.new(expression, self)
+      end
+      def filter(&block)
+        dummy_relation = DummyRelation.new(self.send(:alias))
+        context = Relation::BlockContext.new(dummy_relation, @interpreter)
+        expression = context.instance_eval(&block)
+        DirectExpression.new("FILTER #{field_alias} BY #{expression}", self)
+      end
+      def flatten
+        DirectExpression.new("FLATTEN(#{field_alias})", self)
+      end
+      def field(name)
+        Reference.new(name, self, :explicit_ancestry => true)
+      end
+      def get(key)
+        MapValue.new(key, self)
+      end
     protected
       def parenthesise(expr)
         if expr.respond_to?(:simple?) && ! expr.simple?
           "(#{expr})"
@@ -131,6 +185,22 @@ module Piglet
           nil
         end
       end
-    end
+      def split_at_options(parameters)
+        if parameters.last.is_a? Hash
+          [parameters[0..-2], parameters.last]
+        else
+          [parameters, nil]
+        end
+      end
+      class DummyRelation
+        include Relation::Relation
+        attr_reader :alias
+        def initialize(ali4s)
+          @alias = ali4s
+        end
+      end
+    end
   end
 end