cascading.jruby 0.0.7 → 0.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.travis.yml ADDED
@@ -0,0 +1,6 @@
1
+ language: ruby
2
+ rvm:
3
+ - jruby-18mode
4
+ notifications:
5
+ recipients:
6
+ - mwalker@etsy.com
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source :rubygems
2
+
3
+ group :test do
4
+ gem 'rake', '0.8.7'
5
+ gem 'rspec', '1.1.11'
6
+ end
data/Gemfile.lock ADDED
@@ -0,0 +1,12 @@
1
+ GEM
2
+ remote: http://rubygems.org/
3
+ specs:
4
+ rake (0.8.7)
5
+ rspec (1.1.11)
6
+
7
+ PLATFORMS
8
+ java
9
+
10
+ DEPENDENCIES
11
+ rake (= 0.8.7)
12
+ rspec (= 1.1.11)
data/HACKING.md CHANGED
@@ -2,13 +2,21 @@
2
2
 
3
3
  Some hacking info on `cascading.jruby`:
4
4
 
5
- `cascading.jruby` can be packaged as a gem. To do so, you must generate the necessary packaging files:
5
+ For local development, install with (requires [bundler](http://gembundler.com/)):
6
6
 
7
- jruby -S rake gem
7
+ bundle install
8
8
 
9
- will produce the gem in the pkg/ sub-directory. After that, just cd to this directory and:
9
+ To run the tests (will download Cascading and Hadoop jars):
10
10
 
11
- jruby -S rake install cascading.jruby-xxx.gem
11
+ jruby -S bundle exec rake
12
+
13
+ To create the gem:
14
+
15
+ jruby -S bundle exec rake gem
16
+
17
+ To install it locally:
18
+
19
+ jruby -S gem install pkg/cascading.jruby-xxx.gem
12
20
 
13
21
  The `Cascading::Operations` module is mixed-in the `Cascading::Assembly` class to provide some shortcuts for common operations.
14
22
 
data/History.txt CHANGED
@@ -1,3 +1,11 @@
1
+ 0.0.8 - AggregateBy and upgrade to Cascading 2.0.0 wip-286
2
+
3
+ This release upgrades to Cascading 2.0.0 wip-286, but again does not implement
4
+ local mode. It does, however, replace group_by/union which contain only count,
5
+ sum, and average with the corresponding composite AggregateBy. Various
6
+ refactorings were required to support this, including the addition of limited
7
+ support for SubAssemblies.
8
+
1
9
  0.0.7 - Upgrade to Cascading 2.0.0 wip-255
2
10
 
3
11
  This release upgrades to Cascading 2.0.0 wip-255, but implements neither local
data/README.md CHANGED
@@ -1,8 +1,8 @@
1
- # Cascading.JRuby
1
+ # Cascading.JRuby [![Build Status](https://secure.travis-ci.org/mrwalker/cascading.jruby.png)](http://travis-ci.org/mrwalker/cascading.jruby)
2
2
 
3
3
  `cascading.jruby` is a small DSL above [Cascading](http://www.cascading.org/).
4
4
 
5
- It requires Hadoop (>= 0.20.2) and Cascading (>= 2.0.0) to be set via the environment variables: `HADOOP_HOME` and `CASCADING_HOME`
5
+ It requires Hadoop (>= 0.20.2) and [Cascading 2.0.0-wip-286](http://files.concurrentinc.com/cascading/2.0/cascading-2.0.0-wip-286-hadoop-0.20.2%2B.tgz) to be set via the environment variables: `HADOOP_HOME` and `CASCADING_HOME`
6
6
 
7
7
  It has been tested on JRuby versions 1.2.0, 1.4.0, 1.5.3, and 1.6.5.
8
8
 
data/Rakefile CHANGED
@@ -18,8 +18,9 @@ end
18
18
  ensure_in_path 'lib'
19
19
 
20
20
  require 'cascading'
21
+ require 'bundler/setup'
21
22
 
22
- task :default => 'test:run'
23
+ task :default => 'test'
23
24
 
24
25
  task :run do
25
26
  # ensure_in_path "samples"
@@ -28,7 +29,7 @@ task :run do
28
29
  end
29
30
 
30
31
  desc 'Remove gem and Java build files'
31
- task :clean => ['ant:clean', 'gem:clean'] do
32
+ task :clean => ['ant:clean', 'gem:clean', 'samples:clean'] do
32
33
  puts 'Build files removed'
33
34
  end
34
35
 
data/TODO CHANGED
@@ -1,13 +1,13 @@
1
1
  Documentation
2
2
 
3
- Assembly -> Each/Every refactor
4
- Look into totally eliminating registries
3
+ Import local_job/Job from Etsy, which can eliminate registries
5
4
 
6
5
  Bug fixes on github
7
6
  Enforce more runtime rules at composition time
8
7
  Standardize helper contracts
9
8
  Possibly combine unit tests...into unit tests because RSpec sucks and swallows stack traces
10
9
 
10
+ (Jading)
11
11
  Split out runner
12
12
  Make runner implement Tool
13
13
  Create build tool for job jar
data/ivy.xml ADDED
@@ -0,0 +1,25 @@
1
+ <?xml version="1.0" encoding="ISO-8859-1"?>
2
+ <ivy-module version="2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
3
+ xsi:noNamespaceSchemaLocation="http://ant.apache.org/ivy/schemas/ivy.xsd">
4
+ <info organisation="com.etsy" module="cascading.jruby" status="integration" />
5
+
6
+ <configurations>
7
+ <conf name="default" visibility="public" description="runtime dependencies and master artifact can be used with this conf" extends="runtime,master" />
8
+ <conf name="master" visibility="public" description="contains only the artifact published by this module itself, with no transitive dependencies" />
9
+ <conf name="compile" visibility="public" description="this is the default scope, used if none is specified. Compile dependencies are available in all classpaths." />
10
+ <conf name="provided" visibility="public" description="this is much like compile, but indicates you expect the JDK or a container to provide it. It is only available on the compilation classpath, and is not transitive." />
11
+ <conf name="runtime" visibility="public" description="this scope indicates that the dependency is not required for compilation, but is for execution. It is in the runtime and test classpaths, but not the compile classpath." extends="compile" />
12
+ <conf name="test" visibility="private" description="this scope indicates that the dependency is not required for normal use of the application, and is only available for the test compilation and execution phases." extends="runtime" />
13
+ <conf name="system" visibility="public" description="this scope is similar to provided except that you have to provide the JAR which contains it explicitly. The artifact is always available and is not looked up in a repository." />
14
+ <conf name="sources" visibility="public" description="this configuration contains the source artifact of this module, if any." />
15
+ <conf name="javadoc" visibility="public" description="this configuration contains the javadoc artifact of this module, if any." />
16
+ <conf name="optional" visibility="public" description="contains all optional dependencies" />
17
+ </configurations>
18
+
19
+ <dependencies>
20
+ <dependency org="cascading" name="cascading-core" rev="2.0.0-wip-286" conf="default" />
21
+ <dependency org="cascading" name="cascading-local" rev="2.0.0-wip-286" conf="default" />
22
+ <dependency org="cascading" name="cascading-hadoop" rev="2.0.0-wip-286" conf="default" />
23
+ <dependency org="org.jruby" name="jruby" rev="1.6.5" conf="default" />
24
+ </dependencies>
25
+ </ivy-module>
data/ivysettings.xml ADDED
@@ -0,0 +1,7 @@
1
+ <ivysettings>
2
+ <include url="http://conjars.org/repo/ivysettings.xml"/>
3
+
4
+ <resolvers>
5
+ <ibiblio name="public" m2compatible="true"/>
6
+ </resolvers>
7
+ </ivysettings>
data/lib/cascading.rb CHANGED
@@ -6,7 +6,7 @@ require 'java'
6
6
 
7
7
  module Cascading
8
8
  # :stopdoc:
9
- VERSION = '0.0.7'
9
+ VERSION = '0.0.8'
10
10
  LIBPATH = ::File.expand_path(::File.dirname(__FILE__)) + ::File::SEPARATOR
11
11
  PATH = ::File.dirname(LIBPATH) + ::File::SEPARATOR
12
12
  CASCADING_HOME = ENV['CASCADING_HOME']
@@ -0,0 +1,190 @@
1
+ require 'cascading/operations'
2
+ require 'cascading/scope'
3
+ require 'cascading/ext/array'
4
+
5
+ module Cascading
6
+ # Rules enforced by Aggregations:
7
+ # Contains either 1 Buffer or >= 1 Aggregator (explicitly checked)
8
+ # No GroupBys, CoGroups, Joins, or Merges (methods for these pipes do not exist on Aggregations)
9
+ # No Eaches (Aggregations#each does not exist)
10
+ # Aggregations may not branch (Aggregations#branch does not exist)
11
+ #
12
+ # Externally enforced rules:
13
+ # May be empty (in which case, Aggregations is not instantiated)
14
+ # Must follow a GroupBy or CoGroup (not a Join or Merge)
15
+ #
16
+ # Optimizations:
17
+ # If the leading Group is a GroupBy and all subsequent Everies are
18
+ # Aggregators that have a corresponding AggregateBy, Aggregations can
19
+ # replace the GroupBy/Aggregator pipe with a single composite AggregateBy.
20
+ class Aggregations
21
+ include Operations
22
+
23
+ attr_reader :assembly, :tail_pipe, :scope, :aggregate_bys
24
+
25
+ def initialize(assembly, group, incoming_scopes)
26
+ @assembly = assembly
27
+ @tail_pipe = group
28
+ @scope = Scope.outgoing_scope(tail_pipe, incoming_scopes)
29
+
30
+ # AggregateBy optimization only applies to GroupBy
31
+ @aggregate_bys = tail_pipe.is_group_by ? [] : nil
32
+ end
33
+
34
+ def debug_scope
35
+ puts "Current scope of aggregations for '#{assembly.name}':\n #{scope}\n----------\n"
36
+ end
37
+
38
+ def make_pipe(type, parameters)
39
+ pipe = type.new(*parameters)
40
+
41
+ # Enforce 1 Buffer or >= 1 Aggregator rule
42
+ if tail_pipe.kind_of?(Java::CascadingPipe::Every)
43
+ raise 'Buffer must be sole aggregation' if tail_pipe.buffer? || (tail_pipe.aggregator? && pipe.buffer?)
44
+ end
45
+
46
+ @tail_pipe = pipe
47
+ @scope = Scope.outgoing_scope(tail_pipe, [scope])
48
+ end
49
+ private :make_pipe
50
+
51
+ # We can replace these aggregations with the corresponding composite
52
+ # AggregateBy if the leading Group was a GroupBy and all subsequent
53
+ # Aggregators had a corresponding AggregateBy (which we've encoded in the
54
+ # list of aggregate_bys being a non-empty array).
55
+ def can_aggregate_by?
56
+ !aggregate_bys.nil? && !aggregate_bys.empty?
57
+ end
58
+
59
+ # "Fix" out values fields after a sequence of Everies. This is a field
60
+ # name metadata fix which is why the Identity is not planned into the
61
+ # resulting Cascading pipe. Without it, all values fields would propagate
62
+ # through non-empty aggregations, which doesn't match Cascading's planner's
63
+ # behavior.
64
+ def finalize
65
+ discard_each = Java::CascadingPipe::Each.new(tail_pipe, all_fields, Java::CascadingOperation::Identity.new)
66
+ @scope = Scope.outgoing_scope(discard_each, [scope])
67
+ end
68
+
69
+ # Builds an every pipe and adds it to the current list of aggregations.
70
+ # Note that this list may be either exactly 1 Buffer or any number of
71
+ # Aggregators.
72
+ def every(*args)
73
+ options = args.extract_options!
74
+
75
+ in_fields = fields(args)
76
+ out_fields = fields(options[:output])
77
+ operation = options[:aggregator] || options[:buffer]
78
+
79
+ if options[:aggregate_by] && aggregate_bys
80
+ aggregate_bys << options[:aggregate_by]
81
+ else
82
+ @aggregate_bys = nil
83
+ end
84
+
85
+ parameters = [tail_pipe, in_fields, operation, out_fields].compact
86
+ make_pipe(Java::CascadingPipe::Every, parameters)
87
+ end
88
+
89
+ def assert_group(*args)
90
+ options = args.extract_options!
91
+
92
+ assertion = args[0]
93
+ assertion_level = options[:level] || Java::CascadingOperation::AssertionLevel::STRICT
94
+
95
+ parameters = [tail_pipe, assertion_level, assertion]
96
+ make_pipe(Java::CascadingPipe::Every, parameters)
97
+ end
98
+
99
+ def assert_group_size_equals(*args)
100
+ options = args.extract_options!
101
+
102
+ assertion = Java::CascadingOperationAssertion::AssertGroupSizeEquals.new(args[0])
103
+ assert_group(assertion, options)
104
+ end
105
+
106
+ # Builds a series of every pipes for aggregation.
107
+ #
108
+ # Args can either be a list of fields to aggregate and an options hash or
109
+ # a hash that maps input field name to output field name (similar to
110
+ # insert) and an options hash.
111
+ #
112
+ # Options include:
113
+ # * <tt>:ignore</tt> a Java Array of Objects (for min and max) or Tuples
114
+ # (for first and last) of values for the aggregator to ignore
115
+ #
116
+ # <tt>function</tt> is a symbol that is the method to call to construct the Cascading Aggregator.
117
+ def composite_aggregator(args, function)
118
+ field_map, options = extract_field_map(args)
119
+
120
+ field_map.each do |in_field, out_field|
121
+ agg = self.send(function, out_field, options)
122
+ every(in_field, :aggregator => agg, :output => all_fields)
123
+ end
124
+ raise "Composite aggregator '#{function.to_s.gsub('_function', '')}' invoked on 0 fields" if field_map.empty?
125
+ end
126
+
127
+ def min(*args); composite_aggregator(args, :min_function); end
128
+ def max(*args); composite_aggregator(args, :max_function); end
129
+ def first(*args); composite_aggregator(args, :first_function); end
130
+ def last(*args); composite_aggregator(args, :last_function); end
131
+
132
+ # Counts elements of a group. May optionally specify the name of the
133
+ # output count field (defaults to 'count').
134
+ def count(name = 'count')
135
+ count_aggregator = Java::CascadingOperationAggregator::Count.new(fields(name))
136
+ count_by = Java::CascadingPipeAssembly::CountBy.new(fields(name))
137
+ every(last_grouping_fields, :aggregator => count_aggregator, :output => all_fields, :aggregate_by => count_by)
138
+ end
139
+
140
+ # Sums one or more fields. Fields to be summed may either be provided as
141
+ # the arguments to sum (in which case they will be aggregated into a field
142
+ # of the same name in the given order), or via a hash using the :mapping
143
+ # parameter (in which case they will be aggregated from the field named by
144
+ # the key into the field named by the value after being sorted). The type
145
+ # of the output sum may be controlled with the :type parameter.
146
+ def sum(*args)
147
+ options = args.extract_options!
148
+ type = JAVA_TYPE_MAP[options[:type]]
149
+
150
+ mapping = options[:mapping] ? options[:mapping].sort : args.zip(args)
151
+ mapping.each do |in_field, out_field|
152
+ sum_aggregator = Java::CascadingOperationAggregator::Sum.new(*[fields(out_field), type].compact)
153
+ # NOTE: SumBy requires a type in wip-286, unlike Sum (see Sum.java line 42 for default)
154
+ sum_by = Java::CascadingPipeAssembly::SumBy.new(fields(in_field), fields(out_field), type || Java::double.java_class)
155
+ every(in_field, :aggregator => sum_aggregator, :output => all_fields, :aggregate_by => sum_by)
156
+ end
157
+ raise "sum invoked on 0 fields (note :mapping must be provided to explicitly rename fields)" if mapping.empty?
158
+ end
159
+
160
+ # Averages one or more fields. The contract of average is identical to
161
+ # that of other composite aggregators, but it accepts no options.
162
+ def average(*args)
163
+ field_map, _ = extract_field_map(args)
164
+
165
+ field_map.each do |in_field, out_field|
166
+ average_aggregator = Java::CascadingOperationAggregator::Average.new(fields(out_field))
167
+ average_by = Java::CascadingPipeAssembly::AverageBy.new(fields(in_field), fields(out_field))
168
+ every(in_field, :aggregator => average_aggregator, :output => all_fields, :aggregate_by => average_by)
169
+ end
170
+ raise "average invoked on 0 fields" if field_map.empty?
171
+ end
172
+
173
+ private
174
+
175
+ # Extracts a field mapping, input field => output field, by accepting a
176
+ # hash in the first argument. If no hash is provided, then maps arguments
177
+ # onto themselves which names outputs the same as inputs. Additionally
178
+ # extracts options from args.
179
+ def extract_field_map(args)
180
+ if !args.empty? && args.first.kind_of?(Hash)
181
+ field_map = args.shift.sort
182
+ options = args.extract_options!
183
+ else
184
+ options = args.extract_options!
185
+ field_map = args.zip(args)
186
+ end
187
+ [field_map, options]
188
+ end
189
+ end
190
+ end
@@ -1,21 +1,23 @@
1
+ # -*- coding: utf-8 -*-
1
2
  # Copyright 2009, Grégoire Marabout. All Rights Reserved.
2
3
  #
3
4
  # This is free software. Please see the LICENSE and COPYING files for details.
4
5
 
5
6
  require 'cascading/base'
6
7
  require 'cascading/operations'
8
+ require 'cascading/aggregations'
9
+ require 'cascading/sub_assembly'
7
10
  require 'cascading/ext/array'
8
11
 
9
12
  module Cascading
10
13
  class Assembly < Cascading::Node
11
14
  include Operations
12
15
 
13
- attr_accessor :head_pipe, :tail_pipe, :incoming_scopes, :outgoing_scopes
16
+ attr_reader :head_pipe, :tail_pipe
14
17
 
15
18
  def initialize(name, parent, outgoing_scopes = {})
16
19
  super(name, parent)
17
20
 
18
- @every_applied = false
19
21
  @outgoing_scopes = outgoing_scopes
20
22
  if parent.kind_of?(Assembly)
21
23
  @head_pipe = Java::CascadingPipe::Pipe.new(name, parent.tail_pipe)
@@ -26,13 +28,13 @@ module Cascading
26
28
  @head_pipe = Java::CascadingPipe::Pipe.new(name)
27
29
  @outgoing_scopes[name] ||= Scope.empty_scope(name)
28
30
  end
29
- @tail_pipe = @head_pipe
31
+ @tail_pipe = head_pipe
30
32
  @incoming_scopes = [scope]
31
33
  end
32
34
 
33
35
  def describe(offset = '')
34
- incoming_scopes_desc = "#{incoming_scopes.map{ |incoming_scope| incoming_scope.values_fields.to_a.inspect }.join(', ')}"
35
- incoming_scopes_desc = "(#{incoming_scopes_desc})" unless incoming_scopes.size == 1
36
+ incoming_scopes_desc = "#{@incoming_scopes.map{ |incoming_scope| incoming_scope.values_fields.to_a.inspect }.join(', ')}"
37
+ incoming_scopes_desc = "(#{incoming_scopes_desc})" unless @incoming_scopes.size == 1
36
38
  description = "#{offset}#{name}:assembly :: #{incoming_scopes_desc} -> #{scope.values_fields.to_a.inspect}"
37
39
  description += "\n#{child_names.map{ |child| children[child].describe("#{offset} ") }.join("\n")}" unless children.empty?
38
40
  description
@@ -51,108 +53,122 @@ module Cascading
51
53
  puts "Current scope for '#{name}':\n #{scope}\n----------\n"
52
54
  end
53
55
 
54
- def make_each(type, *parameters)
55
- make_pipe(type, parameters)
56
- @every_applied = false
57
- end
56
+ def make_pipe(type, parameters)
57
+ @tail_pipe = type.new(*parameters)
58
+ @outgoing_scopes[name] = Scope.outgoing_scope(tail_pipe, [scope])
58
59
 
59
- def make_every(type, *parameters)
60
- make_pipe(type, parameters, scope.grouping_key_fields)
61
- @every_applied = true
60
+ tail_pipe
62
61
  end
62
+ private :make_pipe
63
63
 
64
- def every_applied?
65
- @every_applied
64
+ def populate_incoming_scopes(assembly_names, group_fields_args = {})
65
+ # NOTE: this overrides the existing incoming_scopes, which changes the
66
+ # way describe will function on this assembly
67
+ pipes, @incoming_scopes, group_fields = [], [], []
68
+ assembly_names.each do |assembly_name|
69
+ assembly = parent_flow.find_child(assembly_name)
70
+ raise "Could not find assembly '#{assembly_name}' from '#{name}'" unless assembly
71
+
72
+ pipes << assembly.tail_pipe
73
+ @incoming_scopes << assembly.scope
74
+ group_fields << fields(group_fields_args[assembly_name]) if group_fields_args[assembly_name]
75
+ end
76
+ [pipes, group_fields]
66
77
  end
78
+ private :populate_incoming_scopes
67
79
 
68
- def do_every_block_and_rename_fields(group_fields, incoming_scopes, &block)
69
- return unless block
80
+ def apply_aggregations(group, incoming_scopes, &block)
81
+ aggregations = Aggregations.new(self, group, incoming_scopes)
82
+ aggregations.instance_eval(&block) if block_given?
70
83
 
71
- # TODO: this should really be instance evaled on an object
72
- # that only allows aggregation and buffer operations.
73
- instance_eval &block
84
+ # Sorting of any type means that we cannot use the AggregateBy optimization
85
+ if aggregations.can_aggregate_by? && !group.is_sorted && !group.is_sort_reversed
86
+ grouping_fields = group.key_selectors.values.first
87
+ group.key_selectors.values.each do |key_fields|
88
+ raise "Grouping fields mismatch: #{grouping_fields} expected; #{key_fields} found from #{group.key_selectors}" unless key_fields == grouping_fields
89
+ end
74
90
 
75
- bind_names scope.grouping_fields.to_a if every_applied?
76
- end
91
+ aggregate_by = sub_assembly(Java::CascadingPipeAssembly::AggregateBy.new(
92
+ name,
93
+ group.previous,
94
+ grouping_fields,
95
+ aggregations.aggregate_bys.to_java(Java::CascadingPipeAssembly::AggregateBy)
96
+ ), group.previous, incoming_scopes)
77
97
 
78
- def make_pipe(type, parameters, grouping_key_fields = [], incoming_scopes = [scope])
79
- @tail_pipe = type.new(*parameters)
80
- @outgoing_scopes[name] = Scope.outgoing_scope(@tail_pipe, incoming_scopes, grouping_key_fields, every_applied?)
98
+ aggregate_by
99
+ else
100
+ aggregations.finalize if block_given?
101
+ @tail_pipe = aggregations.tail_pipe
102
+ @outgoing_scopes[name] = aggregations.scope
103
+
104
+ group
105
+ end
81
106
  end
107
+ private :apply_aggregations
82
108
 
83
109
  def to_s
84
- "#{name} : head pipe : #{@head_pipe} - tail pipe: #{@tail_pipe}"
110
+ "#{name} : head pipe : #{head_pipe} - tail pipe: #{tail_pipe}"
85
111
  end
86
112
 
87
- # Builds a join (CoGroup) pipe. Requires a list of assembly names to join.
113
+ # Builds a join (CoGroup) pipe. Requires a list of assembly names to join
114
+ # and :on to specify the group_fields.
88
115
  def join(*args, &block)
89
116
  options = args.extract_options!
90
117
 
91
- pipes, @incoming_scopes = [], []
92
- args.each do |assembly_name|
93
- assembly = parent_flow.find_child(assembly_name)
94
- raise "Could not find assembly '#{assembly_name}' in join" unless assembly
118
+ pipes, _ = populate_incoming_scopes(args)
95
119
 
96
- pipes << assembly.tail_pipe
97
- incoming_scopes << @outgoing_scopes[assembly.name]
98
- end
120
+ group_fields_args = options[:on]
121
+ raise 'join requires :on parameter' unless group_fields_args
99
122
 
100
- group_fields_args = options.delete(:on)
101
123
  if group_fields_args.kind_of?(String)
102
124
  group_fields_args = [group_fields_args]
103
125
  end
104
- group_fields_names = group_fields_args.to_a
126
+
105
127
  group_fields = []
106
128
  if group_fields_args.kind_of?(Array)
107
129
  pipes.size.times do
108
130
  group_fields << fields(group_fields_args)
109
131
  end
110
132
  elsif group_fields_args.kind_of?(Hash)
111
- pipes, @incoming_scopes = [], []
112
- keys = group_fields_args.keys.sort
113
- keys.each do |assembly_name|
114
- v = group_fields_args[assembly_name]
115
- assembly = parent_flow.find_child(assembly_name)
116
- raise "Could not find assembly '#{assembly_name}' in join" unless assembly
117
-
118
- pipes << assembly.tail_pipe
119
- incoming_scopes << @outgoing_scopes[assembly.name]
120
- group_fields << fields(v)
121
- group_fields_names = group_fields_args[keys.first].to_a
122
- end
133
+ pipes, group_fields = populate_incoming_scopes(group_fields_args.keys.sort, group_fields_args)
134
+ else
135
+ raise "Unsupported data type for :on in join: '#{group_fields_args.class}'"
123
136
  end
124
137
 
138
+ raise 'join requires non-empty :on parameter' if group_fields_args.empty?
125
139
  group_fields = group_fields.to_java(Java::CascadingTuple::Fields)
126
- incoming_fields = incoming_scopes.map{ |s| s.values_fields }
140
+ incoming_fields = @incoming_scopes.map{ |s| s.values_fields }
127
141
  declared_fields = fields(options[:declared_fields] || dedup_fields(*incoming_fields))
128
- joiner = options.delete(:joiner)
129
-
130
- if declared_fields
131
- case joiner
132
- when :inner, "inner", nil
133
- joiner = Java::CascadingPipeJoiner::InnerJoin.new
134
- when :left, "left"
135
- joiner = Java::CascadingPipeJoiner::LeftJoin.new
136
- when :right, "right"
137
- joiner = Java::CascadingPipeJoiner::RightJoin.new
138
- when :outer, "outer"
139
- joiner = Java::CascadingPipeJoiner::OuterJoin.new
140
- when Array
141
- joiner = joiner.map do |t|
142
- case t
143
- when true, 1, :inner then true
144
- when false, 0, :outer then false
145
- else fail "invalid mixed joiner entry: #{t}"
146
- end
142
+ joiner = options[:joiner]
143
+
144
+ case joiner
145
+ when :inner, 'inner', nil
146
+ joiner = Java::CascadingPipeJoiner::InnerJoin.new
147
+ when :left, 'left'
148
+ joiner = Java::CascadingPipeJoiner::LeftJoin.new
149
+ when :right, 'right'
150
+ joiner = Java::CascadingPipeJoiner::RightJoin.new
151
+ when :outer, 'outer'
152
+ joiner = Java::CascadingPipeJoiner::OuterJoin.new
153
+ when Array
154
+ joiner = joiner.map do |t|
155
+ case t
156
+ when true, 1, :inner then true
157
+ when false, 0, :outer then false
158
+ else fail "invalid mixed joiner entry: #{t}"
147
159
  end
148
- joiner = Java::CascadingPipeJoiner::MixedJoin.new(joiner.to_java(:boolean))
149
160
  end
161
+ joiner = Java::CascadingPipeJoiner::MixedJoin.new(joiner.to_java(:boolean))
150
162
  end
151
-
152
- parameters = [pipes.to_java(Java::CascadingPipe::Pipe), group_fields, declared_fields, joiner].compact
153
- grouping_key_fields = group_fields[0] # Left key group wins
154
- make_pipe(Java::CascadingPipe::CoGroup, parameters, grouping_key_fields, incoming_scopes)
155
- do_every_block_and_rename_fields(group_fields_names, incoming_scopes, &block)
163
+ result_group_fields = dedup_fields(*group_fields)
164
+ parameters = [
165
+ pipes.to_java(Java::CascadingPipe::Pipe),
166
+ group_fields,
167
+ declared_fields,
168
+ result_group_fields,
169
+ joiner
170
+ ]
171
+ apply_aggregations(Java::CascadingPipe::CoGroup.new(*parameters), @incoming_scopes, &block)
156
172
  end
157
173
  alias co_group join
158
174
 
@@ -193,66 +209,76 @@ module Cascading
193
209
  assembly
194
210
  end
195
211
 
196
- # Builds a new _group_by_ pipe. The fields used for grouping are specified in the args
197
- # array.
212
+ # Builds a new GroupBy pipe that groups on the fields given in args.
213
+ # Any block passed to this method should contain only Everies.
198
214
  def group_by(*args, &block)
199
215
  options = args.extract_options!
200
-
201
216
  group_fields = fields(args)
202
-
203
- sort_fields = fields(options[:sort_by] || args)
217
+ sort_fields = fields(options[:sort_by])
204
218
  reverse = options[:reverse]
205
219
 
206
- parameters = [@tail_pipe, group_fields, sort_fields, reverse].compact
207
- make_pipe(Java::CascadingPipe::GroupBy, parameters, group_fields)
208
- do_every_block_and_rename_fields(args, [scope], &block)
220
+ parameters = [tail_pipe, group_fields, sort_fields, reverse].compact
221
+ apply_aggregations(Java::CascadingPipe::GroupBy.new(*parameters), [scope], &block)
209
222
  end
210
223
 
211
- # Unifies several pipes sharing the same field structure.
212
- # This actually creates a GroupBy pipe.
213
- # It expects a list of assembly names as parameter.
214
- def union_pipes(*args)
215
- pipes, @incoming_scopes = [], []
216
- args[0].each do |assembly_name|
217
- assembly = parent_flow.find_child(assembly_name)
218
- pipes << assembly.tail_pipe
219
- incoming_scopes << @outgoing_scopes[assembly.name]
220
- end
224
+ # Unifies multiple incoming pipes sharing the same field structure using a
225
+ # GroupBy. Accepts :on like join and :sort_by and :reverse like group_by,
226
+ # as well as a block which may be used for a sequence of Every
227
+ # aggregations.
228
+ #
229
+ # By default, groups only on the first field (see line 189 of GroupBy.java)
230
+ def union(*args, &block)
231
+ options = args.extract_options!
232
+ group_fields = fields(options[:on])
233
+ sort_fields = fields(options[:sort_by])
234
+ reverse = options[:reverse]
221
235
 
222
- # Groups only on the 1st field (see line 186 of GroupBy.java)
223
- grouping_key_fields = fields(incoming_scopes.first.values_fields.get(0))
224
- make_pipe(Java::CascadingPipe::GroupBy, [pipes.to_java(Java::CascadingPipe::Pipe)], grouping_key_fields, incoming_scopes)
225
- # TODO: Shouldn't union_pipes accept an every block?
226
- #do_every_block_and_rename_fields(args, incoming_scopes, &block)
236
+ pipes, _ = populate_incoming_scopes(args)
237
+
238
+ # Must provide group_fields to ensure field name propagation
239
+ group_fields = fields(@incoming_scopes.first.values_fields.get(0)) unless group_fields
240
+
241
+ # FIXME: GroupBy is missing a constructor for union in wip-255
242
+ sort_fields = group_fields if !sort_fields && !reverse.nil?
243
+
244
+ parameters = [pipes.to_java(Java::CascadingPipe::Pipe), group_fields, sort_fields, reverse].compact
245
+ apply_aggregations(Java::CascadingPipe::GroupBy.new(*parameters), @incoming_scopes, &block)
227
246
  end
247
+ alias :union_pipes :union
228
248
 
229
- # Builds an basic _every_ pipe, and adds it to the current assembly.
230
- def every(*args)
231
- options = args.extract_options!
249
+ # Allows you to plugin c.p.SubAssemblies to a cascading.jruby Assembly
250
+ # under certain assumptions. Note the default is to extend the tail pipe
251
+ # of this Assembly using a linear SubAssembly. See SubAssembly class for
252
+ # details.
253
+ def sub_assembly(sub_assembly, pipes = [tail_pipe], incoming_scopes = [scope])
254
+ sub_assembly = SubAssembly.new(self, sub_assembly)
255
+ sub_assembly.finalize(pipes, incoming_scopes)
232
256
 
233
- in_fields = fields(args)
234
- out_fields = fields(options[:output])
235
- operation = options[:aggregator] || options[:buffer]
257
+ @tail_pipe = sub_assembly.tail_pipe
258
+ @outgoing_scopes[name] = sub_assembly.scope
236
259
 
237
- parameters = [@tail_pipe, in_fields, operation, out_fields].compact
238
- make_every(Java::CascadingPipe::Every, *parameters)
260
+ sub_assembly
239
261
  end
240
262
 
241
263
  # Builds a basic _each_ pipe, and adds it to the current assembly.
242
264
  # --
243
265
  # Example:
244
- # each "line", :filter=>regex_splitter(["name", "val1", "val2", "id"],
245
- # :pattern => /[.,]*\s+/),
246
- # :output=>["id", "name", "val1", "val2"]
266
+ # each 'line', :function => regex_splitter(['name', 'val1', 'val2', 'id'], :pattern => /[.,]*\s+/), :output => ['id', 'name', 'val1', 'val2']
247
267
  def each(*args)
248
268
  options = args.extract_options!
249
269
 
250
270
  in_fields = fields(args)
251
271
  out_fields = fields(options[:output])
272
+
252
273
  operation = options[:filter] || options[:function]
274
+ raise 'c.p.Each does not support applying an output selector to a c.o.Filter' if options[:filter] && options[:output]
275
+
276
+ parameters = [tail_pipe, in_fields, operation, out_fields].compact
277
+ each = make_pipe(Java::CascadingPipe::Each, parameters)
278
+ raise ':function specified but c.o.Filter provided' if options[:function] && each.is_filter
279
+ raise ':filter specified but c.o.Function provided' if options[:filter] && each.is_function
253
280
 
254
- parameters = [@tail_pipe, in_fields, operation, out_fields].compact
255
- make_each(Java::CascadingPipe::Each, *parameters)
281
+ each
256
282
  end
257
283
 
258
284
  # Restricts the current assembly to the specified fields.
@@ -273,14 +299,6 @@ module Cascading
273
299
  project(*keep_fields.to_a)
274
300
  end
275
301
 
276
- # Assign new names to initial fields in positional order.
277
- # --
278
- # Example:
279
- # bind_names "field1", "field2"
280
- def bind_names(*new_names)
281
- each all_fields, :function => Java::CascadingOperation::Identity.new(fields(new_names))
282
- end
283
-
284
302
  # Renames fields according to the mapping provided.
285
303
  # --
286
304
  # Example:
@@ -318,14 +336,9 @@ module Cascading
318
336
  options = args.extract_options!
319
337
  assertion = args[0]
320
338
  assertion_level = options[:level] || Java::CascadingOperation::AssertionLevel::STRICT
321
- make_each(Java::CascadingPipe::Each, @tail_pipe, assertion_level, assertion)
322
- end
323
339
 
324
- def assert_group(*args)
325
- options = args.extract_options!
326
- assertion = args[0]
327
- assertion_level = options[:level] || Java::CascadingOperation::AssertionLevel::STRICT
328
- make_every(Java::CascadingPipe::Every, @tail_pipe, assertion_level, assertion)
340
+ parameters = [tail_pipe, assertion_level, assertion]
341
+ make_pipe(Java::CascadingPipe::Each, parameters)
329
342
  end
330
343
 
331
344
  # Builds a debugging pipe.
@@ -362,67 +375,6 @@ module Cascading
362
375
  assert(assertion, options)
363
376
  end
364
377
 
365
- def assert_group_size_equals(*args)
366
- options = args.extract_options!
367
- assertion = Java::CascadingOperationAssertion::AssertGroupSizeEquals.new(args[0])
368
- assert_group(assertion, options)
369
- end
370
-
371
- # Builds a series of every pipes for aggregation.
372
- #
373
- # Args can either be a list of fields to aggregate and an options hash or
374
- # a hash that maps input field name to output field name (similar to
375
- # insert) and an options hash.
376
- #
377
- # Options include:
378
- # * <tt>:ignore</tt> a Java Array of Objects (for min and max) or Tuples
379
- # (for first and last) of values for the aggregator to ignore
380
- #
381
- # <tt>function</tt> is a symbol that is the method to call to construct the Cascading Aggregator.
382
- def composite_aggregator(args, function)
383
- if !args.empty? && args.first.kind_of?(Hash)
384
- field_map = args.shift.sort
385
- options = args.extract_options!
386
- else
387
- options = args.extract_options!
388
- field_map = args.zip(args)
389
- end
390
- field_map.each do |in_field, out_field|
391
- agg = self.send(function, out_field, options)
392
- every(in_field, :aggregator => agg, :output => all_fields)
393
- end
394
- puts "WARNING: composite aggregator '#{function.to_s.gsub('_function', '')}' invoked on 0 fields; will be ignored" if field_map.empty?
395
- end
396
-
397
- def min(*args); composite_aggregator(args, :min_function); end
398
- def max(*args); composite_aggregator(args, :max_function); end
399
- def first(*args); composite_aggregator(args, :first_function); end
400
- def last(*args); composite_aggregator(args, :last_function); end
401
- def average(*args); composite_aggregator(args, :average_function); end
402
-
403
- # Counts elements of a group. First unnamed parameter is the name of the
404
- # output count field (defaults to 'count' if it is not provided).
405
- def count(*args)
406
- options = args.extract_options!
407
- name = args[0] || 'count'
408
- every(last_grouping_fields, :aggregator => count_function(name, options), :output => all_fields)
409
- end
410
-
411
- # Fields to be summed may either be provided as an array, in which case
412
- # they will be aggregated into the same field in the given order, or as a
413
- # hash, in which case they will be aggregated from the field named by the
414
- # key into the field named by the value after being sorted.
415
- def sum(*args)
416
- options = args.extract_options!
417
- type = JAVA_TYPE_MAP[options[:type]]
418
- raise "No type specified for sum" unless type
419
-
420
- mapping = options[:mapping] ? options[:mapping].sort : args.zip(args)
421
- mapping.each do |in_field, out_field|
422
- every(in_field, :aggregator => sum_function(out_field, :type => type), :output => all_fields)
423
- end
424
- end
425
-
426
378
  # Builds a _parse_ pipe. This pipe will parse the fields specified in input (first unamed arguments),
427
379
  # using a specified regex pattern.
428
380
  #
@@ -437,7 +389,7 @@ module Cascading
437
389
  fields = args || all_fields
438
390
  pattern = options[:pattern]
439
391
  output = options[:output] || all_fields
440
- each(fields, :filter => regex_parser(pattern, options), :output => output)
392
+ each(fields, :function => regex_parser(pattern, options), :output => output)
441
393
  end
442
394
 
443
395
  # Builds a pipe that splits a field into other fields, using a specified regular expression.
@@ -692,14 +644,6 @@ module Cascading
692
644
  pass
693
645
  end
694
646
 
695
- # Builds a pipe that will unify (merge) pipes. The method accepts the list of pipes as argument.
696
- # Tuples unified must share the same fields.
697
- def union(*args)
698
- options = args.extract_options!
699
- pipes = args
700
- union_pipes pipes
701
- end
702
-
703
647
  def join_fields(*args)
704
648
  options = args.extract_options!
705
649
  output = options[:output] || all_fields