cascading.jruby 0.0.10 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,165 +1,18 @@
1
- GNU LESSER GENERAL PUBLIC LICENSE
2
- Version 3, 29 June 2007
1
+ License:
2
+ Project and contact information: http://github.com/mrwalker/cascading.jruby
3
3
 
4
- Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
5
- Everyone is permitted to copy and distribute verbatim copies
6
- of this license document, but changing it is not allowed.
4
+ Licensed under the Apache License, Version 2.0 (the "License");
5
+ you may not use this file except in compliance with the License.
6
+ You may obtain a copy of the License at
7
7
 
8
+ http://www.apache.org/licenses/LICENSE-2.0
8
9
 
9
- This version of the GNU Lesser General Public License incorporates
10
- the terms and conditions of version 3 of the GNU General Public
11
- License, supplemented by the additional permissions listed below.
10
+ Unless required by applicable law or agreed to in writing, software
11
+ distributed under the License is distributed on an "AS IS" BASIS,
12
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ See the License for the specific language governing permissions and
14
+ limitations under the License.
12
15
 
13
- 0. Additional Definitions.
16
+ Third-party Licenses:
14
17
 
15
- As used herein, "this License" refers to version 3 of the GNU Lesser
16
- General Public License, and the "GNU GPL" refers to version 3 of the GNU
17
- General Public License.
18
-
19
- "The Library" refers to a covered work governed by this License,
20
- other than an Application or a Combined Work as defined below.
21
-
22
- An "Application" is any work that makes use of an interface provided
23
- by the Library, but which is not otherwise based on the Library.
24
- Defining a subclass of a class defined by the Library is deemed a mode
25
- of using an interface provided by the Library.
26
-
27
- A "Combined Work" is a work produced by combining or linking an
28
- Application with the Library. The particular version of the Library
29
- with which the Combined Work was made is also called the "Linked
30
- Version".
31
-
32
- The "Minimal Corresponding Source" for a Combined Work means the
33
- Corresponding Source for the Combined Work, excluding any source code
34
- for portions of the Combined Work that, considered in isolation, are
35
- based on the Application, and not on the Linked Version.
36
-
37
- The "Corresponding Application Code" for a Combined Work means the
38
- object code and/or source code for the Application, including any data
39
- and utility programs needed for reproducing the Combined Work from the
40
- Application, but excluding the System Libraries of the Combined Work.
41
-
42
- 1. Exception to Section 3 of the GNU GPL.
43
-
44
- You may convey a covered work under sections 3 and 4 of this License
45
- without being bound by section 3 of the GNU GPL.
46
-
47
- 2. Conveying Modified Versions.
48
-
49
- If you modify a copy of the Library, and, in your modifications, a
50
- facility refers to a function or data to be supplied by an Application
51
- that uses the facility (other than as an argument passed when the
52
- facility is invoked), then you may convey a copy of the modified
53
- version:
54
-
55
- a) under this License, provided that you make a good faith effort to
56
- ensure that, in the event an Application does not supply the
57
- function or data, the facility still operates, and performs
58
- whatever part of its purpose remains meaningful, or
59
-
60
- b) under the GNU GPL, with none of the additional permissions of
61
- this License applicable to that copy.
62
-
63
- 3. Object Code Incorporating Material from Library Header Files.
64
-
65
- The object code form of an Application may incorporate material from
66
- a header file that is part of the Library. You may convey such object
67
- code under terms of your choice, provided that, if the incorporated
68
- material is not limited to numerical parameters, data structure
69
- layouts and accessors, or small macros, inline functions and templates
70
- (ten or fewer lines in length), you do both of the following:
71
-
72
- a) Give prominent notice with each copy of the object code that the
73
- Library is used in it and that the Library and its use are
74
- covered by this License.
75
-
76
- b) Accompany the object code with a copy of the GNU GPL and this license
77
- document.
78
-
79
- 4. Combined Works.
80
-
81
- You may convey a Combined Work under terms of your choice that,
82
- taken together, effectively do not restrict modification of the
83
- portions of the Library contained in the Combined Work and reverse
84
- engineering for debugging such modifications, if you also do each of
85
- the following:
86
-
87
- a) Give prominent notice with each copy of the Combined Work that
88
- the Library is used in it and that the Library and its use are
89
- covered by this License.
90
-
91
- b) Accompany the Combined Work with a copy of the GNU GPL and this license
92
- document.
93
-
94
- c) For a Combined Work that displays copyright notices during
95
- execution, include the copyright notice for the Library among
96
- these notices, as well as a reference directing the user to the
97
- copies of the GNU GPL and this license document.
98
-
99
- d) Do one of the following:
100
-
101
- 0) Convey the Minimal Corresponding Source under the terms of this
102
- License, and the Corresponding Application Code in a form
103
- suitable for, and under terms that permit, the user to
104
- recombine or relink the Application with a modified version of
105
- the Linked Version to produce a modified Combined Work, in the
106
- manner specified by section 6 of the GNU GPL for conveying
107
- Corresponding Source.
108
-
109
- 1) Use a suitable shared library mechanism for linking with the
110
- Library. A suitable mechanism is one that (a) uses at run time
111
- a copy of the Library already present on the user's computer
112
- system, and (b) will operate properly with a modified version
113
- of the Library that is interface-compatible with the Linked
114
- Version.
115
-
116
- e) Provide Installation Information, but only if you would otherwise
117
- be required to provide such information under section 6 of the
118
- GNU GPL, and only to the extent that such information is
119
- necessary to install and execute a modified version of the
120
- Combined Work produced by recombining or relinking the
121
- Application with a modified version of the Linked Version. (If
122
- you use option 4d0, the Installation Information must accompany
123
- the Minimal Corresponding Source and Corresponding Application
124
- Code. If you use option 4d1, you must provide the Installation
125
- Information in the manner specified by section 6 of the GNU GPL
126
- for conveying Corresponding Source.)
127
-
128
- 5. Combined Libraries.
129
-
130
- You may place library facilities that are a work based on the
131
- Library side by side in a single library together with other library
132
- facilities that are not Applications and are not covered by this
133
- License, and convey such a combined library under terms of your
134
- choice, if you do both of the following:
135
-
136
- a) Accompany the combined library with a copy of the same work based
137
- on the Library, uncombined with any other library facilities,
138
- conveyed under the terms of this License.
139
-
140
- b) Give prominent notice with the combined library that part of it
141
- is a work based on the Library, and explaining where to find the
142
- accompanying uncombined form of the same work.
143
-
144
- 6. Revised Versions of the GNU Lesser General Public License.
145
-
146
- The Free Software Foundation may publish revised and/or new versions
147
- of the GNU Lesser General Public License from time to time. Such new
148
- versions will be similar in spirit to the present version, but may
149
- differ in detail to address new problems or concerns.
150
-
151
- Each version is given a distinguishing version number. If the
152
- Library as you received it specifies that a certain numbered version
153
- of the GNU Lesser General Public License "or any later version"
154
- applies to it, you have the option of following the terms and
155
- conditions either of that published version or of any later version
156
- published by the Free Software Foundation. If the Library as you
157
- received it does not specify a version number of the GNU Lesser
158
- General Public License, you may choose any version of the GNU Lesser
159
- General Public License ever published by the Free Software Foundation.
160
-
161
- If the Library as you received it specifies that a proxy can decide
162
- whether future versions of the GNU Lesser General Public License shall
163
- apply, that proxy's public statement of acceptance of any version is
164
- permanent authorization for you to choose that version for the
165
- Library.
18
+ All third-party dependencies are listed in ivy.xml.
@@ -0,0 +1,35 @@
1
+ # Cascading.JRuby [![Build Status](https://secure.travis-ci.org/mrwalker/cascading.jruby.png)](http://travis-ci.org/mrwalker/cascading.jruby)
2
+
3
+ cascading.jruby is a DSL for [Cascading](http://www.cascading.org/), which is a dataflow API written in Java. With cascading.jruby, Ruby programmers can rapidly script efficient MapReduce jobs for Hadoop.
4
+
5
+ To give you a quick idea of what a cascading.jruby job looks like, here's word count:
6
+
7
+ ```ruby
8
+ require 'rubygems'
9
+ require 'cascading'
10
+
11
+ input_path = ARGV.shift || (raise 'input_path required')
12
+
13
+ cascade 'wordcount', :mode => :local do
14
+ flow 'wordcount' do
15
+ source 'input', tap(input_path)
16
+
17
+ assembly 'input' do
18
+ split_rows 'line', /[.,]*\s+/, 'word', :output => 'word'
19
+ group_by 'word' do
20
+ count
21
+ end
22
+ end
23
+
24
+ sink 'input', tap('output/wordcount', :sink_mode => :replace)
25
+ end
26
+ end.complete
27
+ ```
28
+
29
+ cascading.jruby provides a clean Ruby interface to Cascading, but doesn't attempt to add abstractions on top of it. Therefore, you should be acquainted with the [Cascading](http://docs.cascading.org/cascading/2.0/userguide/html/) [API](http://docs.cascading.org/cascading/2.0/javadoc/) before you begin.
30
+
31
+ For operations you can apply to your dataflow within a pipe assembly, see the [Assembly](http://rubydoc.info/gems/cascading.jruby/1.0.0/Cascading/Assembly) class. For operations available within a block passed to a group_by, union, or join, see the [Aggregations](http://rubydoc.info/gems/cascading.jruby/1.0.0/Cascading/Aggregations) class.
32
+
33
+ Note that the Ruby code you write merely constructs a Cascading job, so no JRuby runtime is required on your cluster. This stands in contrast with writing [Hadoop streaming jobs in Ruby](http://www.quora.com/How-do-the-different-options-for-Ruby-on-Hadoop-compare). To run cascading.jruby applications on a Hadoop cluster, you must use [Jading](https://github.com/mrwalker/jading) to package them into a job jar.
34
+
35
+ cascading.jruby has been tested on JRuby versions 1.2.0, 1.4.0, 1.5.3, 1.6.5, 1.6.7.2, 1.7.0, and 1.7.3.
@@ -2,59 +2,26 @@ require 'java'
2
2
 
3
3
  module Cascading
4
4
  # :stopdoc:
5
- VERSION = '0.0.10'
6
- LIBPATH = ::File.expand_path(::File.dirname(__FILE__)) + ::File::SEPARATOR
7
- PATH = ::File.dirname(LIBPATH) + ::File::SEPARATOR
8
- CASCADING_HOME = ENV['CASCADING_HOME']
9
- HADOOP_HOME = ENV['HADOOP_HOME']
10
-
11
- # :startdoc:
12
-
13
- # Returns the version string for the library.
14
- #
15
- def self.version
16
- VERSION
17
- end
18
-
19
- # Returns the library path for the module. If any arguments are given,
20
- # they will be joined to the end of the libray path using
21
- # <tt>File.join</tt>.
22
- #
23
- def self.libpath( *args )
24
- args.empty? ? LIBPATH : ::File.join(LIBPATH, args.flatten)
25
- end
26
-
27
- # Returns the lpath for the module. If any arguments are given,
28
- # they will be joined to the end of the path using
29
- # <tt>File.join</tt>.
30
- #
31
- def self.path( *args )
32
- args.empty? ? PATH : ::File.join(PATH, args.flatten)
33
- end
34
-
35
- def self.require_all_jars(from = ::File.join(::File.dirname(__FILE__), "..", "jars"))
36
- search_me = ::File.expand_path(
37
- ::File.join(from, '**', '*.jar'))
38
- Dir.glob(search_me).sort.each do |jar|
39
- require jar
40
- end
41
- end
5
+ VERSION = '1.0.0'
42
6
  end
43
7
 
44
- Cascading.require_all_jars(Cascading::HADOOP_HOME) if Cascading::HADOOP_HOME
45
- Cascading.require_all_jars(Cascading::CASCADING_HOME) if Cascading::CASCADING_HOME
46
-
8
+ require 'cascading/aggregations'
47
9
  require 'cascading/assembly'
48
10
  require 'cascading/base'
49
11
  require 'cascading/cascade'
50
12
  require 'cascading/cascading'
51
13
  require 'cascading/cascading_exception'
52
14
  require 'cascading/expr_stub'
15
+ require 'cascading/filter_operations'
53
16
  require 'cascading/flow'
17
+ require 'cascading/identity_operations'
54
18
  require 'cascading/mode'
55
19
  require 'cascading/operations'
20
+ require 'cascading/regex_operations'
56
21
  require 'cascading/scope'
22
+ require 'cascading/sub_assembly'
57
23
  require 'cascading/tap'
24
+ require 'cascading/text_operations'
58
25
 
59
- # include module to make them available at top package
26
+ # include module to make it available at top level
60
27
  include Cascading
@@ -1,28 +1,39 @@
1
- require 'cascading/operations'
2
1
  require 'cascading/scope'
3
2
  require 'cascading/ext/array'
4
3
 
5
4
  module Cascading
5
+ # Aggregations is the context available to you within the block of a group_by,
6
+ # union, or join that allows you to apply Every pipes to the result of those
7
+ # operations. You may apply aggregators and buffers within this context
8
+ # subject to several rules laid out by Cascading.
9
+ #
6
10
  # Rules enforced by Aggregations:
7
11
  # * Contains either 1 Buffer or >= 1 Aggregator (explicitly checked)
8
- # * No GroupBys, CoGroups, Joins, or Merges (methods for these pipes do not
9
- # exist on Aggregations)
12
+ # * No GroupBys, CoGroups, Joins, or Merges (methods for these pipes do not exist on Aggregations)
10
13
  # * No Eaches (Aggregations#each does not exist)
11
14
  # * Aggregations may not branch (Aggregations#branch does not exist)
12
15
  #
13
16
  # Externally enforced rules:
14
17
  # * May be empty (in which case, Aggregations is not instantiated)
15
- # * Must follow a GroupBy or CoGroup (not a Join or Merge)
18
+ # * Must follow a GroupBy or CoGroup (not a HashJoin or Merge)
16
19
  #
17
20
  # Optimizations:
18
- # * If the leading Group is a GroupBy and all subsequent Everies are
19
- # Aggregators that have a corresponding AggregateBy, Aggregations can replace
20
- # the GroupBy/Aggregator pipe with a single composite AggregateBy
21
+ # * If the leading Group is a GroupBy and all subsequent Everies are Aggregators that have a corresponding AggregateBy, Aggregations can replace the GroupBy/Aggregator pipe with a single composite AggregateBy
22
+ #
23
+ # Aggregator and buffer DSL standard optional parameter names:
24
+ # [input] c.p.Every argument selector
25
+ # [into] c.o.Operation field declaration
26
+ # [output] c.p.Every output selector
21
27
  class Aggregations
22
- include Operations
23
-
24
28
  attr_reader :assembly, :tail_pipe, :scope, :aggregate_bys
25
29
 
30
+ # Do not use this constructor directly; instead, pass a block containing
31
+ # the desired aggregations to a group_by, union, or join and it will be
32
+ # instantiated for you.
33
+ #
34
+ # Builds the context in which a sequence of Every aggregations may be
35
+ # evaluated in the given assembly appended to the given group pipe and with
36
+ # the given incoming_scopes.
26
37
  def initialize(assembly, group, incoming_scopes)
27
38
  @assembly = assembly
28
39
  @tail_pipe = group
@@ -32,23 +43,14 @@ module Cascading
32
43
  @aggregate_bys = tail_pipe.is_group_by ? [] : nil
33
44
  end
34
45
 
46
+ # Prints information about the scope of these Aggregations at the point at
47
+ # which it is called. This allows you to trace the propagation of field
48
+ # names through your job and is handy for debugging. See Scope for
49
+ # details.
35
50
  def debug_scope
36
51
  puts "Current scope of aggregations for '#{assembly.name}':\n #{scope}\n----------\n"
37
52
  end
38
53
 
39
- def make_pipe(type, parameters)
40
- pipe = type.new(*parameters)
41
-
42
- # Enforce 1 Buffer or >= 1 Aggregator rule
43
- if tail_pipe.kind_of?(Java::CascadingPipe::Every)
44
- raise 'Buffer must be sole aggregation' if tail_pipe.buffer? || (tail_pipe.aggregator? && pipe.buffer?)
45
- end
46
-
47
- @tail_pipe = pipe
48
- @scope = Scope.outgoing_scope(tail_pipe, [scope])
49
- end
50
- private :make_pipe
51
-
52
54
  # We can replace these aggregations with the corresponding composite
53
55
  # AggregateBy if the leading Group was a GroupBy and all subsequent
54
56
  # Aggregators had a corresponding AggregateBy (which we've encoded in the
@@ -69,13 +71,27 @@ module Cascading
69
71
 
70
72
  # Builds an every pipe and adds it to the current list of aggregations.
71
73
  # Note that this list may be either exactly 1 Buffer or any number of
72
- # Aggregators.
73
- def every(*args)
74
- options = args.extract_options!
75
-
76
- in_fields = fields(args)
74
+ # Aggregators. Exactly one of :aggregator or :buffer must be specified and
75
+ # :aggregator may be accompanied by a corresponding :aggregate_by.
76
+ #
77
+ # The named options are:
78
+ # [aggregator] A Cascading Aggregator, mutually exclusive with :buffer.
79
+ # [aggregate_by] A Cascading AggregateBy that corresponds to the given
80
+ # :aggregator. Only makes sense with the :aggregator option
81
+ # and does not exist for all Aggregators. Providing nothing
82
+ # or nil will cause all Aggregations to operate as normal,
83
+ # without being compiled into a composite AggregateBy.
84
+ # [buffer] A Cascading Buffer, mutually exclusive with :aggregator.
85
+ # [output] c.p.Every output selector.
86
+ #
87
+ # Example:
88
+ # every 'field1', 'field2', :aggregator => sum_aggregator, :aggregate_by => sum_by, :output => all_fields
89
+ # every fields(input_fields), :buffer => Java::SomePackage::SomeBuffer.new, :output => all_fields
90
+ def every(*args_with_options)
91
+ options, in_fields = args_with_options.extract_options!, fields(args_with_options)
77
92
  out_fields = fields(options[:output])
78
93
  operation = options[:aggregator] || options[:buffer]
94
+ raise 'every requires either :aggregator or :buffer' unless operation
79
95
 
80
96
  if options[:aggregate_by] && aggregate_bys
81
97
  aggregate_bys << options[:aggregate_by]
@@ -84,71 +100,152 @@ module Cascading
84
100
  end
85
101
 
86
102
  parameters = [tail_pipe, in_fields, operation, out_fields].compact
87
- make_pipe(Java::CascadingPipe::Every, parameters)
88
- end
103
+ every = make_pipe(Java::CascadingPipe::Every, parameters)
104
+ raise ':aggregator specified but c.o.Buffer provided' if options[:aggregator] && every.is_buffer
105
+ raise ':buffer specified but c.o.Aggregator provided' if options[:buffer] && every.is_aggregator
89
106
 
90
- def assert_group(*args)
91
- options = args.extract_options!
107
+ every
108
+ end
92
109
 
93
- assertion = args[0]
110
+ # Builds an every assertion pipe given a c.o.a.Assertion and adds it to the
111
+ # current list of aggregations. Note this breaks a chain of AggregateBys.
112
+ #
113
+ # The named options are:
114
+ # [level] The assertion level; defaults to strict.
115
+ def assert_group(assertion, options = {})
94
116
  assertion_level = options[:level] || Java::CascadingOperation::AssertionLevel::STRICT
95
117
 
96
118
  parameters = [tail_pipe, assertion_level, assertion]
97
119
  make_pipe(Java::CascadingPipe::Every, parameters)
98
120
  end
99
121
 
100
- def assert_group_size_equals(*args)
101
- options = args.extract_options!
102
-
103
- assertion = Java::CascadingOperationAssertion::AssertGroupSizeEquals.new(args[0])
122
+ # Builds a pipe that asserts the size of the current group is the specified
123
+ # size for all groups.
124
+ def assert_group_size_equals(size, options = {})
125
+ assertion = Java::CascadingOperationAssertion::AssertGroupSizeEquals.new(size)
104
126
  assert_group(assertion, options)
105
127
  end
106
128
 
107
- # Builds a series of every pipes for aggregation.
129
+ # Computes the minima of the specified fields within each group. Fields
130
+ # may be a list or a map for renaming. Note that fields are sorted by
131
+ # input name when a map is provided.
108
132
  #
109
- # Args can either be a list of fields to aggregate and an options hash or
110
- # a hash that maps input field name to output field name (similar to
111
- # insert) and an options hash.
133
+ # The named options are:
134
+ # [ignore] Java Array of Objects of values to be ignored.
112
135
  #
113
- # Options include:
114
- # * <tt>:ignore</tt> a Java Array of Objects (for min and max) or Tuples
115
- # (for first and last) of values for the aggregator to ignore
116
- # * <tt>function</tt> is a symbol that is the method to call to construct
117
- # the Cascading Aggregator.
118
- def composite_aggregator(args, function)
119
- field_map, options = extract_field_map(args)
136
+ # Examples:
137
+ # assembly 'aggregate' do
138
+ # ...
139
+ # insert 'const' => 1
140
+ # group_by 'const' do
141
+ # min 'field1', 'field2'
142
+ # min 'field3' => 'fieldA', 'field4' => 'fieldB'
143
+ # end
144
+ # discard 'const'
145
+ # end
146
+ def min(*args_with_options)
147
+ composite_aggregator(args_with_options, Java::CascadingOperationAggregator::Min)
148
+ end
120
149
 
121
- field_map.each do |in_field, out_field|
122
- agg = self.send(function, out_field, options)
123
- every(in_field, :aggregator => agg, :output => all_fields)
124
- end
125
- raise "Composite aggregator '#{function.to_s.gsub('_function', '')}' invoked on 0 fields" if field_map.empty?
150
+ # Computes the maxima of the specified fields within each group. Fields
151
+ # may be a list or a map for renaming. Note that fields are sorted by
152
+ # input name when a map is provided.
153
+ #
154
+ # The named options are:
155
+ # [ignore] Java Array of Objects of values to be ignored.
156
+ #
157
+ # Examples:
158
+ # assembly 'aggregate' do
159
+ # ...
160
+ # insert 'const' => 1
161
+ # group_by 'const' do
162
+ # max 'field1', 'field2'
163
+ # max 'field3' => 'fieldA', 'field4' => 'fieldB'
164
+ # end
165
+ # discard 'const'
166
+ # end
167
+ def max(*args_with_options)
168
+ composite_aggregator(args_with_options, Java::CascadingOperationAggregator::Max)
169
+ end
170
+
171
+ # Returns the first value within each group for the specified fields.
172
+ # Fields may be a list or a map for renaming. Note that fields are sorted
173
+ # by input name when a map is provided.
174
+ #
175
+ # The named options are:
176
+ # [ignore] Java Array of Tuples which should be ignored
177
+ #
178
+ # Examples:
179
+ # assembly 'aggregate' do
180
+ # ...
181
+ # group_by 'key1', 'key2' do
182
+ # first 'field1', 'field2'
183
+ # first 'field3' => 'fieldA', 'field4' => 'fieldB'
184
+ # end
185
+ # end
186
+ def first(*args_with_options)
187
+ composite_aggregator(args_with_options, Java::CascadingOperationAggregator::First)
126
188
  end
127
189
 
128
- def min(*args); composite_aggregator(args, :min_function); end
129
- def max(*args); composite_aggregator(args, :max_function); end
130
- def first(*args); composite_aggregator(args, :first_function); end
131
- def last(*args); composite_aggregator(args, :last_function); end
190
+ # Returns the last value within each group for the specified fields.
191
+ # Fields may be a list or a map for renaming. Note that fields are sorted
192
+ # by input name when a map is provided.
193
+ #
194
+ # The named options are:
195
+ # [ignore] Java Array of Tuples which should be ignored
196
+ #
197
+ # Examples:
198
+ # assembly 'aggregate' do
199
+ # ...
200
+ # group_by 'key1', 'key2' do
201
+ # last 'field1', 'field2'
202
+ # last 'field3' => 'fieldA', 'field4' => 'fieldB'
203
+ # end
204
+ # end
205
+ def last(*args_with_options)
206
+ composite_aggregator(args_with_options, Java::CascadingOperationAggregator::Last)
207
+ end
132
208
 
133
- # Counts elements of a group. May optionally specify the name of the
134
- # output count field (defaults to 'count').
209
+ # Counts elements of each group. May optionally specify the name of the
210
+ # output count field, which defaults to 'count'.
211
+ #
212
+ # Examples:
213
+ # assembly 'aggregate' do
214
+ # ...
215
+ # group_by 'key1', 'key2' do
216
+ # count
217
+ # count 'key1_key2_count'
218
+ # end
219
+ # end
135
220
  def count(name = 'count')
136
221
  count_aggregator = Java::CascadingOperationAggregator::Count.new(fields(name))
137
222
  count_by = Java::CascadingPipeAssembly::CountBy.new(fields(name))
138
223
  every(last_grouping_fields, :aggregator => count_aggregator, :output => all_fields, :aggregate_by => count_by)
139
224
  end
140
225
 
141
- # Sums one or more fields. Fields to be summed may either be provided as
142
- # the arguments to sum (in which case they will be aggregated into a field
143
- # of the same name in the given order), or via a hash using the :mapping
144
- # parameter (in which case they will be aggregated from the field named by
145
- # the key into the field named by the value after being sorted). The type
146
- # of the output sum may be controlled with the :type parameter.
147
- def sum(*args)
148
- options = args.extract_options!
226
+ # Sums the specified fields within each group. Fields may be a list or
227
+ # provided through the :mapping option for renaming. Note that fields are
228
+ # sorted by name when a map is provided.
229
+ #
230
+ # The named options are:
231
+ # [mapping] Map of input to output field names if renaming is desired.
232
+ # Results in output fields sorted by input field.
233
+ # [type] Controls the type of the output, specified using values from the
234
+ # Cascading::JAVA_TYPE_MAP as in Janino expressions (:double, :long, etc.)
235
+ #
236
+ # Examples:
237
+ # assembly 'aggregate' do
238
+ # ...
239
+ # group_by 'key1', 'key2' do
240
+ # sum 'field1', 'field2', :type => :long
241
+ # sum :mapping => { 'field3' => 'fieldA', 'field4' => 'fieldB' }, :type => :double
242
+ # end
243
+ # end
244
+ def sum(*args_with_options)
245
+ options, in_fields = args_with_options.extract_options!, args_with_options
149
246
  type = JAVA_TYPE_MAP[options[:type]]
150
247
 
151
- mapping = options[:mapping] ? options[:mapping].sort : args.zip(args)
248
+ mapping = options[:mapping] ? options[:mapping].sort : in_fields.zip(in_fields)
152
249
  mapping.each do |in_field, out_field|
153
250
  sum_aggregator = Java::CascadingOperationAggregator::Sum.new(*[fields(out_field), type].compact)
154
251
  # NOTE: SumBy requires a type in wip-286, unlike Sum (see Sum.java line 42 for default)
@@ -158,10 +255,22 @@ module Cascading
158
255
  raise "sum invoked on 0 fields (note :mapping must be provided to explicitly rename fields)" if mapping.empty?
159
256
  end
160
257
 
161
- # Averages one or more fields. The contract of average is identical to
162
- # that of other composite aggregators, but it accepts no options.
163
- def average(*args)
164
- field_map, _ = extract_field_map(args)
258
+ # Averages the specified fields within each group. Fields may be a list or
259
+ # a map for renaming. Note that fields are sorted by input name when a map
260
+ # is provided.
261
+ #
262
+ # Examples:
263
+ # assembly 'aggregate' do
264
+ # ...
265
+ # insert 'const' => 1
266
+ # group_by 'const' do
267
+ # max 'field1', 'field2'
268
+ # max 'field3' => 'fieldA', 'field4' => 'fieldB'
269
+ # end
270
+ # discard 'const'
271
+ # end
272
+ def average(*fields_or_field_map)
273
+ field_map, _ = extract_field_map(fields_or_field_map)
165
274
 
166
275
  field_map.each do |in_field, out_field|
167
276
  average_aggregator = Java::CascadingOperationAggregator::Average.new(fields(out_field))
@@ -173,6 +282,42 @@ module Cascading
173
282
 
174
283
  private
175
284
 
285
+ def make_pipe(type, parameters)
286
+ pipe = type.new(*parameters)
287
+
288
+ # Enforce 1 Buffer or >= 1 Aggregator rule
289
+ if tail_pipe.kind_of?(Java::CascadingPipe::Every)
290
+ raise 'Buffer must be sole aggregation' if tail_pipe.buffer? || (tail_pipe.aggregator? && pipe.buffer?)
291
+ end
292
+
293
+ @tail_pipe = pipe
294
+ @scope = Scope.outgoing_scope(tail_pipe, [scope])
295
+
296
+ tail_pipe
297
+ end
298
+
299
+ # Builds a series of every pipes for aggregation.
300
+ #
301
+ # Args can either be a list of fields to aggregate and an options hash or
302
+ # a hash that maps input field name to output field name (similar to
303
+ # insert) and an options hash.
304
+ #
305
+ # The named options are:
306
+ # [ignore] Java Array of Objects (for min and max) or Tuples (for first and
307
+ # last) of values for the aggregator to ignore.
308
+ def composite_aggregator(args, aggregator)
309
+ field_map, options = extract_field_map(args)
310
+
311
+ field_map.each do |in_field, out_field|
312
+ every(
313
+ in_field,
314
+ :aggregator => aggregator.new(*[fields(out_field), options[:ignore]].compact),
315
+ :output => all_fields
316
+ )
317
+ end
318
+ raise "Composite aggregator '#{aggregator}' invoked on 0 fields" if field_map.empty?
319
+ end
320
+
176
321
  # Extracts a field mapping, input field => output field, by accepting a
177
322
  # hash in the first argument. If no hash is provided, then maps arguments
178
323
  # onto themselves which names outputs the same as inputs. Additionally