cascading.jruby 0.0.8 → 0.0.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/HACKING.md CHANGED
@@ -4,7 +4,7 @@ Some hacking info on `cascading.jruby`:
4
4
 
5
5
  For local development, install with (requires [bundler](http://gembundler.com/)):
6
6
 
7
- bundle install
7
+ jruby -S bundle install
8
8
 
9
9
  To run the tests (will download Cascading and Hadoop jars):
10
10
 
data/History.txt CHANGED
@@ -1,3 +1,9 @@
1
+ 0.0.9 - Cascading local mode and upgrade to Cascading 2.0.0
2
+
3
+ This release upgrades to Cascading 2.0.0 (final) and introduces Cascading local
4
+ mode. Ambiguous node names are now prohibited and the insert helper correctly
5
+ supports constants.
6
+
1
7
  0.0.8 - AggregateBy and upgrade to Cascading 2.0.0 wip-286
2
8
 
3
9
  This release upgrades to Cascading 2.0.0 wip-286, but again does not implement
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  `cascading.jruby` is a small DSL above [Cascading](http://www.cascading.org/).
4
4
 
5
- It requires Hadoop (>= 0.20.2) and [Cascading 2.0.0-wip-286](http://files.concurrentinc.com/cascading/2.0/cascading-2.0.0-wip-286-hadoop-0.20.2%2B.tgz) to be set via the environment variables: `HADOOP_HOME` and `CASCADING_HOME`
5
+ It requires Hadoop (>= 0.20.2) and [Cascading 2.0.0](http://files.cascading.org/cascading/2.0/cascading-2.0.0.tgz) to be set via the environment variables: `HADOOP_HOME` and `CASCADING_HOME`
6
6
 
7
7
  It has been tested on JRuby versions 1.2.0, 1.4.0, 1.5.3, and 1.6.5.
8
8
 
data/ivy.xml CHANGED
@@ -17,9 +17,9 @@
17
17
  </configurations>
18
18
 
19
19
  <dependencies>
20
- <dependency org="cascading" name="cascading-core" rev="2.0.0-wip-286" conf="default" />
21
- <dependency org="cascading" name="cascading-local" rev="2.0.0-wip-286" conf="default" />
22
- <dependency org="cascading" name="cascading-hadoop" rev="2.0.0-wip-286" conf="default" />
20
+ <dependency org="cascading" name="cascading-core" rev="2.0.0" conf="default" />
21
+ <dependency org="cascading" name="cascading-local" rev="2.0.0" conf="default" />
22
+ <dependency org="cascading" name="cascading-hadoop" rev="2.0.0" conf="default" />
23
23
  <dependency org="org.jruby" name="jruby" rev="1.6.5" conf="default" />
24
24
  </dependencies>
25
25
  </ivy-module>
@@ -4,19 +4,20 @@ require 'cascading/ext/array'
4
4
 
5
5
  module Cascading
6
6
  # Rules enforced by Aggregations:
7
- # Contains either 1 Buffer or >= 1 Aggregator (explicitly checked)
8
- # No GroupBys, CoGroups, Joins, or Merges (methods for these pipes do not exist on Aggregations)
9
- # No Eaches (Aggregations#each does not exist)
10
- # Aggregations may not branch (Aggregations#branch does not exist)
7
+ # * Contains either 1 Buffer or >= 1 Aggregator (explicitly checked)
8
+ # * No GroupBys, CoGroups, Joins, or Merges (methods for these pipes do not
9
+ # exist on Aggregations)
10
+ # * No Eaches (Aggregations#each does not exist)
11
+ # * Aggregations may not branch (Aggregations#branch does not exist)
11
12
  #
12
13
  # Externally enforced rules:
13
- # May be empty (in which case, Aggregations is not instantiated)
14
- # Must follow a GroupBy or CoGroup (not a Join or Merge)
14
+ # * May be empty (in which case, Aggregations is not instantiated)
15
+ # * Must follow a GroupBy or CoGroup (not a Join or Merge)
15
16
  #
16
17
  # Optimizations:
17
- # If the leading Group is a GroupBy and all subsequent Everies are
18
- # Aggregators that have a corresponding AggregateBy, Aggregations can
19
- # replace the GroupBy/Aggregator pipe with a single composite AggregateBy.
18
+ # * If the leading Group is a GroupBy and all subsequent Everies are
19
+ # Aggregators that have a corresponding AggregateBy, Aggregations can replace
20
+ # the GroupBy/Aggregator pipe with a single composite AggregateBy
20
21
  class Aggregations
21
22
  include Operations
22
23
 
@@ -110,10 +111,10 @@ module Cascading
110
111
  # insert) and an options hash.
111
112
  #
112
113
  # Options include:
113
- # * <tt>:ignore</tt> a Java Array of Objects (for min and max) or Tuples
114
- # (for first and last) of values for the aggregator to ignore
115
- #
116
- # <tt>function</tt> is a symbol that is the method to call to construct the Cascading Aggregator.
114
+ # * <tt>:ignore</tt> a Java Array of Objects (for min and max) or Tuples
115
+ # (for first and last) of values for the aggregator to ignore
116
+ # * <tt>function</tt> is a symbol that is the method to call to construct
117
+ # the Cascading Aggregator.
117
118
  def composite_aggregator(args, function)
118
119
  field_map, options = extract_field_map(args)
119
120
 
@@ -14,9 +14,12 @@ module Cascading
14
14
  @last_child = nil
15
15
  end
16
16
 
17
+ # Children must be uniquely named within the scope of each Node. This
18
+ # ensures, for example, two assemblies are not created within the same flow
19
+ # with the same name, causing joins, unions, and sinks on them to be
20
+ # ambiguous.
17
21
  def add_child(node)
18
- child = root.find_child(node.name)
19
- warn "WARNING: adding '#{node.qualified_name}', but node named '#{node.name}' already exists at '#{child.qualified_name}'" if child
22
+ raise AmbiguousNodeNameException.new("Attempted to add '#{node.qualified_name}', but node named '#{node.name}' already exists") if @children[node.name]
20
23
 
21
24
  @children[node.name] = node
22
25
  @child_names << node.name
@@ -33,21 +36,36 @@ module Cascading
33
36
  end
34
37
  alias desc describe
35
38
 
39
+ # In order to find a child, we require it to be uniquely named within this
40
+ # Node and its children. This ensures, for example, branches in peer
41
+ # assemblies or branches and assemblies do not conflict in joins, unions,
42
+ # and sinks.
36
43
  def find_child(name)
37
- children.each do |child_name, child|
38
- return child if child_name == name
39
- result = child.find_child(name)
40
- return result if result
41
- end
42
- nil
44
+ all_children_with_name = find_all_children_with_name(name)
45
+ qualified_names = all_children_with_name.map{ |child| child.qualified_name }
46
+ raise AmbiguousNodeNameException.new("Ambiguous lookup of child by name '#{name}'; found '#{qualified_names.join("', '")}'") if all_children_with_name.size > 1
47
+
48
+ all_children_with_name.first
43
49
  end
44
50
 
45
51
  def root
46
52
  return self unless parent
47
53
  parent.root
48
54
  end
55
+
56
+ protected
57
+
58
+ def find_all_children_with_name(name)
59
+ child_names.map do |child_name|
60
+ children[child_name] if child_name == name
61
+ end.compact + child_names.map do |child_name|
62
+ children[child_name].find_all_children_with_name(name)
63
+ end.flatten
64
+ end
49
65
  end
50
66
 
67
+ class AmbiguousNodeNameException < StandardError; end
68
+
51
69
  # A module to add auto-registration capability
52
70
  module Registerable
53
71
  def all
@@ -69,7 +87,7 @@ module Cascading
69
87
 
70
88
  def add(name, instance)
71
89
  @registered ||= {}
72
- warn "WARNING: node named '#{name}' already registered in #{self}" if @registered[name]
90
+ warn "WARNING: Node named '#{name}' already registered in #{self}" if @registered[name]
73
91
  @registered[name] = instance
74
92
  end
75
93
 
@@ -9,14 +9,23 @@ module Cascading
9
9
  class Cascade < Cascading::Node
10
10
  extend Registerable
11
11
 
12
- def initialize(name)
12
+ attr_reader :mode
13
+
14
+ # Builds a cascade given the specified name. Optionally accepts a :mode
15
+ # which will be used as the default mode for all child flows. See
16
+ # Cascading::Mode.parse for details.
17
+ def initialize(name, params = {})
18
+ @mode = params[:mode]
13
19
  super(name, nil) # A Cascade cannot have a parent
14
20
  self.class.add(name, self)
15
21
  end
16
22
 
17
- def flow(name, &block)
23
+ # Builds a child flow given a name and block. Optionally accepts a :mode,
24
+ # which will override the default mode stored in this cascade.
25
+ def flow(name, params = {}, &block)
18
26
  raise "Could not build flow '#{name}'; block required" unless block_given?
19
- flow = Flow.new(name, self)
27
+ params[:mode] ||= mode
28
+ flow = Flow.new(name, self, params)
20
29
  add_child(flow)
21
30
  flow.instance_eval(&block)
22
31
  flow
@@ -12,17 +12,21 @@ module Cascading
12
12
  :float => java.lang.Float.java_class, :string => java.lang.String.java_class,
13
13
  }
14
14
 
15
- def cascade(name, &block)
15
+ # Builds a top-level cascade given a name and a block. Optionally accepts a
16
+ # :mode, as explained in Cascading::Cascade#initialize.
17
+ def cascade(name, params = {}, &block)
16
18
  raise "Could not build cascade '#{name}'; block required" unless block_given?
17
- cascade = Cascade.new(name)
19
+ cascade = Cascade.new(name, params)
18
20
  cascade.instance_eval(&block)
19
21
  cascade
20
22
  end
21
23
 
22
- # For applications built of Flows with no Cascades
23
- def flow(name, &block)
24
+ # Builds a top-level flow given a name and block for applications built of
25
+ # flows with no cascades. Optionally accepts a :mode, as explained in
26
+ # Cascading::Flow#initialize.
27
+ def flow(name, params = {}, &block)
24
28
  raise "Could not build flow '#{name}'; block required" unless block_given?
25
- flow = Flow.new(name, nil)
29
+ flow = Flow.new(name, nil, params)
26
30
  flow.instance_eval(&block)
27
31
  flow
28
32
  end
@@ -91,7 +95,9 @@ module Cascading
91
95
  Java::CascadingTuple::Fields::RESULTS
92
96
  end
93
97
 
94
- # Creates a c.s.h.TextLine scheme. Positional args are used if <tt>:source_fields</tt> is not provided.
98
+ # Creates a TextLine scheme (can be used in both Cascading local and hadoop
99
+ # modes). Positional args are used if <tt>:source_fields</tt> is not
100
+ # provided.
95
101
  #
96
102
  # The named options are:
97
103
  # * <tt>:source_fields</tt> a string or array of strings. Specifies the
@@ -100,7 +106,7 @@ module Cascading
100
106
  # to be written to a sink with this scheme. Defaults to all_fields.
101
107
  # * <tt>:compression</tt> a symbol, either <tt>:enable</tt> or
102
108
  # <tt>:disable</tt>, that governs the TextLine scheme's compression. Defaults
103
- # to the default TextLine compression.
109
+ # to the default TextLine compression (only applies to c.s.h.TextLine).
104
110
  def text_line_scheme(*args)
105
111
  options = args.extract_options!
106
112
  source_fields = fields(options[:source_fields] || (args.empty? ? ['offset', 'line'] : args))
@@ -111,55 +117,40 @@ module Cascading
111
117
  else Java::CascadingSchemeHadoop::TextLine::Compress::DEFAULT
112
118
  end
113
119
 
114
- Java::CascadingSchemeHadoop::TextLine.new(source_fields, sink_fields, sink_compression)
120
+ {
121
+ :local_scheme => Java::CascadingSchemeLocal::TextLine.new(source_fields, sink_fields),
122
+ :hadoop_scheme => Java::CascadingSchemeHadoop::TextLine.new(source_fields, sink_fields, sink_compression),
123
+ }
115
124
  end
116
125
 
117
- # Creates a c.s.h.SequenceFile scheme instance from the specified fields.
126
+ # Creates a c.s.h.SequenceFile scheme instance from the specified fields. A
127
+ # local SequenceFile scheme is not provided by Cascading, so this scheme
128
+ # cannot be used in Cascading local mode.
118
129
  def sequence_file_scheme(*fields)
119
- unless fields.empty?
120
- fields = fields(fields)
121
- return Java::CascadingSchemeHadoop::SequenceFile.new(fields)
122
- else
123
- return Java::CascadingSchemeHadoop::SequenceFile.new(all_fields)
124
- end
130
+ {
131
+ :local_scheme => nil,
132
+ :hadoop_scheme => Java::CascadingSchemeHadoop::SequenceFile.new(fields.empty? ? all_fields : fields(fields)),
133
+ }
125
134
  end
126
135
 
127
136
  def multi_source_tap(*taps)
128
- Java::CascadingTap::MultiSourceTap.new(taps.to_java('cascading.tap.Tap'))
137
+ MultiTap.multi_source_tap(taps)
129
138
  end
130
139
 
131
140
  def multi_sink_tap(*taps)
132
- Java::CascadingTap::MultiSinkTap.new(taps.to_java('cascading.tap.Tap'))
133
- end
134
-
135
- # Generic method for creating taps.
136
- # It expects a ":kind" argument pointing to the type of tap to create.
137
- def tap(*args)
138
- opts = args.extract_options!
139
- path = args.empty? ? opts[:path] : args[0]
140
- scheme = opts[:scheme] || text_line_scheme
141
- sink_mode = opts[:sink_mode] || :keep
142
- sink_mode = case sink_mode
143
- when :keep, 'keep' then Java::CascadingTap::SinkMode::KEEP
144
- when :replace, 'replace' then Java::CascadingTap::SinkMode::REPLACE
145
- when :append, 'append' then Java::CascadingTap::SinkMode::APPEND
146
- else raise "Unrecognized sink mode '#{sink_mode}'"
147
- end
148
- fs = opts[:kind] || :hfs
149
- klass = case fs
150
- when :hfs, 'hfs' then Java::CascadingTapHadoop::Hfs
151
- when :dfs, 'dfs' then Java::CascadingTapHadoop::Dfs
152
- when :lfs, 'lfs' then Java::CascadingTapHadoop::Lfs
153
- else raise "Unrecognized kind of tap '#{fs}'"
154
- end
155
- parameters = [scheme, path, sink_mode]
156
- klass.new(*parameters)
141
+ MultiTap.multi_sink_tap(taps)
142
+ end
143
+
144
+ # Creates a Cascading::Tap given a path and optional :scheme and :sink_mode.
145
+ def tap(path, params = {})
146
+ Tap.new(path, params)
157
147
  end
158
148
 
159
149
  # Constructs properties to be passed to Flow#complete or Cascade#complete
160
- # which will locate temporary Hadoop files in base_dir. It is necessary
161
- # to pass these properties only when executing local scripts via JRuby's main
162
- # method, which confuses Cascading's attempt to find the containing jar.
150
+ # which will locate temporary Hadoop files in base_dir. It is necessary to
151
+ # pass these properties only when executing scripts in Hadoop local mode via
152
+ # JRuby's main method, which confuses Cascading's attempt to find the
153
+ # containing jar. When using Cascading local mode, these are unnecessary.
163
154
  def local_properties(base_dir)
164
155
  dirs = {
165
156
  'test.build.data' => "#{base_dir}/build",
@@ -1,9 +1,9 @@
1
- # NativeException wrapper that prints the full nested stack trace of the Java
2
- # exception and all of its causes wrapped by the NativeException.
3
- # NativeException by default reveals only the first cause, which is
4
- # insufficient for tracing cascading.jruby errors into JRuby code or revealing
5
- # underlying Janino expression problems.
6
1
  module Cascading
2
+ # NativeException wrapper that prints the full nested stack trace of the Java
3
+ # exception and all of its causes wrapped by the NativeException.
4
+ # NativeException by default reveals only the first cause, which is
5
+ # insufficient for tracing cascading.jruby errors into JRuby code or
6
+ # revealing underlying Janino expression problems.
7
7
  class CascadingException < StandardError
8
8
  attr_accessor :ne, :depth
9
9
 
@@ -10,9 +10,15 @@ module Cascading
10
10
  extend Registerable
11
11
 
12
12
  attr_accessor :properties, :sources, :sinks, :incoming_scopes, :outgoing_scopes, :listeners
13
+ attr_reader :mode
13
14
 
14
- def initialize(name, parent)
15
+ # Builds a flow given a name and a parent node (a cascade or nil).
16
+ # Optionally accepts a :mode which will determine the execution mode of
17
+ # this flow. See Cascading::Mode.parse for details.
18
+ def initialize(name, parent, params = {})
15
19
  @properties, @sources, @sinks, @incoming_scopes, @outgoing_scopes, @listeners = {}, {}, {}, {}, {}, []
20
+ @mode = Mode.parse(params[:mode])
21
+ @flow_scope = Scope.flow_scope(name)
16
22
  super(name, parent)
17
23
  self.class.add(name, self)
18
24
  end
@@ -25,30 +31,18 @@ module Cascading
25
31
  assembly
26
32
  end
27
33
 
28
- # Create a new sink for this flow, with the specified name.
29
- # "tap" can be either a tap (see Cascading.tap) or a string that will
30
- # reference a path.
31
- def sink(*args)
32
- if (args.size == 2)
33
- sinks[args[0]] = args[1]
34
- elsif (args.size == 1)
35
- sinks[name] = args[0]
36
- end
34
+ # Create a new source for this flow, using the specified name and
35
+ # Cascading::Tap
36
+ def source(name, tap)
37
+ sources[name] = tap
38
+ incoming_scopes[name] = Scope.source_scope(name, mode.source_tap(name, tap), @flow_scope)
39
+ outgoing_scopes[name] = incoming_scopes[name]
37
40
  end
38
41
 
39
- # Create a new source for this flow, with the specified name.
40
- # "tap" can be either a tap (see Cascading.tap) or a string that will
41
- # reference a path.
42
- def source(*args)
43
- if (args.size == 2)
44
- sources[args[0]] = args[1]
45
- incoming_scopes[args[0]] = Scope.tap_scope(args[1], args[0])
46
- outgoing_scopes[args[0]] = incoming_scopes[args[0]]
47
- elsif (args.size == 1)
48
- sources[name] = args[0]
49
- incoming_scopes[name] = Scope.empty_scope(name)
50
- outgoing_scopes[name] = incoming_scopes[name]
51
- end
42
+ # Create a new sink for this flow, using the specified name and
43
+ # Cascading::Tap
44
+ def sink(name, tap)
45
+ sinks[name] = tap
52
46
  end
53
47
 
54
48
  def describe(offset = '')
@@ -149,12 +143,10 @@ module Cascading
149
143
  Java::CascadingProperty::AppProps.setApplicationName(properties, name)
150
144
  Java::CascadingProperty::AppProps.setApplicationVersion(properties, '0.0.0')
151
145
 
152
- Java::CascadingFlowHadoop::HadoopFlowConnector.new(properties).connect(
153
- name,
154
- make_tap_parameter(@sources),
155
- make_tap_parameter(@sinks),
156
- make_pipes
157
- )
146
+ sources = make_tap_parameter(@sources, :head_pipe)
147
+ sinks = make_tap_parameter(@sinks, :tail_pipe)
148
+ pipes = make_pipes
149
+ mode.connect_flow(properties, name, sources, sinks, pipes)
158
150
  end
159
151
 
160
152
  def complete(properties = nil)
@@ -169,12 +161,11 @@ module Cascading
169
161
 
170
162
  private
171
163
 
172
- def make_tap_parameter(taps)
164
+ def make_tap_parameter(taps, pipe_accessor)
173
165
  taps.inject({}) do |map, (name, tap)|
174
166
  assembly = find_child(name)
175
167
  raise "Could not find assembly '#{name}' to connect to tap: #{tap}" unless assembly
176
-
177
- map[assembly.tail_pipe.name] = tap
168
+ map[assembly.send(pipe_accessor).name] = tap
178
169
  map
179
170
  end
180
171
  end
@@ -0,0 +1,78 @@
1
+ module Cascading
2
+ # A Cascading::Mode encapsulates the idea of the execution mode for your
3
+ # flows. The default is Hadoop mode, but you can request that your code run
4
+ # in Cascading local mode. If you subsequently use a tap or a scheme that
5
+ # has no local implementation, the mode will be converted back to Hadoop
6
+ # mode.
7
+ class Mode
8
+ attr_reader :local
9
+
10
+ # Hadoop mode is the default. You must explicitly request Cascading local
11
+ # mode with values 'local' or :local.
12
+ def self.parse(mode)
13
+ case mode
14
+ when 'local', :local then Mode.new(true)
15
+ else Mode.new(false)
16
+ end
17
+ end
18
+
19
+ def initialize(local)
20
+ @local = local
21
+ end
22
+
23
+ # Attempts to select the appropriate tap given the current mode. If that
24
+ # tap does not exist, it fails over to the other tap with a warning.
25
+ def source_tap(name, tap)
26
+ warn "WARNING: No local tap for source '#{name}' in tap #{tap}" if local && !tap.local?
27
+ warn "WARNING: No Hadoop tap for source '#{name}' in tap #{tap}" if !local && !tap.hadoop?
28
+
29
+ if local
30
+ tap.local_tap || tap.hadoop_tap
31
+ else
32
+ tap.hadoop_tap || tap.local_tap
33
+ end
34
+ end
35
+
36
+ # Builds a c.f.Flow given properties, name, sources, sinks, and pipes from
37
+ # a Cascading::Flow. The current mode is adjusted based on the taps and
38
+ # schemes of the sources and sinks, then the correct taps are selected
39
+ # before building the flow.
40
+ def connect_flow(properties, name, sources, sinks, pipes)
41
+ update_local_mode(sources, sinks)
42
+ sources = select_taps(sources)
43
+ sinks = select_taps(sinks)
44
+ flow_connector_class.new(properties).connect(name, sources, sinks, pipes)
45
+ end
46
+
47
+ private
48
+
49
+ # Updates this mode based upon your sources and sinks. It's possible that
50
+ # you asked for Cascading local mode, but that request cannot be fulfilled
51
+ # because you used taps or schemes which have no local implementation.
52
+ def update_local_mode(sources, sinks)
53
+ local_supported = sources.all?{ |name, tap| tap.local? } && sinks.all?{ |name, tap| tap.local? }
54
+
55
+ if local && !local_supported
56
+ non_local_sources = sources.reject{ |name, tap| tap.local? }
57
+ non_local_sinks = sinks.reject{ |name, tap| tap.local? }
58
+ warn "WARNING: Cascading local mode requested but these sources: #{non_local_sources.inspect} and these sinks: #{non_local_sinks.inspect} do not support it"
59
+ @local = false
60
+ end
61
+
62
+ local
63
+ end
64
+
65
+ # Given a tap map, extracts the correct taps for the current mode
66
+ def select_taps(tap_map)
67
+ tap_map.inject({}) do |map, (name, tap)|
68
+ map[name] = tap.send(local ? :local_tap : :hadoop_tap)
69
+ map
70
+ end
71
+ end
72
+
73
+ # Chooses the correct FlowConnector class for the current mode
74
+ def flow_connector_class
75
+ local ? Java::CascadingFlowLocal::LocalFlowConnector : Java::CascadingFlowHadoop::HadoopFlowConnector
76
+ end
77
+ end
78
+ end
@@ -107,15 +107,21 @@ module Cascading
107
107
 
108
108
  def to_java_comparable_array(arr)
109
109
  (arr.map do |v|
110
- case v.class
110
+ coerce_to_java(v)
111
+ end).to_java(java.lang.Comparable)
112
+ end
113
+
114
+ def coerce_to_java(v)
115
+ case v
111
116
  when Fixnum
112
- java.lang.Integer.new(v)
117
+ java.lang.Long.new(v)
113
118
  when Float
114
119
  java.lang.Double.new(v)
120
+ when NilClass
121
+ nil
115
122
  else
116
123
  java.lang.String.new(v.to_s)
117
- end
118
- end).to_java(java.lang.Comparable)
124
+ end
119
125
  end
120
126
 
121
127
  def expression_filter(*args)
@@ -10,12 +10,18 @@ module Cascading
10
10
  Scope.new(Java::CascadingFlowPlanner::Scope.new(@scope))
11
11
  end
12
12
 
13
+ def self.flow_scope(name)
14
+ Java::CascadingFlowPlanner::Scope.new(name)
15
+ end
16
+
13
17
  def self.empty_scope(name)
14
18
  Scope.new(Java::CascadingFlowPlanner::Scope.new(name))
15
19
  end
16
20
 
17
- def self.tap_scope(tap, name)
18
- java_scope = outgoing_scope_for(tap, java.util.HashSet.new)
21
+ def self.source_scope(name, tap, flow_scope)
22
+ incoming_scopes = java.util.HashSet.new
23
+ incoming_scopes.add(flow_scope)
24
+ java_scope = outgoing_scope_for(tap, incoming_scopes)
19
25
  # Taps and Pipes don't name their outgoing scopes like other FlowElements
20
26
  java_scope.name = name
21
27
  Scope.new(java_scope)
@@ -4,12 +4,12 @@ module Cascading
4
4
  # Allows you to plugin c.p.SubAssemblies to a cascading.jruby Assembly.
5
5
  #
6
6
  # Assumptions:
7
- # * You will either use the tail_pipe of the calling Assembly, or overwrite
8
- # its incoming_scopes (as do join and union).
9
- # * Your subassembly will have only 1 tail pipe; branching is not
10
- # supported. This allows you to continue operating upon the tail of the
11
- # SubAssembly within the calling Assembly.
12
- # * You will not use nested c.p.SubAssemblies.
7
+ # * You will either use the tail_pipe of the calling Assembly, or overwrite
8
+ # its incoming_scopes (as do join and union)
9
+ # * Your subassembly will have only 1 tail pipe; branching is not
10
+ # supported. This allows you to continue operating upon the tail of the
11
+ # SubAssembly within the calling Assembly
12
+ # * You will not use nested c.p.SubAssemblies
13
13
  #
14
14
  # This is a low-level tool, so be careful.
15
15
  class SubAssembly
@@ -0,0 +1,81 @@
1
+ module Cascading
2
+ # A Cascading::BaseTap wraps up a pair of Cascading taps, one for Cascading
3
+ # local mode and the other for Hadoop mode.
4
+ class BaseTap
5
+ attr_reader :local_tap, :hadoop_tap
6
+
7
+ def initialize(local_tap, hadoop_tap)
8
+ @local_tap = local_tap
9
+ @hadoop_tap = hadoop_tap
10
+ end
11
+
12
+ def local?
13
+ !local_tap.nil?
14
+ end
15
+
16
+ def hadoop?
17
+ !hadoop_tap.nil?
18
+ end
19
+ end
20
+
21
+ # A Cascading::Tap represents a non-aggregate tap with a scheme, path, and
22
+ # optional sink_mode. c.t.l.FileTap is used in Cascading local mode and
23
+ # c.t.h.Hfs is used in Hadoop mode. Whether or not these can be created is
24
+ # governed by the :scheme parameter, which must contain at least one of
25
+ # :local_scheme or :hadoop_scheme. Schemes like TextLine are supported in
26
+ # both modes (by Cascading), but SequenceFile is only supported in Hadoop
27
+ # mode.
28
+ class Tap < BaseTap
29
+ attr_reader :scheme, :path, :sink_mode
30
+
31
+ def initialize(path, params = {})
32
+ @path = path
33
+
34
+ @scheme = params[:scheme] || text_line_scheme
35
+ raise "Scheme must provide one of :local_scheme or :hadoop_scheme; received: '#{scheme.inspect}'" unless scheme[:local_scheme] || scheme[:hadoop_scheme]
36
+
37
+ @sink_mode = case params[:sink_mode] || :keep
38
+ when :keep, 'keep' then Java::CascadingTap::SinkMode::KEEP
39
+ when :replace, 'replace' then Java::CascadingTap::SinkMode::REPLACE
40
+ when :append, 'append' then Java::CascadingTap::SinkMode::APPEND
41
+ else raise "Unrecognized sink mode '#{params[:sink_mode]}'"
42
+ end
43
+
44
+ local_scheme = scheme[:local_scheme]
45
+ @local_tap = local_scheme ? Java::CascadingTapLocal::FileTap.new(local_scheme, path, sink_mode) : nil
46
+
47
+ hadoop_scheme = scheme[:hadoop_scheme]
48
+ @hadoop_tap = hadoop_scheme ? Java::CascadingTapHadoop::Hfs.new(hadoop_scheme, path, sink_mode) : nil
49
+ end
50
+ end
51
+
52
+ # A Cascading::MultiTap represents one of Cascading's aggregate taps and is
53
+ # built via static constructors that accept an array of Cascading::Taps. In
54
+ # order for a mode (Cascading local or Hadoop) to be supported, all provided
55
+ # taps must support it.
56
+ class MultiTap < BaseTap
57
+ def initialize(local_tap, hadoop_tap)
58
+ super(local_tap, hadoop_tap)
59
+ end
60
+
61
+ def self.multi_source_tap(taps)
62
+ multi_tap(taps, Java::CascadingTap::MultiSourceTap)
63
+ end
64
+
65
+ def self.multi_sink_tap(taps)
66
+ multi_tap(taps, Java::CascadingTap::MultiSinkTap)
67
+ end
68
+
69
+ private
70
+
71
+ def self.multi_tap(taps, klass)
72
+ local_supported = taps.all?{ |tap| tap.local? }
73
+ local_tap = local_supported ? klass.new(taps.map{ |tap| tap.local_tap }.to_java('cascading.tap.Tap')) : nil
74
+
75
+ hadoop_supported = taps.all?{ |tap| tap.hadoop? }
76
+ hadoop_tap = hadoop_supported ? klass.new(taps.map{ |tap| tap.hadoop_tap }.to_java('cascading.tap.Tap')) : nil
77
+
78
+ MultiTap.new(local_tap, hadoop_tap)
79
+ end
80
+ end
81
+ end
data/lib/cascading.rb CHANGED
@@ -6,7 +6,7 @@ require 'java'
6
6
 
7
7
  module Cascading
8
8
  # :stopdoc:
9
- VERSION = '0.0.8'
9
+ VERSION = '0.0.9'
10
10
  LIBPATH = ::File.expand_path(::File.dirname(__FILE__)) + ::File::SEPARATOR
11
11
  PATH = ::File.dirname(LIBPATH) + ::File::SEPARATOR
12
12
  CASCADING_HOME = ENV['CASCADING_HOME']
@@ -55,8 +55,10 @@ require 'cascading/cascading'
55
55
  require 'cascading/cascading_exception'
56
56
  require 'cascading/expr_stub'
57
57
  require 'cascading/flow'
58
+ require 'cascading/mode'
58
59
  require 'cascading/operations'
59
60
  require 'cascading/scope'
61
+ require 'cascading/tap'
60
62
 
61
63
  # include module to make them available at top package
62
64
  include Cascading