arc-furnace 0.1.0 → 0.1.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 38e438f57e6350ddce2937cf2ef03d6d93ea20a8
4
- data.tar.gz: 17ee74aba7edef53647c2a2e4efe6b4d3bda477f
3
+ metadata.gz: 2ab49afa557ed851dd866c6dd1486bde91062890
4
+ data.tar.gz: f66fa651a0e3ac6df2c3824d5daecfe1244859c6
5
5
  SHA512:
6
- metadata.gz: ee8d88b55edd486dbd496d2831096ed53eaad124e6043629747b9503bd6473267c1f18086edae0425152b08638fc7ed4961cfecc79b72a92d709768c49316b25
7
- data.tar.gz: 39114078eab439a660c83479b9251b0e9f1690aa8e0110b51a4dc061aee5f7d64421189c69f0705349d95339b14e7e5b118124b1822a7c3e747446313c585ec7
6
+ metadata.gz: 236b5f55f9914dc37a89579cc543304880af745d49096186e94397e6eca5ea41024437bb16461e5c457786e47fbc3db67fed268a424f72b7c7c057be026d796e
7
+ data.tar.gz: 6833e0e9e9e84442ce686cdd68e55efdd3c8535ec57b7fe225c10f2811e390e5af02525086ab90cac8ee158533fe619f7984f4f0cb834fda25f2dd57670ae694
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  # ArcFurnace
2
+ [![Gem Version](https://badge.fury.io/rb/arc-furnace.png)][gem]
2
3
  [![Build Status](https://travis-ci.org/salsify/arc-furnace.svg?branch=master)][travis]
3
4
 
5
+ [gem]: https://rubygems.org/gems/arc-furnace
4
6
  [travis]: http://travis-ci.org/salsify/arc-furnace
5
7
 
6
8
  ArcFurnace melts, melds, and transforms your scrap data into perfectly crafted data for ingest into applications,
@@ -71,7 +73,7 @@ require a stream of data (`Hash`, `Transform`, `Join`, `Sink`) will have one.
71
73
 
72
74
  #### Hashes
73
75
 
74
- A `Hash` provides indexed access to a `Source` but pre-computing the index based on a key. The processing happens during the
76
+ A `Hash` provides indexed access to a `Source` but pre-computing the index based on a key. The processing happens during the
75
77
  prepare stage of pipeline processing. Hashes have a simple interface, `#get(primary_key)`, to requesting data. Hashes
76
78
  are almost exclusively used as inputs to one side of joins.
77
79
 
@@ -82,6 +84,12 @@ key is the key that the hash was rolled-up on, however, the `key_column` option
82
84
  may override this. Note the default join is an inner join, which will drop source rows if the hash does not contain
83
85
  a matching row.
84
86
 
87
+ #### Filters
88
+
89
+ A `Filter` acts as a source, however, takes a source as an input and determines whether to pass each row to
90
+ the next downstream node by calling the `#filter` method on itself. There is an associated `BlockFilter` and
91
+ sugar on `Pipeline` to make this easy.
92
+
85
93
  #### Transforms
86
94
 
87
95
  A `Transform` acts as a source, however, takes a source as an input and transforms each input. The `BlockTransform` and
@@ -100,7 +108,7 @@ subscribe to the `#row(hash)` interace--each output row is passed to this method
100
108
 
101
109
  ### General pipeline development process
102
110
 
103
- 1. Define a source. Choose an existing `Source` implementation in this library (`CSVSource` or `ExcelSource`),
111
+ 1. Define a source. Choose an existing `Source` implementation in this library (`CSVSource` or `ExcelSource`),
104
112
  extend the `EnumeratorSource`, or implement the `row()` method for a new source.
105
113
  2. Define any transformations, or joins. This may cause you to revisit #1.
106
114
  3. Define the sink. This is generally custom, or, may be one of the provided `CSVSink` types.
@@ -114,9 +122,8 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
114
122
 
115
123
  ## TODOs
116
124
 
117
- 1. Add a `filter` node and implementation to `Pipeline`
118
- 2. Add examples for `ErrorHandler` interface.
119
- 3. Add sugar to define a `BlockTransform` on a `Source` definition in a `Pipeline`.
125
+ 1. Add examples for `ErrorHandler` interface.
126
+ 2. Add sugar to define a `BlockTransform` on a `Source` definition in a `Pipeline`.
120
127
 
121
128
  ## Contributing
122
129
 
@@ -0,0 +1,18 @@
1
+ require 'arc-furnace/filter'
2
+
3
+ module ArcFurnace
4
+ class BlockFilter < Filter
5
+ private_attr_reader :block
6
+
7
+ def initialize(source:, block:)
8
+ raise 'Must specify a block' if block.nil?
9
+ @block = block
10
+ super(source: source)
11
+ end
12
+
13
+ def filter(row)
14
+ block.call(row)
15
+ end
16
+
17
+ end
18
+ end
@@ -30,7 +30,7 @@ module ArcFurnace
30
30
 
31
31
  # Return the enumerator
32
32
  def build_enumerator
33
- raise "Unimplemented!"
33
+ raise "Unimplemented"
34
34
  end
35
35
  end
36
36
  end
@@ -0,0 +1,37 @@
1
+ require 'arc-furnace/source'
2
+
3
+ # Filters limit rows to downstream nodes. They act just like Enumerable#filter:
4
+ # when the #filter method returns true, the row is passed downstream. when
5
+ # it returns false, the row is skipped.
6
+ module ArcFurnace
7
+ class Filter < Source
8
+
9
+ private_attr_reader :source
10
+ attr_reader :value
11
+
12
+ def initialize(source:)
13
+ @source = source
14
+ advance
15
+ end
16
+
17
+ # Given a row from the source, tell if it should be passed down to the next
18
+ # node downstream from this node.
19
+ #
20
+ # This method must return a boolean
21
+ def filter(row)
22
+ raise "Unimplemented"
23
+ end
24
+
25
+ def empty?
26
+ value.nil? && source.empty?
27
+ end
28
+
29
+ def advance
30
+ loop do
31
+ @value = source.row
32
+ break if value.nil? || filter(value)
33
+ end
34
+ end
35
+
36
+ end
37
+ end
@@ -1,10 +1,12 @@
1
1
  require 'arc-furnace/all_fields_csv_sink'
2
2
  require 'arc-furnace/binary_key_merging_hash'
3
+ require 'arc-furnace/block_filter'
3
4
  require 'arc-furnace/block_transform'
4
5
  require 'arc-furnace/block_unfold'
5
6
  require 'arc-furnace/csv_sink'
6
7
  require 'arc-furnace/csv_source'
7
8
  require 'arc-furnace/enumerator_source'
9
+ require 'arc-furnace/filter'
8
10
  require 'arc-furnace/fixed_column_csv_sink'
9
11
  require 'arc-furnace/hash'
10
12
  require 'arc-furnace/inner_join'
@@ -23,58 +23,70 @@ module ArcFurnace
23
23
  end
24
24
 
25
25
  @sink_node = -> do
26
- type.new(resolve_parameters(params))
26
+ type.new(resolve_parameters(:sink, params))
27
27
  end
28
28
  @sink_source = source
29
29
  end
30
30
 
31
31
  # Define a hash node, processing all rows from it's source and caching them
32
32
  # in-memory.
33
- def self.hash_node(name, type: ArcFurnace::Hash, params:)
34
- define_intermediate(name, type: type, params: params)
33
+ def self.hash_node(node_id, type: ArcFurnace::Hash, params:)
34
+ define_intermediate(node_id, type: type, params: params)
35
35
  end
36
36
 
37
37
  # A source that has row semantics, delivering a hash per row (or per entity)
38
38
  # for the source.
39
- def self.source(name, type:, params:)
39
+ def self.source(node_id, type:, params:)
40
40
  raise "Source #{type} is not a Source!" unless type <= Source
41
- define_intermediate(name, type: type, params: params)
41
+ define_intermediate(node_id, type: type, params: params)
42
42
  end
43
43
 
44
44
  # Define an inner join node where rows from the source are dropped
45
45
  # if an associated entity is not found in the hash for the join key
46
- def self.inner_join(name, type: ArcFurnace::InnerJoin, params:)
47
- define_intermediate(name, type: type, params: params)
46
+ def self.inner_join(node_id, type: ArcFurnace::InnerJoin, params:)
47
+ define_intermediate(node_id, type: type, params: params)
48
48
  end
49
49
 
50
50
  # Define an outer join nod e where rows from the source are kept
51
51
  # even if an associated entity is not found in the hash for the join key
52
- def self.outer_join(name, type: ArcFurnace::OuterJoin, params:)
53
- define_intermediate(name, type: type, params: params)
52
+ def self.outer_join(node_id, type: ArcFurnace::OuterJoin, params:)
53
+ define_intermediate(node_id, type: type, params: params)
54
54
  end
55
55
 
56
56
  # Define a node that transforms rows. By default you get a BlockTransform
57
57
  # (and when this metaprogramming method is passed a block) that will be passed
58
58
  # a hash for each row. The result of the block becomes the row for the next
59
59
  # downstream node.
60
- def self.transform(name, type: BlockTransform, params: {}, &block)
61
- if block
60
+ def self.transform(node_id, type: BlockTransform, params: {}, &block)
61
+ if block_given? && type <= BlockTransform
62
62
  params[:block] = block
63
63
  end
64
64
  raise "Transform #{type} is not a Transform!" unless type <= Transform
65
- define_intermediate(name, type: type, params: params)
65
+ define_intermediate(node_id, type: type, params: params)
66
66
  end
67
67
 
68
- # Define a node that unfolds rows. By default you get a BlocUnfold
68
+ # Define a node that unfolds rows. By default you get a BlockUnfold
69
69
  # (and when this metaprogramming method is passed a block) that will be passed
70
70
  # a hash for each row. The result of the block becomes the set of rows for the next
71
71
  # downstream node.
72
- def self.unfold(name, type: BlockUnfold, params: {}, &block)
73
- if block
72
+ def self.unfold(node_id, type: BlockUnfold, params: {}, &block)
73
+ if block_given? && type <= BlockUnfold
74
74
  params[:block] = block
75
75
  end
76
76
  raise "Unfold #{type} is not an Unfold!" unless type <= Unfold
77
- define_intermediate(name, type: type, params: params)
77
+ define_intermediate(node_id, type: type, params: params)
78
+ end
79
+
80
+ # Define a node that filters rows. By default you get a BlockFilter
81
+ # (and when this metaprogramming method is passed a block) that will be passed
82
+ # a hash for each row. The result of the block determines if a given row
83
+ # flows to a downstream node
84
+ def self.filter(node_id, type: BlockFilter, params: {}, &block)
85
+ if block_given? && type <= BlockFilter
86
+ params[:block] = block
87
+ end
88
+ raise "Filter #{type} is not a Filter!" unless type <= Filter
89
+ define_intermediate(node_id, type: type, params: params)
78
90
  end
79
91
 
80
92
  # Create an instance to run a transformation, passing the parameters to
@@ -82,18 +94,18 @@ module ArcFurnace
82
94
  # will have a single public method--#execute, which will perform the
83
95
  # transformation.
84
96
  def self.instance(params = {})
85
- DSLInstance.new(self, params)
97
+ PipelineInstance.new(self, params)
86
98
  end
87
99
 
88
100
  private
89
101
 
90
- def self.define_intermediate(name, type:, params:)
91
- intermediates_map[name] = -> do
92
- type.new(resolve_parameters(params))
102
+ def self.define_intermediate(node_id, type:, params:)
103
+ intermediates_map[node_id] = -> do
104
+ type.new(resolve_parameters(node_id, params))
93
105
  end
94
106
  end
95
107
 
96
- class DSLInstance
108
+ class PipelineInstance
97
109
  attr_reader :sink_node, :sink_source, :intermediates_map, :params, :dsl_class, :error_handler
98
110
 
99
111
  def initialize(dsl_class, error_handler: ErrorHandler.new, **params)
@@ -135,22 +147,22 @@ module ArcFurnace
135
147
  @sink_source = intermediates_map[dsl_class.sink_source]
136
148
  end
137
149
 
138
- def resolve_parameters(params_to_resolve)
150
+ def resolve_parameters(node_id, params_to_resolve)
139
151
  params_to_resolve.each_with_object({}) do |(key, value), result|
140
152
  result[key] =
141
153
  if value.is_a?(Symbol)
142
154
  # Allow resolution of intermediates
143
- resolve_parameter(value)
155
+ resolve_parameter(node_id, value)
144
156
  elsif value.nil?
145
- resolve_parameter(key)
157
+ resolve_parameter(node_id, key)
146
158
  else
147
159
  value
148
160
  end
149
161
  end
150
162
  end
151
163
 
152
- def resolve_parameter(key)
153
- self.params[key] || self.intermediates_map[key] || (raise "Unknown key #{key}!")
164
+ def resolve_parameter(node_id, key)
165
+ self.params[key] || self.intermediates_map[key] || (raise "When processing node #{node_id}: Unknown key #{key}!")
154
166
  end
155
167
 
156
168
  end
@@ -4,6 +4,8 @@ module ArcFurnace
4
4
  class Source < Node
5
5
  extend Forwardable
6
6
 
7
+ # Called to prepare anything this source needs to do before providing rows.
8
+ # For instance, opening a source file or database connection.
7
9
  def prepare
8
10
 
9
11
  end
@@ -17,23 +19,23 @@ module ArcFurnace
17
19
 
18
20
  # Is this source empty?
19
21
  def empty?
20
-
22
+ raise 'Unimplemented'
21
23
  end
22
24
 
23
25
  # The current value this source points at
24
26
  # This is generally the only method required to implement a source.
25
27
  def value
26
-
28
+ raise 'Unimplemented'
27
29
  end
28
30
 
29
- # Close the source
31
+ # Close the source. Called by the framework at the end of processing.
30
32
  def close
31
33
 
32
34
  end
33
35
 
34
36
  # Advance this source by one. #advance specifies no return value contract
35
37
  def advance
36
-
38
+ raise 'Unimplemented'
37
39
  end
38
40
 
39
41
  end
@@ -1,3 +1,3 @@
1
1
  module ArcFurnace
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.3"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: arc-furnace
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Daniel Spangenberger
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: exe
11
11
  cert_chain: []
12
- date: 2015-10-04 00:00:00.000000000 Z
12
+ date: 2015-10-22 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: msgpack
@@ -129,6 +129,7 @@ files:
129
129
  - lib/arc-furnace/abstract_join.rb
130
130
  - lib/arc-furnace/all_fields_csv_sink.rb
131
131
  - lib/arc-furnace/binary_key_merging_hash.rb
132
+ - lib/arc-furnace/block_filter.rb
132
133
  - lib/arc-furnace/block_transform.rb
133
134
  - lib/arc-furnace/block_unfold.rb
134
135
  - lib/arc-furnace/csv_sink.rb
@@ -138,6 +139,7 @@ files:
138
139
  - lib/arc-furnace/enumerator_source.rb
139
140
  - lib/arc-furnace/error_handler.rb
140
141
  - lib/arc-furnace/excel_source.rb
142
+ - lib/arc-furnace/filter.rb
141
143
  - lib/arc-furnace/fixed_column_csv_sink.rb
142
144
  - lib/arc-furnace/hash.rb
143
145
  - lib/arc-furnace/inner_join.rb