data_collector 0.17.0 → 0.18.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 351bb040c33b9010903681a117df29f02ba3663e440ed30b37520d5a8aa98b30
4
- data.tar.gz: a307677b46ecce478fed2206cf5e9173b67bcec8a8b772ca1f12d71fd33a6fad
3
+ metadata.gz: 9b2dda800a0c468ee0db8c4a4546f98c8baa005f0cff2df603e613d404021315
4
+ data.tar.gz: c505eb5354999645eb5ea9fbb5200ce100a37d9f3e0eac85bf9416d21cd3514a
5
5
  SHA512:
6
- metadata.gz: 6298f438cf8030be76ac85f9652aead672dedd42c6f5c6324b6004bae175442bb9565e31ec3aee430da601e3bcdc43eab08dbf19e9cfb89fe54f00ef61758325
7
- data.tar.gz: 7b77100e03002764c58e4f2b0d3c962b445422fe7109aa14f0ac7f389628399392db0e172fe7141a35967a6f87c26cce2a8ee7865084454bb9e5587550085ae7
6
+ metadata.gz: a44557a687028b74b495236a47b4d802a4a6e130526a639ddf63b7b6a8a07b090f5197c23a36b2b4c9628bcfa33a0d38e2451c1a3224a45fa63d388f6922624e
7
+ data.tar.gz: b98a223f063f24b8f78e1358faeb02e33e365edd77b0fba2d28649fa0ad17d79f386ff216326040f3ec87390cb595f41382733ea042c5357c9cf48a23481d8c7
data/README.md CHANGED
@@ -1,39 +1,91 @@
1
1
  # DataCollector
2
- Convenience module to Extract, Transform and Load your data.
3
- You have main objects that help you to 'INPUT', 'OUTPUT' and 'FILTER' data. The basic ETL components.
4
- Support objects like CONFIG, LOG, RULES and the new RULES_NG just to make life easier.
2
+ Convenience module to Extract, Transform and Load your data in a Pipeline.
3
+ The 'INPUT', 'OUTPUT' and 'FILTER' object will help you to read, transform and output your data.
4
+ Support objects like CONFIG, LOG, ERROR, RULES. Will help you to write manageable rules to transform and log your data.
5
+ Include the DataCollector::Core module into your application gives you access to these objects.
6
+ ```ruby
7
+ include DataCollector::Core
8
+ ```
5
9
 
6
- Including the DataCollector::Core module into your application gives you access to these objects.
7
-
8
- The RULES and RULES_NG objects work in a very simple concept. Rules exist of 3 components:
9
- - a destination tag
10
- - a jsonpath filter to get the data
11
- - a lambda to execute on every filter hit
10
+ Every object can be used on its own.
11
+
12
+
13
+ #### Pipeline
14
+ Allows you to create a simple pipeline of operations to process data. With a data pipeline, you can collect, process, and transform data, and then transfer it to various systems and applications.
15
+
16
+ You can set a schedule for pipelines that are triggered by new data, specifying how often the pipeline should be
17
+ executed in the [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm). The processing logic is then executed.
18
+ ###### methods:
19
+ - .new(options): options can be schedule in [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm) and name
20
+ - .run: start the pipeline. blocking if a schedule is supplied
21
+ - .stop: stop the pipeline
22
+ - .pause: pause the pipeline. Restart using .run
23
+ - .running?: is pipeline running
24
+ - .stopped?: is pipeline not running
25
+ - .paused?: is pipeline paused
26
+ - .name: name of the pipe
27
+ - .run_count: number of times the pipe has ran
28
+ - .on_message: handle to run every time a trigger event happens
29
+ ###### example:
30
+ ```ruby
31
+ #create a pipline scheduled to run every 10 minutes
32
+ pipeline = Pipeline.new(schedule: 'PT10M')
12
33
 
34
+ pipeline.on_message do |input, output|
35
+ # logic
36
+ end
13
37
 
14
- #### input
15
- Read input from an URI. This URI can have a http, https or file scheme
38
+ pipeline.run
39
+ ```
40
+
41
+ #### input
42
+ The input component is part of the processing logic. All data is converted into a Hash, Array, ... accessible using plain Ruby or JSONPath using the filter object.
43
+ The input component can fetch data from various URIs, such as files, URLs, directories, queues, ...
44
+ For a push input component, a listener is created with a processing logic block that is executed whenever new data is available.
45
+ A push happens when new data is created in a directory, message queue, ...
16
46
 
17
- **Public methods**
18
47
  ```ruby
19
48
  from_uri(source, options = {:raw, :content_type})
20
49
  ```
21
- - source: an uri with a scheme of http, https, file
50
+ - source: an uri with a scheme of http, https, file, amqp
22
51
  - options:
23
52
  - raw: _boolean_ do not parse
24
53
  - content_type: _string_ force a content_type if the 'Content-Type' returned by the http server is incorrect
25
54
 
26
- example:
55
+ ###### example:
27
56
  ```ruby
57
+ # read from an http endpoint
28
58
  input.from_uri("http://www.libis.be")
29
59
  input.from_uri("file://hello.txt")
30
60
  input.from_uri("http://www.libis.be/record.jsonld", content_type: 'application/ld+json')
31
- ```
32
61
 
62
+ # read data from a RabbitMQ queue
63
+ listener = input.from_uri('amqp://user:password@localhost?channel=hello')
64
+ listener.on_message do |input, output, message|
65
+ puts message
66
+ end
67
+ listener.start
68
+
69
+ # read data from a directory
70
+ listener = input.from_uri('file://this/is/directory')
71
+ listener.on_message do |input, output, filename|
72
+ puts filename
73
+ end
74
+ listener.start
75
+ ```
33
76
 
77
+ Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
34
78
 
79
+ ###### listener from input.from_uri(directory|message queue)
80
+ When a listener is defined that is triggered by an event(PUSH) like a message queue or files written to a directory you have these extra methods.
35
81
 
36
- Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
82
+ - .run: start the listener. blocking if a schedule is supplied
83
+ - .stop: stop the listener
84
+ - .pause: pause the listener. Restart using .run
85
+ - .running?: is listener running
86
+ - .stopped?: is listener not running
87
+ - .paused?: is listener paused
88
+ - .on_message: handle to run every time a trigger event happens
37
89
 
38
90
  ### output
39
91
  Output is an object you can store key/value pairs that needs to be written to an output stream.
@@ -45,7 +97,7 @@ Output is an object you can store key/value pairs that needs to be written to an
45
97
  Write output to a file, string use an ERB file as a template
46
98
  example:
47
99
  ___test.erb___
48
- ```ruby
100
+ ```erbruby
49
101
  <names>
50
102
  <combined><%= data[:name] %> <%= data[:last_name] %></combined>
51
103
  <%= print data, :name, :first_name %>
@@ -53,7 +105,7 @@ ___test.erb___
53
105
  </names>
54
106
  ```
55
107
  will produce
56
- ```ruby
108
+ ```html
57
109
  <names>
58
110
  <combined>John Doe</combined>
59
111
  <first_name>John</first_name>
@@ -97,41 +149,11 @@ filter data from a hash using [JSONPath](http://goessner.net/articles/JsonPath/i
97
149
  filtered_data = filter(data, "$..metadata.record")
98
150
  ```
99
151
 
100
- #### rules (depricated)
101
- See newer rules_ng object
102
- ~~Allows you to define a simple lambda structure to run against a JSONPath filter~~
103
-
104
- ~~A rule is made up of a Hash the key is the map key field its value is a Hash with a JSONPath filter and options to apply a convert method on the filtered results.~~
105
- ~~Available convert methods are: time, map, each, call, suffix, text~~
106
- ~~- time: parses a given time/date string into a Time object~~
107
- ~~- map: applies a mapping to a filter~~
108
- ~~- suffix: adds a suffix to a result~~
109
- ~~- call: executes a lambda on the filter~~
110
- ~~- each: runs a lambda on each row of a filter~~
111
- ~~- text: passthrough method. Returns value unchanged~~
112
-
113
- ~~example:~~
114
- ```ruby
115
- my_rules = {
116
- 'identifier' => {"filter" => '$..id'},
117
- 'language' => {'filter' => '$..lang',
118
- 'options' => {'convert' => 'map',
119
- 'map' => {'nl' => 'dut', 'fr' => 'fre', 'de' => 'ger', 'en' => 'eng'}
120
- }
121
- },
122
- 'subject' => {'filter' => '$..keywords',
123
- options' => {'convert' => 'each',
124
- 'lambda' => lambda {|d| d.split(',')}
125
- }
126
- },
127
- 'creationdate' => {'filter' => '$..published_date', 'convert' => 'time'}
128
- }
129
-
130
- rules.run(my_rules, record, output)
131
- ```
132
-
133
- #### rules_ng
134
- !!! not compatible with RULES object
152
+ #### rules
153
+ The RULES objects have a simple concept. Rules exist of 3 components:
154
+ - a destination tag
155
+ - a jsonpath filter to get the data
156
+ - a lambda to execute on every filter hit
135
157
 
136
158
  TODO: work in progress see test for examples on how to use
137
159
 
@@ -202,15 +224,15 @@ Here you find different rule combination that are possible
202
224
  }
203
225
  ```
204
226
 
205
- Here is an example on how to call last RULESET "rs_hash_with_json_filter_and_option".
206
227
 
207
- ***rules_ng.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
228
+ ***rules.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
208
229
 
209
- List of engine directives:
230
+ ##### List of engine directives:
210
231
  - _no_array_with_one_element: defaults to false. if the result is an array with 1 element just return the element.
211
232
 
212
-
233
+ ###### example:
213
234
  ```ruby
235
+ # apply RULESET "rs_hash_with_json_filter_and_option" to data
214
236
  include DataCollector::Core
215
237
  output.clear
216
238
  data = {'subject' => ['water', 'thermodynamics']}
@@ -315,7 +337,32 @@ Or install it yourself as:
315
337
 
316
338
  ## Usage
317
339
 
318
- TODO: Write usage instructions here
340
+ ```ruby
341
+ require 'data_collector'
342
+
343
+ include DataCollector::Core
344
+ # including core gives you a pipeline, input, output, filter, config, log, error object to work with
345
+ RULES = {
346
+ 'title' => '$..vertitle'
347
+ }
348
+ #create a PULL pipeline and schedule it to run every 5 seconds
349
+ pipeline = DataCollector::Pipeline.new(schedule: 'PT5S')
350
+
351
+ pipeline.on_message do |input, output|
352
+ data = input.from_uri('https://services3.libis.be/primo_artefact/lirias3611609')
353
+ rules.run(RULES, data, output)
354
+ #puts JSON.pretty_generate(input.raw)
355
+ puts JSON.pretty_generate(output.raw)
356
+ output.clear
357
+
358
+ if pipeline.run_count > 2
359
+ log('stopping pipeline after one run')
360
+ pipeline.stop
361
+ end
362
+ end
363
+ pipeline.run
364
+
365
+ ```
319
366
 
320
367
  ## Development
321
368
 
@@ -43,11 +43,14 @@ Gem::Specification.new do |spec|
43
43
  spec.add_runtime_dependency 'jsonpath', '~> 1.1'
44
44
  spec.add_runtime_dependency 'mime-types', '~> 3.4'
45
45
  spec.add_runtime_dependency 'minitar', '= 0.9'
46
- spec.add_runtime_dependency 'nokogiri', '~> 1.13'
46
+ spec.add_runtime_dependency 'nokogiri', '~> 1.14'
47
47
  spec.add_runtime_dependency 'nori', '~> 2.6'
48
+ spec.add_runtime_dependency 'iso8601', '~> 0.13'
49
+ spec.add_runtime_dependency 'listen', '~> 3.8'
50
+ spec.add_runtime_dependency 'bunny', '~> 2.20'
48
51
 
49
52
  spec.add_development_dependency 'bundler', '~> 2.3'
50
- spec.add_development_dependency 'minitest', '~> 5.16'
53
+ spec.add_development_dependency 'minitest', '~> 5.18'
51
54
  spec.add_development_dependency 'rake', '~> 13.0'
52
55
  spec.add_development_dependency 'webmock', '~> 3.18'
53
56
  end
data/examples/marc.rb ADDED
@@ -0,0 +1,27 @@
1
+ $LOAD_PATH << '../lib'
2
+ require 'data_collector'
3
+
4
+ # include module gives us an pipeline, input, output, filter, log and error object to work with
5
+ include DataCollector::Core
6
+
7
+ RULES = {
8
+ "title" => {'$.record.datafield[?(@._tag == "245")]' => lambda do |d, o|
9
+ subfields = d['subfield']
10
+ subfields = [subfields] unless subfields.is_a?(Array)
11
+ subfields.map{|m| m["$text"]}.join(' ')
12
+ end
13
+ },
14
+ "author" => {'$..datafield[?(@._tag == "100")]' => lambda do |d, o|
15
+ subfields = d['subfield']
16
+ subfields = [subfields] unless subfields.is_a?(Array)
17
+ subfields.map{|m| m["$text"]}.join(' ')
18
+ end
19
+ }
20
+ }
21
+
22
+ #read remote record enable logging
23
+ data = input.from_uri('https://gist.githubusercontent.com/kefo/796b39925e234fb6d912/raw/3df2ce329a947864ae8555f214253f956d679605/sample-marc-with-xsd.xml', {logging: true})
24
+ # apply rules to data and if result contains only 1 entry do not return an array
25
+ rules.run(RULES, data, output, {_no_array_with_one_element: true})
26
+ # print result
27
+ puts JSON.pretty_generate(output.raw)
@@ -10,6 +10,14 @@ require_relative 'config_file'
10
10
 
11
11
  module DataCollector
12
12
  module Core
13
+ # Pipeline for your data pipeline
14
+ # example: pipeline.on_message do |input, output|
15
+ # ** processing logic here **
16
+ # end
17
+ def pipeline
18
+ @input ||= DataCollector::Pipeline.new
19
+ end
20
+ module_function :pipeline
13
21
  # Read input from an URI
14
22
  # example: input.from_uri("http://www.libis.be")
15
23
  # input.from_uri("file://hello.txt")
@@ -79,6 +87,8 @@ module DataCollector
79
87
  # }
80
88
  # rules.run(my_rules, input, output)
81
89
  def rules
90
+ #DataCollector::Core.log('RULES depricated using RULESNG')
91
+ #rules_ng
82
92
  @rules ||= Rules.new
83
93
  end
84
94
  module_function :rules
@@ -121,6 +131,12 @@ module DataCollector
121
131
  end
122
132
  module_function :log
123
133
 
134
+ def error(message)
135
+ @logger ||= Logger.new(STDOUT)
136
+ @logger.error(message)
137
+ end
138
+ module_function :error
139
+
124
140
  end
125
141
 
126
142
  end
@@ -0,0 +1,28 @@
1
+ require_relative 'generic'
2
+ require 'listen'
3
+
4
+ module DataCollector
5
+ class Input
6
+ class Dir < Generic
7
+ def initialize(uri, options)
8
+ super
9
+ end
10
+
11
+ def running?
12
+ @listener.processing?
13
+ end
14
+
15
+ private
16
+
17
+ def create_listener
18
+ @listener ||= Listen.to("#{@uri.host}/#{@uri.path}", @options) do |modified, added, _|
19
+ files = added | modified
20
+ files.each do |filename|
21
+ handle_on_message(input, output, filename)
22
+ end
23
+ end
24
+ end
25
+
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,77 @@
1
+ require 'listen'
2
+
3
+ module DataCollector
4
+ class Input
5
+ class Generic
6
+ def initialize(uri, options)
7
+ @uri = uri
8
+ @options = options
9
+
10
+ @input = DataCollector::Input.new
11
+ @output = DataCollector::Output.new
12
+
13
+ @listener = create_listener
14
+ end
15
+
16
+ def run(should_block = false, &block)
17
+ raise DataCollector::Error, 'Please supply a on_message block' if @on_message_callback.nil?
18
+ @listener.start
19
+
20
+ if should_block
21
+ while running?
22
+ yield block if block_given?
23
+ sleep 2
24
+ end
25
+ else
26
+ yield block if block_given?
27
+ end
28
+
29
+ end
30
+
31
+ def stop
32
+ @listener.stop
33
+ end
34
+
35
+ def pause
36
+ @listener.pause
37
+ end
38
+
39
+ def running?
40
+ @listener.running?
41
+ end
42
+
43
+ def stopped?
44
+ @listener.stopped?
45
+ end
46
+
47
+ def paused?
48
+ @listener.paused?
49
+ end
50
+
51
+ def on_message(&block)
52
+ @on_message_callback = block
53
+ end
54
+
55
+ private
56
+
57
+ def create_listener
58
+ raise DataCollector::Error, 'Please implement a listener'
59
+ end
60
+
61
+ def handle_on_message(input, output, data)
62
+ if (callback = @on_message_callback)
63
+ timing = Time.now
64
+ begin
65
+ callback.call(input, output, data)
66
+ rescue StandardError => e
67
+ DataCollector::Core.error("INPUT #{e.message}")
68
+ puts e.backtrace.join("\n")
69
+ ensure
70
+ DataCollector::Core.log("INPUT ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
71
+ end
72
+ end
73
+ end
74
+
75
+ end
76
+ end
77
+ end
@@ -0,0 +1,60 @@
1
+ require_relative 'generic'
2
+ require 'bunny'
3
+ require 'active_support/core_ext/hash'
4
+
5
+ module DataCollector
6
+ class Input
7
+ class Queue < Generic
8
+ def initialize(uri, options)
9
+ super
10
+
11
+ if running?
12
+ create_channel unless @channel
13
+ create_queue unless @queue
14
+ end
15
+ end
16
+
17
+ def running?
18
+ @listener.open?
19
+ end
20
+
21
+ def send(message)
22
+ if running?
23
+ @queue.publish(message)
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def create_listener
30
+ @listener ||= begin
31
+ connection = Bunny.new(@uri.to_s)
32
+ connection.start
33
+
34
+ connection
35
+ rescue StandardError => e
36
+ raise DataCollector::Error, "Unable to connect to RabbitMQ. #{e.message}"
37
+ end
38
+ end
39
+
40
+ def create_channel
41
+ raise DataCollector::Error, 'Connection to RabbitMQ is closed' if @listener.closed?
42
+ @channel ||= @listener.create_channel
43
+ end
44
+
45
+ def create_queue
46
+ @queue ||= begin
47
+ options = CGI.parse(@uri.query).with_indifferent_access
48
+ raise DataCollector::Error, '"channel" query parameter missing from uri.' unless options.include?(:channel)
49
+ queue = @channel.queue(options[:channel].first)
50
+
51
+ queue.subscribe do |delivery_info, metadata, payload|
52
+ handle_on_message(input, output, payload)
53
+ end if queue
54
+
55
+ queue
56
+ end
57
+ end
58
+ end
59
+ end
60
+ end
@@ -12,6 +12,8 @@ require 'active_support/core_ext/hash'
12
12
  require 'zlib'
13
13
  require 'minitar'
14
14
  require 'csv'
15
+ require_relative 'input/dir'
16
+ require_relative 'input/queue'
15
17
 
16
18
  #require_relative 'ext/xml_utility_node'
17
19
  module DataCollector
@@ -34,7 +36,13 @@ module DataCollector
34
36
  when 'https'
35
37
  data = from_https(uri, options)
36
38
  when 'file'
37
- data = from_file(uri, options)
39
+ if File.directory?("#{uri.host}/#{uri.path}")
40
+ return from_dir(uri, options)
41
+ else
42
+ data = from_file(uri, options)
43
+ end
44
+ when 'amqp'
45
+ data = from_queue(uri,options)
38
46
  else
39
47
  raise "Do not know how to process #{source}"
40
48
  end
@@ -61,7 +69,10 @@ module DataCollector
61
69
 
62
70
  def from_https(uri, options = {})
63
71
  data = nil
64
- HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
72
+ if options.with_indifferent_access.include?(:logging) && options.with_indifferent_access[:logging]
73
+ HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
74
+ end
75
+
65
76
  http = HTTP
66
77
 
67
78
  #http.use(logging: {logger: @logger})
@@ -157,6 +168,14 @@ module DataCollector
157
168
  data
158
169
  end
159
170
 
171
+ def from_dir(uri, options = {})
172
+ DataCollector::Input::Dir.new(uri, options)
173
+ end
174
+
175
+ def from_queue(uri, options = {})
176
+ DataCollector::Input::Queue.new(uri, options)
177
+ end
178
+
160
179
  def xml_to_hash(data)
161
180
  #gsub('&lt;\/', '&lt; /') outherwise wrong XML-parsing (see records lirias1729192 )
162
181
  data = data.gsub /&lt;/, '&lt; /'
@@ -38,8 +38,10 @@ module DataCollector
38
38
  data[k] << v
39
39
  end
40
40
  else
41
- t = data[k]
42
- data[k] = Array.new([t, v])
41
+ data[k] = v
42
+ # HELP: why am I creating an array here?
43
+ # t = data[k]
44
+ # data[k] = Array.new([t, v])
43
45
  end
44
46
  else
45
47
  data[k] = v
@@ -152,7 +154,6 @@ module DataCollector
152
154
  result
153
155
  rescue Exception => e
154
156
  raise "unable to transform to text: #{e.message}"
155
- ""
156
157
  end
157
158
 
158
159
  def to_tmp_file(erb_file, records_dir)
@@ -0,0 +1,91 @@
1
+ require 'iso8601'
2
+
3
+ module DataCollector
4
+ class Pipeline
5
+ attr_reader :run_count, :name
6
+ def initialize(options = {})
7
+ @running = false
8
+ @paused = false
9
+
10
+ @input = DataCollector::Input.new
11
+ @output = DataCollector::Output.new
12
+ @run_count = 0
13
+
14
+ @schedule = options[:schedule] || {}
15
+ @name = options[:name] || "#{Time.now.to_i}-#{rand(10000)}"
16
+ end
17
+
18
+ def on_message(&block)
19
+ @on_message_callback = block
20
+ end
21
+
22
+ def run
23
+ if paused? && @running
24
+ @paused = false
25
+ end
26
+
27
+ @running = true
28
+ if @schedule && !@schedule.empty?
29
+ while running?
30
+ @run_count += 1
31
+ start_time = ISO8601::DateTime.new(Time.now.to_datetime.to_s)
32
+ begin
33
+ duration = ISO8601::Duration.new(@schedule)
34
+ rescue StandardError => e
35
+ raise DataCollector::Error, "PIPELINE - bad schedule: #{e.message}"
36
+ end
37
+ interval = ISO8601::TimeInterval.from_duration(start_time, duration)
38
+
39
+ DataCollector::Core.log("PIPELINE running in #{interval.size} seconds")
40
+ sleep interval.size
41
+ handle_on_message(@input, @output) unless paused?
42
+ end
43
+ else # run once
44
+ @run_count += 1
45
+ DataCollector::Core.log("PIPELINE running once")
46
+ handle_on_message(@input, @output)
47
+ end
48
+ rescue StandardError => e
49
+ DataCollector::Core.error("PIPELINE run failed: #{e.message}")
50
+ raise e
51
+ #puts e.backtrace.join("\n")
52
+ end
53
+
54
+ def stop
55
+ @running = false
56
+ @paused = false
57
+ end
58
+
59
+ def pause
60
+ @paused = !@paused if @running
61
+ end
62
+
63
+ def running?
64
+ @running
65
+ end
66
+
67
+ def stopped?
68
+ !@running
69
+ end
70
+
71
+ def paused?
72
+ @paused
73
+ end
74
+
75
+ private
76
+
77
+ def handle_on_message(input, output)
78
+ if (callback = @on_message_callback)
79
+ timing = Time.now
80
+ begin
81
+ callback.call(input, output)
82
+ rescue StandardError => e
83
+ DataCollector::Core.error("PIPELINE #{e.message}")
84
+ ensure
85
+ DataCollector::Core.log("PIPELINE ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
86
+ end
87
+ end
88
+ end
89
+
90
+ end
91
+ end
@@ -1,130 +1,9 @@
1
- require 'logger'
1
+ require_relative 'rules_ng'
2
2
 
3
3
  module DataCollector
4
- class Rules
5
- def initialize()
6
- @logger = Logger.new(STDOUT)
4
+ class Rules < RulesNg
5
+ def initialize(logger = Logger.new(STDOUT))
6
+ super
7
7
  end
8
-
9
- def run(rule_map, from_record, to_record, options = {})
10
- rule_map.each do |map_to_key, rule|
11
- if rule.is_a?(Array)
12
- rule.each do |sub_rule|
13
- apply_rule(map_to_key, sub_rule, from_record, to_record, options)
14
- end
15
- else
16
- apply_rule(map_to_key, rule, from_record, to_record, options)
17
- end
18
- end
19
-
20
- to_record.each do |element|
21
- element = element.delete_if do |k, v|
22
- v != false && (v.nil?)
23
- end
24
- end
25
- end
26
-
27
- private
28
-
29
- def apply_rule(map_to_key, rule, from_record, to_record, options = {})
30
- if rule.has_key?('text')
31
- suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
32
- to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
33
- elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
34
- result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
35
-
36
- if result.is_a?(Array)
37
- result.each do |m|
38
- to_record << {map_to_key.to_sym => m}
39
- end
40
- else
41
- to_record << {map_to_key.to_sym => result}
42
- end
43
- else
44
- result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
45
- return if result && result.empty?
46
-
47
- to_record << {map_to_key.to_sym => result}
48
- end
49
- end
50
-
51
- def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
52
- data = nil
53
- if record
54
- if filter_path.is_a?(Array) && !record.is_a?(Array)
55
- record = [record]
56
- end
57
-
58
- data = Core::filter(record, filter_path)
59
-
60
- if data && rule_options
61
- if rule_options.key?('convert')
62
- case rule_options['convert']
63
- when 'time'
64
- result = []
65
- data = [data] unless data.is_a?(Array)
66
- data.each do |d|
67
- result << Time.parse(d)
68
- end
69
- data = result
70
- when 'map'
71
- if data.is_a?(Array)
72
- data = data.map do |r|
73
- rule_options['map'][r] if rule_options['map'].key?(r)
74
- end
75
-
76
- data.compact!
77
- data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
78
- else
79
- return rule_options['map'][data] if rule_options['map'].key?(data)
80
- end
81
- when 'each'
82
- data = [data] unless data.is_a?(Array)
83
- if options.empty?
84
- data = data.map { |d| rule_options['lambda'].call(d) }
85
- else
86
- data = data.map { |d| rule_options['lambda'].call(d, options) }
87
- end
88
- data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
89
- when 'call'
90
- if options.empty?
91
- data = rule_options['lambda'].call(data)
92
- else
93
- data = rule_options['lambda'].call(data, options)
94
- end
95
- return data
96
- end
97
- end
98
-
99
- if rule_options.key?('suffix')
100
- data = add_suffix(data, rule_options['suffix'])
101
- end
102
-
103
- end
104
-
105
- end
106
-
107
- return data
108
- end
109
-
110
- def add_suffix(data, suffix)
111
- case data.class.name
112
- when 'Array'
113
- result = []
114
- data.each do |d|
115
- result << add_suffix(d, suffix)
116
- end
117
- data = result
118
- when 'Hash'
119
- data.each do |k, v|
120
- data[k] = add_suffix(v, suffix)
121
- end
122
- else
123
- data = data.to_s
124
- data += suffix
125
- end
126
- data
127
- end
128
-
129
8
  end
130
- end
9
+ end
@@ -0,0 +1,130 @@
1
+ require 'logger'
2
+
3
+ module DataCollector
4
+ class Rules
5
+ def initialize()
6
+ @logger = Logger.new(STDOUT)
7
+ end
8
+
9
+ def run(rule_map, from_record, to_record, options = {})
10
+ rule_map.each do |map_to_key, rule|
11
+ if rule.is_a?(Array)
12
+ rule.each do |sub_rule|
13
+ apply_rule(map_to_key, sub_rule, from_record, to_record, options)
14
+ end
15
+ else
16
+ apply_rule(map_to_key, rule, from_record, to_record, options)
17
+ end
18
+ end
19
+
20
+ to_record.each do |element|
21
+ element = element.delete_if do |k, v|
22
+ v != false && (v.nil?)
23
+ end
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def apply_rule(map_to_key, rule, from_record, to_record, options = {})
30
+ if rule.has_key?('text')
31
+ suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
32
+ to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
33
+ elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
34
+ result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
35
+
36
+ if result.is_a?(Array)
37
+ result.each do |m|
38
+ to_record << {map_to_key.to_sym => m}
39
+ end
40
+ else
41
+ to_record << {map_to_key.to_sym => result}
42
+ end
43
+ else
44
+ result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
45
+ return if result && result.empty?
46
+
47
+ to_record << {map_to_key.to_sym => result}
48
+ end
49
+ end
50
+
51
+ def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
52
+ data = nil
53
+ if record
54
+ if filter_path.is_a?(Array) && !record.is_a?(Array)
55
+ record = [record]
56
+ end
57
+
58
+ data = Core::filter(record, filter_path)
59
+
60
+ if data && rule_options
61
+ if rule_options.key?('convert')
62
+ case rule_options['convert']
63
+ when 'time'
64
+ result = []
65
+ data = [data] unless data.is_a?(Array)
66
+ data.each do |d|
67
+ result << Time.parse(d)
68
+ end
69
+ data = result
70
+ when 'map'
71
+ if data.is_a?(Array)
72
+ data = data.map do |r|
73
+ rule_options['map'][r] if rule_options['map'].key?(r)
74
+ end
75
+
76
+ data.compact!
77
+ data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
78
+ else
79
+ return rule_options['map'][data] if rule_options['map'].key?(data)
80
+ end
81
+ when 'each'
82
+ data = [data] unless data.is_a?(Array)
83
+ if options.empty?
84
+ data = data.map { |d| rule_options['lambda'].call(d) }
85
+ else
86
+ data = data.map { |d| rule_options['lambda'].call(d, options) }
87
+ end
88
+ data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
89
+ when 'call'
90
+ if options.empty?
91
+ data = rule_options['lambda'].call(data)
92
+ else
93
+ data = rule_options['lambda'].call(data, options)
94
+ end
95
+ return data
96
+ end
97
+ end
98
+
99
+ if rule_options.key?('suffix')
100
+ data = add_suffix(data, rule_options['suffix'])
101
+ end
102
+
103
+ end
104
+
105
+ end
106
+
107
+ return data
108
+ end
109
+
110
+ def add_suffix(data, suffix)
111
+ case data.class.name
112
+ when 'Array'
113
+ result = []
114
+ data.each do |d|
115
+ result << add_suffix(d, suffix)
116
+ end
117
+ data = result
118
+ when 'Hash'
119
+ data.each do |k, v|
120
+ data[k] = add_suffix(v, suffix)
121
+ end
122
+ else
123
+ data = data.to_s
124
+ data += suffix
125
+ end
126
+ data
127
+ end
128
+
129
+ end
130
+ end
@@ -53,28 +53,38 @@ module DataCollector
53
53
 
54
54
  output_data << {tag.to_sym => data} unless data.nil? || (data.is_a?(Array) && data.empty?)
55
55
  rescue StandardError => e
56
- puts "error running rule '#{tag}'\n\t#{e.message}"
57
- puts e.backtrace.join("\n")
56
+ # puts "error running rule '#{tag}'\n\t#{e.message}"
57
+ # puts e.backtrace.join("\n")
58
+ raise DataCollector::Error, "error running rule '#{tag}'\n\t#{e.message}"
58
59
  end
59
60
 
60
61
  def apply_filtered_data_on_payload(input_data, payload, options = {})
61
62
  return nil if input_data.nil?
62
63
 
64
+ normalized_options = options.select{|k,v| k !~ /^_/ }.with_indifferent_access
63
65
  output_data = nil
64
66
  case payload.class.name
65
67
  when 'Proc'
66
68
  data = input_data.is_a?(Array) ? input_data : [input_data]
67
- output_data = if options.empty?
69
+ output_data = if normalized_options.empty?
68
70
  data.map { |d| payload.call(d) }
69
71
  else
70
- data.map { |d| payload.call(d, options) }
72
+ data.map { |d| payload.call(d, normalized_options) }
71
73
  end
72
74
  when 'Hash'
73
75
  input_data = [input_data] unless input_data.is_a?(Array)
74
76
  if input_data.is_a?(Array)
75
77
  output_data = input_data.map do |m|
76
78
  if payload.key?('suffix')
77
- "#{m}#{payload['suffix']}"
79
+ if (m.is_a?(Hash))
80
+ m.transform_values{|v| v.is_a?(String) ? "#{v}#{payload['suffix']}" : v}
81
+ elsif m.is_a?(Array)
82
+ m.map{|n| n.is_a?(String) ? "#{n}#{payload['suffix']}": n}
83
+ elsif m.methods.include?(:to_s)
84
+ "#{m}#{payload['suffix']}"
85
+ else
86
+ m
87
+ end
78
88
  else
79
89
  payload[m]
80
90
  end
@@ -83,7 +93,7 @@ module DataCollector
83
93
  when 'Array'
84
94
  output_data = input_data
85
95
  payload.each do |p|
86
- output_data = apply_filtered_data_on_payload(output_data, p, options)
96
+ output_data = apply_filtered_data_on_payload(output_data, p, normalized_options)
87
97
  end
88
98
  else
89
99
  output_data = [input_data]
@@ -97,12 +107,16 @@ module DataCollector
97
107
  output_data = output_data.first
98
108
  end
99
109
 
100
- if options.key?('_no_array_with_one_element') && options['_no_array_with_one_element'] &&
110
+ if options.with_indifferent_access.key?('_no_array_with_one_element') && options.with_indifferent_access['_no_array_with_one_element'] &&
101
111
  output_data.is_a?(Array) && output_data.size == 1
102
112
  output_data = output_data.first
103
113
  end
104
114
 
105
115
  output_data
116
+ rescue StandardError => e
117
+ # puts "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
118
+ # puts e.backtrace.join("\n")
119
+ raise DataCollector::Error, "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
106
120
  end
107
121
 
108
122
  def json_path_filter(filter, input_data)
@@ -111,6 +125,10 @@ module DataCollector
111
125
  return input_data if input_data.is_a?(String)
112
126
 
113
127
  Core.filter(input_data, filter)
128
+ rescue StandardError => e
129
+ puts "error running filter '#{filter}'\n\t#{e.message}"
130
+ puts e.backtrace.join("\n")
131
+ raise DataCollector::Error, "error running filter '#{filter}'\n\t#{e.message}"
114
132
  end
115
133
  end
116
134
  end
@@ -29,7 +29,6 @@ module DataCollector
29
29
  puts e.message
30
30
  puts e.backtrace.join("\n")
31
31
  ensure
32
- # output.tar_file.close unless output.tar_file.closed?
33
32
  @logger.info("Finished in #{((Time.now - @time_start)*1000).to_i} ms")
34
33
  end
35
34
 
@@ -1,4 +1,4 @@
1
1
  # encoding: utf-8
2
2
  module DataCollector
3
- VERSION = "0.17.0"
3
+ VERSION = "0.18.0"
4
4
  end
@@ -4,6 +4,7 @@ require 'logger'
4
4
 
5
5
  require 'data_collector/version'
6
6
  require 'data_collector/runner'
7
+ require 'data_collector/pipeline'
7
8
  require 'data_collector/ext/xml_utility_node'
8
9
 
9
10
  module DataCollector
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_collector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.17.0
4
+ version: 0.18.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mehmet Celik
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-03-16 00:00:00.000000000 Z
11
+ date: 2023-04-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -114,14 +114,14 @@ dependencies:
114
114
  requirements:
115
115
  - - "~>"
116
116
  - !ruby/object:Gem::Version
117
- version: '1.13'
117
+ version: '1.14'
118
118
  type: :runtime
119
119
  prerelease: false
120
120
  version_requirements: !ruby/object:Gem::Requirement
121
121
  requirements:
122
122
  - - "~>"
123
123
  - !ruby/object:Gem::Version
124
- version: '1.13'
124
+ version: '1.14'
125
125
  - !ruby/object:Gem::Dependency
126
126
  name: nori
127
127
  requirement: !ruby/object:Gem::Requirement
@@ -136,6 +136,48 @@ dependencies:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
138
  version: '2.6'
139
+ - !ruby/object:Gem::Dependency
140
+ name: iso8601
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - "~>"
144
+ - !ruby/object:Gem::Version
145
+ version: '0.13'
146
+ type: :runtime
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - "~>"
151
+ - !ruby/object:Gem::Version
152
+ version: '0.13'
153
+ - !ruby/object:Gem::Dependency
154
+ name: listen
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - "~>"
158
+ - !ruby/object:Gem::Version
159
+ version: '3.8'
160
+ type: :runtime
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - "~>"
165
+ - !ruby/object:Gem::Version
166
+ version: '3.8'
167
+ - !ruby/object:Gem::Dependency
168
+ name: bunny
169
+ requirement: !ruby/object:Gem::Requirement
170
+ requirements:
171
+ - - "~>"
172
+ - !ruby/object:Gem::Version
173
+ version: '2.20'
174
+ type: :runtime
175
+ prerelease: false
176
+ version_requirements: !ruby/object:Gem::Requirement
177
+ requirements:
178
+ - - "~>"
179
+ - !ruby/object:Gem::Version
180
+ version: '2.20'
139
181
  - !ruby/object:Gem::Dependency
140
182
  name: bundler
141
183
  requirement: !ruby/object:Gem::Requirement
@@ -156,14 +198,14 @@ dependencies:
156
198
  requirements:
157
199
  - - "~>"
158
200
  - !ruby/object:Gem::Version
159
- version: '5.16'
201
+ version: '5.18'
160
202
  type: :development
161
203
  prerelease: false
162
204
  version_requirements: !ruby/object:Gem::Requirement
163
205
  requirements:
164
206
  - - "~>"
165
207
  - !ruby/object:Gem::Version
166
- version: '5.16'
208
+ version: '5.18'
167
209
  - !ruby/object:Gem::Dependency
168
210
  name: rake
169
211
  requirement: !ruby/object:Gem::Requirement
@@ -208,13 +250,19 @@ files:
208
250
  - bin/console
209
251
  - bin/setup
210
252
  - data_collector.gemspec
253
+ - examples/marc.rb
211
254
  - lib/data_collector.rb
212
255
  - lib/data_collector/config_file.rb
213
256
  - lib/data_collector/core.rb
214
257
  - lib/data_collector/ext/xml_utility_node.rb
215
258
  - lib/data_collector/input.rb
259
+ - lib/data_collector/input/dir.rb
260
+ - lib/data_collector/input/generic.rb
261
+ - lib/data_collector/input/queue.rb
216
262
  - lib/data_collector/output.rb
263
+ - lib/data_collector/pipeline.rb
217
264
  - lib/data_collector/rules.rb
265
+ - lib/data_collector/rules.rb.depricated
218
266
  - lib/data_collector/rules_ng.rb
219
267
  - lib/data_collector/runner.rb
220
268
  - lib/data_collector/version.rb
@@ -240,7 +288,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
240
288
  - !ruby/object:Gem::Version
241
289
  version: '0'
242
290
  requirements: []
243
- rubygems_version: 3.1.6
291
+ rubygems_version: 3.4.10
244
292
  signing_key:
245
293
  specification_version: 4
246
294
  summary: ETL helper library