data_collector 0.17.0 → 0.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 351bb040c33b9010903681a117df29f02ba3663e440ed30b37520d5a8aa98b30
4
- data.tar.gz: a307677b46ecce478fed2206cf5e9173b67bcec8a8b772ca1f12d71fd33a6fad
3
+ metadata.gz: 35d57ff2998ab1343a4d6e906bcd76bd67951a0eae9a6db69387e4de7dbba285
4
+ data.tar.gz: 702a6447c28533d2dcdce237cc209417963ea2827eaed4f4ad0ab56a62c42783
5
5
  SHA512:
6
- metadata.gz: 6298f438cf8030be76ac85f9652aead672dedd42c6f5c6324b6004bae175442bb9565e31ec3aee430da601e3bcdc43eab08dbf19e9cfb89fe54f00ef61758325
7
- data.tar.gz: 7b77100e03002764c58e4f2b0d3c962b445422fe7109aa14f0ac7f389628399392db0e172fe7141a35967a6f87c26cce2a8ee7865084454bb9e5587550085ae7
6
+ metadata.gz: 80e487e0d8bfa19cec43a607b3c58698c37e23fd6385be3102d3ca87584348d585241f1794f848c460c02751f29f3a28d1365c472b8e3a532a922f9104fb2e06
7
+ data.tar.gz: 0366f4350e54e1bf985f68d3d0532b6fb00394aad23a3641b07fd65594e7e0b16ddf5da90574251c75b83c63fe09455aa988b7395706fe76e995b17fc79cb2fe
data/README.md CHANGED
@@ -1,39 +1,91 @@
1
1
  # DataCollector
2
- Convenience module to Extract, Transform and Load your data.
3
- You have main objects that help you to 'INPUT', 'OUTPUT' and 'FILTER' data. The basic ETL components.
4
- Support objects like CONFIG, LOG, RULES and the new RULES_NG just to make life easier.
2
+ Convenience module to Extract, Transform and Load your data in a Pipeline.
3
+ The 'INPUT', 'OUTPUT' and 'FILTER' object will help you to read, transform and output your data.
4
+ Support objects like CONFIG, LOG, ERROR, RULES. Will help you to write manageable rules to transform and log your data.
5
+ Include the DataCollector::Core module into your application gives you access to these objects.
6
+ ```ruby
7
+ include DataCollector::Core
8
+ ```
5
9
 
6
- Including the DataCollector::Core module into your application gives you access to these objects.
7
-
8
- The RULES and RULES_NG objects work in a very simple concept. Rules exist of 3 components:
9
- - a destination tag
10
- - a jsonpath filter to get the data
11
- - a lambda to execute on every filter hit
10
+ Every object can be used on its own.
11
+
12
+
13
+ #### Pipeline
14
+ Allows you to create a simple pipeline of operations to process data. With a data pipeline, you can collect, process, and transform data, and then transfer it to various systems and applications.
15
+
16
+ You can set a schedule for pipelines that are triggered by new data, specifying how often the pipeline should be
17
+ executed in the [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm). The processing logic is then executed.
18
+ ###### methods:
19
+ - .new(options): options can be schedule in [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm) and name
20
+ - .run: start the pipeline. blocking if a schedule is supplied
21
+ - .stop: stop the pipeline
22
+ - .pause: pause the pipeline. Restart using .run
23
+ - .running?: is pipeline running
24
+ - .stopped?: is pipeline not running
25
+ - .paused?: is pipeline paused
26
+ - .name: name of the pipe
27
+ - .run_count: number of times the pipe has ran
28
+ - .on_message: handle to run every time a trigger event happens
29
+ ###### example:
30
+ ```ruby
31
+ #create a pipline scheduled to run every 10 minutes
32
+ pipeline = Pipeline.new(schedule: 'PT10M')
33
+
34
+ pipeline.on_message do |input, output|
35
+ # logic
36
+ end
12
37
 
38
+ pipeline.run
39
+ ```
13
40
 
14
- #### input
15
- Read input from an URI. This URI can have a http, https or file scheme
41
+ #### input
42
+ The input component is part of the processing logic. All data is converted into a Hash, Array, ... accessible using plain Ruby or JSONPath using the filter object.
43
+ The input component can fetch data from various URIs, such as files, URLs, directories, queues, ...
44
+ For a push input component, a listener is created with a processing logic block that is executed whenever new data is available.
45
+ A push happens when new data is created in a directory, message queue, ...
16
46
 
17
- **Public methods**
18
47
  ```ruby
19
48
  from_uri(source, options = {:raw, :content_type})
20
49
  ```
21
- - source: an uri with a scheme of http, https, file
50
+ - source: an uri with a scheme of http, https, file, amqp
22
51
  - options:
23
52
  - raw: _boolean_ do not parse
24
53
  - content_type: _string_ force a content_type if the 'Content-Type' returned by the http server is incorrect
25
54
 
26
- example:
55
+ ###### example:
27
56
  ```ruby
57
+ # read from an http endpoint
28
58
  input.from_uri("http://www.libis.be")
29
59
  input.from_uri("file://hello.txt")
30
60
  input.from_uri("http://www.libis.be/record.jsonld", content_type: 'application/ld+json')
31
- ```
32
61
 
62
+ # read data from a RabbitMQ queue
63
+ listener = input.from_uri('amqp://user:password@localhost?channel=hello')
64
+ listener.on_message do |input, output, message|
65
+ puts message
66
+ end
67
+ listener.start
68
+
69
+ # read data from a directory
70
+ listener = input.from_uri('file://this/is/directory')
71
+ listener.on_message do |input, output, filename|
72
+ puts filename
73
+ end
74
+ listener.start
75
+ ```
33
76
 
77
+ Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
34
78
 
79
+ ###### listener from input.from_uri(directory|message queue)
80
+ When a listener is defined that is triggered by an event(PUSH) like a message queue or files written to a directory you have these extra methods.
35
81
 
36
- Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
82
+ - .run: start the listener. blocking if a schedule is supplied
83
+ - .stop: stop the listener
84
+ - .pause: pause the listener. Restart using .run
85
+ - .running?: is listener running
86
+ - .stopped?: is listener not running
87
+ - .paused?: is listener paused
88
+ - .on_message: handle to run every time a trigger event happens
37
89
 
38
90
  ### output
39
91
  Output is an object you can store key/value pairs that needs to be written to an output stream.
@@ -45,7 +97,7 @@ Output is an object you can store key/value pairs that needs to be written to an
45
97
  Write output to a file, string use an ERB file as a template
46
98
  example:
47
99
  ___test.erb___
48
- ```ruby
100
+ ```erbruby
49
101
  <names>
50
102
  <combined><%= data[:name] %> <%= data[:last_name] %></combined>
51
103
  <%= print data, :name, :first_name %>
@@ -53,7 +105,7 @@ ___test.erb___
53
105
  </names>
54
106
  ```
55
107
  will produce
56
- ```ruby
108
+ ```html
57
109
  <names>
58
110
  <combined>John Doe</combined>
59
111
  <first_name>John</first_name>
@@ -97,41 +149,11 @@ filter data from a hash using [JSONPath](http://goessner.net/articles/JsonPath/i
97
149
  filtered_data = filter(data, "$..metadata.record")
98
150
  ```
99
151
 
100
- #### rules (depricated)
101
- See newer rules_ng object
102
- ~~Allows you to define a simple lambda structure to run against a JSONPath filter~~
103
-
104
- ~~A rule is made up of a Hash the key is the map key field its value is a Hash with a JSONPath filter and options to apply a convert method on the filtered results.~~
105
- ~~Available convert methods are: time, map, each, call, suffix, text~~
106
- ~~- time: parses a given time/date string into a Time object~~
107
- ~~- map: applies a mapping to a filter~~
108
- ~~- suffix: adds a suffix to a result~~
109
- ~~- call: executes a lambda on the filter~~
110
- ~~- each: runs a lambda on each row of a filter~~
111
- ~~- text: passthrough method. Returns value unchanged~~
112
-
113
- ~~example:~~
114
- ```ruby
115
- my_rules = {
116
- 'identifier' => {"filter" => '$..id'},
117
- 'language' => {'filter' => '$..lang',
118
- 'options' => {'convert' => 'map',
119
- 'map' => {'nl' => 'dut', 'fr' => 'fre', 'de' => 'ger', 'en' => 'eng'}
120
- }
121
- },
122
- 'subject' => {'filter' => '$..keywords',
123
- options' => {'convert' => 'each',
124
- 'lambda' => lambda {|d| d.split(',')}
125
- }
126
- },
127
- 'creationdate' => {'filter' => '$..published_date', 'convert' => 'time'}
128
- }
129
-
130
- rules.run(my_rules, record, output)
131
- ```
132
-
133
- #### rules_ng
134
- !!! not compatible with RULES object
152
+ #### rules
153
+ The RULES objects have a simple concept. Rules exist of 3 components:
154
+ - a destination tag
155
+ - a jsonpath filter to get the data
156
+ - a lambda to execute on every filter hit
135
157
 
136
158
  TODO: work in progress see test for examples on how to use
137
159
 
@@ -202,15 +224,15 @@ Here you find different rule combination that are possible
202
224
  }
203
225
  ```
204
226
 
205
- Here is an example on how to call last RULESET "rs_hash_with_json_filter_and_option".
206
227
 
207
- ***rules_ng.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
228
+ ***rules.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
208
229
 
209
- List of engine directives:
230
+ ##### List of engine directives:
210
231
  - _no_array_with_one_element: defaults to false. if the result is an array with 1 element just return the element.
211
232
 
212
-
233
+ ###### example:
213
234
  ```ruby
235
+ # apply RULESET "rs_hash_with_json_filter_and_option" to data
214
236
  include DataCollector::Core
215
237
  output.clear
216
238
  data = {'subject' => ['water', 'thermodynamics']}
@@ -247,8 +269,11 @@ Log to stdout
247
269
  ```ruby
248
270
  log("hello world")
249
271
  ```
250
-
251
-
272
+ #### error
273
+ Log an error
274
+ ```ruby
275
+ error("if you have an issue take a tissue")
276
+ ```
252
277
  ## Example
253
278
  Input data ___test.csv___
254
279
  ```csv
@@ -315,7 +340,32 @@ Or install it yourself as:
315
340
 
316
341
  ## Usage
317
342
 
318
- TODO: Write usage instructions here
343
+ ```ruby
344
+ require 'data_collector'
345
+
346
+ include DataCollector::Core
347
+ # including core gives you a pipeline, input, output, filter, config, log, error object to work with
348
+ RULES = {
349
+ 'title' => '$..vertitle'
350
+ }
351
+ #create a PULL pipeline and schedule it to run every 5 seconds
352
+ pipeline = DataCollector::Pipeline.new(schedule: 'PT5S')
353
+
354
+ pipeline.on_message do |input, output|
355
+ data = input.from_uri('https://services3.libis.be/primo_artefact/lirias3611609')
356
+ rules.run(RULES, data, output)
357
+ #puts JSON.pretty_generate(input.raw)
358
+ puts JSON.pretty_generate(output.raw)
359
+ output.clear
360
+
361
+ if pipeline.run_count > 2
362
+ log('stopping pipeline after one run')
363
+ pipeline.stop
364
+ end
365
+ end
366
+ pipeline.run
367
+
368
+ ```
319
369
 
320
370
  ## Development
321
371
 
@@ -43,11 +43,14 @@ Gem::Specification.new do |spec|
43
43
  spec.add_runtime_dependency 'jsonpath', '~> 1.1'
44
44
  spec.add_runtime_dependency 'mime-types', '~> 3.4'
45
45
  spec.add_runtime_dependency 'minitar', '= 0.9'
46
- spec.add_runtime_dependency 'nokogiri', '~> 1.13'
46
+ spec.add_runtime_dependency 'nokogiri', '~> 1.14'
47
47
  spec.add_runtime_dependency 'nori', '~> 2.6'
48
+ spec.add_runtime_dependency 'iso8601', '~> 0.13'
49
+ spec.add_runtime_dependency 'listen', '~> 3.8'
50
+ spec.add_runtime_dependency 'bunny', '~> 2.20'
48
51
 
49
52
  spec.add_development_dependency 'bundler', '~> 2.3'
50
- spec.add_development_dependency 'minitest', '~> 5.16'
53
+ spec.add_development_dependency 'minitest', '~> 5.18'
51
54
  spec.add_development_dependency 'rake', '~> 13.0'
52
55
  spec.add_development_dependency 'webmock', '~> 3.18'
53
56
  end
data/examples/marc.rb ADDED
@@ -0,0 +1,27 @@
1
+ $LOAD_PATH << '../lib'
2
+ require 'data_collector'
3
+
4
+ # include module gives us an pipeline, input, output, filter, log and error object to work with
5
+ include DataCollector::Core
6
+
7
+ RULES = {
8
+ "title" => {'$.record.datafield[?(@._tag == "245")]' => lambda do |d, o|
9
+ subfields = d['subfield']
10
+ subfields = [subfields] unless subfields.is_a?(Array)
11
+ subfields.map{|m| m["$text"]}.join(' ')
12
+ end
13
+ },
14
+ "author" => {'$..datafield[?(@._tag == "100")]' => lambda do |d, o|
15
+ subfields = d['subfield']
16
+ subfields = [subfields] unless subfields.is_a?(Array)
17
+ subfields.map{|m| m["$text"]}.join(' ')
18
+ end
19
+ }
20
+ }
21
+
22
+ #read remote record enable logging
23
+ data = input.from_uri('https://gist.githubusercontent.com/kefo/796b39925e234fb6d912/raw/3df2ce329a947864ae8555f214253f956d679605/sample-marc-with-xsd.xml', {logging: true})
24
+ # apply rules to data and if result contains only 1 entry do not return an array
25
+ rules.run(RULES, data, output, {_no_array_with_one_element: true})
26
+ # print result
27
+ puts JSON.pretty_generate(output.raw)
@@ -10,6 +10,14 @@ require_relative 'config_file'
10
10
 
11
11
  module DataCollector
12
12
  module Core
13
+ # Pipeline for your data pipeline
14
+ # example: pipeline.on_message do |input, output|
15
+ # ** processing logic here **
16
+ # end
17
+ def pipeline
18
+ @input ||= DataCollector::Pipeline.new
19
+ end
20
+ module_function :pipeline
13
21
  # Read input from an URI
14
22
  # example: input.from_uri("http://www.libis.be")
15
23
  # input.from_uri("file://hello.txt")
@@ -79,6 +87,8 @@ module DataCollector
79
87
  # }
80
88
  # rules.run(my_rules, input, output)
81
89
  def rules
90
+ #DataCollector::Core.log('RULES depricated using RULESNG')
91
+ #rules_ng
82
92
  @rules ||= Rules.new
83
93
  end
84
94
  module_function :rules
@@ -121,6 +131,12 @@ module DataCollector
121
131
  end
122
132
  module_function :log
123
133
 
134
+ def error(message)
135
+ @logger ||= Logger.new(STDOUT)
136
+ @logger.error(message)
137
+ end
138
+ module_function :error
139
+
124
140
  end
125
141
 
126
142
  end
@@ -0,0 +1,28 @@
1
+ require_relative 'generic'
2
+ require 'listen'
3
+
4
+ module DataCollector
5
+ class Input
6
+ class Dir < Generic
7
+ def initialize(uri, options)
8
+ super
9
+ end
10
+
11
+ def running?
12
+ @listener.processing?
13
+ end
14
+
15
+ private
16
+
17
+ def create_listener
18
+ @listener ||= Listen.to("#{@uri.host}/#{@uri.path}", @options) do |modified, added, _|
19
+ files = added | modified
20
+ files.each do |filename|
21
+ handle_on_message(input, output, filename)
22
+ end
23
+ end
24
+ end
25
+
26
+ end
27
+ end
28
+ end
@@ -0,0 +1,77 @@
1
+ require 'listen'
2
+
3
+ module DataCollector
4
+ class Input
5
+ class Generic
6
+ def initialize(uri, options)
7
+ @uri = uri
8
+ @options = options
9
+
10
+ @input = DataCollector::Input.new
11
+ @output = DataCollector::Output.new
12
+
13
+ @listener = create_listener
14
+ end
15
+
16
+ def run(should_block = false, &block)
17
+ raise DataCollector::Error, 'Please supply a on_message block' if @on_message_callback.nil?
18
+ @listener.start
19
+
20
+ if should_block
21
+ while running?
22
+ yield block if block_given?
23
+ sleep 2
24
+ end
25
+ else
26
+ yield block if block_given?
27
+ end
28
+
29
+ end
30
+
31
+ def stop
32
+ @listener.stop
33
+ end
34
+
35
+ def pause
36
+ @listener.pause
37
+ end
38
+
39
+ def running?
40
+ @listener.running?
41
+ end
42
+
43
+ def stopped?
44
+ @listener.stopped?
45
+ end
46
+
47
+ def paused?
48
+ @listener.paused?
49
+ end
50
+
51
+ def on_message(&block)
52
+ @on_message_callback = block
53
+ end
54
+
55
+ private
56
+
57
+ def create_listener
58
+ raise DataCollector::Error, 'Please implement a listener'
59
+ end
60
+
61
+ def handle_on_message(input, output, data)
62
+ if (callback = @on_message_callback)
63
+ timing = Time.now
64
+ begin
65
+ callback.call(input, output, data)
66
+ rescue StandardError => e
67
+ DataCollector::Core.error("INPUT #{e.message}")
68
+ puts e.backtrace.join("\n")
69
+ ensure
70
+ DataCollector::Core.log("INPUT ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
71
+ end
72
+ end
73
+ end
74
+
75
+ end
76
+ end
77
+ end
@@ -0,0 +1,60 @@
1
+ require_relative 'generic'
2
+ require 'bunny'
3
+ require 'active_support/core_ext/hash'
4
+
5
+ module DataCollector
6
+ class Input
7
+ class Queue < Generic
8
+ def initialize(uri, options)
9
+ super
10
+
11
+ if running?
12
+ create_channel unless @channel
13
+ create_queue unless @queue
14
+ end
15
+ end
16
+
17
+ def running?
18
+ @listener.open?
19
+ end
20
+
21
+ def send(message)
22
+ if running?
23
+ @queue.publish(message)
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def create_listener
30
+ @listener ||= begin
31
+ connection = Bunny.new(@uri.to_s)
32
+ connection.start
33
+
34
+ connection
35
+ rescue StandardError => e
36
+ raise DataCollector::Error, "Unable to connect to RabbitMQ. #{e.message}"
37
+ end
38
+ end
39
+
40
+ def create_channel
41
+ raise DataCollector::Error, 'Connection to RabbitMQ is closed' if @listener.closed?
42
+ @channel ||= @listener.create_channel
43
+ end
44
+
45
+ def create_queue
46
+ @queue ||= begin
47
+ options = CGI.parse(@uri.query).with_indifferent_access
48
+ raise DataCollector::Error, '"channel" query parameter missing from uri.' unless options.include?(:channel)
49
+ queue = @channel.queue(options[:channel].first)
50
+
51
+ queue.subscribe do |delivery_info, metadata, payload|
52
+ handle_on_message(input, output, payload)
53
+ end if queue
54
+
55
+ queue
56
+ end
57
+ end
58
+ end
59
+ end
60
+ end
@@ -12,6 +12,8 @@ require 'active_support/core_ext/hash'
12
12
  require 'zlib'
13
13
  require 'minitar'
14
14
  require 'csv'
15
+ require_relative 'input/dir'
16
+ require_relative 'input/queue'
15
17
 
16
18
  #require_relative 'ext/xml_utility_node'
17
19
  module DataCollector
@@ -34,7 +36,15 @@ module DataCollector
34
36
  when 'https'
35
37
  data = from_https(uri, options)
36
38
  when 'file'
37
- data = from_file(uri, options)
39
+ if File.directory?("#{uri.host}/#{uri.path}")
40
+ raise DataCollector::Error, "#{uri.host}/#{uri.path} not found" unless File.exist?("#{uri.host}/#{uri.path}")
41
+ return from_dir(uri, options)
42
+ else
43
+ raise DataCollector::Error, "#{uri.host}/#{uri.path} not found" unless File.exist?("#{uri.host}/#{uri.path}")
44
+ data = from_file(uri, options)
45
+ end
46
+ when 'amqp'
47
+ data = from_queue(uri,options)
38
48
  else
39
49
  raise "Do not know how to process #{source}"
40
50
  end
@@ -61,7 +71,10 @@ module DataCollector
61
71
 
62
72
  def from_https(uri, options = {})
63
73
  data = nil
64
- HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
74
+ if options.with_indifferent_access.include?(:logging) && options.with_indifferent_access[:logging]
75
+ HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
76
+ end
77
+
65
78
  http = HTTP
66
79
 
67
80
  #http.use(logging: {logger: @logger})
@@ -157,6 +170,14 @@ module DataCollector
157
170
  data
158
171
  end
159
172
 
173
+ def from_dir(uri, options = {})
174
+ DataCollector::Input::Dir.new(uri, options)
175
+ end
176
+
177
+ def from_queue(uri, options = {})
178
+ DataCollector::Input::Queue.new(uri, options)
179
+ end
180
+
160
181
  def xml_to_hash(data)
161
182
  #gsub('&lt;\/', '&lt; /') outherwise wrong XML-parsing (see records lirias1729192 )
162
183
  data = data.gsub /&lt;/, '&lt; /'
@@ -38,8 +38,10 @@ module DataCollector
38
38
  data[k] << v
39
39
  end
40
40
  else
41
- t = data[k]
42
- data[k] = Array.new([t, v])
41
+ data[k] = v
42
+ # HELP: why am I creating an array here?
43
+ # t = data[k]
44
+ # data[k] = Array.new([t, v])
43
45
  end
44
46
  else
45
47
  data[k] = v
@@ -152,7 +154,6 @@ module DataCollector
152
154
  result
153
155
  rescue Exception => e
154
156
  raise "unable to transform to text: #{e.message}"
155
- ""
156
157
  end
157
158
 
158
159
  def to_tmp_file(erb_file, records_dir)
@@ -0,0 +1,116 @@
1
+ require 'iso8601'
2
+
3
+ module DataCollector
4
+ class Pipeline
5
+ attr_reader :run_count, :name
6
+ def initialize(options = {})
7
+ @running = false
8
+ @paused = false
9
+
10
+ @input = DataCollector::Input.new
11
+ @output = DataCollector::Output.new
12
+ @run_count = 0
13
+
14
+ @schedule = options[:schedule] || {}
15
+ @name = options[:name] || "#{Time.now.to_i}-#{rand(10000)}"
16
+ @options = options
17
+ @listeners = []
18
+ end
19
+
20
+ def on_message(&block)
21
+ @on_message_callback = block
22
+ end
23
+
24
+ def run
25
+ if paused? && @running
26
+ @paused = false
27
+ @listeners.each do |listener|
28
+ listener.run if listener.paused?
29
+ end
30
+ end
31
+
32
+ @running = true
33
+ if @schedule && !@schedule.empty?
34
+ while running?
35
+ @run_count += 1
36
+ start_time = ISO8601::DateTime.new(Time.now.to_datetime.to_s)
37
+ begin
38
+ duration = ISO8601::Duration.new(@schedule)
39
+ rescue StandardError => e
40
+ raise DataCollector::Error, "PIPELINE - bad schedule: #{e.message}"
41
+ end
42
+ interval = ISO8601::TimeInterval.from_duration(start_time, duration)
43
+
44
+ DataCollector::Core.log("PIPELINE running in #{interval.size} seconds")
45
+ sleep interval.size
46
+ handle_on_message(@input, @output) unless paused?
47
+ end
48
+ else # run once
49
+ @run_count += 1
50
+ if @options.key?(:uri)
51
+ listener = Input.new.from_uri(@options[:uri], @options)
52
+ listener.on_message do |input, output, filename|
53
+ DataCollector::Core.log("PIPELINE triggered by #{filename}")
54
+ handle_on_message(@input, @output, filename)
55
+ end
56
+ @listeners << listener
57
+
58
+ listener.run(true)
59
+
60
+ else
61
+ DataCollector::Core.log("PIPELINE running once")
62
+ handle_on_message(@input, @output)
63
+ end
64
+ end
65
+ rescue StandardError => e
66
+ DataCollector::Core.error("PIPELINE run failed: #{e.message}")
67
+ raise e
68
+ #puts e.backtrace.join("\n")
69
+ end
70
+
71
+ def stop
72
+ @running = false
73
+ @paused = false
74
+ @listeners.each do |listener|
75
+ listener.stop if listener.running?
76
+ end
77
+ end
78
+
79
+ def pause
80
+ if @running
81
+ @paused = !@paused
82
+ @listeners.each do |listener|
83
+ listener.pause if listener.running?
84
+ end
85
+ end
86
+ end
87
+
88
+ def running?
89
+ @running
90
+ end
91
+
92
+ def stopped?
93
+ !@running
94
+ end
95
+
96
+ def paused?
97
+ @paused
98
+ end
99
+
100
+ private
101
+
102
+ def handle_on_message(input, output, filename = nil)
103
+ if (callback = @on_message_callback)
104
+ timing = Time.now
105
+ begin
106
+ callback.call(input, output, filename)
107
+ rescue StandardError => e
108
+ DataCollector::Core.error("PIPELINE #{e.message}")
109
+ ensure
110
+ DataCollector::Core.log("PIPELINE ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
111
+ end
112
+ end
113
+ end
114
+
115
+ end
116
+ end
@@ -1,130 +1,9 @@
1
- require 'logger'
1
+ require_relative 'rules_ng'
2
2
 
3
3
  module DataCollector
4
- class Rules
5
- def initialize()
6
- @logger = Logger.new(STDOUT)
4
+ class Rules < RulesNg
5
+ def initialize(logger = Logger.new(STDOUT))
6
+ super
7
7
  end
8
-
9
- def run(rule_map, from_record, to_record, options = {})
10
- rule_map.each do |map_to_key, rule|
11
- if rule.is_a?(Array)
12
- rule.each do |sub_rule|
13
- apply_rule(map_to_key, sub_rule, from_record, to_record, options)
14
- end
15
- else
16
- apply_rule(map_to_key, rule, from_record, to_record, options)
17
- end
18
- end
19
-
20
- to_record.each do |element|
21
- element = element.delete_if do |k, v|
22
- v != false && (v.nil?)
23
- end
24
- end
25
- end
26
-
27
- private
28
-
29
- def apply_rule(map_to_key, rule, from_record, to_record, options = {})
30
- if rule.has_key?('text')
31
- suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
32
- to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
33
- elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
34
- result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
35
-
36
- if result.is_a?(Array)
37
- result.each do |m|
38
- to_record << {map_to_key.to_sym => m}
39
- end
40
- else
41
- to_record << {map_to_key.to_sym => result}
42
- end
43
- else
44
- result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
45
- return if result && result.empty?
46
-
47
- to_record << {map_to_key.to_sym => result}
48
- end
49
- end
50
-
51
- def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
52
- data = nil
53
- if record
54
- if filter_path.is_a?(Array) && !record.is_a?(Array)
55
- record = [record]
56
- end
57
-
58
- data = Core::filter(record, filter_path)
59
-
60
- if data && rule_options
61
- if rule_options.key?('convert')
62
- case rule_options['convert']
63
- when 'time'
64
- result = []
65
- data = [data] unless data.is_a?(Array)
66
- data.each do |d|
67
- result << Time.parse(d)
68
- end
69
- data = result
70
- when 'map'
71
- if data.is_a?(Array)
72
- data = data.map do |r|
73
- rule_options['map'][r] if rule_options['map'].key?(r)
74
- end
75
-
76
- data.compact!
77
- data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
78
- else
79
- return rule_options['map'][data] if rule_options['map'].key?(data)
80
- end
81
- when 'each'
82
- data = [data] unless data.is_a?(Array)
83
- if options.empty?
84
- data = data.map { |d| rule_options['lambda'].call(d) }
85
- else
86
- data = data.map { |d| rule_options['lambda'].call(d, options) }
87
- end
88
- data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
89
- when 'call'
90
- if options.empty?
91
- data = rule_options['lambda'].call(data)
92
- else
93
- data = rule_options['lambda'].call(data, options)
94
- end
95
- return data
96
- end
97
- end
98
-
99
- if rule_options.key?('suffix')
100
- data = add_suffix(data, rule_options['suffix'])
101
- end
102
-
103
- end
104
-
105
- end
106
-
107
- return data
108
- end
109
-
110
- def add_suffix(data, suffix)
111
- case data.class.name
112
- when 'Array'
113
- result = []
114
- data.each do |d|
115
- result << add_suffix(d, suffix)
116
- end
117
- data = result
118
- when 'Hash'
119
- data.each do |k, v|
120
- data[k] = add_suffix(v, suffix)
121
- end
122
- else
123
- data = data.to_s
124
- data += suffix
125
- end
126
- data
127
- end
128
-
129
8
  end
130
- end
9
+ end
@@ -0,0 +1,130 @@
1
+ require 'logger'
2
+
3
+ module DataCollector
4
+ class Rules
5
+ def initialize()
6
+ @logger = Logger.new(STDOUT)
7
+ end
8
+
9
+ def run(rule_map, from_record, to_record, options = {})
10
+ rule_map.each do |map_to_key, rule|
11
+ if rule.is_a?(Array)
12
+ rule.each do |sub_rule|
13
+ apply_rule(map_to_key, sub_rule, from_record, to_record, options)
14
+ end
15
+ else
16
+ apply_rule(map_to_key, rule, from_record, to_record, options)
17
+ end
18
+ end
19
+
20
+ to_record.each do |element|
21
+ element = element.delete_if do |k, v|
22
+ v != false && (v.nil?)
23
+ end
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def apply_rule(map_to_key, rule, from_record, to_record, options = {})
30
+ if rule.has_key?('text')
31
+ suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
32
+ to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
33
+ elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
34
+ result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
35
+
36
+ if result.is_a?(Array)
37
+ result.each do |m|
38
+ to_record << {map_to_key.to_sym => m}
39
+ end
40
+ else
41
+ to_record << {map_to_key.to_sym => result}
42
+ end
43
+ else
44
+ result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
45
+ return if result && result.empty?
46
+
47
+ to_record << {map_to_key.to_sym => result}
48
+ end
49
+ end
50
+
51
+ def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
52
+ data = nil
53
+ if record
54
+ if filter_path.is_a?(Array) && !record.is_a?(Array)
55
+ record = [record]
56
+ end
57
+
58
+ data = Core::filter(record, filter_path)
59
+
60
+ if data && rule_options
61
+ if rule_options.key?('convert')
62
+ case rule_options['convert']
63
+ when 'time'
64
+ result = []
65
+ data = [data] unless data.is_a?(Array)
66
+ data.each do |d|
67
+ result << Time.parse(d)
68
+ end
69
+ data = result
70
+ when 'map'
71
+ if data.is_a?(Array)
72
+ data = data.map do |r|
73
+ rule_options['map'][r] if rule_options['map'].key?(r)
74
+ end
75
+
76
+ data.compact!
77
+ data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
78
+ else
79
+ return rule_options['map'][data] if rule_options['map'].key?(data)
80
+ end
81
+ when 'each'
82
+ data = [data] unless data.is_a?(Array)
83
+ if options.empty?
84
+ data = data.map { |d| rule_options['lambda'].call(d) }
85
+ else
86
+ data = data.map { |d| rule_options['lambda'].call(d, options) }
87
+ end
88
+ data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
89
+ when 'call'
90
+ if options.empty?
91
+ data = rule_options['lambda'].call(data)
92
+ else
93
+ data = rule_options['lambda'].call(data, options)
94
+ end
95
+ return data
96
+ end
97
+ end
98
+
99
+ if rule_options.key?('suffix')
100
+ data = add_suffix(data, rule_options['suffix'])
101
+ end
102
+
103
+ end
104
+
105
+ end
106
+
107
+ return data
108
+ end
109
+
110
+ def add_suffix(data, suffix)
111
+ case data.class.name
112
+ when 'Array'
113
+ result = []
114
+ data.each do |d|
115
+ result << add_suffix(d, suffix)
116
+ end
117
+ data = result
118
+ when 'Hash'
119
+ data.each do |k, v|
120
+ data[k] = add_suffix(v, suffix)
121
+ end
122
+ else
123
+ data = data.to_s
124
+ data += suffix
125
+ end
126
+ data
127
+ end
128
+
129
+ end
130
+ end
@@ -51,30 +51,51 @@ module DataCollector
51
51
 
52
52
  data = apply_filtered_data_on_payload(data, rule_payload, options)
53
53
 
54
- output_data << {tag.to_sym => data} unless data.nil? || (data.is_a?(Array) && data.empty?)
54
+ output_data << { tag.to_sym => data } unless data.nil? || (data.is_a?(Array) && data.empty?)
55
55
  rescue StandardError => e
56
- puts "error running rule '#{tag}'\n\t#{e.message}"
57
- puts e.backtrace.join("\n")
56
+ # puts "error running rule '#{tag}'\n\t#{e.message}"
57
+ # puts e.backtrace.join("\n")
58
+ raise DataCollector::Error, "error running rule '#{tag}'\n\t#{e.message}"
58
59
  end
59
60
 
60
61
  def apply_filtered_data_on_payload(input_data, payload, options = {})
61
62
  return nil if input_data.nil?
62
63
 
64
+ normalized_options = options.select { |k, v| k !~ /^_/ }.with_indifferent_access
63
65
  output_data = nil
64
66
  case payload.class.name
65
67
  when 'Proc'
66
68
  data = input_data.is_a?(Array) ? input_data : [input_data]
67
- output_data = if options.empty?
68
- data.map { |d| payload.call(d) }
69
+ output_data = if normalized_options.empty?
70
+ # data.map { |d| payload.curry.call(d).call(d) }
71
+ data.map { |d|
72
+ loop do
73
+ payload_result = payload.curry.call(d)
74
+ break payload_result unless payload_result.is_a?(Proc)
75
+ end
76
+ }
69
77
  else
70
- data.map { |d| payload.call(d, options) }
78
+ data.map { |d|
79
+ loop do
80
+ payload_result = payload.curry.call(d, normalized_options)
81
+ break payload_result unless payload_result.is_a?(Proc)
82
+ end
83
+ }
71
84
  end
72
85
  when 'Hash'
73
86
  input_data = [input_data] unless input_data.is_a?(Array)
74
87
  if input_data.is_a?(Array)
75
88
  output_data = input_data.map do |m|
76
89
  if payload.key?('suffix')
77
- "#{m}#{payload['suffix']}"
90
+ if (m.is_a?(Hash))
91
+ m.transform_values { |v| v.is_a?(String) ? "#{v}#{payload['suffix']}" : v }
92
+ elsif m.is_a?(Array)
93
+ m.map { |n| n.is_a?(String) ? "#{n}#{payload['suffix']}" : n }
94
+ elsif m.methods.include?(:to_s)
95
+ "#{m}#{payload['suffix']}"
96
+ else
97
+ m
98
+ end
78
99
  else
79
100
  payload[m]
80
101
  end
@@ -83,7 +104,7 @@ module DataCollector
83
104
  when 'Array'
84
105
  output_data = input_data
85
106
  payload.each do |p|
86
- output_data = apply_filtered_data_on_payload(output_data, p, options)
107
+ output_data = apply_filtered_data_on_payload(output_data, p, normalized_options)
87
108
  end
88
109
  else
89
110
  output_data = [input_data]
@@ -92,17 +113,21 @@ module DataCollector
92
113
  output_data.compact! if output_data.is_a?(Array)
93
114
  output_data.flatten! if output_data.is_a?(Array)
94
115
  if output_data.is_a?(Array) &&
95
- output_data.size == 1 &&
96
- (output_data.first.is_a?(Array) || output_data.first.is_a?(Hash))
116
+ output_data.size == 1 &&
117
+ (output_data.first.is_a?(Array) || output_data.first.is_a?(Hash))
97
118
  output_data = output_data.first
98
119
  end
99
120
 
100
- if options.key?('_no_array_with_one_element') && options['_no_array_with_one_element'] &&
121
+ if options.with_indifferent_access.key?('_no_array_with_one_element') && options.with_indifferent_access['_no_array_with_one_element'] &&
101
122
  output_data.is_a?(Array) && output_data.size == 1
102
123
  output_data = output_data.first
103
124
  end
104
125
 
105
126
  output_data
127
+ rescue StandardError => e
128
+ # puts "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
129
+ # puts e.backtrace.join("\n")
130
+ raise DataCollector::Error, "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
106
131
  end
107
132
 
108
133
  def json_path_filter(filter, input_data)
@@ -111,6 +136,10 @@ module DataCollector
111
136
  return input_data if input_data.is_a?(String)
112
137
 
113
138
  Core.filter(input_data, filter)
139
+ rescue StandardError => e
140
+ puts "error running filter '#{filter}'\n\t#{e.message}"
141
+ puts e.backtrace.join("\n")
142
+ raise DataCollector::Error, "error running filter '#{filter}'\n\t#{e.message}"
114
143
  end
115
144
  end
116
145
  end
@@ -29,7 +29,6 @@ module DataCollector
29
29
  puts e.message
30
30
  puts e.backtrace.join("\n")
31
31
  ensure
32
- # output.tar_file.close unless output.tar_file.closed?
33
32
  @logger.info("Finished in #{((Time.now - @time_start)*1000).to_i} ms")
34
33
  end
35
34
 
@@ -1,4 +1,4 @@
1
1
  # encoding: utf-8
2
2
  module DataCollector
3
- VERSION = "0.17.0"
3
+ VERSION = "0.19.0"
4
4
  end
@@ -4,6 +4,7 @@ require 'logger'
4
4
 
5
5
  require 'data_collector/version'
6
6
  require 'data_collector/runner'
7
+ require 'data_collector/pipeline'
7
8
  require 'data_collector/ext/xml_utility_node'
8
9
 
9
10
  module DataCollector
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: data_collector
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.17.0
4
+ version: 0.19.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mehmet Celik
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2023-03-16 00:00:00.000000000 Z
11
+ date: 2023-05-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -114,14 +114,14 @@ dependencies:
114
114
  requirements:
115
115
  - - "~>"
116
116
  - !ruby/object:Gem::Version
117
- version: '1.13'
117
+ version: '1.14'
118
118
  type: :runtime
119
119
  prerelease: false
120
120
  version_requirements: !ruby/object:Gem::Requirement
121
121
  requirements:
122
122
  - - "~>"
123
123
  - !ruby/object:Gem::Version
124
- version: '1.13'
124
+ version: '1.14'
125
125
  - !ruby/object:Gem::Dependency
126
126
  name: nori
127
127
  requirement: !ruby/object:Gem::Requirement
@@ -136,6 +136,48 @@ dependencies:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
138
  version: '2.6'
139
+ - !ruby/object:Gem::Dependency
140
+ name: iso8601
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - "~>"
144
+ - !ruby/object:Gem::Version
145
+ version: '0.13'
146
+ type: :runtime
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - "~>"
151
+ - !ruby/object:Gem::Version
152
+ version: '0.13'
153
+ - !ruby/object:Gem::Dependency
154
+ name: listen
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - "~>"
158
+ - !ruby/object:Gem::Version
159
+ version: '3.8'
160
+ type: :runtime
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - "~>"
165
+ - !ruby/object:Gem::Version
166
+ version: '3.8'
167
+ - !ruby/object:Gem::Dependency
168
+ name: bunny
169
+ requirement: !ruby/object:Gem::Requirement
170
+ requirements:
171
+ - - "~>"
172
+ - !ruby/object:Gem::Version
173
+ version: '2.20'
174
+ type: :runtime
175
+ prerelease: false
176
+ version_requirements: !ruby/object:Gem::Requirement
177
+ requirements:
178
+ - - "~>"
179
+ - !ruby/object:Gem::Version
180
+ version: '2.20'
139
181
  - !ruby/object:Gem::Dependency
140
182
  name: bundler
141
183
  requirement: !ruby/object:Gem::Requirement
@@ -156,14 +198,14 @@ dependencies:
156
198
  requirements:
157
199
  - - "~>"
158
200
  - !ruby/object:Gem::Version
159
- version: '5.16'
201
+ version: '5.18'
160
202
  type: :development
161
203
  prerelease: false
162
204
  version_requirements: !ruby/object:Gem::Requirement
163
205
  requirements:
164
206
  - - "~>"
165
207
  - !ruby/object:Gem::Version
166
- version: '5.16'
208
+ version: '5.18'
167
209
  - !ruby/object:Gem::Dependency
168
210
  name: rake
169
211
  requirement: !ruby/object:Gem::Requirement
@@ -208,13 +250,19 @@ files:
208
250
  - bin/console
209
251
  - bin/setup
210
252
  - data_collector.gemspec
253
+ - examples/marc.rb
211
254
  - lib/data_collector.rb
212
255
  - lib/data_collector/config_file.rb
213
256
  - lib/data_collector/core.rb
214
257
  - lib/data_collector/ext/xml_utility_node.rb
215
258
  - lib/data_collector/input.rb
259
+ - lib/data_collector/input/dir.rb
260
+ - lib/data_collector/input/generic.rb
261
+ - lib/data_collector/input/queue.rb
216
262
  - lib/data_collector/output.rb
263
+ - lib/data_collector/pipeline.rb
217
264
  - lib/data_collector/rules.rb
265
+ - lib/data_collector/rules.rb.depricated
218
266
  - lib/data_collector/rules_ng.rb
219
267
  - lib/data_collector/runner.rb
220
268
  - lib/data_collector/version.rb
@@ -240,7 +288,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
240
288
  - !ruby/object:Gem::Version
241
289
  version: '0'
242
290
  requirements: []
243
- rubygems_version: 3.1.6
291
+ rubygems_version: 3.4.10
244
292
  signing_key:
245
293
  specification_version: 4
246
294
  summary: ETL helper library