RubyGems - data_collector - Versions diffs - 0.17.0 → 0.18.0 - Mend

data_collector 0.17.0 → 0.18.0

Files changed (18) hide show

checksums.yaml +4 -4
data/README.md +105 -58
data/data_collector.gemspec +5 -2
data/examples/marc.rb +27 -0
data/lib/data_collector/core.rb +16 -0
data/lib/data_collector/input/dir.rb +28 -0
data/lib/data_collector/input/generic.rb +77 -0
data/lib/data_collector/input/queue.rb +60 -0
data/lib/data_collector/input.rb +21 -2
data/lib/data_collector/output.rb +4 -3
data/lib/data_collector/pipeline.rb +91 -0
data/lib/data_collector/rules.rb +5 -126
data/lib/data_collector/rules.rb.depricated +130 -0
data/lib/data_collector/rules_ng.rb +25 -7
data/lib/data_collector/runner.rb +0 -1
data/lib/data_collector/version.rb +1 -1
data/lib/data_collector.rb +1 -0
metadata +55 -7

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 351bb040c33b9010903681a117df29f02ba3663e440ed30b37520d5a8aa98b30
-  data.tar.gz: a307677b46ecce478fed2206cf5e9173b67bcec8a8b772ca1f12d71fd33a6fad
+  metadata.gz: 9b2dda800a0c468ee0db8c4a4546f98c8baa005f0cff2df603e613d404021315
+  data.tar.gz: c505eb5354999645eb5ea9fbb5200ce100a37d9f3e0eac85bf9416d21cd3514a
 SHA512:
-  metadata.gz: 6298f438cf8030be76ac85f9652aead672dedd42c6f5c6324b6004bae175442bb9565e31ec3aee430da601e3bcdc43eab08dbf19e9cfb89fe54f00ef61758325
-  data.tar.gz: 7b77100e03002764c58e4f2b0d3c962b445422fe7109aa14f0ac7f389628399392db0e172fe7141a35967a6f87c26cce2a8ee7865084454bb9e5587550085ae7
+  metadata.gz: a44557a687028b74b495236a47b4d802a4a6e130526a639ddf63b7b6a8a07b090f5197c23a36b2b4c9628bcfa33a0d38e2451c1a3224a45fa63d388f6922624e
+  data.tar.gz: b98a223f063f24b8f78e1358faeb02e33e365edd77b0fba2d28649fa0ad17d79f386ff216326040f3ec87390cb595f41382733ea042c5357c9cf48a23481d8c7

data/README.md CHANGED Viewed

@@ -1,39 +1,91 @@
 # DataCollector
-Convenience module to Extract, Transform and Load your data.
-You have main objects that help you to 'INPUT', 'OUTPUT' and 'FILTER' data. The basic ETL components.
-Support objects like CONFIG, LOG, RULES and the new RULES_NG just to make life easier.
+Convenience module to Extract, Transform and Load your data in a Pipeline.
+The 'INPUT', 'OUTPUT' and 'FILTER' object will help you to read, transform and output your data.
+Support objects like CONFIG, LOG, ERROR, RULES. Will help you to write manageable rules to transform and log your data.
+Include the DataCollector::Core module into your application gives you access to these objects.
+```ruby
+include DataCollector::Core
+```
-Including the DataCollector::Core module into your application gives you access to these objects.
-The RULES and RULES_NG objects work in a very simple concept. Rules exist of 3 components:
- - a destination tag
- - a jsonpath filter to get the data
- - a lambda to execute on every filter hit
+Every object can be used on its own.
+#### Pipeline
+Allows you to create a simple pipeline of operations to process data. With a data pipeline, you can collect, process, and transform data, and then transfer it to various systems and applications.
+You can set a schedule for pipelines that are triggered by new data, specifying how often the pipeline should be
+executed in the [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm). The processing logic is then executed.
+###### methods:
+ - .new(options): options can be schedule in [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm)  and name
+ - .run: start the pipeline. blocking if a schedule is supplied
+ - .stop: stop the pipeline
+ - .pause: pause the pipeline. Restart using .run
+ - .running?: is pipeline running
+ - .stopped?: is pipeline not running
+ - .paused?: is pipeline paused
+ - .name: name of the pipe
+ - .run_count: number of times the pipe has ran
+ - .on_message: handle to run every time a trigger event happens
+###### example:
+```ruby
+#create a pipline scheduled to run every 10 minutes
+pipeline = Pipeline.new(schedule: 'PT10M')
+pipeline.on_message do |input, output|
+  # logic
+end
-#### input
-Read input from an URI. This URI can have a http, https or file scheme
+pipeline.run
+```
+#### input
+The input component is part of the processing logic. All data is converted into a Hash, Array, ... accessible using plain Ruby or JSONPath using the filter object.
+The input component can fetch data from various URIs, such as files, URLs, directories, queues, ...
+For a push input component, a listener is created with a processing logic block that is executed whenever new data is available.
+A push happens when new data is created in a directory, message queue, ...
-**Public methods**
 ```ruby
   from_uri(source, options = {:raw, :content_type})
 ```
-- source: an uri with a scheme of http, https, file
+- source: an uri with a scheme of http, https, file, amqp
 - options:
     - raw: _boolean_ do not parse
     - content_type: _string_ force a content_type if the 'Content-Type' returned by the http server is incorrect
-example:
+###### example:
 ```ruby
+# read from an http endpoint
     input.from_uri("http://www.libis.be")
     input.from_uri("file://hello.txt")
     input.from_uri("http://www.libis.be/record.jsonld", content_type: 'application/ld+json')
-```
+# read data from a RabbitMQ queue
+    listener = input.from_uri('amqp://user:password@localhost?channel=hello')
+    listener.on_message do |input, output, message|
+      puts message
+    end
+    listener.start
+# read data from a directory
+    listener = input.from_uri('file://this/is/directory')
+    listener.on_message do |input, output, filename|
+      puts filename
+    end
+    listener.start
+```
+Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
+###### listener from input.from_uri(directory|message queue)
+When a listener is defined that is triggered by an event(PUSH) like a message queue or files written to a directory you have these extra methods.
-Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
+- .run: start the listener. blocking if a schedule is supplied
+- .stop: stop the listener
+- .pause: pause the listener. Restart using .run
+- .running?: is listener running
+- .stopped?: is listener not running
+- .paused?: is listener paused
+- .on_message: handle to run every time a trigger event happens
  ### output
 Output is an object you can store key/value pairs that needs to be written to an output stream.
@@ -45,7 +97,7 @@ Output is an object you can store key/value pairs that needs to be written to an
 Write output to a file, string use an ERB file as a template
 example:
 ___test.erb___
-```ruby
+```erbruby
 <names>
     <combined><%= data[:name] %> <%= data[:last_name] %></combined>
     <%= print data, :name, :first_name %>
@@ -53,7 +105,7 @@ ___test.erb___
 </names>
 ```
 will produce
-```ruby
+```html
    <names>
      <combined>John Doe</combined>
      <first_name>John</first_name>
@@ -97,41 +149,11 @@ filter data from a hash using [JSONPath](http://goessner.net/articles/JsonPath/i
     filtered_data = filter(data, "$..metadata.record")
 ```
-#### rules (depricated)
-    See newer rules_ng object
-~~Allows you to define a simple lambda structure to run against a JSONPath filter~~
-~~A rule is made up of a Hash the key is the map key field its value is a Hash with a JSONPath filter and options to apply a convert method on the filtered results.~~
-~~Available convert methods are: time, map, each, call, suffix, text~~
-~~- time: parses a given time/date string into a Time object~~
-~~- map: applies a mapping to a filter~~
-~~- suffix: adds a suffix to a result~~
-~~- call: executes a lambda on the filter~~
-~~- each: runs a lambda on each row of a filter~~
-~~- text: passthrough method. Returns value unchanged~~
-~~example:~~
-```ruby
- my_rules = {
-   'identifier' => {"filter" => '$..id'},
-   'language' => {'filter' => '$..lang',
-                  'options' => {'convert' => 'map',
-                                'map' => {'nl' => 'dut', 'fr' => 'fre', 'de' => 'ger', 'en' => 'eng'}
-                               }
-                 },
-   'subject' => {'filter' => '$..keywords',
-                 options' => {'convert' => 'each',
-                              'lambda' => lambda {|d| d.split(',')}
-                             }
-                },
-   'creationdate' => {'filter' => '$..published_date', 'convert' => 'time'}
- }
-rules.run(my_rules, record, output)
-```
-#### rules_ng
-!!! not compatible with RULES object
+#### rules
+The RULES objects have a simple concept. Rules exist of 3 components:
+- a destination tag
+- a jsonpath filter to get the data
+- a lambda to execute on every filter hit
 TODO: work in progress see test for examples on how to use
@@ -202,15 +224,15 @@ Here you find different rule combination that are possible
       }
 ```
-Here is an example on how to call last RULESET "rs_hash_with_json_filter_and_option".
-***rules_ng.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
+***rules.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
-List of engine directives:
+##### List of engine directives:
   - _no_array_with_one_element: defaults to false. if the result is an array with 1 element just return the element.
+###### example:
 ```ruby
+# apply RULESET "rs_hash_with_json_filter_and_option" to data
     include DataCollector::Core
     output.clear
     data = {'subject' => ['water', 'thermodynamics']}
@@ -315,7 +337,32 @@ Or install it yourself as:
 ## Usage
-TODO: Write usage instructions here
+```ruby
+require 'data_collector'
+include DataCollector::Core
+# including core gives you a pipeline, input, output, filter, config, log, error object to work with
+RULES = {
+        'title' => '$..vertitle'
+}
+#create a PULL pipeline and schedule it to run every 5 seconds
+pipeline = DataCollector::Pipeline.new(schedule: 'PT5S')
+pipeline.on_message do |input, output|
+  data = input.from_uri('https://services3.libis.be/primo_artefact/lirias3611609')
+  rules.run(RULES, data, output)
+  #puts JSON.pretty_generate(input.raw)
+  puts JSON.pretty_generate(output.raw)
+  output.clear
+  if pipeline.run_count > 2
+    log('stopping pipeline after one run')
+    pipeline.stop
+  end
+end
+pipeline.run
+```
 ## Development

data/data_collector.gemspec CHANGED Viewed

@@ -43,11 +43,14 @@ Gem::Specification.new do |spec|
   spec.add_runtime_dependency 'jsonpath', '~> 1.1'
   spec.add_runtime_dependency 'mime-types', '~> 3.4'
   spec.add_runtime_dependency 'minitar', '= 0.9'
-  spec.add_runtime_dependency 'nokogiri', '~> 1.13'
+  spec.add_runtime_dependency 'nokogiri', '~> 1.14'
   spec.add_runtime_dependency 'nori', '~> 2.6'
+  spec.add_runtime_dependency 'iso8601', '~> 0.13'
+  spec.add_runtime_dependency 'listen', '~> 3.8'
+  spec.add_runtime_dependency 'bunny', '~> 2.20'
   spec.add_development_dependency 'bundler', '~> 2.3'
-  spec.add_development_dependency 'minitest', '~> 5.16'
+  spec.add_development_dependency 'minitest', '~> 5.18'
   spec.add_development_dependency 'rake', '~> 13.0'
   spec.add_development_dependency 'webmock', '~> 3.18'
 end

data/examples/marc.rb ADDED Viewed

@@ -0,0 +1,27 @@
+$LOAD_PATH << '../lib'
+require 'data_collector'
+# include module gives us an pipeline, input, output, filter, log and error object to work with
+include DataCollector::Core
+RULES = {
+  "title" => {'$.record.datafield[?(@._tag == "245")]' => lambda do |d, o|
+    subfields = d['subfield']
+    subfields = [subfields] unless subfields.is_a?(Array)
+    subfields.map{|m| m["$text"]}.join(' ')
+  end
+  },
+  "author" => {'$..datafield[?(@._tag == "100")]' => lambda do |d, o|
+    subfields = d['subfield']
+    subfields = [subfields] unless subfields.is_a?(Array)
+    subfields.map{|m| m["$text"]}.join(' ')
+  end
+  }
+}
+#read remote record enable logging
+data = input.from_uri('https://gist.githubusercontent.com/kefo/796b39925e234fb6d912/raw/3df2ce329a947864ae8555f214253f956d679605/sample-marc-with-xsd.xml', {logging: true})
+# apply rules to data and if result contains only 1 entry do not return an array
+rules.run(RULES, data, output, {_no_array_with_one_element: true})
+# print result
+puts JSON.pretty_generate(output.raw)

data/lib/data_collector/core.rb CHANGED Viewed

@@ -10,6 +10,14 @@ require_relative 'config_file'
 module DataCollector
   module Core
+    # Pipeline for your data pipeline
+    # example:  pipeline.on_message do |input, output|
+    #            ** processing logic here **
+    #           end
+    def pipeline
+      @input ||= DataCollector::Pipeline.new
+    end
+    module_function :pipeline
     # Read input from an URI
     # example:  input.from_uri("http://www.libis.be")
     #           input.from_uri("file://hello.txt")
@@ -79,6 +87,8 @@ module DataCollector
     # }
     # rules.run(my_rules, input, output)
     def rules
+      #DataCollector::Core.log('RULES depricated using RULESNG')
+      #rules_ng
       @rules ||= Rules.new
     end
     module_function :rules
@@ -121,6 +131,12 @@ module DataCollector
     end
     module_function :log
+    def error(message)
+      @logger ||= Logger.new(STDOUT)
+      @logger.error(message)
+    end
+    module_function :error
   end
 end

data/lib/data_collector/input/dir.rb ADDED Viewed

@@ -0,0 +1,28 @@
+require_relative 'generic'
+require 'listen'
+module DataCollector
+  class Input
+    class Dir < Generic
+      def initialize(uri, options)
+        super
+      end
+      def running?
+        @listener.processing?
+      end
+      private
+      def create_listener
+        @listener ||= Listen.to("#{@uri.host}/#{@uri.path}", @options) do |modified, added, _|
+          files = added | modified
+          files.each do |filename|
+            handle_on_message(input, output, filename)
+          end
+        end
+      end
+    end
+  end
+end

data/lib/data_collector/input/generic.rb ADDED Viewed

@@ -0,0 +1,77 @@
+require 'listen'
+module DataCollector
+  class Input
+    class Generic
+      def initialize(uri, options)
+        @uri = uri
+        @options = options
+        @input = DataCollector::Input.new
+        @output = DataCollector::Output.new
+        @listener = create_listener
+      end
+      def run(should_block = false, &block)
+        raise DataCollector::Error, 'Please supply a on_message block' if @on_message_callback.nil?
+        @listener.start
+        if should_block
+          while running?
+            yield block if block_given?
+            sleep 2
+          end
+        else
+          yield block if block_given?
+        end
+      end
+      def stop
+        @listener.stop
+      end
+      def pause
+        @listener.pause
+      end
+      def running?
+        @listener.running?
+      end
+      def stopped?
+        @listener.stopped?
+      end
+      def paused?
+        @listener.paused?
+      end
+      def on_message(&block)
+        @on_message_callback = block
+      end
+      private
+      def create_listener
+        raise DataCollector::Error, 'Please implement a listener'
+      end
+      def handle_on_message(input, output, data)
+        if (callback = @on_message_callback)
+          timing = Time.now
+          begin
+            callback.call(input, output, data)
+          rescue StandardError => e
+            DataCollector::Core.error("INPUT #{e.message}")
+            puts e.backtrace.join("\n")
+          ensure
+            DataCollector::Core.log("INPUT ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
+          end
+        end
+      end
+    end
+  end
+end

data/lib/data_collector/input/queue.rb ADDED Viewed

@@ -0,0 +1,60 @@
+require_relative 'generic'
+require 'bunny'
+require 'active_support/core_ext/hash'
+module DataCollector
+  class Input
+    class Queue < Generic
+      def initialize(uri, options)
+        super
+        if running?
+          create_channel unless @channel
+          create_queue unless @queue
+        end
+      end
+      def running?
+        @listener.open?
+      end
+      def send(message)
+        if running?
+          @queue.publish(message)
+        end
+      end
+      private
+      def create_listener
+        @listener ||= begin
+                        connection = Bunny.new(@uri.to_s)
+                        connection.start
+                        connection
+                      rescue StandardError => e
+                        raise DataCollector::Error, "Unable to connect to RabbitMQ. #{e.message}"
+                      end
+      end
+      def create_channel
+        raise DataCollector::Error, 'Connection to RabbitMQ is closed' if @listener.closed?
+        @channel ||= @listener.create_channel
+      end
+      def create_queue
+        @queue ||= begin
+                     options = CGI.parse(@uri.query).with_indifferent_access
+                     raise DataCollector::Error, '"channel" query parameter missing from uri.' unless options.include?(:channel)
+                     queue = @channel.queue(options[:channel].first)
+                     queue.subscribe do |delivery_info, metadata, payload|
+                       handle_on_message(input, output, payload)
+                     end if queue
+                     queue
+                   end
+      end
+    end
+  end
+end

data/lib/data_collector/input.rb CHANGED Viewed

@@ -12,6 +12,8 @@ require 'active_support/core_ext/hash'
 require 'zlib'
 require 'minitar'
 require 'csv'
+require_relative 'input/dir'
+require_relative 'input/queue'
 #require_relative 'ext/xml_utility_node'
 module DataCollector
@@ -34,7 +36,13 @@ module DataCollector
         when 'https'
           data = from_https(uri, options)
         when 'file'
-          data = from_file(uri, options)
+          if File.directory?("#{uri.host}/#{uri.path}")
+            return from_dir(uri, options)
+          else
+            data = from_file(uri, options)
+          end
+        when 'amqp'
+          data = from_queue(uri,options)
         else
           raise "Do not know how to process #{source}"
         end
@@ -61,7 +69,10 @@ module DataCollector
     def from_https(uri, options = {})
       data = nil
-      HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
+      if options.with_indifferent_access.include?(:logging) && options.with_indifferent_access[:logging]
+        HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
+      end
       http = HTTP
       #http.use(logging: {logger: @logger})
@@ -157,6 +168,14 @@ module DataCollector
       data
     end
+    def from_dir(uri, options = {})
+      DataCollector::Input::Dir.new(uri, options)
+    end
+    def from_queue(uri, options = {})
+      DataCollector::Input::Queue.new(uri, options)
+    end
     def xml_to_hash(data)
       #gsub('&lt;\/', '&lt; /') outherwise wrong XML-parsing (see records lirias1729192 )
       data = data.gsub /&lt;/, '&lt; /'

data/lib/data_collector/output.rb CHANGED Viewed

@@ -38,8 +38,10 @@ module DataCollector
               data[k] << v
             end
           else
-            t = data[k]
-            data[k] = Array.new([t, v])
+            data[k] = v
+            # HELP: why am I creating an array here?
+            # t = data[k]
+            # data[k] = Array.new([t, v])
           end
         else
           data[k] = v
@@ -152,7 +154,6 @@ module DataCollector
       result
     rescue Exception => e
       raise "unable to transform to text: #{e.message}"
-      ""
     end
     def to_tmp_file(erb_file, records_dir)

data/lib/data_collector/pipeline.rb ADDED Viewed

@@ -0,0 +1,91 @@
+require 'iso8601'
+module DataCollector
+  class Pipeline
+    attr_reader :run_count, :name
+    def initialize(options = {})
+      @running = false
+      @paused = false
+      @input = DataCollector::Input.new
+      @output = DataCollector::Output.new
+      @run_count = 0
+      @schedule = options[:schedule] || {}
+      @name = options[:name] || "#{Time.now.to_i}-#{rand(10000)}"
+    end
+    def on_message(&block)
+      @on_message_callback = block
+    end
+    def run
+      if paused? && @running
+        @paused = false
+      end
+      @running = true
+      if @schedule && !@schedule.empty?
+        while running?
+          @run_count += 1
+          start_time = ISO8601::DateTime.new(Time.now.to_datetime.to_s)
+          begin
+            duration = ISO8601::Duration.new(@schedule)
+          rescue StandardError => e
+            raise DataCollector::Error, "PIPELINE - bad schedule: #{e.message}"
+          end
+          interval = ISO8601::TimeInterval.from_duration(start_time, duration)
+          DataCollector::Core.log("PIPELINE running in #{interval.size} seconds")
+          sleep interval.size
+          handle_on_message(@input, @output) unless paused?
+        end
+      else # run once
+        @run_count += 1
+        DataCollector::Core.log("PIPELINE running once")
+        handle_on_message(@input, @output)
+      end
+    rescue StandardError => e
+      DataCollector::Core.error("PIPELINE run failed: #{e.message}")
+      raise e
+      #puts e.backtrace.join("\n")
+    end
+    def stop
+      @running = false
+      @paused = false
+    end
+    def pause
+      @paused = !@paused if @running
+    end
+    def running?
+      @running
+    end
+    def stopped?
+      !@running
+    end
+    def paused?
+      @paused
+    end
+    private
+    def handle_on_message(input, output)
+      if (callback = @on_message_callback)
+        timing = Time.now
+        begin
+          callback.call(input, output)
+        rescue StandardError => e
+          DataCollector::Core.error("PIPELINE #{e.message}")
+        ensure
+          DataCollector::Core.log("PIPELINE ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
+        end
+      end
+    end
+  end
+end

data/lib/data_collector/rules.rb CHANGED Viewed

@@ -1,130 +1,9 @@
-require 'logger'
+require_relative 'rules_ng'
 module DataCollector
-  class Rules
-    def initialize()
-      @logger = Logger.new(STDOUT)
+  class Rules < RulesNg
+    def initialize(logger =  Logger.new(STDOUT))
+      super
     end
-    def run(rule_map, from_record, to_record, options = {})
-      rule_map.each do |map_to_key, rule|
-        if rule.is_a?(Array)
-          rule.each do |sub_rule|
-            apply_rule(map_to_key, sub_rule, from_record, to_record, options)
-          end
-        else
-          apply_rule(map_to_key, rule, from_record, to_record, options)
-        end
-      end
-      to_record.each do |element|
-        element = element.delete_if do |k, v|
-          v != false && (v.nil?)
-        end
-      end
-    end
-    private
-    def apply_rule(map_to_key, rule, from_record, to_record, options = {})
-      if rule.has_key?('text')
-        suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
-        to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
-      elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
-        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
-        if result.is_a?(Array)
-          result.each do |m|
-            to_record << {map_to_key.to_sym => m}
-          end
-        else
-          to_record << {map_to_key.to_sym => result}
-        end
-      else
-        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
-        return if result && result.empty?
-        to_record << {map_to_key.to_sym => result}
-      end
-    end
-    def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
-      data = nil
-      if record
-        if filter_path.is_a?(Array) && !record.is_a?(Array)
-          record = [record]
-        end
-        data = Core::filter(record, filter_path)
-        if data && rule_options
-          if rule_options.key?('convert')
-            case rule_options['convert']
-            when 'time'
-              result = []
-              data = [data] unless data.is_a?(Array)
-              data.each do |d|
-                result << Time.parse(d)
-              end
-              data = result
-            when 'map'
-              if data.is_a?(Array)
-                data = data.map do |r|
-                  rule_options['map'][r] if rule_options['map'].key?(r)
-                end
-                data.compact!
-                data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
-              else
-                return rule_options['map'][data] if rule_options['map'].key?(data)
-              end
-            when 'each'
-              data = [data] unless data.is_a?(Array)
-              if options.empty?
-                data = data.map { |d| rule_options['lambda'].call(d) }
-              else
-                data = data.map { |d| rule_options['lambda'].call(d, options) }
-              end
-              data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
-            when 'call'
-              if options.empty?
-                data = rule_options['lambda'].call(data)
-              else
-                data = rule_options['lambda'].call(data, options)
-              end
-              return data
-            end
-          end
-          if rule_options.key?('suffix')
-            data = add_suffix(data, rule_options['suffix'])
-          end
-        end
-      end
-      return data
-    end
-    def add_suffix(data, suffix)
-      case data.class.name
-      when 'Array'
-        result = []
-        data.each do |d|
-          result <<  add_suffix(d, suffix)
-        end
-        data = result
-      when 'Hash'
-        data.each do |k, v|
-          data[k] = add_suffix(v, suffix)
-        end
-      else
-        data = data.to_s
-        data += suffix
-      end
-      data
-    end
   end
-end
+end

data/lib/data_collector/rules.rb.depricated ADDED Viewed

@@ -0,0 +1,130 @@
+require 'logger'
+module DataCollector
+  class Rules
+    def initialize()
+      @logger = Logger.new(STDOUT)
+    end
+    def run(rule_map, from_record, to_record, options = {})
+      rule_map.each do |map_to_key, rule|
+        if rule.is_a?(Array)
+          rule.each do |sub_rule|
+            apply_rule(map_to_key, sub_rule, from_record, to_record, options)
+          end
+        else
+          apply_rule(map_to_key, rule, from_record, to_record, options)
+        end
+      end
+      to_record.each do |element|
+        element = element.delete_if do |k, v|
+          v != false && (v.nil?)
+        end
+      end
+    end
+    private
+    def apply_rule(map_to_key, rule, from_record, to_record, options = {})
+      if rule.has_key?('text')
+        suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
+        to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
+      elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
+        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
+        if result.is_a?(Array)
+          result.each do |m|
+            to_record << {map_to_key.to_sym => m}
+          end
+        else
+          to_record << {map_to_key.to_sym => result}
+        end
+      else
+        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
+        return if result && result.empty?
+        to_record << {map_to_key.to_sym => result}
+      end
+    end
+    def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
+      data = nil
+      if record
+        if filter_path.is_a?(Array) && !record.is_a?(Array)
+          record = [record]
+        end
+        data = Core::filter(record, filter_path)
+        if data && rule_options
+          if rule_options.key?('convert')
+            case rule_options['convert']
+            when 'time'
+              result = []
+              data = [data] unless data.is_a?(Array)
+              data.each do |d|
+                result << Time.parse(d)
+              end
+              data = result
+            when 'map'
+              if data.is_a?(Array)
+                data = data.map do |r|
+                  rule_options['map'][r] if rule_options['map'].key?(r)
+                end
+                data.compact!
+                data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
+              else
+                return rule_options['map'][data] if rule_options['map'].key?(data)
+              end
+            when 'each'
+              data = [data] unless data.is_a?(Array)
+              if options.empty?
+                data = data.map { |d| rule_options['lambda'].call(d) }
+              else
+                data = data.map { |d| rule_options['lambda'].call(d, options) }
+              end
+              data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
+            when 'call'
+              if options.empty?
+                data = rule_options['lambda'].call(data)
+              else
+                data = rule_options['lambda'].call(data, options)
+              end
+              return data
+            end
+          end
+          if rule_options.key?('suffix')
+            data = add_suffix(data, rule_options['suffix'])
+          end
+        end
+      end
+      return data
+    end
+    def add_suffix(data, suffix)
+      case data.class.name
+      when 'Array'
+        result = []
+        data.each do |d|
+          result <<  add_suffix(d, suffix)
+        end
+        data = result
+      when 'Hash'
+        data.each do |k, v|
+          data[k] = add_suffix(v, suffix)
+        end
+      else
+        data = data.to_s
+        data += suffix
+      end
+      data
+    end
+  end
+end

data/lib/data_collector/rules_ng.rb CHANGED Viewed

@@ -53,28 +53,38 @@ module DataCollector
       output_data << {tag.to_sym => data} unless data.nil? || (data.is_a?(Array) && data.empty?)
     rescue StandardError => e
-      puts "error running rule '#{tag}'\n\t#{e.message}"
-      puts e.backtrace.join("\n")
+      # puts "error running rule '#{tag}'\n\t#{e.message}"
+      # puts e.backtrace.join("\n")
+      raise DataCollector::Error, "error running rule '#{tag}'\n\t#{e.message}"
     end
     def apply_filtered_data_on_payload(input_data, payload, options = {})
       return nil if input_data.nil?
+      normalized_options = options.select{|k,v| k !~ /^_/ }.with_indifferent_access
       output_data = nil
       case payload.class.name
       when 'Proc'
         data = input_data.is_a?(Array) ? input_data : [input_data]
-        output_data = if options.empty?
+        output_data = if normalized_options.empty?
                         data.map { |d| payload.call(d) }
                       else
-                        data.map { |d| payload.call(d, options) }
+                        data.map { |d| payload.call(d, normalized_options) }
                       end
       when 'Hash'
         input_data = [input_data] unless input_data.is_a?(Array)
         if input_data.is_a?(Array)
           output_data = input_data.map do |m|
             if payload.key?('suffix')
-              "#{m}#{payload['suffix']}"
+              if (m.is_a?(Hash))
+                m.transform_values{|v| v.is_a?(String) ? "#{v}#{payload['suffix']}" : v}
+              elsif m.is_a?(Array)
+                m.map{|n| n.is_a?(String) ? "#{n}#{payload['suffix']}": n}
+              elsif m.methods.include?(:to_s)
+                "#{m}#{payload['suffix']}"
+              else
+                m
+              end
             else
               payload[m]
             end
@@ -83,7 +93,7 @@ module DataCollector
       when 'Array'
         output_data = input_data
         payload.each do |p|
-          output_data = apply_filtered_data_on_payload(output_data, p, options)
+          output_data = apply_filtered_data_on_payload(output_data, p, normalized_options)
         end
       else
         output_data = [input_data]
@@ -97,12 +107,16 @@ module DataCollector
         output_data = output_data.first
       end
-      if options.key?('_no_array_with_one_element') && options['_no_array_with_one_element'] &&
+      if options.with_indifferent_access.key?('_no_array_with_one_element') && options.with_indifferent_access['_no_array_with_one_element'] &&
         output_data.is_a?(Array) && output_data.size == 1
         output_data = output_data.first
       end
       output_data
+    rescue StandardError => e
+      # puts "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
+      # puts e.backtrace.join("\n")
+      raise DataCollector::Error, "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
     end
     def json_path_filter(filter, input_data)
@@ -111,6 +125,10 @@ module DataCollector
       return input_data if input_data.is_a?(String)
       Core.filter(input_data, filter)
+    rescue StandardError => e
+      puts "error running filter '#{filter}'\n\t#{e.message}"
+      puts e.backtrace.join("\n")
+      raise DataCollector::Error, "error running filter '#{filter}'\n\t#{e.message}"
     end
   end
 end

data/lib/data_collector/runner.rb CHANGED Viewed

@@ -29,7 +29,6 @@ module DataCollector
       puts e.message
       puts e.backtrace.join("\n")
     ensure
-#    output.tar_file.close unless output.tar_file.closed?
       @logger.info("Finished in #{((Time.now - @time_start)*1000).to_i} ms")
     end

data/lib/data_collector/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 # encoding: utf-8
 module DataCollector
-  VERSION = "0.17.0"
+  VERSION = "0.18.0"
 end

data/lib/data_collector.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require 'logger'
 require 'data_collector/version'
 require 'data_collector/runner'
+require 'data_collector/pipeline'
 require 'data_collector/ext/xml_utility_node'
 module DataCollector

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: data_collector
 version: !ruby/object:Gem::Version
-  version: 0.17.0
+  version: 0.18.0
 platform: ruby
 authors:
 - Mehmet Celik
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2023-03-16 00:00:00.000000000 Z
+date: 2023-04-18 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -114,14 +114,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.13'
+        version: '1.14'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.13'
+        version: '1.14'
 - !ruby/object:Gem::Dependency
   name: nori
   requirement: !ruby/object:Gem::Requirement
@@ -136,6 +136,48 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '2.6'
+- !ruby/object:Gem::Dependency
+  name: iso8601
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.13'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.13'
+- !ruby/object:Gem::Dependency
+  name: listen
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.8'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.8'
+- !ruby/object:Gem::Dependency
+  name: bunny
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.20'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.20'
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -156,14 +198,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.16'
+        version: '5.18'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.16'
+        version: '5.18'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -208,13 +250,19 @@ files:
 - bin/console
 - bin/setup
 - data_collector.gemspec
+- examples/marc.rb
 - lib/data_collector.rb
 - lib/data_collector/config_file.rb
 - lib/data_collector/core.rb
 - lib/data_collector/ext/xml_utility_node.rb
 - lib/data_collector/input.rb
+- lib/data_collector/input/dir.rb
+- lib/data_collector/input/generic.rb
+- lib/data_collector/input/queue.rb
 - lib/data_collector/output.rb
+- lib/data_collector/pipeline.rb
 - lib/data_collector/rules.rb
+- lib/data_collector/rules.rb.depricated
 - lib/data_collector/rules_ng.rb
 - lib/data_collector/runner.rb
 - lib/data_collector/version.rb
@@ -240,7 +288,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.6
+rubygems_version: 3.4.10
 signing_key:
 specification_version: 4
 summary: ETL helper library