RubyGems - data_collector - Versions diffs - 0.17.0 → 0.19.0 - Mend

data_collector 0.17.0 → 0.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/README.md +110 -60
data/data_collector.gemspec +5 -2
data/examples/marc.rb +27 -0
data/lib/data_collector/core.rb +16 -0
data/lib/data_collector/input/dir.rb +28 -0
data/lib/data_collector/input/generic.rb +77 -0
data/lib/data_collector/input/queue.rb +60 -0
data/lib/data_collector/input.rb +23 -2
data/lib/data_collector/output.rb +4 -3
data/lib/data_collector/pipeline.rb +116 -0
data/lib/data_collector/rules.rb +5 -126
data/lib/data_collector/rules.rb.depricated +130 -0
data/lib/data_collector/rules_ng.rb +40 -11
data/lib/data_collector/runner.rb +0 -1
data/lib/data_collector/version.rb +1 -1
data/lib/data_collector.rb +1 -0
metadata +55 -7

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 351bb040c33b9010903681a117df29f02ba3663e440ed30b37520d5a8aa98b30
-  data.tar.gz: a307677b46ecce478fed2206cf5e9173b67bcec8a8b772ca1f12d71fd33a6fad
+  metadata.gz: 35d57ff2998ab1343a4d6e906bcd76bd67951a0eae9a6db69387e4de7dbba285
+  data.tar.gz: 702a6447c28533d2dcdce237cc209417963ea2827eaed4f4ad0ab56a62c42783
 SHA512:
-  metadata.gz: 6298f438cf8030be76ac85f9652aead672dedd42c6f5c6324b6004bae175442bb9565e31ec3aee430da601e3bcdc43eab08dbf19e9cfb89fe54f00ef61758325
-  data.tar.gz: 7b77100e03002764c58e4f2b0d3c962b445422fe7109aa14f0ac7f389628399392db0e172fe7141a35967a6f87c26cce2a8ee7865084454bb9e5587550085ae7
+  metadata.gz: 80e487e0d8bfa19cec43a607b3c58698c37e23fd6385be3102d3ca87584348d585241f1794f848c460c02751f29f3a28d1365c472b8e3a532a922f9104fb2e06
+  data.tar.gz: 0366f4350e54e1bf985f68d3d0532b6fb00394aad23a3641b07fd65594e7e0b16ddf5da90574251c75b83c63fe09455aa988b7395706fe76e995b17fc79cb2fe

data/README.md CHANGED Viewed

@@ -1,39 +1,91 @@
 # DataCollector
-Convenience module to Extract, Transform and Load your data.
-You have main objects that help you to 'INPUT', 'OUTPUT' and 'FILTER' data. The basic ETL components.
-Support objects like CONFIG, LOG, RULES and the new RULES_NG just to make life easier.
+Convenience module to Extract, Transform and Load your data in a Pipeline.
+The 'INPUT', 'OUTPUT' and 'FILTER' object will help you to read, transform and output your data.
+Support objects like CONFIG, LOG, ERROR, RULES. Will help you to write manageable rules to transform and log your data.
+Include the DataCollector::Core module into your application gives you access to these objects.
+```ruby
+include DataCollector::Core
+```
-Including the DataCollector::Core module into your application gives you access to these objects.
-The RULES and RULES_NG objects work in a very simple concept. Rules exist of 3 components:
- - a destination tag
- - a jsonpath filter to get the data
- - a lambda to execute on every filter hit
+Every object can be used on its own.
+#### Pipeline
+Allows you to create a simple pipeline of operations to process data. With a data pipeline, you can collect, process, and transform data, and then transfer it to various systems and applications.
+You can set a schedule for pipelines that are triggered by new data, specifying how often the pipeline should be
+executed in the [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm). The processing logic is then executed.
+###### methods:
+ - .new(options): options can be schedule in [ISO8601 duration format](https://www.digi.com/resources/documentation/digidocs//90001488-13/reference/r_iso_8601_duration_format.htm)  and name
+ - .run: start the pipeline. blocking if a schedule is supplied
+ - .stop: stop the pipeline
+ - .pause: pause the pipeline. Restart using .run
+ - .running?: is pipeline running
+ - .stopped?: is pipeline not running
+ - .paused?: is pipeline paused
+ - .name: name of the pipe
+ - .run_count: number of times the pipe has ran
+ - .on_message: handle to run every time a trigger event happens
+###### example:
+```ruby
+#create a pipline scheduled to run every 10 minutes
+pipeline = Pipeline.new(schedule: 'PT10M')
+pipeline.on_message do |input, output|
+  # logic
+end
+pipeline.run
+```
-#### input
-Read input from an URI. This URI can have a http, https or file scheme
+#### input
+The input component is part of the processing logic. All data is converted into a Hash, Array, ... accessible using plain Ruby or JSONPath using the filter object.
+The input component can fetch data from various URIs, such as files, URLs, directories, queues, ...
+For a push input component, a listener is created with a processing logic block that is executed whenever new data is available.
+A push happens when new data is created in a directory, message queue, ...
-**Public methods**
 ```ruby
   from_uri(source, options = {:raw, :content_type})
 ```
-- source: an uri with a scheme of http, https, file
+- source: an uri with a scheme of http, https, file, amqp
 - options:
     - raw: _boolean_ do not parse
     - content_type: _string_ force a content_type if the 'Content-Type' returned by the http server is incorrect
-example:
+###### example:
 ```ruby
+# read from an http endpoint
     input.from_uri("http://www.libis.be")
     input.from_uri("file://hello.txt")
     input.from_uri("http://www.libis.be/record.jsonld", content_type: 'application/ld+json')
-```
+# read data from a RabbitMQ queue
+    listener = input.from_uri('amqp://user:password@localhost?channel=hello')
+    listener.on_message do |input, output, message|
+      puts message
+    end
+    listener.start
+# read data from a directory
+    listener = input.from_uri('file://this/is/directory')
+    listener.on_message do |input, output, filename|
+      puts filename
+    end
+    listener.start
+```
+Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
+###### listener from input.from_uri(directory|message queue)
+When a listener is defined that is triggered by an event(PUSH) like a message queue or files written to a directory you have these extra methods.
-Inputs can be JSON, XML or CSV or XML in a TAR.GZ file
+- .run: start the listener. blocking if a schedule is supplied
+- .stop: stop the listener
+- .pause: pause the listener. Restart using .run
+- .running?: is listener running
+- .stopped?: is listener not running
+- .paused?: is listener paused
+- .on_message: handle to run every time a trigger event happens
  ### output
 Output is an object you can store key/value pairs that needs to be written to an output stream.
@@ -45,7 +97,7 @@ Output is an object you can store key/value pairs that needs to be written to an
 Write output to a file, string use an ERB file as a template
 example:
 ___test.erb___
-```ruby
+```erbruby
 <names>
     <combined><%= data[:name] %> <%= data[:last_name] %></combined>
     <%= print data, :name, :first_name %>
@@ -53,7 +105,7 @@ ___test.erb___
 </names>
 ```
 will produce
-```ruby
+```html
    <names>
      <combined>John Doe</combined>
      <first_name>John</first_name>
@@ -97,41 +149,11 @@ filter data from a hash using [JSONPath](http://goessner.net/articles/JsonPath/i
     filtered_data = filter(data, "$..metadata.record")
 ```
-#### rules (depricated)
-    See newer rules_ng object
-~~Allows you to define a simple lambda structure to run against a JSONPath filter~~
-~~A rule is made up of a Hash the key is the map key field its value is a Hash with a JSONPath filter and options to apply a convert method on the filtered results.~~
-~~Available convert methods are: time, map, each, call, suffix, text~~
-~~- time: parses a given time/date string into a Time object~~
-~~- map: applies a mapping to a filter~~
-~~- suffix: adds a suffix to a result~~
-~~- call: executes a lambda on the filter~~
-~~- each: runs a lambda on each row of a filter~~
-~~- text: passthrough method. Returns value unchanged~~
-~~example:~~
-```ruby
- my_rules = {
-   'identifier' => {"filter" => '$..id'},
-   'language' => {'filter' => '$..lang',
-                  'options' => {'convert' => 'map',
-                                'map' => {'nl' => 'dut', 'fr' => 'fre', 'de' => 'ger', 'en' => 'eng'}
-                               }
-                 },
-   'subject' => {'filter' => '$..keywords',
-                 options' => {'convert' => 'each',
-                              'lambda' => lambda {|d| d.split(',')}
-                             }
-                },
-   'creationdate' => {'filter' => '$..published_date', 'convert' => 'time'}
- }
-rules.run(my_rules, record, output)
-```
-#### rules_ng
-!!! not compatible with RULES object
+#### rules
+The RULES objects have a simple concept. Rules exist of 3 components:
+- a destination tag
+- a jsonpath filter to get the data
+- a lambda to execute on every filter hit
 TODO: work in progress see test for examples on how to use
@@ -202,15 +224,15 @@ Here you find different rule combination that are possible
       }
 ```
-Here is an example on how to call last RULESET "rs_hash_with_json_filter_and_option".
-***rules_ng.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
+***rules.run*** can have 4 parameters. First 3 are mandatory. The last one ***options*** can hold data static to a rule set or engine directives.
-List of engine directives:
+##### List of engine directives:
   - _no_array_with_one_element: defaults to false. if the result is an array with 1 element just return the element.
+###### example:
 ```ruby
+# apply RULESET "rs_hash_with_json_filter_and_option" to data
     include DataCollector::Core
     output.clear
     data = {'subject' => ['water', 'thermodynamics']}
@@ -247,8 +269,11 @@ Log to stdout
 ```ruby
     log("hello world")
 ```
+#### error
+Log an error
+```ruby
+    error("if you have an issue take a tissue")
+```
 ## Example
 Input data ___test.csv___
 ```csv
@@ -315,7 +340,32 @@ Or install it yourself as:
 ## Usage
-TODO: Write usage instructions here
+```ruby
+require 'data_collector'
+include DataCollector::Core
+# including core gives you a pipeline, input, output, filter, config, log, error object to work with
+RULES = {
+        'title' => '$..vertitle'
+}
+#create a PULL pipeline and schedule it to run every 5 seconds
+pipeline = DataCollector::Pipeline.new(schedule: 'PT5S')
+pipeline.on_message do |input, output|
+  data = input.from_uri('https://services3.libis.be/primo_artefact/lirias3611609')
+  rules.run(RULES, data, output)
+  #puts JSON.pretty_generate(input.raw)
+  puts JSON.pretty_generate(output.raw)
+  output.clear
+  if pipeline.run_count > 2
+    log('stopping pipeline after one run')
+    pipeline.stop
+  end
+end
+pipeline.run
+```
 ## Development

data/data_collector.gemspec CHANGED Viewed

@@ -43,11 +43,14 @@ Gem::Specification.new do |spec|
   spec.add_runtime_dependency 'jsonpath', '~> 1.1'
   spec.add_runtime_dependency 'mime-types', '~> 3.4'
   spec.add_runtime_dependency 'minitar', '= 0.9'
-  spec.add_runtime_dependency 'nokogiri', '~> 1.13'
+  spec.add_runtime_dependency 'nokogiri', '~> 1.14'
   spec.add_runtime_dependency 'nori', '~> 2.6'
+  spec.add_runtime_dependency 'iso8601', '~> 0.13'
+  spec.add_runtime_dependency 'listen', '~> 3.8'
+  spec.add_runtime_dependency 'bunny', '~> 2.20'
   spec.add_development_dependency 'bundler', '~> 2.3'
-  spec.add_development_dependency 'minitest', '~> 5.16'
+  spec.add_development_dependency 'minitest', '~> 5.18'
   spec.add_development_dependency 'rake', '~> 13.0'
   spec.add_development_dependency 'webmock', '~> 3.18'
 end

data/examples/marc.rb ADDED Viewed

@@ -0,0 +1,27 @@
+$LOAD_PATH << '../lib'
+require 'data_collector'
+# include module gives us an pipeline, input, output, filter, log and error object to work with
+include DataCollector::Core
+RULES = {
+  "title" => {'$.record.datafield[?(@._tag == "245")]' => lambda do |d, o|
+    subfields = d['subfield']
+    subfields = [subfields] unless subfields.is_a?(Array)
+    subfields.map{|m| m["$text"]}.join(' ')
+  end
+  },
+  "author" => {'$..datafield[?(@._tag == "100")]' => lambda do |d, o|
+    subfields = d['subfield']
+    subfields = [subfields] unless subfields.is_a?(Array)
+    subfields.map{|m| m["$text"]}.join(' ')
+  end
+  }
+}
+#read remote record enable logging
+data = input.from_uri('https://gist.githubusercontent.com/kefo/796b39925e234fb6d912/raw/3df2ce329a947864ae8555f214253f956d679605/sample-marc-with-xsd.xml', {logging: true})
+# apply rules to data and if result contains only 1 entry do not return an array
+rules.run(RULES, data, output, {_no_array_with_one_element: true})
+# print result
+puts JSON.pretty_generate(output.raw)

data/lib/data_collector/core.rb CHANGED Viewed

@@ -10,6 +10,14 @@ require_relative 'config_file'
 module DataCollector
   module Core
+    # Pipeline for your data pipeline
+    # example:  pipeline.on_message do |input, output|
+    #            ** processing logic here **
+    #           end
+    def pipeline
+      @input ||= DataCollector::Pipeline.new
+    end
+    module_function :pipeline
     # Read input from an URI
     # example:  input.from_uri("http://www.libis.be")
     #           input.from_uri("file://hello.txt")
@@ -79,6 +87,8 @@ module DataCollector
     # }
     # rules.run(my_rules, input, output)
     def rules
+      #DataCollector::Core.log('RULES depricated using RULESNG')
+      #rules_ng
       @rules ||= Rules.new
     end
     module_function :rules
@@ -121,6 +131,12 @@ module DataCollector
     end
     module_function :log
+    def error(message)
+      @logger ||= Logger.new(STDOUT)
+      @logger.error(message)
+    end
+    module_function :error
   end
 end

data/lib/data_collector/input/dir.rb ADDED Viewed

@@ -0,0 +1,28 @@
+require_relative 'generic'
+require 'listen'
+module DataCollector
+  class Input
+    class Dir < Generic
+      def initialize(uri, options)
+        super
+      end
+      def running?
+        @listener.processing?
+      end
+      private
+      def create_listener
+        @listener ||= Listen.to("#{@uri.host}/#{@uri.path}", @options) do |modified, added, _|
+          files = added | modified
+          files.each do |filename|
+            handle_on_message(input, output, filename)
+          end
+        end
+      end
+    end
+  end
+end

data/lib/data_collector/input/generic.rb ADDED Viewed

@@ -0,0 +1,77 @@
+require 'listen'
+module DataCollector
+  class Input
+    class Generic
+      def initialize(uri, options)
+        @uri = uri
+        @options = options
+        @input = DataCollector::Input.new
+        @output = DataCollector::Output.new
+        @listener = create_listener
+      end
+      def run(should_block = false, &block)
+        raise DataCollector::Error, 'Please supply a on_message block' if @on_message_callback.nil?
+        @listener.start
+        if should_block
+          while running?
+            yield block if block_given?
+            sleep 2
+          end
+        else
+          yield block if block_given?
+        end
+      end
+      def stop
+        @listener.stop
+      end
+      def pause
+        @listener.pause
+      end
+      def running?
+        @listener.running?
+      end
+      def stopped?
+        @listener.stopped?
+      end
+      def paused?
+        @listener.paused?
+      end
+      def on_message(&block)
+        @on_message_callback = block
+      end
+      private
+      def create_listener
+        raise DataCollector::Error, 'Please implement a listener'
+      end
+      def handle_on_message(input, output, data)
+        if (callback = @on_message_callback)
+          timing = Time.now
+          begin
+            callback.call(input, output, data)
+          rescue StandardError => e
+            DataCollector::Core.error("INPUT #{e.message}")
+            puts e.backtrace.join("\n")
+          ensure
+            DataCollector::Core.log("INPUT ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
+          end
+        end
+      end
+    end
+  end
+end

data/lib/data_collector/input/queue.rb ADDED Viewed

@@ -0,0 +1,60 @@
+require_relative 'generic'
+require 'bunny'
+require 'active_support/core_ext/hash'
+module DataCollector
+  class Input
+    class Queue < Generic
+      def initialize(uri, options)
+        super
+        if running?
+          create_channel unless @channel
+          create_queue unless @queue
+        end
+      end
+      def running?
+        @listener.open?
+      end
+      def send(message)
+        if running?
+          @queue.publish(message)
+        end
+      end
+      private
+      def create_listener
+        @listener ||= begin
+                        connection = Bunny.new(@uri.to_s)
+                        connection.start
+                        connection
+                      rescue StandardError => e
+                        raise DataCollector::Error, "Unable to connect to RabbitMQ. #{e.message}"
+                      end
+      end
+      def create_channel
+        raise DataCollector::Error, 'Connection to RabbitMQ is closed' if @listener.closed?
+        @channel ||= @listener.create_channel
+      end
+      def create_queue
+        @queue ||= begin
+                     options = CGI.parse(@uri.query).with_indifferent_access
+                     raise DataCollector::Error, '"channel" query parameter missing from uri.' unless options.include?(:channel)
+                     queue = @channel.queue(options[:channel].first)
+                     queue.subscribe do |delivery_info, metadata, payload|
+                       handle_on_message(input, output, payload)
+                     end if queue
+                     queue
+                   end
+      end
+    end
+  end
+end

data/lib/data_collector/input.rb CHANGED Viewed

@@ -12,6 +12,8 @@ require 'active_support/core_ext/hash'
 require 'zlib'
 require 'minitar'
 require 'csv'
+require_relative 'input/dir'
+require_relative 'input/queue'
 #require_relative 'ext/xml_utility_node'
 module DataCollector
@@ -34,7 +36,15 @@ module DataCollector
         when 'https'
           data = from_https(uri, options)
         when 'file'
-          data = from_file(uri, options)
+          if File.directory?("#{uri.host}/#{uri.path}")
+            raise DataCollector::Error, "#{uri.host}/#{uri.path} not found" unless File.exist?("#{uri.host}/#{uri.path}")
+            return from_dir(uri, options)
+          else
+            raise DataCollector::Error, "#{uri.host}/#{uri.path} not found" unless File.exist?("#{uri.host}/#{uri.path}")
+            data = from_file(uri, options)
+          end
+        when 'amqp'
+          data = from_queue(uri,options)
         else
           raise "Do not know how to process #{source}"
         end
@@ -61,7 +71,10 @@ module DataCollector
     def from_https(uri, options = {})
       data = nil
-      HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
+      if options.with_indifferent_access.include?(:logging) && options.with_indifferent_access[:logging]
+        HTTP.default_options = HTTP::Options.new(features: { logging: { logger: @logger } })
+      end
       http = HTTP
       #http.use(logging: {logger: @logger})
@@ -157,6 +170,14 @@ module DataCollector
       data
     end
+    def from_dir(uri, options = {})
+      DataCollector::Input::Dir.new(uri, options)
+    end
+    def from_queue(uri, options = {})
+      DataCollector::Input::Queue.new(uri, options)
+    end
     def xml_to_hash(data)
       #gsub('&lt;\/', '&lt; /') outherwise wrong XML-parsing (see records lirias1729192 )
       data = data.gsub /&lt;/, '&lt; /'

data/lib/data_collector/output.rb CHANGED Viewed

@@ -38,8 +38,10 @@ module DataCollector
               data[k] << v
             end
           else
-            t = data[k]
-            data[k] = Array.new([t, v])
+            data[k] = v
+            # HELP: why am I creating an array here?
+            # t = data[k]
+            # data[k] = Array.new([t, v])
           end
         else
           data[k] = v
@@ -152,7 +154,6 @@ module DataCollector
       result
     rescue Exception => e
       raise "unable to transform to text: #{e.message}"
-      ""
     end
     def to_tmp_file(erb_file, records_dir)

data/lib/data_collector/pipeline.rb ADDED Viewed

@@ -0,0 +1,116 @@
+require 'iso8601'
+module DataCollector
+  class Pipeline
+    attr_reader :run_count, :name
+    def initialize(options = {})
+      @running = false
+      @paused = false
+      @input = DataCollector::Input.new
+      @output = DataCollector::Output.new
+      @run_count = 0
+      @schedule = options[:schedule] || {}
+      @name = options[:name] || "#{Time.now.to_i}-#{rand(10000)}"
+      @options = options
+      @listeners = []
+    end
+    def on_message(&block)
+      @on_message_callback = block
+    end
+    def run
+      if paused? && @running
+        @paused = false
+        @listeners.each do |listener|
+          listener.run if listener.paused?
+        end
+      end
+      @running = true
+      if @schedule && !@schedule.empty?
+        while running?
+          @run_count += 1
+          start_time = ISO8601::DateTime.new(Time.now.to_datetime.to_s)
+          begin
+            duration = ISO8601::Duration.new(@schedule)
+          rescue StandardError => e
+            raise DataCollector::Error, "PIPELINE - bad schedule: #{e.message}"
+          end
+          interval = ISO8601::TimeInterval.from_duration(start_time, duration)
+          DataCollector::Core.log("PIPELINE running in #{interval.size} seconds")
+          sleep interval.size
+          handle_on_message(@input, @output) unless paused?
+        end
+      else # run once
+        @run_count += 1
+        if @options.key?(:uri)
+          listener = Input.new.from_uri(@options[:uri], @options)
+          listener.on_message do |input, output, filename|
+            DataCollector::Core.log("PIPELINE triggered by #{filename}")
+            handle_on_message(@input, @output, filename)
+          end
+          @listeners << listener
+          listener.run(true)
+        else
+          DataCollector::Core.log("PIPELINE running once")
+          handle_on_message(@input, @output)
+        end
+      end
+    rescue StandardError => e
+      DataCollector::Core.error("PIPELINE run failed: #{e.message}")
+      raise e
+      #puts e.backtrace.join("\n")
+    end
+    def stop
+      @running = false
+      @paused = false
+      @listeners.each do |listener|
+        listener.stop if listener.running?
+      end
+    end
+    def pause
+      if @running
+      @paused = !@paused
+        @listeners.each do |listener|
+          listener.pause if listener.running?
+        end
+      end
+    end
+    def running?
+      @running
+    end
+    def stopped?
+      !@running
+    end
+    def paused?
+      @paused
+    end
+    private
+    def handle_on_message(input, output, filename = nil)
+      if (callback = @on_message_callback)
+        timing = Time.now
+        begin
+          callback.call(input, output, filename)
+        rescue StandardError => e
+          DataCollector::Core.error("PIPELINE #{e.message}")
+        ensure
+          DataCollector::Core.log("PIPELINE ran for #{((Time.now.to_f - timing.to_f).to_f * 1000.0).to_i}ms")
+        end
+      end
+    end
+  end
+end

data/lib/data_collector/rules.rb CHANGED Viewed

@@ -1,130 +1,9 @@
-require 'logger'
+require_relative 'rules_ng'
 module DataCollector
-  class Rules
-    def initialize()
-      @logger = Logger.new(STDOUT)
+  class Rules < RulesNg
+    def initialize(logger =  Logger.new(STDOUT))
+      super
     end
-    def run(rule_map, from_record, to_record, options = {})
-      rule_map.each do |map_to_key, rule|
-        if rule.is_a?(Array)
-          rule.each do |sub_rule|
-            apply_rule(map_to_key, sub_rule, from_record, to_record, options)
-          end
-        else
-          apply_rule(map_to_key, rule, from_record, to_record, options)
-        end
-      end
-      to_record.each do |element|
-        element = element.delete_if do |k, v|
-          v != false && (v.nil?)
-        end
-      end
-    end
-    private
-    def apply_rule(map_to_key, rule, from_record, to_record, options = {})
-      if rule.has_key?('text')
-        suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
-        to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
-      elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
-        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
-        if result.is_a?(Array)
-          result.each do |m|
-            to_record << {map_to_key.to_sym => m}
-          end
-        else
-          to_record << {map_to_key.to_sym => result}
-        end
-      else
-        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
-        return if result && result.empty?
-        to_record << {map_to_key.to_sym => result}
-      end
-    end
-    def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
-      data = nil
-      if record
-        if filter_path.is_a?(Array) && !record.is_a?(Array)
-          record = [record]
-        end
-        data = Core::filter(record, filter_path)
-        if data && rule_options
-          if rule_options.key?('convert')
-            case rule_options['convert']
-            when 'time'
-              result = []
-              data = [data] unless data.is_a?(Array)
-              data.each do |d|
-                result << Time.parse(d)
-              end
-              data = result
-            when 'map'
-              if data.is_a?(Array)
-                data = data.map do |r|
-                  rule_options['map'][r] if rule_options['map'].key?(r)
-                end
-                data.compact!
-                data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
-              else
-                return rule_options['map'][data] if rule_options['map'].key?(data)
-              end
-            when 'each'
-              data = [data] unless data.is_a?(Array)
-              if options.empty?
-                data = data.map { |d| rule_options['lambda'].call(d) }
-              else
-                data = data.map { |d| rule_options['lambda'].call(d, options) }
-              end
-              data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
-            when 'call'
-              if options.empty?
-                data = rule_options['lambda'].call(data)
-              else
-                data = rule_options['lambda'].call(data, options)
-              end
-              return data
-            end
-          end
-          if rule_options.key?('suffix')
-            data = add_suffix(data, rule_options['suffix'])
-          end
-        end
-      end
-      return data
-    end
-    def add_suffix(data, suffix)
-      case data.class.name
-      when 'Array'
-        result = []
-        data.each do |d|
-          result <<  add_suffix(d, suffix)
-        end
-        data = result
-      when 'Hash'
-        data.each do |k, v|
-          data[k] = add_suffix(v, suffix)
-        end
-      else
-        data = data.to_s
-        data += suffix
-      end
-      data
-    end
   end
-end
+end

data/lib/data_collector/rules.rb.depricated ADDED Viewed

@@ -0,0 +1,130 @@
+require 'logger'
+module DataCollector
+  class Rules
+    def initialize()
+      @logger = Logger.new(STDOUT)
+    end
+    def run(rule_map, from_record, to_record, options = {})
+      rule_map.each do |map_to_key, rule|
+        if rule.is_a?(Array)
+          rule.each do |sub_rule|
+            apply_rule(map_to_key, sub_rule, from_record, to_record, options)
+          end
+        else
+          apply_rule(map_to_key, rule, from_record, to_record, options)
+        end
+      end
+      to_record.each do |element|
+        element = element.delete_if do |k, v|
+          v != false && (v.nil?)
+        end
+      end
+    end
+    private
+    def apply_rule(map_to_key, rule, from_record, to_record, options = {})
+      if rule.has_key?('text')
+        suffix = (rule && rule.key?('options') && rule['options'].key?('suffix')) ? rule['options']['suffix'] : ''
+        to_record << { map_to_key.to_sym => add_suffix(rule['text'], suffix) }
+      elsif rule.has_key?('options') && rule['options'].has_key?('convert') && rule['options']['convert'].eql?('each')
+        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
+        if result.is_a?(Array)
+          result.each do |m|
+            to_record << {map_to_key.to_sym => m}
+          end
+        else
+          to_record << {map_to_key.to_sym => result}
+        end
+      else
+        result = get_value_for(map_to_key, rule['filter'], from_record, rule['options'], options)
+        return if result && result.empty?
+        to_record << {map_to_key.to_sym => result}
+      end
+    end
+    def get_value_for(tag_key, filter_path, record, rule_options = {}, options = {})
+      data = nil
+      if record
+        if filter_path.is_a?(Array) && !record.is_a?(Array)
+          record = [record]
+        end
+        data = Core::filter(record, filter_path)
+        if data && rule_options
+          if rule_options.key?('convert')
+            case rule_options['convert']
+            when 'time'
+              result = []
+              data = [data] unless data.is_a?(Array)
+              data.each do |d|
+                result << Time.parse(d)
+              end
+              data = result
+            when 'map'
+              if data.is_a?(Array)
+                data = data.map do |r|
+                  rule_options['map'][r] if rule_options['map'].key?(r)
+                end
+                data.compact!
+                data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
+              else
+                return rule_options['map'][data] if rule_options['map'].key?(data)
+              end
+            when 'each'
+              data = [data] unless data.is_a?(Array)
+              if options.empty?
+                data = data.map { |d| rule_options['lambda'].call(d) }
+              else
+                data = data.map { |d| rule_options['lambda'].call(d, options) }
+              end
+              data.flatten! if rule_options.key?('flatten') && rule_options['flatten']
+            when 'call'
+              if options.empty?
+                data = rule_options['lambda'].call(data)
+              else
+                data = rule_options['lambda'].call(data, options)
+              end
+              return data
+            end
+          end
+          if rule_options.key?('suffix')
+            data = add_suffix(data, rule_options['suffix'])
+          end
+        end
+      end
+      return data
+    end
+    def add_suffix(data, suffix)
+      case data.class.name
+      when 'Array'
+        result = []
+        data.each do |d|
+          result <<  add_suffix(d, suffix)
+        end
+        data = result
+      when 'Hash'
+        data.each do |k, v|
+          data[k] = add_suffix(v, suffix)
+        end
+      else
+        data = data.to_s
+        data += suffix
+      end
+      data
+    end
+  end
+end

data/lib/data_collector/rules_ng.rb CHANGED Viewed

@@ -51,30 +51,51 @@ module DataCollector
       data = apply_filtered_data_on_payload(data, rule_payload, options)
-      output_data << {tag.to_sym => data} unless data.nil? || (data.is_a?(Array) && data.empty?)
+      output_data << { tag.to_sym => data } unless data.nil? || (data.is_a?(Array) && data.empty?)
     rescue StandardError => e
-      puts "error running rule '#{tag}'\n\t#{e.message}"
-      puts e.backtrace.join("\n")
+      # puts "error running rule '#{tag}'\n\t#{e.message}"
+      # puts e.backtrace.join("\n")
+      raise DataCollector::Error, "error running rule '#{tag}'\n\t#{e.message}"
     end
     def apply_filtered_data_on_payload(input_data, payload, options = {})
       return nil if input_data.nil?
+      normalized_options = options.select { |k, v| k !~ /^_/ }.with_indifferent_access
       output_data = nil
       case payload.class.name
       when 'Proc'
         data = input_data.is_a?(Array) ? input_data : [input_data]
-        output_data = if options.empty?
-                        data.map { |d| payload.call(d) }
+        output_data = if normalized_options.empty?
+                        # data.map { |d| payload.curry.call(d).call(d) }
+                        data.map { |d|
+                          loop do
+                            payload_result = payload.curry.call(d)
+                            break payload_result unless payload_result.is_a?(Proc)
+                          end
+                        }
                       else
-                        data.map { |d| payload.call(d, options) }
+                        data.map { |d|
+                          loop do
+                            payload_result = payload.curry.call(d, normalized_options)
+                            break payload_result unless payload_result.is_a?(Proc)
+                          end
+                        }
                       end
       when 'Hash'
         input_data = [input_data] unless input_data.is_a?(Array)
         if input_data.is_a?(Array)
           output_data = input_data.map do |m|
             if payload.key?('suffix')
-              "#{m}#{payload['suffix']}"
+              if (m.is_a?(Hash))
+                m.transform_values { |v| v.is_a?(String) ? "#{v}#{payload['suffix']}" : v }
+              elsif m.is_a?(Array)
+                m.map { |n| n.is_a?(String) ? "#{n}#{payload['suffix']}" : n }
+              elsif m.methods.include?(:to_s)
+                "#{m}#{payload['suffix']}"
+              else
+                m
+              end
             else
               payload[m]
             end
@@ -83,7 +104,7 @@ module DataCollector
       when 'Array'
         output_data = input_data
         payload.each do |p|
-          output_data = apply_filtered_data_on_payload(output_data, p, options)
+          output_data = apply_filtered_data_on_payload(output_data, p, normalized_options)
         end
       else
         output_data = [input_data]
@@ -92,17 +113,21 @@ module DataCollector
       output_data.compact! if output_data.is_a?(Array)
       output_data.flatten! if output_data.is_a?(Array)
       if output_data.is_a?(Array) &&
-         output_data.size == 1 &&
-         (output_data.first.is_a?(Array) || output_data.first.is_a?(Hash))
+        output_data.size == 1 &&
+        (output_data.first.is_a?(Array) || output_data.first.is_a?(Hash))
         output_data = output_data.first
       end
-      if options.key?('_no_array_with_one_element') && options['_no_array_with_one_element'] &&
+      if options.with_indifferent_access.key?('_no_array_with_one_element') && options.with_indifferent_access['_no_array_with_one_element'] &&
         output_data.is_a?(Array) && output_data.size == 1
         output_data = output_data.first
       end
       output_data
+    rescue StandardError => e
+      # puts "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
+      # puts e.backtrace.join("\n")
+      raise DataCollector::Error, "error applying filtered data on payload'#{payload.to_json}'\n\t#{e.message}"
     end
     def json_path_filter(filter, input_data)
@@ -111,6 +136,10 @@ module DataCollector
       return input_data if input_data.is_a?(String)
       Core.filter(input_data, filter)
+    rescue StandardError => e
+      puts "error running filter '#{filter}'\n\t#{e.message}"
+      puts e.backtrace.join("\n")
+      raise DataCollector::Error, "error running filter '#{filter}'\n\t#{e.message}"
     end
   end
 end

data/lib/data_collector/runner.rb CHANGED Viewed

@@ -29,7 +29,6 @@ module DataCollector
       puts e.message
       puts e.backtrace.join("\n")
     ensure
-#    output.tar_file.close unless output.tar_file.closed?
       @logger.info("Finished in #{((Time.now - @time_start)*1000).to_i} ms")
     end

data/lib/data_collector/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 # encoding: utf-8
 module DataCollector
-  VERSION = "0.17.0"
+  VERSION = "0.19.0"
 end

data/lib/data_collector.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require 'logger'
 require 'data_collector/version'
 require 'data_collector/runner'
+require 'data_collector/pipeline'
 require 'data_collector/ext/xml_utility_node'
 module DataCollector

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: data_collector
 version: !ruby/object:Gem::Version
-  version: 0.17.0
+  version: 0.19.0
 platform: ruby
 authors:
 - Mehmet Celik
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2023-03-16 00:00:00.000000000 Z
+date: 2023-05-08 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -114,14 +114,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.13'
+        version: '1.14'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.13'
+        version: '1.14'
 - !ruby/object:Gem::Dependency
   name: nori
   requirement: !ruby/object:Gem::Requirement
@@ -136,6 +136,48 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '2.6'
+- !ruby/object:Gem::Dependency
+  name: iso8601
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.13'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.13'
+- !ruby/object:Gem::Dependency
+  name: listen
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.8'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.8'
+- !ruby/object:Gem::Dependency
+  name: bunny
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.20'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.20'
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -156,14 +198,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.16'
+        version: '5.18'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.16'
+        version: '5.18'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -208,13 +250,19 @@ files:
 - bin/console
 - bin/setup
 - data_collector.gemspec
+- examples/marc.rb
 - lib/data_collector.rb
 - lib/data_collector/config_file.rb
 - lib/data_collector/core.rb
 - lib/data_collector/ext/xml_utility_node.rb
 - lib/data_collector/input.rb
+- lib/data_collector/input/dir.rb
+- lib/data_collector/input/generic.rb
+- lib/data_collector/input/queue.rb
 - lib/data_collector/output.rb
+- lib/data_collector/pipeline.rb
 - lib/data_collector/rules.rb
+- lib/data_collector/rules.rb.depricated
 - lib/data_collector/rules_ng.rb
 - lib/data_collector/runner.rb
 - lib/data_collector/version.rb
@@ -240,7 +288,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.6
+rubygems_version: 3.4.10
 signing_key:
 specification_version: 4
 summary: ETL helper library