RubyGems - metacrunch - Versions diffs - 3.0.1 → 3.0.2 - Mend

metacrunch 3.0.1 → 3.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 8c5c7308708c116022aab5aafb8c546b70430383
-  data.tar.gz: 5c658db2d33ab7a31b28df026a158146d67d63a6
+  metadata.gz: 140352c3ee66626aef744b87358762a4130f6823
+  data.tar.gz: f9bd336d44ac985f5806045b852e219236e1d038
 SHA512:
-  metadata.gz: c9f71280290fecd7ac65cec82b708e9e09b4aa07929f1a7dce9d7077c048ac8c8c2787972a4dc344da3041756c79cfff8f78e5f2b8c9f95d58383cc9dfcf0cd3
-  data.tar.gz: f2bef55af9e464cf9c50b83bdcd15ad3b7552cb65520fbc1267228fe9ca9c4b500a6250f54236c87a63ba7d6fb1cb3325e884cd595205fabf4a68eeedffbea26
+  metadata.gz: 494530523e869e12ef00bd709ad139e840b1fd580d39d587575b85b1acdf029a573c330776eb65085a99c6714f4fd01afdad8a37d25dfe7197568682d229cde2
+  data.tar.gz: f09ba8cadfc1a10cb26b9a5125797da1dbb7dd9c24f4b11381183181b0916fa57a74abbf20713614f062e0026f7b0670f4cb3fa2a568e13b5b017f4d718a0c07

data/Readme.md CHANGED Viewed

@@ -17,51 +17,64 @@ $ gem install metacrunch
 ```
-Create ETL jobs
----------------
+Creating ETL jobs
+-----------------
-The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data back to one or more **destinations** (load step).
+The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data to one or more **destinations** (load step).
-metacrunch provides you with a simple DSL to define such ETL jobs. Just create a text file with the extension `.metacrunch`. Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.
+metacrunch provides you with a simple DSL to define and run such ETL jobs. Just create a text file with the extension `.metacrunch`. *Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
-Let's take a look at an example. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
+Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
+#### It's Ruby
+Every `.metacrunch` job file is a regular Ruby file. So you can always use regular stuff like e.g. declaring methods, classes, variable and requiring other Ruby files.
 ```ruby
 # File: my_etl_job.metacrunch
-# Every metacrunch job file is a regular Ruby file. So you can always use regular Ruby
-# stuff like declaring methods
 def my_helper
   # ...
 end
-# ... declaring classes
 class MyHelper
   # ...
 end
-# ... declaring variables
-foo = "bar"
+helper = MyHelper.new
-# ... or loading other ruby files
+require "SomeGem"
 require_relative "./some/other/ruby/file"
+```
+#### Defining sources
+A source (aka. a reader) is an object that reads data into the metacrunch processing pipeline. Use one of the build-in or 3rd party sources or implement it by yourself. Implementing sources is easy – [see notes below](#implementing-sources). You can declare one or more sources. They are processed in the order they are defined.
+You must declare at least one source to allow a job to run.
+```ruby
+# File: my_etl_job.metacrunch
-# Declare a source (use a build-in or 3rd party source or implement it – see notes below).
-# At least one source is required to allow the job to run.
+source Metacrunch::Fs::Reader.new(args)
 source MySource.new
-# ... maybe another one. Sources are processed in the order they are defined.
-source MyOtherSource.new
+```
-# Declare a destination (use a build-in or 3rd party destination or implement it – see notes below).
-# Technically a destination is optional, but a job that doesn't store it's
-# output doesn't really makes sense.
-destination MyDestination.new
-# ... you can have more destinations if you like
-destination MyOtherDestination.new
+This example uses a build-in file reader source. To learn more about the build-in sources see [notes below](#built-in-sources-and-destinations).
+#### Defining transformations
+To process, transform or manipulate data use the `#transformation` hook. A transformation can be implemented as a block, a lambda or as an (callable) object. To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
+The current data object (the object that is currently read by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation - or to the destination if the current transformation is the last one.
+If you return nil the current data object will be dismissed and the next transformation (or destination) won't be called.
+```ruby
+# File: my_etl_job.metacrunch
-# To process data use the #transformation hook.
 transformation do |data|
-  # Called for each data object that has been put in the pipeline by a source.
+  # Called for each data object that has been read by a source.
   # Do your data transformation process here.
@@ -71,60 +84,227 @@ transformation do |data|
 end
 # Instead of passing a block to #transformation you can pass a
-# `callable` object (an object responding to #call).
-transformation Proc.new {
-  # Procs and Lambdas responds to #call
+# `callable` object (any object responding to #call).
+transformation ->(data) {
+  # Lambdas responds to #call
 }
 # MyTransformation defines #call
 transformation MyTransformation.new
+```
+#### Defining destinations
+A destination (aka. a writer) is an object that writes the transformed data to an external system. Use one of the build-in or 3rd party destinations or implement it by yourself. Implementing destinations is easy – [see notes below](#implementing-destinations). You can declare one or more destinations. They are processed in the order they are defined.
+```ruby
+# File: my_etl_job.metacrunch
-# To run arbitrary code before the first transformation use the #pre_process hook.
+destination MyDestination.new
+```
+This example uses a custom destination. To learn more about the build-in destinations see [notes below](#built-in-sources-and-destinations).
+#### Pre/Post process
+To run arbitrary code before the first transformation use the
+`#pre_process` hook. To run arbitrary after the last transformation use
+`#post_process`. Like transformations, `#post_process` and `#pre_process` can be called with a block, a lambda or a (callable) object.
+```ruby
 pre_process do
   # Called before the first transformation
 end
-# To run arbitrary code after the last transformation use the #post_process hook.
 post_process do
   # Called after the last transformation
 end
-# Instead of passing a block to #pre_process or #post_process you can pass a
-# `callable` object (an object responding to #call).
-pre_process Proc.new {
-  # Procs and Lambdas responds to #call
+pre_process ->() {
+  # Lambdas responds to #call
 }
 # MyCallable class defines #call
 post_process MyCallable.new
 ```
+#### Defining options
-Run ETL jobs
-------------
+TBD.
-metacrunch comes with a handy command line tool. In your terminal just call
+Running ETL jobs
+----------------
+metacrunch comes with a handy command line tool. In a terminal use
 ```
 $ metacrunch run my_etl_job.metacrunch
 ```
-to run the job.
+to run a job.
+If you use [Bundler](http://bundler.io) to manage dependencies for your jobs make sure to change into the directory where your Gemfile is (or set BUNDLE_GEMFILE environment variable) and run metacrunch with `bundle exec`.
+```
+$ bundle exec metacrunch run my_etl_job.metacrunch
+```
+Depending on your environment `bundle exec` may not be required (e.g. you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
+To pass options to the job, separate job options from the metacrunch command options using the `@@` separator.
+Use the following syntax
+```
+$ [bundle exec] metacrunch run [COMMAND_OPTIONS] JOB_FILE [@@ [JOB_OPTIONS] [JOB_ARGS...]]
+```
 Implementing sources
 --------------------
-TBD.
+A source (aka a reader) is any Ruby object that responds to the `each` method that yields data objects one by one.
+The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
+Any `enumerable` object (e.g. `Array`) responds to `each` and can be used as a source in metacrunch.
+```ruby
+# File: my_etl_job.metacrunch
+source [1,2,3,4,5,6,7,8,9]
+```
+Usually you implement your sources as classes. Doing so you can unit test and reuse them.
+Here is a simple CSV source
+```ruby
+# File: my_csv_source.rb
+require 'csv'
+class MyCsvSource
+  def initialize(input_file)
+    @csv = CSV.open(input_file, headers: true, header_converters: :symbol)
+  end
+  def each
+    @csv.each do |data|
+      yield(data.to_hash)
+    end
+    @csv.close
+  end
+end
+```
+You can then use that source in your job
+```ruby
+# File: my_etl_job.metacrunch
+require "my_csv_source"
+source MyCsvSource.new("my_data.csv")
+```
 Implementing transformations
 ----------------------------
-TBD.
+Transformations can be implemented as blocks or as a `callable`. A `callable` in Ruby is any object that responds to the `call` method.
+### Transformations as a block
+When using the block syntax the current data row will be passed as a parameter.
+```ruby
+# File: my_etl_job.metacrunch
+transformation do |data|
+  # DO YOUR TRANSFORMATION HERE
+  data = ...
+  # Make sure to return the data to keep it in the pipeline. Dismiss the
+  # data conditionally by returning nil.
+  data
+end
+```
+### Transformations as a callable
+Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations similar to blocks.
+```ruby
+# File: my_etl_job.metacrunch
+transformation -> (data) do
+  # ...
+end
+```
+Like sources you can create classes to test and reuse transformation logic.
+```ruby
+# File: my_transformation.rb
+class MyTransformation
+  def call(data)
+    # ...
+  end
+end
+```
+You can use this transformation in your job
+```ruby
+# File: my_etl_job.metacrunch
+require "my_transformation"
+transformation MyTransformation.new
+```
+Implementing destinations
+-------------------------
+A destination (aka a writer) is any Ruby object that responds to `write(data)` and `close`.
+Like sources you are encouraged to implement destinations as classes.
+```ruby
+# File: my_destination.rb
+class MyDestination
+  def write(data)
+    # Write data to files, remote services, databases etc.
+  end
+  def close
+    # Use this method to close connections, files etc.
+  end
+end
+```
+In your job
+```ruby
+# File: my_etl_job.metacrunch
+require "my_destination"
+destination MyDestination.new
+```
-Implementing writers
----------------------
+Built in sources and destinations
+---------------------------------
 TBD.

data/lib/metacrunch/db/writer.rb CHANGED Viewed

@@ -2,6 +2,9 @@ module Metacrunch
   class Db::Writer
     def initialize(database_connection_or_url, dataset_proc, options = {})
+      @use_upsert = options.delete(:use_upsert) || false
+      @id_key     = options.delete(:id_key)     || :id
       @db = if database_connection_or_url.is_a?(String)
         Sequel.connect(database_connection_or_url, options)
       else
@@ -12,12 +15,37 @@ module Metacrunch
     end
     def write(data)
-      @dataset.insert(data)
+      if data.is_a?(Array)
+        @db.transaction do
+          data.each{|d| insert_or_upsert(d) }
+        end
+      else
+        insert_or_upsert(data)
+      end
     end
     def close
       @db.disconnect
     end
+  private
+    def insert_or_upsert(data)
+      @use_upsert ? upsert(data) : insert(data)
+    end
+    def insert(data)
+      @dataset.insert(data) if data
+    end
+    def upsert(data)
+      if data
+        rec = @dataset.where(id: data[@id_key])
+        if 1 != rec.update(data)
+          insert(data)
+        end
+      end
+    end
   end
 end

data/lib/metacrunch/job.rb CHANGED Viewed

@@ -110,37 +110,31 @@ module Metacrunch
     def run_transformations
       sources.each do |source|
         # sources are expected to respond to `each`
-        source.each do |row|
-          _run_transformations(row)
+        source.each do |data|
+          run_transformations_and_write_destinations(data)
         end
         # Run all transformations a last time to flush possible buffers
-        _run_transformations(nil, flush_buffers: true)
+        run_transformations_and_write_destinations(nil, flush_buffers: true)
       end
       # destination implementations are expected to respond to `close`
       destinations.each(&:close)
     end
-    def _run_transformations(row, flush_buffers: false)
+    def run_transformations_and_write_destinations(data, flush_buffers: false)
       transformations.each do |transformation|
-        row = if transformation.is_a?(Buffer)
-          if flush_buffers
-            transformation.flush
-          else
-            transformation.buffer(row)
-          end
+        if transformation.is_a?(Buffer)
+          data = transformation.buffer(data) if data.present?
+          data = transformation.flush if flush_buffers
         else
-          transformation.call(row) if row
+          data = transformation.call(data) if data.present?
         end
-        break unless row
       end
-      if row
+      if data.present?
         destinations.each do |destination|
-          # destinations are expected to respond to `write(row)`
-          destination.write(row)
+          destination.write(data) # destinations are expected to respond to `write(data)`
         end
       end
     end

data/lib/metacrunch/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Metacrunch
-  VERSION = "3.0.1"
+  VERSION = "3.0.2"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: metacrunch
 version: !ruby/object:Gem::Version
-  version: 3.0.1
+  version: 3.0.2
 platform: ruby
 authors:
 - René Sprotte
@@ -10,7 +10,7 @@ authors:
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-05-19 00:00:00.000000000 Z
+date: 2016-07-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport