RubyGems - metacrunch - Versions diffs - 3.1.4 → 4.0.1 - Mend

metacrunch 3.1.4 → 4.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/.travis.yml +2 -2
data/Gemfile +5 -11
data/Rakefile +1 -0
data/Readme.md +98 -90
data/lib/metacrunch.rb +0 -5
data/lib/metacrunch/cli.rb +22 -61
data/lib/metacrunch/job.rb +65 -84
data/lib/metacrunch/job/dsl.rb +10 -14
data/lib/metacrunch/job/dsl/options.rb +80 -0
data/lib/metacrunch/job/dsl/options/dsl.rb +21 -0
data/lib/metacrunch/version.rb +1 -1
data/metacrunch.gemspec +2 -6
metadata +10 -68
data/lib/metacrunch/db.rb +0 -8
data/lib/metacrunch/db/reader.rb +0 -33
data/lib/metacrunch/db/writer.rb +0 -55
data/lib/metacrunch/fs.rb +0 -6
data/lib/metacrunch/fs/entry.rb +0 -17
data/lib/metacrunch/fs/reader.rb +0 -63
data/lib/metacrunch/job/dsl/option_support.rb +0 -102
data/lib/metacrunch/parallel_processable_reader.rb +0 -21
data/lib/metacrunch/redis.rb +0 -8
data/lib/metacrunch/redis/queue_reader.rb +0 -43
data/lib/metacrunch/redis/queue_writer.rb +0 -39
data/lib/metacrunch/redis/writer.rb +0 -33

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 2ce37e928a0b2b84de0067604a3b56fe0206bda6
-  data.tar.gz: 42cf1bc2c321c35e34b716331300fcdaaf6f282b
+  metadata.gz: 7a75f1157f466513ad8b721d4aada832173503ef
+  data.tar.gz: 97872d8c9ee3c7ed78f5cfda03d5535c94d06b53
 SHA512:
-  metadata.gz: 2e870644b9c78a4dd8590776f913ae712ebd8dedf381d1cbd6541e9a3c5a51311b1126c8af1a47b1d39d496675a8176076f5f99352741eeb4caa58ae29078dff
-  data.tar.gz: 6ba5ade2739a26d424f14add74c49fcadd7d03868a2f4c0e8651aff45d833b636f9e00a5176d5a1f1705703a6d99a6383b657c56d0902b6fcc97e840da98727d
+  metadata.gz: 92f378e2f694693d17e593a7d4d37aa6e23c7b7e88d659a675772950b38bea37f2462bfdbda951b1ac316fae531b0c96c6e6714484e5047e8b9e6ccaa6558e28
+  data.tar.gz: b0bc8b77912abb87fecd6d756911d6eec7802cdf20a8d81d51b75f775889eda798decef8d069abf75fa9979e61d42be6f1423da83bebafb64d2550aa78c37f87

data/.travis.yml CHANGED Viewed

@@ -1,4 +1,4 @@
 language: ruby
 rvm:
-  - ruby-2.3.1
-  - jruby-9.0.5.0
+  - ruby-2.3.5
+  - ruby-2.4.2

data/Gemfile CHANGED Viewed

@@ -3,21 +3,15 @@ source "https://rubygems.org"
 gemspec
 group :development do
-  gem "bundler",      ">= 1.7"
-  gem "rake",         ">= 11.1"
-  gem "rspec",        ">= 3.0.0",  "< 4.0.0"
-  gem "simplecov",    ">= 0.11.0"
-  gem "sqlite3",      ">= 1.3.11", platform: :ruby
-  gem "jdbc-sqlite3", ">= 3.8", platform: :jruby
+  gem "bundler", ">= 1.15"
+  gem "rake",    ">= 12.1"
+  gem "rspec",   ">= 3.5.0", "< 4.0.0"
   if !ENV["CI"]
-    gem "hashdiff",   ">= 0.3.0", platform: :ruby
-    gem "pry-byebug", ">= 3.3.0", platform: :ruby
-    gem "pry-rescue", ">= 1.4.2", platform: :ruby
-    gem "pry-state",  ">= 0.1.7", platform: :ruby
+    gem "pry-byebug", ">= 3.5.0"
   end
 end
 group :test do
-  gem "codeclimate-test-reporter", ">= 0.5.0", require: nil
+  gem "simplecov", ">= 0.15.0"
 end

data/Rakefile CHANGED Viewed

@@ -1,4 +1,5 @@
 require "rspec/core/rake_task"
+require "bundler/gem_tasks"
 RSpec::Core::RakeTask.new(:spec)

data/Readme.md CHANGED Viewed

@@ -8,7 +8,6 @@ metacrunch
 metacrunch is a simple and lightweight data processing and ETL ([Extract-Transform-Load](http://en.wikipedia.org/wiki/Extract,_transform,_load))
 toolkit for Ruby.
-**NOTE: THIS README IS FOR THE MASTER BRANCH. CHECK THE [RELEASES-PAGE](https://github.com/ubpb/metacrunch/releases) TO SEE THE README FOR THE RELEVANT RELEASES**
 Installation
 ------------
@@ -17,15 +16,17 @@ Installation
 $ gem install metacrunch
 ```
+*Note: When upgrading from metacrunch 3.x, there are some breaking changes you need to address. See the [notes below](#upgrading) for details.*
 Creating ETL jobs
 -----------------
-The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from one or more **sources** (extract step), runs one or more **transformations** (transform step) on the data and finally writes the transformed data to one or more **destinations** (load step).
+The basic idea behind an ETL job in metacrunch is the concept of a data processing pipeline. Each ETL job reads data from a **source** (extract step), runs one or more **transformations** (transform step) on the data and finally loads the transformed data to a **destination** (load step).
-metacrunch provides you with a simple DSL to define and run such ETL jobs in Ruby. Just create a text file with the extension `.metacrunch` and [run it](#running-etl-jobs) with the provided `metacrunch` CLI command. *Note: The extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
+metacrunch gives you a simple DSL ([Domain-specific language](https://en.wikipedia.org/wiki/Domain-specific_language)) to define and run ETL jobs in Ruby. Just create a text file with the extension `.metacrunch` and [run it](#running-etl-jobs) with the provided `metacrunch` CLI command. *Note: The file extension doesn't really matter but you should avoid `.rb` to not loading them by mistake from another Ruby component.*
-Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repo.
+Let's walk through the main steps of creating ETL jobs with metacrunch. For a collection of working examples check out our [metacrunch-demo](https://github.com/ubpb/metacrunch-demo) repository.
 #### It's Ruby
@@ -49,80 +50,91 @@ require "SomeGem"
 require_relative "./some/other/ruby/file"
 ```
-#### Defining sources
+#### Defining a source
-A source (aka. a reader) is an object that reads data into the metacrunch processing pipeline. Use one of the build-in or 3rd party sources or implement it by yourself. Implementing sources is easy – [see notes below](#implementing-sources). You can declare one or more sources. They are processed in the order they are defined.
+A source is an object that reads data (e.g. from a file or an external system) into the metacrunch processing pipeline. Implementing sources is easy – a source can be any Ruby object that responds to `#each`. For more information on how to implement sources [see notes below](#implementing-sources).
-You must declare at least one source to allow a job to run.
+You must declare a source to allow a job to run.
 ```ruby
 # File: my_etl_job.metacrunch
-source Metacrunch::Fs::Reader.new(args)
+source [1,2,3,4]
+# or ...
+source Metacrunch::File::Source.new(ARGV)
+# or ...
 source MySource.new
 ```
-This example uses a build-in file reader source. To learn more about the build-in sources see [notes below](#built-in-sources-and-destinations).
 #### Defining transformations
-To process, transform or manipulate data use the `#transformation` hook. A transformation can be implemented as a block, a lambda or as an (callable) object. To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
+To process, transform or manipulate data use the `#transformation` hook. A transformation is implemented with a `callable` object (any Ruby object that responds to `#call`. E.g. a lambda). To learn more about transformations check the section about [implementing transformations](#implementing-transformations) below.
-The current data object (the object that is currently read by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation - or to the destination if the current transformation is the last one.
+The current data object (the last object yielded by the source) will be passed to the first transformation as a parameter. The return value of a transformation will then be passed to the next transformation and so on.
-If you return nil the current data object will be dismissed and the next transformation (or destination) won't be called.
+If you return `nil` the current data object will be dismissed and the next transformation won't be called.
 ```ruby
 # File: my_etl_job.metacrunch
-transformation do |data|
-  # Called for each data object that has been read by a source.
-  # Do your data transformation process here.
+# Array implements #each and therefore is a valid source
+source [1,2,3,4,5,6,7,8,9]
+# A transformation is implemented with a `callable` object (any
+# object that responds to #call).
+# Lambdas responds to #call
+transformation ->(number) {
+  # Called for each data object that has been read by a source.
   # You must return the data to keep it in the pipeline. Dismiss the
   # data conditionally by returning nil.
-  data
-end
+  number if number.odd?
+}
-# Instead of passing a block to #transformation you can pass a
-# `callable` object (any object responding to #call).
-transformation ->(data) {
-  # Lambdas responds to #call
+transformation ->(odd_number) {
+  odd_number * 2
 }
-# MyTransformation defines #call
+# MyTransformation implements #call
 transformation MyTransformation.new
 ```
-#### Defining destinations
+#### Using a transformation buffer
-A destination (aka. a writer) is an object that writes the transformed data to an external system. Use one of the build-in or 3rd party destinations or implement it by yourself. Implementing destinations is easy – [see notes below](#implementing-destinations). You can declare one or more destinations. They are processed in the order they are defined.
+Sometimes it is useful to buffer data between transformation steps to allow a transformation to work on larger bulks of data. metacrunch uses a simple transformation buffer to achieve this.
+To use a transformation buffer pass the buffer size as an option to the transformation.
 ```ruby
 # File: my_etl_job.metacrunch
-destination MyDestination.new
+source 1..95 # A range responds to #each and is a valid source
+transformation ->(bulk) {
+  # this transformation is called when the buffer
+  # is filled with 10 objects or if the source has
+  # yielded the last data object.
+  # bulk would be: [1,...,10], [11,...,20], ..., [91,...,95]
+}, buffer_size: 10
 ```
-This example uses a custom destination. To learn more about the build-in destinations see [notes below](#built-in-sources-and-destinations).
+#### Defining a destination
-#### Pre/Post process
+A destination is an object that writes the transformed data to an external system. Implementing destinations is easy – [see notes below](#implementing-destinations). A destination receives the return value from the last transformation as a parameter if the return value from the last transformation was not `nil`.
-To run arbitrary code before the first transformation use the
-`#pre_process` hook. To run arbitrary after the last transformation use
-`#post_process`. Like transformations, `#post_process` and `#pre_process` can be called with a block, a lambda or a (callable) object.
+Using destinations is optional. In most cases using the last transformation to write the data to an external system is fine. Destinations are useful if the required code is more complex.
 ```ruby
-pre_process do
-  # Called before the first transformation
-end
+# File: my_etl_job.metacrunch
-post_process do
-  # Called after the last transformation
-end
+destination MyDestination.new
+```
+#### Pre/Post process
+To run arbitrary code before the first transformation is run on the first data object use the `#pre_process` hook. To run arbitrary code after the last transformation is run on the last data object use `#post_process`. Like transformations, `#post_process` and `#pre_process` must be implemented using a `callable` object.
-pre_process ->() {
+```ruby
+pre_process -> {
   # Lambdas responds to #call
 }
@@ -130,42 +142,60 @@ pre_process ->() {
 post_process MyCallable.new
 ```
-#### Defining options
+#### Defining job options
-metacrunch has build-in support to parameterize your jobs. Using the `option` helper, you can declare options that can be set/overridden by the CLI when [running your jobs](#running-etl-jobs).
+metacrunch has build-in support to parameterize jobs. Using the `options` hook you can declare options that can be set/overridden by the CLI when [running your jobs](#running-etl-jobs).
 ```ruby
+# File: my_etl_job.metacrunch
 options do
-  add :number_of_processes, "-n", "--no-of-processes N", "Number of processes", default: 2
+  add :log_level, "-l", "--log-level LEVEL", "Log level (debug,info,warn,error)", default: "info"
   add :database_url, "-d", "--database URL", "Database connection URL", required: true
 end
+# Prints out 'info'
+echo options[:log_level]
 ```
-In this example we declare two options `number_of_processes` and `database_url`. `number_of_processes` defaults to 2, whereas `database_url` has no default and is required. In your job file you can access the option values using the `options` Hash. E.g. `options[:number_of_processes]`.
+In this example we declare two options `log_level` and `database_url`. `log_level` defaults to `info`, whereas `database_url` has no default and is required. In your job file you can access the option values using the `options` Hash. E.g. `options[:log_level]`.
 To set/override these options use the command line.
 ```
-$ bundle exec metacrunch my_etl_job.metacrunch @@ --no-of-processes 4
+$ bundle exec metacrunch my_etl_job.metacrunch --log-level debug
 ```
-This will set the `options[:number_of_processes]` to `4`.
+This will set the `options[:log_level]` to `debug`.
 To get a list of available options for a job, use `--help` on the command line.
 ```
-$ bundle exec metacrunch my_etl_job.metacrunch @@ --help
+$ bundle exec metacrunch my_etl_job.metacrunch --help
-Usage: metacrunch run [options] JOB_FILE @@ [job-options] [ARGS]
+Usage: metacrunch [options] JOB_FILE [job-options] [ARGS]
 Job options:
-    -n, --no-of-processes N          Number of processes
-                                     DEFAULT: 2
+    -l, --log-level LEVEL            Log level (debug,info,warn,error)
+                                     DEFAULT: info
     -d, --database URL               Database connection URL
                                      REQUIRED
 ```
 To learn more about defining options take a look at the [reference below](#defining-job-options).
+#### Require non-option arguments
+All non-option arguments that get passed to the job when running are available to the `ARGV` constant. If your job requires such arguments (e.g. if you work with a list of files) you can require it.
+```ruby
+# File: my_etl_job.metacrunch
+options(require_args: true) do
+  # ...
+end
+```
 Running ETL jobs
 ----------------
@@ -185,21 +215,19 @@ If you use [Bundler](http://bundler.io) to manage dependencies for your jobs mak
 $ bundle exec metacrunch my_etl_job.metacrunch
 ```
-Depending on your environment `bundle exec` may not be required (e.g. you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
-To pass options to the job, separate job options from the metacrunch command options using the `@@` separator.
+Depending on your environment `bundle exec` may not be required (e.g. if you have rubygems-bundler installed) but we recommend using it whenever you have a Gemfile you like to use. When using Bundler make sure to add `gem "metacrunch"` to the Gemfile.
-Use the following syntax
+Use the following syntax to run a metacrunch job
 ```
-$ [bundle exec] metacrunch [COMMAND_OPTIONS] JOB_FILE [@@ [JOB_OPTIONS] [JOB_ARGS...]]
+$ [bundle exec] metacrunch [COMMAND_OPTIONS] JOB_FILE [JOB_OPTIONS] [JOB_ARGS...]
 ```
 Implementing sources
 --------------------
-A source (aka a reader) is any Ruby object that responds to the `each` method that yields data objects one by one.
+A metacrunch source is any Ruby object that responds to the `each` method that yields data objects one by one.
 The data is usually a `Hash` instance, but could be other structures as long as the rest of your pipeline is expecting it.
@@ -245,29 +273,9 @@ source MyCsvSource.new("my_data.csv")
 Implementing transformations
 ----------------------------
-Transformations can be implemented as blocks or as a `callable`. A `callable` in Ruby is any object that responds to the `call` method.
-### Transformations as a block
-When using the block syntax the current data row will be passed as a parameter.
-```ruby
-# File: my_etl_job.metacrunch
-transformation do |data|
-  # DO YOUR TRANSFORMATION HERE
-  data = ...
-  # Make sure to return the data to keep it in the pipeline. Dismiss the
-  # data conditionally by returning nil.
-  data
-end
-```
+A metacrunch transformation is implemented as a `callable` object. A `callable` in Ruby is any object that responds to the `call` method.
-### Transformations as a callable
-Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations similar to blocks.
+Procs and Lambdas in Ruby respond to `call`. They can be used to implement transformations inline.
 ```ruby
 # File: my_etl_job.metacrunch
@@ -306,7 +314,7 @@ transformation MyTransformation.new
 Implementing destinations
 -------------------------
-A destination (aka a writer) is any Ruby object that responds to `write(data)` and `close`.
+A destination is any Ruby object that responds to `write(data)` and `close`.
 Like sources you are encouraged to implement destinations as classes.
@@ -337,21 +345,21 @@ destination MyDestination.new
 ```
+Upgrading
+---------
-Built in sources and destinations
----------------------------------
-TBD.
-Defining job dependencies
--------------------------
-TBD.
+#### 3.x -> 4.x
-Defining job options
---------------------
+When upgrading from metacrunch 3.x, there are some breaking changes you need to address.
-TBD.
+* There is now only one `source` and `destination`. If you have more than one in your job file the last definition will used.
+* There is no `transformation_buffer` anymore. Instead set `buffer_size` as an option to `transformation`.
+* `transformation`, `pre_process` and `post_process` can't be implemented using a block anymore. Always use a `callable` (E.g. Lambda, Proc or any object responding to `#call`).
+* When running jobs via the CLI you do not need to separate the arguments passed to metacrunch from the arguments passed to the job with `@@`.
+* The `args` function used to get the non-option arguments passed to a job has been removed. Use `ARGV` instead.
+* `Metacrunch::Db` classes have been moved into the [metacrunch-db](https://github.com/ubpb/metacrunch-db) gem package.
+* `Metacrunch::Redis` classes have been moved into the [metacrunch-redis](https://github.com/ubpb/metacrunch-redis) gem package.
+* `Metacrunch::File` classes have been moved into the [metacrunch-file](https://github.com/ubpb/metacrunch-file) gem package.
 License
 -------

data/lib/metacrunch.rb CHANGED Viewed

@@ -1,14 +1,9 @@
 require "active_support"
 require "active_support/core_ext"
 require "colorized_string"
-require "parallel"
 module Metacrunch
   require_relative "metacrunch/version"
   require_relative "metacrunch/cli"
   require_relative "metacrunch/job"
-  require_relative "metacrunch/parallel_processable_reader"
-  require_relative "metacrunch/fs"
-  require_relative "metacrunch/db"
-  require_relative "metacrunch/redis"
 end

data/lib/metacrunch/cli.rb CHANGED Viewed

@@ -2,21 +2,32 @@ require "optparse"
 module Metacrunch
   class Cli
-    ARGS_SEPERATOR = "@@"
     def run
-      job_files = global_parser.parse!(global_argv)
-      run!(job_files)
+      # Parse global options on order
+      job_argv = global_parser.order(ARGV)
+      # The first of the unparsed arguments is by definition the filename
+      # of the job.
+      job_file = job_argv[0]
+      # Manipulate ARGV so that the option handling for the job can work
+      ARGV.clear
+      job_argv[1..-1]&.each {|arg| ARGV << arg}
+      # Delete the old separator symbol for backward compatability
+      ARGV.delete_if{|arg| arg == "@@"}
+      # Finally run the job
+      run!(job_file)
+    rescue OptionParser::ParseError => e
+      error(e.message)
     end
   private
     def global_parser
       @global_parser ||= OptionParser.new do |opts|
         opts.banner = <<-BANNER.strip_heredoc
           #{ColorizedString["Usage:"].bold}
-            metacrunch [options] JOB_FILE @@ [job-options] [ARGS...]
+            metacrunch [options] JOB_FILE [job-options] [ARGS...]
           #{ColorizedString["Options:"].bold}
         BANNER
@@ -24,22 +35,9 @@ module Metacrunch
         opts.on("-v", "--version", "Show metacrunch version and exit") do
           show_version
         end
-        opts.on("-n INTEGER", "--number-of-processes INTEGER", Integer, "Number of parallel processes to run the job. Source needs to support this. DEFAULT: 1") do |n|
-          error("--number-of-procs must be > 0") if n <= 0
-          global_options[:number_of_processes] = n
-        end
-        opts.separator "\n"
       end
     end
-    def global_options
-      @global_options ||= {
-        number_of_processes: 1
-      }
-    end
     def show_version
       puts Metacrunch::VERSION
       exit(0)
@@ -51,32 +49,13 @@ module Metacrunch
       exit(0)
     end
-    def global_argv
-      index = ARGV.index(ARGS_SEPERATOR)
-      if index == 0
-        []
-      else
-        @global_argv ||= index ? ARGV[0..index-1] : ARGV
-      end
-    end
-    def job_argv
-      index = ARGV.index(ARGS_SEPERATOR)
-      @job_argv ||= index ? ARGV[index+1..-1] : nil
-    end
-    def run!(job_files)
-      if job_files.first == "run"
-        puts ColorizedString["WARN: Using 'run' is deprecated. Just use 'metacrunch [options] JOB_FILE @@ [job-options] [ARGS...]'\n"].yellow.bold
-        job_files = job_files[1..-1]
-      end
-      if job_files.empty?
+    def run!(job_file)
+      if job_file.blank?
         error "You need to provide a job file."
-      elsif job_files.count > 1
-        error "You must provide exactly one job file."
+      elsif !File.exists?(job_file)
+        error "The file `#{job_file}` doesn't exist."
       else
-        job_filename = File.expand_path(job_files.first)
+        job_filename = File.expand_path(job_file)
         dir = File.dirname(job_filename)
         Dir.chdir(dir) do
@@ -86,25 +65,7 @@ module Metacrunch
     end
     def run_job!(job_filename)
-      if global_options[:number_of_processes] > 1
-        process_indicies = (0..(global_options[:number_of_processes] - 1)).to_a
-        Parallel.each(process_indicies) do |process_index|
-          Metacrunch::Job.define(
-            File.read(job_filename),
-            filename: job_filename,
-            args: job_argv,
-            number_of_processes: global_options[:number_of_processes],
-            process_index: process_index
-          ).run
-        end
-      else
-        Metacrunch::Job.define(
-          File.read(job_filename),
-          filename: job_filename,
-          args: job_argv
-        ).run
-      end
+      Metacrunch::Job.define(File.read(job_filename)).run
     end
   end