RubyGems - advance - Versions diffs - 0.1.0 - Mend

advance 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 0704814d6913c4e1e3bccbc24d697e2681ba71d7
+  data.tar.gz: 0408447d2efe199de361c058262065193341e964
+SHA512:
+  metadata.gz: cdb7d8f6c80d3f7df834d5eab82846c055bfc968d68d0df3ee863c08994adf2c3a09f50d79d42653119bf51d7c5407df6a0dc69832fbdc4be90d412f7f1441b2
+  data.tar.gz: e77444b0f5e04ed12f31df208c22cfbb6d2b2eed99b2310caa040d17a01c7dc438b030a36b69d89121a161ef4bfb644086453c35ca13c418cb4bf1939ceaa4dc

data/.gitignore ADDED Viewed

@@ -0,0 +1,9 @@
+/.bundle/
+/.yardoc
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+.idea

data/.travis.yml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+sudo: false
+language: ruby
+cache: bundler
+rvm:
+  - 2.2.4
+before_install: gem install bundler -v 1.16.6

data/Gemfile ADDED Viewed

@@ -0,0 +1,6 @@
+source "https://rubygems.org"
+git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
+# Specify your gem's dependencies in advance.gemspec
+gemspec

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,24 @@
+PATH
+  remote: .
+  specs:
+    advance (0.1.0)
+      team_effort
+GEM
+  remote: https://rubygems.org/
+  specs:
+    minitest (5.10.1)
+    rake (10.4.2)
+    team_effort (1.0.0)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  advance!
+  bundler (~> 1.16)
+  minitest (~> 5.0)
+  rake (~> 10.0)
+BUNDLED WITH
+   1.16.6

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2019 janemacfarlane
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,152 @@
+# Advance
+Advance is a framework for building data transformation pipelines.
+Advance allows you to concisely script your
+data transformation process and to
+incrementally build and easily debug that process.
+Each data transformation is a step and the results of each
+step become the input to the next step.
+The artifacts of each step are preserved in step named directories.
+When the results of a step are not right, just
+adjust the Advance script, delete the step directory with the bad data and
+rerun the script. Previously successful steps are skipped so the script
+moves quickly to the incomplete step. Similarly, when steps fail the results
+are preserved in directories prefixed with "tmp_". This isolates incomplete
+step data and ensures that the step is re-processed when the problem is
+resolved.
+Advance scripts are easy to understand. They are ruby scripts,
+composed of a series of function calls that invoke your scripts
+or commands to transform your data. Each step is composed of a
+step processing type function, followed by a
+slug for the step, followed by the command or script. For example:
+```ruby
+single :unzip_7z_raw_data_file, "7z x {previous_file}"
+single :split_files, "split -l 10000 -a 3 {previous_file} gps_data_"
+multi :add_local_time, "cat {file_path} | add_local_time.rb timestamp local_time US/Pacific > {file}"
+# ...
+```
+The step processing functions are `single` and `multi`. `Single` applies the command
+to the last output, which should be a single file. `Multi` speeds processing of multiple
+files by doing working in parallel (via the [TeamEffort gem][1]).
+[1]: https://rubygems.org/gems/team_effort
+> _[Advance][2]: To help the progress of (something); to further._
+[2]: https://en.wiktionary.org/wiki/advance
+## Installation
+Advance is meant to augment a standalone ruby script. The advance gem needs to be
+available to your instance of ruby. Here are 2 techniques to make Advance available
+to your script:
+ * simply install the gem:
+    $ gem install advance
+ * install [bundler][3], and add this ruby snippet to the beginning of your script:
+[3]: https://rubygems.org/gems/bundler
+```ruby
+    #!/usr/bin/env ruby
+    require "bundler/inline"
+    gemfile do
+      source "https://rubygems.org"
+      gem "advance"
+    end
+```
+## Usage
+You will likely need multiple supporting scripts. Ideally you will
+create your Advance script and your supporting scripts in a single directory.
+Creating your Advance script is an incremental process. Start with a single
+step, run the script and check the results. When the output is as you expect,
+add the next step. After you add a step to your script you can simply rerun
+the script. Previously successful steps are skipped and your script moves on
+to the first incomplete step.
+When the results are not what you expect, just delete the step directory with
+the bad data, adjust your step, and rerun. Advance will rerun that step and
+all subsequent steps.
+Steps have 3 components:
+ * a step processing type (single or multi)
+ * a descriptive slug describing the step (as a ruby symbol)
+ * the command that transforms the data
+Advance adds the bin dir of the Advance gem to PATH, so that you can invoke the
+supporting advance scripts in your pipeline without specifying the full path
+of the script. Advance also adds the path of your script to PATH so that you can
+invoke scripts in the same directory as your main script without specifying
+the full path of the script. Of course, you can invoke any script if the path
+to the script is fully specified or the path is already on PATH.
+**Specifying Script Input and Output**
+Since your command is transforming data, you need a way to specify the input
+file or directory and the output file name. Advance provides a few tokens
+that can be inserted in the command string for this purpose:
+ * **{previous_file}** indicates the output file from the previous step when
+   the output of the previous step was a single output file. It is also used
+   to indicate the first file to be used and it finds that file in the current
+   working dir.
+ * **{file_path}** indicates an output file from the previous step when the
+   previous step generated multiple output files and the current step is a
+   `multi` step.
+ * **{file}** indicates an output file name, which is the basename from
+   {file_path}. Commands often process multiple files from previous steps,
+   generating multiple output files. Those output files are placed in the
+   step directory.
+ * **{previous_dir}** indicates the directory a previous step.
+**Example Script**
+```ruby
+#!/usr/bin/env ruby
+require "bundler/inline"
+gemfile do
+  source "https://rubygems.org"
+  gem "advance"
+end
+ensure_bin_on_path # ensures the directory for this script is on
+                   # the path so that related scripts can be referenced
+                   # without paths
+single :unzip_7z_raw_data_file, "7z x {previous_file}" # uses 7z to inflate a file in the current dir
+single :split_files, "split -l 10000 -a 3 {previous_file} gps_data_" # split the file
+multi :add_local_time, "cat {file_path} | add_local_time.rb timestamp local_time US/Pacific > {file}" # adds a local_time column to a csv
+```
+**Running Your Script**
+When running your pipeline, it is helpful to have a directory with the single, initial file.
+1. Move to your data directory with your single initial file.
+2. invoke your script from there.
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/doctorjane/advance.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/Rakefile ADDED Viewed

@@ -0,0 +1,10 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new(:test) do |t|
+  t.libs << "test"
+  t.libs << "lib"
+  t.test_files = FileList["test/**/*_test.rb"]
+end
+task :default => :test

data/advance.gemspec ADDED Viewed

@@ -0,0 +1,35 @@
+lib = File.expand_path("../lib", __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require "advance/version"
+Gem::Specification.new do |spec|
+  spec.name          = "advance"
+  spec.version       = Advance::VERSION
+  spec.authors       = ["janemacfarlane"]
+  spec.email         = ["jfmacfarlane@lbl.gov"]
+  spec.summary       = %q{A framework for building data transformation pipelines}
+  spec.description   = %q{Advance allows you to concisely script your
+data transformation process and to
+incrementally build and easily debug that process.
+Each data transformation is a step and the results of each
+step become the input to the next step.
+}
+  spec.homepage      = "https://github.com/doctorjane/advance"
+  spec.license       = "MIT"
+  # Specify which files should be added to the gem when it is released.
+  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
+  spec.files         = Dir.chdir(File.expand_path('..', __FILE__)) do
+    `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  end
+  spec.bindir        = "bin"
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.add_runtime_dependency "team_effort"
+  spec.add_development_dependency "bundler", "~> 1.16"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "minitest", "~> 5.0"
+end

data/bin/concat_csv.rb ADDED Viewed

@@ -0,0 +1,31 @@
+#!/usr/bin/env ruby
+require "find"
+require "team_effort"
+def do_cmd(cmd)
+  system cmd
+  status = $?
+  raise "'#{cmd}' failed with #{status}" if !status.success?
+end
+files_dir_path = ARGV[0]
+output_file = ARGV[1]
+files = Find.find(files_dir_path).reject { |p| FileTest.directory?(p) || p =~ %r(\b(stdout|stderr)$) }
+# 1. capture the header from the first file
+do_cmd "ghead -n 1 #{files.first} > header"
+# 2. strip the header from all files
+TeamEffort.work(files) do |file_path|
+  file = File.basename(file_path)
+  do_cmd "gtail -n +2 #{file_path} > #{file}"
+end
+# 3. concate the header and all files
+tmp_files = files.map{|f| File.basename(f)}
+(["header"] + tmp_files).each_slice(20) do |files_to_concat|
+  file_list = files_to_concat.join(' ')
+  do_cmd "gcat #{file_list} >> #{output_file}"
+  do_cmd "rm #{file_list}"
+end

data/bin/console ADDED Viewed

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "advance"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start(__FILE__)

data/bin/csv_select.rb ADDED Viewed

@@ -0,0 +1,21 @@
+#!/usr/bin/env ruby
+require 'csv'
+# $stderr.puts "#{__FILE__}:#{__LINE__}"
+test_proc = eval "lambda {|row| #{ARGV.shift}}"
+input = CSV.new(ARGF, :headers => true, :return_headers => true, :converters => :numeric)
+output = CSV.new($stdout, :headers => true, :write_headers => true)
+input.each.with_index do |row, index|
+  # $stderr.puts "#{index}: >>#{row.to_s.chomp}<<"
+  if row.header_row?
+    output << row
+    next
+  end
+  if test_proc.call(row)
+    output << row
+    next
+  end
+end

data/bin/csv_split_on_change.rb ADDED Viewed

@@ -0,0 +1,29 @@
+#!/usr/bin/env ruby
+require 'csv'
+def columns_changed?(previous_row, row, change_columns)
+  changed = false
+  change_columns.each do |column|
+    changed ||= previous_row[column] != row[column]
+  end
+  changed
+end
+def file_name_from_changed_columns(row, change_columns)
+  change_columns.map { |column| row[column] }.join("_") + ".csv"
+end
+change_columns = ARGV[0].split(/,/).map(&:to_i)
+input_file = ARGV[1]
+previous_row = output_csv = nil
+CSV.foreach(input_file) do |row|
+  if previous_row.nil? || (previous_row && columns_changed?(previous_row, row, change_columns))
+    output_csv.close if output_csv
+    output_file_name = file_name_from_changed_columns(row, change_columns)
+    output_csv = CSV.open(output_file_name, "w")
+  end
+  output_csv << row
+  previous_row = row
+end
+output_csv.close

data/bin/setup ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/bin/split_csv.rb ADDED Viewed

@@ -0,0 +1,21 @@
+#!/usr/bin/env ruby
+require 'team_effort'
+def do_cmd(cmd)
+  `#{cmd}`
+  raise "'#{cmd}' failed with #{status}" if !$?.success?
+end
+csv_file = ARGV[0]
+lines = ARGV[1]
+csv_file_name = File.basename(csv_file)
+system "ghead -n 1 #{csv_file} > #{csv_file_name}_header"
+system "gtail -n +2 #{csv_file} | gsplit -l #{lines} -a 3 - #{csv_file_name}_"
+files = Dir.entries(".").reject { |f| f =~ %r{^(\.\.?|stdout|stderr)$} }
+TeamEffort.work(files, 1) do |file|
+  tmp_file = "tmp_#{file}"
+  do_cmd "gcat #{csv_file_name}_header #{file} >> #{tmp_file}"
+  do_cmd "mv #{tmp_file} #{file}"
+end
+do_cmd "rm #{csv_file_name}_header"
+puts ""

data/lib/advance/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Advance
+  VERSION = "0.1.0"
+end

data/lib/advance.rb ADDED Viewed

@@ -0,0 +1,153 @@
+require "advance/version"
+require 'open3'
+require "team_effort"
+module Advance
+  RESET="\e[0m"
+  BOLD="\e[1m"
+  ITALIC="\e[3m"
+  UNDERLINE="\e[4m"
+  CYAN="\e[36m"
+  GRAY="\e[37m"
+  GREEN="\e[32m"
+  MAGENTA="\e[35m"
+  RED="\e[31m"
+  WHITE="\e[1;37m"
+  YELLOW="\e[33m"
+  def do_command(command, feedback = true)
+    puts "#{YELLOW}#{command}#{RESET}  " if feedback
+    start_time = Time.now
+    stdout, stderr, status = Open3.capture3(command)
+    elapsed_time = Time.now - start_time
+    File.open("log", "w") do |f|
+      f.puts "%%% command: >#{command}<"
+      f.puts "%%% returned status: >#{status}<"
+      f.puts "%%% elapsed time: #{elapsed_time} seconds"
+      f.puts "%%% stdout:"
+      f.puts stdout
+      f.puts "%%% stderr:"
+      f.puts stderr
+    end
+    if !status.success?
+      raise "step #{$step} #{label} failed with #{status}"
+    end
+  end
+  def previous_dir_path
+    relative_path = case $step
+                    when 1
+                      ".."
+                    else
+                      File.join("..", Dir.entries("..").find { |d| d =~ /^#{step_dir_prefix($step - 1)}/ })
+                    end
+    File.expand_path(relative_path)
+  end
+  def previous_file_path
+    dir_entries = Dir.glob(File.join(previous_dir_path, "*"))
+    dir_entries_clean = dir_entries.reject { |f| File.directory?(f) || f =~ %r{^\.\.?|log} }
+    dir_entries_clean.first
+  end
+  def single(label, command)
+    step(label) do
+      if command =~ /\{previous_file\}/
+        command.gsub!("{previous_file}", previous_file_path)
+      end
+      if command =~ /\{previous_dir\}/
+        command.gsub!("{previous_dir}", previous_dir_path)
+      end
+      do_command command
+    end
+  end
+  def multi(label, command)
+    no_feedback = false
+    step(label) do
+      # previous_dir_path = File.expand_path(previous_dir_path)
+      files = Dir.entries(previous_dir_path).reject { |f| f =~ %r{^(\.\.?|log)$} }
+      file_path_template = file_path_template(previous_dir_path, files)
+      last_progress = ""
+      progress_proc = ->(index, max_index) do
+        latest_progress = sprintf("%3i%", index.to_f / max_index * 100)
+        puts latest_progress if last_progress != latest_progress
+        last_progress = latest_progress
+      end
+      TeamEffort.work(files, $cores, progress_proc: progress_proc) do |file|
+        begin
+          previous_file_path = file_path_template.gsub("{file}", file)
+          command.gsub!("{file_path}", previous_file_path) unless $step == 1
+          command.gsub!("{file}", file) unless $step == 1
+          puts "#{YELLOW}#{command}#{RESET}"
+          dir_name = file
+          work_in_sub_dir(dir_name) do
+            do_command command, no_feedback
+          end
+        rescue
+          puts "%%%% error while processing #{file}"
+          raise
+        end
+      end
+    end
+  end
+  def file_path_template(dir_path, files)
+    file = files.first
+    file_path = File.join(dir_path, file)
+    if File.directory?(file_path)
+      File.join(dir_path, "{file}", "{file}")
+    else
+      File.join(dir_path, "{file}")
+    end
+  end
+  def work_in_sub_dir(dir_name, existing_message = nil)
+    return if Dir.exist? dir_name
+    tmp_dir_name = "tmp_#{dir_name}"
+    FileUtils.rm_rf tmp_dir_name
+    FileUtils.mkdir_p tmp_dir_name
+    FileUtils.cd tmp_dir_name
+    yield
+    FileUtils.cd ".."
+    FileUtils.mv tmp_dir_name, dir_name
+  end
+  def step_dir_prefix(step_no)
+    "step_%03d" % [step_no]
+  end
+  def step(label)
+    $step ||= 0
+    $step += 1
+    dir_name = "#{step_dir_prefix($step)}_#{label}"
+    $previous_dir = File.join(FileUtils.pwd, dir_name)
+    puts "#{CYAN}step #{$step} #{label}#{WHITE}... #{RESET}"
+    work_in_sub_dir(dir_name, "#{GREEN}OK#{RESET}") do
+      yield
+    end
+  end
+  def ensure_bin_on_path
+    advance_path = File.dirname(__FILE__)
+    add_dir_to_path(advance_path)
+    caller_path = File.dirname(caller[0].split(/:/).first)
+    add_dir_to_path(caller_path)
+  end
+  def add_dir_to_path(dir)
+    bin_dir = File.expand_path(dir)
+    path = ENV["PATH"]
+    return if path.include?(bin_dir)
+    ENV["PATH"] = [path, bin_dir].join(":")
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,127 @@
+--- !ruby/object:Gem::Specification
+name: advance
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- janemacfarlane
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2019-01-10 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: team_effort
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.16'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.16'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+- !ruby/object:Gem::Dependency
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '5.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '5.0'
+description: |
+  Advance allows you to concisely script your
+  data transformation process and to
+  incrementally build and easily debug that process.
+  Each data transformation is a step and the results of each
+  step become the input to the next step.
+email:
+- jfmacfarlane@lbl.gov
+executables:
+- concat_csv.rb
+- console
+- csv_select.rb
+- csv_split_on_change.rb
+- setup
+- split_csv.rb
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".travis.yml"
+- Gemfile
+- Gemfile.lock
+- LICENSE.txt
+- README.md
+- Rakefile
+- advance.gemspec
+- bin/concat_csv.rb
+- bin/console
+- bin/csv_select.rb
+- bin/csv_split_on_change.rb
+- bin/setup
+- bin/split_csv.rb
+- lib/advance.rb
+- lib/advance/version.rb
+homepage: https://github.com/doctorjane/advance
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.8
+signing_key:
+specification_version: 4
+summary: A framework for building data transformation pipelines
+test_files: []