RubyGems - csv2avro - Versions diffs - 1.0.0 - Mend

csv2avro 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +7 -0
data/.dockerignore +1 -0
data/.gitignore +15 -0
data/.travis.yml +8 -0
data/CHANGELOG.md +48 -0
data/Dockerfile +23 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +80 -0
data/Rakefile +41 -0
data/bin/csv2avro +58 -0
data/csv2avro.gemspec +28 -0
data/lib/avro_schema.rb +57 -0
data/lib/csv2avro/avro_writer.rb +27 -0
data/lib/csv2avro/converter.rb +125 -0
data/lib/csv2avro/schema.rb +44 -0
data/lib/csv2avro/version.rb +3 -0
data/lib/csv2avro.rb +78 -0
data/spec/csv2avro/converter_spec.rb +434 -0
data/spec/csv2avro/schema_spec.rb +85 -0
data/spec/csv2avro_spec.rb +38 -0
data/spec/spec_helper.rb +15 -0
data/spec/support/avro_reader.rb +22 -0
data/spec/support/data.csv +4 -0
data/spec/support/schema.avsc +17 -0
metadata +161 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 6eae97d5b2bf7476331128770ffee4d3b6d69d7a
+  data.tar.gz: 554e64338b5950de37ccaad44927176f6922f94c
+SHA512:
+  metadata.gz: 3a70f269a7337d6dad0bd24528e9092b897217254610a8257db0fecc72d55dada505c68040b3a1309b3e8266473257f118a88231a54b5627c74ffb63c998d49c
+  data.tar.gz: 01cc32197d34410522aed53d4682aa9c91b20a63ad9e09a80831cc7e6af6d5bfd1a972488bb7b52f263451222a0363f10e5ab8a07e09ab38a5154b46496ff93e

data/.dockerignore ADDED Viewed

	@@ -0,0 +1 @@
1	+ .git

data/.gitignore ADDED Viewed

@@ -0,0 +1,15 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log
+.ruby-version

data/.travis.yml ADDED Viewed

@@ -0,0 +1,8 @@
+language: ruby
+rvm:
+  - 2.1.5
+  - 2.2.1
+cache: bundler
+notifications:
+  email: false
+  hipchat: 852d9ffe9095a59131fe5bf2a6d891@Data

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,48 @@
+# Changelog
+All notable changes to this project are documented in this file.
+This project adheres to [Semantic Versioning](http://semver.org/).
+## 1.0.0 (2015-06-05; [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
+### Added
+ * Usage description to readme
+ * Detailed exception reporting
+ * `aws-cli` to Docker image
+### Fixed
+ * Docker image entrypoint
+## 0.4.0 (2015-05-07; [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
+### Added
+ * Streaming support (#7)
+ * `rake docker:spec` task
+### Removed
+ * S3 support (#7)
+### Changed
+ * Do not include .git in Docker build context
+### Fixed
+ * Build project into Docker image (#9)
+## 0.3.0 (2015-04-28; [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
+### Added
+ * Docker support (#6)
+   * `rake docker:build` task
+   * `rake docker:push` task to push to Docker Hub
+   * Semantic Docker tags
+ * CHANGELOG.md
+## 0.1.0 (2015-04-07)
+Initial release
+### Added
+ * CLI (`csv2avro convert`) to convert CSV files to Avro (#1)
+ * Travis CI (#2)
+ * Bad rows (#4)
+ * Versioning ($5)
+ * Gem packaging

data/Dockerfile ADDED Viewed

@@ -0,0 +1,23 @@
+FROM ruby:2.1
+MAINTAINER Secret Sauce Partners, Inc. <dev@sspinc.io>
+RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
+    python2.7 get-pip.py && \
+    pip install awscli
+# throw errors if Gemfile has been modified since Gemfile.lock
+RUN bundle config --global frozen 1
+RUN mkdir -p /srv/csv2avro
+WORKDIR /srv/csv2avro
+RUN mkdir -p /srv/csv2avro/lib/csv2avro
+COPY lib/csv2avro/version.rb /srv/csv2avro/lib/csv2avro/version.rb
+COPY csv2avro.gemspec Gemfile Gemfile.lock /srv/csv2avro/
+RUN bundle install
+COPY . /srv/csv2avro
+ENTRYPOINT ["./bin/csv2avro"]

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in csv2avro.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2015 Secret Sauce Partners, Inc.
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,80 @@
+# CSV2Avro
+Convert CSV files to Avro like a boss.
+## Installation
+    $ gem install csv2avro
+or if you prefer to live on the edge, just clone this repository and build it from scratch.
+You can run the converter within a **Docker** container, you just need to pull the `sspinc/csv2avro` image.
+```
+$ docker pull sspinc/csv2avro
+```
+## Usage
+### Basic
+```
+$ csv2avro --schema ./spec/support/schema.avsc ./spec/support/data.csv
+```
+This will process the data.csv file and creates a *data.avro* file and a *data.bad.csv* file with the bad rows.
+You can override the bad-rows file location with the `--bad-rows [BAD_ROWS]` option.
+### CSV2Avro in Docker
+```
+$ docker run sspinc/csv2avro --help
+```
+### Streaming
+```
+$ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
+```
+This will process the *input stream* and push the avro data to the *output stream*. If you're working with streams you will need to specify the `--bad-rows` location.
+### Advanced features
+#### AWS S3 storage
+```
+aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad.csv | aws s3 cp - s3://avro-bucket/transactions.avro
+```
+This will stream your file stored in AWS S3, converts the data and pushes it back to S3. For more information, please check the [AWS CLI documentation](http://docs.aws.amazon.com/cli/latest/reference/s3/index.html).
+#### Convert compressed files
+```
+gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
+```
+This will uncompress the file and converts it to avro, leaving the original file intact.
+### More
+For a full list of available options, run `csv2avro --help`
+```
+$ csv2avro --help
+Version 1.0.0 of CSV2Avro
+Usage: csv2avro [options] [file]
+    -s, --schema SCHEMA              A file containing the Avro schema. This value is required.
+    -b, --bad-rows [BAD_ROWS]        The output location of the bad rows file.
+    -d, --delimiter [DELIMITER]      Field delimiter. If none specified, then comma is used as the delimiter.
+    -a [ARRAY_DELIMITER],            Array field delimiter. If none specified, then comma is used as the delimiter.
+        --array-delimiter
+    -D, --write-defaults             Write default values.
+    -c, --stdout                     Output will go to the standard output stream, leaving files intact.
+    -h, --help                       Prints help
+```
+## Contributing
+1. Fork it ( https://github.com/sspinc/csv2avro/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,41 @@
+require 'rspec/core/rake_task'
+require 'bundler/gem_tasks'
+require 'bump/tasks'
+# Remove pre and set rake tasks
+Rake.application.instance_eval do
+  %w[bump:pre bump:set].each do |task|
+    @tasks.delete(task)
+  end
+end
+# Default directory to look in is `/spec`
+# Run with `rake spec`
+RSpec::Core::RakeTask.new(:spec) do |task|
+  task.rspec_opts = ['--color', '--format', 'documentation']
+end
+task :default => :spec
+namespace :docker do
+  desc "Build docker image"
+  task :build do
+    sh "docker build -t sspinc/csv2avro:#{CSV2Avro::VERSION} ."
+    minor_version = CSV2Avro::VERSION.sub(/\.[0-9]+$/, '')
+    sh "docker tag -f sspinc/csv2avro:#{CSV2Avro::VERSION} sspinc/csv2avro:#{minor_version}"
+    major_version = minor_version.sub(/\.[0-9]+$/, '')
+    sh "docker tag -f sspinc/csv2avro:#{CSV2Avro::VERSION} sspinc/csv2avro:#{major_version}"
+    sh "docker tag -f sspinc/csv2avro:#{CSV2Avro::VERSION} sspinc/csv2avro:latest"
+  end
+  desc "Run specs inside docker image"
+  task :spec => :build do
+    sh "docker run -t --entrypoint=rake sspinc/csv2avro:#{CSV2Avro::VERSION} spec"
+  end
+  desc "Push docker image"
+  task :push => :spec do
+    sh "docker push sspinc/csv2avro"
+  end
+end

data/bin/csv2avro ADDED Viewed

@@ -0,0 +1,58 @@
+#!/usr/bin/env ruby
+$LOAD_PATH << File.dirname(__FILE__) + '/../lib' if $0 == __FILE__
+require 'optparse'
+require 'csv2avro'
+options = {}
+option_parser = OptionParser.new do |opts|
+  opts.banner = "Version #{CSV2Avro::VERSION} of CSV2Avro\n" \
+    "Usage: #{File.basename(__FILE__)} [options] [file]"
+  opts.on('-s', '--schema SCHEMA', 'A file containing the Avro schema. This value is required.') do |path|
+    options[:schema] = path
+  end
+  opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows file.') do |path|
+    options[:bad_rows] = path
+  end
+  opts.on('-d', '--delimiter [DELIMITER]', 'Field delimiter. If none specified, then comma is used as the delimiter.') do |char|
+    options[:delimiter] = char.gsub("\\t", "\t")
+  end
+  opts.on('-a', '--array-delimiter [ARRAY_DELIMITER]', 'Array field delimiter. If none specified, then comma is used as the delimiter.') do |char|
+    options[:array_delimiter] = char
+  end
+  opts.on('-D', '--write-defaults', 'Write default values.') do
+    options[:write_defaults] = true
+  end
+  opts.on('-c', '--stdout', 'Output will go to the standard output stream, leaving files intact.') do
+    options[:stdout] = true
+  end
+  opts.on('-h', '--help', 'Prints help') do
+    puts opts
+    exit
+  end
+end
+option_parser.parse!
+begin
+  raise OptionParser::MissingArgument.new('--schema') if options[:schema].nil?
+  raise OptionParser::MissingArgument.new('--bad-rows') if options[:bad_rows].nil? && ARGV.empty?
+  CSV2Avro.new(options).convert
+rescue OptionParser::MissingArgument => ex
+  puts ex.message
+  puts option_parser
+rescue Exception => e
+  puts 'Uh oh, something went wrong!'
+  puts e.message
+  puts e.backtrace.join("\n")
+end

data/csv2avro.gemspec ADDED Viewed

@@ -0,0 +1,28 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'csv2avro/version'
+Gem::Specification.new do |spec|
+  spec.name          = "csv2avro"
+  spec.version       = CSV2Avro::VERSION
+  spec.authors       = ["Peter Ableda"]
+  spec.email         = ["scotty@secretsaucepartners.com"]
+  spec.summary       = %q{Convert CSV files to Avro}
+  spec.description   = %q{Convert CSV files to Avro like a boss.}
+  spec.homepage      = ""
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "rspec", "~> 3.2"
+  spec.add_development_dependency "pry", "~> 0.10"
+  spec.add_development_dependency "bump", "~> 0.5"
+  spec.add_dependency "avro", "~> 1.7"
+end

data/lib/avro_schema.rb ADDED Viewed

@@ -0,0 +1,57 @@
+module Avro
+  class Schema
+    @errors = []
+    class << self
+      attr_accessor :errors
+    end
+    def self.validate(expected_schema, datum, name=nil, suppress_error=false)
+      expected_type = expected_schema.type_sym
+      valid = case expected_type
+              when :null
+                datum.nil?
+              when :boolean
+                datum == true || datum == false
+              when :string, :bytes
+                datum.is_a? String
+              when :int
+                (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
+                    (INT_MIN_VALUE <= datum) && (datum <= INT_MAX_VALUE)
+              when :long
+                (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
+                    (LONG_MIN_VALUE <= datum) && (datum <= LONG_MAX_VALUE)
+              when :float, :double
+                datum.is_a?(Float) || datum.is_a?(Fixnum) || datum.is_a?(Bignum)
+              when :fixed
+                datum.is_a?(String) && datum.size == expected_schema.size
+              when :enum
+                expected_schema.symbols.include? datum
+              when :array
+                datum.is_a?(Array) &&
+                  datum.all?{|d| validate(expected_schema.items, d) }
+              when :map
+                datum.keys.all?{|k| k.is_a? String } &&
+                  datum.values.all?{|v| validate(expected_schema.values, v) }
+              when :union
+                expected_schema.schemas.any?{|s| validate(s, datum, nil, true) }
+              when :record, :error, :request
+                datum.is_a?(Hash) &&
+                  expected_schema.fields.all?{|f| validate(f.type, datum[f.name], f.name) }
+              else
+                false
+              end
+      if !suppress_error && !valid && name
+        if datum.nil? && expected_type != :null
+          @errors << "Missing value at #{name}"
+        else
+          @errors << "'#{datum}' at #{name} does'n match the type '#{expected_schema.to_s}'"
+        end
+      end
+      valid
+    end
+  end
+end

data/lib/csv2avro/avro_writer.rb ADDED Viewed

@@ -0,0 +1,27 @@
+require 'avro'
+require 'avro_schema'
+require 'forwardable'
+class CSV2Avro
+  class AvroWriter
+    extend Forwardable
+    attr_reader :avro_writer
+    def_delegators :'avro_writer.writer', :seek, :read, :eof?
+    def_delegators :avro_writer, :flush, :close
+    def initialize(writer, schema)
+      datum_writer = Avro::IO::DatumWriter.new(schema.avro_schema)
+      @avro_writer = Avro::DataFile::Writer.new(writer, datum_writer, schema.avro_schema)
+    end
+    def writer_schema
+      avro_writer.datum_writer.writers_schema
+    end
+    def write(hash)
+      avro_writer << hash
+    end
+  end
+end

data/lib/csv2avro/converter.rb ADDED Viewed

@@ -0,0 +1,125 @@
+require 'csv2avro/schema'
+require 'csv2avro/avro_writer'
+require 'csv'
+class CSV2Avro
+  class Converter
+    attr_reader :writer, :bad_rows_writer, :error_writer, :schema, :reader, :csv_options, :converter_options, :header_row, :column_separator
+    def initialize(reader, writer, bad_rows_writer, error_writer, options, schema: schema)
+      @writer = writer
+      @bad_rows_writer = bad_rows_writer
+      @error_writer = error_writer
+      @schema = schema
+      @column_separator = options[:delimiter] || ','
+      @reader = reader
+      @header_row = reader.readline.strip
+      header = header_row.split(column_separator)
+      init_header_converter
+      @csv_options = {
+        headers: header,
+        skip_blanks: true,
+        col_sep: column_separator,
+        header_converters: :aliases
+      }
+      @converter_options = options
+    end
+    def convert
+      defaults = schema.defaults if converter_options[:write_defaults]
+      fields_to_convert = schema.types.reject{ |key, value| value == :string }
+      reader.each do |line|
+        CSV.parse(line, csv_options) do |row|
+          row = row.to_hash
+          if converter_options[:write_defaults]
+            add_defaults_to_row!(row, defaults)
+          end
+          convert_fields!(row, fields_to_convert)
+          begin
+            writer.write(row)
+            writer.flush
+          rescue
+            if bad_rows_writer.size == 0
+              bad_rows_writer << header_row + "\n"
+            end
+            bad_rows_writer << line
+            bad_rows_writer.flush
+            until Avro::Schema.errors.empty? do
+              error_writer << "line #{reader.lineno}: #{Avro::Schema.errors.shift}\n"
+            end
+          end
+        end
+      end
+    end
+    private
+    def convert_fields!(row, fields_to_convert)
+      fields_to_convert.each do |key, value|
+        row[key] = begin
+          case value
+            when :int
+              Integer(row[key])
+            when :float, :double
+              Float(row[key])
+            when :boolean
+              parse_boolean(row[key])
+            when :array
+              parse_array(row[key])
+            when :enum
+              row[key].downcase.tr(" ", "_")
+          end
+        rescue
+          row[key]
+        end
+      end
+      row
+    end
+    def parse_boolean(value)
+      return true  if value == true  || value =~ (/^(true|t|yes|y|1)$/i)
+      return false if value == false || value =~ (/^(false|f|no|n|0)$/i)
+      nil
+    end
+    def parse_array(value)
+      delimiter = converter_options[:array_delimiter] || ','
+      value.split(delimiter) if value
+    end
+    def add_defaults_to_row!(row, defaults)
+      # Add default values to nil cells
+      row.each do |key, value|
+        row[key] = defaults[key] if value.nil?
+      end
+      # Add default values to missing columns
+      defaults.each  do |key, value|
+        row[key] = defaults[key]  unless row.has_key?(key)
+      end
+      row
+    end
+    def init_header_converter
+      aliases = schema.aliases
+      CSV::HeaderConverters[:aliases] = lambda do |header|
+          aliases[header] || header
+      end
+    end
+  end
+end

data/lib/csv2avro/schema.rb ADDED Viewed

@@ -0,0 +1,44 @@
+require 'json'
+class CSV2Avro
+  class Schema
+    attr_reader :avro_schema, :schema_string
+    def initialize(schema)
+      @schema_string = schema.read
+      @avro_schema = Avro::Schema.parse(schema_string)
+    end
+    def defaults
+      Hash[
+        avro_schema.fields.map{ |field| [field.name, field.default] unless field.default.nil? }.compact
+      ]
+    end
+    def types
+      Hash[
+        avro_schema.fields.map do |field|
+          type = if field.type.type_sym == :union
+            # use the primary type
+            field.type.schemas[0].type_sym
+          else
+            field.type.type_sym
+          end
+          [field.name, type]
+        end
+      ]
+    end
+    # TODO: Change this when the avro gem starts to support aliases
+    def aliases
+      schema_as_json = JSON.parse(schema_string)
+      Hash[
+        schema_as_json['fields'].select{ |field| field['aliases'] }.flat_map do |field|
+          field['aliases'].map { |one_alias| [one_alias, field['name']]}
+        end
+      ]
+    end
+  end
+end

data/lib/csv2avro/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+class CSV2Avro
+  VERSION = "1.0.0"
+end

data/lib/csv2avro.rb ADDED Viewed

@@ -0,0 +1,78 @@
+require 'csv2avro/converter'
+require 'csv2avro/version'
+class CSV2Avro
+  attr_reader :input_path, :schema_path, :bad_rows_path, :stdout_option, :options
+  def initialize(options)
+    @input_path = ARGV.first
+    @schema_path = options.delete(:schema)
+    @bad_rows_path = options.delete(:bad_rows)
+    @stdout_option = !input_path || options.delete(:stdout)
+    @options = options
+  end
+  def convert
+    Converter.new(reader, writer, bad_rows_writer, error_writer, options, schema: schema).convert
+  ensure
+    writer.close if writer
+    if bad_rows_writer.size == 0
+      File.delete(bad_rows_uri)
+    elsif bad_rows_writer
+      bad_rows_writer.close
+    end
+  end
+  private
+  def schema
+    @schema ||= File.open(schema_path, 'r') do |schema|
+      CSV2Avro::Schema.new(schema)
+    end
+  end
+  def reader
+    ARGF.lineno = 0
+    ARGF
+  end
+  def writer
+    @__writer ||= begin
+      writer = if stdout_option
+        IO.new(STDOUT.fileno)
+      else
+        File.open(avro_uri, 'w')
+      end
+      CSV2Avro::AvroWriter.new(writer, schema)
+    end
+  end
+  def avro_uri
+    dir = File.dirname(input_path)
+    ext = File.extname(input_path)
+    name = File.basename(input_path, ext)
+    "#{dir}/#{name}.avro"
+  end
+  def error_writer
+    $stderr
+  end
+  def bad_rows_writer
+    @__bad_rows_writer ||= File.open(bad_rows_uri, 'w')
+  end
+  def bad_rows_uri
+    return bad_rows_path if bad_rows_path
+    dir = File.dirname(input_path)
+    ext = File.extname(input_path)
+    name = File.basename(input_path, ext)
+    "#{dir}/#{name}.bad#{ext}"
+  end
+end