RubyGems - csv2avro - Versions diffs - 1.0.0 - Mend

csv2avro 1.0.0

Files changed (26) hide show

checksums.yaml +7 -0
data/.dockerignore +1 -0
data/.gitignore +15 -0
data/.travis.yml +8 -0
data/CHANGELOG.md +48 -0
data/Dockerfile +23 -0
data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +80 -0
data/Rakefile +41 -0
data/bin/csv2avro +58 -0
data/csv2avro.gemspec +28 -0
data/lib/avro_schema.rb +57 -0
data/lib/csv2avro/avro_writer.rb +27 -0
data/lib/csv2avro/converter.rb +125 -0
data/lib/csv2avro/schema.rb +44 -0
data/lib/csv2avro/version.rb +3 -0
data/lib/csv2avro.rb +78 -0
data/spec/csv2avro/converter_spec.rb +434 -0
data/spec/csv2avro/schema_spec.rb +85 -0
data/spec/csv2avro_spec.rb +38 -0
data/spec/spec_helper.rb +15 -0
data/spec/support/avro_reader.rb +22 -0
data/spec/support/data.csv +4 -0
data/spec/support/schema.avsc +17 -0
metadata +161 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 6eae97d5b2bf7476331128770ffee4d3b6d69d7a
+  data.tar.gz: 554e64338b5950de37ccaad44927176f6922f94c
+SHA512:
+  metadata.gz: 3a70f269a7337d6dad0bd24528e9092b897217254610a8257db0fecc72d55dada505c68040b3a1309b3e8266473257f118a88231a54b5627c74ffb63c998d49c
+  data.tar.gz: 01cc32197d34410522aed53d4682aa9c91b20a63ad9e09a80831cc7e6af6d5bfd1a972488bb7b52f263451222a0363f10e5ab8a07e09ab38a5154b46496ff93e

data/.dockerignore ADDED Viewed

	@@ -0,0 +1 @@
1	+ .git

data/.gitignore ADDED Viewed

@@ -0,0 +1,15 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log
+.ruby-version

data/.travis.yml ADDED Viewed

@@ -0,0 +1,8 @@
+language: ruby
+rvm:
+  - 2.1.5
+  - 2.2.1
+cache: bundler
+notifications:
+  email: false
+  hipchat: 852d9ffe9095a59131fe5bf2a6d891@Data

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,48 @@
+# Changelog
+All notable changes to this project are documented in this file.
+This project adheres to [Semantic Versioning](http://semver.org/).
+## 1.0.0 (2015-06-05; [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
+### Added
+ * Usage description to readme
+ * Detailed exception reporting
+ * `aws-cli` to Docker image
+### Fixed
+ * Docker image entrypoint
+## 0.4.0 (2015-05-07; [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
+### Added
+ * Streaming support (#7)
+ * `rake docker:spec` task
+### Removed
+ * S3 support (#7)
+### Changed
+ * Do not include .git in Docker build context
+### Fixed
+ * Build project into Docker image (#9)
+## 0.3.0 (2015-04-28; [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
+### Added
+ * Docker support (#6)
+   * `rake docker:build` task
+   * `rake docker:push` task to push to Docker Hub
+   * Semantic Docker tags
+ * CHANGELOG.md
+## 0.1.0 (2015-04-07)
+Initial release
+### Added
+ * CLI (`csv2avro convert`) to convert CSV files to Avro (#1)
+ * Travis CI (#2)
+ * Bad rows (#4)
+ * Versioning ($5)
+ * Gem packaging

data/Dockerfile ADDED Viewed

@@ -0,0 +1,23 @@
+FROM ruby:2.1
+MAINTAINER Secret Sauce Partners, Inc. <dev@sspinc.io>
+RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
+    python2.7 get-pip.py && \
+    pip install awscli
+# throw errors if Gemfile has been modified since Gemfile.lock
+RUN bundle config --global frozen 1
+RUN mkdir -p /srv/csv2avro
+WORKDIR /srv/csv2avro
+RUN mkdir -p /srv/csv2avro/lib/csv2avro
+COPY lib/csv2avro/version.rb /srv/csv2avro/lib/csv2avro/version.rb
+COPY csv2avro.gemspec Gemfile Gemfile.lock /srv/csv2avro/
+RUN bundle install
+COPY . /srv/csv2avro
+ENTRYPOINT ["./bin/csv2avro"]

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in csv2avro.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2015 Secret Sauce Partners, Inc.
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,80 @@
+# CSV2Avro
+Convert CSV files to Avro like a boss.
+## Installation
+    $ gem install csv2avro
+or if you prefer to live on the edge, just clone this repository and build it from scratch.
+You can run the converter within a **Docker** container, you just need to pull the `sspinc/csv2avro` image.
+```
+$ docker pull sspinc/csv2avro
+```
+## Usage
+### Basic
+```
+$ csv2avro --schema ./spec/support/schema.avsc ./spec/support/data.csv
+```
+This will process the data.csv file and creates a *data.avro* file and a *data.bad.csv* file with the bad rows.
+You can override the bad-rows file location with the `--bad-rows [BAD_ROWS]` option.
+### CSV2Avro in Docker
+```
+$ docker run sspinc/csv2avro --help
+```
+### Streaming
+```
+$ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
+```
+This will process the *input stream* and push the avro data to the *output stream*. If you're working with streams you will need to specify the `--bad-rows` location.
+### Advanced features
+#### AWS S3 storage
+```
+aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad.csv | aws s3 cp - s3://avro-bucket/transactions.avro
+```
+This will stream your file stored in AWS S3, converts the data and pushes it back to S3. For more information, please check the [AWS CLI documentation](http://docs.aws.amazon.com/cli/latest/reference/s3/index.html).
+#### Convert compressed files
+```
+gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
+```
+This will uncompress the file and converts it to avro, leaving the original file intact.
+### More
+For a full list of available options, run `csv2avro --help`
+```
+$ csv2avro --help
+Version 1.0.0 of CSV2Avro
+Usage: csv2avro [options] [file]
+    -s, --schema SCHEMA              A file containing the Avro schema. This value is required.
+    -b, --bad-rows [BAD_ROWS]        The output location of the bad rows file.
+    -d, --delimiter [DELIMITER]      Field delimiter. If none specified, then comma is used as the delimiter.
+    -a [ARRAY_DELIMITER],            Array field delimiter. If none specified, then comma is used as the delimiter.
+        --array-delimiter
+    -D, --write-defaults             Write default values.
+    -c, --stdout                     Output will go to the standard output stream, leaving files intact.
+    -h, --help                       Prints help
+```
+## Contributing
+1. Fork it ( https://github.com/sspinc/csv2avro/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,41 @@
+require 'rspec/core/rake_task'
+require 'bundler/gem_tasks'
+require 'bump/tasks'
+# Remove pre and set rake tasks
+Rake.application.instance_eval do
+  %w[bump:pre bump:set].each do |task|
+    @tasks.delete(task)
+  end
+end
+# Default directory to look in is `/spec`
+# Run with `rake spec`
+RSpec::Core::RakeTask.new(:spec) do |task|
+  task.rspec_opts = ['--color', '--format', 'documentation']
+end
+task :default => :spec
+namespace :docker do
+  desc "Build docker image"
+  task :build do
+    sh "docker build -t sspinc/csv2avro:#{CSV2Avro::VERSION} ."
+    minor_version = CSV2Avro::VERSION.sub(/\.[0-9]+$/, '')
+    sh "docker tag -f sspinc/csv2avro:#{CSV2Avro::VERSION} sspinc/csv2avro:#{minor_version}"
+    major_version = minor_version.sub(/\.[0-9]+$/, '')
+    sh "docker tag -f sspinc/csv2avro:#{CSV2Avro::VERSION} sspinc/csv2avro:#{major_version}"
+    sh "docker tag -f sspinc/csv2avro:#{CSV2Avro::VERSION} sspinc/csv2avro:latest"
+  end
+  desc "Run specs inside docker image"
+  task :spec => :build do
+    sh "docker run -t --entrypoint=rake sspinc/csv2avro:#{CSV2Avro::VERSION} spec"
+  end
+  desc "Push docker image"
+  task :push => :spec do
+    sh "docker push sspinc/csv2avro"
+  end
+end

data/bin/csv2avro ADDED Viewed

@@ -0,0 +1,58 @@
+#!/usr/bin/env ruby
+$LOAD_PATH << File.dirname(__FILE__) + '/../lib' if $0 == __FILE__
+require 'optparse'
+require 'csv2avro'
+options = {}
+option_parser = OptionParser.new do |opts|
+  opts.banner = "Version #{CSV2Avro::VERSION} of CSV2Avro\n" \
+    "Usage: #{File.basename(__FILE__)} [options] [file]"
+  opts.on('-s', '--schema SCHEMA', 'A file containing the Avro schema. This value is required.') do |path|
+    options[:schema] = path
+  end
+  opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows file.') do |path|
+    options[:bad_rows] = path
+  end
+  opts.on('-d', '--delimiter [DELIMITER]', 'Field delimiter. If none specified, then comma is used as the delimiter.') do |char|
+    options[:delimiter] = char.gsub("\\t", "\t")
+  end
+  opts.on('-a', '--array-delimiter [ARRAY_DELIMITER]', 'Array field delimiter. If none specified, then comma is used as the delimiter.') do |char|
+    options[:array_delimiter] = char
+  end
+  opts.on('-D', '--write-defaults', 'Write default values.') do
+    options[:write_defaults] = true
+  end
+  opts.on('-c', '--stdout', 'Output will go to the standard output stream, leaving files intact.') do
+    options[:stdout] = true
+  end
+  opts.on('-h', '--help', 'Prints help') do
+    puts opts
+    exit
+  end
+end
+option_parser.parse!
+begin
+  raise OptionParser::MissingArgument.new('--schema') if options[:schema].nil?
+  raise OptionParser::MissingArgument.new('--bad-rows') if options[:bad_rows].nil? && ARGV.empty?
+  CSV2Avro.new(options).convert
+rescue OptionParser::MissingArgument => ex
+  puts ex.message
+  puts option_parser
+rescue Exception => e
+  puts 'Uh oh, something went wrong!'
+  puts e.message
+  puts e.backtrace.join("\n")
+end

data/csv2avro.gemspec ADDED Viewed

@@ -0,0 +1,28 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'csv2avro/version'
+Gem::Specification.new do |spec|
+  spec.name          = "csv2avro"
+  spec.version       = CSV2Avro::VERSION
+  spec.authors       = ["Peter Ableda"]
+  spec.email         = ["scotty@secretsaucepartners.com"]
+  spec.summary       = %q{Convert CSV files to Avro}
+  spec.description   = %q{Convert CSV files to Avro like a boss.}
+  spec.homepage      = ""
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "rspec", "~> 3.2"
+  spec.add_development_dependency "pry", "~> 0.10"
+  spec.add_development_dependency "bump", "~> 0.5"
+  spec.add_dependency "avro", "~> 1.7"
+end

data/lib/avro_schema.rb ADDED Viewed

@@ -0,0 +1,57 @@
+module Avro
+  class Schema
+    @errors = []
+    class << self
+      attr_accessor :errors
+    end
+    def self.validate(expected_schema, datum, name=nil, suppress_error=false)
+      expected_type = expected_schema.type_sym
+      valid = case expected_type
+              when :null
+                datum.nil?
+              when :boolean
+                datum == true || datum == false
+              when :string, :bytes
+                datum.is_a? String
+              when :int
+                (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
+                    (INT_MIN_VALUE <= datum) && (datum <= INT_MAX_VALUE)
+              when :long
+                (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
+                    (LONG_MIN_VALUE <= datum) && (datum <= LONG_MAX_VALUE)
+              when :float, :double
+                datum.is_a?(Float) || datum.is_a?(Fixnum) || datum.is_a?(Bignum)
+              when :fixed
+                datum.is_a?(String) && datum.size == expected_schema.size
+              when :enum
+                expected_schema.symbols.include? datum
+              when :array
+                datum.is_a?(Array) &&
+                  datum.all?{|d| validate(expected_schema.items, d) }
+              when :map
+                datum.keys.all?{|k| k.is_a? String } &&
+                  datum.values.all?{|v| validate(expected_schema.values, v) }
+              when :union
+                expected_schema.schemas.any?{|s| validate(s, datum, nil, true) }
+              when :record, :error, :request
+                datum.is_a?(Hash) &&
+                  expected_schema.fields.all?{|f| validate(f.type, datum[f.name], f.name) }
+              else
+                false
+              end
+      if !suppress_error && !valid && name
+        if datum.nil? && expected_type != :null
+          @errors << "Missing value at #{name}"
+        else
+          @errors << "'#{datum}' at #{name} does'n match the type '#{expected_schema.to_s}'"
+        end
+      end
+      valid
+    end
+  end
+end

data/lib/csv2avro/avro_writer.rb ADDED Viewed

@@ -0,0 +1,27 @@
+require 'avro'
+require 'avro_schema'
+require 'forwardable'
+class CSV2Avro
+  class AvroWriter
+    extend Forwardable
+    attr_reader :avro_writer
+    def_delegators :'avro_writer.writer', :seek, :read, :eof?
+    def_delegators :avro_writer, :flush, :close
+    def initialize(writer, schema)
+      datum_writer = Avro::IO::DatumWriter.new(schema.avro_schema)
+      @avro_writer = Avro::DataFile::Writer.new(writer, datum_writer, schema.avro_schema)
+    end
+    def writer_schema
+      avro_writer.datum_writer.writers_schema
+    end
+    def write(hash)
+      avro_writer << hash
+    end
+  end
+end

data/lib/csv2avro/converter.rb ADDED Viewed

@@ -0,0 +1,125 @@
+require 'csv2avro/schema'
+require 'csv2avro/avro_writer'
+require 'csv'
+class CSV2Avro
+  class Converter
+    attr_reader :writer, :bad_rows_writer, :error_writer, :schema, :reader, :csv_options, :converter_options, :header_row, :column_separator
+    def initialize(reader, writer, bad_rows_writer, error_writer, options, schema: schema)
+      @writer = writer
+      @bad_rows_writer = bad_rows_writer
+      @error_writer = error_writer
+      @schema = schema
+      @column_separator = options[:delimiter] || ','
+      @reader = reader
+      @header_row = reader.readline.strip
+      header = header_row.split(column_separator)
+      init_header_converter
+      @csv_options = {
+        headers: header,
+        skip_blanks: true,
+        col_sep: column_separator,
+        header_converters: :aliases
+      }
+      @converter_options = options
+    end
+    def convert
+      defaults = schema.defaults if converter_options[:write_defaults]
+      fields_to_convert = schema.types.reject{ |key, value| value == :string }
+      reader.each do |line|
+        CSV.parse(line, csv_options) do |row|
+          row = row.to_hash
+          if converter_options[:write_defaults]
+            add_defaults_to_row!(row, defaults)
+          end
+          convert_fields!(row, fields_to_convert)
+          begin
+            writer.write(row)
+            writer.flush
+          rescue
+            if bad_rows_writer.size == 0
+              bad_rows_writer << header_row + "\n"
+            end
+            bad_rows_writer << line
+            bad_rows_writer.flush
+            until Avro::Schema.errors.empty? do
+              error_writer << "line #{reader.lineno}: #{Avro::Schema.errors.shift}\n"
+            end
+          end
+        end
+      end
+    end
+    private
+    def convert_fields!(row, fields_to_convert)
+      fields_to_convert.each do |key, value|
+        row[key] = begin
+          case value
+            when :int
+              Integer(row[key])
+            when :float, :double
+              Float(row[key])
+            when :boolean
+              parse_boolean(row[key])
+            when :array
+              parse_array(row[key])
+            when :enum
+              row[key].downcase.tr(" ", "_")
+          end
+        rescue
+          row[key]
+        end
+      end
+      row
+    end
+    def parse_boolean(value)
+      return true  if value == true  || value =~ (/^(true|t|yes|y|1)$/i)
+      return false if value == false || value =~ (/^(false|f|no|n|0)$/i)
+      nil
+    end
+    def parse_array(value)
+      delimiter = converter_options[:array_delimiter] || ','
+      value.split(delimiter) if value
+    end
+    def add_defaults_to_row!(row, defaults)
+      # Add default values to nil cells
+      row.each do |key, value|
+        row[key] = defaults[key] if value.nil?
+      end
+      # Add default values to missing columns
+      defaults.each  do |key, value|
+        row[key] = defaults[key]  unless row.has_key?(key)
+      end
+      row
+    end
+    def init_header_converter
+      aliases = schema.aliases
+      CSV::HeaderConverters[:aliases] = lambda do |header|
+          aliases[header] || header
+      end
+    end
+  end
+end

data/lib/csv2avro/schema.rb ADDED Viewed

@@ -0,0 +1,44 @@
+require 'json'
+class CSV2Avro
+  class Schema
+    attr_reader :avro_schema, :schema_string
+    def initialize(schema)
+      @schema_string = schema.read
+      @avro_schema = Avro::Schema.parse(schema_string)
+    end
+    def defaults
+      Hash[
+        avro_schema.fields.map{ |field| [field.name, field.default] unless field.default.nil? }.compact
+      ]
+    end
+    def types
+      Hash[
+        avro_schema.fields.map do |field|
+          type = if field.type.type_sym == :union
+            # use the primary type
+            field.type.schemas[0].type_sym
+          else
+            field.type.type_sym
+          end
+          [field.name, type]
+        end
+      ]
+    end
+    # TODO: Change this when the avro gem starts to support aliases
+    def aliases
+      schema_as_json = JSON.parse(schema_string)
+      Hash[
+        schema_as_json['fields'].select{ |field| field['aliases'] }.flat_map do |field|
+          field['aliases'].map { |one_alias| [one_alias, field['name']]}
+        end
+      ]
+    end
+  end
+end

data/lib/csv2avro/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+class CSV2Avro
+  VERSION = "1.0.0"
+end

data/lib/csv2avro.rb ADDED Viewed

@@ -0,0 +1,78 @@
+require 'csv2avro/converter'
+require 'csv2avro/version'
+class CSV2Avro
+  attr_reader :input_path, :schema_path, :bad_rows_path, :stdout_option, :options
+  def initialize(options)
+    @input_path = ARGV.first
+    @schema_path = options.delete(:schema)
+    @bad_rows_path = options.delete(:bad_rows)
+    @stdout_option = !input_path || options.delete(:stdout)
+    @options = options
+  end
+  def convert
+    Converter.new(reader, writer, bad_rows_writer, error_writer, options, schema: schema).convert
+  ensure
+    writer.close if writer
+    if bad_rows_writer.size == 0
+      File.delete(bad_rows_uri)
+    elsif bad_rows_writer
+      bad_rows_writer.close
+    end
+  end
+  private
+  def schema
+    @schema ||= File.open(schema_path, 'r') do |schema|
+      CSV2Avro::Schema.new(schema)
+    end
+  end
+  def reader
+    ARGF.lineno = 0
+    ARGF
+  end
+  def writer
+    @__writer ||= begin
+      writer = if stdout_option
+        IO.new(STDOUT.fileno)
+      else
+        File.open(avro_uri, 'w')
+      end
+      CSV2Avro::AvroWriter.new(writer, schema)
+    end
+  end
+  def avro_uri
+    dir = File.dirname(input_path)
+    ext = File.extname(input_path)
+    name = File.basename(input_path, ext)
+    "#{dir}/#{name}.avro"
+  end
+  def error_writer
+    $stderr
+  end
+  def bad_rows_writer
+    @__bad_rows_writer ||= File.open(bad_rows_uri, 'w')
+  end
+  def bad_rows_uri
+    return bad_rows_path if bad_rows_path
+    dir = File.dirname(input_path)
+    ext = File.extname(input_path)
+    name = File.basename(input_path, ext)
+    "#{dir}/#{name}.bad#{ext}"
+  end
+end