RubyGems - csv2avro - Versions diffs - 1.0.2 → 1.1.0 - Mend

csv2avro 1.0.2 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -5
data/README.md +6 -6
data/bin/csv2avro +9 -8
data/lib/csv2avro/avro_writer.rb +2 -2
data/lib/csv2avro/converter.rb +9 -14
data/lib/csv2avro/datum_writer.rb +31 -0
data/lib/{avro_schema.rb → csv2avro/schema_validator.rb} +19 -10
data/lib/csv2avro/version.rb +1 -1
data/lib/csv2avro.rb +1 -1
data/spec/csv2avro/converter_spec.rb +3 -3
data/spec/csv2avro_spec.rb +54 -20
data/spec/support/data.csv +2 -0
data/spec/support/data_quoted.csv +8 -0
metadata +6 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 3dd5ca9e046d1e46a845614350ad1e0e2dcdd8a6
-  data.tar.gz: 0c7eefec78b9293b6f1ed954fb22b5ec6b6506ba
+  metadata.gz: 6e7e9a8d86d5cd8e85b957ffecb5e60ccfe9c8b5
+  data.tar.gz: 2b12a6828c601dfe19e6d93bc39b481ec1e15118
 SHA512:
-  metadata.gz: abbed521cda772e04453d95ec559f37e1e45b84f1fbe46e600e5a1ec0333abdbf28978b31946a451ab87701de4af089787f68b316262152bb00f6d9ceefe1048
-  data.tar.gz: 683f680b56c5c06f0217ca5c6f2ff384b228435175327d746545601636844755a5832cf904c420f004bcc4a0cbb4d751eaf1ddeefe4eeeaa334c6418b8633a02
+  metadata.gz: 1caef810f21aa9f9b8dd1562c253a967f5b1c94382296f677dd6703624ac51d3d13774457c76820bf52c4a88795829d28a1b46017384ab306abf9f62bb50a078
+  data.tar.gz: cf9f67c9316d2840f883a36082a30187ae8aebbe24db71a4100496ea833b20568b9ec16f4e92258b38705b7334f07b78b52766db78245cc81c2a32e3c6244d95

data/CHANGELOG.md CHANGED Viewed

@@ -3,17 +3,27 @@
 All notable changes to this project are documented in this file.
 This project adheres to [Semantic Versioning](http://semver.org/).
-## 1.0.2 (2015-06-29; [compare](https://github.com/sspinc/csv2avro/compare/1.0.1...1.0.2))
+## 1.1.0 (2015-09-16) [compare](https://github.com/sspinc/csv2avro/compare/1.0.2...1.1.0))
+### Changed
+ * Write usage and error messages to stderr
+ * Exit code 1 for general errors, 2 for missing arguments
+ * Bad rows report with error causes instead of bad rows csv
+### Fixed
+ * Handle quoted headers
+## 1.0.2 (2015-06-29) [compare](https://github.com/sspinc/csv2avro/compare/1.0.1...1.0.2))
 ### Fixed
  * Continue on parsing errors
-## 1.0.1 (2015-06-12; [compare](https://github.com/sspinc/csv2avro/compare/1.0.0...1.0.1))
+## 1.0.1 (2015-06-12) [compare](https://github.com/sspinc/csv2avro/compare/1.0.0...1.0.1))
 ### Fixed
  * CSV parsing issues
-## 1.0.0 (2015-06-05; [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
+## 1.0.0 (2015-06-05) [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
 ### Added
  * Usage description to readme
@@ -23,7 +33,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
 ### Fixed
  * Docker image entrypoint
-## 0.4.0 (2015-05-07; [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
+## 0.4.0 (2015-05-07) [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
 ### Added
  * Streaming support (#7)
@@ -38,7 +48,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
 ### Fixed
  * Build project into Docker image (#9)
-## 0.3.0 (2015-04-28; [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
+## 0.3.0 (2015-04-28) [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
 ### Added
  * Docker support (#6)

data/README.md CHANGED Viewed

@@ -14,13 +14,13 @@ or if you prefer to live on the edge, just clone this repository and build it fr
 ```
 $ csv2avro --schema ./spec/support/schema.avsc ./spec/support/data.csv
 ```
-This will process the data.csv file and creates a *data.avro* file and a *data.bad.csv* file with the bad rows.
+This will process the data.csv file and creates a *data.avro* file and a *data.bad* file with a report of the bad rows.
-You can override the bad-rows file location with the `--bad-rows [BAD_ROWS]` option.
+You can override the bad rows report file location with the `--bad-rows [BAD_ROWS]` option.
 ### Streaming
 ```
-$ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
+$ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro
 ```
 This will process the *input stream* and push the avro data to the *output stream*. If you're working with streams you will need to specify the `--bad-rows` location.
@@ -29,7 +29,7 @@ This will process the *input stream* and push the avro data to the *output strea
 #### AWS S3 storage
 ```
-aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad.csv | aws s3 cp - s3://avro-bucket/transactions.avro
+aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad | aws s3 cp - s3://avro-bucket/transactions.avro
 ```
 This will stream your file stored in AWS S3, converts the data and pushes it back to S3. For more information, please check the [AWS CLI documentation](http://docs.aws.amazon.com/cli/latest/reference/s3/index.html).
@@ -37,7 +37,7 @@ This will stream your file stored in AWS S3, converts the data and pushes it bac
 #### Convert compressed files
 ```
-gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
+gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro
 ```
 This will uncompress the file and converts it to avro, leaving the original file intact.
@@ -50,7 +50,7 @@ $ csv2avro --help
 Version 1.0.1 of CSV2Avro
 Usage: csv2avro [options] [file]
     -s, --schema SCHEMA              A file containing the Avro schema. This value is required.
-    -b, --bad-rows [BAD_ROWS]        The output location of the bad rows file.
+    -b, --bad-rows [BAD_ROWS]        The output location of the bad rows report file.
     -d, --delimiter [DELIMITER]      Field delimiter. If none specified, then comma is used as the delimiter.
     -a [ARRAY_DELIMITER],            Array field delimiter. If none specified, then comma is used as the delimiter.
         --array-delimiter

data/bin/csv2avro CHANGED Viewed

@@ -14,7 +14,7 @@ option_parser = OptionParser.new do |opts|
     options[:schema] = path
   end
-  opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows file.') do |path|
+  opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows report file.') do |path|
     options[:bad_rows] = path
   end
@@ -35,7 +35,7 @@ option_parser = OptionParser.new do |opts|
   end
   opts.on('-h', '--help', 'Prints help') do
-    puts opts
+    $stderr.puts opts
     exit
   end
 end
@@ -48,11 +48,12 @@ begin
   CSV2Avro.new(options).convert
 rescue OptionParser::MissingArgument => ex
-  puts ex.message
-  puts option_parser
+  $stderr.puts ex.message
+  $stderr.puts option_parser
+  exit 2
 rescue Exception => e
-  puts 'Uh oh, something went wrong!'
-  puts e.message
-  puts e.backtrace.join("\n")
+  $stderr.puts 'Uh oh, something went wrong!'
+  $stderr.puts e.message
+  $stderr.puts e.backtrace.join("\n")
+  exit 1
 end

data/lib/csv2avro/avro_writer.rb CHANGED Viewed

@@ -1,6 +1,6 @@
 require 'avro'
-require 'avro_schema'
 require 'forwardable'
+require 'csv2avro/datum_writer'
 class CSV2Avro
   class AvroWriter
@@ -12,7 +12,7 @@ class CSV2Avro
     def_delegators :avro_writer, :flush, :close
     def initialize(writer, schema)
-      datum_writer = Avro::IO::DatumWriter.new(schema.avro_schema)
+      datum_writer = CSV2Avro::DatumWriter.new(schema.avro_schema)
       @avro_writer = Avro::DataFile::Writer.new(writer, datum_writer, schema.avro_schema)
     end

data/lib/csv2avro/converter.rb CHANGED Viewed

@@ -13,7 +13,7 @@ class CSV2Avro
       @schema = schema
       # read header row explicitly
-      @header = @reader.readline.strip.split(col_sep)
+      @header = @reader.readline.strip.split(col_sep).map{ |col| col.gsub('"','') }
     end
     def convert
@@ -21,7 +21,9 @@ class CSV2Avro
         begin
           row = csv.shift
         rescue CSV::MalformedCSVError
-          @error_writer.puts("line #{line_number}: Unable to parse")
+          error_msg = "L#{row_number}: Unable to parse"
+          @error_writer.puts(error_msg)
+          @bad_rows_writer.puts(error_msg)
           next
         end
         hash = row.to_hash
@@ -31,12 +33,10 @@ class CSV2Avro
         begin
           @writer.write(hash)
-        rescue Avro::IO::AvroTypeError
-          bad_rows_csv << row
-          until Avro::Schema.errors.empty? do
-            @error_writer.puts("line #{line_number}: #{Avro::Schema.errors.shift}")
-          end
+        rescue CSV2Avro::SchemaValidationError => e
+          error_msg = "L#{row_number}: #{e.errors.join(', ')}"
+          @error_writer.puts(error_msg)
+          @bad_rows_writer.puts(error_msg)
         end
       end
       @writer.flush
@@ -71,12 +71,7 @@ class CSV2Avro
       @csv ||= CSV.new(@reader, csv_options)
     end
-    def bad_rows_csv
-      options = csv_options.tap { |hash| hash.delete(:header_converters) }
-      @bad_rows_csv ||= CSV.new(@bad_rows_writer, options)
-    end
-    def line_number
+    def row_number
       @reader.lineno + 1
     end

data/lib/csv2avro/datum_writer.rb ADDED Viewed

@@ -0,0 +1,31 @@
+require 'avro'
+require 'csv2avro/schema_validator'
+class CSV2Avro
+  class DatumWriter < Avro::IO::DatumWriter
+    attr_reader :schema_validator
+    def initialize(*args)
+      super
+      @schema_validator = CSV2Avro::SchemaValidator.new
+    end
+    def write(datum, encoder)
+      schema_validator.clear
+      if !schema_validator.validate(writers_schema, datum)
+        raise SchemaValidationError.new(schema_validator.errors)
+      end
+      super
+    end
+  end
+  class SchemaValidationError < StandardError
+    attr_reader :errors
+    def initialize(schema_errors)
+      @errors = schema_errors
+    end
+  end
+end

data/lib/{avro_schema.rb → csv2avro/schema_validator.rb} RENAMED Viewed

@@ -1,12 +1,19 @@
-module Avro
-  class Schema
-    @errors = []
+require 'avro/schema'
-    class << self
-      attr_accessor :errors
+class CSV2Avro
+  class SchemaValidator
+    attr_reader :errors
+    def initialize
+      @errors = []
+    end
+    def clear
+      @errors.clear
     end
-    def self.validate(expected_schema, datum, name=nil, suppress_error=false)
+    def validate(expected_schema, datum, name=nil, suppress_error=false)
       expected_type = expected_schema.type_sym
       valid = case expected_type
@@ -18,10 +25,10 @@ module Avro
                 datum.is_a? String
               when :int
                 (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
-                    (INT_MIN_VALUE <= datum) && (datum <= INT_MAX_VALUE)
+                    (Avro::Schema::INT_MIN_VALUE <= datum) && (datum <= Avro::Schema::INT_MAX_VALUE)
               when :long
                 (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
-                    (LONG_MIN_VALUE <= datum) && (datum <= LONG_MAX_VALUE)
+                    (Avro::Schema::LONG_MIN_VALUE <= datum) && (datum <= Avro::Schema::LONG_MAX_VALUE)
               when :float, :double
                 datum.is_a?(Float) || datum.is_a?(Fixnum) || datum.is_a?(Bignum)
               when :fixed
@@ -38,12 +45,14 @@ module Avro
                 expected_schema.schemas.any?{|s| validate(s, datum, nil, true) }
               when :record, :error, :request
                 datum.is_a?(Hash) &&
-                  expected_schema.fields.all?{|f| validate(f.type, datum[f.name], f.name) }
+                  expected_schema.fields.reduce(true){|result, f|
+                  validate_result = validate(f.type, datum[f.name], f.name)
+                  result && validate_result }
               else
                 false
               end
-      if !suppress_error && !valid && name
+      if !valid && name
         if datum.nil? && expected_type != :null
           @errors << "Missing value at #{name}"
         else

data/lib/csv2avro/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 class CSV2Avro
-  VERSION = "1.0.2"
+  VERSION = "1.1.0"
 end

data/lib/csv2avro.rb CHANGED Viewed

@@ -68,6 +68,6 @@ class CSV2Avro
     ext = File.extname(input_path)
     name = File.basename(input_path, ext)
-    "#{dir}/#{name}.bad#{ext}"
+    "#{dir}/#{name}.bad"
   end
 end

data/spec/csv2avro/converter_spec.rb CHANGED Viewed

@@ -351,15 +351,15 @@ RSpec.describe CSV2Avro::Converter do
         CSV2Avro::Converter.new(reader, avro_writer, bad_rows_writer, error_writer, { delimiter: "\t" }, schema: schema).convert
       end
-      it 'should have the bad data in the original form' do
+      it 'should report the bad rows correctly' do
         expect(bad_rows_writer.string).to eq(
-          "id\ttitle\tdescription\n1\t\tdresses\n4\t\tfemale-shoes\n"
+          "L2: Missing value at name\nL5: Missing value at name\n"
         )
       end
       it 'should have an error' do
         expect(error_writer.string).to eq(
-          "line 2: Missing value at name\nline 5: Missing value at name\n"
+          "L2: Missing value at name\nL5: Missing value at name\n"
         )
       end

data/spec/csv2avro_spec.rb CHANGED Viewed

@@ -3,32 +3,66 @@ require 'spec_helper'
 RSpec.describe CSV2Avro do
   describe '#convert' do
     let(:options) { { schema: './spec/support/schema.avsc' } }
-    before do
-      ARGV.replace ['./spec/support/data.csv']
+    subject(:converter) do
+      CSV2Avro.new(options)
     end
-    subject(:converter) { CSV2Avro.new(options) }
-    it 'should write errors to STDERR' do
-      expect { converter.convert }.to output("line 4: Missing value at name\nline 7: Unable to parse\n").to_stderr
-    end
+    context "Unquoted header" do
+      before do
+        ARGV.replace ['./spec/support/data.csv']
+      end
+      bad_rows_output = "L4: Missing value at name\nL7: Unable to parse\nL9: Missing value at id, Missing value at name\nL10: 'male-shoes' at id doesn't match the type '\"int\"', Missing value at name\n"
+      it 'should write errors to STDERR' do
+        expect { converter.convert }.to output(bad_rows_output).to_stderr
+      end
+      it 'should have bad rows' do
+        File.open('./spec/support/data.bad', 'r') do |file|
+          expect(file.read).to eq(bad_rows_output)
+        end
+      end
-    it 'should have a bad row' do
-      File.open('./spec/support/data.bad.csv', 'r') do |file|
-        expect(file.read).to eq("id,name,description\n3,,Bras\n")
+      it 'should contain the avro data' do
+        File.open('./spec/support/data.avro', 'r') do |file|
+          expect(AvroReader.new(file).read).to eq(
+            [
+              { 'id'=>1, 'name'=>'dresses',     'description'=>'Dresses' },
+              { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
+              { 'id'=>4, 'name'=>'male-tops',   'description'=>"Male Tops\nand Male Shirts"},
+              { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
+            ]
+          )
+        end
       end
     end
-    it 'should contain the avro data' do
-      File.open('./spec/support/data.avro', 'r') do |file|
-        expect(AvroReader.new(file).read).to eq(
-          [
-            { 'id'=>1, 'name'=>'dresses',     'description'=>'Dresses' },
-            { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
-            { 'id'=>4, 'name'=>'male-tops',   'description'=>"Male Tops\nand Male Shirts"},
-            { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
-          ]
-        )
+    context "Quoted header" do
+      before do
+        ARGV.replace ['./spec/support/data_quoted.csv']
+      end
+      it 'should write errors to STDERR' do
+        expect { converter.convert }.to output("L4: Missing value at name\nL7: Unable to parse\n").to_stderr
+      end
+      it 'should have a bad row' do
+        File.open('./spec/support/data_quoted.bad', 'r') do |file|
+          expect(file.read).to eq("L4: Missing value at name\nL7: Unable to parse\n")
+        end
+      end
+      it 'should contain the avro data' do
+        File.open('./spec/support/data_quoted.avro', 'r') do |file|
+          expect(AvroReader.new(file).read).to eq(
+            [
+              { 'id'=>1, 'name'=>'dresses',     'description'=>'Dresses' },
+              { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
+              { 'id'=>4, 'name'=>'male-tops',   'description'=>"Male Tops\nand Male Shirts"},
+              { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
+            ]
+          )
+        end
       end
     end
   end

data/spec/support/data.csv CHANGED Viewed

@@ -6,3 +6,5 @@ id,name,description
 and Male Shirts"
 "5","female-shoes","\"Shoes\" for women"
 "6","male-shoes","Male Shoes"
+,,"Male Shoes"
+"male-shoes",,"Male Shoes"

data/spec/support/data_quoted.csv ADDED Viewed

@@ -0,0 +1,8 @@
+"id","name","description"
+1,dresses,Dresses
+"2","female-tops",
+"3",,"Bras"
+"4","male-tops","Male Tops
+and Male Shirts"
+"5","female-shoes","\"Shoes\" for women"
+"6","male-shoes","Male Shoes"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: csv2avro
 version: !ruby/object:Gem::Version
-  version: 1.0.2
+  version: 1.1.0
 platform: ruby
 authors:
 - Peter Ableda
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-06-30 00:00:00.000000000 Z
+date: 2015-09-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -113,11 +113,12 @@ files:
 - Rakefile
 - bin/csv2avro
 - csv2avro.gemspec
-- lib/avro_schema.rb
 - lib/csv2avro.rb
 - lib/csv2avro/avro_writer.rb
 - lib/csv2avro/converter.rb
+- lib/csv2avro/datum_writer.rb
 - lib/csv2avro/schema.rb
+- lib/csv2avro/schema_validator.rb
 - lib/csv2avro/version.rb
 - spec/csv2avro/converter_spec.rb
 - spec/csv2avro/schema_spec.rb
@@ -125,6 +126,7 @@ files:
 - spec/spec_helper.rb
 - spec/support/avro_reader.rb
 - spec/support/data.csv
+- spec/support/data_quoted.csv
 - spec/support/schema.avsc
 homepage: ''
 licenses:
@@ -157,5 +159,6 @@ test_files:
 - spec/spec_helper.rb
 - spec/support/avro_reader.rb
 - spec/support/data.csv
+- spec/support/data_quoted.csv
 - spec/support/schema.avsc
 has_rdoc: