csv2avro 1.0.2 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 3dd5ca9e046d1e46a845614350ad1e0e2dcdd8a6
4
- data.tar.gz: 0c7eefec78b9293b6f1ed954fb22b5ec6b6506ba
3
+ metadata.gz: 6e7e9a8d86d5cd8e85b957ffecb5e60ccfe9c8b5
4
+ data.tar.gz: 2b12a6828c601dfe19e6d93bc39b481ec1e15118
5
5
  SHA512:
6
- metadata.gz: abbed521cda772e04453d95ec559f37e1e45b84f1fbe46e600e5a1ec0333abdbf28978b31946a451ab87701de4af089787f68b316262152bb00f6d9ceefe1048
7
- data.tar.gz: 683f680b56c5c06f0217ca5c6f2ff384b228435175327d746545601636844755a5832cf904c420f004bcc4a0cbb4d751eaf1ddeefe4eeeaa334c6418b8633a02
6
+ metadata.gz: 1caef810f21aa9f9b8dd1562c253a967f5b1c94382296f677dd6703624ac51d3d13774457c76820bf52c4a88795829d28a1b46017384ab306abf9f62bb50a078
7
+ data.tar.gz: cf9f67c9316d2840f883a36082a30187ae8aebbe24db71a4100496ea833b20568b9ec16f4e92258b38705b7334f07b78b52766db78245cc81c2a32e3c6244d95
data/CHANGELOG.md CHANGED
@@ -3,17 +3,27 @@
3
3
  All notable changes to this project are documented in this file.
4
4
  This project adheres to [Semantic Versioning](http://semver.org/).
5
5
 
6
- ## 1.0.2 (2015-06-29; [compare](https://github.com/sspinc/csv2avro/compare/1.0.1...1.0.2))
6
+ ## 1.1.0 (2015-09-16) [compare](https://github.com/sspinc/csv2avro/compare/1.0.2...1.1.0))
7
+
8
+ ### Changed
9
+ * Write usage and error messages to stderr
10
+ * Exit code 1 for general errors, 2 for missing arguments
11
+ * Bad rows report with error causes instead of bad rows csv
12
+
13
+ ### Fixed
14
+ * Handle quoted headers
15
+
16
+ ## 1.0.2 (2015-06-29) [compare](https://github.com/sspinc/csv2avro/compare/1.0.1...1.0.2))
7
17
 
8
18
  ### Fixed
9
19
  * Continue on parsing errors
10
20
 
11
- ## 1.0.1 (2015-06-12; [compare](https://github.com/sspinc/csv2avro/compare/1.0.0...1.0.1))
21
+ ## 1.0.1 (2015-06-12) [compare](https://github.com/sspinc/csv2avro/compare/1.0.0...1.0.1))
12
22
 
13
23
  ### Fixed
14
24
  * CSV parsing issues
15
25
 
16
- ## 1.0.0 (2015-06-05; [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
26
+ ## 1.0.0 (2015-06-05) [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
17
27
 
18
28
  ### Added
19
29
  * Usage description to readme
@@ -23,7 +33,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
23
33
  ### Fixed
24
34
  * Docker image entrypoint
25
35
 
26
- ## 0.4.0 (2015-05-07; [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
36
+ ## 0.4.0 (2015-05-07) [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
27
37
 
28
38
  ### Added
29
39
  * Streaming support (#7)
@@ -38,7 +48,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
38
48
  ### Fixed
39
49
  * Build project into Docker image (#9)
40
50
 
41
- ## 0.3.0 (2015-04-28; [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
51
+ ## 0.3.0 (2015-04-28) [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
42
52
 
43
53
  ### Added
44
54
  * Docker support (#6)
data/README.md CHANGED
@@ -14,13 +14,13 @@ or if you prefer to live on the edge, just clone this repository and build it fr
14
14
  ```
15
15
  $ csv2avro --schema ./spec/support/schema.avsc ./spec/support/data.csv
16
16
  ```
17
- This will process the data.csv file and creates a *data.avro* file and a *data.bad.csv* file with the bad rows.
17
+ This will process the data.csv file and creates a *data.avro* file and a *data.bad* file with a report of the bad rows.
18
18
 
19
- You can override the bad-rows file location with the `--bad-rows [BAD_ROWS]` option.
19
+ You can override the bad rows report file location with the `--bad-rows [BAD_ROWS]` option.
20
20
 
21
21
  ### Streaming
22
22
  ```
23
- $ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
23
+ $ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro
24
24
  ```
25
25
  This will process the *input stream* and push the avro data to the *output stream*. If you're working with streams you will need to specify the `--bad-rows` location.
26
26
 
@@ -29,7 +29,7 @@ This will process the *input stream* and push the avro data to the *output strea
29
29
  #### AWS S3 storage
30
30
 
31
31
  ```
32
- aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad.csv | aws s3 cp - s3://avro-bucket/transactions.avro
32
+ aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad | aws s3 cp - s3://avro-bucket/transactions.avro
33
33
  ```
34
34
 
35
35
  This will stream your file stored in AWS S3, converts the data and pushes it back to S3. For more information, please check the [AWS CLI documentation](http://docs.aws.amazon.com/cli/latest/reference/s3/index.html).
@@ -37,7 +37,7 @@ This will stream your file stored in AWS S3, converts the data and pushes it bac
37
37
  #### Convert compressed files
38
38
 
39
39
  ```
40
- gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
40
+ gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro
41
41
  ```
42
42
 
43
43
  This will uncompress the file and converts it to avro, leaving the original file intact.
@@ -50,7 +50,7 @@ $ csv2avro --help
50
50
  Version 1.0.1 of CSV2Avro
51
51
  Usage: csv2avro [options] [file]
52
52
  -s, --schema SCHEMA A file containing the Avro schema. This value is required.
53
- -b, --bad-rows [BAD_ROWS] The output location of the bad rows file.
53
+ -b, --bad-rows [BAD_ROWS] The output location of the bad rows report file.
54
54
  -d, --delimiter [DELIMITER] Field delimiter. If none specified, then comma is used as the delimiter.
55
55
  -a [ARRAY_DELIMITER], Array field delimiter. If none specified, then comma is used as the delimiter.
56
56
  --array-delimiter
data/bin/csv2avro CHANGED
@@ -14,7 +14,7 @@ option_parser = OptionParser.new do |opts|
14
14
  options[:schema] = path
15
15
  end
16
16
 
17
- opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows file.') do |path|
17
+ opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows report file.') do |path|
18
18
  options[:bad_rows] = path
19
19
  end
20
20
 
@@ -35,7 +35,7 @@ option_parser = OptionParser.new do |opts|
35
35
  end
36
36
 
37
37
  opts.on('-h', '--help', 'Prints help') do
38
- puts opts
38
+ $stderr.puts opts
39
39
  exit
40
40
  end
41
41
  end
@@ -48,11 +48,12 @@ begin
48
48
 
49
49
  CSV2Avro.new(options).convert
50
50
  rescue OptionParser::MissingArgument => ex
51
- puts ex.message
52
-
53
- puts option_parser
51
+ $stderr.puts ex.message
52
+ $stderr.puts option_parser
53
+ exit 2
54
54
  rescue Exception => e
55
- puts 'Uh oh, something went wrong!'
56
- puts e.message
57
- puts e.backtrace.join("\n")
55
+ $stderr.puts 'Uh oh, something went wrong!'
56
+ $stderr.puts e.message
57
+ $stderr.puts e.backtrace.join("\n")
58
+ exit 1
58
59
  end
@@ -1,6 +1,6 @@
1
1
  require 'avro'
2
- require 'avro_schema'
3
2
  require 'forwardable'
3
+ require 'csv2avro/datum_writer'
4
4
 
5
5
  class CSV2Avro
6
6
  class AvroWriter
@@ -12,7 +12,7 @@ class CSV2Avro
12
12
  def_delegators :avro_writer, :flush, :close
13
13
 
14
14
  def initialize(writer, schema)
15
- datum_writer = Avro::IO::DatumWriter.new(schema.avro_schema)
15
+ datum_writer = CSV2Avro::DatumWriter.new(schema.avro_schema)
16
16
  @avro_writer = Avro::DataFile::Writer.new(writer, datum_writer, schema.avro_schema)
17
17
  end
18
18
 
@@ -13,7 +13,7 @@ class CSV2Avro
13
13
  @schema = schema
14
14
 
15
15
  # read header row explicitly
16
- @header = @reader.readline.strip.split(col_sep)
16
+ @header = @reader.readline.strip.split(col_sep).map{ |col| col.gsub('"','') }
17
17
  end
18
18
 
19
19
  def convert
@@ -21,7 +21,9 @@ class CSV2Avro
21
21
  begin
22
22
  row = csv.shift
23
23
  rescue CSV::MalformedCSVError
24
- @error_writer.puts("line #{line_number}: Unable to parse")
24
+ error_msg = "L#{row_number}: Unable to parse"
25
+ @error_writer.puts(error_msg)
26
+ @bad_rows_writer.puts(error_msg)
25
27
  next
26
28
  end
27
29
  hash = row.to_hash
@@ -31,12 +33,10 @@ class CSV2Avro
31
33
 
32
34
  begin
33
35
  @writer.write(hash)
34
- rescue Avro::IO::AvroTypeError
35
- bad_rows_csv << row
36
-
37
- until Avro::Schema.errors.empty? do
38
- @error_writer.puts("line #{line_number}: #{Avro::Schema.errors.shift}")
39
- end
36
+ rescue CSV2Avro::SchemaValidationError => e
37
+ error_msg = "L#{row_number}: #{e.errors.join(', ')}"
38
+ @error_writer.puts(error_msg)
39
+ @bad_rows_writer.puts(error_msg)
40
40
  end
41
41
  end
42
42
  @writer.flush
@@ -71,12 +71,7 @@ class CSV2Avro
71
71
  @csv ||= CSV.new(@reader, csv_options)
72
72
  end
73
73
 
74
- def bad_rows_csv
75
- options = csv_options.tap { |hash| hash.delete(:header_converters) }
76
- @bad_rows_csv ||= CSV.new(@bad_rows_writer, options)
77
- end
78
-
79
- def line_number
74
+ def row_number
80
75
  @reader.lineno + 1
81
76
  end
82
77
 
@@ -0,0 +1,31 @@
1
+ require 'avro'
2
+ require 'csv2avro/schema_validator'
3
+
4
+ class CSV2Avro
5
+ class DatumWriter < Avro::IO::DatumWriter
6
+
7
+ attr_reader :schema_validator
8
+
9
+ def initialize(*args)
10
+ super
11
+ @schema_validator = CSV2Avro::SchemaValidator.new
12
+ end
13
+
14
+ def write(datum, encoder)
15
+ schema_validator.clear
16
+ if !schema_validator.validate(writers_schema, datum)
17
+ raise SchemaValidationError.new(schema_validator.errors)
18
+ end
19
+ super
20
+ end
21
+ end
22
+
23
+ class SchemaValidationError < StandardError
24
+
25
+ attr_reader :errors
26
+
27
+ def initialize(schema_errors)
28
+ @errors = schema_errors
29
+ end
30
+ end
31
+ end
@@ -1,12 +1,19 @@
1
- module Avro
2
- class Schema
3
- @errors = []
1
+ require 'avro/schema'
4
2
 
5
- class << self
6
- attr_accessor :errors
3
+ class CSV2Avro
4
+ class SchemaValidator
5
+
6
+ attr_reader :errors
7
+
8
+ def initialize
9
+ @errors = []
10
+ end
11
+
12
+ def clear
13
+ @errors.clear
7
14
  end
8
15
 
9
- def self.validate(expected_schema, datum, name=nil, suppress_error=false)
16
+ def validate(expected_schema, datum, name=nil, suppress_error=false)
10
17
  expected_type = expected_schema.type_sym
11
18
 
12
19
  valid = case expected_type
@@ -18,10 +25,10 @@ module Avro
18
25
  datum.is_a? String
19
26
  when :int
20
27
  (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
21
- (INT_MIN_VALUE <= datum) && (datum <= INT_MAX_VALUE)
28
+ (Avro::Schema::INT_MIN_VALUE <= datum) && (datum <= Avro::Schema::INT_MAX_VALUE)
22
29
  when :long
23
30
  (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
24
- (LONG_MIN_VALUE <= datum) && (datum <= LONG_MAX_VALUE)
31
+ (Avro::Schema::LONG_MIN_VALUE <= datum) && (datum <= Avro::Schema::LONG_MAX_VALUE)
25
32
  when :float, :double
26
33
  datum.is_a?(Float) || datum.is_a?(Fixnum) || datum.is_a?(Bignum)
27
34
  when :fixed
@@ -38,12 +45,14 @@ module Avro
38
45
  expected_schema.schemas.any?{|s| validate(s, datum, nil, true) }
39
46
  when :record, :error, :request
40
47
  datum.is_a?(Hash) &&
41
- expected_schema.fields.all?{|f| validate(f.type, datum[f.name], f.name) }
48
+ expected_schema.fields.reduce(true){|result, f|
49
+ validate_result = validate(f.type, datum[f.name], f.name)
50
+ result && validate_result }
42
51
  else
43
52
  false
44
53
  end
45
54
 
46
- if !suppress_error && !valid && name
55
+ if !valid && name
47
56
  if datum.nil? && expected_type != :null
48
57
  @errors << "Missing value at #{name}"
49
58
  else
@@ -1,3 +1,3 @@
1
1
  class CSV2Avro
2
- VERSION = "1.0.2"
2
+ VERSION = "1.1.0"
3
3
  end
data/lib/csv2avro.rb CHANGED
@@ -68,6 +68,6 @@ class CSV2Avro
68
68
  ext = File.extname(input_path)
69
69
  name = File.basename(input_path, ext)
70
70
 
71
- "#{dir}/#{name}.bad#{ext}"
71
+ "#{dir}/#{name}.bad"
72
72
  end
73
73
  end
@@ -351,15 +351,15 @@ RSpec.describe CSV2Avro::Converter do
351
351
  CSV2Avro::Converter.new(reader, avro_writer, bad_rows_writer, error_writer, { delimiter: "\t" }, schema: schema).convert
352
352
  end
353
353
 
354
- it 'should have the bad data in the original form' do
354
+ it 'should report the bad rows correctly' do
355
355
  expect(bad_rows_writer.string).to eq(
356
- "id\ttitle\tdescription\n1\t\tdresses\n4\t\tfemale-shoes\n"
356
+ "L2: Missing value at name\nL5: Missing value at name\n"
357
357
  )
358
358
  end
359
359
 
360
360
  it 'should have an error' do
361
361
  expect(error_writer.string).to eq(
362
- "line 2: Missing value at name\nline 5: Missing value at name\n"
362
+ "L2: Missing value at name\nL5: Missing value at name\n"
363
363
  )
364
364
  end
365
365
 
@@ -3,32 +3,66 @@ require 'spec_helper'
3
3
  RSpec.describe CSV2Avro do
4
4
  describe '#convert' do
5
5
  let(:options) { { schema: './spec/support/schema.avsc' } }
6
-
7
- before do
8
- ARGV.replace ['./spec/support/data.csv']
6
+ subject(:converter) do
7
+ CSV2Avro.new(options)
9
8
  end
10
- subject(:converter) { CSV2Avro.new(options) }
11
9
 
12
- it 'should write errors to STDERR' do
13
- expect { converter.convert }.to output("line 4: Missing value at name\nline 7: Unable to parse\n").to_stderr
14
- end
10
+ context "Unquoted header" do
11
+ before do
12
+ ARGV.replace ['./spec/support/data.csv']
13
+ end
14
+
15
+ bad_rows_output = "L4: Missing value at name\nL7: Unable to parse\nL9: Missing value at id, Missing value at name\nL10: 'male-shoes' at id doesn't match the type '\"int\"', Missing value at name\n"
16
+ it 'should write errors to STDERR' do
17
+ expect { converter.convert }.to output(bad_rows_output).to_stderr
18
+ end
19
+
20
+ it 'should have bad rows' do
21
+ File.open('./spec/support/data.bad', 'r') do |file|
22
+ expect(file.read).to eq(bad_rows_output)
23
+ end
24
+ end
15
25
 
16
- it 'should have a bad row' do
17
- File.open('./spec/support/data.bad.csv', 'r') do |file|
18
- expect(file.read).to eq("id,name,description\n3,,Bras\n")
26
+ it 'should contain the avro data' do
27
+ File.open('./spec/support/data.avro', 'r') do |file|
28
+ expect(AvroReader.new(file).read).to eq(
29
+ [
30
+ { 'id'=>1, 'name'=>'dresses', 'description'=>'Dresses' },
31
+ { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
32
+ { 'id'=>4, 'name'=>'male-tops', 'description'=>"Male Tops\nand Male Shirts"},
33
+ { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
34
+ ]
35
+ )
36
+ end
19
37
  end
20
38
  end
21
39
 
22
- it 'should contain the avro data' do
23
- File.open('./spec/support/data.avro', 'r') do |file|
24
- expect(AvroReader.new(file).read).to eq(
25
- [
26
- { 'id'=>1, 'name'=>'dresses', 'description'=>'Dresses' },
27
- { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
28
- { 'id'=>4, 'name'=>'male-tops', 'description'=>"Male Tops\nand Male Shirts"},
29
- { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
30
- ]
31
- )
40
+ context "Quoted header" do
41
+ before do
42
+ ARGV.replace ['./spec/support/data_quoted.csv']
43
+ end
44
+
45
+ it 'should write errors to STDERR' do
46
+ expect { converter.convert }.to output("L4: Missing value at name\nL7: Unable to parse\n").to_stderr
47
+ end
48
+
49
+ it 'should have a bad row' do
50
+ File.open('./spec/support/data_quoted.bad', 'r') do |file|
51
+ expect(file.read).to eq("L4: Missing value at name\nL7: Unable to parse\n")
52
+ end
53
+ end
54
+
55
+ it 'should contain the avro data' do
56
+ File.open('./spec/support/data_quoted.avro', 'r') do |file|
57
+ expect(AvroReader.new(file).read).to eq(
58
+ [
59
+ { 'id'=>1, 'name'=>'dresses', 'description'=>'Dresses' },
60
+ { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
61
+ { 'id'=>4, 'name'=>'male-tops', 'description'=>"Male Tops\nand Male Shirts"},
62
+ { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
63
+ ]
64
+ )
65
+ end
32
66
  end
33
67
  end
34
68
  end
@@ -6,3 +6,5 @@ id,name,description
6
6
  and Male Shirts"
7
7
  "5","female-shoes","\"Shoes\" for women"
8
8
  "6","male-shoes","Male Shoes"
9
+ ,,"Male Shoes"
10
+ "male-shoes",,"Male Shoes"
@@ -0,0 +1,8 @@
1
+ "id","name","description"
2
+ 1,dresses,Dresses
3
+ "2","female-tops",
4
+ "3",,"Bras"
5
+ "4","male-tops","Male Tops
6
+ and Male Shirts"
7
+ "5","female-shoes","\"Shoes\" for women"
8
+ "6","male-shoes","Male Shoes"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: csv2avro
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.2
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Ableda
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-06-30 00:00:00.000000000 Z
12
+ date: 2015-09-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler
@@ -113,11 +113,12 @@ files:
113
113
  - Rakefile
114
114
  - bin/csv2avro
115
115
  - csv2avro.gemspec
116
- - lib/avro_schema.rb
117
116
  - lib/csv2avro.rb
118
117
  - lib/csv2avro/avro_writer.rb
119
118
  - lib/csv2avro/converter.rb
119
+ - lib/csv2avro/datum_writer.rb
120
120
  - lib/csv2avro/schema.rb
121
+ - lib/csv2avro/schema_validator.rb
121
122
  - lib/csv2avro/version.rb
122
123
  - spec/csv2avro/converter_spec.rb
123
124
  - spec/csv2avro/schema_spec.rb
@@ -125,6 +126,7 @@ files:
125
126
  - spec/spec_helper.rb
126
127
  - spec/support/avro_reader.rb
127
128
  - spec/support/data.csv
129
+ - spec/support/data_quoted.csv
128
130
  - spec/support/schema.avsc
129
131
  homepage: ''
130
132
  licenses:
@@ -157,5 +159,6 @@ test_files:
157
159
  - spec/spec_helper.rb
158
160
  - spec/support/avro_reader.rb
159
161
  - spec/support/data.csv
162
+ - spec/support/data_quoted.csv
160
163
  - spec/support/schema.avsc
161
164
  has_rdoc: