csv2avro 1.0.2 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 3dd5ca9e046d1e46a845614350ad1e0e2dcdd8a6
4
- data.tar.gz: 0c7eefec78b9293b6f1ed954fb22b5ec6b6506ba
3
+ metadata.gz: 6e7e9a8d86d5cd8e85b957ffecb5e60ccfe9c8b5
4
+ data.tar.gz: 2b12a6828c601dfe19e6d93bc39b481ec1e15118
5
5
  SHA512:
6
- metadata.gz: abbed521cda772e04453d95ec559f37e1e45b84f1fbe46e600e5a1ec0333abdbf28978b31946a451ab87701de4af089787f68b316262152bb00f6d9ceefe1048
7
- data.tar.gz: 683f680b56c5c06f0217ca5c6f2ff384b228435175327d746545601636844755a5832cf904c420f004bcc4a0cbb4d751eaf1ddeefe4eeeaa334c6418b8633a02
6
+ metadata.gz: 1caef810f21aa9f9b8dd1562c253a967f5b1c94382296f677dd6703624ac51d3d13774457c76820bf52c4a88795829d28a1b46017384ab306abf9f62bb50a078
7
+ data.tar.gz: cf9f67c9316d2840f883a36082a30187ae8aebbe24db71a4100496ea833b20568b9ec16f4e92258b38705b7334f07b78b52766db78245cc81c2a32e3c6244d95
data/CHANGELOG.md CHANGED
@@ -3,17 +3,27 @@
3
3
  All notable changes to this project are documented in this file.
4
4
  This project adheres to [Semantic Versioning](http://semver.org/).
5
5
 
6
- ## 1.0.2 (2015-06-29; [compare](https://github.com/sspinc/csv2avro/compare/1.0.1...1.0.2))
6
+ ## 1.1.0 (2015-09-16) [compare](https://github.com/sspinc/csv2avro/compare/1.0.2...1.1.0))
7
+
8
+ ### Changed
9
+ * Write usage and error messages to stderr
10
+ * Exit code 1 for general errors, 2 for missing arguments
11
+ * Bad rows report with error causes instead of bad rows csv
12
+
13
+ ### Fixed
14
+ * Handle quoted headers
15
+
16
+ ## 1.0.2 (2015-06-29) [compare](https://github.com/sspinc/csv2avro/compare/1.0.1...1.0.2))
7
17
 
8
18
  ### Fixed
9
19
  * Continue on parsing errors
10
20
 
11
- ## 1.0.1 (2015-06-12; [compare](https://github.com/sspinc/csv2avro/compare/1.0.0...1.0.1))
21
+ ## 1.0.1 (2015-06-12) [compare](https://github.com/sspinc/csv2avro/compare/1.0.0...1.0.1))
12
22
 
13
23
  ### Fixed
14
24
  * CSV parsing issues
15
25
 
16
- ## 1.0.0 (2015-06-05; [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
26
+ ## 1.0.0 (2015-06-05) [compare](https://github.com/sspinc/csv2avro/compare/0.4.0...1.0.0))
17
27
 
18
28
  ### Added
19
29
  * Usage description to readme
@@ -23,7 +33,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
23
33
  ### Fixed
24
34
  * Docker image entrypoint
25
35
 
26
- ## 0.4.0 (2015-05-07; [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
36
+ ## 0.4.0 (2015-05-07) [compare](https://github.com/sspinc/csv2avro/compare/0.3.0...0.4.0))
27
37
 
28
38
  ### Added
29
39
  * Streaming support (#7)
@@ -38,7 +48,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
38
48
  ### Fixed
39
49
  * Build project into Docker image (#9)
40
50
 
41
- ## 0.3.0 (2015-04-28; [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
51
+ ## 0.3.0 (2015-04-28) [compare](https://github.com/sspinc/csv2avro/compare/0.1.0...0.3.0))
42
52
 
43
53
  ### Added
44
54
  * Docker support (#6)
data/README.md CHANGED
@@ -14,13 +14,13 @@ or if you prefer to live on the edge, just clone this repository and build it fr
14
14
  ```
15
15
  $ csv2avro --schema ./spec/support/schema.avsc ./spec/support/data.csv
16
16
  ```
17
- This will process the data.csv file and creates a *data.avro* file and a *data.bad.csv* file with the bad rows.
17
+ This will process the data.csv file and creates a *data.avro* file and a *data.bad* file with a report of the bad rows.
18
18
 
19
- You can override the bad-rows file location with the `--bad-rows [BAD_ROWS]` option.
19
+ You can override the bad rows report file location with the `--bad-rows [BAD_ROWS]` option.
20
20
 
21
21
  ### Streaming
22
22
  ```
23
- $ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
23
+ $ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro
24
24
  ```
25
25
  This will process the *input stream* and push the avro data to the *output stream*. If you're working with streams you will need to specify the `--bad-rows` location.
26
26
 
@@ -29,7 +29,7 @@ This will process the *input stream* and push the avro data to the *output strea
29
29
  #### AWS S3 storage
30
30
 
31
31
  ```
32
- aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad.csv | aws s3 cp - s3://avro-bucket/transactions.avro
32
+ aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad | aws s3 cp - s3://avro-bucket/transactions.avro
33
33
  ```
34
34
 
35
35
  This will stream your file stored in AWS S3, converts the data and pushes it back to S3. For more information, please check the [AWS CLI documentation](http://docs.aws.amazon.com/cli/latest/reference/s3/index.html).
@@ -37,7 +37,7 @@ This will stream your file stored in AWS S3, converts the data and pushes it bac
37
37
  #### Convert compressed files
38
38
 
39
39
  ```
40
- gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad.csv > ./spec/support/data.avro
40
+ gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro
41
41
  ```
42
42
 
43
43
  This will uncompress the file and converts it to avro, leaving the original file intact.
@@ -50,7 +50,7 @@ $ csv2avro --help
50
50
  Version 1.0.1 of CSV2Avro
51
51
  Usage: csv2avro [options] [file]
52
52
  -s, --schema SCHEMA A file containing the Avro schema. This value is required.
53
- -b, --bad-rows [BAD_ROWS] The output location of the bad rows file.
53
+ -b, --bad-rows [BAD_ROWS] The output location of the bad rows report file.
54
54
  -d, --delimiter [DELIMITER] Field delimiter. If none specified, then comma is used as the delimiter.
55
55
  -a [ARRAY_DELIMITER], Array field delimiter. If none specified, then comma is used as the delimiter.
56
56
  --array-delimiter
data/bin/csv2avro CHANGED
@@ -14,7 +14,7 @@ option_parser = OptionParser.new do |opts|
14
14
  options[:schema] = path
15
15
  end
16
16
 
17
- opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows file.') do |path|
17
+ opts.on('-b', '--bad-rows [BAD_ROWS]', 'The output location of the bad rows report file.') do |path|
18
18
  options[:bad_rows] = path
19
19
  end
20
20
 
@@ -35,7 +35,7 @@ option_parser = OptionParser.new do |opts|
35
35
  end
36
36
 
37
37
  opts.on('-h', '--help', 'Prints help') do
38
- puts opts
38
+ $stderr.puts opts
39
39
  exit
40
40
  end
41
41
  end
@@ -48,11 +48,12 @@ begin
48
48
 
49
49
  CSV2Avro.new(options).convert
50
50
  rescue OptionParser::MissingArgument => ex
51
- puts ex.message
52
-
53
- puts option_parser
51
+ $stderr.puts ex.message
52
+ $stderr.puts option_parser
53
+ exit 2
54
54
  rescue Exception => e
55
- puts 'Uh oh, something went wrong!'
56
- puts e.message
57
- puts e.backtrace.join("\n")
55
+ $stderr.puts 'Uh oh, something went wrong!'
56
+ $stderr.puts e.message
57
+ $stderr.puts e.backtrace.join("\n")
58
+ exit 1
58
59
  end
@@ -1,6 +1,6 @@
1
1
  require 'avro'
2
- require 'avro_schema'
3
2
  require 'forwardable'
3
+ require 'csv2avro/datum_writer'
4
4
 
5
5
  class CSV2Avro
6
6
  class AvroWriter
@@ -12,7 +12,7 @@ class CSV2Avro
12
12
  def_delegators :avro_writer, :flush, :close
13
13
 
14
14
  def initialize(writer, schema)
15
- datum_writer = Avro::IO::DatumWriter.new(schema.avro_schema)
15
+ datum_writer = CSV2Avro::DatumWriter.new(schema.avro_schema)
16
16
  @avro_writer = Avro::DataFile::Writer.new(writer, datum_writer, schema.avro_schema)
17
17
  end
18
18
 
@@ -13,7 +13,7 @@ class CSV2Avro
13
13
  @schema = schema
14
14
 
15
15
  # read header row explicitly
16
- @header = @reader.readline.strip.split(col_sep)
16
+ @header = @reader.readline.strip.split(col_sep).map{ |col| col.gsub('"','') }
17
17
  end
18
18
 
19
19
  def convert
@@ -21,7 +21,9 @@ class CSV2Avro
21
21
  begin
22
22
  row = csv.shift
23
23
  rescue CSV::MalformedCSVError
24
- @error_writer.puts("line #{line_number}: Unable to parse")
24
+ error_msg = "L#{row_number}: Unable to parse"
25
+ @error_writer.puts(error_msg)
26
+ @bad_rows_writer.puts(error_msg)
25
27
  next
26
28
  end
27
29
  hash = row.to_hash
@@ -31,12 +33,10 @@ class CSV2Avro
31
33
 
32
34
  begin
33
35
  @writer.write(hash)
34
- rescue Avro::IO::AvroTypeError
35
- bad_rows_csv << row
36
-
37
- until Avro::Schema.errors.empty? do
38
- @error_writer.puts("line #{line_number}: #{Avro::Schema.errors.shift}")
39
- end
36
+ rescue CSV2Avro::SchemaValidationError => e
37
+ error_msg = "L#{row_number}: #{e.errors.join(', ')}"
38
+ @error_writer.puts(error_msg)
39
+ @bad_rows_writer.puts(error_msg)
40
40
  end
41
41
  end
42
42
  @writer.flush
@@ -71,12 +71,7 @@ class CSV2Avro
71
71
  @csv ||= CSV.new(@reader, csv_options)
72
72
  end
73
73
 
74
- def bad_rows_csv
75
- options = csv_options.tap { |hash| hash.delete(:header_converters) }
76
- @bad_rows_csv ||= CSV.new(@bad_rows_writer, options)
77
- end
78
-
79
- def line_number
74
+ def row_number
80
75
  @reader.lineno + 1
81
76
  end
82
77
 
@@ -0,0 +1,31 @@
1
+ require 'avro'
2
+ require 'csv2avro/schema_validator'
3
+
4
+ class CSV2Avro
5
+ class DatumWriter < Avro::IO::DatumWriter
6
+
7
+ attr_reader :schema_validator
8
+
9
+ def initialize(*args)
10
+ super
11
+ @schema_validator = CSV2Avro::SchemaValidator.new
12
+ end
13
+
14
+ def write(datum, encoder)
15
+ schema_validator.clear
16
+ if !schema_validator.validate(writers_schema, datum)
17
+ raise SchemaValidationError.new(schema_validator.errors)
18
+ end
19
+ super
20
+ end
21
+ end
22
+
23
+ class SchemaValidationError < StandardError
24
+
25
+ attr_reader :errors
26
+
27
+ def initialize(schema_errors)
28
+ @errors = schema_errors
29
+ end
30
+ end
31
+ end
@@ -1,12 +1,19 @@
1
- module Avro
2
- class Schema
3
- @errors = []
1
+ require 'avro/schema'
4
2
 
5
- class << self
6
- attr_accessor :errors
3
+ class CSV2Avro
4
+ class SchemaValidator
5
+
6
+ attr_reader :errors
7
+
8
+ def initialize
9
+ @errors = []
10
+ end
11
+
12
+ def clear
13
+ @errors.clear
7
14
  end
8
15
 
9
- def self.validate(expected_schema, datum, name=nil, suppress_error=false)
16
+ def validate(expected_schema, datum, name=nil, suppress_error=false)
10
17
  expected_type = expected_schema.type_sym
11
18
 
12
19
  valid = case expected_type
@@ -18,10 +25,10 @@ module Avro
18
25
  datum.is_a? String
19
26
  when :int
20
27
  (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
21
- (INT_MIN_VALUE <= datum) && (datum <= INT_MAX_VALUE)
28
+ (Avro::Schema::INT_MIN_VALUE <= datum) && (datum <= Avro::Schema::INT_MAX_VALUE)
22
29
  when :long
23
30
  (datum.is_a?(Fixnum) || datum.is_a?(Bignum)) &&
24
- (LONG_MIN_VALUE <= datum) && (datum <= LONG_MAX_VALUE)
31
+ (Avro::Schema::LONG_MIN_VALUE <= datum) && (datum <= Avro::Schema::LONG_MAX_VALUE)
25
32
  when :float, :double
26
33
  datum.is_a?(Float) || datum.is_a?(Fixnum) || datum.is_a?(Bignum)
27
34
  when :fixed
@@ -38,12 +45,14 @@ module Avro
38
45
  expected_schema.schemas.any?{|s| validate(s, datum, nil, true) }
39
46
  when :record, :error, :request
40
47
  datum.is_a?(Hash) &&
41
- expected_schema.fields.all?{|f| validate(f.type, datum[f.name], f.name) }
48
+ expected_schema.fields.reduce(true){|result, f|
49
+ validate_result = validate(f.type, datum[f.name], f.name)
50
+ result && validate_result }
42
51
  else
43
52
  false
44
53
  end
45
54
 
46
- if !suppress_error && !valid && name
55
+ if !valid && name
47
56
  if datum.nil? && expected_type != :null
48
57
  @errors << "Missing value at #{name}"
49
58
  else
@@ -1,3 +1,3 @@
1
1
  class CSV2Avro
2
- VERSION = "1.0.2"
2
+ VERSION = "1.1.0"
3
3
  end
data/lib/csv2avro.rb CHANGED
@@ -68,6 +68,6 @@ class CSV2Avro
68
68
  ext = File.extname(input_path)
69
69
  name = File.basename(input_path, ext)
70
70
 
71
- "#{dir}/#{name}.bad#{ext}"
71
+ "#{dir}/#{name}.bad"
72
72
  end
73
73
  end
@@ -351,15 +351,15 @@ RSpec.describe CSV2Avro::Converter do
351
351
  CSV2Avro::Converter.new(reader, avro_writer, bad_rows_writer, error_writer, { delimiter: "\t" }, schema: schema).convert
352
352
  end
353
353
 
354
- it 'should have the bad data in the original form' do
354
+ it 'should report the bad rows correctly' do
355
355
  expect(bad_rows_writer.string).to eq(
356
- "id\ttitle\tdescription\n1\t\tdresses\n4\t\tfemale-shoes\n"
356
+ "L2: Missing value at name\nL5: Missing value at name\n"
357
357
  )
358
358
  end
359
359
 
360
360
  it 'should have an error' do
361
361
  expect(error_writer.string).to eq(
362
- "line 2: Missing value at name\nline 5: Missing value at name\n"
362
+ "L2: Missing value at name\nL5: Missing value at name\n"
363
363
  )
364
364
  end
365
365
 
@@ -3,32 +3,66 @@ require 'spec_helper'
3
3
  RSpec.describe CSV2Avro do
4
4
  describe '#convert' do
5
5
  let(:options) { { schema: './spec/support/schema.avsc' } }
6
-
7
- before do
8
- ARGV.replace ['./spec/support/data.csv']
6
+ subject(:converter) do
7
+ CSV2Avro.new(options)
9
8
  end
10
- subject(:converter) { CSV2Avro.new(options) }
11
9
 
12
- it 'should write errors to STDERR' do
13
- expect { converter.convert }.to output("line 4: Missing value at name\nline 7: Unable to parse\n").to_stderr
14
- end
10
+ context "Unquoted header" do
11
+ before do
12
+ ARGV.replace ['./spec/support/data.csv']
13
+ end
14
+
15
+ bad_rows_output = "L4: Missing value at name\nL7: Unable to parse\nL9: Missing value at id, Missing value at name\nL10: 'male-shoes' at id doesn't match the type '\"int\"', Missing value at name\n"
16
+ it 'should write errors to STDERR' do
17
+ expect { converter.convert }.to output(bad_rows_output).to_stderr
18
+ end
19
+
20
+ it 'should have bad rows' do
21
+ File.open('./spec/support/data.bad', 'r') do |file|
22
+ expect(file.read).to eq(bad_rows_output)
23
+ end
24
+ end
15
25
 
16
- it 'should have a bad row' do
17
- File.open('./spec/support/data.bad.csv', 'r') do |file|
18
- expect(file.read).to eq("id,name,description\n3,,Bras\n")
26
+ it 'should contain the avro data' do
27
+ File.open('./spec/support/data.avro', 'r') do |file|
28
+ expect(AvroReader.new(file).read).to eq(
29
+ [
30
+ { 'id'=>1, 'name'=>'dresses', 'description'=>'Dresses' },
31
+ { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
32
+ { 'id'=>4, 'name'=>'male-tops', 'description'=>"Male Tops\nand Male Shirts"},
33
+ { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
34
+ ]
35
+ )
36
+ end
19
37
  end
20
38
  end
21
39
 
22
- it 'should contain the avro data' do
23
- File.open('./spec/support/data.avro', 'r') do |file|
24
- expect(AvroReader.new(file).read).to eq(
25
- [
26
- { 'id'=>1, 'name'=>'dresses', 'description'=>'Dresses' },
27
- { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
28
- { 'id'=>4, 'name'=>'male-tops', 'description'=>"Male Tops\nand Male Shirts"},
29
- { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
30
- ]
31
- )
40
+ context "Quoted header" do
41
+ before do
42
+ ARGV.replace ['./spec/support/data_quoted.csv']
43
+ end
44
+
45
+ it 'should write errors to STDERR' do
46
+ expect { converter.convert }.to output("L4: Missing value at name\nL7: Unable to parse\n").to_stderr
47
+ end
48
+
49
+ it 'should have a bad row' do
50
+ File.open('./spec/support/data_quoted.bad', 'r') do |file|
51
+ expect(file.read).to eq("L4: Missing value at name\nL7: Unable to parse\n")
52
+ end
53
+ end
54
+
55
+ it 'should contain the avro data' do
56
+ File.open('./spec/support/data_quoted.avro', 'r') do |file|
57
+ expect(AvroReader.new(file).read).to eq(
58
+ [
59
+ { 'id'=>1, 'name'=>'dresses', 'description'=>'Dresses' },
60
+ { 'id'=>2, 'name'=>'female-tops', 'description'=>nil },
61
+ { 'id'=>4, 'name'=>'male-tops', 'description'=>"Male Tops\nand Male Shirts"},
62
+ { 'id'=>6, 'name'=>'male-shoes', 'description'=>'Male Shoes'}
63
+ ]
64
+ )
65
+ end
32
66
  end
33
67
  end
34
68
  end
@@ -6,3 +6,5 @@ id,name,description
6
6
  and Male Shirts"
7
7
  "5","female-shoes","\"Shoes\" for women"
8
8
  "6","male-shoes","Male Shoes"
9
+ ,,"Male Shoes"
10
+ "male-shoes",,"Male Shoes"
@@ -0,0 +1,8 @@
1
+ "id","name","description"
2
+ 1,dresses,Dresses
3
+ "2","female-tops",
4
+ "3",,"Bras"
5
+ "4","male-tops","Male Tops
6
+ and Male Shirts"
7
+ "5","female-shoes","\"Shoes\" for women"
8
+ "6","male-shoes","Male Shoes"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: csv2avro
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.2
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Ableda
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-06-30 00:00:00.000000000 Z
12
+ date: 2015-09-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bundler
@@ -113,11 +113,12 @@ files:
113
113
  - Rakefile
114
114
  - bin/csv2avro
115
115
  - csv2avro.gemspec
116
- - lib/avro_schema.rb
117
116
  - lib/csv2avro.rb
118
117
  - lib/csv2avro/avro_writer.rb
119
118
  - lib/csv2avro/converter.rb
119
+ - lib/csv2avro/datum_writer.rb
120
120
  - lib/csv2avro/schema.rb
121
+ - lib/csv2avro/schema_validator.rb
121
122
  - lib/csv2avro/version.rb
122
123
  - spec/csv2avro/converter_spec.rb
123
124
  - spec/csv2avro/schema_spec.rb
@@ -125,6 +126,7 @@ files:
125
126
  - spec/spec_helper.rb
126
127
  - spec/support/avro_reader.rb
127
128
  - spec/support/data.csv
129
+ - spec/support/data_quoted.csv
128
130
  - spec/support/schema.avsc
129
131
  homepage: ''
130
132
  licenses:
@@ -157,5 +159,6 @@ test_files:
157
159
  - spec/spec_helper.rb
158
160
  - spec/support/avro_reader.rb
159
161
  - spec/support/data.csv
162
+ - spec/support/data_quoted.csv
160
163
  - spec/support/schema.avsc
161
164
  has_rdoc: