csvlint 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (53) hide show
  1. checksums.yaml +7 -0
  2. data/.coveralls.yml +1 -0
  3. data/.gitignore +22 -0
  4. data/.travis.yml +10 -0
  5. data/Gemfile +7 -0
  6. data/LICENSE.md +22 -0
  7. data/README.md +214 -0
  8. data/Rakefile +17 -0
  9. data/bin/create_schema +32 -0
  10. data/bin/csvlint +52 -0
  11. data/csvlint.gemspec +39 -0
  12. data/features/check_format.feature +46 -0
  13. data/features/csv_options.feature +35 -0
  14. data/features/fixtures/cr-line-endings.csv +1 -0
  15. data/features/fixtures/crlf-line-endings.csv +3 -0
  16. data/features/fixtures/inconsistent-line-endings.csv +2 -0
  17. data/features/fixtures/invalid-byte-sequence.csv +24 -0
  18. data/features/fixtures/lf-line-endings.csv +3 -0
  19. data/features/fixtures/spreadsheet.xls +0 -0
  20. data/features/fixtures/title-row.csv +4 -0
  21. data/features/fixtures/valid.csv +3 -0
  22. data/features/fixtures/windows-line-endings.csv +2 -0
  23. data/features/information.feature +22 -0
  24. data/features/parse_csv.feature +90 -0
  25. data/features/schema_validation.feature +63 -0
  26. data/features/sources.feature +18 -0
  27. data/features/step_definitions/csv_options_steps.rb +19 -0
  28. data/features/step_definitions/information_steps.rb +13 -0
  29. data/features/step_definitions/parse_csv_steps.rb +30 -0
  30. data/features/step_definitions/schema_validation_steps.rb +7 -0
  31. data/features/step_definitions/sources_steps.rb +7 -0
  32. data/features/step_definitions/validation_errors_steps.rb +43 -0
  33. data/features/step_definitions/validation_info_steps.rb +18 -0
  34. data/features/step_definitions/validation_warnings_steps.rb +46 -0
  35. data/features/support/env.rb +30 -0
  36. data/features/support/webmock.rb +1 -0
  37. data/features/validation_errors.feature +151 -0
  38. data/features/validation_info.feature +24 -0
  39. data/features/validation_warnings.feature +74 -0
  40. data/lib/csvlint.rb +13 -0
  41. data/lib/csvlint/error_collector.rb +43 -0
  42. data/lib/csvlint/error_message.rb +15 -0
  43. data/lib/csvlint/field.rb +102 -0
  44. data/lib/csvlint/schema.rb +69 -0
  45. data/lib/csvlint/types.rb +113 -0
  46. data/lib/csvlint/validate.rb +253 -0
  47. data/lib/csvlint/version.rb +3 -0
  48. data/lib/csvlint/wrapped_io.rb +39 -0
  49. data/spec/field_spec.rb +247 -0
  50. data/spec/schema_spec.rb +149 -0
  51. data/spec/spec_helper.rb +20 -0
  52. data/spec/validator_spec.rb +279 -0
  53. metadata +367 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: a5c033caad4d4ee2536ecf2af837461a946151e1
4
+ data.tar.gz: 1c9d644f1295bbbeb2c160828d3a520f6b35999c
5
+ SHA512:
6
+ metadata.gz: cf5029278e55511925b7abad559489810d8e56555ea46afd68960af2d18db786a6f0df2afa30ba4bcd99c1ea9d149d570536e94b84d45ce9bdae837d395dc1f2
7
+ data.tar.gz: 56f23ba6e7b701af2e7ec575519d46e7b4c29bfa6403f123359b89d58c1191df65166b3739a663b1febc66d8c54f9a339394ca74a890e42ccba2fb6cb03fc4d3
@@ -0,0 +1 @@
1
+ service_name: travis-ci
@@ -0,0 +1,22 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ coverage/
19
+ /.rspec
20
+
21
+ .idea
22
+ .DS_Store
@@ -0,0 +1,10 @@
1
+ rvm:
2
+ - 2.0.0
3
+ notifications:
4
+ irc:
5
+ channels:
6
+ - "irc.freenode.net#theodi"
7
+ template:
8
+ - "%{repository} %{branch} - %{message} %{build_url}"
9
+ on_success: change
10
+ on_failure: always
data/Gemfile ADDED
@@ -0,0 +1,7 @@
1
+ #ruby=ruby-2.0.0
2
+ #ruby-gemset=csvlintrb
3
+
4
+ source 'https://rubygems.org'
5
+
6
+ # Specify your gem's dependencies in csvlint.rb.gemspec
7
+ gemspec
@@ -0,0 +1,22 @@
1
+ ##Copyright (c) 2014 The Open Data Institute
2
+
3
+ #MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,214 @@
1
+ [![Build Status](http://img.shields.io/travis/theodi/csvlint.rb.svg)](https://travis-ci.org/theodi/csvlint.rb)
2
+ [![Dependency Status](http://img.shields.io/gemnasium/theodi/csvlint.rb.svg)](https://gemnasium.com/theodi/csvlint.rb)
3
+ [![Coverage Status](http://img.shields.io/coveralls/theodi/csvlint.rb.svg)](https://coveralls.io/r/theodi/csvlint.rb)
4
+ [![License](http://img.shields.io/:license-mit-blue.svg)](http://theodi.mit-license.org)
5
+ [![Badges](http://img.shields.io/:badges-5/5-ff6799.svg)](https://github.com/pikesley/badger)
6
+
7
+ # CSV Lint
8
+
9
+ A ruby gem to support validating CSV files to check their syntax and contents.
10
+
11
+ ## Installation
12
+
13
+ Add this line to your application's Gemfile:
14
+
15
+ gem 'csvlint'
16
+
17
+ And then execute:
18
+
19
+ $ bundle
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install csvlint
24
+
25
+ ## Usage
26
+
27
+ Currently the gem supports retrieving a CSV accessible from a URL, File, or an IO-style object (e.g. StringIO)
28
+
29
+ require 'csvlint'
30
+
31
+ validator = Csvlint::Validator.new( "http://example.org/data.csv" )
32
+ validator = Csvlint::Validator.new( File.new("/path/to/my/data.csv" )
33
+ validator = Csvlint::Validator.new( StringIO.new( my_data_in_a_string ) )
34
+
35
+ When validating from a URL the range of errors and warnings is wider as the library will also check HTTP headers for
36
+ best practices
37
+
38
+ #invoke the validation
39
+ validator.validate
40
+
41
+ #check validation status
42
+ validator.valid?
43
+
44
+ #access array of errors, each is an Csvlint::ErrorMessage object
45
+ validator.errors
46
+
47
+ #access array of warnings
48
+ validator.warnings
49
+
50
+ #access array of information messages
51
+ validator.info_messages
52
+
53
+ #get some information about the CSV file that was validated
54
+ validator.encoding
55
+ validator.content_type
56
+ validator.extension
57
+
58
+ #retrieve HTTP headers from request
59
+ validator.headers
60
+
61
+ ## Controlling CSV Parsing
62
+
63
+ The validator supports configuration of the [CSV Dialect](http://dataprotocols.org/csv-dialect/) used in a data file. This is specified by
64
+ passing an options hash to the constructor:
65
+
66
+ opts = {
67
+ "header" => true,
68
+ "delimiter" => ","
69
+ }
70
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", opts )
71
+
72
+ The options should be a Hash that conforms to the [CSV Dialect](http://dataprotocols.org/csv-dialect/) JSON structure.
73
+
74
+ While these options configure the parser to correctly process the file, the validator will still raise errors or warnings for CSV
75
+ structure that it considers to be invalid, e.g. a missing header or different delimiters.
76
+
77
+ Note that the parser will also check for a `header` parameter on the `Content-Type` header returned when fetching a remote CSV file. As
78
+ specified in [RFC 4180](http://www.ietf.org/rfc/rfc4180.txt) the values for this can be `present` and `absent`, e.g:
79
+
80
+ Content-Type: text/csv; header=present
81
+
82
+ ## Error Reporting
83
+
84
+ The validator provides feedback on a validation result using instances of `Csvlint::ErrorMessage`. Errors are divided into errors, warnings and information
85
+ messages. A validation attempt is successful if there are no errors.
86
+
87
+ Messages provide context including:
88
+
89
+ * `category` has a symbol that indicates the category or error/warning: `:structure` (well-formedness issues), `:schema` (schema validation), `:context` (publishing metadata, e.g. content type)
90
+ * `type` has a symbol that indicates the type of error or warning being reported
91
+ * `row` holds the line number of the problem
92
+ * `column` holds the column number of the issue
93
+ * `content` holds the contents of the row that generated the error or warning
94
+
95
+ ## Errors
96
+
97
+ The following types of error can be reported:
98
+
99
+ * `:wrong_content_type` -- content type is not `text/csv`
100
+ * `:ragged_rows` -- row has a different number of columns (than the first row in the file)
101
+ * `:blank_rows` -- completely empty row, e.g. blank line or a line where all column values are empty
102
+ * `:invalid_encoding` -- encoding error when parsing row, e.g. because of invalid characters
103
+ * `:not_found` -- HTTP 404 error when retrieving the data
104
+ * `:stray_quote` -- missing or stray quote
105
+ * `:unclosed_quote` -- unclosed quoted field
106
+ * `:whitespace` -- a quoted column has leading or trailing whitespace
107
+ * `:line_breaks` -- line breaks were inconsistent or incorrectly specified
108
+ * `:undeclared_header` -- if there is no machine-readable description of whether a header is present (e.g. in a dialect or `Content-Type` header)
109
+
110
+ ## Warnings
111
+
112
+ The following types of warning can be reported:
113
+
114
+ * `:no_encoding` -- the `Content-Type` header returned in the HTTP request does not have a `charset` parameter
115
+ * `:encoding` -- the character set is not UTF-8
116
+ * `:no_content_type` -- file is being served without a `Content-Type` header
117
+ * `:excel` -- no `Content-Type` header and the file extension is `.xls`
118
+ * `:check_options` -- CSV file appears to contain only a single column
119
+ * `:inconsistent_values` -- inconsistent values in the same column. Reported if <90% of values seem to have same data type (either numeric or alphanumeric including punctuation)
120
+ * `:empty_column_name` -- a column in the CSV header has an empty name
121
+ * `:duplicate_column_name` -- a column in the CSV header has a duplicate name
122
+ * `:title_row` -- if there appears to be a title field in the first row of the CSV
123
+
124
+ ## Information Messages
125
+
126
+ There are also information messages available:
127
+
128
+ * `:nonrfc_line_breaks` -- uses non-CRLF line breaks, so doesn't conform to RFC4180.
129
+ * `:assumed_header` -- the validator has assumed that a header is present
130
+
131
+ ## Schema Validation
132
+
133
+ The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently
134
+ follows JSON Table Schema with some extensions.
135
+
136
+ An example schema file is:
137
+
138
+ {
139
+ "fields": [
140
+ {
141
+ "name": "id",
142
+ "constraints": { "required": true }
143
+ },
144
+ {
145
+ "name": "price",
146
+ "constraints": { "required": true, "minLength": 1 }
147
+ },
148
+ {
149
+ "name": "postcode",
150
+ "constraints": {
151
+ "required": true,
152
+ "pattern": "[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}"
153
+ }
154
+ }
155
+ ]
156
+ }
157
+
158
+ Parsing and validating with a schema:
159
+
160
+ schema = Schema.load_from_json_table(uri)
161
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, schema )
162
+
163
+ Supported constraints:
164
+
165
+ * `required` -- there must be a value for this field in every row
166
+ * `unique` -- the values in every row should be unique
167
+ * `minLength` -- minimum number of characters in the value
168
+ * `maxLength` -- maximum number of characters in the value
169
+ * `pattern` -- values must match the provided regular expression
170
+ * `type` -- specifies an XML Schema data type. Values of the column must be a valid value for that type
171
+ * `minimum` -- specify a minimum range for values, the value will be parsed as specified by `type`
172
+ * `maximum` -- specify a maximum range for values, the value will be parsed as specified by `type`
173
+ * `datePattern` -- specify a `strftime` compatible date pattern to be used when parsing date values and min/max constraints
174
+
175
+ Supported data types (this is still a work in progress):
176
+
177
+ * String -- `http://www.w3.org/2001/XMLSchema#string` (effectively a no-op)
178
+ * Integer -- `http://www.w3.org/2001/XMLSchema#integer` or `http://www.w3.org/2001/XMLSchema#int`
179
+ * Float -- `http://www.w3.org/2001/XMLSchema#float`
180
+ * Double -- `http://www.w3.org/2001/XMLSchema#double`
181
+ * URI -- `http://www.w3.org/2001/XMLSchema#anyURI`
182
+ * Boolean -- `http://www.w3.org/2001/XMLSchema#boolean`
183
+ * Non Positive Integer -- `http://www.w3.org/2001/XMLSchema#nonPositiveInteger`
184
+ * Positive Integer -- `http://www.w3.org/2001/XMLSchema#positiveInteger`
185
+ * Non Negative Integer -- `http://www.w3.org/2001/XMLSchema#nonNegativeInteger`
186
+ * Negative Integer -- `http://www.w3.org/2001/XMLSchema#negativeInteger`
187
+ * Date -- `http://www.w3.org/2001/XMLSchema#date`
188
+ * Date Time -- `http://www.w3.org/2001/XMLSchema#dateTime`
189
+ * Year -- `http://www.w3.org/2001/XMLSchema#gYear`
190
+ * Year Month -- `http://www.w3.org/2001/XMLSchema#gYearMonth`
191
+ * Time -- `http://www.w3.org/2001/XMLSchema#time`
192
+
193
+ Use of an unknown data type will result in the column failing to validate.
194
+
195
+ Schema validation provides some additional types of error and warning messages:
196
+
197
+ * `:missing_value` (error) -- a column marked as `required` in the schema has no value
198
+ * `:min_length` (error) -- a column with a `minLength` constraint has a value that is too short
199
+ * `:max_length` (error) -- a column with a `maxLength` constraint has a value that is too long
200
+ * `:pattern` (error) -- a column with a `pattern` constraint has a value that doesn't match the regular expression
201
+ * `:header_name` (warning) -- the header in the CSV has a column name that doesn't match the schema
202
+ * `:missing_column` (warning) -- a row in the CSV file has a missing column, that is specified in the schema. This is a warning only, as it may be legitimate
203
+ * `:extra_column` (warning) -- a row in the CSV file has extra column.
204
+ * `:unique` (error) -- a column with a `unique` constraint contains non-unique values
205
+ * `:below_minimum` (error) -- a column with a `minimum` constraint contains a value that is below the minimum
206
+ * `:above_maximum` (error) -- a column with a `maximum` constraint contains a value that is above the maximum
207
+
208
+ ## Contributing
209
+
210
+ 1. Fork it
211
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
212
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
213
+ 4. Push to the branch (`git push origin my-new-feature`)
214
+ 5. Create new Pull Request
@@ -0,0 +1,17 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ $:.unshift File.join( File.dirname(__FILE__), "lib")
4
+
5
+ require 'rubygems'
6
+ require 'cucumber'
7
+ require 'cucumber/rake/task'
8
+ require 'coveralls/rake/task'
9
+ require 'rspec/core/rake_task'
10
+
11
+ RSpec::Core::RakeTask.new(:spec)
12
+ Coveralls::RakeTask.new
13
+ Cucumber::Rake::Task.new(:features) do |t|
14
+ t.cucumber_opts = "features --format pretty"
15
+ end
16
+
17
+ task :default => [:spec, :features, 'coveralls:push']
@@ -0,0 +1,32 @@
1
+ #!/usr/bin/env ruby
2
+ $:.unshift File.join( File.dirname(__FILE__), "..", "lib")
3
+
4
+ require 'csvlint'
5
+
6
+ begin
7
+ puts ARGV[0]
8
+ csv = CSV.new( open(ARGV[0]) )
9
+ headers = csv.shift
10
+
11
+ name = File.basename( ARGV[0] )
12
+ schema = {
13
+ "title" => name,
14
+ "description" => "Auto generated schema for #{name}",
15
+ "fields" => []
16
+ }
17
+
18
+ headers.each do |name|
19
+ schema["fields"] << {
20
+ "name" => name,
21
+ "title" => "",
22
+ "description" => "",
23
+ "constraints" => {}
24
+ }
25
+ end
26
+
27
+ $stdout.puts JSON.pretty_generate(schema)
28
+ rescue => e
29
+ puts e
30
+ puts e.backtrace
31
+ puts "Unable to parse CSV file"
32
+ end
@@ -0,0 +1,52 @@
1
+ #!/usr/bin/env ruby
2
+ $:.unshift File.join( File.dirname(__FILE__), "..", "lib")
3
+
4
+ require 'csvlint'
5
+ require 'colorize'
6
+
7
+ def print_error(index, error, color=:red)
8
+ location = ""
9
+ location += error.row.to_s if error.row
10
+ location += "#{error.row ? "," : ""}#{error.column.to_s}" if error.column
11
+ if error.row || error.column
12
+ location = "#{error.row ? "Row" : "Column"}: #{location}"
13
+ end
14
+ puts "#{index+1}. #{error.type}. #{location}".colorize(color)
15
+ end
16
+
17
+ if ARGV.length == 0 && !$stdin.tty?
18
+ source = StringIO.new(ARGF.read)
19
+ else
20
+ if ARGV[0]
21
+ source = ARGV[0]
22
+ unless source =~ /^http(s)?/
23
+ begin
24
+ source = File.new( source ) unless source =~ /^http(s)?/
25
+ rescue Errno::ENOENT
26
+ puts "File not found"
27
+ exit 1
28
+ end
29
+ end
30
+ else
31
+ puts "Usage: csvlint {file or URL} or {input} | csvlint"
32
+ exit 1
33
+ end
34
+ end
35
+
36
+ validator = Csvlint::Validator.new( source )
37
+
38
+ puts "#{ARGV[0] || "CSV"} is #{validator.valid? ? "VALID".green : "INVALID".red}"
39
+
40
+ if validator.errors.size > 0
41
+ validator.errors.each_with_index do |error, i|
42
+ print_error(i, error)
43
+ end
44
+ end
45
+
46
+ if validator.warnings.size > 0
47
+ validator.warnings.each_with_index do |error, i|
48
+ print_error(i, error, :yellow)
49
+ end
50
+ end
51
+
52
+ exit 1 unless validator.valid?
@@ -0,0 +1,39 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'csvlint/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "csvlint"
8
+ spec.version = Csvlint::VERSION
9
+ spec.authors = ["pezholio"]
10
+ spec.email = ["pezholio@gmail.com"]
11
+ spec.description = %q{CSV Validator}
12
+ spec.summary = %q{CSV Validator}
13
+ spec.homepage = ""
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_dependency "mime-types"
22
+ spec.add_dependency "colorize"
23
+ spec.add_dependency "open_uri_redirections"
24
+ spec.add_dependency "activesupport"
25
+ spec.add_dependency "addressable"
26
+
27
+ spec.add_development_dependency "bundler", "~> 1.3"
28
+ spec.add_development_dependency "rake"
29
+ spec.add_development_dependency "cucumber"
30
+ spec.add_development_dependency "simplecov"
31
+ spec.add_development_dependency "simplecov-rcov"
32
+ spec.add_development_dependency "spork"
33
+ spec.add_development_dependency "webmock"
34
+ spec.add_development_dependency "rspec"
35
+ spec.add_development_dependency "rspec-pride"
36
+ spec.add_development_dependency "rspec-expectations"
37
+ spec.add_development_dependency "coveralls"
38
+ spec.add_development_dependency "pry"
39
+ end
@@ -0,0 +1,46 @@
1
+ Feature: Check inconsistent formatting
2
+
3
+ Scenario: Inconsistent formatting for integers
4
+ Given I have a CSV with the following content:
5
+ """
6
+ "1","2","3"
7
+ "Foo","5","6"
8
+ "3","2","1"
9
+ "3","2","1"
10
+ """
11
+ And it is stored at the url "http://example.com/example1.csv"
12
+ And I ask if there are warnings
13
+ Then there should be 1 warnings
14
+ And that warning should have the type "inconsistent_values"
15
+ And that warning should have the column "1"
16
+
17
+ Scenario: Inconsistent formatting for alpha fields
18
+ Given I have a CSV with the following content:
19
+ """
20
+ "Foo","Bar","Baz"
21
+ "Biz","1","Baff"
22
+ "Boff","Giff","Goff"
23
+ "Boff","Giff","Goff"
24
+ """
25
+ And it is stored at the url "http://example.com/example1.csv"
26
+ And I ask if there are warnings
27
+ Then there should be 1 warnings
28
+ And that warning should have the type "inconsistent_values"
29
+ And that warning should have the column "2"
30
+
31
+ Scenario: Inconsistent formatting for alphanumeric fields
32
+ Given I have a CSV with the following content:
33
+ """
34
+ "Foo 123","Bar","Baz"
35
+ "1","Bar","Baff"
36
+ "Boff 432423","Giff","Goff"
37
+ "Boff444","Giff","Goff"
38
+ """
39
+ And it is stored at the url "http://example.com/example1.csv"
40
+ And I ask if there are warnings
41
+ Then there should be 1 warnings
42
+ And that warning should have the type "inconsistent_values"
43
+ And that warning should have the column "1"
44
+
45
+
46
+