wjordan213-csvlint 0.2.8

Sign up to get free protection for your applications and to get access to all the features.
Files changed (77) hide show
  1. checksums.yaml +7 -0
  2. data/.coveralls.yml +1 -0
  3. data/.gitattributes +2 -0
  4. data/.gitignore +28 -0
  5. data/.ruby-version +1 -0
  6. data/.travis.yml +32 -0
  7. data/CHANGELOG.md +361 -0
  8. data/Gemfile +7 -0
  9. data/LICENSE.md +22 -0
  10. data/README.md +328 -0
  11. data/Rakefile +17 -0
  12. data/bin/create_schema +32 -0
  13. data/bin/csvlint +10 -0
  14. data/features/check_format.feature +46 -0
  15. data/features/cli.feature +210 -0
  16. data/features/csv_options.feature +35 -0
  17. data/features/csvupload.feature +145 -0
  18. data/features/csvw_schema_validation.feature +127 -0
  19. data/features/fixtures/cr-line-endings.csv +0 -0
  20. data/features/fixtures/crlf-line-endings.csv +0 -0
  21. data/features/fixtures/inconsistent-line-endings-unquoted.csv +0 -0
  22. data/features/fixtures/inconsistent-line-endings.csv +0 -0
  23. data/features/fixtures/invalid-byte-sequence.csv +0 -0
  24. data/features/fixtures/invalid_many_rows.csv +0 -0
  25. data/features/fixtures/lf-line-endings.csv +0 -0
  26. data/features/fixtures/spreadsheet.xls +0 -0
  27. data/features/fixtures/spreadsheet.xlsx +0 -0
  28. data/features/fixtures/title-row.csv +0 -0
  29. data/features/fixtures/valid.csv +0 -0
  30. data/features/fixtures/valid_many_rows.csv +0 -0
  31. data/features/fixtures/windows-line-endings.csv +0 -0
  32. data/features/information.feature +22 -0
  33. data/features/parse_csv.feature +90 -0
  34. data/features/schema_validation.feature +105 -0
  35. data/features/sources.feature +17 -0
  36. data/features/step_definitions/cli_steps.rb +11 -0
  37. data/features/step_definitions/csv_options_steps.rb +24 -0
  38. data/features/step_definitions/information_steps.rb +13 -0
  39. data/features/step_definitions/parse_csv_steps.rb +42 -0
  40. data/features/step_definitions/schema_validation_steps.rb +33 -0
  41. data/features/step_definitions/sources_steps.rb +7 -0
  42. data/features/step_definitions/validation_errors_steps.rb +90 -0
  43. data/features/step_definitions/validation_info_steps.rb +22 -0
  44. data/features/step_definitions/validation_warnings_steps.rb +60 -0
  45. data/features/support/aruba.rb +56 -0
  46. data/features/support/env.rb +26 -0
  47. data/features/support/load_tests.rb +114 -0
  48. data/features/support/webmock.rb +1 -0
  49. data/features/validation_errors.feature +147 -0
  50. data/features/validation_info.feature +16 -0
  51. data/features/validation_warnings.feature +86 -0
  52. data/lib/csvlint.rb +27 -0
  53. data/lib/csvlint/cli.rb +165 -0
  54. data/lib/csvlint/csvw/column.rb +359 -0
  55. data/lib/csvlint/csvw/date_format.rb +182 -0
  56. data/lib/csvlint/csvw/metadata_error.rb +13 -0
  57. data/lib/csvlint/csvw/number_format.rb +211 -0
  58. data/lib/csvlint/csvw/property_checker.rb +761 -0
  59. data/lib/csvlint/csvw/table.rb +204 -0
  60. data/lib/csvlint/csvw/table_group.rb +165 -0
  61. data/lib/csvlint/error_collector.rb +27 -0
  62. data/lib/csvlint/error_message.rb +15 -0
  63. data/lib/csvlint/field.rb +196 -0
  64. data/lib/csvlint/schema.rb +92 -0
  65. data/lib/csvlint/validate.rb +599 -0
  66. data/lib/csvlint/version.rb +3 -0
  67. data/spec/csvw/column_spec.rb +112 -0
  68. data/spec/csvw/date_format_spec.rb +49 -0
  69. data/spec/csvw/number_format_spec.rb +417 -0
  70. data/spec/csvw/table_group_spec.rb +143 -0
  71. data/spec/csvw/table_spec.rb +90 -0
  72. data/spec/field_spec.rb +252 -0
  73. data/spec/schema_spec.rb +211 -0
  74. data/spec/spec_helper.rb +17 -0
  75. data/spec/validator_spec.rb +619 -0
  76. data/wjordan213_csvlint.gemspec +46 -0
  77. metadata +490 -0
data/LICENSE.md ADDED
@@ -0,0 +1,22 @@
1
+ ##Copyright (c) 2014 The Open Data Institute
2
+
3
+ #MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,328 @@
1
+ [![Build Status](http://img.shields.io/travis/theodi/csvlint.rb.svg)](https://travis-ci.org/theodi/csvlint.rb)
2
+ [![Dependency Status](http://img.shields.io/gemnasium/theodi/csvlint.rb.svg)](https://gemnasium.com/theodi/csvlint.rb)
3
+ [![Coverage Status](http://img.shields.io/coveralls/theodi/csvlint.rb.svg)](https://coveralls.io/r/theodi/csvlint.rb)
4
+ [![License](http://img.shields.io/:license-mit-blue.svg)](http://theodi.mit-license.org)
5
+ [![Badges](http://img.shields.io/:badges-5/5-ff6799.svg)](https://github.com/pikesley/badger)
6
+
7
+ # CSV Lint
8
+
9
+ A ruby gem to support validating CSV files to check their syntax and contents.
10
+
11
+ ## Installation
12
+
13
+ Add this line to your application's Gemfile:
14
+
15
+ gem 'csvlint'
16
+
17
+ And then execute:
18
+
19
+ $ bundle
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install csvlint
24
+
25
+ ## Usage
26
+
27
+ You can either use this gem within your own Ruby code, or as a standolone command line application
28
+
29
+ ## On the command line
30
+
31
+ After installing the gem, you can validate a CSV on the command line like so:
32
+
33
+ csvlint myfile.csv
34
+
35
+ You will then see the validation result, together with any warnings or errors e.g.
36
+
37
+ ```
38
+ myfile.csv is INVALID
39
+ 1. blank_rows. Row: 3
40
+ 1. title_row.
41
+ 2. inconsistent_values. Column: 14
42
+ ```
43
+
44
+ You can also optionally pass a schema file like so:
45
+
46
+ csvlint myfile.csv --schema=schema.json
47
+
48
+ ## In your own Ruby code
49
+
50
+ Currently the gem supports retrieving a CSV accessible from a URL, File, or an IO-style object (e.g. StringIO)
51
+
52
+ require 'csvlint'
53
+
54
+ validator = Csvlint::Validator.new( "http://example.org/data.csv" )
55
+ validator = Csvlint::Validator.new( File.new("/path/to/my/data.csv" ))
56
+ validator = Csvlint::Validator.new( StringIO.new( my_data_in_a_string ) )
57
+
58
+ When validating from a URL the range of errors and warnings is wider as the library will also check HTTP headers for
59
+ best practices
60
+
61
+ #invoke the validation
62
+ validator.validate
63
+
64
+ #check validation status
65
+ validator.valid?
66
+
67
+ #access array of errors, each is an Csvlint::ErrorMessage object
68
+ validator.errors
69
+
70
+ #access array of warnings
71
+ validator.warnings
72
+
73
+ #access array of information messages
74
+ validator.info_messages
75
+
76
+ #get some information about the CSV file that was validated
77
+ validator.encoding
78
+ validator.content_type
79
+ validator.extension
80
+ validator.row_count
81
+
82
+ #retrieve HTTP headers from request
83
+ validator.headers
84
+
85
+ ## Controlling CSV Parsing
86
+
87
+ The validator supports configuration of the [CSV Dialect](http://dataprotocols.org/csv-dialect/) used in a data file. This is specified by
88
+ passing a dialect hash to the constructor:
89
+
90
+ dialect = {
91
+ "header" => true,
92
+ "delimiter" => ","
93
+ }
94
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", dialect )
95
+
96
+ The options should be a Hash that conforms to the [CSV Dialect](http://dataprotocols.org/csv-dialect/) JSON structure.
97
+
98
+ While these options configure the parser to correctly process the file, the validator will still raise errors or warnings for CSV
99
+ structure that it considers to be invalid, e.g. a missing header or different delimiters.
100
+
101
+ Note that the parser will also check for a `header` parameter on the `Content-Type` header returned when fetching a remote CSV file. As
102
+ specified in [RFC 4180](http://www.ietf.org/rfc/rfc4180.txt) the values for this can be `present` and `absent`, e.g:
103
+
104
+ Content-Type: text/csv; header=present
105
+
106
+ ## Error Reporting
107
+
108
+ The validator provides feedback on a validation result using instances of `Csvlint::ErrorMessage`. Errors are divided into errors, warnings and information
109
+ messages. A validation attempt is successful if there are no errors.
110
+
111
+ Messages provide context including:
112
+
113
+ * `category` has a symbol that indicates the category or error/warning: `:structure` (well-formedness issues), `:schema` (schema validation), `:context` (publishing metadata, e.g. content type)
114
+ * `type` has a symbol that indicates the type of error or warning being reported
115
+ * `row` holds the line number of the problem
116
+ * `column` holds the column number of the issue
117
+ * `content` holds the contents of the row that generated the error or warning
118
+
119
+ ## Errors
120
+
121
+ The following types of error can be reported:
122
+
123
+ * `:wrong_content_type` -- content type is not `text/csv`
124
+ * `:ragged_rows` -- row has a different number of columns (than the first row in the file)
125
+ * `:blank_rows` -- completely empty row, e.g. blank line or a line where all column values are empty
126
+ * `:invalid_encoding` -- encoding error when parsing row, e.g. because of invalid characters
127
+ * `:not_found` -- HTTP 404 error when retrieving the data
128
+ * `:stray_quote` -- missing or stray quote
129
+ * `:unclosed_quote` -- unclosed quoted field
130
+ * `:whitespace` -- a quoted column has leading or trailing whitespace
131
+ * `:line_breaks` -- line breaks were inconsistent or incorrectly specified
132
+
133
+ ## Warnings
134
+
135
+ The following types of warning can be reported:
136
+
137
+ * `:no_encoding` -- the `Content-Type` header returned in the HTTP request does not have a `charset` parameter
138
+ * `:encoding` -- the character set is not UTF-8
139
+ * `:no_content_type` -- file is being served without a `Content-Type` header
140
+ * `:excel` -- no `Content-Type` header and the file extension is `.xls`
141
+ * `:check_options` -- CSV file appears to contain only a single column
142
+ * `:inconsistent_values` -- inconsistent values in the same column. Reported if <90% of values seem to have same data type (either numeric or alphanumeric including punctuation)
143
+ * `:empty_column_name` -- a column in the CSV header has an empty name
144
+ * `:duplicate_column_name` -- a column in the CSV header has a duplicate name
145
+ * `:title_row` -- if there appears to be a title field in the first row of the CSV
146
+
147
+ ## Information Messages
148
+
149
+ There are also information messages available:
150
+
151
+ * `:nonrfc_line_breaks` -- uses non-CRLF line breaks, so doesn't conform to RFC4180.
152
+ * `:assumed_header` -- the validator has assumed that a header is present
153
+
154
+ ## Schema Validation
155
+
156
+ The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently
157
+ follows JSON Table Schema with some extensions and rudinmentary [CSV on the Web Metadata](http://www.w3.org/TR/tabular-metadata/).
158
+
159
+ An example JSON Table Schema schema file is:
160
+
161
+ {
162
+ "fields": [
163
+ {
164
+ "name": "id",
165
+ "constraints": {
166
+ "required": true,
167
+ "type": "http://www.w3.org/TR/xmlschema-2/#integer"
168
+ }
169
+ },
170
+ {
171
+ "name": "price",
172
+ "constraints": {
173
+ "required": true,
174
+ "minLength": 1
175
+ }
176
+ },
177
+ {
178
+ "name": "postcode",
179
+ "constraints": {
180
+ "required": true,
181
+ "pattern": "[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}"
182
+ }
183
+ }
184
+ ]
185
+ }
186
+
187
+ An equivalent CSV on the Web Metadata file is:
188
+
189
+ {
190
+ "@context": "http://www.w3.org/ns/csvw",
191
+ "url": "http://example.com/example1.csv",
192
+ "tableSchema": {
193
+ "columns": [
194
+ {
195
+ "name": "id",
196
+ "required": true,
197
+ "datatype": { "base": "integer" }
198
+ },
199
+ {
200
+ "name": "price",
201
+ "required": true,
202
+ "datatype": { "base": "string", "minLength": 1 }
203
+ },
204
+ {
205
+ "name": "postcode",
206
+ "required": true
207
+ }
208
+ ]
209
+ }
210
+ }
211
+
212
+ Parsing and validating with a schema (of either kind):
213
+
214
+ schema = Csvlint::Schema.load_from_json(uri)
215
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, schema )
216
+
217
+ ### CSV on the Web Validation Support
218
+
219
+ This gem passes all the validation tests in the [official CSV on the Web test suite](http://w3c.github.io/csvw/tests/) (though there might still be errors or parts of the [CSV on the Web standard](http://www.w3.org/TR/tabular-metadata/) that aren't tested by that test suite).
220
+
221
+ ### JSON Table Schema Support
222
+
223
+ Supported constraints:
224
+
225
+ * `required` -- there must be a value for this field in every row
226
+ * `unique` -- the values in every row should be unique
227
+ * `minLength` -- minimum number of characters in the value
228
+ * `maxLength` -- maximum number of characters in the value
229
+ * `pattern` -- values must match the provided regular expression
230
+ * `type` -- specifies an XML Schema data type. Values of the column must be a valid value for that type
231
+ * `minimum` -- specify a minimum range for values, the value will be parsed as specified by `type`
232
+ * `maximum` -- specify a maximum range for values, the value will be parsed as specified by `type`
233
+ * `datePattern` -- specify a `strftime` compatible date pattern to be used when parsing date values and min/max constraints
234
+
235
+ Supported data types (this is still a work in progress):
236
+
237
+ * String -- `http://www.w3.org/2001/XMLSchema#string` (effectively a no-op)
238
+ * Integer -- `http://www.w3.org/2001/XMLSchema#integer` or `http://www.w3.org/2001/XMLSchema#int`
239
+ * Float -- `http://www.w3.org/2001/XMLSchema#float`
240
+ * Double -- `http://www.w3.org/2001/XMLSchema#double`
241
+ * URI -- `http://www.w3.org/2001/XMLSchema#anyURI`
242
+ * Boolean -- `http://www.w3.org/2001/XMLSchema#boolean`
243
+ * Non Positive Integer -- `http://www.w3.org/2001/XMLSchema#nonPositiveInteger`
244
+ * Positive Integer -- `http://www.w3.org/2001/XMLSchema#positiveInteger`
245
+ * Non Negative Integer -- `http://www.w3.org/2001/XMLSchema#nonNegativeInteger`
246
+ * Negative Integer -- `http://www.w3.org/2001/XMLSchema#negativeInteger`
247
+ * Date -- `http://www.w3.org/2001/XMLSchema#date`
248
+ * Date Time -- `http://www.w3.org/2001/XMLSchema#dateTime`
249
+ * Year -- `http://www.w3.org/2001/XMLSchema#gYear`
250
+ * Year Month -- `http://www.w3.org/2001/XMLSchema#gYearMonth`
251
+ * Time -- `http://www.w3.org/2001/XMLSchema#time`
252
+
253
+ Use of an unknown data type will result in the column failing to validate.
254
+
255
+ Schema validation provides some additional types of error and warning messages:
256
+
257
+ * `:missing_value` (error) -- a column marked as `required` in the schema has no value
258
+ * `:min_length` (error) -- a column with a `minLength` constraint has a value that is too short
259
+ * `:max_length` (error) -- a column with a `maxLength` constraint has a value that is too long
260
+ * `:pattern` (error) -- a column with a `pattern` constraint has a value that doesn't match the regular expression
261
+ * `:malformed_header` (warning) -- the header in the CSV doesn't match the schema
262
+ * `:missing_column` (warning) -- a row in the CSV file has a missing column, that is specified in the schema. This is a warning only, as it may be legitimate
263
+ * `:extra_column` (warning) -- a row in the CSV file has extra column.
264
+ * `:unique` (error) -- a column with a `unique` constraint contains non-unique values
265
+ * `:below_minimum` (error) -- a column with a `minimum` constraint contains a value that is below the minimum
266
+ * `:above_maximum` (error) -- a column with a `maximum` constraint contains a value that is above the maximum
267
+
268
+ ## Other validation options
269
+
270
+ You can also provide an optional options hash as the fourth argument to Validator#new. Supported options are:
271
+
272
+ * :limit_lines -- only check this number of lines of the CSV file. Good for a quick check on huge files.
273
+
274
+ ```
275
+ options = {
276
+ limit_lines: 100
277
+ }
278
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, options )
279
+ ```
280
+
281
+ * :lambda -- Pass a block of code to be called when each line is validated, this will give you access to the `Validator` object. For example, this will return the current line number for every line validated:
282
+
283
+ ```
284
+ options = {
285
+ lambda: ->(validator) { puts validator.current_line }
286
+ }
287
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, options )
288
+ => 1
289
+ 2
290
+ 3
291
+ 4
292
+ .....
293
+ ```
294
+
295
+ ## Contributing
296
+
297
+ 1. Fork it
298
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
299
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
300
+ 4. Push to the branch (`git push origin my-new-feature`)
301
+ 5. Create new Pull Request
302
+
303
+ ### Testing
304
+
305
+ The codebase includes both rspec and cucumber tests, which can be run together using:
306
+
307
+ $ rake
308
+
309
+ or separately:
310
+
311
+ $ rake spec
312
+ $ rake features
313
+
314
+ When the cucumber tests are first run, a script will create tests based on the latest version of the [CSV on the Web test suite](http://w3c.github.io/csvw/tests/), including creating a local cache of the test files. This requires an internet connection and some patience. Following that download, the tests will run locally; there's also a batch script:
315
+
316
+ $ bin/run-csvw-tests
317
+
318
+ which will run the tests from the command line.
319
+
320
+ If you need to refresh the CSV on the Web tests:
321
+
322
+ $ rm bin/run-csvw-tests
323
+ $ rm features/csvw_validation_tests.feature
324
+ $ rm -r features/fixtures/csvw
325
+
326
+ and then run the cucumber tests again or:
327
+
328
+ $ ruby features/support/load_tests.rb
data/Rakefile ADDED
@@ -0,0 +1,17 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ $:.unshift File.join( File.dirname(__FILE__), "lib")
4
+
5
+ require 'rubygems'
6
+ require 'cucumber'
7
+ require 'cucumber/rake/task'
8
+ require 'coveralls/rake/task'
9
+ require 'rspec/core/rake_task'
10
+
11
+ RSpec::Core::RakeTask.new(:spec)
12
+ Coveralls::RakeTask.new
13
+ Cucumber::Rake::Task.new(:features) do |t|
14
+ t.cucumber_opts = "features --format pretty"
15
+ end
16
+
17
+ task :default => [:spec, :features, 'coveralls:push']
data/bin/create_schema ADDED
@@ -0,0 +1,32 @@
1
+ #!/usr/bin/env ruby
2
+ $:.unshift File.join( File.dirname(__FILE__), "..", "lib")
3
+
4
+ require 'csvlint'
5
+
6
+ begin
7
+ puts ARGV[0]
8
+ csv = CSV.new( open(ARGV[0]) )
9
+ headers = csv.shift
10
+
11
+ name = File.basename( ARGV[0] )
12
+ schema = {
13
+ "title" => name,
14
+ "description" => "Auto generated schema for #{name}",
15
+ "fields" => []
16
+ }
17
+
18
+ headers.each do |name|
19
+ schema["fields"] << {
20
+ "name" => name,
21
+ "title" => "",
22
+ "description" => "",
23
+ "constraints" => {}
24
+ }
25
+ end
26
+
27
+ $stdout.puts JSON.pretty_generate(schema)
28
+ rescue => e
29
+ puts e
30
+ puts e.backtrace
31
+ puts "Unable to parse CSV file"
32
+ end
data/bin/csvlint ADDED
@@ -0,0 +1,10 @@
1
+ #!/usr/bin/env ruby
2
+ $:.unshift File.join( File.dirname(__FILE__), "..", "lib")
3
+
4
+ require 'csvlint/cli'
5
+
6
+ if ARGV == ["help"]
7
+ Csvlint::Cli.start(["help"])
8
+ else
9
+ Csvlint::Cli.start(ARGV.unshift("validate"))
10
+ end
@@ -0,0 +1,46 @@
1
+ Feature: Check inconsistent formatting
2
+
3
+ Scenario: Inconsistent formatting for integers
4
+ Given I have a CSV with the following content:
5
+ """
6
+ "1","2","3"
7
+ "Foo","5","6"
8
+ "3","2","1"
9
+ "3","2","1"
10
+ """
11
+ And it is stored at the url "http://example.com/example1.csv"
12
+ And I ask if there are warnings
13
+ Then there should be 1 warnings
14
+ And that warning should have the type "inconsistent_values"
15
+ And that warning should have the column "1"
16
+
17
+ Scenario: Inconsistent formatting for alpha fields
18
+ Given I have a CSV with the following content:
19
+ """
20
+ "Foo","Bar","Baz"
21
+ "Biz","1","Baff"
22
+ "Boff","Giff","Goff"
23
+ "Boff","Giff","Goff"
24
+ """
25
+ And it is stored at the url "http://example.com/example1.csv"
26
+ And I ask if there are warnings
27
+ Then there should be 1 warnings
28
+ And that warning should have the type "inconsistent_values"
29
+ And that warning should have the column "2"
30
+
31
+ Scenario: Inconsistent formatting for alphanumeric fields
32
+ Given I have a CSV with the following content:
33
+ """
34
+ "Foo 123","Bar","Baz"
35
+ "1","Bar","Baff"
36
+ "Boff 432423","Giff","Goff"
37
+ "Boff444","Giff","Goff"
38
+ """
39
+ And it is stored at the url "http://example.com/example1.csv"
40
+ And I ask if there are warnings
41
+ Then there should be 1 warnings
42
+ And that warning should have the type "inconsistent_values"
43
+ And that warning should have the column "1"
44
+
45
+
46
+