wjordan213-csvlint 0.2.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (77) hide show
  1. checksums.yaml +7 -0
  2. data/.coveralls.yml +1 -0
  3. data/.gitattributes +2 -0
  4. data/.gitignore +28 -0
  5. data/.ruby-version +1 -0
  6. data/.travis.yml +32 -0
  7. data/CHANGELOG.md +361 -0
  8. data/Gemfile +7 -0
  9. data/LICENSE.md +22 -0
  10. data/README.md +328 -0
  11. data/Rakefile +17 -0
  12. data/bin/create_schema +32 -0
  13. data/bin/csvlint +10 -0
  14. data/features/check_format.feature +46 -0
  15. data/features/cli.feature +210 -0
  16. data/features/csv_options.feature +35 -0
  17. data/features/csvupload.feature +145 -0
  18. data/features/csvw_schema_validation.feature +127 -0
  19. data/features/fixtures/cr-line-endings.csv +0 -0
  20. data/features/fixtures/crlf-line-endings.csv +0 -0
  21. data/features/fixtures/inconsistent-line-endings-unquoted.csv +0 -0
  22. data/features/fixtures/inconsistent-line-endings.csv +0 -0
  23. data/features/fixtures/invalid-byte-sequence.csv +0 -0
  24. data/features/fixtures/invalid_many_rows.csv +0 -0
  25. data/features/fixtures/lf-line-endings.csv +0 -0
  26. data/features/fixtures/spreadsheet.xls +0 -0
  27. data/features/fixtures/spreadsheet.xlsx +0 -0
  28. data/features/fixtures/title-row.csv +0 -0
  29. data/features/fixtures/valid.csv +0 -0
  30. data/features/fixtures/valid_many_rows.csv +0 -0
  31. data/features/fixtures/windows-line-endings.csv +0 -0
  32. data/features/information.feature +22 -0
  33. data/features/parse_csv.feature +90 -0
  34. data/features/schema_validation.feature +105 -0
  35. data/features/sources.feature +17 -0
  36. data/features/step_definitions/cli_steps.rb +11 -0
  37. data/features/step_definitions/csv_options_steps.rb +24 -0
  38. data/features/step_definitions/information_steps.rb +13 -0
  39. data/features/step_definitions/parse_csv_steps.rb +42 -0
  40. data/features/step_definitions/schema_validation_steps.rb +33 -0
  41. data/features/step_definitions/sources_steps.rb +7 -0
  42. data/features/step_definitions/validation_errors_steps.rb +90 -0
  43. data/features/step_definitions/validation_info_steps.rb +22 -0
  44. data/features/step_definitions/validation_warnings_steps.rb +60 -0
  45. data/features/support/aruba.rb +56 -0
  46. data/features/support/env.rb +26 -0
  47. data/features/support/load_tests.rb +114 -0
  48. data/features/support/webmock.rb +1 -0
  49. data/features/validation_errors.feature +147 -0
  50. data/features/validation_info.feature +16 -0
  51. data/features/validation_warnings.feature +86 -0
  52. data/lib/csvlint.rb +27 -0
  53. data/lib/csvlint/cli.rb +165 -0
  54. data/lib/csvlint/csvw/column.rb +359 -0
  55. data/lib/csvlint/csvw/date_format.rb +182 -0
  56. data/lib/csvlint/csvw/metadata_error.rb +13 -0
  57. data/lib/csvlint/csvw/number_format.rb +211 -0
  58. data/lib/csvlint/csvw/property_checker.rb +761 -0
  59. data/lib/csvlint/csvw/table.rb +204 -0
  60. data/lib/csvlint/csvw/table_group.rb +165 -0
  61. data/lib/csvlint/error_collector.rb +27 -0
  62. data/lib/csvlint/error_message.rb +15 -0
  63. data/lib/csvlint/field.rb +196 -0
  64. data/lib/csvlint/schema.rb +92 -0
  65. data/lib/csvlint/validate.rb +599 -0
  66. data/lib/csvlint/version.rb +3 -0
  67. data/spec/csvw/column_spec.rb +112 -0
  68. data/spec/csvw/date_format_spec.rb +49 -0
  69. data/spec/csvw/number_format_spec.rb +417 -0
  70. data/spec/csvw/table_group_spec.rb +143 -0
  71. data/spec/csvw/table_spec.rb +90 -0
  72. data/spec/field_spec.rb +252 -0
  73. data/spec/schema_spec.rb +211 -0
  74. data/spec/spec_helper.rb +17 -0
  75. data/spec/validator_spec.rb +619 -0
  76. data/wjordan213_csvlint.gemspec +46 -0
  77. metadata +490 -0
data/LICENSE.md ADDED
@@ -0,0 +1,22 @@
1
+ ##Copyright (c) 2014 The Open Data Institute
2
+
3
+ #MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,328 @@
1
+ [![Build Status](http://img.shields.io/travis/theodi/csvlint.rb.svg)](https://travis-ci.org/theodi/csvlint.rb)
2
+ [![Dependency Status](http://img.shields.io/gemnasium/theodi/csvlint.rb.svg)](https://gemnasium.com/theodi/csvlint.rb)
3
+ [![Coverage Status](http://img.shields.io/coveralls/theodi/csvlint.rb.svg)](https://coveralls.io/r/theodi/csvlint.rb)
4
+ [![License](http://img.shields.io/:license-mit-blue.svg)](http://theodi.mit-license.org)
5
+ [![Badges](http://img.shields.io/:badges-5/5-ff6799.svg)](https://github.com/pikesley/badger)
6
+
7
+ # CSV Lint
8
+
9
+ A ruby gem to support validating CSV files to check their syntax and contents.
10
+
11
+ ## Installation
12
+
13
+ Add this line to your application's Gemfile:
14
+
15
+ gem 'csvlint'
16
+
17
+ And then execute:
18
+
19
+ $ bundle
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install csvlint
24
+
25
+ ## Usage
26
+
27
+ You can either use this gem within your own Ruby code, or as a standolone command line application
28
+
29
+ ## On the command line
30
+
31
+ After installing the gem, you can validate a CSV on the command line like so:
32
+
33
+ csvlint myfile.csv
34
+
35
+ You will then see the validation result, together with any warnings or errors e.g.
36
+
37
+ ```
38
+ myfile.csv is INVALID
39
+ 1. blank_rows. Row: 3
40
+ 1. title_row.
41
+ 2. inconsistent_values. Column: 14
42
+ ```
43
+
44
+ You can also optionally pass a schema file like so:
45
+
46
+ csvlint myfile.csv --schema=schema.json
47
+
48
+ ## In your own Ruby code
49
+
50
+ Currently the gem supports retrieving a CSV accessible from a URL, File, or an IO-style object (e.g. StringIO)
51
+
52
+ require 'csvlint'
53
+
54
+ validator = Csvlint::Validator.new( "http://example.org/data.csv" )
55
+ validator = Csvlint::Validator.new( File.new("/path/to/my/data.csv" ))
56
+ validator = Csvlint::Validator.new( StringIO.new( my_data_in_a_string ) )
57
+
58
+ When validating from a URL the range of errors and warnings is wider as the library will also check HTTP headers for
59
+ best practices
60
+
61
+ #invoke the validation
62
+ validator.validate
63
+
64
+ #check validation status
65
+ validator.valid?
66
+
67
+ #access array of errors, each is an Csvlint::ErrorMessage object
68
+ validator.errors
69
+
70
+ #access array of warnings
71
+ validator.warnings
72
+
73
+ #access array of information messages
74
+ validator.info_messages
75
+
76
+ #get some information about the CSV file that was validated
77
+ validator.encoding
78
+ validator.content_type
79
+ validator.extension
80
+ validator.row_count
81
+
82
+ #retrieve HTTP headers from request
83
+ validator.headers
84
+
85
+ ## Controlling CSV Parsing
86
+
87
+ The validator supports configuration of the [CSV Dialect](http://dataprotocols.org/csv-dialect/) used in a data file. This is specified by
88
+ passing a dialect hash to the constructor:
89
+
90
+ dialect = {
91
+ "header" => true,
92
+ "delimiter" => ","
93
+ }
94
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", dialect )
95
+
96
+ The options should be a Hash that conforms to the [CSV Dialect](http://dataprotocols.org/csv-dialect/) JSON structure.
97
+
98
+ While these options configure the parser to correctly process the file, the validator will still raise errors or warnings for CSV
99
+ structure that it considers to be invalid, e.g. a missing header or different delimiters.
100
+
101
+ Note that the parser will also check for a `header` parameter on the `Content-Type` header returned when fetching a remote CSV file. As
102
+ specified in [RFC 4180](http://www.ietf.org/rfc/rfc4180.txt) the values for this can be `present` and `absent`, e.g:
103
+
104
+ Content-Type: text/csv; header=present
105
+
106
+ ## Error Reporting
107
+
108
+ The validator provides feedback on a validation result using instances of `Csvlint::ErrorMessage`. Errors are divided into errors, warnings and information
109
+ messages. A validation attempt is successful if there are no errors.
110
+
111
+ Messages provide context including:
112
+
113
+ * `category` has a symbol that indicates the category or error/warning: `:structure` (well-formedness issues), `:schema` (schema validation), `:context` (publishing metadata, e.g. content type)
114
+ * `type` has a symbol that indicates the type of error or warning being reported
115
+ * `row` holds the line number of the problem
116
+ * `column` holds the column number of the issue
117
+ * `content` holds the contents of the row that generated the error or warning
118
+
119
+ ## Errors
120
+
121
+ The following types of error can be reported:
122
+
123
+ * `:wrong_content_type` -- content type is not `text/csv`
124
+ * `:ragged_rows` -- row has a different number of columns (than the first row in the file)
125
+ * `:blank_rows` -- completely empty row, e.g. blank line or a line where all column values are empty
126
+ * `:invalid_encoding` -- encoding error when parsing row, e.g. because of invalid characters
127
+ * `:not_found` -- HTTP 404 error when retrieving the data
128
+ * `:stray_quote` -- missing or stray quote
129
+ * `:unclosed_quote` -- unclosed quoted field
130
+ * `:whitespace` -- a quoted column has leading or trailing whitespace
131
+ * `:line_breaks` -- line breaks were inconsistent or incorrectly specified
132
+
133
+ ## Warnings
134
+
135
+ The following types of warning can be reported:
136
+
137
+ * `:no_encoding` -- the `Content-Type` header returned in the HTTP request does not have a `charset` parameter
138
+ * `:encoding` -- the character set is not UTF-8
139
+ * `:no_content_type` -- file is being served without a `Content-Type` header
140
+ * `:excel` -- no `Content-Type` header and the file extension is `.xls`
141
+ * `:check_options` -- CSV file appears to contain only a single column
142
+ * `:inconsistent_values` -- inconsistent values in the same column. Reported if <90% of values seem to have same data type (either numeric or alphanumeric including punctuation)
143
+ * `:empty_column_name` -- a column in the CSV header has an empty name
144
+ * `:duplicate_column_name` -- a column in the CSV header has a duplicate name
145
+ * `:title_row` -- if there appears to be a title field in the first row of the CSV
146
+
147
+ ## Information Messages
148
+
149
+ There are also information messages available:
150
+
151
+ * `:nonrfc_line_breaks` -- uses non-CRLF line breaks, so doesn't conform to RFC4180.
152
+ * `:assumed_header` -- the validator has assumed that a header is present
153
+
154
+ ## Schema Validation
155
+
156
+ The library supports validating data against a schema. A schema configuration can be provided as a Hash or parsed from JSON. The structure currently
157
+ follows JSON Table Schema with some extensions and rudinmentary [CSV on the Web Metadata](http://www.w3.org/TR/tabular-metadata/).
158
+
159
+ An example JSON Table Schema schema file is:
160
+
161
+ {
162
+ "fields": [
163
+ {
164
+ "name": "id",
165
+ "constraints": {
166
+ "required": true,
167
+ "type": "http://www.w3.org/TR/xmlschema-2/#integer"
168
+ }
169
+ },
170
+ {
171
+ "name": "price",
172
+ "constraints": {
173
+ "required": true,
174
+ "minLength": 1
175
+ }
176
+ },
177
+ {
178
+ "name": "postcode",
179
+ "constraints": {
180
+ "required": true,
181
+ "pattern": "[A-Z]{1,2}[0-9][0-9A-Z]? ?[0-9][A-Z]{2}"
182
+ }
183
+ }
184
+ ]
185
+ }
186
+
187
+ An equivalent CSV on the Web Metadata file is:
188
+
189
+ {
190
+ "@context": "http://www.w3.org/ns/csvw",
191
+ "url": "http://example.com/example1.csv",
192
+ "tableSchema": {
193
+ "columns": [
194
+ {
195
+ "name": "id",
196
+ "required": true,
197
+ "datatype": { "base": "integer" }
198
+ },
199
+ {
200
+ "name": "price",
201
+ "required": true,
202
+ "datatype": { "base": "string", "minLength": 1 }
203
+ },
204
+ {
205
+ "name": "postcode",
206
+ "required": true
207
+ }
208
+ ]
209
+ }
210
+ }
211
+
212
+ Parsing and validating with a schema (of either kind):
213
+
214
+ schema = Csvlint::Schema.load_from_json(uri)
215
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, schema )
216
+
217
+ ### CSV on the Web Validation Support
218
+
219
+ This gem passes all the validation tests in the [official CSV on the Web test suite](http://w3c.github.io/csvw/tests/) (though there might still be errors or parts of the [CSV on the Web standard](http://www.w3.org/TR/tabular-metadata/) that aren't tested by that test suite).
220
+
221
+ ### JSON Table Schema Support
222
+
223
+ Supported constraints:
224
+
225
+ * `required` -- there must be a value for this field in every row
226
+ * `unique` -- the values in every row should be unique
227
+ * `minLength` -- minimum number of characters in the value
228
+ * `maxLength` -- maximum number of characters in the value
229
+ * `pattern` -- values must match the provided regular expression
230
+ * `type` -- specifies an XML Schema data type. Values of the column must be a valid value for that type
231
+ * `minimum` -- specify a minimum range for values, the value will be parsed as specified by `type`
232
+ * `maximum` -- specify a maximum range for values, the value will be parsed as specified by `type`
233
+ * `datePattern` -- specify a `strftime` compatible date pattern to be used when parsing date values and min/max constraints
234
+
235
+ Supported data types (this is still a work in progress):
236
+
237
+ * String -- `http://www.w3.org/2001/XMLSchema#string` (effectively a no-op)
238
+ * Integer -- `http://www.w3.org/2001/XMLSchema#integer` or `http://www.w3.org/2001/XMLSchema#int`
239
+ * Float -- `http://www.w3.org/2001/XMLSchema#float`
240
+ * Double -- `http://www.w3.org/2001/XMLSchema#double`
241
+ * URI -- `http://www.w3.org/2001/XMLSchema#anyURI`
242
+ * Boolean -- `http://www.w3.org/2001/XMLSchema#boolean`
243
+ * Non Positive Integer -- `http://www.w3.org/2001/XMLSchema#nonPositiveInteger`
244
+ * Positive Integer -- `http://www.w3.org/2001/XMLSchema#positiveInteger`
245
+ * Non Negative Integer -- `http://www.w3.org/2001/XMLSchema#nonNegativeInteger`
246
+ * Negative Integer -- `http://www.w3.org/2001/XMLSchema#negativeInteger`
247
+ * Date -- `http://www.w3.org/2001/XMLSchema#date`
248
+ * Date Time -- `http://www.w3.org/2001/XMLSchema#dateTime`
249
+ * Year -- `http://www.w3.org/2001/XMLSchema#gYear`
250
+ * Year Month -- `http://www.w3.org/2001/XMLSchema#gYearMonth`
251
+ * Time -- `http://www.w3.org/2001/XMLSchema#time`
252
+
253
+ Use of an unknown data type will result in the column failing to validate.
254
+
255
+ Schema validation provides some additional types of error and warning messages:
256
+
257
+ * `:missing_value` (error) -- a column marked as `required` in the schema has no value
258
+ * `:min_length` (error) -- a column with a `minLength` constraint has a value that is too short
259
+ * `:max_length` (error) -- a column with a `maxLength` constraint has a value that is too long
260
+ * `:pattern` (error) -- a column with a `pattern` constraint has a value that doesn't match the regular expression
261
+ * `:malformed_header` (warning) -- the header in the CSV doesn't match the schema
262
+ * `:missing_column` (warning) -- a row in the CSV file has a missing column, that is specified in the schema. This is a warning only, as it may be legitimate
263
+ * `:extra_column` (warning) -- a row in the CSV file has extra column.
264
+ * `:unique` (error) -- a column with a `unique` constraint contains non-unique values
265
+ * `:below_minimum` (error) -- a column with a `minimum` constraint contains a value that is below the minimum
266
+ * `:above_maximum` (error) -- a column with a `maximum` constraint contains a value that is above the maximum
267
+
268
+ ## Other validation options
269
+
270
+ You can also provide an optional options hash as the fourth argument to Validator#new. Supported options are:
271
+
272
+ * :limit_lines -- only check this number of lines of the CSV file. Good for a quick check on huge files.
273
+
274
+ ```
275
+ options = {
276
+ limit_lines: 100
277
+ }
278
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, options )
279
+ ```
280
+
281
+ * :lambda -- Pass a block of code to be called when each line is validated, this will give you access to the `Validator` object. For example, this will return the current line number for every line validated:
282
+
283
+ ```
284
+ options = {
285
+ lambda: ->(validator) { puts validator.current_line }
286
+ }
287
+ validator = Csvlint::Validator.new( "http://example.org/data.csv", nil, nil, options )
288
+ => 1
289
+ 2
290
+ 3
291
+ 4
292
+ .....
293
+ ```
294
+
295
+ ## Contributing
296
+
297
+ 1. Fork it
298
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
299
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
300
+ 4. Push to the branch (`git push origin my-new-feature`)
301
+ 5. Create new Pull Request
302
+
303
+ ### Testing
304
+
305
+ The codebase includes both rspec and cucumber tests, which can be run together using:
306
+
307
+ $ rake
308
+
309
+ or separately:
310
+
311
+ $ rake spec
312
+ $ rake features
313
+
314
+ When the cucumber tests are first run, a script will create tests based on the latest version of the [CSV on the Web test suite](http://w3c.github.io/csvw/tests/), including creating a local cache of the test files. This requires an internet connection and some patience. Following that download, the tests will run locally; there's also a batch script:
315
+
316
+ $ bin/run-csvw-tests
317
+
318
+ which will run the tests from the command line.
319
+
320
+ If you need to refresh the CSV on the Web tests:
321
+
322
+ $ rm bin/run-csvw-tests
323
+ $ rm features/csvw_validation_tests.feature
324
+ $ rm -r features/fixtures/csvw
325
+
326
+ and then run the cucumber tests again or:
327
+
328
+ $ ruby features/support/load_tests.rb
data/Rakefile ADDED
@@ -0,0 +1,17 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ $:.unshift File.join( File.dirname(__FILE__), "lib")
4
+
5
+ require 'rubygems'
6
+ require 'cucumber'
7
+ require 'cucumber/rake/task'
8
+ require 'coveralls/rake/task'
9
+ require 'rspec/core/rake_task'
10
+
11
+ RSpec::Core::RakeTask.new(:spec)
12
+ Coveralls::RakeTask.new
13
+ Cucumber::Rake::Task.new(:features) do |t|
14
+ t.cucumber_opts = "features --format pretty"
15
+ end
16
+
17
+ task :default => [:spec, :features, 'coveralls:push']
data/bin/create_schema ADDED
@@ -0,0 +1,32 @@
1
+ #!/usr/bin/env ruby
2
+ $:.unshift File.join( File.dirname(__FILE__), "..", "lib")
3
+
4
+ require 'csvlint'
5
+
6
+ begin
7
+ puts ARGV[0]
8
+ csv = CSV.new( open(ARGV[0]) )
9
+ headers = csv.shift
10
+
11
+ name = File.basename( ARGV[0] )
12
+ schema = {
13
+ "title" => name,
14
+ "description" => "Auto generated schema for #{name}",
15
+ "fields" => []
16
+ }
17
+
18
+ headers.each do |name|
19
+ schema["fields"] << {
20
+ "name" => name,
21
+ "title" => "",
22
+ "description" => "",
23
+ "constraints" => {}
24
+ }
25
+ end
26
+
27
+ $stdout.puts JSON.pretty_generate(schema)
28
+ rescue => e
29
+ puts e
30
+ puts e.backtrace
31
+ puts "Unable to parse CSV file"
32
+ end
data/bin/csvlint ADDED
@@ -0,0 +1,10 @@
1
+ #!/usr/bin/env ruby
2
+ $:.unshift File.join( File.dirname(__FILE__), "..", "lib")
3
+
4
+ require 'csvlint/cli'
5
+
6
+ if ARGV == ["help"]
7
+ Csvlint::Cli.start(["help"])
8
+ else
9
+ Csvlint::Cli.start(ARGV.unshift("validate"))
10
+ end
@@ -0,0 +1,46 @@
1
+ Feature: Check inconsistent formatting
2
+
3
+ Scenario: Inconsistent formatting for integers
4
+ Given I have a CSV with the following content:
5
+ """
6
+ "1","2","3"
7
+ "Foo","5","6"
8
+ "3","2","1"
9
+ "3","2","1"
10
+ """
11
+ And it is stored at the url "http://example.com/example1.csv"
12
+ And I ask if there are warnings
13
+ Then there should be 1 warnings
14
+ And that warning should have the type "inconsistent_values"
15
+ And that warning should have the column "1"
16
+
17
+ Scenario: Inconsistent formatting for alpha fields
18
+ Given I have a CSV with the following content:
19
+ """
20
+ "Foo","Bar","Baz"
21
+ "Biz","1","Baff"
22
+ "Boff","Giff","Goff"
23
+ "Boff","Giff","Goff"
24
+ """
25
+ And it is stored at the url "http://example.com/example1.csv"
26
+ And I ask if there are warnings
27
+ Then there should be 1 warnings
28
+ And that warning should have the type "inconsistent_values"
29
+ And that warning should have the column "2"
30
+
31
+ Scenario: Inconsistent formatting for alphanumeric fields
32
+ Given I have a CSV with the following content:
33
+ """
34
+ "Foo 123","Bar","Baz"
35
+ "1","Bar","Baff"
36
+ "Boff 432423","Giff","Goff"
37
+ "Boff444","Giff","Goff"
38
+ """
39
+ And it is stored at the url "http://example.com/example1.csv"
40
+ And I ask if there are warnings
41
+ Then there should be 1 warnings
42
+ And that warning should have the type "inconsistent_values"
43
+ And that warning should have the column "1"
44
+
45
+
46
+