csvreader 1.2.4 → 1.2.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/{HISTORY.md → CHANGELOG.md} +3 -3
- data/Manifest.txt +1 -2
- data/README.md +682 -682
- data/Rakefile +33 -32
- data/datasets/cars11.csv +10 -10
- data/datasets/cities11.csv +12 -12
- data/datasets/customers11.csv +13 -13
- data/datasets/iris.attrib.csv +25 -25
- data/datasets/iris11.csv +163 -163
- data/datasets/lcc.attrib.csv +14 -14
- data/datasets/shakespeare.csv +9 -9
- data/lib/csvreader/base.rb +6 -2
- data/lib/csvreader/buffer.rb +0 -1
- data/lib/csvreader/builder.rb +0 -1
- data/lib/csvreader/converter.rb +0 -1
- data/lib/csvreader/parser.rb +32 -33
- data/lib/csvreader/parser_fixed.rb +105 -106
- data/lib/csvreader/parser_json.rb +23 -24
- data/lib/csvreader/parser_std.rb +582 -583
- data/lib/csvreader/parser_strict.rb +290 -291
- data/lib/csvreader/parser_tab.rb +22 -23
- data/lib/csvreader/parser_table.rb +122 -123
- data/lib/csvreader/parser_yaml.rb +23 -24
- data/lib/csvreader/reader.rb +2 -3
- data/lib/csvreader/reader_hash.rb +1 -2
- data/lib/csvreader/version.rb +30 -32
- data/lib/csvreader.rb +0 -1
- data/test/test_parser_formats.rb +66 -66
- data/test/test_parser_java.rb +208 -208
- metadata +18 -15
- data/LICENSE.md +0 -116
data/README.md
CHANGED
@@ -1,682 +1,682 @@
|
|
1
|
-
# csvreader - read tabular data in the comma-separated values (csv) format the right way (uses best practices out-of-the-box with zero-configuration)
|
2
|
-
|
3
|
-
|
4
|
-
* home :: [github.com/csvreader/csvreader](https://github.com/csvreader/csvreader)
|
5
|
-
* bugs :: [github.com/csvreader/csvreader/issues](https://github.com/csvreader/csvreader/issues)
|
6
|
-
* gem :: [rubygems.org/gems/csvreader](https://rubygems.org/gems/csvreader)
|
7
|
-
* rdoc :: [rubydoc.info/gems/csvreader](http://rubydoc.info/gems/csvreader)
|
8
|
-
* forum :: [wwwmake](http://groups.google.com/group/wwwmake)
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
## What's News?
|
14
|
-
|
15
|
-
**v1.2.2** Added auto-fix/correction/recovery
|
16
|
-
for double quoted value with extra trailing value
|
17
|
-
to the default parser (`ParserStd`) e.g. `"Freddy" Mercury`
|
18
|
-
will get read "as is" and turned
|
19
|
-
into an "unquoted" value with "literal" quotes e.g. `"Freddy" Mercury`.
|
20
|
-
|
21
|
-
|
22
|
-
**v1.2.1** Added support for (optional) hashtag to the
|
23
|
-
to the default parser (`ParserStd`) for
|
24
|
-
supporting the [Humanitarian eXchange Language (HXL)](https://github.com/csvspecs/csv-hxl).
|
25
|
-
Default is turned off (`false`). Use `Csv.human`
|
26
|
-
or `Csv.hum` or `Csv.hxl` for pre-defined with hashtag turned on.
|
27
|
-
|
28
|
-
|
29
|
-
**v1.2** Added support for alternative (non-space) separators (e.g. `;|^:`)
|
30
|
-
to the default parser (`ParserStd`).
|
31
|
-
|
32
|
-
|
33
|
-
**v1.1.5** Added built-in support for (optional) alternative space
|
34
|
-
character
|
35
|
-
(e.g. `_-+•`)
|
36
|
-
to the default parser (`ParserStd`) and the table parser (`ParserTable`).
|
37
|
-
Turns `Man_Utd` into `Man Utd`, for example. Default is turned off (`nil`).
|
38
|
-
|
39
|
-
|
40
|
-
**v1.1.4** Added new "classic" table parser (see `ParserTable`) for supporting fields separated by (one or more) spaces
|
41
|
-
e.g. `Csv.table.parse( txt )`.
|
42
|
-
|
43
|
-
|
44
|
-
**v1.1.3**: Added built-in support for french single and double quotes / guillemets (`‹› «»`) to default parser ("The Right Way").
|
45
|
-
Now you can use both, that is, single (`‹...›'` or `›...‹'`)
|
46
|
-
or double (`«...»` or `»...«`).
|
47
|
-
Note: A quote only "kicks-in" if it's the first (non-whitespace)
|
48
|
-
character of the value (otherwise it's just a "vanilla" literal character).
|
49
|
-
|
50
|
-
|
51
|
-
**v1.1.2**: Added built-in support for single quotes (`'`) to default parser ("The Right Way").
|
52
|
-
Now you can use both, that is, single (`'...'`) or double quotes (`"..."`)
|
53
|
-
like in ruby (or javascript or html or ...) :-).
|
54
|
-
Note: A quote only "kicks-in" if it's the first (non-whitespace)
|
55
|
-
character of the value (otherwise it's just a "vanilla" literal character)
|
56
|
-
e.g. `48°51'24"N` needs no quote :-).
|
57
|
-
With the "strict" parser you will get a firework of "stray" quote errors / exceptions.
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
**v1.1.1**: Added built-in support for (optional) alternative comments (`%`) - used by
|
62
|
-
[ARFF (attribute-relation file format)](https://github.com/csvspecs/csv-meta#attribute-relation-classic) -
|
63
|
-
and support for (optional) directives (`@`) in header (that is, before any records)
|
64
|
-
to default parser ("The Right Way").
|
65
|
-
Now you can use either `#` or `%` for comments, the first one "wins" - you CANNOT use both.
|
66
|
-
Now you can use either a front matter (`---`) block
|
67
|
-
or directives (e.g. `@attribute`, `@relation`, etc.)
|
68
|
-
for meta data, the first one "wins" - you CANNOT use both.
|
69
|
-
|
70
|
-
|
71
|
-
**v1.1.0**: Added new fixed width field (fwf) parser (see `ParserFixed`) for supporting fields with fixed width (and no separator)
|
72
|
-
e.g. `Csv.fixed.parse( txt, width: [8,-2,8,-3,32,-2,14] )`.
|
73
|
-
|
74
|
-
|
75
|
-
**v1.0.3**: Added built-in support for an (optional) front matter (`---`) meta data block
|
76
|
-
in header (that is, before any records)
|
77
|
-
to default parser ("The Right Way") - used by [CSVY (yaml front matter for csv file format)](https://github.com/csvspecs/csv-meta#front-matter-in-yaml).
|
78
|
-
Use `Csv.parser.meta` to get the parsed meta data block hash (or `nil`) if none.
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
## Usage
|
84
|
-
|
85
|
-
|
86
|
-
``` ruby
|
87
|
-
txt = <<TXT
|
88
|
-
1,2,3
|
89
|
-
4,5,6
|
90
|
-
TXT
|
91
|
-
|
92
|
-
records = Csv.parse( txt ) ## or CsvReader.parse
|
93
|
-
pp records
|
94
|
-
# => [["1","2","3"],
|
95
|
-
# ["4","5","6"]]
|
96
|
-
|
97
|
-
# -or-
|
98
|
-
|
99
|
-
records = Csv.read( "values.csv" ) ## or CsvReader.read
|
100
|
-
pp records
|
101
|
-
# => [["1","2","3"],
|
102
|
-
# ["4","5","6"]]
|
103
|
-
|
104
|
-
# -or-
|
105
|
-
|
106
|
-
Csv.foreach( "values.csv" ) do |rec| ## or CsvReader.foreach
|
107
|
-
pp rec
|
108
|
-
end
|
109
|
-
# => ["1","2","3"]
|
110
|
-
# => ["4","5","6"]
|
111
|
-
```
|
112
|
-
|
113
|
-
|
114
|
-
### What about type inference and data converters?
|
115
|
-
|
116
|
-
Use the converters keyword option to (auto-)convert strings to nulls, booleans, integers, floats, dates, etc.
|
117
|
-
Example:
|
118
|
-
|
119
|
-
``` ruby
|
120
|
-
txt = <<TXT
|
121
|
-
1,2,3
|
122
|
-
true,false,null
|
123
|
-
TXT
|
124
|
-
|
125
|
-
records = Csv.parse( txt, :converters => :all ) ## or CsvReader.parse
|
126
|
-
pp records
|
127
|
-
# => [[1,2,3],
|
128
|
-
# [true,false,nil]]
|
129
|
-
```
|
130
|
-
|
131
|
-
|
132
|
-
Built-in converters include:
|
133
|
-
|
134
|
-
| Converter | Comments |
|
135
|
-
|--------------|-------------------|
|
136
|
-
| `:integer` | convert matching strings to integer |
|
137
|
-
| `:float` | convert matching strings to float |
|
138
|
-
| `:numeric` | shortcut for `[:integer, :float]` |
|
139
|
-
| `:date` | convert matching strings to `Date` (year/month/day) |
|
140
|
-
| `:date_time` | convert matching strings to `DateTime` |
|
141
|
-
| `:null` | convert matching strings to null (`nil`) |
|
142
|
-
| `:boolean` | convert matching strings to boolean (`true` or `false`) |
|
143
|
-
| `:all` | shortcut for `[:null, :boolean, :date_time, :numeric]` |
|
144
|
-
|
145
|
-
|
146
|
-
Or add your own converters. Example:
|
147
|
-
|
148
|
-
``` ruby
|
149
|
-
Csv.parse( 'Ruby, 2020-03-01, 100', converters: [->(v) { Time.parse(v) rescue v }] )
|
150
|
-
#=> [["Ruby", 2020-03-01 00:00:00 +0200, "100"]]
|
151
|
-
```
|
152
|
-
|
153
|
-
A custom converter is a method that gets the value passed in
|
154
|
-
and if successful returns a non-string type (e.g. integer, float, date, etc.)
|
155
|
-
or a string (for further processing with all other converters in the "pipeline" configuration).
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
### What about Enumerable?
|
160
|
-
|
161
|
-
Yes, every reader includes `Enumerable` and runs on `each`.
|
162
|
-
Use `new` or `open` without a block
|
163
|
-
to get the enumerator (iterator).
|
164
|
-
Example:
|
165
|
-
|
166
|
-
|
167
|
-
``` ruby
|
168
|
-
csv = Csv.new( "a,b,c" )
|
169
|
-
it = csv.to_enum
|
170
|
-
pp it.next
|
171
|
-
# => ["a","b","c"]
|
172
|
-
|
173
|
-
# -or-
|
174
|
-
|
175
|
-
csv = Csv.open( "values.csv" )
|
176
|
-
it = csv.to_enum
|
177
|
-
pp it.next
|
178
|
-
# => ["1","2","3"]
|
179
|
-
pp it.next
|
180
|
-
# => ["4","5","6"]
|
181
|
-
```
|
182
|
-
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
### What about headers?
|
188
|
-
|
189
|
-
Use the `CsvHash`
|
190
|
-
if the first line is a header (or if missing pass in the headers
|
191
|
-
as an array) and you want your records as hashes instead of arrays of strings.
|
192
|
-
Example:
|
193
|
-
|
194
|
-
``` ruby
|
195
|
-
txt = <<TXT
|
196
|
-
A,B,C
|
197
|
-
1,2,3
|
198
|
-
4,5,6
|
199
|
-
TXT
|
200
|
-
|
201
|
-
records = CsvHash.parse( txt ) ## or CsvHashReader.parse
|
202
|
-
pp records
|
203
|
-
|
204
|
-
# -or-
|
205
|
-
|
206
|
-
txt2 = <<TXT
|
207
|
-
1,2,3
|
208
|
-
4,5,6
|
209
|
-
TXT
|
210
|
-
|
211
|
-
records = CsvHash.parse( txt2, headers: ["A","B","C"] ) ## or CsvHashReader.parse
|
212
|
-
pp records
|
213
|
-
|
214
|
-
# => [{"A": "1", "B": "2", "C": "3"},
|
215
|
-
# {"A": "4", "B": "5", "C": "6"}]
|
216
|
-
|
217
|
-
# -or-
|
218
|
-
|
219
|
-
records = CsvHash.read( "hash.csv" ) ## or CsvHashReader.read
|
220
|
-
pp records
|
221
|
-
# => [{"A": "1", "B": "2", "C": "3"},
|
222
|
-
# {"A": "4", "B": "5", "C": "6"}]
|
223
|
-
|
224
|
-
# -or-
|
225
|
-
|
226
|
-
CsvHash.foreach( "hash.csv" ) do |rec| ## or CsvHashReader.foreach
|
227
|
-
pp rec
|
228
|
-
end
|
229
|
-
# => {"A": "1", "B": "2", "C": "3"}
|
230
|
-
# => {"A": "4", "B": "5", "C": "6"}
|
231
|
-
```
|
232
|
-
|
233
|
-
|
234
|
-
### What about symbol keys for hashes?
|
235
|
-
|
236
|
-
Yes, you can use the header_converters keyword option.
|
237
|
-
Use `:symbol` for (auto-)converting header (strings) to symbols.
|
238
|
-
Note: the symbol converter will also downcase all letters and
|
239
|
-
remove all non-alphanumeric (e.g. `!?$%`) chars
|
240
|
-
and replace spaces with underscores.
|
241
|
-
|
242
|
-
Example:
|
243
|
-
|
244
|
-
``` ruby
|
245
|
-
txt = <<TXT
|
246
|
-
a,b,c
|
247
|
-
1,2,3
|
248
|
-
true,false,null
|
249
|
-
TXT
|
250
|
-
|
251
|
-
records = CsvHash.parse( txt, :converters => :all, :header_converters => :symbol )
|
252
|
-
pp records
|
253
|
-
# => [{a: 1, b: 2, c: 3},
|
254
|
-
# {a: true, b: false, c: nil}]
|
255
|
-
|
256
|
-
# -or-
|
257
|
-
options = { :converters => :all,
|
258
|
-
:header_converters => :symbol }
|
259
|
-
|
260
|
-
records = CsvHash.parse( txt, options )
|
261
|
-
pp records
|
262
|
-
# => [{a: 1, b: 2, c: 3},
|
263
|
-
# {a: true, b: false, c: nil}]
|
264
|
-
```
|
265
|
-
|
266
|
-
Built-in header converters include:
|
267
|
-
|
268
|
-
| Converter | Comments |
|
269
|
-
|--------------|---------------------|
|
270
|
-
| `:downcase` | downcase strings |
|
271
|
-
| `:symbol` | convert strings to symbols (and downcase and remove non-alphanumerics) |
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
### What about (typed) structs?
|
276
|
-
|
277
|
-
See the [csvrecord library »](https://github.com/csvreader/csvrecord)
|
278
|
-
|
279
|
-
Example from the csvrecord docu:
|
280
|
-
|
281
|
-
Step 1: Define a (typed) struct for the comma-separated values (csv) records. Example:
|
282
|
-
|
283
|
-
```ruby
|
284
|
-
require 'csvrecord'
|
285
|
-
|
286
|
-
Beer = CsvRecord.define do
|
287
|
-
field :brewery ## note: default type is :string
|
288
|
-
field :city
|
289
|
-
field :name
|
290
|
-
field :abv, Float ## allows type specified as class (or use :float)
|
291
|
-
end
|
292
|
-
```
|
293
|
-
|
294
|
-
or in "classic" style:
|
295
|
-
|
296
|
-
```ruby
|
297
|
-
class Beer < CsvRecord::Base
|
298
|
-
field :brewery
|
299
|
-
field :city
|
300
|
-
field :name
|
301
|
-
field :abv, Float
|
302
|
-
end
|
303
|
-
```
|
304
|
-
|
305
|
-
|
306
|
-
Step 2: Read in the comma-separated values (csv) datafile. Example:
|
307
|
-
|
308
|
-
```ruby
|
309
|
-
beers = Beer.read( 'beer.csv' )
|
310
|
-
|
311
|
-
puts "#{beers.size} beers:"
|
312
|
-
pp beers
|
313
|
-
```
|
314
|
-
|
315
|
-
pretty prints (pp):
|
316
|
-
|
317
|
-
```
|
318
|
-
6 beers:
|
319
|
-
[#<Beer:0x302c760 @values=
|
320
|
-
["Andechser Klosterbrauerei", "Andechs", "Doppelbock Dunkel", 7.0]>,
|
321
|
-
#<Beer:0x3026fe8 @values=
|
322
|
-
["Augustiner Br\u00E4u M\u00FCnchen", "M\u00FCnchen", "Edelstoff", 5.6]>,
|
323
|
-
#<Beer:0x30257a0 @values=
|
324
|
-
["Bayerische Staatsbrauerei Weihenstephan", "Freising", "Hefe Weissbier", 5.4]>,
|
325
|
-
...
|
326
|
-
]
|
327
|
-
```
|
328
|
-
|
329
|
-
Or loop over the records. Example:
|
330
|
-
|
331
|
-
``` ruby
|
332
|
-
Beer.read( 'beer.csv' ).each do |rec|
|
333
|
-
puts "#{rec.name} (#{rec.abv}%) by #{rec.brewery}, #{rec.city}"
|
334
|
-
end
|
335
|
-
|
336
|
-
# -or-
|
337
|
-
|
338
|
-
Beer.foreach( 'beer.csv' ) do |rec|
|
339
|
-
puts "#{rec.name} (#{rec.abv}%) by #{rec.brewery}, #{rec.city}"
|
340
|
-
end
|
341
|
-
```
|
342
|
-
|
343
|
-
|
344
|
-
printing:
|
345
|
-
|
346
|
-
```
|
347
|
-
Doppelbock Dunkel (7.0%) by Andechser Klosterbrauerei, Andechs
|
348
|
-
Edelstoff (5.6%) by Augustiner Bräu München, München
|
349
|
-
Hefe Weissbier (5.4%) by Bayerische Staatsbrauerei Weihenstephan, Freising
|
350
|
-
Rauchbier Märzen (5.1%) by Brauerei Spezial, Bamberg
|
351
|
-
Münchner Dunkel (5.0%) by Hacker-Pschorr Bräu, München
|
352
|
-
Hofbräu Oktoberfestbier (6.3%) by Staatliches Hofbräuhaus München, München
|
353
|
-
```
|
354
|
-
|
355
|
-
|
356
|
-
### What about tabular data packages with pre-defined types / schemas?
|
357
|
-
|
358
|
-
See the [csvpack library »](https://github.com/csvreader/csvpack)
|
359
|
-
|
360
|
-
|
361
|
-
|
362
|
-
|
363
|
-
|
364
|
-
## Frequently Asked Questions (FAQ) and Answers
|
365
|
-
|
366
|
-
### Q: What's CSV the right way? What best practices can I use?
|
367
|
-
|
368
|
-
Use best practices out-of-the-box with zero-configuration.
|
369
|
-
Do you know how to skip blank lines or how to add `#` single-line comments?
|
370
|
-
Or how to trim leading and trailing spaces? No worries. It's turned on by default.
|
371
|
-
|
372
|
-
Yes, you can. Use
|
373
|
-
|
374
|
-
```
|
375
|
-
#######
|
376
|
-
# try with some comments
|
377
|
-
# and blank lines even before header (first row)
|
378
|
-
|
379
|
-
Brewery,City,Name,Abv
|
380
|
-
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
381
|
-
Augustiner Bräu München,München,Edelstoff,5.6%
|
382
|
-
|
383
|
-
Bayerische Staatsbrauerei Weihenstephan, Freising, Hefe Weissbier, 5.4%
|
384
|
-
Brauerei Spezial, Bamberg, Rauchbier Märzen, 5.1%
|
385
|
-
Hacker-Pschorr Bräu, München, Münchner Dunkel, 5.0%
|
386
|
-
Staatliches Hofbräuhaus München, München, Hofbräu Oktoberfestbier, 6.3%
|
387
|
-
```
|
388
|
-
|
389
|
-
instead of strict "classic"
|
390
|
-
(no blank lines, no comments, no leading and trailing spaces, etc.):
|
391
|
-
|
392
|
-
```
|
393
|
-
Brewery,City,Name,Abv
|
394
|
-
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
395
|
-
Augustiner Bräu München,München,Edelstoff,5.6%
|
396
|
-
Bayerische Staatsbrauerei Weihenstephan,Freising,Hefe Weissbier,5.4%
|
397
|
-
Brauerei Spezial,Bamberg,Rauchbier Märzen,5.1%
|
398
|
-
Hacker-Pschorr Bräu,München,Münchner Dunkel,5.0%
|
399
|
-
Staatliches Hofbräuhaus München,München,Hofbräu Oktoberfestbier,6.3%
|
400
|
-
```
|
401
|
-
|
402
|
-
|
403
|
-
Or use the ARFF (attribute-relation file format)-like alternative style
|
404
|
-
with `%` for comments and `@`-directives
|
405
|
-
for "meta data" in the header (before any records):
|
406
|
-
|
407
|
-
```
|
408
|
-
%%%%%%%%%%%%%%%%%%
|
409
|
-
% try with some comments
|
410
|
-
% and blank lines even before @-directives in header
|
411
|
-
|
412
|
-
@RELATION Beer
|
413
|
-
|
414
|
-
@ATTRIBUTE Brewery
|
415
|
-
@ATTRIBUTE City
|
416
|
-
@ATTRIBUTE Name
|
417
|
-
@ATTRIBUTE Abv
|
418
|
-
|
419
|
-
@DATA
|
420
|
-
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
421
|
-
Augustiner Bräu München,München,Edelstoff,5.6%
|
422
|
-
|
423
|
-
Bayerische Staatsbrauerei Weihenstephan, Freising, Hefe Weissbier, 5.4%
|
424
|
-
Brauerei Spezial, Bamberg, Rauchbier Märzen, 5.1%
|
425
|
-
Hacker-Pschorr Bräu, München, Münchner Dunkel, 5.0%
|
426
|
-
Staatliches Hofbräuhaus München, München, Hofbräu Oktoberfestbier, 6.3%
|
427
|
-
```
|
428
|
-
|
429
|
-
Or use the ARFF (attribute-relation file format)-like alternative style with `@`-directives
|
430
|
-
inside comments (for easier backwards compatibility with old readers)
|
431
|
-
for "meta data" in the header (before any records):
|
432
|
-
|
433
|
-
```
|
434
|
-
##########################
|
435
|
-
# try with some comments
|
436
|
-
# and blank lines even before @-directives in header
|
437
|
-
#
|
438
|
-
# @RELATION Beer
|
439
|
-
#
|
440
|
-
# @ATTRIBUTE Brewery
|
441
|
-
# @ATTRIBUTE City
|
442
|
-
# @ATTRIBUTE Name
|
443
|
-
# @ATTRIBUTE Abv
|
444
|
-
|
445
|
-
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
446
|
-
Augustiner Bräu München,München,Edelstoff,5.6%
|
447
|
-
|
448
|
-
Bayerische Staatsbrauerei Weihenstephan, Freising, Hefe Weissbier, 5.4%
|
449
|
-
Brauerei Spezial, Bamberg, Rauchbier Märzen, 5.1%
|
450
|
-
Hacker-Pschorr Bräu, München, Münchner Dunkel, 5.0%
|
451
|
-
Staatliches Hofbräuhaus München, München, Hofbräu Oktoberfestbier, 6.3%
|
452
|
-
```
|
453
|
-
|
454
|
-
|
455
|
-
|
456
|
-
### Q: How can I change the default format / dialect?
|
457
|
-
|
458
|
-
The reader includes more than half a dozen pre-configured formats,
|
459
|
-
dialects.
|
460
|
-
|
461
|
-
Use strict if you do NOT want to trim leading and trailing spaces
|
462
|
-
and if you do NOT want to skip blank lines. Example:
|
463
|
-
|
464
|
-
``` ruby
|
465
|
-
txt = <<TXT
|
466
|
-
1, 2,3
|
467
|
-
4,5 ,6
|
468
|
-
|
469
|
-
TXT
|
470
|
-
|
471
|
-
records = Csv.strict.parse( txt )
|
472
|
-
pp records
|
473
|
-
# => [["1","•2","3"],
|
474
|
-
# ["4","5•","6"],
|
475
|
-
# [""]]
|
476
|
-
```
|
477
|
-
|
478
|
-
More strict pre-configured variants include:
|
479
|
-
|
480
|
-
`Csv.mysql` uses:
|
481
|
-
|
482
|
-
``` ruby
|
483
|
-
ParserStrict.new( sep: "\t",
|
484
|
-
quote: false,
|
485
|
-
escape: true,
|
486
|
-
null: "\\N" )
|
487
|
-
```
|
488
|
-
|
489
|
-
`Csv.postgres` or `Csv.postgresql` uses:
|
490
|
-
|
491
|
-
``` ruby
|
492
|
-
ParserStrict.new( doublequote: false,
|
493
|
-
escape: true,
|
494
|
-
null: "" )
|
495
|
-
```
|
496
|
-
|
497
|
-
`Csv.postgres_text` or `Csv.postgresql_text` uses:
|
498
|
-
|
499
|
-
``` ruby
|
500
|
-
ParserStrict.new( sep: "\t",
|
501
|
-
quote: false,
|
502
|
-
escape: true,
|
503
|
-
null: "\\N" )
|
504
|
-
```
|
505
|
-
|
506
|
-
and so on.
|
507
|
-
|
508
|
-
|
509
|
-
### Q: How can I change the separator to semicolon (`;`) or pipe (`|`) or tab (`\t`)?
|
510
|
-
|
511
|
-
Pass in the `sep` keyword option
|
512
|
-
to the parser. Example:
|
513
|
-
|
514
|
-
``` ruby
|
515
|
-
Csv.parse( ..., sep: ';' )
|
516
|
-
Csv.read( ..., sep: ';' )
|
517
|
-
# ...
|
518
|
-
Csv.parse( ..., sep: '|' )
|
519
|
-
Csv.read( ..., sep: '|' )
|
520
|
-
# and so on
|
521
|
-
```
|
522
|
-
|
523
|
-
Note: If you use tab (`\t`) use the `TabReader`
|
524
|
-
(or for your convenience the built-in `Csv.tab` alias)!
|
525
|
-
If you use the "classic" one or more space or tab (`/[ \t]+/`) regex
|
526
|
-
use the `TableReader`
|
527
|
-
(or for your convenience the built-in `Csv.table` alias)!
|
528
|
-
|
529
|
-
|
530
|
-
Note: The default ("The Right Way") parser does NOT allow space or tab
|
531
|
-
as separator (because leading and trailing space always gets trimmed
|
532
|
-
unless inside quotes, etc.). Use the `strict` parser if you want
|
533
|
-
to make up your own format with space or tab as a separator
|
534
|
-
or if you want that every space or tab counts (is significant).
|
535
|
-
|
536
|
-
|
537
|
-
|
538
|
-
Aside: Why? Tab =! CSV. Yes, tab is
|
539
|
-
its own (even) simpler format
|
540
|
-
(e.g. no escape rules, no newlines in values, etc.),
|
541
|
-
see [`TabReader` »](https://github.com/csvreader/tabreader).
|
542
|
-
|
543
|
-
``` ruby
|
544
|
-
Csv.tab.parse( ... ) # note: "classic" strict tab format
|
545
|
-
Csv.tab.read( ... )
|
546
|
-
# ...
|
547
|
-
|
548
|
-
Csv.table.parse( ... ) # note: "classic" one or more space (or tab) table format
|
549
|
-
Csv.table.read( ... )
|
550
|
-
# ...
|
551
|
-
```
|
552
|
-
|
553
|
-
If you want double quote escape rules, newlines in quotes values, etc. use
|
554
|
-
the "strict" parser with the separator (`sep`) changed to tab (`\t`).
|
555
|
-
|
556
|
-
``` ruby
|
557
|
-
Csv.strict.parse( ..., sep: "\t" ) # note: csv-like tab format with quotes
|
558
|
-
Csv.strict.read( ..., sep: "\t" )
|
559
|
-
# ...
|
560
|
-
```
|
561
|
-
|
562
|
-
|
563
|
-
|
564
|
-
|
565
|
-
### Q: How can I read records with fixed width fields (and no separator)?
|
566
|
-
|
567
|
-
Pass in the `width` keyword option with the field widths / lengths
|
568
|
-
to the "fixed" parser. Example:
|
569
|
-
|
570
|
-
``` ruby
|
571
|
-
txt = <<TXT
|
572
|
-
12345678123456781234567890123456789012345678901212345678901234
|
573
|
-
TXT
|
574
|
-
|
575
|
-
Csv.fixed.parse( txt, width: [8,8,32,14] ) # or Csv.fix or Csv.f
|
576
|
-
# => [["12345678","12345678", "12345678901234567890123456789012", "12345678901234"]]
|
577
|
-
|
578
|
-
|
579
|
-
txt = <<TXT
|
580
|
-
John Smith john@example.com 1-888-555-6666
|
581
|
-
Michele O'Reileymichele@example.com 1-333-321-8765
|
582
|
-
TXT
|
583
|
-
|
584
|
-
Csv.fixed.parse( txt, width: [8,8,32,14] ) # or Csv.fix or Csv.f
|
585
|
-
# => [["John", "Smith", "john@example.com", "1-888-555-6666"],
|
586
|
-
# ["Michele", "O'Reiley", "michele@example.com", "1-333-321-8765"]]
|
587
|
-
|
588
|
-
# and so on
|
589
|
-
```
|
590
|
-
|
591
|
-
<!--
|
592
|
-
Note: You can use for your convenience the built-in
|
593
|
-
`Csv.fix` or `Csv.f` aliases / shortcuts.
|
594
|
-
-->
|
595
|
-
|
596
|
-
|
597
|
-
Note: You can use negative widths (e.g. `-2`, `-3`, and so on)
|
598
|
-
to "skip" filler fields (e.g. `--`, `---`, and so on).
|
599
|
-
Example:
|
600
|
-
|
601
|
-
``` ruby
|
602
|
-
txt = <<TXT
|
603
|
-
12345678--12345678---12345678901234567890123456789012--12345678901234XXX
|
604
|
-
TXT
|
605
|
-
|
606
|
-
Csv.fixed.parse( txt, width: [8,-2,8,-3,32,-2,14] ) # or Csv.fix or Csv.f
|
607
|
-
# => [["12345678","12345678", "12345678901234567890123456789012", "12345678901234"]]
|
608
|
-
```
|
609
|
-
|
610
|
-
|
611
|
-
|
612
|
-
|
613
|
-
|
614
|
-
### Q: What's broken in the standard library CSV reader?
|
615
|
-
|
616
|
-
Two major design bugs and many many minor.
|
617
|
-
|
618
|
-
(1) The CSV class uses [`line.split(',')`](https://github.com/ruby/csv/blob/master/lib/csv.rb#L1255) with some kludges (†) with the claim it's faster.
|
619
|
-
What?! The right way: CSV needs its own purpose-built parser. There's no other
|
620
|
-
way you can handle all the (edge) cases with double quotes and escaped doubled up
|
621
|
-
double quotes. Period.
|
622
|
-
|
623
|
-
For example, the CSV class cannot handle leading or trailing spaces
|
624
|
-
for double quoted values `1,•"2","3"•`.
|
625
|
-
Or handling double quotes inside values and so on and on.
|
626
|
-
|
627
|
-
(2) The CSV class returns `nil` for `,,` but an empty string (`""`)
|
628
|
-
for `"","",""`. The right way: All values are always strings. Period.
|
629
|
-
|
630
|
-
If you want to use `nil` you MUST configure a string (or strings)
|
631
|
-
such as `NA`, `n/a`, `\N`, or similar that map to `nil`.
|
632
|
-
|
633
|
-
|
634
|
-
(†): kludge - a workaround or quick-and-dirty solution that is clumsy, inelegant, inefficient, difficult to extend and hard to maintain
|
635
|
-
|
636
|
-
Appendix: Simple examples the standard csv library cannot read:
|
637
|
-
|
638
|
-
Quoted values with leading or trailing spaces e.g.
|
639
|
-
|
640
|
-
```
|
641
|
-
1, "2","3" , "4" ,5
|
642
|
-
```
|
643
|
-
|
644
|
-
=>
|
645
|
-
|
646
|
-
``` ruby
|
647
|
-
["1", "2", "3", "4" ,"5"]
|
648
|
-
```
|
649
|
-
|
650
|
-
"Auto-fix" unambiguous quotes in "unquoted" values e.g.
|
651
|
-
|
652
|
-
```
|
653
|
-
value with "quotes", another value
|
654
|
-
```
|
655
|
-
|
656
|
-
=>
|
657
|
-
|
658
|
-
``` ruby
|
659
|
-
["value with \"quotes\"", "another value"]
|
660
|
-
```
|
661
|
-
|
662
|
-
and some more.
|
663
|
-
|
664
|
-
|
665
|
-
|
666
|
-
|
667
|
-
## Alternatives
|
668
|
-
|
669
|
-
See the Libraries & Tools section in the [Awesome CSV](https://github.com/csvspecs/awesome-csv#libraries--tools) page.
|
670
|
-
|
671
|
-
|
672
|
-
## License
|
673
|
-
|
674
|
-
![](https://publicdomainworks.github.io/buttons/zero88x31.png)
|
675
|
-
|
676
|
-
The `csvreader` scripts are dedicated to the public domain.
|
677
|
-
Use it as you please with no restrictions whatsoever.
|
678
|
-
|
679
|
-
## Questions? Comments?
|
680
|
-
|
681
|
-
Send them along to the [wwwmake forum](http://groups.google.com/group/wwwmake).
|
682
|
-
Thanks!
|
1
|
+
# csvreader - read tabular data in the comma-separated values (csv) format the right way (uses best practices out-of-the-box with zero-configuration)
|
2
|
+
|
3
|
+
|
4
|
+
* home :: [github.com/csvreader/csvreader](https://github.com/csvreader/csvreader)
|
5
|
+
* bugs :: [github.com/csvreader/csvreader/issues](https://github.com/csvreader/csvreader/issues)
|
6
|
+
* gem :: [rubygems.org/gems/csvreader](https://rubygems.org/gems/csvreader)
|
7
|
+
* rdoc :: [rubydoc.info/gems/csvreader](http://rubydoc.info/gems/csvreader)
|
8
|
+
* forum :: [wwwmake](http://groups.google.com/group/wwwmake)
|
9
|
+
|
10
|
+
|
11
|
+
|
12
|
+
|
13
|
+
## What's News?
|
14
|
+
|
15
|
+
**v1.2.2** Added auto-fix/correction/recovery
|
16
|
+
for double quoted value with extra trailing value
|
17
|
+
to the default parser (`ParserStd`) e.g. `"Freddy" Mercury`
|
18
|
+
will get read "as is" and turned
|
19
|
+
into an "unquoted" value with "literal" quotes e.g. `"Freddy" Mercury`.
|
20
|
+
|
21
|
+
|
22
|
+
**v1.2.1** Added support for (optional) hashtag to the
|
23
|
+
to the default parser (`ParserStd`) for
|
24
|
+
supporting the [Humanitarian eXchange Language (HXL)](https://github.com/csvspecs/csv-hxl).
|
25
|
+
Default is turned off (`false`). Use `Csv.human`
|
26
|
+
or `Csv.hum` or `Csv.hxl` for pre-defined with hashtag turned on.
|
27
|
+
|
28
|
+
|
29
|
+
**v1.2** Added support for alternative (non-space) separators (e.g. `;|^:`)
|
30
|
+
to the default parser (`ParserStd`).
|
31
|
+
|
32
|
+
|
33
|
+
**v1.1.5** Added built-in support for (optional) alternative space
|
34
|
+
character
|
35
|
+
(e.g. `_-+•`)
|
36
|
+
to the default parser (`ParserStd`) and the table parser (`ParserTable`).
|
37
|
+
Turns `Man_Utd` into `Man Utd`, for example. Default is turned off (`nil`).
|
38
|
+
|
39
|
+
|
40
|
+
**v1.1.4** Added new "classic" table parser (see `ParserTable`) for supporting fields separated by (one or more) spaces
|
41
|
+
e.g. `Csv.table.parse( txt )`.
|
42
|
+
|
43
|
+
|
44
|
+
**v1.1.3**: Added built-in support for french single and double quotes / guillemets (`‹› «»`) to default parser ("The Right Way").
|
45
|
+
Now you can use both, that is, single (`‹...›'` or `›...‹'`)
|
46
|
+
or double (`«...»` or `»...«`).
|
47
|
+
Note: A quote only "kicks-in" if it's the first (non-whitespace)
|
48
|
+
character of the value (otherwise it's just a "vanilla" literal character).
|
49
|
+
|
50
|
+
|
51
|
+
**v1.1.2**: Added built-in support for single quotes (`'`) to default parser ("The Right Way").
|
52
|
+
Now you can use both, that is, single (`'...'`) or double quotes (`"..."`)
|
53
|
+
like in ruby (or javascript or html or ...) :-).
|
54
|
+
Note: A quote only "kicks-in" if it's the first (non-whitespace)
|
55
|
+
character of the value (otherwise it's just a "vanilla" literal character)
|
56
|
+
e.g. `48°51'24"N` needs no quote :-).
|
57
|
+
With the "strict" parser you will get a firework of "stray" quote errors / exceptions.
|
58
|
+
|
59
|
+
|
60
|
+
|
61
|
+
**v1.1.1**: Added built-in support for (optional) alternative comments (`%`) - used by
|
62
|
+
[ARFF (attribute-relation file format)](https://github.com/csvspecs/csv-meta#attribute-relation-classic) -
|
63
|
+
and support for (optional) directives (`@`) in header (that is, before any records)
|
64
|
+
to default parser ("The Right Way").
|
65
|
+
Now you can use either `#` or `%` for comments, the first one "wins" - you CANNOT use both.
|
66
|
+
Now you can use either a front matter (`---`) block
|
67
|
+
or directives (e.g. `@attribute`, `@relation`, etc.)
|
68
|
+
for meta data, the first one "wins" - you CANNOT use both.
|
69
|
+
|
70
|
+
|
71
|
+
**v1.1.0**: Added new fixed width field (fwf) parser (see `ParserFixed`) for supporting fields with fixed width (and no separator)
|
72
|
+
e.g. `Csv.fixed.parse( txt, width: [8,-2,8,-3,32,-2,14] )`.
|
73
|
+
|
74
|
+
|
75
|
+
**v1.0.3**: Added built-in support for an (optional) front matter (`---`) meta data block
|
76
|
+
in header (that is, before any records)
|
77
|
+
to default parser ("The Right Way") - used by [CSVY (yaml front matter for csv file format)](https://github.com/csvspecs/csv-meta#front-matter-in-yaml).
|
78
|
+
Use `Csv.parser.meta` to get the parsed meta data block hash (or `nil`) if none.
|
79
|
+
|
80
|
+
|
81
|
+
|
82
|
+
|
83
|
+
## Usage
|
84
|
+
|
85
|
+
|
86
|
+
``` ruby
|
87
|
+
txt = <<TXT
|
88
|
+
1,2,3
|
89
|
+
4,5,6
|
90
|
+
TXT
|
91
|
+
|
92
|
+
records = Csv.parse( txt ) ## or CsvReader.parse
|
93
|
+
pp records
|
94
|
+
# => [["1","2","3"],
|
95
|
+
# ["4","5","6"]]
|
96
|
+
|
97
|
+
# -or-
|
98
|
+
|
99
|
+
records = Csv.read( "values.csv" ) ## or CsvReader.read
|
100
|
+
pp records
|
101
|
+
# => [["1","2","3"],
|
102
|
+
# ["4","5","6"]]
|
103
|
+
|
104
|
+
# -or-
|
105
|
+
|
106
|
+
Csv.foreach( "values.csv" ) do |rec| ## or CsvReader.foreach
|
107
|
+
pp rec
|
108
|
+
end
|
109
|
+
# => ["1","2","3"]
|
110
|
+
# => ["4","5","6"]
|
111
|
+
```
|
112
|
+
|
113
|
+
|
114
|
+
### What about type inference and data converters?
|
115
|
+
|
116
|
+
Use the converters keyword option to (auto-)convert strings to nulls, booleans, integers, floats, dates, etc.
|
117
|
+
Example:
|
118
|
+
|
119
|
+
``` ruby
|
120
|
+
txt = <<TXT
|
121
|
+
1,2,3
|
122
|
+
true,false,null
|
123
|
+
TXT
|
124
|
+
|
125
|
+
records = Csv.parse( txt, :converters => :all ) ## or CsvReader.parse
|
126
|
+
pp records
|
127
|
+
# => [[1,2,3],
|
128
|
+
# [true,false,nil]]
|
129
|
+
```
|
130
|
+
|
131
|
+
|
132
|
+
Built-in converters include:
|
133
|
+
|
134
|
+
| Converter | Comments |
|
135
|
+
|--------------|-------------------|
|
136
|
+
| `:integer` | convert matching strings to integer |
|
137
|
+
| `:float` | convert matching strings to float |
|
138
|
+
| `:numeric` | shortcut for `[:integer, :float]` |
|
139
|
+
| `:date` | convert matching strings to `Date` (year/month/day) |
|
140
|
+
| `:date_time` | convert matching strings to `DateTime` |
|
141
|
+
| `:null` | convert matching strings to null (`nil`) |
|
142
|
+
| `:boolean` | convert matching strings to boolean (`true` or `false`) |
|
143
|
+
| `:all` | shortcut for `[:null, :boolean, :date_time, :numeric]` |
|
144
|
+
|
145
|
+
|
146
|
+
Or add your own converters. Example:
|
147
|
+
|
148
|
+
``` ruby
|
149
|
+
Csv.parse( 'Ruby, 2020-03-01, 100', converters: [->(v) { Time.parse(v) rescue v }] )
|
150
|
+
#=> [["Ruby", 2020-03-01 00:00:00 +0200, "100"]]
|
151
|
+
```
|
152
|
+
|
153
|
+
A custom converter is a method that gets the value passed in
|
154
|
+
and if successful returns a non-string type (e.g. integer, float, date, etc.)
|
155
|
+
or a string (for further processing with all other converters in the "pipeline" configuration).
|
156
|
+
|
157
|
+
|
158
|
+
|
159
|
+
### What about Enumerable?
|
160
|
+
|
161
|
+
Yes, every reader includes `Enumerable` and runs on `each`.
|
162
|
+
Use `new` or `open` without a block
|
163
|
+
to get the enumerator (iterator).
|
164
|
+
Example:
|
165
|
+
|
166
|
+
|
167
|
+
``` ruby
|
168
|
+
csv = Csv.new( "a,b,c" )
|
169
|
+
it = csv.to_enum
|
170
|
+
pp it.next
|
171
|
+
# => ["a","b","c"]
|
172
|
+
|
173
|
+
# -or-
|
174
|
+
|
175
|
+
csv = Csv.open( "values.csv" )
|
176
|
+
it = csv.to_enum
|
177
|
+
pp it.next
|
178
|
+
# => ["1","2","3"]
|
179
|
+
pp it.next
|
180
|
+
# => ["4","5","6"]
|
181
|
+
```
|
182
|
+
|
183
|
+
|
184
|
+
|
185
|
+
|
186
|
+
|
187
|
+
### What about headers?
|
188
|
+
|
189
|
+
Use the `CsvHash`
|
190
|
+
if the first line is a header (or if missing pass in the headers
|
191
|
+
as an array) and you want your records as hashes instead of arrays of strings.
|
192
|
+
Example:
|
193
|
+
|
194
|
+
``` ruby
|
195
|
+
txt = <<TXT
|
196
|
+
A,B,C
|
197
|
+
1,2,3
|
198
|
+
4,5,6
|
199
|
+
TXT
|
200
|
+
|
201
|
+
records = CsvHash.parse( txt ) ## or CsvHashReader.parse
|
202
|
+
pp records
|
203
|
+
|
204
|
+
# -or-
|
205
|
+
|
206
|
+
txt2 = <<TXT
|
207
|
+
1,2,3
|
208
|
+
4,5,6
|
209
|
+
TXT
|
210
|
+
|
211
|
+
records = CsvHash.parse( txt2, headers: ["A","B","C"] ) ## or CsvHashReader.parse
|
212
|
+
pp records
|
213
|
+
|
214
|
+
# => [{"A": "1", "B": "2", "C": "3"},
|
215
|
+
# {"A": "4", "B": "5", "C": "6"}]
|
216
|
+
|
217
|
+
# -or-
|
218
|
+
|
219
|
+
records = CsvHash.read( "hash.csv" ) ## or CsvHashReader.read
|
220
|
+
pp records
|
221
|
+
# => [{"A": "1", "B": "2", "C": "3"},
|
222
|
+
# {"A": "4", "B": "5", "C": "6"}]
|
223
|
+
|
224
|
+
# -or-
|
225
|
+
|
226
|
+
CsvHash.foreach( "hash.csv" ) do |rec| ## or CsvHashReader.foreach
|
227
|
+
pp rec
|
228
|
+
end
|
229
|
+
# => {"A": "1", "B": "2", "C": "3"}
|
230
|
+
# => {"A": "4", "B": "5", "C": "6"}
|
231
|
+
```
|
232
|
+
|
233
|
+
|
234
|
+
### What about symbol keys for hashes?
|
235
|
+
|
236
|
+
Yes, you can use the header_converters keyword option.
|
237
|
+
Use `:symbol` for (auto-)converting header (strings) to symbols.
|
238
|
+
Note: the symbol converter will also downcase all letters and
|
239
|
+
remove all non-alphanumeric (e.g. `!?$%`) chars
|
240
|
+
and replace spaces with underscores.
|
241
|
+
|
242
|
+
Example:
|
243
|
+
|
244
|
+
``` ruby
|
245
|
+
txt = <<TXT
|
246
|
+
a,b,c
|
247
|
+
1,2,3
|
248
|
+
true,false,null
|
249
|
+
TXT
|
250
|
+
|
251
|
+
records = CsvHash.parse( txt, :converters => :all, :header_converters => :symbol )
|
252
|
+
pp records
|
253
|
+
# => [{a: 1, b: 2, c: 3},
|
254
|
+
# {a: true, b: false, c: nil}]
|
255
|
+
|
256
|
+
# -or-
|
257
|
+
options = { :converters => :all,
|
258
|
+
:header_converters => :symbol }
|
259
|
+
|
260
|
+
records = CsvHash.parse( txt, options )
|
261
|
+
pp records
|
262
|
+
# => [{a: 1, b: 2, c: 3},
|
263
|
+
# {a: true, b: false, c: nil}]
|
264
|
+
```
|
265
|
+
|
266
|
+
Built-in header converters include:
|
267
|
+
|
268
|
+
| Converter | Comments |
|
269
|
+
|--------------|---------------------|
|
270
|
+
| `:downcase` | downcase strings |
|
271
|
+
| `:symbol` | convert strings to symbols (and downcase and remove non-alphanumerics) |
|
272
|
+
|
273
|
+
|
274
|
+
|
275
|
+
### What about (typed) structs?
|
276
|
+
|
277
|
+
See the [csvrecord library »](https://github.com/csvreader/csvrecord)
|
278
|
+
|
279
|
+
Example from the csvrecord docu:
|
280
|
+
|
281
|
+
Step 1: Define a (typed) struct for the comma-separated values (csv) records. Example:
|
282
|
+
|
283
|
+
```ruby
|
284
|
+
require 'csvrecord'
|
285
|
+
|
286
|
+
Beer = CsvRecord.define do
|
287
|
+
field :brewery ## note: default type is :string
|
288
|
+
field :city
|
289
|
+
field :name
|
290
|
+
field :abv, Float ## allows type specified as class (or use :float)
|
291
|
+
end
|
292
|
+
```
|
293
|
+
|
294
|
+
or in "classic" style:
|
295
|
+
|
296
|
+
```ruby
|
297
|
+
class Beer < CsvRecord::Base
|
298
|
+
field :brewery
|
299
|
+
field :city
|
300
|
+
field :name
|
301
|
+
field :abv, Float
|
302
|
+
end
|
303
|
+
```
|
304
|
+
|
305
|
+
|
306
|
+
Step 2: Read in the comma-separated values (csv) datafile. Example:
|
307
|
+
|
308
|
+
```ruby
|
309
|
+
beers = Beer.read( 'beer.csv' )
|
310
|
+
|
311
|
+
puts "#{beers.size} beers:"
|
312
|
+
pp beers
|
313
|
+
```
|
314
|
+
|
315
|
+
pretty prints (pp):
|
316
|
+
|
317
|
+
```
|
318
|
+
6 beers:
|
319
|
+
[#<Beer:0x302c760 @values=
|
320
|
+
["Andechser Klosterbrauerei", "Andechs", "Doppelbock Dunkel", 7.0]>,
|
321
|
+
#<Beer:0x3026fe8 @values=
|
322
|
+
["Augustiner Br\u00E4u M\u00FCnchen", "M\u00FCnchen", "Edelstoff", 5.6]>,
|
323
|
+
#<Beer:0x30257a0 @values=
|
324
|
+
["Bayerische Staatsbrauerei Weihenstephan", "Freising", "Hefe Weissbier", 5.4]>,
|
325
|
+
...
|
326
|
+
]
|
327
|
+
```
|
328
|
+
|
329
|
+
Or loop over the records. Example:
|
330
|
+
|
331
|
+
``` ruby
|
332
|
+
Beer.read( 'beer.csv' ).each do |rec|
|
333
|
+
puts "#{rec.name} (#{rec.abv}%) by #{rec.brewery}, #{rec.city}"
|
334
|
+
end
|
335
|
+
|
336
|
+
# -or-
|
337
|
+
|
338
|
+
Beer.foreach( 'beer.csv' ) do |rec|
|
339
|
+
puts "#{rec.name} (#{rec.abv}%) by #{rec.brewery}, #{rec.city}"
|
340
|
+
end
|
341
|
+
```
|
342
|
+
|
343
|
+
|
344
|
+
printing:
|
345
|
+
|
346
|
+
```
|
347
|
+
Doppelbock Dunkel (7.0%) by Andechser Klosterbrauerei, Andechs
|
348
|
+
Edelstoff (5.6%) by Augustiner Bräu München, München
|
349
|
+
Hefe Weissbier (5.4%) by Bayerische Staatsbrauerei Weihenstephan, Freising
|
350
|
+
Rauchbier Märzen (5.1%) by Brauerei Spezial, Bamberg
|
351
|
+
Münchner Dunkel (5.0%) by Hacker-Pschorr Bräu, München
|
352
|
+
Hofbräu Oktoberfestbier (6.3%) by Staatliches Hofbräuhaus München, München
|
353
|
+
```
|
354
|
+
|
355
|
+
|
356
|
+
### What about tabular data packages with pre-defined types / schemas?
|
357
|
+
|
358
|
+
See the [csvpack library »](https://github.com/csvreader/csvpack)
|
359
|
+
|
360
|
+
|
361
|
+
|
362
|
+
|
363
|
+
|
364
|
+
## Frequently Asked Questions (FAQ) and Answers
|
365
|
+
|
366
|
+
### Q: What's CSV the right way? What best practices can I use?
|
367
|
+
|
368
|
+
Use best practices out-of-the-box with zero-configuration.
|
369
|
+
Do you know how to skip blank lines or how to add `#` single-line comments?
|
370
|
+
Or how to trim leading and trailing spaces? No worries. It's turned on by default.
|
371
|
+
|
372
|
+
Yes, you can. Use
|
373
|
+
|
374
|
+
```
|
375
|
+
#######
|
376
|
+
# try with some comments
|
377
|
+
# and blank lines even before header (first row)
|
378
|
+
|
379
|
+
Brewery,City,Name,Abv
|
380
|
+
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
381
|
+
Augustiner Bräu München,München,Edelstoff,5.6%
|
382
|
+
|
383
|
+
Bayerische Staatsbrauerei Weihenstephan, Freising, Hefe Weissbier, 5.4%
|
384
|
+
Brauerei Spezial, Bamberg, Rauchbier Märzen, 5.1%
|
385
|
+
Hacker-Pschorr Bräu, München, Münchner Dunkel, 5.0%
|
386
|
+
Staatliches Hofbräuhaus München, München, Hofbräu Oktoberfestbier, 6.3%
|
387
|
+
```
|
388
|
+
|
389
|
+
instead of strict "classic"
|
390
|
+
(no blank lines, no comments, no leading and trailing spaces, etc.):
|
391
|
+
|
392
|
+
```
|
393
|
+
Brewery,City,Name,Abv
|
394
|
+
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
395
|
+
Augustiner Bräu München,München,Edelstoff,5.6%
|
396
|
+
Bayerische Staatsbrauerei Weihenstephan,Freising,Hefe Weissbier,5.4%
|
397
|
+
Brauerei Spezial,Bamberg,Rauchbier Märzen,5.1%
|
398
|
+
Hacker-Pschorr Bräu,München,Münchner Dunkel,5.0%
|
399
|
+
Staatliches Hofbräuhaus München,München,Hofbräu Oktoberfestbier,6.3%
|
400
|
+
```
|
401
|
+
|
402
|
+
|
403
|
+
Or use the ARFF (attribute-relation file format)-like alternative style
|
404
|
+
with `%` for comments and `@`-directives
|
405
|
+
for "meta data" in the header (before any records):
|
406
|
+
|
407
|
+
```
|
408
|
+
%%%%%%%%%%%%%%%%%%
|
409
|
+
% try with some comments
|
410
|
+
% and blank lines even before @-directives in header
|
411
|
+
|
412
|
+
@RELATION Beer
|
413
|
+
|
414
|
+
@ATTRIBUTE Brewery
|
415
|
+
@ATTRIBUTE City
|
416
|
+
@ATTRIBUTE Name
|
417
|
+
@ATTRIBUTE Abv
|
418
|
+
|
419
|
+
@DATA
|
420
|
+
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
421
|
+
Augustiner Bräu München,München,Edelstoff,5.6%
|
422
|
+
|
423
|
+
Bayerische Staatsbrauerei Weihenstephan, Freising, Hefe Weissbier, 5.4%
|
424
|
+
Brauerei Spezial, Bamberg, Rauchbier Märzen, 5.1%
|
425
|
+
Hacker-Pschorr Bräu, München, Münchner Dunkel, 5.0%
|
426
|
+
Staatliches Hofbräuhaus München, München, Hofbräu Oktoberfestbier, 6.3%
|
427
|
+
```
|
428
|
+
|
429
|
+
Or use the ARFF (attribute-relation file format)-like alternative style with `@`-directives
|
430
|
+
inside comments (for easier backwards compatibility with old readers)
|
431
|
+
for "meta data" in the header (before any records):
|
432
|
+
|
433
|
+
```
|
434
|
+
##########################
|
435
|
+
# try with some comments
|
436
|
+
# and blank lines even before @-directives in header
|
437
|
+
#
|
438
|
+
# @RELATION Beer
|
439
|
+
#
|
440
|
+
# @ATTRIBUTE Brewery
|
441
|
+
# @ATTRIBUTE City
|
442
|
+
# @ATTRIBUTE Name
|
443
|
+
# @ATTRIBUTE Abv
|
444
|
+
|
445
|
+
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
|
446
|
+
Augustiner Bräu München,München,Edelstoff,5.6%
|
447
|
+
|
448
|
+
Bayerische Staatsbrauerei Weihenstephan, Freising, Hefe Weissbier, 5.4%
|
449
|
+
Brauerei Spezial, Bamberg, Rauchbier Märzen, 5.1%
|
450
|
+
Hacker-Pschorr Bräu, München, Münchner Dunkel, 5.0%
|
451
|
+
Staatliches Hofbräuhaus München, München, Hofbräu Oktoberfestbier, 6.3%
|
452
|
+
```
|
453
|
+
|
454
|
+
|
455
|
+
|
456
|
+
### Q: How can I change the default format / dialect?
|
457
|
+
|
458
|
+
The reader includes more than half a dozen pre-configured formats,
|
459
|
+
dialects.
|
460
|
+
|
461
|
+
Use strict if you do NOT want to trim leading and trailing spaces
|
462
|
+
and if you do NOT want to skip blank lines. Example:
|
463
|
+
|
464
|
+
``` ruby
|
465
|
+
txt = <<TXT
|
466
|
+
1, 2,3
|
467
|
+
4,5 ,6
|
468
|
+
|
469
|
+
TXT
|
470
|
+
|
471
|
+
records = Csv.strict.parse( txt )
|
472
|
+
pp records
|
473
|
+
# => [["1","•2","3"],
|
474
|
+
# ["4","5•","6"],
|
475
|
+
# [""]]
|
476
|
+
```
|
477
|
+
|
478
|
+
More strict pre-configured variants include:
|
479
|
+
|
480
|
+
`Csv.mysql` uses:
|
481
|
+
|
482
|
+
``` ruby
|
483
|
+
ParserStrict.new( sep: "\t",
|
484
|
+
quote: false,
|
485
|
+
escape: true,
|
486
|
+
null: "\\N" )
|
487
|
+
```
|
488
|
+
|
489
|
+
`Csv.postgres` or `Csv.postgresql` uses:
|
490
|
+
|
491
|
+
``` ruby
|
492
|
+
ParserStrict.new( doublequote: false,
|
493
|
+
escape: true,
|
494
|
+
null: "" )
|
495
|
+
```
|
496
|
+
|
497
|
+
`Csv.postgres_text` or `Csv.postgresql_text` uses:
|
498
|
+
|
499
|
+
``` ruby
|
500
|
+
ParserStrict.new( sep: "\t",
|
501
|
+
quote: false,
|
502
|
+
escape: true,
|
503
|
+
null: "\\N" )
|
504
|
+
```
|
505
|
+
|
506
|
+
and so on.
|
507
|
+
|
508
|
+
|
509
|
+
### Q: How can I change the separator to semicolon (`;`) or pipe (`|`) or tab (`\t`)?
|
510
|
+
|
511
|
+
Pass in the `sep` keyword option
|
512
|
+
to the parser. Example:
|
513
|
+
|
514
|
+
``` ruby
|
515
|
+
Csv.parse( ..., sep: ';' )
|
516
|
+
Csv.read( ..., sep: ';' )
|
517
|
+
# ...
|
518
|
+
Csv.parse( ..., sep: '|' )
|
519
|
+
Csv.read( ..., sep: '|' )
|
520
|
+
# and so on
|
521
|
+
```
|
522
|
+
|
523
|
+
Note: If you use tab (`\t`) use the `TabReader`
|
524
|
+
(or for your convenience the built-in `Csv.tab` alias)!
|
525
|
+
If you use the "classic" one or more space or tab (`/[ \t]+/`) regex
|
526
|
+
use the `TableReader`
|
527
|
+
(or for your convenience the built-in `Csv.table` alias)!
|
528
|
+
|
529
|
+
|
530
|
+
Note: The default ("The Right Way") parser does NOT allow space or tab
|
531
|
+
as separator (because leading and trailing space always gets trimmed
|
532
|
+
unless inside quotes, etc.). Use the `strict` parser if you want
|
533
|
+
to make up your own format with space or tab as a separator
|
534
|
+
or if you want that every space or tab counts (is significant).
|
535
|
+
|
536
|
+
|
537
|
+
|
538
|
+
Aside: Why? Tab =! CSV. Yes, tab is
|
539
|
+
its own (even) simpler format
|
540
|
+
(e.g. no escape rules, no newlines in values, etc.),
|
541
|
+
see [`TabReader` »](https://github.com/csvreader/tabreader).
|
542
|
+
|
543
|
+
``` ruby
|
544
|
+
Csv.tab.parse( ... ) # note: "classic" strict tab format
|
545
|
+
Csv.tab.read( ... )
|
546
|
+
# ...
|
547
|
+
|
548
|
+
Csv.table.parse( ... ) # note: "classic" one or more space (or tab) table format
|
549
|
+
Csv.table.read( ... )
|
550
|
+
# ...
|
551
|
+
```
|
552
|
+
|
553
|
+
If you want double quote escape rules, newlines in quotes values, etc. use
|
554
|
+
the "strict" parser with the separator (`sep`) changed to tab (`\t`).
|
555
|
+
|
556
|
+
``` ruby
|
557
|
+
Csv.strict.parse( ..., sep: "\t" ) # note: csv-like tab format with quotes
|
558
|
+
Csv.strict.read( ..., sep: "\t" )
|
559
|
+
# ...
|
560
|
+
```
|
561
|
+
|
562
|
+
|
563
|
+
|
564
|
+
|
565
|
+
### Q: How can I read records with fixed width fields (and no separator)?
|
566
|
+
|
567
|
+
Pass in the `width` keyword option with the field widths / lengths
|
568
|
+
to the "fixed" parser. Example:
|
569
|
+
|
570
|
+
``` ruby
|
571
|
+
txt = <<TXT
|
572
|
+
12345678123456781234567890123456789012345678901212345678901234
|
573
|
+
TXT
|
574
|
+
|
575
|
+
Csv.fixed.parse( txt, width: [8,8,32,14] ) # or Csv.fix or Csv.f
|
576
|
+
# => [["12345678","12345678", "12345678901234567890123456789012", "12345678901234"]]
|
577
|
+
|
578
|
+
|
579
|
+
txt = <<TXT
|
580
|
+
John Smith john@example.com 1-888-555-6666
|
581
|
+
Michele O'Reileymichele@example.com 1-333-321-8765
|
582
|
+
TXT
|
583
|
+
|
584
|
+
Csv.fixed.parse( txt, width: [8,8,32,14] ) # or Csv.fix or Csv.f
|
585
|
+
# => [["John", "Smith", "john@example.com", "1-888-555-6666"],
|
586
|
+
# ["Michele", "O'Reiley", "michele@example.com", "1-333-321-8765"]]
|
587
|
+
|
588
|
+
# and so on
|
589
|
+
```
|
590
|
+
|
591
|
+
<!--
|
592
|
+
Note: You can use for your convenience the built-in
|
593
|
+
`Csv.fix` or `Csv.f` aliases / shortcuts.
|
594
|
+
-->
|
595
|
+
|
596
|
+
|
597
|
+
Note: You can use negative widths (e.g. `-2`, `-3`, and so on)
|
598
|
+
to "skip" filler fields (e.g. `--`, `---`, and so on).
|
599
|
+
Example:
|
600
|
+
|
601
|
+
``` ruby
|
602
|
+
txt = <<TXT
|
603
|
+
12345678--12345678---12345678901234567890123456789012--12345678901234XXX
|
604
|
+
TXT
|
605
|
+
|
606
|
+
Csv.fixed.parse( txt, width: [8,-2,8,-3,32,-2,14] ) # or Csv.fix or Csv.f
|
607
|
+
# => [["12345678","12345678", "12345678901234567890123456789012", "12345678901234"]]
|
608
|
+
```
|
609
|
+
|
610
|
+
|
611
|
+
|
612
|
+
|
613
|
+
|
614
|
+
### Q: What's broken in the standard library CSV reader?
|
615
|
+
|
616
|
+
Two major design bugs and many many minor.
|
617
|
+
|
618
|
+
(1) The CSV class uses [`line.split(',')`](https://github.com/ruby/csv/blob/master/lib/csv.rb#L1255) with some kludges (†) with the claim it's faster.
|
619
|
+
What?! The right way: CSV needs its own purpose-built parser. There's no other
|
620
|
+
way you can handle all the (edge) cases with double quotes and escaped doubled up
|
621
|
+
double quotes. Period.
|
622
|
+
|
623
|
+
For example, the CSV class cannot handle leading or trailing spaces
|
624
|
+
for double quoted values `1,•"2","3"•`.
|
625
|
+
Or handling double quotes inside values and so on and on.
|
626
|
+
|
627
|
+
(2) The CSV class returns `nil` for `,,` but an empty string (`""`)
|
628
|
+
for `"","",""`. The right way: All values are always strings. Period.
|
629
|
+
|
630
|
+
If you want to use `nil` you MUST configure a string (or strings)
|
631
|
+
such as `NA`, `n/a`, `\N`, or similar that map to `nil`.
|
632
|
+
|
633
|
+
|
634
|
+
(†): kludge - a workaround or quick-and-dirty solution that is clumsy, inelegant, inefficient, difficult to extend and hard to maintain
|
635
|
+
|
636
|
+
Appendix: Simple examples the standard csv library cannot read:
|
637
|
+
|
638
|
+
Quoted values with leading or trailing spaces e.g.
|
639
|
+
|
640
|
+
```
|
641
|
+
1, "2","3" , "4" ,5
|
642
|
+
```
|
643
|
+
|
644
|
+
=>
|
645
|
+
|
646
|
+
``` ruby
|
647
|
+
["1", "2", "3", "4" ,"5"]
|
648
|
+
```
|
649
|
+
|
650
|
+
"Auto-fix" unambiguous quotes in "unquoted" values e.g.
|
651
|
+
|
652
|
+
```
|
653
|
+
value with "quotes", another value
|
654
|
+
```
|
655
|
+
|
656
|
+
=>
|
657
|
+
|
658
|
+
``` ruby
|
659
|
+
["value with \"quotes\"", "another value"]
|
660
|
+
```
|
661
|
+
|
662
|
+
and some more.
|
663
|
+
|
664
|
+
|
665
|
+
|
666
|
+
|
667
|
+
## Alternatives
|
668
|
+
|
669
|
+
See the Libraries & Tools section in the [Awesome CSV](https://github.com/csvspecs/awesome-csv#libraries--tools) page.
|
670
|
+
|
671
|
+
|
672
|
+
## License
|
673
|
+
|
674
|
+
![](https://publicdomainworks.github.io/buttons/zero88x31.png)
|
675
|
+
|
676
|
+
The `csvreader` scripts are dedicated to the public domain.
|
677
|
+
Use it as you please with no restrictions whatsoever.
|
678
|
+
|
679
|
+
## Questions? Comments?
|
680
|
+
|
681
|
+
Send them along to the [wwwmake forum](http://groups.google.com/group/wwwmake).
|
682
|
+
Thanks!
|