csvreader 0.4.0 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: ed373a97a0bdb4c45d2980894a32014cdcb8ca7c
4
- data.tar.gz: 784adcade81e39ad9accd1a9b2d0c76fd666b6f9
3
+ metadata.gz: ea1d667219773e3a355c81f815d91e92340d61a1
4
+ data.tar.gz: ba7a43ccb5e110fc1f6eca76ca2a74a62f1131fb
5
5
  SHA512:
6
- metadata.gz: 5523a8697990c691f55aa7c3b23867104b1c4c5b8e9e25b0424a3191e73cbb32cee369541b712f60fc366ba76a8207a77d6b12b68ea209896b6c26e11c5712de
7
- data.tar.gz: 7c33c812c2a53303911b6686d03554d6e388b3f936a3b6b8d995ed237651bd171d3bdb8ab8f38f7f327e1a9d1be26d1fa955918012f37cf6a9e1c2cc6ab08373
6
+ metadata.gz: 0543a4338d2d12e36da16acdad9abff28633e519baa1d92044d1ca8f5e3472d835d00a10d8b19c24561b06e0d724f87414495600f4c83eef7c9e033474b4c09e
7
+ data.tar.gz: 8df669bc86f2066b2650a67bda5698fae7b6d58766b9c318f47958b0499671d0a4d39e862b8d3af842105a2777a3cf7ad05168380c338a3053fd3d363697abfb
data/Manifest.txt CHANGED
@@ -13,5 +13,7 @@ test/data/beer11.csv
13
13
  test/data/shakespeare.csv
14
14
  test/helper.rb
15
15
  test/test_parser.rb
16
+ test/test_parser_formats.rb
17
+ test/test_parser_rfc4180.rb
16
18
  test/test_reader.rb
17
19
  test/test_reader_hash.rb
data/README.md CHANGED
@@ -164,17 +164,15 @@ see [`TabReader` »](https://github.com/datatxt/tabreader).
164
164
 
165
165
  Two major design bugs and many many minor.
166
166
 
167
- (1) The CSV class uses `line.split(',')` with some kludges (†) with the claim its faster.
167
+ (1) The CSV class uses [`line.split(',')`](https://github.com/ruby/csv/blob/master/lib/csv.rb#L1248) with some kludges (†) with the claim it's faster.
168
168
  What?! The right way: CSV needs its own purpose-built parser. There's no other
169
169
  way you can handle all the (edge) cases with double quotes and escaped doubled up
170
170
  double quotes. Period.
171
171
 
172
- For example, the CSV class cannot handle leading or trailing spaces
172
+ For example, the CSV class cannot handle leading or trailing spaces
173
173
  for double quoted values `1,•"2","3"•`.
174
174
  Or handling double quotes inside values and so on and on.
175
175
 
176
- (†): kludge - a workaround or quick-and-dirty solution that is clumsy, inelegant, inefficient, difficult to extend and hard to maintain
177
-
178
176
  (2) The CSV class returns `nil` for `,,` but an empty string (`""`)
179
177
  for `"","",""`. The right way: All values are always strings. Period.
180
178
 
@@ -182,6 +180,36 @@ If you want to use `nil` you MUST configure a string (or strings)
182
180
  such as `NA`, `n/a`, `\N`, or similar that map to `nil`.
183
181
 
184
182
 
183
+ (†): kludge - a workaround or quick-and-dirty solution that is clumsy, inelegant, inefficient, difficult to extend and hard to maintain
184
+
185
+ Appendix: Simple examples the standard csv library cannot read:
186
+
187
+ Quoted values with leading or trailing spaces e.g.
188
+
189
+ ```
190
+ 1, "2","3" , "4" ,5
191
+ ```
192
+
193
+ =>
194
+
195
+ ``` ruby
196
+ ["1", "2", "3", "4" ,"5"]
197
+ ```
198
+
199
+ "Auto-fix" unambiguous quotes in "unquoted" values e.g.
200
+
201
+ ```
202
+ value with "quotes", another value
203
+ ```
204
+
205
+ =>
206
+
207
+ ``` ruby
208
+ ["value with \"quotes\"", "another value"]
209
+ ```
210
+
211
+ and some more.
212
+
185
213
 
186
214
 
187
215
 
data/lib/csvreader.rb CHANGED
@@ -3,6 +3,7 @@
3
3
  require 'csv'
4
4
  require 'json'
5
5
  require 'pp'
6
+ require 'logger'
6
7
 
7
8
 
8
9
  ###
@@ -18,22 +18,10 @@ class BufferIO ## todo: find a better name - why? why not? is really just for
18
18
  end
19
19
  end # method getc
20
20
 
21
-
22
- def ungetc( c )
23
- ## add upfront as first char in buffer
24
- ## last in/first out queue!!!!
25
- @buf.unshift( c )
26
- ## puts "ungetc - >#{c} (#{c.ord})< => >#{@buf}<"
27
- end
28
-
29
-
30
21
  def peek
31
- ## todo/fix:
32
- ## use Hexadecimal code: 1A, U+001A for eof char - why? why not?
33
22
  if @buf.size == 0 && @io.eof?
34
23
  puts "peek - hitting eof!!!"
35
- ## return eof char(s) - exits? is \0 ?? double check
36
- return "\0"
24
+ return "\0" ## return NUL char (0) for now
37
25
  end
38
26
 
39
27
  if @buf.size == 0
@@ -44,5 +32,6 @@ class BufferIO ## todo: find a better name - why? why not? is really just for
44
32
 
45
33
  @buf.first
46
34
  end # method peek
35
+
47
36
  end # class BufferIO
48
37
  end # class CsvReader
@@ -1,74 +1,92 @@
1
1
  # encoding: utf-8
2
2
 
3
3
  class CsvReader
4
- class Parser
5
-
6
4
 
7
- ## char constants
8
- DOUBLE_QUOTE = "\""
9
- COMMENT = "#" ## use COMMENT_HASH or HASH or ??
10
- SPACE = " "
11
- TAB = "\t"
12
- LF = "\n" ## 0A (hex) 10 (dec)
13
- CR = "\r" ## 0D (hex) 13 (dec)
14
5
 
15
6
 
16
- def self.parse( data )
17
- puts "parse:"
18
- pp data
19
7
 
20
- parser = new
21
- parser.parse( data )
22
- end
23
8
 
24
- def self.parse_line( data )
25
- puts "parse_line:"
9
+ class Parser
26
10
 
27
- parser = new
28
- records = parser.parse( data, limit: 1 )
29
11
 
30
- ## unwrap record if empty return nil - why? why not?
31
- ## return empty record e.g. [] - why? why not?
32
- records.size == 0 ? nil : records.first
12
+ ## char constants
13
+ DOUBLE_QUOTE = "\""
14
+ BACKSLASH = "\\" ## use BACKSLASH_ESCAPE ??
15
+ COMMENT = "#" ## use COMMENT_HASH or HASH or ??
16
+ SPACE = " " ## \s == ASCII 32 (dec) = (Space)
17
+ TAB = "\t" ## \t == ASCII 0x09 (hex) = HT (Tab/horizontal tab)
18
+ LF = "\n" ## \n == ASCII 0x0A (hex) 10 (dec) = LF (Newline/line feed)
19
+ CR = "\r" ## \r == ASCII 0x0D (hex) 13 (dec) = CR (Carriage return)
20
+
21
+
22
+ ###################################
23
+ ## add simple logger with debug flag/switch
24
+ #
25
+ # use Parser.debug = true # to turn on
26
+ #
27
+ # todo/fix: use logutils instead of std logger - why? why not?
28
+
29
+ def self.logger() @@logger ||= Logger.new( STDOUT ); end
30
+ def logger() self.class.logger; end
31
+
32
+
33
+
34
+ attr_reader :config ## todo/fix: change config to proper dialect class/struct - why? why not?
35
+
36
+ def initialize( sep: ',',
37
+ quote: DOUBLE_QUOTE, ## note: set to nil for no quote
38
+ doublequote: true,
39
+ escape: BACKSLASH, ## note: set to nil for no escapes
40
+ trim: true, ## note: will toggle between human/default and strict mode parser!!!
41
+ na: ['\N', 'NA'], ## note: set to nil for no null vales / not availabe (na)
42
+ quoted_empty: '', ## note: only available in strict mode (e.g. trim=false)
43
+ unquoted_empty: '' ## note: only available in strict mode (e.g. trim=false)
44
+ )
45
+ @config = {} ## todo/fix: change config to proper dialect class/struct - why? why not?
46
+ @config[:sep] = sep
47
+ @config[:quote] = quote
48
+ @config[:doublequote] = doublequote
49
+ @config[:escape] = escape
50
+ @config[:trim] = trim
51
+ @config[:na] = na
52
+ @config[:quoted_empty] = quoted_empty
53
+ @config[:unquoted_empty] = unquoted_empty
33
54
  end
34
55
 
35
56
 
36
57
 
37
- def self.read( path )
38
- parser = new
39
- File.open( path, 'r:bom|utf-8' ) do |file|
40
- parser.parse( file )
41
- end
42
- end
58
+ def strict?
59
+ ## note: use trim for separating two different parsers / code paths:
60
+ ## - human with trim leading and trailing whitespace and
61
+ ## - strict with no leading and trailing whitespaces allowed
43
62
 
44
- def self.foreach( path, &block )
45
- parser = new
46
- File.open( path, 'r:bom|utf-8' ) do |file|
47
- parser.foreach( file, &block )
48
- end
63
+ ## for now use - trim == false for strict version flag alias
64
+ ## todo/fix: add strict flag - why? why not?
65
+ @config[:trim] ? false : true
49
66
  end
50
67
 
51
- def self.parse_lines( data, &block )
52
- parser = new
53
- parser.parse_lines( data, &block )
54
- end
55
68
 
69
+ DEFAULT = new( sep: ',', trim: true )
70
+ RFC4180 = new( sep: ',', trim: false )
71
+ EXCEL = new( sep: ',', trim: false )
56
72
 
73
+ def self.default() DEFAULT; end ## alternative alias for DEFAULT
74
+ def self.rfc4180() RFC4180; end ## alternative alias for RFC4180
75
+ def self.excel() EXCEL; end ## alternative alias for EXCEL
57
76
 
58
77
 
59
78
 
60
- def parse_field( io, trim: true )
79
+
80
+ def parse_field( io, sep: )
81
+ logger.debug "parse field - sep: >#{sep}< (#{sep.ord})" if logger.debug?
82
+
61
83
  value = ""
62
- value << parse_spaces( io ) ## add leading spaces
84
+ skip_spaces( io ) ## strip leading spaces
63
85
 
64
86
  if (c=io.peek; c=="," || c==LF || c==CR || io.eof?) ## empty field
65
- value = value.strip if trim ## strip all spaces
66
87
  ## return value; do nothing
67
88
  elsif io.peek == DOUBLE_QUOTE
68
- puts "start double_quote field - value >#{value}<"
69
- value = value.strip ## note always strip/trim leading spaces in quoted value
70
-
71
- puts "start double_quote field - peek >#{io.peek}< (#{io.peek.ord})"
89
+ logger.debug "start double_quote field - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
72
90
  io.getc ## eat-up double_quote
73
91
 
74
92
  loop do
@@ -89,18 +107,18 @@ def parse_field( io, trim: true )
89
107
 
90
108
  ## note: always eat-up all trailing spaces (" ") and tabs (\t)
91
109
  skip_spaces( io )
92
- puts "end double_quote field - peek >#{io.peek}< (#{io.peek.ord})"
110
+ logger.debug "end double_quote field - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
93
111
  else
94
- puts "start reg field - peek >#{io.peek}< (#{io.peek.ord})"
112
+ logger.debug "start reg field - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
95
113
  ## consume simple value
96
114
  ## until we hit "," or "\n" or "\r"
97
115
  ## note: will eat-up quotes too!!!
98
116
  while (c=io.peek; !(c=="," || c==LF || c==CR || io.eof?))
99
- puts " add char >#{io.peek}< (#{io.peek.ord})"
117
+ logger.debug " add char >#{io.peek}< (#{io.peek.ord})" if logger.debug?
100
118
  value << io.getc ## eat-up all spaces (" ") and tabs (\t)
101
119
  end
102
- value = value.strip if trim ## strip all spaces
103
- puts "end reg field - peek >#{io.peek}< (#{io.peek.ord})"
120
+ value = value.strip ## strip all trailing spaces
121
+ logger.debug "end reg field - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
104
122
  end
105
123
 
106
124
  value
@@ -108,12 +126,60 @@ end
108
126
 
109
127
 
110
128
 
111
- def parse_record( io, trim: true )
129
+
130
+ def parse_field_strict( io, sep: )
131
+ logger.debug "parse field (strict) - sep: >#{sep}< (#{sep.ord})" if logger.debug?
132
+
133
+ value = ""
134
+
135
+ if (c=io.peek; c==sep || c==LF || c==CR || io.eof?) ## empty unquoted field
136
+ value = config[:unquoted_empty] ## defaults to "" (might be set to nil if needed)
137
+ ## return value; do nothing
138
+ elsif config[:quote] && io.peek == config[:quote]
139
+ logger.debug "start quote field (strict) - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
140
+ io.getc ## eat-up double_quote
141
+
142
+ loop do
143
+ while (c=io.peek; !(c==config[:quote] || io.eof?))
144
+ value << io.getc ## eat-up everything unit quote (")
145
+ end
146
+
147
+ break if io.eof?
148
+
149
+ io.getc ## eat-up double_quote
150
+
151
+ if config[:doublequote] && io.peek == config[:quote] ## doubled up quote?
152
+ value << io.getc ## add doube quote and continue!!!!
153
+ else
154
+ break
155
+ end
156
+ end
157
+
158
+ value = config[:quoted_empty] if value == "" ## defaults to "" (might be set to nil if needed)
159
+
160
+ logger.debug "end double_quote field (strict) - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
161
+ else
162
+ logger.debug "start reg field (strict) - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
163
+ ## consume simple value
164
+ ## until we hit "," or "\n" or "\r" or stroy "\"" double quote
165
+ while (c=io.peek; !(c==sep || c==LF || c==CR || c==config[:quote] || io.eof?))
166
+ logger.debug " add char >#{io.peek}< (#{io.peek.ord})" if logger.debug?
167
+ value << io.getc
168
+ end
169
+ logger.debug "end reg field (strict) - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
170
+ end
171
+
172
+ value
173
+ end
174
+
175
+
176
+
177
+ def parse_record( io, sep: )
112
178
  values = []
113
179
 
114
180
  loop do
115
- value = parse_field( io, trim: trim )
116
- puts "value: »#{value}«"
181
+ value = parse_field( io, sep: sep )
182
+ logger.debug "value: »#{value}«" if logger.debug?
117
183
  values << value
118
184
 
119
185
  if io.eof?
@@ -133,6 +199,33 @@ def parse_record( io, trim: true )
133
199
  end
134
200
 
135
201
 
202
+
203
+ def parse_record_strict( io, sep: )
204
+ values = []
205
+
206
+ loop do
207
+ value = parse_field_strict( io, sep: sep )
208
+ logger.debug "value: »#{value}«" if logger.debug?
209
+ values << value
210
+
211
+ if io.eof?
212
+ break
213
+ elsif (c=io.peek; c==LF || c==CR)
214
+ skip_newline( io ) ## note: singular / single newline only (NOT plural)
215
+ break
216
+ elsif io.peek == sep
217
+ io.getc ## eat-up FS (,)
218
+ else
219
+ puts "*** csv parse error (strict): found >#{io.peek} (#{io.peek.ord})< - FS (,) or RS (\\n) expected!!!!"
220
+ exit(1)
221
+ end
222
+ end
223
+
224
+ values
225
+ end
226
+
227
+
228
+
136
229
  def skip_newlines( io )
137
230
  return if io.eof?
138
231
 
@@ -142,6 +235,22 @@ def skip_newlines( io )
142
235
  end
143
236
 
144
237
 
238
+ def skip_newline( io ) ## note: singular (strict) version
239
+ return if io.eof?
240
+
241
+ ## only skip CR LF or LF or CR
242
+ if io.peek == CR
243
+ io.getc ## eat-up
244
+ io.getc if io.peek == LF
245
+ elsif io.peek == LF
246
+ io.getc ## eat-up
247
+ else
248
+ # do nothing
249
+ end
250
+ end
251
+
252
+
253
+
145
254
  def skip_until_eol( io )
146
255
  return if io.eof?
147
256
 
@@ -161,91 +270,95 @@ end
161
270
 
162
271
 
163
272
 
164
- def parse_spaces( io ) ## helper method
165
- spaces = ""
166
- ## add leading spaces
167
- while (c=io.peek; c==SPACE || c==TAB)
168
- spaces << io.getc ## eat-up all spaces (" ") and tabs (\t)
169
- end
170
- spaces
171
- end
172
-
173
-
174
-
175
-
176
- def parse_lines( io_maybe, trim: true,
177
- comments: true,
178
- blanks: true, &block )
179
273
 
180
- ## find a better name for io_maybe
181
- ## make sure io is a wrapped into BufferIO!!!!!!
182
- if io_maybe.is_a?( BufferIO ) ### allow (re)use of BufferIO if managed from "outside"
183
- io = io_maybe
184
- else
185
- io = BufferIO.new( io_maybe )
186
- end
187
274
 
275
+ def parse_lines_human( io, sep:, &block )
188
276
 
189
277
  loop do
190
278
  break if io.eof?
191
279
 
192
- ## hack: use own space buffer for peek( x ) lookahead (more than one char)
193
- ## check for comments or blank lines
194
- if comments || blanks
195
- spaces = parse_spaces( io )
196
- end
280
+ skip_spaces( io )
197
281
 
198
- if comments && io.peek == COMMENT ## comment line
199
- puts "skipping comment - peek >#{io.peek}< (#{io.peek.ord})"
282
+ if io.peek == COMMENT ## comment line
283
+ logger.debug "skipping comment - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
200
284
  skip_until_eol( io )
201
285
  skip_newlines( io )
202
- elsif blanks && (c=io.peek; c==LF || c==CR || io.eof?)
203
- puts "skipping blank - peek >#{io.peek}< (#{io.peek.ord})"
286
+ elsif (c=io.peek; c==LF || c==CR || io.eof?)
287
+ logger.debug "skipping blank - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
204
288
  skip_newlines( io )
205
- else # undo (ungetc spaces)
206
- puts "start record - peek >#{io.peek}< (#{io.peek.ord})"
207
-
208
- if comments || blanks
209
- ## note: MUST ungetc in "reverse" order
210
- ## ## buffer is last in/first out queue!!!!
211
- spaces.reverse.each_char { |space| io.ungetc( space ) }
212
- end
213
-
214
- record = parse_record( io, trim: trim )
289
+ else
290
+ logger.debug "start record - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
215
291
 
292
+ record = parse_record( io, sep: sep )
216
293
  ## note: requires block - enforce? how? why? why not?
217
294
  block.call( record ) ## yield( record )
218
295
  end
219
296
  end # loop
220
- end # method parse_lines
297
+ end # method parse_lines_human
298
+
299
+
300
+
301
+ def parse_lines_strict( io, sep:, &block )
302
+
303
+ ## no leading and trailing whitespaces trimmed/stripped
304
+ ## no comments skipped
305
+ ## no blanks skipped
306
+ ## - follows strict rules of
307
+ ## note: this csv format is NOT recommended;
308
+ ## please, use a format with comments, leading and trailing whitespaces, etc.
309
+ ## only added for checking compatibility
310
+
311
+ loop do
312
+ break if io.eof?
313
+
314
+ logger.debug "start record (strict) - peek >#{io.peek}< (#{io.peek.ord})" if logger.debug?
315
+
316
+ record = parse_record_strict( io, sep: sep )
317
+
318
+ ## note: requires block - enforce? how? why? why not?
319
+ block.call( record ) ## yield( record )
320
+ end # loop
321
+ end # method parse_lines_strict
322
+
221
323
 
222
324
 
325
+ def parse_lines( io_maybe, sep: config[:sep], &block )
326
+ ## find a better name for io_maybe
327
+ ## make sure io is a wrapped into BufferIO!!!!!!
328
+ if io_maybe.is_a?( BufferIO ) ### allow (re)use of BufferIO if managed from "outside"
329
+ io = io_maybe
330
+ else
331
+ io = BufferIO.new( io_maybe )
332
+ end
223
333
 
334
+ if strict?
335
+ parse_lines_strict( io, sep: sep, &block )
336
+ else
337
+ parse_lines_human( io, sep: sep, &block )
338
+ end
339
+ end ## parse_lines
340
+
341
+
342
+
343
+ ## fix: add optional block - lets you use it like foreach!!!
344
+ ## make foreach an alias of parse with block - why? why not?
345
+ ##
346
+ ## unifiy with (make one) parse and parse_lines!!!! - why? why not?
224
347
 
225
- def parse( io_maybe, trim: true,
226
- comments: true,
227
- blanks: true,
228
- limit: nil )
348
+ def parse( io_maybe, sep: config[:sep], limit: nil )
229
349
  records = []
230
350
 
231
- parse_lines( io_maybe, trim: trim, comments: comments, blanks: blanks ) do |record|
351
+ parse_lines( io_maybe, sep: sep ) do |record|
232
352
  records << record
233
353
 
234
354
  ## set limit to 1 for processing "single" line (that is, get one record)
235
- return records if limit && limit >= records.size
355
+ break if limit && limit >= records.size
236
356
  end
237
357
 
238
358
  records
239
359
  end ## method parse
240
360
 
241
361
 
242
- def foreach( io_maybe, trim: true,
243
- comments: true,
244
- blanks: true, &block )
245
- parse_lines( io_maybe, trim: trim, comments: comments, blanks: blanks, &block )
246
- end
247
-
248
-
249
362
 
250
363
  end # class Parser
251
364
  end # class CsvReader
@@ -1,150 +1,98 @@
1
1
  # encoding: utf-8
2
2
 
3
3
 
4
- module Csv ## check: rename to CsvSettings / CsvPref / CsvGlobals or similar - why? why not???
5
4
 
5
+ class CsvReader
6
6
 
7
- class Dialect ## todo: use a module - it's just a namespace/module now - why? why not?
8
- ###
9
- # (auto-)add these flavors/dialects:
10
- # :tab -> uses TabReader(!)
11
- # :strict|:rfc4180
12
- # :unix -> uses unix-style escapes e.g. \n \" etc.
13
- # :windows|:excel
14
- # :guess|:auto -> guess (auto-detect) separator - why? why not?
15
-
16
- ## e.g. use Dialect.registry[:unix] = { ... } etc.
17
- ## note use @@ - there is only one registry
18
- def self.registry() @@registry ||={} end
19
-
20
- ## add built-in dialects:
21
- ## trim - use strip? why? why not? use alias?
22
- registry[:tab] = {} ##{ class: TabReader }
23
- registry[:strict] = { strict: true, trim: false } ## add no comments, blank lines, etc. ???
24
- registry[:rfc4180] = :strict ## alternative name
25
- registry[:windows] = {}
26
- registry[:excel] = :windows
27
- registry[:unix] = {}
28
-
29
- ## todo: add some more
30
- end # class Dialect
31
-
32
-
33
-
34
- class Configuration
35
-
7
+ def initialize( parser )
8
+ @parser = parser
9
+ end
36
10
 
37
- attr_accessor :sep ## col_sep (column separator)
38
- attr_accessor :na ## not available (string or array of strings or nil) - rename to nas/nils/nulls - why? why not?
39
- attr_accessor :trim ### allow ltrim/rtrim/trim - why? why not?
40
- attr_accessor :blanks
41
- attr_accessor :comments
42
- attr_accessor :dialect
11
+ DEFAULT = new( Parser::DEFAULT )
12
+ RFC4180 = new( Parser::RFC4180 )
13
+ EXCEL = new( Parser::EXCEL )
43
14
 
44
- def initialize
45
- @sep = ','
46
- @blanks = true
47
- @comments = true
48
- @trim = true
49
- ## note: do NOT add headers as global - should ALWAYS be explicit
50
- ## headers (true/false) - changes resultset and requires different processing!!!
15
+ def self.default() DEFAULT; end ## alternative alias for DEFAULT
16
+ def self.rfc4180() RFC4180; end ## alternative alias for RFC4180
17
+ def self.excel() EXCEL; end ## alternative alias for EXCEL
51
18
 
52
- self ## return self for chaining
53
- end
54
19
 
55
- ## strip leading and trailing spaces
56
- def trim?() @trim; end
57
-
58
- ## skip blank lines (with only 1+ spaces)
59
- ## note: for now blank lines with no spaces will always get skipped
60
- def blanks?() @blanks; end
61
-
62
-
63
- def comments?() @comments; end
64
-
65
-
66
- ## built-in (default) options
67
- ## todo: find a better name?
68
- def default_options
69
- ## note:
70
- ## do NOT include sep character and
71
- ## do NOT include headers true/false here
72
- ##
73
- ## make default sep its own "global" default config
74
- ## e.g. Csv.config.sep =
75
-
76
- ## common options
77
- ## skip comments starting with #
78
- ## skip blank lines
79
- ## strip leading and trailing spaces
80
- ## NOTE/WARN: leading and trailing spaces NOT allowed/working with double quoted values!!!!
81
- defaults = {
82
- blanks: @blanks, ## note: skips lines with no whitespaces only!! (e.g. line with space is NOT blank!!)
83
- comments: @comments,
84
- trim: @trim
85
- ## :converters => :strip
86
- }
87
- defaults
88
- end
89
- end # class Configuration
20
+ #####################
21
+ ## convenience helpers defaulting to default csv dialect/format reader
22
+ ##
23
+ ## CsvReader.parse_line is the same as
24
+ ## CsvReader::DEFAULT.parse_line or CsvReader.default.parse_line
25
+ ##
90
26
 
27
+ def self.parse_line( data, sep: nil,
28
+ converters: nil )
29
+ DEFAULT.parse_line( data, sep: sep, converters: converters )
30
+ end
91
31
 
92
- ## lets you use
93
- ## Csv.configure do |config|
94
- ## config.sep = ',' ## or "/t"
95
- ## end
32
+ def self.parse( data, sep: nil,
33
+ converters: nil )
34
+ DEFAULT.parse( data, sep: sep, converters: converters )
35
+ end
96
36
 
97
- def self.configure
98
- yield( config )
37
+ #### fix!!! remove - replace with parse with (optional) block!!!!!
38
+ def self.parse_lines( data, sep: nil,
39
+ converters: nil, &block )
40
+ DEFAULT.parse_lines( data, sep: sep, converters: nil, &block )
99
41
  end
100
42
 
101
- def self.config
102
- @config ||= Configuration.new
43
+ def self.read( path, sep: nil,
44
+ converters: nil )
45
+ DEFAULT.read( path, sep: sep, converters: converters )
103
46
  end
104
- end # module Csvv
105
47
 
48
+ def self.header( path, sep: nil )
49
+ DEFAULT.header( path, sep: sep )
50
+ end
106
51
 
52
+ def self.foreach( path, sep: nil,
53
+ converters: nil, &block )
54
+ DEFAULT.foreach( path, sep: sep, converters: converters, &block )
55
+ end
107
56
 
108
- ####
109
- ## use our own wrapper
110
57
 
111
- class CsvReader
112
58
 
113
- def self.parse_line( txt, sep: Csv.config.sep,
114
- trim: Csv.config.trim?,
115
- na: Csv.config.na,
116
- dialect: Csv.config.dialect,
117
- converters: nil)
118
- ## note: do NOT include headers option (otherwise single row gets skipped as first header row :-)
119
- csv_options = Csv.config.default_options.merge(
120
- col_sep: sep
121
- )
122
- ## pp csv_options
123
- Parser.parse_line( txt ) ##, csv_options )
124
- end
59
+ #############################
60
+ ## all "high-level" reader methods
61
+ ##
62
+ ## note: allow "overriding" of separator
63
+ ## if sep is not nil otherwise use default dialect/format separator
125
64
 
126
65
 
127
66
  ##
128
67
  ## todo/fix: "unify" parse and parse_lines !!!
129
68
  ## check for block_given? - why? why not?
130
69
 
131
- def self.parse( txt, sep: Csv.config.sep )
132
- csv_options = Csv.config.default_options.merge(
133
- col_sep: sep
134
- )
135
- ## pp csv_options
136
- Parser.parse( txt ) ###, csv_options )
70
+ def parse( data, sep: nil, limit: nil,
71
+ converters: nil )
72
+ sep = @parser.config[:sep] if sep.nil?
73
+ @parser.parse( data, sep: sep, limit: limit )
74
+ end
75
+
76
+ #### fix!!! remove - replace with parse with (optional) block!!!!!
77
+ def parse_lines( data, sep: nil,
78
+ converters: nil, &block )
79
+ sep = @parser.config[:sep] if sep.nil?
80
+ @parser.parse_lines( data, sep: sep, &block )
137
81
  end
138
82
 
139
- def self.parse_lines( txt, sep: Csv.config.sep, &block )
140
- csv_options = Csv.config.default_options.merge(
141
- col_sep: sep
142
- )
143
- ## pp csv_options
144
- Parser.parse_lines( txt, &block ) ###, csv_options )
83
+
84
+
85
+ def parse_line( data, sep: nil,
86
+ converters: nil )
87
+ records = parse( data, sep: sep, limit: 1 )
88
+
89
+ ## unwrap record if empty return nil - why? why not?
90
+ ## return empty record e.g. [] - why? why not?
91
+ records.size == 0 ? nil : records.first
145
92
  end
146
93
 
147
- def self.read( path, sep: Csv.config.sep )
94
+ def read( path, sep: nil,
95
+ converters: nil )
148
96
  ## note: use our own file.open
149
97
  ## always use utf-8 for now
150
98
  ## check/todo: add skip option bom too - why? why not?
@@ -152,33 +100,26 @@ class CsvReader
152
100
  parse( txt, sep: sep )
153
101
  end
154
102
 
155
-
156
- def self.foreach( path, sep: Csv.config.sep, &block )
157
- csv_options = Csv.config.default_options.merge(
158
- col_sep: sep
159
- )
160
-
161
- Parser.foreach( path, &block ) ###, csv_options )
103
+ def foreach( path, sep: nil,
104
+ converters: nil, &block )
105
+ File.open( path, 'r:bom|utf-8' ) do |file|
106
+ parse_lines( file, sep: sep, &block )
107
+ end
162
108
  end
163
109
 
164
110
 
165
- def self.header( path, sep: Csv.config.sep ) ## use header or headers - or use both (with alias)?
166
- # read first lines (only)
167
- # and parse with csv to get header from csv library itself
168
- #
169
- # check - if there's an easier or built-in way for the csv library
170
111
 
171
- ## readlines until
172
- ## - NOT a comments line or
173
- ## - NOT a blank line
112
+ def header( path, sep: nil ) ## use header or headers - or use both (with alias)?
113
+ # read first lines (only)
114
+ # and parse with csv to get header from csv library itself
174
115
 
175
116
  record = nil
176
117
  File.open( path, 'r:bom|utf-8' ) do |file|
177
- record = Parser.parse_line( file )
118
+ record = parse_line( file, sep: sep )
178
119
  end
179
120
 
180
- record ## todo/fix: return nil for empty - why? why not?
181
- end # method self.header
121
+ record ## todo/fix: returns nil for empty - why? why not?
122
+ end # method self.header
182
123
 
183
124
  end # class CsvReader
184
125
 
@@ -188,13 +129,13 @@ end # class CsvReader
188
129
  class CsvHashReader
189
130
 
190
131
 
191
- def self.parse( txt, sep: Csv.config.sep, headers: nil )
132
+ def self.parse( data, sep: nil, headers: nil )
192
133
 
193
134
  ## pass in headers as array e.g. ['A', 'B', 'C']
194
135
  names = headers ? headers : nil
195
136
 
196
137
  records = []
197
- CsvReader.parse_lines( txt ) do |values| # sep: sep
138
+ CsvReader.parse_lines( data ) do |values| # sep: sep
198
139
  if names.nil?
199
140
  names = values ## store header row / a.k.a. field/column names
200
141
  else
@@ -206,13 +147,13 @@ def self.parse( txt, sep: Csv.config.sep, headers: nil )
206
147
  end
207
148
 
208
149
 
209
- def self.read( path, sep: Csv.config.sep, headers: nil )
150
+ def self.read( path, sep: nil, headers: nil )
210
151
  txt = File.open( path, 'r:bom|utf-8' ).read
211
152
  parse( txt, sep: sep, headers: headers )
212
153
  end
213
154
 
214
155
 
215
- def self.foreach( path, sep: Csv.config.sep, headers: nil, &block )
156
+ def self.foreach( path, sep: nil, headers: nil, &block )
216
157
 
217
158
  ## pass in headers as array e.g. ['A', 'B', 'C']
218
159
  names = headers ? headers : nil
@@ -228,7 +169,7 @@ def self.foreach( path, sep: Csv.config.sep, headers: nil, &block )
228
169
  end
229
170
 
230
171
 
231
- def self.header( path, sep: Csv.config.sep ) ## add header too? why? why not?
172
+ def self.header( path, sep: nil ) ## add header too? why? why not?
232
173
  ## same as "classic" header method - delegate/reuse :-)
233
174
  CsvReader.header( path, sep: sep )
234
175
  end
@@ -4,7 +4,7 @@
4
4
  class CsvReader ## note: uses a class for now - change to module - why? why not?
5
5
 
6
6
  MAJOR = 0 ## todo: namespace inside version or something - why? why not??
7
- MINOR = 4
7
+ MINOR = 5
8
8
  PATCH = 0
9
9
  VERSION = [MAJOR,MINOR,PATCH].join('.')
10
10
 
data/test/test_parser.rb CHANGED
@@ -9,24 +9,38 @@ require 'helper'
9
9
 
10
10
  class TestParser < MiniTest::Test
11
11
 
12
+ def setup
13
+ CsvReader::Parser.logger.level = :debug ## turn on "global" logging - move to helper - why? why not?
14
+ end
15
+
16
+ def parser
17
+ parser = CsvReader::Parser::DEFAULT
18
+ end
19
+
12
20
 
13
- def test_parse1
14
- records = [["a", "b", "c"],
15
- ["1", "2", "3"],
16
- ["4", "5", "6"]]
17
-
18
- ## don't care about newlines (\r\n)
19
- assert_equal records, CsvReader::Parser.parse( "a,b,c\n1,2,3\n4,5,6" )
20
- assert_equal records, CsvReader::Parser.parse( "a,b,c\n1,2,3\n4,5,6\n" )
21
- assert_equal records, CsvReader::Parser.parse( "a,b,c\r1,2,3\r4,5,6" )
22
- assert_equal records, CsvReader::Parser.parse( "a,b,c\r\n1,2,3\r\n4,5,6\r\n" )
23
-
24
- ## or leading and trailing spaces
25
- assert_equal records, CsvReader::Parser.parse( " \n a , b , c \n 1,2 ,3 \n 4,5,6 " )
26
- assert_equal records, CsvReader::Parser.parse( "\n\na, b,c \n 1, 2, 3\n 4, 5, 6" )
27
- assert_equal records, CsvReader::Parser.parse( " \"a\" , b , \"c\" \n1, 2,\"3\" \n4,5, \"6\"" )
28
- assert_equal records, CsvReader::Parser.parse( "a, b, c\n1, 2,3\n\n\n4,5,6\n\n\n" )
29
- assert_equal records, CsvReader::Parser.parse( " a, b ,c \n 1 , 2 , 3 \n4,5,6 " )
21
+ def test_parser_default
22
+ pp CsvReader::Parser::DEFAULT
23
+ pp CsvReader::Parser.default
24
+ assert true
25
+ end
26
+
27
+ def test_parse
28
+ records = [["a", "b", "c"],
29
+ ["1", "2", "3"],
30
+ ["4", "5", "6"]]
31
+
32
+ ## don't care about newlines (\r\n)
33
+ assert_equal records, parser.parse( "a,b,c\n1,2,3\n4,5,6" )
34
+ assert_equal records, parser.parse( "a,b,c\n1,2,3\n4,5,6\n" )
35
+ assert_equal records, parser.parse( "a,b,c\r1,2,3\r4,5,6" )
36
+ assert_equal records, parser.parse( "a,b,c\r\n1,2,3\r\n4,5,6\r\n" )
37
+
38
+ ## or leading and trailing spaces
39
+ assert_equal records, parser.parse( " \n a , b , c \n 1,2 ,3 \n 4,5,6 " )
40
+ assert_equal records, parser.parse( "\n\na, b,c \n 1, 2, 3\n 4, 5, 6" )
41
+ assert_equal records, parser.parse( " \"a\" , b , \"c\" \n1, 2,\"3\" \n4,5, \"6\"" )
42
+ assert_equal records, parser.parse( "a, b, c\n1, 2,3\n\n\n4,5,6\n\n\n" )
43
+ assert_equal records, parser.parse( " a, b ,c \n 1 , 2 , 3 \n4,5,6 " )
30
44
  end
31
45
 
32
46
 
@@ -34,19 +48,19 @@ def test_parse_quotes
34
48
  records = [["a", "b", "c"],
35
49
  ["11 \n 11", "\"2\"", "3"]]
36
50
 
37
- assert_equal records, CsvReader::Parser.parse( " a, b ,c \n\"11 \n 11\", \"\"\"2\"\"\" , 3 \n" )
38
- assert_equal records, CsvReader::Parser.parse( "\n\n \"a\", \"b\" ,\"c\" \n \"11 \n 11\" , \"\"\"2\"\"\" , 3 \n" )
51
+ assert_equal records, parser.parse( " a, b ,c \n\"11 \n 11\", \"\"\"2\"\"\" , 3 \n" )
52
+ assert_equal records, parser.parse( "\n\n \"a\", \"b\" ,\"c\" \n \"11 \n 11\" , \"\"\"2\"\"\" , 3 \n" )
39
53
  end
40
54
 
41
55
  def test_parse_empties
42
56
  records = [["", "", ""]]
43
57
 
44
- assert_equal records, CsvReader::Parser.parse( ",," )
45
- assert_equal records, CsvReader::Parser.parse( <<TXT )
58
+ assert_equal records, parser.parse( ",," )
59
+ assert_equal records, parser.parse( <<TXT )
46
60
  "","",""
47
61
  TXT
48
62
 
49
- assert_equal [], CsvReader::Parser.parse( "" )
63
+ assert_equal [], parser.parse( "" )
50
64
  end
51
65
 
52
66
 
@@ -54,7 +68,7 @@ def test_parse_comments
54
68
  records = [["a", "b", "c"],
55
69
  ["1", "2", "3"]]
56
70
 
57
- assert_equal records, CsvReader::Parser.parse( <<TXT )
71
+ assert_equal records, parser.parse( <<TXT )
58
72
  # comment
59
73
  # comment
60
74
  ## comment
@@ -64,7 +78,7 @@ a, b, c
64
78
 
65
79
  TXT
66
80
 
67
- assert_equal records, CsvReader::Parser.parse( <<TXT )
81
+ assert_equal records, parser.parse( <<TXT )
68
82
  a, b, c
69
83
  1, 2, 3
70
84
 
@@ -0,0 +1,69 @@
1
+ # encoding: utf-8
2
+
3
+ ###
4
+ # to run use
5
+ # ruby -I ./lib -I ./test test/test_parser_formats.rb
6
+
7
+
8
+ require 'helper'
9
+
10
+ class TestParserFormats < MiniTest::Test
11
+
12
+ def setup
13
+ CsvReader::Parser.logger.level = :debug ## turn on "global" logging - move to helper - why? why not?
14
+ end
15
+
16
+ def parser
17
+ CsvReader::Parser
18
+ end
19
+
20
+
21
+ def test_parse_whitespace
22
+ records = [["a", "b", "c"],
23
+ ["1", "2", "3"]]
24
+
25
+ ## don't care about newlines (\r\n) ??? - fix? why? why not?
26
+ assert_equal records, parser.default.parse( "a,b,c\n1,2,3" )
27
+ assert_equal records, parser.default.parse( "a,b,c\n1,2,3\n" )
28
+ assert_equal records, parser.default.parse( " a, b ,c \n\n1,2,3\n" )
29
+ assert_equal records, parser.default.parse( " a, b ,c \n \n1,2,3\n" )
30
+
31
+ assert_equal [["a", "b", "c"],
32
+ [""],
33
+ ["1", "2", "3"]], parser.default.parse( %Q{a,b,c\n""\n1,2,3\n} )
34
+ assert_equal [["", ""],
35
+ [""],
36
+ ["", "", ""]], parser.default.parse( %Q{,\n""\n"","",""\n} )
37
+
38
+
39
+ ## strict rfc4180 - no trim leading or trailing spaces or blank lines
40
+ assert_equal records, parser.rfc4180.parse( "a,b,c\n1,2,3" )
41
+ assert_equal [["a", "b", "c"],
42
+ [""],
43
+ ["1", "2", "3"]], parser.rfc4180.parse( "a,b,c\n\n1,2,3" )
44
+ assert_equal [[" a", " b ", "c "],
45
+ [""],
46
+ ["1", "2", "3"]], parser.rfc4180.parse( " a, b ,c \n\n1,2,3" )
47
+ assert_equal [[" a", " b ", "c "],
48
+ [" "],
49
+ ["",""],
50
+ ["1", "2", "3"]], parser.rfc4180.parse( " a, b ,c \n \n,\n1,2,3" )
51
+ end
52
+
53
+
54
+ def test_parse_empties
55
+ assert_equal [], parser.default.parse( "\n \n \n" )
56
+
57
+ ## strict rfc4180 - no trim leading or trailing spaces or blank lines
58
+ assert_equal [[""],
59
+ [" "],
60
+ [" "]], parser.rfc4180.parse( "\n \n \n" )
61
+ assert_equal [[""],
62
+ [" "],
63
+ [" "]], parser.rfc4180.parse( "\n \n " )
64
+
65
+ assert_equal [[""]], parser.rfc4180.parse( "\n" )
66
+ assert_equal [], parser.rfc4180.parse( "" )
67
+ end
68
+
69
+ end # class TestParserFormats
@@ -0,0 +1,95 @@
1
+ # encoding: utf-8
2
+
3
+ ###
4
+ # to run use
5
+ # ruby -I ./lib -I ./test test/test_parser_rfc4180.rb
6
+
7
+
8
+ require 'helper'
9
+
10
+ class TestParserRfc4180 < MiniTest::Test
11
+
12
+ def setup
13
+ CsvReader::Parser.logger.level = :debug ## turn on "global" logging - move to helper - why? why not?
14
+ end
15
+
16
+ def parser
17
+ CsvReader::Parser::RFC4180
18
+ end
19
+
20
+
21
+ def test_parser_rfc4180
22
+ pp CsvReader::Parser::RFC4180
23
+ pp CsvReader::Parser.rfc4180
24
+ assert true
25
+ end
26
+
27
+ def test_parse
28
+ records = [["a", "b", "c"],
29
+ ["1", "2", "3"],
30
+ ["4", "5", "6"]]
31
+
32
+ ## don't care about newlines (\r\n) ??? - fix? why? why not?
33
+ assert_equal records, parser.parse( "a,b,c\n1,2,3\n4,5,6" )
34
+ assert_equal records, parser.parse( "a,b,c\n1,2,3\n4,5,6\n" )
35
+ assert_equal records, parser.parse( "a,b,c\r1,2,3\r4,5,6" )
36
+ assert_equal records, parser.parse( "a,b,c\r\n1,2,3\r\n4,5,6\r\n" )
37
+ end
38
+
39
+ def test_parse_semicolon
40
+ records = [["a", "b", "c"],
41
+ ["1", "2", "3"],
42
+ ["4", "5", "6"]]
43
+
44
+ ## don't care about newlines (\r\n) ??? - fix? why? why not?
45
+ assert_equal records, parser.parse( "a;b;c\n1;2;3\n4;5;6", sep: ';' )
46
+ assert_equal records, parser.parse( "a;b;c\n1;2;3\n4;5;6\n", sep: ';' )
47
+ assert_equal records, parser.parse( "a;b;c\r1;2;3\r4;5;6", sep: ';' )
48
+ assert_equal records, parser.parse( "a;b;c\r\n1;2;3\r\n4;5;6\r\n", sep: ';' )
49
+ end
50
+
51
+ def test_parse_tab
52
+ records = [["a", "b", "c"],
53
+ ["1", "2", "3"],
54
+ ["4", "5", "6"]]
55
+
56
+ ## don't care about newlines (\r\n) ??? - fix? why? why not?
57
+ assert_equal records, parser.parse( "a\tb\tc\n1\t2\t3\n4\t5\t6", sep: "\t" )
58
+ assert_equal records, parser.parse( "a\tb\tc\n1\t2\t3\n4\t5\t6\n", sep: "\t" )
59
+ assert_equal records, parser.parse( "a\tb\tc\r1\t2\t3\r4\t5\t6", sep: "\t" )
60
+ assert_equal records, parser.parse( "a\tb\tc\r\n1\t2\t3\r\n4\t5\t6\r\n", sep: "\t" )
61
+ end
62
+
63
+
64
+
65
+ def test_parse_empties
66
+ assert_equal [["","",""],["","",""]], parser.parse( %Q{"","",""\n,,} )
67
+
68
+ parser.config[:quoted_empty] = nil
69
+
70
+ assert_nil parser.config[:quoted_empty]
71
+ assert_equal "", parser.config[:unquoted_empty]
72
+
73
+ assert_equal [[nil,nil,nil," "],["","",""," "]], parser.parse( %Q{"","",""," "\n,,, } )
74
+
75
+
76
+ parser.config[:unquoted_empty] = nil
77
+
78
+ assert_nil parser.config[:quoted_empty]
79
+ assert_nil parser.config[:unquoted_empty]
80
+
81
+ assert_equal [[nil,nil,nil," "],[nil,nil,nil," "]], parser.parse( %Q{"","",""," "\n,,, } )
82
+
83
+
84
+ ## reset to defaults
85
+ parser.config[:quoted_empty] = ""
86
+ parser.config[:unquoted_empty] = ""
87
+
88
+ assert_equal "", parser.config[:quoted_empty]
89
+ assert_equal "", parser.config[:unquoted_empty]
90
+
91
+ assert_equal [["","",""],["","",""]], parser.parse( %Q{"","",""\n,,} )
92
+ end
93
+
94
+
95
+ end # class TestParserRfc4180
data/test/test_reader.rb CHANGED
@@ -9,6 +9,10 @@ require 'helper'
9
9
 
10
10
  class TestReader < MiniTest::Test
11
11
 
12
+ def setup
13
+ CsvReader::Parser.logger.level = :debug ## turn on "global" logging - move to helper - why? why not?
14
+ end
15
+
12
16
 
13
17
  def test_read
14
18
  puts "== read: beer.csv:"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: csvreader
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Gerald Bauer
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-08-21 00:00:00.000000000 Z
11
+ date: 2018-09-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rdoc
@@ -64,6 +64,8 @@ files:
64
64
  - test/data/shakespeare.csv
65
65
  - test/helper.rb
66
66
  - test/test_parser.rb
67
+ - test/test_parser_formats.rb
68
+ - test/test_parser_rfc4180.rb
67
69
  - test/test_reader.rb
68
70
  - test/test_reader_hash.rb
69
71
  homepage: https://github.com/csv11/csvreader