RubyGems - csvreader - Versions diffs - 1.1.5 → 1.2.0 - Mend

csvreader 1.1.5 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/Manifest.txt +1 -0
data/README.md +34 -6
data/lib/csvreader/parser_std.rb +70 -29
data/lib/csvreader/parser_table.rb +36 -2
data/lib/csvreader/version.rb +2 -2
data/test/data/test.csv +21 -0
data/test/test_parser.rb +73 -0
data/test/test_parser_table.rb +42 -0
data/test/test_samples.rb +14 -0
metadata +3 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 5a26d07e0e618e2cfb6803c8f059e76cf0541a47
-  data.tar.gz: 71ed3dba6b2d0df06e733832b35a709bd427aabc
+  metadata.gz: 5e83a3e71ad1ec014c4744e80be07aa7b6caef10
+  data.tar.gz: c181a4d7f379f241e5a8c1a21af99523c8a0c9d3
 SHA512:
-  metadata.gz: 10a4ef973d7c573243fd02c224ee16951a7403ad8472ad5a7b0806c252ec718be3ef404c90e3c32ff41dc2388e441e6a3931226330ce870da54dcff979ced024
-  data.tar.gz: '09030d9645657ef71643c1703209806613b29d6da34aa9c01eb88f3c9a7530c6b2b4953931e5f3789d5fc4a3a2daac9e65317544305dee00347f84cd5e41d22d'
+  metadata.gz: be9862e8ff97642f27a18e8d9160d534e3197a862e2a8c2d94b2ffe01b264a47b2e0996693dbc57b5cc68bd6113813fce7ea75289fd2a0e227ad0adc476868b3
+  data.tar.gz: b7696e7342f7676a928c15f6f35b90c78c56e7519de3eb6de95e4fbb3dd3c7823136bda2676f46aea29930c5026ee9d7384e86fb26cffefb1603e9279fbe0ce7

data/Manifest.txt CHANGED

@@ -27,6 +27,7 @@ test/data/iris.attrib.csv
 test/data/iris11.csv
 test/data/lcc.attrib.csv
 test/data/shakespeare.csv
+test/data/test.csv
 test/helper.rb
 test/test_buffer.rb
 test/test_converter.rb

data/README.md CHANGED

@@ -11,6 +11,17 @@
 ## What's News?
+**v1.2** Add support for alternative (non-space) separators (e.g. `;|^:`)
+to the default parser (`ParserStd`).
+**v1.1.5**  Added built-in support for (optional) alternative space
+character
+(e.g. `_-+•`)
+to the default parser (`ParserStd`) and the table parser (`ParserTable`).
+Turns `Man_Utd` into `Man Utd`, for example. Default is turned off (`nil`).
 **v1.1.4**  Added new "classic" table parser (see `ParserTable`) for supporting fields separated by (one or more) spaces
 e.g. `Csv.table.parse( txt )`.
@@ -484,20 +495,33 @@ and so on.
 ### Q: How can I change the separator to semicolon (`;`) or pipe (`|`) or tab (`\t`)?
 Pass in the `sep` keyword option
-to the "strict" parser. Example:
+to the parser. Example:
 ``` ruby
-Csv.strict.parse( ..., sep: ';' )
-Csv.strict.read( ..., sep: ';' )
+Csv.parse( ..., sep: ';' )
+Csv.read( ..., sep: ';' )
 # ...
-Csv.strict.parse( ..., sep: '|' )
-Csv.strict.read( ..., sep: '|' )
+Csv.parse( ..., sep: '|' )
+Csv.read( ..., sep: '|' )
 # and so on
 ```
 Note: If you use tab (`\t`) use the `TabReader`
 (or for your convenience the built-in `Csv.tab` alias)!
-Why? Tab =! CSV. Yes, tab is
+If you use the "classic" one or more space or tab (`/[ \t]+/`) regex
+use the `TableReader`
+(or for your convenience the built-in `Csv.table` alias)!
+Note: The default ("The Right Way") parser does NOT allow space or tab
+as separator (because leading and trailing space always gets trimmed
+unless inside quotes, etc.). Use the `strict` parser if you want
+to make up your own format with space or tab as a separator
+or if you want that every space or tab counts (is significant).
+Aside:  Why? Tab =! CSV. Yes, tab is
 its own (even) simpler format
 (e.g. no escape rules, no newlines in values, etc.),
 see [`TabReader` »](https://github.com/csvreader/tabreader).
@@ -506,6 +530,10 @@ see [`TabReader` »](https://github.com/csvreader/tabreader).
 Csv.tab.parse( ... )  # note: "classic" strict tab format
 Csv.tab.read( ... )
 # ...
+Csv.table.parse( ... )  # note: "classic" strict tab format
+Csv.table.read( ... )
+# ...
 ```
 If you want double quote escape rules, newlines in quotes values, etc. use

data/lib/csvreader/parser_std.rb CHANGED

@@ -49,12 +49,17 @@ attr_reader   :meta
 ##    null values - include NA - why? why not?
 ##        make null values case sensitive or add an option for case sensitive
 ##        or better allow a proc as option for checking too!!!
-def initialize( null:     ['\N', 'NA'],  ## note: set to nil for no null vales / not availabe (na)
+def initialize( sep:      ',',
+                null:     ['\N', 'NA'],  ## note: set to nil for no null vales / not availabe (na)
                 numeric:  false,   ## (auto-)convert all non-quoted values to float
-                nan:      nil      ## note: only if numeric - set mappings for Float::NAN (not a number) values
+                nan:      nil,      ## note: only if numeric - set mappings for Float::NAN (not a number) values
+                space:    nil
               )
   @config = {}   ## todo/fix: change config to proper dialect class/struct - why? why not?
+  check_sep( sep )
+  @config[:sep]     = sep
   ## note: null values must get handled by parser
   ##   only get checked for unquoted strings (and NOT for quoted strings)
   ##   "higher-level" code only knows about strings and has no longer any info if string was quoted or unquoted
@@ -62,40 +67,66 @@ def initialize( null:     ['\N', 'NA'],  ## note: set to nil for no null vales /
   @config[:numeric] = numeric
   @config[:nan]     = nan   # not a number (NaN) e.g. Float::NAN
+  ## e.g. treat/convert char to space e.g. _-+• etc
+  ##   Man_Utd   => Man Utd
+  ##  or use it for leading and trailing spaces without quotes
+  ##  todo/check: only use for unquoted values? why? why not?
+  @config[:space]   = space
   @meta  = nil     ## no meta data block   (use empty hash {} - why? why not?)
 end
+  SEPARATORS = ",;|^:"
+def check_sep( sep )
+  ## note: parse does NOT support space or tab as separator!!
+  ##    leading and trailing space or tab (whitespace) gets by default trimmed
+  ##      unless quoted (or alternative space char used e.g. _-+ if configured)
+  if SEPARATORS.include?( sep )
+     ## everything ok
+  else
+    raise ArgumentError, "invalid/unsupported sep >#{sep}< - for now only >#{SEPARATORS}< allowed; sorry"
+  end
+end
 #########################################
 ## config convenience helpers
 ##   e.g. use like  Csv.defaultl.null = '\N'   etc.   instead of
 ##                  Csv.default.config[:null] = '\N'
-def null=( value )     @config[:null]=value; end
+def sep=( value )         check_sep( value );  @config[:sep]=value; end
+def null=( value )        @config[:null]=value; end
 def numeric=( value )     @config[:numeric]=value; end
 def nan=( value )         @config[:nan]=value; end
+def space=( value )       @config[:space]=value; end
-def parse( data, **kwargs, &block )
+def parse( str_or_readable, sep: config[:sep], &block )
+  check_sep( sep )
   ## note: data - will wrap either a String or IO object passed in data
   ## note: kwargs NOT used for now (but required for "protocol/interface" by other parsers)
   ##   make sure data (string or io) is a wrapped into Buffer!!!!!!
-  if data.is_a?( Buffer )    ### allow (re)use of Buffer if managed from "outside"
-    input = data
+  if str_or_readable.is_a?( Buffer )    ### allow (re)use of Buffer if managed from "outside"
+    input = str_or_readable
   else
-    input = Buffer.new( data )
+    input = Buffer.new( str_or_readable )
   end
   if block_given?
-    parse_lines( input, &block )
+    parse_lines( input, sep: sep, &block )
   else
     records = []
-    parse_lines( input ) do |record|
+    parse_lines( input, sep: sep ) do |record|
       records << record
     end
@@ -108,11 +139,11 @@ end ## method parse
 private
-def parse_escape( input )
+def parse_escape( input, sep: )
   value = ""
   if input.peek == BACKSLASH
     input.getc ## eat-up backslash
-    if (c=input.peek; c==BACKSLASH || c==LF || c==CR || c==',' || c==DOUBLE_QUOTE || c==SINGLE_QUOTE )
+    if (c=input.peek; c==BACKSLASH || c==LF || c==CR || c==sep || c==DOUBLE_QUOTE || c==SINGLE_QUOTE )
       logger.debug "  add escaped char >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
       value << input.getc     ## add escaped char (e.g. lf, cr, etc.)
     else
@@ -128,7 +159,7 @@ end
-def parse_quote( input, opening_quote:, closing_quote:)
+def parse_quote( input, sep:, opening_quote:, closing_quote:)
   value = ""
   if input.peek == opening_quote
     input.getc  ## eat-up opening quote
@@ -141,7 +172,7 @@ def parse_quote( input, opening_quote:, closing_quote:)
       if input.eof?
         break
       elsif input.peek == BACKSLASH
-        value << parse_escape( input )
+        value << parse_escape( input, sep: sep )
       else   ## assume input.peek == quote
         input.getc ## eat-up quote
         if opening_quote == closing_quote && input.peek == closing_quote
@@ -162,7 +193,7 @@ end
-def parse_field( input )
+def parse_field( input, sep: )
   value = ""
   numeric = config[:numeric]
@@ -172,7 +203,7 @@ def parse_field( input )
   skip_spaces( input )   ## strip leading spaces
-  if (c=input.peek; c=="," || c==LF || c==CR || input.eof?) ## empty field
+  if (c=input.peek; c==sep || c==LF || c==CR || input.eof?) ## empty field
     ## note: allows null = '' that is turn unquoted empty strings into null/nil
     ##   or if using numeric into NotANumber (NaN)
     if is_null?( value )
@@ -184,7 +215,8 @@ def parse_field( input )
     end
   elsif input.peek == DOUBLE_QUOTE
     logger.debug "start double_quote field - peek >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
-    value << parse_quote( input, opening_quote: DOUBLE_QUOTE,
+    value << parse_quote( input, sep: sep,
+                                 opening_quote: DOUBLE_QUOTE,
                                  closing_quote: DOUBLE_QUOTE )
     ## note: always eat-up all trailing spaces (" ") and tabs (\t)
@@ -192,26 +224,31 @@ def parse_field( input )
     logger.debug "end double_quote field - peek >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
   elsif input.peek == SINGLE_QUOTE    ## allow single quote too (by default)
     logger.debug "start single_quote field - peek >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
-    value << parse_quote( input, opening_quote: SINGLE_QUOTE,
+    value << parse_quote( input, sep: sep,
+                                 opening_quote: SINGLE_QUOTE,
                                  closing_quote: SINGLE_QUOTE )
     ## note: always eat-up all trailing spaces (" ") and tabs (\t)
     skip_spaces( input )
     logger.debug "end single_quote field - peek >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
   elsif input.peek == "«"
-    value << parse_quote( input, opening_quote: "«",
+    value << parse_quote( input, sep: sep,
+                                 opening_quote: "«",
                                  closing_quote: "»" )
     skip_spaces( input )
   elsif input.peek == "»"
-    value << parse_quote( input, opening_quote: "»",
+    value << parse_quote( input, sep: sep,
+                                 opening_quote: "»",
                                  closing_quote: "«" )
     skip_spaces( input )
   elsif input.peek == "‹"
-    value << parse_quote( input, opening_quote: "‹",
+    value << parse_quote( input, sep: sep,
+                                 opening_quote: "‹",
                                  closing_quote: "›" )
     skip_spaces( input )
   elsif input.peek == "›"
-    value << parse_quote( input, opening_quote: "›",
+    value << parse_quote( input, sep: sep,
+                                 opening_quote: "›",
                                  closing_quote: "‹" )
     skip_spaces( input )
   else
@@ -219,9 +256,9 @@ def parse_field( input )
     ## consume simple value
     ##   until we hit "," or "\n" or "\r"
     ##    note: will eat-up quotes too!!!
-    while (c=input.peek; !(c=="," || c==LF || c==CR || input.eof?))
+    while (c=input.peek; !(c==sep || c==LF || c==CR || input.eof?))
       if input.peek == BACKSLASH
-        value << parse_escape( input )
+        value << parse_escape( input, sep: sep )
       else
         logger.debug "  add char >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
         value << input.getc   ## note: eat-up all spaces (" ") and tabs (\t) too (strip trailing spaces at the end)
@@ -256,11 +293,15 @@ end
-def parse_record( input )
+def parse_record( input, sep: )
   values = []
+  space = config[:space]
   loop do
-     value = parse_field( input )
+     value = parse_field( input, sep: sep )
+     value = value.tr( space, ' ' )   if space && value.is_a?( String )
      logger.debug "value: »#{value}«"  if logger.debug?
      values << value
@@ -269,10 +310,10 @@ def parse_record( input )
      elsif (c=input.peek; c==LF || c==CR)
        skip_newline( input )
        break
-     elsif input.peek == ","
+     elsif input.peek == sep
        input.getc   ## eat-up FS(,)
      else
-       raise ParseError.new( "found >#{input.peek} (#{input.peek.ord})< - FS (,) or RS (\\n) expected!!!!" )
+       raise ParseError.new( "found >#{input.peek} (#{input.peek.ord})< - FS (#{sep}) or RS (\\n) expected!!!!" )
      end
   end
@@ -375,7 +416,7 @@ end
-def parse_lines( input, &block )
+def parse_lines( input, sep:, &block )
   ## note: reset (optional) meta data block
   @meta  = nil     ## no meta data block   (use empty hash {} - why? why not?)
@@ -426,7 +467,7 @@ def parse_lines( input, &block )
     else
       logger.debug "start record - peek >#{input.peek}< (#{input.peek.ord})"  if logger.debug?
-      record = parse_record( input )
+      record = parse_record( input, sep: sep )
       record_num +=1
       ## note: requires block - enforce? how? why? why not?

data/lib/csvreader/parser_table.rb CHANGED

@@ -22,12 +22,38 @@ def logger()  self.class.logger; end
-def parse( data, **kwargs, &block )
+attr_reader   :config   ## todo/fix: change config to proper dialect class/struct - why? why not?
+##
+##  todo/check:
+##    null values - include NA - why? why not?
+##        make null values case sensitive or add an option for case sensitive
+##        or better allow a proc as option for checking too!!!
+def initialize( space: nil )
+  @config = {}   ## todo/fix: change config to proper dialect class/struct - why? why not?
+  ## e.g. treat/convert char to space e.g. _-+• etc
+  ##   Man_Utd   => Man Utd
+  ##  or use it for leading and trailing spaces without quotes
+  ##  todo/check: only use for unquoted values? why? why not?
+  @config[:space]   = space
+end
+#########################################
+## config convenience helpers
+def space=( value )       @config[:space]=value; end
+def parse( str_or_readable, **kwargs, &block )
   ## note: input: required each_line (string or io/file for example)
   ## note: kwargs NOT used for now (but required for "protocol/interface" by other parsers)
-  input = data   ## assume it's a string or io/file handle
+  input = str_or_readable   ## assume it's a string or io/file handle
   if block_given?
     parse_lines( input, &block )
@@ -48,6 +74,8 @@ private
 def parse_lines( input, &block )
+  space = config[:space]
   ## note: each line only works with \n (windows) or \r\n (unix)
   ##   will NOT work with \r (old mac, any others?) only!!!!
   input.each_line do |line|
@@ -79,6 +107,12 @@ def parse_lines( input, &block )
     values = line.split( /[ \t]+/ )
     logger.debug values.pretty_inspect   if logger.debug?
+    if space
+      ## e.g. translate _-+ etc. if configured to space
+      ##  Man_Utd => Man Utd etc.
+       values = values.map {|value| value.tr(space,' ') }
+    end
     ## note: requires block - enforce? how? why? why not?
     block.call( values )
   end

data/lib/csvreader/version.rb CHANGED

@@ -5,8 +5,8 @@ class CsvReader   ## note: uses a class for now - change to module - why? why no
   module Version
     MAJOR = 1    ## todo: namespace inside version or something - why? why not??
-    MINOR = 1
-    PATCH = 5
+    MINOR = 2
+    PATCH = 0
     ## self.to_s  - why? why not?
   end

data/test/data/test.csv ADDED

@@ -0,0 +1,21 @@
+##################################################
+##   Apache Commons CSV Reader Test Sample
+##    see https://github.com/apache/commons-csv/blob/master/src/test/resources/CSVFileParser/test.csv
+A,B,C,"D"
+# plain values
+a,b,c,d
+# spaces before and after
+ e ,f , g,h
+# quoted: with spaces before and after
+" i ", " j " , " k "," l "
+# empty values
+,,,
+# empty quoted values
+"","","",""
+# 3 empty lines
+# EOF on next line

data/test/test_parser.rb CHANGED

@@ -41,6 +41,79 @@ def test_parse
 end
+def test_parse_space
+  records = [["1", "Man City"],
+             ["2", "Liverpool"],
+             ["3", "Chelsea"],
+             ["4", "Arsenal"],
+             ["8", "Man Utd"],
+             ["13", "West Ham"],
+             ["14", "Crystal Palace"]]
+  parser.space='_'
+  assert_equal records, parser.parse( <<TXT )
+      1, Man_City
+      2, Liverpool
+      3, Chelsea
+      4, Arsenal
+      8, Man_Utd
+     13, West_Ham
+     14, Crystal_Palace
+TXT
+  assert_equal [[" "," ","  "]], parser.parse( "_ , _ , __" )
+  parser.space='•'
+  assert_equal records, parser.parse( <<TXT )
+      1, Man•City
+      2, Liverpool
+      3, Chelsea
+      4, Arsenal
+      8, Man•Utd
+     13, West•Ham
+     14, Crystal•Palace
+TXT
+  assert_equal [[" "," ","  "]], parser.parse( "• , • , ••" )
+  parser.space = nil  ## reset to default setting
+end
+def test_parse_semicolon
+   records = [["a", "b", "c"],
+              ["1", "2", "3"],
+              ["4", "5", "6"]]
+   ## don't care about newlines (\r\n) ??? - fix? why? why not?
+   assert_equal records, parser.parse( "a;b;c\n1;2;3\n4;5;6",         sep: ';' )
+   assert_equal records, parser.parse( "a;b;c\n1;2;3\n4;5;6\n",       sep: ';' )
+   assert_equal records, parser.parse( "a;b;c\r1;2;3\r4;5;6",         sep: ';' )
+   assert_equal records, parser.parse( "a;b;c\r\n1;2;3\r\n4;5;6\r\n", sep: ';' )
+   assert_equal records, parser.parse( " a;  b ; c\n1; 2;  3\n 4; 5;6   ",  sep: ';' )
+   assert_equal records, parser.parse( "a; b; c\n  1; 2 ;3 \n4;5;6\n",      sep: ';' )
+end
+def test_parse_pipe   # or bar e.g. |||
+   records = [["a", "b", "c"],
+              ["1", "2", "3"],
+              ["4", "5", "6"]]
+   ## don't care about newlines (\r\n) ??? - fix? why? why not?
+   assert_equal records, parser.parse( "a|b|c\n1|2|3\n4|5|6",         sep: '|' )
+   assert_equal records, parser.parse( "a|b|c\n1|2|3\n4|5|6\n",       sep: '|' )
+   assert_equal records, parser.parse( "a|b|c\r1|2|3\r4|5|6",         sep: '|' )
+   assert_equal records, parser.parse( "a|b|c\r\n1|2|3\r\n4|5|6\r\n", sep: '|' )
+   assert_equal records, parser.parse( " a|  b | c\n1| 2|  3\n 4| 5|6   ",  sep: '|' )
+   assert_equal records, parser.parse( "a| b| c\n  1| 2 |3 \n4|5|6\n",      sep: '|' )
+end
 def test_parse_quotes
   records = [["a", "b", "c"],
              ["11 \n 11", "\"2\"", "3"]]

data/test/test_parser_table.rb CHANGED

@@ -13,6 +13,48 @@ class TestParserTable < MiniTest::Test
 def parser() CsvReader::Parser::TABLE;  end
+def test_space
+  records = [["1", "Man City", "10", "8", "2", "0", "27", "3", "24", "26"],
+             ["2", "Liverpool", "10", "8", "2", "0", "20", "4", "16", "26"],
+             ["3", "Chelsea", "10", "7", "3", "0", "24", "7", "17", "24"],
+             ["4", "Arsenal", "10", "7", "1", "2", "24", "13", "11", "22"],
+             ["8", "Man Utd", "10", "5", "2", "3", "17", "17", "0", "17"],
+             ["13", "West Ham", "10", "2", "2", "6", "9", "15", "-6", "8"],
+             ["14", "Crystal Palace", "10", "2", "2", "6", "7", "13", "-6", "8"]]
+  parser.space='_'
+  assert_equal records, parser.parse( <<TXT )
+      1  Man_City 10 8 2 0 27 3 24 26
+      2  Liverpool 10 8 2 0 20 4 16 26
+      3  Chelsea 10 7 3 0 24 7 17 24
+      4  Arsenal 10 7 1 2 24 13 11 22
+      8  Man_Utd 10 5 2 3 17 17 0 17
+      13  West_Ham 10 2 2 6 9 15 -6 8
+      14  Crystal_Palace 10 2 2 6 7 13 -6 8
+TXT
+  assert_equal [[" "," ","  "]], parser.parse( "_ _ __" )
+  parser.space='•'
+  assert_equal records, parser.parse( <<TXT )
+      1  Man•City 10 8 2 0 27 3 24 26
+      2  Liverpool 10 8 2 0 20 4 16 26
+      3  Chelsea 10 7 3 0 24 7 17 24
+      4  Arsenal 10 7 1 2 24 13 11 22
+      8  Man•Utd 10 5 2 3 17 17 0 17
+      13  West•Ham 10 2 2 6 9 15 -6 8
+      14  Crystal•Palace 10 2 2 6 7 13 -6 8
+TXT
+  assert_equal [[" "," ","  "]], parser.parse( "• • ••" )
+  parser.space = nil  ## reset to default setting
+end
 def test_contacts
   records = [["aa", "bbb"],
              ["cc", "dd", "ee"]]

data/test/test_samples.rb CHANGED

@@ -61,4 +61,18 @@ def test_shakespeare11
                 ["Tomorrow, and tomorrow, and tomorrow", "Macbeth", "Act 5, scene 5, 19"]], records
 end
+def test_test
+  records = CsvReader.read( "#{CsvReader.test_data_dir}/test.csv" )
+  pp records
+  assert_equal [["A", "B", "C", "D"],
+                ["a", "b", "c", "d"],
+                ["e", "f", "g", "h"],
+                [" i ", " j ", " k ", " l "],
+                ["", "", "", ""],
+                ["", "", "", ""]], records
+end
 end # class TestSamples

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: csvreader
 version: !ruby/object:Gem::Version
-  version: 1.1.5
+  version: 1.2.0
 platform: ruby
 authors:
 - Gerald Bauer
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-11-04 00:00:00.000000000 Z
+date: 2018-11-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rdoc
@@ -78,6 +78,7 @@ files:
 - test/data/iris11.csv
 - test/data/lcc.attrib.csv
 - test/data/shakespeare.csv
+- test/data/test.csv
 - test/helper.rb
 - test/test_buffer.rb
 - test/test_converter.rb