smarter_csv 1.8.1 → 1.8.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
4
- data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
3
+ metadata.gz: 654b04532f0d0b1e15bf84c2e23231e00946a1f57c613f53555ba2d531eaf4f9
4
+ data.tar.gz: d99a921a908864764a39e94818be45c9feb8a1fbe15eb776e24ef10e98c749fd
5
5
  SHA512:
6
- metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
7
- data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
6
+ metadata.gz: 8005c2b6bdd4e82ab1acc8849afd4b8d7abf0d744bb18fa76aaac68a707a8f14300b4e844abac3dacab24b254c81787dc4501a1cc1138ebdc97fe52728e82f30
7
+ data.tar.gz: 0baead2aa4d6841f3770e27a24e5dc5d783873db253c8185e81366a2b5a36045d82f4dc2011fbd46373cf434968675b232217c8764c2c858d74c1cceaebd45ed
data/CHANGELOG.md CHANGED
@@ -1,13 +1,25 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.8.3 (2023-03-30)
5
+ * bugfix: windows one-column files were raising NoColSepDetected (issue #229)
6
+
7
+
8
+ ## 1.8.2 (2023-03-21)
9
+ * bugfix: do not raise `NoColSepDetected` for CSV files with only one column in most cases (issue #222)
10
+ If the first lines contain non-ASCII characters, and no col_sep is detected, it will still raise `NoColSepDetected`
11
+
4
12
  ## 1.8.1 (2023-03-19)
5
13
  * added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
6
14
  * deprecating `required_headers` and replace with `required_keys` (issue #140)
7
15
  * fixed issue with require statement
8
16
 
9
- ## 1.8.0 (2023-03-18)
17
+ ## 1.8.0 (2023-03-18) BREAKING
10
18
  * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
19
+
20
+ MAKE SURE to rescue `NoColSepDetected` if your CSV files can have unexpected formats,
21
+ e.g. from users uploading them to a service, and handle those cases.
22
+
11
23
  * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
12
24
 
13
25
  ## 1.7.4 (2023-01-13)
data/README.md CHANGED
@@ -3,26 +3,33 @@
3
3
 
4
4
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
5
5
 
6
+ #### Development Branches
7
+
8
+ * default branch is `main` for 1.x development
9
+ * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
10
+
6
11
  #### Work towards Future Version 2.0
7
12
 
8
13
  * Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
9
14
  Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
10
15
 
11
- * New versions of SmarterCSV 1.x will soon print a deprecation warning if you set :verbose to true
12
- See below for list of deprecated options.
16
+ ---------------
13
17
 
14
- #### Restructured Branches
18
+ #### SmarterCSV 1.x [Current Version]
15
19
 
16
- * default branch is `main` for 1.x development
17
- * 2.x development is on `2.0-development`
20
+ `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
18
21
 
19
- ---------------
22
+ The goals for SmarterCSV are:
23
+ * ease of use for handling most common CSV files without having to tweak options
24
+ * improve robustness of your code when you have no control over the quality of the CSV files which are processed
25
+ * formatting each row of data as a hash, in order to allow easy processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
20
26
 
21
- #### SmarterCSV 1.x [Current Version]
27
+ #### Rescue from Exceptions
28
+ While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore, when calling `SmarterCSV.process`, please rescue from `SmarterCSVException`, and handle outliers according to your requirements.
22
29
 
23
- `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, or kicking-off batch jobs with Sidekiq.
30
+ If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
24
31
 
25
- To create high-quality output, some options are enabled as a default. Please make sure to check the output and tweak the options accordingly.
32
+ #### Features
26
33
 
27
34
  One `smarter_csv` user wrote:
28
35
 
@@ -77,7 +84,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
77
84
 
78
85
  Here are some examples to demonstrate the versatility of SmarterCSV.
79
86
 
80
- By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
87
+ **It is generally recommended to rescue `SmarterCSVException` or it's sub-classes.**
88
+
89
+ By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
90
+
81
91
  In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
82
92
 
83
93
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
data/TO_DO_v2.md ADDED
@@ -0,0 +1,14 @@
1
+ # SmarterCSV v2.0 TO DO List
2
+
3
+ * add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
4
+ * use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
5
+ * make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
+ * skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
7
+ Or stream large file from S3 (linked in the issue)
8
+ * Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
9
+ * Don't call rewind on filehandle
10
+ * [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
11
+ * [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
12
+ * Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
13
+ * Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
14
+
@@ -15,67 +15,67 @@
15
15
  static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
16
16
  if (RB_TYPE_P(line, T_NIL) == 1) {
17
17
  return rb_ary_new();
18
+ }
18
19
 
19
- } else if (RB_TYPE_P(line, T_STRING) == 1) {
20
- rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
21
- char *startP = RSTRING_PTR(line); /* may not be null terminated */
22
- long line_len = RSTRING_LEN(line);
23
- char *endP = startP + line_len ; /* points behind the string */
24
- char *p = startP;
20
+ if (RB_TYPE_P(line, T_STRING) != 1) {
21
+ rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
22
+ }
25
23
 
26
- char *col_sepP = RSTRING_PTR(col_sep);
27
- long col_sep_len = RSTRING_LEN(col_sep);
24
+ rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
25
+ char *startP = RSTRING_PTR(line); /* may not be null terminated */
26
+ long line_len = RSTRING_LEN(line);
27
+ char *endP = startP + line_len ; /* points behind the string */
28
+ char *p = startP;
28
29
 
29
- char *quoteP = RSTRING_PTR(quote_char);
30
- long quote_count = 0;
30
+ char *col_sepP = RSTRING_PTR(col_sep);
31
+ long col_sep_len = RSTRING_LEN(col_sep);
31
32
 
32
- bool col_sep_found = true;
33
+ char *quoteP = RSTRING_PTR(quote_char);
34
+ long quote_count = 0;
33
35
 
34
- VALUE elements = rb_ary_new();
35
- VALUE field;
36
- long i;
36
+ bool col_sep_found = true;
37
37
 
38
- while (p < endP) {
39
- /* does the remaining string start with col_sep ? */
40
- col_sep_found = true;
41
- for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
42
- col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
43
- }
44
- /* if col_sep was found and we have even quotes */
45
- if (col_sep_found && (quote_count % 2 == 0)) {
46
- /* if max_size != nil && lements.size >= header_size */
47
- if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
48
- break;
49
- } else {
50
- /* push that field with original encoding onto the results */
51
- field = rb_enc_str_new(startP, p - startP, encoding);
52
- rb_ary_push(elements, field);
38
+ VALUE elements = rb_ary_new();
39
+ VALUE field;
40
+ long i;
53
41
 
54
- p += col_sep_len;
55
- startP = p;
56
- }
42
+ while (p < endP) {
43
+ /* does the remaining string start with col_sep ? */
44
+ col_sep_found = true;
45
+ for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
46
+ col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
47
+ }
48
+ /* if col_sep was found and we have even quotes */
49
+ if (col_sep_found && (quote_count % 2 == 0)) {
50
+ /* if max_size != nil && lements.size >= header_size */
51
+ if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
52
+ break;
57
53
  } else {
58
- if (*p == *quoteP) {
59
- quote_count += 1;
60
- }
61
- p++;
62
- }
63
- } /* while */
54
+ /* push that field with original encoding onto the results */
55
+ field = rb_enc_str_new(startP, p - startP, encoding);
56
+ rb_ary_push(elements, field);
64
57
 
65
- /* check if the last part of the line needs to be processed */
66
- if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
67
- /* copy the remaining line as a field with original encoding onto the results */
68
- field = rb_enc_str_new(startP, endP - startP, encoding);
69
- rb_ary_push(elements, field);
58
+ p += col_sep_len;
59
+ startP = p;
60
+ }
61
+ } else {
62
+ if (*p == *quoteP) {
63
+ quote_count += 1;
64
+ }
65
+ p++;
70
66
  }
67
+ } /* while */
71
68
 
72
- return elements;
69
+ /* check if the last part of the line needs to be processed */
70
+ if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
71
+ /* copy the remaining line as a field with original encoding onto the results */
72
+ field = rb_enc_str_new(startP, endP - startP, encoding);
73
+ rb_ary_push(elements, field);
73
74
  }
74
75
 
75
- rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
76
+ return elements;
76
77
  }
77
78
 
78
-
79
79
  VALUE SmarterCSV = Qnil;
80
80
 
81
81
  void Init_smarter_csv(void) {
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.8.1"
4
+ VERSION = "1.8.3"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -3,8 +3,11 @@
3
3
  require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
- require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # local testing
6
+ if `uname -s`.chomp == 'Darwin'
7
+ require 'smarter_csv.bundle' unless ENV['CI'] # local testing
8
+ else
9
+ require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
10
+ end
8
11
 
9
12
  module SmarterCSV
10
13
  class SmarterCSVException < StandardError; end
@@ -393,15 +396,28 @@ module SmarterCSV
393
396
  def guess_column_separator(filehandle, options)
394
397
  skip_lines(filehandle, options)
395
398
 
396
- possible_delimiters = [',', "\t", ';', ':', '|']
399
+ delimiters = [',', "\t", ';', ':', '|']
400
+
401
+ line = nil
402
+ has_header = options[:headers_in_file]
403
+ candidates = Hash.new(0)
404
+ count = has_header ? 1 : 5
405
+ count.times do
406
+ line = readline_with_counts(filehandle, options)
407
+ delimiters.each do |d|
408
+ candidates[d] += line.scan(d).count
409
+ end
410
+ rescue EOFError # short files
411
+ break
412
+ end
413
+ rewind(filehandle)
397
414
 
398
- candidates = if options.fetch(:headers_in_file)
399
- candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
400
- else
401
- candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
402
- end
415
+ if candidates.values.max == 0
416
+ # if the header only contains
417
+ return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
403
418
 
404
- raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
419
+ raise SmarterCSV::NoColSepDetected
420
+ end
405
421
 
406
422
  candidates.key(candidates.values.max)
407
423
  end
@@ -582,35 +598,5 @@ module SmarterCSV
582
598
  return true if str.is_a?(String) && !str.empty?
583
599
  false
584
600
  end
585
-
586
- def candidated_column_separators_from_headers(filehandle, options, delimiters)
587
- candidates = Hash.new(0)
588
- line = readline_with_counts(filehandle, options.slice(:row_sep))
589
-
590
- delimiters.each do |d|
591
- candidates[d] += line.scan(d).count
592
- end
593
-
594
- rewind(filehandle)
595
-
596
- candidates
597
- end
598
-
599
- def candidated_column_separators_from_contents(filehandle, options, delimiters)
600
- candidates = Hash.new(0)
601
-
602
- 5.times do
603
- line = readline_with_counts(filehandle, options.slice(:row_sep))
604
- delimiters.each do |d|
605
- candidates[d] += line.scan(d).count
606
- end
607
- rescue EOFError # short files
608
- break
609
- end
610
-
611
- rewind(filehandle)
612
-
613
- candidates
614
- end
615
601
  end
616
602
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.8.1
4
+ version: 1.8.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-03-19 00:00:00.000000000 Z
11
+ date: 2023-03-30 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print
@@ -112,6 +112,7 @@ files:
112
112
  - LICENSE.txt
113
113
  - README.md
114
114
  - Rakefile
115
+ - TO_DO_v2.md
115
116
  - ext/smarter_csv/extconf.rb
116
117
  - ext/smarter_csv/smarter_csv.c
117
118
  - lib/extensions/hash.rb