smarter_csv 1.8.1 → 1.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a7aa350efc77f90c6986a7573e733b5d9d02930c94465f17d2227b346263a6ce
4
- data.tar.gz: 42351edf3e618b8c025f266796897aa0c3572d77e42788a05b1ee37ce8bdeed2
3
+ metadata.gz: 654b04532f0d0b1e15bf84c2e23231e00946a1f57c613f53555ba2d531eaf4f9
4
+ data.tar.gz: d99a921a908864764a39e94818be45c9feb8a1fbe15eb776e24ef10e98c749fd
5
5
  SHA512:
6
- metadata.gz: 8bd9d59d7260a8e90ce472917801b98d088e37de5b1e912914f820f2efbbeb0491f5056d47575debdf1bccb8b9b8670cd089647efa15ec93b02413747dcfe702
7
- data.tar.gz: 861364c6213af99c11cd3b9a59b2cf46f8c8e850ee2273e4f1b790714c9cd0ca66a734d64233737e086669c2b6aa51415f1343c3d61811547ec3c715d7a1620c
6
+ metadata.gz: 8005c2b6bdd4e82ab1acc8849afd4b8d7abf0d744bb18fa76aaac68a707a8f14300b4e844abac3dacab24b254c81787dc4501a1cc1138ebdc97fe52728e82f30
7
+ data.tar.gz: 0baead2aa4d6841f3770e27a24e5dc5d783873db253c8185e81366a2b5a36045d82f4dc2011fbd46373cf434968675b232217c8764c2c858d74c1cceaebd45ed
data/CHANGELOG.md CHANGED
@@ -1,13 +1,25 @@
1
1
 
2
2
  # SmarterCSV 1.x Change Log
3
3
 
4
+ ## 1.8.3 (2023-03-30)
5
+ * bugfix: windows one-column files were raising NoColSepDetected (issue #229)
6
+
7
+
8
+ ## 1.8.2 (2023-03-21)
9
+ * bugfix: do not raise `NoColSepDetected` for CSV files with only one column in most cases (issue #222)
10
+ If the first lines contain non-ASCII characters, and no col_sep is detected, it will still raise `NoColSepDetected`
11
+
4
12
  ## 1.8.1 (2023-03-19)
5
13
  * added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
6
14
  * deprecating `required_headers` and replace with `required_keys` (issue #140)
7
15
  * fixed issue with require statement
8
16
 
9
- ## 1.8.0 (2023-03-18)
17
+ ## 1.8.0 (2023-03-18) BREAKING
10
18
  * NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
19
+
20
+ MAKE SURE to rescue `NoColSepDetected` if your CSV files can have unexpected formats,
21
+ e.g. from users uploading them to a service, and handle those cases.
22
+
11
23
  * ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
12
24
 
13
25
  ## 1.7.4 (2023-01-13)
data/README.md CHANGED
@@ -3,26 +3,33 @@
3
3
 
4
4
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
5
5
 
6
+ #### Development Branches
7
+
8
+ * default branch is `main` for 1.x development
9
+ * 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
10
+
6
11
  #### Work towards Future Version 2.0
7
12
 
8
13
  * Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
9
14
  Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
10
15
 
11
- * New versions of SmarterCSV 1.x will soon print a deprecation warning if you set :verbose to true
12
- See below for list of deprecated options.
16
+ ---------------
13
17
 
14
- #### Restructured Branches
18
+ #### SmarterCSV 1.x [Current Version]
15
19
 
16
- * default branch is `main` for 1.x development
17
- * 2.x development is on `2.0-development`
20
+ `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
18
21
 
19
- ---------------
22
+ The goals for SmarterCSV are:
23
+ * ease of use for handling most common CSV files without having to tweak options
24
+ * improve robustness of your code when you have no control over the quality of the CSV files which are processed
25
+ * formatting each row of data as a hash, in order to allow easy processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
20
26
 
21
- #### SmarterCSV 1.x [Current Version]
27
+ #### Rescue from Exceptions
28
+ While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore, when calling `SmarterCSV.process`, please rescue from `SmarterCSVException`, and handle outliers according to your requirements.
22
29
 
23
- `smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, or kicking-off batch jobs with Sidekiq.
30
+ If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
24
31
 
25
- To create high-quality output, some options are enabled as a default. Please make sure to check the output and tweak the options accordingly.
32
+ #### Features
26
33
 
27
34
  One `smarter_csv` user wrote:
28
35
 
@@ -77,7 +84,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
77
84
 
78
85
  Here are some examples to demonstrate the versatility of SmarterCSV.
79
86
 
80
- By default SmarterCSV determines the `row_sep` and `col_sep` values automatically.
87
+ **It is generally recommended to rescue `SmarterCSVException` or it's sub-classes.**
88
+
89
+ By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
90
+
81
91
  In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
82
92
 
83
93
  #### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
data/TO_DO_v2.md ADDED
@@ -0,0 +1,14 @@
1
+ # SmarterCSV v2.0 TO DO List
2
+
3
+ * add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
4
+ * use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
5
+ * make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
6
+ * skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
7
+ Or stream large file from S3 (linked in the issue)
8
+ * Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
9
+ * Don't call rewind on filehandle
10
+ * [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
11
+ * [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
12
+ * Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
13
+ * Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
14
+
@@ -15,67 +15,67 @@
15
15
  static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
16
16
  if (RB_TYPE_P(line, T_NIL) == 1) {
17
17
  return rb_ary_new();
18
+ }
18
19
 
19
- } else if (RB_TYPE_P(line, T_STRING) == 1) {
20
- rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
21
- char *startP = RSTRING_PTR(line); /* may not be null terminated */
22
- long line_len = RSTRING_LEN(line);
23
- char *endP = startP + line_len ; /* points behind the string */
24
- char *p = startP;
20
+ if (RB_TYPE_P(line, T_STRING) != 1) {
21
+ rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
22
+ }
25
23
 
26
- char *col_sepP = RSTRING_PTR(col_sep);
27
- long col_sep_len = RSTRING_LEN(col_sep);
24
+ rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
25
+ char *startP = RSTRING_PTR(line); /* may not be null terminated */
26
+ long line_len = RSTRING_LEN(line);
27
+ char *endP = startP + line_len ; /* points behind the string */
28
+ char *p = startP;
28
29
 
29
- char *quoteP = RSTRING_PTR(quote_char);
30
- long quote_count = 0;
30
+ char *col_sepP = RSTRING_PTR(col_sep);
31
+ long col_sep_len = RSTRING_LEN(col_sep);
31
32
 
32
- bool col_sep_found = true;
33
+ char *quoteP = RSTRING_PTR(quote_char);
34
+ long quote_count = 0;
33
35
 
34
- VALUE elements = rb_ary_new();
35
- VALUE field;
36
- long i;
36
+ bool col_sep_found = true;
37
37
 
38
- while (p < endP) {
39
- /* does the remaining string start with col_sep ? */
40
- col_sep_found = true;
41
- for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
42
- col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
43
- }
44
- /* if col_sep was found and we have even quotes */
45
- if (col_sep_found && (quote_count % 2 == 0)) {
46
- /* if max_size != nil && lements.size >= header_size */
47
- if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
48
- break;
49
- } else {
50
- /* push that field with original encoding onto the results */
51
- field = rb_enc_str_new(startP, p - startP, encoding);
52
- rb_ary_push(elements, field);
38
+ VALUE elements = rb_ary_new();
39
+ VALUE field;
40
+ long i;
53
41
 
54
- p += col_sep_len;
55
- startP = p;
56
- }
42
+ while (p < endP) {
43
+ /* does the remaining string start with col_sep ? */
44
+ col_sep_found = true;
45
+ for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
46
+ col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
47
+ }
48
+ /* if col_sep was found and we have even quotes */
49
+ if (col_sep_found && (quote_count % 2 == 0)) {
50
+ /* if max_size != nil && lements.size >= header_size */
51
+ if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
52
+ break;
57
53
  } else {
58
- if (*p == *quoteP) {
59
- quote_count += 1;
60
- }
61
- p++;
62
- }
63
- } /* while */
54
+ /* push that field with original encoding onto the results */
55
+ field = rb_enc_str_new(startP, p - startP, encoding);
56
+ rb_ary_push(elements, field);
64
57
 
65
- /* check if the last part of the line needs to be processed */
66
- if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
67
- /* copy the remaining line as a field with original encoding onto the results */
68
- field = rb_enc_str_new(startP, endP - startP, encoding);
69
- rb_ary_push(elements, field);
58
+ p += col_sep_len;
59
+ startP = p;
60
+ }
61
+ } else {
62
+ if (*p == *quoteP) {
63
+ quote_count += 1;
64
+ }
65
+ p++;
70
66
  }
67
+ } /* while */
71
68
 
72
- return elements;
69
+ /* check if the last part of the line needs to be processed */
70
+ if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
71
+ /* copy the remaining line as a field with original encoding onto the results */
72
+ field = rb_enc_str_new(startP, endP - startP, encoding);
73
+ rb_ary_push(elements, field);
73
74
  }
74
75
 
75
- rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
76
+ return elements;
76
77
  }
77
78
 
78
-
79
79
  VALUE SmarterCSV = Qnil;
80
80
 
81
81
  void Init_smarter_csv(void) {
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterCSV
4
- VERSION = "1.8.1"
4
+ VERSION = "1.8.3"
5
5
  end
data/lib/smarter_csv.rb CHANGED
@@ -3,8 +3,11 @@
3
3
  require_relative "extensions/hash"
4
4
  require_relative "smarter_csv/version"
5
5
 
6
- require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
7
- # require 'smarter_csv.bundle' unless ENV['CI'] # local testing
6
+ if `uname -s`.chomp == 'Darwin'
7
+ require 'smarter_csv.bundle' unless ENV['CI'] # local testing
8
+ else
9
+ require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
10
+ end
8
11
 
9
12
  module SmarterCSV
10
13
  class SmarterCSVException < StandardError; end
@@ -393,15 +396,28 @@ module SmarterCSV
393
396
  def guess_column_separator(filehandle, options)
394
397
  skip_lines(filehandle, options)
395
398
 
396
- possible_delimiters = [',', "\t", ';', ':', '|']
399
+ delimiters = [',', "\t", ';', ':', '|']
400
+
401
+ line = nil
402
+ has_header = options[:headers_in_file]
403
+ candidates = Hash.new(0)
404
+ count = has_header ? 1 : 5
405
+ count.times do
406
+ line = readline_with_counts(filehandle, options)
407
+ delimiters.each do |d|
408
+ candidates[d] += line.scan(d).count
409
+ end
410
+ rescue EOFError # short files
411
+ break
412
+ end
413
+ rewind(filehandle)
397
414
 
398
- candidates = if options.fetch(:headers_in_file)
399
- candidated_column_separators_from_headers(filehandle, options, possible_delimiters)
400
- else
401
- candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
402
- end
415
+ if candidates.values.max == 0
416
+ # if the header only contains
417
+ return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
403
418
 
404
- raise SmarterCSV::NoColSepDetected if candidates.values.max == 0
419
+ raise SmarterCSV::NoColSepDetected
420
+ end
405
421
 
406
422
  candidates.key(candidates.values.max)
407
423
  end
@@ -582,35 +598,5 @@ module SmarterCSV
582
598
  return true if str.is_a?(String) && !str.empty?
583
599
  false
584
600
  end
585
-
586
- def candidated_column_separators_from_headers(filehandle, options, delimiters)
587
- candidates = Hash.new(0)
588
- line = readline_with_counts(filehandle, options.slice(:row_sep))
589
-
590
- delimiters.each do |d|
591
- candidates[d] += line.scan(d).count
592
- end
593
-
594
- rewind(filehandle)
595
-
596
- candidates
597
- end
598
-
599
- def candidated_column_separators_from_contents(filehandle, options, delimiters)
600
- candidates = Hash.new(0)
601
-
602
- 5.times do
603
- line = readline_with_counts(filehandle, options.slice(:row_sep))
604
- delimiters.each do |d|
605
- candidates[d] += line.scan(d).count
606
- end
607
- rescue EOFError # short files
608
- break
609
- end
610
-
611
- rewind(filehandle)
612
-
613
- candidates
614
- end
615
601
  end
616
602
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_csv
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.8.1
4
+ version: 1.8.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-03-19 00:00:00.000000000 Z
11
+ date: 2023-03-30 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: awesome_print
@@ -112,6 +112,7 @@ files:
112
112
  - LICENSE.txt
113
113
  - README.md
114
114
  - Rakefile
115
+ - TO_DO_v2.md
115
116
  - ext/smarter_csv/extconf.rb
116
117
  - ext/smarter_csv/smarter_csv.c
117
118
  - lib/extensions/hash.rb