smarter_csv 1.8.1 → 1.8.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -1
- data/README.md +20 -10
- data/TO_DO_v2.md +14 -0
- data/ext/smarter_csv/smarter_csv.c +46 -46
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +25 -39
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 654b04532f0d0b1e15bf84c2e23231e00946a1f57c613f53555ba2d531eaf4f9
|
4
|
+
data.tar.gz: d99a921a908864764a39e94818be45c9feb8a1fbe15eb776e24ef10e98c749fd
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8005c2b6bdd4e82ab1acc8849afd4b8d7abf0d744bb18fa76aaac68a707a8f14300b4e844abac3dacab24b254c81787dc4501a1cc1138ebdc97fe52728e82f30
|
7
|
+
data.tar.gz: 0baead2aa4d6841f3770e27a24e5dc5d783873db253c8185e81366a2b5a36045d82f4dc2011fbd46373cf434968675b232217c8764c2c858d74c1cceaebd45ed
|
data/CHANGELOG.md
CHANGED
@@ -1,13 +1,25 @@
|
|
1
1
|
|
2
2
|
# SmarterCSV 1.x Change Log
|
3
3
|
|
4
|
+
## 1.8.3 (2023-03-30)
|
5
|
+
* bugfix: windows one-column files were raising NoColSepDetected (issue #229)
|
6
|
+
|
7
|
+
|
8
|
+
## 1.8.2 (2023-03-21)
|
9
|
+
* bugfix: do not raise `NoColSepDetected` for CSV files with only one column in most cases (issue #222)
|
10
|
+
If the first lines contain non-ASCII characters, and no col_sep is detected, it will still raise `NoColSepDetected`
|
11
|
+
|
4
12
|
## 1.8.1 (2023-03-19)
|
5
13
|
* added validation against invalid values for :col_sep, :row_sep, :quote_char (issue #216)
|
6
14
|
* deprecating `required_headers` and replace with `required_keys` (issue #140)
|
7
15
|
* fixed issue with require statement
|
8
16
|
|
9
|
-
## 1.8.0 (2023-03-18)
|
17
|
+
## 1.8.0 (2023-03-18) BREAKING
|
10
18
|
* NEW DEFAULTS: `col_sep: :auto`, `row_sep: :auto`. Fully automatic detection by default.
|
19
|
+
|
20
|
+
MAKE SURE to rescue `NoColSepDetected` if your CSV files can have unexpected formats,
|
21
|
+
e.g. from users uploading them to a service, and handle those cases.
|
22
|
+
|
11
23
|
* ignore Byte Order Marker (BOM) in first line in file (issues #27, #219)
|
12
24
|
|
13
25
|
## 1.7.4 (2023-01-13)
|
data/README.md
CHANGED
@@ -3,26 +3,33 @@
|
|
3
3
|
|
4
4
|
[](https://codecov.io/gh/tilo/smarter_csv) [](http://badge.fury.io/rb/smarter_csv)
|
5
5
|
|
6
|
+
#### Development Branches
|
7
|
+
|
8
|
+
* default branch is `main` for 1.x development
|
9
|
+
* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
|
10
|
+
|
6
11
|
#### Work towards Future Version 2.0
|
7
12
|
|
8
13
|
* Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
|
9
14
|
Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
|
10
15
|
|
11
|
-
|
12
|
-
See below for list of deprecated options.
|
16
|
+
---------------
|
13
17
|
|
14
|
-
####
|
18
|
+
#### SmarterCSV 1.x [Current Version]
|
15
19
|
|
16
|
-
|
17
|
-
* 2.x development is on `2.0-development`
|
20
|
+
`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
|
18
21
|
|
19
|
-
|
22
|
+
The goals for SmarterCSV are:
|
23
|
+
* ease of use for handling most common CSV files without having to tweak options
|
24
|
+
* improve robustness of your code when you have no control over the quality of the CSV files which are processed
|
25
|
+
* formatting each row of data as a hash, in order to allow easy processing with ActiveRecord, parallel processing, kicking-off batch jobs with Sidekiq, or oploading data to S3.
|
20
26
|
|
21
|
-
####
|
27
|
+
#### Rescue from Exceptions
|
28
|
+
While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore, when calling `SmarterCSV.process`, please rescue from `SmarterCSVException`, and handle outliers according to your requirements.
|
22
29
|
|
23
|
-
|
30
|
+
If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
|
24
31
|
|
25
|
-
|
32
|
+
#### Features
|
26
33
|
|
27
34
|
One `smarter_csv` user wrote:
|
28
35
|
|
@@ -77,7 +84,10 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
|
|
77
84
|
|
78
85
|
Here are some examples to demonstrate the versatility of SmarterCSV.
|
79
86
|
|
80
|
-
|
87
|
+
**It is generally recommended to rescue `SmarterCSVException` or it's sub-classes.**
|
88
|
+
|
89
|
+
By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
|
90
|
+
|
81
91
|
In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
|
82
92
|
|
83
93
|
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
|
data/TO_DO_v2.md
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# SmarterCSV v2.0 TO DO List
|
2
|
+
|
3
|
+
* add enumerable to speed up parallel processing [issue #66](https://github.com/tilo/smarter_csv/issues/66), [issue #32](https://github.com/tilo/smarter_csv/issues/32)
|
4
|
+
* use Procs for validations and transformatoins [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
5
|
+
* make @errors and @warnings work [issue #118](https://github.com/tilo/smarter_csv/issues/118)
|
6
|
+
* skip file opening, allow reading from CSV string, e.g. reading from S3 file [issue #120](https://github.com/tilo/smarter_csv/issues/120).
|
7
|
+
Or stream large file from S3 (linked in the issue)
|
8
|
+
* Collect all Errors, before surfacing them. Avoid throwing an exception on the first error [issue #133](https://github.com/tilo/smarter_csv/issues/133)
|
9
|
+
* Don't call rewind on filehandle
|
10
|
+
* [2.0 BUG] :convert_values_to_numeric_unless_leading_zeros drops leading zeros [issue #151](https://github.com/tilo/smarter_csv/issues/151)
|
11
|
+
* [2.0 BUG] convert_to_float saves Proc as @@convert_to_integer [issue #157](https://github.com/tilo/smarter_csv/issues/157)
|
12
|
+
* Provide an example for custom Procs for hash_transformations in the docs [issue #174](https://github.com/tilo/smarter_csv/issues/174)
|
13
|
+
* Replace remove_empty_values: false [issue #213](https://github.com/tilo/smarter_csv/issues/213)
|
14
|
+
|
@@ -15,67 +15,67 @@
|
|
15
15
|
static VALUE rb_parse_csv_line(VALUE self, VALUE line, VALUE col_sep, VALUE quote_char, VALUE max_size) {
|
16
16
|
if (RB_TYPE_P(line, T_NIL) == 1) {
|
17
17
|
return rb_ary_new();
|
18
|
+
}
|
18
19
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
long line_len = RSTRING_LEN(line);
|
23
|
-
char *endP = startP + line_len ; /* points behind the string */
|
24
|
-
char *p = startP;
|
20
|
+
if (RB_TYPE_P(line, T_STRING) != 1) {
|
21
|
+
rb_raise(rb_eTypeError, "ERROR in SmarterCSV.parse_line: line has to be a string or nil");
|
22
|
+
}
|
25
23
|
|
26
|
-
|
27
|
-
|
24
|
+
rb_encoding *encoding = rb_enc_get(line); /* get the encoding from the input line */
|
25
|
+
char *startP = RSTRING_PTR(line); /* may not be null terminated */
|
26
|
+
long line_len = RSTRING_LEN(line);
|
27
|
+
char *endP = startP + line_len ; /* points behind the string */
|
28
|
+
char *p = startP;
|
28
29
|
|
29
|
-
|
30
|
-
|
30
|
+
char *col_sepP = RSTRING_PTR(col_sep);
|
31
|
+
long col_sep_len = RSTRING_LEN(col_sep);
|
31
32
|
|
32
|
-
|
33
|
+
char *quoteP = RSTRING_PTR(quote_char);
|
34
|
+
long quote_count = 0;
|
33
35
|
|
34
|
-
|
35
|
-
VALUE field;
|
36
|
-
long i;
|
36
|
+
bool col_sep_found = true;
|
37
37
|
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
|
42
|
-
col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
|
43
|
-
}
|
44
|
-
/* if col_sep was found and we have even quotes */
|
45
|
-
if (col_sep_found && (quote_count % 2 == 0)) {
|
46
|
-
/* if max_size != nil && lements.size >= header_size */
|
47
|
-
if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
|
48
|
-
break;
|
49
|
-
} else {
|
50
|
-
/* push that field with original encoding onto the results */
|
51
|
-
field = rb_enc_str_new(startP, p - startP, encoding);
|
52
|
-
rb_ary_push(elements, field);
|
38
|
+
VALUE elements = rb_ary_new();
|
39
|
+
VALUE field;
|
40
|
+
long i;
|
53
41
|
|
54
|
-
|
55
|
-
|
56
|
-
|
42
|
+
while (p < endP) {
|
43
|
+
/* does the remaining string start with col_sep ? */
|
44
|
+
col_sep_found = true;
|
45
|
+
for(i=0; (i < col_sep_len) && (p+i < endP) ; i++) {
|
46
|
+
col_sep_found = col_sep_found && (*(p+i) == *(col_sepP+i));
|
47
|
+
}
|
48
|
+
/* if col_sep was found and we have even quotes */
|
49
|
+
if (col_sep_found && (quote_count % 2 == 0)) {
|
50
|
+
/* if max_size != nil && lements.size >= header_size */
|
51
|
+
if ((max_size != Qnil) && RARRAY_LEN(elements) >= NUM2INT(max_size)) {
|
52
|
+
break;
|
57
53
|
} else {
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
p++;
|
62
|
-
}
|
63
|
-
} /* while */
|
54
|
+
/* push that field with original encoding onto the results */
|
55
|
+
field = rb_enc_str_new(startP, p - startP, encoding);
|
56
|
+
rb_ary_push(elements, field);
|
64
57
|
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
58
|
+
p += col_sep_len;
|
59
|
+
startP = p;
|
60
|
+
}
|
61
|
+
} else {
|
62
|
+
if (*p == *quoteP) {
|
63
|
+
quote_count += 1;
|
64
|
+
}
|
65
|
+
p++;
|
70
66
|
}
|
67
|
+
} /* while */
|
71
68
|
|
72
|
-
|
69
|
+
/* check if the last part of the line needs to be processed */
|
70
|
+
if ((max_size == Qnil) || RARRAY_LEN(elements) < NUM2INT(max_size)) {
|
71
|
+
/* copy the remaining line as a field with original encoding onto the results */
|
72
|
+
field = rb_enc_str_new(startP, endP - startP, encoding);
|
73
|
+
rb_ary_push(elements, field);
|
73
74
|
}
|
74
75
|
|
75
|
-
|
76
|
+
return elements;
|
76
77
|
}
|
77
78
|
|
78
|
-
|
79
79
|
VALUE SmarterCSV = Qnil;
|
80
80
|
|
81
81
|
void Init_smarter_csv(void) {
|
data/lib/smarter_csv/version.rb
CHANGED
data/lib/smarter_csv.rb
CHANGED
@@ -3,8 +3,11 @@
|
|
3
3
|
require_relative "extensions/hash"
|
4
4
|
require_relative "smarter_csv/version"
|
5
5
|
|
6
|
-
|
7
|
-
|
6
|
+
if `uname -s`.chomp == 'Darwin'
|
7
|
+
require 'smarter_csv.bundle' unless ENV['CI'] # local testing
|
8
|
+
else
|
9
|
+
require_relative "smarter_csv/smarter_csv" unless ENV['CI'] # does not compile/link in CI?
|
10
|
+
end
|
8
11
|
|
9
12
|
module SmarterCSV
|
10
13
|
class SmarterCSVException < StandardError; end
|
@@ -393,15 +396,28 @@ module SmarterCSV
|
|
393
396
|
def guess_column_separator(filehandle, options)
|
394
397
|
skip_lines(filehandle, options)
|
395
398
|
|
396
|
-
|
399
|
+
delimiters = [',', "\t", ';', ':', '|']
|
400
|
+
|
401
|
+
line = nil
|
402
|
+
has_header = options[:headers_in_file]
|
403
|
+
candidates = Hash.new(0)
|
404
|
+
count = has_header ? 1 : 5
|
405
|
+
count.times do
|
406
|
+
line = readline_with_counts(filehandle, options)
|
407
|
+
delimiters.each do |d|
|
408
|
+
candidates[d] += line.scan(d).count
|
409
|
+
end
|
410
|
+
rescue EOFError # short files
|
411
|
+
break
|
412
|
+
end
|
413
|
+
rewind(filehandle)
|
397
414
|
|
398
|
-
candidates
|
399
|
-
|
400
|
-
|
401
|
-
candidated_column_separators_from_contents(filehandle, options, possible_delimiters)
|
402
|
-
end
|
415
|
+
if candidates.values.max == 0
|
416
|
+
# if the header only contains
|
417
|
+
return ',' if line.chomp(options[:row_sep]) =~ /^\w+$/
|
403
418
|
|
404
|
-
|
419
|
+
raise SmarterCSV::NoColSepDetected
|
420
|
+
end
|
405
421
|
|
406
422
|
candidates.key(candidates.values.max)
|
407
423
|
end
|
@@ -582,35 +598,5 @@ module SmarterCSV
|
|
582
598
|
return true if str.is_a?(String) && !str.empty?
|
583
599
|
false
|
584
600
|
end
|
585
|
-
|
586
|
-
def candidated_column_separators_from_headers(filehandle, options, delimiters)
|
587
|
-
candidates = Hash.new(0)
|
588
|
-
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
589
|
-
|
590
|
-
delimiters.each do |d|
|
591
|
-
candidates[d] += line.scan(d).count
|
592
|
-
end
|
593
|
-
|
594
|
-
rewind(filehandle)
|
595
|
-
|
596
|
-
candidates
|
597
|
-
end
|
598
|
-
|
599
|
-
def candidated_column_separators_from_contents(filehandle, options, delimiters)
|
600
|
-
candidates = Hash.new(0)
|
601
|
-
|
602
|
-
5.times do
|
603
|
-
line = readline_with_counts(filehandle, options.slice(:row_sep))
|
604
|
-
delimiters.each do |d|
|
605
|
-
candidates[d] += line.scan(d).count
|
606
|
-
end
|
607
|
-
rescue EOFError # short files
|
608
|
-
break
|
609
|
-
end
|
610
|
-
|
611
|
-
rewind(filehandle)
|
612
|
-
|
613
|
-
candidates
|
614
|
-
end
|
615
601
|
end
|
616
602
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: smarter_csv
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.8.
|
4
|
+
version: 1.8.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tilo Sloboda
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-03-
|
11
|
+
date: 2023-03-30 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|
@@ -112,6 +112,7 @@ files:
|
|
112
112
|
- LICENSE.txt
|
113
113
|
- README.md
|
114
114
|
- Rakefile
|
115
|
+
- TO_DO_v2.md
|
115
116
|
- ext/smarter_csv/extconf.rb
|
116
117
|
- ext/smarter_csv/smarter_csv.c
|
117
118
|
- lib/extensions/hash.rb
|