smarter_csv 1.15.2 → 1.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +68 -1
- data/CONTRIBUTORS.md +3 -1
- data/Gemfile +1 -0
- data/README.md +123 -27
- data/docs/_introduction.md +40 -24
- data/docs/bad_row_quarantine.md +285 -0
- data/docs/basic_read_api.md +151 -9
- data/docs/basic_write_api.md +474 -59
- data/docs/batch_processing.md +161 -4
- data/docs/column_selection.md +183 -0
- data/docs/data_transformations.md +162 -29
- data/docs/examples.md +339 -46
- data/docs/header_transformations.md +93 -12
- data/docs/header_validations.md +56 -18
- data/docs/history.md +117 -0
- data/docs/instrumentation.md +165 -0
- data/docs/migrating_from_csv.md +290 -0
- data/docs/options.md +150 -87
- data/docs/parsing_strategy.md +63 -1
- data/docs/real_world_csv.md +262 -0
- data/docs/releases/1.16.0/benchmarks.md +223 -0
- data/docs/releases/1.16.0/changes.md +272 -0
- data/docs/releases/1.16.0/performance_notes.md +114 -0
- data/docs/row_col_sep.md +14 -5
- data/docs/value_converters.md +193 -57
- data/ext/smarter_csv/extconf.rb +3 -0
- data/ext/smarter_csv/smarter_csv.c +1007 -71
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
- data/lib/smarter_csv/errors.rb +8 -0
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +14 -13
- data/lib/smarter_csv/header_transformations.rb +21 -2
- data/lib/smarter_csv/headers.rb +2 -1
- data/lib/smarter_csv/options.rb +124 -7
- data/lib/smarter_csv/parser.rb +362 -75
- data/lib/smarter_csv/reader.rb +494 -46
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +71 -19
- data/lib/smarter_csv.rb +95 -12
- data/smarter_csv.gemspec +20 -10
- metadata +37 -80
data/docs/value_converters.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,76 +11,211 @@
|
|
|
10
11
|
* [Row and Column Separators](./row_col_sep.md)
|
|
11
12
|
* [Header Transformations](./header_transformations.md)
|
|
12
13
|
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [Data Transformations](./data_transformations.md)
|
|
14
16
|
* [**Value Converters**](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Using Value Converters for Reading CSV
|
|
19
27
|
|
|
20
|
-
Value
|
|
28
|
+
Value converters let you transform raw CSV strings into the types your downstream code
|
|
29
|
+
expects — dates, booleans, numbers, Money objects, whatever you need. They run per-key,
|
|
30
|
+
after SmarterCSV has parsed and mapped the headers.
|
|
31
|
+
|
|
32
|
+
A converter is either a **lambda** (for simple inline cases) or a **class** implementing
|
|
33
|
+
`self.convert(value)` (for reusable, independently testable converters). Both forms are
|
|
34
|
+
fully supported.
|
|
35
|
+
|
|
36
|
+
The examples throughout this page use the following fixture file:
|
|
37
|
+
|
|
38
|
+
```
|
|
39
|
+
first,last,date,price,member
|
|
40
|
+
Ben,Miller,10/30/1998,$44.50,TRUE
|
|
41
|
+
Tom,Turner,2/1/2011,$15.99,False
|
|
42
|
+
Ken,Smith,01/09/2013,$199.99,true
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
> **Key mapping interaction:** if you use `key_mapping:`, converters must reference the
|
|
46
|
+
> **mapped** key name, not the original CSV header name. The mapping runs first; converters
|
|
47
|
+
> see the final key.
|
|
48
|
+
|
|
49
|
+
## Lambda Converters
|
|
21
50
|
|
|
22
|
-
|
|
51
|
+
Lambdas are the quickest way to define a converter inline.
|
|
52
|
+
|
|
53
|
+
**Boolean:**
|
|
23
54
|
|
|
24
55
|
```ruby
|
|
25
|
-
|
|
26
|
-
first,last,date,price,member
|
|
27
|
-
Ben,Miller,10/30/1998,$44.50,TRUE
|
|
28
|
-
Tom,Turner,2/1/2011,$15.99,False
|
|
29
|
-
Ken,Smith,01/09/2013,$199.99,true
|
|
30
|
-
|
|
31
|
-
$ irb
|
|
32
|
-
> require 'smarter_csv'
|
|
33
|
-
> require 'date'
|
|
34
|
-
|
|
35
|
-
# define a custom converter class, which implements self.convert(value)
|
|
36
|
-
class DateConverter
|
|
37
|
-
def self.convert(value)
|
|
38
|
-
Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
|
|
39
|
-
end
|
|
40
|
-
end
|
|
56
|
+
bool = ->(v) { v&.match?(/\Atrue\z/i) }
|
|
41
57
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
58
|
+
data = SmarterCSV.process('records.csv', value_converters: { active: bool, verified: bool })
|
|
59
|
+
# "TRUE" => true
|
|
60
|
+
# "false" => false
|
|
61
|
+
# nil => nil (& guard handles missing/empty fields)
|
|
62
|
+
```
|
|
47
63
|
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
64
|
+
**Strip currency symbol and convert to Float:**
|
|
65
|
+
|
|
66
|
+
```ruby
|
|
67
|
+
dollar = ->(v) { v&.sub('$', '')&.to_f }
|
|
68
|
+
|
|
69
|
+
data = SmarterCSV.process('records.csv', value_converters: { price: dollar, tax: dollar })
|
|
70
|
+
# "$44.50" => 44.5
|
|
71
|
+
# nil => nil
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
**Reusing the same lambda across multiple keys:**
|
|
75
|
+
|
|
76
|
+
```ruby
|
|
77
|
+
date = ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil }
|
|
78
|
+
|
|
79
|
+
data = SmarterCSV.process('records.csv', value_converters: { start_date: date, end_date: date })
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
**`key_mapping` + `value_converters` — always use the mapped name:**
|
|
83
|
+
|
|
84
|
+
```ruby
|
|
85
|
+
# CSV header is "MemberSince" — mapped to :member_since
|
|
86
|
+
options = {
|
|
87
|
+
key_mapping: { membersince: :member_since },
|
|
88
|
+
value_converters: { member_since: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil } },
|
|
89
|
+
}
|
|
90
|
+
data = SmarterCSV.process('records.csv', options)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## Handling nil and Empty Fields
|
|
94
|
+
|
|
95
|
+
Converters receive the raw string value from the CSV field. If a field is blank or missing,
|
|
96
|
+
the value passed to your converter may be `nil` or `""`. Always guard against this:
|
|
97
|
+
|
|
98
|
+
```ruby
|
|
99
|
+
# Safe: returns nil for blank fields instead of raising
|
|
100
|
+
price = ->(v) { v&.sub('$', '')&.to_f }
|
|
101
|
+
|
|
102
|
+
# Unsafe: raises NoMethodError when v is nil
|
|
103
|
+
price = ->(v) { v.sub('$', '').to_f }
|
|
104
|
+
```
|
|
55
105
|
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
106
|
+
For class-based converters, add an explicit guard at the top of `self.convert`:
|
|
107
|
+
|
|
108
|
+
```ruby
|
|
109
|
+
def self.convert(value)
|
|
110
|
+
return nil if value.nil? || value.empty?
|
|
111
|
+
# ... rest of conversion
|
|
112
|
+
end
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
## Class-Based Converters
|
|
116
|
+
|
|
117
|
+
For converters you want to reuse across the codebase or test independently, define a class
|
|
118
|
+
with a `self.convert(value)` class method:
|
|
119
|
+
|
|
120
|
+
```ruby
|
|
121
|
+
require 'date'
|
|
122
|
+
|
|
123
|
+
class DateConverter
|
|
124
|
+
def self.convert(value)
|
|
125
|
+
return nil if value.nil? || value.empty?
|
|
126
|
+
Date.strptime(value, '%m/%d/%Y')
|
|
127
|
+
end
|
|
128
|
+
end
|
|
129
|
+
|
|
130
|
+
class DollarConverter
|
|
131
|
+
def self.convert(value)
|
|
132
|
+
return nil if value.nil? || value.empty?
|
|
133
|
+
value.sub('$', '').to_f
|
|
134
|
+
end
|
|
135
|
+
end
|
|
136
|
+
|
|
137
|
+
class BooleanConverter
|
|
138
|
+
def self.convert(value)
|
|
139
|
+
case value
|
|
140
|
+
when /\Atrue\z/i then true
|
|
141
|
+
when /\Afalse\z/i then false
|
|
67
142
|
end
|
|
143
|
+
end
|
|
144
|
+
end
|
|
68
145
|
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
=> true
|
|
146
|
+
options = {
|
|
147
|
+
value_converters: {
|
|
148
|
+
date: DateConverter,
|
|
149
|
+
price: DollarConverter,
|
|
150
|
+
member: BooleanConverter,
|
|
151
|
+
}
|
|
152
|
+
}
|
|
153
|
+
data = SmarterCSV.process('spec/fixtures/with_dates.csv', options)
|
|
154
|
+
|
|
155
|
+
data.first[:date] #=> #<Date: 1998-10-30>
|
|
156
|
+
data.first[:price] #=> 44.5
|
|
157
|
+
data.first[:member] #=> true
|
|
82
158
|
```
|
|
83
159
|
|
|
160
|
+
## Money Converter
|
|
161
|
+
|
|
162
|
+
For applications using the [`money`](https://github.com/RubyMoney/money) gem:
|
|
163
|
+
|
|
164
|
+
```ruby
|
|
165
|
+
require 'money'
|
|
166
|
+
|
|
167
|
+
class MoneyConverter
|
|
168
|
+
def self.convert(value)
|
|
169
|
+
return nil if value.nil? || value.empty?
|
|
170
|
+
# remove currency symbol and thousands separators before converting
|
|
171
|
+
Money.from_amount(value.gsub(/[\s$,]/, '').to_f)
|
|
172
|
+
end
|
|
173
|
+
end
|
|
174
|
+
|
|
175
|
+
data = SmarterCSV.process('invoices.csv', value_converters: { amount: MoneyConverter })
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
## Why there are no built-in Date / Time / DateTime converters
|
|
179
|
+
|
|
180
|
+
SmarterCSV intentionally does not ship built-in date or time converters. The reason is
|
|
181
|
+
**localization (L10N)**: date formats vary widely across regions and there is no single
|
|
182
|
+
correct interpretation of a bare string like `"12/03/2020"` — it is December 3rd in the
|
|
183
|
+
United States but March 12th in most of Europe.
|
|
184
|
+
|
|
185
|
+
Ruby's standard library `Date.parse` / `DateTime.parse` handle ISO 8601 and a handful of
|
|
186
|
+
English-language formats, but they are not locale-aware and will silently produce the wrong
|
|
187
|
+
date for locale-specific formats. Shipping a built-in converter that is wrong for half the
|
|
188
|
+
world's locales would be worse than shipping none.
|
|
189
|
+
|
|
190
|
+
The right solution is a `value_converter` with an explicit format string tuned to your data:
|
|
191
|
+
|
|
192
|
+
```ruby
|
|
193
|
+
require 'date'
|
|
194
|
+
|
|
195
|
+
# US format: MM/DD/YYYY
|
|
196
|
+
us_date = ->(v) { Date.strptime(v, '%m/%d/%Y') rescue v }
|
|
197
|
+
|
|
198
|
+
# European format: DD.MM.YYYY
|
|
199
|
+
eu_date = ->(v) { Date.strptime(v, '%d.%m.%Y') rescue v }
|
|
200
|
+
|
|
201
|
+
# ISO 8601 (unambiguous, safe to use without rescue)
|
|
202
|
+
iso_date = ->(v) { Date.iso8601(v) rescue v }
|
|
203
|
+
|
|
204
|
+
options = {
|
|
205
|
+
value_converters: {
|
|
206
|
+
birth_date: eu_date,
|
|
207
|
+
created_at: iso_date,
|
|
208
|
+
invoiced_on: us_date,
|
|
209
|
+
}
|
|
210
|
+
}
|
|
211
|
+
data = SmarterCSV.process('records.csv', options)
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
For locale-aware parsing of user-supplied date strings (e.g., "3. Oktober 2024" in German),
|
|
215
|
+
consider the [`delocalize`](https://github.com/clemens/delocalize) gem, which integrates
|
|
216
|
+
with Rails' I18n locale configuration. For natural-language date strings, consider
|
|
217
|
+
[`chronic`](https://github.com/mojombo/chronic).
|
|
218
|
+
|
|
84
219
|
--------------------
|
|
85
|
-
|
|
220
|
+
|
|
221
|
+
PREVIOUS: [Data Transformations](./data_transformations.md) | NEXT: [Bad Row Quarantine](./bad_row_quarantine.md) | UP: [README](../README.md)
|
data/ext/smarter_csv/extconf.rb
CHANGED
|
@@ -11,6 +11,9 @@ end
|
|
|
11
11
|
|
|
12
12
|
optflags = "-O3 -flto -fomit-frame-pointer -DNDEBUG".dup
|
|
13
13
|
optflags << " -march=native" unless RUBY_PLATFORM.start_with?("arm64-darwin")
|
|
14
|
+
# -fno-semantic-interposition: GCC/Clang only (not MSVC). Allows intra-library
|
|
15
|
+
# calls to bypass the PLT on Linux and enables more aggressive LTO inlining.
|
|
16
|
+
optflags << " -fno-semantic-interposition" unless RUBY_PLATFORM.include?("mswin")
|
|
14
17
|
|
|
15
18
|
append_cflags('-Wno-compound-token-split-by-macro')
|
|
16
19
|
|