smarter_csv 1.16.0 → 1.16.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rspec +2 -0
- data/CHANGELOG.md +60 -1
- data/CONTRIBUTORS.md +3 -1
- data/README.md +10 -4
- data/docs/_introduction.md +6 -1
- data/docs/bad_row_quarantine.md +90 -33
- data/docs/basic_read_api.md +1 -0
- data/docs/basic_write_api.md +1 -0
- data/docs/batch_processing.md +1 -0
- data/docs/column_selection.md +1 -0
- data/docs/data_transformations.md +1 -0
- data/docs/examples.md +1 -0
- data/docs/header_transformations.md +1 -0
- data/docs/header_validations.md +1 -0
- data/docs/history.md +2 -0
- data/docs/instrumentation.md +1 -0
- data/docs/migrating_from_csv.md +364 -89
- data/docs/options.md +2 -1
- data/docs/parsing_strategy.md +2 -1
- data/docs/real_world_csv.md +1 -0
- data/docs/releases/1.16.0/changes.md +1 -2
- data/docs/row_col_sep.md +1 -0
- data/docs/ruby_csv_pitfalls.md +545 -0
- data/docs/value_converters.md +1 -0
- data/ext/smarter_csv/smarter_csv.c +10 -11
- data/lib/smarter_csv/hash_transformations.rb +1 -1
- data/lib/smarter_csv/header_transformations.rb +11 -9
- data/lib/smarter_csv/parser.rb +5 -8
- data/lib/smarter_csv/reader.rb +2 -2
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +1 -1
- data/lib/smarter_csv.rb +41 -3
- metadata +3 -2
data/docs/migrating_from_csv.md
CHANGED
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
5
|
* [**Migrating from Ruby CSV**](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
6
7
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
8
|
* [The Basic Read API](./basic_read_api.md)
|
|
8
9
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -25,9 +26,18 @@
|
|
|
25
26
|
|
|
26
27
|
# Migrating from Ruby CSV
|
|
27
28
|
|
|
28
|
-
Already using Ruby's built-in `CSV` library?
|
|
29
|
-
|
|
30
|
-
|
|
29
|
+
Already using Ruby's built-in `CSV` library? There are three good reasons to switch — and switching is typically a one- or two-line change.
|
|
30
|
+
|
|
31
|
+
### Inconvenient
|
|
32
|
+
`CSV.read` returns arrays of arrays, so your code must manually handle column indexing, header normalization, type conversion, and whitespace stripping. SmarterCSV returns Rails-ready hashes with symbol keys, numeric conversion, and whitespace stripping out of the box — no boilerplate needed.
|
|
33
|
+
|
|
34
|
+
### Hidden failure modes
|
|
35
|
+
`CSV.read` has 10 ways to silently corrupt or lose data — no exception, no warning, no log line.
|
|
36
|
+
|
|
37
|
+
➡️ See [**Ruby CSV Pitfalls**](./ruby_csv_pitfalls.md) for reproducible examples and the SmarterCSV fix for each.
|
|
38
|
+
|
|
39
|
+
### Slow
|
|
40
|
+
On top of everything else, it is up to 129× slower than SmarterCSV for equivalent end-to-end work — see the [Performance](#performance) section below.
|
|
31
41
|
|
|
32
42
|
> **Medium article:** *"Switch from Ruby CSV to SmarterCSV in 5 Minutes"* — *(coming soon)*
|
|
33
43
|
|
|
@@ -50,29 +60,78 @@ _‡ `CSV.table` is the closest Ruby equivalent to SmarterCSV — both return sy
|
|
|
50
60
|
|
|
51
61
|
## The one-line switch
|
|
52
62
|
|
|
63
|
+
Real-world CSV files are messy — whitespace-padded headers, extra columns without headers, trailing
|
|
64
|
+
commas. Consider this file:
|
|
65
|
+
|
|
66
|
+
```
|
|
67
|
+
$ cat data.csv
|
|
68
|
+
First Name , Last Name , Age
|
|
69
|
+
Alice , Smith, 30, VIP, Gold ,
|
|
70
|
+
Bob, Jones, 25
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
**With Ruby CSV:**
|
|
53
74
|
```ruby
|
|
54
|
-
|
|
55
|
-
rows
|
|
75
|
+
rows = CSV.read('data.csv', headers: true).map(&:to_h)
|
|
76
|
+
rows.first
|
|
77
|
+
# => { " First Name " => "Alice ", " Last Name " => " Smith", " Age" => " 30", nil => "" }
|
|
78
|
+
# "VIP" and "Gold" silently lost — both compete for the nil key, last one wins
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
Whitespace-polluted keys, `Age` as a string, and extra columns competing for the same `nil` key —
|
|
82
|
+
the last one wins and the rest are silently discarded.
|
|
56
83
|
|
|
57
|
-
|
|
58
|
-
|
|
84
|
+
**With SmarterCSV:**
|
|
85
|
+
```ruby
|
|
86
|
+
rows = SmarterCSV.process('data.csv')
|
|
87
|
+
rows.first
|
|
88
|
+
# => { first_name: "Alice", last_name: "Smith", age: 30, column_1: "VIP", column_2: "Gold" }
|
|
59
89
|
```
|
|
60
90
|
|
|
61
|
-
|
|
91
|
+
Clean symbol keys, whitespace stripped, `age` converted to `Integer`, extra columns named — no data loss.
|
|
92
|
+
|
|
93
|
+
No `.map(&:to_h)`, no `header_converters:`, no manual post-processing.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Sample file used in remaining examples
|
|
98
|
+
|
|
99
|
+
The sections below use a simpler file to keep the focus on the specific behavior being illustrated:
|
|
100
|
+
|
|
101
|
+
```
|
|
102
|
+
$ cat sample.csv
|
|
103
|
+
name,age,city
|
|
104
|
+
Alice,30,New York
|
|
105
|
+
Bob,25,
|
|
106
|
+
Charlie,35,Chicago
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Bob's `city` field is intentionally empty to illustrate empty-value handling.
|
|
62
110
|
|
|
63
111
|
---
|
|
64
112
|
|
|
65
113
|
## Parsing a CSV string
|
|
66
114
|
|
|
115
|
+
**With Ruby CSV:**
|
|
67
116
|
```ruby
|
|
68
|
-
csv_string = "name,age\nAlice,30\nBob,25\n"
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
117
|
+
csv_string = "name,age,city\nAlice,30,New York\nBob,25,\nCharlie,35,Chicago\n"
|
|
118
|
+
|
|
119
|
+
rows = CSV.parse(csv_string, headers: true, header_converters: :symbol).map(&:to_h)
|
|
120
|
+
# => [
|
|
121
|
+
# { name: "Alice", age: "30", city: "New York" },
|
|
122
|
+
# { name: "Bob", age: "25", city: nil },
|
|
123
|
+
# { name: "Charlie", age: "35", city: "Chicago" }
|
|
124
|
+
# ]
|
|
125
|
+
```
|
|
72
126
|
|
|
73
|
-
|
|
127
|
+
**With SmarterCSV:**
|
|
128
|
+
```ruby
|
|
74
129
|
rows = SmarterCSV.parse(csv_string)
|
|
75
|
-
# => [
|
|
130
|
+
# => [
|
|
131
|
+
# { name: "Alice", age: 30, city: "New York" },
|
|
132
|
+
# { name: "Bob", age: 25 },
|
|
133
|
+
# { name: "Charlie", age: 35, city: "Chicago" }
|
|
134
|
+
# ]
|
|
76
135
|
```
|
|
77
136
|
|
|
78
137
|
`SmarterCSV.parse` is a convenience wrapper added in 1.16.0. Under the hood it wraps the
|
|
@@ -82,15 +141,17 @@ string in a `StringIO` — but you don't need to think about that.
|
|
|
82
141
|
|
|
83
142
|
## Row-by-row iteration
|
|
84
143
|
|
|
144
|
+
**With Ruby CSV:**
|
|
85
145
|
```ruby
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
MyModel.create(row.to_h)
|
|
146
|
+
CSV.foreach('sample.csv', headers: true, header_converters: :symbol) do |row|
|
|
147
|
+
MyModel.create(row.to_h) # row is a CSV::Row — needs .to_h
|
|
89
148
|
end
|
|
149
|
+
```
|
|
90
150
|
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
151
|
+
**With SmarterCSV:**
|
|
152
|
+
```ruby
|
|
153
|
+
SmarterCSV.each('sample.csv') do |row|
|
|
154
|
+
MyModel.create(row) # row is already a plain Hash — no .to_h needed
|
|
94
155
|
end
|
|
95
156
|
```
|
|
96
157
|
|
|
@@ -98,53 +159,67 @@ end
|
|
|
98
159
|
`Enumerable` API is available:
|
|
99
160
|
|
|
100
161
|
```ruby
|
|
101
|
-
names
|
|
102
|
-
|
|
103
|
-
|
|
162
|
+
names = SmarterCSV.each('sample.csv').map { |row| row[:name] }
|
|
163
|
+
# => ["Alice", "Bob", "Charlie"]
|
|
164
|
+
|
|
165
|
+
us_rows = SmarterCSV.each('sample.csv').select { |row| row[:city] == 'New York' }
|
|
166
|
+
# => [{ name: "Alice", age: 30, city: "New York" }]
|
|
167
|
+
|
|
168
|
+
first2 = SmarterCSV.each('sample.csv').lazy.first(2)
|
|
169
|
+
# => [{ name: "Alice", age: 30, city: "New York" }, { name: "Bob", age: 25 }]
|
|
104
170
|
```
|
|
105
171
|
|
|
106
172
|
---
|
|
107
173
|
|
|
108
174
|
## Key behavior differences
|
|
109
175
|
|
|
110
|
-
### 1.
|
|
176
|
+
### 1. String keys → Symbol keys
|
|
111
177
|
|
|
112
|
-
|
|
113
|
-
|
|
178
|
+
`CSV.read` returns string keys by default. SmarterCSV returns symbol keys, which are more
|
|
179
|
+
efficient (interned in memory) and idiomatic for Rails and ActiveRecord.
|
|
114
180
|
|
|
181
|
+
**With Ruby CSV:**
|
|
115
182
|
```ruby
|
|
116
|
-
|
|
117
|
-
rows
|
|
118
|
-
rows.first['
|
|
183
|
+
rows = CSV.read('sample.csv', headers: true).map(&:to_h)
|
|
184
|
+
rows.first['name'] # => "Alice"
|
|
185
|
+
rows.first['age'] # => "30"
|
|
186
|
+
```
|
|
119
187
|
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
rows
|
|
188
|
+
**With SmarterCSV:**
|
|
189
|
+
```ruby
|
|
190
|
+
rows = SmarterCSV.process('sample.csv')
|
|
191
|
+
rows.first[:name] # => "Alice"
|
|
192
|
+
rows.first[:age] # => 30
|
|
123
193
|
|
|
124
|
-
#
|
|
125
|
-
rows = SmarterCSV.process('
|
|
126
|
-
rows.first['name']
|
|
194
|
+
# To match CSV.read string-key behaviour:
|
|
195
|
+
rows = SmarterCSV.process('sample.csv', strings_as_keys: true)
|
|
196
|
+
rows.first['name'] # => "Alice"
|
|
127
197
|
```
|
|
128
198
|
|
|
129
199
|
### 2. Numeric conversion is automatic
|
|
130
200
|
|
|
131
|
-
SmarterCSV converts numeric strings to `Integer`
|
|
132
|
-
|
|
201
|
+
`CSV.read` returns everything as strings. SmarterCSV converts numeric strings to `Integer`
|
|
202
|
+
or `Float` automatically — no `converters: :numeric` needed.
|
|
133
203
|
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
CSV.table('data.csv', converters: :numeric)
|
|
204
|
+
Watch out for columns where leading zeros matter — ZIP codes, phone numbers, account numbers —
|
|
205
|
+
and exclude them:
|
|
137
206
|
|
|
138
|
-
|
|
139
|
-
|
|
207
|
+
**With Ruby CSV:**
|
|
208
|
+
```ruby
|
|
209
|
+
rows = CSV.read('sample.csv', headers: true).map(&:to_h)
|
|
210
|
+
rows.first['age'] # => "30" (String)
|
|
211
|
+
rows.first['age'].class # => String
|
|
140
212
|
```
|
|
141
213
|
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
To limit conversion to specific columns:
|
|
214
|
+
**With SmarterCSV:**
|
|
145
215
|
```ruby
|
|
146
|
-
SmarterCSV.process('
|
|
147
|
-
|
|
216
|
+
rows = SmarterCSV.process('sample.csv')
|
|
217
|
+
rows.first[:age] # => 30 (Integer)
|
|
218
|
+
rows.first[:age].class # => Integer
|
|
219
|
+
|
|
220
|
+
# Exclude columns where leading zeros matter:
|
|
221
|
+
rows = SmarterCSV.process('sample.csv',
|
|
222
|
+
convert_values_to_numeric: { except: [:zip_code, :phone, :account_number] })
|
|
148
223
|
```
|
|
149
224
|
|
|
150
225
|
### 3. Empty values are removed by default
|
|
@@ -152,18 +227,20 @@ SmarterCSV.process('data.csv', convert_values_to_numeric: { except: [:zip_code]
|
|
|
152
227
|
SmarterCSV drops key/value pairs where the value is `nil` or blank
|
|
153
228
|
(`remove_empty_values: true` is the default). Ruby CSV keeps them as `nil`.
|
|
154
229
|
|
|
230
|
+
**With Ruby CSV:**
|
|
155
231
|
```ruby
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
# => {name: "Alice", city: nil, age: 30}
|
|
232
|
+
rows = CSV.read('sample.csv', headers: true, header_converters: :symbol).map(&:to_h)
|
|
233
|
+
rows[1] # => { name: "Bob", age: "25", city: nil }
|
|
234
|
+
```
|
|
160
235
|
|
|
161
|
-
|
|
162
|
-
|
|
236
|
+
**With SmarterCSV:**
|
|
237
|
+
```ruby
|
|
238
|
+
rows = SmarterCSV.process('sample.csv')
|
|
239
|
+
rows[1] # => { name: "Bob", age: 25 } ← empty city removed
|
|
163
240
|
|
|
164
|
-
#
|
|
165
|
-
SmarterCSV.process('
|
|
166
|
-
# => {name: "
|
|
241
|
+
# To keep nil values and match Ruby CSV behaviour:
|
|
242
|
+
rows = SmarterCSV.process('sample.csv', remove_empty_values: false)
|
|
243
|
+
rows[1] # => { name: "Bob", age: 25, city: nil }
|
|
167
244
|
```
|
|
168
245
|
|
|
169
246
|
### 4. Plain Hash, not CSV::Row
|
|
@@ -173,18 +250,69 @@ Ruby CSV returns `CSV::Row` objects. SmarterCSV returns plain Ruby `Hash` object
|
|
|
173
250
|
`CSV::Row` wraps a hash with extra methods (`.headers`, `.fields`, `.to_h`, `.to_a`).
|
|
174
251
|
With SmarterCSV you work directly with the hash — no wrapper, no `.to_h` needed.
|
|
175
252
|
|
|
253
|
+
**With Ruby CSV:**
|
|
176
254
|
```ruby
|
|
177
|
-
|
|
178
|
-
row = CSV.table('data.csv').first
|
|
255
|
+
row = CSV.read('sample.csv', headers: true).first
|
|
179
256
|
row.class # => CSV::Row
|
|
180
|
-
row
|
|
181
|
-
row
|
|
257
|
+
row['name'] # => "Alice"
|
|
258
|
+
row['age'] # => "30" (String)
|
|
259
|
+
row.to_h # => { "name" => "Alice", "age" => "30", "city" => "New York" }
|
|
260
|
+
```
|
|
182
261
|
|
|
183
|
-
|
|
184
|
-
|
|
262
|
+
**With SmarterCSV:**
|
|
263
|
+
```ruby
|
|
264
|
+
row = SmarterCSV.process('sample.csv').first
|
|
185
265
|
row.class # => Hash
|
|
186
|
-
row
|
|
187
|
-
row
|
|
266
|
+
row[:name] # => "Alice"
|
|
267
|
+
row[:age] # => 30 (Integer)
|
|
268
|
+
row # => { name: "Alice", age: 30, city: "New York" }
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## Renaming headers to match your schema
|
|
274
|
+
|
|
275
|
+
CSV column names rarely match your ActiveRecord attribute names. Use `key_mapping:` to rename
|
|
276
|
+
them in one step — the mapping uses the normalized (downcased, underscored) header name as input:
|
|
277
|
+
|
|
278
|
+
**With SmarterCSV:**
|
|
279
|
+
```ruby
|
|
280
|
+
# CSV headers: "First Name", "Last Name", "E-Mail", "Date of Birth"
|
|
281
|
+
# After normalization: :first_name, :last_name, :e_mail, :date_of_birth
|
|
282
|
+
|
|
283
|
+
rows = SmarterCSV.process('contacts.csv',
|
|
284
|
+
key_mapping: {
|
|
285
|
+
first_name: :given_name,
|
|
286
|
+
last_name: :family_name,
|
|
287
|
+
e_mail: :email,
|
|
288
|
+
date_of_birth: :dob,
|
|
289
|
+
})
|
|
290
|
+
# => [{ given_name: "Alice", family_name: "Smith", email: "alice@example.com", dob: "1990-05-14" }, ...]
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Map a key to `nil` to drop that column entirely:
|
|
294
|
+
|
|
295
|
+
```ruby
|
|
296
|
+
key_mapping: { internal_id: nil, created_at: nil } # these columns won't appear in results
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
---
|
|
300
|
+
|
|
301
|
+
## Select only the columns you need
|
|
302
|
+
|
|
303
|
+
Wide CSV files often have dozens of columns your application doesn't need. Use `headers: { only: }`
|
|
304
|
+
to declare upfront which columns to keep — SmarterCSV skips everything else at the parser level,
|
|
305
|
+
so unneeded fields are never allocated:
|
|
306
|
+
|
|
307
|
+
**With SmarterCSV:**
|
|
308
|
+
```ruby
|
|
309
|
+
# CSV has 50 columns — you only need 3
|
|
310
|
+
rows = SmarterCSV.process('contacts.csv',
|
|
311
|
+
headers: { only: [:email, :first_name, :last_name] })
|
|
312
|
+
# => [{ email: "alice@example.com", first_name: "Alice", last_name: "Smith" }, ...]
|
|
313
|
+
|
|
314
|
+
# Or exclude a known noisy column while keeping everything else:
|
|
315
|
+
rows = SmarterCSV.process('export.csv', headers: { except: [:internal_notes] })
|
|
188
316
|
```
|
|
189
317
|
|
|
190
318
|
---
|
|
@@ -195,16 +323,44 @@ Ruby CSV has built-in `:date` and `:date_time` converters. SmarterCSV intentiona
|
|
|
195
323
|
them because date formats are locale-dependent (`12/03/2020` means December 3rd in the US
|
|
196
324
|
but March 12th in Europe). Use a `value_converter` instead:
|
|
197
325
|
|
|
326
|
+
**With Ruby CSV:**
|
|
327
|
+
```ruby
|
|
328
|
+
rows = CSV.read('data.csv', headers: true, converters: :date)
|
|
329
|
+
rows.first['birth_date'] # => #<Date: 1990-05-15> (assumes ISO 8601 format only)
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
**With SmarterCSV:**
|
|
198
333
|
```ruby
|
|
199
334
|
require 'date'
|
|
200
335
|
|
|
201
|
-
|
|
202
|
-
|
|
336
|
+
rows = SmarterCSV.process('data.csv',
|
|
337
|
+
value_converters: {
|
|
338
|
+
birth_date: ->(v) { v ? Date.strptime(v, '%Y-%m-%d') : nil }, # ISO 8601
|
|
339
|
+
# birth_date: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil }, # US format
|
|
340
|
+
# birth_date: ->(v) { v ? Date.strptime(v, '%d.%m.%Y') : nil }, # EU format
|
|
341
|
+
})
|
|
342
|
+
rows.first[:birth_date] # => #<Date: 1990-05-15>
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
See [Value Converters](./value_converters.md) for full details.
|
|
346
|
+
|
|
347
|
+
---
|
|
348
|
+
|
|
349
|
+
## Custom value converters
|
|
203
350
|
|
|
204
|
-
SmarterCSV
|
|
351
|
+
SmarterCSV lets you apply any transformation per column — prices, booleans, custom types:
|
|
352
|
+
|
|
353
|
+
**With SmarterCSV:**
|
|
354
|
+
```ruby
|
|
355
|
+
rows = SmarterCSV.process('records.csv',
|
|
356
|
+
value_converters: {
|
|
357
|
+
birth_date: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
|
|
358
|
+
price: ->(v) { v&.delete('$,')&.to_f },
|
|
359
|
+
active: ->(v) { v&.match?(/\Atrue\z/i) },
|
|
360
|
+
})
|
|
205
361
|
```
|
|
206
362
|
|
|
207
|
-
See [Value Converters](./value_converters.md) for full details
|
|
363
|
+
See [Value Converters](./value_converters.md) for full details.
|
|
208
364
|
|
|
209
365
|
---
|
|
210
366
|
|
|
@@ -213,50 +369,72 @@ See [Value Converters](./value_converters.md) for full details and examples for
|
|
|
213
369
|
Ruby CSV leaves these as strings. SmarterCSV lets you nil-ify them (and optionally remove
|
|
214
370
|
the key) in a single option:
|
|
215
371
|
|
|
372
|
+
**With SmarterCSV:**
|
|
216
373
|
```ruby
|
|
217
|
-
# Remove
|
|
218
|
-
SmarterCSV.process('data.csv', nil_values_matching: /\A(NULL|NaN|#VALUE!)\z/)
|
|
374
|
+
# Remove keys where value matches (remove_empty_values: true is the default)
|
|
375
|
+
rows = SmarterCSV.process('data.csv', nil_values_matching: /\A(NULL|N\/A|NaN|#VALUE!)\z/i)
|
|
376
|
+
# fields matching the pattern are removed entirely
|
|
219
377
|
|
|
220
|
-
# Keep the key but set the value to nil
|
|
221
|
-
SmarterCSV.process('data.csv',
|
|
378
|
+
# Keep the key but set the value to nil:
|
|
379
|
+
rows = SmarterCSV.process('data.csv',
|
|
222
380
|
nil_values_matching: /\ANULL\z/,
|
|
223
381
|
remove_empty_values: false,
|
|
224
382
|
)
|
|
383
|
+
# => [{ name: "Alice", score: nil, ... }]
|
|
225
384
|
```
|
|
226
385
|
|
|
227
386
|
---
|
|
228
387
|
|
|
229
388
|
## Malformed / bad rows
|
|
230
389
|
|
|
231
|
-
Ruby CSV
|
|
232
|
-
SmarterCSV gives you explicit control:
|
|
233
|
-
|
|
390
|
+
**With Ruby CSV:**
|
|
234
391
|
```ruby
|
|
235
|
-
#
|
|
236
|
-
CSV.read('data.csv', liberal_parsing: true)
|
|
392
|
+
# Silent ignore — errors are swallowed
|
|
393
|
+
rows = CSV.read('data.csv', liberal_parsing: true)
|
|
394
|
+
```
|
|
237
395
|
|
|
238
|
-
|
|
396
|
+
**With SmarterCSV:**
|
|
397
|
+
```ruby
|
|
398
|
+
# Collect bad rows so you can inspect, log, or quarantine them
|
|
239
399
|
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
|
|
240
400
|
good_rows = reader.process
|
|
241
|
-
bad_rows = reader.errors[:bad_rows]
|
|
401
|
+
bad_rows = reader.errors[:bad_rows]
|
|
402
|
+
|
|
403
|
+
puts "#{good_rows.size} imported, #{bad_rows.size} bad rows"
|
|
404
|
+
bad_rows.each { |r| puts "Line #{r[:file_line_number]}: #{r[:error_message]}" }
|
|
242
405
|
```
|
|
243
406
|
|
|
244
407
|
See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
|
|
245
408
|
|
|
246
409
|
---
|
|
247
410
|
|
|
411
|
+
## Batch processing for large files
|
|
412
|
+
|
|
413
|
+
**With SmarterCSV:**
|
|
414
|
+
```ruby
|
|
415
|
+
SmarterCSV.process('big.csv', chunk_size: 500) do |chunk|
|
|
416
|
+
MyModel.insert_all(chunk) # bulk insert 500 rows at a time
|
|
417
|
+
end
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
---
|
|
421
|
+
|
|
248
422
|
## Writing CSV
|
|
249
423
|
|
|
424
|
+
**With Ruby CSV:**
|
|
250
425
|
```ruby
|
|
251
|
-
|
|
252
|
-
CSV.open('out.csv', 'w', write_headers: true, headers: ['name','age']) do |csv|
|
|
426
|
+
CSV.open('out.csv', 'w', write_headers: true, headers: ['name', 'age']) do |csv|
|
|
253
427
|
csv << ['Alice', 30]
|
|
428
|
+
csv << ['Bob', 25]
|
|
254
429
|
end
|
|
430
|
+
```
|
|
255
431
|
|
|
256
|
-
|
|
432
|
+
**With SmarterCSV:**
|
|
433
|
+
```ruby
|
|
434
|
+
# Takes hashes, discovers headers automatically
|
|
257
435
|
SmarterCSV.generate('out.csv') do |csv|
|
|
258
|
-
csv << {name: 'Alice', age: 30}
|
|
259
|
-
csv << {name: 'Bob', age: 25}
|
|
436
|
+
csv << { name: 'Alice', age: 30 }
|
|
437
|
+
csv << { name: 'Bob', age: 25 }
|
|
260
438
|
end
|
|
261
439
|
```
|
|
262
440
|
|
|
@@ -270,21 +448,118 @@ send_data io.string, type: 'text/csv'
|
|
|
270
448
|
|
|
271
449
|
---
|
|
272
450
|
|
|
451
|
+
## Advanced patterns
|
|
452
|
+
|
|
453
|
+
### Rails file upload
|
|
454
|
+
|
|
455
|
+
Accepting a CSV upload in a Rails controller — pass the tempfile path directly:
|
|
456
|
+
|
|
457
|
+
```ruby
|
|
458
|
+
def create
|
|
459
|
+
file = params[:file] # ActionDispatch::Http::UploadedFile
|
|
460
|
+
|
|
461
|
+
SmarterCSV.process(file.path, chunk_size: 500) do |chunk|
|
|
462
|
+
MyModel.insert_all(chunk)
|
|
463
|
+
end
|
|
464
|
+
|
|
465
|
+
redirect_to root_path, notice: "Import complete"
|
|
466
|
+
end
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
### Parallel processing with Sidekiq
|
|
470
|
+
|
|
471
|
+
```ruby
|
|
472
|
+
SmarterCSV.process('users.csv', chunk_size: 100) do |chunk, chunk_index|
|
|
473
|
+
puts "Queueing chunk #{chunk_index} (#{chunk.size} records)..."
|
|
474
|
+
Sidekiq::Client.push_bulk(
|
|
475
|
+
'class' => UserImportWorker,
|
|
476
|
+
'args' => chunk,
|
|
477
|
+
)
|
|
478
|
+
end
|
|
479
|
+
```
|
|
480
|
+
|
|
481
|
+
### Streaming directly from S3
|
|
482
|
+
|
|
483
|
+
SmarterCSV accepts any IO-like object — stream a CSV directly from S3 without writing a temp file:
|
|
484
|
+
|
|
485
|
+
```ruby
|
|
486
|
+
require 'aws-sdk-s3'
|
|
487
|
+
|
|
488
|
+
s3 = Aws::S3::Client.new(region: 'us-east-1')
|
|
489
|
+
obj = s3.get_object(bucket: 'my-bucket', key: 'imports/contacts.csv')
|
|
490
|
+
|
|
491
|
+
SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
|
|
492
|
+
MyModel.insert_all(chunk)
|
|
493
|
+
end
|
|
494
|
+
```
|
|
495
|
+
|
|
496
|
+
### Production instrumentation
|
|
497
|
+
|
|
498
|
+
```ruby
|
|
499
|
+
SmarterCSV.process('large_import.csv',
|
|
500
|
+
chunk_size: 1_000,
|
|
501
|
+
on_start: ->(info) { Rails.logger.info "Import started: #{info[:input]} (#{info[:file_size]} bytes)" },
|
|
502
|
+
on_chunk: ->(info) { Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows (#{info[:total_rows_so_far]} total)" },
|
|
503
|
+
on_complete: ->(stats) {
|
|
504
|
+
Rails.logger.info "Done: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s, #{stats[:bad_rows]} bad rows"
|
|
505
|
+
StatsD.histogram('csv.import.duration', stats[:duration])
|
|
506
|
+
},
|
|
507
|
+
) { |chunk| MyModel.insert_all(chunk) }
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
See [Instrumentation Hooks](./instrumentation.md) for full details.
|
|
511
|
+
|
|
512
|
+
### Resumable imports with Rails ActiveJob
|
|
513
|
+
|
|
514
|
+
Rails 8.1 introduced `ActiveJob::Continuable` — jobs that pause on deployment and resume exactly
|
|
515
|
+
where they stopped. SmarterCSV's `chunk_index` maps directly onto the job cursor:
|
|
516
|
+
|
|
517
|
+
```ruby
|
|
518
|
+
class ImportCsvJob < ApplicationJob
|
|
519
|
+
include ActiveJob::Continuable
|
|
520
|
+
|
|
521
|
+
def perform(file_path)
|
|
522
|
+
step :import_rows do |step|
|
|
523
|
+
SmarterCSV.process(file_path, chunk_size: 500) do |chunk, chunk_index|
|
|
524
|
+
next if chunk_index < step.cursor.to_i # skip already-processed chunks on resume
|
|
525
|
+
|
|
526
|
+
MyModel.insert_all(chunk)
|
|
527
|
+
step.set! chunk_index + 1
|
|
528
|
+
end
|
|
529
|
+
end
|
|
530
|
+
end
|
|
531
|
+
end
|
|
532
|
+
```
|
|
533
|
+
|
|
534
|
+
### Bulk upsert — insert or update
|
|
535
|
+
|
|
536
|
+
```ruby
|
|
537
|
+
SmarterCSV.process('contacts.csv',
|
|
538
|
+
chunk_size: 500,
|
|
539
|
+
key_mapping: { e_mail: :email },
|
|
540
|
+
) do |chunk|
|
|
541
|
+
Contact.upsert_all(chunk, unique_by: :email)
|
|
542
|
+
end
|
|
543
|
+
```
|
|
544
|
+
|
|
545
|
+
---
|
|
546
|
+
|
|
273
547
|
## Quick reference
|
|
274
548
|
|
|
275
549
|
| Ruby CSV | SmarterCSV equivalent | Notes |
|
|
276
550
|
|---|---|---|
|
|
277
|
-
| `CSV.
|
|
278
|
-
| `CSV.read(f, headers: true)` | `SmarterCSV.process(f
|
|
551
|
+
| `CSV.read(f, headers: true).map(&:to_h)` | `SmarterCSV.process(f)` | Symbol keys, numeric conversion, whitespace stripped. |
|
|
552
|
+
| `CSV.read(f, headers: true, header_converters: :symbol).map(&:to_h)` | `SmarterCSV.process(f)` | Drop-in. |
|
|
553
|
+
| `CSV.table(f).map(&:to_h)` | `SmarterCSV.process(f)` | Drop-in. |
|
|
279
554
|
| `CSV.parse(str, headers: true, header_converters: :symbol)` | `SmarterCSV.parse(str)` | Direct string parsing. |
|
|
280
555
|
| `CSV.foreach(f, headers: true) { \|r\| }` | `SmarterCSV.each(f) { \|r\| }` | Row is already a plain Hash. |
|
|
281
556
|
| `converters: :numeric` | default | Automatic in SmarterCSV. |
|
|
282
|
-
| `converters: :date` | `value_converters: {col:
|
|
283
|
-
| `liberal_parsing: true` | `on_bad_row: :collect` | Explicit quarantine
|
|
557
|
+
| `converters: :date` | `value_converters: {col: ->(v) { ... } }` | Use explicit format strings — date formats are locale-dependent. |
|
|
558
|
+
| `liberal_parsing: true` | `on_bad_row: :collect` | Explicit quarantine gives you visibility. |
|
|
284
559
|
| `skip_blanks: true` | `remove_empty_hashes: true` | Default in SmarterCSV. |
|
|
285
560
|
| `row.to_h` | `row` | Already a plain Hash — no conversion needed. |
|
|
286
561
|
| `row.headers` | `reader.headers` | Available on the `Reader` instance. |
|
|
287
562
|
|
|
288
563
|
---
|
|
289
|
-
PREVIOUS: [Introduction](./_introduction.md) | NEXT: [
|
|
564
|
+
PREVIOUS: [Introduction](./_introduction.md) | NEXT: [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md) | UP: [README](../README.md)
|
|
290
565
|
|
data/docs/options.md
CHANGED
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
5
|
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
6
7
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
8
|
* [The Basic Read API](./basic_read_api.md)
|
|
8
9
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -118,7 +119,7 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
|
|
|
118
119
|
|--------|---------|-------------|
|
|
119
120
|
| `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
|
|
120
121
|
| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
|
|
121
|
-
| `:value_converters` | `nil` | Hash of `:header =>
|
|
122
|
+
| `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
|
|
122
123
|
| `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil` or an empty string. |
|
|
123
124
|
| `:remove_zero_values` | `false` | Remove key/value pairs where the numeric value equals zero. |
|
|
124
125
|
| `:nil_values_matching` | `nil` | Set matching values to `nil`. Accepts a regular expression matched against the string representation of each value (e.g. `/\ANAN\z/` for NaN, `/\A#VALUE!\z/` for Excel errors). With `remove_empty_values: true` (default), nil-ified values are then removed. With `remove_empty_values: false`, the key is retained with a `nil` value. |
|
data/docs/parsing_strategy.md
CHANGED
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
5
|
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
6
7
|
* [**Parsing Strategy**](./parsing_strategy.md)
|
|
7
8
|
* [The Basic Read API](./basic_read_api.md)
|
|
8
9
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -158,4 +159,4 @@ Both options apply simultaneously. `quote_boundary` governs *where* a quote is r
|
|
|
158
159
|
|
|
159
160
|
--------------
|
|
160
161
|
|
|
161
|
-
PREVIOUS: [
|
|
162
|
+
PREVIOUS: [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md) | NEXT: [The Basic Read API](./basic_read_api.md) | UP: [README](../README.md)
|
data/docs/real_world_csv.md
CHANGED
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
5
|
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
6
7
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
8
|
* [The Basic Read API](./basic_read_api.md)
|
|
8
9
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
* [Introduction](../../_introduction.md)
|
|
5
5
|
* [Migrating from Ruby CSV](../../migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](../../ruby_csv_pitfalls.md)
|
|
6
7
|
* [Parsing Strategy](../../parsing_strategy.md)
|
|
7
8
|
* [The Basic Read API](../../basic_read_api.md)
|
|
8
9
|
* [The Basic Write API](../../basic_write_api.md)
|
|
@@ -194,8 +195,6 @@ See [performance_notes.md](performance_notes.md) and [benchmarks.md](benchmarks.
|
|
|
194
195
|
|
|
195
196
|
**Deprecations:**
|
|
196
197
|
|
|
197
|
-
- `only_headers:` → use `headers: { only: }`
|
|
198
|
-
- `except_headers:` → use `headers: { except: }`
|
|
199
198
|
- `remove_values_matching:` → use `nil_values_matching:`
|
|
200
199
|
- `strict: true` → use `missing_headers: :raise`
|
|
201
200
|
- `strict: false` → use `missing_headers: :auto`
|
data/docs/row_col_sep.md
CHANGED
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
5
|
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
|
|
6
7
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
8
|
* [The Basic Read API](./basic_read_api.md)
|
|
8
9
|
* [The Basic Write API](./basic_write_api.md)
|