smarter_csv 1.15.2 → 1.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +68 -1
- data/CONTRIBUTORS.md +3 -1
- data/Gemfile +1 -0
- data/README.md +123 -27
- data/docs/_introduction.md +40 -24
- data/docs/bad_row_quarantine.md +285 -0
- data/docs/basic_read_api.md +151 -9
- data/docs/basic_write_api.md +474 -59
- data/docs/batch_processing.md +161 -4
- data/docs/column_selection.md +183 -0
- data/docs/data_transformations.md +162 -29
- data/docs/examples.md +339 -46
- data/docs/header_transformations.md +93 -12
- data/docs/header_validations.md +56 -18
- data/docs/history.md +117 -0
- data/docs/instrumentation.md +165 -0
- data/docs/migrating_from_csv.md +290 -0
- data/docs/options.md +150 -87
- data/docs/parsing_strategy.md +63 -1
- data/docs/real_world_csv.md +262 -0
- data/docs/releases/1.16.0/benchmarks.md +223 -0
- data/docs/releases/1.16.0/changes.md +272 -0
- data/docs/releases/1.16.0/performance_notes.md +114 -0
- data/docs/row_col_sep.md +14 -5
- data/docs/value_converters.md +193 -57
- data/ext/smarter_csv/extconf.rb +3 -0
- data/ext/smarter_csv/smarter_csv.c +1007 -71
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
- data/lib/smarter_csv/errors.rb +8 -0
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +14 -13
- data/lib/smarter_csv/header_transformations.rb +21 -2
- data/lib/smarter_csv/headers.rb +2 -1
- data/lib/smarter_csv/options.rb +124 -7
- data/lib/smarter_csv/parser.rb +362 -75
- data/lib/smarter_csv/reader.rb +494 -46
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +71 -19
- data/lib/smarter_csv.rb +95 -12
- data/smarter_csv.gemspec +20 -10
- metadata +37 -80
data/docs/examples.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,70 +11,362 @@
|
|
|
10
11
|
* [Row and Column Separators](./row_col_sep.md)
|
|
11
12
|
* [Header Transformations](./header_transformations.md)
|
|
12
13
|
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [Data Transformations](./data_transformations.md)
|
|
14
16
|
* [Value Converters](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [**Examples**](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Examples
|
|
19
27
|
|
|
20
|
-
|
|
28
|
+
**Rescue from `SmarterCSV::Error` (recommended):** SmarterCSV auto-detects row and column separators. In rare cases detection fails and raises an exception (e.g. `NoColSepDetected`). Rescuing from `SmarterCSV::Error` ensures your application handles unexpected CSV formats gracefully.
|
|
29
|
+
|
|
30
|
+
---
|
|
21
31
|
|
|
22
|
-
|
|
32
|
+
1. [CSV → Array of Hashes](#example-1-csv--array-of-hashes)
|
|
33
|
+
2. [Parsing a CSV String](#example-2-parsing-a-csv-string)
|
|
34
|
+
3. [Key Mapping and Column Selection](#example-3-key-mapping-and-column-selection)
|
|
35
|
+
4. [Encoding and Preamble Skip](#example-4-encoding-and-preamble-skip)
|
|
36
|
+
5. [Value Converters](#example-5-value-converters)
|
|
37
|
+
6. [Header Validation](#example-6-header-validation)
|
|
38
|
+
7. [Bad Row Handling](#example-7-bad-row-handling)
|
|
39
|
+
8. [Writing CSV](#example-8-writing-csv)
|
|
40
|
+
9. [Using `each` and `each_chunk` Enumerators](#example-9-using-each-and-each_chunk-enumerators)
|
|
41
|
+
10. [Importing into a Database](#example-10-importing-into-a-database)
|
|
42
|
+
11. [Batch Processing with Sidekiq](#example-11-batch-processing-with-sidekiq)
|
|
43
|
+
12. [Resumable CSV Import with Rails ActiveJob](#example-12-resumable-csv-import-with-rails-activejob-rails-81)
|
|
44
|
+
13. [Instrumentation](#example-13-instrumentation)
|
|
23
45
|
|
|
24
|
-
|
|
46
|
+
---
|
|
25
47
|
|
|
26
|
-
|
|
48
|
+
## Example 1: CSV → Array of Hashes
|
|
27
49
|
|
|
28
|
-
|
|
29
|
-
Please note how each hash contains only the keys for columns with non-null values.
|
|
50
|
+
Each hash only contains keys for columns with non-nil, non-empty values — columns with blank entries are omitted automatically:
|
|
30
51
|
|
|
31
52
|
```ruby
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
53
|
+
$ cat pets.csv
|
|
54
|
+
first name,last name,dogs,cats,birds,fish
|
|
55
|
+
Dan,McAllister,2,,,
|
|
56
|
+
Lucy,Laweless,,5,,
|
|
57
|
+
Miles,O'Brian,,,,21
|
|
58
|
+
Nancy,Homes,2,,1,
|
|
59
|
+
|
|
60
|
+
$ irb
|
|
61
|
+
> require 'smarter_csv'
|
|
62
|
+
> pets_by_owner = SmarterCSV.process('pets.csv')
|
|
63
|
+
=> [ {first_name: "Dan", last_name: "McAllister", dogs: 2},
|
|
64
|
+
{first_name: "Lucy", last_name: "Laweless", cats: 5},
|
|
65
|
+
{first_name: "Miles", last_name: "O'Brian", fish: 21},
|
|
66
|
+
{first_name: "Nancy", last_name: "Homes", dogs: 2, birds: 1}
|
|
67
|
+
]
|
|
47
68
|
```
|
|
48
69
|
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## Example 2: Parsing a CSV String
|
|
73
|
+
|
|
74
|
+
Use `SmarterCSV.parse` to parse a CSV string directly — no file needed. Useful in tests, API responses, or when the CSV arrives as a string in memory:
|
|
49
75
|
|
|
50
|
-
## Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
|
|
51
76
|
```ruby
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
# when chunking is not enabled, there is only one hash in each array
|
|
58
|
-
MyModel.create( array.first )
|
|
59
|
-
end
|
|
77
|
+
csv_string = <<~CSV
|
|
78
|
+
name,age,city
|
|
79
|
+
Alice,30,New York
|
|
80
|
+
Bob,25,Chicago
|
|
81
|
+
CSV
|
|
60
82
|
|
|
61
|
-
|
|
83
|
+
data = SmarterCSV.parse(csv_string)
|
|
84
|
+
# => [{name: "Alice", age: 30, city: "New York"}, {name: "Bob", age: 25, city: "Chicago"}]
|
|
62
85
|
```
|
|
63
86
|
|
|
64
|
-
|
|
65
|
-
|
|
87
|
+
See [The Basic Read API](./basic_read_api.md) and [Migrating from Ruby CSV](./migrating_from_csv.md).
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## Example 3: Key Mapping and Column Selection
|
|
92
|
+
|
|
93
|
+
Rename headers and drop unwanted columns in one pass:
|
|
94
|
+
|
|
95
|
+
```ruby
|
|
96
|
+
options = {
|
|
97
|
+
key_mapping: {
|
|
98
|
+
first_name: :fname,
|
|
99
|
+
last_name: :lname,
|
|
100
|
+
dob: :birth_date,
|
|
101
|
+
ssn: nil, # drop this column entirely
|
|
102
|
+
},
|
|
103
|
+
}
|
|
104
|
+
data = SmarterCSV.process('people.csv', options)
|
|
105
|
+
# => [{fname: "Alice", lname: "Smith", birth_date: "1990-05-14"}, ...]
|
|
106
|
+
# ↑ :ssn is gone; original CSV headers remapped to your domain names
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Keep only specific columns using `headers: { only: }`:
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
data = SmarterCSV.process('people.csv', headers: { only: [:name, :email] })
|
|
113
|
+
# => [{name: "Alice", email: "alice@example.com"}, ...]
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
See [Header Transformations](./header_transformations.md) and [Column Selection](./column_selection.md).
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## Example 4: Encoding and Preamble Skip
|
|
121
|
+
|
|
122
|
+
Handle non-UTF-8 files and metadata rows before the header:
|
|
123
|
+
|
|
124
|
+
```ruby
|
|
125
|
+
# Bank statement export: Windows-1252, 3 preamble rows, then header
|
|
126
|
+
data = SmarterCSV.process('statement.csv',
|
|
127
|
+
file_encoding: 'windows-1252',
|
|
128
|
+
skip_lines: 3)
|
|
129
|
+
|
|
130
|
+
# European lab instrument export: semicolon-separated, Latin-1
|
|
131
|
+
data = SmarterCSV.process('results.csv',
|
|
132
|
+
file_encoding: 'iso-8859-1',
|
|
133
|
+
col_sep: :auto) # :auto detects the semicolon
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
See [Row and Column Separators](./row_col_sep.md) and [Real-World CSV Files](./real_world_csv.md).
|
|
137
|
+
|
|
138
|
+
---
|
|
139
|
+
|
|
140
|
+
## Example 5: Value Converters
|
|
141
|
+
|
|
142
|
+
Transform raw strings into typed values — dates, booleans, currency:
|
|
143
|
+
|
|
66
144
|
```ruby
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
145
|
+
require 'date'
|
|
146
|
+
|
|
147
|
+
data = SmarterCSV.process('records.csv',
|
|
148
|
+
value_converters: {
|
|
149
|
+
# Parse US date format
|
|
150
|
+
dob: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
|
|
151
|
+
|
|
152
|
+
# Strip currency symbol and convert to Float
|
|
153
|
+
price: ->(v) { v&.delete('$,')&.to_f },
|
|
154
|
+
|
|
155
|
+
# Boolean from various representations
|
|
156
|
+
active: ->(v) { v&.match?(/\Atrue\z/i) },
|
|
157
|
+
})
|
|
158
|
+
|
|
159
|
+
data.first[:dob] # => #<Date: 1990-05-14>
|
|
160
|
+
data.first[:price] # => 44.5
|
|
161
|
+
data.first[:active] # => true
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
Combining with `nil_values_matching` to clean sentinel values before conversion:
|
|
165
|
+
|
|
166
|
+
```ruby
|
|
167
|
+
data = SmarterCSV.process('export.csv',
|
|
168
|
+
nil_values_matching: /\A(N\/A|NULL|#N\/A)\z/i,
|
|
169
|
+
value_converters: {
|
|
170
|
+
score: ->(v) { v&.to_f }, # v is nil for N/A rows — guard with &.
|
|
171
|
+
})
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
See [Value Converters](./value_converters.md).
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
## Example 6: Header Validation
|
|
179
|
+
|
|
180
|
+
Raise early if required columns are missing, before processing any data rows:
|
|
181
|
+
|
|
182
|
+
```ruby
|
|
183
|
+
begin
|
|
184
|
+
data = SmarterCSV.process('transactions.csv',
|
|
185
|
+
required_keys: [:account_id, :amount, :currency])
|
|
186
|
+
rescue SmarterCSV::MissingKeys => e
|
|
187
|
+
puts "CSV is missing required columns: #{e.keys.join(', ')}"
|
|
188
|
+
# => "CSV is missing required columns: currency"
|
|
189
|
+
end
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
See [Header Validations](./header_validations.md).
|
|
193
|
+
|
|
194
|
+
---
|
|
195
|
+
|
|
196
|
+
## Example 7: Bad Row Handling
|
|
197
|
+
|
|
198
|
+
Collect parse errors without stopping the import:
|
|
199
|
+
|
|
200
|
+
```ruby
|
|
201
|
+
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
|
|
202
|
+
good_rows = reader.process
|
|
203
|
+
|
|
204
|
+
bad = reader.errors[:bad_rows]
|
|
205
|
+
puts "Imported #{good_rows.size} rows, #{bad.size} bad rows"
|
|
206
|
+
bad.each do |rec|
|
|
207
|
+
puts "Line #{rec[:file_line_number]}: #{rec[:error_message]}"
|
|
208
|
+
puts " Raw: #{rec[:raw_line]}"
|
|
209
|
+
end
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Cap the number of tolerated bad rows and limit field sizes to guard against malformed input:
|
|
213
|
+
|
|
214
|
+
```ruby
|
|
215
|
+
SmarterCSV.process('untrusted.csv',
|
|
216
|
+
on_bad_row: :skip,
|
|
217
|
+
bad_row_limit: 10,
|
|
218
|
+
field_size_limit: 4096)
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
See [Bad Row Quarantine](./bad_row_quarantine.md).
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## Example 8: Writing CSV
|
|
226
|
+
|
|
227
|
+
```ruby
|
|
228
|
+
records = [
|
|
229
|
+
{ name: "Alice", age: 30, city: "New York" },
|
|
230
|
+
{ name: "Bob", age: 25, city: "Chicago" },
|
|
231
|
+
]
|
|
232
|
+
|
|
233
|
+
SmarterCSV.generate('output.csv') do |csv|
|
|
234
|
+
records.each { |r| csv << r }
|
|
235
|
+
end
|
|
236
|
+
# output.csv:
|
|
237
|
+
# name,age,city
|
|
238
|
+
# Alice,30,New York
|
|
239
|
+
# Bob,25,Chicago
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
Writing with header renaming and value converters:
|
|
243
|
+
|
|
244
|
+
```ruby
|
|
245
|
+
require 'date'
|
|
246
|
+
|
|
247
|
+
SmarterCSV.generate('report.csv',
|
|
248
|
+
map_headers: { name: 'Full Name', dob: 'Date of Birth' },
|
|
249
|
+
value_converters: { dob: ->(v) { v&.strftime('%m/%d/%Y') } },
|
|
250
|
+
) do |csv|
|
|
251
|
+
User.find_each { |u| csv << { name: u.full_name, dob: u.dob } }
|
|
252
|
+
end
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
See [The Basic Write API](./basic_write_api.md).
|
|
256
|
+
|
|
257
|
+
---
|
|
258
|
+
|
|
259
|
+
## Example 9: Using `each` and `each_chunk` Enumerators
|
|
260
|
+
|
|
261
|
+
The modern API gives you full Enumerable power without loading the whole file:
|
|
262
|
+
|
|
263
|
+
```ruby
|
|
264
|
+
# each — one hash per row
|
|
265
|
+
reader = SmarterCSV::Reader.new('data.csv')
|
|
266
|
+
reader.each { |hash| MyModel.upsert(hash) }
|
|
267
|
+
puts reader.headers.inspect # accessible after processing
|
|
268
|
+
|
|
269
|
+
# Enumerable methods
|
|
270
|
+
active_users = reader.select { |h| h[:status] == 'active' }
|
|
271
|
+
names = reader.map { |h| h[:name] }
|
|
272
|
+
|
|
273
|
+
# Lazy — stop early without reading the whole file
|
|
274
|
+
first_ten_active = reader.lazy.select { |h| h[:active] }.first(10)
|
|
275
|
+
|
|
276
|
+
# each_slice — manual batching without chunk_size
|
|
277
|
+
reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
See [Batch Processing](./batch_processing.md) and [The Basic Read API](./basic_read_api.md).
|
|
281
|
+
|
|
282
|
+
---
|
|
283
|
+
|
|
284
|
+
## Example 10: Importing into a Database
|
|
285
|
+
|
|
286
|
+
```ruby
|
|
287
|
+
filename = '/tmp/some.csv'
|
|
288
|
+
options = { key_mapping: { unwanted_row: nil, old_row_name: :new_name } }
|
|
289
|
+
|
|
290
|
+
n = SmarterCSV.process(filename, options) do |array|
|
|
291
|
+
MyModel.create(array.first)
|
|
292
|
+
end
|
|
293
|
+
# => returns number of rows processed
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
---
|
|
297
|
+
|
|
298
|
+
## Example 11: Batch Processing with Sidekiq
|
|
299
|
+
|
|
300
|
+
Processing in chunks reduces memory usage and enables parallel processing. The block receives the chunk as an optional second parameter:
|
|
301
|
+
|
|
302
|
+
```ruby
|
|
303
|
+
filename = '/tmp/input.csv'
|
|
304
|
+
|
|
305
|
+
n = SmarterCSV.process(filename, chunk_size: 100) do |chunk, chunk_index|
|
|
306
|
+
puts "Queueing chunk #{chunk_index} with #{chunk.size} records..."
|
|
307
|
+
Sidekiq::Client.push_bulk(
|
|
308
|
+
'class' => SidekiqWorkerClass,
|
|
309
|
+
'args' => chunk,
|
|
310
|
+
)
|
|
311
|
+
end
|
|
312
|
+
# => returns number of chunks
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
See [Batch Processing](./batch_processing.md).
|
|
316
|
+
|
|
317
|
+
---
|
|
318
|
+
|
|
319
|
+
## Example 12: Resumable CSV Import with Rails ActiveJob (Rails 8.1+)
|
|
320
|
+
|
|
321
|
+
Rails 8.1 introduced `ActiveJob::Continuable`, which lets a job pause and resume from exactly where it stopped — for example during a deployment or queue drain.
|
|
322
|
+
|
|
323
|
+
```ruby
|
|
324
|
+
# app/jobs/import_csv_job.rb
|
|
325
|
+
class ImportCsvJob < ApplicationJob
|
|
326
|
+
include ActiveJob::Continuable
|
|
327
|
+
|
|
328
|
+
def perform(file_path)
|
|
329
|
+
step :import_rows do |step|
|
|
330
|
+
SmarterCSV.process(file_path, chunk_size: 500) do |chunk, chunk_index|
|
|
331
|
+
next if chunk_index < step.cursor.to_i # skip already-processed chunks on resume
|
|
332
|
+
|
|
333
|
+
MyModel.import!(chunk)
|
|
334
|
+
step.set! chunk_index + 1
|
|
335
|
+
end
|
|
77
336
|
end
|
|
78
|
-
|
|
337
|
+
end
|
|
338
|
+
end
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
- `step.cursor` starts as `nil` (→ `0`), so the first run processes all chunks.
|
|
342
|
+
- If interrupted after chunk 7, Rails persists the cursor as `8`.
|
|
343
|
+
- On the next run chunks 0–7 are skipped quickly via `next`; processing resumes from chunk 8.
|
|
344
|
+
|
|
345
|
+
> Requires Rails 8.1+ and a queue adapter that supports graceful shutdown (Sidekiq, Solid Queue).
|
|
346
|
+
|
|
347
|
+
---
|
|
348
|
+
|
|
349
|
+
## Example 13: Instrumentation
|
|
350
|
+
|
|
351
|
+
```ruby
|
|
352
|
+
SmarterCSV.process('large_import.csv',
|
|
353
|
+
chunk_size: 1000,
|
|
354
|
+
|
|
355
|
+
on_start: ->(info) {
|
|
356
|
+
Rails.logger.info "Import started: #{info[:input]} (#{info[:file_size]} bytes)"
|
|
357
|
+
},
|
|
358
|
+
|
|
359
|
+
on_chunk: ->(info) {
|
|
360
|
+
Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows"
|
|
361
|
+
},
|
|
362
|
+
|
|
363
|
+
on_complete: ->(stats) {
|
|
364
|
+
Rails.logger.info "Done: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s"
|
|
365
|
+
},
|
|
366
|
+
) { |chunk| MyModel.insert_all(chunk) }
|
|
79
367
|
```
|
|
368
|
+
|
|
369
|
+
See [Instrumentation Hooks](./instrumentation.md).
|
|
370
|
+
|
|
371
|
+
--------------------
|
|
372
|
+
PREVIOUS: [Instrumentation Hooks](./instrumentation.md) | NEXT: [Real-World CSV Files](./real_world_csv.md) | UP: [README](../README.md)
|
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,15 +11,55 @@
|
|
|
10
11
|
* [Row and Column Separators](./row_col_sep.md)
|
|
11
12
|
* [**Header Transformations**](./header_transformations.md)
|
|
12
13
|
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [Data Transformations](./data_transformations.md)
|
|
14
16
|
* [Value Converters](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Header Transformations
|
|
19
27
|
|
|
20
28
|
By default SmarterCSV assumes that a CSV file has headers, and it automatically normalizes the headers and transforms them into Ruby symbols. You can completely customize or override this (see below).
|
|
21
29
|
|
|
30
|
+
## Header Transformation Pipeline
|
|
31
|
+
|
|
32
|
+
When a CSV file is opened, the header line passes through the following steps in order:
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
[user_provided_headers] ──► skips steps below; uses your array directly
|
|
36
|
+
│
|
|
37
|
+
▼ (when headers come from the file)
|
|
38
|
+
comment_regexp ──► strip_chars_from_headers ──► split on col_sep
|
|
39
|
+
──► strip quote_char ──► strip_whitespace
|
|
40
|
+
──► [unless keep_original_headers]: gsub spaces/dashes→_ ──► downcase_header
|
|
41
|
+
──► disambiguate_headers ──► symbolize ──► key_mapping
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
| Step | Option | Default | Description |
|
|
45
|
+
|------|--------|---------|-------------|
|
|
46
|
+
| 1 | `comment_regexp` | `nil` | Strips a comment prefix from the raw header line (e.g. `# ` at start) |
|
|
47
|
+
| 2 | `strip_chars_from_headers` | `nil` | Removes characters matching a regexp from the raw header line (e.g. `/[\-"]/`) |
|
|
48
|
+
| 3 | *(split)* | `col_sep` | Splits the header line into individual column tokens |
|
|
49
|
+
| 4 | `quote_char` | `"` | Strips surrounding quote characters from each token |
|
|
50
|
+
| 5 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from each header |
|
|
51
|
+
| 6 | *(normalize)* | — | Replaces spaces and dashes with `_` (`keep_original_headers` skips this and steps 7–9) |
|
|
52
|
+
| 7 | `downcase_header` | `true` | Downcases each header string |
|
|
53
|
+
| 8 | `duplicate_header_suffix` | `''` | Renames empty headers to `column_N`; appends suffix+number to duplicates |
|
|
54
|
+
| 9 | `strings_as_keys` | `false` | Converts headers to symbols (skipped if `true` or `keep_original_headers`) |
|
|
55
|
+
| 10 | `key_mapping` | `nil` | Renames or drops headers; use post-transformation key names as input |
|
|
56
|
+
|
|
57
|
+
> `user_provided_headers` bypasses all file header reading and transformation entirely — your array is used as-is. Versions >1.13 automatically set `headers_in_file: false` when `user_provided_headers` is given; if the file has a header row you want to skip, set `headers_in_file: true` explicitly.
|
|
58
|
+
|
|
59
|
+
See [Configuration Options](./options.md) for full option reference.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
22
63
|
## Header Normalization
|
|
23
64
|
|
|
24
65
|
When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)
|
|
@@ -81,16 +122,57 @@ end
|
|
|
81
122
|
|
|
82
123
|
## Key Mapping
|
|
83
124
|
|
|
84
|
-
|
|
85
|
-
|
|
125
|
+
`key_mapping:` renames CSV headers to the symbols your application expects. Any header not
|
|
126
|
+
listed in the mapping is kept as-is by default.
|
|
86
127
|
|
|
87
|
-
|
|
128
|
+
```ruby
|
|
129
|
+
# CSV headers: first_name, last_name, internal_id, created_at
|
|
130
|
+
data = SmarterCSV.process('contacts.csv',
|
|
131
|
+
key_mapping: { first_name: :given_name, last_name: :family_name },
|
|
132
|
+
)
|
|
133
|
+
# => [{given_name: "Alice", family_name: "Smith", internal_id: 42, created_at: "2026-01-01"}, ...]
|
|
134
|
+
# ^^^ renamed ^^^ unmapped keys kept as-is
|
|
135
|
+
```
|
|
88
136
|
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
137
|
+
To delete a specific column, map it to `nil` — it will be removed from every row hash:
|
|
138
|
+
|
|
139
|
+
```ruby
|
|
140
|
+
key_mapping: { internal_id: nil, created_at: nil } # drop these two columns
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
### `remove_unmapped_keys:` — drop everything not in the map
|
|
144
|
+
|
|
145
|
+
When you have files with many columns and only care about a few, listing every unwanted
|
|
146
|
+
column as `nil` is tedious. Use `remove_unmapped_keys: true` to implicitly drop any header
|
|
147
|
+
that has no entry in `key_mapping:`:
|
|
148
|
+
|
|
149
|
+
```ruby
|
|
150
|
+
# CSV has 50 columns; you only want two of them, renamed
|
|
151
|
+
data = SmarterCSV.process('contacts.csv',
|
|
152
|
+
key_mapping: { first_name: :given_name, last_name: :family_name },
|
|
153
|
+
remove_unmapped_keys: true,
|
|
154
|
+
)
|
|
155
|
+
# => [{given_name: "Alice", family_name: "Smith"}, ...] # only the two mapped columns
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### `remove_unmapped_keys:` vs `headers: { only: }`
|
|
159
|
+
|
|
160
|
+
Both achieve column selection, but they serve different purposes:
|
|
161
|
+
|
|
162
|
+
| | `remove_unmapped_keys: true` | `headers: { only: [...] }` |
|
|
163
|
+
|---|---|---|
|
|
164
|
+
| Use when | Already using `key_mapping:` and want to implicitly drop the rest | Pure column selection, no renaming needed |
|
|
165
|
+
| Performance | Post-parse filter — all fields parsed, unmapped keys deleted | **C-path early exit** — unneeded fields never parsed |
|
|
166
|
+
| Renaming | Yes — combines selection and rename in one step | No renaming (use `key_mapping:` alongside if needed) |
|
|
167
|
+
|
|
168
|
+
For wide files where performance matters, prefer `headers: { only: }` — it skips unneeded
|
|
169
|
+
fields entirely inside the C parser and can be **10–14× faster** on very wide files.
|
|
170
|
+
Use `remove_unmapped_keys: true` when you are already remapping headers and the convenience
|
|
171
|
+
of a single option outweighs the (usually small) performance difference.
|
|
172
|
+
|
|
173
|
+
See [Column Selection](./column_selection.md) for full details on `headers: { only: }`.
|
|
174
|
+
|
|
175
|
+
> **Note:** Key mapping is particularly useful when importing CSV data directly into a database or document store. By remapping headers to the exact symbol names your application uses internally (e.g. ActiveRecord attributes, DynamoDB document keys, Sidekiq job parameters), you can pass the resulting hashes directly without any further transformation.
|
|
94
176
|
|
|
95
177
|
## CSV Files without Headers
|
|
96
178
|
|
|
@@ -124,5 +206,4 @@ For CSV files with headers, you can either:
|
|
|
124
206
|
* some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, set the `quote_char` to something different, e.g. `quote_char: "%"`, or try setting `:strip_chars_from_headers => /[\-"]/`
|
|
125
207
|
|
|
126
208
|
---------------
|
|
127
|
-
PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md)
|
|
128
|
-
|
|
209
|
+
PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md) | UP: [README](../README.md)
|
data/docs/header_validations.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
### Contents
|
|
3
3
|
|
|
4
4
|
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
5
6
|
* [Parsing Strategy](./parsing_strategy.md)
|
|
6
7
|
* [The Basic Read API](./basic_read_api.md)
|
|
7
8
|
* [The Basic Write API](./basic_write_api.md)
|
|
@@ -10,43 +11,80 @@
|
|
|
10
11
|
* [Row and Column Separators](./row_col_sep.md)
|
|
11
12
|
* [Header Transformations](./header_transformations.md)
|
|
12
13
|
* [**Header Validations**](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
13
15
|
* [Data Transformations](./data_transformations.md)
|
|
14
16
|
* [Value Converters](./value_converters.md)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
17
25
|
|
|
18
26
|
# Header Validations
|
|
19
27
|
|
|
20
|
-
When
|
|
28
|
+
When importing data it is important to verify that all required columns are present — catching a missing column upfront is far better than a cryptic error later when your code tries to access a key that was never populated.
|
|
21
29
|
|
|
22
|
-
|
|
30
|
+
## `required_keys`
|
|
23
31
|
|
|
24
|
-
|
|
32
|
+
Use `required_keys` to specify an array of hash keys that must be present after header transformation. Validation runs once, after the header row is parsed and all header transformations (downcase, symbolize, `key_mapping`) have been applied — so use the **transformed** key names, not the raw CSV header strings.
|
|
25
33
|
|
|
26
|
-
|
|
34
|
+
If any required key is absent, `SmarterCSV::MissingKeys` is raised before any data rows are processed.
|
|
27
35
|
|
|
28
36
|
```ruby
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
=> this will raise SmarterCSV::MissingKeys if any row does not contain these three keys
|
|
37
|
+
options = {
|
|
38
|
+
required_keys: [:source_account, :destination_account, :amount]
|
|
39
|
+
}
|
|
40
|
+
data = SmarterCSV.process('/tmp/transactions.csv', options)
|
|
41
|
+
# => raises SmarterCSV::MissingKeys if any of the three columns are missing
|
|
35
42
|
```
|
|
36
43
|
|
|
37
|
-
|
|
44
|
+
### Accessing the missing keys
|
|
38
45
|
|
|
39
|
-
|
|
46
|
+
`SmarterCSV::MissingKeys` exposes the missing keys via the `keys` accessor:
|
|
40
47
|
|
|
41
48
|
```ruby
|
|
42
49
|
begin
|
|
43
|
-
|
|
44
|
-
|
|
50
|
+
data = SmarterCSV.process('/tmp/transactions.csv',
|
|
51
|
+
required_keys: [:source_account, :destination_account, :amount])
|
|
45
52
|
rescue SmarterCSV::MissingKeys => e
|
|
46
53
|
puts "Missing columns: #{e.keys.join(', ')}"
|
|
47
|
-
# =>
|
|
54
|
+
# => "Missing columns: amount"
|
|
48
55
|
end
|
|
49
56
|
```
|
|
50
57
|
|
|
58
|
+
### Interaction with `key_mapping`
|
|
59
|
+
|
|
60
|
+
`required_keys` uses the **post-mapping** key names. If you remap CSV headers, reference the mapped names:
|
|
61
|
+
|
|
62
|
+
```ruby
|
|
63
|
+
options = {
|
|
64
|
+
key_mapping: { acct_from: :source_account, acct_to: :destination_account },
|
|
65
|
+
required_keys: [:source_account, :destination_account, :amount],
|
|
66
|
+
}
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## `silence_missing_keys`
|
|
72
|
+
|
|
73
|
+
When using `key_mapping`, SmarterCSV raises `SmarterCSV::KeyMappingError` if a mapped key is not found in the CSV header. Use `silence_missing_keys` to make some or all mapped keys optional:
|
|
74
|
+
|
|
75
|
+
```ruby
|
|
76
|
+
# All mapped keys are optional — no error if any are absent
|
|
77
|
+
options = {
|
|
78
|
+
key_mapping: { optional_field: :my_field, required_field: :other_field },
|
|
79
|
+
silence_missing_keys: true,
|
|
80
|
+
}
|
|
81
|
+
|
|
82
|
+
# Only specific mapped keys are optional
|
|
83
|
+
options = {
|
|
84
|
+
key_mapping: { optional_field: :my_field, required_field: :other_field },
|
|
85
|
+
silence_missing_keys: [:optional_field],
|
|
86
|
+
}
|
|
87
|
+
```
|
|
88
|
+
|
|
51
89
|
----------------
|
|
52
|
-
PREVIOUS: [Header Transformations](./header_transformations.md) | NEXT: [
|
|
90
|
+
PREVIOUS: [Header Transformations](./header_transformations.md) | NEXT: [Column Selection](./column_selection.md) | UP: [README](../README.md)
|