smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
data/docs/examples.md CHANGED
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,70 +12,362 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [**Examples**](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # Examples
19
28
 
20
- Here are some examples to demonstrate the versatility of SmarterCSV.
29
+ **Rescue from `SmarterCSV::Error` (recommended):** SmarterCSV auto-detects row and column separators. In rare cases detection fails and raises an exception (e.g. `NoColSepDetected`). Rescuing from `SmarterCSV::Error` ensures your application handles unexpected CSV formats gracefully.
30
+
31
+ ---
21
32
 
22
- **It is generally recommended to rescue `SmarterCSV::Error` or it's sub-classes.**
33
+ 1. [CSV Array of Hashes](#example-1-csv--array-of-hashes)
34
+ 2. [Parsing a CSV String](#example-2-parsing-a-csv-string)
35
+ 3. [Key Mapping and Column Selection](#example-3-key-mapping-and-column-selection)
36
+ 4. [Encoding and Preamble Skip](#example-4-encoding-and-preamble-skip)
37
+ 5. [Value Converters](#example-5-value-converters)
38
+ 6. [Header Validation](#example-6-header-validation)
39
+ 7. [Bad Row Handling](#example-7-bad-row-handling)
40
+ 8. [Writing CSV](#example-8-writing-csv)
41
+ 9. [Using `each` and `each_chunk` Enumerators](#example-9-using-each-and-each_chunk-enumerators)
42
+ 10. [Importing into a Database](#example-10-importing-into-a-database)
43
+ 11. [Batch Processing with Sidekiq](#example-11-batch-processing-with-sidekiq)
44
+ 12. [Resumable CSV Import with Rails ActiveJob](#example-12-resumable-csv-import-with-rails-activejob-rails-81)
45
+ 13. [Instrumentation](#example-13-instrumentation)
23
46
 
24
- By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
47
+ ---
25
48
 
26
- In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
49
+ ## Example 1: CSV Array of Hashes
27
50
 
28
- ## Example 1a: How SmarterCSV processes CSV-files as array of hashes:
29
- Please note how each hash contains only the keys for columns with non-null values.
51
+ Each hash only contains keys for columns with non-nil, non-empty values columns with blank entries are omitted automatically:
30
52
 
31
53
  ```ruby
32
- $ cat pets.csv
33
- first name,last name,dogs,cats,birds,fish
34
- Dan,McAllister,2,,,
35
- Lucy,Laweless,,5,,
36
- Miles,O'Brian,,,,21
37
- Nancy,Homes,2,,1,
38
- $ irb
39
- > require 'smarter_csv'
40
- => true
41
- > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
42
- => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
43
- {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
44
- {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
45
- {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
46
- ]
54
+ $ cat pets.csv
55
+ first name,last name,dogs,cats,birds,fish
56
+ Dan,McAllister,2,,,
57
+ Lucy,Laweless,,5,,
58
+ Miles,O'Brian,,,,21
59
+ Nancy,Homes,2,,1,
60
+
61
+ $ irb
62
+ > require 'smarter_csv'
63
+ > pets_by_owner = SmarterCSV.process('pets.csv')
64
+ => [ {first_name: "Dan", last_name: "McAllister", dogs: 2},
65
+ {first_name: "Lucy", last_name: "Laweless", cats: 5},
66
+ {first_name: "Miles", last_name: "O'Brian", fish: 21},
67
+ {first_name: "Nancy", last_name: "Homes", dogs: 2, birds: 1}
68
+ ]
47
69
  ```
48
70
 
71
+ ---
72
+
73
+ ## Example 2: Parsing a CSV String
74
+
75
+ Use `SmarterCSV.parse` to parse a CSV string directly — no file needed. Useful in tests, API responses, or when the CSV arrives as a string in memory:
49
76
 
50
- ## Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
51
77
  ```ruby
52
- # without using chunks:
53
- filename = '/tmp/some.csv'
54
- options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
55
- n = SmarterCSV.process(filename, options) do |array|
56
- # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
57
- # when chunking is not enabled, there is only one hash in each array
58
- MyModel.create( array.first )
59
- end
78
+ csv_string = <<~CSV
79
+ name,age,city
80
+ Alice,30,New York
81
+ Bob,25,Chicago
82
+ CSV
60
83
 
61
- => returns number of chunks / rows we processed
84
+ data = SmarterCSV.parse(csv_string)
85
+ # => [{name: "Alice", age: 30, city: "New York"}, {name: "Bob", age: 25, city: "Chicago"}]
62
86
  ```
63
87
 
64
- ## Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq:
65
- The block receives an optional second parameter `chunk_index` (0-based) for progress tracking:
88
+ See [The Basic Read API](./basic_read_api.md) and [Migrating from Ruby CSV](./migrating_from_csv.md).
89
+
90
+ ---
91
+
92
+ ## Example 3: Key Mapping and Column Selection
93
+
94
+ Rename headers and drop unwanted columns in one pass:
95
+
96
+ ```ruby
97
+ options = {
98
+ key_mapping: {
99
+ first_name: :fname,
100
+ last_name: :lname,
101
+ dob: :birth_date,
102
+ ssn: nil, # drop this column entirely
103
+ },
104
+ }
105
+ data = SmarterCSV.process('people.csv', options)
106
+ # => [{fname: "Alice", lname: "Smith", birth_date: "1990-05-14"}, ...]
107
+ # ↑ :ssn is gone; original CSV headers remapped to your domain names
108
+ ```
109
+
110
+ Keep only specific columns using `headers: { only: }`:
111
+
112
+ ```ruby
113
+ data = SmarterCSV.process('people.csv', headers: { only: [:name, :email] })
114
+ # => [{name: "Alice", email: "alice@example.com"}, ...]
115
+ ```
116
+
117
+ See [Header Transformations](./header_transformations.md) and [Column Selection](./column_selection.md).
118
+
119
+ ---
120
+
121
+ ## Example 4: Encoding and Preamble Skip
122
+
123
+ Handle non-UTF-8 files and metadata rows before the header:
124
+
125
+ ```ruby
126
+ # Bank statement export: Windows-1252, 3 preamble rows, then header
127
+ data = SmarterCSV.process('statement.csv',
128
+ file_encoding: 'windows-1252',
129
+ skip_lines: 3)
130
+
131
+ # European lab instrument export: semicolon-separated, Latin-1
132
+ data = SmarterCSV.process('results.csv',
133
+ file_encoding: 'iso-8859-1',
134
+ col_sep: :auto) # :auto detects the semicolon
135
+ ```
136
+
137
+ See [Row and Column Separators](./row_col_sep.md) and [Real-World CSV Files](./real_world_csv.md).
138
+
139
+ ---
140
+
141
+ ## Example 5: Value Converters
142
+
143
+ Transform raw strings into typed values — dates, booleans, currency:
144
+
66
145
  ```ruby
67
- filename = '/tmp/input.csv' # CSV file containing ids or data to process
68
- options = { :chunk_size => 100 }
69
- n = SmarterCSV.process(filename, options) do |chunk, chunk_index|
70
- puts "Queueing chunk #{chunk_index} with #{chunk.size} records..."
71
- Sidekiq::Client.push_bulk(
72
- 'class' => SidekiqIndividualWorkerClass,
73
- 'args' => chunk,
74
- )
75
- # OR:
76
- # SidekiqBatchWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
146
+ require 'date'
147
+
148
+ data = SmarterCSV.process('records.csv',
149
+ value_converters: {
150
+ # Parse US date format
151
+ dob: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
152
+
153
+ # Strip currency symbol and convert to Float
154
+ price: ->(v) { v&.delete('$,')&.to_f },
155
+
156
+ # Boolean from various representations
157
+ active: ->(v) { v&.match?(/\Atrue\z/i) },
158
+ })
159
+
160
+ data.first[:dob] # => #<Date: 1990-05-14>
161
+ data.first[:price] # => 44.5
162
+ data.first[:active] # => true
163
+ ```
164
+
165
+ Combining with `nil_values_matching` to clean sentinel values before conversion:
166
+
167
+ ```ruby
168
+ data = SmarterCSV.process('export.csv',
169
+ nil_values_matching: /\A(N\/A|NULL|#N\/A)\z/i,
170
+ value_converters: {
171
+ score: ->(v) { v&.to_f }, # v is nil for N/A rows — guard with &.
172
+ })
173
+ ```
174
+
175
+ See [Value Converters](./value_converters.md).
176
+
177
+ ---
178
+
179
+ ## Example 6: Header Validation
180
+
181
+ Raise early if required columns are missing, before processing any data rows:
182
+
183
+ ```ruby
184
+ begin
185
+ data = SmarterCSV.process('transactions.csv',
186
+ required_keys: [:account_id, :amount, :currency])
187
+ rescue SmarterCSV::MissingKeys => e
188
+ puts "CSV is missing required columns: #{e.keys.join(', ')}"
189
+ # => "CSV is missing required columns: currency"
190
+ end
191
+ ```
192
+
193
+ See [Header Validations](./header_validations.md).
194
+
195
+ ---
196
+
197
+ ## Example 7: Bad Row Handling
198
+
199
+ Collect parse errors without stopping the import:
200
+
201
+ ```ruby
202
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
203
+ good_rows = reader.process
204
+
205
+ bad = reader.errors[:bad_rows]
206
+ puts "Imported #{good_rows.size} rows, #{bad.size} bad rows"
207
+ bad.each do |rec|
208
+ puts "Line #{rec[:file_line_number]}: #{rec[:error_message]}"
209
+ puts " Raw: #{rec[:raw_line]}"
210
+ end
211
+ ```
212
+
213
+ Cap the number of tolerated bad rows and limit field sizes to guard against malformed input:
214
+
215
+ ```ruby
216
+ SmarterCSV.process('untrusted.csv',
217
+ on_bad_row: :skip,
218
+ bad_row_limit: 10,
219
+ field_size_limit: 4096)
220
+ ```
221
+
222
+ See [Bad Row Quarantine](./bad_row_quarantine.md).
223
+
224
+ ---
225
+
226
+ ## Example 8: Writing CSV
227
+
228
+ ```ruby
229
+ records = [
230
+ { name: "Alice", age: 30, city: "New York" },
231
+ { name: "Bob", age: 25, city: "Chicago" },
232
+ ]
233
+
234
+ SmarterCSV.generate('output.csv') do |csv|
235
+ records.each { |r| csv << r }
236
+ end
237
+ # output.csv:
238
+ # name,age,city
239
+ # Alice,30,New York
240
+ # Bob,25,Chicago
241
+ ```
242
+
243
+ Writing with header renaming and value converters:
244
+
245
+ ```ruby
246
+ require 'date'
247
+
248
+ SmarterCSV.generate('report.csv',
249
+ map_headers: { name: 'Full Name', dob: 'Date of Birth' },
250
+ value_converters: { dob: ->(v) { v&.strftime('%m/%d/%Y') } },
251
+ ) do |csv|
252
+ User.find_each { |u| csv << { name: u.full_name, dob: u.dob } }
253
+ end
254
+ ```
255
+
256
+ See [The Basic Write API](./basic_write_api.md).
257
+
258
+ ---
259
+
260
+ ## Example 9: Using `each` and `each_chunk` Enumerators
261
+
262
+ The modern API gives you full Enumerable power without loading the whole file:
263
+
264
+ ```ruby
265
+ # each — one hash per row
266
+ reader = SmarterCSV::Reader.new('data.csv')
267
+ reader.each { |hash| MyModel.upsert(hash) }
268
+ puts reader.headers.inspect # accessible after processing
269
+
270
+ # Enumerable methods
271
+ active_users = reader.select { |h| h[:status] == 'active' }
272
+ names = reader.map { |h| h[:name] }
273
+
274
+ # Lazy — stop early without reading the whole file
275
+ first_ten_active = reader.lazy.select { |h| h[:active] }.first(10)
276
+
277
+ # each_slice — manual batching without chunk_size
278
+ reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
279
+ ```
280
+
281
+ See [Batch Processing](./batch_processing.md) and [The Basic Read API](./basic_read_api.md).
282
+
283
+ ---
284
+
285
+ ## Example 10: Importing into a Database
286
+
287
+ ```ruby
288
+ filename = '/tmp/some.csv'
289
+ options = { key_mapping: { unwanted_row: nil, old_row_name: :new_name } }
290
+
291
+ n = SmarterCSV.process(filename, options) do |array|
292
+ MyModel.create(array.first)
293
+ end
294
+ # => returns number of rows processed
295
+ ```
296
+
297
+ ---
298
+
299
+ ## Example 11: Batch Processing with Sidekiq
300
+
301
+ Processing in chunks reduces memory usage and enables parallel processing. The block receives the chunk as an optional second parameter:
302
+
303
+ ```ruby
304
+ filename = '/tmp/input.csv'
305
+
306
+ n = SmarterCSV.process(filename, chunk_size: 100) do |chunk, chunk_index|
307
+ puts "Queueing chunk #{chunk_index} with #{chunk.size} records..."
308
+ Sidekiq::Client.push_bulk(
309
+ 'class' => SidekiqWorkerClass,
310
+ 'args' => chunk,
311
+ )
312
+ end
313
+ # => returns number of chunks
314
+ ```
315
+
316
+ See [Batch Processing](./batch_processing.md).
317
+
318
+ ---
319
+
320
+ ## Example 12: Resumable CSV Import with Rails ActiveJob (Rails 8.1+)
321
+
322
+ Rails 8.1 introduced `ActiveJob::Continuable`, which lets a job pause and resume from exactly where it stopped — for example during a deployment or queue drain.
323
+
324
+ ```ruby
325
+ # app/jobs/import_csv_job.rb
326
+ class ImportCsvJob < ApplicationJob
327
+ include ActiveJob::Continuable
328
+
329
+ def perform(file_path)
330
+ step :import_rows do |step|
331
+ SmarterCSV.process(file_path, chunk_size: 500) do |chunk, chunk_index|
332
+ next if chunk_index < step.cursor.to_i # skip already-processed chunks on resume
333
+
334
+ MyModel.import!(chunk)
335
+ step.set! chunk_index + 1
336
+ end
77
337
  end
78
- => returns number of chunks
338
+ end
339
+ end
340
+ ```
341
+
342
+ - `step.cursor` starts as `nil` (→ `0`), so the first run processes all chunks.
343
+ - If interrupted after chunk 7, Rails persists the cursor as `8`.
344
+ - On the next run chunks 0–7 are skipped quickly via `next`; processing resumes from chunk 8.
345
+
346
+ > Requires Rails 8.1+ and a queue adapter that supports graceful shutdown (Sidekiq, Solid Queue).
347
+
348
+ ---
349
+
350
+ ## Example 13: Instrumentation
351
+
352
+ ```ruby
353
+ SmarterCSV.process('large_import.csv',
354
+ chunk_size: 1000,
355
+
356
+ on_start: ->(info) {
357
+ Rails.logger.info "Import started: #{info[:input]} (#{info[:file_size]} bytes)"
358
+ },
359
+
360
+ on_chunk: ->(info) {
361
+ Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows"
362
+ },
363
+
364
+ on_complete: ->(stats) {
365
+ Rails.logger.info "Done: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s"
366
+ },
367
+ ) { |chunk| MyModel.insert_all(chunk) }
79
368
  ```
369
+
370
+ See [Instrumentation Hooks](./instrumentation.md).
371
+
372
+ --------------------
373
+ PREVIOUS: [Instrumentation Hooks](./instrumentation.md) | NEXT: [Real-World CSV Files](./real_world_csv.md) | UP: [README](../README.md)
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,15 +12,55 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [**Header Transformations**](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # Header Transformations
19
28
 
20
29
  By default SmarterCSV assumes that a CSV file has headers, and it automatically normalizes the headers and transforms them into Ruby symbols. You can completely customize or override this (see below).
21
30
 
31
+ ## Header Transformation Pipeline
32
+
33
+ When a CSV file is opened, the header line passes through the following steps in order:
34
+
35
+ ```
36
+ [user_provided_headers] ──► skips steps below; uses your array directly
37
+
38
+ ▼ (when headers come from the file)
39
+ comment_regexp ──► strip_chars_from_headers ──► split on col_sep
40
+ ──► strip quote_char ──► strip_whitespace
41
+ ──► [unless keep_original_headers]: gsub spaces/dashes→_ ──► downcase_header
42
+ ──► disambiguate_headers ──► symbolize ──► key_mapping
43
+ ```
44
+
45
+ | Step | Option | Default | Description |
46
+ |------|--------|---------|-------------|
47
+ | 1 | `comment_regexp` | `nil` | Strips a comment prefix from the raw header line (e.g. `# ` at start) |
48
+ | 2 | `strip_chars_from_headers` | `nil` | Removes characters matching a regexp from the raw header line (e.g. `/[\-"]/`) |
49
+ | 3 | *(split)* | `col_sep` | Splits the header line into individual column tokens |
50
+ | 4 | `quote_char` | `"` | Strips surrounding quote characters from each token |
51
+ | 5 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from each header |
52
+ | 6 | *(normalize)* | — | Replaces spaces and dashes with `_` (`keep_original_headers` skips this and steps 7–9) |
53
+ | 7 | `downcase_header` | `true` | Downcases each header string |
54
+ | 8 | `duplicate_header_suffix` | `''` | Renames empty headers to `column_N`; appends suffix+number to duplicates |
55
+ | 9 | `strings_as_keys` | `false` | Converts headers to symbols (skipped if `true` or `keep_original_headers`) |
56
+ | 10 | `key_mapping` | `nil` | Renames or drops headers; use post-transformation key names as input |
57
+
58
+ > `user_provided_headers` bypasses all file header reading and transformation entirely — your array is used as-is. Versions >1.13 automatically set `headers_in_file: false` when `user_provided_headers` is given; if the file has a header row you want to skip, set `headers_in_file: true` explicitly.
59
+
60
+ See [Configuration Options](./options.md) for full option reference.
61
+
62
+ ---
63
+
22
64
  ## Header Normalization
23
65
 
24
66
  When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)
@@ -81,16 +123,57 @@ end
81
123
 
82
124
  ## Key Mapping
83
125
 
84
- The above example already illustrates how intermediate keys can be mapped into something different.
85
- This transfoms some of the keys in the input, but other keys are still present.
126
+ `key_mapping:` renames CSV headers to the symbols your application expects. Any header not
127
+ listed in the mapping is kept as-is by default.
86
128
 
87
- There is an additional option `remove_unmapped_keys` which can be enabled to only produce the mapped keys in the resulting hashes, and drops any other columns.
129
+ ```ruby
130
+ # CSV headers: first_name, last_name, internal_id, created_at
131
+ data = SmarterCSV.process('contacts.csv',
132
+ key_mapping: { first_name: :given_name, last_name: :family_name },
133
+ )
134
+ # => [{given_name: "Alice", family_name: "Smith", internal_id: 42, created_at: "2026-01-01"}, ...]
135
+ # ^^^ renamed ^^^ unmapped keys kept as-is
136
+ ```
88
137
 
89
-
90
- ### NOTES on Key Mapping:
91
- * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
92
- * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
93
- * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
138
+ To delete a specific column, map it to `nil` — it will be removed from every row hash:
139
+
140
+ ```ruby
141
+ key_mapping: { internal_id: nil, created_at: nil } # drop these two columns
142
+ ```
143
+
144
+ ### `remove_unmapped_keys:` — drop everything not in the map
145
+
146
+ When you have files with many columns and only care about a few, listing every unwanted
147
+ column as `nil` is tedious. Use `remove_unmapped_keys: true` to implicitly drop any header
148
+ that has no entry in `key_mapping:`:
149
+
150
+ ```ruby
151
+ # CSV has 50 columns; you only want two of them, renamed
152
+ data = SmarterCSV.process('contacts.csv',
153
+ key_mapping: { first_name: :given_name, last_name: :family_name },
154
+ remove_unmapped_keys: true,
155
+ )
156
+ # => [{given_name: "Alice", family_name: "Smith"}, ...] # only the two mapped columns
157
+ ```
158
+
159
+ ### `remove_unmapped_keys:` vs `headers: { only: }`
160
+
161
+ Both achieve column selection, but they serve different purposes:
162
+
163
+ | | `remove_unmapped_keys: true` | `headers: { only: [...] }` |
164
+ |---|---|---|
165
+ | Use when | Already using `key_mapping:` and want to implicitly drop the rest | Pure column selection, no renaming needed |
166
+ | Performance | Post-parse filter — all fields parsed, unmapped keys deleted | **C-path early exit** — unneeded fields never parsed |
167
+ | Renaming | Yes — combines selection and rename in one step | No renaming (use `key_mapping:` alongside if needed) |
168
+
169
+ For wide files where performance matters, prefer `headers: { only: }` — it skips unneeded
170
+ fields entirely inside the C parser and can be **10–14× faster** on very wide files.
171
+ Use `remove_unmapped_keys: true` when you are already remapping headers and the convenience
172
+ of a single option outweighs the (usually small) performance difference.
173
+
174
+ See [Column Selection](./column_selection.md) for full details on `headers: { only: }`.
175
+
176
+ > **Note:** Key mapping is particularly useful when importing CSV data directly into a database or document store. By remapping headers to the exact symbol names your application uses internally (e.g. ActiveRecord attributes, DynamoDB document keys, Sidekiq job parameters), you can pass the resulting hashes directly without any further transformation.
94
177
 
95
178
  ## CSV Files without Headers
96
179
 
@@ -124,5 +207,4 @@ For CSV files with headers, you can either:
124
207
  * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, set the `quote_char` to something different, e.g. `quote_char: "%"`, or try setting `:strip_chars_from_headers => /[\-"]/`
125
208
 
126
209
  ---------------
127
- PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md)
128
-
210
+ PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md) | UP: [README](../README.md)
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [The Basic Read API](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,43 +12,80 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [**Header Validations**](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # Header Validations
19
28
 
20
- When you are importing data, it can be important to verify that all required data is present, to ensure consistent quality when importing data.
29
+ When importing data it is important to verify that all required columns are present catching a missing column upfront is far better than a cryptic error later when your code tries to access a key that was never populated.
21
30
 
22
- You can use the `required_keys` option to specify an array of hash keys that you require to be present at a minimum for every data row (after header transformation).
31
+ ## `required_keys`
23
32
 
24
- If these keys are not present, `SmarterCSV::MissingKeys` will be raised to inform you of the data inconsistency.
33
+ Use `required_keys` to specify an array of hash keys that must be present after header transformation. Validation runs once, after the header row is parsed and all header transformations (downcase, symbolize, `key_mapping`) have been applied so use the **transformed** key names, not the raw CSV header strings.
25
34
 
26
- ## Example
35
+ If any required key is absent, `SmarterCSV::MissingKeys` is raised before any data rows are processed.
27
36
 
28
37
  ```ruby
29
- options = {
30
- required_keys: [:source_account, :destination_account, :amount]
31
- }
32
- data = SmarterCSV.process("/tmp/transactions.csv", options)
33
-
34
- => this will raise SmarterCSV::MissingKeys if any row does not contain these three keys
38
+ options = {
39
+ required_keys: [:source_account, :destination_account, :amount]
40
+ }
41
+ data = SmarterCSV.process('/tmp/transactions.csv', options)
42
+ # => raises SmarterCSV::MissingKeys if any of the three columns are missing
35
43
  ```
36
44
 
37
- ## Handling Missing Keys Programmatically
45
+ ### Accessing the missing keys
38
46
 
39
- When `SmarterCSV::MissingKeys` is raised, you can access the missing keys directly via the `keys` accessor, without parsing the error message:
47
+ `SmarterCSV::MissingKeys` exposes the missing keys via the `keys` accessor:
40
48
 
41
49
  ```ruby
42
50
  begin
43
- options = { required_keys: [:source_account, :destination_account, :amount] }
44
- data = SmarterCSV.process("/tmp/transactions.csv", options)
51
+ data = SmarterCSV.process('/tmp/transactions.csv',
52
+ required_keys: [:source_account, :destination_account, :amount])
45
53
  rescue SmarterCSV::MissingKeys => e
46
54
  puts "Missing columns: #{e.keys.join(', ')}"
47
- # => e.keys returns [:amount] (array of missing key symbols)
55
+ # => "Missing columns: amount"
48
56
  end
49
57
  ```
50
58
 
59
+ ### Interaction with `key_mapping`
60
+
61
+ `required_keys` uses the **post-mapping** key names. If you remap CSV headers, reference the mapped names:
62
+
63
+ ```ruby
64
+ options = {
65
+ key_mapping: { acct_from: :source_account, acct_to: :destination_account },
66
+ required_keys: [:source_account, :destination_account, :amount],
67
+ }
68
+ ```
69
+
70
+ ---
71
+
72
+ ## `silence_missing_keys`
73
+
74
+ When using `key_mapping`, SmarterCSV raises `SmarterCSV::KeyMappingError` if a mapped key is not found in the CSV header. Use `silence_missing_keys` to make some or all mapped keys optional:
75
+
76
+ ```ruby
77
+ # All mapped keys are optional — no error if any are absent
78
+ options = {
79
+ key_mapping: { optional_field: :my_field, required_field: :other_field },
80
+ silence_missing_keys: true,
81
+ }
82
+
83
+ # Only specific mapped keys are optional
84
+ options = {
85
+ key_mapping: { optional_field: :my_field, required_field: :other_field },
86
+ silence_missing_keys: [:optional_field],
87
+ }
88
+ ```
89
+
51
90
  ----------------
52
- PREVIOUS: [Header Transformations](./header_transformations.md) | NEXT: [Data Transformations](./data_transformations.md)
91
+ PREVIOUS: [Header Transformations](./header_transformations.md) | NEXT: [Column Selection](./column_selection.md) | UP: [README](../README.md)