smarter_csv 1.15.2 → 1.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +9 -0
  3. data/CHANGELOG.md +68 -1
  4. data/CONTRIBUTORS.md +3 -1
  5. data/Gemfile +1 -0
  6. data/README.md +123 -27
  7. data/docs/_introduction.md +40 -24
  8. data/docs/bad_row_quarantine.md +285 -0
  9. data/docs/basic_read_api.md +151 -9
  10. data/docs/basic_write_api.md +474 -59
  11. data/docs/batch_processing.md +161 -4
  12. data/docs/column_selection.md +183 -0
  13. data/docs/data_transformations.md +162 -29
  14. data/docs/examples.md +339 -46
  15. data/docs/header_transformations.md +93 -12
  16. data/docs/header_validations.md +56 -18
  17. data/docs/history.md +117 -0
  18. data/docs/instrumentation.md +165 -0
  19. data/docs/migrating_from_csv.md +290 -0
  20. data/docs/options.md +150 -87
  21. data/docs/parsing_strategy.md +63 -1
  22. data/docs/real_world_csv.md +262 -0
  23. data/docs/releases/1.16.0/benchmarks.md +223 -0
  24. data/docs/releases/1.16.0/changes.md +272 -0
  25. data/docs/releases/1.16.0/performance_notes.md +114 -0
  26. data/docs/row_col_sep.md +14 -5
  27. data/docs/value_converters.md +193 -57
  28. data/ext/smarter_csv/extconf.rb +3 -0
  29. data/ext/smarter_csv/smarter_csv.c +1007 -71
  30. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  31. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  32. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  36. data/lib/smarter_csv/errors.rb +8 -0
  37. data/lib/smarter_csv/file_io.rb +1 -1
  38. data/lib/smarter_csv/hash_transformations.rb +14 -13
  39. data/lib/smarter_csv/header_transformations.rb +21 -2
  40. data/lib/smarter_csv/headers.rb +2 -1
  41. data/lib/smarter_csv/options.rb +124 -7
  42. data/lib/smarter_csv/parser.rb +362 -75
  43. data/lib/smarter_csv/reader.rb +494 -46
  44. data/lib/smarter_csv/version.rb +1 -1
  45. data/lib/smarter_csv/writer.rb +71 -19
  46. data/lib/smarter_csv.rb +95 -12
  47. data/smarter_csv.gemspec +20 -10
  48. metadata +37 -80
data/docs/examples.md CHANGED
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,70 +11,362 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [**Examples**](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # Examples
19
27
 
20
- Here are some examples to demonstrate the versatility of SmarterCSV.
28
+ **Rescue from `SmarterCSV::Error` (recommended):** SmarterCSV auto-detects row and column separators. In rare cases detection fails and raises an exception (e.g. `NoColSepDetected`). Rescuing from `SmarterCSV::Error` ensures your application handles unexpected CSV formats gracefully.
29
+
30
+ ---
21
31
 
22
- **It is generally recommended to rescue `SmarterCSV::Error` or it's sub-classes.**
32
+ 1. [CSV Array of Hashes](#example-1-csv--array-of-hashes)
33
+ 2. [Parsing a CSV String](#example-2-parsing-a-csv-string)
34
+ 3. [Key Mapping and Column Selection](#example-3-key-mapping-and-column-selection)
35
+ 4. [Encoding and Preamble Skip](#example-4-encoding-and-preamble-skip)
36
+ 5. [Value Converters](#example-5-value-converters)
37
+ 6. [Header Validation](#example-6-header-validation)
38
+ 7. [Bad Row Handling](#example-7-bad-row-handling)
39
+ 8. [Writing CSV](#example-8-writing-csv)
40
+ 9. [Using `each` and `each_chunk` Enumerators](#example-9-using-each-and-each_chunk-enumerators)
41
+ 10. [Importing into a Database](#example-10-importing-into-a-database)
42
+ 11. [Batch Processing with Sidekiq](#example-11-batch-processing-with-sidekiq)
43
+ 12. [Resumable CSV Import with Rails ActiveJob](#example-12-resumable-csv-import-with-rails-activejob-rails-81)
44
+ 13. [Instrumentation](#example-13-instrumentation)
23
45
 
24
- By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
46
+ ---
25
47
 
26
- In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
48
+ ## Example 1: CSV Array of Hashes
27
49
 
28
- ## Example 1a: How SmarterCSV processes CSV-files as array of hashes:
29
- Please note how each hash contains only the keys for columns with non-null values.
50
+ Each hash only contains keys for columns with non-nil, non-empty values columns with blank entries are omitted automatically:
30
51
 
31
52
  ```ruby
32
- $ cat pets.csv
33
- first name,last name,dogs,cats,birds,fish
34
- Dan,McAllister,2,,,
35
- Lucy,Laweless,,5,,
36
- Miles,O'Brian,,,,21
37
- Nancy,Homes,2,,1,
38
- $ irb
39
- > require 'smarter_csv'
40
- => true
41
- > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
42
- => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
43
- {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
44
- {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
45
- {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
46
- ]
53
+ $ cat pets.csv
54
+ first name,last name,dogs,cats,birds,fish
55
+ Dan,McAllister,2,,,
56
+ Lucy,Laweless,,5,,
57
+ Miles,O'Brian,,,,21
58
+ Nancy,Homes,2,,1,
59
+
60
+ $ irb
61
+ > require 'smarter_csv'
62
+ > pets_by_owner = SmarterCSV.process('pets.csv')
63
+ => [ {first_name: "Dan", last_name: "McAllister", dogs: 2},
64
+ {first_name: "Lucy", last_name: "Laweless", cats: 5},
65
+ {first_name: "Miles", last_name: "O'Brian", fish: 21},
66
+ {first_name: "Nancy", last_name: "Homes", dogs: 2, birds: 1}
67
+ ]
47
68
  ```
48
69
 
70
+ ---
71
+
72
+ ## Example 2: Parsing a CSV String
73
+
74
+ Use `SmarterCSV.parse` to parse a CSV string directly — no file needed. Useful in tests, API responses, or when the CSV arrives as a string in memory:
49
75
 
50
- ## Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
51
76
  ```ruby
52
- # without using chunks:
53
- filename = '/tmp/some.csv'
54
- options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
55
- n = SmarterCSV.process(filename, options) do |array|
56
- # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
57
- # when chunking is not enabled, there is only one hash in each array
58
- MyModel.create( array.first )
59
- end
77
+ csv_string = <<~CSV
78
+ name,age,city
79
+ Alice,30,New York
80
+ Bob,25,Chicago
81
+ CSV
60
82
 
61
- => returns number of chunks / rows we processed
83
+ data = SmarterCSV.parse(csv_string)
84
+ # => [{name: "Alice", age: 30, city: "New York"}, {name: "Bob", age: 25, city: "Chicago"}]
62
85
  ```
63
86
 
64
- ## Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq:
65
- The block receives an optional second parameter `chunk_index` (0-based) for progress tracking:
87
+ See [The Basic Read API](./basic_read_api.md) and [Migrating from Ruby CSV](./migrating_from_csv.md).
88
+
89
+ ---
90
+
91
+ ## Example 3: Key Mapping and Column Selection
92
+
93
+ Rename headers and drop unwanted columns in one pass:
94
+
95
+ ```ruby
96
+ options = {
97
+ key_mapping: {
98
+ first_name: :fname,
99
+ last_name: :lname,
100
+ dob: :birth_date,
101
+ ssn: nil, # drop this column entirely
102
+ },
103
+ }
104
+ data = SmarterCSV.process('people.csv', options)
105
+ # => [{fname: "Alice", lname: "Smith", birth_date: "1990-05-14"}, ...]
106
+ # ↑ :ssn is gone; original CSV headers remapped to your domain names
107
+ ```
108
+
109
+ Keep only specific columns using `headers: { only: }`:
110
+
111
+ ```ruby
112
+ data = SmarterCSV.process('people.csv', headers: { only: [:name, :email] })
113
+ # => [{name: "Alice", email: "alice@example.com"}, ...]
114
+ ```
115
+
116
+ See [Header Transformations](./header_transformations.md) and [Column Selection](./column_selection.md).
117
+
118
+ ---
119
+
120
+ ## Example 4: Encoding and Preamble Skip
121
+
122
+ Handle non-UTF-8 files and metadata rows before the header:
123
+
124
+ ```ruby
125
+ # Bank statement export: Windows-1252, 3 preamble rows, then header
126
+ data = SmarterCSV.process('statement.csv',
127
+ file_encoding: 'windows-1252',
128
+ skip_lines: 3)
129
+
130
+ # European lab instrument export: semicolon-separated, Latin-1
131
+ data = SmarterCSV.process('results.csv',
132
+ file_encoding: 'iso-8859-1',
133
+ col_sep: :auto) # :auto detects the semicolon
134
+ ```
135
+
136
+ See [Row and Column Separators](./row_col_sep.md) and [Real-World CSV Files](./real_world_csv.md).
137
+
138
+ ---
139
+
140
+ ## Example 5: Value Converters
141
+
142
+ Transform raw strings into typed values — dates, booleans, currency:
143
+
66
144
  ```ruby
67
- filename = '/tmp/input.csv' # CSV file containing ids or data to process
68
- options = { :chunk_size => 100 }
69
- n = SmarterCSV.process(filename, options) do |chunk, chunk_index|
70
- puts "Queueing chunk #{chunk_index} with #{chunk.size} records..."
71
- Sidekiq::Client.push_bulk(
72
- 'class' => SidekiqIndividualWorkerClass,
73
- 'args' => chunk,
74
- )
75
- # OR:
76
- # SidekiqBatchWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
145
+ require 'date'
146
+
147
+ data = SmarterCSV.process('records.csv',
148
+ value_converters: {
149
+ # Parse US date format
150
+ dob: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
151
+
152
+ # Strip currency symbol and convert to Float
153
+ price: ->(v) { v&.delete('$,')&.to_f },
154
+
155
+ # Boolean from various representations
156
+ active: ->(v) { v&.match?(/\Atrue\z/i) },
157
+ })
158
+
159
+ data.first[:dob] # => #<Date: 1990-05-14>
160
+ data.first[:price] # => 44.5
161
+ data.first[:active] # => true
162
+ ```
163
+
164
+ Combining with `nil_values_matching` to clean sentinel values before conversion:
165
+
166
+ ```ruby
167
+ data = SmarterCSV.process('export.csv',
168
+ nil_values_matching: /\A(N\/A|NULL|#N\/A)\z/i,
169
+ value_converters: {
170
+ score: ->(v) { v&.to_f }, # v is nil for N/A rows — guard with &.
171
+ })
172
+ ```
173
+
174
+ See [Value Converters](./value_converters.md).
175
+
176
+ ---
177
+
178
+ ## Example 6: Header Validation
179
+
180
+ Raise early if required columns are missing, before processing any data rows:
181
+
182
+ ```ruby
183
+ begin
184
+ data = SmarterCSV.process('transactions.csv',
185
+ required_keys: [:account_id, :amount, :currency])
186
+ rescue SmarterCSV::MissingKeys => e
187
+ puts "CSV is missing required columns: #{e.keys.join(', ')}"
188
+ # => "CSV is missing required columns: currency"
189
+ end
190
+ ```
191
+
192
+ See [Header Validations](./header_validations.md).
193
+
194
+ ---
195
+
196
+ ## Example 7: Bad Row Handling
197
+
198
+ Collect parse errors without stopping the import:
199
+
200
+ ```ruby
201
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
202
+ good_rows = reader.process
203
+
204
+ bad = reader.errors[:bad_rows]
205
+ puts "Imported #{good_rows.size} rows, #{bad.size} bad rows"
206
+ bad.each do |rec|
207
+ puts "Line #{rec[:file_line_number]}: #{rec[:error_message]}"
208
+ puts " Raw: #{rec[:raw_line]}"
209
+ end
210
+ ```
211
+
212
+ Cap the number of tolerated bad rows and limit field sizes to guard against malformed input:
213
+
214
+ ```ruby
215
+ SmarterCSV.process('untrusted.csv',
216
+ on_bad_row: :skip,
217
+ bad_row_limit: 10,
218
+ field_size_limit: 4096)
219
+ ```
220
+
221
+ See [Bad Row Quarantine](./bad_row_quarantine.md).
222
+
223
+ ---
224
+
225
+ ## Example 8: Writing CSV
226
+
227
+ ```ruby
228
+ records = [
229
+ { name: "Alice", age: 30, city: "New York" },
230
+ { name: "Bob", age: 25, city: "Chicago" },
231
+ ]
232
+
233
+ SmarterCSV.generate('output.csv') do |csv|
234
+ records.each { |r| csv << r }
235
+ end
236
+ # output.csv:
237
+ # name,age,city
238
+ # Alice,30,New York
239
+ # Bob,25,Chicago
240
+ ```
241
+
242
+ Writing with header renaming and value converters:
243
+
244
+ ```ruby
245
+ require 'date'
246
+
247
+ SmarterCSV.generate('report.csv',
248
+ map_headers: { name: 'Full Name', dob: 'Date of Birth' },
249
+ value_converters: { dob: ->(v) { v&.strftime('%m/%d/%Y') } },
250
+ ) do |csv|
251
+ User.find_each { |u| csv << { name: u.full_name, dob: u.dob } }
252
+ end
253
+ ```
254
+
255
+ See [The Basic Write API](./basic_write_api.md).
256
+
257
+ ---
258
+
259
+ ## Example 9: Using `each` and `each_chunk` Enumerators
260
+
261
+ The modern API gives you full Enumerable power without loading the whole file:
262
+
263
+ ```ruby
264
+ # each — one hash per row
265
+ reader = SmarterCSV::Reader.new('data.csv')
266
+ reader.each { |hash| MyModel.upsert(hash) }
267
+ puts reader.headers.inspect # accessible after processing
268
+
269
+ # Enumerable methods
270
+ active_users = reader.select { |h| h[:status] == 'active' }
271
+ names = reader.map { |h| h[:name] }
272
+
273
+ # Lazy — stop early without reading the whole file
274
+ first_ten_active = reader.lazy.select { |h| h[:active] }.first(10)
275
+
276
+ # each_slice — manual batching without chunk_size
277
+ reader.each_slice(500) { |batch| MyModel.insert_all(batch) }
278
+ ```
279
+
280
+ See [Batch Processing](./batch_processing.md) and [The Basic Read API](./basic_read_api.md).
281
+
282
+ ---
283
+
284
+ ## Example 10: Importing into a Database
285
+
286
+ ```ruby
287
+ filename = '/tmp/some.csv'
288
+ options = { key_mapping: { unwanted_row: nil, old_row_name: :new_name } }
289
+
290
+ n = SmarterCSV.process(filename, options) do |array|
291
+ MyModel.create(array.first)
292
+ end
293
+ # => returns number of rows processed
294
+ ```
295
+
296
+ ---
297
+
298
+ ## Example 11: Batch Processing with Sidekiq
299
+
300
+ Processing in chunks reduces memory usage and enables parallel processing. The block receives the chunk as an optional second parameter:
301
+
302
+ ```ruby
303
+ filename = '/tmp/input.csv'
304
+
305
+ n = SmarterCSV.process(filename, chunk_size: 100) do |chunk, chunk_index|
306
+ puts "Queueing chunk #{chunk_index} with #{chunk.size} records..."
307
+ Sidekiq::Client.push_bulk(
308
+ 'class' => SidekiqWorkerClass,
309
+ 'args' => chunk,
310
+ )
311
+ end
312
+ # => returns number of chunks
313
+ ```
314
+
315
+ See [Batch Processing](./batch_processing.md).
316
+
317
+ ---
318
+
319
+ ## Example 12: Resumable CSV Import with Rails ActiveJob (Rails 8.1+)
320
+
321
+ Rails 8.1 introduced `ActiveJob::Continuable`, which lets a job pause and resume from exactly where it stopped — for example during a deployment or queue drain.
322
+
323
+ ```ruby
324
+ # app/jobs/import_csv_job.rb
325
+ class ImportCsvJob < ApplicationJob
326
+ include ActiveJob::Continuable
327
+
328
+ def perform(file_path)
329
+ step :import_rows do |step|
330
+ SmarterCSV.process(file_path, chunk_size: 500) do |chunk, chunk_index|
331
+ next if chunk_index < step.cursor.to_i # skip already-processed chunks on resume
332
+
333
+ MyModel.import!(chunk)
334
+ step.set! chunk_index + 1
335
+ end
77
336
  end
78
- => returns number of chunks
337
+ end
338
+ end
339
+ ```
340
+
341
+ - `step.cursor` starts as `nil` (→ `0`), so the first run processes all chunks.
342
+ - If interrupted after chunk 7, Rails persists the cursor as `8`.
343
+ - On the next run chunks 0–7 are skipped quickly via `next`; processing resumes from chunk 8.
344
+
345
+ > Requires Rails 8.1+ and a queue adapter that supports graceful shutdown (Sidekiq, Solid Queue).
346
+
347
+ ---
348
+
349
+ ## Example 13: Instrumentation
350
+
351
+ ```ruby
352
+ SmarterCSV.process('large_import.csv',
353
+ chunk_size: 1000,
354
+
355
+ on_start: ->(info) {
356
+ Rails.logger.info "Import started: #{info[:input]} (#{info[:file_size]} bytes)"
357
+ },
358
+
359
+ on_chunk: ->(info) {
360
+ Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows"
361
+ },
362
+
363
+ on_complete: ->(stats) {
364
+ Rails.logger.info "Done: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s"
365
+ },
366
+ ) { |chunk| MyModel.insert_all(chunk) }
79
367
  ```
368
+
369
+ See [Instrumentation Hooks](./instrumentation.md).
370
+
371
+ --------------------
372
+ PREVIOUS: [Instrumentation Hooks](./instrumentation.md) | NEXT: [Real-World CSV Files](./real_world_csv.md) | UP: [README](../README.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,15 +11,55 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [**Header Transformations**](./header_transformations.md)
12
13
  * [Header Validations](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # Header Transformations
19
27
 
20
28
  By default SmarterCSV assumes that a CSV file has headers, and it automatically normalizes the headers and transforms them into Ruby symbols. You can completely customize or override this (see below).
21
29
 
30
+ ## Header Transformation Pipeline
31
+
32
+ When a CSV file is opened, the header line passes through the following steps in order:
33
+
34
+ ```
35
+ [user_provided_headers] ──► skips steps below; uses your array directly
36
+
37
+ ▼ (when headers come from the file)
38
+ comment_regexp ──► strip_chars_from_headers ──► split on col_sep
39
+ ──► strip quote_char ──► strip_whitespace
40
+ ──► [unless keep_original_headers]: gsub spaces/dashes→_ ──► downcase_header
41
+ ──► disambiguate_headers ──► symbolize ──► key_mapping
42
+ ```
43
+
44
+ | Step | Option | Default | Description |
45
+ |------|--------|---------|-------------|
46
+ | 1 | `comment_regexp` | `nil` | Strips a comment prefix from the raw header line (e.g. `# ` at start) |
47
+ | 2 | `strip_chars_from_headers` | `nil` | Removes characters matching a regexp from the raw header line (e.g. `/[\-"]/`) |
48
+ | 3 | *(split)* | `col_sep` | Splits the header line into individual column tokens |
49
+ | 4 | `quote_char` | `"` | Strips surrounding quote characters from each token |
50
+ | 5 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from each header |
51
+ | 6 | *(normalize)* | — | Replaces spaces and dashes with `_` (`keep_original_headers` skips this and steps 7–9) |
52
+ | 7 | `downcase_header` | `true` | Downcases each header string |
53
+ | 8 | `duplicate_header_suffix` | `''` | Renames empty headers to `column_N`; appends suffix+number to duplicates |
54
+ | 9 | `strings_as_keys` | `false` | Converts headers to symbols (skipped if `true` or `keep_original_headers`) |
55
+ | 10 | `key_mapping` | `nil` | Renames or drops headers; use post-transformation key names as input |
56
+
57
+ > `user_provided_headers` bypasses all file header reading and transformation entirely — your array is used as-is. Versions >1.13 automatically set `headers_in_file: false` when `user_provided_headers` is given; if the file has a header row you want to skip, set `headers_in_file: true` explicitly.
58
+
59
+ See [Configuration Options](./options.md) for full option reference.
60
+
61
+ ---
62
+
22
63
  ## Header Normalization
23
64
 
24
65
  When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)
@@ -81,16 +122,57 @@ end
81
122
 
82
123
  ## Key Mapping
83
124
 
84
- The above example already illustrates how intermediate keys can be mapped into something different.
85
- This transfoms some of the keys in the input, but other keys are still present.
125
+ `key_mapping:` renames CSV headers to the symbols your application expects. Any header not
126
+ listed in the mapping is kept as-is by default.
86
127
 
87
- There is an additional option `remove_unmapped_keys` which can be enabled to only produce the mapped keys in the resulting hashes, and drops any other columns.
128
+ ```ruby
129
+ # CSV headers: first_name, last_name, internal_id, created_at
130
+ data = SmarterCSV.process('contacts.csv',
131
+ key_mapping: { first_name: :given_name, last_name: :family_name },
132
+ )
133
+ # => [{given_name: "Alice", family_name: "Smith", internal_id: 42, created_at: "2026-01-01"}, ...]
134
+ # ^^^ renamed ^^^ unmapped keys kept as-is
135
+ ```
88
136
 
89
-
90
- ### NOTES on Key Mapping:
91
- * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
92
- * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
93
- * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
137
+ To delete a specific column, map it to `nil` — it will be removed from every row hash:
138
+
139
+ ```ruby
140
+ key_mapping: { internal_id: nil, created_at: nil } # drop these two columns
141
+ ```
142
+
143
+ ### `remove_unmapped_keys:` — drop everything not in the map
144
+
145
+ When you have files with many columns and only care about a few, listing every unwanted
146
+ column as `nil` is tedious. Use `remove_unmapped_keys: true` to implicitly drop any header
147
+ that has no entry in `key_mapping:`:
148
+
149
+ ```ruby
150
+ # CSV has 50 columns; you only want two of them, renamed
151
+ data = SmarterCSV.process('contacts.csv',
152
+ key_mapping: { first_name: :given_name, last_name: :family_name },
153
+ remove_unmapped_keys: true,
154
+ )
155
+ # => [{given_name: "Alice", family_name: "Smith"}, ...] # only the two mapped columns
156
+ ```
157
+
158
+ ### `remove_unmapped_keys:` vs `headers: { only: }`
159
+
160
+ Both achieve column selection, but they serve different purposes:
161
+
162
+ | | `remove_unmapped_keys: true` | `headers: { only: [...] }` |
163
+ |---|---|---|
164
+ | Use when | Already using `key_mapping:` and want to implicitly drop the rest | Pure column selection, no renaming needed |
165
+ | Performance | Post-parse filter — all fields parsed, unmapped keys deleted | **C-path early exit** — unneeded fields never parsed |
166
+ | Renaming | Yes — combines selection and rename in one step | No renaming (use `key_mapping:` alongside if needed) |
167
+
168
+ For wide files where performance matters, prefer `headers: { only: }` — it skips unneeded
169
+ fields entirely inside the C parser and can be **10–14× faster** on very wide files.
170
+ Use `remove_unmapped_keys: true` when you are already remapping headers and the convenience
171
+ of a single option outweighs the (usually small) performance difference.
172
+
173
+ See [Column Selection](./column_selection.md) for full details on `headers: { only: }`.
174
+
175
+ > **Note:** Key mapping is particularly useful when importing CSV data directly into a database or document store. By remapping headers to the exact symbol names your application uses internally (e.g. ActiveRecord attributes, DynamoDB document keys, Sidekiq job parameters), you can pass the resulting hashes directly without any further transformation.
94
176
 
95
177
  ## CSV Files without Headers
96
178
 
@@ -124,5 +206,4 @@ For CSV files with headers, you can either:
124
206
  * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, set the `quote_char` to something different, e.g. `quote_char: "%"`, or try setting `:strip_chars_from_headers => /[\-"]/`
125
207
 
126
208
  ---------------
127
- PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md)
128
-
209
+ PREVIOUS: [Row and Column Separators](./row_col_sep.md) | NEXT: [Header Validations](./header_validations.md) | UP: [README](../README.md)
@@ -2,6 +2,7 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
5
6
  * [Parsing Strategy](./parsing_strategy.md)
6
7
  * [The Basic Read API](./basic_read_api.md)
7
8
  * [The Basic Write API](./basic_write_api.md)
@@ -10,43 +11,80 @@
10
11
  * [Row and Column Separators](./row_col_sep.md)
11
12
  * [Header Transformations](./header_transformations.md)
12
13
  * [**Header Validations**](./header_validations.md)
14
+ * [Column Selection](./column_selection.md)
13
15
  * [Data Transformations](./data_transformations.md)
14
16
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
17
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
18
+ * [Instrumentation Hooks](./instrumentation.md)
19
+ * [Examples](./examples.md)
20
+ * [Real-World CSV Files](./real_world_csv.md)
21
+ * [SmarterCSV over the Years](./history.md)
22
+ * [Release Notes](./releases/1.16.0/changes.md)
23
+
24
+ --------------
17
25
 
18
26
  # Header Validations
19
27
 
20
- When you are importing data, it can be important to verify that all required data is present, to ensure consistent quality when importing data.
28
+ When importing data it is important to verify that all required columns are present catching a missing column upfront is far better than a cryptic error later when your code tries to access a key that was never populated.
21
29
 
22
- You can use the `required_keys` option to specify an array of hash keys that you require to be present at a minimum for every data row (after header transformation).
30
+ ## `required_keys`
23
31
 
24
- If these keys are not present, `SmarterCSV::MissingKeys` will be raised to inform you of the data inconsistency.
32
+ Use `required_keys` to specify an array of hash keys that must be present after header transformation. Validation runs once, after the header row is parsed and all header transformations (downcase, symbolize, `key_mapping`) have been applied so use the **transformed** key names, not the raw CSV header strings.
25
33
 
26
- ## Example
34
+ If any required key is absent, `SmarterCSV::MissingKeys` is raised before any data rows are processed.
27
35
 
28
36
  ```ruby
29
- options = {
30
- required_keys: [:source_account, :destination_account, :amount]
31
- }
32
- data = SmarterCSV.process("/tmp/transactions.csv", options)
33
-
34
- => this will raise SmarterCSV::MissingKeys if any row does not contain these three keys
37
+ options = {
38
+ required_keys: [:source_account, :destination_account, :amount]
39
+ }
40
+ data = SmarterCSV.process('/tmp/transactions.csv', options)
41
+ # => raises SmarterCSV::MissingKeys if any of the three columns are missing
35
42
  ```
36
43
 
37
- ## Handling Missing Keys Programmatically
44
+ ### Accessing the missing keys
38
45
 
39
- When `SmarterCSV::MissingKeys` is raised, you can access the missing keys directly via the `keys` accessor, without parsing the error message:
46
+ `SmarterCSV::MissingKeys` exposes the missing keys via the `keys` accessor:
40
47
 
41
48
  ```ruby
42
49
  begin
43
- options = { required_keys: [:source_account, :destination_account, :amount] }
44
- data = SmarterCSV.process("/tmp/transactions.csv", options)
50
+ data = SmarterCSV.process('/tmp/transactions.csv',
51
+ required_keys: [:source_account, :destination_account, :amount])
45
52
  rescue SmarterCSV::MissingKeys => e
46
53
  puts "Missing columns: #{e.keys.join(', ')}"
47
- # => e.keys returns [:amount] (array of missing key symbols)
54
+ # => "Missing columns: amount"
48
55
  end
49
56
  ```
50
57
 
58
+ ### Interaction with `key_mapping`
59
+
60
+ `required_keys` uses the **post-mapping** key names. If you remap CSV headers, reference the mapped names:
61
+
62
+ ```ruby
63
+ options = {
64
+ key_mapping: { acct_from: :source_account, acct_to: :destination_account },
65
+ required_keys: [:source_account, :destination_account, :amount],
66
+ }
67
+ ```
68
+
69
+ ---
70
+
71
+ ## `silence_missing_keys`
72
+
73
+ When using `key_mapping`, SmarterCSV raises `SmarterCSV::KeyMappingError` if a mapped key is not found in the CSV header. Use `silence_missing_keys` to make some or all mapped keys optional:
74
+
75
+ ```ruby
76
+ # All mapped keys are optional — no error if any are absent
77
+ options = {
78
+ key_mapping: { optional_field: :my_field, required_field: :other_field },
79
+ silence_missing_keys: true,
80
+ }
81
+
82
+ # Only specific mapped keys are optional
83
+ options = {
84
+ key_mapping: { optional_field: :my_field, required_field: :other_field },
85
+ silence_missing_keys: [:optional_field],
86
+ }
87
+ ```
88
+
51
89
  ----------------
52
- PREVIOUS: [Header Transformations](./header_transformations.md) | NEXT: [Data Transformations](./data_transformations.md)
90
+ PREVIOUS: [Header Transformations](./header_transformations.md) | NEXT: [Column Selection](./column_selection.md) | UP: [README](../README.md)