smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
@@ -0,0 +1,342 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](./parsing_strategy.md)
8
+ * [The Basic Read API](./basic_read_api.md)
9
+ * [The Basic Write API](./basic_write_api.md)
10
+ * [Batch Processing](././batch_processing.md)
11
+ * [Configuration Options](./options.md)
12
+ * [Row and Column Separators](./row_col_sep.md)
13
+ * [Header Transformations](./header_transformations.md)
14
+ * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
16
+ * [Data Transformations](./data_transformations.md)
17
+ * [Value Converters](./value_converters.md)
18
+ * [**Bad Row Quarantine**](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
26
+
27
+ # Bad Row Quarantine
28
+
29
+ Real-world CSV files are often malformed. By default, SmarterCSV raises an exception on the
30
+ first bad row it encounters. The `on_bad_row` option lets you keep processing and handle bad
31
+ rows in whatever way suits your application.
32
+
33
+ ## What counts as a bad row
34
+
35
+ - Malformed CSV (unclosed quoted fields, unterminated multiline rows)
36
+ - A field that exceeds `field_size_limit` (see [Limiting field size](#limiting-field-size-field_size_limit))
37
+ - Extra columns when running in `strict: true` mode
38
+ - Any `SmarterCSV::Error` or `EOFError` raised during row parsing
39
+
40
+ ## Options
41
+
42
+ | Option | Default | Description |
43
+ |--------|---------|-------------|
44
+ | `on_bad_row` | `:raise` | How to handle a bad row: `:raise`, `:skip`, `:collect`, or a callable |
45
+ | `collect_raw_lines` | `true` | Include `raw_logical_line` in the error record |
46
+ | `bad_row_limit` | `nil` | Raise `SmarterCSV::TooManyBadRows` after this many bad rows |
47
+
48
+ ## Modes
49
+
50
+ ### `:raise` (default)
51
+
52
+ Current behavior — the exception propagates and processing stops:
53
+
54
+ ```ruby
55
+ SmarterCSV.process('data.csv')
56
+ # => raises SmarterCSV::MalformedCSV on the first bad row
57
+ ```
58
+
59
+ The `on_bad_row` option controls what happens when a bad row is encountered:
60
+
61
+ * `on_bad_row: :raise` (default) fails fast.
62
+ * `on_bad_row: :collect` quarantines them — error records available via `SmarterCSV.errors` or `reader.errors`.
63
+ * `on_bad_row: ->(rec) { ... }` calls your lambda per bad row — works with both `SmarterCSV.process` and `SmarterCSV::Reader`.
64
+ * `on_bad_row: :skip` discards bad rows silently — count available via `SmarterCSV.errors` or `reader.errors`.
65
+
66
+ ### `:collect`
67
+
68
+ Continue processing and store a structured error record for each bad row.
69
+ Error records are available via `SmarterCSV.errors[:bad_rows]` (class-level API)
70
+ or `reader.errors[:bad_rows]` (Reader API).
71
+
72
+ ```ruby
73
+ # Class-level API — use SmarterCSV.errors after the call
74
+ good_rows = SmarterCSV.process('data.csv', on_bad_row: :collect)
75
+
76
+ good_rows.each { |row| MyModel.create!(row) }
77
+
78
+ SmarterCSV.errors[:bad_rows].each do |rec|
79
+ Rails.logger.warn "Bad row at line #{rec[:csv_line_number]}: #{rec[:error_message]}"
80
+ Rails.logger.warn "Raw content: #{rec[:raw_logical_line]}"
81
+ end
82
+ ```
83
+
84
+ ```ruby
85
+ # Reader API — use when you also need access to headers or other reader state
86
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
87
+ result = reader.process
88
+
89
+ result.each { |row| MyModel.create!(row) }
90
+
91
+ reader.errors[:bad_rows].each do |rec|
92
+ Rails.logger.warn "Bad row at line #{rec[:csv_line_number]}: #{rec[:error_message]}"
93
+ Rails.logger.warn "Raw content: #{rec[:raw_logical_line]}"
94
+ end
95
+ ```
96
+
97
+ ### Callable (lambda / proc)
98
+
99
+ Pass any object that responds to `#call`. It is invoked once per bad row with the
100
+ error record hash, then processing continues. Because the lambda receives errors
101
+ inline, **this works with both `SmarterCSV.process` and `SmarterCSV::Reader`** —
102
+ you do not need a `Reader` instance to handle bad rows.
103
+
104
+ ```ruby
105
+ # Works with SmarterCSV.process — no Reader instance needed
106
+ bad_rows = []
107
+ good_rows = SmarterCSV.process('data.csv',
108
+ on_bad_row: ->(rec) { bad_rows << rec })
109
+ ```
110
+
111
+ ```ruby
112
+ # Log to a dead-letter file
113
+ quarantine = File.open('quarantine.csv', 'w')
114
+ SmarterCSV.process('data.csv',
115
+ on_bad_row: ->(rec) { quarantine.puts(rec[:raw_logical_line]) })
116
+ quarantine.close
117
+ ```
118
+
119
+ ```ruby
120
+ # Send to a monitoring system
121
+ SmarterCSV.process('data.csv',
122
+ on_bad_row: ->(rec) { Metrics.increment('csv.bad_rows', tags: { error: rec[:error_class].name }) })
123
+ ```
124
+
125
+ ### `:skip`
126
+
127
+ Silently skip bad rows and continue. The count of skipped rows is available via
128
+ `SmarterCSV.errors[:bad_row_count]` (class-level API) or `reader.errors[:bad_row_count]`
129
+ (Reader API). No error records are stored.
130
+
131
+ ```ruby
132
+ # Class-level API — use SmarterCSV.errors after the call
133
+ SmarterCSV.process('data.csv', on_bad_row: :skip)
134
+ puts "Skipped: #{SmarterCSV.errors[:bad_row_count] || 0} bad rows"
135
+ ```
136
+
137
+ ```ruby
138
+ # Reader API — access reader.errors directly
139
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :skip)
140
+ result = reader.process
141
+
142
+ puts "Processed: #{result.size} good rows"
143
+ puts "Skipped: #{reader.errors[:bad_row_count] || 0} bad rows"
144
+ ```
145
+
146
+ ## Error record structure
147
+
148
+ Each error record is a Hash:
149
+
150
+ ```ruby
151
+ {
152
+ csv_line_number: 3, # logical row (counting header as row 1)
153
+ file_line_number: 3, # physical file line where the row started
154
+ file_lines_consumed: 1, # physical lines spanned (>1 for multiline)
155
+ error_class: SmarterCSV::HeaderSizeMismatch, # exception class object
156
+ error_message: "extra columns detected ...", # exception message string
157
+ raw_logical_line: "Jane,25,Boston,EXTRA_DATA\n", # present when collect_raw_lines: true (default)
158
+ }
159
+ ```
160
+
161
+ ### `collect_raw_lines`
162
+
163
+ `collect_raw_lines: true` (default) — `raw_logical_line` is always included in the error
164
+ record. Set to `false` if you want to reduce memory usage and don't need the raw content:
165
+
166
+ ```ruby
167
+ reader = SmarterCSV::Reader.new('data.csv',
168
+ on_bad_row: :collect,
169
+ collect_raw_lines: false,
170
+ )
171
+ ```
172
+
173
+ For multiline rows (quoted fields spanning several physical lines), `raw_logical_line` contains
174
+ the fully stitched content — it may include embedded newline characters. The
175
+ `file_lines_consumed` field tells you how many physical lines were read.
176
+
177
+ ## Limiting bad rows with `bad_row_limit`
178
+
179
+ To abort processing after too many failures, set `bad_row_limit`. This works with `:skip`,
180
+ `:collect`, and callable modes:
181
+
182
+ ```ruby
183
+ reader = SmarterCSV::Reader.new('data.csv',
184
+ on_bad_row: :collect,
185
+ bad_row_limit: 10,
186
+ )
187
+
188
+ begin
189
+ result = reader.process
190
+ rescue SmarterCSV::TooManyBadRows => e
191
+ puts "Aborting: #{e.message}"
192
+ puts "Collected so far: #{reader.errors[:bad_rows].size} bad rows"
193
+ end
194
+ ```
195
+
196
+ ## Accessing errors
197
+
198
+ There are two ways to access bad row data after processing:
199
+
200
+ ### Via `SmarterCSV.errors` (class-level API)
201
+
202
+ `SmarterCSV.errors` returns the errors from the most recent call to `process`, `parse`,
203
+ `each`, or `each_chunk` on the current thread. It is cleared at the start of each new call.
204
+
205
+ ```ruby
206
+ SmarterCSV.process('data.csv', on_bad_row: :skip)
207
+ puts SmarterCSV.errors[:bad_row_count] # => 3
208
+
209
+ SmarterCSV.process('data.csv', on_bad_row: :collect)
210
+ puts SmarterCSV.errors[:bad_row_count] # => 3
211
+ puts SmarterCSV.errors[:bad_rows].size # => 3
212
+ ```
213
+
214
+ > **Note:** `SmarterCSV.errors` only surfaces errors from the **most recent run on the
215
+ > current thread**. In a multi-threaded environment (Puma, Sidekiq), each thread maintains
216
+ > its own error state independently. If you call `SmarterCSV.process` twice in the same
217
+ > thread, the second call's errors replace the first's. For long-running or complex
218
+ > pipelines where you need to aggregate errors across multiple files, use the Reader API.
219
+ >
220
+ > ⚠️ **Fibers:** `SmarterCSV.errors` uses `Thread.current` for storage, which is **shared
221
+ > across all fibers running in the same thread**. If you process CSV files concurrently
222
+ > in fibers (e.g. with `Async`, `Falcon`, or manual `Fiber` scheduling), `SmarterCSV.errors`
223
+ > may return stale or wrong results. **Use `SmarterCSV::Reader` directly** — errors are
224
+ > scoped to the reader instance and are always correct regardless of fiber context.
225
+
226
+ ### Via `reader.errors` (Reader API)
227
+
228
+ For full control — including access to headers, raw headers, and errors from a specific
229
+ call — use `SmarterCSV::Reader` directly:
230
+
231
+ | Attribute | Description |
232
+ |-----------|-------------|
233
+ | `reader.errors[:bad_row_count]` | Total bad rows encountered (all modes) |
234
+ | `reader.errors[:bad_rows]` | Array of error records (`:collect` mode only) |
235
+
236
+ ```ruby
237
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
238
+ reader.process
239
+ puts reader.errors[:bad_row_count]
240
+ puts reader.headers.inspect
241
+ ```
242
+
243
+ ## Chunked processing
244
+
245
+ Bad row quarantine works seamlessly with `chunk_size`. Skipped rows are simply not added to the
246
+ current chunk — chunk sizes remain consistent:
247
+
248
+ ```ruby
249
+ reader = SmarterCSV::Reader.new('large_file.csv',
250
+ chunk_size: 500,
251
+ on_bad_row: :collect,
252
+ )
253
+ reader.process do |chunk, index|
254
+ MyModel.import(chunk)
255
+ end
256
+ puts "Bad rows: #{reader.errors[:bad_row_count]}"
257
+ ```
258
+
259
+ ## Limiting field size: `field_size_limit`
260
+
261
+ Real-world CSV files sometimes contain unexpectedly large fields — either intentionally
262
+ (a DoS attempt) or accidentally (a forgotten closing quote, a JSON blob in a cell, a notes
263
+ field that ran away). Without a limit, SmarterCSV will happily stitch together physical lines
264
+ until it either finds the closing quote or reaches end-of-file, potentially consuming hundreds
265
+ of megabytes.
266
+
267
+ `field_size_limit` sets a hard cap (in bytes) on the size of any individual extracted field.
268
+ The default is `nil` (no limit). When a field exceeds the limit a
269
+ `SmarterCSV::FieldSizeLimitExceeded` exception is raised — and because it inherits from
270
+ `SmarterCSV::Error`, the `on_bad_row` option handles it exactly like any other parse error.
271
+
272
+ ### The three cases it prevents
273
+
274
+ **1. Huge inline field** — a single-line field containing a large payload (e.g. a JSON blob,
275
+ a base64-encoded file, or a runaway notes column):
276
+
277
+ ```csv
278
+ id,payload
279
+ 1,"{... 500 KB of JSON ...}"
280
+ ```
281
+
282
+ **2. Quoted field spanning many embedded newlines** — a legitimate multiline field in a
283
+ poorly exported file that happens to be enormous:
284
+
285
+ ```csv
286
+ ticket_id,notes
287
+ 42,"Customer wrote:
288
+ ... (thousands of lines of chat history) ...
289
+ "
290
+ ```
291
+
292
+ **3. Never-closing quoted field** — a missing closing quote causes the parser to stitch every
293
+ subsequent physical line into one logical row until EOF:
294
+
295
+ ```csv
296
+ id,comment
297
+ 1,"this quote never closes
298
+ 2,this entire row is now inside the field
299
+ 3,and this one too ...
300
+ ```
301
+
302
+ Without `field_size_limit`, case 3 reads the entire rest of the file into memory. With the
303
+ limit set, the stitch loop raises `FieldSizeLimitExceeded` as soon as the accumulating buffer
304
+ crosses the threshold.
305
+
306
+ ### Usage
307
+
308
+ ```ruby
309
+ # Raise immediately on any oversized field (default on_bad_row: :raise)
310
+ SmarterCSV.process('data.csv', field_size_limit: 1_000_000) # 1 MB per field
311
+
312
+ # Skip oversized rows and continue
313
+ SmarterCSV.process('data.csv', field_size_limit: 1_000_000, on_bad_row: :skip)
314
+
315
+ # Collect oversized rows for inspection
316
+ reader = SmarterCSV::Reader.new('data.csv',
317
+ field_size_limit: 1_000_000,
318
+ on_bad_row: :collect,
319
+ )
320
+ result = reader.process
321
+ reader.errors[:bad_rows].each do |rec|
322
+ Rails.logger.warn "Oversized field on row #{rec[:csv_line_number]}: #{rec[:error_message]}"
323
+ end
324
+ ```
325
+
326
+ ### What "bytes" means here
327
+
328
+ The limit is checked against `String#bytesize` (raw byte count), not character count.
329
+ For ASCII content they are identical. For multi-byte UTF-8 content (e.g. CJK characters)
330
+ bytesize is larger than the character count — so the limit is a memory cap, not a
331
+ character cap, which is what matters for DoS protection.
332
+
333
+ ### Performance
334
+
335
+ `field_size_limit` is zero-overhead when not set (the default `nil` short-circuits all
336
+ checks). When set, a single integer comparison is performed per logical row; the per-field
337
+ scan only runs when the raw line is large enough to potentially contain an oversized field.
338
+ Normal rows (where the entire line fits within the limit) bypass per-field checking entirely.
339
+
340
+ --------------------
341
+
342
+ PREVIOUS: [Value Converters](./value_converters.md) | NEXT: [Instrumentation Hooks](./instrumentation.md) | UP: [README](../README.md)
@@ -2,6 +2,8 @@
2
2
  ### Contents
3
3
 
4
4
  * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
5
7
  * [Parsing Strategy](./parsing_strategy.md)
6
8
  * [**The Basic Read API**](./basic_read_api.md)
7
9
  * [The Basic Write API](./basic_write_api.md)
@@ -10,10 +12,17 @@
10
12
  * [Row and Column Separators](./row_col_sep.md)
11
13
  * [Header Transformations](./header_transformations.md)
12
14
  * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
13
16
  * [Data Transformations](./data_transformations.md)
14
17
  * [Value Converters](./value_converters.md)
15
-
16
- --------------
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
17
26
 
18
27
  # SmarterCSV Basic API
19
28
 
@@ -22,7 +31,7 @@ Let's explore the basic APIs for reading and writing CSV files. There is a simpl
22
31
  ## Reading CSV
23
32
 
24
33
  SmarterCSV has convenient defaults for automatically detecting row and column separators based on the given data. This provides more robust parsing of input files when you have no control over the data, e.g. when users upload CSV files.
25
- Learn more about this [in this section](docs/examples/row_col_sep.md).
34
+ Learn more about this [in this section](./row_col_sep.md).
26
35
 
27
36
  ### Simplified Interface
28
37
 
@@ -32,11 +41,23 @@ The simplified call to read CSV files is:
32
41
  array_of_hashes = SmarterCSV.process(file_or_input, options)
33
42
 
34
43
  ```
44
+
45
+ To parse a CSV **string** directly (no file needed), use `SmarterCSV.parse`:
46
+
47
+ ```
48
+ array_of_hashes = SmarterCSV.parse(csv_string, options)
49
+
50
+ ```
51
+
52
+ This is equivalent to `SmarterCSV.process(StringIO.new(csv_string), options)` and is the
53
+ idiomatic replacement for `CSV.parse(str, headers: true, header_converters: :symbol)`.
54
+ See [Migrating from Ruby CSV](./migrating_from_csv.md) for a full comparison.
55
+
35
56
  It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
36
57
 
37
58
  ```
38
59
  SmarterCSV.process(file_or_input, options) do |array_of_hashes|
39
- # without chunk_size, each yield conatins a one-element array (one row)
60
+ # without chunk_size, each yield contains a one-element array (one row)
40
61
  end
41
62
  ```
42
63
 
@@ -81,11 +102,133 @@ It can also be used with a block. The block always receives an array of hashes a
81
102
  This allows you access to the internal state of the `reader` instance after processing.
82
103
 
83
104
 
105
+ ## Modern Enumerator API — `each`
106
+
107
+ `Reader#each` is the modern, idiomatic way to read CSV rows one at a time. It always yields a single `Hash` per row and includes `Enumerable`, so every standard Ruby enumerable method works out of the box.
108
+
109
+ ### Simplified form
110
+
111
+ ```ruby
112
+ SmarterCSV.each('data.csv', options) do |hash|
113
+ MyModel.upsert(hash)
114
+ end
115
+ ```
116
+
117
+ ### Full form (recommended — retains reader state after processing)
118
+
119
+ ```ruby
120
+ reader = SmarterCSV::Reader.new('data.csv', options)
121
+
122
+ reader.each do |hash|
123
+ MyModel.upsert(hash)
124
+ end
125
+
126
+ puts reader.headers # accessible after processing
127
+ puts reader.errors.inspect
128
+ ```
129
+
130
+ ### Returns an Enumerator when called without a block
131
+
132
+ ```ruby
133
+ enum = SmarterCSV.each('data.csv', options)
134
+ enum.to_a # => [{ name: "Alice", ... }, { name: "Bob", ... }, ...]
135
+ ```
136
+
137
+ ### Enumerable methods work directly
138
+
139
+ Because `Reader` includes `Enumerable`, all standard Ruby enumerable methods work:
140
+
141
+ ```ruby
142
+ reader = SmarterCSV::Reader.new('data.csv', options)
143
+
144
+ # Filter rows
145
+ us_users = reader.select { |h| h[:country] == 'US' }
146
+
147
+ # Transform
148
+ names = reader.map { |h| h[:name] }
149
+
150
+ # Count good rows
151
+ reader.count
152
+
153
+ # Row index (0-based count of successfully parsed rows, excluding bad rows)
154
+ reader.each_with_index do |hash, i|
155
+ puts "Row #{i}: #{hash[:name]}"
156
+ end
157
+
158
+ # Free chunking via Enumerable — no chunk_size needed
159
+ reader.each_slice(100) do |batch|
160
+ MyModel.insert_all(batch)
161
+ end
162
+ ```
163
+
164
+ ### Lazy evaluation
165
+
166
+ `lazy` lets you stop early without reading the entire file:
167
+
168
+ ```ruby
169
+ # Read only the first 10 rows matching a condition
170
+ reader = SmarterCSV::Reader.new('big.csv', options)
171
+ result = reader.lazy.select { |h| h[:status] == 'active' }.first(10)
172
+ ```
173
+
174
+ ### `each` ignores `chunk_size`
175
+
176
+ If `chunk_size` is set in options, `each` ignores it and always yields individual `Hash` objects. Use [`each_chunk`](./batch_processing.md) for chunked batch processing.
177
+
178
+ ### Interaction with `on_bad_row`
179
+
180
+ `each` respects all `on_bad_row` options. Bad rows are skipped (or routed to your handler) and never yielded:
181
+
182
+ ```ruby
183
+ reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
184
+ reader.each { |hash| MyModel.upsert(hash) }
185
+ reader.errors[:bad_rows].each { |rec| puts "Bad row: #{rec[:error_message]}" }
186
+ ```
187
+
188
+ ---
189
+
190
+ ## Value Transformation Pipeline
191
+
192
+ After each row is parsed, SmarterCSV applies transformations to field values in this order:
193
+
194
+ | Step | Option | Default | Description |
195
+ |------|--------|---------|-------------|
196
+ | 1 | `strip_whitespace` | `true` | Strips leading/trailing whitespace from all values (and headers) at parse time |
197
+ | 2 | `nil_values_matching` | `nil` | Sets values matching the regexp to `nil` |
198
+ | 3 | `remove_empty_values` | `true` | Removes keys whose value is `nil` or blank |
199
+ | 4 | `remove_zero_values` | `false` | Removes keys whose value is numeric zero |
200
+ | 5 | `convert_values_to_numeric` | `true` | Converts numeric-looking strings to `Integer` or `Float` |
201
+ | 6 | `value_converters` | `nil` | Applies per-key custom converter lambdas or classes |
202
+ | 7 | `remove_empty_hashes` | `true` | Drops rows that are entirely empty after all transformations |
203
+
204
+ > Steps 2–6 run per field, in that order, for every key/value pair in the row.
205
+ > `value_converters` receive the value **after** numeric conversion — guard against `Integer`/`Float` input if needed.
206
+
207
+ See [Data Transformations](./data_transformations.md) and [Value Converters](./value_converters.md) for details.
208
+
209
+ ---
210
+
211
+ ## Header Transformation Pipeline
212
+
213
+ Before any data rows are processed, the header line passes through these steps:
214
+
215
+ ```
216
+ comment_regexp → strip_chars_from_headers → split on col_sep → strip quote_char
217
+ → strip_whitespace → [gsub spaces/dashes→_ → downcase_header]
218
+ → disambiguate_headers → symbolize → key_mapping
219
+ ```
220
+
221
+ `user_provided_headers` bypasses the file header and all transformation steps — your array is used as-is.
222
+
223
+ See [Header Transformations](./header_transformations.md) for the full step-by-step table and options.
224
+
225
+ ---
226
+
84
227
  ## Rescue from Exceptions
85
228
 
86
229
  While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect `col_sep`, `row_sep`, or if it encounters other problems. Therefore please rescue from `SmarterCSV::Error`, and handle outliers according to your requirements.
87
230
 
88
- If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accomodate for unusual formats.
231
+ If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accommodate for unusual formats.
89
232
 
90
233
  ## Troubleshooting
91
234
 
@@ -102,9 +245,8 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
102
245
 
103
246
  ## Assumptions / Limitations
104
247
 
105
- * the escape character is `\`, as on UNIX and Windows systems.
106
- * quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
107
- e.g. an escaped `quote_char` does not denote the end of a field.
248
+ * By default, quote escaping uses `:auto` mode SmarterCSV tries backslash-escape (`\"`) first and falls back to RFC 4180 doubled-quotes (`""`). Use `quote_escaping: :double_quotes` or `:backslash` to fix the mode explicitly. See [Parsing Strategy](./parsing_strategy.md).
249
+ * Quote characters around fields are expected to be balanced, e.g. valid: `"field"`, invalid: `"field\"` — an escaped `quote_char` does not denote the end of a field.
108
250
 
109
251
 
110
252
  ## NOTES about File Encodings:
@@ -125,4 +267,5 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
125
267
  ```
126
268
 
127
269
  ----------------
128
- PREVIOUS: [Parsing Strategy](./parsing_strategy.md) | NEXT: [The Basic Write API](./basic_write_api.md)
270
+
271
+ PREVIOUS: [Parsing Strategy](./parsing_strategy.md) | NEXT: [The Basic Write API](./basic_write_api.md) | UP: [README](../README.md)