smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
@@ -0,0 +1,514 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [**Ruby CSV Pitfalls**](./ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](./parsing_strategy.md)
8
+ * [The Basic Read API](./basic_read_api.md)
9
+ * [The Basic Write API](./basic_write_api.md)
10
+ * [Batch Processing](././batch_processing.md)
11
+ * [Configuration Options](./options.md)
12
+ * [Row and Column Separators](./row_col_sep.md)
13
+ * [Header Transformations](./header_transformations.md)
14
+ * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
16
+ * [Data Transformations](./data_transformations.md)
17
+ * [Value Converters](./value_converters.md)
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ ---
26
+
27
+ # Ruby CSV Pitfalls: Silent Data Corruption and Loss
28
+
29
+ Ruby's built-in `CSV` library is for many the go-to — it ships with Ruby and requires no dependencies. But it has failure modes that produce **no exception, no warning, and no indication that anything went wrong**. Your import runs, your tests pass, and your data is quietly wrong.
30
+
31
+ This page documents ten reproducible ways `CSV.read` (and `CSV.table`) can silently corrupt or lose data, with examples you can run yourself, and how SmarterCSV handles each case.
32
+
33
+ > **Note on `CSV.table`:** It's a convenience wrapper for `CSV.read` with `headers: true`, `header_converters: :symbol`, and `converters: :numeric`.
34
+
35
+ ---
36
+
37
+ ## At a Glance
38
+
39
+ | # | Ruby CSV Issue | Failure Mode | SmarterCSV fix | SmarterCSV Details |
40
+ |---|-------|-------------|:--------------:|---------|
41
+ | 1 | Extra columns silently dropped | Values beyond header count compete for the `nil` key — all but the last are discarded | by default ✅ | Default `missing_headers: :auto` auto-generates `:column_N` keys |
42
+ | 2 | Duplicate headers — last wins | `.to_h` keeps only the last value for a repeated header; earlier values silently lost | by default ✅ | Default `duplicate_header_suffix:` → `:score`, `:score2`, `:score3` |
43
+ | 3 | Empty headers — `""` key collision | Blank header cells become `""` keys; multiple blanks collide and overwrite each other | by default ✅ | Default `missing_header_prefix:` → `:column_1`, `:column_2` |
44
+ | 4 | BOM corrupts first header | `"\xEF\xBB\xBFname"` ≠ `"name"` — first column becomes unreachable by its key | by default ✅ | Automatic BOM stripping — always on, no option needed |
45
+ | 5 | Whitespace in headers ¹ | `" Age"` ≠ `"Age"` — lookup silently returns `nil` | by default ✅ | Default `strip_whitespace: true` strips headers and values |
46
+ | 6 | `liberal_parsing` garbles fields | Unmatched quotes produce wrong field boundaries — corrupted data returned as valid | by default ✅ | `on_bad_row: :raise` (default); opt-in `:skip` / `:collect` for quarantine |
47
+ | 7 | `nil` vs `""` for empty fields | Unquoted empty → `nil`, quoted empty → `""` — inconsistent empty checks | by default ✅ | Default `remove_empty_values: true` removes both; `false` normalizes both to `nil` |
48
+ | 8 | Backslash-escaped quotes (MySQL/Unix) | `\"` treated as field-closing quote — crash or garbled data | by default ✅ | Default `quote_escaping: :auto` handles both RFC 4180 and backslash escaping |
49
+ | 9 | Missing closing quote eats the rest of the file | One unclosed `"` swallows all subsequent rows into one field value | via option | `field_size_limit: N` raises immediately; `quote_boundary: :standard` (default) reduces exposure |
50
+ | 10 | No encoding auto-detection | Non-UTF-8 files either crash or silently produce mojibake | via option | `file_encoding:`, `force_utf8: true`, `invalid_byte_sequence:` |
51
+
52
+ ¹ The one case where `CSV.table` does better than `CSV.read`: its `header_converters: :symbol` option includes `.strip`, so whitespace is removed from headers. All other nine issues are identical between `CSV.read` and `CSV.table`.
53
+
54
+ ---
55
+
56
+ ## Why These Failures Are Dangerous
57
+
58
+ Every failure in this list is **silent**. No exception, no warning, no log line — the import completes successfully and the data is quietly wrong. That makes them hard to catch in tests and easy to miss in code review.
59
+
60
+ The root cause is that `CSV.read` is a tokenizer, not a data pipeline. It splits bytes into fields and returns them with no normalization, no validation, and no defensive handling of real-world messiness. Every assumption about what "clean" input looks like is left to the caller.
61
+
62
+ `CSV.table` fixes exactly one issue out of ten — whitespace in headers — because its `:symbol` converter happens to call `.strip`. Everything else is identical.
63
+
64
+ These are not obscure edge cases. Extra columns, trailing commas, BOMs, Windows-1252 encoding, duplicate headers, and blank header cells are all common in CSV files exported from Excel, reporting tools, ERP systems, and legacy data pipelines.
65
+
66
+ > **Ready to switch?** ➡️ [Migrating from Ruby CSV](./migrating_from_csv.md)
67
+
68
+ ---
69
+
70
+ ## 1. Extra Columns Without Headers — Values Silently Discarded
71
+
72
+ When a row has more fields than there are headers, `CSV.read` maps every extra field to the `nil` key. If there are multiple extra fields, they all compete for the same `nil` key — **only the last one survives**, the rest are silently discarded.
73
+
74
+ ```
75
+ $ cat example1.csv
76
+ First Name , Last Name , Age
77
+ Alice , Smith, 30, VIP, Gold ,
78
+ Bob, Jones, 25
79
+ ```
80
+
81
+ **With Ruby CSV:**
82
+
83
+ ```ruby
84
+ rows = CSV.read('example1.csv', headers: true).map(&:to_h)
85
+ rows.first
86
+ # => {" First Name " => "Alice ", " Last Name " => " Smith", " Age" => " 30", nil => ""}
87
+ # the values "VIP" and "Gold" are silently lost here ^^^^^^^^^
88
+ ```
89
+
90
+ Alice's row has 6 fields but only 3 headers. The extra fields `"VIP"`, `"Gold"`, and `""` (trailing comma) all land on `nil` — each overwriting the last. No error, no warning.
91
+
92
+ This is common in real-world exports: tools frequently append audit columns, status flags, or trailing commas that don't correspond to headers.
93
+
94
+ **`CSV.table` has the same problem.**
95
+
96
+ **With SmarterCSV:**
97
+
98
+ ```ruby
99
+ rows = SmarterCSV.process('example1.csv')
100
+ rows.first
101
+ # => {first_name: "Alice", last_name: "Smith", age: 30, column_1: "VIP", column_2: "Gold"}
102
+ ```
103
+
104
+ The default `missing_headers: :auto` auto-generates distinct names for extra columns using `missing_header_prefix` (default: `"column_"`). The trailing empty field is dropped by the default `remove_empty_values: true` setting. No data loss.
105
+
106
+ ---
107
+
108
+ ## 2. Duplicate Header Names — First Value Silently Dropped
109
+
110
+ When two columns share the same header name, `CSV::Row#to_h` keeps only the **last** value. The first is silently dropped.
111
+
112
+ ```
113
+ $ cat example2.csv
114
+ score,name,score
115
+ 95,Alice,87
116
+ ```
117
+
118
+ **With Ruby CSV:**
119
+
120
+ ```ruby
121
+ rows = CSV.read('example2.csv', headers: true).map(&:to_h)
122
+ rows.first
123
+ # => {"score" => "87", "name" => "Alice"}
124
+ # ^^^ first score (95) silently lost
125
+ ```
126
+
127
+ Common with reporting tool exports that repeat a column (e.g., two date columns both labeled `"Date"`).
128
+
129
+ **With SmarterCSV:**
130
+
131
+ ```ruby
132
+ rows = SmarterCSV.process('example2.csv')
133
+ rows.first
134
+ # => {score: 95, name: "Alice", score2: 87}
135
+ ```
136
+
137
+ * The default `duplicate_header_suffix: ""` disambiguates by appending a counter: `:score`, `:score2`, `:score3`.
138
+ * Use `duplicate_header_suffix: '_'` to get `:score_2`, `:score_3`.
139
+ * Set `duplicate_header_suffic: nil` to raise `DuplicateHeaders` instead.
140
+
141
+ ---
142
+
143
+ ## 3. Empty Header Fields — `""` Key Collision
144
+
145
+ A CSV file with blank header cells (e.g., `name,,age`) gives those columns an empty string key. Multiple blank headers all collide on `""` — same overwrite problem as issue #1.
146
+
147
+ > This is distinct from issue #1. Issue #1 is about extra *data* fields beyond the header count, which get keyed under `nil`. Issue #3 is about blank cells *in the header row itself*, which get keyed under `""`.
148
+
149
+ ```
150
+ $ cat example3.csv
151
+ name,,,age
152
+ Alice,foo,bar,30
153
+ ```
154
+
155
+ **With Ruby CSV:**
156
+
157
+ ```ruby
158
+ rows = CSV.read('example3.csv', headers: true).map(&:to_h)
159
+ rows.first
160
+ # => {"name" => "Alice", "" => "bar", "age" => "30"}
161
+ # ^^^ "foo" silently lost — both blank headers wrote to the "" key
162
+ ```
163
+
164
+ `CSV.table` converts headers to symbols — blank headers become `:"" ` — same collision, different key:
165
+
166
+ ```ruby
167
+ rows = CSV.table('example3.csv').map(&:to_h)
168
+ rows.first
169
+ # => {name: "Alice", :"" => "bar", age: 30}
170
+ # ^^^ "foo" still silently lost
171
+ ```
172
+
173
+ **With SmarterCSV:**
174
+
175
+ ```ruby
176
+ rows = SmarterCSV.process('example3.csv')
177
+ rows.first
178
+ # => {name: "Alice", column_1: "foo", column_2: "bar", age: 30}
179
+ ```
180
+
181
+ `missing_header_prefix:` (default `"column_"`) auto-generates names for blank headers: `:column_1`, `:column_2`, etc. No collision, no data loss.
182
+
183
+ ---
184
+
185
+ ## 4. BOM Corrupts the First Header
186
+
187
+ Files saved by Excel on Windows often include a UTF-8 BOM (`\xEF\xBB\xBF`) at the start. `CSV.read` does not strip it, so the BOM is silently prepended to the first header name.
188
+
189
+ ```
190
+ $ cat example4.csv
191
+ name,age
192
+ Alice,30
193
+ ```
194
+
195
+ ```
196
+ $ hexdump -C example4.csv
197
+ 00000000 ef bb bf 6e 61 6d 65 2c 61 67 65 0a 41 6c 69 63 |...name,age.Alic|
198
+ 00000010 65 2c 33 30 0a |e,30.|
199
+ ```
200
+
201
+ The `ef bb bf` at offset 0 is the UTF-8 BOM — invisible in `cat` output but silently prepended to the first header by `CSV.read`.
202
+
203
+ **With Ruby CSV:**
204
+
205
+ ```ruby
206
+ rows = CSV.read('example4.csv', headers: true).map(&:to_h)
207
+ rows.first.keys.first # => "\xEF\xBB\xBFname" ← not "name"
208
+
209
+ rows.first['name'] # => nil ← first column unreachable
210
+ ```
211
+
212
+ The data is present but every lookup on the first column silently returns `nil`. The BOM is invisible in most terminals and editors — the output appears correct.
213
+
214
+ **With SmarterCSV:**
215
+
216
+ ```ruby
217
+ rows = SmarterCSV.process('example4.csv')
218
+ rows.first[:name] # => "Alice" ← BOM stripped automatically
219
+ ```
220
+
221
+ By default SmarterCSV automatically detects and strips BOMs. Always on, no option needed.
222
+
223
+ ---
224
+
225
+ ## 5. Whitespace in Header Names — Silent `nil` on Lookup
226
+
227
+ `CSV.read` returns headers exactly as they appear in the file, including leading and trailing whitespace. Code that accesses columns by the expected name silently gets `nil`.
228
+
229
+ ```
230
+ $ cat example5.csv
231
+ name , age
232
+ Alice,30
233
+ ```
234
+
235
+ **With Ruby CSV:**
236
+
237
+ ```ruby
238
+ rows = CSV.read('example5.csv', headers: true).map(&:to_h)
239
+ rows.first
240
+ # => {" name " => "Alice", " age " => "30"}
241
+
242
+ rows.first['name'] # => nil ← key is " name ", not "name"
243
+ rows.first['age'] # => nil
244
+ ```
245
+
246
+ > `CSV.table` mitigates this: the `:symbol` header converter includes `.strip`. This is the one issue where `CSV.table` behaves better than `CSV.read`.
247
+
248
+ **With SmarterCSV:**
249
+
250
+ ```ruby
251
+ rows = SmarterCSV.process('example5.csv')
252
+ rows.first
253
+ # => {name: "Alice", age: 30}
254
+ ```
255
+
256
+ The default setting `strip_whitespace: true` strips leading/trailing whitespace from both headers and values.
257
+
258
+ ---
259
+
260
+ ## 6. `liberal_parsing: true` Garbles Field Values
261
+
262
+ `CSV.read` raises `MalformedCSVError` when it encounters an unmatched quote. `liberal_parsing: true` suppresses the error and returns a row anyway — but with wrong field boundaries.
263
+
264
+ **The key danger:** without `liberal_parsing` you at least know something is wrong. With it, corrupted data is silently returned as valid.
265
+
266
+ ```
267
+ $ cat example6.csv
268
+ name,note,score
269
+ Alice,"unclosed quote,99
270
+ Bob,normal,87
271
+ ```
272
+
273
+ **With Ruby CSV:**
274
+
275
+ ```ruby
276
+ # Without liberal_parsing: you know something is wrong
277
+ CSV.read('example6.csv', headers: true)
278
+ # => CSV::MalformedCSVError: Unclosed quoted field on line 2
279
+
280
+ # With liberal_parsing: silent corruption
281
+ rows = CSV.read('example6.csv', headers: true, liberal_parsing: true).map(&:to_h)
282
+ rows.length # => 1 (not 2 — Bob's row is gone)
283
+ rows[0]
284
+ # => {"name" => "Alice", "note" => "unclosed quote,99\nBob,normal,87", "score" => nil}
285
+ # ^^^ Alice's note field swallowed the rest of the file; Bob vanished
286
+ ```
287
+
288
+ The garbled row passes validations, gets inserted into the database, and surfaces as a data quality issue later.
289
+
290
+ **With SmarterCSV:**
291
+
292
+ ```ruby
293
+ reader = SmarterCSV::Reader.new('example6.csv', on_bad_row: :collect)
294
+ good_rows = reader.process
295
+ reader.errors
296
+ # => {
297
+ # :bad_row_count => 1,
298
+ # :bad_rows => [
299
+ # {
300
+ # :csv_line_number => 2,
301
+ # :file_line_number => 2,
302
+ # :file_lines_consumed => 2,
303
+ # :error_class => SmarterCSV::MalformedCSV,
304
+ # :error_message => "Unclosed quoted field detected in multiline data",
305
+ # :raw_logical_line => "Alice,\"unclosed quote,99\nBob,normal,87\n"
306
+ # }
307
+ # ]
308
+ # }
309
+ ```
310
+
311
+ Or pass a lambda to `on_bad_row` — works with `SmarterCSV.process` (no `Reader` instance needed):
312
+
313
+ ```ruby
314
+ bad_rows = []
315
+ good_rows = SmarterCSV.process('example6.csv',
316
+ on_bad_row: ->(rec) { bad_rows << rec })
317
+ ```
318
+
319
+ * `on_bad_row: :raise` (default) fails fast.
320
+ * `on_bad_row: :collect` quarantines them — use `reader.errors` to access.
321
+ * `on_bad_row: ->(rec) { ... }` calls your lambda per bad row; works with `SmarterCSV.process`.
322
+ * `on_bad_row: :skip` discards bad rows silently.
323
+
324
+ ---
325
+
326
+ ## 7. `nil` vs `""` for Empty Fields — Inconsistent Empty Checks
327
+
328
+ `CSV.read` treats unquoted empty fields and quoted empty fields differently:
329
+
330
+ - Unquoted empty (`,,`) → `nil`
331
+ - Quoted empty (`,"",`) → `""`
332
+
333
+ ```
334
+ $ cat example7.csv
335
+ name,city
336
+ Alice,
337
+ Bob,""
338
+ ```
339
+
340
+ **With Ruby CSV:**
341
+
342
+ ```ruby
343
+ rows = CSV.read('example7.csv', headers: true).map(&:to_h)
344
+
345
+ rows[0]['city'] # => nil (unquoted empty)
346
+ rows[1]['city'] # => "" (quoted empty)
347
+
348
+ rows[0]['city'].nil? # => true
349
+ rows[1]['city'].nil? # => false ← same semantic meaning, different Ruby type
350
+ ```
351
+
352
+ Both rows have no city, but your code sees two different things. Any check using `.nil?`, `.blank?`, `.present?`, or `if row['city']` will behave differently depending on how the upstream exporter quoted the empty field.
353
+
354
+ **With SmarterCSV:**
355
+
356
+ ```ruby
357
+ # remove_empty_values: true (default) — both empty cities are dropped from the hash
358
+ rows = SmarterCSV.process('example7.csv')
359
+ rows[0] # => {name: "Alice"}
360
+ rows[1] # => {name: "Bob"}
361
+
362
+ # remove_empty_values: false — both normalized to nil
363
+ rows = SmarterCSV.process('example7.csv', remove_empty_values: false)
364
+ rows[0] # => {name: "Alice", city: nil}
365
+ rows[1] # => {name: "Bob", city: nil}
366
+ ```
367
+
368
+ ---
369
+
370
+ ## 8. Backslash-Escaped Quotes — MySQL / Unix Dump Format
371
+
372
+ MySQL's `SELECT INTO OUTFILE`, PostgreSQL `COPY TO`, and many Unix data-pipeline tools escape embedded double quotes as `\"` — not as `""` (the RFC 4180 standard). Ruby's `CSV` only understands RFC 4180, so a backslash before a quote is treated as two separate characters: a literal `\` followed by a `"` that immediately **closes the field**.
373
+
374
+ ```
375
+ $ cat example8.csv
376
+ name,note
377
+ Alice,"She said \"hello\" to everyone"
378
+ Bob,"Normal note"
379
+ ```
380
+
381
+ **With Ruby CSV — Scenario 1: crash** (at least you know something went wrong):
382
+
383
+ ```ruby
384
+ rows = CSV.read('example8.csv', headers: true)
385
+ # => CSV::MalformedCSVError: Illegal quoting in line 2.
386
+ ```
387
+
388
+ **With Ruby CSV — Scenario 2: silent garbling** with `liberal_parsing: true`:
389
+
390
+ ```ruby
391
+ rows = CSV.read('example8.csv', headers: true, liberal_parsing: true)
392
+ rows[0]['name'] # => "Alice"
393
+ rows[0]['note'] # => "She said \\" ← field closed at the backslash-quote; rest lost
394
+ rows[1]['name'] # => "hello" ← Alice's leftovers eaten as Bob's name
395
+ rows[1]['note'] # => nil
396
+ ```
397
+
398
+ No exception. No warning. `rows.length` is still 2. The data just quietly moved to the wrong fields.
399
+
400
+ **With SmarterCSV:**
401
+
402
+ ```ruby
403
+ rows = SmarterCSV.process('example8.csv')
404
+ rows[0] # => {name: "Alice", note: "She said \"hello\" to everyone"}
405
+ rows[1] # => {name: "Bob", note: "Normal note"}
406
+ ```
407
+
408
+ `quote_escaping: :auto` (default) detects and handles both `""` and `\"` escaping row-by-row. No option required. This covers MySQL `SELECT INTO OUTFILE`, PostgreSQL `COPY TO`, and Unix `csvkit`/`awk`-generated files.
409
+
410
+ ---
411
+
412
+ ## 9. Missing Closing Quote Consumes the Rest of the File
413
+
414
+ A single unclosed `"` causes the parser to enter quoted-field mode and treat everything that follows — newlines included — as part of one field. **All remaining rows are swallowed into a single field value.**
415
+
416
+ ```
417
+ $ cat example8.csv
418
+ name,age
419
+ "Alice,30
420
+ Bob,25
421
+ Carol,40
422
+ ```
423
+
424
+ **With Ruby CSV:**
425
+
426
+ ```ruby
427
+ rows = CSV.read('example8.csv', headers: true)
428
+ rows.length # => 1 (not 3)
429
+ rows.first['name'] # => "Alice,30\nBob,25\nCarol,40"
430
+ # ^^^ entire remainder of file in one field
431
+ ```
432
+
433
+ On a large file this is an OOM risk: the parser accumulates an ever-growing string until EOF or memory exhaustion. There is no field size limit, no timeout, and no error until the file ends.
434
+
435
+ **With SmarterCSV:**
436
+
437
+ ```ruby
438
+ reader = SmarterCSV::Reader.new('example8.csv',
439
+ on_bad_row: :collect,
440
+ )
441
+ good_rows = reader.process
442
+ reader.errors
443
+ # => {
444
+ # :bad_row_count => 1,
445
+ # :bad_rows => [
446
+ # {
447
+ # :csv_line_number => 2,
448
+ # :file_line_number => 2,
449
+ # :file_lines_consumed => 3,
450
+ # :error_class => SmarterCSV::MalformedCSV,
451
+ # :error_message => "Unclosed quoted field detected in multiline data",
452
+ # :raw_logical_line => "\"Alice,30\nBob,25\nCarol,40\n"
453
+ # }
454
+ # ]
455
+ # }
456
+ ```
457
+
458
+ Or pass a lambda to `on_bad_row` — works with `SmarterCSV.process` (no `Reader` instance needed):
459
+
460
+ ```ruby
461
+ bad_rows = []
462
+ good_rows = SmarterCSV.process('example8.csv',
463
+ on_bad_row: ->(rec) { bad_rows << rec })
464
+ ```
465
+
466
+ `field_size_limit: N` raises `SmarterCSV::FieldSizeLimitExceeded` as soon as any field or accumulating multiline buffer exceeds N bytes — the runaway parse stops immediately. Additionally, `quote_boundary: :standard` (default since 1.16.0) means mid-field quotes don't toggle quoted mode, reducing the attack surface further.
467
+
468
+ ---
469
+
470
+ ## 10. No Encoding Auto-Detection — Crash or Mojibake
471
+
472
+ `CSV.read` assumes UTF-8. CSV files exported from Excel on Windows are typically Windows-1252 (CP1252), which encodes accented characters (é, ü, ñ) differently from UTF-8.
473
+
474
+ ```
475
+ $ cat example9.csv
476
+ last_name,first_name
477
+ Müller,Hans
478
+ ```
479
+
480
+ The file is saved in Windows-1252 encoding — `ü` is stored as `\xFC`, not as UTF-8.
481
+
482
+ **With Ruby CSV — Scenario 1: crash** (the better outcome — at least you know):
483
+
484
+ ```ruby
485
+ rows = CSV.read('example9.csv', headers: true)
486
+ # => Encoding::InvalidByteSequenceError: "\xFC" from ASCII-8BIT to UTF-8
487
+ ```
488
+
489
+ **With Ruby CSV — Scenario 2: silent mojibake** (the worse outcome):
490
+
491
+ ```ruby
492
+ # Specifying the wrong encoding suppresses the error
493
+ rows = CSV.read('example9.csv', headers: true, encoding: 'binary')
494
+ rows.first['last_name'] # => "M\xFCller" ← garbled string
495
+ rows.first['last_name'].valid_encoding? # => true ← Ruby thinks it's fine
496
+ ```
497
+
498
+ The mojibake string passes `.valid_encoding?`, passes database validations, gets stored, and surfaces as a display bug in production.
499
+
500
+ **With SmarterCSV:**
501
+
502
+ ```ruby
503
+ rows = SmarterCSV.process('example9.csv',
504
+ file_encoding: 'windows-1252:utf-8')
505
+ rows.first[:last_name] # => "Müller"
506
+ ```
507
+
508
+ * `file_encoding:` accepts Ruby's `'external:internal'` transcoding notation.
509
+ * `force_utf8: true` transcodes to UTF-8 automatically.
510
+ * `invalid_byte_sequence:` controls the replacement character for bytes that can't be transcoded.
511
+
512
+ ---
513
+
514
+ PREVIOUS: [Migrating from Ruby CSV](./migrating_from_csv.md) | NEXT: [Parsing Strategy](./parsing_strategy.md) | UP: [README](../README.md)