smarter_csv 1.16.4 → 1.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop.yml +10 -1
  3. data/CHANGELOG.md +54 -0
  4. data/Gemfile +10 -5
  5. data/README.md +98 -14
  6. data/TO_DO.md +109 -0
  7. data/docs/_introduction.md +1 -0
  8. data/docs/bad_row_quarantine.md +2 -1
  9. data/docs/basic_read_api.md +6 -1
  10. data/docs/basic_write_api.md +30 -0
  11. data/docs/batch_processing.md +25 -0
  12. data/docs/column_selection.md +1 -0
  13. data/docs/data_transformations.md +1 -0
  14. data/docs/examples.md +126 -0
  15. data/docs/header_transformations.md +23 -0
  16. data/docs/header_validations.md +1 -0
  17. data/docs/history.md +1 -0
  18. data/docs/instrumentation.md +2 -1
  19. data/docs/migrating_from_csv.md +1 -0
  20. data/docs/options.md +20 -18
  21. data/docs/parsing_strategy.md +1 -0
  22. data/docs/real_world_csv.md +51 -1
  23. data/docs/releases/1.16.0/performance_notes.md +15 -15
  24. data/docs/releases/1.17.0/benchmarks.md +121 -0
  25. data/docs/releases/1.17.0/changes.md +161 -0
  26. data/docs/releases/1.17.0/performance_notes.md +126 -0
  27. data/docs/row_col_sep.md +21 -1
  28. data/docs/ruby_csv_pitfalls.md +1 -0
  29. data/docs/value_converters.md +24 -0
  30. data/docs/warnings.md +141 -0
  31. data/ext/smarter_csv/smarter_csv.c +98 -32
  32. data/images/SmarterCSV_1.17.0_vs_RubyCSV_3.3.5_speedup.svg +106 -0
  33. data/images/SmarterCSV_1.17.0_vs_previous_C-speedup.svg +181 -0
  34. data/images/SmarterCSV_1.17.0_vs_previous_Rb-speedup.svg +179 -0
  35. data/lib/smarter_csv/auto_detection.rb +215 -30
  36. data/lib/smarter_csv/file_io.rb +2 -2
  37. data/lib/smarter_csv/hash_transformations.rb +29 -13
  38. data/lib/smarter_csv/parser.rb +42 -33
  39. data/lib/smarter_csv/peekable_io.rb +453 -0
  40. data/lib/smarter_csv/reader.rb +119 -23
  41. data/lib/smarter_csv/reader_options.rb +61 -1
  42. data/lib/smarter_csv/version.rb +1 -1
  43. data/lib/smarter_csv.rb +40 -12
  44. metadata +12 -5
  45. data/TO_DO_v2.md +0 -14
  46. data/ext/smarter_csv/Makefile +0 -270
data/docs/examples.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [**Examples**](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -43,6 +44,12 @@
43
44
  11. [Batch Processing with Sidekiq](#example-11-batch-processing-with-sidekiq)
44
45
  12. [Resumable CSV Import with Rails ActiveJob](#example-12-resumable-csv-import-with-rails-activejob-rails-81)
45
46
  13. [Instrumentation](#example-13-instrumentation)
47
+ 14. [Streaming Inputs (Non-Seekable IO)](#example-14-streaming-inputs-non-seekable-io)
48
+ 15. [Resumable Import (Plain Ruby)](#example-15-resumable-import-plain-ruby)
49
+ 16. [CSV Files with Comment Lines](#example-16-csv-files-with-comment-lines)
50
+ 17. [Tab-Separated Values (TSV)](#example-17-tab-separated-values-tsv)
51
+ 18. [Multi-Line Fields](#example-18-multi-line-fields)
52
+ 19. [Filtering and Transforming a CSV File](#example-19-filtering-and-transforming-a-csv-file)
46
53
 
47
54
  ---
48
55
 
@@ -369,5 +376,124 @@ SmarterCSV.process('large_import.csv',
369
376
 
370
377
  See [Instrumentation Hooks](./instrumentation.md).
371
378
 
379
+ ---
380
+
381
+ ## Example 14: Streaming Inputs (Non-Seekable IO)
382
+
383
+ *(1.17.0+)* SmarterCSV reads from gzipped files, HTTP responses, S3 objects, or piped STDIN — no need to materialize the file on disk first.
384
+
385
+ ```ruby
386
+ require 'zlib'
387
+ Zlib::GzipReader.open('huge.csv.gz') do |io|
388
+ SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
389
+ end
390
+ ```
391
+
392
+ See [Real-World CSV Files → I/O Patterns](./real_world_csv.md#io-patterns) for gzip, S3, HTTP, STDIN, and `IO.popen` worked examples.
393
+
394
+ ---
395
+
396
+ ## Example 15: Resumable Import (Plain Ruby)
397
+
398
+ A non-Rails counterpart to Example 12 — track the chunk cursor in a JSON file so an interrupted import resumes where it left off.
399
+
400
+ See [Batch Processing → Resumable Import (Plain Ruby)](./batch_processing.md#example-resumable-import-plain-ruby) for the worked example.
401
+
402
+ ---
403
+
404
+ ## Example 16: CSV Files with Comment Lines
405
+
406
+ Strip lines matching a pattern (e.g. `#`-prefixed comments in DB dumps and log exports) using `comment_regexp`:
407
+
408
+ ```ruby
409
+ SmarterCSV.process('data.csv', comment_regexp: /\A#/)
410
+ ```
411
+
412
+ See [Header Transformations → CSV Files with Comment Lines](./header_transformations.md#csv-files-with-comment-lines) for the worked example.
413
+
414
+ ---
415
+
416
+ ## Example 17: Tab-Separated Values (TSV)
417
+
418
+ ```ruby
419
+ SmarterCSV.process('data.tsv') # auto-detected
420
+ SmarterCSV.process('data.tsv', col_sep: "\t") # explicit
421
+ ```
422
+
423
+ See [Row and Column Separators → Tab-Separated Values (TSV)](./row_col_sep.md#tab-separated-values-tsv) for details.
424
+
425
+ ---
426
+
427
+ ## Example 18: Multi-Line Fields
428
+
429
+ Newlines inside `"..."` are preserved as part of the field — common in addresses, CRM notes, and free-text comments. No configuration needed.
430
+
431
+ See [Real-World CSV Files → Multi-Line Quoted Fields](./real_world_csv.md#multi-line-quoted-fields) for the worked example.
432
+
433
+ ---
434
+
435
+ ## Example 19: Filtering and Transforming a CSV File
436
+
437
+ The Ruby CSV library has `CSV.filter` for "read CSV, mutate each row, write CSV." In SmarterCSV this is a two-line composition of `SmarterCSV.each` and `SmarterCSV.generate`:
438
+
439
+ ```ruby
440
+ SmarterCSV.generate('out.csv') do |csv|
441
+ SmarterCSV.each('in.csv') do |row|
442
+ row[:price] = (row[:price] * 1.1).round(2)
443
+ row.delete(:internal_notes)
444
+ csv << row
445
+ end
446
+ end
447
+ ```
448
+
449
+ The explicit `csv << row` is the win over `CSV.filter` — emission is intentional, not a side effect of mutating the block argument.
450
+
451
+ ### Pipeline (STDIN → STDOUT)
452
+
453
+ ```ruby
454
+ # cat in.csv | ruby filter.rb > out.csv
455
+ SmarterCSV.generate($stdout) do |csv|
456
+ SmarterCSV.each($stdin) { |row| csv << row }
457
+ end
458
+ ```
459
+
460
+ ### Skipping rows
461
+
462
+ ```ruby
463
+ SmarterCSV.generate('out.csv') do |csv|
464
+ SmarterCSV.each('in.csv') do |row|
465
+ next if row[:status] == 'archived' # just skip — no emit
466
+ csv << row
467
+ end
468
+ end
469
+ ```
470
+
471
+ ### Compressed in, compressed out
472
+
473
+ ```ruby
474
+ require 'zlib'
475
+ Zlib::GzipWriter.open('out.csv.gz') do |gz_out|
476
+ SmarterCSV.generate(gz_out) do |csv|
477
+ Zlib::GzipReader.open('in.csv.gz') do |gz_in|
478
+ SmarterCSV.each(gz_in) { |row| csv << row }
479
+ end
480
+ end
481
+ end
482
+ ```
483
+
484
+ Both endpoints are non-seekable streams — a pattern `CSV.filter` cannot handle, since it requires seekable input/output.
485
+
486
+ ### Header renaming on the way through
487
+
488
+ ```ruby
489
+ SmarterCSV.generate('out.csv', headers: [:given_name, :family_name, :email]) do |csv|
490
+ SmarterCSV.each('in.csv',
491
+ key_mapping: { first_name: :given_name, last_name: :family_name }
492
+ ) { |row| csv << row }
493
+ end
494
+ ```
495
+
496
+ Use `key_mapping:` on the read side to rename columns and `headers:` on the write side to enforce output column order.
497
+
372
498
  --------------------
373
499
  PREVIOUS: [Instrumentation Hooks](./instrumentation.md) | NEXT: [Real-World CSV Files](./real_world_csv.md) | UP: [README](../README.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -61,6 +62,28 @@ See [Configuration Options](./options.md) for full option reference.
61
62
 
62
63
  ---
63
64
 
65
+ ## CSV Files with Comment Lines
66
+
67
+ Strip comment lines anywhere in the file — including before the header — using `comment_regexp`:
68
+
69
+ ```ruby
70
+ $ cat data.csv
71
+ # Generated 2026-01-15 by exporter v3.2
72
+ # Confidential — internal use only
73
+ id,name,amount
74
+ 1,Alice,100
75
+ 2,Bob,200
76
+ # end of file
77
+
78
+ data = SmarterCSV.process('data.csv', comment_regexp: /\A#/)
79
+ # => [{id: 1, name: "Alice", amount: 100},
80
+ # {id: 2, name: "Bob", amount: 200}]
81
+ ```
82
+
83
+ Common in database dumps, log exports, and pipelines that prepend provenance metadata. The regexp is applied per line — any line matching is dropped before parsing.
84
+
85
+ ---
86
+
64
87
  ## Header Normalization
65
88
 
66
89
  When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
data/docs/history.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [**Instrumentation Hooks**](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -163,4 +164,4 @@ SmarterCSV.process(file, on_start: ON_START, on_complete: ON_COMPLETE)
163
164
  ```
164
165
 
165
166
  --------------------
166
- PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
167
+ PREVIOUS: [Warnings](./warnings.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
data/docs/options.md CHANGED
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -52,27 +53,28 @@
52
53
 
53
54
  ### File Input & Encoding
54
55
 
55
- | Option | Default | Explanation |
56
- |--------|---------|-------------|
57
- | `:file_encoding` | `utf-8` | Set the file encoding, e.g. `'windows-1252'` or `'iso-8859-1'`. |
58
- | `:invalid_byte_sequence` | `''` | What to replace invalid byte sequences with. |
59
- | `:force_utf8` | `false` | Force UTF-8 encoding of all lines (including headers) in the CSV file. |
56
+ | Option | Default | Explanation |
57
+ |--------------------------|---------|------------------------------------------------------------------------|
58
+ | `:file_encoding` | `utf-8` | Set the file encoding, e.g. `'windows-1252'` or `'iso-8859-1'`. |
59
+ | `:invalid_byte_sequence` | `''` | What to replace invalid byte sequences with. |
60
+ | `:force_utf8` | `false` | Force UTF-8 encoding of all lines (including headers) in the CSV file. |
60
61
 
61
62
  ### File Layout
62
63
 
63
- | Option | Default | Explanation |
64
- |--------|---------|-------------|
65
- | `:skip_lines` | `nil` | How many lines to skip before the first line or header line is processed. |
66
- | `:comment_regexp` | `nil` | Regular expression to ignore comment lines (e.g. `/\A#/`). See NOTE on CSV header. |
67
- | `:chunk_size` | `nil` | If set, data is yielded in chunks of this many rows instead of all at once. Use with `SmarterCSV.each_chunk` for memory-efficient batch processing. |
64
+ | Option | Default | Explanation |
65
+ |-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
66
+ | `:skip_lines` | `nil` | How many lines to skip before the first line or header line is processed. |
67
+ | `:comment_regexp` | `nil` | Regular expression to ignore comment lines (e.g. `/\A#/`). See NOTE on CSV header. |
68
+ | `:chunk_size` | `nil` | If set, data is yielded in chunks of this many rows instead of all at once. Use with `SmarterCSV.each_chunk` for memory-efficient batch processing. |
68
69
 
69
70
  ### Separators
70
71
 
71
72
  | Option | Default | Explanation |
72
73
  |--------|---------|-------------|
73
74
  | `:col_sep` | `:auto` | Column separator. `:auto` detects from file content (previous default was `','`). |
74
- | `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content. Manual detection reads the whole file first (slow on large files). |
75
- | `:auto_row_sep_chars` | `500` | How many characters to analyze when using `:row_sep => :auto`. `nil` or `0` means whole file. |
75
+ | `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content by scanning in chunks of `auto_row_sep_chars` bytes, up to a 64KB hard cap. |
76
+ | `:auto_row_sep_chars` | `4096` | Initial scan size for `:row_sep => :auto` detection. Scan stops as soon as one separator has a clear majority, up to a 64KB cap. Bump this if your files have very wide headers or long comment preambles. Out-of-range values, `nil`, or `0` fall back to the default with a warning. |
77
+ | `:buffer_size` | `16_384` | Peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Out-of-range values warn and clamp to the supported range. Has no effect on seekable inputs (file paths, `File`, `StringIO`, `Tempfile`). |
76
78
 
77
79
  ### Quoting
78
80
 
@@ -121,8 +123,8 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
121
123
  | `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
122
124
  | `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
123
125
  | `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
124
- | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil` or an empty string. |
125
- | `:remove_zero_values` | `false` | Remove key/value pairs where the numeric value equals zero. |
126
+ | `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil`, empty, or whitespace-only any Unicode whitespace, same as Ruby's `String#blank?`. |
127
+ | `:remove_zero_values` | `false` | Remove key/value pairs whose value is zero — numeric `0` / `0.0`, or any textual form of zero (`"0"`, `"0.0"`, `"00.00"`, `"+0"`, `"-0.0"`, …). |
126
128
  | `:nil_values_matching` | `nil` | Set matching values to `nil`. Accepts a regular expression matched against the string representation of each value (e.g. `/\ANAN\z/` for NaN, `/\A#VALUE!\z/` for Excel errors). With `remove_empty_values: true` (default), nil-ified values are then removed. With `remove_empty_values: false`, the key is retained with a `nil` value. |
127
129
  | `:remove_empty_hashes` | `true` | Remove result hashes that have no key/value pairs or all-empty values. |
128
130
 
@@ -142,7 +144,7 @@ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
142
144
  | Option | Default | Explanation |
143
145
  |--------|---------|-------------|
144
146
  | `:with_line_numbers` | `false` | Add `:csv_line_number` to each result hash. |
145
- | `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. |
147
+ | `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. See [Warnings](./warnings.md) for the structured warning collection. |
146
148
 
147
149
  ### Instrumentation Hooks
148
150
 
@@ -156,9 +158,9 @@ See [Instrumentation Hooks](./instrumentation.md) for full details and payload r
156
158
 
157
159
  ### Performance
158
160
 
159
- | Option | Default | Explanation |
160
- |--------|---------|-------------|
161
- | `:acceleration` | `true` | Use the C extension for parsing (MRI Ruby only). Set to `false` to force the pure-Ruby fallback (always used on JRuby/TruffleRuby). |
161
+ | Option | Default | Explanation |
162
+ |-------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
163
+ | `:acceleration` | `true` | Use the C extension for parsing (MRI Ruby only). Set to `false` to force the pure-Ruby fallback (always used on JRuby/TruffleRuby). |
162
164
 
163
165
  ---
164
166
 
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [Real-World CSV Files](./real_world_csv.md)
@@ -16,6 +16,7 @@
16
16
  * [Data Transformations](./data_transformations.md)
17
17
  * [Value Converters](./value_converters.md)
18
18
  * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Warnings](./warnings.md)
19
20
  * [Instrumentation Hooks](./instrumentation.md)
20
21
  * [Examples](./examples.md)
21
22
  * [**Real-World CSV Files**](./real_world_csv.md)
@@ -186,10 +187,59 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
186
187
 
187
188
  ### I/O Patterns
188
189
 
190
+ SmarterCSV accepts any IO-compatible source — file paths, open `File` handles, `StringIO`, and **non-seekable streams** like pipes, `STDIN`, and `Zlib::GzipReader`. Auto-detection of `row_sep` / `col_sep` works on streaming sources too thanks to internal buffering — the underlying source never needs to support `rewind` or `seek`. (Streaming IO support landed in 1.17.0.)
191
+
189
192
  | Source | Issue | Status | Notes |
190
193
  |--------|-------|--------|-------|
191
- | Gzipped CSV (`.csv.gz`) | Compressed file | 🔘 | Decompress and pass the resulting IO object: `SmarterCSV.process(Zlib::GzipReader.open(path))`. |
194
+ | Gzipped CSV (`.csv.gz`) | Compressed, non-seekable stream | 🔘 | `SmarterCSV.process(Zlib::GzipReader.open(path))` — no need to decompress to disk first. |
192
195
  | HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
196
+ | `STDIN` / shell pipes | Non-seekable input | 🔘 | `cat data.csv \| ruby -rsmarter_csv -e 'SmarterCSV.process(STDIN) { \|h\| ... }'` |
197
+ | `IO.popen` output | Non-seekable subprocess stream | 🔘 | `IO.popen('zcat data.csv.gz') { \|io\| SmarterCSV.process(io) }` |
198
+ | S3 object body | Non-seekable HTTP stream | 🔘 | `SmarterCSV.process(s3.get_object(...).body)` — see worked example below. |
199
+
200
+ #### Streaming Inputs
201
+
202
+ ```ruby
203
+ # Gzipped CSV — stream-decompressed, never written to disk
204
+ require 'zlib'
205
+ Zlib::GzipReader.open('huge.csv.gz') do |io|
206
+ SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
207
+ end
208
+
209
+ # STDIN / pipes
210
+ SmarterCSV.process($stdin) { |row, _| MyModel.upsert(row.first) }
211
+
212
+ # HTTP response body
213
+ require 'open-uri'
214
+ URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
215
+
216
+ # S3 — stream the response body directly
217
+ require 'aws-sdk-s3'
218
+ obj = Aws::S3::Client.new.get_object(bucket: 'data', key: 'imports/users.csv')
219
+ SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
220
+ MyModel.insert_all(chunk)
221
+ end
222
+
223
+ # Subprocess output
224
+ IO.popen('zcat data.csv.gz') { |io| SmarterCSV.process(io) }
225
+ ```
226
+
227
+ #### Multi-Line Quoted Fields
228
+
229
+ Newlines inside `"..."` are preserved as part of the field — useful for address blocks, CRM notes, and free-text comments. No configuration needed:
230
+
231
+ ```ruby
232
+ $ cat addresses.csv
233
+ id,name,address
234
+ 1,Alice,"123 Main St
235
+ Apt 4B
236
+ Brooklyn, NY 11201"
237
+ 2,Bob,"42 Elm Ave"
238
+
239
+ data = SmarterCSV.process('addresses.csv')
240
+ # => [{id: 1, name: "Alice", address: "123 Main St\nApt 4B\nBrooklyn, NY 11201"},
241
+ # {id: 2, name: "Bob", address: "42 Elm Ave"}]
242
+ ```
193
243
 
194
244
  †: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:
195
245
 
@@ -45,14 +45,14 @@ rows with type conversion applied. SmarterCSV/C is dramatically faster:
45
45
 
46
46
  ### C path
47
47
 
48
- | Gain | Files |
49
- |--------------|---------------------------------------------------------------------|
50
- | **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
51
- | **1.5×** | heavy_quoting — same skip-ahead benefit |
52
- | **1.4×** | tab_separated |
48
+ | Gain | Files |
49
+ |--------------|-----------------------------------------------------------------------------|
50
+ | **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
51
+ | **1.5×** | heavy_quoting — same skip-ahead benefit |
52
+ | **1.4×** | tab_separated |
53
53
  | **1.2–1.3×** | embedded_sep, utf8, PEOPLE_IMPORT_C/NC, worldcities, whitespace, multi_char |
54
- | **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
55
- | **~1.0×** | sensor_data, embedded_newlines (within noise) |
54
+ | **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
55
+ | **~1.0×** | sensor_data, embedded_newlines (within noise) |
56
56
 
57
57
  15 of 19 files are measurably faster; 2 within noise; 2 files show a small regression
58
58
  (PEOPLE_IMPORT_NB −7%, wide_500_cols −5%) attributable to the new `quote_boundary: :standard`
@@ -60,11 +60,11 @@ default adding one extra state check on the unquoted fast path.
60
60
 
61
61
  ### Ruby path
62
62
 
63
- | Gain | Files |
64
- |--------------|---------------------------------------------------------------------|
63
+ | Gain | Files |
64
+ |--------------|-----------------------------------------------------------------------------------|
65
65
  | **1.9×** | PEOPLE_IMPORT_C (117 cols) — direct hash construction bypasses intermediate Array |
66
- | **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
67
- | **1.0–1.1×** | most other files |
66
+ | **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
67
+ | **1.0–1.1×** | most other files |
68
68
 
69
69
  The Ruby path gains are concentrated on wide/complex files where the direct-hash
70
70
  construction optimization (Opt #11) has the most impact.
@@ -106,9 +106,9 @@ are skipped entirely in the C hot path — no string allocation, no conversion,
106
106
  insertion. Benchmark on `wide_500_cols_20k.csv` (500 columns):
107
107
 
108
108
  | Columns kept | Speedup vs no selection |
109
- |---|---|
110
- | 2 of 500 | ~16× faster |
111
- | 10 of 500 | ~8× faster |
112
- | 50 of 500 | ~3× faster |
109
+ |--------------|-------------------------|
110
+ | 2 of 500 | ~16× faster |
111
+ | 10 of 500 | ~8× faster |
112
+ | 50 of 500 | ~3× faster |
113
113
 
114
114
  This is additive on top of the baseline gains above.
@@ -0,0 +1,121 @@
1
+ # SmarterCSV 1.17.0 — Benchmark Results
2
+
3
+ - **Date:** 2026-05-06
4
+ - **Ruby:** 3.4.7 [arm64-darwin25] on Apple M1 Pro
5
+ - **SmarterCSV:** 1.17.0
6
+ - **Versions compared:** 1.14.4, 1.15.2, 1.16.4, 1.17.0
7
+ - **Ruby CSV:** 3.3.5
8
+ - **Methodology:** best of 40 measured runs (2 warm-up)
9
+ - **Raw data files:**
10
+ - [`2026-05-06_1250_ruby3.4.7.md`](2026-05-06_1250_ruby3.4.7.md) / [`.json`](2026-05-06_1250_ruby3.4.7.json) — version comparison (1.14.4 / 1.15.2 / 1.16.4 / 1.17.0)
11
+ - [`2026-05-06_1511_ruby3.4.7.md`](2026-05-06_1511_ruby3.4.7.md) / [`.json`](2026-05-06_1511_ruby3.4.7.json) — vs Ruby CSV 3.3.5
12
+
13
+ See [performance_notes.md](performance_notes.md) for analysis of these numbers.
14
+
15
+ ---
16
+
17
+ ## SmarterCSV C accelerated — version comparison
18
+
19
+ | File | Rows | v1.14.4 | v1.15.2 | v1.16.4 | v1.17.0 | newest vs oldest |
20
+ |----------------------------------|--------|------------|-----------|-----------|-----------|------------------|
21
+ | PEOPLE_IMPORT_B.csv | 50000 | 1.6175s | 0.1049s | 0.0867s | 0.0872s | 18.54× faster |
22
+ | PEOPLE_IMPORT_C.csv | 50000 | 8.0347s | 0.2055s | 0.1763s | 0.1746s | 46.02× faster |
23
+ | PEOPLE_IMPORT_NB.csv | 50000 | 1.5629s | 0.0994s | 0.0694s | 0.0708s | 22.08× faster |
24
+ | PEOPLE_IMPORT_NC.csv | 50000 | 1.4679s | 0.0855s | 0.0711s | 0.0705s | 20.83× faster |
25
+ | uscities.csv | 31257 | 1.0357s | 0.1129s | 0.0878s | 0.0819s | 12.64× faster |
26
+ | uszips.csv | 33782 | 1.2419s | 0.1121s | 0.0880s | 0.0879s | 14.13× faster |
27
+ | worldcities.csv | 48059 | 1.0420s | 0.1174s | 0.0861s | 0.0773s | 13.49× faster |
28
+ | embedded_newlines_20k.csv | 80000 | 0.5337s | 0.0633s | 0.0591s | 0.0545s | 9.80× faster |
29
+ | embedded_separators_20k.csv | 20000 | 0.2761s | 0.0328s | 0.0215s | 0.0214s | 12.90× faster |
30
+ | heavy_quoting_20k.csv | 20000 | 0.5129s | 0.0561s | 0.0364s | 0.0358s | 14.34× faster |
31
+ | long_fields_20k.csv | 20000 | 2.9215s | 0.1082s | 0.0464s | 0.0392s | 74.54× faster |
32
+ | many_empty_fields_20k.csv | 20000 | 0.3885s | 0.0314s | 0.0240s | 0.0262s | 14.81× faster |
33
+ | multi_char_separator_20k.csv | 20000 | 0.5305s | 0.0340s | 0.0272s | 0.0296s | 17.90× faster |
34
+ | sample_10M.csv | 50000 | 0.4513s | 0.0619s | 0.0480s | 0.0446s | 10.11× faster |
35
+ | sensor_data_50krows_50cols.csv | 50000 | 3.8704s | 0.2714s | 0.2559s | 0.2549s | 15.19× faster |
36
+ | tab_separated_20k.tsv | 20000 | 0.4496s | 0.0337s | 0.0255s | 0.0256s | 17.54× faster |
37
+ | utf8_multibyte_20k.csv | 20000 | 0.2233s | 0.0210s | 0.0152s | 0.0149s | 14.96× faster |
38
+ | whitespace_heavy_20k.csv | 20000 | 0.5244s | 0.0349s | 0.0250s | 0.0286s | 18.34× faster |
39
+ | wide_500_cols_20k.csv | 20000 | 17.3477s | 1.2805s | 1.2798s | 1.2701s | 13.66× faster |
40
+
41
+ ## SmarterCSV Ruby path — version comparison
42
+
43
+ | File | Rows | v1.14.4 | v1.15.2 | v1.16.4 | v1.17.0 | newest vs oldest |
44
+ |----------------------------------|--------|------------|-----------|-----------|-----------|------------------|
45
+ | PEOPLE_IMPORT_B.csv | 50000 | 4.5718s | 0.5635s | 0.5272s | 0.4971s | 9.20× faster |
46
+ | PEOPLE_IMPORT_C.csv | 50000 | 26.0194s | 2.5511s | 1.3401s | 1.3328s | 19.52× faster |
47
+ | PEOPLE_IMPORT_NB.csv | 50000 | 4.4999s | 0.5268s | 0.4757s | 0.4791s | 9.39× faster |
48
+ | PEOPLE_IMPORT_NC.csv | 50000 | 4.3233s | 0.5752s | 0.3989s | 0.4017s | 10.76× faster |
49
+ | uscities.csv | 31257 | 2.6702s | 1.8124s | 1.0662s | 1.0944s | 2.44× faster |
50
+ | uszips.csv | 33782 | 3.1853s | 2.1641s | 1.3332s | 1.3434s | 2.37× faster |
51
+ | worldcities.csv | 48059 | 2.8397s | 1.8978s | 1.0910s | 1.0909s | 2.60× faster |
52
+ | embedded_newlines_20k.csv | 80000 | 0.9578s | 0.4629s | 0.4291s | 0.4314s | 2.22× faster |
53
+ | embedded_separators_20k.csv | 20000 | 0.7074s | 0.4535s | 0.2748s | 0.2748s | 2.57× faster |
54
+ | heavy_quoting_20k.csv | 20000 | 1.4361s | 0.8598s | 0.5241s | 0.5273s | 2.72× faster |
55
+ | long_fields_20k.csv | 20000 | 8.8715s | 4.7839s | 2.5696s | 2.5624s | 3.46× faster |
56
+ | many_empty_fields_20k.csv | 20000 | 0.8635s | 0.2521s | 0.1680s | 0.1664s | 5.19× faster |
57
+ | multi_char_separator_20k.csv | 20000 | 1.4172s | 0.2463s | 0.1853s | 0.1879s | 7.54× faster |
58
+ | sample_10M.csv | 50000 | 1.0547s | 0.2388s | 0.2238s | 0.2211s | 4.77× faster |
59
+ | sensor_data_50krows_50cols.csv | 50000 | 8.9445s | 1.8246s | 1.8348s | 1.8181s | 4.92× faster |
60
+ | tab_separated_20k.tsv | 20000 | 1.2664s | 0.1596s | 0.1553s | 0.1536s | 8.24× faster |
61
+ | utf8_multibyte_20k.csv | 20000 | 0.6484s | 0.1124s | 0.1068s | 0.1066s | 6.08× faster |
62
+ | whitespace_heavy_20k.csv | 20000 | 1.5513s | 0.1613s | 0.1654s | 0.1610s | 9.63× faster |
63
+ | wide_500_cols_20k.csv | 20000 | 44.5782s | 7.2023s | 6.9748s | 6.9261s | 6.44× faster |
64
+
65
+ ---
66
+
67
+ ## SmarterCSV 1.17.0 vs Ruby CSV 3.3.5 — full results
68
+
69
+ | File | Rows | CSV.read¹ | CSV.hashes¹ | SmarterCSV/C | SmarterCSV/Rb |
70
+ |----------------------------------|--------|------------|-------------|---------------|---------------|
71
+ | PEOPLE_IMPORT_B.csv | 50000 | 0.2718s | 0.7750s | 0.0673s | 0.5034s |
72
+ | PEOPLE_IMPORT_C.csv | 50000 | 1.4111s | 8.0199s | 0.1907s | 1.4032s |
73
+ | PEOPLE_IMPORT_NB.csv | 50000 | 0.2659s | 0.7603s | 0.0638s | 0.4800s |
74
+ | PEOPLE_IMPORT_NC.csv | 50000 | 0.2860s | 0.9173s | 0.0630s | 0.4132s |
75
+ | uscities.csv | 31257 | 0.5640s | 0.8803s | 0.0789s | 1.1120s |
76
+ | uszips.csv | 33782 | 0.7414s | 1.1604s | 0.0929s | 1.3645s |
77
+ | worldcities.csv | 48059 | 0.6313s | 0.9906s | 0.0794s | 1.0945s |
78
+ | embedded_newlines_20k.csv | 80000 | 0.1693s | 0.2245s | 0.0554s | 0.4451s |
79
+ | embedded_separators_20k.csv | 20000 | 0.1312s | 0.1838s | 0.0206s | 0.2830s |
80
+ | heavy_quoting_20k.csv | 20000 | 0.1167s | 0.2410s | 0.0338s | 0.5400s |
81
+ | long_fields_20k.csv | 20000 | 0.2373s | 0.2762s | 0.0392s | 2.6172s |
82
+ | many_empty_fields_20k.csv | 20000 | 0.1145s | 0.3622s | 0.0216s | 0.1727s |
83
+ | multi_char_separator_20k.csv | 20000 | 0.0890s | 0.2122s | 0.0293s | 0.1662s |
84
+ | sample_10M.csv | 50000 | 0.1685s | 0.3012s | 0.0357s | 0.2361s |
85
+ | sensor_data_50krows_50cols.csv | 50000 | 0.5655s | 2.6744s | 0.2442s | 1.8878s |
86
+ | tab_separated_20k.tsv | 20000 | 0.0832s | 0.2029s | 0.0219s | 0.1651s |
87
+ | utf8_multibyte_20k.csv | 20000 | 0.0662s | 0.1427s | 0.0156s | 0.1138s |
88
+ | whitespace_heavy_20k.csv | 20000 | 0.0890s | 0.2169s | 0.0278s | 0.1670s |
89
+ | wide_500_cols_20k.csv | 20000 | 2.3351s | 32.4002s | 1.2823s | 7.3504s |
90
+
91
+ ## Ruby CSV 3.3.5 vs SmarterCSV 1.17.0 (C accelerated)
92
+
93
+ | File | Rows | CSV.read¹ | CSV.hashes¹ |
94
+ |----------------------------------|--------|---------------|---------------|
95
+ | PEOPLE_IMPORT_B.csv | 50000 | 4.04× slower | 11.51× slower |
96
+ | PEOPLE_IMPORT_C.csv | 50000 | 7.40× slower | 42.04× slower |
97
+ | PEOPLE_IMPORT_NB.csv | 50000 | 4.17× slower | 11.92× slower |
98
+ | PEOPLE_IMPORT_NC.csv | 50000 | 4.54× slower | 14.55× slower |
99
+ | uscities.csv | 31257 | 7.15× slower | 11.16× slower |
100
+ | uszips.csv | 33782 | 7.98× slower | 12.50× slower |
101
+ | worldcities.csv | 48059 | 7.95× slower | 12.48× slower |
102
+ | embedded_newlines_20k.csv | 80000 | 3.05× slower | 4.05× slower |
103
+ | embedded_separators_20k.csv | 20000 | 6.36× slower | 8.91× slower |
104
+ | heavy_quoting_20k.csv | 20000 | 3.46× slower | 7.14× slower |
105
+ | long_fields_20k.csv | 20000 | 6.05× slower | 7.04× slower |
106
+ | many_empty_fields_20k.csv | 20000 | 5.29× slower | 16.73× slower |
107
+ | multi_char_separator_20k.csv | 20000 | 3.04× slower | 7.25× slower |
108
+ | sample_10M.csv | 50000 | 4.72× slower | 8.43× slower |
109
+ | sensor_data_50krows_50cols.csv | 50000 | 2.32× slower | 10.95× slower |
110
+ | tab_separated_20k.tsv | 20000 | 3.80× slower | 9.28× slower |
111
+ | utf8_multibyte_20k.csv | 20000 | 4.24× slower | 9.14× slower |
112
+ | whitespace_heavy_20k.csv | 20000 | 3.20× slower | 7.81× slower |
113
+ | wide_500_cols_20k.csv | 20000 | 1.82× slower | 25.27× slower |
114
+
115
+ ---
116
+
117
+ ¹ **Raw output** — no post-processing applied. Returns plain arrays or string-keyed hashes. No header normalization, type conversion, whitespace stripping, or empty-value removal. Your own post-processing must be added to produce usable data.
118
+
119
+ ---
120
+
121
+ PREVIOUS: [Performance Notes](./performance_notes.md) | UP: [README](../../../README.md)