smarter_csv 1.16.4 → 1.17.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +10 -1
- data/CHANGELOG.md +54 -0
- data/Gemfile +10 -5
- data/README.md +98 -14
- data/TO_DO.md +109 -0
- data/docs/_introduction.md +1 -0
- data/docs/bad_row_quarantine.md +2 -1
- data/docs/basic_read_api.md +6 -1
- data/docs/basic_write_api.md +30 -0
- data/docs/batch_processing.md +25 -0
- data/docs/column_selection.md +1 -0
- data/docs/data_transformations.md +1 -0
- data/docs/examples.md +126 -0
- data/docs/header_transformations.md +23 -0
- data/docs/header_validations.md +1 -0
- data/docs/history.md +1 -0
- data/docs/instrumentation.md +2 -1
- data/docs/migrating_from_csv.md +1 -0
- data/docs/options.md +20 -18
- data/docs/parsing_strategy.md +1 -0
- data/docs/real_world_csv.md +51 -1
- data/docs/releases/1.16.0/performance_notes.md +15 -15
- data/docs/releases/1.17.0/benchmarks.md +121 -0
- data/docs/releases/1.17.0/changes.md +161 -0
- data/docs/releases/1.17.0/performance_notes.md +126 -0
- data/docs/row_col_sep.md +21 -1
- data/docs/ruby_csv_pitfalls.md +1 -0
- data/docs/value_converters.md +24 -0
- data/docs/warnings.md +141 -0
- data/ext/smarter_csv/smarter_csv.c +98 -32
- data/images/SmarterCSV_1.17.0_vs_RubyCSV_3.3.5_speedup.svg +106 -0
- data/images/SmarterCSV_1.17.0_vs_previous_C-speedup.svg +181 -0
- data/images/SmarterCSV_1.17.0_vs_previous_Rb-speedup.svg +179 -0
- data/lib/smarter_csv/auto_detection.rb +215 -30
- data/lib/smarter_csv/file_io.rb +2 -2
- data/lib/smarter_csv/hash_transformations.rb +29 -13
- data/lib/smarter_csv/parser.rb +42 -33
- data/lib/smarter_csv/peekable_io.rb +453 -0
- data/lib/smarter_csv/reader.rb +119 -23
- data/lib/smarter_csv/reader_options.rb +61 -1
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv.rb +40 -12
- metadata +12 -5
- data/TO_DO_v2.md +0 -14
- data/ext/smarter_csv/Makefile +0 -270
data/docs/examples.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [**Examples**](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -43,6 +44,12 @@
|
|
|
43
44
|
11. [Batch Processing with Sidekiq](#example-11-batch-processing-with-sidekiq)
|
|
44
45
|
12. [Resumable CSV Import with Rails ActiveJob](#example-12-resumable-csv-import-with-rails-activejob-rails-81)
|
|
45
46
|
13. [Instrumentation](#example-13-instrumentation)
|
|
47
|
+
14. [Streaming Inputs (Non-Seekable IO)](#example-14-streaming-inputs-non-seekable-io)
|
|
48
|
+
15. [Resumable Import (Plain Ruby)](#example-15-resumable-import-plain-ruby)
|
|
49
|
+
16. [CSV Files with Comment Lines](#example-16-csv-files-with-comment-lines)
|
|
50
|
+
17. [Tab-Separated Values (TSV)](#example-17-tab-separated-values-tsv)
|
|
51
|
+
18. [Multi-Line Fields](#example-18-multi-line-fields)
|
|
52
|
+
19. [Filtering and Transforming a CSV File](#example-19-filtering-and-transforming-a-csv-file)
|
|
46
53
|
|
|
47
54
|
---
|
|
48
55
|
|
|
@@ -369,5 +376,124 @@ SmarterCSV.process('large_import.csv',
|
|
|
369
376
|
|
|
370
377
|
See [Instrumentation Hooks](./instrumentation.md).
|
|
371
378
|
|
|
379
|
+
---
|
|
380
|
+
|
|
381
|
+
## Example 14: Streaming Inputs (Non-Seekable IO)
|
|
382
|
+
|
|
383
|
+
*(1.17.0+)* SmarterCSV reads from gzipped files, HTTP responses, S3 objects, or piped STDIN — no need to materialize the file on disk first.
|
|
384
|
+
|
|
385
|
+
```ruby
|
|
386
|
+
require 'zlib'
|
|
387
|
+
Zlib::GzipReader.open('huge.csv.gz') do |io|
|
|
388
|
+
SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
|
|
389
|
+
end
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
See [Real-World CSV Files → I/O Patterns](./real_world_csv.md#io-patterns) for gzip, S3, HTTP, STDIN, and `IO.popen` worked examples.
|
|
393
|
+
|
|
394
|
+
---
|
|
395
|
+
|
|
396
|
+
## Example 15: Resumable Import (Plain Ruby)
|
|
397
|
+
|
|
398
|
+
A non-Rails counterpart to Example 12 — track the chunk cursor in a JSON file so an interrupted import resumes where it left off.
|
|
399
|
+
|
|
400
|
+
See [Batch Processing → Resumable Import (Plain Ruby)](./batch_processing.md#example-resumable-import-plain-ruby) for the worked example.
|
|
401
|
+
|
|
402
|
+
---
|
|
403
|
+
|
|
404
|
+
## Example 16: CSV Files with Comment Lines
|
|
405
|
+
|
|
406
|
+
Strip lines matching a pattern (e.g. `#`-prefixed comments in DB dumps and log exports) using `comment_regexp`:
|
|
407
|
+
|
|
408
|
+
```ruby
|
|
409
|
+
SmarterCSV.process('data.csv', comment_regexp: /\A#/)
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
See [Header Transformations → CSV Files with Comment Lines](./header_transformations.md#csv-files-with-comment-lines) for the worked example.
|
|
413
|
+
|
|
414
|
+
---
|
|
415
|
+
|
|
416
|
+
## Example 17: Tab-Separated Values (TSV)
|
|
417
|
+
|
|
418
|
+
```ruby
|
|
419
|
+
SmarterCSV.process('data.tsv') # auto-detected
|
|
420
|
+
SmarterCSV.process('data.tsv', col_sep: "\t") # explicit
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
See [Row and Column Separators → Tab-Separated Values (TSV)](./row_col_sep.md#tab-separated-values-tsv) for details.
|
|
424
|
+
|
|
425
|
+
---
|
|
426
|
+
|
|
427
|
+
## Example 18: Multi-Line Fields
|
|
428
|
+
|
|
429
|
+
Newlines inside `"..."` are preserved as part of the field — common in addresses, CRM notes, and free-text comments. No configuration needed.
|
|
430
|
+
|
|
431
|
+
See [Real-World CSV Files → Multi-Line Quoted Fields](./real_world_csv.md#multi-line-quoted-fields) for the worked example.
|
|
432
|
+
|
|
433
|
+
---
|
|
434
|
+
|
|
435
|
+
## Example 19: Filtering and Transforming a CSV File
|
|
436
|
+
|
|
437
|
+
The Ruby CSV library has `CSV.filter` for "read CSV, mutate each row, write CSV." In SmarterCSV this is a two-line composition of `SmarterCSV.each` and `SmarterCSV.generate`:
|
|
438
|
+
|
|
439
|
+
```ruby
|
|
440
|
+
SmarterCSV.generate('out.csv') do |csv|
|
|
441
|
+
SmarterCSV.each('in.csv') do |row|
|
|
442
|
+
row[:price] = (row[:price] * 1.1).round(2)
|
|
443
|
+
row.delete(:internal_notes)
|
|
444
|
+
csv << row
|
|
445
|
+
end
|
|
446
|
+
end
|
|
447
|
+
```
|
|
448
|
+
|
|
449
|
+
The explicit `csv << row` is the win over `CSV.filter` — emission is intentional, not a side effect of mutating the block argument.
|
|
450
|
+
|
|
451
|
+
### Pipeline (STDIN → STDOUT)
|
|
452
|
+
|
|
453
|
+
```ruby
|
|
454
|
+
# cat in.csv | ruby filter.rb > out.csv
|
|
455
|
+
SmarterCSV.generate($stdout) do |csv|
|
|
456
|
+
SmarterCSV.each($stdin) { |row| csv << row }
|
|
457
|
+
end
|
|
458
|
+
```
|
|
459
|
+
|
|
460
|
+
### Skipping rows
|
|
461
|
+
|
|
462
|
+
```ruby
|
|
463
|
+
SmarterCSV.generate('out.csv') do |csv|
|
|
464
|
+
SmarterCSV.each('in.csv') do |row|
|
|
465
|
+
next if row[:status] == 'archived' # just skip — no emit
|
|
466
|
+
csv << row
|
|
467
|
+
end
|
|
468
|
+
end
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
### Compressed in, compressed out
|
|
472
|
+
|
|
473
|
+
```ruby
|
|
474
|
+
require 'zlib'
|
|
475
|
+
Zlib::GzipWriter.open('out.csv.gz') do |gz_out|
|
|
476
|
+
SmarterCSV.generate(gz_out) do |csv|
|
|
477
|
+
Zlib::GzipReader.open('in.csv.gz') do |gz_in|
|
|
478
|
+
SmarterCSV.each(gz_in) { |row| csv << row }
|
|
479
|
+
end
|
|
480
|
+
end
|
|
481
|
+
end
|
|
482
|
+
```
|
|
483
|
+
|
|
484
|
+
Both endpoints are non-seekable streams — a pattern `CSV.filter` cannot handle, since it requires seekable input/output.
|
|
485
|
+
|
|
486
|
+
### Header renaming on the way through
|
|
487
|
+
|
|
488
|
+
```ruby
|
|
489
|
+
SmarterCSV.generate('out.csv', headers: [:given_name, :family_name, :email]) do |csv|
|
|
490
|
+
SmarterCSV.each('in.csv',
|
|
491
|
+
key_mapping: { first_name: :given_name, last_name: :family_name }
|
|
492
|
+
) { |row| csv << row }
|
|
493
|
+
end
|
|
494
|
+
```
|
|
495
|
+
|
|
496
|
+
Use `key_mapping:` on the read side to rename columns and `headers:` on the write side to enforce output column order.
|
|
497
|
+
|
|
372
498
|
--------------------
|
|
373
499
|
PREVIOUS: [Instrumentation Hooks](./instrumentation.md) | NEXT: [Real-World CSV Files](./real_world_csv.md) | UP: [README](../README.md)
|
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -61,6 +62,28 @@ See [Configuration Options](./options.md) for full option reference.
|
|
|
61
62
|
|
|
62
63
|
---
|
|
63
64
|
|
|
65
|
+
## CSV Files with Comment Lines
|
|
66
|
+
|
|
67
|
+
Strip comment lines anywhere in the file — including before the header — using `comment_regexp`:
|
|
68
|
+
|
|
69
|
+
```ruby
|
|
70
|
+
$ cat data.csv
|
|
71
|
+
# Generated 2026-01-15 by exporter v3.2
|
|
72
|
+
# Confidential — internal use only
|
|
73
|
+
id,name,amount
|
|
74
|
+
1,Alice,100
|
|
75
|
+
2,Bob,200
|
|
76
|
+
# end of file
|
|
77
|
+
|
|
78
|
+
data = SmarterCSV.process('data.csv', comment_regexp: /\A#/)
|
|
79
|
+
# => [{id: 1, name: "Alice", amount: 100},
|
|
80
|
+
# {id: 2, name: "Bob", amount: 200}]
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
Common in database dumps, log exports, and pipelines that prepend provenance metadata. The regexp is applied per line — any line matching is dropped before parsing.
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
64
87
|
## Header Normalization
|
|
65
88
|
|
|
66
89
|
When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)
|
data/docs/header_validations.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/history.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/instrumentation.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [**Instrumentation Hooks**](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -163,4 +164,4 @@ SmarterCSV.process(file, on_start: ON_START, on_complete: ON_COMPLETE)
|
|
|
163
164
|
```
|
|
164
165
|
|
|
165
166
|
--------------------
|
|
166
|
-
PREVIOUS: [
|
|
167
|
+
PREVIOUS: [Warnings](./warnings.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
|
data/docs/migrating_from_csv.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/options.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
|
@@ -52,27 +53,28 @@
|
|
|
52
53
|
|
|
53
54
|
### File Input & Encoding
|
|
54
55
|
|
|
55
|
-
| Option
|
|
56
|
-
|
|
57
|
-
| `:file_encoding`
|
|
58
|
-
| `:invalid_byte_sequence` | `''`
|
|
59
|
-
| `:force_utf8`
|
|
56
|
+
| Option | Default | Explanation |
|
|
57
|
+
|--------------------------|---------|------------------------------------------------------------------------|
|
|
58
|
+
| `:file_encoding` | `utf-8` | Set the file encoding, e.g. `'windows-1252'` or `'iso-8859-1'`. |
|
|
59
|
+
| `:invalid_byte_sequence` | `''` | What to replace invalid byte sequences with. |
|
|
60
|
+
| `:force_utf8` | `false` | Force UTF-8 encoding of all lines (including headers) in the CSV file. |
|
|
60
61
|
|
|
61
62
|
### File Layout
|
|
62
63
|
|
|
63
|
-
| Option
|
|
64
|
-
|
|
65
|
-
| `:skip_lines`
|
|
66
|
-
| `:comment_regexp` | `nil`
|
|
67
|
-
| `:chunk_size`
|
|
64
|
+
| Option | Default | Explanation |
|
|
65
|
+
|-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
66
|
+
| `:skip_lines` | `nil` | How many lines to skip before the first line or header line is processed. |
|
|
67
|
+
| `:comment_regexp` | `nil` | Regular expression to ignore comment lines (e.g. `/\A#/`). See NOTE on CSV header. |
|
|
68
|
+
| `:chunk_size` | `nil` | If set, data is yielded in chunks of this many rows instead of all at once. Use with `SmarterCSV.each_chunk` for memory-efficient batch processing. |
|
|
68
69
|
|
|
69
70
|
### Separators
|
|
70
71
|
|
|
71
72
|
| Option | Default | Explanation |
|
|
72
73
|
|--------|---------|-------------|
|
|
73
74
|
| `:col_sep` | `:auto` | Column separator. `:auto` detects from file content (previous default was `','`). |
|
|
74
|
-
| `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content
|
|
75
|
-
| `:auto_row_sep_chars` | `
|
|
75
|
+
| `:row_sep` | `:auto` | Row / record separator. `:auto` detects from file content by scanning in chunks of `auto_row_sep_chars` bytes, up to a 64KB hard cap. |
|
|
76
|
+
| `:auto_row_sep_chars` | `4096` | Initial scan size for `:row_sep => :auto` detection. Scan stops as soon as one separator has a clear majority, up to a 64KB cap. Bump this if your files have very wide headers or long comment preambles. Out-of-range values, `nil`, or `0` fall back to the default with a warning. |
|
|
77
|
+
| `:buffer_size` | `16_384` | Peek buffer chunk size for non-seekable inputs (pipes, gzip readers, HTTP/S3 bodies). Out-of-range values warn and clamp to the supported range. Has no effect on seekable inputs (file paths, `File`, `StringIO`, `Tempfile`). |
|
|
76
78
|
|
|
77
79
|
### Quoting
|
|
78
80
|
|
|
@@ -121,8 +123,8 @@ See [Parsing Strategy](./parsing_strategy.md) for full details on quote handling
|
|
|
121
123
|
| `:strip_whitespace` | `true` | Remove whitespace before/after values and headers. |
|
|
122
124
|
| `:convert_values_to_numeric` | `true` | Convert strings containing integers or floats to the appropriate numeric type. Accepts `{except: [:key1, :key2]}` or `{only: :key3}` to limit which columns. |
|
|
123
125
|
| `:value_converters` | `nil` | Hash of `:header => converter`; converter can be a lambda/Proc or a class implementing `self.convert(value)`. See [Value Converters](./value_converters.md). |
|
|
124
|
-
| `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil
|
|
125
|
-
| `:remove_zero_values` | `false` | Remove key/value pairs
|
|
126
|
+
| `:remove_empty_values` | `true` | Remove key/value pairs where the value is `nil`, empty, or whitespace-only — any Unicode whitespace, same as Ruby's `String#blank?`. |
|
|
127
|
+
| `:remove_zero_values` | `false` | Remove key/value pairs whose value is zero — numeric `0` / `0.0`, or any textual form of zero (`"0"`, `"0.0"`, `"00.00"`, `"+0"`, `"-0.0"`, …). |
|
|
126
128
|
| `:nil_values_matching` | `nil` | Set matching values to `nil`. Accepts a regular expression matched against the string representation of each value (e.g. `/\ANAN\z/` for NaN, `/\A#VALUE!\z/` for Excel errors). With `remove_empty_values: true` (default), nil-ified values are then removed. With `remove_empty_values: false`, the key is retained with a `nil` value. |
|
|
127
129
|
| `:remove_empty_hashes` | `true` | Remove result hashes that have no key/value pairs or all-empty values. |
|
|
128
130
|
|
|
@@ -142,7 +144,7 @@ See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
|
|
|
142
144
|
| Option | Default | Explanation |
|
|
143
145
|
|--------|---------|-------------|
|
|
144
146
|
| `:with_line_numbers` | `false` | Add `:csv_line_number` to each result hash. |
|
|
145
|
-
| `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. |
|
|
147
|
+
| `:verbose` | `:normal` | Controls warning and diagnostic output. Accepted values:<br>• `:quiet` — suppress all warnings and notices (recommended for production)<br>• `:normal` — show behavioral warnings, e.g. auto-configuration notices **(default)**<br>• `:debug` — `:normal` + print computed options and per-row diagnostics to stderr<br>`nil` is silently treated as `:normal`. Passing `true` or `false` still works but is deprecated — see below. See [Warnings](./warnings.md) for the structured warning collection. |
|
|
146
148
|
|
|
147
149
|
### Instrumentation Hooks
|
|
148
150
|
|
|
@@ -156,9 +158,9 @@ See [Instrumentation Hooks](./instrumentation.md) for full details and payload r
|
|
|
156
158
|
|
|
157
159
|
### Performance
|
|
158
160
|
|
|
159
|
-
| Option
|
|
160
|
-
|
|
161
|
-
| `:acceleration`
|
|
161
|
+
| Option | Default | Explanation |
|
|
162
|
+
|-------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------|
|
|
163
|
+
| `:acceleration` | `true` | Use the C extension for parsing (MRI Ruby only). Set to `false` to force the pure-Ruby fallback (always used on JRuby/TruffleRuby). |
|
|
162
164
|
|
|
163
165
|
---
|
|
164
166
|
|
data/docs/parsing_strategy.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [Real-World CSV Files](./real_world_csv.md)
|
data/docs/real_world_csv.md
CHANGED
|
@@ -16,6 +16,7 @@
|
|
|
16
16
|
* [Data Transformations](./data_transformations.md)
|
|
17
17
|
* [Value Converters](./value_converters.md)
|
|
18
18
|
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
19
|
+
* [Warnings](./warnings.md)
|
|
19
20
|
* [Instrumentation Hooks](./instrumentation.md)
|
|
20
21
|
* [Examples](./examples.md)
|
|
21
22
|
* [**Real-World CSV Files**](./real_world_csv.md)
|
|
@@ -186,10 +187,59 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
|
|
|
186
187
|
|
|
187
188
|
### I/O Patterns
|
|
188
189
|
|
|
190
|
+
SmarterCSV accepts any IO-compatible source — file paths, open `File` handles, `StringIO`, and **non-seekable streams** like pipes, `STDIN`, and `Zlib::GzipReader`. Auto-detection of `row_sep` / `col_sep` works on streaming sources too thanks to internal buffering — the underlying source never needs to support `rewind` or `seek`. (Streaming IO support landed in 1.17.0.)
|
|
191
|
+
|
|
189
192
|
| Source | Issue | Status | Notes |
|
|
190
193
|
|--------|-------|--------|-------|
|
|
191
|
-
| Gzipped CSV (`.csv.gz`) | Compressed
|
|
194
|
+
| Gzipped CSV (`.csv.gz`) | Compressed, non-seekable stream | 🔘 | `SmarterCSV.process(Zlib::GzipReader.open(path))` — no need to decompress to disk first. |
|
|
192
195
|
| HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
|
|
196
|
+
| `STDIN` / shell pipes | Non-seekable input | 🔘 | `cat data.csv \| ruby -rsmarter_csv -e 'SmarterCSV.process(STDIN) { \|h\| ... }'` |
|
|
197
|
+
| `IO.popen` output | Non-seekable subprocess stream | 🔘 | `IO.popen('zcat data.csv.gz') { \|io\| SmarterCSV.process(io) }` |
|
|
198
|
+
| S3 object body | Non-seekable HTTP stream | 🔘 | `SmarterCSV.process(s3.get_object(...).body)` — see worked example below. |
|
|
199
|
+
|
|
200
|
+
#### Streaming Inputs
|
|
201
|
+
|
|
202
|
+
```ruby
|
|
203
|
+
# Gzipped CSV — stream-decompressed, never written to disk
|
|
204
|
+
require 'zlib'
|
|
205
|
+
Zlib::GzipReader.open('huge.csv.gz') do |io|
|
|
206
|
+
SmarterCSV.process(io) { |row| MyModel.upsert(row.first) }
|
|
207
|
+
end
|
|
208
|
+
|
|
209
|
+
# STDIN / pipes
|
|
210
|
+
SmarterCSV.process($stdin) { |row, _| MyModel.upsert(row.first) }
|
|
211
|
+
|
|
212
|
+
# HTTP response body
|
|
213
|
+
require 'open-uri'
|
|
214
|
+
URI.open('https://example.com/data.csv') { |io| SmarterCSV.process(io) }
|
|
215
|
+
|
|
216
|
+
# S3 — stream the response body directly
|
|
217
|
+
require 'aws-sdk-s3'
|
|
218
|
+
obj = Aws::S3::Client.new.get_object(bucket: 'data', key: 'imports/users.csv')
|
|
219
|
+
SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
|
|
220
|
+
MyModel.insert_all(chunk)
|
|
221
|
+
end
|
|
222
|
+
|
|
223
|
+
# Subprocess output
|
|
224
|
+
IO.popen('zcat data.csv.gz') { |io| SmarterCSV.process(io) }
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
#### Multi-Line Quoted Fields
|
|
228
|
+
|
|
229
|
+
Newlines inside `"..."` are preserved as part of the field — useful for address blocks, CRM notes, and free-text comments. No configuration needed:
|
|
230
|
+
|
|
231
|
+
```ruby
|
|
232
|
+
$ cat addresses.csv
|
|
233
|
+
id,name,address
|
|
234
|
+
1,Alice,"123 Main St
|
|
235
|
+
Apt 4B
|
|
236
|
+
Brooklyn, NY 11201"
|
|
237
|
+
2,Bob,"42 Elm Ave"
|
|
238
|
+
|
|
239
|
+
data = SmarterCSV.process('addresses.csv')
|
|
240
|
+
# => [{id: 1, name: "Alice", address: "123 Main St\nApt 4B\nBrooklyn, NY 11201"},
|
|
241
|
+
# {id: 2, name: "Bob", address: "42 Elm Ave"}]
|
|
242
|
+
```
|
|
193
243
|
|
|
194
244
|
†: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:
|
|
195
245
|
|
|
@@ -45,14 +45,14 @@ rows with type conversion applied. SmarterCSV/C is dramatically faster:
|
|
|
45
45
|
|
|
46
46
|
### C path
|
|
47
47
|
|
|
48
|
-
| Gain | Files
|
|
49
|
-
|
|
50
|
-
| **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields
|
|
51
|
-
| **1.5×** | heavy_quoting — same skip-ahead benefit
|
|
52
|
-
| **1.4×** | tab_separated
|
|
48
|
+
| Gain | Files |
|
|
49
|
+
|--------------|-----------------------------------------------------------------------------|
|
|
50
|
+
| **2.4×** | long_fields — biggest win; `memchr` skip-ahead in quoted fields |
|
|
51
|
+
| **1.5×** | heavy_quoting — same skip-ahead benefit |
|
|
52
|
+
| **1.4×** | tab_separated |
|
|
53
53
|
| **1.2–1.3×** | embedded_sep, utf8, PEOPLE_IMPORT_C/NC, worldcities, whitespace, multi_char |
|
|
54
|
-
| **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols
|
|
55
|
-
| **~1.0×** | sensor_data, embedded_newlines (within noise)
|
|
54
|
+
| **1.1–1.2×** | PEOPLE_IMPORT_B/NB, uszips, sample_10M, wide_500_cols |
|
|
55
|
+
| **~1.0×** | sensor_data, embedded_newlines (within noise) |
|
|
56
56
|
|
|
57
57
|
15 of 19 files are measurably faster; 2 within noise; 2 files show a small regression
|
|
58
58
|
(PEOPLE_IMPORT_NB −7%, wide_500_cols −5%) attributable to the new `quote_boundary: :standard`
|
|
@@ -60,11 +60,11 @@ default adding one extra state check on the unquoted fast path.
|
|
|
60
60
|
|
|
61
61
|
### Ruby path
|
|
62
62
|
|
|
63
|
-
| Gain | Files
|
|
64
|
-
|
|
63
|
+
| Gain | Files |
|
|
64
|
+
|--------------|-----------------------------------------------------------------------------------|
|
|
65
65
|
| **1.9×** | PEOPLE_IMPORT_C (117 cols) — direct hash construction bypasses intermediate Array |
|
|
66
|
-
| **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep
|
|
67
|
-
| **1.0–1.1×** | most other files
|
|
66
|
+
| **1.5×** | PEOPLE_IMPORT_NC, multi_char_sep |
|
|
67
|
+
| **1.0–1.1×** | most other files |
|
|
68
68
|
|
|
69
69
|
The Ruby path gains are concentrated on wide/complex files where the direct-hash
|
|
70
70
|
construction optimization (Opt #11) has the most impact.
|
|
@@ -106,9 +106,9 @@ are skipped entirely in the C hot path — no string allocation, no conversion,
|
|
|
106
106
|
insertion. Benchmark on `wide_500_cols_20k.csv` (500 columns):
|
|
107
107
|
|
|
108
108
|
| Columns kept | Speedup vs no selection |
|
|
109
|
-
|
|
110
|
-
|
|
|
111
|
-
|
|
|
112
|
-
|
|
|
109
|
+
|--------------|-------------------------|
|
|
110
|
+
| 2 of 500 | ~16× faster |
|
|
111
|
+
| 10 of 500 | ~8× faster |
|
|
112
|
+
| 50 of 500 | ~3× faster |
|
|
113
113
|
|
|
114
114
|
This is additive on top of the baseline gains above.
|
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
# SmarterCSV 1.17.0 — Benchmark Results
|
|
2
|
+
|
|
3
|
+
- **Date:** 2026-05-06
|
|
4
|
+
- **Ruby:** 3.4.7 [arm64-darwin25] on Apple M1 Pro
|
|
5
|
+
- **SmarterCSV:** 1.17.0
|
|
6
|
+
- **Versions compared:** 1.14.4, 1.15.2, 1.16.4, 1.17.0
|
|
7
|
+
- **Ruby CSV:** 3.3.5
|
|
8
|
+
- **Methodology:** best of 40 measured runs (2 warm-up)
|
|
9
|
+
- **Raw data files:**
|
|
10
|
+
- [`2026-05-06_1250_ruby3.4.7.md`](2026-05-06_1250_ruby3.4.7.md) / [`.json`](2026-05-06_1250_ruby3.4.7.json) — version comparison (1.14.4 / 1.15.2 / 1.16.4 / 1.17.0)
|
|
11
|
+
- [`2026-05-06_1511_ruby3.4.7.md`](2026-05-06_1511_ruby3.4.7.md) / [`.json`](2026-05-06_1511_ruby3.4.7.json) — vs Ruby CSV 3.3.5
|
|
12
|
+
|
|
13
|
+
See [performance_notes.md](performance_notes.md) for analysis of these numbers.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## SmarterCSV C accelerated — version comparison
|
|
18
|
+
|
|
19
|
+
| File | Rows | v1.14.4 | v1.15.2 | v1.16.4 | v1.17.0 | newest vs oldest |
|
|
20
|
+
|----------------------------------|--------|------------|-----------|-----------|-----------|------------------|
|
|
21
|
+
| PEOPLE_IMPORT_B.csv | 50000 | 1.6175s | 0.1049s | 0.0867s | 0.0872s | 18.54× faster |
|
|
22
|
+
| PEOPLE_IMPORT_C.csv | 50000 | 8.0347s | 0.2055s | 0.1763s | 0.1746s | 46.02× faster |
|
|
23
|
+
| PEOPLE_IMPORT_NB.csv | 50000 | 1.5629s | 0.0994s | 0.0694s | 0.0708s | 22.08× faster |
|
|
24
|
+
| PEOPLE_IMPORT_NC.csv | 50000 | 1.4679s | 0.0855s | 0.0711s | 0.0705s | 20.83× faster |
|
|
25
|
+
| uscities.csv | 31257 | 1.0357s | 0.1129s | 0.0878s | 0.0819s | 12.64× faster |
|
|
26
|
+
| uszips.csv | 33782 | 1.2419s | 0.1121s | 0.0880s | 0.0879s | 14.13× faster |
|
|
27
|
+
| worldcities.csv | 48059 | 1.0420s | 0.1174s | 0.0861s | 0.0773s | 13.49× faster |
|
|
28
|
+
| embedded_newlines_20k.csv | 80000 | 0.5337s | 0.0633s | 0.0591s | 0.0545s | 9.80× faster |
|
|
29
|
+
| embedded_separators_20k.csv | 20000 | 0.2761s | 0.0328s | 0.0215s | 0.0214s | 12.90× faster |
|
|
30
|
+
| heavy_quoting_20k.csv | 20000 | 0.5129s | 0.0561s | 0.0364s | 0.0358s | 14.34× faster |
|
|
31
|
+
| long_fields_20k.csv | 20000 | 2.9215s | 0.1082s | 0.0464s | 0.0392s | 74.54× faster |
|
|
32
|
+
| many_empty_fields_20k.csv | 20000 | 0.3885s | 0.0314s | 0.0240s | 0.0262s | 14.81× faster |
|
|
33
|
+
| multi_char_separator_20k.csv | 20000 | 0.5305s | 0.0340s | 0.0272s | 0.0296s | 17.90× faster |
|
|
34
|
+
| sample_10M.csv | 50000 | 0.4513s | 0.0619s | 0.0480s | 0.0446s | 10.11× faster |
|
|
35
|
+
| sensor_data_50krows_50cols.csv | 50000 | 3.8704s | 0.2714s | 0.2559s | 0.2549s | 15.19× faster |
|
|
36
|
+
| tab_separated_20k.tsv | 20000 | 0.4496s | 0.0337s | 0.0255s | 0.0256s | 17.54× faster |
|
|
37
|
+
| utf8_multibyte_20k.csv | 20000 | 0.2233s | 0.0210s | 0.0152s | 0.0149s | 14.96× faster |
|
|
38
|
+
| whitespace_heavy_20k.csv | 20000 | 0.5244s | 0.0349s | 0.0250s | 0.0286s | 18.34× faster |
|
|
39
|
+
| wide_500_cols_20k.csv | 20000 | 17.3477s | 1.2805s | 1.2798s | 1.2701s | 13.66× faster |
|
|
40
|
+
|
|
41
|
+
## SmarterCSV Ruby path — version comparison
|
|
42
|
+
|
|
43
|
+
| File | Rows | v1.14.4 | v1.15.2 | v1.16.4 | v1.17.0 | newest vs oldest |
|
|
44
|
+
|----------------------------------|--------|------------|-----------|-----------|-----------|------------------|
|
|
45
|
+
| PEOPLE_IMPORT_B.csv | 50000 | 4.5718s | 0.5635s | 0.5272s | 0.4971s | 9.20× faster |
|
|
46
|
+
| PEOPLE_IMPORT_C.csv | 50000 | 26.0194s | 2.5511s | 1.3401s | 1.3328s | 19.52× faster |
|
|
47
|
+
| PEOPLE_IMPORT_NB.csv | 50000 | 4.4999s | 0.5268s | 0.4757s | 0.4791s | 9.39× faster |
|
|
48
|
+
| PEOPLE_IMPORT_NC.csv | 50000 | 4.3233s | 0.5752s | 0.3989s | 0.4017s | 10.76× faster |
|
|
49
|
+
| uscities.csv | 31257 | 2.6702s | 1.8124s | 1.0662s | 1.0944s | 2.44× faster |
|
|
50
|
+
| uszips.csv | 33782 | 3.1853s | 2.1641s | 1.3332s | 1.3434s | 2.37× faster |
|
|
51
|
+
| worldcities.csv | 48059 | 2.8397s | 1.8978s | 1.0910s | 1.0909s | 2.60× faster |
|
|
52
|
+
| embedded_newlines_20k.csv | 80000 | 0.9578s | 0.4629s | 0.4291s | 0.4314s | 2.22× faster |
|
|
53
|
+
| embedded_separators_20k.csv | 20000 | 0.7074s | 0.4535s | 0.2748s | 0.2748s | 2.57× faster |
|
|
54
|
+
| heavy_quoting_20k.csv | 20000 | 1.4361s | 0.8598s | 0.5241s | 0.5273s | 2.72× faster |
|
|
55
|
+
| long_fields_20k.csv | 20000 | 8.8715s | 4.7839s | 2.5696s | 2.5624s | 3.46× faster |
|
|
56
|
+
| many_empty_fields_20k.csv | 20000 | 0.8635s | 0.2521s | 0.1680s | 0.1664s | 5.19× faster |
|
|
57
|
+
| multi_char_separator_20k.csv | 20000 | 1.4172s | 0.2463s | 0.1853s | 0.1879s | 7.54× faster |
|
|
58
|
+
| sample_10M.csv | 50000 | 1.0547s | 0.2388s | 0.2238s | 0.2211s | 4.77× faster |
|
|
59
|
+
| sensor_data_50krows_50cols.csv | 50000 | 8.9445s | 1.8246s | 1.8348s | 1.8181s | 4.92× faster |
|
|
60
|
+
| tab_separated_20k.tsv | 20000 | 1.2664s | 0.1596s | 0.1553s | 0.1536s | 8.24× faster |
|
|
61
|
+
| utf8_multibyte_20k.csv | 20000 | 0.6484s | 0.1124s | 0.1068s | 0.1066s | 6.08× faster |
|
|
62
|
+
| whitespace_heavy_20k.csv | 20000 | 1.5513s | 0.1613s | 0.1654s | 0.1610s | 9.63× faster |
|
|
63
|
+
| wide_500_cols_20k.csv | 20000 | 44.5782s | 7.2023s | 6.9748s | 6.9261s | 6.44× faster |
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## SmarterCSV 1.17.0 vs Ruby CSV 3.3.5 — full results
|
|
68
|
+
|
|
69
|
+
| File | Rows | CSV.read¹ | CSV.hashes¹ | SmarterCSV/C | SmarterCSV/Rb |
|
|
70
|
+
|----------------------------------|--------|------------|-------------|---------------|---------------|
|
|
71
|
+
| PEOPLE_IMPORT_B.csv | 50000 | 0.2718s | 0.7750s | 0.0673s | 0.5034s |
|
|
72
|
+
| PEOPLE_IMPORT_C.csv | 50000 | 1.4111s | 8.0199s | 0.1907s | 1.4032s |
|
|
73
|
+
| PEOPLE_IMPORT_NB.csv | 50000 | 0.2659s | 0.7603s | 0.0638s | 0.4800s |
|
|
74
|
+
| PEOPLE_IMPORT_NC.csv | 50000 | 0.2860s | 0.9173s | 0.0630s | 0.4132s |
|
|
75
|
+
| uscities.csv | 31257 | 0.5640s | 0.8803s | 0.0789s | 1.1120s |
|
|
76
|
+
| uszips.csv | 33782 | 0.7414s | 1.1604s | 0.0929s | 1.3645s |
|
|
77
|
+
| worldcities.csv | 48059 | 0.6313s | 0.9906s | 0.0794s | 1.0945s |
|
|
78
|
+
| embedded_newlines_20k.csv | 80000 | 0.1693s | 0.2245s | 0.0554s | 0.4451s |
|
|
79
|
+
| embedded_separators_20k.csv | 20000 | 0.1312s | 0.1838s | 0.0206s | 0.2830s |
|
|
80
|
+
| heavy_quoting_20k.csv | 20000 | 0.1167s | 0.2410s | 0.0338s | 0.5400s |
|
|
81
|
+
| long_fields_20k.csv | 20000 | 0.2373s | 0.2762s | 0.0392s | 2.6172s |
|
|
82
|
+
| many_empty_fields_20k.csv | 20000 | 0.1145s | 0.3622s | 0.0216s | 0.1727s |
|
|
83
|
+
| multi_char_separator_20k.csv | 20000 | 0.0890s | 0.2122s | 0.0293s | 0.1662s |
|
|
84
|
+
| sample_10M.csv | 50000 | 0.1685s | 0.3012s | 0.0357s | 0.2361s |
|
|
85
|
+
| sensor_data_50krows_50cols.csv | 50000 | 0.5655s | 2.6744s | 0.2442s | 1.8878s |
|
|
86
|
+
| tab_separated_20k.tsv | 20000 | 0.0832s | 0.2029s | 0.0219s | 0.1651s |
|
|
87
|
+
| utf8_multibyte_20k.csv | 20000 | 0.0662s | 0.1427s | 0.0156s | 0.1138s |
|
|
88
|
+
| whitespace_heavy_20k.csv | 20000 | 0.0890s | 0.2169s | 0.0278s | 0.1670s |
|
|
89
|
+
| wide_500_cols_20k.csv | 20000 | 2.3351s | 32.4002s | 1.2823s | 7.3504s |
|
|
90
|
+
|
|
91
|
+
## Ruby CSV 3.3.5 vs SmarterCSV 1.17.0 (C accelerated)
|
|
92
|
+
|
|
93
|
+
| File | Rows | CSV.read¹ | CSV.hashes¹ |
|
|
94
|
+
|----------------------------------|--------|---------------|---------------|
|
|
95
|
+
| PEOPLE_IMPORT_B.csv | 50000 | 4.04× slower | 11.51× slower |
|
|
96
|
+
| PEOPLE_IMPORT_C.csv | 50000 | 7.40× slower | 42.04× slower |
|
|
97
|
+
| PEOPLE_IMPORT_NB.csv | 50000 | 4.17× slower | 11.92× slower |
|
|
98
|
+
| PEOPLE_IMPORT_NC.csv | 50000 | 4.54× slower | 14.55× slower |
|
|
99
|
+
| uscities.csv | 31257 | 7.15× slower | 11.16× slower |
|
|
100
|
+
| uszips.csv | 33782 | 7.98× slower | 12.50× slower |
|
|
101
|
+
| worldcities.csv | 48059 | 7.95× slower | 12.48× slower |
|
|
102
|
+
| embedded_newlines_20k.csv | 80000 | 3.05× slower | 4.05× slower |
|
|
103
|
+
| embedded_separators_20k.csv | 20000 | 6.36× slower | 8.91× slower |
|
|
104
|
+
| heavy_quoting_20k.csv | 20000 | 3.46× slower | 7.14× slower |
|
|
105
|
+
| long_fields_20k.csv | 20000 | 6.05× slower | 7.04× slower |
|
|
106
|
+
| many_empty_fields_20k.csv | 20000 | 5.29× slower | 16.73× slower |
|
|
107
|
+
| multi_char_separator_20k.csv | 20000 | 3.04× slower | 7.25× slower |
|
|
108
|
+
| sample_10M.csv | 50000 | 4.72× slower | 8.43× slower |
|
|
109
|
+
| sensor_data_50krows_50cols.csv | 50000 | 2.32× slower | 10.95× slower |
|
|
110
|
+
| tab_separated_20k.tsv | 20000 | 3.80× slower | 9.28× slower |
|
|
111
|
+
| utf8_multibyte_20k.csv | 20000 | 4.24× slower | 9.14× slower |
|
|
112
|
+
| whitespace_heavy_20k.csv | 20000 | 3.20× slower | 7.81× slower |
|
|
113
|
+
| wide_500_cols_20k.csv | 20000 | 1.82× slower | 25.27× slower |
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
¹ **Raw output** — no post-processing applied. Returns plain arrays or string-keyed hashes. No header normalization, type conversion, whitespace stripping, or empty-value removal. Your own post-processing must be added to produce usable data.
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
PREVIOUS: [Performance Notes](./performance_notes.md) | UP: [README](../../../README.md)
|