smarter_csv 1.15.2 → 1.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +68 -1
- data/CONTRIBUTORS.md +3 -1
- data/Gemfile +1 -0
- data/README.md +123 -27
- data/docs/_introduction.md +40 -24
- data/docs/bad_row_quarantine.md +285 -0
- data/docs/basic_read_api.md +151 -9
- data/docs/basic_write_api.md +474 -59
- data/docs/batch_processing.md +161 -4
- data/docs/column_selection.md +183 -0
- data/docs/data_transformations.md +162 -29
- data/docs/examples.md +339 -46
- data/docs/header_transformations.md +93 -12
- data/docs/header_validations.md +56 -18
- data/docs/history.md +117 -0
- data/docs/instrumentation.md +165 -0
- data/docs/migrating_from_csv.md +290 -0
- data/docs/options.md +150 -87
- data/docs/parsing_strategy.md +63 -1
- data/docs/real_world_csv.md +262 -0
- data/docs/releases/1.16.0/benchmarks.md +223 -0
- data/docs/releases/1.16.0/changes.md +272 -0
- data/docs/releases/1.16.0/performance_notes.md +114 -0
- data/docs/row_col_sep.md +14 -5
- data/docs/value_converters.md +193 -57
- data/ext/smarter_csv/extconf.rb +3 -0
- data/ext/smarter_csv/smarter_csv.c +1007 -71
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
- data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
- data/lib/smarter_csv/errors.rb +8 -0
- data/lib/smarter_csv/file_io.rb +1 -1
- data/lib/smarter_csv/hash_transformations.rb +14 -13
- data/lib/smarter_csv/header_transformations.rb +21 -2
- data/lib/smarter_csv/headers.rb +2 -1
- data/lib/smarter_csv/options.rb +124 -7
- data/lib/smarter_csv/parser.rb +362 -75
- data/lib/smarter_csv/reader.rb +494 -46
- data/lib/smarter_csv/version.rb +1 -1
- data/lib/smarter_csv/writer.rb +71 -19
- data/lib/smarter_csv.rb +95 -12
- data/smarter_csv.gemspec +20 -10
- metadata +37 -80
data/docs/history.md
ADDED
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
|
+
* [The Basic Read API](./basic_read_api.md)
|
|
8
|
+
* [The Basic Write API](./basic_write_api.md)
|
|
9
|
+
* [Batch Processing](././batch_processing.md)
|
|
10
|
+
* [Configuration Options](./options.md)
|
|
11
|
+
* [Row and Column Separators](./row_col_sep.md)
|
|
12
|
+
* [Header Transformations](./header_transformations.md)
|
|
13
|
+
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
15
|
+
* [Data Transformations](./data_transformations.md)
|
|
16
|
+
* [Value Converters](./value_converters.md)
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [**SmarterCSV over the Years**](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
25
|
+
|
|
26
|
+
# SmarterCSV over the Years
|
|
27
|
+
|
|
28
|
+
## Origin
|
|
29
|
+
|
|
30
|
+
SmarterCSV was born from a [StackOverflow question in 2011](https://stackoverflow.com/questions/7788618/update-mongodb-with-array-from-csv-join-table/7788746#7788746) about importing CSV data into MongoDB. The answer involved processing CSV rows as hashes — which turned out to be so useful that it became a gem.
|
|
31
|
+
|
|
32
|
+
The original write-up is preserved at [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html).
|
|
33
|
+
|
|
34
|
+
The first gem release was **v1.0.1 on 2012-07-30**.
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## Key Milestones
|
|
39
|
+
|
|
40
|
+
| Version | Date | Highlight |
|
|
41
|
+
|---------|------------|-----------|
|
|
42
|
+
| 1.0.1 | 2012-07-30 | First release: CSV → array of hashes, batch processing, key mapping |
|
|
43
|
+
| 1.0.17 | 2014-01-13 | `row_sep: :auto` — automatic row separator detection |
|
|
44
|
+
| 1.0.18 | 2014-10-27 | Multi-line / embedded-newline field support |
|
|
45
|
+
| 1.1.0 | 2015-07-26 | `value_converters` — custom per-column type parsing (dates, money, …) |
|
|
46
|
+
| 1.4.0 | 2022-02-11 | Experimental `col_sep: :auto` detection; switched to MIT-only licence |
|
|
47
|
+
| 1.5.1 | 2022-04-27 | `duplicate_header_suffix` for CSV files with repeated headers |
|
|
48
|
+
| 1.6.0 | 2022-05-03 | Complete rewrite of the pure-Ruby line parser |
|
|
49
|
+
| **1.7.0** | **2022-06-26** | **First C extension — >10× speedup over 1.6.x announced** |
|
|
50
|
+
| 1.8.0 | 2023-03-18 | `col_sep: :auto` and `row_sep: :auto` made the **default** |
|
|
51
|
+
| 1.9.0 | 2023-09-04 | Structured error objects with programmatic key access |
|
|
52
|
+
| 1.10.0 | 2023-12-31 | Performance & memory improvements; stricter `user_provided_headers` |
|
|
53
|
+
| **1.11.0** | **2024-07-02** | **SmarterCSV::Writer** — CSV generation from hashes |
|
|
54
|
+
| **1.12.0** | **2024-07-09** | **Thread-safe `SmarterCSV::Reader` class**; docs site added |
|
|
55
|
+
| 1.13.0 | 2024-11-06 | Auto-generation of extra column names; improved quote robustness |
|
|
56
|
+
| 1.14.0 | 2025-04-07 | Advanced Writer options; `header_converter` |
|
|
57
|
+
| 1.14.3 | 2025-05-04 | C-extension fast path for unquoted fields; inline whitespace stripping |
|
|
58
|
+
| **1.15.0** | **2026-02-04** | **Major C-extension rewrite — ~5× faster than 1.14.4; 39% less memory** |
|
|
59
|
+
| 1.15.1 | 2026-02-17 | Fix for backslash in quoted fields (`quote_escaping:` option) |
|
|
60
|
+
| 1.15.2 | 2026-02-20 | Further C-path optimisations; 5.4×–37.4× faster than 1.14.4 |
|
|
61
|
+
| **1.16.0** | **2026-03-12** | **New `each`/`each_chunk` enumerator API; `SmarterCSV.parse`; bad row quarantine; column selection `headers: { only: }`; 1.8×–8.6× faster than Ruby CSV.read; new features for Reader and Writer; minor breaking: `quote_boundary: :standard`** |
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Performance Journey
|
|
66
|
+
|
|
67
|
+
Measured on Apple M1, Ruby 3.4.7. Best of 2 sessions × 30 runs.
|
|
68
|
+
All times are **C-accelerated** except the `1.6.1` column (no C extension existed).
|
|
69
|
+
`—` = not measured for that version.
|
|
70
|
+
|
|
71
|
+
| File | Rows | 1.6.1 Rb (s) | 1.7.1 C (s) | 1.14.4 C (s) | 1.15.2 C (s) | 1.16.0 C (s) | total gain |
|
|
72
|
+
|--------------------------------|------:|-------------:|------------:|-------------:|-------------:|-------------:|-----------:|
|
|
73
|
+
| PEOPLE_IMPORT_B.csv | 50k | 3.793 | 1.083 | 1.656 | 0.101 | 0.087 | **43.6×** |
|
|
74
|
+
| PEOPLE_IMPORT_C.csv | 50k | 21.612 | 2.763 | 8.172 | 0.207 | 0.169 | **127.8×** |
|
|
75
|
+
| PEOPLE_IMPORT_NB.csv | 50k | 3.746 | 1.053 | 1.605 | 0.086 | 0.080 | **46.9×** |
|
|
76
|
+
| PEOPLE_IMPORT_NC.csv | 50k | 3.831 | 1.018 | 1.495 | 0.076 | 0.063 | **60.8×** |
|
|
77
|
+
| uscities.csv | 31k | — | — | 1.058 | 0.113 | 0.108 | — |
|
|
78
|
+
| uszips.csv | 34k | — | — | 1.277 | 0.111 | 0.102 | — |
|
|
79
|
+
| worldcities.csv | 48k | — | — | 1.070 | 0.116 | 0.097 | — |
|
|
80
|
+
| fmap.csv | 50k | 2.130 | 0.873 | — | — | — | — |
|
|
81
|
+
| zipcode.csv | 44k | 1.572 | 0.797 | — | — | — | — |
|
|
82
|
+
| sample_10M.csv | 50k | 1.291 | 0.661 | 0.459 | 0.053 | 0.046 | **28.0×** |
|
|
83
|
+
| sensor_data_50krows_50cols.csv | 50k | — | — | 3.985 | 0.272 | 0.264 | — |
|
|
84
|
+
| embedded_newlines_20k.csv | 80k | 0.716 | 0.366 | 0.540 | 0.056 | 0.054 | **13.2×** |
|
|
85
|
+
| embedded_separators_20k.csv | 20k | 0.714 | 0.333 | 0.278 | 0.032 | 0.025 | **28.6×** |
|
|
86
|
+
| heavy_quoting_20k.csv | 20k | 1.309 | 0.484 | 0.522 | 0.054 | 0.036 | **36.5×** |
|
|
87
|
+
| long_fields_20k.csv | 20k | 5.698 | 1.112 | 2.960 | 0.110 | 0.045 | **126.6×** |
|
|
88
|
+
| many_empty_fields_20k.csv | 20k | 1.149 | 0.420 | 0.395 | 0.031 | 0.025 | **45.8×** |
|
|
89
|
+
| multi_char_separator_20k.csv | 20k | — | — | 0.539 | 0.033 | 0.026 | — |
|
|
90
|
+
| tab_separated_20k.tsv | 20k | — | — | 0.462 | 0.034 | 0.025 | — |
|
|
91
|
+
| utf8_multibyte_20k.csv | 20k | 0.709 | 0.305 | 0.228 | 0.020 | 0.017 | **41.7×** |
|
|
92
|
+
| whitespace_heavy_20k.csv | 20k | 1.335 | 0.393 | 0.536 | 0.036 | 0.028 | **47.5×** |
|
|
93
|
+
| wide_500_cols_20k.csv | 20k | 39.755 | 9.532 | 17.658 | 1.419 | 1.352 | **29.4×** |
|
|
94
|
+
|
|
95
|
+
`total gain` = v1.6.1 Ruby time / v1.16.0 C-accelerated time (files without 1.6.1 data show `—`)
|
|
96
|
+
|
|
97
|
+
--------------
|
|
98
|
+
|
|
99
|
+
**Highlights:**
|
|
100
|
+
- `long_fields_20k` (long quoted fields): **126.6×** — `memchr`-based field scanning makes long quoted fields essentially free to skip.
|
|
101
|
+
- `PEOPLE_IMPORT_C` (116 columns): **127.8×** — wide rows multiply every per-field saving across all columns.
|
|
102
|
+
- `PEOPLE_IMPORT_NC` (17 columns): **60.8×** — Ruby-path optimisations #10 & #11 provide an extra boost on moderately wide files.
|
|
103
|
+
- `wide_500_cols_20k` went from **39.8 seconds → 1.35 seconds** — and with `headers: { only: }` keeping just 2 of those 500 columns it drops further to **~0.1 seconds** (an additional ~16× on top).
|
|
104
|
+
- `embedded_newlines` shows the smallest gain (**13.2×**) — multi-line stitching is bounded by I/O and the line-counting loop, not field parsing.
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Related Reading
|
|
109
|
+
|
|
110
|
+
- [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
|
|
111
|
+
- [SmarterCSV 1.15.2 — Faster than raw CSV arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e)
|
|
112
|
+
- [Processing 1.4 Million CSV Records in Ruby, fast](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
|
|
113
|
+
- [Faster Parsing CSV with Parallel Processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing) by [Jack Lin](https://github.com/xjlin0/)
|
|
114
|
+
|
|
115
|
+
--------------------
|
|
116
|
+
|
|
117
|
+
PREVIOUS: [Real-World CSV Files](./real_world_csv.md) | NEXT: [Release Notes](./releases/1.16.0/changes.md) | UP: [README](../README.md)
|
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](./_introduction.md)
|
|
5
|
+
* [Migrating from Ruby CSV](./migrating_from_csv.md)
|
|
6
|
+
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
|
+
* [The Basic Read API](./basic_read_api.md)
|
|
8
|
+
* [The Basic Write API](./basic_write_api.md)
|
|
9
|
+
* [Batch Processing](././batch_processing.md)
|
|
10
|
+
* [Configuration Options](./options.md)
|
|
11
|
+
* [Row and Column Separators](./row_col_sep.md)
|
|
12
|
+
* [Header Transformations](./header_transformations.md)
|
|
13
|
+
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
15
|
+
* [Data Transformations](./data_transformations.md)
|
|
16
|
+
* [Value Converters](./value_converters.md)
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [**Instrumentation Hooks**](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
25
|
+
|
|
26
|
+
# Instrumentation Hooks
|
|
27
|
+
|
|
28
|
+
SmarterCSV provides three optional callback hooks so you can observe file processing
|
|
29
|
+
without wrapping every call site in timing code. The hooks work with `SmarterCSV.process`
|
|
30
|
+
(library-controlled iteration). Enumerator modes (`each`, `each_chunk`) do not fire
|
|
31
|
+
hooks — in those modes the caller owns the lifecycle and should instrument their own loop.
|
|
32
|
+
|
|
33
|
+
## The Three Hooks
|
|
34
|
+
|
|
35
|
+
| Hook | Fires when | Useful for |
|
|
36
|
+
|---------------|-----------------------------------------------------|---------------------------------------------|
|
|
37
|
+
| `on_start` | Once, before the first row is parsed | Logging intent, starting timers, counters |
|
|
38
|
+
| `on_chunk` | After each chunk is parsed, before block runs | Progress tracking, per-batch metrics |
|
|
39
|
+
| `on_complete` | Once, after the entire file is exhausted | Total duration, row counts, summary metrics |
|
|
40
|
+
|
|
41
|
+
`on_chunk` only fires when `chunk_size` is set. In non-chunked mode only `on_start` and
|
|
42
|
+
`on_complete` fire.
|
|
43
|
+
|
|
44
|
+
## Usage
|
|
45
|
+
|
|
46
|
+
All three hooks are lambdas (or any callable) passed as options:
|
|
47
|
+
|
|
48
|
+
```ruby
|
|
49
|
+
SmarterCSV.process('data.csv',
|
|
50
|
+
chunk_size: 500,
|
|
51
|
+
|
|
52
|
+
on_start: ->(info) {
|
|
53
|
+
Rails.logger.info "Starting CSV import: #{info[:input]} (#{info[:file_size]} bytes)"
|
|
54
|
+
Metrics.increment('csv.import.start')
|
|
55
|
+
},
|
|
56
|
+
|
|
57
|
+
on_chunk: ->(info) {
|
|
58
|
+
Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows " \
|
|
59
|
+
"(#{info[:total_rows_so_far]} so far)"
|
|
60
|
+
},
|
|
61
|
+
|
|
62
|
+
on_complete: ->(stats) {
|
|
63
|
+
Rails.logger.info "Import complete: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s"
|
|
64
|
+
Metrics.histogram('csv.import.duration', stats[:duration])
|
|
65
|
+
Metrics.gauge('csv.import.rows', stats[:total_rows])
|
|
66
|
+
Metrics.increment('csv.import.bad_rows', stats[:bad_rows]) if stats[:bad_rows] > 0
|
|
67
|
+
},
|
|
68
|
+
) { |chunk| MyModel.insert_all(chunk) }
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## Hook Payloads
|
|
72
|
+
|
|
73
|
+
### `on_start`
|
|
74
|
+
|
|
75
|
+
| Key | Type | Description |
|
|
76
|
+
|--------------|---------------|---------------------------------------------------------------------|
|
|
77
|
+
| `:input` | String | File path if input is a filename; class name (e.g. `"File"`) otherwise |
|
|
78
|
+
| `:file_size` | Integer / nil | File size in bytes if determinable; nil for IO objects |
|
|
79
|
+
| `:col_sep` | String | Effective column separator (after auto-detection) |
|
|
80
|
+
| `:row_sep` | String | Effective row separator (after auto-detection) |
|
|
81
|
+
|
|
82
|
+
### `on_chunk`
|
|
83
|
+
|
|
84
|
+
| Key | Type | Description |
|
|
85
|
+
|-----------------------|---------|------------------------------------------------------|
|
|
86
|
+
| `:chunk_number` | Integer | 1-based index of this chunk |
|
|
87
|
+
| `:rows_in_chunk` | Integer | Number of rows in this chunk (≤ `chunk_size`) |
|
|
88
|
+
| `:total_rows_so_far` | Integer | Cumulative rows processed including this chunk |
|
|
89
|
+
|
|
90
|
+
### `on_complete`
|
|
91
|
+
|
|
92
|
+
| Key | Type | Description |
|
|
93
|
+
|-----------------|---------|--------------------------------------------------------------------|
|
|
94
|
+
| `:total_rows` | Integer | Total rows successfully parsed |
|
|
95
|
+
| `:total_chunks` | Integer | Number of chunks yielded (0 in non-chunked mode) |
|
|
96
|
+
| `:duration` | Float | Elapsed seconds from `on_start` to `on_complete` |
|
|
97
|
+
| `:bad_rows` | Integer | Number of rows that triggered `on_bad_row` handling (0 if none) |
|
|
98
|
+
|
|
99
|
+
## Non-chunked mode
|
|
100
|
+
|
|
101
|
+
When `chunk_size` is not set, `on_chunk` never fires. `on_start` and `on_complete`
|
|
102
|
+
still fire and give you the full-file summary:
|
|
103
|
+
|
|
104
|
+
```ruby
|
|
105
|
+
SmarterCSV.process('data.csv',
|
|
106
|
+
on_start: ->(info) { @started_at = Time.now; log "Importing #{info[:input]}" },
|
|
107
|
+
on_complete: ->(stats) { log "Done: #{stats[:total_rows]} rows in #{stats[:duration].round(3)}s" },
|
|
108
|
+
)
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Execution order
|
|
112
|
+
|
|
113
|
+
```
|
|
114
|
+
on_start
|
|
115
|
+
├─ on_chunk (chunk 1 parsed) → block runs → returns
|
|
116
|
+
├─ on_chunk (chunk 2 parsed) → block runs → returns
|
|
117
|
+
└─ on_chunk (chunk N parsed) → block runs → returns
|
|
118
|
+
on_complete
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
`on_chunk` fires **before** the block receives the chunk, so you can record timing or
|
|
122
|
+
state before your processing logic runs.
|
|
123
|
+
|
|
124
|
+
## Without Rails / ActiveSupport
|
|
125
|
+
|
|
126
|
+
The hooks are plain callables — no dependency on Rails or any framework:
|
|
127
|
+
|
|
128
|
+
```ruby
|
|
129
|
+
require 'logger'
|
|
130
|
+
logger = Logger.new($stdout)
|
|
131
|
+
|
|
132
|
+
SmarterCSV.process('import.csv',
|
|
133
|
+
on_start: ->(i) { logger.info "CSV import started: #{i[:input]}" },
|
|
134
|
+
on_complete: ->(s) { logger.info "CSV import done: #{s[:total_rows]} rows, #{s[:duration].round(2)}s" },
|
|
135
|
+
)
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## With `ActiveSupport::Notifications` (Rails)
|
|
139
|
+
|
|
140
|
+
If you prefer Rails-style instrumentation, wrap the hooks yourself:
|
|
141
|
+
|
|
142
|
+
```ruby
|
|
143
|
+
# config/initializers/smarter_csv_instrumentation.rb
|
|
144
|
+
ON_START = ->(info) {
|
|
145
|
+
ActiveSupport::Notifications.instrument('start.smarter_csv', info)
|
|
146
|
+
}
|
|
147
|
+
ON_COMPLETE = ->(stats) {
|
|
148
|
+
ActiveSupport::Notifications.instrument('complete.smarter_csv', stats)
|
|
149
|
+
}
|
|
150
|
+
|
|
151
|
+
# Subscribe once at startup:
|
|
152
|
+
ActiveSupport::Notifications.subscribe('complete.smarter_csv') do |*, payload|
|
|
153
|
+
StatsD.histogram('csv.duration', payload[:duration])
|
|
154
|
+
StatsD.gauge('csv.rows', payload[:total_rows])
|
|
155
|
+
end
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
Then pass the cached lambdas to any `process` call:
|
|
159
|
+
|
|
160
|
+
```ruby
|
|
161
|
+
SmarterCSV.process(file, on_start: ON_START, on_complete: ON_COMPLETE)
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
--------------------
|
|
165
|
+
PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)
|
|
@@ -0,0 +1,290 @@
|
|
|
1
|
+
|
|
2
|
+
### Contents
|
|
3
|
+
|
|
4
|
+
* [Introduction](./_introduction.md)
|
|
5
|
+
* [**Migrating from Ruby CSV**](./migrating_from_csv.md)
|
|
6
|
+
* [Parsing Strategy](./parsing_strategy.md)
|
|
7
|
+
* [The Basic Read API](./basic_read_api.md)
|
|
8
|
+
* [The Basic Write API](./basic_write_api.md)
|
|
9
|
+
* [Batch Processing](././batch_processing.md)
|
|
10
|
+
* [Configuration Options](./options.md)
|
|
11
|
+
* [Row and Column Separators](./row_col_sep.md)
|
|
12
|
+
* [Header Transformations](./header_transformations.md)
|
|
13
|
+
* [Header Validations](./header_validations.md)
|
|
14
|
+
* [Column Selection](./column_selection.md)
|
|
15
|
+
* [Data Transformations](./data_transformations.md)
|
|
16
|
+
* [Value Converters](./value_converters.md)
|
|
17
|
+
* [Bad Row Quarantine](./bad_row_quarantine.md)
|
|
18
|
+
* [Instrumentation Hooks](./instrumentation.md)
|
|
19
|
+
* [Examples](./examples.md)
|
|
20
|
+
* [Real-World CSV Files](./real_world_csv.md)
|
|
21
|
+
* [SmarterCSV over the Years](./history.md)
|
|
22
|
+
* [Release Notes](./releases/1.16.0/changes.md)
|
|
23
|
+
|
|
24
|
+
--------------
|
|
25
|
+
|
|
26
|
+
# Migrating from Ruby CSV
|
|
27
|
+
|
|
28
|
+
Already using Ruby's built-in `CSV` library? Switching to SmarterCSV is typically a one- or
|
|
29
|
+
two-line change — and you get **1.7×–8.6× faster** end-to-end throughput vs `CSV.read`, plain Ruby
|
|
30
|
+
hashes with symbol keys, automatic type conversion, and a much richer feature set in return.
|
|
31
|
+
|
|
32
|
+
> **Medium article:** *"Switch from Ruby CSV to SmarterCSV in 5 Minutes"* — *(coming soon)*
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Performance
|
|
37
|
+
|
|
38
|
+
| Comparison | Range |
|
|
39
|
+
|---|---|
|
|
40
|
+
| SmarterCSV vs `CSV.read` † | **1.7×–8.6× faster** |
|
|
41
|
+
| SmarterCSV vs `CSV.table` ‡ | **7×–129× faster** |
|
|
42
|
+
|
|
43
|
+
_Benchmarks: 19 CSV files (20k–80k rows), Ruby 3.4.7, Apple M1._
|
|
44
|
+
|
|
45
|
+
_† `CSV.read` returns raw arrays of arrays — hash construction, key normalization, and type conversion still need to happen, understating the real cost difference._
|
|
46
|
+
|
|
47
|
+
_‡ `CSV.table` is the closest Ruby equivalent to SmarterCSV — both return symbol-keyed hashes._
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## The one-line switch
|
|
52
|
+
|
|
53
|
+
```ruby
|
|
54
|
+
# Before — Ruby CSV
|
|
55
|
+
rows = CSV.table('data.csv').map(&:to_h) # array of hashes with symbol keys
|
|
56
|
+
|
|
57
|
+
# After — SmarterCSV (drop-in, up to 129× faster)
|
|
58
|
+
rows = SmarterCSV.process('data.csv') # array of hashes with symbol keys
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
That's it for the common case. Keep reading for the few behavior differences to be aware of.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Parsing a CSV string
|
|
66
|
+
|
|
67
|
+
```ruby
|
|
68
|
+
csv_string = "name,age\nAlice,30\nBob,25\n"
|
|
69
|
+
|
|
70
|
+
# Ruby CSV
|
|
71
|
+
rows = CSV.parse(csv_string, headers: true, header_converters: :symbol)
|
|
72
|
+
|
|
73
|
+
# SmarterCSV — direct string parsing
|
|
74
|
+
rows = SmarterCSV.parse(csv_string)
|
|
75
|
+
# => [{name: "Alice", age: 30}, {name: "Bob", age: 25}]
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
`SmarterCSV.parse` is a convenience wrapper added in 1.16.0. Under the hood it wraps the
|
|
79
|
+
string in a `StringIO` — but you don't need to think about that.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## Row-by-row iteration
|
|
84
|
+
|
|
85
|
+
```ruby
|
|
86
|
+
# Ruby CSV
|
|
87
|
+
CSV.foreach('data.csv', headers: true, header_converters: :symbol) do |row|
|
|
88
|
+
MyModel.create(row.to_h)
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
# SmarterCSV
|
|
92
|
+
SmarterCSV.each('data.csv') do |row|
|
|
93
|
+
MyModel.create(row) # row is already a plain Hash — no .to_h needed
|
|
94
|
+
end
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
`SmarterCSV.each` returns an `Enumerator` when called without a block, so the full
|
|
98
|
+
`Enumerable` API is available:
|
|
99
|
+
|
|
100
|
+
```ruby
|
|
101
|
+
names = SmarterCSV.each('data.csv').map { |row| row[:name] }
|
|
102
|
+
us_rows = SmarterCSV.each('data.csv').select { |row| row[:country] == 'US' }
|
|
103
|
+
first10 = SmarterCSV.each('data.csv').lazy.first(10)
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Key behavior differences
|
|
109
|
+
|
|
110
|
+
### 1. Symbol keys (same as `CSV.table`, different from `CSV.read`)
|
|
111
|
+
|
|
112
|
+
SmarterCSV returns symbol keys by default — the same as `CSV.table`. If you were using
|
|
113
|
+
`CSV.read` with string keys, add `strings_as_keys: true`:
|
|
114
|
+
|
|
115
|
+
```ruby
|
|
116
|
+
# Ruby CSV.read — string keys
|
|
117
|
+
rows = CSV.read('data.csv', headers: true)
|
|
118
|
+
rows.first['name'] # string key
|
|
119
|
+
|
|
120
|
+
# SmarterCSV default — symbol keys (same as CSV.table)
|
|
121
|
+
rows = SmarterCSV.process('data.csv')
|
|
122
|
+
rows.first[:name] # symbol key
|
|
123
|
+
|
|
124
|
+
# SmarterCSV with string keys — if you need to match CSV.read behaviour
|
|
125
|
+
rows = SmarterCSV.process('data.csv', strings_as_keys: true)
|
|
126
|
+
rows.first['name']
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### 2. Numeric conversion is automatic
|
|
130
|
+
|
|
131
|
+
SmarterCSV converts numeric strings to `Integer` or `Float` automatically (the `:numeric`
|
|
132
|
+
converter in Ruby CSV terms). You get integers and floats back without requesting it:
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
# Ruby CSV — explicit converter needed
|
|
136
|
+
CSV.table('data.csv', converters: :numeric)
|
|
137
|
+
|
|
138
|
+
# SmarterCSV — automatic (convert_values_to_numeric: true is the default)
|
|
139
|
+
SmarterCSV.process('data.csv')
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
To disable: `convert_values_to_numeric: false`.
|
|
143
|
+
|
|
144
|
+
To limit conversion to specific columns:
|
|
145
|
+
```ruby
|
|
146
|
+
SmarterCSV.process('data.csv', convert_values_to_numeric: { only: [:age, :score] })
|
|
147
|
+
SmarterCSV.process('data.csv', convert_values_to_numeric: { except: [:zip_code] })
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### 3. Empty values are removed by default
|
|
151
|
+
|
|
152
|
+
SmarterCSV drops key/value pairs where the value is `nil` or blank
|
|
153
|
+
(`remove_empty_values: true` is the default). Ruby CSV keeps them as `nil`.
|
|
154
|
+
|
|
155
|
+
```ruby
|
|
156
|
+
# CSV "Alice,,30" with header "name,city,age"
|
|
157
|
+
|
|
158
|
+
# Ruby CSV — nil values present
|
|
159
|
+
# => {name: "Alice", city: nil, age: 30}
|
|
160
|
+
|
|
161
|
+
# SmarterCSV default — nil removed
|
|
162
|
+
# => {name: "Alice", age: 30}
|
|
163
|
+
|
|
164
|
+
# SmarterCSV — keep nil values (match Ruby CSV behaviour)
|
|
165
|
+
SmarterCSV.process('data.csv', remove_empty_values: false)
|
|
166
|
+
# => {name: "Alice", city: nil, age: 30}
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### 4. Plain Hash, not CSV::Row
|
|
170
|
+
|
|
171
|
+
Ruby CSV returns `CSV::Row` objects. SmarterCSV returns plain Ruby `Hash` objects.
|
|
172
|
+
|
|
173
|
+
`CSV::Row` wraps a hash with extra methods (`.headers`, `.fields`, `.to_h`, `.to_a`).
|
|
174
|
+
With SmarterCSV you work directly with the hash — no wrapper, no `.to_h` needed.
|
|
175
|
+
|
|
176
|
+
```ruby
|
|
177
|
+
# Ruby CSV — CSV::Row object
|
|
178
|
+
row = CSV.table('data.csv').first
|
|
179
|
+
row.class # => CSV::Row
|
|
180
|
+
row.headers # => [:name, :age]
|
|
181
|
+
row.to_h # => {name: "Alice", age: 30}
|
|
182
|
+
|
|
183
|
+
# SmarterCSV — plain Hash
|
|
184
|
+
row = SmarterCSV.process('data.csv').first
|
|
185
|
+
row.class # => Hash
|
|
186
|
+
row.keys # => [:name, :age]
|
|
187
|
+
row # => {name: "Alice", age: 30}
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Date / DateTime conversion
|
|
193
|
+
|
|
194
|
+
Ruby CSV has built-in `:date` and `:date_time` converters. SmarterCSV intentionally omits
|
|
195
|
+
them because date formats are locale-dependent (`12/03/2020` means December 3rd in the US
|
|
196
|
+
but March 12th in Europe). Use a `value_converter` instead:
|
|
197
|
+
|
|
198
|
+
```ruby
|
|
199
|
+
require 'date'
|
|
200
|
+
|
|
201
|
+
# ISO 8601 (YYYY-MM-DD) — unambiguous
|
|
202
|
+
iso_date = Class.new { def self.convert(v) = v ? Date.strptime(v, '%Y-%m-%d') : nil }
|
|
203
|
+
|
|
204
|
+
SmarterCSV.process('data.csv', value_converters: { birth_date: iso_date })
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
See [Value Converters](./value_converters.md) for full details and examples for US/EU formats.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## Sentinel values (NULL, NaN, #VALUE!)
|
|
212
|
+
|
|
213
|
+
Ruby CSV leaves these as strings. SmarterCSV lets you nil-ify them (and optionally remove
|
|
214
|
+
the key) in a single option:
|
|
215
|
+
|
|
216
|
+
```ruby
|
|
217
|
+
# Remove rows where any value is NULL or an Excel error
|
|
218
|
+
SmarterCSV.process('data.csv', nil_values_matching: /\A(NULL|NaN|#VALUE!)\z/)
|
|
219
|
+
|
|
220
|
+
# Keep the key but set the value to nil (useful for distinguishing "missing" from "absent")
|
|
221
|
+
SmarterCSV.process('data.csv',
|
|
222
|
+
nil_values_matching: /\ANULL\z/,
|
|
223
|
+
remove_empty_values: false,
|
|
224
|
+
)
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Malformed / bad rows
|
|
230
|
+
|
|
231
|
+
Ruby CSV has `liberal_parsing: true` to silently swallow parse errors.
|
|
232
|
+
SmarterCSV gives you explicit control:
|
|
233
|
+
|
|
234
|
+
```ruby
|
|
235
|
+
# Ruby CSV — silent ignore
|
|
236
|
+
CSV.read('data.csv', liberal_parsing: true)
|
|
237
|
+
|
|
238
|
+
# SmarterCSV — collect bad rows so you can inspect them
|
|
239
|
+
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
|
|
240
|
+
good_rows = reader.process
|
|
241
|
+
bad_rows = reader.errors[:bad_rows] # inspect / log / quarantine
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
See [Bad Row Quarantine](./bad_row_quarantine.md) for full details.
|
|
245
|
+
|
|
246
|
+
---
|
|
247
|
+
|
|
248
|
+
## Writing CSV
|
|
249
|
+
|
|
250
|
+
```ruby
|
|
251
|
+
# Ruby CSV
|
|
252
|
+
CSV.open('out.csv', 'w', write_headers: true, headers: ['name','age']) do |csv|
|
|
253
|
+
csv << ['Alice', 30]
|
|
254
|
+
end
|
|
255
|
+
|
|
256
|
+
# SmarterCSV — takes hashes, discovers headers automatically
|
|
257
|
+
SmarterCSV.generate('out.csv') do |csv|
|
|
258
|
+
csv << {name: 'Alice', age: 30}
|
|
259
|
+
csv << {name: 'Bob', age: 25}
|
|
260
|
+
end
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
SmarterCSV's writer also accepts any IO object (StringIO, open file handle) for streaming:
|
|
264
|
+
|
|
265
|
+
```ruby
|
|
266
|
+
io = StringIO.new
|
|
267
|
+
SmarterCSV.generate(io) { |csv| records.each { |r| csv << r } }
|
|
268
|
+
send_data io.string, type: 'text/csv'
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
---
|
|
272
|
+
|
|
273
|
+
## Quick reference
|
|
274
|
+
|
|
275
|
+
| Ruby CSV | SmarterCSV equivalent | Notes |
|
|
276
|
+
|---|---|---|
|
|
277
|
+
| `CSV.table(f)` | `SmarterCSV.process(f)` | Drop-in. Symbol keys, numeric conversion. |
|
|
278
|
+
| `CSV.read(f, headers: true)` | `SmarterCSV.process(f, strings_as_keys: true)` | Add `strings_as_keys:` for string keys. |
|
|
279
|
+
| `CSV.parse(str, headers: true, header_converters: :symbol)` | `SmarterCSV.parse(str)` | Direct string parsing. |
|
|
280
|
+
| `CSV.foreach(f, headers: true) { \|r\| }` | `SmarterCSV.each(f) { \|r\| }` | Row is already a plain Hash. |
|
|
281
|
+
| `converters: :numeric` | default | Automatic in SmarterCSV. |
|
|
282
|
+
| `converters: :date` | `value_converters: {col: DateConverter}` | See [Value Converters](./value_converters.md). |
|
|
283
|
+
| `liberal_parsing: true` | `on_bad_row: :collect` | Explicit quarantine is better. |
|
|
284
|
+
| `skip_blanks: true` | `remove_empty_hashes: true` | Default in SmarterCSV. |
|
|
285
|
+
| `row.to_h` | `row` | Already a plain Hash — no conversion needed. |
|
|
286
|
+
| `row.headers` | `reader.headers` | Available on the `Reader` instance. |
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
PREVIOUS: [Introduction](./_introduction.md) | NEXT: [Parsing Strategy](./parsing_strategy.md) | UP: [README](../README.md)
|
|
290
|
+
|