smarter_csv 1.15.2 → 1.16.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. checksums.yaml +4 -4
  2. data/.rspec +2 -0
  3. data/.rubocop.yml +9 -0
  4. data/CHANGELOG.md +112 -1
  5. data/CONTRIBUTORS.md +4 -1
  6. data/Gemfile +1 -0
  7. data/README.md +129 -27
  8. data/docs/_introduction.md +45 -24
  9. data/docs/bad_row_quarantine.md +342 -0
  10. data/docs/basic_read_api.md +152 -9
  11. data/docs/basic_write_api.md +475 -59
  12. data/docs/batch_processing.md +162 -4
  13. data/docs/column_selection.md +184 -0
  14. data/docs/data_transformations.md +163 -29
  15. data/docs/examples.md +340 -46
  16. data/docs/header_transformations.md +94 -12
  17. data/docs/header_validations.md +57 -18
  18. data/docs/history.md +119 -0
  19. data/docs/instrumentation.md +166 -0
  20. data/docs/migrating_from_csv.md +565 -0
  21. data/docs/options.md +151 -87
  22. data/docs/parsing_strategy.md +64 -1
  23. data/docs/real_world_csv.md +263 -0
  24. data/docs/releases/1.16.0/benchmarks.md +223 -0
  25. data/docs/releases/1.16.0/changes.md +273 -0
  26. data/docs/releases/1.16.0/performance_notes.md +114 -0
  27. data/docs/row_col_sep.md +15 -5
  28. data/docs/ruby_csv_pitfalls.md +514 -0
  29. data/docs/value_converters.md +194 -57
  30. data/ext/smarter_csv/extconf.rb +3 -0
  31. data/ext/smarter_csv/smarter_csv.c +1017 -82
  32. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.png +0 -0
  33. data/images/SmarterCSV_1.16.0_vs_RubyCSV_3.3.5_speedup.svg +108 -0
  34. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.png +0 -0
  35. data/images/SmarterCSV_1.16.0_vs_previous_C-speedup.svg +141 -0
  36. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.png +0 -0
  37. data/images/SmarterCSV_1.16.0_vs_previous_Rb-speedup.svg +139 -0
  38. data/lib/smarter_csv/errors.rb +8 -0
  39. data/lib/smarter_csv/file_io.rb +1 -1
  40. data/lib/smarter_csv/hash_transformations.rb +14 -13
  41. data/lib/smarter_csv/header_transformations.rb +21 -2
  42. data/lib/smarter_csv/headers.rb +2 -1
  43. data/lib/smarter_csv/options.rb +124 -7
  44. data/lib/smarter_csv/parser.rb +358 -74
  45. data/lib/smarter_csv/reader.rb +494 -46
  46. data/lib/smarter_csv/version.rb +1 -1
  47. data/lib/smarter_csv/writer.rb +71 -19
  48. data/lib/smarter_csv.rb +134 -13
  49. data/smarter_csv.gemspec +20 -10
  50. metadata +38 -80
data/docs/history.md ADDED
@@ -0,0 +1,119 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](./parsing_strategy.md)
8
+ * [The Basic Read API](./basic_read_api.md)
9
+ * [The Basic Write API](./basic_write_api.md)
10
+ * [Batch Processing](././batch_processing.md)
11
+ * [Configuration Options](./options.md)
12
+ * [Row and Column Separators](./row_col_sep.md)
13
+ * [Header Transformations](./header_transformations.md)
14
+ * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
16
+ * [Data Transformations](./data_transformations.md)
17
+ * [Value Converters](./value_converters.md)
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [Instrumentation Hooks](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [**SmarterCSV over the Years**](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
26
+
27
+ # SmarterCSV over the Years
28
+
29
+ ## Origin
30
+
31
+ SmarterCSV was born from a [StackOverflow question in 2011](https://stackoverflow.com/questions/7788618/update-mongodb-with-array-from-csv-join-table/7788746#7788746) about importing CSV data into MongoDB. The answer involved processing CSV rows as hashes — which turned out to be so useful that it became a gem.
32
+
33
+ The original write-up is preserved at [The original post](http://www.unixgods.org/Ruby/process_csv_as_hashes.html).
34
+
35
+ The first gem release was **v1.0.1 on 2012-07-30**.
36
+
37
+ ---
38
+
39
+ ## Key Milestones
40
+
41
+ | Version | Date | Highlight |
42
+ |---------|------------|-----------|
43
+ | 1.0.1 | 2012-07-30 | First release: CSV → array of hashes, batch processing, key mapping |
44
+ | 1.0.17 | 2014-01-13 | `row_sep: :auto` — automatic row separator detection |
45
+ | 1.0.18 | 2014-10-27 | Multi-line / embedded-newline field support |
46
+ | 1.1.0 | 2015-07-26 | `value_converters` — custom per-column type parsing (dates, money, …) |
47
+ | 1.4.0 | 2022-02-11 | Experimental `col_sep: :auto` detection; switched to MIT-only licence |
48
+ | 1.5.1 | 2022-04-27 | `duplicate_header_suffix` for CSV files with repeated headers |
49
+ | 1.6.0 | 2022-05-03 | Complete rewrite of the pure-Ruby line parser |
50
+ | **1.7.0** | **2022-06-26** | **First C extension — >10× speedup over 1.6.x announced** |
51
+ | 1.8.0 | 2023-03-18 | `col_sep: :auto` and `row_sep: :auto` made the **default** |
52
+ | 1.9.0 | 2023-09-04 | Structured error objects with programmatic key access |
53
+ | 1.10.0 | 2023-12-31 | Performance & memory improvements; stricter `user_provided_headers` |
54
+ | **1.11.0** | **2024-07-02** | **SmarterCSV::Writer** — CSV generation from hashes |
55
+ | **1.12.0** | **2024-07-09** | **Thread-safe `SmarterCSV::Reader` class**; docs site added |
56
+ | 1.13.0 | 2024-11-06 | Auto-generation of extra column names; improved quote robustness |
57
+ | 1.14.0 | 2025-04-07 | Advanced Writer options; `header_converter` |
58
+ | 1.14.3 | 2025-05-04 | C-extension fast path for unquoted fields; inline whitespace stripping |
59
+ | **1.15.0** | **2026-02-04** | **Major C-extension rewrite — ~5× faster than 1.14.4; 39% less memory** |
60
+ | 1.15.1 | 2026-02-17 | Fix for backslash in quoted fields (`quote_escaping:` option) |
61
+ | 1.15.2 | 2026-02-20 | Further C-path optimisations; 5.4×–37.4× faster than 1.14.4 |
62
+ | **1.16.0** | **2026-03-12** | **New `each`/`each_chunk` enumerator API; `SmarterCSV.parse`; bad row quarantine; column selection `headers: { only: }`; 1.8×–8.6× faster than Ruby CSV.read; new features for Reader and Writer; minor breaking: `quote_boundary: :standard`** |
63
+ | 1.16.1 | 2026-03-16 | `SmarterCSV.errors` class-level error access; fix `col_sep` in quoted headers (#325); fix quoted numeric conversion |
64
+
65
+ ---
66
+
67
+ ## Performance Journey
68
+
69
+ Measured on Apple M1, Ruby 3.4.7. Best of 2 sessions × 30 runs.
70
+ All times are **C-accelerated** except the `1.6.1` column (no C extension existed).
71
+ `—` = not measured for that version.
72
+
73
+ | File | Rows | 1.6.1 Rb (s) | 1.7.1 C (s) | 1.14.4 C (s) | 1.15.2 C (s) | 1.16.0 C (s) | total gain |
74
+ |--------------------------------|------:|-------------:|------------:|-------------:|-------------:|-------------:|-----------:|
75
+ | PEOPLE_IMPORT_B.csv | 50k | 3.793 | 1.083 | 1.656 | 0.101 | 0.087 | **43.6×** |
76
+ | PEOPLE_IMPORT_C.csv | 50k | 21.612 | 2.763 | 8.172 | 0.207 | 0.169 | **127.8×** |
77
+ | PEOPLE_IMPORT_NB.csv | 50k | 3.746 | 1.053 | 1.605 | 0.086 | 0.080 | **46.9×** |
78
+ | PEOPLE_IMPORT_NC.csv | 50k | 3.831 | 1.018 | 1.495 | 0.076 | 0.063 | **60.8×** |
79
+ | uscities.csv | 31k | — | — | 1.058 | 0.113 | 0.108 | — |
80
+ | uszips.csv | 34k | — | — | 1.277 | 0.111 | 0.102 | — |
81
+ | worldcities.csv | 48k | — | — | 1.070 | 0.116 | 0.097 | — |
82
+ | fmap.csv | 50k | 2.130 | 0.873 | — | — | — | — |
83
+ | zipcode.csv | 44k | 1.572 | 0.797 | — | — | — | — |
84
+ | sample_10M.csv | 50k | 1.291 | 0.661 | 0.459 | 0.053 | 0.046 | **28.0×** |
85
+ | sensor_data_50krows_50cols.csv | 50k | — | — | 3.985 | 0.272 | 0.264 | — |
86
+ | embedded_newlines_20k.csv | 80k | 0.716 | 0.366 | 0.540 | 0.056 | 0.054 | **13.2×** |
87
+ | embedded_separators_20k.csv | 20k | 0.714 | 0.333 | 0.278 | 0.032 | 0.025 | **28.6×** |
88
+ | heavy_quoting_20k.csv | 20k | 1.309 | 0.484 | 0.522 | 0.054 | 0.036 | **36.5×** |
89
+ | long_fields_20k.csv | 20k | 5.698 | 1.112 | 2.960 | 0.110 | 0.045 | **126.6×** |
90
+ | many_empty_fields_20k.csv | 20k | 1.149 | 0.420 | 0.395 | 0.031 | 0.025 | **45.8×** |
91
+ | multi_char_separator_20k.csv | 20k | — | — | 0.539 | 0.033 | 0.026 | — |
92
+ | tab_separated_20k.tsv | 20k | — | — | 0.462 | 0.034 | 0.025 | — |
93
+ | utf8_multibyte_20k.csv | 20k | 0.709 | 0.305 | 0.228 | 0.020 | 0.017 | **41.7×** |
94
+ | whitespace_heavy_20k.csv | 20k | 1.335 | 0.393 | 0.536 | 0.036 | 0.028 | **47.5×** |
95
+ | wide_500_cols_20k.csv | 20k | 39.755 | 9.532 | 17.658 | 1.419 | 1.352 | **29.4×** |
96
+
97
+ `total gain` = v1.6.1 Ruby time / v1.16.0 C-accelerated time (files without 1.6.1 data show `—`)
98
+
99
+ --------------
100
+
101
+ **Highlights:**
102
+ - `long_fields_20k` (long quoted fields): **126.6×** — `memchr`-based field scanning makes long quoted fields essentially free to skip.
103
+ - `PEOPLE_IMPORT_C` (116 columns): **127.8×** — wide rows multiply every per-field saving across all columns.
104
+ - `PEOPLE_IMPORT_NC` (17 columns): **60.8×** — Ruby-path optimisations #10 & #11 provide an extra boost on moderately wide files.
105
+ - `wide_500_cols_20k` went from **39.8 seconds → 1.35 seconds** — and with `headers: { only: }` keeping just 2 of those 500 columns it drops further to **~0.1 seconds** (an additional ~16× on top).
106
+ - `embedded_newlines` shows the smallest gain (**13.2×**) — multi-line stitching is bounded by I/O and the line-counting loop, not field parsing.
107
+
108
+ ---
109
+
110
+ ## Related Reading
111
+
112
+ - [Parsing CSV Files in Ruby with SmarterCSV](https://tilo-sloboda.medium.com/parsing-csv-files-in-ruby-with-smartercsv-6ce66fb6cf38)
113
+ - [SmarterCSV 1.15.2 — Faster than raw CSV arrays](https://tilo-sloboda.medium.com/smartercsv-1-15-2-faster-than-raw-csv-arrays-benchmarks-zsv-and-the-full-pipeline-2c12a798032e)
114
+ - [Processing 1.4 Million CSV Records in Ruby, fast](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
115
+ - [Faster Parsing CSV with Parallel Processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing) by [Jack Lin](https://github.com/xjlin0/)
116
+
117
+ --------------------
118
+
119
+ PREVIOUS: [Real-World CSV Files](./real_world_csv.md) | NEXT: [Release Notes](./releases/1.16.0/changes.md) | UP: [README](../README.md)
@@ -0,0 +1,166 @@
1
+
2
+ ### Contents
3
+
4
+ * [Introduction](./_introduction.md)
5
+ * [Migrating from Ruby CSV](./migrating_from_csv.md)
6
+ * [Ruby CSV Pitfalls](./ruby_csv_pitfalls.md)
7
+ * [Parsing Strategy](./parsing_strategy.md)
8
+ * [The Basic Read API](./basic_read_api.md)
9
+ * [The Basic Write API](./basic_write_api.md)
10
+ * [Batch Processing](././batch_processing.md)
11
+ * [Configuration Options](./options.md)
12
+ * [Row and Column Separators](./row_col_sep.md)
13
+ * [Header Transformations](./header_transformations.md)
14
+ * [Header Validations](./header_validations.md)
15
+ * [Column Selection](./column_selection.md)
16
+ * [Data Transformations](./data_transformations.md)
17
+ * [Value Converters](./value_converters.md)
18
+ * [Bad Row Quarantine](./bad_row_quarantine.md)
19
+ * [**Instrumentation Hooks**](./instrumentation.md)
20
+ * [Examples](./examples.md)
21
+ * [Real-World CSV Files](./real_world_csv.md)
22
+ * [SmarterCSV over the Years](./history.md)
23
+ * [Release Notes](./releases/1.16.0/changes.md)
24
+
25
+ --------------
26
+
27
+ # Instrumentation Hooks
28
+
29
+ SmarterCSV provides three optional callback hooks so you can observe file processing
30
+ without wrapping every call site in timing code. The hooks work with `SmarterCSV.process`
31
+ (library-controlled iteration). Enumerator modes (`each`, `each_chunk`) do not fire
32
+ hooks — in those modes the caller owns the lifecycle and should instrument their own loop.
33
+
34
+ ## The Three Hooks
35
+
36
+ | Hook | Fires when | Useful for |
37
+ |---------------|-----------------------------------------------------|---------------------------------------------|
38
+ | `on_start` | Once, before the first row is parsed | Logging intent, starting timers, counters |
39
+ | `on_chunk` | After each chunk is parsed, before block runs | Progress tracking, per-batch metrics |
40
+ | `on_complete` | Once, after the entire file is exhausted | Total duration, row counts, summary metrics |
41
+
42
+ `on_chunk` only fires when `chunk_size` is set. In non-chunked mode only `on_start` and
43
+ `on_complete` fire.
44
+
45
+ ## Usage
46
+
47
+ All three hooks are lambdas (or any callable) passed as options:
48
+
49
+ ```ruby
50
+ SmarterCSV.process('data.csv',
51
+ chunk_size: 500,
52
+
53
+ on_start: ->(info) {
54
+ Rails.logger.info "Starting CSV import: #{info[:input]} (#{info[:file_size]} bytes)"
55
+ Metrics.increment('csv.import.start')
56
+ },
57
+
58
+ on_chunk: ->(info) {
59
+ Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows " \
60
+ "(#{info[:total_rows_so_far]} so far)"
61
+ },
62
+
63
+ on_complete: ->(stats) {
64
+ Rails.logger.info "Import complete: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s"
65
+ Metrics.histogram('csv.import.duration', stats[:duration])
66
+ Metrics.gauge('csv.import.rows', stats[:total_rows])
67
+ Metrics.increment('csv.import.bad_rows', stats[:bad_rows]) if stats[:bad_rows] > 0
68
+ },
69
+ ) { |chunk| MyModel.insert_all(chunk) }
70
+ ```
71
+
72
+ ## Hook Payloads
73
+
74
+ ### `on_start`
75
+
76
+ | Key | Type | Description |
77
+ |--------------|---------------|---------------------------------------------------------------------|
78
+ | `:input` | String | File path if input is a filename; class name (e.g. `"File"`) otherwise |
79
+ | `:file_size` | Integer / nil | File size in bytes if determinable; nil for IO objects |
80
+ | `:col_sep` | String | Effective column separator (after auto-detection) |
81
+ | `:row_sep` | String | Effective row separator (after auto-detection) |
82
+
83
+ ### `on_chunk`
84
+
85
+ | Key | Type | Description |
86
+ |-----------------------|---------|------------------------------------------------------|
87
+ | `:chunk_number` | Integer | 1-based index of this chunk |
88
+ | `:rows_in_chunk` | Integer | Number of rows in this chunk (≤ `chunk_size`) |
89
+ | `:total_rows_so_far` | Integer | Cumulative rows processed including this chunk |
90
+
91
+ ### `on_complete`
92
+
93
+ | Key | Type | Description |
94
+ |-----------------|---------|--------------------------------------------------------------------|
95
+ | `:total_rows` | Integer | Total rows successfully parsed |
96
+ | `:total_chunks` | Integer | Number of chunks yielded (0 in non-chunked mode) |
97
+ | `:duration` | Float | Elapsed seconds from `on_start` to `on_complete` |
98
+ | `:bad_rows` | Integer | Number of rows that triggered `on_bad_row` handling (0 if none) |
99
+
100
+ ## Non-chunked mode
101
+
102
+ When `chunk_size` is not set, `on_chunk` never fires. `on_start` and `on_complete`
103
+ still fire and give you the full-file summary:
104
+
105
+ ```ruby
106
+ SmarterCSV.process('data.csv',
107
+ on_start: ->(info) { @started_at = Time.now; log "Importing #{info[:input]}" },
108
+ on_complete: ->(stats) { log "Done: #{stats[:total_rows]} rows in #{stats[:duration].round(3)}s" },
109
+ )
110
+ ```
111
+
112
+ ## Execution order
113
+
114
+ ```
115
+ on_start
116
+ ├─ on_chunk (chunk 1 parsed) → block runs → returns
117
+ ├─ on_chunk (chunk 2 parsed) → block runs → returns
118
+ └─ on_chunk (chunk N parsed) → block runs → returns
119
+ on_complete
120
+ ```
121
+
122
+ `on_chunk` fires **before** the block receives the chunk, so you can record timing or
123
+ state before your processing logic runs.
124
+
125
+ ## Without Rails / ActiveSupport
126
+
127
+ The hooks are plain callables — no dependency on Rails or any framework:
128
+
129
+ ```ruby
130
+ require 'logger'
131
+ logger = Logger.new($stdout)
132
+
133
+ SmarterCSV.process('import.csv',
134
+ on_start: ->(i) { logger.info "CSV import started: #{i[:input]}" },
135
+ on_complete: ->(s) { logger.info "CSV import done: #{s[:total_rows]} rows, #{s[:duration].round(2)}s" },
136
+ )
137
+ ```
138
+
139
+ ## With `ActiveSupport::Notifications` (Rails)
140
+
141
+ If you prefer Rails-style instrumentation, wrap the hooks yourself:
142
+
143
+ ```ruby
144
+ # config/initializers/smarter_csv_instrumentation.rb
145
+ ON_START = ->(info) {
146
+ ActiveSupport::Notifications.instrument('start.smarter_csv', info)
147
+ }
148
+ ON_COMPLETE = ->(stats) {
149
+ ActiveSupport::Notifications.instrument('complete.smarter_csv', stats)
150
+ }
151
+
152
+ # Subscribe once at startup:
153
+ ActiveSupport::Notifications.subscribe('complete.smarter_csv') do |*, payload|
154
+ StatsD.histogram('csv.duration', payload[:duration])
155
+ StatsD.gauge('csv.rows', payload[:total_rows])
156
+ end
157
+ ```
158
+
159
+ Then pass the cached lambdas to any `process` call:
160
+
161
+ ```ruby
162
+ SmarterCSV.process(file, on_start: ON_START, on_complete: ON_COMPLETE)
163
+ ```
164
+
165
+ --------------------
166
+ PREVIOUS: [Bad Row Quarantine](./bad_row_quarantine.md) | NEXT: [Examples](./examples.md) | UP: [README](../README.md)